126 63
English Pages 187 [181] Year 2010
VLSI Design for Video Coding
Youn-Long Steve Lin • Chao-Yang Kao Huang-Chih Kuo • Jian-Wen Chen
VLSI Design for Video Coding H.264/AVC Encoding from Standard Specification to Chip
123
Prof. Youn-Long Steve Lin National Tsing Hua University Dept. Computer Science 101 Kuang Fu Road HsinChu 300 Section 2 Taiwan R.O.C.
Huang-Chih Kuo National Tsing Hua University Dept. Computer Science 101 Kuang Fu Road HsinChu 300 Section 2 Taiwan R.O.C.
Chao-Yang Kao National Tsing Hua University Dept. Computer Science 101 Kuang Fu Road HsinChu 300 Section 2 Taiwan R.O.C.
Jian-Wen Chen National Tsing Hua University Dept. Computer Science 101 Kuang Fu Road HsinChu 300 Section 2 Taiwan R.O.C.
ISBN 978-1-4419-0958-9 e-ISBN 978-1-4419-0959-6 DOI 10.1007/978-1-4419-0959-6 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2009943294 c Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
A video signal is represented as a sequence of frames of pixels. There exists vast amount of redundant information that can be eliminated with video compression technology so that its transmission and storage becomes more efficient. To facilitate interoperability between compression at the video producing source and decompression at the consumption end, several generations of video coding standards have been defined and adapted. After MPEG-1 for VCD and MPEG-2 for DVD applications, H.264/AVC is the latest and most advanced video coding standard defined by the international standard organizations. Its high compression ratio comes at the expense of more computational-intensive coding algorithms. For low-end applications, software solutions are adequate. For high-end applications, dedicated hardware solutions are needed. This book describes an academic project of developing an application-specific VLSI architecture for H.264/AVC video encoding. Each subfunction is analyzed before a suitable parallel-processing architecture is designed. Integration of subfunctional modules as well as the integration into a bus-based SOC platform is presented. The whole encoder has been prototyped using an FPGA. Intended readers are researchers, educators, and developers in video coding systems, hardware accelerators for image/video processing, and high-level synthesis of VLSI. Especially, those who are interested in state-of-the-art parallel architecture and implementation of intra prediction, integer motion estimation, fractional motion estimation, discrete cosine transform, context-adaptive binary arithmetic coding, and deblocking filter will find design ideas from this book. HsinChu, Taiwan, ROC
Youn-Long Lin Chao-Yang Kao Huang-Chih Kuo Jian-Wen Chen
v
Acknowledgments
Cheng-Long Wu, Cheng-Ru Chang, Chun-Hsin Lee, Chun-Lin Chiu, Hao-Ting Huang, Huan-Chun Tseng, Huan-Kai Peng, Hui-Ting Yang, Jhong-Wei Gu, Kai-Hsiang Chang, Li-Cian Wu, Ping Chao, Po-Sheng Liu, Sheng-Tsung Hsu, Sheng-Yu Shih, Shin-Chih Lee, Tzu-Jen Lo, Wei-Cheng Huang, Yu-Chien Kao, Yuan-Chun Lin, and Yung-Hung Chan of the Theda.Design Group, National Tsing Hua University contribute to the development of the H.264 Video Encoder System described in this book. The authors appreciate financial support from Taiwan’s National Science Council under Contracts no. 95-2220-E-007-024, 96-2220-E-007-013, and 97-2220-E-007003 and Ministry of Economics Affairs under Contracts no. 94-EC-17-A-01-S1038, 95-EC-17-A-01-S1-038, and 96-EC-17-A-01-S1-038. Financial support from Taiwan Semiconductor Manufacturing Company Limited (TSMC) and Industry Technology Research Institute (ITRI) is also greatly appreciated. Global Unichip Corp. provided us with its UMVP multimedia SOC platform and consultation during the FPGA prototyping stage of the development. The authors are grateful to Chi Mei Optoelectronics for a 52-in. Quad Full HD display panel. Joint research with the Microprocessor Research Center (MPRC) of Peking University has been an important milestone of this project.
vii
Contents
1 Introduction to Video Coding and H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Basic Coding Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Video Encoding Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Color Space Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Prediction of a Macroblock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Intraframe Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.6 Interframe Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.7 Motion Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.8 Prediction Error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.9 Space-Domain to Frequency-Domain Transformation of Residual Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.10 Coefficient Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.11 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.12 Motion Compensation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.13 Deblocking Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 2 2 2 3 4 4 4 4 5 5 5 5 6 6
2 Intra Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Design Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Prediction Time Reduction Approaches. . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Hardware Area Reduction Approaches . . . . . . . . . . . . . . . . . . . . . . . . 2.3 A VLSI Design for Intra Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11 11 12 16 19 19 19 20 20 24 30 30
ix
x
Contents
3 Integer Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Data-Reuse Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 A VLSI Design for Integer Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Proposed Data-Reuse Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31 31 33 36 37 37 43 44 45 47 49 52 53
4 Fractional Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 A VLSI Design for Fractional Motion Estimation . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Proposed Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Proposed Resource Sharing Method for SATD Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57 57 58 61 61 63 63
5 Motion Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Memory Traffic Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Interpolation Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 A VLSI Design for Motion Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Motion Vector Generator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Interpolator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73 73 73 75 75 76 76 77 77 79 83 83
6 Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Design Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Multitransform Engine Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Trans/Quan or InvQuan/InvTrans Integration Approaches . . . .
85 85 85 97 97 97 97
68 72 72
Contents
6.3
6.4
xi
A VLSI Design for Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .106 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .106
7 Deblocking Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .107 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .107 7.1.1 Deblocking Filter Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108 7.1.2 Subtasks Processing Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112 7.1.3 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115 7.3 A VLSI Design for Deblocking Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116 7.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116 7.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116 7.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124 8 CABAC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 8.1.1 CABAC Encoder Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 8.1.2 Subtasks Processing Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134 8.1.3 Design Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134 8.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136 8.3 A VLSI Design for CABAC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139 8.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139 8.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140 8.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .147 8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148 9 System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151 9.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151 9.1.2 Design Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153 9.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155 9.3 A VLSI Design for H.264/AVC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156 9.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156 9.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159 9.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165 9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .166 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .167 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173
Chapter 1
Introduction to Video Coding and H.264/AVC
Abstract A video signal is represented as a sequence of frames of pixels. There exists a vast amount of redundant information that can be eliminated with video compression technology so that transmission and storage becomes more efficient. To facilitate interoperability between compression at the video producing source and decompression at the consumption end, several generations of video coding standards have been defined and adapted. For low-end applications, software solutions are adequate. For high-end applications, dedicated hardware solutions are needed. This chapter gives an overview of the principles behind video coding in general and the advanced features of H.264/AVC standard in particular. It serves as an introduction to the remaining chapters; each covers an important coding tool and its VLSI architectural design of an H.264/AVC encoder.
1.1 Introduction A video encoder takes as its input a video sequence, performs compression, and then produces as its output a bit-stream data which can be decoded back to a video sequence by a standard-compliant video decoder. A video signal is a sequence of frames. It has a frame rate defined as the number of frames per second (fps). For typical consumer applications, 30 fps is adequate. However, it could be as high as 60 or 72 for very high-end applications or as low as 10 or 15 for video conferencing over a low-bandwidth communication link. A frame consists of a two-dimensional array of color pixels. Its size is called frame resolution. A standard definition (SD) frame has 720 480 pixels per frame whereas a full high definition (FullHD) one has 1,920 1,088. There are large number of frame size variations developed by various applications such as computer monitors. A color pixel is composed of three elementary components: R, G, and B. Each component is digitized to an 8-bit data for consumer applications or a 12-bit one for high-end applications.
Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 1, c Springer Science+Business Media, LLC 2010
1
2
1 Introduction to Video Coding and H.264/AVC
The data rate for a raw video signal is huge. For example, a 30-fps FullHD one will have a data rate of 30 1;920 1;088 3 8 D 1:5Gbps, which is impractical for today’s communication or storage infrastructure. Fortunately, by taking advantage of the characteristics of human visual system and the redundancy in the video signal, we can compress the data by two orders of magnitude without scarifying the quality of the decompressed video.
1.1.1 Basic Coding Unit In order for a video encoding or decoding system to handle video of different frame rates and simplify the implementation, a basic size of 16 16 has been popularly adopted. Every main stream coding standards from MPEG-1, MPEG-2, : : : to H.264 has chosen a macroblock of 16 16 pixels as their basic unit of processing. Hence, for video of different resolutions, we just have to process different number of macroblocks. For every 720 480 SD frame, we process 45 30 macroblocks while for every FullHD frame, we process 120 68 macroblocks.
1.1.2 Video Encoding Flow Algorithm 1.1 depicts a typical flow of video encoding. frame(t ) is the current frame to be encoded. frame0 (t 1) is the reconstructed frame for referencing or called reference frame. frame0 (t ) is the reconstructed current frame. We encode F .t / one macroblock (MB) at a time starting from the leftmost MB of the topmost row. We called the MB being encoded as Curr MB. It can be encoded in one of the three modes: I for intra prediction, P or unidirectional interprediction, and B for bidirectional interprediction. The resultant MB from prediction is called Pred MB and the difference between Curr MB and Pred MB is called Res MB for residuals. Res MB goes through space-to-frequency transformation and then quantization processes to become Res Coef or residual coefficients. Entropy coding then compresses Res Coef to get final bit-stream. In order to prepare reconstructed current frame for future reference, we perform inverse quantization and inverse transform on Res Coef to get reconstructed residuals called Reconst res. Adding together Reconst res and Pred MB, we have Reconstruct MB for insertion into frame0 (t ).
1.1.3 Color Space Conversion Naturally, each pixel is composed of R, G, and B 8-bit components. Applying the following conversion operation, it can be represented as one luminance (luma) component Y and two chrominance (chroma) components Cr and Cb. Since the human
1.1 Introduction
3
Algorithm 1.1: Encode a frame. encode a frame (frame(t), mode) for I D 1, N do //** N: #rows of MBs per frame for I D 1, M do //** N: #rows of MBs per frame Curr MB D MB(frame(t), I, J); case (mode) I: Pred MB D Intra Pred (frame(t)’, I, J); P: Pred MB D ME (frame(t-1)’, I, J); B: Pred MB D ME (frame(t-1)’, frame(tC1)’, I, J); endcase Res MB D Curr MB - Pred MB; Res Coef D Quant(Transform(Res MB)); Output(Entropy code(Res Coef)); Reconst res D InverseTransform(InverseQuant(Res Coef)); Reconst MB D Reconst res C Pred MB; Insert(Reconst MB, frame(t)’); endfor endfor end encode a frame;
visual system is more sensitive to luminance component than chrominance ones, we can subsample Cr and Cb to reduce the data amount without sacrificing the video quality. Usually one out of two or one out of four subsampling is applied. The former is called 4:2:2 format and the later 4:2:0 format. In this book, we assume that 4:2:0 format is chosen. Of course, the inverse conversion will give us R, G, B components from a set of Y , Cr, Cb components. Y D 0:299R C 0:587G C 0:114B; Cb D 0:564.B Y /;
(1.1)
Cr D 0:713.R Y /:
1.1.4 Prediction of a Macroblock A macroblock M has 1616 D 256 pixels. It takes 2563 D 768 bytes to represent it in RGB format and 256 .1 C 1=4 C 1=4/ D 384 bytes in 4:2:0 format. If we can find during decoding a macroblock M0 which is similar to M, then we only have to get from the encoding end the difference between M and M0 . If M and M0 are very similar, the difference becomes very small so does the amount of data needed to be transmitted/stored. Another way to interpret similarity is redundancy. There exist two types of redundancy: spatial and temporal. Spatial redundancy results from similarity between a pixel (region) and its surrounding pixels (regions) in a frame. Temporal redundancy results from slow change of video contents from one frame to the next. Redundancy information can be identified and removed with prediction tools.
4
1 Introduction to Video Coding and H.264/AVC
1.1.5 Intraframe Prediction In an image region with smooth change, a macroblock is likely to be similar to its neighboring macroblocks in color or texture. For example, if all its neighbors are red, we can predict that a macroblock is also red. Generally, we can define several prediction functions; each takes pixel values from neighboring macroblocks as its input and produces a predicted macroblock as its output. To carry out intraframe prediction, every function is evaluated and the one resulting in the smallest error is chosen. Only the function type and the error need to be encoded and stored/transmitted. This tool is also called intra prediction and a prediction function is also called a prediction mode.
1.1.6 Interframe Prediction Interframe prediction, also called interprediction, identifies temporal redundancy between neighboring frames. We call the frame currently being processed the current frame and the neighboring one the reference frame. We try to find from the reference frame a reference macroblock that is very similar to the current macroblock of the current frame. The process is called motion estimation. A motion estimator compares the current macroblock with candidate macroblocks within a search window in the reference frame. After finding the best-matched candidate macroblock, only the displacement and the error need to be encoded and stored/transmitted. The displacement from the location of the current macroblock to that of the best candidate block is called motion vector (MV). In other words, motion estimation determines the MV that results in the smallest interprediction error. A bigger search window will give better prediction at the expense of longer estimation time.
1.1.7 Motion Vector A MV obtained from motion estimation is adequate for retrieving a block from the reference frame. Yet, we do not have to encode/transmit the whole of it because there exists similarity (or redundancy) among MVs of neighboring blocks. Instead, we can have a motion vector prediction (MVP) as a function of neighboring blocks’ MVs and just process the difference, called motion vector difference (MVD), between the MV and its MVP. In most cases, the MVD is much smaller than its associated MV.
1.1.8 Prediction Error We call the difference between the current macroblock and the predicted one as prediction error. It is also called residual error or just residual.
1.1 Introduction
5
1.1.9 Space-Domain to Frequency-Domain Transformation of Residual Error Residual error is in the space domain and can be represented in the frequency domain by applying discrete cosine transformation (DCT). DCT can be viewed as representing an image block with a weighted sum of elementary patterns. The weights are termed as coefficients. For computational feasibility, a macroblock of residual errors is usually divided into smaller 4 4 or 8 8 blocks before applying DCT one by one.
1.1.10 Coefficient Quantization Coefficients generated by DCT carry image components of various frequencies. Since human visual system is more sensitive to low frequency components and less sensitive to high frequency ones, we can treat them with different resolution by means of quantization. Quantization effectively discards certain least significant bits (LSBs) of a coefficient. By giving smaller quantization steps to low frequency components and larger quantization steps to high frequency ones, we can reduce the amount of data without scarifying the visual quality.
1.1.11 Reconstruction Both encoding and decoding ends have to reconstruct video frame. In the encoding end, the reconstructed frame instead of the original one should be used as reference because no original frame is available in the decoding end. To reconstruct, we perform inverse quantization and inverse DCT to obtain reconstructed residual. Note that the reconstructed residual is not identical to the original residual as quantization is irreversible. Therefore, distortion is introduced here. We then add prediction data to the reconstructed residual to obtain reconstructed image. For an intrapredicted macroblock, we perform predict function on its neighboring reconstructed macroblocks while for an interpredicted one we perform motion compensation. Both methods give a reconstructed version of the current macroblock.
1.1.12 Motion Compensation Given a MV, the motion compensator retrieves from the reference frame a reconstructed macroblock pointed to by the integer part of the MV. If the MV has fractional part, it performs interpolation over the retrieved image to obtain the final reconstructed image. Usually, interpolation is done twice, one for half-pixel accuracy and the other for quarter-pixel accuracy.
6
1 Introduction to Video Coding and H.264/AVC
1.1.13 Deblocking Filtering After every macroblock of a frame is reconstructed one by one, we obtain a reconstructed frame. Since the encoding/decoding process is done macroblock-wise, there exists blocking artifacts between boundaries of adjacent macroblocks or subblocks. Deblocking filter is used to eliminate this kind of artificial edges.
1.2 Book Organization This book describe a VLSI implementation of a hardware H.264/AVC encoder as depicted in Fig. 1.1.
AMBA AMBA Slave
AMBA Master AMBA Interface MAU Arbiter DF MAU
MB MAU
SR MAU
BIT MAU
Command Receiver
MainCtrl Engine Inter Info Memory
IntraPred Engine Unfilter Memory DF Engine
FME Engine
IntraMD Engine
MC Engine
Multiplexer
IME Engine
TransCoding Engine
CABAC Engine Recons Engine
ReconsMB Memory
Encoder Core
Fig. 1.1 Top-level block diagram of the proposed design
PE Engine
1.2 Book Organization
7
In Chap. 2, we present intra prediction. Intra prediction is the first process of H.264/AVC intra encoding. It predicts a macroblock by referring to its neighboring macroblocks to eliminate spatial redundancy. There are 17 prediction modes for a macroblock: nine modes for each of the 16 luma 4 4 blocks, four modes for a luma 16 16 block, and four modes for each of the two chroma 8 8 blocks. Because there exists great similarity among equations of generating prediction pixels across prediction modes, effective hardware resource sharing is the main design consideration. Moreover, there exists a long data-dependency loop among luma 4 4 blocks during encoding. Increasing parallelism and skipping some modes are two of the popular methods to design a high-performance architecture for high-end applications. However, to increase throughput will require more hardware area and to skip some modes will degrade video quality. We will present a novel VLSI implementation for intra prediction in this chapter. In Chap. 3, we present integer motion estimation. Interframe prediction in H.264/AVC is carried out in three phases: integer motion estimation (IME), fractional motion estimation (FME), and motion compensation (MC). We will discuss these functions in Chaps. 3, 4, and 5, respectively. Because motion estimation in H.264/AVC supports variable block sizes and multiple reference frames, high computational complexity and huge data traffic become main difficulties in VLSI implementation. Moreover, high-resolution video applications, such as HDTV, make these problems more critical. Therefore, current VLSI designs usually adopt parallel architecture to increase the total throughput and solve high computational complexity. On the other hand, many data-reuse schemes try to increase data-reuse ratio and, hence, reduce required data traffic. We will introduce several key points of VLSI implementation for IME. In Chap. 4, we present fractional motion estimation. Motion estimation in H.264/AVC supports quarter-pixel precision and is usually carried out in two phases: IME and FME. We have talked about IME in Chap. 3. After IME finds an integer motion vector (IMV) for each of the 41 subblocks, FME performs motion search around the refinement center pointed to by IMV and further refines 41 IMVs into fractional MVs (FMVs) of quarter-pixel precision. FME interpolates halfpixels using a six-tap filter and then quarter-pixels a two-tap one. Nine positions are searched in both half refinement (one integer-pixel search center pointed to by IMV and eight half-pixel positions) and then quarter refinement (one half-pixel position and eight quarter-pixel positions). The position with minimum residual error is chosen as the best match. FME can significantly improve the video quality (C0:3 to C0:5dB) and reduce bit-rate (20–37%) according to our experimental results. However, our profiling report shows that FME consumes more than 40% of the total encoding time. Therefore, an efficient hardware accelerator for fractional motion estimation is indispensable. In Chap. 5, we present motion compensation. Following integer and fractional motion estimation, motion compensation (MC) is the third stage in H.264/AVC interframe prediction (P or B frame). After the motion estimator finds MVs and related information for each current macroblock, the motion compensator generates
8
1 Introduction to Video Coding and H.264/AVC
compensated macroblocks (MBs) from reference frames. Due to quarter-pixel precision and variable-block-size motion estimation supported in H.264, motion compensation also needs to generate half- or quarter-pixels for MB compensation. Therefore, motion compensation also has high computational complexity and dominates the data traffic on DRAM. Current VLSI designs for MC usually focus on reducing memory traffic or increasing interpolator throughput. In this chapter, we will introduce several key points of VLSI implementation for motion compensation. In Chap. 6, we present transform coding. In H.264/AVC, both transform and quantization units consist of forward and inverse parts. Residuals are transformed into frequency domain coefficients in the forward transform unit and quantized in the forward quantization unit to reduce insignificant data for bit-rate saving. To generate reconstructed pixels for the intra prediction unit and reference frames for the motion estimation unit, quantized coefficients are rescaled in the inverse quantization unit and transformed back to residuals in the inverse transform unit. There are three kinds of transforms used in H.264/AVC: 4 4 integer discrete cosine transform, 2 2 Hadamard transform, and 4 4 Hadamard transform. To design an area-efficient architecture is the main design challenge. We will present a VLSI implementation of transform coding in this chapter. In Chap. 7, we present deblocking filter. The deblocking filter (DF) adopted in H.264/AVC reduces the blocking artifact generated by block-based motioncompensated interprediction, intra prediction, and integer discrete cosine transform. The filter for eliminating blocking artifacts is embedded within the coding loop. Therefore, it is also called in-loop filter. Expirically, it achieves up to 9% bit-rate saving at the expense of intensive computation. Even with today’s fastest CPU, it is hard to perform software-based real-time encoding of high-resolution sequences such as QFHD (3,840 2,160). Consequently, accelerating the deblocking filter by VLSI implementation is indeed required. Through optimizing processing cycle, external memory access, and working frequency, we show a design that can support QFHD at 60-fps application by running at 195 MHz. In Chap. 8, we present context-based adaptive binary arithmetic coding. Contextbased adaptive binary arithmetic coding (CABAC) adopted in H.264/AVC main profile is the state-of-the-art in terms of bit-rate efficiency. In comparison with context-based adaptive variable length coding (CAVLC) used in baseline profile, it can save up to 7% of the bit-rate. However, CABAC occupies 9.6% of total encoding time and its throughput is limited by bit-level data dependency. Moreover, for ultrahigh resolution, such like QFHD (3,840 2,160), its performance is difficult to meet real-time requirement for a pure software CABAC encoder. Therefore, it is necessary to accelerate the CABAC encoder by VLSI implementation. In this chapter, a novel architecture of CABAC encoder will be described. Its performance is capable of real-time encoding QFHD video in the worst case of main profile Level 5.1. In Chap. 9, we present system integration. Hardware cost and encoding performance are the two main challenges in designing a high-performance H.264/AVC encoder. We have proposed several high-performance architectures for the functional
1.2 Book Organization
9
units in an H.264/AVC encoder. In addition, external memory management is another design issue. We have to access an external memory up to 3.3 GBps for real-time encoding 1080pHD video in our encoder. We propose several AMBAcompliant memory access units (MAUs) to efficiently access an external memory. We will present our H.264/AVC encoder in this chapter.
Chapter 2
Intra Prediction
Abstract Intra prediction is the first process of H.264/AVC intra encoding. It predicts a macroblock by referring to its neighboring macroblocks to eliminate spatial redundancy. There are 17 prediction modes for a macroblock: nine modes for each of the 16 luma 4 4 blocks, four modes for a luma 16 16 block, and four modes for each of the two chroma 8 8 blocks. Because there exists great similarity among equations of generating prediction pixels across prediction modes, effective hardware resource sharing is the main design consideration. Moreover, there exists a long data-dependency loop among luma 4 4 blocks during encoding. Increasing parallelism and skipping some modes are two of the popular methods to design a high-performance architecture for high-end applications. However, to increase throughput will require more hardware area and to skip some modes will degrade video quality. We will present a novel VLSI implementation for intra prediction in this chapter.
2.1 Introduction H.264/AVC intra encoding achieves higher compression ratio and quality compared with the latest still image coding standard JPEG2000 [1]. The intra prediction unit, which is the first process of H.264/AVC intra encoding, employs 17 kinds of prediction modes and supports several different block sizes. For baseline, main, and extended profiles, it supports 4 4 and 16 16 block sizes. For high profile, it additionally supports an 8 8 block size. In this chapter, we focus on the intra prediction for baseline, main, and extended profiles. The intra prediction unit refers to reconstructed neighboring pixels to generate prediction pixels. Therefore, its superior performance comes at the expense of very high computational complexity. We describe the detailed algorithm of intra prediction in Sect. 2.1.1 and address some design considerations in Sect. 2.1.2.
Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 2, c Springer Science+Business Media, LLC 2010
11
12
2 Intra Prediction
2.1.1 Algorithm All intra prediction pixels are calculated based on the reconstructed pixels of previously encoded neighboring blocks. Figure 2.1 lists all intra prediction modes with different block sizes. For the luma component, a 16 16 macroblock can be partitioned into sixteen 4 4 blocks or just one 16 16 block. The chroma component simply contains one 8 8 Cb block and one 8 8 Cr block. There are nine prediction modes for each of the 16 luma 4 4 blocks and four prediction modes for a luma 16 16 block and two chroma 8 8 blocks. Figure 2.2 illustrates the reference pixels of a luma macroblock. A luma 16 16 block is predicted by referring to its upper, upper-left, and left neighboring luma 16 16 blocks. For a luma 4 4 block, we utilize its upper, upper-left, left, and upper-right neighboring 4 4 blocks. There are 33 and 13 reference pixels for a luma 16 16 block and a luma 4 4 block, respectively. To predict a chroma 8 8 block is like to predict a luma 16 16 block by using its upper, upper-left, and left neighboring chroma blocks. There are 17 reference pixels for a chroma block. Figure 2.3 shows all the computation equations of luma 4 4 modes. Upper case letters from “A” to “M” denote the 13 reference pixels and lower case letters from “a” to “p” denote the 16 prediction pixels. Component
Block Size 1 16x16
L16_VER L16_HOR L16_DC L16_PLANE
0:vertical 1:horizontal 2:DC 3:diagonal down-left 4:diagonal down-right 5:vertical-right 6:horizontal-down 7:vertical-left 8:horizontal-up
L4_VER L4_HOR L4_DC L4_DDL L4_DDR L4_VR L4_HD L4_VL L4_HU
1 8x8
0:DC 1:horizontal 2:vertical 3:plane
CB8_DC CB8_HOR CB8_VER CB8_PLANE
1 8x8
0:DC 1:horizontal 2:vertical 3:plane
CR8_DC CR8_HOR CR8_VER CR8_PLANE
16 4x4 16x16
8x8 Cr
Abbreviation
0:vertical 1:horizontal 2:DC 3:plane
Y
Cb
Prediction Modes
8x8
Fig. 2.1 Intra prediction modes
2.1 Introduction
13
MB_UpperLeft
MB_Upper
MB_Left
MB_Current
MB_UpperRight
blk_UL blk_U blk_UR blk_L
blk_C
Fig. 2.2 Reference pixels of a luma macroblock
There are four modes for a luma 16 16 block and two chroma 8 8 blocks: horizontal, vertical, DC, and plane as shown in Fig. 2.4. All except the plane mode are similar to that of luma 4 4 modes. Plane modes defined for smoothly varying image are the most complicated. Every prediction pixel has a unique value. Figure 2.5 shows the equations of a luma 16 16 plane mode. Each prediction pixel value depends on its coordinate in the block and parameters a, b, and c which are calculated from pixels of neighboring blocks. After generating prediction pixels for each mode, the intra mode decision unit will compute the cost of each mode, based on distortion and bit-rate, and choose the one with the lowest cost. We describe the encoding process of the intra prediction unit using Algorithm 2.1 and show the corresponding flow chart in Fig. 2.6. The primary inputs of the intra prediction unit are Xcoord and Ycoord , which indicate the location of the current macroblock in a frame. For example, in a CIF (352 288) frame, (Xcoord ,Ycoord / D .0; 0/ denotes the first macroblock and (Xcoord ,Ycoord / D .21; 17/ denotes the last. We use upper-case letters A through M to represent 13 reference pixels for a luma 4 4 block as shown in Fig. 2.3. HSL and VSL represent sets of left and upper reference pixels for a luma 16 16 block. QL represents upper-left reference pixels for a luma macroblock. HSCb , VSCb , QCb , HSCr , VSCr , and QCr are for two chroma 8 8 blocks. Figure 2.7 shows the order in which to process the subtasks of an intra prediction unit. Table 2.1 shows the cycle budget for an H.264/AVC encoder to encode a
14
2 Intra Prediction Prediction Mode (Abbreviation) MAB I a b J e f K i j Lmn
C c g k o
DE FGH d (Reference Pixels) h A-D: Upper Pixels l E-H: Upper Right Pixels p I-L : Left Pixels
Vertical (VER) MABCDE FGH I a=e=i=m=A J b=f=j=n=B K c=g=k=o=C L d=h=l=p=D
M : Upper Left Pixels (Prediction Pixels) a-p
Horizontal (HOR) MABCDE FGH I a=b=c=d=I J e=f=g=h=J K i=j=k=l=K L m=n=o=o=L
DC (DC) MABCDE FGH I If (A-D,I-L available) J a-p=(SH+SV+4)>>3 K else if (I-L available) L a-p=(SH+2)>>2 SH=SUM(I-L) SV=SUM(A-D)
Diagonal Down-Left (DDL) MABCDE FGH I a=(A+2B+C+2)>>2 J b=e=(B+2C+D+2)>>2 K c=f=i=(C+2D+E+2)>>2 L d=g=j=m=
Diagonal Down-Right (DDR) MABCDE FGH I m=(L+2K+J+2)>>1 J i=n=(K+2J+I+2)>>2 K e=j=o=(J+2I+M+2)>>2 L a=f=k=p=
(D+2E+F+2)>>2 h=k=n= (E+2F+G+2)>>2 l=o=(F+2G+H+2)>>2 p=(G+3H+2)>>2
(I+2M+A+2)>>2 b=g=l= (M+2A+B+2)>>2 c=h=(A+2B+C+2)>>2 d=(B+2C+D+2)>>2
Vertical-Right (VR) MABCDE FGH a=j=(M+A+1)>>1 I b=k=(A+B+1)>>1 J c=l=(B+C+1)>>1 K d=(C+D+1)>>1 L m=(K+2J+I+2)>>2
Horizontal-Down (HD) MABCDE FGH m=(L+K+1)>>1 I i=o=(K+J+1)>>1 J e=k=(J+I+1)>>1 K a=g=(I+M+1)>>1 L n=(L+2K+J+2)>>2
i=(J+2I+M+2)>>2 e=n=(I+2M+A+2)>>2 f=o=(M+2A+B+2)>>2 g=p=(A+2B+C+2)>>2 h=(B+2C+D+2)>>2
j=p=(K+2J+I+2)>>2 f=l=(J+2I+M+2)>>2 b=h=(I+2M+A+2)>>2 c=(M+2A+B+2)>>2 d=(A+2B+C+2)>>2
Vertical-Left (VL) MABCDE FGH a=(A+B+1)>>1 I b=i=(B+C+1)>>1 J c=j=(C+D+1)>>1 K d=k=(D+E+1)>>1 L l=(E+F+1)>>1
else if (A-D available) a-p=(SV+2)>>2 else a-p=128
Horizontal-Up (HU) MA B C D E F GH I k=l=m=o=p=L J g=i=(L+K+1)>>1 c=e=(K+J+1)>>1 K a=(J+I+1)>>1 L
e=(A+2B+C+2)>>2 f=m=(B+2C+D+2)>>2 g=n=(C+2D+E+2)>>2 h=o=(D+2E+F+2)>>2 p=(E+2F+G+2)>>2
Fig. 2.3 Equations of luma 4 4 modes prediction pixels
h=j=(3L+J+2)>>2 d=f=(L+2K+J+2)>>2 b=(K+2J+I+2)>>2
2.1 Introduction
a
15 Vertical
Horizontal
b
Vertical
Plane
DC
DC
Horizontal
Plane
Pred [y,x] Mean (32 neighboring pixels)
PlanePred [y,x]
Fig. 2.4 (a) Luma 16 16 and (b) chroma 8 8 prediction modes
ReconsPixel[–1,–1] ReconsPixel[–1,0]
1*(ReconsPixel[8, –1] - ReconsPixel[6,–1]) ReconsPixel[0,–1]
3*(ReconsPixel[–1,10] - ReconsPixel[–1,4])
PlanePred [y, x] = Clip1{ a+16+b*(x–7)+c*(y–7))>>5}, x, y = 0~15 a = (ReconsPixel [–1, 15] + ReconsPixel [15,–1]) 6 c = (5*V + 32) >> 6 7
H = Σ (x' +1)∗(ReconsPixel[8 + x',–1] – ReconsPixel[6 – x', –1]) x'=0 7
V = Σ (y' +1)∗(ReconsPixel[ –1,8 + y' ] – ReconsPixel[ – 1,6 – y' ]) y'=0
Fig. 2.5 Illustration of plane modes
16
2 Intra Prediction
Algorithm 2.1: Intra prediction. Intra Prediction (Xcoord ,Ycoord , AM, HSL , VSL , QL , HSC b , VSC b , QC b , HSC r , VSC r , QC r ) for 16 luma 4x4 block do PredPixels4x4DC D Gen 4x4DC(A,B,C,D,I,J,K,L); if up block available then PredPixels4x4VER D Gen 4x4VER(A,B,C,D); PredPixels4x4DDL D Gen 4x4DDL(A,B,C,D,E,F,G,H); PredPixels4x4VL D Gen 4x4VL(A,B,C,D,E,F,G); endif if left block available then PredPixels4x4HOR D Gen 4x4HOR(I,J,K,L); PredPixels4x4H U D Gen 4x4HU(I,J,K,L); endif if left block available & up block available & up left block available then PredPixels4x4DDR D Gen 4x4DDR(A,B,C,D,I,J,K,L,M); PredPixels4x4VR D Gen 4x4VR(A,B,C,D,I,J,K,M); PredPixels4x4HD D Gen 4x4HD(A,B,C,I,J,K,L,M); endif endfor PredPixel16x16DC D Gen 16x16DC(HSL ,VSL ); if up luma mb available then PredPixels16x16VER D Gen 16x16VER(VSL ); endif if left luma mb available then PredPixels16x16HOR D Gen 16x16HOR(HSL ); endif if left luma mb available & up luma mb available then PredPixels16x16PLANE D Gen 16x16PLANE(VSL ,HSL ,QL ); endif if up chroma mb available then PredPixels8x8VER C b D Gen 8x8VER(VSC b ); PredPixels8x8VER C r D Gen 8x8VER(VSC r ); endif if left chroma mb available then PredPixels8x8HOR C b D Gen 8x8HOR(HSC b ); PredPixels8x8HOR C r D Gen 8x8HOR(HSC r ); endif if left chroma mb available & up chroma mb available then PredPixels8x8PLANE C b D Gen 8x8PLANE(VSC b ,HSC b ,QC b ); PredPixels8x8PLANE C r D Gen 8x8PLANE(VSC r ,HSC r ,QC r ); endif
1080pHD (1,920 1,088) video at 30 fps at different working frequencies. If the intra prediction unit generates four prediction pixels per cycle, it will take 960 cycles to predict a macroblock.
2.1.2 Design Consideration Because all neighboring pixels are reconstructed, the intra prediction unit can only start to predict a luma 4 4 block, a luma 16 16 block, or a chroma 8 8 block
2.1 Introduction
17 Start Predict luma 4x4 DC Upper block available? Yes Predict luma 4x4 VER, DDL, VL Left block available? Yes Predict luma 4x4 HOR, HU All neighbor block available? Yes Predict luma 4x4 DDR, VR, HD
No
No
No
No All 16 4x4 block done? Yes Predict luma 16x16 or chroma DC Upper MB available? Yes Predict luma 16x16 or chroma VER
No
Left MB available? Yes Predict luma 16x16 or chroma HOR
No
All neighbor MB available? Yes Predict luma 16x16 or chroma Plane
No
No Luma 16x16, Cb8x8, and Cr 8x8 done?
Yes End
Fig. 2.6 Flow chart of the intra prediction unit
after its neighboring blocks are reconstructed as shown in Fig. 2.2. This data dependency exists in both the macroblock level for a luma 16 16 block and two chroma 8 8 blocks and the 4 4 block level for 16 luma 4 4 blocks. The data dependency among 16 luma 4 4 blocks usually dominates the system performance of an H.264/AVC intra encoder since it takes 576 960 D 60% of the total processing time as shown in Fig. 2.7. In Fig. 2.8, the arrows denote the data dependency among 16 luma 4 4 blocks, and the numbers inside the 4 4 blocks show the processing order defined in the
18
2 Intra Prediction
Predict luma 4x4 DC for a 4x4 block Predict luma 4x4 VER, DDL, VL for a 4x4 block Predict luma 4x4 HOR, HU for a 4x4 block Predict luma 4x4 DDR, VR, HD for a 4x4 block Predict luma16x16 DC, VER, HOR, PLANE Predict chroma DC, VER, HOR, PLANE Cycles
36
576
832
Fig. 2.7 Order of processing subtasks Table 2.1 Cycle budget at different working frequency for 1080pHD video Frequency (MHz) Cycles 300 1,225 250 1,021 200 816 166 678 150 612 125 510 100 408
1
2
5
6
3
4
7
8
9
10
13
14
11
12
15
16
Fig. 2.8 Data dependency and processing order of 16 luma 4 4 blocks
960
2.2 Related Works
19
H.264/AVC standard. For example, the arrows point to block 13, which means we have to predict it by referring to the reconstructed pixels of blocks 7, 8, and 10.
2.2 Related Works Several VLSI architectures exist for H.264/AVC intra prediction. Some of them address how to shorten prediction time to support high-resolution video applications. Others aim to provide a hardware-efficient solution to minimize the system cost.
2.2.1 Prediction Time Reduction Approaches The intra prediction unit and the intra mode decision unit together account for about 80% of the computation time in the H.264/AVC intra encoding, according to our profiling using the H.264/AVC reference sofware Joint Model (JM) 11.0. An allmode mode decision approach evaluates costs of nine luma 4 4 modes, four luma 16 16 modes, and four chroma 8 8 modes, whereas a partial-mode approach evaluates fewer modes by skipping modes that have a lower probability of being the best one. By adopting the partial-mode mode decisions can shorten prediction time but result in video-quality degradation. Several previous designs [7, 54, 78] propose partial-mode mode decision algorithms. For example, a three-step encoding algorithm is proposed by Cheng and Chang [7] to adaptively choose the next prediction mode. Instead of reducing the computation load, one can reduce prediction time by increasing pixel-level parallelism (PLP). Increasing the PLP directly decreases the total prediction time. For example, if the intra prediction unit predicts 8 pixels per cycle, the lower bound of prediction time will be 480 cycles. Several previous works [21, 29, 32] have proposed the ability to predict 4 pixels per cycle, and another work [46] proposes to predict 8 pixels per cycle by employing two prediction engines. The data dependency among luma 4 4 blocks introduces bubble cycles. Huang et al. [21] proposes inserting luma 16 16 prediction into these bubble cycles to eliminate them.
2.2.2 Hardware Area Reduction Approaches Instead of using a dedicated hardware [61] for each prediction mode, several previous designs [21, 29, 46] employ reconfigurable architectures. Some previous works [29, 46] save hardware area by removing the plane mode from the prediction mode capability (PMC). PMC is defined as the prediction modes that the intra prediction unit supports. PMC affects the prediction mode number and
20
2 Intra Prediction
the candidate mode set for the mode decision unit. The size of the candidate mode set also affects compression ratio. A smaller set results in reduced coding efficiency. Several previous designs schedule the processing order of prediction modes according to their computation load to save hardware costs. Huang et al. [21] schedules DC and plane modes last for data preparation, while the unit outputs pixels for vertical and horizontal modes.
2.3 A VLSI Design for Intra Prediction We propose a VLSI architecture for intra prediction in this section. We first describe how we schedule all subtasks in Sect. 2.3.1, and we then propose our hardware architecture in Sect. 2.3.2. In Sect. 2.3.3, we evaluate its performance.
2.3.1 Subtasks Scheduling We categorize all intra prediction modes into reconstruction-loop (RL) modes and nonreconstruction-loop (Non-RL) modes, as shown in Table 2.2. The former includes all luma 4 4 modes, whereas the later includes luma 16 16 and chroma 8 8 modes. Our profiling shows that RL modes occupy 59%, and three plane modes occupy approximately 40% of overall computations. Since the data dependency among 4 4 blocks dominates the processing time, prediction of RL modes is the performance bottleneck. Our design spends 5 cycles to generate prediction pixels of nine intra 4 4 modes for a 4 4 block by increasing PLP to 16 (pixels/mode) 2 (modes/cycle) D 32 pixels/cycle. Ideally, it takes 5 (cycles/block) 16 (blocks/macroblock) D 80 cycles to predict 16 luma 4 4 blocks.
Table 2.2 Mode categories Prediction modes RL modes Category Bypass DC Plane Skew
Luma 4 4 L4 HOR L4 VER L4 DC L4 L4 L4 L4 L4 L4
DDL DDR VR HD VL HU
Non-RL modes Luma 16 16 L16 HOR L16 VER L16 DC L16 PLANE
Cb 8 8 CB8 HOR CB8 VER CB8 DC CB8 PLANE
Cr 8 8 CR8 HOR CR8 VER CR8 DC CR8 PLANE
2.3 A VLSI Design for Intra Prediction
21
We also increase the PLP for predicting Non-RL modes. Our design generates 16 pixels of Non-RL modes per cycle. Therefore, it takes 16 (cycles/macroblock) 4 (modes) C 4 (cycles/chroma 8 8 block) 2 (chroma types) 4 (modes) D 96 cycles to predict a luma 16 16 block and two chroma 8 8 blocks. To alleviate the performance bottleneck caused by the long data-dependency loop among luma 4 4 blocks, we modify the processing order of 4 4 blocks to process two luma 4 4 blocks at a time. In Fig. 2.9, the arrows show the data dependency among 16 luma 4 4 blocks, and the numbers inside the 4 4 blocks show the modified processing order. To predict two chroma 8 8 blocks, we use a 4 4 block as a unit to generate prediction pixels. We predict one Cb and one Cr 4 4 block at the same time to shorten the processing time. Moreover, to utilize the bubble cycles between two luma 4 4 blocks, we generate prediction pixels of luma 16 16 modes after generating prediction pixels of luma 4 4 modes for a luma 4 4 block. Figure 2.10 shows the timing diagram of our proposed design. As mentioned before, there is a data dependency among luma 4 4 blocks. After the intra prediction
1'
2'
3'
4'
3'
4'
5'
6'
5'
6'
7'
8'
7'
8'
9'
10'
Fig. 2.9 Proposed processing order of 16 luma 4 4 blocks
Fetch data RL engine 1 predicts luma 4x4 modes for a 4x4 block RL engine 2 predicts luma 4x4 modes for a 4x4 block Non-RL engine predicts luma 16x16 modes Non-RL engine predicts chroma 8x8 modes Cycles
16
32
Fig. 2.10 Timing diagram of the proposed design
129
163
22
2 Intra Prediction
unit finishes one luma 4 4 encoding iteration (encoding one or two luma 4 4 blocks), it needs to wait for transform, quantization, inverse quantization, inverse transform, and reconstruction units before starting the next iteration. Our design requires 1 cycle to read reference pixels. By adopting the proposed processing order, two RL engines take 5 (cycles/iteration) 10 (iterations/macroblock) D 50 cycles to predict 16 luma 4 4 blocks. We schedule the Non-RL engine to predict luma 16 16 modes between two luma 4 4 encoding iterations. The Non-RL engine generates prediction pixels of luma 16 16 modes for one 4 4 block in iterations 1, 2, 9, and 10 and for two 4 4 blocks in other iterations, as two RL engines do. The Non-RL engine also computes parameters for the chroma plane and DC modes at iterations 2, 3, 4, 5, and 6 and outputs prediction pixels at iterations 9 and 10. In total, it takes 163 cycles to predict a macroblock. To schedule all prediction modes efficiently, we divide them into four categories: bypass, DC, skew, and plane, according to their characteristics as shown in Table 2.2. The bypass category consists of vertical and horizontal modes which require no computation. The DC category has four DC modes in luma 4 4, luma 16 16, Cb 8 8, and Cr 8 8, respectively. The skew category has six modes: DDL, DDR, VR, HD, VL, and HU, all for luma 4 4 prediction. Their computation equations are mainly three-tap and two-tap filters. The plane category has three modes: luma 16 16 plane, Cb 8 8 plane, and Cr 8 8 plane. They are the most complicated. To predict RL modes, we first separate those in the skew category into three groups (1) HD and HU modes, (2) DDR and VR modes, and (3) DDL and VL modes. We then schedule the horizontal mode and vertical mode into group (4) and the DC mode into group (5). Our design takes 5 cycles to output them in order of group number. To predict Non-RL modes, our design performs horizontal, vertical, DC, and plane prediction in order for a luma 16 16 block and two chroma 8 8 blocks. Although increasing PLP will increase the hardware area, we propose an optimized processing element (PE) scheduling scheme to achieve better resource sharing. A PE is a basic hardware unit used to calculate a prediction pixel. With the optimized scheduling scheme, we aim at generating prediction pixels of RL modes with an optimal number of PEs. Our design predicts 32 pixels of RL modes per cycle. If we give every prediction pixel a dedicated PE, 32 PEs will be needed for RL modes. However, for all modes in the bypass category, we need nothing but multiplexers. For all DC modes, we use a 4-pixel adder tree to remove the computation load from the PEs. RL modes in the skew category are the most complicated modes. Still, there are multiple prediction pixels sharing the same computation equation. For illustration, we draw Table 2.3 by reorganizing a table proposed by Huang et al. [21]. Table 2.3 shows all computation equations of RL modes in the skew category. Each row represents a computation equation and the number of prediction pixels that utilize the equation in each mode. The computation equations labeled as “T3,” “T2,” and “Bypass” denote three-tap filter, two-tap filter, and bypass operations, respectively. The numbers in the columns of reference pixels represent the multiplied numbers of corresponding reference pixels. The numbers in the columns of skew modes denote the
2.3 A VLSI Design for Intra Prediction
23
Table 2.3 Operation table of RL modes in the skew category Reference pixels Skew modes Equation L K J I M T3eq0 3 1 T3eq1 1 2 1 T3eq2 1 2 1 T3eq3 1 2 1 T3eq4 1 2 T3eq5 1 T3eq6 T3eq7 T3eq8 T3eq9 T3eq10 T3eq11 T3eq12 T2eq0 T2eq1 T2eq2 T2eq3 T2eq4 T2eq5 T2eq6 T2eq7 T2eq8 T2eq9 T2eq10 T2eq11
1 1 1 1 1 1 1 1 1
Bypass
1
A B C D E F G H DDL DDR VR HD VL HU 2 1 1 2 2 1 2 1 3 1 2 1 4 2 2 2 1 3 2 1 1 2 1 1 2 2 1 1 1 2 1 2 1 1 2 1 2 1 3 2 1 2 1 4 2 1 2 1 3 1 1 2 1 2 1 3 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 1
Sum 2 4 6 6 8 6 7 6 5 6 4 2 1
2 2 1
3 4 3 2 2 3 4 3 2 1 0 0
6
6
1 2 2 2 1
number of prediction pixels sharing the corresponding computation equation. For example, four prediction pixels in DDL mode share the same computation equation, T3eq9. We can reuse the PEs which perform the same computation equation during the same cycle. Moreover, our design is able to predict RL modes with an optimal number of PEs by simultaneously performing two of the most similar modes. For example, DDL and VL modes have the greatest similarity according to Table 2.3. If our design predicts DDL and VL modes in the same cycle, only 7 three-tap filter PEs and 5 two-tap filter PEs are needed. However, DDR mode is similar to both VR mode and HD mode. If we predict DDR mode with HD mode and VR mode with HU mode in the same cycle, the optimal set of PEs will be 8 three-tap filter PEs and 7 two-tap filter PEs. Therefore, our proposed design predicts DDR mode with VR mode and HD mode with HU mode in the same cycle. By adopting this schedule, our design uses only 7 three-tap filter PEs and 5 two-tap filter PEs.
24
2 Intra Prediction
2.3.2 Architecture Figure 2.11 shows the top-level block diagram of the proposed design. Its primary inputs are reconstructed neighboring pixels, and its primary outputs are prediction pixels. It contains two RL engines and one Non-RL engine. We use two RL engines to predict two luma 4 4 blocks in parallel, and we design our Non-RL engine to generate 32 prediction pixels for two 4 4 blocks at a time.
Block-level DC Pixel Generator Block-level Skew Pixel Generator
Non-RL Engine
RL Engine Xcoord, Ycoord
MB-level HV Pixel Generator Block-level DC Pixel Generator Block-level Skew Pixel Generator
Multiplexer
Block-level Reference Pixels
Plane Pixel Calculator
Multiplexer
Plane Seed Pixel Generator
MB-level DC Pixel Generator
16 Prediction Pixels
Multiplexer
Plane Parameter Generator
MB-level Reference Pixels
Block-level HV Pixel Generator
Multiplexer
RL Engine
16 Prediction Pixels
Multiplexer
Block-level Reference Pixels
16 Prediction Pixels
16 Prediction Pixels
Block-level HV Pixel Generator Main Controller Reference Pixels Memory AG Enable, Address
Fig. 2.11 Top-level block diagram of the proposed design
2.3 A VLSI Design for Intra Prediction
2.3.2.1
25
RL Engine
The RL engine consists of a block-level skew pixel generator, a block-level DC pixel generator, and a block-level HV pixel generator. Both HV and DC pixel generators are easy to implement. The HV pixel generator generates prediction pixels for vertical and horizontal modes by bypassing the reference pixels, whereas the DC pixel generator uses a 4-pixel adder tree to compute the DC value. The skew pixel generator predicts pixels for DDL, DDR, HU, HD, VL, and VR modes. Its architecture is shown in Fig. 2.12. It consists of seven PE3s and five PE2s, where PE3 and PE2 are for three-tap filter and two-tap filter operations, respectively. It first takes 13 reference pixels and distributes them to appropriate PEs. Next, each PE selects input pixels to produce prediction values. Finally, 32 multiplexers select the prediction values according to mode specification. We design two customized architectures for PE3 and PE2, respectively, as depicted in Fig. 2.13. In each PE3, three multiplexers first select current input pixels. Next, two adders and one shifter sum up the current input pixels. Then, the rounding & shifting and clipping units proceed to postprocess the summation value. The architecture of PE2 is similar to that of PE3 except that it sums up two reference pixels before rounding. 2.3.2.2
Non-RL Engine
The Non-RL engine consists of three units: an MB-level HV pixel generator, an MB-level DC pixel generator, and an MB-level plane pixel generator. The HV pixel
PE3_1 PE3_2
PE3_5
HU, VR, VL
Multiplexer
PE3_4
Multiplexer
PE3_3
HD, DDR, DDL
PE3_6 PE3_7 13 block-level reference pixels PE2_1 PE2_2 PE2_3 PE2_4 PE2_5
Fig. 2.12 Architecture of the block-level skew pixel generator
26
2 Intra Prediction K G L F J H J D K C I E I H J GMH M E I D A F A F M E B G L C L B K D B B A A C C
1 d q = clip3(c1,– c1, d pi) q1 = q1+ d q
End
Fig. 7.6 Flow chart of weak edge filter
Fetch unfiltered pixels and coding information Generate boundary strength, threshold and filter Sample flag Perform edge filter
Flush filtered pixels
Cycles
96 99103
432
528
Fig. 7.7 Subtasks timing of the deblocking filter
7.1.3 Design Considerations Designing a deblocking filter has five considerations (1) processing cycles, (2) external memory traffic, (3) working frequency, (4) hardware cost, and (5) skip mode support for eliminating unnecessary filtering. We will describe them in the following subsections.
114
7.1.3.1
7 Deblocking Filter
Processing Cycles
The cycle counts of the deblocking filter is determined by both filter utilization and I/O interface width. Because there are 192 filtering operations per MB, at least 192 cycles are needed using a single edge filter and 96 cycles for two edge filters. Since there are 96 32 bits of pixel data to be read from previous stage and written to the external memory, we need at least 96 cycles using a 32-bit I/O interface and 48 using a 64-bit one.
7.1.3.2
External Memory Traffic
The deblocking filter’s external memory access is composed of three parts: write access of current MB, read/write access of left neighbor, and read/write access of top neighbor. The write access of current MB is constant but there is a tradeoff between local SRAM size and traffic for left and top neighbors. With a 32 32-bit of local memory, we can save 192 bytes of traffic for left neighbor. However, in order to eliminate upper neighbor traffic, we need to store the whole row of upper neighboring blocks with a huge local SRAM whose size is proportional to the picture width.
7.1.3.3
Working Frequency
The deblocking filter algorithm is both complex and adaptive. A single filtering operation implemented in hardware includes input selection from memories or registers, table look-up, threshold clipping, pixel comparison, pixel filtering with addition and shift, pixel clipping in case of overflow, and output to memories or registers. Therefore, it is important to carefully allocate these operations into different pipelined stages for high frequency application. A pipelined design should further ensure that there is no pipeline hazard during consecutive filtering operations given a filtering order. Otherwise, extra stall cycles have to be inserted and, hence, performance degrades.
7.1.3.4
Hardware Cost
Hardware cost is determined together by local memory size and logic gate count. The memory size depends on two factors. One is the tradeoff between external memory traffic and local SRAM size. The other is filtering order. A design with memory-efficient filtering order would completely filter a 4 4 block as soon as possible to reduce the size of the local buffer for partially filtered blocks. The logic gate count mainly depends on the number and structure (pipelined or nonpipelined) of edge filters and the number and size of transpose registers.
7.2 Related Works
115
Fig. 7.8 Profiling of Skip-top mode and Skip-all mode in P-type frames
7.1.3.5
Skip Mode Support for Eliminating Unnecessary Filtering
According to the distribution of BS values, two special MB modes (Skip-top mode and Skip-all mode) are very likely to occur in an inter-predicted frame. An MB of Skip-top mode has zero BS at its top boundary edges. Thus, we do not have to access external memory for its upper-neighbor pixels. An MB of Skip-all mode has all its BSs zero. Thus, no filtering is needed. Figure 7.8 shows the profiling result of these two modes in some P-type frames. A design that takes advantage of skip modes will need less memory access and fewer cycle counts.
7.2 Related Works Several previous designs focus on cycle reduction by increasing the filter utilization. Huang’s architecture [19] is the first that adopts straightforward filtering order. Chang’s design [17] reduces processing cycles by overlapping data transfer with filtering operation and employing an adaptive macroblock transmission scheme to skip transfer of pixels that need no filtering. Khurana’s design [33] and Shih’s designs [58, 59] achieve near-optimal cycle counts using single filter by performing horizontal filtering and vertical filtering alternately. Later, Lin [52] proposes an architecture with two edge filters, an 128-bit I/O, and a large local SRAM for cycle reduction. In order to integrate into a hardwired decoder or codec, Liu [48] and Shih [59] hold both left and upper neighbor pixels in a large local SRAM. The later further reduces the size for upper neighbor by storing only half of the chroma components in
116
7 Deblocking Filter
the local memory. On the contrary, Chang [17] proposes a bus-connected accelerator using an adaptive macroblock transmission scheme. It need not store neighbor pixels in local SRAM. However, it requires extra external memory traffic. For achieving high working frequency, many previous designs employ pipelined edge filters. Li [41] proposes a four-stage pipelined data path that is 47% faster than a nonpipelined one. Chao [16], Bojnordi [2], and Khurana [33] utilize their pipelined edge filters with hazard-free filtering order. Shih [59] employs a forwarding scheme to prevent hazard from happening. For hardware resource saving, Chao’s [16] and Khurana’s [33] architectures use novel filtering order and hence need less local memory. Li [41], Lin [52], and Bojnordi [2] all propose special multibank memory organizations to save the use of transpose registers.
7.3 A VLSI Design for Deblocking Filter Based on the above design considerations, our target is to fulfill the demand of ultra-high throughput with low hardware overhead. Thus, we propose a DF architecture [43], which nearly achieves the lower bound on cycle count using two filters and a 64-bit I/O interface. Furthermore, our design can save all left neighbor traffic and most upper neighbor traffic in inter-predicted frames without storing the entire row of upper neighbors in the local SRAM. It employs a five-stage pipelined and hardware-shared dual-edge filter to generate two filtering results every cycle.
7.3.1 Subtasks Scheduling In order to achieve ultra-high throughput, we must well schedule the processing of subtasks of the deblocking filter as shown in Fig. 7.9. In the beginning, it generates the boundary strengths of all edges. Then, it employs a five-stage pipelined dual-edge filter and a 64-bit I/O to filter two boundary edges within 4 cycles. When filtering the rightmost macroblock, it will further spends cycles to flush the rightmost filtered pixels. Because our H.264/AVC encoder is also pipelined, we can further reduce 48 cycles by overlapping boundary strengths generation with filtering of previous blocks.
7.3.2 Architecture Figure 7.10 shows the block diagram of our deblocking filter architecture. The three dotted boxes, BSG, DF, and MAU-DF, represent the modules involved in the deblocking filtering process. Outside the dotted boxes are global memories.
7.3 A VLSI Design for Deblocking Filter
117
Generate boundary strengths
Fetch unfiltered pixels and generate threshold Perform left strong filter Perform left weak filter Perform right weak filter
output filtered pixels
flush rightmost filtered pixels
Cycles
48
52
56
148 160
Fig. 7.9 Subtasks scheduling
These three modules are executed at different pipelined stages. When the DF module filters MB N , the Upper-neighbor Prefetch submodule reads the upper neighbor of MB (N C 1) from the reference frame and the BSG module calculates the boundary strengths of MB (N C 2). Meanwhile the Ref-frame Interface submodule takes the filtered pixels of MB (N 1) or (N 2) from the Filtered memory to the external memory. The BSG module calculates BSs of an MB within 48 cycles. It gets motion vectors, reference indexes, coding information, and filtering flags from the input memories MV0, MV1, Refidx0, Refidx1, Para, and Predinfo, respectively. BSG also informs the MAU-DF module of Skip-top mode and the DF module of Skip-all mode. The MAU-DF module manages the external memory access including the pixel prefetch for upper neighbor and the output for filtered pixels. The submodule, Upper-neighbor Prefetch, reads the upper neighboring blocks from the reference frame and stores them into the Upper-neighbor memory. After an MB is completely filtered, the Ref-frame interface will store this MB as well as its upper neighbor from the Filtered memory to the reference frame. If it is a Skip-top MB, we will not access its upper neighbor.
118
7 Deblocking Filter Skip-top mode
Skip-all mode
Reconstruct
MAU-DF
Upper-neighbor Pre-fetch Upper-neighbor 64
Filtered
T1
ROP Selector
64 64 64
LocalPixel 64
32
32
BS Generator
BS
32
Threshold Generator BS Selector
Two-Result-Per-Cycle Edge Filter
32
Memory
64
DF
Local-Pixe l Selector
Coding information
Ref-frame Interface
T2 32
32
Logic block
Fig. 7.10 Proposed deblocking filter block diagram
The DF module needs the pixels of the current MB as well as its left and upper neighbor as inputs. It takes the unfiltered current MB through the Reconstruct memory and the upper neighboring blocks through the Upper-neighbor memory. After an MB is filtered, its rightmost column of 4 4 blocks has been stored in the LocalPixel memory as the left neighbor of the next MB. The DF module is composed of several submodules including Threshold Generator, BS Selector, Two-Result-Per-Cycle Edge Filter, T0, T1, and memory interfaces. Threshold Generator and BS selector provide two sets of boundary strengths and threshold variables for the Two-Result-Per-Cycle Edge Filter. Two transpose registers, T0 and T1, are used for transposing pixels. The Two-Result-Per-Cycle Edge Filter takes 12 pixels as its input and filters them through a five-stage pipeline. It selects pixels from the Local-Pixel memory, Reconstruct memory, T0, T1, and its own output port. The Local-Pixel Selection submodule selects the output of Two-ResultPer-Cycle Edge Filter and T0 or the unfiltered pixels of the Reconstruct memory to store. Parallel-connected ROP Pixel Selector submodule selects two individual 4 4 blocks as input and outputs them alternately with a parallel-connected Rowof-Pixel.
7.3.2.1
Filtering Order
Figure 7.11 shows the filtering order of our Two-Result-Per-Cycle Edge Filter. It can concurrently filter 12 pixels across two edges every cycle. The number associated with an edge denotes the processing step. The letters “L” and “R” denote the left
7.3 A VLSI Design for Deblocking Filter
L0 5L
T0
T1
T2
T3
8L
23L
11L
10L
B0 5R B1 9L 23R
8R L1 6L
B2
B3
11R
B4 6R B5 7L B6 19L
9R
24L
10R 7R
B7
22L
21L
L2 16L B8 16R B9 20L B10 20R 19R L3 17L
24R 17R
B12
22R 18L
B11
T5
4L
3L
4R L5
B15
T4
B16 L4 1L 1R
21R 18R
B14
B13
119
3R
2L B18 2R
Y
B17
Cb
B19
T6
T7
15L
14L
L6 12L B20 12R
B21 14R
15R L7 13L B22 13R
B23
Cr
Fig. 7.11 Proposed filtering order
or right data paths used. We start from the Cb block which are followed by the top-half of the Y block except “23L” and “23R.” Then we filter the Cr block and the remaining edges of the Y block. This order ensures no access conflict in the local and output memory. Therefore, we can filter all 48 edges without any stall cycles.
7.3.2.2
Memory Organization
We employ a row-of-pixels (ROP) approach for data placement in the memories. Figure 7.12 shows the pixel organizations of the input memories, output memory, and local memory. The Reconstruct memory could provide 8 pixels with serialconnected ROP in 1 cycle. The Upper-neighbor memory and Filtered memory store pixels as parallel-connected ROP. Both Reconstruct and Upper-neighbor memories adopt a ping-pong structure. The Filtered memory expends 60 64 3 bits for three MBs to collect filtered pixels for effective off-chip memory access. In the LocalPixel memory, we use a two-ported two-bank SRAM (28 32 2 bits) to store the left neighboring pixels and temporal pixels.
7.3.2.3
Two-Result-Per-Cycle Edge Filter
Figure 7.13 shows the proposed five-stage pipelined dual-edge filter. Stage 1 gets 12 pixels from the memories, transpose registers, or output of the edge filter. Stage 2 filters the left edge with left strong filter (BS D 4). The left delta values of the left edge and the R21 delta value of the right edge are also derived at this stage, since “R21” is ready. Stage 3 filters the left edge with left weak edge filter (BSD .1 >k)&0x01); k D k 1; endwhile break ; endif endwhile return with Binstream
8.1 Introduction
8.1.1.3
129
Context Modeler
The context modeler generates a context index (CtxIdx) for each bin to look up its probability model. The modeler contains 399 (defined in H.264/AVC main profile) probability models which are used to estimate the probability of a bin. The probability is then passed to the binary arithmetic encoder to generate output bit-stream. We employ two equations to calculate CtxIdx. The first one is “CtxIdx D CtxIdxOffset C CtxIdxInc,” and the second one is “CtxIdx D CtxIdxOffset C CtxIdxInc C CtxBlkCatOffset.” The CtxIdxOffset value is dependent on SE type. The values of CtxIdxInc and CtxBlkCatOffset depend on four reference schemes. The first equation is used in the first three reference schemes while the second one is used in the last scheme.
Reference to Syntax Elements in the Neighboring Macroblock The two parameters, CtxIdx Left-Inc and CtxIdx Top-Inc, are derived from SEs of the neighboring (top and left) macroblocks or subblocks. Their sum gives CtxIdxInc.
Reference to Previously Coded Bin Value In this scheme, CtxIdxInc is derived from some previously coded bin values.
Reference to Syntax Elements in Previously Encoded Macroblocks CtxIdxInc is derived from some SEs of the macroblock that precedes the current macroblock in the encoding order. When the current macroblock is the first macroblock of a slice, there is no macroblock preceding the current macroblock.
Reference to Syntax Elements in the Same Coefficient Block This scheme is used for calculating CtxIdxInc and CtxBlkCatOffset of the second equation. In Table 8.1, CtxBlkCatOffset is derived from two parameters, SE type and context category. The context category is decided by residual block type, e.g., Luma DC, Luma AC, or Chroma DC, as shown in Table 8.2. For calculating CtxIdxInc, two methods are used. The first method relies on the position in the scanning path and is applied to SE types sig coeff flag and last sig coeff flag. The second method relies on two variables and is applied to coeff level type. One variable is used to accumulate the number of coeff level SEs whose value is greater than 1; the other variable is used to accumulate the number of coeff level SEs with value equals to 1.
130
8 CABAC Encoder Table 8.1 CtxBlkCatOffset values as function of context categories and SE types Context category 0 1 2 3 4 coded block flag 0 4 8 12 16 sig coeff flag 0 15 29 44 47 0 15 29 44 47 last sig coeff flag 10 20 30 39 coeff abs level minus1 0
Table 8.2 Basic block types and associated context categories Block type Context category Luma DC for intra 16 16 0 Luma AC for intra 16 16 1 Luma for intra 4 4 and Inter 2 Chroma DC 3 Chroma AC 4
8.1.1.4
Binary Arithmetic Encoder
CABAC employs a recursive interval subdivision procedure as illustrated in Fig. 8.2 to produce a series of bits. It uses two variables, Low and Range, to specify the subinterval. In addition, since all bins are either of Most Probable Symbol (MPS) or of Least Probable Symbol (LPS), the subinterval is further divided into two parts, RangeMPS (rMps) and RangeLPS (rLps). rMps is used as the new subinterval if the bin is an MPS; otherwise, rLps is used. The new subinterval is renormalized if it becomes too small. Several bits of Low are output to the bit-stream during each renormalization. Three arithmetic coding engines employed by binary arithmetic encoder are described next. Their main difference lies in the source of input probability model. Both bypass and terminal coding engines use predefined probability while the regular engine uses that computed by the context modeler.
Regular Coding Engine Algorithm 8.3 depicts the regular coding engine. Because this engine needs to fetch and update the probability model, it is more complex than the others. In the first step, it uses the rangeTabLPS table to simplify the calculation of “Range Probability” and gets an approximate RangeLPS value. In the second step, new Low and Range are computed according to whether the bin is MPS. The third step is used to update the probability model (i.e., MPS and pStateIdx). The last step, RenormE presented in Algorithm 8.4, performs renormalization process. The renormalization process contains shift, subtraction, and PutBit function. If the value of Range is too small, it (i.e., RenormE step) will renormalize it and output the overflow bits of Low to the bit-stream.
8.1 Introduction
131 Range Lps
RangeMps Low0 Range0 (interval)
Range Mps Low1 Range1
Low 2 Output bit stream
Range2
Fig. 8.2 Interval subdivision procedure in binary arithmetic encoding Algorithm 8.3: The regular coder. regular coding engine (Range,Low,context data,bin) q D (Range>>)&3; RangeLPS D rangeTabLPS[pStateIdx][q]; RangeMPS D Range-RangeLPS; if bin ! D MPS Low D Low C RangeMPS; Range D RangeLPS; if pStateIdx! D 0 MPS=1-MPS; endif pStateIdx D transIdxLPS[pStateIdx]; else Range D RangeMPS; pStateIdx D transIdxMPS[pStateIdx]; endif RenormE (Range,Low);
Bypass Coding Engine The probability for the bypass engine is predefined as 1/2, so the new Range will be updated as Range/2. However, as shown in Algorithm 8.5, the renormalization process is combined with this engine’s interval subdivision procedure, so the Range is unchanged. Finally, output bits of Low are generated.
Terminal Coding Engine The terminal engine as depicted in Algorithm 8.6 is similar to that of the regular engine except that its RangeLPS always equals to 2. Besides, at the end of each slice, this engine flushes remaining undetermined output bits (bitOutstanding bits).
132
8 CABAC Encoder
Algorithm 8.4: The renormalization process. RenormE (Range,Low) while Range