The Regularized Fast Hartley Transform: Low-Complexity Parallel Computation of the FHT in One and Multiple Dimensions [2 ed.] 3030682447, 9783030682446

This book describes how a key signal/image processing algorithm – that of the fast Hartley transform (FHT) or, via a sim

159 85 8MB

English Pages 339 [325] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Audience
Acknowledgements
Contents
About the Author
Part I: The Discrete Fourier and Hartley Transforms
Chapter 1: Background to Research
1.1 Introduction
1.2 The DFT and Its Efficient Computation
1.3 Twentieth-Century Developments of the FFT
1.4 The DHT and Its Relation to the DFT
1.5 Attractions of Computing the Real-Data DFT via the FHT
1.6 Modern Hardware-Based Parallel Computing Technologies
1.7 Hardware-Based Arithmetic Units
1.8 Performance Metrics and Constraints
1.9 Key Parameters, Definitions and Notation
1.10 Organization of Monograph
References
Chapter 2: The Real-Data Discrete Fourier Transform
2.1 Introduction
2.2 Real-Data FFT Algorithms
2.2.1 The Bergland Algorithm
2.2.2 The Bruun Algorithm
2.3 Real-From-Complex Strategies
2.3.1 Computation of Real-Data DFT via Complex-Data FFT
2.3.2 Computation of Two Real-Data DFTs via Complex-Data FFT
2.3.3 Computation of Real-Data DFT via Half-Length Complex-Data FFT
2.4 Data Reordering
2.5 Discussion
References
Chapter 3: The Discrete Hartley Transform
3.1 Introduction
3.2 Orthogonality of DHT
3.3 Decomposition into Even and Odd Components
3.4 Connecting Relations Between DFT and DHT
3.4.1 Real-Data DFT
3.4.2 Complex-Data DFT
3.5 Fundamental Theorems for DFT and DHT
3.5.1 Reversal Theorem
3.5.2 Addition Theorem
3.5.3 Shift Theorem
3.5.4 Convolution Theorem
3.5.5 Product Theorem
3.5.6 Autocorrelation Theorem
3.5.7 First Derivative Theorem
3.5.8 Second Derivative Theorem
3.5.9 Summary of Theorems and Related Properties
3.6 Fast Solutions to DHT - The FHT Algorithm
3.7 Accuracy Considerations
3.8 Discussion
References
Part II: The Regularized Fast Hartley Transform
Chapter 4: Derivation of Regularized Formulation of Fast Hartley Transform
4.1 Introduction
4.2 Derivation of the Conventional Radix-4 Butterfly Equations
4.3 Single-to-Double Conversion of Radix-4 Butterfly Equations
4.4 Radix-4 Factorization of the FHT
4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly
4.5.1 Twelve-Multiplier Version of Generic Double Butterfly
4.5.2 Nine-Multiplier Version of Generic Double Butterfly
4.6 Trigonometric Coefficient Storage, Retrieval and Generation
4.6.1 Minimum-Arithmetic Addressing Scheme
4.6.2 Minimum-Memory Addressing Scheme
4.6.3 Trigonometric Coefficient Generation via Trigonometric Identities
4.7 Comparative Complexity Analysis with Existing FFT Designs
4.8 Scaling Considerations for Fixed-Point Implementation
4.9 Discussion
References
Chapter 5: Design Strategy for Silicon-Based Implementation of Regularized Fast Hartley Transform
5.1 Introduction
5.2 The Fundamental Properties of FPGA and ASIC Devices
5.3 Low-Power Design Techniques
5.3.1 Clock Frequency
5.3.2 Silicon Area
5.3.3 Switching Frequency
5.4 Proposed Hardware Design Strategy
5.4.1 Scalability of Design
5.4.2 Partitioned-Memory Processing
5.4.3 Flexibility of Design
5.5 Constraints on Available Resources
5.6 Assessing the Resource Requirements
5.7 Discussion
References
Chapter 6: Architecture for Silicon-Based Implementation of Regularized Fast Hartley Transform
6.1 Introduction
6.2 Single-PE Versus Multi-PE Architectures
6.3 Conflict-Free Parallel Memory Addressing Schemes
6.3.1 Parallel Storage and Retrieval of Data
6.3.2 Parallel Storage, Retrieval and Generation of Trigonometric Coefficients
6.3.2.1 Minimum-Arithmetic Addressing Scheme
6.3.2.2 Minimum-Memory Addressing Scheme
6.3.2.3 Comparative Analysis of Addressing Schemes
6.4 Design of Pipelined PE for Single-PE Recursive Architecture
6.4.1 Parallel Computation of Generic Double Butterfly
6.4.2 Space-Complexity Considerations
6.4.3 Time-Complexity Considerations
6.5 Performance and Requirements Analysis of FPGA Implementation
6.6 Derivation of Range of Validity for Regularized FHT
6.7 Discussion
References
Chapter 7: Design of CORDIC-Based Processing Element for Regularized Fast Hartley Transform
7.1 Introduction
7.2 Accuracy Considerations
7.3 Fast Multiplier Approach
7.4 CORDIC Arithmetic Approach
7.4.1 CORDIC Formulation of Complex Multiplier
7.4.2 Parallel Formulation of CORDIC-Based PE
7.4.3 Discussion of CORDIC-Based Solution
7.4.4 Logic Requirement of CORDIC-Based PE
7.5 Comparative Analysis of PE Designs
7.6 Discussion
References
Part III: Applications of Regularized Fast Hartley Transform
Chapter 8: Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using Regularized Fast Hartley Transform
8.1 Introduction
8.2 Computation of Real-Data DFT via Two Half-Length Regularized FHTs
8.2.1 Derivation of Radix-2 Algorithm via Double-Resolution Approach
8.2.2 Implementation of Double-Resolution Algorithm
8.2.2.1 Single-FHT Solution for Computation of Regularized FHTs
8.2.2.2 Two-FHT Solution for Computation of Regularized FHTs
8.2.2.3 Comparative Analysis of Solutions
8.3 Computation of Real-Data DFT via One Double-Length Regularized FHT
8.3.1 Derivation of Radix-2 Algorithm via Half-Resolution Approach
8.3.2 Implementation of Half-Resolution Algorithm
8.4 Comparative Complexity Analysis with Standard Radix-2 FFT
8.5 Discussion
References
Chapter 9: Computation of Common DSP-Based Functions Using Regularized Fast Hartley Transform
9.1 Introduction
9.2 Fast Transform-Space Convolution and Correlation
9.3 Up-Sampling and Differentiation of Real-Valued Signal
9.3.1 Up-Sampling via Hartley-Space
9.3.2 Differentiation via Hartley-Space
9.3.3 Combined Up-Sampling and Differentiation
9.4 Correlation of Two Arbitrary Signals
9.4.1 Computation of Complex-Data Correlation via Real-Data Correlation
9.4.2 Cross-Correlation of Two Finite-Length Data Sets
9.4.3 Auto-Correlation: Finite-Length Against Infinite-Length Data Sets
9.4.4 Cross-Correlation: Infinite-Length Against Infinite-Length Data Sets
9.4.5 Combining Functions in Hartley-Space
9.5 Channelization of Real-Valued Signal
9.5.1 Single Channel: Fast Hartley-Space Convolution
9.5.2 Multiple Channels: Conventional Polyphase DFT Filter Bank
9.5.2.1 Alias-Free Formulation
9.5.2.2 Implementation Issues
9.6 Distortion-Free Multi-Carrier Communications
9.7 Discussion
References
Part IV: The Multi-dimensional Discrete Hartley Transform
Chapter 10: Parallel Reordering and Transfer of Data Between Partitioned Memories of Discrete Hartley Transform for 1-D and m-...
10.1 Introduction
10.2 Memory Mappings of Regularized FHT
10.3 Requirements for Parallel Reordering and Transfer of Data
10.4 Sequential Construction of Reordered Data Sets
10.5 Parallelization of Data Set Construction Process
10.6 Parallel Transfer of Reordered Data Sets
10.7 Discussion
References
Chapter 11: Architectures for Silicon-Based Implementation of m-D Discrete Hartley Transform Using Regularized Fast Hartley Tr...
11.1 Introduction
11.2 Separable Version of 2-D DHT
11.2.1 Two-Stage Formulation of 2-D SDHT
11.2.2 Hartley-Space Filtering of 2-D Data Sets
11.2.3 Relationship Between 2-D SDHT and 2-D DFT
11.3 Architectures for 2-D SDHT
11.3.1 Single-FHT Recursive Architecture
11.3.2 Two-FHT Pipelined Architecture
11.3.3 Relative Merits of Proposed Architectures
11.4 Complexity Analysis of 2-D SDHT
11.4.1 Complexity Summary for Regularized FHT
11.4.2 Space-Complexity of 2-D Solutions
11.4.3 Time-Complexity of 2-D Solutions
11.4.4 Computational Density of 2-D Solutions
11.4.5 Comparative Complexity of 2-D Solutions
11.4.6 Relative Start-up Delays and Update Times of 2-D Solutions
11.4.7 Application of 2-D SDHT to Filtering of 2-D Data Sets
11.4.8 Application of 2-D SDHT to Computation of 2-D Real-Data DFT
11.5 Generalization of 2-D Solutions to Processing of m-D Data Sets
11.5.1 Space and Time Complexities of m-D Solutions
11.5.2 Comparative Complexity of M-D Solutions
11.5.3 Relative Start-up Delays and Update Times of m-D Solutions
11.6 Constraints on Achieving and Maintaining Real-Time Operation
11.7 Discussion
References
Part V: Results of Research
Chapter 12: Summary and Conclusions
12.1 Outline of Problems Addressed
12.2 Summary of Results
12.3 Conclusions
References
Appendix A: Computer Programme for Regularized Fast Hartley Transform
A.1 Introduction
A.2 Description of Functions
A.2.1 Control Routine
A.2.2 Generic Double Butterfly Routines
A.2.3 Address Generation and Data Reordering Routines
A.2.4 Data Memory Retrieval and Updating Routine
A.2.5 Trigonometric Coefficient Generation Routines
A.2.6 Look-Up Table Generation Routines
A.2.7 FHT-to-FFT Conversion Routine
A.3 Brief Guide to Running the Programme
A.4 Available Scaling Strategies
Appendix B: Source Code for Regularized Fast Hartley Transform
B.1 Listings for Main Programme and Signal Generation Routine
B.2 Listings for Preprocessing Functions
B.3 Listings for Processing Functions
Appendix C: MATLAB Code for Parallel Reordering of Data via Dibit-Reversal Mapping
C.1 Listing for MATLAB Data Reordering Program
C.2 Discussion
Glossary
Index
Recommend Papers

The Regularized Fast Hartley Transform: Low-Complexity Parallel Computation of the FHT in One and Multiple Dimensions [2 ed.]
 3030682447, 9783030682446

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Keith John Jones

The Regularized Fast Hartley Transform Low-Complexity Parallel Computation of the FHT in One and Multiple Dimensions Second Edition

The Regularized Fast Hartley Transform

Keith John Jones

The Regularized Fast Hartley Transform Low-Complexity Parallel Computation of the FHT in One and Multiple Dimensions Second Edition

Keith John Jones Wyke Technologies Ltd. Weymouth, Dorset, UK

ISBN 978-3-030-68244-6 ISBN 978-3-030-68245-3 https://doi.org/10.1007/978-3-030-68245-3

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2010, 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Most real-world spectrum analysis problems involve the computation of the realdata discrete Fourier transform (DFT), a unitary transform that maps elements of the linear (or vector) space of real-valued N-tuples, RN, to elements of its complexvalued counterpart, CN. The computation is conventionally carried out via a ‘realfrom-complex’ strategy using a complex-data version of the familiar fast Fourier transform (FFT), the generic name given to the class of fast recursive algorithms used for the efficient computation of the DFT. Such algorithms are typically derived by exploiting the property of symmetry, whether it exists in just the transform kernel or, in certain circumstances, in the input data and/or output data as well. When the input data to the DFT is real-valued, for example, the resulting output data is in the form of a Hermitian (or conjugate)-symmetric frequency spectrum which may be exploited to some advantage in terms of reduced arithmetic-complexity. To make effective use of a complex-data FFT, however, via the chosen real-fromcomplex strategy, the input data to the DFT must first be converted from elements of the linear space RN to those of CN. The reason for choosing the computational domain of real-data problems such as this to be CN, rather than RN, is due in part to the fact that manufacturers of computing equipment have invested so heavily in producing digital signal processing (DSP) devices built around the design of the fast complex-data multiplier-and-accumulator (MAC). This is an arithmetic unit that’s ideally suited to the implementation of the radix-2 butterfly, which is the computational engine used for carrying out the repetitive arithmetic operations required by the complex-data version of the radix-2 FFT. The net result of such a strategy is that the problem of computing the real-data DFT is effectively modified so as to match an existing complex-data solution, rather than a solution being sought that matches the actual problem needing to be solved – which is the approach that’s been adopted in this book. The accessibility of the increasingly powerful field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) technologies is now giving DSP design engineers far greater control, however, over the type of algorithm that may be used in the building of high-performance DSP systems, so that more v

vi

Preface

appropriate algorithmically specialized solutions to the real-data DFT may be actively sought and exploited to some advantage with implementations based upon the use of these silicon-based technologies. These technologies facilitate the use of both multiple arithmetic units – such as those based upon the fast multiplier and/or the CORDIC phase rotator – and multiple banks of fast memory in order to enhance the performance of key signal processing algorithms, such as the FFT, via their parallel computation. The first part of the book, after providing the background information necessary for a better understanding of both the problems to be addressed and of the proposed solutions, concerns itself with the design of a new and highly parallel formulation of the fast Hartley transform (FHT) which is to be used, in turn, for the efficient computation of the real-data DFT – where both transforms are restricted to the one-dimensional (1-D) case – which would, in turn, enable it to be used for those DSP-based problems commonly addressed via the FFT. The FHT is the generic name given to the class of fast recursive algorithms used for the efficient computation of the discrete Hartley transform (DHT) – a bilateral and orthogonal transform and close relative of the DFT that possesses many of the same properties – which, for the processing of real-valued data, has attractions over the complex-data FFT in terms of reduced arithmetic and memory requirements. Its bilateral property means that it may be straightforwardly applied to the transformation from Hartley-space to data-space as well as from data-space to Hartley-space, thus making it equally applicable to the computation of both the forward and the inverse DFT algorithms and an attractive option for carrying out filtering-type operations with realvalued data. A drawback, however, of conventional FHT algorithms lies in the lack of regularity (as relates to the algorithm structure and which equates to the amount of repetition and symmetry present in the design) arising from the need for two sizes of butterfly – and thus for two separate butterfly designs – single-sized and doublesized for efficient fixed-radix formulations where, for a radix ‘R’ algorithm, a singlesized butterfly produces R outputs from R inputs whilst a double-sized butterfly produces 2R outputs from 2R inputs. A generic version of the double-sized butterfly, to be referred to as the generic double butterfly, has therefore been sought for the radix-4 factorization of the FHT that might overcome the problem in an elegant fashion, where the resulting single-design solution, to be referred to as the regularized FHT, would be required to lend itself to an efficient implementation with parallel computing technology – as typified by the FPGA and the ASIC. Consideration has been given in the design process to the fact that when producing electronic equipment, whether for commercial or military use, great emphasis is inevitably placed upon minimizing the unit cost so that the design engineer is seldom blessed with the option of using the latest state-of-the-art device technology. The most common situation encountered is one where the expectation is to use the smallest (and thus least expensive) device that’s capable of yielding solutions able to meet the desired performance objectives, which means using devices that are often one or more generations behind the latest specification. As a result, there are situations where there would be great merit in having designs that are not totally

Preface

vii

reliant on the increasing availability of large quantities of expensive embedded resources, such as the fast multipliers and fast memory provided by the manufacturers of the latest silicon-based devices, but are sufficiently flexible as to yield efficient implementations in silicon even when such resources are scarce. To help address the problem, several versions of a processing element (PE) – as required for the low-complexity parallel computation of the generic double butterfly – have been sought which are each required to be a simple variation of the same basic design and each compatible with a single-PE computing architecture. This would enable parallel solutions to be defined for the radix-4 factorization of the FHT and the real-data FFT that are resource-efficient (whereby just a single PE may be used by the transform to carry out the repetitive arithmetic operations required by each and every instance of the large double butterfly), scalable (which refers to the ease with which the solution may be modified in order to accommodate increasing or decreasing transform sizes) and possess universal application (in that each new application would necessitate minimal re-design effort and costs). Such solutions would enable the use of the available silicon resources to be optimized so as to maximize the achievable computational density – that is, the throughput per unit area of silicon – and, in so doing, to match the computational density of the most advanced commercially available complex-data FFTs (which are invariably based upon a multi-PE computing architecture) for potentially just a fraction of the silicon resources. A further requirement of any design is that it should be able to cater for a range of resource-constrained environments, as might be encountered in applications typified by that of mobile communications, for example, where a small battery may be the only source of power supply for long periods of time so that power-efficiency would have to be a key requirement in the design of any such solution. Such a design process inevitably involves particular resources being consumed and traded off, one against another, this being most simply expressed in the general terms of a trade-off of the power requirement (which would have to satisfy some pre-defined constraint) against the space-complexity (through the memory and arithmetic components – which typically exist as embedded resources on an FPGA – and the programmable logic) and the time-complexity (through either the latency or the update time – although for a single-PE architecture the two parameters are identical – which would be constrained by the data set refresh rate). The choice of which particular computing device to use for carrying out a comparative analysis of the various design options to be considered in the book has not been regarded as being of relevance to the results obtained, as the intention has been that the attractions of the proposed solutions should be considered to be valid regardless of the specific device onto which they are mapped – that is, a ‘good’ design should be device-independent. The author is well aware, however, that the intellectual investment made in achieving such a design may seem to fly in the face of current wisdom, whereby the need for good engineering design and practice is often dispensed with through the adoption of ever-more costly and powerful (and power consuming) computing devices offering seemingly endless quantities of embedded resources for overcoming design issues.

viii

Preface

Thus, in order to meet the desired design objectives subject to the stated constraints, a single-PE architecture has been sought for the parallel computation of the generic double butterfly and of the resulting regularized FHT that would yield attractive solutions, particularly when implemented with parallel computing technology, that would be resource-efficient, scalable and device-independent (so that they would not be dependent upon the specific characteristics of any particular device, being able to exploit whatever resources happen to be available on the target device). A high computational throughput has been sought by exploiting a combination of parallel processing techniques, these including the use of both pipelined and single-instruction multiple-data (SIMD) processing and the adoption and exploitation of partitioned memory for the parallel storage and retrieval of both the data and the trigonometric coefficients (as defined by the transform kernel). The above issues have already been successfully addressed and discussed in some depth in a previous Springer book from 2010: The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments (Signals and Communication Technology Series). A new feature of the present edition of the book – which is an updated and expanded version of the previous edition – involves the search for attractive solutions for the parallel computation of the multidimensional (m-D) DHT – and, in particular, of a separable version referred to as the m-D SDHT – and equivalently, via the relationship of their kernels, of the m-D realdata DFT. The aim has been to use the regularized FHT as a building block in producing attractive parallel solutions that would exploit the already proven benefits of the regularized FHT and which, like those solutions for the 1-D case, would be resource-efficient, scalable and device-independent, being thus able to optimize the use of the available silicon resources so as to maximize the achievable computational density. The adoption of the regularized FHT for the provision of such solutions to the 2-D and 3-D problems, in particular, would enable it to be beneficially used as a key component in the design of systems for the real-time processing of 2-D and 3-D images, respectively, as well as that of conventional 1-D signals. Weymouth, UK

Keith John Jones

Audience

The book is aimed at practising DSP engineers, academics, researchers and students from engineering, computer science and mathematics backgrounds with an interest in the design and implementation of sequential and parallel algorithms and, in particular, of the FHT and the real-data FFT for both 1-D and m-D cases. It is intended, in particular, to provide the reader with the tools necessary to both understand and implement the new formulations with a choice of simple design variations that offer clear implementational attractions/advantages, both theoretical and practical, when compared to more conventional solutions based upon the adoption of the familiar complex-data FFT.

ix

Acknowledgements

I would like to thank my wife Deborah for suggesting and encouraging the work as I would otherwise have been forced to engage more fully with my DIY and gardening duties – with my skills, like my enthusiasm, leaving much to be desired. Thanks also to our aging cat, Titus, who has spent many an hour sat with me eating biscuits and staring at an unchanging computer screen!

xi

Contents

Part I

The Discrete Fourier and Hartley Transforms

1

Background to Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The DFT and Its Efficient Computation . . . . . . . . . . . . . . . . . 1.3 Twentieth-Century Developments of the FFT . . . . . . . . . . . . . 1.4 The DHT and Its Relation to the DFT . . . . . . . . . . . . . . . . . . 1.5 Attractions of Computing the Real-Data DFT via the FHT . . . . 1.6 Modern Hardware-Based Parallel Computing Technologies . . . 1.7 Hardware-Based Arithmetic Units . . . . . . . . . . . . . . . . . . . . . 1.8 Performance Metrics and Constraints . . . . . . . . . . . . . . . . . . . 1.9 Key Parameters, Definitions and Notation . . . . . . . . . . . . . . . . 1.10 Organization of Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

3 3 4 6 8 10 11 12 13 15 17 20

2

The Real-Data Discrete Fourier Transform . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Real-Data FFT Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The Bergland Algorithm . . . . . . . . . . . . . . . . . . . . . . 2.2.2 The Bruun Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Real-From-Complex Strategies . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Computation of Real-Data DFT via Complex-Data FFT . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Computation of Two Real-Data DFTs via Complex-Data FFT . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Computation of Real-Data DFT via Half-Length Complex-Data FFT . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Data Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

23 23 24 25 26 28

.

28

.

28

. . . .

30 32 32 34

xiii

xiv

3

Contents

The Discrete Hartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Orthogonality of DHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Decomposition into Even and Odd Components . . . . . . . . . . . 3.4 Connecting Relations Between DFT and DHT . . . . . . . . . . . . 3.4.1 Real-Data DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Complex-Data DFT . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Fundamental Theorems for DFT and DHT . . . . . . . . . . . . . . . 3.5.1 Reversal Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Addition Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Shift Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Convolution Theorem . . . . . . . . . . . . . . . . . . . . . . . . 3.5.5 Product Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.6 Autocorrelation Theorem . . . . . . . . . . . . . . . . . . . . . 3.5.7 First Derivative Theorem . . . . . . . . . . . . . . . . . . . . . 3.5.8 Second Derivative Theorem . . . . . . . . . . . . . . . . . . . 3.5.9 Summary of Theorems and Related Properties . . . . . . 3.6 Fast Solutions to DHT – The FHT Algorithm . . . . . . . . . . . . . 3.7 Accuracy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part II 4

. . . . . . . . . . . . . . . . . . . . .

35 35 36 37 38 39 40 40 41 42 43 43 43 44 44 44 45 46 48 49 50

. . . . .

53 53 53 57 58

.

60

.

66

. . . .

67 69 70 70

.

71

. . . .

72 74 75 76

The Regularized Fast Hartley Transform

Derivation of Regularized Formulation of Fast Hartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Derivation of the Conventional Radix-4 Butterfly Equations . . 4.3 Single-to-Double Conversion of Radix-4 Butterfly Equations . . 4.4 Radix-4 Factorization of the FHT . . . . . . . . . . . . . . . . . . . . . . 4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Twelve-Multiplier Version of Generic Double Butterfly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Nine-Multiplier Version of Generic Double Butterfly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Trigonometric Coefficient Storage, Retrieval and Generation . . 4.6.1 Minimum-Arithmetic Addressing Scheme . . . . . . . . . 4.6.2 Minimum-Memory Addressing Scheme . . . . . . . . . . . 4.6.3 Trigonometric Coefficient Generation via Trigonometric Identities . . . . . . . . . . . . . . . . . . . 4.7 Comparative Complexity Analysis with Existing FFT Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Scaling Considerations for Fixed-Point Implementation . . . . . . 4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents

5

6

7

Design Strategy for Silicon-Based Implementation of Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Fundamental Properties of FPGA and ASIC Devices . . . . 5.3 Low-Power Design Techniques . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Clock Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Silicon Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Switching Frequency . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Proposed Hardware Design Strategy . . . . . . . . . . . . . . . . . . . . 5.4.1 Scalability of Design . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Partitioned-Memory Processing . . . . . . . . . . . . . . . . . 5.4.3 Flexibility of Design . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Constraints on Available Resources . . . . . . . . . . . . . . . . . . . . 5.6 Assessing the Resource Requirements . . . . . . . . . . . . . . . . . . 5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture for Silicon-Based Implementation of Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Single-PE Versus Multi-PE Architectures . . . . . . . . . . . . . . . . 6.3 Conflict-Free Parallel Memory Addressing Schemes . . . . . . . . 6.3.1 Parallel Storage and Retrieval of Data . . . . . . . . . . . . 6.3.2 Parallel Storage, Retrieval and Generation of Trigonometric Coefficients . . . . . . . . . . . . . . . . . . . . 6.4 Design of Pipelined PE for Single-PE Recursive Architecture . 6.4.1 Parallel Computation of Generic Double Butterfly . . . 6.4.2 Space-Complexity Considerations . . . . . . . . . . . . . . . 6.4.3 Time-Complexity Considerations . . . . . . . . . . . . . . . 6.5 Performance and Requirements Analysis of FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Derivation of Range of Validity for Regularized FHT . . . . . . . 6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design of CORDIC-Based Processing Element for Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Accuracy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Fast Multiplier Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 CORDIC Arithmetic Approach . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 CORDIC Formulation of Complex Multiplier . . . . . . 7.4.2 Parallel Formulation of CORDIC-Based PE . . . . . . . . 7.4.3 Discussion of CORDIC-Based Solution . . . . . . . . . . . 7.4.4 Logic Requirement of CORDIC-Based PE . . . . . . . . .

xv

. . . . . . . . . . . . . . .

79 79 80 82 82 83 85 86 86 87 87 88 89 90 92

. . . . .

93 93 94 96 96

. . . . .

102 105 107 109 110

. . . .

111 114 115 117

. . . . . . . . .

119 119 120 121 122 122 124 124 128

xvi

Contents

7.5 Comparative Analysis of PE Designs . . . . . . . . . . . . . . . . . . . . 129 7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Part III 8

9

Applications of Regularized Fast Hartley Transform

Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using Regularized Fast Hartley Transform . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Computation of Real-Data DFT via Two Half-Length Regularized FHTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Derivation of Radix-2 Algorithm via Double-Resolution Approach . . . . . . . . . . . . . . . . . . 8.2.2 Implementation of Double-Resolution Algorithm . . . . 8.3 Computation of Real-Data DFT via One Double-Length Regularized FHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Derivation of Radix-2 Algorithm via Half-Resolution Approach . . . . . . . . . . . . . . . . . . . . 8.3.2 Implementation of Half-Resolution Algorithm . . . . . . 8.4 Comparative Complexity Analysis with Standard Radix-2 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computation of Common DSP-Based Functions Using Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Fast Transform-Space Convolution and Correlation . . . . . . . . . 9.3 Up-Sampling and Differentiation of Real-Valued Signal . . . . . 9.3.1 Up-Sampling via Hartley-Space . . . . . . . . . . . . . . . . 9.3.2 Differentiation via Hartley-Space . . . . . . . . . . . . . . . . 9.3.3 Combined Up-Sampling and Differentiation . . . . . . . . 9.4 Correlation of Two Arbitrary Signals . . . . . . . . . . . . . . . . . . . 9.4.1 Computation of Complex-Data Correlation via Real-Data Correlation . . . . . . . . . . . . . . . . . . . . . 9.4.2 Cross-Correlation of Two Finite-Length Data Sets . . . 9.4.3 Auto-Correlation: Finite-Length Against Infinite-Length Data Sets . . . . . . . . . . . . . . . . . . . . . 9.4.4 Cross-Correlation: Infinite-Length Against Infinite-Length Data Sets . . . . . . . . . . . . . . . . . . . . . 9.4.5 Combining Functions in Hartley-Space . . . . . . . . . . . 9.5 Channelization of Real-Valued Signal . . . . . . . . . . . . . . . . . . 9.5.1 Single Channel: Fast Hartley-Space Convolution . . . . 9.5.2 Multiple Channels: Conventional Polyphase DFT Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 137 . 137 . 138 . 139 . 145 . 152 . 153 . 154 . 156 . 159 . 160 . . . . . . . .

161 161 162 163 164 165 165 166

. 167 . 169 . 170 . . . .

172 175 175 176

. 178

Contents

xvii

9.6 Distortion-Free Multi-Carrier Communications . . . . . . . . . . . . . 183 9.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Part IV 10

11

The Multi-dimensional Discrete Hartley Transform

Parallel Reordering and Transfer of Data Between Partitioned Memories of Discrete Hartley Transform for 1-D and m-D Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Memory Mappings of Regularized FHT . . . . . . . . . . . . . . . . . 10.3 Requirements for Parallel Reordering and Transfer of Data . . . 10.4 Sequential Construction of Reordered Data Sets . . . . . . . . . . . 10.5 Parallelization of Data Set Construction Process . . . . . . . . . . . 10.6 Parallel Transfer of Reordered Data Sets . . . . . . . . . . . . . . . . . 10.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

Architectures for Silicon-Based Implementation of m-D Discrete Hartley Transform Using Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Separable Version of 2-D DHT . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Two-Stage Formulation of 2-D SDHT . . . . . . . . . . . . . 11.2.2 Hartley-Space Filtering of 2-D Data Sets . . . . . . . . . . . 11.2.3 Relationship Between 2-D SDHT and 2-D DFT . . . . . . 11.3 Architectures for 2-D SDHT . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Single-FHT Recursive Architecture . . . . . . . . . . . . . . . 11.3.2 Two-FHT Pipelined Architecture . . . . . . . . . . . . . . . . . 11.3.3 Relative Merits of Proposed Architectures . . . . . . . . . . 11.4 Complexity Analysis of 2-D SDHT . . . . . . . . . . . . . . . . . . . . . 11.4.1 Complexity Summary for Regularized FHT . . . . . . . . . 11.4.2 Space-Complexity of 2-D Solutions . . . . . . . . . . . . . . . 11.4.3 Time-Complexity of 2-D Solutions . . . . . . . . . . . . . . . 11.4.4 Computational Density of 2-D Solutions . . . . . . . . . . . 11.4.5 Comparative Complexity of 2-D Solutions . . . . . . . . . . 11.4.6 Relative Start-up Delays and Update Times of 2-D Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.7 Application of 2-D SDHT to Filtering of 2-D Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.8 Application of 2-D SDHT to Computation of 2-D Real-Data DFT . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Generalization of 2-D Solutions to Processing of m-D Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Space and Time Complexities of m-D Solutions . . . . . .

191 191 192 193 197 199 201 203 205

207 207 209 209 211 212 213 214 216 218 220 221 221 222 225 225 228 229 231 232 234

xviii

Contents

11.5.2 11.5.3

Comparative Complexity of M-D Solutions . . . . . . . . Relative Start-up Delays and Update Times of m-D Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Constraints on Achieving and Maintaining Real-Time Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part V

. 235 . 237 . 238 . 241 . 242

Results of Research . . . . .

247 247 248 251 252

Appendix A: Computer Programme for Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Description of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Control Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 Generic Double Butterfly Routines . . . . . . . . . . . . . . . . A.2.3 Address Generation and Data Reordering Routines . . . . A.2.4 Data Memory Retrieval and Updating Routine . . . . . . . A.2.5 Trigonometric Coefficient Generation Routines . . . . . . . A.2.6 Look-Up Table Generation Routines . . . . . . . . . . . . . . . A.2.7 FHT-to-FFT Conversion Routine . . . . . . . . . . . . . . . . . A.3 Brief Guide to Running the Programme . . . . . . . . . . . . . . . . . . . A.4 Available Scaling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . .

253 253 254 254 254 254 255 255 256 256 256 258

Appendix B: Source Code for Regularized Fast Hartley Transform . . . B.1 Listings for Main Programme and Signal Generation Routine . . B.2 Listings for Preprocessing Functions . . . . . . . . . . . . . . . . . . . . B.3 Listings for Processing Functions . . . . . . . . . . . . . . . . . . . . . .

261 261 273 277

12

Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Outline of Problems Addressed . . . . . . . . . . . . . . . . . . . . . . 12.2 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . .

Appendix C: MATLAB Code for Parallel Reordering of Data via Dibit-Reversal Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 C.1 Listing for MATLAB Data Reordering Program . . . . . . . . . . . . . 311 C.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

About the Author

Keith John Jones is a Chartered Mathematician (CMath) and Fellow of the Institute of Mathematics & its Applications (FIMA), in the UK, having obtained a BSc Honours degree in Mathematics from the University of London in 1974 as an external student, a MSc in Applicable Mathematics from Cranfield Institute of Technology in 1977, and a PhD in Computer Science from Birkbeck College, University of London, in 1992, again as an external student. The PhD was awarded primarily for research into the design of novel systolic processor array architectures for the parallel computation of the discrete Fourier transform (DFT). Dr Jones has subsequently published widely in the signal processing and sensor array processing fields, having a particular interest in the application of number theory, algebra and non-standard arithmetic techniques to the design of low-complexity algorithms and circuits for efficient implementation with suitably defined parallel computing architectures. Dr Jones, who also holds a number of patents in these fields, has been regularly named in both the Who’s Who in Science and Engineering and the Dictionary of International Biography (otherwise known as the ‘Cambridge Blue Book’) since 2008.

xix

Part I

The Discrete Fourier and Hartley Transforms

Chapter 1

Background to Research

1.1

Introduction

The subject of spectrum or harmonic analysis started in earnest with the work of Joseph Fourier (1768–1830), who asserted and proved that an arbitrary function could be represented via a suitable transformation as a sum of trigonometric functions [6]. It seems likely, however, that such ideas were already common knowledge amongst European mathematicians by the time Fourier appeared on the scene, mainly through the earlier work of Joseph-Louis Lagrange (1736–1813) and Leonhard Euler (1707–1783), with the first appearance of the discrete version of this transformation, the discrete Fourier transform (DFT) [45, 48], dating back to Euler’s investigations of sound propagation in elastic media in 1750 and to the astronomical work of Alexis Claude Clairaut (1713–1765) in 1754 [26]. The DFT is now widely used in many branches of science, playing in particular a central role in the field of digital signal processing (DSP) [45, 48], enabling digital signals – namely, those that have been both sampled and quantized – to be viewed in the frequency domain where, compared to the time domain, the information contained in the signal may often be more easily extracted and/or visualized, or where many common DSP functions, such as that of the finite impulse response (FIR) filter or the matched filter [45, 48], may be more easily or efficiently carried out. The monograph is primarily concerned with the problem of computing the realvalued discrete Hartley transform (DHT) and equivalently, via the relationship of their kernels, of the above-mentioned DFT, initially in just one dimension (1-D) but later to be extended to that of multiple dimensions (m-D). Solutions are achieved via the application of various factorization techniques and are targeted at implementation with silicon-based parallel computing equipment – as typified by the fieldprogrammable gate array (FPGA) and the application-specific integrated circuit (ASIC) technologies [40] – bearing in mind the size and power constraints relevant to the particular field of interest. With mobile communications, for example, a small battery may be the only source of power supply for long periods of time so that © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_1

3

4

1 Background to Research

power efficiency would have to be a key requirement of any such solution. Through the adoption of the DHT, the monograph looks also to exploit the fact that the measurement data, as with many real-world problems, is real-valued in nature, with each sample of data thus belonging to R, the field of real numbers [4], although the restriction to fixed-point implementations effectively limits the range of interest still further to that of Z, the commutative ring of integers [4].

1.2

The DFT and Its Efficient Computation

Turning firstly to its definition, the DFT is a unitary transform [19], which, for the 1-D case of N input/output samples, may be expressed in normalized form via the equation: N 1 1 X X ðF Þ ½k ¼ pffiffiffiffi x½n:W nk N, N n¼0

k ¼ 0, 1, . . . , N  1

ð1:1Þ

where the input/output data vectors belong to CN, the linear (or vector) space of complex-valued N-tuples [4]. This equation may be rewritten in matrix-vector form as ðF Þ

X N ¼ F NN :xN

ð1:2Þ

where FNN, the N  N complex-valued Fourier matrix, is the matrix representation of the transform kernel whose elements derive from the term: W N ¼ exp ði2π=N Þ,



pffiffiffiffiffiffiffiffiffi 1,

ð1:3Þ

the primitive N’th complex root of unity [4, 41, 43]. The Fourier matrix is clearly conjugate-symmetric about its leading diagonal and therefore equal to its own conjugate-transpose. The unitary nature of the DFT means, in addition, that the inverse of the Fourier matrix is equal to its conjugate-transpose, whilst its columns form an orthogonal basis [6, 7, 19] – similarly, a transform is said to be orthogonal when the inverse of the transform matrix is equal simply to its transpose, as is the case with any real-valued kernel. Note that the multiplication of any power of the term WN by any number belonging to C, the field of complex numbers [4], results in a simple phase shift of that complex number – the amplitude or magnitude remains unchanged. This suggests that the dominant operation of any fast solution to the DFT will be that of phase rotation which, with the right choice of arithmetic unit, could be exploited to some advantage. The direct computation of the N-point DFT, however, as defined above in Eq. 1.1, involves O(N2) arithmetic operations, so that many of the early scientific problems involving the DFT could not be seriously attacked without access

1.2 The DFT and Its Efficient Computation

5

to fast algorithms for its efficient solution, where the key to the design of such algorithms is the identification and exploitation of the property of symmetry [54, 58], whether it exists in just the transform kernel or, in certain circumstances, in the input data and/or output data as well. One early area of activity with such transforms involved astronomical calculations, and in the early part of the nineteenth century the great Carl Friedrich Gauss (1777–1855) used the DFT for the interpolation of asteroidal orbits from a finite set of equally spaced observations [26]. He developed a fast two-factor algorithm for its computation that was identical to that described in 1965 by James Cooley and John Tukey [14] – as with many of Gauss’s greatest ideas, however, the algorithm was never published outside of his collected works and only then in an obscure Latin form. This algorithm, which for a transform length of N ¼ N1  N2 involves just O((N1 + N2)  N ) arithmetic operations, was probably the first member of the class of fast recursive algorithms now commonly referred to as the fast Fourier transform (FFT) [5, 6, 10, 14, 19, 44], which is unquestionably the most ubiquitous algorithm in use today for the analysis or manipulation of digital data. In fact, Gauss is known to have first used the above-mentioned two-factor FFT algorithm for the solution of the DFT as far back as 1805, the same year that Admiral Nelson routed the French fleet at the Battle of Trafalgar – interestingly, Fourier served in Napoleon Bonaparte’s army from 1798 to 1801, during its invasion of Egypt, acting as scientific advisor. Although the DFT, as defined above, allows for both the input and output data sets to be complex-valued (i.e. possessing both amplitude and phase), many realworld spectrum analysis problems, including those addressed by Gauss, involve only real-valued (i.e. possessing amplitude only) input data, so that there is a genuine need for the identification of a subset of the class of FFT algorithms that are able to exploit this fact – bearing in mind that the use of real-valued data leads to a Hermitian (or conjugate) -symmetric frequency spectrum: complexdata FFT ) exploitation of kernel symmetry, whilst realdata FFT ) exploitation of kernel symmetry þ spectral symmetries, with the exploitation of symmetry in the transform kernel being typically achieved by invoking the property of periodicity and of the shift theorem, as will be discussed later in Chap. 3 of the monograph. There is a requirement, in particular, for the development of real-data FFT algorithms which retain the regularity – as relates to the algorithm structure – of their complex-data counterparts as regular algorithms lend themselves more naturally to an efficient implementation. Regularity, which equates to the amount of repetition and symmetry present in the design, is most straightforwardly achieved through the adoption of fixed-radix formulations, such as with the familiar radix-2 and radix-4 algorithms [10, 13], as this essentially reduces the FFT design to that of a

6

1 Background to Research

single fixed-radix butterfly, which is the computational engine used for carrying out the repetitive arithmetic operations required by the fixed-radix algorithm. Note that with such a formulation, the radix actually corresponds to the size of the resulting butterfly (in terms of the number of inputs/outputs), although in Chap. 8 it is seen how a DFT, whose length is a power of two (a radix-2 integer) but not a power of four (a radix-4 integer), may be solved by means of a highly optimized radix-4 butterfly. An additional attraction of fixed-radix FFT formulations, which for an arbitrary radix R decomposes an N-point DFT into logRN temporal stages each comprising N/R radix-R butterflies, is that they lend themselves naturally to a parallel solution. Such decompositions may be defined over either (1) the spatial domain, facilitating its mapping onto a single-instruction multiple-data (SIMD) computing architecture [1], whereby the same set of operations (e.g. those corresponding to the radix-R butterfly) may be carried out simultaneously on multiple sets of data stored within the same memory, or (2) the temporal domain, facilitating its mapping, via the technique of pipelining [1], onto a systolic computing architecture [1, 35], whereby all stages of the systolic array (referred to hereafter simply as a ‘pipeline’ or ‘computational pipeline’) operate simultaneously on different temporal stages of the computation and each of the pipeline’s stages communicates with only its nearest neighbours. A parallel solution may also be defined over both the spatial and the temporal domains which would involve a computational pipeline where each of its stages involves the simultaneous execution of multiple arithmetic operations via SIMD processing – such an architecture being often referred to in the computing literature as ‘parallel-pipelined’. Parallel decompositions such as these suggest that the fixed-radix FFT would lend itself naturally to an efficient implementation with one of the increasingly more accessible/affordable parallel computing technologies.

1.3

Twentieth-Century Developments of the FFT

As far as modern-day developments in FFT design are concerned, it is the names of Cooley and Tukey that are invariably mentioned first in any historical account, but this does not really do justice to the many contributors from the first half of the twentieth century whose work was simply not picked up on, or appreciated, at the time of their development or publication. The prime reason for such a situation was the lack of a suitable technology for their efficient implementation, this remaining the case until the advent of the semiconductor technology of the 1960s. Early pioneering work was carried out by the German mathematician Carl Runge [49], who in 1903 recognized that the periodicity of the DFT kernel could be exploited to enable the computation of a 2N-point DFT to be expressed in terms of the computation of two N-point DFTs, this factorization technique being subsequently referred to as the doubling algorithm. The Cooley-Tukey algorithm, which does not rely on any specific factorization of the transform length, may thus be viewed as a simple generalization of this algorithm, as the successive application of

1.3 Twentieth-Century Developments of the FFT

7

the doubling algorithm leads straightforwardly to the radix-2 version of the CooleyTukey algorithm. Runge’s influential work was subsequently picked up and popularized in publications by Karl Stumpff [55] in 1939 and Gordon Danielson and Cornelius Lanczos [15] in 1942, each in turn making contributions of their own to the subject. Danielson and Lanczos, for example, produced reduced-complexity solutions to the DFT through the exploitation of symmetries in the transform kernel, whilst Stumpff discussed versions of both the doubling algorithm and the analogous tripling algorithm, whereby a 3N-point DFT is expressed in terms of the computation of three N-point DFTs. All of the techniques developed, including those of more recent origin such as the nesting algorithm of Shmuel Winograd [60] and the split-radix algorithm of Pierre Duhamel [17], rely upon the ‘divide-and-conquer’ [34] principle, whereby the computation of a composite-length DFT is broken down into that of a number of smaller DFTs where the small-DFT lengths correspond to the multiplicative factors of the original transform length. Depending upon the particular factorization of the transform length, this process may be repeated in a recursive fashion on the increasingly smaller DFTs. When the lengths of the small DFTs have common factors, as encountered with the familiar fixed-radix formulations, there will be a need for the intermediate results occurring between the successive stages of small DFTs to be modified by elements of the Fourier matrix, these terms being commonly referred to in the FFT literature as ‘twiddle factors’. When the algorithm in question is a fixed-radix algorithm of the decimation-in-time (DIT) type [10], whereby the set of data-space samples is decomposed into successively smaller subsequences, the twiddle factors are applied to the inputs to the butterflies, whereas when the fixed-radix algorithm is of the decimation-in-frequency (DIF) type [10], whereby the set of transform-space samples is decomposed into successively smaller subsequences, the twiddle factors are applied to the outputs to the butterflies. Note, however, that when the lengths of the small DFTs have no common factors at all – that is, when they are relatively prime [4, 41] – the need for the application of the twiddle factors disappears as each twiddle factor becomes equal to one. This particular result was made possible through the development of a new numbertheoretic data reordering scheme in 1958 by the statistician Jack Good [22], the scheme being based upon the ubiquitous Chinese remainder theorem (CRT) [41, 43, 44] – which for the interest of those readers of a more mathematical disposition provides a means of obtaining a unique solution to a set of simultaneous linear congruences – whose origins supposedly date back to the first century A.D. [16]. Also, it should be noted that in the FFT literature, the class of fast algorithms based upon the decomposition of a composite-length DFT into smaller DFTs whose lengths have common factors – such as the Cooley-Tukey algorithm – is often referred to as the common factor algorithm [41, 43, 44], whereas the class of fast algorithms based upon the decomposition of a composite-length DFT into smaller DFTs whose lengths are relatively prime is often referred to as the prime factor algorithm [41, 43, 44].

8

1 Background to Research

Before moving on from this brief historical discussion, it is worth returning to the last name mentioned, namely, that of Jack Good, as his background is a particularly interesting one for anyone with an interest in the history of computing. During World War II, Good served at Bletchley Park in Buckinghamshire, England, working alongside Alan Turing [27] on, amongst other things, the decryption of messages produced by the Enigma machine [21] – as used by the German armed forces. At the same time, and on the same site, a team of engineers under the leadership of Tom Flowers [21] – all seconded from the Post Office Research Establishment at Dollis Hill in North London – were, unbeknown to the outside world (and remaining so for several decades), developing the world’s first electronic computer, the Colossus [21], under the supervision of Turing and Cambridge mathematician Max Newman (Turing’s former supervisor from his student days at Cambridge University and future colleague at Manchester University [38] where work was to be undertaken on the development of both hardware and software for the Mark I computer [37]). The Colossus was built primarily to automate various essential code-breaking tasks such as the cracking of the Lorenz code used by Adolf Hitler to communicate with his generals and was the first serious device – albeit a very large and a very specialized one – on the path towards our current state of technology whereby entire signal processing systems may be mapped onto a single silicon chip.

1.4

The DHT and Its Relation to the DFT

A close relative of the Fourier transform is that of the Hartley transform, as introduced by Ralph Hartley (1890–1970) in 1942 for the analysis of transient and steady-state transmission problems [25]. The discrete-time version of this bilateral and orthogonal transform is referred to as the DHT [8], which, for the 1-D case of N input/output samples, may be expressed in normalized form via the equation: N 1 1 X X ðH Þ ½k ¼ pffiffiffiffi x½n:casð2πnk=N Þ N n¼0

k ¼ 0, 1, . . . , N  1

ð1:4Þ

where the input/output data vectors belong to RN, the linear space of real-valued Ntuples, and the transform kernel – which may be represented when using matrixvector terminology by means of the symmetric Hartley matrix – is expressed in terms of the ‘cas’ function: casð2πnk=N Þ  cos ð2πnk=N Þ þ sin ð2πnk=N Þ:

ð1:5Þ

1.4 The DHT and Its Relation to the DFT

9

Note that as the elements of the Hartley matrix are all real-valued, the DHT is orthogonal (although unitary as well, given that R ⊂ C), with the columns of the matrix forming an orthogonal basis. Unlike the DFT, the DHT has no natural interpretation as a frequency spectrum, although the discrete version of the power spectral density (PSD) may be determined directly from the DHT coefficients. Its most natural use is as: 1. A bilateral transform that satisfies the circular convolution theorem (CCT), so that it may be used for both the forward and the inverse transformations of a transform-based solution for the filtering of real-valued data (in both one and multiple dimensions) 2. A means for computing the DFT so that fast solutions to the DHT – which are referred to generically as the fast Hartley transform (FHT) [7–9, 52] – have become increasingly popular as an alternative to the FFT for the efficient computation of the DFT The FHT is particularly attractive for when the input data to the DFT is realvalued, its applicability being made possible by the fact that all of the familiar properties associated with the DFT, such as the CCT and the shift theorem, are also applicable to the DHT (as will be discussed in Chap. 3), and that the complexvalued DFT output set and real-valued DHT output set may each be simply obtained, one from the other. To see the truth of this, note that the equality    nk  casð2πnk=N Þ ¼ Re W nk N  Im W N

ð1:6Þ

(where ‘Re’ stands for the real component and ‘Im’ for the imaginary component) relates the kernels of the two transformations, both of which are periodic with a period of 2π radians. As a result     X ðH Þ ½k  ¼ Re X ðF Þ ½k  Im X ðFÞ ½k  ,

ð1:7Þ

which expresses the DHT output in terms of the DFT output, whilst     Re X ðFÞ ½k ¼ 1=2 X ðH Þ ½N  k þ X ðH Þ ½k

ð1:8Þ

    Im X ðFÞ ½k  ¼ 1=2 X ðH Þ ½N  k  X ðH Þ ½k  ,

ð1:9Þ

and

which express the real and imaginary components of the DFT output, respectively, in terms of the DHT output.

10

1.5

1 Background to Research

Attractions of Computing the Real-Data DFT via the FHT

Although applicable to the computation of the DFT for both real-valued and complex-valued data, the major computational advantage of the FHT over the FFT, as implied above, lies in the processing of real-valued data. As most realworld spectrum analysis problems involve only real-valued data, significant performance gains may be obtained by using the FHT without any great loss of generality. This is evidenced by the fact that if one computes the complex-data FFT of an Npoint real-valued data set, the result will be 2N real-valued (or, equivalently, N complex-valued or N pairs of real-valued) samples, one half of which are redundant. The FHT, on the other hand, will produce just N real-valued outputs, from which the required N/2 complex-valued DFT outputs may be straightforwardly obtained, thereby requiring only one half as many arithmetic operations and one half the memory requirement for storage of the input/output data. The reduced memory requirement is particularly relevant when the transform is large and the available resources are limited, as might be encountered in applications typified by that of mobile communications. The traditional approach to computing the DFT has been to use a complex-data solution, regardless of the nature of the data, this often entailing the initial conversion of real-valued data to complex-valued data via a wideband digital downconversion (DDC) process or through the adoption of a ‘real-from-complex’ strategy whereby two real-data DFTs are computed simultaneously via one full-length complex-data FFT [51] or where one real-data DFT is computed via one halflength complex-data FFT [13, 51]. Each of the real-from-complex solutions, however, involves a computational overhead when compared to the more direct approach of a real-data FFT in terms of increased memory, increased processing delay to allow for the possible acquisition/processing of pairs of data sets and additional packing/ unpacking complexity. With the DDC approach, the integrity of the information content of short-duration signals may also be compromised through the introduction of the filtering operation. The reason for such a situation is due in part to the fact that manufacturers of computing equipment have invested so heavily in producing DSP devices built around the fast complex-data multiplier-and-accumulator (MAC). This is an arithmetic unit ideally suited to the implementation of the radix-2 butterfly, which is the computational engine used for carrying out the repetitive arithmetic operations required by the complex-data version of the radix-2 FFT. The net result is that the problem of computing the real-data DFT is effectively modified so as to match an existing complex-data solution, rather than a solution being sought that matches the actual problem needing to be solved – which is the approach to be adopted here.

1.6 Modern Hardware-Based Parallel Computing Technologies

11

It should be noted that specialized FFT algorithms [2, 11, 17, 18, 20, 36, 42, 53, 56] do however exist for dealing with the case of real-valued data. Such algorithms compare favourably, in terms of their arithmetic and memory requirements, with those of the FHT, but suffer in terms of a loss of regularity and reduced flexibility in that different algorithms are often required for the computation of the forward and the inverse DFT algorithms. Clearly, in applications requiring transform-space processing followed by a return to data-space, this could prove something of a disadvantage, particularly when compared to the adoption of a bilateral transform, such as the DHT, which may be straightforwardly applied to the transformation from Hartley-space to data-space as well as from data-space to Hartley-space, making it thus equally applicable to the computation of both the forward and the inverse DFT algorithms – the bilateral property of the DHT means that its definitions for the two directions, up to a possible scaling factor, are identical. A drawback of conventional FHT algorithms [7–9, 52], however, lies in the need for two sizes of butterfly – and thus for two separate butterfly designs – for fixed-radix formulations where, for a radix ‘R’ algorithm, a single-sized butterfly produces R outputs from R inputs and a double-sized butterfly produces 2R outputs from 2R inputs. A generic version of the double-sized butterfly, referred to hereafter as the generic double butterfly [28], is therefore developed in this monograph for the radix-4 factorization of the FHT which overcomes the problem in an elegant fashion. The resulting radix-4 FHT, referred to hereafter as the regularized FHT [28], will be shown to lend itself naturally to an efficient implementation with parallel computing technology.

1.6

Modern Hardware-Based Parallel Computing Technologies

The type of high-performance parallel computing equipment referred to above is typified by the increasingly accessible FPGA and ASIC technologies which now give design engineers far greater flexibility and control over the type of algorithm that may be used in the building of high-performance DSP systems, so that more appropriate hardware solutions to the real-data FFT may be actively sought and exploited to some advantage with these silicon-based technologies. With such technologies, however, it is no longer adequate to view the complexity of the FFT purely in terms of arithmetic operation counts, as has conventionally been done, as there is now the facility to use both multiple arithmetic units – such as those based upon the fast multiplier – and multiple banks of fast memory in order to enhance the FFT performance via its parallel computation. As a result, a whole new set of constraints has arisen relating to the design of ‘efficient’ FFT algorithms.

12

1 Background to Research

With the recent and explosive growth of wireless technology, and in particular that of mobile communications, algorithms are now being designed subject to new and often conflicting performance criteria whereby the ideal is to simultaneously: 1. Maximize the throughput 2. Satisfy a timing constraint, relating either to the ‘latency’, which is defined as the elapsed time involved in the production of an output data set from its input data set, or the ‘update time’, which is defined as the elapsed time between the arrival of the latest input data set and the subsequent production of the latest output data set (noting that for a 1-D block-based solution these two timing parameters are identical) 3. Minimize the required silicon resources (and thus the cost of implementation) 4. Keep the power consumption to within the available power budget The task, therefore, is to find those solutions that are best able to deal with the trade-offs that inevitably need to be made if the above objectives are to be adequately addressed. Note that the difference between the latency and the update time is most evident and easy to visualize when the algorithm is implemented via a computational pipeline, as the latency then corresponds to the elapsed time across the entire length of the pipeline whereas the update time corresponds to the elapsed time across a single stage of the pipeline. Such trade-offs are considered in some considerable detail for the silicon-based implementations of the regularized FHT discussed in this monograph, bearing in mind the aim of achieving resource-efficient and power-efficient solutions for the parallel computation [1, 3, 23] of the DHT and equivalently, via the relationship of their kernels, of the real-data DFT, for the processing of both 1-D data, to be discussed in Chaps. 4, 5, 6 and 7, and m-D data, to be discussed in Chaps. 10 and 11. As a final observation, the adoption of the FHT for wireless communications technology would seem to be particularly apt, given the contribution made by the originator of the Hartley transform (albeit the continuous-time rather than the discrete-time version) to the foundation of information theory, where the ShannonHartley theorem [46] helped to establish Shannon’s idea of channel capacity [46, 50]. The theorem simply states that if the amount of digital data or information transmitted over a given communication channel is less than the channel capacity, then error-free communication may be achieved, whereas if it exceeds that capacity, then errors in transmission will always occur no matter how well the communication equipment is designed.

1.7

Hardware-Based Arithmetic Units

When producing electronic equipment, whether for commercial or military use, great emphasis is inevitably placed upon minimizing the unit cost so that one is seldom blessed with the option of using the latest state-of-the-art device technology. The

1.8 Performance Metrics and Constraints

13

most common situation encountered is one where the expectation is to use the smallest (and thus least expensive) device that’s capable of yielding solutions able to meet the desired performance objectives, which means using devices that are often one or more generations behind the latest specification. As a result, there are situations where there would be great merit in having designs that are not totally reliant on the increasing availability of large quantities of expensive embedded resources, such as the fast multipliers and fast memory provided by the manufacturers of the latest silicon-based devices, but are sufficiently flexible as to yield efficient implementations in silicon even when such resources are scarce. One way of achieving such flexibility with the regularized FHT would be through the design of a processing element (PE) for the computation of the generic double butterfly that minimizes or perhaps even avoids the need for fast multipliers, or fast memory, or both, according to the availability of the resources on the target computing device. Despite the increased use of the hardware-based computing technologies, however, there is still a strong reliance upon the use of softwarebased techniques for the design of the arithmetic unit. These techniques, as typified by the familiar fast multiplier, are relatively inflexible in terms of the precision they offer and, although increasingly more power-efficient, tend to be expensive in terms of silicon resources. There are a number of hardware-based arithmetic techniques available, however, such as the shift-and-add techniques, as typified by the Co-Ordinate Rotation DIgital Computer (CORDIC) arithmetic [57] unit and the look-up table (LUT) techniques, as typified by the distributed arithmetic (DA) [59] unit, which date back to the DSP revolution of the mid-twentieth century but nevertheless still offer great attractions for use with the new hardware-based technologies. The CORDIC arithmetic unit, for example, which may be used to carry out in an optimal fashion the operation of phase rotation – the key operation involved in the computation of the DFT – may be implemented by means of a computational structure whose form may range from fully sequential to fully parallel, with the update time of the CORDIC operation decreasing linearly with increasing parallelism. The application of the CORDIC technique to the computation of the regularized FHT is considered in this monograph for its ability both to minimize the memory requirement and to yield a flexibleprecision solution to the problem of computing the real-data DFT.

1.8

Performance Metrics and Constraints

Having introduced and defined the algorithms of interest in this introductory chapter, namely, the DFT and its close relation the DHT, as well as discussing very briefly the various types of computing architecture and technology available for the implementation of their fast solutions, via the FFT and the FHT, respectively, it is now worth devoting a little time to considering the type of performance metrics and constraints

14

1 Background to Research

most appropriate to each. For the mapping of such algorithms onto a singleprocessor (or Von Neumann) computing device, for example, the performance might typically be assessed according to the following: Performance Metric for Single-Processor Sequential Computing Device: A solution for the computation of a discrete unitary or orthogonal transform, when executed on a single-processor sequential computing device, may be assessed according to its arithmetic-complexity such that one solution may be said to be more operation-efficient than another if it requires less arithmetic operations to carry out the task.

With this definition, the arithmetic complexity of a discrete unitary or orthogonal transform is minimized by identifying and exploiting the property of symmetry, whether it exists in just the transform kernel or, in certain circumstances, in the input data and/or output data as well. For the mapping of such algorithms onto a multiprocessor computing device, on the other hand, the performance might typically be assessed according to the following: Performance Metric for Multi-Processor Parallel Computing Device: A solution for the computation of a discrete unitary or orthogonal transform, when executed on a multi-processor parallel computing device, may be assessed according to its time-complexity such that one solution may be said to be more time-efficient than another if it requires less time to carry out the task.

With this definition, the time-complexity of a discrete unitary or orthogonal transform is minimized by exploiting the parallelism on offer, so that many operations may be carried out simultaneously, or in parallel. The idealized objective of a parallel solution is to obtain a linear speed up in performance which is directly proportional to the number of processors used, although in reality, with most multiprocessor applications, being able to obtain such a speed up is rarely achievable. The main problem relates to the communication complexity arising from the need to move potentially large quantities of data between the various processors. Finally, for the mapping of such algorithms onto a silicon-based computing device, as typified by the FPGA and the ASIC to be discussed in Chap. 5, which takes into account the constraints and trade-offs of the various parameters of interest, the performance might typically be assessed according to the following: Performance Metric for Silicon-Based Parallel Computing Device: A solution for the computation of a discrete unitary or orthogonal transform, when executed on a silicon-based parallel computing device, may be assessed according to its throughput per unit area of silicon such that one solution may be said to be more resourceefficient than another if it requires less resources (and thus a smaller silicon area) to carry out the task within a given time period.

Although other metrics could be used in assessing performance, this particular metric relating to throughput per unit area of silicon – referred to hereafter as the computational density – is targeted specifically at the type of power-constrained environment that one would expect to encounter with applications typified by that of mobile communications, as it’s assumed that a solution that yields a high computational density will be attractive in terms of both power efficiency and resource efficiency, given the known influence of silicon area upon power consumption, to

1.9 Key Parameters, Definitions and Notation

15

be discussed in Chap. 5. Note, however, that two solutions may achieve the same computational density on the same device but require different quantities of silicon resources, so that one solution will be more resource-efficient than the other. This may occur with the allocation of different time periods (as given by the data set refresh rate, to be defined in the next section) for the completion of the task, as the availability of a shorter time period will need to be overcome through the exploitation of additional parallelism and additional silicon resources.

1.9

Key Parameters, Definitions and Notation

Two timing parameters, the latency and the update time, have already been introduced in Sect. 1.6 of this chapter as a means of assessing the realizability of a given solution. Another timing parameter that will be referred to at various points throughout the monograph is that of the ‘update period’, which is defined as the elapsed time between the production of consecutive input data sets (and as such must be greater than or equal to the update time for a realizable solution). The value of each of the timing parameters is determined by the speed of the input/output (I/O) system, which is the rate at which data samples are generated by the external input data source, such as an analog-to-digital conversion (ADC) unit, being typically equal to, or some integer multiple of, the clock frequency of the target computing device. This also dictates the value of the ‘data set refresh rate’, which is defined as the rate at which each new input data set is transferred from the external input data source to suitably defined data-space memory (DSM). The I/O rate is assumed in this monograph, for ease of illustration, to be equal to one sample per clock cycle, leading to a data set refresh rate of: 1. N samples every update period of N clock cycles for the case of a 1-D transform, where the transform length N is taken to be a radix-4 integer as required by the regularized FHT 2. Nm samples every update period of Nm clock cycles for the case of an m-D transform, for m  2, where the common length N of each dimension of the transform is also taken to be a radix-4 integer for compatibility with the adoption of the N-point regularized FHT For the m-D case, the elapsed time between the acquisitions of consecutive Nsample subsets of the input data set is referred to hereafter as the ‘slicing period’, where ‘slicing’ refers to the extraction of N samples in a given dimension of an m-D data set by fixing the indices of the remaining m-1 dimensions. As already stated, for a realizable solution whereby the transform is able to operate in a realtime fashion, it is necessary that the update time be kept shorter than the update period, as dictated by the data set refresh rate, as the transform cannot process the data faster than it’s generated.

16

1 Background to Research

When assessing the space-complexity – which comprises both arithmetic and memory components – of the various solutions developed in the monograph, it may be necessary to deal with very large numbers for representing the memory component or the device capacity when measured in binary words. To address this problem, the memory component will typically be measured in units of KWords, for multiples of 103 words, or MWords, for multiples of 106 words, or GWords, for multiples of 109 words, where the word length (in terms of the number of bits) will be as specified. To clarify the use of a few basic terms, note that the input data to unitary and orthogonal transforms, such as the DFT and the DHT, may be said to belong to ‘dataspace’ where each individual input data vector, as already stated, will belong to CN for the case of the complex-data DFT or RN for the case of the DHT and the real-data DFT. Analogously, the output data from such transforms may be said to belong to ‘transform-space’ – which for the case of the DFT is referred to as ‘Fourier-space’ and for the case of the DHT is referred to as ‘Hartley-space’ – where each individual output data vector will belong to CN for the case of the DFT or RN for the case of the DHT. As already implied, all data vectors with an attached superscript of ‘(F)’ will be assumed to reside within Fourier-space, whilst all those with an attached superscript of ‘(H)’ will be assumed to reside within Hartley-space. These definitions will be used throughout the monograph, where appropriate, in order to simplify or clarify the exposition. Note also that curly brackets ‘{.}’ will be used throughout the monograph to denote a finite set or sequence of digital samples, as required, for example, for expressing the input/output relationship for both the DFT and the DHT, whether for 1-D or m-D data. The indexing convention generally adopted when using such sequences is that the elements of a finite sequence in data-space, as typically denoted with a lower-case character such as ‘x’, are typically indexed by means of the letter ‘n’, whereas the elements of a finite sequence in transform-space, as typically denoted with an upper-case character such as ‘X’, are typically indexed by means of the letter ‘k’. Various algorithms are assessed throughout the monograph in terms of their arithmetic and time complexities, where the arithmetic-complexity is typically expressed in terms of the numbers of multiplications and additions required to carry out a task as a function of its input length whilst the time-complexity is expressed in terms of the associated number of clock cycles. Rather than trying to evaluate the exact numbers of operations or clock cycles, which may not be possible without delving into the inaccessible low-level details of each operation, the ‘BigOh’ notation is often used for defining the ‘order of complexity’ [1] whereby such details may be ignored. Thus: The function f(n) is said to be of order at most g(n), denoted O(g(n)), if there exists positive constants ‘c’ and ‘m’ such that f(n)  c.g(n) for all values of ‘n’ such that n  m.

Finally, it has already been stated that for the case of a fixed-radix FFT, the trigonometric elements of the transform kernel, as represented by the Fourier matrix

1.10

Organization of Monograph

17

and applied to the appropriate butterfly inputs/outputs, are generally referred to as twiddle factors. However, for consistency and generality, the elements of the transform kernels as represented by both the Fourier matrix and the Hartley matrix, as required for use by the butterflies of their respective decompositions, will instead be referred to hereafter simply as the trigonometric coefficients. Also, for the derivation of fast solutions to both transform types, the elements of their respective kernels are generally decomposed into pairs of real numbers in order to facilitate their efficient application by their respective butterflies, given that the kernel of each transform type involves the use of both ‘sinusoidal’ and ‘cosinusoidal’ terms.

1.10

Organization of Monograph

Part I of the monograph – entitled “The Discrete Fourier and Hartley Transforms” – provides the background information necessary for a better understanding of the problem being addressed, namely, that of deriving resource-efficient, scalable (which refers to the ease with which the solution may be modified in order to accommodate increasing or decreasing transform sizes) and device-independent (so that it is not dependent upon the specific characteristics of any particular device, being able to exploit whatever resources happen to be available on the target device) solutions for the parallel computation of the DHT and equivalently, via the relationship of their kernels, of the real-data DFT, initially for just the 1-D case but later to be extended to the m-D case. This involves, as discussed already in this chapter, an outline of the problem set in a historical context, followed in Chap. 2 by an account of the real-data DFT and of the fast algorithms and techniques conventionally used for its solution. Chapter 3 next provides a detailed account of the DHT and the class of recursive FHT algorithms used for its fast solution, and of those properties of the DHT that make the FHT of particular interest with regard to the fast solution of the real-data DFT and to its application to a key problem concerned with the filtering of real-valued data. Part II of the monograph – entitled “The Regularized Fast Hartley Transform” – deals with the novel solution proposed for dealing with the problem of computing the DHT and hence the DFT for the case of 1-D real-valued data, where the transform length N is taken to be a radix-4 integer. This involves, in Chap. 4, a detailed account of the design of an efficient solution for the computation of the DHT based upon the use of the generic double butterfly – namely, the regularized FHT – which lends itself naturally to an efficient implementation with parallel computing technology. Design constraints and trade-offs for the silicon-based technologies of the FPGA and the ASIC are then discussed in Chap. 5 prior to the consideration, in Chap. 6, of different possible architectures for the efficient mapping of the regularized FHT onto such hardware. A single-PE recursive architecture exploiting finegrained pipelining of the PE is identified for the parallel computation of the generic

18

1 Background to Research

double butterfly and of the resulting regularized FHT [29, 30], whereby SIMD processing is applied within each stage of the computational pipeline and both the data and the trigonometric coefficients are partitioned or distributed across multiple banks of fast memory, referred to hereafter as the PE’s data memory (PDM) and trigonometric coefficient memory (PCM), respectively. The resulting parallel solution is resource-efficient, scalable (which, for the 1-D case, is understood to be in relation to the length of the transform) and device-independent, being able to maximize the computational density as time-complexity is traded off against space-complexity. It is next seen, in Chap. 7, how the fast multipliers used by the generic double butterfly might in certain circumstances be beneficially replaced by a hardware-based parallel arithmetic unit, such as that based upon the use of CORDIC arithmetic, which is able to yield a flexible-precision solution, without the need of the PCM, when implemented with one of the proposed silicon-based technologies. Part III of the monograph – entitled “Applications of Regularized Fast Hartley Transform” – deals with applications of the regularized FHT to the low-complexity parallel computation of a number of common DSP-based functions. This involves, in Chap. 8, the derivation of two new radix-2 real-data FFT algorithms, both exploiting the regularized FHT, where the transform length is now a power of two (a radix-2 integer), but not a power of four (a radix-4 integer). This shows how the regularized FHT may be applied, potentially, to a great many more problems than originally envisioned. This is followed by its application, in Chap. 9, to the computation of some of the more familiar and computationally intensive DSP-based functions, such as those of correlation – both auto-correlation and cross-correlation – and of the wideband channelization of real-valued radiofrequency (RF) data via the polyphase DFT filter bank [24, 32]. With each such function – which might typically be encountered in that increasingly important area of wireless communications concerned with the geolocation [47] of signal emitters – the adoption of the regularized FHT may result in both conceptually and computationally simplified solutions. A more recent application involving a novel transform-space scheme for enhancing the performance of multi-carrier communications in the presence of inter-modulation distortion (IMD) is also briefly discussed [31, 33]. Part IV of the monograph – entitled “The Multidimensional Discrete Hartley Transform” – deals with the design of architectures for resource-efficient and scalable solutions for the parallel computation of the m-D DHT, for m  2, where the common length N of each dimension of the transform is taken to be a radix-4 integer for compatibility with the adoption of the N-point regularized FHT. A new parallel data reordering scheme is first discussed, in Chap. 10, for the parallel reordering and transfer of the regularized FHT’s naturally ordered input data from the DSM to the PDM (which is applicable to both the 1-D and m-D cases) and, for the m-D case, for the parallel reordering and transfer of intermediate DHT output data, as stored in naturally ordered form within suitably defined Hartley-space memory (HSM), to the PDM of the target PE, making full use of the potential parallelism made available by the partitioned nature of the memory. The reordering of the data – which is able to be carried out simultaneously with its transfer between

1.10

Organization of Monograph

19

partitioned memories – is obtained from the application of the familiar digit-reversal mapping [12] geared specifically to the DIT formulation of the radix-4 FHT algorithm being proposed. With Chap. 11 it is seen how the regularized FHT may be exploited as a building block in producing attractive solutions for the parallel computation of the m-D DHT and equivalently, via the relationship of their kernels, of the m-D real-data DFT. This is achieved through the adoption of: 1. A separable formulation of the m-D DHT (referred to as the SDHT) [8] so that the familiar row-column method (RCM) [34] may be applied 2. Memory partitioning, double-buffering (whereby functions performed on two equally sized regions of memory alternate with successive input data sets) and parallel addressing schemes – namely, the data reordering scheme discussed in Chap. 10 – that are consistent with those used by the regularized FHT Combining these features, the regularized FHT may then be used as a building block for the processing of each stage of the m-D formulation of the SDHT with the resulting parallel solutions, like those for the 1-D case, being resource-efficient, scalable (which, for the m-D case where ‘m’ is fixed, is understood to be in relation to the common length of each dimension of the transform) and device-independent, being thus able to maximize the computational density as time-complexity is traded off against space-complexity. Part V of the monograph – entitled “Results of Research” – which consists of just Chap. 12, outlines the background to the research problems addressed by the monograph, before summarizing the results obtained and drawing conclusions from those results. Finally, three appendices are provided relating to software. The first two appendices, Appendix A and Appendix B, provide both a detailed description and a listing of computer source code, written in the ‘C’ programming language, for all those functions required by the proposed single-PE solution to the regularized FHT, this code being used for proving the mathematical/logical correctness of its operation. The computer programme provides the user with various choices of PE design and of storage/retrieval scheme for the trigonometric coefficients, helping the user to identify how the algorithm might be efficiently mapped onto suitable parallel computing equipment following translation of the sequential ‘C’ code to parallel code as produced by a suitably chosen hardware description language (HDL). The third and final appendix, Appendix C, provides a detailed description and a listing of computer source code, written this time using the MATLAB [39] computing environment, for determining how naturally ordered samples are distributed across partitioned memory after being reordered according to the familiar ‘digit-reversal’ mapping. The code is thus used to prove the mathematical/logical correctness of operation of the parallel reordering of the data and of its simultaneous transfer from one partitioned memory to another.

20

1 Background to Research

References 1. S.G. Akl, The Design and Analysis of Parallel Algorithms (Prentice-Hall, 1989) 2. G.D. Bergland, A Fast Fourier transform algorithm for real-valued series. Comm. ACM 11 (10) (1968) 3. A.W. Biermann, Great Ideas in Computer Science (MIT Press, 1995) 4. G. Birkhoff, S. MacLane, A Survey of Modern Algebra (Macmillan, 1977) 5. R. Blahut, Fast Algorithms for Digital Signal Processing (Addison-Wesley, 1985) 6. R.N. Bracewell, The Fourier Transform and its Applications (McGraw-Hill, 1978) 7. R.N. Bracewell, The fast Hartley transform. Proc. IEEE 72(8) (August 1984) 8. R.N. Bracewell, The Hartley Transform (Oxford University Press, 1986) 9. R.N. Bracewell, Computing with the Hartley transform. Comput. Phys. 9(4) (July/August 1995) 10. E.O. Brigham, The Fast Fourier Transform and Its Applications (Prentice-Hall, Englewood Cliffs, 1988) 11. G. Bruun, Z-transform DFT filters and FFTs. IEEE Trans. ASSP 26(1) (January 1978) 12. E. Chu, A. George, Inside the FFT Black Box (CRC Press, 2000) 13. J.W. Cooley, P.A.W. Lewis, P.D. Welch, “The Fast Fourier Transform Algorithm and Its Applications”, Technical Report RC-1743 (IBM, February 1967) 14. J.W. Cooley, J.W. Tukey, An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput. 19(4), 297–301 (1965) 15. G.C. Danielson, C. Lanczos, Some improvements in practical Fourier series and their application to X-ray scattering from liquids. J. Franklin Inst. 233, 365–380 and 435–452 (April 1942) 16. C. Ding, D. Pei, A. Salomaa, Chinese Remainder Theorem: Applications in Computing, Coding, Cryptography (World Scientific, 1996) 17. P. Duhamel, Implementations of split-radix FFT algorithms for complex, real and realsymmetric data. IEEE Trans. ASSP 34(2), 285–295 (April 1986) 18. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to cyclic convolution of real data. IEEE Trans. ASSP 35(6), 818–824 (June 1987) 19. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications (Academic, 1982) 20. O. Ersoy, Real discrete Fourier transform. IEEE Trans. ASSP 33(4) (April 1985) 21. P. Gannon, Colossus: Bletchley Park’s Greatest Secret (Atlantic Books, London, 2006) 22. I.J. Good, The interaction algorithm and practical Fourier series. J. R. Stat. Soc. Ser. B 20, 361–372 (1958) 23. D. Harel, Algorithmics: The Spirit of Computing (Addison-Wesley, 1997) 24. F.J. Harris, Multirate Signal Processing for Communication Systems (Prentice-Hall, Upper Saddle River, 2004) 25. R.V.L. Hartley, A more symmetrical Fourier analysis applied to transmission problems. Proc. IRE 30 (1942) 26. M.T. Heideman, D.H. Johnson, C.S. Burrus, Gauss and the history of the fast Fourier transform. IEEE ASSP Mag. 1, 14–21 (October 1984) 27. A. Hodges, Alan Turing: The Enigma (Vintage, London, 1992) 28. K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Proc. Vis. Image Sig. Process. 153(1), 70–78 (February 2006) 29. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Sig. Process. 1(3), 128–138 (September 2007) 30. K.J. Jones, The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments (Springer (Series on Signals & Communication Technology), 2010) 31. K.J. Jones, Low-Complexity Scheme for Enhancing Multi-Carrier Communications, GB Patent No: 2504512, July 2012

References

21

32. K.J. Jones, Resource-efficient and scalable solution to problem of real-data polyphase discrete Fourier transform channelisation with rational over-sampling factor. IET Sig. Process. 7(4), 296–305 (June 2013) 33. K.J. Jones, Design of low-complexity scheme for maintaining distortion-free multi-carrier communications. IET Sig. Process. 8(5), 495–506 (July 2014) 34. L. Kronsjo, Computational Complexity of Sequential and Parallel Algorithms (Wiley, 1985) 35. S.Y. Kung, VLSI Array Processors (Prentice-Hall, Englewood Cliffs, 1988) 36. J.B. Marten, Discrete Fourier transform algorithms for real valued sequences. IEEE Trans. ASSP 32(2) (February 1984) 37. S. Lavington, A History of Manchester Computers (The British Computer Society (BCS), The Chartered Institute for IT, 1998) 38. S. Lavington (ed.), Alan Turing and His Contemporaries: Building the World’s First Computers (The British Computer Society (BCS), The Chartered Institute for IT, 2012) 39. MATLAB @ www.mathworks.com 40. C. Maxfield, The Design Warrior’s Guide to FPGAs (Newnes (Elsevier), 2004) 41. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice-Hall, 1979) 42. H. Murakami, Real-valued fast discrete Fourier transform and decimation-in-frequency algorithms. IEEE Trans. Circ. Syst. II: Analog Digit. Sig. Proc. 41(12), 808–816 (1994) 43. I. Nivan, H.S. Zuckerman, An Introduction to the Theory of Numbers (Wiley, 1980) 44. H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms (Springer, 1981) 45. A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice-Hall, 1989) 46. J.R. Pierce, An Introduction to Information Theory: Symbols, Signals and Noise (Dover Publications Inc., New York, 1980) 47. R.A. Poisel, Electronic Warfare: Target Location Methods (Artech House, 2005) 48. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice-Hall, 1975) 49. C. Runge, Uber die Zerlegung Empirisch Periodischer Funktionen in Sinnus-Wellen. Zeit. Fur Math. Und Physik 48, 443–456 (1903) 50. C.E. Shannon, A mathematical theory of communication. BSTJ 27, 379–423, 623–657 (1948) 51. G.R.L. Sohie, W. Chen, Implementation of Fast Fourier Transforms on Motorola’s Digital Signal Processors, downloadable document from website: www.Motorola.com 52. H.V. Sorensen, D.L. Jones, C.S. Burrus, M.T. Heideman, On computing the discrete Hartley transform. IEEE ASSP 33, 1231–1238 (1985) 53. H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform algorithms. IEEE Trans. ASSP 35(6), 849–863 (June 1987) 54. I. Stewart, Why Beauty Is Truth: A History of Symmetry (Basic Books, 2007) 55. K. Stumpff, Tafeln und Aufgaben zur Harmonischer Analyse und Periodogrammrechnung (Julius Springer, Berlin, 1939) 56. P.R. Uniyal, Transforming real-valued sequences: Fast Fourier versus Fast Hartley transform algorithms. IEEE Sig. Proc. 42(11) (November 1994) 57. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans. Electron. Comput. EC-8(3), 330–334 (1959) 58. H. Weyl, Symmetry (Princeton Science Library, 1989) 59. S.A. White, Application of distributed arithmetic to digital signal processing: A tutorial review. IEEE ASSP Mag., 4–19 (July 1989) 60. S. Winograd, Arithmetic Complexity of Computations (SIAM, 1980)

Chapter 2

The Real-Data Discrete Fourier Transform

2.1

Introduction

Since the original developments of spectrum analysis in the eighteenth century, the vast majority of real-world applications have been concerned with the processing of real-valued data where the data, generally of 1-D form, corresponds to amplitude measurements of some particular signal of interest. As a result, there has always been a genuine practical need for fast solutions to the problem of computing the DFT of real-valued data with two quite distinct approaches evolving over this period to address the problem. The first and more intellectually challenging approach involves trying to design specialized algorithms which are geared specifically to real-data applications and therefore able to exploit, in a direct way, the real-valued nature of the data. Such data is known to result in a Hermitian-symmetric frequency spectrum where, for the case of an N-point transform, the outputs are such that     Re X ðF Þ ½k ¼ Re X ðFÞ ½N  k 

ð2:1Þ

    Im X ðFÞ ½k  ¼ Im X ðFÞ ½N  k ,

ð2:2Þ

and

so that one half of the DFT outputs are actually redundant. Such solutions, as typified by the Bergland algorithm [1] and the Bruun algorithm [3, 14], only need therefore to produce one half of the DFT outputs. The second and less demanding approach – but also the most commonly adopted, particularly for applications requiring a hardware solution – involves restructuring the data so as to use an existing complex-data FFT algorithm, possibly coupled with pre-FFT and/or post-FFT stages, to produce the

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_2

23

24

2 The Real-Data Discrete Fourier Transform

DFT of either one or two (produced simultaneously) real-valued data sets, such solutions thus said to be obtained via a ‘real-from-complex’ strategy [16]. Both of these approaches are now discussed in some detail prior to a summary of their relative merits and drawbacks. Before delving into the details of these two approaches, however, it is perhaps worth restating the definition of the DFT, as given in Sect. 1.2 of Chap. 1, namely, that for the case of N input/output samples, the 1-D DFT may be expressed in normalized form via the equation: N1 1 X X ðF Þ ½k ¼ pffiffiffiffi x½n:W nk N, N n¼0

k ¼ 0, 1, . . . , N  1

ð2:3Þ

where the transform kernel derives from the term W N ¼ exp ði2π=N Þ,



pffiffiffiffiffiffiffi 1

ð2:4Þ

the primitive N’th complex root of unity. This is the definition of the DFT that will be referred to throughout the remainder of this chapter.

2.2

Real-Data FFT Algorithms

Since the re-emergence of computationally-efficient FFT algorithms, as initiated by the published work of James Cooley and John Tukey in the mid-1960s [5], a number of attempts have been made [1, 3, 7, 8, 10, 11, 13, 17, 18] at producing fast algorithms that are able to directly exploit the spectral symmetry that arises from the processing of real-valued data. Two such algorithms are those due to Glenn Bergland (1968) and Georg Bruun (1978), and these are now briefly discussed so as to give a flavour of the type of algorithmic structures that can result from pursuing such an approach. The Bergland algorithm effectively modifies the DIT formulation of the familiar radix-2 Cooley-Tukey algorithm [2] to account for the fact that only one half of the DFT outputs need to be computed, whilst the Bruun algorithm adopts an unusual recursive polynomial-factorization approach – the DIF formulation of the fixed-radix Cooley-Tukey algorithm, referred to as the Sande-Tukey algorithm [2], may also be expressed in such a form – which involves only real-valued polynomial coefficients until the last stage of the computation, making it particularly suited therefore to the problem of computing the real-data DFT. Examples of the signalflow graphs (SFGs) for both the DIT and the DIF formulations of the standard radix2 Cooley-Tukey algorithm are as given in Figs. 2.1 and 2.2, respectively.

2.2 Real-Data FFT Algorithms

25

X[0]

X[0] X[2] X[4]

W80

4-point DFT

W81 W82

X[6]

3 8

X[1] X[2] X[3]

W X[1] X[3] X[5]

W84

4-point DFT

W85 W86

X[7]

W87

X[4] X[5] X[6] X[7]

Fig. 2.1 Signal-flow graph for DIT decomposition of eight-point DFT

X[0]

X[0] X[1]

4-point DFT

X[2] X[3]

X[2] X[4] X[6]

X[4]

-

X[5]

-

X[6]

-

X[7]

-

X[1]

W80 W81

4-point DFT

W82 W83

X[3] X[5] X[7]

Fig. 2.2 Signal-flow graph for DIF decomposition of eight-point DFT

2.2.1

The Bergland Algorithm

The Bergland algorithm is a real-data FFT algorithm based upon the observation that the frequency spectrum arising from the processing of real-valued data is Hermitiansymmetric, so that only one half of the DFT outputs need to be computed. Starting with the DIT formulation of the familiar complex-data radix-2 Cooley-Tukey FFT algorithm, if the input data is real-valued, then for each of the log2N temporal stages of the algorithm, the computation involves the repeated combination of two transforms to yield one longer double-length transform. From this, Bergland observed that the property of Hermitian symmetry may actually be exploited for each of the log2N temporal stages of the algorithm. Thus, as all the odd-addressed output

26

2 The Real-Data Discrete Fourier Transform

samples for each such double-length transform form the second half of the frequency spectrum, which can in turn be straightforwardly obtained from the property of spectral symmetry, the Bergland algorithm instead uses those memory locations for storing the imaginary components of the data. Note that the log2N stages are referred to above as being ‘temporal’ in the sense that the computations required for any given stage can only commence after those of its predecessor have been completed and need to have been completed before those of its successor can begin, so that the computations for each of the log2N stages need to be carried in a specific temporal order. Also, with the Bergland algorithm, given that the input data set is real-valued, all the intermediate results may be stored within just N memory locations – each location thus corresponding to just one word of memory. The computation can also be carried out in an ‘in-place’ fashion – whereby the outputs from each butterfly are stored within the same set of memory locations as used by the inputs – although the indices of the set of butterfly outputs are not in bit-reversed order, as they are with the Cooley-Tukey algorithm, being instead ordered according to the Bergland ordering scheme [1], as also are the indices of the twiddle factors or trigonometric coefficients. However, the natural ordering of the twiddle factors may, with due care, be converted to the Bergland ordering and the Bergland ordering of the FFT outputs subsequently converted to the natural ordering, as required for an efficient in-place solution [1, 16]. Thus, the result of the above modifications is an FFT algorithm with an arithmetic complexity of O(N.log2N) arithmetic operations, as is obtained with the standard radix-2 Cooley-Tukey algorithm, which yields a reduction of two saving when compared to the conventional zero-padded complex-data FFT solution – to be discussed in Sect. 2.3.1 – in terms of their arithmetic-complexity and memory requirements.

2.2.2

The Bruun Algorithm

The Bruun algorithm is a real-data FFT algorithm based upon an unusual recursive polynomial-factorization approach, proposed initially for the case of N input samples, where N is a power of two, but subsequently generalized by Hideo Murakami, in 1996 [13], to deal with the case where N is an arbitrary even number. With reference to Eq. 2.3, by defining the polynomial x(z) whose coefficients are those elements of the finite sequence, {x[n]}, such that N 1 1 X xðzÞ ¼ pffiffiffiffi x½n:zn , N n¼0

it is possible to view the DFT as a reduction of this polynomial [12], so that

ð2:5Þ

2.2 Real-Data FFT Algorithms

    X ðFÞ ½k ¼ x W kN ¼ xðzÞ mod z  W kN

27

ð2:6Þ

where ‘mod’ stands for the modulo operation  [12] which denotes the polynomial remainder upon division of x(z) by z  W kN [12]. The key to fast execution of the Bruun algorithm stems from being able to perform this set of N polynomial remainder operations in a recursive fashion. Computation of the DFT involves evaluating the remainder of x(z) modulo some polynomial of degree one, more commonly referred to as a ‘monomial’, a total of N times, as suggested by Eqs. 2.5 and 2.6. To do this efficiently, one can combine the remainders recursively in the following way: suppose it is required to evaluate x(z) modulo U(z) as well as x(z) modulo V(z). Then, by first evaluating x(z) modulo the polynomial product, U(z).V(z), the degree of the polynomial x(z) is reduced, thereby making subsequent modulo operations less computationally expensive.   Now the product of all of the monomials, z  W kN , for values of ‘k’ from 0 up to N–1, is simply (zN  1), whose roots are clearly the N primitive complex roots of unity. A recursive factorization of (zN  1) is therefore required which breaks it down into polynomials of smaller and smaller degree with each possessing as few non-zero coefficients as possible. To compute the DFT, one takes x(z) modulo each level of this factorization in turn, recursively, until one arrives at the monomials and the final result. If each level of the factorization splits every polynomial into an O(1) number of smaller polynomials, each with an O(1) number of non-zero coefficients, then the modulo operations for that level will take O(N ) arithmetic operations, thus leading to a total arithmetic-complexity, for all log2N levels, of O(N.log2N) arithmetic operations, as is obtained with the standard radix-2 CooleyTukey algorithm. Note that when N is a power of two, the Bruun algorithm factorizes the polynomial (zN  1) recursively via the rules:    z2M  1 ¼ zM  1 zM þ 1

ð2:7Þ

and    pffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffi z4M þ a:z2M þ 1 ¼ z2M þ 2  a:zM þ 1 z2M  2  a:zM þ 1 ,

ð2:8Þ

where ‘a’ is a constant such that |a|  2. On completion of the recursion, when M ¼ 1, there remain polynomials   of degree two that can each be evaluated modulo two roots of the form z  W kN for each polynomial. Thus, at each recursive stage, all of the polynomials may be factorized into two parts, each of half the degree and possessing at most three non-zero coefficients, leading to an FFT algorithm with an arithmetic-complexity of O(N.log2N ) arithmetic operations. Moreover, since all the polynomials have purely real coefficients, at least until the last stage, they quite naturally exploit the special case where the input data is real-valued, thereby yielding

28

2 The Real-Data Discrete Fourier Transform

a reduction of two saving when compared to the conventional zero-padded complexdata FFT solution – to be discussed in Sect. 2.3.1 – in terms of their arithmetic and memory requirements.

2.3

Real-From-Complex Strategies

By far the most common approach to computing the real-data DFT is that based upon the use of an existing complex-data FFT algorithm as it simplifies the problem, at worst, to one of designing pre-FFT and/or post-FFT stages for the packing of the real-valued data into the correct format required for input to the FFT algorithm and for the subsequent unpacking of the FFT output data to obtain the spectrum (or spectra) of the original real-valued data set (or sets). Note that any fast algorithm may be used for carrying out the complex-data FFT, so that both the DIT and DIF formulations of fixed-radix FFTs, as already discussed, as well as more sophisticated FFT designs such as those corresponding to the mixed-radix algorithm, split-radix algorithm, prime-factor algorithm, prime-length algorithm (due to Charles Rader [12]) and Winograd’s nested algorithm [9, 12], for example, might be used.

2.3.1

Computation of Real-Data DFT via Complex-Data FFT

The most straightforward approach to the problem involves first packing the realvalued data into the real component of a complex-valued data set, padding the imaginary component with zeros – this action more commonly referred to as ‘zero padding’ – and then feeding the resulting complex-valued data set into a complexdata FFT. The arithmetic requirement of such an approach is clearly identical to that obtained when a standard complex-data FFT is applied to genuine complex-valued data, so that no computational benefits stemming from the simplified nature of the data are achieved with such an approach. On the contrary, computational resources are wasted with such an approach, as excessive arithmetic operations are performed for the computation of the required outputs and twice the required amount of memory used for the storage of the input/output data sets.

2.3.2

Computation of Two Real-Data DFTs via Complex-Data FFT

The next approach to the problem involves computing two N-point real-data DFTs, simultaneously, by means of one N-point complex-data FFT. This is achieved by packing one real-valued data set into the real component of a complex-valued data

2.3 Real-From-Complex Strategies

29

set and another real-valued data set into its imaginary component. Thus, given two real-valued data sets, {g[n]} and {h[n]}, a complex-valued data set, {x[n]}, may be simply obtained by setting x½n ¼ g½n þ i:h½n,

ð2:9Þ

with the k’th output obtained from taking the DFT of the resulting data set being written in normalized form, in terms of the DFTs of {g[n]} and {h[n]} – denoted {G[k]} and {H[k]}, respectively – as N1 1 X X ðFÞ ½k ¼ pffiffiffiffi x½n:W nk N N n¼0 N 1 N 1 1 X 1 X p ffiffiffiffi ¼ pffiffiffiffi g½n:W nk þ i h½n:W nk N N N n¼0 N n¼0

ð2:10Þ

which may be rewritten as X ðFÞ ½k ¼ G½k þ i:H ½k ¼ðGR ½k  H I ½kÞ þ iðGI ½k þ H R ½kÞ,

ð2:11Þ

where GR[k] and GI[k] are the real and imaginary components, respectively, of G[k], the same applying to HR[k] and HI[k] with respect to H[k]. Similarly, the (N–k)’th output may be written in normalized form as N 1 1 X nðNk Þ X ðFÞ ½N  k ¼ pffiffiffiffi x½n:W N N n¼0 N1 N1 1 X 1 X ¼ pffiffiffiffi g½n:W nk h½n:W nk N þ i pffiffiffiffi N N n¼0 N n¼0

ð2:12Þ

which may be rewritten as X ðFÞ ½N  k ¼ G ½k  þ i:H  ½k ¼ðGR ½k þ H I ½k Þ þ iðGI ½k þ H R ½kÞ,

ð2:13Þ

where the superscript ‘*’ stands for the operation of complex conjugation, so that upon combining the expressions of Eqs. 2.11 and 2.13, the DFT outputs G[k] and H[k] may be written, in terms of the DFT outputs X(F)[k] and X(F)[N–k], as

30

2 The Real-Data Discrete Fourier Transform

G½k ¼ GR ½k  þ i:GI ½k  h i h i ¼ 1=2 Re X ðF Þ ½k þ X ðFÞ ½N  k þ i:Im X ðF Þ ½k  X ðFÞ ½N  k

ð2:14Þ

and H ½k ¼ H R ½k þ i:H I ½k  h i h i ð2:15Þ ¼ 1=2 Im X ðFÞ ½k  þ X ðFÞ ½N  k   i: Re X ðF Þ ½k  X ðF Þ ½N  k , where the terms Re(X(F)[k]) and Im(X(F)[k]) denote the real and imaginary components, respectively, of X(F)[k]. Thus, it is evident that the DFT of the two real-valued data sets, {g[n]} and {h[n]}, may be computed simultaneously, via one full-length complex-data FFT algorithm, with the DFT of the data set, {g[n]}, being as given by Eq. 2.14 and that of the data set, {h[n]}, by Eq. 2.15. The pre-FFT data packing stage is quite straightforward in that it simply involves the assignment of one real-valued data set to the real component of the complex-valued data set and one real-valued data set to its imaginary component. The post-FFT data unpacking stage simply involves separating out the two spectra from the complex-valued FFT output data, this requiring two real additions/subtractions for each real-data DFT output together with two scaling operations each by a factor of two (which in fixed-point hardware reduces to that of a simple right-shift operation of length one).

2.3.3

Computation of Real-Data DFT via Half-Length Complex-Data FFT

Finally, the last approach to the problem involves showing how an N-point complexdata FFT may be used to carry out the computation of one 2N-point real-data DFT. The output obtained from taking the DFT of the 2N-point real-valued data set, {x[n]}, may be written in normalized form as 2N1 1 X X ðF Þ ½k ¼ pffiffiffiffiffiffi x½n:W nk 2N 2N n¼0

k ¼ 0, 1, . . . , N  1

N 1 N 1 1 X 1 X k p ffiffiffiffiffiffi ¼ pffiffiffiffiffiffi x½2n:W nk þ W x½2n þ 1:W nk N 2N N, 2N n¼0 2N n¼0

which, upon setting g[n] ¼ x[2n] and h[n] ¼ x[2n + 1], may be rewritten as

ð2:16Þ

2.3 Real-From-Complex Strategies

31

N 1 N 1 1 X 1 X k X ðFÞ ½k ¼ pffiffiffiffiffiffi g½n:W nk h½n:W nk N þ W 2N pffiffiffiffiffiffi N 2N n¼0 2N n¼0

k ¼ 0, 1, . . . , N  1

¼ G½k  þ W k2N :H ½k : ð2:17Þ Therefore, by setting y[n] ¼ g[n] + i.h[n] and exploiting the combined expressions of Eqs. 2.11 and 2.13, the DFT output Y[k] may be written as Y ½k ¼ ðGR ½k  H I ½kÞ þ i:ðGI ½k  þ H R ½k Þ

ð2:18Þ

and that for Y[N–k] as Y ½N  k  ¼ ðGR ½k þ H I ½kÞ þ i:ðGI ½k  þ H R ½k Þ:

ð2:19Þ

Then, by combining the expressions of Eqs. 2.17, 2.18 and 2.19, the real component of X(F)[k] may be written as ðF Þ

X R ½k ¼ 1=2 Re ðY ½k þ Y ½N  kÞþ þ 1=2 cos ðkπ=N Þ:ImðY ½k þ Y ½N  kÞ 

1=2 sin ðkπ=N Þ: Re ðY ½k 

ð2:20Þ

 Y ½N  k Þ

and the imaginary component as ðF Þ

X I ½k  ¼1=2ImðY ½k   Y ½N  k Þ  1=2 sin ðkπ=N Þ:ImðY ½k  þ Y ½N  k Þ

ð2:21Þ

 1=2 cos ðkπ=N Þ: Re ðY ½k  Y ½N  k Þ: Thus, it is evident that the DFT of one real-valued data set, {x[n]}, of length 2N, may be computed via one N-point complex-data FFT algorithm, with the real component of the DFT output being as given by Eq. 2.20 and the imaginary component as given by Eq. 2.21. The pre-FFT data packing stage is conceptually simple, but nonetheless burdensome, in that it involves the assignment of the evenaddressed samples of the real-valued data set to the real component of the complexvalued data set and the odd-addressed samples to its imaginary component. The post-FFT data unpacking stage, in turn, is considerably more complex than that required for the approach of Sect. 2.3.2, requiring the application of eight real additions/subtractions for each DFT output, together with two scaling operations, each by a factor of two, and four real multiplications by precomputed trigonometric coefficients.

32

2.4

2 The Real-Data Discrete Fourier Transform

Data Reordering

All of the fixed-radix formulations of the FFT – at least for the case where the transform length is a power or two – require that either the naturally ordered (referred to hereafter as NAT-ordered, for brevity, as this expression will be used repeatedly throughout the monograph) inputs to or the outputs from the transform be permuted according to the familiar digit-reversal mapping [4]. In fact, it is possible to place the data reordering either before or after the execution of the transform for both the DIT and DIF formulations [4]. For the case of a radix-2 algorithm, the data reordering is more commonly known as the bit-reversal mapping, being based upon the exchanging of single bits of the data addresses, whilst for the radix-4 case, it is known as the dibit-reversal mapping – referred to hereafter as the DBR mapping, for brevity, as this expression will also be used repeatedly throughout the monograph – being based instead upon the exchanging of pairs of bits of the data addresses. Such data reordering, when mapped onto a single-processor sequential computing device, might typically be carried out via the use of either: 1. An LUT, at the expense of additional memory 2. A fast algorithm using just shifts, additions/subtractions and memory exchanges 3. A fast algorithm that also makes use of a small LUT – containing the reflected bursts of ones that change on the lower end of the data addresses with incrementing address – in order to optimize the speed at the cost of a slight increase in memory, with the optimum choice being dependent upon the available resources and the time constraint imposed by the particular application. This time constraint dictates that the data reordering should be carried out at a sufficiently fast rate in order to keep up with the data set refresh rate, as will be discussed in more detail in Chap. 10, where it will be seen how the data reordering – as required for dealing with both 1-D and m-D problems – might be achieved with approximately 16-fold parallelism when multiple memory banks (eight for our purposes, as will be discussed in Chaps. 4, 5, 6 and 7) are available for the storage of the data, both prior to and following the reordering.

2.5

Discussion

The aim of this chapter has been to highlight both the advantages and the disadvantages of the conventional approaches to the problem of computing the DFT for the case of 1-D real-valued data. As is evident from the examples discussed in Sect. 2.2, namely, the Bergland algorithm and the Bruun algorithm, the adoption of specialized real-data FFT algorithms may well yield solutions possessing attractive performance metrics in terms of their arithmetic and memory requirements – namely, when assessed according to the “Performance Metric for Single-Processor Sequential

2.5 Discussion

33

Computing Device” of Sect. 1.8 in Chap. 1 – but generally this is only achieved at the expense of a more complex algorithmic structure when compared to those of the highly-regular fixed-radix designs. As a result, such algorithms would not seem to lend themselves particularly well to being mapped onto modern parallel computing equipment. Similarly, from the examples of Sect. 2.3, namely, the real-from-complex strategies, the regularity of the conventional fixed-radix designs may only be exploited at the expense of introducing additional processing modules, namely, the pre-FFT and/or post-FFT stages for the packing of the real-valued data into the correct format required for input to the FFT algorithm and for the subsequent unpacking of the FFT output data to obtain the spectrum (or spectra) of the original real-valued data set(s). An additional set of problems associated with the real-from-complex strategies, at least when compared to the more direct approach of a real-data FFT, relate to the need for increased memory and increased processing delay to allow for the possible acquisition/processing of pairs of data sets. It is worth noting that an alternative DSP-based approach to those discussed above is to first convert the real-valued data to complex-valued data by means of a wideband DDC process, this followed by the application of a conventional complex-data FFT. Such an approach, however, introduces an additional function to be performed – typically an FIR filter with length dependent upon the performance requirements of the application – which also introduces an additional processing delay prior to the execution of the FFT. Drawing upon a philosophical analogy, namely, the maxim of the fourteenth-century Franciscan scholar, William of Occam, commonly known as ‘Occam’s Razor’ [15]: why use two functions to perform a given task when just one will suffice! A related and potentially serious problem arises when there is limited information available on the signal under analysis as the integrity of such information might well be compromised via the filtering operation, particularly when the duration of the signal is short relative to that of the transient response of the filter – as might be encountered, for example, with problems relating to the detection of extremely shortduration dual-tone multi-frequency (DTMF) signals [6]. Thus, there are clear drawbacks to all such approaches, especially when the particular application requires a solution in hardware using silicon-based parallel computing equipment, so that the investment of searching for alternative solutions to the problem of computing the real-data DFT is still well merited. More specifically, solutions are required that: 1. Possess highly-regular designs that lend themselves naturally to mapping onto silicon-based parallel computing equipment 2. Possess attractive performance metrics in terms of their arithmetic, memory and power requirements 3. Do not require excessive packing/unpacking requirements 4. Do not incur the latency problems (as arising from the increased processing delay) associated with the adoption of certain of the real-from-complex strategies Such a solution will be fully developed in Chaps. 4, 5, 6 and 7 with the introduction of the generic double butterfly and the regularized FHT.

34

2 The Real-Data Discrete Fourier Transform

References 1. G.D. Bergland, A fast Fourier transform algorithm for real-valued series. Comm. ACM 10, 11 (1968) 2. E.O. Brigham, The Fast Fourier Transform and its Applications (Prentice-Hall, Englewood Cliffs, 1988) 3. G. Bruun, Z-transform DFT filters and FFTs. IEEE Trans. ASSP 1, 26 (January 1978) 4. E. Chu, A. George, Inside the FFT Black Box (CRC Press, 2000) 5. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19(4), 297–301 (1965) 6. A.Z. Dodd, The Essential Guide to Telecommunications, 5th edn. (Prentice-Hall, 2012) 7. P. Duhamel, Implementations of split-radix FFT algorithms for complex, real and realsymmetric data. IEEE Trans. ASSP 34(2), 285–295 (April 1986) 8. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to cyclic convolution of real data. IEEE Trans. ASSP 35(6), 818–824 (June 1987) 9. P. Duhamel, M. Vetterli, Fast Fourier transforms: A tutorial review and a state of the art. Signal Process. 19, 259–299 (1990) 10. O. Ersoy, Real discrete Fourier transform. IEEE Trans. ASSP 4, 33 (April 1985) 11. J.B. Marten, Discrete Fourier transform algorithms for real valued sequences. IEEE Trans. ASSP 2, 32 (February 1984) 12. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice-Hall, 1979) 13. H. Murakami, Real-valued fast discrete Fourier transform and cyclic convolution algorithms of highly composite even length. Proc. ICASSP 3, 1311–1314 (1996) 14. H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms (Springer, 1981) 15. B. Russell, History of Western Philosophy (George Allen & Unwin, 1961) 16. G.R.L. Sohie, W. Chen, “Implementation of Fast Fourier Transforms on Motorola’s Digital Signal Processors”, downloadable document from website: www.Motorola.com 17. H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform algorithms. IEEE Trans. ASSP 35(6), 849–863 (June 1987) 18. P.R. Uniyal, Transforming real-valued sequences: Fast Fourier versus fast Hartley transform algorithms. IEEE Sig. Proc. 11, 42 (November 1994)

Chapter 3

The Discrete Hartley Transform

3.1

Introduction

An algorithm that would appear to satisfy most, if not all, of the requirements laid down in Sect. 2.5 of Chap. 2 is that of the DHT, as introduced in Eq. 1.4 of Chap. 1, a discrete orthogonal transform [1, 9] that involves only real arithmetic and is intimately related to the DFT, satisfying all of those properties required of it (as discussed in Sect. 3.5) as well as possessing fast algorithms for its efficient solution (as discussed in Sect. 3.6). The algorithm has already been used successfully as an alternative to the DFT for carrying out numerous common DSP-based functions, as is discussed in some depth in Chap. 9 for the case of 1-D real-valued data, as well as proving an attractive alternative to the discrete cosine transform (DCT) [13] for the transform-based coding of signals, such as speech [2], for the purpose of data compression [14]. Before delving into the details of the DHT and its properties, however, it is perhaps worth re-stating the definition, as given in Sect. 1.4 of Chap. 1, namely that for the case of N input/output samples, the 1-D DHT may be expressed in normalized form via the equation N 1 1 X X ðH Þ ½k  ¼ pffiffiffiffi x½n:casð2πnk=N Þ N n¼0

k ¼ 0, 1, . . . , N  1

ð3:1Þ

where the input/output data vectors belong to RN, the linear space of real-valued Ntuples, and the transform kernel is given by the ‘cas’ function: casð2πnk=N Þ ¼ cos ð2πnk=N Þ þ sin ð2πnk=N Þ,

ð3:2Þ

a periodic function with period 2π and possessing (amongst others) the following set of useful properties:

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_3

35

36

3 The Discrete Hartley Transform

casðA þ BÞ ¼ cos A:casB þ sin A:casðBÞ casðA þ BÞ ¼ cos A:casðBÞ þ sin A:casB casA:casB ¼ cos ðA þ BÞ þ sin ðA þ BÞ

ð3:3Þ

casA þ casB ¼ 2:casð1=2ðA þ BÞÞ: cos ð1=2ðA  BÞÞ casA  casB ¼ 2:casð1=2ðA þ BÞÞ: sin ð1=2ðA  BÞÞ as will be exploited later in Chap. 4 for the derivation of the proposed FHT algorithm. This is the definition of the DHT that will be referred to throughout the remainder of this chapter. pffiffiffiffi Note that without the presence of the scaling factor, 1 N, that has been included in the current definition of the DHT, as given above (as well as in the definition of the DFT, as given in Sect. 1.2 of Chap. 1), the magnitudes of the outputs of the second DHT in Eq. 3.5 would actually be equal to N times those of the inputs of the first DHT, so that the role of the scaling factor is to ensure that the magnitudes are preserved. It should be borne in mind, however, that the presence of a coherent signal in the input data will result in most of the growth in magnitude occurring in the forward transform, so that any future scaling strategy – as discussed in Chap. 4 – must reflect this fact. A scaling factor of 1/N is often pffiffiffiffiused for the forward definition of both the DFT and the DHT, with the value of 1 N being used instead here purely for mathematical elegance, as it has the attraction of reducing the definitions of the DHT for both the forward and the inverse (or reverse) directions to an identical form. The fundamental theorems discussed in Sect. 3.5 for both the DFT and the DHT, however, are valid regardless of the particular version of the scaling factor used.

3.2

Orthogonality of DHT

The definition of the DHT, as given by Eq. 3.1 above, may be re-written equivalently in matrix-vector form as ðH Þ

X N ¼ H NN :xN

ð3:4Þ

where HNN, the N  N real-valued Hartley Matrix, is the matrix representation of the transform kernel which, from the definition of the ‘cas’ function, is clearly symmetrical about its leading diagonal and therefore equal to its own transpose. Also, the DHT can be shown (with the application of a little algebra) to be bilateral, whereby the product of the Hartley Matrix with itself reduces to the identity matrix, making the Hartley Matrix equal to both its own transpose and its own inverse so that the forward and inverse versions of the normalized transform are identical. These combined properties mean that the DHT may also be considered to be orthogonal – which simply requires that the inverse of the Hartley Matrix is equal to its own transpose – and thus a member of that class of algorithms comprising the

3.3 Decomposition into Even and Odd Components

37

discrete orthogonal transforms [1] and therefore possessing those properties shared by all those algorithms belonging to this important class. Note that the bilateral property referred to above means that if the transform is applied twice, in succession, the first time to a real-valued data set, {x[n]}, the second time to the output of the first operation, then the output from the second operation, {y[n]}, can be written as fy½ng ¼ DHTðDHTðfx½ngÞÞ  fx½ng,

ð3:5Þ

so that the output of the second operation is actually equivalent, as implied above, to the input of the first operation. One important property of an orthogonal (or unitary) transform, as is stated in Sect. 3.5.9, is that of Parseval’s Theorem [6], which concerns the preservation of energy (up to a scaling factor) contained in the signal under the operation of such a transform. Thus, the energy measured in data-space is equivalent to that measured in transform-space, and when a normalized version of the transform is used (as with the normalized expressions given by Eq. 3.1 for the DHT and Eq. 2.3 of Chap. 2 for the DFT) the measurements will in fact be equal.

3.3

Decomposition into Even and Odd Components

The close relationship between the DFT and the DHT hinges upon symmetry considerations which may be best explained by considering the decomposition of ðH Þ each DHT output into its ‘even’ and ‘odd’ components [4], denoted X E ½k and ðH Þ X O ½k, respectively, for the k’th output, and written as ðH Þ

ðH Þ

X ðH Þ ½k  ¼ X E ½k  þ X O ½k 

ð3:6Þ

ðH Þ

where, for an N-point transform, X E ½k is such that ðH Þ

ðH Þ

X E ½k  ¼ X E ½k

ð3:7Þ

ðH Þ

and X O ½k is such that ðH Þ

ðH Þ

X O ½k  ¼ X O ½k

ð3:8Þ

where, from transform periodicity, index ‘–k’ may be regarded as being equivalent to ‘N–k’. As a result, the even and odd components may each be expressed in terms of the DHT outputs via the expressions

38

3 The Discrete Hartley Transform

  ðH Þ X E ½k  ¼ 1=2 X ðH Þ ½k þ X ðH Þ ½k

ð3:9Þ

  ðH Þ X O ½k ¼ 1=2 X ðH Þ ½k  X ðH Þ ½k ,

ð3:10Þ

and

respectively, from which the relationship between the DFT and DHT outputs may be straightforwardly obtained. Thus, if, as is the case with real-valued data, a DFT spectrum is found to be Hermitian symmetric, then the real component of the spectrum will be an even function whilst the imaginary component will be an odd function.

3.4

Connecting Relations Between DFT and DHT

Firstly, from the equality    nk  casð2πnk=N Þ ¼ Re W nk N  Im W N ,

ð3:11Þ

which relates the kernels of the two transformations, the DFT outputs may be expressed in terms of the DHT outputs as ðH Þ

ðH Þ

X ðF Þ ½k ¼ X E ½k   i:X O ½k ,

ð3:12Þ

so that the real and imaginary components are given, respectively, as     Re X ðFÞ ½k  ¼ 1=2 X ðH Þ ½k þ X ðH Þ ½k

ð3:13Þ

    Im X ðFÞ ½k  ¼ 1=2 X ðH Þ ½k  X ðH Þ ½k ,

ð3:14Þ

and

whilst the DHT outputs may be expressed in terms of the DFT outputs as     X ðH Þ ½k  ¼ Re X ðFÞ ½k   Im X ðFÞ ½k :

ð3:15Þ

3.4 Connecting Relations Between DFT and DHT

3.4.1

39

Real-Data DFT

Thus, from Eqs. 3.13, 3.14 and 3.15, the complex-valued DFT output data set and the real-valued DHT output data set, as obtained from the processing of a real-valued input data set, may now be simply obtained, one from the other, so that a fast algorithm for the solution of the real-data DFT may also be used for the efficient computation of the DHT whilst a fast algorithm for the solution of the DHT may similarly be used for the efficient computation of the real-data DFT. This means, in turn, that the DHT may also be used for solving those DSP-based problems commonly addressed via the DFT, and vice versa. Note from the above equations that pairs of real-valued DHT outputs combine to give individual complex-valued DFT outputs, such that X ðH Þ ½k & X ðH Þ ½k  $ X ðF Þ ½k

ð3:16Þ

for k ¼ 1, 2, . . ., N/2 – 1, whilst the remaining two terms are such that X ðH Þ ½0 $ X ðF Þ ½0

ð3:17Þ

X ðH Þ ½N=2 $ X ðF Þ ½N=2:

ð3:18Þ

and

With regard to the two trivial mappings provided above by Eqs. 3.17 and 3.18, it may also be noted from Eq. 3.11 that when k ¼ 0, we have casð2πnk=N Þ ¼ W nk N ¼ 1,

ð3:19Þ

so that the zero-address component in Hartley-space maps to the zero-address (or zero-frequency) component in Fourier-space, and vice versa, as implied by Eq. 3.17, whilst when k ¼ N/2, we have n casð2πnk=N Þ ¼ W nk N ¼ ð1Þ ,

ð3:20Þ

so that the Nyquist-address component in Hartley-space similarly maps to the Nyquist-address (or Nyquist-frequency) component in Fourier-space, and vice versa, as implied by Eq. 3.18.

40

3 The Discrete Hartley Transform

3.4.2

Complex-Data DFT

Now, having defined the relationship between the Fourier-space and the Hartleyspace representations of a real-valued data set, it is a simple task to extend the results to the case of a complex-valued data set. Given the linearity of the DFT – this property follows from the Addition Theorem to be discussed in the following section – the DFT of a complex-valued data set, {xR[n] + i.xI[n]}, can be written as the sum of the DFTs of the individual real and imaginary components, so that DFT ðfxR ½n þ i:xI ½ngÞ ¼ DFTðfxR ½ngÞ þ i:DFTðfxI ½ngÞ n o n o ðF Þ ðF Þ  X R ½k  þ i: X I ½k :

ð3:21Þ

Therefore, by first taking the DHT of the individual real and imaginary components of the complex-valued data set and then deriving the DFT of each such component by means of Eqs. 3.13 and 3.14, the real and imaginary components of the DFT of the complex-valued data set may be written in terms of the two DHTs as       ðH Þ ðH Þ ðH Þ ðH Þ Re X ðF Þ ½k ¼ 1=2 X R ½k  þ X R ½k  1=2 X I ½k   X I ½k

ð3:22Þ

      ðH Þ ðH Þ ðH Þ ðH Þ Im X ðF Þ ½k ¼ 1=2 X R ½k   X R ½k þ 1=2 X I ½k þ X I ½k ,

ð3:23Þ

and

respectively, so that it is now possible to compute the DFT of both real-valued and complex-valued data sets by means of the DHT – pseudo code is provided for both the real-valued data and complex-valued data cases in Figs. 3.1 and 3.2, respectively. The significance of the decomposition described here for the complex-data DFT is that it introduces an additional level of parallelism to the problem, as the resulting DHTs are independent and thus able to be computed simultaneously, or in parallel, when implemented with parallel computing technology – a subject to be discussed in Chap. 5. This is particularly relevant when the transform is long, the throughput requirement high and a fast algorithm is available for the efficient computation of each DHT.

3.5

Fundamental Theorems for DFT and DHT

As has already been stated, if the DFT and DHT algorithms are to be used interchangeably for solving certain types of signal processing problem, then it is essential that there are corresponding theorems [4] for the two transforms which enable the input data sets to be similarly related to their respective transforms. Using the normalized definition of the DHT, as given by Eq. 3.1 – with a similar scaling

3.5 Fundamental Theorems for DFT and DHT

41

# # Description: The real and imaginary components of the real-data N-point DFT outputs are optimally stored in the following way: XRdata[0] XRdata[1] XRdata[N–1] XRdata[2] XRdata[N–2] --- --- ---

= zeroth frequency output = real component of 1st frequency output = imaginary component of 1st frequency output = real component of 2nd frequency output = imaginary component of 2nd frequency output --- --- ---

--- --- ---

--- --- ---

XRdata[N/2–1] = real component of (N/2–1)th frequency output XRdata[N/2+1] = imaginary component of (N/2–1)th frequency output XRdata[N/2] = (N/2)th frequency output # # Note: The components XRdata[0] and XRdata[N/2] do not need to be modified to yield zeroth and (N/2)th frequency outputs. # # Pseudo-Code for DHT-to-DFT Conversion: k = N – 1; for ( j = 1; j < (N/2); j=j+1) { store = XRdata[k] + XRdata[j]; XRdata[k] = XRdata[k] – XRdata[j]; XRdata[j] = store; XRdata[j] = XRdata[j] / 2; XRdata[k] = XRdata[k] / 2; k = k – 1; } Fig. 3.1 Pseudo-code for computing real-data DFT outputs from DHT outputs

strategy assumed for the definition of the DFT, as given by Eq. 1.1 of Chap. 1 – together with the connecting relations of Sect. 3.4, the following commonly encountered theorems may be derived, each one carrying over from one transform-space to the other. Note that the 1-D data set is assumed, in each case, to be of length N.

3.5.1

Reversal Theorem

The DFT-based relationship is given by n o DFTðfx½ngÞ ¼ X ðFÞ ½k ,

ð3:24Þ

with the corresponding DHT-based relationship given by n o DHTðfx½ngÞ ¼ X ðH Þ ½k  :

ð3:25Þ

42

3 The Discrete Hartley Transform

# # Description: The complex-data N-point DFT outputs are optimally stored with array ‘XRdata’ holding the real component of both the input and output data, whilst the array ‘XIdata’ holds the imaginary component of both the input and output data. # # Note: The components XRdata[0] and XRdata[N/2] do not need to be modified to yield zeroth and (N/2)th frequency outputs. # # Pseudo-Code for DHT-to-DFT Conversion: k = N – 1; for (j = 1; j < (N/2); j=j+1) { // Real component data channel. store = XRdata[k] + XRdata[j]; XRdata[k] = XRdata[k] – XRdata[j]; XRdata[j] = store; XRdata[j] = XRdata[j] / 2; XRdata[k] = XRdata[k] / 2; // Imaginary component data channel. store = XIdata[k] + XIdata[j]; XIdata[k] = XIdata[k] – XIdata[j]; XIdata[j] = store; XIdata[j] = XIdata[j] / 2; XIdata[k] = XIdata[k] / 2; // Combine outputs from data channels. store1 = XRdata[j] + XIdata[k]; store2 = XRdata[j] – XIdata[k]; store3 = XIdata[j] + XRdata[k]; XIdata[k] = XIdata[j] – XRdata[k]; XRdata[j] = store2; XRdata[k] = store1; XIdata[j] = store3; k = k – 1; } Fig. 3.2 Pseudo-code for computing complex-data DFT outputs from DHT outputs

3.5.2

Addition Theorem

The DFT-based relationship is given by DFTðfx1 ½n þ x2 ½ngÞ ¼ DFTðfx1 ½ngÞ þ DFTðfx2 ½ngÞ, n o n o ðF Þ ðF Þ ¼ X 1 ½k  þ X 2 ½k  ,

ð3:26Þ

with the corresponding DHT-based relationship given by DHTðfx1 ½n þ x2 ½ngÞ ¼ DHTðfx1 ½ngÞ þ DHTðfx2 ½ngÞ, n o n o ðH Þ ðH Þ ¼ X 1 ½k  þ X 2 ½k :

ð3:27Þ

3.5 Fundamental Theorems for DFT and DHT

3.5.3

43

Shift Theorem

The DFT-based relationship is given by n o DFTðfx½n  n0 gÞ ¼ ei2πn0 k=N :X ðFÞ ½k ,

ð3:28Þ

with the corresponding DHT-based relationship given by DHTðfx½n  n0 gÞ ¼ n o n o cos ð2πn0 k=N Þ:X ðH Þ ½k   sin ð2πn0 k=N Þ:X ðH Þ ½k  :

3.5.4

ð3:29Þ

Convolution Theorem

Denoting the operation of circular or cyclic convolution by means of the symbol ‘’, the DFT-based relationship (up to a scaling factor) is given by n o ðF Þ ðF Þ DFTðfx1 ½ng  fx2 ½ngÞ ¼ X 1 ½k:X 2 ½k ,

ð3:30Þ

with the corresponding DHT-based relationship (up to a scaling factor) given by DHTðfx1 ½ng  fx2 ½ngÞ n  ðH Þ ðH Þ ðH Þ ðH Þ ¼ 1=2 X 1 ½k:X 2 ½k  X 1 ½k :X 2 ½k  o ðH Þ ðH Þ ðH Þ ðH Þ þX 1 ½k :X 2 ½k  þ X 1 ½k :X 2 ½k :

3.5.5

ð3:31Þ

Product Theorem

The DFT-based relationship (up to a scaling factor) is given by n o n o ðF Þ ðF Þ DFTðfx1 ½n:x2 ½ngÞ ¼ X 1 ½k  X 2 ½k  ,

ð3:32Þ

with the corresponding DHT-based relationship (up to a scaling factor) given by

44

3 The Discrete Hartley Transform

DHTðfx1 ½n:x2 ½ngÞ ¼ n  ðH Þ ðH Þ ðH Þ ðH Þ 1=2 X 1 ½k   X 2 ½k   X 1 ½k   X 2 ½k þ o ðH Þ ðH Þ ðH Þ ðH Þ X 1 ½k   X 2 ½k þ X 1 ½k  X 2 ½k :

3.5.6

ð3:33Þ

Autocorrelation Theorem

Denoting the operation of circular or cyclic correlation by means of the symbol ‘’, the DFT-based relationship (up to a scaling factor) is given by DFTðfx½ng  fx½ngÞ ¼

n  o X ðF Þ ½k2 ,

ð3:34Þ

with the corresponding DHT-based relationship (up to a scaling factor) given by DHTðfx½ng  fx½ngÞ ¼

3.5.7

n  2  2  o 1=2 X ðH Þ ½k  þ X ðH Þ ½k  :

ð3:35Þ

First Derivative Theorem

The DFT-based relationship is given by n o DFTðfx0 ½ngÞ ¼ i:2πk:X ðF Þ ½k ,

ð3:36Þ

with the corresponding DHT-based relationship given by n o DHTðfx0 ½ngÞ ¼ 2πk:X ðH Þ ½k :

3.5.8

ð3:37Þ

Second Derivative Theorem

The DFT-based relationship is given by n o DFTðfx00 ½ngÞ ¼ 4π 2 k 2 :X ðFÞ ½k ,

ð3:38Þ

3.5 Fundamental Theorems for DFT and DHT

45

with the corresponding DHT-based relationship given by n o DHTðfx00 ½ngÞ ¼ 4π 2 k2 :X ðH Þ ½k :

3.5.9

ð3:39Þ

Summary of Theorems and Related Properties

This section simply highlights the fact that for every fundamental theorem associated with the DFT, there is an analogous theorem for the DHT, which may be applied, in a straightforward fashion, so that the DHT may be used to address the same type of signal processing problems as solved by the DFT, and vice versa. An important example is that of the digital filtering of an effectively infinite-length data sequence with a fixed-length FIR filter – as will be discussed in Chap. 9 – the process being more commonly referred to as continuous convolution, where the associated linear convolution is carried out via the piecewise application of the CCT using either the ‘overlap-add’ or the ‘overlap-save’ technique [6] for connecting the resulting pieces together in a relatively seamless fashion. The role of the DHT, in this respect, is much like that of the number-theoretic transforms (NTTs) [10] – as typified by the Fermat number transform (FNT) and the Mersenne number transform (MNT) – which gained considerable popularity back in the 1970s amongst the academic community. These transforms, which are defined over finite or Galois fields [10] via the use of residue number arithmetic [10], exist primarily for their ability to satisfy the CCT within the arithmetic field of interest. An additional and important result that may be derived via the Product Theorems of Eqs. 3.32 and 3.33, is that when the real-valued data sets, {x1[n]} and {x2[n]}, are identical, Parseval’s Theorem [6] may be obtained as N 1 X n¼0

jx½nj2 

N 1  N 1  X X   X ðFÞ ½k 2  X ðH Þ ½k 2 , k¼0

ð3:40Þ

k¼0

which simply states that the energy contained in the signal is preserved (up to a scaling factor) under the operation of both the DFT and the DHT (and, in fact, under any discrete unitary or orthogonal transformation), so that the energy measured in data-space is equivalent to that measured in transform-space. This theorem will be used later in Chap. 8, where it will be invoked to enable a fast radix-4 FHT algorithm, based upon transform-space interpolation via the processing of a zeropadded data set, to be applied to the fast computation of the real-data DFT whose transform length is a power of two (a radix-2 integer), but not a power of four (a radix-4 integer). Also, note that Eq. 3.34 above simply states the familiar result that by taking the DFT of the discrete version of the autocorrelation function one obtains the discrete

46

3 The Discrete Hartley Transform

version of the PSD [6, 9], this function being typically obtained therefore by forming the squared-magnitudes of the DFT outputs where the DFT computation is efficiently performed by means of a suitably chosen FFT algorithm. However, given the availability of Hartley-space outputs, there is no need for these to be first transformed to Fourier-space in order to compute the PSD, as the PSD may be computed directly from either the Fourier-space outputs (as given by Eq. 3.34) or the Hartley-space outputs (as given by Eq. 3.35), with the k’th component of the PSD being expressed as  2  2  2 PSD½k ¼ X ðF Þ ½k  1=2 X ðH Þ ½k  þ X ðH Þ ½k ,

ð3:41Þ

where, for real-valued data, ‘k’ may take on values from 0 up to N/2–1.

3.6

Fast Solutions to DHT – The FHT Algorithm

Knowledge that the DHT is in possession of many of the same properties as the DFT is all very well, but to be of practical significance, it is also necessary that the DHT, like the DFT, possesses fast recursive algorithms for its efficient computation. The first widely published work in this field is thought to be that due to Ronald Bracewell [3–5], who produced both radix-2 and radix-4 versions of the DIT formulation of the fixed-radix FHT algorithm. His work in this field was summarized in a short monograph [4], which did much to popularize the use of the FHT amongst the DSP community and which has formed the inspiration for the work discussed here. The solutions produced by Bracewell are attractive in that they achieve the desired performance metrics in terms of their arithmetic and memory requirements. That is, compared to a conventional complex-data FFT, they require one half of the arithmetic operations and one half the memory requirement, but suffer from the fact that they need two sizes of butterfly – and thus two separate butterfly designs – for efficient fixed-radix formulations. For the radix-4 algorithm, for example, a singlesized butterfly is used to produce four outputs from four inputs, as shown in Fig. 3.3, whilst a double-sized butterfly is used to produce eight outputs from eight inputs, as shown in Fig. 3.4, both of which will be developed in some detail from first principles in Chap. 4. This lack of regularity makes an in-place solution (whereby each output set produced by the butterfly is to be written back to the same set of memory locations as used by the input set) somewhat difficult to achieve, necessitating the use of additional memory between the temporal stages, as well as making an efficient mapping onto parallel computing equipment somewhat less than straightforward. Although other algorithmic variations for the efficient solution to the DHT have subsequently appeared [7, 8, 15], they all suffer, to varying extents, in terms of their lack of regularity, so that alternative solutions to the DHT are still sought that

3.6 Fast Solutions to DHT – The FHT Algorithm

47

X[0]

X[0]

X[1]

-

X[1]

X[2] -

X[3]

-

X[2]

-

X[3]

(a)

X[0]

X[0]

2

X[1] X[2]

-

2

X[1] X[2]

-

X[3]

-

X[3]

(b) Fig. 3.3 Signal flow graphs for single-sized butterfly for radix-4 FHT algorithm. (a) Zero-address version of single-sized butterfly. (b) Nyquist-address version of single-sized butterfly

possess the regularity associated with the complex-data versions of the fixed-radix FFT algorithms but without sacrificing the benefits of the existing FHT algorithms in terms of their reduced arithmetic and memory requirements and their optimal timecomplexity (in terms of latency or update time). Various FHT designs could be studied, including versions of the popular radix-2 factorization and the Split-Radix Algorithm [7], but when transform lengths allow for comparison, the radix-4 FHT is more computationally efficient than the radix-2 FHT, its design more regular than that of the Split-Radix FHT, and it has the potential for an eight-fold speed up with parallel computing equipment over that achievable via a purely sequential solution. This makes the radix-4 version of the FHT a good candidate to pursue for potential hardware implementation and, as a result, has been selected as the algorithm of choice in this monograph.

48

3 The Discrete Hartley Transform trigonometric coefficients X[0]

X[0]

X[1]

-

-

-

X[1] X[2]

X[2]

X[3]

-

-

-

X[3] X[4]

X[4]

X[5]

-

-

-

X[5] X[6]

X[6]

X[7]

-

-

-

X[7]

Fig. 3.4 Signal flow graph for double-sized butterfly for radix-4 FHT algorithm

3.7

Accuracy Considerations

When compared to a full-length FFT solution based upon one of the real-fromcomplex strategies, as discussed in Sect. 2.3 of Chap. 2, the FHT approach will involve approximately the same number of arithmetic operations (when the complex arithmetic operations of the FFT are reduced to equivalent real arithmetic operations) in order to obtain each real-data DFT output. The associated numerical errors may be due to both rounding, as introduced via the discarding of the lower-order bits from the fixed-point multiplier outputs, and truncation, as introduced via the discarding of the least-significant bit from the adder outputs after an overflow has occurred. The underlying characteristics of such errors for the two approaches will also be very similar, however, due to the similarity of their butterfly structures, so that when compared to FFT-based solutions possessing a comparable arithmetic requirement, the errors will inevitably be very similar [11, 16]. This feature of the FHT will be particularly relevant when dealing with a fixedpoint implementation, as is still the case with most solutions that are to be mapped onto an FPGA or an ASIC (although the availability of floating-point arithmetic units is becoming increasing more common with such devices as an additional

3.8 Discussion

49

embedded resource, albeit an expensive one in terms of silicon resources when compared to the fixed-point arithmetic unit), where the combined effects of both truncation errors [12] and rounding errors [12] will need to be properly assessed and catered for through the optimum choice of word-length and scaling strategy.

3.8

Discussion

When the DHT is applied to the computation of the DFT, as discussed in Sect. 3.4, a conversion routine is required to map the DHT outputs from Hartley-space to Fourier-space. For the case of real-valued data, as outlined in Fig. 3.1, the conversion process involves two real additions/subtractions for each DFT output together with two scaling operations, whilst for the complex-data case, as outlined in Fig. 3.2, this increases to four real additions/subtractions for each DFT output together with two scaling operations – note that the DFT output will always be complex-valued with a non-zero imaginary component unless the input data happens to be Hermitian symmetric! All the scaling operations, however, are by a factor of two, which in fixed-point arithmetic reduces to that of a simple right-shift operation of length one. Also, it should be noted that with many of the specialized real-data FFT algorithms, apart from their lack of regularity, they also suffer from the fact that different algorithms are often required for the fast computation of the forward and the inverse DFT algorithms. Clearly, in applications requiring transform-space processing followed by a return to data-space, as encountered, for example, with matched filtering, this could prove something of a disadvantage, particularly when compared to the adoption of a bilateral transform, such as the DHT, where the definitions of both the forward and the inverse transforms, up to a scaling factor, are identical. Finally, note that the presence of the factor 1/2 appearing in several of the DHT-based theorems provided in Sect. 3.5 is due to the fact that the ‘cas’ function pffiffiffi is 2 times stronger than the sinusoidal or cosinusoidal functions, so that when squared terms are formed this discrepancy increases to a factor of 2. Also, note that whenever theorems about the DHT and its properties involve ‘dual’ Hartley-space terms in their expression – such as the terms X(H )[k] and X(H )[k], for example, in the circular convolution and correlation theorems –it is necessary for care to be taken to treat the zero-address and the Nyquist-address terms separately, as neither term possesses its own dual. It will be seen in Chap. 4, for example, that the way these two terms are treated is crucial to the development of a single generic double butterfly design, as required for the efficient computation of the proposed radix-4 version of the FHT.

50

3 The Discrete Hartley Transform

References 1. N. Ahmed, C.B. Johnson, Orthogonal Transforms for Digital Signal Processing (Springer, 2012) 2. N. Aloui, S. Bousselmi, A. Cherif, New algorithm for speech compression based on discrete Hartley transform. Int. Arab J. Inf. Technol. 16(1), 156–162 (January 2019) 3. R.N. Bracewell, The fast Hartley transform. Proc. IEEE 72(8) (August 1984) 4. R.N. Bracewell, The Hartley Transform (Oxford University Press, 1986) 5. R.N. Bracewell, Computing with the Hartley transform. Comput. Phys. 9(4), 373 (July/August 1995) 6. E.O. Brigham, The Fast Fourier Transform and its Applications (Prentice-Hall, Englewood Cliffs, 1988) 7. P. Duhamel, Implementations of split-radix FFT algorithms for complex, real and realsymmetric data. IEEE Trans. ASSP 34(2), 285–295 (April 1986) 8. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: application to cyclic convolution of real data. IEEE Trans. ASSP 35(6), 818–824 (June 1987) 9. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications (Academic, New York, 1982) 10. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice-Hall, Englewood Cliffs, 1979) 11. J.B. Nitschke, G.A. Miller, Digital filtering in EEG/ERP analysis: Some technical and empirical comparisons. Behav. Res. Methods Instrum. Comput. 30(1), 54–67 (1998) 12. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice-Hall, 1975) 13. K.R. Rao, P. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications (Academic, Boston, 1990), pp. 2–14 14. D. Salomon, Data Compression: The Complete Reference (Springer, New York, 2004) 15. H.V. Sorensen, D.L. Jones, C.S. Burrus, M.T. Heideman, On computing the discrete Hartley transform. IEEE ASSP 33, 1231–1238 (1985) 16. A. Zakhor, A.V. Oppenheim, Quantization errors in the computation of the discrete Hartley transform. IEEE Trans. ASSP 35(11), 1592–1602 (1987)

Part II

The Regularized Fast Hartley Transform

Chapter 4

Derivation of Regularized Formulation of Fast Hartley Transform

4.1

Introduction

A drawback of conventional FHT algorithms, as highlighted in Chap. 3, lies in the need for two sizes of butterfly – and thus for two separate butterfly designs – for efficient fixed-radix formulations. For the case of the radix-4 FHT to be discussed here, a single-sized butterfly, producing four outputs from four inputs, is required for both the zero-address and the Nyquist-address iterations of the relevant temporal stages of the algorithm, whilst a double-sized butterfly, producing eight outputs from eight inputs, is required for each of the remaining iterations. We look now at how this lack of regularity might be overcome, bearing in mind the requirement to map the resulting algorithmic structure onto suitably defined parallel computing equipment in a way that’s ultimately consistent with the silicon-based performance metric, as stated in Sect. 1.8 of Chap. 1. Note that the attraction of a solution based upon the radix-4 factorization, rather than that of the more familiar radix-2 factorization, is its greater computational efficiency – in terms of both a reduced arithmetic requirement and reduced memory access for the retrieval of data from memory – and the potential for exploiting greater parallelism, at the arithmetic level, via the larger-sized butterfly, thereby offering the possibility of achieving a higher computational density when implemented in silicon, to be discussed in greater detail in Chap. 6.

4.2

Derivation of the Conventional Radix-4 Butterfly Equations

The first step towards achieving this goal concerns the derivations of the two different-sized butterflies – the single-sized and the double-sized – as required for efficient implementation of the radix-4 FHT. A DIT formulation of the algorithm is © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_4

53

54

4

Derivation of Regularized Formulation of Fast Hartley Transform

to be adopted which will prove to be particularly suited to the chosen computing architecture, to be identified and discussed in Chaps. 5 and 6, as well as yielding a slightly better signal-to-noise ratio (SNR) than a DIF formulation when the processing is to be performed with fixed-point arithmetic [8, 13]. In fact, the noise variance of the DIF formulation of the algorithm can be shown to be twice that of the DIT formulation [13], so that the DIT approach offers the possibility of using shorter word lengths and ultimately less silicon for a given level of performance. Let us first decompose the basic DHT expression as given by Eq. 3.1 of Chap. 3 – although in this instance without the scaling factor and with the output vector X(H ) now replaced simply by X for ease of exposition – into four partial summations, such that X ½k  ¼

N 1 X

x½n:casð2πnk=N Þ

n¼0



N=41 X

x½4n:casð2π ð4nÞk=N Þ

n¼0

þ

N=41 X

x½4n þ 1:casð2π ð4n þ 1Þk=N Þ

ð4:1Þ

n¼0

þ

N=41 X

x½4n þ 2:casð2π ð4n þ 2Þk=N Þ

n¼0

þ

N=41 X

x½4n þ 3:casð2π ð4n þ 3Þk=N Þ:

n¼0

Suppose now that x1 ½n ¼ x½4n, x2 ½n ¼ x½4n þ 1, x3 ½n ¼ x½4n þ 2 & x4 ½n ¼ x½4n þ 3

ð4:2Þ

and note from Eq. 3.3 of Chap. 3 that casð2π ð4n þ r Þk=N Þ ¼ casð2πnk=ðN=4Þ þ 2πrk=N Þ ¼ cos ð2πrk=N Þ:casð2πnk=ðN=4ÞÞ

ð4:3Þ

þ sin ð2πrk=N Þ:casð2πnk=ðN=4ÞÞ and casð2πnk=N Þ ¼ casð2πnðN  kÞ=N Þ: Then if the partial summations of Eq. 4.1 are written as

ð4:4Þ

4.2 Derivation of the Conventional Radix-4 Butterfly Equations

X 1 ½k  ¼

N=41 X

55

x1 ½n:casð2πnk=ðN=4ÞÞ

ð4:5Þ

x2 ½n:casð2πnk=ðN=4ÞÞ

ð4:6Þ

x3 ½n:casð2πnk=ðN=4ÞÞ

ð4:7Þ

x4 ½n:casð2πnk=ðN=4ÞÞ,

ð4:8Þ

n¼0

X 2 ½k  ¼

N=41 X n¼0

X 3 ½k  ¼

N=41 X n¼0

X 4 ½k  ¼

N=41 X n¼0

it enables the equation to be rewritten as X ½k  ¼ X 1 ½k þ cos ð2πk=N Þ:X 2 ½k þ sin ð2πk=N Þ:X 2 ½N=4  k  þ cos ð4πk=N Þ:X 3 ½k þ sin ð4πk=N Þ:X 3 ½N=4  k  þ cos ð6πk=N Þ:X 4 ½k þ sin ð6πk=N Þ:X 4 ½N=4  k ,

ð4:9Þ

the first of the double-sized butterfly equations. Now, by exploiting the properties of Eqs. 4.3 and 4.4, the remaining double-sized butterfly equations may be written as X ½N=4  k  ¼ X 1 ½N=4  k þ sin ð2πk=N Þ:X 2 ½N=4  k  þ cos ð2πk=N Þ:X 2 ½k  cos ð4πk=N Þ:X 3 ½N=4  k þ sin ð4πk=N Þ:X 3 ½k

ð4:10Þ

 sin ð6πk=N Þ:X 4 ½N=4  k   cos ð6πk=N Þ:X 4 ½k X ½k þ N=4 ¼ X 1 ½k  sin ð2πk=N Þ:X 2 ½k þ cos ð2πk=N Þ:X 2 ½N=4  k   cos ð4πk=N Þ:X 3 ½k  sin ð4πk=N Þ:X 3 ½N=4  k 

ð4:11Þ

þ sin ð6πk=N Þ:X 4 ½k  cos ð6πk=N Þ:X 4 ½N=4  k  X ½N=2  k ¼ X 1 ½N=4  k  cos ð2πk=N Þ:X 2 ½N=4  k þ sin ð2πk=N Þ:X 2 ½k þ cos ð4πk=N Þ:X 3 ½N=4  k  sin ð4πk=N Þ:X 3 ½k  cos ð6πk=N Þ:X 4 ½N=4  k þ sin ð6πk=N Þ:X 4 ½k

ð4:12Þ

56

4

Derivation of Regularized Formulation of Fast Hartley Transform

X ½k þ N=2 ¼ X 1 ½k  cos ð2πk=N Þ:X 2 ½k  sin ð2πk=N Þ:X 2 ½N=4  k þ cos ð4πk=N Þ:X 3 ½k þ sin ð4πk=N Þ:X 3 ½N=4  k  cos ð6πk=N Þ:X 4 ½k  sin ð6πk=N Þ:X 4 ½N=4  k X ½3N=4  k  ¼ X 1 ½N=4  k  sin ð2πk=N Þ:X 2 ½N=4  k   cos ð2πk=N Þ:X 2 ½k  cos ð4πk=N Þ:X 3 ½N=4  k þ sin ð4πk=N Þ:X 3 ½k þ sin ð6πk=N Þ:X 4 ½N=4  k  þ cos ð6πk=N Þ:X 4 ½k X ½k þ 3N=4 ¼ X 1 ½k þ sin ð2πk=N Þ:X 2 ½k   cos ð2πk=N Þ:X 2 ½N=4  k  cos ð4πk=N Þ:X 3 ½k  sin ð4πk=N Þ:X 3 ½N=4  k

ð4:13Þ

ð4:14Þ

ð4:15Þ

 sin ð6πk=N Þ:X 4 ½k  þ cos ð6πk=N Þ:X 4 ½N=4  k X ½N  k ¼ X 1 ½N=4  k  þ cos ð2πk=N Þ:X 2 ½N=4  k  sin ð2πk=N Þ:X 2 ½k  þ cos ð4πk=N Þ:X 3 ½N=4  k  sin ð4πk=N Þ:X 3 ½k 

ð4:16Þ

þ cos ð6πk=N Þ:X 4 ½N=4  k  sin ð6πk=N Þ:X 4 ½k , where N/4 is the length of the DHT output subsequences, {X1[k]}, {X2[k]}, {X3[k]} and {X4[k]}, and the parameter ‘k’ varies from 1 up to N/8–1. When k ¼ 0, which corresponds to the zero-address case, we obtain the first set of single-sized butterfly equations: X ½0 ¼ X 1 ½0 þ X 2 ½0 þ X 3 ½0 þ X 4 ½0

ð4:17Þ

X ½N=4 ¼ X 1 ½0 þ X 2 ½0  X 3 ½0  X 4 ½0

ð4:18Þ

X ½N=2 ¼ X 1 ½0  X 2 ½0 þ X 3 ½0  X 4 ½0

ð4:19Þ

X ½3N=4 ¼ X 1 ½0  X 2 ½0  X 3 ½0 þ X 4 ½0,

ð4:20Þ

and when k ¼ N/8, which corresponds to the Nyquist-address case, we obtain the second set of single-sized butterfly equations: pffiffiffi 2:X 2 ½N=8 þ X 3 ½N=8 pffiffiffi X ½3N=8 ¼ X 1 ½N=8  X 3 ½N=8 þ 2:X 4 ½N=8 pffiffiffi X ½5N=8 ¼ X 1 ½N=8  2:X 2 ½N=8 þ X 3 ½N=8 pffiffiffi X ½7N=8 ¼ X 1 ½N=8  X 3 ½N=8  2:X 4 ½N=8: X ½N=8 ¼ X 1 ½N=8 þ

ð4:21Þ ð4:22Þ ð4:23Þ ð4:24Þ

Thus, two different-sized butterflies are required for efficient computation of the DIT formulation of the radix-4 FHT, their SFGs being as given in Figs. 3.3 and 3.4

4.3 Single-to-Double Conversion of Radix-4 Butterfly Equations

57

of Chap. 3. For the single-sized butterfly equations, the computation of each output involves the addition of either three or four terms, whereas for the double-sized butterfly equations, the computation of each output involves the addition of exactly seven terms. The resulting lack of regularity makes an attractive hardware implementation very difficult to achieve, therefore, without suitable reformulation of the associated equations.

4.3

Single-to-Double Conversion of Radix-4 Butterfly Equations

Thus, in order to derive a computationally efficient single-design solution to the radix-4 FHT, it is necessary to ‘regularize’ the algorithm structure by replacing the single-sized and double-sized butterflies with a single generic version of the doublesized butterfly. Before this can be achieved, however, it is first necessary to show how the single-sized butterfly equations may be converted to the same form as those of the double-sized butterfly. When just the zero-address equations need to be carried out, it may be achieved via the interleaving of two sets, each of four equations, one set involving the four terms {X1[0], X2[0], X3[0], X4[0]}, say, and the other set involving the four terms {Y1[0], Y2[0], Y3[0], Y4[0]}, say. This yields the modified butterfly equations: X ½0 ¼ X 1 ½0 þ X 2 ½0 þ X 3 ½0 þ X 4 ½0

ð4:25Þ

Y ½0 ¼ Y 1 ½0 þ Y 2 ½0 þ Y 3 ½0 þ Y 4 ½0

ð4:26Þ

X ½N=4 ¼ X 1 ½0 þ X 2 ½0  X 3 ½0  X 4 ½0

ð4:27Þ

Y ½N=4 ¼ Y 1 ½0 þ Y 2 ½0  Y 3 ½0  Y 4 ½0

ð4:28Þ

X ½N=2 ¼ X 1 ½0  X 2 ½0 þ X 3 ½0  X 4 ½0

ð4:29Þ

Y ½N=2 ¼ Y 1 ½0  Y 2 ½0 þ Y 3 ½0  Y 4 ½0

ð4:30Þ

X ½3N=4 ¼ X 1 ½0  X 2 ½0  X 3 ½0 þ X 4 ½0

ð4:31Þ

Y ½3N=4 ¼ Y 1 ½0  Y 2 ½0  Y 3 ½0 þ Y 4 ½0,

ð4:32Þ

with the associated double-sized butterfly being referred to as the Type-I butterfly. Similarly, when both the zero-address and the Nyquist-address equations need to be carried out – which is always the case when the solution to the Nyquist-address equations is required – the two corresponding sets of equations may be interleaved and combined in the same fashion as above to yield the butterfly equations: X ½0 ¼ X 1 ½0 þ X 2 ½0 þ X 3 ½0 þ X 4 ½0

ð4:33Þ

58

4

Derivation of Regularized Formulation of Fast Hartley Transform

X ½N=8 ¼ X 1 ½N=8 þ

pffiffiffi 2:X 2 ½N=8 þ X 3 ½N=8

ð4:34Þ

X ½N=4 ¼ X 1 ½0 þ X 2 ½0  X 3 ½0  X 4 ½0 pffiffiffi X ½3N=8 ¼ X 1 ½N=8  X 3 ½N=8 þ 2:X 4 ½N=8

ð4:35Þ

X ½N=2 ¼ X 1 ½0  X 2 ½0 þ X 3 ½0  X 4 ½0 pffiffiffi X ½5N=8 ¼ X 1 ½N=8  2:X 2 ½N=8 þ X 3 ½N=8

ð4:37Þ

X ½3N=4 ¼ X 1 ½0  X 2 ½0  X 3 ½0 þ X 4 ½0 pffiffiffi X ½7N=8 ¼ X 1 ½N=8  X 3 ½N=8  2:X 4 ½N=8,

ð4:39Þ

ð4:36Þ

ð4:38Þ

ð4:40Þ

with the associated double-sized butterfly being referred to as the Type-II butterfly. With the indexing assumed to start from zero, rather than one, the even-indexed equations thus correspond to the zero-address butterfly and the odd-indexed equations to the Nyquist-address butterfly. Thus, the sets of single-sized butterfly equations may be reformulated in such a way that the resulting composite butterflies now accept eight inputs and produce eight outputs, the same as for the standard radix-4 double-sized butterfly, referred to as the Type-III butterfly. The result is that the radix-4 FHT, instead of requiring both single-sized and double-sized butterflies, may now be carried out instead with three simple variations of the double-sized butterfly.

4.4

Radix-4 Factorization of the FHT

A radix-4 factorization of the FHT may be obtained in a straightforward fashion in terms of the double-sized butterfly equations through application of the familiar divide-and-conquer [7] principle, as used in the derivation of other fast discrete unitary and orthogonal transforms [4], such as the FFT. This factorization leads to the algorithm described by the pseudo-code of Fig. 4.1, where all instructions within the scope of the outermost ‘for’ loop constitute a single iteration in the ‘temporal’ domain and all instructions within the scope of the innermost ‘for’ loop constitute a single iteration in the ‘spatial’ domain. Thus, each iteration in the temporal domain, more commonly referred to as a ‘stage’, comprises N/8 iterations in the spatial domain where each such iteration corresponds to the execution of a single set of double-sized butterfly equations. The implication of the above definitions is that for the processing of a single Nsample data set, where N is a power of four, the computations associated with a given stage may only be carried out after those of its predecessor and before those of its successor, whereas those associated with every iteration of a given stage may in theory be executed simultaneously. Thus, each stage is time dependent and may only be executed sequentially, whereas if all N samples of the input data set to a given

4.4 Radix-4 Factorization of the FHT Fig. 4.1 Pseudo-code for radix-4 factorization of FHT algorithm

// // //

//

// //

59

Set up transform length. N = 4α ; Di-bit reverse input data addresses. ( in ) X N = PΦ 0 .x N ; Loop through log4 temporal stages. offset = 1; for (i = 0; i < α ; i=i+1) { M = 8×offset; Loop through N/8 spatial iterations. for (j = 0; j < N; j=j+M) { for (k = 0; k < offset; k=k+1) Carry out radix-4 double butterfly equations. { Double Butterfly Routine:

//

computes 8 outputs from 8 inputs ( out )

XN

(

)

M = f X N , CM n = 0,1,2,3 n , k , Sn , k ( in )

} } offset = 2 (2i + 1);

}

stage are available for processing, then the iterations within that stage may in theory be executed simultaneously, or in parallel. Note from the pseudo-code of Fig. 4.1 that Φ0 is the bijective mapping, or permutation, and PΦ0 the associated permutation matrix corresponding to the DBR mapping of the FHT input data addresses, whilst the double-sized butterfly section referred to in the pseudo-code makes use of both cosinusoidal and sinusoidal terms, as given by CM n,k ¼ cos ð2πnk=M Þ

n ¼ 0, 1, 2, 3

ð4:41Þ

SM n,k ¼ sin ð2πnk=M Þ,

n ¼ 0, 1, 2, 3,

ð4:42Þ

and

respectively, the trigonometric coefficients defined in Chap. 1 which are each a function of the indices of the innermost and outermost ‘for’ loops. For the FHT factorization described here, the double-sized butterfly routine referred to in the pseudo-code implements either the Type-I butterfly of Eqs. 4.9– 4.16, the Type-II butterfly of Eqs. 4.25–4.32 or the Type-III butterfly of Eqs. 4.33– 4.40. As a result, the FHT appears to require a different SFG for each ‘type’ of double butterfly and so appears to lack at this stage the regularity necessary for an

60

4

Derivation of Regularized Formulation of Fast Hartley Transform

efficient mapping onto a single regular computational structure, as will be required for an efficient hardware implementation with suitably defined parallel computing equipment.

4.5

Closed-Form Expression for Generic Radix-4 Double Butterfly

The first step towards addressing this problem is to reformulate the double-sized butterfly equations so that they may be expressed in a recursive closed-form fashion, as once this is achieved it will then be a simple task to show how the same SFG can be used to describe the operation of each of the Type-I, Type-II and Type-III doublesized butterflies. This first step is achieved through the introduction of the address permutations Φ1, Φ2, Φ3 and Φ4, as defined in Table 4.1, and through the introduction of arithmetic redundancy into the processing via the use of the trigonometric M coefficients EM n,k (for the even-valued index ‘k’) and On,k (for the odd-valued index ‘k’), as defined in Table 4.2, where the cosinusoidal and sinusoidal terms referred to M in the table, CM n,k and Sn,k , are as given by Eqs. 4.41 and 4.42, respectively. Through the use of such operators and terms, it can be shown how the same set of arithmetic operations may be carried out upon the input data set for every instance of the double-sized butterfly, despite the fact that for certain of the Type-I and Type-II Table 4.1 Address permutations for generic double butterfly

Input address Φ1: type ¼ I, II Φ1: type ¼ III Φ2: type ¼ I, II Φ2: type ¼ III Φ3: type ¼ I, II Φ3: type ¼ III Φ4: type ¼ I, II, III

0 0 0 0 0 0 0 0

1 1 1 4 4 4 4 4

2 2 2 3 2 1 1 1

3 6 3 2 6 5 3 5

4 4 4 1 1 2 2 6

5 5 5 5 5 6 6 2

6 3 6 6 3 3 7 3

7 7 7 7 7 7 5 7

Table 4.2 Trigonometric coefficients for generic double butterfly Index m EM m,k : type ¼ I

0 1

1 0

2 1

3 0

4 1

5 0

6 1

7 0

EM m,k : type ¼ II

1

0

1

0

1

0

SM 1,k

CM 2,k

SM 2,k

p1ffiffi 2 CM 3,k

p1ffiffi 2 SM 3,k

1

0

1

0

1

p1ffiffi 2 SM 3,k

p1ffiffi 2 CM 3,k

EM m,k : type ¼ III OM m,k : type ¼ I OM m,k : type ¼ II

1

0

CM 1,k

0

1

0

0

1

0

1

0

1

OM m,k : type ¼ III

0

1

SM 1,k

CM 1,k

SM 2,k

CM 2,k

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly

61

cases, the values of the set of trigonometric coefficients suggest that the multiplications are trivial and thus avoidable – that is, that one or more of the trigonometric coefficients belong to the set {1, 0, +1}. The even-valued and odd-valued indices for the addressing of the input data to the double-sized butterfly are both ‘arithmetic sequences’ and consequently generated very simply via the pseudo-code of Fig. 4.2, with the associated double-sized butterfly – referred to hereafter as the generic double butterfly [5, 6] and abbreviated to GD-BFLY – being expressed via the pseudo-code of Fig. 4.3. The address permutations are dependent only upon the ‘type’ of GD-BFLY being executed, with if (i == 0) { //

Set up 1st even and odd data indices for Type-I double butterfly. twice_offset = offset & index_even[0] = j & index_odd[0] = j + 4;

// Set up address permutations for Type-I double butterfly. ( I , II )

Fn = Fn

//

, n = 1,2,3,4 } else { twice_offset = 2 × offset; if (k == 0) { Set up 1st even and odd data indices for Type-II double butterfly. index_even[0] = j & index_odd[0] = j + offset;

// Set up address permutations for Type-II double butterfly. ( I , II )

Fn = Fn

//

, n = 1,2,3,4 } else { Set up 1st even and odd data indices for Type-III double butterfly. index_even[0] = j + k & index_odd[0] = j + twice_offset – k;

// Set up address permutations for Type-III double butterfly. ( III)

Fn = Fn

,

n = 1,2,3,4

} } // Set up remaining even and odd data indices for double butterfly. for (n = 1; n < 4; n=n+1) { index_even[n] = index_even[n-1] + twice_offset; index_odd[n] = index_odd[n-1] + twice_offset; } Fig. 4.2 Pseudo-code for generation of data indices and address permutations

62

4

//

// //

Derivation of Regularized Formulation of Fast Hartley Transform

Set up input data vector. for (n = 0; n < 4; n=n+1) { X[2n] = X(in)[index_even[n]] & X[2n+1] = X(in)[index_odd[n]]; } Apply 1st address permutation. Y = PΦT1 .X Apply trigonometric coefficients and 1st set of additions/subtractions. for (n = 1; n < 4; n=n+1) { M store = E M 2 n , k × Y[ 2n ] + E 2 n +1, k × Y[ 2n + 1] ; M Y[2n + 1] = O M 2 n , k × Y[ 2n ] – O 2 n +1, k × Y[ 2n + 1] ;

//

Y[2n ] = store; } Apply 2nd address permutation. X = PΦT 2 .Y

Apply 2nd set of additions/subtractions. for (n = 0; n < 4; n=n+1) { store = X[2n]+X[2n+1] & X[2n+1]=X[2n] –X[2n+1] & X[2n] = store; } // Apply 3rd address permutation. Y = PΦT3 .X

//

//

// //

Apply 3rd set of additions/subtractions. for (n = 0; n < 4; n=n+1) { store = Y[2n ] + Y[2n + 1] & Y[2n + 1] = Y[2n ] − Y[2n + 1] & Y[2n ] = store; } Apply 4th address permutation. X = PΦT 4 .Y Set up output data vector. for (n = 0; n < 4; n=n+1) { X ( out ) [index _ even[n ]] = X[2n ] & X ( out ) [index _ odd[n ]] = X[2n + 1]; }

Fig. 4.3 Pseudo-code for carrying out generic double butterfly

just two slightly different versions being required for each of the first three permutations, and only one for the last permutation. The two versions of Φ1 differ in just two (of the eight possible) exchanges, whilst the two versions of Φ2 and Φ3 each differ in just three (of the eight possible) exchanges, as evidenced from the contents of Table 4.1. The trigonometric coefficients, which as stated above include the trivial constants belonging to the set {1, 0, +1}, are dependent also upon the value of the

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly

63

parameter ‘k’ corresponding to the innermost ‘for’ loop of the pseudo-code of Fig. 4.1. An elegant and informative way of representing the four permutation mappings may be achieved by noting from the group-theoretic properties of the ‘symmetric group’ [1] – which for order N is the set of all permutations of N objects – that any permutation can be expressed as a product of cyclic permutations and that each such cyclic permutation can in turn be simply expressed as a product of transpositions [1]. As shorthand for describing a permutation, a cyclic notation is first introduced in order to describe how the resolving of a given permutation into transpositions is achieved. With this notation, each element within parentheses is replaced by the element to its right with the last element being replaced by the first element in the set – note that any element that replaces itself is omitted. Thus, the two versions of Φ1 may be expressed as Φ1 ¼ ð3, 6Þ

ð4:43Þ

Φ1 ¼ ð:Þ,

ð4:44Þ

and

the second version being the length-eight identity mapping, the two versions of Φ2 as Φ2 ¼ ð1, 4Þð2, 3Þ ¼ ð1, 4Þð3, 2Þ

ð4:45Þ

Φ2 ¼ ð1, 4Þð3, 6Þ,

ð4:46Þ

Φ3 ¼ð1, 4, 2Þð3, 5, 6Þ ¼ ð2, 1, 4Þð5, 6, 3Þ ¼ ð2, 1Þð2, 4Þð5, 6Þð5, 3Þ

ð4:47Þ

and

the two versions of Φ3 as

and Φ3 ¼ ð1, 4, 2Þð5, 6, 7Þ ¼ ð2, 1, 4Þð5, 6, 7Þ ¼ ð2, 1Þð2, 4Þð5, 6Þð5, 7Þ,

ð4:48Þ

and finally the single version of Φ4 as Φ4 ¼ð1, 4, 6, 3, 5, 2Þ ¼ ð3, 5, 2, 1, 4, 6Þ ¼ ð3, 5Þð3, 2Þð3, 1Þð3, 4Þð3, 6Þ:

ð4:49Þ

64

4

Derivation of Regularized Formulation of Fast Hartley Transform

From these compact representations – which are equivalent to those given in tabular form in Table 4.1 – both the commonalities and the differences between the two versions of each permutation are straightforwardly visualized, with each pair being distinguished by means of a single transposition, whilst the common component (whether in terms of cyclic permutations or transpositions) is fixed and thus amenable to hardwiring. The ordering of the transpositions has been adjusted in the above expressions so as to minimize the associated communication lengths involved in the exchanges. For Φ1 the first version involves the application of a single transposition, involving addresses ‘3’ and ‘6’, whilst for Φ2, the two versions differ only in the final transposition involving the exchange of address ‘3’ with either address ‘2’ or address ‘6’, and for Φ3, they differ only in terms of the final transposition involving the exchange of new address ‘5’ (original address ‘6’) with either address ‘3’ or address ‘7’. The ability to hardwire most of the required transpositions and to keep the associated communication lengths as short as possible has the additional attraction of minimizing the associated power consumption required for the exchanges – as will be discussed in Chap. 5. Notice that as the combined effect of the first four trigonometric coefficients – corresponding to indices m ¼ 0 and m ¼ 1 in Table 4.2 – for every instance of the GD-BFLY is simply for the first two inputs to the GD-BFLY to pass directly through to the second permutation, then the first four multiplications and the associated pair of additions may be simply removed from the SFG of Fig. 3.4 shown in Chap. 3, to yield the SFG shown below in Fig. 4.4, this being obtained at the cost of slightly reduced regularity, at the arithmetic level, within the GD-BFLY. This results in the need for just twelve real multiplications for the GD-BFLY, rather than sixteen, whose trigonometric coefficient multiplicands may be obtained, through symmetry relations, from just six stored trigonometric coefficients: two each – both cosinusoidal and sinusoidal – for ‘single-angle’, ‘doubleangle’ and ‘triple-angle’ cases. Also, the number of additions required prior to the second permutation reduces from eight to just six. Thus, the three ‘types’ of GD-BFLY each map efficiently onto the same regular computational structure, this structure being represented by means of an SFG consisting of three stages of additive recursion, the first stage being preceded by a pointwise multiplication stage involving the trigonometric coefficients. Denoting the input and output data vectors to the GD-BFLY by X ðinÞ and X ðoutÞ , respectively, the operation of the GD-BFLY may thus be represented in a closed-form fashion by means of a multistage recursion, as given by the expression:         X ðoutÞ ¼ PTΦ4 : A3 : PTΦ3 : A2 : PTΦ2 : A1 : M 1 : PTΦ1 :X ðinÞ

ð4:50Þ

where PΦ1 , PΦ2 , PΦ3 and PΦ4 are the butterfly-dependent permutation matrices [1] associated with the address permutations Φ1, Φ2, Φ3 and Φ4, respectively. Being orthogonal, whereby PΦ :PTΦ ¼ I 8 – the matrix version of the length-eight identity mapping – they may each be applied either side of an equation, such that

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly

65

trigonometric coefficients

-

-

Address Permutation

-

-

-

-

F4

-

-

-

output data vector

F3

Address Permutation

F2

Address Permutation

Address Permutation

input data vector

F1

-

-

Fig. 4.4 Signal-flow graph for 12-multiplier version of generic double butterfly

Y ¼ PTΦ :X  PΦ :Y ¼ X,

ð4:51Þ

where the superscript ‘T’ denotes the transpose operator. The composite matrix A1.M1 is a butterfly-dependent 2  2 block diagonal matrix [1] containing the trigonometric coefficients (as defined from the contents of Table 4.2 with the first two terms fixed and equal to one), whilst A2 and A3 are fixed addition blocks, also expressed as 2  2 block diagonal matrices, such that 2

þ1

6 0 6 6 6 6 6 6 A1 :M 1 ¼ 6 6 6 6 6 6 4

and

3

0 þ1 þE2 þO2

þE 3 O3 þE4 þE 5 þO4 O5 þE 6 þO6

7 7 7 7 7 7 7 7, 7 7 7 7 7 þE 7 5 O7

ð4:52Þ

66

4

Derivation of Regularized Formulation of Fast Hartley Transform

2

þ1 6 þ1 6 6 6 6 6 6 A2 ¼ A3 ¼ 6 6 6 6 6 6 4

3

þ1 1 þ1 þ1

þ1 1 þ1 þ1 þ1 1 þ1 þ1

7 7 7 7 7 7 7 7: 7 7 7 7 7 þ1 5

ð4:53Þ

1

Note that as long as each data set for input to the GD-BFLY is accompanied by an appropriately set ‘type’ flag – indicating whether the current instance of the GD-BFLY is of Type I, Type II or Type III – then the correct versions of the first three permutators may be appropriately applied for any given instance of the GD-BFLY. The reformulated equations, which were obtained through the introduction of arithmetic redundancy into the processing, thus correspond to a double butterfly which overcomes, in an elegant fashion, the loss of regularity associated with more conventional fixed-radix formulations of the FHT. The resulting radix-4 algorithm is referred to hereafter as the regularized FHT and abbreviated to R24 FHT [5, 6], where the ‘R24 ’ part of the expression is short for ‘regularized radix-4’.

4.5.1

Twelve-Multiplier Version of Generic Double Butterfly

As evidenced from the SFG of Fig. 4.4, the GD-BFLY described above requires a total of twelve real multiplications and twenty-two real additions, whilst the effect of the permutators regarding a parallel solution is to reduce the communication topology to that of ‘nearest neighbour’ for input to both the adders and the multipliers, with the data entering/leaving the arithmetic components in consecutive pairs. The only change to the operation of the GD-BFLY, from one instance to another, is in terms of the definitions of the first three address permutations, with one of two slightly different versions being appropriately selected for each such permutation according to the particular ‘type’ of the GD-BFLY being executed – see the permutation definitions of Table 4.1. As a consequence, each instance of the twelve-multiplier version of the GD-BFLY may be carried out using precisely the same components and represented by means of precisely the same SFG.

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly

4.5.2

67

Nine-Multiplier Version of Generic Double Butterfly

A version of the above GD-BFLY with lower arithmetic-complexity may be achieved by noting that each block of four multipliers and its associated two adders correspond to the solution of a pair of bilinear forms [12], which can be optimally solved, in terms of multiplications, with just three multipliers – see the corresponding section of the SFG for the standard Type-III GD-BFLY in Fig. 4.5. This complexity reduction is achieved at the expense of three extra adders for the GD-BFLY and six extra adders for the generation of the trigonometric coefficients. The complete SFG for the resulting reduced-complexity solution is as shown in Fig. 4.6, from which it can be seen that the GD-BFLY now requires a total of nine real multiplications and twenty-five real additions. As with the twelve-multiplier version, there are minor changes to the operation of the GD-BFLY, from one instance to another, in terms of the definitions of the first three address permutations, with one of two slightly different versions being appropriately selected for each such permutation according to the particular ‘type’ of the GD-BFLY being executed – see the permutation definitions of Table 4.1. Additional Multiplication-addition block of standard Type-III double butterfly ~ a cos θ sin θ a ~ = b sin θ − cos θ b Multiplicative constants: c1 = cos θ + sin θ c 2 = cos θ c3 = cos θ − sin θ c1 a c2

+

~ a



c3 b

+ –

Fig. 4.5 Reduced-complexity arithmetic block for a set of bilinear forms

~ b

F1

Derivation of Regularized Formulation of Fast Hartley Transform

trigonometric coefficients

F2

F3

± ±

-

-

-

Address Permutation

± ±

Address Permutation

Address Permutation

input data vector

-

F4

-

-

output data vector

4

Address Permutation

68

± ±

-

-

Fig. 4.6 Signal-flow graph for nine-multiplier version of generic double butterfly

but minor changes are also required, however, to the operation of the stage of adders directly following the multipliers and to the ordering of the outputs from the resulting operations. For the first of the three sets of three multipliers, if the GD-BFLY is of Type I or Type II, then each of the two adders performs addition on its two inputs, and the ordering of the two outputs is the same as that of the two inputs, whilst if the GD-BFLY is of Type III, then each of the two adders performs subtraction on its two inputs, and the ordering of the two outputs is the reverse of that of the two inputs. Similarly, for the second of the three sets of three multipliers, if the GD-BFLY is of Type I or Type II, then each of the two adders performs addition on its two inputs, and the ordering of the two outputs is the same as that of the two inputs, whilst if the GD-BFLY is of Type III, then each of the two adders performs subtraction on its two inputs, and the ordering of the two outputs is the reverse of that of the two inputs. Finally, for the last of the three sets of three multipliers, if the GD-BFLY is of Type I, then each of the two adders performs addition on its two inputs, and the ordering of the two outputs is the same as that of the two inputs, whilst if the GD-BFLY is of Type II or Type III, then each of the two adders performs subtraction on its two inputs, and the ordering of the two outputs is the reverse of that of the two inputs. Note that the reversal of each pair of outputs is straightforwardly achieved, as shown in Fig. 4.6, by means of a simple switch. As a consequence, each instance of the nine-multiplier version of the GD-BFLY may be carried out using precisely the same components and represented by means of precisely the same SFG.

4.6 Trigonometric Coefficient Storage, Retrieval and Generation

4.6

69

Trigonometric Coefficient Storage, Retrieval and Generation

An efficient implementation of the R24 FHT invariably requires an efficient mechanism for the storage and retrieval of the trigonometric coefficients required for feeding into each instance of the GD-BFLY. The requirement, more exactly, is that six non-trivial coefficients be either retrieved from the PCM or suitably generated on-the-fly in order to be able to carry out the necessary processing for any given input data set. Referring to the definitions for the non-trivial cosinusoidal and sinusoidal terms, as given by Eqs. 4.41 and 4.42, respectively, if we put β ¼ N/M, where the parameters M and N are as defined in the pseudo-code of Fig. 4.1, then N CM n,k ¼ cos ð2πnkβ=N Þ ¼ C n,kβ

for n ¼ 1, 2, 3

ð4:54Þ

N SM n,k ¼ sin ð2πnkβ=N Þ ¼ Sn,kβ ,

for n ¼ 1, 2, 3

ð4:55Þ

and

enabling the terms to be straightforwardly addressed from suitably constructed LUTs via the parameters ‘n’, ‘k’ and ‘β’. The total size requirement of the LUT can be minimized by exploiting the relationship between the cosinusoidal and sinusoidal functions, as given by the expression:   1 cos ðxÞ ¼ sin x þ π , 2

ð4:56Þ

as well as the periodic nature of each, as given by the expressions sin ðx þ 2π Þ ¼ sin ðxÞ

ð4:57Þ

sin ðx þ π Þ ¼  sin ðxÞ:

ð4:58Þ

and

Two schemes are now outlined which enable a simple trade-off to be made between memory size and addressing complexity – as measured in terms of the number of arithmetic/logic operations required for computing the necessary memory addresses.

70

4.6.1

4

Derivation of Regularized Formulation of Fast Hartley Transform

Minimum-Arithmetic Addressing Scheme

As already stated, the trigonometric coefficient set comprises both cosinusoidal and sinusoidal terms for single-angle, double-angle and triple-angle cases. To minimize the arithmetic/logic requirement for the generation of the addresses, the LUT is sized according to a single-quadrant addressing scheme, whereby the trigonometric coefficients are read from a sampled version of the sinusoidal function with argument defined from 0 up to π/2 radians. As a result, each LUT may be accessed by means of a single, easy-to-compute, input parameter which may be updated from one access to another via simple addition using a fixed increment – that is, the addresses form an arithmetic sequence. Thus, for the case of an N-point R24 FHT, it is required that the LUT be of length N/4 yielding a total PCM requirement, denoted CAopt MEM , of 1 C Aopt MEM ¼ =4N

ð4:59Þ

words. This scheme would seem to offer, therefore, a reasonable compromise between the PCM requirement and the addressing complexity, using more than the theoretical minimum amount of memory required for the storage of the trigonometric coefficients so as to keep the arithmetic/logic requirement of the addressing as simple as possible.

4.6.2

Minimum-Memory Addressing Scheme

Another approach to this problem is to adopt a two-level LUT, this comprising one ‘coarse-resolution’ region of length N/4L for the sinusoidal function, covering 0 up to π/2 radians, and one ‘fine-resolution’ region of length L for each of the cosinusoidal and sinusoidal functions, covering 0 up to π/2L radians. The required trigonometric coefficients may then be obtained from the contents of the two-level LUT through the application of one or other of the standard trigonometric identities: cos ðθ þ φÞ ¼ cos ðθÞ: cos ðφÞ  sin ðθÞ: sin ðφÞ

ð4:60Þ

sin ðθ þ φÞ ¼ sin ðθÞ: cos ðφÞ þ cos ðθÞ: sin ðφÞ,

ð4:61Þ

and

where θ corresponds to the angle defined over the coarse-resolution region and φ to the angle defined over the fine-resolution region.

4.6 Trigonometric Coefficient Storage, Retrieval and Generation

71

By expressing the combined size of the two-level LUT for the sinusoidal function as having to cater for f ðLÞ ¼ N=4L þ L

ð4:62Þ

words, it can be seen that the optimum LUT length is obtained when the derivative df ¼ 1  N=4L2 dL is set to zero, giving L ¼ C Mopt MEM , of

ð4:63Þ

pffiffiffiffi N =2 and resulting in a total PCM requirement, denoted pffiffiffiffi 3 C Mopt MEM ¼ =2 N

ð4:64Þ

pffiffiffiffi words – N =2 pffiffiffiffifor the coarse-resolution region (as required by the sinusoidal function) and N =2 for each of the two fine-resolution regions (as required by both the cosinusoidal and sinusoidal functions). This scheme therefore yields the theoretical minimum-memory requirement for the storage of the trigonometric coefficients at the expense of an increased arithmetic/logic requirement for the associated addressing. The two-level LUT will actually be regarded hereafter as consisting of three separate complementary-angle pffiffiffiffi LUTs, each of length N =2, rather than as a single LUT, as all three may need to be accessed simultaneously if an efficient parallel solution to the R24 FHT is to be achieved when mapped onto parallel computing equipment. Although each LUT is accessed by means of a single input parameter, their computation is no longer straightforward, as each trigonometric coefficient is now made up from components taken from all three LUTs so that the corresponding input parameters to the three LUTs are dependent upon each other.

4.6.3

Trigonometric Coefficient Generation via Trigonometric Identities

With both of the storage schemes discussed above, after deriving the single-angle trigonometric coefficients from the respective LUT(s), there is the option for the double-angle and triple-angle trigonometric coefficients to be obtained directly from the single-angle trigonometric coefficients through the application of the standard trigonometric identities: cos ð2θÞ ¼ 2: cos 2 ðθÞ  1

ð4:65Þ

72

4

Derivation of Regularized Formulation of Fast Hartley Transform

sin ð2θÞ ¼ 2: sin ðθÞ: cos ðθÞ

ð4:66Þ

cos ð3θÞ ¼ ð2: cos ð2θÞ  1Þ: cos ðθÞ

ð4:67Þ

sin ð3θÞ ¼ ð2: cos ð2θÞ þ 1Þ: sin ðθÞ,

ð4:68Þ

and

respectively, or alternatively, through the replication of the respective LUT(s) for each of the double-angle and triple-angle cases in order to reduce the associated time-complexity at the expense of increased space-complexity. The schemes covered in this section for the storage, retrieval and generation of the trigonometric coefficients will be discussed further in Chap. 6 in connection with the development of conflict-free parallel memory addressing schemes for both the data (for which the addressing is also in-place) and the trigonometric coefficients, as required for the parallel operation of the GD-BFLY and thus of the R24 FHT when implemented using the silicon-based parallel computing technologies to be introduced in Chap. 5.

4.7

Comparative Complexity Analysis with Existing FFT Designs

This chapter has concerned itself with the detailed derivation of the GD-BFLY, the computational engine required for a regularized version of the DIT formulation of the radix-4 FHT, referred to as the R24 FHT , the initial intention being to use the resulting algorithm for the efficient parallel computation of both the DHT and the DFT for the case of 1-D real-valued data, although later to be extended, in Chap. 11, to that of m-D real-valued data. For most applications, the real-data DFT is still generally solved with a realfrom-complex strategy, as discussed in some detail in Chap. 2, whereby an Npoint complex-data FFT simultaneously computes the outputs of two N-point realdata DFTs, or where the output of an N-point real-data DFT is obtained from the output of one N/2-point complex-data FFT. Such approaches, however, are adopted at the possible expense of increased memory, increased processing delay to allow for the acquisition/processing of pairs of data sets and additional packing/unpacking complexity. The class of specialized real-data FFTs discussed in Chap. 2 is also commonly used, and although these algorithms compare favourably, in terms of operation counts and memory requirement, with those of the FHT, they suffer in terms of a loss of regularity and reduced flexibility in that different algorithms are often required for the computation of the forward and the inverse DFT algorithms.

4.7 Comparative Complexity Analysis with Existing FFT Designs

73

Table 4.3 Algorithmic comparison for real-data and complex-data FFT designs Algorithm Design regularity No. of butterfly designs required Parallelization Arithmetic domain Arithmetic complexity (no. of operations) Time complexity (no. of clock cycles) Data memory for N-point real-data DFT Data memory for N-point complex-data DFT Pin count for N-point real-data DFT Pin count for N-point complex-data DFT Processing delay for N-point real-data DFT Applicable to forward & inverse DFTs Additive complexity for unpacking of N-point real-data DFT Additive complexity for unpacking of N-point complex-data DFT

Complex-data N-point FFT High 1

Real-data N-point FFT Low 1

Standard N-point FHT Low 2

Regularized N-point FHT High 1

High Complex field O(N  log4N )

Low Complex field O(N  log4N )

Low Real field O(N  log4N )

High Real field O(N  log4N )

O(N  log4N )

O(N  log4N )

O(N  log4N )

O(N  log4N )

2N

N

N

N

2N



N

N

22N

2N

2N

2N

22N



2N

2N

2D

D

D

D

Yes

No

Yes

Yes

N



N

N





4N

4N

The performance of the R24 FHT is therefore compared very briefly with those of the real-data and complex-data FFTs, as described in Chap. 2, together with the conventional or non-regularized FHT [2, 3]. The performance is evaluated for the computation of both real-data and complexdata DFTs, where the application of the FHT to complex-valued data is achieved very simply by processing separately the real and imaginary components of the data and additively combining the outputs to yield the required complex-data DFT output – this was discussed in some detail in Sect. 3.4 of Chap. 3. The results are summarized in Table 4.3, where a single-PE recursive architecture is adopted for each solution such that the PE is assumed to be able to produce all the outputs for a single instance of the respective butterfly (remember that there are two types of butterfly required for the computation of the conventional or non-regularized FHT) simultaneously via the exploitation of fine-grained parallelism at the arithmetic

74

4

Derivation of Regularized Formulation of Fast Hartley Transform

level – such architectural considerations are to be discussed later in some depth in Chaps. 5 and 6. Such a performance may prove difficult (if not impossible) to attain for some approaches, however, as the third row of the table suggests that neither the N-point real-data FFT nor the conventional or non-regularized FHT lend themselves particularly well to parallelization. However, as can be seen from the table, the regularity and simplicity of the design and the bilateral nature of the algorithm make the R24 FHT an attractive solution when compared to the class of real-data FFTs, whilst the reduced processing delay (for the real-data case) and reduced data memory and pin-count requirement (for both the real-data and the complex-data cases) offer additional advantages over the conventional complex-data FFT approach. The low memory requirement of the R24 FHT approach is particularly relevant for applications involving large transform lengths, as might be the case, for example, with many wide-bandwidth channelization problems. Summarizing the results, the regularity of the design, combined with the ease of parallelization, nearest-neighbour communication topology at the arithmetic component level (as effected by the permutators) for a parallel solution, simplicity of the arithmetic components, optimum processing delay, low pin-count and memory requirements, makes the R24 FHT an extremely attractive candidate to pursue for possible realization in hardware with parallel computing equipment. The arithmetic (no. of operations) and time (no. of clock cycles) complexities are shown to be of the same order for each solution considered, with the arithmetic requirement of the GD-BFLY being actually equivalent to that achievable for the butterfly of an optimally designed complex-data radix-4 FFT algorithm [9], widely considered the most computationally attractive of all fixed-radix butterflies.

4.8

Scaling Considerations for Fixed-Point Implementation

For a fixed-point implementation of the R24 FHT, as is the case of interest in this monograph, the registers available for holding the trigonometric coefficients and the data are of fixed length, whilst the register used for holding the outputs from the arithmetic operations (namely, the accumulator), although of fixed length, is generally longer than those used for holding the trigonometric coefficients and the data. This additional length for the accumulator is to prevent the unnecessary loss of accuracy from the rounding of the results following the arithmetic operations, as the multiplication of a K-bit word and a L-bit word yields a (K + L)-bit result, whilst the addition of two L-bit words yields a (L + 1)-bit result. When the trigonometric coefficients are each less than or equal to one, however, as they are for the R24 FHT, each multiplication will introduce no word growth, whereas the addition of any two terms following the multiplication stage may produce word growth of one bit.

4.9 Discussion

75

The maximum growth in magnitude through the GD-BFLY occurs when all the input samples possess equal magnitude and the rotation angle associated with the trigonometric coefficients is π/4, the magnitude then growing by a factor of up to pffiffiffi 1 þ 3 2  5:242. If the data register is fully occupied, this will result in three bits of overflow. To prevent this, an ‘unconditional’ scaling strategy could be applied whereby the data are right-shifted by three bits prior to each stage of GD-BFLYs. However, apart from reducing the dynamic range of the data, such scaling introduces truncation error if the discarded bits are non-zero. The possibility of overflow would therefore be eliminated at the cost of unnecessary shifting of the data and a potentially large loss of accuracy through the discarding of the lower-order bits. A more accurate approach would be to adopt a ‘conditional’ scaling strategy, namely, the block floating-point technique [9], whereby the data are shifted only when overflow occurs. The block floating-point mechanism comprises two parts. The output part calculates the maximum magnitude of the output data for the current stage of GD-BFLYs, from which a scaling factor is derived as a reference value for the input scaling of the next stage of GD-BFLYs. The input part receives the scaling factor generated by the previous stage of GD-BFLYs, so that the number of bits to be right-shifted for the current input data set will be based upon the scaling factor provided. Therefore, the data overflow and the precision of the integer operations are controlled automatically by the block floating-point mechanism, which provides information not only for the word growth of the current stage of GD-BFLYs but also for the word growth of all the previous stages. Such scaling, however, is far more complex to implement than that of unconditional scaling. An alternative to the above two approaches is to allow the data registers to possess a limited number of guard bits to cater for some or all of the word growth, such that the scaling strategy need only cater for limited word growth, rather than for the worst case. The performance of such a scheme, however, as with that of unconditional scaling, will always yield a suboptimal performance – in terms of accuracy and dynamic range – when compared to that achievable by the conditional block floating-point scheme.

4.9

Discussion

To summarize the situation so far, a new formulation of the radix-4 FHT has been derived, referred to as the regularized FHT or R24 FHT, whereby the major limitation of existing fixed-radix FHT designs, namely, the lack of regularity arising from the need for two sizes of butterfly – and thus for two separate butterfly designs – has been overcome. It remains now to see how easily the result algorithmic structure lends itself to mapping onto parallel computing equipment, bearing in mind that the requirement is to derive a resource-efficient solution for power-constrained applications, as typified by that of mobile communications, where parallelism will need to

76

4

Derivation of Regularized Formulation of Fast Hartley Transform

be fully and efficiently exploited in order that the required throughput rates are attained. This issue is to be pursued in some detail in Chap. 6 where the silicon-based technology of the FPGA – to be discussed, together with that of the ASIC, in Chap. 5 – will provide the target computing device. There is reason to be optimistic in the endeavour in that the large size of the GD-BFLY, which results in it being able to produce eight outputs from eight inputs, offers the promise of an eightfold speed up with parallel computing equipment over that achievable via a purely sequential solution, whilst the arithmetic requirement of the GD-BFLY, as is indicated from its SFG, suggests that it should lend itself naturally to internal pipelining whereby each stage of the computational pipeline would be made up from various combinations of arithmetic components (namely, the adders and fast multipliers) and permutators of which the GD-BFLY is composed. Note that the radix-4 butterfly used for the standard formulation of the radix-4 FFT is sometimes referred to in the technical literature as a ‘dragonfly’, rather than a butterfly, due to its resemblance to the said insect – a radix-8 butterfly may also be referred to as a ‘spider’, for the same reason, although arachnophobic tendencies amongst the scientific community might prevent such an acceptance! Finally, it should be noted that the property of ‘symmetry’ [10, 11] has been exploited not only to minimize the number of arithmetic operations required by both FFT and FHT algorithms, through the regular nature of the respective decompositions, but also to minimize the memory requirement through the periodic and symmetrical nature of the fundamental trigonometric function from which the associated transform kernels are derived, namely, the sinusoidal function. The basic properties of this function – together with that of its complementary function, the cosinusoid – are as described by Eqs. 4.56–4.58 given earlier in the chapter, with the sinusoid being an even-symmetric function relative to any odd-integer multiple of the argument π/2 and an odd-symmetric function relative to any even-integer multiple of π/2, whilst the cosinusoid is an even-symmetric function relative to any even-integer multiple of the argument π/2 and an odd-symmetric function relative to any odd-integer multiple of π/2. That is, they are each either even-symmetric or odd-symmetric according to whether the axis of symmetry is an appropriately chosen multiple of π/2.

References 1. G. Birkhoff, S. MacLane, A Survey of Modern Algebra (Macmillan, 1977) 2. R.N. Bracewell, The fast Hartley transform. Proc. IEEE 72(8) (August 1984) 3. R.N. Bracewell, The Hartley Transform (Oxford University Press, 1986) 4. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications (Academic, 1982) 5. K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Proc. Vis. Image Signal Process. 153(1), 70–78 (February 2006)

References

77

6. K.J. Jones, The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments, Series on Signals & Communication Technology (Springer, 2010) 7. L. Kronsjo, Computational Complexity of Sequential and Parallel Algorithms (Wiley, 1985) 8. Y. Li, Z. Wang, J. Ruan, K. Dai, A low-power globally synchronous locally asynchronous fft processor. HPCC 2007 LNCS 4782, 168–179 (2007) 9. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice-Hall, 1975) 10. I. Stewart, Why Beauty Is Truth: A History of Symmetry (Basic Books, 2007) 11. H. Weyl, Symmetry (Princeton Science Library, 1989) 12. S. Winograd, Arithmetic Complexity of Computations (SIAM, 1980) 13. A. Zakhor, A.V. Oppenheim, Quantization errors in the computation of the discrete Hartley transform. IEEE Trans. ASSP 35(11), 1592–1602 (1987)

Chapter 5

Design Strategy for Silicon-Based Implementation of Regularized Fast Hartley Transform

5.1

Introduction

The type of high-performance parallel computing equipment typified by the increasingly powerful silicon-based FPGA and ASIC technologies [2] now gives design engineers far greater flexibility and control over the type of algorithm that may be used in the building of high-performance DSP systems, so that more appropriate hardware solutions to the problem of computing the FHT and thus of computing the real-data FFT may be actively sought and exploited to some advantage with these silicon-based technologies. With such technologies, however, it is no longer adequate to view the FHT/FFT complexity purely in terms of arithmetic operation counts, as has conventionally been done, as there is now the facility to use both multiple arithmetic units – typically in the form of fast multipliers and adders – and multiple banks of fast random-access memory (RAM) in order to enhance the FHT/ FFT performance via its parallel computation. As a result, a whole new set of constraints has arisen relating to the design of efficient FHT/FFT algorithms. With the recent and explosive growth of wireless technology – and in particular that of mobile communications where a small battery may be the only source of power supply for long periods of time – algorithms are now being designed subject to new and often conflicting performance criteria whereby the ideal is to optimize or maximize the throughput given the constraint of the data set refresh rate which dictates that the algorithm cannot process the data faster than it’s generated. At the same time, there is a requirement to minimize the silicon resources used (and thereby minimize the implementation cost) whilst keeping the power consumption to within the available power budget. To be able to produce such a solution for the FHT, however, it is first necessary to identify the relevant key parameters [2] – namely, the clock frequency, silicon area and switching frequency – involved in the design process and then to outline the trade-offs that need to be made between them. The optimization of this parameter set leads to questions relating to the most appropriate choice of computing architecture, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_5

79

80

5 Design Strategy for Silicon-Based Implementation of Regularized Fast. . .

in terms of whether one or multiple PEs (where a PE, for our purposes, is the processor assigned the task of carrying out the repetitive computations required of the GD-BFLY, as introduced in Chap. 4) should be used – and thus the merits of different pipelining strategies for the parallelization of the solution and, also, the benefits of exploiting multiple banks of memory, referred to hereafter as ‘partitioned’ memory, for the efficient storage and retrieval of the data and the trigonometric coefficients. The question of scalability of the proposed design is also considered, as this property, combined with that of flexibility (in terms of arithmetic precision and/or the ability to optimize the use of the various embedded resources), will help maintain performance when costly embedded resources, such as fast multipliers and fast memory, are scarce. A scalable solution will also help to minimize the re-design effort and costs when dealing with new applications. The ultimate aim of the design process, for those resource-constrained applications with requirements typified by that of mobile communications, is to obtain a solution that optimizes the use of the available silicon resources on the target computing device whilst keeping the associated power consumption to within the available power budget and, in so doing, maximizing the achievable computational density – namely, the throughput per unit area of silicon. The two possible device technologies, the FPGA and the ASIC, are now considered in some detail before moving on to a discussion of the key design issues.

5.2

The Fundamental Properties of FPGA and ASIC Devices

An FPGA device [2] is an integrated circuit made up of a number of programmable or configurable blocks of logic (CLBs), together with configurable interconnections between them, which provide the user with both logic and storage capabilities. Each CLB is made up of a number of ‘slices’ of logic where each slice consists of a number of LUTs, each typically with three or four inputs, together with a number of flip-flops. To implement a 4-input LUT, for example, which can perform any 4-input Boolean function, 24 or 16 bits of memory are needed which can be configured as a 16  1-bit synchronous RAM, more commonly referred to as distributed RAM. An attraction of these devices is that they are able to be configured or programmed by DSP design engineers to perform a wide variety of signal processing tasks with most modern FPGAs offering the additional facility that they may be repeatedly reprogrammed. Thus, the main attractions of the FPGA – which is often referred to as being ‘coarse-grained’ because it is physically realized using higher-level blocks of programmable logic – are at the ‘system’ level where it offers both flexibility and cost-effectiveness for low-volume price-sensitive applications. An ASIC device [2], on the other hand, is custom-designed to address a specific application and as such is able to offer the ultimate solution in terms of size (as expressed by the number of transistors), complexity and performance, where

5.2 The Fundamental Properties of FPGA and ASIC Devices

81

the performance is typically measured in terms of computational density. Designing and building an ASIC is an extremely time-consuming and expensive process, however, with the added disadvantage that the final design is frozen in silicon [2] and cannot be modified without creating, at some expense, a new version of the device. Thus, the main attractions of the ASIC – which is often referred to as being ‘fine-grained’ because ultimately it is implemented at the level of the primitive logic gates – are at the ‘circuit’ level where it offers reduced delay, area and power consumption which will always yield optimum performance when measured in terms of computational density. To overcome some of the apparent deficiencies of the FPGA, at least when compared to the ASIC, manufacturers have looked to enhance its capabilities and competitiveness by providing the design engineer with access to embedded resources, such as fast multipliers and banks of fast RAM with dedicated arithmetic routing, which are considerably smaller, faster and more power-efficient than when implemented in programmable logic. More recent devices may even offer floatingpoint and/or CORDIC arithmetic units as embedded resources for those applications where the performance achieved via the use of the standard fixed-point multiplier might be found wanting. These features, when coupled with the massive parallelism on offer, enable the FPGA to outperform the fastest of the conventional singleprocessor DSP devices by two or even three orders of magnitude. The cost of an FPGA, as one would expect, is much lower than that of an ASIC. At the same time, implementing design changes is also much easier with the time to market for such designs being therefore considerably shorter than that required by an ASIC. This means that the FPGA allows the design engineer to realize software and hardware concepts on an FPGA-based test platform without having to incur the enormous costs associated with ASIC designs. As a result, high-performance DSP designs – such as for the FHT and the real-data FFT considered in this monograph – even when ultimately targeted at an ASIC implementation, will generally, for reasons of ease, time and cost, be developed and tested on an FPGA before being mapped onto the target device. Thus, for the analysis carried out in this monograph, emphasis is placed on implementations of the R24 FHT using FPGA technology where, for the particular implementation to be discussed in Chap. 6, the target device family is the Virtex-II Pro, as produced by Xilinx Inc. [6] of the USA which, although now somewhat old (but very popular at the time of writing of the original edition of this monograph), its use being intended only to facilitate comparison between the two main types of computing architecture available for the R24 FHT’ s parallel computation, namely: 1. The single-PE architecture, which looks to exploit fine-grained parallelism and lends itself naturally to ‘block-based’ operation whereby all the data samples must first be generated and stored before they can be processed 2. The multi-PE architecture which looks to exploit coarse-grained parallelism and lends itself naturally to ‘streaming’ operation whereby the data samples are processed as soon as they arrive at the first PE in the computational pipeline

82

5.3

5 Design Strategy for Silicon-Based Implementation of Regularized Fast. . .

Low-Power Design Techniques

Over the past couple of decades, power consumption has grown from a secondary to a major constraint in the design of hardware-based DSP solutions. In portable applications, such as that encountered with mobile communications, low power consumption has long been the main design constraint, due in part to the increasing cost of cooling and packaging, but also to the resulting rise in on-chip temperature, which in turn results in reduced reliability. The result is that the identification and application of low-power techniques, at both arithmetic and algorithmic or system levels, are crucial to the specification of an achievable design in silicon that meets with the required power-related set of performance objectives. The power consumption associated with the silicon-based implementation of a high-performance DSP algorithm, such as the R24 FHT , comprises both ‘static’ (or acquiescent) and ‘dynamic’ components. The dynamic component has until recently dominated the total power consumption, although as the devices become ever bigger and ever more powerful the contribution of the static component to the total power consumption is becoming increasingly more significant. Given our hardware-efficient objectives, however, we restrict our attention here to the dynamic component, denoted PD, which may be expressed as PD ¼ C  V 2  f

ð5:1Þ

where ‘C’ is the capacitance of the node switching, ‘V’ is the supply voltage and ‘f’ is the switching frequency. This dynamic component is primarily driven by the clock frequency of the device, the silicon area required for its implementation (which is determined primarily by the size of the arithmetic unit, the total memory requirement and the data routing) and the average switching rate of the individual circuits over each clock cycle (where the clock cycle is inversely related to the clock frequency, as discussed next). These items are now discussed in more detail in order that a suitable design strategy might be identified and pursued.

5.3.1

Clock Frequency

To achieve high throughput with a hardware-based solution to the DSP-based problem of interest, the clock frequency is typically traded off against parallelism, with the choice of solution ranging from that based upon the use of a single processor, driven to a potentially very high clock frequency, to that based upon the use of multiple processors, typically operating simultaneously or concurrently via suitably defined parallel processing techniques, in order to achieve the required performance but with a potentially much reduced clock frequency. For the particular problem of interest in this monograph, the parallelism can be exploited at the local arithmetic level, in terms of the fine-grained parallelism of the GD-BFLY, and/or at

5.3 Low-Power Design Techniques

83

the global algorithmic level, in terms of the coarse-grained parallelism of the resulting R24 FHT algorithm, with pipelining techniques being the most attractive and power-efficient means of achieving parallelism due to the resulting nearestneighbour communication requirement. Thus, given the strong dependence of the power consumption upon clock frequency, there is clearly great attraction in being able to keep the clock frequency as low as possible for the implementation of the R24 FHT, provided that the resulting solution is able to meet the required set of performance objectives. To achieve such a solution, however, it is necessary that an appropriate parallelization scheme be defined, as this would enable the problem to be reduced from one large task, driven with a high clock frequency, to a number of smaller tasks, each driven with a low clock frequency. Such a scheme might typically be based upon fine-grained and/or coarse-grained pipelining techniques, combined with SIMD processing within each stage of the computational pipeline(s), where the choice made will additionally impact upon the total silicon area requirement, as now discussed.

5.3.2

Silicon Area

Suppose that the transform to be computed, the R24 FHT, is of length N where N ¼ 4α ,

ð5:2Þ

so that ‘α’, the radix exponent of the transform, given by log4N, represents the number of temporal stages required by the algorithm where each stage, as discussed in Chap. 4, involves the computation of N/8 GD-BFLYs. High-performance solutions may be obtained through coarse-grained pipelining of the algorithm via the adoption of an α-stage computational pipeline, as shown in Fig. 5.1, where each

PCM

input data

PDM

PCM

GD-BFLY for PE No 1

PDM

PCM

GD-BFLY for PE No 2

HSM

Note: Each PE performs N/8 GD-BFLYs

Fig. 5.1 Multi-PE pipelined architecture for regularized FHT

PDM

GD-BFLY for PE No α

HSM

output data

5 Design Strategy for Silicon-Based Implementation of Regularized Fast. . .

84

input data

Data Memory

Trigonometric Coefficient Memory

GD-BFLY for Parallel PE

output data

loop through α×N/8 GD-BFLYs

Fig. 5.2 Single-PE recursive architecture for regularized FHT

stage of the pipeline is assigned its own PE and (for all but the last stage) its own block of double-buffered HSM for storing the partial-FHT outputs – note that with double-buffered memory, the functions performed on two equally sized regions of memory alternate with successive input data sets, with one region of memory being filled with new data whilst the data already stored in the other is being processed. But this means that the amount of silicon required by the R24 FHT will be both dependent upon and proportional to the length of the transform being computed, as is the case with most commercially available FFT designs. A solution based upon a pipelined multi-PE architecture such as this achieves an O(N ) time-complexity at the cost of an O(log4N ) space-complexity, where space-complexity refers loosely to the total silicon area requirement and comprises an arithmetic component, consisting of the required numbers of fast multipliers and adders, together with a memory component, consisting of the various memories required for the storage of both the data and the trigonometric coefficients. Alternatively, through the adoption of a single-PE architecture, as shown in Fig. 5.2, high performance may be achieved for the R24 FHT via parallelization of the PE’s internal arithmetic, as achieved with fine-grained pipelining and the simultaneous execution of the multiple arithmetic operations to be performed within each stage of the computational pipeline via SIMD processing. However, the success of this scheme relies heavily, if adequate throughput is to be achieved, upon the adoption of an appropriate storage scheme for the data and the trigonometric coefficients. This may be achieved through the efficient use of partitioned memory, so that multiple data samples may be both retrieved and updated and trigonometric coefficients retrieved, from their respective memory banks, with each set of multiple reads/writes being performed in parallel. Optimal efficiency also requires that the processing for each instance of the GD-BFLY be carried out in an in-place fashion so that the memory requirement may be kept to a minimum. When such a solution is possible, the result is area-efficient with space-complexity – apart from the memory component – being independent of the length N of the transform being computed.

5.3 Low-Power Design Techniques

85

The corresponding time-complexity (denoting the latency or, equivalently in this case, the update time) is of O(N. log4N ) which leads to an approximate figure of N=P: log N clock cycles after taking into account the level of parallelism, P, as 4 introduced via the adoption of the partitioned data memory – with a data set refresh rate of N samples every update period of N clock cycles, this ensures continuous realtime operation at the cost of just O(1) space-complexity. Thus, it is evident that the greater the area-efficiency, the lower the achievable throughput, as one would expect, so that the ultimate choice of computing architecture for the proposed solution will be very much dependent upon the ability of the solution to meet the timing constraint imposed upon the problem – namely, that the throughput rate of the solution should be able to keep up with the data set refresh rate – as will be discussed in some detail in Chap. 6. Note that the single-PE architecture described in this section may also be regarded as being ‘recursive’ in the sense that the output from each of the first α  1 temporal stages of processing is fed back as input to the succeeding stage, where the same set of computational components (as provided by the PE for carrying out the computations of the GD-BFLY) is used to perform the same set of operations on the input data to every stage. The multi-PE architecture is obtained when this α-fold recursion is unfolded, with the α-stage recursion becoming an α-stage computational pipeline with successive stages of the pipeline being connected by means of a doublebuffered memory and the processing for each stage being carried out by means of its own PE. Thus, the choice of computing architecture for the parallel computation of the R24 FHT reduces to that of a single-PE recursive architecture versus a multi-PE pipelined architecture, where for the case of a transform of length N ¼ 4α, the achievable time-complexity of the single-PE solution – in terms of update time – may be shown to be approximately α times that of the multi-PE solution, although with a commensurate saving in terms of the silicon resources required for the production of each new N-sample output data set.

5.3.3

Switching Frequency

Another important factor affecting power consumption is the switching power which relates to the number of times that a gate makes a logic transition, 0 ! 1 or 1 ! 0, within each clock cycle. Within the arithmetic unit, for example, when one of the inputs is constant, as with the case of the precomputed trigonometric coefficients, it is possible to use the precomputed values to reduce the number of logic transitions involved, when compared to a conventional fast multiplier solution, and thus to reduce the associated switching power. With a parallel DA [4] unit, for example, it is possible to reduce both switching power and silicon area for the implementation of the arithmetic components at the expense of increased memory for the storage of precomputed sums or inner products [1], whereas with a parallel CORDIC arithmetic [3] unit, it is possible to completely eliminate the PCM requirement and the

86

5 Design Strategy for Silicon-Based Implementation of Regularized Fast. . .

associated power-hungry memory accesses, which also involves switching activity, at the expense of increased arithmetic and control logic for the on-the-fly generation of the trigonometric coefficients within each stage of the CORDIC pipeline.

5.4

Proposed Hardware Design Strategy

Having discussed very briefly the key parameters involved in the production of a low-power silicon-based solution to the R24 FHT, the resulting properties or features required of such a solution need to be identified. One should bear in mind, however, that in using the proposed solution for different applications, where the available silicon resources may vary considerably from one application to another, it would be advantageous to be able to define a small number of design variations for the PE, with each version conforming to the same basic design and each compatible with the chosen computing architecture. As a result, the appropriate choice of PE would enable one to optimize the use of the available silicon resources on the target computing device so as to obtain a solution that maximizes the achievable computational density. Three key properties impacting upon the specification of such a solution are now discussed.

5.4.1

Scalability of Design

The above discussion suggests that a desirable feature of our solution is that it should be easily adapted, when dealing with new applications, at minimal re-design effort and costs. This may in part be achieved by making the solution ‘scalable’, a term which has already been introduced and used a few times already in this monograph and may mean different things in different contexts, although here it simply refers to the ease with which the sought-after solution may be modified in order to accommodate increasing or decreasing transform lengths (where the data is assumed, at present, to be of 1-D form) as the solution is applied to different applications. This might best be achieved through the adoption of a highly parallel version of the single-PE recursive architecture whereby the hardware requirements for each new application would remain essentially unaltered as the transform length is increased or decreased, other than the varying memory requirement necessary to cater for the varying amounts of data and trigonometric coefficients. The silicon area, as the transform length is increased, would remain essentially constant at the expense of the latency which increases according to the number of times the GD-BFLY is executed per transform, namely, 1=8N: log 4 N for a transform of length N. Such an approach would in turn play a key role in keeping the power consumption to within the available power budget.

5.4 Proposed Hardware Design Strategy

87

With a multi-PE pipelined approach, on the other hand, the required number of PEs corresponds to the number of temporal stages required by the transform, namely, log4N for a radix-4 transform of length N, which clearly increases/decreases logarithmically with the length of the transform, so that the hardware requirement will in turn increase/decrease in like fashion with the length of the transform. As a consequence, the re-design effort and costs in going from one application to another, where different transform lengths are involved, will be significantly greater with a multi-PE approach than with a single-PE approach, making the single-PE approach a particularly attractive one provided the associated timing constraint arising from the data set refresh rate can be met.

5.4.2

Partitioned-Memory Processing

An additional requirement arising from the property of scalability is that relating to the parallelization of the PE, as used to carry out the computations of the GD-BFLY. This may be addressed by ensuring that both the data and the trigonometric coefficients can be appropriately stored within partitioned memory so that multiple data samples may be both retrieved and updated and multiple trigonometric coefficients retrieved, from their respective memory banks, with each set of multiple reads/writes being performed in parallel. The resulting combination of scalability of design and an efficient storage scheme via the use of partitioned memory, if it could be achieved, would yield a solution that was both area-efficient and able to yield high throughput and which would therefore be better able, for all transform lengths of interest (except for pathologically large cases), to satisfy the timing constraint arising from the data set refresh rate. Being able to place the memory close to where the processing is actually taking place would in addition eliminate the need for long on-chip communication paths which might otherwise result in long processing delays and increased power consumption. An additional attraction of such a solution is that the adoption of partitioned memory as a key component in the parallelization of the solution – whereby multiple memory banks combined with a low clock frequency are used rather than a single global memory combined with a high clock frequency – would, as already inferred, lead to a potential reduction in the power consumption [5].

5.4.3

Flexibility of Design

The final property to be considered is that of flexibility whereby the design ensures that the best possible use is made of the available silicon resources. This might be achieved, for example, with the provision of a few variations of the same basic PE design, where each version is compatible with the chosen computing architecture so that one is able to select a specific version of the PE according to the particular

88

5 Design Strategy for Silicon-Based Implementation of Regularized Fast. . .

silicon resources available on the target computing device. Such flexibility has already been implied in the results of Sects. 4.5 and 4.6 of Chap. 4, where both nine-multiplier and twelve-multiplier versions of the GD-BFLY were considered together with different PCM addressing schemes, one of which minimized the arithmetic requirement at the cost of an increased PCM requirement and the other minimizing the PCM requirement at the cost of an increased arithmetic requirement. A different type of flexibility relates to the available arithmetic precision, as provided by the arithmetic unit. Different signal processing applications involving the use of an FFT may require very different processing functions in order to carry out the necessary tasks and often different levels of precision for each such function. The FFT may well be fed directly by means of an ADC unit, for example, so that the word length (in terms of the number of bits) of the data into and out of the FFT will be dictated both by the capability of the ADC unit and by the dynamic range requirements of the processing functions into which the FFT feeds. For the design to have truly universal application, therefore, it would be beneficial that the arithmetic unit should be easily adapted to cater for arbitrary arithmetic precision processing, including those applications where the requirements are not adequately addressed through the use of embedded resources, so that different levels of accuracy may be achieved for different applications without having to alter the basic design of the PE. One way of achieving such flexibility, at least in terms of arithmetic precision, would be via the use of a pipelined version of the CORDIC arithmetic unit, as will be discussed in some depth in Chap. 7, where increased precision may be obtained by simply increasing the length of the associated computational pipeline – noting that all the CORDIC stages used in the pipeline are both identical and simple to implement, comprising just adders and shifters – at the expense of a proportionate increase in the latency. Most FPGA manufacturers now provide their own version of the CORDIC arithmetic unit as an embedded resource, alongside that of the fast fixed-point multiplier, for use when the basic operation of interest is that of phase rotation.

5.5

Constraints on Available Resources

As already discussed in Sect. 1.7 of Chap. 1, when producing electronic equipment, whether for commercial or military use, one is seldom blessed with the option of using the latest state-of-the-art device technology. As a result, there are situations where there would be great merit in having designs that are not totally reliant on the availability of the increasingly large quantities of expensive embedded resources, such as fast multipliers and fast memory, as provided by the manufacturers of the latest silicon-based devices, but are sufficiently flexible as to lend themselves to an attractive implementation in silicon even when constrained by the scarcity of such costly resources.

5.6 Assessing the Resource Requirements

89

A problem may arise in practice, for example, when the length of the transform to be computed is very large when compared to the capability of the target computing device such that there are insufficient embedded resources to enable a successful or attractive mapping of the transform (and of those additional DSP functions both preceding and succeeding the transform) onto the device to take place. In such a situation, where the use of a larger and more powerful device is simply not an option, it is thus required that some means be found of facilitating a successful mapping onto the available device and one way of achieving this is through the design of a more appropriate arithmetic unit, namely, one which does not rely too heavily upon the use of embedded resources. This might be achieved – as with the requirement already discussed for flexible-precision processing – via the use of a pipelined version of the CORDIC arithmetic unit, for example, as this would eliminate the requirement for both fast fixed-point multipliers and fast RAM for storage of the trigonometric coefficients, so that the PCM could be effectively dispensed with.

5.6

Assessing the Resource Requirements

Given the device-independent nature of the R24 FHT design(s) sought in this monograph, a somewhat theoretical approach has been adopted for assessing the resource requirements for its implementation in silicon, this assessment being based on the determination of the individual requirements, measured in logic slices, for addressing both the arithmetic requirement and the memory requirement. Such an approach can only tell part of the story, however, as the amount of logic required for controlling the operation and interaction of the various components (which ideally are manufacturer-supplied embedded components for optimal size and power efficiency) of the design is rather more difficult (if not impossible) to assess, if considered in isolation from the actual hardware design process, this due in part to the automated and somewhat unpredictable nature of that process, as outlined below. Typically, after designing and implementing the hardware design in an HDL, there is a multistage process to go through before the design is ready for use in an FPGA. The first stage is synthesis, which takes the HDL code and translates it into a ‘netlist’ which is simply a textual description of a circuit diagram or schematic. This is followed by a simulation which verifies that the design specified in the netlist functions correctly. Once verified, the netlist is translated into a binary format, with the components and connections that it defines being then mapped to CLBs before the design is finally placed and routed to fit onto the target computing device. A second simulation is then performed to help establish how well the design has been placed and routed before a configuration file is generated to enable the design to be loaded onto the FPGA. The reality, after this process has been gone through, is that the actual logic requirement will invariably be somewhat greater than predicated by the theory, due

90

5 Design Strategy for Silicon-Based Implementation of Regularized Fast. . .

to the inefficient and unpredictable use made of the available resources in meeting the various design constraints. This situation is true for any design considered, however, so that in carrying out a comparative analysis of different FHT or FFT designs, the same inefficiencies will inevitably apply to each of the selected algorithms. Such an overhead in the logic requirement needs to be borne in mind, therefore, when actually assessing whether a particular device has sufficient resources to meet the given task.

5.7

Discussion

This chapter has looked at the various key design parameters – namely, the clock frequency, silicon area and switching frequency – and the trade-offs that need to be made between them when trying to design a low-power solution to the R24 FHT for implementation with silicon-based parallel computing technology, as typified by the FPGA and the ASIC, where the sought-after solution is required to be able to optimize the use of the available silicon resources on the target computing device so as to obtain a solution that maximizes the achievable computational density. The optimization of the above parameter set has led to questions relating to the most appropriate choice of computing architecture, in terms of whether one or multiple PEs should be used, together with how parallelism might best be exploited for each such approach – namely, the merits of different pipelining strategies and of the adoption of partitioned memory for the efficient storage and retrieval of the data and the trigonometric coefficients. The question of scalability of the proposed design was also considered, as this property, combined with that of flexibility (in terms of arithmetic precision and/or the ability to optimize the use of the various embedded resources), could help to maintain performance when costly embedded resources, such as fast multipliers and fast memory, are scarce. A scalable solution would also help to minimize the re-design effort and costs when dealing with new applications. Clearly, if silicon-based designs could be produced that minimize the requirement for such costly embedded resources, then smaller lower-complexity devices might be used, rather than those at the top end of the device range, as is commonly the case, thus minimizing the financial cost of implementation – the less the reliance on the use of embedded resources, the greater the flexibility in the choice of target hardware. Given the wish list of properties outlined in this chapter for our idealized R24 FHT solution, the single-PE recursive architecture looks to be a particularly attractive one to adopt given that the associated computing engine, the GD-BFLY, has already shown itself capable, in Chap. 4, of producing eight outputs from eight inputs, so that a parallel solution achieved through the fine-grained pipelining of the PE would offer a theoretical eightfold speed up over a purely sequential solution. This would, however, necessitate being able to store in an efficient manner both the data and

5.7 Discussion

91

the trigonometric coefficients within suitably defined partitioned memory (i.e. using multiple banks of fast memory) so that multiple data samples might be both retrieved and updated and multiple trigonometric coefficients retrieved, from their respective memory banks, with each set of multiple reads/writes being performed in parallel. This use of partitioned memory would in turn result in a further decrease in the power consumption and would also enable the simultaneous execution of the multiple arithmetic operations needing to be performed within each stage of the computational pipeline via SIMD processing. Being able to offer a small number of design variations of the PE – with each version conforming to the same basic design and each compatible with the chosen computing architecture, which looks most likely to be that based upon the use of a single PE – would also be advantageous, offering a performance ranging from optimality in terms of the arithmetic requirement to optimality in terms of the memory requirement, as this would provide the user with the ability to optimize the design of the PE for each new application according to the resources available on the target computing device. The outcome of the analysis carried out in this chapter is thus to enable a performance objective to be defined relating to the efficient implementation of the R24 FHT that’s able to exploit the silicon-based parallel computing technology, taking into account the constraints and trade-offs of the various parameters and assessing the performance according to the silicon-based performance metric introduced in Sect. 1.8 of Chap. 1. Thus: Performance Objective for Silicon-Based Implementation: The requirement is to produce a computing architecture for yielding resource-efficient, scalable and device-independent solutions for the parallel computation of the regularized FHT, when implemented with silicon-based computing technology, using a performance metric based upon maximization of the computational density.

Although various definitions could be used when deciding what a given solution should be looking to achieve, this particular definition – which for the latencyconstrained problem looks for that solution that’s best able to maximize the computational density and thus minimize the silicon cost – is targeted specifically at the type of power-constrained environment that one would expect to encounter with applications typified by that of mobile communications. Another commonly used definition for such applications is that an attractive solution may be regarded as being one in possession of a low size, weight and power (SWAP) requirement – assuming of course that it also meets the appropriate timing constraint, as given by the data set refresh rate – although the additional properties of scalability and device independence are not catered for by such a simplistic definition. The task now, for the following chapters – particularly those of Chaps. 6 and 7 for the 1-D case and of Chaps. 10 and 11 for the m-D case – is to produce detailed solutions for the R24 FHT and for its key applications, such as those of the real-data DFT and the filtering of real-valued data sets, where the solution to the R24 FHT is capable of achieving the new silicon-based performance objective.

92

5 Design Strategy for Silicon-Based Implementation of Regularized Fast. . .

References 1. G. Birkhoff, S. MacLane, A Survey of Modern Algebra (Macmillan, 1977) 2. C. Maxfield, The Design Warrior’s Guide to FPGAs (Newnes (Elsevier), 2004) 3. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans. Electron. Comput. EC-8(3), 330–334 (1959) 4. S.A. White, Application of distributed arithmetic to digital signal processing: A tutorial review. IEEE ASSP Mag. 6, 4–19 (July 1989) 5. T. Widhe, J. Melander, L. Wanhammar, Design of Efficient Radix-8 butterfly PE for VLSI. Proc. IEEE Int. Symp. Circ. Syst. Hong Kong (June 1997) 6. Xilinx Inc., company and product information available at company web site: www.xilinx.com

Chapter 6

Architecture for Silicon-Based Implementation of Regularized Fast Hartley Transform

6.1

Introduction

A point has now been reached whereby an attractive formulation of the FHT algorithm has been produced, namely, the R24 FHT as introduced in Chap. 4, whilst those properties required of such an algorithm and of its associated computing architecture for an optimal mapping of the solution onto silicon-based parallel computing equipment – as typified, in particular, by the FPGA – have also been outlined, as in Chap. 5. This has resulted in a performance objective for the implementation of the R24 FHT, as stated in Sect. 5.7 of Chap. 5, that’s relevant to the proposed technology. The question now to be addressed is whether such a mapping can be found, given that there appears to be need for a ‘squaring of the circle’, namely, that of maximizing the computational throughput in order to keep up with the data set refresh rate, whilst at the same time minimizing the required silicon resources so as to reduce both the power consumption and the cost of implementation. A solution that yields a high computational density is assumed to be attractive in terms of both power consumption and resource efficiency, given the known influence of silicon area on the power consumption – as discussed in Chap. 5. The constraint on its execution being completed within the update period, as dictated by the data set refresh rate, is to ensure continuous real-time operation for the R24 FHT for transform lengths up to and including some maximum value, yet to be determined – as will be discussed later in Sect. 6.6 of this chapter. For transforms whose length exceeds this maximum value, therefore, it will not be possible to sustain continuous real-time operation through the use of a single R24 FHT, when based upon the single-PE recursive architecture, so that other approaches will be needed – this point is also taken up again later in Sect. 6.6. From the silicon-based performance objective, as stated in Sect. 5.7 of Chap. 5, it is possible to ensure that any scalable, parallel solution to the R24 FHT, if found, will possess those properties outlined in Chap. 5 that are considered to be desirable, if not © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_6

93

94

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

essential, for an attractive hardware implementation. It will also be necessary, however, that a proper comparison be made of the time-complexity, as given by the latency (although for the ‘block-based’ processing solution discussed in this monograph the latency and the update time are equivalent when dealing with 1-D data sets), as well as the space-complexity, as expressed in terms of the required silicon resources via both arithmetic and memory components, with those of existing commercially available industry-standard FFT devices – although, as already stated, most if not all such commercially available solutions will almost invariably involve the computation of the conventional complex-data version of the radix-2 FFT. Note that the basic data set used for input/output to/from the GD-BFLY consists of eight samples where – as will be seen later in the chapter – the samples are stored within the PDM with either each sample being assigned to its own separate memory bank or each pair of consecutive samples being assigned to its own separate memory bank so that only four of the eight memory banks would be used for any particular data set, either the even-addressed ones or the odd-addressed ones. This eightsample data set, stored in the above fashion, will be referred to hereafter as a woctad (where octad is defined as meaning ‘set of eight objects’, according to the Concise Oxford Dictionary, so that woctad is adopted here as meaning ‘set of eight words’, where each word holds a single sample of data) in order to avoid unnecessary verbiage through its repeated usage.

6.2

Single-PE Versus Multi-PE Architectures

Two types of parallel computing architecture were briefly discussed in Sect. 5.3.2 of Chap. 5, one based upon the adoption of multiple PEs and the other upon that of a single PE, where the multi-PE pipelined architecture enables the required computational throughput to be achieved via coarse-grained pipelining of the FHT algorithm and the other via fine-grained pipelining of the PE – with each pipelined solution able, potentially, to exploit SIMD processing within each stage of its computational pipeline to further enhance performance. The multi-PE pipelined architecture thus lends itself more naturally to ‘streaming’ operation – which generally takes NAT-ordered input data and (for the case of a radix-4 algorithm) produces DBR-ordered output data – whereby the data samples are processed as soon as they arrive at the first PE in the pipeline. The locally pipelined single-PE recursive architecture, on the other hand, lends itself more naturally to ‘block-based’ operation – which generally (for the case of a radix-4 algorithm) takes DBR-ordered input data and produces NAT-ordered output data – whereby all the data samples must first be generated and stored before they can be processed. The single-PE recursive architecture certainly looks to offer the most promise for the problem under consideration, but in order for the required computational throughput to be achieved, it will be necessary that the memory, as required for storing both the data and the trigonometric coefficients, should be suitably organized. The memory structure should be such that both the data set and the

6.2 Single-PE Versus Multi-PE Architectures

95

Partitioned Data Memory

M

Partitioned Coefficient Memory

M

M

1

M

2

n D n C

M

1

M

5

D

M

6

D

M

3

M

7

D

M

4

M

8

D

- nth coefficient memory bank D

D

D

D

×8

×6

Trigonometric Coefficient

×9

Generator M

2

- n data memory bank

C

C

M

th

Generic Radix-4 Double Butterfly

3 C

Address Generator Fig. 6.1 Single-PE recursive architecture for regularized FHT

trigonometric coefficients required for the execution of any given instance of the GD-BFLY may be accessed simultaneously, and without conflict, thereby facilitating SIMD processing for the simultaneous execution of the multiple arithmetic operations to be performed within each stage of its computational pipeline and the GD-BFLY thus able to produce output woctads at the rate of one per clock cycle. In order to achieve this, it is necessary that the memory should be organized according to that required by the single-PE recursive architecture of Fig. 6.1, where the topology of the data routing network is shown in the form of an H tree [5] so as to keep the communication paths between the PE and each memory bank of equal length – although in reality, when mapping such designs onto an FPGA, one may no longer have that level of control over such matters. The input/output data is distributed over the eight memory banks that make up the PDM, whilst the trigonometric coefficients are distributed over the three memory banks or LUTs that make up the PCM, so that suitable parallel addressing schemes need now to be defined which ideally enable one woctad (the GD-BFLY input) to be read from the PDM and another (the GD-BFLY output) to be written to the PDM every clock cycle, in an in-place fashion (whereby each output woctad produced by the GD-BFLY is to be written back to the same set of memory locations as used by the input woctad) and without conflict, and two trigonometric coefficients to be read from each bank of the PCM every clock cycle [3, 4], again without conflict. Such addressing schemes are now discussed in more detail according to various optimization criteria.

96

6.3

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

Conflict-Free Parallel Memory Addressing Schemes

The parallel memory addressing schemes described here for the R24 FHT are based upon the assumption that the memories are of dual-port type. Such memory is assumed to have four data ports, two for the data inputs and two for the data outputs, although there is only one address input for each input/output data pair. As a result, each memory bank is able to cater for the execution of either two simultaneous reads, as required for the case of the PCM, two simultaneous writes or one simultaneous read and write using separate read and write addresses. These read/write options will be shown to be sufficient for the addressing requirements of both the PCM, which requires the execution of two simultaneous reads per clock cycle, and the PDM, which for the implementation discussed in this monograph will be shown to need all three options. With regard to the PDM, the addressing scheme is also to be regarded as being performed in-place with each output woctad produced by the GD-BFLY being written back to the same set of memory locations as used by the input woctad.

6.3.1

Parallel Storage and Retrieval of Data

The GD-BFLY, for Type-I, Type-II and Type-III cases, as described in Chap. 4, requires that woctads be read/written from/to the PDM, in an in-place fashion, in order to be able to carry out the processing for a given data set. One way for this to be achieved is if the woctad to be processed by the GD-BFLY is stored with one sample in each PDM bank, so that all eight PDM banks are used for each instance of the GD-BFLY. Another way, given the availability of dual-port memory, would be to have two samples in each of four PDM banks (either the four even-addressed or the four odd-addressed memory banks) with alternate sets of four PDM banks being used on alternate sets of data. The problem is addressed here by adopting suitably modified versions of the rotation-based radix-4 memory mapping, ‘Ψ4’, as given by the definition: Mapping for data memory addressing """ Ψ4 ðn, αÞ ¼

α  X

n mod 4

k



# # #  >> 2ðk  1Þ mod 4 >’ and ‘ 3

ð6:4Þ

so that ΦðnÞ 2 f0, 1, . . . , N=8  1g, where the parameter n2{0,1,. . ., N–1} corresponds to the sample address after reordering of the data by the DBR mapping. To better understand the workings of these rotation-based memory mappings for the storage/retrieval of the data from memory, it is best to first visualize the data as being stored within a two-dimensional array of four columns and N/4 rows, where the data is stored on a row-by-row basis, with four samples to a row. The effect of the generic address mapping, Ψ4, as shown in the example given in Table 6.1, is to apply a left-sense rotation to each row of data where the amount of rotation is dependent upon the particular (N/4)  4 sub-array to which it belongs, as well as the particular (N/16)  4 sub-array within that sub-array, as well as the particular (N/64)  4 sub-array within that sub-array, etc., until all the relevant partitions have been accounted for – there are log4N of these. As a result, there is a cyclic rotation being applied to the data over each such sub-array – the cyclic nature of the mapping means that within each sub-array, the amount of rotation to be applied to a given row of data is one position greater than that for the preceding row. This cyclic property, as will later be seen, may be beneficially exploited by the GD-BFLY through the way in which it stores/retrieves the samples of the input/ output woctads, for both individual instances of the GD-BFLY, via the address mapping Ω1, and for consecutive pairs of instances, via the address mapping Ω2, over all eight memory banks. Examples of the address mappings Ω1 and Ω2 for the case of a data set of length 64 are given below in Tables 6.2 and 6.3, respectively, where each pair of consecutive rows of memory bank addresses correspond to the sample locations of a complete GD-BFLY input/output woctad.

6.3 Conflict-Free Parallel Memory Addressing Schemes Table 6.1 Structure of generic address mapping Ψ4 for case of length-64 data set

Row 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Table 6.2 Structure of address mapping Ω1 for case of length-64 data set

Row 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Value of generic address mapping Ψ4 0 2 4 2 4 6 4 6 0 6 0 2 2 4 6 4 6 0 6 0 2 0 2 4 4 6 0 6 0 2 0 2 4 2 4 6 6 0 2 0 2 4 2 4 6 4 6 0 Value of address mapping Ω1 0 3 4 2 5 6 4 7 0 6 1 2 2 5 6 4 7 0 6 1 2 0 3 4 4 7 0 6 1 2 0 3 4 2 5 6 6 1 2 0 3 4 2 5 6 4 7 0

99

6 0 2 4 0 2 4 6 2 4 6 0 4 6 0 2

7 1 3 5 1 3 5 7 3 5 7 1 5 7 1 3

Suppose now, for ease of illustration, that the arithmetic required for any given instance of the GD-BFLY can be carried out with a zero processing delay so that a woctad may be read from the PDM, processed by the GD-BFLY and then written back to the PDM within a single clock cycle – this is not, of course, actually achievable, and a more realistic scenario is to be discussed later in Sect. 6.4 of this chapter when internal pipelining of the PE is introduced in order to address the problems associated with a non-zero processing delay.

100

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

Table 6.3 Structure of address mapping Ω2 for case of length-64 data set

Row 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Value of address mapping Ω2 0 2 4 2 4 6 5 7 1 7 1 3 2 4 6 4 6 0 7 1 3 1 3 5 4 6 0 6 0 2 1 3 5 3 5 7 6 0 2 0 2 4 3 5 7 5 7 1

6 0 3 5 0 2 5 7 2 4 7 1 4 6 1 3

The input/output woctad to/from the GD-BFLY comprises four even-addressed samples and four odd-addressed samples, where for a given instance of the GD-BFLY for the first temporal stage, each of the eight PDM banks will contain just one sample, as required, whilst for a given instance of the GD-BFLY for the remaining α–1 temporal stages, four of the eight PDM banks will each contain one even-addressed sample and one odd-addressed sample with the remaining four PDM banks being unused. As a result, it is generally not possible to carry out the execution of all eight reads/writes for the same woctad using all eight PDM banks within a single clock cycle. However, if, for all but the first temporal stage, we consider any pair of consecutive instances of the GD-BFLY, then it may be shown that the sample addresses of the second instance will occupy the four PDM banks not utilized by the first, so that for every two consecutive clock cycles, the eight even-addressed samples and the eight odd-addressed samples required by the pair of consecutive instances of the GD-BFLY may be both read from and written to the PDM, as required for conflict-free and in-place memory addressing – see Fig. 6.2. Thus, based upon our simplistic assumption of a zero processing delay for the execution of the GD-BFLY’s arithmetic, all eight PDM banks for the first temporal stage may be both read from and written to within a single clock cycle, whilst for the remaining α–1 temporal stages, it can be shown that in any one clock cycle, the input woctad for one instance of the GD-BFLY may be read from the PDM, prior to being processed by the GD-BFLY, whilst the output woctad produced by its predecessor may be written back to the PDM, where the four memory banks being written to form the complement to those four memory banks whose contents are being processed. As a result, the R24 FHT solution based upon the single-PE recursive architecture will be able to yield complete GD-BFLY output woctads at the rate of

6.3 Conflict-Free Parallel Memory Addressing Schemes

0 1st butterfly of pair

2nd butterfly of pair

1

ES/OS

2

3

ES/OS

ES/OS

101

4

5

ES/OS

ES/OS

6

7

ES/OS

ES/OS

ES/OS

ES – even-address sample OS – odd-address sample Fig. 6.2 Addressing of hypothetical pair of consecutive generic double butterflies for all stages other than first

two every two clock cycles – with one being completed and another one started with each clock cycle – thus equating to a rate of one woctad per clock cycle, as required. An alternative way of handling the pipelining for the last α–1 temporal stages would be to read just four samples for the first clock cycle, with one sample from each of the four even-addressed memory banks. This would be followed by a woctad for each succeeding clock cycle apart from the last, with four samples for the current instance of the GD-BFLY being read from the four even-/odd-addressed memory banks and four samples for the succeeding instance of the GD-BFLY being read from the remaining four odd-/even-addressed memory banks. The processing would be completed by reading just four samples for the last clock cycle, with one sample from each of the four odd-addressed memory banks. In this way, for each clock cycle apart from the first and the last, a woctad could be read/written from/to all eight memory banks, one sample per memory bank, with one complete GD-BFLY output woctad being thus produced and another partly produced, to be completed on the succeeding clock cycle. Note, however, that a temporary buffer would be needed to hold one complete GD-BFLY output woctad as the samples written back to memory would also need to come from consecutive GD-BFLY output woctads, rather than from a single GD-BFLY output woctad, due to the dual-port nature of the memory. For the last clock cycle, the remaining GD-BFLY output woctad could also be written out to all eight memory banks, again one sample per memory bank. The choice of how best to carry out the pipelining is really down to the individual HDL programmer, but for the purposes of consistency within the current monograph, it will be assumed that the input woctad required for a given instance of the GD-BFLY is to be read from the PDM within a single clock cycle, two samples per even-/odd-addressed memory bank as originally described, so that the input woctad for one instance of the GD-BFLY may be read from the PDM, prior to its being processed by the GD-BFLY, whilst the output woctad produced by its predecessor is written back to the PDM.

102

6.3.2

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

Parallel Storage, Retrieval and Generation of Trigonometric Coefficients

Turning now to the trigonometric coefficients, the GD-BFLY, as described in Chap. 4, requires that six non-trivial trigonometric coefficients be either retrieved from the PCM or efficiently generated in order to be able to carry out the GD-BFLY processing for a given woctad. Two schemes are now outlined for performing this task whereby all six trigonometric coefficients may be retrieved simultaneously, within a single clock cycle, these schemes offering a straightforward trade-off of memory requirement against addressing-complexity – as measured in terms of the number of arithmetic/logic operations required for computing the necessary addresses. The two schemes considered cater for those extremes whereby the choice is either to minimize the arithmetic requirement, at the expense of an increased PCM requirement, or to minimize the PCM requirement, at the expense of an increased arithmetic requirement. Clearly, other options that fall between these two extremes are also possible, but these may be easily defined and developed given an understanding of the techniques discussed here and in Sect. 4.6 of Chap. 4.

6.3.2.1

Minimum-Arithmetic Addressing Scheme

The trigonometric coefficient set comprises both cosinusoidal and sinusoidal terms for single-angle, double-angle and triple-angle cases. Therefore, in order for all six trigonometric coefficients to be retrieved simultaneously from dual-port RAM, three LUTs are required with the two single-angle coefficients being read from the first LUT, the two double-angle coefficients from the second LUT and the two tripleangle coefficients from the third LUT. In order to keep the arithmetic requirement of the addressing to a minimum, each LUT may be defined as in Sect. 4.6.1 of Chap. 4, being sized according to the single-quadrant addressing scheme, whereby the trigonometric coefficients are read from a sampled version of the sinusoidal function with argument defined from 0 up to π/2 radians. Thus, for the case of an N-point R24 FHT, it is required that each of the three single-quadrant LUTs be of length N/4 yielding a total PCM requirement, denoted C Aopt MEM , of 3 C Aopt MEM ¼ =4N

ð6:5Þ

words. This scheme would seem to offer a reasonable compromise, therefore, between the PCM requirement and the addressing-complexity, using more memory than is theoretically necessary, in terms of replicated LUTs, in order to keep the arithmetic/ logic requirement of the addressing as simple as possible – namely, a zero arithmetic requirement when using the twelve-multiplier version of the GD-BFLY or six additions when using the nine-multiplier version.

6.3 Conflict-Free Parallel Memory Addressing Schemes

6.3.2.2

103

Minimum-Memory Addressing Scheme

Another approach to the problem is to adopt a two-level LUT for the first of the three angles, where the associated complementary-angle LUTs are as defined pffiffiffiffi in Sect. 4. 6.2 of Chap. 4, comprising one coarse-resolution region of p length N =2 for the ffiffiffiffi sinusoidal function, and one fine-resolution region of length N =2 for each of the sinusoidal and cosinusoidal functions. To keep the PCM requirement to a minimum, the double-angle and triple-angle trigonometric coefficients are then obtained straightforwardly through the application of standard trigonometric identities, as given by Eqs. 4.65–4.68 of Chap. 4, so that the solution requires that three complementary-angle LUTspbeffiffiffiffiused for just the single-angle trigonometric coefficient case, each LUT of length N =2, yielding a total PCM requirement, denoted CMopt MEM , of pffiffiffiffi 3 CMopt MEM ¼ =2 N

ð6:6Þ

words. The double-angle and triple-angle trigonometric coefficients could also be obtained by assigning a two-level LUT to the storage of each, but the associated arithmetic requirement involved in generating the addresses turns out to be identical to that obtained when the trigonometric coefficients are obtained through the direct application of standard trigonometric identities, so that in this instance the replication of the two-level LUT provides us with three times the memory requirement but with no arithmetic advantage as compensation. With the proposed technique, therefore, the PCM requirement, as given by Eq. 6.6, is minimized at the expense of additional arithmetic/logic for the addressing – namely, an arithmetic requirement of seven multiplications and eight additions when using the twelve-multiplier version of the GD-BFLY or seven multiplications and fourteen additions when using the nine-multiplier version.

6.3.2.3

Comparative Analysis of Addressing Schemes

The results of this section are summarized in Table 6.4, where the PCM requirement and the arithmetic requirement for each of the conflict-free parallel addressing schemes are given – the PDM is assumed to be double-buffered. A trade-off has clearly to be made between the PCM requirement and the arithmetic requirement, with the choice being ultimately made according to the resources available on the target computing device. Versions I and II of the solution to the R24 FHT correspond to the adoption of the minimum-arithmetic addressing scheme for the twelve-multiplier and nine-multiplier PEs, respectively, whilst Versions III and IV correspond to the adoption of the minimum-memory addressing scheme for the twelve-multiplier and nine-multiplier PEs, respectively. The trigonometric coefficient retrieval/generation schemes required for Versions I–IV of the above solution are illustrated via Figs. 6.3, 6.4, 6.5 and 6.6, respectively, with the associated arithmetic requirement for the addressing given by zero when

14

IV

7

9

Version of solution I II III

25

Arithmetic complexity Processing element Coefficient generator Multipliers Adders Multipliers Adders 12 22 0 0 9 25 0 6 12 22 7 8 2  81=8N ¼ 2N

Data memory (double-buffered) 2  81=8N ¼ 2N 2  81=8N ¼ 2N 2  81=8N ¼ 2N

Coefficient memory 3  1=4N ¼ 3=4N 3  1=4N ¼ 3=4N pffiffiffiffi pffiffiffiffi 3  1=2 N ¼ 3=2 N pffiffiffiffi pffiffiffiffi 3  1=2 N ¼ 3=2 N

Memory requirement (words)

Table 6.4 Performance/resource comparison for fast multiplier versions of N-point regularized FHT

1=8N: log

4N

Update time/latency 4N 1=8N: log N 4 1=8N: log N 4 1=8N: log

Time complexity (clock cycles)

104 6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

6.4 Design of Pipelined PE for Single-PE Recursive Architecture

105

D1 S1 D2 C1 D3 D4 S2 D5 C2 D6 D7

Sn = sin( n q )



LUT[n]

Cn = cos( n q ) ∈ LUT[n] Dn = nth coefficient LUT[n] =

1 4

N

words

S3 D8 C3 D9 Fig. 6.3 Resources required for trigonometric coefficient retrieval/generation for Version I of solution with one-level LUTs

using Version I of the R24 FHT solution, six additions when using Version II, seven multiplications and eight additions when using Version III and seven multiplications and fourteen additions when using Version IV. Note that with the minimum-memory addressing scheme of Figs. 6.5 and 6.6, pipelining will certainly need to be introduced so as to ensure that a complete new set of trigonometric coefficients is available for input to the GD-BFLY for each new clock cycle – thus enabling real-time processing to be achieved and maintained.

6.4

Design of Pipelined PE for Single-PE Recursive Architecture

To exploit the multibank memories and LUTs, together with the associated conflictfree parallel memory addressing schemes for both the data (for which the addressing is also in place) and the trigonometric coefficients, the PE needs now to be able to produce complete GD-BFLY output woctads at the rate of one per clock cycle in order to achieve the required computational throughput. However, it must be borne

106

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

D1

S1

D2 C1

Sn = sin( n q )

LUT[n]



Cn = cos( n q ) ∈ LUT[n] Dn = nth coefficient LUT[n] =

1 4

N

D3

D4

S2

D5 C2

D6

words D7

S3

D8 C3

D9

Fig. 6.4 Resources required for trigonometric coefficient retrieval/generation for Version II of solution with one-level LUTs

S1, C1 S2

D7, D8, D9

C2

S1 = sin( a ) 䌜 LUT[1]

Delay

Delay

D4, D5, D6 D1, D2, D3

C1 = cos( a ) 䌜 LUT[1] S2 = sin( b ) 䌜 LUT[2] C2 = cos( b ) 䌜 LUT[3]

LUT[n] =

1 2

N words

Dn = nth coefficient

Fig. 6.5 Resources required for trigonometric coefficient retrieval/generation for Version III of solution with two-level LUT

6.4 Design of Pipelined PE for Single-PE Recursive Architecture Delay

Delay

107

Delay

S1, C1 S2

D7, D8, D9

C2

S1 = sin( α ) 䌜 LUT[1]

Delay

Delay

Delay

D4, D5, D6 D1, D2, D3

C1 = cos( α ) 䌜 LUT[1] S2 = sin( β ) 䌜 LUT[2] C2 = cos( β ) 䌜 LUT[3]

LUT[n] =

1 2

N words

Dn = nth coefficient

Fig. 6.6 Resources required for trigonometric coefficient retrieval/generation for Version IV of solution with two-level LUT

in mind that although, for the first temporal stage, all eight PDM banks can be both read from and written to within the same clock cycle, for the remaining α–1 temporal stages, only those four PDM banks not currently being read from may be written to (and vice versa). Also, the simplistic assumption of a zero processing delay for the execution of the GD-BFLY’s arithmetic must now be discarded to allow for the more realistic scenario of a non-zero processing delay with the arithmetic now being mapped onto a suitably defined computational pipeline in order to produce new output woctads at the rate of one per clock cycle.

6.4.1

Parallel Computation of Generic Double Butterfly

The above constraints suggest that a suitable PE design may be obtained if the GD-BFLY is carried out by means of a β-stage computational pipeline, as shown in the simple example of Fig. 6.7, where β is an odd-valued integer and where each computational stage of the PE (or PCS) contains its own set of storage registers for holding the current set of eight processed samples. In this way, if a start-up delay of DCG clock cycles is required for a pipelined version of the trigonometric coefficient generator and DPE clock cycles for a pipelined version of the PE, where DPE ¼ β  1,

ð6:7Þ

then given that the trigonometric coefficients will be required before the first outputs have been produced by the PE, the first temporal stage of processing may be safely

108

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . . Stage 0: both even-addressed EB and odd-addressed OB memory banks are read from & written to at the same time – one sample per memory bank

READ EB & OB

PCS1

PCS0 EB & OB

PCS2 EB & OB

PCS3 EB & OB

PCS4 EB & OB

WRITE EB & OB

Stages 1 to α -1: when even-addressed EB memory banks are read from, odd-addressed OB memory banks are written to & vice versa – two samples per memory bank

READ EB / OB

PCS1

PCS0 OB / EB

PCS2 EB / OB

PCS3 OB / EB

PCS4 EB / OB

WRITE OB / EB

Fig. 6.7 Parallel solution for PE using five-stage computational pipeline

assumed to produce its first outputs after a start-up delay of DSU clock cycles, where for the worst possible case DSU ¼ DCG þ DPE ,

ð6:8Þ

where one would expect DCG to be much smaller than DPE. This will enable the PE to read in one woctad and write out one woctad every clock cycle, thereby enabling the first temporal stage to be safely completed in N/8 + DSU clock cycles, and subsequent temporal stages to be completed in N/8 clock cycles. Note that the pipeline delay DPE must account not only for the sets of adders and permutators but also for the fixed-point multipliers which are themselves typically implemented as pipelines, possibly requiring as many as five PCSs according to the required precision. As a result, it is likely that at least nine PCSs might be required for implementation of the computational pipeline, with each temporal stage of the R24 FHT requiring the PE to feed N/8 consecutive woctads through the pipeline and with SIMD processing being used for the simultaneous execution of the multiple arithmetic operations to be performed within each stage of the pipeline. A description of the pipelined PE including the structure of the memory, for both the data and the trigonometric coefficients, together with its associated interconnections, is given in Fig. 6.8. Note, however, that depending upon the relative lengths of the computational pipeline, β, and the transform, N, an additional delay may need to be applied for every temporal stage, as well as the first, in order to ensure that woctads are not updated in one temporal stage before they have been processed and written back to the PDM in the preceding temporal stage, as this would result in the production of invalid outputs. If the transform length is sufficiently greater than the pipeline delay, however, as will invariably be the case, this problem may be avoided – these points are discussed further in Sect. 6.4.3.

PCM1

PDM0

109

PCM2

PCS1

PDM1

PCSb -1

Writes x 8

Reads x 6

PCM0

PCS0

Reads x 8

Address Generation

6.4 Design of Pipelined PE for Single-PE Recursive Architecture

PDM7

PCM – coefficient memory of PE PCS – computational stage of PE PDM – data memory of PE Fig. 6.8 Memory structure and interconnections for internally pipelined PE

6.4.2

Space-Complexity Considerations

The space-complexity is determined by the combined requirements of an arithmetic component, comprising the arithmetic/logic components, and a memory component, comprising the multibank fast dual-port memory. Adopting the minimum-arithmetic addressing scheme of Versions I and II of the R24 FHT solution (as detailed in Table 6.4), together with the adoption of partitioned memory for the storage of both the data and the trigonometric coefficients, the ‘worst-case’ memory compoðW Þ nent for the single-PE recursive architecture, denoted M FHT , is given by ðW Þ

M FHT ¼ 2  ð8  1=8N Þ þ 3  1=4N ¼ 11=4N

ð6:9Þ

words, where 2N words are required for double buffering of the eight-bank PDM and 3N/4 words for the three single-quadrant LUTs that make up the PCM. In comparison, by adopting the minimum-memory addressing scheme of Versions III and IV of the R24 FHT solution (as detailed in Table 6.4), together with the adoption of

110

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

partitioned memory for the storage of both the data and the trigonometric coefficients, the ‘best-case’ memory component for the single-PE recursive architecture, ðBÞ denoted M FHT , is given by ðBÞ

M FHT ¼ 2  ð8  1=8N Þ þ 3  pffiffiffiffi ¼ 2N þ 3=2 N

 pffiffiffiffi 1=2 N

ð6:10Þ

words, pffiffiffiffi where 2N words are required for double buffering of the eight-bank PDM and 3 N =2 words for the three complementary-angle LUTs that make up the PCM. The arithmetic/logic requirement is dominated by the presence of the dedicated fast fixed-point multipliers, with a total of nine or twelve being required by the GD-BFLY and up to seven for the memory addressing, depending upon the chosen addressing scheme.

6.4.3

Time-Complexity Considerations

The single-PE recursive architecture, as based upon the internally pipelined PE described in Sect. 6.4.1, exploits partitioned memory for the storage of both the data and the trigonometric coefficients so as to enable the GD-BFLY to produce output woctads at the rate of one per clock cycle. Therefore, the first temporal stage will be completed in N/8 + DSU clock cycles and subsequent temporal stages in either N/8 clock cycles or N/8 + DSM clock cycles, where the additional delay, DSM, provides the necessary safety margin needed to ensure that the outputs produced from each stage of GD-BFLYs are valid. This delay, as already stated, depends upon the relative lengths of the computational pipeline and the transform and may range, theoretically, from zero to as large as DPE. As a result, the N-point R24 FHT has a ðW Þ worst-case time complexity, denoted T FHT , of ðW Þ

T FHT ¼ ðDSU þ 1=8N Þ þ ðα  1Þ  ðDSM þ 1=8N Þ

ð6:11Þ

¼ ðDSU þ ðα  1ÞDSM Þ þ 1=8N: log 4 N ðBÞ

clock cycles, and a best-case or standard time complexity, denoted T FHT , for when the safety margin delay, DSM, is not required, of ðBÞ

T FHT ¼ ðDSU þ 1=8N Þ þ ðα  1Þ  1=8N ¼ DSU þ 1=8N: log 4 N

ð6:12Þ

clock cycles, given that α ¼ log4N. More generally, for any given combination of pipeline length and transform length, it should be a straightforward task to calculate

6.5 Performance and Requirements Analysis of FPGA Implementation

111

the exact safety margin delay, DSM, required after each temporal stage in order to guarantee the generation of valid outputs, although for most parameter combinations of practical interest it will almost certainly be set to zero so that the time complexity for each instance of the transform will be as given by Eq. 6.12. Note that for those cases of interest where the update time is less than the update period, as dictated by the data set refresh rate, the time delay to the production of the first complete output data set will be approximated by ðF Þ

DFHT  N þ 1=8N: log 4 N

ð6:13Þ

clock cycles (this including the time needed to initialize the PDM which, it is assumed, may be filled with new data, directly by the ADC, every N clock cycle), with a time period – namely, the update period – of N clock cycles between the subsequent production of consecutive N-sample output data sets.

6.5

Performance and Requirements Analysis of FPGA Implementation

The theoretical complexity requirements discussed above have been proven in silicon by TRL Technology in the UK, who have produced an implementation of the R24 FHTon a Xilinx Virtex-II Pro 100 FPGA [10], running at close to 200 MHz, for use in various wireless communication systems. Although (at the time of writing of this new edition to the monograph) this particular device is now somewhat dated, it was, at the time of writing of the original monograph, a perfectly reasonable choice (as it would still be) for proving the mathematical/logical correctness of operation of the proposed solution and in illustrating its relative performance compared to that of commercially available complex-data FFT solutions. A simple comparison with the state-of-the-art performances of the RFEL QuadSpeed FFT [6] and Roke Manor Research FFT solutions [7] (both multi-PE IP cores from the UK whereby a complex-data FFT may be used to process simultaneously two real-valued data sets so that packing/unpacking of the input/ output data sets needs to be accounted for) is given in Table 6.5 for the case of 4096point and 16,384-point real-data FFTs, where the RFEL and Roke Manor Research results are extrapolated from company data sheets and where the Version II solution of the R24 FHT described in Sect. 6.3.2.3 – using the minimum-arithmetic addressing scheme together with a nine-multiplier PE – is assumed for the TRL solution. Clearly, many alternatives to these two commercially available devices could have been used for the purposes of this comparison, but at the time these devices were both considered to be perfectly viable options with performances that were quite representative of this particular class of multi-PE streaming FFT solutions. The particular choice of real-from-complex strategy to be applied to the two

4096

16,384

16,384

16,384

ROKEb

TRLa

RFELb

ROKEb

10

12

18

14

12

Input word length 18

1 K  18 RAMs (with double buffering) 11 (2.5% capacity) 33 (7.5% capacity) 42 (9.5% capacity) 44 (9.9% capacity) 107 (24.1% capacity) 124 (28.0% capacity)

DHT-to-DFT conversion not accounted for in figs Packing/unpacking requirement not accounted for in figures

4096

RFELb

b

a

FFT length 4096

Clock frequency 200 MHz TRLa 18  18 multipliers 9 (2.0% capacity) 30 (6.8% capacity) 48 (10.8% capacity) 9 (2.0% capacity) 37 (8.3% capacity) 55 (12.4% capacity) Logic slices ~5000 (5.0% capacity) ~5000 (5.0% capacity) ~3800 (3.8% capacity) ~5000 (5.0% capacity) ~6500 (6.5% capacity) ~5800 (5.8% capacity) 4

4

1

4

4

I/O speed (samples/ cycle) 1

Table 6.5 Performance and resource utilization for 4096-point and 16,384-point real-data radix-4 FFTs Update time per real-data FFT (μs) ~15 (1 channel) ~10 (2 channels) ~10 (2 channels) ~72 (1 channel) ~41 (2 channels) ~41 (2 channels)

Latency per real-data FFT (μs) ~15 (1 channel) ~21 (2 channels) ~21 (2 channels) ~72 (1 channel) ~83 (2 channels) ~83 (2 channels)

112 6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

6.5 Performance and Requirements Analysis of FPGA Implementation

113

commercially available solutions has been made to ensure that we compare like with like, or as close as we can make it, as the adoption of the DDC-based approach would introduce additional filtering operations to complicate the issue together with an accompanying processing delay. As a matter of interest, for an efficient implementation with the particular device used here, the Virtex-II Pro 100, a complex DDC with 84 dB of spurious-free dynamic range (SFDR), has been shown to require approximately 1700 slices of programmable logic [1]. Although the performances – in terms of the update time and latency figures – are similar for the solutions described, it is clear from the respective I/O requirements that the RFEL and Roke Manor Research performance figures are achieved at the expense of having to process twice as much data at a time (two channels yielding two output sets instead of just one) as the TRL solution and (for the case of an N-point transform) having to execute N/2 radix-2 butterflies every N/2 clock cycles, so that the computational pipeline needs to be fed with data generated by the sampling system at the rate of N complex-valued (or 2N real-valued) samples every N/2 clock cycles. This means the need of a data set refresh rate of four times that of the TRL solution which will necessitate a significantly faster sampling system which might in turn involve the use of multiple ADC units. The results highlight the fact that although the computational densities of the three solutions are not that dissimilar, the TRL solution is considerably more resource-efficient, requiring a small fraction of the memory and fast multiplier requirements of the other two solutions in order to satisfy the latency constraint, whilst the logic requirement – as required for controlling the operation and interaction of the various components of the FPGA implementation – which increases significantly with transform length for the RFEL and Roke Manor Research solutions, remains relatively constant with the TRL solution. The scalable nature of the TRL solution means also that only the memory requirement needs substantially changing when going from one application to another (and thus from one transform length to another in order to reflect the increased/decreased quantity of data needing to be processed), making the cost of adapting the solution for new applications negligible. For longer transforms, better use of the resources could probably be achieved by trading off memory requirement against the required number of fast multipliers through the choice of a more memory-efficient addressing scheme – as discussed above in Sect. 6.3. Note that in order to support the continuous real-time operation of the TRL solution, it has been necessary that the memory be ‘double-buffered’ whereby functions performed on two equally sized regions of memory alternate with successive input data sets, with one region of memory being filled with new data whilst the data already stored in the other is being processed. Thus, for a given double-buffered memory, the ‘active’ region is defined within this monograph (and referred to hereafter as such) as being that region of memory whose contents are currently available for processing, whilst the ‘passive’ region is defined within this monograph (and referred to hereafter as such) as being that region of memory currently available for the storing of new data.

114

6.6

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

Derivation of Range of Validity for Regularized FHT

An important point to note is that most, if not all, of the commercially available FFT solutions are multi-PE solutions geared to streaming operation where achieving and maintaining continuous real-time operation equates to the constraining or minimizing of the update time – so as to maximize the throughput – rather than satisfying some constraint on the latency, as has been addressed in this monograph with the design of the R24 FHT. In fact, the point should perhaps again be made here that with the silicon-based performance objective, as stated in Sect. 5.7 of Chap. 5, it was required that for a valid or realizable solution, the latency (or, equivalently in this case, the update time) of the transform should be less than the update period, as dictated by the data set refresh rate, and assumed here for a transform of length N and an I/O rate of one sample per clock cycles to be N clock cycles. Thus, from Eq. 6.12 (and ignoring, for the moment, the combined effect of the various small timing delays), the length of the R24 FHT is limited to those values of N for which 1=8N: log

4N

< N,

ð6:14Þ

where the left-hand side of the inequality approximates the update time (in clock cycles) and the right-hand side corresponds to the update period (in clock cycles), as dictated by the data set refresh rate, so that the transform is restricted to those data sets for which N  16,384. From this constraint it may be deduced that the larger the size of the data set, the smaller the size of the ‘safety margin’ (i.e. the difference between the update time and the update period) and the more problematic the timing issues – such as those discussed in Sect. 6.4.3 of Chap. 6 – are likely to be. For transforms whose length exceeds this upper limit of 16,384, it would be necessary to increase the throughput rate by some appropriate means in order that continuous real-time operation might still be achieved and maintained. With a pipelined FFT approach, exploiting multiple PEs, continuous real-time operation is achieved and maintained for any given transform length by increasing the throughput rate through the minimization of the update time – which effectively means increasing the length of the computational pipeline in line with the transform length and, if necessary, exploiting additional parallelism within each stage of the pipeline. With the R24 FHT approach, where the latency and the update time are equivalent, the block-based nature of the processing means that continuous real-time operation may only be achieved and maintained for those transforms whose length exceeds the upper limit of 16,384 by having either multiple versions of the R24 FHT being applied to consecutive input data sets, in turn, or as with the commercially available solutions, by having the R24 FHT exploit multiple PEs in its design instead of just one. The problem with the multi-PE approach is that it would mean having to modify the single-PE recursive architecture, for each transform length of interest, so that the attractions of scalability, generality and simplicity, for both its design and its

6.7 Discussion

115

operation, would be lost. With a multi‐R24 FHT solution, on the other hand, where the throughput rate of each individual R24 FHT is unable to keep up with the data set refresh rate, two or more such R24 FHTs would be applied to consecutive input data sets, in turn, in order to produce interleaved output data sets, thereby achieving the desired throughput rate through the simple replication of silicon resources. Thus, when two R24 FHTs are required, one (with its own double-buffered DSM) would be assigned the task of processing all the even-addressed input data sets, whilst the other (also with its own double-buffered DSM) would be assigned the task of processing all the odd-addressed input data sets. The data set refresh rate for each individual R24 FHT would be reduced in proportion to the number of R24 FHTs used, which would be two for this particular example, thereby enabling the throughput rate of each individual R24 FHT to keep up with the reduced data set refresh rate. Reducing the data set refresh rate of each R24 FHT in this way means that for the case of two R24 FHTs operating in parallel, for example, the permissible latency of each R24 FHT will now be bounded above by 2N clock cycles, rather than N clock cycles, which would extend the real-time capability from one able to cater for those data sets for which N  16,384, as described above, to one now able to cater for those data sets for which N  415 – albeit achieved at the cost of a doubling of the silicon resources through the need for two R24 FHTs – and continuing in this fashion, the adoption of three R24 FHTs operating in parallel would further extend the realtime capability to one now able to cater for those data sets for which N  423. The multi‐R24 FHT approach could also be used to some advantage when applied to the computation of the complex-data DFT, as discussed in Sect. 3.4.2 of Chap. 3, where one R24 FHT would be assigned the task of processing the real component of the data set and another the task of processing the imaginary component. A suitably defined conversion routine would then be used which combines the two sets of realvalued Hartley-space outputs to produce a single set of complex-valued Fourierspace outputs. A highly parallel ‘dual‐R24 FHT’ solution such as this would be able to achieve, for the complex-data case, the same eight-fold speed up over a purely sequential solution as already achieved for the real-data case, at the cost of a doubling of the silicon resources – thus it would possess the attractive linear property of requiring twice the amount of silicon resources in order to achieve a doubling of the throughput rate.

6.7

Discussion

The outcome of this chapter has been the specification of a single-PE recursive architecture for the parallel computation of the R24 FHT where the resulting solutions are resource-efficient, scalable and device-independent, being able to achieve a high computational density as time-complexity is traded off against space-complexity – as required by the silicon-based performance objective stated in Sect. 5.7 of Chap. 5. This has involved the exploitation of fine-grained pipelining of the PE, partitioned

116

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

memory for the storage of both the data and the trigonometric coefficients and the specification of conflict-free parallel memory addressing schemes for both the data (for which the addressing is also in-place) and the trigonometric coefficients. As a result, the R24 FHT has been able to achieve a high computational throughput through the exploitation of two levels of parallel processing via a parallel-pipelined approach: (1) ‘fine-grained’ pipelining at the arithmetic level for the internal operation of the PE and (2) SIMD processing for the simultaneous execution of the multiple arithmetic operations to be performed within each stage of the fine-grained computational pipeline. These features, when combined, enable the GD-BFLY to produce output woctads at the rate of one per clock cycle, so that an O(N. log4N ) time-complexity (denoting the latency or, equivalently in this case, the update time) is achieved for the N-point R24 FHT which leads to an approximate figure of N=8: log N clock cycles after taking into account the eight-fold parallelism as 4 introduced via the adoption of the partitioned data memory. Four versions of the PE have been described with each being a simple variation of the same basic design and each compatible with the single-PE recursive computing architecture. These provide the user with the ability to optimize the space-complexity by trading off the arithmetic component, in terms of both adders and fast fixed-point multipliers, against the memory component, with a theoretical performance and resource comparison of the resulting four solutions being provided in tabular form. The mathematical/logical correctness of the operation of all four versions – referred to as Versions I, II, III and IV – of the solution has been proven in software via a computer programme written in the ‘C’ programming language – see Appendices A and B for details. Silicon-based implementations of 4096-point and 16,384-point transforms have been produced and studied, each using Version II of the R24 FHT solution which uses the nine-multiplier version of the PE together with the minimum-arithmetic addressing scheme. The R24 FHT results were seen to compare very favourably with those of two commercially available IP cores, both multi-PE pipelined solutions, with both the 4096-point and 16,384-point transforms achieving the stated performance objective whilst requiring greatly reduced silicon resources compared to their commercial counterparts. Although the target computing device may now be somewhat old, it was more than adequate for purpose, which was simply to facilitate comparison of the relative merits of the single-PE and multi-PE architectures. As already stated, with real-world applications, it is not always possible, for various practical/financial reasons, to have access to the latest device technologies. Such a situation does tend to focus the mind, however, forcing one to work to within whatever silicon budget one happens to have been dealt with. Note that a number of single-PE designs for the fixed-radix FFT [2, 8, 9], along the lines of that discussed in this chapter for the R24 FHT, have already appeared in the technical literature in recent years for the more straightforward complex-data case, each such solution using a simplified version of the memory addressing scheme discussed here whereby multibank or partitioned memory is again used to facilitate the parallel computation of the algorithm.

References

117

Another important property of the proposed set of R24 FHT designs discussed here is that they are able, via the application of the block floating-point scaling technique, to optimize the achievable dynamic range of the Hartley-space (and thus Fourierspace) outputs and therefore to outperform the more conventional streaming FFT solutions which, given the need to process the data as and when it arrives, are restricted to the use of various fixed scaling strategies in order to address the fixedpoint overflow problem. With fully optimized streaming operation, the application of block floating-point scaling would involve having to stall the optimal flow of data through the computational pipeline, as the entire set of outputs from each stage of butterflies would need to be passed through the ‘maximum’ function in order for the required common exponent to be found. As a result, the block-based nature of the single-PE recursive operation of the R24 FHT means that it is also able to produce higher accuracy transform-space outputs than is achievable by its multi-PE FFT counterparts. Finally, note that for the R24 FHT designs considered in this chapter, it is assumed that the DBR-based reordering of the input data produced by the external input data source, the ADC unit, may be carried out sequentially according to one of the techniques described in Sect. 2.4 of Chap. 2, such that the reordering of an N-sample input data set may be comfortably carried out within the update period of N clock cycles, as dictated by the data set refresh rate. A continuous real-time performance is achieved and maintained through the use of double-buffered memory within the PE, whereby one N-sample input data set is being reordered and written to one set of PDM banks whilst another reordered N-sample data set, its predecessor, is being processed by the arithmetic components of the R24 FHT, with the functions performed on the contents of the two sets of memory banks being interchanged after the completion of each R24 FHT. The question of data reordering – as required for dealing with both 1-D and m-D problems – is to be dealt with in considerably more detail in Chap. 10 where the partitioned nature of the memory used by the R24 FHT for the storage of the data is fully exploited in order to parallelize the reordering and transfer of data from one partitioned memory to another.

References 1. R. Hosking, New FPGAs tackle real-time DSP tasks for defense applications (Defense & Aerospace, Boards & Solutions, November 2006) 2. L.G. Johnson, Conflict-free memory addressing for dedicated FFT hardware. IEEE Trans. Circ. Syst. II Analog Digit. Signal Process. 39(5), 312–316 (May 1992) 3. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (September 2007) 4. K.J. Jones, The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments, Series on Signals & Communication Technology (Springer, 2010) 5. W. Moore, A. McCabe, R. Urquhart, Systolic Arrays (Adam Hilger, 1987)

118

6 Architecture for Silicon-Based Implementation of Regularized Fast Hartley. . .

6. RF Engines Ltd., “IP Cores – Xilinx FFT Library”, product information sheet available at company web site: www.rfel.com 7. Roke Manor Research Ltd., “Ultra High Speed Pipeline FFT Core”, product information sheet available at company web site: www.roke.com 8. B.S. Son, B.G. Jo, M.H. Sunwoo, Y.S. Kim, A high-speed FFT processor for OFDM systems. Proc. IEEE Int. Symp. Circ. Syst. 3, 281–284 (2002) 9. C.H. Sung, K.B. Lee, C.W. Jen, Design and implementation of a scalable fast Fourier transform Core. IEEE Proc. 2005 Conf. Asia S. Pac. Des. Autom., 920–923 (2005) 10. Xilinx Inc., company and product information available at company web site: www.xilinx.com

Chapter 7

Design of CORDIC-Based Processing Element for Regularized Fast Hartley Transform

7.1

Introduction

A detailed account was provided in Chap. 6 of how the R24 FHT could be mapped onto a parallel structure based upon the single-PE recursive computing architecture, with fine-grained pipelining of the PE and the adoption of partitioned memory for the storage of the data and the trigonometric coefficients. Conflict-free parallel memory addressing schemes were also specified for both the data (for which the addressing is also in-place) and the trigonometric coefficients to enable the computational power of the silicon-based parallel computing technologies, as discussed in Chap. 5, to be effectively exploited with each solution able to exploit SIMD processing within each stage of the computational pipeline to further enhance performance. Four versions of this highly parallel solution to the DHT and the real-data DFT have been produced – Versions I–IV – where the associated PEs are each based upon the same basic design and where the solution capabilities range from optimality in terms of the arithmetic requirement to optimality in terms of the memory requirement, although the common feature of all four versions is that they each involve the use of a fast fixed-point multiplier. No consideration has been given, as yet, as to whether an arithmetic unit based upon the fast multiplier is always the most appropriate to adopt or, when such an arithmetic unit is used, how the fast multiplier might best be implemented. With the use of FPGA technology, however, the fast multiplier is typically available to the user as an embedded resource which, although expensive in terms of silicon resources, is becoming increasingly more power-efficient and therefore the logical solution to adopt. A problem may arise in practice, however, when the length of the transform to be computed is very large when compared to the capability of the target computing device such that there are insufficient embedded resources – in terms of fast multipliers, fast RAM or both – to enable a successful or attractive mapping of the transform (and of those additional DSP functions both preceding and succeeding the transform) onto the device to take place. In such a situation, where the use of a © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_7

119

120

7 Design of CORDIC-Based Processing Element for Regularized Fast Hartley. . .

larger and more powerful device is simply not an option, it is thus required that some means be found of facilitating a successful mapping onto the available device and one way of achieving this is through the design of a more appropriate arithmetic unit – namely, one which does not rely too heavily upon the use of embedded resources. The choice of which type of arithmetic unit to adopt for the proposed resourceconstrained solution has been made in favour of the CORDIC unit, rather than the DA unit, as the well-documented optimality of CORDIC arithmetic for the operation of phase rotation [1], as is shown to be the required operation here, combined with the ability to generate the rotation angles that correspond to the trigonometric coefficients very efficiently, on-the-fly, with a trivial memory requirement, makes it the obvious candidate to pursue – the DA unit would inevitably involve a considerably larger memory requirement due to the storage of the precomputed sums or inner products. A number of attractive CORDIC-based FPGA solutions to the FFT have appeared in the technical literature in recent years, albeit for the more straightforward complex-data case, with two such solutions as discussed in references [2, 12]. Note that the sizing to be carried out in this chapter for the various R24 FHT solutions, including those based upon both the fast fixed-point multiplier and the CORDIC phase rotator, is to be performed for hypothetical FPGA implementations exploiting only programmable logic, rather than embedded resources, in order to facilitate their comparison.

7.2

Accuracy Considerations

To obtain L-bit accuracy in the GD-BFLY outputs, it will be necessary to retain sufficient bits out of the multipliers as well as to use sufficient guard bits in order to protect both the least significant bit (LSB) and the most significant bit (MSB). This is due to the fact that with fixed-point processing, the accuracy may be degraded through the possible word growth of one bit with each stage of adders. For the MSB, the guard bits correspond to those higher-order (initially unoccupied) bits, appended to the left of the L most significant data bits out of the multipliers, which could in theory, after completion of the stages of GD-BFLY adders, contain the MSB of the output data. For the LSB, the guard bits correspond to those lower-order (initially occupied) bits, appearing to the right of the L most significant data bits out of the multipliers, which could in theory, after completion of the stages of GD-BFLY adders, affect or contribute to the LSB of the output data. Thus, the possible occurrence of truncation errors due to the three stages of adders is accounted for by varying the lengths of the registers as the data progresses across the computational stages of the PE. Allowing for word growth in this fashion permits the application of block floating-point scaling [10] – as discussed in Sect. 4.8 of Chap. 4, prior to each

7.3 Fast Multiplier Approach

121

stage of GD-BFLYs, thereby enabling the dynamic range of any signals present in the data to be maximized at the output of the R24 FHT.

7.3

Fast Multiplier Approach

Apart from the potentially large PCM requirement associated with the four PE designs discussed in Chaps. 4 and 6, an additional limitation relates to their relative inflexibility, in terms of the achievable arithmetic precision, due to their reliance on the embedded fast fixed-point multiplier. For example, when the word length, ‘L’, of the two multiplicands exceeds the word-length capability, ‘K’, of the embedded multiplier, it would typically be necessary to use four embedded multipliers to carry out four K  K multiplications (assuming that K < L  2 K), the outputs of which would then be combined, via the use of two 2K-bit adders, to produce a single 4K-bit output from which the required result would then be extracted. The embedded multiplier used on the Xilinx Virtex-II Pro 100 FPGA, for example, as used in the illustrative examples of Sect. 6.5 of Chap. 6, is restricted to the handling of two 18-bit two’s complement multiplicands. To be able to cater for different applications, therefore, requiring different levels of arithmetic precision – typically lying between 16-bit and 32-bit – the arithmetic unit would have to cater for the worst-case situation (thus requiring four embedded multipliers per multiplication) and might therefore be far from optimal (in terms of wasted, costly embedded resources) for some of those lower-precision applications. When implemented on an FPGA in programmable logic, it is to be assumed that one L  L pipelined multiplier will require of the order of 5L2/8 slices [3, 4] in order to produce a new output each clock cycle, whilst one L-bit adder will require just L/2 slices [4]. The PCM will require L-bit RAM, with the single-port version involving L/2 slices and the dual-port version involving L slices [9]. These simplistic logicbased complexity figures will be used later in the chapter for carrying out sizing comparisons of the PE designs discussed in this and the previous chapters. To obtain L-bit accuracy in the outputs of the twelve-multiplier version of the GD-BFLY, which involves three stages of adders, it is necessary that L + 3 bits be retained from the multipliers, each of size L  L, in order to guard the LSB, whilst the first stage of adders is carried out to (L + 4)-bit precision, the second stage to (L + 5)-bit precision and the third stage to (L + 6)-bit precision, in order to guard the MSB, at which point the data is scaled to yield the L-bit results. Similarly, to obtain L-bit accuracy in the outputs of the nine-multiplier version of the GD-BFLY, which involves four stages of adders, it is necessary that the first stage of adders (preceding the multipliers) be carried out to (L + 1)-bit precision, with L + 4 bits being retained from the multipliers, each of size (L + 1)  (L + 1), whilst the second stage of adders is carried out to (L + 5)-bit precision, the third stage to (L + 6)-bit precision and the fourth stage to (L + 7)-bit precision, at which point the data is scaled to yield the L-bit results.

122

7 Design of CORDIC-Based Processing Element for Regularized Fast Hartley. . .

Thus, given that the twelve-multiplier version of the GD-BFLY involves a total of twelve pipelined multipliers, six stage-one adders, eight stage-two adders and eight stage-three adders, the PE can be constructed with an arithmetic-based logic requirement, denoted LAM12 , of   LAM12  1=2 15L2 þ 22L þ 112

ð7:1Þ

slices, whilst the nine-multiplier version of the GD-BFLY, which involves a total of three stage-one adders, nine pipelined multipliers, six stage-two adders, eight stagethree adders and eight stage-four adders, requires an arithmetic-based logic requirement, denoted LAM9 , of slices.   LAM9  1=8 45L2 þ 190L þ 548

ð7:2Þ

These figures, together with the PCM requirement – as discussed in Sect. 6.3 of Chap. 6 and given by Eqs. 6.5 and 6.6 – will combine to form the benchmarks with which to assess the merits of the hardware-based arithmetic unit now discussed.

7.4

CORDIC Arithmetic Approach

The CORDIC algorithm [13] is an arithmetic technique used for carrying out two-dimensional vector rotations. Its relevance here is in its ability to carry out the phase rotation of a complex number, as this will be seen to be the underlying operation required by the GD-BFLY. The vector rotation, which is a convergent linear process, is performed very simply as a sequence of elementary rotations with an ever-decreasing elementary rotation angle where each elementary rotation, which can be carried out using just shift and add-subtract operations, yields one extra bit of accuracy in the final result.

7.4.1

CORDIC Formulation of Complex Multiplier

For carrying out the particular operation of phase rotation, a vector (X,Y) is rotated by an angle θ to obtain the new vector (X’,Y’). For the n’th elementary rotation, the fixed elementary rotation angle, arctan(2-n), which is stored within a suitably defined read-only memory (ROM) or LUT, is subtracted/added from/to the angle remainder, θ n, so that the angle remainder approaches zero with increasing ‘n’. The mathematical relations for the conventional non-redundant CORDIC rotation operation [1] are as given below via the four sets of equations:

7.4 CORDIC Arithmetic Approach

123

1. Phase rotation operation X 0 ¼ cos ðθÞ  X  sin ðθÞ  Y Y 0 ¼ cos ðθÞ  Y þ sin ðθÞ  X

ð7:3Þ

θ0 ¼ 0 2. Phase rotation operation as sequence of elementary rotations

X0 ¼

K 1 Y

cos ð arctan ð2n ÞÞðX n  σ n :Y n :2n Þ

n¼0

Y0 ¼

K 1 Y

cos ð arctan ð2n ÞÞðY n þ σ n :X n :2n Þ

ð7:4Þ

n¼0

θ0 ¼

K 1 X

θ  ðσ n : arctan ð2n ÞÞ

n¼0

3. Expression for n’th elementary rotation X nþ1 ¼ X n  σ n :2n :Y n Y nþ1 ¼ Y n þ σ n :2n :X n

ð7:5Þ

θnþ1 ¼ θn  σ n : arctan ð2n Þ where σ n is either +1 or  1, for non-redundant CORDIC, depending upon the sign of the angle remainder term, denoted here as θn. 4. Expression for CORDIC magnification factor



K 1 Y n¼0

arccos ð arctan ð2n ÞÞ ¼

K 1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Y  ffi 2n

1þ2

 1:647

ð7:6Þ

n¼0

for large ‘K’ – the number of elementary rotations – which may need to be scaled out of the rotated output in order to preserve the correct amplitude of the phase rotated complex number. The choice of ‘non-redundant’ CORDIC, whereby the value of the term σ n is allowed to be either +1 or 1, rather than a ‘redundant’ version, whereby the value of the term σ n is also allowed to be 0, ensures that the value of the magnification factor, which is a function of the number of iterations, is independent of the rotation angle being applied and therefore fixed for every instance of the GD-BFLY whether it is of Type I, Type II or Type III – for the definitions see Sect. 4.3 of Chap. 4.

124

7.4.2

7 Design of CORDIC-Based Processing Element for Regularized Fast Hartley. . .

Parallel Formulation of CORDIC-Based PE

From Eq. 7.5, the CORDIC algorithm requires one pair of shift/add-subtract operations and one add-subtract operation for each bit of accuracy. When implemented sequentially [1], therefore, the CORDIC unit implements these elementary operations, one after another, using a single PCS and feeding the output from one iteration as the input to the next iteration in a ‘recursive’ fashion whereby each new recursion introduces one extra bit of accuracy in the final result. A sequential or recursive CORDIC unit with L-bit output has a latency of L clock cycles and produces a new output every L clock cycles. On the other hand, when this recursion is unfolded, a computational pipeline results – see Fig. 7.1 – which enables the CORDIC unit to implement the required elementary operations in a parallel manner [1] using an array of identical PCSs. A parallel CORDIC unit operated as a computational pipeline with L-bit output has a latency of L clock cycles but produces a new output every clock cycle – that is, it has an update time of one clock cycle – provided that all the operations required by each stage of the pipeline may be carried out in parallel. An attraction of the parallel-pipelined architecture is that the shifters in each PCS involve a fixed right shift, so that they may be implemented very efficiently in the wiring. Also, the elementary rotation angles may be distributed as constants, one to each PCS, so that they may also be hardwired. As a result, the entire CORDIC rotator may be reduced to an array of interconnected add-subtract units. Pipelining is achieved by inserting registers between the add and subtract units, although with most FPGA architectures, there are already registers present in each logic cell, so that the addition of the pipeline registers involves no additional hardware cost.

7.4.3

Discussion of CORDIC-Based Solution

The twelve-multiplier version of the GD-BFLY produces eight outputs from eight inputs, these samples denoted by (X1,Y1) through to (X4,Y4), with the multiplication stage of the GD-BFLY comprising twelve real multiplications which, together with the accompanying set of additions/subtractions, may be expressed for the case of the standard Type-III GD-BFLY via the three sets of equations: 

  X2 cos ðθÞ ¼ sin ðθÞ Y2    cos ð2θÞ X3 ¼ sin ð2θÞ Y3

 X2 Y2   sin ð2θÞ X3  cos ð2θÞ Y 3 sin ðθÞ  cos ðθÞ



ð7:7Þ ð7:8Þ

7.4 CORDIC Arithmetic Approach

125

X0

Y0

Z0 LUT

>> 0

α0

sign( Z0 )

>> 0

~ X 0 ± Y0

~ Y0 ± X 0

Z0 ± α 0

X1

Y1

Z1

>> 1

>> 1

α1

sign( Z1 )

~ X 1 ± Y1

~ Y1 ± X1

Z1 ± α1

X2

Y2

Z2

X N −1

YN −1

Z N −1

>> N-1

~ X N −1 ± YN −1 XN

>> N-1

~ YN −1 ± X N −1 YN

sign( Z N −1 )

α N−1

Z N −1 ± α N −1 ZN

~ ~ Note : X n = X n >> n and Yn = Yn >> n for n = 0,1, ... , N − 1 Fig. 7.1 Pipeline architecture for CORDIC rotator

7 Design of CORDIC-Based Processing Element for Regularized Fast Hartley. . .

126



X4 Y4



 ¼

cos ð3θÞ sin ð3θÞ

sin ð3θÞ  cos ð3θÞ



X4 Y4

 ð7:9Þ

where clearly θ is the single-angle, 2θ the double-angle and 3θ the triple-angle rotation angles. These sets of equations are equivalent to what would be obtained if we multiplied the complex number interpretations of (X2,Y2) by eiθ, (X3,Y3) by ei2θ and (X4,Y4) by ei3θ, followed for the case of the standard Type-III GD-BFLY by negation of the components Y2, Y3 and Y4. As with the nine-multiplier and twelve-multiplier versions of the GD-BFLY, there are minor changes to the operation of the GD-BFLY, from one instance to another, in terms of the definitions of the first three address permutations, with one of the two slightly different versions being appropriately selected for each according to the particular ‘Type’ of GD-BFLY being executed – see Table 4.1 of Chap. 4. In addition, however, there are also minor changes required to the outputs of the CORDIC units in that if the GD-BFLY is of Type I, then the components Y2, Y3 and Y4 do not need to be negated, whereas if the GD-BFLY is of Type II, then only component Y4 needs to be negated, and if the GD-BFLY is of Type-III, as discussed in the previous paragraph, then all three components need to be negated. Note, however, that the outputs will have grown due to the CORDIC magnification factor, M, of Eq. 7.6, so that this growth needs to be adequately accounted for within the GD-BFLY. The most efficient way of achieving this would be to allow the growth to remain within components (X2,Y2) through to (X4,Y4) and the components (X1,Y1) to be scaled multiplicatively by the term M, this being achieved with just two constant coefficient multipliers – see Fig. 7.2. This would result in a growth

negated rotation angles

Φ1

Φ3

Φ2

Φ4



Negate

Un-scaled CORDIC Rotator

-

-

-

Negate -

-

Fig. 7.2 Signal flow graph for CORDIC-based version of generic double butterfly

output data vector

Un-scaled CORDIC Rotator

-

-

Address Permutation

Negate

-

Address Permutation

Un-scaled CORDIC Rotator

Address Permutation

Address Permutation

input data vector

Fixed Scaler

7.4 CORDIC Arithmetic Approach

127

of approximately 1.647 in all the eight inputs to the second address permutation Φ2. Note that scaling by such a constant differs from the operation of a standard fast multiplier in that Booth encoding/decoding circuits are no longer required, whilst efficient recoding methods [5] can be used to further reduce the logic requirement of the simplified operation to approximately one third that of the standard fast fixedpoint multiplier. An obvious attraction of the CORDIC-based approach is that the GD-BFLY only requires knowledge of the single-angle, double-angle and triple-angle rotation angles, so that there is no longer any need to construct, maintain and access potentially large LUTs required for the storage of the trigonometric coefficients – that is, for the storage of sampled versions of the sinusoidal function with argument defined from 0 up to π/2 radians. As a result, the radix-4 factorization of the CORDIC-based FHT may be expressed very simply with the updating of the rotation angles for the execution of each instance of the GD-BFLY being performed on-thefly and involving only additions and subtractions. The optimum throughput for the GD-BFLY is achieved with the parallelpipelined hardwired solution of Fig. 7.3, whereby each PCS of the pipeline uses nine add-subtract units to carry out simultaneously the three elementary phase rotations – note that in the figure, the superscripts ‘S’, ‘D’ and ‘T’ stand for single-angle, double-angle and triple-angle, respectively. Due to the decomposition of the original rotation angle into K elementary rotation angles, it is clear that execution of the phase rotation operation can only be approximated with the accuracy of the outputs of the last iteration being limited by the magnitude of the last elementary rotation angle applied. Thus, if L-bit accuracy is required of the rotated output, one would expect the number of iterations, K, to be chosen so that K ¼ L, as the right shifts carried out in the K’th (and last) iteration would be of length L–1. This, in turn, necessitates two guard bits on the MSB and log2L guard bits on the LSB. The MSB guard bits cater for pffiffithe ffi magnification factor of Eq. 7.6 and the maximum possible range extension of 2, whilst the LSB guard bits cater for the accumulated rounding error from the L iterations.

XS

>> n

~ X±Y

XS

ZS

ZS

>> n

~ Y±X

YS

S

sign( Z )

XD

>> n

Z ± αn ZS

YD

ZD

D

sign( Z )

>> n

~ X±Y

~ Y±X

XD

YD

XT

>> n

Z ± αn ZD

ZT

YT

>> n

~ X±Y

~ Y±X

XT

YT

~ ~ Note : X = X >> n and Y = Y >> n where n and α n are fixed

Fig. 7.3 Computational stage of pipeline for CORDIC rotator with scalar inputs

T

sign( Z )

Z ± αn ZT

128

7 Design of CORDIC-Based Processing Element for Regularized Fast Hartley. . .

Note also, from the definition of the elementary rotation angles, tan ðθn Þ ¼ 2n ,

ð7:10Þ

that the CORDIC algorithm is known to converge over the range π/2  θ  + π/2, so that in order to cater for rotation angles between π, an additional rotation angle of π/2 may need to be applied prior to the application of the elementary rotation angles in order to ensure that the algorithm converges, thus increasing the number of iterations from K ¼ L to K ¼ L + 1. This may be very simply achieved, however, via application of the equations: X 0 ¼ σ:Y Y 0 ¼ þσ:X   1 θ0 ¼ θ þ σ: π 2

ð7:11Þ

where σ¼

þ1 Y < 0 1 otherwise

,

ð7:12Þ

whenever the rotation angle lies outside the range of convergence, with the above equations being carried out via precisely the same components and represented by means of precisely the same SFG as those equations – namely, Eqs. 7.3, 7.4, 7.5 and 7.6 – corresponding to the elementary rotation angles.

7.4.4

Logic Requirement of CORDIC-Based PE

Referring back to the SFG of Fig. 7.2, it is to be assumed that the GD-BFLY outputs are to be computed to L-bit accuracy. Therefore, because of the two stages of adders following the CORDIC rotators, it will be necessary for the CORDIC rotators to adopt L + 3 iterations in order to produce data to (L + 2)-bit accuracy for input to the first stage of adders. This in turn requires that each CORDIC rotator adopt L + 4 + log2(L + 2) bits for the registers, this including log2(L + 2) guard bits for the LSB and two guard bits for the MSB. Following their operation, the data will have magnified by one bit so that just the top MSB guard bit needs to be removed, together with the lowest log2(L + 2) + 1 bits, to leave the required L + 2 bits for input to the adders. The first stage of adders is then carried out to (L + 3)-bit precision and the second stage to (L + 4)-bit precision, at which point the data is scaled to yield the final L-bit result. The outputs from the two fixed-coefficient multipliers – note that in the time it takes for the CORDIC operation to be executed, the same fixed-coefficient

7.5 Comparative Analysis of PE Designs

129

multiplier could be used to carry out the scaling operation for both of the first two inputs – are retained to (L + 2)-bit precision in order to ensure consistency with the precision of the outputs from the CORDIC rotators. Thus, the CORDIC-based version of the GD-BFLY involves three (L + 3)-stage pipelined CORDIC rotators, eight (L + 3)-bit stage-one adders, eight (L + 4)-bit stage-two adders and one shared fixed-coefficient multiplier using a (L + 2)-bit coefficient, so that the PE may be constructed with a total arithmetic-based logic requirement, denoted LAC , of   LAC  1=2 10L2 þ 83L þ 9ðL þ 3Þ  log 2 ðL þ 2Þ þ 168

ð7:13Þ

slices. Note that the single-angle, double-angle and triple-angle rotation angles are fed directly to the GD-BFLY, so that the only memory requirement is for the storage of three generation angles for each stage of the transform from which the rotation angles may then be recursively derived via simple addition. Thus, assuming singleport memory, the memory-based logic requirement, denoted LM C , is given by just 3 LM C  =2αL

ð7:14Þ

slices, where α ¼ log4N, with the required single-angle, double-angle and tripleangle rotation angles being computed on-the-fly as and when they are required.

7.5

Comparative Analysis of PE Designs

This section provides a very brief theoretical comparison of the silicon resources required for all the five types of PE so far considered – four corresponding to the use of a pipelined fixed-point multiplier, as discussed in some detail in Chap. 6, and one corresponding to the use of the pipelined CORDIC arithmetic unit – where the sizing is to be based upon the simplistic logic-based complexity figures discussed in Sects. 7.3 and 7.4. An FPGA implementation [6, 7] would of course be able to exploit the available embedded resources, whether using the fast fixed-point multiplier or the CORDIC arithmetic unit, as most FPGA manufacturers now provide their own version of the CORDIC unit, in addition to the fast multipliers and RAM, as an embedded resource to be exploited by the user. A pipelined version of the CORDIC arithmetic unit may even be obtained as an IP core [11] and subsequently used as a building block for constructing larger DSP systems. The assumption here is that any relative advantages obtained from an implementation in programmable logic will be even greater when the PEs are implemented using such optimized embedded resources. The arithmetic-based and memory-based logic requirements for all five versions of the R24 FHT solution – catering for both the arithmetic requirement and the PCM

130

7 Design of CORDIC-Based Processing Element for Regularized Fast Hartley. . .

requirement – are as summarized in Table 7.1, from which the potential attraction of Version V, the CORDIC-based solution, is evident. The benefits stem basically from the fact that there is no longer any need to construct, maintain and access potentially large LUTs required for the storage of the trigonometric coefficients. The same word lengths, denoted ‘L’, are assumed for both the input/output data, to/from the GD-BFLY, and the trigonometric coefficients. The control-based logic requirements – for controlling the operation and interaction of the various components of the design – as discussed in Sect. 5.6 of Chap. 5, are not included in the results as they are rather more difficult (if not impossible) to assess, if considered in isolation from the actual hardware design process, this due in part to the automated and somewhat unpredictable nature of that process. It seems clear, however, that the potential gains achieved by the CORDICbased R24 FHT solution in not having to maintain and access the PCM will be somewhat counterbalanced by the need to control a potentially large number of adders rather than just a few fast fixed-point multipliers. Also, the two versions of the R24 FHT solution based upon the minimum-memory addressing scheme (of Versions III and IV) will involve greater control-complexity than those versions based upon the minimum-arithmetic addressing scheme (of Versions I and II), as evidenced from the discussions of Sect. 6.3.2 of Chap. 6. For each of the five versions, however, the control-based logic requirement will vary little with transform length or word length, as indicated in the results of Sect. 6.5 of Chap. 6, due to the scalable nature of the designs. Estimates for the logic requirements due to both the arithmetic requirement and the PCM requirement for various combinations of transform length and data/coefficient word length, for all the solutions considered, are as given in Table 7.2, with the results reinforcing the attraction of the CORDIC-based solution for those parameter sets typically encountered in high-performance DSP applications. It is evident from the results displayed in Tables 7.1 and 7.2 that as the transform length increases, the associated memory-based logic requirement makes all those solutions based upon the fast fixed-point multiplier increasingly less attractive, as the silicon requirement is clearly dominated for such solutions by the increasing memory requirement. The only significant change to the CORDIC-based solution as the transform length varies relates to the memory allocation for the double-buffered storage of the input/output data.

7.6

Discussion

The primary question addressed in this chapter concerned the optimal choice of arithmetic unit given the requirement for a resource-constrained solution to the R24 FHT when costly embedded resources on the target device, such as fast fixedpoint multipliers and fast RAM, are scarce. As stated in Sect. 7.3, the embedded multiplier used on the Xilinx Virtex-II Pro 100 FPGA, for example, was restricted to

Processing element type Fast multiplier

Fast multiplier

Fast multiplier

Fast multiplier

CORDIC unit

Version of solution I

II

III

IV

V

Arithmetic-based logic for double butterfly (slices)   1=2 15L2 þ 22L þ 112   1=8 45L2 þ 190L þ 548   1=2 15L2 þ 22L þ 112   1=8 45L2 þ 190L þ 548   1=2 10L2 þ 83L þ 9ðL þ 3Þ: log ðL þ 2Þ þ 168 2

3L   1=8 35L2 þ 162L þ 277   1=8 35L2 þ 246L þ 831 0

3=4LN 3=2L

pffiffiffiffi N pffiffiffiffi 3=2L N 3=2L: log N 4

Arithmetic-based logic for coefficient generator (slices) 0

Memory-based logic for coefficients (slices) 3=4LN

Table 7.1 Logic resources required for different versions of PE and trigonometric coefficient generator assuming N-point regularized FHT and L-bit accuracy

7.6 Discussion 131

7 Design of CORDIC-Based Processing Element for Regularized Fast Hartley. . .

132

Table 7.2 Logic resources for combined arithmetic and PCM requirements for combinations of transform length N and word length L Version Processing of Element Solution Type

N = 1024 L = 16

L = 20

N = 4096 L = 24

(approximate sizing in slices×1K)

L = 16

L = 20

N = 16384 L = 24

(approximate sizing in slices×1K)

L = 16

L = 20

L = 24

(approximate sizing in slices×1K)

I

Fast Multiplier

14

18

23

50

63

77

194

243

293

II

Fast Multiplier

14

18

22

50

63

76

194

243

292

III

Fast Multiplier

4

6

9

5

7

10

7

9

12

IV

Fast Multiplier

4

6

8

5

7

9

7

9

11

V

CORDIC Unit

3

4

5

3

4

5

3

4

5

the handling of two 18-bit two’s complement multiplicands so that in order to be able to cater for different applications, requiring different levels of arithmetic precision – typically lying between 16-bit and 32-bit – the arithmetic unit would have to cater for the worst-case situation (thus requiring four embedded multipliers per multiplication) and might therefore be far from optimal (in terms of wasted, costly embedded resources) for some of those lower-precision applications. To address this problem, the fast fixed-point multipliers used for the computation of the GD-BFLY were replaced by a hardware-based parallel arithmetic unit, with the particular design investigated being based upon the use of CORDIC arithmetic, as this is known to be computationally optimal for the operation of phase rotation, as required for the FHT and FFT algorithms of interest in this monograph – most FPGA manufacturers now provide their own version of the CORDIC unit, in addition to the fast multipliers and RAM, as an embedded resource to be exploited by the user. The resulting version of the R24 FHT is thus based upon the adoption of a simple variation of the basic PE design, as was discussed in Chap. 6, which like those previously discussed designs is also compatible with the single-PE recursive computing architecture. The adoption of such a PE results in a solution to the problem of computing the DHT and the real-data DFT offering the promise of greatly reduced quantities of silicon resources – at least for the arithmetic requirement and the memory requirement – and when implemented with FPGA technology the possibility of adopting a lower-complexity and lower-cost device compared to that based upon the use of the fast fixed-point multiplier. The mathematical/logical correctness of the operation of the resulting CORDIC-based version of the R24 FHT solution, as with those versions based upon the use of the fast fixed-point multiplier, has been proven

References

133

in software via a computer programme written in the ‘C’ programming language – see Appendices A and B for details. The comparative benefits of the various designs, as suggested by the complexity figures derived for a hypothetical FPGA implementation with programmable logic, should also carry over, not only when exploiting embedded resources but also when implemented with ASIC technology, where the high regularity of the CORDICbased design could prove particularly attractive. In fact, a recent study [8] has shown that compared to an implementation using a standard cell ASIC, the FPGA area required to implement a typical DSP algorithm – such as the R24 FHT – is on average forty times larger, whilst the achievable speed, which relates to the critical path delay and hence the maximum allowable clock frequency, is on average one third that achievable by the ASIC. As a result, it is possible to hypothesize the dynamic power consumption for an FPGA implementation which is on average an order of magnitude greater than that for the ASIC when embedded features are used [8], this increasing still further when only programmable logic is used [8]. Note that the design constraint on the PE discussed in Sect. 6.4 of Chap. 6, concerning the total number of PCSs in the computational pipeline, is applicable for the CORDIC-based solution as well as those based upon the fast fixed-point multiplier, namely, that the total number of PCSs – including those corresponding to the CORDIC iterations – needs to be an odd-valued integer, so as to avoid any possible addressing conflicts arising from the reading/writing of the input/output data sets from/to the eight PDM banks for each new clock cycle. Finally, it should be noted that the potential benefits of adopting the CORDICbased design, rather than one of the more conventional designs based upon the use of the fast fixed-point multiplier, may only be achieved at the expense of incurring greater latency given that the delay associated with a pipelined fixed-point multiplier might typically be of O(log2L) clock cycles whereas that for the pipelined CORDIC arithmetic unit is of O(L) clock cycles. As a result, for the processing of 16-bit to 24-bit data, for example, whereas the PE design based upon the pipelined fixed-point multiplier might typically involve a total pipeline delay of nine clock cycles, say, that based upon the CORDIC arithmetic unit might typically involve a total pipeline delay of two to three times the size, as illustrated through the contents of Table 7.2, which might in turn (although only for small transform sizes) necessitate the adoption of a safety margin delay with each stage of GD-BFLYs.

References 1. R. Andraka, A survey of CORDIC algorithms for FPGA based computers (Proc. of ACM/SIGDA 6th Int. Symp. on FPGAs, Monteray, 1998), pp. 191–200 2. A. Banerjee, S.D. Anindya, S. Banerjee, FPGA realization of a CORDIC-based FFT processor for biomedical signal processing. Microproc and Microsyst (Elsevier) 25(3), 131–142 (May 2001) 3. M. Becvar, P. Stukjunger, Fixed-point arithmetic in FPGA. Acta Polytechnica 45(2), 67–72 (2005)

134

7 Design of CORDIC-Based Processing Element for Regularized Fast Hartley. . .

4. C.H. Dick, FPGAs: The High-End Alternative for DSP Applications, DSP Engineering (2000) 5. K. Hwang, Computer arithmetic: Principles, architectures and design (John Wiley & Sons, New York, 1979) 6. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process 1(3), 128–138 (2007) 7. K.J. Jones, The regularized fast Hartley transform: Optimal formulation of real-data fast Fourier transform for silicon-based implementation in resource-constrained environments, Springer (Series on Signals & Communication Technology), (2010) 8. I. Kuon, J. Rose, Measuring the gap between FPGAs and ASICs (FPGA ‘06, Monterey, 2006) 9. C. Maxfield, The Design Warrior’s Guide to FPGAs, Newnes (Elsevier), (2004) 10. L.R. Rabiner, B. Gold, Theory and application of digital signal processing (Prentice-Hall, Englewood Cliffs, 1975) 11. RFEL: rfel.com/products/Products_Cordic.asp 12. T. Sansaloni, A. Perez-Pascual, J. Valls, Area-efficient FPGA-based FFT processor. Electron Lett 39(19), 1369–1370 (September 2003) 13. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans On Electronic Comput EC-8(3), 330–334 (1959)

Part III

Applications of Regularized Fast Hartley Transform

Chapter 8

Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using Regularized Fast Hartley Transform

8.1

Introduction

The results discussed so far in this monograph have been concerned with the application of the R24 FHT [1–3] to the computation of the DHT and equivalently, via the relationship of their kernels, to that of the real-data DFT, where the transform length is a power of four (a radix-4 integer). Given the amount of effort and resources devoted to the design of the GD-BFLY and the associated R24 FHT, however, there would be great attraction in being able to extend its range of applicability to that of a radix-2 solution to the DHT and the real-data DFT, where the transform length is a power of two (a radix-2 integer), but not a power of four, provided the computational efficiency would not be unduly compromised. A direct radix-2 version of the regularized FHT could of course be developed, but this would yield just fourfold parallelism, at best, rather than the eightfold parallelism of the radix-4 solution already developed, whilst the time-complexity would increase, relative to that of the radix-4 solution, by a factor of log 2 N= log 4 N , which equates to a factor of two. If the applicability of the R24 FHT could be generalized, therefore, without significantly compromising performance, it could result in a very flexible solution to the problem of computing the DHT and the real-data DFT that would be able to address a great many more problems than originally envisioned. Two approaches to the problem of computing the real-data DFT are now discussed: 1. The first involves the exploitation of a half-length version of the R24 FHT, with the transform being applied separately to both the even-addressed and the odd-addressed subsequences of the real-valued input data set, before their outputs are appropriately combined to yield the required real-data DFT outputs. 2. The second involves the exploitation of one double-length version of the R24 FHT, this being applied to a zero-padded version of the input data set to yield an interpolated Hartley-space data set from which the required real-data DFT outputs may be obtained. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_8

137

138

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

The first approach needs to be able to convert, in an efficient manner, the realvalued Hartley-space outputs produced by the half-length R24 FHTðsÞ into the required complex-valued outputs of the real-data DFT. This approach will be referred to as the double-resolution approach, as it involves producing DFT outputs possessing the required resolving capability via R24 FHT-based processing at double the required transform-space resolution (thus corresponding to outputs possessing one half the required resolving capability). The second and conceptually much simpler approach will be referred to as the half-resolution approach, as it involves producing the real-data DFT outputs possessing the required resolving capability via R24 FHT-based processing at one half the required transform-space resolution (thus corresponding, when a full data set is available, to outputs possessing twice the required resolving capability) in order to produce interpolated Hartley-space outputs from which the required real-data DFT outputs may be subsequently obtained. The required resolution will be referred to as the ‘full resolution’, as it corresponds to the resolving capability of the sought-after solution. Note, however, that with the half-resolution approach, the effect of zero-padding the input data set is to produce interpolated results in transform-space, so that although greater accuracy may be achieved in locating tonal signals within that space, the resolving capability – that is, the ability to distinguish closely-spaced transform-space components of the signal – will not change. Given that the resolution is inversely related to the duration of the actual signal segment being processed, the resolution achieved via zeropadding will inevitably be less than that achieved when the zero-valued samples are replaced by genuine samples as it’s derived from a smaller quantity of valid data.

8.2

Computation of Real-Data DFT via Two Half-Length Regularized FHTs

This section discusses the first of the two R24 FHT-based approaches, the doubleresolution approach, which is concerned with the derivation of a 2N-point real-data FFT algorithm where N is a power of four. For this to be achieved, a regular and highly parallel conversion routine is required which enables the two Hartley-space data sets – as obtained from the application of the R24 FHT to both the even-addressed and the odd-addressed subsequences of the real-valued input data set – to be suitably combined to produce the required DFT outputs. The approach exploits the following properties: 1. The outputs from a real-data DFT of length 2N, where N is a power of four, may be obtained from the outputs of a complex-data DFT of length N [5] (as discussed in Sect. 2.3.3 of Chap. 2). 2. The real and imaginary components of the complex-data DFT outputs may each be independently obtained via an R24 FHT of length N (as discussed in Sect. 3.4.2 of Chap. 3).

8.2 Computation of Real-Data DFT via Two Half-Length Regularized FHTs

139

The resulting algorithm, which thus exploits one (for sequential mode) or two (for parallel mode) N-point R24 FHTs and one conversion routine (referred to more formally in this chapter as the ‘R4FHT-to-R2FFT’ conversion routine, as it produces outputs for a radix-2 FFT algorithm from outputs produced by a radix-4 FHT algorithm), ultimately produces outputs in Fourier-space, rather than Hartleyspace, and so may be regarded in some sense as belonging to the same class of specialized real-data FFT algorithms as those discussed earlier in Sect. 2.2 of Chap. 2 – in fact, its performance will be compared in some detail with one such algorithm in the discussions of Sect. 8.4.

8.2.1

Derivation of Radix-2 Algorithm via Double-Resolution Approach

Let us start by denoting the real-valued input data set by {x[n]}, with the evenaddressed subsequence given by {xE[n]} and the odd-addressed subsequence by {xO[n]}. After processing each subsequence by means of an N-point R24 FHT, let the R24 FHT outputs from the processing of the even-addressed samples be denoted by ðH Þ {X E [k]} and those obtained from the processing of the odd-addressed samples by ðH Þ {X O [k]}. The R24 FHT outputs may then be converted to Fourier-space by means of the expressions:   ðF Þ ðH Þ ðH Þ X R,E ½k ¼ 1=2 X E ½k  þ X E ½N  k   ðF Þ ðH Þ ðH Þ X I,E ½k ¼ 1=2 X E ½N  k  X E ½k

ð8:1Þ ð8:2Þ

for the even-addressed terms, and   ðF Þ ðH Þ ðH Þ X R,O ½k  ¼ 1=2 X O ½k  þ X O ½N  k   ðF Þ ðH Þ ðH Þ X I,O ½k ¼ 1=2 X O ½N  k  X O ½k

ð8:3Þ ð8:4Þ

ðF Þ

for the odd-addressed terms, where X R,E=O denotes the real component of the output ðF Þ

and X I,E=O the imaginary component. Suppose now that the real and imaginary components of double-resolution Fourier-space samples, denoted {YR[k]} and {YI[k]}, respectively, are introduced via the expressions: ðF Þ

ðF Þ

Y R ½k ¼ X R,E ½k  X I,O ½k

ð8:5Þ

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

140

ðF Þ

ðF Þ

Y I ½k  ¼ X I,E ½k þ X R,O ½k 

ð8:6Þ

and ðF Þ

ðF Þ

ð8:7Þ

ðF Þ

ðF Þ

ð8:8Þ

Y R ½N  k ¼ X R,E ½k  þ X I,O ½k Y I ½N  k  ¼ X I,E ½k þ X R,O ½k :

Then the real and the imaginary components of the required 2N-point real-data DFT ðF Þ ðF Þ outputs, as denoted by {X R [k]} and {X I [k]}, respectively, may be written as ðF Þ

X R ½k ¼ 1=2½ðY R ½k þ Y R ½N  k Þ þ cos ð2πk=2N ÞðY I ½k þ Y I ½N  k Þ  sin ð2πk=2N ÞðY R ½k  Y R ½N  k Þ

ð8:9Þ

ðF Þ

X I ½k ¼ 1=2½ðY I ½k  Y I ½N  k Þ  sin ð2πk=2N ÞðY I ½k þ Y I ½N  kÞ  cos ð2πk=2N ÞðY R ½k   Y R ½N  kÞ,

ð8:10Þ

as already demonstrated with Eqs. 2.18 and 2.19 of Chap. 2. The even-symmetric nature of the sinusoidal function in the above two equations, relative to an argument of π/2 radians (corresponding to coefficient index N/2), and of the cosinusoidal function, relative to an argument of 0 radians (corresponding to coefficient index 0), together with their periodicity (being both of period 2N ) – see Eqs. 4.56–4.58 of Chap. 4 – means that just N/2–1 non-trivial values and two trivial values (i.e. values 0 and 1) covering the closed region [0, π/2] radians of the sinusoidal function may be pre-computed and stored in an LUT of length N/2 + 1 from which each pair of trigonometric coefficients may be obtained. As will be seen, each such pair may be used for the production of four 2N-point real-data DFT outputs, except for the cases corresponding to the coefficient indices 0 and N/2, which each produce just two such outputs. The DFT output indices corresponding to coefficient index 0 are given by k ¼ 0 (the term for zero frequency) and k ¼ N/2 (the term for one half the Nyquist frequency for a length 2N transform), whilst those corresponding to coefficient index N/2 are given by k ¼ N/4 (the term for one quarter the Nyquist frequency for a length 2N transform) and k ¼ 3N/4 (the term for three quarters the Nyquist frequency for a length 2N transform). Now, by exploiting the symmetry properties of the sinusoidal and cosinusoidal functions described above and putting ðH Þ

ðH Þ

ð8:11Þ

ðH Þ

ðH Þ

ð8:12Þ

A ¼ X E ½k þ X E ½N  k  B ¼ X E ½k  X E ½N  k 

8.2 Computation of Real-Data DFT via Two Half-Length Regularized FHTs ðH Þ

ðH Þ

C ¼ X O ½k þ X O ½N  k  ðH Þ

ðH Þ

D ¼ X O k  X O ½N  k,

141

ð8:13Þ ð8:14Þ

the single complex-valued output of Eqs. 8.9 and 8.10 may be expanded and replaced by the two complementary pairs of complex-valued outputs: ðF Þ

X R ½k  ¼ 1=2ðA þ C: cos ðφk Þ  D: sin ðφk ÞÞ ðF Þ

ð8:15Þ

X I ½k  ¼ 1=2ðB  D: cos ðφk Þ  C: sin ðφk ÞÞ

ð8:16Þ

ðF Þ

ð8:17Þ

X I ½N  k  ¼ 1=2ðB þ D: cos ðφk Þ  C: sin ðφk ÞÞ

ðF Þ

ð8:18Þ

ðF Þ

ð8:19Þ

X R ½N  k  ¼ 1=2ðA þ C: cos ðφk Þ þ D: sin ðφk ÞÞ

and X R ½N=2  k  ¼ 1=2ðA þ C: cos ðφk Þ  D: sin ðφk ÞÞ ðF Þ

X I ½N=2  k  ¼ 1=2ðB  D: cos ðφk Þ  C: sin ðφk ÞÞ ðF Þ X R ½N=2

ð8:20Þ

þ k  ¼ 1=2ðA þ C: cos ðφk Þ þ D: sin ðφk ÞÞ

ð8:21Þ

X I ½N=2 þ k  ¼ 1=2ðB þ D: cos ðφk Þ  C: sin ðφk ÞÞ

ð8:22Þ

φk ¼ 2πk=2N ,

ð8:23Þ

ðF Þ

where

so that the real-data DFT outputs are now expressed directly in terms of the R24 FHT outputs, as required, with each pair of trigonometric coefficients being used to produce the real and imaginary components of four 2N-point real-data DFT outputs where those outputs corresponding to the required non-negative half of the frequency spectrum are addressed by means of the index k2{0,1,. . ., N–1}. The real and imaginary components of the DFT outputs for the zero-frequency and the one half Nyquist-frequency indices are as given by the following: ðF Þ

ðH Þ

ðH Þ

X R ½0 ¼ X E ½0 þ X O ½0, ðF Þ

ð8:24Þ

X I ½ 0 ¼ 0

ð8:25Þ

X R ½N=2 ¼ X E ½N=2

ðH Þ

ð8:26Þ

ðF Þ

ðH Þ

ð8:27Þ

ðF Þ

X I ½N=2 ¼ X O ½N=2,

142

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

whilst the real and imaginary components of the additional two DFT outputs (corresponding to the one quarter and the three quarters Nyquist-frequency indices) may be straightforwardly obtained from Eqs. 8.15 to 8.18 by setting the value of ‘k’ equal to N/4 so that the value of ‘N–k’ becomes equal to 3N/4. Note, however, that with the SFG to be introduced later in Sect. 8.2.3, only the upper half will be needed to represent the computation of those two complex-valued real-data DFT outputs with indices N/4 and 3N/4. Thus, apart from the two pairs of additional terms, the complex-valued outputs of the 2N-point real-data DFT may be efficiently computed four outputs at a time where each such set is derived from two sets, each of four real-valued Hartley-space outputs, by means of the proposed R4FHT-to-R2FFT conversion routine. The addresses for each set of four DFT outputs, which correspond exactly (from within their respective memories) with those of the two sets of stored Hartley-space outputs from which they are obtained, are expressed via the indices k1 2 f1, 2, . . . , N=4  1g, k2 ¼ N=2  k 1 , k 3 ¼ N=2 þ k 1 & k4 ¼ N  k1 ð8:28Þ so that the addresses k1 and k4 are thus complementary with respect to N, as are the addresses k2 and k3. As already stated, just two trigonometric coefficients are required for the computation of each set of four DFT outputs, these being represented by means of the parameters S1 and S2, where S1 ¼ SIN[k1] and S2 ¼ SIN[k2], with ‘SIN’ being an LUT for the storage of a single quadrant of the sinusoidal function with addresses ranging from 0 up to N/2. The corresponding two multiplicands required for each quadrant are expressed in terms of the trigonometric coefficients as     Quadrant 1 : cos ϕk1 ¼ S2 & sin ϕk1 ¼ S1     Quadrant 2 : cos ϕk2 ¼ S1 & sin ϕk2 ¼ S2     Quadrant 3 : cos ϕk3 ¼ S1 & sin ϕk3 ¼ S2     Quadrant 4 : cos ϕk4 ¼ S2 & sin ϕk4 ¼ S1 ,

ð8:29Þ ð8:30Þ ð8:31Þ ð8:32Þ

thus enabling the Hartley-space data acquired from all four quadrants of each of the two partitioned memories used for the storage of the R24 FHT output data sets – the ‘even-stream’ and the ‘odd-stream’ data memories, denoted DME and DMO, respectively – to be efficiently combined. Note, however, that each of these memories, which are already ‘physically’ partitioned ‘column-wise’ into eight memory banks (addressed from 1 through to 8, although addressing of data samples is assumed to start from 0 which is more appropriate for representing the zerofrequency term), in order to facilitate the efficient storage and retrieval of the Hartley-space outputs, needs now to be ‘conceptually’ partitioned ‘row-wise’, with each memory bank of N/8 time slots being divided into four quadrants each of N/32 rows (or time slots).

8.2 Computation of Real-Data DFT via Two Half-Length Regularized FHTs

143

Table 8.1 Distribution of four-quadrant data sets across partitioned memory k1 + memory bank address 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 1 9 2 10 3 11 4 12 5 13 6 14 7 15 8 16 1

k2 + memory bank address – 31 8 30 7 29 6 28 5 27 4 26 3 25 2 24 1 23 8 22 7 21 6 20 5 19 4 18 3 17 2 –

k3 + memory bank address 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 1 41 2 42 3 43 4 44 5 45 6 46 7 47 8 48 1

k4 + memory bank address – 63 8 62 7 61 6 60 5 59 4 58 3 57 2 56 1 55 8 54 7 53 6 52 5 51 4 50 3 49 2 –

A tabulated example of the addressing for the case where N ¼ 64 is given in Table 8.1, which illustrates how the four-quadrant data sets – with addresses k1, k2, k3 and k4 – are distributed across the eight memory banks of each of the two memories containing the R24 FHT output data sets. Thus, it can be seen from Fig. 8.1 that: 1. Address ‘k1’ corresponds to locations in the first quadrant as stored in the first N/32 time slots of the memory bank. 2. Address ‘k2’ corresponds to locations in the second quadrant as stored in the second N/32 time slots of the memory bank. 3. Address ‘k3’ corresponds to locations in the third quadrant as stored in the third N/32 time slots of the memory bank. 4. Address ‘k4’ corresponds to locations in the fourth quadrant as stored in the last N/32 time slots of the memory bank. Note that the computation of the R24 FHT outputs may be carried out in either sequential mode (whereby just one R24 FHT is used for the computation of the two DHTs) or parallel mode (whereby a separate R24 FHT is assigned to the computation of each DHT) – as will be discussed later – but however this is done, the computation of the Hartley-space outputs precedes that of the R4FHT-to-R2FFT conversion routine which can only commence processing once the Hartley-space outputs have been appropriately stored in the respective memories, either DME or DMO. The fact that the R4FHT-to-R2FFT conversion routine has to wait for all the R24 FHT outputs to

144

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

‘Even-Stream Data Memory’

‘Odd-Stream Data Memory’

Quadrant No 1

Quadrant No 1

N/4 samples spread across 8 memory banks & indexed by ‘k1’

N/4 samples spread across 8 memory banks & indexed by ‘k1’

Quadrant No 2

Quadrant No 2

N/4 samples spread across 8 memory banks & indexed by ‘k2’

N/4 samples spread across 8 memory banks & indexed by ‘k2’

Quadrant No 3

Quadrant No 3

N/4 samples spread across 8 memory banks & indexed by ‘k3’

N/4 samples spread across 8 memory banks & indexed by ‘k3’

Quadrant No 4

Quadrant No 4

N/4 samples spread across 8 memory banks & indexed by ‘k4’

N/4 samples spread across 8 memory banks & indexed by ‘k4’

For storage of regularized FHT output for ‘even-addressed’ input samples

For storage of regularized FHT output for ‘odd-addressed’ input samples

Fig. 8.1 Memory structure for Hartley-space data using even-stream and odd-stream data memories

have been produced, before being able to commence operation, is not an issue as long as the appropriate timing constraint is able to be met, namely, that the latency (or, equivalently, the update time for the case of a ‘block-based’ algorithm such as this) of the solution should be less than the update period of 2N clock cycles, as dictated by the data set refresh rate. Thus, in order to obtain a solution that meets the timing constraint, it is required that once the computation of the two sets of Hartley-space outputs has been completed by the R24 FHTðsÞ , the R4FHT-to-R2FFT conversion routine needs to combine them in an appropriate manner (as performed by Eqs. 8.11–8.22) to produce the required 2N-point real-data DFT outputs, but doing so in the same highly parallel fashion as is done with the R24 FHT (as discussed in Chaps. 4, 5, 6 and 7). This means that a computing architecture is required that’s able to exploit the various partitioned memories so as to enable parallel memory addressing of both the data and the trigonometric coefficients, for the operation of both the R24 FHTðsÞ and the subsequent R4FHT-to-R2FFT conversion routine, enabling output woctads to be produced at the required rate of one per clock cycle.

8.2 Computation of Real-Data DFT via Two Half-Length Regularized FHTs Hartley-Space Inputs XE[k1]

Fourier-Space Outputs

S2 +

+ +

+



>> 1

XR[k1]

>> 1

XI[k1]

>> 1

XR[k4]

>> 1

XI[k4]

>> 1

XR[k2]

>> 1

XI[k2]

>> 1

XR[k3]

>> 1

XI[k3]

+

+

+ XO[k1]

145

+



S1 S1

XE[k4]



+





+ XO[k4]

XE[k2]

+

S2 S2

+



+ +

+

+

+

+

+

+

+ XO[k2]



+



S1 S1

XE[k3]



+





+ XO[k3]

+



+ S2

+

+

Note: S1 = LUT[k1] & S2 = LUT[k2]

Fig. 8.2 Signal-flow graph for R4FHT-to-R2FFT conversion routine – derivation of real-data DFT outputs from Hartley-space inputs

8.2.2

Implementation of Double-Resolution Algorithm

Now in order for the operation of the R4FHT-to-R2FFT conversion routine to be carried out in such a manner – which from the SFG of Fig. 8.2 requires a total of eight real multiplications, 20 real additions and 8 right-shift operations, each of length one, in order to obtain four complex-valued 2N-point real-data DFT outputs from eight real-valued Hartley-space outputs – it is necessary that the two sets, each of four Hartley-space outputs, required for each instance of the R4FHT-to-R2FFT conversion routine (which is obtained by taking one sample from each quadrant of

146

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

both the DME and the DMO) is such that of the four samples obtained from each memory, no more than two possess the same memory bank address, given the dualport nature of the memory. Unfortunately, however, this is not the case as it is quite possible for all four samples from each memory to appear in precisely the same memory bank – as is evident from the simple example for N ¼ 64, illustrated in Table 8.1, where one complete four-quadrant data set occurs in the first memory bank, two complete four-quadrant data sets occurs in the fifth memory bank and the two complete two-sample data sets also occur in the first memory bank. All is not lost, however, as an attractive property of the Hartley-space data stored within the DME and the DMO is that each pair of four consecutive fourquadrant data sets required by the R4FHT-to-R2FFT conversion routine occupy just four rows (or time slots) of each memory, one row from each quadrant, so that by reading the corresponding four rows of data from each memory, woctad-bywoctad, and temporarily storing them – in some appropriate fashion to be defined – it should be possible to extract each of the eight eight-sample data sets (where each eight-sample data set may be simply derived from the concatenation of two fourquadrant data sets) from the temporarily stored data and feed it into the R4FHT-toR2FFT conversion routine, as illustrated by the SFG of Fig. 8.2, at the rate of one set per clock cycle. To see how this might be efficiently achieved, it is first necessary to introduce a small intermediate data memory, denoted DMIN and of 2-D form, partitioned into eight rows by eight columns where each of the sixty-four memory banks is capable of holding a single sample of the Hartley-space data. As a result, the 2-D memory is capable of holding eight complete woctads, as produced by the R24 FHTðsÞ, four from those stored within the DME and four from those stored within the DMO. Each such woctad, which is stored within a single row (or time slot) of either the DME or the DMO, is now mapped to a single row of the DMIN, so that once the DMIN is full, all eight sets of samples may be extracted and fed to the R4FHT-to-R2FFT conversion routine in the required order for processing. Fortunately, the eight samples of each set now come from a single column of eight distinct single-sample memory banks so that the data may be retrieved from the 2-D memory, woctad-by-woctad, at the rate of one set per clock cycle. As a result, a pipelined implementation of the R4FHT-to-R2FFT conversion routine, as illustrated in Fig. 8.3, may be defined whereby the multiple arithmetic operations to be performed within each stage of the computational pipeline may be simultaneously executed via SIMD processing. This would enable four new 2Npoint real-data DFT outputs to be produced every clock cycle, these being subsequently directed to a suitably defined external data store. The individual stages of the

Hartley-space inputs

adders

multipliers

adders

adders

shifters

A1

M

A2

A3

S

Fig. 8.3 Computational pipeline for R4FHT-to-R2FFT conversion routine

real-data DFT outputs

8.2 Computation of Real-Data DFT via Two Half-Length Regularized FHTs

147

computational pipeline need to correspond to the arithmetic components displayed in the SFG of Fig. 8.2 and so comprise three stages of adders, with two stages containing eight adders and one stage containing four adders, all operating in parallel, together with one stage of multipliers, this containing eight multipliers also operating in parallel, and one stage of parallel shifters for scaling each of the DFT outputs (both real and imaginary components) by two via a right-shift operation of length one. Note, however, that the fixed-point multipliers are themselves typically implemented as pipelines, possibly requiring as many as five stages according to the required precision. The result is thus an arithmetic requirement of 8 fast multipliers and 20 adders for fully-parallel operation of the pipelined R4FHT-toR2FFT conversion routine. The pipelined solution is best achieved through double buffering via the introduction of a second DMIN, identical in form to the first, as this would enable two complete sets of R24 FHT outputs, each set comprising four woctads, to be built up and stored within one DMIN (the ‘passive’ region of the double-buffered memory) whilst data sets from the other DMIN (the ‘active’ region of the double-buffered memory) are being fed into the R4FHT-to-R2FFT conversion routine. This processing scheme – which is similar to that to be described in Sect. 10.5 of Chap. 10 concerning the parallel construction and transfer of reordered data sets between partitioned memories – involves a start-up delay of eight clock cycles to allow for the first DMIN to be filled, first time around, plus a pipeline delay to allow for the initial data to traverse the computational pipeline, with the functions of the two intermediate data memories then alternating, every eight clock cycles. This enables the contents of one DMIN to be updated, woctad-by-woctad, with Hartley-space data from the two memories, DME and DMO, whilst the contents of the other DMIN are being fed, woctad-by-woctad, into the R4FHT-to-R2FFT conversion routine. Thus, it is possible for a computing architecture to be defined that exploits the various partitioned memories to enable the processing for both the N-point R24 FHTðsÞ and the R4FHT-to-R2FFT conversion routine to be efficiently carried out, where the basic components of the solution are as shown in the processing scheme of Fig. 8.4. Note that the double-buffered intermediate data memory, DMIN, required of such a scheme, is best built with programmable logic, so as not to waste potentially large quantities of fast and expensive embedded RAM in their construction, as embedded memory normally comes with a minimum size of some several thousands of bits, rather than just a few tens of bits, as required for each of the sixtyfour banks of each DMIN.

8.2.2.1

Single-FHT Solution for Computation of Regularized FHTs

With regard to the computation of the 2N-point real-data DFT using the ‘sequential’ version of the double-resolution approach, whereby a single R24 FHT is assigned to the computation of the two DHTs so that they must be executed sequentially, one after the other, the O(N. log4N ) time-complexity, denoted TSDR, may be approximated by

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

148

‘even’ addresses

R 24 FHT

Even-Stream Data Memory input data

‘odd’ addresses

Odd-Stream Data Memory

R4FHT-to-R2FFT Conversion Routine

Double-Buffered Intermediate Data Memory

real-data DFT outputs

R 24 FHT

Fig. 8.4 Computation of 2N-point real-data DFT using double-resolution processing scheme with N-point regularized FHT

T SDR ¼ 2  ð1=8N: log 4 N Þ þ 1=4N ¼ 1=4N ð log 4 N þ 1Þ

ð8:33Þ

clock cycles, this figure ignoring any start-up/pipeline delays. With a data set refresh rate of 2N samples every update period of 2N clock cycles, this suggests a real-time capability for those data sets for which N  4096. The space-complexity for the combined operations of the R24 FHT and the subsequent R4FHT-to-R2FFT conversion routine has an arithmetic component of either 20 multipliers and 42 adders, when using Version I of the R24 FHT solution, or 17 multipliers and 51 adders when using Version II – the arithmetic component of all four versions of the R24 FHT solution, including Versions III and IV, are provided later in Table 8.2 of Sect. 8.2.2.3. The sequential solution has a ‘worst-case’ memory component that involves: 1. Two sets, each of eight PDM banks, for the double-buffered storage of the R24 FHT input data with each memory bank holding N/8 samples 2. Two sets, each of N memory locations, for the temporary data stores DME and DMO 3. Two sets, each of sixty-four memory locations, for the double-buffered DMIN 4. Four single-quadrant LUTs for storage of the trigonometric coefficients – one set of three, with each LUT holding N/4 double-resolution coefficients for minimumarithmetic addressing by the R24 FHT, and one LUT holding N/2 full-resolution coefficients for minimum-arithmetic addressing by the R4FHT-to-R2FFT conversion routine

8.2 Computation of Real-Data DFT via Two Half-Length Regularized FHTs

149

Table 8.2 Theoretical performance analysis for computation of 2N-point real-data DFT via double-resolution approach – note that ‘S’ denotes sequential mode and ‘P’ parallel mode Type of solution Version ¼ I, mode ¼ ‘S’ Version ¼ I, mode ¼ ‘P’ Version ¼ II, mode ¼ ‘S’ Version ¼ II, mode ¼ ‘P’ Version ¼ III, mode ¼ ‘S’ Version ¼ III, mode ¼ ‘P’ Version ¼ IV, mode ¼ ‘S’ Version ¼ IV, mode ¼ ‘P’

Space complexity Multipliers Adders 20 42

Memory (words) ~21=4N þ 128

Time complexity (clock cycles) ~1=4N ð log 4 N þ 1Þ

32

64

~8N + 128

~1=8N ð log 4 N þ 2Þ

17

51

~21=4N þ 128

~1=4N ð log 4 N þ 1Þ

26

82

~8N + 128

~1=8N ð log 4 N þ 2Þ

27

50

~4N þ 3=2

46

80

24

59

40

98

pffiffiffiffi pffiffiffiffiffiffi N þ 2N þ 128

~1=4N ð log 4 N þ 1Þ

pffiffiffiffiffiffi pffiffiffiffi ~6N þ 3 N þ 1=2 2N þ 128 pffiffiffiffi pffiffiffiffiffiffi ~4N þ 3=2 N þ 2N þ 128

~1=4N ð log 4 N þ 1Þ

pffiffiffiffiffiffi pffiffiffiffi ~6N þ 3 N þ 1=2 2N þ 128

~1=8N ð log 4 N þ 2Þ

~1=8N ð log 4 N þ 2Þ

ðW Þ

This results in a memory component, denoted M SDR , of ðW Þ

M SDR ¼ 2N þ 2N þ 3=4N þ 1=2N þ 128 ¼ 21=4N þ 128

ð8:34Þ

words, with the associated arithmetic component for the memory addressing of the R24 FHT given by zero when using the Version I solution or six adders when using Version II. This figure ignores the DSM requirement for the initial storage of the input data set as produced by the external data source, the ADC unit. In comparison, the sequential solution has a ‘best-case’ memory component that involves: 1. Two sets, each of eight PDM banks, for the double-buffered storage of the R24 FHT input data with each memory bank holding N/8 samples 2. Two sets, each of N memory locations, for the temporary data stores DME and DMO 3. Two sets, each of sixty-four memory locations, for the double-buffered DMIN 4. Six complementary-angle LUTs for storage of the trigonometric coefficients – pffiffiffiffi one set of three, with each LUT holding N =2 double-resolution coefficients for minimum-memory addressing by the R24 FHT , and one set of three, with each pffiffiffiffiffiffi LUT holding 2N =2 full-resolution coefficients for minimum-memory addressing by the R4FHT-to-R2FFT conversion routine

150

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . . ðBÞ

This results in a memory component, denoted M SDR , of pffiffiffiffiffiffi pffiffiffiffi ðBÞ M SDR ¼ 2N þ 2N þ 3=2 N þ 3=2 2N þ 128 pffiffiffiffi pffiffiffiffiffiffi ¼ 4N þ 3=2 N þ 2N þ 128

ð8:35Þ

words, with the associated arithmetic component for the addressing of the R24 FHT given by 7 multipliers and 8 adders when using the Version III solution or 7 multipliers and 14 adders when using Version IV. This figure also ignores the DSM requirement for the initial storage of the input data set as produced by the external data source, the ADC unit.

8.2.2.2

Two-FHT Solution for Computation of Regularized FHTs

With regard to the computation of the 2N-point real-data DFT using the ‘parallel’ version of the double-resolution approach, whereby a separate R24 FHT is assigned to the computation of each of the DHTs so that they may be executed simultaneously, or in parallel, the O(N. log4N ) time-complexity, denoted TPDR, may be approximated by T PDR ¼ 1=8N: log 4 N þ 1=4N ¼ 1=8N ð log 4 N þ 2Þ

ð8:36Þ

clock cycles, this figure ignoring any start-up/pipeline delays. With a data set refresh rate of 2N samples every update period of 2N clock cycles, this suggests a real-time capability for those data sets for which N  413. The space complexity for the combined operations of the two R24 FHTs and the subsequent R4FHT-to-R2FFT conversion routine has an arithmetic component of either 32 multipliers and 64 adders, when using Version I of the R24 FHT solution, or 26 multipliers and 82 adders when using Version II – the corresponding arithmetic components of all four versions of the R24 FHT solution, including Versions III and IV, are provided later in Table 8.2 of Sect. 8.2.2.3. The parallel solution has a ‘worst-case’ memory component that involves: 1. Four sets, each of eight PDM banks, for the double-buffered storage of the input data to each R24 FHT with each memory bank holding N/8 samples 2. Two sets, each of N memory locations, for the temporary data stores DME and DMO 3. Two sets, each of sixty-four memory locations, for the double-buffered DMIN 4. Seven single-quadrant LUTs for storage of the trigonometric coefficients – two sets of three, with each LUT holding N/4 double-resolution coefficients for minimum-arithmetic addressing by the two R24 FHTs, and one LUT holding N/2

8.2 Computation of Real-Data DFT via Two Half-Length Regularized FHTs

151

full-resolution coefficients for minimum-arithmetic addressing by the R4FHT-toR2FFT conversion routine. ðW Þ

This results in a memory component, denoted M PDR , of ðW Þ

M PDR ¼ 4N þ 2N þ 3=2N þ 1=2N þ 128 ¼ 8N þ 128

ð8:37Þ

words, with the associated arithmetic component for the addressing of each R24 FHT given by zero when using the Version I solution or 6 adders when using Version II. This figure ignores the DSM requirement for the initial storage of the input data set as produced by the external data source, the ADC unit. In comparison, the parallel solution has a ‘best-case’ memory component that involves: 1. Four sets, each of eight PDM banks, for the double-buffered storage of the input data to each R24 FHT with each memory bank holding N/8 samples 2. Two sets, each of N memory locations, for the temporary data stores DME and DMO 3. Two sets, each of sixty-four memory locations, for the double-buffered DMIN 4. Nine complementary-angle LUTs for storage of the trigonometric coefficients – pffiffiffiffi two sets of three, with each LUT holding N =2 double-resolution coefficients for minimum-memory addressing by the two R24 FHTs , and one set of three, with pffiffiffiffiffiffi each LUT holding 2N =2 full-resolution coefficients for minimum-memory addressing by the R4FHT-to-R2FFT conversion routine ðBÞ

This results in a memory component, denoted M PDR , of pffiffiffiffiffiffi pffiffiffiffi ðBÞ M PDR ¼ 4N þ 2N þ 3 N þ 3=2 2N þ 128  pffiffiffiffi pffiffiffiffiffiffi ¼ 6N þ 3=2 2 N þ 2N þ 128

ð8:38Þ

words, with the associated arithmetic component for the addressing of each R24 FHT given by 7 multipliers and 8 adders when using the Version III solution or 7 multipliers and 14 adders when using Version IV. This figure also ignores the DSM requirement for the initial storage of the input data set as produced by the external data source, the ADC unit.

8.2.2.3

Comparative Analysis of Solutions

The theoretical performance and resource utilization figures for the latencyconstrained computation of the 2N-point real-data DFT by means of the R24 FHT, where N is a power of four, are summarized in Table 8.2, where ‘S’ refers to the

152

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

sequential solution discussed in Sect. 8.2.2.1 where one R24 FHT is assigned to the computation of both N-point DHTs, one after the other, and ‘P’ refers to the parallel solution discussed in Sect. 8.2.2.2 where a separate R24 FHT is assigned to the computation of each N-point DHT. The results highlight the achievable computational density of the parallel solution, when compared to that of the sequential version, as the resulting throughput rate is nearly doubled at the minimal expense of an increased space complexity, comprising (1) an additional 12 fast fixed-point multipliers and 22 adders for Version I of the R24 FHT solution or just 9 fast fixedpoint multipliers and 31 adders for Version II (and, of course, increased programmable logic), plus (2) additional memory of 2N words for the simultaneous double buffering of both the even-addressed and the odd-addressed subsets of the input data set – the increased space-complexity figures for Versions III and IV are as given in Table 8.2.

8.3

Computation of Real-Data DFT via One Double-Length Regularized FHT

This section discusses the second of the two R24 FHT-based approaches, the halfresolution approach, which is concerned with the derivation of an N-point real-data FFT algorithm where 2N is a power of four. To see how this may be achieved, using one 2N-point R24 FHT, let us first turn to an important result from Sect. 3.5 of Chap. 3, namely, that of Parseval’s theorem, which states that the energy in a signal is preserved under a unitary or orthogonal transformation, such as with the DFT or the DHT, this being expressed as N 1 X n¼0

jx½nj2 

N 1  N 1  X X   X ðFÞ ½k 2  X ðH Þ ½k 2 , k¼0

ð8:39Þ

k¼0

so that the energy measured in data-space is (up to a scaling factor) equal to that measured in transform-space – which, for the purposes of this chapter, will be that measured in Hartley-space. This result, combined with the familiar DFT-based technique of obtaining an interpolated frequency spectrum by performing the DFT on a zero-padded version of the input data set, is now exploited to yield a simple algorithm for obtaining interpolated DHT outputs where the length of the zero-padded data set is a power of four (so that the length of the original data set is a power of two, but not a power of four). As a result, the DHT may be efficiently carried out via the R24 FHT with the outputs sub-sampled by a factor of two – so that only the even-addressed outputs need actually be computed by the last stage of GD-BFLYs – to yield Hartley-space samples with the required resolution. These outputs may then be fed to the Hartleyspace to Fourier-space conversion routine, as described in Sect. 3.4 of Chap. 3, to yield the required N-point real-data DFT outputs.

8.3 Computation of Real-Data DFT via One Double-Length Regularized FHT

8.3.1

153

Derivation of Radix-2 Algorithm via Half-Resolution Approach

Let n us start o by applying the DHT to a data set of length N so that the output, denoted ðH Þ X N ½k , is given by N 1 1 X ðH Þ x½n:casð2πnk=N Þ X N ½k ¼ pffiffiffiffi N n¼0

ð8:40Þ

and then apply the 2N-point DHT to a data set of length 2N obtained by appending N zero-valued samples to the same N samples as used above, so that the output, n o ðH Þ denoted X 2N ½k  , is given by 2N1 1 X ðH Þ x½n:casð2πnk=2N Þ: X 2N ½k ¼ pffiffiffiffiffiffi 2N n¼0

ð8:41Þ

Then by considering only the even-addressed outputs of Eq. 8.41, we have that 2N1 1 X ðH Þ x½n:casð2πn2k=2N Þ X 2N ½2k ¼ pffiffiffiffiffiffi 2N n¼0 N 1 1 X ¼ pffiffiffiffiffiffi x½n:casð2πnk=N Þ, 2N n¼0

ð8:42Þ

so that X 2N ½2k ¼ 1=pffiffi2X N ½k, ðH Þ

ðH Þ

ð8:43Þ

meaning that the signal energy measured at index ‘k’ using the N-point transform is equal to twice that obtained when it is measured at the corresponding index (i.e. 2k) using the 2N-point transform, as with the longer transform the energy is being spread over twice as many outputs. In fact, from the discussion of Parseval’s theorem in Sect. 3.5.9 of Chap. 3, we have that N 1 X n¼0

jx½nj2 

 N 1  X  ðH Þ 2 X N ½k  k¼0

2  2 N1  X  ðH Þ   ðH Þ  ¼ X 2N ½2k þ X 2N ½2k þ 1 k¼0

  2 N 1   X  ðH Þ 2  ðH Þ  1 =2X N ½k  þ X 2N ½2k þ 1 , ¼ k¼0

ð8:44Þ

154

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

so that one half of the signal energy is contained in the even-addressed outputs and the other half in the odd-addressed outputs. The Hartley-space outputs of interest correspond to those possessing the even-valued addresses, which includes the zero-frequency term, so that although the solution to the problem of computing the 2N-point DHT – as carried out by the R24 FHT– may be used to produce all 2N outputs, both the even-addressed and the odd-addressed terms, it is only the even-addressed outputs that actually need to be computed by the last stage of GD-BFLYs and subsequently converted from Hartley-space to Fourier-space using the conversion routine equations described in Sect. 3.4 of Chap. 3.

8.3.2

Implementation of Half-Resolution Algorithm

Although the input data need only be generated N samples at a time, the on-chip memory required by the 2N-point R24 FHT – in the form of two sets, each of eight PDM banks, required for double-buffered storage of the real-valued input data set, and three LUTs required by the PCM for storage of the trigonometric coefficients, as discussed in Chap. 6 – needs to cater for twice that amount of data and up to twice the corresponding number of trigonometric coefficients due to the fact that halfresolution processing is being used to derive the required DFT outputs. As a result, each of the PDM banks needs to be able to hold N/4 data samples (instead of the standard N/8), although only N/8 samples need to be updated each time, and each of the LUTs needs to be able to hold either N/2 trigonometric coefficients (instead pffiffiffiffiffiffi of the standard N/4), for Versions I and II of thepffiffiffiffi R24 FHT solution, or 2N =2 trigonometric coefficients (instead of the standard N =2) for Versions III and IV. Thus, disregarding the complexity requirement of the standard Hartley-space to Fourier-space conversion routine, which from Sect. 3.4 of Chap. 3 is trivial, the computation of the N-point real-data DFT by means of a 2N-point R24 FHT via the half-resolution approach (with only the even-addressed outputs being computed by the final temporal stage of GD-BFLYs) has an O(N. log4N ) time-complexity, denoted THR, which may be approximated by T HR ¼ 1=4N ð log 4 2N  1Þ þ 1=8N

ð8:45Þ

clock cycles, this figure excluding any start-up/pipeline delays. With a data set refresh rate of N samples every update period of N clock cycles, this suggests a real-time capability able to cater for those data sets for which N  64 which is clearly considerably less than that achievable via the double-resolution approach of Sect. 8.2. Also, following the analysis of Sect. 8.2.2, the space-complexity can be shown ðW Þ to possess a ‘worst-case’ memory component, denoted M HR , of ðW Þ

M HR ¼ 2  2N þ ð3  1=4  2N Þ ¼ 11=2N ðBÞ

words, and a ‘best-case’ memory component, denoted M HR , of

ð8:46Þ

8.3 Computation of Real-Data DFT via One Double-Length Regularized FHT

155

  1 pffiffiffiffiffiffi ðBÞ M HR ¼ 2  2N þ 3   2N 2 pffiffiffiffiffiffi ¼ 4N þ 3=2 2N

ð8:47Þ

words. Thus, it is evident from the time-complexity figure of Eq. 8.45 that in order to produce a new Hartley-space interpolated output data set of length 2N every N clock cycle, as required, it will be necessary to set up a new input data set every N clock cycle, where each such data set comprises N new samples and, appended to these, an additional N zero-valued samples. This means, from the time-complexity figures of Eqs. 6.11 and 6.12 in Chap. 6, that for those data sets for which N > 64, the throughput rate of the standard R24 FHT-based approach would have to be increased through some appropriate means if it is to keep up with the data set refresh rate. The simplest way of achieving this would be to have two or more R24 FHTs being applied to consecutive input data sets, in turn, in order to produce interleaved output data sets and thus to increase the throughput rate through the simple replication of silicon resources – as already discussed in Sect. 6.6 of Chap. 6. Thus, with a ‘dualR24 FHT’ solution, for example, the operation of two R24 FHTs would be offset by N clock cycles (the update period, as dictated by the data set refresh rate) relative to each other and would be running in parallel – see Fig. 8.5. One R24 FHT (with its own double-buffered DSM) would be assigned the task of processing all the evenaddressed input data sets, whilst the other (also with its own double-buffered DSM) would be assigned the task of processing all the odd-addressed input data sets. In this way, the permissible latency of each R24 FHT would now be bounded above by 2N clock cycles, rather than by N clock cycles, which would extend the real-time capability from one able to cater for those data sets for which N  64, to one now able to cater for those data sets for which N  16,384 – albeit achieved at the cost of a doubling of the silicon resources through the need for two R24 FHTs – and continuing in this fashion, the adoption of three R24 FHTs operating in parallel would further extend the real-time capability to one now able to cater for those data sets for which N  411. Therefore, given that a single 2N-point R24 FHT requires twice the PDM requirement and up to twice the PCM requirement (depending upon the particular addressing scheme used) of a single N-point R24 FHT – although the same arithmetic

Regularized FHT

Regularized FHT

Regularized FHT

clock cycles t=0

t=N

t = 2N

Regularized FHT

t = 3N

t = 4N

t = 5N

Regularized FHT

Fig. 8.5 Dual-R24 FHT approach to half-resolution processing scheme

t = 6N

Regularized FHT

t = 7N

156

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

component in terms of the numbers of multipliers and adders – achieving the required throughput rate, via the use of two 2N-point R24 FHTs operating in parallel, would involve a space complexity with up to four times the memory component and twice the arithmetic component of a hypothetical R24 FHT -based solution directly applicable to the computation of the real-data DFT for such transform lengths. Thus, the achievable computational density for the radix-2 real-data FFT derived via the half-resolution approach (which is able to meet the required timing constraint) may be said to lie between one quarter and one half of that achievable for the radix-4 real-data FFT derived via the conventional use of the R24 FHT (namely, where the length of the data set being processed is a power of four), the exact fraction being dependent upon the length of the transform – the longer the transform, the larger the relative memory component and the lower the relative computational density – and the chosen implementation of the R24 FHT . The practicality of such a solution is therefore very much dependent upon the implementational efficiency of the R24 FHT which, from the results described in Chap. 6, suggests the R24 FHT-based approach to solving both the radix-2 and the radix-4 versions of the real-data FFT to be perfectly viable.

8.4

Comparative Complexity Analysis with Standard Radix-2 FFT

The first of the two solutions discussed in this chapter, which was based upon the double-resolution approach as discussed in Sect. 8.2, showed how the highly parallel R24 FHT may be effectively exploited for the derivation of a radix-2 real-data FFT algorithm, of length 2N, where N is a power of four. The solution involved FHT-based processing at double the required transform-space resolution via the application of half-length R24 FHTs to both the even-addressed and the odd-addressed subsequences of the input data set. The subsequent operation of the R4FHT-to-R2FFT conversion routine, being performed in a regular and highly parallel fashion via the efficient use of partitioned memory, enabled the computational benefits of the R24 FHT to be fully exploited. The solution has some other interesting properties, even when the complexity is viewed purely in terms of sequential arithmetic operation counts, as when assessed via the ‘Performance Metric for Single-Processor Sequential Computing Device’, as stated in Sect. 1.8 of Chap. 1. The computation of the 2N-point real-data DFT, for when N is a power of four, requires a total of 1 1 C mply FFT ¼ =2ð =2  2N  log 2 2N  4Þ ¼ 2N: log 2 2N

real multiplications, and

ð8:48Þ

8.4 Comparative Complexity Analysis with Standard Radix-2 FFT 1 1 C adds FFT ¼ =2ð =2  2N  log 2 2N  6Þ ¼ 3N: log 2 2N

157

ð8:49Þ

real additions, when obtained via one of the real-from-complex strategies discussed in Chap. 2, using the standard complex-data radix-2 Cooley-Tukey algorithm (these figures including a reduction by a factor of two to account for the simultaneous production of two real-data FFT output data sets), but only 1 1 C mply FHT ¼ 2  ð =8  N  log 4 N  9Þ þ ð =4  N  8Þ ¼ 9=4N: log 4 N þ 2N

ð8:50Þ

real multiplications, and 1 1 C adds FHT ¼ 2  ð =8  N  log 4 N  31Þ þ ð =4  N  20Þ 31 ¼ =4N: log 4 N þ 5N

ð8:51Þ

real additions, when obtained via the combined use of the Version II solution of the R24 FHT and the R4FHT-to-R2FFT conversion routine. Thus, for the computation of a 2048-point real-data DFT, for example, this means 22,528 real multiplications and 33,792 real additions via the standard complex-data radix-2 FFT or 13,568 real multiplications and 44,800 real additions via the R24 FHT-based solution, implying a significant reduction in terms of the number of real multiplications of about 40% by using the R24 FHT -based solution outlined here. This is obtained, however, at the expense of an increase of approximately 33% in terms of the number of real additions, although the complexity of an addition is considerably less than that of a multiplication – the real multiplier possesses an O(L2) space complexity, whilst the real adder possesses an O(L ) space complexity, where ‘L’ is the word length. The split-radix algorithm could of course be used instead of the Cooley-Tukey algorithm to further reduce the multiplication count of the radix-2 FFT but only at the expense of a loss of regularity in the FFT design. Consider next the situation where the solution’s complexity is viewed in terms of the computational density, as when assessed via the ‘Performance Metric for SiliconBased Parallel Computing Device’, as stated in Sect. 1.8 of Chap. 1. When only ‘block-based’ solutions for the parallel computation of the 2048-point real-data DFT are considered and compared (so as to facilitate a like-with-like comparison), the standard complex-data radix-2 FFT with 4 radix-2 butterflies operating in parallel, in SIMD fashion, is able to achieve a time-complexity of 2816 clock cycles (this figure including a reduction by a factor of 2 to account for the simultaneous production of 2 real-data FFT output data sets) at the expense of a space-complexity with an arithmetic component of 16 real multipliers and 24 real adders, where for each butterfly the set of multipliers is operating in parallel, as is the set of adders. In comparison, the sequential-mode version of the double-resolution approach using

158

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

the Version II solution of the R24 FHT is able to achieve a time-complexity of 1536 clock cycles at the expense of a space-complexity with an arithmetic component of 17 real multipliers and 51 real adders. As a result, the R24 FHT-based solution looks capable of outperforming the standard complex-data radix-2 FFT in terms of both sequential operation counts and computational density by a factor of about two to one for the example of the 2048-point data set considered. Note that the arithmetic components of the above two solutions are clearly comparable, given the relative space complexities of the real multiplier and real adder of O(L2) and O(L ), respectively, where ‘L’ is the word length. The associated memory components of the two solutions are also comparable, given that whereas the standard complex-data radix-2 FFT solution needs to store and process two sets of real-valued data at a time, when performed via one of the real-from-complex strategies, the R24 FHT-based solution needs the use of the two temporary data stores, DME and DMO. When comparing the computational density of the standard complex-data radix-2 Cooley-Tukey algorithm, which with eight radix-2 butterflies operating in parallel has an arithmetic component of 32 real multipliers and 48 real adders, with that of the parallel-mode version of the double-resolution approach, which has an arithmetic component of 26 real multipliers and 82 real adders, the time-complexities reduce to 1408 clock cycles and 896 clock cycles, respectively, although the standard complex-data radix-2 FFT uses approximately 25% more real multipliers than the double-resolution approach. This suggests that for the 2048-point data set considered, the advantage of the double-resolution approach when compared against that of the more conventional complex-data radix-2 FFT, in terms of increased computational density, is similar for the two modes of operation, namely, whether the solution’s operated in sequential or parallel mode. The second of the two solutions discussed in this chapter, which was based upon the half-resolution approach as discussed in Sect. 8.3, showed how the highly parallel GD-BFLY may be effectively exploited for the derivation of a radix-2 real-data FFT algorithm of length N, where 2N is a power of four. The solution involved FHT-based processing at one half the required transform-space resolution via the application of one double-length R24 FHT. The solution appears clearly unable to match the real-time capability of the double-resolution approach, however, in terms of its range of permissible transform lengths, although it was seen how this could be easily overcome through the simple replication of silicon resources whereby separate R24 FHTs are applied to alternate even-addressed and odd-addressed input data sets. Although much simpler to implement than the double-resolution approach – without the need for a complex conversion routine – it’s clearly unable to achieve the same level of performance, either in terms of arithmetic operation counts or computational density.

8.5 Discussion

8.5

159

Discussion

Summarizing the results of this chapter, the derivations of two new radix-2 real-data FFT algorithms have been discussed in some detail where the transform length is assumed now to be a power of two, but not a power of four. The performances of the two solutions, which were based upon the double-resolution and half-resolution approaches defined in the introduction of Sect. 8.1, were discussed in some detail in terms of both the space-complexity and the time-complexity, with a detailed comparison being made against that of the more conventional complex-data radix-2 FFT for the case where the input data set was of length 2048. The two solutions have shown how the R24 FHT might be applied, potentially, to a great many more problems than originally envisioned with the scalability of the R24 FHT design carrying over, in each case, to those of the two radix-2 FFTs and the solution based upon the doubleresolution approach of Sect. 8.2 looking particularly attractive in terms of both arithmetic-complexity and computational density. A point worth noting with the half-resolution approach, as described above in Sect. 8.3, is that if one performs the DHT of a data sequence of length four, say, such that DHTðfx½0, x½1, x½2, x½3gÞ n o ¼ X ðH Þ ½0, X ðH Þ ½1, X ðH Þ ½2, X ðH Þ ½3 ,

ð8:52Þ

then it is also true, via a theorem applicable to both the DFT and the DHT, namely, the stretch or repeat theorem [4], that DHTðfx½0, x½1, x½2, x½3, x½0, x½1, x½2, x½3gÞ n o ¼ 2X ðH Þ ½0, 0, 2X ðH Þ ½1, 0, 2X ðH Þ ½2, 0, 2X ðH Þ ½3, 0

ð8:53Þ

this result being true, not just for the four-point data sequence shown, but for a data sequence of any length. As a result, an alternative to the zero-padding approach, which instead involves the idea of transforming a repeated or replicated data set, could be used to extract the required FHT outputs from those of a double-length R24 FHT. Note, however, that the magnitudes of the required even-addressed output samples are twice what they should be so that scaling may be necessary – namely, division by two which in fixed-point hardware reduces to that of a simple right-shift operation – in order to achieve the correct magnitudes, this being applied either to the input samples or to the output samples.

160

8 Derivation of Radix-2 Real-Data Fast Fourier Transform Algorithms Using. . .

References 1. K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Proc. Vis. Image Signal Process. 153(1), 70–78 (February 2006) 2. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (September 2007) 3. K.J. Jones, The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments, Series on Signals & Communication Technology (Springer, 2010) 4. J.O. Smith III, Mathematics of the Discrete Fourier Transform (DFT) with Audio Applications (W3K Publishing, 2007) 5. H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform algorithms. IEEE Trans. ASSP 35(6), 849–863 (June 1987)

Chapter 9

Computation of Common DSP-Based Functions Using Regularized Fast Hartley Transform

9.1

Introduction

Having now seen how the R24 FHT might be used for the efficient parallel computation of an N-point DFT, where N may be a power of either two or four – although for optimal computational density it should be a power of four – the monograph continues with the description of a number of DSP-based functions where the adoption of Hartley-space, rather than Fourier-space, as the chosen computational domain, may lead to conceptually and computationally simplified solutions particularly when based upon the adoption of the R24 FHT . Three particular sets of functions common to many modern DSP systems are discussed, namely: 1. The up-sampling and differentiation – for the case of both first and second derivatives – of a real-valued signal either individually or in combination 2. The correlation function of two real-valued or complex-valued signals where the signals may both be of infinite duration, as encountered with cross-correlation, or where one signal is of finite duration and the other of infinite duration, as encountered with auto-correlation 3. The channelization of a real-valued signal which, for the case of a single channel (or small number of channels), may be achieved by means of a DDC process where the filtering is carried out via fast Hartley-space convolution, whilst for the case of multiple channels, may be achieved via the application of the polyphase DFT filter bank where the DFT is carried out with an FHT One important area of wireless communications where all three sets of functions might typically be encountered is that relating to the geolocation [11] of signal emitters, where there is a requirement to produce accurate timing measurements from the data gathered at a number of sensors, these measurements being generally obtained from the up-sampled outputs of a correlator. When the signal under analysis is of sufficiently wide bandwidth, however, the data would first have to be partitioned in frequency before such measurements could be made so as to © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_9

161

162

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

optimize the SNR of the signal for specific frequency bands of interest prior to the correlation process. For the case of a single channel (or small number of channels), the associated filtering operation may, depending upon the parameters, be most efficiently carried out by means of fast transform-space convolution, whilst when there is a sufficiently large number of channels which are of equal spacing and of equal bandwidth, this process – which is generally referred to in the technical literature as ‘channelization’ – is best carried out by means of a polyphase DFT filter bank [1, 4, 15]. The adoption of the transform-space approach in signal processing makes particular sense when a significant amount of the processing is able to be efficiently carried out in transform-space, so that several distinct tasks might be beneficially performed there before the resulting signal is transformed back to data-space. A multi-sensor digital signal conditioner [5] has been defined, for example, which exploits the transform-space approach to carry out in a highly-efficient manner, in Fourierspace, the various tasks of sample-rate conversion, spectral shaping or filtering and malfunctioning sensor detection and compensation, prior to the formation in the time-domain of conventional beams [9]. A novel transform-space scheme for enhancing the performance of multi-carrier communications in the presence of inter-modulation distortion, or IMD – a seemingly intractable problem [6, 8] – is also briefly discussed in Sect. 9.6, based upon the exploitation of fast polynomial arithmetic/algebraic techniques and a suitably defined FFT routine, where the integer-valued nature of the algorithm inputs (which are simply the carrier frequency locations) suggests that the FHT might prove an attractive choice for carrying out the required forward and inverse transformations.

9.2

Fast Transform-Space Convolution and Correlation

Given the emphasis placed on the transform-space approach in this chapter, it is perhaps worth illustrating firstly its importance by considering the simple case of the filtering of a real-valued signal by means of an FIR filter of length N. A linear system [12, 13] such as this is characterized by means of an output signal that is obtained from the convolution of the system input signal with the system impulse response – as represented by a finite set of filter coefficients. A direct data-space formulation of the problem may be written, in an un-normalized complex-data form, as Rconv h,x ½k  ¼

N 1 X

h ½n:x½k  n,

ð9:1Þ

n¼0

where the superscript ‘’ refers to the operation of complex conjugation, so that each filter output requires N multiplications – this yields an arithmetic complexity of O(N2) arithmetic operations for the production of N filter outputs. Alternatively, a fast Hartley-space convolution approach – see Sect. 3.5 of Chap. 3 – combined with

9.3 Up-Sampling and Differentiation of Real-Valued Signal

163

the familiar ‘overlap-save’ or ‘overlap-add’ technique [2] associated with conventional FFT-based linear convolution [2] (where the FHT of the filter coefficient set is fixed and pre-computed), might typically involve the application of two 2N-point FHTs and one element-wise transform-space product of length 2N in order to produce N filter outputs (this yields an arithmetic complexity of O(N.logN) arithmetic operations). Thus, with a suitably chosen FHT algorithm, clear computational gains are potentially achievable via fast Hartley-space convolution for even relatively small values of N, although the larger the problem, the larger the potential gains. The correlation function is generally defined as measuring the degree of correlation or similarity between a given signal and a shifted replica of that signal. From this, the basic data-space formulation for the cross-correlation function of two arbitrary complex-valued signals may be written, an in un-normalized form and with arbitrary upper and lower limits, as Rcorr h,x ½k  ¼

upper X

h ½n:x½k þ n,

ð9:2Þ

n¼lower

which is similar in form to that for the convolution function of Eq. 9.1 except that there is no need to apply the folding operation [2] to one of the two functions to be correlated. In fact, if either of the two functions to be correlated is an even function, then the operations of convolution and correlation are equivalent. The above expression is such that: (1) When both sequences are of finite length, it corresponds to the cross-correlation function of two finite-duration signals – to be discussed in Sect. 9.4.2. (2) When one sequence is of infinite length and the other a finite-length stored reference, it corresponds to the auto-correlation function – to be discussed in Sect. 9.4.3. (3) When both sequences are of infinite length, it corresponds to the cross-correlation function of two continuous data streams – to be discussed in Sect. 9.4.4. As evidenced from the discussion above relating to the convolution-based filtering problem, the larger the correlation problem, the greater the potential benefits to be gained from the adoption of a transform-space approach, particularly when the correlation operation is carried out by means of a fast unitary/orthogonal transform such as the FFT or the FHT.

9.3

Up-Sampling and Differentiation of Real-Valued Signal

This section looks briefly at how two basic DSP-based functions, namely, those of up-sampling and differentiation, might be efficiently carried out by first transforming the real-valued signal from data-space to Hartley-space, via the application of an

164

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

FHT, and then modifying in some way the resulting Hartley-space data, before returning to data-space via the application of a second FHT to obtain the data corresponding to an appropriately modified version of the original real-valued signal.

9.3.1

Up-Sampling via Hartley-Space

The first function considered is that of up-sampling where the requirement is to increase the sampling rate of the signal without introducing additional frequency components to the signal outside of its frequency range or band of definition – this function being also referred to as band-limited interpolation. Suppose that the signal is initially represented by means of N real-valued samples and that it is required to increase or interpolate this by a factor of L. To achieve this, the real-valued data is first transformed from data-space to Hartley-space, via the application of an FHT of length N, with zero-valued samples being then inserted between the samples of the Hartley-space data according to the following rule [14]:

Y ½k  ¼

8 > > > > > > < > > > > > > :

L:X ½k

for

k 2 ½0, N=2  1

1=2L:X ½N=2

for for

k ¼ N=2 k 2 ½N=2 þ 1, M  N=2 þ 1 ,

0 1=2L:X ½N=2

for L:X ½k  M þ N  for

ð9:3Þ

k ¼ M  N=2 k 2 ½M  N=2 þ 1, M  1

where M ¼ L  N, before returning to data-space via the application of a second FHT, this time of length M, to obtain the resulting up-sampled signal, as required – see Fig. 9.1. Note that the non-zero terms in the above expression have been magnified by a factor of L so as to ensure, upon return to data-space, that the magnitudes of the original samples are preserved. When L is chosen to be a power of two, however, this reduces to a simple left-shift operation of appropriate length. Note that the above technique, which has been defined for the up-sampling of a single segment of signal data, may be straightforwardly applied to the case of a continuous signal through the piecing together of multiple data-space signal segments via a suitably adapted reconstruction technique [3] which combines the use of the overlap-save technique, as associated with conventional FFT-based linear convolution, with that of temporal windowing [10], in order to keep the root-meansquare (RMS) interpolation error to an acceptable level. Without taking such precautions, the interpolation error may well prove to be unacceptably high due to the inclusion of error maxima near the segment boundaries – this problem being referred to in the technical literature as the ‘end effect’ or ‘boundary effect’ [2].

9.3 Up-Sampling and Differentiation of Real-Valued Signal Fig. 9.1 Scheme for up-sampling of signal using FHT

165

{X(H)[k]} {x[n]}

FHT

Zero-Pad Centre of Spectrum – see Equation 9.3

{y[n]}

9.3.2

FHT

{Y(H)[k]}

Differentiation via Hartley-Space

The second function considered is that of differentiation, and from the first and second derivative theorems of Sect. 3.5 in Chap. 3, it was stated, for the case of a data set of length N, that n o DHT ðfx0 ½ngÞ ¼ 2πkX ðH Þ ½N  k

ð9:4Þ

n o DHT ðfx00 ½ngÞ ¼ 4π 2 k2 X ðH Þ ½k ,

ð9:5Þ

and

respectively, so that by transforming the real-valued signal from data-space to Hartley-space, via the application of an FHT of length N, and then modifying the resulting Hartley-space samples according to Eq. 9.4 or 9.5, before returning to dataspace via the application of a second FHT, also of length N, it is possible to obtain the first or second derived function corresponding to the original real-valued signal, as required – see Fig. 9.2.

9.3.3

Combined Up-Sampling and Differentiation

Note from the results of the above two sections that it is a straightforward task to carry out both the up-sampling and the differentiation of the real-valued signal by simply applying both sets of modifications to the same set of Hartley-space samples before returning to data-space. Thus, after modifying the Hartley-space samples according to Eq. 9.4 or 9.5 of Sect. 9.3.2, the resulting samples are then zero-padded according to Eq. 9.3 of Sect. 9.3.1, before being returned to data-space via the

166

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

Fig. 9.2 Scheme for differentiation of signal using FHT

{x[n]}

Modify: Y ( H ) [k ]

{y[n]}

FHT

{X(H)[k]}

Modify

2Sk ´ X ( H ) [ N  k ]

FHT

{Y(H)[k]}

{X(H)[k]} {x[n]}

FHT Modify

Modify: Z( H ) [k ]

2Sk ´ X ( H ) [ N  k ]

{Z(H)[k]} Zero-Pad Centre of Spectrum – see Equation 9.3

{y[n]}

FHT

{Y(H)[k]}

Fig. 9.3 Scheme for combined up-sampling and differentiation of signal using FHT

application of a second FHT to yield an up-sampled version of the first or second derived function of the original real-valued signal, as required – see Fig. 9.3.

9.4

Correlation of Two Arbitrary Signals

Having covered very briefly the problems of up-sampling and differentiation, the computationally more intensive problem of correlation, as introduced in Sect. 9.2, is now addressed in some detail. As evidenced from the discussions of Sect. 9.2 relating to fast transform-space convolution and correlation, when the correlation operation is performed upon two finite segments of signal, each comprising N samples, a direct data-space

9.4 Correlation of Two Arbitrary Signals

167

implementation will yield an arithmetic complexity of O(N2) arithmetic operations, whereas a transform-space implementation involving two forward transforms, one element-wise transform-space product and one inverse transform, will yield an arithmetic complexity of O(N.logN) arithmetic operations, via the application of a fast unitary/orthogonal transform, which suggests that the larger the correlation problem, the greater the potential benefits to be gained from the adoption of a transform-space approach. A key ingredient for the success and the generality of the transform-space approach is in being able to carry out a linear correlation by means of one or more circular correlations, so that by invoking the circular correlation theorem [2] – which is analogous to the more familiar circular convolution theorem [2], already abbreviated in Chap. 1 to the CCT, as based upon the operation of circular convolution – it is possible to move the processing from data-space to transform-space where a fast algorithm may be exploited. Thus, when the data in question is complex-valued, the processing may be carried out in Fourier-space via the use of an FFT, whereas when the data is real-valued, it may be carried out in Hartley-space via the use of an FHT. Note that with the problem of geolocation, it is possible for either cross-correlation or auto-correlation to be encountered: if the sensors operate in passive mode, then each operation will be assumed to be that of cross-correlation and thus to be performed on signals from two different sensors to provide time-difference-ofarrival (TDOA) or equivalent relative-range measurements, whereas if the sensors operate in active mode, then each operation will be assumed to be that of autocorrelation (so that one of the two signals is simply a stored reference of the other) to provide time-of-arrival (TOA) or equivalent range measurements. The essential difference, in terms of processing requirement, between the two modes of operation, is that with auto-correlation, one of the two signals is of finite duration and the other of infinite duration, whilst with cross-correlation, both of the signals are of infinite duration. The signal of interest is typically in the form of a sampled pulse or pulse train, for both active and passive systems, so that the received signal, although often regarded as being of infinite duration for the purposes of correlator implementation, is actually a succession of temporally spaced finite-duration segments.

9.4.1

Computation of Complex-Data Correlation via Real-Data Correlation

Although all of the techniques discussed in this chapter are geared to the processing of real-valued signals, it is worth pointing out that as the operation of correlation, denoted by means of the symbol ‘’, is a linear process – and thereby satisfying the property of additivity – the correlation of two complex-valued signals, as encountered, for example, when the signal processing is carried out at baseband [4, 12, 13,

168

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

{YR[n]}

{YI[n]}

=> correlation => addition

{XR[n]} {ZR[n]}

{XI[n]} –

{ZI[n]}

Fig. 9.4 Scheme for complex-data correlation via real-data correlation

15], may be decomposed into the summation of four correlations each operating upon two real-valued signals, so that fX R ½n þ i:X I ½ng  fY R ½n þ i:Y I ½ng  ðfX R ½ng  fY R ½ng þ fX I ½ng  fY I ½ngÞþ

ð9:6Þ

i:ðfX R ½ng  fY I ½ng  fX I ½ng  fY R ½ngÞ, this expression taking into account the operation of complex conjugation needing to be performed upon one of the two input signals, as shown in Eq. 9.2. The attraction of the ‘complex-to-real’ decomposition described here for the complex-data correlation operation is that it introduces an additional level of parallelism to the problem as the resulting real-data correlations are independent and thus able to be computed simultaneously, or in parallel, as shown in Fig. 9.4. This is particularly relevant when the quantities of data to be correlated are large and the throughput requirement high as a transform-space approach may then be the only viable approach to adopt, leaving the conventional complex-data approach to rely upon the parallelization of the complex-data FFT and its inverse as the only logical means of achieving the required computational throughput. With the complex-toreal decomposition, however, the required performance may be more easily obtained by running in parallel multiple versions of the R24 FHT in both forward and reverse directions. The transformation from data-space to Hartley-space, for example, may be carried out by running in parallel two (when using a stored reference) or four (when cross-correlating two arbitrary signals) R24 FHTs, this followed by the computation of four sets of element-wise transform-space products, again in parallel, with each such product taking the form of

9.4 Correlation of Two Arbitrary Signals

{XR[n]}

FHT

169

{YR[n]}

{YI[n]}

FHT

FHT

Combine FHT

{ZR[n]}

FHT

{ZI[n]}

Combine

{XI[n]}

FHT

Combine –

Combine

Combine:

1 (H) 1 X [k ] Y ( H ) [k ]  Y ( H ) [ N  k ]  X ( H ) [ N  k ] Y ( H ) [ N  k ]  Y ( H ) [k ] 2 2









Fig. 9.5 Scheme for complex-data correlation using FHT

  Z ½k ¼ 1=2X ðH Þ ½k  Y ðH Þ ½k  þ Y ðH Þ ½N  k   þ 1=2X ðH Þ ½N  k Y ðH Þ ½N  k  Y ðH Þ ½k  :

ð9:7Þ

The results of the four element-wise transform-space products may then be additively combined prior to the results being transformed back to data-space by running in parallel two R24 FHTs to yield the required correlation results – see Fig. 9.5. Thus, compared to a solution based upon the use of a complex-data FFT, this approach results in a potential doubling of the parallelism (in addition to that achievable via the efficient implementation of the R24 FHT, as discussed in Chap. 6) with which to increase the throughput of the complex-data correlation operation.

9.4.2

Cross-Correlation of Two Finite-Length Data Sets

Before moving on to the two important cases of auto-correlation and cross-correlation where at least one of the two signals is of infinite duration, the simple problem of cross-correlating two finite-duration signals by means of the DHT is considered. To achieve this, if one of the two signal segments is represented by N1 samples and the other signal segment by N2 samples, then the length N of the DHT is first chosen so that N  N1 þ N2  1:

ð9:8Þ

One segment is then pre-zero-padded out to a length of N samples and the other segment post-zero-padded also out to a length of N samples. Following this, each

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

170

{x[n]}

Pre-Zero-Pad

FHT

{y[n]}

Post-Zero-Pad

FHT

FHT

{z[n]}

Combine:

{X(H)[k]}

{Y(H)[k]}

Combine

{Z(H)[k]}

1 (H) 1 X [k ] Y ( H ) [k ] + Y ( H ) [ N − k ] + X ( H ) [ N − k ] Y ( H ) [ N − k ] − Y ( H ) [k ] 2 2

(

)

(

)

Fig. 9.6 Scheme for correlation of two signal segments

zero-padded segment is passed through an FHT of length N, their transforms then multiplied, element-by-element, before the transform-space product is transformed back to data-space by means of another FHT, also of length N, to yield the required cross-correlator output. There will, however, be a deterministic shift of length S ¼ N  ðN1 þ N2  1Þ

ð9:9Þ

samples, which needs to be accounted for when interpreting the output, as the resulting data set out of the final FHT comprises N samples whereas the correlation of the two segments is known to be only of length N1 + N2–1. This procedure is outlined in Fig. 9.6.

9.4.3

Auto-Correlation: Finite-Length Against Infinite-Length Data Sets

The next type of problem considered relates to that of auto-correlation whereby a finite-duration signal segment – in the form of a stored reference – is being correlated against a continuous or infinite-duration signal. The stored reference correlator is commonly referred to in the technical literature as a ‘matched filter’, where the output of a detector based upon the application of such a filter is known to optimize the peak received SNR in the presence of additive white Gaussian noise (AWGN). The output is also known to correspond – at least for the case of idealized distortion-

9.4 Correlation of Two Arbitrary Signals

{x[n]}

171

Post-Zero-Pad

FHT

{y[n]}

Get Next Overlapped Segment

{z[n]}

Discard Invalid Outputs

Combine:

FHT

FHT

{X(H)[k]}

{Y(H)[k]} Combine

{Z(H)[k]}

1 (H) 1 X [k ] Y ( H ) [k ] + Y ( H 0 [ N − k ] + X ( H ) [ N − k ] Y ( H ) [ N − k ] − Y ( H ) [k ] 2 2

(

)

(

)

Fig. 9.7 Scheme for auto-correlation using FHT

free and multipath-free propagation – to the auto-correlation function of the stored signal. This type of problem is best tackled by viewing it as a segmented correlation, a task most simply solved by means of the familiar overlap-save or overlap-add technique associated with conventional FFT-based linear convolution. The approach involves decomposing the infinite-duration received signal into segments and computing the correlation of the stored reference and the received signal as a number of smaller circular correlations. With the overlap-save technique, for example, suitable zero padding of the stored reference combined with the selection of an appropriate segment length enables the required correlation outputs to be obtained from the segmented circular correlation outputs without the need for further arithmetic. With the overlap-add technique, on the other hand, the received signal segments need also to be zero-padded with the required correlation outputs being obtained through appropriate combination – although only via addition – of the segmented circular correlation outputs. A solution based upon the adoption of the overlap-save technique is as outlined in Fig. 9.7, where the stored reference comprises N1 samples, the FHT is of length N, where

172

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

N  2N1 ,

ð9:10Þ

and the number of valid samples produced from each length-N signal segment out of the correlator is given by N2, where N2 ¼ N  N1 þ 1,

ð9:11Þ

these samples appearing at the beginning of each new output segment with the last N1–1 samples of each such segment being invalid and thus discarded. To achieve this, consecutive signal segments of length N are overlapped by N1–1 samples, with the first such segment being pre-zero-padded by N1–1 samples to account for the lack of a predecessor. The optimal choice of segment length is dependent very much upon the length of the stored reference, with a sensible lower limit being given by twice the length of the stored reference – as given by Eq. 9.10. Clearly, the shorter the segment length, the smaller the memory requirement but the lower the computational efficiency of the solution, whereas the larger the segment length, the higher the computational efficiency but the larger the memory requirement of the solution. Thus, there is once again a direct trade-off to be made of the arithmetic requirement against the memory requirement, according to how long one makes the signal segment.

9.4.4

Cross-Correlation: Infinite-Length Against Infinite-Length Data Sets

The final type of problem considered relates to that of cross-correlation whereby a continuous or infinite-duration signal is being correlated against another signal of similar type. This type of problem, as with that for the auto-correlation problem of the previous section, is best tackled by viewing it as a segmented correlation, albeit one requiring a rather more complex solution. With the cross-correlation of two infinite-duration signals, each region of signal that carries information will be of finite duration, so that if 50% overlapped signal segments are generated from the data acquired at each sensor, where the segment length corresponds to twice the anticipated duration of each signal region of interest added to twice the maximum possible propagation delay arising from the separation of the sensors, then for some given acquisition period, the current signal region of interest is guaranteed to appear in the corresponding segment of both sensors. Thus, the cross-correlation breaks down into the successive computation of a number of overlapped cross-correlations of finite-duration signals one of which corresponds to the cross-correlation of the current signal region of interest. If the length of the segment is short enough to facilitate its direct computation – that is, there is adequate memory to hold the sensor data and cross-correlator outputs – then the overlapped cross-correlation of each two signal segments may be carried out by means of the

9.4 Correlation of Two Arbitrary Signals

173

technique described in Sect. 9.4.2. If this is not the case, however, then it is likely that the number of cross-correlator outputs of actual significance – that is, that correspond to the temporal region containing the dominant peaks – will be considerably smaller than the number of samples in the segment so that computational advantage could be made of this fact. To see how this may be achieved [9], each segment needs first to be broken down into a number of smaller sub-segments with the cross-correlation of the original two segments being subsequently obtained from the cross-correlation of the sub-segments in the following way. Suppose that we regard each long signal segment as being comprised of K samples, with the number of samples in each sub-segment being denoted by N, where K ¼ M  N,

ð9:12Þ

for some integer M. Then denoting the sub-segment index by ‘m’, we carry out the following steps: 1. Segment each set of K samples to give  xm ½n ¼

x½n þ ðm  1ÞN

n ¼ 0, 1, . . . , N  1

0

n ¼ N, N þ 1, . . . , 2N  1

for m ¼ 0, 1, . . ., M–2, and ym ½n ¼ y½n þ ðm  1ÞN  n ¼ 0, 1, . . . , 2N  1 for m ¼ 0, 1, . . ., M–2, and  ym ½n ¼

y½n þ ðm  1ÞN

n ¼ 0, 1, . . . , N  1

0

n ¼ N, N þ 1, . . . , 2N  1

ð9:13Þ

for m ¼ M–1. 2. Carry out the 2N-point FHT of each sub-segment to give n o X ðmH Þ ½k ¼ DHT ðfxm ½ngÞ for m ¼ 0, 1, . . ., M–1, and n o Y ðmH Þ ½k ¼ DHT ðfym ½ngÞ for m ¼ 0, 1, . . ., M–1.

ð9:14Þ

174

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

3. Multiply the two Hartley-space output sets, element-by-element, to give   Z m ½k ¼ 1=2X ðmH Þ ½k Y ðmH Þ ½k þ Y ðmH Þ ½2N  k þ   1=2X ðH Þ ½2N  k  Y ðH Þ ½2N  k   Y ðH Þ ½k  m m m

ð9:15Þ

for k ¼ 0,1, . . .,2N–1 and for m ¼ 0, 1, . . ., M–1. 4. Sum the element-wise Hartley-space products over all M sets to give

Z ðH Þ ½k ¼

M 1 X

Z m ½k 

ð9:16Þ

m¼0

for k ¼ 0, 1, . . ., 2N–1. 5. Carry out the 2N-point FHT of the resulting summed product to give fz½ng ¼ DHT

n

Z ðH Þ ½k

o :

ð9:17Þ

The above sequence of steps, which illustrate how to carry out the required segmented cross-correlation operation, is also given in diagrammatic form in Fig. 9.8.

{xm[n]}

Get Next Sub-segment

Post-Zero-Pad

FHT

{X {Y

(H) m

{ym[n]}

Get Next Overlapped Sub-segment

}

[k ]

FHT

Combine

{Z Combine: see Equations 9.15 and 9.16

{z[n]}

FHT

Fig. 9.8 Scheme for cross-correlation using FHT

(H) m

(H)

}

[k ]

}

[k ]

9.5 Channelization of Real-Valued Signal

175

Note that if the sampled data is complex-valued rather than real-valued, then the above sequence of steps may be straightforwardly modified to account for the four real-data combinations required by the complex-to-real parallel decomposition discussed in Sect. 9.4.1. Also, as for each of the correlation schemes discussed in this section, if the length of the correlation operations is chosen to be a power of four, then the R24 FHT may be beneficially applied to enable the function to be carried out in a computationally-efficient manner.

9.4.5

Combining Functions in Hartley-Space

Having shown in the previous sections how different functions, such as those of up-sampling and differentiation, may be efficiently carried out, either individually or in combination, via transformation to Hartley-space, it is easy to visualize – through straightforward manipulation of the Hartley-space data – how such functions may also be combined with that of correlation to enable the output signal from the correlator to be produced in up-sampled form, or as a derived function of the standard correlator output signal or, upon combining of the two ideas, as an up-sampled version of a derived function of the standard correlator output signal. The adoption of the first derived function of the standard correlator output signal, for example, enables one to replace peak detection by zero detection for the estimation of either TOA or TDOA. The utility of such an idea is particularly evident in the seemingly intractable problem of trying to find the TOA corresponding to the direct-path component of a multipath signal given that the largest peak of the standard correlator output signal does not necessarily correspond to the location of the direct-path component. With the first derived function, for example, it can be shown that the position of the peak of the direct-path signal corresponds to the point at which the value of the first derived function first starts to decrease, whilst with the second derived function, it can be shown that the position of the peak of the directpath signal corresponds to the point at which the first negative peak of the second derived function appears. Thus, both first and second derived functions of the standard correlator output signal may be used to attack the problem. Finally, note that with all of the correlation-based expressions given in this section that involve the use of dual Hartley-space terms, such as the terms X(H)[k] and X(H)[N–k], it is necessary that care be taken to treat the zero-address and Nyquist-address terms separately, as neither term possesses a dual.

9.5

Channelization of Real-Valued Signal

The function of a digital multichannel receiver [12, 13] is to simultaneously downconvert a set of frequency-division multiplexed (FDM) channels residing in a single sampled data stream. The traditional approach to solving this problem has been to

176

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

use a bank of DDC units, with each channel being produced individually via a DDC unit which digitally down-converts the signal to baseband, constrains the bandwidth with a digital filter and then reduces the sampling rate by an amount commensurate with the reduction in bandwidth. The problem with the DDC approach, however, is one of cost in that multiple channels are produced via replication of the DDC unit, so that there is no commonality of processing and therefore no possibility of computational savings being made. This is particularly relevant when the bandwidth of the signal under analysis dictates that a large number of channels be produced, as the DDC unit required for each channel typically requires the use of two FIR low-pass filters and one stored version of the period of a complex sinusoid sampled at the input rate. Two cases are now considered, the first corresponding to the efficient production of a single channel (or small number of channels) by means of a DDC process where the filtering is carried out via fast Hartley-space convolution and the second corresponding to the production of multiple channels via the application of the polyphase DFT filter bank.

9.5.1

Single Channel: Fast Hartley-Space Convolution

For the simple example of a single channel, after the real-valued signal has been frequency-shifted to baseband, the remaining task of the DDC process is to filter the resulting two channels of data so as to constrain the bandwidth of the signal and thus enable the sampling rate to be reduced by an amount commensurate with the reduction in bandwidth. Each filtering operation may be viewed as a convolutiontype problem, where the impulse response function of the digital filter is being convolved with a continuous or infinite-duration signal. As already stated, this convolution-based problem may be solved with either a data-space or a transform-space approach, the optimum choice being very much dependent upon the achievable down-sampling rate out of the two FIR filters – one filter for the ‘in-phase’ channel and another for the ‘quadrature’ channel. Clearly, if the down-sampling rate is sufficiently large and/or the length of the impulse response of each filter sufficiently short, then the computational efficiency of the data-space approach may well be difficult to improve upon. For the case of the transform-space approach, however, this type of problem is best tackled by viewing it as a segmented convolution, a task most simply solved by means of the familiar overlap-save or overlap-add technique, as discussed already in relation to the analogous problem of segmented correlation. The approach involves decomposing the infinite-duration received signal into segments and computing the convolution of the impulse response function of the filter and the received signal as a number of smaller circular convolutions. With the overlap-save technique, for example, suitable zero padding of the impulse response function combined with the selection of an appropriate segment length enables the required convolution

9.5 Channelization of Real-Valued Signal

{x[n]}

Post-Zero-Pad

FHT

{yI[n]}

Get Next Overlapped Segment

FHT

{yQ[n]}

Get Next Overlapped Segment

FHT

177 {X(H)[k]}

{YI( H ) [k ]}

{YQ( H ) [k ]}

Combine

FHT

Combine

FHT

{Z(I H ) [k ], Z(QH ) [k ]}

{z I [n ], z Q [n ]}

Discard Invalid Outputs

Combine:

1 (H) 1 X [k ] Y ( H ) [k ] + Y ( H 0 [ N − k ] + X ( H ) [ N − k ] Y ( H ) [ N − k ] − Y ( H ) [k ] 2 2

(

)

(

)

Fig. 9.9 Scheme for filtering complex-valued signal using FHT

outputs to be obtained from the segmented circular convolution outputs without the need for further arithmetic. A solution based upon the adoption of the overlap-save technique is as outlined in Fig. 9.9, where the impulse response function, {x[n]}, comprises N1 samples or coefficients, the FHT is of length N, where N  2N1 ,

ð9:18Þ

and the number of valid samples produced from each signal segment of length N out of the convolver is given by N2, where N2 ¼ N  N1 þ 1,

ð9:19Þ

these samples appearing at the end of each new output segment with the first N1–1 samples of each such segment being invalid and thus discarded. To achieve this, consecutive length-N segments of the in-phase and quadrature components of the signal are overlapped by N1–1 samples, with the first such segment being pre-zeropadded by N1–1 samples to account for the lack of a predecessor. The element-wise transform-space product associated with each of the small circular convolutions takes the form of   Z ½k ¼ 1=2X ðH Þ ½k Y ðH Þ ½k þ Y ðH Þ ½N  k   þ 1=2X ðH Þ ½N  k  Y ðH Þ ½N  k   Y ðH Þ ½k ,

ð9:20Þ

178

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

with the in-phase and quadrature components of the final filtered data-space output denoted by {zI[n]} and {zQ[n]}, respectively. The optimum choice of segment length is dependent very much upon the length of the impulse response function of the filter, with a sensible lower limit being given by twice the length of the impulse response function – as given by Eq. 9.18. Clearly, as with the case of segmented correlation, the shorter the segment length, the smaller the memory requirement but the lower the computational efficiency of the solution, whereas the larger the segment length, the higher the computational efficiency but the larger the memory requirement of the solution. Thus, there is once again a direct trade-off to be made of the arithmetic requirement against the memory requirement, according to how long one makes the signal segment.

9.5.2

Multiple Channels: Conventional Polyphase DFT Filter Bank

A common situation, of particular interest, is where multiple channels – possibly even thousands of channels – are to be produced which are of equal spacing and of equal bandwidth, as a polyphase decomposition may be beneficially used to enable the bank of DDC processes to be simply transformed into an alternative filter bank structure, namely, the polyphase DFT, as described in Fig. 9.10 for the most general case of a complex-valued signal, whereby large numbers of channels may be simultaneously produced at computationally attractive levels. For a brief mathematical justification of this decomposition, it should be first noted that a set of N filters, {Hk(z)}, is said to be a uniform DFT filter bank [1, 4, 15] if the filters are expressible as   H k ðzÞ  H 0 z:W kN ,

ð9:21Þ

H 0 ðzÞ ¼ 1 þ z1 þ . . . þ zðN1Þ ,

ð9:22Þ

where

with z1 corresponding to the unit delay and WN to the primitive N’th complex root of unity, as given by Eq. 1.3 in Chap. 1. Two additional ideas of particular importance are those conveyed by the equivalency theorem and the Noble identity [1, 4, 15], where the invoking of the equivalency theorem enables the operations of down-conversion followed by low-pass filtering to be replaced by those of band-pass filtering followed by down-conversion, whilst that of the Noble identity enables the ordering of the operations of filtering followed by down-sampling to be straightforwardly

9.5 Channelization of Real-Valued Signal

Down-Sample

Filter

¯N

H0 (z)

¯N

H1 (z)

z-1

z-1

z-1

¯N

{y0[m]}

N–Point Complex-Data Fast Fourier Transform

{x[n]}

179

H N 1 (z)

{y1[m]}

{yk[m]}

{yN-1[m]}

{x[n]} – band-pass complex-valued input signal {yk[m]} – low-pass complex-valued output signal Fig. 9.10 Scheme for polyphase DFT channelization of complex-valued signal

reversed. With these two key ideas in mind, assume that the prototype filter, denoted P(z), is expressible in polyphase form as PðzÞ ¼

N 1 X

  zn :H n zN ,

ð9:23Þ

n¼0

for the case of an N-branch system, so that the filter corresponding to the k’th branch, Pk(z), may thus be written as   Pk ðzÞ ¼ P z:W kN

180

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

¼

N 1  X

n  N  z1 :W k :H n z , N

ð9:24Þ

n¼0

with the output of Pk(z), denoted Yk(z), given by Y k ðzÞ ¼

N1 X

 n  N   W nk z :H n z :X ðzÞ , N

ð9:25Þ

n¼0

which corresponds to the polyphase structure shown above in Fig. 9.10. With this structure, therefore, the required band-pass filters are obtained by adopting a polyphase filter bank, with each filter branch being obtained by delaying and sub-sampling the impulse response of a single prototype FIR low-pass filter, followed by the application of a DFT to the instantaneous output sets produced by the polyphase filter bank. The effect of the polyphase filtering is to isolate and downsample the individual channels, whilst the DFT is used to convert each channel to baseband. In this way, the same polyphase filter bank is used to generate all the channels with additional complexity reduction being made possible by computing the DFT with an appropriately chosen FFT algorithm. When the sampled data is complex-valued, the feeding of N samples into an N-branch polyphase DFT filter bank will result in the production of N independent channels via the use of a complex-data FFT, whereas when the sampled data is real-valued, the feeding of N samples into the N-branch polyphase DFT filter bank will result in the production of just N/2 independent channels via the use of a real-data FFT. For the efficient computation of the polyphase DFT filter bank, as for that of the standard DFT, the traditional approach to the problem has been to use a complexdata solution, regardless of the nature of the data, this often entailing the initial conversion of the real-valued data to complex-valued data via a wideband DDC process, or through the adoption of a real-from-complex strategy whereby two realvalued data sets are built up from the polyphase filter bank outputs to enable two real-data DFTs to be computed simultaneously via one full-length complex-data FFT, or where one real-data DFT is performed on the polyphase filter bank outputs via one half-length complex-data FFT. The most commonly adopted approach is probably to apply the polyphase DFT filter bank after the real-valued data has first been converted to baseband via the wideband DDC process, which means that the data has to undergo two separate stages of filtering – one stage following the frequency shifting and another for the polyphase filter bank – before it is in the required form. The same drawbacks are therefore equally valid for the computation of the realdata polyphase DFT filter bank as they are for that of the real-data DFT, these drawbacks having already been comprehensively discussed in Chap. 2. A typical channelization problem involves a real-valued wide bandwidth RF signal, sampled at an intermediate frequency (IF) with a potentially high sampling rate, and a significant number of channels, so that the associated computational

9.5 Channelization of Real-Valued Signal

181

demands of a solution based upon the use of the polyphase DFT filter bank would typically be met through the mapping of the polyphase filter bank and the associated real-data DFT placed at its output onto appropriately chosen parallel computing equipment, as might be provided by a sufficiently powerful FPGA device. As a result, if the number of polyphase filter branches is a power of four, then the real-data DFT placed at the output of the polyphase filter bank may be efficiently carried out by means of the R24 FHT without recourse to the use of a complex-data FFT – as is discussed in reference [7].

9.5.2.1

Alias-Free Formulation

An important problem associated with the polyphase DFT filter bank is that of adjacent channel interference which arises through the nature of the sampling process – namely, the fact that with the conventional formulation of the polyphase DFT filter bank, all the channels are critically sampled at the Nyquist rate – as this results in overlapping of the channel frequency responses and hence aliasing of a signal in the transition region of one channel into the transition region of one or both of its neighbours. To overcome this problem, the presence of aliased signals arising from the poor filtering performance near the channel boundaries may be reduced or eliminated by oversampling the individual channels to above the Nyquist rate. This over-sampling may be most simply achieved, with a rational factor, by overlapping the segments of sampled data into the polyphase filter bank, using simple memory shifts/exchanges, and then removing the resulting frequencydependant phase shifts at the output of the polyphase filter bank by applying circular time shifts to the filtered data, this being achieved by reordering the data with simple memory shifts/exchanges [4]. The effect of over-sampling is to create redundant spectral regions between the desired channel boundaries and thus to prevent the overlapping of adjacent-channel frequency responses. For a channel bandwidth of W, suppose that an over-sampling ratio of 4/3 is used – equating to an overlap of 25% of the sampled data segments – and that the pass-band and stop-band edges are symmetrically placed at (3/4)  (W/2) and (5/4)  (W/2), respectively, relative to the channel boundary. This results in the creation of a spectral band (in the centre of the redundant region), of width BW, where BW ¼ 2  ð4=3  W=2  5=4  W=2Þ ¼ W=12,

ð9:26Þ

which separates adjacent channel stop-band edges and thus prevents possible aliasing problems, so that the redundant regions may be easily identified and discarded upon spectrum analysis of the individual channels of interest. Clearly, by suitably adjusting the position of the stop-band edge – that is, by setting it to exactly RP(W/2) where RP is the over-sampling ratio for the polyphase

182

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

filtering – it is possible to completely eliminate this spectral safety region such that the locations of the stop-band edges of adjacent channels actually coincide. If the signal is real-valued and the number of channels to be produced is equal to N/2 – and hence the length of the sampled data segments as well as the number of branches used by the polyphase filter is equal to N – then an over-sampling ratio of RP for use by the polyphase filter bank will require an overlap, OP, of   OP ¼ N 1  1=RP

ð9:27Þ

samples for the data segments. As with the computation of any DSP-based function, there is a direct trade-off to be made between complexity and performance in that the larger the over-sampling ratio, the larger the arithmetic requirement but the easier the task of the polyphase filtering process. This results in a reduction in the number of taps required by each of the small filters used by the polyphase filter bank, which in turn leads to a reduced latency and a reduced-duration transient response. A realistic value for the over-sampling ratio, as commonly adopted in many channelization problems, is given by two, whereby the requirement is thus for a 50% overlap of the sampled data segments.

9.5.2.2

Implementation Issues

With a simplified formulation (albeit not a particularly useful one) of the polyphase DFT filter bank, which takes no account of the aliasing problem, if N real-valued samples are fed into an N-branch polyphase filter bank, then the solution to the associated problem of computing the real-data DFT will equate to the execution of one N-point real-data FFT every N clock cycle which is the very problem that has already been addressed in this monograph through the development of the R24 FHT. For the more interesting and relevant situation, however, where an over-sampling ratio of two is adopted to address the aliasing problem, the solution to the associated problem of computing the real-data DFT will equate to the execution of one N-point real-data FFT every N/2 clock cycles, so that it will be necessary, for when N > 64 (from the time-complexity figure of Eq. 6.12 in Chap. 6), to double the throughput of the standard R24 FHT and hence of the real-data FFT. This may be achieved by computing two R24 FHTs simultaneously, or in parallel, on consecutive overlapped sets of polyphase filter outputs, in turn, in order to produce interleaved output data sets – as discussed in some detail in Sect. 6.6 of Chap. 6. When the over-sampling ratio is reduced to 4/3, however, the problem of computing the real-data DFT simplifies to the execution of one N-point real-data FFT every 3N/4 clock cycles, so that for those situations where N 1024 (from the time-complexity figure of Eq. 6.12 in Chap. 6), a single R24 FHT may well suffice.

9.7 Discussion

9.6

183

Distortion-Free Multi-Carrier Communications

Finally, another important application to benefit from the transform-space approach is that concerned with a recent and novel approach to distortion-free multi-carrier communications. The scheme described in [6, 8] – and which is only briefly touched upon here – enables one to improve the quality of one’s own wireless communications, over a given frequency (or set of frequencies), when in the presence of IMD. This type of distortion is generated by one’s own power amplifier (PA), when operating over an adjacent band of frequencies, and arises as the result of the non-linear nature of the PA when engaged in the transmission of modulated multicarrier signals. The distortion appears in the form of inter-modulation products (IMPs), these occurring at multiple frequencies which may potentially coincide with (one or more of) one’s own communication frequency (or frequencies). The proposed scheme enables one to predict the frequency locations and strengths of the IMPs – as outlined in Fig. 9.11 – and, when coincident with a given communication frequency, to clear the IMPs from that frequency regardless of the levels of distortion present, as outlined in Fig. 9.12. The speed at which the IMPs may be identified and cleared from the communication frequency is the key to its attraction – attributable to the efficient exploitation of fast polynomial arithmetic/ algebraic techniques and a suitably defined FFT routine for carrying out both forward and inverse DFTs – offering the promise of maintaining reliable real-time communications without having to interrupt the operation of one’s own electronic equipment. The integer-valued nature of the algorithm inputs (which are simply the carrier frequency locations) means that, with a little ingenuity, advantage may be taken of the fact that the forward FFT maps real-valued data to complex-valued data and the inverse FFT maps (Hermitian-symmetric) complex-valued data to real-valued data, making possible the use of a specialized real-data transform such as the FHT – and, in particular, the R24 FHT – that would offer the possibility of a resource-efficient real-time solution when implemented with one of the silicon-based technologies discussed in Chap. 5. The arithmetic-complexity of the proposed transform-space solution is of O(M. log2M) arithmetic operations, where ‘M’ corresponds to the number of frequency channels residing within the IMD region of interest.

9.7

Discussion

This chapter has focused on the application of the DHT to a number of computationally-intensive DSP-based functions which may benefit from the adoption of transform-space processing, particularly when the DHT is carried out by means of the R24 FHT – as discussed in Chaps. 4, 5, 6 and 7. The particular application area of geolocation was discussed in some detail as it is a potential vehicle for all of the DSP-based functions considered. With most geolocation systems, there is typically a

184

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

carrier signal channel locations

Setting up of indicator polynomial

Forward FFT

Cyclic Re-Samplings

Repeated for each order of IMD of interest before combining of the results

Element-Wise Vector Products + Combining of Vector Products

Inverse FFT

Set up binary polynomial coefficient set of ‘0s’ and ‘1s’

Transform indicator polynomial coefficient set to Fourier-space

Produce cyclically re-sampled versions of spectrum for those combinations of valid sets of channel weights catering for IMD types of interest

Combine expressions to obtain single expression catering for IMD types of interest

Transform to data-space to obtain coefficient set for polynomial representation of IMD distribution for those IMD types of interest

IMP channel locations & strengths Fig. 9.11 Scheme for real-time prediction of IMP locations and strengths

requirement to produce up-sampled correlator outputs from which the TOA or TDOA timing measurements may subsequently be derived. The TOA measurement forms the basis of those geolocation systems based upon the exploitation of multiple range estimates, whilst the TDOA measurement forms the basis of those geolocation systems based upon the exploitation of multiple relative range estimates. The up-sampling, differentiation and correlation functions, as was shown, may all be efficiently performed, in various combinations, when the processing is carried out via Hartley-space, with the linearity of the complex-data

9.7 Discussion

185

carrier signal channel locations

Partitioning of latest set of carrier signal channels into ‘R’ non-overlapping subsets

R

1 Modify indicator polynomial

Modify indicator polynomial

Calculate residual IMD at channel of interest using predictor

Calculate residual IMD at channel of interest using predictor

channel location to be cleared

Select ‘best’ subset of carrier signal channels leading to smallest residual IMD

Residual IMD > delta & Attempts not exhausted?

Repeat loop ‘E’ times

Yes – repeat

Modify initial set of carrier signal channel locations

No – stop

Fig. 9.12 Scheme for real-time clearance of communication channels of interest

correlation operation also leading to its decomposition into four parallel real-data correlation operations. This parallel decomposition is particularly useful when the quantities of data to be correlated are large and the throughput requirement high as it enables the correlation to be efficiently computed by running in parallel multiple versions of the R24 FHT. With regard to the channelization problem, it was suggested that the computational complexity involved in the production of a single channel (or small number of channels) by means of a DDC process may, depending upon the parameters, be

186

9 Computation of Common DSP-Based Functions Using Regularized Fast Hartley. . .

considerably reduced relative to that of the direct data-space approach, by carrying out the filtering via fast Hartley-space convolution. For the case of multiple channels, it was seen that the channelization of a real-valued signal by means of the polyphase DFT filter bank may also be considerably simplified through the adoption of an FHT for carrying out the associated real-data DFT. With most RF channelization problems, where the number of channels is large enough to make the question of implementational complexity a serious issue, the sampled intermediate frequency (IF) data is naturally real-valued, so that advantage may be made of this fact in trying to reduce the complexity to manageable levels. This can be done by means of the following two steps: firstly, by replacing each pair of short FIR filters – as applied to the in-phase and quadrature channels – required by the standard solution for each polyphase branch, with a single short FIR filter, as the data remains real-valued right the way through the polyphase filtering process, and, secondly, by replacing the complex-data DFT at the output of the standard polyphase filter bank by a real-data DFT which for a suitably chosen number of channels may be efficiently computed by means of the R24 FHT. Finally, a more recent application involving a novel transform-space scheme for enhancing the performance of multi-carrier communications in the presence of IMD was briefly discussed which required both forward and inverse FFTs, where the forward FFT mapped real-valued data to complex-valued data and the inverse FFT mapped (Hermitian-symmetric) complex-valued data to real-valued data. Both of these transformations, with a little ingenuity, could be catered for in an efficient manner by means of a specialized real-data transform such as the FHT – and, in particular, the R24 FHT – which would offer the possibility of a resource-efficient real-time solution when implemented with one of the silicon-based technologies.

References 1. A.N. Akansu, R.A. Haddad, Multiresolution signal decomposition: Transforms – Subbands – Wavelets (Academic Press, 2001) 2. E.O. Brigham, The fast Fourier transform and its applications (Prentice-Hall, Englewood Cliffs, 1988) 3. D. Fraser, Interpolation by the FFT revisited – An experimental investigation. IEEE Trans ASSP 37(5), 665–675 (May 1989) 4. F.J. Harris, Multirate signal processing for communication systems (Prentice-Hall, Upper Saddle River, 2004) 5. K.J. Jones, Digital signal conditioning for sensor arrays (G.B. Patent Application No: 0112415.5, May 2001) 6. K.J. Jones, Low-complexity scheme for enhancing multi-carrier communications (GB Patent No: 2504512, July 2012) 7. K.J. Jones, Resource-efficient and scalable solution to problem of real-data polyphase discrete Fourier transform channelisation with rational over-sampling factor. IET Signal Process 7(4), 296–305 (June 2013) 8. K.J. Jones, Design of low-complexity scheme for maintaining distortion-free multi-carrier communications. IET Signal Process 8(5), 495–506 (July 2014)

References

187

9. R. Nielson, Sonar signal processing (Artech House, 1991) 10. A.V. Oppenheim, R.W. Schafer, Discrete-time signal processing (Prentice-Hall, 1989) 11. R.A. Poisel, Electronic warfare: Target location methods (Artech House, 2005) 12. J.G. Proakis, Digital communications (McGraw-Hill, 2001) 13. B. Sklar, Digital communications: Fundamentals and applications (Prentice-Hall, 2002) 14. C.C. Tseng, S.L. Lee, Design of FIR digital differentiator using discrete Hartley transform and backward difference (European Signal Processing Conference (EUSIPCO), 2008) 15. P.P. Vaidyanathan, Multirate systems and filter banks (Prentice-Hall, 1993)

Part IV

The Multi-dimensional Discrete Hartley Transform

Chapter 10

Parallel Reordering and Transfer of Data Between Partitioned Memories of Discrete Hartley Transform for 1-D and m-D Cases

10.1

Introduction

With the sequential data reordering techniques discussed in Sect. 2.4 of Chap. 2, it was stated that the timing constraint required the data reordering to be carried out at a sufficiently fast rate in order to keep up with the data set refresh rate – this constraint needing also to be met for the production of each DHT output data set by the R24 FHT. Typically, for the case of 1-D data, as discussed in Chaps. 4, 5, 6 and 7, this means that it should be possible for a NAT-ordered input data set (as produced by the external input data source, the ADC unit) of length N, where N is a radix-4 integer, to be reordered according to the DBR mapping within the update period of N clock cycles, as dictated by the data set refresh rate, where a ‘clock cycle’ is as defined by the clock frequency of the target computing device, as discussed in Chap. 5. As a result, a sequential solution to the data reordering problem for the case of 1-D data seems quite appropriate, as any of the techniques discussed in Sect. 2.4 of Chap. 2 will be more than capable of meeting the imposed timing constraint, with the reordered data being either written directly, as it is being sequentially generated, to the partitioned memory of the PDM residing on the PE of the R24 FHT or, alternatively, written to the partitioned memory of the DSM, from where it may be subsequently transferred in a more parallel and time-efficient fashion to the PDM. When dealing with the computation of the m-D DHT, for m  2, where the common length N of each dimension of the transform is taken to be a radix-4 integer (and therefore compatible with the adoption of the N-point R24 FHT) and where the separable version, referred to as the m-D SDHT (to be discussed in some detail in Chap. 11), is to be assumed, the NAT-ordered data sets needing to be reordered via the DBR mapping will not just be that coming from the ADC unit but will be those intermediate output data sets produced by each of the first m  1 stages of the RCM-based formulation of the m-D SDHT. These intermediate output data sets will be appropriately stored within their own partitioned memories as each stage, for a fully pipelined solution, may be assigned its own R24 FHT and each stage except the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_10

191

192

10 Parallel Reordering and Transfer of Data Between Partitioned Memories of. . .

last its own block of HSM – the output from the last stage being directed to a suitably defined external output data store. With the adoption of a parallel computing architecture, the reordering and transfer of NAT-ordered data between the relevant partitioned memories may then be carried out in a parallel fashion with considerably reduced time-complexity. Thus, given the need for solutions that may be easily adapted for various applications and thus able to address in an efficient manner the data reordering needs of both the 1-D DHT and the m-D SDHT, each of which rely upon the application of the DBR mapping to NAT-ordered data stored within one partitioned memory followed by its transfer to another partitioned memory for subsequent processing, the sequential approach would no longer appear to be the most appropriate. Instead, a more ‘generic’ solution should be sought to the data reordering problem that is able to exploit the partitioned nature of the different memory types, for both the 1-D and the m-D cases, and therefore better able to exploit the potential parallelism on offer.

10.2

Memory Mappings of Regularized FHT

Before proceeding with the development of such a scheme, it should be noted that for each woctad of the DBR-reordered N-sample data set (referred to hereafter simply as a DBR-reordered woctad) that’s mapped onto the PDM, the relevant PDM addresses – comprising a memory bank address for each sample within the woctad and a time slot address for the location of the woctad within the memory – are modified on entry to (and by) the R24 FHT through the memory mappings of Chap. 6, whereby: (1) the memory bank addresses of the samples within each DBR-reordered woctad are reordered according to the pre-FHT/post-FHT mapping, Ω1(n, α), of Eq. 6.2, where ‘n’ is the sample address within the DBR-reordered Nsample input data set to the R24 FHT and ‘α’ the radix exponent corresponding to the transform length N, and such that (2) the time slot address of each DBR-reordered woctad within the PDM is made consistent with the address-offset memory mapping, Φ(n), of Eq. 6.4, where ‘n’ is as defined above – as all the partitioned data memories are assumed to be using the same number of memory banks; however, the time slot address will already be in the required form and so needs no further modification. With regard to the samples within each output woctad produced by the R24 FHT – for subsequent direction to a suitably defined external output data store or, for the case of m-D data, for possible mapping onto one of the HSMs – in order for the samples to be transferred in the required NAT-ordered form the PDM addresses are modified on exit from (and by) the R24 FHT through the pre-FHT/post-FHT mapping, Ω1(n, α), of Eq. 6.2, where ‘n’ is as defined in the paragraph above. Note, however, that as with the DBR-reordered woctads, the time slot address of each NAT-ordered output woctad from the PDM will already be in the required form and so needs no

10.3

Requirements for Parallel Reordering and Transfer of Data

NAT-ordered data

DSM

NAT-ordered data

Regularized FHT

DBR

Ω1

PDM

193

Ω1

HSM

Note: Ω1 is as defined by Equation 6.2 of Chapter 6

Fig. 10.1 Memory mappings between DSM, PDM and HSM

further modification. Figure 10.1 illustrates the data reordering requirements for the 2-D case, where the input data set is transferred from the DSM to the PDM residing on the PE of the first R24 FHT, the output of which is then transferred to the HSM. The pre-FHT/post-FHT memory mapping required for the above address modifications for the input/output data sets on entry/exit to/from (and by) the R24 FHT has already been discussed in some depth in Chap. 6 and is straightforwardly computed, either through its on-the-fly calculation or through the use of a suitably defined LUT for the storage of the radix-4 memory mapping, Ψ4, of Eq. 6.1, from which the pre-FHT/post-FHT memory mapping is simply obtained at minimal computational expense. Therefore, it will be assumed, hereafter, when referring to the ordering of the samples within each woctad, whether from the DBR-reordered input data set or the NAT-ordered output data set, that these simple memory address modifications will on input be, or on output have been, carried out by the R24 FHT and so are thus ‘invisible’ to those functions carried out both immediately prior to and following the execution of the R24 FHT.

10.3

Requirements for Parallel Reordering and Transfer of Data

To develop the required data reordering scheme, it is first necessary to summarize the set of requirements that must be satisfied if such a solution is to be achieved, as well as simplifying, where appropriate, certain oft-repeated expressions in order to avoid unnecessary verbiage through their repeated usage. Thus, instead of continually referring to the transferred data as coming from ‘either the DSM or the HSM’, we shall refer to it as coming from the ‘source memory’, and instead of referring to its transfer to ‘that PDM residing on the PE of the target R24 FHT’, we shall refer to its transfer to the ‘target PDM’ – these terms being applicable for the processing of both 1-D and m-D data, as discussed in this chapter as well as in Chap. 11, where different architectures will be developed and discussed for the parallel computation of the m-D SDHT.

194

10 Parallel Reordering and Transfer of Data Between Partitioned Memories of. . .

Now, there are two types of partitioned memory – apart from that of the target PDM – referred to throughout this chapter:– (1) the DSM, as introduced in Chap. 1 for the storage of the NAT-ordered input data (as generated by the external input data source, the ADC unit) to be subsequently transferred to the relevant target PDM; as well as (2) the HSM, as introduced in Chap. 1 for the storage of the NAT-ordered intermediate output data (as produced by each of the first m-1 stages of the RCM-based formulation of the m-D SDHT) to be subsequently transferred to the relevant target PDM, noting that each stage, for a fully pipelined solution, may be assigned its own R24 FHT and each stage except the last its own block of HSM. Thus, both the DSM and the HSM, as already stated, are the source memories from which data is to be transferred to the relevant target PDM. For the efficient operation of the R24 FHT, however, whether for the processing of 1-D data sets or as a building block for the processing of m-D data sets, each of the three memory types referred to above needs ideally (at least for a fully pipelined solution) to be double-buffered – as defined in Sect. 6.5 of Chap. 6 – whereby the ‘active’ region is defined as being that region of memory whose contents are currently available for processing, whilst the ‘passive’ region is defined as being that region of memory currently available for the storing of new data. For both the 1-D DHT and the m-D SDHT, addressing of the DSM requires that consecutive NAT-ordered data samples, as generated by the ADC unit, should be stored cyclically within consecutive banks of the DSM (where, for the DSM, memory bank no 8 is always followed by memory bank no 1). For the case of the 2-D SDHT – the number of dimensions being limited initially to just two for ease of illustration, where the input data set is assumed to be of size N  N – addressing of the 2-D HSM, which is configurable as a 2-D array of 8  8 memory banks, requires that consecutive row-DHT output data sets, each of length N and NAT-ordered, should be stored cyclically within consecutive rows of banks of the HSM – as illustrated in Fig. 10.2 for a size 16  16 data set – with consecutive samples within each output data set being stored cyclically within consecutive banks of the appropriate row of the 2-D HSM to which it’s assigned. In this way, consecutive columnDHT input samples will be stored cyclically within consecutive banks of the appropriate column of the 2-D HSM, as required for subsequent transfer, in DBR-reordered form, to the target PDM. The cyclic nature of the data storage scheme means that for each 1-D DSM/PDM or row/column of 2-D HSM, memory bank no 8 is always to be followed by memory bank no 1, whilst for the 2-D HSM, row/column no 8 is always to be followed by row/column no 1. With the adoption of the above scheme, the data stored within the source memory, prior to reordering by the DBR mapping, is in the required NAT-ordered form. As each DBR-reordered woctad is subsequently produced, it’s mapped onto the target PDM, so that the stored data is then in the correct form for processing by the R24 FHT [1–3]. Thus, prior to being transferred to the target PDM, each such set of N NAT-ordered samples needs first to be converted to its DBR-reordered form so that the task now is to determine how this might best be performed.

10.3

Requirements for Parallel Reordering and Transfer of Data

195

1(16) 9(16) 1(8) 9(8)

2(16) 10(16) 2(8) 10(8)

7(16) 15(16) 7(8) 15(8)

8(16) 16(16) 8(8) 16(8)

1(15) 9(15) 1(7) 9(7)

2(15) 10(15) 2(7) 10(7)

7(15) 15(15) 7(7) 15 (7)

8(15) 16(15) 8(7) 16(7)

Example for N = 16

1(10) 9(10) 1(2) 9(2)

2(10) 10(10) 2(2) 10(2)

7(10) 15(10) 7(2) 15(2)

8(10) 16(10) 8(2) 16(2)

1(9) 1(1)

2(9) 10(9) 2(1) 10(1)

7(9) 15(9) 7(1) 15 (1)

8(9) 8(1)

9(9) 9(1)

16(9) 16(1)

Fig. 10.2 Storage scheme for N  N output/input data sets from/to row-DHT/column-DHT stage, where entry m(n) represents address of mth/nth sample of nth/mth naturally ordered N-sample output/input data set

The data reordering can be most efficiently performed by pre-determining the way the stored samples would be distributed across the active region of the doublebuffered source memory if the DBR mapping was to be physically applied to the stored data and the reordered data then written back to the same memory – see the contents of Table 10.1 which was generated with a MATLAB-based [4] computer program, as described and listed in Appendix C. Then, with this information, instead of physically reordering the stored data prior to its transfer, the samples would just need to be accessed in the required order, from the appropriate source memory banks, as they are being selected for transfer to the target PDM. Each sample would thus be addressed within the source memory according to which memory bank it belongs and to which time slot it is assigned within that particular memory bank. Note that for the parallel computation of the 2-D SDHT, it will be seen that each N-sample data set produced by the row-DHT stage resides within a single row of memory banks of the 2-D HSM, whilst each N-sample data set to be processed by the column-DHT stage resides within a single column of memory banks from the updated 2-D HSM containing the row-DHT outputs. Thus, in addressing the samples for a given N-sample data set from the 2-D source memory, which holds all N2 samples, appropriate address offsets (that is, time slot addresses) must be used to account for any other N-sample data sets (and there could be up to N/8-1 of these for each row/column of memory banks) that may have already been stored within the same set of memory banks – this address offset incrementing by N/8 for the appropriate set of eight memory banks with each newly stored N-sample data set. Before delving into the details of the construction of the DBR-reordered woctads from data stored within the source memory and of their subsequent transfer to the

196

10 Parallel Reordering and Transfer of Data Between Partitioned Memories of. . .

Table 10.1 Distribution of N DBR-reordered samples across eight memory banks of DSM or HSM PDM Woctad Addresses Bank1 Bank2 Bank3 Bank4 Bank5 (a) Sample distribution for N ¼ 16 ) 2 woctads 1 2 2 – – 2 2 – – 2 2 – (b) Sample distribution for N ¼ 64 ) 8 woctads 1–2 4 – – – 4 3–4 – 4 – – – 5–6 – – 4 – – 7–8 – – – 4 – (c) Sample distribution for N > 64 ) M ¼ N/8 woctads & P ¼ M/16 1-P 8 – – – – (P + 1)-2P – – – – 8 (2P + 1)-3P 8 – – – – – – – – – 8 – – 8 – – – – – – – – – – – 8 – – – – – – – – – – – – 8 – – – – – – – – – – – 8 – – – – – – – – – – – – 8 – – – – – – – (M-2P + 1)-(M-P) – – – 8 – (M-P + 1)-M – – – – –

Bank6

Bank7

Bank8

2 –

– 2

– 2

– 4 – –

– – 4 –

– – – 4

– – – – – 8 – 8 – – – – – – – –

– – – – – – – – – 8 – 8 – – – –

– – – – – – – – – – – – – 8 – 8

target PDM, a number of terms are introduced that will help simplify the explanations provided. Firstly, let us introduce the term D[mB,tS] to stand for the data sample corresponding to the tS’th time-slot of the mB’th bank of the source memory, where mB varies from 1 up to 8 and tS varies from 1 up to N/8. Also, let us introduce the parameter ‘M’ for the total number of woctads derived from an N-sample data set, thus given as M ¼ N/8, the parameter ‘P’ for the ‘short’ address increment required for address generation for when N > 64, given as P ¼ M/16, and the parameter ‘Q’ for the ‘long’ address increment required for address generation for when N > 64, given as Q ¼ 4P.

10.4

Sequential Construction of Reordered Data Sets

10.4

197

Sequential Construction of Reordered Data Sets

For the construction of DBR-reordered woctads from data stored within the source memory, it is evident from the results produced with the MATLAB-based computer program referred to above that when: 1. N ¼ 16, the two DBR-reordered woctads may be constructed from samples obtained from the two NAT-ordered woctads stored within the source memory in the correct temporal order (that is, in order of increasing time slot address within the target PDM), from the set of samples fD½m þ 1, 1, D½m þ 5, 1, D½m þ 1, 2, D½m þ 5, 2, D½m þ 2, 1, D½m þ 6, 1, D½m þ 2, 2, D½m þ 6, 2g ð10:1Þ by using successive elements of what’s referred to hereafter as the ‘m-sequence’, as given by the set {0,2}. That is, the first DBR-reordered woctad is constructed from pairs of samples obtained from each of the first, second, fifth and sixth source memory banks, whilst the second is constructed from pairs of samples obtained from each of the third, fourth, seventh and eighth source memory banks; 2. N ¼ 64, the eight DBR-reordered woctads may be constructed from samples obtained from the eight NAT-ordered woctads stored within the source memory in the correct temporal order (that is, in order of increasing time slot address within the target PDM), from the set of samples fD½m þ 1, s þ 1, D½m þ 1, s þ 3, D½m þ 1, s þ 5, D½m þ 1, s þ 7, D½m þ 5, s þ 1, D½m þ 5, s þ 3, D½m þ 5, s þ 5, D½m þ 5, s þ 7g

ð10:2Þ by using successive elements of the m-sequence, as given by the set {0,1,2,3}, and for each fixed element of the m-sequence, by using successive elements of what’s referred to hereafter as the ‘s-sequence’, as given by the set {0,1}. That is, the first two DBR-reordered woctads are constructed from 4-sample sets obtained from each of the first and fifth source memory banks, whilst the next two are constructed from 4-sample sets obtained from each of the second and sixth source memory banks, the next two from 4-sample sets obtained from each of the third and seventh source memory banks and the final two from 4-sample sets obtained from each of the fourth and eighth source memory banks; and 3. N > 64, the situation here is a little more complex, so we first introduce the parameter Es ¼ E + s + 1, which uses an element of the s-sequence – as discussed in the next paragraph – and where the value of E is used to distinguish between those DBR-reordered woctads that are constructed when a particular source memory bank is encountered for the first time (for a period of 8P consecutive time slots) and those

10 Parallel Reordering and Transfer of Data Between Partitioned Memories of. . .

198

Table 10.2 Values of s-sequence used for generating address offsets required for construction of DBR-reordered woctads for 256  N  16384 N 256 1024 4096

16,384

Row-wise tabulation of S-sequence v length per data dimension 0 4 0 16 2 18 4 20 0 64 8 72 16 80 2 66 10 74 18 82 4 68 12 76 20 84 6 70 14 78 22 86 0 256 32 288 64 320 8 264 40 296 72 328 16 272 48 304 80 336 24 280 56 312 88 344 2 258 34 290 66 322 10 266 42 298 74 330 18 274 50 306 82 338 26 282 58 314 90 346 4 260 36 292 68 324 12 268 44 300 76 332 20 276 52 308 84 340 28 284 60 316 92 348 6 262 38 294 70 326 14 270 46 302 78 334 22 278 54 310 86 342 30 286 62 318 94 350

6 24 26 28 30 96 104 112 120 98 106 114 122 100 108 116 124 102 110 118 126

22 88 90 92 94 352 360 368 376 354 362 370 378 356 364 372 380 358 366 374 382

that are constructed when its encountered for the second and last time (also for a period of 8P consecutive time slots). Then, the M DBR-reordered woctads may be constructed from samples obtained from the M NAT-ordered woctads stored within the source memory in the correct temporal order (that is, in order of increasing time slot address within the target PDM), from the set of samples fD½m, E s , D½m, E s þ Q, D½m, Es þ 2Q, D½m, E s þ 3Q, D½m, E s þ P, D½m, E s þ P þ Q, D½m, Es þ P þ 2Q, D½m, E s þ P þ 3Qg:

ð10:3Þ

This is achieved with the use of successive elements of the m-sequence, as given by the set {1,5,1,5,2,6,2,6,3,7,3,7,4,8,4,8} and as derived from the contents of Table 10.1, and for each fixed element of the m-sequence, by using two groups each of 8P consecutive time slots as identified via the address offsets obtained from the use of successive elements of the s-sequence. Each element of the s-sequence, which is of length P, is applied for eight consecutive time slots, taking on values – which are generated by the MATLAB-based computer program – as displayed in Table 10.2. For the first set of 8P consecutive time slots during which a particular source memory bank is encountered, all elements of the s-sequence are successively

10.5

Parallelization of Data Set Construction Process

199

utilized and the value of E is set to ‘0’, whilst for the second set of 8P consecutive time slots during which it’s encountered, all elements of the s-sequence are again successively utilized but the value of E is now set to ‘1’. Unlike the reordering schemes described above for when N ¼ 16 or 64, when N > 64 the M DBR-reordered woctads constructed from the M NAT-ordered woctads are such that each DBR-reordered woctad is constructed from eight samples obtained from a single source memory bank. Thus, when N ¼ 16, the time slot addresses of the samples within the source memory required for the construction of each DBR-reordered woctad, as obtained from the appropriate four source memory banks, are as indicated by Eq. 10.1; when N ¼ 64, the time slot addresses of the samples within the source memory required for the construction of each DBR-reordered woctad, as obtained from the appropriate two source memory banks, are as indicated by Eq. 10.2; and when N > 64, the time slot addresses of the samples within the source memory required for the construction of each DBR-reordered woctad, as obtained from the appropriate source memory bank, are as indicated by Eq. 10.3. As a result, if a time-efficient data reordering scheme is to be obtained, particularly for when N  64, then the contents of all eight source memory banks need to be processed simultaneously over several consecutive clock cycles in order that the DBR-reordered woctads might be suitably constructed at the rate of at least one woctad per clock cycle – thus making the throughput rate compatible with that of the R24 FHT.

10.5

Parallelization of Data Set Construction Process

To handle the construction of the woctads, for when N  64, in a more time-efficient parallel fashion, a single procedure may be defined whereby the DBR-reordered data samples are read from the source memory – assumed to be in the form of dual-port RAM – two samples at a time in the required order (as dictated by the DBR mapping) from each of the eight source memory banks, so that after repeating this operation four times a total of eight DBR-reordered samples will have been obtained from each source memory bank. Thus, it will be possible, with every four clock cycles, to construct eight new DBR-reordered woctads which equate to a commensurate fraction, given as 64/N  1 for when N  64, of one complete N-sample data set, which may then be mapped onto the target PDM. When N ¼ 64, the eight sets of time slot addresses required for the construction of the complete set of eight DBR-reordered woctads produced in this way will correspond to samples obtained from all eight source memory banks, where the samples required to construct each woctad will come from two source memory banks and will be temporally spread according to the ordering implied by the combined values of the m-sequence and the s-sequence – Table 10.1b indicates the temporal ordering (in terms of time slot address) of the DBR-reordered woctads within the target PDM. Similarly, when N > 64, the eight sets of time slot addresses required for the

200

10 Parallel Reordering and Transfer of Data Between Partitioned Memories of. . .

construction of each set of eight DBR-reordered woctads produced in this way will correspond to samples obtained from all eight source memory banks, where the samples required to construct each woctad will now come from just a single source memory bank and will be temporally spread according to the ordering implied by the combined values of the m-sequence and the s-sequence together with the parameters E, P and Q – Table 10.1c indicates the temporal ordering (in terms of time slot address) of the DBR-reordered woctads within the target PDM. To carry out the proposed data reordering, for when N  64, in the most timeefficient fashion, a small intermediate data memory (DMIN) may be introduced, of 2-D form, partitioned into eight rows by eight columns where each memory bank is capable of holding a single sample of data. As a result, the entire memory will be capable of holding the sixty-four DBR-reordered samples corresponding to the eight consecutive woctads, with sets of eight samples being obtained from each of the eight source memory banks holding the NAT-ordered data. Suppose now, for each clock cycle, that two read instructions are performed simultaneously for retrieving the latest two woctads from the source memory, as well as two simultaneous write instructions for transferring the two previously accessed woctads to the DMIN. In this way, the intermediate data memory DMIN may be filled row-wise, every four clock cycles, with eight woctads of DBR-reordered samples, as they are being produced. Once the DMIN is full, the samples may then be transferred column-wise, woctad by woctad, to the target PDM given that the DBR-reordered woctads are stored column-wise within the 2-D intermediate data memory – when N ¼ 64, each DBR-reordered woctad must be retrieved from two columns, whereas when N > 64, each DBR-reordered woctad may be retrieved from just a single column. This scheme is most efficiently implemented through double-buffering via the introduction of a second DMIN, identical in form to the first, so that one set of eight DBR-reordered woctads may be built up and stored within the rows of one DMIN (the ‘passive’ region of the double-buffered memory) whilst another set of eight DBR-reordered woctads, as obtained from the columns of the second DMIN (the ‘active’ region of the double-buffered memory), is being transferred to the target PDM – see Fig. 10.3. The scheme involves a start-up delay of four clock cycles to allow for the filling of the first DMIN, first time around, before the functions of the Source Memory N2 input samples

Double Buffered DSM or HSM 2×(1×8) Memory Banks

for DSM, or 2×(8×8) Memory Banks for HSM

Target Memory ×8

Double Buffered Intermediate Data Memory 2×(8×8) Single-Sample Memory Banks

×8

Double Buffered PDM

N2 output samples

2×(1×8)

Memory Banks

Fig. 10.3 Memory configuration for parallel construction and transfer of reordered N-sample data sets between partitioned memories, where 2-D input data set is of size NN

10.6

Parallel Transfer of Reordered Data Sets

201

two memories alternate, every four clock cycles, with the contents of one DMIN being updated with new data obtained from the source memory whilst the contents of the other DMIN is being transferred to the target PDM. Note that the two intermediate data memories, DMIN, required of the above scheme are best built with programmable logic, so as not to waste potentially large quantities of fast and expensive embedded RAM in their construction, as embedded memory normally comes with a minimum size measured in thousands of bits, rather than just a few tens of bits, as required for each of the sixty-four banks of each DMIN. The m-sequence, as defined in Sect. 10.4, is used to enable DBR-reordered woctads to be constructed sequentially in the correct temporal order (that is, in the order of increasing time slot address within the target PDM) for mapping directly onto the target PDM – and in so doing, it also indicates which source memory banks are to be used in their construction. For the parallelization of the woctad construction process, for when N  64, the sequentially accessed m-sequence may be dispensed with as the data is now to be retrieved from all eight source memory banks simultaneously so that there is no longer any temporal significance to the ordering of the memory bank addresses within the sequence. The DBR-reordered samples are thus to be constructed simultaneously from all eight source memory banks which, for the case where N > 64, are accessed a total of 2P times with each instance (applied over four consecutive clock cycles) producing one new DBR-reordered woctad from the stored data of each source memory bank – that is, eight new DBR-reordered woctads every four clock cycles. For the first P instances the value of E must be set to ‘0’, whilst for the second P instances the value of E must be set to ‘1’ – where E, as already stated in Sect. 10.4, is used to distinguish between those DBR-reordered woctads that are constructed when the eight source memory banks are simultaneously encountered for the first time (for P instances or, equivalently, for 8P consecutive time slots) and those that are constructed when the eight source memory banks are simultaneously encountered for the second and last time (also, for P instances or, equivalently, for 8P consecutive time slots) – note that the samples obtained from eight consecutive time slots of a given source memory bank correspond, after the mapping, to those for just one time slot of the target PDM. The time slot addresses of the samples required for the construction of each DBR-reordered woctad from the stored data of the appropriate source memory banks – as indicated in Eqs. 10.1, 10.2 and 10.3 – may be pre-computed using the appropriate values of m, ES, P and Q before being stored in a suitably defined LUT.

10.6

Parallel Transfer of Reordered Data Sets

From the above, it is evident that for the parallel transfer of the DBR-reordered woctads from the source memory to the target PDM, for when N  64, the woctads may be accessed from the source memory at the rate of two per clock cycle and transferred to the target PDM at the same rate. Thus, with the introduction of double-

202

10 Parallel Reordering and Transfer of Data Between Partitioned Memories of. . .

buffering of the intermediate data memory, DIN, and the simultaneous execution of each set of multiple read/write instructions, the DBR-reordered samples may be transferred from one set of memory banks to another at the rate of two woctads per clock cycle, so that each DBR-reordered N-sample data set may be both constructed and transferred in approximately N/16 clock cycles – with a small time delay of four clock cycles being needed for initializing the first of the intermediate data stores the first time around. Note that the order in which each DBR-reordered woctad is stored within the target PDM needs to reflect the temporal ordering (that is, the appropriate time slot address) as implied by the construction process discussed in Sect. 10.5. Thus, for: 1. N ¼ 16, as the two DBR-reordered woctads may be constructed over one clock cycle for the two known time slot addresses (as derived from contents of Table 10.1), then clearly they may also be transferred to the target PDM in one clock cycle in the temporal order implied by the parallel construction process; 2. N ¼ 64, as the eight DBR-reordered woctads may be constructed over four clock cycles for the eight known time slot addresses (as derived from contents of Table 10.1), then clearly they may also be transferred to the target PDM in four clock cycles in the temporal order implied by the parallel construction process; and 3. N > 64, as each set of eight new DBR-reordered woctads may be constructed over four clock cycles, then clearly each set may also be transferred to the target PDM in four clock cycles in the temporal order implied by the parallel construction process. The temporal ordering of the M DBR-reordered woctads stored within the target PDM, in terms of the time slot addresses, may be expressed as fk, 4P þ k, 8P þ k, 12P þ k, P þ k, 5P þ k, 9P þ k, 13P þ kg

ð10:4Þ

for the first M/2 woctads (corresponding to first time period during which each of the eight memory banks are encountered), as ‘k’ takes on successive values from 1 up to P, followed by f2P þ k, 6P þ k, 10P þ k, 14P þ k, 3P þ k, 7P þ k, 11P þ k, 15P þ kg

ð10:5Þ

for the second M/2 woctads (corresponding to second time period during which each of the eight memory banks are encountered), as again ‘k’ takes on successive values from 1 up to P, as may be verified from inspection of the contents of Table 10.1. Note also, from Sects. 10.4 and 10.5, that for the parallel construction of the DBR-reordered woctads, increased computational efficiency may be obtained for N > 64 by having those multiples and combinations of address offsets involving the parameters P and Q – that is, 2Q, 3Q, P + Q, P + 2Q and P + 3Q – pre-computed and stored within a small suitably sized LUT and the elements of the s-sequence, as displayed in Table 10.2, also pre-computed and stored within an LUT of length P. Similar efficiencies may be obtained with the parallel transfer of the DBR-reordered woctads through the pre-computation and storage of those address offsets involving multiples of the parameter P. As a result, the addressing complexity may be reduced

10.7

Discussion

203

(for the most part) to the use of a couple of small LUTs combined with the on-the-fly application of a relatively small number of pre-computed address increments, resulting in a low-complexity parallel solution for the problem addressed: namely that involving the reordering and transfer of data from one partitioned memory to another.

10.7

Discussion

This chapter has shown how NAT-ordered data stored within one partitioned memory may be efficiently transferred, in a reordered form according to the DBR mapping, to a second partitioned memory – where both memories are assumed to be dual-port in nature. This was shown to be achievable as a single combined operation (namely, one able to carry out simultaneously both the data reordering and the transfer of the reordered data from one partitioned memory, referred to as the source memory, to another, referred to as the target PDM) with approximately 16-fold parallelism – where the adoption of eight memory banks has been assumed for consistency with the adoption of the R24 FHT. The point was made, however, when referring to the ordering of the samples within each woctad for input/output to/from the R24 FHT , whether for the DBR-reordered input woctads or the NAT-ordered output woctads, that the memory address modifications due to the pre-FHT/postFHT memory mapping of Chap. 6 – as discussed in Sect. 10.1 – would on input be, or on output have been, carried out by the R24 FHT and so are thus ‘invisible’ to those functions carried out both immediately prior to and following the execution of the R24 FHT. The partitioned memories of interest were the two source memories, the DSM and the HSM, together with the target PDM – as required for solutions to both the 1-D DHT and the m-D SDHT when solved via the use of the R24 FHT. Three different data reordering scenarios may arise in practice, in terms of the associated problem complexity, as will be discussed in far greater detail in Chap. 11 when solutions to the m-D SDHT are introduced. For the first and simplest case, the R24 FHT is to be applied to the computation of the 1-D DHT, where the length N of the transform is a radix-4 integer, as has already been discussed in Chaps. 4, 5, 6 and 7. The data reordering problem thus involves the DBR-based parallel reordering of a single NAT-ordered input data set, as generated by the ADC unit and stored within the DSM, from where it is transferred in parallel, woctad-by-woctad, to the PDM. For the next case, the R24 FHT is to be applied to the computation of the m-D SDHT, for m  2, where the common length N of each dimension of the transform is taken to be a radix-4 integer and therefore compatible with the adoption of the N-point R24 FHT, whilst a single R24 FHT (and thus a single block of HSM) is used to carry out the processing for all ‘m’ stages of its RCM-based formulation. The data reordering problem now involves the repeated (Nm1) DBR-based parallel reordering of: (1) a NAT-ordered input data set, of length N, as generated by the

204

10 Parallel Reordering and Transfer of Data Between Partitioned Memories of. . .

ADC unit and stored within the DSM, from where it is transferred in parallel, woctad by woctad, to the PDM, as well as for each of the first m  1 stages, the repeated (Nm1) DBR-based parallel reordering of: (2) a NAT-ordered intermediate output data set, of length N, as produced by each stage and stored within the HSM, from where it is transferred in parallel, woctad by woctad, to the PDM. The parallel reordering and transfer of the m-D data sets, as produced by both the ADC unit and each of the first m  1 stages, to the PDM, is carried out sequentially, each performed data set by data set, with the parallel reordering of each N-sample subset of each m-D data set being accompanied by its simultaneous transfer to the PDM. For the final case, the R24 FHT is again to be applied to the computation of the m-D SDHT, for m  2, where again the common length N of each dimension of the transform is taken to be a radix-4 integer and therefore compatible with the adoption of the N-point R24 FHT, but each stage of its RCM-based formulation is now assigned its own R24 FHT and each stage but the last its own block of HSM. The data reordering problem now involves the repeated (Nm1) DBR-based parallel reordering of: (1) a NAT-ordered input data set, of length N, as generated by the ADC unit and stored within the DSM, from where it is transferred in parallel, woctad-by-woctad, to its target PDM; as well as, for each of the first m  1 stages, the repeated (Nm1) DBR-based parallel reordering of: (2) a NAT-ordered intermediate output data set, of length N, as produced by each stage and stored within its own block of HSM, from where it is transferred in parallel, woctad by woctad, to its target PDM. The parallel reordering and transfer of the m-D data sets, as produced by both the ADC unit and each of the first m – 1 stages, to the m target PDMs, is carried out simultaneously, each performed data set by data set, with the parallel reordering of each N-sample subset of each m-D data set being accompanied by its simultaneous transfer to its target PDM. Thus, as the above problems illustrate, the task likely to benefit most from the proposed scheme for the parallel reordering and transfer of data is that concerned with the computation of the m-D SDHT and, in particular, that of the 2-D SDHT to be discussed in some detail in Chap. 11. As a result, the R24 FHT may be used as a building block with a separate R24 FHT being assigned to each stage of its RCM-based formulation. As has been shown, the two operations (namely, the parallel reordering of the data and the transfer of the reordered data from one partitioned memory to another) may be carried out simultaneously as a single combined operation in an optimal fashion in approximately N/16 clock cycles for when each memory is partitioned into eight banks of equal size, a performance not achievable via any of the sequential solutions discussed in Sect. 2.4 of Chap. 2. This has been achieved, for when N  64, with the minimal overhead of a small start-up delay of four clock cycles and a small amount of additional working memory – this comprising a double-buffered version of the intermediate data memory, DMIN, so that continuous real-time operation might be comfortably achieved and maintained.

References

205

References 1. K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Proc. Vision Image Signal Process. 153(1), 70–78 (February 2006) 2. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (September 2007) 3. K.J. Jones, The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments, Series on Signals & Communication Technology (Springer, Dordrecht, 2010) 4. MATLAB @ www.mathworks.com

Chapter 11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley Transform Using Regularized Fast Hartley Transform

11.1

Introduction

A 2-D version of the DHT [3] for the processing of a size N  N data set is expressed in its normalized form as X ðHÞ ½k 1 , k 2  ¼

N 1 N 1 1 XX x½n , n :casð2πn1 k 1 =N þ 2πn2 k2 =N Þ N n ¼0 n ¼0 1 2 1

ð11:1Þ

2

for k1, k2 ¼ 0, 1, . . ., N  1, where the input/output data sets belong to RNN, the linear space of real-valued square arrays of size N  N. This 2-D formulation of the DHT, like that introduced in Sect. 3.1 of Chap. 3 for the 1-D case, can be shown (with the application of a little algebra) to be bilateral, whereby the forward and inverse versions of the normalized 2-D transform are identical, so that after passing a 2-D data set twice through the 2-D DHT of Eq. 11.1, the original data set will be obtained. This property, combined with the symmetry of the transform’s 2-D kernel, means that the 2-D transform may also be considered to be orthogonal and thus, as with the 1-D case, a member of that class of algorithms comprising the discrete orthogonal transforms [1, 6] and therefore possessing those properties shared by all those algorithms belonging to this important class. The definition of Eq. 11.1 is derived from direct extension of that given for the 1-D case of Eq. 3.1 in Chap. 3, so that it preserves the symmetries of the 1-D transform, as expressed via Eq. 3.12 of Chap. 3, via the following 2-D version of the equation: ðHÞ

ðHÞ

X ðFÞ ½k1 , k2  ¼ X E ½k 1 , k 2   i:X O ½k 1 , k 2 

ð11:2Þ

where the ‘even’ and ‘odd’ components of an arbitrary 2-D function ‘Z’ are written (in similar vein to that given for the 1-D case of Sect. 3.3 of Chap. 3) as © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_11

207

208

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

Z E ½n1 , n2  ¼ 1=2ðZ ½n1 , n2  þ Z ½n1 , n2 Þ

ð11:3Þ

Z O ½n1 , n2  ¼ 1=2ðZ ½n1 , n2   Z ½n1 , n2 Þ,

ð11:4Þ

and

respectively, and where from transform periodicity, index ‘k’ may be regarded as being equivalent to ‘Nk’ for each dimension. Note, in addition, that     X ðHÞ ½k1 , k2  ¼ Re X ðFÞ ½k 1 , k 2   Im X ðFÞ ½k1 , k2  ,

ð11:5Þ

so that the output data sets from the 2-D DFT and the 2-D DHT may now be simply obtained, one from the other, via Eqs. 11.2 and 11.5. As a result of the above properties, the 2-D DHT can also be shown to satisfy a 2-D version of the CCT [3] analogous to that given for the 1-D transform by Eq. 3.31 of Chap. 3, namely that the 2-D circular convolution (as represented by the operator ‘**’) of two arbitrary 2-D real-valued data sets, {x[n1,n2]} and {y[n1,n2]}, when using the 2-D DHT, may be expressed (up to a scaling factor) as the addition/ subt`raction of four element-wise Hartley-space products: DHTðfx½n1 , n2 g  fy½n1 , n2 gÞ ¼

n  1=2 X ðHÞ ½k 1 , k 2 :Y ðHÞ ½k 1 , k 2 

X ðHÞ ½k 1 , k2 :Y ðHÞ ½k1 , k 2  þ X ðHÞ ½k1 , k2 :Y ðHÞ ½k1 , k 2  o   þX ðHÞ k1 ,  k2 :Y ðHÞ ½k1 , k2  ,

ð11:6Þ

which may be straightforwardly obtained from that corresponding to the use of the 2-D DFT, namely n DFTðfx½n1 , n2 g  fy½n1 , n2 gÞ ¼ X ðFÞ ½k 1 , k 2 :Y ðFÞ ½k 1 , k2 g,

ð11:7Þ

and vice versa, by simply exploiting the connecting relations of Eqs. 11.2 and 11.5, which enable the output data sets of the two 2-D transforms to be simply obtained, one from the other. Unfortunately, the kernel of the 2-D DHT (and, in fact, of the m-D DHT), unlike that of the 2-D DFT (and the m-D DFT), is not separable [3] – that is, it cannot be expressed as the product of 1-D kernels – so that an alternative definition is required to that given by Eq. 11.1 if fast algorithms for the solution to the 2-D DHT (and, in fact, to the m-D DHT) are to be found (such as those exploiting the RCM) which rely on the separability of the kernel. The aim of this chapter is thus to provide such a definition and to show how realizable architectures for its resource-efficient parallel computation might be obtained and how the resulting solutions might be used to carry out those basic DSP-related tasks conventionally performed by the DFT.

11.2

Separable Version of 2-D DHT

209

Before proceeding, however, it is perhaps worth restating, in view of the key role played by the reordering of data for the 2-D and m-D solutions to be discussed in the chapter, that it’s to be assumed when referring to the ordering of the samples within each woctad for input/output to/from each instance of the R24 FHT, whether for the DBR-reordered input woctads or the NAT-ordered output woctads, that the memory address modifications due to the pre-FHT/post-FHT memory mapping, Ω1 – as introduced in Sect. 6.3.1 of Chap. 6 and further discussed in Sect. 10.1 of Chap. 10 – will on input be, or on output have been, carried out by the R24 FHT and so will be ‘invisible’ to those functions carried out both immediately prior to and following the execution of the R24 FHT.

11.2

Separable Version of 2-D DHT

A modified formulation [3] of that given by Eq. 11.1 for the 2-D DHT which meets the desired objective of separability is now described, followed by brief discussions on how solutions based upon this new formulation might be used to produce computationally-efficient parallel solutions – other solutions to these and related problems are discussed in references [15, 17] together with architectures suitable for FPGA implementation – to the key problems (as briefly discussed in [20]) of: (1) the filtering of 2-D real-valued data sets, as typically encountered when dealing with digital images; and (2) the computation of the 2-D real-data DFT – as might be required, for example, as a key component of a low-pass frequency-domain beamformer [13] as used in the processing of real-valued hydrophone data (although such data sets would generally need to be rectangular rather than square as one would expect there to be more temporal frequency bins than sensors).

11.2.1 Two-Stage Formulation of 2-D SDHT A separable formulation of the 2-D DHT for the processing of a size N  N data set, again in normalized form, may be given by means of the expression X ðSHÞ ½k1 , k2  ¼

N 1 N 1 1 XX x½n , n :casð2πn1 k1 =N Þ:casð2πn2 k 2 =N Þ N n ¼0 n ¼0 1 2 1

2

for k1, k2 ¼ 0, 1, . . ., N  1, which may be reformulated as

ð11:8Þ

210

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

N 1 1 X ðHÞ X ðSHÞ ½k 1 , k 2  ¼ pffiffiffiffi y ½n1 , k2 :casð2πn1 k 1 =N Þ N n1 ¼0

ð11:9Þ

for k1 ¼ 0, 1, . . ., N  1, where N 1 1 X yðHÞ ½n1 , k 2  ¼ pffiffiffiffi x½n1 , n2 :casð2πn2 k2 =N Þ N n2 ¼0

ð11:10Þ

for k2 ¼ 0, 1, . . ., N  1. The attraction of this two-stage formulation is that it enables Eq. 11.8 to be efficiently carried out via the RCM whereby the set of N 1-D DHTs of Eq. 11.10 are first applied to the N rows of the 2-D input data set, one DHT per row, followed by the application of the set of N 1-D DHTs of Eq. 11.9 to the N columns of the resulting 2-D intermediate output data set, one DHT per column, without trigonometric coefficients having to be applied to the intermediate outputs produced by the row-DHT stage. This separable version of the 2-D DHT – referred to hereafter as the 2-D SDHT – like the non-separable version of Eq. 11.1, can be shown (with the application of a little algebra) to be bilateral, whereby the forward and inverse versions of the normalized 2-D SDHT are identical. This property, combined with the symmetry of the separable 2-D kernel, means that the 2-D SDHT may also be considered to be orthogonal and thus, as with the cases of the 1-D DHT and the 2-D non-separable DHT, a member of that class of algorithms comprising the discrete orthogonal transforms and therefore possessing those properties shared by all those algorithms belonging to this important class. The algorithm has already been successfully used for carrying out various 2-D DSP applications [8, 14, 19, 20], proving particularly popular as an alternative to the 2-D DCT for the transform-based coding of images [17], for the purpose of data compression [16], where its bilateral nature makes perfect image reconstruction theoretically possible. The relationship of the outputs of this separable version of the 2-D DHT to the even and odd components of the standard non-separable version [20] is given by means of the expression ðHÞ

ðHÞ

ð11:11Þ

ðSHÞ

ðSHÞ

ð11:12Þ

X ðSHÞ ½k 1 , k 2  ¼ X E ½k 1 , k2  þ X O ½k1 , k2  whilst X ðHÞ ½k1 , k 2  ¼ X E

½k 1 , k2  þ X O ½k1 , k2 

expresses the non-separable 2-D DHT outputs in terms of the outputs of the even and odd components of the 2-D SDHT, as given by Eqs. 11.3 and 11.4. As a result, any expression given in terms of one version of the 2-D transform can be easily reformulated in terms of the other by means of one or other of the above two equations. Unfortunately, however, this separable formulation of the 2-D DHT,

11.2

Separable Version of 2-D DHT

211

unlike that for the non-separable version, does not satisfy the same symmetry properties as the 1-D transform, as expressed via Eq. 3.12 of Chap. 3, so that it will not satisfy the standard form of the 2-D CCT, as given by Eq. 11.6 above. An alternative version of the 2-D CCT will therefore be needed if the 2-D SDHT is to be effectively used for carrying out the filtering-type operations required for the processing of 2-D images. From Eqs. 11.9 and 11.10 above it is evident that given an attractive solution to the problem of computing the 1-D DHT, this algorithm could also be used as a building block for the efficient computation of both the row-DHT and the columnDHT stages of the 2-D formulation of the SDHT, given that all the necessary theorems and properties required for its application would be satisfied. Care needs to be taken, however, with regard to the design of suitable memory partitioning, double-buffering and parallel addressing schemes for the 2-D solution, as they would need to be consistent with those used by the 1-D solution if the associated benefits of the 1-D solution are to be fully exploited – they would also need to be generic in the sense that they should be easily generalized in order to deal with the case of m-D data, where m  2, as will be discussed later in Sect. 11.5.

11.2.2 Hartley-Space Filtering of 2-D Data Sets An important requirement of the 2-D SDHT, as stated above, is that it should satisfy some form of the 2-D CCT, as this theorem forms the basis for the digital filtering of images. Fortunately, it can be shown [20], via the connecting relations of Eqs. 11.11 and 11.12 and the non-separable version of the 2-D CCT given by Eq. 11.6, that the 2-D circular convolution of two arbitrary 2-D real-valued data sets, {x[n1,n2]} and {y [n1,n2]}, may again be expressed (up to a scaling factor) in terms of four elementwise Hartley-space products, when using the separable version of the transform, as SDHTðfx½n1 , n2 g  fy½n1 , n2 gÞ ¼ ðSHÞ

ðSHÞ

n  1=2

ðSHÞ

 X O ½k1 , k2 :Y O ½k 1 , k 2  þ X E o ðSHÞ ðSHÞ þ X O ½k1 , k2 :Y E ½k1 , k2  :

ðSHÞ

XE

ðSHÞ

½k 1 , k 2 :Y E ðSHÞ

½k1 , k 2 

½k1 , k2 :Y O ½k1 , k2 

ð11:13Þ

The satisfying of this modified form of the 2-D CCT by the 2-D SDHT, which may be viewed as a natural extension of the 1-D techniques introduced in Sect. 9.2 of Chap. 9, thus ensures that filtering-type operations – as well as conventional energy calculation and thresholding for image compression purposes [17] – may be efficiently carried out in 2-D Hartley-space (whether using separable or non-separable versions of the transform), just as they are in 2-D Fourier-space, with the bilateral property of the transform facilitating the perfect reconstruction of the filtered data on the return to 2-D data-space via the transform’s repeated application. Significant computational savings may be made when the nature of one or other of the functions

212

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

generating the 2-D data sets is known, as will be seen later in Sect. 11.4.7, when one of the two functions is taken to be an impulse response function (or, for imaging systems, a point spread function), which is also a real-valued even function.

11.2.3 Relationship Between 2-D SDHT and 2-D DFT The relationship between the outputs of this 2-D separable transform with those of the 2-D DFT is given by means of the expression     X ðSHÞ ½k 1 , k 2  ¼ Re X ðFÞ ½k1 , k 2   Im X ðFÞ ½k1 , k2  ,

ð11:14Þ

   Re X ðFÞ ½k1 , k2  ¼ 1=2 X ðSHÞ ½k1 , k 2  þ X ðSHÞ ½k1 , k2 Þ

ð11:15Þ

   Im X ðFÞ ½k 1 , k2  ¼ 1=2 X ðSHÞ ½k1 , k2   X ðSHÞ ½k1 , k2 Þ

ð11:16Þ

whilst

and

express the real and imaginary components of the 2-D DFT outputs, respectively, in terms of the 2-D SDHT outputs [19]. Thus, as with the 1-D case, the output data sets from the two 2-D transforms may be simply obtained, one from the other, so that efficient solutions to the 2-D SDHT could equally well be beneficially used for solving those DSP-based problems commonly addressed via the 2-D DFT, and vice versa, particularly when the input data to the DFT is real-valued – as is discussed in more detail in Sect. 11.4.8. Detailed accounts on the parallel computation of a number of m-D orthogonal/unitary transforms that are reducible to the m-D DFT may be found in references [2, 4]. An additional property of the 2-D SDHT, given its orthogonality, is that of satisfying Parseval’s Theorem, as given for the case of 2-D data by the equation N 1 X N 1 X n1 ¼0 n2 ¼0

jx½n1 , n2 j2 

N 1 X N 1  N 1 X N 1  X X    X ð FÞ ½ k 1 , k 2   2  X ðSHÞ ½k1 , k2 2 , ð11:17Þ k 1 ¼0 k2 ¼0

k1 ¼0 k 2 ¼0

which simply states that the energy contained in the 2-D signal is preserved (up to a scaling factor) under the operation of 2-D versions of both the DFT and the DHT (and, in fact, under the operation of any discrete orthogonal or unitary transform given that any orthogonal transform is also unitary from the fact that R ⊂ C, including both separable and non-separable versions of the 2-D DHT), so that the energy measured in data-space is equivalent to that measured in both Fourier-space and Hartley-space.

11.3

11.3

Architectures for 2-D SDHT

213

Architectures for 2-D SDHT

Given the two-stage formulation of the 2-D SDHT, as given by Eqs. 11.9 and 11.10 above, and the availability, through the adoption of the R24 FHT , of an efficient solution to the 1-D DHT, it is now seen how these two features might be appropriately combined to yield equally attractive solutions to the 2-D SDHT (and thus to the 2-D DFT) and, ultimately, to the m-D SDHT (and the m-D DFT). Two versions are considered, the first exploiting a single-FHT recursive architecture and the second a two-FHT pipelined architecture, where the data set is assumed to be of size N  N where N is taken to be a radix-4 integer for compatibility with the N-point R24 FHT and where the data is both processed and transferred between source and target memories, woctad-by-woctad. The first architecture is defined as being ‘recursive’ in the sense that the output from the first stage of processing (the row-DHT stage) is fed back as input to the second stage (the column-DHT stage), where the same set of computational components (as provided by the R24 FHT) is used to perform the same set of operations on the 2-D input data to both stages. The ‘pipelined’ architecture is obtained when this two-fold recursion is unfolded, so that the two-stage recursion then becomes a two-stage computational pipeline with the two stages of the pipeline being connected by means of a double-buffered memory and the processing for each stage being carried out by means of its own R24 FHT. Thus, the choice of computing architecture for the parallel computation of the 2-D SDHT reduces to that of a single-FHT recursive architecture versus a two-FHT pipelined architecture, where the achievable time-complexity of the single-FHT solution may be shown to be approximately twice that of the two-FHT solution (as one would expect), although with a commensurate saving in terms of the silicon resources required for the production of each new 2-D output data set. These considerations, which are similar to those discussed in Sect. 5.3.2 of Chap. 5 relating to the choice of single-PE versus multi-PE architectures for the parallel computation of the R24 FHT, generalize in an obvious fashion to the processing of m-D data sets, for m  2, as will be discussed later in Sect. 11.5. The efficient reordering of NAT-ordered data by the DBR mapping, as required for every N-sample data set needing to be input to the R24 FHT, for both the 1-D and m-D cases, has already been discussed in some detail in Chap. 10. Before proceeding, however, it should be noted that with a 2-D data set of size N  N, the ‘update period’, as dictated by the data set refresh rate, has been defined (from Sect. 1.9 of Chap. 1) as the time needed to transfer all N2 samples from the ADC unit to the DSM, assumed here (with an I/O rate of one sample per clock cycle) to be N2 clock cycles. This is true even when the input data is stored and processed just one row at a time, as the update period is defined here as the elapsed time between the production of consecutive 2-D input data sets, not the elapsed time between the production of consecutive N-sample subsets of the 2-D input data set, which has already been defined (from Sect. 1.9 of Chap. 1) as the ‘slicing period’. These two parameters, the update period and the slicing period, will play key roles in determining the ability of

214

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

each 2-D solution (and, ultimately, m-D solution) in achieving and maintaining continuous real-time operation. Finally, note that the cyclic nature of the data storage scheme means that for each 1-D DSM/PDM or row/column of 2-D HSM, memory bank no 8 is always to be followed by memory bank no 1, whilst for the 2-D HSM, row/column no 8 is always to be followed by row/column no 1.

11.3.1 Single-FHT Recursive Architecture The solution based upon the single-FHT recursive architecture, as outlined in Fig. 11.1, carries out the processing for all the row-DHTs before commencing the processing for the column-DHTs. The solution operates in a recursive fashion whereby the partitioned memory of the HSM of Fig. 11.2, configurable as a 2-D array of 8  8 equally-sized memory banks (with each memory bank thus holding N2/64 samples), is first filled row-wise with the outputs of the row-DHT stage before the stored outputs are fed back column-wise as inputs to the column-DHT stage, where the ordering of the stored rows and columns of data within the HSM are as specified in Sect. 10.3 of Chap. 10. The outputs of the column-DHT stage are subsequently directed to a suitably defined external output data store as they are produced. A more detailed description of the computing architecture for this solution is as given in Fig. 11.3, whereby the active region of the double-buffered and partitioned DSM of Fig. 11.4, configurable as a 1-D array of 1  8 equally-sized memory banks, contains all N2 samples of the latest 2-D input data set transferred from the ADC unit that are ready to be processed. The samples are stored row-wise within the DSM with consecutive samples of each N-sample data set being stored cyclically within 2×(N×N) Samples from external input data source

Row-DHT & Column-DHT Stages

Double Buffered Memory

Note: FHT carries out 2N 1-D DHTs.

FHT

Memory N×N Samples

Fig. 11.1 Computation of N  N 2-D SDHT via two-stage recursion

to external output data store

11.3

Architectures for 2-D SDHT

215

Note: Each row of memory banks holds N/8 complete row-DHT output sets, whilst each column subsequently holds N/8 complete column-DHT input sets – each input/output set comprising N samples. Dual-port memory enables DBR-reordered N-sample data sets to be retrieved from single column of memory banks & transferred to internal data memory of Regularized FHT at rate of one set every N/16 clock cycles.

8 × 8 Memory Banks

Fig. 11.2 Partitioned Hartley-space memory external input data source

Note: DBR mapping used to reorder data for input to Regularized FHT.

2 × (N × N) Memory rows of 2-D DHT input data

[1]

columns of row-DHT output data

[2]

N×N DHT-Space

Address Generator Read 8-Sample Sets of Reordered Data & Write to Data Memory

Data-Space

× N/8 ×8

N-Point Regularized FHT

send N column-DHT output sets [2] to external output data store

Processor × N/8 ×8

write N row-DHT output sets [1] row-wise to memory

Memory

Fig. 11.3 Single-FHT recursive architecture for resource-efficient parallel computation of N  N 2-D SDHT

2 × 8 Memory Banks

Note: Each set of 8 memory banks holds N data sets, each of N input samples, for transfer to internal data memory of Regularized FHT, whilst remaining set of eight memory banks is being filled with new data. Dual-port memory enables DBR-reordered N-sample data sets to be retrieved from memory & transferred to internal data memory of Regularized FHT at rate of one set every N/16 clock cycles.

Fig. 11.4 Double-buffered and partitioned data-space memory

216

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

consecutive memory banks. Each N-sample row of the 2-D NAT-ordered input data set is read out, in DBR-reordered form, from the DSM’s active region and written to the PDM’s passive region with the samples being stored cyclically within consecutive memory banks. Whilst this is happening, the previous row of data held in the PDM’s active region is being read out and processed by the PE’s arithmetic components for the execution of the row-DHT stage. The NAT-ordered outputs of the row-DHT stage are subsequently written row-wise to the HSM with successive N-sample rows of output data being assigned cyclically to successive rows of memory banks and with consecutive samples of each N-sample row being stored cyclically within consecutive memory banks of the appropriate row. On the completion of the row-DHT stage, the contents are read out column-wise from the HSM, in DBR-reordered form, and written to the PDM’s passive region with consecutive samples being stored cyclically within consecutive memory banks. Whilst this is happening for each given N-sample column of NAT-ordered data, the data held in the PDM’s active region is being read out and processed by the PE’s arithmetic components for execution of the column-DHT stage – the outputs of which are then directed to a suitably defined external output data store as they are produced. By the time a new 2-D input data set has been stored within the DSM’s passive region ready for processing, the processing for both the row-DHT and column-DHT stages needs to have been completed. Thus, for a realizable single-FHT solution, the processing needs to be able to keep up with the data set refresh rate so that both the row-DHT and the column-DHT stages need to have been completed within N2 clock cycles – at which point the DSM’s passive region will have been filled with a new 2-D input data set ready to be processed – with the processing by the row-DHT stage of all N2 input samples stored within the DSM’s active region needing to be completed before the column-DHT stage can commence processing. Achieving such a solution is partly facilitated through the double buffering of the PDM which enables the reading, reordering and transfer of each N-sample data subset from the source memory (which may be either the DSM or the HSM) to be carried out at the same time that its predecessor is being processed by the PE’s arithmetic components. The timing constraints to be addressed for this single-FHT solution will be discussed later in Sects. 11.4.3 and 11.4.5 in considerably more detail.

11.3.2 Two-FHT Pipelined Architecture The solution based upon the two-FHT pipelined architecture, as outlined in Fig. 11.5, carries out the processing for the row-DHT stage at the same time as the processing for the column-DHT stage. The solution achieves this by operating in a pipelined fashion with the HSM of Fig. 11.2 now needing to be available in doublebuffered form. The functions performed on the two regions of the HSM alternate with successive 2-D input data sets, with the passive region of the memory being used for storing the latest outputs of the row-DHT stage whilst the contents of the

11.3

Architectures for 2-D SDHT

from external input data source

217

2×(N×N) Samples

Row-DHT Stage

2×(N×N) Samples

Column-DHT Stage

Double Buffered Memory

FHT

Double Buffered Memory

FHT

to external output data store

Note: Each FHT carries out N 1-D DHTs.

Fig. 11.5 Computation of N  N 2-D SDHT via two-stage pipeline – Version A solution assumed

2 × (N×N) DSM

[1]

Note: DBR mapping used to reorder data for input to Regularized FHT.

× N/8 ×8

Processing Element No 1 for N-Point

rows of row-DHT output data 2 × (N×N)

Processing Element external output data store

× N/8 ×8

Regularized FHT

× N/8 ×8

No 2 for N-Point Regularized FHT

loop ×N over PE No 2 using type [2] data sets to produce output data

× N/8 ×8

Read 8-Sample Sets of Reordered Data & Write to PDM of PE No 2

rows of 2-D DHT input data

loop ×N over PE No 1 using type [1] data sets to fill rows of HSM with data

Address Generator Read 8-Sample Sets of Reordered Data & Write to PDM of PE No 1

external input data source

HSM

[2]

columns of row-DHT output data

Address Generator

Fig. 11.6 Two-FHT pipelined architecture for resource-efficient parallel computation of N  N 2-D SDHT – Version A solution assumed

active region, containing the previous outputs of the row-DHT stage, are being fed as inputs to the column-DHT stage – the outputs of which are then directed to a suitably defined external output data store as they are produced. A more detailed description of the computing architecture for this solution is as given in Fig. 11.6, whereby the DSM of Fig. 11.4 – which is used in precisely the same way as for the single-FHT solution – may be reduced in size as it need only cater now for the storage of one row of data at a time for each set of memory banks,

218

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

rather than for all N rows of data, as it is no longer necessary for all N2 input samples to have been processed by the row-DHT stage before processing for the columnDHT stage can commence. This is due to the double-buffering of the row-DHT outputs which enables one set of NAT-ordered row-DHT outputs to be processed by the column-DHT stage whilst another set is being produced. As a consequence, the size of the double-buffered DSM may be reduced, relative to that for the single-FHT solution, from 2N2 words, for what’s referred to hereafter as the ‘Version A’ solution, to 2N words, for what’s referred to as the ‘Version B’ solution. With the double-buffered HSM, the latest 2-D row-DHT output data set, as produced by the first R24 FHT, is written row-wise to the HSM’s passive region with successive N-sample rows of output data being assigned cyclically to successive rows of memory banks and with consecutive samples of each N-sample row being stored cyclically within consecutive memory banks of the appropriate row. Whilst this is happening, the previous 2-D row-DHT output data set, now held in the HSM’s active region, is being read out column-wise, in DBR-reordered form, and written to the passive region of the PDM residing on the PE of the second R24 FHT, with consecutive samples being stored cyclically within consecutive memory banks. Finally, whilst this is happening for each N-sample column of NAT-ordered data, the data held in the active region of this PDM is being processed by the PE’s arithmetic components for execution of the column-DHT stage – the outputs of which are then directed to a suitably defined external output data store as they are produced. Note that for realizable two-FHT solutions the processing, as for the single-FHT solution, needs to be able to keep up with the data set refresh rate so that it is necessary that the simultaneous processing of the row-DHT and the column-DHT stages should be completed within N2 clock cycles – at which point the DSM’s and the HSM’s passive regions will each have been filled with a new 2-D data set ready to be processed. Achieving such a solution is partly facilitated through the doublebuffering of the PDM on each PE, which enables the reading, reordering and transfer of each N-sample data subset from each source memory (which may be either the DSM or the HSM) to the target PDM to be carried out at the same time that its predecessor is being processed by the target PE’s arithmetic components. The different timing constraints to be addressed for these two-FHT solutions are discussed later in Sects. 11.4.3 and 11.4.5 in considerably more detail.

11.3.3 Relative Merits of Proposed Architectures The computing architectures described in this section extend the results of the original study carried out into the design of the R24 FHT for the computation of the 1-D DHT and the real-data DFT [10–12]. They enable the time-complexity to be directly traded off against the space-complexity, for the case of the 2-D SDHT, with the time-complexity being measured in terms of either the latency or the update time

11.3

Architectures for 2-D SDHT

219

and the space-complexity being measured in terms of an arithmetic component (i.e. the required numbers of multipliers and adders) and a memory component (i.e. the required amount of fast RAM). The two-FHT pipelined architecture offers solutions capable of running at approximately twice the rate of the single-FHT solution at the cost of an approximate doubling of the arithmetic component – the memory component, as already stated, may actually be reduced via the Version B solution. The two-FHT solutions are able to achieve this high computational throughput through the exploitation of three levels of parallel processing via a parallel-pipelined approach: (1) ‘coarse-grained’ pipelining at the FHT level for the global operation of the two FHT-based stages of the algorithm; (2) ‘fine-grained’ pipelining at the arithmetic level for the internal operation of each PE, as discussed in Chap. 6 for the case of the R24 FHT; and (3) SIMD processing for the simultaneous execution of the multiple arithmetic operations to be performed within each stage of the finegrained computational pipeline, also discussed in Chap. 6. Later, in Sect. 11.4.3, it will be seen how these features, when combined, enable the GD-BFLY to produce output woctads at the rate of one per clock cycle, so that when the FHT is carried out by means of an N-point R24 FHT a time-complexity of O(N2. log4N ) clock cycles may be achieved for the parallel computation of the 2-D SDHT. Note that if fast memory access/transfer rates are to be achieved and maintained, for any of the above solutions, then it is desirable that the target computing device should possess sufficient on-chip memory in the form of fast dual-port RAM to ideally hold both the DSM and the HSM – as well as the PDM residing on the PE (s) – otherwise data will have to be repeatedly read/written from/to external memory, causing undesirable processing delays as a result of the movement of potentially large quantities of data (particularly as the number of data dimensions increases) both onto and off of the device before the final outputs have been produced. However, whereas the resource-efficient R24 FHT has already shown itself capable – from the complexity analysis and examples of Sect. 6.5 in Chap. 6 – of being mapped onto the smallest of FPGAs, when applied to the 1-D DHT/DFT, the quantities of data involved when dealing with the m-D SDHT, even when restricted to just two dimensions, may prove prohibitive for most low-to-medium range FPGAs. Therefore, for the processing of sufficiently large data sets it seems most likely that both types of memory will be required – that is, both on-chip and external memory – the task then being to optimize the use of the available fast on-chip RAM (that is, residing on the target computing device) so as to minimize the resulting processing delays. Clearly, the on-chip memory would need to be prioritized for the storage of those data and trigonometric coefficient sets that need to be retrieved/ updated the most regularly, with the external memory being reserved for those data and trigonometric coefficient sets that can be retrieved/updated at a much slower rate. For the processing of certain ‘reasonably-sized’ 2-D data sets, however, the memory requirement could be satisfied with an FPGA chosen from the higher end of the available device range, such as with the 7 Series family of FPGAs from Xilinx

220

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

[21], for example, as the largest device in this family, in terms of memory capacity, offers 2820 blocks of Block-RAM, where each block can hold 1024 18-bit words (thus yielding a total of 2.88 MWords of RAM when using 18-bit words). These blocks may also be combined in pairs to instead yield 1410 blocks of Block-RAM, where each block now holds 1024 36-bit words (thus yielding a total of 1.44 MWords of RAM when using 36-bit words) – or some suitable combination of 18-bit blocks and 36-bit blocks in order to optimize the performance of the processing algorithms.

11.4

Complexity Analysis of 2-D SDHT

The space-complexity of the solutions to the 2-D SDHT based upon the adoption of the above-mentioned computing architectures is straightforwardly defined, as stated above, in terms of the arithmetic and memory components, although the distinction should be noted between a solution’s ‘arithmetic complexity’, as defined in terms of the required number of arithmetic operations, and the ‘arithmetic component’ of its space-complexity, as defined in terms of the numbers of adders and multipliers needed for carrying out the operations. The time-complexity, on the other hand, may be defined in terms of either the latency or the update time. Clearly, for solutions based upon the single-FHT recursive architecture, the two metrics are equivalent, whereas for solutions based upon the two-FHT pipelined architecture, the latency will be approximately equal to twice the update time. Therefore, to avoid any confusion, it will be assumed hereafter that time-complexity for the 2-D case (and, ultimately, for the m-D case) will mean update time, as this metric will best determine whether the throughput rate of a given solution is able to keep up with the data set refresh rate and therefore whether the solution is able to achieve and maintain continuous real-time operation. However, it is perhaps first worth introducing two related timing parameters that will be repeatedly referred to in this chapter, namely: the ‘processing time’, which is that part of the update period during which the processing of data is actually taking place; and the ‘processing delay’, which is that part of the update period when the processing of data is not taking place – thus, the two parameters are complimentary with respect to the update period which, for the case of a 2-D data set of size N  N, is taken to be N2 clock cycles. Note that the processing delay may also be regarded and referred to as a ‘safety margin’, being used to ensure that there’s a sufficient time delay between the processing of consecutive input data sets/subsets (where the size of the data sets/subsets will be dependent upon the architecture used, with N2 samples for the data sets of the single-FHT and Version A solutions but only N samples for the data subsets of the Version B solution) in order to eliminate possible timing problems arising from the combined effects of the various small timing delays, such as those due to pipelining and/or those that might be needed in order to avoid addressing conflicts – as discussed in Sect. 6.4.3 of Chap. 6.

11.4

Complexity Analysis of 2-D SDHT

221

11.4.1 Complexity Summary for Regularized FHT Before delving into the complexity requirements of the proposed 2-D solutions, those of the R24 FHT are first summarized as this algorithm plays such a key role in the overall complexity of each 2-D (and, ultimately, m-D) solution. The spacecomplexity of the N-point R24 FHT – as given for the Version II solution, summarized in Table 6.4 of Chap. 6, to be used throughout this chapter for sizing purposes – may be expressed in terms of an arithmetic component of 9 multipliers and 31 adders for the double butterfly operation and a memory component of 3N/4 words for the storage of the trigonometric coefficients within the PCM and 2N words for the double-buffered storage of the input data set within the PDM. These figures correspond to a solution designed to minimize the arithmetic component whilst keeping the PCM addressing relatively simple. Thus, the space-complexity of the Version II solution for the R24 FHT possesses an O(N ) memory component, consisting of 11N/4 words, whilst the time-complexity (denoting the latency or, equivalently in this case, the update time) of O(N. log4N) leads to an approximate figure of N=8: log 4 N clock cycles after taking into account the eight-fold parallelism as introduced via the adoption of partitioned data memory. This time-complexity figure, as provided by Eq. 6.12 of Chap. 6, has been derived from the fact that the GD-BFLY is able to produce output woctads at the rate of one per clock cycle – with two data samples being read/written from/to each of the eight memory banks of each partitioned memory every two clock cycles, as discussed in Sect. 6.3.1 of Chap. 6.

11.4.2 Space-Complexity of 2-D Solutions Turning firstly to the single-FHT solution to the 2-D SDHT of Sect. 11.3.1, in order to address those memory management operations involved in getting data both into and out of the N-point R24 FHT, memories are required of sizes: (1) 2N2 words for the double-buffered storage of the entire 2-D input data set within the DSM, from where it is reordered in N-sample subsets according to the DBR mapping and transferred to the PDM; and (2) N2 words for storage of the 2-D data set within the HSM as built up from the outputs of the row-DHT stage. Thus, when combined with the memory requirement of the PDM, as given in Sect. 11.4.1, the space-complexity of the singleFHT solution to the 2-D SDHT possesses a memory component of approximately 3N 2 þ 11=4N words. Turning now to the two-FHT solutions to the 2-D SDHT of Sect. 11.3.2, as the two N-point R24 FHTs are used simultaneously to carry out the processing of the two stages, the space-complexity of the 2-D SDHT, when compared to that for the singleFHT solution, will have its arithmetic component and PDM requirement both doubled – the double-buffered DSM requirement, however, may be reduced from 2N2 words, as required for the Version A solution, to just 2N words, as required for the Version B solution, as already discussed in Sect. 11.3.2. On the other hand, the

222

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

storage of the row-DHT outputs within the HSM is now double-buffered, so that there is a requirement for an additional N2 words of memory for the HSM. Thus, when combined with the memory requirement of the two PDMs, the space-complexity of the two-FHT solutions to the 2-D SDHT may have its memory component for Versions A and B approximated by 4N 2 þ 11=2N words and 2N 2 þ 15=2N words, respectively. Summarizing, the space-complexity of the single-FHT solution to the 2-D SDHT has an arithmetic component of ð1Þ

SA ¼ 9 multipliers & 31 adders

ð11:18Þ

and a memory component of ð1Þ

SM  3N 2 þ 11=4N

ð11:19Þ

words, with all N rows of the input data set being available within the DSM, whilst the space-complexity of the two-FHT solutions has an arithmetic component of ð2,AÞ

SA

ð2,BÞ

¼ SA

¼ 18 multipliers & 62 adders

ð11:20Þ

for both solutions, a memory component of ð2,AÞ

SM

 4N 2 þ 11=2N

ð11:21Þ

words for the Version A solution, whereby all N rows of the input data set are available within the DSM, and a memory component of ð2,BÞ

SM

 2N 2 þ 15=2N

ð11:22Þ

words for the Version B solution, whereby the input data set is available only one row at a time within the DSM. Thus, the space-complexity of all three of the above solutions to the 2-D SDHT possesses a memory component of O(N2) words in catering for the combined requirements of the DSM, HSM and PDM(s).

11.4.3 Time-Complexity of 2-D Solutions Turning firstly to the single-FHT solution to the 2-D SDHT of Sect. 11.3.1, there are memory management operations involved in getting data both into and out of the Npoint R24 FHT, this including the transfer of each DBR-reordered N-sample data set from the source memory to the PDM for subsequent processing by the PE’s arithmetic components, which, from Chap. 10, involves a time-complexity of

11.4

Complexity Analysis of 2-D SDHT

223

N/16 + δSU clock cycles where δSU represents the time delay (of four clock cycles) resulting from the initialization of the intermediate data memory, DMIN. Therefore, as the double-buffering of data sets feeding into and out of the R24 FHT enables the memory management operations and the execution of the double butterflies to be carried out simultaneously, the time-complexity of the single-FHT solution to the 2-D SDHT may be expressed as ð1Þ

T 2D  2  N  T ð0Þ

ð11:23Þ

clock cycles, where T(0) is equal to the maximum of the time-complexities of the two overlapping sets of operations, so that T ð0Þ ¼ max ð1=16N þ δSU , 1=8N: log 4 N þ DSU Þ  1=8N: log 4 N

ð11:24Þ

clock cycles, for all those data set sizes of practical interest – namely for N 16,384, as discussed in Sect. 6.6 of Chap. 6. Thus, the time-complexity of the single-FHT solution to the 2-D SDHT may be expressed as ð1Þ

T 2D  1=4N 2 : log 4 N,

ð11:25Þ

clock cycles, this figure being based upon the assumption that the R24 FHT commences the processing of each new N-sample data set as soon as the processing of the previous N-sample data set has been completed, rather than having to wait until N clock cycles has elapsed each time. This is clearly possible, however, since all N2 samples of the 2-D input data set are already available within the source memory before the processing commences. Turning now to the two-FHT solutions to the 2-D SDHT of Sect. 11.3.2, the timecomplexity required for the transfer of each DBR-reordered N-sample data set from each source memory to its target PDM is the same as for the single-FHT solution, namely N/16 + δSU clock cycles. With regard to the DSM, if all N2 samples of the 2-D input data set are available within its active region, as is the case for the Version A solution, then as the two N-point R24 FHTs are operating simultaneously via the two-stage pipeline – one operating upon the row-DHT input data sets and the other upon the column-DHT input data sets – the time-complexity, when compared to that for the single-FHT solution, will approximately halve. Note, however, that if only N samples of the 2-D input data set are available from its active region, as is the case for the Version B solution, then the processing by each N-point R24 FHT will need to be restarted with the availability of each new N-sample data set – that is, every slicing period. Thus, for all those data set sizes of practical interest, whilst with the Version A solution, the processing is restarted after every N2 clock cycles, with the Version B

224

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

solution, the processing is restarted after every N clock cycles. This, in turn, means that with the Version B solution, a small processing delay of NT(0) clock cycles is incurred following the processing of each N-sample data set by each R24 FHT , whereas with the Version A solution, where the processing time for each RCM stage is equal to the update time, a larger processing delay of N times the size is incurred following the processing by each R24 FHT of a complete 2-D data set (input data for the first R24 FHT and row-DHT/column-DHT output/input data for the second R24 FHT). As a result, although the update times of Versions A and B of the two-FHT solution are different, their processing times for an entire 2-D data set are identical, it’s just that Version B possesses one relatively small processing delay every slicing period (of N clock cycles) whilst Version A possesses one relatively large processing delay every update period (of N2 clock cycles). This also means that the Version A solution may be regarded as having a much larger safety margin, when viewed in absolute terms (that is, when measured in terms of clock cycles), than that of Version B. Note also that for the first 2-D data set to be processed, the single-FHT solution and Version A of the two-FHT solution each require an additional N2 clock cycles to initialize the DSM whilst Version B requires just N additional clock cycles to initialize the DSM; all three solutions then require an additional N/16 clock cycles to initialize the PDM residing on the PE of the first R24 FHT . At this point the arithmetic components of the first PE are able to commence processing of the data. Summarizing, the time-complexity of the single-FHT solution to the 2-D SDHT may be expressed as ð1Þ

T 2D  1=4N 2 : log 4 N

ð11:26Þ

clock cycles, with all N rows of the 2-D input data set being available within the DSM, whilst that for Version A of the two-FHT solution may be expressed as ð2,AÞ

T 2D

 1=8N 2 : log 4 N

ð11:27Þ

clock cycles, for when all N rows of the 2-D input data set are available within the DSM, and that for Version B of the two-FHT solution as ð2,BÞ

T 2D

   N 2  N  T ð0Þ ¼ N 2  N þ 1=8N: log 4 N

ð11:28Þ

clock cycles, for when the input data set is available only one row at a time within the DSM. The time-complexity figure produced for the Version B solution is based upon the assumption that the two R24 FHTs can only commence the processing of their respective N-sample data sets every slicing period of N clock cycles, rather than as soon as the processing of the previous N-sample data sets has been completed. Reiterating, the difference in the time-complexities (that is, in the update times) for

11.4

Complexity Analysis of 2-D SDHT

225

the two two-FHT solutions is based upon the fact that whereas the Version A solution only has to restart the processing of data every N2 clock cycles, thus incurring a single processing delay with each new 2-D input data set, the Version B solution has to restart every N clock cycles, thus incurring a processing delay every N clock cycles.

11.4.4 Computational Density of 2-D Solutions Suppose that the processing is to be carried out with L-bit precision where the silicon complexity of an: (1) L-bit multiplier is of O(L2) slices of logic; (2) L-bit adder is of O(L ) slices of logic; and (3) L-bit word of dual-port RAM is also of O(L ) slices of logic [12]. Then, from the space and time complexities of Sects. 11.4.2 and 11.4.3, respectively, with a ‘unit area of silicon’ defined as L2 slices of logic and a ‘unit of time’ defined as N2 clock cycles, the computational density (ignoring their similar logic requirements for ease of analysis) of the single-FHT solution to the 2-D SDHT may be expressed as ð1Þ

C 2D  ð4L=3Þ= log 4 N

ð11:29Þ

outputs per units of time and silicon area, with that for Version A of the two-FHT expressed as ð2,AÞ

C2D

 2L= log 4 N

ð11:30Þ

outputs per units of time and silicon area and that for Version B of the two-FHT solution expressed as ð2,BÞ

C2D

 4L= log 4 N

ð11:31Þ

outputs per units of time and silicon area, these expressions concisely combining in an illustrative manner the space and time complexities for the three 2-D solutions considered.

11.4.5 Comparative Complexity of 2-D Solutions For all three of the considered solutions to the 2-D SDHT, using both the single-FHT recursive architecture and the two-FHT pipelined architecture, the space-complexity possesses a memory component of O(N2) words in catering for the combined requirements of the DSM, HSM and PDM(s), whilst the time-complexity is of O(N2. log4N ) clock cycles when based solely on the processing times – that is,

226

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

when the processing delays are excluded. Clearly, the step-up in going from 1-D data sets to 2-D data sets means that the task of achieving resource-efficient 2-D solutions is now likely to be increasingly dominated by the question of how best to deal with the problem of a greatly increased data memory requirement, as expressed above by the O(N2) memory component, as it may well mean having to resort to the use of an FPGA chosen from the higher end of the available device range that’s able to offer the required memory capacity – such as is described with the Xilinx device in Sect. 11.3.3. A summary of the space-complexity and time-complexity results is provided in Table 11.1 below, for various sizes of 2-D data set, from which it can be deduced that with both architectures, the larger the size of the data set, the smaller the size of the safety margin and the more problematic the timing issues – such as those discussed in Sect. 6.4.3 of Chap. 6 – are likely to be. With regard to the space-complexity, the memory component – as expressed by Eqs. 11.19, 11.21 and 11.22 for the three solutions – for Version B of the two-FHT solution is less than that of the other two solutions, requiring approximately 1/2 that of Version A and 2/3 that of the singleFHT solution. With regard to the time-complexity, the single-FHT solution is realizable for the processing of those 2-D data sets for which N 64, whilst the two two-FHT solutions are each realizable for the processing of those 2-D data sets for which N 16,384 – meaning, in each case, that their update time is less than the update period, as dictated by the data set refresh rate. Note that for Version B of the two-FHT solution, the update time includes the processing delays incurred after the processing of each N-sample data set, which is why the relative value of the update time to that of the update period in Table 11.1 is so close to one, for each size of 2-D data set. When restricted to the processing of each individual N-sample data set, however, the relative value of the processing time to that of the slicing period of N clock cycles needed to acquire the data, is much lower, being approximated by the ratio of T(0) to N – which is actually equivalent to the ratio of the update time to the update period for the Version A solution. As a result, Versions A and B of the two-FHT solution are equally viable in terms of a realizable implementation when viewed in terms of their processing times, although Version A possesses a much larger safety margin when viewed in absolute rather than relative terms. Clearly, by modifying the number of rows of data that are able to be stored by the double-buffered DSM, so that the number lies between one and N, new versions of the two-FHT solution may be obtained yielding different complexities and safety margins that lay between those provided by Versions A and B. Working backwards, therefore, it would be a simple task to determine the number of rows of data needing to be stored within the DSM in order to achieve a specific safety margin that’s able to eliminate possible timing problems arising from the combined effects of the various small timing delays, such as those due to pipelining and/or those that might be needed in order to avoid addressing conflicts – as discussed in Sect. 6.4.3 of Chap. 6. Whichever solution is adopted, however, when considered in terms of processing time rather than update time, the time-complexity of all three solutions to the 2-D SDHT may be expressed as O(N2. log4N ) clock cycles.

Value of N for N  N 2-D data set 16 64 256 1024 4096 16,384

Single-FHT solution Arithmetic ¼ 9 multipliers & 31 adders Memory Update time + Safety (words) margin (clock cycles) 0.9  103 1=2  T U 1=2  T U 13  103

3=4  T U 1=4  T U 198  103 TU 0 3.2  106 5=4  T U N=A 54  106

3=2  T U N=A 806  106 7=4  T U N=A

Complexity v architecture Two-FHT solution – Version A Arithmetic ¼ 18 multipliers & 62 adders Memory Update time + Safety (words) margin (clock cycles) 1.2  103

1=4  T U 3=4  T U 17  103

3=8  T U 5=8  T U 0.26  106 1=2  T U 1=2  T U 4.2  106

5=8  T U 3=8  T U 68  106

3=4  T U 1=4  T U 1.1  109

7=8  T U 1=8  T U

Two-FHT solution – Version B Arithmetic ¼ 18 multipliers & 62 adders Memory Update time + Safety margin (words) (clock cycles) 0.7  103

T U  3=4  N 3=4  N 8.7  103

T U  5=8  N 5=8  N 0.13  106 T U  1=2  N 1=2  N 2.2  106

T U  3=8  N 3=8  N 34  106

T U  1=4  N 1=4  N 0.54  109 T t  1=8  N 1=8  N

Table 11.1 Complexity versus architecture for NN 2-D SDHT solutions where TU is update period of N2 clock cycles – arithmetic requirement based upon Version II solution of regularized FHT

11.4 Complexity Analysis of 2-D SDHT 227

228

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

11.4.6 Relative Start-up Delays and Update Times of 2-D Solutions For those cases of practical interest, where the update time is less than the update period, as dictated by the data set refresh rate, the time delay to the production of the first complete 2-D output data set for the single-FHT solution is given by    ð1Þ D2D  N 2 þ N=16 þ 2  N  T ð0Þ  ¼ N 2 þ N=16 þ ¼N 2 : log 4 N

ð11:32Þ

clock cycles, with an elapsed time – namely the update period – of N2 clock cycles between the subsequent production of consecutive 2-D output data sets, whilst that for Version A of the two-FHT solution is similarly given by ð2,AÞ

D2D

    N 2 þ N=16 þ 2  N  T ð0Þ  ¼ N 2 þ N=16 þ ¼N 2 : log 4 N

ð11:33Þ

clock cycles, with an elapsed time of N2 clock cycles between the subsequent production of consecutive 2-D output data sets, and that for Version B of the two-FHT solution by    ðN þ N=16Þ þ 2  N 2  N þ T ð0Þ  ¼ 17N=16 þ 2  N 2  N þ 1=8N: log 4 N

ð2,BÞ

D2D

ð11:34Þ

clock cycles, with an elapsed time of N2 clock cycles between the subsequent production of consecutive 2-D output data sets – with each of the above three equations thus including the time delay needed to initialize both the double-buffered DSM and the double-buffered PDM residing on the PE of the first R24 FHT. Thus, the above complexity results tell us that the respective time delays of Versions A and B of the two-FHT solution to the production of the first 2-D output ð2Þ data set are in the ratio, RD , where ð2Þ

RD ¼ Dð2D2,AÞ=Dð2D2,BÞ  ð4þ log 4 N Þ=ð8þ1=N : log 4 N Þ

ð11:35Þ

which increases linearly with respect to the radix exponent for N, from an approximate value of 3/4 for N ¼ 16 to an approximate value of 3/2 for N ¼ 16,384, thereby incrementing by 1/8 with increasing radix exponent. This implies that Version A of the two-FHT solution is able to produce the first 2-D output data set more quickly than Version B, for those data sets for which N 64, and as quickly as the singleFHT solution, for those data sets for which N 16,384.

11.4

Complexity Analysis of 2-D SDHT

229

The complexity results also tell us that the respective update times of Versions A ð2Þ and B of the two-FHT solution are in the ratio, RU , where ð2Þ RU ¼ T ð2D2,AÞ=T ð2D2,BÞ  log 4 N=ð8þ1=N : log 4 N Þ

ð11:36Þ

which increases linearly with respect to the radix exponent for N, from an approximate value of 1/4 for N ¼ 16 to an approximate value of 7/8 for N ¼ 16,384, thereby incrementing by 1/8 with increasing radix exponent. This implies that Version A of the two-FHT solution is able to produce new 2-D output data sets more quickly than Version B (although they both involve the same processing time) and thus with a larger safety margin, and twice as quickly as the single-FHT solution, for those data sets for which N 16,384. However, from the memory component of their spacecomplexities – as expressed by Eqs. 11.19, 11.21 and 11.22 for the three solutions – Version B of the two-FHT solution requires less memory than the other two solutions, requiring approximately 1/2 that of Version A and 2/3 that of the singleFHT solution.

11.4.7 Application of 2-D SDHT to Filtering of 2-D Data Sets The most direct approach to carrying out the digital filtering of a size N  N realvalued data set by means of a filter with an impulse response function (or, for imaging systems, a point spread function) of size M  M is to carry out the computations in data-space using real-only arithmetic whereby each filtered output would involve O(M2) arithmetic operations. Provided the parameters M and/or N are sufficiently large, however, there are computational advantages to be had in moving the domain of computation from data-space to transform-space, via a 2-D version of the CCT, where a fast transform could then be used to some advantage. Traditionally, this might be performed in Fourier-space with a 2-D FFT – typically possessing an O(N2.logN ) arithmetic-complexity for the processing of an N  N data set – being used for both the forward and the inverse transformations. The drawback of such an approach, however, is that whereas the forward transformation would involve mapping real-valued input data to complex-valued output data the inverse transformation would involve mapping complex-valued input data to real-valued output data, so that different algorithms would be required for carrying out in an efficient manner the two transformations leading to a potentially complex and irregular solution that requires complex arithmetic despite the real-valued nature of the problem. Alternatively, the problem could be simply overcome by carrying out the computations in Hartley-space via a parallel solution to the 2-D SDHT and a 2-D version of the CCT, as discussed in Sect. 11.2.2. This was shown to involve the addition/ subtraction of four element-wise matrix products, these being each of size K  K where K  M + N to allow for the zero-padding [5, 7] of both the data set and the

230

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

{h[n1,n2]}

E2 = ‘Even’ component of SDHT of zero-padded 2-D data set & O2 = ‘Odd’ component of SDHT of zero-padded 2-D data set

{x[n1,n2]}

E1 = pre-computed SDHT of zero-padded 2-D impulse response function

SDHT Memory E2[k1,k2]

E1[k1,k2]

SDHT Memory

Memory

E1[k1,-k2]

SDHT

{y[k1,k2]}

O2[k1,k2] Fig. 11.7 Scheme for digital filtering of N  N 2-D data set via 2-D SDHT where impulse response function is even and real-valued

impulse response function – ideally, parameter K will also be chosen as a radix-4 integer if the R24 FHT is to be most effectively exploited. Significant computational savings may be made, however, when the impulse response function is assumed – as is typically the case – to be a real-valued even function. The circular convolution of the 2-D data set and the 2-D impulse response function, previously expressed by Eq. 11.13 as the addition/subtraction of four element-wise Hartley-space products, then reduces to ðSHÞ

SDHTðfh½n1 , n2 g  fx½n1 , n2 gÞ ¼ fH E ðSHÞ ðSHÞ H E ½k1 , k2 :X O ½k 1 , k 2  g,

ðSHÞ

½k 1 , k 2 :X E

½k1 , k2 þ

ð11:37Þ

the sum of just two element-wise Hartley-space products, as illustrated in Fig. 11.7, where the even and odd components of the Hartley-space version of the impulse response function may be pre-computed and stored in suitably defined LUTs. As a result, a three-stage computational pipeline may be defined for the digital filtering of the 2-D data set, with the first stage comprising a 2-D SDHT for transforming the 2-D input data from data-space to Hartley-space, followed by a second stage comprising a Hartley-space filtering module, which involves the combining of the two Hartley-space products, followed finally by a third stage comprising another 2-D SDHT for returning the filtered data from Hartley-space back to data-space. Blocks of double-buffered memory, each partitioned into eight memory banks, would be required both on entry to, and on exit from, the filtering stage to enable all three

11.4

Complexity Analysis of 2-D SDHT

231

stages of the computational pipeline to operate simultaneously – the size of the memories would be dependent upon how the filtering stage is to be implemented. Let us assume the worst-case situation, computationally, where the size of the impulse response function is the same as that of the data set, so that K ¼ 2N. Then the filtering stage of the computational pipeline, as performed in Hartley-space, will involve 8N2 real MAC operations which will need to be carried out within the update period of N2 clock cycles, as dictated by the data set refresh rate, in order for continuous real-time operation to be achieved and maintained. Therefore, a high level of parallelism will need to be exploited in order for this to happen, involving pipelined and/or SIMD processing for the simultaneous execution of the multiple read/write instructions and arithmetic operations to be performed upon the woctads accessed from partitioned memory. From Fig. 11.7, it is evident that the two K  K element-wise matrix products may be carried out simultaneously, followed by the application of one K  K matrix addition to yield the required Hartley-space filtered outputs. For the two transformations, from data-space to Hartley-space and from Hartley-space to data-space, a size 2N  2N 2-D SDHT would be needed in each case in order to produce the filtered outputs of the size N  N input data set. Thus, as an example, a size 512  512 input data set would require each 2-D SDHT to be of size 1024  1024, which, with Version A of the two-FHT solution, would result in a new block of filtered outputs being produced every 5N2/8 clock cycles – see contents of Table 11.1 of Sect. 11.4.5 – well within the N2 clock cycles of the update period, as dictated by the data set refresh rate.

11.4.8 Application of 2-D SDHT to Computation of 2-D Real-Data DFT The most direct approach to the computation of the 2-D real-data DFT is via a 2-D version of the RCM whereby the computation reduces to a row-DFT stage followed by a column-DFT stage, thus enabling a 1-D FFT to be used for carrying out in an efficient manner the processing for both stages. The drawback of such an approach, however, is that for the row-DFT stage the 1-D FFT would need to map real-valued input data to complex-valued output data whereas for the column-DFT stage it would need to map complex-valued input data to complex-valued output data, so that different 1-D FFT algorithms would be required for carrying out in an efficient manner the processing for the two stages of the RCM leading to a potentially complex and irregular solution. Alternatively, the problem could be simply overcome by carrying out the computations via a parallel solution to the 2-D SDHT, as discussed in Sect. 11.2.3, with the associated time-complexity figures for the 2-D SDHT given by Eqs. 11.26, 11.27 and 11.28 needing to be suitably modified in order to take account (according to Eqs. 11.15 and 11.16) of the conversion routine which involves the execution of an

232

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

additional two right shifts, one addition and one subtraction for each DFT output. As the N  N output data set produced by the real-data DFT satisfies the Hermitiansymmetry property   X ðFÞ ½k1 , k 2  ¼ conj X ðFÞ ½þk1 , k2  ,

ð11:38Þ

so the number of independent outputs reduces to N2/2, these outputs being provided by any two adjacent quadrants of the 2-D DFT spectrum – given that opposite quadrants are related by the above conjugate equivalence. As a result, an additional N2 right shifts, N2/2 additions and N2/2 subtractions would be required for the derivation of each complete 2-D real-data DFT output data set from the corresponding 2-D SDHT outputs, although when compared to the complexity figures of the 2-D SDHT, as given in Table 11.1 of Sect. 11.4.5, this is a minimal additional cost. Note also that an efficient solution to the 2-D SDHT could also be used to some advantage when applied to the computation of the 2-D complex-data DFT, where one 2-D SDHT solution would be assigned the task of processing the real component of the 2-D data set and another the task of processing the imaginary component. A suitably defined conversion routine would then be used which combines the two sets of 2-D real-valued Hartley-space outputs to produce a single set of 2-D complexvalued Fourier-space outputs. A highly-parallel ‘dual-SDHT’ solution such as this would be able to achieve, for the complex-data case, the same level of speed-up over a purely sequential solution as already achieved for the real-data case, at the cost of a doubling of the silicon resources – thus, it would possess the attractive linear property of requiring twice the amount of silicon resources in order to achieve a doubling of the throughput rate.

11.5

Generalization of 2-D Solutions to Processing of m-D Data Sets

The generalization of the 2-D designs discussed in this chapter to facilitate the processing of m-D data sets, for m  2, where the common length N of each dimension of the transform is taken to be a radix-4 integer for compatibility with the adoption of the N-point R24 FHT, follows in an obvious fashion via the application of the m-D form of the RCM using the separable version of the m-D DHT whose outputs, as illustrated for the 2-D case by Eqs. 11.11 and 11.12, are directly related to those of the non-separable version [2]. As a result, solutions to the m-D SDHT, which may be encountered in various m-D DSP applications, such as those of computer vision and medical imaging [18], may be obtained using either: (1) a single-FHT recursive architecture, whereby the space-complexity is optimized at the expense of the time-complexity; or (2) a multi-FHT pipelined architecture, as obtained when the m-fold recursion of the single-FHT solution is unfolded, whereby

11.5

ADC or PDM

Generalization of 2-D Solutions to Processing of m-D Data Sets

DSM or HSM

PDM of Target PE

Source

Target

Memory

Memory

233

HSM or Output Data File

Fig. 11.8 Data movement between source and target memories for generic stage of m-D SDHT

each stage of the RCM is assigned its own FHT and, apart from the last stage, its own block of HSM, so that time-complexity is optimized at the expense of the spacecomplexity – namely, an m-fold increase in the arithmetic component as well as a commensurate increase in the memory component. Each design, like that of the R24 FHT, can be shown to possess the attractions of being scalable (which, for the m-D case where ‘m’ is fixed, is understood to be in relation to the common length N of each dimension of the transform), highly parallel and highly regular in the sense that the same processing scheme may be adopted for each stage of the RCM-based formulation of the m-D SDHT, as illustrated with the movement of data between the source and target memories of a generic stage in Fig. 11.8 – each block of HSM being now of m-D form and assumed to possess eight memory banks in each dimension. In order to assess the performance of the m-D SDHT solutions, the space and time complexities are evaluated, as with the 2-D case, using the Version II solution of the R24 FHT. Before dealing with complexity issues, however, note that it can be shown that as with the discussions of the 2-D case in Sect. 11.2.1, the separable and non-separable versions of the 3-D DHT may each be simply obtained [7, 9], one from the other, at minimal computational expense involving the addition/subtraction of four 3-D matrices, each of size N  N  N. Also, it can be shown that as with the discussions of the 2-D case in Sect. 11.2.2, a 3-D version of the CCT may be defined for use with the 3-D SDHT whereby the circular convolution of two 3-D data sets, each of size N  N  N, may be simply expressed (up to a scaling factor) in terms of the addition/ subtraction of four element-wise Hartley-space products [7, 9], thus enabling the 3-D SDHT to be effectively used for carrying out the filtering-type operations required for the processing of 3-D images. Finally, as with the discussions of the 2-D case in Sect. 11.4.7, significant computational savings may be made when the nature of one or other of the functions generating the 3-D data sets is known, as, for example, when one of the two functions is taken to be an impulse response function which is also a real-valued even function.

234

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

11

11.5.1 Space and Time Complexities of m-D Solutions The space-complexity of the single-FHT solution to the m-D SDHT has an arithmetic component identical to that for the 1-D and 2-D cases, when based upon the use of the R24 FHT, namely ð1Þ

SA ¼ 9 multipliers & 31 adders,

ð11:39Þ

together with a memory component of ð1Þ

SM  3N m þ 11=4N

ð11:40Þ

words, with all Nm1 rows of the m-D input data set being available within the DSM. For both of the multi-FHT solutions to the m-D SDHT, the space-complexity has an arithmetic component given by ðm,AÞ

SA

ðm,BÞ

¼ SA

¼ 9m multipliers & 31m adders,

ð11:41Þ

 2m  N m þ 11=4m  N

ð11:42Þ

with a memory component of ðm,AÞ

SM

words for the Version A solution, whereby all Nm1 rows of the m-D input data set are available within the DSM, and a memory component of ðm,BÞ

SM

 ð2m  2Þ  N m þ ð11=4m þ 2Þ  N

ð11:43Þ

words for the Version B solution, whereby the input data set is available only one row at a time from the DSM. The time-complexity of the single-FHT solution to the m-D SDHT may be expressed as   ð1Þ T mD  m  N m1  T ð0Þ

ð11:44Þ

¼ m=8:N m : log 4 N clock cycles, which, as for the 2-D case, is based upon the assumption that the R24 FHT commences the processing of each new N-sample data set as soon as the processing of the previous N-sample data set has been completed, rather than having to wait until the N clock cycles of the slicing period has elapsed each time. This is clearly possible, however, since all Nm samples of the m-D input data set are already

11.5

Generalization of 2-D Solutions to Processing of m-D Data Sets

235

available within the DSM before the processing commences. For Version A of the multi-FHT solution to the m-D SDHT, the time-complexity is given by ðm,AÞ

T mD

 N m1  T ð0Þ ¼ 1=8N m : log 4 N

ð11:45Þ

clock cycles, whereby all Nm1 rows of the m-D input data set are available within the DSM, and for Version B by ðm,BÞ

T mD

   N m  N  T ð0Þ ¼ N m  N þ 1=8N: log 4 N

ð11:46Þ

clock cycles, whereby the input data set is available only one row at a time from the DSM. The time-complexity figure produced for the Version B solution is based upon the assumption that the ‘m’ R24 FHTs can only commence the processing of their respective N-sample data sets every slicing period of N clock cycles, rather than as soon as the processing of the previous N-sample data sets has been completed. Note that, as for the 2-D case, the difference in the time-complexities (that is, in the update times) for the multi-FHT solutions is based upon the fact that whereas the Version A solution only has to restart the processing of data every update period of Nm clock cycles, thus incurring a single processing delay, the Version B solution has to restart every slicing period of N clock cycles, thus incurring a processing delay every N clock cycles. Note also that for the first m-D data set to be processed, the single-FHT solution and Version A of the multi-FHT solution each require an additional Nm clock cycles to initialize the DSM whilst Version B requires just N additional clock cycles to initialize the DSM; all three solutions then require an additional N/16 clock cycles to initialize the PDM residing on the PE of the first R24 FHT. At this point the arithmetic components of the first PE are able to commence processing of the data.

11.5.2 Comparative Complexity of M-D Solutions By modifying the number of rows of data needing to be stored by the doublebuffered DSM, so that the number lies between one and Nm1, new versions of the multi-FHT solution may be obtained – as discussed for the two-FHT case in Sect. 11.4.5 – yielding different safety margins that lay between those provided by Versions A and B. Working backwards, therefore, it would be a simple task to determine the number of rows of data needing to be stored within the DSM in order to achieve a specific safety margin that’s able to eliminate the possibility of timing problems arising from the combined effects of the various small timing delays, such as those due to pipelining and/or those that might be needed in order to avoid

236

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

addressing conflicts – as discussed in Sect. 6.4.3 of Chap. 6. For all three of the above solutions to the m-D SDHT, using both the single-FHT recursive architecture and the multi-FHT pipelined architecture, the space-complexity possesses a memory component of O(Nm) words in catering for the combined requirements of the DSM, HSM(s) and PDM(s), whilst the time-complexity is of O(Nm. log4N ) clock cycles, when based solely on the processing times – that is, when the processing delays are excluded. The multi-FHT solutions, as with the two-FHT solutions already discussed, are able to achieve a high computational throughput through the exploitation of three levels of parallel processing via a parallel-pipelined approach: (1) ‘coarse-grained’ pipelining at the FHT level for the global operation of the multiple FHT-based stages of the algorithm; (2) ‘fine-grained’ pipelining at the arithmetic level for the internal operation of each PE, as discussed in Chap. 6 for the case of the R24 FHT ; and (3) SIMD processing for the simultaneous execution of the multiple arithmetic operations to be performed within each stage of the fine-grained computational pipeline, also discussed in Chap. 6. However, in generalizing the 2-D complexity results discussed in Sect. 11.4.5, it is clear that the step-up in going from 1-D data sets to 2-D data sets and now to m-D data sets, for m  2, means that the task of achieving resource-efficient m-D solutions is likely to be increasingly dominated by the question of how best to deal with the greatly increased data memory requirement, as expressed above by the O(Nm) memory component, as it will almost certainly mean having to resort to the use of an FPGA chosen from the higher end of the available device range that’s able to offer the required memory capacity – such as is described with the Xilinx device in Sect. 11.3.3. The contents of Table 11.1, as provided in Sect. 11.4.5, has already illustrated that even for the 2-D case the space-complexity for all those transform sizes of interest has been dominated by the increasing memory component whilst the modest arithmetic component for each of the chosen architectures remains constant. Finally, note that the computational densities of the 2-D solutions, as produced in Sect. 11.4, may be generalized in a straightforward fashion (where ‘unit of time’ is now defined as Nm clock cycles) with an approximation for the single-FHT solution to the m-D SDHT expressed as ð1Þ

CmD  ð8L=3mÞ= log 4 N

ð11:47Þ

outputs per units of time and silicon area, with that for Version A of the multi-FHT solution expressed as ð2,AÞ

C mD  ð4L=mÞ= log 4 N

ð11:48Þ

outputs per units of time and silicon area, and that for Version B of the multi-FHT solution expressed as

11.5

Generalization of 2-D Solutions to Processing of m-D Data Sets ð2,BÞ

CmD  ð4L=ðm  1ÞÞ= log 4 N

237

ð11:49Þ

outputs per units of time and silicon area, these expressions concisely combining in an illustrative manner the space and time complexities for the three m-D solutions considered.

11.5.3 Relative Start-up Delays and Update Times of m-D Solutions For those cases of practical interest, where the update time is less than the update period, as dictated by the data set refresh rate, the time delay to the production of the first complete m-D output data set for the single-FHT solution to the m-D SDHT is given by   ð1Þ DmD  ðN m þ N=16Þ þ m  N m1  T ð0Þ

ð11:50Þ

¼ ðN m þ N=16Þ þ m=8:N m : log 4 N clock cycles, with an elapsed time – namely the update period – of Nm clock cycles between the subsequent production of consecutive m-D output data sets, whilst that for Version A of the multi-FHT solution to the m-D SDHT is similarly given by ðm,AÞ

DmD

   ðN m þ N=16Þ þ m  N m1  T ð0Þ

ð11:51Þ

¼ ðN m þ N=16Þ þ m=8:N m : log 4 N clock cycles, with an elapsed time of Nm clock cycles between the subsequent production of consecutive m-D output data sets, and that for Version B of the multi-FHT solution by ðm,BÞ

DmD

   ðN þ N=16Þ þ m  N m  N þ T ð0Þ

ð11:52Þ

¼ 17N=16 þ m  ðN m  N þ 1=8N: log 4 N Þ clock cycles, with an elapsed time of Nm clock cycles between the subsequent production of consecutive m-D output data sets – with each of the above three equations thus including the time delays needed to initialize both the doublebuffered DSM and the double-buffered PDM residing on the PE of the first R24 FHT.

238

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

Thus, the complexity results tell us that the respective time delays of Versions A and B of the multi-FHT solution to the production of the first m-D output data set are ðmÞ in the ratio, RD , where ðmÞ

ð2,AÞ

RD ¼ DmD =Dð2,BÞ  ð8þm: log 4 N Þ=mð8þN 1m : log 4 N Þ, mD

ð11:53Þ

which increases linearly with respect to the radix exponent for N, for each fixed value of m, so that for the case where m ¼ 3 it varies from an approximate value of 7/12 for N ¼ 16 to an approximate value of 29/24 for N ¼ 16,384, thereby incrementing by 1/8 with increasing radix exponent. This implies, for example, that Version A of the three-FHT solution is able to produce the first 3-D output data set more quickly than Version B, for those data sets for which N 1024, and as quickly as the single-FHT solution, for those data sets for which N 16,384. However, from the memory component of their space-complexities – as expressed by Eqs. 11.40, 11.42 and 11.43 for the three solutions – Version B of the three-FHT solution requires approximately 2N3 words of memory less than Version A but approximately N3 words of memory more than that of the single-FHT solution. The complexity results also tell us that the respective update times of Versions A ð mÞ and B of the multi-FHT solution to the m-D SDHT are in the ratio, RU , where ðm,AÞ ð mÞ RU ¼ T mD =T ðm,BÞ  log 4 N=ð8þN m1 : log 4 N Þ, mD

ð11:54Þ

which increases linearly with respect to the radix exponent for N, from an approximate value of 1/4 for N ¼ 16 to an approximate value of 7/8 for N ¼ 16,384, thereby incrementing by 1/8 with increasing radix exponent for each value of m. This implies, for example, that Version A of the three-FHT solution is able to produce new 3-D output data sets more quickly than Version B (although they both involve the same processing time) and thus with a larger safety margin, and three times as quickly as the single-FHT solution, for those data sets for which N 16,384.

11.6

Constraints on Achieving and Maintaining Real-Time Operation

From the above generalized time-complexity results of Eqs. 11.41, 11.45 and 11.46, together with the timing constraint requiring that the update time must be less than the update period, it may be deduced for the case of the single-FHT solution to the m-D SDHT that continuous real-time operation may only be achieved and maintained when the radix-4 integer N is such that

11.6

Constraints on Achieving and Maintaining Real-Time Operation

log 4 N < 8=m,

239

ð11:55Þ

which is dependent upon the number of dimensions of the data set being processed, whereas with the multi-FHT solutions to the m-D SDHT, continuous real-time operation may be achieved and maintained for both the Version A and B solutions when N is such that log 4 N < 8,

ð11:56Þ

which equates to an upper bound of seven, being independent of the number of dimensions of the data set being processed. Thus, ignoring for the moment the space-complexity issues arising from the size of the memory component, the single-FHT solution, when applied to 2-D data (as defined in terms of ‘rows’ and ‘columns’), is realizable for those data sets for which N 64 and when applied to 3-D data (as defined in terms of ‘rows’, ‘columns’ and ‘slices’) is realizable for those data sets for which N 16. The two multi-FHT solutions, on the other hand, when applied to m-D data, are realizable for those data sets for which N 16,384 for all values of m. The size and timing constraints for the 2-D and 3-D cases may be straightforwardly verified from an examination of the contents of Table 11.1, from Sect. 11.4.5, for the 2-D case and Table 11.2, as given below, for the 3-D case. Note that there is a clear similarity in the timing figures provided by the two tables, relative to their respective update periods, so that the relative time-complexities clearly generalize to the m-D case. Therefore, if the timing constraint is met for a given value of N for the 2-D case, then it will be similarly met for that value of N for any number of dimensions. Suppose now that the target computing device is assumed, for the purpose of illustration, to be chosen from the Xilinx 7 Series family of FPGAs, as described earlier in Sect. 11.3.3, where the particular device chosen offers the maximum memory capacity of 2.88 MWords of RAM with 18-bit words. From the generalized figures produced for the space-complexity – where the memory component for the three solutions is as given by Eqs. 11.40, 11.42 and 11.43 – it may be deduced, when restricted to the 2-D case, that the single-FHT solution may be carried out with the entire memory component (that is, the combined DSM, HSM and PDM requirements) catered for by the fast on-chip RAM for those data sets for which N 256. For the two-FHT solutions, whereas Version A, when applied to 2-D data, may also be carried out with all of its memory requirement catered for by the on-chip RAM for those data sets for which N 256, with Version B, the range of validity or realizability extends out to those data sets for which N 1024. When both the timing and the memory constraints are taken into account, the range of validity or realizability for those 2-D solutions using only fast on-chip RAM is restricted in the following way: (1) the single-FHT solution to those data sets for which N 64; (2) Version A of the two-FHT solution to those data sets for which N 256; and (3) Version B of the two-FHT solution to those data sets for which N 1024. Note, however, that with the adoption and efficient usage of sufficiently

Value of N for N  N  N 3-D data set 16 64 256 1024 4096 16,384

Single-FHT solution Arithmetic ¼ 9 multipliers & 31 adders Memory Update time + safety (words) margin (clock cycles) 12.3  103 1=2  T U 1=2  T U 0.79  106 3=4  T U 1=4  T U 51  106

TU 0 3.3  109

5=4  T U N=A 0.21  1012 3=2  T U N=A 13.2  1012 7=4  T U N=A

Complexity v architecture Three-FHT solution – Version A Arithmetic ¼ 27 multipliers & 93 adders Memory Update time + safety (words) margin (clock cycles) 24.6  103 1=4  T U 3=4  T U 1.58  106 3=8  T U 5=8  T U 101  106

1=2  T U 1=2  T U 6.5  109

5=8  T U 3=8  T U 0.42  1012 3=4  T U 1=4  T U 26.4  1012 7=8  T U 1=8  T U

Three-FHT solution – Version B Arithmetic ¼ 27 multipliers & 93 adders Memory Update time + safety margin (words) (clock cycles) 16.4  103 T U  3=4  N 3=4  N 1.05  106 T U  5=8  N 5=8  N 68  106

T U  1=2  N 1=2  N 4.3  109

T U  3=8  N 3=8  N 0.28  1012 T U  1=4  N 1=4  N 17.6  1012 T U  1=8  N 1=8  N

11

Table 11.2 Complexity versus architecture for NNN 3-D SDHT solutions where TU is update period of N3 clock cycles – arithmetic requirement based upon Version II solution of regularized FHT

240 Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

11.7

Discussion

241

large amounts of external memory – of the order of a GWord – it would be theoretically possible, from the timing constraints, for the range of validity or realizability of the two-FHT solutions to each be extended out to those data sets for which N 16,384, with the reduced memory component of the Version B solution making it look particularly attractive. Finally, for data sets larger than those given in Tables 11.1 and 11.2, the blockbased nature of the processing schemes for the proposed solutions enables continuous real-time operation to be achieved and maintained by having multiple versions of the single-FHT or multi-FHT solutions operating upon consecutive m-D input data sets, in turn, in order to produce interleaved output data sets and thus to achieve the desired throughput rate through the simple replication of silicon resources – as already discussed for the 1-D case in Sect. 6.6 of Chap. 6. Also, note that as for the 1-D and 2-D cases, the output data sets from the generalized m-D versions of the SDHT and the DFT may be simply obtained (but requiring proportionately more additions and subtractions), one from the other [2]. As a result, efficient solutions to the m-D SDHT may equally well be beneficially used for solving those DSP-based problems commonly addressed via the m-D DFT, and vice versa, particularly when the input data to the DFT is real-valued in nature.

11.7

Discussion

The research described in this chapter has been concerned with the design of new architectures and solutions for the parallel computation of the m-D DHT and shown to be equally applicable, via the relationship of their kernels, to the parallel computation of the m-D real-data DFT. The designs have exploited the benefits of the R24 FHT, a scalable solution for the resource-efficient parallel computation of the 1-D transform – as discussed in Chaps. 4, 5, 6 and 7 in some detail and already proven in silicon with a fixed-point implementation using FPGA technology – which achieves the computational density of the most advanced commercially available FFT solutions for just a fraction of the silicon resources. In order to make efficient use of the R24 FHT, however, as a building block for solving the m-D problem, it has first been necessary to adopt a separable formulation of the DHT – the number of dimensions being limited initially to just two for ease of illustration – rather than the more familiar non-separable version as derived from direct extension of the 1-D transform. With this separable formulation, the processing was able to be carried out via the RCM whereby a fast algorithm for the 1-D DHT was first applied to the rows of the original 2-D input data set during the row-DHT stage, followed by its application to the columns of the resulting 2-D output data set during the column-DHT stage. Through the adoption of appropriate memory partitioning, double-buffering and parallel addressing schemes – as discussed in Chaps. 6 and 10 in some detail – it was then seen how the 1-D DHTs could be efficiently carried out via the R24 FHT, enabling the resulting parallel solutions to the 2-D SDHT – one solution based upon

242

11

Architectures for Silicon-Based Implementation of m-D Discrete Hartley. . .

a single-FHT recursive architecture and two solutions based upon a two-FHT pipelined architecture – to be resource-efficient, scalable and to achieve a high computational density. The generalization of the designs of these two computing architectures to facilitate the processing of m-D data sets, for m  2, and the derivation of their associated space and time complexities and computational densities was also discussed in some detail as well as the constraints on their achieving and maintaining real-time operation. Note that throughout this chapter it has been assumed that the data sets may be represented by means of m-D data matrices where the common length N of each matrix dimension is a radix-4 integer. In this way, the N-point R24 FHT might then be used for carrying out each stage of processing resulting from the RCM-based formulation of the m-D SDHT. However, the same is also clearly true with m-D data matrices, where there is no common length N for the matrix dimensions, provided that the length of each dimension can nevertheless be expressed as a radix-4 integer, so that the R24 FHT might still be used for carrying out each stage of processing resulting from the RCM-based formulation of the m-D SDHT. Summarizing, resource-efficient and scalable solutions were produced, each with a continuous real-time capability, for the parallel computation of m-D versions of both the SDHT and the real-data DFT. As the performance of each solution was shown to be heavily reliant upon that of the R24 FHT, so the known benefits of the R24 FHT would be expected to carry over directly to the m-D transforms when implemented with the same parallel computing technology – such as a suitably sized FPGA. The extension of the 2-D results to the processing of m-D data means that it may now be beneficially used as a key component in the design of 2-D and 3-D systems for the continuous real-time processing of images, as well as that of conventional 1-D signals.

References 1. N. Ahmed, C.B. Johnson, Orthogonal Transforms for Digital Signal Processing (Springer, Cham, 2012) 2. T. Bortfeld, W. Dinter, Calculation of multi-dimensional Hartley transforms using one-dimensional Fourier transforms. IEEE Trans. Signal Process. 43(5), 1306–1310 (1995) 3. R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986) 4. M.A. Chicheva, Parallel computation of multidimensional discrete orthogonal transforms reducible to a discrete Fourier transform. Pattern Recognit. Image Anal. 21, 381 (September 2011) 5. D. Dudgeon, R. Mersereau, Multidimensional Digital Signal Processing (Prentice-Hall, Englewood Cliffs, 1983) 6. D.F. Elliott, R.K. Rao, Fast Transforms: Algorithms, Analyses, Applications (Academic Press, Orlando, 1982) 7. A. Erdi, B. Wessels & E. Yorke, Use of the fast Hartley transform for efficient 3D convolution in calculation of radiation dose. Proceedings of the 16th International Conference of IEEE Engineering in Medicine and Biology Society, pp. 2226–2233, 2000

References

243

8. S.K. Ghosal, J.K. Mandal, Color image authentication based on two-dimensional separable discrete Hartley transform. AMSE J. 2014 Ser. Adv. B 57(1), 68–87 (June 2014) 9. H. Hao, R.N. Bracewell, A three-dimensional DFT algorithm using the fast Hartley transform. Proc. IEEE 75, 264–265 (1987) 10. K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Vis. Image Signal Process. 153(1), 70–78 (2006) 11. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007) 12. K.J. Jones, The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments, Series on Signals & Communication Technology (Springer, Dordrecht, 2010) 13. R. Nielson, Sonar Signal Processing (Artech House, St. Albans, 1991) 14. M. Perkins, A separable Hartley-like transform in two or more dimensions. Proc. IEEE 75(8), 1127–1129 (1987) 15. L. Pyrgas, P. Kitsos, A. Skodras, Compact FPGA architectures for the two-band fast discrete Hartley transform. Microprocess. Microsyst. 61, 117–125 (September 2018) 16. D. Salomon, Data Compression: The Complete Reference (Springer, New York, 2004) 17. V.K. Sharma, R. Agrawal, U.C. Pati, K.K. Mahapatra, 2-D separable discrete Hartley transform architecture for efficient FPGA resource, in International Conference on Computer & Communication Technology (ICCCT), (2010), pp. 236–241 18. R.S. Sunder, C. Eswaran, N. Sriraam, Medical image compression using 3-D Hartley transform. Comput. Biol. Med. 36, 958–973 (2006) 19. L. Tao, H. Kwan, J.J. Gu, Filterbank-based fast parallel algorithms for 2-D DHT-based realvalued discrete Gabor transform, in IEEE International Symposium on Circuits & Systems, Brazil, (May 2011), pp. 1512–1515 20. B. Watson, A. Poirson, Separable two-dimensional discrete Hartley transform. J. Opt. Soc. Am. A 3(12), 19861201 (1986) 21. Xilinx Inc., company and product information available at company web site: www.xilinx.com

Part V

Results of Research

Chapter 12

Summary and Conclusions

12.1

Outline of Problems Addressed

The traditional approach to spectrum analysis has involved the computation of the real-data DFT which is conventionally achieved via a real-from-complex strategy using a complex-data solution, regardless of the nature of the data. This often entails the initial conversion of real-valued data to complex-valued data via a wideband DDC process or through the adoption of a real-from-complex strategy whereby two real-data FFTs are computed simultaneously via one full-length complex-data FFT or where one real-data FFT is computed via one half-length complex-data FFT. Each such solution, however, involves a computational overhead when compared to the more direct approach of a real-data FFT in terms of increased memory, increased processing delay to allow for the possible acquisition/processing of pairs of data sets and additional packing/unpacking complexity. With the DDC approach, where two processing functions now have to be used instead of just one, the integrity of the information content of short-duration signals may also be compromised through the introduction of the filtering operation. Thus, the traditional approach to the problem of computing the DFT for 1-D realvalued data has effectively been to modify the problem so as to match an existing complex-data solution – the aim of the research carried out in this monograph has instead been to seek a solution that matches the actual problem needing to be solved. The research problems addressed in this monograph have therefore been concerned primarily with the parallel computation of the DHT (due to its realvalued nature) and equivalently, via the relationship of their kernels, of the realdata DFT, for both the 1-D and the m-D cases. The solutions have been particularly targeted at an implementation with silicon-based parallel computing equipment for use in the type of resource-constrained environments that might be encountered in applications typified by that of mobile communications. Scalable solutions based upon the highly-regular design of the fixed-radix FFT have been actively sought for the real-data DFT, although with the computing power now available via the silicon© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3_12

247

248

12 Summary and Conclusions

based parallel computing technologies, such as with the FPGA and the ASIC, it is no longer adequate to view the complexity of such solutions purely in terms of arithmetic operation counts, as has conventionally been done. There is now the facility to use both multiple arithmetic units – such as fast multipliers and CORDIC phase rotators – and multiple banks of fast memory in order to enhance the performance of such algorithms via their parallel computation. As a result, a whole new set of constraints and metrics has arisen relating to the design of efficient FFT algorithms for silicon-based implementation. With the environment encountered in mobile communications, for example, where a small battery may be the only source of power supply for long periods of time, algorithms are now being designed subject to new and often conflicting performance criteria, where the ideal is either to maximize the throughput (i.e. to minimize the update time) or to satisfy some constraint on the latency, whilst at the same time minimizing the required silicon resources (and thereby minimizing the cost of implementation) as well as keeping the power consumption to within the available power budget. The DHT, which is a real-data transform that is both bilateral and orthogonal and a close relative to the DFT, possessing many of the same properties, was identified as an attractive algorithm for attacking the problem of computing the real-data DFT as the outputs from a real-data DFT may be straightforwardly obtained from the outputs of the DHT, and vice versa. As a result, the DHT may also be used for solving those DSP-based problems commonly addressed via the DFT, whilst fast recursive algorithms required for its solution – referred to generically as the FHT – are now commonly encountered in the technical literature. A drawback of conventional FHTs, however, lies in the lack of regularity arising from the need for two sizes of butterfly – and thus for two separate butterfly designs – for fixed-radix formulations where, for a radix ‘R’ algorithm, a single-sized butterfly produces R outputs from R inputs whilst a double-sized butterfly produces 2R outputs from 2R inputs.

12.2

Summary of Results

To address the above situation, a generic version of the double-sized butterfly, referred to as the generic double butterfly and abbreviated to GD-BFLY, was developed for the radix-4 factorization of the FHT which overcame the 1-D problem in an elegant fashion. The resulting single-design solution, referred to as the regularized FHT and abbreviated to R24 FHT , showed itself capable of an efficient implementation with the silicon-based parallel computing technologies. An architecture was identified and developed that exploited partitioned memory for the parallel computation of the GD-BFLY and of the resulting R24 FHT, whereby both the data and the trigonometric coefficients were each partitioned or distributed across multiple equally-sized banks of fast memory. The approach exploited a single locally-pipelined PE that yielded an attractive high-performance solution that was resource-efficient, scalable (which, for the 1-D case, is understood to be in relation to

12.2

Summary of Results

249

the transform length) and device-independent (so that it’s not dependent upon the specific characteristics of any particular device, being able to exploit whatever resources happen to be available on the target device). High performance was achieved by having the PE able to process, in a parallel fashion and without conflict, the input/output data sets to the GD-BFLY – these eight-sample data sets being referred to as woctads – which, in turn, facilitated SIMD processing for the simultaneous execution of the multiple arithmetic operations to be performed within each stage of the computational pipeline. These features, when combined, enabled the GD-BFLY to produce output woctads at the rate of one per clock cycle, so that an O(N. log4N ) time-complexity (denoting the latency or, equivalently in this case, the update time) was achieved for the N-point R24 FHT which yielded an approximate figure of N=8: log 4 N clock cycles after taking into account the eightfold parallelism as introduced via the adoption of the partitioned data memory. Several versions of the PE were described, using either fast fixed-point multipliers or CORDIC phase rotators, with each conforming to the same basic design and each compatible with the chosen single-PE computing architecture. This enabled the space-complexity to be optimized by having the arithmetic and memory components traded off against the addressing complexity according to the resources available on the target computing device. The result was a set of FHT designs based upon the single-PE recursive architecture, where each design used partitioned memory for the storage of both the data and the trigonometric coefficients to yield a resourceefficient solution with universal application, such that each new application would involve minimal re-design effort and costs. The resulting solutions were shown to be amenable to efficient implementation with the silicon-based computing technologies and capable of achieving the computational density – that is, the throughput per unit area of silicon – of the most advanced commercially available complex-data FFT solutions for potentially just a fraction of the silicon resources [1–3]. The resource efficiency makes the single-PE design particularly attractive for those applications where the size of the real-data transform would otherwise result in a significant space-complexity when addressed with a conventional multi-PE solution. This is particularly true with the memory component, as double-buffered memories (each of which is commensurate in size with that of the transform) would typically be required to connect the consecutive stages of the computational pipeline. Also, the block-based nature of the operation of the single-PE designs means that they are also able, via the block floating-point scaling strategy, to produce higher accuracy transform-domain outputs when using fixed-point arithmetic than is achievable by their streaming FFT counterparts based upon the multi-PE design. The resulting R24 FHT was next applied to the derivation of two new radix-2 realdata FFT algorithms, where the transform length was now a power of two (a radix-2 integer), but not a power of four (a radix-4 integer). This showed how the R24 FHT could be applied, potentially, to a great many more problems than originally envisioned, with the scalability of the R24 FHT design carrying over to those of the two radix-2 FFTs and with the radix-2 solution based upon the double-resolution

250

12 Summary and Conclusions

approach (i.e. using a half-length transform) looking particularly attractive in terms of both arithmetic complexity, for fully-sequential operation, and computational density, for fully-parallel operation. This was followed by the application of the R24 FHT to the computation of some of the more familiar and computationally intensive DSP-based functions, such as those of correlation (both auto-correlation and cross-correlation) and of the wideband channelization of RF data via the polyphase DFT filter bank [4]. With each such function – which might typically be encountered in that increasingly important area of wireless communications relating to the geolocation of signal emitters – the adoption of the R24 FHT offers the possibility of achieving simplified, low-complexity solutions. A more recent application involving a novel transformspace scheme for enhancing the performance of multi-carrier communications in the presence of IMD – a seemingly intractable problem – was also briefly discussed [5]. Next, it was seen how the R24 FHT could be exploited in producing attractive new solutions for the parallel computation of the 2-D DHT – the number of dimensions being limited initially to just two for ease of illustration – which was shown to be equally applicable, via the relationship of their kernels, to that of the 2-D real-data DFT. The 2-D designs, where the common length N of each dimension of the transform was taken to be a radix-4 integer for compatibility with the adoption of the N-point R24 FHT, were based upon the adoption of (1) a separable formulation of the DHT, referred to as the SDHT, so that the familiar RCM could be applied and (2) memory partitioning, double-buffering and parallel addressing schemes that were consistent with those used by the R24 FHT. Combining these features, the R24 FHT was able to be used as a building block for the processing of both the row-DHT and the column-DHT stages of the 2-D formulation of the SDHT. These 2-D solutions, which were also shown to yield attractive low-complexity solutions for the filtering of 2-D real-valued data sets, were thus able to achieve a high computational throughput through the exploitation of multiple levels of parallel processing via a parallel-pipelined approach: (1) ‘coarse-grained’ pipelining at the FHT level for the global operation of the algorithm for when two FHT-based stages were used; (2) ‘fine-grained’ pipelining at the arithmetic level for the internal operation of each PE; and (3) SIMD processing for the simultaneous execution of the multiple arithmetic operations to be performed within each stage of the finegrained computational pipeline. Finally, it was seen how the 2-D designs could be generalized in a straightforward fashion to deal with m-D versions of both the SDHT and the real-data DFT, for m  2, where again the common length N of each dimension of the transform was taken to be a radix-4 integer for compatibility with the adoption of the N-point R24 FHT. The resulting parallel solutions, like those for the 2-D case, were resourceefficient, scalable (which, for the m-D case where ‘m’ is fixed, is understood to be in relation to the common length of each dimension of the transform) and deviceindependent. The solutions were able to achieve a high computational density as the time-complexity (denoting the update time) was traded off against the space-complexity, with an O(Nm. log4N ) time-complexity being achieved which yielded (for

12.3

Conclusions

251

the more interesting multi-FHT case) an approximate figure of N m=8: log 4 N clock cycles after taking into account the eightfold parallelism as introduced via the adoption of partitioned data memory, this achieved at the expense of an increased space-complexity possessing an O(Nm) memory component.

12.3

Conclusions

The primary aims of the research have been shown to have been achieved, as discussed above, with the R24 FHT being successfully exploited in producing resource-efficient, scalable and device-independent solutions for the parallel computation of both the DHT and the real-data DFT, for both the 1-D and the m-D cases, rather than just the 1-D case as originally addressed, thereby enabling the R24 FHT to be used for all those signal and image processing problems commonly addressed via a complex-data FFT. Given the reliance of the single-PE recursive architecture of the R24 FHT upon the use of partitioned memory, it has been necessary that a new parallel data reordering scheme be developed for use in computing both the 1-D DHT and the m-D SDHT, which enables NAT-ordered data stored within one partitioned memory to be efficiently transferred, in a reordered form (in this case according to the DBR mapping, but equally applicable to other digit-reversal mappings), to a second partitioned memory. This was shown to be achievable as a single combined operation (namely, one able to carry out simultaneously both the data reordering and the transfer of the reordered data from one partitioned memory to another) with approximately 16-fold parallelism, exploiting in an optimal fashion the dual-port nature of the memory. The solutions described within the monograph yield clear implementational attractions/advantages, both theoretical and practical, when compared to the more conventional complex-data solutions based upon the adoption of the complex-data FFT. The highly parallel formulation of the 1-D version of the FHT (i.e. the R24 FHT) and the real-data FFT described in the monograph has been shown to lead to scalable and device-independent solutions to both the 1-D and the m-D problems which are able to maximize the achievable computational density and thus to optimize the use of the available silicon resources. The adoption of the R24 FHT as a building block to be used for the production of such solutions to the 2-D and 3-D problems, in particular, means that it may now be beneficially used as a key component in the design of systems for the real-time processing of 2-D and 3-D images, respectively, as well as that of conventional 1-D signals. Note that a small number of journal publications, as produced by the author, have been listed at the end of this chapter, as these have produced the ideas upon which the current monograph has been based. The first three publications deal with the development of the R24 FHT for use in solving the problems of the DHT and the realdata DFT for the case of 1-D data. The fourth and fifth publications show how the

252

12 Summary and Conclusions

R24 FHT may be applied to a couple of communication problems: the first involving the channelization of a real-valued RF signal via the polyphase DFT and the second involving a transform-space technique for achieving distortion-free multi-carrier communications. Finally, note that the mathematical/logical correctness of the operation of the various functions used by the R24 FHT has been proven in software with a computer programme written in the ‘C’ programming language. This code provides the user with various choices of PE design and storage/retrieval scheme for the trigonometric coefficients, helping the user to identify how the algorithm might best be mapped onto the available parallel computing equipment following translation of the sequential ‘C’ code to the parallel code produced by a suitably chosen HDL. A second computer programme, written this time using the MATLAB computing environment, has been provided for proving the mathematical/logical correctness of operation of the parallel data reordering scheme as required for the reordering of NAT-ordered data according to the DBR mapping and of its simultaneous transfer from one partitioned memory to another.

References 1. K.J. Jones, Design and parallel computation of regularised Fast Hartley transform. IEE Proc. Vis. Image Signal Process. 153(1), 70–78 (February 2006) 2. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data Fast Fourier transform via regularised Fast Hartley transform. IET Signal Process. 1(3), 128–138 (September 2007) 3. K.J. Jones, The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments, Series on Signals & Communication Technology (Springer, 2010) 4. K.J. Jones, Resource-efficient and scalable solution to problem of real-data polyphase Discrete Fourier transform channelisation with rational over-sampling factor. IET Signal Process. 7(4), 296–305 (June 2013) 5. K.J. Jones, Design of low-complexity scheme for maintaining distortion-free multi-carrier communications. IET Signal Process. 8(5), 495–506 (July 2014)

Appendix A: Computer Programme for Regularized Fast Hartley Transform

A.1

Introduction

The processing functions required for a fixed-point implementation of the R24 FHT break down into two quite distinct categories, namely, those pre-processing functions that need to be carried out in advance of the real-time processing for performing the following tasks, setting up of LUTs for trigonometric coefficients and setting up of permutation mappings for GD-BFLY, and those processing functions that need to be carried out as part of the real-time solution: Data reordering via DBR mapping PDM reads and writes PCM reads and trigonometric coefficient generation GD-BFLY computation FHT-to-FFT conversion The individual modules – written in the ‘C’ programming language with the Microsoft Visual C++ compiler under their Visual Studio computing environment – that have been developed to implement these particular preprocessing and processing functions are now outlined, where integer-only arithmetic has been used to model the fixed-point nature of the associated arithmetic operations. This is followed by a brief guide on how to run the programme and of the scaling strategies available to the user. Please note, however, that the programme has not been exhaustively tested, so it is quite conceivable that various bugs may still be present in the current version of the code. The notification of any such bugs, if identified, would be greatly welcomed by the author. The computer code for the complete solution is listed in Appendix B of this monograph.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3

253

254

A.2

Appendix A: Computer Programme for Regularized Fast Hartley Transform

Description of Functions

Before the R24 FHT can be executed, it is first necessary that a main module or programme be produced, ‘RFHT4_Computer_Program.c’, which carries out all the preprocessing functions, as required for providing the necessary inputs to the R24 FHT, as well as setting up the input data to the R24 FHT through the calling of a separate module, ‘SignalGeneration.c’, such that the data – real valued or complex valued – may be either accessed from an existing binary or text file or generated by the signal generation module.

A.2.1

Control Routine

Once all the preprocessing functions have been carried out and the input data made ready for feeding to the R24 FHT, a control module ‘RFHT4_Control.c’ called from within the main programme then carries out in the required order all the processing functions that make up the real-time solution, this starting with the application of the DBR mapping to the input data set, followed by the execution of the R24 FHT, and finishing with the conversion of the output data, should it be required, from Hartleyspace to Fourier-space.

A.2.2

Generic Double Butterfly Routines

Three versions of the GD-BFLY have been produced, as discussed in Chaps. 6 and 7, with the first version, involving twelve fast fixed-point multipliers, being carried out by means of the module, ‘Butterfly_V12M.c’, the second version, involving nine fast fixed-point multipliers, by means of the module, ‘Butterfly_V09M.c’, and the third version, involving three CORDIC arithmetic units, by means of the module, ‘Butterfly_Cordic.c’.The last version makes use of a separate module ‘Rotation.c’ for carrying out the individual phase rotations.

A.2.3

Address Generation and Data Reordering Routines

The generation of the four permutation mappings used by the GD-BFLY, as discussed in Chap. 4, is carried out by means of the module, ‘ButterflyMappings. c’, whilst the application of the DBR mapping to the input data set is carried out with the module, ‘DibitReversal.c’, and the addresses of the woctads required for input to the GD-BFLY are obtained by means of the module, ‘DataIndices.c’. For optimal

Appendix A: Computer Programme for Regularized Fast Hartley Transform

255

efficiency, the four permutation mappings used by the GD-BFLY only store information relating to the non-trivial exchanges.

A.2.4

Data Memory Retrieval and Updating Routine

The reading/writing of woctads from/to the PDM, which requires the application of the memory address mappings discussed in Chap. 6, is carried out by means of the module, ‘MemoryBankAddress.c’,which, given the address of a single DBR-reordered sample, produces both the memory bank address and the address offset within that particular memory bank. The code provided is at present generic in the sense that it is valid for any length of transform, although for optimal efficiency the code should be tailored to the particular transform length under consideration – although code written for one particular transform length will also be valid for every transform length shorter than it, albeit somewhat wasteful in terms of unnecessary arithmetic/logic.

A.2.5

Trigonometric Coefficient Generation Routines

The trigonometric coefficient sets accessed from the PCM which are required for the execution of the GD-BFLY are dependent upon the particular version of the GD-BFLY used, namely, whether it involves twelve or nine fast fixed-point multipliers, as well as the type of addressing scheme used, namely, whether the storage of the trigonometric coefficients is based upon the adoption of one-level or two-level LUTs. For the combination of a twelve-multiplier version of the GD-BFLY and the adoption of three one-level LUTs, the trigonometric coefficients are generated via the module, ‘Coefficients_V12M_1Level.c’, whilst for the combination of a nine-multiplier version of the GD-BFLY and the adoption of three one-level LUTs, the trigonometric coefficients are generated via the module, ‘Coefficients_V09M_1Level.c’, for the combination of a twelve-multiplier version of the GD-BFLY and the adoption of three two-level LUTs, the trigonometric coefficients are generated via the module, ‘Coefficients_V12M_2Level.c’, and for the combination of a nine-multiplier version of the GD-BFLY and the adoption of three two-level LUTs, the trigonometric coefficients are generated via the module, ‘Coefficients_V09M_2Level.c’. All four versions produce sets of either nine or twelve trigonometric coefficients, as specified, which are required for the execution of the GD-BFLY.

256

A.2.6

Appendix A: Computer Programme for Regularized Fast Hartley Transform

Look-Up Table Generation Routines

The generation of the LUTs required for the storage of the trigonometric coefficients is carried out by means of the module, ‘Look_Up_Table_1Level.c’, for the case of the one-level LUTs, or the module, ‘Look_Up_Table_2Level.c’, for the case of the two-level LUTs.

A.2.7

FHT-to-FFT Conversion Routine

Upon completion of the R24 FHT, the outputs may be converted from Hartley-space to Fourier-space, if required, this being carried out by means of the module, ‘Conversion.c’. The routine is able to operate with FHT outputs obtained from the processing of either real-data inputs or complex-data inputs.

A.3

Brief Guide to Running the Programme

The parameters that define the operation of the R24 FHT are listed as constants at the top of the main programme, ‘RFHT4_Computer_Program.c’, these constants enabling the various versions of the GD-BFLY to be selected, as required, as well as the transform length, word lengths (for both data and trigonometric coefficients), input/output data formats, scaling strategy, etc., to be set up by the user at run time. The complete list of parameters is reproduced here, as shown in Fig. A.1, this including a typical set of parameter values and an accompanying description of each parameter. The input data set used for testing the various double butterfly and memory addressing combinations may be read either from a binary or text file (real data or complex data), with the appropriate file name being as specified in the signal generation routine, ‘SignalGeneration.c’, or mathematically generated to model a signal in the form of a single tone (real-data or complex-data versions) where the address of the excited FHT/FFT bin is as specified on the last line of Fig. A.1. For a real-valued input data set, the programme is able to produce transform outputs in either Hartley-space or Fourier-space, whilst when the input data set is complexvalued, the programme will automatically produce the outputs in Fourier-space. Note that when writing the outputs of an N-point FHT to file, the programme stores one sample to a line; when writing the outputs of an N-point real-data FFT to file, it stores the zero-frequency term on the first line followed by the positivefrequency terms on the next N/2–1 lines, with the real and imaginary components of each term appearing on the same line; and finally, when writing the outputs of an Npoint complex-data FFT to file, it stores the zero-frequency term on the first line followed by the positive and then negative-frequency terms on the next N–1 lines, with the real and imaginary components of each term appearing on the same line –

Appendix A: Computer Programme for Regularized Fast Hartley Transform //

SYSTEM PARAMETERS:

#define FHT_length #define data_type #define FHT_FFT_flag #define BFLY_type #define MEM_type #define scaling //

18 24

// //

no of bits representing input data no of bits representing trigonometric coefficients

18 27 5

// // //

no of Cordic iterations = output accuracy (bits) no of bits representing Cordic rotation angle no of guard bits for LSB: ~ log2(no_of_iterations)

2 2

// //

1 => HEX, 2 => DEC 1 => HEX, 2 => DEC

FIXED SCALING PARAMETERS - ONE FACTOR PER FHT STAGE:

#define scale_factor_0 #define scale_factor_1 #define scale_factor_2 #define scale_factor_3 #define scale_factor_4 #define scale_factor_5 #define scale_factor_6 #define scale_factor_7 //

transform length: must be a power of 4 1 => real-valued data, 2 => complex-valued data 1 => FHT outputs, 2 => FFT outputs Bfly type: 1 => 12 mplys, 2 => 9 mplys, 3 => 3 Cordics Memory type: 1 => one-level LUT, 2 => two-level LUT 1 => FIXED, 2 => BFP

FILE PARAMETERS:

#define input_file_format #define output_file_format //

// // // // // //

CORDIC BUTTERFLY PARAMETERS:

#define no_of_iterations #define no_of_bits_angle #define LSB_guard_bits //

1024 1 1 3 1 2

REGISTER-LENGTH PARAMETERS:

#define no_of_bits_data #define no_of_bits_coeffs //

257

2 2 2 2 2 2 1 0

// // // // // // // //

bits to shift for stage = 0 bits to shift for stage = 1 bits to shift for stage = 2 bits to shift for stage = 3 bits to shift for stage = 4 – last stage for 1K FHT bits to shift for stage = 5 – last stage for 4K FHT bits to shift for stage = 6 – last stage for 16K FHT bits to shift for stage = 7 – last stage for 64K FHT

SYNTHETIC DATA PARAMETERS:

#define data_input #define dft_bin_excited

1 117

// //

0 => read data from file, 1 => generate data tone excited: between 0 and FHT_length/2-1

Fig. A.1 Typical parameter set for regularized FHT programme

although the Nyquist-frequency term, like the zero-frequency term, possesses only a real component. Bear in mind that for the case of the real-data FFT, the magnitude of a zero-frequency tone (or Nyquist-frequency tone, if computed), if measured in the frequency-domain, will be twice that of a comparable positive-frequency tone (i.e. having the same signal amplitude) which shares its energy equally with its negative-frequency counterpart.

258

A.4

Appendix A: Computer Programme for Regularized Fast Hartley Transform

Available Scaling Strategies

With regard to the fixed-point scaling strategies, note that when the scaling of the intermediate results is carried out via the conditional block floating-point technique, it is applied at the input to each stage of GD-BFLYs. As a result, any possible magnification incurred during the last stage of GD-BFLYs is not scaled out of the results, so that up to three bits of growth will still need to be accounted for in the R24 FHT outputs according to the particular post-FHT processing requirements. Examples of block floating-point scaling for both the twelve-multiplier and the nine-multiplier versions of the GD-BFLY are given in Figs. A.2 and A.3, respectively, each geared to the use of an 18-bit fast multiplier – the scaling for the CORDIC version is essentially the same as that for the twelve-multiplier version. The programme provides the user with specific information relating to the chosen parameter set, printing to the screen the amount of scaling required, if any, for each stage of GD-BFLYs required by the transform. For the case of the unconditional fixed scaling technique – the individual scale factors to be applied for each stage of GD-BFLYs are as specified by the set of constants given in Fig. A.1 – a small segment of code has been included within the generic double-butterfly routines which prints to the screen an error message whenever the register for holding the input data to either the fast multiplier or the CORDIC arithmetic unit overflows. For the accurate simulation of a given hardware Input Data: 18 Bits + Zero Growth

Output Data: 18 Bits + Growth

22

21

1-2

>> Growth – 3

23

×8

1-8

>> Growth

3-8

21

18

24

21 >> 3

22

18 ×8 18 + Growth Calculate Growth

Register Details: PE Internal = 21 (Min) & 24 (Max)

Note: Growth Î {0, 1, 2, 3}

PE External = 21

Fig. A.2 Block floating-point scaling for use with twelve-multiplier and CORDIC versions of generic double butterfly

Appendix A: Computer Programme for Regularized Fast Hartley Transform

259

Input Data: 17 Bits + Zero Growth

Output Data: 17 Bits + Growth

Theory => 23 Bits Maximum

22

20

1-2

>> Growth – 3

23

×8

24

1-8

>> Growth

3-8

17

21

18

20 >> 3

22

18 ×8 17 + Growth Calculate Growth

Register Details: PE Internal = 20 (Min) & 23 (Max)

Note: Growth Î {0, 1, 2, 3}

PE External = 20

Fig. A.3 Block floating-point scaling for use with nine-multiplier version of generic double butterfly

device, this segment of code needs to be replaced by a routine that mimics the actual behaviour of the device in response to such an overflow – such a response being dependent upon the particular device used. When the nine-multiplier version of the GD-BFLY is adopted, the presence of the stage of adders prior to that of the fast fixed-point multipliers is far more likely to result in an overflow unless additional scaling is applied immediately after this stage of adders has been completed, as is performed by the computer programme, or alternatively, unless the data word length into the GD-BFLY is constrained to be one bit shorter than that for the twelve-multiplier version. Clearly, in order to prevent fixed-point overflow, the settings for the individual scale factors will need to take into account both the transform length and the particular version of the GD-BFLY chosen, with experience invariably dictating when an optimum selection of scale factors has been made. Bear in mind, however, that with the CORDIC-based version of the GD-BFLY, there is an associated magnification of the data magnitudes by approximately 1.647 with each temporal stage of GD-BFLYs which needs to be accounted for by the scale factors. Finally, note that when the CORDIC-based GD-BFLY is selected, regardless of the scaling strategy adopted, the programme will also print to the screen exactly how many non-trivial shifts/additions are required for carrying out the two fixedcoefficient multipliers for the chosen parameter set. For the case of an 18-stage CORDIC arithmetic unit, for example, a total of nine such non-trivial shifts/additions are required.

Appendix B: Source Code for Regularized Fast Hartley Transform

B.1

Listings for Main Programme and Signal Generation Routine

Functions listed in this section of Appendix B: – Main programme – SignalGeneration #include "stdafx.h" #include #include #include #include // //

DEFINE PARAMETERS -------------------------------------------

//

SYSTEM PARAMETERS:

#define FHT_length

1024

//

transform length: must be a power of 4

#define data_type

1

//

1 => real-valued data, 2 => complex-valued data

#define FHT_FFT_flag

1

//

1 => FHT outputs, 2 => FFT outputs

#define BFLY_type

3

//

Bfly type: 1 => 12 mplys, 2 => 9 mplys, 3 => 3 Cordics

#define MEM_type

1

//

Memory type: 1 => one-level LUT, 2 => two-level LUT

#define scaling

2

//

1 => FIXED, 2 => BFP

//

REGISTER-LENGTH PARAMETERS:

#define no_of_bits_data

18

//

no of bits representing input data

#define no_of_bits_coeffs

24

//

no of bits representing trigonometric coefficients

//

CORDIC BUTTERFLY PARAMETERS:

#define no_of_iterations

18

//

#define no_of_bits_angle

27

//

no of bits representing Cordic rotation angle

5

//

no of guard bits for LSB: ~ log2(no_of_iterations)

#define LSB_guard_bits

no of Cordic iterations = output accuracy (bits)

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3

261

262

Appendix B: Source Code for Regularized Fast Hartley Transform

//

FILE PARAMETERS:

#define input_file_format

2

//

1 => HEX, 2 => DEC

#define output_file_format

2

//

1 => HEX, 2 => DEC

//

FIXED SCALING PARAMETERS - ONE FACTOR PER FHT STAGE:

#define scale_factor_0

2

//

bits to shift for stage = 0

#define scale_factor_1

2

//

bits to shift for stage = 1

#define scale_factor_2

2

//

bits to shift for stage = 2

#define scale_factor_3

2

//

bits to shift for stage = 3

#define scale_factor_4

2

//

bits to shift for stage = 4 - last stage for 1K FHT

#define scale_factor_5

2

//

bits to shift for stage = 5 - last stage for 4K FHT

#define scale_factor_6

2

//

bits to shift for stage = 6 - last stage for 16K FHT

#define scale_factor_7

2

//

bits to shift for stage = 7 - last stage for 64K FHT

//

SYNTHETIC DATA PARAMETERS:

#define data_input #define dft_bin_excited

1

//

0 => read data from file, 1 => generate data

256

//

tone excited: between 0 and FHT_length/2-1

void main () { // // // //

REGULARIZED FAST HARTLEY TRANSFORM ALGORITHM --------------------------------------------------------------------------------------------------------------Author: Dr. Keith John Jones, June 14th 2009 FIXED-POINT FHT IMPLEMENTATION FOR FPGA

// //

- DATA & COEFFICIENTS QUANTIZED UTILIZES ONE DOUBLE–SIZED BUTTERFLY

// // // // // //

- T Y P E = 12 fast multipliers & 22 adders or 9 fast multipliers & 25 adders -or 3 Cordic arithmetic units & 2 fixed multipliers & 16 adders -UTILIZES EIGHT DATA MEMORY BAN KS

// //

- S I Z E = N / 8 words per bank UTILIZES THREE COEFFICIENT MEMORY BANKS

//

- S I Z E = N / 4 words or sqrt(N) / 2 words or zero words per bank

// // //

Description: -------------This program carries out the FHT using a generic radix-4 double-sized butterfly. The solution

//

performs 8 simultaneous reads/writes using 8 memory banks, each of length N/8 words. Three LUTs,

//

each of length N/4 words or sqrt(N)/2 words, may also be used for holding the trigonometric

//

coefficients, enabling all six coefficients to be accessed simultaneously - these LUTs are not required,

//

however, when the arithmetic is performed with the Cordic unit. Three types of double-sized butterfly

//

are available for use by the FHT: one involves the use of 12 fast fixed-point multipliers and 22 adders,

//

another involves the use of 9 fast fixed-point multipliers and 25 adders, whilst a third involves the use of

//

3 Cordic arithmetic units, 2 fixed multipliers and 16 adders. Two coefficient memory addressing

//

schemes are also available for use by the FHT: one involves the use of 3 LUTs, each of length N/4

//

words, whilst another involves the use of 3 LUTs, each of length sqrt(N)/2 words. The following

//

combinations of arithmetic and memory are thus possible:

// // // //

1)

for a 12-multiplier double-sized butterfly & N/4 word LUTs, the coefficient generation involves no arithmetic operations;

2) for a 12-multiplier double-sized butterfly & sqrt(N)/2 word LUTs, the coefficient generation involves 7 multiplications and 8 additions;

Appendix B: Source Code for Regularized Fast Hartley Transform //

3)

for a 9-multiplier double-sized butterfly & N/4 word LUTs, the coefficient generation involves

4)

for a 9-multiplier double-sized butterfly & sqrt(N)/2 word LUTs, the coefficient generation

5)

for a Cordic double-sized butterfly, the coefficients are efficiently generated on-the-fly.

// //

just additions;

// // // //

263

involves 7 multiplications and 14 additions; whilst Scaling may be carried out within the regularized FHT to prevent overflow in the data registers – this may be carried out with either fixed scaling coefficients after each temporal stage, or by means of a

//

block floating-point scheme in order to optimize the dynamic range out of the FHT. The program may

//

produce either FHT or FFT output, where the input data may be either real valued or complex valued.

//

For the case of complex-valued data, the FHT is simply applied to the real and imaginary components of

//

the data separately before being appropriately combined via the FHT-to-FFT conversion routine. The

//

inputs/outputs may be read/written from/to file with either decimal or hexadecimal formats.

// // //

Files Used: ------------For input/output data memory:

//

input_data_read.txt

- input file from which data is read.

//

output_data_fht_fft.txt

- FHT/FFT output data file.

//

For one-level trigonometric coefficient memory:

//

LUT_A1.txt

//

LUT_A2.txt

- LUT for double-angle argument.

//

LUT_A3.txt

- LUT for triple-angle argument.

//

- LUT for single-angle argument.

For two-level trigonometric coefficient memory:

//

LUT_Sin_Coarse.txt

//

LUT_Sin_Fine.txt

- coarse resolution sin LUT for single-angle argument. - fine resolution sin LUT for single-angle argument.

//

LUT_Cos_Fine.txt

- fine resolution cos LUT for single-angle argument.

// // //

Functions Used: ------------------FHT_Computer_Program

- main program.

//

SignalGeneration

- signal generation routine.

//

RFHT4_Control

- regularized FHT control routine.

//

LookUpTable_1Level

- one-level LUT generation routine.

//

LookUpTable_2Level

- two-level LUT generation routine.

//

ButterflyMappings

- address permutation generation routine.

//

DibitReversal

- sequential di-bit reversal routine & 1-D to 2-D conversion.

//

Butterfly_V12M

- double butterfly calculation routine: 12-multiply version.

//

Butterfly_V09M

- double butterfly calculation routine: 9-multiply version.

//

Butterfly_Cordic

- double butterfly calculation routine: Cordic version.

//

Coefficients_V12M_1Level

- one-level coefficient generation: 12-multiply version.

//

Coefficients_V09M_1Level

- one-level coefficient generation: 9-multiply version.

//

Coefficients_V12M_2Level

- two-level coefficient generation: 12-multiply version.

//

Coefficients_V09M_2Level

- two-level coefficient generation: 9-multiply version.

//

DataIndices

- data address generation routine.

//

Conversion

- DHT-to-DFT conversion routine.

//

MemoryBankAddress

- memory bank address/offset calculation routine.

//

Rotation

- Cordic phase rotation routine.

//

Externs:

264

Appendix B: Source Code for Regularized Fast Hartley Transform

//

--------void RFHT4_Control (int**, int*, int*, int*, int*, int*, int*, int*, int*, int*, int*, int*, int, int, int, int, int, int, int, int, int*, int*, int, int*, int*, int*, int, int, int, int, int*, int, int, int, int, int, int, int); void SignalGeneration (int*, int*, int, int, int, int, int, int); void LookUpTable_1Level (int, int, int*, int*, int*, int); void LookUpTable_2Level (int, int, int*, int*, int*, int); void ButterflyMappings (int*, int*, int*, int*); void DibitReversal (int, int, int*, int, int*, int**); void Conversion (int, int, int, int*, int*); void MemoryBankAddress (int, int, int, int, int*, int*);

// // // //

Declarations: --------------Integers: ---------int wordsize, m, M, n, n1, n2, N, N2, N4, N8, no_of_bits, data_levels, coef_levels; int zero = 0, count, RootN, RootNd2, max_magnitude, real_type = 1, imag_type = 2; int fft_length, offset, halfpi, growth, growth_copy, angle_levels, minusquarterpi; int Root_FHT_length, alpha, lower, upper;

// //

Integer Arrays: ----------------int index1[4], index2[16], index3[16], index4[8]; int scale_factors[8], power_of_two_A[15], power_of_two_B[8]; int beta1[8], beta2[8], beta3[8], growth_binary[32], arctans[32];

// //

Floats: -------double pi, halfpi_float, quarterpi_float, twopi, angle, growth_float;

// //

Pointer Variables: --------------------int *XRdata, *XIdata; int *bank1, *offset1, *bank2, *offset2, *scale_total; int *Look_Up_Sin_A1, *Look_Up_Sin_A2, *Look_Up_Sin_A3; int *Look_Up_Sin_Coarse, *Look_Up_Cos_Fine, *Look_Up_Sin_Fine; int **XRdata_2D = new int*[8];

// //

Files: -----FILE *myinfile, *output;

// // //

************************************************************************************ ##

R E G U L A R I S E D F H T I N I T I A L I S A T I O N. Set up transform parameters. Root_FHT_length = (int) (sqrt(FHT_length+0.5)); for (n = 3; n < 9; n++) { if (FHT_length == (int) (pow(4,n))) { alpha = n; } }

//

Set up standard angles. pi = atan(1.0)*4.0; halfpi_float = atan(1.0)*2.0; twopi = atan(1.0)*8.0;

Appendix B: Source Code for Regularized Fast Hartley Transform quarterpi_float = atan(1.0); wordsize = sizeof (int); memset (&index1[0], 0, wordsize 1); N8 = (N4 >> 1);

265

266

Appendix B: Source Code for Regularized Fast Hartley Transform RootN = Root_FHT_length; RootNd2 = RootN / 2; if (data_type == 1) { fft_length = N2; } else { fft_length = N; } //

Set up number of quantisation levels for data. data_levels = (int) (pow(2,(no_of_bits_data-1))-1);

//

Set up number of quantisation levels for coefficients. coef_levels = (int) (pow(2,(no_of_bits_coeffs-1))-1);

//

Set up number of quantisation levels for Cordic rotation angles. angle_levels = (int) (pow(2,(no_of_bits_angle-1))-1);

//

Set up maximum allowable data magnitude into double butterfly. max_magnitude = (int) (pow(2,(no_of_bits_data-1)));

//

Set up register overflow bounds for use with unconditional fixed scaling strategy. lower = -(data_levels+1); upper = data_levels;

//

Set up power-of-two array. no_of_bits = alpha complex valued data

//

dft_bin_excited

= integer representing DFT bin excited.

//

data_input

= data type: 0 => read data from file, 1 => generate data.

//

data_levels

= no of quantized data levels.

//

input_file_format = input file format: 1 => HEX, 2 => DEC.

// // //

Note: -----Complex data is stored in data file in the form of alternating real and imaginary components.

// // // //

Declarations: --------------Integers: ---------int n;

// //

Floats: -------double twopi, argument;

// //

************************************************************************************ ##

T E S T D A T A G E N E R A T I O N. if (data_input == 0) {

//

Read in FHT input data from file. FILE *input; if ((input = fopen("input_data_fht.txt", "r")) == NULL) printf ("\n Error opening input data file to read from"); if (input_file_format == 1) {

//

"H E X" file format. if (data_type == 1) { for (n = 0; n < N; n++) fscanf (input, "%x", &XRdata[n]); } else { for (n = 0; n < N; n++) fscanf (input, "%x %x", &XRdata[n], &XIdata[n]); } } else {

//

"D E C" file format.

Appendix B: Source Code for Regularized Fast Hartley Transform if (data_type == 1) { for (n = 0; n < N; n++) fscanf (input, "%d", &XRdata[n]); } else { for (n = 0; n < N; n++) fscanf (input, "%d %d", &XRdata[n], &XIdata[n]); } } //

Close file. fclose (input); } else {

//

Generate single-tone signal for FHT input data. twopi = 8*atan(1.0); for (n = 0; n < N; n++) { argument = (twopi*n*dft_bin_excited)/N; XRdata[n] = (int) (cos(argument)*data_levels); if (data_type == 2) { XIdata[n] = (int) (sin(argument)*data_levels); } } }

//

End of function.

}

B.2

Listings for Preprocessing Functions

Functions listed in this section of Appendix B: – LookUpTable_1Level – LookUpTable_2Level – ButterflyMappings #include "stdafx.h" #include #include

void LookUpTable_1Level (int N, int N4, int *Look_Up_Sin_A1, int *Look_Up_Sin_A2, int *Look_Up_Sin_A3, int coef_levels) { // //

Description: --------------

273

274

Appendix B: Source Code for Regularized Fast Hartley Transform

//

Routine to set up the one-level LUTs containing the trigonometric coefficients.

// // //

Parameters: ------------N

= transform length.

//

N4

= N / 4.

//

Look_Up_Sin_A1

= l ook-up table for single-angle argument.

//

Look_Up_Sin_A2

= look-up table for double-angle argument.

//

Look_Up_Sin_A3

= look-up table for triple-angle argument.

//

coef_levels

= number of trigonometric coefficient quantisation levels.

// // // //

Declarations: --------------Integers: ---------int i;

// //

Floats: -------double angle, twopi, rotation;

//

************************************************************************************

//

Set up output files for holding LUT contents. FILE *output1; if ((output1 = fopen("LUT_A1.txt", "w")) == NULL) printf ("\n Error opening 1st LUT file"); FILE *output2; if ((output2 = fopen("LUT_A2.txt", "w")) == NULL) printf ("\n Error opening 2nd LUT file"); FILE *output3; if ((output3 = fopen("LUT_A3.txt", "w")) == NULL) printf ("\n Error opening 3rd LUT file"); twopi = (double) (atan(1.0) * 8.0); rotation = (double) (twopi / N);

//

Set up size N/4 LUT for single-angle argument. angle = (double) 0.0; for (i = 0; i < N4; i++) { Look_Up_Sin_A1[i] = (int) (sin(angle) * coef_levels); angle += (double) rotation; fprintf (output1,"%x\n", Look_Up_Sin_A1[i]); }

//

Set up size N/4 LUT for double-angle argument. angle = (double) 0.0; for (i = 0; i < N4; i++) { Look_Up_Sin_A2[i] = (int) (sin(angle) * coef_levels); angle += (double) rotation; fprintf (output2,"%x\n", Look_Up_Sin_A2[i]); }

//

Set up size N/4 LUT for triple-angle argument. angle = (double) 0.0; for (i = 0; i < N4; i++) {

Appendix B: Source Code for Regularized Fast Hartley Transform

275

Look_Up_Sin_A3[i] = (int) (sin(angle) * coef_levels); angle += (double) rotation; fprintf (output3,"%x\n", Look_Up_Sin_A3[i]); } //

Close files.

//

End of function.

fclose (output1); fclose (output2); fclose (output3); } #include "stdafx.h" #include #include

void LookUpTable_2Level (int N, int RootNd2, int *Look_Up_Sin_Coarse, int *Look_Up_Cos_Fine, int *Look_Up_Sin_Fine, int coef_levels) { // // //

Description: -------------Routine to set up the two-level LUTs containing the trigonometric coefficients.

// // //

Parameters: ------------N

= transform length.

//

RootNd2

= sqrt(N) / 2.

//

Look_Up_Sin_Coarse = coarse resolution sin LUT for single-angle argument.

//

Look_Up_Cos_Fine = fine resolution cos LUT for single-angle argument.

//

Look_Up_Sin_Fine

= fine resolution sin LUT for single-angle argument.

//

coef_levels

= number of trigonometric coefficient quantisation levels.

// // // //

Declarations: --------------Integers: ---------int i;

// //

Floats: -------double angle_coarse, angle_fine, twopi, rotation_coarse, rotation_fine;

//

************************************************************************************

//

Set up output files for holding LUT contents. FILE *output1; if ((output1 = fopen("LUT_Sin_Coarse.txt", "w")) == NULL) printf ("\n Error opening 1st LUT file"); FILE *output2; if ((output2 = fopen("LUT_Cos_Fine.txt", "w")) == NULL) printf ("\n Error opening 2nd LUT file"); FILE *output3; if ((output3 = fopen("LUT_Sin_Fine.txt", "w")) == NULL) printf ("\n Error opening 3rd LUT file"); twopi = (double) (atan(1.0) * 8.0); rotation_coarse = (double) (twopi / (2*sqrt((float)N))); rotation_fine = (double) (twopi / N);

//

Set up size sqrt(N) LUT for single-angle argument. angle_coarse = (double) 0.0; angle_fine = (double) 0.0; for (i = 0; i < RootNd2; i++)

276

Appendix B: Source Code for Regularized Fast Hartley Transform { Look_Up_Sin_Coarse[i] = (int) (sin(angle_coarse) * coef_levels); Look_Up_Cos_Fine[i] = (int) (cos(angle_fine) * coef_levels); Look_Up_Sin_Fine[i] = (int) (sin(angle_fine) * coef_levels); fprintf (output1,"%x\n", Look_Up_Sin_Coarse[i]); fprintf (output2,"%x\n", Look_Up_Cos_Fine[i]); fprintf (output3,"%x\n", Look_Up_Sin_Fine[i]); angle_coarse += (double) rotation_coarse; angle_fine += (double) rotation_fine; } Look_Up_Sin_Coarse[RootNd2] = coef_levels; //

Close files. fclose (output1); fclose (output2); fclose (output3);

//

End of function.

} #include "stdafx.h"

void ButterflyMappings (int *index1, int *index2, int *index3, int *index4) { // // //

Description: -------------Routine to set up the address permutations for the generic double butterfly.

// // //

Parameters: ------------index1 = 1st address permutation.

//

index2 = 2nd address permutation.

//

index3 = 3rd address permutation.

//

index4 = 4th address permutation.

//

************************************************************************************

//

1st address permutation for Type-I and Type-II generic double butterflies. index1[0] = 6; index1[1] = 3;

//

1st address permutation for Type-III generic double butterfly. index1[2] = 3; index1[3] = 6;

//

2nd address permutation for Type-I and Type-II generic double butterflies. index2[0] = 0; index2[1] = 4; index2[2] = 3; index2[3] = 2; index2[4] = 1; index2[5] = 5; index2[6] = 6; index2[7] = 7;

//

2nd address permutation for Type-III generic double butterfly. index2[8] = 0; index2[9] = 4; index2[10] = 2; index2[11] = 6; index2[12] = 1; index2[13] = 5; index2[14] = 3; index2[15] = 7;

//

3rd address permutation for Type-I and Type-II generic double butterflies. index3[0] = 0; index3[1] = 4; index3[2] = 1; index3[3] = 5; index3[4] = 2; index3[5] = 6; index3[6] = 3; index3[7] = 7;

//

3rd address permutation for Type-III generic double butterfly. index3[8] = 0; index3[9] = 4; index3[10] = 1; index3[11] = 3; index3[12] = 2; index3[13] = 6; index3[14] = 7; index3[15] = 5;

//

4th address permutation for Type-I, Type-II and Type-III generic double butterflies. index4[0] = 0; index4[1] = 4; index4[2] = 1; index4[3] = 5;

index4[4] = 6; index4[5] = 2; index4[6] = 3; index4[7] = 7; // }

End of function.

Appendix B: Source Code for Regularized Fast Hartley Transform

B.3

277

Listings for Processing Functions

Functions listed in this section of Appendix B: – – – – – – – – – – – – –

RFHT4_Control DibitReversal Conversion Coefficients_V09M_1Level Coefficients_V09M_2Level Coefficients_V12M_1Level Coefficients_V12M_2Level Butterfly_V12M Butterfly_V09M Butterfly_Cordic Rotation DataIndices MemoryBankAddress

#include "stdafx.h" #include #include

void RFHT4_Control (int **Xdata_2D, int *index1, int *index2, int *index3, int *index4, int *Look_Up_Sin_A1, int *Look_Up_Sin_A2, int *Look_Up_Sin_A3, int *Look_Up_Sin_Coarse, int *Look_Up_Cos_Fine, int *Look_Up_Sin_Fine, int *power_of_two, int alpha, int N, int N2, int N4, int RootNd2, int coef_levels, int no_of_bits_coeffs, int scaling, int *scale_factors, int *scale_total, int max_magnitude, int *beta1, int *beta2, int *beta3, int angle_levels, int halfpi, int minusquarterpi, int growth, int *arctans, int no_of_iterations, int no_of_bits_angle, int LSB_guard_bits, int lower, int upper, int BFLY_type, int MEM_type) { // // //

Description: --------------Routine to carry out the regularized FHT algorithm, with options to use either twelve-multiplier, nine-

//

multiplier or Cordic versions of the generic double butterfly and N/4 word, sqrt(N)/2 word or zero word

//

LUTs for the storage of the trigonometric coefficients.

// //

Externs: --------void Butterfly_V12M (int, int, int, int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int); void Butterfly_V09M (int, int, int, int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int, int);

278

Appendix B: Source Code for Regularized Fast Hartley Transform void Butterfly_Cordic (int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int, int, int*, int, int, int, int); void Coefficients_V12M_1Level (int, int, int, int, int, int*, int*, int*, int*, int); void Coefficients_V09M_1Level (int, int, int, int, int, int*, int*, int*, int*, int); void Coefficients_V12M_2Level (int, int, int, int, int, int, int, int*, int*, int*, int*, int, int); void Coefficients_V09M_2Level (int, int, int, int, int, int, int, int*, int*, int*, int*, int, int); void DataIndices (int, int, int, int, int*, int[2][4], int[2][4], int, int);

// // //

Parameters: ------------Xdata_2D

//

index1

= 1st address permutation.

//

index2

= 2nd address permutation.

//

index3

= 3rd address permutation.

//

index4

= 4th address permutation.

//

Look_Up_Sin_A1

= LUT for single-angle argument.

= 2-D data.

//

Look_Up_Sin_A2

= LUT for double-angle argument.

//

Look_Up_Sin_A3

= LUT for triple-angle argument.

//

Look_Up_Sin_Coarse = coarse resolution sin LUT for single-angle argument.

//

Look_Up_Cos_Fine

= fine resolution cos LUT for single-angle argument.

//

Look_Up_Sin_Fine

= fine resolution sin LUT for single-angle argument.

//

power_of_two

= array containing powers of 2.

//

alpha

= no of temporal stages for transform.

//

N

= transform length.

//

N2

= N / 2.

//

N4

= N / 4.

//

RootNd2

= sqrt(N) / 2.

//

coef_levels

= number of trigonometric coefficient quantisation levels.

//

no_of_bits_coeffs

= number of bits representing trigonometric coefficients.

//

scaling

= scaling flag: 1 => FIXED, 2 => BFP.

//

scale_factors

= bits to shift for double butterfly stages.

//

scale_total

= total number of BFP scaling bits.

//

max_magnitude

= maximum magnitude of data into double butterfly.

//

beta1

= initial single-angle Cordic rotation angle.

//

beta2

= initial double-angle Cordic rotation angle.

//

beta3

= initial triple-angle Cordic rotation angle.

//

angle_levels

= number of Cordic rotation angle quantisation levels.

//

halfpi

= integer value of +(pi/2).

//

minusquarterpi

= integer value of -(pi/4).

//

growth

= integer value of Cordic magnification factor.

//

arctans

= Cordic micro-rotation angles.

//

no_of_iterations

= no of Cordic iterations.

//

no_of_bits_angle

= no of bits representing Cordic rotation angle.

//

LSB_guard_bits

= no of bits for guarding LSB.

//

lower

= lower bound for register overflow with unconditional scaling.

//

upper

= upper bound for register overflow with unconditional scaling.

Appendix B: Source Code for Regularized Fast Hartley Transform //

BFLY_type

= BFLY type: 1 => 12 multipliers, 2 => 9 multipliers, 3 => 3 Cordic units.

//

MEM_type

= MEM type: 1 => LUT = one-level, 2 => LUT = two-level.

// // // //

Declarations: --------------Integers: ---------int i, j, k, n, n2, offset, M, beta, bfly_count, Type, negate_flag, shift;

// //

Integer Arrays: ----------------int X[9], kk[4], kbeta[3], Data_Max[1], coeffs[9], threshold[3];

279

int index_even_2D[2][4], index_odd_2D[2][4]; //

************************************************************************************

//

Set up offset for address permutations. kk[3] = 0;

//

Set up block floating-point thresholds. threshold[0] = max_magnitude; threshold[1] = max_magnitude scale_factors[i]); }

//

W R I T E S - Set up output data vector for double butterfly. for (n = 0; n < 4; n++) { n2 = (n 0. Set up data indices for double butterfly. DataIndices (i, j, k, offset, kk, index_even_2D, index_odd_2D, bfly_count, alpha); bfly_count ++;

//

Set up trigonometric coefficients for double butterfly. if (BFLY_type == 1) {

//

Butterfly is twelve-multiplier version. if (MEM_type == 1) {

//

Standard arithmetic & standard memory solution. Coefficients_V12M_1Level (i, k, N2, N4, kbeta[0], Look_Up_Sin_A1, Look_Up_Sin_A2, Look_Up_Sin_A3, coeffs, coef_levels); } else {

//

Standard arithmetic & reduced memory solution. Coefficients_V12M_2Level (i, k, N2, N4, RootNd2, alpha, kbeta[0], Look_Up_Sin_Coarse, Look_Up_Cos_Fine, Look_Up_Sin_Fine, coeffs, coef_levels, no_of_bits_coeffs); }

//

Increment address offset. kbeta[0] += beta; } else {

//

Butterfly is nine-multiplier version. if (BFLY_type == 2)

Appendix B: Source Code for Regularized Fast Hartley Transform

283

{ if (MEM_type == 1) { //

Reduced arithmetic & standard memory solution. Coefficients_V09M_1Level (i, k, N2, N4, kbeta[0], Look_Up_Sin_A1, Look_Up_Sin_A2, Look_Up_Sin_A3, coeffs, coef_levels); } else {

//

Reduced arithmetic & reduced memory solution. Coefficients_V09M_2Level (i, k, N2, N4, RootNd2, alpha, kbeta[0], Look_Up_Sin_Coarse, Look_Up_Cos_Fine, Look_Up_Sin_Fine, coeffs, coef_levels, no_of_bits_coeffs); } }

//

Increment address offset. kbeta[0] += beta; }

//

R E A D S - Set up input data vector for double butterfly. for (n = 0; n < 4; n++) { n2 = (n > scale_factors[i]); }

//

W R I T E S - Set up output data vector for double butterfly. for (n = 0; n < 4; n++) { n2 = (n i3) { store = Xdata[i3]; Xdata[i3] = Xdata[j1]; Xdata[j1] = store; }

//

Convert to 2-D form. MemoryBankAddress (i3, j2, 0, alpha, bank, offset); Xdata_2D[*bank][i1] = Xdata[i3]; i3 ++; } }

//

Delete dynamic memory. delete bank, offset;

//

End of function.

} #include "stdafx.h"

286

Appendix B: Source Code for Regularized Fast Hartley Transform void Conversion (int channel_type, int N, int N2, int *XRdata, int *XIdata)

{ // // //

Description: -------------Routine to convert DHT coefficients to DFT coefficients. If the FHT is to be used for the computation of

//

the real-data FFT, as opposed to being used for the computation of the complex-data FFT, the complex-

//

valued DFT coefficients are optimally stored in the following way:

//

XRdata[0]

= zero'th frequency component

//

XRdata[1]

= real component of 1st frequency component

//

XRdata[N-1]

= imag component of 1st frequency component

//

XRdata[2]

= real component of 2nd frequency component

//

XRdata[N-2]

= imag component of 2nd frequency component

//

---

//

---

---

XRdata[N/2-1] = real component of (N/2-1)th frequency component

//

XRdata[N/2+1] = imag component of (N/2-1)th frequency component

//

XRdata[N/2]

//

= (N/2)th frequency component

For the case of the complex-valued FFT, however, the array "XRdata" stores the real component of

//

both the input and output data, whilst the array "XIdata" stores the imaginary component of both the

//

input and output data.

// // //

Parameters: ------------channel_type

= 1 => real input channel, 2 => imaginary input channel.

//

N

= transform length.

//

N2

= N / 2.

//

XRdata

= on input: FHT output for real input channel;

XIdata

= on input: FHT output for imaginary input channel;

// //

on output: as in "description" above.

// // // // // //

on output: as in "description" above. Declarations: --------------Integers: ---------int j, k, store, store1, store2, store3; ************************************************************************************ if (channel_type == 1) {

//

R E A L D A T A C H A N N E L.

//

Produce DFT output for this channel.

k = N - 1; for (j = 1; j < N2; j++) { store = XRdata[k] + XRdata[j]; XRdata[k] = XRdata[k] - XRdata[j]; XRdata[j] = store; XRdata[j] /= 2; XRdata[k] /= 2; k --; } }

Appendix B: Source Code for Regularized Fast Hartley Transform else { //

I M A G I N A R Y D A T A C H A N N E L.

//

Produce DFT output for this channel.

k = N - 1; for (j = 1; j < N2; j++) { store = XIdata[k] + XIdata[j]; XIdata[k] = XIdata[k] - XIdata[j]; XIdata[j] = store; XIdata[j] /= 2; XIdata[k] /= 2; //

Produce DFT output for complex data. store1 = XRdata[j] + XIdata[k]; store2 = XRdata[j] - XIdata[k]; store3 = XIdata[j] + XRdata[k]; XIdata[k] = XIdata[j] - XRdata[k]; XRdata[j] = store2; XRdata[k] = store1; XIdata[j] = store3; k --; } }

//

End of function.

} #include "stdafx.h" #include

void Coefficients_V09M_1Level (int i, int k, int N2, int N4, int kbeta, int *Look_Up_Sin_A1, int *Look_Up_Sin_A2, int *Look_Up_Sin_A3, int *coeffs, int coef_levels) { // // //

Description: -------------Routine to set up the trigonometric coefficients for use by the nine-multiplier version of the generic

//

double butterfly where one-level LUTs are exploited.

// // //

Parameters: ------------i

//

k

= spatial addressing index.

//

N2

= N / 2.

//

N4

= N / 4.

//

kbeta

= temporal/spatial index.

//

Look_Up_Sin_A1 = look-up table for single-angle argument.

//

Look_Up_Sin_A2 = look-up table for double-angle argument.

= temporal addressing index.

//

Look_Up_Sin_A3 = look-up table for triple-angle argument.

//

coeffs

= current set of trigonometric coefficients.

//

coef_levels

= number of trigonometric coefficient quantisation levels.

// // // //

Declarations: --------------Integers: ---------int m, n, n3, store_00, store_01; static int startup, coeff_00, coeff_01, coeff_02, coeff_03, coeff_04;

287

288 //

Appendix B: Source Code for Regularized Fast Hartley Transform ************************************************************************************ if (startup == 0) {

//

Set up trivial trigonometric coefficients - valid for each type of double butterfly. coeff_00 = +coef_levels; coeff_01 = 0; coeff_02 = -coef_levels;

//

Set up additional constant trigonometric coefficient for Type-II double butterfly. coeff_03 = (int) ((sqrt(2.0) / 2) * coef_levels); coeff_04 = coeff_03 + coeff_03; startup = 1; } if (i == 0) {

//

Set up trigonometric coefficients for Type-I double butterfly. n3 = 0; for (n = 0; n < 3; n++) { coeffs[n3++] = coeff_00; coeffs[n3++] = coeff_01; coeffs[n3++] = coeff_00; } } else { if (k == 0) {

//

Set up trigonometric coefficients for Type-II double butterfly. n3 = 0; for (n = 0; n < 2; n++) { coeffs[n3++] = coeff_00; coeffs[n3++] = coeff_01; coeffs[n3++] = coeff_00; } coeffs[6] = coeff_04; coeffs[7] = coeff_03; coeffs[8] = 0; } else {

//

Set up trigonometric coefficients for Type-III double butterfly.

//

Set up single-angle sinusoidal & cosinusoidal terms.

m = kbeta; store_00 = Look_Up_Sin_A1[N4-m];

Appendix B: Source Code for Regularized Fast Hartley Transform store_01 = Look_Up_Sin_A1[m]; coeffs[0] = store_00 + store_01; coeffs[1] = store_00; coeffs[2] = store_00 - store_01; //

Set up double-angle sinusoidal & cosinusoidal terms. m > bits_to_shift; store2 = ((_int64)sum2*sv1) >> bits_to_shift; store3 = ((_int64)sum3*cv1) >> bits_to_shift; store_00 = (int) (store1 - store2); store_01 = (int) (store1 - store3); coeffs[0] = store_00 + store_01; coeffs[1] = store_00; coeffs[2] = store_00 - store_01; //

Set up double-angle sinusoidal & cosinusoidal terms. store1 = ((_int64)store_00*store_00) >> bits_to_shift_m1; store2 = ((_int64)store_00*store_01) >> bits_to_shift_m1; store_02 = (int) (store1 - coef_levels); store_03 = (int) store2; coeffs[3] = store_02 + store_03; coeffs[4] = store_02; coeffs[5] = store_02 - store_03;

//

Set up triple-angle sinusoidal & cosinusoidal terms. store1 = ((_int64)store_02*store_00) >> bits_to_shift_m1; store2 = ((_int64)store_02*store_01) >> bits_to_shift_m1; store_04 = (int) (store1 - store_00);

291

292

Appendix B: Source Code for Regularized Fast Hartley Transform store_05 = (int) (store2 + store_01); coeffs[6] = (int) (store_04 + store_05); coeffs[7] = (int) (store_04); coeffs[8] = (int) (store_04 - store_05); } }

//

End of function.

} #include "stdafx.h" #include

void Coefficients_V12M_1Level (int i, int k, int N2, int N4, int kbeta, int *Look_Up_Sin_A1, int *Look_Up_Sin_A2, int *Look_Up_Sin_A3, int *coeffs, int coef_levels) { // // //

Description: -------------Routine to set up the trigonometric coefficients for use by the twelve-multiplier version of the generic

//

double butterfly where one-level LUTs are exploited.

// // //

Parameters: ------------i

= temporal addressing index.

//

k

= spatial addressing index.

//

N2

= N / 2.

//

N4

= N / 4.

//

kbeta

= temporal/spatial index.

//

Look_Up_Sin_A1 = look-up table for single-angle argument.

//

Look_Up_Sin_A2 = look-up table for double-angle argument.

//

Look_Up_Sin_A3 = look-up table for triple-angle argument.

//

coeffs

= current set of trigonometric coefficients.

//

coef_levels

= number of trigonometric coefficient quantisation levels.

// // // //

Declarations: --------------Integers: ---------int m, n, n3; static int startup, coeff_00, coeff_01, coeff_02, coeff_03;

//

************************************************************************************ if (startup == 0) {

//

Set up trivial trigonometric coefficients - valid for each type of double butterfly. coeff_00 = +coef_levels; coeff_01 = 0; coeff_02 = -coef_levels;

//

Set up additional constant trigonometric coefficient for Type-II double butterfly. coeff_03 = (int) ((sqrt(2.0) / 2) * coef_levels); startup = 1; }

Appendix B: Source Code for Regularized Fast Hartley Transform if (i == 0) { //

Set up trigonometric coefficients for Type-I double butterfly. n3 = 0; for (n = 0; n < 3; n++) { coeffs[n3++] = coeff_00; coeffs[n3++] = coeff_01; coeffs[n3++] = coeff_02; } } else { if (k == 0) {

//

Set up trigonometric coefficients for Type-II double butterfly. n3 = 0; for (n = 0; n < 2; n++) { coeffs[n3++] = coeff_00; coeffs[n3++] = coeff_01; coeffs[n3++] = coeff_02; } for (n = 6; n < 9; n++) { coeffs[n] = coeff_03; } } else {

//

Set up trigonometric coefficients for Type-III double butterfly.

//

Set up single-angle sinusoidal & cosinusoidal terms.

m = kbeta; coeffs[0] = Look_Up_Sin_A1[N4-m]; coeffs[1] = Look_Up_Sin_A1[m]; //

Set up double-angle sinusoidal & cosinusoidal terms. m alpham1; ca1 = RootNd2 - sa1; sca2 = m % RootNd2; cv1 = Look_Up_Sin_Coarse[ca1]; sv1 = Look_Up_Sin_Coarse[sa1]; cv2 = Look_Up_Cos_Fine[sca2]; sv2 = Look_Up_Sin_Fine[sca2]; sum1 = cv1 + sv1; sum2 = cv2 + sv2; sum3 = cv2 - sv2; store1 = ((_int64)sum1*cv2) >> bits_to_shift; store2 = ((_int64)sum2*sv1) >> bits_to_shift; store3 = ((_int64)sum3*cv1) >> bits_to_shift; coeffs[0] = (int) (store1 - store2); coeffs[1] = (int) (store1 - store3); //

Set up double-angle sinusoidal & cosinusoidal terms. cv1 = coeffs[0]; sv1 = coeffs[1]; store1 = ((_int64)cv1*cv1) >> bits_to_shift_m1; store2 = ((_int64)cv1*sv1) >> bits_to_shift_m1; coeffs[3] = (int) (store1 - coef_levels); coeffs[4] = (int) store2;

//

Set up triple-angle sinusoidal & cosinusoidal terms. cv2 = coeffs[3]; store1 = ((_int64)cv1*cv2) >> bits_to_shift_m1; store2 = ((_int64)sv1*cv2) >> bits_to_shift_m1; coeffs[6] = (int) (store1 - cv1); coeffs[7] = (int) (store2 + sv1);

//

Set up remaining trigonometric coefficients through symmetry. coeffs[2] = coeffs[0]; coeffs[5] = coeffs[3]; coeffs[8] = coeffs[6]; } }

//

End of function.

} #include "stdafx.h" #include

void Butterfly_V12M (int i, int j, int k, int *X, int *coeffs, int *kk, int *index1, int *index2, int *index3, int *index4, int coef_levels, int no_of_bits_coeffs, int scaling, int *Data_Max, int shift, int lower, int upper) { // // //

Description: -------------Routine to carry out the generic double butterfly computation using twelve fixed-point fast multipliers.

// // //

Parameters: ------------i

= index for temporal loop.

Appendix B: Source Code for Regularized Fast Hartley Transform //

j

= index for outer spatial loop.

//

k

= index for inner spatial loop.

//

X

= 1-D data array.

//

coeffs

= current set of trigonometric coefficients.

//

kk

= offsets for address permutations.

//

index1

= 1st address permutation.

//

index2

= 2nd address permutation.

//

index3

= 3rd address permutation.

//

index4

= 4th address permutation.

//

coef_levels

= number of trigonometric coefficient quantisation levels.

//

no_of_bits_coeffs = number of bits representing trigonometric coefficients.

//

scaling

= scaling flag: 1 => FIXED, 2 => BFP.

//

Data_Max

= maximum magnitude of output data set.

//

shift

= no of bits for input data to be shifted.

//

lower

= lower bound for register overflow with unconditional scaling.

//

upper

= upper bound for register overflow with unconditional scaling.

// // // //

Declarations: --------------Integers: ---------int m, n, n2, n2p1, n3, n3p1, store, bits_to_shift1, bits_to_shift2;

// //

Long Integers: ----------------_int64 m1, m2, m3, m4;

// //

Integer Arrays: ----------------int Y[8];

297

//

************************************************************************************

//

Apply 1st address permutation - comprising one data exchange. m = kk[0]; store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store;

//

Set up scaling factor for multiplication stage. bits_to_shift2 = no_of_bits_coeffs - 1; if (scaling == 1) { Y[0] = X[0]; Y[1] = X[1];

//

### Check for register overflow & flag when overflow arises. for (n = 0; n < 8; n++) { if ((X[n] < lower) || (X[n] > upper)) { printf ("\n\n Overflow occurred on input register"); } }

//

### Check for register overflow completed. } else

298

Appendix B: Source Code for Regularized Fast Hartley Transform { //

Set up scaling factor for first two samples of input data set.

//

Shift data so that MSB occupies optimum position.

bits_to_shift1 = 3 - shift; Y[0] = X[0] shift; } //

Build in three guard bits for LSB. bits_to_shift2 -= 3; }

//

Apply trigonometric coefficients and 1st set of additions/subtractions. n3 = 0; for (n = 1; n < 4; n++) { n2 = (n > bits_to_shift2; m2 = ((_int64)coeffs[n3p1]*X[n2p1]) >> bits_to_shift2; Y[n2] = (int) (m1 + m2);

//

Truncate contents of registers to required levels. m3 = ((_int64)coeffs[n3p1]*X[n2]) >> bits_to_shift2; m4 = ((_int64)coeffs[n3+2]*X[n2p1]) >> bits_to_shift2; Y[n2p1] = (int) (m3 - m4); n3 += 3; }

//

Apply 2nd address permutation. m = kk[1]; for (n = 0; n < 8; n++) { X[index2[m++]] = Y[n]; }

//

Apply 2nd set of additions/subtractions. for (n = 0; n < 4; n++) { n2 = (n 3); if (abs(Y[n]) > abs(Data_Max[0])) Data_Max[0] = Y[n]; } //

Apply 4th address permutation. X[index4[m++]] = Y[n]; }

//

End of function.

} #include "stdafx.h" #include

void Butterfly_V09M (int i, int j, int k, int *X, int *coeffs, int *kk, int *index1, int *index2, int *index3, int *index4, int coef_levels, int no_of_bits_coeffs, int scaling, int *Data_Max, int shift, int Type, int lower, int upper) { // // //

Description: -------------Routine to carry out the generic double butterfly computation using nine fixed-point fast multipliers.

// // //

Parameters: -------------i

= index for temporal loop.

//

j

= index for outer spatial loop.

//

k

= index for inner spatial loop.

//

X

= 1-D data array.

//

coeffs

= current set of trigonometric coefficients.

//

kk

= offsets for address permutations.

//

index1

= 1st address permutation.

//

index2

= 2nd address permutation.

//

index3

= 3rd address permutation.

//

index4

= 4th address permutation.

//

coef_levels

= number of trigonometric coefficient quantisation levels.

//

no_of_bits_coeffs = number of bits representing trigonometric coefficients.

//

scaling

= scaling flag: 1 => FIXED, 2 => BFP.

300

Appendix B: Source Code for Regularized Fast Hartley Transform

//

Data_Max

//

shift

= maximum magnitude of output data set. = no of bits for input data to be shifted.

//

Type

= butterfly type indicator: I, II or III.

//

lower

= lower bound for register overflow with unconditional scaling.

//

upper

= upper bound for register overflow with unconditional scaling.

// // //

Note: -----Dimension array X[n] from 0 to 8 in calling routine RFHT4_Control.

// // // //

Declarations: --------------Integers: ---------int m, n, n2, n2p1, store, bits_to_shift1, bits_to_shift2;

// //

Long Integers: ----------------_int64 product;

// //

Integer Arrays: ----------------int Y[11];

//

************************************************************************************

//

Apply 1st address permutation - comprising one data exchange. m = kk[0]; store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store;

//

Set up scaling factor for multiplication stage. bits_to_shift2 = no_of_bits_coeffs - 1; if (scaling == 2) {

//

Set up scaling factor for first two samples of input data set.

//

Shift data so that MSB occupies optimum position.

bits_to_shift1 = 3 - shift; X[0] = X[0] shift; } //

Build in three guard bits for LSB. bits_to_shift2 -= 3; }

//

Apply 1st set of additions/subtractions. Y[0] = X[0]; Y[1] = X[1]; Y[2] = X[2]; Y[3] = X[2] + X[3]; Y[4] = X[3]; Y[5] = X[4]; Y[6] = X[4] + X[5]; Y[7] = X[5]; Y[8] = X[6]; Y[9] = X[6] + X[7]; Y[10] = X[7]; if (scaling == 1) {

//

Scale outputs of 1st set of additions/subtractions. For (n = 0; n < 11; n++) Y[n] = (Y[n]>>1);

//

### Check for register overflow & flag when overflow arises.

Appendix B: Source Code for Regularized Fast Hartley Transform for (n = 0; n < 11; n++) { if ((Y[n] < lower) || (Y[n] > upper)) { printf ("\n\n Overflow occurred on input register"); } } //

### Check for register overflow completed.

//

Apply trigonometric coefficients.

} for (n = 0; n < 9; n++) { product = ((_int64)coeffs[n]*Y[n+2]) >> bits_to_shift2; X[n] = (int) product; } //

Apply 2nd set of additions/subtractions. if (Type < 3) { Y[2] = X[0] + X[1]; Y[3] = X[1] + X[2]; Y[4] = X[3] + X[4]; Y[5] = X[4] + X[5]; } else { Y[2] = X[1] - X[2]; Y[3] = X[0] - X[1]; Y[4] = X[4] - X[5]; Y[5] = X[3] - X[4]; } if (Type < 2) { Y[6] = X[6] + X[7]; Y[7] = X[7] + X[8]; } else { Y[6] = X[7] - X[8]; Y[7] = X[6] - X[7]; }

//

Apply 2nd address permutation. m = kk[1]; for (n = 0; n < 8; n++) { X[index2[m++]] = Y[n]; }

//

Apply 3rd set of additions/subtractions. for (n = 0; n < 4; n++) { n2 = (n 3); if (abs(Y[n]) > abs(Data_Max[0])) Data_Max[0] = Y[n]; } //

Apply 4th address permutation. X[index4[m++]] = Y[n]; }

//

End of function.

} #include "stdafx.h" #include

void Butterfly_Cordic (int *X, int *kbeta, int *kk, int *index1, int *index2, int *index3, int *index4, int halfpi, int minusquarterpi, int growth, int *arctans, int no_of_iterations, int no_of_bits_angle, int negate_flag, int scaling, int *Data_Max, int shift, int LSB_guard_bits, int lower, int upper) { // // //

Description: -------------Routine to carry out the generic double butterfly computation using three Cordic arithmetic units.

// //

Externs: --------void Rotation (int*, int*, int*, int, int, int*);

// // //

Parameters: ------------X

= data.

//

kbeta

= current set of rotation angles.

Appendix B: Source Code for Regularized Fast Hartley Transform //

kk

= offsets for address permutations.

//

index1

= 1st address permutation.

//

index2

= 2nd address permutation.

//

index3

= 3rd address permutation.

//

index4

= 4th address permutation.

//

halfpi

= integer version of +(pi/2).

//

minusquarterpi

= integer version of -(pi/4).

//

growth

= integer version of Cordic magnification factor.

//

arctans

= micro-rotation angles.

//

no_of_iterations = no of Cordic iterations.

//

no_of_bits_angle = no of bits to represent Cordic rotation angle.

//

negate_flag

= negation flag for Cordic output.

//

scaling

= scaling flag: 1 => FIXED, 2 => BFP.

//

Data_Max

= maximum magnitude of output data set.

//

shift

= no of bits for input data to be shifted.

//

LSB_guard_bits = no of bits for guarding LSB.

//

lower

= lower bound for register overflow with unconditional scaling.

//

upper

= upper bound for register overflow with unconditional scaling.

// // // //

Declarations: --------------Integers: ---------int m, n, n2, n2p1, store, bits_to_shift1, bits_to_shift2;

// //

Integer Arrays: ----------------int Y[8], xs[3], ys[3], zs[3];

303

//

************************************************************************************

//

Apply 1st address permutation - comprising one data exchange. m = kk[0]; store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store;

//

Set up scaling factor for multiplication stage. bits_to_shift1 = no_of_bits_angle - 1; if (scaling == 1) {

//

### Check for register overflow & flag when overflow arises. for (n = 0; n < 8; n++) { if ((X[n] < lower) || (X[n] > upper)) { printf ("\n\n Overflow occurred on input register"); } }

//

### Check for register overflow completed. } else {

//

Set up scaling factor for first two samples of input data set.

304

Appendix B: Source Code for Regularized Fast Hartley Transform bits_to_shift2 = LSB_guard_bits - shift + 2; //

Shift data so that MSB occupies optimum position. X[0] = X[0] >> shift; X[1] = X[1] >> shift; for (n = 2; n < 8; n++) { X[n] = X[n] > bits_to_shift1); Y[1] = (int) (((_int64)growth*X[1]) >> bits_to_shift1);

//

Set up inputs to Cordic phase rotations of remaining permuted inputs. xs[0] =

X[2]; xs[1] =

X[4]; xs[2] =

X[6];

ys[0] =

X[3]; ys[1] =

X[5]; ys[2] =

X[7];

zs[0] = kbeta[0]; zs[1] = kbeta[1]; zs[2] = kbeta[2]; if (negate_flag == 1) zs[2] = minusquarterpi; //

Carry out Cordic phase rotations of remaining permuted inputs. Rotation (xs, ys, zs, halfpi, no_of_iterations, arctans);

//

Set up outputs from Cordic phase rotations of remaining permuted inputs. Y[2] = xs[0]; Y[4] = xs[1]; Y[6] = xs[2]; Y[3] = ys[0]; Y[5] = ys[1]; Y[7] = ys[2]; if (scaling == 2) {

//

Scale Cordic outputs to remove LSB guard bits. for (n = 2; n < 8; n++) { Y[n] = Y[n] >> LSB_guard_bits; } }

//

Negate, where appropriate, phase rotated outputs. if (negate_flag > 0) { Y[7] = -Y[7]; if (negate_flag > 1) { Y[3] = -Y[3]; Y[5] = -Y[5]; } }

//

Apply 2nd address permutation. m = kk[1]; for (n = 0; n < 8; n++) { X[index2[m++]] = Y[n];

Appendix B: Source Code for Regularized Fast Hartley Transform } //

Apply 1st set of additions/subtractions. for (n = 0; n < 4; n++) { n2 = (n 2); if (abs(Y[n]) > abs(Data_Max[0])) Data_Max[0] = Y[n]; } //

Apply 4th address permutation. X[index4[m++]] = Y[n]; }

//

End of function.

} #include "stdafx.h"

void Rotation (int *xs, int *ys, int *zs, int halfpi, int no_of_iterations, int *arctans) { // // //

Description: -------------Routine to carry out the phase rotations required by the Cordic arithmetic unit for the single angle,

//

double angle and triple angle cases.

// // //

Parameters: ------------xs

= X coordinates.

//

ys

= Y coordinates.

305

306

Appendix B: Source Code for Regularized Fast Hartley Transform

//

zs

= rotation angles.

//

halfpi

= +(pi/2).

//

no_of_iterations = no of Cordic iterations.

//

arctans

// // // //

Declarations: --------------Integers: ---------int k, n;

// //

Integer Arrays: ----------------int temp[3];

= set of micro-rotation angles.

//

************************************************************************************

//

P H A S E R O T A T I O N R O U T I N E.

//

Reduce three rotation angles to region of convergence: [-pi/2,+pi/2]. for (n = 0; n < 3; n++) { if (zs[n] < -halfpi) { temp[n] = +ys[n]; ys[n] = -xs[n]; xs[n] = temp[n]; zs[n] += halfpi; } else if (zs[n] > +halfpi) { temp[n] = -ys[n]; ys[n] = +xs[n]; xs[n] = temp[n]; zs[n] -= halfpi; } }

//

Loop through Cordic iterations. for (k = 0; k < no_of_iterations; k++) {

//

Carry out phase micro-rotation of three complex data samples. for (n = 0; n < 3; n++) { if (zs[n] < 0) { temp[n] = xs[n] + (ys[n] >> k); ys[n] -= (xs[n] >> k); xs[n] = temp[n]; zs[n] += arctans[k]; } else { temp[n] = xs[n] - (ys[n] >> k); ys[n] += (xs[n] >> k); xs[n] = temp[n]; zs[n] -= arctans[k]; } } }

// }

End of function.

Appendix B: Source Code for Regularized Fast Hartley Transform #include "stdafx.h" #include

void DataIndices (int i, int j, int k, int offset, int *kk, int index_even_2D[2][4], int index_odd_2D[2][4], int bfly_count, int alpha) { // // //

Description: -------------Routine to set up the data indices for accessing the input data for the generic double butterfly.

// // //

Parameters: ------------i

= index for temporal loop.

//

j

= index for outer spatial loop.

//

k

= index for inner spatial loop.

//

offset

= element of power-of-two array.

//

kk

= offsets for address permutations.

//

index_even_2D = even data address indices.

//

index_odd_2D

//

bfly_count

= odd data address indices. = double butterfly address for stage.

//

alpha

= no of temporal stages for transform.

// //

Externs: --------void MemoryBankAddress (int, int, int, int, int*, int*);

// // // //

Declarations: --------------Integers: ---------int n, n1, n2, twice_offset;

// //

Pointer Variables: --------------------int *bank1, *offset1, *bank2, *offset2;

//

************************************************************************************

//

Set up dynamic memory. bank1 = new int [1]; bank1[0] = 0; bank2 = new int [1]; bank2[0] = 0; offset1 = new int [1]; offset1[0] = 0; offset2 = new int [1]; offset2[0] = 0;

//

Calculate data indices. if (i == 0) {

//

S T A G E = 0.

//

Set up even and odd data indices for Type-I double butterfly.

twice_offset = offset; n1 = j - twice_offset; n2 = n1 + 4; for (n = 0; n < 4; n++) { n1 += twice_offset; n2 += twice_offset; MemoryBankAddress (n1, n, 1, alpha, bank1, offset1); index_even_2D[0][n] = *bank1; index_even_2D[1][n] = *offset1; MemoryBankAddress (n2, n, 1, alpha, bank2, offset2);

307

308

Appendix B: Source Code for Regularized Fast Hartley Transform index_odd_2D[0][n] = *bank2; index_odd_2D[1][n] = *offset2; } //

Set up offsets for address permutations. kk[0] = 0; kk[1] = 0; kk[2] = 0; } else {

//

S T A G E > 0. twice_offset = (offset 16K introduce new first line of code – more time-efficient than generic version

= initialisation flag: 0 => start up, 1 => butterfly.

//

supplied, but less code-efficient.

// // // //

Declarations: --------------Integers: ---------int k1, k2, sub_block_size, mapping;

//

************************************************************************************

//

Calculate memory bank address for N up to and including 1K.

//

bank[0] = ((((address%4)+((address%16)>>2)+((address%64)>>4)+((address%256)>>6)+

//

(address>>8)) % 4) >2)+((address%64)>>4)+((address%256)>>6)+

//

((address%1024)>>8)+(address>>10)) % 4) >2)+((address%64)>>4)+((address%256)>>6)+

// //

((address%1024)>>8)+((address%4096)>>10)+(address>>12)) % 4) 3;

// }

End of function.

Appendix C: MATLAB Code for Parallel Reordering of Data via Dibit-Reversal Mapping

C.1

Listing for MATLAB Data Reordering Program %

% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

clear all; Author: Dr. Keith Jones, March 2020. --------------------------------------------This MatLab program is used for determining the distribution of naturally-ordered samples across partitioned memory after reordering via the dibit-reversal (DBR) mapping – the computer code is based upon the parallel algorithms described in Chapter 10 of the monograph. The computed addresses are held in the 2-D array, “DBR”, whose values may be produced “on-the-fly” using simple additions, shifts and the pre-computed values of the ‘s’ sequence which may be stored in an LUT of length P. The addresses should be the same as those produced using the MatLab function, “digitrevorder”, only now stored in 2-D form – the memory bank address followed by the time slot. The addresses may then be used as a check to ensure that the outputs produced by the program would be the same whether using the 1-D “digitrevorder” function or the 2-D array “DBR” of computed memory addresses. The first index of the 2-D DBR array corresponds to the memory bank address, which ranges from 1 to 8 for compatability with the use of the regularized FHT, whilst the second corresponds to the time slot for that memory bank and the value of DBR contains the address of the DBR-reordered sample residing within the memory with that particular combination of memory bank address and time slot. The program lists, for each woctad of naturally-ordered samples: 1) 2) 3) 4)

the corresponding set of naturally-ordered sample addresses the corresponding set of DBR-reordered sample addresses the memory bank address of each DBR-reordered sample the number of DBR-reordered samples falling within each memory bank

and, for the full naturally-ordered data set: 5) the number of DBR-reordered samples falling within each memory bank Note: 1 woctad = 8 samples, 1 per memory bank

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3

311

312

Appendix C: MATLAB Code for Parallel Reordering of Data via Dibit-Reversal. . . % % # ***************************************************************** % % # Set up program parameters. % N = 64; % no of samples in data set: power of 4 integer B = 8; % no of memory banks - should not be changed M = N/B; % no of samples per memory bank P = M/16; % short address increment for when N > 64 % % # Set up data storage option. % inc = 1; % naturally-ordered data storage scheme: % 1 => 'row-wise’, 2 => 'column-wise' % % # Set up 1-D array for storing naturally-ordered data addresses. % xords = 0 : 1 : (N-1); % % # Set up 1-D array for storing DBR-reordered data addresses. % yords = digitrevorder(xords,4); % % # Set up 2-D array for storing DBR-reordered data addresses. % DBR = zeros(B,M); % if (N == 16) % DBR(1,1) = 1; DBR(1,2) = 3; DBR(2,1) = 5; DBR(2,2) = 7; DBR(3,1) = 9; DBR(3,2) = 11; DBR(4,1) = 13; DBR(4,2) = 15; DBR(5,1) = 2; DBR(5,2) = 4; DBR(6,1) = 6; DBR(6,2) = 8; DBR(7,1) = 10; DBR(7,2) = 12; DBR(8,1) = 14; DBR(8,2) = 16; % elseif (N == 64) % DBR(1,1) = 1; DBR(1,2) = 9; DBR(1,3) = 2; DBR(1,4) = 10; DBR(2,1) = 17; DBR(2,2) = 25; DBR(2,3) = 18; DBR(2,4) = 26; DBR(3,1) = 33; DBR(3,2) = 41; DBR(3,3) = 34; DBR(3,4) = 42; DBR(4,1) = 49; DBR(4,2) = 57; DBR(4,3) = 50; DBR(4,4) = 58; DBR(5,1) = 5; DBR(5,2) = 13; DBR(5,3) = 6; DBR(5,4) = 14; DBR(6,1) = 21; DBR(6,2) = 29; DBR(6,3) = 22; DBR(6,4) = 30; DBR(7,1) = 37; DBR(7,2) = 45; DBR(7,3) = 38; DBR(7,4) = 46; DBR(8,1) = 53; DBR(8,2) = 61; DBR(8,3) = 54; DBR(8,4) = 62; % DBR(1,5) = 3; DBR(1,6) = 11; DBR(1,7) = 4; DBR(1,8) = 12; DBR(2,5) = 19; DBR(2,6) = 27; DBR(2,7) = 20; DBR(2,8) = 28; DBR(3,5) = 35; DBR(3,6) = 43; DBR(3,7) = 36; DBR(3,8) = 44; DBR(4,5) = 51; DBR(4,6) = 59; DBR(4,7) = 52; DBR(4,8) = 60; DBR(5,5) = 7; DBR(5,6) = 15; DBR(5,7) = 8; DBR(5,8) = 16; DBR(6,5) = 23; DBR(6,6) = 31; DBR(6,7) = 24; DBR(6,8) = 32; DBR(7,5) = 39; DBR(7,6) = 47; DBR(7,7) = 40; DBR(7,8) = 48; DBR(8,5) = 55; DBR(8,6) = 63; DBR(8,7) = 56; DBR(8,8) = 64; % else % % # Set up 1-D “s-sequence”, as referred to in Chapter 10 of monograph, for storage % in LUTand use in generating “DBR” sequence “on-the-fly” for when N > 64. % s = zeros(1,P); % if (N 0 => disagreement. % if (Errors > 0) fprintf('\n # No of detected addressing errors = %6d\n\n',Errors'); end % % # End of program. %

C.2

Discussion

The aim of the computer code provided above has been to show that the 2-D version of the DBR mapping, as required for the reordering of NAT-ordered data to be transferred from one partitioned memory to another – with each partitioned memory comprising eight memory banks, as required for compatability with the R24 FHT – can be efficiently generated, on the fly, using a couple of small LUTs combined with the application of a relatively small number of precomputed address increments. An in-built MATLAB function, ‘digitrevorder’, was used to prove the mathematical/ logical correctness of operation of the proposed data reordering scheme.

Glossary

ADC ASIC AWGN CCT CLB CN CORDIC CRT DA DBR DCT DDC DFT DHT DIF DIT DME DMIN DMO DSM DSP DTMF FDM FFT FHT FNT FPGA GD-BFLY HDL

analog-to-digital conversion application-specific integrated circuit additive white Gaussian noise circular convolution theorem configurable logic block linear space of complex-valued N-tuples Co-Ordinate Rotation DIgital Computer Chinese remainder theorem distributed arithmetic dibit-reversal discrete cosine transform digital down-conversion discrete Fourier transform discrete Hartley transform decimation-in-frequency decimation-in-time even-stream data memory intermediate data memory odd-stream data memory data-space memory digital signal processing dual-tone multi-frequency frequency-division multiplexed fast Fourier transform fast Hartley transform Fermat number transform field-programmable gate array generic double butterfly hardware description language

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3

315

316

HSM IF IMD IMP I/O LSB LUT MAC MNT MSB NTT O PA PCM PCS PDM PE PSD RAM RCM R24 FHT RF RN RN3N ROM SDHT SFDR SFG SIMD SNR SWAP TDOA TOA woctad

Glossary

Hartley-space memory intermediate frequency inter-modulation distortion inter-modulation product input/output least-significant bit look-up table multiplier-and-accumulator Mersenne number transform most-significant bit number-theoretic transform ‘Big-Oh’ notation for order of complexity power amplifier trigonometric coefficient memory of PE computational stage of PE data memory of PE processing element power spectral density random-access memory row-column method regularized radix-4 fast Hartley transform radio frequency linear space of real-valued N-tuples linear space of real-valued NN square arrays read-only memory separable DHT spurious-free dynamic range signal flow graph single-instruction multiple-data signal-to-noise ratio size, weight and power time-difference-of-arrival time-of-arrival set of eight words obtained from four or eight memory blocks

Index

A Alias-free formulation, 181–182 Analog-to-digital conversion (ADC), 15, 88, 111, 113, 117, 149–151, 191, 194, 203, 204, 213, 214, 315 Application-specific integrated circuit (ASIC), 3, 11, 14, 17, 48, 76, 79–81, 90, 133, 248, 315 Area-efficient, 84, 87 Arithmetic-complexity, 14, 26, 27, 159, 162, 163, 167, 183, 220, 229, 250 Auto-correlation, 18, 44, 45, 161, 163, 167, 169–172, 250

B Bergland Algorithm, 25–26, 32 Bit-reversal mapping, 32 Block-RAM, 220 Bruun algorithm, 23, 24, 26–28, 32 Butterflies, 6, 7, 10, 11, 17, 26, 46–48, 53–68, 73–76, 113, 117, 157, 158, 221, 248, 254, 256, 258, 277

C Channelization, 18, 74, 161, 162, 175–182, 185, 186, 250, 252 Chinese remainder theorem (CRT), 7, 315 Circular convolutions, 49, 167, 176, 177, 208, 211, 230, 233 Circular convolution theorem (CCT), 9, 45, 208, 211, 229, 233, 315 Circular correlations, 167, 171

Clock frequency, 15, 79, 82–83, 87, 90, 133, 191 Coarse-grained parallelism, 81, 83 Coefficient memory, 18, 104, 262, 263 Common factor algorithm, 7 Complementary-angle LUTs, 71, 103, 110, 149, 151 Computational densities, 14, 15, 18, 19, 53, 80, 81, 86, 90, 91, 93, 113, 115, 152, 156–159, 161, 225, 241, 242, 249–251 Computational stages, 107, 120, 127, 316 Configurable logic block (CLB), 80, 315 Convolutions, 43, 45, 161–163, 166, 176–178, 186 Cooley-Tukey algorithm, 6, 7, 24–26, 157, 158 Co-Ordinate Rotation DIgital Computer (CORDIC), 13, 18, 81, 85, 86, 88, 89, 119–133, 248, 249, 254, 258, 259, 277, 315 Correlations, 18, 44, 49, 161–163, 166–176, 178, 184, 185, 250 Cross-correlation, 18, 161, 163, 167, 169–170, 172–175, 250

D Data memories, 18, 74, 85, 116, 142, 144, 146, 147, 192, 200–202, 204, 221, 223, 226, 236, 249, 251, 255, 315, 316 Data set refresh rate, 15, 32, 79, 85, 87, 91, 93, 111, 113–115, 117, 144, 148, 150, 154, 155, 191, 213, 216, 218, 220, 226, 228, 231, 237

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. J. Jones, The Regularized Fast Hartley Transform, https://doi.org/10.1007/978-3-030-68245-3

317

318 Data-space, 7, 11, 16, 37, 45, 49, 152, 162–170, 176, 178, 186, 211, 212, 229–231 Data-space memory, 15, 215, 315 Decimation-in-frequency (DIF), 7, 24, 25, 28, 32, 54, 315 Decimation-in-time (DIT), 7, 19, 24, 25, 28, 32, 46, 53, 54, 56, 72, 315 Dibit-reversal mapping, 32, 311–314 Differentiation, 161, 163–166, 175, 184 Digital down-conversion (DDC), 10, 33, 113, 161, 176, 178, 180, 185, 247, 315 Digit-reversal mapping, 19, 32 Discrete Fourier transform (DFT), 3–13, 16, 17, 19, 23–46, 48, 49, 72, 73, 91, 115, 119, 132, 137–157, 159, 161, 178–182, 186, 208, 209, 212, 213, 218, 219, 231–232, 241, 242, 247, 248, 250–252, 315 Discrete Hartley transform (DHT), 3, 4, 8–9, 11–13, 16–19, 35–50, 54, 56, 72, 119, 132, 137, 143, 152–154, 159, 169, 183, 191–205, 207–213, 218, 219, 232, 233, 241, 247, 248, 250, 251, 315, 316 Distributed arithmetic (DA), 13, 85, 120, 315 Divide-and-conquer, 7, 58 Double-buffering, 19, 200, 201, 211, 218, 223, 241, 250 Double-resolution approach, 138–144, 147, 149, 150, 154, 156–159, 249 Dragonfly, 76 Dual-port memory, 96, 109 Dual-tone multi-frequency (DTMF), 33, 315

E Equivalency theorem, 178

F Fast Fourier transform (FFT), 5, 137–159, 315 Fast Hartley transform (FHT), 9–13, 15, 17–19, 33, 36, 45–49, 53–76, 79–117, 119–133, 137–186, 192–193, 207–242, 248–251, 253–259, 261–309, 315, 316 Field-programmable gate array (FPGA), 3, 11, 14, 17, 48, 76, 79–81, 88–90, 93, 95, 111–113, 119–121, 124, 129, 130, 132, 133, 181, 209, 219, 226, 236, 241, 242, 248, 315 Fine-grained parallelism, 73, 81, 82 Fixed-point, 4, 30, 48, 49, 54, 74–75, 81, 88, 89, 108, 110, 116, 117, 119–121, 127, 129, 130, 132, 133, 147, 152, 159, 241, 249, 253–255, 258, 259

Index Fourier matrix, 4, 7, 16, 17 Fourier-space, 39, 40, 46, 49, 115, 117, 139, 152, 154, 211, 212, 229, 232 Frequency division multiplexed (FDM), 175, 315

G Generic double butterflies, 11, 13, 17, 18, 33, 49, 60–62, 65–68, 101, 107–108, 126, 248, 254–256, 258, 259, 315 Global pipelining, 83, 250

H Half-resolution approach, 138, 152–156, 158, 159 Hardware description language (HDL), 19, 89, 101, 252, 315 Hartley matrix, 8, 9, 17, 36 Hartley-space, 39, 40, 46, 49, 115, 117, 137, 138, 142–147, 152, 154, 155, 161–165, 174–178, 186, 208, 211–212, 229–233 Hartley-space memory (HSM), 18, 84, 192–196, 203, 204, 214–216, 218, 219, 221, 222, 225, 233, 236, 239, 316

I Impulse response function, 176–178, 212, 229–231, 233 In-place processing, 84 Input/output (I/O), 4, 8, 10, 24, 28, 35, 46, 94–96, 98, 99, 111, 130, 133, 249, 256 Inter-modulation distortion (IMD), 18, 162, 183, 186, 250, 316 Inter-modulation products (IMPs), 183

K Kernels, 3–9, 12, 14, 16, 17, 19, 24, 35, 36, 38, 76, 137, 207, 208, 210, 247, 250

L Latency, 12, 15, 33, 47, 85, 86, 88, 94, 112–116, 124, 133, 144, 155, 182, 218, 220, 221, 248, 249 Linear convolution, 45, 163, 164, 171 Linear correlation, 167 Local pipelining, 82, 94, 248

Index M Matched filter, 3, 170 Memory requirements, 10, 11, 13, 26, 28, 32, 46, 47, 72, 74, 76, 82, 84, 86, 89, 91, 102, 103, 113, 119, 120, 129, 130, 132, 172, 178, 219, 221, 222, 226, 236, 239 Minimum-arithmetic addressing, 70, 102, 103, 109, 111, 116, 130, 148, 150, 151 Minimum-memory addressing, 70–71, 103, 105, 109, 130, 149, 151 Mobile communications, 3, 10, 12, 14, 75, 79, 80, 82, 91, 247, 248 Multiplier-and-accumulator (MAC), 10, 231, 316

N Nearest-neighbour communication, 74, 83 Noble identity, 178

O Orthogonal, 4, 8, 9, 14, 16, 35–37, 45, 58, 64, 152, 163, 167, 207, 210, 212, 248 Overlap-add technique, 171, 176 Overlap-save technique, 164, 171, 176, 177

P Parseval’s theorem, 37, 45, 152, 153, 212 Partitioned-memory processing, 87 Pipeline delays, 108, 133, 147, 148, 150, 154 Pipelines, 6, 12, 18, 76, 81, 83–86, 88, 91, 94, 95, 107, 108, 110, 113, 114, 116, 117, 119, 124, 125, 127, 133, 146, 147, 213, 217, 219, 223, 230, 231, 236, 249, 250 Point spread function, 212, 229 Polyphase DFT, 178 Polyphase DFT filter bank, 18, 161, 162, 176, 178–182, 186, 250 Power amplifier (PA), 183, 316 Power spectral density (PSD), 9, 46, 316 Prime-factor algorithm, 7, 28 Processing delays, 10, 33, 72, 74, 87, 99, 100, 107, 113, 219, 220, 224–226, 235, 236, 247 Processing element (PE), 13, 17–19, 73, 80, 81, 84–88, 90, 91, 94, 95, 99, 105–111, 115–117, 119–133, 216, 218, 219, 222, 224, 228, 235–237, 248–250, 252, 316 Processing times, 220, 224–226, 229, 236, 238

319 R Radix, 6, 11, 83, 97, 192, 228, 229, 238, 248 Random-access memory (RAM), 79–81, 89, 102, 119, 121, 129, 130, 132, 147, 199, 201, 219, 220, 225, 239, 316 Real-from-complex strategy, 72, 111, 180, 247 Regular, 5, 33, 47, 60, 64, 76, 138, 156, 233, 247 Regularized radix-4 fast Hartley transform, 66

S Safety margin, 110, 111, 114, 133 Scalable regular, 233, 241, 242 Scaling strategy, 75, 117, 249, 253, 256, 258–259 regular, 36, 40 Shannon-Hartley theorem, 12 Signal flow graphs (SFGs), 24, 47, 48, 56, 59, 60, 64, 66–68, 76, 126, 128, 142, 145–147, 316 Signal-to-noise ratio (SNR), 54, 162, 170, 316 Silicon area, 14, 79, 82–86, 90, 93, 225, 236, 237 Single-instruction multiple-data (SIMD), 6, 18, 83, 84, 91, 94, 95, 108, 116, 146, 157, 219, 231, 236, 249, 250, 316 Single-port memory, 129 Single-quadrant addressing scheme, 70, 102 Slicing period, 15, 213, 223, 224, 226, 234, 235 Space-complexity, 109–110, 152, 218–222, 225, 226, 232–234, 236, 239 Spider, 76 Spurious-free dynamic range (SFDR), 113, 316 Start-up delay, 107, 108, 147, 200, 204 Switching frequency, 79, 82, 85–86, 90 Symmetries, 5, 7, 14, 24–26, 37, 64, 76, 140, 207, 210, 211 Systolic, 6

T Time-complexity, 14, 47, 110–111, 155, 182, 192, 213, 218–226, 231–236, 238 Time-difference-of-arrival (TDOA), 167, 175, 184, 316 Time-of-arrival (TOA), 167, 175, 184, 316 Transform-space, 7, 11, 18, 37, 41, 45, 49, 117, 138, 156, 158, 162–163, 166–170, 176, 177, 183, 186, 229, 250, 252 Trigonometric coefficient memory, 18, 316

320 U Unitary, 4, 9, 14, 16, 37, 45, 58, 152, 163, 167, 212 Update period, 15, 85, 93, 111, 114, 117, 144, 148, 150, 154, 155, 191, 213, 220, 224, 226–228, 231, 235, 237, 238, 240 Update times, 12, 13, 15, 47, 85, 94, 111, 113, 114, 116, 124, 144, 218, 220, 221, 224, 226, 228–229, 235, 237–238, 248–250 Up-sampling, 161, 163–166, 175, 184

Index W Wireless communications, 12, 18, 111, 161, 183, 250 Woctads, 94–102, 105, 107, 108, 110, 116, 144, 146, 147, 192–204, 209, 219, 221, 231, 249, 254, 255

Z Zero-padding, 138, 159, 229