Matrix and Tensor Decompositions in Signal Processing 1786301555, 9781786301550

The second volume will deal with a presentation of the main matrix and tensor decompositions and their properties of uni

169 97 3MB

English Pages 378 [379] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Half-Title Page
Title Page
Copyright Page
Contents
Introduction
I.1. What are the advantages of tensor approaches?
I.2. For what uses?
I.3. In what fields of application?
I.4. With what tensor decompositions?
I.5. With what cost functions and optimization algorithms?
I.6. Brief description of content
1. Matrix Decompositions
1.1. Introduction
1.2. Overview of the most common matrix decompositions
1.3. Eigenvalue decomposition
1.3.1. Reminders about the eigenvalues of a matrix
1.3.2. Eigendecomposition and properties
1.3.3. Special case of symmetric/Hermitian matrices
1.3.4. Application to compute the powers of a matrix and a matrix
1.3.5. Application to compute a state transition matrix
1.3.6. Application to compute the transfer function and the output of a
1.4. URVH decomposition
1.5. Singular value decomposition
1.5.1. Definition and properties
1.5.2. Reduced SVD and dyadic decomposition
1.5.3. SVD and fundamental subspaces associated with a matrix
1.5.4. SVD and the Moore–Penrose pseudo-inverse
1.5.5. SVD computation
1.5.6. SVD and matrix norms
1.5.7. SVD and low-rank matrix approximation
1.5.8. SVD and orthogonal projectors
1.5.9. SVD and LS estimator
1.5.10. SVD and polar decomposition
1.5.11. SVD and PCA
1.5.12. SVD and blind source separation
1.6. CUR decomposition
2. Hadamard, Kronecker and Khatri–Rao Products
2.1. Introduction
2.2. Notation
2.3. Hadamard product
2.3.1. Definition and identities
2.3.2. Fundamental properties
2.3.3. Basic relations
2.3.4. Relations between the diag operator and Hadamard product
2.4. Kronecker product
2.4.1. Kronecker product of vectors
2.4.2. Kronecker product of matrices
2.4.3. Rank, trace, determinant and spectrum of a Kronecker product
2.4.4. Structural properties of a Kronecker product
2.4.5. Inverse and Moore–Penrose pseudo-inverse of a Kronecker
2.4.6. Decompositions of a Kronecker product
2.5. Kronecker sum
2.5.1. Definition
2.5.2. Properties
2.6. Index convention
2.6.1. Writing vectors and matrices with the index convention
2.6.2. Basic rules and identities with the index convention
2.6.3. Matrix products and index convention
2.6.4. Kronecker products and index convention
2.6.5. Vectorization and index convention
2.6.6. Vectorization formulae
2.6.7. Vectorization of partitioned matrices
2.6.8. Traces of matrix products and index convention
2.7. Commutation matrices
2.7.1. Definition
2.7.2. Properties
2.7.3. Kronecker product and permutation of factors
2.7.4. Multiple Kronecker product and commutation matrices
2.7.5. Block Kronecker product
2.7.6. Strong Kronecker product
2.8. Relations between the diag operator and the Kronecker product
2.9. Khatri–Rao product
2.9.1. Definition
2.9.2. Khatri–Rao product and index convention
2.9.3. Multiple Khatri–Rao product
2.9.4. Properties
2.9.5. Identities
2.9.6. Khatri–Rao product and permutation of factors
2.9.7. Trace of a product of matrices and Khatri–Rao product
2.10. Relations between vectorization and Kronecker and Khatri–Rao products
2.11. Relations between the Kronecker, Khatri–Rao and Hadamard products
2.12. Applications
2.12.1. Partial derivatives and index convention
2.12.2. Solving matrix equations
3. Tensor Operations
3.1. Introduction
3.2. Notation and particular sets of tensors
3.3. Notion of slice
3.3.1. Fibers
3.3.2. Matrix and tensor slices
3.4. Mode combination
3.5. Partitioned tensors or block tensors
3.6. Diagonal tensors
3.6.1. Case of a tensorX . K[N;I]
3.6.2. Case of a square tensor
3.6.3. Case of a rectangular tensor
3.7. Matricization
3.7.1. Matricization of a third-order tensor
3.7.2. Matrix unfoldings and index convention
3.7.3. Matricization of a tensor of order N
3.7.4. Tensor matricization by index blocks
3.8. Subspaces associated with a tensor and multilinear rank
3.9. Vectorization
3.9.1. Vectorization of a tensor of order N
3.9.2. Vectorization of a third-order tensor
3.10. Transposition
3.10.1. Definition of a transpose tensor
3.10.2. Properties of transpose tensors
3.10.3. Transposition and tensor contraction
3.11. Symmetric/partially symmetric tensors
3.11.1. Symmetric tensors
3.11.2. Partially symmetric/Hermitian tensors
3.11.3. Multilinear forms with Hermitian symmetry and Hermitian tensors
3.11.4. Symmetrization of a tensor
3.12. Triangular tensors
3.13. Multiplication operations
3.13.1. Outer product of tensors
3.13.2. Tensor-matrix multiplication
3.13.3. Tensor–vector multiplication
3.13.4. Mode-(p, n) product
3.13.5. Einstein product
3.14. Inverse and pseudo-inverse tensors
3.15. Tensor decompositions in the form of factorizations
3.15.1. Eigendecomposition of a symmetric square tensor
3.15.2. SVD decomposition of a rectangular tensor
3.15.3. Connection between SVD and HOSVD
3.15.4. Full-rank decomposition
3.16. Inner product, Frobenius norm and trace of a tensor
3.16.1. Inner product of two tensors
3.16.2. Frobenius norm of a tensor
3.16.3. Trace of a tensor
3.17. Tensor systems and homogeneous polynomials
3.17.1. Multilinear systems based on the mode-n product
3.17.2. Tensor systems based on the Einstein product
3.17.3. Solving tensor systems using LS
3.18. Hadamard and Kronecker products of tensors
3.19. Tensor extension
3.20. Tensorization
3.21. Hankelization
4. Eigenvalues and Singular Values of a Tensor
4.1. Introduction
4.2. Eigenvalues of a tensor of order greater than two
4.2.1. Different definitions of the eigenvalues of a tensor
4.2.2. Positive/negative (semi-)definite tensors
4.2.3. Orthogonally/unitarily similar tensors
4.3. Best rank-one approximation
4.4. Orthogonal decompositions
4.5. Singular values of a tensor
5. Tensor Decompositions
5.1. Introduction
5.2. Tensor models
5.2.1. Tucker model
5.2.2. Tucker-(N1,N) model
5.2.3. Tucker model of a transpose tensor
5.2.4. Tucker decomposition and multidimensional Fourier transform
5.2.5. PARAFAC model
5.2.6. Block tensor models
5.2.7. Constrained tensor models
5.3. Examples of tensor models
5.3.1. Model of multidimensional harmonics
5.3.2. Source separation
5.3.3. Model of a FIR system using fourth-order output cumulants
Appendix. Random Variables and Stochastic Processes
A1.1. Introduction
A1.2. Random variables
A1.2.1. Real scalar random variables
A1.2.2. Real multidimensional random variables
A1.2.3. Gaussian distribution
A1.3. Discrete-time random signals
A1.3.1. Second-order statistics
A1.3.2. Stationary and ergodic random signals
A1.3.3. Higher order statistics of random signals
A1.4. Application to system identification
A1.4.1. Case of linear systems
A1.4.2. Case of homogeneous quadratic systems
References
Index
Other titles from iSTE in Digital Signal and Image Processing
Recommend Papers

Matrix and Tensor Decompositions in Signal Processing
 1786301555, 9781786301550

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Matrix and Tensor Decompositions in Signal Processing

Matrices and Tensors with Signal Processing Set coordinated by Gérard Favier

Volume 2

Matrix and Tensor Decompositions in Signal Processing

Gérard Favier

First published 2021 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 27-37 St George’s Road London SW19 4EU UK

John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd 2021 The rights of Gérard Favier to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Control Number: 2021938218 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 978-1-78630-155-0

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Chapter 1. Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . .

1

1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Overview of the most common matrix decompositions . . . . . . . 1.3. Eigenvalue decomposition . . . . . . . . . . . . . . . . . . . . . . . 1.3.1. Reminders about the eigenvalues of a matrix . . . . . . . . . . . 1.3.2. Eigendecomposition and properties . . . . . . . . . . . . . . . . 1.3.3. Special case of symmetric/Hermitian matrices . . . . . . . . . . 1.3.4. Application to compute the powers of a matrix and a matrix polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5. Application to compute a state transition matrix . . . . . . . . . 1.3.6. Application to compute the transfer function and the output of a discrete-time linear system . . . . . . . . . . . . . . . . . . . . . . . . . 1.4. URVH decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 1.5. Singular value decomposition . . . . . . . . . . . . . . . . . . . . . 1.5.1. Definition and properties . . . . . . . . . . . . . . . . . . . . . . 1.5.2. Reduced SVD and dyadic decomposition . . . . . . . . . . . . . 1.5.3. SVD and fundamental subspaces associated with a matrix . . . 1.5.4. SVD and the Moore–Penrose pseudo-inverse . . . . . . . . . . . 1.5.5. SVD computation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.6. SVD and matrix norms . . . . . . . . . . . . . . . . . . . . . . . 1.5.7. SVD and low-rank matrix approximation . . . . . . . . . . . . . 1.5.8. SVD and orthogonal projectors . . . . . . . . . . . . . . . . . . 1.5.9. SVD and LS estimator . . . . . . . . . . . . . . . . . . . . . . . 1.5.10. SVD and polar decomposition . . . . . . . . . . . . . . . . . . 1.5.11. SVD and PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.12. SVD and blind source separation . . . . . . . . . . . . . . . . . 1.6. CUR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

1 2 5 5 7 10

. . . .

12 12

. . . . . . . . . . . . . . . .

13 13 15 15 17 20 20 21 22 25 28 28 31 33 38 43

. . . . . .

. . . . . . . . . . . . . . . .

vi

Matrix and Tensor Decompositions in Signal Processing

Chapter 2. Hadamard, Kronecker and Khatri–Rao Products . . . . .

47

2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3. Hadamard product . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1. Definition and identities . . . . . . . . . . . . . . . . . . . . . . . 2.3.2. Fundamental properties . . . . . . . . . . . . . . . . . . . . . . . 2.3.3. Basic relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4. Relations between the diag operator and Hadamard product . . 2.4. Kronecker product . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1. Kronecker product of vectors . . . . . . . . . . . . . . . . . . . . 2.4.2. Kronecker product of matrices . . . . . . . . . . . . . . . . . . . 2.4.3. Rank, trace, determinant and spectrum of a Kronecker product 2.4.4. Structural properties of a Kronecker product . . . . . . . . . . . 2.4.5. Inverse and Moore–Penrose pseudo-inverse of a Kronecker product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.6. Decompositions of a Kronecker product . . . . . . . . . . . . . 2.5. Kronecker sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2. Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6. Index convention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1. Writing vectors and matrices with the index convention . . . . . 2.6.2. Basic rules and identities with the index convention . . . . . . . 2.6.3. Matrix products and index convention . . . . . . . . . . . . . . . 2.6.4. Kronecker products and index convention . . . . . . . . . . . . 2.6.5. Vectorization and index convention . . . . . . . . . . . . . . . . 2.6.6. Vectorization formulae . . . . . . . . . . . . . . . . . . . . . . . 2.6.7. Vectorization of partitioned matrices . . . . . . . . . . . . . . . 2.6.8. Traces of matrix products and index convention . . . . . . . . . 2.7. Commutation matrices . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2. Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.3. Kronecker product and permutation of factors . . . . . . . . . . 2.7.4. Multiple Kronecker product and commutation matrices . . . . . 2.7.5. Block Kronecker product . . . . . . . . . . . . . . . . . . . . . . 2.7.6. Strong Kronecker product . . . . . . . . . . . . . . . . . . . . . 2.8. Relations between the diag operator and the Kronecker product . . 2.9. Khatri–Rao product . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.1. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2. Khatri–Rao product and index convention . . . . . . . . . . . . 2.9.3. Multiple Khatri–Rao product . . . . . . . . . . . . . . . . . . . . 2.9.4. Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.5. Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.6. Khatri–Rao product and permutation of factors . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

47 49 50 50 51 51 52 54 54 58 64 66

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67 68 69 69 70 70 71 72 74 75 77 79 82 84 86 87 88 88 90 92 94 94 95 95 96 97 97 98 99

Contents

2.9.7. Trace of a product of matrices and Khatri–Rao product . . . . 2.10. Relations between vectorization and Kronecker and Khatri–Rao products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11. Relations between the Kronecker, Khatri–Rao and Hadamard products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12.1. Partial derivatives and index convention . . . . . . . . . . . . 2.12.2. Solving matrix equations . . . . . . . . . . . . . . . . . . . .

vii

. . . 100 . . . 101 . . . .

. . . .

. . . .

101 108 108 116

Chapter 3. Tensor Operations . . . . . . . . . . . . . . . . . . . . . . . . . 125 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Notation and particular sets of tensors . . . . . . . . . . . 3.3. Notion of slice . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1. Fibers . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2. Matrix and tensor slices . . . . . . . . . . . . . . . . . 3.4. Mode combination . . . . . . . . . . . . . . . . . . . . . . 3.5. Partitioned tensors or block tensors . . . . . . . . . . . . 3.6. Diagonal tensors . . . . . . . . . . . . . . . . . . . . . . . 3.6.1. Case of a tensor X ∈ K[N ;I] . . . . . . . . . . . . . . 3.6.2. Case of a square tensor . . . . . . . . . . . . . . . . . 3.6.3. Case of a rectangular tensor . . . . . . . . . . . . . . 3.7. Matricization . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1. Matricization of a third-order tensor . . . . . . . . . . 3.7.2. Matrix unfoldings and index convention . . . . . . . 3.7.3. Matricization of a tensor of order N . . . . . . . . . . 3.7.4. Tensor matricization by index blocks . . . . . . . . . 3.8. Subspaces associated with a tensor and multilinear rank 3.9. Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1. Vectorization of a tensor of order N . . . . . . . . . . 3.9.2. Vectorization of a third-order tensor . . . . . . . . . . 3.10. Transposition . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1. Definition of a transpose tensor . . . . . . . . . . . . 3.10.2. Properties of transpose tensors . . . . . . . . . . . . 3.10.3. Transposition and tensor contraction . . . . . . . . . 3.11. Symmetric/partially symmetric tensors . . . . . . . . . 3.11.1. Symmetric tensors . . . . . . . . . . . . . . . . . . . 3.11.2. Partially symmetric/Hermitian tensors . . . . . . . . 3.11.3. Multilinear forms with Hermitian symmetry and Hermitian tensors . . . . . . . . . . . . . . . . . . . . . . . . 3.11.4. Symmetrization of a tensor . . . . . . . . . . . . . . 3.12. Triangular tensors . . . . . . . . . . . . . . . . . . . . . 3.13. Multiplication operations . . . . . . . . . . . . . . . . . 3.13.1. Outer product of tensors . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

125 127 133 133 133 135 137 139 139 140 141 141 142 143 144 147 148 149 149 150 151 151 152 155 156 156 156

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

159 161 166 166 168

viii

Matrix and Tensor Decompositions in Signal Processing

3.13.2. Tensor–matrix multiplication . . . . . . . . . . . . 3.13.3. Tensor–vector multiplication . . . . . . . . . . . . 3.13.4. Mode-(p, n) product . . . . . . . . . . . . . . . . . 3.13.5. Einstein product . . . . . . . . . . . . . . . . . . . 3.14. Inverse and pseudo-inverse tensors . . . . . . . . . . . 3.15. Tensor decompositions in the form of factorizations . 3.15.1. Eigendecomposition of a symmetric square tensor 3.15.2. SVD decomposition of a rectangular tensor . . . . 3.15.3. Connection between SVD and HOSVD . . . . . . 3.15.4. Full-rank decomposition . . . . . . . . . . . . . . . 3.16. Inner product, Frobenius norm and trace of a tensor . 3.16.1. Inner product of two tensors . . . . . . . . . . . . 3.16.2. Frobenius norm of a tensor . . . . . . . . . . . . . 3.16.3. Trace of a tensor . . . . . . . . . . . . . . . . . . . 3.17. Tensor systems and homogeneous polynomials . . . . 3.17.1. Multilinear systems based on the mode-n product 3.17.2. Tensor systems based on the Einstein product . . . 3.17.3. Solving tensor systems using LS . . . . . . . . . . 3.18. Hadamard and Kronecker products of tensors . . . . . 3.19. Tensor extension . . . . . . . . . . . . . . . . . . . . . 3.20. Tensorization . . . . . . . . . . . . . . . . . . . . . . . 3.21. Hankelization . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

170 174 176 178 186 193 193 194 194 197 199 199 200 203 203 204 207 209 211 213 215 217

Chapter 4. Eigenvalues and Singular Values of a Tensor . . . . . . . 221 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Eigenvalues of a tensor of order greater than two . . . 4.2.1. Different definitions of the eigenvalues of a tensor . 4.2.2. Positive/negative (semi-)definite tensors . . . . . . 4.2.3. Orthogonally/unitarily similar tensors . . . . . . . . 4.3. Best rank-one approximation . . . . . . . . . . . . . . . 4.4. Orthogonal decompositions . . . . . . . . . . . . . . . . 4.5. Singular values of a tensor . . . . . . . . . . . . . . . . Chapter 5. Tensor Decompositions

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

221 224 224 232 233 235 238 239

. . . . . . . . . . . . . . . . . . . . . 241

5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2. Tensor models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1. Tucker model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2. Tucker-(N1 , N ) model . . . . . . . . . . . . . . . . . . . . . . . 5.2.3. Tucker model of a transpose tensor . . . . . . . . . . . . . . . . 5.2.4. Tucker decomposition and multidimensional Fourier transform 5.2.5. PARAFAC model . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6. Block tensor models . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.7. Constrained tensor models . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

241 242 242 249 251 251 252 271 273

Contents

5.3. Examples of tensor models . . . . . . . . . . . . . . . . . . . . . 5.3.1. Model of multidimensional harmonics . . . . . . . . . . . . 5.3.2. Source separation . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3. Model of a FIR system using fourth-order output cumulants

. . . .

. . . .

. . . .

. . . .

ix

277 277 278 282

Appendix. Random Variables and Stochastic Processes . . . . . . . 285 References Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

Introduction

The first book of this series was dedicated to introducing matrices and tensors (of order greater than two) from the perspective of their algebraic structure, presenting their similarities, differences and connections with representations of linear, bilinear and multilinear mappings. This second volume will now study tensor operations and decompositions in greater depth. In this introduction, we will motivate the use of tensors by answering five questions that prospective users might and should ask: – What are the advantages of tensor approaches? – For what uses? – In what fields of application? – With what tensor decompositions? – With what cost functions and optimization algorithms? Although our answers are necessarily incomplete, our aim is to: – present the advantages of tensor approaches over matrix approaches; – show a few examples of how tensor tools can be used; – give an overview of the extensive diversity of problems that can be solved using tensors, including a few example applications; – introduce the three most widely used tensor decompositions, presenting some of their properties and comparing their parametric complexity; – state a few problems based on tensor models in terms of the cost functions to be optimized;

xii

Matrix and Tensor Decompositions in Signal Processing

– describe various types of tensor-based processing, with a brief glimpse of the optimization methods that can be used. I.1. What are the advantages of tensor approaches? In most applications, a tensor X of order N is viewed as an array of real or complex numbers. The current element of the tensor is denoted xi1 ,··· ,iN , where each index in ∈ In   {1, · · · , In }, for n ∈ N   {1, · · · , N }, is associated with the nth mode, and In is its dimension, i.e. the number of elements for the nth mode. The order of the tensor is the number N of indices, i.e. the number of modes. Tensors are written with calligraphic letters1. An N th-order tensor with entries xi1 ,··· ,iN is written X = [xi1 ,··· ,iN ] ∈ KI1 ×···×IN , where K = R or C, depending on whether the tensor is real-valued or complex-valued, and I1 × · · · × IN represents the size of X . In general, a mode (also called a way) can have one of the following interpretations: (i) as a source of information (user, patient, client, trial, etc.); (ii) as a type of entity attached to the data (items/products, types of music, types of film, etc.); (iii) as a tag that characterizes an item, a piece of music, a film, etc.; (iv) as a recording modality that captures diversity in various domains (space, time, frequency, wavelength, polarization, color, etc.). Thus, a digital image in color can be represented as a three-dimensional tensor (of pixels) with two spatial modes, one for the rows (width) and one for the columns (height), and one channel mode (RGB colors). For example, a color image can be represented as a tensor of size 1024 × 768 × 3, where the third mode corresponds to the intensity of the three RGB colors (red, green, blue). For a volumetric image, there are three spatial modes (width × height × depth), and the points of the image are called voxels. In the context of hyperspectral imagery, in addition to the two spatial dimensions, there is a third dimension corresponding to the emission wavelength within a spectral band. Tensor approaches benefit from the following advantages over matrix approaches: – the essential uniqueness property2, satisfied by some tensor decompositions, such as PARAFAC (parallel factors) (Harshman 1970) under certain mild conditions; for matrix decompositions, this property requires certain restrictive conditions on the factor matrices, such as orthogonality, non-negativity, or a specific structure (triangular, Vandermonde, Toeplitz, etc.); – the ability to solve certain problems, such as the identification of communication channels, directly from measured signals, without requiring the calculation of 1 Scalars, vectors, and matrices are written in lowercase, bold lowercase, and bold uppercase letters, respectively: a, a, A. 2 A decomposition satisfies the essential uniqueness property if it is unique up to permutation and scaling factors in the columns of its factor matrices.

Introduction

xiii

high-order statistics of these signals or the use of long pilot sequences. The resulting deterministic and semi-blind processings can be performed with signal recordings that are shorter than those required by statistical methods, based on the estimation of high-order moments or cumulants. For the blind source separation problem, tensor approaches can be used to tackle the case of underdetermined systems, i.e. systems with more sources than sensors; – the possibility of compressing big data sets via a data tensorization and the use of a tensor decomposition, in particular, a low multilinear rank approximation; – a greater flexibility in representing and processing multimodal data by considering the modalities separately, instead of stacking the corresponding data into a vector or a matrix. This allows the multilinear structure of data to be preserved, meaning that interactions between modes can be taken into account; – a greater number of modalities can be incorporated into tensor representations of data, meaning that more complementary information is available, which allows the performance of certain systems to be improved, e.g. wireless communication, recommendation, diagnostic, and monitoring systems, by making detection, interpretation, recognition, and classification operations easier and more efficient. This led to a generalization of certain matrix algorithms, like SVD (singular value decomposition) to MLSVD (multilinear SVD), also known as HOSVD (higher order SVD) (de Lathauwer et al. 2000a); similarly, certain signal processing algorithms were generalized, like PCA (principal component analysis) to MPCA (multilinear PCA) (Lu et al. 2008) or TRPCA (tensor robust PCA) (Lu et al. 2020) and ICA (independent component analysis) to MICA (multilinear ICA) (Vasilescu and Terzopoulos 2005) or tensor PICA (probabilistic ICA) (Beckmann and Smith 2005). It is worth noting that, with a tensor model, the number of modalities considered in a problem can be increased either by increasing the order of the data tensor or by coupling tensor and/or matrix decompositions that share one or several modes. Such a coupling approach is called data fusion using a coupled tensor/matrix factorization. Two examples of this type of coupling are presented later in this introductory chapter. In the first, EEG signals are coupled with functional magnetic resonance imaging (fMRI) data to analyze the brain function; in the second, hyperspectral and multispectral images are merged for remote sensing. The other approach, namely, increasing the number of modalities, will be illustrated in Volume 3 of this series by giving a unified presentation of various models of wireless communication systems designed using tensors. In order to improve system performance, both in terms of transmission and reception, the idea is to employ multiple types of diversity simultaneously in various domains (space, time, frequency, code, etc.), each type of diversity being associated with a mode of the tensor of received signals. Coupled tensor models will also be presented in the context of cooperative communication systems with relays.

xiv

Matrix and Tensor Decompositions in Signal Processing

I.2. For what uses? In the big data3 era, digital information processing plays a key role in various fields of application. Each field has its own specificities and requires specialized, often multidisciplinary, skills to manage both the multimodality of the data and the processing techniques that need to be implemented. Thus, the “intelligent” information processing systems of the future will have to integrate representation tools, such as tensors and graphs, signal and image processing methods, with artificial intelligence techniques based on artificial neural networks and machine learning. The needs of such systems are diverse and numerous – whether in terms of storage, visualization (3D representation, virtual reality, dissemination of works of art), transmission, imputation, prediction/forecasting, analysis, classification or fusion of multimodal and heterogeneous data. The reader is invited to refer to Lahat et al. (2015) and Papalexakis et al. (2016) for a presentation of various examples of data fusion and data mining based on tensor models. Some of the key applications of tensor tools are as follows: – decomposition or separation of heterogeneous datasets into components/factors or subspaces with the goal of exploiting the multimodal structure of the data and extracting useful information for users from uncertain or noisy data or measurements provided by different sources of information and/or types of sensor. Thus, features can be extracted in different domains (spatial, temporal, frequential) for classification and decision-making tasks; – imputation of missing data within an incomplete database using a low-rank tensor model, where the missing data results from defective sensors or communication links, for example. This task is called tensor completion and is a higher order generalization of matrix completion (Candès and Recht 2009; Signoretto et al. 2011; Liu et al. 2013); – recovery of useful information from compressed data by reconstructing a signal or an image that has a sparse representation in a predefined basis, using compressive sampling (CS; also known as compressed sensing) techniques (Candès and Wakin 2008; Candès and Plan 2010), applied to sparse, low-rank tensors (Sidiropoulos and Kyrillidis 2012); – fusion of data using coupled tensor and matrix decompositions; – design of cooperative multi-antenna communication systems (also called MIMO (multiple-input multiple-output); this type of application, which led to the 3 Big data is characterized by 3Vs (Volume, Variety, Velocity) linked to the size of the data set, the heterogeneity of the data and the rate at which it is captured, stored and processed.

Introduction

xv

development of several new tensor models, will be considered in the next two volumes of this series; – multilinear compressive learning that combines compressed sensing with machine learning; – reduction of the dimensionality of multimodal, heterogeneous databases with very large dimensions (big data) by solving a low-rank tensor approximation problem; – multiway filtering and tensor data denoising. Tensors can also be used to tensorize neural networks with fully connected layers by expressing the weight matrix of a layer as a tensor train (TT) whose cores represent the parameters of the layer. This considerably reduces the parametric complexity and, therefore, the storage space. This compression property of the information contained in layered neural networks when using tensor decompositions provides a way to increase the number of hidden units (Novikov et al. 2015). Tensors, when used together with multilayer perceptron neural networks to solve classification problems, achieve lower error rates with fewer parameters and less computation time than neural networks alone (Chien and Bao 2017). Neural networks can also be used to learn the rank of a tensor (Zhou et al. 2019), or to compute its eigenvalues and singular values, and hence the rank-one approximation of a tensor (Che et al. 2017). I.3. In what fields of application? Tensors have applications in many domains. The fields of psychometrics and chemometrics in the 1970s and 1990s paved the way for signal and image processing applications, such as blind source separation, digital communications, and computer vision in the 1990s and early 2000s. Today, there is a quantitative explosion of big data in medicine, astronomy, meteorology, with fifth-generation wireless communications (5G), for medical diagnostic aid, web services delivered by recommendation systems (video on demand, online sales, restaurant and hotel reservations, etc.), as well as for information searching within multimedia databases (texts, images, audio and video recordings) and with social networks. This explains why various scientific communities and the industrial world are showing a growing interest in tensors. Among the many examples of applications of tensors for signal and image processing, we can mention: – blind source separation and blind system identification. These problems play a fundamental role in signal processing. They involve separating the input signals (also called sources) and identifying a system from the knowledge of only the output signals and certain hypotheses about the input signals, such as statistical independence in the case of independent component analysis (Comon 1994), or the assumption of a

xvi

Matrix and Tensor Decompositions in Signal Processing

finite alphabet in the context of digital communications. This type of processing is, in particular, used to jointly estimate communication channels and information symbols emitted by a transmitter. It can also be used for speech or music separation, or to process seismic signals; – use of tensor decompositions to analyze biomedical signals (EEG, MEG, ECG, EOG4) in the space, time and frequency domains, in order to provide a medical diagnostic aid; for instance, Acar et al. (2007) used a PARAFAC model of EEG signals to analyze epileptic seizures; Becker et al. (2014) used the same type of decomposition to locate sources within EEG signals; – analysis of brain activity by merging imaging data (fMRI) and biomedical signals (EEG and MEG) with the goal of enabling non-invasive medical tests (see Table I.4); – analysis and classification of hyperspectral images used in many fields (medicine, environment, agriculture, monitoring, astrophysics, etc.). To improve the spatial resolution of hyperspectral images, Li et al. (2018) merged hyperspectral and multispectral images using a coupled Tucker decomposition with a sparse core (coupled sparse tensor factorization (CSTF)) (see Table I.4); – design of semi-blind receivers for point-to-point or cooperative MIMO communication systems based on tensor models; see the overviews by de Almeida et al. (2016) and da Costa et al. (2018); – modeling and identification of nonlinear systems via a tensor representation of Volterra kernels or Wiener–Hammerstein systems (see, for example, Kibangou and Favier 2009a, 2010; Favier and Kibangou 2009; Favier and Bouilloc 2009, 2010; Favier et al. 2012a); – identification of tensor-based separable trilinear systems that are linear with respect to (w.r.t.) the input signal and trilinear w.r.t. the coefficients of the global impulse response, modeled as a Kronecker product of three individual impulse responses (Elisei-Iliescu et al. 2020). Note that such systems are to be compared with third-order Volterra filters that are linear w.r.t. the Volterra kernel coefficients and trilinear w.r.t. the input signal; – facial recognition, based on face tensors, for purposes of authentication and identification in surveillance systems. For facial recognition, photos of people to recognize are stored in a database with different lighting conditions, different facial expressions, from multiple angles, for each individual. In Vasilescu and Terzopoulos (2002), the tensor of facial images is of order five, with dimensions: 28 × 5 × 3 × 3 × 7943, corresponding to the modes: people × views × illumination × expressions × pixels per image. For an overview of various facial recognition systems, see Arachchilage and Izquierdo (2020); 4 Electroencephalography (EEG), magnetoencephalography (MEG), electrocardiography (ECG) and electrooculography (EOG).

Introduction

xvii

– tensor-based anomaly detection used in monitoring and surveillance systems. Table I.1 presents a few examples of signal and image tensors, specifying the nature of the modes in each case. Signals

Modes

References

Antenna processing

space (antennas) × time × sensor subnetwork space × time × polarization

(Sidiropoulos et al. 2000a) (Raimondi et al. 2017)

Digital communications

space (antennas) × time × code antennas × blocks × symbol periods × code × frequencies

(Sidiropoulos et al. 2000b) (Favier and de Almeida 2014b)

ECG

space (electrodes) × time × frequencies

(Acar et al. 2007; Padhy et al. 2019)

EEG

EEG + fMRI

space (electrodes) × time × frequencies × subjects or trials (Becker et al. 2014; Cong et al. 2015) subjects × electrodes × time + subjects × voxels (model with matrix and tensor factorizations coupled via the “subjects” mode)

(Acar et al. 2017)

References

Images

Modes

Color images

space (width) × space (height) × channel (colors)

Videos in grayscale

space (width) × space (height) × time

Videos in color

space × space × channel × time

Hyperspectral images

space × space × spectral bands

(Makantasis et al. 2018)

Computer vision

people × views × illumination × expressions × pixels

(Vasilescu and Terzopoulos 2002)

Table I.1. Signal and image tensors

Other fields of application are considered in Table I.2. Below, we give some details about the application concerning recommendation systems, which play an important role in various websites. The goal of these systems is to help users to select items from tags that have been assigned to each item by users. These items could, for example, be movies, books, musical recordings, webpages, products for sale on an e-commerce site, etc. A standard recommendation system is based on the three following modes: users × items × tags. Collaborative filtering techniques use the opinions of a set of people, or assessments from these people based on a rating system, to generate a list of recommendations for a specific user. This type of filtering is, for example, used by websites like Netflix for renting DVDs. Collaborative filtering methods are classified into three categories, depending on whether the filtering is based on (a) history and a similarity metric; (b) a model based on matrix factorization using algorithms like SVD or non-negative matrix factorization (NMF); (c) some combination of both, known as hybrid collaborative filtering techniques. See Luo et al. (2014) and Bokde et al. (2015) for approaches based on matrix factorization.

xviii

Matrix and Tensor Decompositions in Signal Processing

Other so-called passive filtering techniques exploit the data of a matrix of relations between items to deduce recommendations for a user from correlations between items and the user’s previous choices, without using any kind of rating system. This is known as a content-based approach. Domains

Modes

References

Phonetics

subjects × vowels × formants

(Harshman 1970)

Chemometrics (fluorescence)

excitation × emission × samples (excitation/emission wavelengths)

(Bro 1997, 2006; Smilde et al. 2004)

Contextual recommendation systems

users × items × tags × context 1 × · · · × context N

(Rendle and Schmidt-Thieme 2010) (Symeonidis and Zioupos 2016) (Frolov and Oseledets 2017)

Transportation (speed measurements)

Space (sensors) × time (days) × time (weeks) (periods of 15s and 24h)

(Goulart et al. 2017) (Tan et al. 2013; Ran et al. 2016)

Music

types of music × frequencies × frequencies users × keywords × songs recordings × (audio) characteristics × segments

(Panagakis et al. 2010) (Nanopoulos et al. 2010) (Benetos and Kotropoulos 2008)

Bioinformatics

medicine × targets × diseases

(Wang et al. 2019)

Table I.2. Other fields of application

Recommendation systems can also use information about the users (age, nationality, geographic location, participation on social networks, etc.) and the items themselves (types of music, types of film, classes of hotels, etc.). This is called contextual information. Taking this additional information into account allows the relevance of the recommendations to be improved, at the cost of increasing the dimensionality and the complexity of the data representation model and, therefore, of the processing algorithms. This is why tensor approaches are so important for this type of application today. Note that, for recommendation systems, the data tensors are sparse. Consequently, some tags can be automatically generated by the system based on similarity metrics between items. This is, for example, the case for music recommendations based on the acoustic characteristics of songs (Nanopoulos et al. 2010). Personalized tag recommendations take into account the user’s profile, preferences, and interests. The system can also help the user select existing tags or create new ones (Rendle and Schmidt-Thieme 2010). The articles by Bobadilla et al. (2013) and Frolov and Oseledets (2017) present various recommendation systems with many bibliographical references. Operating according to a similar principle as recommendation systems, social network websites, such as Wikipedia, Facebook, or Twitter, allow different types of data to be exchanged and shared, content to be produced and connections to be established.

Introduction

xix

I.4. With what tensor decompositions? I1 ×···×IN It is important , the number N to note that, for an N th-order tensor X ∈ K of elements is n=1 In , and, assuming In = I for n ∈ N , this number becomes I N , which induces an exponential increase with the tensor order N . This is called the curse of dimensionality (Oseledets and Tyrtyshnikov 2009). For big data tensors, tensor decompositions play a fundamental role in alleviating this curse of dimensionality, due to the fact that the number of parameters that characterize the decompositions is generally much smaller than the number of elements in the original tensor.

We now introduce three basic decompositions: PARAFAC/CANDECOMP/CPD, TD and TT5. The first two are studied in depth in Chapter 5, whereas the third, briefly introduced in Chapter 3, will be considered in more detail in Volume 3. Table I.3 gives the expression of the element xi1 ,··· ,iN of a tensor X ∈ KI1 ×···×IN of order N and size I1 × · · · × IN , either real (K = R) or complex (K = C), for each of the three decompositions cited above. Their parametric complexity is compared in terms of the size of each matrix and tensor factor, assuming In = I and Rn = R for all n ∈ N . Figures I.1–I.3 show graphical representations of the PARAFAC model A(1) , A(2) , A(3) ; R  and the TD model G ; A(1) , A(2) , A(3)  for a third-order tensor X ∈ KI1 ×I2 ×I3 , and of the TT model A(1) , A(2) , A(3) , A(4)  for a 2 ×I3 ×I4 fourth-order tensor X ∈ KI1 ×I . In the case of the PARAFAC model, we   (n) (n) (n) define A  a1 , · · · , aR ∈ KIn ×R using its columns, for n ∈ {1, 2, 3}.

Figure I.1. Third-order PARAFAC model

We can make a few remarks about each of these decompositions: – The PARAFAC decomposition (Harshman 1970), also known as CANDECOMP (Carroll and Chang 1970) or CPD (Hitchcock 1927), of a N th-order tensor X is a sum 5 PARAFAC for parallel factors; CANDECOMP for canonical decomposition; CPD for canonical polyadic decomposition; TD for Tucker decomposition; TT for tensor train.

xx

Matrix and Tensor Decompositions in Signal Processing

of R rank-one tensors, each defined as the outer product of one column from each of the N matrix factors A(n) ∈ KIn ×R . When R is minimal, it is called the rank of the tensor. If the matrix factors satisfy certain conditions, this decomposition has the essential uniqueness property. See Figure I.1 for a third-order tensor (N = 3), and Chapter 5 for a detailed presentation. Decompositions

Notation

PARAFAC / CPD

A(1) , · · · , A(N ) ; R  G ; A(1) , · · · , A(N ) 

TD

A(1) , A(2) , · · · · · · , A(N −1) , A(N ) 

TT

Decompositions CPD

Element xi1 ,··· ,iN R N  

R 1 r1 =1

···

R 1 r1 =1

RN



rN =1

···

A

∈K

In ×R



, ∀n ∈ N 

G∈K A(n) ∈ KIn ×Rn , ∀n ∈ N 

TT

A(1) ∈ KI1 ×R1 , A(N ) ∈ KRN −1 ×IN A(n) ∈ KRn−1 ×In ×Rn ∀n ∈ {2, 3, · · · , N − 1}

N 

rN −1 =1 n=1

R1 ×···×RN

TD

ain ,r

n=1

(n)

ar

O (N IR) O (N IR + RN )  O 2IR

+(N − 2)IR2

(n)

ain ,rn

n−1 ,in ,rn

Complexity

Table I.3. Parametric complexity of the CPD, TD, and TT decompositions

Figure I.2. Third-order Tucker model

N 

gr1 ,··· ,rN

RN −1

Parameters (n)

(n)

r=1 n=1



Introduction

xxi

Figure I.3. Fourth-order TT model

– The Tucker decomposition (Tucker 1966) can be viewed as a generalization of the PARAFAC decomposition that takes into account all the interactions between the columns of the matrix factors A(n) ∈ KIn ×Rn via the introduction of a core tensor G ∈ KR1 ×···×RN . This decomposition is not unique in general. Note that, if Rn ≤ In for ∀n ∈ N , then the core tensor G provides a compressed form of X . If Rn , for n ∈ N , is chosen as the rank of the mode-n matrix unfolding6 of X , then the N -tuple (R1 , · · · , RN ) is minimal, and it is called the multilinear rank of the tensor. Such a Tucker decomposition can be obtained using the truncated high-order SVD (THOSVD), under the constraint of column-orthonormal matrices A(n) (de Lathauwer et al. 2000a). This algorithm is described in section 5.2.1.8. See Figure I.2 for a third-order tensor, and Chapter 5 for a detailed presentation. – The TT decomposition (Oseledets 2011) is composed of a train of third-order tensors A(n) ∈ KRn−1 ×In ×Rn , for n ∈ {2, 3, · · · , N − 1}, the first and last carriages of the train being matrices A(1) ∈ KI1 ×R1 and A(N ) ∈ KRN −1 ×IN , which implies (1) (1) (N ) (N ) r0 = rN = 1, and therefore ar0 ,i1 ,r1 = ai1 ,r1 and arN −1 ,iN ,rN = arN −1 ,iN . The dimensions Rn , for n ∈ N − 1, called the TT ranks, are given by the ranks of some matrix unfoldings of the original tensor. This decomposition has been used to solve the tensor completion problem (Grasedyck et al. 2015; Bengua et al. 2017), for facial recognition (Brandoni and Simoncini 2020) and for modeling MIMO communication channels (Zniyed et al. 2020), among many other applications. A brief description of the TT decomposition is given in section 3.13.4 using the mode-(p, n) product. Note that a specific SVD-based algorithm, called TT-SVD, was proposed by Oseledets (2011) for computing a TT decomposition. This decomposition and the hierarchical Tucker (HT) one (Grasedyck and Hackbush 2011; Ballani et al. 2013) are special cases of tensor networks (TNs) (Cichocki 2014), as will be discussed in more detail in the next volume. 6 See definition [3.41], in Chapter 3, of the mode-n matrix unfolding Xn of a tensor X , whose columns are the mode-n vectors obtained by fixing all but n indices.

xxii

Matrix and Tensor Decompositions in Signal Processing

From this brief description of the three tensor models, one can conclude that, unlike matrices, the notion of rank is not unique for tensors, since it depends on the decomposition used. Thus, as mentioned above, one defines the tensor rank (also called the canonical rank or Kruskal’s rank) associated with the PARAFAC decomposition, the multilinear rank that relies on the Tucker’s model, and the TT-ranks linked with the TT decomposition. It is important to note that the number of characteristic parameters of the PARAFAC and TT decompositions is proportional to N , the order of the tensor, whereas the parametric complexity of the Tucker decomposition increases exponentially with N . This is why the first two decompositions are especially valuable for large-scale problems. Although the Tucker model is not unique in general, imposing an orthogonality constraint on the matrix factors yields the HOSVD decomposition, a truncated form of which gives an approximate solution to the best low multilinear rank approximation problem (de Lathauwer et al. 2000a). This solution, which is based on an a priori choice of the dimensions Rn of the core tensor, is to be compared with the truncated SVD in the matrix case, although it does not have the same optimality property. It is widely used to reduce the parametric complexity of data tensors. From the above, it can be concluded that the TT model combines the advantages of the other two decompositions, in terms of parametric complexity (like PARAFAC) and numerical stability (like Tucker’s model), due to a parameter estimation algorithm based on a calculation of SVDs. To illustrate the use of the PARAFAC decomposition, let us consider the case of multi-user mobile communications with a CDMA (code-division multiple access) encoding system. The multiple access technique allows multiple emitters to simultaneously transmit information over the same communication channel by assigning a code to each emitter. The information is transmitted as symbols sn,m , with n ∈ N  and m ∈ M , where N and M are the number of transmission time slots, i.e. the number of symbol periods, and the number of emitting antennas, respectively. The symbols belong to a finite alphabet that depends on the modulation being used. They are encoded with a space-time coding that introduces code diversity by repeating each symbol P times with a code cp,m assigned to the mth emitting antenna, p ∈ P , where P denotes the length of the spreading code. The signal received by the kth receiving antenna, during the nth symbol period and the pth chip period, is a linear combination of the symbols encoded and transmitted by the M emitting antennas: xk,n,p =

M 

hk,m sn,m cp,m ,

[I.1]

m=1

where hk,m is the fading coefficient of the communication channel between the receiving antenna k and the emitting antenna m.

Introduction

xxiii

The received signals, which are complex-valued, therefore form a third-order tensor X ∈ CK×N ×P whose modes are: space × time × code, associated with the indices (k, n, p). This signal tensor satisfies a PARAFAC decomposition H, S, C; M  whose rank is equal to the number M of emitting antennas and whose matrix factors are the channel (H ∈ CK×M ), the matrix of transmitted symbols (S ∈ CN ×M ) and the coding matrix (C ∈ CP ×M ). This example is a simplified form of the DS-CDMA (direct-sequence CDMA) system proposed by (Sidiropoulos et al. 2000b). I.5. With what cost functions and optimization algorithms? We will now briefly describe the most common processing operations carried out with tensors, as well as some of the optimization algorithms that are used. It is important to first present the preprocessing operations that need to be performed. Preprocessing typically involves data centering operations (offset elimination), scaling of non-homogeneous data, suppression of outliers and artifacts, image adjustment (size, brightness, contrast, alignment, etc.), denoising, signal transformation using certain transforms (wavelets, Fourier, etc.), and finally, in some cases, the calculation of statistics of signals to be processed. Preprocessing is fundamental, both to improve the quality of the estimated models and, therefore, of the subsequent processing operations, and to avoid numerical problems with optimization algorithms, such as conditioning problems that may cause the algorithms to fail to converge. Centering and scaling preprocessing operations are potentially problematic because they are interdependent and can be combined in several different ways. If data are missing, centering can also reduce the rank of the tensor model. For a more detailed description of these preprocessing operations, see Smilde et al. (2004). For the processing operations themselves, we can distinguish between several different classes: – supervised/non-supervised (blind or semi-blind), i.e. with or without training data, for example, to solve classification problems, or when a priori information, called a pilot sequence, is transmitted to the receiver for channel estimation; – real-time (online)/batch (offline) processing; – centralized/distributed; – adaptive/blockwise (with respect to the data); – with/without coupling of tensor and/or matrix models; – with/without missing data.

xxiv

Matrix and Tensor Decompositions in Signal Processing

It is important to distinguish batch processing, which is performed to analyze data recorded as signal and image sets, from the real-time processing required by wireless communication systems, recommendation systems, web searches and social networks. In real-time applications, the dimensionality of the model and the algorithmic complexity are predominant factors. The signals received by receiving antennas, the information exchanged between a website and the users and the messages exchanged between the users of a social network are time-dependent. For instance, a recommendation system interacts with the users in real-time, via a possible extension of an existing database by means of machine learning techniques. For a description of various applications of tensors for data mining and machine learning, see Anandkumar et al. (2014) and Sidiropoulos et al. (2017). Tensor-based processings lead to various types of optimization algorithm as follows: – constrained/unconstrained optimization; – iterative/non-iterative, or closed-form; – alternating/global; – sequential/parallel. Furthermore, depending on the information that is available a priori, different types of constraints can be taken into account in the cost function to be optimized: low rank, sparseness, non-negativity, orthogonality and differentiability/smoothness. In the case of constrained optimization, weights need to be chosen in the cost function according to the relative importance of each constraint and the quality of the a priori information that is available. Table I.4 presents a few examples of cost functions that can be minimized for the parameter estimation of certain third-order tensor models (CPD, Tucker, coupled matrix Tucker (CMTucker) and coupled sparse tensor factorization (CSTF)), for the imputation of missing data in a tensor and for the estimation of a sparse data tensor with a low-rank constraint expressed in the form of the nuclear norm of the tensor. R EMARK I.1.– We can make the following remarks: – the cost functions presented in Table I.4 correspond to data fitting criteria. These criteria, expressed in terms of tensor and matrix Frobenius norms (.F ), are quadratic in the difference between the data tensor X and the output of CPD and TD models, as well as between the data matrix Y and a matrix factorization model, in the case of the CMTucker model. They are trilinear and quadrilinear, respectively, with respect to the parameters of the CPD and TD models to be estimated, and bilinear with respect to the parameters of the matrix factorization model;

Introduction

xxv

– for the missing data imputation problem using a CPD or TD model, the binary tensor W, which has the same size as X , is defined as:  1 if xijk is known [I.2] wijk = 0 if xijk is missing The purpose of the Hadamard product (denoted ) of W, with the difference between X and the output of the CPD and TD models, is to fit the model to the available data only, ignoring any missing data for model estimation. This imputation problem, known as the tensor completion problem, was originally dealt with by Tomasi and Bro (2005) and Acar et al. (2011a) using a CPD model, followed by Filipovic and Jukic (2015) using a TD model. Various articles have discussed this problem in the context of different applications. An overview of the literature will be given in the next volume; Data Problems Estimation

Cost functions  2 f (A, B, C) = X − A, B, C; R F  2 f (G, A, B, C) = X − G; A, B, C F 2   2 f (G, A, B, C, U) = X − G; A, B, C F + Y − AUT F  2 f (G, W, H, S) = X − G; W∗ , H∗ , S F +     2 +Y − G; W, H, S∗ F + λ G 1

CPD TD CMTucker CSTF Imputation CPD TD

TD

Cost functions   2 fW (A, B, C) = W  X − A, B, C; R  F   2 fW (G, A, B, C) = W  X − G; A, B, C  

F

Imputation with low-rank constraint CPD

X ∈ KI×J×K , Y ∈ KI×M

Cost functions   2 fW (A, B, C) = W  X − A, B, C; R  F + λ X    2 fW (G, A, B, C) = W  X − G; A, B, C  F + λ X 

Table I.4. Cost functions for model estimation and recovery of missing data

– for the imputation problem with the low-rank constraint, the term X  in the cost function replaces the low-rank constraint with the nuclear norm of X , since the function rank(X ) is not convex, and the nuclear norm is the closest

xxvi

Matrix and Tensor Decompositions in Signal Processing

convex approximation of the rank. In Liu et al. (2013), this term is replaced by  3 7 n=1 λn Xn  , where Xn represents the mode-n unfolding of X ; – in the case of the CMTucker model, the coupling considered here relates to the first modes of the tensor X and the matrix Y of data via the common matrix factor A. Coupled matrix and tensor factorization (CMTF) models were introduced in Acar et al. (2011b) by coupling a CPD model with a matrix factorization and using the gradient descent algorithm to estimate the parameters. This type of model was used by Acar et al. (2017) to merge EEG and fMRI data with the goal of analyzing brain activity. The EEG signals are modeled with a normalized CPD model (see Chapter 5), whereas the fMRI data are modeled with a matrix factorization. The data are coupled through the subjects mode (see Table I.1). The cost function to be minimized is therefore given by: 2   2     f (g, Σ, A, B, C, U) = X − g; A, B, C  + Y − A Σ UT  F

+ α g1 + α σ1 ,

F

[I.3]

where the column vectors of the matrix factors (A, B, C) have unit norm, Σ is a diagonal matrix whose diagonal elements are the coefficients of the vector σ and α > 0 is a penalty parameter that allows the importance of the sparseness constraints on the weight vectors (g, σ) to be increased or decreased, modeled by means of the l1 norm. The advantage of merging EEG and fMRI data with the criterion [I.3] is that the acquisition and observation methods are complementary in terms of resolution, since EEG signals have a high temporal resolution but low spatial resolution, while fMRI imaging provides high spatial resolution; – in the case of the CSTF model (Li et al. 2018), the tensor of high-resolution hyperspectral images (HR-HSI) is represented using a third-order Tucker model that has a sparse core (X = G ×1 W ×2 H ×3 S), with the following modes: space (width) × space (height) × spectral bands. The matrices W ∈ RM ×nw , H ∈ RN ×nh and S ∈ RP ×ns denote the dictionaries for the width, height and spectral modes, composed of nw , nh and ns atoms, respectively, and the core tensor G contains the coefficients relative to the three dictionaries. The matrices W∗ , H∗ and S∗ are spatially and spectrally subsampled versions with respect to each mode. The term λ is a regularization parameter for the sparseness constraint on the core tensor, expressed in terms of the l1 norm of G. The criteria listed in Table I.4 can be globally minimized using a nonlinear optimization method such as a gradient descent algorithm (with fixed or optimal step size), or the Gauss–Newton and Levenberg–Marquardt algorithms, the latter being a 7 See definition [3.41] of the unfolding Xn , and definitions [1.65] and [1.67] of the Frobenius norm (.F ) and the nuclear norm (.∗ ) of a matrix; for a tensor, see section 3.16.

Introduction

xxvii

regularized form of the former. In the case of constrained optimization, the augmented Lagrangian method is very often used, as it allows the constrained optimization problem to be transformed into a sequence of unconstrained optimization problems. The drawbacks of these optimization methods include slow convergence for gradient-type algorithms and high numerical complexity for the Gauss–Newton and Levenberg–Marquardt algorithms due to the need to compute the Jacobian matrix of the criterion w.r.t. the parameters being estimated, as well as the inverse of a large matrix. Alternating optimization methods are therefore often used instead of a global optimization w.r.t. all matrix and tensor factors to be estimated. These iterative methods perform a sequence of separate optimizations of criteria linear in each unknown factor while fixing the other factors with the values estimated at previous iterations. An example is the standard ALS (alternating least squares) algorithm, presented in Chapter 5 for estimating PARAFAC models. For constrained optimization, the alternating direction method of multipliers (ADMM) is often used (Boyd et al. 2011). To complete this introductory chapter, let us outline the key knowledge needed to employ tensor tools, whose presentation constitutes the main objective of this second volume: – arrangement (also called reshaping) operations that express the data tensor as a vector (vectorization), a matrix (matricization), or a lower order tensor by combining modes; conversely, the tensorization and Hankelization operations allow us to construct tensors from data contained in large vectors or matrices; – tensor operations such as transposition, symmetrization, Hadamard and Kronecker products, inversion and pseudo-inversion; – the notions of eigenvalue and singular value of a tensor; – tensor decompositions/models, and their uniqueness properties; – algorithms used to solve dimensionality reduction problems and, hence, best low-rank approximation, parameter estimation and missing data imputation. This algorithmic aspect linked to tensors will be explored in more depth in Volume 3. I.6. Brief description of content Tensor operations and decompositions often use matrix tools, so we will begin by reviewing some matrix decompositions in Chapter 1, going into further detail on eigenvalue decomposition (EVD) and SVD, as well as a few of their applications.

xxviii

Matrix and Tensor Decompositions in Signal Processing

The Hadamard, Kronecker and Khatri–Rao matrix products are presented in detail in Chapter 2, together with many of their properties and a few relations between them. To illustrate these operations, we will use them to represent first-order partial derivatives of a function, and solve matrix equations, such as Sylvester and Lyapunov ones. This chapter also introduces an index convention that is very useful for tensor computations. This convention, which generalizes Einstein’s summation convention (Pollock 2011), will be used to represent various matrix products and to prove some matrix product vectorization formulae, as well as various relations between the Kronecker, Khatri-Rao and Hadamard products. It will be used in Chapter 3 for tensor matricization and vectorization in an original way, as well as in Chapter 5 to establish matrix forms of the Tucker and PARAFAC decompositions. Chapter 3 presents various sets of tensors before introducing the notions of matrix and tensor slices and of mode combination on which reshaping operations are based. The key tensor operations listed above are then presented. Several links between products of tensors and systems of tensor equations are also outlined, and some of these systems are solved with the least squares method. Chapter 4 is dedicated to introducing the notions of eigenvalue and singular value for tensors. The problem of the best rank-one approximation of a tensor is also considered. In Chapter 5, we will give a detailed presentation of various tensor decompositions, with a particular focus on the basic Tucker and CPD decompositions, which can be viewed as generalizations of matrix SVD to tensors of order greater than two. Block tensor models and constrained tensor models will also be described, as well as certain variants, such as HOSVD and BTD (block term decomposition). CPD-type decompositions are generally used to estimate latent parameters, whereas Tucker decomposition is often used to estimate modal subspaces and reduce the dimensionality via low multilinear rank approximation and truncated HOSVD. A description of the ALS algorithm for parameter estimation of PARAFAC models will also be given. The uniqueness properties of the Tucker and CDP decompositions will be presented, as well as the various notions of the rank of a tensor. The chapter will end with illustrations of BTD and CPD decompositions for the tensor modeling of multidimensional harmonics, the problem of source separation in an instantaneous linear mixture and the modeling and estimation of a finite impulse response (FIR) linear system, using a tensor of fourth-order cumulants of the system output. High-order cumulants of random signals that can be viewed as tensors play a central role in various signal processing applications, as illustrated in Chapter 5. This motivated us to include an Appendix to present a brief overview of some basic results concerning the higher order statistics (HOS) of random signals, with two applications to the HOS-based estimation of a linear time-invariant system and a homogeneous quadratic system.

1 Matrix Decompositions

1.1. Introduction The goal of this chapter is to give an overview of the most important matrix decompositions, with a more detailed presentation of the eigenvalue decomposition (EVD) and singular value decomposition (SVD), as well as some of their applications. Matrix decompositions (also called factorizations) play a key role in matrix computation, in particular, for computing the pseudo-inverse of a matrix (see section 1.5.4), the low-rank approximation of a matrix (see section 1.5.7), the solution of a system of linear equations using the least squares (LS) method (see section 1.5.9), or for parametric estimation of nonlinear models using the ALS method, as illustrated in Chapter 5 with the estimation of tensor models. Matrix decompositions have two goals. The first is to factorize a given matrix with structured factor matrices that are easier to invert, and the second is to reduce the dimensionality, in order to reduce both the memory capacity required to store the data and the computational cost of the data processing algorithms. After giving a brief overview of the most common decompositions, we will recall a few results about the eigenvalues of a matrix, and then present the EVD decomposition of a square matrix. The use of this decomposition will be illustrated by computing the powers of a matrix, a matrix polynomial, a state transition matrix and the transfer function of a discretetime linear system. The URVH decomposition of a rectangular matrix will then be introduced, followed by a presentation of the SVD decomposition. The latter can be viewed as a special case of the URVH decomposition with U and V orthogonal (respectively, unitary) in the case of a real (respectively, complex) matrix and R pseudo-diagonal. The SVD can also be viewed as an extension of the EVD for diagonalizing rectangular matrices.

Matrix and Tensor Decompositions in Signal Processing, First Edition. Gérard Favier. © ISTE Ltd 2021. Published by ISTE Ltd and John Wiley & Sons, Inc.

2

Matrix and Tensor Decompositions in Signal Processing

We will present several results relating to the SVD as the links between SVD and the fundamental spaces of a matrix and certain matrix norms. Applications of the SVD to compute the pseudo-inverse of a matrix and hence the LS estimator, as well as a low-rank matrix approximation, will also be described. Polar decomposition will be demonstrated using the SVD. The connection between SVD and principal component analysis (PCA) will be established, with an application to data compression by reducing the dimensionality of a data matrix. The use of the SVD for the blind source separation (BSS) problem will also be considered. Finally, the CUR decomposition of a matrix, which is based on selecting certain columns and rows, will be briefly described. 1.2. Overview of the most common matrix decompositions Table 1.1 presents the most common matrix decompositions as products of matrices. These decompositions differ from one another in terms of the structural properties of their factor matrices (diagonal/pseudo-diagonal, upper/lower triangular, orthogonal/unitary). The EVD, the SVD and the polar decomposition are presented in detail in sections 1.3 and 1.5. The URVH and CUR decompositions are also described in this chapter. Full-rank decomposition is discussed in Chapter 3, in section 3.15.4. For other decompositions, see Lawson and Hanson (1974), Favier (1982, 2019), Golub and Van Loan (1983), Lancaster and Tismenetsky (1985), Horn and Johnson (1985, 1991) and Meyer (2000), among many others. We can make the following remarks: – A square root of a symmetric positive semi-definite square matrix A ∈ RI×I is S defined as a square matrix S ∈ RI×I such that A = SST . We often write S = A1/2 . The square root is not unique, since any matrix SQ with Q orthogonal is also a square root of A. The Cholesky decomposition gives a square root in lower triangular (L) or upper triangular (U) form for a symmetric positive semi-definite matrix A. The matrix L is computed row by row, from left to right and top to bottom, whereas U is computed column by column, from right to left and bottom to top. See Favier (1982) for a detailed presentation. The factors L and U are unique if A is positive definite. In the complex case, Cholesky decompositions are expressed in the form A = LLH = UUH . The UD decomposition is obtained by modifying the Cholesky decomposition so that the factor matrices L and U are unit lower triangular and unit upper triangular, respectively, and D is diagonal.

Matrix Decompositions

3

– The Schur decomposition is written as A = UTUH (respectively, A = QTQT ), where U is unitary (respectively, Q is orthogonal), and T is upper or lower triangular with the eigenvalues of A along the diagonal. This decomposition, which can also be written UH AU = T (respectively, QT AQ = T), shows that every complex (respectively, real) square matrix is unitarily (respectively, orthogonally) similar to a triangular matrix (see Table 1.6). – The decomposition A = LU, where A ∈ CI×I is of rank R, L ∈ CI×I is lower triangular and U ∈ CI×I is upper triangular, satisfies the property that U or L is non-singular. This decomposition does not always exist. In the case of a non-singular matrix A ∈ RI×I , a variant of the LU decomposition is A = LDU, where L and U are unit lower and upper triangular, respectively, and D is diagonal. In this case, the matrices L and U are unique. – The QR decomposition can be computed using various orthogonalization methods based on Householder, Givens or Gram–Schmidt transformations1. – The LU and QR decompositions are used to solve systems of linear equations of the form Ax = b using the LS method. Using the decomposition A = LU, with A ∈ RI×I , the original system is replaced by two triangular systems Ly = b and Ux = y. To solve these new systems, the square triangular matrices L and U are inverted using forward and backward substitution algorithms, respectively, instead of inverting A. Similarly, using the decomposition A = QR, with A ∈ RI×J assumed to have full column rank (I ≥ J), the original system of equations can be solved by inverting a triangular matrix. To see this, set:



R1 c T T Q A=R= , Q b= , [1.1] 0 d where R1 ∈ RJ×J is non-singular, c ∈ RJ and d ∈ RI−J . Using the fact that pre-multiplying a vector by an orthogonal matrix preserves its Euclidean norm (Qy2 = y2 ), the LS criterion to minimize can be rewritten as follows: Ax − b22 = QT (Ax − b)22 = R1 x − c22 + d22 .

[1.2]

1 A matrix Q ∈ RI×J , with J ≤ I, is said to be column-wise orthonormal (or simply column orthonormal) if its column vectors form an orthonormal set (QT Q = IJ ). Similarly, when I ≤ J, the matrix Q is said to be row-wise orthonormal (or simply row orthonormal) if its row vectors form an orthonormal set (QQT = II ).

4

Matrix and Tensor Decompositions in Signal Processing

Matrices

Decompositions

Properties

A ∈ RI×I ,A ≥ 0 S

Square root A = A1/2 (A1/2 )T = A1/2 AT /2

A1/2 ≥ 0

A ∈ RI×I ,A ≥ 0 S

Cholesky A = LLT A = UUT

L lower triangular , lii ≥ 0 , ∀i ∈ I U upper triangular , uii ≥ 0 , ∀i ∈ I

A ∈ RI×I ,A > 0 S

UD A = LDLT A = UDUT

A ∈ CI×I

Schur A = UTUH

A ∈ RI×I

A = QTQT

A ∈ CI×I

LU A = LU

A ∈ CI×I , rA = I

LDU A = LDU

A ∈ RI×J , rA = J

QR A = QR

A ∈ CI×I , rA = I

A = QR

A ∈ CI×J , rA = R

full rank A = FG

A ∈ RI×I

EVD A = PDP−1

A ∈ CI×I Hermitian A ∈ RI×I S symmetric

A = PDPH A = PDPT

A ∈ CI×J

SVD A = UΣVH

A ∈ CI×I

Polar A = QP or A = SQ

L unit lower triangular , lii = 1 , ∀i ∈ I U unit upper triangular , uii = 1 , ∀i ∈ I D diagonal , dii > 0 , ∀i ∈ I U unitary T upper triang. , tii = λi ∈ sp(A) , ∀i ∈ I Q orthogonal L lower triangular U upper triangular L unit lower triangular , lii = 1 , ∀i ∈ I U unit upper triangular , uii = 1 , ∀i ∈ I D diagonal Q column orthonormal (QT Q = IJ ) R upper triang. , rjj > 0 , ∀j ∈ J Q unitary (QH Q = QQH = II ) R upper triang. , rii > 0 , ∀i ∈ I F ∈ CI×R , G ∈ CR×J , rF = rG = R P matrix of eigenvectors D diagonal matrix of eigenvalues P unitary D real diagonal P orthogonal D real diagonal U ∈ CI×I , V ∈ CJ×J unitary Σ ∈ RI×J positive pseudo-diagonal Q unitary P, S ≥ 0 Hermitian

Table 1.1. The most common matrix decompositions

The LS solution is therefore given by xLS = R−1 1 c, where R1 is upper triangular, and the estimation error is equal to d22 . For a discussion of solving the LS problem with SVD, see section 1.5.9.

Matrix Decompositions

5

1.3. Eigenvalue decomposition 1.3.1. Reminders about the eigenvalues of a matrix Given a square matrix A ∈ KI×I , a scalar λk is said to be an eigenvalue of A if there exists a non-zero vector vk ∈ KI such that: Avk = λk vk .

[1.3]

The vector vk is called a right eigenvector (or simply an eigenvector) of A associated with the eigenvalue λk . This definition can also be written as follows: (A − λk II )vk = 0I , with vk = 0I ,

[1.4]

which implies that the matrix A−λk II is singular, and therefore has determinant zero. We can then conclude that the eigenvalues of A are the I roots of the characteristic equation: pA (λ) = det(λII − A) = 0.

[1.5]

Similarly, a left eigenvector of A associated with the eigenvalue μi is a vector ui = 0I such that2: H uH i A = μi ui ,

[1.6]

or, equivalently: T ∗ H uH i (μi II − A) = 0I ⇐⇒ (μi II − A )ui = 0I ,

[1.7]

which implies that det(μi II − A) = 0. We again recover the characteristic equation [1.5], which shows that the left and right eigenvectors are associated with the same eigenvalues. Below, we recall a few results about the eigenvalues of a matrix that were presented in Volume 1 (Favier 2019). Table 1.2 summarizes certain key results about the definitions of left and right eigenvectors, i.e. the mode-1 and mode-2 eigenvectors, respectively, of a real square matrix, as well as the characteristic equation to compute them3. Note that the mode-1 2 In the case of a real matrix (K = R), replace the transconjugation operation in [1.6] by that of transposition, and equation [1.7] becomes (μi II − AT )ui = 0I . 3 For the definition of the mode-p product, denoted ×p see section 3.13.2

6

Matrix and Tensor Decompositions in Signal Processing

and mode-2 eigenvalues are identical, whereas the mode-1 and mode-2 eigenvectors associated with the same eigenvalue are, in general, not the same, except when A is Hermitian (or symmetric, in the real case). Indeed, in this case, it is easy to check that the equations defining the left and right eigenvectors are identical, and therefore the mode-1 and mode-2 eigenvectors associated with the same eigenvalue are also identical. A ∈ RI×I , u, v = 0I

Definitions, characteristic equation, and properties I Right eigenvectors (mode-2) Av = A ×2 vT = λv or j=1 aij vj = λvi , i ∈ I  I Left eigenvectors (mode-1) uT A = A ×1 uT = μuT or i=1 aij ui = μuj , j ∈ I Characteristic equation

det(λI − A) = det(μI − A) = 0 ⇒ λi = μi ∀i ∈ I

Table 1.2. Eigenvalues of a real square matrix

Table 1.3 recalls the links between the eigenvalues of a real symmetric matrix4 T A ∈ RI×I and the extrema of the Rayleigh quotient λ = xxTAx , as well as the S x T optimization5 of the quadratic criterion x Ax subject to the constraint of the unit Euclidean norm, x2 = 1. The notions of generalized eigenvalue and generalized eigenpair are also recalled. Eigenvalues of A ∈ RI×I S min/max λ =

xT Ax xT x

=

I

i,j=1 aij xi xj xT x

, x = 0 ⇒ Ax = λx

 min/max xT Ax = Ii,j=1 aij xi xj , x2 = 1 KKT optimality conditions: Ax = λx with xT x = 1 Generalized eigenvalues of A ∈ RI×I , A ≥ 0 with B ∈ RI×I , B>0 S S min/max λ =

xT Ax xT Bx

, x = 0 ⇒ Ax = λBx

Table 1.3. Eigenvalues and extrema of the Rayleigh quotient

R EMARK 1.1.– The generalized eigenvalues λ are found by solving the following equation: det(λB − A) = 0 ⇔ det(λI − B−1 A) = 0,

[1.8]

4 RI×I is the subspace of symmetric square matrices of order I. S 5 This optimization problem with an equality constraint is solved using the Lagrangian L(x, λ) = xT Ax − λ(xT x − 1). The Karush–Kuhn–Tucker (KKT) optimality conditions are ∂L(x,λ) = 0 , ∂L(x,λ) = 0, which leads to Ax = λx, with xT x = 1. ∂x ∂λ

Matrix Decompositions

7

i.e. by computing the eigenvalues of B−1 A, where A and B are symmetric positive definite or semi-definite and symmetric positive definite, respectively. The equation Ax = λBx can be written as AP = BPD, where D = diag(λ1 , · · · , λI ) and P contain the generalized eigenvalues and the generalized eigenvectors of the pair (A, B), respectively, and (P, D) is called an eigenpair of (A, B). The following proposition shows that any left eigenvector is orthogonal to any right eigenvector when the two vectors are associated with distinct eigenvalues. P ROPOSITION 1.2.– Given a right eigenvector vk and a left eigenvector ui associated with distinct eigenvalues λk and λi , respectively, the following orthogonality relation is satisfied with respect to the Hermitian scalar product if K = C (or with respect to the Euclidean scalar product if K = R): vk , ui  = uH i vk = 0.

[1.9]

P ROOF .– Pre-multiplying equation [1.3] by uH i and post-multiplying [1.6] by vk , after replacing μi with λi , yields: H uH i Avk = λk ui vk

[1.10]

= λi uH i vk . Therefore, by subtracting these two equations on both sides and using the hypothesis that λi = λk , we deduce the orthogonality relation uH  i vk = 0. 1.3.2. Eigendecomposition and properties Concatenating the I columns associated with equation [1.3] for k ∈ I gives us the following matrix equation: A v 1 · · · v I = λ 1 v 1 · · · λI v I ⎡ ⎤ 0 λ1 ⎢ ⎥ .. = v1 · · · vI ⎣ ⎦. . 0

λI

After defining the diagonal matrix D = diag(λ1 , · · · , λI ) with the eigenvalues of A along the diagonal, repeated with their multiplicity, and the matrix P = [v1 , · · · , vI ] of right eigenvectors in the same order as the corresponding eigenvalues, this equation becomes: AP = PD.

[1.11]

8

Matrix and Tensor Decompositions in Signal Processing

We can therefore conclude that A is diagonalizable if and only if it has I linearly independent eigenvectors, i.e. if the matrix P is invertible. If so, the matrix D = P−1 AP is diagonal, and A decomposes into: A = PDP−1 .

[1.12]

This decomposition is called the eigendecomposition of A, also known as the spectral decomposition of A, named after the “spectrum”, which is the set of eigenvalues of a matrix. Diagonalization condition: In order for a matrix A ∈ KI×I to be diagonalizable, it is necessary and sufficient for all of its eigenvalues to belong to K and for the dimension of the eigenspace associated with each eigenvalue to be equal to the multiplicity of this eigenvalue. In other words, an eigenvalue of multiplicity q must have q linearly independent eigenvectors. This condition implies that any matrix A ∈ KI×I with I distinct eigenvalues is diagonalizable. The decomposition [1.12] can be used to show the next proposition. P ROPOSITION 1.3.– Given the matrix A ∈ KI×I , we have the following relations: tr(A) =

I 

λi , det(A) =

i=1

I 

λi .

[1.13]

i=1

P ROOF .– From the decomposition [1.12], we deduce the following equalities: tr(A) = tr(PDP−1 ) = tr(P−1 PD) = tr(D) = det(A) = det(PDP−1 ) = det(D) =

I 

I 

λi

i=1

λi ,

i=1



which proves the identities [1.13]. The next proposition can be deduced from the orthogonality property [1.9].

P ROPOSITION 1.4.– Let P and Q be the square matrices of order I formed by the right and left eigenvectors of A: P = [v1 , · · · , vI ], Q = [u1 , · · · , uI ].

[1.14]

If all the eigenvalues of A are distinct, then P and Q satisfy the following orthogonality relation: QH P = I I ,

[1.15]

i.e. P and Q are bi-unitary and their inverses satisfy P−1 = QH and Q−1 = PH .

Matrix Decompositions

9

By taking the orthogonality relation [1.15] into account, the decomposition [1.12] can be rewritten in term of the matrix Q of left eigenvectors of A as follows: A = PDQH ,

[1.16]

or, in expanded form: ⎡ A=



v1

···

vI

⎢ ⎣

.. 0

=

I 

⎤ uH 1 ⎥ ⎢ .. ⎥ ⎦⎣ . ⎦ λI uH I 0

λ1 .

λi v i u H i ,

⎤⎡

[1.17]

i=1

where ui and vi are the left and right eigenvectors of A associated with the eigenvalue λi .

0 j , with j 2 = −1. This is a E XAMPLE 1.5.– Consider the matrix A = j 0 skew-Hermitian matrix (AH = −A). Its eigenvalues are therefore purely imaginary; they are the solutions of the equation det(λI2 − A) = λ2 + 1 = 0, namely λ1 = j, λ2 = −j. The right eigenvectors satisfy:

1 1 jx1 = jy1 ⇒ v1 = √ 2 1

1 1 . jx2 = − jy2 ⇒ v2 = √ 2 −1 It is easy to check that we can choose the left eigenvectors so that

1 1 1 u1 = v1 , u2 = v2 . The eigenvector matrices are therefore P = Q = √2 . 1 −1 These matrices are such that P−1 = P = Q and, hence, QH P = I2 . The EVD of A is given by:







1 1 1 j 0 1 1 0 j −1 = , A = PDP = 0 −j 1 −1 j 0 2 1 −1 or, equivalently:





2  j j 1 1 0 j 1 1 1 −1 − = . A= λi v i u H = i j 0 2 1 2 −1 i=1

10

Matrix and Tensor Decompositions in Signal Processing

1.3.3. Special case of symmetric/Hermitian matrices Table 1.4 recalls the eigenvalue properties of (skew-)symmetric/(skew-)Hermitian matrices, orthogonal matrices and unitary matrices. Matrix classes

Eigenvalue properties

CI×I Hermitian

A∈ A ∈ CI×I skew-Hermitian

λi ∈ R λ∗i = −λi

A ∈ RI×I symmetric S A ∈ RI×I skew-symmetric

λi ∈ R λ∗i = −λi

A ∈ RI×I positive definite S A ∈ RI×I positive semi-definite S

λi > 0 λi ≥ 0

A ∈ RI×I orthogonal A ∈ CI×I unitary

|λi | = 1 |λi | = 1

Table 1.4. Eigenvalue properties of certain matrix classes

R EMARK 1.6.– The eigenvalues of a symmetric or Hermitian matrix A are real-valued, and any right eigenvector is also a left eigenvector associated with the same eigenvalue. Therefore, any two eigenvectors associated with distinct eigenvalues are orthogonal with respect to the Euclidean inner product (real case) or the Hermitian inner product (complex case), and the columns of P form an orthonormal basis. Note that, if A is non-symmetric, its eigenvectors are not orthogonal. For more details, see Favier (2019). The above results allow us to deduce the following proposition. P ROPOSITION 1.7.– Every real symmetric (respectively, complex Hermitian) matrix has orthogonal eigenvectors and real-valued eigenvalues, and can be diagonalized using its EVD: A = PDPT (respectively, A = PDPH ),

[1.18]

where P is orthogonal (respectively, unitary), and D is diagonal and real. We can also write A as follows: A=

I 

λi ui uTi (respectively, A =

I 

λi u i u H i ).

[1.19]

i=1

i=1

This expression shows that the eigenvectors are only determined up to some phase ambiguity, since, for j 2 = −1 and θi ∈ R , ∀i ∈ I, we have: I  i=1

λi (ejθi ui )(ejθi ui )H =

I  i=1

λi u i u H i .

[1.20]

Matrix Decompositions

11

Table 1.5 summarizes the eigenvalue properties of positive (or negative) (semi-) definite real symmetric matrices. A ∈ RI×I S

Constraints

Eigenvalue properties

A>0

xT Ax > 0 ∀x = 0

λi > 0 , ∀i ∈ I

A≥0 A 0. Using the SVD [1.34] of A, the decompositions [1.35] and [1.36] can be rewritten as follows: AAH = (UΣVH )(VΣT UH ) = UΣΣT UH ,

[1.38]

AH A = (VΣT UH )(UΣVH ) = VΣT ΣVH ,

[1.39]

, λr = σr2 . which implies D1 = ΣΣ and D2 = Σ Σ, and, hence, λ1 = The diagonal √ coefficients σk of Σ are the positive square roots of the eigenvalues λk , i.e. σk = λk ; they are called the singular values of A. Since these singular values are ordered non-increasingly along the diagonal of Σ, we define Σr = diag(σ1 , · · · , σr ), where r is the rank of A, equal to the number of non-zero singular values. T

T

σ12 , · · ·

From [1.38] and [1.39], we conclude that the span of the columns of U is the column space of A, whereas the span of the columns of V is the column space of AH , i.e. the row space of A. Since V is unitary, we have V−1 = VH , and we can rewrite the SVD as AV = UΣ, from which we deduce that the columns of U and V satisfy the relations: Avk = σk uk , k ∈ r.

[1.40]

Thus, A can be viewed as the matrix of the linear transformation that transforms a vector vk of its row space into a vector uk of its column space. Similarly, from H equation [1.34], we have: UH A = ΣVH , which gives uH k A = σk vk , or, after transconjugating both sides: AH uk = σk vk , k ∈ r.

[1.41]

From the equations [1.40] and [1.41], we deduce that: AH Avk = σk AH uk = σk2 vk

[1.42]

AAH uk = σk Avk = σk2 uk .

[1.43]

Matrix Decompositions

17

By comparison with [1.35]–[1.37], we conclude that σk2 = λk is an eigenvalue of both AH A and AAH . The above results are summarized in Table 1.9. A ∈ C I×J , u ∈ C I , v ∈ C J J Av = σu ⇔ j=1 aij vj = σui , i ∈ I I H ∗ A u = σv ⇔ i=1 aij ui = σvj , j ∈ J

Table 1.9. Relations between left and right singular vectors

1.5.2. Reduced SVD and dyadic decomposition In the case where A ∈ CI×J is of rank r ≤ min(I, J), by partitioning as U = [Ur UI−r ] and V = [Vr VJ−r ], the SVD can be written in the following partitioned form: 

 VrH 0r×J−r Σr , [1.44] A = [Ur UI−r ] H VJ−r 0I−r×r 0I−r×J−r with Ur ∈ CI×r , UI−r ∈ CI×I−r , Vr ∈ CJ×r , VJ−r ∈ CJ×J−r , and Σr ∈ Rr×r . After simplification, we obtain: A = Ur Σr VrH .

[1.45]

This simplified form is called the reduced SVD of A. It is also known as the compact SVD of A. It corresponds to the special case of the URVH decomposition described in [1.31] where the matrix R is diagonal. Using [1.45], we can interpret the reduced SVD in terms of a dyadic decomposition, i.e. as a sum of rank-one matrices: A=

r 

σk uk vkH .

[1.46]

k=1

In the case of a multiple singular value σi with multiplicity n, the SVD is not unique. Indeed, if we suppose that σi = σi+1 = · · · = σi+n−1 , then the sum of the terms in [1.46] associated with this singular value can be written as: σi

i+n−1 

H uk vkH = σi Ui,n Vi,n ,

[1.47]

k=i

where Ui,n = [ui · · · ui+n−1 ] and Vi,n = [vi · · · vi+n−1 ]. For any unitary matrix Q ∈ Cn×n , equation [1.47] is equivalent to: H H = σi Ui,n Vi,n , σi (Ui,n Q)(Vi,n Q)H = σi Ui,n QQH Vi,n

[1.48]

18

Matrix and Tensor Decompositions in Signal Processing

which shows that the left and right singular vectors associated with the multiple singular value σi are unique up to a unitary matrix. If σk is a simple singular value (σm = σk , ∀m = k), then the singular vectors uk and vk are unique up to a factor ejθ , with j 2 = −1 and θ ∈ R, i.e.: (ejθ uk )(ejθ vk )H = (ejθ uk )(e−jθ vkH ) = uk vkH . Case of a full-rank matrix In the case of a full-rank matrix, the SVD takes one of two possible forms. If A ∈ CI×J has full row rank (r = I ≤ J), the reduced SVD can be rewritten as:

VIH = UΣI VIH , [1.49] A = U [ΣI 0I×J−I ] H VJ−I where U ∈ CI×I is unitary, VI ∈ CJ×I and VJ−I ∈ CJ×(J−I) are column orthonormal, and ΣI ∈ RI×I is a square matrix of order I. If A ∈ CI×J has full column rank (r = J ≤ I), the reduced SVD simplifies as follows:

ΣJ V H = UJ Σ J V H , A = [UJ UI−J ] [1.50] 0I−J×J where UJ ∈ CI×J and UI−J ∈ CI×(I−J) are column orthonormal, V ∈ CJ×J is unitary and ΣJ ∈ RJ×J is a square matrix of order J. The key results about the SVD are summarized in Table 1.10. From the reduced SVD [1.45], we deduce AH = Vr Σr UH r . Noting that Σr is invertible and that Vr and Ur are column orthonormal (VrH Vr = UH r Ur = Ir ), we can deduce the relations listed in Table 1.11, which encompass the relations of Table 1.9 for the singular vectors associated with non-zero singular values. For k ∈ r, the column vectors uk and vk of Ur and Vr form two orthonormal bases. From the relations in Table 1.11, we deduce: Avi , Avj  = σi ui , σj uj  = σi σj ui , uj  = σi2 δij A ui , A uj  = σi vi , σj vj  = H

H

σi2 δij .

[1.51] [1.52]

Matrix Decompositions

19

A ∈ CI×J (A ∈ RI×J ) of rank r SVD A = UΣVH (A = UΣVT ) ∈ CJ×J unitary ; Σ ∈ RI×J pseudo-diagonal U∈ (U ∈ RI×I , V ∈ RJ×J orthogonal ; Σ ∈ RI×J pseudo-diagonal) CI×I , V

Columns of U are eigenvectors of AAH (AAT ), called left singular vectors Columns of V are eigenvectors of AH A (AT A), called right singular vectors Σ = diag(σ1 , · · · , σmin(I,J) ) with: σ1 ≥ · · · ≥ σr > 0 ; σr+1 = · · · = σmin(I,J) = 0 Reduced SVD A = Ur Σr VrH (A = Ur Σr VrT ) Ur ∈ CI×r , Vr ∈ CJ×r column orthonormal ; Σr = diag(σ1 , · · · , σr ) ∈ Rr×r (Ur ∈ RI×r , Vr ∈ RJ×r column orthonormal ; Σr = diag(σ1 , · · · , σr ) ∈ Rr×r ) Dyadic decomposition of A A=

r

k=1

σk uk vkH

(A =

r

k=1

σk uk vkT )

Table 1.10. Definition and properties of the SVD

A = Ur Σr VrH Ur = AVr Σ−1 r

Vr = AH Ur Σ−1 r

AVr = Ur Σr

AH U r = Vr Σ r

Avk = σk uk , k ∈ r

AH uk = σk vk , k ∈ r

Table 1.11. Relations between the left and right singular vectors

Case of a square matrix For a square matrix A ∈ CI×I , H H H det(UΣV ) = det(U)det(Σ)det(V ), det(V ) = det(U) = det(V) = ±1 because U and V are unitary, we have: |det(A)| = |det(UΣVH )| = |det(Σ)| =

I  i=1

σi .

noting that [det(V)]∗ and

[1.53]

20

Matrix and Tensor Decompositions in Signal Processing

From this relation, we can conclude that, if A is non-singular (det(A) = 0), then all the singular values are non-zero. Conversely, if A is singular, at least one singular value must be zero. In the case where A ∈ CI×I is a Hermitian matrix, it admits an EVD PDPH that corresponds to the special case of the SVD where U = V = P and Σ2 = D. 1.5.3. SVD and fundamental subspaces associated with a matrix By considering the partitions U = [Ur UI−r ] and V = [Vr VJ−r ] and the decompositions [1.29]–[1.30], we deduce that the span of the columns of Ur (respectively, Vr ) formed by the r first columns of U (respectively, V) is the column space of A (respectively, the row space of A), whereas the left null space of A (respectively, the null space of A) is spanned by the I − r last columns of U (respectively, the J − r last columns of V). These results are summarized in Table 1.127. This leads us back to the interpretations given in proposition 1.9 for the URVH decomposition. A ∈ CI×J of rank r C(A) = C(AAH ) = C(Ur ) = Vect(u1 , · · · , ur ) C(AH ) = C(AH A) = C(Vr ) = Vect(v1 , · · · , vr ) N (A) = C(VJ−r ) = Vect(vr+1 , · · · , vJ ) N (AH ) = C(UI−r ) = Vect(ur+1 , · · · , uI )

Table 1.12. SVD and fundamental subspaces

1.5.4. SVD and the Moore–Penrose pseudo-inverse From the expressions [1.34] and [1.45], we can deduce the formulae of the following proposition for the inverse and the Moore–Penrose pseudo-inverse of A. P ROPOSITION 1.10.– If A ∈ CI×I is invertible, its inverse may be expressed as a function of the SVD as follows: A−1 = VΣ−1 UH =

I  1 v i uH i . σ i i=1

[1.54]

If A ∈ CI×I is singular, or A ∈ CI×J is rectangular, of rank r, then the Moore– Penrose pseudo-inverse can be expressed as a function of the reduced SVD as follows: H A† = Vr Σ−1 r Ur =

r  1 J×I vk u H . k ∈C σk

k=1

7 The notation Vect(u1 , · · · , ur ) denotes the subspace spanned by {u1 , · · · , ur }.

[1.55]

Matrix Decompositions

21

P ROOF .– Using the expression [1.34] of the SVD and taking into account the fact that U and V are unitary (U−1 = UH , V−1 = VH ), we obtain: A−1 = (UΣVH )−1 = (VH )−1 Σ−1 U−1 = VΣ−1 UH . Similarly, using the expression [1.45] of the reduced SVD and the orthonormality H properties UH r Ur = Vr Vr = Ir , it is easy to check that [1.55] satisfies the four relations defining the Moore–Penrose pseudo-inverse: H H † H AA† = (Ur Σr VrH )(Vr Σ−1 r Ur ) = Ur Ur = (AA )

[1.56]

H H H † H A† A = (Vr Σ−1 r Ur )(Ur Σr Vr ) = Vr Vr = (A A)

[1.57]

H H AA† A = Ur UH r (Ur Σr Vr ) = Ur Σr Vr = A H −1 H † A† AA† = Vr VrH (Vr Σ−1 r U r ) = Vr Σ r U r = A ,

which concludes the proof of the proposition.

[1.58] [1.59] 

R EMARK 1.11.– If A has full row or column rank, then, using the expressions [1.49] and [1.50] of the reduced SVD, we can rewrite the pseudo-inverse [1.55] as follows: † H A† = (VIH )† Σ−1 and A† = V Σ−1 I U J UJ ,

[1.60]

† H H † with U†J = UH J , VI = VI , and so (VI ) = VI .

1.5.5. SVD computation There are several algorithms for computing the SVD. Table 1.13 describes the SVD computation based on the eigendecomposition [1.36] of AH A (replace AH by AT for the case of a real matrix) and the use of the relations in Table 1.11. A formula to compute the Moore–Penrose pseudo-inverse obtained from the reduced SVD is also recalled; this formula is very useful for computing the pseudo-inverse of a non-square matrix, as is often needed when solving systems of linear equations with the LS method. Note that, in the case where I < J, the SVD can be computed from the eigendecomposition [1.35] of AAH = UD1 UH , whose dimension is lower than that of AH A, and the matrix V is computed using the relation vk = σ1k AH uk , k ∈ r, given in Table 1.11. In this case, computing the SVD is numerically less expensive than if we use the eigendecomposition [1.36] of AH A. Several different algorithms for computing the SVD are presented in Cline and Dhillon (2007).

22

Matrix and Tensor Decompositions in Signal Processing

SVD computation A ∈ CI×J of rank r V= Σ=



v1

···

Σr 0I−r×r

AH A = VD2 VH , D2 = diag(λ1 , · · · , λr , 0, · · · , 0) ∈ RJ×J with r ≤ J

√ 0r×J−r , Σr = diag(σ1 , · · · , σr ) , σk = λk , k ∈ r 0I−r×J−r vJ



Ur = AVr Σ−1 , uk = σ1 Avk , k = 1, · · · , r r k ur+1 , · · · , uI chosen such that the columns of U form an orthonormal set of vectors A = UΣVH Reduced SVD A = Ur Σr VrH Computation of the Moore-Penrose pseudo-inverse H J×I A† = Vr Σ−1 r Ur ∈ C

Table 1.13. Computation of an SVD and of the Moore-Penrose pseudo-inverse

1.5.6. SVD and matrix norms Let us begin by recalling the definition of a matrix norm. We will then describe different types of matrix norm. D EFINITION .– Given any matrix A ∈ KI×J , a matrix norm is a function, denoted . : KI×J → R+ , that satisfies the following properties: Positivity: A ≥ 0, A = 0 ⇔ A = 0. Homogeneity: αA = |α|A,

[1.61]

∀α ∈ K.

Triangle inequality: A + B ≤ A + B, ∀A, B ∈ K

[1.62] I×J

.

[1.63]

If we also have AB ≤ AB for A ∈ KI×J , B ∈ KJ×K , then we say that the norm is submultiplicative. Hölder norm: For A ∈ KI×J , we define the Hölder norm, also known as the lp -norm, as follows: Alp 

J I  

|aij |p

1/p

, p ≥ 1.

i=1 j=1

This norm is equivalent to the lp -norm of the vector vec(A).

[1.64]

Matrix Decompositions

23

Frobenius norm: For p = 2, we obtain the Frobenius norm, denoted AF : J I  

A2F 

|aij |2 = tr(AH A) = tr(AAH )

i=1 j=1

= tr(UΣVH VΣH UH ) = tr(UΣΣH UH ) = tr(UH U ΣΣH ) = tr(ΣΣH ) =

r 

σn2 = σ(A)22 ,

[1.65]

n=1

where r is the rank of A and σ(A)  [σ1 , · · · , σr ]T denotes the vector of non-zero singular values. This norm is also equal to the Hermitian norm of a vectorized form of A. Its square is equal to the sum of the squares of the Hermitian norms of the row vectors or column vectors of A: I J   A2F = vec(A)22 = Ai. 22 = A.j 22 . i=1

j=1

R EMARK 1.12.– – The Frobenius norm is submultiplicative: ABF ≤ AF BF . – This norm is left unchanged if A ∈ CI×J is pre-multiplied by a unitary matrix P ∈ CI×I or post-multiplied by a unitary matrix Q ∈ CJ×J : PA2F = tr(AH PH PA) = tr(AH A) = A2F . Similarly, AQ2F = A2F . Schatten norm: The Schatten p-norm is defined as: Aσp  σ(A)lp =

r 

σnp

1/p

,

[1.66]

n=1

i.e. the lp -norm of the vector of singular values. We therefore have AF = Aσ2 . Nuclear norm: The nuclear norm, denoted A∗ , is equal to the sum of the singular values, i.e. the l1 -norm of σ(A), which also coincides with the Schatten 1-norm: A∗ 

r 

σn = σ(A)l1 = Aσ1 .

[1.67]

n=1

Induced norms: For A ∈ KI×J , we can also define matrix norms induced by vector norms as follows: A  max Ax for x ∈ KJ .

x =1

[1.68]

24

Matrix and Tensor Decompositions in Signal Processing

This matrix norm, induced by the Euclidean vector norm and denoted A2 , is such that: A2  max Ax2 = σ1 ,

[1.69]

x 2 =1

where σ1 is the largest singular value of A. It is called the spectral norm. If A ∈ KI×I is non-singular, we have: A−1 2 =

1 1 = , min Ax2 σI

[1.70]

x 2 =1

where σI is the smallest singular value of A. We also have: A1  max Axl1 = max

x l1 =1

I 

j

A∞  max Ax∞ = max

x ∞ =1

i

|aij |,

[1.71]

i=1 J 

|aij |.

[1.72]

j=1

The norm A1 (respectively, A∞ ) is therefore equal to the largest sum of the absolute values of the elements of a column (respectively, row) over the set of columns (respectively, rows). These norms are called the maximum absolute column sum and maximum absolute row sum norms, respectively. The main matrix norms are summarized in Table 1.14. Norms Hölder Frobenius A2F = Schatten Nuclear

Expressions Alp = I J



I i=1

J

j=1

|aij |p

1/p

, p≥1

|aij |2 = tr(AH A) = tr(AAH ) =  p 1/p r Aσp = σ(A)lp = n=1 σn  A∗ = rn=1 σn = σ(A)l1 = Aσ1

i=1

j=1

Induced norms A1 = Spectral

max Axl1 = max

A2 = A∞ =

I

j

xl1 =1

i=1

|aij |

max Ax2 = σ1

x2 =1

max Ax∞ = max

x∞ =1

Table 1.14. Matrix norms

i

J

j=1

|aij |

r

n=1

2 σn

Matrix Decompositions

25

1.5.7. SVD and low-rank matrix approximation In the next proposition, we give the expression for the best rank-k approximation of a matrix of rank r ≥ k. P ROPOSITION 1.13.– Let A ∈ KI×J be a matrix of rank r. The best approximation of rank k ≤ r of A is given by: Ak = Uk Σk VkH =

k 

σn un vnH .

[1.73]

n=1

This approximation, known as the Eckart–Young theorem (1936), is given by the SVD truncated to order k, i.e. the sum of the first k terms in the dyadic decomposition of A, associated with the k largest singular values. The approximation errors obtained when minimizing the Frobenius and spectral norms are given by: A − Ak 2F =

r 

σn2 , A − Ak 2 = σk+1 .

[1.74]

n=k+1

The square of the Frobenius norm of the rank-k approximation error is therefore equal to the sum of the squares of the r − k smallest non-zero singular values, whereas the spectral norm of the approximation error is equal to the first singular value that is not taken into account in the approximation. P ROOF .– Using the best approximation Ak of rank k ≤ r, the SVD can be written as:



VkH 0 Σk A = [Uk Ur−k ] , [1.75] H Vr−k 0 Σr−k where Uk ∈ KI×k (Vk ∈ KJ×k ) is a column orthonormal matrix composed of the k left (right) singular vectors of A associated with the k largest singular values, and Ur−k ∈ KI×r−k (Vr−k ∈ KJ×r−k ) is a column orthonormal matrix composed of the r − k left (right) singular vectors associated with the r − k smallest non-zero singular values. The diagonal elements of the diagonal matrices Σk ∈ Rk×k and Σr−k ∈ R(r−k)×(r−k) are the k largest and r − k smallest non-zero singular values, respectively. We can expand A as follows: H A = Ak + Ur−k Σr−k Vr−k ,

[1.76]

26

Matrix and Tensor Decompositions in Signal Processing

H from which we deduce the rank-k approximation error: A − Ak = Ur−k Σr−k Vr−k , and therefore: H H H A − Ak 2F = Ur−k Σr−k Vr−k 2F = tr[(Ur−k Σr−k Vr−k )(Ur−k Σr−k Vr−k )H ] H 2 = tr[Ur−k Σr−k ΣH r−k Ur−k ] = tr[Σr−k ] =

r 

σn2 .

[1.77]

n=k+1

Furthermore, from the definition [1.69], we deduce that the spectral norm of the H approximation error is equal to the largest singular value of Ur−k Σr−k Vr−k , i.e. the first singular value not taken into account in the approximation: A − Ak 2 = σk+1 .

[1.78] 

This concludes the proof of the approximation errors.

Table 1.15 summarizes the approximation formula for a matrix of rank r by a matrix of rank k ≤ r, as well as the approximation errors obtained when minimizing the spectral and Frobenius norms. Rank-k approximation formula A ∈ CI×J of rank r ≤ min(I, J)  H , k ≤r Ak = Uk Σk VkH = kn=1 σn un vn Approximation errors min

A − B2 = A − Ak 2 = σk+1  2 A − B2F = A − Ak 2F = rn=k+1 σn

B / rank(B)=k

min

B / rank(B)=k

Table 1.15. Rank-k approximation and approximation errors

Interpretation of the approximation in terms of orthogonal projection The previous proposition leads us to an interpretation of the best rank-k approximation in terms of orthogonal projection. P ROPOSITION 1.14.– The matrix Ak can be interpreted in terms of projections: Ak = PUk A = (Uk UH k )A = (

k 

un uH n )A

[1.79]

n=1

= APVk = A(Vk VkH ) = A(

k  n=1

vn vnH ),

[1.80]

Matrix Decompositions

27

8 where PUk = Uk UH k is an orthogonal projection matrix onto the column space of Uk , i.e. the space spanned by the k left singular vectors of A associated with the k largest singular values. Similarly, PVk = Vk VkH is an orthogonal projection matrix onto the column space of Vk , i.e. the space spanned by the k right singular vectors of A associated with the k largest singular values.

P ROOF .– From the expression [1.75] of A, and noting that the orthonormality of the H columns of U implies UH k Uk = Ik and Uk Ur−k = 0k×(r−k) , we have:



VkH 0 Σk H )A = (U U )[U U ] (Uk UH k k r−k H k k Vr−k 0 Σr−k H H = Uk UH k U k Σ k Vk = U k Σ k V k = Ak .

[1.82]

The proof of [1.80] can be established in the same way by using the orthonormality of the columns of V.  ⎡

⎤ 1 1 2 T ⎣ ⎦ E XAMPLE 1.15.– Let A = 1 0 . We have A A = 1 0 1

1 2



2

λI) = λ − 4λ + 3 = 0, which gives λ1 = 3, λ2 = 1, and v1 =

1 1 √ . 2 −1

and det(AT A − √1 2

1 1

, v2 =

Hence, we deduce that:



1 1 1 V2 = [v1 v2 ] = √ 2 1 −1

√   √ 3 0 σ 1 = λ 1 = 3 , σ 2 = λ 2 = 1 ⇒ Σ2 = 0 1 ⎡ ⎤ 2 √0 1 ⎣ √ 3 ⎦. 1 = U2 = AV2 Σ−1 2 6 1 − √3

The matrix A therefore factorizes as follows: ⎡ ⎤ 2 0 √ √ 1 3 3 ⎦ A = U2 Σ2 V2T = √ ⎣ 1 √ 0 2 3 1 − 3

0 1



1 1 1 −1

,

8 Definition: A square matrix P ∈ Kn×n is an orthogonal projector if it is idempotent and symmetric in the case K = R (Hermitian in the case K = C), i.e. P2 = P and PT = P (PH = P in the complex case). ⊥

The matrix P = In − P is called the orthogonal complement of P.

[1.81]

28

Matrix and Tensor Decompositions in Signal Processing

and the Moore–Penrose pseudo-inverse of A is given by:

1

√ 0 1 2 √1 1 1 1 T 3 √ √ A† = V2 Σ−1 U = 2 2 3 − 3 0 0 1 2 3 1 −1

1 1 2 −1 . = 1 −1 2 3 ⎡ ⎤ 2 2 The best rank-one approximation is equal to A1 = σ1 u1 v1T = 12 ⎣ 1 1 ⎦. ⎡ ⎤1 1 0 0 Furthermore, the approximation error is given by: A − A1 = 12 ⎣ 1 −1 ⎦, which −1 1 2 2 2 implies that A − A1 2 = A − A1 F = 1 = σ2 . 1.5.8. SVD and orthogonal projectors The SVD of a matrix A ∈ KI×J of rank r allows us to define orthogonal projectors onto the fundamental subspaces recalled in Table 1.7. These projectors, denoted PC(A) , PC(AH ) , PN (A) and PN (AH ) , are orthogonal projections onto the column space C(A), the row space C(AH ) and the null spaces N (A) and N (AH ), respectively. Using equations [1.56] and [1.57], as well as the relations in Table 1.8, we have (Meyer 2000): AA† = Ur UH r = PC(A)

[1.83]

A† A = Vr VrH = PC(AH )

[1.84]

PN (A) = I − PC(AH ) = I − A† A PN (AH ) = I − PC(A) = I − AA† .

[1.85] [1.86]

1.5.9. SVD and LS estimator P ROPOSITION 1.16.– Given the overdetermined system of linear equations y = Ax, where A ∈ CI×J has full column rank (rA = J, I > J), the vector x ∈ CJ that minimizes the LS criterion JLS (x) = y − Ax22 is given by: xLS = (AH A)−1 AH y = A† y,

[1.87]

with the following minimum criterion value: JLS (xLS ) = y22 − AxLS 22 2 H ⊥ = P⊥ A y2 = y PA y,

[1.88] [1.89]

where PA  PC (A) is the orthogonal projector onto the space C (A) defined in [1.83].

Matrix Decompositions

29

P ROOF .– The criterion JLS (x) is minimized by setting its gradient to zero for x = xLS . We have:  ∂JLS (x) ∂  = [y − Ax]H [y − Ax] ∂x ∂x ∂ H H = (x A Ax − xH AH y − yH Ax + yH y) ∂x = 2(xH AH A − yH A).

[1.90]

Hence, by canceling the gradient and transconjugating the right-hand side of the above equation, we obtain the LS solution xLS , satisfying the so-called normal equations: AH AxLS = AH y.

[1.91]

Since A is assumed to have full column rank, AH A is non-singular, which gives: xLS = (AH A)−1 AH y = A† y.

[1.92]

We still need to check that this optimum is indeed a minimum, i.e. that the Hessian 2 JLS (x) = 2AH A ≥ 0. matrix is positive semi-definite: ∂ ∂x 2 The normal equations [1.91] can be rewritten as follows: AH (y − AxLS ) = 0,

[1.93]

which shows that the estimation error y − AxLS is orthogonal to the columns of A. This result is called the orthogonality principle or condition. Using this orthogonality condition, the minimum criterion value is given by: JLS (xLS ) = [y − AxLS ]H [y − AxLS ]

[1.94]

= yH (y − AxLS ) (by [1.93]) = y22 − (AH y)H xLS   H = y22 − AH AxLS + (y − AxLS ) xLS = y22 − (AH AxLS )H xLS

(by [1.93])

= y22 − AxLS 22 ,

[1.95]

which corresponds to equation [1.88]. Using the expression [1.92] of xLS , the estimated value of y is given by:  = AxLS = AA† y = PC(A) y, y

[1.96]

30

Matrix and Tensor Decompositions in Signal Processing

where PC(A)  AA† is the projector onto the column space of A defined in [1.83]. By the expression [1.96] and the idempotence property of the orthogonal † ⊥ 2 ⊥ complement P⊥ A = I − PA = I − AA , i.e. (PA ) = PA , the minimum criterion value [1.88] can be rewritten as follows: 2 H ⊥ JLS (xLS ) = (I − AA† )y22 = P⊥ A y2 = y PA y,

which proves the expression [1.89] of the minimum criterion value JLS (xLS ).

[1.97] 

E XAMPLE 1.17.– Consider the system of equations y = u x, with y, u ∈ CI , and x ∈ C. The LS estimator of x is given as: I ∗ y, u i=1 yi ui xLS = u† y = (uH u)−1 uH y = = .  2 I 2 u2 i=1 |ui | Case of a rank-deficient matrix It is important to note that the LS solution is unique when A has full column rank. However, if A has deficient rank r, then dim[N (A)] = J − r, and there are infinitely many solutions, as stated in the following proposition. P ROPOSITION 1.18.– In the case where A ∈ CI×J is of rank r, the set of solutions of the system of equations y = Ax, in the LS sense, is given by the following expression: xLS = A† y + (I − A† A)z = A† y + PN (A) z,

∀z ∈ KJ ,

[1.98]

where the pseudo-inverse A† can be computed using the formula [1.55] of the reduced SVD of A, and I − A† A = PN (A) is the orthogonal projector onto the null space N (A) defined in [1.85]. From the property AA† A = A of the Moore–Penrose pseudo-inverse, we deduce that APN (A) = A − AA† A = 0. Hence, by the definition [1.83], combining the  with [1.98] gives: expression [1.96] of the estimate y  = AxLS = AA† y + A(I − A† A)z = AA† y = PC(A) y. y

[1.99]

Thus, we recover the same expression as in the case where A has full column rank. From the expression [1.98], we can conclude that the set of solutions in the LS sense is obtained by adding an element PN (A) z to A† y; this element corresponds to the orthogonal projection of an arbitrary vector z ∈ KJ onto the null space N (A), such that APN (A) z = 0.

Matrix Decompositions

31

Case of an ill-conditioned matrix and regularized LS solution If the matrix A is ill-conditioned, i.e. close to the singularity, we can use Tikhonov’s regularization method, which introduces a regularization term into the LS criterion: JLS (x) = y − Ax22 + Γx22 ,

[1.100]

where Γ is a real diagonal matrix with positive diagonal elements. The LS solution is then given by: xΓ = (AH A + Γ2 )−1 AH y.

[1.101]

The regularization works by adding positive terms to the diagonal of the matrix to be inverted. The larger these terms, the further we move away from the optimal LS solution [1.92]. If we choose Γ = 0, we recover the non-regularized LS solution. In practice, we often choose Γ = I, which favors a solution with small norm, since the second term of the criterion [1.100] becomes x22 . Table 1.16 summarizes the key results established in this section. y = Ax, A ∈ CI×J Case where A has full column rank xLS = A† y = (AH A)−1 AH y Case where A has deficient rank xLS = A† y + (I − A† A)z , ∀z ∈ KJ r 1 H H A† = Vr Σ−1 r Ur = k=1 σ vk uk k

Case where A is ill-conditioned xΓ = (AH A + Γ2 )−1 AH y , Γ diagonal

Table 1.16. Least squares estimator

1.5.10. SVD and polar decomposition D EFINITION.– A polar decomposition of a square matrix A ∈ CI×I is a decomposition of the form: A = QP or A = SQ,

[1.102]

where Q is a unitary matrix, and P = (AH A)1/2 and S = (AAH )1/2 are positive semi-definite Hermitian matrices. Such a decomposition always exists, and it is unique if A is non-singular. If so, the matrices P and S are positive definite.

32

Matrix and Tensor Decompositions in Signal Processing

P ROOF .– The decomposition A = QP can be shown using the SVD. Indeed, since V is unitary, we can write: A = UΣVH = U(VH V)ΣVH = QP,

[1.103]

where Q  UVH and P  VΣVH . By these definitions, and the fact that U and V are unitary, we can deduce that: QH Q = (UVH )H UVH = V(UH U)VH = VVH = I.

[1.104]

Similarly, we have QQH = I, and so Q is unitary. Furthermore, noting that A A = VΣ2 VH = (VΣVH )2 , we deduce that P = (AH A)1/2 ≥ 0, where, by definition, PH = (VΣVH )H = P, which implies that P is Hermitian positive semi-definite and equal to a square root of AH A. Similarly, H

A = UΣVH = UΣ(UH U)VH = SQ,

[1.105]

where Q is defined as before, and S  UΣUH . It is easy to check that S can also be written S = (AAH )1/2 , i.e. as a square root of AAH .  The polar decompositions [1.103] and [1.105] are said to be right and left polar decompositions, respectively. In the case of a real square matrix A, the matrix Q is orthogonal, whereas P and S are symmetric positive semi-definite. Case of a full-rank rectangular matrix If A has full column rank (r = J), then, following the same approach as above, the reduced SVD [1.50] gives us the following polar decomposition: A = UJ ΣJ VH = UJ (VH V)ΣJ VH = QP,

[1.106]

where UJ ∈ CI×J is column orthonormal, V ∈ CJ×J is unitary, Q  UJ VH and H P  VΣJ VH . The matrix Q is then column orthonormal (QH Q = VUH = J UJ V H H 1/2 VV = IJ ), and P = (A A) is Hermitian positive definite. In the case where A has full row rank (r = I ≤ J), then, from the reduced SVD [1.49], we obtain the following polar decomposition: A = UΣI VIH = UΣI (UH U)VIH = SQ,

[1.107]

where U ∈ CI×I is unitary, VI ∈ CJ×I is column orthonormal, Q  UVIH and S  UΣI UH . The matrix Q is then row orthonormal (QQH = UVIH VI UH = UUH = II ), and S = (AAH )1/2 is Hermitian positive definite.

Matrix Decompositions

33

1.5.11. SVD and PCA PCA is a widely used method for data analysis and compression. This method, also known as the Karhunen–Loève transform (KLT), is one of the techniques of factor analysis. From a statistical perspective, it decorrelates correlated random variables. This is called statistical orthogonalization, with the scalar product of two random variables xi and xj defined in terms of mathematical expectation and, therefore, as the correlation between variables: xi , xj  = E[xi xj ]. The decorrelated variables are called principal components or, alternatively, features in the context of classification. PCA can also be viewed as a method for reducing the dimensionality, whose goal is to find L latent variables, called factors, to represent a set of M observed random variables, where L < M and often L  M . PCA was introduced by Pearson (1901) to represent a system of points in a space with a reduced dimension. Today, it is used in many different fields of applications, such as denoising, collaborative filtering, signal separation and data visualization or classification. See Sanguansat (2012) for a review of the multidisciplinary applications of PCA. Its many extensions include independent component analysis (ICA) and multilinear PCA (MPCA). The latter technique, introduced by Kroonenberg and de Leeuw (1980), can be used to analyze tensors of multimodal data or for applications involving the reconstruction of 3D objects (Lu et al. 2008). Other extensions, such as kernel PCA, allow the nonlinear nature of the data to be taken into consideration, so that nonlinear features can be extracted to support data classification (Scholkopf et al. 1998; Mika et al. 1999). 1.5.11.1. Principle of the method Consider a random vector x ∈ RM that characterizes a physical phenomenon observed over N sampling periods, where M might, for example, represent the number of sensors in a source separation problem or the number of objects in a classification problem. Suppose that the measurement vector x is centered. If not, the data can be centered by subtracting their statistical mean. The covariance matrix Cx = E[xxT ], which is equal to the autocorrelation matrix in the case of centered data, is symmetric non-negative definite and therefore diagonalizable using its EVD Cx = PΛPT , where P is an orthogonal matrix formed by the eigenvectors, and Λ is a diagonal matrix whose diagonal elements are the non-negative eigenvalues 2 ≥ 0, with m ∈ M , in decreasing order: λ m = σm ⎡ ⎤ λ1 0 ⎢ ⎥ λ2 ⎢ ⎥ P = [u1 u2 · · · uM ], Λ = ⎢ ⎥. .. ⎣ ⎦ . 0

λM

34

Matrix and Tensor Decompositions in Signal Processing

The principle of the PCA method is to linearly transform the vector x into a vector y = Qx so as to (statistically) orthogonalize its components, i.e. decorrelate them, since decorrelating the components of y is equivalent to orthogonalizing them. This amounts to diagonalizing the covariance matrix Cy . Indeed, if [Cy ]i,j = E[yi yj ] = 0 for i = j, then the components yi and yj are decorrelated. The diagonal elements [Cy ]i,i = E[yi2 ] are the variances of the components of y. The diagonalization of the covariance matrix Cy is obtained by taking Q = P−1 = PT . Indeed, we have: y = P−1 x Cy = E[yyT ] = P−1 E[xxT ]P−T = P−1 (PΛPT )P−T = Λ,

[1.108]

2 and so E[ym ] = λm . The components of the vector y = PT x are called the principal components. Since the eigenvalues λm are arranged in decreasing order, we have: 2 E[y12 ] ≥ E[y22 ] ≥ · · · ≥ E[yM ].

1.5.11.2. PCA and variance maximization The transformation y = Qx = [q1 , · · · , qM ]T x, where Q is expressed in terms of its rows qTm , can also be determined by maximizing the variance of the principal components. Thus, the first principal component y1 = qT1 x has variance: E[y12 ] = qT1 E[xxT ]q1 = qT1 Cx q1 .

[1.109]

Maximizing this variance subject to the constraint that q1 is a unit vector (q1 2 = 1) leads us to solve the constrained optimization problem: u1 = arg max qT1 Cx q1 ,

[1.110]

q1 2 =1

which is equivalent to maximizing the Rayleigh quotient, defined as

qT 1 C x q1 , qT 1 q1

under

the constraint = 1. The solution is given by the eigenvector associated with the largest eigenvalue of Cx , i.e. the first column u1 of the matrix P of eigenvectors of Cx . Furthermore, the maximum variance is equal to the largest eigenvalue λ1 = σ12 . qT1 q1

Thus, we again find that the first principal component is equal to y1 = uT1 x, i.e. the first row of y = PT x, and its variance is equal to the first element of the diagonal matrix Λ of eigenvalues. The second principal component y2 is obtained similarly, by solving the following optimization problem: u2 = arg max E

q2 2 =1



2 

qT2 [x − u1 uT1 x]

,

[1.111]

Matrix Decompositions

35

where x − u1 uT1 x = (I − u1 uT1 )x is the error associated with reconstructing x from the first principal component. Since the matrix I−u1 uT1 is the orthogonal complement of the projector onto the column space generated by u1 , the reconstruction error is orthogonal to u1 . Hence, maximizing the criterion [1.111] implies that the vector u2 must be orthogonal to u1 . As before, the solution is given by the eigenvector associated with the largest eigenvalue of the covariance matrix:  T  E (I − u1 uT1 )x (I − u1 uT1 )x = [I − u1 uT1 ]Cx [I − u1 uT1 ]T = Cx − λ1 u1 uT1 =

M 

λm um uTm ,

m=2

where the penultimate equality follows from the orthonormalityproperty of the M eigenvectors (uTm u1 = δm1 ) and the eigendecomposition of Cx = m=1 λm um uTm . The solution is therefore given by the eigenvector u2 associated with the second eigenvalue of Cx . Repeating this procedure m times yields the mth principal component by maximizing the criterion: um = arg max E



qm 2 =1

= arg max E

qm 2 =1



qTm [I −

m−1 

ui uTi ]x

2 

i=1

qTm [I − Um−1 UTm−1 ]x

2 

,

[1.112]

where Um−1 represents the matrix composed of the m − 1 first columns of the matrix P of eigenvectors. The solution um of this maximization problem is orthogonal to the column subspace of Um−1 . It is given by the eigenvector associated with the largest m−1 eigenvalue of the matrix Cx − i=1 λi ui uTi , i.e. the eigenvector um associated with the mth eigenvalue of Cx . 1.5.11.3. PCA and dimensionality reduction From the perspective of reducing the dimensionality, consider the matrix PL ∈ RM ×L , consisting of the first L columns of P, i.e. the L eigenvectors of Cx associated with the L largest eigenvalues. We can define the transformation yL = PTL x ∈ RL , which gives the L principal components, characterized by the L 2 2 largest variances E[ym ] = σm = λm , m ∈ L. This transformation allows us to reduce the dimension of the vector y by retaining only the L principal components ym , m ∈ L.

36

Matrix and Tensor Decompositions in Signal Processing

The choice of L is often based on the following criterion. For some fixed threshold δ, we can choose L as the smallest value such that: L i=1 λi ≥ δ, [1.113] M i=1 λi which is equivalent to finding L such that the L first eigenvalues satisfy the variance percentage δ defined in [1.113], i.e. the percentage of information in the original data that we wish to be able to recover by PCA after reducing the dimensionality. 1.5.11.4. PCA of data In practice, we have N samples of the random signal x, assumed to be centered. Let XN = [x1 , · · · , xN ] ∈ RM ×N be the matrix whose columns are these samples. In a classification problem, the dimensions M and N might, for example, represent the number of data points to be classified and the number of variables characterizing each data point, respectively. The covariance CxN is then estimated using the empirical covariance, which is defined as follows: N  1 x = 1 C xn xTn = XN XTN . N N n=1 N

[1.114]

The transformation matrix that allows us to reduce the dimensionality of XN can  x . It be determined by computing the EVD of the estimated covariance matrix C N 1 √ can also be determined by computing the SVD of Y = N XN . Indeed, we have  x , and, as we saw earlier, the SVD Y = UΣVT gives YYT = N1 XN XTN = C N the matrix U of left singular vectors of Y, which is the matrix of eigenvectors of  x , i.e. P = U. YYT = UΣΣT UT , and, therefore, of C N This link between PCA and the SVD of Y allows us to determine the matrix PL directly from the data matrix XN , without the intermediate step of estimating the  x = 1 XN XT . covariance matrix C N N N With the goal of dimensionality reduction, we can choose the column orthonormal matrix PL = UL ∈ RM ×L , formed by the L first left singular vectors of Y associated with the L largest singular values. The reduced-dimension data matrix is then defined as: 1 YL = UTL Y = √ UTL XN ∈ RL×N , N

[1.115]

Matrix Decompositions

37

whose empirical covariance matrix is:  Y = YL YT = UT YYT UL = UT (UΣΣT UT )UL C L L L L

2 0 ΣL T [UL UM −L ]T UL = UL [UL UM −L ] 0 Σ2M −L

2 0 ΣL [UTL UL UTL UM −L ]T . = [UTL UL UTL UM −L ] 0 Σ2M −L Since the column vectors of U are orthonormal, we have UTL UL = IL and  Y simplifies as follows: = 0L×M −L . Hence, the covariance C L

UTL UM −L

 Y = Σ2 . C L L

[1.116]

The diagonal elements of the diagonal matrix Σ2L ∈ RL×L are the L largest X . eigenvalues λl = σl2 , l ∈ L of the covariance matrix C N The √reconstruction of XN is then deduced from [1.115] as  N = N UL YL = UL UT XN ∈ RM ×N , i.e. by projecting XN onto the X L L-dimensional subspace spanned by the columns {u1 , · · · , uL } of UL , which are the L principal left singular vectors of Y and therefore the L principal eigenvectors X . of the empirical covariance matrix C N

N = R EMARK 1.19.– Taking L = M , which implies that UL = U, we have X √ N UY = UUT XN = XN . Thus, we recover the original data matrix exactly. In conclusion, the PCA method allows both the dimension of the data matrix to be reduced and the data to be decorrelated via the transformation YL = UTL Y. This objective is achieved by projecting the original data (column vectors of XN ) onto the subspace spanned by the L principal left singular vectors of Y associated with the L largest singular values. These projections, called the principal components, are characterized by having maximum variance, the maximized variances being equal to X . the L largest eigenvalues of C N 1.5.11.5. PCA algorithm The various steps of the PCA algorithm are summarized as follows: – Compute Y =

√1 XN . N

– Compute the matrix U of left singular vectors of Y: Y = UΣVT . – Determine L using the criterion [1.113].

[1.117]

38

Matrix and Tensor Decompositions in Signal Processing

– Compute the reduced-dimension data matrix: YL = UTL Y ∈ RL×N . – Reconstruct the original data matrix: √  N = N UL YL ∈ RM ×N . X

[1.118]

[1.119]

– Reduce another data vector x ∈ RM : 1 yL = √ UTL x ∈ RL . N – Reconstruct the data vector: √  = N UL yL ∈ RM . x

[1.120]

[1.121]

R EMARK 1.20.– We can make the following remarks: – The reconstructed data vector can be expressed as follows: ⎤ ⎡ x, u1  L ⎥  ⎢ ..  = UL UTL x = [u1 · · · uL ] ⎣ x, ui ui , = x ⎦ . i=1 x, uL  which amounts to projecting the data vector x onto the L principal components, i.e. onto the subspace UL spanned by {u1 , · · · , uL }.

– If the goal is to reduce the dimension of the row space of XN and decorrelate its row vectors, the same reasoning must be applied to Y = √1M XTN , which amounts to 1 diagonalizing the covariance matrix M XTN XN . 1.5.12. SVD and blind source separation In this section, we will show how the SVD can be used to solve the BSS problem in an instantaneous (i.e. memoryless) linear mixture, first for the noiseless case, and then for the noisy case. 1.5.12.1. Noiseless case Consider the following noiseless instantaneous linear mixture of P sources: xn = Asn ,

[1.122]

where xn ∈ CM and sn ∈ CP are the vectors containing the observations provided by M sensors and the P sources at time n, and A ∈ CM ×P is the mixture matrix. The sources are assumed to be mutually statistically independent, stationary, centered, with

Matrix Decompositions

39

variance σp2 = E |sn (p)|2 , where sn (p) is the pth source at time n. The independence and centering hypotheses on the sources induce a diagonal spatial covariance matrix for the source vector:  2  2 Cs  E[sn sH [1.123] n ] = diag σ1 , · · · , σP . Furthermore, the sources are assumed to be time-uncorrelated. The spatiotemporal non-correlation means that: E sn (p)s∗t (q) = σp2 δnt δpq . [1.124] We also assume that M ≥ P , i.e. the system is over-dimensioned, with more sensors than sources, and that the matrix A has full column rank. The covariance matrix of the observations is given by: H H H Cx  E[xn xH n ] = AE[sn sn ]A = ACs A .

[1.125]

Note that the assumption that the mixture matrix has full column rank implies that Cx has rank equal to P , which is the rank of Cs . The BSS problem involves estimating the sources and the mixture matrix from only signal measurements xn , for n ∈ N . The next proposition presents a solution based on the SVD of a matrix of observations YN . In Chapter 5, we will present two tensor-based solutions. P ROPOSITION 1.21.– Let YN be the observation matrix constructed from N samples of the signal xn , with N ≥ M : YN  [x1 · · · xN ] = ASN ∈ CM ×N ,

[1.126]

where SN  [s1 · · · sN ]. Consider the reduced SVD of YN of order P : YN = UP ΣP VPH ,

[1.127]

∈ CM ×P and VP ∈ CN ×P are column orthonormal where UP H P ×P (UH U = V V = I ), and Σ ∈ R is diagonal and non-singular, since Cx P P P P P P has rank P . The source matrix SN and the mixture matrix A can be estimated over a time window of length N using the following equations: √ √ ˆ LS )N = N C1/2 Σ−1 UH YN = N C1/2 VH (S [1.128] s P s P P ˆ = √1 UP ΣP C−1/2 . A s N

[1.129]

40

Matrix and Tensor Decompositions in Signal Processing

P ×P P ROOF .– Let us first determine the covariance CS  N1 E[SN SH of the N] ∈ C source matrix SN . By the double spatiotemporal non-correlation [1.124] of the sources, we have: N N 1  1  2 sn (i)s∗n (j) = σ δij E N N n=1 i n=1   CS = diag σ12 , · · · , σP2 = Cs .

(CS )ij =

[1.130] [1.131]

N 1 H This result can be recovered by writing: N1 E[SN SH N] = N n=1 E[sn sn ] = Cs . Using the definition [1.126], the covariance of the observation matrix YN is then given by: CY 

1 1 H H H ] = AE[SN SH E[YN YN N ]A = ACs A . N N

[1.132]

Moreover, from the SVD [1.127] of YN , the empirical covariance of the observation matrix is given by: ˆ Y  1 YN Y H = 1 U P Σ 2 UH . C N P P N N

[1.133]

Identifying this empirical covariance with the exact covariance [1.132] allows us to 1/2 deduce that ACs = √1N UP ΣP , which gives the following estimate of the mixture matrix: ˆ = √1 UP ΣP C−1/2 , A s N

[1.134]

which corresponds to the expression [1.129]. Applying the LS method to the observation equation YN = ASN with the hypothesis that the mixture matrix has full column rank yields the LS estimator of the source matrix: ˆ LS )N = A† YN = (AH A)−1 AH YN . (S

[1.135]

Replacing A by its estimate [1.129] gives: √ ˆ LS )N = N C1/2 (ΣP UH UP ΣP )−1 C1/2 C−1/2 ΣP UH YN , (S s P s s P or alternatively, after simplification: √ ˆ LS )N = N C1/2 Σ−1 UH YN . (S s P P After replacing YN with its SVD [1.127], we also have: √ √ ˆ LS )N = N C1/2 Σ−1 UH (UP ΣP VH ) = N C1/2 VH , (S s P P s P P which proves the formula [1.128].



Matrix Decompositions

41

R EMARK 1.22.– – It can be checked that: √ ˆ S ˆ LS )N = √1 UP ΣP C−1/2 ( N C1/2 VH ) = UP ΣP VH = YN . A( s s P P N The data are therefore recovered exactly by using the mixture and source matrices estimated using the SVD of the observation matrix YN . – In realistic situations, the variance of the sources is generally unknown and is therefore assumed to be one, which is equivalent to choosing the identity matrix Cs = IP as the spatial covariance matrix of the sources. The latter are then estimated up to a scaling factor, and equations [1.128] and [1.129] simplify as follows: √ √ ˆ LS )N = N Σ−1 UH YN = N VH ; A ˆ = √1 UP ΣP . (S [1.136] P P P N 1.5.12.2. Noisy case Let us now consider the noisy case, with: yn = xn + bn = Asn + bn ,

[1.137]

where the additive white noise bn ∈ C is assumed to be stationary, centered, with spatial covariance Cb = σb2 IM , and independent of the source signals. M

We define the noisy observation matrix YN constructed from N samples of the measured signal yn , with N ≥ M , as follows: YN = [y1 · · · yN ] = ASN + BN ∈ CM ×N .

[1.138]

P ROPOSITION 1.23.– In the noisy case defined above, the estimates of the source and mixture matrices are given by: √ ˆ LS )N = N C1/2 (Σ2 − N σ 2 IP )−1/2 UH YN (S [1.139] =



s

 1/2

N Cs

P

b

Σ2P − N σb2 IP

P

−1/2

ΣP VPH

  ˆ = √1 UP Σ2 − N σ 2 IP 1/2 C−1/2 . A P b s N

[1.140] [1.141]

P ROOF .– From [1.138], we can deduce the LS estimator of the source matrix, which is given by a formula9 that is identical to [1.135]: ˆ LS )N = (AH A)−1 AH YN . [1.142] (S 9 In the case of the observation equation YN = ASN + BN , with an additive white Gaussian noise (AWGN) with covariance matrix CB = σb2 IM , i.e. when the M components of bn are spatially uncorrelated and have the same variance σb2 , the least squares estimator is optimal, in the sense of coinciding with the maximum likelihood (ML) estimator, whose expression is ˆ LS )N . ˆ ML )N = (AH C−1 A)−1 AH C−1 YN = (AH A)−1 AH YN = (S (S b b

42

Matrix and Tensor Decompositions in Signal Processing

As for the covariance CS of the sources, defined in [1.131], the spatiotemporal non-correlation of the noise implies that CB = σb2 IM . Hence, the covariance of YN deduced from the model [1.138] is given by: H ] = ACs AH + σb2 IM . CY  E[YN YN

[1.143]

Taking into account the full SVD of YN = UΣVH , decomposed in the same way as [1.75], with r = M and k = P , the empirical covariance matrix of the observations can be decomposed as follows: H ˆ Y  1 YN YH = 1 (UP Σ2 UH + UM −P Σ2 C N P P M −P UM −P ). N N

[1.144]

This decomposition gives two sets of left singular vectors. The first P are associated with the signal subspace, and the M − P others, associated with the M − P smallest singular values, define the noise subspace. Since the eigenvectors form an orthonormal basis, these two subspaces are orthogonal. By separating the signal space associated with the P largest singular values of the SVD from the noise space, we can write IM = UUH = UP |UM −P H H = U P UH UP |UM −P P + UM −P UM −P . By identifying the signal components of the expressions [1.143] and [1.144] of CY , we deduce: ACs AH =

1 UP (Σ2P − N σb2 IP )UH P, N

[1.145]

which gives the expression [1.141] of the estimated mixture matrix. Furthermore, by identifying the noise components of CY in [1.143] and [1.144], we have σp2 = σb2 for p ∈ {P + 1, · · · , M }, where σp is the pth singular value of YN . After replacing A with its estimate [1.141] in the expression [1.142] of the LS estimator, we obtain the following estimate for the source matrix: √   ˆ LS )N = N C−1/2 (Σ2 − N σ 2 IP )C−1/2 −1 C−1/2 (Σ2 − N σ 2 IP )1/2 UH YN (S s P b s s P b P √ 1/2 2 2 −1/2 H = N Cs (ΣP − N σb IP ) U P YN . [1.146] Furthermore, from the SVD [1.127] of YN , we deduce: √ ˆ LS )N = N C1/2 (Σ2 − N σ 2 IP )−1/2 ΣP VH . (S s P b P This proves the formula [1.140].

[1.147] 

R EMARK 1.24.– – Setting σb = 0 makes equations [1.139]–[1.141] identical to equations [1.128] and [1.129] for the noiseless case.

Matrix Decompositions

43

– If we assume that the sources have unit variance, equations [1.139]–[1.141] simplify as follows (Zarzoso and Nandi 1999): √     ˆ LS )N = N Σ2 − N σ 2 IP −1/2 ΣP VH ; A ˆ = √1 UP Σ2 − N σ 2 IP 1/2 . (S P b P P b N 1.6. CUR decomposition As we saw above, the SVD gives the best rank-k approximation of a matrix A ∈ RI×J , denoted Ak . Although very useful in many applications, the SVD has two drawbacks: – the singular vectors are difficult to interpret in classification applications; – the sparseness and non-negativity properties are not preserved. These limitations led to the development of low-rank matrix decompositions that select certain columns, and possibly also rows, from the original matrix and are therefore expressed in terms of the data to analyze. This corresponds to the two following types of decomposition: A ≈ CX and A ≈ CUR,

[1.148]

where the matrices C ∈ R and R ∈ R are formed by c < J columns and l < I rows of A, respectively; in general, l  I and c  J. The matrix U ∈ Cc×l is called the intersection matrix. I×c

l×J

Column selection algorithms choose c columns from A in order to construct a matrix C that minimizes the following criterion (Boutsidis et al. 2010): Jc = A − CC† Aξ , †

[1.149] †

where C is the Moore–Penrose pseudo-inverse of C, the term CC A is the projection of A onto the subspace spanned by the columns of C, and ξ denotes the norm being used: ξ = 2, or ξ = F , or ξ = ∗, depending on whether we are considering the spectral norm, the Frobenius norm, or the nuclear norm. Note that choosing C = A makes the criterion zero, due to the properties of the pseudo-inverse (AA† A = A). If the matrix A is large, making an optimal selection of c columns is a difficult combinatorial problem, since there are CcJ possible combinations. In the case of a sparse matrix A, decompositions of the form CX allow a sparse structure to be preserved for C, but not for the matrix X ∈ Rc×J of coefficients. This type of decomposition is only useful when I  J. CUR decomposition methods aim to determine the triplet (C, U, R) in such a way that the product CUR is as close as possible to A, which amounts to minimizing an approximation error criterion of the form: Jc = A − CURξ ,

[1.150]

44

Matrix and Tensor Decompositions in Signal Processing

where the norm ξ is defined as above. Like the SVD, the CUR decomposition of a matrix A ∈ RI×J provides a low-rank approximation (Mahoney and Drineas 2009). Of course, this decomposition is not unique. It is closely linked to the problem of selecting column and row subsets via random sampling of the data and assigning probabilities to the vectors of the original matrix or its truncated SVD as a function of the Euclidean norm of these vectors. In the case of a large data matrix, the CUR decomposition can be used for purposes such as compression, representation, classification or even recovery and extrapolation of data. This type of decomposition has been used in various fields of applications, including the analysis of documents on the Internet and of medical data (Mahoney and Drineas 2009; Sorensen and Embree 2015), the extrapolation of urban traffic data (Mitrovic et al. 2013) and image compression (Voronin and Martinsson 2017). Various algorithms have been proposed in the literature to compute a CUR decomposition (Drineas et al. 2006, 2008; Wang and Zhang 2013; Boutsidis and Woodruff 2014; Voronin and Martinsson 2017). Most algorithms are composed of two steps: first, a column and row selection method is applied to determine the matrices C and R, then the matrix U is optimized to minimize the approximation error criterion [1.150]. For the Frobenius norm (ξ = F ), the matrix U that minimizes the criterion [1.150] is given by: U = C† AR† ,

[1.151]

where C† and R† denote the pseudo-inverses of C and R, respectively. Determining C and R is equivalent to selecting a subset of columns of A and AT , respectively. Drineas et al. (2008) presented polynomial-time random algorithms for the first time to compute CX and CUR decompositions of a matrix A ∈ RI×J . These algorithms require l = O(k −4 log 2 k) rows and c = O(k −2 logk) columns of A to ˆ where A ˆ = CX or be selected, achieving an approximation error A − A, ˆ A = CUR, that is bounded as follows, with a probability of at least 0.7: ˆ F ≤ (1 + ) A − Ak F . A − A

[1.152]

The parameter (satisfying 0 < < 1) defines the precision of the approximation, k (satisfying 1 ≤ k  min(I, J)) is a rank parameter and Ak is the best rank-k approximation of A, obtained by truncating its SVD to order k, i.e. Ak = Uk Σk VkT . The approximation error is therefore equal to the approximation error provided by the SVD, truncated to order k up to a constant factor.

Matrix Decompositions

45

CUR algorithms that compute the k principal singular vectors of A use a subspace sampling technique based on sampling probabilities that are proportional to the Euclidean norm of the singular vectors10: pi =

k k 1 1 T 2 (Uk )2ij , qj = (V ) . k j=1 k i=1 k ij

[1.153]

The probabilities pi and qj , where i ∈ I and j ∈ J, are used to select rows and columns, respectively. For example, one possible column selection algorithm is to retain the c columns of A with the largest probabilities qj , for j ∈ J, and to fix the 1 corresponding columns of C as C.j = √cq A.j . j For a bound on the approximation error identical to [1.152], Wang and Zhang (2013) present a CUR algorithm that only requires l = O(k −2 ) rows and c = O(k −1 ) columns to be selected, while, for the algorithms proposed by Boutsidis and Woodruff (2014), these numbers are reduced to l = c = O(k −1 ). The latter reference compares several CUR algorithms in terms of the number of columns and rows to be selected, the bound on the approximation errors and the computation time. The CUR decomposition was extended to tensors by Mahoney et al. (2008), who applied it to hyperspectral imagery for the purpose of classification and to recommendation systems in order to reconstruct missing data, as well as by Caiafa and Cichocki (2010).

T 10 Since the matrices Uk and Vk are column orthonormal, we have  UTk U k = V k V k = Ik , I k 2 2 2 and, hence, Uk F = Vk F = k, from which we deduce i=1 j=1 (Uk )ij = I  k J J T 2 i=1 j=1 (Vk )ij = k, and so i=1 pi = j=1 qj = 1, which explains why the quantities pi and qj can be interpreted as probabilities.

2 Hadamard, Kronecker and Khatri–Rao Products

2.1. Introduction In this chapter, we consider three matrix products that play a very important role in matrix computation: the Hadamard, Kronecker and Khatri–Rao products. The Kronecker product, also known as the tensor product, is widely used in many signal and image processing applications, such as compressed sampling using Kronecker dictionaries (Duarte and Baraniuk 2012), and for image restoration (Nagy et al. 2004). It is also widely used in systems theory (Brewer 1978) and in linear algebra to express and solve matrix equations such as Sylvester and Lyapunov equations (Bartels and Stewart 1972), as well as generalized Lyapunov equations (Lev Ari 2005). Additionally, it plays a key role in simplifying and implementing rapid transform algorithms, such as Fourier, Walsh-Hadamard, and Haar transforms (Regalia and Mitra 1989; Pitsianis 1997; Van Loan 2000). Recently, the identification of bilinear filters decomposed into Kronecker products was proposed in Paleologu et al. (2018), and so-called Kronecker receivers allowed various wireless communication systems based on different tensor models to be presented with a single, unified approach (da Costa et al. 2018). The Hadamard product has been applied in statistics (Styan 1973; Neudecker et al. 1995; Neudecker and Liu 2001). The Khatri–Rao product has been used to define space-time codes (Sidiropoulos and Budampati 2002) and space-time-frequency codes (de Almeida and Favier 2013) in the context of wireless communications. The Kronecker and Khatri–Rao products are also found in tensor calculus, since they naturally appear in the matrix unfoldings of basic tensor decompositions, such as PARAFAC (Harshman 1970) and Tucker (Tucker 1966) decompositions, and more generally constrained PARAFAC decompositions (Favier and de Almeida 2014).

Matrix and Tensor Decompositions in Signal Processing, First Edition. Gérard Favier. © ISTE Ltd 2021. Published by ISTE Ltd and John Wiley & Sons, Inc.

48

Matrix and Tensor Decompositions in Signal Processing

For a block matrix, the block Kronecker product can also be defined (Tracy and Singh 1972; Hyland and Collins 1989; Koning et al. 1991). Two other key notions are also considered in this chapter: the vectorization operator vec and the vec-permutation matrix (Tracy and Dwyer 1969), also called the commutation matrix (Magnus and Neudecker 1979 and 1988). The most common use of the vec operator is to vectorize a matrix by stacking its columns on top of each other. This operator was originally introduced by Sylvester in 1884 to solve linear matrix equations. However, there are several variants of the vectorization operator that have led to different notations, depending on whether the rows or the columns are being stacked, whether the operator is being applied to a symmetric square matrix, in which case the vectorization only includes the distinct elements of the matrix located on and below the diagonal (Henderson and Searle 1979), or alternatively whether the operator is being applied to a partitioned matrix. The vec-permutation matrix, introduced for the first time in Tracy and Dwyer (1969) to compute matrix derivatives, transforms the vectorized form vec(A) of a matrix A into that of its transpose vec(AT ). It also transforms the Kronecker product of two vectors or two matrices (u ⊗ v ; A ⊗ B) into the Kronecker product of the same vectors or matrices in reverse order (v ⊗ u ; B ⊗ A). A historical overview of vectorization operators and the Kronecker, Hadamard, and Khatri–Rao products can be found in previous studies (Magnus and Neudecker 1979; Henderson and Searle 1981; Henderson et al. 1983; Regalia and Mitra 1989; Horn 1990; Koning et al. 1991; Pitsianis 1997; Liu and Trenkler 2008; Van Loan 2009). The following few paragraphs describe the content of this chapter. The next section will define the notation used for the matrix products considered in this chapter. The Hadamard, Kronecker and Khatri–Rao products will then be presented in sections 2.3, 2.4 and 2.9, respectively, and the Kronecker sum will be defined in section 2.5. The key properties satisfied by these products will be presented. This chapter will also provide an original contribution concerning the use of the index convention, which was introduced in Pollock (2011), in order to show some of the stated properties and relations. This convention, which generalizes Einstein’s summation convention, allows us to make both the proofs more concise and the formulae more compact. In section 2.6, it will first be used to express matrix products, then to describe the vectorization of matrices and matrix products, as well as to establish formulae relating to the trace of matrix products. In Chapter 3, it will be employed to define matrix and vector unfoldings of a tensor, and in Chapter 5 it will be used for matricizing PARAFAC and Tucker decompositions.

Hadamard, Kronecker and Khatri–Rao Products

49

In section 2.7, we will introduce the notion of the commutation matrix, and we will illustrate how to use it to permute the factors of simple Kronecker products, multiple Kronecker products and block Kronecker products, as well as to vectorize partitioned matrices. In sections 2.8 and 2.10, we will present several relations between the diag operator and the Kronecker product, as well as between the vectorization operator and the Kronecker and Khatri–Rao products. Further relations between the diag operator and the Hadamard product and among the three matrix products will also be presented in sections 2.3 and 2.11. Finally, in section 2.12, we will describe various examples involving the use of Kronecker and Khatri–Rao products to: – compute and arrange first-order partial derivatives of a function using the tensor formalism; – solve matrix equations, in particular Lyapunov and Sylvester equations, which play an important role in the theory of linear systems. Algorithms for estimating the factor matrices of a Khatri–Rao product and a Kronecker product will be described. These algorithms are based on the construction of rank-one matrices and tensors from the vectorization of the factor matrices. This reduces the estimation problem to that of rank-one matrix or tensor approximation using the SVD algorithm, or the HOSVD algorithm in the case of multiple Khatri– Rao and Kronecker products. 2.2. Notation We will write A∗ , AT , AH , A† , Ai. , A.j , r(A) (or rA ), and det(A), for the conjugate, the transpose, the transconjugate (also known as conjugate transpose or Hermitian transpose), the Moore–Penrose pseudo-inverse, the ith row, the jth column, the rank and the determinant of A ∈ KI×J , respectively. Furthermore, ai , aij = (A)ij , aijk = (A)ijk , ai1 ,··· ,iN = (A)i1 ,··· ,iN are the elements of the vector a ∈ KI , the matrix A = [aij ] ∈ KI×J and the tensors A ∈ KI×J×K and A ∈ KI1 ×···×IN , respectively. The symbol 1I denotes a column vector of size I whose elements are all equal to 1. The elements of the matrices 0I×J and 1I×J of size (I × J) are all equal to 0 and (N ) 1, respectively. The symbols IN and en denote the identity matrix of order N and the nth vector of the canonical basis of the vector space RN , respectively. The notation for the various matrix operations considered in this chapter is summarized in Table 2.1, with a reference to the section where it is introduced in

50

Matrix and Tensor Decompositions in Signal Processing

each case1. Note that the symbols chosen to represent these operations are not universal. For example, in the literature, the Hadamard product is also denoted by ∗ or ◦. Similarly, the symbols  and are also used for the Khatri–Rao and block Kronecker product, respectively. Notation

Products

Section

 ⊗ ⊗b |⊗| 

Hadamard Kronecker block Kronecker (Tracy-Singh) strong Kronecker Khatri-Rao

2.3 2.4 2.7.5 2.7.6 2.9

Notation

Sum

Section



Kronecker

2.5

Table 2.1. Hadamard, Kronecker, and Khatri-Rao products and Kronecker sum

2.3. Hadamard product 2.3.1. Definition and identities Let A and B ∈ KI×J be two matrices of the same size. The Hadamard product of A and B is the matrix C ∈ KI×J defined as follows: ⎡ ⎤ a11 b11 a12 b12 · · · a1J b1J ⎢ a21 b21 a22 b22 · · · a2J b2J ⎥ ⎢ ⎥ [2.1] C=AB=⎢ ⎥, .. .. .. ⎣ ⎦ . . . aI1 bI1

aI2 bI2

···

aIJ bIJ

i.e. cij = aij bij , and therefore C = [aij bij ], with i ∈ I, j ∈ J. P ROPOSITION 2.1.– For A ∈ KI×J and B ∈ KI×I , the following identities follow directly from the definition of the Hadamard product: A  0I×J = 0I×J  A = 0I×J

[2.2]

A  1I×J = 1I×J  A = A

[2.3]

B  II = II  B = diag(b11 , b22 , . . . , bII ).

[2.4]

1 The Hadamard product is sometimes also called the Schur or Schur–Hadamard product. This is due to Schur’s (product) theorem (1911), which establishes that the Hadamard product of two positive semi-definite matrices is also a positive semi-definite matrix. See Horn (1990) for a historical overview of the Hadamard product and its various names.

Hadamard, Kronecker and Khatri–Rao Products

51

2.3.2. Fundamental properties P ROPOSITION 2.2.– The Hadamard product satisfies the properties of commutativity, associativity and distributivity with respect to addition: AB=BA A  (B  C) = (A  B)  C = A  B  C

[2.5] [2.6]

(A + B)  (C + D) = (A  C) + (A  D) + (B  C) + (B  D). [2.7] The Hadamard product also satisfies the following properties: (A  B)T = AT  BT , (A  B)H = AH  BH rA B ≤ rA rB .

[2.8] [2.9]

The relations [2.8] continue to hold if the matrices are replaced by vectors. The Hadamard product satisfies the following structural properties: – Using the properties [2.8], it is easy to deduce that if A and B ∈ CI×I are symmetric (resp. Hermitian), then their Hadamard product is itself a symmetric (respectively, Hermitian) matrix:  T  A =A ⇒ (A  B)T = AT  BT = A  B [2.10] BT = B   H A =A ⇒ (A  B)H = AH  BH = A  B. [2.11] BH = B – The Hadamard product of Hermitian positive definite or semi-definite matrices is itself a positive definite or semi-definite matrix, as summarized in Table 2.2. Hypotheses | Properties A, B > 0 | A  B > 0 A, B ≥ 0 | A  B ≥ 0

Table 2.2. Hadamard product of Hermitian positive (semi-)definite matrices

2.3.3. Basic relations Table 2.3 states a few basic relations satisfied by the Hadamard product. In the last relation of this table, the matrices A, B, C can be permuted without changing the result.

52

Matrix and Tensor Decompositions in Signal Processing

x, y, z ∈ KI and u, v ∈ KJ xT (y  z) = yT (x  z) = zT (x  y) =

I

xi yi zi  (y  =  =  = yi zi uj ∈ KI×J

 (xuT )  (yvT ) = (x  y)(u  v)T = xi yi uj vj ∈ KI×J z)uT

(yuT )

(z1T J)

(y1T J)

(zuT )

i=1

A, B, C, D ∈ KI×K

  K I×I (A  B)(C  D)T = k=1 aik bik cjk djk ∈ K  K I×I II  [(A  B)CT ] = diag k=1 aik bik cjk ∈ K

Table 2.3. Basic relations satisfied by the Hadamard product

2.3.4. Relations between the diag operator and Hadamard product Table 2.4 presents a few relations between the diag operator and Hadamard product. u, v ∈ KI , w ∈ KJ , A, B ∈ KI×J diag(u)v = u  v = diag(v)u ∈ KI I×J diag(u)A = u 1T J  A = [ui aij ] ∈ K



A diag(w) = A  1I wT = [aij wj ] ∈ KI×J diag(u)A  B diag(w) = [ui aij bij wj ] ∈ KI×J = diag(u)(A  B)diag(w)

= diag(u)A diag(w)  B = A  diag(u)B diag(w) = A diag(w)  diag(u)B     vec(A  B) = diag vec(A) vec(B) = diag vec(B) vec(A)

Table 2.4. The diag operator and the Hadamard product

P ROOF .– Recall that the diag operator transforms a vector argument into a diagonal matrix whose diagonal elements are the components of the vector: diag(u) =

I  i=1

(I)

(I)

ui ei (ei )T .

[2.12]

Hadamard, Kronecker and Khatri–Rao Products (I)

– For u, v ∈ KI , the orthonormality property of the basis vectors ei commutativity of the Hadamard product give us: I I I    (I) (I) (I) (I) diag(u)v = ui ei (ei )T ( vj ej ) = ui vj δij ei i=1

j=1

53

and the

i,j=1

or alternatively: diag(u)v =

I 

(I)

ui vi ei

= u  v = v  u = diag(v)u ∈ KI .

[2.13]

i=1

– By decomposing the matrix A using its columns and the relation [2.13], with v = A.j , together with the second relation in Table 2.3, we have: diag(u)A =

J 

(J)

diag(u)A.j (ej )T =

j=1

= u 1TJ 

J  

 (J) u  A.j (ej )T

j=1 J 

(J)

A.j (ej )T



= u 1TJ  A = [ui aij ] ∈ KI×J .

j=1

[2.14] – Using the relation [2.14] and the transposition and commutativity properties of the Hadamard product, we obtain:  T Adiag(w) = diag(w)AT = (w 1TI  AT )T = A  1I wT = [aij wj ] ∈ KI×J .

[2.15]

– Using the relations [2.14] and [2.15], as well as the commutativity and associativity properties of the Hadamard product, we obtain:     diag(u)A  Bdiag(w) = [ui aij ]  [wj bij ] = [ui wj aij bij ] ∈ KI×J   = [ui wj aij ]  [bij ] = diag(u)A diag(w)  B   = [aij ]  [ui wj bij ] = A  diag(u)B diag(w)     = [wj aij ]  [ui bij ] = A diag(w)  diag(u)B , and



       diag(u)A  Bdiag(w) = u 1TJ  A  B  1I wT = u 1TJ  (A  B)  1I wT = diag(u)(A  B)diag(w).

54

Matrix and Tensor Decompositions in Signal Processing

– Applying the formula [2.13] with (u, v) = (vec(A), vec(B)) gives:     diag vec(A) vec(B) = vec(A)  vec(B) = diag vec(B) vec(A). The last relation in Table 2.4 follows from the property vec(A)  vec(B) = vec(A  B). 

This concludes the proof of the relations in Table 2.4. 2.4. Kronecker product 2.4.1. Kronecker product of vectors 2.4.1.1. Definition

Below, the Kronecker product is defined for two vectors, then for three and N vectors. – For u ∈ KI and v ∈ KJ , we have: x = u ⊗ v ∈ KIJ ⇐⇒ xj+(i−1)J = ui vj or equivalently:

⎤ u1 v ⎥ ⎢ u ⊗ v = [u1 v1 , u1 v2 , · · · , u1 vJ , u2 v1 , · · · , uI vJ ]T = ⎣ ... ⎦ . uI v ⎡

– Similarly, for u ∈ KI , v ∈ KJ , and w ∈ KK , we have: x = u ⊗ v ⊗ w ∈ KIJK ⇐⇒ xk+(j−1)K+(i−1)JK = ui vj wk . R EMARK 2.3.– By convention, the order of the dimensions in a product IJK follows the order of variation of the corresponding indices (i, j, k). For example, KIJK means that the index i varies more slowly than j, which itself varies more slowly than k. – In general, for N vectors u(n) ∈ KIn , the multiple Kronecker product N

⊗ u(n)  u(1) ⊗ u(2) ⊗ · · · ⊗ u(N ) ∈ KI1 I2 ···IN is such that:

n=1

N

⊗ u(n)

n=1

! = i

N  n=1

(n)

uin with i = iN + (iN −1 )IN +

N −2  n=1

(in − 1)

N 

Ij .

j=n+1

[2.16] Note that the Kronecker product of column vectors is identical to their Khatri–Rao product.

Hadamard, Kronecker and Khatri–Rao Products

55

2.4.1.2. Fundamental properties P ROPOSITION 2.4.– The Kronecker product satisfies the properties of associativity and distributivity with respect to addition: u ⊗ (v ⊗ w) = (u ⊗ v) ⊗ w = u ⊗ v ⊗ w (u + v) ⊗ (x + y) = (u ⊗ x) + (u ⊗ y) + (v ⊗ x) + (v ⊗ y).

[2.17] [2.18]

However, the Kronecker product is not commutative, i.e. if u = v, then, in general, u ⊗ v = v ⊗ u. It also satisfies the following properties: (αu) ⊗ v = u ⊗ (αv) = α(u ⊗ v) , ∀α ∈ K (u ⊗ v)T = uT ⊗ vT , (u ⊗ v)H = uH ⊗ vH .

[2.19]

2.4.1.3. Basic relations In this section, we present various relations involving Kronecker products of vectors. Table 2.5 gives two basic relations satisfied by the Kronecker product of vectors. u ∈ KI , v ∈ KJ uH ⊗ v = vuH = v ⊗ uH ∈ KJ×I u ⊗ vH = uvH = vH ⊗ u ∈ KI×J

Table 2.5. Basic relations satisfied by the Kronecker product of vectors

P ROOF .– These Kronecker products can be developed as follows: uH ⊗ v = [u∗1 v, · · · , u∗I v] = vuH ⎡ ⎤ ⎡ ⎤ v1 u∗1 · · · v1 u∗I v 1 uH ⎢ ⎥ .. ⎥ = ⎢ .. H = ⎣ ... ⎦=v⊗u . ⎦ ⎣ . vJ u∗1 · · · vJ u∗I vJ u H ⎤ ⎡ u1 v H ⎥ ⎢ .. H H u ⊗ vH = ⎣ ⎦ = uv = v ⊗ u by (2.20), . uI v

[2.20]

[2.21]

[2.22]

H

which concludes the proof of the relations in Table 2.5.



56

Matrix and Tensor Decompositions in Signal Processing

These relations continue to hold if transconjugation is replaced by transposition, i.e.: uT ⊗ v = vuT = v ⊗ uT

[2.23]

u ⊗ vT = uvT = vT ⊗ u.

[2.24]

P ROPOSITION 2.5.– From the above, we can conclude that the Kronecker product of a column vector with a row vector is independent of the order of the vectors. Therefore, in a multiple Kronecker product of row vectors and column vectors, we can permute a row vector followed by a column vector (or vice versa) without changing the result, but we cannot permute two consecutive column or row vectors, since the Kronecker product is not commutative. Thus, for u ∈ KI , v ∈ KJ and w ∈ KK , we have: uT ⊗ v ⊗ w = v ⊗ uT ⊗ w = v ⊗ w ⊗ uT ∈ KJK×I

= w ⊗ v ⊗ uT ∈ KKJ×I uT ⊗ v ⊗ wT = uT ⊗ wT ⊗ v = v ⊗ uT ⊗ wT ∈ KJ×IK

= v ⊗ wT ⊗ uT ∈ KJ×KI . E XAMPLE 2.6.– For u ∈ K3 , v ∈ K2 and w ∈ K2 , we have:

u1 v1 u2 v1 u3 v1 ⊗w uT ⊗ v ⊗ w = u1 v2 u2 v2 u3 v2 ⎡ ⎤ u1 v 1 w 1 u 2 v 1 w 1 u 3 v 1 w 1 ⎢ u1 v 1 w 2 u 2 v 1 w 2 u3 v 1 w 2 ⎥ ⎥ = ⎢ ⎣ u1 v 2 w 1 u 2 v 2 w 1 u3 v 2 w 1 ⎦ u1 v 2 w 2 u 2 v 2 w 2 u3 v 2 w 2 and



⎡ ⎤ v 1 w 1 u1 v1 w1 ⎢ v1 w2 ⎥ ⎢ v 1 w 2 u1 T T ⎢ ⎥ v⊗w⊗u = ⎢ ⎣ v 2 w 1 ⎦ ⊗ u = ⎣ v 2 w 1 u1 v2 w2 v 2 w 2 u1

v1 w1 u 2 v1 w2 u 2 v2 w1 u 2 v2 w2 u 2

⎤ v 1 w 1 u3 v 1 w 2 u3 ⎥ ⎥ v 2 w 1 u3 ⎦ v 2 w 2 u3

= uT ⊗ v ⊗ w. By the property [2.24], we can deduce the following identity: (I×J)

Eij

(I)

(J)

(I)

= ei (ej )T = ei

(J)

⊗ (ej )T ∈ RI×J .

[2.25]

Hadamard, Kronecker and Khatri–Rao Products

57

Omitting the dimensions of the basis vectors to alleviate the notation, A ∈ KI×J (I×J) and AT can be expressed as follows in the canonical basis {Eij }: A=

J I  

(I×J)

aij Eij

=

i=1 j=1

AT =

J I  

aij ei eTj =

i=1 j=1

J I  

(J×I)

aij Eji

=

i=1 j=1

J I  

J I  

aij ei ⊗ eTj

[2.26]

aij ej ⊗ eTi .

[2.27]

i=1 j=1

aij ej eTi =

i=1 j=1

J I   i=1 j=1

2.4.1.4. Rank-one matrices and Kronecker products of vectors Table 2.6 states a few relations satisfied by rank-one matrices expressed in terms of Kronecker products of vectors. u ∈ KI , v ∈ KJ , x ∈ KK , y ∈ KL (u ⊗ v)(x ⊗ y)H = uxH ⊗ vyH ∈ KIJ×KL Generalization to P vectors  P up ∈ KIp , xp ∈ KJp , I = P p=1 Ip , J = p=1 Jp P P H P ⊗ up ⊗ xp = ⊗ up xH ∈ KI×J p p=1

p=1

p=1

Special cases (u ⊗

v)xH

= (u ⊗ v)(xH ⊗ 1) = uxH ⊗ v ∈ KIJ×K

(u ⊗ v)xH = (u ⊗ v)(1 ⊗ xH ) = u ⊗ vxH ∈ KIJ×K x(u ⊗ v)H = xuH ⊗ vH = uH ⊗ xvH ∈ KK×IJ

Table 2.6. Rank-one matrices and Kronecker products of vectors

Using the associativity property [2.17] and the transconjugation property [2.19], as well as the properties stated in Tables 2.5 and 2.6, the product (u ⊗ v)(x ⊗ y)H can be decomposed in the various ways summarized in Table 2.7. (u ⊗ v)(x ⊗ y)H = (u ⊗ v) ⊗ (x ⊗ y)H = u ⊗ v ⊗ xH ⊗ yH = u ⊗ vxH ⊗ yH (u ⊗ v)(x ⊗ y)H = (x ⊗ y)H ⊗ (u ⊗ v) = xH ⊗ yH ⊗ u ⊗ v = xH ⊗ uyH ⊗ v

Table 2.7. Other decompositions of (u ⊗ v)(x ⊗ y)H

2.4.1.5. Vectorization of a rank-one matrix In this section, we present several relations between the vectorization operator applied to rank-one matrices and the Kronecker product of vectors. These relations can

58

Matrix and Tensor Decompositions in Signal Processing

easily be shown using the definition of the vectorization operator and the associativity [2.17] and transposition/transconjugation [2.19] properties of the Kronecker product, together with the identity [2.22]. For x ∈ KI , y ∈ KJ , u ∈ KK , v ∈ KL , we have: vec(x ⊗ yH ) = vec(xyH ) = y∗ ⊗ x vec(x ⊗ y ⊗ uH ) = u∗ ⊗ x ⊗ y   vec (x ⊗ y)(u ⊗ v)H = u∗ ⊗ v∗ ⊗ x ⊗ y.

[2.28] [2.29] [2.30]

These relations continue to hold if transconjugation is replaced by transposition, removing the conjugation of the vectors. 2.4.2. Kronecker product of matrices 2.4.2.1. Definitions and identities Below, we define the Kronecker product of two matrices, followed by the multiple Kronecker product of more than two matrices. – Given two matrices A ∈ KI×J and B ∈ KM ×N of arbitrary size, the right Kronecker product of A by B is the matrix C ∈ KIM ×JN defined as follows: ⎡ ⎤ a11 B a12 B · · · a1J B ⎢ a21 B a22 B · · · a2J B ⎥ ⎢ ⎥ C=A⊗B=⎢ . [2.31] ⎥ = [aij B]. .. .. ⎣ .. ⎦ . . aI1 B aI2 B · · · aIJ B The Kronecker product is a matrix partitioned into (I, J) blocks, where the block (i, j) is given by the matrix aij B ∈ KM ×N . The element aij bmn is located at the position ((i − 1)M + m, (j − 1)N + n) in A ⊗ B. – For Ap ∈ CIp ×Jp , p ∈ P , we define the multiple Kronecker product as follows: P

⊗ Ap  A1 ⊗ A2 ⊗ · · · ⊗ AP ∈ CI1 ···IP ×J1 ···JP .

p=1

[2.32]

Hadamard, Kronecker and Khatri–Rao Products

⎛ ⎜a11 B A⊗B=⎜ ⎝ ··· a21 B

! ! a12 b b , B = 11 12 , we have: a22 b21 b22 ⎛ .. ⎞ ⎜a11 b11 a11 b12 . a12 b11 ⎜ .. a12 B⎟ ⎜ ⎜a11 b21 a11 b22 . a12 b21 ⎜ ··· ⎟ ··· ··· ⎠ = ⎜ ··· ⎜ . . ⎜a21 b11 a21 b12 . a22 b11 a22 B ⎝ . a21 b21 a21 b22 .. a22 b21

a11 a21

E XAMPLE 2.7.– For A =

.. . .. .

59

⎞ a12 b12 ⎟ ⎟ a12 b22 ⎟ ⎟ ··· ⎟ ⎟. ⎟ a22 b12 ⎟ ⎠ a22 b22

The Kronecker product [2.31] can also be partitioned into column block or row block matrices. Indeed, the columns (respectively, rows) of A ⊗ B are formed by all the Kronecker products of one column (respectively, row) of A with one column (respectively, row) of B, arranging the columns (respectively, rows) in lexicographical order in each case: A ⊗ B = A.1 ⊗ B.1 · · · A.1 ⊗ B.N · · · A.J ⊗ B.1 · · · A.J ⊗ B.N [2.33] and



⎤ A1. ⊗ B1. ⎢ ⎥ .. ⎢ ⎥ . ⎢ ⎥ ⎢ A1. ⊗ BM. ⎥ ⎢ ⎥ ⎢ ⎥ .. A⊗B=⎢ ⎥. . ⎢ ⎥ ⎢ AI. ⊗ B1. ⎥ ⎢ ⎥ ⎢ ⎥ .. ⎣ ⎦ . AI. ⊗ BM.

P ROPOSITION 2.8.– For A ∈ KI×J , it is easy to check the following identities:   IN ⊗ A = diag A, A, · · · , A ∈ CN I×N J [2.34] )* + ( N terms

1I ⊗ 1TJ = 1TJ ⊗ 1I = 1I×J

[2.35]

II ⊗ IJ = IIJ

[2.36]

(αIM ) ⊗ βIN = αβ IM N ,

[2.37]

where 1I denotes the vector of ones of size I. 2.4.2.2. Fundamental properties For the rest of this chapter, we will assume that the matrices have suitable sizes to ensure that all operations are well defined.

60

Matrix and Tensor Decompositions in Signal Processing

P ROPOSITION 2.9.– Like the Kronecker product of vectors, the Kronecker product of matrices satisfies the properties of associativity, distributivity with respect to addition, transposition/transconjugation and multiplication by a scalar, as summarized in the following: A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C = A ⊗ B ⊗ C

[2.38]

(A + B) ⊗ (C + D) = (A ⊗ C) + (A ⊗ D) + (B ⊗ C) + (B ⊗ D) [2.39] (A ⊗ B)T = AT ⊗ BT , (A ⊗ B)H = AH ⊗ BH

[2.40]

(αA) ⊗ B = A ⊗ (αB) = α(A ⊗ B) , ∀α ∈ K.

[2.41]

If A = B, their Kronecker product is not commutative in general, i.e. A ⊗ B = B ⊗ A. P ROPOSITION 2.10.– If A ∈ CI×I and B ∈ CJ×J are symmetric (respectively, Hermitian), then their Kronecker product is itself a symmetric (respectively, Hermitian) matrix. P ROOF .– Using the transposition/transconjugation property [2.40] of the Kronecker product, we can deduce that:  T  A =A ⇒ (A ⊗ B)T = AT ⊗ BT = A ⊗ B [2.42] BT = B   H A =A ⇒ (A ⊗ B)H = AH ⊗ BH = A ⊗ B, [2.43] BH = B which shows that A ⊗ B is symmetric (respectively, Hermitian).



P ROPOSITION 2.11.– For A ∈ KI×J , B ∈ KM ×N , C ∈ KJ×K and D ∈ KN ×P , we have the following identities: A ⊗ B = (A ⊗ IM )(IJ ⊗ B) = (II ⊗ B)(A ⊗ IN )

[2.44]

AC ⊗ IM = (A ⊗ IM )(C ⊗ IM )

[2.45]

IN ⊗ BD = (IN ⊗ B)(IN ⊗ D)

[2.46]

(IN ⊗ A)k = IN ⊗ Ak .

[2.47]

Hadamard, Kronecker and Khatri–Rao Products

P ROOF .– – The equation of the definition [2.31] can be rewritten as: ⎡ ⎤⎡ a11 IM a12 IM · · · a1J IM B ⎢ a21 IM a22 IM · · · a2J IM ⎥ ⎢ 0 ⎢ ⎥⎢ A⊗B= ⎢ ⎥ ⎢ .. .. .. .. ⎣ ⎦⎣ . . . . aI1 IM

···

aI2 IM

aIJ IM

0

0 ··· B ··· .. .

0 0 .. .

0

B

···

⎤ ⎥ ⎥ ⎥ ⎦

= (A ⊗ IM )(IJ ⊗ B). Similarly, we have: ⎡ B ⎢ 0 ⎢ A⊗B= ⎢ . ⎣ .. 0

0 ··· B ··· .. .

0 0 .. .

0

B

···

61

[2.48] ⎤⎡ ⎥⎢ ⎥⎢ ⎥⎢ ⎦⎣

a11 IN a21 IN .. .

a12 IN a22 IN .. .

··· ···

a1J IN a2J IN .. .

aI1 IN

aI2 IN

···

aIJ IN

= (II ⊗ B)(A ⊗ IN ),

⎤ ⎥ ⎥ ⎥ ⎦ [2.49]

which proves the identities [2.44]. – Using the definition of the Kronecker product, we have: ⎡ ⎤⎡ a11 IM · · · a1J IM c11 IM ⎢ ⎥ ⎢ . . .. .. .. (A ⊗ IM )(C ⊗ IM ) = ⎣ ⎦⎣ . aI1 IM ⎡ J ⎢ = ⎢ ⎣

with

J j=1

···

aIJ IM

a1j cj1 IM .. . J a j=1 Ij cj1 IM j=1

··· ···

···

⎤ c1K IM ⎥ .. ⎦ . cJK IM ⎤

cJ1 IM · · · J j=1 a1j cjK IM ⎥ .. ⎥, . ⎦ J a c I j=1 Ij jK M

aij cjk = (AC)ik , which implies:

(A ⊗ IM )(C ⊗ IM ) = AC ⊗ IM . This proves [2.45]. The identity [2.46] can be shown in the same way. The identity [2.47] follows directly from [2.46].  Using the properties [2.44]–[2.46], we can show the fundamental identity stated in the following proposition. P ROPOSITION 2.12.– For A ∈ KI×J , B ∈ KM ×N , C ∈ KJ×K and D ∈ KN ×P , we have the fundamental identity: (A ⊗ B)(C ⊗ D) = AC ⊗ BD.

[2.50]

62

Matrix and Tensor Decompositions in Signal Processing

P ROOF .– Using the results of the previous proposition, we obtain: (A ⊗ B)(C ⊗ D) = (A ⊗ IM )(IJ ⊗ B)(C ⊗ IN )(IK ⊗ D) = (A ⊗ IM )(C ⊗ B)(IK ⊗ D) = (A ⊗ IM )(C ⊗ IM )(IK ⊗ B)(IK ⊗ D) = (AC ⊗ IM )(IK ⊗ BD) = AC ⊗ BD. 

This proves the identity [2.50].

The properties and identities presented above continue to hold if the matrices are replaced by vectors. For instance, from the relation [2.50], and using the fact that a scalar is equal to its transpose, we can deduce the relations stated in Table 2.8. x, u ∈ KI , y, v ∈ KJ (x ⊗ y)T (u ⊗ v) = xT u ⊗ yT v = (xT u)(yT v) x, y ∈ KI (x ⊗

x∗ )T (y∗

⊗ y) =

(xT y∗ )(xH y)

= (xT y∗ )T (xH y) = yH (xxH )y

x, y, u, v ∈ KI (x ⊗ yT )(uT ⊗ v) = (xuT ) ⊗ (yT v) = (yT v)xuT

Table 2.8. Relations involving Kronecker products of vectors

2.4.2.3. Multiple Kronecker product Table 2.9 states a generalization of the distributivity property [2.39] and the fundamental identity [2.50] to the case of the multiple Kronecker product. 2.4.2.4. Interpretation as a tensor product (I )

Let En = KIn and Fn = KJn , for n ∈ N , be vector spaces with bases {binn } (J )

and {cjnn }, respectively. Given N linear mappings fn : En → Fn and the associated matrices An ∈ KJn ×In with respect to these bases, consider the multilinear mapping: N  , N N  Fn , × En  (u1 , · · · , uN ) → ⊗ fn (un ) ∈

n=1

where

N -

n=1

[2.51]

n=1

Fn denotes the tensor space of the N vector spaces Fn . The multilinearity of

n=1

this mapping follows from the linearity of the fn , namely fn (λun +vn ) = λfn (un )+ fn (vn ) for every un , vn ∈ En and all λ ∈ K.

Hadamard, Kronecker and Khatri–Rao Products

63

Distributivity of the multiple Kronecker product with respect to addition  P   Q  P Q   p=1 Ap ⊗ p=1 q=1 Bq = q=1 Ap ⊗ Bq Fundamental identities for the multiple Kronecker product P  p=1



P

⊗ Ap



p=1

  P P Ap ⊗ Bp Ap ⊗ B p = p=1

p=1



P

P

⊗ B p = A1 B 1 ⊗ · · · ⊗ AP B P = ⊗

p=1

p=1

Ap B p

A ∈ KI×I , B ∈ KJ×J (A ⊗ B)p = Ap ⊗ Bp

Table 2.9. Properties of the multiple Kronecker product

By the universal property applied to this multilinear mapping, there exists a unique N N N linear mapping ⊗ fn from En into Fn such that (Favier 2019): n=1

N

⊗ fn

n=1





N

n=1

n=1

N

⊗ un → ⊗ fn (un ).

n=1

[2.52]

n=1

N

This equation defines the tensor product ⊗ fn of the linear mappings fn . By the n=1

linearity of fn , namely fn (un ) = An un , and the third property of the Kronecker product in Table 2.9, we deduce that: N , N En  ⊗ un → n=1

n=1 N

⊗ un →

n=1

N

N

⊗ fn (un ) = ⊗ (An un ) ∈

n=1



n=1



N

N , Fn n=1

N

⊗ An ( ⊗ un ).

n=1

[2.53]

n=1

N

This equation shows that the multiple Kronecker product ⊗ An is the matrix n=1

N

associated with the tensor product ⊗ fn of the linear mappings fn . n=1

2.4.2.5. Identities involving matrix-vector Kronecker products Using the fundamental identity [2.50] and the transposition property [2.40], it is easy to check the identities in Table 2.10. Note that, for x ∈ KM , we have: ⎡ ⎤ ⎡ x x1 IJ ⎢ ⎥ ⎢ . M J×J .. x ⊗ IJ = ⎣ , IJ ⊗ x = ⎣ ⎦∈K xM I J

⎤ ..

⎥ JM ×J . ⎦∈K

. x

64

Matrix and Tensor Decompositions in Signal Processing

x ∈ KM , y ∈ KJ , A ∈ KI×J , B ∈ KJ×K , C ∈ KK×J (x ⊗ A)B = (x ⊗ A)(1 ⊗ B) = x ⊗ (AB) (A ⊗ x)B = (A ⊗ x)(B ⊗ 1) = (AB) ⊗ x C(x ⊗ A)T = xT ⊗ (CAT ) C(A ⊗ x)T = (CAT ) ⊗ xT Special case: A = IJ (x ⊗ IJ )B = x ⊗ B , (IJ ⊗ x)B = B ⊗ x C(x ⊗ IJ )T = xT ⊗ C , C(IJ ⊗ x)T = C ⊗ xT Special case: x ∈ KM , y ∈ KJ x ⊗ y = (x ⊗ IJ )y = (IM ⊗ y)x

Table 2.10. Identities involving matrix-vector Kronecker products

2.4.3. Rank, trace, determinant and spectrum of a Kronecker product Table 2.11 presents a few properties of the rank, trace, determinant and spectrum of a Kronecker product. A ∈ KI×J , B ∈ KK×L rA⊗B = rB⊗A = rA rB A ∈ KI×I , B ∈ KJ×J   sp(A) = Ii=1 {λi } , sp(B) = J j=1 {μj } tr(A ⊗ B) = tr(B ⊗ A) = tr(A)tr(B) det(A ⊗ B) = det(B ⊗ A) = [det(A)]J [det(B)]I  sp(A ⊗ B) = sp(B ⊗ A) = I,J i,j=1 {λi μj } I,J eigenvectors(A ⊗ B) = i,j=1 {ui ⊗ vj }

Table 2.11. Rank, trace, determinant, and spectrum of a Kronecker product

P ROOF .– – For a proof of the relations involving the rank and trace of A ⊗ B, see sections 2.4.6 and 2.6.8. – For A ∈ KI×I and B ∈ KJ×J , the identities [2.44] give: det(A ⊗ B) = det(A ⊗ IJ )det(II ⊗ B)

[2.54]

det(B ⊗ A) = det(B ⊗ II )det(IJ ⊗ A).

[2.55]

Hadamard, Kronecker and Khatri–Rao Products

65

Using the property [2.171], shown in section 2.7.3, which relates the Kronecker products A ⊗ B and B ⊗ A, and noting that the determinant of a permutation matrix is equal to 1, we have: det(A ⊗ IJ ) = det(IJ ⊗ A) = [det(A)]J

[2.56]

det(B ⊗ II ) = det(II ⊗ B) = [det(B)]I .

[2.57]

Therefore, by substituting these two expressions into [2.54] and [2.55], we obtain: det(A ⊗ B) = det(B ⊗ A) = [det(A)]J [det(B)]I .

[2.58]

From this relation, we can deduce that the product A ⊗ B is non-singular if and only if A and B are non-singular. – Suppose that A ∈ CI×I and B ∈ CJ×J admit the eigenpairs (λi , ui ), i ∈ I, and (μj , vj ), j ∈ J, respectively. The fundamental relation [2.50], with C = ui , D = vj , allows us to write: (A ⊗ B)(ui ⊗ vj ) = Aui ⊗ Bvj = λi μj (ui ⊗ vj ),

[2.59]

from which we can conclude that ui ⊗ vj is an eigenvector of A ⊗ B, associated with the eigenvalue λi μj , for i ∈ I and j ∈ J. The spectrum of A ⊗ B therefore consists of the set of IJ products of eigenvalues of A and B, namely: sp(A ⊗ B) =

I,J .

{λi μj },

[2.60]

i,j=1

where sp(A) represents the spectrum of A, i.e. the set of its eigenvalues. Furthermore, the eigenvectors of A ⊗ B are given the IJ Kronecker products of an eigenvector /by I,J of A with an eigenvector of B, i.e. i,j=1 {ui ⊗ vj }. Given that the determinant of a matrix is equal to the product of its eigenvalues, the formula [2.58] implies that sp(B ⊗ A) = sp(A ⊗ B).  P ROPOSITION 2.13.– From the relation rA⊗B = rA rB , we can conclude that, if A and B have full column rank, then the matrix A ⊗ B also has full column rank. Thus, in the case of square matrices, if A and B are non-singular, then A ⊗ B is also non-singular. We can also deduce that:  A A  r = A A

  A −A r = A A

 1 1

  1 1  r ⊗A = r r A = rA 1 1 1 1

    1 −1 1 −1 r ⊗A = r r A = 2 rA . 1 1 1 1

66

Matrix and Tensor Decompositions in Signal Processing

2.4.4. Structural properties of a Kronecker product The Kronecker product satisfies the following structural properties: – Using the relation [2.60], we can deduce the properties of the Kronecker product of positive/negative definite matrices, which are listed in Table 2.12. Similar properties can be deduced for positive/negative semi-definite matrices. Hypotheses

| Properties

A, B > 0

| A⊗B>0

A, B < 0

| A⊗B>0

A > 0, B < 0 | A ⊗ B < 0 A < 0, B > 0 | A ⊗ B < 0

Table 2.12. Kronecker product of positive/negative definite matrices

P ROOF .– We have:  A, B > 0 ⇔ λi (A) > 0 and μj (B) > 0 A, B < 0 ⇔ λi (A) < 0 and μj (B) < 0  A > 0, B < 0 ⇔ λi (A) > 0 and μj (B) < 0 A < 0, B > 0 ⇔ λi (A) < 0 and μj (B) > 0

∀i, j ∀i, j ∀i, j ∀i, j

 ⇒A⊗B>0  ⇒ A ⊗ B < 0,

from which we deduce that the Kronecker product of two positive/negative definite matrices is itself a positive/negative definite matrix.  – The Kronecker product of two diagonal matrices is a diagonal matrix. Similarly, the Kronecker product of two upper (respectively, lower) triangular matrices is an upper (respectively, lower) triangular matrix. – The Kronecker product of two Toeplitz matrices2 is a block Toeplitz matrix, i.e. a matrix consisting of identical Toeplitz blocks along the diagonals parallel to the main diagonal. E XAMPLE 2.14.– Let A be a Toeplitz matrix of order 3, and let B be a Toeplitz matrix of order N . Then A ⊗ B is a block Toeplitz matrix of order 3N : ⎤ ⎡ ⎤ ⎡ a0 a−1 a−2 a0 B a−1 B a−2 B A = ⎣ a1 a0 a−1 ⎦ ⇒ A ⊗ B = ⎣ a1 B a0 B a−1 B ⎦ . a2 B a1 B a2 a1 a0 a0 B 2 A square Toeplitz matrix T = [tij ] of order I satisfies tij = ai−j , i, j ∈ I. It consists of identical coefficients along the diagonals parallel to the main diagonal, i.e. tij = tkl if i − j = k − l.

Hadamard, Kronecker and Khatri–Rao Products

67

Similarly, the Kronecker product of two Hankel3 matrices is a block Hankel matrix. – Using the property [2.40] and the fundamental relation [2.50], we can conclude that the Kronecker product of two orthogonal (respectively, unitary) matrices A ∈ KI×I and B ∈ KJ×J is itself an orthogonal (respectively, unitary) matrix:  T  A A = AAT = II ⇒ (A ⊗ B)T (A ⊗ B) = (A ⊗ B)(A ⊗ B)T = IIJ BT B = BBT = IJ  H  A A = AAH = II ⇒ (A ⊗ B)H (A ⊗ B) = (A ⊗ B)(A ⊗ B)H = IIJ . BH B = BBH = IJ – Similarly, if A ∈ KI×J and B ∈ KK×L are column orthonormal matrices, then A ⊗ B is itself column orthonormal:  H  A A = IJ ⇒ (A ⊗ B)H (A ⊗ B) = AH A ⊗ BH B = IJL . BH B = I L The above results are summarized in Table 2.13. Hypotheses A, B

symmetric/Hermitian diagonal upper/lower triangular column orthonormal orthogonal/unitary Toeplitz (Hankel)

| |

Structural properties A⊗B

| symmetric/Hermitian | diagonal | upper/lower triangular | column orthonormal | orthogonal/unitary | block Toeplitz (block Hankel)

Table 2.13. Structural properties of the Kronecker product

2.4.5. Inverse and Moore–Penrose pseudo-inverse of a Kronecker product For non-singular square matrices A and B, we have: (A ⊗ B)−1 = A−1 ⊗ B−1 .

[2.61]

This result can be deduced directly from the fundamental relation [2.50]. 3 A square Hankel matrix H = [hij ] of order I satisfies hij = ai+j−1 . It consists of identical coefficients along diagonals parallel to the antidiagonal.

68

Matrix and Tensor Decompositions in Signal Processing

The Moore–Penrose pseudo-inverse of the Kronecker product of two matrices A ∈ KI×J and B ∈ KM ×N of full column rank is equal to the Kronecker product of the Moore–Penrose pseudo-inverses of these two matrices: (A ⊗ B)† = A† ⊗ B† .

[2.62]

P ROOF .– Since the matrices A and B have full column rank, their Kronecker product also has full column rank. The Moore–Penrose pseudo-inverse of A ⊗ B is then given by:  −1 (A ⊗ B)† = (A ⊗ B)H (A ⊗ B) (A ⊗ B)H . Using the properties [2.50] and [2.61], we obtain: = (AH A ⊗ BH B)−1 (AH ⊗ BH )   = (AH A)−1 ⊗ (BH B)−1 (AH ⊗ BH )     = (AH A)−1 AH ⊗ (BH B)−1 BH = A† ⊗ B† . This proves [2.62].



2.4.6. Decompositions of a Kronecker product Using the relation [2.50], it is easy to deduce the Cholesky, UD, QR and SVD decompositions of A ⊗ B from the corresponding decompositions of A and B. – For symmetric positive definite matrices A ∈ RI×I and B ∈ RJ×J , we have:   A = SA STA ⇒ A ⊗ B = (SA ⊗ SB )(SA ⊗ SB )T B = SB STB   A = UA DA UTA ⇒ A ⊗ B = (UA ⊗ UB )(DA ⊗ DB )(UA ⊗ UB )T , B = UB DB UTB where (SA , SB ) are the upper or lower triangular matrices with positive diagonal elements that define the Cholesky decompositions of A and B, respectively, (UA , UB ) are upper or lower triangular matrices whose diagonal elements are equal to 1, and (DA , DB ) are positive definite diagonal matrices from the UD decompositions of A and B. Thus, we obtain the Cholesky and UD decompositions of A ⊗ B from the Cholesky and UD decompositions of A and B. – Regarding the QR decomposition, for matrices A ∈ RI×J and B ∈ RK×L of full column rank, we have:   A = Q A RA ⇒ A ⊗ B = (QA ⊗ QB )(RA ⊗ RB ), B = QB RB

Hadamard, Kronecker and Khatri–Rao Products

69

where RA ∈ RJ×J and RB ∈ RL×L are upper triangular matrices, and QA ∈ RI×J and QB ∈ RK×L are column orthonormal matrices. Since the matrix QA ⊗ QB is itself column orthonormal, and RA ⊗RB is upper triangular (see Table 2.13), we then obtain the QR decomposition of A ⊗ B from the QR decompositions of A and B. – Similarly, for the SVD decomposition, we have:   H A = U A Σ A VA ⇒ A ⊗ B = (UA ⊗ UB )(ΣA ⊗ ΣB )(VA ⊗ VB )H , H B = U B ΣB V B where UA ∈ KI×I , UB ∈ KK×K , VA ∈ KJ×J , VB ∈ KL×L are unitary matrices in the complex case (K = C) and orthogonal matrices in the real case (K = R), whereas ΣA and ΣB are pseudo-diagonal matrices. From the last relation, we can deduce that the non-zero singular values of A ⊗ B are equal to (σA )i (σB )j , where the (σA )i , i ∈ rA , and the (σB )j , j ∈ rB , are the non-zero singular values of A and B, and rA and rB denote the rank of A and B, respectively. Hence, since the rank of a matrix is equal to the number of non-zero singular values, we can deduce that the rank of A⊗B is equal to rA rB , i.e. the product of the ranks of A and B (see Table 2.11). 2.5. Kronecker sum In this section, we introduce the definition and a few properties of the Kronecker sum. 2.5.1. Definition The Kronecker sum of A ∈ KI×I and B ∈ KJ×J is defined as4: A ⊕ B = A ⊗ IJ + II ⊗ B ∈ KIJ×IJ . E XAMPLE 2.15.– For A = ⎛ ⎜ A⊕B=⎜ ⎝

a11 + b11 b21 a21 0

a11 a21

! a12 ,B = a22

b12 a11 + b22 0 a21

[2.63] b11 b21

a12 0 a22 + b11 b21

4 Some authors define the Kronecker sum as: A ⊕ B = IJ ⊗ A + B ⊗ II ∈ KJI×JI .

! b12 , we have: b22 ⎞ 0 a12 ⎟ ⎟. b12 ⎠ a22 + b22

70

Matrix and Tensor Decompositions in Signal Processing

2.5.2. Properties The Kronecker sum A ⊕ B satisfies the property of associativity, and its spectrum has a simple expression in terms of the spectra of A and B. – Associativity: For A ∈ KI×I , B ∈ KJ×J , C ∈ KK×K , we have: A ⊕ (B ⊕ C) = (A ⊕ B) ⊕ C = A ⊕ B ⊕ C. – Spectrum: The spectrum of A ⊕ B is given by: sp(A ⊕ B) = sp(B ⊕ A) =

I,J .

{λi + μj },

[2.64]

i,j=1

where (λi , ui ) and (μj , vj ) are eigenpairs of A and B, respectively, and ui ⊗ vj is the eigenvector associated with the eigenvalue λi + μj . P ROOF .– Applying the fundamental relation [2.50] allows us to write: (A ⊕ B)(ui ⊗ vj ) = (A ⊗ IJ )(ui ⊗ vj ) + (II ⊗ B)(ui ⊗ vj ) = Aui ⊗ vj + ui ⊗ Bvj = (λi ui ⊗ vj ) + (ui ⊗ μj vj ) = (λi + μj )(ui ⊗ vj ), from which we deduce that (λi + μj , ui ⊗ vj ) is an eigenpair of A ⊕ B.

[2.65] 

R EMARK 2.16.– In general, the Kronecker sum is not distributive with respect to the Kronecker product, i.e.: (A ⊕ B) ⊗ C = (A ⊗ C) ⊕ (B ⊗ C) C ⊗ (A ⊕ B) = (C ⊗ A) ⊕ (C ⊗ B). This is easy to check, for example by choosing A = B = 1 and C = IJ : (A ⊕ B) ⊗ C = C ⊗ (A ⊕ B) = 2C = 2 IJ (A ⊗ C) ⊕ (B ⊗ C) = (C ⊗ A) ⊕ (C ⊗ B) = C ⊕ C = 2 IJ 2 . 2.6. Index convention Like Einstein’s summation convention, the index convention enables us to eliminate the summation symbols to write certain formulae involving multi-index

Hadamard, Kronecker and Khatri–Rao Products

71

variables more compactly. The index convention also simplifies certain proofs, as illustrated later in this chapter. If an index i ∈ I is repeated in an expression (or more generally in a term in an equation), the expression I(or term) is interpreted as being summed over this index, from 1 to I. For example, i=1 ai bi is simply written as ai bi . However, there are two differences relative to Einstein’s summation convention: – each index can be repeated more than twice in an expression; – ordered index sets are allowed. The index notation can be interpreted in terms of two types of summation, the first associated with the row indices (superscripts) and the second associated with the column indices (subscripts), with the following rules, which follow from the relations [2.23] and [2.24]: – the order of the column indices is independent of the order of the row indices; – consecutive row and column indices (or index sets) can be permuted. For the rest of this chapter, we will use the index convention to represent matrix products, in particular the Kronecker and Khatri–Rao products, to establish vectorization formulae for matrix products and partitioned matrices, and to express various relations involving the traces of matrix products, as well as relations between the Kronecker, Khatri–Rao and Hadamard products. In Chapter 3, the index convention will be used to define matrix and vector unfoldings, i.e. for the matricization and vectorization of a tensor. Finally, in Chapter 5, the index convention will allow us to write the Tucker and PARAFAC decompositions compactly, as well as to establish matricized forms of these decompositions. 2.6.1. Writing vectors and matrices with the index convention Omitting the dimension of the vectors of the canonical basis to alleviate the (I×J) notation, with Eij defined as in [2.25], the index convention allows us to write I the vectors u ∈ K and vT ∈ K1×J and the matrices A ∈ KI×J and AT as: u=

I 

(I)

ui ei

= ui ei

[2.66]

i=1

vT =

J  j=1

(J)

vj (ej )T = vj ej ; vH = vj∗ ej

[2.67]

72

Matrix and Tensor Decompositions in Signal Processing

and A=

j I  

(I)

aij (ei

(J)

(I×J)

⊗ (ej )T ) = aij eji = aij Eij

[2.68]

i=1 j=1

AT = aij eij ; AH = a∗ij eij ,

[2.69]

Similarly, the identity matrix of order J, the matrix 1I×J and the vector 1J of ones, of size I × J and J, respectively, and the integer I can be written as: IJ =

J 

(J)

ej

(J)

(J×J)

⊗ (ej )T = ejj = Ejj

[2.70]

j=1

1I×J = uij eji , with uij = 1 ∀i ∈ I, ∀j ∈ J 1J =

J 

(J)

ej

[2.71]

= δjj ej

[2.72]

j=1

I = ei ei .

[2.73]

2.6.2. Basic rules and identities with the index convention Consider the ordered set S = {n1 , . . . , nN } obtained by permuting the elements of N   {1, . . . , N }, and define the index combination I = in1 · · · inN associated with S. Let eI and eI denote the following Kronecker products: (I )

(I

)

(I

(InN )

)

eI  ⊗ einn  einn1 ⊗ einn2 ⊗ · · · ⊗ ein 1

n∈S

2

(I

(I )

N

(InN ) T

)

eI  ⊗ (einn )T  (einn1 )T ⊗ · · · ⊗ (ein 1

n∈S

∈ RIn1 In2 ···InN

N

) ∈ R1×In1 In2 ···InN . (n) (I )

Using the above definition, together with the expression u(n) = uin einn ∈ KIn , the multiple Kronecker product ⊗ u(n) can be written as: n∈S

(n ) (I

(n ) (InN )

)

⊗ u(n) = uin 1 einn1 ⊗ · · · ⊗ uin N ein 1

n∈S

=



1

(n)  u in ⊗ n∈S n∈S

N

(I ) einn

=



N

(n)  uin eI .

n∈S

[2.74]

Hadamard, Kronecker and Khatri–Rao Products

73

P ROPOSITION 2.17.– Let I and J be two ordered index sets, partitioned into two subsets (I1 , I2 ) and (J1 , J2 ), respectively. The rules stated above imply the following identities: eJI = eI ⊗ eJ = eJ ⊗ eI = eJI11IJ22

[2.75]

= eI1 ⊗ eI2 ⊗ eJ1 ⊗ eJ2 = eI1 ⊗ eJ1 ⊗ eI2 ⊗ eJ2 = eI1 ⊗ eJ1 ⊗ eJ2 ⊗ eI2 = eJ1 ⊗ eJ2 ⊗ eI1 ⊗ eI2 = eJ1 ⊗ eI1 ⊗ eJ2 ⊗ eI2 = eJ1 ⊗ eI1 ⊗ eI2 ⊗ eJ2 . These identities generalize the identities stated by Pollock (2011) for a term of the form ekl ij to two index sets partitioned into two subsets. They follow from the properties [2.23] and [2.24] of the Kronecker product of a column vector with a row vector, which namely state that this product is independent of the order of the vectors (uT ⊗ v = v ⊗ uT ). This implies that, in a sequence of Kronecker products of column vectors and row vectors, a column vector can be permuted with a row vector without changing the end result, provided that the order of the column vectors and the order of the row vectors are not changed in the sequence. In particular, we have: ij ji ij ij ji eji ij eji = (eij ⊗ e )(e ⊗ eji ) = (eij e ) ⊗ (e eji )

= eij ij = IIJ

(by [2.50])

(because eji eji = 1 for fixed i and j, and by [2.70]) [2.76]

from which we deduce: −1 = eij (eji ij ) ji .

[2.77]

E XAMPLE 2.18.– For I = 2, J = 3, using the relations [2.75] and [2.24], we have: ji ji eji ij = eij ⊗ e = eij e

=

2  3   i=1 j=1

(2)

ei

(3)

⊗ ej



(3)

(2)

ej ⊗ ei

T

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ! ! ! 1  1 0  0  ! T T 1 1 ⎝0⎠ ⊗ 1 ⎝1⎠ ⊗ 1 = ⊗ ⎝0⎠ + ⊗ ⎝1⎠ 0 0 0 0 0 0 0 0 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ! ! 0  0 1  1  !  ! T T 1 0 ⎝0⎠ ⊗ 1 ⎝0⎠ ⊗ 0 + ⊗ ⎝0⎠ + ⊗ ⎝0⎠ 0 1 0 1 1 0 1 0 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ! ! 0  0 0  0  !  ! T T 0 0 0 0 ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ 1 ⊗ 0 ⊗ + ⊗ 1 + ⊗ 0 , 1 1 1 1 0 1 0 1 

74

Matrix and Tensor Decompositions in Signal Processing

which gives: ⎡ ⎢ ⎢ ⎢ ⎢ eji = ij ⎢ ⎢ ⎣

1 0 0 0 0 0

0 0 0 1 0 0

0 1 0 0 0 0

0 0 0 0 1 0

0 0 1 0 0 0

0 0 0 0 0 1





⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ , eij = ⎢ ji ⎥ ⎢ ⎥ ⎢ ⎦ ⎣

1 0 0 0 0 0

0 0 1 0 0 0

0 0 0 0 1 0

0 1 0 0 0 0

0 0 0 1 0 0

0 0 0 0 0 1

⎤ ⎥ ⎥ ⎥ ⎥ = (eji )−1 . ij ⎥ ⎥ ⎦

ji −1 R EMARK 2.19.– Since the matrix eji = ij is a permutation matrix, we have (eij ) ji T (eij ) (see [2.168]).

2.6.3. Matrix products and index convention Using the index convention, the products vT u and Au can be rewritten as follows, with u, v ∈ KJ , and A ∈ KI×J : vT u =

I 

ui v i = ui v i

[2.78]

i=1

Au =

I 

(I)

(Ai. u)ei

= (Ai. u)ei

[2.79]

i=1

=

I  J  ( aij uj )ei = aij uj ei ∈ KI .

[2.80]

i=1 j=1

Similarly, for u ∈ KJ , A ∈ KI×J , B ∈ KJ×K and C ∈ KK×J , we have: AB =

J I  K   ( aij bjk )eki = aij bjk eki ∈ KI×K

[2.81]

i=1 k=1 j=1

= (aij ei )(bjk ek ) = A.j Bj. ACT = aij ckj eki ∈ KI×K .

[2.82] [2.83]

We also have: Adiag(u1 , · · · , uJ )B = uj A.j Bj. ,

[2.84]

and the Hadamard product of A ∈ KI×J with B ∈ KI×J can be written as: A  B = aij bij eji .

[2.85]

Hadamard, Kronecker and Khatri–Rao Products

75

2.6.4. Kronecker products and index convention The Kronecker product of vectors can be written compactly using the index convention, as illustrated in Table 2.14. u ∈ KI , v ∈ KJ , w ∈ KK x(n) ∈ KIn , with n ∈ N  u ⊗ v = ui vj eij uT ⊗ vT = ui vj eij u ⊗ vT = ui vj eji u ⊗ v ⊗ w = ui vj wk eijk u ⊗ vT ⊗ w = ui vj wk ejik N

⊗ x(n) =

 N

n=1

n=1

(n)

xin

ei1 ···iN

Table 2.14. Kronecker products of vectors and index convention

P ROOF .– First, note that: (I)

ei (I)

ei (I )

(J)

⊗ ej

(J)

⊗ ej

(K)

⊗ ek

(I )

(IJ)

= e(i−1)J+j = eij

[2.86]

(IJK)

= e(i−1)JK+(j−1)K+k = eijk

(I )

[2.87]

(I I ···I )

2 N = ei1 ···iN ei1 1 ⊗ ei2 2 ⊗ · · · ⊗ eiNN = e(i11−1)I 2 ···IN +(i2 −1)I3 ···IN +···+iN [2.88]

The expressions in Table 2.14 can be proven using the properties [2.75]. Thus, for u ∈ KI , v ∈ KJ , the index convention allows us to write: u⊗v =

I 

(I)

ui ei





J 



j=1

i=1

=

(J)

vj ej

I  J 

(I)

ui vj (ei

(J)

⊗ ej ) = ui vj eij ,

[2.89]

i=1 j=1

which shows the first relation in Table 2.14. This relation implies that: (IJ)

ui vj = [u ⊗ v](i−1)J+j = (e(i−1)J+j )T (u ⊗ v) (I)

= (ei

(J)

(I)

⊗ ej )T (u ⊗ v) = (u ⊗ v)T (ei

= eij (u ⊗ v) = (u ⊗ v)T eij .

[2.90] (J)

⊗ ej ) [2.91]

76

Matrix and Tensor Decompositions in Signal Processing

Similarly, we have: uT ⊗ vT = (ui ei ) ⊗ (vj ej ) = ui vj eij

[2.92]

u ⊗ vT = (ui ei ) ⊗ (vj ej ) = ui vj eji .

[2.93]

For the Kronecker product of three vectors, we have: u⊗v⊗w =

J  K I  

(IJK)

ui vj wk e(i−1)JK+(j−1)K+k

i=1 j=1 k=1 (IJK)

= ui vj wk e(i−1)JK+(j−1)K+k = ui vj wk eijk u ⊗ vT ⊗ w = (ui ei ) ⊗ (vj ej ) ⊗ (wk ek ) = ui vj wk ejik .

[2.94] [2.95]

In general, for x(n) ∈ KIn , with n ∈ N , we have: N

⊗ x(n) =

n=1

I1  I2 

···

i1 =1 i2 =1

=

N 

(n)

x in

IN   N  iN =1

(n)

x in



(I I ···I )

2 N e(i11−1)I 2 ···IN +(i2 −1)I3 ···IN +···+iN

n=1

 ei1 ···iN .

[2.96]

n=1



This concludes the proof of the expressions in Table 2.14. For the Kronecker product of matrices, we have the equations listed in Table 2.15.

A ∈ KI×J , B ∈ KK×L A ⊗ B = aij bkl ejl ik AT ⊗ BT = aij bkl eik jl A ⊗ BT = aij bkl ejk il AT ⊗ B = aij bkl eil jk

Table 2.15. Kronecker products of matrices and index convention

P ROOF .– Using the expressions [2.68] and [2.69] of a matrix and its transpose using the index convention, we have: A ⊗ B = (aij eji ) ⊗ (bkl elk ) = aij bkl ejl ik

[2.97]

AT ⊗ BT = (aij eij ) ⊗ (bkl ekl ) = aij bkl eik jl

[2.98]

Hadamard, Kronecker and Khatri–Rao Products

77

and A ⊗ BT = (aij eji ) ⊗ (bkl ekl ) = aij bkl ejk il

[2.99]

AT ⊗ B = (aij eij ) ⊗ (bkl elk ) = aij bkl eil jk .

[2.100] 

This proves the formulae in Table 2.15. Note that, from [2.97], we can deduce that: (I)

aij bkl = eik (A ⊗ B)ejl = (ei

(K)

(J)

⊗ ek )T (A ⊗ B)(ej

(L)

⊗ el ).

[2.101]

The expression [2.97] can be generalized to the case of a multiple Kronecker product, as formulated in the following proposition. P ROPOSITION 2.20.– Using the index convention, the multiple Kronecker product ⊗ A(n) can alternatively be written as:

n∈N 

⊗ A(n) =

n∈N 

N 

N  (n)  (n)  ···rN ain ,rn eR ain ,rn eri11···i = I , N

n=1

[2.102]

n=1

(n)

where ain ,rn is the current element of A(n) ∈ KIn ×Rn , I = i1 · · · iN , and R = r1 · · · rN . 2.6.5. Vectorization and index convention Noting that: vec(abT ) = b ⊗ a

[2.103]

and therefore:   (I) (J) T (J) (I) = ej ⊗ ei = eji , vec ei ej

[2.104]

the index convention allows us to write the vectorization operator compactly, as illustrated in the following proposition. P ROPOSITION 2.21.– For A ∈ KI×J and B ∈ KK×L , we have: vec(A) = vec

J I  

(I) (J) T

aij ei ej

i=1 j=1

= aij eji ∈ K



=

J I  

  (J) (I) aij ej ⊗ ei

i=1 j=1

JI

[2.105]

vec(AT ) = aij eij ∈ KIJ

[2.106]

vec(A ⊗ B) = aij bkl ejlik ∈ KJLIK ,

[2.107]

78

Matrix and Tensor Decompositions in Signal Processing

where, as a special case of the last relation, for x ∈ KI and y ∈ KL : vec(x ⊗ yT ) = xi yl eli = y ⊗ x ∈ KLI .

[2.108]

Thus, we recover the relation [2.28] with transposition instead of transconjugation. It should be noted that vec(AT ) is equivalent to forming a column vector by stacking the row vectors of A, rearranged as column vectors. The element aij of A is located at the position i + (j − 1)I in vec(A) and at j + (i − 1)J in vec(AT ). The element aij bkl of A ⊗ B is located at the position k + (i − 1)K + (l − 1)IK + (j − 1)LIK in the vector vec(A ⊗ B). E XAMPLE 2.22.– From the vectorization [2.105] of the matrix A, with aij = δij , we can deduce the vectorization of the identity matrix II : vec(II ) = eii .

[2.109]

In particular, for I = 2, we have:

  1 0 (2) (2) (2) (2) = e1 ⊗ e1 + e2 ⊗ e2 vec(I2 ) = vec 0 1







1 1 0 0 = ⊗ + ⊗ 0 0 1 1 ⎤ ⎤ ⎡ ⎤ ⎡ ⎡ 1 0 1 ⎢ 0 ⎥ ⎢ 0 ⎥ ⎢ 0 ⎥ ⎥ ⎥ ⎢ ⎥ ⎢ = ⎢ ⎣ 0 ⎦ + ⎣ 0 ⎦ = ⎣ 0 ⎦. 1 1 0 E XAMPLE 2.23.– For A = [a11 , a12 ] and B = A ⊗ B = [a11 B, a12 B] =

a11 b11 a11 b21

b11 b21

b12 b22

a11 b12 a11 b22

, we have: a12 b11 a12 b21

a12 b12 a12 b22



vec(A ⊗ B) = [a11 b11 , a11 b21 , a11 b12 , a11 b22 , a12 b11 , a12 b21 , a12 b12 , a12 b22 ]T ( )* + ( (

k=1,2

)* l=1,2

+ )*

j=1,2

+

which corresponds to a variation of the indices of ejlik in [2.107] such that k varies faster than l and j, and l varies faster than j, with i = 1.

Hadamard, Kronecker and Khatri–Rao Products

79

2.6.6. Vectorization formulae Here, we present several vectorization formulae for the Kronecker product of two matrices, and then for other matrix products. These formulae can be proven using the index convention. P ROPOSITION 2.24.– For A ∈ KI×J and B ∈ KK×L , we have:   vec(A ⊗ B) = (IJ ⊗ eil li ⊗ IK ) vec(A) ⊗ vec(B) ,

[2.110]

where eil li represents the commutation matrix defined in [2.162]. P ROOF .– Using the relations [2.105], [2.107] and [2.75], as well as the second property in Table 2.9, and noting that: T T T T eil li eil = eli (ei ⊗ el )(ei ⊗ el ) = eli (ei ei ⊗ el el ) = eli ,

[2.111]

we deduce: vec(A) ⊗ vec(B) = aij eji ⊗ bkl elk = aij bkl ejilk = aij bkl (ej ⊗ eil ⊗ ek ), vec(A ⊗ B) = aij bkl ejlik = aij bkl (ej ⊗ eli ⊗ ek ) = aij bkl (IJ ⊗ eil li ⊗ IK )(ej ⊗ eil ⊗ ek )   = (IJ ⊗ eil li ⊗ IK ) vec(A) ⊗ vec(B) , which proves the relation [2.110].



P ROPOSITION 2.25.– From the relation [2.110], with A = u ∈ KI , B = vT ∈ K1×L , and J = K = 1, we deduce that: LI vec(u ⊗ vT ) = v ⊗ u = eil li (u ⊗ v) ∈ K .

[2.112]

E XAMPLE 2.26.– For A = u ∈ K2 , B = vT ∈ K1×2 , i.e. I = L = 2, J = K = 1, the formula [2.110] becomes: ⎡ ⎤ u1 v1

 ⎢  u2 v1 ⎥ u1 v 1 u1 v 2 ⎥ vec(u ⊗ vT ) = vec =⎢ ⎣ u1 v2 ⎦ = v ⊗ u u2 v 1 u2 v 2 u2 v2 ⎤⎡ ⎡ ⎤ ⎡ ⎤ u1 v 1 1 0 0 0 u1 v 1   ⎢ 0 0 1 0 ⎥⎢ u v ⎥ ⎢ u v ⎥ T ⎥⎢ 1 2 ⎥ ⎢ 2 1 ⎥ ⎢ eil li u ⊗ v = ⎣ 0 1 0 0 ⎦ ⎣ u v ⎦ = ⎣ u v ⎦ = vec(u ⊗ v ). 2 1 1 2 0 0 0 1 u2 v 2 u2 v 2

80

Matrix and Tensor Decompositions in Signal Processing

P ROPOSITION 2.27.– For A ∈ KI×J , B ∈ KJ×M , C ∈ KM ×N , we have: vec(ABC) = (CT ⊗ A)vec(B).

[2.113]

= (IN ⊗ AB)vec(C)

[2.114]

= (CT BT ⊗ II )vec(A)

[2.115]

vec(AB) = (IM ⊗ A)vec(B)

[2.116]

= (B ⊗ II )vec(A)

[2.117]

= (B ⊗ A)vec(IJ ).

[2.118]

T T

P ROOF .– Let us first show the relation [2.113] using the definition of the vectorization operation. The product ABC ∈ KI×N expands into: n ABC = (aij eji )(bjm em j )(cmn em ) n n = aij bjm cmn eji em j em = aij bjm cmn ei .

[2.119]

Applying the vectorization formula [2.105] then gives us: vec(ABC) = (ABC)in eni = aij bjm cmn eni .

[2.120]

Noting that: emj em j  = (em ⊗ ej )(em ⊗ ej  ) = em em ⊗ ej ej  = δmm δjj 

[2.121]

emj ni em j 

[2.122]

= (eni ⊗ e

mj

)em j  = eni ⊗ e

mj

em j  = δmm δjj  eni ,

applying the formulae [2.105] and [2.100] allows us to rewrite vec(ABC) as: vec(ABC) = cmn aij bj  m emj ni em j  T = (cmn aij emj ni )(bj  m em j  ) = (C ⊗ A)vec(B),

[2.123]

which proves [2.113]. According to Horn and Jonhson (1991), this formula appears to have been proposed for the first time by Roth (1934). It is important to note that introducing the double Kronecker product emj em j  is what led to the formation of the factors CT ⊗ A and vec(B). This double Kronecker product allows us to replace the double sum over the indices m and j associated with the product ABC followed by the vectorization of a matrix of size I × N , by the product of the matrix CT ⊗ A ∈ KN I×M J with the vector vec(B) ∈ KM J . The relations [2.116]–[2.118] can be deduced directly from [2.113] by replacing the triplet (A, B, C) with (A, B, IM ), (II , A, B) and (A, IJ , B), respectively. Similarly, the relations [2.114] and [2.115] can be deduced from [2.113] by replacing (A, B, C) with (AB, C, IN ) and (II , A, BC), respectively. 

Hadamard, Kronecker and Khatri–Rao Products

81

R EMARK 2.28.– The reasoning used to prove [2.113] will be reused later in this chapter. To illustrate this reasoning, let us apply it to prove the identities [2.116] and [2.117] that allow us to compute vec(AB). A classical method to compute vec(AB) is to compute C = AB ∈ KI×M then vectorize C. By replacing K with M and using the expressions [2.81] and [2.105], AB and vec(AB) can be written as: m AB = C = cim em i = aij bjm ei

vec(AB) = vec(C) = cim emi = aij bjm emi .

[2.124] [2.125]

Another method is to replace emi with emj mi em j  . Equation [2.125] can then be written as: vec(AB) = (aij emj mi )(bj  m em j  )

[2.126]

j = (em m ⊗ aij ei )vec(B)

[2.127]

= (IM ⊗ A)vec(B),

[2.128]

which corresponds to the identity [2.116]. Similarly, by replacing emi with eji mi ej  i , equation [2.125] becomes: vec(AB) = (bjm eji mi )(ai j  ej  i ) =

(bjm ejm



eii )vec(A)

= (BT ⊗ II )vec(A),

[2.129] [2.130] [2.131]

which is the identity [2.117]. Thus, the decompositions of emi used above allowed us to replace the computation of the matrix C = AB followed by its vectorization with two matrix products involving the vectorization of A or B. The identities [2.113]–[2.118] play a very important role in solving systems of matrix equations, as will be illustrated in section 2.12. P ROPOSITION 2.29.– Using the relation [2.113], it is easy to deduce the following vectorization formulae for A ∈ KI×J , B ∈ KJ×M , C ∈ KM ×N , D ∈ KN ×P : vec(ABCD) = (DT ⊗ AB)vec(C)

[2.132]

= (DT CT ⊗ A)vec(B)

[2.133]

= (D C B ⊗ II )vec(A)

[2.134]

= (IP ⊗ ABC)vec(D).

[2.135]

T

T

T

82

Matrix and Tensor Decompositions in Signal Processing

2.6.7. Vectorization of partitioned matrices In this section, we will use the index convention to compute the block vectorization of a partitioned matrix. P ROPOSITION 2.30.– Let A ∈ KI×J  be a matrix  partitioned into blocks R S Ars ∈ KIr ×Js , with r ∈ R , s ∈ S, and r=1 Ir = I, s=1 Js = J. The block vectorization of A is given as: (rs) (SRJ I )

s ,r,ir vecb (A) = air js es,r,jss,irr = (es,j s,r,js ,ir )vec(A)

[2.136]

js ,r = (IS ⊗ er,j ⊗ IIr )vec(A). s

[2.137]

P ROOF .– Since the matrix A is partitioned in the form: ⎤ ⎡ A11 · · · A1S ⎢ .. ⎥ , A = ⎣ ... . ⎦ ···

AR1

[2.138]

ARS

it can be written as follows with the index convention: A=

S R  

(E(R×S) ⊗ Ars ) = esr ⊗ Ars , rs

[2.139]

r=1 s=1 (R×S)

(R)

(S)

where Ers = er (es )T = esr is the matrix of size R × S with 1 at the position (r, s) and 0 everywhere else, defined as in [2.25]. Using the formulae [2.110] and [2.105], we have: vec(A) = vec(esr ⊗ Ars ) (rs)

(rs)

s =air ,js vec(esr ⊗ ejirs ) = air ,js vec(es,j r,ir )

(rs)

(SJ RI )

=air ,js es,jss,r,irr .

[2.140]

The difference between block vectorization and standard vectorization is the order of variation of the indices r and js , with: (rs)

(SRJ I )

vecb (A) = air ,js es,r,jss,irr ,

[2.141]

with the index r of the block varying more slowly than the index js of the element (rs) air ,js of A in the block vectorization, whereas the reverse holds for vec(A). Therefore, we deduce that: s,js ,r,ir js ,r vec(A) = (ess ⊗ er,j ⊗ eiirr )vec(A) vecb (A) = es,r,j s ,ir s js ,r = (IS ⊗ er,j ⊗ IIr )vec(A), s

which proves [2.137].

[2.142] [2.143] 

Hadamard, Kronecker and Khatri–Rao Products

83

(rs)

R EMARK 2.31.– This formula means that the position of the element air ,js in vec(A) (SJ RI )

(S)

(J )

(R)

(I )

is given by the canonical basis vector es,jss,r,irr = es ⊗ejs s ⊗er ⊗eir r , whereas (SJ RI )

(S)

in vecb (A) it is given by es,r,jss ,irr = es

(R)

⊗ er

(J )

(I )

⊗ ejs s ⊗ eir r .

E XAMPLE 2.32.– Consider the case where R = S = 2, with Ir = 1, Js = 2, r, s ∈ {1, 2}: ⎡  A=

A11

A12

A21

A22



(11) a11

(11) a12

⎢ ⎢ ⎢ = ⎢ ··· ⎢ ⎣ (21) a11

.. .

(12) a11

···

···

(21)

.. .

(12)

(22)

a12

(22)

a11

(12) a12



⎥ ⎥ ⎥ ··· ⎥. ⎥ ⎦ (22) a12

We have: (11)

(11)

(21)

(21)

(12)

(22)

vec(A) = [a11 , a11 , a12 , a12 , a11 , a11 , a12 , a12 ]T . By definition, vecb (A) is given by: ⎡ ⎤ vec(A11 ) ⎢ ⎥ ⎢ vec(A21 ) ⎥ ⎢ ⎥ vecb (A) = ⎢ ⎥ ⎢ vec(A12 ) ⎥ ⎣ ⎦ vec(A22 )

(11) (11) (21) (21) (12) (12) (22) (22) T = a11 , a12 , a11 , a12 , a11 , a12 , a11 , a12 . ( )*+ ( )* + r=1,2 ; s=2 ir =1 ( )* + (

js =1,2

)*

+

r=1,2 ; s=1

Applying the formula [2.137], we obtain: ⎡ 0 vecb (A) = [I2 ⊗

ejrjs rs ]vec(A)

=

1 0 0 0



⎢ ⎥1 ⎢ 0 0 1 0 ⎥ ⎢ ⎥ I2 ⊗ ⎢ ⎥ vec(A), ⎢ 0 1 0 0 ⎥ ⎣ ⎦ 0 0 0 1

[2.144]

84

Matrix and Tensor Decompositions in Signal Processing

which gives:



⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ vecb (A) = ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

1

0

0

0

0

0

1

0

0

1

0

0

0 ···

0 ···

0 ···

1 ···

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

.. . .. . .. . .. . .. . .. . .. . .. .

⎤ 0

0

0

0

0

0

0

0

0

0 ···

0 ···

0 ···

1

0

0

0

0

1

0

1

0

0

0

0

To illustrate remark 2.31, consider the element 1 ; s = js = 2). Its position in vec(A) is given by (1)

(8)

0 ⎥ ⎥⎡ 0 ⎥ ⎥ ⎥⎢ 0 ⎥ ⎥⎢ ⎥⎢ ⎥⎢ 0 ⎥⎢ ⎥⎢ ··· ⎥⎢ ⎥⎢ ⎥⎢ 0 ⎥⎢ ⎥⎢ ⎥⎢ 0 ⎥⎣ ⎥ ⎥ 0 ⎥ ⎦ 1

(11)

a11 (21) a11 (11) a12 (21) a12 (12) a11 (22) a11 (12) a12 (22) a12

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

(12) a12 corresponding to (r = ir (2) (2) (SJs RIr ) (2) es,js ,r,ir = e2 ⊗ e2 ⊗ e1

= ⊗

e1 = e7 , which is the seventh row of vec(A), whereas in vecb (A) its position is (SRJ I ) (2) (2) (2) (1) (8) given by es,r,jss,irr = e2 ⊗ e1 ⊗ e2 ⊗ e1 = e6 , which is the sixth row. 2.6.8. Traces of matrix products and index convention Using the index convention, we can prove the following relations for traces of matrices: – For A, B, C, D ∈ KI×J , we have: J I     (aij bij )(cij dij ) = aij bij cij dij tr (A  B)(C  D)T =

[2.145]

i=1 j=1 I  J    (aij bij )cij = aij bij cij tr (A  B)CT =

[2.146]

i=1 j=1

  = (aij cij )bij = tr (A  C)BT .

[2.147]

– For A ∈ KI×J , B ∈ KJ×I , we have: tr(AB) = aij bji = vecT (AT )vec(B) = 1TI (A  BT )1J .

[2.148] [2.149]

P ROOF .– The index convention allows us to write: tr(AB) = (AB)ii = Ai. B.i = aij bji .

[2.150]

Hadamard, Kronecker and Khatri–Rao Products

85

Furthermore, like in the proof of the formula [2.123], introducing the scalar product eij ei j  = δii δjj  allows us to replace the summation of scalars aij bji over the indices i and j with the scalar product of the vectors vec(AT ) and vec(B) of size IJ. Using the relations [2.105] and [2.106], we obtain: tr(AB) = (aij eij )(bj  i ei j  ) = vecT (AT )vec(B). We can also write tr(AB) as: tr(AB) = aij bji =

I  

(A  BT )ij = 1TI (A  BT )1J ,

i=1 j=1



which proves [2.149]. We also have: – For A ∈ KI×I and B ∈ KJ×J : tr(A ⊗ B) = tr(A)tr(B). – For A ∈ K

I×J

,B ∈ K

J×M

[2.151] ,C ∈ K

M ×N

,D ∈ K

N ×I

:

tr(ABCD) = vecT (DT )(CT ⊗ A)vec(B)

[2.152]

= vecT (B)(C ⊗ AT )vec(DT )

[2.153]

= vec (D)(A ⊗ C )vec(B )

[2.154]

= vecT (A)(B ⊗ DT )vec(CT ).

[2.155]

T

T

T

– For A ∈ KI×J , B ∈ KJ×M , C ∈ KM ×I : tr(ABC) = vecT (II )(CT ⊗ A)vec(B)

[2.156]

= vec (A)(B ⊗ II )vec(C ).

[2.157]

T

T

I 

P ROOF .– First, note that the trace of A can be written tr(A) =

i=1

aii = eTi Aei .

– By the expression [2.101], the trace of the Kronecker product can be written as: tr(A ⊗ B) =

I  J 

aii bjj

i=1 j=1

= (ei ⊗ ej )T (A ⊗ B)(ei ⊗ ej ) = (eTi Aei ) ⊗ (eTj Bej ) = (eTi Aei )(eTj Bej ) = tr(A)tr(B). The fourth equality follows from the fact that the quantities in parentheses are scalars. This proves the relation [2.151].

86

Matrix and Tensor Decompositions in Signal Processing  

j – Following the same approach as before, by the relations eni em n i emj = δii δjj  δmm δnn , [2.100], [2.99], [2.105] and [2.106], we can write:

tr(ABCD) = aij bjm cmn dni  

j = dni eni (cm n ai j  em n i ) bjm emj

= vecT (DT )(CT ⊗ A)vec(B), 



n which proves [2.152]. Similarly, by introducing the term eji em j  i emn , we obtain: 



n tr(ABCD) = aij eji (bj  m dn i em j  i )cmn emn

= vecT (A)(B ⊗ DT )vec(CT ), i.e. the relation [2.155]. The formula [2.152] can also be shown using the relations [2.149] and [2.113], together with the relation tr(AB) = tr(BA):   tr(ABCD) = tr D(ABC) = vecT (DT )vec(ABC) = vecT (DT )(CT ⊗ A)vec(B). The relation [2.153] is obtained by transposing the right-hand side of [2.152], which is a scalar and therefore equal to its transpose. Similarly, since tr(AB) = tr(AT BT ), we have:     tr(ABCD) = tr DT (ABC)T = tr DT (CT BT AT ) , and, using the relations [2.149] and [2.113], we obtain: tr(ABCD) = vecT (D)vec(CT BT AT ) = vecT (D)(A ⊗ CT )vec(BT ), which is the relation [2.154]. The formulae [2.156] and [2.157] can be deduced from [2.152] and [2.155], respectively, by taking N = I and D = II .  2.7. Commutation matrices In this section, we introduce the notion of commutation matrix via the transformation that transforms vec(A) to vec(AT ). Then, after presenting a few properties of these matrices, we will describe several of their uses to permute the factors of a simple or multiple Kronecker product, as well as to define the block Kronecker product.

Hadamard, Kronecker and Khatri–Rao Products

87

2.7.1. Definition P ROPOSITION 2.33.– For A ∈ KI×J , we can transform vec(A) to vec(AT ) using the following formula: vec(AT ) = KIJ vec(A), KIJ =

eji ij

[2.158]

∈ RIJ×JI ,

[2.159]

where KIJ is called the vec-permutation matrix, or commutation matrix (Magnus and Neudecker 1979). P ROOF .– We have:  

 

eij = eij ⊗ ej i eji = (δii δjj  ejiji )eji .

[2.160]

The term in parentheses means that there is no summation over the indices i and j, which enables us to separate this term from the factor eji . Hence, from the relations [2.105] and [2.106], we deduce:  

vec(AT ) = aij eij = (δii δjj  ejiji )(aij eji ) = eji ij vec(A),

[2.161] 

which proves [2.158]. The commutation matrix KIJ can be expanded as follows: (I)

KIJ = eji ij = (ei (IJ)

(J)

(J)

⊗ ej )(ej (JI)

= e(i−1)J+j (e(j−1)I+i ) (I×J)

= eji ⊗ eij = Eij (I×J)

where Eij

(I×J)

Eij

(I) T

⊗ ei )

T

[2.162] [2.163]

(J×I)

⊗ Eji

,

[2.164]

is defined in [2.25] as: (I)

= eji = ei

(J)

(I)

(J)

⊗ (ej )T = ei (ej )T .

[2.165]

R EMARK 2.34.– By the definition [2.159], the vectorization formula [2.110] can also be written as:   vec(A ⊗ B) = (IJ ⊗ KLI ⊗ IK ) vec(A) ⊗ vec(B) . [2.166] P ROPOSITION 2.35.– The commutation matrix only depends on the size of A and not on its elements. The 1 in each column n of KIJ is obtained by computing the matrix eji ij for the index pair (i, j) from the element aij located on the nth row of vec(A), with n = (j − 1)I + i.

88

Matrix and Tensor Decompositions in Signal Processing

E XAMPLE 2.36.– For A =

a11 a21

a12 a22

, with I = J = 2, we have:

vec(A) = [a11 , a21 , a12 , a22 ]T , vec(AT ) = [a11 , a12 , a21 , a22 ]T and: T T T T K22 = eji ij = e11 e11 + e12 e21 + e21 e12 + e22 e22 ! ! ! !  1! 1  1 1 T  1 = ⊗ ⊗ + ⊗ 0 0 0 0 0 ! ! ! !  0! 1  1 0 T  0 + ⊗ ⊗ ⊗ 1 1 0 0 1 ⎤ ⎡ 1 0 0 0 ⎢ 0 0 1 0 ⎥ ⎥ = ⎢ ⎣ 0 1 0 0 ⎦. 0 0 0 1

! ! ! 0  0 1 T ⊗ + 1 1 0 ! ! ! 0  0 0 T ⊗ 1 1 1

[2.167]

2.7.2. Properties P ROPOSITION 2.37.– Since the commutation matrix is a permutation matrix, it satisfies the following properties (Magnus and Neudecker 1979): KJI = KTIJ = K−1 IJ .

[2.168]

P ROOF .– These relations can be shown with the index convention: ji T T KJI = eij ji = (eij ) = KIJ .

Furthermore, by the relation [2.76], we deduce that: ij −1 KIJ KJI = eji ij eji = IIJ ⇒ KIJ = KJI ,

which concludes the proof of the relations [2.168].

[2.169] 

2.7.3. Kronecker product and permutation of factors Given the vectors u ∈ KI and v ∈ KK , the Kronecker products u ⊗ v ∈ KIK and v ⊗ u ∈ KKI both contain the same components, consisting of all products of a component of u with a component of v. These two Kronecker products are related by a row permutation matrix. Similarly, for A ∈ KI×J and B ∈ KK×L , the Kronecker products A ⊗ B ∈ and B⊗A ∈ KKI×LJ are related by row and column permutation matrices, K as we will show in the next proposition using an original approach based on the index convention. IK×JL

Hadamard, Kronecker and Khatri–Rao Products

89

P ROPOSITION 2.38.– Given the vectors u ∈ KI , v ∈ KK and the matrices A ∈ KI×J , B ∈ KK×L , we have: v ⊗ u = KKI (u ⊗ v)

[2.170]

B ⊗ A = KKI (A ⊗ B)KJL ,

[2.171]

where KKI and KJL are row and column permutation matrices, i.e. commutation matrices defined as in [2.162]: Δ

(K)

KKI = eik ki = (ek =

K  I 

(I)

(I)

⊗ ei )(ei

(K) (I)T

ek ei

(K)

⊗ ek ) T

(I) (K)T

⊗ ei ek

∈ RKI×IK ,

[2.172] [2.173]

k=1 i=1

and Δ

(J)

KJL = elj jl = (ej =

L J  

(L)

(L)

⊗ el )(el

(J) (L)T

ej el

(J)

⊗ e j )T

(L) (J)T

⊗ el ej

∈ RJL×LJ .

[2.174] [2.175]

j=1 l=1

P ROOF .– Using the expression of u ⊗ v given in Table 2.14, we have: u ⊗ v = ui vk eik ; v ⊗ u = ui vk eki .

[2.176]

 

Noting that eikik eik = eki , we deduce that:  

v ⊗ u = (δii δkk eikik )(ui vk eik ) = KKI (u ⊗ v),

[2.177]

which proves the relation [2.170]. Similarly, in the matrix case, by [2.97], we have: lj A ⊗ B = aij bkl ejl ik ; B ⊗ A = aij bkl eki

[2.178]

and, noting that:  

lj lj eikik ejl ik ej  l = eki ,

[2.179]

we can rewrite B ⊗ A as:  

lj B ⊗ A = aij bkl eikik ejl ik ej  l  

lj   = (δii δkk eikik ) (aij bkl ejl ik ) (δjj δll ej  l ) lj = eik ki (A ⊗ B) ejl

= KKI (A ⊗ B)KJL , which proves [2.171], with KKI and KJL defined as in [2.173] and [2.175].



90

Matrix and Tensor Decompositions in Signal Processing

P ROPOSITION 2.39.– Using the property [2.168] of the commutation matrix, the relation [2.171] can also be written as: (B ⊗ A)KLJ = KKI (A ⊗ B).

[2.180]

In the case where A and B are square matrices (I = J, K = L), the relations [2.171] and [2.180] become: B ⊗ A = KKI (A ⊗ B)KIK (B ⊗ A)KKI = KKI (A ⊗ B).

[2.181] [2.182]

Table 2.16 summarizes the formulae established for the vectorization of a transposed matrix and a Kronecker product, as well as for the permutation of factors of Kronecker products of vectors and matrices. 2.7.4. Multiple Kronecker product and commutation matrices P ROPOSITION 2.40.– The relations [2.170] and [2.171] can be generalized to the Kronecker product of three vectors u ∈ KI , v ∈ KJ , w ∈ KK and of three matrices A ∈ KI×J , B ∈ KK×L , C ∈ KM ×N . For example: jk u ⊗ w ⊗ v = eijk ikj (u ⊗ v ⊗ w) = (II ⊗ ekj )(u ⊗ v ⊗ w)

[2.183]

i k w ⊗ v ⊗ u = eijk kji (u ⊗ v ⊗ w) = (ek ⊗ IJ ⊗ ei )(u ⊗ v ⊗ w)

[2.184]

njl C ⊗ A ⊗ B = eikm mik (A ⊗ B ⊗ C)ejln

[2.185]

lnj B ⊗ C ⊗ A = eikm kmi (A ⊗ B ⊗ C)ejln

[2.186]

n j C ⊗ B ⊗ A = (eim ⊗ IK ⊗ em i )(A ⊗ B ⊗ C)(ej ⊗ IL ⊗ en ).

[2.187]

and

E XAMPLE 2.41.– For I = J = K = 2, we have: u ⊗ v ⊗ w = [u1 v1 w1 , u1 v1 w2 , u1 v2 w1 , u1 v2 w2 , u2 v 1 w 1 , u 2 v 1 w 2 , u 2 v 2 w 1 , u 2 v 2 w 2 ] T w ⊗ v ⊗ u = [u1 v1 w1 , u2 v1 w1 , u1 v2 w1 , u2 v2 w1 , u1 v1 w2 , u2 v1 w2 , u1 v2 w2 , u2 v2 w2 ]T = eijk kji (u ⊗ v ⊗ w)

[2.188]

Hadamard, Kronecker and Khatri–Rao Products

91

with the following row permutation matrix: j i k eijk kji = ek ⊗ ej ⊗ ei ⎡ 1 0 0 ⎢ 0 0 0 ⎢ ⎢ 0 0 1 ⎢ ⎢ 0 0 0 = ⎢ ⎢ 0 1 0 ⎢ ⎢ 0 0 0 ⎢ ⎣ 0 0 0 0 0 0

0 0 0 0 0 0 1 0

0 1 0 0 0 0 0 0

0 0 0 0 0 1 0 0

0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 1

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎥ ⎦

(I×J)

KIJ = eji ij = Eij

[2.189]

(J×I)

⊗ Eji

−1 KJI = KT IJ = KIJ

u ∈ KI , v ∈ KK v ⊗ u = KKI (u ⊗ v) A ∈ KI×J , B ∈ KK×L vec(AT ) = KIJ vec(A)   vec(A ⊗ B) = (IJ ⊗ KLI ⊗ IK ) vec(A) ⊗ vec(B) B ⊗ A = KKI (A ⊗ B)KJL or (B ⊗ A)KLJ = KKI (A ⊗ B)

Table 2.16. Commutation matrices, vectorization, and Kronecker product

For example, the 1 in the fifth and seventh columns of the permutation matrix are obtained for the triplets (i, j, k) = (2, 1, 1) and (2, 2, 1), respectively, i.e. by computing:





0 1 1 0 0 0 2 1 1 e211 = e ⊗ e ⊗ e = ⊗ ⊗ 112 1 1 2 0 0 0 0 1 0





0 1 0 0 0 0 221 2 2 1 ⊗ ⊗ . e122 = e1 ⊗ e2 ⊗ e2 = 0 0 0 1 1 0 The formulae [2.183]–[2.187] can easily be generalized to arbitrary numbers of Kronecker products.

92

Matrix and Tensor Decompositions in Signal Processing

2.7.5. Block Kronecker product Below, we illustrate another use of the index convention for the block Kronecker product of two partitioned vectors and of two partitioned matrices. P ROPOSITION 2.42.– For u ∈ KI and v ∈ KK , partitioned into R blocks  u(r) ∈ KIr R (m) Km and M blocks v ∈ K , respectively, with r ∈ R , m ∈ M  and r=1 Ir = M I, m=1 Km = K, the block Kronecker product u ⊗b v is defined as: ⎡ (1) ⎤ u ⊗ v(1) .. ⎢ ⎥ ⎢ ⎥ ⎢ (1) . (M ) ⎥ ⎢ u ⊗v ⎥ ⎢ ⎥ ⎢ ⎥ . .. u ⊗b v = ⎢ ⎥. ⎢ ⎥ ⎢ u(R) ⊗ v(1) ⎥ ⎢ ⎥ ⎢ ⎥ .. ⎣ ⎦ . u(R) ⊗ v(M ) It can be written in terms of u ⊗ v using the following formula: (r) (m) (RM I K )

u ⊗b v = uir vkm er,m,irr,kmm

[2.190]

r ,m,km = er,i r,m,ir ,km (u ⊗ v),

[2.191]

where the commutation matrix is defined as in section 2.7.1. ⎤ ⎡ (1) ⎤ v u(1) ⎢ .. ⎥ ⎢ .. ⎥ P ROOF .– For u = ⎣ . ⎦ and v = ⎣ . ⎦, we can write: u(R) v(M ) ⎡

) (m) ⊗ u(r) , v = e(M . u = e(R) r m ⊗v

[2.192]

Hence, the Kronecker product u ⊗ v can be expanded as follows: ) (m) u ⊗ v = (e(R) ⊗ u(r) ) ⊗ (e(M ) r m ⊗v (r) (I )

(m) (K )

m ) ⊗ uir eir r ) ⊗ (e(M = (e(R) r m ⊗ vkm ekm )

(r) (m) (RI M K )

= uir vkm er,ir r,m,kmm .

Hadamard, Kronecker and Khatri–Rao Products

93

With the index convention, we can expand the block Kronecker product into the Kronecker products of each pair of vectors u(r) ⊗ v(m) , which gives: ) (r) u ⊗b v = e(R) ⊗ e(M ⊗ v(m) r m ⊗u (r) (m)

(I )

(K )

r m ) = uir vkm e(R) ⊗ e(M r m ⊗ eir ⊗ ekm

(r) (m) (RM I K )

= uir vkm er,m,irr,kmm r ,m,km = er,i r,m,ir ,km (u ⊗ v).

[2.193] 

This proves the relation [2.191].

P ROPOSITION 2.43.– For A ∈ KI×J and B ∈ KK×L partitioned into blocks Ars ∈ KIr ×Js , and Bmn ∈ KKm ×Ln , respectively, with r ∈ R , s ∈ S, m ∈ M , R S M N n ∈ N , and r=1 Ir = I, s=1 Js = J, m=1 Km = K, n=1 Ln = L, the block Kronecker product of A with B, called the Tracy–Singh product (1972) and denoted A ⊗b B, is defined as: ⎡ ⎤ A11 ⊗ B11

.. ⎢ . ⎢ ⎢ A11 ⊗ BM 1 ⎢ .. A ⊗b B = ⎢ . ⎢ ⎢ AR1 ⊗ B11 ⎢ .. ⎣

. AR1 ⊗ BM 1

of size (

R

··· .. . ··· .. . ··· .. . ···

M

r=1

A11 ⊗ B1N .. . A11 ⊗ BM N .. . AR1 ⊗ B1N .. . AR1 ⊗ BM N

m=1 Ir Km ,

S

N

s=1

n=1

··· .. . ··· .. . ··· .. . ···

A1S ⊗ B11 ... A1S ⊗ BM 1 .. . ARS ⊗ B11 .. . ARS ⊗ BM 1

··· .. . ··· .. . ··· .. . ···

A1S ⊗ B1N ⎥ ... ⎥ A1S ⊗ BM N ⎥ ⎥ .. ⎥, . ⎥ ARS ⊗ B1N ⎥ ⎥ .. ⎦ . ARS ⊗ BM N

Js Ln ).

The block Kronecker product can be written concisely as: A ⊗b B = [Ars ⊗ B]rs = [Ars ⊗ Bmn ]mn rs ,

[2.194]

where Ars ⊗ B is the (r, s)th sub-block of A ⊗b B, of size Ir K × Js L, which itself admits the matrix Ars ⊗ Bmn , of size Ir Km × Js Ln , as its (m, n)th sub-block. The block Kronecker product therefore corresponds to the Kronecker product of all pairs of blocks of the partitioned matrices A and B. It can be written compactly in terms of the Kronecker product A ⊗ B using the commutation matrix: (r,s) (m,n)

s ,ln A ⊗b B = air ,js bkm ,ln es,n,j r,m,ir ,km

s,n,js ,ln r ,m,km = er,i r,m,ir ,km (A ⊗ B)es,js ,n,ln .

[2.195] [2.196]

94

Matrix and Tensor Decompositions in Signal Processing

The relations [2.195] and [2.196] can be proven with the same approach as for the block Kronecker product of two vectors, with the following correspondences between rows and columns: (r, m, ir , km ) ↔ (s, n, js , ln ). We also define the block Kronecker product of two matrices A = [A1 , · · · , AK ] and B = [B1 , · · · , BK ] partitioned into K column blocks Ak ∈ KI×M and Bk ∈ KJ×N , with k ∈ K, denoted A ⊗b B, as follows: A ⊗b B = A1 ⊗ B1 , · · · , AK ⊗ BK ∈ KIJ×KM N . [2.197] 2.7.6. Strong Kronecker product Another Kronecker product of partitioned matrices, called the strong Kronecker product and denoted | ⊗ |, was introduced by de Launey and Seberry (1994) to generate orthogonal matrices from Hadamard matrices. This Kronecker product is also used to represent tensor train decompositions in the case of large-scale tensors (Lee and Cichocki 2017). Given the matrices A and B partitioned into (R, S) blocks Ars ∈ CI×J , with (r ∈ R, s ∈ S), and (S, N ) blocks Bsn ∈ CK×L , with n ∈ N , respectively, we define the strong Kronecker product A| ⊗ |B as the matrix C partitioned into (R, N ) blocks Crn ∈ CIK×JL such that: Crn =

S 

Ars ⊗ Bsn , r ∈ R , n ∈ N .

s=1

This operation is fully determined by the parameters (R, S, N ). 2.8. Relations between the diag operator and the Kronecker product P ROPOSITION 2.44.– Given the vectors u ∈ CI , w ∈ CJ , x ∈ CIJ , we have: diag(u) ⊗ diag(w) = diag(u ⊗ w)   diag(u ⊗ w) x = diag(x)(u ⊗ w).

[2.198] [2.199]

P ROOF .– – by the expression [2.12] of diag and using index notation, we have: diag(u) ⊗ diag(w) = ui eii ⊗ wj ejj = ui wj eij ij = diag(u ⊗ w). – by the relations [2.198] and [2.13], we deduce that:     diag(u) ⊗ diag(w) x = diag(u ⊗ w) x = diag(x)(u ⊗ w), which completes the proof.

[2.200] 

Hadamard, Kronecker and Khatri–Rao Products

E XAMPLE 2.45.– For I = J = 2, we have: ⎡ 0 u1 w 1   ⎢ u w 1 2 diag(u) ⊗ diag(w) x = ⎢ ⎣ u2 w 1 0 u2 w 2 ⎡ u 1 w 1 x1 0 ⎢ u w x 1 2 2 = ⎢ ⎣ u2 w 1 x3 0

95

⎤⎡

⎤ x1 ⎥ ⎢ x2 ⎥ ⎥⎢ ⎥ ⎦ ⎣ x3 ⎦ x4 ⎤ ⎥ ⎥ ⎦ u2 w 2 x 4

= diag(x)(u ⊗ w). 2.9. Khatri–Rao product The Khatri–Rao product of two matrices with the same number of columns is identical to the column-wise Kronecker product of these matrices. As we will see in Chapter 5, this product plays an important role in the matricization of the PARAFAC decomposition of a tensor. After introducing the definition of the simple and multiple Khatri–Rao products of matrices, we will present a few identities that it satisfies, then we will show that it can be used to express the trace of a product of matrices. 2.9.1. Definition Let A ∈ KI×J and B ∈ KK×J be two matrices with the same number of columns. The Khatri–Rao product of A with B is the matrix denoted A  B ∈ KIK×J and defined as (Khatri and Rao 1968, 1972): A  B = A.1 ⊗ B.1 , A.2 ⊗ B.2 , · · · , A.J ⊗ B.J . [2.201] We say that A  B is the column-wise Kronecker product of A and B. It is a matrix partitioned into J column blocks, where the jth block is equal to the Kronecker product of the jth column vector of A with the jth column vector of B. P ROPOSITION 2.46.– The Khatri–Rao product can also be written as a matrix partitioned into I row blocks: ⎡ ⎤ BD1 (A) ⎢ BD2 (A) ⎥ ⎢ ⎥ AB=⎢ [2.202] ⎥, .. ⎣ ⎦ . BDI (A) where Di (A) = diag(ai1 , ai2 , · · · , aiJ ) denotes the diagonal matrix with the elements of the ith row of A along the diagonal.

96

Matrix and Tensor Decompositions in Signal Processing

P ROOF .– By the definition of the Khatri–Rao product, the ith row block can be written as:     ai1 B.1 , · · · , aiJ B.J = B.1 , · · · , B.J diag(ai1 , · · · , aiJ ) = BDi (A) ∈ CK×J . By stacking the row blocks i ∈ I, we deduce the expression [2.202]. E XAMPLE 2.47.– For A =

 A  B = A.1 ⊗ B.1

or alternatively:

[2.203] 

! ! a12 b11 b12 ,B = , we have: a22 b21 b22 ⎛ ⎞ a11 b11 a12 b12 ⎜ ⎟  ⎜a11 b21 a12 b22 ⎟ ⎟ A.2 ⊗ B.2 = ⎜ · · · · · · · · · · · · ⎜ ⎟ ⎝a21 b11 a22 b12 ⎠ a21 b21 a22 b22

a11 a21



!⎞ a11 0 ⎛ ⎞ B ⎜ 0 a12 ⎟ BD1 (A) ⎜ ⎟ ⎟ ⎝ ··· ⎠. AB=⎜ ⎜ · · · · · · · · · · · ·!⎟ = ⎝ BD2 (A) 0 ⎠ a B 21 0 a22

2.9.2. Khatri–Rao product and index convention Table 2.17 presents three examples of the Khatri–Rao product expressed using the index convention. These formulae can be viewed as simplified versions of the formulae involving the Kronecker product listed in Table 2.15.

A ∈ KI×R , B ∈ KK×R , and C ∈ KR×N A  B = air bkr erik (A  B)T = air bkr eik r A  CT = air eri  crn ern = air crn erin

Table 2.17. Khatri-Rao product and index convention

Hadamard, Kronecker and Khatri–Rao Products

97

2.9.3. Multiple Khatri–Rao product Given the vectors u(n) ∈ KIn and the matrices A(n) ∈ KIn ×R , with n ∈ N , we define the multiple Khatri–Rao product as: N

N

n=1

n=1

 u(n)  u(1)  u(2)  · · ·  u(N ) = ⊗ u(n) ∈ KI1 ···IN

N

 A(n)  A(1)  A(2)  · · ·  A(N ) ∈ KI1 ···IN ×R .

[2.204] [2.205]

n=1

(n)

Using the index convention, and writing ain ,r for the current element of A(n) , the above equations can be rewritten compactly as: N

 u(n) =

N 

n=1

N

 A(n) =

(n)  uin eI

[2.206]

n=1 N 

n=1

(n)  ain ,r erI ,

[2.207]

n=1 N

(I )

N

(I )

(R)

where I = i1 · · · iN , eI = ⊗ einn and erI = ( ⊗ einn ) ⊗ (er )T . n=1

n=1

2.9.4. Properties The Khatri–Rao product satisfies the following properties: – For A ∈ KI×J , B ∈ KM ×J , and ∀α ∈ K: (αA)  B = A  (αB) = α(A  B).

[2.208]

– Associativity: For A ∈ KI×J , B ∈ KK×J , C ∈ KM ×J : A  (B  C) = (A  B)  C = A  B  C.

[2.209]

– Distributivity with respect to addition: For A, B ∈ KI×J , C ∈ KK×J : (A + B)  C = A  C + B  C

[2.210]

C  (A + B) = C  A + C  B.

[2.211]

– Transconjugation: For A ∈ KI×J , B ∈ KK×J : (A  B)H = D∗1 (A)BH , · · · , D∗I (A)BH ∈ KJ×IK D∗i (A) = diag(a∗i1 , · · · , a∗iJ ).

[2.212] [2.213]

98

Matrix and Tensor Decompositions in Signal Processing

P ROOF .– The properties [2.208]–[2.210] are obvious, whereas [2.212] can be deduced by transconjugating the right-hand side of equation [2.202].  – For A ∈ KI×J , B ∈ KK×J , we have:   r(A  B) ≥ max r(A), r(B) .

[2.214]

In the case where A and B have full column rank, A  B also has full column rank, as shown in section 2.11. Like the Kronecker product, the Khatri–Rao product is not commutative in general. However, for a ∈ KJ and B ∈ KI×J , we have: aT  B = B  aT = B diag(a) ∈ KI×J .

[2.215]

2.9.5. Identities P ROPOSITION 2.48.– The Khatri–Rao product AB, where A ∈ KI×R , B ∈ KK×R , satisfies the following identity: A  B = (A ⊗ 1K )  (1I ⊗ B).

[2.216]

P ROOF .– Noting that 1K = δkk ek , and using the first relation in Table 2.17, we have: (A ⊗ 1K )  (1I ⊗ B) = air δkk erik  bkr δii erik = air bkr erik = A  B, 

which proves [2.216].

By choosing A = II ⊗ 1TK and B = 1TI ⊗ IK with R = IK, in the relation [2.216], we obtain the following identity: (II ⊗ 1TK )  (1TI ⊗ IK ) = (II ⊗ 1TK ⊗ 1K )  (1I ⊗ 1TI ⊗ IK ) = (II ⊗ 1K×K )  (1I×I ⊗ IK ) = IIK . E XAMPLE 2.49.– For I = K = R = 2, we have: ⎡ ⎤ ⎡ ⎤ a11 a12 b11 b12 ⎢ a11 a12 ⎥ ⎢ b21 b22 ⎥ ⎥ ⎢ ⎥ (A ⊗ 12 )  (12 ⊗ B) = ⎢ ⎣ a21 a22 ⎦  ⎣ b11 b12 ⎦ a21 a22 b21 b22 ⎡ ⎤ a11 b11 a12 b12 ⎢ a11 b21 a12 b22 ⎥ ⎥ = ⎢ ⎣ a21 b11 a22 b12 ⎦ = A  B. a21 b21 a22 b22

[2.217]

Hadamard, Kronecker and Khatri–Rao Products

99

2.9.6. Khatri–Rao product and permutation of factors From [2.170] and [2.171], and with KKI as defined in [2.173], we can deduce that, for u ∈ KI , v ∈ KK , A ∈ KI×J , and B ∈ KK×J : v  u = KKI (u  v)

[2.218]

B  A = KKI (A  B).

[2.219]

Similarly, for A ∈ KI×J , B ∈ KK×J , C ∈ KM ×J , the formulae [2.185]–[2.187] become: C  A  B = eikm mik (A  B  C)

[2.220]

B  C  A = eikm kmi (A  B  C)

[2.221]

C  B  A = (eim ⊗ IK ⊗ em i )(A  B  C).

[2.222]

N

In the case of a multiple Khatri–Rao product  u(n) of N vectors u(n) ∈ K(In ) , n=1

we can permute u(p) into the first position of the Khatri–Rao product, with p ∈ {2, · · · , N }, using the following formula:   N   N  p−1  u(p)   u(n)   u(n) = Π  u(n) [2.223] n=1

n=p+1

n=1

with the permutation matrix: i ···i

p Π = ei1p i1 ···i ⊗ IIp+1 ···IN p−1

(I )

= ei p p ⊗

 p−1 (In )  p (In ) T ⊗ ein ⊗ ein ⊗ IIp+1 ···IN , n=1

n=1

[2.224] [2.225]

where IIp+1 ···IN is the identity matrix of order Ip+1 · · · IN . E XAMPLE 2.50.– For u ∈ KI , v ∈ KJ and w ∈ KK , with I = J = K = 2, we have: ij v  u  w = eijk jik (u  v  w) = (eji ⊗ I2 )(u  v  w), with: ⎡ ⎤ 1 0 0 0

⎢ 0 0 1 0 ⎥ 1 0 ij ⎢ ⎥ ⊗ eji ⊗ I2 = ⎣ 0 1 0 0 ⎦ 0 1 0 0 0 1 ⎡ ⎤ I2 02 02 02 ⎢ 02 02 I2 02 ⎥ ⎥ = ⎢ ⎣ 02 I2 02 02 ⎦ , 02 02 02 I2

100

Matrix and Tensor Decompositions in Signal Processing

which gives:



⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⊗ I = eij 2 ji ⎢ ⎢ ⎢ ⎢ ⎣

1 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎥ ⎦

It is now easy to check that: v  u  w = [u1 v1 w1 , u1 v1 w2 , u2 v1 w1 , u2 v1 w2 , u 1 v2 w1 , u 1 v2 w2 , u 2 v2 w1 , u 2 v2 w2 ] T = (eij ji ⊗ I2 )(u ⊗ v ⊗ w) = (eij ji ⊗ I2 )[u1 v1 w1 , u1 v1 w2 , u1 v2 w1 , u1 v2 w2 , u2 v 1 w 1 , u 2 v 1 w 2 , u 2 v 2 w 1 , u 2 v 2 w 2 ] T . 2.9.7. Trace of a product of matrices and Khatri–Rao product P ROPOSITION 2.51.– For A ∈ KI×J , B ∈ KJ×M , C ∈ KM ×I , we have: tr(ABC) = vecT (A)(B  CT )1M .

[2.226]

P ROOF .– Analogously to [2.150], we have: tr(ABC) = aij bjm cmi .

[2.227]

By the relations [2.72] and [2.105], the expression of B  CT deduced from  Table 2.17, and after introducing the term eji em j  i em = δii δjj  δmm , we obtain: 

T T tr(ABC) =(aij eji )(bj  m cm i em j  i )(δmm em ) = vec (A)(B  C )1M ,

which corresponds to the relation [2.226].



Comparing the formulae [2.226] and [2.157], which express tr(ABC) in terms of the Khatri–Rao and Kronecker products, respectively, we deduce: (B  CT )1M = (B ⊗ II )vec(CT ).

[2.228]

Hadamard, Kronecker and Khatri–Rao Products

101

2.10. Relations between vectorization and Kronecker and Khatri–Rao products The following proposition presents two expressions for the vectorized form of the matrix product UVT in terms of the Kronecker and Khatri–Rao products of the factor matrices U and V. P ROPOSITION 2.52.– Consider the product UVT ∈ KI×J with U = [u1 · · · uR ] ∈ KI×R and V = [v1 · · · vR ] ∈ KJ×R . We have: vec(UVT ) = (V ⊗ U)vec(IR )

[2.229]

= (V  U)1R ,

[2.230]

where IR and 1R , respectively, denote the identity matrix of order R and the vector of size R composed of ones. P ROOF .– The relation [2.229] can be deduced directly from [2.113] with (A, B, C) = (U, IR , VT ). Moreover, using the relation [2.28], we have: vec(UVT ) = vec(

R 

ur vrT ) =

r=1

R 

vec(ur vrT ) =

r=1

R 

vr  ur

r=1

= (V  U)1R , 

which proves [2.230].

2.11. Relations between the Kronecker, Khatri–Rao and Hadamard products There is a certain hierarchy of complexity between the Hadamard, Khatri–Rao, Kronecker and block Kronecker products. – The Hadamard product of A ∈ KI×J with B ∈ KI×J consists of a subset of rows of their Khatri–Rao product: A  B = STI (A  B),

[2.231]

where STI is a row selection matrix defined as: STI

=

I  i=1

(I)

ei



(I)

ei

(I)

⊗ ei

T

102

Matrix and Tensor Decompositions in Signal Processing

= eii i (with the index convention)  2 T 2 (I ) (I 2 ) (I 2 ) (I 2 ) = e1 , eI+2 , e2I+3 , · · · , eI 2 ∈ RI×I .

[2.232] [2.233]

P ROOF .– Noting that the ith row of A  B, for i ∈ I, is given by: (A  B)i. = [ai1 bi1 , ai2 bi2 , · · · , aiJ biJ ]  T  2 T (I) (I) (I ) = ei ⊗ ei (A  B) = eii (A  B), the matrix A  B is obtained by varying i from 1 to I: AB=

I 

(I)

ei (A  B)i. =

i=1

=



I 

(I)

ei

i=1

(I)  (I 2 ) T e1 e1

+

(I)  (I 2 ) T e2 eI+2



(I 2 )

T

eii

(A  B) = eii i (A  B)

 (I 2 ) + · · · + (eI 2 )T (A  B)

[2.234]

or alternatively: ⎡

(I 2 ) T

(e1

)

⎢ (I 2 ) T ⎢ (eI+2 ) AB=⎢ .. ⎢ ⎣ . (I 2 )

⎤ ⎥ ⎥ ⎥ (A  B) = STI (A  B), ⎥ ⎦

[2.235]

(eI 2 )T where the row selection matrix STI is defined as in [2.233]. This proves [2.231]. – The Khatri–Rao product of A ∈ K columns of their Kronecker product:

I×J

with B ∈ K

A  B = (A ⊗ B)SJ ,

K×J



consists of a subset of [2.236]

where SJ is a column selection matrix defined as:   2 2 (J ) (J 2 ) (J 2 ) (J 2 ) SJ = e1 , eJ+2 , e2J+3 , · · · , eJ 2 ∈ RJ ×J

[2.237]

= ejjj (with the index convention).

[2.238]

This relation can be shown by applying the same reasoning to the columns of A ⊗ B as used earlier for the rows of A  B to establish the relation [2.231]. R EMARK 2.53.– If A and B have full column rank, which implies J ≤ min(I, K), then A ⊗ B is itself a matrix with full column rank (see the rank property of the

Hadamard, Kronecker and Khatri–Rao Products

103

Kronecker product in Table 2.11). Hence, by [2.236], we can deduce that A  B has the same rank as SJ , which is therefore equal to J. This proves that, if A and B have full column rank, then their Khatri–Rao product also has full column rank, like their Kronecker product. – By combining the relations [2.231] and [2.236], we deduce the following relation between the Kronecker and Hadamard products of two matrices A, B ∈ KI×J , of same size (Marcus and Khan 1959; Lev Ari 2005): A  B = STI (A ⊗ B)SJ

[2.239]

with the row and column selection matrices defined in [2.233] and [2.237]. Table 2.18 summarizes the relations between the three products considered in this chapter.

A, B ∈ KI×J SI =

A  B = ST I (A  B)

(I 2 ) e1

(I 2 )

(I 2 )

(I 2 )

, eI+2 , e2I+3 , · · · , eI 2



A ∈ KI×J , B ∈ KK×J A  B = (A ⊗ B)SJ 

(J 2 ) (J 2 ) (J 2 ) (J 2 ) SJ = e1 , eJ+2 , e2J+3 , · · · , eJ 2 A, B ∈ KI×J A  B = ST I (A ⊗ B)SJ

Table 2.18. Relations between the Hadamard, Khatri-Rao, and Kronecker products

E XAMPLE 2.54.– For A, B ∈ K2×2 , after becomes: ⎡ 1 2    ⎢ 0 (2) (2) (2) S2 = ej ⊗ ej (ej )T = ⎢ ⎣ 0 j=1 0

transposition, the definition [2.232] ⎤ 0 0 ⎥ ⎥ 0 ⎦ 1

104

Matrix and Tensor Decompositions in Signal Processing

and



a11 b11 a11 b12 ⎢ a11 b21 a11 b22 A⊗B= ⎢ ⎣ a21 b11 a21 b12 a21 b21 a21 b22 ⎡ a11 b11 a12 b12 ⎢ a11 b21 a12 b22 AB= ⎢ ⎣ a21 b11 a22 b12 a21 b21 a22 b22 a11 b11 a12 b12 AB= a21 b21 a22 b22

a12 b11 a12 b21 a22 b11 a22 b21 ⎤

⎤ a12 b12 a12 b22 ⎥ ⎥ a22 b12 ⎦ a22 b22

⎥ ⎥ = (A ⊗ B)S2 ⎦

= ST2 (A  B) = ST2 (A ⊗ B)S2 .

Below, we present a few identities relating the Hadamard and Kronecker products of matrices, and then of vectors. – Let A, C ∈ KI×J , and B, D ∈ KK×L . We have the identity: (A ⊗ B)  (C ⊗ D) = (A  C) ⊗ (B  D).

[2.240]

This identity is easy to prove using the index convention: jl jl (A ⊗ B)  (C ⊗ D) = (aij bkl ejl ik )  (cij dkl eik ) = aij cij bkl dkl eik

= aij cij bkl dkl (eji ⊗ elk ) = (aij cij eji ) ⊗ (bkl dkl elk ) = (A  C) ⊗ (B  D).

[2.241]

– Similarly, for u, x ∈ KI , and v, y ∈ KJ , we have: (u ⊗ v)  (x ⊗ y) = (u  x) ⊗ (v  y).

[2.242]

– Given the vectors u, v, x, y, of respective sizes I, J, K, L, we have the following identities: (u ⊗ v)(x ⊗ y)T = (u ⊗ 1J )(1K ⊗ y)T  (1I ⊗ v)(x ⊗ 1L )T

[2.243]

= (1I ⊗ v)(1K ⊗ y)T  (u ⊗ 1J )(x ⊗ 1L )T

[2.244]

u ⊗ v ⊗ x ⊗ y = (u ⊗ v ⊗ 1KL )  (1IJ ⊗ x ⊗ y)

[2.245]

= (u ⊗ 1J ⊗ x ⊗ 1L )  (1I ⊗ v ⊗ 1K ⊗ y)

[2.246]

= (u ⊗ 1JK ⊗ y)  (1I ⊗ v ⊗ x ⊗ 1L ).

[2.247]

Hadamard, Kronecker and Khatri–Rao Products

105

P ROOF .– As an example, we will check the relation [2.243]. The other relations can be checked similarly. Using [2.72] and [2.89], we have: (u ⊗ 1J )(1K ⊗ y)T  (1I ⊗ v)(x ⊗ 1L )T = (ui δjj eij )(δkk yl ekl )  (δii vj eij ) (xk δll ekl ) kl = (ui yl δjj δkk ekl ij )  (vj xk δii δll eij ) kl = ui yl vj xk ekl ij = (ui vj eij )(xk yl e )

= (u ⊗ v)(x ⊗ y)T .  – From [2.243]–[2.247], it is easy to deduce the other relations below, choosing (y = L = 1) and (v = J = 1): (u ⊗ v)xT = (u ⊗ 1J )1TK  (1I ⊗ v)xT

[2.248]

= (1I ⊗ v)1TK  (u ⊗ 1J )xT

[2.249]

u(x ⊗ y)T = u(1K ⊗ y)T  1I (x ⊗ 1L )T

[2.250]

= 1I (1K ⊗ y)T  u(x ⊗ 1L )T

[2.251]

u ⊗ v ⊗ x = (u ⊗ v ⊗ 1K )  (1IJ ⊗ x)

[2.252]

= (u ⊗ 1J ⊗ x)  (1I ⊗ v ⊗ 1K )

[2.253]

= (u ⊗ 1JK )  (1I ⊗ v ⊗ x).

[2.254]

The three products also satisfy the relations in Table 2.19, which are proven below. – Using the relation [2.236] between the Khatri–Rao and Kronecker products, then the properties [2.40] and [2.50] of the Kronecker product, and finally the relation [2.239] between the Hadamard and Kronecker products, we have: (A  B)H (C  D) = STJ (A ⊗ B)H (C ⊗ D)SN = STJ [(AH C) ⊗ (BH D)]SN = (AH C)  (BH D) ∈ KJ×N .

[2.255]

– Similarly, using the relation [2.236] between the Khatri–Rao and Kronecker products, then the property [2.50] of the Kronecker product, and finally the relation [2.236] once again, we obtain: (A ⊗ B)(C  D) = (A ⊗ B)(C ⊗ D)SN = (AC ⊗ BD)SN = (AC  BD) ∈ KIM ×N .

[2.256]

106

Matrix and Tensor Decompositions in Signal Processing

A ∈ KI×J , B ∈ KK×J , C ∈ KI×N , D ∈ KK×N (A  B)H (C  D) = (AH C)  (BH D) ∈ KJ×N A ∈ KI×J , B ∈ KM ×K , C ∈ KJ×N , D ∈ KK×N (A ⊗ B)(C  D) = (AC)  (BD) ∈ KIM ×N A ∈ KI×J , C ∈ KJ×K ⎡ ⎤ λ1 ⎢ ⎥ vec Adiag(λ1 , · · · , λJ )C = (CT  A) ⎣ ... ⎦ ∈ KKI λJ A ∈ KI×J , B ∈ KK×J , where A  B has full column rank

−1 (A  B)† = (AH A  BH B) (A  B)H An ∈ KIn ×J , Cn ∈ KIn ×K , n ∈ N  N

N

N

n=1

n=1

n=1

J×K (  An )H (  Cn ) =  (AH n Cn ) ∈ K

Table 2.19. Relations between the Kronecker, Khatri-Rao, and Hadamard products of matrices

– Noting that: Adiag(λ1 , · · · , λJ )C =

J 

λj A.j Cj.

j=1

and using the property [2.28] with x = A.j , yH = Cj. , we deduce: J J       λj vec A.j Cj. = λj (CTj.  A.j ) vec Adiag(λ1 , · · · , λJ )C = j=1





λ1 ⎢ .. ⎥ = (C  A) ⎣ . ⎦ . λJ T

j=1

[2.257]

This relation can also be deduced from [2.113] by dropping the zeroes from vec diag(λ1 , · · · , λJ ) .

Hadamard, Kronecker and Khatri–Rao Products

107

– Using the property [2.255], the Moore–Penrose pseudo-inverse of A  B can be written as:  −1 (A  B)† = (A  B)H (A  B) (A  B)H  −1 = (AH A  BH B) (A  B)H ,

[2.258]

which proves the formula of the Moore–Penrose pseudo-inverse of the Khatri–Rao product of two full column rank matrices. – The last relation in Table 2.19 is a generalization of the first relation to the case of multiple Khatri–Rao products. We also have the relations listed in Table 2.20. u, a ∈ KI , v, b ∈ KJ , x, y ∈ KK (u ⊗ v) xT  yT = uxT  vyT (u ⊗ v)xT = uxT  v 1T K (u ⊗ v)  (a ⊗ b) = (u  a)  (v  b) U, A ∈ KI×K , V, B ∈ KJ×K (U  V)  (A  B) = (U  A)  (V  B)

Table 2.20. Other relations between the Kronecker, Khatri-Rao, and Hadamard products of vectors and matrices

P ROOF .– The first relation can be obtained from the property [2.256] by setting A = u, B = v, C = xT and D = yT , whereas the second relation follows from the first, with y = 1K , and from the identity xT = xT  1TK . The third relation can be shown using the index convention, noting that u  a = ui ei  ai ei = ui ai ei and u ⊗ v = u  v = ui vj eij (see Table 2.14): (u ⊗ v)  (a ⊗ b) = (ui vj eij )  (ai bj eij ) = ui vj ai bj eij = (ui ai ei )  (vj bj ej ) = (u  a)  (v  b).

[2.259] [2.260]

Finally, the last relation can be deduced from the previous one by decomposing the matrix Khatri–Rao products column by column.  Tables 2.21 and 2.22 summarize the key relations presented in this chapter.

108

Matrix and Tensor Decompositions in Signal Processing

A ∈ KI×J , B ∈ KM ×N , C ∈ KJ×K , D ∈ KN ×P (A ⊗ B)(C ⊗ D) = AC ⊗ BD A∈

KI×J , B

A∈

KI×J , B

∈ KM ×N , C ∈ KJ×K , D ∈ KN ×K

(A ⊗ B)(C  D) = (AC)  (BD) ∈ KK×J , C ∈ KI×N , D ∈ KK×N

(A  B)H (C  D) = (AH C)  (BH D) A, C ∈ KI×J , B, D ∈ KK×L (A ⊗ B)  (C ⊗ D) = (A  C) ⊗ (B  D)

Table 2.21. Basic relations

A ∈ KI×J , B ∈ KJ×M , C ∈ KM ×N vec(ABC) = (CT ⊗ A)vec(B) A ∈ KI×J , C ∈ KJ×K

⎡ ⎤ λ1 ⎢ ⎥ vec Adiag(λ1 , · · · , λJ )C = (CT  A) ⎣ ... ⎦ λJ

Table 2.22. Vectorization formulae

2.12. Applications In this section, we present two applications of the Kronecker product as examples. The first concerns the computation and arrangement of first-order partial derivatives. Multiple definitions of these derivatives are provided using the index convention. The second application relates to solving matrix equations such as Sylvester and Lyapunov equations, both continuous-time and discrete-time, as well as other equations that often appear when estimating the parameters of tensor models. The problem of estimating the factors of a Khatri–Rao product and a Kronecker product is also solved at the end of the chapter. 2.12.1. Partial derivatives and index convention Matrix differential calculus, i.e. computing the derivative of a matrix function with respect to a matrix variable, was developed by Magnus and Neudecker (1985, 1988) in the case with real functions and real variables (see also Magnus 2010).

Hadamard, Kronecker and Khatri–Rao Products

109

In the following, we will consider real scalar, vector and matrix functions of a real variable that can itself be a scalar, vector or matrix. The nine cases of partial derivatives that we will consider are summarized in Table 2.23. All functions are assumed to be differentiable. Our objective here is to introduce different ways of arranging the first-order partial derivatives5 as a vector or a matrix using the index convention. We will also illustrate these results with the Jacobian matrix of certain functions.

Function Variable

f ∈R

f ∈ RI F ∈ RI×J

x∈R

∂ f (x) ∂x

∂ f (x) ∂x

∂ F(x) ∂x

x ∈ RN

∂ f (x) ∂x

∂ f (x) ∂x

∂ F(x) ∂x

X ∈ RM ×N

∂ f (X) ∂X

∂ f (X) ∂X

∂ F(X) ∂X

Table 2.23. Partial derivatives of various functions

Table 2.24 gives various definitions of the first-order partial derivative of a real function with respect to a real variable, stating the size of the derivatives in each case, according to whether the function and the variable are scalars or vectors, and using the index convention. These definitions characterize how the partial derivatives are arranged within a vector or a matrix. For example, for (f , x) ∈ RI × RN , the partial derivatives ∂∂fix(x) , with i ∈ I n and n ∈ N , can be arranged in a matrix in two different ways: ⎡ ∂ f (x) ⎤ 1 · · · ∂∂f1x(x) ∂ x1 N ⎥ ∂ f (x) ∂ fi (x) n ⎢ .. .. .. ⎥ ∈ RI×N = ei = ⎢ [2.261] . . . ⎣ ⎦ T ∂x ∂ xn ∂ fI (x) ∂ fI (x) ··· ∂ x1 ∂ xN and

⎡ ⎢ ∂ fi (x) i ∂ f T (x) = en = ⎢ ⎣ ∂x ∂ xn

∂ f1 (x) ∂ x1

.. .

∂ f1 (x) ∂ xN

··· .. . ···

∂ fI (x) ∂ x1

.. .

∂ fI (x) ∂ xN

⎤ ⎥ ⎥ ∈ RN ×I . ⎦

[2.262]

5 A partial derivative of a function f (x1 , ..., xN ) of several independent variables is the derivative with respect to one of the variables, say xn , while assuming that the other variables , where x = (x1 , ..., xN ). are constant. This partial derivative is denoted ∂∂fx(x) n

110

Matrix and Tensor Decompositions in Signal Processing

(Function, variable)

Derivatives

Size

(f, x) ∈ R × RN

∂ f (x) ∂x

=

∂ f (x) e ∂ xn n

RN

(f, x) ∈ R × RN

∂ f (x) ∂ xT

=

∂ f (x) n e ∂ xn

R1×N

(f , x) ∈ RI × R

∂ f (x) ∂x

=

∂ fi (x) ei ∂x

RI

(f , x) ∈ RI × R

∂ f T (x) ∂x

(f , x) ∈ RI × RN

∂ f (x) ∂ xT

(f , x) ∈ RI × RN

∂ f T (x) ∂x

(f , x) ∈ RI × RN

∂ f (x) ∂x

∂ fi (x) i e ∂x

R1×I

∂ fi (x) n ei ∂ xn

RI×N

=

= =

=

∂ fi (x) i en ∂ xn

∂ fi (x) ein ∂ xn

RN ×I RIN

Table 2.24. Derivatives of a scalar function and a vector function

The form [2.261] is called the Jacobian matrix of f . The ith row vector of this matrix is given by: ∂∂fxi (x) = [ ∂ ∂fix(x) · · · ∂∂fxi (x) ]. This is the transpose of the gradient T 1 N vector of the function fi (x), with i ∈ I. When I = N , the determinant of the Jacobian matrix is called the Jacobian of f . E XAMPLE 2.55.– For x ∈ RI , we have: ∂x ∂ xT = = II . ∂ xT ∂x E XAMPLE 2.56.– For R2  x = ∂ f (x) = ∂ xT



cos θ sin θ

−r sin θ r cos θ

r θ



→

r cos θ r sin θ

∂ f T (x) , = ∂x

and the Jacobian is equal to r(cos2 θ + sin2 θ) = r.

∈ R2 , we have:

cos θ −r sin θ

sin θ r cos θ

,

Hadamard, Kronecker and Khatri–Rao Products

111

E XAMPLE 2.57.– For f (x) = Ax ∈ RI , where x ∈ RN and A ∈ RI×N is a constant matrix, the matrix of first-order partial derivatives can be defined as: ∂ (Ax) = A ∈ RI×N ∂ xT ∂ (Ax)T ∂ (xT AT ) = = AT ∈ RN ×I . ∂x ∂x E XAMPLE 2.58.– For f (x) = xT Ax = xm amn xn = xn anm xm , where x ∈ RM and A ∈ RM ×M is a constant matrix, we have: ∂ (xT Ax) ∂ f (x) em = (amn xn + xn anm ) em = ∂x ∂ xm = (A + AT ) x. If A is a symmetric matrix, then the above expression becomes: ∂ (xT Ax) = 2Ax. ∂x

[2.263]

In the case of a real matrix function F(X) ∈ RI×J and a real matrix variable X ∈ ∂ fij , the partial derivatives ∂ xmn form a fourth-order tensor of size I ×J ×M ×N . R There are therefore several ways to arrange these partial derivatives in a vector or matrix, depending on which combinations of the modes (i, j, m, n) are considered. To reduce the problem to that of the partial derivatives of a vector function with respect to a vector variable, one solution is to vectorize F or FT and X or XT . Table 2.25 presents four definitions that depend on these vectorizations of F, FT , X and XT . M ×N

∂ vec F(X)

∂ (vec(X))T

∂ vec FT (X)

= 

∂ (vec[XT ])T

Derivatives



∂ vec F(X)

∂ vec FT (X) ∂ (vec(X))T

∈ RJI×N M

=

∂ fij (X) mn eij ∂ xmn

∈ RIJ×M N

=

∂ fij (X) mn eji ∂ xmn

∈ RJI×M N

=

∂ fij (X) nm eij ∂ xmn

∈ RIJ×N M



∂ (vec[XT ])T

∂ fij (X) nm eji ∂ xmn



Table 2.25. Derivatives of a matrix function with respect to a matrix variable

112

Matrix and Tensor Decompositions in Signal Processing

vec[F(X)] For example, for ∂∂ (vec(X)) ∈ RJI×N M , the matrices F and X are arranged T as two column vectors g = vec(F) ∈ RJI and y = vec(X) ∈ RN M such that6: g(j−1)I+i = fij and y(n−1)M +m = xmn . The derivative can then be rewritten as ∂ g(y) . The pth row vector of this matrix, with p = (j − 1)I + i, contains the partial ∂ yT derivatives of the function fij (X) with respect to the various components of the matrix variable X arranged into the vector vec(X).

We can also define ∂ gT (y) ∂y

∂ gT (y) ∂y

correspond to the

∂ vecT [F(X)] ∈ RN M ×JI . The derivatives ∂∂g(y) and ∂ vec(X) yT ∂ f (x) ∂ f T (x) definitions ∂ xT and ∂ x in Table 2.24, respectively.

=

To illustrate the results in Table 2.25, let us consider several examples of functions F(X). Let A ∈ RI×M and B ∈ RJ×N be constant matrices, and let X ∈ RM ×N be a matrix variable. Recalling that vec(AXBT ) = (B ⊗ A)vec(X) ∈ RJI , and using vec[F(X)] the definition ∂∂ (vec(X)) T of the derivative, we obtain: F(X) = AXBT ⇒ F(X) = AX ⇒ F(X) = XBT ⇒

∂ vec[F(X)] = B ⊗ A ∈ RJI×N M ∂ (vec(X))T

∂ vec[F(X)] = IN ⊗ A ∈ RN I×N M ∂ (vec(X))T ∂ vec[F(X)] = B ⊗ IM ∈ RJM ×N M . ∂ (vec(X))T

[2.264] [2.265] [2.266]

For F(X) = X ∈ RM ×N , we have: ∂ vec[F(X)] = enm nm = IN M ∂ (vec(X))T

[2.267]

∂ vec[F(X)] = emn nm = IN M . ∂ (vec(XT ))T

[2.268]

vec[F(X)] This last example shows that the definition ∂∂ (vec(X)) T gives a matrix of partial derivatives equal to the identity matrix. This definition can be viewed as a generalization of the Jacobian matrix of a vector function of a vector variable (example [2.55]), which justifies it being called the Jacobian matrix of the matrix function F(X) (Magnus and Neudecker, 1985).

6 Recall that, by convention, the order of the dimensions in a product JI is related to the order of variation of the indices, with j varying more slowly than i.

Hadamard, Kronecker and Khatri–Rao Products

113

Let us use the same definition of the derivative to compute the derivative of the product of two matrix functions F(X) ∈ RI×J and G(X) ∈ RJ×K of a matrix variable X ∈ RM ×N . Using the index convention, we obtain: ∂ vec[FG(X)] ∂ (FG)ik (X) nm ∂ (fij gjk ) nm = eki = e ∂ (vec(X))T ∂ xmn ∂ xmn ki = [gjk

∂fij ∂gjk nm + fij ]e . ∂xmn ∂xmn ki

[2.269]

ji nm kj nm Noting that enm ki = eki ej  i = eki ek j  , the expansion [2.269] can be rewritten as:

 ∂f      ∂g    ∂ vec[FG(X)]  ij j k nm ji kj nm + f , = g e e e e     jk ki ij ki ∂ (vec(X))T ∂xmn j i ∂xmn k j [2.270] with: j i T gjk eji ki = (gjk ek ) ⊗ ei = G ⊗ II j k fij ekj ki = ek ⊗ (fij ei ) = IK ⊗ F.

The formula [2.270] can therefore be written as: ∂ vec[FG(X)] ∂ vec(F(X)) ∂ vec(G(X)) = (GT ⊗ II ) + (IK ⊗ F) ∈ RKI×N M . ∂ (vec(X))T ∂ (vec(X))T ∂ (vec(X))T [2.271] This formula was proven by Magnus and Neudecker (1985) with a different proof that does not use the index convention. Consider now the Hadamard product of two matrix functions F(X), G(X) ∈ RI×J of the matrix variable X ∈ RM ×N . Using the vectorization formula of a Hadamard product from Table 2.4, we can write:     vec(F  G) = diag vec(F) vec(G) = diag vec(G) vec(F). [2.272] We then deduce the following formula for the derivative of the Hadamard product F  G (Magnus and Neudecker 1985):   ∂ vec(G)   ∂ vec(F) ∂ vec (F  G) = diag vec(F) + diag vec(G) . ∂ vec(X) ∂ vec(X) ∂ vec(X) For F(X) ∈ RI×J , X ∈ RM ×N , we can also define the matrix of partial derivatives as a partitioned matrix, divided either into I × J blocks of size M × N , ∂fij (X) where each block (i, j) is the matrix ∂X , with i ∈ I and j ∈ J, or into

114

Matrix and Tensor Decompositions in Signal Processing

M × N blocks of size I × J, where each block (m, n) is the matrix ∂F(X) ∂xmn , with m ∈ M  and n ∈ N . These two partitioned forms are presented below using the index convention: ⎤ ⎡ ∂ f11 (X) · · · ∂ f∂1JX(X) ∂X ∂ fij (X) ⎥ ⎢ .. .. .. IM ×JN eji ⊗ [2.273] = ⎣ ⎦∈R . . . ∂X ∂ fI1 (X) ∂ fIJ (X) ··· ∂X ∂X ⎡ ∂ F(X) ⎤ ∂ F(X) · · · ∂ x1N ∂ x11 ⎢ ⎥ ∂ F(X) .. .. .. ⎥ ∈ RM I×N J = ⎢ [2.274] enm ⊗ . . . ⎣ ⎦ ∂ xmn ∂ F(X) ∂ F(X) · · · ∂ xM N ∂ xM 1 with: ∂ fij (X) ∂ fij (X) n e ∈ RM ×N = ∂X ∂ xmn m

[2.275]

∂ fij (X) j ∂ F(X) = e ∈ RI×J . ∂ xmn ∂ xmn i

[2.276]

This gives us the differentiation formulae stated in Table 2.26.

F ∈ RI×J , X ∈ RM ×N I

i=1

M

m=1

J

j=1

N

n=1

Eij

(I×J)



∂fij ∂X

(M ×N )



∂F ∂xmn

Emn

= eji ⊗

∂fij ∂X

= en m⊗

Table 2.26. Definitions of

=

∂F ∂xmn

∂ F(X) ∂X

∂fij ejn ∂xmn im

=

∈ RIM ×JN

∂fij enj ∂xmn mi

∈ RM I×N J

in partitioned form

– For F(X) = X ∈ RM ×N , the two differentiation formulae in Table 2.26 give the same partial derivative: ∂xij nj nn T M 2 ×N 2 emi = δim δjn enj , mi = emm = vec(IM )vec (IN ) ∈ R ∂xmn T since, from [2.109], we have enn mm = vec(IM )vec (IN ), which is a rank-one matrix. This differentiation formula should be compared against [2.267] and [2.268]. Since the Jacobian matrix of the identity function (X → F(X) = X) is the identity matrix, vec[F(X)] we can conclude that the definition ∂∂ (vec(X)) T is preferable to the other definitions (Magnus and Neudecker, 1985).

Hadamard, Kronecker and Khatri–Rao Products

115

– For F(X) ∈ RI×J and G(X) ∈ RJ×K , with X ∈ RM ×N , the second differentiation formula in Table 2.26 gives: N M   ∂(FG(X)) ∂(FG(X)) (M ×N ) Emn ⊗ . = ∂X ∂xmn m=1 n=1

[2.277]

Noting that eki = eji ekj , F = fij eji , G = gjk ekj , and FG = fij gjk eki , the property [2.50] of the Kronecker product allows us to expand the above equation as follows using the index convention:  ∂(FG(X)) ∂fij ∂gjk  k ∂(fij gjk ) k e ei = enm ⊗ gjk + fij = enm ⊗ ∂X ∂xmn ∂xmn ∂xmn i = (enm ⊗ =

∂fij j ∂gj  k k ei )(IN ⊗ gj  k ekj ) + (IM ⊗ fij eji )(enm ⊗ e ) ∂xmn ∂xmn j

∂G(X) ∂F(X) (IN ⊗ G) + (IM ⊗ F) ∈ RM I×N K , ∂X ∂X

M I×N J where ∂F(X) and ∂X ∈ R in Table 2.26.

∂G(X) ∂X

[2.278]

∈ RM J×N K are in partitioned form, as defined

Table 2.27 presents several definitions of the partial derivatives of a scalar function of a matrix variable, as well as of a matrix function of a scalar variable. Below, we prove several formulae for computing the derivatives of traces of matrix products using the index convention. E XAMPLE 2.59.– For X ∈ RM ×M and f (X) = tr(X) =

M j=1

xjj = xjj , we have:

2 ∂ xjj ∂ tr(X) ∂ tr(X) emn = emn = δjm δjn emn = emm = vec(IM ) ∈ RM = T ∂ vec(X ) ∂ xmn ∂ xmn

∂ tr(X) ∂ xjj n ∂ tr(X) n M ×M em = e = δjm δjn enm = em . = m = IM ∈ R ∂X ∂ xmn ∂ xmn m E XAMPLE 2.60.– Let f (X) = tr(AX) = aji xij , i ∈ M , j ∈ N , where A ∈ RN ×M is a constant matrix, and X ∈ RM ×N . We have: ∂ tr(AX) n ∂ tr(AX) = e = anm enm = AT . ∂X ∂ xmn m

[2.279]

116

Matrix and Tensor Decompositions in Signal Processing

(Function, variable)

Derivatives

(f, X) ∈ R × RM ×N

∂ f (X) ∂ (vec(X))T

(f, X) ∈ R × RM ×N

∂ f (X) ∂ vec(XT )

∂ f (X) nm e ∂ xmn

= =

∂ f (X) e ∂ xmn mn

Size R1×N M RM N

(f, X) ∈ R × RM ×N

∂ f (X) ∂X

=

∂ f (X) n e ∂ xmn m

RM ×N

(f, X) ∈ R × RM ×N

∂ f (X) ∂ XT

=

∂ f (X) m e ∂ xmn n

RN ×M



(F, x) ∈ RI×J × R

∂ vec FT (x) ∂x

(F, x) ∈ RI×J × R

∂ F(x) ∂x



=

=

∂ fij (x) eij ∂x

∂ fij (x) j ei ∂x

RIJ RI×J

Table 2.27. Other derivatives

E XAMPLE 2.61.– Let f (X) = tr(AXB) = aim xmn bni , where A ∈ RI×M and B ∈ RN ×I are constant matrices, and X ∈ RM ×N . We have: ∂ tr(AXB) ∂ tr(AXB) n em = aim bni enm = ∂X ∂ xmn = (BA)nm enm = (BA)T = AT BT .

[2.280]

In Table 2.28, we present several derivatives of traces of matrix products that can be proven from the above examples. Thus, noting that tr(XT A) = tr(AT X), the T A) can be deduced from [2.279] by replacing A formula for the derivative ∂ tr(X ∂X T T with A . Similarly, since tr(AX B) = tr(BT XAT ), the formula for the derivative ∂ tr(AXT B) can be obtained from [2.280] by replacing (A, B) with (BT , AT ). Other ∂X differentiation formulae involving Kronecker products can be found in Tracy and Dwyer (1969) and Brewer (1978). From the above, we can conclude that there are several possible definitions for arranging partial derivatives. Consequently, it is important to always state the definitions used in any given computation. 2.12.2. Solving matrix equations The Kronecker product can be used to represent matrix systems of linear equations in such a way that the unknowns initially contained in a matrix are arranged into a

Hadamard, Kronecker and Khatri–Rao Products

Size of the matrices

Derivatives ∂ tr(X) ∂X

X ∈ RM ×N (A, X) ∈ RN ×M × RM ×N (A, X) ∈

RM ×N

×

RM ×N

(A, X, B) ∈ RI×M × RM ×N × RN ×I (A, X, B) ∈

RI×N

×

RM ×N

117

×

RM ×I

= IM

∂ tr(AX) = AT ∂X ∂ tr(XT A) ∂ tr(AT X) = ∂X ∂X ∂ tr(AXB) = ∂X ∂ tr(AXT B) ∂X

=A

AT B T = BA

Table 2.28. Derivatives of matrix traces

vector. A new expression for the equations is obtained by vectorizing both sides of each equation. The system to solve is then expressed in the standard form Ax = b. P ROPOSITION 2.62.– For T ∈ CM ×I , S ∈ CN ×J , and X ∈ CI×J , using the relations [2.116], [2.117] and [2.113], we can deduce the following equivalences: TX = C ⇔ (IJ ⊗ T)vec(X) = vec(C) , C ∈ CM ×J XS = C ⇔ (S ⊗ II )vec(X) = vec(C) , C ∈ C T

I×N

M ×N

TXS = C ⇔ (S ⊗ T)vec(X) = vec(C) , C ∈ C T

[2.281] [2.282]

.

[2.283]

More generally, for Tp ∈ CM ×I , Sp ∈ CN ×J , C ∈ CM ×N , and X ∈ CI×J , we have: P 

Tp XSTp = C ⇔

p=1

P 

(Sp ⊗ Tp )vec(X) = vec(C).

[2.284]

p=1

2.12.2.1. Continuous-time Sylvester and Lyapunov equations Using the relations [2.281] and [2.282], the continuous-time Sylvester equation: TX + XS = C

[2.285]

with T ∈ CI×I , S ∈ CJ×J , and C, X ∈ CI×J , can be written as a system of IJ linear equations with IJ unknowns in the following form:   (IJ ⊗ T) + (ST ⊗ II ) vec(X) = vec(C), [2.286] or alternatively: (ST ⊕ T)vec(X) = vec(C),

[2.287]

118

Matrix and Tensor Decompositions in Signal Processing

where ⊕ denotes the Kronecker sum defined in [2.63]. If the matrix ST ⊕ T is non-singular, we obtain the following LS solution: ˆ = (ST ⊕ T)−1 vec(C). vec(X)

[2.288]

Given the property [2.64], the eigenvalues of ST ⊕ T are λi + μj , where λi and μj are eigenvalues of T and S, and therefore of ST , respectively. Hence, the uniqueness condition of the LS solution [2.288] is λi + μj = 0, ∀i ∈ I , ∀j ∈ J, which means that the eigenvalues of S must not be equal to the eigenvalues of T with opposite sign. In the case where S = TT , with I = J, Sylvester’s equation [2.285] corresponds to a continuous-time Lyapunov equation that is encountered in filtering and optimal control problems. The uniqueness condition stated above then becomes λi + λj = 0, ∀ i, j ∈ I. 2.12.2.2. Discrete-time Sylvester and Lyapunov equations In the discrete-time case, the Sylvester and Lyapunov equations are written as follows, respectively: X − TXS = C

[2.289]

X − TXTT = C.

[2.290]

and

In the case where I = J, these equations can be rewritten as:   II 2 − (ST ⊗ T) vec(X) = vec(C)   II 2 − (T ⊗ T) vec(X) = vec(C), which leads us to the following LS solutions:  −1 ˆ = II 2 − (ST ⊗ T) vec(X) vec(C)  −1 ˆ = II 2 − (T ⊗ T) vec(X) vec(C)

[2.291] [2.292]

[2.293] [2.294]

if the matrices II 2 − (ST ⊗ T) and II 2 − (T ⊗ T) are non-singular, or equivalently if det(II 2 − (ST ⊗ T)) = 0 and det(II 2 − (T ⊗ T)) = 0, i.e. if all the eigenvalues of ST ⊗ T and T ⊗ T are not 1. Noting the property [2.60] concerning the spectrum of a Kronecker product, this can be translated into the following conditions: λi μj = 1, ∀i ∈ I , ∀j ∈ J, and λi = ±1, ∀i ∈ I, respectively, where λi and μj represent eigenvalues of T and S, respectively.

Hadamard, Kronecker and Khatri–Rao Products

119

2.12.2.3. Equations of the form Y = AXBT Consider the equation AXBT = Y, where A and B have full column rank, which implies that B ⊗ A has full column rank. Using the relation [2.283] and the property [2.62], the optimal LS solution is given by: minY − AXBT 2F ⇔ minvec(Y) − (B ⊗ A)vec(X)22 X

X

⇓ ˆ = (B ⊗ A)† vec(Y) = [(B ⊗ A)H (B ⊗ A)]−1 (B ⊗ A)H vec(Y) vec(X) = (BH B ⊗ AH A)−1 (BH ⊗ AH )vec(Y)   = (BH B)−1 BH ⊗ (AH A)−1 AH vec(Y) = (B† ⊗ A† )vec(Y).

[2.295]

This solution can also be written as: ˆ = A† Y(BT )† . X

[2.296]

P ROOF .– Applying the property [2.113] of the Kronecker product to the right-hand    T † ˆ = ⊗ A† vec(Y), where side of equation [2.296], we obtain vec(X) BT [(BT )† ]T = [B∗ (BT B∗ )−1 ]T = (BH B)−1 BH = B† , and so we recover [2.295].  2.12.2.4. Equations of the form Y = (B  A)X Consider the equation (B  A)X = Y, where B ∈ CI×J , A ∈ CK×J , X ∈ CJ×N , Y ∈ CIK×N . This type of equation will be encountered in Chapter 5 when estimating the parameters of a third-order PARAFAC model. Assuming that BA has full column rank, which implies the necessary but not sufficient condition IK ≥ J, and using the relation [2.255], the LS estimate of X is given by: ˆ = (B  A)† Y = [(B  A)H (B  A)]−1 (B  A)H Y X  −1 = BH B  A H A (B  A)H Y.

[2.297]

If B is column orthonormal, then using the orthonormality property of B allows us to simplify this LS solution as:  −1 ˆ = I J  AH A X (B  A)H Y  −1 = diag(A.1 22 , · · · , A.J 22 ) (B  A)H Y.

[2.298]

This type of simplified solution is used in Freitas et al. (2018) to compute a semi-blind receiver in a communication system with relays using column orthonormal coding matrices.

120

Matrix and Tensor Decompositions in Signal Processing

2.12.2.5. Estimation of the factors of a Khatri–Rao product In many applications based on tensor approaches using PARAFAC or Tucker models, parametric estimation of these models plays a fundamental role. In some cases, it is desirable to estimate the factors of Khatri–Rao or Kronecker products, respectively. Some of these factors, denoted A, B, C and D, are estimated by minimizing the following LS criteria: minY − A  B2F and minZ − C ⊗ D2F , A,B

[2.299]

C,D

where A = [a1 , · · · , aR ] ∈ KI×R , B = [b1 , · · · , bR ] ∈ KJ×R , C ∈ KI×J , and D ∈ KK×L , and Y = [y1 , · · · , yR ] ∈ KIJ×R and Z ∈ KIK×JL are given noisy matrices, i.e. Y = A  B + E and Z = C ⊗ D + F, where E and F represent additive noise matrices caused by estimation and/or modeling errors. First, consider the case of a Khatri–Rao product. By vectorizing Y, the LS criterion can be rewritten as: minvec(Y) − vec(A  B)22 = min A,B

R 

ar ,br r=1 r∈R

yr − ar  br 22 .

[2.300]

Since each term in this sum can be minimized separately, the columns ar ∈ KI and br ∈ KJ are estimated by minimizing the criterion min yr − ar  br 22 . ar ,br

Let us define Yr  unvec(yr ) ∈ KJ×I as the matrix obtained by inverting the vectorization operation. Applying the identity [2.103] then gives us: min Yr − br aTr 2F = min Yr − br ◦ ar 2F , r ∈ R,

ar ,br

ar ,br

[2.301]

where ◦ denotes the outer product defined in [3.107]. Now that the criterion has been rewritten in terms of a rank-one matrix, the vectors ar and br can be estimated by computing the rank-one reduced SVD of Yr : (1) H Yr = σr(1) u(1) r (vr ) , (1)

[2.302] (1)

(1)

where σr denotes the largest singular value, and ur and vr are the left and right (1) singular vectors associated with σr , respectively. Identifying the reduced SVD [2.302] with br aTr leads us to the following estimation formulae (Kibangou and Favier 2009b): 2 2 ∗ (1) ˆ r = σr(1) u(1) . ˆr = σr vr(1) , b [2.303] a r

Hadamard, Kronecker and Khatri–Rao Products

121

R EMARK 2.63.– – The matrix Yr can be formed from the vector yr using the following transformation: (Yr )j,i = (yr )j+(i−1)J , i ∈ I , j ∈ J.

[2.304]

– Note that the R columns of A and B can be estimated in parallel. – Each vector ar and br , r ∈ R, is only estimated up to a scaling factor, since (λr ar )  ( λ1r br ) = ar  br for every λr ∈ K. To eliminate this scaling ambiguity, we simply need to know one component of one of these two vectors. For example, if we assume that the first component (ar )1 is equal to 1, then λr = (ˆa1r )1 , and the estimated vectors are as follows after removing the ambiguity: ˆˆr = a

1 ˆ ˆ r = (ˆ ˆr, ˆr , b ar ) 1 b a (ˆ ar ) 1

[2.305]

ˆ r are defined in [2.303]. From the above, we can conclude that the ˆr and b where a factors A and B of a Khatri–Rao product can be estimated without scaling ambiguity if we have a priori knowledge of one row of A or B. This follows from the fact that AΛ  BΛ−1 = A  B for any non-singular matrix Λ = diag(λ1 , · · · , λR ) ∈ KR×R . – In the special case of a Khatri–Rao product of two vectors of which one is known, the other can be estimated using the LS method. Let y = ab, where a ∈ KI , b ∈ KJ and y ∈ KIJ . The LS estimates of the components ai , i ∈ I and bj , j ∈ J of a and b are given by: - If b is known: a ˆi =

bT y(i) , i ∈ I b22

y(i)  [y1+(i−1)J , · · · , yi J−1 , yi J ]T = ai b ∈ KJ .

[2.306] [2.307]

- If a is known: T ˆbj = a y(j) , j ∈ J a22

y(j)  [yj , yj+J , · · · yj+(I−1)J ]T = bj a ∈ KI .

[2.308] [2.309]

Note that, from [2.306] and [2.308], we can construct an iterative alternating least squares (ALS) algorithm for estimating a and b by applying each formula in turn. See section 5.2.5.7 for a description of the ALS algorithm in the context of estimating a PARAFAC model. The advantage of this algorithm compared to the previous method based on computing the SVD is its high numerical simplicity, since it does not require any matrix computations. However, like any iterative algorithm, its speed of convergence depends on the initialization.

122

Matrix and Tensor Decompositions in Signal Processing

The algorithm that estimates the factors of a Khatri–Rao product by finding rank-one approximations of R matrices can be generalized to the case of a multiple (n)

N

(n)

 A(n) , with A(n) = [a1 , · · · , aR ] ∈ KIn ×R ,

Khatri–Rao product: Y =

n=1

n ∈ N . This generalization is based on the rank-one approximation of R tensors of order N constructed from the columns of the factors A(n) . Consider the column N (n) vector yr =  ar ∈ KI1 ···IN , for r ∈ R, and the associated rank-one tensor N

n=1 (n)

X r = ◦ ar n=1

∈ KI1 ×···×IN defined using the following transformation:

(Xr )i1 ,··· ,iN = (yr )i1 ···iN , where i1 · · · iN  iN +

N −1 n=1

(in − 1)

[2.310] N

k=n+1 Ik ,

as defined in [3.118] for the (n)

vectorization of a rank-one tensor. We can now determine the columns ar from a rank-one approximation of the tensor Xr using the THOSVD algorithm, which will be presented in section 5.2.1.8. Like for a Khatri–Rao product of two vectors, the R columns of the factors A(n) can be estimated in parallel. 2.12.2.6. Estimation of the factors of a Kronecker product Like for the Khatri–Rao product, an LS estimate of the factors of Z = C ⊗ D can be found by solving a rank-one matrix approximation problem using the SVD (Van Loan and Pitsianis 1993). By the identity [2.33] and the property u⊗v22 = uvT 2F , the Frobenius norm of the Kronecker product can be written as follows: C ⊗ D2F =

L J  

C.j ⊗ D.l 22 =

j=1 l=1

L J  

C.jDT.l 2F .

[2.311]

j=1 l=1

By concatenating the columns of C and D, i.e. by vectorizing C and D, we also have: C ⊗ D2F = vec(C)vecT (D)2F .

[2.312]

This transformation of Z = C ⊗ D into X = vec(C)vecT (D) amounts to placing the element cij dkl = zk+(i−1)K , l+(j−1)L in X at the position defined by the matrix (J)

(I)

(L)

(K)

T T elk ji = ej ⊗ei ⊗(el ) ⊗(ek ) , which is equivalent to performing the following transformation:

xi+(j−1)I , k+(l−1)K = zk+(i−1)K , l+(j−1)L .

[2.313]

The LS criterion can be rewritten in terms of the matrix X as: Z − C ⊗ D2F = X − vec(C)vecT (D)2F = X − vec(C) ◦ vec(D)2F . [2.314]

Hadamard, Kronecker and Khatri–Rao Products

123

Since the matrix X is of rank one, its SVD can be written as X = UΣVH = σ u v , where σ (1) is the largest singular value, and u(1) and v(1) are the left and right singular vectors associated with σ (1) . The factor matrices C and D can therefore be estimated in vectorized form as follows:   ˆ = σ (1) u(1) , vec(D) ˆ = σ (1) v(1) ∗ . vec(C) [2.315] (1) (1) (1) H

E XAMPLE 2.64.– For I = J = K ⎡ c11 d11 ⎢ c11 d21 Z=C⊗D=⎢ ⎣ c21 d11 c21 d21

= L = 2, we have: c11 d12 c11 d22 c21 d12 c21 d22

c12 d11 c12 d21 c22 d11 c22 d21

⎤ c12 d12 c12 d22 ⎥ ⎥ c22 d12 ⎦ c22 d22

c11 d11 ⎢ c21 d11 T X = vec(C)vec (D) = ⎢ ⎣ c12 d11 c22 d11 ⎡ ⎤ z11 z21 z12 z22 ⎢ z31 z41 z32 z42 ⎥ ⎥ = ⎢ ⎣ z13 z23 z14 z24 ⎦ . z33 z43 z34 z44

c11 d21 c21 d21 c12 d21 c22 d21

c11 d12 c21 d12 c12 d12 c22 d12



⎤ c11 d22 c21 d22 ⎥ ⎥ c12 d22 ⎦ c22 d22

R EMARK 2.65.– Unlike a Khatri–Rao product of two matrices, which admits a diagonal ambiguity matrix, a Kronecker product is ambiguous up to a scalar scaling factor, since (λC) ⊗ ( λ1 D) = C ⊗ D. The factors C and D can therefore be estimated without ambiguity, provided that one element of C or D is known. In the case of a Kronecker product Z = C ⊗ D where one of the two factors is known, it is possible to estimate the other factor using the LS method. Thus, for example, if D is known, then each element cij of C can be estimated from the block Zij = cij D of Z associated with the coefficient cij . By vectorizing both sides of this equation, we obtain the following LS estimate: 

H vec(D) vec(Zij ) cˆij = , i ∈ I , j ∈ J. D2F

[2.316]

Similarly, in the case where C is known, after permuting the factors C and D using the relation [2.171]: Y = D ⊗ C = KKI (C ⊗ D)KJL ,

[2.317]

124

Matrix and Tensor Decompositions in Signal Processing

with the commutation matrices KKI and KJL defined in [2.173] and [2.175], it is possible to estimate the factor D by applying the LS method to the equation Ykl = dkl C, which gives:  H vec(C) vec(Ykl ) ˆ dkl = , k ∈ K , l ∈ L. C2F

[2.318]

Like for the Khatri–Rao product, we can construct an iterative estimation algorithm for C and D based on the ALS method using equations [2.316] and [2.318] in turn. The algorithm for estimating the factors of a Kronecker product presented above N

can be generalized to the case of a multiple Kronecker product: Z = ⊗ A(n) , with n=1

A(n) ∈ KIn ×Jn , n ∈ N .

The idea is to construct a rank-one tensor from Z, defined as the outer product of the vectorized forms vec(A(n) ) ∈ KJn In of the matrices A(n) : X = vec(A(1) )◦vec(A(2) )◦· · ·◦vec(A(N ) ) ∈ KJ1 I1 ×J2 I2 ×···×JN IN . [2.319] This is equivalent to performing the following transformation: (X )k1 ,··· ,kN = zi1 ···iN ,j1 ···jN , kn ∈ Kn  , n ∈ N ,

[2.320]

where kn = (jn − 1)In + in , Kn = Jn In , and i1 · · · iN and j1 · · · jN are defined as in [3.118]. The problem of estimating the factors A(n) can be solved in the sense of minimizing the following LS criterion, which generalizes the criterion [2.314] to the case of a Kronecker product of N matrix factors: min

A(1) ,··· ,A(N )

N

Z − ⊗ A(n) 2F = n=1

min

A(1) ,··· ,A(N )

X − vec(A(1) ) ◦ · · · ◦ vec(A(N ) )2F .

This minimization amounts to finding the best rank-one approximation for the tensor X using the THOSVD method. This approximation gives an estimate of the vectorized forms vec(A(n) ), from which it is then easy to deduce an estimate of the factors A(n) . This estimate is subject to a scalar scaling ambiguity λn for each factor N A(n) , where n=1 λn = 1.

3 Tensor Operations

3.1. Introduction Tensor operations play an important role in tensor calculus. They allow us to rearrange the elements of a tensor into a vector or a matrix, and more generally into a reduced order tensor, to define tensor models and decompositions, and to compute the eigenvalues of a tensor. In this chapter, we will study two tensor multiplication operations, called the Tucker and Einstein products, more thoroughly. We will use this last product to introduce the notions of inverse and pseudo-inverse tensors, as well as tensor decompositions in the form of factorizations. The Hadamard and Kronecker tensor products will also be defined. Finally, we will describe a few examples of tensor systems that can be solved using the least squares (LS) method. As we saw in Volume 1 (Favier 2019), there is a very close link between tensors, polynomials and multilinear forms. We will now use this connection to present different classes of tensors. Like in the matrix case, where the notions of symmetric bilinear form and positive definite quadratic form lead us to symmetric and positive definite matrices. The notion of symmetric positive definite multilinear form will lead us to symmetric positive definite tensors, as presented in Chapter 4. Several types of symmetry are presented in detail in this chapter, whose key objectives are to: – present some links between multilinear forms, homogeneous polynomials and different sets of tensors; – describe various operations with tensors (matricization, vectorization, transposition, multiplication, inversion, pseudo-inversion, extension, tensorization and Hankelization); – present certain tensor decompositions, e.g. singular value decomposition (SVD) and full-rank factorization;

Matrix and Tensor Decompositions in Signal Processing, First Edition. Gérard Favier. © ISTE Ltd 2021. Published by ISTE Ltd and John Wiley & Sons, Inc.

126

Matrix and Tensor Decompositions in Signal Processing

– introduce the notion of tensor systems and illustrate the solution of some of them using the LS method. After introducing some notations relating to tensors and various sets of tensors, we will define the notions of slice (fibers, matrix slices and tensor slices) and mode combination. We will then present the most common ways of arranging the elements of a tensor as matrices and vectors, corresponding to the operations of matricization (also known as matrix unfolding) and vectorization. Next, we will describe the operation of transposition/transconjugation, introducing the notions of orthogonal/unitary tensors, as well as idempotence for square tensors. Various tensor multiplication operations and their properties will be presented, with a particular focus on the mode-p product, also known as the Tucker product, and the Einstein product. We will show how this last product can be used to define the notion of the generalized inverse and therefore of the Moore–Penrose pseudo-inverse of a tensor, to develop tensor factorizations, such as the eigendecomposition and SVD of a tensor, and to solve tensor systems. The Hadamard and Kronecker products of tensors will also be defined, as well as the operations of tensor extension, tensorization and Hankelization. The organization and content of this chapter is as follows: – in section 3.2, we define the notations and the various sets of tensors considered throughout the chapter; – in section 3.3, we present the notions of matrix and tensor slices; – the modal contraction operation resulting from a mode combination is presented in section 3.4; – the notion of the partitioned tensor is introduced in section 3.5; – (partially) diagonal tensors are defined in section 3.6; – the operations of matricization and vectorization based on mode combinations are presented in sections 3.7 and 3.9. Mode-n matricized forms, with n ∈ N , allow us to define the modal subspaces and the multilinear rank of an N th-order tensor in section 3.8; – section 3.10 focuses on the operation of transposition; – a few structured tensors, including (partially) symmetric tensors and triangular tensors are presented in sections 3.11 and 3.12; – various products with tensors that generalize matrix–matrix multiplication and their properties are described, distinguishing between the outer product (section 3.13.1), tensor–matrix (also called mode-p) multiplication (section 3.13.2), tensor–vector multiplication (section 3.13.3), mode-(p, n) product (section 3.13.4) and the Einstein product (section 3.13.5). The TT model will be briefly introduced using the mode-(p, n) product;

Tensor Operations

127

– the notions of the inverse, generalized inverse and Moore–Penrose pseudoinverse of a tensor are introduced in section 3.14; – tensor decompositions expressed as factorizations, such as the eigendecomposition of a square tensor and the SVD and full-rank decompositions of a rectangular tensor, are presented in section 3.15; – the notions of the inner product, Frobenius norm and trace of a tensor are defined in section 3.16; – some links between tensor products and (linear and multilinear) tensor systems, as well as the LS solution of certain systems, are established in section 3.17; – the Hadamard and Kronecker tensor products are defined in section 3.18; – the operations of tensor extension and tensorization, which consists of constructing a data tensor, are described in sections 3.19 and 3.20, respectively; – the Hankelization operation, which can be viewed as a particular tensorization method allowing us to build a Hankel matrix or a higher order Hankel tensor from a long data vector, will also be briefly discussed in section 3.21. 3.2. Notation and particular sets of tensors Let X ∈ KI1 ×···×IN be a tensor of order N and size I1 × · · · × IN . The order corresponds to the number of indices that characterize its elements xi1 ,··· ,iN ∈ K, also denoted xi1 ···iN or (X )i1 ,··· ,iN . Each index in ∈ In , for n ∈ N , is associated with a mode, also called a way, and In denotes the dimension of the nth mode. The N number of elements in X is equal to n=1 In . The tensor X is said to be real (respectively, complex) if its elements are real numbers (respectively, complex numbers), which corresponds to K = R (respectively, K = C). It is said to have even order (respectively, odd order) if N is even (respectively, odd). The special cases N = 2 and N = 1 correspond to the sets of matrices X ∈ KI×J and column vectors x ∈ KI , respectively. A real tensor X ∈ RI1 ×···×IN is said to be non-negative (respectively, positive) if all of its elements are positive or zero (respectively, positive), i.e. if xi1 ,··· ,iN ≥ 0 (respectively, xi1 ,··· ,iN > 0) for all in ∈ In  and all n ∈ N . The tensor X is said to be the zero tensor if all of its elements are zero, regardless of its size. The zero tensor is denoted as O. In applications, a tensor is typically viewed as an array of numbers [xi1 ,··· ,iN ]. This explains some of the other names given to tensors, such as N -way or multiway array. As we saw in Volume 1, the notion of a tensor of order N can be introduced by defining the tensor product (denoted ⊗) of N vector spaces Un of dimension In defined over the same field K (Lim 2013). The resulting tensor space, denoted U1 ⊗ · · · ⊗ UN ,

128

Matrix and Tensor Decompositions in Signal Processing

is the set of tensors of order N and size I1 × · · · × IN , where In = dim(Un ). By (n) (n) defining a basis B (n) = {b1 , · · · , bIn } for each vector space Un , any tensor X can be represented with respect to the bases {B (n) , n ∈ N } as a linear combination of tensor products of the basis vectors, i.e.: X =

I1 

···

i1 =1

IN  iN =1

(1)

(2)

(N )

ci1 ,··· ,iN bi1 ⊗ bi2 ⊗ · · · ⊗ biN .

[3.1]

The coordinates ci1 ,··· ,iN , for in ∈ In , define a hypermatrix C = ci1 ,··· ,iN ∈ KI1 ×···×IN and characterize the tensor X with respect to the bases {B (n) , n ∈ N }. Note that the operator ⊗ applied to any general vectors of the vector spaces Un is called the tensor product. But when the vector spaces Un are defined as KIn , the tensor space U1 ⊗ · · · ⊗ UN is more specifically the space KI1 ×···×IN of dimension N

(n)

(1)

(N )

I1 ×· · ·×IN , and the tensor product of the basis vectors ⊗ bin  bi1 ⊗· · ·⊗biN , n=1

(n)

with bin ∈ KIn , can be replaced by the outer product of these vectors, denoted

N (n) ◦ b n=1 in

(1)

(N )

 bi1 ◦ · · · ◦ biN , which defines a rank-one tensor. In particular, in the canonical basis, we have: X =

I1  i1 =1

···

IN  iN =1

N

[3.2]

n=1

(I ×I ×···×IN )

= xi1 ,··· ,iN Ei1 i12 ···i2N (I )

(I )

xi1 ,··· ,iN ◦ einn ,

(I ×I ×···×I )

N where einn is the in th vector of the canonical basis of RIn , and Ei1 i12 ···i2N is the I1 ×I2 ×···×IN , with 1 at the position (i1 , i2 , · · · , iN ) and zeros everywhere tensor of R else. The last equality uses the index convention. The coordinate hypermatrix is then written as X = xi1 ,··· ,iN , i.e. as the tensor itself, with in ∈ In  for n ∈ N .

In the following, tensors will be identified with their coordinate hypermatrices, which means that the bases {B (n) , n ∈ N } are implicitly fixed. If I1 = · · · = IN = I, the N th-order tensor X = [xi1 ,··· ,iN ] ∈ KI×I×···×I is said to be hypercubic, of dimensions I, with in ∈ I, for n ∈ N . The number of elements in X is then equal to I N . The set of (real or complex) hypercubic tensors of order N and dimensions I will be denoted K[N ;I] . The special cases N = 3 and N = 2 correspond to the cubic tensors of size I × I × I and the square matrices of size I × I, respectively. Hypercubic tensors play an important role in statistical signal processing approaches to solving blind source separation (BSS), and system identification

Tensor Operations

129

problems using tensors of cumulants of order greater than two, in which each mode has the same dimension. The identity tensor of order N and dimensions I is denoted IN,I = [δi1 ,··· ,iN ], with in ∈ I for n ∈ N , or simply I. It is a hypercubic tensor whose elements are defined using the generalized Kronecker delta:  1 if i1 = · · · = iN . δi1 ,··· ,iN = 0 otherwise The identity tensor IN,I ∈ R[N ;I] is a diagonal tensor whose diagonal elements are equal to 1 and whose other elements are all zero. For the rest of this chapter, we will write KI1 ×···×IP ×J1 ×···×JN  KI P ×J N , with I P = I1 × · · · × IP and J N = J1 × · · · × JN for the set of tensors of order P + N and of size I1 × · · · × IP × J1 × · · · × JN . In the case where Ip = I, ∀p ∈ P , and Jn = J, ∀n ∈ N , the corresponding set of tensors will be denoted K[P +N ;I,J] . If we also have N = P and I = J, the set of tensors of order 2P and of dimensions I 2P will be denoted K[2P ;I] . Some authors write KI for this set instead. Table 3.1 summarizes the notation used for sets of indices and dimensions. iP  {i1 , · · · , iP } ; jN  {j1 , · · · , jN } IP  {I1 , · · · , IP } ; JN  {J1 , · · · , JN } I P  I1 × · · · × IP ; J N  J1 × · · · × JN I P × J N = I1 × · · · × IP × J 1 × · · · × J N I P × I P = I1 × · · · × IP × I1 × · · · × IP  ΠIP  I1 · · · IP = P p=1 Ip

Table 3.1. Notation for sets of indices and dimensions

Using this notation and the index convention, the multiple sum over the indices of xi1 ,··· ,iP yi1 ,··· ,iP will be abbreviated to: I1  i1 =1

···

IP  iP =1

xi1 ,··· ,iP yi1 ,··· ,iP =

IP  iP =1

xiP yiP = xiP yiP ,

[3.3]

where 1 denotes a set of ones, whose number is fixed by the index P of the set IP . The notations iP and IP allow us to simplify the expression of the multiple sum into a single sum over an index set, which is further simplified by using the index convention. Table 3.2 summarizes the various sets of tensors that will be considered in this chapter.

130

Matrix and Tensor Decompositions in Signal Processing

Order

Size

Sets of tensors

P

I P = I1 × · · · × IP

KI1 ×···×IP  KI P

P

I P = I1 × · · · × IP with Ip = I, ∀p ∈ P 

K[P ;I]

P + N I P × J N = I1 × · · · × IP × J1 × · · · × JN

KI P ×J N

P +N

IP × JN = I × · · · × I × J × · · · × J with Ip = I, ∀p ∈ P  and Jn = J, ∀n ∈ N 

2P

I P × I P with Ip = I, ∀p ∈ P 

K[2P ;I]

I1 × J1 × · · · × IP × JP

[2P ; Ip ×Jp ] Kp

2P

K[P +N ;I,J]

Table 3.2. Various sets of tensors

R EMARK 3.1.– We can make the following remarks about the sets of tensors defined in Table 3.2: – for P = N = 1, the set K[2;I,J] is the set KI×J of (real or complex) matrices of size I × J; – the set K[P ;I] is also denoted KI

P

or TP (KI ) by some authors;

– the set KI P ×I P is called the set of even-order (or square) tensors of order 2P and size I P × I P . The name square tensor comes from the fact that the index set is divided into two identical subsets of dimension I P ; – analogously to matrices, tensors in the sets KI P ×J P with Jp = Ip and KI P ×J N are said to be rectangular. The set KI P ×J N is called the set of rectangular tensors with index blocks of dimensions I P and J N ; [2P ; I ×J ]

p p – the set Kp  KI1 ×J1 ×···×IP ×JP , introduced by Huang and Qi (2018) and called the set of even-order paired tensors, corresponds to the case where the indices are divided into adjacent pairs {ip , jp } associated with the dimensions Ip ×Jp , for p ∈ P . The index p in Kp refers to this adjacent index pairing. These tensors play an important role in elasticity theory.

The various tensor classes introduced above can be associated with scalar multilinear forms in vector variables and homogeneous polynomials. Analogously to matrices, which can be associated with bilinear and quadratic forms in the real case and sesquilinear and Hermitian forms in the complex case, we will distinguish between real-valued multilinear forms and complex-valued multilinear forms. Like in the matrix case, we will distinguish between homogeneous polynomials of degree P that depend on the components of P vector variables and those that depend on just one vector variable. Another difference between the tensor case (for orders greater than two) and the matrix case is the number of conjugated variables compared to the number of non-conjugated variables.

Tensor Operations

131

These various multilinear forms are summarized in Table 3.3, which also states the transformations corresponding to each of them, as well as the associated tensors. Multilin. forms real-valued

Transformations

Tensors

 P  × RIp  (x(1) , · · · x(P ) ) −→ f x(1) , · · · x(P ) ∈ R

A ∈ RI P

RI  x −→ f (x, · · · , x) ∈ R   

A ∈ R[P ;I]

p=1

in P vectors real-valued

P terms

in one vector complex-valued in P + N vectors

P

N

p=1

n=1

× CIp × × CJn  (y(1) , · · · , y (P ) , x(1) , · · · , x(N ) ) −→ A ∈ CI P ×J N   f (y(1) )∗ , · · · , (y(P ) )∗ , x(1) , · · · , x(N ) ∈ C       N terms

P terms

complex-valued

CI  x −→ f (x∗ , · · · , x∗ , x, · · · , x) ∈ C       P terms

in one vector

A ∈ C[2P ;I]

P terms

Table 3.3. Multilinear forms and associated tensors

Table 3.4 recalls the definitions of bilinear/quadratic and sesquilinear/Hermitian forms using the index convention, and then presents the multilinear forms defined in Table 3.3, as well as the associated tensors from Table 3.2 and the corresponding homogeneous polynomials. For instance, the real-valued multilinear form in P vectors x(p) ∈ RIp , p ∈ P  can be written as: IP I1 P      (1) (P ) (p) f x(1) , · · · , x(P ) = xi p . ··· ai1 ,··· ,iP xi1 · · · xiP = aiP i1 =1

iP =1

p=1

[3.4] Similarly, the complex multilinear form in P conjugated vectors and N non-conjugated vectors can be written as: IP  I1 J1     f (y(1) )∗ , · · · , (y(P ) )∗ , x(1) , · · · , x(N ) = ··· ··· i1 =1 JN  jN =1

iP =1 j1 =1

ai1 ,··· ,iP ,j1 ,··· ,jN

132

Matrix and Tensor Decompositions in Signal Processing

(1)

(P )

(1)

P 

(N )

(yi1 )∗ · · · (yiP )∗ xj1 · · · xjN = aiP ,j

N

(p)

(yip )∗

p=1

N 

(n)

xj n .

[3.5]

n=1

Forms

Matrices/Tensors

Homogeneous polynomials

Bilinear

A ∈ RI×J ; y ∈ RI , x ∈ RJ

f (x, y) = yT Ax = aij yi xj

Quadratic

RI×I ; x

f (x) = xT Ax = aij xi xj

A∈



RI

Sesquilinear

A ∈ CI×J ; y ∈ CI , x ∈ CJ

f (x, y) = yH Ax = aij yi∗ xj

Hermitian

A ∈ CI×I ; x ∈ CI

f (x) = xH Ax = aij x∗i xj   (p) f x(1) , · · · x(P ) = aiP P p=1 xip

Real multilinear in P vectors

A ∈ RI P ; x(p) ∈ RIp

Real multilinear

A ∈ R[P ;I] ; x ∈ RI



f (x, · · · , x) = aiP   

p=1

xi p

P terms

in one vector Complex multilinear

P

[2P ;I]

A∈C

  f x∗ , · · · , x∗ , x, · · · , x =      

; x ∈ CI

P terms

in one vector

Complex multilinear

A ∈ CI P ×J N ;

in P + N vectors

y(p) ∈ CIp , p ∈ P ; x(n) ∈ CJn , n ∈ N 

aiP ,j

P

p=1

P

P terms

x∗i p

P

n=1

xj n

 f (y(1) )∗ , · · · , (y(P ) )∗ , x(1) , · · · , x(N ) =       

P terms

aiP ,j

N terms

P N

(p) ∗ N (n) p=1 (yip ) n=1 xjn

Table 3.4. Multilinear forms and associated homogeneous polynomials

R EMARK 3.2.– We can make the following remarks: – In the same way that bilinear/sesquilinear forms depend on two variables that do not belong to the same vector space, general real and complex multilinear forms depend on variables that belong to different vector spaces: P variables x(p) ∈ RIp in the real case; P conjugated variables y(p) ∈ CIp and N non-conjugated variables x(n) ∈ CJn in the complex case, with p ∈ P  and n ∈ N . – Analogously to the quadratic and Hermitian forms obtained from bilinear and sesquilinear forms by replacing the pair (x, y) with the vector x, real and complex multilinear forms expressed using just one vector x ∈ KI will be called multiquadratic forms (K = R) and multi-Hermitian forms (K = C), respectively. In section 3.11, we will see that, in the same way the symmetric quadratic/Hermitian forms lead to the notion of a symmetric/Hermitian matrix, the symmetry of multiquadratic/ multi-Hermitian forms is directly linked to the symmetry of their associated tensors.

Tensor Operations

133

3.3. Notion of slice A slice is a sub-tensor obtained by fixing one or more indices. In the case of a tensor of order N , if we fix N − 1 indices, we obtain a vector called a fiber, which generalizes the column and row vectors of a matrix to tensors of order greater than two. 3.3.1. Fibers In the case of a third-order tensor X ∈ KI×J×K , we obtain the following three types of fiber, called columns, rows, and tubes, or mode-1, -2 and -3 fibers: – Columns: j and k fixed ⇒ JK columns x•jk ∈ KI . – Rows:

i and k fixed

⇒ IK rows xi•k ∈ KJ .

– Tubes:

i and j fixed

⇒ IJ tubes

xij• ∈ KK .

The fibers defined above will also be denoted x•,j,k , xi,•,k and xi,j,• , where the dot indicates which index varies. More generally, in the case of a tensor X ∈ KI N of order N , the mode-n fiber will be denoted xi1 ,··· ,in−1 , • ,in+1 ,··· ,iN , where the bold dot (•) makes it easier to see which index varies. These fibers are illustrated in Figure 3.1 for a third-order tensor X ∈ KI×J×K , with I = J = K = 3.

Figure 3.1. Fibers of a third-order tensor

3.3.2. Matrix and tensor slices If N − 2 indices are fixed, we obtain a matrix slice. Thus, for a third-order tensor, there are three possible types of matrix slice, called horizontal, lateral and frontal when the indices i, j and k are fixed, respectively: Xi.. ∈ KJ×K , X.j. ∈ KK×I , X..k ∈ KI×J ,

[3.6]

134

Matrix and Tensor Decompositions in Signal Processing

defined as: ⎛

Xi..

xi11 ⎜ xi21 ⎜ =⎜ . ⎝ ..

xi12 xi22 .. .

··· ··· .. .

⎞ xi1K xi2K ⎟ ⎟ J×K , .. ⎟ ∈ K . ⎠

xiJ1

xiJ2

···

xiJK



X.j.

X..k

x2j1 x2j2 .. .

··· ··· .. .

x1jK x2jK ⎛ x11k x12k ⎜x21k x22k ⎜ =⎜ . .. ⎝ .. . xI1k xI2k

···

x1j1 ⎜ x1j2 ⎜ =⎜ . ⎝ ..

··· ··· .. . ···

⎞ xIj1 xIj2 ⎟ ⎟ K×I , .. ⎟ ∈ K ⎠ . xIjK ⎞ x1Jk x2Jk ⎟ ⎟ I×J . .. ⎟ ∈ K . ⎠ xIJk

These matrix slices are illustrated in Figure 3.2.

Figure 3.2. Matrix slices of a third-order tensor

[3.7]

[3.8]

[3.9]

Tensor Operations

135

Using the index convention introduced in section 2.6, the horizontal, lateral and frontal slices can be written as follows: Xi.. =

K J  

(J)

xijk ej

(K)

⊗ (ek )T = xijk ekj

[3.10]

j=1 k=1

X.j. =

I  K 

(K)

xijk ek

(I)

⊗ (ei )T = xijk eik

[3.11]

i=1 k=1

X..k =

I  J 

(I)

xijk ei

(J)

⊗ (ej )T = xijk eji .

[3.12]

i=1 j=1

In general, for a tensor X ∈ KI1 ×···×IN of order N , we obtain tensors with a reduced order N − p by fixing p indices. Thus, by fixing the mode n, we obtain a tensor slice of order N − 1, called a mode-n slice, denoted X···in ··· , of size In+1 × · · · × IN × I1 × · · · × In−1 . R EMARK 3.3.– Some authors define this mode-n slice as having the size I1 × · · · × In−1 × In+1 × · · · × IN . E XAMPLE 3.4.– For X ∈ KI×J×K , with I ! x111 x112 , X2.. = X1.. = x121 x122 ! x111 x211 , X.2. = X.1. = x112 x212 ! x111 x121 , X..2 = X..1 = x211 x221

= J = K = 2, we have: ! x211 x212 x221 x222 ! x121 x221 x122 x222 ! x112 x122 x212 x222

E XAMPLE 3.5.– For a fifth-order tensor X ∈ KI 5 , we can define matrix slices by fixing three indices, or tensor slices by fixing one or two indices, such as: Xi1 ,•,i3 ,•,i5 ∈ KI2 ×I4 ; Xi1 ,•,•,•,• ∈ KI2 ×I3 ×I4 ×I5 ; X•,•,i3 ,•,i5 ∈ KI1 ×I2 ×I4 . 3.4. Mode combination The mode combination operation plays a very important role in tensor calculus. It can be viewed as a contraction of a tensor of order N to a tensor of order N1 < N . Various contractions are possible, depending on how the modes are combined. For example, suppose that the set N  = {1, . . . , N } is partitioned into N1 disjoint

136

Matrix and Tensor Decompositions in Signal Processing

ordered subsets, with 1 ≤ N1 ≤ N − 1, with each subset Sn1 being composed of N1  p(n1 ) = N . p(n1 ) modes, with n1 ∈ N1  and n1 =1

The mode combination associated with Sn1 has dimension Jn1 =



In . n∈Sn1

Thus,

we can rewrite the tensor X ∈ KI1 ×···×IN of order N , defined in [3.2], as the tensor Y ∈ KJ1 ×···×JN1 of order N1 such that: Y=

J1  j1 =1

···

JN 1  jN1 =1

yj1 ,··· ,jN1

N1

(J

)

(J

)

◦ ejnn1 with ejnn1 =

n1 =1

1

1

(I )

⊗ einn .

n∈Sn1

[3.13]

A mode combination is also called a reshaping operation. It transforms a tensor of order N into a reduced order tensor containing the same elements. Two particular types of mode combination corresponding to the operations of matricization and vectorization will be presented in sections 3.7 and 3.9, respectively. E XAMPLE 3.6.– Case of a fourth-order tensor X ∈ KI×J×K×L . The following expressions represent combinations of four, three and two modes, corresponding to a vectorization, a matricization and a contraction to a third-order tensor, respectively, starting from the tensor X : y  xIJKL =

M 

(M ) ym em

m=1

Y  XI×JKL =

P I  

(I)

yi,p ei

◦ ep(P )

i=1 p=1

Y  XI×J×KL =

Q J  I  

(I)

yi,j,q ei

(J)

◦ ej

◦ e(Q) q ,

i=1 j=1 q=1

M = IJKL , P = JKL , Q = KL.

[3.14]

These three mode combinations give three tensors of reduced order, namely a vector y = xIJKL ∈ KM , a matrix Y = XI×JKL ∈ KI×P and a third-order tensor Y = XI×J×KL ∈ KI×J×Q , such that, for i ∈ I, j ∈ J, k ∈ K, l ∈ L, we have: ym = xi,j,k,l with m = l + (k − 1)L + (j − 1)KL + (i − 1)JKL ∈ M  yi,p = xi,j,k,l with p = l + (k − 1)L + (j − 1)KL ∈ P  yi,j,q = xi,j,k,l with q = l + (k − 1)L ∈ Q,

Tensor Operations

137

where the dimensions M , P and Q are defined by [3.14]. In the following, we will also use the notation: m  ijkl , p  jkl , q  kl,

[3.15]

and in general: i1 i2 · · · iP  iP +

P −1 

(ip − 1)

p=1

P 

Ik .

[3.16]

k=p+1

PRecall that, by convention, the order of the dimensions in a product p=1 Ip  I1 · · · IP associated with the index combination i1 i2 · · · iP follows the order of variation of the indices iP = (i1 , · · · , iP ), with i1 varying more slowly than i2 , which in turn varies more slowly than i3 , etc. Thus, for X ∈ KI P , lexicographical vectorization gives the vector y  xI1 ···IP , with element xiP at the position m = i1 i2 · · · iP of y, i.e. ym = xi1 ,··· ,iP = xiP . For example, for KIJK , the index i varies more slowly than j, which in turn varies more slowly than k. The elements of the vector y = xIJK are therefore arranged in such a way that xi,j,k is located at the position m = k + (j − 1)K + (i − 1)JK in y, so ym = xi,j,k . For I = J = K = 2, we have: y = xIJK = [x111 x112 x121 x122 x211 x212 x221 x222 ]T . R EMARK 3.7.– If we choose to vary i1 (the quickest) and iP (the slowest), the position of ym = xiP in y = xI1 ···IP becomes: m = i1 +

P  p=2

(ip − 1)

p−1 

I k  iP · · · i 2 i1 .

[3.17]

k=1

3.5. Partitioned tensors or block tensors In the same way that a vector x ∈ KI and a matrix X ∈ KI×J can be partitioned into M sub-vectors xm ∈ KIm and M N sub-matrices Xm,n ∈ KIm ×Jn , N M respectively, with m ∈ M , n ∈ N , m=1 Im = I and n=1 Jn = J, a tensor N X ∈ KI1 ×···×IN of order N > 2 can be partitioned into n=1 Mn sub-tensors (1) (N ) M (n) Xm1 ,··· ,mN ∈ KJm1 ×···×JmN of order N , with mn ∈ Mn  and mnn=1 Jmn = In , for n ∈ N . This tensor partitioning is obtained by partitioning each dimension In into Mn (n) subsets of dimensions Jmn so that each subset is associated with indices

138

Matrix and Tensor Decompositions in Signal Processing

M (n) (n) (n) jmn ∈ Jmn , and mnn=1 Jmn = In . Thus, the element (Xm1 ,··· ,mN )j (1) ,··· ,j (N ) of m1 mN the sub-tensor Xm1 ,··· ,mN corresponds to the element xi1 ,··· ,iN of the tensor X such that: m n −1 (n) (n) in = J k + jm , ∀n ∈ N . n k=1

R EMARK 3.8.– In the matrix case, the element (Xm,n )im ,jn of the sub-matrix Xm,n ∈ KIm ×Jn corresponds to the element xi,j of X ∈ KI×J such that: (Xm,n )im ,jn = xi,j with i =

m−1  k=1

I k + im , j =

n−1 

Jk + j n .

k=1

The matrix and tensor slices of a tensor introduced in section 3.3.2 are special partitionings. Thus, a third-order tensor X ∈ KI×J×K can be partitioned into K (matrix) sub-tensors of size I × J, denoted X..k with k ∈ K, corresponding to the frontal slices of X . Similarly, a fourth-order tensor X ∈ KI1 ×I2 ×I3 ×I4 can be partitioned into I1 I2 (matrix) sub-tensors of size I3 × I4 , denoted Xi1 ,i2 ,.. , with i1 ∈ I1  and i2 ∈ I2 . In general, a tensor X of order N partitioned into sub-tensors of order N will be called a block tensor, with each block having a size smaller than or equal to the size of X . Tensor blocks generalize matrix blocks to orders greater than two. Like for matrix blocks, some operations such as transposition, addition and vectorization can be performed block-wise. See section 3.10.2 for block transposition. E XAMPLE 3.9.– A cubic tensor X ∈ KI1 ×I2 ×I3 , with I1 = I2 = I3 = 6, contains 216 elements. It can, for example, be partitioned into 27 cubic sub-tensors of size (3) (1) (2) 2 × 2 × 2, by choosing M1 = M2 = M3 = 3 and Jm1 = Jm2 = Jm3 = 2, for mn ∈ 3 and n ∈ 3. The tensor X is then partitioned into sub-tensors Xm1 ,m2 ,m3 ∈ K2×2×2 , each containing eight elements. We can therefore view the tensor X with this partitioning as a cubic block tensor of size 3 × 3 × 3, where each block is itself a cubic tensor of size 2 × 2 × 2. For this partitioning, the element x2,3,4 of the tensor X is located in the sub-tensor X1,2,2 , at the position (2, 1, 2), namely the element (X1,2,2 )2,1,2 . Similarly, x4,5,6 corresponds to the element (X2,3,3 )2,1,2 . Note that the sub-tensor X2,3,3 is formed by the set of elements xi1 ,i2 ,i3 such that 3 ≤ i1 ≤ 4 and 5 ≤ i2 , i3 ≤ 6. E XAMPLE 3.10.– A tensor X ∈ K9×6×4 can be partitioned into 12 sub-tensors, with (1) (1) (1) (2) (2) (3) M1 = 3, M2 = M3 = 2, J1 = J2 = 4, J3 = 1, J1 = J2 = 3, J1 = (3) J2 = 2. The element x7,2,3 is located at the position (3, 2, 1) of the sub-tensor

Tensor Operations

139

X2,1,2 , i.e. x7,2,3 = (X2,1,2 )3,2,1 , where the sub-tensor X2,1,2 consists of the set of (1) (1) (1) (2) elements xi1 ,i2 ,i3 , such that i1 ∈ {J1 + 1, · · · , J1 + J2 }, i2 ∈ {1, · · · , J1 } (3) (3) (3) and i3 ∈ {J1 + 1, · · · , J1 + J2 }, i.e. {xi1 ,i2 ,i3 ; 5 ≤ i1 ≤ 8, 1 ≤ i2 ≤ 3, and 3 ≤ i3 ≤ 4}. Block tensors can also be obtained by concatenating column block or row block tensors. Thus, if we consider the tensors A ∈ KI P ×J N , I P ×K N M P ×J N B∈K ,C ∈ K , we can define column block and row block tensors as follows:   X = A ... B ∈ KI P ×LN with Ln = Jn + Kn , n ∈ N  [3.18] 

xiP ,lN = and

aiP ,j N biP ,kN

for ln = jn ∈ Jn  , n ∈ N  for ln = Jn + kn ∈ {Jn + 1, · · · Jn + Kn } , n ∈ N 



⎤ A Y = ⎣ · · · ⎦ ∈ KLP ×J N with Lp = Ip + Mp , p ∈ P  [3.19] C 3 aiP ,j for lp = ip ∈ Ip  , p ∈ P  N ylP ,j = N for lp = Ip + kp ∈ {Ip + 1, · · · Ip + Kp } , p ∈ P  bkP ,j N

3.6. Diagonal tensors Below, we present several types of diagonal and partially diagonal tensors according to which class they belong to. 3.6.1. Case of a tensor X ∈ K[N ;I] Recall that a hypercubic tensor X ∈ K[N ;I] of order N and dimensions I is diagonal if xi1 ,i2 ··· ,iN = 0 only holds when i1 = i2 = · · · = iN . We can also define partially diagonal tensors, i.e. tensors that are diagonal with respect to two or more modes (Rezghi and Elder 2011). Let T = {in1 , · · · , inT } be a subset of {i1 , · · · , iN }, with T ≤ N , and In1 = · · · = InT . A tensor X ∈ KI1 ×···×IN is said to be diagonal with respect to the set of modes associated with T if xi1 ,··· ,iN

= 0 only holds when in1 = in2 = · · · = inT . For example, a hypercubic tensor X ∈ K[N ;I] of order N and dimensions I is said to be row-diagonal if it is diagonal with respect to its N −1 last modes (Liu et al. 2018).

140

Matrix and Tensor Decompositions in Signal Processing

This means that every tensor slice Xi1 ... obtained by fixing the first mode (i1 ∈ I) is diagonal, i.e. the coefficients xi1 ,i2 ,··· ,iN are only non-zero if i2 = · · · = iN . The tensor X ∈ KI1 ×···×IN is diagonal with respect to two disjoint subsets T = {in1 , · · · , inT } and Q = {ip1 , · · · , ipQ }, with In1 = · · · = InT and Ip1 = · · · = IpQ , if xi1 ,··· ,iN = 0 only holds when in1 = · · · = inT and ip1 = · · · = ipQ . E XAMPLE 3.11.– A third-order tensor X ∈ KI×J×K , with J = K, is diagonal with respect to its last two modes if xijk = 0 , ∀j = k. The horizontal slices Xi.. , for i ∈ I, are then diagonal matrices, and the matrix unfolding XJ×IK = [X1.. · · · XI.. ] is a column block matrix, whose column blocks are diagonal matrices. Similarly, in the case where I = J, the tensor X ∈ KI×I×K is said to be diagonal with respect to the first two modes if the frontal slices X..k , for k ∈ K, are diagonal matrices, i.e. if xijk = 0, ∀i = j. E XAMPLE 3.12.– For a fourth-order tensor X ∈ KI×J×K×L , with I = J and K = L, we say that it is diagonal with respect to the two index subsets {i, j} and {k, l} if xijkl = 0 only holds when i = j and k = l. Note that, in this case, the matrix unfolding X{1,3};{2,4} = XIK×JL is a diagonal matrix. 3.6.2. Case of a square tensor In the case of a square tensor X ∈ KI N ×I N of order 2N , we say that it is block diagonal if its elements satisfy (Brazell et al. 2013):  if in = jn , ∀n ∈ N  αiN , j N xiN , j = [3.20] N 0 otherwise where the αiN , j are arbitrary scalars. This means that every element of X is zero N except the elements xi1 ,··· ,iN ,i1 ,··· ,iN , with in ∈ In for n ∈ N . E XAMPLE 3.13.– For a tensor X ∈ K3×3×3×3 , which is a fourth-order hypercubic tensor, we say that X is block diagonal if the only non-zero coefficients are as follows: x1111 , x1212 , x1313 , x2121 , x2222 , x2323 , x3131 , x3232 , x3333 .

[3.21]

Note that X is diagonal only if the non-zero coefficients are: x1111 , x2222 , and x3333 . We can therefore conclude that any diagonal tensor is also block diagonal. It is clear that the converse is not true.

Tensor Operations

141

We define the identity block tensor of order 2N , denoted J2N ∈ RI N ×I N , as the tensor whose elements satisfy: (J )iN , j = N

namely:

N 

δin ,jn , in , jn ∈ In ,

[3.22]

n=1



(J )iN , j = N

1 if in = jn , ∀n ∈ N  . 0 otherwise

[3.23]

Note that the matrix unfolding JI1 ···IN ×J1 ···JN , with In = Jn for n ∈ N , of this N identity block tensor of order 2N is the identity matrix IN of order I n=1 In . In n n=1 other words: N JN = IN . n=1 In × n=1 In n=1 In

[3.24]

Indeed, the diagonal elements (J )iN , iN of this unfolding are equal to 1, whereas the other elements (J )iN , j , where in = jn for at least one index n ∈ N , are zero. N

E XAMPLE 3.14.– When N = 2 and In = 3, for n ∈ 2, the unfolding corresponding to the identity block tensor J4 is given by: (J )I1 I2 ×J1 J2 = diag(j1111 , j1212 , j1313 , j2121 , j2222 , j2323 , j3131 , j3232 , j3333 ) = I9 , with ji1 ,i2 ,i1 ,i2 = 1 for i1 , i2 ∈ 3. 3.6.3. Case of a rectangular tensor Analogously to the case of a rectangular matrix, a rectangular tensor X ∈ KI P ×J N is said to be pseudo-diagonal if (Behera et al. 2019): xiP ,j = 0 for i1 · · · iP = j1 · · · jN , N

[3.25]

where i1 · · · iP and j1 · · · jN are defined as in [3.16]. 3.7. Matricization This section presents the operation of matricization of a tensor, which plays a very important role in the processing of data tensors. In Chapter 5, this operation will be detailed for various tensor decompositions. First, we will define the various matricized forms of a third-order tensor. We will then use the index convention in an original way to prove an expression for these matrix unfoldings in terms of matrix slices, as well as to establish compact and general expressions for the matricization of a tensor of order N and for tensor matricization by index blocks.

142

Matrix and Tensor Decompositions in Signal Processing

3.7.1. Matricization of a third-order tensor Different matricizations can be defined by stacking the matrix slices. Thus, for a third-order tensor X ∈ KI×J×K , there are six flat unfoldings, denoted XI×JK , XI×KJ , XJ×KI , XJ×IK , XK×IJ , XK×JI , and six tall unfoldings XJK×I , XKJ×I , XKI×J , XIK×J , XIJ×K , XJI×K , which are transposes of the flat unfoldings. For example, by stacking the mode-1, mode-2 and mode-3 matrix slices defined in [3.7]–[3.9] as column blocks, we obtain the flat unfoldings: XI×KJ = [X..1 · · · X..K ] ∈ KI×KJ

[3.26]

XJ×IK = [X1.. · · · XI.. ] ∈ KJ×IK

[3.27]

XK×JI = [X.1. · · · X.J. ] ∈ K

[3.28]

K×JI

or alternatively:

XI×JK = XT.1. · · · XT.J. ∈ KI×JK XJ×KI = XT..1 · · · XT..K ∈ KJ×KI XK×IJ = XT1.. · · · XTI.. ∈ KK×IJ .

[3.29] [3.30] [3.31]

This gives us matrices partitioned into column blocks. R EMARK 3.15.– Writing the dimensions as a subscript allows us to distinguish between mode combinations more easily. Noting that X..k = [x.1k · · · x.Jk ] ∈ KI×J , with x.jk ∈ KI , for j ∈ J and k ∈ K, the matrix XI×KJ defined in [3.26] can also be partitioned into KJ columns: XI×KJ = x.11 · · · x.J1 x.12 · · · x.J2 · · · x.1K · · · x.JK . E XAMPLE 3.16.– For I = J = K x111 x121 XI×KJ = x211 x221 x111 x112 XI×JK = x211 x212 x111 x211 XJ×KI = x121 x221 x111 x121 XK×IJ = x112 x122

= 2, we have:

x112 x122 ∈ K2×4 x212 x222

x121 x122 ∈ K2×4 x221 x222

x112 x212 ∈ K2×4 x122 x222

x211 x221 ∈ K2×4 . x212 x222

Tensor Operations

143

Similarly, we can define vertical stacks of the matrix slices [3.7]–[3.9] as matrices partitioned into row blocks such that: ⎡ ⎤ X.1. ⎢ ⎥ XJK×I = ⎣ ... ⎦ = XTI×JK [3.32]

XKI×J

X.J.

⎤ X..1 ⎥ ⎢ = ⎣ ... ⎦ = XTJ×KI X..K ⎡

[3.33]

and ⎡

XIJ×K

⎤ X1.. ⎢ ⎥ = ⎣ ... ⎦ = XTK×IJ , XI..

[3.34]

where XI×JK , XJ×KI and XK×IJ are defined in [3.29]–[3.31]. This implies, for example: xijk = [XI×KJ ]i,(k−1)J+j = [XJ×IK ]j,(i−1)K+k = [XK×JI ]k,(j−1)I+i . 3.7.2. Matrix unfoldings and index convention We can use the index convention to express matrix unfoldings in terms of the matrix slices [3.7]–[3.9]. Thus, for example, using the expression [3.12] of X..k , we have: k j XI×KJ = xijk ekj i = xijk (ei ⊗ e ⊗ e )

[3.35]

= ek ⊗ (xijk eji ) = ek ⊗ X..k = X..1 · · · X..K ∈ KI×KJ .

[3.36] [3.37]

Thus, we recover the unfolding [3.26]. Similarly, noting that vec(Xi.. ) = xijk ekj and vecT (Xi.. ) = xijk ekj by [2.105], the unfolding XI×KJ can also be written as: ⎤ ⎡ vecT (X1.. ) ⎥ ⎢ .. I×KJ . XI×KJ = ei ⊗ (xijk ekj ) = ei ⊗ vecT (Xi.. ) = ⎣ ⎦∈K . vecT (XI.. )

144

Matrix and Tensor Decompositions in Signal Processing

Likewise, using the equations [3.10]–[3.12], we obtain: ⎤ X.1. ⎥ ⎢ = xijk eijk = ej ⊗ (xijk eik ) = ej ⊗ X.j. = ⎣ ... ⎦ X.J. ⎤ ⎡ X..1 ⎥ ⎢ = xijk ejki = ek ⊗ (xijk eji ) = ek ⊗ X..k = ⎣ ... ⎦ ⎡

XJK×I

XKI×J

X..K ⎤ X1.. ⎥ ⎢ = ⎣ ... ⎦ . XI.. ⎡

XIJ×K = xijk ekij = ei ⊗ (xijk ekj ) = ei ⊗ Xi..

Thus, we recover the unfoldings [3.32]–[3.34]. 3.7.3. Matricization of a tensor of order N Let us consider a tensor X ∈ KI N of order N , together with a partitioning of the set of modes {1, · · · , N } into two disjoint ordered subsets S1 and S2 , composed of p and N − p modes, respectively, with p ∈ N − 1. A general matrix unfolding formula was given by Favier and de Almeida (2014a) as follows:

XS1 ;S2 =

I1  i1 =1

···

IN 

xi1 ,··· ,iN

iN =1



n∈S1

(I ) einn

! ⊗

n∈S2

(I ) einn

!T

∈ KJ1 ×J2 , [3.38]

where Jn1 =



In , n∈Sn1

for n1 = 1 and 2. We say that XS1 ;S2 is a matrix unfolding of X

along the modes of S1 for the rows and along the modes of S2 for the columns, with S1 ∩ S2 = ∅ and S1 ∪ S2 = N . Using the index convention, the unfolding XS1 ;S2 defined in [3.38] can be written concisely as: XS1 ;S2 = xi1 ,··· ,iN eII21 ∈ KJ1 ×J2 ,

[3.39]

where I1 and I2 represent the index combinations associated with the mode subsets S1 and S2 , respectively.

Tensor Operations

145

From this unfolded form XS1 ;S2 , we can deduce the following expression for the element xiN : xiN = xi1 ,··· ,iN =

(I )

!T

⊗ einn

n∈S1

XS1 ;S2

(I )

!

⊗ einn

n∈S2

= eI1 XS1 ;S2 eI2 .

[3.40]

In the case of a tensor X ∈ KI N of order N , we define two particular matrix unfoldings: the flat mode-n unfolding, denoted Xn , and the unfolding Xn corresponding to S1 = {1, · · · , n}, S2 = {n + 1, · · · , N }, such that: Xn = XIn ×In+1 ···IN I1 ···In−1

[3.41]

Xn = X{1,··· ,n};{n+1,··· ,N } = XI1 ···In ×In+1 ···IN .

[3.42]

R EMARK 3.17.– The element xiN  xi1 ,··· ,iN is placed at position (in , j) in Xn , and at (m, l) in Xn , with: j = in−1 + (in−2 − 1)In−1 + · · · + (i1 − 1)

n−1 

Ik + (iN − 1)

k=2

+ (in+1 − 1)

N 

n−1 

Ik + · · ·

k=1

Ik

k=1,k=n

m = in + (in−1 − 1)In + · · · + (i1 − 1)

n 

I k = i 1 i2 · · · in

k=2

l = iN + (iN −1 − 1)IN + · · · + (in+1 − 1)

N 

Ik = in+1 in+2 · · · iN .

k=n+2

R EMARK 3.18.– Some authors define the flat mode-n unfolding as having size In × I1 · · · In−1 In+1 · · · IN , which is equivalent to combining the modes 1, · · · , n − 1, n + 1, · · · , N without reversing the order. R EMARK 3.19.– In the case of a tensor X ∈ KI×J×K , the unfoldings [3.29]–[3.31] correspond to: XI×JK = X1 , XJ×KI = X2 , XK×IJ = X3 . For a fourth-order tensor X ∈ KI×J×K×L , the matrix XIJ×KL can be written as a column block matrix with KL columns of size IJ. Indeed, using the index convention, the unfolding XIJ×KL can be written as: kl kl XIJ×KL = xi,j,k,l ekl ij = e ⊗ xi,j,k,l eij = e ⊗ y.kl

146

Matrix and Tensor Decompositions in Signal Processing

where y.kl ∈ KIJ , k ∈ K, l ∈ L is the mode-1 fiber of the third-order tensor Y ∈ KM ×K×L , with M = IJ, obtained by combining the first two modes of X , satisfying ym,k,l = xi,j,k,l , with m = j + (i − 1)J. Thus, we obtain: XIJ×KL = y.11 · · · y.1L y.21 · · · y.2L · · · y.K1 · · · y.KL ∈ KIJ×KL , with y.kl = [x1,1,k,l · · · x1,J,k,l · · · xI,1,k,l · · · xI,J,k,l ]T , which gives: ⎡ x1111 · · · x111L · · · x11K1 · · · x11KL ⎢ .. .. .. .. ⎢ . ··· . ··· . ··· . ⎢ ⎢ x1J11 · · · x1J1L · · · x1JK1 · · · x1JKL ⎢ ⎢ .. .. .. .. XIJ×KL = ⎢ . ··· . ··· . ··· . ⎢ ⎢ xI111 · · · xI11L · · · xI1K1 · · · xI1KL ⎢ ⎢ .. .. .. .. ⎣ . ··· . ··· . ··· . xIJ11 Similarly, by writing:

···

xIJ1L

···

xIJK1

···

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎥ ⎦

xIJKL

T XIJ×KL = eij ⊗ xi,j,k,l ekl = eij ⊗ vij. T the unfolding XIJ×KL can be decomposed into IJ row blocks vij. ∈ K1×KL , for i ∈ I and j ∈ J, where vij. is the mode-3 fiber of the tensor V ∈ KI×J×N , with N = KL, obtained by combining the last two modes of X , satisfying: T vij. = xi,j,1,1 · · · xi,j,1,L · · · xi,j,K,1 · · · xi,j,K,L . .

E XAMPLE 3.20.– For I = J = K = L = 2, we have: ⎡

⎤ x11kl ⎢ x12kl ⎥ ⎥ XIJ×KL = [y.11 y.12 y.21 y.22 ] with y.kl = ⎢ ⎣ x21kl ⎦ , x22kl ⎡ ⎤ x1111 x1112 x1121 x1122 ⎢ x1211 x1212 x1221 x1222 ⎥ 4×4 ⎥ which gives: XIJ×KL = ⎢ ⎣ x2111 x2112 x2121 x2122 ⎦ ∈ K . x2211 x2212 x2221 x2222 Similarly, we have: ⎡ T v11. T ⎢ v12. ⎢ XIJ×KL = ⎣ T v21. T v22.

⎤ ⎥ T ⎥ with vij. = xi,j,1,1 ⎦

xi,j,1,2

xi,j,2,1

xi,j,2,2



.

Tensor Operations

147

Writing Y = XIJ×KL ∈ KM ×N , with M = IJ and N = KL, the element xi,j,k,l of the tensor X is located at the position (m, n) of Y, such that: ym,n = xi,j,k,l with m = j + (i − 1)J and n = l + (k − 1)L. Table 3.5 summarizes the main notation used. Notation X ∈

Definitions

KI N

Tensor of order N , size I1 × · · · , ×IN

xiN  xi1 ,··· ,iN = (X )i1 ,··· ,iN

Element (i1 , · · · , iN ) of X

Xi1 ,··· ,in−1 ,•,in+1 ,··· ,iN

Mode-n fiber of size In

X···in ···

Mode-n tensor slice, of size In+1 × · · · × IN × I1 × · · · × In−1 .

Xn  XIn ×In+1 ···IN I1 ···In−1

Mode-n flat matrix unfolding of size In × In+1 · · · IN I1 · · · In−1

XS1 ;S2

Matrix unfolding with respect to the mode combinations S1 and S2

Xn  X{1,··· ,n};{n+1,··· ,N }

Unfolding with respect to the mode combinations {1, · · · , n} and {n + 1, · · · , N } for the rows and columns of size I1 · · · In × In+1 · · · IN

Table 3.5. Notation for tensors

3.7.4. Tensor matricization by index blocks In section 3.13.5, we will use an important special case of matrix unfolding of a tensor X ∈ KI P ×J N with the Einstein product, namely the unfolding XI×J  XI1 ···IP ×J1 ···JN , with I = I1 · · · IP and J = J1 · · · JN , corresponding to a combination of the P first and N last modes of X for the rows and columns of XI×J , respectively. Using the definition [3.16], the element xiP , j is located in XI×J at position N (i1 · · · iP , j1 · · · jN ), with: j1 · · · jN  j N +

N −1 

(jn − 1)

n=1

N 

Jk .

[3.43]

k=n+1

From the expression [2.68] of a matrix using the index convention, the matrix unfolding XI×J can be written as: ···jN XI×J = xiP , j eji 1···i , N

1

P

[3.44]

148

Matrix and Tensor Decompositions in Signal Processing

with: P

(I )

N

(J )

···jN = ⊗ eipp ( ⊗ ejnn )T . eji 1···i 1

P

p=1

[3.45]

n=1

R EMARK 3.21.– In the special case of a hypercubic tensor X ∈ KI P ×I P with Ip = I P P for p ∈ P , the unfolding X  XI1 ···IP ×I1 ···IP ∈ KI ×I satisfies:   = xiP , j , [3.46] X P

i1 ···iP ,j1 ···jP

where ip , jp ∈ I, i1 · · · iP , and j1 · · · jP are defined similarly as: j1 · · · jP  jP + (jP −1 − 1)I + · · · + (j2 − 1)I P −2 + (j1 − 1)I P −1 . [3.47] Table 3.6 summarizes the matricizations for three particular classes of tensors, whose index sets are divided into either two blocks of the same or different dimensions or into index pairs, i.e. the sets KI P ×I P , KI P ×J N and KI1 ×J1 ×···×IP ×JP , respectively. Space of X

Unfolding XI×I /XI×J

KI P ×I P

(XI×I )iP

KI P ×J N

(XI×J )iP

KI1 ×J1 ×···×IP ×JP (XI×J )iP

,j

N

P

= x iP

,j

N

= x iP

,j

,j ,j

P N

= xi1 ,j1 ,··· ,iP ,jP

Space of X KI×I KI×J KI×J

Dimensions I ; J  I= P p=1 Ip P  I = p=1 Ip ; J = N n=1 Jn P N I = p=1 Ip ; J = n=1 Jn

Table 3.6. Matricization by index blocks for different sets of tensors

3.8. Subspaces associated with a tensor and multilinear rank For a tensor X ∈ KI×J×K , the column vectors of the unfoldings XI×JK , XJ×KI and XK×IJ are called the mode-1, mode-2 and mode-3 vectors of X , respectively. They span three linear spaces whose dimensions are called the mode-1, mode-2 and mode-3 ranks, respectively, denoted R1 = r(XI×JK ) = r1 (X ), R2 = r(XJ×KI ) = r2 (X ) and R3 = r(XK×IJ ) = r3 (X ). The triplet (R1 , R2 , R3 ) is called the trilinear rank of X . It satisfies R1 ≤ I, R2 ≤ J and R3 ≤ K. In the case of a tensor X ∈ KI N , the column vectors of the matricized form Xn are the mode-n vectors, and Rn = r(Xn ) = rn (X ) ≤ In is the mode-n rank of X . The N -tuple (R1 , ..., RN ) is called the multilinear (or N -linear) rank of X . The multilinear rank generalizes the row and column ranks of a matrix to the case of tensors of order greater than two. Note that, for a matrix, the row and column ranks are both equal to the rank of the matrix. However, this does not necessarily hold for N > 2, and the modal ranks Rn , n ∈ N  may differ, unless X is a symmetric tensor.

Tensor Operations

149

3.9. Vectorization In this section, we present a general formula for the vectorization of a tensor of arbitrary order using the index convention. This formula extends the vectorization results presented in Chapter 2 for matrices, to the case of tensors of order greater than two. The inverse operation of vectorization, denoted unvec, transforms a vectorized form into the original tensor. 3.9.1. Vectorization of a tensor of order N By [3.2], using the index convention, a tensor X = [xi1 ,··· ,iN ] of order N can be written as follows with respect to the canonical basis: N

(I )

X = xi1 ,··· ,iN ◦ einn .

[3.48]

n=1

There are N ! vectorizations in the form of a column vector, each associated with a different mode combination. Thus, for lexicographical order, the vectorization is given as: IN I1   (I ) (I ) (I ) xI1 I2 ···IN = ··· xi1 ,··· ,iN (ei1 1 ⊗ ei2 2 ⊗ · · · ⊗ eiNN ) ∈ KI1 I2 ···IN , i1 =1

iN =1

or alternatively, using the index convention: N

(I )

(I ···I )

xI1 ···IN = xi1 ,··· ,iN ⊗ einn = xi1 ,··· ,iN ei1 1···iNN . n=1

[3.49]

By comparing [3.48] with [3.49], we conclude that the vectorization of a tensor of order N can be translated into (N − 1) Kronecker products of the N basis vectors. In general, the vectorization of an outer product of N vectors is obtained by replacing the outer products with the Kronecker products of the vectors. If we consider the set S = {n1 , . . . , nN } associated with a permutation of the modes {1, · · · , N }, the vectorization of X according to this new ordering of the modes is: (In1 In2 ···InN ) 1 in2 ···inN

xIn1 In2 ···InN = xi1 ,··· ,iN ein

∈ KIn1 In2 ···InN .

[3.50]

The transformation from the lexicographical vectorization to the vectorization associated with the set S is given by the following equation, corresponding to a permutation of the rows of xI1 I2 ···IN : xIn1 In2 ···InN = (eii1n ⊗ eii2n ⊗ · · · ⊗ eiiN )xI1 I2 ···IN . n 1

2

N

[3.51]

150

Matrix and Tensor Decompositions in Signal Processing

Note that, since (in1 , · · · , inN ) is a permutation of (i1 , · · · , iN ), the formula [3.51] does indeed involve a repetition of each index, which implies that the index convention is being used. The 1 of each column n of the permutation matrix is obtained by computing eii1n ⊗ eii2n ⊗ · · · ⊗ eiiN for the N -tuple of indices nN 1 2 (i1 , · · · , iN ) of the element xi1 ,··· ,iN , located on the nth row of xI1 I2 ···IN . The next section gives an example for a third-order tensor. 3.9.2. Vectorization of a third-order tensor Given three vectors u ∈ KI , v ∈ KJ , w ∈ KK , their outer product defines a rank-one, third-order tensor such that: X = u ◦ v ◦ w ∈ KI×J×K ⇔ xijk = ui vj wk .

[3.52]

The lexicographical vectorization is given by: xIJK = ui vj wk eijk ∈ KIJK .

[3.53]

E XAMPLE 3.22.– For I = J = K = 2, we have: ij k xJIK = ui vj wk ejik = (eij ji ⊗ ek )xIJK = (eji ⊗ IK )xIJK ,

[3.54]

with the permutation matrix eij ji , for I = J = 2, defined in [2.167], which gives: ⎤ ⎡ ⎡ ⎤ I2 0 0 0 1 0 0 0

⎢ 0 0 I2 0 ⎥ ⎢ 0 0 1 0 ⎥ 1 0 ⎥ ⎢ ⎥ =⎢ eij ji ⊗ I2 = ⎣ 0 1 0 0 ⎦ ⊗ ⎣ 0 I2 0 0 ⎦ , 0 1 0 0 0 1 0 0 0 I2 where 0 = 02×2 is the zero square matrix of order two. The vectorization xJIK can therefore also be deduced from xIJK using the following equation: ⎤⎡ ⎤ ⎡ ⎡ ⎤ u1 v 1 w 1 1 0 0 0 0 0 0 0 u1 v 1 w 1 ⎢ u1 v 1 w 2 ⎥ ⎢ 0 1 0 0 0 0 0 0 ⎥ ⎢ u1 v 1 w 2 ⎥ ⎥⎢ ⎥ ⎢ ⎢ ⎥ ⎢ u2 v 1 w 1 ⎥ ⎢ 0 0 0 0 1 0 0 0 ⎥ ⎢ u1 v 2 w 1 ⎥ ⎥⎢ ⎥ ⎢ ⎢ ⎥ ⎢ u2 v 1 w 2 ⎥ ⎢ 0 0 0 0 0 1 0 0 ⎥ ⎢ u1 v 2 w 2 ⎥ ⎥⎢ ⎥=⎢ ⎥ xJIK = ⎢ ⎢ u1 v 2 w 1 ⎥ ⎢ 0 0 1 0 0 0 0 0 ⎥ ⎢ u2 v 1 w 1 ⎥ . ⎥⎢ ⎥ ⎢ ⎢ ⎥ ⎢ u1 v 2 w 2 ⎥ ⎢ 0 0 0 1 0 0 0 0 ⎥ ⎢ u2 v 1 w 2 ⎥ ⎥⎢ ⎥ ⎢ ⎢ ⎥ ⎣ u2 v 2 w 1 ⎦ ⎣ 0 0 0 0 0 0 1 0 ⎦ ⎣ u2 v 2 w 1 ⎦ 0 0 0 0 0 0 0 1 u2 v 2 w 2 u2 v 2 w 2 It can be checked that the 1 of the fourth column, associated with the element u1 v2 w2 of the vector xIJK , and hence with the triplet (i, j, k) = (1, 2, 2), is obtained by computing: ⎡ ⎤ 0

⎢ 0 ⎥ 0 0 12 2 ⎢ ⎥ 0 1 0 0 ⊗ , e21 ⊗ e2 = ⎣ 1 ⎦ 0 1 0

Tensor Operations

which gives:



⎢ ⎢ ⎢ ⎢ ⎢ 2 ⎢ ⊗ e = e12 21 2 ⎢ ⎢ ⎢ ⎢ ⎣

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

151

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎥ ⎦

E XAMPLE 3.23.– To illustrate the vectorization of a third-order tensor, consider the tensor X ∈ K2×2×2 , corresponding to I = J = K = 2. We have: xIJK = xIKJ = xKIJ = xJKI =



x111

x112

x121

x122

x211

x212

x221

x222

x111

x121

x112

x122

x211

x221

x212

x222

x111

x121

x211

x221

x112

x122

x212

x222

x111

x211

x112

x212

x121

x221

x122

x222

T T T T

.

3.10. Transposition 3.10.1. Definition of a transpose tensor The transposition of a tensor X ∈ KI N of order N amounts to reordering the indices (i1 , · · · , iN ), i.e. performing a permutation π of the modes (1, · · · , N ). The transpose tensors of X can therefore be written as: yij1 ,··· ,ijN = xi1 ,··· ,iN with jn = π(n), n ∈ N ,

[3.55]

where the set (j1 , · · · , jN ) is obtained by applying the permutation π ∈ SN to the set N , where SN is the symmetric permutation group of order N . The transpose tensor Y, denoted X T,π , therefore has the entries yiπ(1) ,··· ,iπ(N ) and the size Iπ(1) × · · · × Iπ(N ) . Using the notations from Tables 3.1 and 3.2, equation [3.55] can also be written as: yiπ(N ) = xiN ,

[3.56]

with yiπ(N ) = yiπ(1) ,··· ,iπ(N ) and xiN = xi1 ,··· ,iN , and the size of the tensor Y is I π(N )  Iπ(1) × · · · × Iπ(N ) .

152

Matrix and Tensor Decompositions in Signal Processing

For a tensor of order N , there are N ! − 1 transpose tensors. Thus, for N = 3, there are five transpose tensors of X ∈ KI×J×K , of sizes I × K × J, J × I × K, J × K × I, K × I × J, and K × J × I. We can distinguish between total transposition and partial transposition. A total transposition corresponds to a permutation π, such that every index in is no longer in its original position after applying the permutation, i.e. π(n) = n, ∀n ∈ N . Thus, in the case of a third-order tensor, there are two total transpositions, corresponding to the transpose tensors of size J × K × I and K × I × J. The three other transpose tensors, of size I × K × J, K × J × I, and J × I × K, correspond to partial transpositions associated with permutations of only two modes. 3.10.2. Properties of transpose tensors The transposition operation satisfies the following properties (Brazell et al. 2013; Pan 2014): – Composition of two transpositions associated with the mode permutations π1 and π2 is such that: (X T,π1 )T,π2 = X T,π2 π1 ,

[3.57]

where π2 π1 = π2 ◦ π1 represents the composition of the two permutations. In particular, denoting by π −1 the inverse of the permutation π, we have −1 (X T,π )T,π = X . – Given two tensors A and B of the same size, we have: AT,π , B T,π  = A, B,

[3.58]

in other words, the inner product of two tensors of the same size is preserved if the two tensors are transposed using the same permutation of modes (see section 3.16 for the definition of the inner product). – For a tensor A ∈ KI N of order N , a sequence of N transpositions associated with a cyclic permutation (π1 , π2 , · · · , πN −1 , πN ) of the indices (i1 , · · · , iN ), of length N , leaves the tensor unchanged. Indeed, since the composition of the permutations π = πN ◦ πN −1 ◦ · · · ◦ π2 ◦ π1 is the identity mapping, i.e. the mapping that transforms the N -tuple (i1 , · · · , iN ) into itself, we have: 

T,πN   · · · (AT,π1 )T,π2 · · · = A.

[3.59]

Tensor Operations

153

E XAMPLE 3.24.– For N = 3, with A ∈ KI×J×K , there are two cyclic permutations of length three, which satisfy: π1 : (i, j, k) → (j, k, i) ; π2 : (j, k, i) → (k, i, j) ; π3 : (k, i, j) → (i, j, k) π1 : (i, k, j) → (k, j, i) ; π2 : (k, j, i) → (j, i, k) ; π3 : (j, i, k) → (i, k, j) T,π3  = AT,π3 π2 π1 = A. and we therefore have: (AT,π1 )T,π2 – In the matrix case, the properties [3.58] and [3.59] become: AT , BT  = A, B , (AT )T = A.

[3.60]

– The transpose of a tensor X of order N can be obtained from its vectorization using the following formula: ···iN vec(X T,π ) = eii1π(1) ···iπ(N ) vec(X ),

[3.61]

···iN where eii1π(1) ···iπ(N ) is the permutation matrix associated with the permutation π of the set N  that transforms the indices (i1 , · · · , iN ) of the tensor X into the indices (iπ(1) , · · · , iπ(N ) ) of the transpose tensor X T,π . See section 2.7.4 for this type of permutation.

E XAMPLE 3.25.– Consider the tensor X ∈ KI×J×K×L and the permutation π that transforms the quadruple {i, k, j, l} into the quadruple {i, j, k, l}. The formula [3.61] that allows us to compute the transpose tensor Y = X T,π such that yikjl = xijkl is then given by: vec(X T,π ) = (II ⊗ ejk kj ⊗ IL )vec(X ), where the permutation matrix ejk kj is defined as in [2.167]. – In the case of a tensor X ∈ RI P ×J N of order P +N , we can define two particular types of partial block-wise transposition/transconjugation with respect to the P first and the N last modes, depending on whether or not the indices associated with each block are permuted using some permutations π and σ, respectively. The corresponding transpose tensors, denoted X T and X T,π,σ , are given in Table 3.7, with the following notation for the index sets after permutation, and the associated dimensions: iπ(P ) = {iπ(1) , · · · , iπ(P ) } , jσ(N ) = {jσ(1) , · · · , jσ(N ) }

[3.62]

Iπ(P ) = {Iπ(1) , · · · , Iπ(P ) } , Jσ(N ) = {Jσ(1) , · · · , Jσ(N ) }

[3.63]

I π(P ) = Iπ(1) × · · · × Iπ(P ) , J σ(N ) = Jσ(1) × · · · × Jσ(P ) .

[3.64]

The block-wise transposition of even-order tensors characterized by two mode sets of same dimension I P and belonging to the space KI P ×I P is also defined. This tensor space will be considered in section 3.13.5, in connection with the Einstein product.

154

Matrix and Tensor Decompositions in Signal Processing

Properties

Notation

Conditions

Dimensions

X ∈ KI P ×J N Block transpose Block transconjugate Block transpose with permutations

Y=

XT

Y=

XH

Y = X T,π,σ

IP × JN yj

N

yj

N

yj

σ(N )

, iP

= xiP

,j

=

, iπ(P )

= x iP

, iP

JN × IP

N

x∗i , j P

JN × IP

N

,j

N

X ∈ KI P ×I P Block transpose Block transconjugate Block transpose with permutations

Y=

XT

Y=

XH

Y = X T,π

J σ(N ) × I π(P ) IP × IP

yj yj yj

P P

π(P )

, iP

= x iP , j

P

, iP

x∗i , j P

P

=

, iπ(P )

= x iP , j

IP × IP IP × IP P

I π(P ) × I π(P )

Table 3.7. Block-wise transposition of X ∈ KI P ×J N and X ∈ KI P ×I P

E XAMPLE 3.26.– Let X ∈ CI1 ×I2 ×J1 ×J2 , with I1 = 3 and I2 block-wise transconjugate tensor Y = X H ∈ CJ1 ×J2 ×I1 ×I2 YJ1 J2 ×I1 I2 is given by: ⎡ ∗ x1111 x∗1211 x∗2111 x∗2211 x∗3111 ⎢ x∗1112 x∗1212 x∗2112 x∗2212 x∗3112 YJ1 J2 ×I1 I2 = ⎢ ⎣ x∗1121 x∗1221 x∗2121 x∗2221 x∗3121 x∗1122 x∗1222 x∗2122 x∗2222 x∗3122

= J1 = J2 = 2. The in the unfolded form ⎤ x∗3211 x∗3212 ⎥ ⎥. x∗3221 ⎦ x∗3222

If we choose the permutations π and σ to be identical and such that (1, 2) → (2, 1), the block-wise transpose tensor with permutations Y = X T,π,σ is given by: ⎡ ⎤ x1111 x2111 x3111 x1211 x2211 x3211 ⎢ x1121 x2121 x3121 x1221 x2221 x3221 ⎥ ⎥ YJ2 J1 ×I2 I1 = ⎢ [3.65] ⎣ x1112 x2112 x3112 x1212 x2212 x3212 ⎦ . x1122 x2122 x3122 x1222 x2222 x3222 P ROPOSITION 3.27.– The block-wise transpose tensor with permutations Y = X T,π,σ satisfies the following relation with the tensor X in terms of matrix unfoldings: YJσ(1) ···Jσ(N ) ×Iπ(1) ···Iπ(P ) = Pσ (XI×J )T Pπ ,

[3.66]

P N where I = p=1 Ip , J = n=1 Jn , and Pσ and Pπ are permutation matrices defined as follows (see section 2.7 for commutation matrices): ···jN Pσ = ejj1σ(1) ···jσ(N ) i

···iπ(P )

Pπ = eiπ(1) 1 ···iP

[3.67] [3.68]

Tensor Operations

155

E XAMPLE 3.28.– For the tensor X ∈ CI1 ×I2 ×J1 ×J2 , with I1 = 3 and I2 = J1 = J2 = 2 from the previous example, the unfolded form [3.65] can be recovered using the formulae [3.66]–[3.68], with: ⎡ ⎤ x1111 x1112 x1121 x1122 ⎢ x1211 x1212 x1221 x1222 ⎥ ⎢ ⎥ ⎢ x2111 x2112 x2121 x2122 ⎥ ⎢ ⎥, XI×J = XI1 I2 ×J1 J2 = ⎢ ⎥ ⎢ x2211 x2212 x2221 x2222 ⎥ ⎣ x3111 x3112 x3121 x3122 ⎦ x3211 x3212 x3221 x3222 ⎤ ⎡ 1 0 0 0 0 0 ⎤ ⎡ ⎢ 0 0 0 1 0 0 ⎥ 1 0 0 0 ⎥ ⎢ ⎥ ⎢ ⎢ 0 1 0 0 0 0 ⎥ 0 0 1 0 ⎥ i2 i1 ⎥ ⎢ Pσ = ejj12 jj21 = ⎢ , P = e = π i1 i2 ⎢ 0 0 0 0 1 0 ⎥. ⎣ 0 1 0 0 ⎦ ⎥ ⎢ ⎣ 0 0 1 0 0 0 ⎦ 0 0 0 1 0 0 0 0 0 1 3.10.3. Transposition and tensor contraction The transposition operation can be used to perform a tensor contraction. For example, using index notation, the contraction defined by: yi,j,k = ai,l,j,m,n bm,k,l,n ,

[3.69]

corresponding to triple summation over the indices (l, m, n), can be obtained by multiplying two matrix unfoldings obtained by combining the modes of the transpose tensors of A ∈ KI×L×J×M ×N and B ∈ KM ×K×L×N , as summarized below: 1) transposition of the tensor A: ai,l,j,m,n → xi,j,l,m,n ; 2) matrix unfolding of the transpose tensor X of A: XIJ×LM N ; 3) transposition of the tensor B: bm,k,l,n → zl,m,n,k ; 4) matrix unfolding of the transpose tensor Z of B: ZLM N ×K ; 5) matrix multiplication: YIJ×K = XIJ×LM N ZLM N ×K ; 6) reconstruction of the tensor Y: YIJ×K → yi,j,k . For tensors with very large dimensions, efficient algorithms can be used to determine the transpose tensors, in particular with the goal of minimizing the computational complexity of tensor contractions (see, for example, Lyakh (2015)).

156

Matrix and Tensor Decompositions in Signal Processing

3.11. Symmetric/partially symmetric tensors 3.11.1. Symmetric tensors A hypercubic tensor A ∈ K[N ;I] of order N and dimensions I is said to be symmetric if it is invariant under any permutation π of its modes, i.e.: aπ(i1 ,i2 ,··· ,iN )  aiπ(1) ,iπ(2) ,··· ,iπ(N ) = ai1 ,i2 ,··· ,iN

[3.70]

or equivalently: A = AT,π , ∀π ∈ SN .

[3.71]

A tensor is therefore symmetric if it is equal to all of its transpose tensors. If so, the symmetry is said to be total or complete. Note that in the case of a complex tensor (K = C), Hermitian symmetry will be considered in section 3.11.2. The set of symmetric real or complex tensors of order N and dimensions I will [N ;I] be denoted KS . Another notation used in literature is SN (KI ). Some authors call these tensors super-symmetric tensors (Kofidis and Regalia 2002; Qi 2005). E XAMPLE 3.29.– High-order moments and cumulants of random variables form symmetric tensors. See the property [A1.31] in the Appendix. In the case of a third-order tensor A ∈ KSI×J×K , with I = J = K, (total) symmetry implies the following equalities for every i, j, k ∈ I: aijk = aikj = ajik = ajki = akij = akji .

[3.72] [N ;I]

Like for a symmetric matrix, a symmetric tensor A ∈ KS is uniquely determined by the set of entries located in its upper triangular part, i.e. {ai1 ,··· ,iN ; 1 ≤ i1 ≤ · · · ≤ iN ≤ I}. A symmetric rank-one tensor of order N and dimensions I can be written as (N − 1) outer products of some vector x ∈ KI with itself (see section 3.13.1 for the definition of the outer product): [N ;I]

A=x · · ◦ x+ = x◦N ∈ KS ( ◦ ·)*

.

[3.73]

N terms

3.11.2. Partially symmetric/Hermitian tensors In the same way that we defined partially diagonal tensors, we can also define partially symmetric tensors, i.e. tensors that are symmetric with respect to subsets of

Tensor Operations

157

modes. There are different ways to define partial symmetry according to the partitioning being considered for the index set. For example, for a cubic tensor A ∈ KI×I×I , there are three partial symmetries with respect to two modes (i, j), (i, k), or (j, k), which correspond to the following three conditions, respectively: (1) xijk = xjik , (2) xijk = xkji and (3) xijk = xikj , satisfied for every i, j, k ∈ I. Note that if two of these partial symmetries are satisfied, then this implies the third is also satisfied. In the following, we will consider partial block-wise symmetries for three particular sets of even-order tensors: KI P ×I P , K[2P ;I] and [2N ; I n ×J n ] I1 ×J1 ×···×IN ×JN Rp R . A complex square tensor A = ai1 ,··· ,iP ,j1 ,··· ,jP = aiP ,j ∈ CI P ×I P of order P 2P is said to be Hermitian (by blocks of order P ), written as AH = A, if it is invariant under transconjugation by blocks of order P (Panigrahy and Mishra 2018; Ni 2019): aj

P

,iP

= a∗iP ,j , ∀ip and jp ∈ Ip , p ∈ P . P

[3.74]

If so, the diagonal coefficients aiP ,iP are real numbers. P ROPOSITION 3.30.– Considerthe tensor A ∈ CI P ×I P and the matrix unfolding P A ∈ CΠIP ×ΠIP , with ΠIP  p=1 Ip , defined as: (A)i1 ···iP ,j1 ···jP = aiP , j , P

[3.75]

where i1 · · · iP is defined as in [3.16], with ip , jp ∈ Ip  , ∀p ∈ P . If A is Hermitian by blocks of order P , then the matrix unfolding A defined above is a Hermitian matrix, and we therefore have the following equivalence: AH = A ⇐⇒ AH = A.

[3.76]

E XAMPLE 3.31.– Consider the tensor A ∈ K2×2×2×2 , with I1 = I2 = 2, assumed to be Hermitian by blocks of order two. Its unfolding AI1 I2 ×I1 I2 can then be written as: ⎡ ⎤ a1111 a∗1211 a∗2111 a∗2211 ⎢ a1211 a1212 a∗2112 a∗2212 ⎥ ⎥ AI1 I2 ×I1 I2 = ⎢ [3.77] ⎣ a2111 a2112 a2121 a∗2221 ⎦ , a2211 a2212 a2221 a2222 where the elements a1111 , a1212 , a2121 and a2222 are real numbers. Similarly, using a transposition/transconjugation by blocks of order P , we can define a skew-Hermitian, symmetric and skew-symmetric tensor, as outlined in Table 3.8.

158

Matrix and Tensor Decompositions in Signal Processing

Properties

Notation

Conditions

A ∈ KI P ×I P Hermitian Skew-Hermitian

AH = A AH

= −A

Symmetric

AT = A

Skew-symmetric

AT = −A

aj aj

P

aj aj

P

,iP

P

P

,iP

,iP

,iP

= a∗i

P ,jP

= −a∗i

P ,jP

= aiP ,j

P

= −aiP ,j

P

Table 3.8. Hermitian/symmetric tensors by blocks of order P

P ROPOSITION 3.32.– Like for matrices, any real tensor A ∈ RI P ×I P has a unique decomposition as the sum of a symmetric tensor and a skew-symmetric tensor by blocks of order P : A=

1 1 (A + AT ) + (A − AT ), 2 2

[3.78]

where AT is defined as in Table 3.8. In the case of a complex tensor A ∈ CI P ×I P , replacing AT by AH in [3.78] gives a unique decomposition of A as the sum of a Hermitian tensor and a skew-Hermitian tensor by blocks of order P . We say that a real (respectively, complex) even-order hypercubic tensor A ∈ R[2P ;I] is twice block-wise symmetric (respectively, Hermitian) if it is invariant under any permutation by blocks of order P and under any permutation π of the indices {i1 , · · · , iP } and {j1 , · · · , jP } of each block, i.e.:   [3.79] aπ(j ),π(iP ) = aiP ,j , resp. a∗π(j ),π(iP ) = aiP , j P

P

P

P

ip , jp ∈ I , p ∈ P .

[3.80]

We can also define a partial symmetry by blocks of order P such that: aπ(iP ),π(j

P

)

= aiP , j

P

, ip , jp ∈ I , p ∈ P .

[3.81]

This partial symmetry is equivalent to a symmetry with respect to the P first modes combined with a symmetry with respect to the P last modes. A real even-order paired tensor A ∈ RI1 ×J1 ×···×IN ×JN , with In = Jn for n ∈ N , is said to be paired symmetric if it is invariant under any permutation of adjacent indices {in , jn } for n ∈ N  (Huang and Qi 2018): ai1 j1 i2 j2 ···iN jN = aj1 i1 i2 j2 ···iN jN = ai1 j1 j2 i2 ···iN jN = · · · = ai1 j1 i2 j2 ···jN iN . [3.82]

Tensor Operations

159

We say that A ∈ RI1 ×J1 ×···×IN ×JN , with In = I and Jn = J for n ∈ N , is circularly symmetric if it is invariant under paired circular permutation of indices (Huang and Qi 2018): ai1 j1 i2 j2 ···iN jN = ai2 j2 i3 j3 ···i1 j1 = · · · = aiN jN i1 j1 ···iN −1 jN −1 .

[3.83]

E XAMPLE 3.33.– For a fourth-order tensor A = [aijkl ] ∈ R[4 ; I] , with i, j, k, l, ∈ I, paired symmetry means that the following constraints are satisfied1: aijkl = ajikl = aijlk = ajilk ∀i, j, k, l ∈ I.

[3.84]

One example of this type of tensor is given by the fourth-order elasticity tensor2 C ∈ R[4;3] , which satisfies the following symmetries (Olive and Auffray 2013): cijkl = cjikl = cijlk ∀i, j, k, l ∈ 3

[3.85]

cijkl = cklij

[3.86]

∀i, j, k, l ∈ 3.

The paired symmetries corresponding to the constraints [3.85] are called minor symmetries, whereas the symmetry by blocks of order two [3.86] is described as major. The symmetry relations [3.85] reduce the number of independent coefficients from 34 = 81 to 36, and the symmetries [3.86] further reduce this number to 21. 3.11.3. Multilinear forms with Hermitian symmetry and Hermitian tensors In the next two propositions, we will prove that, given certain symmetry conditions on the associated tensor, the multilinear form [3.5], for P = N with Ip = Jp , ∀p ∈ P , is Hermitian, and the multi-Hermitian form deduced from this multilinear form is real-valued. P ROPOSITION 3.34.– Consider the complex multilinear form deduced from [3.5] by choosing P = N , written using the index convention: P P     (p) (p) f (y(1) )∗ , · · · , (y(P ) )∗ , x(1) , · · · , x(P ) = aiP ,j (yip )∗ xjp , [3.87] P

p=1

p=1

1 Note that the third equality immediately follows from the first two. 2 In the case of small deformations of an elastic solid under stress, Hooke’s experimental law (1676) can be written as follows with the index convention: tij = cijkl kl , where [tij ], [kl ], and [cijkl ] represent the stress, strain and rigidity tensors, respectively. The rigidity tensor is also called the tensor of elastic constants or the elasticity tensor. The symmetry of the stress tensor (tij = tji ) and the strain tensor (kl = lk ) implies that the rigidity tensor must also be symmetric.

160

Matrix and Tensor Decompositions in Signal Processing

with x(p) , y(p) ∈ CIp , ∀p ∈ P  and A ∈ CI P ×I P . This multilinear form satisfies the property of Hermitian symmetry in the sense of the following equality:     f (x(1) )∗ , · · · , (x(P ) )∗ , y(1) , · · · , y(P ) = f ∗ (y(1) )∗ , · · · , (y(P ) )∗ , x(1) , · · · , x(P ) [3.88] if and only if A is Hermitian, as defined in Table 3.8, i.e. AH = A. ΠIP ×ΠIP P ROOF .– Consider of the tensor A ∈ CI P ×I P , P the matrix unfolding A ∈ C with ΠIP  p=1 Ip , as defined in [3.75], as well as the following vectors:

x = x(1) ⊗ · · · ⊗ x(P ) ∈ CΠIP ( )* +

[3.89]

P terms

y = y(1) ⊗ · · · ⊗ y(P ) ∈ CΠIP . ( )* +

[3.90]

P terms

The complex multilinear form [3.87] can then be rewritten as: f (y∗ , x) = yH Ax.

[3.91]

Using the property [3.76], the hypothesis of Hermitian symmetry of the tensor A implies that the matrix unfolding A is Hermitian, i.e. AH = A. Hence, we can deduce that: f (x∗ , y) = xH Ay = (xH Ay)T (by transposition of a scalar quantity) = yT AT x∗ = (yH AH x)∗ = (yH Ax)∗ (by the hypothesis AH = A) = f ∗ (y∗ , x).

[3.92]

From this equality, we can deduce the Hermitian symmetry property [3.88], and conversely, [3.92] implies AH = A.  P ROPOSITION 3.35.– Consider the complex multilinear form defined in Table 3.4: P P     f x∗ , · · · , x∗ , x, · · · , x = aiP ,j x∗i p xj n . P ( )* + ( )* + P terms

P terms

p=1

[3.93]

n=1

This multilinear form is Hermitian in the sense of the following equality:     f ∗ x∗ , · · · , x∗ , x, · · · , x = f x∗ , · · · , x∗ , x, · · · , x if and only if A is Hermitian (AH = A), in which case it is real-valued.

[3.94]

Tensor Operations

161

P ROOF .– Note that the multilinear form [3.93] can be deduced from the multilinear form [3.87] by choosing x(p) = y(p) = x ∈ CI , for ∀p ∈ P , and A ∈ C[2P ;I] . We can therefore deduce from the previous proposition that if A is Hermitian, then the multilinear form [3.93] is multi-Hermitian in the sense of the equality [3.94], and therefore it is real-valued, since it is equal to its conjugate.  R EMARK 3.36.– Similarly, we can show that the multilinear form [3.93] is skew-Hermitian if A is skew-Hermitian (AH = −A). 3.11.4. Symmetrization of a tensor Like for matrices, it is possible to symmetrize a tensor, either as a tensor with the same size as the original tensor, or as a block tensor whose blocks are either transposed forms of the original tensor or zero tensors. In the matrix case, for a square matrix A ∈ RI×I , a symmetrized form is given by: 1 [3.95] (A + AT ). 2 The next proposition extends the above symmetrization formula to the case of a hypercubic tensor of order N . Y=

P ROPOSITION 3.37.– Any real hypercubic tensor A ∈ R[N ;I] of order N and dimensions I can be uniquely symmetrized using the following formula: 1  T,πn Y= A , [3.96] N ! π ∈S n

N

where πn denotes one of the N ! permutations of SN , the symmetric group defined over the set N  of the N first integers. E XAMPLE 3.38.– Consider the cubic tensor A ∈ RI×I×I , with I = 2, whose vectorization, according to lexicographical order, is the following vector: aIJK = [1, 1, 2, 3, 3, 2, 4, 3]T ∈ R8 , which corresponds to the following mode-1 flat matrix unfolding:

1 1 | 2 3 . AI×JK = 3 2 | 4 3 Using the formula [3.96], the symmetrized tensor is given by: T yIJK = 1 2 2 3 2 3 3 3

1 2 | 2 3 YI×JK = . 2 3 | 3 3

162

Matrix and Tensor Decompositions in Signal Processing

Note that this symmetrized form Y of the tensor A does indeed satisfy the symmetry conditions [3.72], namely: y112 = y121 = y211 = 2, and y122 = y212 = y221 = 3. In the matrix case, a block symmetrized form of a rectangular matrix A ∈ RI1 ×I2 , denoted X, is given by:

0I1 ×I1 A X= ∈ R(I1 +I2 )×(I1 +I2 ) , [3.97] AT 0I2 ×I2 which implies X = XT . P ROPOSITION 3.39.– The block symmetric matrix X can be used to determine the SVD of A from the EVD of X. P ROOF .– To prove this result, recall equations [1.42] and [1.43], which show that the right singular vectors (vk ) and left singular vectors (uk ) associated with the singular value σk are the eigenvectors of the symmetric matrices AT A and AAT , respectively, i.e.: Avk = σk uk ⇔ AT Avk = σk2 vk

[3.98]

AT uk = σk vk ⇔ AAT uk = σk2 uk .

[3.99]

Using the block symmetrized form [3.97] of A, the equations defining the above singular vectors can be rewritten as:



0I1 ×I1 A uk uk = σ , [3.100] k vk vk AT 0I2 ×I2 which corresponds to the equation of the EVD of X defined in [3.97]. Assuming that A has rank R, the SVD of A (written UΣVT ) can therefore be determined by using the EVD [3.100] of the block symmetrized form [3.97], with: U = [u1 , · · · , uR ] , V = [v1 , · · · , vR ] , Σ = diag(σ1 , · · · , σR ),

[3.101]

where the σr , with r ∈ R, are the R non-zero singular values associated with the R non-zero eigenvalues of X, and the corresponding eigenvectors provide the singular vectors (uk , vk ) of A.  In the next proposition, we extend the block symmetrization [3.97] to the case of a tensor A ∈ RI N (Ragnarsson and Van Loan 2013). As we have just seen, in the case of a matrix A ∈ RI1 ×I2 , the block symmetrized matrix X ∈ R(I1 +I2 )×(I1 +I2 ) is formed of the blocks A and AT , as well as two zero blocks. Analogously to the

Tensor Operations

163

matrix case, a tensor A RI N can be symmetrized as a block hypercubic tensor ∈ N [N,I] X ∈R , with I = n=1 In , whose blocks are either the N ! transpose tensors of A or zero tensors. Recall the notation AT,πp , with p ∈ N !, of the transpose tensor of A associated with the permutation πp of the set N  of modes, such that:   T,πp  T,πp   → A A A i − i N

πp N 

∈ AT,πp .

[3.102]

4 5 The multi-index πp N   πp (1), · · · , πp (N ) is used to define the position of the transpose tensor AT,πp in the block tensor X , as detailed in the next proposition. IN P ROPOSITION 3.40.– Any tensor NA ∈ R can be symmetrized as a hypercubic block [N,I] tensor X ∈ R , with I = n=1 In , defined as:

X j = N



AT,πp for jN = πp N  for jN = πp N  ∀p ∈ N ! Oj

[3.103]

N

I where X πp N  ∈ R πp N  , with I πp N  = Iπp (1) × · · · × Iπp (N ) , denotes the block   of X located at position πp (1), · · · , πp (N ) , and the zero block Oj ∈ RIj1 ×···×IjN is located at position (j1 , · · · , jN ) in X . The block Oj will N N also be denoted OIj1 ×···×IjN , with jn ∈ N , n ∈ N , in order to specify its dimensions. R EMARK 3.41.– The zero tensor Oj is positioned in X using the multi-index N jN = {j1 , · · · , jN }, such that jN = πp N  for ∀p ∈ N !, i.e. an N -tuple jN that cannot be associated with any permutation πp of the set N . This means that at least two of the N indices jn are equal. Before proving that the block tensor X thus constructed is symmetric, we will illustrate the block symmetrization formula [3.103] for a third-order tensor (Ragnarsson and Van Loan 2013). For A ∈ RI1 ×I2 ×I3 , there exist six transpose tensors associated with the six permutations πp , with p ∈ 6, of the set 3. The blocks X π 3 of the tensor X p containing these six transpose tensors are summarized in Table 3.9, together with their dimensions.

164

Matrix and Tensor Decompositions in Signal Processing

πp

Permutations πp

3 −→ πp 3 π1 (1, 2, 3) → (1, 2, 3) π2 (1, 2, 3) → (1, 3, 2) π3 (1, 2, 3) → (2, 1, 3) π4 (1, 2, 3) → (2, 3, 1) π5 (1, 2, 3) → (3, 1, 2) π6 (1, 2, 3) → (3, 2, 1)

  X π

Dimensions of AT,πp

p 3

  X 123   X 132   X 213   X 231   X 312   X 321

= AT,π1

I1 × I2 × I3

= AT,π2

I1 × I3 × I2

=

AT,π3

I2 × I1 × I3

=

AT,π4

I2 × I3 × I1

=

AT,π5

I3 × I1 × I2

= AT,π6

I3 × I2 × I1

Table 3.9. Block symmetrized tensor X of a third-order tensor A ∈ RI1 ×I2 ×I3

The equations below indicate the positions of the six transpose tensors AT,πp , for p ∈ 6, in the tensor slices (X )..k ∈ RI×I×Ik , with k ∈ 3 and I = I1 + I2 + I3 . ⎤ ⎡ .. .. O . O . O I1 ×I2 ×I1 I1 ×I3 ×I1 ⎥ ⎢ I1 ×I1 ×I1 ⎥ ⎢ ... ... ... ⎥ ⎢ ⎥ ⎢ . . .. T,π4 (X )..1 = ⎢ OI ×I ×I .. OI ×I ×I ⎥ ∈ RI×I×I1 A 2 1 1 2 2 1 ⎥ ⎢ ⎥ ⎢ ... ... ... ⎦ ⎣ .. .. T,π6 O . A . O I3 ×I1 ×I1



(X )..2

⎢ OI1 ×I1 ×I2 ⎢ ... ⎢ ⎢ = ⎢ OI ×I ×I 2 1 2 ⎢ ⎢ ... ⎣ AT,π5 ⎡

(X )..3

⎢ OI1 ×I1 ×I3 ⎢ ... ⎢ ⎢ = ⎢ AT,π3 ⎢ ⎢ ... ⎣ OI3 ×I1 ×I3

I3 ×I3 ×I1

.. . .. . .. . .. . .. . .. .

OI1 ×I2 ×I2 ... OI2 ×I2 ×I2 ... OI3 ×I2 ×I2 A ...

T,π1

OI2 ×I2 ×I3 ... OI3 ×I2 ×I3

.. . .. . .. . .. . .. . .. .

⎤ A ...

T,π2

OI2 ×I3 ×I2 ...

⎥ ⎥ ⎥ ⎥ ⎥ ∈ RI×I×I2 ⎥ ⎥ ⎦

OI3 ×I3 ×I2 ⎤ OI1 ×I3 ×I3 ⎥ ⎥ ... ⎥ ⎥ ∈ RI×I×I3 OI2 ×I3 ×I3 ⎥ ⎥ ⎥ ... ⎦ OI3 ×I3 ×I3

R EMARK 3.42.– We can make the following remarks:

Tensor Operations

165

– there are zero tensors at the positions (j1 , j2 , j3 ) of the slices (X )..k whenever the triplet (j1 , j2 , j3 ) does not correspond to a permutation of the set 3, i.e. when two or three of the indices j1 , j2 , and j3 are equal. For example, in the slice (X )..2 , the block [X ]122 is the zero tensor OI1 ×I2 ×I2 ∈ RI1 ×I2 ×I2 due to the repetition of the last two indices (j2 = j3 = 2); – all blocks in the same slice (X )..k have the same third dimension Ik . We will now prove that the block tensor X is symmetric, which means that it is equal to all of its transpose tensors. Let Y = X T,πq be the transpose tensor of X associated with the permutation πq of the set N . Since the block tensor X is hypercubic with dimensions I, its transpose tensors arethemselves hypercubic block tensors with dimensions I, i.e. Y ∈ R[N,I] , N with I = n=1 In . The blocks of Y are composed of the transposes of the blocks X j , with N 

positions πp (jN  ) = (πp (j1 ), · · · , πp (jN )). Hence, by [3.103], we can conclude that the blocks of Y are equal to either transposed forms of the blocks AT,πp , or to transposed zero tensors. Thus, after transposition, the block X πp N  = AT,πp of X

is transformed into the block (AT,πp )T,πq = AT,πq πp , at the position πq πp N  in Y: T,πq = (AT,πp )T,πq = AT,πq πp = X πq πp N  , Y πq πp N  = Xπp N 

[3.104] T,πq πp , and hence equal to which proves that the block Y πq πp N  is equal to A X π π N  by [3.103]. q

p

To illustrate this result, consider the symmetrized form defined in [3.103] of the third-order tensor A, with p = 6, q = 2. Noting that the composition of the permutations π2 π6 transforms the triplet (1, 2, 3) into (3, 1, 2), we have π2 π6 = π5 , and therefore: (AT,π6 )T,π2 = AT,π2 π6 = AT,π5 = X π5 3 = X 312 ,   as can be checked in the tensor slice X ..2 detailed above. The transpose tensor X T,π2 is therefore such that the block X 321 = AT,π6 of X is transformed into Y 312 = AT,π2 π6 = AT,π5 = X 312 in Y = X T,π2 . Now consider a zero block Oj ∈ RIj1 ×···×IjN in X , for jN = {j1 , · · · , jN }, N jn ∈ N , and n ∈ N , with at least two identical indices ji and jk , i.e. jN = πp N  for ∀p ∈ N !. The transpose of the block Oj in X T,πq is given by: N  T,πq Y πq (j ) = Oj = Oπq (j ) = X πq (j ) ∈ RIπq (j1 ) ×···×Iπq (jN ) . N

N

N

N

166

Matrix and Tensor Decompositions in Signal Processing

For example, for the tensor A ∈ RI 1 ×I2 ×I3 , after the transposition associated with the permutation π2 , the zero block X 121 = OI1 ×I2 ×I1 of X is transformed into: T,π2   2 Y π2 (1,2,1) = Y 112 = X 121 = (O)T,π 121 = O π2 (1,2,1) = OI1 ×I1 ×I2 . From the above results, we can write the transpose tensor Y = X T,πq as:  T,πq πp A for jN = πq πp N  Y j = X T,πq j = (OiN )T,πq for jN = πq (iN ) N N

where iN = πp N  for ∀p ∈ N !.

In summary, we conclude that X T,πq = X for every q ∈ N !, i.e. regardless of the transposition of X , which implies that X is symmetric. 3.12. Triangular tensors Let A ∈ KI N ×I N . This tensor is said to be triangular if its matrix unfolding AI×I , N with I = n=1 In , is itself triangular. More precisely, noting that: (AI×I )i1 ···iN , j1 ···jN = aiN ,j ,

[3.105]

N

where i1 · · · iN is defined as in [3.16], with in , jn ∈ In  , ∀n ∈ N , the tensor A is upper (respectively, lower) triangular if AI×I is upper (respectively, lower) triangular, which is equivalent to aiN ,j = 0 for i1 · · · iN > j1 · · · jN (respectively, j1 · · · jN > N i1 · · · iN ). Similarly, we say that A is unit upper (lower) triangular if AI×I is unit upper (lower) triangular, i.e. with diagonal terms equal to 1. E XAMPLE 3.43.– Let A ∈ K2×2×2×2 , with AI1 I2 ×I1 I2 can be written as: ⎡ a1111 a1112 a1121 ⎢ a1211 a1212 a1221 AI1 I2 ×I1 I2 = ⎢ ⎣ a2111 a2112 a2121 a2211 a2212 a2221

I1 = I2 = 2. Its matrix unfolding ⎤ a1122 a1222 ⎥ ⎥. a2122 ⎦ a2222

[3.106]

The tensor A is unit upper triangular if: a1111 = a1212 = a2121 = a2222 = 1 and a1211 = a2111 = a2112 = a2211 = a2212 = a2221 = 0, and the other terms can take any values. 3.13. Multiplication operations There are several types of multiplication with tensors. These operations can be classified as follows:

Tensor Operations

167

– products whose result is a tensor of the same order as the tensors being multiplied, like the Hadamard product (section 3.18); – products that give a tensor of order higher than the tensors being multiplied, like the outer product (section 3.13.1) and the Kronecker product (section 3.18); – contracted products that correspond to a contraction of tensors being multiplied over one or several modes, like the mode-p and mode-(p, n) products, denoted ×p and ×np , respectively (sections 3.13.2 and 3.13.4), or the Einstein product, denoted N (section 3.13.5). These products involve summation over one or several indices corresponding to the modes shared by both tensors being multiplied. Note that mode-p multiplication, also known as the Tucker product, is used for tensor decompositions into a linear combination of rank-one tensors, like the PARAFAC and Tucker decompositions that will be presented in Chapter 5, whereas the Einstein product is used for developing tensor factorizations, like the SVD and full-rank decomposition, which will be described in section 3.15. The Einstein product is also used to define certain tensor systems (section 3.17), and to develop methods for solving them based on the notions of inverse and pseudo-inverse tensors (section 3.14). These various multiplication operations are presented below. First, we will describe three fundamental operations, namely the outer product of tensors, tensor-matrix multiplication and tensor-vector multiplication. We will then present the more general operation of tensor contraction, i.e. tensor–tensor multiplication based on the mode-(p, n) and Einstein products. In section 3.16, the inner product of two tensors will be defined using the Einstein product. The notions of the Frobenius norm and trace of a tensor will also be defined. In section 3.17, these multiplication operations will be used to establish a connection between tensors and tensor systems. Finally, the Hadamard and Kronecker products of tensors will be introduced in section 3.18. R EMARK 3.44.– It should be noted that other types of tensor multiplication exist, such as the t-product of third-order tensors A, B, denoted A  B (Kilmer and Martin 2011). This type of tensor product is based on the matrix multiplication of a block circulant matrix constructed from frontal slices of A, with a block vectorized form of B also constructed using frontal slices. Unlike the mode-p and Einstein products, the t-product preserves the order of the tensors being multiplied. Furthermore, the t-product can be computed using the fast Fourier transform (FFT). The t-product operation was used to develop the t-SVD, t-polar, t-LU, t-QR and t-Schur decompositions of third-order tensors (Kilmer and Martin 2011; Hao et al. 2013; Miao et al. 2020).

168

Matrix and Tensor Decompositions in Signal Processing

3.13.1. Outer product of tensors The outer product, also called the tensor product, of two non-zero vectors u ∈ KI and v ∈ KJ , denoted u ◦ v, defines a rank-one matrix of size I × J such that: (u ◦ v)ij = ui vj ⇒ u ◦ v ∈ KI×J .

[3.107]

The outer product of P non-zero vectors u(p) ∈ KIp , p ∈ P , gives a rank-one tensor of order P and size I P such that: 

P

◦ u(p)

p=1

 iP

=

P 

(p)

ui p

p=1



P

◦ u(p) ∈ KI P .

[3.108]

p=1

In particular, the outer product of three non-zero vectors u ∈ KI , v ∈ KJ , and w ∈ KK gives a rank-one, third-order tensor u ◦ v ◦ w of size I × J × K such that: (u ◦ v ◦ w)ijk = ui vj wk , i ∈ I, j ∈ J, k ∈ K.

[3.109]

In the case where the vectors u(p) are all identical and equal to u ∈ KI , the definition [3.108] gives a symmetric hypercubic tensor of order P and dimensions I, as defined in [3.70]: u · · ◦ u+  u◦P = ( ◦ ·)*

P 

P terms

[P ;I] ui p ∈ KS .

[3.110]

p=1

Table 3.10 presents a few examples of outer products of matrices and tensors, indicating the space which the tensors resulting from these products belong to, as well as their order. Matrices/tensors being multiplied Outer products A(p) ∈ KIp ×Jp A ∈ KI P , B ∈ KI P A∈

KI P

,B ∈

A(p) ∈ K

KJ N

J Np

Spaces

Orders

◦ A(p)

K[2P ;Ip ×Jp ]

2P

A◦B

KI P ×I P

2P

A◦B

KI P ×J N

P

p=1

P

◦ A(p)

p=1

J ×···×J N P K N1

P +N P

p=1

Np

Table 3.10. Outer products of matrices and tensors

R EMARK 3.45.– Note that the outer product of P matrices A(p) ∈ KIp ×Jp gives an even-order paired tensor belonging to the space KI1 ×J1 ···×IP ×JP , also denoted K[2P ;Ip ×Jp ] . Moreover, the outer product of two tensors A and B ∈ KI P gives a square tensor of the space KI P ×I P , whereas the outer product of A ∈ KI P with B ∈ KJ N gives a rectangular tensor of order P + N , belonging to the space KI P ×J N .

Tensor Operations

Elements of X

Matrices/tensors

Outer products

A(p) ∈ KIp ×Jp

X = ◦ A(p)

A ∈ KI P , B ∈ KI P

X =A◦B

x iP

A ∈ KI P , B ∈ KJ N

X =A◦B

x iP

A(p) ∈ K

J Np

P

p=1

xi1 j1 ···iP jP =

p=1

Indices

p=1

(p)

aip jp

ip ∈ Ip  , p ∈ P  jp ∈ Jp  , p ∈ P 

P

X = ◦ A(p) xj

P

169

N1

,j

P

= a iP b j

P

,j

N

= a iP b j

N

, ··· , j

NP

=

P

p=1

ip , jp ∈ Ip  , p ∈ P  ip ∈ Ip  , p ∈ P  jn ∈ Jn  , n ∈ N  (p)

aj

Np

jN = (j1 , · · · , jNp ) p

p ∈ P 

Table 3.11. Scalar elements of outer products

Table 3.11 gives expressions for the scalar elements of each tensor resulting from the outer products in Table 3.10. The next proposition presents the properties of the outer product. P ROPOSITION 3.46.– The outer product is associative and distributive with respect to tensor addition but not commutative. For every A, A1 , A2 ∈ KI P , B, B1 , B2 ∈ KJ N , C ∈ KK L and λ ∈ K, we have: A ◦ (B ◦ C) = (A ◦ B) ◦ C = A ◦ B ◦ C ∈ KI P ×J N ×K L

[3.111]

(A1 + A2 ) ◦ B =A1 ◦ B + A2 ◦ B ∈ KI P ×J N

[3.112]

A ◦ (B1 + B2 ) =A ◦ B1 + A ◦ B2 ∈ KI P ×J N

[3.113]

(λA) ◦ B = A ◦ (λB) = λ(A ◦ B) ∈ KI P ×J N .

[3.114]

From the properties [3.112]–[3.114], we can conclude that the outer product is a bilinear mapping from RI P × RJ N to RI P ×J N : RI P × RJ N  (A, B) −→ A ◦ B ∈ RI P ×J N . The next proposition gives general matricized and vectorized forms for a rank-one tensor of order P .

170

Matrix and Tensor Decompositions in Signal Processing

P

P ROPOSITION 3.47.– Let X = ◦ u(p) ∈ KI P be the rank-one tensor of order P p=1

defined in [3.108]. The general matricized form [3.39] of X is given by: ! !T (p) (I ) (p) (I ) ⊗ uip eipp ⊗ uip eipp XS1 ;S2 = xi1 ,··· ,iP eII21 = = where Jp1 =



⊗ u(p)

p∈S1

Ip , p∈Sp1

p∈S1

!

⊗ u(p)

!T

p∈S2

p∈S2

∈ KJ1 ×J2

[3.115]

for p1 = 1 and 2. !T

Applying the relation [2.118] to [3.115], with A = ⊗ u p∈S1

(p)

,B =

and J = 1, gives the following general vectorized form:     vec(XS1 ;S2 ) = ⊗ u(p) ⊗ ⊗ u(p) . p∈S2

p∈S1

⊗ u

(p)

p∈S2

[3.116]

We can therefore conclude that the vectorization of a rank-one tensor transforms the outer product of the vectors u(p) into a Kronecker product of these same vectors. In particular, the vectorization according to lexicographical order is given by: P

P

p=1

p=1

xI1 ···IP = vec( ◦ u(p) ) = ⊗ u(p) ,

[3.117]

P (p) where the element xi1 ,··· ,iP = p=1 uip of X is located at the position i in the vectorized form xI1 ···IP given by (see [3.43]): i  i1 · · · iP  iP +

P −1 

P 

(ip − 1)

p=1

Ik .

[3.118]

k=p+1

Table 3.12 illustrates the matricization and vectorization formulae for a rank-one, third-order tensor. The Frobenius norm is also given. The proof of this formula is given in section 3.16.2. 3.13.2. Tensor-matrix multiplication 3.13.2.1. Definition of the mode-p product The mode-p product of a tensor X ∈ KI P of order P with a matrix A ∈ KJp ×Ip , denoted X ×p A, gives the tensor Y of order P and size I1 × · · · × Ip−1 × Jp × Ip+1 × · · · × IP such that (Carroll et al. 1980): yi1 ,...,ip−1 ,jp ,ip+1 ,...,iP =

Ip  ip =1

ajp ,ip xi1 ,...,ip−1 ,ip ,ip+1 ,...,iP .

[3.119]

Tensor Operations

171

X = u ◦ v ◦ w ∈ KI×J×K Matricized forms XI×JK = u(v ⊗

w)T

, XJ×KI = v(w ⊗ u)T , XK×IJ = w(u ⊗ v)T Vectorized forms

xIJK = u ⊗ v ⊗ w , xJKI = v ⊗ w ⊗ u , xKIJ = w ⊗ u ⊗ v Frobenius norm u ◦ v ◦ wF = u2 v2 w2

Table 3.12. Matricized and vectorized forms of a rank-one third-order tensor

The operation ×p therefore corresponds to summation over the index ip associated with the mode p of the tensor X , which is equivalent to performing a contraction over the mode p of X and the mode 2 of A. This operation is also called the Tucker product, since it is used to write the Tucker decomposition of a tensor (see section 5.2.1.2). It can be expressed using the mode-p unfoldings of the tensors X and Y as follows: Yp = AXp .

[3.120]

This operation can be interpreted in terms of a linear mapping from the mode-p space of X to the mode-p space of Y, associated with the matrix A. Note that [3.120] implies that the mode shared by X and A corresponds to the second index of A. Table 3.13 gives three examples of the mode-p product of a third-order tensor X ∈ KI×J×K with a matrix A ∈ KP ×I , B ∈ KQ×J and C ∈ KR×K for p ∈ 3. Tensor Y

Entries of Y Dimensions of Y Unfoldings of Y I X ×1 A ypjk = i=1 api xijk P ×J ×K YP ×JK = AXI×JK X ×2 B yiqk = X ×3 C yijr =

J

j=1 bqj xijk

K

k=1 crk xijk

I ×Q×K

YQ×KI = BXJ×KI

I ×J ×R

YR×IJ = CXK×IJ

Table 3.13. Mode-p products of a third-order tensor with a matrix

If we consider an ordered set S = {n1 , . . . , nP } obtained by permuting the elements of the set P  = {1, . . . , P }, a series of mode-np products of X ∈ KI P with A(np ) ∈ KJnp ×Inp , p ∈ P , will be written concisely as follows: nP

X ×n1 A(n1 ) · · · ×nP A(nP )  X × A(n) . n=n1

[3.121]

172

Matrix and Tensor Decompositions in Signal Processing

In the case where the tensor-matrix product is performed for every mode {1, . . . , P }, de Silva and Lim (2008) proposed another way to concisely express [3.121]: X ×n1 A(n1 ) · · · ×nP A(nP )  (A(n1 ) , · · · , A(nP ) ) . X ,

[3.122]

where the order of the matrices A(n1 ) , · · · , A(nP ) is identical to the order of the mode-np products, i.e. ×n1 , · · · , ×nP . R EMARK 3.48.– The mode-p product of a tensor can be viewed as a generalization of left (mode-1) and right (mode-2) multiplication of a matrix by another matrix, as illustrated in Table 3.14. Matrices

Matrix products

A ∈ KK×J , B ∈ KI×K

BA = A ×1 B = B ×2 AT

A ∈ KK×J , B ∈ KI×K , C ∈ KJ×L

BAC = A ×1 B ×2 CT

U ∈ KI×R , V ∈ KJ×R , Σ ∈ KR×R

UΣVT = Σ ×1 U ×2 V

Table 3.14. Matrix products and mode-p product

3.13.2.2. Properties P ROPOSITION 3.49.– The mode-p product satisfies the following properties: – For X ∈ KI P , αm ∈ K and A(m) ∈ KJp ×Ip , for m ∈ M , we have: M     αm A(m) X ×p α1 A(1) + · · · + αM A(M ) = X ×p m=1

=

M 

  αm X ×p A(m) .

[3.123]

m=1

– Let X ∈ KI P and A(p) ∈ KJp ×Ip , p ∈ P . For any permutation π of the elements of the set P , such that np = π(p), we have: nP

P

n=n1

p=1

X × A(n) = X × A(p) . This means that the order of the mode-p products does not matter when the modes are all distinct. Using the index convention, we have: IP I1 P P      P  (p) (p) X × A(p) j = ajp ,ip = xiP ··· xi1 ,··· ,iP ajp ,ip . [3.124] p=1

P

i1 =1

iP =1

p=1

p=1

Tensor Operations

173

– For two products of X ∈ KI P along the mode p, with A ∈ KJp ×Ip and B ∈ KKp ×Jp , we have (de Lathauwer 1997): Y = X ×p A×p B = X ×p (BA) ∈ KI1 ×···×Ip−1 ×Kp ×Ip+1 ×···×IP .

[3.125]

P ROOF .– Defining Z = X ×p A, we have Y = Z×p B, and using the definition [3.119] of the mode-p product, we have: zi1 ,...,ip−1 ,jp ,ip+1 ,...,iP =

Ip 

ajp ,ip xi1 ,...,ip−1 ,ip ,ip+1 ,...,iP

[3.126]

bkp ,jp zi1 ,...,ip−1 ,jp ,ip+1 ,...,iP .

[3.127]

ip =1

yi1 ,...,ip−1 ,kp ,ip+1 ,...,iP =

Jp  jp =1

By substituting [3.126] into [3.127] and reversing the order of summation, we obtain: yi1 ,...,ip−1 ,kp ,ip+1 ,...,iP =

=

Jp 

bkp ,jp

Ip 

ajp ,ip xi1 ,...,ip−1 ,ip ,ip+1 ,...,iP .

jp =1

ip =1

Ip Jp  

bkp ,jp ajp ,ip xi1 ,...,ip−1 ,ip ,ip+1 ,...,iP .

ip =1

jp =1

The sum in square brackets gives the current element ckp ,ip of the matrix C = BA, which proves the property [3.125].  From this property, we can conclude that the double mode-p product is commutative only if the matrices A and B commute (AB = BA). If so, we have: Y = X ×p A×p B = X ×p (BA) = X ×p (AB) = X ×p B×p A.

[3.128]

– For X ∈ KI P , A(p) ∈ KJp ×Ip , and B(p) ∈ KKp ×Jp , with p ∈ P , we have: P

P

P

p=1

p=1

p=1

Y = X × A(p) × B(p) = X × (B(p) A(p) ) ∈ KK P .

[3.129]

– For X ∈ KI P and factors A(p) ∈ KJp ×Ip , p ∈ P , of full column rank, we have: P

P

p=1

p=1



Y = X × A(p) ⇔ X = Y × A(p) , †

where A(p) denotes the Moore–Penrose pseudo-inverse of A(p) .

[3.130]

174

Matrix and Tensor Decompositions in Signal Processing

P ROOF .– Using the property [3.129], we have:  P  P P P † † † Y × A(p) = X × A(p) × A(p) = X × (A(p) A(p) ) = X p=1

because A

(p) †



p=1

A(p) = A

p=1

(p) T

A(p)

−1

A

p=1

(p) T

A(p) = IIp .



P ROPOSITION 3.50.– The mode-p product of X ∈ KI P with a matrix A(p) ∈ KJp ×Ip of full column rank does not change the mode-p rank of X . Consequently, the multiple mode-p product with matrices A(p) of full column rank, for p ∈ P , does not change the multilinear rank (R1 , · · · , RP ) of X . P ROOF .– Let Y = X ×p A(p) . Using the expression [3.120] of the mode-p unfolding of Y, we have: Yp = A(p) Xp .

[3.131]

Since the rank of a matrix is preserved when it is pre-multiplied by a matrix of full column rank, we deduce that r(Yp ) = r(Xp ) = Rp , and hence the mode-p rank of Y is identical to that of X . It is now easy to deduce that the multilinear rank of P

Y = X × A(p) is identical to that of X if the matrices A(p) all have full column p=1



rank. 3.13.3. Tensor–vector multiplication 3.13.3.1. Definition

The mode-p product of X ∈ KI P with the row vector uT ∈ K1×Ip , denoted by X ×p uT , gives a tensor Y of order P − 1 and size I1 × · · · × Ip−1 × Ip+1 × · · · × IP such that: Ip  yi1 ,··· ,ip−1 ,ip+1 ,··· ,iP = uip xi1 ,··· ,ip−1 ,ip ,ip+1 ,··· ,iP , ip =1

which can be written in vectorized form as vecT (Y) = uT Xp ∈ K1×Ip+1 ···IP I1 ···Ip−1 , with Xp defined as in [3.41]. From this relation, we can deduce that the ip th mode-p tensor slice of X , defined (I ) T

in Table 3.5, is given by X···ip ··· = X ×p eipp with the P basis vectors P (I ) T xi1 ,··· ,iP = X × eipp . p=1

(I ) eipp

. Thus, by multiplying the tensor X

along the P modes, we obtain the entry

Tensor Operations

175

3.13.3.2. Case of a third-order tensor For X ∈ KI×J×K , u ∈ KI , v ∈ KJ , w ∈ KK , with the index convention, we have: X × 1 uT =

I 

ui Xi.. ∈ KJ×K , (X ×1 uT )jk = ui xijk

i=1

X ×2 v T =

J 

vj X.j. ∈ KK×I , (X ×2 vT )ki = vj xijk

j=1

X ×3 w T =

K 

wk X..k ∈ KI×J , (X ×3 wT )ij = wk xijk .

k=1

Similarly, we have: y = X × 1 uT × 2 v T =

I  J 

ui vj Xij. ∈ KK , yk = ui vj xijk

[3.132]

ui wk Xi.k ∈ KJ , yj = ui wk xijk

[3.133]

vj wk X.jk ∈ KI , yi = vj wk xijk

[3.134]

i=1 j=1

y = X × 1 uT × 3 w T =

K I   i=1 k=1

y = X × 2 v T ×3 w T =

K J   j=1 k=1

and: X × 1 uT × 2 v T × 3 w T =

J  K I  

ui vj wk xijk = ui vj wk xijk ∈ K. [3.135]

i=1 j=1 k=1

R EMARK 3.51.– The product [3.132] of X with uT and vT can be performed in three different ways, as illustrated by the following three expressions: X ×1 uT ×2 vT = (X ×2 vT )×1 uT = (X ×1 uT )×1 vT .

[3.136]

The first equality means that the mode-1 and mode-2 products can be performed consecutively, regardless of their order, whereas, for the third expression, the product X ×1 uT ∈ KJ×K is performed first, which eliminates the first mode of X associated with the index i to replace it with the mode associated with the index j. This explains the mode-1 product of (X ×1 uT ) with vT .

176

Matrix and Tensor Decompositions in Signal Processing

For X ∈ KI×J×K , the mode-p products with canonical basis vectors give: (I) T

Xi.. = X ×1 ei

(I) T

Xij. = X ×1 ei

(J) T

X.jk = X ×2 ej

(I) T

Xi.k = X ×1 ei

(I) T

xijk = X ×1 ei

,

(J) T

X.j. = X ×2 ej (J) T

×2 ej

(K) T

(K) T (J) T

×2 ej

(K) T

X..k = X ×3 ek

∈ KK

×3 ek ×3 ek

,

∈ KI ∈ KJ (K) T

×3 ek

.

3.13.4. Mode-(p, n) product The mode-p product of a tensor with a matrix or a vector can be extended to a product with tensors of arbitrary orders (Gelfand et al. 1992; Lee and Cichocki 2014). If we consider the tensors A ∈ KI P and B ∈ KJ N , with IP = J1 , the mode-(P, 1) product of A with B, denoted A ×1P B, gives a tensor C of order P + N − 2 and size I1 × · · · × IP −1 × J2 × · · · × JN , with entries: ci1 ,··· ,iP −1 ,j2 ,··· ,jN =

IP 

ai1 ,··· ,iP −1 ,k bk,j2 ,··· ,jN .

[3.137]

k=1

This contraction operation can be performed using a matrix multiplication involving tall mode-P and flat mode-1 matrix unfoldings of A and B, respectively: CI1 ···IP −1 ×J2 ···JN = AI1 ···IP −1 ×IP BIP ×J2 ···JN .

[3.138]

R EMARK 3.52.– In the matrix case with A ∈ KI×K and B ∈ KK×J , where the number of columns of A is equal to the number of rows of B, we have: C = A ×12 B = AB ⇔ cij =

K 

aik bkj ,

[3.139]

k=1

i.e. the standard matrix product. This contraction operation can also be performed for two arbitrary modes (p, n) of A ∈ KI P and B ∈ KJ N , with Ip = Jn = K. In this case, it is denoted A ×np B and

Tensor Operations

177

gives a tensor C of order P + N − 2 and size I1 × · · · × Ip−1 × Ip+1 × · · · × IP × J1 × · · · × Jn−1 × Jn+1 × · · · × JN such that: ci1 ,··· , ip−1 , ip+1 ,··· ,iP , j1 ,··· , jn−1 , jn+1 ,··· , jN =

K 

ai1 ,··· ,ip−1 , k, ip+1 ,··· ,iP

k=1

[3.140] × bj1 ,··· ,jn−1 , k, jn+1 ,··· ,jN . This type of contraction, of which [3.137] is a special case, was introduced by Gelfand et al. (1992) under the name of convolution (or product), denoted A p,n B. P ROPOSITION 3.53.– The contracted product ×np is associative; in other words, for any tensors A ∈ KI P , B ∈ KJ N and C ∈ KK Q such that Ip = Jn and Jm = Kq , with m = n, we have: (A ×np B) ×qm C = A ×np (B ×qm C) = A ×np B ×qm C.

[3.141]

This double contracted product gives a tensor of order P + N + Q − 4. This property is easy to prove using the definition of the product ×np . R EMARK 3.54.– In the matrix case, with A ∈ KI×J , B ∈ KJ×K and C ∈ KK×L , the property [3.141] becomes: (A ×12 B) ×12 C = A ×12 (B ×12 C) = A ×12 B ×12 C,

[3.142]

or equivalently: (AB)C = A(BC) = ABC.

[3.143]

Thus, we recover the associativity property of the standard matrix product. E XAMPLE 3.55.– The mode-(p, n) product can be used to define the TT decomposition, introduced in Table I.3, that will be studied in more detail in the next volume. This decomposition represents an N th-order tensor X ∈ KI P as a train of two matrices G(1) and G(P ) , and (P − 2) third-order tensors G (p) , p ∈ {2, · · · , P − 1}, with the following sizes: G(1) ∈ KI1 ×R1 ; G(P ) ∈ KRP −1 ×IP ; G (p) ∈ KRp−1 ×Ip ×Rp . The factors G (p) , p ∈ P , are called the TT-cores, and the integers Rp , p ∈ P − 1, are called the TT-ranks. Note that, for p = 1 and p = P , the TT-cores are two matrices.

178

Matrix and Tensor Decompositions in Signal Processing

The TT decomposition can be written as: X = G(1) ×12 G (2) ×13 G (3) ×14 · · · ×1P −1 G (P −1) ×1P G(P ) ,

[3.144]

or in scalar form, with the index convention: xiP = =

R1  r1 =1 P 



RP −1

···

(1)

rP −1 =1

(2)

(3)

(P −1)

(P )

gi1 ,r1 gr1 ,i2 ,r2 gr2 ,i3 ,r3 · · · grP −2 ,iP −1 ,rP −1 grP −1 ,iP

(p)

grp−1 ,ip ,rp

[3.145]

p=1 (1)

(1)

(P )

(P )

with gr0 ,i1 ,r1 = gi1 ,r1 and grP −1 ,iP ,rP = grP −1 ,iP . This entry-wise representation can also be written using matrix–vector and matrix–matrix products as: (1)

(2)

(3)

(P −1)

(P )

xiP = gi1 ,• G•,i2 ,• G•,i3 ,• · · · G•,iP −1 ,• g•,iP (1)

[3.146]

(P )

where gi1 ,• ∈ K1×R1 and g•,iP ∈ KRP −1 are the i1 th row vector of G(1) , and the (p)

iP th column vector of G(P ) , respectively, whereas G•,ip ,• ∈ KRp−1 ×Rp is the ip th lateral slice of G (p) , for p ∈ {2, · · · , P − 1}. The contraction operation can also be applied to several arbitrary pairs of indices corresponding to modes with the same dimension. Thus, for example, for A ∈ KI P and B ∈ KJ N , with Ip = Jn , Is = Jr , and p < s , n < r, it is possible to sum over the index ip of A and the index jn of B on one hand, and over the index is of A and the index jr of B on the other hand. This contraction is defined as follows: A, Bp:n;s:r =

Ip Is  

ai1 ,··· ,k,··· ,l,···iP bj1 ,··· ,jn−1 ,k,jn+1 ··· ,jr−1 ,l,jr+1 ,···jN .

k=1 l=1

Like in the example [3.138], this contraction can be performed by multiplying the matrices AI1 ···Ip−1 Ip+1 ···Is−1 Is+1 ···IP ×Ip Is and BIp Is ×J1 ···Jn−1 Jn+1 ···Jr−1 Jr+1 ···JN to obtain the unfolding CI1 ···Ip−1 Ip+1 ···Is−1 Is+1 ···IP ×J1 ···Jn−1 Jn+1 ···Jr−1 Jr+1 ···JN of the tensor C resulting from a double contraction. 3.13.5. Einstein product We will now define the Einstein product, denoted N , which generalizes the product [3.137]. We will then prove that the space KI N ×I N equipped with the product N has a group structure, and will show that there is an isomorphism of groupsbetween this space KI N ×I N and the general linear group GLI , where N I = n=1 In , i.e. the group of invertible square matrices of order I under the operation of standard matrix multiplication. This isomorphism will allow us to define

Tensor Operations

179

the inverse of a tensor X ∈ KI N ×I N via the inverse of its matrix unfolding XI×I ∈ KI×I (Brazell et al. 2013). Note that the Einstein product will be used to define the properties of orthogonality and idempotence for a tensor, as well as certain tensor factorizations (see sections 3.13.5.2 and 3.15). 3.13.5.1. Definition and properties The product [3.137] can be generalized by defining the Einstein product of the tensor A ∈ KI P ×J N of order P + N with the tensor B ∈ KJ N ×K Q of order N + Q along the N shared indices jN = {j1 , · · · , jN }, corresponding to the N last modes of A and the N first modes of B. This product gives the tensor C = A N B ∈ KI P ×K Q of order P + Q such that: ciP , kQ =

JN 

aiP , j bj

j =1

N

N

, kQ

= aiP , j bj N

N

, kQ ,

[3.147]

N

where the last equality follows by using the index convention, which implies summation over the repeated indices jN , explaining the name of Einstein product. Recall the notation defined earlier in Table 3.1: iP = {i1 , · · · , iP } , kQ = {k1 , · · · , kQ } , JN = {J1 , · · · , JN }. For the Einstein product [3.147], the modes associated with the P first indices and the N last indices of aiP , j are called the row modes and column modes of A, N respectively. E XAMPLE 3.56.– For N = 1, with A ∈ KI P ×J and B ∈ KJ×K Q , we have: (A 1 B)iP , kQ = aiP , j bj, kQ

[3.148]

and hence: A 1 B = A ×1P +1 B ∈ KI P ×K Q .

[3.149]

E XAMPLE 3.57.– For A ∈ KI×J×K , X ∈ KJ×K and x ∈ KK , the 2 product of A with X and the 1 product of A with x give a vector y ∈ KI and a matrix Y ∈ KI×J , respectively, such that: J,K 

yi = [A 2 X]i =

ai,j,k xj,k = ai,j,k xj,k

[3.150]

ai,j,k xk = ai,j,k xk .

[3.151]

j,k=1

yi,j = [A 1 x]i,j =

K  k=1

180

Matrix and Tensor Decompositions in Signal Processing

E XAMPLE 3.58.– For A ∈ KI×J×K×L and B ∈ KK×L×M ×N , the 2 product of A with B gives the fourth-order tensor C ∈ KI×J×M ×N such that: ci,j,m,n = [A 2 B]i,j,m,n =

K,L 

ai,j,k,l bk,l,m,n = ai,j,k,l bk,l,m,n .

[3.152]

k,l=1

P ROPOSITION 3.59.– The Einstein product is associative and distributive with respect to addition, i.e. for A ∈ KI P ×J N , B, D ∈ KJ N ×K M and C ∈ KK M ×LQ , we have: (A N B) M C = A N (B M C) = A N B M C.

[3.153]

This double contracted product gives a tensor P ∈ KI P ×LQ of order P + Q. Moreover, we have: A N (B + D) = (A N B) + (A N D) ∈ KI P ×K M

[3.154]

(B + D) M C = (B M C) + (D M C) ∈ KJ N ×LQ .

[3.155]

R EMARK 3.60.– – The Einstein product is not commutative. – For A ∈ KI×K and B ∈ KK×J , the 1 product of A with B gives the matrix C ∈ KI×J , such that: ci,j = [A 1 B]i,j =

K 

ai,k bk,j = ai,k bk,j .

[3.156]

k=1

The 1 product of two matrices is therefore equivalent to the standard matrix product, and we have: A 1 B = A ×12 B = B ×1 A = AB.

[3.157]

– The Einstein product is also defined for even-order paired tensors A ∈ KI1 ×K1 ···IP ×KP and B ∈ KK1 ×J1 ···KP ×JP , giving a tensor C ∈ KI1 ×J1 ···IP ×JP that is also an even-order paired tensor of order 2P (Chen et al. 2019): ci1 j1 ···iP jP =

KN 

ai1 k1 ···iP kP bk1 j1 ···kP jP .

[3.158]

kN =1

In this case, the summations apply to alternating indices, i.e. every other index. – The Einstein product can also be used with block tensors, such as those defined in section 3.5, as illustrated using the following examples.

Tensor Operations

181

For A ∈ KI P ×J N , B ∈ KI P ×K N , C ∈ KM P ×J N , D ∈ KJ N ×K Q , F ∈ K , and X ∈ KH Q ×I P , Y ∈ KJ N ×GQ , we have:     X P A ... B = X P A ... X P B ∈ KH Q ×LN J N ×RQ

[3.159] with LN = J N × K N ⎡ ⎤ A A N Y ⎣ · · · ⎦ N Y = ⎣ ⎦ ∈ KLP ×GQ ··· C N Y C ⎡







A  ⎣ · · · ⎦ N D C

.. .

[3.160]

with LP = I P × M P ⎤ ⎡ ..  D . A  F A  N N ⎦ ∈ KLP ×S Q ⎣ F = .. C N D . C N F [3.161] with LP = I P × M P , S Q = K Q × RQ .

Table 3.15 summarizes the key results about multiplication with tensors, using the notations from Table 3.1 and the index convention. Tensors X ∈

KI P

,A ∈

Operations KJ×Ip

X ∈ KI P , u ∈ KI p

Definitions 

X ×p A

ip



X × p uT

X ∈ KI P , Y ∈ KJ N with IP = J1

X ×1P Y

X ∈ KI P ×J N , Y ∈ KJ N ×K Q

X N Y

IP

aj,ip xiP = aj,ip xiP

ip

uip xiP = uip xiP

k=1 xi1 ,··· ,iP −1 ,k y k,j2 ,··· ,jN = xi1 ,··· ,iP −1 ,k y k,j2 ,··· ,jN

J N  j

N

=1

xiP ,j yj N

N

,kQ

= xiP ,j yj N

N

,kQ

Table 3.15. Different types of multiplication with tensors

3.13.5.2. Orthogonality and idempotence properties Using the Einstein product, we can define the properties of orthogonality3 and idempotence for an even-order tensor X ∈ KI N ×I N , as outlined in Table 3.16, where J2N is the block identity tensor defined in [3.22], and the tensors X T and X H are 3 The orthogonality property based on the Einstein product was introduced by Brazell et al. (2013) for a fourth-order tensor.

182

Matrix and Tensor Decompositions in Signal Processing

defined as in Table 3.7, replacing the transposition operation with that of transconjugation for X H . The properties of orthogonality and idempotence for a square matrix X ∈ KI×I are also recalled. Properties

Conditions for a matrix X∈

Conditions for a tensor

KI×I

X ∈ KI N ×I N

X, X orthogonal

XT X = XXT = II

X T N X = X N X T = J2N

X, X unitary

XH X = XXH = II

X H N X = X N X H = J2N

X2

X, X idempotent

=X

X N X = X

Table 3.16. Orthogonality and idempotence properties

3.13.5.3. Isomorphism of tensor and matrix groups Below, we consider the space (KI N ×I N , N ), i.e. the space of square tensors of size I N × I N , equipped with the N product, and the matrix unfolding XI×I , with N I = n=1 In , of X ∈ KI N ×I N , as defined in [3.105]. In the next proposition, we begin by showing two preliminary results concerning the closure property of the space (KI N ×I N , N ) under the operation N and the existence of a neutral (also called identity) element for this operation. P ROPOSITION 3.61.– The N product satisfies the following properties: – For any tensors X , Y ∈ KI N ×I N , the product Z = X N Y gives an element of K . This means that the space (KI N ×I N , N ) is closed under the operation N . The unfolding of the product Z can be computed using the product of the unfoldings of X and Y, i.e.: I N ×I N

ZI×I = (X N Y)I×I = XI×I YI×I .

[3.162]

From this property and the definition of an idempotent tensor given in Table 3.16, we can conclude that the tensor X ∈ KI N ×I N is idempotent whenever its unfolding XI×I is idempotent itself, i.e. X2I×I = XI×I . P ROOF .– The closure property follows from the definition [3.147] of the product N : Z = X N Y ⇒ ziN ,kN = xiN ,j yj N

N

,kN

⇒ Z ∈ KI N ×I N ,

[3.163]

with in , jn , kn ∈ In  , ∀n ∈ N . Using the expression [2.68] of a matrix with the index convention, the matrix unfolding ZI×I can be written as: ···kN ZI×I = ziN ,kN eki 1···i . 1

N

[3.164]

Tensor Operations

183

Replacing ziN ,kN with its expression [3.163], using the expression [2.81] of the matrix product and the index convention, we obtain: ZI×I = xiN ,j yj N

N

k1 ···kN ,kN ei ···i 1

[3.165]

N

= XI×I YI×I , with I =

N

n=1 In .

[3.166] 

This proves the relation [3.162].

– The identity tensor J2N defined in [3.22] is the neutral element for the N product, i.e. for any X ∈ KI N ×I N , we have: X N J2N = J2N N X = X .

[3.167]

P ROOF .– This result follows from the definition of the N product: (X N J2N )iN ,j = N

IN 

xiN ,kN

kN =1

N 

δkn ,jn = xiN ,j ,

n=1

N

[3.168]

with in , jn ∈ In , ∀n ∈ N , which proves that X N J2N = X . The equality J2N N X = X can be proven in the same way. We conclude that the space (KI N ×I N , N ) admits J2N as a neutral element. This result can also be recovered using the properties [3.162] and [3.24]: (X N J2N )I×I = XI×I II = XI×I ,

[3.169] 

which completes the proof of [3.167].

E XAMPLE 3.62.– For X , Y ∈ K2×1×2×1 , i.e. for I1 = 2, I2 = 1, the product Z = X 2 Y satisfies: zijmn =

2  1 

xijkl yklmn , i, m ∈ {1, 2} , j, n = 1,

[3.170]

k=1 l=1

which gives: z1111 = x1111 y1111 + x1121 y2111 , z1121 = x1111 y1121 + x1121 y2121 z2111 = x2111 y1111 + x2121 y2111 , z2121 = x2111 y1121 + x2121 y2121 . This result can be found using equation [3.162]:

z1111 z1121 = XI×I YI×I (Z)I×I = z2111 z2121



y1111 y1121 x1111 x1121 . = x2111 x2121 y2111 y2121

184

Matrix and Tensor Decompositions in Signal Processing

To show the group isomorphism, we define the set (TN , N ), equipped with the I N ×I N Einstein product as a binary operation, , N composed of the square tensors X ∈ K whose unfolding XI×I , with I = n=1 In , is a non-singular matrix, i.e.: TN = {X ∈ KI N ×I N / det(XI×I ) = 0}.

[3.171]

Since the matrix XI×I is invertible, it is an element of the general linear group GLI , whose elements are the invertible square matrices of order I. Note that the invertibility of XI×I implies that this matrix has full rank: r(XI×I ) = I. Now, let us define the mapping f that transforms a tensor X ∈ TN into its matrix unfolding XI×I : f : TN  X −→ f (X ) = XI×I ∈ KI×I with I =

N 

In .

[3.172]

n=1

This mapping is bijective. Indeed, bijectivity follows from the relation [3.105], which associates each element xiN ,j of X with the element (XI×I )i1 ···iN , j1 ···jN of N the unfolding XI×I , via the bijective mapping g defined as: I N × I N  (iN , jN ) −→ g(iN , jN ) = (i1 · · · iN , j1 · · · jN ) ∈

N 

N 

In ×

n=1

In .

n=1

[3.173] The property [3.162] can be rewritten using the mapping f as follows: f (X N Y) = f (X )f (Y) = XI×I YI×I .

[3.174]

This formula allows us to determine the tensor Z = X N Y by computing its matrix unfolding ZI×I = XI×I YI×I . Since the mapping f is bijective, it admits an inverse f −1 such that: X N Y = f −1 f (X )f (Y) .

[3.175]

The relation [3.175] can also be expressed using matrix unfoldings as follows: f −1 (XI×I ) N f −1 (YI×I ) = f −1 (XI×I YI×I ).

[3.176]

Furthermore, the property [3.24] can be written as: f (J2N ) = II with I =

N  n=1

In ,

[3.177]

Tensor Operations

185

which means that the image of the neutral element of TN under f is the neutral element of the group GLI . Using the properties [3.174], [3.175] and [3.177], we can deduce the following proposition, which shows that the inverse of a square tensor X ∈ KI N ×I N belonging to the set TN , defined in [3.171], can be determined from the inverse of its matrix unfolding f (X ) = XI×I . P ROPOSITION 3.63.– Any tensor X ∈ TN admits a unique inverse, denoted X −1 ∈ TN , with respect to the Einstein product, satisfying (Brazell et al. 2013): X N X −1 = X −1 N X = J2N . Its matrix unfolding is f (X

−1

)=

X−1 I×I ,

[3.178] i.e.

X −1 = f −1 (X−1 I×I ). P ROOF .– Using the properties [3.174], X N X −1 = J2N gives us:

[3.179] [3.175] and [3.177],

the equality

f (X N X −1 ) = f (X )f (X −1 ) = XI×I f (X −1 ), = f (J2N ) = II

[3.180]

from which we deduce: −1 = f −1 (X−1 f (X −1 ) = X−1 I×I ⇔ X I×I ).

The same result can be obtained from the equality X

[3.181] −1

N X = J2N .



In summary, from the properties stated in propositions 3.59, 3.61 and 3.63, we conclude that: – the set (TN , N ) is closed under the operation N ; – the operation N is associative; – there exists a multiplicative neutral element, namely the identity tensor J2N ; – every element X ∈ (TN , N ) admits an inverse. These properties define a group structure on the set (TN , N ). The property [3.174] then defines a morphism of groups. Furthermore, since f is bijective, it is an isomorphism from the tensor group (TN , N ) to the general linear group GLI . R EMARK 3.64.– It is easy to check that the mapping f maps a symmetric matrix XI×I to a tensor X ∈ RI N ×I N that is symmetric by blocks of order N (see Table 3.8). Moreover, if X ∈ TN is orthogonal (respectively, unitary) according to the definition given in Table 3.16, its matrix unfolding XI×I is orthogonal itself (respectively, unitary).

186

Matrix and Tensor Decompositions in Signal Processing

3.14. Inverse and pseudo-inverse tensors The concept of the inverse tensor, introduced by Brazell et al. (2013) for square tensors, and therefore the invertibility conditions of a tensor depend on several factors simultaneously: the structure of the tensor, defined in terms of order and dimensions; the definition of a multiplication operation (•) between tensors; and an identity tensor J for this operation, satisfying X • X −1 = X −1 • X = J , where X −1 is the inverse of X . In general, inversion is based on exploiting an isomorphism f between the tensor space (T, •) containing X , equipped with the product • and a space (M, ·) of invertible matrices that contains a particular matrix unfolding of the tensor X , equipped with the standard matrix product, whose multiplication symbol (·) is typically omitted from the notation: f : T  X −→ X = f (X ) ∈ M.

[3.182]

Since the mapping f is an isomorphism, the inverse mapping f −1 allows us to determine the inverse of X from that of its unfolding X, i.e.: X −1 = f −1 [f (X )]−1 = f −1 [X−1 ]. [3.183] This result is illustrated by the formula [3.179], which gives the inverse of a tensor X ∈ TN . Table 3.17 gives inverses for orthogonal and unitary tensors. Properties of X ∈ KI N ×I N Orthogonal Unitary

Definitions XT

N X = X N

XT

Inverses = J2N

X −1

X H N X = X N X H = J2N X −1

  = X T = f −1 XT I×I   = X H = f −1 XH I×I

Table 3.17. Inverses of orthogonal/unitary tensors

In the next proposition, we show how the inverse of the tensor Z = X N Y can be determined via its matrix unfolding, i.e. Z −1 = f −1 Z−1 I×I . P ROPOSITION 3.65.– Consider the tensors X , Y ∈ TN . The inverse of their product −1 −1 Z = X N Y can be obtained from its matrix unfolding Z−1 I×I = YI×I XI×I , where XI×I and YI×I are the unfoldings of X and Y, i.e.: −1 −1 (YI×I X−1 Z −1 = f −1 (Z−1 I×I ) = f I×I ) = f −1 f (Y −1 )f (X −1 )

[3.185]

= Y −1 N X −1 by [3.175].

[3.186]

[3.184]

Tensor Operations

187

R EMARK 3.66.– This result requires XI×I and YI×I to be invertible, which is the case by the hypothesis X , Y ∈ TN . E XAMPLE 3.67.– Consider two tensors X , Y ∈ K2×2×2×2 , whose unfoldings XI×I and YI×I , with I = I1 I2 and I1 = I2 = 2, are given by: ⎤ ⎤ ⎡ ⎡ 1 0 0 0 1 1 0 0 ⎢ 1 1 0 0 ⎥ ⎢ 0 1 1 0 ⎥ ⎥ ⎥ ⎢ XI×I = ⎢ ⎣ 0 1 1 0 ⎦ , YI×I = ⎣ 0 0 1 1 ⎦ , 0 0 1 1 0 0 0 1 which corresponds to xijkl = 0 for all (i, j, k, l) such that l + 2(k − 1) > j + 2(i − 1) and yijkl = 0 for all (i, j, k, l) such that j + 2(i − 1) > l + 2(k − 1), respectively. It is easy to check that: ⎡ 1 0 0 0 ⎢ −1 1 0 0 −1 XI×I = ⎢ ⎣ 1 −1 1 0 −1 1 −1 1





⎥ ⎥ , Y−1 I×I ⎦

1 −1 ⎢ 0 1 =⎢ ⎣ 0 0 0 0

⎤ 1 −1 −1 1 ⎥ ⎥, 1 −1 ⎦ 0 1

and hence the unfoldings ZI×I and Z−1 I×I are given by: ⎡

ZI×I = XI×I YI×I

−1 −1 Z−1 I×I = YI×I XI×I

1 ⎢ 1 =⎢ ⎣ 0 0 ⎡

4 ⎢ −3 =⎢ ⎣ 2 −1

1 2 1 0

0 1 2 1

⎤ 0 0 ⎥ ⎥ 1 ⎦ 2

⎤ −3 2 −1 3 −2 1 ⎥ ⎥. −2 2 −1 ⎦ 1 −1 1

The latter matrix does indeed correspond to the unfolding of the inverse of the tensor Z, as can be checked by computing the product ZI×I Z−1 I×I , which must give the unfolding of the identity tensor J4 defined in [3.22], namely the identity matrix I4 , by the relation [3.24]. Indeed, we have: ⎡ ZI×I Z−1 I×I

1 ⎢ 1 =⎢ ⎣ 0 0

1 2 1 0

0 1 2 1

⎤⎡ 4 −3 0 ⎢ −3 3 0 ⎥ ⎥⎢ 1 ⎦ ⎣ 2 −2 −1 1 2

2 −2 2 −1

⎤ −1 1 ⎥ ⎥ = I4 . −1 ⎦ 1

In section 3.15, we will show how the group isomorphism introduced above can be exploited to develop tensor decompositions in a factorized form, such as the SVD and full-rank factorization.

188

Matrix and Tensor Decompositions in Signal Processing

The next proposition presents a generalization of the inversion formula [3.184] for Z = X T P X , with X ∈ KI P ×J N . P ROPOSITION 3.68.– Let X ∈ KI P ×J N and X T ∈ KJ N ×I P . The inverse of the Einstein product Z = X T P X ∈ KJ N ×J N admits the matrix unfolding −1  T , where XI×J is the matrix unfolding of the tensor X , with XI×J XI×J P N I = p=1 Ip and J = n=1 Jn . Hence, the necessary condition for Z to be invertible is that XI×J must have full column rank, i.e. N P r(XI×J ) = J = n=1 Jn ≤ I = p=1 Ip . a tensorX ∈ KI P ×J N into its P ROOF .– By defining the mapping f that transforms P N matrix unfolding f (X ) = XI×J with I = p=1 Ip and J = n=1 Jn , as defined in [3.44]–[3.45], we have: Z = X T P X ∈ KJ N ×J N

[3.187]

f (Z) = XTI×J XI×J = f (X T )f (X ) = ZJ×J T −1 Z −1 = f −1 (Z−1 (XI×J XI×J )−1 . J×J ) = f

[3.188]

and:

[3.189]

We can therefore conclude that Z is invertible if XTI×J XI×J is invertible, which implies the necessary condition that XI×J must have full column rank, i.e. r(XI×J ) = J.  E XAMPLE 3.69.– Consider the fourth-order rectangular tensor X ∈ R3×2×2×2 corresponding to (I1 , I2 , J1 , J2 ) = (3, 2, 2, 2) and hence (I, J) = (6, 4), with the following matrix unfolding: ⎤ ⎡ 1 0 0 0 ⎢ 1 1 0 0 ⎥ ⎥ ⎢ ⎢ 0 1 1 0 ⎥ ⎥. ⎢ [3.190] XI×J = ⎢ ⎥ ⎢ 0 0 1 1 ⎥ ⎣ 0 0 0 1 ⎦ 0 0 0 0 From this matrix unfolding, we can deduce the elements of the tensor X , such as: (XI×J )i2 +(i1 −1)I2 ,j2 +(j1 −1)J2 = xi1 ,i2 ,j1 ,j2 . For example, x1221 = (XI×J )23 = 0, and x3122 = (XI×J )54 = 1. Using the formulae [3.188] and [3.189], we obtain: ⎡ ⎤ ⎡ ⎤ 0.8 −0.6 0.4 −0.2 2 1 0 0 ⎢ −0.6 ⎢ 1 2 1 0 ⎥ 1.2 −0.8 0.4 ⎥ −1 ⎢ ⎥, ⎥ ZJ×J = ⎢ ⎣ 0 1 2 1 ⎦ , ZJ×J = ⎣ 0.4 −0.8 1.2 −0.6 ⎦ −0.2 0.4 −0.6 0.8 0 0 1 2

Tensor Operations

189

with, for example: z1211 = (Z)1211 =

2 3  

= xi1 ,i2 ,1,2 xi1 ,i2 ,1,1 = (ZJ×J )21 = 1

i1 =1 i2 =1

(Z −1 )2212 = (Z−1 J×J )42 = 0.4. −1 It is easy to check that ZJ×J Z−1 J×J = ZJ×J ZJ×J = I4 .

In the following, we will introduce the notions of the generalized inverse and Moore–Penrose pseudo-inverse of a tensor A ∈ KI P ×J N of order P + N , using the Einstein product (Sun et al. 2016; Behera and Mishra 2017; Liang and Zheng 2018). Generalized inverses and the Moore–Penrose pseudo-inverse of tensors are in particular used to solve certain tensor systems (Brazell et al. 2013; Bu et al. 2014; Sun et al. 2016; Behera and Mishra 2017; Jin et al. 2018; Liang and Zheng 2018b), as will be illustrated in section 3.17. Table 3.18 states the four conditions4 that define different types of generalized inverses of a matrix A ∈ KI×J and a tensor A ∈ KI P ×J N of order P + N , denoted A and A , respectively. When these four conditions are satisfied5, we have the Moore–Penrose pseudo-inverse, denoted A† in the matrix case and A† in the tensor case. Conditions A ∈ KI×J , A ∈ KJ×I A ∈ KI P ×J N , A ∈ KJ N ×I P (1)

AA A = A

A N A P A = A

(2)

A AA = A

A P A N A = A

(3)

(AA )H

AA

(A N A )H = A N A

(4)

(A A)H = A A

(A P A)H = A P A

=

Table 3.18. Generalized inverses

There are several types of generalized inverse for a matrix A or a tensor A, depending on which conditions are satisfied in Table 3.18. For example, Burns et al. (1974) and Sun et al. (2016) define the following generalized inverses for matrices and tensors, respectively: – for (1): inner inverse, denoted A{1} ; – for (2): outer inverse, denoted A{2} ; 4 The conditions (1)–(4) are called the general, reflexivity, normalization and inverse normalization conditions, respectively. 5 In the real case (K = R), replace the operation of tranconjugation with that of transposition in the conditions (3) and (4).

190

Matrix and Tensor Decompositions in Signal Processing

– for (1) and (2): reflexive generalized inverse, denoted A{1,2} ; – for (1)–(3): weak (or normalized) generalized inverse, denoted A{1,2,3} ; – for (1)–(4): Moore–Penrose pseudo-inverse, denoted A† . Based on the isomorphism f defined in [3.172], Brazell et al. (2013) presented the SVD of a real fourth-order square tensor. A generalization of the SVD to the case of a rectangular tensor X ∈ CI P ×J N , via the Einstein product, was established by Liang and Zheng (2018), using the mapping f defined as: f :C

I P ×J N

 X −→ f (X ) = XI×J ∈ C

I×J

P 

with I =

p=1

Ip , J =

N 

Jn .

n=1

[3.191] Like the mapping f defined in [3.172], the mapping defined by the above equation is bijective. Before establishing the Moore–Penrose pseudo-inverse and the SVD of a rectangular tensor, let us present a preliminary result in the following lemma (Stanimirovic et al. 2020): L EMMA 3.70.– Consider the rectangular tensors X ∈ CI P ×J N and Y ∈ CJ N ×K L , and their respective matrix unfoldings XI×J ∈ CI×J and YJ×K ∈ CJ×K , with N L P I = p=1 Ip , J = n=1 Jn , and K = l=1 Kl . The mapping f defined in [3.191] satisfies the following properties: f (X N Y) = XI×J YJ×K = f (X )f (Y) ∈ CI×K X N Y = f −1 (XI×J YJ×K ) = f −1 (XI×J ) N f −1 (YJ×K )

[3.192] [3.193]

and: H f −1 (XTI×J ) = X T , f −1 (XH I×J ) = X .

[3.194]

R EMARK 3.71.– From the above results, we can deduce the expression of the transconjugate of an Einstein product of tensors. Let X ∈ CI P ×J N , Y ∈ CJ N ×K M , and Z ∈ CK M ×LQ . We have: (X N Y)H = Y H N X H ∈ CK M ×I P (X N Y M Z)H = Z H M Y H N X H ∈ CLQ ×I P .

[3.195] [3.196]

Using the isomorphism defined in [3.191], we can deduce the following proposition.

Tensor Operations

191

P ROPOSITION 3.72.– Any rectangular tensor X ∈ CI P ×J N whose matrix unfolding XI×J has full column rank admits a Moore–Penrose pseudo-inverse, defined as: [3.197] X † = f −1 [X†I×J ] = f −1 (XTI×J XI×J )−1 XTI×J . Table 3.19 summarizes the key results established for the Einstein product over the previous sections. Mappings/Properties

Classes of tensors and operations with the Einstein product

Eq.

KI N ×I N

Square tensors: TN = {X ∈ / det(XI×I ) = 0} (3.171)  X , Y ∈ KI N ×I N , I = N n=1 In Isomorphism: f

TN  X −→ f (X ) = XI×I ∈ KI×I

(3.172)

Product

f (X N Y) = XI×I YI×I = f (X )f (Y)

(3.174)

X −1

Inverse of X Inverse of X N Y

(X N

Y)−1

=

=

f −1 (X−1 I×I )

−1 f −1 [YI×I X−1 I×I ]

=

(3.179) Y −1

N

X −1

(3.186)

Rectangular tensors: X ∈ CI P ×J N , Y ∈ CJ N ×K L  N L I= P p=1 Ip , J = n=1 Jn , K = l=1 Kl Mapping: f

CI P ×J N  X −→ f (X ) = XI×J ∈ CI×J

(3.191)

Product

f (X N Y) = XI×J YJ×K = f (X )f (Y)

(3.192) (3.193)

Inverse of X T P X

X N Y = f −1 (XI×J ) N f −1 (YJ×K )  −1  (X T P X )−1 = f −1 XT I×J XI×J −1 T   X † = f −1 [X†I×J ] = f −1 XT XI×J I×J XI×J

Moore-Penrose pseudo-inverse

(3.189) (3.197)

Table 3.19. Operations with the Einstein product

Table 3.20 presents various properties of the Moore–Penrose pseudo-inverse of a tensor X ∈ CI P ×J N , in parallel to those of the Moore–Penrose pseudo-inverse of a matrix (Favier, 2019), with the following hypotheses for specific properties: – Property (xv): A ∈ KI×J of full column rank, in the matrix case; A ∈ KI P ×J N , with the unfolding AI×J of full column rank. – Property (xvi): A ∈ KI×J of full row rank, in the matrix case; A ∈ KI P ×J N , with the unfolding AI×J of full row rank. – Property (xvii): A ∈ KI×K , B ∈ KK×J and C ∈ KK×K of full rank, in the matrix case; A ∈ KI P ×J N , B ∈ KJ N ×K Q and C ∈ KJ N ×J N , in the tensor case, with C assumed to be invertible in the sense of the definition [3.179]. – Property (xviii): A ∈ CI×I and B ∈ CJ×J in the matrix case, and A ∈ CI P ×I P and B ∈ CJ N ×J N in the tensor case, are assumed to be unitary, with C ∈ KI×J and C ∈ KI P ×J N .

192

Matrix and Tensor Decompositions in Signal Processing

Matrix case

Tensor case

CI×J

A ∈ CI P ×J N

A∈ SVD Pseudo-

A = UDVH U ∈ CI×I , V ∈ CJ×J unitary

A = U P D N V H U ∈ CI P ×I P , V ∈ CJ N ×J N unitary

A† = VD† UH ∈ CJ×I

A† = V N D† P U H ∈ CJ N ×I P

inverse (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) (ix) (x) (xi) (xii)

(A† )† = A (A† )† = A (AH )† = (A† )H (AH )† = (A† )H (AA† )2 = AA† (A N A† ) P (A N A† ) = A N A† (A† A)2 = A† A (A† P A) N (A† P A) = A† P A (AA† )† = AA† (A N A† )† = A N A† (A† A)† = A† A (A† P A)† = A† P A (AAH )† = (A† )H A† (A N AH )† = (A† )H N A† (AH A)† = A† (A† )H (AH P A)† = A† P (A† )H AH = AH AA† = A† AAH AH = AH P A N A† = A† P A N AH A† = AH (A† )H A† A† = AH P (A† )H N A† = A† (A† )H AH = A† P (A† )H N AH A = AAH (A† )H A = A N AH P (A† )H = (A† )H AH A = (A† )H N AH P A A† = (AH A)† AH A† = (AH P A)† N AH = AH (AAH )† = AH P (A N AH )†

(xiii) (xiv)

AA† = (AH )† AH A† A = AH (AH )†

A N A† = (AH )† N AH A† P A = AH P (AH )†

(xv) (xvi)

A† = (AH A)−1 AH A† = AH (AAH )−1

A† = (AH P A)−1 N AH A† = AH P (A N AH )−1

(xvii) (xviii)

(ACB)† = B† C−1 A† (ACB)† = BH C† AH

(A N C N B)† = B† N C −1 N A† (A P C N B)† = BH N C † P AH

(xix)

(0I×J )† = 0J×I

(0I P ×J N )† = 0J N ×I P

Table 3.20. Properties of the Moore-Penrose pseudo-inverse

R EMARK 3.73.– We can make the following remarks: – In the case of an invertible square matrix A ∈ KI×I and an invertible square tensor A ∈ KI P ×I P , we have A† = A−1 and A† = A−1 , as defined in [3.179]. – The relations (iii) and (iv) mean that AA† and A† A, in the matrix case, and A N A† and A† P A, in the tensor case, are idempotent. To prove these properties, it suffices to replace the pseudo-inverses A† and A† by the expressions provided by the definition equations (1) and (2) in Table 3.18. – The expressions (xv) and (xvi) follow directly from the relations (xii), because the pseudo-inverses on both right-hand sides can be replaced with inverses, due to the hypotheses of full column rank and full row rank, respectively.

Tensor Operations

193

Some of the results presented in Table 3.20 are proven in Panigrahy et al. (2017) for a tensor A ∈ CI P ×J P , i.e., for N = P . In general, the proof of these results can be established using the equations defining the Moore–Penrose pseudo-inverse given in Table 3.18, together with the properties [3.192]–[3.193] and [3.197]. 3.15. Tensor decompositions in the form of factorizations In the next three sections, we present the eigendecomposition of a symmetric square tensor, the SVD of a rectangular tensor and the full-rank decomposition of a tensor of arbitrary order. 3.15.1. Eigendecomposition of a symmetric square tensor P ROPOSITION 3.74.– Let A ∈ RI P ×I P be a block symmetric square tensor in the sense of the definition in Table 3.8. Then there exists an orthogonal tensor U ∈ RI P ×I P and a diagonal tensor D ∈ RI P ×I P , such that: A = U P D P U T , where the diagonal tensor D is defined as in [3.20]:  σi1 ···iP if ip = jp , ∀p ∈ P  diP , j = P 0 otherwise

[3.198]

[3.199]

This factorization is called the eigendecomposition of A. It is expressed in terms of the tensors (U , D) associated with the matrices (U, D) of the eigendecomposition of the unfolding AI×I . It constitutes an extension of the eigendecomposition of a fourth-order tensor introduced by Brazell et al. (2013). P ROOF .– Since the tensor A is symmetric, its unfolding AI×I is a symmetric matrix. This matrix therefore admits an eigendecomposition of the form: AI×I = UDUT ,

[3.200]

where U ∈ RI×I is orthogonal and D ∈ RI×I = diag(σ1 · · · σI ) is diagonal, with P I = p=1 Ip . Using the inverse isomorphism f −1 and based on the property [3.193], it is easy to deduce [3.198] from [3.200], with U = f −1 (U) and D = f −1 (D). Moreover, using the property [3.193] and the relation [3.177] once again, we conclude that the tensor U is itself orthogonal. Indeed, from UUT = II , we deduce: f −1 (UUT ) = f −1 (U) P f −1 (UT ) = U P U T = f −1 (II ) = I2P . [3.201] Similarly, we can show that U T P U = I2P , which implies that the tensor U is orthogonal. 

194

Matrix and Tensor Decompositions in Signal Processing

3.15.2. SVD decomposition of a rectangular tensor P ROPOSITION 3.75.– Any rectangular tensor X ∈ CI P ×J N whose matrix unfolding XI×J is of rank R admits an SVD of the following form (Liang and Zheng 2018; Behera et al. 2019): X = U P D N V H ,

[3.202]

I P ×I P

J N ×J N

and V ∈ C are unitary tensors as defined in Table 3.16, where U ∈ C and D ∈ CI P ×J N is a diagonal tensor as defined in [3.20]:  σr > 0 if i1 · · · iP = j1 · · · jN = r ∈ R (D)iP , j = . [3.203] N 0 otherwise P ROOF .– If we assume that the matrix unfolding XI×J is of rank R, with N P I = p=1 Ip and J = n=1 Jn , its SVD can be written as follows: XI×J = UDVH ,

[3.204]

and V ∈ C where U ∈ C pseudo-diagonal matrix such that: I×I

J×J

 (D)i1 ,··· ,iP , j1 ,··· ,jN =

are unitary matrices, and D ∈ C

I×J

is a

σr > 0 if i1 · · · iP = j1 · · · jN = r ∈ R . 0 otherwise [3.205]

Using the property [3.193], it is easy to deduce the expression [3.202] of the SVD of the tensor X from the SVD [3.204] of XI×J . Furthermore, proceeding in the same way as in the previous section, but replacing the operation of transposition with that of transconjugation, we can show that the tensors U and V are unitary, i.e.: U P U H = U H P U = I2P

[3.206]

V N V H = V H N V = I2N .

[3.207]

This completes the proof of the proposition.



R EMARK 3.76.– The SVD decomposition presented above is a generalization of the SVD proposed by Sun et al. (2016) for tensors X ∈ CI P ×J P , i.e. corresponding to N = P. 3.15.3. Connection between SVD and HOSVD In this section, we will establish the link between the SVD [3.202] of the tensor X ∈ CI P ×J N and its HOSVD decomposition, deduced from [5.15]: P

N

p=1

n=1

X = G × U(p) × U(p+n) ,

[3.208]

Tensor Operations

195

where U(p) ∈ CIp ×Ip , for p ∈ P  and U(p+n) ∈ CJn ×Jn , for n ∈ N , are unitary matrices, and G ∈ CI P ×J N is the core tensor. This connection is the object of the next proposition, which generalizes the relationship established by Cui et al. (2015) for a square tensor to the case of a rectangular tensor. P ROPOSITION 3.77.– Consider the tensor X ∈ CI P ×J N . Its SVD [3.202] P can be rewritten using the matrix unfolding defined in [3.191], with I = p=1 Ip and N J = n=1 Jn : X  XI×J = f (X ) = UΣVH ,

[3.209]

where the unitary matrices U = f (U ) and V = f (V) and the pseudo-diagonal matrix Σ = f (D) of singular values are directly related to the matrices of the HOSVD [3.208] of X , as well as to the matrices of the EVDs of GGH and GH G, where G  GI×J is the matrix unfolding of the core tensor G, by the following relations:   P [3.210] U = ⊗ U(p) Q p=1

V=



 N ⊗ (U(p+n) )∗ P

n=1

[3.211]

GGH = QDI QH ; GH G = PDJ PH [3.212]  √ λr if i1 · · · iP = j1 · · · jN = r ∈ R (Σ)i1 ,··· ,iP , j1 ,··· ,jN = 0 otherwise [3.213] where the λj are the non-zero eigenvalues of GGH and GH G, and R is the rank of G. P ROOF .– Let X  XI×J be the unfolding of X , as defined in [3.191]. Using the general matricization formula [5.5] for the Tucker model, this matrix unfolding is given by:  P   N  X = ⊗ U(p) G ⊗ (U(p+n) )T , [3.214] p=1

n=1

where G  GI×J . Taking into account the fact that matrices U(p+n) are unitary for n ∈ N , i.e. (U(p+n) )T (U(p+n) )∗ = IJn , we deduce that:   P H  P XXH = ⊗ U(p) GGH ⊗ U(p) ∈ CI×I . [3.215] p=1

p=1

Since the matrix GGH is Hermitian positive semi-definite, it admits the following EVD: GGH = QDI QH ,

[3.216]

196

Matrix and Tensor Decompositions in Signal Processing

where Q ∈ CI×I is unitary and DI ∈ CI×I is diagonal positive semi-definite. After replacing GGH with its EVD, equation [3.215] becomes:   P H  P XXH = ⊗ U(p) QDI QH ⊗ U(p) = UDI UH p=1

U



p=1

 P ⊗ U(p) Q.

[3.217] [3.218]

p=1

Note that U is a unitary matrix because the matrices U(p) and Q are themselves unitary, which implies:   P H  P UUH = ⊗ U(p) QQH ⊗ U(p) p=1

p=1

P

= ⊗ U(p) (U(p) )H = II . p=1

Similarly, we have UH U = II . Proceeding in the same way, we have:    N  N XH X = ⊗ (U(p+n) )∗ GH G ⊗ (U(p+n) )T ∈ CJ×J . n=1

n=1

[3.219]

Defining the EVD of GH G as: GH G = PDJ PH , with P ∈ CJ×J unitary and DJ ∈ CJ×J positive semi-definite and diagonal, equation [3.219] becomes: XH X = VDJ VH   N V  ⊗ (U(p+n) )∗ P. n=1

[3.220] [3.221]

Like U, the matrix V is unitary. Furthermore, as we saw in Volume 1 (Favier 2019), the matrices GGH and GH G have the same (positive) non-zero eigenvalues. Suppose that I ≥ J, and R is the rank of X, so that r(X) = R ≤ min(I, J) = J, and write D ∈ CI×J for the diagonal matrix (if I = J or pseudo-diagonal if I > J), whose diagonal (or pseudo-diagonal) terms satisfy:  λj > 0 for j ∈ R [3.222] djj = 0 otherwise where the λj are the non-zero eigenvalues of GGH and GH G, i.e. the squares of the singular values (σj ) of G. The SVD of the matrix unfolding X is then given by: X = UΣVH ,

[3.223]

where the unitary matrices U and V are defined in [3.218] and [3.221], respectively, and Σ is defined in [3.213]. This completes the proof of the proposition. 

Tensor Operations

197

3.15.4. Full-rank decomposition Before presenting the full-rank decomposition for tensors, we will recall an elementary result from matrix algebra, concerning the factorization of any matrix as the product of a matrix of full column rank with a matrix of full row rank. This is called the full-rank factorization. P ROPOSITION 3.78.– Any matrix A ∈ KI×J of rank R can be expressed as the product of a matrix F ∈ KI×R of full column rank with a matrix G ∈ KR×J of full row rank: A = FG.

[3.224]

P ROOF .– Let F ∈ KI×R be a matrix whose columns form a basis of the column space C (A) of A. Each column j ∈ J of A can then be uniquely written as a linear combination of the R columns of F. Hence, by denoting the matrix by G ∈ KR×J , such that each column j contains the R coefficients of the linear combination for the jth column of A, we obtain the factorization [3.224]. Furthermore, since the matrix F has full column rank, we have: r(A) = r(FG) = r(G) = R, which implies that G has full row rank.

[3.225] 

Full-rank matrix decomposition can be extended to tensors A ∈ KI P ×J N using a new definition of the rank of the tensor A as the rank of its matrix unfolding AI×J , i.e. r(A) = r[f (A)] = r(AI×J ). The tensor A is then said to have full column (respectively, row) rank if its unfolding AI×J has full column (respectively, row) rank itself (Liang and Zheng 2018). This decomposition is the object of the next proposition, which uses the Einstein product. P ROPOSITION 3.79.– Consider a tensor A ∈ KI P ×J N and its matrix R unfolding AI×J , as defined in [3.191]. If we assume that r(AI×J ) = K = r=1 Kr , and P consider the full-rank decomposition of AI×J as in [3.224], with I = p=1 Ip and N J = n=1 Jn : AI×J = FG,

[3.226]

with F ∈ KI×K and G ∈ KK×J , then the full-rank decomposition of the tensor A is given by Liang and Zheng (2018) and Behera et al. (2020): A = F R G,

[3.227]

where F = f −1 (F) ∈ KI P ×K R is a tensor of full column rank and G = f −1 (G) ∈ KK R ×J N is a tensor of full row rank.

198

Matrix and Tensor Decompositions in Signal Processing

P ROOF .– This result can be shown using equation [3.193]: f −1 (FG) = f −1 (F) R f −1 (G) = F R G. Furthermore, since the matrices F and G are the factors of the full-rank decomposition [3.226] of AI×J , they have full column rank and full row rank, respectively. Hence, given that these matrices are the matrix unfoldings of the tensors F and G, we may conclude that these tensors also have full column rank and full row rank, respectively.  In the next proposition, we derive a formula for computing the Moore–Penrose pseudo-inverse of the tensor A ∈ KI P ×J N , based on the full-rank decomposition [3.226] of AI×J . Our proof of this formula is different to the one proposed by Liang and Zheng (2018). P ROPOSITION 3.80.– Consider a tensor A ∈ KI P ×J N whose unfolding AI×J is assumed to have full column rank. Its Moore–Penrose pseudo-inverse, defined in [3.197], can be computed using the factors F ∈ KI P ×K R and G ∈ KK R ×J N in its full-rank factorization [3.227], thanks to the following formula: A† = G H R (F H P A N G H )−1 R F H . P ROOF .– From [3.197] and [3.226], we have: A† = f −1 (A†I×J ) = f −1 (FG)† .

[3.228]

[3.229]

Since the matrices F and G have full column rank and full row rank, respectively, the properties (xv) and (xvi) in Table 3.20 give us: (FG)† = G† F† = GH (GGH )−1 (FH F)−1 FH ,

[3.230]

where FH F and GGH have full rank, and are therefore invertible. From [3.226], we deduce: FH AI×J GH = (FH F)(GGH ) ⇒ (GGH )−1 (FH F)−1 = (FH AI×J GH )−1 . Substituting this last expression into [3.230], then into [3.229], we obtain: [3.231] A† = f −1 GH (FH AI×J GH )−1 FH . Since the matrix FH AI×J GH is invertible and of size R × R, using the properties [3.183], [3.193] and [3.194] leads to: f −1 (FH AI×J GH )−1 = (F H P A N G H )−1 . [3.232] Hence, from [3.231] and [3.232], it is easy to deduce the formula [3.228].



Tensor Operations

199

3.16. Inner product, Frobenius norm and trace of a tensor 3.16.1. Inner product of two tensors In the case of two tensors A, B ∈ RI N of order N and same size, a contraction on every pair of indices is equivalent to the N product of the two tensors, which corresponds to their inner product: A, B = A N B =

I1 

IN 

···

i1 =1

ai1 ,··· ,iN bi1 ,··· ,iN =

iN =1

IN  iN =1

aiN biN .

[3.233]

Using the index convention, the inner product can be concisely written as: A, B = aiN biN .

[3.234]

We can also write it using the Euclidean inner product of vectorized forms of A and B: A, B = vecT (B) vec(A) = vecT (A) vec(B),

[3.235]

where vec(A) and vec(B) are vectorizations of A and B associated with the same mode combination. In the case of complex-valued tensors, the Hermitian inner product is obtained by conjugating bi1 ,··· ,iN in the definition [3.233], i.e.: A, B = aiN b∗iN = vecH (B) vec(A).

[3.236] P

P ROPOSITION 3.81.– For A ∈ RI P and B = ◦ x(p) ∈ RI P , a rank-one tensor of p=1

order P , their inner product is given by: P

A, ◦ x(p)  = aiP p=1

P 

(p)

P

T

xip = A × x(P ) . p=1

p=1

[3.237]

The inner product of the tensor A with a rank-one tensor is therefore equal to the multiple mode-p product of A, with P vectors that define the rank-one tensor. P ROPOSITION 3.82.– Let A ∈ RI P and x(p) ∈ RIp for p ∈ P . The inner product P

T

of the vector y = A × x(q) ∈ RIp with x(p) is equivalent to the multiple mode-p q=1 q=p

product of A with the P vectors x(p) , and therefore to the inner product of A with the P

rank-one tensor ◦ x(P ) , by [3.237]: p=1

P

T

P

T

P

A × x(q) , x(p)  = A × x(P ) = A, ◦ x(p) . q=1 q=p

p=1

p=1

[3.238]

200

Matrix and Tensor Decompositions in Signal Processing

P

T

P ROOF .– Since the ip th component of the vector y = A × x(q) is: q=1 q=p

yip =

I1 



Ip−1

···

i1 =1



Ip+1

···

ip−1 =1 ip+1 =1

IP 

ai1 ,··· ,ip ,··· ,iP

iP =1

P 

(q)

x iq ,

[3.239]

q=1 q=p

we have: y, x

(p)

=

Ip  ip =1

(p)

yip xip = aiP

P

P 

(p)

xi p

[3.240]

p=1 T

P

= A × x(p) = A, ◦ x(p) ,

[3.241]

p=1

p=1



which proves [3.238] 3.16.2. Frobenius norm of a tensor

The Frobenius norm of the tensor A ∈ KI N is the square root of the inner product of the tensor with itself, i.e.: 6 6 7 I 7 I1 I 7 N N   7 7 AF = A, A1/2 = 8 ··· |ai1 ,··· ,iN |2 = 8 |aiN |2 , [3.242] i1 =1

iN =1

iN =1

where |.| represents the absolute value or the modulus, depending on whether A is real or complex. Since the Frobenius norm is equal to the square root of the sum of the squares of the absolute value or modulus of all elements of the tensor, it is also equal to the following expressions: AF = vec(A)2 = AS1 ;S2 F = An F , ∀n ∈ N .

[3.243]

The Frobenius norm of a tensor can therefore be defined as the Euclidean norm of one of its vectorized forms, or as the Frobenius norm of one of its matrix unfolding AS1 ;S2 or mode-n unfolding An , defined in [3.39] and [3.41], respectively.

Tensor Operations

201

P

P ROPOSITION 3.83.– For a rank-one tensor A = ◦ u(p) ∈ KI P of order P , applying p=1

the result [3.243] with the vectorized form [3.117] leads to: P

P

P

p=1

p=1

p=1

 ◦ u(p) 2F = vec( ◦ u(p) )22 =  ⊗ u(p) 22 P

= ( ⊗ u(p) )H



p=1

P 

=

 P P ⊗ u(p) = ⊗ (u(p) )H u(p)

p=1

p=1

u(p) 22 .

[3.244] [3.245]

[3.246]

p=1

From this relation, we deduce that the Frobenius norm of an outer product of P vectors is equal to the product of the Hermitian (respectively, Euclidean) norms of these vectors in the complex (respectively, real) case. Like for vectors and matrices, we can define other norms for tensors of order greater than two, such as the Höder or lp norm: Ap =

IN 

|aiN |p

1/p

, 1 ≤ p ≤ ∞,

[3.247]

iN =1

which has the norms l1 and l∞ as special cases: A1 =

IN 

|aiN |

[3.248]

iN =1

A∞ = max |ai1 ,··· ,iN |. i1 ,··· ,iN

[3.249]

The Frobenius norm of a complex (respectively, real) tensor is preserved when the tensor is multiplied mode-p with a unitary (respectively, orthogonal) matrix, as shown in the next proposition. P ROPOSITION 3.84.– Let X ∈ CI P , and let U(p) ∈ CIp ×Ip be a unitary matrix, i.e. U(p) (U(p) )H = (U(p) )H U(p) = IIp . The mode-p product of X with U(p) does not change the Frobenius norm of X . P ROOF .– Let Y = X ×p U(p) . Using the results [3.243] and [3.120], as well as the expression of the Frobenius norm of a matrix in terms of the trace (AF = tr(AH A)), we have: (p) H (p) YF = Yp F = tr(YpH Yp ) = tr(XH ) U Xp F , p (U

[3.250]

202

Matrix and Tensor Decompositions in Signal Processing

and taking into account the hypothesis that U(p) is unitary, we deduce that: YF = tr(XH p Xp ) = Xp F = X F ,

[3.251] 

which proves the stated property.

Like in the matrix case, this property means that any unitary transformation applied to the mode-p subspace of a tensor does not change the norm of the tensor. It is easy to prove that this result may be generalized to the case of a multiple mode-p product, P

i.e. for Y = X × U(p) , where the matrices U(p) are unitary. We then have: p=1

P

X × U(p) F = X F .

[3.252]

p=1

P

As we will see in section 5.2.1, the equation Y = X × U(p) corresponds to a p=1

Tucker model of the tensor Y, with unitary factors U(p) , where the tensor X is called the core tensor. Equation [3.252] therefore implies that the Frobenius norm of a tensor represented using a Tucker model with unitary factors is equal to the Frobenius norm of the core tensor. R EMARK 3.85.– In the case of a real tensor X ∈ RI P , proposition 3.84 holds with an orthogonal matrix U(p) ∈ RIp ×Ip . Below, we give a formula for computing the Frobenius norm of an Einstein product of two tensors. P ROPOSITION 3.86.– Consider the Einstein product C = A N B ∈ RI P ×K Q of the tensors A ∈ RI P ×J N and B ∈ RJ N ×K Q . By [3.243], the Frobenius norm of C is P equal to the Frobenius norm of its matrix unfolding CI×K , where I  p=1 Ip , and Q K  q=1 Kq . It can therefore be computed using the Frobenius norm of the product of the matrix unfoldings of A and B as follows: CF = A N BF = AI×J BJ×K F , with J 

N n=1

Jn .

[3.253]

Tensor Operations

203

3.16.3. Trace of a tensor The next two propositions show that the traces of a hypercubic tensor of order N and a square tensor of order 2N can be obtained as the inner products of these tensors with identity tensors. P ROPOSITION 3.87.– Let A ∈ K[N ;I] be a hypercubic tensor of order N and dimensions I. Its inner product with the identity tensor of order N gives the trace of A, denoted tr(A): I 

A, IN,I  = A N IN ,I =

ai1 , ··· ,iN δi1 ,··· ,iN

[3.254]

i1 ,··· ,iN =1

= aiN δiN =

I 

ai,··· ,i = tr(A).

[3.255]

i=1

R EMARK 3.88.– In the matrix case, for A ∈ KI×I , we have: A, I =

I 

ai ,i = tr(A).

[3.256]

i=1

P ROPOSITION 3.89.– Let A ∈ KI N ×I N be a square tensor of order 2N and size I N × I N . Its trace is given by the inner product of A with the identity block tensor of order 2N , defined in [3.22]–[3.23]: tr(A) =

IN  iN =1

aiN , iN = A, J2N .

[3.257]

Indeed, we have: A, J2N  = A 2N J2N =

IN IN   iN =1 j =1 N

= aiN , iN =

I1  i1 =1

···

IN 

aiN , j

N  N

δin ,jn

[3.258]

n=1

ai1 ,··· ,iN ,i1 ,··· ,iN = tr(A).

[3.259]

iN =1

3.17. Tensor systems and homogeneous polynomials The tensor products ×n and N can be used to define (linear and multilinear) tensor systems, as well as homogeneous polynomials.

204

Matrix and Tensor Decompositions in Signal Processing

3.17.1. Multilinear systems based on the mode-n product The bilinear form φ, with the associated matrix A ∈ RI×J , such that: φ : RI × RJ  (x, y) → φ(x, y) = xT Ay = b ∈ R,

[3.260]

can be written as follows using the mode-n product: A ×1 x T ×2 y T =

I  J 

aij xi yj = aij xi yj = b,

[3.261]

i=1 j=1

i.e., as a homogeneous polynomial of degree 2 in the components xi and yj , with i ∈ I and j ∈ J, of the vectors x and y. Since the quantity b is scalar, it is equal to its transpose, and using the properties [2.149] and [2.28] gives us: b = tr(xT Ay) = tr(AyxT ) = vecT (AT )vec(yxT ) = vecT (AT )(x ⊗ y).

[3.262]

This expression clearly shows the bilinear character of the system with respect to the vectors x and y. Similarly, from equation [3.135], we have: G ×1 xT ×2 yT ×3 zT =

J  K I  

gijk xi yj zk = gijk xi yj zk = b.

[3.263]

i=1 j=1 k=1

This tensor equation, which is trilinear with respect to the vectors (x, y, z), corresponds to a third-order Tucker model, denoted G; x, y, z, with G ∈ RI×J×K , x ∈ RI , y ∈ RJ , z ∈ RK (see section 5.2.1.6). Using the vectorization formula given in Table 5.1, and replacing (A, B, C) with (xT , yT , zT ), we can rewrite this trilinear system as: b = vecT (G)(x ⊗ y ⊗ z),

[3.264]

with vec(G) ∈ RIJK . This equation generalizes the bilinear system [3.262] to the trilinear case. It should be noted that equation [3.263] is associated with the following trilinear form: φ : RI × RJ × RI  (x, y, z) → φ(x, y, z) = G ×1 xT ×2 yT ×3 zT ∈ R. [3.265] The generalization to the case of a P -linear form, with P ≥ 2, is given by: P

P

φ : × RIp  (x(1) , · · · , x(P ) ) → φ(x(1) , · · · , x(P ) ) = G × x(p) p=1

T

p=1

=

I1  i1 =1

···

IP  iP =1

(1)

(P )

gi1 ,··· ,iP xi1 · · · xiP ∈ R,

[3.266]

where the tensor G ∈ RI P of order P is called the core tensor in the Tucker model.

Tensor Operations

205

The bilinear form [3.261] and the trilinear form [3.263] can be obtained as special cases of the P -linear form [3.266] for P = 2 and P = 3, respectively. In the case of a hypercubic tensor A ∈ R[P ;I] of order P and dimensions I, the mode-p multiproduct of A with the vector x ∈ RI gives a P -linear form that can be written as: P

φ : RI  x → φ(x) = A × xT

[3.267]

p=1

I 

=

i1 ,··· ,iP =1

ai1 ,··· ,iP xi1 · · · xiP = aiP

P 

xip ∈ R.

[3.268]

p=1

This P -linear form is often written using the notation AxP , which was introduced by Qi (2005). It corresponds to a homogeneous polynomial of degree P in the variables xi , with i ∈ I, whose coefficients are the entries ai1 ,··· ,iP of the tensor A. It can also be written as the inner product of the tensor A with the rank-one P

tensor x◦P  ◦ x = [xi1 · · · xiP ], or as the Euclidean inner product of the vectors x and Ax

p=1 ◦(P −1)

, i.e.: P

AxP  A × xT = A, x◦P  = Ax◦(P −1) , x = xT (Ax◦(P −1) ). [3.269] p=1

R EMARK 3.90.– In the matrix case (P = 2), with A ∈ RI×I , we obtain a homogeneous polynomial of degree 2 in the variables xi , with i ∈ I: φ : RI  x → φ(x) = Ax2  A ×1 xT ×2 xT = xT Ax =

I 

ai,j xi xj ∈ R,

i,j=1

i.e., a quadratic form. R EMARK 3.91.– A is symmetric and satisfies a PARAFAC decomposition of rank If R ◦P I R, i.e. A = r=1 ur , with ur ∈ K (see equation [5.31]), equation [3.269] becomes: R R   P T AxP = ( u◦P ) × x = (xT ur )P . r r=1

p=1

r=1

We then obtain a homogeneous polynomial of degree P in I variables xi , expressed as a sum of powers of linear forms, which is directly connected to the Waring problem.

206

Matrix and Tensor Decompositions in Signal Processing

E XAMPLE 3.92.– An example of this type of nonlinear system is given by a Volterra series, whose homogeneous term of order P can be written as6: yk =

MP 

hi1 ,··· ,iP uk−i1 · · · uk−iP ,

[3.270]

i1 ,··· ,iP =1

where uk and yk represent the input and output of the system at the sampling instant kT , where T is the sampling period. The tensor hi1 ,··· ,iP is the Volterra kernel of order P , and MP is the memory of this kernel. Writing the kernel of order P as the P th-order tensor HP ∈ RMP ×···×MP , and defining the input vector uTk = [uk−1 , · · · , uk−MP ], equation [3.270] of the system output at time kT can also be written as: P

yk = HP × uTk = HP , u◦P k ,

[3.271]

p=1

which is a homogeneous polynomial of degree P in the components of the input vector uk . In order to reduce the complexity of this type of model, a PARAFAC decomposition of the symmetrized kernel was exploited by Favier et al. (2012a). More generally, for a hypercubic tensor A ∈ K[P ;I] of order P and dimensions I, we define the multiple mode-(Q + 1, · · · , P ) product, with 0 ≤ Q < P , of A with x, as follows: φ : KI  x → φ(x) = A

P

×

p=Q+1

xT = AxP −Q .

[3.272]

We then obtain a tensor B of order Q and dimensions I such that: I 

bi1 ,··· ,iQ =

ai1 ,··· ,iQ ,iQ+1 ,··· ,iP xiQ+1 · · · xiP .

[3.273]

iQ+1 ,··· ,iP =1

In particular, for Q = 1 and Q = 2, we obtain: P

– a vector b = A × xT = AxP −1 ∈ KI , for which each component is a p=2

homogeneous polynomial of degree P − 1 in the variables xi , with i ∈ I, that can be deduced from [3.268] by fixing the index i1 = i of the mode-1 of A: 

AxP −1

 i

=

I 

ai,i2 ,i3 ,··· ,iP xi2 xi3 · · · xiP ∈ K , with i ∈ I;

i2 ,i3 ,··· ,iP =1

[3.274] 6 Here, we assume that the system has a pure delay of one sampling period. If this is not the case, the lower bound on the sum over the indices must be considered 0.

Tensor Operations

207

P

– a matrix B = A × xT = AxP −2 ∈ KI×I for which each element is given by: p=3



AxP −2

 ij

=

I 

ai,j,i3 ,··· ,iP xi3 · · · xiP ∈ K , with i, j ∈ I.

i3 ,i4 ,··· ,iP =1

[3.275] E XAMPLE 3.93.– For a third-order tensor A ∈ CI×I×I and x ∈ CI with I = 2, the system of equations ([3.274] becomes Ax2 = b, or in expanded form: a111 x21 + (a112 + a121 )x1 x2 + a122 x22 = b1 a211 x21 + (a212 + a221 )x1 x2 + a222 x22 = b2 . Thus, we obtain a system of two polynomial equations consisting of two homogeneous polynomialsof degree two in the variables (x1 , x2 ), whereas equation  2 [3.275] can be written as: Ax = k=1 aijk xk , which corresponds to four linear ij

equations in the variables (x1 , x2 ). In the case of a complex hypercubic tensor A ∈ C[2P ;I] , which is a square tensor of order 2P , we define the following complex multilinear form (Fu et al. 2018): P

2P

p=1

p=P +1

φ : CI  x → φ(x∗ , x) = A × xH

×

xT = ai2P

P  p=1

x∗ip

2P 

xi n .

n=P +1

[3.276] This multilinear form can also be written as the Hermitian inner product of the tensor A with the rank-one tensor x · · ◦ x+ ◦ (x∗ ◦ ·)* · · ◦ x∗+ = x◦P ◦ (x∗ )◦P : ( ◦ ·)* P terms

P terms

φ(x∗ , x) = A, x◦P ◦ (x∗ )◦P .

[3.277]

The multi-products presented above will be used in Chapter 4 to define the notions of eigenvalue and eigenvector of a tensor. 3.17.2. Tensor systems based on the Einstein product We can also use the Einstein product to define multilinear and linear tensor systems. Thus, a bilinear tensor system associated with a bilinear mapping,

208

Matrix and Tensor Decompositions in Signal Processing

characterized by the tensor A ∈ RI×J×K×L is defined in Brazell et al. (2013) as follows: φ : RK×L × RI×J  (X, Y) → φ(X, Y) = A 2 X 2 Y = b ∈ R. [3.278] Similarly, for a bilinear tensor system whose variables are the tensors (X , Y), and which is characterized by the tensor A ∈ RI×J×K×L×M ×N , we can write: φ : RM ×N ×P ×RK×L×P  (X , Y) → φ(X , Y) = A2 X 3 Y = B ∈ RI×J . Another example of a tensor system using the Einstein product is given by Sylvester’s tensor equation, which can be stated as follows (Behera and Mishra 2017): T P X + X N S = C,

[3.279]

with T ∈ RI P ×I P , S ∈ RJ N ×J N and C, X ∈ RI P ×J N . This equation generalizes Sylvester’s matrix equation [2.285] to the tensor case. It can be rewritten using the Einstein product of block tensors as follows:



  I2N X OP ×N T I2P P = C, [3.280] N S OP ×N X where OP ×N is the zero tensor of size I P × J N , and I2P and I2N are the identity tensors of size I P × I P and J N × J N , respectively, as defined in [3.22]. Two other examples of basic tensor equations are as follows: A N X = C ; X N B = D,

[3.281]

with A ∈ RI P ×J N , B ∈ RJ N ×I P , C, D ∈ RI P and X ∈ RJ N . This type of system was called multilinear by Brazell et al. (2013), but in our view it seems more appropriate to speak of a system of linear tensor equations (or a tensor system) with respect to the unknown tensor X . R EMARK 3.94.– Using the definition [3.147] of the Einstein product and the results [3.266] and [3.273], we can prove the following proposition, which establishes a link between tensor systems based on the mode-p and Einstein products. P ROPOSITION 3.95.– Let A ∈ KI P ×J N and x(n) ∈ KJn , n ∈ N . We have: B = A ×p+1 x(1) ×p+2 · · · ×p+N x(N ) = A N C

[3.282]

N   N (n) xj n ∈ K J N . C = ◦ x(n) = cj =

[3.283]

n=1

N

n=1

Tensor Operations

209

P ROOF .– By the definition of the multiple mode-p product, we have: B = A ×p+1 x

(1)

×p+2 · · · ×p+N x

(N )

⇒ biP =

JN  j =1

(1)

(N )

aiP ,j xj1 · · · xjN . N

N

Taking into account the definition [3.283] of the tensor C, we deduce that: biP =

JN  j =1

aiP ,j cj N

N

⇒ B = A N C,

N



which proves the relation [3.282]. 3.17.3. Solving tensor systems using LS

Using the results from sections 3.13.5 and 3.14, it is possible to use the LS method to solve systems of tensor equations expressed in terms of the Einstein product. E XAMPLE 3.96.– Consider the following tensor system: A 2 X = B,

[3.284]

with A ∈ RI×J×K×L , X ∈ RK×L and B ∈ RI×J . The solution of this system in the sense of minimizing the LS criterion minA 2 X − B2F can be obtained by rewriting X

equation [3.284] using the unfolding AIJ×KL of the tensor A and the vectorized forms xKL and bIJ of the matrices X and B. The LS criterion then becomes: minAIJ×KL xKL − bIJ 22 . xKL

[3.285]

Minimizing this criterion leads us to the normal equations: xKL = ATIJ×KL bIJ (ATIJ×KL AIJ×KL )ˆ ⇓ ˆ KL = (ATIJ×KL AIJ×KL )−1 ATIJ×KL bIJ x

[3.286] [3.287] [3.288]

if the matrix ATIJ×KL AIJ×KL is invertible, i.e. if AIJ×KL has full column rank, which implies the necessary but not sufficient condition that IJ ≥ KL. The normal equations and LS solution can be rewritten using the Einstein product as follows: ˆ = AT 2 B ⇒ X ˆ = (AT 2 A)−1 2 AT 2 B, AT 2 A 2 X with AT ∈ RK×L×I×J .

[3.289]

210

Matrix and Tensor Decompositions in Signal Processing

E XAMPLE 3.97.– The following tensor system was considered by Brazell et al. (2013): A 2 X = B,

[3.290]

with A ∈ RI×J×K×L , X ∈ RK×L×M ×N and B ∈ RI×J×M ×N . The LS solution of this system of equations is obtained by minimizing the criterion: minA 2 X − B2F .

[3.291]

X

Using proposition 3.86, this criterion can also be written as: min AIJ×KL XKL×M N − BIJ×M N 2F .

XKL×M N

[3.292]

Its minimization then leads us to the following normal equations and solution: ˆ KL×M N = AT ATIJ×KL AIJ×KL X IJ×KL BIJ×M N

[3.293]

⇓ −1 ˆ KL×M N = (AT ATIJ×KL BIJ×M N X IJ×KL AIJ×KL )

[3.294]

if ATIJ×KL AIJ×KL is invertible, which implies the necessary but not sufficient condition that IJ ≥ KL. After defining AT ∈ RK×L×I×J , the normal equations and the LS solution can be rewritten in terms of the Einstein product as follows: AT 2 A 2 Xˆ = AT 2 B ⇒ Xˆ = (AT 2 A)−1 2 AT 2 B.

[3.295]

Thus, we recover the result of Brazell et al. (2013) much more straightforwardly because of our use of the property [3.253] and by rewriting the LS criterion in the form [3.292]. Minimizing this criterion then leads us to the matrix normal equations [3.293], from which it is easy to deduce the tensor normal equations. Note that the solution [3.295] is of the same form as [3.289], with tensors X and B instead of matrices X and B. The solutions [3.289] and [3.295] are expressed using the Moore–Penrose pseudo-inverse of the tensor A, which is given by A† = (AT 2 A)−1 2 AT , = and deduced from the matrix pseudo-inverse A†IJ×KL (ATIJ×KL AIJ×KL )−1 ATIJ×KL (see proposition 3.72). Proceeding in the same way, we can solve the tensor system A 2 X = b, with A ∈ RI×K×L , X ∈ RK×L and b ∈ RI . The LS criterion is then given by: minA 2 X − b22 ⇔ minAI×KL xKL − b22 . X

xKL

[3.296]

Tensor Operations

211

Minimizing these criteria gives us the following solutions: ˆ = (AT 1 A)−1 2 AT ×3 b ˆ KL = (ATI×KL AI×KL )−1 ATI×KL b ⇔ X x [3.297] with AT ∈ RK×L×I . This LS solution exists if the matrix AI×KL has full column rank, which implies the necessary but not sufficient condition that I ≥ KL. The LS solution of the three tensor systems considered here are summarized in Table 3.21. Tensor systems

LS solutions

Necessary conditions

A 2 X = B A ∈ RI×J×K×L , X ∈ RK×L×M ×N Xˆ = (AT 2 A)−1 2 AT 2 B B ∈ RI×J×M ×N

IJ ≥ KL

A 2 X = B A ∈ RI×J×K×L , X ∈ RK×L B ∈ RI×J

ˆ = (AT 2 A)−1 2 AT 2 B X

IJ ≥ KL

A 2 X = b A ∈ RI×K×L , X ∈ RK×L b ∈ RI

ˆ = (AT 1 A)−1 2 AT ×3 b X

I ≥ KL

Table 3.21. LS solutions of tensor linear systems

3.18. Hadamard and Kronecker products of tensors The Hadamard and Kronecker products of matrices can be extended to tensors (Lee and Cichocki 2017), as summarized in Tables 3.22 and 3.23, where we have used the notation [3.16] to denote the combination of indices associated with the Kronecker product: i1 i2 · · · iI  iI + (iI−1 − 1)II + · · · + (i1 − 1)I2 · · · II ,

[3.298]

with the convention that the first index i1 varies more slowly than i2 , which itself varies more slowly than i3 , and so on. In particular, ki  i+(k−1)I , lj  j+(l−1)J. E XAMPLE 3.98.– The Hadamard product of two tensors appears, in particular, when solving the problem of estimating a tensor model for a data tensor characterized by missing data. Let X ∈ KI P be a tensor modeled by means of a PARAFAC model A(1) , · · · , A(P ) ; R of order P and rank R, as defined in section 5.2.5, and let W ∈ KI P be the binary tensor such that:  0 if xiP missing wi1 ,··· ,iP = wiP = 1 if xiP measured

212

Matrix and Tensor Decompositions in Signal Processing

for ∀ip ∈ Ip  and ∀p ∈ P . The quadratic criterion in terms of the difference between the data tensor X and the model output, taking into account the missing data, can be written as:   f (A(1) , · · · , A(P ) ) = W  X − A(1) , · · · , A(P ) ; R 2F . [3.299]

Operations

Symbols and entries A, B ∈

KI×J ,

C∈

Dimensions

KK×L

Outer product

(A ◦ C)i,j,k,l = aij ckl

I ×J ×K×L

Hadamard product

(A  B)ij = aij bij

I ×J

Kronecker product

(C ⊗ A)ki , lj = aij ckl

KI × LJ

Table 3.22. Matrix operations

Operations

Symbols and entries

Dimensions

v ∈ KJ , A, B ∈ KI P , C ∈ KJ P , D ∈ KK N (A ◦ v)iP

Outer product

(A ◦ D)iP

Outer product

,j

, kN

P

= a iP v j

I P × J = × Ip × J p=1

= a i P d kN

IP × KN

(A  B)iP = aiP biP

Hadamard product

(C ⊗ A)j1 i1 ,··· ,j

Kronecker product

P iP

= aiP cj

IP P

J1 I1 × · · · × JP IP

Table 3.23. Tensor operations

Since the rank R is fixed, the criterion needs to be minimized with respect to the factor matrices (A(1) , · · · , A(P ) ) (see Table I.4). P

E XAMPLE 3.99.– Let A, B and C be three rank-one tensors defined as A = ◦ a(p) , p=1

P

P

p=1

p=1

B = ◦ b(p) and C = ◦ c(p) , with a(p) , b(p) ∈ KIp and c(p) ∈ KJp , p ∈ P . Their Hadamard and Kronecker products are given by: P

P

p=1

p=1

A  B = ◦ (a(p)  b(p) ) ∈ KI P ; C ⊗ A = ◦ (c(p) ⊗ a(p) ) ∈ KJ1 I1 ×···×JP IP . We can also define partial Hadamard and Kronecker products (Favier and de Almeida 2014a; Lee and Cichocki 2017). The partial Hadamard product was used to model a wireless communication system, represented using a generalized PARATUCK model, whose core tensor is obtained as the partial Hadamard product

Tensor Operations

213

of the coding tensor with the resource allocation tensor (Favier and de Almeida 2014b). The partial Hadamard product, denoted X Y, along a set iP = (i1 , · · · , iP ) of iP

RL ×I P

and Y ∈ KS M ×I P is defined as the tensor indices shared by the tensors X ∈ K RL ×S M ×I P D∈K whose entries are given by (without using the index convention): drL ,sM ,iP = xrL ,iP ysM ,iP ,

[3.300]

with rL = {r1 , · · · , rL } and sM = {s1 , · · · , sM }. E XAMPLE 3.100.– Consider two third-order tensors X ∈ KR×I1 ×I2 and Y ∈ KS×I1 ×I2 . The partial Hadamard product along the indices corresponding to the shared modes 2 and 3, denoted X  Y, gives a fourth-order tensor {i1 ,i2 }

D ∈ KR×S×I1 ×I2 such that dr,s,i1 ,i2 = xr,i1 ,i2 ys,i1 ,i2 . In the case of two tensors X ∈ KRL ×I P ×J M and Y ∈ KS L ×I P ×K M that share the indices (i1 , · · · , iP ), we define the partial Kronecker product as the tensor D such that: D··· ,i1 ,··· ,iP ,··· = X··· ,i1 ,··· ,iP ,··· ⊗ Y··· ,i1 ,··· ,iP ,··· ∈ KR1 S1 ×···×RL SL ×I1 ×···×IP ×J1 K1 ×···×JM KM , where the Kronecker product applies to each pair of indices (rl , sl ), for l ∈ L, and (jm , km ), for m ∈ M , not shared by the tensors X and Y. 3.19. Tensor extension Tensor extension of a matrix refers to the construction of a tensor of order greater than two from this matrix. For example, given a matrix B ∈ KI×J , we define the tensor A ∈ KI×J×K such that ai,j,k = bi,j for k = 1, · · · , K. This amounts to defining A in such a way that its K frontal slices A..k , k ∈ K, are all equal to B. Using index notation, the mode-1 flat unfolding AI×KJ of A is given by: AI×KJ = ai,j,k ekj i =

K 

ek ⊗ bi,j eji

k=1

=

1TK

⊗ B = B(1TK ⊗ IJ ) = [B · · B+]. ( ·)* K terms

[3.301]

214

Matrix and Tensor Decompositions in Signal Processing

Similarly, we can extend the matrix B to a third-order tensor A ∈ KM ×I×J such that am,i,j = bi,j for m = 1, · · · , M , i.e. a tensor whose horizontal slices Am.. , for m ∈ M , are all equal to B. We then have: AM I×J = am,i,j ejmi =

M 

em ⊗ bi,j eji

m=1

= 1M ⊗ B = (1M

⎤⎫ B ⎪ ⎬ ⎢ ⎥ ⊗ II )B = ⎣ ... ⎦ M terms. ⎪ ⎭ B ⎡

[3.302]

E XAMPLE 3.101.– For the extension of B ∈ KI×J to A ∈ KM ×I×J×K such that: am,i,j,k = bi,j , ∀ m ∈ M , ∀ k ∈ K, combining the formulae [3.301] and [3.302] gives: AM I×KJ = (1M ⊗ II )B(1TK ⊗ IJ ).

[3.303]

We therefore obtain a matrix partitioned into (M, K) blocks, whose blocks of size I × J are all equal to the matrix B. We can also extend a tensor by repeating some of its dimensions. For example, the third-order tensor A ∈ KI1 ×I2 ×I3 can be extended to the tensor B ∈ KR1 I1 ×R2 I2 ×I3 by repeating its first two dimensions R1 and R2 times, respectively. The tensor B can then be written as the following Tucker decomposition: B = A ×1 (1R1 ⊗ II1 ) ×2 (1R2 ⊗ II2 ) ×3 II3 ∈ KJ1 ×J2 ×I3 ,

[3.304]

with J1 = R1 I1 and J2 = R2 I2 , and A as the core tensor (see section 5.2.1 for the definition of a Tucker model). The tensor B admits the following mode-1 flat and mode-2 tall unfoldings (see equation [5.5]):  T BJ1 ×J2 I3 = (1R1 ⊗ II1 )AI1 ×I2 I3 (1R2 ⊗ II2 ) ⊗ II3 = (1R1 ⊗ II1 )AI1 ×I2 I3 (1R2 ⊗ II2 I3 )T BJ1 J2 ×I3 = (1R1 ⊗ II1 ) ⊗ (1R2 ⊗ II2 )AI1 I2 ×I3 = (1R1 R2 ⊗ II1 I2 )AI1 I2 ×I3 . From these equations, we conclude that the unfolding BJ1 ×J2 I3 corresponds to a partitioned matrix, with R1 row blocks and R2 column blocks, all equal to AI1 ×I2 I3 , whereas BJ1 J2 ×I3 is a matrix partitioned into R1 R2 row blocks, all equal to AI1 I2 ×I3 , as illustrated by Figure 3.3.

Tensor Operations

215

Figure 3.3. Unfolding BJ1 ×J2 I3 .

3.20. Tensorization Tensorization is the process of constructing data tensors. We distinguish three main types of tensorization approach based on (i) segmentation of data vectors or matrices, (ii) repetition of experiments in different configurations and (iii) system design. The first approach can be viewed as the inverse process to vectorization and matricization. It involves constructing a high-order tensor from a given vector or matrix. For instance, a vector of size I1 I2 · · · IP partitioned into P sub-vectors of size Ip , p ∈ P , can be reformatted into of order P and size I1 ×I2 ×· · ·×IP . Similarly, a partitioned aPtensor P2 1 matrix of size p=1 Jp can be tensorized into a tensor of order P1 + P2 Ip × p=1 and size I1 × · · · × IP1 × J1 × · · · × JP2 . Thus, a matrix A ∈ CIK×JL partitioned into (I, J) blocks of same size K × L can be tensorized into a fourth-order tensor B ∈ KI×J×K×L , such that: bijkl = ak+(i−1)K , l+(j−1)L = aik,jl using the notation defined in [3.298]. Note that A can be seen as the matrix unfolding BIK×JL of the tensor B obtained by combining the modes 1 and 3 along the rows and the modes 2 and 4 along the columns.

216

Matrix and Tensor Decompositions in Signal Processing

E XAMPLE 3.102.– For I = J = 3, K = 2, L = 1, the matrix A ∈ C6×3 is partitioned into (3,3)-blocks of size 2 × 1, i.e. vectors of size 2: ⎤ ⎡ a11 | a12 | a13 ⎢ a21 | a22 | a23 ⎥ ⎥ ⎢ ⎢ − − − ⎥ ⎥ ⎢ ⎢ a31 | a32 | a33 ⎥ ⎥ A=⎢ ⎢ a41 | a42 | a43 ⎥ . ⎥ ⎢ ⎢ − − − ⎥ ⎥ ⎢ ⎣ a51 | a52 | a53 ⎦ a61 | a62 | a63 The element a52 of the matrix A corresponding to (i = 3, j = 2, k = 1, l = 1) is associated with the element b3,2,1,1 of the tensor B. Similarly, a43 , which corresponds to (i = 2, j = 3, k = 2, l = 1), is associated with b2,3,2,1 . We can also increase the order of a tensor, i.e. transform a tensor X ∈KI P of order Q P JQ P into a tensor Y ∈ K of order Q, with Q > P and q=1 Jq = p=1 Ip . This type of transformation is used in image processing, in particular for the compression and reconstruction of images (Latorre 2005; Bengua et al. 2017). Below, we summarize the idea introduced by Latorre (2005) to represent a color image, viewed as a third-order tensor X ∈ KI1 ×I2 ×I3 , where i1 and i2 are the spatial modes associated with the rows and columns of an image, and i3 is the color mode (red, green, blue), with I3 = 3. An image of size I1 × I2 = 2N × 2N can be divided into elementary square blocks of 2 × 2 pixels, such that each block is associated with a matrix: Y=

4  3  j1 =1 i3 =1

(4)

(3)

xj1 ,i3 ej1 ◦ ei3 ,

where xj1 ,i3 represents the value of the pixel for the color i3 ∈ {1, 2, 3}, with j1 ∈ {1, 2, 3, 4} according to the position of the pixel within the block: top left (j1 = 1), top right (j1 = 2), bottom left (j1 = 3) and bottom right (j1 = 4), respectively. For an image containing 2N × 2N pixels, we define a tensor Y ∈ K4×···×4×3 of order N + 1 containing all pixels, of the form: Y=

4  j1 =1

···

3 4   jN =1 i3 =1

(4)

(4)

(3)

yj1 ,··· ,jN ,i3 ejN ◦ · · · ◦ ej1 ◦ ei3 .

Each mode jn , n ∈ N , is associated with some division of the image into N overlapping elementary square 2×2 blocks, and the set of modes (j1 , · · · , jN ) defines

Tensor Operations

217

the position of the pixel of value yj1 ,··· ,jN ,i3 for the color i3 . Thus, for example, in the case N = 2, for j1 = 3, j2 = 2, the position within the image is given by: ⎤ ⎤ ⎡ ⎡ 0 0 0 0 0 ⎥ ⎢ 1 ⎥ ⎢ (4) (4) ⎥ 0 0 1 0 = ⎢ 0 0 1 0 ⎥. ej2 ◦ ej1 = ⎢ ⎣ 0 0 0 0 ⎦ ⎣ 0 ⎦ 0 0 0 0 0 The value j2 = 2 is associated with the 2×2 block in the top right, whereas j1 = 3 corresponds to a position of the bottom left pixel of the top right block. An example of the second tensorization approach that is often used in applications is to stack a set of data matrices of the same size along a third mode to build a third-order data tensor. This type of tensorization can result, for example, from the repetition of some experiments at different time instances (time diversity), the use of multiple sensors as in array processing (space diversity), or the exploitation of various conditions like illumination, views, and expressions in a face recognition system (see Table I.1). The third tensorization approach is illustrated by wireless communication systems whose design can simultaneously incorporate several diversities, like space-time-frequency-code diversities (Favier and de Almeida 2014). Tensor-based approaches to designing such communication systems lead to tensors of received signals at the receiver that satisfy tensor models with different structures. The multilinear structure of the received signals is then exploited to develop semi-blind receivers to jointly estimate the channels and information symbols. This application will be studied in detail in the next two volumes. Another tensorization approach that leads to tensor decompositions originated from the use of high-order cumulants to solve BSS and blind system identification problems (see the Appendix). For instance, in the context of wireless communication systems, fourth-order cumulants of the signals, measured at the output of a finite impulse response (FIR) communication channel, satisfy a PARAFAC model that can be exploited to estimate the channel. See section 5.3.3 for a detailed presentation of this application. There are particular tensorization techniques that construct structured tensors, like Cauchy, Toeplitz or Hankel tensors, from a given vector of data. The corresponding Hankelization is briefly described in the next section. For an overview of tensorization techniques, see Debals and de Lathauwer (2017). 3.21. Hankelization Hankelization is the process of constructing Hankel matrices and Hankel tensors from a given data vector. In the matrix case, the Hankelization maps a vector u ∈ KN to a Hankel matrix A ∈ KI×J , defined as follows:

218

Matrix and Tensor Decompositions in Signal Processing

KN  u → A ∈ KI×J , N = I + J − 1 aij = ui+j−1 , i ∈ I , j ∈ J.

[3.305]

It is important to note that the Hankelization process introduces redundancy to the information contained in the vector u, due to the repetition of its components in the Hankel matrix. This redundancy allows the performance of estimating the parameters contained in u to be improved by Hankelization. ⎡

u1 E XAMPLE 3.103.– For N = 5, I = J = 3, we obtain: A = ⎣ u2 u3

u2 u3 u4

⎤ u3 u4 ⎦ . u5

For certain vectors u, Hankelization provides low rank Hankel matrices or low multilinear rank Hankel tensors. For instance, when the components of u are exponentials such that un = z n−1 , n ∈ N , the Hankel matrix A is a rank-one matrix that can be factored as: ⎡ ⎤ ⎡ ⎤ 1 z ··· z J−1 1 ⎢ z ⎥ ⎢ z ⎥ z2 · · · zJ ⎢ ⎥ ⎢ ⎥ A=⎢ . ⎥ = ⎢ .. ⎥ 1 z · · · z J−1 . .. .. . ⎣ . ⎦ ⎣ . ⎦ . . z I−1

zI

···

z I+J−2

z I−1

R n−1 , n ∈ N , the In the case of a sum of R exponentials un = r=1 zr Hankelization process gives a Hankel matrix that can be factored as the product of two Vandermonde matrices A = UVT , with: ⎡ ⎡ ⎤ ⎤ 1 1 ··· 1 1 1 ··· 1 ⎢ z1 ⎢ z1 z2 · · · zR ⎥ z2 · · · zR ⎥ ⎢ ⎢ ⎥ ⎥ U = ⎢ .. .. .. ⎥ , V = ⎢ .. .. .. ⎥ . ⎣ . ⎣ . . . ⎦ . . ⎦ z1I−1

z2I−1

···

I−1 zR

z1J−1

z2J−1

···

J−1 zR

If the generators (zr ) are all distinct and the number R of exponentials is less than or equal to min{I, J}, the Vandermonde matrices have full column rank, which implies that the Hankel matrix admits a full-rank decomposition. Such a Hankelization is used to solve the BSS problem. In this context, de Lathauwer (2011) and Debals et al. (2017) present different examples of matrix Hankelization from data vectors obtained by evaluating various signals corresponding to sums and/or products of exponentials, sinusoids and/or polynomials. The rank of the resulting Hankel matrix is given for each class of signals.

Tensor Operations

219

Similarly, the Hankelization of a vector u ∈ KN to a Hankel tensor A ∈ KI1 ×···×IP of order P is defined as the following transformation (Papy et al. 2005): KN  u → A ∈ KI1 ×···×IP , N =

P 

Ip − P + 1

p=1

ai1 ,··· ,iP = ui1 +···+iP −P +1 , ip ∈ Ip  , p ∈ P .

[3.306]

The tensor thus formed is composed of P − 1 tensor slices, each with size I1 × · · · × IP −1 . E XAMPLE 3.104.– For N = 5, P = 3, I1 = tensor A ∈ K3×2×2 contains two frontal I1 × I2 = 3 × 2, defined as follows: ⎤ ⎡ ⎡ u2 u1 u 2 A..1 = ⎣ u2 u3 ⎦ , A..2 = ⎣ u3 u4 u3 u 4

3, I2 = I3 = 2, the third-order Hankel slices with Hankel structure and size ⎤ u3 u4 ⎦ . u5

For this example, data Hankelization is obtained by stacking two Hankel matrices along the third mode. Tensor Hankelization was used by de Lathauwer (2011) and Phan et al. (2017) to solve the BSS problem. In section 5.3.2, we will briefly present this application for an instantaneous source mixture.

4 Eigenvalues and Singular Values of a Tensor

4.1. Introduction The notion of determinant plays an important role in matrix computation, and more generally in linear algebra, for example for determining the rank, inverse and eigenvalues of a matrix, as well as for solving systems of linear equations using Cramer formulae. It is a well-known fact that the system of equations Ax = b, with A ∈ CN ×N and x, b ∈ CN , is invertible, in the sense that there exists a unique solution, if and only if the matrix A is invertible, and therefore if and only if the determinant of this matrix is non-zero. The notions of eigenvalue and eigenvector of A ∈ CN ×N are also encountered when solving the system of equations Ax = λx, where the eigenvalue–eigenvector pair (λ, x) ∈ C × {CN \{0}} is called an eigenpair of A. The existence of a solution x = 0 for the system (A − λI)x = 0 means that the kernel of A − λI must contain non-zero vectors, which implies that the matrix A − λI is singular, and therefore det(A − λI) = 0, known as the characteristic equation. The polynomial pA (λ) = A − λI is called the characteristic polynomial of A, and its roots are the eigenvalues of A. The set Vλ of all eigenvectors associated with the eigenvalue λ is called the eigenspace associated with λ. The problem of computing eigenvalues is closely related to that of solving ordinary and partial differential equations. Thus, in the case of a system of first-order linear differential equations with constant coefficients such that du(t) = Au(t), with the dt initial condition u(0) = u0 , an exponential solution u(t) = eλt x leads to the equation Ax = λx, whose unknowns λ and x correspond to an eigenpair of A. Given N particular solutions un (t) = eλn t xn , n ∈ N , the superposition principle for linear N systems implies that every linear combination n=1 cn un (t) is also a solution.

Matrix and Tensor Decompositions in Signal Processing, First Edition. Gérard Favier. © ISTE Ltd 2021. Published by ISTE Ltd and John Wiley & Sons, Inc.

222

Matrix and Tensor Decompositions in Signal Processing

Once the N eigenvectors xn have been computed, the coefficients cn of this linear combination can be determined using the initial condition, namely: ⎤ ⎡ c1 N N   ⎥ ⎢ cn un (0) = cn xn = [x1 · · · xN ] ⎣ ... ⎦ = u0 . n=1

n=1

cN In the same way that an exponential solution of a first-order differential equation leads to an eigenvalue problem, searching for an exponential solution u(t) = ejωt x 2 = Au(t), with A, B ∈ to a system of second-order differential equations B d dtu(t) 2 CN ×N , leads to a generalized eigenvalue problem of the form Ax = λBx, where λ = (jω)2 . This problem admits a solution x = 0 only if the matrix A−λB is singular and therefore if det(A − λB) = 0. The roots λ of this polynomial are then called the generalized eigenvalues of A and B, and the problem is called the generalized form of the classical eigenvalue problem. This type of generalized eigenvalue problem is often encountered in solid mechanics. In the case where B is symmetric positive definite, the above polynomial can be rewritten as det(B−1 A − λI) = 0, which amounts to solving a classical eigenvalue problem for B−1 A corresponding to the equation B−1 Ax = λx. Like in the case of a system of first-order matrix differential equations, searching for an exponential solution u(t) = eλt x to the system of first-order tensor differential equations defined in terms of the tensors A and B as: dB(u(t))P −1 = A(u(t))P −1 dt

[4.1]

leads to the following generalized eigenvalue problem (Ding and Wei 2015): AxP −1 = (P − 1)λBxP −1 .

[4.2]

Another application of the spectral properties of a matrix, i.e. the properties concerning its eigenvectors and eigenvalues, is encountered when studying continuous-time and discrete-time linear dynamic systems represented using state-space models described by the following equations:   ˙ x(t) = Ax(t) + Bu(t) x(k + 1) = Ax(k) + Bu(k) and [4.3] y(t) = Cx(t) + Du(t) y(k) = Cx(k) + Du(k) where x ∈ RP , u ∈ RM and y ∈ RQ represent the state, the control and the output of the system, with A ∈ RP ×P , B ∈ RP ×M , C ∈ RQ×P and D ∈ RQ×M . When the matrices (A, B, C, D) are independent of time t and k, the system is said to be time-invariant or stationary. The spectral properties of the state-transition matrix A allow us to study the stability of the system and determine the state–space representation associated with

Eigenvalues and Singular Values of a Tensor

223

the modal form of A, i.e. its diagonal or Jordan canonical form. It is a well-known fact that the bounded-input bounded-output (BIBO) stability of systems represented using state-space models [4.3] is guaranteed if and only if all eigenvalues of A have negative real parts in the continuous-time case or modulus less than 1 in the discrete-time case. The state-space models recalled above can be generalized to the case of tensor systems defined in terms of the Einstein product as follows: 

X˙ (t) = A P X (t) + B R U (t) Y(t) = C P X (t) + D R U (t)

 ,

X (k + 1) = A P X (k) + B R U(k) Y(k) = C P X (k) + D R U (k) [4.4]

where X ∈ RI P , U ∈ RM R and Y ∈ RQS are called the state, input and output tensors, respectively, and A ∈ RI P ×I P , B ∈ RI P ×M R , C ∈ RQS ×I P and D ∈ RQS ×M R represent the dynamics, control and output (or observation) tensors of the tensor state-space model. Analogously to standard state-space models [4.3], we can study the stability, controllability and observability properties of dynamic systems represented using tensor state-space models [4.4]. Thus, in the case of a discrete-time tensor system where A ∈ RI1 ×I1 ×···×IP ×IP is an even-order paired tensor, and with the Einstein product defined as in [3.158], asymptotic stability in the sense of X (k)F → 0 as k → ∞ is guaranteed if and only if every eigenvalue of the dynamics tensor A defined by the equation A P X = λX has modulus less than 1 (Chen et al. 2019). See equation [4.22] in definition 4.4. Note that computing matrix eigenvalues and singular values is the key to determining the eigenvalue decomposition (EVD) and the HOSVD for high-order tensors. Spectral theory has been developed in more detail for specific classes of structured tensors, such as symmetric tensors, non-negative tensors, Toeplitz tensors, Hankel tensors and Cauchy tensors. Two important applications of the spectral theory of non-negative tensors concern: – hypergraphs, to study the spectral properties of the connectivity hypermatrix (adjacency hypermatrix, Laplacian hypermatrix, etc.) of a hypergraph (Li et al. 2013; Hu and Qi 2014; Pearson and Zhang 2014; Xie and Qi 2016; Banerjee et al. 2017); – high-order Markov chains, to study the properties of the transition probability tensors (Chang and Zhang 2013b; Li and Ng 2014; Culp et al. 2017).

224

Matrix and Tensor Decompositions in Signal Processing

See the article by Chang et al. (2013) for a presentation of different applications of the spectral theory of non-negative tensors. Unlike the matrix case, there are several ways to define the eigenvalues of a tensor. Solving a polynomial optimization problem plays a key role in each of them. The objective of this chapter is to present the various definitions. The notions of eigenvalue and singular value of a tensor will be defined in sections 4.2 and 4.5, respectively. Positive/negative definite tensors and orthogonally/unitarily similar tensors will also be introduced. The link between eigenvalues and the best rank-one approximation of a tensor will be established in section 4.3, and the notion of orthogonal decomposition of a tensor will be introduced in section 4.4. 4.2. Eigenvalues of a tensor of order greater than two Following the publication of the articles by Qi (2005) and Lim (2005) on the eigenvalues and singular values of a tensor, there was extensive research concerning the spectral properties of tensors. There are various notions and problems underlying the concept of eigenvalue of a tensor, such as the notions of characteristic polynomial and hyperdeterminant, the Perron–Frobenius theorem for non-negative tensors, the computation of eigenvalues and therefore solving polynomial equations, among many other results that generalize known results from the matrix case. See for example the surveys on the spectral theory of tensors by Qi (2012) and Chang et al. (2013). The notion of eigenvalue for a tensor can be defined using either the mode-p product (Qi 2005) or the Einstein product (Cui et al. 2016; Liang et al. 2018). In the latter case, we speak of an eigentensor rather than an eigenvector. Five different definitions are introduced in the next section. 4.2.1. Different definitions of the eigenvalues of a tensor The work by Qi (2005) was originally motivated by research into positive definite homogeneous polynomials such as [3.269]. These polynomials play an important role in automatic control when studying the stability of nonlinear autonomous systems of . the form x(t) = f (x(t)) using Lyapunov’s direct method1. .

1 Consider the continuous-time autonomous nonlinear system x(t) = f (x(t)), with f (0) = 0. Lyapunov’s direct method aims to construct a positive definite polynomial function (called a Lyapunov function) V (x) > 0 in the neighborhood of the origin, i.e. on a ball Br of radius r around the origin, with V (0) = 0, such that: . ∂V T . ∂V T V (x) = ( ) x(t) = ( ) f (x) < 0 for ∀x ∈ Rn , x = 0, ∂x ∂x

Eigenvalues and Singular Values of a Tensor

225

They also allow us to determine whether a tensor is positive definite (see the definition in Table 4.2). [P ;I]

The notion of eigenvalue of a real symmetric2 hypercubic tensor A ∈ RS defined by Qi (2005) in terms of the multiple tensor–vector product [3.274]. [P ;I]

D EFINITION 4.1.– A pair (λ, x) is said to be an eigenpair of A ∈ RS the following system of homogeneous polynomial equations: AxP −1 = λx[P −1] ,

was

if it satisfies [4.5]

−1 T ] is the vector whose components are equal to the where x[P −1] = [x1P −1 , · · · , xP I components of x raised to the power P − 1, i.e. (x[P −1] )i = xiP −1 for i ∈ I. The system [4.5] is equivalent to the I following polynomial equations: I  i2 =1

···

I 

−1 ai,i2 ,··· ,iP xi2 · · · xiP = λxP , i ∈ I. i

[4.6]

iP =1

In the case where the pair (λ, x) ∈ R × {RI \{0}} is real, it is called an H-eigenpair. Equation [4.5] implies that λ ∈ C is an eigenvalue of A if and only if it is a non-zero root of the characteristic polynomial φ(λ) = det(A − λ IP,I ), where IP,I = [δi1 ,··· ,iP ], with ip ∈ I for p ∈ P , is the identity tensor of order P . The hyperdeterminant of the tensor A, denoted det(A), is defined as the resultant of the I homogeneous polynomials (AxP −1 )i , with i ∈ I, in the I components xi of x:   det(A) = Res (AxP −1 )1 , · · · , (AxP −1 )I , [4.7] where (AxP −1 )i is the ith component of the vector AxP −1 , i ∈ I. ∂V ∂V T where ∂V  [ ∂x , · · · , ∂x ] is the gradient of V (x). The origin is then an asymptotically ∂x n 1 stable equilibrium point with domain of attraction Br , i.e. lim x(t) = 0, ∀x(0) ∈ Br . .

t→+∞

In the case of an autonomous linear system x(t) = Ax(t), a Lyapunov function is given by the quadratic form V (x) = xT Px, where P ∈ Rn×n is a positive definite symmetric matrix. The Lyapunov condition is then written as: .   .T . V (x) = x Px + xT Px = xT AT P + PA x < 0, which leads to the Lyapunov equation AT P + PA = −Q, with Q > 0. The autonomous . linear system x(t) = Ax(t) is asymptotically stable at the origin if and only if, for a given positive definite matrix Q, the Lyapunov equation admits precisely one solution P > 0. [P ;I] 2 Recall that RS denotes the set of real symmetric tensors of order P and dimensions I.

226

Matrix and Tensor Decompositions in Signal Processing

R EMARK 4.1.– – Resultants play an important role in the study of polynomial equations. The notion of resultant, which generalizes that of determinant, is used to establish whether a system of N homogeneous polynomial equations in N variables admits a non-zero solution, or equivalently if the N polynomials defining the N equations have a shared non-zero root. Thus, the system of I polynomial equations AxP −1 = 0 admits a non-zero solution if and only if det(A) = 0, with det(A) as defined in [4.7]. The resultant is a homogeneous polynomial in the coefficients aiP of A of degree I(P − 1)I−1 . – In the case of a system of linear equations Ax = 0, with A = [aij ] ∈ RI×I , the resultant coincides exactly with the determinant of A, which is a homogeneous polynomial of degree I in the coefficients aij . – The theory of resultants is closely linked to the theories of discriminants and hyperdeterminants (Gelfand et al. 1992; Morozov and Shakirov 2010). The discriminant of a polynomial Q is used to determine the existence of multiple roots. Thus, since the existence of a double root implies the existence of a root shared by Q and its derivative Q , the discriminant of a unitary polynomial Q of degree N , denoted Dis(Q), is given by Dis(Q) = (−1)N (N −1)/2 Res(Q, Q ), where Res(Q, Q ) is the resultant of Q and Q . See Coste (2001) for further details. If the order P is even, the eigenvalues defined in [4.5] can be interpreted in terms of optimizing the following polynomial form (Lim 2005): f (x) = AxP subject to the constraint xP = 1,

[4.8]

 I  P 1/P where .P is the lP norm defined by xP = , and i=1 |xi |    I P P a x = a x , using the index AxP = i ,i ,··· ,i i i i P p p i1 ,i2 ,··· ,iP =1 1 2 p=1 p=1 P convention. P ROOF .– This optimization with  an equality  constraint can be solved using the Lagrangian L(x, λ) = AxP − λ xP − 1 , where λ is the Lagrange multiplier. For P −1 P even P , since the gradient of the lP norm is given by ∂ x

= x[P −1] /xP , the P ∂x Karush–Kuhn–Tucker (KKT) optimality conditions give: ∂L(x, λ) ∂L(x, λ) = P AxP −1 −λP x[P −1] = 0 , = 1−xP P = 0, [4.9] ∂x ∂λ from which we deduce AxP −1 = λx[P −1] , namely equation [4.5] defining the eigenvalues. The corresponding pair (λ, x) is called an lP -eigenpair by Lim (2005). We can therefore conclude that the H-eigenvalues are identical to the lP -eigenvalues (see Table 4.1).  R EMARK 4.2.– For P = 2 (matrix case), we recover the interpretation of the eigenvalues of a matrix A, defined as solutions of the equation Ax = λx, in terms of

Eigenvalues and Singular Values of a Tensor

optimizing the quadratic form Ax2 x22 = xT x = 1 (see Table 1.3).



227

xT Ax subject to the constraint

Another definition of an eigenpair (λ, x) was given by Qi (2005) as follows. D EFINITION 4.2.– A pair (λ, x) is an eigenpair if it is a solution of the following system of non-homogeneous polynomial equations, subject to the constraint that x has unit Euclidean norm: AxP −1 = λx subject to the constraint x22 = 1.

[4.10]

If λ and x are real, the pair (λ, x) is called a Z-eigenpair3. In the case of a complex pair (λ, x) ∈ C×{CI \{0}}, it is called an E-eigenpair by Qi (2005), where the letter E refers to the Euclidean norm in [4.10]. R EMARK 4.3.– We can make the following remarks: – In the matrix case, the definitions [4.5] and [4.10] are equivalent to Ax = λx. – Unlike the matrix case, where the eigenvalues and eigenvectors of a symmetric/Hermitian matrix are all real (see Table 1.4), the same is not true for a real symmetric tensor, which explains the importance of real H- and Z-eigenvalues when studying positive/negative definite tensors. – Two E-eigenpairs (λ, x) and (λ , x ) are said to be equivalent if there exists a non-zero complex number α ∈ C such that (Cartwright and Sturmfels 2013): λ = αP −2 λ and x = αx.

[4.11]

Indeed, we have A(x )P −1 = αP −1 AxP −1 and λ x = αP −1 λx. Hence, taking [4.10] into account, we deduce that A(x )P −1 = λ x . This means that if (λ, x) is an E-eigenvalue of A, then (αP −2 λ, αx) is also an E-eigenpair for any complex number α = 0. – Since the system of polynomial equations [4.5] is homogeneous of degree P −1, if x is an eigenvector associated with the eigenvalue λ, then αx is also an eigenvector associated with λ, for any complex number α = 0. As we saw in the previous remark, this is not the case for the system of polynomial equations [4.10] due to the presence of the constraint x2 = 1 instead of the constraint xP = 1. The normalization constraint on the eigenvectors based on the l2 norm in [4.10] implies that there exists an E-eigenpair equivalent to (λ, x) given by ((−1)P λ, −x), which means that for odd P the eigenvalues are determined up to their sign, whereas for even P the sign indeterminacy disappears. 3 The name of Z-eigenvalue was suggested by Zhou (2004), as cited in Qi (2005).

228

Matrix and Tensor Decompositions in Signal Processing

– From the definitions [4.5] and [4.10], we can conclude that computing the eigenvalues of a tensor of order greater than two is a nonlinear problem in the sense that it requires us to solve a system of polynomial equations in the components xi of the eigenvector x. This problem has been addressed by various articles, in particular to compute the largest and smallest eigenvalues of a symmetric tensor (Hu et al. 2013; Hao et al. 2015). Like for matrices, the spectral radius of a tensor A is the quantity defined as follows: ρ(A) = max|λi (A)|,

[4.12]

i

with λi (A) ∈ sp(A), where sp(A) is the spectrum, i.e. the set of eigenvalues of A. The geometric multiplicity mi of an eigenvalue λi is defined as the maximum number of linearly independent eigenvectors associated with λi . Unlike the matrix case, the spectral radius of a tensor depends on the definition of its eigenvalues. [3;2]

E XAMPLE 4.4.– For a third-order tensor A ∈ RS becomes: 2 

of dimensions two, equation [4.5]

aijk xj xk = λx2i with i ∈ {1, 2},

[4.13]

j,k=1

or in developed form: a111 x21 + (a112 + a121 )x1 x2 + a122 x22 = λx21 a211 x21 + (a212 + a221 )x1 x2 + a222 x22 = λx22 . For a positive tensor A such that a111 = a222 = 1, a122 = a211 = , with 0 < < 1, and aijk = 0 for the other triplets (i, j, k), the above system of equations simplifies as follows (Pearson 2010): x21 + x22 = λx21

x21 + x22 = λx22 . By adding and subtracting both sides of these equations, we obtain the eigenvalues λ1 = 1 + and λ2 = 1 − , with eigenvectors satisfying the equations x21 = x22 and x21 = −x22 , respectively. For λ1 , we have the real eigenvectors x = (1, 1) and x = (1, −1), whereas for λ2 we have the complex eigenvectors x = (1, i) and x = (1, −i), where i2 = −1. We can therefore conclude that the real geometric multiplicity of λ1 and the complex geometric multiplicity of λ2 are both equal to 2. [P ;I]

Qi (2005) proved the following properties for a real symmetric tensor A ∈ RS

:

Eigenvalues and Singular Values of a Tensor

229

– The number of eigenvalues is equal to I(P − 1)I−1 , which is the degree of the resultant defined in [4.7] (see remark 4.1). Their product is equal to det(A), the resultant of the I homogeneous polynomials (AxP −1 )i , with i ∈ I. The sum of all eigenvalues is equal to (P − 1)I−1 tr(A). – If P is even, then A is positive (semi-)definite if and only if all its H- and Z-eigenvalues are positive (non-negative); furthermore, a necessary condition for A to be positive semi-definite is det(A) ≥ 0. – The Z-eigenpair corresponding to the Z-eigenvalue with the largest absolute value gives the best rank-one approximation of A (Qi 2005; Qi et al. 2009) (see proposition 4.12). We therefore have the following result (Qi 2005): P ROPOSITION 4.5.– Any real symmetric even-order tensor A is positive (semi-)definite (see the definition in Table 4.2) if and only if its smallest H-eigenvalue is positive (non-negative). P ROOF .– Let (λopt , xopt ) be an H-eigenpair that minimizes the criterion [4.8] and is [P −1] −1 therefore a solution of the equation AxP = λopt xopt . Taking the inner product of opt the two sides of this equation with the vector xopt , we have: [P −1]

−1 P AxP opt , xopt  = Axopt and xopt

I

, xopt  =

I 

xP opt,i ,

i=1

= λopt Taking into account the hypothesis from which we deduce that P is even, which implies that xP ≥ 0 for i ∈ I, as well as the fact that A opt,i is positive definite, in the sense that AxP > 0, ∀x ∈ RI , we deduce from the above equality that λopt > 0. Hence, the smallest eigenvalue λopt is positive (or non-negative for a positive semi-definite tensor A).  AxP opt

P i=1 xopt,i .

To illustrate the H- and Z-eigenvalues, let us consider the case of a symmetric [3;I] third-order tensor A ∈ RS of dimensions I. In the case of an H-eigenpair, the eigenvalues are obtained by solving the system of homogeneous polynomial equations deduced from [4.6], which are of degree two in the components of x ∈ RI , x = 0: I 

aijk xj xk = λx2i , i ∈ I,

[4.14]

j,k=1

whereas in the case of a Z-eigenpair the system of non-homogeneous polynomial equations that must be solved is as follows: I  j,k=1

aijk xj xk = λxi , i ∈ I.

[4.15]

230

Matrix and Tensor Decompositions in Signal Processing

Since equations [4.15] are not homogeneous, the eigenvectors are not invariant up to a scaling factor. This is due to the constraint x22 = 1. To satisfy this invariance property,  we need to consider the l3 norm for the constraint I x3 = ( i=1 |xi |3 )1/3 = 1, which then gives us equations [4.14]. The notion of generalized eigenpair introduced by Chang et al. (2009) is defined as follows. For more details, see the articles by Kolda and Mayo (2014) and Ding and Wei (2015). [P ;I]

D EFINITION 4.3.– A pair (λ, x) is said to be a B-eigenpair of A ∈ RS solution of the following system of homogeneous polynomial equations: AxP −1 = λBxP −1 .

if it is a [4.16]

It is also called a B-eigenvalue and a B-eigenvector. This generalized eigenpair can be interpreted in terms of a constrained optimization (Kolda and Mayo 2014): min/max f (x) =

AxP P xP P subject to the constraint xP = 1, BxP

[4.17]

[P ;I]

where A, B ∈ RS are symmetric tensors of order P and dimensions I, the tensor A is positive semi-definite, the tensor B is positive definite (so that BxP > 0 for ∀x = 0) and λ is the Lagrange multiplier defined as: λ=

AxP . BxP

[4.18]

  P ROOF .– After defining the Lagrangian as L(x, λ) = f (x) − λ xP P − 1 , the KKT optimality conditions [4.9] become:  ∂L(x, λ) AxP P  P P −1 P −1 (Ax − P λx = 0 )x + Ax − ( )Bx = ∂x BxP BxP ∂L(x, λ) [4.19] = 1 − xP P = 0, ∂λ from which we deduce equations [4.16] and [4.18]. For more details about the proof, see Kolda and Mayo (2014), who also suggest a method for computing generalized eigenpairs.  R EMARK 4.6.– The definition [4.5] can be deduced as a special case of [4.16] by choosing B = IP , the identity tensor of order P . Indeed, we then have, for x ∈ RI : BxP = IP , x◦P  =

I 

xpi = xP P

[4.20]

i=1

BxP −1 = IP xP −1 = x[P −1] .

[4.21]

Eigenvalues and Singular Values of a Tensor

231

Equation [4.16] with B = IP then becomes AxP −1 = λx[P −1] , i.e. equation [4.5]. As we mentioned earlier, the notion of eigenvalue can also be defined using the Einstein product (Liang et al. 2018). D EFINITION 4.4.– Given a complex square tensor A ∈ CI P ×I P of order 2P , a pair (λ, X ) ∈ C × CI P ×I P , with X = O, is said to be an eigenvalue–eigentensor pair if it satisfies the following equation: A P X = λX ,

[4.22]

which amounts to solving the following system of tensor equations: (λ J2P − A) P X = O.

[4.23]

Taking the property [3.24] into account, equation [4.23] can also be written as: (λ II − AI×I )XI×I = 0 , I =

P 

Ip .

[4.24]

p=1

The eigenvalues are then obtained by solving the characteristic equation of the square tensor A expressed in terms of its matrix unfolding AI×I : det(λ II − AI×I ) = 0,

[4.25]

and the eigentensor associated with the eigenvalue λ is determined by solving the tensor equation [4.23] or the matrix equation [4.24]. D EFINITION 4.5.– A variant of the eigenvalue–eigentensor problem was proposed by Cui et al. (2015) in the form of equation [4.22], with a tensor X ∈ CI P . Equation [4.24] then becomes: (λ II − AI×I )xI = 0,

[4.26]

where xI ∈ CI is a vectorized form of the tensor X obtained using a mode combination as defined in [3.16]. The eigenvalue–eigentensor problem is then reduced to a standard eigenvalue–eigenvector problem for the matrix AI×I , and the eigentensor is computed in the vectorized form xI by solving equation [4.26]. The eigenvalue problem for a Toeplitz tensor was also solved by Cui et al. (2015) and applied to image restoration. The various definitions of tensor eigenpairs introduced above are summarized in Table 4.1.

232

Matrix and Tensor Decompositions in Signal Processing

Systems of polynomial equations [P ;I]

(λ, x) ∈C×

A ∈ RS

{CI \{0}}

AxP −1

=

Eigenpairs eigenvalue-eigenvector

λx[P −1]

eigenpair

∈ C × {CI \{0}}

AxP −1 = λx , xT x = 1

E-eigenpair

∈ R × {RI \{0}}

AxP −1 = λx[P −1]

H-eigenpair

∈R×

{RI \{0}}

(λ, x)

AxP −1

= λx ,

xT x

=1

Systems of polynomial equations [P ;I]

A, B ∈ RS

AxP −1 = λBxP −1

(λ, X ), X = O

A ∈ CI P ×I P

{CI P

A P X = λX

\{O}}

Generalized eigenpair

with A ≥ 0, B > 0

∈ C × {CI \{0}}

∈C×

Z-eigenpair

generalized eigenpair

eigenvalue-eigentensor

Table 4.1. Eigenpairs and generalized eigenpairs for tensors

4.2.2. Positive/negative (semi-)definite tensors The multiple tensor-vector product [3.274] and the notion of eigenvalue of a tensor have been used to define various classes of tensors, such as positive definite tensors, M -tensors and P-tensors (Ding et al. 2013; Zhang et al. 2014; Song and Qi 2015). In the same way that a positive (respectively, negative) definite matrix A ∈ RI×I S satisfies the property that the quadratic form xT Ax = A ×1 xT ×2 xT , denoted Ax2 , is positive (respectively, negative), a real symmetric hypercubic tensor A ∈ [P ;I] RS is positive definite (respectively, negative) if the P -linear form AxP is itself positive (respectively, negative) definite, as summarized in Table 4.2. These definitions generalize those of Table 1.5 to tensors of order greater than two (see Chen and Qi 2015). Properties A positive semi-definite (A ≥ 0)

Conditions AxP

≥ 0, ∀x ∈ RI

A positive definite (A > 0)

AxP > 0, ∀x ∈ RI , x = 0

A negative semi-definite (A ≤ 0)

AxP ≤ 0, ∀x ∈ RI

A negative definite (A < 0)

AxP < 0, ∀x ∈ RI , x = 0 [P ;I]

Table 4.2. Properties of hypercubic symmetric tensors A ∈ RS

Eigenvalues and Singular Values of a Tensor

233

4.2.3. Orthogonally/unitarily similar tensors Analogously to the definition [1.21] of orthogonally similar matrices, we say that the tensors A and B ∈ R[P ;I] are orthogonally similar if there exists an orthogonal matrix Q ∈ RI×I such that (Qi 2012): P

B = A ×1 Q ×2 Q · · · ×P Q  A × Q,

[4.27]

p=1

or alternatively, in scalar form and using the index convention: I 

bi1 i2 ···iP =

qi1 k1 qi2 k2 · · · qiP kP ak1 k2 ···kP

[4.28]

k1 ,k2 ,··· ,kP =1

biP = akP

P 

q ip k p .

[4.29]

p=1

Similarly, in the complex case, analogously to the definition [1.22] in the matrix case, we say that the tensors A and B ∈ C[2P ;I] are unitarily similar if there exists a unitary matrix Q ∈ CI×I such that: P

2P

p=1

p=P +1

B = A ×1 Q ×2 · · · ×P Q ×P +1 Q∗ ×P +2 · · · ×2P Q∗  A × Q

×

Q∗ .

[4.30] The relations [4.28] and [4.29] then become: biP , j = P

I 

qi1 k1 · · · qiP kP qj∗1 n1 · · · qj∗P nP ak1 ···kP n1 ···nP

k1 ,··· ,kP ,n1 ,··· ,nP =1

[4.31] = a kP , nP

P 

qip kp qj∗p np .

[4.32]

p=1

R EMARK 4.7.– We have the following results: – If A is symmetric, then B defined in [4.27] is also symmetric. – Using the orthogonality property of Q, we deduce from [4.27] that: P

P

p=1

p=1

A = B × Q−1 = B × QT .

[4.33]

Similarly, from [4.30], with a unitary matrix Q, we deduce: P

2P

p=1

p=P +1

A = B × QH

×

QT .

[4.34]

234

Matrix and Tensor Decompositions in Signal Processing

P ROPOSITION 4.8.– Any two orthogonally similar symmetric tensors have the same eigenvalues in the sense of the definition [4.10]. Furthermore4, if (λ, x) is an eigenpair of A, then (λ, Qx) is an eigenpair for B. This property generalizes the corresponding property of orthogonally similar matrices to the case of tensors of order greater than two (see Table 1.6). P ROOF .– Let A and B be symmetric and orthogonally similar tensors, according to the relation [4.27], and let (λ, x) be an eigenpair of A in the sense of the definition AxP −1 = λx. Defining the vector y = Qx, or equivalently xT = yT Q, we have: P

P

P

P

p=2

p=2

p=2

p=2

AxP −1 = A × xT = A × yT Q = (A × Q) × yT

[4.35]

λx = λQ−1 Qx = λQ−1 y.

[4.36]

From these two equations, we deduce: P

P

p=2

p=2

(A × Q) × yT = λQ−1 y.

[4.37]

By pre-multiplying the two sides of this equation by Q, we obtain: P

P

P

p=1

p=2

p=2

(A × Q) × yT = B × yT = λy,

[4.38]

which proves that (λ, Qx) is an eigenpair of B.

 [P ;I]

R EMARK 4.9.– In the case of a symmetric tensor A ∈ RS decomposition [5.15] can be written as: P

A = G × U, p=1

, the HOSVD

[4.39]

where G is symmetric and U is orthogonal. The tensors A and G are therefore orthogonally similar, and, by the previous proposition, we conclude that the eigenpair (λ, x) of G corresponds to the eigenpair (λ, Ux) of A. Equivalently, each eigenpair (λ, y) of A is associated with the eigenpair (λ, UT y) of G. Note that, in the general (non-symmetric) case, the HOSVD of the tensor A is not determined by computing its eigenvalues but by computing the singular values of its P modal unfoldings. In the symmetric case, U is composed of the left singular vectors of one of the P modal unfoldings, which are all identical due to the symmetry of A. 4 Our proof of this result is different to the one given in Qi (2005).

Eigenvalues and Singular Values of a Tensor

235

4.3. Best rank-one approximation The best rank-one approximation of a tensor plays a very important role in applications involving the simplification of the representation of a data tensor (de Lathauwer et al. 2000b; Zhang and Golub 2001; Kofidis and Regalia 2002). Recall ˆ in the sense of that, in the matrix case, the best rank-one approximation, denoted A, I×J the Frobenius norm, of a matrixA ∈ K of rank R, is directly obtained from the R H H ˆ compact SVD, namely A = r=1 σr ur vr , as A = σ1 u1 v1 , where σ1 is the largest singular value of A, and (u1 , v1 ) are the left and right singular vectors ˆ 2 = R σ 2 , associated with σ1 . The approximation error is given by A − A F r=2 r with σ1 ≥ σ2 ≥ · · · ≥ σR > 0 (see section 1.5.7). In this section, we present the link between the best rank-one approximation of a given tensor and the maximization of a homogeneous polynomial. P ROPOSITION 4.10.– Let A ∈ RI P be a tensor of order P and size I P . The best rank-one approximation of A, in the sense of the Frobenius norm, is obtained by minimizing the following criterion with respect to the scalar λ and the vectors x(p) ∈ RIp : P

f (λ, x(1) · · · , x(P ) ) = A − λ ◦ x(p) 2F

[4.40]

p=1

=

I1  i1 =1

···

IP   iP =1

aiP − λ

P 

(p) 2

xi p

[4.41]

p=1

subject to the constraint x(p) 2 = 1 for every p ∈ P . A sub-optimal solution is obtained by minimizing [4.40] in two steps. First, the criterion is minimized with respect to λ while assuming that the vectors x(p) are fixed, then the criterion is minimized with respect to x(p) using the optimized value of λ found during the first step. This second step is the object of proposition 4.13. The first minimization leads us to maximize the following criterion with respect to the vectors x(p) : I1 IP P    P  T 2 P (p) 2 A, ◦ x(p) 2 = A × x(p) = ··· aiP x ip p=1

p=1

i1 =1

iP =1

P   (p) 2 = aiP xi p , p=1

where the last equality follows from the index convention.

[4.42]

p=1

[4.43]

236

Matrix and Tensor Decompositions in Signal Processing

P ROOF .– The criterion [4.40] can be developed as follows: P

P

P

p=1

p=1

A − λ ◦ x(p) 2F = A − λ ◦ x(p) , A − λ ◦ x(p)  p=1

P

P

p=1

p=1

= A2F − 2λA, ◦ x(p)  + λ2  ◦ x(p) 2F .

[4.44]

Since the vectors x(p) are assumed to be fixed, the minimization of [4.44] with respect to λ is obtained by setting the gradient with respect to λ equal to zero, which gives the following solution: P

λopt (x(p) ) =

A, ◦ x(p)  p=1

P

 ◦ x(p) 2F

.

[4.45]

p=1

This expression can be interpreted as a high-order generalized Rayleigh quotient. Taking into account the constraint x(p) 2 = 1 for every p ∈ P , as well as the expression [3.246] for the Frobenius norm of a rank-one tensor of order P , we have: P

 ◦ x(p) 2F = p=1

P 

x(p) 22 = 1.

[4.46]

p=1

By proposition 3.81, the solution [4.45] can also be written as: P

P

p=1

p=1

T

λopt (x(p) ) = A , ◦ x(p)  = A × x(p) = aiP

P 

(p)

xi p .

[4.47]

p=1

After replacing λ with this value λopt in [4.44] and taking the relation [4.46] into account, the value of the minimized criterion is given by: P

P

p=1

p=1

A − λopt ◦ x(p) 2F = A2F − 2λopt A, ◦ x(p)  + λ2opt = A2F − λ2opt =

A2F



− aiP

P 

(p) 2

xi p

.

[4.48]

p=1

Since the Frobenius norm of A is constant, minimizing the above criterion is equivalent to maximizing λ2opt and hence to maximizing the criterion [4.43].  [P,I]

of order P and R EMARK 4.11.– In the case of a symmetric tensor A ∈ RS dimensions I, the criterion [4.40] is written as f (λ, x) = A − λIP xP 2F , where IP is the identity tensor of order P . By the above proposition, minimizing this criterion

Eigenvalues and Singular Values of a Tensor

237

P

is equivalent to maximizing the criterion A × xT = AxP subject to the constraint p=1

of unit Euclidean norm. Hence, we can conclude that maximizing the criterion [4.43] gives the largest eigenvalue of A, in the sense of the definition [4.10]. This result is summarized in the next proposition. P ROPOSITION 4.12.– The best rank-one approximation of a symmetric tensor A ∈ [P,I] RS is given by the eigenvalue according to the definition [4.10], with the largest absolute value, which can be obtained by solving the system of polynomial equations AxP −1 = λx. In the general case of a tensor A ∈ RI P , maximizing the criterion [4.43] with respect to the vectors x(p) subject to the constraint x(p) 2 = 1 for every p ∈ P  constitutes the second step of proposition 4.10. This maximization leads to the solution described in the next proposition. P ROPOSITION 4.13.– The best rank-one approximation of a tensor A ∈ RI P is given P

by λ ◦ x(p) , where the unit-norm vectors x(p) and the scalar λ are solutions of the p=1

following polynomial equations: P

T

A × x(q) = λx(p) q=1 q=p P

T

A × x(p) = λ. p=1

[4.49]

[4.50]

P ROOF .– The criterion [4.43] can be maximized subject to the constraint x(p) 2 = 1 for every p ∈ P  using the Lagrangian: P    P T λp  (p) 2  L λ1 , · · · , λP , x(1) , · · · , x(P ) = A × x(p) − x 2 −1 . [4.51] 2 p=1 p=1

By setting the partial derivatives of the Lagrangian [4.51] with respect to the optimization variables to zero, we obtain the following equations, for every p ∈ P , as the KKT optimality conditions: 1 ∂L = (1 − x(p) 22 ) = 0 ∂λp 2 P T ∂L = A × x(q) − λp x(p) = 0. (p) q=1 ∂x q=p

[4.52] [4.53]

238

Matrix and Tensor Decompositions in Signal Processing

By considering the inner product of the vector equation [4.52] into account, we obtain: 

∂L ∂x(p)

with the vector x(p) and taking

P T ∂L , x(p)  = A × x(p) − λp x(p) 22 p=1 ∂x(p) P

[4.54]

T

= A × x(p) − λp = 0,

[4.55]

p=1

P

T

from which we deduce that the λp are all identical and equal to λ = A × x(p) . The p=1

optimality equations [4.53] therefore give the best rank-one approximation, as defined in [4.49], where λ is given by the expression [4.50].  E XAMPLE 4.14.– To illustrate the above results, consider the case of a third-order tensor A ∈ RI×J×K (Zhang and Golub 2001). Writing the vectors x(p) , for p ∈ {1, 2, 3} as x ∈ RI , y ∈ RJ , z ∈ RK , the optimality equations [4.49]–[4.50] become:  A ×2 y ×3 z = λx ⇔ aijk yj zk = λxi , i ∈ I [4.56] j,k

A ×1 x ×3 z = λy ⇔



aijk xi zk = λyj , j ∈ J

[4.57]

aijk xi yj = λzk , k ∈ K

[4.58]

aijk xi yj zk = λ.

[4.59]

i,k

A ×1 x ×2 y = λz ⇔

 i,j

A × 1 x ×2 y ×3 z =

 i,j,k

In Zhang and Golub (2001), the Newton and alternating least squares (ALS) algorithms are suggested to solve these polynomial equations. See section 5.2.5.8 for the ALS algorithm. R EMARK 4.15.– Unlike the matrix case, the low-rank approximation problem is illposed for tensors. De Silva and Lim (2008) showed that a tensor A ∈ RI P of order P does not admit a best approximation of rank R with R ≤ min(I1 , · · · , IP ) in general. 4.4. Orthogonal decompositions In Chapter 1, we saw that any real symmetric matrix A ∈ RI×I of rank R always S admits an orthogonal decomposition, namely the eigendecomposition, which can be written as: A = UΛUT =

R  r=1

λr ur uTr =

R  r=1

λr u r ◦ u r ,

[4.60]

Eigenvalues and Singular Values of a Tensor

239

  where Λ = diag λ1 , · · · , λR ∈ RR×R is a diagonal matrix, and U = [u1 · · · uR ] ∈ RI×R is a column orthonormal matrix (uTi uj = δij ). The eigenvalues are determined by solving the equation Aui = λi ui , which is equivalent to determining the extrema T Au of the Rayleigh quotient u u

2 . I

In the case of a real symmetric tensor A ∈ RSP of order P , an orthogonal decomposition is defined as a decomposition of the form: A=

R  r=1

λr (ur ◦ · · · ◦ ur )  ( )* + P terms

R 

λr u◦P r ,

[4.61]

r=1

where ur is the eigenvector of unit Euclidean norm associated with the eigenvalue λr , as defined in [4.10]. It can be obtained by solving the system of polynomial equations [4.10], written in scalar form as: I 

···

i2 =1

I 

ai,i2 ,··· ,iP ui2 · · · uiP = λui , i ∈ I,

[4.62]

iP =1

I with i=1 u2i = 1. It can also be obtained using a variational approach that maximizes the generalized Rayleigh quotient (Lim 2005). Note that, unlike the matrix case, not every real symmetric tensor admits such an orthogonal decomposition. 4.5. Singular values of a tensor We define the singular values and singular vectors u ∈ RI , v ∈ RJ , and w ∈ RK of a third-order tensor A ∈ RI×J×K as the values and critical points of the Rayleigh quotient: A ×1 u ×2 v ×3 w . u3 v3 w3

[4.63]

solution is obtained from the Lagrangian L(u, v, w, λ) = a u v w − σ(u v w − 1). Thus, we obtain the following 3 3 3 i,j,k=1 ijk i j k system of homogeneous polynomial equations of degree two (Lim 2005):

The I,J,K

J,K 

aijk vj wk = σu2i , i ∈ I

[4.64a]

aijk ui wk = σvj2 , j ∈ J

[4.64b]

j,k=1 I,K  i,k=1 I,J  i,j=1

aijk ui vj = σwk2 , k ∈ K,

[4.64c]

240

Matrix and Tensor Decompositions in Signal Processing

where σ is a singular value of A, and u, v and w are mode-1, -2 and -3 singular vectors, respectively. These equations generalize the equations in Table 1.9, which allow us to compute singular values and singular vectors of a matrix by solving a system of linear equations. This definition can be generalized to a tensor A ∈ RI1 ×···×IN of order N . The mode-n singular vectors, denoted u(n) , with n ∈ N , associated with the singular value σ, are then solutions of the following homogeneous polynomial equations: 

I1 ,··· ,In−1 ,In+1 ,··· ,IN i1 ,··· ,in−1 ,in+1 ··· ,iN =1

(1)

(n−1) (n+1)

(N )

(n)

ai1 ,··· ,in ,··· ,iN ui1 · · · uin−1 uin+1 · · · uiN = σ(uin )N −1 , (n)

where each equation is defined by the factor uin , with n ∈ N  and in ∈ In . For N = 3, we recover equations [4.64a]–[4.64c], (u(1) , u(2) , u(3) ) = (u, v, w) and (I1 , I2 , I3 ) = (I, J, K).

with

5 Tensor Decompositions

5.1. Introduction As we saw in the introductory chapter, tensor-based approaches have several advantages over traditional matrix-based ones, offering in particular the possibility to process multimodal and incomplete large-dimensional data efficiently. The exponential increase of the number of elements in a data tensor with the number of modes, i.e. the tensor order, is called the curse of dimensionality (Oseledets and Tyrtyshnikov 2009). Tensor decompositions, also called tensor models, allow us to represent data tensors by means of matrix factors and lower order tensors, called core tensors. Such representations can be exploited to reduce the storage memory while facilitating the data processing and leading to new solutions for various applications such as source separation, array processing, wireless communications, biomedical signal processing, data analysis and fusion, recommender systems, and estimation of missing data (tensor completion problem), to mention only a few. Many different tensor models exist. Some of them have been developed within the framework of certain applications, as will be described in the next two volumes through the design of new wireless communication systems. Other models like the hierarchical Tucker (HT) and tensor train (TT) decompositions (Oseledets and Tyrtyshnikov 2010; Grasedyck and Hakbush 2011; Oseledets 2011; Ballani et al. 2013), which can be viewed as special cases of tensor networks (TNs) (Cichocki 2014), will also be presented in the next volume. The aim of this chapter is to present the basic Tucker and PARAFAC decompositions, as well as some of their variants such as the Tucker-(N1 , N )decomposition, block Tucker and PARAFAC decompositions, constrained decompositions (PARALIND and CONFAC) and block term decomposition (BTD). The third-order Tucker and PARAFAC models will be described in further detail. Matrix and Tensor Decompositions in Signal Processing, First Edition. Gérard Favier. © ISTE Ltd 2021. Published by ISTE Ltd and John Wiley & Sons, Inc.

242

Matrix and Tensor Decompositions in Signal Processing

For the tensor models considered in this chapter, we will give different forms of representation (scalar forms, with mode-n products, with outer products, vectorized forms and matrix unfoldings). The uniqueness properties of the Tucker and PARAFAC models will be detailed. We will describe the HOSVD algorithm, which is based on computing the SVD of the modal matrix unfoldings, leading to a Tucker model with orthogonal matrix factors in the real case and unitary matrix factors in the complex case. For the parameter estimation of the PARAFAC models, we will present the alternating least squares (ALS) algorithm, which is the most widely used algorithm for estimating the parameters of tensor models. In the last section, four examples of tensor models will be presented to illustrate the use of tensor decompositions: – multidimensional harmonic model, based on the observation of a sum of complex exponentials; – instantaneous linear mixture using the received signals, then using the fourthorder cumulants of the received signals, in the context of source separation; – representation of a linear finite impulse response (FIR) system using the fourthorder cumulants of the output, with the perspective of system identification. 5.2. Tensor models The Tucker model will first be presented in section 5.2.1, with the HOSVD algorithm, whereas the Tucker-(N1 , N ) model will be defined in section 5.2.2. The PARAFAC decomposition will then be considered in section 5.2.5, with the presentation of the ALS algorithm. Some generalizations, including block tensor models (section 5.2.6) and constrained tensor models (section 5.2.7), will also be described. 5.2.1. Tucker model 5.2.1.1. Definition in scalar form For a tensor X ∈ KI N of order N , a Tucker model can be written in scalar form as follows (Tucker 1966): xi1 ,··· ,iN =

R1  r1 =1

···

RN  rN =1

gr1 ,··· ,rN

N 

(n)

ain ,rn ,

[5.1]

n=1

with in ∈ In  for n ∈ N , where gr1 ,··· ,rN is an element of the core tensor G ∈ (n) KRN , and ain ,rn is an element of the factor matrix A(n) ∈ KIn ×Rn , n ∈ N .

Tensor Decompositions

243

Using the index convention, the Tucker model can be rewritten concisely as follows: xi1 ,··· ,iN = gr1 ,··· ,rN

N 

(n)

ain ,rn .

[5.2]

n=1

5.2.1.2. Expression in terms of mode-n products R n (n) Noting that the sum gr1 ,··· ,rN ain ,rn in [5.1] represents the mode-n product of rn =1

the core tensor with the factor matrix A(n) , i.e. G×n A(n) , the Tucker model can also be written in terms of the mode-n products, n ∈ N , as follows: X = G×1 A(1) ×2 A(2) ×3 · · · ×N A(N ) N

 G × A(n) .

[5.3]

n=1

This decomposition, abbreviated as G; A(1) , · · · , A(N) , can be interpreted as N linear transformations of the core tensor G, each associated with a matrix A(n) ∈ KIn ×Rn applied to the mode-n vector space of G ∈ KRN . Indeed, for X = G×n A(n) , we have Xn = A(n) Gn , where Xn and Gn are the mode-n flat unfoldings of X and G as defined in [3.41]. 5.2.1.3. Expression in terms of outer products N  (n) Noting that the factor ain ,rn in [5.1] can be interpreted as the (i1 , · · · , iN )th n=1

(n)

element of the outer product of the column vectors A.rn of the matrix factors A(n) , the Tucker model can also be written as a linear combination of these outer products: X =

R1  r1 =1

···

RN  rN =1

N

(n) gr1 ,··· ,rN ◦ A.r . n

[5.4]

n=1

This expression gives X as a weighted sum of

N 

Rn rank-one tensors, where the

n=1

elements gr1 ,··· ,rN of the core tensor are the weights of the linear combination. Each element gr1 ,··· ,rN can also be interpreted as a weight of the interactions between the (n) columns A.rn , n ∈ N , of the factor matrices. 5.2.1.4. Matricization The matrix unfolding [3.38] of the Tucker model along the mode subsets S1 and S2 is given by the following equation: ! !T (n) (n) GS1 ;S2 ⊗ A ⊗ A ∈ KM1 ×M2 , [5.5] XS1 ;S2 = n∈S1

n∈S2

244

Matrix and Tensor Decompositions in Signal Processing

with GS1 ;S2 ∈ KK1 ×K2 , Ki =



Rn , and Mi =



n∈Si

In , n∈Si

for i = 1 and 2.

P ROOF .– Let us define (I1 , I2 ) and (R1 , R2 ) as the sets of indices in and rn associated with the mode subsets (S1 , S2 ) combined to form the rows and columns of XS1 ;S2 , respectively. Applying the formula [3.40] to the core tensor allows us to write the element gr1 ,··· ,rN as: gr1 ,··· ,rN = eR1 GS1 ;S2 eR2 .

[5.6]

Using the expressions [5.2] and [5.6] of xi1 ,··· ,iN and gr1 ,··· ,rN , we obtain: XS1 ;S2 = xi1 ,··· ,iN eII21 = eI1 xi1 ,··· ,iN eI2 = eI1 gr1 ,··· ,rN

N 

(n)  ain ,rn eI2

n=1

=



N  n=1

=



(n)  ain ,rn eI1 eR1 GS1 ;S2 eR2 eI2

  (n)  I2 (n)  1 ain ,rn eR ain ,rn eR2 . I1 GS1 ;S2

n∈S1

n∈S2

Taking into account the expression [2.102] of the multiple Kronecker product, XS1 ;S2 can now be written as: XS1 ;S2 =

! ⊗ A(n) GS1 ;S2

n∈S1

⊗ A(n)

n∈S2

!T , 

which completes the proof of [5.5]. 5.2.1.5. Vectorization

Applying the vectorization formula [2.113] to equation [5.5] gives the following expression for the vectorized form of the matrix unfolding XS1 ;S2 :     ⊗ A(n) vec(GS1 ;S2 ). ⊗ A(n) ⊗ [5.7] vec(XS1 ;S2 ) = n∈S2

n∈S1

In the case where S1 is the singleton {n} and S2 = {n + 1, · · · , N, 1, · · · , n − 1}, equation [5.5] becomes the mode-n flat matrix unfolding defined in Table 3.5: Xn = A(n) Gn (A(n+1) ⊗ · · · ⊗ A(N ) ⊗ A(1) ⊗ · · · ⊗ A(n−1) )T ∈ KIn ×In+1 ···IN I1 ···In−1 .

[5.8]

Tensor Decompositions

245

The corresponding vectorized form is obtained by applying the identity [2.113] to [5.8], which gives: vec(Xn ) = (A(n+1) ⊗ · · · ⊗ A(N ) ⊗ A(1) ⊗ · · · ⊗ A(n) )vec(Gn ).

[5.9]

This expression can also be deduced from the general vectorized form [5.7] by choosing S1 = {n} and S2 = {n + 1, · · · , N, 1, · · · , n − 1}. Similarly, when S2 is the singleton {n} and S1 = {n + 1, · · · , N, 1, · · · , n − 1}, we obtain the mode-n tall matrix unfolding equal to the transpose of Xn , defined in [5.8], of size In+1 · · · IN I1 · · · In−1 × In . 5.2.1.6. Case of a third-order tensor X ∈ KI×J×K Setting A(1) = A, A(2) = B, A(3) = C, and (R1 , R2 , R3 ) = (P, Q, S) and using the index convention, equations [5.1], [5.3] and [5.4] then become: xijk =

Q  S P  

gpqs aip bjq cks = gpqs aip bjq cks

[5.10]

p=1 q=1 s=1

X = G×1 A×2 B×3 C =

Q  P  S 

[5.11]

gpqs A.p ◦ B.q ◦ C.r = gpqs A.p ◦ B.q ◦ C.s ,

[5.12]

p=1 q=1 s=1

with G ∈ KP ×Q×S , and A ∈ KI×P , B ∈ KJ×Q , C ∈ KK×S . Equation [5.12] translates the fact that the third-order Tucker model, denoted Tucker3, expresses the tensor X as the weighted sum of P QS first-order tensors, where the weights of these first-order tensors are given by the coefficients gpqs of the core tensor G. This Tucker model is denoted G; A, B, C. See Figure I.2 for a graphical representation of this Tucker model. The various representations of a third-order Tucker model are summarized in Table 5.1, where the flat matrix unfoldings and the vectorized form were deduced from the general formulae [5.8] and [5.9], with X1 = XI×JK , X2 = XJ×KI , X3 = XK×IJ , and G1 = GP ×QS , G2 = GQ×SP , G3 = GS×P Q . 5.2.1.7. Uniqueness The Tucker model is not unique in general, since the factors A(n) are invariant up to non-singular matrices Λ(n) whose effect is compensated by the core tensor, i.e. X = G; A(1) , · · · , A(N)  is left unchanged if we replace the factors A(n) with −1 −1 A(n) Λ(n) and the core tensor with C = G; (Λ(1) ) , · · · , (Λ(N ) ) . Indeed, using the property [3.125] of the mode-n product, we have: C; A(1) Λ(1) , · · · , A(N) Λ(N )  = G; A(1) Λ(1) (Λ(1) )

−1

, · · · , A(N) Λ(N ) (Λ(N ) )

= G; A(1) , · · · , A(N)  = X .

−1



246

Matrix and Tensor Decompositions in Signal Processing

This means that the Tucker model is not changed if we apply a linear transformation with matrix Λ(n) to the column space of each factor A(n) and the −1 inverse transformation with matrix (Λ(n) ) to the mode-n subspace of the tensor G, −1 with n ∈ N , which is equivalent to replacing Gn with (Λ(n) ) Gn . Indeed, under this double mode-n transformation, the matrix unfolding [5.8] becomes: Xn = A(n) Λ(n) (Λ(n) )

−1

Gn (A(n+1) ⊗ · · · ⊗ A(N ) ⊗ A(1) ⊗ · · · ⊗ A(n−1) )T

= A(n) Gn (A(n+1) ⊗ · · · ⊗ A(N ) ⊗ A(1) ⊗ · · · ⊗ A(n−1) )T .

G∈

KP ×Q×S ,

xijk

X ∈ KI×J×K A ∈ KI×P , B ∈ KJ×Q , C ∈ KK×S

Scalar expression Q  S P   = gpqs aip bjq cks p=1 q=1 s=1

Expression with mode-n products X = G×1 A×2 B×3 C Expression with outer products Q  S P   gpqs A.p ◦ B.q ◦ C.s X = p=1 q=1 s=1

Matrix unfoldings XI×JK = AGP ×QS (B ⊗ C)T XJ×KI = BGQ×SP (C ⊗ A)T XK×IJ = CGS×P Q (A ⊗ B)T Vectorization vec(X3 ) = xIJK = (A ⊗ B ⊗ C)vec(G3 ) vec(G3 ) = gP QS

Table 5.1. Third-order Tucker model

The matrix unfolding Xn is therefore left unchanged. The same reasoning holds for any mode-n ∈ N , which implies that the Tucker model is unique up to these linear transformations, characterized by the non-singular matrices Λ(n) , for n ∈ N . Nevertheless, it is important to note that the Tucker model is guaranteed to be unique if certain conditions are satisfied, such as: – the core tensor is perfectly known; – there are certain zeroes in the core tensor, i.e. under certain sparseness constraints on the core tensor (ten Berge and Smilde 2002); – certain structural constraints are imposed on the core tensor (Favier et al. 2012c), where the core tensor is characterized by Hankel and Vandermonde matrix slices.

Tensor Decompositions

247

See Smilde et al. (2004) for a review of uniqueness results for Tucker models. 5.2.1.8. HOSVD algorithm Since the Tucker model of an N th-order tensor is nonlinear with respect to the parameters (G; A(1) , · · · , A(N) ), the ALS method can be used for parameter estimation through an alternating minimization of least squares criteria constructed from the matrix unfoldings [5.8] for n ∈ N , and the vectorized form [5.9] for n = 1, for example. This type of algorithm will be presented in section 5.2.5.7 to estimate the parameters of a third-order PARAFAC model. With regard to the identifiability of the Tucker model  [5.1], the number of data N points contained in the tensor X ∈ KI N , which is equal to n=1 In , must be greater than or equal to the number of parameters that must be estimated in thecore tensor N G ∈ KRN and the factor matrices A(n) ∈ KIn ×Rn , n ∈ n, which is n=1 Rn + N n=1 In Rn , giving us the condition: N  n=1

In ≥

N  n=1

Rn +

N 

In Rn .

[5.13]

n=1

We now present a non-iterative solution based on the orthogonality constraint on the factor matrices. This method, called HSOVD (higher order singular value decomposition) or MLSVD (multilinear SVD), has the advantage of being simple to implement, since it only involves computing N matrix SVDs for a tensor of order N (de Lathauwer et al. 2000a). To simplify the presentation, we will consider the case of a complex third-order tensor X ∈ CI×J×K . From the expressions of the matrix unfoldings XI×JK , XJ×KI and XK×IJ given in Table 5.1, we can conclude that the column spaces of these unfoldings are identical to the column spaces of the matrices A, B and C, respectively. The HOSVD method computes the SVD of these three unfoldings and takes the unitary matrices (U(1) , U(2) , U(3) ) of the left singular vectors of these three SVDs as an estimate for A ∈ CI×I , B ∈ CJ×J and C ∈ CK×K , respectively. The core tensor G ∈ CI×J×K is then deduced from (1) (2) (3) T XI×JK = U GI×JK (U ⊗ U ) in the following unfolded form: GI×JK = (U(1) )H XI×JK (U(2) ⊗ U(3) )∗ ,

[5.14]

which is obtained by pre-multiplying XI×JK with (U(1) )H and post-multiplying it with (U(2) ⊗ U(3) )∗ . P ROOF .– The orthogonality property of the factor matrices implies the following identities: (U(1) )H U(1) = U(1) (U(1) )H = II , (U(2) )∗ (U(2) )T = IJ , (U(3) )∗ (U(3) )T = IK , from which we deduce (U(2) )−T = (U (2) )∗ and (U(3) )−T = (U (3) )∗ .

248

Matrix and Tensor Decompositions in Signal Processing

The inverses of U(1) and (U(2) ⊗ U(3) )T are therefore given by (U(1) )−1 = (U ) and (U(2) ⊗ U(3) )−T = (U(2) )−T ⊗ (U(3) )−T = (U(2) )∗ ⊗ (U(3) )∗ = (U(2) ⊗ U(3) )∗ , respectively. The formula [5.14] is then deduced using these inverses.  (1) H

The resulting estimation algorithm is summarized below. 1) Compute the SVD of XI×JK , and choose A as the matrix U(1) of left singular vectors. 2) Compute the SVD of XJ×KI , and choose B as the matrix U(2) of left singular vectors. 3) Compute the SVD of XK×IJ , and choose C as the matrix U(3) of left singular vectors. 4) Compute the core tensor in the unfolded form [5.14]. In the case of an N th-order tensor X ∈ KI N , its HOSVD can be written as: N

X = G × U(n) , n=1

[5.15]

where U(n) ∈ KIn ×In is unitary in the complex case (K = C) and orthogonal in the real case (K = R), given by the SVD of Xn = U(n) Σ(n) (V(n) )H . The columns of U(n) can be viewed as the mode-n singular vectors of X . In the complex case, the core tensor G ∈ CI N , which has the same size as X , can be obtained by applying the property [3.130] of the mode-n product, with (U(n) )† = (U(n) )H : N

G = X × (U(n) )H . n=1

[5.16]

It satisfies the all-orthogonality and ordering constraints, which means that: – all the slices along a fixed mode-n are mutually orthogonal: Gin =i , Gin =j  = 0 for i = j and ∀n ∈ N 

[5.17]

where Gin =i denotes the ith slice along the mode-n; – the slices along each mode-n are arranged in such a way that their norms are in decreasing order, i.e. such that: Gin =1 F ≥ Gin =2 F ≥ · · · ≥ Gin =In F ∀n ∈ N .

[5.18]

Furthermore, the Frobenius norm of the slice Gin of the core tensor is the mode-n (n) singular value Gin F = σin = (Σ(n) )in ,in of X .

Tensor Decompositions

249

R EMARK 5.1.– We can make the following remarks: – In the case of a symmetric tensor X , all the factors U(n) are identical, and the core tensor itself is symmetric. The HOSVD then coincides with the HOEVD (higher order eigenvalue decomposition). – The THOSVD (truncated higher order singular value decomposition) method provides a low multilinear rank approximation of X . Choosing (R1 , · · · , RN ) as an approximated multilinear rank, with Rn ≤ In , implies that only the Rn principal mode-n left singular vectors associated with the Rn largest mode-n singular values are kept. The approximation of X is then given by: N

N

n=1

n=1

ˆ (n) , Gˆ = X × (U ˆ (n) )H , Xˆ = Gˆ × U

[5.19]

ˆ (n) ∈ KIn ×Rn , where the dimension Rn is often chosen to be much smaller with U ˆ (n) is a column orthonormal matrix, which than In . Note that, with the THOSVD, −1 U(n) H (n) † (n) H (n) ˆ ˆ ) = (U ˆ ) U ˆ ) = (U ˆ (n) )H . implies (U (U This THOSVD method does not give the best multilinear rank-(R1 , · · · , RN ) approximation because it does not minimize the criterion X − Xˆ F globally, since the optimization is performed separately for each mode-n via the best rank-Rn approximation for Xn . Nevertheless, it is widely used in practice to reduce the dimensionality of data tensors (aspect compression) because it is very straightforward to implement and generally offers good performance if the approximated multilinear rank is suitably chosen. The THOSVD can be viewed as a compressed form of the Tucker decomposition. For more details, see the article by de Lathauwer et al. (2000b). – To improve the performance of the THOSVD method, de Lathauwer et al. (2000b) proposed the HOOI (higher order orthogonal iteration) algorithm. The idea is to begin estimating the Tucker model with the HOSVD method, then refine the ˆ (n) iteratively and alternately using the Rn principal estimates of the factor matrices U left singular vectors obtained by computing the SVD of the mode-n matrix unfolding of the tensor reconstructed at each iteration. After convergence is achieved, the core ˆ (n) ∈ KIn ×Rn tensor is computed using equation [5.19], after replacing the matrices U with the factors determined from this iterative procedure. 5.2.2. Tucker-(N1 , N ) model A Tucker-(N1 , N ) model for an N th-order tensor X ∈ KI N , with N ≥ N1 , corresponds to the case where N − N1 factor matrices are equal to identity matrices (Favier and de Almeida 2014a).

250

Matrix and Tensor Decompositions in Signal Processing

For example, if we assume that A(n) = IIn , which implies Rn = In , for n = N1 + 1, · · · , N , and hence G ∈ KR1 ×···×RN1 ×IN1 +1 ×···×IN , then equations [5.1] and [5.3] become: xi1 ,··· ,iN =

R1 

RN 1

···

r1 =1



rN1 =1

gr1 ,··· ,rN1 ,iN1 +1 ,··· ,iN

N1 

(n)

ain ,rn

[5.20]

n=1

X = G×1 A(1) ×2 · · · ×N1 A(N1 ) ×N1 +1 IIN1 +1 · · · ×N IIN N1

= G × A(n) .

[5.21] [5.22]

n=1

For a third-order tensor X ∈ KI×J×K , two special cases are given by the Tucker-(2,3) and Tucker-(1,3) models, often called Tucker2 and Tucker1, respectively. These models are obtained by fixing one or two of the matrix factors equal to identity matrices. Table 5.2 summarizes the equations of the Tucker-(2,3) and Tucker-(1,3) models, in the case where C = IK for the Tucker-(2,3) model, and (B = IJ , C = IK ) for the Tucker-(1,3) model. |

Tucker-(2,3) model X ∈ G ∈ KP ×Q×K A ∈ KI×P , B ∈ KJ×Q , C = IK

Tucker-(1,3) model

KI×J×K | |

G ∈ KP ×J×K A ∈ KI×P , B = IJ , C = IK

Scalar expression xijk =

Q P   p=1 q=1

gpqk aip bjq

|

xijk =

P  p=1

gpjk aip

X = G×1 A×2 B

With mode-n products |

X = G×1 A

XIJ×K = (A ⊗ B)GP Q×K XJK×I = (B ⊗ IK )GQK×P AT XKI×J = (IK ⊗ A)GKP ×Q BT

Matrix unfoldings | | |

XIJ×K = (A ⊗ IJ )GP J×K XJK×I = GJK×P AT XKI×J = (IK ⊗ A)GKP ×J

xIJK = (A ⊗ B ⊗ IK )gP QK

Vectorization |

xIJK = (A ⊗ IJK )gP JK

Table 5.2. Tucker-(2,3) and Tucker-(1,3) models

Tensor Decompositions

251

5.2.3. Tucker model of a transpose tensor Given the Tucker model G; A(1) , · · · , A(N)  of an N th-order tensor X , the transpose tensor X T,π associated with the permutation π admits the Tucker model G T,π ; Aπ(1) , · · · , Aπ(N ) , or alternatively: X T,π = (G ×1 A(1) ×2 · · · ×N A(N ) )T,π = G T,π ×1 Aπ(1) ×2 · · · ×N Aπ(N ) ,

[5.23]

where Aπ(n) is the matrix factor of X associated with the mode π(n), i.e. the permuted mode of n. The core tensor of the Tucker model of X T,π is therefore the transpose G T,π of the core tensor G, with the matrix factor Aπ(n) for the mode n. Thus, for the third-order tensor X ∈ KI×J×K admitting the Tucker model G; A, B, C, with G ∈ KP ×Q×S , and the permutation π(1, 2, 3) = (2, 3, 1), the transpose tensor X T,π ∈ KJ×K×I admits the Tucker model G T,π ; B, C, A, with G T,π ∈ KQ×S×P . 5.2.4. Tucker decomposition and multidimensional Fourier transform Given the tensor X ∈ KI N of order N , its discrete N -dimensional Fourier transform is the tensor Y ∈ CI N defined as: N

Y = X ×1 FI1 ×2 FI2 ×3 · · · ×N FIN  X × FIn . n=1

[5.24]

The tensor Y is obtained by applying the Fourier transform with matrix FIn to the mode-n subspace of X , for n ∈ N . It can be written in scalar form as: yk1 ,··· ,kN =

I1 IN  1  (i −1)(k1 −1) (i −1)(kN −1) √ ··· xi1 ,··· ,iN ωI11 · · · ωINN , I n n=1 i =1 i =1

N 

1

N

for kn ∈ In , n ∈ N , with ωIn = exp(−2πi/In ), i2 = −1, and where FIn is the Fourier matrix of order In defined as: ⎤ ⎡ 1 1 1 ··· 1 I −1 2 n ⎥ ⎢ 1 ωI n ω In · · · ω In ⎥ In −2 ⎥ 2 4 1 ⎢ ⎢ 1 ωI n ω In · · · ω In [5.25] FIn = √ ⎢ ⎥. ⎥ In ⎢ .. .. .. .. . . ⎦ ⎣ . . . . . 1

ωIInn −1

ωIInn −2

···

ω In

The Fourier transform [5.24] of X can be interpreted as the Tucker decomposition X ; FI1 , · · · , FIN  of Y, where X is the core tensor and the Fourier

252

Matrix and Tensor Decompositions in Signal Processing

matrices are the matrix factors. Goulart and Favier (2014) applied such a multidimensional Fourier transform to a circulant-constraint canonical polyadic decomposition (CPD), i.e. a CPD of a hypercubic tensor whose factors are circulant matrices, to derive a specialized estimation algorithm that exploits the Fourier transform of such structured tensors. If the tensor X satisfies a Tucker model G; A(1) , · · · , A(N ) , its multidimensional Fourier transform leads to the Tucker model G; FI1 A(1) , · · · , FIN A(N ) , which amounts to pre-multiplying each factor matrix A(n) by the Fourier matrix FIn . 5.2.5. PARAFAC model 5.2.5.1. Scalar expression The PARAFAC (parallel factors) decomposition of a tensor, also called CANDECOMP (canonical decomposition), was introduced independently by Harshman (1970) and Carroll and Chang (1970) for applications in psychometrics and phonetics, respectively. It is also called CP for CANDECOMP/PARAFAC in Kiers (2000) and CPD following work by Hitchcock (1927), who defined a tensor as a sum of polyads. Thus, a third-order tensor can be expressed as a sum of triads, i.e. a sum of products of three factors: xijk =

R 

air bjr ckr .

[5.26]

r=1

A PARAFAC model of an N th-order tensor corresponds to the special case of a Tucker model whose core tensor is the identity tensor of order N and size R ×· · ·×R, i.e.: G = IN,R

(also denoted IR )



gr1 ,··· ,rN = δr1 ,··· ,rN ,

where δr1 ,··· ,rN is the generalized Kronecker delta, which takes the value 1 if r1 = · · · = rN = r, with r ∈ R, and 0 otherwise. Equations [5.1]–[5.2] then become: xi1 ,··· ,iN =

N R   r=1 n=1

(n)

ain ,r =

N 

(n)

ain ,r

[5.27]

n=1

(n) where A(n) = ain ,r ∈ CIn ×R , n ∈ N , are the factor matrices, and the last equality follows from using the index convention, since the repetition of the index r implies summation over r. We will write this decomposition as A(1) , · · · , A(N ) ; R.

Tensor Decompositions

253

5.2.5.2. Other expressions Equations [5.3] and [5.4] in terms of mode-n products and outer products now become: N

X = IR × A(n) , n=1

X =

R  r=1

N

◦ A(n) .r .

n=1

[5.28] [5.29]

We can make the following remarks: – the expression [5.27] in the form of a sum of polyads was called a polyadic form of X by Hitchcock (1927); – the PARAFAC model [5.27]–[5.29] is equivalent to decomposing the tensor X into a sum of R components, with each component being a rank-one tensor, equal to the outer product of the rth columns of the N factor matrices. When R is minimal, it is called the tensor rank or canonical rank of X (Kruskal 1977). This explains why the PARAFAC model has also been called the rank-revealing decomposition. This marks N a distinction between PARAFAC and Tucker models, which are instead a sum of n=1 Rn components, where each component is a rank-one tensor obtained from the outer product of one column from each factor matrix. Unlike PARAFAC models, which only involve interactions between the same columns (r ∈ R) of the factor matrices, Tucker models take into account all interactions between distinct columns (rn ∈ Rn , n ∈ N ) of the factor matrices; – unlike matrices, for which the rank is always at most equal to the smallest dimension, the rank of a tensor of order greater than two can be larger than its largest dimension In , n ∈ N ; in other words, it is possible to have R > max(In ). This n property allows us to solve the source separation problem when the number of sources is greater than the number of sensors, which corresponds to an underdetermined system. The maximum rank is defined as the largest attainable rank for a given set of tensors. In Kruskal (1989), it was shown that the maximum rank Rmax for the set of third-order tensors X ∈ KI×J×K satisfies the following inequalities: max(I, J, K) ≤ Rmax (I, J, K) ≤ min(IJ, JK, KI).

[5.30]

There are other definitions of rank for tensors, like the typical and generic ranks, as well as the symmetric rank for a symmetric tensor. For more details, see Comon et al. (2008, 2009a);

254

Matrix and Tensor Decompositions in Signal Processing

– every symmetric tensor A of dimensions I can be written as a linear combination of symmetric rank-one tensors: ◦N A = x◦N 1 + · · · + x RS 

RS 

x◦N , xr ∈ KI , r ∈ RS . r

[5.31]

r=1

The smallest integer RS such that A can be written as a linear combination of RS symmetric rank-one tensors is called the symmetric rank of A. The symmetric rank of a tensor is difficult to determine. A general formula was established by Alexander and Hirschowitz (1995) for a symmetric tensor of order greater than two. Comon et al. (2008) formulated the conjecture that the symmetric rank is equal to the rank. See Comon et al. (2008), Landsberg (2012), and Nie (2017) for more details about symmetric tensors and their applications, as well as the computation of symmetric tensor decompositions. – In telecommunications applications, the structure parameters (rank, dimensions of the modes and of the core tensor) of a tensor model are design parameters that are chosen as a function of the desired system performance. However, in most applications, like in data analysis, the structure parameters are generally unknown and need to be determined. Several approaches have been proposed to determine these parameters (see Bro and Kiers 2003; da Costa et al. 2008, 2010, 2011, and the references given in these articles). In practice, the structure parameters are often determined by a trial-and-error approach. It is important to mention that the determination of the tensor rank is an NP-hard problem (Hästad 1990; Hillar and Lim 2009). 5.2.5.3. Matricization The matrix unfolding [3.38] of the PARAFAC model is given by: XS1 ;S2 =

 A(n)

!

n∈S1

 A(n)

n∈S2

!T .

[5.32] (n)

P ROOF .– Using the expression [5.27] of xi1 ,··· ,iN and the identity [2.206] with A.r instead of u(n) , we obtain: 10 1 0  (n)  (n) I2 I2 [5.33] ain ,r eI1 ain ,r e XS1 ;S2 = xi1 ,··· ,iN eI1 = =

 A(n) .r

n∈S1

!

n∈S1

 A(n) .r

n∈S2

n∈S2

!T =

 A(n)

n∈S1

!

 A(n)

n∈S2

!T .

[5.34]

Tensor Decompositions

255

The last equality follows from the sum over r implicit in the expression [5.34], using the index convention. This completes the proof of [5.32].  In particular, the mode-n flat unfolding corresponding to S1 = {n} and S2 = {n + 1, · · · , N, 1, · · · , n − 1} is given by: Xn = A(n) (A(n+1)  · · ·  A(N )  A(1)  · · ·  A(n−1) )T .

[5.35]

P ROPOSITION 5.2.– Any matrix unfolding XS1 ;S2 has rank at most equal to R, and the rank of X satisfies R ≥ max(Rn ), with Rn ≤ In for n ∈ N . n

P ROOF .– From [5.32], we deduce:  ⎫ In × R ⎪ dim(  A(n) ) = ⎪ n∈S1 n∈S1 ⎬ dim(  A n∈S2

(n)

)=

 n∈S2

⎪ ⎪ In × R ⎭

⇒ r(XS1 ;S2 ) ≤ R.

[5.36]

In particular, we have r(Xn ) ≤ R, i.e. the mode-n rank satisfies Rn ≤ R, for every n ∈ N , which implies: R ≥ max(R1 , R2 , · · · , RN ).

[5.37]

From [5.35], we can also deduce that: Rn ≤ In , n ∈ N .

[5.38] 

This completes the proof of the proposition. 5.2.5.4. Vectorization

Applying the property [2.257] to [5.32], with A =  A(n) , C = (  A(n) )T n∈S1

n∈S2

and λ1 = · · · = λR = 1, gives the following vectorization formula, associated with the unfolding XS1 ;S2 :      A(n)  [5.39]  A(n) 1R . vec(XS1 ;S2 ) = n∈S2

n∈S1

In particular, the vectorization associated with a mode combination according to lexicographical order is given by: xI1 ···IN = (A(1)  A(2)  · · ·  A(N ) )1R .

[5.40]

256

Matrix and Tensor Decompositions in Signal Processing

5.2.5.5. Normalized form The PARAFAC model is often defined in the following normalized form: xi1 ,··· ,iN =

R 

gr

r=1

N 

(n)

bin ,r

with gr > 0,

[5.41]

n=1 (n)

(n)

where the column vectors B.r are obtained by normalizing the columns A.r : B(n) .r = gr =

1 (n) A.r F N 

A(n) .r , r ∈ R , n ∈ N 

(n) A.r F .

[5.42]

[5.43]

n=1

In this case, the identity tensor IR in [5.28] is replaced by the diagonal tensor G ∈ RR×···×R whose diagonal elements are equal to the scaling factors gr :  gr if r1 = · · · = rN = r , r ∈ R . gr1 ,··· ,rN = 0 otherwise The PARAFAC model in normalized form [5.41] can be viewed as a Tucker model G, B(1) , · · · , B(N)  with a diagonal core tensor. T We can also write g, B(1) , · · · , B(N) , where g = g1 , · · · , gR is the vector whose components are the weights gr , r ∈ R. The matricization [5.32] and vectorization [5.40] formulae then become: XS1 ;S2 =

!  B(n) diag(g)

n∈S1

 B(n)

n∈S2

xI1 ···IN = (B(1)  B(2)  · · ·  B(N ) )g,

!T [5.44] [5.45]

where diag(g) is a diagonal matrix whose diagonal elements are the coefficients gr . Note that, with this normalized form, the PARAFAC model is unique up to sign N (n) (n) ambiguities δr for each column of the factors B(n) , satisfying n=1 δr = 1 for every r ∈ R. To eliminate the column permutation ambiguities, we can arrange them in such a way that they are associated with the factors gr in decreasing order (g1 ≥ g2 ≥ · · · ≥ gR ).

Tensor Decompositions

257

5.2.5.6. Case of a third-order tensor In the case of a third-order tensor X ∈ KI×J×K , equations [5.27]–[5.29] become: xijk =

R 

air bjr ckr

[5.46]

r=1

X = IR ×1 A×2 B×3 C =

R 

[5.47]

A.r ◦ B.r ◦ C.r ,

[5.48]

r=1

where A ∈ KI×R , B ∈ KJ×R , C ∈ KK×R are the factor matrices. This PARAFAC model of rank R is denoted A, B, C; R. Figure I.1 in the introductory chapter illustrates this model in the form of a sum of R rank-one tensors, each tensor being the outer product of the rth column vectors of the three (1) (2) (3) factor matrices, denoted (ar , ar , ar ) in Figure I.1, with r ∈ R. R EMARK 5.3.– In the case of a rank-one tensor X ∈ KI×J×K , PARAFAC models will be denoted a, b, c, with a ∈ KI , b ∈ KJ , c ∈ KK and xijk = ai bj ck , i ∈ I, j ∈ J, k ∈ K. We can also decompose the tensor into matrix slices. For example, using index notation and the identity [2.84], the frontal slices obtained by fixing the index k in the third-order PARAFAC model [5.46] are given by: X..k = ckr air bjr eji = ckr (air ei )(bjr ej ) = ckr A.r BT.r

[5.49]

= Adiag(ck1 , · · · , ckR )BT = Adiag(Ck. )BT = ADk (C)BT ∈ KI×J ,

[5.50]

where Dk (C)  diag(Ck. )  diag(ck1 , · · · , ckR ), i.e. the diagonal matrix whose diagonal entries are the elements of the kth row of C. Similarly, it is easy to deduce the following expressions for the horizontal and lateral matrix slices: Xi.. = air bjr ckr ekj = BDi (A)CT ∈ KJ×K X.j. =

bjr ckr air eik

= CDj (B)A ∈ K T

K×I

[5.51] .

[5.52]

The matricized and vectorized forms can be deduced from equations [5.32] and [5.39]. The various representations of a third-order PARAFAC model are summarized in Table 5.3.

258

Matrix and Tensor Decompositions in Signal Processing

X ∈ KI×J×K A ∈ KI×R , B ∈ KJ×R , C ∈ KK×R Scalar expression R  xijk = air bjr ckr r=1

Expression with mode-n products X = IR ×1 A×2 B×3 C Expression with outer products R  A.r ◦ B.r ◦ C.r X = r=1

Matrix unfoldings XIJ×K = (A  B)CT XJK×I = (B  C)AT XKI×J = (C  A)BT Vectorization xIJK = (A  B  C)1R Matrix slices X..k = A Dk (C) BT ∈ KI×J Xi.. = B Di (A) CT ∈ KJ×K X.j. = C Dj (B) AT ∈ KK×I

Table 5.3. Third-order PARAFAC model

5.2.5.7. ALS algorithm for estimating a PARAFAC model The PARAFAC model [5.27] is multilinear (more precisely N -linear) in its parameters in the sense that it is linear with respect to each of its N factors. ALS-based estimation of the parameters A(n) , n ∈ N , relies on the minimization of the quadratic error between the data tensor X and its PARAFAC model, which amounts to minimizing the following cost function:

min

X −

A(n) ,n∈N 

R  r=1

N

2 ◦ A(n) .r F =

n=1

min

A(1) ,··· ,A(N )

X − A(1) , · · · , A(N) 2F , [5.53]

where the rank R was fixed beforehand or estimated using a trial-and-error method. This non-linear optimization problem can be solved using the ALS algorithm (Harshman 1970; Carroll and Chang 1970). This algorithm estimates each factor matrix separately and alternately by minimizing a quadratic error criterion conditional on fixing the other factor matrices with their previously estimated values.

Tensor Decompositions

259

Thus, using the mode-n matrix unfolding [5.35], we can estimate the factor A(n) at the iteration t by minimizing the LS criterion conditional on the estimated values of the other factors, i.e.: (n+1)

min XTn − (At−1

A(n)

(N )

(1)

(n−1)

 · · ·  At−1  At  · · ·  At

)(A(n) )T 2F , [5.54]

(n)

where At denotes the estimate of A(n) at the iteration t. The details of this ALS algorithm are presented below for a third-order tensor. To guarantee the uniqueness of the LS solution resulting from the minimization of [5.54], the matrix to be pseudo-inverted must have full column rank, i.e. R ≤ N  Ip . Hence, to guarantee the identifiability of the N factors, the rank R must p=1,p=n

satisfy the following necessary, but not sufficient, condition: N 

R ≤ min( n

Ip ).

[5.55]

p=1,p=n

In the case of the third-order PARAFAC model, the factors (A, B, C) are estimated by minimizing the following LS criterion: min X −

A,B,C

R 

A.r ◦ B.r ◦ C.r 2F .

[5.56]

r=1

Since the PARAFAC model is trilinear with respect to its parameters, the ALS algorithm replaces the criterion [5.56] with three quadratic criteria obtained using the three matrix unfoldings given in Table 5.3, while fixing two of these matrices using the values estimated at the previous iterations: minXJK×I − (Bt−1  Ct−1 )AT 2F ⇒ At

[5.57]

minXKI×J − (Ct−1  At )BT 2F ⇒ Bt

[5.58]

A

B

minXIJ×K − (At  Bt )CT 2F ⇒ Ct . B

[5.59]

The ALS algorithm for a third-order PARAFAC model is summarized as follows: 1) Initialization: t = 0; B0 , C0 . 2) Increment t and compute: †

(At )T

= (Bt−1  Ct−1 ) XJK×I

(Bt )T

= (Ct−1  At ) XKI×J

(Ct )T

= (At  Bt ) XIJ×K .





260

Matrix and Tensor Decompositions in Signal Processing

3) Return to step 2 until convergence. Stopping criterion and identifiability condition: the stopping criterion is often based on the variation (between two consecutive iterations t − 1 and t) in the quadratic error between the measured data and the data reconstructed using the estimated model, i.e.: |et − et−1 | ≤ with et = XJK×I − (Bt  Ct )ATt 2F ,

[5.60]

where the threshold, for example, is set to the value = 10−6 . We can also use a stopping criterion based on the variation in the estimated parameters between two consecutive iterations. For a third-order PARAFAC model, the necessary condition [5.55] for identifiability, corresponding to the constraint that A  B, B  C and C  A have full column rank, becomes: R ≤ min(IJ, JK, KI).

[5.61]

Hence, taking into account the lower bound [5.37] for N = 3, the rank R must satisfy: max(R1 , R2 , R3 ) ≤ R ≤ min(IJ, JK, KI).

[5.62]

We can make the following remarks: – the ALS algorithm was originally proposed by Harshman (1970) and Carroll and Chang (1970) for third-order PARAFAC models. However, these two articles do not explicitly mention Khatri–Rao products of the factor matrices; – taking into account the property [2.255] of the Khatri–Rao product, the computation of the pseudo-inverses in the ALS algorithm can be simplified as follows: (Bt−1  Ct−1 )† = [(Bt−1  Ct−1 )H (Bt−1  Ct−1 )]−1 (Bt−1  Ct−1 )H H −1 = (BH (Bt−1  Ct−1 )H t−1 Bt−1  Ct−1 Ct−1 ) H −1 (Ct−1  At )† = (CH (Ct−1  At )H t−1 Ct−1  At At ) H −1 (At  Bt )H . (At  Bt )† = (AH t A t  Bt Bt )

[5.63] [5.64] [5.65]

Tensor Decompositions

261

This amounts to replacing the computation of the pseudo-inverses of matrices of size JK ×R, KI ×R, and IJ ×R by the computation of the inverses of three matrices of size R × R. If a matrix factor is also column orthonormal (e.g. CH C = IR ), then the formulae [5.63] and [5.64] simplify as follows: −1   (Bt−1  Ct−1 )† = diag (B.1 )t−1 22 , · · · , (B.R )t−1 22 (Bt−1  Ct−1 )H −1   (Ct−1  At )† = diag (A.1 )t 22 , · · · , (A.R )t 22 (Ct−1  At )H . – Many algorithms have been proposed for the parametric estimation of PARAFAC models. See, for example, Faber et al. (2003) for a review of various estimation algorithms for third-order PARAFAC models and the associated identifiability conditions. Similarly, in Tomasi and Bro (2006), different algorithms are compared for third-order models, with a presentation of two possible approaches for accelerating convergence (line search and compression). Other optimization-based estimation methods like the Gauss–Newton algorithm can also be used (Phan et al. 2013). For an overview of various parameter estimation algorithms for PARAFAC, see the article by Comon et al. (2009b). – The main advantages of the ALS algorithm are its simplicity to implement and the fact that it can be straightforwardly generalized to an arbitrary order N . On the other hand, it has the drawbacks of being slow to converge and potentially converging to a local minimum, depending on the choice of initialization. This slow convergence occurs especially in the case of degenerate PARAFAC models, i.e. when there is a linear dependency between the columns of the factor matrices that can induce a model order modification (Kruskal et al. 1989). In Paatero (2000), the degeneracy phenomenon is illustrated for third-order PARAFAC models of size 2 × 2 × 2, as in the example presented below. The case of p × p × 2 models is considered in Stegeman (2006). I×J×K E XAMPLE 5.4.– Consider satisfying the sequence of rank-two tensors Xn ∈ K the PARAFAC model [An , Bn , Cn ; 2] defined as: An = na | −na + n12 v ; Bn = nb | −nb + n12 w Cn = nc + n12 u | −nc ,

262

Matrix and Tensor Decompositions in Signal Processing

i.e.: Xn = (na) ◦ (nb) ◦ (nc + =X −

1 1 1 u) + (−na + 2 v) ◦ (−nb + 2 w) ◦ (−nc) 2 n n n

1 (v ◦ w ◦ c) n3

[5.66]

X = a ◦ b ◦ u + v ◦ b ◦ c + a ◦ w ◦ c,

[5.67]

where a, v ∈ KI , b, w ∈ KJ and c, u ∈ KK are the generating vectors of the model, assumed to be such that (a, v) are independent, as well as (b, w) and (c, u). As n → ∞, each of the columns of the three factors (An , Bn , Cn ) tends to infinity proportionally to n, becoming increasingly dependent on the fact that the contributions of the vectors (u, v, w) decrease proportionally to n12 and tend to zero, giving: An

[ na |

−na ] Bn

[ nb

| −nb ] Cn

[ nc |

−nc ].

However, equations [5.66]–[5.67] show that the tensors Xn remain finite. The original second-order PARAFAC model is constructed in such a way that the sequence Xn tends to the third-order tensor X defined in [5.67], while the difference Xn − X tends to zero proportionally to n13 . In general, PARAFAC degeneracy occurs when a tensor is approximated by a lower rank tensor. As in the above example, some factors tending to infinity cancel each other, which induces the convergence toward a higher rank tensor. In practice, the degeneracy problem can be avoided by imposing orthogonality constraints as with the HOSVD algorithm or non-negativity ones. Thus, Lim and Comon (2009) showed that, for any non-negative tensor, a best non-negative rank-R approximation always exists in the sense of minimizing the LS criterion [5.53]. 5.2.5.8. Estimation of a third-order rank-one tensor Consider a rank-one tensor X ∈ RI×J×K admitting a PARAFAC model of the form [5.41], with factors u ∈ RI , v ∈ RJ and w ∈ RK , of unit norm, i.e.: X = g u ◦ v ◦ w ⇔ xijk = g ui vj wk , g > 0.

[5.68]

The LS parameter estimation problem is solved by minimizing the following cost function f (g, u, v, w) = X − g(u ◦ v ◦ w)2F with respect to (g, u, v, w) subject to the constraints u2 = v2 = w2 = 1.

Tensor Decompositions

263

Using the relations [3.132]–[3.135], and noting that the vectors u, v, w have unit norm, we can deduce the following equations: X × 1 uT × 2 v T =

J I  

ui vj Xij. = g

J I  

i=1 j=1

X × 1 u ×3 w = T

T

K I  

X × 2 v T ×3 w T =

ui wk Xi.k = g

K I  

u2i wk2 v = gv [5.70]

i=1 k=1

vj wk X.jk = g

j=1 k=1

X × 1 uT × 2 v T × 3 w T =

[5.69]

i=1 j=1

i=1 k=1 K J  

u2i vj2 w = gw

J  K I  

K J  

vj2 wk2 u = gu [5.71]

j=1 k=1

xijk ui vj wk = g

i=1 j=1 k=1

J  K I  

u2i vj2 wk2 = g.

i=1 j=1 k=1

From these equations, and taking into account the constraints of unit norm for the vectors (u, v, w), we deduce the ALS estimation algorithm described below. Initialization: it = 0 : u0 , v0 . T T (i) wit+1 = X ×1 uTit ×2 vit /X ×1 uTit ×2 vit 2 T T (ii) vit+1 = X ×1 uTit ×3 wit+1 /X ×1 uTit ×3 wit+1 2 T T T T (iii) uit+1 = X ×2 vit+1 ×3 wit+1 /X ×2 vit+1 ×3 wit+1 2

Return to (i) until convergence. T T × 3 w∞ , g∞ = X ×1 uT∞ ×2 v∞

where (u∞ , v∞ , w∞ ) are the estimated values of (u, v, w) at convergence. R EMARK 5.5.– Note that, for any iteration it, we have: f (uit , vit , wit ) ≤ f (uit , vit , wit−1 ) ≤ f (uit , vit−1 , wit−1 ) ≤ f (uit−1 , vit−1 , wit−1 ), which means that the cost function f (uit , vit , wit ) decreases monotonely. 5.2.5.9. Case of a fourth-order tensor The various possible representations of the PARAFAC model (A, B, C, D; R) of a fourth-order tensor X ∈ KI×J×K×L are summarized in Table 5.4.

264

Matrix and Tensor Decompositions in Signal Processing

X ∈ KI×J×K×L A ∈ KI×R , B ∈ KJ×R , C ∈ KK×R , D ∈ KL×R Scalar expression R  xijkl = air bjr ckr dlr r=1

Expression with mode-n products X = IR ×1 A×2 B×3 C×4 D Expression with outer products R  A.r ◦ B.r ◦ C.r ◦ D.r

X =

r=1

Matrix unfoldings XIJ×KL = (A  B)(C  D)T XIJK×L = (A  B  C)DT Vectorization xIJKL = (A  B  C  D)1R Matrix slices Xij.. = Cdiag(Bj. )diag(Ai. )DT ∈ KK×L

Table 5.4. Fourth-order PARAFAC model

In particular, applying the formula [5.32] with S1 = {1, 2, 3} and S2 = {4}, as well as the formula [2.202] defining the Khatri–Rao product, we have: XIJK×L = (A  B  C)DT ⎡ CD1 (B)D1 (A) ⎢ .. ⎢ . ⎢ ⎡ ⎤ ⎢ CDJ (B)D1 (A) (B  C)D1 (A) ⎢ ⎢ ⎢ ⎥ T .. .. =⎣ = D ⎢ ⎦ . . ⎢ ⎢ CD1 (B)DI (A) (B  C)DI (A) ⎢ ⎢ .. ⎣ . CDJ (B)DI (A)

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ T ⎥ D ∈ KIJK×L . ⎥ ⎥ ⎥ ⎥ ⎦

Since the matrix unfolding XIJK×L is obtained by stacking the IJ matrix slices Xij.. ∈ KK×L , we deduce the following expression for these slices: Xij.. = CDj (B)Di (A)DT ∈ KK×L .

[5.72]

Noting that: Dj (B)Di (A) = D(j−1)I+i (B  A) = D(i−1)J+j (A  B),

[5.73]

where D(j−1)I+i (BA) and D(i−1)J+j (AB) are diagonal matrices whose diagonal entries are the elements of the rows (j − 1)I + i and (i − 1)J + j of the Khatri–Rao

Tensor Decompositions

265

products B  A and A  B, respectively, the matrix unfolding [5.72] can be rewritten as: Xij.. = CD(j−1)I+i (B  A)DT = CD(i−1)J+j (A  B)DT .

[5.74]

Similarly, we have: X.jk. = AD(j−1)K+k (B  C)DT = AD(k−1)J+j (C  B)DT ∈ KI×L [5.75] X..kl = AD(k−1)L+l (C  D)BT = AD(l−1)K+k (D  C)BT ∈ KI×J . [5.76] Other slices and matrix unfoldings can be deduced from the expressions presented above using simple permutations of the factor matrices. Thus, for example, we have: XIL×JK = (A  D)(B  C)T ; XILJ×K = (A  D  B)CT Xi..l = BDl (D)Di (A)CT ∈ KJ×K = BD(l−1)I+i (D  A)CT = BD(i−1)L+l (A  D)CT . By combining two modes (for example the modes 2 and 3), we can reshape the tensor X into a third-order tensor Y ∈ KI×M ×L , where M = JK and the combined mode is defined as m = k + (j − 1)K. Setting fmr = bjr ckr , or equivalently F = B  C ∈ KM ×R , the scalar equation of the tensor Y can be written as: yiml =

R 

air fmr dlr ,

[5.77]

r=1

which is a PARAFAC model A, F, D; R . From Table 5.3, we can deduce the following vectorized and matricized forms of Y, as well as the matrix slice Yi.. : YIM ×L = (A  F)DT ; yIM L = (A  F  D)1R ; Yi.. = Fdiag(Ai. )DT ∈ KM ×L . R EMARK 5.6.– In general, by combining P modes of an N th-order tensor X that admits a PARAFAC model of rank R, we obtain a tensor Y of reduced order N −P +1 that admits a PARAFAC model of rank R characterized by N − P + 1 matrix factors, one of which is given by the Khatri–Rao products of the P factor matrices associated with the P combined modes. 5.2.5.10. Variants of the PARAFAC model Table 5.5 presents a few variants of the third-order PARAFAC model. The acronyms mean “individual differences scaling” (INDSCAL), “symmetric CP”

266

Matrix and Tensor Decompositions in Signal Processing

(Sym CP), “doubly symmetric CP” (DSym CP), “shifted CP” (ShiftCP) and “convolutive CP” (ConvCP). The symmetric CP model was used to model Volterra kernels in Favier and Bouilloc (2010) and Favier et al. (2012a), whereas the doubly symmetric model was used to model nonlinear communication channels (Bouilloc and Favier 2012; Crespo-Cadenas et al. 2014). Models CP INDSCAL

xi,j,k

Applications

air bjr ckr

Various fields

air ajr ckr

Psychometrics

air ajr akr

Volterra models

air ajr a∗kr

Nonlinear channels

air bj−tk ,r ckr

Neuro-imaging

Ref (Harshman 1970) (Carroll and Chang 1970) (Comon et al. 2008)

Sym CP

DSym CP (Bouilloc and Favier 2012;

R  r=1 R  r=1 R  r=1 R  r=1

Crespo-Cadenas et al. 2014) ShiftCP

(Harshman et al. 2003;

R  r=1

Morup 2011) ConvCP

(Morup 2011;

T R  

air bj−t,r ck,r,t

Neuro-imaging

r=1 t=1

Morup et al. 2011) Table 5.5. Variants of the third-order PARAFAC model

The convolutive CP model described by the equation xi,j,k =

T R  

air bj−t,r ck,r,t

[5.78]

r=1 t=1

was used to model three-dimensional neuro-imaging data whose first two modes are space–time and whose third mode is the experiment number. The dependency of the components of the time factor br = B.r with respect to the time shift t reflects variability between experiments by taking into account, for the experiment k, an T  amplitude that itself depends on t via the coefficient ck,r,t . The sum bj−t,r ck,r,t t=1

can be interpreted as a time convolution, which explains the name of the convolutive CP model. The ShiftCP model corresponds to the special case of the ConvCP model where each time factor br only depends on a shift tk for the experiment k. The ConvCP model can therefore be viewed as a generalization of the ShiftCP model that introduces a convolution on the time interval T . For more details, see Morup et al. (2011).

Tensor Decompositions

267

5.2.5.11. PARAFAC model of a transpose tensor Given the PARAFAC model A(1) , · · · , A(N) ; R  of an N th-order tensor X , the transpose tensor X T,π associated with the permutation π admits the PARAFAC model Aπ(1) , · · · , Aπ(N ) ; R , or alternatively: X T,π = (IR ×1 A(1) ×2 · · · ×N A(N ) )T,π = IR ×1 Aπ(1) ×2 · · · ×N Aπ(N ) ,

[5.79]

where Aπ(n) is the matrix factor of X associated with the mode π(n), i.e. the permuted mode of n. The PARAFAC model of X T,π therefore has Aπ(n) as its mode-n factor matrix. Note that the rank is not changed by a permutation of the modes. Thus, for example, for the third-order tensor X ∈ KI×J×K admitting the PARAFAC model A, B, C; R  and the permutation π(1, 2, 3) = (2, 3, 1), the transpose tensor X T,π admits the PARAFAC model B, C, A; R . 5.2.5.12. Uniqueness and identifiability Since the pioneering work of Harshman (1970) and Kruskal (1977), various articles have been written about the problem of essential uniqueness of PARAFAC models, i.e. uniqueness up to trivial indeterminacies corresponding to permutation and scaling ambiguities among the columns of the factor matrices. The permutation ambiguity arises from the fact that the order of the rank-one tensors in [5.29] can be changed without changing the result of the sum, due to the commutativity of addition. This is equivalent to permuting the columns of the factors A(n) using a permutation (n) matrix Π. Similarly, it is possible to multiply each column A.r with a scalar factor N (n) (n) λr without changing equation [5.29] of the model, provided that n=1 λr = 1. (n) (n) (n) In other words, the factors A can be replaced with A ΠΛ without changing N  (n) (n) the model if Λ(n) = IR , where Λ(n) = diag(λ1 , · · · , λR ), for n ∈ N . n=1

This result can also be recovered from a matrix unfolding using the following properties of the Khatri–Rao product: (AΠ)  (BΠ) = (A  B)Π (AΛ1 )  (BΛ2 ) = (A  B)Λ1 Λ2

[5.80] [5.81]

for any permutation matrix Π and any diagonal matrices Λ1 and Λ2 . For example, consider the mode-1 flat unfolding defined in [5.35] for the PARAFAC model A, B, C; R  and the triplet (AΠΛ(A) , BΠΛ(B) , CΠΛ(C) ). Using the properties [5.80] and [5.81] gives: AΠΛ(A) (BΠΛ(B)  CΠΛ(C) )T = AΠΛ(A) Λ(B) Λ(C) ΠT (B  C)T . [5.82]

268

Matrix and Tensor Decompositions in Signal Processing

Taking into account the properties Λ(A) Λ(B) Λ(C) = ΠΠT = IR , we conclude that the model AΠΛ(A) , BΠΛ(B) , CΠΛ(C) ; R  is equivalent to the model A, B, C; R . An overview of the main uniqueness conditions of PARAFAC models can be found in Domanov and de Lathauwer (2013) for the deterministic case, i.e. for a particular PARAFAC model, and in Domanov and de Lathauwer (2014) for the generic case, i.e. when the elements of the factor matrices are random and generated from a continuous distribution. Below, we briefly recall a few results about the uniqueness of PARAFAC models, after defining the notion of k-rank. The k-rank (also known as the Kruskal rank) of a matrix A, denoted kA , is the largest integer such that every set of kA columns of A is linearly independent. From this definition, we can deduce the following results: – kA ≤ rA ; – if A ∈ KI×J is a matrix whose elements are randomly generated from an absolutely continuous distribution, then A has full rank, which implies that kA = rA = min(I, J); – if A contains a zero column, then kA = 0; – if A contains two proportional columns, then kA = 1. E XAMPLE 5.7.– Below are three examples of square matrices of order three with the same rank but different column k-ranks: ⎛ ⎞ 1 0 0 A1 = ⎝0 1 0⎠ ⇒ rA1 = 2, kA1 = 0 0 0 0 ⎛ ⎞ 1 1 1 A2 = ⎝1 1 0⎠ ⇒ rA2 = 2, kA2 = 1 1 1 0 ⎛ ⎞ 1 0 1 A3 = ⎝0 1 1⎠ ⇒ rA3 = kA3 = 2. 0 0 0 The next example shows that the column and row k-ranks are not necessarily the same. ⎛ ⎞ 1 0 0 1 A = ⎝0 1 0 0 ⎠ . 0 0 1 0 We have rA = 3, with kA = 1 for the column k-rank, and kA = 3 for the row k-rank.

Tensor Decompositions

269

P ROPOSITION 5.8.– The PARAFAC model of order N described by equations [5.27]–[5.29] is essentially unique if the following condition is satisfied (Sidiropoulos and Bro 2000): N 

kA(n) ≥ 2R + N − 1.

[5.83]

n=1

In the generic case, the factor matrices have full rank with probability one, which implies that kA(n) = min(In , R), and the condition [5.83] becomes: N 

min(In , R) ≥ 2R + N − 1.

[5.84]

n=1

For a third-order PARAFAC model A, B, C; R , the Kruskal condition can be written as follows: kA + kB + kC ≥ 2R + 2.

[5.85]

R EMARK 5.9.– We can make the following remarks. – The condition [5.83] is sufficient but not necessary to guarantee essential uniqueness. This condition does not hold for R = 1. It is necessary for R = 2 and R = 3 but not for R > 3 (ten Berge and Sidiropoulos 2002). If all the factor matrices have full column rank, then every PARAFAC model of a tensor of order N > 3 and rank strictly greater than one is essentially unique. – The first sufficient condition for the essential uniqueness of a third-order PARAFAC model was established by Harshman (1972), then generalized by (Kruskal 1977) using the notion of k-rank. A more accessible proof of the Kruskal condition is given by Stegeman and Sidiropoulos (2007). This condition was then extended to complex-valued tensors by Sidiropoulos et al. (2000) and to tensors of order N > 3 by Sidiropoulos and Bro (2000). – From the condition [5.85], we can conclude that if two factor matrices (A, B) have full column rank (kA = kB = R), then the PARAFAC model is essentially unique if the third factor does not have any proportional columns (kC > 1). Hence, we can deduce the following condition for the essential uniqueness of a third-order PARAFAC model: min(kA , kB , kC ) ≥ 2.

[5.86]

– If a factor matrix (e.g. C) has full column rank, then the condition [5.85] becomes: kA + kB ≥ R + 2.

[5.87]

270

Matrix and Tensor Decompositions in Signal Processing

– In Jiang and Sidiropoulos (2004) and de Lathauwer (2006) necessary and sufficient uniqueness conditions, less strict than the Kruskal condition, are established for third- and fourth-order tensors subject to the hypothesis that at least one factor matrix has full column rank. However, these conditions are difficult to exploit in practice. Other less strict conditions were independently derived by Stegeman (2008) and Guo et al. (2011) for third-order tensors with a factor matrix (e.g. C) of full column rank. It has been shown that the PARAFAC model is essentially unique if the factors (A, B) satisfy the following conditions: 1)

k A , kB ≥ 2

2)

rA + k B ≥ R + 2

or rB + kA ≥ R + 2.

[5.88]

These conditions are less restrictive than [5.87]. Indeed, if, for example, kA = 2 and rA = kA + δ with δ > 0, then applying [5.87] implies that kB = R, i.e. B must have full column rank, whereas [5.88] gives kB ≥ R − δ, which does not require B to have full column rank. – When a factor matrix (e.g. C) is known and the Kruskal condition [5.85] is satisfied, as is often the case in telecommunications applications, essential uniqueness is guaranteed without any permutation ambiguity, leaving only scaling ambiguities (ΛA , ΛB ) such that ΛA ΛB = IR . Table 5.6 summarizes a few important results about the rank and k-rank of certain matrices frequently encountered in signal processing applications, namely random matrices with i.i.d. (independent and identically distributed) elements with a uniformly continuous law, Vandermonde matrices and Khatri–Rao products, as well as the sufficient conditions guaranteeing the essential uniqueness of a third-order PARAFAC model presented above. With regard to the identifiability of a PARAFAC model A(1) , · · · , A(N ) ; R of order N , with A(n) ∈ KIn ×R , n ∈ N , the number of data points contained in N the tensor X ∈ KI N , which is equal to n=1 In , must be greater than or equal to N the number of unknown parameters to be estimated1, which is equal to ( n=1 In − N + 1)R, giving us the following upper bound for the rank of the tensor: N n=1 In R ≤ N . [5.89] ( n=1 In − N + 1)R 1 The term (N − 1)R subtracted from the total number of coefficients of the factor matrices corresponds to the number of scaling ambiguity factors contained in the diagonal matrices (n) = IR , which allows us Λ(n) ∈ KR×R , n ∈ N , minus R due to the relation N n=1 Λ to compute an ambiguity matrix if the other N − 1 are known. The number (N − 1)R therefore represents the amount of a priori information needed to eliminate the scaling ambiguities.

Tensor Decompositions

References

Conditions

271

Properties

Random matrix A ∈ KI×J kA = rA = min(I, J)

with i.i.d. elements from a uniformly continuous law ⎡ ⎢ ⎢ ⎢ (Sidiropoulos and Liu 2001) A = ⎢ ⎢ ⎣

Vandermonde matrix 1

1

···

1

α1 .. .

α2 .. .

··· .. .

αJ .. .

αI−1 1

αI−1 2

···

αI−1 J

⎤ ⎥ ⎥ ⎥ ⎥ ∈ KI×J ⎥ ⎦

kA = rA = min(I, J)

αj = 0 ∀j ∈ J , αk = αj ∀k = j Khatri-Rao product A B ; A ∈ KI×R , B ∈ KJ×R (Sidiropoulos et al. 2000a)

kA + kB ≥ R + 1

rAB = R

(Sidiropoulos and Liu 2001)

kA ≥ 1 , k B ≥ 1

kAB ≥ min(kA + kB − 1, R)

(Jiang et al. 2001)

or Vandermonde

A, B random rAB = kAB = min(IJ, R)

with random generators Essential uniqueness of a PARAFAC model A, B, C; R 

kA + kB + kC ≥ 2R + 2

(Kruskal 1977)



(Stegeman 2008) (Guo et al. 2011)

⎧ ⎪ ⎨ ⎪ ⎩

kC = R kA , k B ≥ 2 rA + kB ≥ R + 2 or rB + kA ≥ R + 2

kA + kB + kC ≥ 2R + 2 with C known

⎧ ⎪ ⎪ ⎪ ⎪ ⎨

A = AΠΛA

B = BΠΛB ⎪ C = CΠΛC ⎪ ⎪ ⎪ ⎩ Λ Λ Λ =I R A B C ⎧ ⎪ A = AΠΛ ⎪ A ⎪ ⎪ ⎨ B = BΠΛ B

⎪ C = CΠΛC ⎪ ⎪ ⎪ ⎩ Λ Λ Λ =I R A B C ⎧ A = AΛ ⎪ A ⎨ B = BΛB ⎪ ⎩ ΛA ΛB = I R

Table 5.6. k-rank of certain matrices and essential uniqueness of a third-order PARAFAC model

5.2.6. Block tensor models In some applications, the data tensor X ∈ KI N of order N is decomposed into a sum of P tensor components, each with a differently structured tensor model. For example, in the case of a tensor X decomposed into the sum of P tensors X (p) ∈ KI N admitting a Tucker model, we have: X =

P  p=1

N

X (p) , X (p) = G (p) × A(p,n) , n=1

[5.90]

272

Matrix and Tensor Decompositions in Signal Processing (p,1)

(p,N )

(p,n)

×···×R where G (p) ∈ KR and A(p,n) ∈ KIn ×R , n ∈ N , are the core tensor and the factor matrices of X (p) . The matrix unfolding [5.5] then becomes:

XS1 ;S2 =

P 

(p) XS1 ;S2

=

p=1

P  p=1

⊗ A

(p,n)

!

n∈S1

Let us define the partitioned matrices:  ⊗ A(2,n) , ⊗ A(1,n) , ⊗b A(n) = n∈S n∈S 1

n∈S1

(p) GS1 ;S2

··· ,

1

⊗ A

(p,n)

!T .

n∈S2

⊗ A(P,n)

[5.91]



n∈S1

  (1) (2) (P ) GS1 ;S2 = bdiag GS1 ;S2 , GS1 ;S2 , · · · , GS1 ;S2 ,

[5.92] [5.93]

where ⊗b denotes a block-wise Kronecker product, and bdiag is the block-wise diagonalization operator. The matrix unfolding [5.91] can then be rewritten in the following compact form, corresponding to a block Tucker model: XS1 ;S2 =



  T ⊗b A(n) GS1 ;S2 ⊗b A(n) .

n∈S1

[5.94]

n∈S2

Similarly, for a block PARAFAC model, we have: X =

P 

N

X (p) , X (p) = IN,R(p) × A(p,n) ,

[5.95]

n=1

p=1

where A(p,n) ∈ KIn ×R

(p)

, n ∈ N , are the factor matrices of the component X (p) .

The matrix unfolding [5.32] then becomes: XS1 ;S2 =

P 

(p) XS1 ;S2

=

p=1

=

P  p=1

b A(n)

n∈S1

!

 A

(p,n)

n∈S1

b A(n)

!  A

(p,n)

!T

n∈S2

!T ,

[5.96]

n∈S2

where b denotes the block Khatri–Rao product, defined as:    A(2,n) , · · · ,  A(P,n) .  A(1,n) , b A(n) = n∈S n∈S n∈S n∈S1

1

1

1

[5.97]

Constrained block PARAFAC decompositions were introduced by de Almeida et al. (2007) to represent three different multiuser wireless communication systems in a unified way, whereas constrained block Tucker models were used to design MIMO OFDM systems with tensor-based space–time multiplexing codes (de Almeida et al. 2006) and for blind beamforming (de Almeida et al. 2009b).

Tensor Decompositions

273

5.2.7. Constrained tensor models 5.2.7.1. Interpretation and use of constraints Constraints in tensor models can be interpreted as interactions or linear dependencies between the matrix factors. Such dependencies are encountered in applications in psychometrics and chemometrics for example, which are at the root of the PARATUCK-2 (Harshman and Lundy 1996) and PARALIND (Bro et al. 2009) models, respectively. The PARALIND model was introduced by Carroll et al. (1980) under the name CANDELINC (canonical decomposition with linear constraints). The earliest applications of the PARATUCK-2 model in signal processing concerned the identification and blind equalization of Wiener–Hammerstein communication channels (Kibangou and Favier 2007), followed by the design of point-to-point (de Almeida et al. 2009) and cooperative (Ximenes et al. 2014) MIMO communication systems. The PARALIND model was used to estimate propagation parameters in the context of antenna processing (Xu et al. 2012). Constraints can also be used for resource allocation. For example, this is the case with the CONFAC (constrained factors) (de Almeida et al. 2008; Favier and de Almeida 2014a) and PARATUCK-(N1 , N ) (Favier et al. 2012b) models. For a unified presentation of constrained PARAFAC models, see Favier and de Almeida (2014a). In the following two sections, we present two examples of constrained tensor models: the CONFAC model and an extension of the PARAFAC model, called BTD (de Lathauwer 2008), while giving an interpretation as a CONFAC model. 5.2.7.2. Constrained PARAFAC models Various constrained PARAFAC models can be defined using a Tucker model G, A(1) , · · · , A(N ) , with A(n) ∈ KIn ×Rn . If we assume that the core tensor G ∈ RR1 ×···×RN satisfies a PARAFAC model Φ(1) , · · · , Φ(N ) ; R , then its factors Φ(n) ∈ RRn ×R allow us to introduce constraints on the matrices A(n) . Indeed, using the property [3.125] of the mode-n product, we have: N

N

n=1

n=1

X = G × A(n) , G = IN,R × Φ(n)



N

X = IN,R × A(n) Φ(n) . n=1

[5.98] The tensor X therefore satisfies a constrained PARAFAC model (1) (N ) (n) A , · · · , A ; R  whose factors are given by A = A(n) Φ(n) , n ∈ N . The (n) matrices Φ can be interpreted in two different ways: – either in terms of linear dependency between the columns of A(n) , as is the case for the PARALIND model;

274

Matrix and Tensor Decompositions in Signal Processing

– or in terms of resource allocation, as is the case for the CONFAC model, in the context of digital communications, where the columns of Φ(n) are chosen to be canonical vectors of the space RRn . Equations [5.98] can be written in scalar form as follows: gr1 ,··· ,rN =

N R  

ϕr(n) n ,r

[5.99]

r=1 n=1

xi1 ,··· ,iN =

R1 

···

r1 =1

=

RN 

gr1 ,··· ,rN

rN =1

R  N 

N 

(n)

ain ,rn

[5.100]

n=1

(n) ain ,r

with

(n) ain ,r

=

RN 

(n)

ain ,rn ϕr(n) . n ,r

rn =1

r=1 n=1

[5.101]

Using the property [2.256], the matrix unfolding [5.32] can be developed in the following different ways: XS1 ;S2 = = =

 A

(n)

!  A

n∈S1

!

n∈S1

n∈S1

!T ,

n∈S2

 A(n) Φ(n)

⊗ A(n)

(n)

!

[5.102]

 A(n) Φ(n)

!T [5.103]

n∈S2

 Φ(n)

!

n∈S1

 Φ(n)

!T

n∈S2

⊗ A(n)

!T . [5.104]

n∈S2

! After defining the matrix unfolding GS1 ;S2 =

 Φ

n∈S1

(n)

!T  Φ

n∈S2

(n)

of the

PARAFAC model Φ(1) , · · · , Φ(N ) ; R  satisfied by the core tensor G, we also have: XS1 ;S2 =

! ⊗ A(n) GS1 ;S2

n∈S1

⊗ A(n)

n∈S2

!T .

[5.105]

5.2.7.3. BTD model The BTD in blocks of multilinear rank-(Rp , Rp , 1) for a third-order tensor X ∈ KI×J×K can be viewed as a constrained PARAFAC decomposition. Indeed, it is defined by de Lathauwer (2008) as: X =

P   p=1

 A(p) (B(p) )T ◦ c(p) ,

[5.106]

Tensor Decompositions (p)

275

(p)

where the matrices A(p) = [ai,rp ] ∈ KI×Rp and B(p) = [bj,rp ] ∈ KJ×Rp have full (p)

(p)

column rank, and c(p) = [c1 , · · · , cK ]T ∈ KK is a column vector whose components are assumed to be non-zero, for p ∈ P . (p)

(p)

Defining dk,rp = ck , for rp ∈ Rp , the BTD model can be rewritten in scalar form as: xi,j,k =

RP P   p=1

rp =1

(p) (p)  (p) ai,rp bj,rp ck

=

RP P   p=1

rp =1

(p) (p) (p)  ai,rp bj,rp dk,rp .

[5.107]

We can therefore interpret the BTD model as the sum of P third-order PARAFAC models A(p) , B(p) , D(p) ; Rp , with p ∈ P , and D(p) = c(p) 1TRp ∈ KK×Rp , i.e. a matrix where the Rp columns are equal to c(p) . Each PARAFAC model is associated P with a tensor X (p) ∈ KI×J×K , and we have X = p=1 X (p) . After defining matrices partitioned into P column blocks A = [A(1) · · · A(P ) ] ∈ K , B = [B(1) · · · B(P ) ] ∈ KJ×R , D = [D(1) · · · D(P ) ] ∈ KK×R and C = P  Rp , the BTD model [5.106] can be rewritten [c(1) · · · c(P ) ] ∈ KK×P , with R = I×R

p=1

as the following block PARAFAC model: X = I3,R ×1 A ×2 B ×3 D.

[5.108]

The factor matrix D can also be written as: (P ) (1) D = [c ·)* · · c(P+) ] = CΦ, · · c(1)+, · · · , c ( ( ·)* R1

⎡ ⎢ where Φ = ⎣

1TR1

[5.109]

RP

⎤ ..

.

⎥ ⎦ ∈ KP ×R can be viewed as a constraint matrix.

1TRP We can therefore conclude that the BTD model is equivalent to the constrained block PARAFAC model A, B, CΦ; R  (Favier and de Almeida 2014a). Using the formulae in Table 5.3, the matrix unfoldings of X are given by: XJK×I = (B  CΦ)AT =

P 

(B(p)  c(p) 1TRp )(A(p) )T

[5.110]

p=1

XKI×J = (CΦ  A)BT

[5.111]

XIJ×K = (A  B)ΦT CT .

[5.112]

276

Matrix and Tensor Decompositions in Signal Processing

Noting that, by the hypotheses, A(p) and B(p) have full column rank equal to Rp and D(p) = cp 1TRp has rank one, we can conclude that the BTD model amounts to decompose a tensor X into a sum of P tensors X (p) of multilinear rank (Rp , Rp , 1). The BTD model is unique up to permutations of terms of same rank Rp , scaling ambiguities (αp , βp , γp ) in the factors of each term, with αp βp γp = 1, ∀p ∈ P , and a non-singular multiplicative ambiguity matrix Λ(p) ∈ CRp ×Rp , since any triplet (αp A(p) Λ(p) , βp B(p) (Λ(p) )−T , γp c(p) ) gives the same tensor as defined in [5.106]. In the special case where Rp = 1, ∀p ∈ P , the BTD model [5.106] reduces to a standard CPD model of rank P : X =

P 

(ap bTp ) ◦ c(p) =

p=1

P 

(ap ◦ bp ) ◦ c(p) =

p=1

xijk =

P 

P 

ap ◦ bp ◦ cp ,

[5.113]

p=1

aip bjp ckp ,

[5.114]

p=1

which corresponds to a CPD model whose matrix factors are A = [a1 · · · aP ] ∈ KI×P , B = [b1 · · · bP ] ∈ KJ×P , C = [c1 · · · cP ] ∈ KK×P , with cp = c(p) . By comparing [5.113] with [5.106], we can conclude that the BTD decomposition into terms of multilinear rank-(Rp , Rp , 1) replaces the rank-one factors ap bTp with the factors A(p) (B(p) )T of rank Rp . Using the formula [5.50] and the column-blockpartitioned forms of (A, B, D), the kth frontal slice for the BTD model is given by: X..k = Adiag(Dk. )BT

 (1) (P )  (1) (P ) = [A(1) · · · A(P ) ]diag [ck , · · · , ck , · · · , ck , · · · , ck ] [B(1) · · · B(P ) ]T ( ( )* + )* + R1

=

P 

RP

(p)

ck A(p) (B(p) )T ∈ KI×J .

[5.115]

p=1

This frontal slice is therefore equal to the sum of P products of two matrices of (p) rank Rp , each weighted by the kth component ck of the vector c(p) . This expression can be deduced directly from [5.106] as follows: X =

K P   (p) (p) T   (p) (K) ck ek ) A (B ) ◦ ( p=1

k=1

Tensor Decompositions

=

K  P  k=1

=

K 

277

 (p) (K) ck A(p) (B(p) )T ◦ ek )

p=1 (K)

X..k ◦ ek ,

[5.116]

k=1

with X..k defined in [5.115]. 5.3. Examples of tensor models In this section, we present four examples of tensor models for multidimensional harmonics and an instantaneous linear mixture, using both a BTD model of the received signals and a PARAFAC model of the tensor of fourth-order cumulants of the received signals. The fourth example concerns the representation of a linear FIR system using a tensor of fourth-order cumulants of the system output. 5.3.1. Model of multidimensional harmonics Multidimensional harmonics retrieval (MHR) from the observation of a sum of complex exponentials is an old and classical problem in the field of signal processing (Pisarenko 1973). The signal model corresponding to the superposition of M undamped exponentials on a P -dimensional grid can be written as: xi1 ,··· ,iP =

M  m=1

αm

P 

ip −1 zm,p , ip ∈ Ip ,

[5.117]

p=1

where αm is the complex amplitude of the mth exponential zm,p = ejωm,p , and ωm,p is the angular frequency of the pth dimension. The tensor of received signals X ∈ CI P satisfies a structured PARAFAC model g, V(1) , · · · , V(P ) ; M  of rank M , where g  [α1 , · · · , αM ]T is the amplitude vector, and the factors V(p) are Vandermonde matrices defined as: (p) (p) V(p) = V.1 , · · · , V.M ∈ CIp ×M , p ∈ P , [5.118] with: (p) Ip −1 T V.m = 1, zm,p , · · · , zm,p ∈ CIp , m ∈ M .

[5.119]

This PARAFAC model is called a higher order Vandermonde decomposition (HOVDMD) by Papy et al. (2005). Its uniqueness properties were studied by Sidiropoulos (2001). Using a correspondence between PARAFAC models and TT models (Zniyed et al. 2019a), the model [5.117] can be expressed as a train of

278

Matrix and Tensor Decompositions in Signal Processing

Vandermonde tensors (Zniyed et al. 2019b). In the case of a high-order tensor, this approach allows us to reduce the computational complexity of parameter estimation, since the high-order PARAFAC model is transformed into a train of coupled third-order tensors, each containing a Vandermonde factor. 5.3.2. Source separation 5.3.2.1. Instantaneous mixture modeled using the received signals One of the earliest applications of the BTD model was the blind source separation (BSS) problem (de Lathauwer 2011), which we will briefly discuss here. Consider the following matrix equation, which models an instantaneous linear mixture, (also called a memoryless or non-convolutive mixture) of P sources, measured using K sensors, over N sampling periods: Y = MS =

P 

M.p Sp. =

p=1

P 

Y(p) ,

[5.120]

p=1

or in scalar form: yk,n =

P 

mk,p sp;n =

p=1

P 

(p)

yk,n , k ∈ K, n ∈ N .

[5.121]

p=1

The matrix M ∈ KK×P is called the instantaneous mixture, whereas S ∈ KP ×N and Y ∈ KK×N are the matrices containing the source signals and the observation signals over N sampling periods, respectively. The BSS problem is to estimate the matrices M and S from the observed signals in Y only. This problem can be reformulated using a BTD model. We assume that N is odd and set N = I + J − 1 with I = J, and we construct a Hankel matrix H(p) ∈ KI×I from the row vector Sp. ∈ K1×(2I−1) , as in [3.305]: ⎡ ⎤ sp1 sp2 ··· spI ⎢ sp2 sp3 · · · sp,I+1 ⎥ ⎢ ⎥ (p) H(p) = ⎢ . [5.122] ⎥ = [hij ] .. ⎣ .. ⎦ . spI

sp,I+1

···

sp,2I−1

(p)

hij = sp,i+j−1 .

[5.123]

The matrix H(p) corresponds to a Hankelization of the row vector Sp. . The matrix Y = M.p Sp. ∈ KK×N , which is the component of Y associated with the source p, is transformed into a third-order tensor X (p) ∈ KI×J×K by defining the frontal (p) slice X..k as: (p)

(p)

(p)

xijk = mkp hij = mkp sp,i+j−1 , i, j ∈ I, k ∈ K, (p)

or equivalently X..k = mkp H(p) .

Tensor Decompositions

279

Using the expression [5.116], the tensor X (p) can be written as: X (p) =

K 

(p)

(K)

X..k ◦ ek

=

k=1

K 

(K)

mkp H(p) ◦ ek

k=1

= H(p) ◦ (

K 

(K)

mkp ek ) = H(p) ◦ mp ,

[5.124]

k=1

where mp = M.p ∈ KK is the pth column vector of the mixture matrix M. It is important to note that rewriting the row vector Sp. as the Hankel matrix H(p) defined in [5.122] introduces redundancy into the model by repeating the source signals. The observation matrix Y defined in [5.120] is transformed into the third-order tensor: X =

P 

X (p) =

p=1

P 

H(p) ◦ mp ∈ KI×I×K ,

[5.125]

p=1

which satisfies a BTD model. Since the measurement matrix Y is replaced with the (p) redundant measurement tensor X after replacing each row Yk. ∈ K1×(2I−1) of (p) Y(p) with a frontal slice X..k ∈ KI×I of the tensor X (p) , we conclude that this transformation of [5.120] to [5.125] increases the number of measurements from KN = K(2I − 1) to IJK = KI 2 . Writing H for the Hankelization operator, the transformation of the matrix model [5.120] to the tensor model [5.125] is summarized by the following equations: H

K1×(2I−1)  Sp. → H(p) ∈ KI×I → X..k = mkp H(p) ⇓ KK×(2I−1)  Y(p) → X (p) ∈ KI×I×K , p ∈ P .

[5.126]

De Lathauwer (2011) used the BTD model for mixtures corresponding to linear combinations of exponentials observed during N = 2I − 1 sampling periods, i.e. sp,n+1 =

Rp  rp =1

crp ,p zrnp ,p , p ∈ P  , n = 0, 1, · · · , N − 1.

[5.127]

The Hankel matrix H(p) defined in [5.122] then admits the following Vandermonde factorization: H(p) = V(p) C(p) (V(p) )T ,

[5.128]

280

with:

Matrix and Tensor Decompositions in Signal Processing



1

1

··· ···

1



⎢ z1,p z2,p zRp ,p ⎥ ⎢ ⎥ V(p) = ⎢ .. .. ⎥ ∈ KI×Rp ⎣ . . ⎦ I−1 I−1 I−1 z2,p · · · zR z1,p p ,p ⎡ ⎤ c1,p 0 ··· 0 ⎢ 0  c2,p · · · 0 ⎥ ⎢ ⎥ C(p) = ⎢ . = diag c1,p , ⎥ . .. ⎦ ⎣ .. 0 0 · · · cRp ,p

 · · · ,cRp ,p ∈ KRp ×Rp .

If we assume that I ≥ max(R1 , R2 , · · · , Rp ) and the exponentials are distinct, i.e. the generators zrp ,p of the Vandermonde matrix V(p) are distinct for rp ∈ Rp , then V(p) has full column rank. Therefore, assuming that all coefficients crp ,p are non-zero, we can deduce that H(p) has full rank equal to Rp , which implies that the tensor slice X (p) has trilinear rank (Rp , Rp , 1), by [5.124] and [5.128]. 5.3.2.2. Instantaneous mixture modeled using cumulants of the received signals Consider the output of a sensor network in the blind source separation problem in the case of an instantaneous and noisy linear mixture, analogous to equation [1.137], with the correspondences (A, sn ) ⇔ (H, x(n)): y(n) = Hx(n) + e(n),

[5.129]

where x(n) ∈ C , y(n) and e(n) ∈ C represent the complex-valued vector of P sources that must be separated, the output vector of K sensors and an additive white Gaussian noise (AWGN) sequence that is independent of the input signal x(n) at time n, respectively, and H ∈ CK×P is the mixture matrix. The sources are assumed to be statistically independent and stationary in the strict sense, and circular to fourth order, which implies that their fourth-order cumulant is a diagonal tensor: P

K

cum(xp1 (n), xp2 (n), x∗p3 (n), x∗p4 (n)) = κ4x (p1 )δp1 p2 p3 p4 ,

[5.130]

where δp1 p2 p3 p4 is the generalized Kronecker delta, which is equal to 1 for p1 = p2 = p3 = p4 and 0 otherwise, with p1 , p2 , p3 , p4 ∈ P . Consider the fourth-order cumulants of the output signal2:   cy,2,2  cy,y,y∗ ,y∗ (τ1 , τ2 , τ3 ) = cum y(n), y(n − τ1 ), y∗ (n − τ2 ), y∗ (n − τ3 ) . 2 For complex-valued signals, we can define different cumulants of order P according to the numbers P1 and P2 of non-conjugated and conjugated terms, with P1 + P2 = P . In the case of circular signals in the strict sense, the cumulants of order P = P1 + P2 are zero if P1 = P2 , i.e. if the number of non-conjugated and conjugated terms is different (Picinbono 1994). See the reminders on the cumulants of random signals in section A1.3.3 of the Appendix.

Tensor Decompositions

281

Due to the stationarity hypothesis, these cumulants only depend on the time shifts (τ1 , τ2 , τ3 ). They form a seventh-order tensor with four spatial dimensions and three time dimensions. In the case where the time shifts are zero (τ1 = τ2 = τ3 = 0), the cumulants cum(y, y, y∗ , y∗ ) form a fourth-order tensor C4y ∈ CK×K×K×K with only four spatial modes. P ROPOSITION 5.10.– The tensor C4y of output cumulants satisfies a PARAFAC model g, H, H, H∗ , H∗ ; P  of rank P with a Hermitian symmetry, factor matrices equal to the mixture matrix and its conjugate, and g = [κ4x (1), · · · , κ4x (P )]T . P ROOF .– Using the index convention, the output of the sensor k at time n can be written as: yk (n) =

P 

hkp xp (n) + ek (n) = hkp xp (n) + ek (n) , k ∈ K.

p=1

Since the additive white noise is assumed to be Gaussian, its fourth-order cumulant is zero, which implies that the fourth-order cumulant of the output y(n) is independent of the noise. Omitting the time index, this cumulant is given by: cum(yk1 , yk2 , yk∗3 , yk∗4 ) = cum(hk1 p1 xp1 , hk2 p2 xp2 , h∗k3 p3 x∗p3 , h∗k4 p4 x∗p4 ) with ki ∈ K for i ∈ 4. Taking into account the multilinearity property of the cumulant, and using the expression [5.130] of the source cumulants, the element ck1 ,k2 ,k3 ,k4  cum(yk1 , yk2 , yk∗3 , yk∗4 ) of the tensor C4y is given by: ck1 ,k2 ,k3 ,k4 = hk1 p1 hk2 p2 h∗k3 p3 h∗k4 p4 cum(xp1 , xp2 , x∗p3 , x∗p4 ) = κ4x (p1 )δp1 p2 p3 p4 hk1 p1 hk2 p2 h∗k3 p3 h∗k4 p4

[5.131]

= κ4x (p)hk1 p hk2 p h∗k3 p h∗k4 p =

P 

κ4x (p)hk1 p hk2 p h∗k3 p h∗k4 p .

[5.132]

p=1

The fourth-order cumulants of the output therefore form a fourth-order tensor characterized by a Hermitian symmetry. Equation [5.132] describes a PARAFAC model g, H, H, H∗ , H∗ ; P  of rank P whose four matrix factors are equal to the mixture matrix or its conjugate, with a scaling factor κ4x (p) for each column p equal to the fourth-order cumulant of the source P , which defines g = [κ4x (1), · · · , κ4x (P )]T . 

282

Matrix and Tensor Decompositions in Signal Processing

This model generalizes the DSym CP model presented in Table 5.5 to order four. 2 2 A matrix unfolding C4y ∈ CK ×K of the cumulants tensor C4y ∈ CK×K×K×K can be deduced directly from the general matricization formula [5.44], namely:   C4y = (H  H)diag κ4x (1), · · · , κ4x (P ) (H  H)H . [5.133] Thus, we obtain a Hermitian matrix due to the Hermitian symmetry of the cumulants tensor C4y . This equation can be viewed as a diagonalization of the tensor C4y of fourth-order cumulants. It provides the basis for developing source separation methods (Comon and Cardoso 1990). Applying the formula [2.257] also allows us to vectorize the cumulants tensor C4y in the following form: ⎡ ⎤ κ4x (1) ⎢ ⎥ .. c4y = vec(C4y ) = (H∗  H∗  H  H) ⎣ [5.134] ⎦. . κ4x (P ) 5.3.3. Model of a FIR system using fourth-order output cumulants Consider a stationary FIR system of order N described by the following equation: y(t) =

N 

bn x(t − n) + e(t),

[5.135]

n=0

where {x(t)} is a sequence of independent, centered, stationary, complex-valued random variables admitting the following fourth-order cumulant:   cum x(t), x(t − τ1 ), x∗ (t − τ2 ), x∗ (t − τ3 ) = κ4x δ0,τ1 ,τ2 ,τ3 . [5.136] Here, note that we are assuming the samples of the input signal to be time independent, whereas in the source separation problem above the sources were space independent. The model [5.135] can also be viewed as a convolutive mixture of sources, or a noisy moving average (MA) model. P ROPOSITION 5.11.– The fourth-order output  cumulants c4y (i1 , i2 , i3 )  cy,2,2   ∗ ∗ cum y(t), y(t − i1 ), y (t − i2 ), y (t − i3 ) satisfy a third-order PARAFAC model g, H, H∗ , H∗ ; N + 1  of rank N + 1, with:

Tensor Decompositions

⎡ ⎢ ⎢ H= ⎢ ⎢ ⎣

b0

b1

0 .. .

b0 .. .

0

...

··· .. . .. . 0

⎤ bN .. ⎥ . ⎥ ⎥ ∈ C(N +1)×(N +1) ⎥ b1 ⎦ b0

g = κ4x [b0 , b1 , · · · , bN ]T = κ4x HT1. ,

283

[5.137]

[5.138]

where H1. represents the first row of H, which is a Toeplitz matrix. P ROOF .– As in the previous example, the hypothesis of additive white Gaussian noise implies that the fourth-order cumulant of the output does not depend on the noise. Using the index convention, the equation of the model, and the multilinearity property of the cumulant, we obtain: c4y (i1 , i2 , i3 ) = cum(y(t), y(t − i1 ), y ∗ (t − i2 ), y ∗ (t − i3 ))  = bn bτ1 b∗τ2 b∗τ3 cum x(t − n), x(t − i1 − τ1 ),  x∗ (t − i2 − τ2 ), x∗ (t − i3 − τ3 ) . The stationarity and independence hypotheses on the input imply that: c4y (i1 , i2 , i3 ) = κ4x bn bτ1 b∗τ2 b∗τ3 δn,i1 +τ1 ,i2 +τ2 ,i3 +τ3 = κ4x

N 

bn bn−i1 b∗n−i2 b∗n−i3 .

[5.139]

n=0

/ [0, N ], we deduce that c4y (i1 , i2 , i3 ) = 0, ∀|i1 |, |i2 |, |i3 | > N . Since bn = 0, ∀n ∈ These cumulants form a third-order tensor C4y ∈ C(N +1)×(N +1)×(N +1) . Defining him ,n = bn−im , for m = 1, 2, 3, with im = 0, 1, · · · , N , the cumulant of the output can also be written as: c4y (i1 , i2 , i3 ) = κ4x

N 

bn hi1 ,n h∗i2 ,n h∗i3 ,n ,

[5.140]

n=0

which is the equation of a third-order PARAFAC model g, H, H∗ , H∗ ; N + 1  of rank N + 1, where H and g are as defined in [5.137] and [5.138].  Using the general matricization formula [5.32] for a PARAFAC model, a matrix unfolding [C4y ]I1 I2 ×I3 of the cumulants tensor C4y is given by: C4y = (H  H∗ )DHH ∈ C(N +1)

2

×(N +1)

.

[5.141]

284

Matrix and Tensor Decompositions in Signal Processing

with D = κ4x diag(b0 , b1 , · · · , bN ). Applying the identity [2.257], with A = H  H∗ , C = HH , and λn = κ4x bn−1 , for n = 1, · · · , N + 1, allows us to deduce the following vectorized form for the output cumulants tensor: c4y = vec(C4y ) = (H∗  H  H∗ )g.

[5.142]

Note that the matricized and vectorized forms [5.141] and [5.142] do not take the cumulants c4y (i1 , i2 , i3 ) into account for i1 , i2 , i3 ∈ [−N, −1]. One such matricization with i1 , i2 , i3 ∈ [−N, N ] is considered in Fernandes et al. (2008), as well as the case of a MIMO FIR system, i.e. with y(t) ∈ CK and x(t) ∈ CP , corresponding to the case of K sensors and P sources.

Appendix Random Variables and Stochastic Processes

A1.1. Introduction The goal of this appendix is to give a brief overview of a few fundamental results about the higher order statistics (HOS) of random signals (i.e. of order greater than two). HOS play an important role in digital signal processing (SP) for the representation, detection, analysis, classification, equalization and filtering of signals such as radar, sonar, seismic, biomedical and communication signals. In many applications, second-order statistics are not sufficient to completely characterize the signals to be processed. It is then beneficial to use HOS (higher order cumulants and their associated Fourier transforms, known as polyspectra). HOS-based SP methods, originally developed to solve the blind source separation (BSS) and blind deconvolution problems, offer the following advantages: – they are robust to additive white Gaussian noise (AWGN), since the cumulants of order greater than two of Gaussian signals are zero, and the cumulants of the sum of independent stationary random signals are equal to the sum of the cumulants of each random signal; consequently, in the presence of an additive Gaussian measurement noise, independent of the noiseless output signal, as commonly assumed in most applications, HOS-based SP methods are blind to this additive noise; – they allow us to identify non-minimum phase linear systems (i.e. with unstable inverses); this is not possible with second-order statistics (autocorrelation and power spectrum) that do not preserve non-minimum phase information, making phase reconstruction only achievable for minimum phase systems; – they are the basis of certain blind channel equalization techniques that compensate the distortion undergone by a signal transmitted through a channel. In the

Matrix and Tensor Decompositions in Signal Processing, First Edition. Gérard Favier. © ISTE Ltd 2021. Published by ISTE Ltd and John Wiley & Sons, Inc.

286

Matrix and Tensor Decompositions in Signal Processing

context of digital communications, the function of an equalizer is to invert the channel in such a way that the cascade of the channel and the equalizer is as close as possible to the identity transfer function in order to recover the transmitted symbols. Since a blind equalizer does not use a learning sequence, we also speak of unsupervised equalization, with the goal of improving the transmission rate of the communication system; – they are useful for analyzing non-Gaussian signals, detecting and characterizing non-linearities, and identifying nonlinear systems. Higher-order cumulants can be viewed as tensors (McCullagh 1987), so it is natural for methods based on cumulants to take advantage of tensor decompositions, as illustrated in Chapter 5. In the first section of this appendix, we will recall a few definitions and properties relating to random variables (r.v.s), while distinguishing between real and complex r.v.s, as well as between scalar and multidimensional r.v.s. The definition and properties of cumulants will be presented for real r.v.s, then for complex r.v.s, and the notion of circular complex r.v.s will be introduced. The Gaussian distribution, one of the most widely used distributions in SP applications, will be considered in particular detail. In the second section, HOS of discrete-time random signals will be presented for real random signals, then for complex random signals. Here, it should be noted that complex-valued random signals play an important role in SP, especially in array processing and digital communications. This was the motivation for writing this appendix, which should also prove useful for the examples given in Chapter 5 involving the application of cumulant tensors to model signals and systems. The notions of polyspectra, also called higher order spectra, and circular signals will be defined. While the power spectrum allows a second-order spectral analysis, polyspectra such as the bispectrum or the trispectrum allow us to perform a so-called higher order spectral (or polyspectral) analysis. In the third section, the use of HOS will be illustrated with the supervised identification of linear time-invariant (LTI) systems and homogeneous quadratic systems using higher order spectra and cross-spectra of input and output signals. This will also allow us to introduce the link between Volterra models and tensors via Volterra kernels. In Volume 4 of this set of books, this link will be used to reduce the parametric complexity of Volterra models via symmetrization and tensor decomposition of the kernels, as well as for their identification (Favier and Bouilloc 2010; Favier et al. 2012a). For a more detailed presentation of HOS and their applications in SP, see the books by Nikias and Petropulu (1993), Picinbono (1993), Amblard et al. (1996b), Lacoume

Appendix

287

et al. (1997) and Nandi (1999), as well as the articles by Brillinger (1965), Nikias and Raghuveer (1987), Mendel (1991), and Nikias and Mendel (1993). A1.2. Random variables First, let us recall a few results about second-order statistics of real scalar r.v.s (section A1.2.1) and multidimensional r.v.s (section A1.2.2). The first and second characteristic functions are defined, also called moment-generating and cumulant-generating functions, respectively, since they allow us to generate moments and cumulants. Some relationships between cumulants and moments are presented, as well as some properties of cumulants. A1.2.1. Real scalar random variables A1.2.1.1. Distribution function and probability density Let (Ω, F , P ) be a probability space, where Ω is the set of all possible events for a random experiment, F is a set of events, called a σ-algebra on Ω, and P denotes a probability measure. For more details, see Picinbono (1993) and Favier (2019). Given a real-valued scalar r.v. x(ω), with ω ∈ Ω, its distribution function, denoted Fx , is a mapping from R into [ 0, 1 ] that defines the probability that x takes a value less than or equal to u: Fx (u) = Prob x(ω) ≤ u . If Fx (u) is assumed to be continuous and differentiable, its derivative px (u)  is called the probability density function (pdf) of x. It satisfies:

dFx (u) du

Prob a < x ≤ b = Fx (b) − Fx (a) =

=

b a

= px (u)du with

∞ −∞

px (u)du = 1.

A1.2.1.2. Moments and central moments D EFINITION.– The ith-order moment μx,i of the real, continuous, scalar r.v. x(ω) with pdf px (u) is defined as1: = ∞ ui px (u)du, [A1.1] μx,i  E[xi ] = −∞

where E is the mathematical expectation. The first-order moment μx,1  E[x] = μx is called the mean of x. We say that the r.v. x is centered if its mean is zero. If not, x 1 In the rest of the appendix, x(ω) will be written x to alleviate the notation.

288

Matrix and Tensor Decompositions in Signal Processing

can be centered by subtracting its mean, giving E[x − μx ] = 0. The ith-order moment will also be denoted mx,i . In the case of a discrete r.v. x taking the values xn with the probabilities pn , the mean value of x is defined by:  xn p n , [A1.2] E[x] = n

which corresponds to the average of all possible values xn weighted by their probabilities pn . If the distribution of x is symmetric, i.e. if its pdf is an even function (px (−u) = px (u)), we deduce from the definition [A1.1] that the odd-order moments are zero: μx,2k+1 = 0 , ∀k ≥ 0.

[A1.3]

D EFINITION.– The ith-order central moment of x, denoted νx,i , is the ith-order moment about the mean, defined as: = ∞ i νx,i  E[(x − μx ) ] = (u − μx )i px (u)du. [A1.4] −∞

The second-order central moment νx,2  σx2 is called the variance of x, and its square root σx is called the standard deviation. We have the relation: σx2 = E[(x − μx )2 ] = E[x2 ] − [E(x)]2 = E[x2 ] − μ2x ,

[A1.5]

i.e. the variance is equal to the difference between the mean of the square and the square of the mean. To characterize the shape of a distribution, we define two dimensionless quantities in terms of the third- and fourth-order central moments, namely the skewness coefficient γx,3 , and the kurtosis γx,4 , in the following normalized forms: γx,3 =

νx,3 νx,4 − 3σx4 , γ = . x,4 σx3 σx4

[A1.6]

Note that the kurtosis is zero for a zero-mean Gaussian r.v. Table A1.1 recalls a few definitions and results regarding the second-order statistics of jointly distributed scalar r.v.s, with the joint pdf px,y (u, v), denoted p(u, v).

Appendix

289

Quantities

Definitions ∞ i j  ∞ Joint moment E[xi y j ] = −∞ −∞ u v p(u, v)dudv   ∞ ∞ Central joint moment E[(x − μx )i (y − μy )j ] = −∞ −∞ (u − μx )i (v − μy )j p(u, v)dudv Special cases Cross-correlation

ϕxy  E[xy]

Cross-covariance

σxy  Cov[x, y] = E[(x − μx )(y − μy )]

Correlation coefficient

ρxy 

Cov[x,y] σx σy

=

σxy σx σy

Relation and property Cov[x, y] = E[xy] − μx μy or σxy = ϕxy − μx μy |ρxy | ≤ 1 ⇔ |Cov[x, y]| ≤ σx σy

Table A1.1. Some definitions for jointly distributed r.v.s x and y

A1.2.1.3. Independence, non-correlation and orthogonality Table A1.2 gives definitions of uncorrelated r.v.s and orthogonal r.v.s. The notion of orthogonality stems from the fact that E[xy] can be interpreted as the inner product of x and y in the Hilbert space of scalar r.v.s. These two properties are equivalent for centered random variables. Properties

Definitions

x and y uncorrelated ρxy = 0 ⇔ Cov[x, y] = 0 ⇔ ϕxy = μx μy x and y orthogonal

ϕxy = E[xy] = 0

Table A1.2. Definitions of uncorrelated and orthogonal r.v.s x and y

The r.v.s x and y are said to be independent if their joint pdf is separable2: p(u, v) = px (u)py (v) ⇒ E[xi y j ] = E[xi ]E[y j ] ∀i, j ∈ N∗ .

[A1.7]

2 When two continuous r.v.s are independent, their marginal pdfs px (u) and py (v) are equal to the conditional probability densities px (u/v) of x given y = v and py (v/u) of y given x = u, respectively:  p(u, v) px (u)py (v) = = px (u)  p(u, v)dv px (u/v)  py (v) py (v)  p(u, v) px (u)py (v) = = py (v)  p(u, v)du. py (v/u)  px (u) px (u) p (i) p (j)

y In the case of two independent discrete r.v.s, we have: px (i/j) = p(i,j) = x py (j) = py (j)  px (i)  P [x = xi ] = j p(i, j), where px (i/j)  P [x = xi /y = yj ] is the conditional

290

Matrix and Tensor Decompositions in Signal Processing

In particular, for i = j = 1, we then have ϕxy = μx μy . This means that, if the r.v.s x and y are statistically independent, then they are uncorrelated, but they are not orthogonal in general. Orthogonality (ϕxy = 0) is satisfied if at least one of the r.v.s is zero-mean. The converse is not true, i.e. two uncorrelated r.v.s are not independent in general. Independence is therefore a more restrictive concept that non-correlation. An exception is the case of two jointly Gaussian r.v.s, where non-correlation implies statistical independence. The proof of this result is given in section A1.2.3.3. These properties are summarized in Table A1.3. Hypotheses ⎡

x and y independent

Properties ⎤

⇒ x and y uncorrelated

x and y uncorrelated ⎦ ⇒ x and y orthogonal and x or/and y is/are zero-mean ⎡ ⎤ x and y jointly Gaussian ⎣ ⎦ ⇒ x and y independent and uncorrelated



Table A1.3. Properties of r.v.s

R EMARK A1.1.– In SP, the notions of non-correlation and statistical independence are the basis of two classes of fundamental methods for solving the BSS problem in an instantaneous, i.e. non-convolutive, mixture: principal component analysis (PCA) and independent component analysis (ICA). PCA methods attempt to decorrelate the signals of the mixture to obtain sources that are spatially decorrelated but not independent in general. These methods either diagonalize the covariance matrix of the received signals using the EVD decomposition or exploit the SVD decomposition of a data matrix directly (see Chapter 1). ICA methods, on the other hand, seek to maximize the statistical independence of the sources, either by using adaptive algorithms or HOS-based processings of data blocks. Here, the separation principle corresponds to maximizing an independence criterion, called a contrast function (Comon 1994), and independence of the sources to p-th order means that all cross-cumulants of the sources are zero up to order p. A1.2.1.4. Ensemble averages and empirical averages In practice, the statistics of r.v.s are estimated as averages computed from finitely many measurements performed with identical systems or by repeating the same probability, px (i) and py (j) are the marginal probabilities, and  p(i, j)  P [(x = xi ) and (y = yj )]. Similarly, we have: py (j/i) = py (j)  P [y = yj ] = i p(i, j).

Appendix

291

experiment multiple times under the same conditions. This is equivalent to replacing ensemble averages (in the sense of the mathematical expectation) with empirical averages. Thus, if we assume that N samples {xn , yn ; n ∈ N } of independent realizations of the variables x and y are known, we can estimate the empirical moments, cross-correlation and cross-covariance as: μ x,i

N N N 1  i 1  1  = x , μ x = xn , μ y = yn N n=1 n N n=1 N n=1

σ xy =

N N 1  1  (xn − μ x )(yn − μ y ) , ϕ xy = xn y n . N n=1 N n=1

In the case of discrete-time stationary random signals, the statistics are estimated using time averages computed from signals measured over a sufficiently long window. Replacing the ensemble averages by time averages corresponds to the hypothesis that the signals are ergodic, which is very commonly assumed in SP to estimate the HOS. Note that ergodicity requires stationarity, whereas a stationary random signal might not be ergodic (see section A1.3.2). A1.2.1.5. Characteristic functions, moments and cumulants The P th-order moment of the scalar r.v. x, defined in [A1.1], is generated by the first characteristic function using the following formula: mx,P = E[xP ] =

> 1 dP 1 ) > Ξ (u)  P Ξ(P x x (0), u=0 j P duP j

[A1.8]

(P )

where j 2 = −1 and Ξx (0) is the P -th derivative of Ξx (u) calculated at u = 0. In the continuous case, the characteristic function is defined as the mean of ejux : = +∞ ejuv px (v)dv, Ξx (u) = E[ejux ] = −∞

and in the discrete case:  ejuxn pn . Ξx (u) = n

P ROOF .– Using the Taylor–MacLaurin expansion of ejux , we obtain: ∞ ∞   j k mx,k k (ju)k Ξx (u) = E[ejux ] = E[xk ] = u . k! k! i=0 k=0

By differentiating P times and taking the derivative at u = 0, we deduce the expression [A1.8] for the P th-order moment. 

292

Matrix and Tensor Decompositions in Signal Processing

The second characteristic function is defined as the natural logarithm of Ξx (u) in Papoulis (1984): Ψx (u) = Log[Ξx (u)].

[A1.9]

In the case of a non-centered r.v., we have: 2 cum(x, x) = E[x2 ] − E[x] = νx,2 = σx2 3 cum(x, x, x) = E[x3 ] − 3E[x]E[x2 ] + 2 E[x] = νx,3 2 2 4 cum(x, x, x, x) = E[x4 ] − 4E[x]E[x3 ] − 3 E[x2 ] + 12 E[x] E[x2 ] − 6 E[x] = νx,4 − 3σx4 , where νx,3 and νx,4 are the third- and fourth-order central moments as defined in [A1.4]. Note that the cumulants are identical to the central moments up to third order. In the case of a zero-mean r.v., the second-, third-, and fourth-order cumulants satisfy the following relations with the moments: cum(x, x) = E[x2 ] ; cum(x, x, x) = E[x3 ] 2 cum(x, x, x, x) = E[x4 ] − 3 E[x2 ] .

[A1.10] [A1.11]

A1.2.2. Real multidimensional random variables A1.2.2.1. Second-order statistics In the case of a real-valued N -dimensional r.v. (x1 , · · · , xN ), we define the vector x ∈ RN whose components are the r.v.s xn . We say that x is a (real) random vector of size N , and the second-order statistics (cross-correlation and cross-covariance) of the r.v.s xn define the autocorrelation and covariance matrices of the random vector x. Table A1.4 gives the definitions of the autocorrelation, cross-correlation and cross-covariance matrices of real random vectors.

covariance,

R EMARK A1.2.– We can make the following remarks: – The element (i, j) of the autocorrelation matrix Φx is the correlation ϕxi xj = E[xi xj ] between the r.v.s xi and xj . The autocorrelation matrix Φx is symmetric and non-negative definite, i.e. uT Φx u ≥ 0 for every non-zero real vector u. Indeed, we have uT E[xxT ] u = E[uT xxT u] = E[y 2 ] ≥ 0, where y  xT u = uT x. – When the r.v.s xi , i ∈ n, are mutually orthogonal (ϕxi xj = 0 ∀i, j ∈ n, i = j), the autocorrelation matrix is diagonal, and the ith element of the diagonal is equal to ϕxi = E[x2i ].

Appendix

Quantities

Definitions

Autocorrelation matrix

Φx  E[xxT ]

Covariance matrix

Σx  E[(x − μx )(x − μx )T ]

Cross-correlation matrix

Φxy  E[xyT ]

293

Cross-covariance matrix Σxy  E[(x − μx )(y − μy )T ] Hypotheses

Relations and properties Σx = Φx − μx μT x

x centered

Σx = Φx Φxy = ΦT yx T Σxy = ΣT yx = Φxy − μx μy

x and y uncorrelated

Σxy = 0 ⇔ Φxy = μx μT y

x and y orthogonal

Φxy = 0

Table A1.4. Definitions and properties of the second-order statistics of real random vectors x and y

– The element (i, j) of Σx is the cross-covariance σxi xj = E[(xi −μxi )(xj −μxj )] between the r.v.s xi and xj , where σx2i = E[(xi − μxi )2 ] is the variance of xi . The matrix Σx is also symmetric. – When the r.v.s xi , i ∈ n, are mutually uncorrelated, we have σxi xj = σx2i δij , the matrix Σx is diagonal, and the ith element σx2i of the diagonal is equal to the variance of xi . – From the relation Σxy = Φxy − μx μTy , we can conclude that if at least one of the two random vectors x and y is zero-mean, then the cross-correlation and crosscovariance matrices are identical: Φxy = Σxy . If so, non-correlation of the random vectors (Σxy = 0) is equivalent to orthogonality (Φxy = 0). – Like for scalar r.v.s, if the random vectors are independent, then they are uncorrelated, since independence of these vectors implies: Σxy = E[(x − μx )(y − μy )T ] = E[x − μx ]E[(y − μy )T ] = 0.

[A1.12]

The converse is not true in general, with one exception being the case of jointly Gaussian vectors, for which non-correlation implies statistical independence. This result is proven in section A1.2.3.3.

294

Matrix and Tensor Decompositions in Signal Processing

In the case of complex-valued random vectors, the definitions and relations recalled in Table A1.4 become: Φx  E[xxH ] = ΦH x

[A1.13]

H Σx  E[(x − μx )(x − μx )H ] = Φx − μx μH x = Σx

[A1.14]

Φxy  E[xyH ] = ΦH yx

[A1.15]

H Σxy  E[(x − μx )(y − μy )H ] = Φxy − μx μH y = Σyx .

[A1.16]

The autocorrelation matrix Φx is then Hermitian and non-negative definite, in the sense that uH Φx u ≥ 0 for every non-zero complex vector u. The same holds for the covariance matrix Σx . A1.2.2.2. Characteristic functions, moments and cumulants Given the real random vector x ∈ RN , the P th-order moments and cumulants of its components define the P th-order tensors Mx,P ∈ R[P ;N ] and Cx,P ∈ R[P ;N ] such that, for ip ∈ N :   mx,P i1 ,i2 ,··· ,iP = E[xi1 xi2 · · · xiP ] [A1.17]   cx,P i1 ,i2 ,··· ,i = cum(xi1 , xi2 , · · · , xiP ). [A1.18] P

The order P of the moment and the cumulant corresponds to the number of indices {ip }. For P = 2, the second-order moment Mx,2 corresponds to the autocorrelation matrix Φx . In the case of a real random vector x ∈ RN , the characteristic function, denoted T Ξx (u), is defined as the mean of h(x) = eju x , with u ∈ RN : = ∞ juT x T Ξx (u) = E e = eju v px (v)dv. [A1.19] −∞

Using the geometric series expansion eju Ξx (u) =

T

x

=

∞

jk T k k=0 k! [u x] ,

∞  j k  T k E u x . k!

we deduce that: [A1.20]

k=0

The characteristic function N allows us to generate the moments. Thus, the crossmoment of order P = n=1 pn of the components of the random vector x can be expressed as follows in terms of the partial derivatives of Ξx (u) at the point u = 0: E

>  ∂ pN  ∂ p 1 > . xpnn = j −P ··· Ξx (u)> ∂u ∂u u=0 1 N n=1

N 

[A1.21]

Appendix

295

The second characteristic function is defined as the natural logarithm of Ξx (u) in Papoulis (1984): Ψx (u) = Log[Ξx (u)].

[A1.22]

This function allows us togenerate the cumulants. For example, we define the N cross-cumulant of order P = n=1 pn as: >   ∂ p N   ∂ p1 > cum x1 , · · · , x1 , · · · , xN , · · · , xN = j −P ··· Ψx (u)> . ( )* + ( )* + ∂u1 ∂uN u=0 p1 terms

pN terms

[A1.23] Here, the order of the cumulant corresponds to the sum of the repetitions pn of each r.v. xn , with n ∈ N . A1.2.2.3. Relationship between cumulants and moments A formula established by Leonov and Shiryaev (1959) allows us to pass from moments to cumulants. Thus, the joint cumulant of P random variables can be expressed in terms of the moments of order smaller than or equal to P as: cum(x1 , x2 , · · · , xP ) =

P 

(−1)q−1 (q − 1)!

q=1





Pqj ∈Pq Ik ∈Pqj

E

 

 xi m ,

im ∈Ik

[A1.24] where Pq represents the set of partitions of order q of the index set I = {1, · · · , P }, i.e. the set of partitions with q disjoint non-empty subsets Ik of I whose union is the set I, and Pqj is an element of the set Pq . Using this formula for centered r.v.s, we can check that the cumulants of order less than or equal to four satisfy the following relations with the moments: cum(xi ) = E[xi ] = 0 cum(xi1 , xi2 ) = E[xi1 xi2 ] = ϕx1 x2 cum(xi1 , xi2 , xi3 ) = E[xi1 xi2 xi3 ]

[A1.25] [A1.26] [A1.27]

cum(xi1 , xi2 , xi3 , xi4 ) = E[xi1 xi2 xi3 xi4 ] − E[xi1 xi2 ]E[xi3 xi4 ] − E[xi1 xi3 ]E[xi2 xi4 ] − E[xi1 xi4 ]E[xi2 xi3 ]. [A1.28] R EMARK A1.3.– If we choose the variables xin = x for n ∈ 4, the formulae [A1.26]–[A1.28] give the second-, third- and fourth-order cumulants [A1.10]–[A1.11] of a centered scalar r.v.

296

Matrix and Tensor Decompositions in Signal Processing

In the case of non-centered r.v.s, we have: cum(xi1 , xi2 ) = E[xi1 xi2 ] − E[xi1 ]E[xi2 ] = σxi1 xi2

[A1.29]

cum(xi1 , xi2 , xi3 ) = E[xi1 xi2 xi3 ] − E[xi1 ]E[xi2 xi3 ] −E[xi2 ]E[xi1 xi3 ]− E[xi3 ]E[xi1 xi2 ] + 2E[xi1 ]E[xi2 ]E[xi3 ].

[A1.30]

A1.2.2.4. Properties of cumulants The cumulants satisfy several important properties, which are described below (Brillinger 1965; Mendel 1991; Nikias and Mendel 1993; Nikias and Petropulu 1993; Picinbono 1993): – P1: The higher order cumulants (i.e. of order greater than two) of any set of jointly Gaussian r.v.s are zero, which explains why higher order processing methods are robust to additive Gaussian noise. This property, which will be proven in section A1.2.3.3, allows us to define Gaussianity tests for r.v.s and linearity tests for systems (Hinich 1982). – P2: The odd-order cumulants of a r.v. with a symmetric distribution are zero. This is, for example, the case for the uniform, Gaussian and Laplacian distributions. This property also holds for the odd-order moments, as indicated in [A1.3]. – P3: Changing the order of the partial derivatives in [A1.23] does not change the result, from which we can conclude that the cumulants are symmetric functions with respect to their arguments, i.e.: cum(x1 , · · · , xN ) = cum(xπ(1) , · · · , xπ(N ) )

[A1.31]

for any permutation π of the set N . This property means that the tensor Cx,P ∈ [P ;N ] of P th-order cumulants of a set of N r.v.s is symmetric. This property also RS [P ;N ] holds for moments: Mx,P ∈ RS . – P4: The cumulants (and moments) satisfy the following multilinearity property: for a set of N random vectors u(n) ∈ RIn linearly transformed to y(n) = B(n) u(n) , with B(n) ∈ RJn ×In , the tensor of N th-order cumulants of the vectors y(n) ∈ RJn is given by: N

Cy,N = cum(y(1) , · · · , y(N ) ) = Cu,N × B(n) ∈ RJ N

[A1.32]

Cu,N = cum(u(1) , · · · , u(N ) ) ∈ RI N .

[A1.33]

n=1

Appendix

297

This key property establishes a close link between cumulants and tensors. (n)

Developing the above, the cross-cumulant of the components yjn of the vectors (n) y , with n ∈ N , can be written as:  (1)   (1)  (1) (N )  (N ) (N )  ··· bj1 ,i1 · · · bjN ,iN cum ui1 , · · · , uiN cum yj1 , · · · , yjN = i1

=

N 

iN

 (1) (n) (N )  bjn ,in cum ui1 , · · · , uiN ,

[A1.34]

n=1

where the last equality follows from the use of the index convention. This multilinearity property implies the following two properties: – P5: The cumulants (respectively, moments) of the r.v.s xn , n ∈ N , multiplied by constant scaling factors λn are equal to the cumulants (respectively, moments) of the r.v.s multiplied by the product of all scaling factors: cum(λ1 x1 , · · · , λN xN ) =

N 

 λn cum(x1 , · · · , xN ).

[A1.35]

n=1

In particular, we have cum(λx1 , x2 , · · · , xN ) = λ cum(x1 , · · · , xN ). – P6: The cumulants are additive with respect to their arguments: cum(x1 + y, x2 · · · , xN ) = cum(x1 , x2 · · · , xN ) + cum(y, x2 · · · , xN ). This property holds for both real and complex r.v.s. It is also satisfied by moments. – P7: The cumulants of the sum of two random vectors whose components are statistically independent are equal to the sum of the cumulants of each random vector, considered separately: cum(x1 + y1 , · · · , xN + yN ) = cum(x1 , · · · , xN ) + cum(y1 , · · · , yN ). [A1.36] Indeed, since the r.v.s xn are independent of the r.v.s yn , for n ∈ N , we have: T T T Ψx+y (u) = Log E[eju (x+y) ] = Log E[eju x ] + Log E[eju y ] = Ψx (u) + Ψy (u). Since the second characteristic function can be written as a sum of functions, we can deduce the property [A1.36]. This property justifies the name of cumulant, since the cumulant of a sum of two sets of independent r.v.s is equal to the sum of

298

Matrix and Tensor Decompositions in Signal Processing

the cumulants of each set of r.v.s, considered separately. It should be noted that this property is not satisfied by moments. This property is very often used in SP in the case of an additive Gaussian noise e(k) that is assumed to be independent of the noiseless component y(k) of the measured signal s(k) = y(k) + e(k). Since the cumulants of order greater than two of Gaussian noise are zero, using these cumulants of the measured signal makes it possible to eliminate the effect of additive Gaussian noise. This property will be used in section A1.4. – P8: The cumulants of a set of statistically independent r.v.s are zero. the Indeed, N independence property implies that the pdf p(x) can be factorized into n=1 p(xn ), and the second N characteristic function defined in [A1.22] can then be written as a sum Ψx (u) = n=1 Ψxn (un ). Therefore, the partial derivatives in [A1.23] of Ψxn (un ) with respect to um , with m = n, are zero. This property is also satisfied in the case where only one subset of the N r.v.s is statistically independent of the other r.v.s. It is N not satisfied by the moments in general, since Φx (u) = n=1 Φxn (un ). In conclusion, the properties P1, P7 and P8, which distinguish cumulants from moments, are the main arguments in favor of using the former rather than the latter in SP applications. Moreover, the fourth-order cumulants are often used because the third-order cumulants of symmetrically distributed random signals are zero. This is in particular the case when solving blind communication channel identification/ deconvolution problems. A1.2.2.5. Cumulants of complex random variables Complex-valued random variables and complex-valued random signals were studied by Amblard et al. (1996a, 1996b), extending the notion of circularity to non-Gaussian r.v.s (Picinbono 1994). A complex r.v. z can be defined as z = x1 + jx2 , with j 2 = −1 and x1 , x2 ∈ R. It can be viewed as a real two-dimensional r.v. whose real and imaginary parts, x1 and x2 , respectively, have a joint pdf. The pdf of z is also a function of the conjugated variable z ∗ = x1 − jx2 , i.e. p(z, z ∗ ). The first characteristic function is then defined as: j ∗ ∗ Ξz,z∗ (w, w∗ ) = E e 2 (wz +w z) , [A1.37] with w = u1 + ju2 , which implies 12 (wz ∗ + w∗ z) = u1 x1 + u2 x2 = uT x, with uT = [u1 u2 ] and xT = [x1 x2 ]. The first characteristic function can therefore be viewed as the characteristic function [A1.19] of a real two-dimensional r.v. with components (x1 , x2 ) corresponding to the real and imaginary parts of z, i.e.: T Ξz,z∗ (u) = E eju x .

[A1.38]

Appendix

299

In the same way as for real r.v.s, the second characteristic function is defined as: [A1.39] Ψz,z ∗ (w, w∗ ) = Log Ξz,z∗ (w, w∗ ) . Using the Taylor–McLaurin expansion of the exponential in [A1.37] and the binomial formula, we obtain: Ξ

z,z ∗

∞ k  j k  m k−m ∗ m ∗ k−m m (w, w ) = C w (w ) E (z ) z , 2k k! m=1 k ∗

[A1.40]

k=0

where Ckm =

k! m!(k−m)!

are the binomial coefficients.

This expression of the characteristic function shows the introduction of complex moments of the form E (z ∗ )k−m z m , i.e. which depend on both z and z ∗ . For the P th-order moment, there are therefore P + 1 different moments, depending on how many conjugated terms are considered. Thus, to second order, there are three different moments: E[z 2 ], E[zz ∗ ], and E[z ∗ 2 ]. From equation [A1.40], we can deduce the following expression for the moment E (z ∗ )k−m z m involving k − m partial derivatives of Ξz,z ∗ (w, w∗ ) with respect to w, and m partial derivatives with respect to w∗ , calculated at the point (w, w∗ ) = (0, 0): > 2k  ∂ k−m  ∂ m ∗ > ∗ (w, w )> Ξ . [A1.41] E (z ∗ )k−m z m = k z,z j ∂w ∂w∗ w=0,w∗ =0 Similarly, from the second characteristic function, we deduce: > 2k  ∂ k−m  ∂ m ∗ > ∗ (w, w )> Ψ . cum[z, · · · , z , z ∗ , · · · , z ∗ ] = k z,z ( )* + ( )* + j ∂w ∂w∗ w=0,w∗ =0 m

k−m

[A1.42] Using the formula [A1.30] with (xi1 , xi2 , xi3 ) = (z, z ∗ , z ∗ ), we deduce the following expression for the cumulant cum(z, z ∗ , z ∗ ):  2 cum(z, z ∗ , z ∗ ) = E[zz ∗2 ] − E[z] E[z ∗2 ] − 2E[z ∗ ] E[zz ∗ ] + 2E[z] E[z ∗ ] . A1.2.2.6. Circular complex random variables The notion of circularity, introduced by Goodman (1963) in the Gaussian case, was generalized to the non-Gaussian case by Amblard et al. (1996a). D EFINITION .– We say that a complex r.v. z is circular to order n if and only if its statistics of order less than or equal to n involving a number (p) of non-conjugated terms that is not equal to the number (q) of conjugated terms are zero, i.e. for every (p, q) such that p + q ≤ n, with p = q, we have: [A1.43] mz,p,q  E z p (z ∗ )q = 0 , cz,p,q  cum[z, · · · , z , z ∗ , · · · , z ∗ ] = 0. ( )* + ( )* + p

q

300

Matrix and Tensor Decompositions in Signal Processing

P ROPERTY.– Let z = x + jy, with x, y ∈ R, be a circular complex r.v. By the definition of circularity, we have: E[z] = 0 ⇒ E[x] + jE[y] = 0 ⇒ E[x] = E[y] = 0 E[z 2 ] = 0 ⇒ E[x2 ] − E[y 2 ] + 2jE[xy] = 0 ⇒ E[x2 ] = E[y 2 ] and E[xy] = 0. We can therefore conclude that the real (x) and imaginary (y) parts of z are centered, with the same standard deviation (σx = σy ), and uncorrelated (ϕxy = E[xy] = 0). Furthermore, if z is a complex Gaussian r.v., which implies that x and y are real Gaussian r.v.s, then x and y are independent, since non-correlation implies independence for Gaussian r.v.s (see section A1.2.3.3). E XAMPLE A1.4.– Let x be a complex random vector of size N whose components xn , with n ∈ N , are circular complex r.v.s. Only one type of fourth-order cumulant is non-zero, namely cx,2,2 = cum(xi , xj , x∗k , x∗l ), obtained by considering two conjugated and two non-conjugated terms. This tensor of fourth-order cumulants is called the quadricovariance by Cardoso (1990) in the context of antenna processing to localize and identify sources. If we write ci,j,k,l  cum(xi , xj , x∗k , x∗l ), where the indices (i, j) are associated with the non-conjugated terms and the indices (k, l) are associated with the conjugated terms, the quadricovariance tensor satisfies the following symmetries: ci,j,k,l = cj,i,l,k = c∗k,l,i,j = c∗l,k,j,i .

[A1.44]

A1.2.3. Gaussian distribution In SP, and more generally in statistics, the Gaussian distribution, also called the normal distribution, is very widely used to model various physical phenomena and certain signals, such as measurement noise that results from the addition of a large number of random perturbations. The importance of the Gaussian distribution is mainly due to the central limit theorem, which states that a sum of N independent identically distributed (i.i.d.) r.v.s tends toward a Gaussian distribution as N tends to infinity, even if the added r.v.s are not themselves Gaussian. A1.2.3.1. Case of a scalar Gaussian variable D EFINITION.– The pdf of a real scalar Gaussian r.v. x is given by: p(u) =

 (u − μ)2  1 √ exp − , 2σ 2 σ 2π

[A1.45]

Appendix

301

where μ is a real number and σ is a positive real number representing the mean and the standard deviation, respectively, i.e.: = +∞ = +∞ E[x] = u p(u)du = μ , E[(x − μ)2 ] = (u − μ)2 p(u)du = σ 2 . −∞

−∞

The Gaussian pdf is completely determined by the two parameters μ and σ, and it is often written as x ∼ N (μ, σ 2 ). A1.2.3.2. Characteristic functions and HOS The first and second characteristic functions are given by: Ξx (u) = E[ejux ] = ejμu e−(1/2)σ

2

u2

[A1.46]

Ψx (u) = Log[Ξx (u)] = jμu − (1/2)σ 2 u2 .

[A1.47]

By [A1.8] and [A1.23], the P th-order moments and cumulants are given by: mx,P = E[xP ] =

1 (P ) Ξ (0) jP x

[A1.48]

1 cx,P = cum(x, · · · , x) = P Ψx(P ) (0), ( )* + j

[A1.49]

P terms (P )

(P )

where Ξx (0) and Ψx (0) are the P -th derivatives of Ξx (u) and Ψx (u) at u = 0. In the case of a zero-mean Gaussian r.v. x ∼ N (0, σ 2 ), the series expansion of [A1.46] with μ = 0 can be written as: ∞  2 2 (−1)k σ 2k 2k Ξx (u) = e−(1/2)σ u = u . 2k k! k=0

Since Ξx (u) only depends on the even powers of u, using the formula [A1.48] for P = 2k + 1 and P = 2k allows us to deduce the following expressions for the oddand even-order moments: mx,2k+1 = E(x2k+1 ) = 0 , mx,2k = E(x2k ) =

(2k)! σ 2k . 2k k!

[A1.50]

It is worth highlighting that all odd-order moments of a centered Gaussian variable are zero. Furthermore, we have: mx,2 = σ 2 ; mx,4 = 3σ 4 ,

[A1.51]

and consequently the kurtosis is zero. In the case of a non-zero-mean Gaussian r.v., we have: mx,1 = μ ; mx,2 = σ 2 + μ2 ; mx,3 = 3σ 2 μ + μ3 ; mx,4 = 3σ 4 + 6μ2 σ 2 + μ4 .

302

Matrix and Tensor Decompositions in Signal Processing

These expressions can be proven by computing the kth-order central moments of x. For example: E[(x − μ)3 ] = E[x3 ] − 3μE[x2 ] + 3μ2 E[x] − μ3 = mx,3 − 3μ(σ 2 + μ2 ) + 2μ3 = 0, from which we deduce the expression of mx,3 = 3σ 2 μ + μ3 . From the formulae [A1.47] and [A1.49], we can conclude that all the cumulants cx,P of order greater than two of a Gaussian r.v. are zero, i.e. cx,P = 0 , ∀P > 2. This proves the property P1 stated in section A1.2.2.4. Furthermore, we have cx,1 = μ, and cx,2 = σ 2 . This nullity property of the cumulants of order greater than two of a Gaussian r.v. provides the basis for developing blind identification methods for systems excited by a non-Gaussian input and corrupted by additive Gaussian noise, as will be illustrated in section A1.4.1. These identification methods based on the use of cumulants of order greater than two of the output signal are said to be robust with respect to additive Gaussian noise, as the output cumulants do not depend on the cumulants of the Gaussian noise, which are zero for orders greater than two. A1.2.3.3. Case of a Gaussian random vector D EFINITION.– A real random vector x ∈ RN is Gaussian if every linear combination aT x of its components follows a one-dimensional Gaussian distribution. We also say that the components xn of x are jointly Gaussian. The pdf of a real Gaussian random vector of size N , mean μ, and covariance matrix Σ, is given by: p(u) =

1

1

(2π)N/2 [det(Σ)]1/2

T

e− 2 (u−μ)

Σ−1 (u−μ)

.

[A1.52]

Like in the scalar case, the pdf p(u) is fully defined by the first- and second-order statistics, which explains the notation N (μ, Σ). The first and second characteristic functions [A1.46] and [A1.47] become: Ξx (u) = E[eju

T

x

] = eju

T

μ −(1/2)uT Σu

e

Ψx (u) = Log[Ξx (u)] = juT μ − (1/2)uT Σu.

[A1.53] [A1.54]

Appendix

303

E XAMPLE A1.5.– Case of a real two-dimensional Gaussian vector: The components x1 and x2 of x are jointly Gaussian if their joint pdf is of the form:   u −μ 1 1 1 1 2  p(u) = exp − ) ( 2 2(1 − ρ ) σ1 2πσ1 σ2 1 − ρ2 (u1 − μ1 )(u2 − μ2 ) u2 − μ2 2  , [A1.55] − 2ρ +( ) σ1 σ 2 σ2 where



μ1 ρσ1 σ2 σ12 μ= , Σ= , [A1.56] μ2 ρσ1 σ2 σ22 and ρ is the correlation coefficient between x1 and x2 , with det(Σ) = σ12 σ22 (1 − ρ2 ). P ROPERTIES.– – If the r.v.s x1 and x2 are jointly Gaussian, then they are marginally Gaussian, with the marginal pdfs: 1 √

pxi (ui ) =

 (ui − μi )2  exp − , i ∈ {1, 2}. 2σi2 2π

σi P ROOF .– The marginal pdf of xi is given by: = ∞ pxi (ui ) = p(u1 , u2 )duj , i, j ∈ {1, 2}, i = j,

[A1.57]

−∞

where p(u1 , u2 ) = p(u) is defined in [A1.55]. For example, if we choose j = 1, i = 2, and we rewrite the bracket of the exponential in [A1.55] in the following form: u − μ u2 − μ 2  2 (u2 − μ2 )2 1 1 −ρ + (1 − ρ2 ) , [A1.58] σ1 σ2 σ22 we obtain:

=

px2 (u2 ) =

∞ −∞

p(u1 , u2 )du1

 (u2 − μ2 )2  = Aexp − 2σ22

=

∞ −∞

 exp −

u1 − μ1 1 u 2 − μ 2 2  −ρ du1 2(1 − ρ2 ) σ1 σ2

1 . By performing the change of variables: u = u1σ−μ − 1   2 ∞ π −au 2 1 , which gives du = du du = ρ u2σ−μ σ1 , and using the result −∞ e a , with 2 1 a = 2(1−ρ2 ) , the integral term in px2 (u2 ) simplifies as follows: = ∞ u1 − μ 1  1 u2 − μ 2 2  du1 exp − −ρ 2) 2(1 − ρ σ σ2 1 −∞ = ∞    1 = σ1 exp − u2 du = σ1 2π(1 − ρ2 ). 2 2(1 − ρ ) −∞

with A =

1√

2πσ1 σ2

1−ρ2

304

Matrix and Tensor Decompositions in Signal Processing

Hence, the marginal pdf px2 (u2 ) can be written as:  (u2 − μ2 )2  1 px2 (u2 ) = √ exp − , 2σ22 σ2 2π which proves that x2 is marginally Gaussian N (μ2 , σ22 ). We can prove that x1 is marginally Gaussian N (μ1 , σ12 ) in the same way.  – More generally, the components of a Gaussian vector of size N ≥ 2 are marginally Gaussian. – The converse of the above property is not true. Gaussianity of each component xi of x is not sufficient to guarantee that x is a Gaussian vector. – If jointly Gaussian r.v.s are uncorrelated, then they are independent. P ROOF .– Consider a Gaussian vector x ∼ N (μ, Σ) of size N whose components xn ∼ N (μn , σn2 ) are assumed to be uncorrelated. The covariance matrix is then diagonal Σ = diag(σn2 ), so Σ−1 = diag(σn−2 ). Hence, the pdf of x can be written as: p(u) = =

(2π) N 

N   (un − μn )2  exp − 2σn2 n=1 n=1 σn

1 N N/2

p(un ) with p(un ) = √

n=1

 (un − μn )2  1 exp − . 2σn2 2πσn

We can therefore conclude that non-correlation of the Gaussian r.v.s xn implies their independence. This proves the property stated in Table A1.3.  – Let x ∼ N (0, Σx ) be a centered Gaussian vector. The cross-moments N N pn E of odd order P = n=1 xin n=1 pn = 2k +1 of its components are  zero. The cross-cumulants x1 , · · · , x1 , · · · , xN , · · · , xN of order greater than two ( )* + ( )* + p1 terms pN terms N (P = n=1 pn > 2) are zero. – Given a random vector y of size M obtained by applying an affine transformation to a Gaussian random vector x ∼ N (μx , Σx ) of size N , i.e. y = Ax + b, with A ∈ RM ×N , the vector y is itself Gaussian y ∼ N (μy , Σy ) with: μy = Aμx + b , Σy = AΣx AT . This Gaussianity preservation property under any linear transformation of a Gaussian random vector plays a fundamental role in SP applications. In particular, it can be exploited to test the nonlinearity of a system. Indeed, it is sufficient for any cumulant of order greater than two of the output of a system excited by a Gaussian input to be non-zero in order to conclude that the system is nonlinear.

Appendix

305

A1.3. Discrete-time random signals A discrete-time random signal (also called a stochastic process) can be viewed as a sequence of random variables indexed by time, denoted x(k), with k ∈ N if the signal is assumed to be causal. Therefore, like for random variables, we can define the moments and cumulants of a random signal. Below, we first recall the secondorder statistics, then the notion of stationary and ergodic signals, before introducing higher-order statistics, consisting of cumulants and polyspectra. A1.3.1. Second-order statistics Let x(k) and y(k) be two discrete-time, real, stationary scalar random signals. Table A1.5 gives the definitions of the mean, variance, autocorrelation and covariance of the signal x(k), as well as the cross-correlation and cross-covariance of the signals x(k) and y(k). Quantities

Definitions

Mean

μx (k) = E[x(k)]

Variance

2 σx (k) = E[(x(k) − μx (k))2 ] = E[x2 (k)] − μ2x (k)

Autocorrelation σx (k, t) = E

Covariance



Cross-correlation σx,y (k, t) = E

Cross-covariance

ϕx (k, t) = E[x(k)x(t)]   x(t) − μx (t) = ϕx (k, t) − μx (k)μx (t)

x(k) − μx (k)



ϕx,y (k, t) = E[x(k)y(t)]   y(t) − μy (t) = ϕx,y (k, t) − μx (k)μy (t)

x(k) − μx (k)

Properties x(k) and y(k) uncorrelated

σx,y (k, t) = 0 , ∀k, t ⇔ ϕx,y (k, t) = μx (k)μy (t)

x(k) and y(k) orthogonal

ϕx,y (k, t) = 0 , ∀k, t

Table A1.5. Definitions and properties for real random signals x(k) and y(k)

R EMARK A1.6.– Note that: – if the processes are uncorrelated and at least one of them is zero-mean, then they are orthogonal; – the covariance gives information about the fluctuations of the signal around its mean.

306

Matrix and Tensor Decompositions in Signal Processing

A1.3.2. Stationary and ergodic random signals We say that a random signal is strictly stationary (or stationary in the strict sense) if its statistics are independent of the time origin, or equivalently if they are invariant under any translation in time. This means, for example, that the autocorrelation function ϕx (k, t) = E[x(k)x(t)] only depends on the time interval τ = k − t, in which case it is defined as ϕx (τ ) = E[x(k)x(k − τ )]. The hypothesis of stationarity is often used in SP, together with the hypothesis of ergodicity, since this allows us to substitute ensemble averages with time averages. This amounts to replacing averages calculated using an ensemble of realizations of a random signal with averages computed using samples measured over a time window of finite duration3, for only a single realization of the signal. In practice, the hypothesis of stationarity is impossible to verify, since the measurements are performed over a finite period of time. Nevertheless, it is very often assumed in order to allow us to estimate the statistics used in processing algorithms. Thus, for second-order methods, we assume the hypothesis of stationarity to second order (also called weak stationarity or stationarity in the wide sense), which means that the signals are assumed to have constant mean, and the autocorrelation function ϕx (k, t) only depends on the time interval τ = k − t. Similarly, for methods based on fourth-order cumulants, we would assume the stationarity to fourth order. Under the hypotheses of causality (x(k) = 0 , ∀k < 0), stationarity to second order, and ergodicity, we can estimate the mean and the autocorrelation function as follows: μ x =

T T +τ 1  1  x(k) ; ϕ x (τ ) = x(k)x(k − τ ). T +1 T +1 k=0

k=τ

These estimators tend asymptotically to μx and ϕx (τ ) as T tends to infinity. In the case of stationarity to order P > 2, we can estimate the moments mx,p = E x(k)x(k − τ1 ) · · · x(k − τp−1 ) of order p ≤ P using time averages of products of p shifted signals: m  x,p (τ1 , · · · , τp−1 ) =

T +τ 1  x(k)x(k − τ1 ) · · · x(k − τp−1 ), T +1

[A1.59]

k=τ

where τ = max(τq ), with q ∈ p − 1. 3 The time window considered to estimate the statistics of a signal needs to be sufficiently large to guarantee high-quality estimates. The higher the order of the statistics being estimated, the longer the window needs to be. For some signals, like periodic signals, the stationarity hypothesis is replaced by the cyclo-stationarity hypothesis, which states that the statistics of these signals vary periodically (Gardner 1991).

Appendix

307

Two processes x(k) and y(k) are said to be jointly wide-sense stationary if each of them is wide-sense stationary, and their cross-correlation function ϕx,y (k, t) only depends on the interval τ = k − t. If so, we have: ϕx,y (k − t) = E[x(k)y(t)] ⇔ ϕx,y (τ ) = E[x(k)y(k − τ )]. Similarly, their cross-covariance is given by:    σx,y (τ ) = E x(k) − mx y(k − τ ) − my = ϕx,y (τ ) − mx my . The random signals x(k) and y(k) are said to be orthogonal (respectively, uncorrelated) if their cross-correlation (respectively, cross-covariance) function is zero. P ROPERTY.– Given the real-valued random signals x(k) and y(k), assumed to be wide-sense stationary, their autocorrelation and cross-correlation functions have the following symmetry properties: ϕx (τ ) = ϕx (−τ ) , ϕyx (τ ) = ϕxy (−τ ).

[A1.60]

For a stationary discrete-time random signal, we define the power spectrum, also called the power spectral density (PSD) or simply the spectrum, as the (one-dimensional) discrete-time Fourier transform (DTFT) of the autocorrelation function4: Σx (ω) =

∞ 

ϕx (τ ) e−jωτ , |ω| ≤ π.

[A1.61]

τ =−∞

The autocorrelation function is given by the inverse Fourier transform of the spectrum: = π 1 ϕx (τ ) = Σx (ω)ejωτ dω. [A1.62] 2π −π In particular, the average power is given by: = π 1 2 E[|x| ] = ϕx (0) = Σx (ω)dω. 2π −π

[A1.63]

The power spectrum is an even function of ω, taking real non-negative values: Σx (−ω) = Σx (ω) ∈ R+ .

[A1.64]

4 This result, known as the Wiener–Khintchine identity or theorem, was published by Norber Wiener in 1930 and independently by Aleksandr Khintchine in 1934. It states that a stationary ergodic random signal admits a spectral decomposition given by the DTFT of its autocorrelation function.

308

Matrix and Tensor Decompositions in Signal Processing

Note that Σx (ω) is periodic5, with period 2π. With normal frequencies (f = ω/2π), the spectrum is given by: ∞ 

Sx (f ) =

ϕx (τ ) e−j2πf τ .

[A1.65]

τ =−∞

For two jointly stationary signals x(k) and y(k), we define the cross-spectrum, also called the cross PSD, as the Fourier transform of the cross-correlation function: Σxy (ω) =

∞ 

ϕxy (τ )e−jωτ .

[A1.66]

τ =−∞

For real signals, the cross-correlation ϕxy (τ ) is real, and the cross-spectrum is such that: Σxy (ω) = Σ∗xy (−ω),

[A1.67]

which implies that its magnitude is even, whereas its phase is odd. The cross-correlation can be obtained from the inverse Fourier transform of Σxy (ω) as: = π 1 Σxy (ω)ejωτ dω. [A1.68] ϕxy (τ ) = 2π −π In Table A1.6, we summarize the definitions and the properties of the second-order statistics for two stationary complex random signals x(k) and y(k). R EMARK A1.7.– We can make the following remarks: – For a stationary complex-valued random signal, we have: ϕx (τ ) = E[x(k)x∗ (k − τ )] = E[x(k + τ )x∗ (k)] = ϕ∗x (−τ ) , −∞ < τ < ∞. This relation corresponds to the Hermitian symmetry property of the autocorrelation function of a complex random signal that is stationary to second order. – The cross-spectrum Σx,y (ω) of two stationary complex random signals is complex-valued, satisfies Σxy (ω) = Σ∗yx (ω), and is periodic, with period 2π. 5 Some authors use the notation Σx (ejω ) instead of Σx (ω) to better highlight the periodicity with period 2π of the spectrum. Indeed, we have: Σx (ej(ω+2kπ) ) = Σx (ejω ) , ∀k ∈ Z.

Appendix

Correlations

Definitions / Properties

Autocorrelation

ϕx (τ ) = E[x(k)x∗ (k − τ )] = ϕ∗x (−τ )  ∗  σx (τ ) = E x(k) − μx x(k − τ ) − μx = ϕx (τ ) − |μx |2 = σx∗ (−τ )

Covariance



Cross-covariance

ϕx,y (τ ) = E[x(k)y ∗ (k − τ )] = ϕ∗y,x (−τ )  ∗   ∗ (−τ ) = σy,x σx,y (τ ) = E x(k) − μx y(k − τ ) − μy

Spectra

Properties

Spectrum

Σx (ω) = Σ∗x (ω)

Cross-spectrum

Σx,y (ω) = Σ∗y,x (ω)

Cross-correlation

309

Table A1.6. Definitions and properties of second-order statistics for stationary complex random signals x(k) and y(k)

We have the following inequality: |Σxy (ω)|2 ≤ Σx (ω)Σy (ω), from which we define the coherence function: Cxy (ω) =

|Σxy (ω)|2 such that 0 ≤ Cxy (ω) ≤ 1. Σx (ω)Σy (ω)

[A1.69]

Although the correlation functions give information about the correlation, and hence the similarity, of two samples of the same signal (autocorrelation) or of two different signals (cross-correlation) separated by a time interval τ , the power spectrum gives information about the frequency content of a signal (in terms of the frequency f or the angular frequency ω = 2πf ). This leads to spectral analysis or frequency analysis of signals and is generally performed using a fast Fourier transform (FFT) algorithm. A1.3.3. Higher order statistics of random signals A1.3.3.1. Cumulants of real random signals Let x(k) be a real random signal. The P th-order cumulant, denoted cx,P (k1 , · · · , kP ), is the joint P th-order cumulant of the r.v.s x(kp ), for p ∈ P :   [A1.70] cx,P (k1 , · · · , kP )  cum x(k1 ), x(k2 ), · · · , x(kP ) . The stationarity hypothesis in the strict sense for the signal x(k) implies that the P th-order cumulant only depends on P − 1 time lags τp ∈ Z for p ∈ P − 1. If we select the time instant k1 = k as reference time and define τp = k − kp+1 , the cumulant defined in [A1.70] can be written as:   cx,P (τ1 , · · · , τP −1 ) = cum x(k), x(k − τ1 ), · · · , x(k − τP −1 . [A1.71]

310

Matrix and Tensor Decompositions in Signal Processing

The P th-order cumulants of a stationary random signal therefore form a tensor of order P − 1. The cumulants are also called multicorrelations. In particular, the third- and fourth-order cumulants are called the bicorrelation and the tricorrelation, respectively (Lacoume et al. 1997). Recall that, for a stationary random signal x(k), the P th-order moment is defined as: mx,P (τ1 , · · · , τP −1 )  E x(k)x(k − τ1 ) · · · x(k − τP −1 .

[A1.72]

In the case of a non-zero-mean stationary signal x(k), using the relations [A1.25], [A1.29] and [A1.30] with xi1 = x(k), xi2 = x(k − τ1 ), xi3 = x(k − τ2 ) gives us the following expressions for the first-, second- and third-order cumulants: cx,1 = mx,1 ; cx,2 (τ ) = mx,2 (τ ) − [mx,1 ]2 [A1.73] cx,3 (τ1 , τ2 ) = mx,3 (τ1 , τ2 ) − mx,1 mx,2 (τ2 − τ1 ) + mx,2 (τ1 ) + mx,2 (τ2 ) 3 + 2 mx,1 . [A1.74] Analogously to the formulae [A1.26]–[A1.28], the second-, third- and fourth-order cumulants of the random signal x(k), assumed to be stationary and centered, are given by: cx,2 (τ ) = E[x(k)x(k − τ )] = mx,2 (τ )   = ϕx (τ ) = ϕx (−τ ) = cum x(k), x(k − τ ) cx,3 (τ1 , τ2 ) = E[x(k)x(k − τ1 )x(k − τ2 )] = mx,3 (τ1 , τ2 )

[A1.75] [A1.76] [A1.77]

cx,4 (τ1 , τ2 , τ3 ) = E[x(k)x(k − τ1 )x(k − τ2 )x(k − τ3 )] − cx,2 (τ1 )cx,2 (τ2 − τ3 ) − cx,2 (τ2 )cx,2 (τ3 − τ1 ) − cx,2 (τ3 )cx,2 (τ1 − τ2 ).

[A1.78]

Note that, for a zero-mean random signal, the second- and third-order cumulants are identical to the second- and third-order moments, respectively, whereas the fourthorder cumulant depends on both the second- and fourth-order moments. R EMARK A1.8.– The relations [A1.77] and [A1.78] can be used to estimate the third- and fourth-order cumulants of a stationary, centered random signal x(k), after replacing the moments with values estimated using time averages, such as: T 1  cˆx,3 (τ1 , τ2 ) = x(k)x(k − τ1 )x(k − τ2 ). T +1 k=0

By fixing τ = τ1 = τ2 = τ3 = 0 in the relations [A1.76]–[A1.78], we obtain the second-, third- and fourth-order cumulants with zero time lags, which correspond to

Appendix

311

the variance, the non-normalized forms of skewness (γx,3 ) and kurtosis (γx,4 ) of the signal x(k), expressed in terms of the cumulants: cx,2 (0) = E[x2 (k)] = mx,2 (0) , cx,3 (0, 0) = E[x3 (k)] = γx,3 cx,4 (0, 0, 0) = E[x4 (k)] − 3c2x,2 (0) = γx,4 . The symmetry property [A1.31] of the cumulants implies that, for the P th-order cumulant defined in [A1.70], there are P ! possible ways to choose the order of the time instants (k1 , · · · , kP ) without changing the cumulant. In the case of a stationary signal, there are (P − 1)! ways to choose the order of the time lags τp , p ∈ P − 1, for each choice of reference time. For example, for the bicorrelation, choosing k1 = k as the reference time leads to the following symmetry relation: cx,3 (τ1 , τ2 ) = cx,3 (τ2 , τ1 ).

[A1.79]



Similarly, by choosing k = k − τ1 as the reference time, we deduce the following symmetry relations:     cum x(k), x(k − τ1 ), x(k − τ2 ) = cum x(k  ), x(k  + τ1 ), x(k  + τ1 − τ2 ) ⇓ cx,3 (τ1 , τ2 ) = cx,3 (−τ1 , τ2 − τ1 ) = cx,3 (τ2 − τ1 , −τ1 ). [A1.80] Likewise, if we choose k  = k − τ2 as the reference time, permuting the time lags τ1 and τ2 in the above relations gives: cx,3 (τ1 , τ2 ) = cx,3 (−τ2 , τ1 − τ2 ) = cx,3 (τ1 − τ2 , −τ2 ).

[A1.81]

These symmetry relations define six regions in the (τ1 , τ2 )-plane, as illustrated in Figure A1.1. Knowing the third-order cumulants in one of these six regions is sufficient to deduce the values of cx,3 (τ1 , τ2 ) in the other five regions. Thus, we can restrict the third-order cumulant estimation to the region defined by: 0 < τ2 ≤ τ1 , with τ1 ≥ 0 and τ2 ≥ 0 (region I in Figure A.1). A1.3.3.2. Polyspectra The multidimensional DTFT6 of the multicorrelations of a stationary discrete-time random signal gives us the polyspectra of the signal, also called cumulant spectra, a notion that was introduced by Brillinger (1965). 6 This is a transformation in the reduced frequency f associated with the frequency F of the continuous-time Fourier transform via the relation f = F/T , where T is the sampling period. Recall that the discrete-time Fourier transform (DTFT) of a sampled assumed ∞ signal x(k), −j2πf k , with to be absolutely summable, is defined as the function G(f ) = k=−∞ x(k)e f ∈ [−1/2, 1/2], since G is a periodic function with period 1, whereas the continuous Fourier ∞ transform (CFT) of a continuous-time signal x(t) is defined as G(F ) = −∞ x(t)e−j2πF t dt, with F ∈ (−∞, ∞), since the continuous time implies that G is not periodic.

312

Matrix and Tensor Decompositions in Signal Processing

Figure A1.1. Symmetry regions of the third-order cumulant

If we assume that cx,P (τ1 , · · · , τP −1 ) is absolutely summable, i.e.: +∞ +∞   > > >cx,P (τ1 , · · · , τP −1 )> < ∞, ··· τ1 =−∞

τP −1 =−∞

the P th-order cumulant spectrum of the signal x(k) is defined as the (P − 1)-dimensional Fourier transform of cx,P (τ1 , · · · , τP −1 ): [A1.82] Sx,P (f1 , · · · , fP −1 )  DTFT cx,P (τ1 , · · · , τP −1 ) =

+∞  τ1 =−∞

···

+∞ 

cx,P (τ1 , · · · , τP −1 ) e−j2π

P −1 p=1

f p τp

.

[A1.83]

τP −1 =−∞

The cumulant spectrum is periodic with period 1, i.e. Sx,P (f1 + 1, · · · , fP −1 + 1) = Sx,P (f1 , · · · , fP −1 ), which enables us to restrict consideration of the cumulant > P −1 > spectrum to only one period, i.e., |fp | ≤ 21 , p ∈ P − 1 and > p=1 fp > ≤ 12 . The cumulant spectrum can also be defined in terms of the angular frequency ω = 2πf as: Σx,P (ω1 , · · · , ωP −1 ) =

+∞  τ1 =−∞

···

+∞ 

cx,P (τ1 , · · · , τP −1 ) e−j

P −1 p=1

ωp τ p

.

τP −1 =−∞

[A1.84] The cumulant spectrum is then periodic with period 2π, i.e. Σx,P (ω1 , · · · , ωP −1 ) = Σx,P (ω1 + 2π, · · · , ωP −1 + 2π), with |ωp | ≤ π, > P −1 > p ∈ P − 1, and > p=1 ωp > ≤ π.

Appendix

313

We also define the P th-order z-cumulant spectrum, denoted Cx,P (z1 , · · · , zP −1 ), as the bilateral multidimensional z-transform of the P th-order cumulant: +∞ 

Cx,P (z1 , · · · , zP −1 ) =

+∞ 

···

τ1 =−∞

τP −1 =−∞

−τ

P −1 cx,P (τ1 , · · · , τP −1 ) z1−τ1 · · · zP −1 .

[A1.85] The expressions [A1.83] and [A1.84] can be deduced from [A1.85] using the transformations: zp = ej2πfp = ejωp , p ∈ P − 1,

[A1.86]

which give the relations: Sx,P (f1 , · · · , fP −1 ) = Cx,P (ej2πf1 , · · · , ej2πfP −1 )

[A1.87]

Σx,P (ω1 , · · · , ωP −1 ) = Cx,P (ejω1 , · · · , ejωP −1 ).

[A1.88]

Polyspectra allow us to study random signals in the frequency domain to identify relationships between frequencies, whereas cumulants are used in the time domain to obtain information relating to the temporal (multi)correlation of signals. Taking the inverse Fourier transform of [A1.83] and [A1.84], we have: = cx,P (τ1 , · · · , τP −1 ) =

+ 21 − 12

= ···

+ 12 − 12

Sx,P (f1 , · · · , fP −1 ) ej2π

P −1 p=1

fp τ p

df1 · · · dfP −1 = +π = +π P −1 1 = · · · Σx,P (ω1 , · · · , ωP −1 ) ej p=1 ωp τp P −1 (2π) −π −π dω1 · · · dωP −1 . +∞ −j2πf τ It is a well-known fact that the PSD Sx,2 (f ) = τ =−∞ cx,2 (τ ) e corresponds to the DTFT of the autocorrelation function ϕx (τ ) = cx,2 (τ ) = cumx,2 (τ ). In the cases of bicorrelation cx,3 (τ1 , τ2 ) and tricorrelation cx,4 (τ1 , τ2 , τ3 ), the two- and three-dimensional DTFTs give the bispectrum Sx,3 (f1 , f2 ) and the trispectrum Sx,4 (f1 , f2 , f3 ), respectively. Like cumulants, polyspectra are characterized by various symmetries. This makes it easier to estimate them using the DTFT in a reduced frequency domain. Thus, the spectrum satisfies Sx,2 (f ) = Sx,2 (−f ). As a result of this parity property and the periodicity of period 1, we can restrict consideration to the frequencies 0 ≤ f ≤ 12 when estimating the spectrum.

314

Matrix and Tensor Decompositions in Signal Processing

In the case of third-order statistics, the bispectrum (corresponding to P = 3) is defined as the two-dimensional DTFT of the bicorrelation: Σx,3 (ω1 , ω2 ) =

+∞ 

+∞ 

cx,3 (τ1 , τ2 ) e−j(ω1 τ1 +ω2 τ2 ) .

[A1.89]

τ1 =−∞ τ2 =−∞

Taking into account the symmetries [A1.79]–[A1.81] of the bicorrelation, we can deduce the following symmetries for the bispectrum (Therrien 1992; Nikias and Petropulu 1993): Sx,3 (f1 , f2 ) = Sx,3 (f2 , f1 ) = Sx,3 (f1 , −f1 − f2 ) = Sx,3 (f2 , −f1 − f2 ) = Sx,3 (−f1 − f2 , f1 ) = Sx,3 (−f1 − f2 , f2 ).

[A1.90]

As a result of these symmetries, we can restrict the bispectrum estimation to the domain 0 ≤ f2 ≤ f1 , with f1 + f2 ≤ 21 . Similar symmetry properties exist for the trispectrum. E XAMPLE A1.9.– Case of a stationary zero-mean non-Gaussian signal, white to order P 7: This type of signal is often considered in SP to model source signals. The P thorder cumulant and cumulant spectrum of such a signal e(k) are given by: ce,P (τ1 , · · · , τP −1 ) = γe,P δ(0, τ1 , · · · , τP −1 )   Se,P f1 , · · · , fP −1 = γe,P

[A1.91] [A1.92]

where γe,P = cum(e(k), · · · , e(k)) is a constant and δ(0, τ1 , · · · , τP −1 ) is the generalized Kronecker delta, which is equal to 0 except if τp = 0 for p ∈ P − 1. We can therefore conclude that the cumulant spectrum is constant for every frequency, and the cumulant of order P is non-zero only if all the time lags τp are zero. We say that e(k) is a sequence of white noise to order P . 7 When we are only interested in second-order statistics, the notion of white noise corresponds to a sequence e(k) of uncorrelated r.v.s (to second order) with autocorrelation function ϕe (τ ) = E[e(k)e(k − τ )] = σ 2 δ(τ ) and hence a constant power spectrum. Using HOS, the notion of white noise was extended to orders greater than two by (Bondon and Picinbono 1990), defining whiteness to orders P > 2 in terms of the cumulants. We say that a stationary signal e(k) is white to order P if all its cumulants of orders less than or equal to P are zero for non-zero time lags. A white noise is said to be pure if it is a sequence of independent random variables, which implies whiteness to every order P . Furthermore, the stationarity hypothesis means that the r.v.s are identically distributed. The sequence is then described as i.i.d. (independent and identically distributed). Generally, the Gaussian distribution is chosen, as is the case in signal processing with additive white Gaussian noise (AWGN).

Appendix

315

This result should be compared against the case of a sequence of i.i.d. zero-mean Gaussian noise, a hypothesis widely used to model measurement noise. As we saw earlier, all cumulants of order greater than two of a centered Gaussian r.v. are zero, the only non-zero cumulant being the second-order cumulant. Therefore, unlike the non-Gaussian case, the hypothesis of Gaussian white noise implies that every cumulant of order greater than two is zero (γe,P = 0 for P > 2 in [A1.91]). In particular, the fourth-order cumulants are zero, a property that is exploited by the applications considered in Chapter 5. Moreover, the spectrum is equal to ∞ Se,2 (f ) = τ =−∞ ϕe (τ )e−j2πf τ = σe2 , with ϕe (τ ) = σe2 δτ 0 . A1.3.3.3. Cumulants of complex random signals The case of stationary complex-valued random signals leads to different definitions of the cumulants involving the conjugation of certain terms. Thus, we define the cumulant of order p + q as follows: cx,p,q (τ1 , · · · , τp+q−1 ) = cum x(k), x(k − τ1 ), · · · , x(k − τp−1 ), x∗ (k − τp ), · · · , x∗ (k − τp+q−1 ) , with p non-conjugated terms and q conjugated terms. The stationarity hypothesis implies a dependency on p + q − 1 time lags τi with i ∈ p + q − 1. The cumulants of order p + q therefore form a tensor of order p + q − 1. In general, there are 2p+q different definitions of a complex cumulant of order p + q. For example, for the third-order cumulants, there are eight possible definitions, including the following three: cx,3,0 (τ1 , τ2 ) = cum x(k), x(k − τ1 ), x(k − τ2 ) cx,2,1 (τ1 , τ2 ) = cum x(k), x∗ (k − τ1 ), x(k − τ2 ) cx,1,2 (τ1 , τ2 ) = cum x(k), x∗ (k − τ1 ), x∗ (k − τ2 ) . A1.3.3.4. Case of complex circular random signals Complex random signals are characterized by an important property: circularity. Analogously to circular complex r.v.s (see [A1.43]), we say that a complex random signal is circular to order n if and only if its statistics of order less than or equal to n involving p non-conjugated terms and q conjugated terms are zero for p = q, and, for all (p, q) such that p + q ≤ n: mx,p,q = E x(k)x(k − τ1 ) · · · x(k − τp−1 )x∗ (k − τp ) · · · x∗ (k − τp+q−1 ) = 0 cx,p,q = cum[x(k), · · · , x(k − τp−1 ), x∗ (k − τp ), · · · , x∗ (k − τp+q−1 )] = 0. Thus, for a complex signal circular to fourth order, we have E[x2 (k)] = E[x2 (k)x∗ (k)] = E[x3 (k)x∗ (k)] = 0. Exactly one type of fourth-order

316

Matrix and Tensor Decompositions in Signal Processing

cumulant is non-zero, namely cx,2,2 = cum x(k), x(k − τ1 ), x∗ (k − τ2 ), x∗ (k − τ3 ) , obtained by considering two conjugated terms and two non-conjugated terms. These cumulants, denoted ci,j,m  cum x(k), x(k − τi ), x∗ (k − τj ), x∗ (k − τm ) , form a third-order tensor, where the index i is associated with the non-conjugated term x(k − τi ), and the indices (j, m) are associated with the conjugated terms x∗ (k − τj ) and x∗ (k − τm ). This tensor satisfies the following partial symmetry: ci,j,m = ci,m,j .

[A1.93]

A1.4. Application to system identification The goal of this section is to present methods for identifying linear systems and homogeneous quadratic systems using the spectra and cumulant spectra of input and output signals. If the input is not measurable, only the output statistics can be used for identification. This is called blind or unsupervised identification. In the case of a linear system represented by means of its transfer function, and hence its frequency response, we will show that the second-order statistics of the output do not contain any information about the phase and hence do not allow non-minimum phase systems to be identified. For this type of system, which is very widespread in practice, we need to use HOS like the bispectrum or trispectrum to estimate the gain and the phase of the system (Alshebeili and Cetin 1990; Nikias and Petropulu 1993; Li and Ding 1994). R EMARK A1.10.– In the case of autoregressive (AR), moving average (MA) and ARMA systems, the non-measurable input is assumed to be a non-Gaussian i.i.d. sequence. For these systems, very often considered in SP applications, various methods of blind identification based on HOS of the output have been proposed in the literature (Giannakis 1987; Nikias 1988; Giannakis and Mendel 1989; Tugnait 1990; Swami and Mendel 1990; Alshebeili et al. 1993; Favier et al. 1994; Na et al. 1995; Nandi 1999; Abderrahim et al. 2001; Favier 2004). A1.4.1. Case of linear systems Consider a stable, discrete-time linear system represented using the following input–output equation: y(k) =

∞ 

h(i)u(k − i) =

i=−∞

∞ 

h(k − i)u(i) = h(k) ∗ u(k)

i=−∞

s(k) = y(k) + e(k), where the symbol ∗ denotes time convolution, and h(k) is the impulse response (i.r.) of the system, whose bilateral z-transform corresponds to the discrete transfer function

Appendix

317

∞ −k of the system: H(z) = . For a causal system, we have h(k) = k=−∞ h(k)z 0, ∀k < 0. As the system is assumed to be stable, its i.r. is absolutely summable: ∞ |h(k)| < ∞. k=−∞ The complex-valued signals u(k), y(k) and s(k) denote the input, assumed to be non-Gaussian, centered and stationary, the noiseless output and the noisy measured output of the system. The sequence of additive white noise e(k) is assumed to be Gaussian, centered, with variance σ 2 , and independent of the input. Since the system is assumed to be stable and the input is assumed to be centered and stationary, the output is also centered and stationary. Furthermore, since the input u(k) is assumed to be independent of the additive noise, the noiseless output y(k) is also independent of the noise. This allows us to exploit the additivity property [A1.36] of the cumulants for the noisy output s(k) to separate the contributions of the noise and the noiseless output signal in the statistics of the noisy measured output. Thus, the second- and third-order statistics of the output s(k) are given by: cs,2 (τ ) = cy,2 (τ ) + ce,2 (τ ) = cy,2 (τ ) + σ 2 δ(τ ) ∗

Φs (z) = Φy (z) + Φe (z) = H(z)H (z

−∗

)Φu (z) + σ

[A1.94] 2

[A1.95]

cs,3,0 (τ1 , τ2 ) = cy,3,0 (τ1 , τ2 )

[A1.96]

Σs,3 (ω1 , ω2 ) = Σy,3 (ω1 , ω2 ) = H(ω1 )H(ω2 )H(−ω1 − ω2 )Σu,3 (ω1 , ω2 ). [A1.97] As we mentioned earlier, using the bispectrum [A1.97] enables us to make the statistics of the noisy output independent of the additive Gaussian noise, which is not the case with the spectrum [A1.95]. Table A1.7 summarizes the expressions of the second-order statistics of the noiseless output y(k). Below, we will prove some of the formulae stated in this table. The others are easy to prove using the same reasoning. P ROOF .– By the definition of the autocorrelation function of the output, we have: cs,2 (τ )  ϕs (τ ) = E[s(k)s∗ (k − τ )]    = E[ y(k) + e(k) y ∗ (k − τ ) + e∗ (k − τ ) ]. Taking into account the hypotheses on the additive noise, we obtain: ϕs (τ )  ϕy (τ ) + σ 2 δ(τ ) with ϕy (τ )  E[y(k)y ∗ (k − τ )] = E =

 i

[A1.98] 

h(i)u(k − i)y ∗ (k − τ )



i

h(i)ϕuy (τ − i) = h(τ ) ∗ ϕuy (τ ).

[A1.99]

318

Matrix and Tensor Decompositions in Signal Processing

Quantities y(k)

Expressions y(k) =

∞

i=−∞

Noiseless  output h(i)u(k − i) = ∞ i=−∞ h(k − i)u(i) = h(k) ∗ u(k) Cross-correlation

ϕyu (τ )

ϕyu (τ ) 

E[y(k)u∗ (k

− τ ) = h(τ ) ∗ ϕu (τ ) = ϕ∗uy (−τ )

ϕuy (τ ) = h∗ (−τ ) ∗ ϕu (τ )

ϕuy (τ )

Autocorrelation ϕy (τ )  E[y(k)y ∗ (k − τ )] = h(τ ) ∗ ϕuy (τ ) = h(τ ) ∗ h∗ (−τ ) ∗ ϕu (τ ) = ϕ∗y (−τ )

ϕy (τ )

z-transform of the output Y (z) = H(z) U (z)

Φyu (z)

Φyu (z) 

Φuy (z) Σyu (ω)

∞

τ =−∞

Φuy (z)  Σyu (ω) 

Σuy (ω)

Cross power spectrum in z

∞

∞

τ =−∞

ϕuy (τ )z −τ = H ∗ (z −∗ )Φu (z)

Cross power spectrum in ω

τ =−∞

Σuy (ω) 

ϕyu (τ )z −τ = H(z)Φu (z) = Φ∗uy (z −∗ )

∞

ϕyu (τ )e−jωτ = H(ω)Σu (ω) = Σ∗uy (ω)

τ =−∞

ϕuy (τ )e−jωτ = H ∗ (ω)Σu (ω)

Power spectrum of the output  −τ = H(z)Φ (z) Φy (z)  ∞ uy τ =−∞ ϕy (τ )z = H(z)H ∗ (z −∗ )Φu (z) = Φ∗y (z −∗ )  −jωτ = H(ω)Σ (ω) Σy (ω)  ∞ uy τ =−∞ ϕy (τ )e = |H(ω)|2 Σu (ω) = Σ∗y (ω)

Φy (z) Σy (ω)

Table A1.7. Second-order statistics of the output of a linear system

Similarly, for the cross-correlation function, we have:   ϕsu (τ )  E[s(k)u∗ (k − τ )] = E[ y(k) + e(k) u∗ (k − τ )] = ϕyu (τ )



ϕyu (τ )  E y(k)u∗ (k − τ ) =



[A1.100] h(i)ϕu (τ − i) = h(τ ) ∗ ϕu (τ ).

i

[A1.101]

Appendix

319

Taking into account the symmetry property of the autocorrelation function of the input ϕ∗u (−τ ) = ϕu (τ ) and therefore ϕ∗u (−τ − i) = ϕu (τ + i), we also obtain:    ϕus (τ ) = ϕuy (τ )  E u(k)y ∗ (k − τ ) = E u(k) h∗ (i)u∗ (k − τ − i) =



i

h∗ (i)ϕu (τ + i) = ϕ∗yu (−τ ) = h∗ (−τ ) ∗ ϕu (τ ).

i

Hence, replacing ϕuy (τ ) with this expression in [A1.99], we deduce that: ϕy (τ ) = h(τ ) ∗ h∗ (−τ ) ∗ ϕu (τ ) = ϕ∗y (−τ )

[A1.102]

or equivalently: ϕy (τ ) =

∞ 

∞ 

h(i) h∗ (m) ϕu (τ − i + m).

[A1.103]

i=−∞ m=−∞

The double convolution in [A1.102] corresponds to the double sum in [A1.103]. Consider now the spectrum in z of the noiseless output, defined as the bilateral z-transform of the autocorrelation function, expressed in the form [A1.103]: Φy (z) 

∞ 

ϕy (τ )z −τ =

τ =−∞

∞ 

∞ 

∞ 

h(i) h∗ (m) ϕu (τ −i+m)z −τ .

τ =−∞ i=−∞ m=−∞

[A1.104] By decomposing z −τ = z −(τ −i+m) z −i z m and setting t = τ − i + m, the triple sum can be expressed as: Φy (z) =

∞   i=−∞

h(i)z −i

∞   m=−∞

h∗ (m)z m

∞  

 ϕu (t)z −t .

[A1.105]

t=−∞

∞ ∞ ∗ m ∗ m ∗ = = H ∗ (z −∗ ), Noting that m=−∞ h (m)z m=−∞ h(m)(z ) equation [A1.105] gives us: Φy (z) = H(z)H ∗ (z −∗ )Φu (z).

[A1.106]

From this expression, we can deduce that the term h∗ (−τ ) of the double convolution [A1.102], which comes from the conjugated term y ∗ (k − τ ) in the autocorrelation function ϕy (τ ) = E[y(k)y ∗ (k − τ )], is associated with the term H ∗ (z −∗ ) in the spectrum in z.

320

Matrix and Tensor Decompositions in Signal Processing

Proceeding in the same way for the spectrum in ω, we have8: Σy (ω) 

∞ 

ϕy (τ )e−jωτ

τ =−∞

=

∞  

−jωi

h(i)e

i=−∞

∞  



h (m)e

jωm

m=−∞

∞  

 ϕu (t)e−jωt .

t=−∞

∞ ∞ −jωm ∗ = H ∗ (ω), we Noting that m=−∞ h∗ (m)ejωm = m=−∞ h(m)e deduce the following expression for the spectrum in ω: Σy (ω) = H(ω)H ∗ (ω)Σu (ω) = |H(ω)|2 Σu (ω),

[A1.107]

where Σu (ω) is the spectrum of the input signal. We therefore conclude that the term H ∗ (w) is now associated with the term h∗ (−τ ) of the double convolution [A1.102].  R EMARK A1.11.– We can make the following remarks: – The power spectrum in ω satisfies the Hermitian symmetry property Σy (ω) = Σ∗y (ω) and the non-negativity property Σy (ω) ≥ 0, 0 ≤ ω ≤ 2π. – The expression [A1.107] clearly shows that the output spectrum does not contain any information about the phase of the system, unlike the bispectrum (see Table A1.9). Only the modulus of H(ω) can be determined using the spectra Σu (ω) and Σy (ω) of the input and output signals. This means that the same input signal filtered by various filters with the same magnitude but different phases provides different output signals that have the same second-order statistics (autocorrelation function and spectrum). – The stationarity property in the wide sense of the zero-mean complex random signal y(k) is reflected in the following relation satisfied by its power spectrum in z: ϕy (τ ) = ϕ∗y (−τ ) , −∞ < τ < ∞ ? @

[A1.108]

Φy (z) = Φ∗y (z −∗ ).

[A1.109]

In other words, the spectrum in z has the property of para-Hermitian symmetry. We say that Φy (z) is a para-Hermitian polynomial. 8 Some authors use the notation Σy (ejω ), Σu (ejω ), and H(ejω ) instead of Σy (ω), Σu (ω), and H(ω) to better highlight the periodicity with period 2π of the discrete-time Fourier transform.

Appendix

321

P ROOF .– Using the definition of the power spectrum in z and the property of Hermitian symmetry [A1.102] of the autocorrelation function, we have: Φ∗y (z −∗ ) =

∞ 

ϕ∗y (τ )z τ =

τ =−∞

=

∞ 

∞ 

ϕ∗y (−τ )z −τ (by changing the sign of τ )

τ =−∞

ϕy (τ )z −τ = Φy (z)

(by [A1.102])

[A1.110]

τ =−∞

which proves the para-Hermitian symmetry property [A1.109] of the spectrum.



– In the case ofa system whose transfer function H(z) has real coefficients, we ∞ have H ∗ (z −∗ ) = −∞ h(τ )z τ = H(z −1 ). Hence, the expression [A1.106] of the output spectrum in z can be written as: Φy (z) = H(z)H(z −1 )Φu (z) = Φy (z −1 ).

[A1.111]

It should be noted that identifying a linear system requires knowledge of the second-order (or higher-order) statistics of the input. In the case of a Volterra system of order P , i.e. with input nonlinearities of order P , parameter estimation requires us to appeal to input statistics of order at least 2P . Thus, as we will see in the next section, identifying a quadratic Volterra system requires the use of input statistics of order at least 4. Consider now  the cumulant spectrum of the output associated with the cumulant cy,P,Q = cum y(k), y(k − τ1 ), · · · , y(k − τP −1 ), y ∗ (k − τP , · · · , y ∗ (k − τP +Q−1 ) involving P non-conjugated terms and Q conjugated terms. From our remarks on the expressions [A1.106] and [A1.107] of the spectra in z and ω of the output signal y(k), we can deduce the rules stated in Table A1.8 for determining the factors of the polyspectra in z, ω and f associated with each term y(k − τp ), for p ∈ P − 1, y ∗ (k − τq ), for q ∈ {P, P + 1, · · · , P + Q − 1}, and with y(k). y(k − τp ) term y(k − τq ) term

Polyspectra Cy,P,Q (z1 , · · · , zP +Q−1 )

H(zp )

H ∗ (zq−∗ )

Σy,P,Q (ω1 , · · · , ωP +Q−1 )

H(ωp )

H ∗ (ωq )

Sy,P,Q (f1 , · · · , fP +Q−1 )

H(fp )

H ∗ (fq )

y(k) term  −1 Q H p=1 zp q=1 zq    Q H∗ − P p=1 ωp + q=1 ωq    Q H∗ − P p=1 fp + q=1 fq  P

Table A1.8. Factors of the polyspectra

Applying these rules allows us to easily deduce the bispectra (P + Q = 3) and trispectra (P + Q = 4) in z and ω for different choices of P and Q, as stated in Table A1.9.

322

Matrix and Tensor Decompositions in Signal Processing

As mentioned earlier, the output spectrum does not allow us to estimate the phase of the transfer function. For an input signal that is white to third order, and therefore non-Gaussian, with a non-symmetric pdf, the bispectrum Σu,3,0 (ω1 , ω2 ) is a non-zero constant γu,3 . We can therefore use the bispectrum of the output Σy,3,0 (ω1 , ω2 ) given in Table A1.9 to estimate the phase ϕH of the system from the following relation: ϕΣy,3,0 (ω1 , ω2 ) = ϕH (ω1 ) + ϕH (ω2 ) − ϕH (ω1 + ω2 ),

[A1.112]

where ϕΣy,3,0 (ω1 , ω2 ) is the phase of the output bispectrum. Different methods for reconstructing the phase of the system from the relation [A1.112] are given by Nikias and Petropulu (1993). Bispectra in z Cy,3,0 (z1 , z2 ) = H( z 1z )H(z1 )H(z2 ) Cu,3,0 (z1 , z2 ) 1 2

Cy,2,1 (z1 , z2 ) = H( zz2 )H(z1 )H ∗ (z2−∗ ) Cu,2,1 (z1 , z2 ) 1

Bispectra in ω Σy,3,0 (ω1 , ω2 ) = H(−ω1 − ω2 )H(ω1 )H(ω2 ) Σu,3,0 (ω1 , ω2 ) Σy,2,1 (ω1 , ω2 ) = H ∗ (−ω1 + ω2 )H(ω1 )H ∗ (ω2 ) Σu,2,1 (ω1 , ω2 ) Trispectra in z and ω Cy,2,2 (z1 , z2 , z3 ) =

H( z2zz3 )H(z1 )H ∗ (z2−∗ )H ∗ (z3−∗ ) Cu,2,2 (z1 , z2 , z3 ) 1

Σy,2,2 (ω1 , ω2 , ω3 ) = H ∗ (−ω1 + ω2 + ω3 )H(ω1 )H ∗ (ω2 )H ∗ (ω3 ) Σu,2,2 (ω1 , ω2 , ω3 )

Table A1.9. Bispectra and trispectra of the output of a linear system

Note that the equation of the bispectrum also gives us the following relation to compute the modulus of H(ω): |Σy,3,0 (ω1 , ω2 )| = |γu,3 | |H(ω1 )| |H(ω2 )| |H(ω1 + ω2 )|.

[A1.113]

Relations similar to [A1.112] and [A1.113] can be deduced from the trispectrum of the output for an input signal that is white to fourth order. Similarly, we can use the cross-spectra of the input–output signals to identify the transfer function of a linear system. Thus, from the relations established in Table A1.7, we have: H(ω) =

Σyu (ω) H(ω) Σyu (ω) and = = ej2ϕH , ∗ Σu (ω) H (ω) Σuy (ω)

where ϕH is the phase of H(ω).

[A1.114]

Appendix

323

Likewise, for the cross-bicorrelation, we have, for example:   cs,u,u (τ1 , τ2 ) = cum s(k), u(k − τ1 ), u(k − τ2 ) = cy,u,u (τ1 , τ2 )    cy,u,u (τ1 , τ2 ) = h(m)cum u(k − m), u(k − τ1 ), u(k − τ2 ) m

=



  h(m)cum u(k), u(k + m − τ1 ), u(k + m − τ2 ) .

m

Hence, setting ti = τi − m, with i ∈ {1, 2}, the corresponding cross-trispectrum is given by:     h(m)e−jm(ω1 +ω2 ) cum u(k), u(k − t1 ), u(k − t2 ) Σy,u,u (ω1 , ω2 ) = m

e

t1 ,t2

−j(ω1 t1 +ω2 t2 )

= H(ω1 + ω2 )Σu,u,u (ω1 , ω2 ). From this example, we can conclude that different cross-polyspectra can be used according to the hypotheses formulated on the input. A1.4.2. Case of homogeneous quadratic systems Truncated Volterra models, also called discrete-time Volterra series expansions, are an extension of linear finite impulse response (FIR) models to account for nonlinearities in the input. This extension provides a link to tensors via Volterra kernels. Recall the input– output equation of a Volterra model of order P : y(k) = h0 +

p −1 P M  

p=1 m1 =0

Mp −1

···



mP =0

(p)

hm1 ,··· ,mp

p 

u(k − mi ),

[A1.115]

i=1

where Mp is the memory of the pth-order kernel H(p) ∈ RMp ×···×Mp , for which (p) hm1 ,··· ,mp is a coefficient. This kernel can be interpreted as a pth-order tensor. Volterra models are very widely used in many fields of application. They satisfy several important properties: – linearity with respect to their parameters, i.e. the coefficients of the Volterra kernels, with the possibility of reducing the parametric complexity via symmetrization and decomposition of the kernels, viewed as tensors (Favier and Bouilloc 2010; Favier et al. 2012a);

324

Matrix and Tensor Decompositions in Signal Processing

– interpretation of the homogeneous term of order P as a P -dimensional convolution of the P th-order kernel with the input defined as u(k1 , · · · , kP ) = u(k1 ) · · · u(kP ); – sufficient condition for BIBO (bounded-input bounded-output) stability of a homogeneous system of order P in terms of absolute summability of the P th-order kernel, generalizing the BIBO stability condition for FIR linear systems. Below, we will consider a homogeneous quadratic system modeled using a secondorder Volterra model described by the following equation: y(k) =

M −1 M −1  

h(m1 , m2 )u(k − m1 )u(k − m2 ),

[A1.116]

m1 =0 m2 =0

where h(m1 , m2 ) is the second-order Volterra kernel with memory M , and 0 ≤ m1 , m2 ≤ M − 1, and u(k) and y(k) are the input and output signals, assumed to be real-valued. After defining the input vector uT (k) = [ u(k) u(k − 1) · · · u(k − M + 1) ] ∈ RM , the output can be rewritten as a quadratic form in the input vector: y(k) = uT (k)Hu(k),

[A1.117]

where H is a square matrix of order M containing the coefficients h(m1 , m2 ) of the kernel. Unlike the case of linear systems previously considered, we assume that the input is a measured centered Gaussian signal. This is known as supervised identification. Our goal is to show that  the quadratic kernel can be estimated using the cross-cumulant cyuu (τ1 , τ2 )  cum y(k), u(k−τ1 ), u(k−τ2 ) and the cumulant spectrum associated with it. We will omit the additive Gaussian measurement noise, which is assumed to be independent of the input, since we will be considering third-order statistics, which are zero for a Gaussian signal. Taking into account the multilinearity property [A1.32] of the cumulants, the cross-cumulant cyuu (τ1 , τ2 ) is given by:   cyuu (τ1 , τ2 ) = cum y(k), u(k − τ1 ), u(k − τ2 ) =

−1 M −1 M  

[A1.118]

  h(m1 , m2 )cum u(k − m1 )u(k − m2 ), u(k − τ1 ), u(k − τ2 ) .

m1 =0 m2 =0

[A1.119]

Appendix

325

Using the relation [A1.30], with xi1 = u(k − m1 )u(k − m2 ) , xi2 = u(k − τ1 ) , xi3 = u(k − τ2 ), we obtain:   cum u(k − m1 )u(k − m2 ), u(k − τ1 ), u(k − τ2 ) = E u(k − m1 )u(k − m2 )u(k − τ1 )u(k − τ2 ) − E u(k − m1 )u(k − m2 ) E u(k − τ1 )u(k − τ2 ) = mu,4 (m1 , m2 , τ1 , τ2 ) − cu,2 (m2 − m1 )cu,2 (τ2 − τ1 ). [A1.120] By the hypotheses, since the input is a centered Gaussian signal, its N th-order moment is given by Schetzen (1980) and Picinbono (1993): E

N 



u(k − τn ) =



0

n=1

if N = 2k + 1, k ∈ N cu,2 (τi − τj ) if N = 2k

[A1.121]

where the sum ranges over all partitionings of the N indices τ1 , · · · , τN into pairs (τi , τj ). There are (2k)! = (2k − 1)(2k − 3) · · · (1) such partitionings in total. From k! 2k this formula, we deduce that all odd-order moments of the input are zero (as we saw earlier for a Gaussian r.v.) and that every even-order moment can be expressed as a sum of products of second-order moments, i.e. the autocorrelation. Thus, for the fourth-order moment mu,4 (m1 , m2 , τ1 , τ2 ), we have: mu,4 (m1 , m2 , τ1 , τ2 ) = cu,2 (m2 − m1 )cu,2 (τ2 − τ1 ) +cu,2 (m1 − τ1 ) cu,2 (m2 − τ2 ) + cu,2 (m1 − τ2 )cu,2 (m2 − τ1 ). [A1.122] Replacing the fourth-order moment of the input with this expression [A1.122] in [A1.120], the cross-cumulant [A1.119] can be written as follows, for τ1 , τ2 ∈ {0, 1, · · · , M − 1}: cyuu (τ1 , τ2 ) =

−1 M −1 M  

 cu,2 (m1 − τ1 )cu,2 (m2 − τ2 )

m1 =0 m2 =0

 + cu,2 (m1 − τ2 )cu,2 (m2 − τ1 ) h(m1 , m2 ). [A1.123]

After writing the kernel in triangularized form using the equations: ⎧ ∀m1 < m2 ≤ M − 1 ⎨ h(m1 , m2 ) + h(m2 , m1 ) h(m1 , m2 ) ∀m1 = m2 htri (m1 , m2 ) = ⎩ 0 ∀m1 > m2 [A1.124]

326

Matrix and Tensor Decompositions in Signal Processing

equation [A1.116] of the Volterra model can also be written as: y(k) =

M −1 

M −1 

htri (m1 , m2 )u(k − m1 )u(k − m2 ).

[A1.125]

m1 =0 m2 =m1

By defining the matrices Cuu and Cyuu ∈ RM ×M such that [Cuu ]τ1 ,τ2 = cu,2 (τ1 − τ2 ) and [Cyuu ]τ1 ,τ2 = cyuu (τ1 , τ2 ), with τ1 , τ2 ∈ {0, 1, · · · , M − 1}, equation [A1.123] of the cross-cumulant can be rewritten in matrix form as: Cyuu = Cuu (Htri + HTtri )Cuu ,

[A1.126]

where Htri ∈ RM ×M is the matrix of coefficients of the quadratic kernel, in upper triangular form, and Cuu and Cyuu are the autocorrelation matrix of the input signal and the matrix of cross-cumulants between the input and output formed by the cumulants cyuu (τ1 , τ2 ). These two matrices are symmetric and of size M × M . If we assume that the autocorrelation matrix is non-singular, the quadratic kernel can be estimated using the following equation: −1 (Htri + HTtri ) = C−1 uu Cyuu Cuu .

[A1.127]

E XAMPLE A1.12.– For M = 2, we have: ! cyuu (0, 0) cyuu (0, 1) , Cuu = Cyuu = cyuu (1, 0) cyuu (1, 1) ! h(0, 0) h(0, 1) + h(1, 0) . Htri = 0 h(1, 1)

cu,2 (0) cu,2 (1)

cu,2 (1) cu,2 (0)

!

Taking the two-dimensional Fourier transform of the two sides of equation [A1.123], we obtain the cross-bispectrum of the output and input:  cyuu (τ1 , τ2 )e−j(ω1 τ1 +ω2 τ2 ) Σyuu (ω1 , ω2 )  τ1

=

τ2

−1 −1 M    M τ1

 h(m1 , m2 ) cu,2 (m1 − τ1 )cu,2 (m2 − τ2 )

τ2 m1 =0 m2 =0

 + cu,2 (m1 − τ2 )cu,2 (m2 − τ1 ) e−j(ω1 τ1 +ω2 τ2 ) .

[A1.128]

Appendix

327

By performing the change of variables (σ1 , σ2 ) = (m1 − τ1 , m2 − τ2 ) in the first term of the bracket and (σ1 , σ2 ) = (m1 − τ2 , m2 − τ1 ) in the second term, we can rewrite the cross-bispectrum as: Σyuu (ω1 , ω2 ) 

−1 M −1 M  

  h(m1 , m2 ) e−j(ω1 m1 +ω2 m2 ) + e−j(ω1 m2 +ω2 m1 )

m1 =0 m2 =0

×



cu,2 (σ1 )ejω1 σ1

σ1



cu,2 (σ2 )ejω2 σ2

[A1.129]

σ2

Under the hypothesis of real signals, with the symmetry (Σu (−ω) = Σu (ω)) of the input spectrum, we obtain: Σyuu (ω1 , ω2 ) = H(ω1 , ω2 ) + H(ω2 , ω1 ) Σu (ω1 )Σu (ω2 ). [A1.130] Considering the kernel in symmetrized form hsym (τ1 , τ2 ), the corresponding transfer function in ω satisfies Hsym (ω1 , ω2 ) = 21 H(ω1 , ω2 ) + H(ω2 , ω1 ) . We can therefore estimate the two-dimensional transform of the symmetrized quadratic kernel using the spectrum of the input and the cross-bispectrum of the input and output as follows: Hsym (ω1 , ω2 ) =

Σyuu (ω1 , ω2 ) . 2 Σu (ω1 )Σu (ω2 )

[A1.131]

This formula was established by Tick (1961) for continuous-time quadratic Volterra systems. An extension to the case of discrete-time Volterra systems of arbitrary order was proposed by Koukoulas and Kalouptsidis (1995) for a Gaussian input (see also Nikias and Petropulu 1993; Mathews and Sicuranza 2000). R EMARK A1.13.– The cross-bispectrum Σyuu (ω1 , ω2 ) can be estimated by computing the two-dimensional DTFT of the cross-cumulant defined in [A1.118], which is in turn estimated using the expression [A1.120] with an empirical estimator of the second- and fourth-order moments of the input.

References

Abderrahim, K., Ben Abdennour, R., Favier, G., Ksouri, M., Faouzi, M. (2001). New results on FIR system identification using cumulants. APII-JESA, 35(5), 601–622. Acar, E., Aykut-Bingol, C., Bingol, H., Bro, R., Yener, B. (2007). Multiway analysis of epilepsy tensors. Bioinformatics, 23, i10–i18. Acar, E., Dunlavy, D.M., Kolda, T.G., Morup, M. (2011a). Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1), 41–56. Acar, E., Kolda, T.G., Dunlavy, D.M. (2011b). All-at-once optimization for coupled matrix and tensor factorizations. KDD Workshop on Mining and Learning with Graphs. arXiv.org:1105.3422. Acar, E., Levin-Schwartz, Y., Calhoun, V.D., Adal, T. (2017). Tensor-based fusion of EEG and fMRI to understand neurological changes in schizophrenia. IEEE Int. Symp. on Circuits and Systems (ISCAS’2017), Baltimore, USA. Alexander, J. and Hirschowitz, A. (1995). Polynomial interpolation in several variables. J. Algebraic Geom., 4(4), 201–222. de Almeida, A.L.F. and Favier, G. (2013). Double Khatri-Rao space-time-frequency coding using semi-blind PARAFAC based receiver. IEEE Signal Processing Letters, 20(5), 471–474. de Almeida, A.L.F., Favier, G., Mota, J.C.M. (2006). Tensor-based space-time multiplexing codes for MIMO-OFDM systems with blind detection. Proc. of 17th IEEE Symp. Pers. Ind. Mob. Radio Com. (PIMRC’2006), Helsinki, Finland. de Almeida, A.L.F., Favier, G., Mota, J.C.M. (2007). PARAFAC-based unified tensor modeling for wireless communication systems with application to blind multiuser equalization. Signal Processing, 87(2), 337–351. de Almeida, A.L.F., Favier, G., Mota, J.C.M. (2008). A constrained factor decomposition with application to MIMO antenna systems. IEEE Trans. Signal Process., 56(6), 2429–2442. de Almeida, A.L.F., Favier, G., Mota, J.C.M. (2009a). Space-time spreading-multiplexing for MIMO wireless communication systems using the PARATUCK-2 tensor model. Signal Processing, 89(11), 2103–2116. de Almeida, A.L.F., Favier, G., Mota, J.C.M. (2009b). Constrained Tucker-3 model for blind beamforming. Signal Process, 89, 1240–1244.

Matrix and Tensor Decompositions in Signal Processing, First Edition. Gérard Favier. © ISTE Ltd 2021. Published by ISTE Ltd and John Wiley & Sons, Inc.

330

Matrix and Tensor Decompositions in Signal Processing

de Almeida, A.L.F., Favier, G., da Costa, J.P.C.L., Mota, J.C. (2016). Overview of tensor decompositions with applications to communications. In Signals and Images: Advances and Results in Speech, Estimation, Compression, Recognition, Filtering, and Processing, Coelho, R.F., Nascimento, V.H., de Queiroz, R.L., Romano, J.M., Cavalcante, C.C. (eds). CRC Press, Boca Raton. Alshebeili, S.A. and Cetin, A.E. (1990). A phase reconstruction algorithm from bispectrum. IEEE Tr. on Geoscience and Remote Sensing, 28, 166–170. Alshebeili, S.A., Venetsanopoulos, A.N., Cetin, A.E. (1993). Cumulant based identification approaches for nonminimum phase FIR systems. IEEE Tr. Signal Proc., 41(4), 1576–1588. Amblard, P.O., Gaeta, M., Lacoume, J.L. (1996a). Statistics for complex variables and signals. Part I: Variables. Signal Processing, 53, 1–13. Amblard, P.O., Gaeta, M., Lacoume, J.L. (1996b). Statistics for complex variables and signals. Part II: Signals. Signal Processing, 53, 15–25. Anandkumar, A., Ge, R., Hsu, D., Kakade, S.M., Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. J. Machine Learning Res., 15(1), 2773–2832 [Online]. Available at: http://dl.acm.org/citation.cfm?id=2627435.2697055. Arachchilage, S.W. and Izquierdo, E. (2020). Deep-learned faces: A survey. EURASIP J. on Image and Video Processing, 25 [Online]. Available at: https://doi.org/10.1186/ s13640-02000510-w. Ballani, J., Grasedyck, L., Kluge, M. (2013). Black box approximation of tensors in hierarchical Tucker format. Linear Algebra and Its Applications, 438, 639–657. Banerjee, A., Char, A., Mondal, B. (2017). Spectra of general hypergraphs. Linear Algebra and Its Applications, 518, 14–30. Bartels, R. and Stewart, G. (1972). Solution of the matrix equation AX + XB = C. Comm. of the ACM, 15(9), 820–826. Becker, H., Albera, L., Comon, P., Haardt, M., Birot, G., Wendling, F., Gavaret, M., Bénar, C.-G., Merlet, I. (2014). EEG extended source localization: Tensor-based vs. conventional methods. NeuroImage, 96, 143–157. Beckmann, C.F. and Smith, S.M. (2005). Tensorial extensions of independent component analysis for multisubject fMRI analysis. NeuroImage, 25(1), 294–311. Behera, R. and Mishra, D. (2017). Further results on generalized inverses of tensors via the Einstein product. Linear and Multilinear Algebra, 65(8), 1662–1682. Behera, R., Maji, S., Mohapatra, R.N. (2020). Weighted Moore–Penrose inverses of arbitrary-order tensors. Computational and Applied Mathematics, 39, 284. Benetos, E. and Kotropoulos, C. (2008). A tensor-based approach for automatic music genre classification. EUSIPCO, Lausanne, Switzerland. Bengua, J.A., Phien, H.N., Tuan, H.D., Do, M.N. (2017). Efficient tensor completion for color image and video recovery: Low-rank tensor train. IEEE Tr. Image Processing, 26(5), 2466–2479. ten Berge, J.M.F. and Sidiropoulos, N.D. (2002). On uniqueness in CANDECOMP/ PARAFAC. Psychometrika, 67(3), 399–409. ten Berge, J.M.F. and Smilde, A.K. (2002). Non-triviality and identification of a constrained Tucker3 analysis. Journal of Chemometrics, 16, 609–612.

References

331

Bobadilla, J., Ortega, F., Hernando, A., Gutiérrez, A. (2013). Recommender systems survey. Knowledge-Based Systems, 46, 109–132. Bokde, D., Girase, S., Mukhopadhyay, D. (2015). Matrix factorization model in collaborative filtering algorithms: A survey. Procedia Computer Science, 49(1), 136–146. Bondon, P. and Picinbono, B. (1990). De la blancheur et de ses transformations. Traitement du signal, 7(5), 385–395. Bouilloc, T. and Favier, G. (2012). Nonlinear channel modeling and identification using bandpass Volterra-PARAFAC models. Signal Processing, 92(6), 1492–1498. Boutsidis, C., Mahoney, M.W., Drineas, P. (2010). An improved approximation algorithm for the column subset selection problem. Proc. 19th Annual ACM-SIAM Symp. on Discrete Algorithms (SODA), 968–977. Boutsidis, C. and Woodruff, D.P. (2014). Optimal CUR matrix decompositions. Proc. 46th ACM Symp. on Theory of Computing (STOC’14), 353–362. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122. Brandoni, D. and Simoncini, V. (2020). Tensor-train decomposition for image recognition. Calcolo, 57(9), 1–24. Brazell, M., Li, N., Navasca, C., Tamon, C. (2013). Solving multilinear systems via tensor inversion. SIAM J. Matrix Analysis and Applications, 34(2), 542–570. Brewer, J.W. (1978). Kronecker products and matrix calculus in system theory. IEEE Tr. on Circuits and Systems, 25(9), 772–781. Brillinger, D.R. (1965). An introduction to polyspectra. Ann. Math. Statist., 36, 1351–1374. Bro, R. (1997). PARAFAC. Tutorial and applications. Chemometrics and Intelligent Laboratory Systems, 38(2), 149–171. Bro, R. (2006). Review on multiway analysis in chemisty – 2000–2005. Critical Reviews in Analytical Chemistry, 36, 279–293. Bro, R. and Kiers, H.A.L. (2003). A new efficient method for determining the number of components in PARAFAC models. J. Chemometrics, 17(5), 274–286. Bro, R., Harshman, R.A., Sidiropoulos, N.D., Lundy, M.E. (2009). Modeling multi-way data with linearly dependent loadings. Chemometrics, 23(7–8), 324–340. Bu, C., Zhang, X., Zhou, J., Wang, W., Wei, Y. (2014). The inverse, rank and product of tensors. Linear Algebra and Its Applications, 446, 269–280. Burns, F., Carlson, D., Haynsworth, E., Markham, T. (1974). Generalized inverse formulas using the Schur complement. SIAM J. Appl. Math., 26(2), 254–259. Caiafa, C.F. and Cichocki, A. (2010). Generalizing the column–row matrix decomposition to multi-way arrays. Linear Algebra and Its Applications, 433, 557–573. Candès, E.J. and Plan, Y. (2010). Matrix completion with noise. Proc. IEEE, 98(6), 925–936. Candès, E.J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math., 9, 717–772. Candès, E.J. and Wakin, M.B. (2008). An introduction to compressive sampling. IEEE Signal Proc. Mag., 25(2), 21–30.

332

Matrix and Tensor Decompositions in Signal Processing

Cardoso, J.F. (1990). Localisation et identification par la quadricovariance. Traitement du signal, 7(5), 397–406. Carroll, J.D. and Chang, J. (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 35(3), 283–319. Carroll, J.D., Pruzansky, S., Kruskal, J.B. (1980). CANDELINC: A general approach to multidimensional analysis of many-way arrays with linear constraints on parameters. Psychometrika, 45(1), 3–24. Cartwright, D. and Sturmfels, B. (2013). The number of eigenvalues of a tensor. Linear Algebra and Its Applications, 438, 942–952. Chang, K.C. and Zhang, T. (2013). On the uniqueness and non-uniqueness of the positive Z-eigenvector for transition probability tensors. J. Math. Anal. Appl., 408, 525–540. Chang, K.C., Pearson, K., Zhang, T. (2009). On eigenvalue problems of real symmetric tensors. J. Math. Anal. and Appl., 350, 416–422. Chang, K.C., Qi, L., Zhang, T. (2013). A survey on the spectral theory of nonnegative tensors. Numerical Linear Algebra Appl., 20(6), 891–912. Che, M., Cichocki, A., Wei, Y. (2017). Neural networks for computing best rank-one approximations of tensors and its applications. Neurocomputing, 267(6), 114–133. Chen, H. and Qi, L. (2015). Positive definiteness and semi-definiteness of even order symmetric Cauchy tensors. American Inst. of Math. Sciences, 11(4), 1263–1274. Chen, C., Surana, A., Bloch, A., Rajapakse, I. (2019). Multilinear time invariant system theory. Proc. of SIAM Conf. on Control and its Applications, 118–125 [Online]. Available at: arXiv:1905.07427v1. Chien, J.-T. and Bao, Y.-T. (2017). Tensor factorized neural networks. IEEE Tr. on Neural Networks and Learning Systems, 29(5), 1998–2011. Cichocki, A. (2014). Tensor networks for big data analytics and large-scale optimization problems [Online]. Available at: arXiv:1407.3124. Cline, A.K. and Dhillon, I.S. (2007). Computation of the singular value decomposition. In Handbook of Linear Algebra, Hogben, L. (ed.). Chapman & Hall/CRC Press, Boca Raton. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. Comon, P. and Cardoso, J.F. (1990). Eigenvalue decomposition of a cumulant tensor with applications. SPIE Conf. on Advanced Signal Processing Algorithms, Architectures, and Implementations, 361–372, San Diego, USA. Comon, P., Golub, G., Lim, L.-H., Mourrain, B. (2008). Symmetric tensors and symmetric tensor rank. SIAM J. Matrix Anal. Appl., 30(3), 1254–1279. Comon, P., ten Berge, J.M.F., de Lathauwer, L., Castaing, J. (2009a). Generic and typical ranks of multi-way arrays. Linear Algebra and Its Applications, 430(11), 2997–3007. Comon, P., Luciani, X., de Almeida, A.L.F. (2009b). Tensor decompositions, alternating least squares and other tales. J. of Chemometrics, 23(7–8), 393–405. Cong, F., Lin, Q.-H., Kuang, L.-D., Gong, X.-F., Astikainen, P., Ristaniemi, T. (2015). Tensor decomposition of EEG signals: A brief review. J. Neuroscience Methods, 248, 59–69.

References

333

da Costa, J.P.C.L., Haardt, M., Roemer, F. (2008). Robust methods based on HOSVD for estimating the model order in PARAFAC models. Proc. of 5th IEEE Sensor Array and Multich. Signal Proc. Workshop (SAM 2008), 510–514, Darmstadt, Germany. da Costa, J.P.C.L., Roemer, F., Weis, M., Haardt, M. (2010). Robust R-D parameter estimation via closed-form PARAFAC. Proc. of ITG Workshop on Smart Antennas (WSA’2010), 99–106, Bremen, Germany. da Costa, J.P.C.L., Roemer, F., Haardt, M., de Sousa, R.T. (2011). Multi-dimensional model order selection. EURASIP J. on Advances in Signal Processing, 26 [Online]. Available at: http://asp.eurasipjournals.com/content/2011/1/26. da Costa, M.N., Favier, G., Romano, J.-M. (2018). Tensor modelling of MIMO communication systems with performance analysis and Kronecker receivers. Signal Processing, 145, 304–316. Coste, M. (2001). Elimination, résultant. Discriminant. Agrégation preparation. Université de Rennes 1, Rennes. Crespo-Cadenas, C., Aguilera-Bonet, P., Becerra-Gonzalez, J.A., Cruces, S. (2014). On nonlinear amplifier modeling and identification using baseband Volterra-PARAFAC models. Signal Processing, 96, 401–405. Cui, L.-B., Chen, C., Li, W., Ng, M.K. (2015). An eigenvalue problem for even order tensors with its applications. Linear and Multilinear Algebra, 64(4), 602–621. Culp, J., Pearson, K.J., Zhang, T. (2017). On the uniqueness of the Z1-eigenvector of transition probability tensors. Linear and Multilinear Algebra, 65(5), 891–896. Debals, O. and de Lathauwer, L. (2017). The concept of tensorization. Technical report. ESATSTADIUS, KU Leuven, 1–34 [Online]. Available at: ftp://134.58.56.3/pub/sista/odebals/ debals2017concept.pdf. Ding, W. and Wei, Y. (2015). Generalized tensor eigenvalue problems. SIAM J. Matrix Anal. Appl., 36(3), 1073–1099. Ding, W., Qi, L., Wei, Y. (2013). M -tensors and nonsingular M -tensors. Linear Algebra and Its Applications, 439(10), 3264–3278. Domanov, I. and de Lathauwer, L. (2013). On the uniqueness of the canonical polyadic decomposition of third-order tensors. Part I: Basic results and uniqueness of one factor matrix. SIAM J. Matrix Anal. Appl., 34(3), 855–875. Domanov, I. and de Lathauwer, L. (2014). Generic uniqueness conditions for the canonical polyadic decomposition and INDSCAL. SIAM J. Matrix Anal. Appl., 36(4), 1567–1589. Drineas, P., Kannan, R., Mahoney, M.W. (2006). Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition. SIAM J. Comput., 36(1), 184–206. Drineas, P., Mahoney, M.W., Muthukrishnan, S. (2008). Relative-error CUR matrix decompositions. SIAM J. Matrix Anal. Appl., 30(2), 844–881. Duarte, M.F. and Baraniuk, R.G. (2012). Kronecker compressive sensing. IEEE Tr. on Image Processing, 21(2), 494–504. Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218.

334

Matrix and Tensor Decompositions in Signal Processing

Elisei-Iliescu, C., Dogariu, L.-M., Paleologu, C., Bebnesty, J., Enescu, A.-A., Ciochina, S. (2020). A recursive least-squares algorithm for the identification of trilinear forms. MDPI/Algorithms, 13, 135. Faber, N.M., Bro, R., Hopke, P.K. (2003). Recent developments in CANDECOMP/PARAFAC algorithms: A critical review. Chemometrics and Intell. Lab. Syst., 65, 119–137. Favier, G. (1982). Filtrage, modélisation et identification de systèmes linéaires stochastiques à temps discret. CNRS, Paris. Favier, G. (2004). Estimation paramétrique de modèles entrée-sortie. In Signaux aléatoires: modélisation, estimation, détection, Guglielmi, M. (ed.). Hermes, Lavoisier, Cachan. Favier, G. (2019). From Algebraic Structures to Tensors, ISTE Ltd, London and John Wiley & Sons, New York. Favier, G. and de Almeida, A.L.F. (2014a). Overview of constrained PARAFAC models. EURASIP J. Advances in Signal Processing, 5, 41. Favier, G. and de Almeida, A.L.F. (2014b). Tensor space-time-frequency coding with semi-blind receivers for MIMO wireless communication systems. IEEE Tr. Signal Processing, 62(22), 5987–6002. Favier, G. and Bouilloc, T. (2009). Parametric complexity reduction of Volterra models using tensor decompositions. 17th European Signal Proc. Conf. (EUSIPCO), Glasgow. Favier, G. and Bouilloc, T. (2010). Identification de modèles de Volterra basée sur la décomposition PARAFAC de leurs noyaux et le filtre de Kalman étendu. Traitement du signal, 27(1), 27–51. Favier, G. and Kibangou, A.Y. (2009). Tensor-based methods for system identification. Part 2: Three examples of tensor-based system identification methods. Int. Journal on Sciences and Techniques of Automatic Control (IJ-STA), 3(1), 870–889. Favier, G., Dembélé, D., Peyre, J.L. (1994). ARMA identification using high-order statistics based linear methods: A unified presentation. EUSIPCO’94, 203–207, Edinburgh. Favier, G., Kibangou, A., Bouilloc, T. (2012a). Nonlinear system modeling and identification using Volterra-PARAFAC models. Int. J. of Adaptive Control and Signal Proc., 26(1), 30–53. Favier, G., da Costa, M.N., de Almeida, A.L.F., Romano, J.M.T. (2012b). Tensor space–time (TST) coding for MIMO wireless communication systems. Signal Processing, 92(4), 1079–1092. Favier, G., Bouilloc, T., de Almeida, A.L.F. (2012c). Blind constrained block-Tucker2 receiver for multiuser SIMO NL-CDMA communication systems. Signal Processing, 92(7), 1624–1636. Favier, G., Fernandes, C.E.R, de Almeida, A.L.F. (2016). Nested Tucker tensor decomposition with application to MIMO relay systems using tensor space–time coding (TSTC). Signal Processing, 128, 318–331. Filipovic, M. and Jukic, A. (2015). Tucker factorization with missing data with application to low-n-rank tensor completion. Multidimensional Systems and Signal Processing, 26, 677–692. Freitas, W., Favier, G., de Almeida, A.L.F. (2018). Generalized Khatri-Rao and Kronecker space-time coding for MIMO relay systems with closed-form semi-blind receivers. Signal Processing, 151, 19–31.

References

335

Frolov, E. and Oseledets, I. (2017). Tensor methods and recommender systems. WIREs Data Mining Knowl. Discov., 7(3). Fu, T., Jiang, B., Li, Z. (2018). On decompositions and approximations of conjugate partial-symmetric complex tensors [Online]. Available at: arXiv:1802.09013v1. Gardner, W.A. (1991). Exploitation of spectral redundancy in cyclostationary signals. IEEE Signal Proc. Magazine, 8(2), 14–36. Gelfand, I.M., Kapranov, M.M., Zelevinsky, A.V. (1992). Hyperdeterminants. Advances in Mathematics, 96, 226–263. Giannakis, G.B. (1987). Cumulants: A powerful tool in signal processing. Proc. of the IEEE, 75, 1333–1334. Giannakis, G.B. and Mendel, J.M. (1989). Identification of nonminimum phase systems using higher order statistics. IEEE Tr. on Acoustics, Speech and Signal Proc., 37(3), 360–377. Golub, G.H. and Van Loan, C.F. (1983). Matrix Computations. Johns Hopkins University Press, Oxford. Goodman, N.R. (1963). Statistical analysis based on certain multivariate complex Gaussian distribution. Ann. Math. Statist., 34, 152–176. Goulart, J.H.M. and Favier, G. (2014). An algebraic solution for the CANDECOMP/PARAFAC decomposition with circulant factors. SIAM J. Matrix Anal. Appl., 35(4), 1543–1562. Goulart, J.H.M., Kibangou, A., Favier, G. (2017). Traffic data imputation via tensor completion based on soft thresholding of Tucker core. Transportation Research, Part C: Emerging Technologies, 85, 348–362. Grasedyck, L. and Hackbusch, W. (2011). An introduction to hierarchical (h-)rank and TT-rank of tensors with examples. Comput. Meth. in Appl. Math., 11, 291–304. Grasedyck, L., Kluge, M., Kramer, S. (2015). Variants of alternating least squares tensor completion in the tensor train format. SIAM J. Sci. Comput., 37(5), A2424–A2450. Guo, X., Miron, S., Brie, D., Zhu, S., Liao, X. (2011). A CANDECOMP/PARAFAC perspective on uniqueness of DOA estimation using a vector sensor array. IEEE Trans. Signal Process., 59(7), 3475–3481. Hao, N., Kilmer, M.E., Braman, K., Hoover, R.C. (2013). Facial recognition using tensor-tensor product decompositions. SIAM J. Imaging Sci., 6, 437–463. Hao, C., Cui, C., Dai, Y.-H. (2015). A sequential subspace projection method for extreme Z-eigenvalues of supersymmetric tensors. Numerical Linear Algebra with Applications, 22(2), 283–298. Harshman, R.A. (1970). Foundations of the PARAFAC procedure: Model and conditions for an “explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics, 16, 1–84. Harshman, R.A. (1972). Determination and proof of minimum uniqueness conditions for PARAFAC1. UCLA Working Papers in Phonetics, 22, 111–117. Harshman, R.A. and Lundy, M.E. (1996). Uniqueness proof for a family of models sharing features of Tucker’s three-mode factor analysis and PARAFAC/CANDECOMP. Psychometrika, 61, 133–154. Harshman, R.A., Hong, S., Lundy, M.E. (2003). Shifted factor analysis. Part I: Models and properties. J. Chemometrics, 17, 363–378.

336

Matrix and Tensor Decompositions in Signal Processing

Hastad, J. (1990). Tensor rank is NP-complete. J. Algorithms, 11(4), 644–654. Henderson, H.V. and Searle, S.R. (1979). Vec and Vech operators for matrices, with some uses in Jacobians and multivariate statistics. Canad. J. Statist, 7(1), 65–81. Henderson, H.V. and Searle, S.R. (1981). The vec-pemutation matrix, the vec operator and Kronecker products: A review. Linear and Multilinear Algebra, 9, 271–288. Henderson, H.V., Pukelsheim, F., Searle, S.R. (1983). On the history of the Kronecker product. Linear and Multilinear Algebra, 14, 113–120. Hillar, C.J. and Lim, L.H. (2013). Most tensor problems are NP-hard. J. of the ACM, 60(6), 45:1–45:39. Hinich, M.J. (1982). Testing for Gaussianity and linearity of a stationary time series. J. Time Series Anal., 3, 169–176. Hitchcock, F.L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(3), 164–189. Horn, R.A. (1990). The Hadamard product. Proc. Symp. Appl. Math., 40, 87–169. Horn, R.A. and Johnson, C.A. (1985). Matrix Analysis. Cambridge University Press, Cambridge. Horn, R.A. and Johnson, C.A. (1991). Topics in Matrix Analysis. Cambridge University Press, Cambridge. Hu, S. and Qi, L. (2014). The eigenvectors associated with the zero eigenvalues of the Laplacian and signless Laplacian tensors of a uniform hypergraph. Discrete Applied Mathematics, 169, 140–151. Hu, S., Huang, Z.H., Qi, L. (2013). Finding the extreme Z-eigenvalues of tensors via a sequential semidefinite programming method. Numerical Linear Algebra with Applications, 20, 972–984. Huang, Z. and Qi, L. (2018). Positive definiteness of paired symmetric tensors and elasticity tensors. Computational and Applied Mathematics, 338, 22–43. Hyland, D.C. and Collins, E.G. (1989). Block Kronecker products and block norm matrices in large-scale systems analysis. SIAM J. Matrix Anal. Appl., 10(1), 18–29. Jiang, T. and Sidiropoulos, N.D. (2004). Kruskal’s permutation lemma and the identification of CANDECOMP/PARAFAC and bilinear models with constant modulus constraints. IEEE Trans. Signal Process., 52(9), 2625–2636. Jiang, T., Sidiropoulos, N.D., ten Berge, J.M.F. (2001). Almost-sure identifiability of multidimensional harmonic retrieval. IEEE Trans. Signal Process., 49(9), 1849–1859. Jin, H., Bai, M., Benitez, J., Liu, X. (2017). The generalized inverses of tensors and application to linear models. Computers and Math. with Appl., 74(3), 385–397. Khatri, C.G. and Rao, C.R. (1968). Solutions to some functional equations and their applications to characterization of probability distributions. Sankhya, Indian J. Statistics, Series A, 30, 167–180. Khatri, C.G. and Rao, C.R. (1972). Functional equations and characterization of probability laws through linear functions of random variables. J. of Multivariate Analysis, 2, 162–173. Kibangou, A.Y. and Favier, G. (2007). Blind joint identification and equalization of Wiener-Hammerstein communication channels using PARATUCK-2 tensor decomposition. Proc. EUSIPCO’2007, Poznan, Poland.

References

337

Kibangou, A.Y. and Favier, G. (2009a). Identification of parallel-cascade Wiener systems using joint diagonalization of third-order Volterra kernel slices. IEEE Signal Processing Letters, 16(3), 188–191. Kibangou, A.Y. and Favier, G. (2009b). Non-iterative solution for PARAFAC with a Toeplitz matrix factor. Proc. of EUSIPCO, Glasgow. Kibangou, A.Y. and Favier, G. (2010). Tensor analysis-based model structure determination and parameter estimation for block-oriented nonlinear systems. IEEE Journal of Selected Topics in Signal Processing, 4(3), 514–525. Kiers, H.A.L. (2000). Towards a standardized notation and terminology in multiway analysis. J. Chemometrics, 14(2), 105–122. Kilmer, M.E. and Martin, C.D. (2011). Factorization strategies for third-order tensors. Linear Algebra and Its Applications, 435, 641–658. Kofidis, E. and Regalia, P.A. (2002). On the best rank-1 approximation of higher-order supersymmetric tensors. SIAM J. Matrix Anal. Appl., 23(3), 863–884. Kolda, T.G. and Bader, B.W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. Kolda, T.G. and Mayo, J.R. (2014). An adaptive shifted power method for computing generalized tensor eigenpairs. SIAM J. Matrix Anal. Appl., 35(4), 1563–1581. Koning, R.H., Neudecker, H., Wansbeek, T. (1991). Block Kronecker products and the vecb operator. Linear Algebra and Its Applications, 149, 165–184. Koukoulas, P. and Kalouptsidis, N. (1995). Nonlinear system identification using Gaussian inputs. IEEE Tr. Signal Processing, 43(8), 1831–1841. Kroonenberg, P.M. and de Leeuw, J. (1980). Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika, 45(1), 69–97. Kruskal, J.B. (1977). Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl., 18(2), 95–138. Kruskal, J.B. (1989). Rank, decomposition, and uniqueness for 3-way and N-way arrays. In Multiway Data Analysis, Coppi, R., Bolasco, S. (eds). Elsevier, Amsterdam. Kruskal, J.B., Harshman, R.A., Lundy, M.E. (1989). How 3-MFA data can cause degenerate PARAFAC solutions, among other relationships. In Multiway Data Analysis, Coppi, R., Bolasco, S. (eds). Elsevier, Amsterdam. Lacoume, J.-L., Amblard, P.-O., Comon, P. (1997). Statistiques d’ordre supérieur pour le traitement du signal. Masson, Paris. Lahat, D., Adah, T., Jutten, C. (2015). Multimodal data fusion: An overview of methods, challenges and prospects. Proceedings of the IEEE, 103(9), 1449–1477. Lancaster, P. and Tismenetsky, M. (1985). The Theory of Matrices with Applications. Academic Press, New York. Landsberg, J.M. (2012). Tensors: Geometry and Applications. American Mathematical Society, Providence, RI. de Launey, W. and Seberry, J. (1994). The strong Kronecker product. Journal of Combinatorial Theory, 66(2), 192–213. de Lathauwer, L. (1997). Signal processing based on multilinear algebra. PhD. Thesis, KUL, Leuven.

338

Matrix and Tensor Decompositions in Signal Processing

de Lathauwer, L. (2006). A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM J. Matrix Anal. Appl., 28(3), 642–666. de Lathauwer, L. (2008). Decompositions of a higher-order tensor in block terms. Part II: Definitions and uniqueness. SIAM J. Matrix Anal. Appl., 30(3), 1033–1066. de Lathauwer, L. (2011). Blind separation of exponential polynomials and the decomposition of a tensor in rank-(Lr , Lr , 1) terms. SIAM J. Matrix Anal. Appl., 32(4), 1451–1474. de Lathauwer, L., de Moor, B., Vandewalle, J. (2000a). A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl., 21(4), 1253–1278. de Lathauwer, L., de Moor, B., Vandewalle, J. (2000b). On the best rank-1 and rank-(R1, R2, ..., RN) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl., 21(4), 1324–1342. Latorre, J.I. (2005). Image compression and entanglement [Online]. Available at: https://arxiv.org/abs/quant-ph/0510031. Lawson, C.L. and Hanson, R.J. (1974). Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs, NJ. Lee, N. and Cichocki, A. (2014). Big data matrix singular value decomposition based on low-rank tensor train decomposition. In Advances in Neural Networks-ISNN, Zeng, Z., Li, Y., King, I. (eds). Springer, Cham. Lee, N. and Cichocki, A. (2017). Fundamental tensor operations for large-scale data analysis using tensor network formats. Multidim. Syst. Signal Process, 29, 921–960. Leonov, V.P. and Shiryaev, A.N. (1959). On a method of calculations of semi-invariants. Theory Probab. Appl., IV(3), 319–328. Lev Ari, H. (2005). Efficient solution of linear matrix equations with application to multistatic antenna array processing. Communications in Information and Systems, 5(1), 123–130. Li, Y. and Ding, Z. (1994). A new nonparametric method for linear system phase recovery from bispectrum. IEEE Tr. Circuits and Syst. II: Analog and Digital Signal Proc., 41, 415–419. Li, W. and Ng, M. (2014). On the limiting probability distribution of a transition probability tensor. Linear and Multilinear Algebra, 62, 362–385. Li, G., Qi, L., Yu, G. (2013). The Z-eigenvalues of a symmetric tensor and its application to spectral hypergraph theory. Numerical Linear Algebra with Applications, 20, 1001–1029. Li, S., Dian, R., Fang, L., Bioucas-Dias, J.M. (2018). Fusing hyperspectral and multispectral images via coupled sparse tensor factorisation. IEEE Tr. on Image Processing, 27(8), 4118–4130. Liang, M. and Zheng, B. (2018). Further results on Moore–Penrose inverses of tensors with application to tensor nearness problems. Computers & Mathematics with Applications, 77(5), 1282–1293. Liang, M., Zheng, B., Zhao, R.-J. (2018). Tensor inversion and its application to the tensor equations with Einstein product. Linear and Multilinear Algebra [Online]. Available at: https://doi.org/10.1080/03081087.2018.1500993. Lim, L.-H. (2005). Singular values and eigenvalues of tensors: A variational approach. Proc. of the IEEE Int. Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP’05). Puerto Vallarta, Mexico. Lim, L.-H. (2013). Tensors and hypermatrices. In Handbook of Linear Algebra, 2nd edition, Hogben, L. (ed.). Chapman & Hall/CRC Press, Boca Raton.

References

339

Lim, L.H. and Comon, P. (2009). Nonnegative approximations of nonnegative tensors. J. of Chemometrics, 23, 432–441. Liu, S. and Trenkler, G. (2008). Hadamard, Khatri-Rao, Kronecker and other matrix products. Int. J. Information and Syst. Sc., 4(1), 160–177. Liu, J., Musialski, P., Wonka, P., Ye, J. (2013). Tensor completion for estimating missing values in visual data. IEEE Tr. Pattern Analysis and Machine Intelligence, 35(1), 208–220. Liu, D., Li, W., Vong, S.-W. (2018). The tensor splitting with application to solve multilinear systems. Journal of Computational and Applied Mathematics, 330, 75–94. Lu, H., Plataniotis, K.N., Venetsanopoulos, A.N. (2008). MPCA: Multilinear principal component analysis of tensor objects. IEEE Trans. Neural Netw., 19(1), 18–39. Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z., Yan, S. (2020). Tensor robust principal component analysis with a new tensor nuclear norm. IEEE Tr. on Pattern Analysis and Machine Intelligence, 42(4), 925–938. Luo, X., Zhou, M., Xia, Y., Zhu, Q. (2014). An efficient non-negative matrix-factorizationbased approach to collaborative filtering for recommender systems. IEEE Tr. Industrial Informatics, 10(2), 1273–1284. Lyakh, D.I. (2015). An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications, 189, 84–91. Magnus, J.R. (2010). On the concept of matrix derivative. J. of Multivariate Analysis, 101, 2200–2206. Magnus, J.R. and Neudecker, H. (1979). The commutation matrix: Some properties and applications. Annals of Statistics, 7(2), 381–394. Magnus, J.R. and Neudecker, H. (1985). Matrix differential calculus with applications to simple Hadamard, and Kronecker products. J. Mathematical Psychology, 29, 474–492. Magnus, J.R. and Neudecker, H. (1988). Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, Chichester. Mahoney, M.W. and Drineas, P. (2009). CUR matrix decompositions for improved data analysis. Proc. of the National Academy of Sciences (PNAS), 106(3), 697–702. Mahoney, M.W., Maggioni, M., Drineas, P. (2008). Tensor-CUR decompositions for tensor-based data. SIAM J. on Matrix Analysis and Applications, 30(3), 957–987. Makantasis, K., Doulamis, A., Nikitakis, A. (2018). Tensor-based classification models for hyperspectral data analysis. IEEE Tr. Geoscience and Remote Sensing, 56(12), 6884–6898. Marcus, M. and Khan, N.A. (1959). A note on the Hadamard product. Canadian Math. Bull., 2, 81–83. Mathews, V.J. and Sicuranza, G.L. (2000). Polynomial Signal Processing. John Wiley & Sons, New York. McCullagh, P. (1987). Tensor Methods in Statistics. Chapman & Hall/CRC Press, Boca Raton. Mendel, J.M. (1991). Tutorial on higher-order statistics (spectra) in signal processing and system theory: Theoretical results and some applications. Proc. of the IEEE, 79(3), 278–305. Meyer, C.D. (2000). Matrix Analysis and Applied Linear Algebra. SIAM, Philadelphia. Miao, Y., Qi, L., Wei, Y. (2021). T-Jordan canonical form and T-Drazin inverse based on the T-product. Communications on Applied Math. and Comput., 3, 201–220.

340

Matrix and Tensor Decompositions in Signal Processing

Mika, S., Schölkopf, B., Smola, A.J., Müller, K.-R., Scholz, M., Rätsch, G. (1999). Kernel PCA and de-noising in feature spaces. In Advances in Neural Information Processing Systems 11, Kearns, M.S., Solla, S.A., Cohn, D.A. (eds). MIT Press, Cambridge. Mitrovic, N., Asif, M.T., Rasheed, U., Dauwels, J., Jaillet, P. (2013). CUR decomposition for compression and compressed sensing of large-scale traffic data. 16th Int. IEEE Conf. on Intelligent Transportation Systems (ITSC’2013), 1475–1480, The Hague, The Netherlands. Morozov, A. and Shakirov, S. (2010). New and old results in resultant theory. Theoretical and Math. Physics, 163, 587–617. Morup, M. (2011). Applications of tensor (multiway array) factorizations and decompositions in data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 24–40. Morup, M., Hansen, L.K., Madsen, K.H. (2011). Modeling latency and shape changes in trial based neuroimaging data. Asilomar-SSC, Monterey, USA. Na, Y.J., Kim, K.S., Song, I., Kim, T. (1995). Identification of nonminimum phase systems using the third and fourth order cumulants. IEEE Tr. on Signal Proc., 43(8), 2018–2022. Nagy, J.G., Ng, M.K., Perrone, L. (2004). Kronecker product approximation for image restoration with reflexive boundary conditions. SIAM J. Matrix Anal. Appl., 25(3), 829–841. Nandi, A.K. (1999). Blind Estimation Using Higher-Order Statistics. Kluwer Academic Publishers, Boston, MA. Nanopoulos, A., Rafailidis, D., Symeonidis, P., Manolopoulos, Y. (2010). MusicBox: Personalized music recommendation based on cubic analysis of social tags. IEEE Tr. Audio, Speech and Language Processing, 18, 407–412. Neudecker, H. and Liu, S. (2001). Some statistical properties of Hadamard products of random matrices. Statist. Papers, 42, 475–487. Neudecker, H., Liu, S., Polasek, W. (1995). The Hadamard product and some of its applications in statistics. Statistics, 26(4), 365–373. Ni, G. (2019). Hermitian tensor and quantum mixed state [Online]. Available at: arXiv:1902.02640v4. Nie, J. (2017). Generating polynomials and symmetric tensor decompositions. Foundations of Comput. Math., 17, 423–465. Nikias, C.L. (1988). ARMA bispectrum approach to nonminimum phase system identification. IEEE Tr. on Acoustics, Speech and Signal Proc., 36, 513–524. Nikias, C.L. and Mendel, J.M. (1993). Signal processing with higher-order spectra. IEEE Signal Proc. Magazine, 10–37. Nikias, C.L. and Petropulu, A.P. (1993). Higher-order Spectra Analysis. A Nonlinear Signal Processing Framework. Prentice-Hall, Englewood Cliffs, NJ. Nikias, C.L. and Raghuveer, M.R. (1987). Bispectrum estimation: A digital signal processing framework. Proc. of the IEEE, 75(7), 869–891. Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D. (2015). Tensorizing neural networks. Proc. of the 28th Int. Conf. on Neural Information Processing Systems (NIPS’15), 1, 442–450. Olive, M. and Auffray, N. (2013). Symmetry classes for even-order tensors. Mathematics and Mechanics Complex Systems, 1(2), 177–210.

References

341

Oseledets, I. (2011). Tensor-train decomposition. SIAM J. Sci. Computing, 33(5), 2295–2317. Oseledets, I. and Tyrtyshnikov, E. (2009). Breaking the curse of dimensionality, or how to use SVD in many dimensions,. SIAM J. Sci. Computing, 31, 3744–3759. Oseledets, I. and Tyrtyshnikov, E. (2010). TT-cross approximation for multidimensional arrays. Linear Algebra and Its Applications, 432, 70–88. Paatero, P. (2000). Construction and analysis of degenerate PARAFAC models. J. Chemometrics, 14, 285–299. Padhy, S., Goovaerts, G., Boussé, M., De Lathauwer, L., Van Huffel, S. (2019). The power of tensor-based approaches in cardiac applications In Biomedical Signal Processing. Advances in Theory, Algorithms and Applications, Naik, G. (ed.). Springer, Singapore. Paleologu, C., Benesty, J., Ciochina, S. (2018). Linear system identification based on a Kronecker product decomposition. IEEE/ACM Tr. on Audio, Speech, and Language Proc., 26(10), 1793–1808. Pan, R. (2014). Tensor transpose and its properties [Online]. Available at: arXiv:1411.1503v1. Panagakis, Y., Kotropoulos, C., Arce, G.R. (2010). Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification. IEEE Tr. on Audio, Speech, and Language Processing, 18(3), 576–588. Panigrahy, K. and Mishra, D. (2018). An extension of the Moore–Penrose inverse of a tensor via the Einstein product [Online]. Available at: arXiv:1806.03655v1. Panigrahy, K., Behera, R., Mishra, D. (2020). Reverse order law for the Moore–Penrose inverses of tensors. Linear and Multilinear Algebra, 68(2), 246–264. Papalexakis, E., Faloutsos, C., Siropoulos, N.D. (2016). Tensors for data mining and data fusion: Models, applications, and scalable algorithms. ACM Tr. Intelligent Systems and Technology, 8(2). Papoulis, A. (1984). Probability, Random Variables and Stochastic Processes. McGraw-Hill, New York. Papy, J.M., de Lathauwer, L., Van Huffel, S. (2005). Exponential data fitting using multilinear algebra: The single-channel and multi-channel case. Numerical Algebra with Applications, 12(8), 809–826. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6), 559–572. Pearson, K.J. (2010). Essentially positive tensors. Int. J. of Algebra, 4(9), 421–427. Pearson, K.J. and Zhang, J. (2014). On spectral hypergraph theory of the adjacency tensor. Graphs and Combinatorics, 30, 1233–1248. Phan, A.-H., Tichavsky, P., Cichocki, A. (2013). Low complexity damped Gauss–Newton algorithms for CANDECOMP/PARAFAC. SIAM J. Appl. Math., 34(1), 126–147. Phan, A.-H., Tichavsky, P., Cichocki, A. (2017). Blind source separation of single channel mixture using tensorization and tensor diagonalization. VA/ICA 2017, LNCS 10169, 36–46. Picinbono, B. (1993). Random Signals and Systems. Prentice-Hall, Englewood Cliffs, NJ. Picinbono, B. (1994). On circularity. IEEE Tr. Signal Processing, 42(12), 3473–3482. Pisarenko, V.F. (1973). The retrieval of harmonics from a covariance function. Geophysical J. Int., 33(3), 347–366.

342

Matrix and Tensor Decompositions in Signal Processing

Pitsianis, N.P. (1997). The Kronecker product in approximation and fast transform generation. PhD Thesis, Cornell University. Pollock, D.S.G. (2011). On Kronecker products, tensor products and matrix differential calculus. Working paper 11/34, Department of Economics, University of Leicester [Online]. Available at: http://www.le.ac.uk/ec/research/RePEc/lec/leecon/dp11-34.pdf. Qi, L. (2005). Eigenvalues of a real supersymmetric tensor. J. Symbolic Computation, 40, 1302–1324. Qi, L. (2012). The spectral theory of tensors [Online]. Available at: arXiv:1201.3424v1. Qi, L., Wang, F., Wang, Y. (2009). Z-eigenvalue methods for a global polynomial problem. Mathematical Programming, 118, 301–316. Ragnarsson, S. and Van Loan, C.F. (2013). Block tensors and symmetric embeddings. Linear Algebra and Its Applications, 438, 853–874. Raimondi, F., Cabral Farias, R., Michel, O., Comon, P. (2017). Wideband multiple diversity tensor array processing. IEEE Tr. on Signal Processing, 65(20), 5334–5346. Ran, B., Tan, H., Wu, Y., Jin, P.J., (2016). Tensor based missing traffic data completion with spatial-temporal correlation. Physica A: Statistical Mechanics and its Applications, 446(15), 54–63. Regalia, P.A. and Mitra, S.K. (1989). Kronecker products, unitary matrices and signal processing applications. SIAM Review, 31(4), 586–613. Rendle, S. and Schmidt-Thieme, L. (2010). Pairwise interaction tensor factorization for personalized tag recommendation. Proc. of the Third ACM Intern. Conf. on Web Search and Data Mining, 81–90. Rezghi, M. and Elden, L. (2011). Diagonalization of tensors with circulant structure. Linear Algebra and its Applications, 435, 422–447. Rocha, D.S., Fernandes, C.E.R., Favier, G. (2019a). MIMO multi-relay systems with tensor space-time coding based on coupled nested Tucker decomposition. Digital Signal Processing, 89(3), 170–185. Rocha, D., Favier, G., Fernandes, C.E.R. (2019b). Closed-form receiver for multi-hop MIMO relay systems with tensor space-time coding. Journal of Communication and Information Systems, 34(1), 50–54. Roth, W.E. (1934). On direct product matrices. Bulletin of the American Mathematical Society, 40, 461–468. Sanguansat, P. (2012). Principal Component Analysis: Multidisciplinary Applications. Intech, London. Schetzen, M. (1980). The Volterra and Wiener Theories of Nonlinear Systems. John Wiley & Sons, New York. Schölkopf, B., Smola, A.J., Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Sidiropoulos, N.D. (2001). Generalizing Carathéodory’s uniqueness of harmonic parameterization to N dimensions. IEEE Tr. on Information Theory, 47(4), 1687–1690. Sidiropoulos, N.D. and Bro, R. (2000). On the uniqueness of multilinear decomposition of N-way arrays. J. Chemometrics, 14, 229–239.

References

343

Sidiropoulos, N.D. and Budampati, R. (2002). Khatri-Rao space-time codes. IEEE Trans. Signal Process., 50(10), 2377–2388. Sidiropoulos, N.D. and Kyrillidis, A. (2012). Multi-way compressed sensing for sparse low-rank tensors. IEEE Signal Proc. Letters, 19(11), 757–760. Sidiropoulos, N.D. and Liu, X. (2001). Identifiability results for blind beamforming in incoherent multipath with small delay spread. IEEE Trans. Signal Process., 49(1), 228–236. Sidiropoulos, N.D., Bro, R., Giannakis, G.B. (2000a). Parallel factor analysis in sensor array processing. IEEE Trans. Signal Process., 48(8), 2377–2388. Sidiropoulos, N.D., Giannakis, G.B., Bro, R. (2000b). Blind PARAFAC receivers for DS-CDMA systems. IEEE Trans. Signal Process., 48(3), 810–823. Sidiropoulos, N.D., de Lathauwer, L., Fu, X., Huang, K., Papalexakis, E., Faloutsos, C. (2017). Tensor decomposition for signal processing and machine learning. IEEE Tr. Signal Processing, 65(13), 3551–3582. Signoretto, M., Plas, R.V., De Moor, B., Suykens, J.A.K. (2011). Tensor versus matrix completion: A comparison with application to spectral data. IEEE Signal Proc. Letters, 18(7), 403–406. de Silva, V. and Lim, L.-H. (2008). Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl., 30(3), 1084–1127. Smilde, A.K., Bro, R., Geladi, P. (2004). Multi-way Analysis. Applications in the Chemical Sciences. Wiley, Chichester. Söderström, T. (1994). Discrete-time Stochastic Systems. Prentice-Hall, Englewood Cliffs. Song, Y. and Qi, L. (2015). Properties of some classes of structured tensors. Journal of Optimization Theory and Applications, 165(3), 854–873. Sorensen, D.C. and Embree, M. (2015). A DEIM induced CUR factorization. Technical report [Online]. Available at: https://scholarship.rice.edu/handle/1911/102226. Springer, P., Sankaran, A., Bientinesi, P. (2017). TTC: A tensor transposition compiler for multiple architectures. Proc. of the 3rd ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming, 41–46, New York, USA. Stanimirovic, P., Ciric, M., Katsikis, V., Li, C., Ma, H. (2020). Outer and (b, c) inverses of tensors. Linear Multilinear Algebra, 68(5) [Online]. Available at: https://doi.org/10.1080/03081087.2018.1521783. Stegeman, A. (2006). Degeneracy in CANDECOMP/PARAFAC explained for p × p × 2 arrays of rank p + 1 or higher. Psychometrika, 71(3), 483–501. Stegeman, A. (2008). On uniqueness conditions for CANDECOMP/PARAFAC and INDSCAL with full column rank in one mode. Lin. Alg. Appl., 431(1–2), 211–227. Stegeman, A. and Sidiropoulos, N.D. (2007). On Kruskal’s uniqueness condition for the CANDECOMP/PARAFAC decomposition. Lin. Alg. Appl., 420, 540–552. Styan, G.P.H. (1973). Hadamard products and multivariate statistical analysis. Linear Algebra Appl., 6, 217–240. Sun, L., Zheng, B., Bu, C., Wei, Y. (2016). Moore–Penrose inverse of tensors via Einstein product. Linear Multilinear Algebra, 64, 686–698. Swami, A. and Mendel, J.M. (1990). ARMA parameter estimation using only output cumulants. IEEE Tr. on Acoustics, Speech and Signal Proc., 38, 1257–1265.

344

Matrix and Tensor Decompositions in Signal Processing

Symeonidis, P. and Zioupos, A. (2016). Matrix and Tensor Factorization Techniques for Recommender Systems. Springer, Cham. Tan, H., Feng, G., Feng, J., Wang, W., Zhang, Y.-J., Li, F. (2013). A tensor-based method for missing traffic data completion. Transportation Research Part C: Emerging Technologies, 28, 15–27. Therrien, C.W. (1992). Discrete Random Signals and Statistical Signal Processing. Prentice-Hall, Englewood Cliffs, NJ. Tick, L.J. (1961). The estimation of transfer functions of quadratic systems. Technom., 3, 563–567. Tomasi, G. and Bro, R. (2005). PARAFAC and missing values. Chemometrics and Intelligent Laboratory Systems, 75(2), 163–180. Tomasi, G. and Bro, R. (2006). A comparison of algorithms for fitting the PARAFAC model. Computational Statistics & Data Analysis, 50(7), 1700–1734. Tracy, D.S. and Dwyer, P.S. (1969). Multivariate maxima and minima with matrix derivatives. J. Amer. Statist. Assoc., 64(328), 1576–1594. Tracy, D.S. and Singh, R.P. (1972). A new matrix product and its applications in matrix differentiation. Statist. Neerlandica, 26, 143–157. Tucker, L.R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31, 279–311. Tugnait, J.K. (1990). Approaches to FIR system identification with noisy data using higher order statistics. IEEE Tr. on Acoustics, Speech and Signal Proc., 38(7), 1307–1317. Van Loan, C.F. (2000). The ubiquitous Kronecker product. J. of Computational and Applied Math., 123, 85–100. Van Loan, C.F. (2009). The Kronecker product. A product of the times. SIAM Conf. on Applied Linear Algebra, Monterey, CA. Van Loan, C.F. and Pitsianis, N. (1993). Approximation with Kronecker products. In Linear Algebra for Large Scale and Real-time Applications, Moonen, M.S., Golub, G.H., de Moor, B.L. (eds). Kluwer Academic Publishers, Dordrecht. Vasilescu, M.A.O. and Terzopoulos, D. (2002). Multilinear analysis of image ensembles: TensorFaces. Proc. of the European Conf. on Computer Vision (ECCV ’02), 447–460, Copenhagen. Vasilescu, M.A.O. and Terzopoulos, D. (2005). Multilinear independent component analysis. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR ’05), 1, 547–553, San Diego, USA. Voronin, S. and Martinsson, P.-G. (2017). Efficient algorithms for CUR and interpolative matrix decompositions. Adv. Comput. Math., 43, 495–516. Wang, S. and Zhang, Z. (2013). Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling. J. of Machine Learning Res., 14, 2729–2769. Wang, R., Li, S., Cheng, L., Wong, M.H., Leung, K.S. (2019). Predicting associations among drugs, targets and diseases by tensor decomposition for drug repositioning. BMC Bioinformatics, 20(26), 628. Xie, J. and Qi, L. (2016). Spectral directed hypergraph theory via tensors. Linear and Multilinear Algebra, 780–794.

References

345

Ximenes, L., Favier, G., de Almeida, A. (2014). PARAFAC-PARATUCK semi-blind receivers for two-hop cooperative MIMO relay systems. IEEE Tr. Signal Processing, 62(14), 3604–3615. Xu, L., Liang, G., Longxiang, Y., Hongbo, Z. (2012). PARALIND-based blind joint angle and delay estimation for multipath signals with uniform linear array. EURASIP J. Advances in Signal Proc., 130. Zarzoso, V. and Nandi, A.K. (1999). Blind source separation. In Blind Estimation Using Higher-Order Statistics, Nandi, A. (ed.), Kluwer Academic Publishers, Boston, MA. Zhang, T. and Golub, G.H. (2001). Rank-one approximation of higher order tensors. SIAM J. Matrix Anal. Appl., 23, 534–550. Zhang, L., Qi, L., Zhou, G. (2014). M-tensors and some applications. SIAM J. Matrix Anal. Appl., 35(2), 437–452. Zhou, M., Liu, Y., Long, Z., Chen, L., Zhu, C. (2019). Tensor rank learning in CP decomposition via convolutional neural network. Signal Processing: Image Communication, 73, 12–21. Zniyed, Y., Boyer, R., de Almeida, A., Favier, G. (2019a). High-order CPD estimation with dimensionality reduction using a tensor train model. EUSIPCO, Rome. Zniyed, Y., Boyer, R., de Almeida, A., Favier, G. (2019b). Multidimensional harmonic retrieval based on Vandermonde tensor train. Signal Processing, 163(10), 75–86. Zniyed, Y., Boyer, R., de Almeida, A., Favier, G. (2020). Tensor train representation of MIMO channels using the JIRAFE method. Signal Processing, 171, 107479.

Index

A, B

D, E

alternating least squares (ALS) algorithm, 258–261 best rank-one approximation of a tensor, 235 big data, xiv blind source separation, xv, 38, 278 block tensor models, 271 term decomposition (BTD), 274, 278

dyadic decomposition, 17, 19 EEG/ECG, xvi, xvii eigendecomposition of a matrix, 4, 7 of a symmetric/Hermitian matrix, 10 of a symmetric square tensor, 193 eigenvalue–eigentensor pair, 231 eigenvalues of a positive definite matrix, 11 of a symmetric/Hermitian matrix, 10 of a tensor, 224–230 Einstein product, 178, 191, 223

C CDMA, xxii Cholesky decomposition, 2, 4 CONFAC models, 273 constrained tensor models, 273 coupled tensor factorization, xxv, xxvi cumulants of complex random signals, 280, 282, 315 of random variables, 291, 294, 298 properties, 296 CUR decomposition, 43

F facial recognition, xvi, xvii fibers, 133 FIR systems, 282 identification of, 316 Frobenius norm of a matrix, 23 of a tensor, 200 full-rank decomposition, 197 fundamental subspaces, 14, 20

Matrix and Tensor Decompositions in Signal Processing, First Edition. Gérard Favier. © ISTE Ltd 2021. Published by ISTE Ltd and John Wiley & Sons, Inc.

348

Matrix and Tensor Decompositions in Signal Processing

G Gaussian distribution, 300 generalized Kronecker delta, 129 eigenpair, 230

H Hadamard product of matrices, 50–54, 212 of tensors, 212 Hankelization, 217 Hölder norm, 22 homogeneous polynomial equations, 225, 230 polynomials, 132, 203–207 HOSVD algorithm, 194, 247–249 hyperdeterminant, 225 hypermatrix, 128

I imputation, xiv, xxv index convention, 70–74, 129, 143, 155, 243, 252 inner product of tensors, 199

K Khatri-Rao product, 95 and index convention, 76 factors estimation, 120 factors permutation, 99 multiple, 97 Kronecker product and index convention, 75 block, 92 decompositions, 68 factors estimation, 122 factors permutation, 88 inverse, 67 multiple, 90

of matrices, 58, 212 of tensors, 212 of vectors, 54 pseudo-inverse, 68 rank, 64 spectrum, 64 strong, 94 structural properties, 66 Kronecker sum, 69 spectrum, 70 Kruskal’s rank (k-rank) of a matrix, 268–271

L least squares estimator, 28, 31, 209–211 low-rank matrix approximation, 25 Lyapunov equations, 117, 118, 225

M matrices commutation, 86 equivalent/similar, 11 idempotent, 182 Moore–Penrose pseudo-inverse, 192 orthogonal/unitary, 182 matricization of a PARAFAC model, 254 of a tensor, 141–148 of a Tucker model, 243 matrix decompositions, 1 exponential, 13 norms, 22, 24 polynomial, 12 products, 74, 172 mode-p (Tucker) product, 170–174, 243

Index

mode-(p,n) product, 176, 177 modes, xii, xvii, xviii combination, 135 multidimensional harmonics, 277 multilinear forms, 131, 159, 204–207 rank, 148, 249 systems, 204

N, O nuclear norm of a matrix, 23 of a tensor, xxv orthogonal projectors, 28 outer product, 168, 169, 212, 243, 253

P PARAFAC/CANDECOMP, xix, 252 constrained, 273 normalized form, 256 variants, 256 PARALIND models, 273 partial derivatives and index convention, 108–116 polar decomposition, 4, 31 polynomial equations, 225, 227, 230 polyspectra, 311, 321, 322 preprocessing, xxiii principal component analysis (PCA), 33

Q, R QR decomposition, 3, 4 random signals, 305 higher-order statistics, 309 second-order statistics, 309 stationary/ergodic, 306 random variables circular complex, 299

349

Gaussian, 300 independent/uncorrelated, 289 jointly distributed, 289 scalar, 287 Rayleigh quotient, 6, 239 recommendation systems, xviii resultants, 226

S Schatten norm, 23 Schur decomposition, 3, 4 singular value decomposition (SVD), 4, 15 computation, 21 of a rectangular tensor, 194 reduced, 17, 19 singular values of a matrix, 16 of a tensor, 239 slices, 133 spectral radius of a tensor, 228 Sylvester equations, 117, 118, 208

T tensorization, 215 tensor(s) block, 137 diagonal, 140 identity, 141 models, 271 transpose, 154 completion, xiv constrained models, 273 contraction, 155, 167, 176–178 decompositions, xix diagonal, 139 eigendecomposition,193 extension, 213

350

Matrix and Tensor Decompositions in Signal Processing

generalized inverses, 189 Hermitian, 156–158 hypercubic, 128, 225, 232 idempotent, 182 identity, 129 models, 242 inner product, 199 inverse, 186 Moore–Penrose pseudo-inverse, 192, 198 multiplication, 166, 181 orthogonal decomposition, 238 orthogonal/unitary, 182 orthogonally/unitarily similar, 233 positive/negative definite, 232 sets, 130 slices, 133 state-space models, 223 SVD decomposition, 194 symmetric, 156–158 symmetrization, 161 systems, 203–211 transpose, 151–154 THOSVD, 249 trace of a tensor, 203 of matrix products, 84, 100

TTD (tensor train decomposition), xxi, 177 Tucker decomposition (TD), xxi, 242, 249–251

U uniqueness essential, xii of a PARAFAC model, 267 of a Tucker model, 245

V vectorization and index convention, 77 of a PARAFAC model, 255 of a rank-one matrix, 57 of a tensor, 149 of a Tucker model, 244 of partitioned matrices, 82 Volterra systems, 206 quadratic, 323

Other titles from

in Digital Signal and Image Processing

2019 FAVIER Gérard From Algebraic Structures to Tensors (Matrices and Tensors with Signal Processing Set – Volume 1) MEYER Fernand Topographical Tools for Filtering and Segmentation 1: Watersheds on Node- or Edge-weighted Graphs Topographical Tools for Filtering and Segmentation 2: Flooding and Marker-based Segmentation on Node- or Edge-weighted Graphs

2017 CESCHI Roger, GAUTIER Jean-Luc Fourier Analysis CHARBIT Maurice Digital Signal Processing with Python Programming CHAO Li, SOULEYMANE Bella-Arabe, YANG Fan Architecture-Aware Optimization Strategies in Real-time Image Processing

FEMMAM Smain Fundamentals of Signals and Control Systems Signals and Control Systems: Application for Home Health Monitoring MAÎTRE Henri From Photon to Pixel – 2nd edition PROVENZI Edoardo Computational Color Science: Variational Retinex-like Methods

2015 BLANCHET Gérard, CHARBIT Maurice Digital Signal and Image Processing using MATLAB® Volume 2 – Advances and Applications:The Deterministic Case – 2nd edition Volume 3 – Advances and Applications: The Stochastic Case – 2nd edition CLARYSSE Patrick, FRIBOULET Denis Multi-modality Cardiac Imaging GIOVANNELLI Jean-François, IDIER Jérôme Regularization and Bayesian Methods for Inverse Problems in Signal and Image Processing

2014 AUGER François Signal Processing with Free Software: Practical Experiments BLANCHET Gérard, CHARBIT Maurice Digital Signal and Image Processing using MATLAB® Volume 1 – Fundamentals – 2nd edition DUBUISSON Séverine Tracking with Particle Filter for High-dimensional Observation and State Spaces ELL Todd A., LE BIHAN Nicolas, SANGWINE Stephen J. Quaternion Fourier Transforms for Signal and Image Processing

FANET Hervé Medical Imaging Based on Magnetic Fields and Ultrasounds MOUKADEM Ali, OULD Abdeslam Djaffar, DIETERLEN Alain Time-Frequency Domain for Segmentation and Classification of Nonstationary Signals: The Stockwell Transform Applied on Bio-signals and Electric Signals NDAGIJIMANA Fabien Signal Integrity: From High Speed to Radiofrequency Applications PINOLI Jean-Charles Mathematical Foundations of Image Processing and Analysis Volumes 1 and 2 TUPIN Florence, INGLADA Jordi, NICOLAS Jean-Marie Remote Sensing Imagery VLADEANU Calin, EL ASSAD Safwan Nonlinear Digital Encoders for Data Communications

2013 GOVAERT Gérard, NADIF Mohamed Co-Clustering DAROLLES Serge, DUVAUT Patrick, JAY Emmanuelle Multi-factor Models and Signal Processing Techniques: Application to Quantitative Finance LUCAS Laurent, LOSCOS Céline, REMION Yannick 3D Video: From Capture to Diffusion MOREAU Eric, ADALI Tulay Blind Identification and Separation of Complex-valued Signals PERRIN Vincent MRI Techniques WAGNER Kevin, DOROSLOVACKI Milos Proportionate-type Normalized Least Mean Square Algorithms

FERNANDEZ Christine, MACAIRE Ludovic, ROBERT-INACIO Frédérique Digital Color Imaging FERNANDEZ Christine, MACAIRE Ludovic, ROBERT-INACIO Frédérique Digital Color: Acquisition, Perception, Coding and Rendering NAIT-ALI Amine, FOURNIER Régis Signal and Image Processing for Biometrics OUAHABI Abdeljalil Signal and Image Multiresolution Analysis

2011 CASTANIÉ Francis Digital Spectral Analysis: Parametric, Non-parametric and Advanced Methods DESCOMBES Xavier Stochastic Geometry for Image Analysis FANET Hervé Photon-based Medical Imagery MOREAU Nicolas Tools for Signal Compression

2010 NAJMAN Laurent, TALBOT Hugues Mathematical Morphology

2009 BERTEIN Jean-Claude, CESCHI Roger Discrete Stochastic Processes and Optimal Filtering – 2nd edition CHANUSSOT Jocelyn et al. Multivariate Image Processing

DHOME Michel Visual Perception through Video Imagery GOVAERT Gérard Data Analysis GRANGEAT Pierre Tomography MOHAMAD-DJAFARI Ali Inverse Problems in Vision and 3D Tomography SIARRY Patrick Optimization in Signal and Image Processing

2008 ABRY Patrice et al. Scaling, Fractals and Wavelets GARELLO René Two-dimensional Signal Analysis HLAWATSCH Franz et al. Time-Frequency Analysis IDIER Jérôme Bayesian Approach to Inverse Problems MAÎTRE Henri Processing of Synthetic Aperture Radar (SAR) Images MAÎTRE Henri Image Processing NAIT-ALI Amine, CAVARO-MENARD Christine Compression of Biomedical Images and Signals NAJIM Mohamed Modeling, Estimation and Optimal Filtration in Signal Processing QUINQUIS André Digital Signal Processing Using Matlab

2007 BLOCH Isabelle Information Fusion in Signal and Image Processing GLAVIEUX Alain Channel Coding in Communication Networks OPPENHEIM Georges et al. Wavelets and their Applications

2006 CASTANIÉ Francis Spectral Analysis NAJIM Mohamed Digital Filters Design for Signal and Image Processing