Linear Algebra Tools for Data Mining [2 ed.] 9789811270338, 9789811270345, 9789811270352

123 74 10MB

English Pages [1002] Year 2023

Table of contents :
Contents
Preface
About the Author
1. Preliminaries
1.1 Introduction
1.2 Functions
1.3 Sequences
1.4 Permutations
1.5 Combinatorics
1.6 Groups, Rings, and Fields
1.7 Closure and Interior Systems
Bibliographical Comments
2. Linear Spaces
2.1 Introduction
2.2 Linear Spaces
2.3 Linear Independence
2.4 Linear Mappings
2.5 Bases in Linear Spaces
2.6 Isomorphisms of Linear Spaces
2.7 Constructing Linear Spaces
2.8 Dual Linear Spaces
2.9 Topological Linear Spaces
2.10 Isomorphism Theorems
2.11 Multilinear Functions
Exercises and Supplements
Bibliographical Comments
3. Matrices
3.1 Introduction
3.2 Matrices with Arbitrary Elements
3.3 Fields and Matrices
3.4 Invertible Matrices
3.5 Special Classes of Matrices
3.6 Partitioned Matrices and Matrix Operations
3.7 Change of Bases
3.8 Matrices and Bilinear Forms
3.9 Generalized Inverses of Matrices
3.10 Matrices and Linear Transformations
3.11 The Notion of Rank
3.12 Matrix Similarity and Congruence
3.13 Linear Systems and LU Decompositions
3.14 The Row Echelon Form of Matrices
3.15 The Kronecker and Other Matrix Products
3.16 Outer Products
3.17 Associative Algebras
Exercises and Supplements
Bibliographical Comments
4. MATLAB Environment
4.1 Introduction
4.2 The Interactive Environment of MATLAB
4.3 Number Representation and Arithmetic Computations
4.4 Matrices and Multidimensional Arrays
4.5 Cell Arrays
4.6 Solving Linear Systems
4.7 Control Structures
4.8 Indexing
4.9 Functions
4.10 Matrix Computations
4.11 Matrices and Images in MATLAB
Exercises and Supplements
Bibliographical Comments
5. Determinants
5.1 Introduction
5.2 Determinants and Multilinear Forms
5.3 Cramer’s Formula
5.4 Partitioned Matrices and Determinants
5.5 Resultants
5.6 MATLAB Computations
Exercises and Supplements
Bibliographical Comments
6. Norms and Inner Products
6.1 Introduction
6.2 Basic Inequalities
6.3 Metric Spaces
6.4 Norms
6.5 The Topology of Normed Linear Spaces
6.6 Norms for Matrices
6.7 Matrix Sequences and Matrix Series
6.8 Conjugate Norms
6.9 Inner Products
6.10 Hyperplanes in Rn
6.11 Unitary and Orthogonal Matrices
6.12 Projection on Subspaces
6.13 Positive Definite and Positive Semidefinite Matrices
6.14 The Gram–Schmidt Orthogonalization Algorithm
6.15 Change of Bases Revisited
6.16 The QR Factorization of Matrices
6.17 Matrix Groups
6.18 Condition Numbers for Matrices
6.19 Linear Space Orientation
6.20 MATLAB Computations
Exercises and Supplements
Bibliographical Comments
7. Eigenvalues
7.1 Introduction
7.2 Eigenvalues and Eigenvectors
7.3 The Characteristic Polynomial of a Matrix
7.4 Spectra of Hermitian Matrices
7.5 Spectra of Special Matrices
7.6 Geometry of Eigenvalues
7.7 Spectra of Kronecker Products and Sums
7.8 The Power Method for Eigenvalues
7.9 The QR Iterative Algorithm
7.10 MATLAB Computations
Exercises and Supplements
Bibliographical Comments
8. Similarity and Spectra
8.1 Introduction
8.2 Diagonalizable Matrices
8.3 Matrix Similarity and Spectra
8.4 The Sylvester Operator
8.5 Geometric versus Algebraic Multiplicity
8.6 λ-Matrices
8.7 The Jordan Canonical Form
8.8 Matrix Norms and Eigenvalues
8.9 Matrix Pencils and Generalized Eigenvalues
8.10 Quadratic Forms and Quadrics
8.11 Spectra of Positive Matrices
8.12 Spectra of Positive Semidefinite Matrices
8.13 MATLAB Computations
Exercises and Supplements
Bibliographical Comments
9. Singular Values
9.1 Introduction
9.2 Singular Values and Singular Vectors
9.3 Numerical Rank of Matrices
9.4 Updating SVDs
9.5 Polar Form of Matrices
9.6 CS Decomposition
9.7 Geometry of Subspaces
9.8 Spectral Resolution of a Matrix
9.9 MATLAB Computations
Exercises and Supplements
Bibliographical Comments
10. The k-Means Clustering
10.1 Introduction
10.2 The k-Means Algorithm and Convexity
10.3 Relaxation of the k-Means Problem
10.4 SVD and Clustering
10.5 Evaluation of Clusterings
10.6 MATLAB Computations
Exercises and Supplements
Bibliographical Comments
11. Data Sample Matrices
11.1 Introduction
11.2 The Sample Matrix
11.3 Biplots
Exercises and Supplements
Bibliographical Comments
12. Least Squares Approximations and Data Mining
12.1 Introduction
12.2 Linear Regression
12.3 The Least Square Approximation and QR Decomposition
12.4 Partial Least Square Regression
12.5 Locally Linear Embedding
12.6 MATLAB Computations
Exercises and Supplements
Bibliographical Comments
13. Dimensionality Reduction Techniques
13.1 Introduction
13.2 Principal Component Analysis
13.3 Linear Discriminant Analysis
13.4 Latent Semantic Indexing
13.5 Recommender Systems and SVD
13.6 Metric Multidimensional Scaling
13.7 Procrustes Analysis
13.8 Non-negative Matrix Factorization
Exercises and Supplements
Bibliographical Comments
14. Tensors and Exterior Algebras
14.1 Introduction
14.2 The Summation Convention
14.3 Tensor Products of Linear Spaces
14.4 Tensors on Inner Product Spaces
14.5 Contractions
14.6 Symmetric and Skew-Symmetric Tensors
14.7 Exterior Algebras
14.8 Linear Mappings between Spaces SKSV,k
14.9 Determinants and Exterior Algebra
Exercises and Supplements
Bibliographical Comments
15. Multidimensional Array and Tensors
15.1 Introduction
15.2 Multidimensional Arrays
15.3 Outer Products
15.4 Tensor Rank
15.5 Matricization and Vectorization
15.6 Inner Product and Norms
15.7 Evaluation of a Set of Bilinear Forms
15.8 Matrix Multiplications and Arrays
15.9 MATLAB Computations
15.10 Hyperdeterminants
15.11 Eigenvalues and Singular Values
15.12 Decomposition of Tensors
15.13 Approximation of mdas
Exercises and Supplements
Bibliographical Comments
Bibliography
Index

Recommend Papers

Intelligent Simulation Tools for Mining Large Scientific Data Sets

This paper describes problems, challenges, and opportunities for intelligent simulation of physical systems. Prototype i

437 28 347KB Read more

Linear Algebra in Data Science 9783031549076, 9783031549083

101 4 4MB Read more

Linear Algebra in Data Science 9783031549083, 9783031549076

121 118 Read more

Linear Algebra in Data Science 9783031549076, 9783031549083

103 51 Read more

Linear Algebra

914 98 2MB Read more

Visual Data Mining: Techniques and Tools for Data Visualization and Mining [1st ed.] 9780471149996, 0-471-14999-3

Marketing analysts use data mining techniques to gain a reliable understanding of customer buying habits and then use th

507 54 20MB Read more

Linear Algebra

108 24 588KB Read more

$Lecture notes for Math 115A (linear algebra)$

Lecture notes for Math 115A (linear algebra)

These are lecture notes by Tao for a course in UCLA 2002, on the overview Tao writes: «This course is an introduction t

643 139 996KB Read more

Linear Algebra

225 98 15MB Read more

Bayesian Networks for Data Mining

A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used

523 96 362KB Read more

Linear Algebra Tools for Data Mining [2 ed.]
9789811270338, 9789811270345, 9789811270352

Author / Uploaded
Dan A Simovici

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

This page intentionally left blank

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data Names: Simovici, Dan A., author. Title: Linear algebra tools for data mining / Dan A. Simovici, University of Massachusetts Boston, USA, Dana-Farber Cancer Institute, USA. Description: Second edition. | New Jersey : World Scientific Publishing Co. Pte. Ltd., [2023] | Includes bibliographical references and index. Identifiers: LCCN 2022062233 | ISBN 9789811270338 (hardcover) | ISBN 9789811270345 (ebook for institutions) | ISBN 9789811270352 (ebook for individuals) Subjects: LCSH: Data mining. | Parallel processing (Electronic computers) | Computer algorithms. | Linear programming. Classification: LCC QA76.9.D343 S5947 2023 | DDC 006.3/12--dc23/eng/20230210 LC record available at https://lccn.loc.gov/2022062233 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

Copyright © 2023 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

For any available supplementary material, please visit https://www.worldscientific.com/worldscibooks/10.1142/13248#t=suppl Desk Editors: Balasubramanian/Steven Patt Typeset by Stallion Press Email: [email protected] Printed in Singapore

To my wife, Doina, and to the memory of my brother, Dr. George Simovici

This page intentionally left blank

Preface

Linear algebra plays an increasingly important role in data mining and pattern recognition research either directly, or through the applications of linear algebra in graph theory and optimization. Linear algebra-based algorithms are elegant and fast, are based on a common mathematical doctrine with its collection of basic ideas and techniques, and are easy to implement; they are especially suitable for parallel and distributed computation to approach large-scale challenging problems such as searching and extracting patterns from the entire web. Thus, the application of linear algebra-based techniques in data mining and machine learning research constitute an increasingly attractive area. Many linear algebra results are important for their applications in biology, chemistry, psychology, and sociology. The standard undergraduate education of a computer scientist includes one or, rarely, two semesters of linear algebra, which is woefully inadequate for a researcher in data mining or pattern recognition. Even a casual review of publications in these disciplines convincingly demonstrates the use of quite sophisticated tools from linear algebra, optimization, probabilities, functional analysis, and other areas. Linear algebra and its field of applications are constantly growing, and this volume is a mere introduction to a life-long study. A mathematical background is essential to understand current data mining and pattern recognition research and in conducting research in these disciplines. Therefore, this book was constructed to provide this background and to present a volume of applications vii

viii

Linear Algebra Tools for Data Mining (Second Edition)

that will attract the reader to the study of their mathematical basis. We do not focus on the numerical aspects of the algorithms, particularly error sensitivity, because this extremely important topic has been treated in a vast body of literature in numerical analysis and is not specific to data mining applications. Among the data mining applications we discuss are the k-means algorithm and several of its relaxations, principal component analysis and singular value decomposition for data dimension reduction, biplots, non-negative matrix factorization for unsupervised and semisupervised learning, and latent semantic indexing. Preparing the second edition of this volume involved correcting the existing text, considerable rewriting, and introducing new major topics: tensors, exterior algebra, and multidimensional arrays. The intended readership consists of graduate students and researchers who work in data mining and pattern recognition. I strived to make this volume as self-contained as possible. The reader interested in applications will find in this volume most of the mathematical background that is currently needed. There are few routine exercises, most of those support the material presented in the main sections of each chapter, and there are more than 600 exercises and supplements. Special thanks are due to the librarians of the Joseph Healy Library at the University of Massachusetts Boston whose dedicated and timely help was essential in completing this project. I gratefully acknowledge the support provided by MathWorks Inc. from Natick, Massachusetts. Their book program provided the license for various components of MATLAB, the paramount current tool for linear algebra computations, which we use in this book. Last, but not least, I wish to thank my wife, Doina, a source of strength and loving support.

About the Author

Dan Simovici is a Professor of Computer Science and the Director of the Computer Science Graduate Program at the University of Massachusetts Boston. His main research interests are in the applications of algebraic and information-theoretical methods in data mining and machine learning. He has published over 200 research papers and several books whose subjects range from theoretical computer science to applications of mathematical methods in machine learning. Dr. Simovici served as a visiting professor at Tohoku University in Japan and the University of Lille, France, and he is currently the Editor-in-Chief of the Journal for Multiple-Valued Logic and Soft Computing.

ix

This page intentionally left blank

Contents

Preface

vii

About the Author

ix

1.

1

Preliminaries 1.1 Introduction . . . . . . . . . 1.2 Functions . . . . . . . . . . . 1.3 Sequences . . . . . . . . . . . 1.4 Permutations . . . . . . . . . 1.5 Combinatorics . . . . . . . . 1.6 Groups, Rings, and Fields . . 1.7 Closure and Interior Systems Exercises and Supplements . . . . . Bibliographical Comments . . . . .

2.

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

Linear Spaces 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

1 1 3 5 12 15 25 29 32 33

Introduction . . . . . . . . . . Linear Spaces . . . . . . . . . . Linear Independence . . . . . . Linear Mappings . . . . . . . . Bases in Linear Spaces . . . . Isomorphisms of Linear Spaces Constructing Linear Spaces . . Dual Linear Spaces . . . . . . Topological Linear Spaces . . . xi

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

33 33 36 39 40 48 55 68 76

xii

Linear Algebra Tools for Data Mining (Second Edition)

2.10 Isomorphism Theorems 2.11 Multilinear Functions . Exercises and Supplements . . Bibliographical Comments . . 3.

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Matrices . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

MATLAB Environment 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11

Introduction . . . . . . . . . . . . . . . . The Interactive Environment of MATLAB Number Representation and Arithmetic Computations . . . . . . . . . . . . . . . Matrices and Multidimensional Arrays . . Cell Arrays . . . . . . . . . . . . . . . . . Solving Linear Systems . . . . . . . . . . Control Structures . . . . . . . . . . . . . Indexing . . . . . . . . . . . . . . . . . . Functions . . . . . . . . . . . . . . . . . . Matrix Computations . . . . . . . . . . . Matrices and Images in MATLAB . . . . .

79 80 87 96 97

3.1 Introduction . . . . . . . . . . . . . . . . . . 3.2 Matrices with Arbitrary Elements . . . . . . 3.3 Fields and Matrices . . . . . . . . . . . . . . 3.4 Invertible Matrices . . . . . . . . . . . . . . . 3.5 Special Classes of Matrices . . . . . . . . . . 3.6 Partitioned Matrices and Matrix Operations 3.7 Change of Bases . . . . . . . . . . . . . . . . 3.8 Matrices and Bilinear Forms . . . . . . . . . 3.9 Generalized Inverses of Matrices . . . . . . . 3.10 Matrices and Linear Transformations . . . . 3.11 The Notion of Rank . . . . . . . . . . . . . . 3.12 Matrix Similarity and Congruence . . . . . . 3.13 Linear Systems and LU Decompositions . . . 3.14 The Row Echelon Form of Matrices . . . . . 3.15 The Kronecker and Other Matrix Products . 3.16 Outer Products . . . . . . . . . . . . . . . . . 3.17 Associative Algebras . . . . . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . . . 4.

. . . .

97 98 102 107 118 127 129 131 133 134 138 153 156 160 174 182 182 186 225 227

. . . . . 227 . . . . . 227 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

228 236 254 255 257 264 266 267 271

xiii

Contents

Exercises and Supplements . . . . . . . . . . . . . . . . . 272 Bibliographical Comments . . . . . . . . . . . . . . . . . 276 5.

Determinants 5.1 Introduction . . . . . . . . . . . . . . . 5.2 Determinants and Multilinear Forms . . 5.3 Cramer’s Formula . . . . . . . . . . . . 5.4 Partitioned Matrices and Determinants 5.5 Resultants . . . . . . . . . . . . . . . . 5.6 MATLAB Computations . . . . . . . . . Exercises and Supplements . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . .

6.

277 . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Norms and Inner Products 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13

Introduction . . . . . . . . . . . . . . . . . Basic Inequalities . . . . . . . . . . . . . . Metric Spaces . . . . . . . . . . . . . . . . Norms . . . . . . . . . . . . . . . . . . . . . The Topology of Normed Linear Spaces . . Norms for Matrices . . . . . . . . . . . . . Matrix Sequences and Matrix Series . . . . Conjugate Norms . . . . . . . . . . . . . . Inner Products . . . . . . . . . . . . . . . . Hyperplanes in Rn . . . . . . . . . . . . . . Unitary and Orthogonal Matrices . . . . . Projection on Subspaces . . . . . . . . . . . Positive Definite and Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . . . 6.14 The Gram–Schmidt Orthogonalization Algorithm . . . . . . . . . . . . . . . . . . . 6.15 Change of Bases Revisited . . . . . . . . . 6.16 The QR Factorization of Matrices . . . . . 6.17 Matrix Groups . . . . . . . . . . . . . . . . 6.18 Condition Numbers for Matrices . . . . . . 6.19 Linear Space Orientation . . . . . . . . . . 6.20 MATLAB Computations . . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . .

277 277 296 298 302 313 315 333 335

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

335 335 339 342 354 361 370 374 378 391 392 398

. . . . 405 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

415 421 423 432 434 437 439 444 478

xiv

7.

Linear Algebra Tools for Data Mining (Second Edition)

Eigenvalues

479

7.1 Introduction . . . . . . . . . . . . . . . . . 7.2 Eigenvalues and Eigenvectors . . . . . . . . 7.3 The Characteristic Polynomial of a Matrix 7.4 Spectra of Hermitian Matrices . . . . . . . 7.5 Spectra of Special Matrices . . . . . . . . . 7.6 Geometry of Eigenvalues . . . . . . . . . . 7.7 Spectra of Kronecker Products and Sums . 7.8 The Power Method for Eigenvalues . . . . . 7.9 The QR Iterative Algorithm . . . . . . . . 7.10 MATLAB Computations . . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . . 8.

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

Similarity and Spectra . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

Singular Values 9.1 9.2 9.3 9.4 9.5 9.6 9.7

Introduction . . . . . . . . . . . . . . Singular Values and Singular Vectors Numerical Rank of Matrices . . . . . Updating SVDs . . . . . . . . . . . . Polar Form of Matrices . . . . . . . . CS Decomposition . . . . . . . . . . . Geometry of Subspaces . . . . . . . .

479 479 484 492 496 499 502 503 506 507 508 515 517

8.1 Introduction . . . . . . . . . . . . . . . . . 8.2 Diagonalizable Matrices . . . . . . . . . . . 8.3 Matrix Similarity and Spectra . . . . . . . 8.4 The Sylvester Operator . . . . . . . . . . . 8.5 Geometric versus Algebraic Multiplicity . . 8.6 λ-Matrices . . . . . . . . . . . . . . . . . . 8.7 The Jordan Canonical Form . . . . . . . . 8.8 Matrix Norms and Eigenvalues . . . . . . . 8.9 Matrix Pencils and Generalized Eigenvalues 8.10 Quadratic Forms and Quadrics . . . . . . . 8.11 Spectra of Positive Matrices . . . . . . . . 8.12 Spectra of Positive Semidefinite Matrices . 8.13 MATLAB Computations . . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . . 9.

. . . . . . . . . . . .

517 517 522 546 550 552 566 573 582 586 597 602 603 607 628 629

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

629 629 643 647 650 651 657

xv

Contents

9.8 Spectral Resolution of a 9.9 MATLAB Computations Exercises and Supplements . . Bibliographical Comments . . 10.

Matrix . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

664 674 678 691 693

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

693 693 698 702 705 707 710 718 719

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

719 719 729 735 737

Least Squares Approximations and Data Mining 739 12.1 12.2 12.3

Introduction . . . . . . . . . . . . . . . . Linear Regression . . . . . . . . . . . . . The Least Square Approximation and QR Decomposition . . . . . . . . . . . . . . . 12.4 Partial Least Square Regression . . . . . 12.5 Locally Linear Embedding . . . . . . . . 12.6 MATLAB Computations . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . .

13.

. . . .

Data Sample Matrices 11.1 Introduction . . . . 11.2 The Sample Matrix 11.3 Biplots . . . . . . . Exercises and Supplements Bibliographical Comments

12.

. . . .

The k-Means Clustering 10.1 Introduction . . . . . . . . . . . . . . . 10.2 The k-Means Algorithm and Convexity 10.3 Relaxation of the k-Means Problem . . 10.4 SVD and Clustering . . . . . . . . . . . 10.5 Evaluation of Clusterings . . . . . . . . 10.6 MATLAB Computations . . . . . . . . Exercises and Supplements . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . .

11.

. . . .

. . . . . 739 . . . . . 739 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Dimensionality Reduction Techniques 13.1 13.2 13.3 13.4 13.5 13.6 13.7

Introduction . . . . . . . . . . . Principal Component Analysis . Linear Discriminant Analysis . . Latent Semantic Indexing . . . . Recommender Systems and SVD Metric Multidimensional Scaling Procrustes Analysis . . . . . . .

. . . . . . .

. . . . . . .

744 746 748 754 756 763 765

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

765 765 781 784 788 791 799

xvi

Linear Algebra Tools for Data Mining (Second Edition)

13.8 Non-negative Matrix Factorization . . . . . . . . . 806 Exercises and Supplements . . . . . . . . . . . . . . . . . 815 Bibliographical Comments . . . . . . . . . . . . . . . . . 825 14.

Tensors and Exterior Algebras

827

14.1 Introduction . . . . . . . . . . . . . . . . 14.2 The Summation Convention . . . . . . . 14.3 Tensor Products of Linear Spaces . . . . 14.4 Tensors on Inner Product Spaces . . . . . 14.5 Contractions . . . . . . . . . . . . . . . . 14.6 Symmetric and Skew-Symmetric Tensors 14.7 Exterior Algebras . . . . . . . . . . . . . 14.8 Linear Mappings between Spaces SKSV,k 14.9 Determinants and Exterior Algebra . . . Exercises and Supplements . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . 15.

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

Multidimensional Array and Tensors 15.1 Introduction . . . . . . . . . . . . . . 15.2 Multidimensional Arrays . . . . . . . 15.3 Outer Products . . . . . . . . . . . . . 15.4 Tensor Rank . . . . . . . . . . . . . . 15.5 Matricization and Vectorization . . . 15.6 Inner Product and Norms . . . . . . . 15.7 Evaluation of a Set of Bilinear Forms 15.8 Matrix Multiplications and Arrays . . 15.9 MATLAB Computations . . . . . . . 15.10 Hyperdeterminants . . . . . . . . . . 15.11 Eigenvalues and Singular Values . . . 15.12 Decomposition of Tensors . . . . . . . 15.13 Approximation of mdas . . . . . . . . Exercises and Supplements . . . . . . . . . . Bibliographical Comments . . . . . . . . . .

827 827 829 842 844 845 857 869 870 872 881 883

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

883 883 887 888 900 907 908 911 914 936 940 946 949 953 961

Bibliography

963

Index

975

Chapter 1

Preliminaries

1.1

Introduction

We include in this chapter a presentation of basic set-theoretical notions: functions, permutations, and algebraic structures that are intended to make this work as self-contained as possible. We assume that the reader is familiar with certain important classes of relations such as equivalences and partial orders, as presented, for example, in [53] or [152]. Also, we assume familiarity with basic notions and results concerning directed and undirected graphs. We use the following standard notations for several numerical sets: C the set of complex numbers R the set of real numbers R0 the set of non-negative real numbers R>0 the set of positive real numbers R0 the set of non-positive real R0 I N

the the the the

set set set set

R0 ∪ {+∞} R0 ∪ {+∞}

of irrational numbers of natural numbers

Functions

In this section, we review several types of functions and discuss some results that are used in later chapters.

1

2

Linear Algebra Tools for Data Mining (Second Edition)

The set of functions deﬁned on S and ranging on T is denoted by S −→ T . If f belongs to this set of functions, we write f : S −→ T . The image of a subset P of S under a function f : S −→ T is the set f (P ) = {t ∈ T | f = f (p) for some p ∈ P }. It is easy to verify that for any two subsets P, Q of S we have f (P ∪ Q) = f (P ) ∪ f (Q), f (P ∩ Q) ⊆ f (P ) ∩ f (Q). The preimage of a subset U of T under a function f : S −→ T is the set f −1 (U ) = {s ∈ S | f (s) ∈ U }. Again, it is easy to verify that for any two subsets U, V of T , we have f −1 (U ∪ V ) = f −1 (U ) ∪ f −1 (V ), f −1 (U ∩ V ) = f −1 (U ) ∩ f −1 (V ). Definition 1.1. Let S, T be two sets. A function f : S −→ T is: (i) an injection if f (s) = f (s ) implies s = s ; (ii) a surjection if for every t ∈ T there exists s ∈ S such that f (s) = t; (iii) a bijection, if it is both an injection and a surjection. The set S −→ S contains the identity function 1S deﬁned by 1S (s) = s for s ∈ S. Definition 1.2. Let S, T, U be three sets and let f : S −→ T and g : T −→ U be two functions. The composition or the product of f and g is the function gf : S −→ U deﬁned by (gf )(s) = g(f (s)) for s ∈ S. Note that if f : S −→ T , then f 1S = f and 1T f = f . It is easy to verify that for four sets S, T, U, V and f : S −→ T , g : T −→ U , h : U −→ V , we have h(gf ) = (hg)f (associativity of function composition). Theorem 1.1. A function f : S −→ T is a surjection if and only if for any two functions g1 : T −→ U and g2 : T −→ U the equality g1 f = g2 f implies g1 = g2 .

Preliminaries

3

Proof. Let f : S −→ T be a surjection such that g1 f = g2 f , and let t ∈ T . There exists s ∈ S such that t = f (s). Then g1 (t) = g1 (f (s)) = (g1 f )(s) = (g2 f )(s) = g2 (f (s)) = g2 (t) for every t ∈ T , which implies g1 = g2 . Conversely, suppose that g1 f = g2 f implies g1 = g2 for any g1 , g2 , and assume that f is not a surjection. Then there exists t ∈ T such that t = f (s) for every s ∈ S. Since t ∈ f (S), it is possible to deﬁne g1 , g2 : T −→ U such that g1 (t) = g2 (t), yet g1 (f (s)) = g2 (f (s)). This contradicts the initial supposition, so f must be a surjection.

A similar result holds for injections: Theorem 1.2. A function f : S −→ T is an injection if and only if for any two functions k1 : X −→ S and g2 : X −→ S the equality f k1 = f k2 implies k1 = k2 . Proof. 1.3

The argument is left to the reader.

Sequences

Definition 1.3. Let S be a set. An S-sequence of length n is a function s : {0, . . . , n − 1} −→ S. The number n is the length of the sequence s and is denoted as |s|. The set of S-sequences of length n is denoted as Seqn (S); the set of all sequences on S is Seqn (S). Seq(S) = n0

A S-sequence s of length n is denoted by (s(0), . . . , s(n − 1)). The element s(j) is the j th symbol of s. Example 1.1. If S = {↑, ↓, −}, then (↑, ↑, −, ↓, −) is an S-sequence of length 5.

4

Linear Algebra Tools for Data Mining (Second Edition)

For a ﬁnite set S, with |S| = m, there exist mn S-sequences of length n. In particular, taking n = 0 it follows that there exists a unique S-sequence of length 0 denoted as λ = () and referred to as the null sequence. If u ∈ Seqp (S), v ∈ Seqq (S), u = (u0 , . . . , up−1 ), and v = (v0 , . . . , vq−1 ), the concatenation of u and v is the sequence uv given by uv = (u0 , . . . , up−1 , v0 , . . . , vq−1 ). This means that |uv| = |u| |v|. It is easy to verify that sequence concatenation is associative. This means that if u, v, w ∈ Seq(S), we have (uv)w = u(vw), as the reader can easily verify. The null sequence plays the role of the unit element, that is, uλ = λu = u. If S contains more than one element, then sequence concatenation is not commutative, as the next example shows. Example 1.2. Let S = {0, 1}, u = (0, 1, 1), and v = (1, 0, 0, 1). We have uv = (0, 1, 1, 1, 0, 0, 1) and vu = (1, 0, 0, 1, 0, 1, 1). It is clear that uv = vu. If s = (j1 , . . . , jn ) ∈ Seq(N), we refer to each pair (jp , jq ) such that p < q and jp > jq as an inversion. The set of inversions of s is denoted by INV(s) and |INV(s)| is denoted by inv(s). Definition 1.4. For S ⊆ N, a sequence s ∈ Seq(N) is strict if s = (a0 , . . . , an−1 ) and a0 < a1 < · · · < an−1 , where n 1. In other words, a sequence s ∈ Seq(n) is strict if it consists of distinct elements and inv(s) = 0. The sequence obtained from the sequence s ∈ Seq(N) by sorting its elements in increasing order is denoted as (s). Example 1.3. If s = (4, 1, 8, 2, 1, 6), then (s) = (1, 1, 2, 4, 6, 8).

Preliminaries

5

If s and t are two strict sequences in Seq(N), their concatenation is not strict, in general, even if they have no elements in common. Example 1.4. Let s = (1, 5, 8) and t = (2, 7, 9, 11). The sequence st = (1, 5, 8, 2, 7, 9, 11) is clearly not strict and we have inv(st) = 3. The sequence (st) is (1, 2, 5, 7, 8, 9, 11). If s and t are two strict sequences, the number of inversions in the sequence st is denoted by inv(s, t). Theorem 1.3. Let s, t, u ∈ Seq(N). We have inv((st), (u)) = inv(s, u) + inv(t, u), inv((s), (tu)) = inv(s, t) + inv(s, u). Proof. Since (st) and (u) are sorted sequences, there are no inversions in (st) or in (u). Therefore, inversions in (st)u may occur only because an inversion occurs between a component of s and a component of u, or between a component of t and a component of (u). This justiﬁes the ﬁrst equality. The second equality has a similar argument. 1.4

Permutations

The notion of permutation that we discuss in this section is essential for the study of determinants and exterior algebras. Definition 1.5. A permutation of a set S is a bijection φ : S −→ S. A permutation φ of a ﬁnite set S = {s1 , . . . , sn } is completely described by the sequence (φ(s1 ), . . . , φ(sn )). No two distinct components of such a sequence may be equal because of the injectivity of φ, and all elements of the set S appear in this sequence because φ is surjective. Therefore, the number of permutations equals the number of such sequences, so there are n(n − 1) · · · 2 · 1 permutations of a ﬁnite set S with |S| = n. The number n(n − 1) · · · 2 · 1 is usually denoted by n!. This notation is extended by deﬁning 0! = 1, which is consistent with the interpretation of n! as the number of bijections of a set that has n elements.

6

Linear Algebra Tools for Data Mining (Second Edition)

The set of permutations of the set Sn = {1, . . . , n} is denoted by PERMn . If φ ∈ PERMn is such a permutation, we write 1 ··· i ··· n , φ: a1 · · · ai · · · an where ai = φ(i) for 1 i n. To simplify the notation, we shall specify φ just by the sequence of distinct numbers s = (a1 , . . . , ai , . . . , an ). The permutation ιn ∈ PERMn is deﬁned as ιn (j) = j for 1 j n and is known as the identity permutation. Definition 1.6. Let φ, ψ ∈ PERMn be two permutations. Their composition or product is the permutation ψφ deﬁned by (ψφ)(j) = ψ(φ(j)) for 1 j n. It is clear that the composition of two permutations of Sn is a permutation of Sn . Example 1.5. Let ψ, φ ∈ PERM4 be the permutations 1 2 3 4 1 2 3 4 φ: and ψ : . 3 1 4 2 4 2 1 3 The permutations ψφ and φψ are given by 1 2 3 4 1 2 3 4 ψφ : and φψ : . 1 4 3 2 2 1 3 4 Thus, the composition of permutations is not commutative. Note that every permutation φ ∈ PERMn has an inverse φ−1 because φ is a bijection. Example 1.6. Let φ ∈ PERM4 be the permutation 1 2 3 4 φ: . 3 1 4 2 Its inverse is −1

φ

:

1 2 3 4 . 2 4 1 3

Preliminaries

7

Theorem 1.4. Let PERMn = {φ1 , . . . , φn! }. If ψ ∈ PERMn , then {ψφ1 , . . . , ψφn! } = PERMn . Proof. Note that ψφp = ψφq implies φp = φq for 1 p q n!. The statement follows immediately. Definition 1.7. Let S be a ﬁnite set, φ be a permutation of S, and x ∈ S. The cycle of x is the set of elements of the form Cφ,x = {φi (x) | i ∈ N}. The number |Cφ,x | is the length of the cycle. Cycles of length 1 are said to be trivial. Theorem 1.5. The cycles of a permutation φ of a ﬁnite set S form a partition πφ of S. Proof. Let S be a ﬁnite set. Since Cφ,x ⊆ S, it is clear that Cφ,x is a ﬁnite set. If |Cφ,x | = , then Cφ,x = {x, φ(x), . . . , φ−1 (x)}. Note that each pair of elements φi (x) and φj (x) is distinct for 0 i, j − 1 and i = j, because otherwise we would have |Cφ,x | < . Moreover, φ (x) = x. If z ∈ Cφ,x , then z = φk (x) for some k, 0 k − 1, where = |Cφ,x |. Since x = φ (x), it follows that x = φ−k (z), which shows that x ∈ Cφ,z . Thus, Cφ,x = Cφ,z . Definition 1.8. A k-cyclic permutation or a cycle of a ﬁnite set S is a permutation φ such that πφ consists of a cycle of length k and a number of |S| − k cycles of length 1. A transposition of S is a 2-cyclic permutation. Note that if φ is a transposition of S, then φ2 = 1S . Theorem 1.6. Let S be a ﬁnite set, φ be a permutation, and πφ = {Cφ,x1 , . . . , Cφ,xm } be the cycle partition associated with φ. Deﬁne the cyclic permutations ψ1 , . . . , ψm of S as φ(t) if t ∈ Cφ,xp , ψp (t) = t otherwise. Then, ψp ψq = ψq ψp for every p, q such that 1 p, q m.

8

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Observe ﬁrst that u ∈ Cφ,x if and only if φ(x) ∈ Cφ,x for any cycle Cφ,x . We can assume that p = q. Then the cycles Cφ,xp and Cφ,xq are disjoint. If u ∈ Cφ,xp ∪ Cφ,xq , then ψp (ψq (u)) = ψp (u) = u and ψq (ψp (u)) = ψq (u) = u. Suppose now that u ∈ Cφ,xp − Cφ,xq . We have ψp (ψq (u)) = ψp (u) = φ(u). On the other hand, ψq (ψp (u)) = ψq (φ(u)) = φ(u) because φ(u) ∈ Cφ,xq . Thus, ψp (ψq (u)) = ψq (ψp (u)). The case where u ∈ Cφ,xq −Cφ,xp is treated similarly. Also, note that Cφ,xp ∩Cφ,xq = ∅, so, in all cases, we have ψp (ψq (x)) = ψq (ψp (u)). The set of cycles {ψ1 , . . . , ψm } is the cyclic decomposition of the permutation φ. Definition 1.9. An adjacent transposition is a transposition that changes the places of two adjacent elements. Example 1.7. The permutation φ ∈ PERM5 given by φ:

1 2 3 4 5 1 3 2 4 5

is an adjacent transposition of the set {1, 2, 3, 4, 5} because it changes the position of the elements 2 and 3. On the other hand, the permutation 1 2 3 4 5 ψ: 1 5 3 4 2 is a transposition but not an adjacent transposition of the same set because the pair of elements involved are not consecutive. Theorem 1.7. Any transposition of a ﬁnite set is a product of adjacent transpositions. Proof. Let φk, be a transposition that swaps k and , where k < . We can move k to one step at a time and then move to where k was: φk, = φk,k+1 φk+1,k+2 · · · φ−1, φ−2,−1 · · · φk,k+1 .

Preliminaries

9

For a permutation φ ∈ PERMn speciﬁed by the sequence s = (j1 , . . . , jn ), the set INV(s) is denoted by INV(φ) and |INV(φ)| is denoted by inv(φ). Definition 1.10. The sign of a permutation is the function sign (φ) : PERMn −→ {−1, 1} deﬁned as sign (φ) = (−1)inv(φ) . If sign (φ) = 1, we say that φ is an even permutation; otherwise, that is, if sign (φ) = −1, we refer to φ as an odd permutation. Theorem 1.8. If φ, ψ ∈ PERMn , then sign (φψ) = sign (φ)sign (ψ). Proof. The theorem can be proven by showing that inv(φ)+inv(ψ) has the same parity as inv(φψ). Suppose that 1 2 ··· n φ= and φ(1) φ(2) · · · φ(n) φ(1) φ(2) · · · φ(n) ψφ = . ψ(φ(1)) ψ(φ(2)) · · · ψ(φ(n)) and consider the following cases: (i) if i < j, φ(i) < φ(j), and ψ(φ(i)) < ψ(φ(j)), then no inversions are generated in φ, ψ, and ψφ; (ii) if i < j, φ(i) < φ(j), and ψ(φ(i)) > ψ(φ(j)), φ has no inversion, ψ has an inversion, and so does ψφ; (iii) if i < j, φ(i) > φ(j), and ψ(φ(i)) > ψ(φ(j)), φ has an inversion, ψ has no inversion, and ψφ has an inversion; (iv) if i < j, φ(i) > φ(j), and ψ(φ(i)) < ψ(φ(j)), then φ has an inversion, ψ has an inversion, and ψφ has no inversions. In each of these cases, inv(φ) + inv(ψ) diﬀers from inv(ψφ) by an even number. A descent of a sequence with distinct numbers s = (j1 , . . . , jn ) is a number k such that 1 k n − 1 and jk > jk+1 . The set of descents of s is denoted by D(s) and the set of descents of a permutation φ speciﬁed by s is denoted by D(φ).

Linear Algebra Tools for Data Mining (Second Edition)

10

Example 1.8. Let φ ∈ PERM6 be 1 2 3 4 5 6 φ: . 4 2 5 1 6 3 We have INV(φ) = {(4, 2), (4, 1), (4, 3), (2, 1), (5, 1), (5, 3), (6, 3)}, and inv(φ) = 7. Furthermore, D(φ) = {1, 3, 5}. It is easy to see that the following conditions are equivalent for a permutation φ of the ﬁnite set S: (i) φ = 1S ; (ii) inv(φ) = 0; (iii) D(φ) = ∅. Theorem 1.9. Every permutation φ ∈ PERMn is a composition of transpositions. Proof. If D(φ) = ∅, then φ = 1{1,...,n} and the statement is vacuous. Suppose therefore that D(φ) = ∅, and let k ∈ D(φ), which means that (jk , jk+1 ) is an inversion φ. Let ψ be the adjacent transposition that exchanges jk and jk+1 . It is clear that inv(ψφ) = inv(φ) − 1. Thus, if ψ1 , . . . , ψp are the transpositions that correspond to all adjacent inversions of φ, where p = inv(φ), it follows that ψp · · · ψ1 φ has 0 inversions and, as observed above, ψp · · · ψ1 φ = 1S . Since ψ 2 = 1S for every transposition ψ, we have φ = ψp · · · ψ1 , which gives the desired conclusion. Corollary 1.1. If a permutation φ ∈ Φn can be factored as a product of p transpositions, then φ−1 can be factored as the same number of transpositions. Proof.

This follows immediately from Theorem 1.9.

Theorem 1.10. If φ is a permutation of the ﬁnite set S, then inv(φ) is the least number of adjacent transpositions that constitute a factorization of φ, and the number of adjacent transpositions involved in any other factorization of φ as a product of adjacent transpositions diﬀers from inv(φ) by an even number.

Preliminaries

11

Proof. Let φ = ψq · · · ψ1 be a factorization of φ as a product of adjacent transpositions. Then ψ1 · · · ψq φ = 1S , and we can deﬁne the sequence of permutations φ = ψ · · · ψ1 φ for 1 q. Since each ψi is an adjacent transposition, we have inv(φ+1 ) − inv(φ ) = 1 or inv(φ+1 ) − inv(φ ) = −1. If |{ | 1 q − 1 and inv(φ+1 ) − inv(φ ) = 1}| = r, then |{ | 1 q − 1 and inv(φ+1 ) − inv(φ ) = −1}| = q − r, so inv(φ) + r − (q − r) = 0, which means that q = inv(φ) + 2r. This implies the desired conclusion. Theorem 1.11. A transposition is an odd permutation. Proof.

Suppose that φ ∈ PERMn is the transposition 1 ··· i ··· j ··· n φ: , 1 ··· j ··· i··· n

so j > i. If p and q form an inversion in φ, it follows that we must have i p < q j. For i < p < q < j, the pair (p, q) has no contribution to inversions. If i < q ≤ j, the pair (i, q) contributes j − i inversions because φ(i) = j > φ(q). If i p < j and φ(q) > i = φ(j), there are (j − 1) − i inversions. Thus, the total number of inversions of φ is (j − i) + (j − 1) − i = 2(j − i) − 1 and this number is odd. Next, we introduce the Levi-Civita symbols that are useful in the study of permutations. Definition 1.11. Let (i1 · · · in ) be a permutation of the set {1, 2, . . . , n}; The Levi-Civita symbols1 i1 i2 ···in and i1 i2 ···in are deﬁned as 1 Tullio Levi-Civita (March 29, 1873–December 29, 1941) was an Italian mathematician, well-known for his work on tensor calculus and its applications to the theory of relativity. His work included foundational papers in both pure and applied mathematics, celestial mechanics, analytic mechanics, and hydrodynamics. He was born in Padua, graduated in 1892 from the University of Padua Faculty of Mathematics where he became a professor in 1898. In 1918, be was appointed at the University of Rome.

12

Linear Algebra Tools for Data Mining (Second Edition)

(i) 1 2 ··· n = 1 2 ··· n = 1; (ii) ··· ip ··· iq ··· = −··· iq ··· ip ··· and ··· ip ··· iq ··· = −··· iq ··· ip ··· (the antisymmetry property); (iii) when all indices i1 , . . . , in are distinct, we have i1 ···in = i1 ···in = (−1)p , where p = inv(i1 · · · in ). When two indices ip and iq are equal, the antisymmetry of i1 i2 ···in and i1 i2 ···in implies i1 i2 ···in = i1 i2 ···in = 0. Example 1.9. For Levi-Civita symbols i1 i2 i3 with n = 3, we have 27 components. Since an equality of any of the indices implies that the number i1 i2 i3 is 0, it follows that only six of these indices are non-zero. The non-zero values are 123 = 231 = 312 = 1, 132 = 213 = 321 = −1. 1.5

Combinatorics

Let S be a ﬁnite nonempty set, S = {s1 , . . . , sn }. We seek to count the sequences of S having length k without repetitions. Suppose initially that k 1. For the ﬁrst place in a sequence s of length k, we have n choices. Once an element of S has been chosen for the ﬁrst place, we have n − 1 choices for the second place because the sequence may not contain repetitions, etc. For the k th component of s, there are n − 1 + k choices. Thus, the number of sequences of length k without repetitions is given by n(n − 1) · · · (n − k + 1). We shall denote this number by A(n, k). There exists only one sequence of length 0, namely the empty sequence, so we extend the deﬁnition of A by A(n, 0) = 1 for every n ∈ N. Let S be a ﬁnite set with |S| = n. A k-combination of S is a subset M of S such that |M | = k. Deﬁne the equivalence relation ∼ on Seq(S) by s ∼ t if there exists a bijection f such that s = tf . If T is a subset of S such that |T | = k, there exists a bijection t : {0, . . . , k −1} −→ T ; clearly, this is a sequence without repetitions

Preliminaries

13

and there exist A(n, k) such sequences. Note that if u is an equivalent sequence (that is, if t ∼ u), then the range of this sequence is again the set T and there are k! such sequences (due to the existence of the k! permutations f ) that correspond to the same set T . Therefore, we A(n,k) may conclude nthat Pk (S) contains k! elements. We denote this we refer to it as the (n, k)-binomial coeﬃcient. number by k and n We can write k using factorials as follows: n(n − 1) · · · (n − k + 1) n A(n, k) = = k! k! k =

n(n − 1) · · · (n − k + 1)(n − k) · · · 2 · 1 k!(n − k)!

=

n! . k!(n − k)!

We mention the following useful identities: n n−1 k =n , k k−1 n n−1 n . = m m−1 m Equality (1.1) can be extended as n n−−1 k(k − 1) · · · (k − ) = n(n − 1) · · · (n − ) k k−−1

(1.1) (1.2)

(1.3)

for 0 k − 1. The set of polynomials with complex (real) coeﬃcients in the nondeterminate x is denoted by C[x] (respectively, R[x]). Consider now the n-degree polynomial p ∈ R[x]: p(x) = (x + a0 ) · · · (x + an−2 )(x + an−1 ). The coeﬃcient of xn−k consists of the sum of all monomials of the form ai0 · · · aik−1 , where the subscripts i0 , . . . , ik−1 are distinct. Thus, the coeﬃcient of xn−k contains nk terms corresponding to the k-element subsets of the set {0, . . . , n − 1}. Consequently, the coeﬃcient of xn−k in the power (x + a)n can be obtained from the similar

14

Linear Algebra Tools for Data Mining (Second Edition)

coeﬃcient in p(x) by taking a0 = · · · = an−1 = a; thus, the coeﬃcient is nk ak . This allows us to write n

(x + a) =

n n k=0

k

xn−k ak .

(1.4)

This equality is known as Newton’s binomial formula and has numerous applications. Example 1.10. If we take x = a = 1 in Formula (1.4), we obtain the identity n

2 =

n n k=0

k

.

(1.5)

Note that this equality can be obtained directly by observing that the right member, enumerates the subsets of a set having n elements by their cardinality k. A similar interesting equality can be obtained by taking x = 1 and a = −1 in Formula (1.4). This yields 0=

n n n + + + ··· (−1) = 0 2 4 k n n n − − − − ··· . 0 2 4

n n k=0

k

This inequality shows that each set contains an equal number of subsets having an even or odd number of elements. n−1 (x + a). Example 1.11. Consider the equality (x + a)n = (x n+ a) n−k k a in the left member is k . In the right The coeﬃcient of x n−1 n−k k a has the coeﬃcient ( n−1 + k−1 ), so we obtain the member, x k equality n n−1 n−1 = + , (1.6) k k k−1

for 0 k n − 1.

Preliminaries

15

Multinomial coeﬃcients are generalizations of binomial coeﬃcients that can be introduced as follows. The nth power of the sum x1 + · · · + xk can be written as c(n, r1 , . . . , rk )xr11 · · · xrkk , (x1 + · · · + xk )n = (r1 ,...,rk )

where the sum involves all (r1 , . . . , rk ) ∈ Nk such that ki=1 ri = n. By analogy with the binomial coeﬃcients, we denote c(n, r1 , . . . , rk ) n . As we did with binomial coeﬃcients in Example 1.11, by r1 ,...,r n starting from the equality (x1 + · · · + xk )n = (x1 + · · · + xk )n−1 (x1 + · · · +xk), the coeﬃcient of the monomial xr11 · · · xrkk in the right mem n ber is r1 ,...,rn . In the left member, the same coeﬃcient is k i=1

n−1 , r1 , . . . , ri − 1, . . . , rn

so we obtain the identity

n r1 , . . . , rn

=

k i=1

n−1 , r1 , . . . , ri − 1, . . . , rn

(1.7)

a generalization of the identity (1.6). 1.6

Groups, Rings, and Fields

The notion of operation on a set is needed for introducing various algebraic structures on sets. Definition 1.12. Let n ∈ N. An n-ary operation on a set S is a function f : S n −→ S. The number n is the arity of the operation f . If n = 0, we have the special case of zero-ary operations. A zeroary operation is a function f : S 0 = {∅} −→ S, which is essentially a constant element of S, f (). Operations of arity 1 are referred to as unary operations. Binary operations (of arity 2) are frequently used. For example, the union, intersection, and diﬀerence of subsets of a set S are binary operations on the set P(S).

16

Linear Algebra Tools for Data Mining (Second Edition)

If f is a binary operation on a set, we denote the result f (x, y) of the application of f to x, y by xf y rather than f (x, y). We now introduce certain important types of binary operations. Definition 1.13. A binary operation f on a set S is (i) associative if (xf y)f z = xf (yf z) for every x, y, z ∈ S, (ii) commutative if xf y = yf x for every x, y, ∈ S, and (iii) idempotent if xf x = x for every x ∈ S. Example 1.12. Set union and intersection are both associative, commutative, and idempotent operations on every set of the form P(S). The addition of real numbers “+” is an associative and commutative operation on R; however, “+” is not idempotent. The binary operation g : R2 −→ R given by g(x, y) = x+y 2 for x, y ∈ R is a commutative and idempotent operation of R that is and xg(ygz) = not associative. Indeed, we have (xgy)gz = x+y+2z 4 2x+y+z . 4 Example 1.13. The binary operations max{x, y} and min{x, y} are associative, commutative, and idempotent operations on the set R. Next, we introduce special elements relative to a binary operation on a set. Definition 1.14. Let f be a binary operation on a set S. (i) An element u is a unit for f if xf u = uf x = x for every x ∈ S. (ii) An element z is a zero for f if zf u = uf z = z for every x ∈ S. Note that if an operation f has a unit, then this unit is unique. Indeed, suppose that u and u were two units of the operation f . According to Deﬁnition 1.14, we would have uf x = xf u = x and, in particular, uf u = u f u = u . Applying the same deﬁnition to u yields u f x = xf u = x and, in particular, u f u = uf u = u. Thus, u = u . Similarly, if an operation f has a zero, then this zero is unique. Suppose that z and z were two zeros for f . Since z is a zero, we have zf x = xf z = z for every x ∈ S; in particular, for x = z , we have zf z = z f z = z. Since z is zero, we also have z f x = xf z = z for every x ∈ S; in particular, for x = z, we have z f z = zf z = z , and this implies z = z .

Preliminaries

17

Definition 1.15. Let f be a binary associative operation on S such that f has the unit u. An element x has an inverse relative to f if there exists y ∈ S such that xf y = yf x = u. An element x of S has at most one inverse relative to f . Indeed, suppose that both y and y are inverses of x. Then we have y = yf u = yf (xf y ) = (yf x)f y = uf y = y , which shows that y coincides with y . If the operation f is denoted by “+”, then we refer to the inverse of x as the additive inverse of x, or the opposite element of x; similarly, when f is denoted by “·”, we refer to the inverse of x as the multiplicative inverse of x. The additive inverse of x is usually denoted by −x, while the multiplicative inverse of x is denoted by x−1 . Definition 1.16. An element x of a set S equipped with a binary operation “∗” is idempotent if x ∗ x = x. Observe that every unit and every zero of a binary operation is an idempotent element. Definition 1.17. Let I = {fi |i ∈ I} be a set of operations on a set S indexed by a set I. An algebra type is a mapping θ : I −→ N. An algebra of type θ is a pair A = (A, I) such that (i) A is a set, and (ii) the operation fi has arity θ(i) for every i ∈ I. The algebra A = (A, I) is ﬁnite if the set A is ﬁnite. The set A is referred to as the carrier of the algebra A. If the indexing set I is ﬁnite, we say that the type θ is a ﬁnite type and refer to A as an algebra of ﬁnite type. If θ : I −→ N is a ﬁnite algebra type, we assume, in general, that the indexing set I has the form (0, 1, . . . , n − 1). In this case, we denote θ by the sequence (θ(0), θ(1), . . . , θ(n − 1)). Definition 1.18. A groupoid is an algebra of type (2), A = (A, {f }). If f is an associative operation, then we refer to this algebra as a semigroup. In other words, a groupoid is a set equipped with a binary operation f .

18

Linear Algebra Tools for Data Mining (Second Edition)

Example 1.14. The algebra (R, {f }) where f (x, y) = x+y is a 2 groupoid. However, it is not a semigroup because f is not an associative operation. Example 1.15. Deﬁne the binary operation g on R by xgy = ln(ex + ey ) for x, y ∈ R. Since z

(xgy)gz = ln(exgy+e ) = ln(ex + ey + ez ), xg(ygz) = ln(x + eygz ) = ln(ex + ey + ez ), for every x, y, z ∈ R, it follows that g is an associative operation. Thus, (R, g) is a semigroup. It is easy to verify that this semigroup has no unit element. Definition 1.19. A monoid is an algebra of type (0, 2), A = (A, {e, f }), where e is a zero-ary operation, f is a binary operation, and e is the unit element for f . Example 1.16. Let gcd(m, n) be the greatest common divisor of the numbers m, n ∈ N. The algebras (N, {1, ·}) and (N, {0, gcd}) are monoids. In the ﬁrst case, the binary operation is the multiplication of natural numbers, the unit element is 1, and the algebra is clearly a monoid. In the second case, the unit element is 0. We claim that gcd is an associative operation. Let m, n, p ∈ N. We need to verify that gcd(m, gcd(n, p)) = gcd(gcd(m, n), p). Let k = gcd(m, gcd(n, p)). Then (k, m) ∈ δ and (k, gcd(n, p)) ∈ δ, where δ is the divisibility relation. Since gcd(n, p) divides evenly both n and p, it follows that (k, n) ∈ δ and (k, p) ∈ δ. Thus, k divides gcd(m, n), and therefore k divides h = gcd(gcd(m, n), p). Conversely, h being gcd(gcd(m, n), p), it divides both gcd(m, n) and p. Since h divides gcd(m, n), it follows that it divides both m and p. Consequently, h divides gcd(n, p) and therefore divides k = gcd(m, gcd(n, p)). Since k and h are both natural numbers that divide each other evenly, it follows that k = h, which allows us to conclude that gcd is an associative operation. Since n divides 0 evenly, for any n ∈ N, it follows that gcd(0, n) = gcd(n, 0) = n, which shows that 0 is the unit for gcd.

Preliminaries

19

Definition 1.20. A group is an algebra of type (0, 2, 1), A = (A, {e, f, h}), where e is a zero-ary operation, f is a binary operation, e is the unit element for f , and h is a unary operation such that f (h(x), x) = f (x, h(x)) = e for every x ∈ A. Note that if we have xf y = yf x = e, then y = h(x). Indeed, we can write h(x) = h(x)f e = h(x)f (xf y) = (h(x)f x)f y = ef y = y. We refer to the unique element h(x) as the inverse of x. The usual notation for h(x) is x−1 . Definition 1.21. A group A = (A, {e, f, h}) is Abelian if xf y = yf x for all x, y ∈ A. Abelian groups are also known as commutative groups. Traditionally, the zero-ary operation of Abelian groups is denoted by 0, the binary operation by “+”, and the inverse of an element x is denoted by −x. Thus, we usually write an Abelian group A as (A, {0, +, −}). Example 1.17. The algebra (Z, {0, +, −}) is an Abelian group where “+” is the usual addition of integers, and the additive inverse of an integer n is −n. Example 1.18. The set of permutations of {1, . . . , n} is the symmetric group Sn = (PERMn , ιn , ·, ), where · stands for the permutation composition. Example 1.19. Let ≡n ⊆ Z × Z be the equivalence relation deﬁned on Z by ≡n = {(p, q) ∈ Z × Z | n evenly divides p − q}. In each equivalence class [p] there exists a least non-negative element m such that 0 m n − 1, due to the properties of the division of integers. This allows us to consider the quotient set Z/ ≡n = {[0], . . . , [n − 1]}. For example, if n = 2, the quotient set contains the classes [0] and [1]. All even integers, belong to the class [0] and all odd integers, to the class [1]. We denote the quotient set Z/ ≡n by Zn . The sum of two equivalence classes [p] and [q] is the class [p + q]. It is easy to verify that

20

Linear Algebra Tools for Data Mining (Second Edition)

this addition is well-deﬁned for if r ∈ [p] and s ∈ [q], then n divides evenly r − p and s − q and, therefore, it divides evenly r + s − (p + q). For example, Z4 consists of {[0], [1], [2], [3]} and the addition is deﬁned by the table + [0] [1] [2] [3]

[0] [0] [1] [2] [3]

[1] [1] [2] [3] [0]

[2] [2] [3] [0] [1]

[3] [3] [0] [1] [2]

It is easy to verify that (Zn , {[0], +, −}) is an Abelian group, where −[p] is [n − p] for 0 p n − 1 and [n] = [0]. Definition 1.22. A ring is an algebra of type (0, 2, 1, 2), A = (A, {e, f, h, g}), such that A = (A, {e, f, h}) is an Abelian group and g is a binary associative operation such that xg(uf v) = (xgu)f (xgv), (uf v)gx = (ugx)f (vgx), for every x, u, v ∈ A. These equalities are known as left and right distributivity laws, respectively. The operation f is known as the ring addition, while · is known as the ring multiplication. Frequently, these operations are denoted by “+” and “·”, respectively. Example 1.20. The algebra (Z, {0, +, −, ·}) is a ring. The distributive laws amount to the well-known distributive properties p · (q + r) = (p · q) + (p · r), (q + r) · p = (q · p) + (r · p), for p, q, r ∈ Z, of integer addition and multiplication. Example 1.21. A more interesting type of ring is deﬁned on the set √ of numbers of the form m + n 2, where m and n are integers. The ring operations are given by √ √ √ (m + n 2) + (p + q 2) = m + p + (n + q) 2, √ √ √ (m + n 2) · (p + q 2) = m · p + 2 · n · q + (m · q + n · p) 2.

Preliminaries

21

If the multiplicative operation of a ring has a unit element 1, then we say that the ring is a unitary ring. We consider a unitary ring as an algebra of type (0, 0, 2, 1, 2) by regarding the multiplicative unit as another zero-ary operation. Observe, for example, that the ring (Z, {0, 1, +, −, ·}) is a unitary ring. Also, note that the set of even numbers also generates a ring ({2k | k ∈ Z}, {0, +, −, ·}). However, no multiplicative unit exists in this ring. Example 1.22. Let (S, {0, 1, +, −, ·}) be a commutative ring and let s = (s0 , s1 , . . .) be a sequence of elements of S. The support of the sequence s is the set supp(s) = {i ∈ N | si = 0}. A polynomial over S is a sequence that has a ﬁnite support. If supp(s) = ∅, then s is the zero polynomial. The degree of a polynomial p is the number deg(p) = max supp(p). The degree of the zero polynomial is 0. The addition of the polynomials p = (p0 , p1 , . . .) and q = (q0 , q1 , . . .) produces the polynomial p + q = (p0 + q0 , p1 + q1 , . . .). The product of the polynomials p and q is

m pi qm−i , . . . . pq = p0 q0 , p0 q1 + p1 q0 , . . . , i=0

The “usual notation” for polynomials involves considering a symbol λ referred to as an indeterminate. Then the polynomial p = (p0 , p1 , . . .) is denoted as p(λ) = p0 + p1 λ + · · · + pn λn , where n = deg(p). The set of polynomials over a ring S in the indeterminate λ is denoted by S[λ]. The reader accustomed to the usual notation for polynomials will realize that the addition and multiplication deﬁned above correspond to the usual addition and multiplication of polynomials. If is easy to verify that the set of polynomials in the indeterminate λ over S denoted by S[λ] is itself a ring, denoted by S[λ], with the addition and multiplication deﬁned above.

22

Linear Algebra Tools for Data Mining (Second Edition)

Example 1.23. Example 1.22 can be extended to polynomials of several variables. Let Nk = (z 0 , z 1 , . . .) be the set of k-tuples of natural numbers listed in a ﬁxed order. A polynomial in k variables over S is a sequence p = (pz 0 , pz 1 , . . . , pz n , . . .) such that the support set supp(p) = {z ∈ Nk | pz = 0} is ﬁnite. If supp(p) = ∅, then p is the zero polynomial. The set of polynomials in k variables over S is denoted by S[λ1 , . . . , λk ]. If |supp(p)| = 1, then p is a monomial. Let λ1 , . . . , λk be k indeterminates. A monomial p such that supp(p) = {(a1 , . . . , ak )} is written as p(λ1 , . . . , λk ) = λa11 λa22 · · · λakk and the number deg(p) = a1 +· · ·+ak is the degree of the monomial p. If p, q are two polynomials in k variables, p = (pz 0 , pz 1 , . . . , pz n , . . .) and q = (qz 0 , qz 1 , . . . , qz n , . . .), their sum is the polynomial p + q given by (p + q)(λ1 , . . . , λk ) = (pz 0 + qz 0 , pz 1 + qz1 , . . . , pz n + qz n , . . .). The product of p and q is the polynomial pq, where pu q v . (pq)z = u+v=z

Definition 1.23. A polynomial p ∈ S[λ1 , . . . , λk ] is symmetric if p(λ1 , . . . , λk ) = p(λφ(1) , . . . , λφ(k) ) for every permutation φ ∈ PERMk . Example 1.24. The elementary symmetric polynomials in λ1 , . . . , λn are deﬁned by s0 (λ1 , . . . , λn ) = 1, n λi , s1 (λ1 , . . . , λn ) = i=1

s2 (λ1 , . . . , λn ) =

λi λj ,

i n. Solution: For the ﬁrst part we compute the coeﬃcient of tb for b ∈ N in both sides of the equality. In the left member this coeﬃcient is {λa11 · · · λann | a1 + · · · + an = b because ck is the sum of all monomials of degree k in λ1 , . . . , λn . In the right member we have a product of n series of the form 1 = 1 + λi t + · · · + λpi tp + · · · . 1 − λi t Therefore, the coeﬃcient of tb in this product equals the value of the coeﬃcient for the left member for every b, which establishes the equality. The second part follows immediately from the ﬁrst part. The equality Cn (t)Sn (−t) = 1 can be written as ⎞ ⎛

n si (λ1 , . . . , λn )(−1)i ti · ⎝ ck (λ1 , . . . , λn )tk ⎠ = 1. i=0

k0

Preliminaries

31

The coeﬃcient of tn in the left member is k (−1)j sj (λ1 , . . . , λn )pn−j (λ1 , . . . , λn ) j=0

while the coeﬃcient of tn in the right-hand member is 0. (4) If π, σ ∈ PERMn , prove that sign (πσ) = sign (π)sign (σ). (5) If σ ∈ PERMn , prove that sign (σ −1 ) = sign (σ). (6) Let π ∈ PERMn and σ ∈ PERMn+m . Deﬁne the permutation π ˜ ∈ PERMn+m as π(k) if 1 k n, π ˜ (k) = k if n + 1 k n + m. Prove that sign (˜ π σ) = sign (π)sign (σ). (7) Let θ ∈ PERMp+q be deﬁned as 1 2 ··· p p + 1 ··· p + q θ: . p + 1 p + 2 ··· p + q 1 ··· q Prove that sign (θ) = (−1)pq . (8) Prove that if θ ∈ PERMn can be written as products of transpositions, θ = τ1 τ2 · · · τr = τ1 τ2 · · · τs , then r ≡ s( mod 2). (9) Prove that ⎧ ⎪ if (i1 , . . . , in ) is an even permutation ⎨1 i1 ···in = −1 if (i1 , . . . , in ) is an odd permutation ⎪ ⎩ 0 otherwise. (10) Prove that the Levi-Civita symbols satisfy the identity 3

ijk mk = ij1 m1 + ij2 m2 + ij3 m3 .

k=1

(11) Prove that 3 3 3 i=1 j=1 k=1

ijk ijk = 6.

32

Linear Algebra Tools for Data Mining (Second Edition)

Let X be a ﬁnite set, X = {x1 , . . . , xn }. A selection of r objects from X where each object can be selected more than once is a combination of n objects taken r at a time with repetition. Two combinations taken r at a time with repetition are identical if they have the same elements repeated the same number of times regardless of the order. For instance, if X = {x1 , x2 , x3 , x4 }, then there are 20 combinations of the elements of X taken three at a time with repetition: x1 x1 x1 , x1 x1 x2 , x1 x1 x3 , x1 x1 x4 , x1 x2 x2 , x1 x2 x3 , x1 x2 x4 , x1 x3 x3 , x1 x3 x4 , x1 x4 x4 , x2 x2 x2 , x2 x2 x3 , x2 x2 x4 , x2 x3 x3 , x2 x3 x4 , x2 x4 x4 , x3 x3 x3 , x3 x3 x4 , x3 x4 x4 , x4 x4 x4 . (12) Show that the following numbers are equal: (a) the number of combinations of n objects taken r at a time; (b) the number of non-negative integer solutions of the equation x1 + x2 + · · ·+ xn = r;n+r−1 n+r−1 (c) the number n−1 = . r Solution: We prove only that the number of non-negative intesolutions of the equation x1 + x2 + · · · + xn = r equals ger n+r−1 n−1 . A solution of the equation x1 + x2 + · · · + xn = r can be represented as a binary string of length n + r − 1 as · · · 1 0 · · · 11 · · · 1 11 · · · 1 0 11 x1

x2

xn

containing x1 +· · ·+xn = r digits equal to 1 and n−1 digits equal to 0. Since there are n+r−1 such binary strings, the conclusion n−1 follows. Note that in the example that precedes this supplement, the number of of X taken three at a time with combinations = 20. repetition is 4+3−1 3 Bibliographical Comments There are many sources for general algebra [3, 20, 108] and combinatorics [154, 155, 163, 166] that contain lots of material that would amply satisfy the needs of the reader.

Chapter 2

Linear Spaces

2.1

Introduction

This chapter is dedicated to the study of linear spaces, which are mathematical structures that play a central role in linear algebra. A linear space is deﬁned in connection with a ﬁeld and makes use of two operations: an additive operation between its elements, and an external multiplication operation that involves elements of the ﬁeld and those of the linear space. 2.2

Linear Spaces

Definition 2.1. Let F = (F, {0, 1, +, −, ·}) be a ﬁeld. An F-linear space is a pair (L, s) such that L = (V, {0V , +, −}) is an Abelian group and s : F × V −→ V is a function referred to as the scalar multiplication that satisﬁes the following conditions: (1) s(a + b, x) = s(a, x) + s(b, x); (2) s(a, x + y) = s(a, x) + s(a, y); (3) s(ab, x) = s(a, s(b, x); (4) s(1, x) = x for every a, b ∈ F and x, y ∈ V . The elements of the ﬁeld F are referred to as scalars, while the elements of V are referred to as vectors.

33

34

Linear Algebra Tools for Data Mining (Second Edition)

The result of the scalar multiplication s(a, v) is denoted simply by av. This allows us to write the previous equalities as (1) (a + b)v = av + bv; (2) a(v + w) = av + aw; (3) (ab)v = a(bv); (4) 1v = v for a, b ∈ F and v, w ∈ L. We omit the explicit mention of the scalar multiplication function from the deﬁnition of an F-linear space V = (V, s) and we refer to V simply as V. If the ﬁeld F is irrelevant, or it is clearly designated from the context, we refer to an F-linear space just as a linear space. On the other hand, if F is the real ﬁeld R or the complex ﬁeld C, we refer to an R-linear space as a real linear space and to a C-linear space as a complex linear space. Example 2.1. If F = (F, {0, 1, +, −, ·}) is a ﬁeld, then the oneelement linear space V = {0V }, where a0V = 0V for every a ∈ F, is the zero F-linear space, or, for short, the zero linear space. The ﬁeld F itself is an F-linear space, where the Abelian group is (F, {0, +, −}) and scalar multiplication coincides with the scalar multiplication of F. Example 2.2. The set of all sequences of real numbers, Seq(R), is a real linear space, where the sum of two sequences x = (x0 , x1 , . . .) and y = (y0 , y1 , . . .) is the sequence x + y deﬁned by x + y = (x0 + y0 , x1 + y1 , . . .) and the multiplication of x by a scalar a is ax = (ax0 , ax1 , . . .). A related real linear space is the set Seqn (R) of all sequences of real numbers having length n, where the sum and the scalar multiplications are deﬁned in a similar manner. Namely, if x = (x0 , x1 , . . . , xn−1 ) and y = (y0 , y1 , . . . , yn−1 ), the sequence x + y is deﬁned by x + y = (x0 + y0 , x1 + y1 , . . . , xn−1 + yn−1 ) and the multiplication of x by a scalar a is ax = (ax0 , ax1 , . . . , axn−1 ). This linear space is denoted by Rn and its zero element is denoted by 0n . Example 2.3. If the real ﬁeld R is replaced by the complex ﬁeld C, we obtain the linear space Seq(C) of all sequences of complex numbers. Similarly, we have the complex linear space Cn , which consists of all sequences of length n of complex numbers.

Linear Spaces

35

Example 2.4. Let V be an F-linear space and let S be a non-empty set. The set V S that consists of all functions of the form f : S −→ V is an F-linear space. The addition of functions is deﬁned by (f + g)(s) = f (s) + g(s), while the multiplication by a scalar is given by (af )(s) = af (s), for s ∈ S and a ∈ F. We leave to the reader the task of verifying that the deﬁnition of a linear space is satisﬁed. Example 2.5. Let S be a set and let F2 be the two element ﬁeld deﬁned in Example 1.27. Deﬁne the scalar multiplication of a subset T of S by an element of the ﬁeld as 0 · T = ∅ and 1 · T = T for every T ∈ P(S). The sum of two subsets U and V is deﬁned as their symmetric diﬀerence U + V = (U − V ) ∪ (V − U ). With these deﬁnitions the set of subsets of S is an F2 -linear space, as the reader can easily verify. Informally, a subspace of an F-linear space is a subset of the Flinear space that behaves exactly like an F-linear space. Definition 2.2. A non-empty subset U of an F-linear space V is a subspace of V if (i) x + y ∈ U for all x, y ∈ U , (ii) ax ∈ U for a ∈ F and x ∈ U . If U is a subspace of an F-linear space V, then U is itself an F-linear space. Example 2.6. The subset {0V } of any F-linear space V is a subspace of V named the zero subspace. This is the smallest subspace of V. Theorem 2.1. If V = {Vi | i ∈ I} is a collection of subspaces of an F-linear space V, then V is a subspace of V. Proof. Suppose that x, y ∈ V. Then, x, y ∈ Vi , so x + y ∈ Vi and ax ∈ Vi for every i ∈ I.Thus, x + y ∈ V and ax ∈ V, which allows us to conclude that V is a subspace of L.

36

Linear Algebra Tools for Data Mining (Second Edition)

Since V itself is a subspace of V, it follows that the collection of subspaces of a linear space is a closure system V. If K sub is the closure operator induced by V, then for every subset X of V, K sub (X) is the smallest subspace of V that contains X. If U is a subspace of a linear space V and x ∈ V, we denote the set {x + u | u ∈ U } by x + U . The following statements are immediate for an F-linear space V : (i) the sets V and {0V } are subspaces of V ; (ii) each subspace U of V contains 0V . 2.3

Linear Independence

Let F be a ﬁeld and let I be a non-empty set. If ϕ : I −→ F is a function, deﬁne the support of ϕ as the subset of I given by supp(ϕ) = {i ∈ I | ϕ(i) = 0}. Definition 2.3. Let F be a ﬁeld, I be a non-empty set, and let ϕ : I −→ F be a function that has ﬁnite support. If V is an F-linear space, the linear combination determined by ϕ is an element wϕ of V that can be written as ϕ(i)xi , w= i∈supp(ϕ)

where xi ∈ V for i ∈ supp(ϕ). In the special case when supp(ϕ) = ∅, we deﬁne w ϕ = 0V . If supp(ϕ) = {1, . . . , n} and {x1 , . . . , xn } ⊆ X, then a X-linear combination is an element x of V that can be written as x = a1 x1 + · · · + an xn , where ai = ϕ(i) for 1 i n are scalars called the coeﬃcients of the linear combination w. The set of all X-linear combinations is denoted by X and is referred to as the set spanned by X. Theorem 2.2. Let V be an F-linear space. If X ⊆ V, then X is the smallest subspace of V that contains the set X. In other words, we have

Linear Spaces

37

(i) X is a subspace of V ; (ii) X ⊆ X; (iii) if X ⊆ M, where M is a subspace of V, then X ⊆ M . Proof. It is clear that if u and v are two X-linear combinations, then u + v and au are also X-linear combinations, so X is a subspace of V. Also, for x ∈ X, we can write 1x = x, so X ⊆ X. Finally, suppose that X ⊆ M , where M is a subspace of V and a1 x1 + · · · + an xn ∈ X, where x1 , . . . , xn ∈ X. Since X ⊆ M , we have x1 , . . . , xn ∈ M , hence a1 x1 + · · · + an xn ∈ M because M is a subspace. Thus, X ⊆ M . Corollary 2.1. Let V be an F-linear space. If X ⊆ V, then X equals K sub (X), the subspace of V generated by X. Proof.

This statement follows from Theorem 2.2.

From now on, we use the notation X instead of K sub (X). Definition 2.4. Let V be an F-linear space. A ﬁnite subset U = {x1 , . . . , xn } of V is linearly dependent if 0V = a1 x1 + · · · + an xn = 0V , where at least one element ai of F is not equal to 0. If this condition is not satisﬁed, then U is said to be linearly independent. A set U that consists of one vector x = 0V is linearly independent. The subset U = {x1 , . . . , xn } of V is linearly independent if a1 x1 + · · · + an xn = 0V implies a1 = · · · = an = 0. Also, note that a set U that is linearly independent does not contain 0V . Example 2.7. Let V be an F-linear space. If u ∈ V, then the set Vv = {au | a ∈ F } is a linear subspace of V. Moreover, if u = 0V , then the set {u} is linearly independent. Indeed, if au = 0V and a = 0, then multiplying both sides of the above equality by a−1 we obtain (a−1 a)u = a−1 0, or equivalently, u = 0V , which contradicts the initial assumption. Thus, {u} is a linearly independent set. Deﬁnition 2.4 is extended to arbitrary subsets of a linear space. Definition 2.5. Let V be an F-linear space. A subset W of V is linearly dependent if it contains a ﬁnite subset U that is linearly dependent. A subset W is linearly independent if it is not linearly dependent.

38

Linear Algebra Tools for Data Mining (Second Edition)

Thus, W is linearly independent if every ﬁnite subset of W is linearly independent. Further, any subset of a linearly independent subset is linearly independent and any superset of a linearly dependent set is linearly dependent. Example 2.8. For every F-linear space V, the set {0V } is linearly dependent because we have 10V = 0V . Example 2.9. Consider the F-linear space SF(I, F) that consists of the set of functions that map the non-empty set I into F. For i ∈ I, deﬁne the function ei : I −→ F as 1 if j = i, ei (j) = 0 otherwise. The set E = {ei | i ∈ I} is linearly independent in the linear space SF(I, F). Indeed, let {eik | 1 k p} be a ﬁnite subset of E and assume that ai1 ei1 + · · · + aip eip = z, where z is the function deﬁned by z(i) = 0 for i ∈ I. Thus, choosing i = ik , we have ai1 ei1 (ik ) + · · · + aip eip (ik ) = z(ik ). All terms in the left member equal 0 with the exception of aik eik (ik ) = aik = 0. Theorem 2.3. Let V be an F-linear space and let W be a linearly independent subset of V. If y ∈ W , that is y = a1 x1 + · · · + an xn , for some ﬁnite subset {x1 , . . . , xn } of W, then the coeﬃcients a1 , . . . , an are uniquely determined. Proof.

Suppose that y can be alternatively written as y = b1 x1 + · · · + bn xn ,

for some b1 , . . . , bn ∈ F. Since W is linearly independent, this implies (a1 − b1 )x1 + · · · + (an − bn )xn = 0V , which, in turn, yields a1 − b1 = · · · = an − bn = 0. Thus, we have ai = bi for 1 i n.

Linear Spaces

2.4

39

Linear Mappings

Linear mappings between linear spaces are functions that are compatible with the algebraic operations of linear spaces, as introduced next. Definition 2.6. Let F be a ﬁeld and let V and W be two F-linear spaces. A linear mapping is a function h : V −→ W such that h(ax + by) = ah(x) + bh(y) for each of the scalars a, b ∈ F and x, y ∈ V. An aﬃne mapping is a function f : V −→ W such that there exists a linear mapping h : V −→ W and b ∈ W such that f (x) = h(x) + b for x ∈ V. Linear mappings are also referred to as linear space homomorphisms, linear morphisms, or linear operators. The set of morphisms between two F-linear spaces V and W is denoted by Hom(V, W ). The set of aﬃne mappings between two Flinear spaces V and W is denoted by Aﬀ(V, W ). Definition 2.7. Let h, g ∈ Hom(V, W ) be two linear mappings between the F-linear spaces V and W . The sum of h and g is the mapping h + g deﬁned by (h + g)(x) = h(x) + g(x) for x ∈ V. If a ∈ F, the product af is deﬁned as (af )(x) = af (x) for x ∈ V. If V, W are two F-linear spaces, then the set Hom(V, W ) is never empty because the zero morphism 0V,W : V −→ W deﬁned as 0V,W (x) = 0V for x ∈ V is always an element of Hom(V, W ). Note that (f + g)(ax + by) = f (ax + by) + g(ax + by) = af (x) + bf (y) + ag(x) + bg(y) = f (ax + by) + g(ax + by), for all a, b ∈ F and x, y ∈ L. This shows that the sum of two linear mappings is also a linear mapping.

40

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 2.4. Hom(V, W ) equipped with the sum and product deﬁned above is an F-linear space. Proof. The zero element of Hom(V, W ) is the mapping 0V,W . We leave to the reader to verify that Hom(V, W ) satisﬁes the properties mentioned in Deﬁnition 2.1. Definition 2.8. Let V be an F-linear space. A linear form on V is a morphism in Hom(V, F), where the ﬁeld F is regarded as an F-linear space. 2.5

Bases in Linear Spaces

Definition 2.9. A basis of an F-linear space V is a linearly independent subset B such that B = V. If an F-linear space V has a ﬁnite basis, then we say that V is a linear space of ﬁnite type. Example 2.10. Let F be a ﬁeld and let I be a non-empty set. In Example 2.9, we saw that the set E = {ei | i ∈ I} is a linear independent set in the linear space SF(I, F). Let ϕ : I −→ F and suppose that supp(ϕ) = {j1 , . . . , jp }. Then, ϕ(i) = pi=1 ϕ(jk )ejk (i) for i ∈ I. Therefore, E is a basis in SF(I, F). Example 2.11. Let ei be the vector in Cn given by ⎛ ⎞ 0 ⎜.⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜0⎟ ⎟ ei = ⎜ ⎜1⎟, ⎜ ⎟ ⎜ ⎟ ⎜0⎟ ⎝ ⎠ .. .0 having one non-zero component equal to 1 on its ith position. Note that for x ∈ Cn , we have x = x1 e1 + · · · + xn en ,

Linear Spaces

41

which means that Cn = {e1 , . . . , en }. If x1 e1 + · · · + xn en = 0n , it follows immediately that x1 = · · · = xn = 0, which implies that {e1 , . . . , en } is a basis for Cn . Theorem 2.5. Every non-zero F-linear space V has a basis. Proof. Let V be a non-zero F-linear space and let U be a set such that U = L. Note that at least one such set exists because V = V. The set U contains at least an element distinct from 0V because {0V } = {0V }. Let IU be the collection of linearly independent subsets of U . Since for every x ∈ U such that x = 0V the set {x} is linearly independent, it follows that IU is a non-empty collection. We claim that in the partially ordered set (IU , ⊆) every chain has an upper bound. Indeed, let {Ki | i ∈ J} be a chain of independent subsets in (IU , ⊆). We claim that K = {Ki | i ∈ J} is a linearly independent set. Indeed, suppose that {xk | 1 k n} is a ﬁnite subset of K. For every l, 1 k n there exists ik ∈ J such that xk ∈ Kik . Since {Ki | i ∈ J} is a chain, there exists a set K among the sets Ki1 , . . . , Kin that includes all others. Thus, {xk | 1 k n} ⊆ K , which implies that the set {xk | 1 k n} is linearly independent. Consequently, the set K is linearly independent and, therefore, it is an upper bound for the chain {Ki | i ∈ J}. By Zorn’s Lemma (see, for example, Section 4.10 of [152]), the partially ordered set (IU , ⊆) contains an element maximal B. Clearly, B is linearly independent. To prove that B is a basis, we need to show only that B spans the entire linear space V, that is, that B = V. To this end, it suﬃces to prove that the set U of generators of V is included in B. Let x ∈ U − B and let X = B ∪ {x}. Then, since B is maximal, it follows that X is linearly dependent. Therefore, there exists a linear combination ax + pi=1 ai xi = 0V such that a = 0. Since Fis a ﬁeld, there exists an inverse a−1 of a. Therefore, x = −a−1 pi=1 ai xi , which implies x ∈ B. Thus, we may conclude that U ⊆ B, so V = U ⊆ B = V. Corollary 2.2. Let U be a subset of an F-linear space V such that U = V. If B ⊆ U is a linearly independent set, then there is a basis Z of V such that B ⊆ Z ⊆ U .

42

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Let IU be the collection of linearly independent subsets of U . By Zorn’s Lemma, for every element B of IU there exists a maximal element Z of (IU , ⊆) such that B ⊆ Z. The desired basis is Z. Corollary 2.3 (Independent set extension corollary). Let V be an F-linear space. If S is a linearly independent set, then there exists a basis B of V such that S ⊆ B. Proof. Since S is a linearly independent set, if T = V, then S ∪T also generates V. The statement follows by Corollary 2.2. If an F-linear space V has a ﬁnite basis, then we say that V is a linear space of ﬁnite type. Lemma 2.1. Let V be a ﬁnite type F-linear space and let T be a ﬁnite subset of V that is not linearly independent. If k = |T | 2 and (t1 , . . . , tk ) is a list of the vectors in T, then there exists a number j such that 2 j m and tj is a linear combination of its predecessors in the sequence. Furthermore, we have T − {tj } = T . Proof. Suppose that T is linearly dependent. Then there exists a linear combination ki=1 ai ti = 0V such that some of the scalars a1 , . . . , ak are diﬀerent from 0. Let j be the largest number such that 1 j k and aj = 0. The deﬁnition of j implies that a1 t1 + ai · · · + aj tj = 0V , so tj = − j−1 i=1 aj ti , which shows that tj is a linear combination of its predecessors in the list. Consequently, the set of linear combinations of the vectors in T − {tj } equals T . Theorem 2.6 (The Replacement theorem). Let V be a ﬁnitetype F-linear space such that the set S spans the linear space V and |S| = n. If U is a linearly independent set in V such that |U | = m, then m n and there exists a subset S of S such that S contains n − m vectors and U ∪ S spans the space V. Proof. Suppose that S = {w1 , . . . , w n } and U = {u1 , . . . , um }. The argument is by induction on m. The basis case, m = 0, is immediate. Suppose the statement holds for m and let U = {u1 , . . . , um , um+1 } be a linearly independent set that contains m+1 vectors.

Linear Spaces

43

The set {u1 , . . . , um } is linearly independent, so by the inductive hypothesis m n and there exists a subset S of S that contains n−m vectors such that {u1 , . . . , um } ∪ S spans the space V. Without loss of generality we may assume that S = {w1 , . . . , wn−m }. Thus, um+1 is a linear combination of the vectors of {u1 , . . . , um , w 1 , . . . , wn−m }, so we have um+1 = a1 u1 + · · · + am um + b1 w1 + · · · + bn−m wn−m . We have m+1 n because, otherwise, m+1 = n and um+1 would be a linear combination of u1 , . . . , um , thereby contradicting the linear independence of the set U . The set {u1 , . . . , um , um+1 , w1 , . . . , w n−m } is not linearly independent. Let v be the ﬁrst member of the sequence (u1 , . . . , um , um+1 , w1 , . . . , wn−m ) that is a linear combination of its predecessors. Then, v cannot be one of the ui (with 1 i m) because this would contradict the linear independence of the set U . Therefore, there exists k such that w k is a linear combination of its predecessors and 1 k n − m. By Lemma 2.1, we can remove this element from the set {u1 , . . . , um , um+1 , w1 , . . . , w n−m } without aﬀecting the set spanned. Corollary 2.4. Let V be a ﬁnite-type F-linear space and let B, C be two bases of L. Then |B| = |C|. Proof. Since B is a linearly independent set, and C = V, by Theorem 2.6 we have |B| |C|. The reverse inequality, |C| |B|, is obtained by asserting that C is linearly independent and C = V. Thus, |B| = |C|. Corollary 2.4 allows the introduction of the notion of dimension for a linear space. Definition 2.10. The dimension of a ﬁnite-type linear space V is the number of elements of any basis of V. The dimension of V is denoted by dim(V ). The dimension of the zero F-linear space {0} is 0. If a linear space V is not of ﬁnite type, then we say that dim(V ) is of inﬁnite type.

44

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 2.7. Let V be an F-linear space of ﬁnite type having the basis B = {x1 , . . . , xn } and let Y = {y 1 , . . . , y n } be a subset of an Flinear space W . If f : B −→ Y is a function between B and W, then there exists a unique extension of f as a linear mapping f : V −→ W such that f (xi ) = y i for 1 i n. an xn because Proof. If x ∈ V, we have x = a1 x1 + · · · + {x1 , . . . , xn } is a basis of V. Deﬁne f (x) as f (x) = ni=1 ai y i . The uniqueness of the expression of x as a linear combination of the elements of B makes f well-deﬁned. The linearity of f is immediate. For uniqueness of the extension of f , observe that the value of f is determined by the values of f (xi ). Example 2.12. Let S be a non-empty, ﬁnite set. The linear space CS has dimension |S|. Indeed, for each t ∈ S, consider the function ft : S −→ C deﬁned by ft (s) =

1 0

if s = t, otherwise,

for t ∈ S. If S = {t1 , . . . , tn }, then the set of functions {ft1 , . . . , ftn } is linearly independent, for if c1 ft1 (s) + · · · + cn ftn (s) = 0, then by Furthermore, taking s = tk we obtain ck = 0 for any k, 1 k n. i if f : S −→ C is a function and f (ti ) = c , then f = ni=1 ci fti , so {ft1 , . . . , ftn } is a basis for CS . Example 2.13. Let V, W be two linear spaces of ﬁnite type with dim(V ) = p and dim(W ) = q. Then, dim(Hom(V, W )) = pq. Suppose that {x1 , . . . , xp } is a basis in V and {y 1 , . . . , y q } is a basis in W . As we have shown in Theorem 2.7, for every i such that 1 i p and j such that 1 j q, there exists a unique linear mapping fij : {x1 , . . . , xp } −→ W such that fij (xk ) = for 1 k p.

yj 0M

if i = k, otherwise,

Linear Spaces

Note that if x =

p

fij (x) = fij

k=1 ak xk ,

p

the linearity of fij implies

ak xk

k=1

45

=

p

ak fij (xk ) = ai fij (xi ).

k=1

We claim that the set {fij | 1 i p, 1 j q} is a basis for Hom(V, W ). Let linear mapping. If x ∈ V, we can write pf : V −→ W be a p x = i=1 ai xi , so f (x) = q i=1 ai f (xi ). In turn, since {y 1 , . . . , y q } is a basis in W , f (xi ) = j=1 bij y j , for some bij ∈ F. This allows us to write f (x) =

p i=1

ai

q j=1

bij y j =

q p i=1 j=1

ai bij y j =

q p

ai bij fij (x),

i=1 j=1

which shows that each linear mapping in Hom(V, W ) is a linear combination of functions fij . j p}is linearly indeFurthermore, the set {fij | 1 i p, 1 pendent in Hom(V, W ). Indeed, suppose that pi=1 qj=1 cij fij (x) = q 0W . Then, for x = xi we have j=1 cij y j = 0W , which implies cij = 0. We may conclude that dim(Hom(V, W )) = dim(V ) dim(W ). Theorem 2.8. If W is a subspace of a ﬁnite-type linear space V, then dim(W ) dim(V ). Proof. If U is a linearly independent set in the subspace W , then it is clear that U is linearly independent in V. There exists a basis B of V such that U ⊆ B and |B| = dim(V ). Therefore, dim(W ) dim(V ). Definition 2.11. Let V and W be two F-linear spaces and let h : V −→ W be a linear mapping. The kernel of h is the subset Ker(h) of V given by Ker(h) = {v ∈ V | h(v) = 0W }. The image of h is the subset Im(h) of W given by Im(h) = {w ∈ W | h(v) = w for some v ∈ V }.

46

Linear Algebra Tools for Data Mining (Second Edition)

The notion of subspace is closely linked to the notion of linear mapping as we show next. Theorem 2.9. Let V, W be two F-linear spaces. If h : V −→ W is a linear mapping, then Im(h) is a subspace of W and Ker(h) is a subspace of V. Proof. Let w1 and w2 be two elements of Im(h). There exist v 1 , v 2 ∈ V such that w1 = h(v 1 ) and w2 = h(v 2 ). Since h is a linear mapping, we have w1 + w2 = h(v 1 ) + h(v 2 ) = h(v 1 + v 2 ). Thus, w1 +w2 ∈ Im(h). Further, if a ∈ F and w ∈ W , then w = h(v) for some v ∈ V and we have aw = ah(v) = h(av), so aw ∈ Im(h). Thus, Im(h) is indeed a subspace of W . Suppose now that s and t belong to Ker(h), that is h(s) = h(t) = 0W . Then h(s + t) = h(s) + h(t) = 0W , so s + t ∈ Ker(h). Also, h(as) = ah(s) = a0W = 0W , which allows us to conclude that Ker(h) is a subspace of W . Theorem 2.10. Let V and W be two linear spaces, where dim(V ) = n, and let h : V −→ W be a linear mapping. Then, we have dim(Ker(h)) + dim(Im(h)) = n. Proof. Suppose that {e1 , . . . , em } is a basis for the subspace Ker(h) of V. By Corollary 2.3, each such basis can be extended to a basis {e1 , . . . , em , em+1 , . . . , en } of the linear space V. Any v ∈ V can be written as v=

n

ai ei .

i=1

Since {e1 , . . . , em } ⊆ Ker(h), we have h(ei ) = 0W for 1 i m, so h(v) =

n

ai h(ei ).

i=m+1

This means that the set {h(em+1 ), . . . , h(en )} spans the subspace Im(h) of W . We show nowthat this set is linearly indepenn i dent. Indeed, suppose that i=m+1 b h(ei ) = 0W . This implies

Linear Spaces

47

n i h( ni=m+1 bi ei ) = 0W , that is, i=m+1 b ei ∈ Ker(h). Since {e1 , . . . , em } is a basis for Ker(h), there exist m scalars c1 , . . . , cm such that n

bi ei = c1 e1 + · · · + cm em .

i=m+1

The fact that {e1 , . . . , em , em+1 , . . . , en } is a basis for V implies that c1 = · · · = cm = bm+1 = · · · = bn = 0, so the set {h(em+1 ), . . . , h(en )} is linearly independent and, therefore, a basis for Im(h). Thus, dim(Im(h)) = n − m, which concludes the argument. Definition 2.12. Let V and W be two F-linear spaces and let h ∈ Hom(V, W ). The rank of h is rank(h) = dim(Im(h)); the nullity of h is nullity(h) = dim(Ker(h)). The spark of a linear mapping h : V −→ W is the minimum size of a subset S of Im(h) such that 0W ∈ S. Theorem 2.10 can now be rephrased by saying that if h : V −→ W is a linear mapping and V is a linear space of ﬁnite type, then dim(V ) = rank(h) + nullity(h). Theorem 2.11. Let h : V −→ W be a linear mapping between the linear spaces V and W . Then rank(h) min{dim(V ), dim(W )}. Proof. From Theorem 2.10 it follows that rank(h) dim(V ). On the other hand, rank(h) = dim(Im(h)) dim(W ) because Im(h) is a subspace of W , so the inequality of the theorem follows. Example 2.14. Let V, W be two F-linear spaces. For h ∈ V ∗ and y ∈ W , deﬁne the mapping h,y : V −→ W as h,y (x) = h(x)y for x ∈ V. It is easy to verify that h,y is a linear mapping, that is, h,y ∈ Hom(V, W ). Furthermore, we have rank(h,y ) = 1 because Im(h,y ) consists of the multiples of the vector y. Let f ∈ Hom(V, W ) be a linear mapping of rank r, which means that dim(Im(f )) = r. There exists a basis {y 1 , . . . , y r } in Im(f ) such

Linear Algebra Tools for Data Mining (Second Edition)

48

that for every x ∈ V, f (x) can be uniquely written as f (x) =

r

ai y i .

i=1

Let hi ∈ V ∗ bethe linear form deﬁned as hi (x) = ai for 1 i r. Then, f (x) = ri=1 hi (x)y i , hence f is the sum of r linear forms of rank 1. 2.6

Isomorphisms of Linear Spaces

Definition 2.13. Let V and W be two F-linear spaces. An isomorphism between these linear spaces is a linear mapping h : V −→ W , which is a bijection. If an isomorphism exists between two F-linear spaces V and W , we say that these linear spaces are isomorphic and we write V ∼ = W. Two F-linear spaces that are isomorphic are indiscernible from an algebraic point of view. Theorem 2.12. Let V and W be two F-linear spaces and let h ∈ Hom(V, W ). Then Im(h) ∼ = (V /Ker(h)). Proof. Deﬁne the mapping g : V /Ker(h) −→ Im(h) by g([x]) = h(x) for x ∈ V. We show that g is an isomorphism. The mapping g is well-deﬁned since if u ∈ [x], then u ∼Ker(h) x, so u − x ∈ Ker(h). Therefore, h(u − x) = 0W , hence h(u) = h(x). We leave it to the reader to verify that g is a linear mapping. Further, it is clear that g is surjective. To prove that g is injective, suppose that g([x]) = g([y]). This amounts to h(x) = h(y), which is equivalent to h(x − y) = 0W . Thus, x − y ∈ Ker(h), which implies [x] = [y]. In other words, g is injective and, therefore, is a bijection. This shows that the linear spaces Im(h) and V /Ker(h) are isomorphic. Corollary 2.5. Let V and W be two F-linear spaces and let h : V −→ W be a surjective morphism of linear spaces. Then, W ∼ = V /Ker(h). Proof.

This is an immediate consequence of Theorem 2.12.

Linear Spaces

49

Theorem 2.13. Isomorphism is an equivalence relation between linear spaces. Proof. It is clear that every linear space V is isomorphic to itself, which follows from the fact that the identity map 1V is an isomorphism. Suppose now that the linear spaces V and W are isomorphic and let h : V −→ W be an isomorphism. It is easy to verify that the inverse mapping h−1 : W −→ V is a linear morphism and a bijection, so the existence of an isomorphism is symmetric. Finally, if h : V −→ W and g : W −→ U are isomorphisms, then gh is an isomorphism from V to U , so the existence of an isomorphism is transitive. Thus, isomorphism is an equivalence. Theorem 2.14. If V, W are two ﬁnite-dimensional F-linear spaces and V ∼ = W, then dim(V ) = dim(W ). Proof. Suppose that dim(V ) = n and that B = {x1 , . . . , xn } is a basis for V. We claim that if f : V −→ W is an isomorphism, then B = {f (x1 ), . . . , f (xn )} is a basis for W . Let y ∈ W . Since f is a surjection, there exists x ∈ V such that y = f (x). Then, x = a1 x1 + · · · + an xn for some a1 , . . . , an in F because B is a basis for V, hence y = f (x) = f (a1 x1 + · · · + an xn ) = a1 f (x1 ) + · · · + an f (xn ). This shows that B spans the space W . To prove that B is linearly independent, assume that 0W = c1 f (x1 ) + · · · + cn f (xn ) = f (c1 x1 + · · · + cn xn ). Since f is injective, we have c1 x1 + · · · + cn xn = 0V , which implies c1 = · · · = cn = 0. Thus, B is also linearly independent, hence B is a basis of W . We conclude that dim(V ) = |B| = |B | = dim(W ). Theorem 2.15. If V, W are two ﬁnite-dimensional F-linear spaces and dim(V ) = dim(W ), then V ∼ = W. Proof. To prove that V ∼ = W , it suﬃces to show that any of these two spaces is isomorphic to Fn . Suppose that dim(V ) = n and that B = {x1 , . . . , xn } is a basis for V. Deﬁne the mapping h : V −→ Fn as ⎛ ⎞ a1 ⎜.⎟ ⎟ h(x) = ⎜ ⎝ .. ⎠ , an

50

Linear Algebra Tools for Data Mining (Second Edition)

where x = a1 x1 + · · · + an xn . Since the expression of x in the basis B is unique, this function is a well-deﬁned injective morphism. The function is also a surjection, so it is an isomorphism. Corollary 2.6. If V, W are two ﬁnite-dimensional F-linear spaces, then V ∼ = W if and only if dim(V ) = dim(W ). Proof.

This follows from Theorems 2.14 and 2.15.

Theorem 2.16. Let h : V −→ W be a linear mapping between the F-linear spaces V and W such that dim(V ) = dim(W ) = n. The following statements are equivalent: (i) h is surjective; (ii) h is an isomorphism; (iii) h is injective. Proof. (i) implies (ii): Suppose that h is surjective, that is, Im(h) = W . Since dim(Ker(h)) + dim(Im(h)) = n, it follows that dim(Ker(h)) = 0, so Ker(h) = {0V }. Therefore, by Theorem 2.23, h is an injection and, therefore, an isomorphism. (ii) implies (iii): This implication is immediate. (iii) implies (i): Suppose that h is injective. Then dim(Im(h)) = n, so Im(h) = W , which means that h is a surjection. Theorem 2.17. Let V be a subspace of the linear space W . Then dim(W/V ) = dim(W ) − dim(V ). Proof. We apply Theorem 2.10 to the mapping hV : W −→ W/V. We noted that Ker(hV ) = V and Im(hV ) = W/V. Therefore, dim(W ) = dim(V ) + dim(W/V ), which yields our equality. Let X be a non-empty set, F be the real or the complex ﬁeld, and let FREEF (X) be a subset of FX that consists of the mappings f : X −→ F such that the set supp(f ) = {x ∈ X | f (x) = 0} is ﬁnite. Addition and scalar multiplication are deﬁned as usual. If f, g ∈ FREEF (X), then supp(f + g) ⊆ supp(f ) ∪ supp(g) and supp(f ) if a = 0, supp(af ) = {0} if a = 0. This turns FREEF (X) into an F-linear space. The zero element of FREEF (X) denoted by 0∗ is given by 0∗ (x) = 0 for x ∈ X.

Linear Spaces

51

Definition 2.14. Let X be a non-empty set. The F-free linear space over X is the linear space FREEF (X) constructed above. If the ﬁeld F is clear from context, the subscript F is omitted. Example 2.15. The free R-linear space on the set {1, . . . , n} is isomorphic to Rn . For z ∈ X, let δz be the function deﬁned by 1 if x = z, δz (x) = 0 otherwise. Since supp(δz ) = {z}, it is clear that δz ∈ FREE(X). The notation δz is known as Kronecker delta.1 A related notation is δuv : X × X −→ {0, 1}, which denotes a function given by 1 if u = v, δuv = 0 otherwise. a basis in FREE(X). Indeed, The set {δz | z ∈ X} is if supp(f ) = {z1 , . . . , zn }, then f (z) = ni=1 f (zi )δzi (z), or f = ni=1 f (zi )δzi . nThe set {δz | z ∈ X} is linearly independent because, if j=1 aj δzj (x) = 0, then choosing x = zi we obtain aj = 0 for 1 j n. Example 2.16. The free R-linear space FREE(N) consists of all sequences r0 n0 +r1 n1 +· · · , where the set {ri ∈ R | i ∈ N and ri = 0} is ﬁnite. The zero element of this space is 0n0 + 0n1 + · · · . Theorem 2.18 (Universal property of hX ). Let X be a set and let hX : X −→ FREE(X) be the mapping deﬁned by h(x) = δx for x ∈ X. If L is an F-vector space and φ : X −→ L is a mapping, there exists a unique linear mapping ψ : FREE(X) −→ L such that the following diagram is commutative: 1

Leopold Kronecker (December 7, 1823 in Liegnitz–December 29, 1891 in Berlin) was a German mathematician who worked on number theory, algebra, and logic. He studied at the Universities of Bonn, Breslau, Berlin, where he defended his dissertation in algebraic number theory in 1845. He was elected a member of the Berlin Academy in 1861. There are numerous concepts in mathematics named after Kronecker: Kronecker delta, Kronecker product, and many others.

Linear Algebra Tools for Data Mining (Second Edition)

52

hX

X φ

FREE(X) ψ

L Proof. Deﬁne ψ as the function that maps the element δz of the basis of FREE(X) in φ(z). It is immediate that ψ is a linear mapping and that the diagram is commutative. To prove uniqueness, suppose that ψ1 is another linear mapping such that φ = ψhX = ψ1 hX . Then for x ∈ X, we have φ(x) = ψ(hX (x)) = ψ1 (hX (x)), which implies φ(x) = ψ(δx ) = ψ1 (δx ). Since the set {δx | x ∈ X}is a basis in FREE(X), it follows that ψ = ψ1 . Theorem 2.19. Let V and W be two F-linear spaces and let g, h ∈ Hom(V, W ) be two linear mappings. If X is a subset of V such that g(x) = h(x) for every x ∈ X, then g(x) = h(x) for every x ∈ X. Proof.

Consider the set EQ(g, h) = {u ∈ V | g(u) = h(u)}.

If u, v ∈ EQ(g, h), then g(au+bv) = ag(u)+bg(v) = ah(u)+bh(u) = h(au + bv), so au + bv ∈ EQ(g, h) for every a, b ∈ F, which implies that EQ(g, h) is a subspace of V. Since X ⊆ EQ(g, h), it follows that X ⊆ EQ(g, h), which yields the desired conclusion. We refer to EQ(g, h) as the equalizer of g and h. Theorem 2.20. Let V and W be two F-linear spaces. A morphism h ∈ Hom(V, W ) is injective if and only if h(x) = 0W implies x = 0V . Proof. Let h be a morphism such that h(x) = 0W implies x = 0V . If h(x) = h(y), by the linearity of h we have h(x − y) = 0W , which implies x − y = 0V , that is, x = y. Thus, h is injective. Conversely, suppose that h is injective. If x = 0V , then h(x) = h(0V ) = 0W . Thus, h(x) = 0W implies x = 0V . An endomorphism of an F-linear space V is a morphism h : V −→ V. The set of endomorphisms of V is denoted by End(V ). Often, we refer to endomorphisms of V as linear operators on V.

Linear Spaces

53

Let F be a ﬁeld and let V be an F-linear space. Deﬁne the mapping ha : V −→ V by ha (x) = ax for x ∈ V. It is easy to verify that ha is a linear operator on L. This mapping is known as a homothety on V. If a = 1, then h1 is given by h1 (x) = x for x ∈ V ; this is the identity morphism of V, which is usually denoted by 1V . For a = 0, we obtain the zero endomorphism of V denoted by 0V and given by 0V (x) = 0V for x ∈ V. Example 2.17. Let V be an F-linear space and let z ∈ V. The translation generated by z ∈ V is the mapping tz : V −→ V deﬁned by tz (x) = x + z for x ∈ V. A translation is a bijection but not a morphism unless z = 0V . Its inverse is t−z . Definition 2.15. Let V be an F-linear space and let U and Z be two subsets of L. Deﬁne the subset U + Z of V as U + Z = {u + z | u ∈ U and z ∈ Z}. For a ∈ F, the set aU is aU = {au | u ∈ U }. Theorem 2.21. Let V, W, U be three F-linear spaces. The following properties of compositions of linear mappings hold: (i) If f ∈ Hom(V, W ) and g ∈ Hom(W, U ), then gf ∈ Hom(V, U ). (iii) If f ∈ Hom(V, W ) and g0 , g1 ∈ Hom(W, U ), then f (g0 + g1 ) = f g0 + f g1 . (iii) If f0 , f1 ∈ Hom(V, W ) and g ∈ Hom(W, U ), then (f0 + f1 )g = f0 g + f1 g. Proof. We prove only the second part of the theorem and leave the proofs of the remaining parts to the reader. Let x ∈ V. Then, f (g0 + g1 )(x) = f ((g0 + g1 )(x)) = f (g0 (x) + g1 (x)) = f (g0 (x)) + f (g1 (x)) for x ∈ V, which yields the desired equality.

54

Linear Algebra Tools for Data Mining (Second Edition)

Corollary 2.7. Let V, W be two F-linear spaces. The algebra (Hom(V, W ), {h0 , +, −}) is an Abelian group that has the zero morphism 0V,W as its zero-ary operation and the addition of linear mappings as its binary operation; the opposite of a linear mapping h is the mapping −h. Moreover, (End(V ), {0V , 1V , +, −, ·}) is a unitary ring, where the multiplication is deﬁned as the composition of linear mappings. Proof. The ﬁrst part of the statement is a simple veriﬁcation that the operations h0 , +, − satisfy the deﬁnition of Abelian groups. The second part of the corollary follows immediately from Theorem 2.21. An endomorphism h of a ﬁeld is idempotent if h2 = h, that is, if h(h(x)) = h(x) for every x ∈ M . Corollary 2.8. If h is an idempotent endomorphism of the ﬁeld (End(V ), {h0 , 1V , +, −, ·}), then 1 − h is also an idempotent endomorphism. Proof. This statement is a direct consequence of Corollary 2.7 and of Theorem 1.12. Definition 2.16. Let h be an endomorphism of linear space V. The mth iteration of h (for m ∈ N) is deﬁned as (i) h0 = 1V ; (ii) hm+1 (x) = h(hm (x)) for m ∈ N. For every m 1, hm is an endomorphism of V ; this can be shown by a straightforward proof by induction on m. Example 2.18. Let F be a ﬁeld and F[λ] be the set of polynomials in the indeterminate λ with coeﬃcients in F. If p ∈ F[λ], we have p(λ) = p0 + p1 λ + · · · + pn λn . For an endomorphism of an F-linear space V, h : V −→ V deﬁne the function f = p(h) as f (x) = p0 + p1 h(x) + · · · + pn hn (x), for x ∈ V. Theorem 2.22. Let V be an F-linear space and let p ∈ F[λ] be a polynomial. If h : V −→ V is an endomorphism of V, then p(h) is an aﬃne mapping on V.

Linear Spaces

55

Proof. Let p(λ) = p0 + p1 λ + · · · + pn λn . Since hm is an endomorphism of V for every m 1 and the sum of endomorphisms is an endomorphism, it follows that q(h) is an endomorphism, where q(λ) = p1 λ + · · · + pn λn . Thus, f is an aﬃne mapping because f (x) = p0 + g(x) for x ∈ V. Note that if p(0V ) = 0 and h is an endomorphism, then p(h) is also an endomorphism. 2.7

Constructing Linear Spaces

This section is concerned with methods of constructing new linear spaces. Definition 2.17. Let S be a subspace of an F-linear space V. The equivalence generated by S is the relation ∼S on the set V deﬁned by x ∼S y if x − y ∈ S. The quotient set V /S is the set of equivalence classes of ∼S , V /S = {[x] | x ∈ V }. It is easy to verify that ∼S is indeed an equivalence relation on V. The quotient set has a natural structure of linear space when we deﬁne the addition of classes as [x] + [y] = [x + y] and the scalar multiplication as a[x] = [ax] for a ∈ F and x, y ∈ V. These operations are well-deﬁned. Indeed, if u ∈ [x] and v ∈ [y], then x − u ∈ S and y − v ∈ S. Therefore, x + y − (u + v) ∈ S, so [u + v] = [x + y]. Similarly, ax − au = a(x − u) ∈ S, so [au] = [ax]. We leave it to the reader to check that they satisfy the deﬁnition of a linear space. Definition 2.18. The quotient space of an F-linear space V by a subspace S is the F-linear space deﬁned on the set V /∼S whose operations are deﬁned as above. The quotient of V by S is denoted by V /S.

56

Linear Algebra Tools for Data Mining (Second Edition)

Observe that the surjective mapping hS : V −→ V /S, also known as the canonical morphism (or canonical surjection) of S deﬁned by hS (x) = [x] for x ∈ V, is a linear mapping. We have Ker(hS ) = S and Im(hS ) = V /S. We saw that for a linear mapping h : V −→ W , the set Im(h) is a subspace of W . The quotient linear space W/Im(h) is referred to as the co-kernel of h and is denoted by Coker(h). If h is a surjective mapping, Im(h) = W and, therefore, Coker(h) = {[0W ]}. The co-kernel of a linear mapping h is the zero subspace if and only if h is surjective. Theorem 2.23. Let V, W, Z be two F-linear spaces. If h ∈ Hom(V, W ), then the following three statements are equivalent: (i) h is an injection; (ii) if f, g : Z −→ V are two linear mappings such that hf = hg, then f = g; (iii) Ker(h) = {0V }. Proof. (i) implies (ii): Let h be an injection and suppose that hf = hg, where f, g : Z −→ V are two linear operators. For z ∈ Z, we have h(f (z)) = h(g(z)), which yields f (z) = g(z) for z ∈ Z because h is injective. Thus, f = g. (ii) implies (iii): Let i : Ker(h) −→ V be the linear application deﬁned by i(x) = x for x ∈ Ker(h). Let k : Ker(h) −→ W be the linear function deﬁned by k(v) = 0W . Note that h(i(z)) = h(k(z)) = 0W . Thus, i(z) = k(z) = 0W , which implies z = 0V , so Ker(h) = {0V }. (iii) implies (i): Suppose that Ker(h) = {0V } and that h(x) = h(y). Therefore, h(x − y) = 0W , so x − y ∈ Ker(h) = {0V }, which means that x = y, that is, h is an injection. A similar characterization exists for surjective linear mappings. Theorem 2.24. Let V, W be two F-linear spaces. If h ∈ Hom(V, W ), then the following statements are equivalent: (i) h is a surjection; (ii) if U is an F-linear space and f, g ∈ Hom(W, U ) are such that f h = gh, then f = g.

Linear Spaces

57

Proof. (i) implies (ii): Suppose that h is a surjective linear mapping. Then, for every y ∈ W there exists x such that y = h(x). Thus, we have f (y) = f (h(x)) = g(h(x)) = g(y) for every y ∈ W , so f = g. (ii) implies (i): If condition (ii) is satisﬁed, let U = Coker(h) and deﬁne f, g : W −→ Coker(h) as f (w) = [h(w)] in Coker(h) = W/Im(h) and g(w) = 0Coker(h) . Then f (h(v)) = g(h(v)) means that [h(v)] = 0Coker(h) , hence Coker(h) = [0W ], hence h is surjective. Theorem 2.25. Let V, W, U be three linear spaces and let h : V −→ W and g : V −→ U be two linear mappings. If Ker(h) ⊆ Ker(g) and h is surjective, then there exists a unique k ∈ Hom(W, U ) such that g = kh. Proof. The statement of the theorem asserts the existence and uniqueness of the morphism k : W −→ U that makes the diagram h

W k

V g

U

commutative. Note that if Ker(h) ⊆ Ker(g), then h(x) = 0W implies g(x) = 0U . Since h is surjective, if y ∈ W , there exists x ∈ V such that h(x) = y. Deﬁne the mapping k : W −→ U as k(y) = g(x), where h(x) = y. We verify ﬁrst that k is well-deﬁned. Indeed, suppose that x1 ∈ V is such that h(x1 ) = y. This implies h(x) = h(x1 ), which is equivalent to h(x − x1 ) = 0W . Therefore, g(x − x1 ) = 0U , which means that g(x1 ) = g(x), so k is well-deﬁned. The mapping k is linear. Indeed, let a1 , a2 ∈ F and y 1 , y 2 ∈ W . Suppose that x1 , x2 ∈ V are such that h(xi ) = y i for i = 1, 2. Then h(a1 x1 + a2 x2 ) = a1 y 1 + a2 y 2 . This means that k(a1 y 1 + a2 y 2 ) = g(a1 x1 + a2 x2 ), so by the linearity of g, we have k(a1 y 1 + a2 y 2 ) = a1 g(x1 ) + a2 g(x2 ) = a1 k(y 1 ) + a2 k(y 2 ), which proves that k is linear. The deﬁnition of h implies that k(h(x)) = g(x) for every x ∈ V, so g = kh.

58

Linear Algebra Tools for Data Mining (Second Edition)

Suppose that k1 is a morphism in Hom(W, U ) such that g = k1 h, so k1 h = kh. Since h is a surjection, by Theorem 2.24, we obtain k1 = k. Theorem 2.26. Let h : V −→ W be a linear mapping between the Fspaces V and W and let S be a subspace of V. There exists a unique linear mapping h : V /S −→ W such that h([x]) = h(x) (where [x] = x + S for all x ∈ S) if and only if h(s) = 0W for all s ∈ S. h

S⊆V hS

W h

V /S If h exists and is unique, then for every linear mapping g : V /S −→ W there exists a unique linear transformation g : V −→ W that generates g in this manner. Proof. Suppose that h exists such that h([x]) = h(x). Then, if s ∈ S, we have, h(s) = h(s + S) = h(0V /S ) = 0W . Conversely, suppose now that h(s) = 0W for all s ∈ S. Deﬁne h as h([x]) = h(x). Note that h is well-deﬁned for, if [x] = [y], we have x − y ∈ S, hence h(x − y) = 0W , or h(x) = h(y). The ﬁnal part of the theorem follows by noting that the equality h(x + S) = h(x) shows that any of the mappings h, h is determined by the other. Let SUBSP(V ) be the collection of subspaces of a linear space V. If this set is equipped with the inclusion relation ⊆ (which is a partial order), then for any two subspaces K, L both sup{K, L} and inf{K, L} exist and are given by sup{K, L} = {x + y | x ∈ K and y ∈ L},

(2.1)

inf{K, L} = K ∩ L.

(2.2)

Let H = {x + y | x ∈ K and y ∈ L}. Observe that we have both K ⊆ H and L ⊆ H because 0 belongs to both K and L.

Linear Spaces

59

If u and v belong to H, then u = x1 + y 1 and v = x2 + y 2 , where x1 , x2 ∈ K and y 1 , y 2 ∈ L. Since x1 − x2 ∈ K and y 1 − y 2 ∈ L (because K and L are subspaces), it follows that u − v = x1 + y 1 − (x2 + y 2 ) = (x1 − x2 ) + (y 1 − y 2 ) ∈ H. We have au = ax1 + ax2 ∈ H because ax1 ∈ K and ax2 ∈ L. Thus, H is a subspace of V and is an upper bound of {K, L} in the partially ordered set (SUBSP(V ), ⊆). If G is a subspace of V that contains both K and L, then x+y ∈ G for x ∈ K and y ∈ L, so H ⊆ G. Thus, H = sup{K, L}. We denote H = sup{K, L} by K + L. The lattice of subspaces SUBSP(V ) of a linear space V is actually a complete lattice because the collection of subspaces of a linear space is a closure system. Next, we prove the modularity of SUBSP(V ). Theorem 2.27. Let V be an F-linear space. For any P, Q, R ∈ SUBSP(V ) such that Q ⊆ P, we have P ∩ (Q + R) = Q + (P ∩ R). Proof. Note that Q ⊆ P ∩(Q+R), P ∩R ⊆ P ∩(Q+R). Therefore, we have the inclusion Q + (P ∩ R) ⊆ P ∩ (Q + R) =, which leaves us with the reverse inclusion to prove. Let z ∈ P ∩ (Q + R). This implies z ∈ P and z = x + y, where x ∈ Q ⊆ P and y ∈ R. Therefore, y = z − x ∈ P , so y ∈ P ∩ R. Consequently, z ∈ Q + (P ∩ R), so P ∩ (Q + R) ⊆ Q + (P ∩ R). Theorem 2.35 can now be reformulated as Corollary 2.9. Let K, L be two subspaces of a linear space V. We have dim(sup{K, L}) + dim(inf{K, L}) = dim(K) + dim(L). An immediate consequence of Corollary 2.9 is the following inequality valid for two linear subspaces K, L of a linear space V, namely: dim(K + L) dim(K) + dim(L).

(2.3)

Theorem 2.28. Let V be an n-dimensional linear space and let W be a subspace of V. Then the space V /W is ﬁnite-dimensional and dim(V /W ) = dim(V ) − dim(W ).

60

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Let {w 1 , . . . , w n } be a basis of W and let {w 1 , . . . , wn , v 1 , . . . , v k } be its extension to a basis for V, where dim(W ) = n and dim(V ) = n + k. The set {v 1 + W, . . . , v k + W } is a basis of V /W . Indeed, suppose that a1 (v 1 + W ) + · · · + ak (v k + W ) = W for a1 , . . . , ak ∈ F, where W is the zero element of V /W . This equality amounts to (a1 v 1 + · · · + ak v k ) + W = W, which implies a1 v 1 + · · · + ak v k ∈ W . Thus, we can write a1 v 1 + · · · + ak v k = b1 w1 + · · · + bn wn for some b1 , . . . , bn ∈ F. Since a1 v 1 + · · · + ak v k − b1 w1 − · · · − bn wn = 0V , and {w1 , . . . , w n , v 1 , . . . , v k } is a basis, it follows that a1 = · · · = ak = b1 = · · · = bn = 0. This shows that {v 1 + W, . . . , v k + W } is linearly independent. Let v + W ∈ V /W . Since the set {w 1 , . . . , wn , v 1 , . . . , v k } spans V, we can write v = a1 w 1 + · · · + an wn + b1 v 1 + · · · + bk v k for some a1 , . . . , an , b1 , . . . , bk ∈ F. Therefore, we can write v + W = (a1 w1 + · · · + an wn + b1 v1 + · · · + bk v k ) + W = (b1 v 1 + · · · + bk v k ) + W = b1 (v 1 + W ) + · · · + bk (v k + W ). This shows that the set {v 1 + W, . . . , v k + W } spans V /W , and we conclude that this set forms a basis of V /W . Therefore, dim(V /W ) = k = dim(V ) − dim(W ).

Linear Spaces

61

Definition 2.19. Let V = {Vi | i ∈ I} be a family of F-linearspaces indexed by a set I. The direct product of V is the linear space i∈I Vi that consists of all functions f : I −→ i∈I Vi such that f (i) ∈ Vi for i ∈ I. The addition and scalar multiplication are deﬁned by (f + g)(i) = f (i) + g(i) for every f, g ∈ I −→

(af )(i) = af (i)

The support of f ∈ We have

i∈I

Vi and a ∈ F.

i∈I

Vi is the set supp(f ) = {i | f (i) = 0Vi }.

supp(f + g) ⊆ supp(f ) + supp(g), supp(af ) ⊆ supp(f ) for every f, g ∈ I −→ i∈I Vi and a ∈ F. The notion of direct sum of linear spaces has diﬀerent meanings depending on the nature of the linear spaces involved. Definition 2.20. Let V = {Vi | i ∈ I} be a family of F-linear spaces. The external direct sum of V is the F-linear space i∈I Vi that consists of all functions f : I −→ i∈I Vi such that f (i) ∈ Vi for i ∈ I that have a ﬁnite support. The addition and scalar multiplication are deﬁned exactly as in the case of the members of the direct product. It is clear that the direct sum of a family of F-linear spaces is a subspace of their direct product. Also, for ﬁnite families of linear spaces, the direct product is identical to the direct sum. If V1 , V2 are subspaces of an F-linear space V, then their intersection is non-empty because 0V ∈ V1 ∩ V2 . Moreover, it is easy to see that V1 ∩ V2 is also a subspace of V. Let V1 , V2 be two subspaces of a linear space V. Their internal sum, or simply their sum, is the subset V1 + V2 of V deﬁned by V1 + V2 = {x + y | x ∈ V1 and y ∈ V2 }. It is immediate to verify that V1 + V2 is a subspace of V and that 0V ∈ V1 ∩ V2 .

62

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 2.29. Let V1 , V2 be two subspaces of the F-linear space V. If V1 ∩ V2 = {0V }, then any vector x ∈ V1 + V2 can be uniquely written as x = x1 + x2 , where x1 ∈ V1 and x2 ∈ V2 . Proof. By the deﬁnition of the sum V1 + V2 , it is clear that any vector x ∈ V1 + V2 can be written as x = x1 + x2 , where v 1 ∈ V1 and v 2 ∈ V2 . We need to prove only the uniqueness of x1 and x2 . Suppose that x = x1 + x2 = y 1 + y 2 , where x1 , y 1 ∈ V1 and x2 , y 2 ∈ V2 . This implies x1 − y 1 = y 2 − x2 and, since x1 − y 1 ∈ V1 and y 2 − x2 ∈ V2 , it follows that x1 − y 1 = y 2 − x2 = 0V by hypothesis. Therefore, x1 = y 1 and x2 = y 2 . Theorem 2.30. Let V1 , V2 be two subspaces of the F-linear space V. If every vector x ∈ V1 + V2 can be uniquely written as x = x1 + x2 , then V1 ∩ V2 = 0V . Proof. Suppose that the uniqueness of the expression of x holds but z ∈ V1 ∩ V2 and z = 0V . If x = x1 + x2 , then we can also write x = (x1 + z) + (x2 − z), where x1 + z ∈ V1 and x2 − z ∈ V2 , x1 + z = x1 and x2 − z = x2 , and this contradicts the uniqueness property. Corollary 2.10. Let V be an F-linear space and let V1 , V2 be two subspaces of V. The decomposition of a vector x ∈ V as x = x1 + x2 , where x1 ∈ V1 and x2 ∈ V2 , is unique if and only if V1 ∩ V2 = {0V }. Proof.

This statement follows from Theorems 2.29 and 2.30.

If V1 and V2 are subspaces of the F-linear space V and V1 ∩ V2 = {0V }, we refer to V1 + V2 as the direct sum of the subspaces V1 and V2 . The direct sum V1 + V2 is denoted by V1 ⊕ V2 . Theorem 2.31. Let V be an F-linear space and U0 be a subspace of V. There exists a subspace U1 of V such that V = U0 ⊕ U1 . Proof. Let {ei | i ∈ I} be a basis of the subspace U0 and let {ei | i ∈ I ∪ J} be its completion to a basis of the entire linear space V, where I ∩ J = ∅. If U1 is the subspace generated by {ei | i ∈ J}, then V = U0 + U1 .

Linear Spaces

63

Subspaces of linear spaces are related to idempotent endomorphisms as shown next. Theorem 2.32. Let V be an F-linear space. If h ∈ End(V ) is an idempotent endomorphism of V, then V = Ker(h) ⊕ Im(h). Proof. Observe that both Ker(h) and Im(h) are subspaces of V. Furthermore, suppose that x ∈ Ker(h) ∩ Im(h). Since x ∈ Im(h), we have x = h(y) for some y ∈ V. On the other hand, x ∈ Ker(h) means that h(x) = 0V . Thus, h(h(y)) = h(y) implies h(x) = x, which yields x = 0V . This allows us to conclude that Ker(h) ∩ Im(h) = {0V }. If z ∈ V, we have z = (z−h(z))+h(z). Observe that h(z) ∈ Im(h) and h(z − h(z)) = h(z)− h(h(z)) = 0V , so z − h(z) ∈ Ker(h) because h is idempotent. We conclude that V = Ker(h) ⊕ Im(h). Theorem 2.33. Let V be an F-linear space. If U and W are two subspaces of V such that V = U ⊕ W, then there exists an idempotent endomorphism h of V such that U = Ker(h) and W = Im(h). Proof. Since V is a direct sum of U and W , each v ∈ V can be uniquely written as v = u + w, where u ∈ U and w ∈ W . Let π1 and π2 be the canonical projections and let h1 and h2 be the canonical injections of the subspaces U and W shown in what follows, where h1 (u) = u, h2 (w) = w and p1 (u + w) = u, p2 (u + w) = w for u ∈ U and w ∈ W . p2 p1 - V U W h1 h2 Deﬁne the endomorphism g1 ∈ End(V ) as g1 = h1 p1 . Note that v ∈ Ker(g1 ), where v = u + w for u ∈ U and w ∈ W if and only if g1 (v) = 0V . This, in turn, is equivalent to h1 (p1 (v)) = h1 (u) = u = 0V , so v = w ∈ W . This shows that Ker(g1 ) = W . On the other hand, z ∈ Im(g1 ) means that z = g1 (v) = h1 p1 (u + v) = h1 (u) = u, so Im(g1 ) = U . It is immediate to verify that g1 is idempotent. In Theorem 2.31, we saw that if U is a subspace of an F-linear space V, there exists another subspace W of V such that V = U ⊕W . By Theorem 2.33, there exists an idempotent endomorphism h of V such that U = Ker(h) and W = Im(h).

64

Linear Algebra Tools for Data Mining (Second Edition)

Definition 2.21. If V is an F-linear space and U, W are two subspaces of V such that V = U ⊕ W , then we say that U, W are complementary subspaces. Theorem 2.34. Let V be an F-linear space and let U, W be two subspaces of L. The following statements are equivalent: (i) U, W are complementary spaces; (ii) V = {u + w | u ∈ U and w ∈ W } and U ∩ W = {0}; (iii) If BU and BW are two bases of U and W, respectively, then BU ∪ BW is a basis of V and BU ∩ BW = ∅. Proof. (i) implies (ii): Suppose that V = U ⊕ W . Then, every v ∈ V can be uniquely written as a sum v = u + w with u ∈ U and w ∈ W , so V = {u + w | u ∈ U and w ∈ W }. If x = u + w, where u ∈ U and w ∈ W , and t ∈ U ∩ W , t = 0L , then we can also write x = (u − t) + (t + w), which contradicts the uniqueness of the representation of x. Thus, t = 0V . (ii) implies (iii): Since U ∩ W = {0V }, we have BU ∩ BW = ∅ because 0V does not belong to either BU or BW . Every v ∈ V can be written as v = u+w, where u ∈ U and w ∈ W . Thus, u = {ai ui | ai ∈ F and ui ∈ U, i ∈ I}, w = {bi wi | bi ∈ F and w i ∈ U, i ∈ I}, so v = u + w can be written as a linear combination of BU ∪ BW . (iii) implies (i): Suppose now that BU and BW are two bases of subspaces U and W , respectively, such that BU ∪ BW is a basis of V and BU ∩ BW = ∅. Since BU ∪ BW is a basis of V, it is clear that every v ∈ V is a sum of the form u + w, where u ∈ U and w ∈ W . Suppose that t ∈ U ∩ W and t = 0L . Then, t can be expressed both as a linear combination of BU and as a linear combination of BW , u = a u = b wj and not all ai and not all bj equal 0. This i i∈I i j∈J j implies i∈I ai ui − j∈J bj wj = 0V , which contradicts the linear independence of the set BU ∪ BW . A generalization of Theorem 2.34 is given next. Theorem 2.35. Let V be an F-linear space and let U, W be two subspaces of V. The set T = {t ∈ V | t = u + w, u ∈ U, w ∈ W } is a subspace of V and dim(T ) = dim(U ) + dim(W ) − dim(U ∩ W ).

Linear Spaces

65

Proof. It is straightforward to verify that T is indeed a subspace of V. Suppose that BZ = {z 1 , . . . , z k } is a basis for the subspace Z = U ∩ W . By the Extension corollary (Corollary 2.3), B can be extended to a basis for U , BU = BZ ∪ {u1 , . . . , up } and to a basis for W , BW = BZ ∪ {w1 , . . . , w q }. It is clear that B = BU ∪ BW generates the subspace T . The set B is linear independent. Indeed, suppose that a1 z 1 + · · · + ak z k + b1 u1 + · · · + bp up + c1 w1 + · · · + cq wq = 0V , for some a1 , . . . , ak , b1 , . . . , bp , c1 , . . . , cq ∈ F, so c1 w1 + · · · + cq w q = − (a1 z 1 + · · · + ak z k + b1 u1 + · · · + bp up ) ∈ U and, of course, c1 w1 + · · · + cq wq ∈ W . Thus, c1 w1 + · · · + cq wq ∈ U ∩ W , so c1 w1 + · · · + cq wq = d1 z 1 + · · · + dk z k , which implies c1 = · · · = cq = d1 = · · · = dk = 0 because BW is a basis for W . Using this fact, we obtain a1 z 1 + · · · + ak z k + b1 u1 + · · · + bp up = 0V . Since BU is a basis, this implies a1 = · · · = ak = b1 = · · · = bp = 0. We conclude that BU ∪ BW is indeed a basis for T and dim(T ) = |B| = |BU | + |BV | − |BU ∩ BV | = dim(U ) + dim(W ) − dim(U ∩ W ).

Corollary 2.11. Let U, W be two subspaces of an F-linear space V with dim(V ) = n. We have dim(U ) + dim(W ) − dim(U ∩ W ) n. Proof. The inequality follows from Theorem 2.35 by observing that dim(T ) dim(L) = n. Theorem 2.36. An F-linear space V is a direct sum of the family of subspaces {Vi | i ∈ I} if and only if V = i∈I Vi and for each i ∈ I we have Vj = {0L }. (2.4) Vi ∩ j∈I−{i}

66

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Suppose that V is the direct sum of the family of its subspaces {V | i ∈ I}. It is clear that V = i i∈I Vi . If z ∈ Vi ∩ j∈I−{i} Vj , then z = xi for some xi ∈ Vi and z = xj1 + · · · xj , where jp = i for 1 p . By the uniqueness of direct sum representation, we have xi = 0V , and z = 0V , hence Equality (2.4) holds. Conversely, suppose that V = i∈I Vi and Equality (2.4) holds. Suppose that a vector x ∈ V can be written as x = xj 1 + · · · + xj n and as x = tk1 + · · · + tkm , where xjp ∈ Vjp for 1 p n and tkq ∈ Vkq for 1 q m. Without loss of generality we may assume that n = m (by adding to the expressions of x an appropriate number of zero terms). Thus, we can assume that x = x i1 + · · · + x in = t i1 + · · · + t in , where xi and ti both belong to the subspace Vi for 1 n. Therefore, xi − ti ∈ Vi for 1 n. Since (xi1 − ti1 ) + · · · + (xin − tin ) = 0V , if follows that each diﬀerence xir − tir is a sum of vectors from other subspaces. This is possible if and only if xir − tir = 0V , hence, xir = tir for 1 r n. If {xi | i ∈ I} and {y j | j ∈ J} are bases in the linear spaces V and W , respectively, then the set {(xi , 0W ) | i ∈ I} ∪ {(0V , y j ) | j ∈ J} is a basis in V + W , as the reader can easily verify. If both V and W are of ﬁnite type, then so is V + W and dim(V + W ) = dim(V ) + dim(W ).

Linear Spaces

67

The direct sum of two linear spaces allows us to introduce four linear mappings: the injections i1 : V −→ V + W , i2 : W −→ V + W , and the projections p1 : V + W −→ V and p2 : V + W −→ W given by i1 (x) = (x, 0W ), i2 (y) = (0V , y), p2 (x, y) = y, p1 (x, y) = x, for x ∈ V and y ∈ W . It is immediate to verify that i1 , i2 are injective and p1 , p2 are surjective mappings. The deﬁnitions of i1 , i2 , p1 , and p2 imply immediately p1 i1 = 1V p1 i2 = 0W

p2 i2 = 1W , p2 i1 = 0V .

Additionally, we have (i1 p1 + i2 p2 )(x, y) = i1 p1 (x, y) + i2 p2 (x, y) = (x, 0W ) + (0V , y) = (x, y), for every (x, y) ∈ V × W . Theorem 2.37. Let V, W, U be three linear vector spaces for which there exist the linear mappings h1 : V −→ U , h2 : W −→ U and g1 : U −→ V, g2 : U −→ W such that g1 h1 = 1V , g1 h2 = 0W,V ,

g2 h2 = 1W , g2 h1 = 0V,W ,

and h1 g1 + h2 g2 = 1U . Then, there exists an isomorphism h : V + W −→ U such that h1 = hi1 h2 = hi2

g1 = p1 h−1 , g2 = p2 h−1 .

Proof. The linear mappings mentioned above are shown in the following commutative diagram.

68

Linear Algebra Tools for Data Mining (Second Edition)

V h1

p1 g1

i1 h

V +W

U g2

i2

h2

p2 W

Deﬁne the linear mappings k : U −→ V + W as k(z) = (g1 (z), g2 (z)) for z ∈ U and : V + W −→ U as (x, y) = h1 (x) + h2 (y) for x ∈ V and y ∈ W . Note that (k(z)) = (g1 (z), g2 (z)) = h1 (g1 (z)) + h2 (g2 (z)) = (h1 g1 + h2 g2 )(z) = z, and k((x, y)) = k(h1 (x) + h2 (y)) = k(h1 (x)) + k(h2 (y)) = (g1 (h1 (x)), g2 (h1 (x))) + (g1 (h2 (y)), g2 (h2 (y))) = (x, 0L+M ) + (0M +L , y) = (x, y). This shows that the linear mappings and k are inverse isomor phisms. 2.8

Dual Linear Spaces

Let V be an F-linear space. The set of linear forms deﬁned on V is denoted by V ∗ . This set has the natural structure of an F-linear space known as the dual of the space V . The elements of V ∗ are also referred to as covariant vectors or covectors. We will refer to the vectors of the original linear space as contravariant vectors. The reason for adopting the designation of

Linear Spaces

69

“covariant” and “contravariant” terms for the vectors of V ∗ and V, respectively, will be discussed in Section 3.7. Definition 2.22. Let B = {ui ∈ V | 1 i n} be a basis in an n-dimensional F-linear space V, and let f ∈ V ∗ be a covector. The numbers ai = f (ui ), where 1 i n are the components of the covector f relative to the basis B of V. Theorem 2.38. Let B = {ui ∈ L | 1 i n} be a basis in an n-dimensional F-linear space V. If {ai ∈ F | 1 i n} is a set of scalars, then there is a unique covector f ∈ V ∗ such that f (ui ) = ai for 1 i n. Proof. Since B is a basis in V, we can write v = ni=1 ci ui for every v ∈ V. Thus,

f (v) = f

n i=1

ci ui

=

n

ci ai ,

i=1

which shows that the f is uniquely determined by the n⎞ ⎛ covector a1 ⎜ ⎟ tuple of scalars a = ⎝ ... ⎠. an ˜ = {˜ Let B = {ui ∈ V | 1 i n} B u i ∈ V | 1 i n} be two bases in the linear space V, where u˜i = nj=1 cij uj for 1 i n. ˜ can be written The components of a covector f ∈ V ∗ in the basis B as ⎛ ⎞ n n cij uj ⎠ = cij f (uj ). f (˜ ui ) = f ⎝ j=1

j=1

This equality shows that the components of a covector transform in the same manner as the basis. This justiﬁes the use of the term covariant applied to these components. Theorem 2.39. Let V be an n-dimensional F-linear space. Then, its dual V ∗ is isomorphic to Fn , and, thus, dim(V ∗ ) = dim(V ) = n.

70

Proof.

Linear Algebra Tools for Data Mining (Second Edition)

The function h : Fn −→ Hom(V, F) that maps the vector ⎛ ⎞ a1 ⎜.⎟ ⎟ a=⎜ ⎝ .. ⎠ an

to the function f deﬁned as in Theorem 2.38 is an isomorphism, as it can be checked easily. Theorem 2.38 shows that a linear form f ∈ V ∗ is uniquely determined by its values on the basis of the space V. This allows us to prove the following extension theorem. Theorem 2.40. Let U be a subspace of a ﬁnite-dimensional F-linear space V. A linear function g : U −→ F belongs to U ∗ if and only if there exists a linear form f ∈ V ∗ such that g is the restriction of f to U . Proof. If g is the restriction of f to U , then it is immediate that g ∈ U ∗ . Conversely, let g ∈ U ∗ and let B = {u1 , . . . , up } be a basis of U , where dim(U ) = p. Consider an extension of B to a basis of the entire space B1 = {u1 , . . . , up , up+1 , . . . , un }, where n = dim(L) and deﬁne the linear form f : V −→ F by g(ui ) if i p, f (ui ) = 0 if p + 1 i n. Since f and g coincide for all members of the basis of U , it follows that g is the restriction of f to U . We refer to f as the zero-extension of the linear form g deﬁned on the subspace U . Theorem 2.41. If B = {v 1 , . . . , v n } is a basis of the F-linear space V, then the set of linear forms F = {f j | 1 j n} deﬁned by 1 if i = j, j f (v i ) = 0 otherwise is a basis of the dual linear space V ∗ .

Linear Spaces

71

Proof. The set F = {f 1 , . . . , f n } spans the entire dual space V ∗ . ∗ Indeed, n let i f ∈ V be deﬁned by f (v i ) = ai for 1 i n. If v = i=1 c v i , then

n n n ci v i = ci f (v i ) = ci ai . f (v) = f i=1

i=1

i=1

On the other hand, n

i

ai f (v) =

n

i=1

j

i

ai c f (v j ) =

i=1

n

ai ci ,

i=1

due to the deﬁnition of the linear forms f 1 , . . . , f n . Therefore, f = a1 f 1 + · · · + an f n , which shows that F = L∗ . To prove that the set F is linearly independent in V ∗ , suppose that a1 f 1 + · · · + an f n = 0V ∗ . This implies a1 f 1 (v) + · · · + an f n (v) = 0V for every v ∈ V. Choosing v = v j , we obtain aj f j (v j ) = 0, hence aj = 0, and this can be shown for 1 j n, which implies the linear independence. The basis F = {f 1 , . . . , f n } of V ∗ constructed in Theorem 2.41 is the dual basis of the basis B = {v 1 , . . . , v n } of V. We refer to the pair (B, F ) as a pair of dual bases. In general, we will index vectors in a linear space V using subscripts and covectors in the dual linear space using superscripts. Corollary 2.12. The dual of an n-dimensional F-linear space V is an n-dimensional linear space. This statement follows immediately from Theorem 2.41.

Proof.

Let E = {e1 , . . . , en } be a basis of V and let F = {f 1 , . . . , f n } be its dual basis in V ∗ . If x = a1 e1 + · · · + an en , then, we have i

i

f (x) = f (a1 e1 + · · · + an en ) =

n

aj f i (ej ) = ai

(2.5)

j=1

for 1 i n. Thus, we have x=

n i=1

f i (x)ei .

(2.6)

72

Linear Algebra Tools for Data Mining (Second Edition)

Similarly, if f ∈ V ∗ and f = b1 f 1 + · · · + bn f n , then (f , ei ) = bi , and f=

n (f , ei )f i .

(2.7)

i=1

Example 2.19. Let P2 [x] be the linear space of polynomials of degree 2 in x, that consists of polynomials of the form p(x) = ax2 + bx + c. The set {p0 , p1 , p2 } given by p0 (x) = 1, p1 (x) = x and p2 (x) = x2 is a basis in P2 [x]. Note that, we have c = p(0), 1 b = (p(1) − p(−1)), 2 1 a = (p(1) + p(−1) − 2p(0)). 2 If f : P2 [x] −→ R is a linear form, we have f (p) = af (x2 ) + bf (x) + cf (0) 1 = (p(1) + p(−1) − 2p(0))f (x2 ) 2 1 + (p(1) − p(−1))f (x) + p(0)f (1). 2 Therefore, a basis in P2 [x]∗ consists of the functions f 0 (p) = p(0), 1 f 1 (p) = (p(1) − p(−1)), 2 1 f 2 (p) = (p(1) + p(−1) − 2p(0)). 2 Theorem 2.42. Let V be a ﬁnite-dimensional F-linear space and let v ∈ V − {0V }. There exists a linear form f v in V ∗ such that f v (v) = 1. Proof. Since v = 0V , there is a basis B in V that includes v. Then f v can be deﬁned as the linear form that corresponds to v.

Linear Spaces

73

We saw that the dual V ∗ of an F-linear space V is an F-linear space. The construction of the dual may be repeated, and V ∗∗ , the dual of the dual F-linear space V ∗ is an F-linear space. In the case of ﬁnite-dimensional linear spaces, we have dim(V ∗∗ ) = dim(V ∗ ) = dim(V ), and all these spaces are isomorphic. Theorem 2.43. Let V be a ﬁnite-dimensional F-linear space. Then, the dual V ∗∗ of the dual V ∗ of L is an F-linear space isomorphic to V. Proof. The space V ∗∗ consists of linear functions of the form φ : V ∗ −→ F that map linear forms f ∈ V ∗ into scalars in F . Let Ψ : V −→ V ∗∗ be the mapping given by Ψ(v) = φ, where φ is the linear form deﬁned by φ(f ) = f (v) for all v ∈ V and f ∈ V ∗ . The mapping Ψ is injective. Indeed, suppose that Ψ(u) = Ψ(v) = φ for u, v ∈ V. This implies φ(f ) = f (u) = f (v) for every f ∈ V ∗ , which, in turn, implies u = v. Suppose that Ψ(v) = 0V ∗∗ . Then, f (v) = 0 for every f , which implies v = 0V (by taking f = 1V ). Thus, by Theorem 2.16, Ker(Ψ) = {0V }, which implies that Ψ is an isomorphism. Theorem 2.43 allows us to identify V ∗∗ with V when necessary. Definition 2.23. Let V be an F-linear space let U be a subset of V. The annihilator of U is the subset Ann(U ) of the dual space V ∗ given by Ann(U ) = {f ∈ V ∗ | f (x) = 0 for every x ∈ U } = {f ∈ V ∗ | U ⊆ Ker(f )}. The annihilator of a subset T of the dual space V ∗ of a linear space V is the set Ann(T ) = {v ∈ V | f (v) = 0 for every f ∈ T } = {Ker(f ) | f ∈ T }. Theorem 2.44. Let V be an F-linear space. For any subsets U ⊆ V and T ⊆ V ∗ , Ann(U ) and Ann(T ) are subspaces of V ∗ and V, respectively.

74

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Let f , f˜ ∈ Ann(U ) and let a, b ∈ F. We have (af +bf˜ )(x) = af (x) + bf˜ (x) = 0 for every x ∈ U , so af + bf˜ ∈ Ann(U ). The argument for Ann(T ) is similar. Theorem 2.45. If V is a ﬁnite-dimensional F-linear space, let U ⊆ V and T ⊆ V ∗ be subspaces of V and V ∗ , respectively. The dual space U ∗ is isomorphic to the quotient space V ∗ /Ann(U ) and the dual space T ∗ is isomorphic to the quotient space V ∗∗ /Ann(T ). The quotient linear spaces (V /U )∗ and (V ∗ /T )∗ are isomorphic to the subspaces Ann(U ) and Ann(T ), respectively. Proof. Recall that the zero-extension of a linear form deﬁned on a subspace U of V was introduced in Theorem 2.40. Let Φ : V ∗ −→ U ∗ , where Φ(f ) = g if f is the zero-extension of g for g ∈ U ∗ . It is easy to verify that Φ is a surjective linear mapping. Furthermore, Ker(Φ) consists of those linear forms f in V ∗ such that Φ(f ) is the zero linear form on U , so f (x) = 0 for every x ∈ U . Consequently, Ker(Φ) = Ann(U ). By Corollary 2.5, the quotient space V ∗ /Ann(U ) is isomorphic to U ∗ . Proving that T ∗ is isomorphic to V ∗∗ /Ann(T ) is left to the reader. For the second part of the theorem, deﬁne a mapping Ψ : (V /U )∗ −→ Ann(U ) as follows. Consider a linear form φ ∈ Hom(V /U, F ) and the canonical mapping hU : V −→ V /U , and deﬁne Ψ(φ) = φhU . Observe that φ(hU (x)) = 0 for x ∈ U , so φhU ∈ Ann(U ). It is immediate that Ψ is a linear mapping. If g ∈ Ann(U ), this means that Ker(hU ) = U ⊆ Ker(g), so by Theorem 2.25, since hU is surjective, there exists a unique factorization g = khU , so k = φ and g = Ψ(φ). Thus, Ψ is an isomorphism. We leave to the reader the proof of the isomorphism between (V ∗ /T )∗ and Ann(T ). Definition 2.24. Let V, W be two F-linear spaces and let h ∈ Hom(V, W ) be a linear mapping. The dual of h is the morphism h∗ ∈ Hom(W ∗ , V ∗ ) deﬁned by h∗ (f ) = f h, shown in the commutative diagram that follows. h V W f ∈V∗ h∗ (f ) = f h ∈ U ∗ ~

?

F

Linear Spaces

75

To verify the correctness of Deﬁnition 2.24, we need to show that h∗ is indeed a morphism between the linear spaces W ∗ and V ∗ . Let f , g ∈ W ∗ and let a, b ∈ F. The linear form φ = af + bg is deﬁned by φ(v) = af (v) + bg(v) for v ∈ W . Then, h∗ (φ) = h∗ (af + bg) = (af + bg)h. Since (af + bg)h(u) = af (h(u)) + b(g(h(u) = a(h∗ f )(u) + b(h∗ g)(u) for u ∈ U , it follows that h∗ (af + bg) = ah∗ f + bh∗ g, which proves that h∗ is a morphism. Theorem 2.46. If V, W are two F-linear spaces, the mapping Φ : Hom(V, W ) −→ Hom(W ∗ , V ∗ ) deﬁned by Φ(h) = h∗ for h ∈ Hom(V, W ) is linear. Proof. Let h0 , h1 ∈ Hom(V, W ) and let a, b ∈ F. We need to prove that (ah0 + bh1 )∗ = ah∗0 + bh∗1 . Let f : W −→ F be a linear form. By the deﬁnition of dual morphisms, we have (ah0 + bh1 )∗ (f ) = f (ah0 + bh1 ). Therefore, for every v ∈ W , we have (ah0 + bh1 )∗ (f )(v) = f (ah0 (v) + bh1 (v)) = af h0 (v) + bf h1 (v) = ah∗0 (f )(v) + bh∗1 (v), which proves that Φ is a linear mapping.

We leave it to the reader to verify that the dual of 1V is 1V ∗ and that for any two linear mappings h0 ∈ Hom(U, V ) and h1 ∈ Hom(V, W ), we have (h1 h0 )∗ = h∗1 h∗0 . Theorem 2.47. Let V be a ﬁnite-dimensional F-linear space and let Z be a subspace of V. Then dim(Ann(Z)) = dim(V ) − dim(Z) and Ann(Ann(Z)) = Z. Proof. If dim(V ) = n and Z is a subspace of V, then dim(V /Z) = n − dim(Z) by Theorem 2.17. Since (V /Z)∗ ∼ = Ann(Z), it follows that dim(Ann(Z)) = dim(V /Z)∗ = n − dim(Z). Thus, dim(V ) = dim(Z) + dim(Ann(Z)).

76

Linear Algebra Tools for Data Mining (Second Edition)

For the second part of the theorem, observe that by Deﬁnition 2.23 we have Ann(Ann(Z)) = {Ker(f ) | f ∈ Ann(Z)} = {Ker(f ) | Z ⊆ Ker(f )}, which implies Z ⊆ Ann(Ann(Z)). Note that both Z and Ann(Ann(Z)) are subspaces of V. The ﬁrst part of the theorem implies dim(Ann(Ann(Z))) = dim(V ) − dim(Ann(Z)). On the other hand, we saw that dim(Z) + dim(Ann(Z)) = dim(V ), so dim(Z) = dim(Ann(Ann(Z)). This implies Ann(Ann(Z)) = Z. Theorem 2.48. Let V and W be two ﬁnite-dimensional F-linear spaces. If h ∈ Hom(V, W ), then rank(h) = rank(h∗ ). Proof. Note that if f ∈ Ker(h∗ ) is equivalent to h∗ f (u) = 0, it is equivalent to having f (h(u)) = 0 for every u ∈ V, that is, with f (Im(h)) = 0. Thus, Ker(h∗ ) = Ann(Im(h)), so dim(Ker(h∗ )) = dim(Ann(Im(h))). Since dim(W ) = dim(Ker(h∗ )) + dim(Im(h∗ )) = dim(Ann(Im(h)) + dim(Im(h∗ )), it follows that dim(Im(h∗ )) = dim(W ) − dim(Ann(Im(h)) = dim(Im(h)), by Theorem 2.47. Thus, rank(h∗ ) = rank(h). 2.9

Topological Linear Spaces

We are examining now the interaction between the algebraic structure of linear spaces and topologies that can be deﬁned on linear spaces that are compatible in a certain sense with the algebraic structure. Compatibility, in this case, is deﬁned as the continuity of addition and scalar multiplication. Definition 2.25. Let F be the real ﬁeld R or complex ﬁeld C. An F-topological linear space is a topological space (V, O) such that

Linear Spaces

77

(i) V is an F-linear space; (ii) the vector addition is a continuous function between V 2 and V; (iii) the scalar multiplication is a continuous function between F × V and V. Unless stated otherwise, we assume that the ﬁeld F is either the real or the complex ﬁeld. Theorem 2.49. Let (V, O) be an F-topological linear space and let z ∈ V. The translation mapping tz : V −→ V is a homeomorphism. Proof. It is immediate that tz is a bijection whose inverse is t−z . The continuity of both tz and t−z follows from the continuity of the vector addition of V. Definition 2.26. The mapping tz introduced in Theorem 2.49 is the translation generated by z. Example 2.20. If a = 0, then each homothety ha of a topological linear space (V, O) is a homeomorphism. Indeed, the inverse of ha is ha−1 . The continuity of both ha and ha−1 follows from the continuity of the scalar multiplication of V. Theorem 2.50. Let (V, O) be a topological linear space. If W is a neighborhood of 0V , then tx (W ) is a neighborhood of x. Moreover, every neighborhood of x can be obtained by a translation of a neighborhood of 0. Proof. Since W is a neighborhood of the origin, there exists an open subset L of V such that 0 ∈ L ⊆ W . This implies x = tx (0) ∈ tx (L) ⊆ tx (W ). It follows that tx (L) is an open set and this, in turn, implies that tx (W ) is a neighborhood of x. Conversely, let U be a neighborhood of x and let K be an open set such that x ∈ K ⊆ U . Then, we have 0 = t−x (x) ∈ t−x (K) ⊆ t−x (U ). Since t−x (K) is an open set, it follows that t−x (K) is a neighborhood of 0 and the desired conclusion follows from the fact that U = tx (t−x (U )). Theorem 2.50 shows that in a topological linear space (V, O) the neighborhoods of any point are obtained by translating the neighborhoods of 0V .

78

Linear Algebra Tools for Data Mining (Second Edition)

Corollary 2.13. If Fx is a fundamental system of neighborhoods of x in the topological linear space (V, O), then Fx can be obtained by a translation of a fundamental system of neighborhoods F0 of 0V . Proof.

This statement follows immediately from Theorem 2.50.

The next theorem shows that a linear function between two topological linear spaces is continuous if and only if it is continuous in the zero element of the ﬁrst space. Theorem 2.51. Let (V1 , O1 ) and (V2 , O2 ) be two topological F-linear spaces having 01 and 02 as zero elements, respectively. A linear operator f ∈ Hom(V1 , V2 ) is continuous in x ∈ V1 if and only if it is continuous in 01 ∈ V1 . Proof. Let f be a function that is continuous in a point x ∈ V1 . If U ∈ neigh02 (O2 ), then f (x) + U is a neighborhood of f (x). Since f is continuous, there exists a neighborhood W of x such that f (W ) ⊆ f (x) + U . Observe that the set −x + W is a neighborhood of 01 . Moreover, any neighborhood of 01 has this form. If t ∈ −x+W , then t+x ∈ W and, therefore, f (t) + f (x) = f (t + x) ∈ f (x) + U . This shows that f (t) ∈ U , which proves that f is continuous in 01 . Conversely, suppose that f is continuous in 01 . Let x ∈ V1 and let Z ∈ neighf (x) (O2 ). The set −f (x) + Z is a neighborhood of 02 in V2 . The continuity of f in 01 implies the existence of a neighborhood T of 01 such that f (T ) ⊆ −f (x)+Z. Note that x+T is a neighborhood of x in V1 and every neighborhood of x in V1 has this form. Since f (x + T ) ⊆ Z, it follows that f is continuous in x. Corollary 2.14. Let (V1 , O1 ) and (V2 , O2 ) be two topological F-linear spaces. A linear operator f ∈ Hom(V1 , V2 ) is either continuous on V1 or is discontinuous in every point of V1 . Proof.

This statement is a direct consequence of Theorem 2.51.

Theorem 2.52. Let C and D be two subsets of Rn such that C is compact and D is closed. Then the set C + D = {x + y | x ∈ C, y ∈ D} is closed.

Linear Spaces

79

Proof. Let x ∈ K(C + D). There exists a sequence (x0 , x1 , . . .) such that xi ∈ C + D and limn→∞ xn = x. The deﬁnition of C + D means that there is a sequence (u0 , u1 , . . .) ∈ Seq∞ (C) and a sequence (v 0 , v 1 , . . .) ∈ Seq∞ (D) such that xi = ui + vi for i ∈ N. Since C is compact, the sequence (v 0 , v 1 , . . .) contains a convergent subsequence (ui0 , ui1 , . . .). Let u = limm→∞ uim . Clearly, limm→∞ xim = x. Since D is a closed set, limm→∞ v im = x − u ∈ D. Therefore, x = u + v ∈ C + D, so K(C + D) = C + D, which means that C + D is closed. 2.10

Isomorphism Theorems

Theorem 2.53 (Noether’s first isomorphism theorem). Let V be an F-linear space, and let K and L be two subspaces of V such that K ⊆ L. Then L/K is a linear subspace of M/K and M/L ∼ = (M/K)/(L/K). Proof. Consider the canonical linear mappings hK : V −→ V /K and hL : V −→ V /L for which we have Ker(hK ) = K ⊆ L = Ker(hL ). Since K ⊆ L, by Theorem 2.25, there exists a linear mapping e : V /K −→ V /L such that hL = ehK , that is e(hK (x)) = hL (x), which means that e([x]K ) = [x]L for every x ∈ V. We denoted the ∼K -class of x by [x]K and the ∼L -class of the same element by [x]L . Since Ker(e) = L/K and Im(e) = V /L, we obtain that V /L is isomorphic to (V /K)/(L/K) by Theorem 2.12. Theorem 2.54 (Noether’s second isomorphism theorem). Let V be an F-linear space, and let K and L be two subspaces of V. Then (L + K)/L ∼ = H/(L ∩ K). Proof. Consider the linear mapping h : K −→ K + L deﬁned by h(x) = x for every x ∈ K and the canonical linear mapping g : K + L −→ (K + L)/L. Deﬁne the linear mapping e = gh. We have Ker(e) = K ∩ L and Im(e) = (K + L)/L, so by Theorem 2.12, the subspaces (L + K)/L and H/(L ∩ K) are isomorphic.

80

2.11

Linear Algebra Tools for Data Mining (Second Edition)

Multilinear Functions

The notion of linear mapping can be extended as follows. Definition 2.27. Let V1 , . . . , Vn , U be real linear spaces. A multilinear function is a mapping f : V1 × · · · × Vn −→ U that is linear in each of its arguments when the other arguments are held ﬁxed. In other words, f satisﬁes the following conditions: ⎞ ⎛ k aj xji , xi+1 , . . . , xn ⎠ f ⎝x1 , . . . , xi−1 , j=1

=

k

aj f (x1 , . . . , xi−1 , xji , xi+1 , . . . , xn ),

j=1

for every xi , xji ∈ Vi and a1 , . . . , ak ∈ R. If V1 = · · · = Vn = V and U = R, the mapping f is called an n-linear form on V or an n-tensor on V. The set of n-tensors on V is denoted as Tn (V ). When n = 2, V1 = V2 = Rm , and U = R, we refer to f as a bilinear form on Rm . A bilinear form f : Rm × Rm −→ R is skew-symmetric if f (x, y) = −f (y, x) for x, y ∈ Rm . More generally, a tensor t over the linear space V is a multilinear function ∗ · · × V ∗ × V · · × V −→ R. t:V × · × · p

q

In this case, we say t is covariant of order p and contravariant of order q. The set of real multilinear functions deﬁned on the linear spaces V1 , . . . , Vn and ranging in the real linear space V is denoted by M(V1 , . . . , Vn ; V ). The set of real multilinear forms is M(V1 , . . . , Vn ; R). Multilinear functions are deﬁned on the Cartesian product of the sets V1 , . . . , Vn , not on the direct product of the linear spaces V1 , . . . , Vn . Example 2.21. Let V be a linear space and let V ∗ be its dual. Consider the following four pair of vectors:

Linear Spaces

81

(i) u in V and v ∈ V ; (ii) u in V and g ∈ V ∗ ; (iii) f in V ∗ and v ∈ V ; (iv) f in V ∗ and g ∈ V ∗ . Each of these pairs generates a bilinear function as follows: (i) φu,v : V ∗ × V ∗ −→ R given by φu,v (h, l) = h(u)l(v); (ii) φu,g : V ∗ × V −→ R given by φu,g (h, t) = h(u)g(t); (iii) φf ,v : V × V ∗ −→ R given by φv,f (x, h) = f (x)h(v); (iv) φf ,g : V × V −→ R given by φf ,g (u, v) = f (u)g(v). The function φf ,v : V × V ∗ −→ R deﬁned as φf ,v (x, h) = f (x)h(v) for x ∈ V and h ∈ V ∗ is linear in x by the linearity of f and is linear in the second variable h by the deﬁnition of V ∗ . Therefore, h ∈ M(V, V ∗ ; R). If f : U × W −→ V is a bilinear function, we have f (0U , w) = 0V and f (u, 0W ) = 0V for every w ∈ W and u ∈ U . Definition 2.28. Let V, W be two complex linear spaces. A function f : V × W −→ C is said to be Hermitian bilinear if it is linear in the ﬁrst variable and skew-linear in the second, that is, it satisﬁes the following equalities: f (a1 x1 + a2 x2 , y) = a1 f (x1 , y) + a2 f (x2 , y), f (x, b1 y 1 + b2 y 2 ) = b1 f (x, y 1 ) + b2 f (x, y 2 ), for x1 , x2 , x ∈ V, y, y 1 , y 2 ∈ W , and a1 , a2 , b1 , b2 ∈ C.

82

Linear Algebra Tools for Data Mining (Second Edition)

Example 2.22. Multilinearity is distinct from the notion of linearity on a product of linear spaces. For instance, the mapping h : R2 −→ R deﬁned by h(x, y) = x + y is linear but not bilinear. On the other hand, the mapping g : R2 −→ R given by h(x, y) = xy is bilinear but not linear. Definition 2.29. Let V1 , . . . , Vn , V be real linear spaces. If f, g ∈ M(V1 , . . . , Vn ; L) are two multilinear functions, their sum is the function f + g deﬁned by (f + g)(x1 , . . . , xn ) = f (x1 , . . . , xn ) + g(x1 , . . . , xn ), and the product af , where a ∈ F is the function af given by (af )(x1 , . . . , xn ) = af (x1 , . . . , xn ) for xi ∈ Vi and 1 i n. It is immediate to verify that M(V1 , . . . , Vn ; V ) is an R-linear space relative to these operations. Let f : V1 × V2 −→ V be a real bilinear function. Observe that for x ∈ V1 and y ∈ V2 , we have f (x, 0V2 ) = f (x, 0y) = 0f (x, y) = 0V and f (0V1 , y) = f (0x, y) = 0f (x, y) = 0V .

(2.8)

Example 2.23. Let V be an R-linear space and let ·, · : V ∗ ×V −→ R be the function given by h, y = h(y) for h ∈ V ∗ and y ∈ V. It is immediate that ·, · is a bilinear function because ah + bg, y = ah, y + bg, y, h, ay + bz = ah, y + bh, z, for a, b ∈ R, h, g ∈ V ∗ , and y, z ∈ V. Moreover, we have h, y = 0 for every y ∈ V if and only if h = 0V ∗ and h, y = 0 for every h ∈ V ∗ if and only if y = 0V . Example 2.24. Let V1 , . . . , Vn , V be R-linear spaces, ai ∈ Vi for 1 i n, and let gi ∈ Vi∗ . Deﬁne the function G : V1 ×· · ·×Vn −→ R as G(a1 , . . . , an ) = g1 (a1 ) · · · gn (an ) for ai ∈ Vi and 1 i n.

Linear Spaces

83

The function G is multilinear. Indeed, if ai , bi ∈ Vi and a ∈ R, it is immediate to verify that G(a1 , . . . , ai + bi , . . . , an ) = G(a1 , . . . , ai , . . . , an ) + G(a1 , . . . , bi , . . . , an ), and G(a1 , . . . , aai , . . . , an ) = aG(a1 , . . . , ai , . . . , an ). Note, however, that G is not a linear function because G(aa1 , . . . , aan ) = an G(a1 , . . . , an ) for a ∈ R. Example 2.25. The function f : R2 −→ R deﬁned by f (x1 , x2 ) = x1 x2 is bilinear because it is linear in each of its variables, separately, but is not linear in the ensemble of its arguments. Indeed, we have f (x1 + y1 , x2 ) = f (x1 , x2 ) + f (y1 , x2 ), f (x1 , x2 + y2 ) = f (x1 , x2 ) + f (x1 , y2 ), for every x1 , x2 , y1 , y2 ∈ R, which shows the bilinearity of f . However, we have f (x1 + x2 , y1 + y2 ) = x1 y1 + x1 y2 + x2 y1 + x2 y2 = f (x1 , y1 ) + f (x2 , y2 ), which means that f is not a linear function. Let h : U ×V −→ W be a bilinear mapping. For a ∈ U and b ∈ V, deﬁne ha : V −→ W and hb : U −→ W as ha (v) = h(a, v) for v ∈ V, and hb (u) = h(u, b) for u ∈ U. Then, we have ha ∈ Hom(V, W ) and hb ∈ Hom(U, W ) because of the bilinearity of the mapping h.

84

Linear Algebra Tools for Data Mining (Second Edition)

Conversely, if f : U −→ Hom(V, W ) is a linear mapping, then the function h : U × V −→ W deﬁned by h(u, v) = f (u)(v) is bilinear and hu = f (u). If W = R, then h is a bilinear form, ha ∈ V ∗ and hb ∈ U ∗ . Thus, we have a mapping Φ : M(U, V ; R) −→ Hom(U, V ∗ ) given by Φ(h)(a) = ha for a ∈ U , and a mapping Ψ : M(U, V ; R) −→ Hom(V, U ∗ ) given by Ψ(h)(b) = hb for b ∈ V. Theorem 2.55. Let U, V be two real linear spaces and let M(U, V ; R) be the linear space of bilinear forms deﬁned on U × V. The linear spaces M(U, V ; R), Hom(U, V ∗ ), and Hom(V, U ∗ ) are isomorphic. Proof. It is immediate that Φ is a linear mapping because for c, d ∈ R and h1 , h2 ∈ M(U, V ; R), we have Φ(ch1 + dh2 )(a)(v) = ((ch1 + dh2 )a )(v) = (ch1 + dh2 )(a, v) = ch1 (a, v) + dh2 (a, v) = cha1 (v) + dha2 (v) = cΦ(h1 )(a)(v) + dΦ(h2 )(a)(v), or Φ(ch1 + dh2 ) = cΦ(h1 ) + dΦ(h2 ). Note that Φ maps h : U −→ V into the linear form that transforms a into ha for a ∈ U . Thus, if Φ(h1 ) = Φ(h2 ), we have both h1 and h2 yield equal values for a ∈ U , so h1 = h2 , which proves the injectivity of Φ. Let f ∈ Hom(U, V ∗ ). For every a ∈ U there exists a linear form g : V −→ R such that f (a) = g, or f (a)(v) = g(v) for every v ∈ V. The mapping h : U × V −→ R deﬁned by h(u, bf v) = f (u)(v) is bilinear and Φ(h)(u)(v) = hu (v) = h(u, v) = f (u)(v), which means that Φ(h) = f . Thus, Φ is also surjective and, therefore, it is an isomorphism between the linear spaces M(U, V ; R) and Hom(U, V ∗ ). The existence of an isomorphism targeted to Hom(V, U ∗ ) has a similar argument. Corollary 2.15. Let V, W be two dim(M(V, W ; R)) = dim(V ) · dim(W ).

R-linear

spaces.

Then,

Linear Spaces

85

Proof. Since dim(W ∗ ) = dim(W ) = n, we have dim(Hom(V, W ∗ )) = mn. The result follows immediately from Theorem 2.55. Let U, V, W be three R-linear spaces of ﬁnite dimensions having the bases {u1 , . . . , um }, {v 1 , . . . , v n }, and {w 1 , . . . , wp }, respectively, and let f : U× V −→ W be a bilinear function. If u = m n i=1 ai ui ∈ U , v = j=1 bj v j , then f (u, v) =

m n

ai bj f (ui , v j ).

i=1 j=1

Since f (ui , v j ) ∈ W , there exist ckij such that f (ui , v j ) = p k k=1 cij w k , hence f (u, v) =

p m n

ai bj ckij wk .

i=1 j=1 k=1

{ckij

∈ R | 1 i m, 1 j n, 1 k p} (which Thus, the set contains mnp elements) determines a bilinear function relative to the chosen bases in U, V, and W . The numbers ckij are known as the structural constants of f . In the special case of bilinear forms, when W = R, f : U ×V −→ R is a bilinear form and can be written using structural constants cij as in m n ai bj cij , (2.9) f (u, v) = m

i=1 j=1

n

where u = i=1 ai ui ∈ U , v = j=1 bj v j , and cij are the structural constants. The matrix of f is Cf = (cij ) and the form itself can be written as f (u, v) = u Cf v for u ∈ U and v ∈ V. Example 2.26. Unlike the case n = 1, the set of values of a multilinear function f : V1 × · · · × Vn −→ W is not a subspace of W in general. Indeed, consider a two-dimensional R-linear space U having a basis {u1 , u2 }, a four-dimensional R-linear space W having the basis {w 1 , w 2 , w 3 , w 4 }, and the bilinear function f : U × U −→ W deﬁned as f (u, v) = u1 v1 w1 + u1 v2 w2 + u2 v1 w3 + u2 v2 w4 , where u = u1 u1 + u2 u2 and v = v1 u1 + v2 u2 .

86

Linear Algebra Tools for Data Mining (Second Edition)

Let S be the set of all vectors of the form s = f (u, v). By the deﬁnition of S, there exist u, v ∈ U such that s1 = u1 v1 , s2 = u1 v2 , s3 = u2 v1 , s4 = u2 v2 , hence s1 s4 = s2 s3 for any s ∈ S. Deﬁne the vectors z, t in W as z = 2w 1 + 2w 2 + w3 + w4 , t = w1 + w3 . Note that we have both z ∈ S and t ∈ S. However, x = z − t = w1 + 2w2 + w4 does not belong to S because x1 x4 = 1 and x2 x3 = 0. This example given in [71] shows that the range of a bilinear function is not necessarily a subspace of the space of values of the function. Example 2.27. Let f : R2 × R2 −→ R be a bilinear form. Since the vectors 1 0 and e1 = e1 = 1 0 form a basis in R2 , f can be written as f (ae1 + be2 , ce1 + de2 ) = af (e1 , ce1 + de2 ) + bf (e2 , ce1 + de2 ) = acf (e1 , e1 ) + adf (e1 , e2 ) + bcf (e2 , e1 ) + bdf (e2 , e2 ) = αf (e1 , e1 ) + βf (e1 , e2 ) + γf (e2 , e1 ) + δf (e2 , e2 ), where α = ac, β = ad, γ = bc, δ = bd. Thus, the multilinearity of f implies αδ = βγ. Theorem 2.56. Let V, W, U be three ﬁnite-dimensional linear spaces having the bases {v i | 1 i }, {wj | 1 i m}, and {uk | 1 k p}, and the dual bases {v ∗i | 1 i }, {w ∗j | 1 i m}, {u∗k | 1 k p}, respectively. Then, the functions φkij : V × W −→ U deﬁned as φkij (v, w) = v ∗i (v)w∗j (w)uk are bilinear and form a basis of M(V × W, U ).

Linear Spaces

87

Proof. The bilinearity of the functions φkij is immediate. If f ∈ M(V × W, U ), v ∈ V, and w ∈ W , we have ⎛ f (v, w) = f ⎝

v ∗i (v),

i

=

i

= =

w∗j (w)wj ⎠

v ∗i (v)w ∗j (w)wj f (vi , w j )

j

j

j

v ∗i (v)w∗j (w)w j u∗ (f (v i , w j ))uk

k

i

⎞

j

i

u∗ (f (vi , wj ))φkij (v, w).

k

Thus, the set of mappings φkij spans M(V × W, U ) and the set of numbers u∗k (f (vi , wj )) uniquely determines the bilinear function f , which shows that the set of functions φkij is linearly independent. Corollary 2.16. We have dim(M(V × W, U )) = dim(V ) dim(W ) dim(U ). Proof. The equality Theorem 2.56.

is

an

immediate

consequence

of

Exercises and Supplements (1) Let V be an R-linear space. Prove that the following three statements that concern a non-empty subset P of V are equivalent: (a) P is a subspace of V ; (b) if x, y ∈ P and a ∈ F, then x + y ∈ P and ax ∈ P ; (c) x, y ∈ P and a, b ∈ F imply ax + by ∈ P . (2) Let h be an endomorphism of an R-linear space V. Prove that hm is an endomorphism of V for every m ∈ nn and m 1. (3) Let V and W be two linear spaces such that B = {v 1 , . . . , v n } is a basis of V and let f ∈ Hom(V, W ). Prove that f is an injective linear mapping if and only if the set {f (v 1 ), . . . , f (v n )} is linearly independent.

Linear Algebra Tools for Data Mining (Second Edition)

88

Solution: Suppose that f is injective and that a1 f (v1 ) + · · · + an f (vn ) = 0W . Then, since f is linear, we have f (a1 v 1 + · · · + an vn ) = 0W . The injectivity of f implies a1 v 1 + · · · + an v n = 0V and this, in turn, implies a1 = · · · = an = 0. Thus, {f (v 1 ), . . . , f (v n )} is linearly independent. Conversely, suppose that the set {f (v 1 ), . . . , f (v n )} is linearly independent, and that f (u) = f (v) for some u, v ∈ V. Since B is a basis in V, we can write u = b1 v 1 + · · · + bn v n and v = c1 v 1 + · · · + cn v n , which, in turn, imply b1 f (v1 ) + · · · + bn f (v n ) = c1 f (v 1 ) + · · · + cn f (v n ), or (b1 − c1 )f (v 1 ) + · · · + (bn − cn )f (v n ) = 0W .

(4)

(5) (6)

(7)

The linear independence of {f (v 1 ), . . . , f (v n )} means that bi = ci for 1 i n and this implies u = v. Thus, f is injective. Let V be a linear space such that dim(V ) = n. Prove that no ﬁnite set U that contains n − 1 vectors may span V ; prove that no ﬁnite set of n + 1 vectors can be linearly independent. Prove that every subset of a linearly independent set of a linear space is independent. Let U and W be two ﬁnite, linearly independent subsets of a linear space V such that |U | |W |. Prove that there exists u ∈ U − W such that W ∪ {u} is a linearly independent set. Let U, V be two linear subspaces of Cn . If dim(U )+dim(V ) > n, prove that the subspace U ∩ V contains a vector x = 0n . Solution: We know that 0n ∈ U ∩ V for any two subspaces U and V. Suppose that dim(U ) + dim(V ) > n and that U ∩ V = {0n } and let u1 , . . . , up and v 1 , . . . , v q be two bases of U and V, respectively, where p = dim(U ), q = dim(V ), and p + q > n. The set {u1 , . . . , up , v 1 , . . . , v q } is linearly dependent, so there exist b1 , . . . , bp , c1 , . . . , cq such that not all these numbers are null and b1 u1 + · · · + bp up + c1 v1 + · · · + cq vq = 0n .

Linear Spaces

89

Let u = b1 u1 +· · ·+bp up ∈ U and v = c1 v 1 +· · ·+cq v q ∈ V. The previous equality means that u + v = 0n . Since u = −v, both u and v belong to U ∩ V, so u = v = 0n . This contradicts the linear independence of at least one of the bases, {u1 , . . . , up } and {v 1 , . . . , v q }. The notion of linear independence can be extended to subspaces of ﬁnite-dimensional linear spaces. Namely, if V1 , . . . , Vk are subspaces of a ﬁnite-dimensional linear space V, then we say that V1 , . . . , Vk are linearly independent if for xi ∈ Vi , 1 i k, the equality k i=1 xi = 0V implies xi = 0V for 1 i k. (8) Prove that the subspaces V1 , . . . , Vk of a linear space V are linearly independent if and only if any set {xi | xi ∈ Vi − {0V } for 1 i k} is linearly independent. (9) Let U be a subspace of a linear space V. Prove that there exists a non-zero linear form : V −→ R such that Ker() = U . (10) Let h be an endomorphism of Rn and let x be a vector in Rn − {0n } such that there exists a least integer p such that hp (x) = 0. Prove that the set {x, h(x), h2 (x), . . . , hp−1 (x)} is linearly independent. Solution: Suppose that a0 x+ a1 h(x)+ · · · + ap−1 hp−1 (x) = 0n . Then, we have a0 h(x) + a1 h2 (x) + · · · + ap−2 hp−1 (x) = 0n a0 h2 (x) + a1 h3 (x) + · · · + ap−3 hp−1 (x) = 0n .. . a0 hp−3 (x) + a1 hp−2 (x) + a2 hp−1 (x) = 0n a0 hp−2 (x) + a1 hp−1 (x) = 0n a0 hp−1 (x) = 0n . The last equality implies a0 = 0. Substituting a0 in the previous equality yields a1 = 0, etc. Eventually, we obtain a0 = a1 = · · · = ap−1 = 0, which proves that S is linearly independent. (11) Let (S, O) be a topological space and let U and W be two subsets of S. If U is open and U ∩W = ∅, then prove that U ∩K(W ) = ∅. (12) Let (S, O) be a topological space and let I be its interior operator. Prove that the poset of open sets (O, ⊆) is a complete lattice, where sup L = L and inf L = I ( L) for every family of open sets L.

90

Linear Algebra Tools for Data Mining (Second Edition)

(13) Let (S, O) be a topological space and let I be its interior operator. Prove that the poset of open sets (O, ⊆) is a complete lattice, where sup L = L and inf L = I ( L) for every family of open sets L. (14) Let (S, O) be a topological space, let K be its interior operator, and let K be its collection of closed sets. ⊆) is a Prove that (K, complete lattice, where sup L = K ( L) and inf L = L for every family of closed sets. (15) Let T be a subspace of the topological space (S, O). Let K S , I S , and ∂S be the closure, interior, and border operators associated to S and K T , I T , and ∂T be the corresponding operators associated to T . Prove that (a) K T (U ) = K S (U ) ∩ T , (b) I S (U ) ⊆ I T (U ), and (c) ∂T U ⊆ ∂S U for every subset U of T . (16) Let V be an R-linear space, S a subspace of V with dim(S) = k, and suppose that the quotient space V /S is of dimension p. Prove that dim(V ) = k + p and that a basis for V can be obtained by supplementing a basis of S with p representatives of the classes of the quotient space (one per class). (17) Let V be an R-linear space, and let S and T be complementary subspaces of V. Prove that the restriction of the function f : V −→ V /S given by f (x) = x + S to T is an isomorphism between T and S/V. (18) Let V1 , V2 , V be R-linear spaces. Prove that if f : V1 × V2 −→ V is both a linear and bilinear function, then f (x, y) = 0V for x ∈ V1 and y ∈ V2 . Solution: The bilinearity of f implies f (x, 0V2 ) = f (0V1 , y) = 0V by Equalities (2.8). On the other hand, since f is linear, we have f (x, y) = f (0V1 , y) + f (x, 0V2 ) = 0V . The next Supplement is a generalization of Corollary 2.15.

Linear Spaces

91

(19) Let V1 , . . . , Vm , V be m linear spaces with dim(Vi ) = ni for 1 i m and dim(V ) = n. Prove that dim(M(V1 , . . . , Vn ; V )) = n m i=1 ni . Solution: Let Γ(n1 , . . . , nm ) be the set of sequences of integers (a1 , . . . , am ) of length m such that 1 ai ni for 1 i m. Suppose that the linear space Vt has the base {et1 , . . . , etnt } and V has the base {u1 , . . . , un }. For each sequence α ∈ Γ(n1 , . . . , nm ), deﬁne n m t=1 nt functions:

m ξtα(t) uj , φα,j (v 1 , . . . , v m ) = t=1

n t

where v t = s=1 ξts ets . Each of these functions is in that if v k is replaced by M(V1 , . . . , Vn ; V ). Indeed, observe v k , then the product m ξ is replaced by cv k + d˜ tα(t) t=1 ξ1α(1) · · · ξ(k−1)α(k−1) (cξkα(k) + dξ˜kα(k) ) × ξ(k+1)α(k+1) · · · ξm α(m) = cξ1α(1) · · · ξ(k−1)α(k−1) ξkα(k) · · · ξmα(m) + dξ1α(1) · · · ξ(k−1)α(k−1) ξ˜kα(k) · · · ξmα(m) . Therefore, each of the n m t=1 nt functions φα,j belong to M(V1 , . . . , Vn ; V ). Note that for φ, θ ∈ M(V1 , . . . , Vn ; V ), we have φ = θ if and only if φ(ec ) = θ(ec ) for every c ∈ Γ(n1 , . . . , nm ) because of the multilinearity of φ and θ. Let φ be an arbitrary multilinear function such that φ(eγ ) =

n

cγj uj

j=1

for γ ∈ Γ(n1 , . . . , nm ). Deﬁne θ=

n

β∈Γ(n1 ,...,nm ) j=1

cβj φβ,j .

92

Linear Algebra Tools for Data Mining (Second Edition)

We have n

θ(eγ ) =

cβj φβ,j (eγ )

β∈Γ(n1 ,...,nm ) j=1 n

=

cβj δβ,γ uj

β∈Γ(n1 ,...,nm ) j=1

=

n

cβj δβ,γ uj

j=1 β∈Γ(n1 ,...,nm )

=

n

cγ j uj = φ(eγ ).

j=1

Thus, an arbitrary φ is expressed as a linear combination of φβj . This means that the subspaces generated by the functions φαj are M(V 1 , . . . , Vn ; V ).n If θ = β∈Γ(n1 ,...,nm ) j=1 dβj φβ,j = 0, then

θ(eγ ) =

n

dβj δβγ uj

β∈Γ(n1 ,...,nm ) j=1

=

n

dγj uj = 0,

j=1

which implies dγj = 0 for 1 j n. (20) Let f ∈ M(R2 , R2 ; R). Prove that f (u1 + u2 , v 1 + v 2 ) = f (u1 , v 1 ) + f (u1 , v 2 ) + f (u2 , v 1 ) + f (u2 , v 2 ). (21) Let f ∈ M(R2 , R2 ; R) and let (s, u) and (t, v) be two distinct bases in R2 . Suppose that w, x, y, z are vectors in R2 such that w = as + bu, x = ct + dv, y = es + gu, z = ht + kv, for a, b, c, d, e, g, h, k ∈ R. Prove that if adgh = bcek, then f (s, v) and f (u, t) can be determined from f (s, t), f (u, v), f (w, x), and f (y, z).

Linear Spaces

93

(22) Prove the following equalities involving Levi-Civita and Kronecker symbols: (a) 3

ijk mk = δi δjm − δim δjl ,

k=1

where all indices vary in the set {1, 2, 3}; (b) 3 3

ijk jk = 2δi ,

j=1 k=1

where all indices vary in the set {1, 2, 3}. (23) Let V, W be two ﬁnite-dimensional linear spaces having the bases {v i | 1 i m} and {w j | 1 i n}, respectively. If {f i | 1 i } is the dual basis in V ∗ of {v i | 1 i }, then the set of morphisms {hij : V −→ W | hij (v) = f i (v)wj for 1 i , 1 j n} is a basis in Hom(V, W ). Solution: It is immediate that hij are homomorphisms in Hom(V, W ) for 1 i m and 1 j n. Let h be an arbitrary homomorphism in Hom(V, W ). If v ∈ V, then v = a1 v 1 + · · · + am v m , hence h(v) = m i=1 ai h(v i ). Since h(v i ) ∈ W , we can write: h(v i ) = nj=1 bij wj , for 1 i m. Therefore, n m m m n ai h(v i ) = ai bij wj = ai bij wj . h(v) = i=1

i=1

j=1

i=1 j=1

i

Since {f | 1 i } is the dual basis of {v i | 1 i }, we have

m a v wj = ai wj . hij (v) = f i (v)w j = f i Thus, h(v) =

m n i=1

=1

i j=1 bij hj (v),

hence

{hij : V −→ W | hij (v) = f i (v)wj for 1 i , 1 j n} spans the linear space Hom(V, W ).

94

Linear Algebra Tools for Data Mining (Second Edition)

m n If c hi is the zero morphism of Hom(V, W ), i=1 m j=1 nij j i cij hj (v i ) = 0W , which amounts to then m ni=1 j=1 i i=1 j=1 cij f (v)w j = 0W . Since {w j | 1 i n} is m i a basis in W , we obtain i=1 cij f (v) = 0V , and therefore, m i i=1 cij vi v i = 0V , which implies cij = 0. Thus, {hj : V −→ W | hij (v) = f i (v)w j for 1 i , 1 j n} is a basis in Hom(V, W ). (24) Let V, W, U be three linear spaces and let f : V × W −→ U be a bilinear mapping. Deﬁne the mapping φ : V ×W −→ U as φ(z) = f (π1 (z), π2 (z)), where π1 and π2 are the canonical projections of V × W on V and W , respectively. Prove that φ(z 1 + z 2 ) + φ(z 1 − z 2 ) = 2(φ(z1 ) + φ(z2 )) and φ(az) = a2 φ(z) for every a ∈ R. (25) Let f : V 3 −→ R be a real multilinear form. Prove that if f is symmetric in the ﬁrst two arguments and is skew-symmetric in the last two arguments, then f (x, y, z) = 0 for x, y, z ∈ V. Solution: We have f (x, y, z) = f (y, x, z) (by the symmetry in the first two arguments) = −f (y, z, x) (by the skew-symmetry in the last two arguments) = −f (z, y, x) (by the symmetry in the first two arguments) = f (z, x, y) (by the skew-symmetry in the last two arguments) = f (x, z, y) (by the symmetry in the first two arguments) = −f (x, y, z) (by the skew-symmetry in the last two arguments),

hence 2f (x, y, z) = 0. (26) Prove that every bilinear form f over R or C is uniquely expressible as a sum f = f1 + f2 , where f1 is symmetric and f2 is skew-symmetric.

Linear Spaces

95

(27) Prove that every symmetric bilinear form f over R or C is determined by its values of the form f (v, v). Solution: For any v and w, we can write 1 (f (v + w, v + w) − f (v, v) − f (w, w)) 2 1 = (f (v, w) + f (w, v)) = f (v, w), 2 which leads to the desired conclusion. i ···i

The generalized Kronecker symbol δj11 ···jpp is deﬁned as

i ···i

δj11 ···jpp =

⎧ ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎪ ⎪ ⎨

if

i1 · · · ip

j · · · jp

1 i1 · · · ip

is an even permutation

⎪ is an odd permutation −1 if ⎪ ⎪ ⎪ j · · · j ⎪ 1 p ⎪ ⎪ ⎩0 in all other cases.

Thus, if {i1 , . . . , ip } or {j1 , . . . , jp } does not consist of distinct integers or {i1 , . . . , ip } = {j1 , . . . , jp }, we have i ···i

δj11 ···jpp = 0. i1 i2 ···in and i1 i2 ···in = δi12···n . (28) Verify that i1 i2 ···in = δ12···n 1 i2 ···in (29) Prove that n i1 =1

···

n ik =1

···ik δii11···i = k

n! . (n − k)!

(30) Prove that i ···i

δj11 ···jpp = =

φ∈PERMp

φ∈PERMp

i ···i

(−1)inv(φ) δj1φ(1)p···jφ(p) i

···iφ(p)

(−1)inv(φ) δjφ(1) 1 ···jp

.

96

Linear Algebra Tools for Data Mining (Second Edition)

Bibliographical Comments The books of MacLane and Birkhoﬀ [108], Artin [3], and van der Waerden [165] contain a vast amount of material from many areas of algebra presented in a lucid and readable manner. Supplement 2.11 appears in [110].

Chapter 3

Matrices

3.1

Introduction

Matrices are rectangular arrays and their elements belong typically to a ﬁeld. Numeric matrices (that is, matrices whose elements belong to R or C) serve as representations of linear transformations between linear spaces in the context of certain linear bases in these spaces. Historically, determinants (discussed in Chapter 5) preceded matrices. The term matrix was introduced by Sylvester,1 who regarded matrices as generators of determinants and therefore adopted the Latin word matrix (in Latin, a neutral noun matrix, matricis) (which signiﬁes womb) to designate arrays of numbers. We begin with matrices whose entries belong to arbitrary sets and then focus on matrices whose components belong to ﬁelds. Then, we present several classes of matrices, discuss matrix partitioning, and the notion of invertible matrix. The fundamental relationship between matrices and linear transformations between ﬁnite-dimensional linear spaces is explored and properties of linear mappings are transferred to matrices.

1

James Joseph Sylvester was born on September 3, 1814 in London and died on March 15, 1897 in the same city. Sylvester made fundamental contributions in algebra, number theory, and Combinatorics. He studied at the University of London and at St. John’s College in Cambridge. Sylvester taught at the University College London, the Royal Military Academy, Johns Hopkins University in Baltimore, and Oxford. 97

98

Linear Algebra Tools for Data Mining (Second Edition)

The notion of matrix rank is introduced starting from the dimension of the range of linear mappings attached to matrices. We discuss linear systems and the application of matrices in solving such systems. We conclude with a presentation of Kronecker, Hadamard, and Khatri–Rao matrix products.

3.2

Matrices with Arbitrary Elements

We deﬁne a class of two-argument ﬁnite functions that is ubiquitous in mathematics and is central for linear algebra and its applications. Definition 3.1. A matrix on C is a function A : {1, . . . , m} × {1, . . . , n} −→ C. The pair (m, n) is the format of the matrix A. If A(i, j) ∈ R for 1 i m and 1 j n, then we say that A is a matrix on R. Matrices can be conceptualized as two-dimensional arrays as follows: ⎛

⎞ A(1, 1) A(1, 2) . . . A(1, n) ⎜A(2, 1) A(2, 2) . . . A(2, n)⎟ ⎜ ⎟ ⎜ ⎟ .. ⎟ . .. ⎜ .. . ⎠ ⎝ . . ... A(i, 1) A(i, 2) . . . A(i, n) If A : {1, . . . , m} × {1, . . . , n} −→ C, we say that A is an (m × n)matrix on C. The set of all such matrices are denoted by Cm×n . The element A(i, j) of the matrix A ∈ Cm×n is denoted either by Aij or by aij , for 1 i m and 1 j n. Definition 3.2. A row in a matrix is C1×n ; a column in a matrix is Cn×1 . Rows or columns are denoted by small bold-faced letters: r, s, etc.

Matrices

99

A matrix A ∈ Cm×n can be regarded as consisting of m rows, where the i th row is a sequence of the form (ai1 , ai2 , . . . , ain ), for 1 i n, or as a collection of n columns, where the j th column has the form ⎞ ⎛ a1j ⎜ a2j ⎟ ⎟ ⎜ ⎜ .. ⎟ ⎝ . ⎠ amj for 1 j m. The i th row of a matrix is denoted by A(i, :); similarly, the j th column of A is denoted by A(:, j). The main diagonal of a matrix A ∈ Cm×n is the set {aii | 1 i ≤ min{m, n}}. Note that the set {aij | i − j = k} for 1 ≤ k n − 1 consists of elements located on the k th diagonal above the main diagonal of the matrix A. Similarly, the set {aij | j − i = k} for 1 k n − 1 consists of elements located on the k th diagonal below the main diagonal. Definition 3.3. A square matrix on C is an (n × n)-matrix on the set C for some n 1. The number n is referred to as the order of the matrix A. An (n × n)-square matrix A = (aij ) on C is symmetric if aij = aji for every i, j such that 1 i, j n. A is skew-symmetric if aij = −aij for 1 i, j n. Example 3.1. The (3 × 3)-matrix ⎛ ⎞ 1 0.5 1 ⎜ ⎟ 2⎠ A = ⎝0.5 1 1 2 0.3 over the set of reals R is symmetric. The matrix 0 1 −2 B= −1 0 32 −3 0 is skew-symmetric.

100

Linear Algebra Tools for Data Mining (Second Edition)

Example 3.2. Let f : Rm × Rm −→ R be a skew-symmetric form deﬁned by f (x, y) = x Ay, where A ∈ Rm×m . We have m

aij xi yj = −

i,j=1

m

aij yi xj .

i,j=1

Changing the summation indices in the second term of the above equality yields m

(aij + aji )xi yj = 0

i,j=1

for every x, y ∈ Rm , which implies aij + aji = 0, hence aij = −aji . This shows that the matrix A that deﬁnes f is skew-symmetric. Theorem 3.1. Let A ∈ Rn×n . We have x Ax = 0 for every x ∈ Rn if and only if A is a skew-symmetric matrix. n n Proof. Since x Ax = i=1 j=1 xi aij xj = i i) =0 (because j i < k implies hjk = 0). Therefore, LH is a lower triangular matrix. The argument for upper triangular matrices is similar.

Matrices

107

A simple and useful observation is contained in the next theorem. Theorem 3.5. Let A and B be two matrices in Cn×n . If A = BR, where R is an upper triangular matrix, then the ﬁrst q columns of A are linear combinations of the ﬁrst q columns of B for 1 q n. If A = LB, where L is a lower triangular matrix, then the ﬁrst q rows of A are linear combinations of the ﬁrst q rows of B for 1 q n. Proof.

Indeed, the equality A = BR can be written as ⎛ ⎞ r11 r12 · · · r1n ⎜0 r ⎟ 22 · · · r2n ⎟ ⎜ ⎜ ⎟ (a1 a2 · · · an ) = (b1 b2 · · · bn ) ⎜ . .. .. ⎟ , . . ··· . ⎠ ⎝ . 0 0 · · · rrr

so a1 = r11 b1 , a2 = r12 b1 + r22 b2 .. . an = r1n b1 + r2n b2 + · · · + rnn bn . By applying a transposition it is easy to see that if A = LB, where L is a lower triangular matrix, the ﬁrst q rows of A are linear com binations of the ﬁrst q rows of B. Let SA,q be the subspace generated by the ﬁrst q columns of a matrix A ∈ Cm×n . This notation is used in the next statement. 3.4

Invertible Matrices

Let A ∈ Cn×n be a square matrix. Suppose that there exist two matrices U and V such that AU = In and V A = In . This implies V = V In = V (AU ) = (V A)U = In U = U. Thus, if AU = V A = In , the two matrices involved, U and V, must be equal.

Linear Algebra Tools for Data Mining (Second Edition)

108

Definition 3.12. A matrix A ∈ Cn×n is invertible if there exists a matrix B ∈ Cn×n such that AB = BA = In . Suppose that C is another matrix such that AC = CA = In . By the associativity of the matrix product we have C = CIn = C(AB) = (CA)B = In B = B. Therefore, if A is invertible, there is exactly one matrix B such that AB = BA = In . We denote the matrix B by A−1 and we refer to it as the inverse of the matrix, A. Note that A ∈ Cn×n is a unitary matrix if and only if A−1 = AH . Theorem 3.6. If A, B ∈ Cn×n are two invertible matrices, then the product AB is invertible and (AB)−1 = B −1 A−1 . Proof.

Applying the deﬁnition of the inverse of a matrix, we obtain

(AB)(B −1 A−1 ) = A(BB −1 )A−1 = AIn A−1 = AA−1 = In , which implies (AB)−1 = B −1 A−1 .

Theorem 3.7. If A ∈ Cn×n is invertible, then AH is invertible and (AH )−1 = (A−1 )H . Proof. Since AA−1 = In , we have (A−1 )H AH = In , which shows that (A−1 )H is the inverse of AH and (AH )−1 = (A−1 )H . Example 3.9. Let

A=

a11 a12 a21 a22

be a matrix in R2×2 . We seek to determine conditions under which A is invertible. Suppose that x11 x12 X= x21 x22 is a matrix in R2×2 such that AX = I2 . This matrix equality amounts to four scalar equalities: a11 x11 + a12 x21 a11 x12 + a12 x22 a21 x11 + a22 x21 a21 x12 + a22 x22

= 1, = 0, = 1, = 0,

Matrices

109

which, under certain conditions, can be solved with respect to x11 , x12 , x21 , x22 . By multiplying the ﬁrst equality by a22 and the third by −a12 and adding the resulting equalities, we obtain (a11 a22 − a12 a21 )x11 = −a22 . Thus, if a11 a22 − a12 a21 = 0, we have x11 = −

a22 . a11 a22 − a12 a21

The same condition, a11 a22 − a12 a21 = 0, suﬃces to allow us to obtain the value of the remaining components of X, as the reader can easily verify. Thus, A is an invertible matrix if and only if a11 a22 − a12 a21 = 0. Example 3.10. Let

φ:

1 ··· k ··· n , a1 · · · ak · · · an

be a permutation of the set {1, . . . , n}, where ak = φ(k) for 1 k n. The matrix of this permutation is the square matrix Pφ = (pij ) ∈ {0, 1}n×n , where 1 if j = φ(i), (3.1) pij = 0 otherwise for 1 i, j n. The set of invertible matrices in Rn×n is a group relative to matrix multiplication known as the general linear group GL(n, R). Similarly, the set of invertible matrices in Cn×n forms the group GL(n, C). Note that the matrix of the permutation ιn is In . Also, if φ, ψ are two permutations of nthe set {1, . . . , n}, then Pψφ = Pφ Pψ . Indeed, since (Pφ Pψ )ij = k=1 (Pφ )ik (Pψ )kj , observe that only the term (Pφ )ik (Pψ )kj in which k = φ(i) and j = ψ(k) is diﬀerent from 0. Thus, (Pφ Pψ )ij = 0 if and only if j = ψ(φ(i)), which means that Pψφ = Pφ Pψ . Thus, if φ and φ−1 are two inverse permutations in PERMn , we have Pφ Pφ−1 = In , so Pφ is invertible and Pφ−1 = Pφ−1 .

110

Linear Algebra Tools for Data Mining (Second Edition)

For instance, if φ ∈ PERM4 is the permutation considered in Example 1.6, 1 2 3 4 φ: , 3 1 4 2 then

⎛ 0 ⎜1 ⎜ Pφ = ⎜ ⎝0 0

Its inverse is

0 0 0 1 ⎛

Pφ−1 = Pφ−1

1 0 0 0

0 ⎜0 ⎜ =⎜ ⎝1 0

⎞ 0 0⎟ ⎟ ⎟. 1⎠ 0

1 0 0 0

0 0 0 1

⎞ 0 1⎟ ⎟ ⎟. 0⎠ 0

It is easy to verify that the inverse of a permutation matrix Pφ coincides with its transpose (Pφ ) . Observe that if A ∈ Rn×n having the rows r 1 , . . . , r n and Pφ is a permutation matrix, then Pφ A is the matrix whose rows are r φ(1) , r φ(2) , . . . , r φ(n) . Similarly, if the columns of A are c1 , . . . , cn , the columns of the matrix APφ are cφ(1) , . . . , cφ(n) . In other words, Pφ A is obtained from A be permuting its rows according to the permutation φ and APφ is obtained from A by permuting the columns according to the same permutation. Since every column and row of a permutation matrix contains exactly one 1, it follows that each such matrix is also a doubly stochastic matrix. Theorem 3.8. Let A ∈ Rn×n be a lower (upper) triangular matrix such that aii = 0 for 1 i n. The matrix A is invertible and its inverse is a lower (upper) triangular matrix having diagonal elements equal to the reciprocal of the diagonal elements of A.

Matrices

Proof.

111

Let A be a lower triangular matrix ⎛ ⎞ 0 ··· 0 a11 0 ⎜a ⎟ ⎜ 21 a22 0 · · · 0 ⎟ ⎟ A=⎜ .. .. . ⎟, ⎜ .. ⎝ . . . · · · .. ⎠ an1 an2 an3 · · · ann

where aii = 0 for 1 i n. The proof is by induction on n 1. The base case, n = 1, is immediate, since the inverse of the matrix 1 (a11 ) is a11 . Suppose that the statement holds for matrices in R(n−1)×(n−1) . Then A can be written as 0n−1 B , A= an1 an2 · · · an n−1 ann where B ∈ R(n−1)×(n−1) is a lower triangular matrix. By the inductive hypothesis, this matrix is invertible, its inverse B −1 is also lower triangular, and the diagonal elements of B −1 are the reciprocal elements of the corresponding diagonal elements of B. The matrix B −1 0n−1 v

1 ann

1 a B −1 , and a = is the inverse of A, where v = − ann (an1 , an2 , . . . , an n−1 ), as the reader can easily verify. A similar argument can be used for upper triangular matrices.

Theorem 3.9. Let A ∈ Rn×n be an invertible matrix, Then, its transpose A is invertible and (A )−1 = (A−1 ) . Proof.

Observe that A (A−1 ) = (A−1 A) = In = In .

Therefore, (A )−1 = (A−1 ) .

Linear Algebra Tools for Data Mining (Second Edition)

112

If A ∈ Rn×n is an invertible matrix, we have AA−1 = In , so trace(A)trace(A−1 ) = trace(In ) = n, by Theorem 3.14. This implies n . (3.2) trace(A−1 ) = trace(A) The trace of a matrix A ∈ Cn×n can be obtained as n

ei Aei . trace(A) = i=1

Theorem 3.10. Let A and B be two matrices in Cn×n . If A = BR, where R is an invertible upper triangular matrix, then SA,q = SB,q for every q, 1 q n. Proof. By Theorem 3.5, A = BR implies SA,q ⊆ SB,q for every q, 1 q n. Since R is invertible, B = AR−1 , so SB,q ⊆ SA,q for every q, 1 q n by the same theorem. This implies the desired equality. It is interesting to compute two matrix products that can be formed starting from the columns u and v given by ⎛ ⎞ ⎛ ⎞ v1 u1 ⎜ v2 ⎟ ⎜ u2 ⎟ ⎜ ⎟ ⎜ ⎟ u = ⎜ . ⎟ and v = ⎜ . ⎟ . ⎝ .. ⎠ ⎝ .. ⎠ un vn Note that u v ∈ F1×1 , that is, u v = u1 v1 + u2 v2 + · · · + un vn . This product is known as the inner product of u and v. Theorem 3.11. If u ∈ Rn×1 and u u = 0, then u = 0. Proof.

Let

⎞ u1 ⎜ u2 ⎟ ⎜ ⎟ u = ⎜ . ⎟. ⎝ .. ⎠ un ⎛

We have u u = u21 + u22 + · · · + u2n , so u u = 0 implies u1 = u2 = · · · = un = 0, that is, u = 0.

Matrices

113

The mth power of a square matrix A ∈ Fn×n (where m ∈ N) can be deﬁned inductively as follows. Definition 3.13. The 0th power of any matrix A ∈ Fn×n is A0 = In . The (m+1)st power of A is the matrix Am+1 = Am A for m 0. Example 3.11. Let

A=

a b . −b a

We have A0 = I2 , A1 = A0 A = I2 A = A, 2 a b a b a − b2 2ab = . A2 = A1 A = −ba −ba −2ab a2 − b2 Example 3.12. Let En ∈ Cn×n be the matrix 0n−1 In−1 . En = 0 0n−1 We claim that the pth power of this matrix is given by 0n−1 · · · 0n−1 In−p p . (En ) = 0 ··· 0 0n−p Note that

⎛

⎞ e2 ⎜ .⎟ ⎜ .. ⎟ ⎟ En = (0 e1 · · · en−1 ) = ⎜ ⎜ ⎟. ⎝en ⎠ 0

The equality that we need to prove is equivalent to ⎛ ⎞ ep+1 ⎜ . ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ en ⎟ ⎟ (En )p = ⎜ ⎜ 0 ⎟ . ⎜ ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎝ . ⎠ 0

114

Linear Algebra Tools for Data Mining (Second Edition)

The proof is by induction on p 1. The base case, p = 1, is immediate. Suppose that the equality holds for p − 1. Since 1 if i = j, ei ej = 0 otherwise for 1 i, j n, we have

⎛

⎞ ep ⎜ .⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜en ⎟ p p−1 ⎟ (En ) = (En ) En = ⎜ ⎜ 0 ⎟ (0 e1 · · · en−1 ) ⎜ ⎟ ⎜ .⎟ ⎜ .⎟ ⎝ .⎠ 0 (by the inductive hypothesis) ⎛ ⎞ ep+1 ⎜ . ⎟ ⎜ .. ⎟ ⎟ ⎜ ⎜ ⎟ ⎜ en ⎟ ⎟ =⎜ ⎜ 0 ⎟ . ⎟ ⎜ ⎜ . ⎟ ⎜ . ⎟ ⎝ . ⎠ 0

The previous equality implies immediately (En )n = On,n . Theorem 3.12. Let T ∈ Cn×n be an upper (a lower) triangular matrix. Then T k is an upper (a lower) triangular matrix. Proof. By Theorem 3.4, the product of two upper (lower) triangular matrices is an upper (lower) triangular matrix. The current statement follows immediately. Definition 3.14. A matrix A ∈ Cn×n is nilpotent if there is m ∈ N such that Am = On,n . The nilpotency of A is the number nilp(A) = min{m ∈ N | Am = On,n }. If A ∈ Cn×n is a nilpotent matrix, we have nilp(A) = m if and only if Am = On,n but Am−1 = On,n .

Matrices

115

Example 3.13. Let a and b be two positive numbers in R. The matrix A ∈ R3×3 given by ⎛ ⎞ 0 a 0 ⎜ ⎟ A = ⎝0 0 b ⎠ 0 0 0 is nilpotent because ⎛

⎞ ⎛ ⎞ 0 0 ab 0 0 0 ⎜ ⎟ ⎜ ⎟ A2 = ⎝0 0 0 ⎠. and A3 = ⎝0 0 0⎠. 0 0 0 0 0 0

Thus, nilp(A) = 3. Definition 3.15. A matrix A ∈ Fn×n is idempotent if A2 = A. Example 3.14. The matrix

A=

0.5 1 0.25 0.5

is idempotent, as the reader can easily verify. Matrix product is distributive with respect to matrix addition. Theorem 3.13. Let A ∈ Fm×n , B, C ∈ Fn×p , and D ∈ Fp×q . We have A(B + C) = AB + AC, (B + C)D = BD + CD. Proof.

We have (A(B + C))ik =

n

aij (bjk + cjk )

j=1

=

n

j=1

aij bjk +

n

aij cjk

j=1

= (AB)ik + (AC)ik , for 1 i m and 1 k p, which proves the ﬁrst equality. The proof of the second equality is equally straightforward and is left to the reader.

116

Linear Algebra Tools for Data Mining (Second Edition)

Definition 3.16. The (m×n)-vectorization mapping is the mapping vec : Cm×n −→ Cmn deﬁned by ⎛ ⎞ a11 ⎜ . ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ am1 ⎟ ⎜ ⎟ ⎜ ⎟ vec(A) = ⎜ ... ⎟ , ⎜ ⎟ ⎜a ⎟ ⎜ 1n ⎟ ⎜ ⎟ ⎜ .. ⎟ ⎝ . ⎠ amn obtained by reading A column-wise. The following equality is immediate for a matrix A ∈ Cm×n : ⎛ ⎞ Ae1 ⎜ Ae ⎟ ⎜ 2⎟ ⎟ vec(A) = ⎜ (3.3) ⎜ .. ⎟ . ⎝ . ⎠ Aen The vectorization mapping vec is an isomorphism between the linear space Cm×n and the linear space Cmn , as the reader can easily verify. Example 3.15. For the matrix In , we have ⎛ ⎞ e1 ⎜e ⎟ ⎜ 2⎟ ⎟ vec(In ) = ⎜ ⎜ .. ⎟ . ⎝. ⎠ en The MATLAB implementation of vec is discussed in Example 15.32. Definition 3.17. Let A = (aij ) ∈ Cn×n be a square matrix. The trace of A is the number trace(A) given by trace(A) = a11 + a22 + · · · + ann .

Matrices

117

Theorem 3.14. Let A and B be two square matrices in Cn×n . We have (i) trace(aA) = a trace(A), (ii) trace(A + B) = trace(A) + trace(B), and (iii) trace(AB) = trace(BA). Proof. The ﬁrst two parts are direct consequences of the deﬁnition of the trace. For the last part, we can write n n

n

(AB)ii = aij bji . trace(AB) = i=1

i=1 j=1

Exchanging the subscripts i and j and, then the order of the summations, we have n

n

i=1 j=1

aij bji =

n

n

aji bij =

j=1 i=1

n

n

i=1 j=1

n

bij aji = (BA)ii , i=1

which proves the desired equality.

Note that the elements on the diagonal of a skew-symmetric matrix are 0 and, therefore, its trace equals 0. Let A, B, C be three matrices in Cn×n . We have trace(ABC) = trace((AB)C) = trace(C(AB)) = trace(CAB), and trace(ABC) = trace(A(BC)) = trace((BC)A) = trace(BCA). However, it is important to notice that the third part of Theorem 3.14 cannot be extended to arbitrary permutations of a product of matrices. Consider, for example, the matrices 1 1 1 1 1 0 ,B = , and C = . A= 1 1 1 0 0 1 We have

ABC =

1 2 2 1 and ACB = , 2 3 3 1

so trace(ABC) = 4 and trace(ACB) = 3.

118

Linear Algebra Tools for Data Mining (Second Edition)

Definition 3.18. A matrix A ∈ Rm×n is non-negative if aij 0 for 1 i m and 1 j n. This is denoted by A Om,n . A is positive if aij > 0 for 1 i m and 1 j n. This is denoted by A > 0m,n . If B, C ∈ Rm×n , we write B C (B > C) if B − C Om,n (B − C > Om,n , respectively). The sets of non-negative (non-positive, positive, negative) m × nm×n m×n (Rm×n matrices is denoted by Rm×n 0 0 , R>0 , R j + p implies aij = 0. A has upper bandwidth q if j > i + q implies aij = 0. A is a (p, q)-band matrix if it has lower bandwidth p and upper bandwidth q. A tridiagonal matrix is a (1, 1)-band matrix. A lower Hessenberg matrix is an (m − 1, 1)-band matrix, while an upper Hessenberg matrix is a (1, n − 1)-band matrix. Example 3.17. The matrix A ∈ R4×6 deﬁned by ⎛ ⎞ 1 2 0 0 0 0 ⎜1 2 3 0 0 0⎟ ⎜ ⎟ ⎜ ⎟ ⎝2 1 3 5 0 0⎠ 0 2 1 4 3 0 is a (2, 1)-band matrix. The matrix B ∈ R4×6 given by ⎛

1 ⎜1 ⎜ ⎜ ⎝0 0 is a tridiagonal matrix.

2 2 1 0

0 2 3 1

0 0 2 4

0 0 0 2

⎞ 0 0⎟ ⎟ ⎟ 0⎠ 0

Matrices

The matrix

⎛

1 ⎜1 ⎜ ⎜ ⎝0 0

2 2 1 0

is an upper Hessenberg matrix. ⎛ 1 2 ⎜1 2 ⎜ ⎜ ⎝2 1 3 2

3 2 3 1

4 3 2 4

119

5 4 5 2

⎞ 6 4⎟ ⎟ ⎟ 0⎠ 0

On the other hand, the matrix ⎞ 0 0 0 0 3 0 0 0⎟ ⎟ ⎟ 3 5 0 0⎠ 1 4 3 0

is a lower Hessenberg matrix. Note that a matrix L ∈ Fm×n is lower triangular if it is an (m − 1, 0)-band matrix. Similarly, U is upper triangular if it is a (0, n − 1)-band matrix. In other words, L is lower triangular if its upper bandwidth is 0, that is, if j > i implies lij = 0; U is upper triangular if its lower bandwidth is 0, that is, i > j implies uij = 0. Next, we deﬁne several classes of matrices whose components are real numbers. Definition 3.20. A Toeplitz matrix is a matrix T ∈ Rn×n such that the elements located in any line parallel to the main diagonal of T (including, of course, the main diagonal) are equal. Example 3.18. Let the matrix ⎛ 1 ⎜6 ⎜ ⎜ T = ⎜7 ⎜ ⎝8 9

2 1 6 7 8

3 2 1 6 7

4 3 2 1 6

⎞ 5 4⎟ ⎟ ⎟ 3⎟ ⎟ 2⎠ 1

be a 5 × 5 Toeplitz matrix. The elements of a Toeplitz matrix T ∈ Rn×n are completely determined by its ﬁrst row and its ﬁrst column (which must have their

120

Linear Algebra Tools for Data Mining (Second Edition)

ﬁrst components equal). Therefore, such a matrix is fully deﬁned by a set of 2n − 1 numbers. Definition 3.21. A circulant form ⎛ c1 ⎜ cn ⎜ ⎜ c C=⎜ ⎜ n−1 ⎜ .. ⎝ . c2

matrix is a Toeplitz matrix C of the c2 c1 cn .. . c3

c3 c2 c1 .. . c4

⎞ · · · cn · · · cn−1 ⎟ ⎟ ⎟ · · · cn−2 ⎟ . ⎟ .. ⎟ ··· . ⎠ ···

c1

Note that if C = (cij ), then cij = c(j−i+1)

mod n .

Definition 3.22. A Hankel matrix is a matrix H ∈ Rn×n such that the elements located in any line parallel to the skew-diagonal of H (including, of course, the skew-diagonal) are equal. Example 3.19. Let T be the matrix ⎛ 1 ⎜2 ⎜ ⎜ T =⎜ ⎜3 ⎜4 ⎝ 5

2 3 4 5 6

3 4 5 6 7

4 5 6 7 8

⎞ 5 6⎟ ⎟ ⎟ 7⎟ ⎟ 8⎟ ⎠ 9

be a 5 × 5 Hankel matrix. Next, we introduce a class of matrices that is essential in the study of the theory of Markov chains. Definition 3.23. A stochastic matrix is a matrix A ∈ Rn×n such n that aij 0 for 1 i, j n and j=1 aij = 1 for every i, 1 i n. A doubly stochastic matrix is a matrix A ∈ Rn×n such that both A and A are stochastic. The rows of a stochastic matrix can be regarded as discrete probability distributions.

Matrices

121

Example 3.20. The matrix A ∈ R3×3 deﬁned by ⎛1 1⎞ 2 0 2 ⎜1 1 1⎟ A = ⎝3 2 6⎠ 0 23 31 is a stochastic matrix. Let C be the ﬁeld of complex numbers. A complex matrix is a matrix A ∈ Cm×n . Definition 3.24. The conjugate of a matrix A ∈ Cm×n is the matrix A¯ ∈ Cm×n , where A(i, j) = A(i, j) for 1 i m and 1 j n. The notion of symmetry is extended to accommodate complex matrices. Definition 3.25. The transpose conjugate of the matrix A ∈ Cm×n or its Hermitian adjoint is the matrix B ∈ Cn×m given by B = A¯ = (A ). The transpose conjugate of A is denoted by AH . Example 3.21. Let A ∈ C3×2 be the matrix ⎛ ⎞ 1+i 2 ⎜ ⎟ i ⎠. A = ⎝2 − i 0 1 − 2i The matrix AH is given by 1−i 2+i 0 H . A = 2 −i 1 + 2i Using Hermitian conjugates, several important classes of matrices are deﬁned. Definition 3.26. The matrix A ∈ Cn×n is (i) Hermitian if A = AH ; (ii) skew-Hermitian if AH = −A; (iii) normal if AAH = AH A; (iv) unitary if AAH = AH A = In .

122

Linear Algebra Tools for Data Mining (Second Edition)

The set of unitary matrices in Cn×n constitutes the unitary group UG(n, C). It is immediate that all unitary, Hermitian, and skew-Hermitian matrices are normal. However, there are normal matrices outside these three classes. Example 3.22. The matrix A=

1 −1 1 1

is not Hermitian or skew-Hermitian, and 2 H H AA = A A = 0

it is not unitary because 0 . 2

However, A is normal. Example 3.23. Let α, β, γ, δ, and θ be ﬁve real numbers such that α − β − γ + δ is a multiple of 2π. The matrix eiα cos θ −eiβ sin θ Mα,β,γ,δ (θ) = eiγ sin θ eiδ cos θ introduced in [57] is unitary because Mα,β,γ,δ (θ)H Mα,β,γ,δ (θ) e−iα cos θ e−iγ sin θ eiα cos θ −eiβ sin θ 1 0 = . = 0 1 −e−iβ sin θ e−iδ cos θ eiγ sin θ eiδ cos θ It is immediate that all unitary matrices are invertible, and their inverse is equal to their Hermitian conjugate. Furthermore, the product of two unitary matrices is a unitary matrix. Indeed, suppose that A, B ∈ Cn×n are unitary matrices, that is, AAH = BB H = In . Then (AB)(AB)H = ABB H AH = AAH = In , hence AB is a unitary matrix. If A ∈ Rn×n is a unitary real matrix, we refer to A as an orthogonal matrix or an orthonormal matrix for reasons that we will discuss in Section 6.11. If A ∈ Rn×n is a matrix with real entries, then its Hermitian adjoint coincides with the transposed matrix A . Thus, a real matrix is Hermitian if and only if it is symmetric.

Matrices

123

Observe that if z ∈ Cn and

⎛ ⎞ z1 ⎜.⎟ ⎟ z=⎜ ⎝ .. ⎠ , zn

then z H z = z 1 z1 + · · · + z n zn =

n

2 i=1 |zi | .

Theorem 3.15. Let A ∈ Cn×n . The following statements hold: (i) the matrices A + AH , AAH , and AH A are Hermitian and A − AH is skew-Hermitian; (ii) if A is a Hermitian matrix, then so is Ak for k ∈ N; (iii) if A is Hermitian and invertible, then so is A−1 ; (iv) if A is Hermitian, then aii are real numbers for 1 i ≤ n. Proof. All statements follow directly from the deﬁnition of Hermi tian matrices. Theorem 3.16. If A ∈ Cn×n there exists a unique pair of Hermitian matrices (H1 , H2 ) such that A = H1 + iH2 . Proof.

Let 1 i H1 = (A + AH ) and H2 = − (A − AH ). 2 2

It is immediate that both H1 and H2 are Hermitian and that H1 + iH2 = A. Suppose that A = H3 + iH4 , where H3 and H4 are Hermitian. Then, we have 2H1 = A + AH = H3 + iH4 + H3H − iH4H = 2H3 , so H1 = H3 . Therefore, H2 = H4 , so the matrices H1 and H2 are uniquely determined. Theorem 3.17. If A ∈ Cn×n , there exists a unique pair of matrices (H, S) such that H is Hermitian, S is skew-Hermitian, and A = H + S.

Linear Algebra Tools for Data Mining (Second Edition)

124

Proof. By Theorem 3.16, A can be written as A = H1 +iH2 , where H1 and H2 are Hermitian matrices. Choose H = H1 and S = iH2 . By Exercise 3.17, S is skew-Hermitian. The uniqueness of the pair (H, S) is immediate. Next, we discuss a characterization of Hermitian matrices. Theorem 3.18. A matrix A ∈ Cn×n is Hermitian if and only if xH Ax is a real number for every x ∈ Cn . Proof.

Suppose that A is Hermitian. Then xH Ax = xH AH x = x A x = x A (xH ) = xH Ax,

so xH Ax is a real number because it is equal to its conjugate. Conversely, suppose that xH Ax is a real number for every x ∈ Cn . This implies that (x + y)H A(x + y) = xH Ax + xH Ay + y H Ax + y H Ay is a real number, so xH Ay + y H Ax is real for every x, y ∈ Cn . Let x = ep and y = eq . Then apq + aqp is a real number. If we choose x = −iep and y = ej , it follows that −iapq + iaqp is a real number. Thus, (apq ) = −(aqp ) and (apq ) = (aqp ), which leads to apq = aqp for 1 p, q n. These equalities are equivalent to A = AH , so A is Hermitian. Observe that for any matrix B ∈ Cm×n , the matrix B H B is normal since (B H B)H B H B = B H BB H B and B H B(B H B)H = B H BB H B. Example 3.24. All Hermitian or skew-Hermitian matrices are normal. Indeed, if A is Hermitian, then A = AH and the normality condition is obviously satisﬁed. If A is skew-Hermitian, then AH = −A and the normality follows from (−A)A = A(−A) = −A2 . If A is a real, symmetric matrix, then A is obviously normal. Theorem 3.19. A matrix A ∈ Cn×n is normal and upper triangular (or lower triangular) if and only if A is a diagonal matrix. Proof. Clearly, any diagonal matrix is both normal and upper (and lower) triangular. Therefore, we need to show only that if A is both triangular and normal, then A is diagonal. We make the argument for the case when A is upper triangular.

Matrices

125

Since AH A = AAH , we have the equality (AH A)pp = (AAH )pp for 1 p n. We prove by induction on p that the non-diagonal elements of A are 0. For the base case, p = 1, the conditions of the theorem imply a ¯11 a11 =

n

a1j a ¯1j = a11 a ¯11 +

j=1

n

a1j a ¯1j .

j=2

n ¯11 = |a11 |2 , it follows that ¯1j = Since a ¯11 a11 = a11 a j=2 a1j a n 2 j=2 |a1j | = 0, so a1j = 0 for 2 j n. Thus, all non diagonal element located on the ﬁrst line of A (or the ﬁrst column of AH ) are zero. Suppose now that all non-diagonal elements of the ﬁrst p − 1 lines of A are 0. For the pth diagonal element of (AH A)pp , we have (AH A)pp =

n

a ¯ip aip

i=1

=

n

a ¯ip aip

i=p

(by the inductive hypothesis) =a ¯pp app (because A is an upper diagonal matrix). This allows us to write n n

apj a ¯pj = app a ¯pp + apj a ¯pj , a ¯pp app = j=p

j=p+1

so n

j=p+1

apj a ¯pj =

n

|apj |2 = 0,

j=p+1

which implies ap p+1 = · · · = apn = 0. Thus, all non-diagonal ele ments on the line p are 0. Theorem 3.20. If U ∈ Cn×n is a unitary matrix, then the matrix Z ∈ Rn×n given by zij = |uij |2 for 1 i, j n is a doubly stochastic matrix.

126

Proof. Thus,

Linear Algebra Tools for Data Mining (Second Edition)

Since U is a unitary matrix, we have U U H = U H U = In . n

zij =

j=1

n

|uij |2 =

j=1

=

n

n

uij uij

j=1

uij (U H )ji = (U U H )ii = 1.

j=1

n

The equality i=1 zij = 1 can be established in a similar manner, so Z is indeed a doubly stochastic matrix. Let A ∈ Cm×n be a matrix. The matrix of the absolute values of A is the matrix abs(A) ∈ Rm×n deﬁned by (abs(A))ij = |aij | for 1 i m and 1 j n. In particular, if x ∈ Cn , we have (abs(x))j = |xj |. Theorem 3.21. Let A ∈ Cm×n and B ∈ Cn×p be two matrices. We have abs(AB) abs(A)abs(B). Proof. Since (AB)ik = nj=1 aij bjk , it follows that n n n

aij bjk ≤ |aij bjk | = |aij | |bjk |, |(AB)ik | = j=1

j=1

j=1

for 1 i m and 1 k p. This amounts to abs(AB) abs(A)abs(B). Theorem 3.22. For A ∈ Cn×n we have abs(Ak ) (abs(A))k for every k ∈ N. Proof. The proof is by induction on k. The base case, k = 0, is immediate. Suppose that the inequality holds for k. We have abs(Ak+1 ) = abs(Ak A) ≤ abs(Ak )abs(A) (by Theorem 3.21)

Matrices

127

≤ (abs(A))k abs(A) (by the inductive hypothesis) = (abs(A))k+1 , which completes the induction case.The factor of S’*A*S tends to be sparser than the factor of A. 3.6

Partitioned Matrices and Matrix Operations

Let A ∈ Fm×n be a matrix and suppose that m = m1 + · · · + mp and n = n1 + · · · + nq , where F is the real or the complex ﬁeld. A partitioning of A is a collection of matrices Ahk ∈ Fmh ×nk such that Ahk is the contiguous submatrix

m1 + · · · + mh−1 + 1, . . . , m1 + · · · + mh−1 + mh , A n1 + · · · + nk−1 + 1, . . . , n1 + · · · + nk for 1 h p and 1 k q. If {Ahk | 1 h p and 1 k q} written as ⎛ A11 A12 · · · ⎜A ⎜ 21 A22 · · · A=⎜ .. ⎜ .. ⎝ . . ··· Ap1 Ap2 · · ·

is a partitioning of A, A is ⎞ A1q A2q ⎟ ⎟ ⎟ .. ⎟ . . ⎠ Apq

The matrices Ahk are referred to as the blocks of the partitioning. All blocks located in a column must have the same number of columns; all blocks located in a row must have the same number of rows. Example 3.25. The matrix A ∈ F5×6 given by ⎞ ⎛ a11 a12 a13 a14 a15 a16 ⎟ ⎜a ⎜ 21 a22 a23 a24 a25 a26 ⎟ ⎟ ⎜ ⎟ A=⎜ ⎜a31 a32 a33 a34 a35 a36 ⎟ ⎟ ⎜a ⎝ 41 a42 a43 a44 a45 a46 ⎠ a51 a52 a53 a54 a55 a56

128

Linear Algebra Tools for Data Mining (Second Edition)

can be partitioned as ⎛

a11 ⎜a ⎜ 21 ⎜ ⎜a31 ⎜ ⎜a ⎝ 41 a51

Thus, if we introduce ⎛ a11 a12 ⎜ A11 = ⎝a21 a22 a31 a32 a41 a42 A21 = a51 a52

a12 a22 a32 a42 a52

a13 a23 a33 a43 a53

a14 a24 a34 a44 a54

a15 a25 a35 a45 a55

⎞ a16 a26 ⎟ ⎟ ⎟ a36 ⎟. ⎟ a46 ⎟ ⎠ a56

the matrices ⎛ ⎞ ⎛ ⎞ a13 a14 a15 ⎜ ⎟ ⎜ ⎟ a23 ⎠ , A12 = ⎝a24 ⎠ , A13 = ⎝a25 a33 a34 a35 a45 a45 a43 , A22 = , A23 = a53 a55 a55

⎞ a16 ⎟ a26 ⎠ , a36 a46 , a56

the matrix A can be written as A11 A12 A13 . A= A21 A22 A23 Partitioning matrices is useful because matrix operations can be performed on block submatrices in a manner similar to scalar operations as we show next. Theorem 3.23. Let A ∈ Fm×n and B ∈ Fn×p be two matrices. Suppose that the matrices A, B are partitioned as ⎞ ⎛ ⎞ ⎛ B11 · · · B1 A11 · · · A1k ⎜ . ⎜ . . ⎟ . ⎟ ⎟ ⎜ ⎟ A=⎜ ⎝ .. · · · .. ⎠ and B = ⎝ .. · · · .. ⎠ , Ah1 · · · Ahk Bk1 · · · Bk where Ars ∈ Fmr ×ns , Bst ∈ Fns ×pt for 1 r h, 1 s k and 1 t . Then, the product C = AB can be partitioned as ⎞ ⎛ C11 . . . C1 ⎜ . .. ⎟ ⎟ . C=⎜ ⎝ . ··· . ⎠, Ch1 · · · Chl where Cuv = kt=1 Aut Btv , 1 u h, and 1 v .

Matrices

129

Proof. Note that m1 + · · · + mh = m and p1 + · · · + p = p. For a pair (i, j) such that 1 i m and 1 j n, let u be the least number such that i m1 + · · · + mu and let v be the least number such that j p1 + · · · + pv . The deﬁnition of u and v implies m1 + · · · + mu−1 + 1 i ≤ m1 + · · · + mu and p1 + · · · + pv−1 + 1 j p1 + · · · + pv . This implies that the cij element of the product is located in the submatrix Cuv = kt=1 Aut Btv of C. By the deﬁnition of the matrix product, we have cij =

n

aig bgj

g=1

=

n1

g=1

aig bgj +

n

1 +n2 g=n1 +1

aig bgj + · · · +

n1 +···+n

s

aig bgj .

g=n1 +···+nk−1 +1

Observe that the vectors (ai1 , . . . , ain1 ) and (b1j , . . . , bn1 j ) represent the line number i − (m1 + · · · + mu−1 + 1) and the column number j − (p1 + · · · + pv−1 + 1) of the matrix Au1 and B1v , etc. Similarly, (ai,n1 +···+nk−1 +1 , . . . , ai,n1 +···+ns ) and (bn1 +···+nk−1 +1,j , . . . , bn1 +···+ns ,j ) represent the line number i − (m1 + · · · + mu−1 + 1) and the column number j − (p1 + · · · + pv−1 + 1) of the matrix Auk and Bkv , which shows that cij is computed correctly as an element of the block Cuv . 3.7

Change of Bases

˜ = {˜ ˜n } be two bases in Cn . e1 , . . . , e Let B = {e1 , . . . , en } and B These bases deﬁne the matrices ˜n ) e1 · · · e MB = (e1 · · · en ) and MB˜ = (˜ ˜ in Cn×n whose columns are the vectors of the bases B and B, respectively.

Linear Algebra Tools for Data Mining (Second Edition)

130

As vectors of each base can be expressed in terms of the other base, we can write ei =

n

˜j , for 1 i n, cij e

j=1

˜h = e

n

dhk ek , for 1 h n.

k=1

In matrix formulation the previous equalities can be written as MB = CMB˜ and MB˜ = DMB . Theorem 3.24. The matrices C = (cij ) and D = (dhk ) deﬁned above are inverse to each other. Proof.

We have n n n n

n

˜j = cij e cij djk ek = cij djk ek ei = j=1

j=1

j=1 k=1

k=1

for 1 i n. Equating the coeﬃcients of vectors of B yields the following equalities: n

cij dji = 1, and

j=1

n

cij djk = 0 if k = i.

j=1

If P = CD, these equalities amount to pii = 0 and pik = n j=1 cij djk = 0 if i = k, which means that P = In . Thus, the matrices C and D are inverse to each other. n ˜ be the expression of the vector x in the basis Let x = i=1 x ˜e n i i ˜ ˜ B. Since ei = k=1 dik ek , it follows that x=

n

˜i = x ˜i e

i=1

n

j=1

xj ej =

n

xj

j=1

n

i=1

˜i = cji e

n n

˜i , xj cji e

i=1 j=1

which implies x ˜i =

n

cji xj

(3.4)

j=1

˜ x for 1 i n. The components of x in the new basis B, ˜i can be expressed via the components of x in the old basis B using the

Matrices

131

matrix C (that was used to express the old basis B in terms of the ˜ as MB = CM ˜ ). This justiﬁes the term contravariant new basis B B components applied to the components xk because the bases and the components transform in opposite directions. Similarly, we can write x=

n

j=1

xj ej =

n

˜i = x˜i e

i=1

n

x˜i

i=1

n

dij ej =

j=1

n

n

x˜i dij ej ,

j=1 i=1

hence xj =

n

x˜i dij ej

(3.5)

i=1

for 1 j n. The eﬀects of a change of basis on covectors have been discussed in Section 2.8.

3.8

Matrices and Bilinear Forms

The link between matrices and bilinear form is discussed next. Let C ∈ Rm×n be a matrix. If x ∈ Rm and y ∈ Rn , the function fC : L × M −→ R deﬁned by fC (x, y) = x Cy can be easily seen to be bilinear. The next theorem shows that all bilinear functions between two ﬁnite-dimensional spaces can be deﬁned in this manner. Theorem 3.25. Let V, W be two ﬁnite-dimensional R-linear spaces. If f : V × W −→ R is a bilinear form, then there is a matrix Cf ∈ Rm×n such that f (x, y) = x Cf y for all x ∈ V and y ∈ W . Proof. Suppose that B = {x1 , . . . , xm } and B = {y 1 , . . . , y n } are bases in V and W , respectively. Let x = a1 x1 + · · · + am xm be the expression of x ∈ V in the base B. Similarly, let y = b1 y 1 +· · ·+bn y n

132

Linear Algebra Tools for Data Mining (Second Edition)

be the expression of y ∈ W in B . The bilinearity of f implies ⎞ ⎛ m n

ai x i , bj y j ⎠ f (x, y) = f ⎝ i=1

=

n m

j=1

ai bj f (xi , y j ).

i=1 j=1

If Cf is the matrix

⎛

⎞ f (x1 , y 1 ) · · · f (x1 , y n ) ⎜ ⎟ .. .. .. ⎟, Cf = ⎜ . . . ⎝ ⎠ f (xm , y 1 ) · · · f (xm , y n )

then f (x, y) = x Cf y, where ⎛ ⎞ ⎛ ⎞ b1 a1 ⎜.⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ x=⎜ ⎝ .. ⎠ and y = ⎝ .. ⎠ . am bn

Rn×n

Let A ∈ be a symmetric matrix. The quadratic form associated to the matrix A is the function fA : Rn −→ R deﬁned as fA (x) = x Ax for x ∈ Rn . The polar form of the quadratic form fA is the bilinear form f˜A deﬁned by f˜A (x, y) = x Ay for x, y ∈ Rn . Since x Ay and y Ax are scalars, they are equal and we have fA (x + y) = (x + y) A(x + y) = x Ax + y Ay + x Ay + y Ax = fA (x) + fA (y) + 2f˜A (x, y), which allows us to express the polar form of fA as 1 f˜A (x, y) = (fA (x + y) − fA (x) − fA (y)). 2 If A ∈ Cn×n is a Hermitian matrix, the quadratic Hermitian form associated to A is the function fA : Cn −→ C deﬁned as

Matrices

133

fA (x) = xH Ax for x ∈ Cn . Note that fA (x) = xsH AH x = xsH Ax = fA (x) for x ∈ Cn . This allows us to conclude that values of the quadratic Hermitian form associated to a Hermitian matrix are real numbers. Let us consider the values of a quadratic Hermitian form fA (x) when x = 1. Since fA (x) is a continuous function and the set {x | x = 1} is compact, the function fA attains its supremum. Thus, there exists z 1 ∈ Cn with z 1 = 1 such that fA (z 1 ) attains its maximum M1 . Consider again the maximization of fA (x) subjected to the restrictions x = 1 and (x, z 1 ) = 0. By the same reasoning as above, there exists a unit vector z 2 such that z 2 = 1 and z 2 ⊥ z 2 where the maximum M2 is attained, etc. We obtain an orthonormal sequence of vectors z 1 , z 2 , . . . , z n known as the sequence of principal directions of A. The matrix ⎛ ⎞ z1 ⎜ . ⎟ ⎜ . ⎟ ⎝ . ⎠ z n is obviously unitary. 3.9

Generalized Inverses of Matrices

The next theorem involves a generalization of the notion of inverse of a matrix that is applicable to rectangular matrices. Theorem 3.26. If A ∈ Cm×n , there exists at most one matrix M ∈ Cn×m such that the matrices M A and AM are Hermitian, AM A = A, and M AM = M . Proof. Suppose that both M and P satisfy the equalities of the theorem. We have P = P AP = P (AP ) = P (AP )H = P P H AH = P P H (AM A)H

134

Linear Algebra Tools for Data Mining (Second Edition)

= P P H AH M H AH = P P H AH (AM )H = P (AP )H (AM )H = P AP AM = P (AP A)M = P AM = P AM AM = (P A)H (M A)H M = AH P H AH M H M = (AP A)H M H M = AH M H M = (M A)H M = M AM = M. Thus, P = M and the uniqueness of M is proven.

The matrix M introduced by Theorem 3.26 is known as the Moore-Penrose pseudoinverse of A and is denoted by A† . Theorem 3.27. Let A ∈ Cn×n . If A is invertible, then A† exists and equals A−1 . Proof. If A is invertible, it is clear that the matrices AA−1 and A−1 A are Hermitian because both are equal to In . Furthermore, AA−1 A = A and A−1 AA−1 = A−1 , so A−1 = A† by Theorem 3.26. The Moore–Penrose pseudoinverse of a matrix may exist even if † = On,m , as it can the inverse does not. For example, we have Om,n be easily seen. However, On,n is not invertible. The symmetry of the equalities deﬁning the Moore Penrose pseudoinverse of a matrix show that (A† )† = A. 3.10

Matrices and Linear Transformations

Let h ∈ Hom(Cm , Cn ) be a linear transformation between the linear spaces Cm and Cn . Consider a basis in Cm , R = {r 1 , . . . , r m }, and a basis in Cn , S = {s1 , . . . , sn }. The function h is completely determined by the images of the elements of the basis R, that is, by the set {h(r 1 ), . . . , h(r m )}. If h(r j ) = a1j s1 + a2j s2 + · · · + anj sn =

n

i=1

aij si ,

Matrices

135

then, for x = x1 r 1 + · · · + xm rm , we have by linearity h(x) = x1 h(r 1 ) + · · · + xm h(r m ) =

m

xj h(r j )

j=1

=

n m

xj aij si .

j=1 i=1

In other words, we have aij = (h(r j ))i for 1 i n and 1 j m. In a more compact form, we can write ⎛ ⎞⎛ ⎞ a11 · · · a1m x1 ⎜ . ⎟ ⎜ . ⎟⎜ . ⎟ ⎟ h(x) = (s1 · · · sn ) ⎜ ⎝ .. · · · .. ⎠ ⎝ .. ⎠ . an1 · · · anm xm The image of a vector x = x1 r 1 + · · · + xm r m under h equals Ah x, where Ah is ⎛ ⎞ a11 · · · a1m ⎜ . ⎟ .. · · · ... ⎟ ∈ Cn×m . Ah = ⎜ ⎝ ⎠ an1 · · · anm Clearly, the matrix Ah attached to h : Cm −→ Cn depends on the bases chosen for the linear spaces Cm and Cn and (Ah )ij equals (h(r j ))i , the i th component of image h(r j ) of the basis vector r j . Let now h : Cn −→ Cn be an endomorphism of Cn and let R = {r 1 , . . . , r n } and S = {s1 , . . . , sn } be two bases of Cn . The vectors si can be expressed as linear combinations of the vectors r 1 , . . . , r n as follows: si = pi1 r 1 + · · · + pin r n

(3.6)

for 1 i n, which implies h(si ) = pi1 h(r 1 ) + · · · + pin h(r n )

(3.7)

136

Linear Algebra Tools for Data Mining (Second Edition)

for 1 i n. Therefore, the matrix associated to a linear form h : Cm −→ C is a column vector r. In this case we can write h(x) = r H x for x ∈ Rn . Theorem 3.28. Let h ∈ Hom(Cm , Cn ). The matrix Ah∗ ∈ Cm×n is the transpose of the matrix Ah , that is, we have Ah∗ = Ah . Proof. By the previous discussion, if 1 , . . . , n is a basis of the space (Cn )∗ , then the j th column of the matrix Ah∗ ∈ Cm×n is obtained by expressing the linear form h∗ (j ) = j h in terms of a basis in the dual space (Cm )∗ . Therefore, we need to evaluate the linear form j h ∈ (Cm )∗ . Let {p1 , . . . , pm } be a basis in Cm and let {g1 , . . . , gm } be its dual in (Cm )∗ . Also, let {q 1 , . . . , q n } be a basis in Cn , and let {1 , . . . , n } be its dual (Cn )∗ . Observe that if v ∈ Cm can be expressed as v = m j=1 vj pj , then ⎛ gp (v) = gp ⎝

m

⎞ vj pj ⎠ =

j=1

m

vj gp (pj ) = vp ,

j=1

because {g1 , . . . , gm } is the dual of {p1 , . . . , pm } in (Cm )∗ . On the other hand, we can write ⎞ ⎛ ⎛ ⎞ m m n

vp h(pp )⎠ = j ⎝ vp aip q i ⎠ j (h(v)) = j ⎝ p=1

p=1

i=1

⎛ ⎞ n n m

m

vp aip qi ⎠ = vp aip j (q i ) = j ⎝ p=1 i=1

=

m

p=1

vp ajp =

p=1 i=1 m

ajp gp (v).

p=1

Thus, h∗ (j ) = m p=1 ajp gp for every j, 1 ≤ j m. This means that the j th column of the matrix Ah∗ is the transposed j th row of the matrix Ah , so Ah∗ = (Ah ) . Matrix multiplication corresponds to the composition of linear mappings, as we show next.

Matrices

137

Theorem 3.29. Let h ∈ Hom(Cm , Cn ) and g ∈ Hom(Cn , Cp ). Then, Agh = Ag Ah . Proof. If p1 , . . . , pm is a basis for Cm , then Agh (pi ) = gh(pi ) = g(h(pi )) = g(Ah pi ) = Ag (Ah (pi )) for every i, where 1 i n. This proves that Agh = Ag Ah . Thus, if h is an idempotent endomorphism of a linear space, the matrix Ah is idempotent. The inverse direction, from matrices to linear operators, is introduced next. Definition 3.27. Let A ∈ Cn×m be a matrix. The linear operator associated to A is the mapping hA : Cm −→ Cn given by hA (x) = Ax for x ∈ Cm . If {e1 , . . . , em } is a basis for Cm , then hA (ei ) is the i th column of the matrix A. It is immediate that AhA = A and hAh = h. Attributes of a matrix A are usually transferred to the linear operator hA . For example, if A is Hermitian, we say that hA is Hermitian. Definition 3.28. Let A ∈ Cn×m be a matrix. The range of A is the subspace Im(hA ) of Cn . The null space of A is the subspace Ker(hA ). The range of A and the null space of A are denoted by range(A) and null(A), respectively. Clearly, CA,n = range(A). The null space of A ∈ Cm×n consists of those x ∈ Cn such that Ax = 0. Let {p1 , . . . , pm } be a basis of Cm . Since range(A) = Im(hA ), it follows that this subspace is generated by the set {hA (p1 ), . . . , hA (pm )}, that is, by the columns of the matrix A. For this reason, the subspace range(A) is also known as the column subspace of A. Theorem 3.30. Let A, B ∈ Cm×n be two matrices. Then range(A + B) ⊆ range(A) + range(B). Proof. Let u ∈ range(A + B). There exists v ∈ Cn such that u = (A + B)v = Av + Bv. If x = Av and y = Bv, we have x ∈ range(A) and y ∈ range(B), so u = x + y ∈ range(A) + range(B).

138

Linear Algebra Tools for Data Mining (Second Edition)

Several facts concerning idempotent endomorphisms that were presented in Chapter 2 can now be formulated in terms of matrices. For example, Theorem 2.32 applied to Cn states that if A is an idempotent matrix, then Cn = null(A) range(A). Conversely, by Theorem 2.33, if U and W are two subspaces of Cn such that Cn = U W , then there exists an idempotent matrix A ∈ Cn×n such that U = null(A) and W = range(A). Multilinear functions can be represented by generalizations of matrices. Let V, W , and U be ﬁnite-dimensional linear spaces having the bases {v 1 , . . . , v n }, {w 1 , . . . , wm }, and {u1 , . . . , up }, respectively, and let f : V × W −→ U be a multilinear function. For v = a1 v 1 + · · · + an v n and w = b1 w1 + · · · + bm wm , by the multilinearity of f we can write f (v, w) = f a1 v 1 + · · · + an v n , b1 w1 + · · · + bm wm =

n

m

ai bj f (vi , w j ).

i=1 j=1

Since f (vi , w j ) ∈ U , we can further write f (v i , wj ) =

p

ckij uk ,

k=1

for 1 i n and 1 j m. Thus, f (v, w) =

p n

m

ai bj ckij uk .

i=1 j=1 k=1

Thus, f is completely speciﬁed by nmp numbers ckij . Conversely, every set of nmp numbers speciﬁes a multilinear function. 3.11

The Notion of Rank

Definition 3.29. The rank of a matrix A is the same number denoted by rank(A) given by rank(A) = dim(range(A)) = dim(Im(hA )).

Matrices

139

Thus, the rank of A is the maximal size of a set of linearly independent columns of A. Theorem 2.10 applied to the linear mapping hA : Cm −→ Cn means that for A ∈ Cn×m , we have dim(null(A)) + rank(A) = m.

(3.8)

Observe that if A ∈ Cm×m is non-singular, then Ax = 0m implies x = 0m . Thus, if x ∈ null(A) ∩ range(A), it follows that Ax = 0, so the subspaces null(A) and range(A) are complementary. Example 3.26. For the matrix ⎛

⎞ 1 0 2 ⎜1 −1 1⎟ ⎜ ⎟ A=⎜ ⎟, ⎝2 1 5⎠ 1 2 4 we have rank(A) = 2. Indeed, if c1 , c2 , c3 are its columns, then it is easy to see that {c1 , c2 } is a linearly independent set, and c3 = 2c1 + c2 . Thus, the maximal size of a set of linearly independent columns of A is 2. Example 3.27. Let A ∈ Cn×m and B ∈ Cp×q . For the matrix C ∈ C(n+p)×(m+q) deﬁned by C=

A On,q , Op,m B

we have rank(C) = rank(A) + rank(B). Suppose that rank(C) = and let c1 , . . . , c be a maximal set of linearly independent columns of C. Without loss of generality we may assume that the ﬁrst k columns are among the ﬁrst m columns of A and the remaining − k columns are among the last q columns of C. The ﬁrst k columns of C correspond to k linearly independent columns of A, while the last −k columns correspond to −k linearly independent columns of B. Thus, rank(C) = k rank(A) + rank(B). Conversely, suppose that rank(A) = s and rank(B) = t. Let ai1 , . . . , ais be a maximal set of linearly independent columns of A

Linear Algebra Tools for Data Mining (Second Edition)

140

and let bj1 , . . . , bjt be a maximal set of linearly independent columns of B. Then, it is easy to see that the vectors ais 0n 0n a i1 ,··· , ,..., ,..., 0n 0n bj1 b jt constitute a linearly independent set of columns of C, so rank(A) + rank(B) rank(C). Thus, rank(C) = rank(A) + rank(B). Example 3.28. Let x and y be two vectors in Cn − {0}. The matrix xy H has rank 1. Indeed, if y H = (y1 , y2 , . . . , yn ), then we can write xy H = (y1 x y2 x · · · yn x), which implies that the maximum number of linearly independent columns of xy H is 1. Example 3.29. Let A, B ∈ Cn×m . We have rank(A + B) rank(A) + rank(B). Let A = (a1 a2 · · · am ) and B = (b1 b2 · · · bm ) be two matrices, where a1 , . . . , am , b1 , . . . , bm ∈ Cn . Clearly, we have A + B = (a1 + b1 a2 + b2 · · · am + bm ). If x ∈ Im(A + B), we can write x = x1 (a1 + b1 ) + x2 (a2 + b2 ) + · · · + xm (am + bm ) = y + z, where y = x1 a1 + · · · + xm am ∈ Im(A), z = x1 b1 + · · · + xm bm ∈ Im(B). Thus, Im(A + B) ⊆ Im(A) + Im(B). Since the dimension of the sum of two subspaces of a linear space is less or equal to the dimension of the sum of these subspaces, the result follows. For a matrix A ∈ Cn×m the range of the matrix A ∈ Cm×n is the subspace of Cn generated by the rows of the original matrix A and coincides with the subspace Im(h∗A ) of the dual mapping h∗A .

Matrices

141

By Theorem 2.48, we have rank(h∗A ) = rank(hA ), so the rank of the transposed matrix A is equal to rank(A). Thus, dim(null(A )) + rank(A) = n,

(3.9)

and the maximal size of a set of linearly independent rows of A coincides with the rank of A. The above discussion also shows that if A ∈ Cn×m , then rank(A) min{m, n}. Theorem 3.31. Let A ∈ Cm×n be a matrix. We have rank(A) = rank(A). Proof. Suppose that A = (a1 , . . . , an ) and that the set {ai1 , . . . , aip } is a set of linearly independent columns of A. Then, the set {ai1 , . . . , aip } is a set of linearly independent columns of A. This implies rank(A) = rank(A). Corollary 3.1. We have rank(A) = rank(AH ) for every matrix A ∈ Cm×n . Proof.

Since AH = A , the statement follows immediately.

Definition 3.30. A matrix A ∈ Cn×m is a full-rank matrix if rank(A) = min{m, n}. If A ∈ Cm×n is a full-rank matrix and m n, then the n columns of the matrix are linearly independent; similarly, if n m, the m rows of the matrix are linearly independent. A matrix that is not a full-rank is said to be degenerate. A degenerate square matrix is said to be singular. A non-singular matrix A ∈ Cn×n is a matrix that is not singular and, therefore, has rank(A) = n. Theorem 3.32. A matrix A ∈ Cn×n is non-singular if and only if it is invertible. Proof. Suppose that A is non-singular, that is, rank(A) = n. In other words, the set of columns {c1 , . . . , cn } of A is linearly independent, and therefore, is a basis of Cn . Then, each of the vectors ei can

142

Linear Algebra Tools for Data Mining (Second Edition)

be expressed as a unique combination of the columns of A, that is ei = b1i c1 + b2i c2 + · · · + bni cn , for 1 i n. These equalities can ⎛ b11 ⎜b ⎜ 21 (c1 · · · cn ) ⎜ ⎜ .. ⎝ . bn1

be written as ⎞ · · · b1n · · · b2n ⎟ ⎟ ⎟ .. ⎟ = In . ··· . ⎠ · · · bnn

Consequently, the matrix A is invertible and ⎛ ⎞ b11 · · · b1n ⎜b ⎟ ⎜ 21 · · · b2n ⎟ −1 ⎜ ⎟ A =⎜ . .. ⎟ . . ⎝ . ··· . ⎠ bn1 · · · bnn Suppose now that A is invertible and that d1 c1 + · · · + dn cn = 0. This is equivalent to

⎞ d1 ⎜.⎟ ⎟ A⎜ ⎝ .. ⎠ = 0. dn ⎛

Multiplying both sides by A−1 implies ⎛ ⎞ d1 ⎜.⎟ ⎜ . ⎟ = 0, ⎝.⎠ dn so d1 = · · · = dn = 0, which means that the set of columns of A is linearly independent, so rank(A) = n. Corollary 3.2. A matrix A ∈ Cn×n is non-singular if and only if Ax = 0 implies x = 0 for x ∈ Cn .

Matrices

143

Proof. If A is non-singular then, by Theorem 3.32, A is invertible. Therefore, Ax = 0 implies A−1 (Ax) = A−1 0, so x = 0. Conversely, suppose that Ax = 0 implies x = 0. If A = (c1 · · · cn ) and x = (x1 , . . . , xn ) , the previous implication means that x1 c1 + · · · + xn cn = 0 implies x1 = · · · = xn = 0, so {c1 , . . . , cn } is linearly independent. Therefore, rank(A) = n, so A is non-singular. Let A ∈ Cn×m be a matrix. It is easy to see that the square matrix B = AH A ∈ Cm×m is symmetric. Theorem 3.33. Let A ∈ Cn×m be a matrix and let B = AH A. The matrices A and B have the same rank. Proof. We prove that null(A) = null(B). If Au = 0, then Bu = AH (A0) = 0, so null(A) ⊆ null(B). If v ∈ null(B), then AH Av = 0, which implies that v H AH Av = 0. This, in turn can be written as (Av)H (Av) = 0, so, by Theorem 3.11, we have Av = 0, which means that v ∈ null(A). We conclude that null(A) = null(AH A). The equalities dim(null(A)) + rank(A) = m, dim(null(A)) + rank(AH A) = m, imply that rank(AH A) = m.

Corollary 3.3. Let A ∈ Cn×m be a matrix of full-rank. If m n, then the matrix AH A is non-singular; if n m, then AAH is nonsingular. Proof. Suppose that m n. Then, rank(AH A) = rank(A) = m because A is a full-rank matrix. Thus, AH A ∈ Cm×m is non-singular. The argument for the second part of the corollary is similar. Example 3.30. Let A = (a1 · · · am ) ∈ Cn×m . Since AAH = a1 aH1 + · · · + am aHm , it follows that the rank of the matrix a1 aH1 + · · · + am aHm equals the rank of the matrix A and, therefore, it cannot exceed m. Theorem 3.34 (Sylvester’s rank theorem). Let A ∈ Cm×n and B ∈ Cn×p be two matrices. We have rank(AB) = rank(B) − dim(null(A) ∩ range(B)).

144

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Both null(A) and range(B) are subspaces of Cn . Therefore, null(A) ∩ range(B) is a subspace of Cn . If u1 , . . . , uk is a basis of the subspace null(A) ∩ range(B), then there exists a basis u1 , . . . , uk , uk+1 , . . . , ul of the subspace range(B). The set {Auk+1 , . . . , Aul } is linearly independent. Indeed, suppose that there exists a linear combination a1 Auk+1 + · · · + al−k Aul = 0. Then, A(a1 uk+1 +· · ·+al−k ul ) = 0, so a1 uk+1 +· · ·+al−k ul ∈ null(A). Since uk+1 , . . . , ul ∈ range(B), it follows that a1 uk+1 +· · ·+al−k ul ∈ null(A) ∩ range(B). Since u1 , . . . , uk is a basis of the subspace null(A) ∩ range(B), we have a1 uk+1 + · · · + al−k ul = d1 u1 + · · · + dk uk for some d1 , . . . , dk ∈ C, which implies a1 uk+1 + · · · + al−k ul − d1 u1 − · · · − dk uk = 0. Since u1 , . . . , uk , uk+1 , · · · , ul is a basis of range(B), it follows that a1 = · · · = al−k = d1 = · · · = dk = 0, so Auk+1 , . . . , Aul is indeed linear independent. Next, we show that Auk+1 , . . . , Aul spans the subspace range(AB). Since uj ∈ range(B), it is clear that Auj ∈ range(AB) for k + 1 j l. If w ∈ range(AB), then w = ABx for some x ∈ Cp . Since Bx ∈ range(B), we can write Bx = b1 u1 + · · · + bk uk + bk+1 uk+1 + · · · + bl ul , which implies w = ABx = bk+1 Auk+1 + · · · + bl Aul , because Au1 = · · · = Auk = 0, as u1 , . . . , uk belong to null(A). Thus, Auk+1 , . . . , Aul spans the subspace range(AB), which allows us to conclude that this linearly independent set is a basis for this subspace that contains l−k elements. This allows us to conclude that rank(AB) = dim(range(AB)) = rank(B) − dim(null(A) ∩ range(B)). Corollary 3.4. Let A ∈ Cm×n . If R ∈ Cm×m and Q ∈ Cn×n are invertible matrices, then rank(A) = rank(RA) = rank(AQ) = rank(RAQ).

Matrices

145

Proof. Note that rank(R) = m and rank(Q) = n. Thus, null(R) = {0m } and null(Q) = {0n }. By Sylvester’s rank theorem we have rank(RA) = rank(A) − dim(null(R) ∩ range(A)) = rank(A) − dim({0}) = rank(A). On the other hand, we have rank(AQ) = rank(Q) − dim(null(A) ∩ range(Q)) = n − dim(null(A)) = rank(A), because range(Q) = Cn . The last equality of the theorem follows from the ﬁrst two.

Corollary 3.5 (Frobenius2 inequality). Let A ∈ Cm×n , B ∈ Cn×p , and C ∈ Cp×q be three matrices. Then, rank(AB) + rank(BC) rank(B) + rank(ABC). Proof.

By Sylvester’s rank theorem, we have rank(ABC) = rank(BC) − dim(null(A) ∩ range(BC)), rank(AB) = rank(B) − dim(null(A) ∩ range(B)).

These equalities imply rank(ABC) + rank(B) = rank(AB) + rank(BC) − dim(null(A) ∩ range(BC)) + dim(null(A) ∩ range(B)). Since null(A) ∩ range(BC) ⊆ null(A) ∩ range(B), we have dim(null(A) ∩ range(BC)) dim(null(A) ∩ range(B)), which implies the desired inequality. 2

Ferdinand Georg Frobenius was born on October 26, 1849 in CharlottenburgBerlin and died on August 3, 1917 in Berlin. He studied at the University of G¨ ottingen and Berlin and taught at the University of Berlin and ETH in Z¨ urich. Frobenius has contributed to the study of elliptic functions, algebra, and mathematical physics.

Linear Algebra Tools for Data Mining (Second Edition)

146

Corollary 3.6. Let A ∈ Cm×n and B ∈ Cn×p be two matrices. We have dim(null(AB)) = dim(null(B)) + dim(null(A) ∩ range(B)). Proof.

By Equality (3.8), we have dim(null(AB)) + rank(AB) = p, dim(null(B)) + rank(B) = p.

An application of Sylvester’s rank theorem implies dim(null(AB)) = dim(null(B)) + dim(null(A) ∩ range(B)).

Corollary 3.7. Let A ∈ Cm×n and B ∈ Cn×p be two matrices. We have rank(A)+rank(B)−n rank(AB) min{rank(A), rank(B)}, (3.10) and max{dim(null(A)), dim(null(B))} ≤ dim(null(AB)) ≤ dim(null(A)) + dim(null(B)). Proof. Since dim(null(A)∩rank(B)) dim(null(A)) = n−rank(A), it follows that rank(AB) rank(B) − (n − rank(A)) = rank(A) + rank(B) − n. For the second inequality, observe that Sylvester’s rank Theorem, implies immediately rank(AB) rank(B). Also, rank(AB) = rank((AB) ) = rank(B A ) rank(A ) = rank(A), so rank(AB) ≤ min{rank(A), rank(B)}. The second part of the Corollary follows from the ﬁrst part. Inequality (3.10) is also known as the Sylvester rank inequality. Corollary 3.8. If A ∈ Cm×n is a full-rank matrix with m n, then rank(AB) = rank(B) for any B ∈ Cn×p . Proof. Since m n, we have rank(A) = n; therefore, the n columns of A are linearly independent so null(A) = {0}. By Sylvester’s Rank theorem, we have rank(AB) = rank(B).

Matrices

147

Theorem 3.35 (The full-rank factorization theorem). Let A ∈ Cm×n be a matrix with rank(A) = r > 0. There exists B ∈ Cm×r and C ∈ Cr×n such that A = BC. Furthermore, if A = DE, where D ∈ Cm×r , E ∈ Cr×n , then both D and E are full-rank matrices, that is, we have rank(D) = rank(E) = r. Proof. Let {b1 , . . . , br } ⊆ Cm be a basis for the range(A). Deﬁne B = (b1 · · · br ) ∈ Cm×r . The columns of A, a1 , . . . , an can be written as ai = c1i b1 + · · · cri br for 1 i n, which amounts to ⎛ ⎞ c11 · · · c1r ⎜ . . ⎟ ⎟ A = (a1 · · · an ) = (b1 · · · br ) ⎜ ⎝ .. · · · .. ⎠ . cr1 · · · cr Thus, A = BC, where

⎞ c11 · · · c1r ⎜ . . ⎟ ⎟ C=⎜ ⎝ .. · · · .. ⎠ . cr1 · · · cr ⎛

Suppose now that A = DE, where D ∈ Cm×r , E ∈ Cr×n . It is clear that we have both rank(D) r and rank(E) r. On the other hand, by Corollary 3.7, r = rank(A) = rank(DE) min{rank(D), rank(E)} implies r rank(D) and r ≤ rank(E), so rank(D) = rank(E) = r. Corollary 3.9. Let A ∈ Cm×n be a matrix such that rank(A) = r > 0, and let A = BC be a full-rank factorization of A. If the columns of B constitute a basis of the column space of A, then C is uniquely determined. Furthermore, if the rows of C constitute a basis of the row space of A, then B is uniquely determined. Proof. This statement is an immediate consequence of the full rank factorization theorem. Corollary 3.10. If A ∈ Cm×n is a matrix with rank(A) = r > 0, then A can be written as A = b1 c1 + · · · + br cr , where {b1 , . . . , br } ⊆ Cm and {c1 , . . . , cr } ⊆ Cn are linearly independent sets.

148

Linear Algebra Tools for Data Mining (Second Edition)

Proof. The corollary follows from Theorem 3.35 by adopting the set of columns of B as {b1 , . . . , br } and the transposed rows of C as {c1 , . . . , cr }. Theorem 3.36. Let A ∈ Cm×n be a full-rank matrix. If m n, then there exists a matrix D ∈ Cn×m such that DA = In . If n m, then there exists a matrix E ∈ Cn×m such that AE = Im . Proof. Suppose that A = (a1 · · · an ) ∈ Cm×n is a full-rank matrix and m n. Then, the n columns of A are linearly independent and, by Corollary 2.3, we can extend the set of columns to a basis of Cm , {a1 , . . . , an , d1 , . . . , dm−n }. The matrix T = (a1 · · · an d1 · · · dm−n ) is invertible, so there exists ⎛ ⎞ t1 ⎜ . ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ −1 t T = ⎜ n ⎟, ⎜ ⎟ ⎜ ..⎟ ⎝tn+1 .⎠ tm such that T −1 T = Im . If we deﬁne ⎛ ⎞ t1 ⎜.⎟ ⎟ D=⎜ ⎝ .. ⎠ , tn it is immediate that DA = In . The argument for the second part is similar.

Definition 3.31. Let A ∈ Cm×n . A left inverse of A is a matrix D ∈ Cn×m such that DA = In . A right inverse of A is a matrix E ∈ Cn×m such that AE = Im . Theorem 3.36 can now be restated as follows. Let A ∈ Cm×n be a full-rank matrix. If m n, then A has a left inverse; if n m, then A has a right inverse. Corollary 3.11. Let A ∈ Cn×n be a square matrix. The following statements are equivalent.

Matrices

149

(i) A has a left inverse; (ii) A has a right inverse; (iii) A has an inverse. Proof. It is clear that (iii) implies both (i) and (ii). Suppose now that A has a left inverse, so DA = In . Then, the columns of A, c1 , . . . , cn are linearly independent, for if a1 c1 + · · · an cn = 0, we have a1 Dc1 + · · · + an Dcn = a1 e1 + · · · + an cn = 0, which implies a1 = · · · = an = 0. Thus, rank(A) = n, so A has an inverse. In a similar manner (using the rows of A), we can show that (ii) implies (iii). Theorem 3.37. Let A ∈ Cm×n be a matrix with rank(A) = r > 0. There exists a non-singular matrix G ∈ Cm×m and a non-singular matrix H ∈ Cn×n such that Or,n−r Ir H. A=G Om−r,r Om−r,n−r Proof. By the Full-Rank factorization theorem (Theorem 3.35), there are two full-rank matrices B ∈ Cm×r and C ∈ Cr×n such that A = BC. Let {b1 , . . . , br } be the columns of B and let c1 , . . . , cr be the rows of C. It is clear that both sets of vectors are linearly independent and, therefore, for the ﬁrst set, there exist br+1 , . . . , bm such that {b1 , . . . , bm } is a basis of Cm ; for the second set, we have the vectors cr+1 , . . . , cn such that {c1 , . . . , cn } is a basis for Rn . Deﬁne G = (b1 , . . . , bm ) and ⎛ ⎞ c1 ⎜.⎟ ⎟ H=⎜ ⎝ .. ⎠ . cn Clearly, both G and H are non-singular and Or,n−r Ir H. A=G Om−r,r Om−r,n−r

Next we examine the relationships between full-rank factorization and the Moore–Penrose pseudoinverse.

Linear Algebra Tools for Data Mining (Second Edition)

150

Theorem 3.38. Let A ∈ Cm×n be a matrix with rank(A) = r > 0 and let A = BC be a full-rank factorization of A, where B ∈ Cm×r and C ∈ Cr×n are full-rank matrices. The following statements hold: (1) the matrices B H B ∈ Cr×r and CC H ∈ Cr×r are non-singular; (2) the matrix B H AC H is non-singular; (3) the Moore–Penrose pseudoinverse of A is given by A† = C H (CC H )−1 (B H B)−1 B H . Proof. The ﬁrst part of the theorem follows from Corollary 3.3. For Part (ii), note that B H AC H = B H (BC)C H = (B H B)(CC H ), so B H AC H is non-singular as a product of two non-singular matrices. For Part (iii), observe that (B H AC H )−1 = ((B H B)(CC H ))−1 = (CC H )−1 (B H B)−1 . Therefore, C H (B H AC H )−1 B H = C H (CC H )−1 (B H B)−1 B H , and it is easy to verify that the matrix C H (CC H )−1 (B H B)−1 B H satis ﬁes the conditions of Theorem 3.26. Lemma 3.1. If A ∈ Cm×n is a matrix and x ∈ Cm , y ∈ Cn are two vectors such that xH Ay = 0, then rank(AyxH A) = 1. Proof. By the associative property of matrix product, we have AyxH A = A(yxH )A, so rank(AyxH A) min{rank(yxH , rank(A)} = 1, by Corollary 3.7. We claim that AyxH A = Om,n . Suppose that AyxH A = Om,n . This implies xH AyxH Ay = 0. If z = xH Ay, the previous equality amounts to z 2 = 0, which yields z = xH Ay = 0. This contradicts the hypothesis of the lemma, so AyxH A = Om,n , which implies rank(AyxH A) 1. This allows us to conclude that rank(AyxH A) = 1. The rank-1 matrix AyxH A discussed in Lemma 3.1 plays a central role in the next statement.

Matrices

151

Theorem 3.39 (Wedderburn’s theorem). Let A ∈ Cm×n be a matrix. If x ∈ Cm and y ∈ Cn are two vectors such that xH Ay = 0 and B is the matrix B =A−

1 AyxH A, x Ay H

then rank(B) = rank(A) − 1. Proof. have

Observe that if z ∈ null(A), then Az = 0. Therefore, we

Bz = −

1 AyxH Az = 0, x Ay H

so null(A) ⊆ null(B). Conversely, if z ∈ null(B), we have Az −

1 AyxH Az = 0, xH Ay

which can be written as Az = =

1 Ay(xH Az) x Ay H

xH Az Ay. xH Ay

Thus, we obtain A(z − ky) = 0, where k=

xH Az . xH Ay

Since Ay = 0, this shows that a basis of null(B) can be obtained by adding y to a basis of null(A). Therefore, dim(null(B)) = dim(null(A)) + 1, so rank(B) = rank(A) − 1.

Linear Algebra Tools for Data Mining (Second Edition)

152

Theorem 3.40. A square matrix A ∈ Cn×n generates an increasing sequence of null spaces {0} = null(A0 ) ⊆ null(A1 ) ⊆ · · · ⊆ null(Ak ) ⊆ · · · and a decreasing sequence of subspaces Cn = range(A0 ) ⊇ range(A1 ) ⊇ · · · ⊇ range(Ak ) ⊇ · · ·

Furthermore, there exists a number such that null(A0 ) ⊂ null(A1 ) ⊂ · · · ⊂ null(A ) = null(A+1 ) = · · · and range(A0 ) ⊃ range(A1 ) ⊃ · · · ⊃ range(A ) = range(A+1 ) = · · · Proof. The proof of the existence of the increasing sequence of null subspaces and the decreasing sequence of ranges is immediate. Since null(Ak ) ⊆ Cn for every k, there exists a least number p such that range(Ap ) = range(Ap+1 ). Therefore, range(Ap+i ) = Ai range(Ap ) = Ai range(Ap+1 ) = range(Ap+i+1 ) for every i ∈ N. Thus, once two consecutive subspaces range(A ) and range(A+1 ) are equal, the sequence of range subspaces stops growing. By Equality (3.8), we have dim(range(Ak )) + dim(null(Ak )) = n, so the sequence of null spaces stabilizes at the same number . Definition 3.32. The index of a square matrix A ∈ Cn×n is the number deﬁned in Theorem 3.40. We denote the index of a matrix A ∈ Cn×n by index(A). Observe that if A ∈ Cn×n is a non-singular matrix, then index(A) = 0 because in this case Cn = range(A0 ) = range(A). Theorem 3.41. Let A ∈ Cn×n be a square matrix. The following statements are equivalent: (i) range(Ak ) ∩ null(Ak ) = {0}; (ii) Cn = range(Ak ) null(Ak ); (iii) k index(A). Proof. We prove this theorem by showing that (i) and (ii) are equivalent, (i) implies (iii), and (iii) implies (ii).

Matrices

153

Suppose that the ﬁrst statement holds. By Theorem 2.35, the set T = {t ∈ V | t = u + v, u ∈ range(Ak ), v ∈ null(Ak )} is a subspace of Cn and dim(T ) = dim(range(Ak )) + dim(null(Ak )) = n. Therefore, T = Cn , so Cn = range(Ak ) null(Ak ). The second statement clearly implies the ﬁrst. Suppose now that Cn = range(Ak ) null(Ak ). Then range(Ak ) = Ak Cn = Arange(Ak ) = range(Ak+1 ), so k index(A). Conversely, if k index(A) and x ∈ range(Ak ) ∩ null(Ak ), then x = Ak y and Ak x = 0, so A2k y = 0. Thus, y ∈ null(A2k ) = null(Ak ), which means that x = Ak y = 0. Thus, the ﬁrst statement holds. The notion of spark of linear mappings can be transferred to matrices. Definition 3.33. Let A ∈ Cm×n be a matrix. The spark of A is the minimum size of a set of columns that is linearly dependent. If the set of columns of A is linearly independent, then spark(A) = n + 1. Note that for A ∈ Cm×n , we have 1 spark(A) n + 1. 3.12

Matrix Similarity and Congruence

Deﬁne the similarity relation “∼” on the set of square matrices Cn×n by A ∼ B if there exists an invertible matrix X such that A = XBX −1 . If X is a unitary matrix, then we say that A and B are unitarily similar and we write A ∼u B, so ∼u is a subset of ∼. In this case, we have A = XBX H . Theorem 3.42. The relations “∼” and “∼u ” are equivalence relations. Proof. We have A ∼ A because A = In A(In )−1 , so ∼ is a reﬂexive relation. To prove that ∼ is symmetric suppose that A = XBX −1 . Then, B = X −1 AX and, since X −1 is invertible, we have B ∼ A.

154

Linear Algebra Tools for Data Mining (Second Edition)

Finally, to verify the transitivity, let A, B, C be such that A = XBX −1 and B = Y CY −1 , where X and Y are two invertible matrices. This allows us to write A = XBX −1 = XY CY −1 X −1 = (XY )C(XY )−1 , which proves that A ∼ C. We leave to the reader the similar proof concerning ∼u .

Theorem 3.43. If A ∼u B, where A, B ∈ Cn×n , then AH A ∼u B H B. Proof. Since A ∼u B, there exists a unitary matrix X such that A = XBX −1 = XBX H . Then, AH = XB H X H , so AH A = XB H X H XBX H = XB H BX H . Thus, AH A is unitarily similar to B H B.

The similarity relation can be extended to sets of rectangular matrices as follows. Definition 3.34. Let A, B be two matrices in Cm×n . Then, A and B are similar (written A ∼ B) if there exist two non-singular matrices G ∈ Cm×m and H ∈ Cn×n such that A = GBH. It is easy to verify that ∼ is an equivalence relation on Cm×n . Theorem 3.44. Let A and B be two matrices in Cm×n . We have A ∼ B if and only if rank(A) = rank(B). Proof. By Theorem 3.37, if A ∈ Cm×n is a matrix with rank(A) = r > 0, then Or,n−r Ir . A∼ Om−r,r Om−r,n−r Thus, for every two matrices A, B ∈ Cn×m of rank r, we have A ∼ B because both are similar to Or,n−r Ir . Om−r,r Om−r,n−r

Matrices

155

Conversely, suppose that A ∼ B, that is, A = GBH, where G ∈ Cm×m and H ∈ Cn×n are non-singular matrices. By Corollary 3.4, we have rank(A) = rank(B). Definition 3.35. A matrix A ∈ Cn×n is diagonalizable if there exists a diagonal matrix D such that A ∼ D. Let M be a class of matrices. A is M-diagonalizable if there exists a matrix M ∈ M such that A = M DM −1 . For example, if A is M-diagonalizable and M is the class of unitary matrices, we say that A is unitarily diagonalizable. Let f : Cn −→ C be a polynomial given by f (z) = a0 z n + a1 z n−1 + · · · + an , where a0 , a1 , . . . , an ∈ C. If A ∈ Cm×m , then the matrix f (A) is deﬁned by f (A) = a0 An + a1 An−1 + · · · + an Im . Theorem 3.45. If T ∈ Cm×m is an upper (a lower) triangular matrix and f is a polynomial, then f (T ) is an upper (a lower) triangular matrix. Furthermore, if the diagonal elements of T are t11 , t22 , . . . , tmm , then the diagonal elements of f (T ) are f (t11 ), f (t22 ), . . . , f (tmm ), respectively. Proof. By Theorem 3.12, any power T k of T is an upper (a lower) triangular matrix. Since the sum of upper (lower) triangular matrices is upper (lower) triangular, if follows that f (T ) is an upper triangular (a lower triangular) matrix. An easy argument by induction on k (left to the reader) shows that if the diagonal elements of T are t11 , t22 , . . . , tmm , then the diagonal elements of T k are tk11 , tk22 , . . . , tkmm . The second part of the theorem follows immediately. Theorem 3.46. Let A, B ∈ Cm×m . If A ∼ B and f is a polynomial, Then f (A) ∼ f (B).

156

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Let X be an invertible matrix such that A = XBX −1 . It is straightforward to verify that Ak = XB k X −1 for k ∈ N. This implies that f (A) = Xf (B)X −1 , so f (A) ∼ f (B). Then f (A) ∼ f (B). Definition 3.36. Let A and B be two matrices in Cn×n . The matrices A and B are congruent if there exists an invertible matrix X ∈ Cn×n such that B = XAX H . This is denoted by A ∼H B. The relation ∼H is an equivalence on Cn×n . We have A ∼H A because A = In AInH . If A ∼H B, then B = XAX H , so A = X −1 B(X H )−1 = X −1 B(X −1 )H , which implies B ∼H A. Finally, ∼H is transitive because if B = XAX H and C = Y BY H , where X and Y are invertible matrices, then C = (Y X)A(Y X)H and Y X is an invertible matrix. It is immediate that any two congruent matrices have the same rank. To recapitulate the deﬁnitions of the important similarity relations discussed in this section, consider the following list: (i) A and B are similar matrices, A ∼ B, if there exists an invertible matrix X such that A = XBX −1 ; (ii) A and B are congruent matrices, A ∼H B, if there exists an invertible matrix X ∈ Cn×n such that B = XAX H ; (iii) A and B are unitarily similar, A ∼u B, if there exists a unitary matrix U such that A = U BU −1 . Since every unitary matrix is invertible and its inverse equals its conjugate Hermitian matrix, it follows that ∼u is a subset of both ∼ and ∼H . 3.13

Linear Systems and LU Decompositions

Consider the following set of linear equalities a11 x1 + . . . + a1n xn = b1 , a21 x1 + . . . + a2n xn = b2 , .. .. . . am1 x1 + . . . + amn xn = bm ,

Matrices

157

where aij and bi belong to a ﬁeld F . This set constitutes a system of linear equations. Solving this system means ﬁnding x1 , . . . , xn that satisfy all equalities. The system can be written succinctly in a matrix form as Ax = b, where ⎞ ⎛ ⎛ ⎞ a11 · · · a1n b1 ⎟ ⎜a ⎜ b2 ⎟ ⎜ 21 · · · a2n ⎟ ⎜ ⎟ ⎟ A=⎜ . ⎟, .. ⎟ , b = ⎜ ⎜ .. ⎝ .. ⎠ ⎝ . ··· . ⎠ bm ··· a a m1

and

mn

⎞ x1 ⎜ x2 ⎟ ⎜ ⎟ x = ⎜ .. ⎟ . ⎝ . ⎠ ⎛

xn If the set of solutions of a system Ax = b is not empty, we say that the system is consistent. Note that Ax = b is consistent if and only if b ∈ range(A). Let Ax = b be a linear system in matrix form, where A ∈ Cm×n . The matrix [A b] ∈ Cm×(n+1) is the augmented matrix of the system Ax = b. Theorem 3.47. Let A ∈ Cm×n be a matrix and let b ∈ Cn×1 . The linear system Ax = b is consistent if and only if rank(A b) = rank(A). Proof. If Ax = b is consistent and x = (x1 , . . . , xn ) is a solution of this system, then b = x1 c1 + · · · + xn cn , where c1 , . . . , cn are the columns of A. This implies rank([A b]) = rank(A). Conversely, if rank(A b) = rank(A), the vector b is a linear combination of the columns of A, which means that Ax = b is a consistent system. Definition 3.37. A homogeneous linear system is a linear system of the form Ax = 0m , where A ∈ Cm×n , x ∈ Cn,1 , and 0 ∈ Cm×1 . Clearly, any homogeneous system Ax = 0m has the solution x = 0n . This solution is referred to as the trivial solution. The set of solutions of such a system is null(A), the null space of the matrix A.

158

Linear Algebra Tools for Data Mining (Second Edition)

Let u and v be two solutions of the system Ax = b. Then A(u − v) = 0m , so z = u − v is a solution of the homogeneous system Ax = 0m , or z ∈ null(A). Thus, the set of solutions of Ax = b can be obtained as a “translation” of the null space of A by any particular solution of Ax = b. In other words, the set of solutions of Ax = b is {x + z | z ∈ null(A)}. Thus, for A ∈ Cm×n , the system Ax = b has a unique solution if and only if null(A) = {0n }, that is, according to Equality (3.8), if rank(A) = n. Theorem 3.48. Let A ∈ Cn×n . Then, A is invertible (which is to say that rank(A) = n) if and only if the system Ax = b has a unique solution for every b ∈ Cn . Proof. If A is invertible, then x = A−1 b, so the system Ax = b has a unique solution. Conversely, if the system Ax = b has a unique solution for every b ∈ Cn , let c1 , . . . , cn be the solution of the systems Ax = e1 , . . . , Ax = en , respectively. Then, we have A(c1 | · · · |cn ) = In , which shows that A is invertible and A−1 = (c1 | · · · |cn ).

Corollary 3.12. A homogeneous linear system Ax = 0, where A ∈ Cn×n has a non-trivial solution if and only if A is a singular matrix. Proof.

This statement follows from Theorem 3.48.

Thus, by calculating the inverse of A we can solve any linear system of the form Ax = b. In Chapter 5, we discuss this type of calculation in detail. Definition 3.38. A matrix A ∈ Cn×n is diagonally dominant if |aii | > {|aik | | 1 k n and k = i}. Theorem 3.49. A diagonally dominant matrix is non-singular. Proof. Suppose that A ∈ Cn×n is a diagonally dominant matrix that is singular. By Corollary 3.12, the homogeneous system Ax = 0 has a non-trivial solution x = 0. Let xk be a component of x that

Matrices

159

has the largest absolute value. Since x = 0, we have |xk | > 0. We can write

{akj xj | 1 j n and j = k}, akk xk = − which implies

{akj xj | 1 j n and j = k} |akk | |xk | =

{|akj | |xj | | 1 j n and j = k}

{|akj | | 1 j n and j = k}. |xk |

Thus, we obtain |akk |

{|akj | | 1 j n and j = k},

which contradicts the fact that A is diagonally dominant.

Definition 3.39. The sparsity of a vector v ∈ Cn is the number sparse(v) of components of v equal to 0. Let ν0 : Rn −→ R be the function deﬁned by ν0 (x) = {i | 1 i n, xi = 0}, which gives the number of non-zero components of x. It is immediate to verify that ν0 (x + y) ν0 (x) + ν0 (y) for x, y ∈ Rn . Note that ν0 (v) = n − sparse(v). If x ∈ null(A), then ν0 (x) spark(A) because vectors in this subspace are linear combinations of columns of A that equal 0n and at least spark(A) columns are needed to produce the vector 0n . The following result was obtained in [43]: Theorem 3.50. If a linear system Ax = b has a solution x such that ν0 (x) < 12 spark(A), then x is a sparsest solution of the system. Proof. Let y be a solution of the same system. We have Ax−Ay = A(x −y) = 0n . By the deﬁnition of spark(A) we have ν0 (x)+ν0 (y) ν0 (x − y) spark(A). Thus, the number of non-zero components of the vector x − y cannot exceed the sum of the number of non-zero components within each of the vectors x and y. Since x0 satisﬁes ν0 (x0 ) < spark(A)/2, it follows that any other solution has more than spark(A)/2 non-zero components.

160

3.14

Linear Algebra Tools for Data Mining (Second Edition)

The Row Echelon Form of Matrices

We begin with a class of linear systems that can easily be solved. Definition 3.40. A matrix C ∈ Cm×n is in row echelon form if the following conditions are satisﬁed: (i) rows that contain nonzero elements precede zero rows (that is, rows that contain only zeros); (ii) if cij is the ﬁrst nonzero element of the row i, all elements in the j th column located below cij , that is, entries of the form ckj with k > j are zero (see Figure 3.1); (iii) if i < , ciji is the ﬁrst non-zero element of the row i, and cj is the ﬁrst non-zero element of the row , then ji < j (see Figure 3.2). The ﬁrst non-zero element of a row i (if it exists) is called the pivot of the row i.

Fig. 3.1

Condition (ii) of Definition 3.40.

Fig. 3.2

Condition (iii) of Definition 3.40.

Matrices

161

Example 3.31. Let C ∈ R4×5 be the matrix ⎛

1 ⎜0 ⎜ C=⎜ ⎝0 0

2 2 0 0

⎞ 0 2 0 3 0 1⎟ ⎟ ⎟. 0 −1 2⎠ 0 0 0

It is clear that C is in row echelon form; the pivots of the ﬁrst, second, and third rows are c11 = 1, c22 = 2, and c34 = −1. Theorem 3.51. Let C ∈ Cm×n be a matrix in row echelon form such that the rows that contain non-zero elements are the ﬁrst r rows. Then, rank(C) = r. Proof. Let c1 , . . . , cr be the non-zero rows of C. Suppose that the row ci has the ﬁrst non-zero element in the column ji for 1 i r. By the deﬁnition of the echelon form, we have j1 < j2 < · · · < jr . Suppose that a1 c1 + · · · + ar cr = 0. This equality can be written as a1 c1j1 = 0, a1 c1j2 + a2 c2j2 = 0, .. . a1 c1n + a2 c2n + . . . + ar crn = 0. Since c1j1 = 0, we have a1 = 0. Substituting a1 by 0 in the second equality implies a2 = 0 because c2j2 = 0, etc. Thus, we obtain a1 = a2 = . . . = ar = 0, which proves that the rows c1 , . . . , cr are linearly independent. Since this is a maximal set of rows of C that is linearly independent, it follows that rank(C) = r. Linear systems whose augmented matrices are in row echelon form can be easily solved using a process called back substitution. Consider the following augmented matrix in row echelon form of a system with

162

Linear Algebra Tools for Data Mining (Second Edition)

m equations and n unknowns: ⎛ 0 · · · 0 a1j1 · · · ⎜0 · · · 0 0 · · · ⎜ ⎜. . .. ⎜. . ··· ⎜ . · · · .. ⎜ ⎜0 · · · 0 0 · · · ⎜ ⎜ ⎜0 · · · 0 0 · · · ⎜ . .. ⎜ .. ⎝ . · · · .. . ··· 0 ··· 0 0 ···

··· ··· a2j2 · · · .. . ··· · · · arjr ··· 0 .. . ··· ··· 0

⎞ a1n b1 a2n b2 ⎟ ⎟ .. .. ⎟ ⎟ . . ⎟ ⎟ · · · br ⎟ ⎟. ⎟ · · · br+1 ⎟ ⎟ .. .. ⎟ . . ⎠ · · · bm

The system of equations has the following form: a1j1 xj1 + · · · + a1n xn = b1 a2j2 xj2 + · · · + a2n xn = b2 .. . arjr xjr + · · · + arn xn = br 0 = br+1 .. .. .=. 0 = bm . The variables xj1 , xj2 , . . . , xjr that correspond to the columns where the pivot elements occur are referred to as the basic variables or principal variables. The remaining variables are non-basic or non-principal. Note that we have r min{m, n}. If r < m and there exists b = 0 for r < m, then the system is inconsistent and no solutions exist. If r = m or b = 0 for r < m n, one can choose the variables that do not correspond to the pivot elements, {xi | i ∈ {j1 , j2 , . . . , jr }, as parameters and express the basic variables as functions of these parameters. The process starts with the last basic variable, xjr (because every other variable in the equation arjr xjr + · · · + arn xn = br is a parameter), and then substitutes this

Matrices

163

variable in the previous equality. This allows us to express xjr−1 as a function of parameters, etc. This explains the term back substitution previously introduced. If r = n, then no parameters exist. To conclude, if r < m, the system has a solution if and only if bj = 0 for j > r. If r = m, the system has a solution. This solution is unique if r = n. Definition 3.41. A linear system Ax = b, where A ∈ Rm×n , b ∈ Rm , and m n, is said to be in explicit form if A contains the columns of the matrix Im . Example 3.32. The linear system Ax = b is deﬁned by 1 0 1 −1 2 A= and b = . 0 1 −1 1 1 Note that A is a full-rank matrix because rank(A) = 2. The system is already in explicit form relative to the variables x1 and x2 because the ﬁrst two columns of the matrix [A|b] are the columns of I2 . There are six possible explicit forms of this system corresponding to the 42 = 6 subsets of the set of variables {x1 , x2 , x3 , x4 }. Note, however, that if the columns chosen form a submatrix of rank less than 2, the explicit form does not exist. This is the case when we chose the explicit form relative to x3 and x4 . Therefore, this system has ﬁve explicit forms relative to the sets of variables {x1 , x2 }, {x1 , x3 }, {x1 , x4 }, {x2 , x3 }, and {x2 , x4 }. In general, a system with m equations and n variables has maxin explicit forms. mum m Example 3.33. Consider the system x1 + 2x2 + 2x2 + 3x3 +

2x4 x5 −x4 + x5 0

= b1 = b2 . = b3 = b4

The matrix A ∈ R4×5 is not of full rank because rank(A) = 3.

164

Linear Algebra Tools for Data Mining (Second Edition)

The augmented matrix ⎛ 1 ⎜0 ⎜ ⎜ ⎝0 0

of this system is ⎞ 2 0 2 0 b1 2 3 0 1 b2 ⎟ ⎟ ⎟. 0 0 −1 2 b3 ⎠ 0 0 0 0 b4

The basic variables are x1 , x2 , and x4 . If b4 = 0, the system is consistent. Under this assumption we can choose x3 and x5 as parameters. Let x3 = p and x4 = q. The third equation yields x4 = q − b3 . Similarly, the second equation implies x2 = 0.5(b2 − 3p − q). Substituting these values in the ﬁrst equation allows us to write x1 = b1 − b2 + 2b3 − 3p − q. Further transformations of this system allow us to construct an equivalent linear system whose matrix contains the columns of the matrix I3 . Subtracting the second row from the ﬁrst yields ⎞ ⎛ 1 0 −3 2 −1 b1 − b2 ⎜0 2 3 0 1 b2 ⎟ ⎟ ⎜ ⎟. ⎜ ⎝0 0 0 −1 2 b3 ⎠ 0 0 0 0 0 b4 Then, dividing the second row by 2 will produce ⎞ ⎛ 1 0 −3 2 −1 b1 − b2 ⎜0 1 3/2 0 1/2 b /2 ⎟ 2 ⎟ ⎜ ⎟, ⎜ ⎝0 0 0 −1 2 b3 ⎠ 0 0 0 0 0 b4 which creates the second column of I3 . Next, multiply the third row by −1: ⎞ ⎛ 1 0 −3 2 −1 b1 − b2 ⎟ ⎜0 1 3 0 1 b2 ⎟ ⎜ 2 2 2 ⎟. ⎜ ⎝0 0 0 1 −2 −b3 ⎠ 0 0 0 0 0 b4

Matrices

165

Finally, multiply the third by −2 and add it to the ﬁrst row: ⎛ 1 ⎜0 ⎜ ⎜ ⎝0 0

0 −3 0 1 0 0

3 2

0 0

b1 − b2 + 2b3

1

b2 2

0 12 1 −2 0 0

−b3 b4

⎞ ⎟ ⎟ ⎟. ⎠

Thus, if we choose for the basic variables x1 = b1 − b2 + 2b3 , x2 =

b2 , and x4 = −b3 2

and for the non-basic variables x3 = x5 = 0, we obtain a solution of the system. The extended echelon form of a system can be achieved by applying certain transformations on the rows of the augmented matrix of the system (which amount to transformations involving the equations of the system). In preparation, a few special invertible matrices are introduced in the next examples. Example 3.34. Consider the matrix ⎛

T (i)↔(j)

1 ⎜ . ⎜ .. ⎜ ⎜ ⎜· · · ⎜ ⎜ = ⎜ ... ⎜ ⎜· · · ⎜ ⎜ ⎜ .. ⎝ .

.. . 0 .. . 1 .. .

··· .. . ··· .. . ··· .. . ···

.. . 1 .. . 0 .. .

⎞ ··· .. ⎟ . ⎟ ⎟ ⎟ · · ·⎟ ⎟ .. ⎟ , . ⎟ ⎟ · · ·⎟ ⎟ ⎟ .. ⎟ . ⎠ 1

where line i contains exactly one 1 in position j and line j contains exactly one 1 in position i. If T (i)↔(j) ∈ Cp×p and A ∈ Cp×q , it is easy to see that the matrix T (i)↔(j) A is obtained from the matrix A by permuting the lines i and j.

166

Linear Algebra Tools for Data Mining (Second Edition)

For instance, consider the matrix ⎛ 1 ⎜0 ⎜ T (2)↔(4) = ⎜ ⎝0 0 and the matrix A ∈ F4×5 . We ⎛ 1 0 0 ⎜0 0 0 ⎜ T (2)↔(4) A = ⎜ ⎝0 0 1 0 1 0 ⎛ a11 ⎜a ⎜ 41 =⎜ ⎝a31 a21

a12 a42 a32 a22

T (2)↔(4) ∈ C4×4 deﬁned by ⎞ 0 0 0 0 0 1⎟ ⎟ ⎟ 0 1 0⎠ 1 0 0

have ⎞⎛ 0 a11 ⎟ ⎜ 1⎟ ⎜a21 ⎟⎜ 0⎠ ⎝a31 a41 0 a13 a43 a33 a23

a14 a44 a34 a24

a12 a22 a32 a42

a13 a23 a33 a43

a14 a24 a34 a44

⎞ a15 a25 ⎟ ⎟ ⎟ a35 ⎠ a45

⎞ a15 a45 ⎟ ⎟ ⎟. a35 ⎠ a25

The inverse of T (i)↔(j) is T (i)↔(j) itself. Example 3.35. Let T a(i) ∈ Cp×p be the matrix ⎛ ⎞ 1 0 ··· 0 ···0 ⎜ 0 1 · · · 0 · · · 0⎟ ⎜ ⎟ ⎜ . ⎟ .. .. ⎜ . ⎟ . · · · . · · · 0⎟ ⎜ . a(i) ⎜ ⎟ =⎜ T ⎟ · · · · · · · · · a · · · 0 ⎜ ⎟ ⎜ . ⎟ .. .. ⎜ .. ⎟ . · · · . · · · 0 ⎝ ⎠ 0 0 ··· 0 ···1 that has a ∈ F − {0} on the i th diagonal element, 1 on the remaining diagonal elements, and 0 everywhere else. The product T a(i) A is obtained from A by multiplying the i th row by a. The inverse of this 1 matrix is T a (i) .

Matrices

167

As an example, consider the matrix T 3(2) ∈ C4×4 given by ⎛ ⎞ 1 0 0 0 ⎜0 3 0 0⎟ ⎜ ⎟ T 3(2) = ⎜ ⎟. ⎝0 0 1 0⎠ 0 0 0 1 If A ∈ C4×5 , the matrix T 3(2) A is obtained from A by multiplying its second line by 3. We have ⎞ ⎛ ⎞⎛ 1 0 0 0 a11 a12 a13 a14 a15 ⎟ ⎜0 3 0 0⎟ ⎜a ⎜ ⎟ ⎜ 21 a22 a23 a24 a25 ⎟ ⎟ ⎜ ⎟⎜ ⎝0 0 1 0⎠ ⎝a31 a32 a33 a34 a35 ⎠ a41 a42 a43 a44 a45 0 0 0 1 ⎞ a11 a12 a13 a14 a15 ⎟ ⎜3a ⎜ 21 3a22 3a23 3a24 3a25 ⎟ =⎜ ⎟. ⎝ a31 a32 a33 a34 a35 ⎠ a41 a42 a43 a44 a45 ⎛

Example 3.36. Let T (i)+a(j) ∈ Cp×p be the matrix whose entries are identical to the matrix Ip with the exception of the element located in row i and column j that equals a: ⎛ ⎞ 1 0 ··· ··· ··· 0 ···0 ⎜0 1 · · · · · · · · · 0 · · · 0⎟ ⎜ ⎟ ⎜. . ⎟ . . ⎜ ⎟ T (i)+a(j) = ⎜ .. .. · · · .. · · · .. · · · 0⎟ . ⎜ ⎟ ⎜0 0 · · · a · · · 1 · · · 0⎟ ⎠ ⎝ .. .. .. .. . . ··· . ··· . ···1 The result of the multiplication T (i)+a(j) A is a matrix that can be obtained from A by adding the j th line of A multiplied by a to the i th line of A. The inverse of the matrix T (i)+a(j) A is T (i)−a(j) A.

Linear Algebra Tools for Data Mining (Second Edition)

168

For example, we have ⎛ ⎞⎛ 1 0 0 0 a11 ⎜0 1 0 0⎟ ⎜a ⎜ ⎟ ⎜ 21 T (4)+2(2) A = ⎜ ⎟⎜ ⎝0 0 1 0⎠ ⎝a31 a41 0 2 0 1

a12 a22 a32 a42

a13 a23 a33 a43

a14 a24 a34 a44

⎞ a15 a25 ⎟ ⎟ ⎟ a35 ⎠ a45

⎞ a12 a13 a14 a15 a11 ⎟ ⎜ a21 a22 a23 a24 a25 ⎟ ⎜ =⎜ ⎟. ⎠ ⎝ a31 a32 a33 a34 a35 a41 + 2a21 a42 + 2a22 a43 + 2a23 a44 + 2a24 a45 + 2a25 ⎛

It is easy to see that if one multiplies a matrix A at the right by T (i)↔(j) , T a(i) , and T (i)+a(j) , the eﬀect on A consists of exchanging the columns i and j, multiplying the i th column by a, and adding the j th column multiplied by a to the i th column, respectively. Definition 3.42. Let F be a ﬁeld, A, C ∈ Fm×n and let b, d ∈ Fm×1 . Two systems of linear equations Ax = b and Cx = d are equivalent if they have the same set of solutions. If Ax = b is a system of linear equations in matrix form, where A ∈ Cm×n and b ∈ Cm×1 , and T ∈ Cm×m is a matrix that has an inverse, then the systems Ax = b and (T A)x = (T b) are equivalent. Indeed, any solution of Ax = b satisﬁes the system (T A)x = (T b). Conversely, if (T A)x = (T b), by multiplying this equality by T −1 to the left, we get (T −1 T )Ax = (T −1 T )b, that is, Ax = b. The matrices T (i)↔(j) , T a(i) , and T (i)+a(j) introduced in Examples 3.34–3.36 play a special role in Algorithm 3.14.1 that transforms a linear system Ax = b into an equivalent system in row echelon form. These transformations are known as elementary transformation matrices. Example 3.37. Consider the linear system x1 + 2x2 + 3x3 = 4 x1 + 2x2 + x3 = 3 x1 + 3x2 + x3 = 1.

Matrices

169

Algorithm 3.14.1: Algorithm for the Row Echelon Form of a Matrix Data: A matrix A ∈ Fp×q Result: A row echelon form of A 1 r = 1; 2 c = 1; 3 while r p and c q do 4 while A(∗, c) = 0 do 5 c =c+1 6 end 7 j = r; 8 while A(j, c) = 0 do 9 j = j+1 10 end 11 if j = r then 12 exchange line r with line j 13 end 1 14 multiply line r by A(r,c) ; 15 for each k = r + 1 to p do 16 add line r multiplied by −A(k, c) to line k 17 end 18 r = r + 1; 19 c = c + 1; 20 end The augmented matrix of this system is ⎛ 1 2 3 ⎜ [A|b] = ⎝1 2 1 1 3 1

⎞ 4 ⎟ 3⎠ . 1

By subtracting the ﬁrst row from the second and the third, we obtain the matrix ⎛ ⎞ 1 2 3 4 ⎜ ⎟ T (3)−1(1) T (2)−1(1) [A|b] = ⎝0 0 −2 −1⎠ . 0 1 −2 −3

170

Linear Algebra Tools for Data Mining (Second Edition)

Next, the second and third row are exchanged yielding the matrix ⎛ ⎞ 1 2 3 4 ⎜ ⎟ T (2)↔(3) T (3)−1(1) T (2)−1(1) [A|b] = ⎝0 1 −2 −3⎠ . 0 0 −2 −1 To obtain a 1 in the pivot of the third row, we multiply the third row by − 12 : ⎛

⎞ 1 2 3 4 ⎜ ⎟ T −0.5(3) T (2)↔(3) T (3)−1(1) T (2)−1(1) [A|b] = ⎝0 1 −2 −3⎠ , 0 0 1 0.5 which is the row echelon form of the matrix [A|b]. To achieve the row echelon form, we needed to multiply the matrix [A|b] by the matrix T = T −0.5(3) T (2)↔(3) T (3)−1(1) T (2)−1(1) . The solutions of the system can now be obtained by back substitution from the linear system x1 + 2x2 + 3x3 = 4, x2 − 2x3 = −3, x3 = 0.5. The last equation yields x3 = 0.5. Substituting x3 in the second equation implies x2 = −2; ﬁnally, from the ﬁrst equality we have x1 = 6.5. Theorem 3.52. Let T a(i) , T (p)↔(q) , and T (i)+a(j) be the matrices in Rm×m that correspond to the row transformations applied to matrices in Rm×n , where i = j and p = q. We have ⎧ ⎪ T (p)↔(q) T a(i) if i ∈ {p, q}, ⎪ ⎨ T a(i) T (p)↔(q) = T (p)↔(q) T a(q) if i = p, ⎪ ⎪ ⎩T (p)↔(q) T a(p) if i = q,

Matrices

T (i)+a(j) T (p)↔(q) =

⎧ ⎪ T (p)↔(q) T (i)+a(j) ⎪ ⎪ ⎪ ⎪ ⎨T (q)+a(j) T (p)↔(q) ⎪ T (i)+a(p) T (p)↔(q) ⎪ ⎪ ⎪ ⎪ ⎩T (q)+a(p) T (p)↔(q)

171

if {i, j} ∩ {p, q} = ∅, if i = p and j = q, if i = p and j = q, if i = p and j = q.

Proof. The equalities of the theorem follow immediately from the deﬁnitions of the matrices. The matrices that describe elementary transformations are of two types: lower triangular matrices of the form T a(i) or T (i)+a(j) or permutation matrices of the form T (p)↔(q) . If all pivots encountered in the construction of the row echelon form of the matrix A are not zero, then there is no need to use any permutation matrix T (p)↔(q) among the matrices that multiply A at the left. Thus, there is a lower matrix T and an upper triangular matrix U such that T A = U . The matrix T is a product of invertible matrices and, therefore, it is invertible. Since the inverse L = T −1 of a lower triangular matrix is lower triangular, as we saw in Theorem 3.8, it follows that A = LU ; in other words, A can be decomposed into a product of a lower triangular and an upper triangular matrix. This factorization of matrices is known as an LU -decomposition of A. Example 3.38. Let A ∈ R3×3 be ⎛ 1 ⎜ A = ⎝2 1

the matrix ⎞ 0 1 ⎟ 1 1⎠ . −1 2

Initially, we add the ﬁrst row multiplied by −2 to the second row, and the same ﬁrst row, multiplied by −1, to the third row. This amounts to ⎛ ⎞ 1 0 1 ⎜ ⎟ T (3),−(1) T (2,−2(1) A = ⎝0 1 −1⎠ . 0 −1 1 Next, we add the second row to the third to produce the matrix ⎛ ⎞ 1 0 1 ⎜ ⎟ T (3)+(2) T (3),−(1) T (2,−2(1) A = ⎝0 1 −1⎠ , 0 0 0

172

Linear Algebra Tools for Data Mining (Second Edition)

which is an upper triangular matrix. By Theorem 3.51, we can conclude that rank(A) = 2. We can write ⎛ ⎞ 1 0 1 ⎜ ⎟ A = (T (2,−2(1) )−1 (T (3),−(1) )−1 (T (3)+(2) )−1 ⎝0 1 −1⎠ . 0 0 0 Thus, the lower triangular matrix we are seeking is L = (T (2)−2(1) )−1 (T (3),−(1) )−1 (T (3)+(2) )−1 = T (2)+2(1) T (3)+(1) T (3)−(2) ⎛ ⎞⎛ ⎞⎛ ⎞ ⎛ ⎞ 1 0 0 1 0 0 1 0 0 1 0 0 ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟ = ⎝2 1 0⎠ ⎝0 1 0⎠ ⎝0 1 0⎠ = ⎝2 1 0⎠ , 0 0 1 1 0 1 0 −1 1 1 −1 1 which shows that A can be ⎛ 1 ⎜ A = ⎝2 1

written as ⎞ ⎞⎛ 0 0 1 0 1 ⎟ ⎟⎜ 1 0⎠ ⎝0 1 −1⎠ . −1 1 0 0 0

Suppose that during the construction of the matrix U some of the elementary transformation matrices are permutation matrices of the form T (p)↔(q) . By Theorem 3.52, matrices of the form T (p)↔(q) can be shifted to the right. Therefore, instead of the previous factorization of the matrix A, we have a lower triangular matrix T and a permutation matrix, which results as a product of all permutation matrices of the form T (p)↔(q) used in the algorithm such that T P A = U . In this case, we obtain an LU -factorization of P A instead of A. We give now an alternative characterization of the notion of matrix rank. Note that if T and S are matrices of any of the elementary transformations, then rank(T A) = rank(A) and rank(AS) = rank(A) by Corollary 3.4. Theorem 3.53. Let A ∈ Rm×n be a matrix. If rank(A) = k, then the largest non-singular square submatrix B of A is a k × k-matrix.

Matrices

173

Proof. Since rank(A) = k, there is a maximal linearly independent set of k columns of A, {ci1 , . . . , cik }, and a maximal linearly independent set of k of rows {r j1 , . . . , r jk }. Let K be the submatrix of A deﬁned by

j1 · · · jk K=A . i1 · · · ik We claim that K is non-singular, that is, rank(K) = k. Using right multiplications by elementary transformation matrices, we can produce a matrix D that has the columns ci1 , . . . , cik on the ﬁrst k positions. Since the remaining columns are linear combinations of these ﬁrst k columns, using the same type of column transformations we can transform each of the remaining columns into 0m . Thus, there exists a non-singular matrix S such that D = AS = ((ci1 · · · cik |Om,n−k ) and rank(D) = rank(A) = k. The matrix D has the rows numbered i1 , . . . , ik as a maximal set of linearly independent rows. Now, by applying elementary transformations, these rows can be brought in the ﬁrst k places and the remaining m − r rows can be nulliﬁed. Thus, there is an invertible matrix T such that K O E = T AS = , O O where K is a non-singular k × k-matrix because rank(E) = rank(A). Thus, A contains a submatrix of rank k. Suppose that G is a non-singular square submatrix of A. By a series of row elementary and column elementary transformations (described by the matrices P and Q, respectively) we have G H P AQ = L M and rank(P AQ) = rank(A) = k. Consider now the invertible matrices P1 and Q1 deﬁned by I O I −G−1 H and Q1 = . P1 = −LG−1 I O I

174

Linear Algebra Tools for Data Mining (Second Edition)

Since P1 is lower triangular and Q1 is upper triangular having all diagonal elements equal to 1, both P1 and Q1 are invertible. Therefore, rank(P1 P AQQ1 ) = rank(A). On the other hand, G H I −G−1 H I O P1 P AQQ1 = L M −LG−1 I 0 I =

G O . O M − LG−1 H)

Therefore, rank(A) = rank(G) + rank(M − LG−1 H) rank(G), so k is the maximal rank of a nonsingular submatrix of A. 3.15

The Kronecker and Other Matrix Products

Definition 3.43. Let A ∈ Cm×n and B ∈ Cp×q be two matrices. The Kronecker product of these matrices is the matrix A ⊗ B ∈ Cmp×nq deﬁned by ⎞ ⎛ a11 B a12 B · · · a1n B ⎜a B a B ··· a B ⎟ 22 2n ⎟ ⎜ 21 ⎟. A⊗B =⎜ ⎜ .. .. ... ... ⎟ . ⎠ ⎝ . am1 B am2 B · · · amn B The Kronecker product A ⊗ B creates mn copies of the matrix B and multiplies each copy by the corresponding element of A. Note that the Kronecker product can be deﬁned between any two matrices regardless of their format. If x ∈ Cm and y ∈ Cn , we have x ⊗ y = vec(yx ),

(3.11)

x ⊗ y = xy = y ⊗ x.

(3.12)

and

Example 3.39. Consider the matrices ⎛ ⎞ b11 b12 b13 a11 a12 ⎜ ⎟ and B = ⎝b21 b22 b23 ⎠ . A= a21 a22 b31 b32 b33

Matrices

Their Kronecker product is ⎛ a11 b11 a11 b12 ⎜a b a b ⎜ 11 21 11 22 ⎜ ⎜a11 b31 a11 b32 A⊗B =⎜ ⎜a b a b ⎜ 21 11 21 12 ⎜ ⎝a21 b21 a21 b22 a21 b31 a21 b32

a11 b13 a11 b23 a11 b33 a21 b13 a21 b23 a21 b33

175

a12 b11 a12 b21 a12 b31 a22 b11 a22 b21 a22 b31

a12 b12 a12 b22 a12 b32 a22 b12 a22 b22 a22 b32

⎞ a12 b13 a12 b23 ⎟ ⎟ ⎟ a12 b33 ⎟ ⎟. a22 b13 ⎟ ⎟ ⎟ a22 b23 ⎠ a22 b33

Let C ∈ Cmp×nq be the Kronecker product of the matrices A ∈ and B ∈ Cp×q . We seek to express the value of cij , where 1 i mp and 1 j ≤ nq. It is easy to see that Cm×n

cij = a i , j bi−p i −1,j−q j −1 . p

q

p

(3.13)

q

Conversely, we have ars bvw = (A ⊗ B)p(r−1)+v,q(s−1)+w

(3.14)

for 1 r m, 1 s n and 1 v p, 1 w q. Theorem 3.54. The Kronecker product is associative. In other words, if A ∈ CK×L , B ∈ CM ×N , and C ∈ CR×S , then (A⊗ B)⊗ C = A ⊗ (B ⊗ C). Proof. The product akl bmn is the entry ((k − 1)M + m, ( − 1)N + +n) of A ⊗ B. Therefore, the product (akl bmn )crs is the entry (((k − 1)M + m − 1)R + r, (( − 1)N = n − 1)S + s) of (A ⊗ B) ⊗ C. On the other hand, the product akl (bmn crs ) is the entry of A ⊗ (B ⊗ C) that occupies the position ((k − 1)M R + (m − 1)R + r, ( − 1)N S + (n − 1)S + s), which is identical to ((k − 1)M + (m − 1)R + r, (( − 1)N + (n − 1)S + s). Thus, the product akl bmn cop occupies the same position in (A⊗B)⊗C and in A ⊗ (B ⊗ C). Therefore, (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C).

Linear Algebra Tools for Data Mining (Second Edition)

176

Theorem 3.55. Let A1 , . . . , An and B1 , . . . , Bn be matrices. Then, we have (A1 ⊗ B1 )(A2 ⊗ B2 ) · · · (An ⊗ Bn ) = (A1 A2 · · · An ) ⊗ (B1 B2 · · · Bn ). Proof.

Note that

(A ⊗ B)(C ⊗ D) ⎞⎛ ⎞ ⎛ c11 D · · · a1p D a11 B · · · a1n B ⎜ ⎜ . .. ⎟ .. ⎟ ⎟ ⎜ .. ⎟ . =⎜ . · · · . . · · · . ⎠ ⎠⎝ ⎝ am1 B · · · amn B cn1 D · · · cnp D ⎞ a1j cj1 BD · · · j a1j cjp BD ⎟ ⎜ .. .. ⎟ =⎜ . ··· . ⎠ ⎝ a c BD · · · a c BD j mj j1 j mj jp ⎛

j

⎞ a1j cj1 · · · j a1j cjp ⎟ ⎜ .. .. ⎟ ⊗ BD = (AC) ⊗ (BD). =⎜ ··· . ⎠ ⎝ . a c · · · a c j mj j1 j mj jp ⎛

j

Repeated multiplication yields the desired equality.

The next theorem contains a few other elementary properties of Kronecker’s product. Theorem 3.56. For any complex matrices A, B, C, D, we have the following: (i) (A ⊗ B) = A ⊗ B , (ii) (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C), (iii) A ⊗ B + A ⊗ C = A ⊗ (B + C), (iv) A ⊗ D + B ⊗ D = (A + B) ⊗ D, (v) (A ⊗ B) = A ⊗ B , (vi) (A ⊗ B)H = AH ⊗ B H , when the usual matrix sum and multiplication are well-deﬁned in each of the above equalities. Proof.

The proof is straightforward and is left to the reader.

Matrices

177

Example 3.40. Let x ∈ Cn and y ∈ Cm . We have ⎞ ⎛ ⎞ ⎛ y1 x x1 y ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ x⊗y =⎜ ⎝ .. ⎠ and y ⊗ x = ⎝ .. ⎠ . xn y ym x Note that the Kronecker product is not commutative. For example, we have x ⊗ y = y ⊗ x. We have ⎛ ⎞ 5 ⎜4⎟ ⎛ ⎞ ⎜ ⎟ 1 ⎜ ⎟ ⎜10⎟ 4 ⎜ ⎟ ⎟ =⎜ ⎝2⎠ ⊗ ⎜ 8 ⎟, 5 ⎜ ⎟ 3 ⎜ ⎟ ⎝15⎠ 12 but

⎛

⎞ 4 ⎟ ⎛ ⎞ ⎜ ⎜8⎟ ⎜ ⎟ 1 4 ⎜ ⎟ ⎜ 12⎟ ⎟ ⎝2⎠ = ⎜ ⎜ 5 ⎟. 5 ⎜ ⎟ 3 ⎜ ⎟ ⎝10⎠ 15

Example 3.41. Let D ∈ Cp×p . The Kronecker product C = Im ⊗ D is given by cij = I i , j di−p i −1,j−p j −1 p

=

p

p

di−p(k−1),j−p(k−1) 0

p

if pi = pj = k otherwise,

for 1 i, j mp. Let now E ∈ Cm×m . The Kronecker product L = E ⊗ Ip is given by e i , j if p pi − i = pj − j, p p lij = 0 otherwise, for 1 i, j mp.

Linear Algebra Tools for Data Mining (Second Edition)

178

Theorem 3.57. If A ∈ Cn×n and B ∈ Cm×m are two invertible matrices, then A ⊗ B is invertible and (A ⊗ B)−1 = A−1 ⊗ B −1 . Proof.

Since (A ⊗ B)(A−1 ⊗ B −1 ) = (AA−1 ⊗ BB −1 ) = In ⊗ Im ,

the theorem follows by observing that In ⊗ Im = Inm .

In a sequence of Kronecker products of column and row vectors, a column vector can be permuted with a row vector without altering the ﬁnal result. This is formalized in the next statement. Theorem 3.58. Let u ∈ Cm and let v ∈ Cn . We have u ⊗ v = v ⊗ u. Proof.

Suppose that

⎞ ⎛ ⎞ v1 u1 ⎜ .. ⎟ ⎜ .. ⎟ u = ⎝ . ⎠ and v = ⎝ . ⎠ . um vn

We have

Also,

⎛

⎞ ⎞ ⎛ u1 v1 · · · u1 vn u1 v ⎜ .. .. ⎟ . u ⊗ v = ⎝ · · · ⎠ = ⎝ ... . . ⎠ um v um v1 · · · um vn ⎛

⎞ ⎞ ⎛ v1 u1 · · · vn u1 u1 ⎜ ⎟ ⎜ .. .. .. ⎟ ⎟ v ⊗ u = (v1 , · · · , vn ) ⎝ ... ⎠ = ⎜ . . ⎠, ⎝ . um v1 um · · · vn um ⎛

which establishes our equality.

Example 3.42. Suppose that u1 ∈ Cm , u2 ∈ Cp , and v ∈ Cm . Then, we have u1 ⊗ u2 ⊗ v = u1 ⊗ v ⊗ u2 = v ⊗ u1 ⊗ u2 . Note that such transformations can be applied to Kronecker products of column and row vectors provided we do not change the order of the column and the row vectors.

Matrices

179

Theorem 3.59. Let A ∈ Cn×n and B ∈ Cm×m be two normal (unitary) matrices. Their Kronecker product A ⊗ B is also a normal (a unitary) matrix. Proof.

By Theorem 3.56, we can write

(A ⊗ B) (A ⊗ B) = (A ⊗ B )(A ⊗ B) = (A A ⊗ B B) = (AA ⊗ BB ) (because both A and B are normal) = (A ⊗ B)(A ⊗ B) , which implies that A ⊗ B is normal.

Definition 3.44. Let A ∈ Cm×m and B ∈ Cn×n be two square matrices. Their Kronecker sum is the matrix A⊕B ∈ Cmn×mn deﬁned by A ⊕ B = (A ⊗ In ) + (Im ⊗ B). The Kronecker diﬀerence is the matrix A B ∈ Cmn×mn deﬁned by A B = (A ⊗ In ) − (Im ⊗ B). The element (A ⊕ B)ij is given by (A ⊕ B)ij = (A ⊗ In )ij + (Im ⊗ B)ij . Thus, (A ⊕ B) can be computed by applying the formulas developed in Example 3.41: a i , j if n ni − i = n nj − j, n n (A ⊗ In )ij = 0 otherwise, bi−n(k−1),j−n(k−1) if ni = nj = k, (Im ⊗ B)ij = 0 otherwise, for 1 i, j mn. Similar fact can be proven about Kronecker diﬀerence by replacing B by −B in the formula involving the Kronecker sum.

180

Linear Algebra Tools for Data Mining (Second Edition)

Definition 3.45. Let A, B ∈ Cm×n . The Hadamard product of A and B is the matrix A B ∈ Cm×n deﬁned by ⎞ ⎛ a11 b11 a12 b12 · · · a1n b1n ⎟ ⎜a b ⎜ 21 21 a22 b22 · · · a2n b2n ⎟ ⎟. AB = ⎜ .. .. ⎟ .. ⎜ .. . ⎠ ⎝ . . . am1 bm1 am2 bm2 · · · amn bmn The Hadamard quotient A B is deﬁned only if bij = 0 for 1 i m and 1 j n. In this case, ⎛ a11 b11 a21 b21

a12 b12 a22 b22

.. .

··· ··· .. .

a1n b1n a2n b2n

am1 bm1

am2 bm2

···

amn bmn

⎜ ⎜ AB =⎜ ⎜ .. ⎝ .

⎞

⎟ ⎟ .. ⎟ ⎟. . ⎠

Theorem 3.60. If A, B, C ∈ Cm×n and c ∈ C, we have (i) A B = B A; (ii) A Jm,n = Jm,n A = A; (iii) A (B + C) = A B + A C; (iv) A (cB) = c(A B). Proof.

The proof is straightforward and is left to the reader.

Note that the Hadamard product of two matrices A, B ∈ Cm×n is a submatrix of the Kronecker product A ⊗ B. Example 3.43. Let A, B ∈ C2×3 be the matrices A=

a11 a12 a13 a21 a22 a23

and B =

b11 b12 b13 . b21 b22 b23

The Kronecker product of these matrices is A ⊗ B ∈ C4×9 given by ⎛ a11 b11 ⎜ ⎜a11 b21 A⊗B = ⎜ ⎜a b ⎝ 21 11 a21 b21

a11 b12

a11 b13

a12 b11

a12 b12

a12 b13

a13 b11

a13 b12

a11 b22

a11 b23

a12 b21

a12 b22

a12 b23

a13 b21

a13 b22

a21 b12

a21 b13

a22 b11

a22 b12

a22 b13

a23 b11

a23 b12

a21 b22

a21 b23

a22 b21

a22 b22

a22 b23

a23 b21

a23 b22

a13 b13

⎞

⎟ a13 b23 ⎟ ⎟. a23 b13 ⎟ ⎠ a23 b23

Matrices

181

The Hadamard product of the same matrices is a11 b11 a12 b12 a13 b13 , AB = a21 b21 a22 b22 a23 b23 and we can regard the Hadamard product as a submatrix of the Kronecker product A ⊗ B,

1, 5, 9 A B = (A ⊗ B) . 4, 4, 4 Another matrix product involves matrices that have the same number of columns. Definition 3.46. Let A ∈ Cm×n and B ∈ Cp×n be two matrices that have the same number n of columns, A = (a1 · · · an ) and B = (b1 · · · bn ). The Khatri–Rao product of A and B (or the column-wise Kronecker product) is the matrix A ∗ B = (a1 ⊗ b1 a2 ⊗ b2 · · · an ⊗ bn ). Example 3.44. The Khatri–Rao product of the ⎛ 1 0 1 2 3 ⎜ A= and B = ⎝ 2 1 4 5 6 −1 2

matrices ⎞ 2 ⎟ 3⎠ 1

is the matrix (a1 ⊗ b1 a2 ⊗ b2 a3 ⊗ b3 ), which equals ⎛ ⎞ 1 0 6 ⎜ 2 2 9⎟ ⎜ ⎟ ⎜ ⎟ ⎜−1 4 3 ⎟ ⎜ ⎟ ⎜ 4 0 12⎟ . ⎜ ⎟ ⎜ ⎟ ⎝ 8 5 18⎠ −4 10 6 Note that for any vectors a, b, we have a ⊗ b = a ∗ b.

Linear Algebra Tools for Data Mining (Second Edition)

182

3.16

Outer Products

Definition 3.47. Let u ∈ Cm and v ∈ Cn . The outer product of the vectors u and v is the matrix u ◦ v ∈ Cm×n deﬁned by u ◦ v = uv H . As we saw in Example 3.28, the outer product of two vectors is a matrix of rank 1. For u ∈ Cm and v ∈ Cn , we have v ◦u = vuH = (uv H )H = (u ◦v)H . Therefore, the outer product is not commutative because for u ∈ Cm and v ∈ Cn , we have u ◦ v ∈ Cm×n and v ◦ u ∈ Cn×m . Note that when m = n, we have uv H = trace(u ◦ v). Example 3.45. Let

⎛ ⎞ u1 v1 ⎜ ⎟ . u = ⎝u2 ⎠ and v = v2 u3

We have

⎞ u1 v1 u1 v2 v1 u1 v1 u2 v1 u3 ⎟ ⎜ . u ◦ v = ⎝u2 v1 u2 v2 ⎠ and v ◦ u = v2 u1 v2 u2 v2 u3 u3 v1 u3 v2 ⎛

Contrast this with the Kronecker products: ⎞ ⎛ ⎞ ⎛ v1 u1 u1 v1 ⎜v u ⎟ ⎜u v ⎟ ⎜ 1 2⎟ ⎜ 1 2⎟ ⎟ ⎜ ⎟ ⎜ ⎜v1 u3 ⎟ ⎜u2 v1 ⎟ ⎟ ⎜ ⎟ u⊗v =⎜ ⎜u v ⎟ and v ⊗ u = ⎜v u ⎟ . ⎜ 2 1⎟ ⎜ 2 2⎟ ⎟ ⎟ ⎜ ⎜ ⎝v2 u2 ⎠ ⎝u3 v1 ⎠ u3 v2 v2 u3 Note that the entries of the Kronecker product u⊗v can be obtained by reading the entries of u ◦ v row-wise.

3.17

Associative Algebras

Definition 3.48. An F-associative algebra is a pair (V, m), where V is an F-linear space and m : V × V −→ V is a bilinear mapping that is associative (which means that m(x, m(y, z)) = m(m(x, y), z) for all x, y, z ∈ V ).

Matrices

183

If there exists u ∈ V such that m(v, u) = v for every v ∈ V , then u is referred to as the unit element and (V, m) is a unital associative algebra. The dimension of an F-associative algebra (V, m) is dim(V ). We denote m(v 1 , v 2 ) as v 1 v 2 . In a unital associative algebra there exists a unique unit element. Example 3.46. The set Rn×n of square matrices with real elements is a unital associative algebra, where m(A, B) = AB. The unit element is the matrix In . Example 3.47. Let Hom(V, V ) be the linear space of homomorphisms of an F-linear space V. For h, k ∈ Hom(V, V ), deﬁne m(h, k) as the composition of mappings hk. It is easy to verify that (Hom(V, V ), m) is a unital associative algebra having 1V as its unit element. Definition 3.49. An associative algebra morphism between the associative algebras (V, m) and (W, m ) is a linear mapping h : V −→ W such that h(m(u, v)) = m (h(u), h(v)) for every u, v ∈ V . If (V, m) and (W, m ) have unit elements u and u , respectively, then h(u) = u . If h is a bijective morphism, we refer to h as an isomorphism of associative algebras. Let B = {e1 , . . . , en } be a basisin V, where (V, m) is nan asson i j ciative algebra. If x, y ∈ V , x = i=1 a ei , and y = j=1 b ej , then xy =

n

n

ai bj ei ej .

i=1 j=1

Since ei ej ∈ V , it is possible to write ei ej =

n

k γij ek ,

k=1 k are n3 scalars in F. We refer to these numbers as the strucwhere γij tural coeﬃcients of the associative algebra (V, m).

184

Linear Algebra Tools for Data Mining (Second Edition)

k be structural coeﬃcients of the associative Theorem 3.61. Let γij algebra (V, m) having the basis {e1 , . . . , en }. These coeﬃcients satisfy the following n4 equalities:

p

q r r γjk γiq = γij γpk q

p

for i, h, j, k between 1 and n. p q ep and ej ek = q γjk eq , we can write Proof. Since ei ej = p γij ei (ej ek ) =

q

=

q

(ei ej )ek =

q γjk ei eq =

r

q

q r γjk γiq er =

p γij ep ek =

p

=

p

q γjk

p r γij γpk er =

r

r γiq er

r

r

p γij

p

q

q r γjk γiq er

r γpk er

r

r

p r γij γpk er .

p

The associativity property implies the following equalities:

p

q r r γjk γiq = γij γpk q

for i, h, j, k between 1 and n.

p

An associative algebra (V, m) over a ﬁnite-dimensional linear space V is completely deﬁned by its multiplicative table. Starting from a basis {e1 , . . . , en } place at the intersection of the i th row and the j th column the expansion of the product ei ej in terms of the basis: e ··· e ··· e jk nk 1k e1 ··· ··· k γ11 ek k γ1j ek k γ1n ek .. .. .. .. .. .. . . . . . . k k k ei γ e · · · γ e · · · γ k i1 k k ij k k in ek .. .. .. .. .. .. . . . . . . k k k en ··· ··· k γn1 ek k γnj ek k γnn ek

Matrices

185

Example 3.48. Let V be a four-dimensional real linear space having the basis {1, i, j, k} whose elements have the form v = a1 + bi + cj + dk, where a, b, c, d ∈ R. The multiplication table of this algebra is given by 1 i j k

1 1 i j k

i i −1 −k j

j j k −1 −i

k k −j i −1

If q ∈ V , we can write q = a1 + bi + cj + dk for some a, b, c, d ∈

R. Such an element is said to be a quaternion. Its conjugate is the

quaternion q = a1 − bi − cj − dk. It is easy to see that qq = (a2 + b2 + c2 + d2 )1. Definition 3.50. A subalgebra of an associative algebra (V, m) is a subspace U of V such that m(u1 , u2 ) ∈ U for every u1 , u2 ∈ U . If U is a subalgebra of (V, m) and m = m U , then (U, m ) can be regarded as an associative algebra. The intersection of any family of subalgebras of an associative algebra is a subalgebra. An ideal of an associative algebra (V, m) is a subspace I of V such that m(v, i) ∈ I and m(i, v) ∈ I for every v ∈ V and i ∈ I. Let {Ij | j ∈ J} be a family of ideals of an associative algebra (V, m). It is easy to see that j∈J is an ideal of (V, m). Moreover, if S ⊆ V , the intersection of all ideals that contain S is again an ideal denoted by IS . If I is an ideal of an associative algebra (V, m), we can consider the quotient linear space V /I. Deﬁne the multiplication m on V /I as ˜ ∈ [x] m([x], [y]) = [m(x, y)]. Note that m is well-deﬁned because if x ˜ ∈ [y], we have x ˜ − x ∈ I and y ˜ − y ∈ I. In other words, there and y ˜ = x + u and ˜[y] = y + v. This implies exist u, v ∈ I such that x ˜ ) = m(x + u, y + v) m(˜ x, y = m(x, y) + m(x, v) + m(u, y) + m(u, v) = m(x, y) + z,

186

Linear Algebra Tools for Data Mining (Second Edition)

˜ )] = [m(x, y)] and where z ∈ I. Therefore, [m(˜ x, y ˜ )] = [m(x, y)], m([˜ x], [˜ y ]) = [m(˜ x, y which shows that m is well-deﬁned. The associative algebra (V /I, m) is the factor algebra of V by I. Definition 3.51. A graded linear space is a vector space V that can be written as a direct sum of the form V = n∈N Vn , where each Vn is a vector space. The linear space Vn is the set of elements of degree n. Example 3.49. Let R[x, y] be the linear space of polynomials with real coeﬃcients in the indeterminates x and y. If R[x, y]n is the set of all homogeneous polynomials of degree n in x and y, then R[x, y] is a graded linear space. Generalizing Example 3.49, if V is a direct sum, V = n∈N Vn , we refer to members of the summand Vn as homogeneous elements of degree n. If v is v = n∈N v n where v n ∈ Vn , v n is the homogeneous component of v of degree n. A graded associative algebra is a graded vector space V = n∈N Vn such that m(Vp , Vq ) ⊆ Vp+q for p, q ∈ N. A subalgebra Wof the graded associative algebra V is a graded subalgebra if W = n∈N (W ∩ Vn ). Exercises and Supplements (1) Let A ∈ Cm×n and B ∈ Cn×p be two conformant matrices. Prove that: (a) computing the matrix G = AB using standard matrix multiplication requires mnp number multiplications; (b) if C ∈ Cp×q and we compute the matrix D = (AB)C = A(BC) by the standard method, the ﬁrst modality D = (AB)C, requires mp(n + q) multiplications, while the second, D = A(BC), requires nq(m + p) multiplications. (2) Let {i1 , . . . , ik } be a subset of {1, . . . , n} and let {j1 , . . . , jq } = {1, . . . , n} − {i1 , . . . , ik }, where j1 < · · · < jq and k + q = n.

Matrices

187

i1 · · · ik be a Let A ∈ be a matrix and let B = A i1 · · · ik principal submatrix of A. Prove that if y ∈ Ck , then y H By = xH Ax, where yi if i = jr , xr = 0 otherwise Cn×n

for 1 r n. (3) Let X = (x1 · · · xn ) be a matrix in Cn×n and let C = diag(c1 , . . . , cn ) ∈ Cn×n . Prove that XC = c1 x1 e1 + · · · + c1 xn en . (4) Let X = (x1 · · · xn ) and Y = (y 1 · · · y n ) be two matrices in Cn×n , and let C = diag(c1 , . . . , cn ), D = diag(d1 , . . . , dn ). Prove that XCY D =

n

n

ci dj yji xi ej ,

i=1 j=1

and that trace(XCY D) = ni=1 nj=1 ci dj yji xji . (5) Let a1 , . . . , an be n complex numbers. Prove that n

diag(1, . . . , 1, ai , 1, . . . , 1) = diag(a1 , . . . , an ).

i=1

(6) Let D = diag(d1 , . . . , dn ) ∈ Cn×n and let A ∈ Cn×n . Prove that (DAD)ij = di aij dj for 1 i, j n. (7) Let S n×n be the set of n×n symmetric matrices in Rn×n . Prove . that S n×n is a subspace of Rn×n and dim(S n×n ) = n(n+1) 2 n×n n×n (8) Let A ∈ R be a symmetric be a matrix and let B ∈ R matrix such that x Bx = ni=1 nj=1 aij (xi − xj )2 for every x ∈ Rn . Prove that: (a) for 1 k n, we have bkk = 2 {aik | 1 i n and i = k}; n n (b) i=1 j=1 bij = 0. (9) Let ψ ∈ PERMn and let A be a square matrix, A ∈ Cn×n . Prove that (Pψ A)ij = aψ(i)j and (APψ )ij = aiψ−1 (j) for 1 i, j n.

188

Linear Algebra Tools for Data Mining (Second Edition)

Let φ ∈ PERMn be a permutation and let A ∈ Cn×n be a square matrix. A φ-diagonal of A is a set of n elements Dφ (A) = {a1φ(1) , a2φ(2) , . . . , anφ(n) }. A permutation diagonal of A is a φdiagonal of A for some φ ∈ PERMn . (10) Let A ∈ Cn×n be a square matrix and let ψ ∈ PERMn be a permutation. Prove that if Dφ (A) is a permutation diagonal of A, the same set is a permutation diagonal Dψ−1 φ (B) of B = APψ . n Solution: Note that (APψ )ij = k=1 aik pkj = aiψ−1 (j) . Therefore, we have Dφ (APψ ) = {(APψ )1φ(1) , . . . , (APψ )nφ(n) } = {a1ψ−1 (φ(1)) , . . . , anψ−1 (φ(n)) }, which allows us to conclude that Dφ (A) coincides with the permutation diagonal Dψ−1 φ (B) of B = APψ . (11) Let A ∈ Cn×n be an upper-Hessenberg matrix and let U ∈ Cn×n be an upper-triangular matrix. Prove that both AU and U A are upper-Hessenberg matrices. Solution: Let B = AU . For i > j + 1, we have bij =

n

k=1

aik ukj =

i−2

k=1

aik ukj +

n

aik ukj .

k=i−1

i−2 Since A is upper-Hessenberg, n we have k=1 aik ukj = 0; since U is upper triangular, k=i−1 aik ukj = 0 because j < i − 1. Thus, bij = 0, so B is indeed upper-Hessenberg. The argument for U A is similar. (12) Let A, B ∈ Cn×n be two Hermitian matrices. Prove that AB is a Hermitian matrix if and only if AB = BA. Solution: Suppose that AB = BA. Then, (AB)H = (BA)H = AH B H = AB, hence AB is a Hermitian matrix. Conversely, if AB is Hermitian, we have AB = (AB)H = B H AH = BA. (13) Let A, B be two matrices in Cn×n . Suppose that B = C + D, where C is a Hermitian matrix and D is a skew-Hermitian matrix. Prove that if A is Hermitian and AB = BA, then AC = CA and AD = DA.

Matrices

189

(14) Prove that if A = diag(A1 , . . . , Ak ) is a block-diagonal matrix, then A is Hermitian if and only if every block Ai is Hermitian. (15) Let A ∈ Cn×n be a matrix. Prove that each φ-diagonal of A contains a 0 if and only if A contains a zero submatrix B ∈ Cp×q such that p + q = n + 1. This fact is known as the Frobenius– K¨ onig Theorem. Solution: Suppose that A contains the submatrix

i1 , . . . , ip = Op,q , B=A j1 , . . . , jq where p + q = n + 1 and Dφ = {a1φ(1) , a2φ(2) , . . . , anφ(n) } = {aφ−1 (1)1 , . . . , . . . , aφ−1 (n)n } is a φ-diagonal without zero entries. Then none of the components aφ−1 (j1 )j1 , . . . , . . . , aφ−1 (jq )jq are 0, so they must be located in the rows {1, . . . , n} − {i1 , . . . , ip }. Therefore, q + p n and this contradicts the fact that p + q = n + 1. Conversely, suppose that every diagonal of A contains a 0. We show by induction on n 1 that A contains a zero submatrix. The base case, n = 1 is immediate. Suppose that the implication holds for matrices of size less than n. If A = On,n , the reverse implication is trivial. Therefore, we can assume that A contains a non-zero component. Without loss of generality we may assume that ann = 0. Every diagonal of the submatrix

1, · · · , n − 1 S=A 1, . . . , n − 1 contains a 0 and, therefore, by inductive hypothesis, S contains a zero submatrix of format r × s, where r + s = n. It is obvious that this submatrix of S is also a zero-submatrix of A. There exist two permutations σ, τ ∈ PERMn such that Pσ APτ =

C Or,s , D E

190

Linear Algebra Tools for Data Mining (Second Edition)

where C ∈ Cr×r and E ∈ Cs×s . Let η ∈ PERMn . An η-diagonal of A has the form Dη (A) = {a1η(1) , . . . , anη(n) }. Since Pσ permutes the rows of A and Pτ permutes the columns of A, the set Dη corresponds to the set {aσ(1)τ (η(1)) , . . . , aσ(n)τ (η(n)) } = {a1σ−1 (τ (η(1))) , . . . , anσ−1 (τ (η(n))) }, which is a diagonal set of Pσ APτ . Thus, the collection of diagonal sets of A and Pσ APτ are the same. If all elements of the diagonal set of Pσ APτ that correspond to C are non-zero, then the remaining elements of a diagonal are located in E and must contain a 0. Therefore, if all elements of a diagonal of C are non-zero, then each a diagonal of E must contain a 0. Thus, either all diagonals of C contain a 0, or all diagonals of E contain a 0. In the ﬁrst case, by inductive hypothesis, C contains a k ×h zero submatrix with k +h = r +1. This implies that the ﬁrst r rows of Pσ APτ contain a k × (h + s) zero submatrix and k + h + s = r + 1 + s = n + 1. Similarly, in the second case, E contains a p × q zero-matrix with p + q = s + 1. Therefore, the last s columns of Pσ APτ contain a zero submatrix of format (r + p) × q and r + p + q = r + s + 1 = n + 1. (16) Prove that row exchanges for matrices in Rm×n can be expressed as a sequence of row multiplications and additions of a row multiplied by a constant to another row by showing that T (i)↔(j) = T (i)−(j) T −(j) T (j)−(i) T ((i)+(j) for 1 i, j m. (17) Let A ∈ Cm×n be a matrix. If v = vec(A) ∈ Cmn , prove that vj = aj−m j−1 , j m

m

for 1 j mn. (18) Prove that the matrices A, B given by 1 0 1 1 A= and B = 0 1 0 1 are not similar.

Matrices

191

(19) Let F be the set of functions f : C − {− dz } −→ C given by f (z) =

az + b . cz + d

Denote by Mf the matrix Mf =

a b . c d

Prove that (a) if f, g ∈ F, then Mf g = Mf Mg ; (b) f (f (z)) − (a + d)f (z) + ad − bc = 0 for every z ∈ C. (20) Recall that Jn,n ∈ Rn×n is the complete n × n matrix, that is the matrix having all components equal to 1. Prove that for every number m ∈ N and m 1, we have m = nm−1 Jn,n . Jn,n

(21) Let A, B ∈ Cn×n be two matrices. Prove that if AB = BA, then we have the following equality known as Newton’s binomial: (A + B)n =

n

n k=0

k

An−k B k .

Give an example of matrices A, B ∈ C2×2 such that AB = BA for which the above formula does not hold. (22) Let A ∈ Cn×n be a circulant matrix whose ﬁrst row is (c1 , . . . , cn ) and let P ∈ Cn×n be the circulant matrix whose ﬁrst row is (0, 1, 0, . . . , 0). Prove that A = c1 P 0 + c1 P 1 + · · · + cn P n−1 . (23) Let A ∈ Rn×n and B ∈ Rn×n be two matrices. Prove that if AB is invertible, then both A and B are invertible. Solution: Since AB is invertible, we have AB(AB)−1 = In . Thus, A is invertible and A−1 = B(AB)−1 . Similarly, since (AB)−1 AB = In , it follows that B is invertible and B −1 = (AB)−1 A. (24) Let c ∈ (0, 1] and let A ∈ Rn×n be the matrix A = cIn + (1 − c)Jn . Prove that A is a non-singular matrix.

Linear Algebra Tools for Data Mining (Second Edition)

192

Solution: Suppose that x ∈ Rn is a vector such that Ax = 0n . This amounts to cx + (1 − c)Jn x = cx + (1 − c)ξ1n , where ξ = ni=1 xi . Thus, x = − (1−c)ξ c 1n , which means that all components of x must equal − (1−c)ξ c . This implies ξ = (1−c)ξ −n c , so ξ = 0 which implies x = 0n . Thus, A is nonsingular. (25) Let A, B ∈ Cn×n be two invertible matrices. Prove that B −1 − A−1 = B −1 (A − B)A−1 . Conclude that rank(A − B) = rank(B −1 − A−1 ). (26) Let A ∈ Rn×n be a non-singular matrix. Prove that the set {x1 , . . . , xm } ⊆ Rn is a linearly independent set if and only if {Ax1 , . . . , Axm } is a linearly independent set. (27) Prove that the matrix T (i)+a(j) ∈ Rn×n can be expressed as T (i)+a(j) = In + aei ej . (28) Let U=

Ok,k · · · On−k,k W

be a matrix in Cn×n , where W is an upper triangular matrix, and let Y be an upper triangular matrix such that yk+1,k+1 = 0. Prove that (U Y )ij = O for 1 i, j k + 1. Solution: Note that Y can be written as Y12 Y11 , Y = On−k,k Y22 where Y11 and Y22 are upper triangular matrices. Thus, Ok,k BY22 . UY = On−k,k W Y22 Thus, the ﬁrst k columns of the product U Y consist only of Os.

Matrices

193

The k + 1st column of U Y is ⎛

⎞ y1 k+1 ⎜ . ⎟ ⎜ .. ⎟ ⎟ ⎜ ⎟ ⎜ ⎜yk k+1 ⎟ ⎟ ⎜ ⎟ U⎜ ⎜ 0 ⎟ ⎜ 0 ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ .. ⎟ ⎝ . ⎠ 0

because Y is an upper triangular, matrix. If p < k, then (U Y )p k+1 = 0 because the ﬁrst k columns of U contain zeros. (29) Let m, n be two positive natural numbers such that m = pq and n = rs, where p, q, r, s ∈ N. Prove that there exists a bijection between the set Fm×n and the set (Fp×r )q×s . Interpret the existence of this bijection in terms of matrices. (30) Let A ∈ Cn×n be a strictly upper triangular matrix. Prove that A is nilpotent. Solution: Since A is strictly upper triangular, we have aij = 0 for 1 j i and every i, 1 i n. We prove by induction on (m) (m) m that for Am = (aij ), we have aij = 0 for 1 j i+m−1 and every i, 1 i n. The base case m = 1 is immediate. Suppose that the statement (m+1) = 0 for 1 j i + m. We holds for m. We show that aij have (m+1)

aij

=

n

(m)

aik akj

k=1

=

n

(m)

aik akj = 0

k=i+m

because akj = 0 when j i + m k. Thus, An = 0 and A is nilpotent. (31) Let A ∈ Cn×n be a nilpotent matrix. Prove that rank(A) n nilp(A)−1 nilp(A) .

Linear Algebra Tools for Data Mining (Second Edition)

194

A Hadamard matrix is a matrix H ∈ for 1 i, j n and HH = nIn .

Rn×n

such that hij ∈ {−1, 1}

(32) Verify that the matrices

1 1 1 −1

and ⎛ ⎞ 1 1 1 1 ⎜1 1 −1 −1⎟ ⎜ ⎟ ⎜ ⎟ ⎝1 −1 −1 1 ⎠ 1 −1 1 −1 are Hadamard matrices. (33) Let H = (h1 · · · hj · · · hn ) ∈ Rn×n be a Hadamard matrix. Prove that the matrix (h1 · · · −hj · · · hn ) is also a Hadamard matrix. (34) Let A = (aij ) be an (m × n)-matrix of real numbers. Prove that max min aij min max aij j

i

i

j

(the minimax inequality). Solution: Observe that aij0 maxj aij for every i and j0 , so mini aij0 mini maxj aij , again for every j0 . Thus, maxj mini aij mini maxj aij . (35) Let A ∈ Rn×n and let ΦA : Rn×n −→ Rn× be the function deﬁned by ΦA (X) = AX − XA, for X ∈ Rn×n . Prove that: (a) ΦA is linear, that is, for any a, b ∈ R and X, Y ∈ Rn×n , we have ΦA (aX + bY ) = aΦA (X) + bΦA (Y ); (b) ΦA (XY ) = ΦA (X)Y + XΦA (Y ); (c) if X is an invertible matrix, then ΦA (X) = −XΦA (X −1 )X; (d) ΦA (ΦB (X)) − ΦB (ΦA (X)) = ΦAB (X) − ΦBA (X); (e) trace(ΦA (X)) = 0 for every X ∈ Rn×n .

Matrices

195

(36) Let A be a matrix in Cm×n such that rank(A) = r. Prove that A can be factored as A = P C, where P ∈ Cm×r , C ∈ Cr×n , and rank(P ) = r and as A = DQ, where D ∈ Cm×r , Q ∈ Cr×n and rank(Q) = r. Solution: Let p1 , . . . , pr be a basis for range(A), where A = (a1 · · · an ). Every column ai of A can be written as a linear combination ai = ci1 p1 + · · · + cir pr for 1 i n. In matrix form these equalities amount to A = P C. The argument for the second part is similar and involves the rows of A. (37) Let {Ai ∈ Rm×ni | 1 i k} be a set of matrices and let = ki=1 ni . Prove that the subspaces {range(Ai ) | 1 i k} are linearly independent if and only if rank(A1 |A2 | · · · |Ak ) =

k

rank(Ai ).

i=1

= dim(range(Ai )) and Hint: Note that rank(Ai ) rank(A1 |A2 | · · · |Ak ) = dim(range(A1 ) + · · · + range(Ak )). (38) Let π = {B1 , . . . , Bk } be a partition of a set S = {s1 , . . . , sn }. Deﬁne the characteristic matrix of π, B ∈ Rn×k as 1 if si ∈ Bj bij = 0 otherwise for 1 i n and 1 j k. Prove that B B is a diagonal matrix, (B B)jj = |Bj | for 1 j k and that B B is invertible. (39) Let A and B be two matrices in Cp×q . Prove that rank(A+B) rank(A B) rank(A) + rank(B). Solution: We have rank(A B) = rank(A A+B) rank(A+B) because adding the ﬁrst q columns of the matrix (A B) to the last q columns does not change the rank of a matrix, and the rank of a submatrix is not larger than the rank of the matrix. On the other hand, rank(A B) = rank((A Op,q ) + (Op,q B)) rank(A Op,q ) + rank(Op,q B) = rank(A) + rank(B). (40) If A ∈ Cp×q and B ∈ Cq×r , then prove that rank(AB) min{rank(A), rank(B)}.

196

Linear Algebra Tools for Data Mining (Second Edition)

(41) Let A ∈ Rn×k be a matrix, where n k. If A A = Ik , prove that rank(AA ) = rank(A ) = rank(A) k. Solution: By Sylvester’s rank theorem, we have k = rank(A A) = rank(A) − dim(null(A ) ∩ range(A)), rank(AA ) = rank(A ) − dim(null(A) ∩ range(A )). The ﬁrst equality implies rank(A) k. Let t ∈ null(A) ∩ range(A ). We have At = 0 and t = A z for some z ∈ Rn . Therefore, t t = t A z = (At) z = 0, so t = 0k . Thus, dim(null(A)∩range(A )) = 0, so rank(AA ) = rank(A ) = rank(A) k. (42) Let A ∈ R3×3 be a matrix such that A2 = O3,3 . Prove that rank(A) 1. (43) Let Z ∈ Rn×n , W ∈ Rm×m and let U, V ∈ Rn×m be four matrices such that each of the matrices W , Z, Z + U W V , and W −1 + U W V has an inverse. Prove that (Z + U W V )−1 = Z −1 − Z −1 U (W −1 + V Z −1 U )−1 V Z −1 (the Woodbury–Sherman–Morrison identity). Solution: Consider the following system of matrix equations: ZX + U Y = In , V X − W −1 Y = Om,n , where X ∈ Rn×n and Y ∈ Rm×n . The second equation implies V X = W −1 Y , so U W V X = U Y . Substituting U Y in the ﬁrst equation yields ZX + U W V X = In . Therefore, we have (Z + U W V )X = In , which implies X = (Z + U W V )−1 .

(3.15)

On the other hand, we have X = Z −1 (In − U Y ) from the ﬁrst equation. Substituting X in the second equation yields V Z −1 (In − U Y ) = W −1 Y,

Matrices

197

which is equivalent to V Z −1 = +W −1 Y + V Z −1 U Y = (W −1 + V Z −1 U )Y. Thus, we have Y = (W −1 + V Z −1 U )−1 V Z −1 . Substituting the values of X and Y in the ﬁrst equality implies ZX + U (W −1 + V Z −1 U )−1 V Z −1 = In . Therefore, ZX = In − U (W −1 + V Z −1 U )−1 V Z −1 ,

(3.16)

which implies X = Z −1 − Z −1 U (W −1 + V Z −1 U )−1 V Z −1 .

(3.17)

The Woodbury–Sherman–Morrison identity follows immediately from Equalities (3.15) and (3.17). (44) None of the following matrices A, B, C are invertible: 0 0 1 0 1 1 A= , B= , C= . 0 0 1 1 1 1 However, their pseudoinverses exist and are given by 1 1 1 1 0 0 , B † = 2 2 , C † = 41 14 , A† = 0 0 0 0 4 4 respectively. (45) Let A ∈ Rm×n be a matrix whose columns form an orthonormal set (that is, A A = In ). Prove that the Moore–Penrose pseudoinverse of A is A† = A . Solution: Let M = A . Note that M A = A A = In and AM = AA . Clearly, M A is a symmetric matrix and so is AM . Furthermore, we have AM A = A(A A) = A and M AM = A AA = A = M , so A† = A .

198

Linear Algebra Tools for Data Mining (Second Edition)

(46) Let A ∈ Rn×n be an invertible matrix, and let c, d ∈ Rn be two vectors such that d A−1 c = −1. Prove that (A + cd )−1 = A−1 −

1 A−1 cd A−1 . 1 + d A−1 c

Hint: Apply the Woodbury–Sherman–Morrison identity. (47) Let M ∈ Rn×n be a partitioned matrix, M=

A B , C D

where A ∈ Rm×m and m < n. Prove that, if all inverse matrices mentioned in what follows exist and Q = (A−BD −1 C)−1 , then M

−1

=

−QBD −1 . −D −1 CQ D −1 + D −1 CQBD −1 Q

Hint: Multiply M by M −1 given above. (48) Let c1 , . . . , cp be p vectors in Rn and k1 , . . . , kp be p numbers. Find suﬃcient conditions for the existence of a symmetric matrix X, where X ∈ Rn×n such that ci Xci = ki for 1 i p. (49) Let A ∈ Cn×n be a Hermitian matrix such that rank(A) = r. Prove that there is a permutation matrix P and a principal submatrix B of A such that B ∈ Cr×r Ir H B(Ir F H ) P AP = F and F ∈ C(n−r)×r . Solution: Since rank(A) = r, there is a sequence of r linearly independent columns; since A is Hermitian, the rows that have the same numbers are also linearly independent. The elements at the intersection of those rows yield an r × r submatrix B of

Matrices

199

A having rank R. There exists a permutation matrix P such that B CH H . P AP = C D Since P H AP has rank r and so does (B C H ), it follows that (B C H ) generates the rows of P H AP . Consequently, C = F B for some matrix F , so D = F C H , which implies P AP = H

B B HF H F B F B HF H

Ir = B(Ir F H ). F

(50) Let A ∈ Cn×n be a matrix such that rank(A) = r. Prove that A has a submatrix B ∈ Cr×r such that rank(B) = r. If we choose A = −I in the Equality of Exercise 3.17, we have (I − 1 n cd )−1 = I + 1−d cd , for c, d ∈ R , such that d c = 1. Matrices of c the form I − cd , with d c = 1, are known as elementary matrices. (51) Prove that (a) T (i)↔(j) is an elementary matrix obtained by taking c = d = ei − ej . (b) T a(i) , where a = 0, is an elementary matrix obtained by taking c = (1 − a)ei and d = ei . (c) T (i)+a(j) is an elementary matrix obtained by taking c = −aei and d = ej . (52) Prove that if S, T ∈ C2×2 are two Toeplitz matrices, then T S = ST . Does this property hold for matrices T, S ∈ Cn×n , where n 3? (53) Prove that every permutation matrix is a unitary matrix. (54) Prove that if S ∈ Cn×n is a skew-Hermitian matrix, then In +S is an invertible matrix and the matrix (In − S)(In + S)−1 is a unitary matrix. (55) Prove that if A ∈ Cn×n is a Hermitian matrix, then z = xH Ax is a real number for every vector x ∈ Cn . (56) Let A ∈ Cn×n be a Hermitian matrix and let B ∈ Cn×n be a skew-Hermitian matrix. Prove that iA is skew-Hermitian, and iB is Hermitian.

200

Linear Algebra Tools for Data Mining (Second Edition)

(57) Prove that if A ∈ A = On,n .

Cn×n ,

then trace(AAH ) = 0 if and only if

Solution: By Equality (6.12), we have A 2F = trace(AAH ). Therefore, if trace(AAH ) = 0, it follows that A 2F = 0, so A = On,n . The reverse implication is immediate. (58) Let A ∈ Cn×n . Prove that the following conditions are equivalent: (a) A is Hermitian; (b) xH Ax ∈ R for any x ∈ Cn ; (c) A2 = AH A. (59) Let R(p,q) ∈ Rn×n be the matrix deﬁned by R = (0 · · · 0 ep 0 · · · 0), where ep occurs in the qth position. (a) Prove that if p = q, we have (I + aR(p,q))−1 = I − aR(p,q) for every a ∈ R. (b) Prove that if T ∈ Rn×n and p < q, then ⎞ ⎛ 0 ··· 0 ··· 0 ··· 0 ⎟ ⎜. ⎜ .. · · · ... · · · ... · · · ... ⎟ ⎟ ⎜ ⎟ ⎜ R(p,q)T = ⎜0 · · · 0 · · · tqq · · · tqn ⎟ , ⎟ ⎜ ⎜ .. . . . ⎟ ⎝ . · · · .. · · · .. · · · .. ⎠ 0 ··· 0 ···

0

where the elements tqq , . . . , tqn occur ⎛ 0 · · · t1p ⎜. ⎜ .. · · · ... ⎜ ⎜ ⎜0 · · · tpp (p,q) =⎜ TR ⎜0 · · · 0 ⎜ ⎜. . ⎜. ⎝ . · · · .. 0 · · · t1p

···

0

in line p, and ⎞ ··· 0 .⎟ · · · .. ⎟ ⎟ ⎟ · · · 0⎟ ⎟, · · · 0⎟ ⎟ .. ⎟ ⎟ · · · .⎠ ··· 0

where t1p , . . . , tpp occur in the qth column. (c) Prove that if T ∈ Rn×n and p < q, then R(p,q)T R(p,q) = O. (d) Prove that if T ∈ Rn×n is an upper triangular matrix and p < q, then ((I + aRp,q )−1 T (I + aRp,q ))ij = tij unless i = p and j q or j = q and i p; moreover, prove that ((I + aRp,q )−1 T (I + aRp,q ))pq = tpq + a(tpp − tqq ).

Matrices

201

Solution: We discuss only the last part of this supplement. Observe that (I + aR(p,q))−1 T (I + aR(p,q) ) = (I − aR(p,q) )T (I + aR(p,q) ) = T − aR(p,q) T + aT R(p,q). Taking into account the ﬁrst three parts, the claims made in the last part are immediately justiﬁed. (60) Let Z (k) ∈ Cn×n be the matrix deﬁned by 1 if i = j and i = k (k) (Z )ij = 0 otherwise, for 1 i, j n. If A ∈ Cn×n , prove that AZ (k) is obtained from A by replacing the k th column by zeros; similarly, Z (k) A is obtained from A by replacing the k th row by zeros. (61) Let K be a subset of {1, . . . , n} and let k = |K|. Deﬁne the matrix C (n,K) as the n × k matrix obtained from In by eliminating all columns whose numbers do not occur in K; the matrix R(n,K) is the k × n matrix obtained from In by eliminating all rows whose numbers do not occur in K. (a) If A ∈ Cn×n , prove that AC (n,K) is obtained from A by eliminating all columns whose numbers do not occur in K; similarly, R(n,K) A is obtained from A by erasing all rows whose numbers do not

occur in K. i1 · · · ik be a submatrix of A ∈ Cm×n . Prove (b) Let B = A j1 · · · jh that B = R(m,{i1 ,...,ik } AC (n,{j1 ,...,jh }) . (62) Prove that rank(AH A) = rank(AAH ) = rank(A) for any matrix A ∈ Cm×n using Sylvester’s Theorem. (63) Prove that if A ∈ Cm×n , then AH A is singular if and only if m < n. (64) Prove or disprove: (a) If 0 ∈ W , where W = {w 1 , . . . , wn }, then W is linearly independent. (b) If W = {w1 , . . . , w n } is linearly independent and w is not a linear combination of the vectors of W , then W ∪ {w} is linearly independent.

202

Linear Algebra Tools for Data Mining (Second Edition)

(c) If W = {w1 , . . . , w n } is linearly dependent, then any of wi is a linear combination of the others. (d) If y is not a linear combination of {w 1 , . . . , w n }, then {y, w1 , . . . , w n } is linearly independent. (e) If any n − 1 vectors of the set W = {w 1 , . . . , wn } are linearly dependent, then W is linearly independent. (65) Let A ∈ Cn×n be a matrix such that I + A is invertible. The Cayley transform of A is the matrix C(A) = (I − A)(I + A)−1 . (a) Prove that if C(A) exists, then C(A) is a unitary matrix if and only if A is skew-Hermitian matrix. (b) Prove that C(C(A)) = A if all transforms exist. (c) Prove that 0 tan α cos 2α − sin 2α C = . − tan α 0 sin 2α cos 2α (66) Prove that if A ∈ Cn×n is an idempotent matrix, then range(A) ∩ null(A) = {0}. (67) Let A ∈ Cn×n be an idempotent matrix. Prove that I − A is an idempotent matrix and that rank(A) + rank(I − A) = n. Solution: We leave to the reader the proof of the idempotency of I − A. To prove the desired equality, start from Equality (3.8), that is, from rank(A) + dim(null(A)) = n. Thus, it suﬃces to show that rank(I − A) = dim(null(A)). In turn, it is suﬃcient to show that range(I − A) = null(A). Let u ∈ range(I −A). We have u = (I −A)x for some x, which implies Au = (A − A2 )x = 0. Thus, u ∈ null(A). Conversely, if u ∈ null(A), then Au = 0, and u = Iu − Au = (I − A)u, so u ∈ range(I − A). (68) Let c ∈ Cn be a vector such that 1n c = 1 and let Qc ∈ Cn×n be the matrix deﬁned by Qc = In − 1n c . Prove that Qc 1n = 0n and Qc is idempotent. Solution: It is clear that c 1n = 1. Therefore, Q2c = (In −1n c )(In −1n c ) = In −21n c +1n c 1n c = In −1n c = Qc .

(69) Let c be a vector in idempotent matrix.

Cn .

Prove that if 1n c = 1, then 1n c is an

Matrices

203

An involutive matrix is a matrix A ∈ Cn×n such that A2 = In . (70) Prove that if B ∈ Cn×n is an idempotent matrix, then A = 2B − In is an involutive matrix. (71) Let A ∈ Cn×n be an involutive matrix and let S = {x ∈ Cn | Ax = x} and T = {x ∈ Cn | Ax = −x}. Prove that both S and T are subspaces of Cn and Cn = S T . (72) Let A ∈ Rn×n be the matrix obtained from In by swapping the i th column ei with the j th column ej . Prove that A is an involutive matrix and XA is obtained from X by swapping the i th column xi with the j th column xj . (73) Let A ∈ Cn×m be a matrix, and let c ∈ Cn and d ∈ Cm . Prove that rank(A) − 1 ≤ rank(A) − rank(cdH ) rank(A − cdH ) ≤ rank(A) + rank(cdH ) rank(A) + 1. (74) Let A ∈ Cm×n be a matrix, u ∈ Cm and v ∈ Cn be two vectors, and let a ∈ C. Prove that if rank(A − auv H ) < rank(A), then u ∈ range(A) and v ∈ range(AH ). Solution: Observe that we can write ⎛

⎞ au1 vH ⎜ . ⎟ ⎟ auv H = (av1 u · · · avn u) = ⎜ ⎝ .. ⎠ . aum v H Thus, if w1 , . . . , w m are the rows of the matrix A, we have ⎛ ⎞ w1 − au1 v H ⎜ ⎟ .. ⎟. A − auv H = ⎜ . ⎝ ⎠ wm − aum v H Suppose that rank(A) = r and let wi1 , . . . , w ir be a maximal set of linearly independent rows. Since rank(A − auv H ) < r,

Linear Algebra Tools for Data Mining (Second Edition)

204

there is a linear combination α1 (w i1 − aui1 v H ) + · · · + αr (w ir − auir v H ) = 0, such that not all coeﬃcients αi are 0. The previous equality can be written as α1 wi1 + · · · + αr wir = (aα1 ui1 + · · · + aαr uir )v H ,

(75) (76) (77) (78)

which shows that aα1 ui1 + · · · + aαr uir = 0 and that vH is a linear combination of the rows of A. Thus, v ∈ range(AH ). Proving that u ∈ range(A) is entirely similar. Prove that if A ∈ Cn×n is a normal matrix and U ∈ Cn×n is a unitary matrix, then U H AU is a normal matrix. Let A ∈ Cn×n be a normal matrix and let a, b ∈ C. Prove that the matrix B = aA + bIn is also normal. Let A, B ∈ Cn×n be two matrices such that AB = pA + qB, where p, q ∈ C − {0}. Prove that AB = BA. Let A ∈ Cm×n , B ∈ Cp×q , C ∈ Cn×r , and D ∈ Cq×s . Prove that (A ⊗ B)(C ⊗ D) = AC ⊗ BD. A ∈ Cn×n and B ∈ Cm×n . Prove that (A ⊗ B)H = AH ⊗ B H ; if A and B are Hermitian, then so is A ⊗ B; if A and B are invertible, then so is A⊗B and (A⊗B)−1 = A−1 ⊗ B −1 . Prove that if A and B are two square submatrices in Cn×n , then A ∗ B is a principal submatrix of A ⊗ B. Prove that trace(A ⊗ B) = trace(A)trace(B), for any matrices A and B. Prove that rank(A ⊗ B) = rank(A)rank(B) for any matrices A and B. Let A ∈ Cm×n and B ∈ Cp×q . Prove that if A and B are Hermitian, then so is A ⊗ B.

(79) Let (a) (b) (c) (80) (81) (82) (83)

Solution: By the last part of Theorem 3.56 we have (A⊗B)H = AH ⊗ B H = A ⊗ B, which gives the desired conclusion.

Matrices

205

(84) Let ψ : {1, . . . , m1 } × {1, . . . , mk } −→ {1, . . . , m1 · · · mk } be the function deﬁned by ψ(i1 , . . . , ik ) = 1 +

p−1 k

(ip − 1) mj , p=1

j=1

where 1 i m and 1 k. Prove that ψ is a bijection. Solution: It suﬃces to show that ψ is injective. Suppose that ψ(i1 , . . . , ik ) = ψ(i1 , . . . , ik ). By the deﬁnition of ψ, we have (i1 − 1) + (i2 − 1)m1 + (i3 − 1)m1 m2 + · · · + (ik − 1)m1 · · · mk = (i1 − 1) + (i2 − 1)m1 + (i3 − 1)m1 m2 + · · · + (ik − 1)m1 · · · mk .

Since i1 m1 , by reducing the previous equality modulo m1 it follows that i1 = i1 and (i2 − 1) + (i3 − 1)m2 + · · · + (ik − 1) · · · mk = (i2 − 1) + (i3 − 1)m2 + · · · + (ik − 1)m2 · · · mk . The same argument involving m2 implies i2 = i2 , etc. (n) (85) Let ei be the vector in Rn whose unique non-zero component (m) (n) is in position i, where 1 i n. Prove that ei ⊗ ej = (mn)

em(i−1)+j . Extend this result by proving that (m1 )

e i1

(m2 )

⊗ ei2

(mk )

⊗ · · · ⊗ e ik

k

m

r = eψ(ir=1 , 1 ,...,ik )

where ψ is the function deﬁned in Supplement 3.17. (86) Prove that: (m) (n) (m) (n) (m,n) (m,n) , where Eij is the (a) ei (ej ) = ei ⊗ (ej ) = Eij m×n matrix in R which has a unique non-zero entry in position (i, j); (b) if A = (aij ) is a matrix in Rm×n , then A=

n m

i=1 j=1

(m)

aij ei

(ej ) = (n)

n m

i=1 j=1

(m)

aij ei

⊗ (ej ) . (n)

206

Linear Algebra Tools for Data Mining (Second Edition)

(87) Let A ∈ Rm×n and B ∈ Rp×q . Prove that A⊗B =

p

q m

n

i=1 j=1 h=1 k=1

aij bhk em(i−1)+h ⊗ (e(q(k−1)+j ) . (mp)

(qn)

Solution: We can write A⊗B = =

m

n

(m) aij ei

i=1 j=1 p

q n

m

⊗

(n) (ej )

p

q

bhk eh ⊗ (ek ) (p)

(q)

h=1 k=1 (m)

⊗ (ej ) ⊗ eh ⊗ (ek )

(m)

⊗ eh ⊗ (ej ) ⊗ (ek )

aij bhk ei

(n)

(p)

(q)

i=1 j=1 h=1 k=1

=

=

p

q m

n

i=1 j=1 h=1 k=1 p

q n

m

i=1 j=1 h=1 k=1

aij bhk ei

(p)

(n)

(q)

aij bhk em(i−1)+h ⊗ (eq(k−1)+j ) . (mp)

(qn)

(88) Let A(r) ∈ Rmr ×pr for 1 r N be N matrices. Extend Supplement 3.17 for the matrix A(1) ⊗ · · · ⊗ A(N ) ∈

N

N R r=1 mr × r=1 pr . (89) Let A, X, B be three matrices such that A ∈ Rm×p , X ∈ Rp×q , and B ∈ Rq×n . Prove that (a) we have vec(AXB) = (B ⊗ A)vec(X); (b) if the matrix B ⊗A is invertible, then the equation AXB = C is solvable in X. Solution: Suppose that

⎞ b11 · · · b1n ⎜ . .. .. ⎟ ⎟ . B=⎜ . . . ⎠. ⎝ bq1 · · · bqn ⎛

By the deﬁnition of Kronecker products, we have ⎞ ⎛ b11 A · · · bq1 A ⎜ . .. .. ⎟ ⎟ B ⊗ A = ⎜ . . ⎠. ⎝ .. b1n A · · · bqn A

Matrices

207

By writing X columnwise, X = (x1 · · · xq ), we further have ⎞ x1 ⎜ . ⎟ ⎟ vec(X) = ⎜ ⎝ .. ⎠ xq ⎛

and ⎛

⎞ b11 Ax1 + · · · + bq1 Axq ⎜ ⎟ .. ⎟. (B ⊗ A)vec(X) = ⎜ . ⎝ ⎠ b1n Ax1 + · · · + bqn Axq On the other hand, we have XB = (b11 x1 + · · · + bq1 xq , . . . , b1n x1 + · · · + bqn xq ), which implies AXB = (b11 Ax1 + · · · + bq1 Axq , . . . , b1n Ax1 + · · · + bqn Axq ). This shows that vec(AXB) = (B ⊗A)vec(X). The second part of the supplement is immediate. (90) Let U ∈ Cm×n and V ∈ Cn×p be two matrices. Prove that: (a) vec(U V ) = (Ip ⊗ U )vec(V ); (b) vec(U V ) = (V ⊗ Im )vec(U ); (c) vec(U V ) = diag(U, . . . , U )vecV . Solution: This follows immediately from Supplement 3.17. (91) Prove that x ∈ Cm and y ∈ Cn , then x ⊗ y = vec(yx ) and x ⊗ y = xy = y ⊗ x. (92) Let A, B ∈ Rn×n . Prove that trace(AB) = (vec(A )) vec(B). Solution: Since trace(AB) = from Equality (3.3).

n

i=1 ei ABei ,

the result follows

208

Linear Algebra Tools for Data Mining (Second Edition)

Let di ∈ Cm be the i th column of the matrix Im and ej be the j th column of the matrix In . Deﬁne the matrix Hij = di ej ∈ Cm×n , which has an 1 in its (i, j) position and 0 elsewhere. Then, any matrix A ∈ Cm×n can be written as A=

m

n

aij Hij .

(3.18)

aij Hij .

(3.19)

i=1 j=1

This implies A =

n m

i=1 j=1

(93) Let ei be the i th column of the matrix In for 1 i n and let E[ij] = ei ej . Prove that: (a) the single nonzero entry of E[ij] is (E[ij] )ij = 1; ej = ei ; (b) E [ij] n = In ; (c) i=1 E[ii] (d) vec(In ) = ni=1 ei ⊗ ei ; E[kl] = vec(E[kl] )(vec(E[ij] ) ; (e) E[ij] ⊗ n n (f) i=1 j=1 (E[ij] ⊗ E[ij] ) = (vec(In ))(vec(In )) . m m n n m n (94) Prove that if ei ∈ C and ej ∈ C , then ei ⊗ ej = emn (i−1)n+j . Deﬁne the matrix Kmn ∈ Rmn×mn as Kmn =

m

n

(Hij ⊗ Hij ). i=1 j=1

Kmn is a square mn-dimensional matrix partitioned into mn submatrices in Rm×n such that the (i, j) submatrix has a 1 in position (j, i) and 0 elsewhere. Kmn is referred to as the mn-commutation matrix (see [109], which contains a detailed study of these matrices). For example, K23 is a 6×6 matrix partitioned into six submatrices, as follows: K23 =

2

3

(Hij ⊗ Hij ) i=1 j=1

Matrices

⎛ 1 ⎜0 ⎜ ⎜ ⎜0 =⎜ ⎜0 ⎜ ⎜ ⎝0 0

0 0 0 1 0 0

0 1 0 0 0 0

209

0 0 0 0 1 0

0 0 1 0 0 0

⎞ 0 0⎟ ⎟ ⎟ 0⎟ ⎟. 0⎟ ⎟ ⎟ 0⎠ 1

(95) Prove that if A ∈ Rm×n , we have Kmn vec(A) = vec(A ). Solution: We have ⎛ ⎞ ⎞ ⎛

⎠ aij Hij = vec ⎝ (di Aej )(ej di )⎠ vec(A ) = vec ⎝ ⎛ = vec ⎝ =

ij

⎞ ej di Aej di ⎠ =

ij

ij

vec(Hij AHij )

ij

Hij Hij vec(A)

ij

(by Supplement 3.17) = Kmn vec(A). Here di ∈ Cm is the i th column of the matrix Im and ej is the j th column of the matrix In . (96) Prove the following properties of the commutation matrix Kmn deﬁned above: (a) Kmn = nj=1 (ej ⊗ Im ⊗ ej ) = m i=1 (di ⊗ In ⊗ di ); = Knm ; (b) Kmn −1 = K (c) Kmn nm ; (d) K1n = Kn1 = In ; (e) trace(Kmn ) = 1 + gcd(m − 1, n − 1); (f) if A ∈ Cn×s and B ∈ Cm×t , then Kmn (A⊗B)Kst = B ⊗A. Solution: Part (a): of Kmn implies Kmn = The deﬁnition )= (H ⊗ H (d e ⊗ e d ). By applying Exercise 91, ij i j i ij j ij ij

210

Linear Algebra Tools for Data Mining (Second Edition)

we have

(di ej ⊗ ej di ) =

ij

(ej ⊗ di ⊗ di ⊗ ej ) ij

=

j

=

ej

⊗

ej ⊗

i

j

=

di ⊗ di

⊗ ej

di di

⊗ ej

i

(ej ⊗ Im ⊗ ej ). j

Similarly,

(di ej ⊗ ej di ) = (di ⊗ ej ⊗ ej ⊗ di ) ij

ij

=

i

⎛ ⎞ ⎞

⎝di ⊗ ⎝ ej ⊗ ej ⎠ ⊗ di ⎠ ⎛

j

⎛ ⎞ ⎞

⎝di ⊗ ⎝ ej ej ⎠ di ⎠ = ⎛

i

j

(di ⊗ In ⊗ di ). = i

Part (b): We have

= (di ej ⊗ ej di ) = (ej di ⊗ di ej ) = Kmn . Kmn ij

ji

K Part (c): We need to verify that Kmn mn = Imn . This equality follows from

Kmn = (ej di ⊗ di ej ) (ds et ⊗ et ds ) Kmn ji

st

(ej di ds et ⊗ di ej et ss ) = jist

⎛ ⎞

(ej ej ⊗ di di ) = ⎝ ej ej ⎠ di di = Imn . = ji

j

i

Matrices

211

Part (d): We have

(ej ⊗ ej ) = In = (ej ⊗ ej ) = Kn1 . K1n = j

j

Part (e): Note that ⎛

⎞

trace(Kmn ) = trace ⎝ (di ej ⊗ ej di )⎠ =

ij

trace((di ⊗ ej )(ej ⊗ di ))

ij

(ej ⊗ di ) (di ⊗ ej ). = ij

The mn-vector ej ⊗ di has a unique 1 in its ((j-1)m +i)th position and 0s elsewhere, and the vector di ⊕ ej has a unique 1 in its ((i − 1)n + j)th position. Therefore, we have (ej ⊗ di ) (di ⊗ ej ) =

1 if(j − 1) + i = (i − 1)n + j, 0 otherwise.

Let {E1 , . . . , Ep } be a set of equalities and let B[Ei | 1 i p] be the number of valid equalities in this set. We have trace(Kmn ) =

(ej ⊗ di ) (di ⊗ ej ) ij

= B[(j − 1)m + i = (i − 1)n + j | 1 i m, 1 j n] = B[(j − 1)(m − 1) = (i − 1)(n − 1) | 1 i m, 1 j n] = 1 + B[(j − 1)(m − 1) = (i − 1)(n − 1) | 2 i m, 2 j n] = 1 + B[j(m − 1) = i(n − 1) | 1 i m − 1,

Linear Algebra Tools for Data Mining (Second Edition)

212

1 j n − 1] i m−1 = | 1 i m − 1, = 1+B n−1 j 1 j n − 1] . Let m =

m n and n = . gcd(m, n) gcd(m, n)

Any pair (i, j) such that ji = m n where 1 i m and 1 j n must be of the form i = αm and j = αn , where α is a positive rational number smaller or equal to gcd(m, n). We can write α = pq , where p, q are positive integers and gcd(p, q) = 1. Then,

pn i = pm q and j = q . The numbers i and j are integers if and only if pm and pn are both divisible by q. Since gcd(p, q) = 1, the only common divisor of m and n is 1, which implies q = 1. This, in turn, implies i = pm and j = pn , where 1 p gcd(m, n). Thus, there are gcd(m, n) pairs (i, j) such that ji = m n , that is,

m − 1 i = | 1 i m − 1, 1 j n − 1 = gcd(m, n). B n−1 j

Part (f): Let X ∈ Rs×t . We have Kmn (A ⊗ B)Kst vec(X) = Kmn (A ⊗ B)vec(X ) = (by Supplement 3.17) = Kmn vec(BX A ) = vec(AXB ) = (B ⊗ A)vec(X), which gives the desired result. (97) Prove that for every A ∈ Cm×n , rank(A) = rank(AH ). ¯In ) for any If m = n, prove that rank(A − aIn ) = rank(AH − a a ∈ C.

Matrices

213

(98) Let U ∈ Rn×n be the upper triangular matrix ⎛ ⎞ 1 1 ··· 1 1 ⎜0 1 · · · 1 1⎟ ⎜ ⎟ ⎟ U =⎜ .. .. ⎟ ⎜ .. .. ⎝. . · · · . .⎠ 0 0 ··· 0 1 and let V ∈ Rn×n be the ⎛ 1 ⎜0 ⎜ ⎜. V =⎜ ⎜ .. ⎜ ⎝0 0

matrix −1 0 1 −1 .. .. . . 0 0 0 0

··· ···

0 1 . · · · .. · · · −1 ··· 0

⎞ 0 1⎟ ⎟ .. ⎟ ⎟ .⎟ . ⎟ 1⎠ 1

(a) Verify that U V = In . (b) Let a, b ∈ Rn and let sk = ki=1 ai . Using Part (a), prove Abel’s equality: n

i=1

ai bi =

n−1

si (bi − bi+1 ) + sn bn .

i=1

Solution: Since Part (a) ⎛ is immediate, we discuss only ⎞ s1 ⎜ ⎟ Part (b). Note that if s = ⎝ ... ⎠, then s = U a. Also, we sn have ⎞ ⎛ b1 − b2 ⎜ b2 − b3 ⎟ ⎟ ⎜ ⎟ ⎜ . ⎟. ⎜ .. Vb=⎜ ⎟ ⎟ ⎜ ⎝bn−1 − bn ⎠ bn This allows us to write n−1

si (bi − bi−1 ) + sn bn = s (V b) = a U V b = a b, i=1

which yields Abel’s equality.

214

Linear Algebra Tools for Data Mining (Second Edition)

(99) Let x, y, z ∈ Rn such that x1 x2 · · · xn 0 and i i j=1 yj j=1 zj for 1 i n. Prove that x y x z. Solution: Let a ∈ Rn be deﬁned by ai = zi − yi .To apply i Abel’s equality to the vectors a and x, let si = j=1 zj − i j=1 yj 0 for 1 i n. We have x z − x y =

n

ai x i =

i=1

n−1

si (xi − xi+1 ) + sn xn 0,

i=1

which gives the desired inequality. (100) Let a, b ∈ Rn be two vectors such that a1 a2 · · · an 0 and b1 b2 · · · bn 0 and let P On,n be a matrix in Rn×n such that P 1n 1n and P 1n 1n . Prove that a P b a b. Solution: Observe that n

k

pij =

j=1 i=1

because

n

j=1 pij

k

n

pij k,

i=1 j=1

1 for every i. This allows us to write

k

k

k n

pij +

j=1 i=1

pij k,

j=k+1 i=1

hence, k n

pij k −

j=k+1 i=1

k

k

pij

j=1 i=1

=

k

j=1

1−

k

pij

.

i=1

Since b1 b2 · · · bn 0, we get a stronger inequality by introducing the factors b1 , . . . , bn as follows: k n k k

pij bj ≤ bj 1 − pij . (3.20) j=k+1 i=1

j=1

i=1

Matrices

215

Let c = P b. We have k

ci =

i=1

n k

n

pij bj =

i=1 j=1

=

k

j=1

bj

bj

j=1

k

i=1

pij

i=1

n

pij +

k

bj

k

j=k+1

pij .

i=1

Combining the last inequality and Inequality (3.20), we have k

ci

i=1

k

bj

k

pij +

n

j=1

i=1

j=k+1

k

k

k

j=1

bj

i=1

pij +

bj

bj

k

pij

i=1

1−

j=1

k

i=1

pij

=

k

bj

j=1

for 1 k n. The result follows from Supplement 3.17. (101) Let A ∈ Cm×n be a matrix. If X ∈ Cn×k and Y ∈ Cm×k are two matrices such that the matrix R = Y H AX ∈ Ck×k is invertible and B = A − AXR−1 Y H A, then rank(B) = rank(A) − rank(AXR−1 Y H A). (102) Let A(α) =

1 1 , 1 α

where α ∈ R. Deﬁne the function f : R −→ R as f (α) = rank(Aα ). Prove that f is not continuous in 1. (103) Let A ∈ Rn×n be a matrix such that A On,n . Prove that A is a stochastic matrix if and only if A1n = 1; prove that A is doubly stochastic if and only if A1n = 1 and 1n A1n . (104) Prove that if A, B ∈ Rn×n are (doubly) stochastic matrices, then AB is a (doubly) stochastic matrix. Also, prove that for every c ∈ [0, 1], the matrix cA+(1−c)B is a (doubly) stochastic matrix. (105) Prove that if A ∈ Rn×n is a (doubly) stochastic matrix, then Am is a (doubly) stochastic matrix for m ∈ N.

216

Linear Algebra Tools for Data Mining (Second Edition)

(106) Let A ∈ Rn×n be a doubly stochastic matrix and let x, y ∈ Rn be two vectors such that x1 x2 · · · xn 0 and y1 y2 · · · yn 0. Prove that x Ay x y. Solution: Note that there exist n nonnegative numbers a1 , . . . , an and n nonnegative numbers b1 , . . . , bn such that xp = ar + · · · + an and yp = br + · · · + bn for 1 r n. We have x y − x Ay = x(I − A)y = a H (I − A)AHb, because x = Ha and y = Hb, where ⎛ 1 1 1 ··· ⎜0 1 1 · · · ⎜ H=⎜ ⎜ .. .. .. .. ⎝. . . . 0 0 0 ···

⎞ 1 1⎟ ⎟ ⎟ .. ⎟ . .⎠ 1

Thus, it suﬃces to show that H (I − A)H 0n,n . We have (H AH)ij =

n

n

hpi (I − A)pq hqj

p=1 q=1

=

j i

(I − A)pq . p=1 q=1

If i j, we have

(H AH)ij = i −

j i

apq 0,

p=1 q=1

because no sum jq=1 apq can exceed 1 and there are i such sums (1 p i). The case when j i is similar. (107) Let A ∈ Rn×n be a matrix such that AP = P A for every permutation matrix P . Prove that there exist a, b ∈ R such that A = aIn + bJn . Solution: Suppose that P = Pφ , where φ ∈ PERMn . Taking into account Equality (3.1), (AP )ij = (P A)ij is equivalent to

Matrices

217

aiφ−1 (j) = aφ(i)j for every permutation φ ∈ PERMn . If φ is the transposition that exchanges i and j, then the last equality amounts to aii = ajj for 1 i, j n, so all elements on the main diagonal are equal. Let now aik and aj be two elements outside the main diagonal, so i = k and = j. Let φ be a permutation such that φ(k) = j and φ(i) = . Since φ−1 (j) = k, it follows that aik = aj , so any oﬀ-diagonal elements of A are also equal. Thus, A has the desired form. (108) Let A, B ∈ Rn×n be two matrices such that A > 0 and B > 0 and let D(A, B) be the divergence deﬁned by D(A, B) =

n n

i=1 j=1

aij − aij + bij . aij ln bij

Prove that: (a) D(A, B) 0 and that D(A, B) = 0 implies A = B. (b) If A and B are stochastic matrices, then D(A, B) = both aij n n i=1 j=1 aij ln bij . (In this case D(A, B) is the Kullback–Leibler divergence, denoted by KL(A, B).) Solution: Let f : R>0 −→ R be the function deﬁned by f (x) = x − 1 − ln x. We have f (1) = 0, f (x) = 1 − x1 , and f (x) = x12 , so f has a minimum for x = 1. Thus, f (x) f (1) = 0, which allows us to conclude that x − 1 − ln x 0 for x > 0; also, x − 1 − ln x = 0 if and only if x = 1. Therefore, we have bij bij − 1 − ln 0, aij aij a

or bij − aij + aij ln bijij 0, where the equality takes place only when aij = bij . This implies immediately the inequality of the ﬁrst part. The second part follows immediately from the deﬁnition of stochastic matrices. (109) Give an example of a non-invertible matrix A ∈ Cn×n that satisﬁes the condition |aii | {|aik | | 1 k n and k = i} for 1 i n. Note that this condition is weaker than the condition of diagonal dominance used in Theorem 3.49. (110) Prove that rank(A† ) = rank(A) for every A ∈ Cm×n .

218

Linear Algebra Tools for Data Mining (Second Edition)

(111) Let A ∈ Cm×n and B ∈ Cn×p . If rank(A) = rank(B) = n, then prove that (AB)† = B † A† . (112) Let Qr,p be the collection of r-element subsets of the set {1, . . . , p} and let A ∈ Rm×n be a matrix such that rank(A) = r. Denote by I(A) = {I ∈ Qr,m | rank(A(I, :)) = r}, J(A) = {J ∈ Qr,n | rank(A(:, J))

= r}, and N(A) = {(I, J) ∈ I = r. Prove that: Qr,m × Qr,n | rank A J (a) I(A), J(A) and N(A) denote the maximal sets of linearly independent rows and columns, and of maximal nonsingular matrices, respectively; (b) N(A) = I(A) × J(A); (c) if 0 k r2 , then the intersection of any r − k linearly independent rows and r − k linearly independent columns is a matrix of rank at least equal to r − 2k. Solution: The inclusion N(A) ⊆ I(A) × J(A) is immediate. By the Full-rank factorization theorem (Theorem 3.35) there r×n such that are two full-rank matrices B ∈ C m×r

and C ∈ C I A = BC. Then, every matrix A can be factored as J I A = C(I, :)B(:, J), J which implies I(A) × J(A) ⊆ N(A). For the third part, note that dropping a row and a column of a matrix decreases the rank by at most 2. The Pauli matrices are the matrices P1 , P2 , P3 ∈ C2×2 deﬁned by 0 1 0 −i 1 0 , P2 = , P3 = . P1 = 1 0 i 0 0 −1 (113) Prove that Pk2 = I2 and Pk Ph + Ph Pk = 2δhk I2 for k, k ∈ {1, 2, 3}. (114) Prove that the set {I2 , P1 , P2 , P3 } is a basis in C2×2 . (115) Prove that P12 = P22 = P32 = −iP1 P2 P3 = I2 , and trace(Pk ) = 0 for 1 k 3.

Matrices

219

(116) Let u ∈ Cm and v ∈ Cn . Prove that det(u ∗ v) = 0. (117) Prove that if u, v, w ∈ Cn , then (a) (u ∗ v)H = v ∗ u; (b) (v + w) ∗ u = v ∗ u + w ∗ u; (c) u ∗ (v + w) = u ∗ v + u ∗ w; (d) c(u ∗ v) = (cu) ∗ v = u ∗ (cv). Let x, y ∈ Rn . To compute the product x y = ni=1 xi yi , a number of n multiplications and n − 1 additions of real numbers are required. The main component of this time is given by the number of multiplications, so minimizing this number is important in reducing computing time. (118) For x, y ∈ Rn , deﬁne ξ and η as n/2

ξ=

j=1

n/2

x2j−1 x2j , η =

y2j−1 y2j .

j=1

Note that the computation of ξ and η requires 2n/2 products. Prove that x y is given by x y =

⎧ n/2 ⎪ ⎨ j=1 (x2j−1 + y2j )(x2j + y2j−1 ) − ξ − η

if n is even,

⎪ ⎩n/2 (x

if n is odd.

j=1

2j−1

+ y2j )(x2j + y2j−1 ) − ξ − η + xn yn

Thus, the total number of products needed to compute x y is upper bounded by 2n/2 + (n + 1)/2. (119) Let A ∈ Rm×n and B ∈ Rn×p . If we write ⎛ ⎞ a1 ⎜ . ⎟ ⎟ A=⎜ ⎝ .. ⎠ and B = (b1 , . . . , bp ), am computing the matrix product AB using the standard computation is equivalent to performing mp inner products of ndimensional vectors that require mpn multiplications. Prove that computing these mp inner products using the technique described in Exercise 3.17 requires (m + p)n + (mp − m − p)(n + 1)/2 multiplications. For large values of m, n, p, this number is roughly half of mnp.

Linear Algebra Tools for Data Mining (Second Edition)

220

(120) Let A=

a11 a12 a21 a22

and B =

b11 b12 b21 b22

be two matrices in R2×2 . The standard method for computing their product requires eight multiplications. Deﬁne the following seven numbers: I = (a11 + a22 )(b11 + b22 ) II = (a21 + a22 )b11 III = a11 (b12 − b22 ) IV = a22 (−b11 + b21 ) V = (a11 + a12 )b22 V I = (−a11 + a21 )(b11 + b12 ) V II = (a12 − a22 )(b21 + b22 ). Verify that their product C = AB can be written as c11 c21 c12 c22

= I + IV − V + V II = II + IV = III + V = I + III − II + V I,

using a total of seven multiplications and 18 additions. (121) Let A, B ∈ Rn×n be two matrices, where n = 2k , and B11 B12 A11 A12 and B = , A= A21 A22 B21 B22 k−1

k−1

and Aij , Bij ∈ R2 ×2 . Using multiplications of blockmatrices, the product C = AB ∈ Rn×n can be written as C11 C12 , C= C21 C22 , where Cij ∈ Rn/2×n/2 . Using the result contained in Exercise 3.17, prove that to compute C the number of multiplications required is O(nlog 7 ).

Matrices

221

Solution: Denote by τ (n) the number of operations required to multiply two n × n matrices, where n = 2k . We have τ (n) = 2 7τ n2 + 18 n2 ) for n 2. This implies τ (n) = O(7log n ) = O(nlog 7 ). Let A ∈ Cn×n be a square matrix. Its directed graph is the graph GA having {1, . . . , n} as its set of vertices. An edge (i, j) exists in GA if and only if aij = 0. (122) Draw the graph of the matrix ⎛

0 ⎜2 ⎜ A=⎜ ⎝0 0

1 1 2 0

2 1 0 0

⎞ 0 2⎟ ⎟ ⎟. 1⎠ 2

(123) Let G = (V, E) be a graph having V as a set of vertices and E as a set of edges. Prove that the relation γG on V that consists of all pairs of vertices (x, y) such that there is a path that joins x to y is an equivalence on V. Let G = (V, E) be a directed graph and let V1 , . . . , Vk be the set of equivalence classes relative to the equivalence γG . The condensed graph of G is the digraph C(G) = ({V1 , . . . , VE }, K), having strong components as its vertices such that (Vi , Vj ) ∈ K if and only if there exist vi ∈ Vi and vj ∈ Vj such that (vi , vj ) ∈ E. (124) Let G = (V, E) be a directed graph. Prove that the condensed graph C(G) is acyclic. Solution: Suppose that (Vi1 , . . . , Vi , Vi1 ) is a cycle in the graph C(G) that consists of distinct vertices. By the deﬁnition of C(G) we have a sequence of vertices (vi1 , . . . , vi , vi1 ) in G such that the (vip , vip+1 ) ∈ E for 1 p − 1 and (vi , vi1 ) ∈ E. Therefore, (vi1 , . . . , vi , vi1 ) is a cycle, and for any two vertices u, v of this cycle, there is a path from u to v and a path from v to u. In other words, for any pair of vertices (u, v) of this cycle we have (u, v) ∈ γG , so Vi1 = · · · = Vi , which contradicts our initial assumption.

222

Linear Algebra Tools for Data Mining (Second Edition)

(125) Let φ ∈ PERMn and let Pφ be its permutation matrix. Prove that if A ∈ Cn×n and the vertices of the graph GA are renumbered by replacing each number j by φ(j), the resulting graph corresponds to the matrix B = Pφ APφ . A matrix A ∈ Cn×n isreducible if there exists a permutation matrix U V Pφ , where U ∈ Cp×p , V ∈ Cp×q , and Pφ such that A = Pφ Op,q W W ∈ Cq×q . Otherwise, A is irreducible. (126) Prove that a matrix A ∈ Cn×n is irreducible if and only if there exists no partition {I, J} of the set {1, . . . , n} such that aij = 0 when i ∈ I and j ∈ J. Let A ∈ Cn×n . The degree of reducibility of A is k, where 0 k n−1 if there exists a partition {I1 , . . . , Ik+1 } ∈ PARTn such that any I submatrix A i is irreducible for 1 i k+1 and apq = 0 whenever Ii p ∈ Ii , q ∈ Ij , and i = j. The degree of reducibility of A is denoted by red(A). (127) Let A ∈ Rn×n be a matrix. Prove that GA is a strongly connected digraph if and only if A is irreducible. Solution: Let A ∈ Rn×n be a reducible matrix. There exists a partition {I, J} of the set {1, . . . , n} such that aij = 0 when i ∈ I and j ∈ J. Therefore, there is no edge from a vertex in J to a vertex in I, which implies that there exists a vertex i ∈ I and a vertex j ∈ J such that there is no path leading from j to i. Thus, GA is not strongly connected. This shows that if GA is strongly connected, then A is irreducible. Conversely, suppose that GA is not strongly connected and let V1 , . . . , Vk be the strong connected components of GA , where k > 1. Since the condensed digraph C(GA ) is acyclic (by Supplement 3.17), we may assume without loss of generality that its vertices V1 , . . . , Vk are numbered in topological order. In other words, the existence of an edge (Vi , Vj ) in C(G) implies i < j.

Matrices

223

Assume initially that the vertices of the strong component Vi are vp , vp+1 , . . . , vp+|Vi |−1 , where p = 1 + i−1 j=1 |Vj | for 1 i j k. Under this assumption we have apq = 0 if vp ∈ Vi , vq ∈ V , and i > . In other words, the matrix A has the form ⎞ ⎛ A11 A12 · · · A1k ⎟ ⎜ O A 22 · · · A2k ⎟ ⎜ ⎟ A=⎜ .. . ⎟, ⎜ .. ⎝ . . · · · .. ⎠ O O · · · Akk where Aii is the incidence matrix of the subgraph induced by Vi . Thus, A is not irreducible. If the vertices of GA are not numbered according to the previous assumptions, let φ be a permutation that rearranges the vertices in the needed order. Then Pφ APφ has the necessary form and, again, A is not irreducible. (128) Prove that a matrix A ∈ Cn×n is irreducible if and only if its transpose is irreducible. (129) Let A ∈ Rn×n be a non-negative matrix. For m 1, we have (Am )ij > 0 if and only if there exists a path of length m in the graph GA from i to j. Solution: The argument is by induction on m ≥ 1. The base case, m = 1, is immediate. Suppose that the statement holds for numbers less than m. Then (Am )ij = nk=1 (Am−1 )ik Akj . (Am )ij > 0 if and only if there is a positive term (Am−1 )ik Akj in the right-hand sum because all terms are non-negative. By the inductive hypothesis, this is the case if and only if there exists a path of length m − 1 joining i to k and an edge joining k to j, that is, a path of length m joining i to j. (130) Let A ∈ Rn×n be an irreducible matrix such that A On,n . If i ki > 0 for 1 i n − 1, then n−1 i=0 ki A > On,n . Solution: Since A is an irreducible matrix, the graph GA is strongly connected. Thus, there exists a path of length no larger than n − 1 that joins any two distinct vertices i and

Linear Algebra Tools for Data Mining (Second Edition)

224

j of the graph GA . This implies that for some m n − 1 we have (Am )ij > 0. Since n−1 n−1

i ki A = ki (Am )ij , (3.21) i=0

ij

i=0

and all numbers that occur in this equality are non-negative, n−1 i > 0. If i = j, it follows that for i = j we have i=0 ki A ij

the same inequality follows from the fact that k0 In > On,n . (131) Let A ∈ Rn×n be an irreducible matrix such that A On,n . Prove that (In + A)n−1 > On,n . Solution: If we choose ki = ni for 0 i n − 1 in Inequality (3.21), the desired inequality follows. (132) Let A ∈ Rp× , B ∈ Rq× , C ∈ Rr× be three matrices having the same number of columns. Prove that: (a) the Khatri–Rao product is associative, that is, (A ∗ B) ∗ C = A ∗ (B ∗ C); (b) (A ∗ B) (A ∗ B) = (A A) ∗ (B B); (c) (A∗)† = ((A A) ∗ (B B))† (A ∗ B) . (133) Let A ∈ Rm×n and B ∈ Rp×q be two matrices. Prove that if x ∈ Rnq , y ∈ Rmp , then we have y = (A ⊗ B)x if and only if Y = BXA , where X = (x1 x2 · · · xn ) ∈ Rq×n and Y = (y 1 y 2 · · · y m ) ∈ Rp×m . Solution: We begin by partitioning x and y as ⎛ ⎞ ⎛ ⎞ y1 x1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ x=⎜ ⎝ .. ⎠ and y = ⎝ .. ⎠ , xn ym

Matrices

225

where x1 , . . . , xn ∈ Rq and y 1 , . . . , y m ∈ Rp . Then we have ⎞⎛ ⎞ ⎛ x1 a11 B a12 B · · · a1n B ⎜ a B a B · · · a B ⎟ ⎜x ⎟ 22 2n ⎟ ⎜ 2 ⎟ ⎜ 21 ⎟⎜ ⎟ y = (A ⊗ B)x = ⎜ .. .. ⎟ ⎜ .. ⎟ , ⎜ .. .. . ⎝ . . . ⎠⎝ . ⎠ am1 B am2 B · · · amn B

xn

which implies y i = ai1 Bx1 + · · · a1n Bxn ⎛ ⎞ ai1 ⎜ . ⎟ ⎟ = (Bx1 · · · Bxn ) ⎜ ⎝ .. ⎠ ain ⎛

⎞ ai1 ⎜ . ⎟ ⎟ = B(x1 · · · xn ) ⎜ ⎝ .. ⎠ ain = BXai , where ai = (ai1 · · · ain ). By the deﬁnition of Y , we have Y = (BXa1 · · · BXam ) = BX(a1 · · · bf am ) = BXA . (134) Let u ∈ CI , v ∈ CJ , and w ∈ CK . Prove that x = u ⊗ v ⊗ w if and only if ui vj wk = xk+(j−1)K+(i−1)JK for 1 i I, 1 j J, and 1 k K. Bibliographical Comments Wedderburn’s Theorem (Theorem 3.39) was obtained in [169]. In Supplement 23 of Chapter 5, we discuss a converse of this statement, a result of Egerv´ ary [42]. Theorem 3.35 (the Full-Rank factorization theorem) was formulated in [112]. Supplement 3.17 is an extension of Wedderburn’s theorem obtained in [29]. Supplement 3.17 is a result of Ky Fan [49]. A comprehensive reference of the Moore–Penrose pseudoinverse of a matrix is the monograph [11].

226

Linear Algebra Tools for Data Mining (Second Edition)

Fundamental references to linear algebra include the two-volume book by Gantmacher [168]. More advanced topics are treated in [18] and [78, 79]. Concentrated and useful sources are [111] and [177]. Properties of Kronecker products are studied in [109, 118, 133, 134]. The computation of x y described in Exercise 3.17 was discovered in [172] and initiated a research stream for more economical ways for multiplying matrices. The reduction of the number of necessary multiplications of two 2×2 matrices from eight to seven shown in Exercise 3.17 was obtained by Strassen in [161].

Chapter 4 MATLAB

4.1

Environment

Introduction

MATLAB ,

which stands for “matrix laboratory” [116], is a formidable tool for anybody interested in linear algebra and its applications.

4.2

The Interactive Environment of MATLAB

MATLAB

is an interactive system. Commands can be entered at the prompt >> in the command window shown in Figure 4.1. It is important to remember that • MATLAB is case sensitive; • variables are not typed; • typing the name of a variable causes the value of the variable to be printed; • ending a command with a semicolon will suppress the screen display of the results of the command. In Figure 4.1 we show the command window of MATLAB . Commas and semicolons separate statements that are placed on the same line as in‘

x = 1 ; y =2; z =3;

As we mentioned above, semicolons suppress the output. To continue a line on the next line, one ends the line with three periods (...). 227

228

Linear Algebra Tools for Data Mining (Second Edition)

Fig. 4.1

The command window of MATLAB .

MATLAB is equipped with extensive help and documentation facilities. To display a help topic, it suﬃces to type in the command help followed by the name of the topic or to access the MATLAB documentation using the command doc.

4.3

Number Representation and Arithmetic Computations

In MATLAB , we deal with two representations of integers: the unsigned integers and the signed integers. The classes of unsigned integers are denoted by unitk, where k may assume the values 8, 16, 32, and 64 corresponding to representations over one, two, four, and eight bytes, respectively. Example 4.1. To create a one-byte unsigned integer having the value 77, we write >> x = uint8(77) x = 77

To obtain information on x, we type whos(’x’). MATLAB returns the main characteristics of x

matlab Environment

>> whos(’x’) Name Size x 1x1

Bytes 1

Class uint8

229

Attributes

Since x is a scalar, MATLAB regards x as a 1 × 1-matrix. The function size returns the sizes of each dimension of an array. Example 4.2. For the matrix A deﬁned by A = [1,2,3;0,5,1]

we obtain d = 2

3

The function unique applied to an array A produces an array that contains the same data as A with no repetitions and in sorted order. Example 4.3. The application of unique to the array has the following eﬀect: A = [1 2 1 1 0]; >> unique(A) ans = 0 1 2

The call [m,n] = size(A)

returns the size of the matrix A in separate variables m and n. More generally, [d1,d2,...,dn]=size(A)

returns the sizes of the ﬁrst n dimensions of the array A in separate variables. If we need to return the size of the dimension of A speciﬁed by the scalar dim into the variable m, we write m = size(A,dim)

The function isa determines if its ﬁrst argument belongs to the type speciﬁed by its second argument.

230

Linear Algebra Tools for Data Mining (Second Edition)

Example 4.4. The type of x can be veriﬁed by typing >> isa(x,’integer’) ans = 1 >> isa(x,’uint8’) ans = 1 >> isa(x,’unit32’) ans = 0

The limits of any of the above integer types can be determined by using the functions intmin and intmax, as follows: Example 4.5. >> intmin(’uint32’) ans = 0 >> intmax(’uint32’) ans = 4294967295

For any of these unsigned types, a number larger than intmax is mapped to intmax and a number lower than intmin is mapped to intmin: >> uint8(1000) ans = 255 >> uint8(-1000) ans = 0

Signed integers are represented in MATLAB using the types of the form intk, where k ranges among 8,16,32, and 64. The leftmost digit is reserved for the sign (1 for a negative integer and 0 for a positive integer). Thus, we have >> intmax(’int8’) ans = 127 >> intmin(’int8’) ans = -128

matlab Environment sign R

exponent q digits

- ···

s

Fig. 4.2

231

mantissa p digits

-

··· p−1

2 1 0

Representation of floating point numbers.

MATLAB

makes use of the ﬂoating point representation of real numbers. This representation uses a number of s+1 binary digits and comprises three parts: the sign bit, the sequence of exponent bits, and the sequence of mantissa (or signiﬁcant) bits, as shown in Figure 4.2. The bits are numbered from right to left starting with 0. We assume that the mantissa uses p bits and the exponent uses q bits, where p + q = s. Depending on the desired precision, MATLAB supports a single-precision format or a double-precision format and the latter format (as speciﬁed by the IEEE Standard 754) is the default representation. For double-precision numbers, we have s = 63. The 64th bit is the bit sign b63 deﬁned as 0 if x ≥ 0, b63 = 1 if x < 0, q = 11 bits are reserved for the exponent, and p = 52 bits are used by the signiﬁcant. The number represented by the sequence of bits (b63 , b62 , . . . , b0 ) is 52 b63 −i b52−i 2 · 2y−1034 , x = (−1) · 1 + i=

where y is the equivalent of (b1 · · · b11 )2 . For single-precision numbers, we have s = 32, the exponent uses q = 8 bits (biased by 127), and the signiﬁcant uses the remaining p = 23 bits. Accordingly, single-precision numbers are given by 22 b22−i 2−i · 2y−127 , x = (−1)b31 · 1 + i=

where y is the equivalent of (b1 · · · b8 )2 . Single-precision numbers require less memory than double-precision numbers, but they are represented to less precision.

232

Linear Algebra Tools for Data Mining (Second Edition)

Double-precision numbers are created with assignments such as x = 19.43, because the default format in MATLAB is double precision. To represent the same number in single precision, we need to write y = single(19.43). It is important to realize that the set of real numbers that are representable in any of these formats is ﬁnite. For example, in the case of the double-precision format, we have at our disposal 64 bits, which means that only 264 real numbers have an exact representation as double-precision numbers. For single-precision numbers, the set of real numbers that have exact representations consists of 232 numbers. The remaining real numbers can be represented only with a degree of approximation, which has considerable consequences for numerical computing. Since the set of reals that can be represented exactly on any ﬂoating-point system is ﬁnite, it follows that there exists a small gap between each double-precision number and the next larger doubleprecision number. The gap, which limits the precision of computations, can be determined using the eps function. If x is a doubleprecision number, there are no other double-precision numbers in the interval (x, x + eps(x)). Example 4.6. The distance between 7 and the next double-precision number can be determined as follows: >> eps(7) ans = 8.8818e-016

As x increases so does eps(x). We have >> eps(70) ans = 1.4211e-014 >> eps(700) ans = 1.1369e-013 >> eps(7000) ans = 9.0949e-013

matlab Environment

233

Entering eps without arguments is equivalent to eps(1): >> eps ans = 2.2204e-16

Similar considerations hold for single-precision numbers. Here the gaps between numbers are wider because there are fewer exactly representable numbers. Example 4.7. If we deﬁne y by y = single(7), then eps(y) returns 4.7684e-007, a value larger than eps(7) computed above. Example 4.8. Let x = 123/1256. This number is not a sum of powers of 2, so it cannot be represented exactly as a double-precision number. Consider the following MATLAB dialog: >> x =123/1256 x = 0.0979 >> y = 123 -1256 * x y = -1.4211e-014

The approximative representation of x means that y is computed, in turn, approximatively, which explains the fact that MATLAB does not return 0 for y. Rounding of decimal numbers can cause unexpected results. This phenomenon is known as roundoﬀ error and it can be seen in the next example. Example 4.9. Deﬁne x as x = 0.1. Using the equality test == we have the following results: >> x + x == 0.2 ans = 1 >> x + x + x == 0.3 ans = 0 >> x + x + x + x == 0.4 ans = 1

234

Linear Algebra Tools for Data Mining (Second Edition)

>> x + x + x + x + x + x + x + x == 0.8 ans = 0

Another risk in performing ﬂoating point computations is the inadvertent subtraction of two large and close numbers. This may result in a catastrophic cancellation as we show next. Example 4.10. Consider the following MATLAB computation. >> a = 5 a = 5 >> b = 5e24 b = 5.0000e+024 >> c = a + b - b c = 0

returns 0 c rather than 5 since the numbers a + b and b have the same ﬂoating-point representation.

MATLAB

The range of double-precision numbers is determined using the functions realmin and realmax: >> rangeDouble = ’Double-precision numbers range between %g and %g’; >> sprintf(rangeDouble,realmin,realmax) ans = Double-precision numbers range between 2.22507e-308 and 1.79769e+308

When these functions are called with the argument ‘single’, the corresponding values for the single-precision type are returned: >> rangeSingle = ‘Single-precision numbers range between %g and %g’; >> sprintf(rangeSingle,realmin(‘single’), realmax(‘single’)) ans = Single-precision numbers range between 1.17549e-038 and 3.40282e+038 MATLAB operates with an extended set of reals; the values Inf and -Inf represent real numbers outside the representation ranges, as shown next.

matlab Environment

235

realmax(‘single’) + .0001e+038 ans = Inf -realmax(‘single’) - .0001e+038 ans = -Inf

Values that are not real or complex numbers are represented by the symbol NaN, an acronym for “Not a Number.” Expressions like 0/0 and Inf/Inf yield NaN, as do any arithmetic operations involving a NaN: x = 0/0 x = NaN

The imaginary unit of complex numbers is represented in by either of two letters: i or j. A complex number can be created in MATLAB by writing

MATLAB

z = 3 + 4i

or, equivalently, by using the complex function: >> z = complex(3,4) z = 3.0000 + 4.0000i

The real and imaginary parts of a complex number can be obtained using the real and imag functions: >> x = real(z) x = 3 >> y = imag(z) y = 4

The functions complex, real, and imag can be applied to arrays, where they act componentwise: >> x= [1 2 3]; >> y = [4 5,6]; >> z=complex(x,y) z = 1.0000 + 4.0000i >> zreal = real(z)

2.0000 + 5.0000i

3.0000 + 6.0000i

Linear Algebra Tools for Data Mining (Second Edition)

236

zreal = 1 2 3 >> zimag = imag(z) zimag = 4 5 6

Conversions from types are possible using built-in MATLAB functions named after the target type. For example, to convert other numeric data to double precision, we can use the MATLAB function double. Example 4.11. A signed integer created by y = int64 (-123456789122) is converted to double-precision ﬂoating point by x = double(y). Arithmetic operators involve the usual arithmetic operators. In these operators are overloaded, which means that they can be applied both to numbers and to matrices whose formats accommodate the requirements of these operations. These operators include:

MATLAB

Operator + * ^ \ / ’ .’ .* .\^ .\ ./

Significance Addition or unary plus Subtraction or unary minus Numeric or matrix multiplication Numeric or matrix power Backslash or left matrix division Slash or right matrix division Transpose (for real matrices) or Hermitian conjugate (for complex matrices) Nonconjugated transpose Array multiplication (element-wise) Array power (element-wise) Left array divide (element-wise) Right array divide (element-wise)

The slash or right matrix division B/A is equivalent to BA−1 , while the backslash or left matrix division A\B is equivalent to A−1 B. 4.4

Matrices and Multidimensional Arrays

MATLAB

accommodates both real and complex matrices. Matrices can be entered row-wise from the MATLAB console making sure that

matlab Environment

237

(i) elements of a row are separated by spaces or commas; (ii) rows are separated by a semi-colon; (iii) the matrix is enclosed between square brackets. For example, the matrix A ∈ C2×3 1+i 1−i i A= 2 i 3 − 2i is entered as >>A = [1+i MATLAB

A

1-i

i; 2 i

3-2i]

prints the content of the matrix as

= 1. + i 2.

1. - i i

i 3. - 2.i

To inspect the content of the matrix, we type its name at the prompt >>A

and, again, MATLAB prints the matrix A, as above. Lines can be continued by placing ... at the end of the line. For example, we can write >>B = [1 >> 2 >> 3

-11 22 44

22;... 33;... 55]

to deﬁne the matrix

⎛

⎞ 1 −11 22 B = ⎝2 22 33⎠. 3 44 55

Scalars can be entered in a rather straightforward manner. To enter x = 1 + 3i, one writes >>x = 1+3i

The complex unit may be denoted either by i or by j. MATLAB has a special notation for designating contiguous submatrices known as the colon notation. To designate the ith row of a matrix A, we can use the notation A(i,:). Similarly, the jth column is designated by A(:,j).

238

Linear Algebra Tools for Data Mining (Second Edition)

The submatrix of A that consists of all the rows between the ith row and the kth row is speciﬁed by A(i:k,:). Similarly, the submatrix that consists of the columns of A between the jth column and the hth column is A(:,j:h). Example 4.12. Let A ∈ R3×4 be the matrix ⎛ ⎞ 1 2 3 4 A = ⎝5 6 7 8 ⎠. 9 10 11 12 Then A(:,2:3) returns ans = 2 6 10

3 7 11

The colon notation is also used for specifying sequences. A vector that consists of the members of the arithmetic progression i, i + r, i + 2r, . . . , i + pr, where i + pr > 1:2:8 ans = 1 3 >> 1:2:9 ans = 1 3 >> 5:-0.6:1 ans = 5.0000 2.0000 >> 1:5 ans = 1 2

5

7

5

7

4.4000 1.4000

3

9

3.8000

4

3.2000

2.6000

5

Special matrix can be generated using built-in functions as shown in what follows. For example, the function eye(m,n) generates a

matlab Environment

239

matrix that contains a unit submatrix Ip , where p = min{m, n}, as shown next. >>I = eye(3,3) I = 1. 0. 0. 1. 0. 0.

0. 0. 1.

>>I = eye(4,3) I = 1. 0. 0. 1. 0. 0. 0. 0.

0. 0. 1. 0.

>>I = eye(3,4) I = 1. 0. 0. 1. 0. 0.

0. 0. 1.

0. 0. 0.

Random matrices having elements uniformly distributed in the interval (0, 1) can be generated using the function rand. For example, >>A = rand(p,q)

creates a random matrix A ∈ Rp×q . For example, ->A = rand(2,3)

produces the result A

= 0.3616361 0.2922267

0.5664249 0.4826472

0.3321719 0.5935095

Random number producing functions can be used to produce multidimensional arrays whose entries belong to certain intervals. Example 4.14. To create a 3 × 4 × 2 3-dimensional array whose entries belong to the (0, 1) interval, we can write -> A = rand(3,4,2)

resulting in the array

Linear Algebra Tools for Data Mining (Second Edition)

240

A(:,:,1) = 0.8147 0.9058 0.1270

0.9134 0.6324 0.0975

0.2785 0.5469 0.9575

0.9649 0.1576 0.9706

0.1419 0.4218 0.9157

0.7922 0.9595 0.6557

0.0357 0.8491 0.9340

A(:,:,2) = 0.9572 0.4854 0.8003

The number of elements of an array can be determined using the function numel. which returns the number of elements of an array A. Its eﬀect is equivalent to that of \prod(size(A)). Experimentation in data mining is greatly helped by the capability of MATLAB for generating pseudorandom numbers having a variety of distributions. The function rand produces uniformly distributed pseudorandom numbers in the interval (0, 1). Depending on the number of arguments, this function can return an n × n matrix of such numbers (for rand(n)), or an m × n matrix (when using the call rand(m,n)). Calls of the form rand(n,’double’) or rand(n,’single’) return matrices of numbers that belong to the speciﬁed types. Example 4.15. To generate a 3 × 3 matrix of pseudorandom numbers in (0, 1), we write A = rand(3). This may yield A = 0.4505 0.0838 0.2290

0.9133 0.1524 0.8258

0.5383 0.9961 0.0782

The function randi generates a matrix in Rm×n whose entries are pseudorandom integers from a uniform discrete distribution on an interval [h, k] if called as randi([h k],[m,n]). Example 4.16. The call A = randi([5 10],[3 3]) returns a 3× 3 matrix with integer components in the interval [5, 10]:

matlab Environment

241

A = 7 5 10

5 9 9

10 5 7

If interval [h,k] is replaced with a single argument l, then the randi returns a matrix with pseudorandom components in the interval [1,l]. For instance, the call B = randi(10,2,4) returns a 2 × 4 matrix of numbers in the interval [1, 10]. The function permute rearranges the dimensions of an array in the order speciﬁed by its second argument that is a vector. Example 4.17. Let A be a three-dimensional array whose entries belong to the interval [0, 15] produced by the statement A = randi([0 15],[3 4 2]): A(:,:,1) = 10 12 11

6 10 2

11 0 4

0 1 13

12 12 2

7 7 10

A(:,:,2) = 11 5 15

0 7 6

To permute the ﬁrst and third dimensions of the three-dimensional array A deﬁned above, we write B=permute(A, [3 2 1])

resulting in the array B having the dimension vector [2 4 3]: B(:,:,1) = 10 11

6 0

11 12

0 7

10 7

0 12

1 7

B(:,:,2) = 12 5

Linear Algebra Tools for Data Mining (Second Edition)

242

B(:,:,3) = 11 15

2 6

4 2

13 10

The repositioning of the elements of array A is shown in Figure 4.3. To generate pseudorandom numbers following a normal distribution, one can use the function randn. The call randn(m) returns an m × m-matrix following the normal distribution N (0, 1); using a syntax similar to rand and randi, it is possible to generate matrices of other formats. An additional parameter for the functions of this family located last on their list of parameters allow the generation of values that belong to a more restricted type. For example, a call like randi(10,100,1,’uint32’) returns an array of 100 4-byte integers, while randn(10,’double’) returns an array of double numbers. The sequence of pseudorandom numbers produced by any of the generating functions is determined by the internal state of an internal uniform pseudorandom number generator. Resetting this generator to the same state allows computations to be repeated. The function normrand can be used to produce randomly distributed arrays as normally distributed. If M and S are arrays, then normrand returns an array of random numbers chosen from a normal distribution with mean M and standard deviation S having the same format as M and S. If either M or S are scalars, the result is an array having the format of the other parameter. The function normrnd(M,S,[p,q]) (that we use next) returns a p × q array. B(:, :, 3) 2 4

11

13

15 6 2 10 B(:, :, 2) 12 10 0 1 A(:, :, 2)

5 7 12 B(:, :, 1) 10 6 11 0 11

Fig. 4.3

0

12

A(:, :, 1)

7

7

Rearranging of the elements of an array.

matlab Environment

243

Example 4.18. We begin by generating 15 random points in R2 using U = normrnd(0,1,[15 2])

Starting from U we produce another set of 15 points V applying a rotation by 30 degrees, a scaling by 0.5, and a translation by 2 and we add some noise. To this end, we write >> S = [sqrt(3)/2 -1/2; 1/2 sqrt(3)/2]; >> V = normrnd(0.5*U*S+2,0.05,[15 2])

Submatrices can be extracted by indicating ranges of indices or by using the placeholder :, which stands for all rows or columns, respectively. To extract the second column of the matrix A, we write >>A(:,2) ans = 0.5664249 0.4826472

The following is a list of several built-in functions that create or process matrices. Function ones(m, n) zeros(m, n) eye(m, n) toeplitz(u) diag(u) diag(A) triu(A) tril(A) linspace(a, b, n) kron(A, B) rank(A) A A∗B A+B

Description Creates a matrix in Rm×n containing 1s Creates a matrix in Rm×n containing 0s Creates a matrix in Rm×n having 1s on its main diagonal Creates a Toeplitz matrix whose first row is u Creates a diagonal matrix having u on its main diagonal Yields the diagonal of matrix A Gives the upper part of A Gives the lower part of A Creates a vector in Rn whose components divide [a, b] in n − 1 equal subintervals Computes the Kronecker product of A and B Rank of a matrix A The Hermitian adjoint AH of A The product of matrices A and B The sum of matrices A and B

Operations introduced by a dot apply elementwise. As an example, consider the following operations:

Linear Algebra Tools for Data Mining (Second Edition)

244 A.2 A. ∗ B A./B

A matrix that contains the squares of the components of A The Hadamard product of A and B A matrix that contains the ratios of corresponding components of A and B

If A is a real matrix, then A is the transpose of A; for complex matrices, A denotes the Hermitian adjoint AH . Occasionally, we need to make sure that numerical representation of a symmetric matrix A ∈ Rn×n does not aﬀect its symmetry. This can be achieved by the statement A = (A + A’)/2. The function diag mentioned above has several variants. For example, diag(V,k), where V is an n-dimensional vector and k is an integer returns a square matrix of size n + |k| having the components of V on its kth diagonal. The value k = 0 corresponds to the main diagonal, positive values correspond to diagonals above the main diagonal, and negative values of k refer to diagonals below the main diagonal. The eﬀect of diag(V) is identical to the eﬀect of diag(V,0). If X is a matrix, diag(X,k) is a vector whose components are the elements of the k th diagonal of X. Example 4.19. The expression diag(ones(4,1)) + diag(ones(3,1),1) + diag(ones(3,1),-1)

generates the tridiagonal matrix ans = 1 1 0 0

1 1 1 0

0 1 1 1

0 0 1 1

The function blkdiag produces a block-diagonal matrix from its matrix input arguments. Example 4.20. If we start with the matrices deﬁned by A = [1 2;3 4]; B = [5 6 7; 8 9 10; 11 12 13]; C = [14 15;16 17];

the call blkdiag(A,B,C) will produce the matrix

matlab Environment

ans = 1 3 0 0 0 0 0

2 4 0 0 0 0 0

0 0 5 8 11 0 0

0 0 6 9 12 0 0

0 0 7 10 13 0 0

0 0 0 0 0 14 16

245

0 0 0 0 0 15 17

Block matrices can be formed using vertical or horizontal concatenation of matrices. To concatenate vertically two matrices A and B, the number of columns of A must equal the number of columns of B and the operation is realized as C = [A ; B]. Horizontal concatenation requires equality of the numbers of rows and can be obtained as E = [A D]. Example 4.21. Let A, B, and D be the matrices >> A=[1 2 3; 4 5 6] A = 1 4

2 5

3 6

>> B= [7 8 9; 10 11, 12; 13 14 15] B = 7 10 13

8 11 14

9 12 15

>> D = [17 18; 19 20] D = 17 19

18 20

The vertical concatenation of A and B and the horizontal concatenation of A and D are shown as follows: >> C = [A ; B]

Linear Algebra Tools for Data Mining (Second Edition)

246

C = 1 4 7 10 13

2 5 8 11 14

3 6 9 12 15

>> E = [A D] E = 1 4

2 5

3 6

17 19

18 20

The function repmat replicates a matrix A (referred to as a tile) and creates a new matrix B, which consists of an m × n tiling of copies of A when we use the call B = repmat(A,m,n). The format of B is (pm) × (qn), when the format of A is p × q. The statement repmat(A,n) creates an n × n tiling. Example 4.22. Let A be the matrix A = 1 4

2 5

3 6

To create a 3×2 tiling using A as a tile, we write B = repmat(A,3,2), which results in B = 1 4 1 4 1 4

2 5 2 5 2 5

3 6 3 6 3 6

1 4 1 4 1 4

2 5 2 5 2 5

3 6 3 6 3 6

Real matrices can be sorted in ascending or descending order using the function sort. If v is a vector, sort(v) sorts the elements of v in ascending order. If X is a matrix, sort(X) sorts each column of X in ascending order. This function is polymorphic: if X is an array of strings, then sort(X) sorts the strings in ASCII dictionary order. Another variant of sort, sort(X,d,m), sorts X on the dimension d, in either ascending or descending order, as speciﬁed by

matlab Environment

247

the third parameter m, which can assume the values ‘ascend’ or ‘descend’. If this function is called with two output parameters as in [Y,I] = sort(X,d,m), then the function returns additionally an index matrix. Example 4.23. Starting from the unidimensional array >> X=[ 9 2 8 5 11 3 7] X = 9 2 8 5

11

3

7

the function call [Y,I] = sort(X) returns the matrices Y = 2

3

5

7

8

9

11

2

6

4

7

3

1

5

I =

Clearly, Y contains the sorted element of X, while I gives the position of each of the elements of Y in the original matrix X. If X is a complex matrix, the elements are sorted in the order of their absolute values and elements that tie for the absolute value are sorted in the order of their angle. A sparse matrix is a matrix whose elements are, to a large extent, equal to 0. Such a matrix can be represented by the position and value of its non-zero elements using the MATLAB function sparse. Example 4.24. Staring from the matrix A = 1 3

0 1

0 0

2 0

the function call B = sparse(A) returns B, the representation of A as a sparse matrix: B = (1,1) (2,1) (2,2) (1,4)

1 3 1 2

The format of B is the same as the format of A.

Linear Algebra Tools for Data Mining (Second Edition)

248

All matrix operations can be applied to sparse matrices, or to mixtures of sparse and full matrices. Operations on sparse matrices return sparse matrices and operations on full matrices return full matrices. In most cases, operations on mixtures of sparse and full matrices return full matrices. The exceptions include situations where the result of a mixed operation is structurally sparse. For instance, the Hadamard product A .* S is at least as sparse as S. Example 4.25. Consider the matrix D = 5 0 0 1

2 1 0 0

4 2 1 3

Its sparse form is E = (1,1) (4,1) (1,2) (2,2) (1,3) (2,3) (3,3) (4,3)

5 1 2 1 4 2 1 3

The sparse form of the matrix A*D can be computed as sparse(A*D) and is (1,1) (2,1) (1,2) (2,2) (1,3) (2,3)

7 15 2 7 10 14

The same result can be obtained by multiplying the sparse forms of the matrices A and D, that is, by writing B*E. The function call S = sparse(i,j,s,m,n,nzmax) uses three vectors of equal length i, j, and s to generate an m × n sparse matrix S such that S(i(k),j(k)) = s(k), with space allocated for nzmax

matlab Environment

249

non-zeros. Any elements of s that are zero are ignored, along with the corresponding values of i and j. Any elements of s that have duplicate values of i and j are added together. To convert a sparse matrix to a full representation, we can use the function full. Its use is illustrated in Example 4.27. The function call [B,d] = spdiags(A) starts with a rectangular matrix A of format m × n and returns a sparse matrix containing all non-zero diagonals of A. The components of the vector d indicate the position of these diagonals. Example 4.26. If A is the 3 × 4 matrix A = 1 5 9

2 6 10

3 7 11

4 8 12

that has 6 diagonals, the call [B,d] = spdiags(A) produces the results B = 0 0 9

0 5 10

1 6 11

2 7 12

3 8 0

4 0 0

d = -2 -1 0 1 2 3

Note that the ﬁrst column of B is the −2nd diagonal of A that consists of 9, the second column is the −1st diagonal of A, etc. Other useful formats of spdiags exist. We mention just one, A = spdiags(B,d,m,n) that creates an m × n sparse matrix A from the columns of B and places them along the diagonals speciﬁed by the vector d.

Linear Algebra Tools for Data Mining (Second Edition)

250

Example 4.27. Starting with the matrix B and the vector d given by B = 1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

-2

-1

0

1

d =

the call A = spdiags(B,d,5,4) produces the sparse matrix A = (1,1) (2,1) (3,1) (1,2) (2,2) (3,2) (4,2) (2,3) (3,3) (4,3) (5,3) (3,4) (4,4) (5,4)

3 2 1 8 7 6 5 12 11 10 9 16 15 14

The normal representation of matrices can be obtained by applying the function full. Thus, C = full(A) produces C = 3 2 1 0 0

8 7 6 5 0

0 12 11 10 9

0 0 16 15 14

Note that if the length of the diagonals of C is insuﬃcient to accommodate the full columns of B, these columns are truncated. Multidimensional arrays generalize matrices. Such arrays can be created starting with matrices and then extending these matrices.

matlab Environment

251

Example 4.28. After creating the matrix A as >> A = [1 0 2 1; -1 1 0 3; 1 2 3 4] A = 1 -1 1

0 1 2

2 0 3

1 3 4

a second matrix with the same format as the initial matrix can be added by writing: >> A(:,:,2) = [5 1 6 7; 8 9 0,2;-1 3 1 0]

This results in an multidimensional array having the format 3 × 4 × 2 that is displayed as A(:,:,1) = 1 -1 1

0 1 2

2 0 3

1 3 4

1 9 3

6 0 1

7 2 0

A(:,:,2) = 5 8 -1

Multidimensional arrays can be created using the concatenation function cat. Example 4.29. To add a third page to the previously created array we write

A,

A = cat(3,A,[6 7 0 8; -1 -2 -3 0; 4 5 6 3])

resulting in A(:,:,1) = 1 -1 1

0 1 2

2 0 3

1 3 4

Linear Algebra Tools for Data Mining (Second Edition)

252

A(:,:,2) = 5 8 -1

1 9 3

6 0 1

7 2 0

7 -2 5

0 -3 6

8 0 3

A(:,:,3) = 6 -1 4

A fourth page of this array that consists of repeated occurrences of 8 can be added by writing A(:,:,4) = 8

The functions prod and sum return the product of the elements and the sum of the elements of an array, respectively. The sum of the elements of a matrix A can be computed along any of its dimensions by using the function sum(A,d), where d is the dimension. The function sum(A) computes the sum of all elements of A. For prod, similar conventions apply. Example 4.30. For the matrix X = 5 6

8 2

1 1

4 3

we can execute the following computations: >> sum(X) ans = 11 10 >> sum(X,1) ans = 11 10 >> sum(X,2) ans = 18 12

2

7

2

7

matlab Environment

253

For the three-dimensional array created in Example 4.28, prod(A,[1,2]) returns a 1 × 1 × 3 array whose elements are the products of each page of A; similarly, —sum(A,[1,2])— returns an array whose elements are the sums of each page of A: >> sum(A,[1,2]) ans(:,:,1) = 17

ans(:,:,2) = 41

ans(:,:,3) = 33

The repmat function can be used to produce multidimensional arrays. Its format in this case is repmat(A,r), where A is an array and r is a vector that speciﬁes the repetition scheme. Example 4.31. Let A be the matrix deﬁned by A = [1 2 3; 4 5 6]

Copies of the matrix A are repeated in a 2 × 3 × 2 multiarray by writing B = repmat(A,[2 3 2])

This results in B(:,:,1) = 1 4 1 4

2 5 2 5

3 6 3 6

1 4 1 4

2 5 2 5

3 6 3 6

1 4 1 4

2 5 2 5

3 6 3 6

Linear Algebra Tools for Data Mining (Second Edition)

254

B(:,:,2) = 1 4 1 4

2 5 2 5

3 6 3 6

1 4 1 4

2 5 2 5

3 6 3 6

1 4 1 4

2 5 2 5

3 6 3 6

2 5 2 5

3 6 3 6

1 4 1 4

2 5 2 5

3 6 3 6

If, instead, we have written C = repmat(A,[2 3 1])

this would have resulted in C = 1 4 1 4

2 5 2 5

3 6 3 6

1 4 1 4

Obviously, the last dimension (1) of C is useless and can be eliminated using the function squeeze which removes dimensions of length 1 of an array: D = squeeze(C)

4.5

Cell Arrays

A cell array is an array whose components called cells can contain any type of data. A cell array can be created using the constructor {}. Example 4.32. To create a 4 × 2-cell array containing numbers, strings, and arrays, we can write cellAr = {1, 9, 4, 3; ’Febr’, randi(2,3), {11; 12; 13}, randi(2,2)}

resulting in cellAr = 2 x 4 cell array {[ 1]} {[ 9]} {’Febr’} {3 X 3 double} {2 x 2 double}

{[ 4]} {[ {3 x 1 cell}

3]}

matlab Environment

255

An empty cell array can be created using C = {}; to create an empty cell array of format 2 × 3 × 4, we can write D = cell(2,3,4). Example 4.33. A cell array of text and data can be created by writing A = {’one’,’two’,’three’;1,2,3}

resulting in A = 2 x 3 cell array {’one’} {’two’} {[ 1]} {[ 2]}

{’three’} {[ 3]}

Components of a cell array can be accessed individually using curly braces, or as a set of cells using small parentheses. Example 4.34. Here A refers to the cell array introduced in Example 4.33. To access the component (2,3) of A, we write A{2,3}. To access the leftmost four cells of the cell array A, we write A(1:2,1:2) and obtain 2 x 2 cell array {’one’} {[ 1]}

4.6

{’two’} {[ 2]}

Solving Linear Systems

The function inv computes the inverse of an invertible square matrix. Example 4.35. Let A be the invertible matrix A = 1 2

2 3

Its inverse is given by >> inv(A) ans = -3.0000 2.0000

2.0000 -1.0000

Linear Algebra Tools for Data Mining (Second Edition)

256

On the other hand, if inv is applied to a singular matrix A = 1 2

2 4

an error message is posted: Warning: Matrix is singular to working precision.

If A is nonsingular, the function inv can be used to solve the system Ax = b by writing x = inv(A)*b, although a better method is described as follows: Example 4.36. For A=

1 2 2 3

and b =

13 , 23

the solution of the system Ax = b is x = inv(A)*b x = 7.0000 3.0000

This is not the best way for solving a system of linear equations. In certain circumstances (which we discuss in Section 6.18), this method produces errors and has a poor time performance. A better approach for solving a linear system Ax = b is to use the backslash operator x = A \ b or x = mldivide(A,b). The term mldivide is related to the position of the matrix A at the left of x. Example 4.37. Deﬁne A and b as >> A = [5 11 2; A = 5 11 10 6 -2 9 >> b=[53;26;48] b = 53 26 48

10 6 -4; -2 9 7] 2 -4 7

matlab Environment

257

Then either x=A\b or x=mldivide(A,b) produces x = 1.0000 4.0000 2.0000

The system xA = c, where A ∈ Rn×m and c ∈ Rm can be solved using either x = A / b or x = mrdivide(A,b). It is easy to see that these operations are related by A\b = (A /b ) .

4.7

Control Structures

Relational expressions allow comparisons between numbers using the following relational operators: Relational operator < > = == ~=

Significance Less than Greater than Less or equal Greater or equal Equal to Not equal

The result of a comparison is a logical value deﬁned as one of the numbers 1 or 0, where 1 is equivalent to true and 0 is equivalent to false. If two scalars are compared, the result is a logical value; if two arrays having the same format are compared, the result is an array of logical values (having the same format as the numerical arrays), which contains logical values that result from the componentwise comparisons of the numerical arrays. Example 4.38. The result of a comparison between the arrays x and y is the following array ans: >> x = [1 5 2 4 9 6] x = 1 5 2 >> y = [7 3 1 3 2 9] y = 7 3 1 >> x > y

4

9

6

3

2

9

258

Linear Algebra Tools for Data Mining (Second Edition)

ans = 0

1

1

1

1

0

Example 4.39. A scalar a can be compared with an array B. The result is an array of logical values having the same format as B. a = 5 >> B = [1 5 2; 4 9 6; 6 1 7; 7 3 1] B = 1 5 2 4 9 6 6 1 7 7 3 1 >> a > A = [1 2 3; 4 5 6; 7 8 9] A = 1 2 3 4 5 6 7 8 9 >> A(M)

matlab Environment

259

ans = 1 5 9

The find function applied to an array produces a list of the positions where the components of the array are non-zero. For a one-dimensional array X, find(X) returns a one-dimensional array I, as in the following example: X = 2 0 >> I = find(X) I = 1 3

-1

0

0

3

6

If X is a matrix, I contains the places of the non-zero components of X where X is regarded as an array obtained by concatenating its columns vertically. For example, we have X = 1 0 0 0 >> I = find(X) I = 1 5 6 8 9 10

3 2

0 1

2 1

The call [I,J] = find(X) returns the row and column indices of the non zero components of X, as follows: I = 1 1 2 2 1 2 J = 1 3

Linear Algebra Tools for Data Mining (Second Edition)

260

3 4 5 5

Example 4.40. Let A and B be two matrices: >> A = [1 2 5; 3 0 2; 6 1 7] A = 1 2 5 3 0 2 6 1 7 >> B = [5 9 1; 4 3 8; 5 6 7] B = 5 9 1 4 3 8 5 6 7

The answer to the comparison A < B is the logical matrix 1 1 0

1 1 1

0 1 0

If a matrix is regarded as an array obtained by concatenating its columns vertically, then C = find(A < B) will return C = 1 2 4 5 6 8 MATLAB has four basic control structures: if-then-else, for, while and switch. Their semantics, discussed next, is somewhat diﬀerent from similar control structures in common programming languages. The syntactic diagram of the if-then-else structure is shown in Figure 4.4.

matlab Environment

-

-

if ?

-

Logical expression elseif

Fig. 4.4

Statements

?

-

Logical Statements expression

? 6 -

261

?

else

-

-

Statements

end

-

Syntactic diagram of the if-then-else structure.

Example 4.41. The following piece of MATLAB code if(sum(sum(A < B))== size(A,1)*size(A,2)) disp(’A is less than B’) elseif(sum(sum(A > B))== size(A,1)*size(A,2)) disp(’A is greater than B’) elseif(sum(sum(A == B))==size(A,1)*size(A,2)) disp(’A and B are equal’) else disp(’A and B are incomparable’) end

applied to two matrices A and B having the same format will return A and B are incomparable

when

A=

1 2 3 4

and B =

2 1 . 6 6

The function sum when applied to a matrix returns a vector that contains the sum of the columns of the matrix. Therefore, sum(sum(A)) returns the sum of elements of A. The calls size(A,1) and size(A,2) return the dimensions of A. Note that a condition like sum(sum(A < B))== size (A,1)*size(A,2) is satisﬁed if all entries of the matrix A < B are

Linear Algebra Tools for Data Mining (Second Edition)

262

-

for

-

Variable

Fig. 4.5

-

-

Statements

= - List of values -

end

-

Syntactic diagram of the for structure.

while

-

Condition ? end Statements

Fig. 4.6

Syntactic diagram of the while structure.

equal to 1, that is, if every element of A is less than the corresponding element of B. The syntax of the for structure is shown in Figure 4.5. The list of values can be entered as a vector or as a colon expression. Example 4.42. Either >> for i=[5 7 9 11] disp(sqrt(i)) end

or >> for i=5:2:12 disp(sqrt(i)) end

displays the square roots of the odd numbers between 5 and 12: 2.2361 2.6458 3 3.3166

The syntax diagram of the while structure is shown in Figure 4.6.

matlab Environment

263

Example 4.43. The sum of the terms of the harmonic series that are not smaller than 0.01 is computed by the following code fragment: s = 0; n = 1; while(1/n >= 0.01) s = s + 1/n; n = n + 1; end; fprintf(’Sum is %g\n’,s)

which returns Sum is 5.18738

The switch structure has the following syntactic diagram: The statements following the ﬁrst case where the switch expression matches the case expression are executed; the second list of statements is executed when the second case expression matches the switch expression. If no case expression is a match for the switch expression, then the statements that follow otherwise are executed, if this option exists. Unlike the similar C structure, only one case is executed. The switch expression can be a scalar or a string. Example 4.44. The following code fragment will display cat (Figure 4.7): >> animal = cougar; >> switch lower(animal) case {’ostrich’,’emu’,’turkey’} disp(’bird’); case {’tiger’,’lion’,’cougar’,’leopard’} disp(’cat’); case {’alligator’,’crocodile’,’frog’} disp(’reptile’); end

Linear Algebra Tools for Data Mining (Second Edition)

264

- statement

- otherwise ? switch

? ,

switch expression

-

? case

?- end -

- case expression

- statement ? ,

Fig. 4.7

4.8

Syntactic diagram of the switch structure.

Indexing

Access to individual components of vectors and matrices can be done by indexing. For example, to access the component aij of matrix A, we write A(i,j). The index of a vector can be another vector. Example 4.45. Let v = [1:3:22] and let u = [1 4 5]. Then, we have >> v=1:3:22 v = 1 4 >> u=[1 4 5] u = 1 4 >> v(u) ans = 1 10

7

10

13

16

19

22

5

13

This technique can be applied to swapping the two halves of v by writing >> v([5:8 1:4]) ans = 13 16 19

22

1

4

7

10

matlab Environment

265

The special operator end allows us to access the ﬁnal position of a vector as follows: Example 4.46. For the vector v deﬁned in Example 4.45, the expression v(1:2:end) returns ans = 1

7

13

19

that is, the components of v located in odd-numbered positions. A popular technique in MATLAB is the logical indexing. If the indexing expression is of logical type, an indexing expression will extract those elements of the array that make that expression true. Example 4.47. To extract the even number components of the vector v deﬁned in Example 4.45, we can write v(mod(v,2)==0). This will result in ans = 4

10

16

22

Example 4.48. Suppose that we have a matrix that has entries that are not deﬁned; such entries can be represented by the special value NaN (not a number) and can be recognized by the logical function isnan that returns true if its input is NaN. Starting from the matrix >> X = [1 NaN 3 4; 8 -2 NaN NaN; 0 1 NaN 5] X = 1 NaN 3 4 8 -2 NaN NaN 0 1 NaN 5

the statement X(isnan(X))=0 results in the substitution of all NaN values by 0: X = 1 8 0

0 -2 1

3 0 0

4 0 5

266

4.9

Linear Algebra Tools for Data Mining (Second Edition)

Functions

Functions can be created in ﬁles having the extension .m which have the same name as the functions they deﬁne. In Example 4.49, we deﬁne the function kronsum; thus, the ﬁle is named kronsum.m. The ﬁrst line of the ﬁle has the form function [Y1 , . . . , Yq ] = f (X1 , . . . , Xp ) where X1 , . . . , Xp are the input arguments, f is the name of the function, and Y1 , . . . , Yq are the values computed by the function. Example 4.49. Kronecker’s sum, A⊕B, is computed by the function kronsum as follows: function [S] = kronsum(A,B) %KRONSUM computes the Kronecker sum of matrices A and B if (ndims(A) ~= 2 || ndims(B) ~= 2) return; end [rowsA,colsA] = size(A); [rowsB,colsB] = size(B); if (rowsA ~= colsA) || (rowsB ~= colsB) return; end S = kron(eye(colsB),A) + kron(B,eye(colsA)); end

We ensure that the arguments presented to kronsum are matrices by using the function ndims that returns the number of dimensions of the arguments. The formats of the arguments are computed using the function size and the computation proceeds only if the two arguments are square matrices. Finally, the last line of the function computes eﬀectively the value of the result. Note the presence of comment lines introduced by %. The ﬁrst comment line is returned when we use the command lookfor or request help. Example 4.50. The function datagen serves as a generator of datasets in R2 that contain a prescribed number of points that are grouped around given centers.

matlab Environment

function %DATAGEN % % % % %

267

[T] = datagen(spec) produces a set of points in R^2 starting from a k x 4 matrix called spec. Points are grouped in k clusters; the centers of the j-th cluster has coordinates [spec(j,1) spec(j,2)]; the j-th cluster contains spec(j,3) points and the diameter is spec(j,4)

% determine the number of clusters noc = size(spec,1); % npgen gives the number of points currently generated npgen = 0; for j=1:noc centr = [spec(j,1) spec(j,2)]; T(npgen+1:npgen+spec(j,3),:)=... spec(j,4)*randn(spec(j,3),2)+... repmat(centr,spec(j,3),1); npgen = npgen+spec(j,3); end

Starting from the matrix spec spec = 1.0000 1.0000 5.0000

1.0000 8.0000 4.0000

50.0000 40.0000 50.0000

1.0000 1.3000 1.0000

and calling the function deﬁned above T=datagen(spec), we obtain the set of objects shown in Figure 4.8. 4.10

Matrix Computations

As a “matrix laboratory,” MATLAB oﬀers built-in functions for a wide variety of matrix computations. We oﬀer a few examples that involve commonly used functions The function trace can be applied only to a square matrix A and trace(A) returns trace(A).

Linear Algebra Tools for Data Mining (Second Edition)

268 12

10

8

6

4

2

0

−2 −2

−1

0

Fig. 4.8

1

2

3

4

5

6

7

Dataset obtained with the function datagen.

The function abs applied to a matrix A returns the matrix of absolute values of the elements of A. Example 4.51. Let A be the matrix A = 1.0000 + 1.0000i 2.0000 - 5.0000i

3.0000 + 4.0000i -7.0000

We obtain >> B = abs(A) B = 1.4142 5.3852

5.0000 7.0000

The function rref produces the reduced row echelon form of a matrix A, when called as R = rref(A). A variant of this function, [R,r] = rref(A), also yields a vector r so that r indicates the nonzero pivots, length(r) is the rank of A, and A(:,r) is a basis for

matlab Environment

269

the range of A. Roundoﬀ errors may cause this algorithm to produce a rank for A that is diﬀerent from the actual rank. A pivot tolerance tol used by the algorithm to determine negligible columns can be speciﬁed using rref(A,tol). Example 4.52. Starting from the matrix A = 1 7 1

2 8 3

3 9 5

4 10 7

5 11 9

6 12 11

the function call [R,r]=rref(A) returns R = 1 0 0

0 1 0

1

2

-1 2 0

-2 3 0

-3 4 0

-4 5 0

r =

showing that the rank of A is 2. Example 4.53. The echelon form of a matrix A can be used to construct a basis of the null space of the matrix. To this end, consider the following function nullspbasis, where colpiv and colnonpiv designate the pivot and non-pivot columns of A, respectively. function B = nullspbasis(A) tol = sqrt(eps); [R, colpiv] = rref(A,tol); [m, n] = size(A); r = length(colpiv); colnonpiv = 1:n; colnonpiv(colpiv) = []; B = zeros(n, n-r); B(colnonpiv,:) = eye(n-r); B(colpiv,:) = -R(1:r, colnonpiv);

For the matrix A from Example 4.52, we obtain the basis B of null(A) given by B = 1 -2 1

2 -3 0

3 -4 0

4 -5 0

Linear Algebra Tools for Data Mining (Second Edition)

270

0 0 0

1 0 0

0 1 0

0 0 1

The LU decomposition of a matrix A is computed by the function call [L,U,P] = lu(A), which returns a lower triangular matrix L, an upper triangular matrix U , and a permutation matrix P such that P A = LU . Example 4.54. Starting from the matrix >> A=[1 0 1; 2 1 1; 1 -1 2] A = 1 0 1 2 1 1 1 -1 2

considered in Example 3.38 and applying the function lu, we obtain >> [L,U,P]=lu(A) L = 1.0000 0 0.5000 1.0000 0.5000 0.3333 U = 2.0000 1.0000 0 -1.5000 0 0 P = 0 1 0 0 0 1 1 0 0

0 0 1.0000 1.0000 1.5000 0

When the function lu is called with two output arguments, the ﬁrst argument contains the matrix P L, where P is the permutation matrix discussed before. Of course, the ﬁrst output argument is not a lower triangular matrix, as shown next. >> [L,U] = lu(A) L = 0.5000 0.3333 1.0000 0 0.5000 1.0000

1.0000 0 0

matlab Environment

271

U = 2.0000 0 0

4.11

1.0000 -1.5000 0

1.0000 1.5000 0

Matrices and Images in MATLAB

Binary matrices can be used to encode black and white images in the binary format using the Image Processing toolbox. The function imshow(A) displays the binary image of a matrix A in a ﬁgure such that pixels with the value 0 (zero) are displayed as black and pixels with value 1, as white. Example 4.55. Consider the matrix A ∈ {0, 1}8×8 deﬁned as A = [1 0 1 0 1 0 1 0; 0 1 0 1 0 1 0 1; 1 0 1 0 1 0 1 0; 0 1 0 1 0 1 0 1; 1 0 1 0 1 0 1 0; 0 1 0 1 0 1 0 1; 1 0 1 0 1 0 1 0; 0 1 0 1 0 1 0 1]; \end{pgmdisplay} which represents the pattern of squares of a chess board. The squares themselves are specified using the matrix \begin{PGMdisplay} >> B=ones(50,50);

and the board is generated as the Kronecker product of the matrices A and B: >> C=kron(A,B);

The matrix C has the format 400 × 400. Using the function imshow of the MATLAB image processing toolbox as in >> imshow(C,’Border’,’tight’)

the result can be visualized and saved as a pdf ﬁle shown in Figure 4.9.

272

Linear Algebra Tools for Data Mining (Second Edition)

Fig. 4.9

Chessboard generated in MATLAB .

Example 4.56. The pgm format (an acronym of “Portable Gray Map”) is the simplest gray scale graphic image representation. The content of this ﬁle is shown in Figure 4.10(b). On its ﬁrst row, P2 identiﬁes the ﬁle type. The digit image has been discretized in a 10 × 10 matrix and the format of this matrix is shown in the third line; the fourth line contains the maximum gray value and the rest of the ﬁle contains the matrix A ∈ R10×10 (having non-negative integers as entries), which we subject to SVD analysis in Chapter 9 (Figure 4.10). Exercises and Supplements (1) Write a MATLAB function that when applied to a matrix A returns bases for both the null space of A and the range of A. (2) Write a MATLAB function that starts with three real numbers x, y, z and an integer n and returns a tridiagonal n × n-matrix having all elements on the main diagonal equal to a, all elements immediately located under a diagonal equal to b, and all elements immediately above the diagonal equal to c. (3) Write a MATLAB function that returns a true value if the number of non-zero components of a vector of integers is odd.

matlab Environment

273

(a)

P2 # four.pgm 10 10 16 16 16 16 16 16 16 16 16 13 16 14 1 16 8 0 16 3 0 16 16 16 16 16 16 16 16 16 16 16 16

16 14 1 2 2 0 16 16 16 16

16 3 1 15 0 0 3 5 5 16

4 3 3 4 0 0 3 5 9 16

14 12 4 8 5 0 16 16 16 16

16 16 16 16 16 6 16 16 16 16

16 16 16 16 16 12 16 16 16 16

16 16 16 16 16 16 16 16 16 16

(b)

Fig. 4.10

Digit 4 representation (a) and the corresponding matrix (b).

To measure the time used by MATLAB operations, one can use a pair of MATLAB functions named tic and toc. The function tic (with no argument) starts a stopwatch; toc reads the stopwatch and displays the elapsed time in seconds since the most recent invocation of toc. A similar role can be played by the function cputime, which returns the CPU time in seconds that was used by the MATLAB computation since this computation began. For example, t = cputime; . . . cputime -t

returns the cpu time elapsed between these consecutive calls of cputime.

Linear Algebra Tools for Data Mining (Second Edition)

274

(4) Write a MATLAB function that starts from two integer parameters n and k with k ≤ n and returns a matrix S ∈ Rk×m whose columns contain the distinct subsets of the set {1, . . . , n} with n n no more than k elements, so m = 1+ 1 +· · ·+ k . For example, if n = 4 and k = 3, the function should return the matrix 012341112231112 000002343442233 000000000003444 Compute and plot the time used by the algorithm for various values of n and k. (5) Write a MATLAB function that tests if two integer matrices commute; examine the diﬃculties of writing such a function that deals with real-number matrices. (6) Write a MATLAB script that picks n points at random on the circle of radius 1 and draws the corresponding polygonal contour. Hint: Generate uniformly n angles α1 , . . . , αn in the interval [0, 2π] and then graph the points (cos αi , sin αi ). The MATLAB function fzero can be used to determine the zeros z of a nonlinear single-argument function f : R −→ R located near a number u ∈ R using two arguments: a string s describing the function to investigate and the number u. For example, to ﬁnd the zeros of the function f given by f (x) = x3 − 6x2 + 11x − 6, we write >> z = fzero(’x^3 -6*x^2 + 11*x -6’,1.3)

which results in z = 1.0000

An alternative technique is to use an anonymous function; such a function can be written as F = @(x)x^3 -6*x^2 + 11*x -6

Then, we could write z = fzero(F,4)

which results in z = 3

matlab Environment

275

(7) Experiment with the family of functions fa : R −→ R deﬁned by f (x) = x3 − 10x2 + 21x + a for a ∈ R and determine the number of zeros for various values of a. (8) Consider the complex row vector: >> v = [1+i 2-i 1.5+2*i]

(9)

(10) (11)

(12)

(13)

Which MATLAB expression will compute v H correctly, v’ or v.’ and why? The least element of a matrix may occur in several positions. Write a MATLAB function that has a matrix A as an argument and replaces its minimal element in each position of the matrix. Write a MATLAB function that generates a random m × n matrix with integer entries distributed uniformly between a and b. Write a MATLAB function that will start with a vector of complex numbers and return a vector having its components in the set {1, 2, 3, 4} corresponding to the orthants where the image of each complex number is placed. Write a MATLAB function that will start with a vector of complex numbers and circularly shift its components one position to the left or to the right, as speciﬁed by a parameter of the function. Write a MATLAB function that accepts four square matrices A1 , A2 , A3 , and A4 in Rn×n and outputs the minimum and the maximum trace of a product Ai1 Ai2 Ai3 Ai4 , where 1 2 3 4 i1 i2 i3 i4

is a permutation. (14) Write a MATLAB function that starts with two vectors that represent the sequences of coeﬃcients of two polynomials and generates the sequence of coeﬃcients of the product of these polynomials. (15) The Lagrange interpolating polynomial is a polynomial of degree no larger than n − 1 whose graph passes through the points (x1 , y1 ), . . . , (xn , yn ) in R2 and is given by n {(x − xj ) | 1 ≤ j ≤ n and j = i} . yi p(x) = {(xi − xj ) | 1 ≤ j ≤ n and j = i} i=1

Given two vectors x, y ∈ Rn write a MATLAB function that computes the coeﬃcients of the Lagrange interpolating polynomial.

276

Linear Algebra Tools for Data Mining (Second Edition)

Bibliographical Comments MATLAB ’s

popularity in the technical and research communities has generated a substantial literature. We mention, as especially useful, such titles as [64, 75, 116, 167] and [83]. Space limitations did not allow us to present the outstanding visualization capabilities of MATLAB for scientiﬁc data. The reader should consult [75].

Chapter 5

Determinants

5.1

Introduction

Determinants are a class of numerical multilinear functions deﬁned on the set of square matrices. They play an important role in theoretical considerations of linear algebra and are useful for symbolic computations. As we shall see, determinants can be used to solve certain small and well-behaved linear systems; however, they are of limited use for large or numerically diﬃcult linear systems. Historically, determinants appeared long before matrices related to solving linear systems. In modern times, determinants were introduced by Leibniz at the end of the 17th century and Cramer formula appeared in 1750. The term “determinant” was introduced by Gauss in 1801. 5.2

Determinants and Multilinear Forms

Theorem 5.1. Let F be a ﬁeld, M be an F-linear space, and let f : M n −→ F be a skew-symmetric F-multilinear form. If two arguments of f are interchanged, then the value of f is multiplied by −1, that is, f (x1 , . . . , xi , . . . , xj , . . . , xn ) = −f (x1 , . . . , xj , . . . , xi , . . . , xn ) for x1 , . . . , xn ∈ M .

277

Linear Algebra Tools for Data Mining (Second Edition)

278

Proof.

Since f is a multilinear form, we have

f (x1 , . . . , xi + xj , . . . , xi + xj , . . . , xn ) = f (x1 , . . . , xi , . . . , xi , . . . , xn ) + f (x1 , . . . , xi , . . . , xj , . . . , xn ) +f (x1 , . . . , xj , . . . , xi , . . . , xn ) + f (x1 , . . . , xj , . . . , xj , . . . , xn ) = f (x1 , . . . , xi , . . . , xj , . . . , xn ) + f (x1 , . . . , xj , . . . , xj , . . . , xn ). By the deﬁning property of skew-symmetry, we have the equalities f (x1 , . . . , xi + xj , . . . , xi + xj , . . . , xn ) = 0, f (x1 , . . . , xi , . . . , xi , . . . , xn ) = 0, f (x1 , . . . , xj , . . . , xj , . . . , xn ) = 0, which yield f (x1 , . . . , xi , . . . , xj , . . . , xn ) = −f (x1 , . . . , xj , . . . , xi , . . . , xn ), for x1 , . . . , xn ∈ M .

Corollary 5.1. Let V be an F-linear space and let f : V n −→ F be a skew-symmetric multilinear form. If xi = xj for i = j, then f (x1 , . . . , xi , . . . , xj , . . . , xn ) = 0. Proof.

This follows immediately from Theorem 5.1.

Theorem 5.1 has the following useful extension. Theorem 5.2. Let V be an F-linear space and let f : V n −→ F be a skew-symmetric F-multilinear form. If φ ∈ PERMn is a permutation given by 1 ··· i ··· n , φ: j1 · · · ji · · · jn then f (xj1 , . . . , xjn ) = (−1)inv(φ) f (x1 , . . . , xn ) for x1 , . . . , xn ∈ M . Proof. The argument is by induction on p = inv(φ). The basis case, p = 0 is immediate because in this case, φ is the identity mapping. Suppose that the argument holds for permutations that have no more than p inversions and let φ be a permutation that has p + 1 inversions. Then, as we saw in the proof of Theorem 1.9, there exists an adjacent transposition ψ such that for the permutation φ deﬁned

Determinants

279

as φ = ψφ we have inv(φ ) = inv(φ) − 1. Suppose that φ is the permutation 1 2 ··· + 1 ··· n φ : j1 j2 · · · j j+1 · · · jn and ψ is the adjacent transposition that exchanges j and j+1 , so 1 2 ··· + 1 ··· n . φ: j1 j2 · · · j+1 j · · · jn By the inductive hypothesis,

f (xj1 , . . . , xj , xj+1 , . . . , xjn ) = (−1)inv(φ ) f (x1 , . . . , xn ) and f (xj1 , . . . , xj+1 , xj , . . . , xjn ) = −f (xj1 , . . . , xj , xj+1 , . . . , xjn )

= −(−1)inv(φ ) f (x1 , . . . , xn ) = (−1)inv(φ) f (x1 , . . . , xn ), which concludes the argument.

Theorem 5.3. Let F be a ﬁeld, V be an F-linear space, f : V n −→ F be a skew-symmetric F-multilinear form, and let a ∈ F. If i = j and x1 , . . . , xn ∈ V n , then f (x1 , . . . , xn ) = f (x1 , . . . , xi + axj , . . . , xn ). Proof.

Suppose that i < j. Then, by the linearity of f , we have

f (x1 , . . . , xi + axj , . . . , xn ) = f (x1 , . . . , xi , . . . , xn ) + af (x1 , . . . , xj , . . . , xj , . . . , xn ) = f (x1 , . . . , xi , . . . , xn ), by Corollary 5.1.

Theorem 5.4. Let V be a linear space and let f : V n −→ R be a skew-symmetric linear form on V. If x1 , . . . , xn are linearly dependent, then f (x1 , . . . , xn ) = 0.

Linear Algebra Tools for Data Mining (Second Edition)

280

Proof. Suppose that x1 , . . . , xn are linearly dependent, that is, one of the vectors can be expressed as a linear combination of the remaining vectors. Suppose that xn = a1 x1 + · · · + an−1 xn−1 . Then, f (x1 , . . . , xn−1 , xn ) = f (x1 , . . . , xn−1 , a1 x1 + · · · + an−1 xn−1 ) n−1 ai f (x1 , . . . , xi , . . . , xn−1 , xi ) = 0, = i=1

by Corollary 5.1.

Theorem 5.5. Let V be an n-dimensional F-linear space and let {e1 , . . . , en } be a basis in V. There exists a unique, skew-symmetric multilinear form dn : V n −→ R such that dn (e1 , . . . , en ) = 1. Proof.

Let u1 , . . . , un be n vectors such that ui = a1i e1 + a2i e2 + · · · + ani en

for 1 i n. If dn is a skew symmetric multilinear form, dn : V n −→ R, then dn (u1 , u2 , . . . , un ) ⎛ ⎞ n n n aj11 ej1 , aj22 ej2 , . . . , ajnn ejn ⎠ = dn ⎝ j1 =1

=

n

n

j1 =1 j2 =1

···

j2 =1

n jn =1

jn =1

aj11 aj22 · · · ajnn dn (ej1 , ej2 , . . . , ejn ).

We need to retain only the terms of this sum in which the arguments of dn (ej1 , ej2 , . . . , ejn ) are pairwise distinct (because the term where jp = jq for p = q is zero, by Corollary 5.1). In other words, only terms in which the list (j1 , . . . , jn ) is a permutation of (1, . . . , n) have a non-zero contribution to the sum. By Theorem 5.2, we can write dn (u1 , u2 , . . . , un ) = dn (e1 , e2 , . . . , en )

j1 ,...,jn

(−1)inv(j1 ,...,jn) aj11 aj22 · · · ajnn .

Determinants

281

where the sum extends to all n! permutations (j1 , . . . , jn ) of (1, . . . , n). Since dn (e1 , . . . , en ) = 1, it follows that (−1)inv(j1 ,...,jn ) aj11 aj22 · · · ajnn . dn (u1 , u2 , . . . , un ) = j1 ,...,jn

Note that dn (u1 , u2 , . . . , un ) is the matrix A, where ⎛ 1 2 a1 a1 ⎜ a1 a2 ⎜ 2 2 A=⎜ .. ⎜ .. ⎝. . 1 an a2n

expressed using the elements of ⎞ · · · an1 · · · an2 ⎟ ⎟ ⎟ .. ⎟. ··· . ⎠ · · · ann

Definition 5.1. Let A = (aji ) ∈ Cn×n be a square matrix. The determinant of A is the number det(A) deﬁned as (−1)inv(j1 ,...,jn ) aj11 aj22 · · · ajnn . (5.1) det(A) = j1 ,...,jn

The determinant of A is denoted either by det(A) or by 1 2 a1 a1 · · · an1 a1 a2 · · · an 2 2 2 .. . .. .. . . ··· . a1 a2 · · · an n n n Equality (5.1) is known as the Leibniz formula. Note that det(A) can be written using the Levi-Civita symbols introduced in Deﬁnition 1.11 as j1 ,...,jn aj11 aj22 · · · ajnn . (5.2) det(A) = j1 ,...,jn

Theorem 5.6. Let A ∈ Cn×n be a matrix. We have det(A ) = det(A).

Linear Algebra Tools for Data Mining (Second Edition)

282

Proof.

The deﬁnition of A allows us to write (−1)inv(j1 ,...,jn) a1j1 a2j2 · · · anjn , det(A ) = j1 ,...,jn

where the sum extends to all permutations of (1, . . . , n). Due to the commutativity of numeric multiplication, we can rearrange the term a1j1 a2j2 · · · anjn as ak11 ak22 · · · aknn , where 1 2 ··· n 1 2 ··· n and ψ : φ: k1 k2 · · · kn j1 j2 · · · jn are inverse permutations. Since both φ and ψ have the same parity, it follows that (−1)inv(j1 ,...,jn ) a1j1 a2j2 · · · anjn = (−1)inv(k1 ,...,kn ) ak11 ak22 · · · ajnn , which implies det(A ) = det(A).

Corollary 5.2. If A ∈ Cn×n , then det(AH ) = det(A). Furthermore, if A is a Hermitian matrix, det(A) is a real number. Proof. Let A¯ be the matrix obtained from A by replacing each aij by its conjugate. Since conjugation of complex numbers permutes with both the sum and product of complex numbers, it follows that ¯ = det(A) ¯ = det(A). ¯ = det(A). Thus, det(AH ) = det(A) det(A) The second part of the corollary follows from the equality det(A) = det(A). Corollary 5.3. If A ∈ Cn×n is a unitary matrix, then | det(A)| = 1. Proof. Since A is unitary, we have AH A = AAH = In . By Theorem 5.8, det(AAH ) = det(A) det(AH ) = det(A)det(A) = | det(A)|2 = 1. Thus, | det(A)| = 1. Example 5.1. Let A ∈ R3×3 be the matrix ⎛ 1 2 3⎞ a1 a1 a1 ⎜ 1 2 3⎟ A = ⎝a2 a2 a2 ⎠. a13 a23 a33 The number det(A) is the sum of six terms corresponding to the six permutations of the set {1, 2, 3}, as follows:

Determinants Permutation φ (1, 2, 3) (3, 1, 2) (2, 3, 1) (2, 1, 3) (3, 2, 1) (1, 3, 2)

inv(φ) 0 2 2 1 3 1

283 Term a11 a22 a33 a31 a12 a23 a21 a32 a13 −a21 a12 a33 −a31 a22 a13 −a11 a32 a23

Thus, we have det(A) = a11 a22 a33 + a31 a12 a23 + a21 a32 a13 = −a21 a12 a33 − a31 a22 a13 − a11 a32 a23 . The number of terms n! grows very fast with the size n of the determinant. For instance, for n = 10, we have 10! = 3, 682, 800 terms. Thus, direct computations of determinants are very expensive. The deﬁnition of the determinant that makes use of skewsymmetric multilinear forms has the advantage of yielding quite simple proofs for many elementary properties of determinants. Theorem 5.7. The following properties of det(A) hold for any A ∈ Cn×n : (i) det(A) is a linear function of the rows of A (of the columns of A); (ii) if A has two equal rows (two equal columns), then det(A) = 0; (iii) if two rows (columns) are permuted, then det(A) is changing signs; (iv) if a row of a matrix, multiplied by a constant, is added to another row, then det(A) remains unchanged; the same holds if instead of rows we consider columns; (v) if a row (column) equals 0, then det(A) = 0. Proof. We begin with the above statements that involve rows of A. Let A = (aji ) and let xi = (a1i , . . . , ani ) ∈ Cn be the ith row of the matrix A. We saw that det(A) = f (x1 , . . . , xn ), where f is the skew-symmetric multilinear form deﬁned by f (e1 , . . . , en ) = 1. The linearity in each argument follows immediately from the linearity of f .

Linear Algebra Tools for Data Mining (Second Edition)

284

Part (ii) follows from Corollary 5.1. The third part follows from skew-symmetry of f . Theorem 5.3 implies Part (iv). Finally, the last statement is immediate. The corresponding statements concerning columns of A follow from Theorem 5.6 because the columns of A are the transposed rows of A . Theorem 5.8. Let A, B ∈ Cn×n be two matrices. We have det(AB) = det(A) det(B). Proof. Let a1 , . . . , an and b1 , . . . , bn be the rows of the matrices A and B, respectively. We assume that ai = (a1i , . . . , ani ) for 1 i n. Then, the rows c1 , . . . , cn of the matrix C = AB are given by ci = aji bj , where 1 j n, as it can be easily seen. If dn : (Cn )n −→ C is the skew-symmetric multilinear function that deﬁnes the determinant whose existence and uniqueness were shown in Theorem 5.5, then we have det(AB) = dn (c1 , . . . , ci , . . . , cn ) ⎛ ⎞ n n n aj11 bj1 , . . . , aji i bji , . . . , ajnn bjn ⎠ = dn ⎝ j1 =1

=

n

···

j1 =1

j=1

n j1 =1

n

···

j1 =1

jn =1

aj11 · · · aji i · · · ajnn dn (bj1 , . . . ,

bji , . . . , bjn ), due to the linearity of dn . Observe now that only the sequences (j1 , . . . , jn ) that represent permutations of the set {1, . . . , n} contribute to the sum because dn is skew-symmetric. Furthermore, if (j1 , . . . , jn ) represents a permutation φ, then dn (bj1 , . . . , bji , . . . , bjn ) = (−1)inv(φ) dn (b1 , . . . , bn ). Thus, we can write det(AB) =

n j1 =1

···

n j1 =1

···

n j1 =1

aj11 · · · aji i · · · ajnn dn (bj1 , . . . , bji , . . . , bjn )

Determinants

⎛ =⎝

n j1 =1

···

n

···

j1 =1

n

285

⎞ (−1)inv(j1 ,...,jn ) aj11 · · · aji i · · · ajnn ⎠

j1 =1

×dn (b1 , . . . , bn ) = det(A) det(B).

Lemma 5.1. Let B ∈ R(n+1)×(n+1) be the matrix ⎞ ⎛ 1 0 0 ··· 0 ⎜0 a1 a2 · · · an ⎟ ⎜ 1 1 1⎟ ⎟ B=⎜ .. ⎟. ⎜ .. .. .. ⎝. . . · · · . ⎠ 0 a1n a2n · · · ann We have det(B) = det(A), where ⎞ ⎛ 1 2 a1 a1 · · · an1 ⎟ ⎜ A = ⎝ ... ... · · · ... ⎠. a1n a2n · · · ann Proof.

Note that if B = (bji ), then 1 if j = 1 bj1 = 0 otherwise,

and

b1i

=

1 if i = 1 0 otherwise.

Also, if i > 1 and j > 1, then bji = aj−1 i−1 for 2 i, j n + 1. By the deﬁnition of the determinant, each term of the sum that deﬁnes det(B) must include an element of the ﬁrst row. However, only the ﬁrst element of this row is non-zero, so

286

Linear Algebra Tools for Data Mining (Second Edition)

det(B) =

j

n+1 (−1)inv(j1 ,j2 ,...,jn+1 ) bj11 bj22 . . . bn+1 ,

(j1 ,j2 ,...,jn+1 )

=

jn −1 jn+1 −1 (−1)inv(1,j2 ,...,jn+1 ) aj12 −1 · · · an−1 an ,

(j2 ,...,jn+1 )

where (j2 , . . . , jn+1 ) is a permutation of the set {2, . . . , n + 1}. Since inv(1, j2 , . . . , jn+1 ) = inv(j2 , . . . , jn+1 ), it follows that det(B) =

(−1)inv(j2 ,...,jn+1 ) aj12 −1 aj23 −1 . . . ajnn+1 −1 .

(j2 ,...,jn+1 )

Observe now that if (j2 , . . . , jn+1 ) is a permutation of the set {2, . . . , n + 1}, then (k1 , . . . , kn ), where ki = ji+1 − 1 for 1 i n is a permutation of (1, . . . , n) that has the same number of inversions as (j2 , . . . , jn+1 ). Therefore, (−1)inv(k1 ,...,kn ) ak11 ak22 . . . aknn = det(A). det(B) = (k1 ,...,kn )

Lemma 5.2. Let A ∈ ⎛

Rn×n

a11 ⎜ . ⎜ .. ⎜ ⎜ a1 ⎜ A = ⎜ 1p ⎜ ap+1 ⎜ ⎜ .. ⎝ . a1n

be a matrix partitioned as ⎞ n · · · aq1 aq+1 · · · a 1 1 .. .. .. ⎟ ··· . . ··· . ⎟ ⎟ q q+1 · · · ap ap · · · anp ⎟ ⎟ n ⎟ · · · aqp+1 aq+1 · · · a p+1 ⎟ p+1 ⎟ .. .. . ⎟ ··· . . · · · .. ⎠ · · · aqn aq+1 · · · ann n

and let B ∈ R(n+1)×(n+1) be deﬁned by ⎛ a11 · · · aq1 0 aq+1 1 ⎜ . ⎜ .. · · · ... ... ... ⎜ ⎜ a1 · · · aq 0 aq+1 p p ⎜ p ⎜ B = ⎜ 0 ··· 0 1 0 ⎜ 1 ⎜ ap+1 · · · aqp+1 0 aq+1 p+1 ⎜ .. .. .. ⎜ .. ⎝ . ··· . . . a1n · · · aqn 0 aq+1 n Then det(B) = (−1)p+q det(A).

⎞ · · · an1 . ⎟ · · · .. ⎟ ⎟ · · · anp ⎟ ⎟ ⎟ ··· 0 ⎟. ⎟ · · · anp+1 ⎟ ⎟ . ⎟ · · · .. ⎠ · · · ann

Determinants

287

Proof. By permuting the (p+1)st row of B with each of the p rows preceding it in the matrix B and, then, by permuting the (q+1)st column with each of the q columns preceding it, we obtain the matrix C given by ⎛

1 0 ⎜ 0 a11 ⎜ ⎜ .. .. ⎜. . ⎜ 1 C=⎜ ⎜ 0 ap ⎜ 0 a1 ⎜ p+1 ⎜. . ⎝ .. .. 0 a1n

0 0 0 q q+1 · · · a1 a1 . .. · · · .. . · · · aqp aq+1 p · · · aqp+1 aq+1 p+1 .. .. ··· . .

0 0 · · · an1 . · · · ..

⎞

⎟ ⎟ ⎟ ⎟ ⎟ · · · anp ⎟ ⎟. · · · anp+1 ⎟ ⎟ .. ⎟ ··· . ⎠

· · · aqn aq+1 · · · ann n

By the third part of Theorem 5.7, each of these row or column permutations multiplies det(B) by −1, so det(C) = (−1)p+q det(B). By Lemma 5.1, we have det(C) = det(A), so det(B) = (−1)p+q det(A). Definition 5.2. Let A ∈ Cm×n . A minor of order k of A is a determinant of the form i1 · · · ik . det A j1 · · · kk A principal minor of order k of A is a determinant of the form i1 · · · ik . det A i1 · · · ik The leading principal minor of order k is the determinant 1 ··· k det A . 1 ··· k For A ∈ Cn×n , det(A) is the unique principal minor of order n, and the principal minors of order 1 of A are just the diagonal entries of A: a11 , . . . , ann .

288

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 5.9. Let A ∈ Cn×n . Deﬁne the matrix Aji ∈ C(n−1)×(n−1) as

1 ··· i − 1 i + 1 ··· n j , Ai = A 1 ··· j − 1 j + 1 ··· h that is, the matrix obtained from A by removing the ith row and the jth column. Then, we have n det(A) if i = , j i+j j (−1) ai det(A ) = 0 otherwise, j=1 for every i, , 1 i, n. Proof.

Let xi be the ith row of A, which can be expressed as xi =

n

aji ej ,

j=1

where e1 , . . . , en is a basis of Rn such that dn (e1 , . . . , en ) = 1. By the linearity of dn , we have dn (A) = dn (x1 , . . . , xn ) ⎛ = dn ⎝x1 , . . . , xi−1 ,

n

⎞ aji ej , xi+1 , . . . , xn ⎠

j=1

=

n

aji dn (x1 , . . . , xi−1 , ej , xi+1 , . . . , xn ).

j=1

The determinant dn (x1 , . . . , xi−1 , ej , xi+1 , . . . , xn ) corresponds to a matrix D (i,j) obtained from A by replacing the ith row by the sequence (0, . . . , 0, 1, 0, . . . , 0), whose unique nonzero component is on the jth position. Next, by multiplying the ith row by −ajk and adding the result to the kth row for 1 k i − 1 and i + 1 k n, we obtain a matrix E (i,j) that coincides with the matrix A with the following exceptions:

Determinants

289

(i) the elements of row i are 0 with the exception of the jth element of this row that equals 1, and (ii) the elements of column j are 0 with the exception of the element mentioned above. Clearly, det(D (i,j) ) = det(E (i,j) ). By applying Lemma 5.2, we obtain det(E (i,j) ) = (−1)i+j det(Aji ), so dn (A) =

n

aji dn (E (i,j) ) =

j=1

n (−1)i+j aji det(Aji ), j=1

which is the ﬁrst case of the desired formula. Suppose now that i = . The same determinant could be computed by using an expansion on the th row, as follows: dn (A) =

n

(−1)i+j aj det(Aj ).

j=1

Then nj=1 (−1)i+j aji det(Aj ) is the determinant of a matrix obtained from A by replacing the th row by the ith row and such a determinant is 0 because the new matrix has two identical rows. This proves the second case of the equality of the theorem. The equality of the theorem is known as the Laplace expansion of det(A) by row i. Since the determinant of a matrix A equals the determinant of A , det(A) can be expanded by the jth row as det(A) =

n (−1)i+j aji det(Aji ) i=1

for every 1 j n. Thus, we have n (−1)i+j aji det(Aj ) = i=1

det(A) 0

if i = , if i = .

This formula is the Laplace expansion of det(A) by column j.

290

Linear Algebra Tools for Data Mining (Second Edition)

The number cof(aji ) = (−1)i+j det(Aji ) is the cofactor of aij in either kind of Laplace expansion. Thus, both types of Laplace expansions can be succinctly expressed by the equalities det(A) =

n j=1

aji cof(aji )

=

n

aji cof(aji )

(5.3)

i=1

for all i, j ∈ {1, . . . , n}. Cofactors of the form cof(aii ) are known as principal cofactors of A. Example 5.2. Let a = (a1 , . . . , an ) be a sequence of n real numbers. The Vandermonde determinant Va contains on its ith row the successive powers of ai , namely a0i = 1, a1i = ai , . . . , ani : 1 a1 (a1 )2 · · · (a1 )n−1 1 a2 (a2 )2 · · · (a2 )n−1 Va = .. .. . .. . . · · · . 1 an (an )2 · · · (an )n−1 By subtracting the ﬁrst line from the remaining lines, we have n−1 2 1 a1 a · · · a 1 1 0 a2 − a1 (a2 )2 − (a1 )2 · · · (a2 )n−1 − (a1 )n−1 Va = . .. .. .. . ··· . 2 2 n−1 n−1 0 an − a1 (an ) − (a1 ) · · · (an ) − (a1 ) a2 − a1 (a2 )2 − (a1 )2 · · · (a2 )n−1 − (a1 )n−1 .. . . . . = . . . ··· . an − a1 (an )2 − (a1 )2 · · · (an )n−1 − (a1 )n−1 Factoring now ai+1 − a1 from the ith line of the new determinant for 1 i n yields 1 a2 + a1 · · · n−2 an−2−i ai 1 i=0 2 .. .. Va = (a2 − a1 ) · · · (an − a1 ) ... . . · · · . 1 an + a1 · · · n−2 an−2−i ai 1 i=0 n

Determinants

291

Consider two successive columns of this determinant: ⎛k−1 k−1−i i ⎞ ⎛ k ⎞ k−i i a1 i=0 a2 i=0 a2 a1 ⎜ ⎜ ⎟ ⎟ .. .. ck = ⎝ ⎠ and ck+1 = ⎝ ⎠. . . k−1 k−1−i i k k−i i a1 i=0 an i=0 an a1 Observe that ⎛

ck+1

⎞ ak2 ⎜ ⎟ = ⎝ ... ⎠ + a1 ck , akn

it follows that by subtracting from each column ck+1 from the previous column multiplied by a1 (from right to left), we obtain 1 a2 · · · an−2 2 Va = (a2 − a1 ) · · · (an − a1 ) ... ... · · · ... 1 an · · · an−2 n = (a2 − a1 ) · · · (an − a1 )V(a2 ,...,an ) . By applying repeatedly this formula, we obtain (ap − aq ), Va = p>q

where 1 p, q n. Theorem 5.8 can be extended to products of rectangular matrices. Theorem 5.10. Let A ∈ Cm×n and B ∈ Cn×m be two matrices, where m n. We have det(AB) k1 · · · km 1 ··· m det B = det A 1 ··· m k1 · · · km | 1 k1 < k2 < · · · < km n . This equality is known as the Cauchy–Binet formula.

292

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Let a1 , . . . , an be the rows of the matrix A and let C = AB. The ﬁrst column of the matrix AB equals nk1 =1 ak1 bk1 1 . Since det(C) is linear, we can write 1 2 a c1 · · · cn1 k1 n a2 c2 · · · cn 2 2 k1 bk1 1 . . det(C) = . .. .. · · · ... k1 =1 m 2 a cm · · · cnm k1 n Similarly, the second row of C equals k2 =1 ak2 bk2 1 . A further decomposition yields the sum 1 1 a a · · · cn1 k1 k2 n n a2 a2 · · · cn 2 k k 1 2 bk1 1 bk2 2 . . det(C) = , .. .. · · · ... k1 =1 k2 =1 m 2 a a · · · cnm k1 k2 and so on. Eventually, we can write

k a 1 1 . ··· b1k1 · · · bm det(C) = km .. k1 =1 km =1 ak1 m n

n

· · · ak1m .. .. , . . · · · akm m

due to the multilinearity of the determinants. Only terms involving distinct numbers k1 , . . . , km are retained in this sum because any term with kp = kq equals 0. Suppose that {k1 , . . . , km } = {h1 , . . . , hm }, where h1 < · · · < hm and φ is the bijection deﬁned by ki = φ(hi ) for 1 i m. Then h k a 1 · · · ahm a 1 · · · akm 1 1 1 1 .. .. .. . . . = (−1)inv(k1 ,...,km ) ... ... ... , ah1 · · · ahm ak1 · · · akm m m m m which allows us to write h 1 a1 det(C) = ... h1 A = [1 2 3; 4 5 6; 7 8 9] A = 1 2 3 4 5 6 7 8 9

314

Linear Algebra Tools for Data Mining (Second Edition)

>> d = det(A) d = 0

A = 1 5 4 1 3 1 >> d=det(A) d = 10 >> B=inv(A) B = -1/5 1/5 1/10

2 6 4

-9/5 -1/5 7/5

14/5 1/5 -19/10.

The Symbolic Math Toolbox of MATLAB provides functions for solving, plotting, and manipulating mathematics. The function syms creates symbolic variables and functions. Example 5.16. To create the variables x and y, one could write syms x y

Polynomials can be deﬁned as shown in what follows. For example, to re-create the polynomials deﬁned in Example 5.11, one could write f = x^3 - 2x^2 - x + 2 g = x^2-5x +6

The resultant of these polynomials in the variable x is obtained with res = resultant(f,g,x)

and, as expected, the result is 0. Suppose now that f and g are two polynomials in x and the parameter m as in f = x^3 - (m+1)x^2 - mx + 2 g = x^2-(2m+3) x + 2m+4

Clearly, the previous polynomials are obtained by taking m = 1.

Determinants

315

We begin by computing the resultant as a polynomial in m: >> syms x m >> f = x^3 -(m+1)* x^2 - m*x + 2 f = x^3 + (- m - 1)*x^2 - m*x + 2 >> g=x^2 -(2*m +3)*x + 2*m + 4 g = x^2 + (- 2*m - 3)*x + 2*m + 4 >> res = resultant(f,g,x) res = - 8*m^4 - 24*m^3 - 8*m^2 + 24*m + 16

To determine the values of m that make the resultant equal to 0, we write >> mRoots = solve(res)

which returns: mRoots = -2 -1 -1 1

Exercises and Supplements (1) Let φ ∈ PERMn be

φ:

1 ··· i ··· n , a1 · · · ai · · · an

and let vp (φ) = |{(ik , il ) | il = p, k < l, ik > il } be the number of inversions of φ that have p as their second component, for 1 p n. Prove that

316

Linear Algebra Tools for Data Mining (Second Edition)

(a) vp n − p for 1 p n; (b) for every sequence of numbers (v1 , . . . , vn ) ∈ Nn such that vp n − p for 1 p n there exists a unique permutation φ that has (v1 , . . . , vn ) ∈ Nn as its sequence of inversions. (2) Let p be the polynomial deﬁned as p(x1 , . . . , xn ) = (xi − xj ). i 0. (37) Let A, B ∈ Cn×n be two matrices and let p be the polynomial p(x) = det(A + xB). Prove that the degree of f does not exceed the rank of B. Solution: By Theorem 3.37, if rank(B) = r, there exists the non-singular matrices G, H ∈ Cn×n such that B = GLr H and Ir Or,n−r . Lr = On−r,r On−r,n−r Therefore, p(x) = det(A + xGLr H) = det(G(G−1 AH −1 + xLr )H) = det(G) det(H) det(M + xLr ), where M = G−1 AH −1 . The highest degree of a term in det(M + xLr ) cannot exceed r and the result follows immediately.

Determinants

331

(38) Let f (x1 , . . . , xr ) be a homogeneous polynomial of degree n. Prove that

xi

∂f = nf. ∂xi

(This result is known as Euler’s theorem). Solution: The homogeneity of F means that F (tx1 , . . . , txr ) = tn F (x1 , . . . , xr ). Diﬀerentiating with respect to t yields

xi

∂F = ntn−1 f (x1 , . . . , xr ), ∂xi

and the desired result follows taking t = 1. (39) Let f1 , . . . , fk be k polynomials in R[x]. Deﬁne the polynomials F (x) = a1 f1 (x) + a2 f2 (x) + · · · + ak fk (x), G(x) = b1 f1 (x) + b2 f2 (x) + · · · + bk fk (x). Prove that the polynomials have a common root if and only if RF,G = 0 for all a1 , . . . , ak , b1 , . . . , bk . (40) The discriminant of a polynomial f ∈ R[x] is Df = a10 Rf,f , where a0 is the leading coeﬃcient of f . Prove that f has roots of multiplicity at least 2 if and only if Df = 0. (41) Let f1 , f2 be two polynomials in R[x] of degrees k1 and k2 , respectively. Prove that Df1 f2 = (−1)k1 k2 Df1 Df2 Rf21 ,f2 . (42) Let g, h be two polynomials of degrees k1 and k2 , respectively. Prove that 2 . Dgh = (−1)k1 k2 Dg Dh Rg,h

(43) Prove that the polynomial f ∈ C[x] deﬁned by f (x) = ax3 + bX 2 + cx + d has the expression b2 c2 − 4ac3 − 4b3 d − 27a2 d2 + 18abcd as its discriminant. (44) Let f ∈ C[x] be a polynomial. Prove that Df +a = Df for any constant a ∈ C. If g(x) = f (ax), prove that Dg = an(n−1) Df

332

Linear Algebra Tools for Data Mining (Second Edition)

A homogeneous linear system has the form a11 x1 + a12 x2 + · · · a1n xn = 0, .. . =0 am1 x1 + am2 x2 + · · · amn xn = 0, or Ax = 0m , where A = (aij ) ∈ Rm×n and x ∈ Rn . Observe that each such system has the trivial solution x1 = x2 = · · · = xn = 0. (45) Prove that the set of solutions of the homogeneous system Ax = 0m is a subspace of Rn . (46) Let A ∈ Rn×n be a matrix. Prove that the system Ax = 0n has a non-trivial solution if and only if det(A) = 0. (47) Let A ∈ R3×4 and consider the system Ax = 03 , where x ∈ R4 . Denote by Ak the matrix in R3×3 obtained from A by eliminating the kth column, where 1 k 4. Prove that if the system Ax = 03 has a non-trivial solution x ∈ R4 , then x2 x3 x4 x1 = = = . det(A1 ) − det(A2 ) det(A3 ) − det(A4 ) Solution: The system Au = 03 has the explicit form a11 x1 + a12 x2 + a13 x3 + a14 x4 = 0 a21 x1 + a22 x2 + a23 x3 + a24 x4 = 0 a31 x1 + a32 x2 + a33 x3 + a34 x4 = 0. Equivalently, we can write a11 x1 + a12 x2 + a13 x3 = −a14 x4 a21 x1 + a22 x2 + a23 x3 = −a24 x4 a31 x1 + a32 x2 + a33 x3 = −a34 x4 . Thus, we have

a14 a12 a13 a24 a22 a23 a34 a32 a33 det(A1 ) = −x4 . x1 = −x4 det(A4 ) det(A4 )

Determinants

333

Similar formulas are x2 = x4

det(A2 ) , det(A4 )

and x3 = −x4

det(A3 ) . det(A4 )

Thus, x2 x3 x4 x1 = = = . det(A1 ) − det(A2 ) det(A3 ) − det(A4 ) Bibliographical Comments An encyclopedic reference on Schur’s complement and its applications in numerical analysis, probabilities, and statistics and other areas can be found in [178]. The result contained in Supplement 22 appears in [42]. Supplement 35 is a result of [10].

This page intentionally left blank

Chapter 6

Norms and Inner Products

6.1

Introduction

A norm is a real-valued function deﬁned on a linear space intended to model the “length” of the vector. Norms generate metrics on linear spaces and these metrics, in turn, generate topologies that are useful in constructing algorithms on these spaces. Elementary notions of set topology used in this chapter can be found in [152]. The other fundamental notion discussed in this chapter is the notion of inner product spaces. Inner products are capable of generating norms and allow the introduction of the concept of orthogonality and of unitary and orthogonal (or orthonormal) matrices. Also, we introduce positive deﬁnite matrices and discuss two decomposition results for matrices: the Cholesky decomposition for positive deﬁnite matrices and the QR factorizations. 6.2

Basic Inequalities

Lemma 6.1. Let p, q ∈ R − {0, 1} such that p1 + 1q = 1. Then we have p > 1 if and only if q > 1. Furthermore, one of the numbers p, q belongs to the interval (0, 1) if and only if the other number is negative. Proof. The statement follows immediately from the equality p . q = p−1 335

336

Linear Algebra Tools for Data Mining (Second Edition)

Lemma 6.2. Let p, q ∈ R −{0, 1} be two numbers such that 1p + 1q = 1 and p > 1. Then, for every a, b ∈ R0 , we have ab

ap bq + , p q 1

where the equality holds if and only if a = b− 1−p . Proof. By Lemma 6.1, we have q > 1. Consider the function p f (x) = xp + 1q − x for x 0. We have f (x) = xp−1 − 1, so the minimum is achieved when x = 1 and f (1) = 0. Thus, 1 f ab− p−1 f (1) = 0, which amounts to p

1 ap b− p−1 − 1 + − ab p−1 0. p q p

By multiplying both sides of this inequality by b p−1 , we obtain the desired inequality. Observe that if 1p + 1q = 1 and p < 1, then q < 0. In this case, we have the reverse inequality ab

ap bq + , p q

(6.1)

which can be shown by observing that the function f has a maximum in x = 1. The same inequality holds when q < 1 and therefore p < 0. Theorem 6.1 (The H¨ older inequality). Let a1 , . . . , an and b1 , . . . , bn be 2n nonnegative numbers, and let p and q be two numbers such that 1p + 1q = 1 and p > 1. We have n i=1

ai bi

n i=1

1 api

p

·

n

1 bqi

q

.

i=1

Proof. If a1 = · · · = an = 0 or if b1 = · · · = bn = 0, then the inequality is clearly satisﬁed. Therefore, we may assume that at least

Norms and Inner Products

337

one of a1 , . . . , an and at least one of b1 , . . . , bn is non-zero. Deﬁne the numbers ai bi xi = 1 and yi = 1 p ( ni=1 ai ) p ( ni=1 bqi ) q for 1 i n. Lemma 6.2 applied to xi , yi yields ai bi 1 api 1 bpi + . 1 1 p ni=1 api q ni=1 bpi ( ni=1 api ) p ( ni=1 bqi ) q Adding these inequalities, we obtain n

ai bi

i=1

because

1 p

+

1 q

n

1 p

api

i=1

n

1 bqi

q

i=1

= 1.

The nonnegativity of the numbers a1 , . . . , an , b1 , . . . , bn can be relaxed by using absolute values. Indeed, we can easily prove the following variant of Theorem 6.1. Theorem 6.2. Let a1 , . . . , an and b1 , . . . , bn be 2n numbers and let p and q be two numbers such that p1 + 1q = 1 and p > 1. We have 1 n 1 n n p q p q a b |a | · |b | . i i i i i=1

Proof.

i=1

i=1

By Theorem 6.1, we have n

|ai ||bi |

i=1

n

1 |ai |p

i=1

p

·

n

1 |bi |q

.

i=1

The needed equality follows from the fact that n n ai bi |ai ||bi |. i=1

q

i=1

Linear Algebra Tools for Data Mining (Second Edition)

338

Corollary 6.1 (The Cauchy–Schwarz inequality for Rn ). Let a1 , . . . , an and b1 , . . . , bn be 2n real numbers. We have

n n n

2 ai bi ai · b2i . i=1

i=1

i=1

Proof. The inequality follows immediately from Theorem 6.2 by taking p = q = 2. Theorem 6.3 (Minkowski’s inequality). Let a1 , . . . , an b1 , . . . , bn be 2n nonnegative real numbers. If p 1, we have 1 1 n 1 n n p p p p p (ai + bi )p ai + bi . i=1

i=1

and

i=1

If p < 1, the inequality sign is reversed. Proof. For p = 1, the inequality is immediate. Therefore, we can assume that p > 1. Note that n n n (ai + bi )p = ai (ai + bi )p−1 + bi (ai + bi )p−1 . i=1

i=1

i=1

By H¨older’s inequality for p, q such that p > 1 and 1p + 1q = 1, we have n 1 n 1 n q p p ai (ai + bi )p−1 ai (ai + bi )(p−1)q i=1

=

i=1 n

1 p

api

i=1

i=1 n (ai + bi )p

1 q

.

i=1

Similarly, we can write n

p−1

bi (ai + bi )

i=1

n

1 bpi

p

i=1

n (ai + bi )p

1 q

.

i=1

Adding the last two inequalities yields ⎛ 1 n 1 ⎞ n 1 n n p p q ⎠ (ai + bi )p ⎝ api + bpi (ai + bi )p , i=1

i=1

i=1

i=1

Norms and Inner Products

339

which is equivalent to the inequality

n (ai + bi )p i=1

6.3

1

p

n

1 api

p

+

i=1

n i=1

1 bpi

p

.

Metric Spaces

Definition 6.1. A function d : S 2 −→ R0 is a metric if it has the following properties: (i) d(x, y) = 0 if and only if x = y for x, y ∈ S; (ii) d(x, y) = d(y, x) for x, y ∈ S; (iii) d(x, y) d(x, z) + d(z, y) for x, y, z ∈ S. The pair (S, d) will be referred to as a metric space. If property (i) is replaced by the weaker requirement that d(x, x) = 0 for x ∈ S, then we refer to d as a semimetric on S. Thus, if d is a semimetric, d(x, y) = 0 does not necessarily imply x = y and we can have for two distinct elements x, y of S, d(x, y) = 0. If d is a semimetric, then we refer to the pair (S, d) as a semimetric space. Example 6.1. Let S be a nonempty set. Deﬁne the mapping d : S 2 −→ R0 by 1 if u = v, d(u, v) = 0 otherwise for x, y ∈ S. It is easy to see that d satisﬁes the deﬁniteness property. To prove that d satisﬁes the triangular inequality, we need to show that d(x, y) d(x, z) + d(z, y) for all x, y, z ∈ S. This is clearly the case if x = y. Suppose that x = y, so d(x, y) = 1. Then, for every z ∈ S, we have at least one of the inequalities x = z or z = y, so at least one of the numbers d(x, z) or d(z, y) equals 1. Thus, d satisﬁes the triangular inequality. The metric d introduced here is the discrete metric on S.

340

Linear Algebra Tools for Data Mining (Second Edition)

Example 6.2. Consider the mapping d : (Seqn (S))2 −→ R0 deﬁned by d(p, q) = |{i | 0 i n − 1 and p(i) = q(i)}| for all sequences p, q of length n on the set S. It is easy to see that d is a metric. We justify here only the triangular inequality. Let p, q, r be three sequences of length n on the set S. If p(i) = q(i), then r(i) must be distinct from at least one of p(i) and q(i). Therefore, {i | 0 i n − 1 and p(i) = q(i)} ⊆ {i | 0 i n − 1 and p(i) = r(i)} ∪ {i | 0 i n − 1 and r(i) = q(i)}, which implies the triangular inequality. Example 6.3. For x ∈ Rn and y ∈ Rn , the Euclidean metric is the mapping

n

d2 (x, y) = (xi − yi )2 . i=1

The ﬁrst two conditions of Deﬁnition 6.1 are obviously satisﬁed. To prove the third inequality, let x, y, z ∈ Rn . Choosing ai = xi − yi and bi = yi − zi for 1 i n in Minkowski’s inequality implies

n

n

n

(xi − zi )2 (xi − yi )2 + (yi − zi )2 , i=1

i=1

i=1

which amounts to d(x, z) d(x, y)+d(y, z). Thus, we conclude that d is indeed a metric on Rn . We frequently use the notions of closed sphere and open sphere. Definition 6.2. Let (S, d) be a metric space. The closed sphere centered in x ∈ S of radius r is the set Bd [x, r] = {y ∈ S|d(x, y) r}. The open sphere centered in x ∈ S of radius r is the set Bd (x, r) = {y ∈ S|d(x, y) < r}.

Norms and Inner Products

341

Definition 6.3. Let (S, d) be a metric space. The diameter of a subset U of S is the number diamS,d (U ) = sup{d(x, y) | x, y ∈ U }. The set U is bounded if diamS,d (U ) is ﬁnite. The diameter of the metric space (S, d) is the number diamS,d = sup{d(x, y) | x, y ∈ S}. If the metric space is clear from the context, then we denote the diameter of a subset U just by diam(U ). If (S, d) is a ﬁnite metric space, then diamS,d = max{d(x, y) | x, y ∈ S}. ˆ 0 can be extended to the set of subsets A mapping d : S ×S −→ R of S by deﬁning d(U, V ) as d(U, V ) = inf{d(u, v) | u ∈ U and v ∈ V }

(6.2)

for U, V ∈ P(S). Observe that, even if d is a metric, then its extension is not, in general, a metric on P(S) because it does not satisfy the triangular inequality. Instead, we can show that for every U, V, W we have d(U, W ) d(U, V ) + diam(V ) + d(V, W ). Indeed, by the deﬁnition of d(U, V ) and d(V, W ), for every > 0, there exist u ∈ U , v, v ∈ V , and w ∈ W such that d(U, V ) d(u, v) d(U, V ) + 2 , d(V, W ) d(v , w) d(V, W ) + 2 . By the triangular axiom, we have d(u, w) d(u, v) + d(v, v ) + d(v , w). Hence, d(u, w) d(U, V ) + diam(V ) + d(V, W ) + , which implies d(U, W ) d(U, V ) + diam(V ) + d(V, W ) + for every > 0. This yields the needed inequality. Definition 6.4. Let (S, d) be a metric space. The sets U, V ∈ P(S) are separate if d(U, V ) > 0.

Linear Algebra Tools for Data Mining (Second Edition)

342

We denote the number d({u}, V ) = inf{d(u, v) | v ∈ V } by d(u, V ). It is clear that u ∈ V implies d(u, V ) = 0. The notion of dissimilarity is a generalization of the notion of metric. Definition 6.5. A dissimilarity on a set S is a function d : S 2 −→ R0 satisfying the following conditions: (i) d(x, x) = 0 for all x ∈ S; (ii) d(x, y) = d(y, x) for all x, y ∈ S. The pair (S, d) is a dissimilarity space. A related concept is the notion of similarity. Definition 6.6. A similarity on a set S is a function s : S 2 −→ R0 satisfying the following conditions: (i) s(x, y) s(x, x) = 1 for all x, y ∈ S; (ii) s(x, y) = s(y, x) for all x, y ∈ S. The pair (S, s) is a similarity space. Example 6.4. Let d : S 2 −→ R0 be a metric on the set S. Then s : S 2 −→ R0 deﬁned by s(x, y) = 2−d(x,y) for x, y ∈ S is a dissimilarity, such that s(x, x) = 1 for every x, y ∈ S.

6.4

Norms

In this chapter, we study norms on real or complex linear spaces. Definition 6.7. A seminorm on an F-linear space V is a mapping ν : V −→ R that satisﬁes the following conditions: (i) ν(x + y) ν(x) + ν(y) (subadditivity), and (ii) ν(ax) = |a|ν(x) (positive homogeneity), for x, y ∈ V and a ∈ F . By taking a = 0 in the second condition of the deﬁnition, we have ν(0) = 0 for every seminorm on a real or complex space. A seminorm can be deﬁned on every linear space. Indeed, if B is a basis of V, B = {v i | i ∈ I}, J is a ﬁnite subset of I, and

Norms and Inner Products

x=

i∈I

xi v i , deﬁne νJ (x) as 0 νJ (x) =

343

if x = 0, j∈J |aj | otherwise

for x ∈ V . We leave to the reader the veriﬁcation of the fact that νJ is indeed a seminorm. Theorem 6.4. If V is a real or complex linear space and ν : V −→ R is a seminorm on V, then ν(x − y) |ν(x) − ν(y)|, for x, y ∈ V . Proof.

We have ν(x) ν(x − y) + ν(y), so ν(x) − ν(y) ν(x − y).

(6.3)

Since ν(x − y) = | − 1|ν(y − x) ν(y) − ν(x), we have −(ν(x) − ν(y)) ν(x) − ν(y). Inequalities (6.3) and (6.4) give the desired inequality.

(6.4)

Corollary 6.2. If p : V −→ R is a seminorm on V, then p(x) 0 for x ∈ V . Proof. By choosing y = 0 in the inequality of Theorem 6.4, we have ν(x) |ν(x)| 0. Definition 6.8. A norm on an F-linear space V is a seminorm ν : V −→ R such that ν(x) = 0 implies x = 0 for x ∈ V . The pair (V, ν) is referred to as a normed linear space. Example 6.5. The set of real-valued continuous functions deﬁned on the interval [−1, 1] is a real linear space. The addition of two such functions f, g, is deﬁned by (f + g)(x) = f (x) + g(x) for x ∈ [−1, 1]; the multiplication of f by a scalar a ∈ R is (af )(x) = af (x) for x ∈ [−1, 1]. Deﬁne ν(f ) = sup{|f (x)| | x ∈ [−1, 1]}. Since |f (x)| ν(f ) and |g(x)| ν(g) for x ∈ [−1, 1]}, it follows that |(f + g)(x)| |f (x)| + |g(x)| ν(f ) + ν(g). Thus, ν(f + g) ν(f ) + ν(g). We leave to the reader the veriﬁcation of the remaining properties of Deﬁnition 6.7. We denote ν(f ) by f .

Linear Algebra Tools for Data Mining (Second Edition)

344

Theorem 6.5. For p 1, the function νp : Rn −→ R0 defined by νp (x) =

n

1

p

|xi |p

,

i=1

⎞ x1 ⎜ ⎟ where x = ⎝ ... ⎠ ∈ Rn , is a norm on Rn . xn ⎛

Proof. We must prove that νp satisﬁes the conditions of Deﬁnition 6.7 and that νp (x) = 0 implies x = 0. Let ⎛ ⎞ ⎛ ⎞ x1 y1 ⎜ .. ⎟ ⎜ .. ⎟ x = ⎝ . ⎠ and y = ⎝ . ⎠. xn yn Minkowski’s inequality applied to the nonnegative numbers ai = |xi | and bi = |yi | amounts to

n

1 p

p

(|xi | + |yi |)

n

i=1

1 p

p

|xi |

+

i=1

n

1 p

p

|yi |

.

i=1

Since |xi + yi | |xi | + |yi | for every i, we have

n (|xi + yi |)p i=1

1

p

n i=1

1 |xi |p

p

+

n

1 |yi |p

p

,

i=1

that is, νp (x + y) νp (x) + νp (y). We leave to the reader the veriﬁcation of the remaining conditions. Thus, νp is a norm on Rn . Example 6.6. The mapping ν1 : Rn −→ R given by ν1 (x) = |x1 | + |x2 | + · · · + |xn |,

⎞ x1 ⎜ ⎟ for x = ⎝ ... ⎠, is a norm on Rn . xn ⎛

Norms and Inner Products

345

Example 6.7. A special norm on Rn is the function ν∞ : Rn −→ R0 given by ν∞ (x) = max{|xi | | 1 i n}.

(6.5)

We verify here that ν∞ satisﬁes the ﬁrst condition of Deﬁnition 6.7. We start from the inequality |xi + yi | |xi | + |yi | ν∞ (x) + ν∞ (y) for every i, 1 i n. This in turn implies ν∞ (x + y) = max{|xi + yi | | 1 i n} ν∞ (x) + ν∞ (y), which gives the desired inequality. This norm can be regarded as a limit case of the norms νp . Indeed, let x ∈ Rn and let M = max{|xi | | 1 i n} = |x1 | = · · · = |xk | for some 1 , . . . , k , where 1 1 , . . . , k n. Here x1 , . . . , xk are the components of x that have the maximal absolute value and k 1. We can write n 1 |xi | p p 1 = lim M (k) p = M, lim νp (x) = lim M p→∞ p→∞ p→∞ M i=1

which justiﬁes the notation ν∞ . We use the alternative notation xp for νp (x). We refer to x2 as the Euclidean norm of x and we denote this norm simply by x when there is no risk of confusion. of sequences Example 6.8. For p 1, let p be the set that consists p of real numbers x = (x0 , x1 , . . .) such that the series ∞ i=0 |xi | is convergent. We can show that p is a linear space. Let x, y ∈ p be two sequences in p . Using Minkowski’s inequality, we have n i=0

|xi + yi |p

n n n (|xi | + |yi |)p |xi |p + |yi |p , i=0

i=0

i=0

which shows that x + y ∈ p . It is immediate that x ∈ p implies ax ∈ p for every a ∈ R and x ∈ p .

346

Linear Algebra Tools for Data Mining (Second Edition)

The following statement shows that any norm deﬁned on a linear space generates a metric on the space. Theorem 6.6. Each norm ν : V −→ R0 on a real linear space V generates a metric on the set V defined by dν (x, y) = ν(x − y) for x, y ∈ V . Proof. Note that if dν (x, y) = ν(x − y) = 0, it follows that x − y = 0; that is, x = y. The symmetry of dν is obvious and so we need to verify only the triangular axiom. Let x, y, z ∈ L. Applying the subadditivity of norms, we have ν(x − z) = ν(x − y + y − z) ν(x − y) + ν(y − z), or, equivalently, dν (x, z) dν (x, y) + dν (y, z), for every x, y, z ∈ L, which concludes the argument. We refer to dν as the metric induced by the norm ν on the linear space V. Observe that the norm ν can be expressed using dν as ν(x) = dν (x, 0)

(6.6)

for x ∈ V . For p 1, then dp denotes the metric dνp induced by the norm νp on the linear space Rn known as the Minkowski metric on Rn . If p = 2, we have the Euclidean metric on Rn given by

n

n

2 |xi − yi | = (xi − yi )2 . d2 (x, y) = i=1

i=1

For p = 1, we have d1 (x, y) =

n

|xi − yi |.

i=1

This metric is known also as the city-block metric. The norm ν∞ generates the metric d∞ given by d∞ (x, y) = max{|xi − yi | | 1 i n}, also known as the Chebyshev metric.

Norms and Inner Products

6

x = (x0 , x1 ) Fig. 6.1

347

y = (y0 , y1 )

(y0 , x1 ) -

The distances d1 (x, y) and d2 (x, y).

A representation of these metrics can be seen in Figure 6.1 for the special case of R2 . If x = (x0 , x1 ) and y = (y0 , y1 ), then d2 (x, y) is the length of the hypotenuse of the right triangle and d1 (x, y) is the sum of the lengths of the two legs of the triangle. We can reformulate now the notion of bounded set introduced in Deﬁnition 6.3 for general metric spaces in terms of norms, deﬁned on linear spaces. Theorem 6.7. Let ν be a norm on a linear space V. A subset U of V is bounded in the metric space (V, dν ) if and only if there is b ∈ R0 such that ν(u) b for u ∈ U . Proof. Suppose that U is a bounded set in the sense of Deﬁnition 6.3, that is, there exists c ∈ R0 such that sup{dν (x, y) | x, y ∈ U } = c. Let x0 be a ﬁxed element of U . Since ν(x) = dν (x, 0) dν (x, x0 ) + d(x0 , 0) = dν (x, x0 ) + ν(x0 ), it is immediate that ν(x) c + ν(x0 ), so we can deﬁne b as b = c + ν(x0 ). Conversely, suppose that ν(u) b for u ∈ U . Then dν (x, y) dν (x, 0) + dν (0, y) 2b, for every x, y ∈ U . Thus, U is bounded in the metric space (V, dν ). Theorem 6.8 (Projections on closed sets theorem). Let U be a closed subset of Rn such that U = ∅ and let x0 ∈ Rn − U . Then there exists x1 ∈ U such that x − x0 2 x1 − x0 2 for every x ∈ U. Proof. Let d = inf{x − x0 2 | x ∈ U } and let Un = U ∩ B x0 , d + n1 . Note that the sets form a descending sequence of bounded and closed sets U1 ⊇ U2 ⊇ · · · ⊇ Un ⊇ · · · . Since U1 is

Linear Algebra Tools for Data Mining (Second Edition)

348

compact, n1 Un = ∅. Let x1 ∈ n1 Un . Since Un ⊆ U for every n, it follows that x1 ∈ U . 1 Note 1 − x0 2 d + n for every n because x1 ∈ Un = that x U ∩ B x0 , d + n1 . This implies x1 − x0 2 d x − x0 2 for every x ∈ U. Theorem 6.9 to follow allows us to compare the norms νp (and the metrics of the form dp ) that were introduced on Rn . We begin with a preliminary result. Lemma 6.3. Let a1 , . . . , an be n positive numbers. If p and q are two positive numbers such that p q, then 1

1

(ap1 + · · · + apn ) p (aq1 + · · · + aqn ) q . Proof.

Let f : R>0 −→ R be the function deﬁned by 1

f (r) = (ar1 + · · · + arn ) r . Since ln f (r) =

ln (ar1 + · · · + arn ) , r

it follows that 1 1 ar ln a1 + · · · + arn ln ar f (r) = − 2 ln (ar1 + · · · + arn ) + · 1 . f (r) r r ar1 + · · · + arn To prove that f (r) < 0, it suﬃces to show that ln (ar1 + · · · + arn ) ar1 ln a1 + · · · + arn ln ar . ar1 + · · · + arn r This last inequality is easily seen to be equivalent to n

ar i=1 1

ari ari ln r 0, r + · · · + an a1 + · · · + arn

which holds because ari 1 ar1 + · · · + arn for 1 i n.

Norms and Inner Products

349

Theorem 6.9. Let p and q be two positive numbers such that p q. For every u ∈ Rn , we have up uq . Proof.

This statement follows immediately from Lemma 6.3.

Corollary 6.3. Let p, q be two positive numbers such that p q. For every x, y ∈ Rn , we have dp (x, y) dq (x, y). Proof.

This statement follows immediately from Theorem 6.9.

Example 6.9. For p = 1 and q = 2, the inequality of Theorem 6.9 becomes

n n

|ui | |ui |2 , i=1

i=1

which is equivalent to n

n 2 |u | i i=1 i=1 |ui | . n n Theorem 6.10. Let p 1. For every x ∈ Rn , we have

(6.7)

x∞ xp nx∞ . Proof.

Starting from the deﬁnition of νp , we have n 1 p 1 1 |xi |p n p max |xi | = n p x∞ . xp = i=1

1in

The ﬁrst inequality is immediate.

Corollary 6.4. Let p and q be two numbers such that p, q 1. There exist two constants c, d ∈ R>0 such that cxq xp dxq for x ∈ Rn . Proof. Since x∞ xp and xq nx∞ , it follows that xq nxp . Exchanging the roles of p and q, we have xp nxq , so

for every x ∈ Rn .

1 xq xp nxq n

350

Linear Algebra Tools for Data Mining (Second Edition)

Corollary 6.5. For every x, y ∈ Rn and p 1, we have d∞ (x, y) dp (x, y) nd∞ (x, y). Further, for p, q > 1, there exist c, d ∈ R>0 such that cdq (x, y) dp (x, y) cdq (x, y) for x, y ∈ Rn . Proof.

This statement follows from Theorem 6.10.

Corollary 6.3 implies that if p q, then the closed sphere Bdp (x, r) is included in the closed sphere Bdq (x, r). For example, we have Bd1 (0, 1) ⊆ Bd2 (0, 1) ⊆ Bd∞ (0, 1). In Figures 6.2 (a)–(c), we represent the closed spheres Bd1 (0, 1), Bd2 (0, 1), and Bd∞ (0, 1). A useful consequence of Theorem 6.1 is the following statement: y1 , . . . , ym be 2m nonnegative Theorem 6.11. Let 1 , . . . , xm and x m m x = y numbers such that i=1 i i=1 i = 1 and let p and q be two 1 1 positive numbers such that p + q = 1. We have m

1

1

xjp yjq 1.

j=1 1

1

1

1

p q and y1q , . . . , ym Proof. The H¨older inequality applied to x1p , . . . , xm yields the needed inequality

m

1

j=1

@ @

Fig. 6.2

m

xj

j=1

6 @ @-

(a)

1

xjp yjq

m

yj = 1.

j=1

6

6

(b)

(c)

Spheres Bdp (0, 1) for p = 1, 2, ∞.

Norms and Inner Products

351

Theorem 6.11 allows the formulation of a generalization of the H¨older inequality. Theorem 6.12. Let A be an n×m matrix, A = (aij ), having positive entries such that m n. If p = (p1 , . . . , pn ) is j=1 aij = 1 for 1 i an n-tuple of positive numbers such that ni=1 pi = 1, then n m

apiji 1.

j=1 i=1

Proof. The argument is by induction on n 2. The basis case, n = 2, follows immediately from Theorem 6.11 by choosing p = p11 , q = p12 , xj = a1j , and yj = a2j for 1 j m. Suppose that the statement holds for n, let A be an (n + 1) × mmatrix having positive entries such that m j=1 aij = 1 for 1 i n+ 1, and let p = (p1 , . . . , pn , pn+1 ) be such that p1 + · · · + pn + pn+1 = 1. It is easy to see that m n+1 j=1 i=1

apiji

m

p

n−1 pn +pn+1 ap1j1 an−1 . j (anj + an+1 j )

j=1

By applying the inductive hypothesis, we have

m n+1 j=1

i=1

apiji 1.

A more general form of Theorem 6.12 is given next. Theorem 6.13. Let A be an n×m matrix, A = (aij ), having positive entries. If p = (p1 , . . . , pn ) is an n-tuple of positive numbers such that n p i=1 i = 1, then n m j=1 i=1

Proof.

apiji

n i=1

⎛ ⎝

m

⎞pi aij ⎠ .

j=1

Let B = (bij ) be the matrix deﬁned by aij bij = m

j=1 aij

Linear Algebra Tools for Data Mining (Second Edition)

352

for 1 i n and 1 j m. Since m j=1 bij = 1, we can apply Theorem 6.12 to this matrix. Thus, we can write p i n n m m a m ij bpiji = j=1 aij j=1 i=1

j=1 i=1 m n

apiji pi = m a j=1 i=1 j=1 ij m n pi j=1 i=1 aij pi 1. = m n i=1 j=1 aij

We now give a generalization of Minkowski’s inequality (Theorem 6.3). First, we need a preliminary result. Lemma 6.4. If a1 , . . . , an and b1 , . . . , bn are positive numbers and r < 0, then n r n 1−r n r 1−r ai bi ai · bi . i=1

i=1

i=1

Proof. Let cn1 , . . . , cn , d1 , . . . , dn be 2n positive numbers such that n c = i=1 i i=1 di = 1. Inequality (6.1) applied to the numbers 1

1

a = cip and b = di q yields 1

1

cip diq

ci di + . p q

Summing these inequalities produces the inequality n

1

1

cip diq 1,

i=1

or n

cri d1−r 1, i

i=1

where r = 1p < 0. Choosing ci = the desired inequality.

nai

i=1

ai

and di =

nbi

i=1 bi

, we obtain

Norms and Inner Products

353

Theorem 6.14. Let A be an n×m matrix, A = (aij ), having positive entries, and let p and q be two numbers such that p > q and p = 0, q = 0. We have ⎛ ⎝

n m j=1

Proof.

⎛ ⎛ ⎞ p ⎞ p1 q ⎞ 1q q n m p ⎜ p q ⎠ ⎟ ⎠ ⎝ aij ⎝ aij ⎠ .

i=1

i=1

j=1

Deﬁne ⎛ n q ⎞ 1q m p p ⎠ , aij E=⎝ j=1

i=1

⎛ ⎛ ⎞ p ⎞ p1 q n m ⎜ ⎝ q ⎠ ⎟ aij F =⎝ ⎠ , i=1

j=1

q and ui = m j=1 aij for 1 i n. There are three distinct cases to consider related to the position of 0 relative to p and q. Suppose initially that p > q > 0. We have Fp = =

n

p

uiq

i=1 n m

= p

aqij uiq

−1

=

i=1 j=1

n

p

ui uiq

i=1 m n

−1

p

aqij uiq

−1

.

j=1 i=1

By applying the H¨older inequality, we have n q n 1− q n p p q p p pq −1 p q q −1 q p−q aij ui (aij ) · (ui ) i=1

=

i=1 n i=1

q apij

p

·

i=1 n

p q

ui

(6.8)

1− q

p

,

i=1

which implies F p E q F p−q . This, in turn, gives F q E q , which implies the generalized Minkowski inequality.

Linear Algebra Tools for Data Mining (Second Edition)

354

Suppose now that 0 > p > q, so 0 < −p < −q. Applying the generalized Minkowski inequality to the positive numbers bij = a1ij gives the inequality ⎛ ⎛ ⎞ q ⎞− 1q ⎛ n p ⎞− p1 p m n m q −q ⎜ ⎝ −p ⎠ ⎟ ⎠ ⎝ bij ⎝ bij ⎠ , j=1

i=1

i=1

j=1

which is equivalent to ⎛ ⎝

m

j=1

n

⎛ ⎛ ⎞ q ⎞− 1q p ⎞− p1 p n m q p ⎜ ⎟ q ⎝ aij ⎠ ⎝ aij ⎠ ⎠ .

i=1

i=1

j=1

A last transformation gives ⎛ ⎝

m j=1

n

⎛ ⎛ ⎞ q ⎞ 1q p n m q p ⎟ ⎠ ⎜ ⎝ ⎠ aij ⎝ ⎠ ,

p ⎞ 1p aqij

i=1

i=1

j=1

which is the inequality to be proven. Finally, suppose that p > 0 > q. Since pq < 0, Inequality (6.9) is replaced by the opposite inequality through the application of Lemma 6.4: n q n 1− q n p p p p pq q q −1 aij ui aij · ui . i=1

i=1

i=1

This leads to F p E q F p−q or F q E q . Since q < 0, this implies F E. 6.5

The Topology of Normed Linear Spaces

Every norm ν deﬁned on a linear space V generates a metric d : V 2 −→ R0 given by d(x, y) = ν(x − y). Therefore, any normed space can be equipped with the topology of a metric space, using the metric deﬁned by the norm. Since this topology is induced by a

Norms and Inner Products

355

metric, any normed space is a Hausdorﬀ space. Further, if v ∈ V , then the collection of subsets {Bd (v, r) | r > 0} is a fundamental system of neighborhoods for v. By specializing the deﬁnition of local continuity of functions between metric spaces, a function f : V −→ W between two normed spaces (V, ν) and (W, ν ) is continuous in x0 ∈ V if for every > 0 there exists δ > 0 such that ν(x − x0 ) < δ implies ν (f (x) − f (x0 )) < . A sequence (x0 , x1 , . . .) of elements of V converges to x if for every > 0 there exists n ∈ N such that n n implies ν(xn − x) < . Theorem 6.15. In a normed linear space (V, ν), the norm, the multiplication by scalars, and the vector addition are continuous functions. Proof. By Theorem 6.4, we have ν(x − y) |ν(x) − ν(y)| for every x, y ∈ V . Therefore, if limn→∞ xn = x, we have ν(xn − x) |ν(xn ) − ν(x)|, which implies limn→∞ ν(xn ) = ν(x). Thus, the norm is continuous. Suppose now that limn→∞ an = a and limn→∞ xn = x, where (an ) is a sequence of scalars. Since the sequence (xn ) is bounded, we have ν(ax − an xn ) ν(ax − an x) + ν(an x − an xn ) |a − an |ν(x) + an ν(x − xn ), which implies that limn→∞ an xn = ax. This shows that the multiplication by scalars is a continuous function. To prove that the vector addition is continuous, let (xn ) and (y n ) be two sequences in V such that limn→∞ xn = x and limn→∞ y n = y. Note that ν ((x + y) − (xn + y n )) ν(x − xn ) + ν(y − y n ), which implies that limn→∞ (xn + y n ) = x + y. Thus, the vector addition is continuous. Definition 6.9. Two norms ν and ν on a linear space V are equivalent if they generate the same topology.

356

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 6.16. Let V be a linear space and let ν : V −→ R0 and ν : V −→ R0 be two norms on V that generate the topologies O and O on V, respectively. The topology O is finer than the topology O (that is, O ⊆ O ) if and only if there exists c ∈ R>0 such that ν(v) cν (v) for every v ∈V. Proof. Suppose that O ⊆ O . Then, any open sphere Bν (0, r0 ) = {x ∈ V | ν(x) < r0 } (in O) must be an open set in O . Therefore, there exists an open sphere Bν (0, r1 ) such that Bν (0, r1 ) ⊆ Bν (0, r0 ). This means that for r0 ∈ R0 and v ∈ V there exists r1 ∈ R0 such that ν (v) < r1 implies ν(v) < r0 for every u ∈ V . In particular, for r0 = 1, there is k > 0 such that ν (v) < k implies ν(v) < 1, which is equivalent to cν (v) < 1 implies ν(v) < 1, for every v ∈ V and c = k1 . v 1 For w = c+ ν (v) , where > 0, it follows that

cν (w) = cν so

ν(w) = ν

v 1 c + ν (v)

v 1 c + ν (v)

=

=

c < 1, c+

1 ν(v) < 1. c + ν (v)

Since this inequality holds for every > 0, it follows that ν(v) cν (v). Conversely, suppose that there exists c ∈ R>0 such that ν(v) cν (v) for every v ∈ V . Since r ⊆ {v | ν(v) r} v | ν (v) c for v ∈ V and r > 0, it follows that O ⊆ O .

Corollary 6.6. Let V be a linear space and let ν : V −→ R0 and ν : V −→ R0 be two norms on V. Then ν and ν are equivalent norms if and only if there exist a, b ∈ R>0 such that aν(v) ν (v) bν(v) for v ∈ V .

Norms and Inner Products

Proof.

This statement follows directly from Theorem 6.16.

357

Example 6.10. By Corollary 6.4, any two norms νp and νq on Rn (with p, q 1) are equivalent. Continuous linear operators between normed spaces have a simple characterization. Theorem 6.17. Let (V, ν) and (V , ν ) be two normed F-linear spaces where F is either R or C. A linear operator f : V −→ V is continuous if and only if there exists M ∈ R>0 such that ν (f (x)) M ν(x) for every x ∈ V . Proof. Suppose that f : V −→ V satisﬁes the condition of the theorem. Then r f Bν 0, ⊆ Bν (0, r) M for every r > 0, which means that f is continuous in 0 and, therefore, it is continuous everywhere (by Theorem 2.51). Conversely, suppose that f is continuous. Then there exists δ > 0 such that f (Bν (0, δ)) ⊆ Bν (f (x), 1), which is equivalent to ν(x) < δ, implies ν (f (x)) < 1. Let > 0 and let z ∈ V be deﬁned by z= We have ν(z) = equivalent to

δν(x) ν(x)+

δ x. ν(x) +

< δ. This implies ν (f (z)) < 1, which is

δ ν (f (x)) < 1 ν(x) + because of the linearity of f . This means that ν (f (x))
0, so ν (f (x)) 1δ ν(x).

Lemma 6.5. Let (V, ν) and (V , ν ) be two normed F-linear spaces where F is either R or C. A linear function f : V −→ V is not injective if and only if there exists u ∈ V − {0V } such that f (u) = 0V .

358

Linear Algebra Tools for Data Mining (Second Edition)

Proof. It is clear that the condition of the lemma is suﬃcient for failing injectivity. Conversely, suppose that f is not injective. There exist t, v ∈ V such that t = v and f (t) = f (v). The linearity of f implies f (t − v) = 0V . By deﬁning u = t − v = 0V , we have the desired element u. Theorem 6.18. Let (V, ν) and (V , ν ) be two normed F-linear spaces where F is either R or C. A linear function f : V −→ V is injective if and only if there exists m ∈ R>0 such that ν (f (x)) mν(x) for every x ∈ V . Proof. Suppose that f is not injective. By Lemma 6.5, there exists u ∈ V − {0V } such that f (u) = 0V , so ν (f (u)) < mν(u) for any m > 0. Thus, the condition of the theorem is suﬃcient for injectivity. Suppose that f is injective, so the inverse function f −1 : V −→ V is a linear function. By Theorem 6.17, there exists M > 0 such that ν(f −1 (y)) M ν (y) for every y ∈ V . Choosing y = f (x) yields ν(x) M ν (f (x), so 1 , which concludes the argument. ν (f (x)) mν(x) for m = M Corollary 6.7. Every linear function f : Cm −→ Cn is continuous. Proof. Suppose that both Cm and Cn are equipped with the norm ν1 . If x ∈ Cm , we can write x = x1 e1 + · · · + xm xm and the linearity of f implies ν1 (f (x)) = ν1

f

m

xi ei

= ν1

i=1

m

|xi |ν1 (f (ei )) M

i=1

where M = follows.

m

i=1 ν1 (f (ei )).

m

xi f (ei )

i=1 m

|xi | = M ν1 (x),

i=1

By Theorem 6.17, the continuity of f

Next, we introduce a norm on the linear space Hom(Cm , Cn ) of linear functions from Cm to Cn . Recall that if f : Cm −→ Cn is a linear function and ν, ν are norms on Cm and Cn , respectively, then

Norms and Inner Products

359

there exists a non-negative constant m such that ν (f (x)) M ν(x) for every x ∈ Cm . Deﬁne the norm of f , μ(f ), as μ(f ) = inf{M ∈ R0 | ν (f (x)) M ν(x) for every x ∈ Cm }. (6.9) Theorem 6.19. The mapping μ defined by Equality (6.9) is a norm on the linear space of linear functions Hom(Cm , Cn ). Proof. Let f, g be two functions in Hom(Cm , Cn ). There exist Mf and Mg in R0 such that ν (f (x)) Mf ν(x) and ν (g(x)) Mg ν(x) for every x ∈ V . Thus, ν ((f + g)(x)) = ν (f (x) + g(x)) ν (f (x)) + ν (g(x)) (Mf + Mg )ν(x),

so Mf + Mg ∈ {M ∈ R0 | ν ((f + g)(x)) M ν(x) for every x ∈ V }. Therefore, μ(f + g) μ(f ) + μ(g). We leave to the reader the veriﬁcation of the remaining norm prop erties of μ. Since the norm μ deﬁned by Equality (6.9) depends on the norms ν and ν , we denote it by N (ν, ν ). Theorem 6.20. Let f : Cm −→ Cn and g : Cn −→ Cp and let μ = N (ν, ν ), μ = N (ν , ν ) and μ = N (μ, μ ), where ν, ν , ν are norms on Cm , Cn , and Cp , respectively. We have μ (gf ) μ(f )μ (g). Proof.

Let x ∈ Cm . We have ν (f (x)) (μ(f ) + )ν(x)

for every > 0. Similarly, for y ∈ Cn , ν (g(y)) (μ (g) + )ν (y) for every > 0. These inequalities imply ν (g(f (x)) (μ (g) + )ν (f (x)) (μ (g) + )μ(f ) + )ν(x). Thus, we have μ (gf ) (μ (g) + )μ(f ) + ) for every and . This allows us to conclude that μ (f g) μ(f )μ (g).

360

Linear Algebra Tools for Data Mining (Second Edition)

Equivalent deﬁnitions of the norm μ = N (ν, ν ) are given next. Theorem 6.21. Let f : Cm −→ Cn and let ν and ν be two norms defined on Cm and Cn , respectively. If μ = N (ν, ν ), we have (i) μ(f ) = inf{M ∈ R0 | ν (f (x)) M ν(x) for every x ∈ Cm }; (ii) μ(f ) = sup{ν (f (x)) | ν(x) 1}; (iii) μ(f ) = max{ν (f (x)) | ν(x) 1}; (iv) μ(f ) = max{ν (f (x)) | ν(x) = 1}; ν (f (x)) m − {0 } . (v) μ(f ) = sup | x ∈ C m ν(x) Proof. The ﬁrst equality is the deﬁnition of μ(f ). Let be a positive number. By the deﬁnition of the inﬁmum, there exists M such that ν (f (x)) M ν(x) for every x ∈ Cm and M μ(f )+. Thus, for any x such that ν(x) 1, we have ν (f (x)) M μ(f ) + . Since this inequality holds for every , it follows that ν (f (x)) μ(f ) for every x ∈ Cm with ν(x) 1. Furthermore, if is a positive number, we claim that there exists x0 ∈ Cm such that ν(x0 ) 1 and μ(f ) − ν (f (x0 )) μ(f ). Suppose that this is not the case. Then, for every z ∈Cn with ν(z) 1 n 1, we have ν (f (z)) μ(f ) − . If x ∈ C , then ν ν(x) x = 1, so ν (f (x)) (μ(f ) − )ν(x), which contradicts the deﬁnition of μ(f ). This allows us to conclude that μ(f ) = sup{ν (f (x)) | ν(x) 1}, which proves the second equality. Observe that the third equality (where we replaced sup by max) holds because the closed sphere B(0, 1) is a compact set in Rn . Thus, we have μ(A) = max{ν (f (x)) | ν(x) 1}.

(6.10)

For the fourth equality, since {x | ν(x) = 1} ⊆ {x | ν(x) 1}, it follows that max{ν (f (x)) | ν(x) = 1} max{ν (f (x)) | ν(x) 1} = μ(f ). By the third equality, there exists a vector z ∈ Rn − {0} such that ν(z) 1 and ν (f (z)) = μ(f ). Thus, we have z z ν f . μ(f ) = ν(z)ν f ν(z) ν(z)

Norms and Inner Products

361

z Since ν ν(z) = 1, it follows that μ(A) max{ν (f (x)) | ν(x) = 1}. This yields the desired conclusion. Finally, to prove the last equality observe that for every x ∈ 1 1 m C − {0m }, ν(x) x is a unit vector. Thus, ν (f ( ν(x) x) μ(f ), by the fourth equality. On the other hand, by the third equality, there exists x0 such that ν(x0 ) = 1 and ν (f (x0 )) = μ(f ). This concludes the argument. 6.6

Norms for Matrices

In Chapter 3, we saw that the set Cm×n is a linear space. Therefore, it is natural to consider norms deﬁned on matrices. We discuss two basic methods for deﬁning norms for matrices. The ﬁrst approach treats matrices as vectors (through the vec mapping). The second regards matrices as representations of linear operators, and deﬁnes norms for matrices starting from operator norms. The vectorization mapping vec was introduced in Deﬁnition 3.16. Its use allows us to treat a matrix A ∈ Cm×n as a vector from Cmn . Using vector norms on Cmn , we can deﬁne vectorial norms of matrices. Definition 6.10. Let ν be a vector norm on the space Rmn . The vectorial matrix norm μ(m,n) on Rm×n is the mapping μ(m,n) : Rm×n −→ R0 deﬁned by μ(m,n) (A) = ν(vec(A)) for A ∈ Rm×n . Vectorial norms of matrices are deﬁned without regard for matrix products. The link between linear transformations of ﬁnite-dimensional linear spaces and Theorem 6.20 suggests the introduction of an additional condition. Since every matrix A ∈ Cm×n corresponds to a linear transformation hA : Cm −→ Cn , if ν and ν are norms on Cm and Cn , respectively, it is natural to deﬁne a norm on Cm×n as μ(A) = μ(hA ), where μ = N (ν, ν ) is a norm on the space of linear transformations between Cm and Cn . Suppose that ν, ν , and ν are vector norms deﬁned on Cm , Cn , and Cp , respectively. By Theorem 6.20, μ (gf ) μ(f )μ (g), where μ = N (ν, ν ), μ = N (ν , ν ), and μ = N (μ, μ ), so μ (AB) μ(A)μ (B). This suggests the following deﬁnition.

Linear Algebra Tools for Data Mining (Second Edition)

362

Definition 6.11. A consistent family of matrix norms is a family of functions μ(m,n) : Cm×n −→ R0 , where m, n ∈ P, that satisﬁes the following conditions: (i) μ(m,n) (A) = 0 if and only if A = Om,n ; (ii) μ(m,n) (A+B) μ(m,n) (A)+μ(m,n) (B) (the subadditivity property); (iii) μ(m,n) (aA) = |a|μ(m,n) (A); (iv) μ(m,p) (AB) μ(m,n) (A)μ(n,p) (B) for every matrix A ∈ Rm×n and B ∈ Rn×p (the submultiplicative property). If the format of the matrix A is clear from context or is irrelevant, then we shall write μ(A) instead of μ(m,n) (A). Example 6.11. Let P ∈ Cn×n be an idempotent matrix. If μ is a matrix norm, then either μ(P ) = 0 or μ(P ) 1. Indeed, since P is idempotent, we have μ(P ) = μ(P 2 ). By the submultiplicative property, μ(P 2 ) (μ(P ))2 , so μ(P ) (μ(P ))2 . Consequently, if μ(P ) = 0, then μ(P ) 1. Some vectorial matrix norms turn out to be actual matrix norms; others fail to be matrix norms. This point is illustrated by the next two examples. by Example 6.12. Consider the vectorial matrix norm μ1 induced m×n . |a | for A ∈ R the vector norm ν1 . We have μ1 (A) = ni=1 m ij j=1 Actually, this is a matrix norm. To prove this fact, consider the matrices A ∈ Rm×p and B ∈ Rp×n . We have p p n n m m aik bkj |aik bkj | μ1 (AB) =

i=1 j=1 k=1 p m n

p

k =1

k =1

i=1 j=1

i=1 j=1 k=1

|aik ||bk j |

(because we added extra non-negative terms to the sums) ⎞ ⎛ n p m p |aik | · ⎝ |bk j |⎠ = i=1 k =1

= μ1 (A)μ1 (B).

j=1 k =1

Norms and Inner Products

363

We denote this vectorial matrix norm by the same notation as the corresponding vector norm, that is, by A1 . The vectorial matrix norm μ2 induced by the vector norm ν2 is also a matrix norm. Indeed, using the notations as above, we have p 2 m n aik bkj (μ2 (AB)) = i=1 j=1 k=1 p p n m 2 2 |aik | |blj | 2

i=1 j=1

k=1

l=1

(by Cauchy–Schwarz inequality) (μ2 (A))2 (μ2 (B))2 . The vectorial norm of A ∈ Cm×n , ⎛ ⎞1 2 n m 2⎠ ⎝ |aij | , μ2 (A) = i=1 j=1

denoted also by AF , is known as the Frobenius norm. For A ∈ Rm×n , we have

m

n a2ij . AF = i=1 j=1

It is easy to see that for real matrices we have A2F = trace(AA ) = trace(A A).

(6.11)

For complex matrices, the corresponding equality is A2F = trace(AAH ) = trace(AH A). Note that AH 2F = A2F for every A.

(6.12)

364

Linear Algebra Tools for Data Mining (Second Edition)

Example 6.13. The vectorial norm μ∞ induced by the vector norm ν∞ is denoted by A∞ and is given by A∞ = max |aij | i,j

for A ∈ Cn×n . This is not a matrix norm. Indeed, let a, b be two positive numbers and consider the matrices A=

a a a a

and B =

b b . b b

We have A∞ = a and B∞ = b. However, since AB =

2ab 2ab , 2ab 2ab

we have AB∞ = 2ab and the submultiplicative property of matrix norms is violated. A technique that always produces matrix norms starting from vector norms is introduced in the next theorem. Definition 6.12. Let νm be a norm on Cm and νn be a norm on Cn and let A ∈ Cn×m be a matrix. The operator norm of A is the number μ(n,m) (A) = μ(n,m) (hA ), where μ(n,m) = N (νm , νn ). Theorem 6.22. Let {νn | n 1} be a family of vector norms, where νn is a vector norm on Cn . The family of norms {μ(n,m) | n, m 1} is consistent. Proof. It is easy to see that the family of norms {μ(n,m) | n, m 1} satisﬁes the ﬁrst three conditions of Deﬁnition 6.11 because the corresponding operator norms satisfy similar conditions. For example, if μ(n,m) (A) = 0 for A ∈ Cn×m , this means that μ(n,m) (hA ) = 0, so νn (Ax) = 0 for every x ∈ Cm , such that νm (x) 1. This implies Ax = 0n for every x ∈ Rm , which, in turn, implies A = On,m . Since μ(Om,n ) = 0, the ﬁrst condition is satisﬁed.

Norms and Inner Products

365

For the fourth condition of Deﬁnition 6.11 and A ∈ Cn×m and B ∈ Cm×p , we have μ(n,p) (AB) = sup{νn ((AB)x) | νp (x) 1} = sup{νn (A(Bx)) | νp (x) 1} Bx νm (Bx)νp (x) 1 = sup νn A νm (Bx) μ(n,m) (A) sup{νm (Bx)νp (x) 1} Bx (because νm ν(Bx) = 1) = μn,m (A)μm,p (B).

Theorem 6.21 implies the following equivalent deﬁnitions of μ(n,m) (A). Theorem 6.23. Let νn be a norm on Cn for n 1. The following equalities hold for μ(n,m) (A), where A ∈ C(n,m) : μ(n,m) (A) = inf{M ∈ R0 | νn (Ax) M νm (x) for every x ∈ Cm } = sup{νn (Ax) | νm (x) 1} = max{νn (Ax) | νm (x) 1} = max{ν (f (x)) | ν(x) = 1} ν (f (x)) m | x ∈ C − {0m } . = sup ν(x) Proof.

The theorem is simply a reformulation of Theorem 6.21.

Corollary 6.8. Let μ be the matrix norm on Cn×n induced by the vector norm ν. We have ν(Au) μ(A)ν(u) for every u ∈ Cn . Proof. The inequality is obviously satisﬁed when u = 0n . There1 u. Clearly, fore, we may assume that u = 0n and let x = ν(u) ν(x) = 1 and Equality (6.10) implies that 1 u μ(A) ν A ν(u) for every u ∈ Cn − {0n }. This implies immediately the desired inequality.

366

Linear Algebra Tools for Data Mining (Second Edition)

If μ is a matrix norm induced by a vector norm on Rn , then μ(In ) = sup{ν(In x) | ν(x) 1} = 1. This necessary condition can be used for identifying matrix norms that are not induced by vector norms. The operator matrix norm induced by the vector norm · p is denoted by ||| · |||p . Example 6.14. To compute |||A|||1 = sup{Ax1 | x1 1}, where A ∈ Rn×n , suppose that the columns of A are the vectors a1 , . . . , an , that is ⎛ ⎞ a1j ⎜ a2j ⎟ ⎜ ⎟ aj = ⎜ .. ⎟. ⎝ . ⎠ anj Let x ∈ Rn be a vector whose components are x1 , . . . , xn . Then, Ax = x1 a1 + · · · + xn an , so Ax1 = x1 a1 + · · · + xn an 1 n |xj |aj 1 j=1

max aj 1 j

n

|xj |

j=1

= max aj 1 · x1 . j

Thus, |||A|||1 maxj aj 1 . Let ej be the vector whose components are 0 with the exception of its jth component that is equal to 1. Clearly, we have ej 1 = 1 and aj = Aej . This, in turn implies aj 1 = Aej 1 |||A|||1 for 1 j n. Therefore, maxj aj 1 |||A|||1 , so |||A|||1 = max aj 1 = max j

j

n

|aij |.

i=1

In other words, |||A|||1 equals the maximum column sum of the absolute values.

Norms and Inner Products

367

Example 6.15. Consider now a matrix A ∈ Rn×n . We have n aij xj Ax∞ = max 1in j=1 max

1in

n

|aij xj |

j=1

max x∞ 1in

n

|aij |.

j=1

Consequently, if x∞ 1, we have Ax∞ max1in nj=1 |aij |. Thus, |||A|||∞ max1in nj=1 |aij |. The converse inequality is immediate if A = On,n . Therefore, assume that A = On×n , and let (ap1 , . . . , apn ) be any row of A that has at least one element distinct from 0. Deﬁne the vector z ∈ Rn by |a | pj if apj = 0, zj = apj 1 otherwise for 1 j n. It is clear that zj ∈ {−1, 1} for every j, 1 j n and, therefore, z∞ = 1. Moreover, we have |apj | = apj zj for 1 j n. Therefore, we can write n n n |apj | = apj zj apj zj j=1 j=1 j=1 n aij zj max 1in j=1 = Az∞ max{Ax∞ | x∞ 1} = |||A|||∞ . Since n this holds for every row of A, it follows that max1in j=1 |aij | |||A|||∞ , which proves that |||A|||∞ = max

1in

n

|aij |.

j=1

In other words, |||A|||∞ equals the maximum row sum of the absolute values.

Linear Algebra Tools for Data Mining (Second Edition)

368

Example 6.16. Let D = diag(d1 , . . . , dn ) ∈ Cn×n be a diagonal matrix. If x ∈ Cn , we have ⎞ ⎛ d1 x1 ⎟ ⎜ Dx = ⎝ ... ⎠, dn xn so |||D|||2 = max{Dx2 | x2 = 1} = max{ (d1 x1 )2 + · · · + (dn xn )2 | x21 + · · · + x2n = 1} = max{|di | | 1 1 n}. The next result shows that certain norms are invariant with respect to multiplication by unitary matrices. We refer to these norms as unitarily invariant norms. Theorem 6.24. Let U ∈ Cn×n be a unitary matrix. The following statements hold: (i) U x2 = x2 for every x ∈ Cn ; (ii) |||U A|||2 = |||A|||2 for every A ∈ Cn×p ; (iii) U AF = AF for every A ∈ Cn×p . Proof.

For the ﬁrst part of the theorem, note that U x22 = (U x)H U x = xH U H U x = xH x = x22 ,

because U H A = In . The second part of the theorem is as follows: |||U A|||2 = max{(U A)x2 | x2 = 1} = max{U (Ax)2 | x2 = 1} = max{Ax2 | x2 = 1} (by Part (i)) = |||A|||2 . For the Frobenius norm, note that U AF = trace((U A)H U A) = trace(AH U H U A) = trace(AH A) = AF , by Equality (6.11).

Norms and Inner Products

369

Corollary 6.9. If U ∈ Cn×n is a unitary matrix, then |||U |||2 = 1. Proof. Since |||U |||2 = sup{U x2 | x2 1}, by Part (ii) of Theorem 6.24, |||U |||2 = sup{x2 | x2 1} = 1.

Corollary 6.10. Let A, U ∈ Cn×n . If U is a unitary matrix, then U H AU F = AF . Proof. Since U is a unitary matrix, so is U H . By Part (iii) of Theorem 6.24, U H AU F = AU F = U H AH 2F = AH 2F = A2F ,

which proves the corollary.

Example 6.17. Let S = {x ∈ Rn | x2 = 1} be the surface of the sphere in Rn . The image of S under the linear transformation hU that corresponds to the unitary matrix U is S itself. Indeed, by Theorem 6.24, hU (x)2 = x2 = 1, so hU (x) ∈ S for every x ∈ S. Also, note that hU restricted to S is a bijection because hU H (hU (x)) = x for every x ∈ Rn . More details on transformations of S are given in Supplement 19 of Chapter 9. Theorem 6.25. Let A ∈ Rn×n . We have |||A|||2 AF . Proof.

Let x ∈

Rn .

We have

⎞ r1 x ⎟ ⎜ Ax = ⎝ ... ⎠, ⎛

rn x where r 1 , . . . , r n are the rows of the matrix A. Thus, n 2 Ax2 i=1 (r i x) = . x2 x2 By Cauchy–Schwarz inequality, we have (r i x)2 r i 22 x22 , so

n Ax2

r i 22 = AF . x2 i=1

This implies |||A|||2 AF .

370

Linear Algebra Tools for Data Mining (Second Edition)

We shall prove in Chapter 9 (in Corollary 9.4) that for every A ∈ Rn×n , we have √ AF n|||A|||2 . (6.13)

6.7

Matrix Sequences and Matrix Series

In this section, we make use of the notion of series of complex numbers and we extend this concept ∞ to series of matrices. Recall (see [7], for example) that a series i=1 ai converges absolutely if the series ∞ |a | converges. Absolute convergence implies convergence. The i=1 i terms of an absolutely convergent series can be rearranged by altering its sum. The set of matrices Cm×p is a C-linear space, and the set of matrices Rm×p is an R-linear space. Using matrix norms, these spaces can be equipped with a topological structure, as we indicated above. We focus now on the normed linear space (Rp×p , ||| · |||), where ||| · |||) is a matrix norm. Let A ∈ Rp×p . We can show by induction on n that |||An ||| (|||A|||)n .

(6.14)

The base step, n = 0, is immediate. Suppose that the inequality holds for n. We have |||An+1 ||| = |||An A||| |||An ||||||A||| (because ||| · ||| is a matrix norm) (|||A|||)n |||A||| (by the inductive hypothesis) = (|||A|||)n+1 , which concludes our argument. If |||A||| < 1, the sequence of matrices (A, A2 , . . . , An , . . .) converges toward the zero matrix Op,p . Indeed, limn→∞ |||An − Op,p ||| = limn→∞ |||An ||| limn→∞ (|||A|||)n = 0, which shows that limn→∞ An = Op,p .

Norms and Inner Products

371

Definition 6.13. Let A = (A0 , A1 , . . . , An , . . .) be a sequence of matrices in Rp×p . A matrix series having A as its sequenceof terms is the sequence of matrices (S0 , S1 , . . . , Sn , . . .), where Si = ik=0 Ak . The series (S0 , S1 , . . . , Sn , . . .) is denoted also by A0 + A1 + · · · + An + · · · . We say that the series A0 + A1 + · · · + An + · · · converges to a matrix S if limn→∞ Sn = S. This is also denoted by A0 + A1 + · · · + An + · · · = S. The series A0 + A1 + · · · + An + · · · converges absolutely if each of the series (A0 )ij + (A1 )ij + · · · + (An )ij + · · · converges absolutely for 1 i, j n. The subadditivity property of the norm can be generalized to a series of matrices. Namely, if the series A0 + A1 + · · · + An + · · · converges to S, then ∞ |||Ai |||. |||S||| i=0

Indeed, by the usual subadditivity property, n |||Ai |||. |||A0 + A1 + · · · + An ||| i=0

This implies |||A0 + A1 + · · · + An ||| for every n ∈ N, so |||S||| matrix norm.

∞

i=0 |||Ai |||,

∞

|||Ai |||,

i=0

due to the continuity of the

Example 6.18. Let A ∈ Rp×p be a matrix such that |||A||| < 1. We claim that the matrix I −A is invertible and A0 +A1 +· · ·+An +· · · = (I − A)−1 . Suppose that I−A is not invertible. Then, the system (I−A)x = 0 has a non-trivial solution. This implies x = Ax, so |||A||| 1, which contradicts the hypothesis. Thus, I − A is an invertible matrix. Observe that (A0 + A1 + · · · + An )(I − A)−1 = I − An+1 .

Linear Algebra Tools for Data Mining (Second Edition)

372

Therefore, lim (A0 + A1 + · · · + An ) (I − A)−1 = lim (I − An+1 ) = I, n→∞

n→∞

since limn→∞ An+1 = O. This shows that the series A0 + A1 + · · · + An + · · · converges to the inverse of the matrix I − A, so (I − A)−1 =

∞

Ai .

i=0

Moreover, we have −1

|||(I − A)

∞ ∞ i ||| = A |||Ai ||| i=0

i=0

∞ (|||A|||)i = = i=0

1 , 1 − |||A|||

because |||A||| 1. Example 6.19. Let (x0 , x1 , . . . , xn , . . .) be a sequence of vectors deﬁned inductively by xn+1 = Axn + b

(6.15)

for n ∈ N, where A ∈ Cn×n and b ∈ Cn . It is easy to verify that xn = An x0 + (An−1 + · · · + A + I)b. If |||A||| < 1, limn→∞ An = O, limn→∞ (An−1 + · · · + A + I) = (I − A)−1 , so limn→∞ xn exists. If limn→∞ xn = x, it follows from Equality (6.15) that x = Ax + b. Let en = xn − x be the error sequence. Clearly, if |||A||| < 1, limn→∞ en = 0. , . . . , Am , . . . be matrices in Cn×n . If ∞ Lemma 6.6. Let A0 m=0 Am ∞ is convergent, then U A V is convergent for every U, V and m m=0 ∞ ∞ m=0 U Am V = U ( m=0 Am ) V . ∞ Proof. Suppose that m=0 Am is convergent and let S = ∞ A be its sum. We can write m=0 m

Norms and Inner Products ∞ U Am V − U SV m=0

∞

373

∞ = U Am − S V m=0 ∞ ∞ n2 U ∞ Am − S V ∞ . m=0

∞

∞

Therefore, the convergence of m=0 A m implies the convergence of ∞ ∞ ∞ U A V and U A V = U ( m m m=0 m=0 m=0 Am ) V . n×n and let Lemma 6.7. Let μ be a vectorial norm on ∞C n×n A0 , . . . , Am , . . . be matrices in C . The m=0 Am is absoseries ∞ lutely convergent if and only if the series m=0 μ(Am ) is convergent. Proof. If the series ∞ m=0 μ(Am ) is convergent, then for every i, j such that 1 i, j n we have |(Am )ij | kμ(Am ) for some positive constant k(see Supplement 6.20) which implies the absolute convergence of ∞ m=0 Am . ∞ Conversely, suppose that m=0 Amis absolutely convergent. Then, there exists a number c such that pm=0 |(Am )ij | c for every p ∈ N and 1 i, j n. Therefore, we can write p m=0

μ(Am )

p n

|(Am )ij | n2 c,

m=0 i,j=1

which allows us to conclude that

∞

m=0 μ(Am )

is convergent.

n×n . If Theorem 6.26. Let A0 , . . . , Am , . . . be ∞matrices in C ∞ m=0 Am is absolutely convergent, then m=0 U Am V is absolutely convergent for every U, V .

Proof.

The second part of Supplement 6.20 implies

U Am V ∞ n2 P ∞ Am ∞ V ∞ cAm ∞ , ∞ where c does not depend on m. By Lemma 6.7, m=0 Am ∞ is ∞ for every U, V , which convergent, so m=0 U Am V ∞ is convergent U A V. implies the absolute convergence of ∞ m m=0

Linear Algebra Tools for Data Mining (Second Edition)

374

6.8

Conjugate Norms

Let ν : Cn −→ R0 . Consider the function ν ∗ : Cn −→ R0 deﬁned by ν ∗ (y) = max{|y H x| | ν(x) = 1} for y ∈ Cn . Theorem 6.27. The mapping ν ∗ is a norm on Cn . Proof.

Let y, z ∈ Cn . For ν(x) = 1, we have |(y + z)H x| |y H x| + |z H x| ν ∗ (y) + ν ∗ (z),

which implies ν ∗ (y + z) ν ∗ (y) + ν ∗ (z). Thus, ν ∗ is subadditive. The positive homogeneity is immediate. Suppose that ν ∗ (y) = 0 but y = 0. Since y = 1, ν ν(y) this implies ν ∗ (y) |y H

y22 y |= . ν(y) ν(y)

Therefore, if y = 0, then ν ∗ (y) > 0. Consequently, ν ∗ (y) = 0 implies y = 0, which allows us to conclude that ν ∗ is a norm. Definition 6.14. The norm ν ∗ is the conjugate norm of the norm ν. Example 6.20. Let νp be the norm introduced in Theorem 6.5, n 1 p |xi |p νp (x) = i=1

for x ∈ Rn . To compute its dual νp∗ , we need to compute νp∗ (y) = max{|y H x | νp (x) = 1} = max{|y1 x1 + · · · + yn xn |

n i=1

for y ∈

Cn .

|xi |p = 1}

Norms and Inner Products

375

Without loss of generality, we can assume that y1 , . . . , yn , x1 , . . . , xn belong to R0 . By introducing a Lagrange multiplier λ, we consider the function n p xi − 1 . Φ(y1 , . . . , yn , x1 , . . . , xn , λ) = y1 x1 + · · · + yn xn − λ i=1

The necessary extremum conditions are ∂Φ = yi − λpxp−1 =0 i ∂xi for 1 i n. Thus, we must have y1

xp−1 1

= ··· =

yn

xp−1 n

= pλ.

Equivalently, we have p

p

p ynp−1 y1p−1 p−1 . p = ··· = p = (pλ) x1 xn

For q =

p p−1 ,

the last equality can be written as νq (y)q ynq y1q q = · · · = = (pλ) = = νq (y)q , νp (x)p xp1 xpn

since νp (x) = 1. Thus, pλ = νq (y) and, therefore, yi xi = xpi νq (y) for 1 i n. The maximum of y1 x1 + · · · + yn xn for νp (x) = 1 is therefore n i=1

y i xi =

n

xpi νq (y) = νq (y).

i=1

This shows that the conjugate of the norm νp is the norm νq , where 1 1 + = 1. p q Observe that the conjugate of the Euclidean norm ν2 is ν2 itself.

376

Linear Algebra Tools for Data Mining (Second Edition)

Let U be a subspace of Cn and let h : U −→ C be a linear functional deﬁned on U . We observed that there exists y ∈ Cn such that h can be expressed as h(x) = y H x for x ∈ U . Note that the vector y is not unique because h(x) = (y + z)H x for x ∈ U for every vector z ∈ U ⊥ . Theorem 6.28 (Hahn–Banach theorem). Let h be a linear functional deﬁned on a subspace U of Cn and let ν be a norm on C such that max{|h(x)| | x ∈ U and ν(x) = 1} = M. ˆ of h to Cn such that There exists an extension h ˆ max{|h(x)| | x ∈ Cn and ν(x) = 1} = M. Proof. If the subspace U coincides with Cn , then there is nothing to prove. Suppose that U ⊂ Cn and let w be a vector in Cn − U . Clearly, we have w = 0. If u ∈ U − {0}, we have |h(u)| M ν(u). The linearity of h implies h(u1 )−h(u2 ) = h(u1 −u2 ) M ν(u1 −u2 ) M (ν(u1 +w)+ν(u2 +w)) which yields h(u1 ) − M ν(u1 + w) h(u2 ) + M ν(u2 + w) for any u1 , u2 ∈ U . Therefore, every number in the set A = {h(u1 ) − M ν(u1 + w) | u1 ∈ U } is less than or equal to any number of the set B = {h(u2 ) + M ν(u2 + w) | u2 ∈ U }. We aim to linearly extend h to a linear functional hw deﬁned on the subspace W = U ∪{w}. To this end, deﬁne hw (w) = −a, where a is a number located between the sets A and B. Since hw (u) = h(u) for every u ∈ U , we have hw (u) − M ν(u + w) −hw (w) hw (u) + M ν(u + w), which implies hw (u + w) − M ν(u + w) 0 hw (u + w) + M ν(u + w).

Norms and Inner Products

377

This is equivalent to |hw (u + w)| M ν(u + w). For α ∈ C − {0}, deﬁne hw (u + αw) = h(u) + αhw (w). We have 1 1 |hw (u + αw)| = |α|hw u + w |α|M ν u+w |α| |α| = M ν(u + αw). If W = Cn , this extension can be repeated a ﬁnite number of times because Cn is of ﬁnite dimension. Eventually, we obtain a linear functional deﬁned on Cn with the preservation of the boundedness condition. Corollary 6.11. The conjugate ν ∗∗ of the conjugate ν ∗ of a norm ν on Cn equals ν. Proof.

Since ν ∗ (y) = max{|y H x| | ν(x) = 1}, it follows that |y H x| ν ∗ (y)ν(x)

(6.16)

for x, y ∈ Cn , an inequality that is a generalization of the Cauchy– Schwarz inequality. Therefore, we have ν ∗∗ (x) max{|xH y | ν ∗ (y) = 1} ν(x). To prove the converse inequality, we need to use the Hahn–Banach Theorem. Consider the linear functional g deﬁned on the subspace x by g(u) = aν(x) for u = ax and a ∈ C. If ν(u) = 1, we have 1 |a| = ν(x) , so |g(u)| = |a|ν(x) = 1. Thus, by the Hahn–Banach Theorem, g can be extended to gˆ : C∗ −→ C such that |ˆ(g)(v)| 1 if ν(v) = 1 for v ∈ C. If gˆ(v) = z H v, then

ν ∗ (z) = max{|z H v| | ν(v) = 1} = 1 and |z H v| = ν(v). Therefore, ν ∗∗ (x) = max{|xH z| | ν ∗ (z) = 1} ν(x), so ν ∗∗ (x) = ν(x).

378

6.9

Linear Algebra Tools for Data Mining (Second Edition)

Inner Products

Definition 6.15. Let V be a C-linear space. An inner product on V is a function f : V × V −→ C that has the following properties: (i) f (ax + by, z) = af (x, z) + bf (y, z) (linearity in the ﬁrst argument); (ii) f (x, y) = f (y, x) for y, x ∈ V (conjugate symmetry); (iii) if x = 0V , then f (x, x) is a positive real number (positivity); (iv) f (x, x) = 0 if and only if x = 0V (deﬁniteness); for every x, y, z ∈ V and a, b ∈ C. The pair (V, f ) is called an inner product space. An alternative terminology [106] for real inner product spaces is Euclidean spaces, and Hermitian spaces for complex inner product spaces. For the second argument of an inner product on a C-linear space, we have the property of conjugate linearity, that is, f (z, ax + by) = a ¯f (z, x) + ¯bf (z, y) for every x, y, z ∈ V and a, b ∈ C. Indeed, by the conjugate symmetry property, we can write f (z, ax + by) = f (ax + by, z) = af (x, z) + bf (y, z) =a ¯f (x, z) + ¯bf (y, z) =a ¯f (z, x) + ¯bf (z, y). To simplify notations, if there is no risk of confusion, we denote the inner product f (u, v) as (u, v). Observe that the conjugate symmetry property on inner products implies that for x ∈ V , (x, x) is a real number because (x, x) = (x, x). When V is a real linear space, the deﬁnition of the inner product becomes simpler because the conjugate of a real number a is a itself. Namely, for real linear spaces, the conjugate symmetry is replaced by the plain symmetry property, (x, y) = (y, x) for x, y ∈ V . Thus, in case of real linear spaces, an inner product is linear in both arguments.

Norms and Inner Products

379

Let W = {w 1 , . . . , wn } be a basis in the complex n-dimensional inner product space V. If x = ni=1 xi wi and y = nj=1 y j wj , then (x, y) =

n n

xi y j (wi , wj ),

i=1 j=1

due to the bilinearity of the inner product. If we denote (wi , w j ) by gij , then (x, y) can be written as (x, y) =

n n

xi y j gij

(6.17)

i=1 j=1

for x, y ∈ V , when V is a complex inner product linear space. If V is a real inner product space, then (x, y) =

n n

xi y j gij .

i=1 j=1

Note that in this case (gij ) is a symmetric matrix. The matrix G = (gij ) is actually the Gram matrix of the basis W , which will be referred to as the fundamental matrix of the basis B. Definition 6.16. Two vectors u, v ∈ Cn are said to be orthogonal with respect to an inner product if (u, v) = 0. This is denoted by x ⊥ y. An orthogonal set of vectors in an inner product space V equipped with an inner product is a subset W of V such that for every u, v ∈ W , we have u ⊥ v. Theorem 6.29. Any inner product on a linear space V generates a norm on that space defined by x = (x, x) for x ∈ V . Proof. Let V be a C-linear space. We need to verify that the norm satisﬁes the conditions of Deﬁnition 6.7. Applying the properties of the inner product, we have x + y2 = (x + y, x + y) = (x, x) + 2(x, y) + (y, y) = x2 + 2(x, y) + y2

Linear Algebra Tools for Data Mining (Second Edition)

380

x2 + 2xy + y2 = (x + y)2 . Because x 0, it follows that x + y x + y, which is the subadditivity property. a(x, x) = |a|2 (x, x) = If a ∈ C, then ax = (ax, ax) = a¯ |a| (x, x) = |a|x. Finally, from the deﬁniteness property of the inner product, it follows that x = 0 if and only if x = 0V , which allows us to conclude that · is indeed a norm. The norm induced by the inner product f (x, y) = xi y j gij introduced in Equality (6.17) is x2 = f (x, x) = xi xj gij . Theorem 6.30. If W is a set of orthogonal vectors in an ndimensional C-linear space V and 0V ∈ W , then W is linearly independent. Proof. Let c = a1 w1 + · · · + an wn be a linear combination in V such that a1 w1 + · · · + an wn = 0V . Since (c, w i ) = ai w i 2 = 0, we have ai = 0 because w i 2 = 0, and this holds for every i, where 1 i n. Thus, W is linearly independent. Definition 6.17. An orthonormal set of vectors in an inner product space V equipped with an inner product is an orthogonal subset W of V such that for every u we have u = 1, where the norm is induced by the inner product. Corollary 6.12. If W is an orthonormal set of vectors in an ndimensional C-linear space V and |W | = n, then W is a basis in L. Proof.

This statement follows immediately from Theorem 6.30.

If W = {w 1 , . . . , w n } is an orthonormal basis in Cn , we have 0 if i = j, gij = (w i , wj ) = 1 if i = j, which means that the inner product of the vectors x = xi wi and y = y j wj is given by (x, y) = xi y j (wi , wj ) = xi y i . Consequently, x2 = ni=1 |xi |2 .

(6.18)

Norms and Inner Products

381

The inner product of x, y ∈ Rn is (x, y) = xi y j (wi , wj ) = xi y i .

(6.19)

Next, we present a direct proof for the Cauchy–Schwarz inequality. Theorem 6.31 (Cauchy–Schwarz inequality). Let C-linear space. Then, for every x, y ∈ V, we have

V

be

a

|(x, y)| xy. Moreover, the equality takes place if and only if {x, y} is a linear dependent set of vectors. Proof. If y = 0V , the inequality obviously holds. So, suppose that y = 0V . For every t ∈ R, we have (x + ty, x + ty) 0. In particular, , we have taking t = (x,y) y2 0 x − ty2 = (x, x) − t(x, y) − t(y, x) + tt(y, y) = x2 − t(x, y) − t(x, y) + tty2 |(x, y)|2 |(x, y)|2 = x2 − 2 + y2 y2 |(x, y)|2 = x2 − , y2 which implies |(x, y)| xy. If the equality |(x, y)| = xy takes place, then x − ty = 0, so x − ty = 0, which means that {x, y} is linearly dependent. Not every norm can be induced by an inner product. A characterization of this type of norms in linear spaces, obtained in [85], is presented next. This equality shown in the next theorem is known as the parallelogram equality. Theorem 6.32. Let V be a real linear space. A norm · is induced by an inner product if and only if x + y2 + x − y2 = 2(x2 + y2 ) for every x, y ∈ V .

Linear Algebra Tools for Data Mining (Second Edition)

382

Proof. Suppose that the norm is induced by an inner product. In this case, we can write the following for every x and y: (x + y, x + y) = (x, x) + 2(x, y) + (y, y), (x − y, x − y) = (x, x) − 2(x, y) + (y, y). Thus, (x + y, x + y) + (x − y, x − y) = 2(x, x) + 2(y, y), which can be written in terms of the norm generated as the inner product as x + y2 + x − y2 = 2(x2 + y2 ). Conversely, suppose that the condition of the theorem is satisﬁed by the norm · . Consider the function f : V × V −→ R deﬁned by 1 x + y2 − x − y2 (6.20) f (x, y) = 4 for x, y ∈ V . The symmetry of f is immediate, that is, f (x, y) = f (y, x) for x, y ∈ V . The deﬁnition of f implies 1 y2 − − y2 = 0. (6.21) f (0, y) = 4 We prove that f is a bilinear form that satisﬁes the conditions of Deﬁnition 6.15. Starting from the parallelogram equality, we can write u + v + y2 + u + v − y2 = 2(u + v2 + y2 ), u − v + y2 + u − v − y2 = 2(u − v2 + y2 ). Subtracting these equalities yields u + v + y2 + u + v − y2 − u − v + y2 − u − v − y2 = 2(u + v2 − u − v2 ). This equality can be written as f (u + y, v) + f (u − y, v) = 2f (u, v). Choosing y = u implies f (2u, v) = 2f (u, v), due to Equality (6.21).

(6.22)

Norms and Inner Products

383

Let t = u + y and s = u − y. Since u = 12 (t + s) and y = 12 (t − s), we have 1 (t + s), v = f (t + s, v), f (t, v) + f (s, v) = 2f 2 by Equality (6.22). Next, we show that f (ax, y) = af (x, y) for a ∈ R and x, y ∈ V . Consider the function φ : R −→ R deﬁned by φ(a) = f (ax + y). The basic properties of norms imply that ax + y − bx + y (a − b)x for every a, b ∈ R and x, y ∈ V . Therefore, the function φ, ψ : R −→ R given by φ(a) = ax + y and ψ(a) = ax − y for a ∈ R are continuous. The continuity of these functions implies that the function f deﬁned by Equality (6.20) is continuous relative to a. Deﬁne the set S = {a ∈ R | f (ax, y) = af (x, y)}. Clearly, we have 1 ∈ S. Further, if a, b ∈ S, then a + b ∈ S and a − b ∈ S, which implies Z ⊆ S. If b = 0 and b ∈ S, then, by substituting x by 1b x in the equality f (bx, y) = bf (x, y), we have f (x, y) = bf ( 1b x, y), so 1b f (x, y) = f ( 1b x, y). Thus, if a, b ∈ S and b = 0, we have f ( ab x, y) = ab f (x, y), so Q ⊆ S. Consequently, S = R. This allows us to conclude that f is linear in its ﬁrst argument. The symmetry of f implies the linearity in its second argument, so f is bilinear. Observe that f (x, x) = x2 . The deﬁnition of norms implies that f (x, x) = 0 if and only if x = 0, and if x = 0, then f (x, x) > 0. Thus, f is indeed an inner product and x = f (x, x). Theorem 6.33. Let x, y ∈ Rn be two vectors such that x1 x2 · · · xn , y 1 y 2 · · · y n . For every permutation matrix P , we have x y x (P y). For every permutation matrix P , we have x y x (P y). Proof. Let φ be the permutation that corresponds to the permutation matrix P and suppose that φ = ψp . . . ψ1 , where p = inv(φ)

384

Linear Algebra Tools for Data Mining (Second Edition)

and ψ1 , . . . , ψp are standard transpositions that correspond to all standard inversions of φ (see Theorems 1.9 and 1.10). Let ψ be a standard transposition of {1, . . . , n}, 1 ··· i i + 1 ··· n ψ: . 1 ··· i + 1 i ··· n We have x (P y) = x1 y1 + · · · + xi−1 yi−1 + xi yi+1 + xi+1 yi + · · · + xn yn , so the inequality x y x (P y) is equivalent to xi yi + xi+1 yi+1 xi yi+1 + xi+1 yi . This, in turn is equivalent to (xi+1 − xi )(yi+1 − yi ) 0, which obviously holds in view of the hypothesis. As we observed previously, Pφ = Pψ1 · · · Pψp , so x y x (Pψp y) x (Pψp−1 Pψp y) · · · x (Pψ1 · · · Pψp y) = x(P y), which concludes the proof of the ﬁrst part of the theorem. To prove the second part of the theorem, apply the ﬁrst part to the vectors x and −y. Corollary 6.13. Let x, y ∈ Rn be two vectors such that x1 x1 · · · xn , y1 y2 · · · yn . For every permutation matrix P , we have x − yF x − P yF . If x1 x1 · · · xn and y1 y2 · · · yn , then for every permutation matrix P , we have x − yF x − P yF . Proof.

Note that x − y2F = x2F + y2F − 2x y, x − P y2F = x2F + P y2F − 2x (P y) = x2F + y2F − 2x (P y)

because P y2F = y2F . Then, by Theorem 6.33, x − yF x − P yF . The argument for the second part of the corollary is similar.

Norms and Inner Products

385

Corollary 6.14. Let V be an n-dimensional linear space. If W is an orthogonal (orthonormal) set and |W | = n, then W is an orthogonal (orthonormal) basis of L. Proof. This Theorem 6.30.

statement

For every A ∈

is

an

immediate

consequence

of

Cn×n

and x, y ∈

Cn ,

we have

(Ax, y) = (x, AH y),

(6.23)

which follows from the equalities n (Ax, y) = (Ax)i y¯i i=1

=

=

=

n n

aij xj y¯i i=1 j=1 n n xj

j=1 n j=1

xj

aij y¯i

i=1 n

a ¯ij yi

i=1

= (x, AH y). More generally, we have the following deﬁnition: Definition 6.18. A matrix B ∈ Cn×n is the adjoint of a matrix A ∈ Cn×n relative to the inner product (·, ·) if (Ax, y) = (x, By) for every x, y ∈ Cn . A matrix is self-adjoint if it equals its own adjoint, that is if (Ax, y) = (x, Ay) for every x, y ∈ Cn . Thus, a Hermitian matrix is self-adjoint relative to the inner product (x, y) = xH y for x, y ∈ Cn . If we use the Euclidean inner product, we omit the reference to this product and refer to the adjoint of A relative to this product simply as the adjoint of A. Example 6.21. An inner product on Cn×n , the linear space of matrices of format n × n, can be deﬁned as (X, Y ) = trace(XY H ) for X, Y ∈ Cn×n . Note that X2F = (X, X H ) for every X ∈ Cn×n .

386

Linear Algebra Tools for Data Mining (Second Edition)

By Theorem 2.38, a linear form f deﬁned on an Rn can be uniquely written as f (v) =

n

ci ai ,

i=1

where {e1 , . . . , en } is a basis of Rn , v = {ci ei | 1 i n}, and ai = f (ei ) for 1 i n. Thus, using the Euclidean inner product, we can write f (v) = (v, a) for v ∈ Rn , where ⎛ ⎞ a1 ⎜ .. ⎟ a = ⎝ . ⎠. an The Cauchy–Schwarz inequality implies that |(x, y)| x2 y2 . Equivalently, this means that −1

(x, y) 1. x2 y2

This double inequality allows us to introduce the notion of angle between two vectors x, y of a real linear space V. Definition 6.19. The angle between the vectors x and y is the number α ∈ [0, π] deﬁned by cos α =

(x, y) . x2 y2

This angle will be denoted by ∠(x, y). Example 6.22. Let u = (u1 , u2 ) ∈ R2 be a unit vector. Since u21 + u22 = 1, there exists α ∈ [0, 2π] such that u1 = cos α and u2 = sin α. Thus, for any two unit vectors in R2 , u = (cos α, sin α) and v = (cos β, sin β) we have (u, v) = cos α cos β + sin α sin β = cos(α − β), where α, β ∈ [0, 2π]. Consequently, ∠(u, v) is the angle in the interval [0, π] that has the same cosine as α − β.

Norms and Inner Products

387

Theorem 6.34 (The Cosine theorem). Let x and y be two vectors in Rn equipped with the Euclidean inner product. We have x − y2 = x2 + y2 − 2xy cos α, where α = ∠(x, y). Proof.

Since the norm is induced by the inner product, we have x − y2 = (x − y, x − y) = (x, x) − 2(x, y) + (y, y) = x2 − 2xy cos α + y2 ,

which is the desired equality. If T ⊆ V , then the set

T⊥

is deﬁned by

T ⊥ = {v ∈ V | v ⊥ t for every t ∈ T }. Note that T ⊆ U implies U ⊥ ⊆ T ⊥ . If S, T are two subspaces of an inner product space, then S and T are orthogonal if s ⊥ t for every s ∈ S and every t ∈ T . This is denoted as S ⊥ T . Theorem 6.35. Let V be an inner product space and let T ⊆ V . The set T ⊥ is a subspace of V. Furthermore, T ⊥ = T ⊥ . Proof. Let x and y be two members of T ⊥ . We have (x, t) = (y, t) = 0 for every t ∈ T . Therefore, for every a, b ∈ F , by the linearity of the inner product, we have (ax + by, t) = a(x, t) + b(y, t) = 0 for t ∈ T , so ax + bt ∈ T ⊥ . Thus, T ⊥ is a subspace of V. By a previous observation, since T ⊆ T , we have T ⊥ ⊆ T ⊥ . To prove the converse inclusion, let z ∈ T ⊥ . If y ∈ T , y is a linear combination of vectors of T , y = a1 t1 + · · · + am tm , so (y, z) = a1 (t1 , z) + · · · + am (tm , z) = 0. Therefore, z ⊥ y, which implies z ∈ T ⊥ . This allows us to conclude that T ⊥ = T ⊥ . We refer to T ⊥ as the orthogonal complement of T . Note that T ∩ T ⊥ ⊆ {0}. If T is a subspace, then this inclusion becomes an equality, that is, T ∩ T ⊥ = {0}.

388

Linear Algebra Tools for Data Mining (Second Edition)

If x and y are orthogonal, by Theorem 6.34, we have x − y2 = x2 + y2 , which is the well-known Pythagorean Theorem. Theorem 6.36. Let T be a subspace of Cn . We have (T ⊥ )⊥ = T . Proof. Observe that T ⊆ (T ⊥ )⊥ . Indeed, if t ∈ T , then (t, z) = 0 for every z ∈ T ⊥ , so t ∈ (T ⊥ )⊥ . To prove the reverse inclusion, let x ∈ (T ⊥ )⊥ . Theorem 6.37 implies that we can write x = u + v, where u ∈ T and v ∈ T ⊥ , so x − u = v ∈ T ⊥ . Since T ⊆ (T ⊥ )⊥ , we have u ∈ (T ⊥ )⊥ , so x − u ∈ (T ⊥ )⊥ . Consequently, x−u ∈ T ⊥ ∩(T ⊥ )⊥ = {0}, so x = u ∈ T . Thus, (T ⊥ )⊥ ⊆ T , which concludes the argument. Corollary 6.15. Let Z be a subset of Cn . We have (Z ⊥ )⊥ = Z. Proof. Let Z be a subset of Cn . Since Z ⊆ Z, it follows that Z⊥ ⊆ Z ⊥ . Let now y ∈ Z ⊥ and let z = a1 z 1 + · · · + ap z p ∈ Z, where z 1 , . . . , z p ∈ Z. Since (y, z) = a1 (y, z 1 ) + · · · + ap (y, z p ) = 0, it follows that y ∈ Z⊥ . Thus, we have Z ⊥ = Z⊥ . This allows us to write (Z ⊥ )⊥ = (Z⊥ )⊥ . Since Z is a subspace of Cn , by Theorem 6.36, we have (Z⊥ )⊥ = Z, so (Z ⊥ )⊥ = Z. Theorem 6.37. Let U be a subspace of Cn . Then, Cn = U ⊕ U ⊥ . Proof. If U = {0}, then U ⊥ = Cn and the statement is immediate. Therefore, we can assume that U = {0}. In Theorem 6.35 we saw that U ⊥ is a subspace of Cn . Thus, we need to show that Cn is the direct sum of the subspaces U and U ⊥ . By Corollary 2.10, we need to verify only that every x ∈ Cn can be uniquely written as a sum x = u + v, where u ∈ U and v ∈ U ⊥ . Let u1 , . . . , um be an orthonormal basis of U , that is, a basis such that 1 if i = j, (ui , uj ) = 0 otherwise

Norms and Inner Products

389

for 1 i, j m. Deﬁne the vector u as u = (x, u1 )u1 + · · · + (x, um )um and the vector v as v = x − u. The vector v is orthogonal to every vector ui because (v, ui ) = (x − u, ui ) = (x, ui ) − (u, ui ) = 0. Therefore v ∈ U ⊥ and x has the necessary decomposition. To prove that the decomposition is unique, suppose that x = s + t, where s ∈ U and t ∈ U⊥ . Since s + t = u + v, we have s − u = v − t ∈ U ∩ U ⊥ = {0}, which implies s = u and t = v. Theorem 6.38. Let S be a subspace of Cn such that dim(S) = k. There exists a matrix A ∈ Cn×k having orthonormal columns such that S = range(A). Proof. Let v 1 , . . . , v k be an orthonormal basis of S. Deﬁne the if x ⎞ = matrix A as A = (v 1 , . . . , v k ). We have x ∈ S, if and only ⎛ a1 ⎜ ⎟ a1 v 1 + · · · + ak v k , which is equivalent to x = Aa, where a = ⎝ ... ⎠. ak This amounts to x ∈ range(A). Thus, we have S = range(A). Clearly, for an orthonormal basis in an n-dimensional space, the associated matrix is the diagonal matrix In . In this case, we have (x, y) = x1 y1 + x2 y2 + · · · + xn yn for x, y ∈ V . Observe that if W = {w1 , . . . , w n } is an orthonormal set and x ∈ W , which means that x = a1 w1 +· · ·+an wn , then ai = (x, w i ) for 1 i n. Definition 6.20. Let W = {w 1 , . . . , wn } be an orthonormal set and let x ∈ W . The equality x = (x, w 1 )w 1 + · · · + (x, wn )w n

(6.24)

is the Fourier expansion of x with respect to the orthonormal set W .

390

Linear Algebra Tools for Data Mining (Second Edition)

Furthermore, we have Parseval’s equality: n (x, w i )2 . x2 = (x, x) =

(6.25)

i=1

Thus, if 1 q n, we have q (x, wi )2 x2 .

(6.26)

i=1

It is easy to see that a square matrix C ∈ Cn×n is unitary if and only if its set of columns is an orthonormal set in Cn . Definition 6.21. Let V be an inner product C-linear n-dimensional space and let B = {e1 , . . . , en } be a basis of V. ˜ = {˜ ˜n } is a reciprocal set of B if A set of vectors B e1 , . . . , e ˜j ) = δij for 1 i, j n. (ei , e Theorem 6.39. Let V be an inner product C-linear n-dimensional ˜ of a basis B = {e1 , . . . , en } exists and is space. The reciprocal set B unique. Proof. Let U = e2 , . . . , en and let U ⊥ be the orthogonal complement of U . Then, dim(U ⊥ ) = 1 and there exists a non-zero vector w ∈ U ⊥ . Note that (e1 , w) = 0, which allows us to deﬁne ˜1 = (e11,w) w. Thus, (˜ e e1 , e1 ) = 1 and (˜ e1 , ej ) = 0 for j = 1. In a similar manner, we can prove the existence of the remaining vectors ˜ ˜n of B. ˜2 , . . . , e e ˜ suppose that there exist two recipTo prove the uniqueness of B, ˜1, . . . , d ˜ n }. Then, (˜ ˜ = {d ˜ ˜n } and D ek − rocal sets of B, B = {˜ e1 , . . . , e ˜ k is orthogonal to ˜ k , ek ) = 0 for 1 k n, which means that e ˜k − d d ˜ k , v) = 0 for all v ∈ V , which implies B. This is equivalent to (˜ ek − d ˜ ˜k = dk . This proves that the reciprocal set of B is unique. e Theorem 6.40. The reciprocal set of a basis in an inner product space V is a basis of V. ˜ = {˜ ˜n } be the reciprocal set of a basis B = Proof. Let B e1 , . . . , e ˜1 + · · · + an e ˜n = 0V . Then, {e1 , . . . , en } in V. Suppose that a1 e ˜1 + · · · + an e ˜n , ek ) = ak = 0 (a1 e for 1 k n, which means that the reciprocal set is linearly inde˜ = n, it follows that B ˜ is a basis in V. pendent. Since |B|

Norms and Inner Products

6.10

391

Hyperplanes in Rn

Definition 6.22. Let w ∈ Rn − {0} and let a ∈ R. The hyperplane determined by w and a is the set Hw,a = {x ∈ Rn | w x = a}. If x0 ∈ Hw,a , then w x0 = a, so Hw,a is also described by the equality Hw,a = {x ∈ Rn | w (x − x0 ) = 0}. Any hyperplane Hw,a partitions Rn into three sets: > Hw,a = {x ∈ Rn | w x > a}, 0 = Hw,a , Hw,a

< = {x ∈ Rn | w x < a}. Hw,a

> and H < are the positive and negative open half-spaces The sets Hw,a w,a determined by Hw,a , respectively. The sets = {x ∈ Rn | w x a}, Hw,a = {x ∈ Rn | w x a} Hw,a

are the positive and negative closed half-spaces determined by Hw,a , respectively. If x1 , x2 ∈ Hw,a , we say that the vector x1 − x2 is located in the hyperplane Hw,a . In this case, w ⊥ x1 −x2 . This justiﬁes referring to w as the normal to the hyperplane Hw,a . Observe that a hyperplane is fully determined by a vector x0 ∈ Hw,a and by w. Let x0 ∈ Rn and let Hw,a be a hyperplane. We seek x ∈ Hw,a such that x − x0 2 is minimal. Finding x amounts to minimizing the function f (x) = x − x0 22 = ni=1 (xi − x0i )2 subjected to the constraint w1 x1 + · · · + wn xn − a = 0. Using the Lagrangian Λ(x) = f (x) + λ(w x − a) and the multiplier λ, we impose the conditions ∂Λ = 0 for 1 i n, ∂xi which amount to ∂f + λwi = 0 ∂xi for 1 i n. These equalities yield 2(xi − x0i ) + λw i = 0, so we have xi = x0i − 12 λwi . Consequently, we have x = x0 − 12 λw.

392

Linear Algebra Tools for Data Mining (Second Edition)

Since x ∈ Hw,a , this implies 1 w x = w x0 − λw w = a. 2 Thus, λ=2

w x0 − a w x0 − a = 2 . w w w22

We conclude that the closest point in Hw,a to x0 is x = x0 −

w x0 − a w. w22

The smallest distance between x0 and a point in the hyperplane Hw,a is given by x0 − x =

w x0 − a . w2

If we deﬁne the distance d(Hw,a , x0 ) between x0 and Hw,a as this smallest distance, we have d(Hw,a , x0 ) = 6.11

w x0 − a . w2

(6.27)

Unitary and Orthogonal Matrices

Lemma 6.8. Let A ∈ Cn×n . If xH Ax = 0 for every x ∈ Cn , then A = On,n . Proof. If x = ek , then xH Ax = akk for every k, 1 k n, so all diagonal entries of A equal 0. Choose now x = ek + ej . Then (ek + ej )H A(ek + ej ) = eHk Aek + eHk Aej + eHj Aek + eHj Aej = eHk Aej + eHj Aek = akj + ajk = 0.

Norms and Inner Products

393

Similarly, if we choose x = ek + iej , we obtain (ek + iej )H A(ek + iej ) = (eHk − ieHj )A(ek + iej ) = eHk Aek − ieHj Aek + ieHk Aej + eHj Aej = −iajk + iakj = 0. The equalities akj + ajk = 0 and −ajk + akj = 0 imply akj = ajk = 0. Thus, all oﬀ-diagonal elements of A are also 0, hence A = On,n . Theorem 6.41. A matrix U ∈ Cn×n is unitary if U x2 = x2 for every x ∈ Cn . Proof.

If U is unitary, we have U x22 = (U x)H U x = xH U H U x = x22

because U H U = In . Thus, U x2 = x2 . Conversely, let U be a matrix such that U x2 = x2 for every x ∈ Cn . This implies xH U H U x = xH x, hence xH (U H U − In )x = 0 for x ∈ Cn . By Lemma 6.8, this implies U H U = In , so U is a unitary matrix. Corollary 6.16. The following statements that concern a matrix U ∈ Cn×n are equivalent: (i) U is unitary; (ii) U x − U y2 = x − y2 for x, y ∈ Cn ; (iii) (U x, U y) = (x, y) for x, y ∈ Cn . Proof.

This statement is a direct consequence of Theorem 6.41.

The counterpart of unitary matrices in the set of real matrices are introduced next. Definition 6.23. A matrix A ∈ Rn×n is orthogonal or orthonormal if it is unitary. In other words, a real matrix A ∈ Rn×n is orthogonal if and only if A A = AA = In . Clearly, A is orthogonal if and only if A is orthogonal.

394

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 6.42. If A ∈ Rn×n is an orthogonal matrix, then det(A) ∈ {−1, 1}. Proof. By Corollary 5.3, | det(A)| = 1. Since det(A) is a real num ber, it follows that det(A) ∈ {−1, 1}. Corollary 6.17. Let A be a matrix in Rn×n . The following statements are equivalent: (i) A is orthogonal; (ii) A is invertible and A−1 = A ; (iii) A is invertible and (A )−1 = A; (iv) A is orthogonal. Proof. The equivalence between these statements is an immediate consequence of deﬁnitions. Corollary 6.17 implies that the columns of a square matrix form an orthonormal set of vectors if and only if the set of rows of the matrix is an orthonormal set. Theorem 6.41 specialized to orthogonal matrices shows that a matrix A is orthogonal if and only if it preserves the length of vectors. Theorem 6.43. Let S be an r-dimensional subspace of Rn and let {u1 , . . . , ur }, {v 1 , . . . , v r } be two orthonormal bases of the space S. The orthogonal matrices B = (u1 · · · ur ) ∈ Rn×r and C = (v 1 · · · v r ) ∈ Rn×r of any two such bases are related by the equality B = CT, where T = C B ∈ Cr×r is an orthogonal matrix. Proof. Since the columns of B form a basis for S, each vector v i can be written as v i = v 1 t1i + · · · + v r tri for 1 i r. Thus, B = CT . Since B and C are orthogonal, we have B H B = T H C H CT = T H T = Ir , so T is an orthogonal matrix and because it is a square matrix, it is also a unitary matrix. Furthermore, we have C H B = C H CT = T , which concludes the argument.

Norms and Inner Products

395

Definition 6.24. A rotation matrix is an orthogonal matrix R ∈ Rn×n such that det(R) = 1. A reflection matrix is an orthogonal matrix R ∈ Rn×n such that det(R) = −1. Example 6.23. In the two-dimensional case, n = 2, a rotation is a matrix R ∈ R2×2 , r11 r12 R= , r21 r22 such that 2 2 + r21 = 1, r11 2 2 r12 + r22 = 1, r11 r12 + r21 r22 = 0

and r11 r22 − r12 r21 = 1. This implies r22 (r11 r12 + r21 r22 ) − r12 (r11 r22 − r12 r21 ) = −r12 , or 2 2 + r12 ) = −r12 , r21 (r22

so r21 = −r12 . If r21 = −r21 = 0, the above equalities imply that either r11 = r22 = 1 or r11 = r22 = −1. Otherwise, the equality r11 r12 +r21 r22 = 0 implies r11 = r22 . 2 1, it follows that there exists θ such that r Since r11 11 = cos θ. This implies that R has the form cos θ − sin θ R= . sin θ cos θ Its eﬀect on a vector

x=

x1 x2

∈ R2

396

Linear Algebra Tools for Data Mining (Second Edition)

is to produce the vector y = Rx, where x1 cos θ − x2 sin θ , y= x1 sin θ + x2 cos θ which is obtained from x by a counterclockwise rotation by angle θ. It is easy to see that det(R) = 1, so the term “rotation matrix” is clearly justiﬁed for R. To mark the dependency of R on θ, we will use the notation cos θ − sin θ R(θ) = . sin θ cos θ The same conclusion canbereached by examining Figure 6.3. If the x1 angle of the vector x = with the x1 axis is α and x is rotated x2 counterclockwise by θ to yield the vector y = y1 e1 + y2 e2 , then x1 = r cos α, x2 = r sin α, and y1 = r cos(α + θ) = r cos α cos θ − r sin α sin θ = x1 cos θ − x2 sin θ, y2 = r sin(α + θ) = r sin α cos θ + r cos α sin θ = x1 sin θ + x2 cos θ, which are the formulas that describe the transformation of x into y. x2 y

x θ α x1 Fig. 6.3

y is obtained by a counterclockwise rotation by an angle θ from x.

Norms and Inner Products

397

A extension of Example 6.23 is the Givens matrix G(p, q, θ) ∈

Rn×n , deﬁned as

⎛

1 ⎜ .. ⎜. ⎜ ⎜0 ⎜. ⎜. ⎜. ⎜0 ⎜ ⎜. ⎝ ..

··· ··· ···

··· .. . cos θ .. .

··· ··· ···

··· ··· · · · − sin θ · · · .. ··· . ··· 0 ··· ··· ··· p q

⎞ ··· 0 . .. ⎟ · · · .⎟ ⎟ sin θ · · · 0⎟ p .. .. ⎟ ⎟ . · · · .⎟ cos θ · · · 0⎟ ⎟q .. .. ⎟ . · · · .⎠ ··· ··· 1 ··· .. .

We can write G(p, q, θ) = (e1 · · · cos θep − sin θeq · · · sin θep + cos θ eq · · · en ). It is easy to verify that G(p, q, θ) is a rotation matrix since it is orthogonal and det(G(p, q, θ)) = 1. Since ⎞ ⎛ ⎞ ⎛ v1 v1 .. ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜.⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎜ vp ⎟ ⎜ cos θvp + sin θvq ⎟ ⎟ ⎜.⎟ ⎜ .. ⎟=⎜ ⎟, . G(p, q, θ)⎜ . . ⎟ ⎜ ⎟ ⎜ ⎜ v ⎟ ⎜− sin θv + cos θv ⎟ ⎜ q⎟ ⎜ p q⎟ ⎟ ⎜.⎟ ⎜ .. ⎠ ⎝ .. ⎠ ⎝ . vn

vn

the multiplication of a vector v be a Givens matrix amounts to a clockwise rotation by θ in the plane of the coordinates (vp , vq ). If vp = 0, then the rotation described by the Givens matrix can be used to zero the qth component of the resulting vector by taking v θ such that tan θ = vpq . It is easy to see that R(θ)−1 = R(−θ) and that R(θ2 )R(θ1 ) = R(θ1 + θ2 ).

Linear Algebra Tools for Data Mining (Second Edition)

398

Example 6.24. Let v ∈ Cn −{0n } be a unit vector. The Householder matrix Hv ∈ Cn×n is deﬁned by Hv = In − 2vv H . The matrix Hv is clearly Hermitian. Moreover, we have HH H = HH = (In − 2vv H )2 = In − 4vvH + 4v(v H v)v H = In , so Hv is unitary and involutive. Supplement 14 of Chapter 5 implies that det(Hv ) = −1, so Hv is a reﬂection. For a unit vector v ∈ Rn , Hv is an orthogonal and involutive matrix. In this case the vector Hv x is a reﬂection of x relative to the hyperplane Hv,0 deﬁned by v x = 0, which follows from the fact that vector t = x − Hv x = (In − Hv )x = 2v(v x) is orthogonal to the hyperplane v x = 0. 6.12

Projection on Subspaces

We have shown in Theorem 2.33 that if U, W are two complementary subspaces of Cn , then there exist idempotent endomorphisms g and h of Cn such that W = Ker(g), U = Im(g), and U = Ker(h) and W = Im(h). The vector g(x) is the oblique projection of x on U along the subspace W and h(x) is the oblique projection of x on W along the subspace U . If g and h are represented by the matrices BU and BW , respectively, it follows that these matrices are idempotent, g(x) = BU x ∈ U , and h(x) = BW x ∈ W for x ∈ Cn . Also, BU BW = BW BU = On,n . Let U and W be two complementary subspaces of Cn , {u1 , . . . , up } be a basis for U , and let {w1 , . . . , w q } be a basis for W , where p + q = n. Clearly, {u1 , . . . , up , w 1 , . . . , wq } is a basis for Cn and every x ∈ Cn can be written as x = x1 u1 + · · · + xp up + xp+1 w1 + · · · + xp+q wq . Let B ∈ Cn×n be the matrix B = (u1 · · · up w1 · · · wq ),

Norms and Inner Products

399

which is clearly invertible. Note that BU ui = ui BU wj = 0n , BW ui = 0n BW wj = wj for 1 i p and 1 j q. Therefore, we have Ip Op,n−p BU B = (u1 · · · up 0n · · · 0n ) = B , On−p,p On−p,n−p so

BU = B

Ip

On−p,p

Op,n−p B −1 . On−p,n−p

Similarly, we can show that On−q,n−q On−q,q B −1 . BW = B Oq,n−q Iq Note that BU + BW = BIn B −1 = In . Thus, the oblique projection on U along W is given by Ip Op,n−p B −1 x. g(x) = BU x = B On−p,p On−p,n−p The similar oblique projection on W along U is On−q,n−q On−q,q B −1 x h(x) = BU x = B Oq,n−q Iq for x ∈ Cn . Observe that g(x) + h(x) = x, so the projection on W along U is h(x) = x − g(x) = (In − BU )x. A special important type of projections involves pairs of orthogonal subspaces. Let U be a subspace of Cn with dim U = p and let BU = {u1 , . . . , up } be an orthonormal basis of U . Taking into account that (U ⊥ )⊥ = U , by Theorem 2.33 there exists an idempotent endomorphism g of Cn such that U = Im(g) and U ⊥ = Ker(g). The proof of Theorem 6.37 shows that this endomorphism is deﬁned by g(x) = (x, u1 )u1 + · · · + (x, um )um . Definition 6.25. Let U be an m-dimensional subspace of Cn and let {u1 , . . . , um } be an orthonormal basis of this subspace. The orthogonal projection of the vector x ∈ Cn on the subspace U is the vector projU (x) = (x, u1 )u1 + · · · + (x, um )um .

Linear Algebra Tools for Data Mining (Second Edition)

400

Theorem 6.44. Let U be an m-dimensional subspace of Rn and let x ∈ Rn . The vector y = x − projU (x) belongs to the subspace U ⊥ . Proof. that

Let BU = {u1 , . . . , um } be an orthonormal basis of U . Note (y, uj ) = (x, uj ) −

m (x, ui )ui , uj

i=1 m (x, ui )(ui , uj ) = 0, = (x, uj ) − i=1

due to the orthogonality of the basis BU . Therefore, y is orthogonal on every linear combination of BU , that is on the subspace U . Theorem 6.45. Let U be an m-dimensional subspace of Cn having the orthonormal basis {u1 , . . . , um }. The orthogonal projection projU is given by projU (x) = BU BUH x for x ∈ Cn , where BU ∈ Rn×m is the matrix BU = (u1 · · · um ) ∈ Cn×m . Proof.

We can write

⎛

⎞ uH1 ⎜ ⎟ ui (uHi x) = (u1 · · · um )⎝ ... ⎠x = BU BUH x. projU (x) = i=1 uHm m

Since the basis {u1 , . . . , um } is orthonormal, we have BUH BU = Im . Observe that the matrix BU BUH ∈ Cn×n is symmetric and idempotent because (BU BUH )(BU BUH ) = BU (BUH BU )BUH = BU BUH . For an m-dimensional subspace U of Cn , we denote by PU = BU BUH ∈ Cn×n , where BU is a matrix of an orthonormal basis of U as deﬁned before. PU is the projection matrix of the subspace U . Corollary 6.18. For every non-zero subspace U , the matrix PU is a Hermitian matrix, and therefore, a self-adjoint matrix. Proof. Since PU = BU BUH where BU is a matrix of an orthonormal basis of the subspace S, it is immediate that PUH = PU .

Norms and Inner Products

401

The self-adjointness of PU means that (x, PU y) = (PU x, y) for every x, y ∈ Cn . Corollary 6.19. Let U be an m-dimensional subspace of Cn having the orthonormal basis {u1 , . . . , um }. If BU = (u1 · · · um ) ∈ Cn×m , then for every x ∈ C we have the decomposition x = PU x + QU x, where PU = BU BUH and QU = In − PU , PU x ∈ U and QU x ∈ U ⊥ . Proof.

This statement follows immediately from Theorem 6.45.

Observe that Q2U = (In − PU PUH )(In − PU PUH ) = In − PU PUH − PU PUH + PU PUH PU PUH = QU , so QU is an idempotent matrix. The matrix QU is the projection matrix on the subspace U ⊥ . Clearly, we have PU ⊥ = QU = In − PU .

(6.28)

It is possible to give a direct argument for the independence of the projection matrix PU relative to the choice of orthonormal basis in U. Theorem 6.46. Let U be an m-dimensional subspace of Cn having the orthonormal bases {u1 , . . . , um } and {v 1 , . . . , v m } and let BU = ˜U = (v 1 · · · v m ) ∈ Cn×m . The matrix (u1 · · · um ) ∈ Cn×m and B H ˜ m×m ˜ H = BU B H . ˜U B is unitary and B BU BU ∈ C U U ˜U are bases for U , Proof. Since both sets of columns of BU and B m×m ˜U Q. such that BU = B there exists a unique square matrix Q ∈ C H H ˜ ˜U implies B BU = B ˜ B The orthonormality of BU and B U U U = Im . Thus, we can write H ˜H B ˜ Im = BUH BU = QH B U U Q = Q Q,

˜ U = QH B ˜H B ˜ which shows that Q is unitary. Furthermore, BUH B U U = H Q is unitary and ˜U QQH B ˜U B ˜UH = B ˜UH . BU BUH = B

402

Linear Algebra Tools for Data Mining (Second Edition)

In Example 6.11, we have shown that if P is an idempotent matrix, then for every matrix norm μ we have μ(P ) = 0 or μ(P ) 1. For orthogonal projection matrices of the form PU , where U is a non-zero subspace, we have |||PU |||2 = 1. Indeed, we can write x = (x − projU (x)) + projU (x), so x22 = x − projU (x)22 + projU (x)22 projU (x)22 . Thus, x2 PU (x) for any x ∈ Cn , which implies PU x2 n |||PU |||2 = sup x ∈ C − {0}} 1. x2 This implies |||PU |||2 = 1. The next theorem shows that the best approximation of a vector x by a vector in a subspace U (in the sense of Euclidean distance) is the orthogonal projection of x on U . Theorem 6.47. Let U be an m-dimensional subspace of Cn and let x ∈ Cn . The minimal value of d2 (x, u), the Euclidean distance between x and an element u of the subspace U , is achieved when u = projU (x). Proof. We saw that x can be uniquely written as x = y+projU (x), where y ∈ U ⊥ . Let now u be an arbitrary member of U . We have d2 (x, u)2 = x − u22 = (x − projU (x)) + (projU (x) − u)22 . Since x − projU (x) ∈ U ⊥ and projU (x) − u ∈ U , it follows that these vectors are orthogonal. Thus, we can write d2 (x, u)2 = (x − projU (x))22 + (projU (x) − u)22 , which implies that d2 (x, u) d2 (x − projU (x)).

The orthogonal projections associated with subspaces allow us to deﬁne a metric on the collection of subspaces of Cn . Indeed, if S and T are two subspaces of Cn , we deﬁne dF (S, T ) = PS − PT F . When using the vector norm · 2 and the metric induced by this norm on Cn , we denote the corresponding metric on subspaces by d2 .

Norms and Inner Products

403

Example 6.25. Let u, w be two distinct unit vectors in the linear space V. The orthogonal projection matrices of u and w are uu and ww , respectively. Thus, dF (u, w) = uu − vv F . Suppose now that V = R2 . Since u and w are unit vectors in R2 , there exist α, β ∈ [0, 2π] such that cos α cos β u= and w = . sin α sin β Thus, we can write cos2 α − cos2 β cos α sin α − cos β sin β uu − vv = cos α sin α − cos β sin β sin2 α − sin2 β and dF (u, w) =

√ 2| sin(α − β)|.

We could use any matrix norm in the deﬁnition of the distance between subspaces. For example, we could replace the Frobenius norm by ||| · |||1 or by ||| · |||2 . Let S be a subspace of Cn and let x ∈ Cn . The distance between x and S defined by the norm · is d(x, S) = x − projS (x) = x − PS x = (I − PS )x. Theorem 6.48. Let S and T be two non-zero subspaces of Cn and let δS = max{d2 (x, T ) | x ∈ S, x2 = 1}, δT = max{d2 (x, S) | x ∈ T, x2 = 1}. We have d2 (S, T ) = max{δS , δT }. Proof.

If x ∈ S and x2 = 1, we have d2 (x, T ) = x − PT x2 = PS x − PT x2 = (PS − PT )x2 |||PS − PT |||2 .

Therefore, δS |||PS − PT |||2 . Similarly, δT |||PS − PT |||2 , so max{δS , δT } d2 (S, T ).

Linear Algebra Tools for Data Mining (Second Edition)

404

Note that δS = max{(I − PT )x2 | x ∈ S, x2 = 1}, δT = max{(I − PS )x2 | x ∈ T, x2 = 1}, so, taking into account that PS x ∈ S and PT x ∈ T for every x ∈ Cn , we have (I − PS )PT x2 δS PT x2 , (I − PT )PS x2 δT PS x2 . We have PT (I − PS )x22 = (PT (I − PS )x, PT (I − PS )x) = ((PT )2 (I − PS )x, (I − PS )x) = (PT (I − PS )x, (I − PS )x) = (PT (I − PS )x, (I − PS )2 x) = ((I − PS )PT (I − PS )x, (I − PS )x) because both PS and I − PS are idempotent and self-adjoint. Therefore, PT (I − PS )x22 (I − PS )PT (I − PS )x2 (I − PS )x2 δT PT (I − PS )x2 (I − PS )x2. This allows us to infer that PT (I − PS )x2 δT (I − PS )x2.

We discuss now four fundamental subspaces associated to a matrix A ∈ Cm×n . The range and the null space of A, range(A) ⊆ Cm and null(A) ⊆ Cn , have been already discussed in Chapter 3. We add now two new subspaces: range(AH ) ⊆ Cn and null(AH ) ⊆ Cm . Theorem 6.49. For every matrix A (range(A))⊥ = null(AH ).

∈

Cm×n ,

we have

Proof. The statement follows from the equivalence of the following statements: (i) x ∈ (range(A))⊥ ;

Norms and Inner Products

(ii) (iii) (iv) (v) (vi)

(x, Ay) = 0 for all y ∈ Cn ; xH Ay = 0 for all y ∈ Rn ; y H AH x = 0 for all y ∈ Rn ; AH x = 0; x ∈ null(AH ).

Corollary 6.20. For every matrix A (range(AH ))⊥ = null(A). Proof. by AH .

405

∈

Cm×n ,

we have

This statement follows from Theorem 6.49 by replacing A

Corollary 6.21. For every matrix A ∈ Cm×n , we have Cm = range(A) ⊕ null(AH ), Cn = null(A) ⊕ range(AH ).

Proof. By Theorem 6.37, we have Cm = range(A) ⊕ range(A)⊥ and Cn = null(A) ⊕ null(A)⊥ . Taking into account Theorem 6.49 and Corollary 6.21, we obtain the desired equalities. 6.13

Positive Definite and Positive Semidefinite Matrices

Definition 6.26. A matrix A ∈ Cn×n is positive definite if xH Ax is a real positive number for every x ∈ Cn − {0}. Theorem 6.50. If A ∈ Cn×n is positive definite, then A is Hermitian. Proof. Let A ∈ Cn×n be a matrix. Since xH Ax is a real number, it follows that it equals its conjugate, so xH Ax = xH AH x for every x ∈ Cn . By Theorem 3.16, there exists a unique pair of Hermitian matrices H1 and H2 such that A = H1 + iH2 , which implies AH = H1H − iH2H . Thus, we have xH (H1 + iH2 )x = xH (H1H − iH2H )x = xH (H1 − iH2 )x, because H1 and H2 are Hermitian. This implies xH H2 x = 0 for every x ∈ Cn , which, in turn, implies H2 = On,n . Consequently, A = H1 , so A is indeed Hermitian.

406

Linear Algebra Tools for Data Mining (Second Edition)

Definition 6.27. A matrix A ∈ Cn×n is positive semidefinite if xH Ax is a non-negative real number for every x ∈ Cn − {0}. If xH Ax is a non-positive real number for every x ∈ Cn − {0}, then we say that A is a positive deﬁnite matrix. Positive deﬁniteness (positive semideﬁniteness) is denoted by A 0 (A 0, respectively). The deﬁnition of positive deﬁnite (semideﬁnite) matrix can be specialized for real matrices as follows. Definition 6.28. A symmetric matrix A ∈ Rn×n is positive definite if x Ax > 0 for every x ∈ Rn − {0}. If A satisﬁes the weaker inequality x Ax 0 for every x ∈ Rn − {0}, then we say that A is positive semidefinite. A 0 denotes that A is positive deﬁnite and A 0 means that A is positive semideﬁnite. Note that in the case of real-valued matrices, we need to require explicitly the symmetry of the matrix because, unlike the complex case, the inequality x Ax > 0 for x ∈ Rn − {0n } does not imply the symmetry of A. For example, consider the matrix a b A= , −b a where a, b ∈ R and a > 0. We have a b x1 = a(x21 + x22 ) > 0 x Ax = (x1 x2 ) −b a x2 if x = 02 . Example 6.26. The symmetric real matrix a b A= b c is positive deﬁnite if and only if a > 0 and b2 −ac < 0. Indeed, we have x Ax > 0 for every x ∈ R2 −{0} if and only if ax21 +2bx1 x2 +cx22 > 0, where x = (x1 x2 ); elementary algebra considerations lead to a > 0 and b2 − ac < 0.

Norms and Inner Products

407

A positive deﬁnite matrix is non-singular. Indeed, if Ax = 0, where A ∈ Rn×n is positive deﬁnite, then xH Ax = 0, so x = 0. By Corollary 3.2, A is non-singular. Example 6.27. If A ∈ Cm×n , then the matrices AH A ∈ Cn×n and AAH ∈ Cm×m are positive semideﬁnite. For x ∈ Cn , we have xH (AH A)x = (xH AH )(Ax) = (Ax)H (Ax) = Ax22 0. The argument for AAH is similar. If rank(A) = n, then the matrix AH A is positive deﬁnite because H x (AH A)x = 0 implies Ax = 0, which, in turn, implies x = 0. Theorem 6.51. If A ∈ Cn×n is a positive definite matrix, then any ! i1 · · · ik is a positive definite matrix. principal submatrix B = A i1 · · · ik Proof. Let x ∈ Cn − {0} be a vector such that all components located on ! positions other than i1 , . . . , ik equal 0 and let i1 · · · ik ∈ Ck be the vector obtained from x by retaining y=x 1 only the components located on positions i1 , . . . , ik . Since y H By = xH Ax > 0, it follows that B 0. Corollary 6.22. If A ∈ Cn×n is a positive definite matrix, then any diagonal element aii is a real positive number for 1 i n. Proof. This statement follows immediately from Theorem 6.51 by observing that every diagonal element of A is a 1 × 1 principal sub matrix of A. Theorem 6.52. If A, B ∈ Cn×n are two positive semidefinite matrices and a, b are two non-negative numbers, then aA + bB 0. Proof. The statement holds because xH (aA + bB)x = axH Ax + bxH Bx 0, due to the fact that A and B are positive semideﬁnite.

Theorem 6.53. Let A ∈ Cn×n be a positive definite matrix and let S ∈ Cn×m . The matrix S H AS is positive semidefinite and has the same rank as S. Moreover, if rank(S) = m, then S H AS is positive definite.

408

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Since A is positive deﬁnite, it is Hermitian and (S H AS)H = S H AS implies that S H AS is a Hermitian matrix. Let x ∈ Cm . We have xH S H ASx = (Sx)H A(Sx) 0 because A is positive deﬁnite. Thus, the matrix S H AS is positive semideﬁnite. If Sx = 0, then S H ASx = 0; conversely, if S H ASx = 0, then xH S H ASx = 0, so Sx = 0. This allows us to conclude that null(S) = null(S H AS). Therefore, by Equality (3.8), we have rank(S) = rank(S H AS). Suppose now that rank(S) = m and that xH S H ASx = 0. Since A is positive deﬁnite, we have Sx = 0 and this implies x = 0, because of the assumption made relative to rank(S). Consequently, S H AS is positive deﬁnite. Corollary 6.23. Let A ∈ Cn×n be a positive definite matrix and let S ∈ Cn×n . If S is non-singular, then so is S H AS. Proof.

This is an immediate consequence of Theorem 6.53.

Theorem 6.54. A Hermitian matrix B ∈ Cn×n is positive definite if and only if the mapping f : Cn × Cn −→ C given by f (x, y) = xH By for x, y ∈ Cn defines an inner product on Cn . Proof. Suppose that B deﬁnes an inner product on Cn . Then, by Property (iii) of Deﬁnition 6.15 we have f (x, x) > 0 for x = 0, which amounts to the positive deﬁniteness of B. Conversely, if B is positive deﬁnite, then f satisﬁes the condition from Deﬁnition 6.15. We show here only that f has the conjugate symmetry property. We can write f (y, x) = y H Bx = y Bx for x, y ∈ Cn . Since B is Hermitian, B = B H = B , so f (y, x) = y B x. Observe that y B x is a number (that is, a 1 × 1 matrix), so (y B x) = xH By = f (x, y).

Corollary 6.24. A symmetric matrix B ∈ Rn×n is positive definite if and only if the mapping f : Rn × Rn −→ R given by f (x, y) = x By for x, y ∈ Rn defines an inner product on Rn .

Norms and Inner Products

Proof.

This follows immediately from Theorem 6.54.

409

Definition 6.29. Let S = (v 1 , . . . , v m ) be a sequence of vectors in Rn . The Gram matrix of S is the matrix GS = (gij ) ∈ Rm×m deﬁned by gij = v i v j for 1 i, j m. Note that if AS = (v 1 · · · v m ) ∈ Rn×m , then GS = AS AS . Also, note that GS is a symmetric matrix. Example 6.28. Let ⎛

⎞ ⎛ ⎞ ⎛ ⎞ 1 1 2 v 1 = ⎝ 0 ⎠, v 2 = ⎝2⎠, v 3 = ⎝1⎠. −1 2 0 The Gram matrix of the set S = {v 1 , v 2 , v 3 } is ⎛ ⎞ 2 −1 2 GS = ⎝−1 9 4⎠. 2 4 5 Note that det(GS ) = 1. The Gram matrix of an arbitrary sequence of vectors is positive semideﬁnite, as the reader can easily verify. Theorem 6.55. Let S = (v 1 , . . . , v m ) be a sequence of m vectors in Rn , where m n. If S is linearly independent, then the Gram matrix GS is positive definite. Proof. Let S be a linearly independent sequence of vectors, S = (v 1 , . . . , v m ), and let x ∈ Rm . We have x GS x = x AS AS x = (AS x) AS x = AS x22 . Therefore, if x GS x = 0, we have AS x = 0, which is equivalent to x1 v 1 + · · · + xn vn = 0. Since {v 1 , . . . , v m } is linearly independent, it follows that x1 = · · · = xm = 0, so x = 0n . Thus, GS is indeed positive deﬁnite. Definition 6.30. Let S = (v 1 , . . . , v m ) be a sequence of m vectors in Rn , where m n. The Gramian of S is the number det(GS ). Theorem 6.56. If S = (v 1 , . . . , v m ) is a sequence of m vectors in Rn , then S is linearly independent if and only if det(GS ) = 0.

410

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Suppose that det(GS ) = 0 and that S is not linearly independent. In other words, the numbers a1 , . . . , am exist such that at least one of them is not 0 and a1 v 1 + · · · + am v m = 0Rn . This implies the equalities a1 (v 1 , v j ) + · · · + am (v m , v j ) = 0 for 1 j m, so the system GS a = 0n has a non-trivial solution in a1 , . . . , am . This implies det(GS ) = 0, which contradicts the initial assumption. Conversely, suppose that S is linearly independent and det(GS ) = 0. Then the linear system a1 (v 1 , v j ) + · · · + am (v m , v j ) = 0 for 1 j m, has a non-trivial solution in a1 , . . . , am . Let w = a1 v1 +· · · am v m . We have (w, v i ) = 0 for 1 i n. This, in turn, implies (w, w) = w22 = 0, so w = 0, which contradicts the linear independence of S. Theorem 6.57 (Cholesky’s decomposition theorem). Let A ∈ Cn×n be a Hermitian positive deﬁnite matrix. There exists a unique upper triangular matrix R with real, positive diagonal elements such that A = RH R. Proof. The argument is by induction on n 1. The base step, n = 1, is immediate. Suppose that a decomposition exists for all Hermitian positive matrices of order n, and let A ∈ C(n+1)×(n+1) be a symmetric and positive deﬁnite matrix. We can write A=

a11 aH , a B

where B ∈ Cn×n . By Theorem 6.51, a11 > 0 and B ∈ Cn×n is a Hermitian positive deﬁnite matrix. It is easy to verify the following

Norms and Inner Products

identity: A=

√

a11 0 √ 1 a In a11

1 0 0 B − a111 aaH

411

√ a11 0

√ 1 aH a11

In

.

(6.29)

Let R1 ∈ Cn×n be the upper triangular non-singular matrix √ a11 √a111 aH . R1 = 0 In This allows us to write

A = R1 H

where A1 = B −

1 0 R1 , 0 A1

1 H a11 aa .

Since 1 0 = (R1−1 )H AR1−1 , 0 A1

by Theorem 6.53, the matrix

1 0 0 A1

is positive deﬁnite, which allows us to conclude that the matrix A1 = B − a111 aaH ∈ Cn×n is a Hermitian positive deﬁnite matrix. By the inductive hypothesis, A1 can be factored as A1 = P H P, where P is an upper triangular matrix. This allows us to write 1 0 1 0 1 0 . = 0 A1 0 PH 0 P Thus,

A = R1H

1 0 0 PH

If R is deﬁned as √ a11 1 0 1 0 R1 = R= 0 P 0 P 0

1 0 R1 . 0 P

√ 1 aH a11

In

√ a11 = 0

√ 1 aH a11

then A = RH R and R is clearly an upper triangular matrix.

P

,

Linear Algebra Tools for Data Mining (Second Edition)

412

We refer to the matrix R as the Cholesky factor of A. Corollary 6.25. If A ∈ Cn×n is a Hermitian positive definite matrix, then det(A) > 0. Proof. By Corollary 5.2, det(A) is a real number. By Theorem 6.57, A = RH R, where R is an upper triangular matrix with real, positive diagonal elements, so det(A) = det(RH ) det(R) = (det(R))2 . Since det(R) is the product of its diagonal elements, det(R) is a real, positive number, which implies det(A) > 0. Example 6.29. Let A be the symmetric matrix ⎛ ⎞ 3 0 2 A = ⎝0 2 1⎠. 2 1 2 We leave it to the reader to verify that this matrix is indeed positive deﬁnite starting from Deﬁnition 6.26. By Equality (6.29), the matrix A can be written as ⎞ ⎞⎛ ⎞⎛√ ⎛√ 3 0 0 3 0 √23 1 0 0 A = ⎝ 0 1 0⎠⎝0 2 1 ⎠⎝ 0 1 0 ⎠, √2 0 1 0 1 23 0 0 1 3 because

A1 =

1 0 2 1 2 1 0 2 = − . 1 2 1 23 3 2

Applying the same equality to A1 , we have √ √ 2 0 2 1 0 A1 = √1 1 1 0 6 0 2

√1 2

1

.

Since the matrix ( 16 ) can be factored directly, we have √ √ 1 0 1 0 2 0 2 √12 A1 = √1 0 √16 0 √16 1 0 1 2 √ √ 1 2 √2 2 0 . = √1 √1 0 √16 2 6

Norms and Inner Products

In turn, this ⎛ 1 ⎝0 0

implies ⎞ ⎛ 1 √0 0 0 2 2 1 ⎠ = ⎝0 0 √12 1 23

⎞⎛ 1 √0 0 ⎜ 2 0 ⎠⎝0 √1 0 0 6

413

0

⎞

√1 ⎟, 2⎠ √1 6

which produces the following Cholesky ﬁnal decomposition of A: ⎞⎛√ ⎞⎛ ⎞⎛ ⎞ ⎛√ 1 √0 0 1 √0 0 3 0 0 3 0 √23 2 √12 ⎟ 2 0 ⎠⎜ A = ⎝ 0 1 0⎠⎝0 ⎝0 ⎠⎝ 0 1 0 ⎠ 1 2 1 1 √ 0 √2 √6 0 1 0 0 √6 0 0 1 3 √ ⎛ ⎞ ⎞ ⎛√ 3 0 √23 3 √0 0 √ 2 0 ⎠⎜ 2 √12 ⎟ =⎝ 0 ⎝ 0 ⎠. √2 √1 √1 √1 0 0 3 2 6 6 Cholesky’s decomposition theorem can be extended to positive semi-deﬁnite matrices. Theorem 6.58 (Cholesky’s decomposition theorem for positive semidefinite matrices). Let A ∈ Cn×n be a Hermitian positive semidefinite matrix. There exists an upper triangular matrix R with real, non-negative diagonal elements such that A = RH R. Proof. The argument is similar to the one used for Theorem 6.57 and is omitted. Observe that for positive semideﬁnite matrices, the diagonal elements of R are non-negative numbers and the uniqueness of R does not longer holds. 1 −1 Example 6.30. Let A = . Since x Ax = (x1 − x2 ), it −1 1 is clear that A is a positive semideﬁnite but not a positive deﬁnite matrix. Let R be a matrix of the form r1 r R= 0 r2 such that A = R R. It is easy to see that the last equality is equivalent to r12 = r22 = 1 and rr1 = −1. Thus, we have for distinct Cholesky factors, the following matrices: 1 −1 1 −1 −1 1 −1 1 , , , . 0 1 0 −1 0 1 0 −1

414

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 6.59. A Hermitian matrix A ∈ Cn×n is positive definite if and only if all its leading principal minors are positive. Proof. By Theorem 6.51, if A is positive deﬁnite, then every principal submatrix is positive deﬁnite, so by Corollary 6.25, each principal minor of A is positive. Conversely, suppose that A ∈ Cn×n is a Hermitian matrix having positive leading principal minors. We prove by induction on n that A is positive deﬁnite. The base case, n = 1, is immediate. Suppose that the statement holds for matrices in C(n−1)×(n−1) . Note that A can be written as B b , A= bH a where B ∈ C(n−1)×(n−1) is a Hermitian matrix. Since the leading minors of B are the ﬁrst n − 1 leading minors of A, it follows, by the inductive hypothesis, that B is positive deﬁnite. Thus, there exists a Cholesky decomposition B = RH R, where R is an upper triangular matrix with real, positive diagonal elements. Since R is invertible, let w = (RH )−1 b. The matrix B is invertible. Therefore, by Theorem 5.14, we have det(A) = det(B)(a − bH B −1 b) > 0. Since det(B) > 0, it follows that a bH B −1 b. We observed that if B is positive deﬁnite, then so is B −1 . Therefore, a 0 is and we can write a = c2 for some positive c. This allows us to write H 0 R R w = C H C, A= 0H c wsH c where C is the upper triangular matrix with positive e R w . C= 0H c This implies immediately the positive deﬁniteness of A.

Cn×n .

Let A, B ∈ We write A B if A − B 0, that is, if A − B is a positive deﬁnite matrix. Similarly, we write A B if A − B O, that is, if A − B is positive semideﬁnite. Theorem 6.60. Let A0 , A1 , . . . , Am be m + 1 matrices in Cn×n such that A0 is positive definite and all matrices are Hermitian. There

Norms and Inner Products

415

exists a > 0 such that for any t ∈ [−a, a] the matrix Bm (t) = A0 + A1 t + · · · + Am tm is positive definite. Proof. Since all matrices A0 , . . . , Am are Hermitian, note that xH Ai x are real numbers for 0 i m. Therefore, pm (t) = xH Bm (t)x is a polynomial in t with real coeﬃcients and pm (0) = xH A0 x is a positive number if x = 0. Since pm is a continuous function, there exists an interval [−a, a] such that t ∈ [−a, a] implies pm (t) > 0 if x = 0. This shows that Bm (t) is positive deﬁnite. 6.14

The Gram–Schmidt Orthogonalization Algorithm

The Gram–Schmidt algorithm (Algorithm 6.14.1) constructs an orthonormal basis for a subspace U of Cn , starting from an arbitrary basis of {u1 , . . . , um } of U . The orthonormal basis is constructed sequentially such that w1 , . . . , w k = u1 , . . . , uk for 1 k m. Algorithm 6.14.1: Gram–Schmidt Orthogonalization Algorithm Data: A basis {u1 , . . . , um } for a subspace U of Cn Result: An orthonormal basis {w1 , . . . , w m } for U 1 W = On,m ; 1 2 W (:, 1) = W (:, 1) + U (:,1)2 U (:, 1); 3 for k = 2 to m do 4 P = In − W (:, 1 : (k − 1))W (:, 1 : (k − 1))H ; 1 5 W (:, k) = W (:, k) + P U (:,k) P U (:, k); 2 6 end 7 return W = (w 1 · · · w m ); Next, we prove the correctness of the Gram–Schmidt algorithm. Theorem 6.61. Let (w1 , . . . , w m ) be the sequence of vectors constructed by the Gram–Schmidt algorithm starting from the basis {u1 , . . . , um } of an m-dimensional subspace U of Cn . The set {w 1 , . . . , wm } is an orthogonal basis of U and w1 , . . . , wk = u1 , . . . uk for 1 k m.

416

Linear Algebra Tools for Data Mining (Second Edition)

Proof. In the algorithm the matrix W is initialized as On,m . Its columns will contain eventually the vectors of the orthonormal basis w1 , . . . , w m . The argument is by induction on k 1. The base case, k = 1, is immediate. Suppose that the statement of the theorem holds for k, that is, the set {w1 , . . . , w k } is an orthonormal basis for Uk = u1 , . . . , uk and constitutes the set of the initial k columns of the matrix W , that is, Wk = W (:, 1 : k). Then, Pk = In − Wk WkH is the projection matrix on the subspace Uk⊥ , so Pk uk is orthogonal on every wi , where 1 i k. Therefore, wk+1 = W (:, (k + 1)) is a unit vector orthogonal on all its predecessors w 1 , . . . , wk , so {w1 , . . . , wm } is an orthonormal set. The equality u1 , . . . , uk = w1 , . . . , w k clearly holds for k = 1. Suppose that it holds for k. Then, we have 1 (uk+1 − Wk WkH uk+1 ) Pk uk+1 2 1 (uk+1 − (w 1 · · · wk )WkH uk+1 ) . = Pk uk+1 2

wk+1 =

Since w1 , . . . , wk belong to the subspace u1 , . . . , uk (by inductive hypothesis), it follows that wk+1 ∈ u1 , . . . , uk , uk+1 , so w1 , . . . , w k+1 ⊆ u1 , . . . , uk . For the converse inclusion, since uk+1 = Pk uk+1 2 w k+1 + (w1 · · · wk )WkH uk+1 , it follows that uk+1 ∈ w1 , . . . , w k , w k+1 . Thus, u1 , . . . , uk , uk+1 ⊆ w1 , . . . , wk , wk+1 . Example 6.31. Let A ∈ R3×2 be the matrix ⎛

⎞ 1 1 A = ⎝0 0⎠. 1 3 It is easy to see that rank(A) = 2. We have {u1 , u2 } ⊆ R3 and we construct an orthogonal basis for the subspace generated by these columns. The matrix W is initialized to O3,2

Norms and Inner Products

417

By Algorithm 6.14.1, we begin by deﬁning ⎛√ ⎞ 2

1 ⎜ 2 ⎟ u1 = ⎝ √0 ⎠, w1 = u1 2 2 2

so

⎛√

2 2

⎜ W = ⎝ √0

2 2

⎞ 0 ⎟ 0⎠. 0

The projection matrix is

⎛

1 2

P = I3 − W (:, 1)W (:, 1) = I3 − w1 w1 = ⎝ 0 − 12 The projection of u2 is

⎞ 0 − 12 1 0 ⎠. 0 12

⎛

⎞ −1 P u2 = ⎝ 0 ⎠ 1

and the second column of W becomes

⎛

√ ⎞ − 22 P u2 2 ⎜ ⎟ u2 = ⎝ √0 ⎠. w k = W (:, 2) = P 2 2

Thus, the orthonormal basis we are seeking consists of the vectors ⎛ √ ⎞ ⎛√ ⎞ 2 − 2 ⎜ 2 ⎟ ⎜ 2 ⎟ ⎝ √0 ⎠ and ⎝ √0 ⎠. 2 2

2 2

To decrease the sensitivity of the Gram–Schmidt algorithm to rounding errors, a goal often referred to as numerical stability, a modiﬁed Gram–Schmidt algorithm has been proposed. In the modiﬁed variant, lines 4 and 5 of Algorithm 6.14.1 (in which projection on the subspace w1 , . . . , w k is computed) are replaced by a computation of this projection using the equality In − Wk WkH = (In − wk wHk ) · · · (In − w1 w H1 ) of Supplement 6.20. The modiﬁed Gram–Schmid algorithm is Algorithm 6.14.2.

418

Linear Algebra Tools for Data Mining (Second Edition)

Algorithm 6.14.2: The Modiﬁed Gram–Schmidt Algorithm Data: A basis {u1 , . . . , um } for a subspace U of Cn Result: An orthonormal basis {w1 , . . . , w m } for U 1 W = On,m ; 1 2 W (:, 1) = W (:, 1) + U (:,1)2 U (:, 1); 3 for k = 2 to m do 4 t = U (:, k); 5 for j = 1 to k do 6 t = (In − W (:, j)W (:, j)H )t; 7 end 1 8 W (:, k) = W (:, k) + t t; 2 9 end 10 return W = (w 1 · · · w m ); Theorem 6.62. If L = (v 1 , . . . , v m ) is a sequence of m vectors in Rn , we have m v j 22 . det(GL ) j=1

The equality takes place only if the vectors of L are pairwise orthogonal. Proof. Suppose that L is linearly independent and construct the orthonormal set {y 1 , . . . , y m } as y j = bj1 v 1 + · · · + bjj v j for 1 j m, using Gram–Schmidt algorithm. Since bjj = 0, it follows that we can write v j = cj1 y 1 + · · · + cjj y j for 1 j m so that (v j , y p ) = 0 if j < p and (v j , y p ) = cjp if p j. Thus, we have ⎞ ⎛ (v 1 , y 1 ) (v 2 , y 1 ) · · · (v m , y 1 ) ⎜ 0 (v 2 , y 2 ) · · · (v m , y 2 ) ⎟ ⎟ ⎜ ⎜ 0 0 · · · (v m , y 3 ) ⎟ (v 1 , . . . , v m ) = (y 1 , . . . , y m )⎜ ⎟. ⎟ ⎜ .. .. .. .. ⎠ ⎝ . . . . 0 0 · · · (v m , y m )

Norms and Inner Products

This implies

419

⎞ ⎛ ⎞ v1 (v 1 , v 1 ) · · · (v 1 , v m ) ⎟ ⎜ .. ⎟ ⎜ .. .. ⎠ = ⎝ . ⎠(v 1 , . . . , v m ) ⎝ . ··· . vm (v m , v 1 ) · · · (v m , v m ) ⎞ ⎛ (v 1 , y 1 ) 0 0 ⎟ ⎜ (v 2 , y ) (v 2 , y ) 0 1 2 ⎟ ⎜ =⎜ ⎟ .. .. .. ⎠ ⎝ . . . (v m , y 1 ) (v m , y 2 ) (v m , y m ) ⎞ ⎛ (v 1 , y 1 ) (v 2 , y 1 ) · · · (v m , y 1 ) ⎜ 0 (v 2 , y 2 ) · · · (v m , y 2 ) ⎟ ⎟ ⎜ ×⎜ ⎟. .. .. .. .. ⎠ ⎝ . . . . 0 0 · · · (v m , y m ) ⎛

Therefore, we have m m 2 (v i , y i ) (v i , v i )2 , det(GL ) = i=1

i=1

v i )2 (y i , y i )2 and (y i , y i ) = 1 for 1 i m. because (v i , y i )2 (v i , 2 To have det(GL ) = m i=1 (v i , v i ) , we must have v i = ki y i , that is, the vectors v i must be pairwise orthogonal. Corollary 6.26. Let A ∈ Rn×n be a matrix such that |aij | 1 for n 1 i, j n. Then | det(A)| n 2 and the equality holds only if A is a Hadamard matrix. Proof. √Let ai = (ai1 , . . . , ain ) be the ith row of A. We have ai 2 n, so |(ai , aj )| n by the Cauchy–Schwartz inequality, for 1 i, j n. 2 Note that L = A A, where L is the set of rows of A, so det(A) = G m 2 det(GL ) j=1 v j 2 , so | det(A)|

m

n

v j 2 n 2 .

j=1

The equality takes place only when the vectors of A are mutually orthogonal, so when A is a Hadamard matrix.

420

Linear Algebra Tools for Data Mining (Second Edition)

We saw that orthonormal sets of vectors that do not contain 0 are linearly independent. The Extension Corollary (Corollary 2.3) can now be specialized for orthonormal sets. Theorem 6.63. Let V be a finite-dimensional linear space. If U is an orthonormal set of vectors, then there exists a basis T of V that consists of orthonormal vectors such that U ⊆ T . Proof. Let U = {u1 , . . . , um } be an orthonormal set of vectors in V. There is an extension of U , Z = {u1 , . . . , um , um+1 , . . . , un } to a basis of V, where n = dim(V ), by the Extension Corollary. Now, apply the Gram–Schmidt algorithm to the set U to produce an orthonormal basis W = {w 1 , . . . , w n } for the entire space V. It is easy to see that wi = ui for 1 i m, so U ⊆ W and W is the orthonormal basis of V that extends the set U . Corollary 6.27. If A ∈ Cm×n is a matrix with m n having orthonormal set of columns, then there exists a matrix B ∈ Cm×(m−n) such that (A B) is an orthogonal (unitary) square matrix. Proof.

This follows directly from Theorem 6.63.

Corollary 6.28. Let U be a subspace of an n-dimensional linear space V such that dim(U ) = m, where m < n. Then dim(U ⊥ ) = n − m. Proof.

Let u1 , . . . , um be an orthonormal basis of U , and let u1 , . . . , um , um+1 , . . . , un

be its completion to an orthonormal basis for V, which exists by Theorem 6.63. Then um+1 , . . . , un is a basis of the orthogonal complement U ⊥ , so dim(U ⊥ ) = n − m. Theorem 6.64. A subspace U of Rn is m-dimensional if and only if it is the set of solutions of a homogeneous linear system Ax = 0, where A ∈ R(n−m)×n is a full-rank matrix. Proof. Suppose that U is an m-dimensional subspace of Rn . If v 1 , . . . , v n−m is a basis of the orthogonal complement of U , then

Norms and Inner Products

421

v i x = 0 for every x ∈ U and 1 i n − m. These conditions are equivalent to the equality (v 1 v2 · · · v n−m )x = 0, which shows that U is the set of solutions of a homogeneous linear Ax = 0, where A = (v 1 v 2 · · · v n−m ). Conversely, if A ∈ R(n−m)×n is a full-rank matrix, then the set of solutions of the homogeneous system Ax = b is the null subspace of A and, therefore, it is an m-dimensional subspace. 6.15

Change of Bases Revisited

Next, we examine the behavior of vector components relative to change of bases in real linear spaces equipped with inner products. ˜ = {˜ ˜n } e1 , . . . , e In Section 3.7, we saw that if B = {e1 , . . . , en } and B n are two bases of C , then the contravariant components xk of a vector x relative to the basis B are linkedto the contravariant components of the form xk = ni=1 x ˜i dik , where D is the matrix x ˜k by equalities deﬁned by e˜h = nk=1 dhk ek . If the inner product of two vectors of the basis B isgij = (ei , ej ) for 1 i, j n, the inner product of the vectors u = ni=1 ui ei and v = nj=1 vj ej can be written as (u, v) =

n n i=1 j=1

ui vj (ei , ej ) =

n n

ui vj gij ,

i=1 j=1

where gij = (ei , ej ) for 1 i, j n. The matrix G = (gij ) is the Gram matrix of vectors of the basis B and, therefore, is a positive deﬁnite matrix. Next, we deﬁne the covariant and contravariant components of a vector x in a real n-dimensional linear space L equipped with an inner product. Definition 6.31. Let L be an n-dimensional inner product real linear space and let B = {e1 , . . . , en } be a basis of L. If x ∈ L, the covariant components of x are the numbers xi = (x, ei ) for 1 i n. The contravariant components of x are the numbers xi deﬁned by x = x1 e1 + · · · + xn en .

Linear Algebra Tools for Data Mining (Second Edition)

422

The application of Deﬁnition 6.31 yields ⎛ ⎞ n xj ej , ei ⎠ xi = (x, ei ) = ⎝ j=1

=

n

j

x (ej , ei ) =

j=1

n

gij xj

j=1

for 1 i n. Conversely, the contravariant components of x can be computed from the covariant components xi by solving the system ⎛ 1⎞ ⎛ ⎞ x x1 ⎜ .. ⎟ ⎜ .. ⎟ ⎝ . ⎠ = G⎝ . ⎠, xn xn using Cramer’s formula

j

det G ← xj = Since

det(G)

x1

.. .

xn

.

⎛

⎞⎞ x1 n ⎜ j ⎜ ⎟⎟ gji xi , det ⎝G ← ⎝ ... ⎠⎠ = i=1 xn ⎛

n

(6.30)

g x

it follows that xj = i=1g ji i , where g = det(G) and gji is deﬁned by Equality (6.30). When the basis {ei | 1 i n} is orthonormal, we have 1 if i = j, gij = 0 if i = j, where 1 i, j n. In this case, xi = xi for 1 i n, which means that the contravariant components of x coincide with the covariant components of x.

Norms and Inner Products

6.16

423

The QR Factorization of Matrices

We describe two variants of a factorization algorithm for rectangular matrices. The ﬁrst algorithm allows us to express a matrix as a product of a rectangular matrix with orthogonal columns and an upper triangular invertible matrix (the thin QR factorization). The second algorithm uses Householder matrices and factorizes a matrix as a product of a square orthogonal matrix and an upper triangular matrix with non-negative diagonal entries (the full QR factorization). Theorem 6.65 (The Thin QR factorization theorem). Let A ∈ Cm×n be a full-rank matrix such that m n. Then A can be factored as A = QR, where Q ∈ Cm×n , R ∈ Cn×n such that (i) the columns of Q constitute an orthonormal basis for range(A), and (ii) R = (rij ) is an upper triangular invertible matrix such that its diagonal elements are real non-negative numbers, that is, rii 0 for 1 i n. Proof. Let u1 , . . . , un be the columns of A. Since rank(A) = n, these columns constitute a basis for range(A). Starting from this set of columns construct an orthonormal basis w1 , . . . , w n for the subspace range(A) using the Gram–Schmidt algorithm. Deﬁne Q as the orthogonal matrix Q = (w 1 · · · wn ). By the properties of the Gram–Schmidt algorithm, we have u1 , . . . , uk = w1 , . . . , w k for 1 k n, so it is possible to write uk = r1k w1 + · · · + rkk wk ⎛ ⎞ ⎛ ⎞ r1k r1k ⎜ .. ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ ⎟ ⎜r ⎟ ⎜rkk ⎟ = (w 1 · · · wn )⎜ ⎟ = Q⎜ kk ⎟. 0 ⎜ 0 ⎟ ⎜ ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎝ .. ⎠ ⎝ .. ⎠ 0 0 We can assume that rkk 0; otherwise, that is, if rkk < 0, replace wk by −wk . Clearly, this does not aﬀect the orthonormality of the set {w 1 , . . . , wn }.

424

Linear Algebra Tools for Data Mining (Second Edition)

It is clear that rank(Q) = n. Therefore, by Corollary 3.7, since rank(A) min{rank(Q), rank(R)}, it follows that rank(R) = n, so R is an invertible matrix. Therefore, we have rkk > 0 for 1 k n. Example 6.32. Let us determine a QR factorization for the matrix introduced in Example 6.31. We constructed an orthonormal basis for range(A) that consists of the vectors ⎛ ⎞ ⎛ ⎞ √1 − √12 2 ⎜ ⎟ ⎜ ⎟ w1 = ⎝ 0 ⎠ and w2 = ⎝ 0 ⎠. √1 2

√1 2

Thus, the orthogonal matrix Q is ⎛ 1

√ 2

⎜ Q=⎝ 0

√1 2

⎞ − √12 ⎟ 0 ⎠. √1 2

To compute R, we need to express u1 and u2 as linear combinations of w1 and w 2 . Since √ u1 = 2w1 , √ √ u2 = 2 2w1 + 2w2 , the matrix R is

√ 2 2√ 2 . 2 0

√ R=

Note that the matrix Q obtained by the thin QR decomposition is not a square matrix in general. However, its columns form an orthonormal set. Another variant of the QR decomposition algorithm that produces a ﬁrst factor Q that is an orthogonal matrix can be obtained using Householder matrices introduced in Example 6.24. We need the following preliminary result. Theorem 6.66. Let x, y ∈ Cp be two vectors such that x = y. There exists a Householder matrix Hv ∈ Cp×p such that Hv x = y. Proof. Let Hv = Ip − 2vv , where v ∈ Cp is a unit vector. The equality (Ip − 2vv )x = y is equivalent to x − y = 2vv x = 2av,

Norms and Inner Products

where a = v x. Then, v = a = 12 x − y, so v=

425

− y). Since v = 1, this implies

1 2a (x

1 (x − y). x − y

Corollary 6.29. If u, w ∈ C1×n are two row vectors such that u2 = w2 = 1, there exists a Householder matrix Hv such that u = wHv . Proof. Clearly, we have uH 2 = w H . Thus, by Theorem 6.66, there exists a matrix Hv such that uH = Hv wH , which implies u = wHvH = wHv . Theorem 6.66 is used for describing an iterative process that leads to a QR decomposition of a matrix A ∈ Cm×n , where m n and rank(A) = n. Theorem 6.67 (The Full QR factorization theorem). Let A ∈ Cm×n be a matrix such that m n and rank(A) = n. Then A can be factored as A=Q

R

Om−n,n

,

where Q ∈ Cm×m and R ∈ Cn×n such that (i) Q is a unitary matrix, and (ii) R = (rij ) is an upper triangular matrix having non-negative diagonal entries. Proof. A vector x has the same norm as the vector y = xe1 (which has only one non-zero component which is non-negative). Therefore, by Theorem 6.66, there exists a Householder matrix Hv such that y = Hv x. Let A = (c1 c2 · · · cn ) ∈ Cm×n . If we multiply A by the Householder matrix Hv1 ∈ Cm×m that zeroes all components of the ﬁrst column c1 located below the ﬁrst element, then we obtain a matrix

426

Linear Algebra Tools for Data Mining (Second Edition)

B1 ∈ Cm×n having the form ⎛

⎞ ∗ ··· ∗ · · ·⎟ ⎟ .. .. ⎟, . . ⎠ 0 ∗ ···

∗ ⎜0 ⎜ B1 = Hv1 A = ⎜ .. ⎝.

where the asterisks represents components that are not necessarily 0. Next, let ! 2 ··· m ∈ C(m−1)×(n−1) A1 = B1 2 ··· n and let Hv2 ∈ C(m−1)×(m−1) be a Householder matrix that zeroes the elements of the ﬁrst column of A1 , that is, the subdiagonal elements of the second column of A: ⎛ ⎞ ∗ ∗ ··· ⎜0 ∗ · · · ⎟ ⎜ ⎟ Hv2 A1 = ⎜ .. .. .. ⎟ ∈ C(m−1)×(n−1) . ⎝. . . ⎠ 0 ∗ ··· Observe that by multiplying B1 by the matrix 1 0 ∈ Cm×m , 0 Hv 2 the ﬁrst line and the ﬁrst row of B1 are left intact and the submatrix ! 2 ··· m 1 0 B1 0 Hv 2 2 ··· n coincides with Hv2 A1 . Deﬁne ⎛ ∗ ⎜0 ⎜ 1 0 1 0 ⎜ B1 = Rv1 A = ⎜0 B2 = 0 Hv 2 0 Hv 2 ⎜ .. ⎝.

⎞ ··· ∗ · · · ∗⎟ ⎟ · · · ∗⎟ ⎟ ∈ Cm×n . ⎟ · · · ∗⎠ 0 0 ··· ∗ ∗ ∗ 0 .. .

Norms and Inner Products

427

Continuing this construction, we obtain a sequence of matrices B1 , B2 , . . . , Bn−1 in Cm×n such that I O Bk , Bk+1 = k O Hvk+1 where Hvk+1 zeroes the subdiagonal elements of the (k + 1)st column of the matrix Bk , where 1 k n − 1. Thus, we have Bn = QA, where Bn is an upper triangular matrix of the form ⎛ ⎞ ∗ ∗ ··· ∗ ⎜0 ∗ · · · ∗⎟ ⎜ ⎟ ⎜0 0 · · · ∗⎟ ⎜. . . ⎟ ⎜. . . ⎟, ⎜ . . . 0⎟ ⎜. . ⎟ ⎝ .. .. · · · 0⎠ 0 0 ··· 0 and Q ∈ Cm×m is a unitary matrix. This allows us to write A = Q Bn , which is the desired decomposition of the matrix A. Note that Bn ∈ Cm×n can be written as R , Bn = Om−n,n where R is an upper triangular matrix in Cn×n having non-negative diagonal entries. Observe that by writing Q = (Q0 Q1 ), where Q0 ∈ Cm×n and Q1 ∈ Cm×(m−n) , we have A = Q0 R, which is the thin QR decomposition discussed in Theorem 6.65. Corollary 6.30. Let A ∈ Cm×n be a matrix such that n m and rank(A) = m. Then, A can be factored as A = (L Om,n−m )Q where Q ∈ Cn×n and L ∈ Cm×m such that (i) L = (lij ) is a lower triangular matrix having non-negative diagonal entries, and (ii) Q is a unitary matrix.

428

Linear Algebra Tools for Data Mining (Second Edition)

Proof. The statement is obtained by applying Theorem 6.67 to the matrix AH ∈ Cn×m . By choosing carefully the sequence of Householder matrices, one can prove similar results. Theorem 6.68. Let A ∈ Cm×n be a matrix such that m n and rank(A) = n. Then, A can be factored as R S, A= Om−n,n where S ∈ Cn×n and R ∈ Cn×n such that (i) S is a unitary matrix, and (ii) R is an upper triangular matrix having real non-negative elements on its diagonal ending in the bottom right corner. Also, if rank(A) = m, then A can be factored as A = T (L Om,n−m ) where T ∈ Cm×m and L ∈ Cm×m such that (i) L is a lower triangular matrix with real non-negative elements on its diagonal ending in the bottom right corner, and (ii) T is a unitary matrix. Proof. We prove only the ﬁrst part of the theorem, which is similar to the proof of the full QR Decomposition Theorem. Let r 1 , . . . , r m be the rows of the matrix A. Clearly, the row vector r m eHn has the same norm as r m , so, by Corollary 6.29, there exists a unit vector v m ∈ Rn such that r m Hvm = r m eHn . Therefore, we have ⎛ ⎞ ∗ ∗ ··· ∗ ∗ ⎜∗ ∗ · · · ∗ ∗⎟ ⎜ ⎟ B1 = AHvm = ⎜ .. .. .. .. .. ⎟. ⎝. . . . .⎠ 0 0 ··· 0 ∗ Let A1 = B1

! 1 · · · (m − 1) ∈ C(m−1)×(n−1) 1 · · · (n − 1)

Norms and Inner Products

429

and let Hv m−1 ∈ C(n−1)×(n−1) be a Householder matrix that zeroes the ﬁrst n − 2 elements of the last row of A1 : ⎛

A1 Hv2

⎞ ∗ ··· ∗ · · ·⎟ ⎟ (m−1)×(n−1) . .. .. ⎟ ∈ C . . ⎠ 0 ∗ ···

∗ ⎜∗ ⎜ = ⎜ .. ⎝.

By multiplying B1 = AHv m by

Hvm−1 0n−1 , 0n−1 1

the last row and the last column of B1 are left intact and the submatrix ! 1 ··· m − 1 B1 Hvm−1 0n−1 0n−1 1 1 ··· n − 1 coincides with A1 Hv m−1 . Continuing this construction, we obtain a sequence of matrices B1 , B2 , . . . , Bm in Cm×n such that Bk+1 = Bk

Hvm−(k+1) 0n−1 , 0n−1 1

where Hv m−(k+1) zeroes the elements of the (k+1)st row of the matrix Bk placed at the left of the main diagonal, where 1 k m − 1. Thus, for the last matrix Bm we have Bm = AS, where Bm is an upper triangular matrix and S is a unitary matrix (as a product of unitary matrices). The second part follows from the ﬁrst part. The subspaces associated with symmetric and idempotent matrices enjoy a special relationship that we discuss next. Theorem 6.69. Let A ∈ Cn×n be an idempotent matrix. Then, null(A) ⊥ range(A) if and only if the matrix A is Hermitian.

430

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Let A be an idempotent matrix such that null(A) ⊥ range(A). We have Au ∈ range(A). Also, we have A(In − A)v = (A − A2 )v = 0, so (In − A)v ∈ null(A). This implies Au ⊥ (I − A)v for every u, v ∈ Cn . Thus, (Au)H (In − A)v = uH AH (In − A)v = 0 for every u, v ∈ Rn , so AH (In − A) = O. Equivalently, AH = AH A. Therefore, A = AH . Conversely, if A is both idempotent and Hermitian, then (Au)H (In −A)v = uH AH (In −A)v = uH A(In −A)v = uH (A−A2 )v = 0 for every u, v ∈ Cn , so null(A) ⊥ range(A).

Note that if A is Hermitian and idempotent, then the mapping that transforms a vector x into Ax projects Cn onto the subspace range(A). Next we examine the possibility of constructing a unitary matrix P for a matrix A ∈ Cn×n such that the matrix P H AP is an upper Hessenberg matrix. To this end, we deﬁne a sequence of Householder matrices as follows. Let A0 = A and write a11 b0 , A0 = a 0 B0 where a0 , b0 ∈ Cn−1 and B0 ∈ C(n−1)×(n−1) . Consider the Householder matrix Hv0 (where v 0 ∈ Cn−1 ) such that ⎛ ⎞ ∗ ⎜0⎟ ⎜ ⎟ Hv 0 a0 = ⎜ .. ⎟ ⎝.⎠ 0 and deﬁne

P1 =

1 0 . 0 Hv 0

Norms and Inner Products

431

Then P1 A0 P1 = =

1 0 0 Hv 0 a11 Hv 0 a0

1 0 0 Hv 0 ⎛ ⎞ a11 a0 Hv0 ⎟ ⎜ ⎜∗ ⎟ a0 Hv0 ⎜ ⎟ =⎜0 ⎟. H v 0 B0 H v 0 ⎜ .. Hv0 B0 Hv 0 ⎟ ⎝. ⎠ 0 a11 b0 a0 B0

Deﬁne A1 = Hv 0 B0 Hv0 and repeat the process on this matrix to obtain a unitary matrix Hv 1 such that the elements of the ﬁrst column located under the ﬁrst subdiagonal element in Hv1 A1 Hv1 are 0s. The matrix P2 is given by I2 O P2 = O Hv 2 and it is clearly a unitary matrix. We have ⎛ ∗ ∗ ∗ ··· ⎜∗ ∗ ∗ · · · ⎜ ⎜0 ∗ ∗ · · · ⎜ H H P2 P1 AP1 P2 = ⎜0 0 ∗ · · · ⎜ ⎜ .. .. ⎝. . ∗ · · ·

⎞ ∗ ∗⎟ ⎟ ∗⎟ ⎟ . ∗⎟ ⎟ ⎟ ∗⎠ 0 0 ∗ ··· ∗

Continuing in the same manner, we build the matrices P1 , P2 , . . . , H · · · P1H AP1 · · · Pn−2 is an upper Hessenberg Pn−2 such that C = Pn−2 matrix. If A is a symmetric matrix, then so is C. Therefore, C is actually a tridiagonal matrix, that is, a matrix of the form ⎛ ⎞ ∗ ∗ 0 0 ··· 0 0 0 ⎜∗ ∗ ∗ 0 · · · 0 0 0⎟ ⎜ ⎟ ⎜∗ ∗ ∗ 0 · · · 0 0 0⎟ ⎜ ⎟ ⎜ .. .. .. .. . . . ⎟. ⎜ . . . . · · · .. .. .. ⎟ ⎜ ⎟ ⎝0 0 0 0 · · · ∗ ∗ ∗⎠ 0 0 0 0 ··· 0 ∗ ∗

432

6.17

Linear Algebra Tools for Data Mining (Second Edition)

Matrix Groups

One can deﬁne on the set of invertible square matrices Cn×n a group structure, where the group operation is the matrix multiplication, the unit of the group is In , and the inverse of a matrix A is A−1 . This is the general linear group GLn (C) and we shall examine several of its subgroups. Similar linear group of invertible square matrices in Rn×n will be denoted by GLn (R). Since every matrix with real entries also belongs to Cn , it is clear that GLn (R) is a subgroup of GLn (C). Subgroups of GLn (C) are known as linear groups. Definition 6.32. An isometry of Rn is a mapping f : Rn −→ Rn such that x − y2 = f (x) − f (y)2 for x, y ∈ Cn (for x, y ∈ Rn ). Example 6.33. Let z ∈ Rn and let tz be the translation deﬁned by tz (x) = x + z for x ∈ Rn . Since tz (x) − tz (y) = x − y, it is clear that any translation is an isometry. It is straightforward to verify that every isometry is a bijection having as inverse an isometry and, therefore, the set of isometries of Rn is a group with respect to function composition. We denote this group by ISO(Rn ). Theorem 6.70. Let f : Rn −→ Rn . The following statements are equivalent: (i) f is an isometry such that f (0n ) = 0n ; (ii) f preserves the Euclidean inner product, that is f (x y) = f (x) f (y) for every x, y ∈ Rn ; (iii) f (x) = Ax, where A is an orthogonal matrix. Proof. (i) implies (ii): Let f be an isometry such that f (0n ) = 0n . Since f preserves distances, we have (x − y) (x − y) = (f (x) − f (y) )(f (x) − f (y)).

(6.31)

Taking y = 0n in the above equality, we obtain x x = f (x) f (x)

(6.32)

Norms and Inner Products

433

for x ∈ Rn . Equality (6.31) is equivalent to x x − 2y x + y y = f (x) f (x) − 2f (y) f (x) + f (y) f (y) and taking into account Equality (6.32), we have y x = f (y) f (x), which means that f preserves the Euclidean inner product. (ii) implies (iii): Suppose that f : Rn −→ Rn is a mapping that preserves the inner product. Then, 1 = ei ei = f (ei ) f (ei ) and 0 = ei ej = f (ei ) f (ej ) for i = j, 1 i, j n. Thus, {f (e1 ), . . . , f (en )} is an orthonormal set and the matrix A = (f (e1 ) · · · f (en )) is orthogonal. Since the set of orthogonal matrices is a subgroup of GLn (R), the matrix A−1 is also orthogonal and, by Corollary 6.16, it preserves the inner product. Thus, the mapping g : Rn −→ Rn given by g(x) = A−1 f (x) also preserves the inner product and, in addition, g(ei ) = A−1 f (ei ) = ei for 1 i n. Since g preserves the inner product, we have xi = x ei = g(x) g(ei ) = g(x) ei = (g(x))i for every x ∈ Rn and 1 i n, which means that g is the identity mapping on Rn . Consequently, f (x) = Ax. (iii) implies (i): This implication follows immediately from Corollary 6.16. Theorem 6.71. An isometry h of Rn has the form h(x) = Ax + b for x ∈ Rn . Proof. Let b = h(0n ). The mapping f : Rn −→ Rn given by f (x) = h(x) − h(0n ) has the property f (0n ) = 0n . Moreover, f is an isometry because f (x) − f (y) = h(x) − h(y). Therefore, by Theorem 6.70, there exists an orthogonal matrix A ∈ Rn×n such that f (x) = Ax for x ∈ Rn . This implies that h(x) = Ax + b for x ∈ Rn , where b = h(0n ). Theorem 6.71 means that every isometry is the composition of a multiplication by an orthogonal matrix followed by a translation. An isometry f (x) = Ax + b for x ∈ Rn is orientation-preserving if the orthogonal matrix A is a rotation matrix, and is orientationreversing if A is a reﬂection matrix. Definition 6.33. A dilation is a mapping ha : Rn −→ Rn such that ha (x) = ax for x ∈ Rn , where a ∈ R is the ratio of the dilation.

434

Linear Algebra Tools for Data Mining (Second Edition)

A dilation does not preserve distances (so it is not an isometry), but it preserves angles. Indeed, for u, v ∈ Rn , we have cos ∠(ha (u), ha (v)) =

ha (u) ha (v)) u v = ha (u)2 ha (v)2 u2 v2

= cos ∠(u, v). The set of dilations having a non-zero ratio is easily seen to be a group, where (ha )−1 = h 1 . a

6.18

Condition Numbers for Matrices

Let Au = b be a linear system, where A ∈ Cn×n is a non-singular matrix and b ∈ Rn . We examine the sensitivity of the solution of this system to small variations of b. So, together with the original system, we work with a system of the form Av = b + h, where h ∈ Rn is the perturbation of b. Note that A(v − u) = h, so v − u = A−1 b. Using a vector norm · and its corresponding matrix norm ||| · |||, we have v − u = A−1 h |||A−1 |||h. Since b = Au |||A|||u, it follows that |||A−1 |||h v − u b u |||A|||

=

|||A||||||A−1 |||h . b

Thus, the relative variation of the solution, by the number ing deﬁnition.

|||A||||||A−1|||h . b

v−u u ,

(6.33) is upper bounded

These considerations justify the follow-

Definition 6.34. Let A ∈ Cn×n be a non-singular matrix. The condition number of A relative to the matrix norm ||| · ||| is the number cond(A) = |||A||||||A−1 |||. Equality (6.33) implies that if the condition number is large, then small variations in b may generate large variations in the solution of the system Au = b, especially when b is close to 0. When this is the

Norms and Inner Products

435

case, we say that the system Au = b is ill-conditioned. Otherwise, the system Au = b is well-conditioned. Theorem 6.72. Let A ∈ Cn×n be a non-singular matrix. The following statements hold for every matrix norm induced by a vector norm: (i) cond(A) = cond(A−1 ); (ii) cond(cA) = |c|cond(A); (iii) cond(A) 1. Proof. We prove here only Part (iii). Since AA−1 = I, by the properties of a matrix norm induced by a vector norm, we have cond(A) = |||A||||||A−1 ||| |||AA−1 ||| = |||In ||| = 1. Let A, B be two non-singular matrices in C n×n such that B = aA, where a ∈ C. We have B −1 = aA−1 , |||B||| = |a||||B|||, and |||B −1 ||| = |a||||A−1 ||| so cond(B) = |a|2 cond(A). On the other hand, det(B) = an det(A). Thus, if n is large enough and a < 1, then det(B) can be quite close to 0, while the condition number of B may be quite large. This shows that the determinant and the condition number are relatively independent. Example 6.34. Let A ∈ C2×2 be the matrix a a+α A= , a + α a + 2α where a > 0 and α < 0. We have a+2α − α2 −1 A = a+α α2

a+α α2 − αa2

,

2 so |||A|||1 = a and |||A−1 ||| = αa2 . Thus, cond(A) = αa and, if |α| is small, a system of the for Au = b may be ill-conditioned. Ill-conditioned linear systems Au = b may occur when large differences in scale exist among the columns of A, or among the rows of A. Theorem 6.73. Let A = (a1 · · · an ) be an invertible matrix in

Cn×n , where a1 , . . . , an are the columns of A. Then

cond(A) max

ai | 1 i, j n . aj

436

Proof.

Linear Algebra Tools for Data Mining (Second Edition)

Since cond(A) = |||A||||||A−1 |||, we have cond(A) =

max{Ax | x = 1} , min{Ax | x = 1}

by Supplement 6.20. Note that Aek = ak , where ak is the kth column of A and that ek = 1. Therefore, max{Ax | x = 1} ai , min{Ax | x = 1} aj , which implies cond(A)

ai aj

for all 1 i, j n. This yields the inequality of the theorem. Example 6.35. Let

A=

1 0 , 1 α

where α ∈ R and α > 0. The matrix A is invertible and 1 0 −1 . A = − α1 α1 It is easy to see that the condition number of A relative to the Frobenius norm is 2 + α2 . α Thus, if α is suﬃciently close to 0, the condition number can reach arbitrarily large values. cond(A) =

In general, we use as matrix norms, norms of the form ||| · |||p . The corresponding condition number of a matrix A is denoted by condp (A). Example 6.36. Let A = diag(a1 , . . . , an ) be a diagonalmatrix. Then, |||A|||2 = max1in |ai |. Since A−1 = diag a11 , . . . , a1n , it follows that |||A−1 |||2 =

1 min1in |ai | ,

so cond2 (A) =

max1in |ai | min1in |ai | .

Norms and Inner Products

6.19

437

Linear Space Orientation

Definition 6.35. Let V be an n-dimensional linear-space. An ordered basis for V is a sequence B = (u1 , . . . , un ) such that B = {u1 , . . . , un } is a basis for n. Since B is a basis, note that its Gram matrix GB is positive deﬁnite by Theorem 6.55; also, det(GB ) = 0. Definition 6.36. The ordered basis B has a positive orientation if det(GB ) > 0, and a negative orientation if det(GB ) < 0. Example 6.37. Let {e1 , e2 , e3 } be the standard basis in R3 . The ordered basis B = (e1 , e2 , e3 ) has a positive orientation because ⎛ ⎞ 1 0 0 det(GB ) = det ⎝0 1 0⎠ = 1 > 0. 0 0 1 ˜ = (e1 , e3 , e2 ) has a negative orientation On the other hand, B because ⎛ ⎞ 1 0 0 det(GB ) = det ⎝0 0 1⎠ = −1 < 0. 0 1 0 We extend this deﬁnition to an arbitrary sequence of vectors T = (t1 , . . . , tn ) in Rn by saying that T has a positive orientation if det(GT ) > 0 and a negative orientation if det(GT ) < 0. Definition 6.37. Let u, v be two vectors in R3 . The cross-product of u and v is the vector u × v deﬁned as u × v = (u2 v3 − u3 v2 )e1 + (u3 v1 − u1 v3 )e2 + (u1 v2 − u2 v1 )e3 . Note that u × v can be written as a determinant, that is, e1 e2 e3 u × v = u1 u2 u3 . v1 v2 v3 The vector w = u × v is orthogonal on both u and v. Indeed, it is easy to see that (w, u) = (u2 v3 − u3 v2 )u1 + (u3 v1 − u1 v3 )u2 + (u1 v2 − u2 v1 )u3 = 0, and, similarly, (w, v) = 0.

438

Linear Algebra Tools for Data Mining (Second Edition)

Note that the triple T = (u, v, u × v) has a positive orientation when u × v = 03 and when u and v are not collinear. Indeed, since ⎞ 0 u2 (u, v) ⎠, 0 GT = ⎝(v, u) v2 2 0 0 u × v ⎛

we have det(GT ) = (u2 v2 − (u, v)2 )u × v2 > 0. Finally, w2 = (u2 v3 − u3 v2 )2 + (u3 v1 − u1 v3 )2 + (u1 v2 − u2 v1 )2 = (u21 + u22 + u23 )(v12 + v22 + v32 ) −(u1 v1 + u2 v2 + u3 v3 )2 = u2 v2 (1 − cos2 (∠(u, v))). Thus, w = uv sin α, where α = ∠(u, v). We note that w equals the area of the parallelogram formed by the vectors u and v. The following properties are immediate: (i) u × v = −(v × u), (ii) (au) × v = a(u × v), (iii) (u + v) × w = u × w + v × w for u, v, w ∈ R3 , and a ∈ R. Definition 6.38. Let u, v, w ∈ R3 . The scalar triple product of these vectors is the real number ((u × v), w) denoted as (u, v, w). It is easy to verify the following properties: (i) (u, v, w) = −(v, u, w), (ii) (u, v, w) = (v, w, u) = (w, u, v), (iii) (au, v, w) = a(u, v, w), (iv) (u + t, v, w) = (u, v, w) + (t, v, w). We claim that (u, v, w) equals the volume of the parallelepiped constructed on the sequence of vectors (u, v, w). Indeed, since u×v is the area of the parallelogram formed by the vectors u and v, ((u × v), w) is the product of this area with the projection of w on a vector perpendicular to the parallelogram determined by u and v.

Norms and Inner Products

439

Note that

⎞ ⎛ e1 e2 e3 (u, v, w) = ⎝u1 u2 u3 , w1 e1 + w2 e2 + w3 e3 ⎠ v1 v2 v3 w1 w2 w3 u1 u2 u3 = u1 u2 u3 = v1 v2 v3 . v1 v2 v3 w1 w2 w3

Definition 6.39. Let u, v, w ∈ R3 . The vector triple product of these vectors is the vector u × (v × w). Note that (u × (v × w))1 = u2 (v1 w2 − v2 w1 ) − u3 (v3 w1 − v1 w3 ) = v1 (u2 w2 + u3 w3 ) − w1 (u2 v2 + u3 v3 ) = v1 (u1 w1 + u2 w2 + u3 w3 ) − w1 (u1 v1 + u2 v2 + u3 v3 ) = v1 (u, w) − w1 (u, v). Similarly, we have (u × (v × w))2 = v2 (u, w) − w2 (u, v), (u × (v × w))3 = v3 (u, w) − w3 (u, v), which allows us to write u × (v × w) = v(u, w) − w(u, v). 6.20

MATLAB

Computations

Vector norms can be computed using the function norm which comes in two signatures: norm(v) and norm(v,p). The ﬁrst variant computes v2 ; the second computes vp for any p, 1 p ∞. In addition, norm(v,inf) computes v∞ = max{|vi | | 1 i n}, where v ∈ Rn . If one uses −∞ as the second parameter, then norm(v,-inf) returns min{|vi | | 1 i n}. Example 6.38. For the vector v = [2 -3 5 -4]

Linear Algebra Tools for Data Mining (Second Edition)

440

the computation norms = [norm(v,1),norm(v,2),norm(v,2.5),norm(v,inf), norm(v,-inf)]

returns norms = 14.0000

7.3485

6.5344

5.0000

2.0000

If the ﬁrst argument of norm is a matrix A, norm(A) returns A2 . For the two-parameter format, the second parameter is restricted to the values 1, 2, inf, and ’fro’. Then, norm(A,2) is the same as norm(A), norm(A,1) is |||A|||1 , (A,inf) is |||A|||∞ , and norm(A,’fro’) yields the Frobenius norm of A. Example 6.39. For the matrix A = [1 -1 2; 3 2 -1; 5 4 2]

the following computation is performed: norms = [norm(A,1),norm(A,2),norm(A,inf),norm(A,’fro’)] norms = 9.0000

7.4783

11.0000

8.0623

For matrices whose norm is expensive to compute, an approximative estimation of A2 can be performed using the function normest(A), or normest(A,r), where r is the relative error; the default for r is 10−6 . The following function implements the Gram–Schmidt algorithm: function [W] = gram(U) %GRAM implements the classical Gram-Schmidt algorithm [n,m] = size(U); W = zeros(n,m); W(:,1)= (1/norm(U(:,1)))*U(:,1); for k = 2:1:m P = eye(n) - W*W’; W(:,k) = W(:,k) + (1/norm(P*U(:,k)))* P*U(:,k); end end

Norms and Inner Products

441

An implementation of the modiﬁed Gram–Schmidt algorithm is given next. function [W] = modgram(U) %MODGRAM implements the modified Gram--Schmidt algorithm [n,m] = size(U); W = zeros(n,m); W(:,1)= (1/norm(U(:,1)))*U(:,1); for k = 2:1:m t = U(:,k); for j = 1:1:k t = (eye(n) - W(:,j)*W(:,j)’)*t; end W(:,k) = W(:,k) + (1/norm(t))*t; end end

The Cholesky decomposition of a Hermitian positive deﬁnite matrix is computed in MATLAB using the function chol. The function call R = chol(A) returns an upper triangular matrix R, satisfying the equation RH R = A. If A is not positive deﬁnite, an error message is generated. The matrix R is computed using the diagonal and the upper triangle of A and the computation makes sense only if A is Hermitian. Example 6.40. Let A be the symmetric positive deﬁnite matrix considered in Example 6.29, ⎛ ⎞ 3 0 2 A = ⎝0 2 1⎠. 2 1 2 Then R = chol(A) yields R = 1.7321 0 0

0 1.4142 0

1.1547 0.7071 0.4082

The call L = chol(A,’lower’) returns a lower triangular matrix L from the diagonal and lower triangle of matrix A, satisfying the equation LLH = A. When A is sparse, this syntax of chol is faster.

Linear Algebra Tools for Data Mining (Second Edition)

442

Example 6.41. For the same matrix A as in Example 6.40, L = chol(A,’lower’) returns L = 1.7321 0 1.1547

0 1.4142 0.7071

0 0 0.4082

For added ﬂexibility, [R,p] = chol(A) and [L,p] = chol (A,‘lower’) set p to 0 if A is positive deﬁnite and to a positive number, otherwise, without returning an error message. The thin QR decomposition of a matrix A ∈ Cm×n is obtained using the function qr as in [Q R] = qr(A)

To obtain the full decomposition, we write [Q R] = qr(A,0)

The Hessenberg form of a matrix is computed using the function hess. To produce a Hessenberg matrix H and a unitary matrix P such that P H AP = C (or A = P CP H ), one can use [P,H] = hess(A)

For example, if A = 1 5 -2 2

2 6 2 -4

3 2 4 1

4 5 1 2

then [P,H] = hess(A) will return P = 1.0000 0 0 0

0 -0.8704 0.3482 -0.3482

0 -0.4694 -0.3733 0.8002

0 0.1486 0.8599 0.4883

1.0000 -5.7446 0 0

-2.0889 4.1212 5.7089 0

1.1421 -2.0302 2.8879 0.1780

4.8303 -3.3597 -2.9554 4.9909

H =

Norms and Inner Products

443

If only the Hessenberg form is desired, one could use the function call H = hess(A). Note that the matrix B = 1 2 3 4

2 2 5 6

3 5 3 7

is symmetric, so [P1,H1] = hess(B),

4 6 7 4

its

Hessenberg

form

obtained

with

P1 = 0.6931 -0.6931 0.1980 0

-0.6010 -0.4039 0.6897 0

-0.3980 -0.5970 -0.6965 0

0 0 0 1.0000

H1 = -0.9118 -0.8868 0 0

-0.8868 -2.1872 0.1000 0

0 0.1000 9.0990 -10.0499

0 0 -10.0499 4.0000

is a tridiagonal matrix. The condition number of a matrix A is computed using the function cond(A,p) which returns the p-norm condition of matrix A. When used with a single parameter, as in cond(A), the 2-norm condition number of A is returned. Example 6.42. Let A be the matrix >> A=[10.1 6.2; 5.1 3.1] A = 10.1000 6.2000 5.1000 3.1000

The condition number cond(A) is 567.966, which is quite large indicating signiﬁcant sensitivity to inverse calculations. The inverse of A is >> inv(A) ans = -10.0000 16.4516

20.0000 -32.5806

444

Linear Algebra Tools for Data Mining (Second Edition)

If we make a small change in A yielding the matrix >> B=[10.2 6.3;5.1 3.1] B = 10.2000 6.3000 5.1000 3.1000

the inverse of B changes completely: >> inv(B) ans = -6.0784 10.0000

12.3529 -20.0000

Values of the condition number close to 1 indicate a well-conditioned matrix, and the opposite is true for large values of the condition number. Example 6.43. Consider the linear systems: 10.2x1 + 6.3x2 = 12 10.1x1 + 6.2x2 = 12 and 5.1x1 + 3.1x2 = 6 5.1x1 + 3.1x2 = 6 12 that correspond to Ax = b and Bx = b, where b = . In view 6 of the resemblance of A and B, one would expect their solutions to be close. However, this is not the case. The solution of Ax = b is >> x=inv(A)*b x = 0 1.9355

while the solution of Bx = b is >> x=inv(B)*b x = 1.1765 0

Exercises and Supplements (1) Let ν be a norm on Cn . Prove that there exists a number k∈R such that for any vector x ∈ Cn , we have ν(x) k ni=1 |xi |.

Norms and Inner Products

445

n Solution: n equality x = i=1 nxi ei , we have nStarting from the ν(x e ) = |x |ν(e ) k ν(x) i i i i=1 i=1 i i=1 |xi |, where k = max{ν(ei ) | 1 i n}. (2) Prove that the mapping s : Rn × Rn −→ R deﬁned by s(x, y) = d(x, y)22 is not a metric itself. Show that s(x, y) 2(s(x, z) + s(z, y)) for x, y, z ∈ Rn . (3) Prove that if x, y, z are three vectors in Rn and ν is a norm on V, then ν(x − y) ν(x − z) + ν(z − y). (4) Prove that for any vector norm ν on Rn , we have ν(x + y)2 + ν(x − y)2 4(ν(x)2 + ν(y)2 ) for every x, y ∈ Rn . Solution: Note that ν(x + y) ν(x) + ν(y); similarly, ν(x − y) ν(x) + ν(y). Thus, ν(x + y)2 ν(x)2 + ν(y)2 + 2ν(x)ν(y), ν(x − y)2 ν(x)2 + ν(y)2 + 2ν(x)ν(y), hence, ν(x + y)2 + ν(x − y)2 2ν(x)2 + 2ν(y)2 + 4ν(x)ν(y). The desired inequality follows from 2ν(x)ν(y) ν(x)2 + ν(y)2 . (5) Let x ∈ Rn . Prove that for every > 0 there exists y ∈ Rn such that the components of the vector x + y are distinct and y2 < . Solution: Partition the set {1, . . . , n} into the blocks B1 , . . . , Bk such that all components of x that have an index in B have a common value cj . Suppose that |Bj | = pj . Then k j j=1 pj = n and the numbers {c1 , c2 , . . . , ck } are pairwise distinct. Let d = mini,j |ci − cj |. The vector y can be deﬁned as follows. If Bj = {i1 , . . . , ipj }, then yi1 = η · 2−1 , yi2 = η · 2−2 , . . . , yipj = η · 2−p , where η > 0, which makes the numbers cj +yi1 , cj +yi2 , . . . , cj + yipj pairwise distinct. It suﬃces to take η < d to ensure that

446

Linear Algebra Tools for Data Mining (Second Edition)

the components of x + y are pairwise distinct. Also, note that k η2 nη2 y22 j=1 pj 4 = 4 . It suﬃces to choose η such that η < min{d, 2 n } to ensure that y2 < . (m,n) : Cm×n −→ R0 be a vectorial matrix norm. Prove (6) Let μ that for every A ∈ Cm×n there exists a constant k ∈ R such n that μ(m,n) (A) k m i=1 j=1 |aij |. Hint: Apply Supplement 6.20. (7) Prove that a matrix A ∈ Cn×n is normal if and only if Ax = AH x for all x ∈ Cn . (8) Let {μ(m,n) | m, n ∈ N>0 } be a consistent family of matrix norms and let a ∈ Cn be a vector. Prove that for each m ∈ N, the function νm : Rm ‘R0 deﬁned by νm (x) = μ(m,n) (xaH ) is a vector norm that is consistent with the family of matrix norms. (9) Prove that a matrix A ∈ Cn×n is normal if and only if Ax2 = AH x2 for every x ∈ Cn . Solution: Suppose that A is normal. Then Ax22 = (Ax, Ax) = (x, AH Ax) = (x, AAH x) = (AH x, AH x) = AH x22 . Conversely, suppose that Ax2 = AH x2 for every x ∈ Let λ ∈ C be such that |λ| = 1. We have

Cn .

A(λx + y)22 = λAx + Ay22 = (λxH AH + y H AH , λAx + Ay) = (λxH AH , λAx) + (λxH AH , Ay) +(y H AH , λAx) + (y H AH , Ay) = (xH AH , Ax) + (y H AH , Ay) +(λxH AH , Ay) + (y H AH , λAx) = Ax22 + Ay22 + 2(λ(Ax, Ay)), because (λxH AH , Ay) + (y H AH , λAx) = 2(λ(Ax, Ay)). Similarly, AH (λx + y)22 = AH x22 + AH y22 + 2(λ(AH x, AH y)),

Norms and Inner Products

447

hence, Ax22 + Ay22 + 2(λ(Ax, Ay)) = AH x22 +AH y22 + 2(λ(AH x, AH y)). Since Ax22 = AH x22 and Ay22 = AH y22 , we obtain (λ(Ax, Ay) − λ(AH x, AH y)) = (λ(x, AH Ay) − λ(x, AAH y)) = (λ(x, (AH A − AAH )y)) = 0, hence, |x, (AH A − AAH )y)| = 0 for every x ∈ Cn . Therefore, (AH A−AAH )y = 0 for every y ∈ Cn . This implies AH A−AAH = On,n , hence, AH A = AAH . (10) Prove that for every matrix A ∈ Cn×n , we have |||A|||2 = |||AH |||2 . (11) Let U ∈ Cn×n be a matrix whose set of columns is orthonormal and let V ∈ Cn×n be a matrix whose set of rows is orthonormal. Prove that |||U |||2 = |||V |||2 = 1. Solution: Since U H U = In , we have |||U |||2 = max{U x2 | x ∈ Cn andx2 = 1}. For x ∈ Cn such that x2 = 1, we have U x22 = xH U H U x = xH x = x22 = 1, which implies |||U |||2 = 1. The similar argument for V follows from Exercise 6.20. (12) Let U ∈ Cn×n be a matrix whose set of columns is orthonormal and let V ∈ Cn×n be a matrix whose set of rows is orthonormal. Prove that |||U A|||2 = |||AV |||2 = |||A|||2 and, therefore, |||U AV |||2 = |||A|||2 . Solution: By hypothesis, we have U H U = In and V V H = 1. Therefore, |||U A|||22 = max{U Ax22 | x2 = 1} = max{xH AH U H U Ax | x2 = 1} = max{xH AH Ax | x2 = 1} = max{Ax22 | x2 = 1} = |||A|||22 . This allows us to conclude that |||U A|||2 = |||A|||2 . The second equality follows immediately from the ﬁrst. (13) Let A ∈ Rm×n . Prove that there exists i, 1 i n such that Aei 22 n1 A2F .

Linear Algebra Tools for Data Mining (Second Edition)

448

(14) Let A, B ∈ Cn×n be two matrices such that AB = BA. Prove that if A is a normal matrix, then AH B = BAH . Solution: Let C = AH B − BAH . We have C H = B H A − AB H and, therefore, trace(CC H ) = trace((AH B − BAH )(B H A − AB H )) = trace(AH BB H A) − trace(AH BAB H ) −trace(BAH B H A) + trace(BAH AB H ). Since trace(AH BB H A) = trace(AAH BB H ) = trace(AH ABB H ) = trace(AH BAB H ), trace(BAH AB H ) = trace(ABAH B H ) = trace(BAAH B H ) = trace(BAH AB H ), it follows that trace(CC H ) = 0, so C = On,n . Thus, AH B = BAH by Supplement 57 of Chapter 3. This result is known as Fuglede’s theorem [60]. (15) Let A, B ∈ Cn×n be two normal matrices such that AB = BA. Prove that AB is a normal matrix. Solution: Since both A and B are normal, we have (AB)(AB)H = ABB H AH = AB H BAH = B H AAH B = B H AH AB = (AB)H (AB), which proves that AB is normal. (16) Let U ∈ Cn×n be a matrix whose set of columns is orthonormal and let V ∈ Cn×n be a matrix whose set of rows is orthonormal. Prove that |||U AV |||F = |||A|||F . Solution: Since√ U 2F = trace(U H U ) √= n, it follows n. Similarly, V F = n. Consequently, that U F = U AV 2F = trace(V H AH U H U AV ) = trace(V H AH AV ) = trace(V V H AH A) = trace(AH A) = A2F , which yields the needed equality. (17) Let H ∈ Cn×n be a non-singular matrix. Prove that the function f : Cn×n −→ R0 deﬁned by f (X) = HXH −1 2 for X ∈ Cn×n is a matrix norm.

Norms and Inner Products

449

(18) Let A ∈ Cm×n and B ∈ Cp×q be two matrices. Prove that A ⊗ B2F = trace(A A ⊗ B B). (19) Let x0 , x, and y be three members of Rn . Prove that if t ∈ [0, 1] and u = tx + (1 − t)y, then x0 − u2 max{x0 − x2 , x0 − y2 }. Solution: Since x0 − u = x0 − y − t(x − y), we have x − u22 = x0 − y22 − 2t(x0 − y, x − y) + t2 x − y22 . The graph of the function f : [0, 1] −→ R0 given by f (t) = x − u22 is a segment of a convex parabola. Therefore, maxt∈[0,1] f (t) = max{f (0), f (1)} = max{x0 − y22 , x0 − x22 }, which leads to the desired conclusion. (20) Let u1 , . . . , um be m unit vectors in R2 , such that ui − uj = 1. Prove that m 6. (21) Prove that if A ∈ Cn×n is an invertible matrix, then μ(A) 1 for any matrix norm μ. μ(A−1 ) (22) Let A ∈ Rm×n and B ∈ Rn×p be two rectangular matrices that have orthonormal sets of columns. Prove that the matrix AB ∈ Rm×p also has an orthonormal set of columns. Solution: By hypothesis, we have A A = In and B B = Ip . Therefore, (AB) (AB) = B A AB = B In B = B B = Ip , which shows that AB has an orthonormal set of columns. (23) Let Y ∈ Cn×p be a matrix that has an orthonormal set of columns, that is, Y H Y = Ip . Prove the following: (a) Y F = p; (b) for every matrix R ∈ Cp×q we have Y RF = RF . H Y = I , so Y = trace(Y H Y ) = Solution: We have Y p F √ trace(Ip ) = p. For the second part, we can write Y R2F = trace((Y R)H Y R) = trace(RH Y H Y R) = trace(RH R) = R2F , which gives the desired equality.

450

Linear Algebra Tools for Data Mining (Second Edition)

(24) Let μ : Cn×n −→ R0 be a matrix norm. Prove that there exists a vector norm ν : Cn −→ R0 such that ν(Ax) μ(A)ν(x) for A ∈ Cn×n and x ∈ Cn . Solution: Let b ∈ Cn − {0n }. It is easy to see that the mapping ν : Cn −→ R0 deﬁned by ν(x) = μ(xb ) for x ∈ Cn is a vector norm. Furthermore, we have ν(Ax) = μ(Axb ) μ(A)μ(xb ) = μ(A)ν(x). (25) Let ΦH : Cm×n −→ Cm×n be the function deﬁned by Φ(X) = XHX, where H ∈ Cn×m . Prove that ΦH (X) − ΦH (Y )F 2HF max{XF , Y F } X − Y F Solution: We have ΦH (X) − ΦH (Y )F = XHX − XHY + XHY − Y HY F XHX − XHY F + XHY − Y HY F XF HX − HY F + XH − Y HF Y F max{XF , Y F }(HX − HY F + XH − Y HF ) 2HF max{XF , Y F }X − Y F . (26) Let x, y ∈ Cn − {0}. Prove the following: (a) xy H F = x2 y2 ; (b) |||xy H |||1 = x1 y∞ ; (c) |||xy H |||∞ = x∞ y1 . ˆ B ˆ be four matrices in Cm×n such that none of the (27) Let A, B, A, matrices A, B, or A + B equals Om,n . Deﬁne ΔA = Aˆ − A and ˆ − B. Prove that for any matrix norm μ, we have ΔB = B ˆ − (A + B)) μ(A) + μ(B) μ(ΔA ) μ(ΔB ) μ(Aˆ + B max , . μ(A + B) μ(A + B) μ(A) μ(B) Solution: By the triangular property of norms, we have μ(ΔA + ΔB ) μ(ΔA ) + μ(ΔB ) μ(ΔB ) μ(ΔA ) + μ(B) · μ(A) · μ(A) μ(B) μ(ΔA ) μ(ΔB ) , . (μ(A) + μ(B)) max μ(A) μ(B)

Norms and Inner Products

451

ˆ ∈ Cn×p such that none of the (28) Let A, Aˆ ∈ Cm×n and B, B matrices A, B or AB is a zero matrix. Deﬁne, as above, ΔA = ˆ − B. Aˆ − A and ΔB = B ˆ − AB) μ(A)μ(B) μ(ΔA ) μ(ΔB ) μ(AˆB + μ(AB) μ(AB) μ(A) μ(B) μ(ΔA ) μ(ΔB ) . + μ(A) μ(B) Solution: We have ˆ − AB = (A + ΔA )(B + ΔB ) − AB AˆB = AΔB + BΔA + ΔA ΔB . By the triangle inequality, ˆ − AB) = μ(AΔB + BΔA + ΔA ΔB ) μ(AˆB μ(AΔB ) + μ(BΔA ) + μ(ΔA ΔB ) μ(A)μ(ΔB ) + μ(B)μ(ΔA ) + μ(ΔA )μ(ΔB ) (since μ is a matrix norm). (29) Prove that for every matrix norm μ induced by a vector norm, we have μ(I) = 1. (30) Let A ∈ Cn×n be a matrix and let μ be a matrix norm induced by a vector norm ν. Prove that if μ(A) < 1, then the matrix In + A is non-singular and 1 1 μ((In + A)−1 ) . 1 + μ(A) 1 − μ(A) Solution: The matrix In + A is non-singular because, otherwise, the system (In + A)x = 0 would have a non-zero solution u. This would imply u = −Au, so ν(u) = ν(Au) μ(A)ν(u), which would imply μ(A) 1, contradicting the hypothesis of the statement. Since In = (In + A)(In + A)−1 , we have 1 = μ(In ) μ(In + 1 μ((In + A)μ((In +A)−1 ) (1+μ(A))μ((In +A)−1 ), so 1+μ(A) −1 A) ).

452

Linear Algebra Tools for Data Mining (Second Edition)

The equality In = (In + A)(In + A)−1 can be written as In = (In + A)−1 + A(In + A)−1 . This implies 1 μ((In +A)−1 )−μ(A(In +A)−1 ) (1−μ(A))μ((In +A)−1 ), so μ((In + A)−1 )

1 1−μ(A) . in Cn×n )

be a non-singular matrix. Prove (31) Let A ∈ Rn×n (or that if ν is a norm on Rn (on Cn , respectively), then νA deﬁned by νA (x) = ν(Ax) is a norm on Rn (on Cn ). (32) Prove that the function ν0 : Rn −→ R0 , where ν0 (x) is the number of non-zero components of x, is not a norm, although it satisﬁes the inequality ν(x + y) ν(x) + ν(y) for x, y ∈ Rn . (33) Let A ∈ Rm×n and let ν be a vector norm on Rn . Prove that if A ∈ Rm×n , then we have the following equalities: μ(A) = sup{ν(Ax) | ν(x) = 1} ν(Ax) n = sup x ∈ R − {0} ν(x) = inf{k | ν(Ax) kν(x), for every x ∈ Rn }.

Solution: To prove the ﬁrst equality, note that {ν(Ax) | ν(x) = 1} ⊆ {ν(Ax) | ν(x) 1}. This implies sup{ν(Ax) | ν(x) = 1} sup{ν(Ax) | ν(x) 1} = μ(A). On the other hand, let x be a vector such that ν(x) 1. We have x = 0 if and only if ν(x) = 0 because ν is a norm. 1 x we have ν(y) = 1 and Otherwise, x = 0, and for y = ν(x) ν(Ax) = ν(A(ν(x)y) = ν(x)ν(Ay) ν(Ay). Therefore, in

Norms and Inner Products

453

either case we have ν(Ax) sup{ν(Ay) | ν(y) = 1} for ν(x) 1. Thus, we have the reverse inequality, sup{ν(Ax) | ν(x) 1} sup{ν(Ay) | ν(y) = 1}, so μ(A) = sup{ν(Ax) | ν(x) = 1}. To prove the second equality, observe that x ν(Ax) x = 0 = sup ν(A x = 0 sup ν(x) ν(x) = sup{ν(Ay) | ν(y) = 1} = μ(A), x = 1 and every vector y with ν(y) = 1 can because ν ν(x) 1 x for some x = 0. be written as y = ν(x) We leave the third equality to the reader. (34) Let x and y be two vectors in Rn . Prove that if ax + by = cx + dy, then

(a2 − c2 )x2 + (b2 − d2 )y2 + (ab − cd)(y x + x y) = 0. (35) Let U ∈ Cn×n be a unitary matrix. Prove that cond2 (U ) = 1. (36) Let A and B be two matrices in Cn×n such that A ∼ B. If A = XBX −1 , prove that 1 B2 A2 cond2 (X)B2 . cond2 (X) (37) Prove that |||C (n,K) |||p 1 and |||R(n,K) |||p 1 for any p 1, where C (n,K) and R(n,K) are the matrices deﬁned in Exercise 61. ! i · · · ik be a (38) Let A ∈ Cn×n be a matrix and let B = A 1 j1 · · · jh submatrix of A. Prove that |||B|||p |||A|||p for any p 1. Hint: Apply Part (b) of Exercise 61 of Chapter 3.

454

Linear Algebra Tools for Data Mining (Second Edition)

(39) Prove that if D = diag(d1 , . . . , dn ) is a diagonal matrix, then |||D|||p = max{|di | | 1 i n} for every p 1. (40) Let A ∈ Cn×n . We have seen that if A is a unitary matrix, then Ax2 = x2 for every vector x ∈ Cn (see Theorem 6.24). Prove the inverse statement, that is, if Ax2 = x2 for every vector x ∈ Cn , then A is a unitary matrix. Solution: Observe that the condition satisﬁed by A implies A(x + y)2 = x + y2 for every x, y ∈ Cn . This, in turn, implies (Ax + Ay, Ax + Ay) = (x + y, x + y), which is equivalent to (Ax, Ay) = (x, y), or (Ax)H Ay = xH y. Choosing x = ei and y = ej , the last condition amounts to 1 if i = j (A A)ij = 0 otherwise H

for 1 i, j n. Thus, AH A = In . (41) Let · be a unitarily invariant norm. Prove that for every Hermitian matrix A ∈ Cn×n and every unitary matrix U , we have A − In A − U A + In . (42) Let · be a unitarily invariant norm. Prove that A O A B O D C D for all conforming matrices A, B, C, D. (43) Let A ∈ Cn×n be an invertible matrix and let · be a norm on Cn . Prove that |||A−1 ||| =

1 , min{Ax | x = 1}

where ||| · ||| is the matrix norm generated by · .

Norms and Inner Products

Solution: We claim that −1

{A

t | t = 1} =

Let a = A−1 t for some t ∈ as x=

Cn

455

1 | x = 1 . Ax

such that t = 1. Deﬁne x

1 A−1 t

A−1 t.

Clearly, we have x = 1. In addition, 1 1 t = = , −1 −1 A t A t a | x = 1 . Thus, Ax =

so a ∈

1 Ax

−1

{A

t | t = 1} ⊆

1 | x = 1 . Ax

The reverse inclusion can be shown in a similar way. Therefore, |||A−1 ||| = max{A−1 t | t = 1} =

1 . min{Ax | x = 1}

(44) Prove that if μ1 , . . . , μk are matrix norms on Rn×n , then μ : Rn×n R0 deﬁned by μ(A) = max{μ1 (A), . . . , μk (A)} for A ∈ Rn×n is a matrix norm. Solution: Let A, B be two matrices in

Rn×n .

We have

μi (AB) μi (A)μi (B) max{μ1 (A), . . . , μk (A)} × max{μ1 (B), . . . , μk (B)} for every i, 1 i k. Therefore, max μ1 (AB), . . . , μk (AB) max{μ1 (A), . . . , μk (A)} × max{μ1 (B), . . . , μk (B)}, so μ(AB) μ(A)μ(B). We leave to the reader the veriﬁcation of the remaining properties of matrix norms.

456

Linear Algebra Tools for Data Mining (Second Edition)

(45) Let S ∈ Cn×n be a matrix such that 0 if j i, sij = j−i if j > i, tij δ where tij ∈ C for 1 i, j n and i < j and δ < 1. Prove that there exists a positive number c such that |||S|||2 cδ. Solution: Let t = max{|tij | | 1 i, j n, i = j} and let x ∈ Cn be a vector such that x2 = 1. We have n n sij xj |tij |δi−j |xj | |(Sx)i | = j=1

t(n − i)δ

j=i+1 n

|xj | tnδx2 .

j+1

This implies |||S|||2 tnδ, so we obtain the desired inequality with c = tn. (46) Let A ∈ Cn×n , a ∈ C, b ∈ Cn−1 , and C ∈ C(n−1)×(n−1) be such that a bH . A= b C Prove that b2 |||A|||2 . Solution: Since |||A|||2 = sup{Ax2 | x = 1}, by substitut 1 , the desired inequality follows immediately. ing 0n−1 (47) In Example 6.13 we saw that A∞ fails to be a matrix norm. However, A∞ can be useful due to its simplicity. Let A ∈ Cn×n . Prove the following: (a) if B ∈ Rn×n is a matrix such that abs(A) B, then A∞ (A) A∞ ; (b) if A1 , . . . , Ak ∈ Cn×n , then A∞ (A1 · · · Ak ) nk−1 ki=1 A∞ (Ai ). Solution: If abs(A) B, we have |aij | bij for 1 i, j n. Therefore, the elements of B are non-negative and we have A∞ = maxi,j |aij | maxi,j bij = B∞ .

Norms and Inner Products

457

For the second part, note that (A1 · · · Ak )ij = {(A1 )ii1 (A2 )i1 i2 · · · (Ak )ik−1 j | (i1 , . . . , ik−1 ) ∈ {1, . . . , n}k−1 }. Thus, |(A1 · · · Ak )ij | {|(A1 )ii1 | |(A2 )i1 i2 | · · · (Ak )ik−1 j . (i1 ,...,ik−1 )

Since the last sum contains nk−1 terms and each term is less or equal to ki=1 A∞ (Ai ), the desired inequality follows immediately. (48) Prove that if A, B ∈ Rn×n and abs(A) abs(B), then AF BF . (49) Let A and B be two matrices in Cn×n and let ||| · ||| be a matrix norm on Cn×n . Prove that if |||AB − In ||| 1, then both A and B are invertible. For x, y ∈ Cn , we write abs(x) abs(y) if |xi | |yi | for 1 i n. A norm ν : Cn −→ R0 is monotone if abs(x) abs(y) implies ν(x) ν(y) for x, y ∈ Cn ; ν is said to be absolute if ν(x) = ν(abs(x)) for x ∈ Cn . (50) Prove that a norm ν on absolute.

Cn

is monotone if and only if it is

Solution: If ν is monotone, then it is clearly absolute. Conversely, let ν be an absolute norm and let x, y ∈ Cn such that x = diag(1, . . . , 1, a, 1, . . . , 1)y, where a ∈ [0, 1]. Note that ⎛

⎛ ⎞ ⎞ y1 y1 ⎜ .. ⎟ ⎜ .. ⎟ .⎟ ⎜ . ⎟ 1 − a 1 + a⎜ ⎜ ⎟ ⎜ ⎟ x= ⎜ yk ⎟ + ⎜−yk ⎟. 2 ⎜.⎟ 2 ⎜ . ⎟ ⎝ .. ⎠ ⎝ .. ⎠ yn yn

458

Linear Algebra Tools for Data Mining (Second Edition)

The deﬁnition of norms implies that ν(x) 1−a y ), where 2 ν(˜ ⎞ ⎛ y1 ⎜ .. ⎟ ⎜ . ⎟ ⎟ ⎜ ˜ = ⎜−yk ⎟. y ⎜ . ⎟ ⎝ .. ⎠ yn

1+a 2 ν(y) +

Since ν is absolute, this implies ν(x) ν(y). If a1 , . . . , an ∈ [0, 1] and x = diag(a1 , . . . , an )y, then by Exercise 5 of Chapter 3 and the previous argument, we have ν(x) ν(y). Let x, y ∈ Cn such that abs(x) abs(y), which is equivalent to |xi | |yi |. This allows ﬁnding a1 , . . . , an ∈ [0, 1] such that |xi | = ai |yi |, which implies ν(abs(x)) ν(abs(y)). Since ν is absolute, this implies ν(x) ν(y). A function g : Rn −→ R0 is a symmetric gauge function if it satisﬁes the following conditions: (i) g is a norm on Rn ; (ii) g(Pφ u) = g(u) for every permutation φ ∈ PERMn and every u ∈ Rn ; (iii) g(diag(b1 , . . . , bn )u) = g(u) for every (b1 , . . . , bn ) ∈ {−1, 1}n and u ∈ Rn . (51) Prove that for any p 1 the function νp : Rn −→ R0 is a symmetric gauge function on Rn . (52) Let · be a norm deﬁned on Rn . If f : Rn −→ R is a homogeneous function, that is, a function such that f (ax) = af (x) for a 0, then max{f (x) | x = 1} = max{f (x) | x 1}. Solution: It is clear that max{f (x) | x = 1} max{f (x) | x 1}, because the set from the left member is a subset of the set that occurs in the right member. To prove the reverse inequality, let x be a vector such that x 1. Then x = 1. x

Norms and Inner Products

459

By the deﬁning property of f , we also have 1 x = f (x) f (x). f x x Thus, max{f (x) | x 1} max{f (x) | x = 1}. (53) Let A ∈ Rm×p . Prove that AF = vec(A)2 . (54) Let U ∈ Cm×n , V ∈ Cn×p , and let A ∈ Cm×p . Prove that A − U V F = vec(A) − (V ⊗ Im )vec(U )2 = vec(A) − (In ⊗ U )vec(V )2 . Solution: We have A − U V F = vec(A − U V )2 (by Exercise 6.20) = vec(A) − vec(U V )2 = vec(A) − (V ⊗ Im )vec(U )2 = vec(A) − (In ⊗ U )vec(V )2 , by Supplement 90 of Chapter 3. (55) Prove that if A ∈ Cn×n , then |||A|||2 = max{|y H Ax| | x2 = y2 = 1}. Solution: By the Cauchy–Schwarz inequality, we have |y H Ax| y2 Ax2 , so max{|y H Ax| | x2 = y2 = 1} max{Ax2 | x2 } = |||A|||2 .

(6.34)

˜ be ˜ be a unit vector such that A˜ Let x x2 = |||A|||2 and let y ˜ = A˜1x2 A˜ x. We have the unit vector y ˜ H A˜ x= y

1 A˜ x22 ˜ H AH Ax = x = A˜ x2 = |||A|||2 . A˜ x2 A˜ x2

Thus, the Inequality (6.34) can be replaced by an equality.

460

Linear Algebra Tools for Data Mining (Second Edition)

(56) Prove that every symmetric gauge function g : an absolute norm on Rn .

Rn

−→

R0

is

Solution: Let u ∈ Rn . Deﬁne bi = 1 if ui 0 and bi = −1 if ui < 0. Then |ui | = bi ui for 1 i n, so abs(u) = diag(b1 , . . . , bn )u. By the deﬁnition of gauge functions, g(u) = g(diag(b1 , . . . , bn )u) = g(abs(u)), which implies that g is an absolute norm. (57) For x ∈ Rn , let gk (x) be the sum of the largest k absolute values of components of x. Prove that (a) gk is a gauge symmetric function for 1 k n; (b) if g : Rn −→ R0 is a gauge symmetric function and gk (x) gk (y) for 1 k n, then g(x) g(y). (58) Let A ∈ Rm×n and let I ⊆ {1, . . . , m} and J ⊆ {1, . . .}. Deﬁne A(I, J) = {aij | i ∈ I, j ∈ J}, and AC = maxI,J |A(I, J)|. Prove that · C is a norm on C m×n . This norm will be referred to as the cut norm. (59) A (b, I, J)-cut matrix is a matrix B ∈ Rm×n such that there exist a number b ∈ R and two sets I, J, such that I ⊆ {1, . . . , m} and J ⊆ {1, . . .} and b if i ∈ I and j ∈ J, bij = 0 otherwise. Prove that every cut matrix has rank 1 and that there are 2m+n distinct cut matrices having b as their non-zero entry. (60) Let A ∈ Rm×n be such that there exists a pair of sets (I, J) such that A(I, J) |I||J| for some > 0 and let b = A(I,J) |I||J| . If B is 2 the (b, I, J)-cut matrix, prove that A−BF A2F −2 |I||J|. Solution: Note that in A −!B, one subtracts from every eleS ment of the submatrix A their average. Thus, T {a2ij | i ∈ I or j ∈ J} A − B2F = " A(I, J) 2 + aij − i ∈ I, j ∈ J |I||J| = A2F −

A(I, J)2 A2F − 2 |I||J|. |I||J|

Norms and Inner Products

461

(61) Let A ∈ Rm×n be such that |aij | 1. Prove that for every > 0 it is possible to construct a sequence of cut matrices . , Bp such that A − BC mn and p 12 , where B1 , . . B = pk=1 Bk . Solution: If AC mn, we take p = 1 and B1 = Om,n . Otherwise, AC > mn and we can apply repeatedly the process described in Supplement 6.20. In this manner, we construct two sequences of matrices A1 , . . . , Ap and B1 , . . . , Bp such that A1 = A − B1 , A2 = A1 − B2 , . . . , Ap = Ap−1 − Bp . By Supplement 6.20, we have Ai+1 2F Ai 2F − 2 mn. Since · 2F is non-negative, it is clear that this process can be repeated no more than 12 times until we obtain A − BC mn, where B = B1 + · · · + Bp . (62) Prove the following extension of the result of Supplement 6.20: if A ∈ Rm×n , then for every > 0 there exists a sequence of cut matrices B1 , . . . , Bp such that A = B1 + · · · + Bp + R, where RC mn. The number p is the width of this decomposition, while RC is the error. Solution: Let A be a matrix in Rm×n and let A˜ be the matrix 1 A. We have |˜ aij | 1. Choose now 1 = A ∞ . By A˜ = A ∞ Supplement 6.20, there is a sequence of cut matrices E1 , . . . , Ep such that A˜ = E1 + · · · + Ep + Q and QC 1 mn. Therefore, A = A∞ A˜ = A∞ E1 + · · · + A∞ Ep + A∞ Q and we deﬁne Bi as the cut matrix Bi = A∞ Ei for 1 i p and R = A∞ Q. Note that RC = A∞ QC A∞ 1 mn mn. (63) Let D ∈ Rn×n be a diagonal matrix such that dii 0 for 1 i n. Prove that if X is an orthogonal matrix, then trace(XD) trace(D). Solution: Since D is a diagonal matrix, we have trace(XD) = n x d i=1 ii ii . The orthogonality of X implies xii 1, so xii dii dii because dii 0. Thus, trace(XD) =

n i=1

xii dii

n i=1

dii = trace(D).

462

Linear Algebra Tools for Data Mining (Second Edition)

(64) Let · be a norm on Rn , Dn = Rn × Rn − {(0, 0)} −→ R and let f : Dn −→ R be deﬁned by f (x, y) =

x + y2 + x − y2 . 2(x2 + y2 )

Prove that 1 for (x, y) ∈ Dn ; (a) f (x + y, x − y) = f (x,y) (b) if a = inf{f (x, y) | (x, y) ∈ Dn } and b = sup{f (x, y) | (x, y) ∈ Dn }, then 1 a 1 b 2, 2 and ab = 1; (c) the norm · is generated by an inner product if and only if a = b = 1. Solution: The ﬁrst part is immediate. It is clear that a b. The deﬁnition of the norm implies f (x, y) 2, so b 2. By the ﬁrst part we have a = 1b , which implies a 12 , and a b implies a 1 b. The last part follows from Theorem 6.32. (65) Prove that if x, y ∈ Cn are such that x2 = y2 , then x+y ⊥ x − y. (66) Let f : V × V −→ R be a bilinear form on a real inner product space such that x ⊥ y implies f (x, y) = 0. Prove that f (x, y) = c(x, y) for some c ∈ R. (67) Let f : V × V R be a bilinear form on the R-linear space V. Deﬁne x ⊥f y to mean that f (x, y) = 0. Prove the following: (a) x ⊥f y and x ⊥f z imply x ⊥f (ay + bz) for a, b ∈ R; (b) x1 ⊥f y and x2 ⊥f y imply (ax1 + bx2 ) ⊥f y for a, b ∈ R; (c) Let f : R2 × R2 −→ R be the bilinear form deﬁned by x x = xx + xy − x y − yy . f , y y Prove that there exist x, y ∈ R2 such that x ⊥f y but y ⊥f x. (68) If S is a subspace of Cn , prove that (S ⊥ )⊥ = s. (69) Prove that every permutation matrix Pφ is orthogonal.

Norms and Inner Products

463

A matrix A ∈ Cm×n is subunitary if A is a submatrix of a unitary matrix U . If A ∈ Rm×n , and A is subunitary, then A is a suborthogonal matrix (see [14], where suborthogonal matrices were introduced and referred to as suborthonormal matrices). A matrix A ∈ Cm×n is semiunitary if it is rowwise or columnwise unitary; if A ∈ Rm×n , then A is said to be semiorthogonal. Clearly, every unitary (orthogonal) matrix is a semiunitary (semiorthogonal) matrix and every semiunitary (semiorthogonal) matrix is a subunitary (suborthogonal) matrix. (70) Prove that every submatrix of a subunitary (suborthogonal) matrix is subunitary (suborthogonal). (71) Prove that every suborthogonal matrix can be augmented to a semiorthogonal matrix by adding only rows and by adding only columns to it. (72) Let A ∈ Cn×n be an upper Hessenberg matrix and let A = QR be its QR decomposition. If R is a non-singular matrix, prove that both matrices Q and RQ are upper Hessenberg matrices. Solution: By Theorem 3.8, R−1 is an upper triangular matrix. Since Q = AR−1 , it follows from Supplement 11 of Chapter 3 that Q is an upper Hessenberg matrix. From the same supplement, it follows that RQ is an upper Hessenberg matrix. (73) Let A ∈ Rn×n be a symmetric matrix. Prove that (x, Ax) − (y, Ay) = (A(x − y), x + y) for every x, y ∈ Rn . (74) Let x, y ∈ Rn be two unit vectors. Prove that | sin ∠(x, y)| =

x + y2 x − y2 . 2

(75) Let u and v be two unit vectors in ∠(u, v), then

Rn .

Prove that if α =

u − v cos α2 = sin α. Further, prove that v cos α is the closest vector in v to u.

464

Linear Algebra Tools for Data Mining (Second Edition)

Solution: By the deﬁnition of the Euclidean norm, we have u − v cos α22 = (u − v cos α) (u − v cos α) = 1 − 2u v cos α + cos2 α = 1 − cos2 α = sin2 α, which justiﬁes the equality that we need to prove. For the second part, let w = av be a vector in v. We have u − av22 = (u − av) (u − av) = 1 − 2au v + a2 = 1 − 2a cos α + a2 . The least value of the function in a is achieved when a = cos α. (76) Let {v 1 , . . . , v p } ⊆ Rn be a collection of p unit vectors such that ∠(vi , v j ) = θ, where 0 < θ π2 for every pair (v i , v j ) . such that 1 i, j p and i = j. Prove that p n(n+1) 2 Solution: We shall prove that under the assumptions made above, the set {A1 , . . . , Ap } of symmetric matrices Ai = v i v i ∈ Rn×n is linearly independent. Suppose that a1 A1 + · · · + ap Ap = On×n . This is equivalent to a1 a1 a1 + · · · + ap ap ap = On×n . Therefore, by multiplying the last equality by ai to the left and by ai to the right, we obtain a1 ai a1 a1 ai + · · · + ap ai ap ap ai = 0, which amounts to a1 cos2 θ + · · · + ap−1 cos2 θ = ap + ap+1 cos2 θ + · · · + ap cos2 θ = 0. This equality and the similar p − 1 equalities can be expressed in matrix form as (Ip (1 − cos2 θ) + Jp cos2 θ)a = Op,p , where

⎛ ⎞ a1 ⎜ .. ⎟ a = ⎝ . ⎠. ap

By Supplement 24 of Chapter 3, a = 0p , so A1 , . . . , Ap are linearly independent. By Supplement 7 of the same chapter, . p n(n+1) 2

Norms and Inner Products

465

(77) Let A ∈ Cn×n be a unitary matrix. If A = (u1 , . . . , un ), prove that the set {u1 , . . . , un } is orthonormal. (78) Let {w 1 , . . . , wk } ⊆ Cn be a set of unit vectors such that wi ⊥ wj for i = j and 1 i, j k. If Wk = (w 1 · · · wk ) ∈ Cn×k , then prove that In − Wk WkH = (In − wk wHk ) · · · (In − w1 wH1 ). Solution: The straightforward proof is by induction on k. (79) This exercise reﬁnes the result presented in Supplement 36 of Chapter 3. Let A be a matrix in Cm×n such that rank(A) = r. Prove that A can be factored as A = P C, where P ∈ Cm×r is a matrix having an orthonormal set of columns (that is, P H P = Ir ), C ∈ Cr×n , and rank(P ) = r. Also, prove that A can be factored as A = DQ, where D ∈ Cm×r , Q ∈ Cr×n has an orthonormal set of rows (that is, QQH = Ir , and rank(Q) = r. (80) Let Cn×n be the linear space of complex matrices. Prove the following: (a) the set of Hermitian matrices H and the set of skewHermitian matrices K in Cn×n are subspaces of Cn×n ; (b) if Cn×n is equipped with the inner product deﬁned in Example 6.21, then K = H⊥ . (81) Give an example of a matrix that has positive elements but is not positive deﬁnite. (82) Prove that if A ∈ Rn×n is a positive deﬁnite matrix, then A is invertible and A−1 is also positive deﬁnite. (83) Let A ∈ Cn×n be a positive deﬁnite Hermitian matrix. If A = B + iC, where B, C ∈ Rn×n , prove that the real matrix B −C D= C B is positive deﬁnite. (84) Let A ∈ R2×2 be a matrix such that x Ax > 0 for every x ∈ R2 − {0}. Does it follow that uH Au > 0 for every x ∈ C2×2 − {0}? i . Hint: Consider A = I2 and x = 0

466

Linear Algebra Tools for Data Mining (Second Edition)

(85) Let x, y ∈ Rn be two vectors such that x > 0. Prove that there exists a number > 0 such that y2 implies x + y > 0. (86) Let U ∈ Rn×k be a matrix such that U U = Ik , where k n. Prove that for every x ∈ Rn , we have x2 U x2 . Solution: Let u1 , . . . , uk be the columns of the matrix U . We have ⎛ ⎞ u1 ⎜ .. ⎟ U U = ⎝ . ⎠(u1 · · · uk ) = Ik , uk which shows that {u1 , . . . , uk } is an orthonormal set of vectors completion of this in Rn . Let {u1 , . . . , uk , uk+1 , . . . , un } be the n 2 Rn . If x = set to an orthonormal set of i=1 ai ui , then x2 = n 2 i=1 ai . On the other hand, ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ a1 u1 u1 x ⎜ .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟ U x = ⎝ . ⎠x = ⎝ . ⎠ = ⎝ . ⎠, uk ak uk x because of the orthonormality of the set {u1 , . . . , uk }, so Y

x22

=

k i=1

a2i

n

a2i = x22 .

i=1

(87) Let V be a real linear space and let · be a norm generated by an inner product deﬁned on V. V is said to be symmetric relative to the norm if ax − y = x − ay for a ∈ R and x, y ∈ V such that x = y = 1. (a) Prove that if a norm on a linear vector space V is induced by an inner product, then V is symmetric relative to that norm. (b) Prove that V satisﬁes the Ptolemy inequality x − yz y − zx + z − xy, for x, y, z ∈ V if and only if V is symmetric.

Norms and Inner Products

467

Solution: Suppose that x2 = (x, x) for x ∈ V . Then, if x = y = 1, we have ax − y2 = (ax − y, ax − y) = a2 (x, x) − 2a(x, y) + (y, y) = a2 + 1 + 2a(x, y). It is easy to see that we also have x − ay = a2 + 1 + 2a(x, y), which implies that V is symmetric relative to the norm. For the second part, suppose that V is symmetric relative to the norm. Observe that Ptolemy inequality is immediate if ˆ, y ˆ, z ˆ be three non-zero any of x, y, or z is 0. Therefore, let x vectors deﬁned by ˆ= x

1 1 1 ˆ= x, y y, zˆ = z. 2 2 x y z2

We have ˆ 2 = ˆ x−y

1 2(x, y) 1 x − y2 − + = . x2 x2 y2 y2 x2 y2

ˆ ˆ ˆ + ˆ ˆ , the Ptolemy inequality Since ˆ x−y x−z z−y follows immediately. (88) Let Hu be the Householder matrix corresponding to the unit vector u ∈ Rn . If x ∈ Rn is written as x = y + z, where y = au and z ⊥ u, then Hu x is obtained by a reﬂection of x relative to the hyperplane that is perpendicular to u, that is, Hu x = −u + v. Solution: We have Hu x = Hu (y + z) = Hu y + Hu z. Then, Hu y = (In − 2uu )(au) = au − 2auu u = au − 2au = −au = −y. Also, Hu z = (In − 2uu )z = z − 2uu z = z because u ⊥ z. (89) Let x ∈ Rn be a unit vector such that x ∈ {e1 , −e1 }, and let v=

1 1 (x + e1 ) and w = (x − e1 ). 2(1 + x1 ) 2(1 − x1 )

Prove that (a) v and w are unit vectors; (b) we have Hv x = −e1 and Hw x = e1 .

468

Linear Algebra Tools for Data Mining (Second Edition)

Solution: Observe that 1 (x + e1 )(x + e1 ) 2(1 + x1 ) 1 (x2 + 2x1 + e1 2 ) = 1. = 2(1 + x1 )

v2 =

The computation for w is similar. Solving the second part is straightforward and is omitted. (90) Let a b A= ∈ R2×2 . b a Prove that if A is positive semideﬁnite but not positive deﬁnite, then |a| = |b|. (91) Let S = {s1 , . . . , sn } be a ﬁnite set. For a subset U of S, deﬁne cU ∈ Rn , the characteristic vector of U , by 1 if si ∈ U, (cU )i = 0 otherwise. (a) Prove that (cU , cW ) = |U ∩ V | for any two subsets U, W of S. (b) Let U = (U1 , . . . , Um ) be a sequence of subsets of S. The incidence matrix of U is the matrix CU = (cU1 , . . . , cUn ). If AU ∈ Rm×m is deﬁned by (AU )ij = |Ui ∩ Uj | for 1 i, j m, prove that AU = CU CU , and therefore, AU is a positive semideﬁnite matrix. (92) Let Y ∈ Rn×k be a matrix such that Y Y = Ik , where k n. Prove that the matrix In − Y Y is positive semideﬁnite. Solution: Let x ∈ Rn . We have x (In − Y Y )x = x x − (Y x) (Y x) = x22 − Y x22 . The desired inequality follows immediately from Supplement 6.20. (93) Let A and B be two matrices in Cn×n , where A is Hermitian and B is positive semideﬁnite. Prove that xH Ax < 0 for all x ∈ Cn such that Bx = 0 and x = 0 if and only if there exists a ∈ R such that a > 0 and A − aB ≺ 0.

Norms and Inner Products

469

Solution: Suppose that there exists a ∈ R such that a > 0 and A − aB ≺ 0. This means that xH (A − aB)x < 0. Then, if Bx = 0, it is clear that axH Bx = 0, so xH Ax < 0. Conversely, let A be Hermitian and let B be positive semidefinite such that xH Ax < 0 for all x ∈ Cn such that Bx = 0 and x = 0. Suppose that for every a > 0 there exists x = 0 such that xH Ax axH Bx. In this case, Bx = 0 implies xH Ax 0. This contradiction yields the desired implication. (94) Let {w1 , . . . , w k } ⊆ Cn be an orthonormal set of vectors in Cn and let x ∈ Cn . If ai = (x, wi ) for 1 i k, prove that x−

k

ai w i , x −

i=1

k

ai w i

(x, x) −

i=1

k

ai ai (x, x).

i=1

Prove that (x, x) = ki=1 ai ai if and only if x = ki=1 ai wi . . , An be n + 1 Hermitian matrices and let B(t) = (95) Let n A0 , . . n−k A t . Prove that if An is positive deﬁnite, there exists k k=0 a positive number such that B(t) is positive deﬁnite for every t ∈ [−, ]. Solution: The deﬁnition of B(t) allows us to write

x B(t)x =

n

x Ak xtn−k .

k=0

Since the function g(t) = nk=0 x Ak xtn−k is a polynomial in t and g(0) = x An x > 0, the continuity of the polynomial g(t) implies that there exists > 0 such that t ∈ [, ] implies g(t) > 0. This produces the desired conclusion. (96) Let A ∈ Rn×n and let b ∈ Rn . Deﬁne the matrix C ∈ Rn×n by cij = bi aij bj for 1 i, j n. Prove that A is positive semideﬁnite if and only if C is positive semideﬁnite. Solution: Let x be a vector in x Cx =

n n i=1 j=1

Rn .

xi cij xj =

We have n n i=1 j=1

xi bi aij bj xj ,

Linear Algebra Tools for Data Mining (Second Edition)

470

which is equivalent to x Cx = z Az, where ⎞ ⎛ x 1 b1 ⎟ ⎜ z = ⎝ ... ⎠. x n bn We leave the remainder of the argument to the reader. (97) Prove that if A, B ∈ R2×2 are two rotation matrices, then AB = BA. ˜2 , e˜3 } be two orthonormal bases in R3 . We e1 , e Let {e1 , e2 , e3 } and {˜ ˜ ) = δij . have (ei , ej ) = δij and (˜ e ,e i j ˜j = 3k=1 rjk ek and ej = 3k=1 rˆjk e˜k . Since both Suppose that e ˜ ). bases are orthonormal, we obtain rj = (e˜j , e ) and rˆj = (ej , e Clearly, we have: rj = rˆj . ˜j = ˜ , and ej = (98) Prove that rj = rˆj , e k, rjk rk e r r e . This shows that the matrix R = (rj ) is k, kj k orthogonal. (99) A vector v ∈ R3 is isotropic if its components are the same relative to any basis. Prove that the single isotropic vector in R3 is 03 . Solution: Suppose that v ∈ R3 is isotropic. Then, we have ˜j = ˜k , vj ej = vj rkj e v˜k e v= j

and v=

j

j,k

˜j = v˜j e

k

v˜j rjk ek =

j,k

k

vk ek =

If v is isotropic, then v˜j = vj = k rjk vk for any rotation matrix R. Choosing R as 0 10 R = −1 0 0 0 01 we obtain v1 = v2 = 0. Similarly, choosing 1 0 0 R=0 0 1 0 −1 0 we obtain v2 = v3 = 0, hence, v = 03 .

Norms and Inner Products

471

(100) Let u ∈ Rn be a unit vector. A rotation with axis u is an orthogonal matrix A such that Au = u. Prove that if v ⊥ u, then Av ⊥ u and A v ⊥ u. (101) Let u, v, and w be three unit vectors in R2 − {0}. Prove that ∠(u, v) ∠(u, w) + ∠(w, v). Solution: The hypothesis implies the existence of α, β, γ ∈ (0, 2π) such that cos α cos β cos γ u= ,v = , and w = . sin α sin β sin γ Thus, ∠(u, v) = arccos(cos(α − β)), ∠(u, w) = arccos(cos(α − γ)), ∠(w, v) = arccos(cos(γ − β)). Without loss of generality, we may assume that α γ β. Thus, α−β if α − β π, arccos(cos(α − β)) = 2π − α + β if α − β > π and similar equalities can be written for arccos(cos(α−γ)) and arccos(cos(γ − β)). If α − β π, then α − γ π and γ − β π, so ∠(u, v) ∠(u, w) + ∠(w, v). Otherwise, α − β > π and several cases may occur. Since α − β = (α − γ) + (γ − β), at most one of α − γ and γ − β can be greater than π. If we have α − γ π and γ − β π, the inequality to be shown amounts to 2π − α + β α − γ + γ − β, which clearly holds. If α − γ π and γ − β π, then we have 2π − α + β α − γ + 2π − γ + β, which amounts to the inequality α γ, which holds according to the initial assumption.

472

Linear Algebra Tools for Data Mining (Second Edition)

(102) Let T be a subspace of Rn and let u and v be two unit vectors in Rn such that v ∈ T . If t = projT (u), prove that ∠(u, v) ∠(t, v). Solution: Suppose that T is an m-dimensional space, where m n. Let {v, v 1 , . . . , vm−1 } be the extension of the set {v} to an orthonormal basis of T . Then t = projU (u), we have t = (u, v)v + (u, v 1 )v 1 + · · · + (u, v m−1 )v m−1 , so (t, v) = (u, v). Thus, t2 cos ∠(t, v) = cos ∠(u, v), which implies cos ∠(u, v) cos ∠(t, v). (103) We now extend the result of Supplement 6.20 to Rn as follows. Let u, v, and w be three unit vectors in Rn − {0}. Prove that ∠(u, v) ∠(u, w) + ∠(w, v). (104) This supplement formulates a reciprocal to Theorem 6.36. Prove that if A ∈ Cn×n is a Hermitian and idempotent matrix, then there exists a subspace S of Cn such that A is the projection matrix PS . Solution: Let x ∈ Rn be a vector, and let S = range(A). Then, u = Ax ∈ S. If z = x − u, we have (z, u) = z H u = (xH − uH )u = (xH − xH AH )Ax = xH AH )Ax − xH AH Ax = 0, because A is Hermitian and idempotent. Thus, z ⊥ u and z ∈ S ⊥ . By Theorem 6.37, the decomposition of x = u + z is unique, so u = Ax = projS x. Let V be an n-dimensional linear space equipped with an inner product (·, ·). The subsets B = {b1 , . . . , bn } and C = {c1 , . . . , cn } of V are reciprocal if (bi , cj ) = 1 if i = j and (bi , cj ) = 0 if i = j, for 1 i, j n. (105) Let V be an n-dimensional linear space equipped with an inner product (·, ·). If B = {b1 , . . . , bn } is a basis of V, then there exists a unique reciprocal set of B. Solution: Let Ui be the subspace of V generated by the set B − {bi } and let Ui⊥ be its orthogonal complement. By Corollary 6.28, dim(Ui⊥ ) = 1 because dim(Ui ) = n − 1. Thus, there exists a vector t = 0 in Ui⊥ . Note that (t, bi ) = 0 because bi ∈ Ui . Deﬁne ci =

1 bi . (t, bi )

Norms and Inner Products

473

Then (bi , ci ) = 1 and (bi , cj ) = 0 if j = i. This construction can be applied to all i, where 1 i n and this yields a set C = {c1 , . . . , cn }, which is reciprocal to B. To prove the uniqueness of the set C, assume that D = {d1 , . . . , dn } is another reciprocal set of the basis B. Then since (bi , cj ) = (bi , dj ), it follows that (bi , cj − dj ) = 0 for every i, j. Since cj − dj is orthogonal on all vectors of B, it follows that cj − dj = 0, so cj = dj . Thus, D = C. (106) Let V be an n-dimensional linear space equipped with an inner product (·, ·). If B = {b1 , . . . , bn } is a basis of V, then the reciprocal set C of B is also a basis of V. (107) Let L = (v 1 , . . . , v n ) be a sequence of vectors, where n 2. Prove that the volume Vn of the parallelepiped constructed on these vectors equals the square root of the Gramian of the sequence (v 1 , . . . , v n ). Solution: For the base case n = 2, the area A of the parallelogram is given by A = u2 v2 sin α, where α = ∠(u, v). In other words, (u, u) (u, v) det(Gu,v ) = det (u, v) (v, v) = u22 v22 − u22 v22 cos2 α = u22 v22 sin2 α = V22 . Suppose that the statement holds for sequences of n vectors and let L = (v 1 , . . . , v n , v n+1 ) be a sequence of n + 1 vectors. Let v n+1 = x + y be the orthogonal decomposition of v n+1 on the subspace Un = v1 , . . . , v n , where x ∈ Un and y ⊥ Un . Since x ∈ Un , there exist a1 , . . . , an ∈ R such that x = a1 v 1 + · · · + an vn . Let (v 1 , v 1 ) (v 1 , v 2 ) · · · (v 1 , v n ) (v 1 , v n+1 ) (v 2 , v 1 ) (v 2 , v 2 ) · · · (v 2 , v n ) (v 2 , v n+1 ) .. .. .. .. .. det(GL ) = . . . . . . (v n , v 1 ) (v n , v 2 ) · · · (v n , v n ) (v n , v n+1 ) (v n+1 , v 1 ) (v n+1 , v 2 ) · · · (v n+1 , v n ) (v n+1 , v n+1 )

474

Linear Algebra Tools for Data Mining (Second Edition)

By subtracting from the last row the ﬁrst row multiplied by a1 , the second row multiplied by a2 , etc., the value of the determinant remains the same, and we obtain (v 1 , v 1 ) (v 1 , v 2 ) · · · (v 1 , v n ) (v 1 , v n+1 ) (v 2 , v 1 ) (v 2 , v 2 ) · · · (v 2 , v n ) (v 2 , v n+1 ) .. .. .. .. .. det(GL ) = . . . . . . (v n , v 1 ) (v n , v 2 ) · · · (v n , v n ) (v n , v n+1 ) (y, v 1 ) (y, v 2 ) · · · (y, v n ) (y, v n+1 ) Note that (y, v 1 ) = (y, v 2 ) = · · · = (y, v n ) = 0 because y ⊥ Un and (y, v n+1 ) = (y, x + y) = y22 , which allows us to further write (v 1 , v 1 ) (v 1 , v 2 ) · · · (v 1 , v n ) (v 1 , v n+1 ) (v 2 , v 1 ) (v 2 , v 2 ) · · · (v 2 , v n ) (v 2 , v n+1 ) . . . . . . . . . . det(GL ) = . . . . . . (v n , v 1 ) (v n , v 2 ) · · · (v n , v n ) (v n , v n+1 ) 0 0 ··· 0 y22 = V2n y22 = V2n+1 . (108) Let A ∈ Rm×n be a matrix such that rank(A) = n. Prove that the R-factor of the QR decomposition of A = QR has positive diagonal elements, it equals the Cholesky factor of A A, and therefore is uniquely determined. Solution: If rank(A) = n, then, by Theorem 6.57, there exists a unique Cholesky factor of the matrix A A. Suppose that A has the full QR decomposition R , A=Q Om−n,n where Q ∈ Rm×m and R ∈ Rn×n . Then R = R R. A A = (R On,m−n )Q Q Om−n,n (109) Let A ∈ Rn×n be a symmetric positive deﬁnite matrix and let dA : Rn × Rn −→ R be the function deﬁned by dA (x, y) =

Norms and Inner Products

(x − y) A(x − y) for x, y ∈ on Rn .

Rn .

475

Prove that dA is a metric

Solution: By Cholesky’s Decomposition Theorem, we can factor A as A = R R, where R is an upper triangular matrix R with positive diagonal elements. Therefore, dA (x, y) = (R(x − y)) (R(x − y)) = R(x − y)2 and the desired conclusion follows immediately. (110) Let A ∈ Cn×m be a full-rank matrix such that m n. Prove that A can be factored as A = LQ, where L ∈ Cn×n and Q ∈ Cn×m , such that (a) the columns of Q constitute an orthonormal basis for range(AH ), and (b) L = (ij ) is a lower triangular invertible matrix such that its diagonal elements are real non-negative numbers, that is, ii 0 for 1 i n. n n function deﬁned by fn (x, y) = (111) Let f n n : R −→ R −→ R be the n . Prove that if x, y ∈ C(0 , 1), (1 + x y ) for x, y ∈ R j j n j=1 then |fn (x, y)| e. Solution: It is easy to verify the elementary inequality ln(1 + |t|) |t| for t ∈ R. This allows us to write n

|fn (x, y)| e

(112) (113)

(114) (115)

j=1

ln(1+|xj yj |)

n

e

i=1

|xj yj |

ex2 y2 .

because x2 1 and y2 1. Prove that if a matrix A ∈ Cn×n is normal, then range(A) ⊥ null(A). We saw that for every matrix A, the matrix AH A is positive semideﬁnite. Prove that if A = QR is the full QR decomposition of A, then the Cholesky decomposition of AH A is RH R. Let U ∈ Cn×n be a unitary matrix. If U = (X Y ), where X ∈ Cn×p and Y ∈ Cn×(n−p) , prove that range(X) = range(Y )⊥ . Let A ∈ Rn×n be a skew-symmetric matrix. Prove that A2 is a symmetric negative semi deﬁnite matrix. Solution: We have (A2 ) = (A )2 = (−A)2 = A2 , so A2 is symmetric. Furthermore, x A2 x = −x A Ax = −(Ax) Ax 0 for x ∈ Rn .

476

Linear Algebra Tools for Data Mining (Second Edition)

(116) Let B = {b1 , . . . , bn } and C = {c1 , . . . , cn } be two orthonormal bases of Rn . The coherence of B and C is the number coh(B, C) = max1i,jn bi c. Prove that 1 √ coh(B, C) 1. n Solution: Note that the matrix D = (bi cj ) ∈ Rn×n is an orthonormal matrix and dk 2 = 1 for each of the columns dk of D. Thus, not all entries of D can be less than √1n , so

max1i,jn bi c √1n . (117) Let A ∈ Rm×m and B ∈ Rn×n be two orthogonal matrices. Prove that their Kronecker product A ⊗ B ∈ Rmn×mn is also an orthogonal matrix. Solution: We have (A ⊗ B)(A ⊗ B) = (A ⊗ B)(A ⊗ B ) (by Theorem 3.56) = (AA ) ⊗ (BB ) (by Theorem 3.55) = Im ⊗ In = Imn .

(118) Let U, V ∈ Rn×n be two orthonormal matrices, where U = (u1 , . . . , un ) and V = (v 1 , . . . , v n ), and let w ∈ Rn . Since the columns of U and V are bases for Rn , for every x ∈ Rn with x2 = 1 there exist a, b ∈ Rn such that x = U a = V b. Prove the inequality a1 + b1

2 coh(U, V )

known an the uncertainty principle (see [43]). Solution: Since both U and V are orthonormal matrices, we have x2 = a2 = b2 . Therefore, 1 = x2 = x x = a U V b =

n n i=1 j=1

coh(U, V )a1 b1 .

|ai | |bj |ui v j

Norms and Inner Products

This implies a1 b1 . b1 √ 2

1 coh(U,V ) ,

477

which in turn yields a1 +

coh(U,V )

(119) Let u, v, w, z ∈ R3 . Prove that

(u, w) (u, z) . ((u × v), (w × z)) = (v, w) (v, z)

This is Lagrange’s identity. (120) Let a, b, c, d ∈ R3 be such that a ⊥ b and c ⊥ d. Prove that the vectors x and y are orthogonal, where x = (b×c)×(a×d), and y = (a × c) × (b × d). (121) Let u, v, w ∈ R3 . Prove that u × (v × w) + v × (w × u) + w × (u × v) = 0R3 . This is Jacobi’s identity. (122) Prove that if u, v ∈ R3 , then (u × v)i = 3j,k=1 ijk uj vk for 1 i 3. (123) Let a, b, c ∈ R3 be three vectors that do not belong to the same plane. Prove that the vectors a + αb, b + βc, and c + γa are coplanar if and only if αβγ = −1. Solution: The scalar triple product of a + αb, b + βc, and c + γa can be rewritten as (a + αb, b + βc, c + γa) = (a, b + βc, c + γa) + α(b, b + βc, c + γa) = (a, b + βc, c + γa) + αβ(b, c, c + γa) = (a, b + βc, c + γa) + αβγ(b, c, a). Similarly, (a, b + βc, c + γa) = (a, b + βc, c) + (a, b + βc, γa) = (a, b, c). Thus, (a + αb, b + βc, c + γa) = (a, b, c)(1 + αβγ).

478

Linear Algebra Tools for Data Mining (Second Edition)

Consequently, the vectors a + αb, b + βc, and c + γa are coplanar if and only if their scalar triple product is 0, that is, if and only if αβγ = −1, because the fact that a, b, c are not coplanar implies (a, b, c) = 0. Bibliographical Comments Supplements 58–61 contain concepts in [59, 87]. Supplement 52 is stated in [113].

and results

developed

Chapter 7

Eigenvalues

7.1

Introduction

The existence of directions that are preserved by linear transformations (which are referred to as eigenvectors) was discovered by Euler in his study of movements of rigid bodies. This work was continued by Lagrange, Cauchy, Fourier, and Hermite. The theme of eigenvectors and eigenvalues acquired increasing signiﬁcance through its applications in heat propagation and stability theory. Later, Hilbert initiated the study of eigenvalues in functional analysis (in the theory of integral operators). He introduced the terms “eigenvalue” and “eigenvector”.1 7.2

Eigenvalues and Eigenvectors

Definition 7.1. Let A ∈ Cn×n be a square matrix. An eigenvector of A is a vector v ∈ Cn − {0n } such that Av = λv for some λ ∈ C. The complex number λ that satisﬁes the previous equality is known as an eigenvalue of the matrix A and the pair (λ, v) is known as an eigenpair of A.

1 The term eigenvalue is a German–English hybrid formed from the German word eigen, which means “own”, and the English word “value.” Its use is common in the literature and we adopt it here.

479

480

Linear Algebra Tools for Data Mining (Second Edition)

The set of eigenvalues of a matrix A will be referred to as the spectrum of A and will be denoted by spec(A). Example 7.1. Let A ∈ C2×2 be the matrix A=

a b . c d

The vector v = v1 v2 = 02 is an eigenvector of A if Av = λv, a system equivalent to av1 + bv2 = λv1 , cv1 + dv2 = λv2 . This, in turn, is equivalent to the homogeneous system (a − λ)v1 + bv2 = 0, cv1 + (d − v2 ) = 0. A non-trivial solution exists if and only if det(A) = 0, which is equivalent to (a − λ)(d − λ) − bc = 0 or λ2 − (a + d)λ + ad − bc = 0. The roots of this equation are the eigenvalues of A: a + d ± (a − d)2 + 4bc . λ1,2 = 2 Theorem 7.1. Let A ∈ Cn×n be a matrix. If A has n distinct eigenvalues and v 1 , . . . , v n are corresponding eigenvectors, then {v 1 , . . . , v n } is a linearly independent set. Proof. Suppose that λ1 , . . . , λn are distinct eigenvalues of A and v1 , . . . , vn are eigenvectors that correspond to these values.

Eigenvalues

481

If {v 1 , . . . , v n } were a linearly dependent set, we would have a linear combination of these vectors a1 v p1 + · · · + ak v pk = 0n ,

(7.1)

containing a minimal number k of vectors such that not every number a1 , . . . , ak is 0. This would yield a1 Av p1 + · · · + ak Av pk = a1 λp1 v p1 + · · · + ak λpk v pk = 0n . Taking into account Equality (7.1) multiplied by λpk , we would have a1 (λp1 − λpk )v p1 + · · · + ak−1 (λpk−1 v pk−1 = 0n , which would contradict the minimality of the number of terms in Equality (7.1). Thus, {v 1 , . . . , v n } is a linearly independent set. Corollary 7.1. A matrix A ∈ Cn×n has at most n distinct eigenvalues. Proof. Since the maximum size of a linearly independent set in Cn is n, it follows from Theorem 7.1 that A cannot have more than n distinct eigenvalues. The set of eigenvectors SA,λ that correspond to an eigenvalue λ of A is a subspace of Cn because u, v ∈ SA,λ implies A(au + bv) = aAu + bAv = aλu + bλv = λ(au + bv). The subspace SA,λ is known as the invariant subspace of A for the eigenvalue λ. Clearly, the invariant subspace of A ∈ Cn×n for λ coincides with the null space of the matrix λIn − A. Not every real matrix A ∈ Rn×n has a non-zero invariant subspace in Rn . 1 0 Example 7.2. Let A = ∈ R2×2 . If Ax = λx for x ∈ R2 , 0 −1 then we have x2 = λx1 and −x1 = λx2 , which implies x21 + x22 = 0. This is equivalent to x = 02 . Thus, A has no non-zero invariant subspace. The situation is diﬀerent if we regard A as a matrix in C2×2 . Under this assumption, the equalities x2 = λx1 and −x1 = λx2

482

Linear Algebra Tools for Data Mining (Second Edition)

imply λ2 + 1 = 0 if x = 02 . Thus, we have λ1 = i and λ2 = −i. In the ﬁrst case, we have the invariant subspace x1 2 ∈ C | x2 = ix1 , x2 while in the second case the invariant subspace is x1 2 ∈ C | x2 = −ix1 . x2 Observe that if λ is an eigenvalue for A ∈ Cn×n , then aλ is an eigenvalue of the matrix aA for every a ∈ C. Thus, aspec(A) = spec(aA). If (λ, x) is an eigenpair of A, then xH Ax = λxH x, so λ=

xH Ax . xH x

(7.2)

Equality (7.2) can be specialized to the real case by replacing xH by x . Namely, if A ∈ Rn×n , λ is an eigenvalue and x is an eigenvector that corresponds to λ, then λ=

x Ax . x x

(7.3)

The geometric multiplicity of an eigenvalue λ of a matrix A ∈

Rn×n is denoted by geomm(A, λ) and is equal to dim(SA,λ ). Equiva-

lently, the geometric multiplicity of λ is

geomm(A, λ) = dim(null(A − λIn )) = n − rank(A − λIn )

(7.4)

by Equality (3.8). Theorem 7.2. Let A ∈ Rn×n . We have 0 ∈ spec(A) if and only if A is a singular matrix. Moreover, in this case, geomm(A, 0) = n − rank(A) = dim(null(A)). Proof. The statement is an immediate consequence of Equal ity (7.4). Corollary 7.2. Let A ∈ Rn×n . If 0 ∈ spec(A) and algm(A, 0) = 1, then rank(A) = n − 1.

Eigenvalues

Proof.

483

Clearly, we have geomm(A, 0) = 1, so rank(A) = n − 1.

Theorem 7.3. Let A ∈ Cn×n and let S ⊆ Cn be an invariant subspace of A. If the columns of a matrix X ∈ Cn×p constitute a basis of S, then there exists a unique matrix L ∈ Cp×p such that AX = XL. Proof. Let X = (x1 · · · xp ). Since Ax1 ∈ S, it follows that Ax1 can be uniquely expressed as a linear combination of the columns of X, that is, Axj = x1 1j + · · · + xp pj for 1 i p. Thus,

⎛ ⎞ 1j ⎜ . ⎟ ⎟ Axj = X ⎜ ⎝ .. ⎠. pj

The matrix L is deﬁned by L = (ij ).

Corollary 7.3. Using the notations of Theorem 7.3, the pair (λ, v) is an eigenpair of the matrix L if and only if (λ, Xv) is an eigenpair of A. Proof. The statement is an immediate consequence of Theo rem 7.3. The matrix L introduced in Theorem 7.3 will be referred to as a representation of A on the invariant subspace S. Clearly, L depends on the basis chosen for S, so this representation is not unique. Furthermore, we have spec(L) ⊆ spec(A). Theorem 7.4. Let A ∈ Cm×n be a matrix with rank(A) = n and let B ∈ Cp×q be a matrix such that range(B) = range(A)⊥ . Then range(A) is an invariant subspace of a matrix X ∈ Cp×m if and only if B H XA = Oq,n . Proof. The following statements are easily seen to be equivalent: (i) the subspace range(A) is an invariant subspace of X; (ii) Xrange(A) ⊆ range(A); (iii) Xrange(A) ⊥ range(A)⊥ ; (iv) Xrange(A) ⊥ range(B). The last statement is equivalent to B H XA = Oq,n .

484

Linear Algebra Tools for Data Mining (Second Edition)

Let A ∈ Cn×n be a matrix having the eigenvalues λ1 , . . . , λn . If x1 , . . . , xn are n eigenvectors corresponding to these values, then we have Ax1 = λ1 x1 , . . . , Axn = λn xn . By introducing the matrix X = (x1 · · · xn ) ∈ Cn×n , these equalities can be written in a concentrated form as AX = Xdiag(λ1 , . . . , λn ).

(7.5)

Obviously, since the eigenvalues can be listed in several ways, this equality is not unique. Suppose now that x1 , . . . , xn are unit vectors and that the eigenvalues λ1 , . . . , λn are distinct. Then X is a unitary matrix, X −1 = X H , and we obtain the following equality: A = Xdiag(λ1 , . . . , λn )X H = λ1 x1 xH1 + · · · + λn xn xHn ,

(7.6)

known as the spectral decomposition of the matrix A. We will discuss later (in Chapter 9) a far-reaching extension of the spectral decomposition. 7.3

The Characteristic Polynomial of a Matrix

If λ is an eigenvalue of the matrix A ∈ Cn×n , there exists a non-zero eigenvector x ∈ Cn such that Ax = λx. Therefore, the homogeneous linear system (λIn − A)x = 0n

(7.7)

has a non-trivial solution. This is possible if and only if det(λIn −A) = 0, so eigenvalues are the solutions of the equation det(λIn − A) = 0. Note that det(λIn −A) is a polynomial of degree n in λ, known as the characteristic polynomial of the matrix A. We denote this polynomial by pA .

Eigenvalues

485

Example 7.3. Let ⎞ ⎛ a11 a12 a13 ⎟ ⎜ A = ⎝a21 a22 a23 ⎠ a31 a32 a33 be a matrix in C3×3 . Its characteristic polynomial is λ − a11 −a12 −a13 pA (λ) = −a21 λ − a22 −a23 = λ3 − (a11 + a22 + a33 )λ2 −a31 −a32 λ − a33 + (a11 a22 + a22 a33 + a33 a11 − a12 a21 − a23 a32 − a13 a31 )λ − (a11 a22 a33 + a12 a23 a31 + a13 a32 a21 − a12 a21 a33 − a23 a32 a11 − a13 a31 a22 ). Theorem 7.5. Let A ∈ Cn×n . Then spec(A) = spec(A ) and spec(AH ) = {λ | λ ∈ spec(A)}. Proof.

We have

pA (λ) = det(λIn − A ) = det((λIn − A) ) = det(λIn − A) = pA (λ). Thus, since A and A have the same characteristic polynomials, their spectra are the same. For AH , we can write pAH (λ) = det(λIn − AH ) = det((λIn − A)H ) = (pA (λ))H , which implies the second part of the theorem.

Definition 7.2. Let λ be an eigenvalue of a matrix A ∈ Cn×n . A left eigenvector of the matrix A is a vector v ∈ Cn − {0} such that v H A = λv H . We could have used the obvious term right eigenvectors for the eigenvectors of a matrix A. Since in the vast majority of cases we deal with right eigenvectors, we prefer to use the simpler term eigenvectors for the right eigenvectors. Note that if λ is an eigenvalue of A ∈ Cn×n , the set of solutions of the homogeneous system (λIn −A)v = 0 is non-empty and consists of

486

Linear Algebra Tools for Data Mining (Second Edition)

the non-zero invariant space corresponding to the eigenvalue λ. This is also equivalent to saying that rank(λIn − A) < n. Since rank(λIn − A) = rank(λIn − AH ), it follows that the linear system (λIn − AH )v = 0n has a non-zero solution, so AH v = λv, which is equivalent to v H A = λv H . Thus, every eigenvalue has both an eigenvector and a left eigenvector. Equality of spectra of A and A does not imply that the eigenvectors or the invariant subspaces of the corresponding eigenvalues are identical, as it can be seen from the following example. Example 7.4. Consider the matrix a A= c

A ∈ C2×2 deﬁned by 0 , b

where a = b and c = 0. It is immediate that spec(A) = spec(A ) = {a, b}. For λ1 = a, we have the distinct invariant subspaces: a−b SA,a = k k ∈ C , c 1 SA ,a = k k ∈ C , 0 as the reader can easily verify. If λ1 , λ2 are two distinct eigenvalues of A, u is a right eigenvector that corresponds to λ1 , and v is a left eigenvector that corresponds to λ2 , then u ⊥ v. Indeed, we have λ1 v H u = vH Au = λ2 v H u, so v H u = 0. The leading term of the characteristic polynomial of A is generated by (λ − a11 )(λ − a22 ) · · · (λ − ann ) and equals λn . The fundamental theorem of algebra implies that pA has n complex roots, not necessarily distinct. Observe also that, if A is a matrix with real entries, the roots are paired as conjugate complex numbers. Definition 7.3. The algebraic multiplicity of an eigenvalue λ of a matrix A ∈ Cn×n , algm(A, λ), equals k if λ is a root of order k of the equation pA (λ) = 0. If algm(A, λ) = 1, we refer to λ as a simple eigenvalue.

Eigenvalues

487

Example 7.5. Let A ∈ R3×3 be the matrix ⎛ ⎞ 1 1 1 ⎜ ⎟ A = ⎝0 1 2⎠. 2 1 0 The characteristic polynomial of A is λ − 1 −1 −1 pA (λ) = 0 λ − 1 −2 = λ3 − 2λ2 − 3λ. −2 −1 λ Therefore, the eigenvalues of A are 3, 0, and −1. The eigenvalues of I3 are obtained from the equation λ − 1 0 0 det(λI3 − I3 ) = 0 λ − 1 0 = (λ − 1)3 = 0. 0 0 λ − 1 Thus, I3 has one eigenvalue, 1, and algm(I3 , 1) = 3. Example 7.6. Let P (a) ∈ Cn×n be the matrix ⎛ ⎞ a 1 ··· 1 ⎜1 a · · · 1⎟ ⎜ ⎟ P (a) = ⎜ .. .. . ⎟. ⎝ . . · · · .. ⎠ 1 1 ··· a

To ﬁnd the eigenvalues of P (a), we need to solve the equation λ − a −1 · · · −1 −1 λ − a · · · −1 .. .. = 0. .. . . ··· . −1 −1 · · · λ − a By adding the ﬁrst n − 1 columns to the last and factoring out λ − (a + n − 1), we obtain the equivalent equation λ − a −1 · · · 1 −1 λ − a · · · 1 = 0. (λ − (a + n − 1)) . . . .. · · · .. .. −1 −1 · · · 1

488

Linear Algebra Tools for Data Mining (Second Edition)

Adding the last column from the ﬁrst n − 1 columns and expanding the determinant yields the equation (λ − (a + n − 1))(λ − a + 1)n−1 = 0, which allows us to conclude that P (a) has the eigenvalue a + n − 1 with algm(P (a), a + n − 1) = 1 and the eigenvalue a − 1 with algm(P (a), a − 1) = n − 1. In the special case when a = 1, we have P (1) = Jn,n . Thus, Jn,n has the eigenvalue λ1 = n with algebraic multiplicity 1 and the eigenvalue 0 with algebraic multiplicity n − 1. Definition 7.4. A matrix A ∈ Cn×n is simple if there exists a linearly independent set of n eigenvectors. If A ∈ Cn×n has n distinct eigenvalues, then, by Theorem 7.1, A is a simple matrix. The reverse of this statement is false because there exist simple matrices for which not all eigenvalues are distinct. For example, spec(In ) = {1}, but {e1 , . . . , en } is a linearly independent set of distinct eigenvectors. Theorem 7.6. Let A ∈ Rn×n be a matrix and let λ ∈ spec(A). Then, for any k ∈ P, λk ∈ spec(Ak ). Proof. The proof is by induction on k 1. The base step, k = 1, is immediate. Suppose that λk ∈ spec(Ak ), that is Ak x = λk x for some x ∈ V − {0}. Then Ak+1 x = A(Ak x) = A(λk x) = λk Ax = λk+1 x, so λk+1 ∈ spec(Ak+1 ). Theorem 7.7. Let A ∈ Rn×n be a non-singular matrix and let λ ∈ spec(A). We have λ1 ∈ spec(A−1 ) and the sets of eigenvectors of A and A−1 are equal. Proof. Since λ ∈ spec(A) and A is non-singular, we have λ = 0 and Ax = λx for some x ∈ V − {0}. Therefore, we have A−1 (Ax) = λA−1 x, which is equivalent to λ−1 x = A−1 x, which implies λ1 ∈ spec(A−1 ). In addition, this implies that the set of eigenvectors of A and A−1 are identical. Theorem 7.8. Let pA (λ) = λn +c1 λn−1 +· · ·+cn−1 λ+cn be the characteristic polynomial of the matrix A. Then we have ci = (−1)i Si (A) for 1 i n, where Si (A) is the sum of all principal minors of order i of A.

Eigenvalues

489

Proof. Since pA (λ) = λn + c1 λn−1 + · · · + cn−1 λ + cn , it is easy to see that the derivatives of pA (λ) are given by (1)

pA (λ) = nλn−1 + (n − 1)c1 λn−2 + · · · + cn−1 , (2)

pA (λ) = n(n − 1)λn−2 + (n − 1)(n − 2)c1 λn−3 + · · · + 2cn−2 , .. . (k)

pA (λ) = n(n − 1) · · · (n − k + 1)λn−k + · · · + k!cn−k , .. . (n)

pA (λ) = n!c0 . This implies (k)

cn−k = k!pA (0) for 0 k n. On the other hand, the derivatives of pA (λ) can be computed using Theorem 5.11, and taking into account the formula given in Exercise 17 of Chapter 5, we have 1 (−1)k k!Sn−k (A) = (−1)n−k Sn−k (A), k! which implies the statement of the theorem. cn−k =

By Vi´ete’s Theorem, taking into account Theorem 7.8, we have λ1 + · · · + λn = a11 + a22 + · · · + ann = trace(A) = −c1 .

(7.8)

Another interesting fact that follows immediately from Theorem 7.8 is λ1 · · · λn = det(A). Theorem 7.9. nomial in C[x],

(7.9)

Let p(λ) = λn + a1 λn−1 + · · · + an−1 λ + an be a polywhere n 2. Then the matrix Ap ∈ Cn×n defined by ⎛

0 1 0 ⎜ 0 0 1 ⎜ Ap = ⎜ .. .. ⎜ .. ⎝ . . . −an −an−1 −an−2 has p as its characteristic polynomial.

⎞ 0 0 ⎟ ⎟ ⎟ .. ⎟ ··· . ⎠ · · · −a1 ··· ···

490

Linear Algebra Tools for Data Mining (Second Edition)

Proof. The proof is by induction on n 2. For the base case, n = 2, we have 0 −1 A= , −a2 −a1 the characteristic polynomial of A is λ 1 = λ2 + a1 λ + a2 , pA (λ) = a2 λ + a1 as claimed. Suppose now that the statement holds for polynomials of degree less than n and let p be a polynomial of degree n. Deﬁne q as q(λ) = λn−1 + a1 λn−2 + · · · + an−1 . By the inductive hypothesis, the characteristic matrix ⎛ 0 1 0 ··· ⎜ 0 0 1 ··· ⎜ Aq = ⎜ .. .. ⎜ .. ⎝ . . . ··· −an−1 −an−2 −an−3 · · ·

polynomial of the ⎞ 0 0 ⎟ ⎟ ⎟ .. ⎟ . ⎠ −a1

is q(λ). Observe that det(λIn − Ap ) = λ det(Aq ) + an , as follows by expanding det(λIn − Ap ) by its ﬁrst column. Thus, the characteristic polynomial of Ap is pAp = λpAq (λ) + an = λq(λ) + an = p(λ), which concludes the argument.

The matrix Ap will be referred to as the companion matrix of the polynomial p. Theorem 7.10. Let A ∈ Cm×n and B ∈ Cn×m be two matrices. Then the set of non-zero eigenvalues of the matrices AB ∈ Cm×m and BA ∈ Cn×n are the same and algm(AB, λ) = algm(BA, λ) for each such eigenvalue.

Eigenvalues

Proof.

Consider the following straightforward equalities: λIm A λIm − AB Om,n Im −A = , On,m λIn B In −λB λIn λIm A −λIm −A −Im Om,n = . B In On,m λIn − BA −B λIn

Observe that

det

491

Im

−A

On,m

λIn

λIm

A

B

In

= det

−Im

Om,n

−B

λIn

λIm

A

B

In

,

and therefore, −λIm −A λIm − AB Om,n = det det . On,m λIn − BA −λB λIn The last equality amounts to λn pAB (λ) = λm pBA (λ). Thus, for λ = 0, we have pAB (λ) = pBA (λ), which gives the desired conclusion. Corollary 7.4. Let

⎞ a1 ⎜.⎟ ⎟ a=⎜ ⎝ .. ⎠ an ⎛

be a vector in Cn − {0}. Then, the matrix aaH ∈ Cn×n has one eigenvalue distinct from 0, and this eigenvalue is equal to a2 . Proof. By Theorem 7.10, the matrix aaH has the same non-zero eigenvalues as the matrix aH a ∈ C1×1 and the single eigenvalue of aH a is aH a = a2 . Theorem 7.11. Let A ∈ C(m+n)×(m+n) be a matrix partitioned as B C , A= On,m D where B ∈ Cm×m , C ∈ Cm×n , and D ∈ Cn×n . Then spec(A) = spec(B) ∪ spec(D).

492

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Let λ ∈ spec(A) and let x ∈ Cm+n be an eigenvector that corresponds to λ. If u x= , v where u ∈ Cm and v ∈ Cn , then we have B C u Bu + Cv u Ax = = =λ . On,m D v Dv v This implies Bu + Cv = λu and Dv = λv. If v = 0, then λ ∈ spec(D); otherwise, Bu = λu, which yields λ ∈ spec(B), so λ ∈ spec(B) ∪ spec(D). Thus, spec(A) ⊆ spec(B) ∪ spec(D). To prove the converse inclusion, note that if λ ∈ spec(B) and u is an eigenvector of λ, then Bu = λu, which means that u u A =λ , 0 0 so spec(B) ⊆ spec(A). Similarly, spec(D) ⊆ spec(A), which implies the equality of the theorem. 7.4

Spectra of Hermitian Matrices

Theorem 7.12. All eigenvalues of a Hermitian matrix A ∈ Cn×n are real numbers. All eigenvalues of a skew-Hermitian matrix are purely imaginary numbers. Proof. Theorem 3.18 implies that xH x is a real number for every x ∈ Cn . Then, by Equality (7.2), λ is a real number. Suppose now that B is a skew-Hermitian matrix. Then, as above, xH Ax = −xH Ax, which implies that the real part of xH Ax is 0. Thus, xH Ax is a purely imaginary number and, by the same Equality (7.3), λ is a purely imaginary number.

Eigenvalues

493

Corollary 7.5. If A ∈ Rn×n and A is a symmetric matrix, then all its eigenvalues are real numbers. Proof. This statement follows from Theorem 7.12 by observing that the Hermitian adjoint AH of a matrix A ∈ Rn×n coincides with its transposed matrix A . Example 7.7. Let A ∈ Rn×n be a symmetric and orthogonal matrix. By Corollary 7.5, all its eigenvalues are real numbers. Let λ ∈ spec(A) and let x be an eigenvector that corresponds to λ. Since Ax = λx, we have x A = λx , hence λ2 x x = (Ax ) Axx A Ax = x x. Therefore, λ2 = 1, hence λ ∈ {−1, 1}. Corollary 7.6. Let A ∈ Cm×n be a matrix. The non-zero eigenvalues of the matrices AAH and AH A are positive numbers and they have the same algebraic multiplicities for the matrices AAH and AH A. Proof. By Theorem 7.10, we need to verify only that if λ is a nonzero eigenvalue of AH A, then λ is a positive number. Since AH A is a Hermitian matrix, by Theorem 7.12, λ is a real number. The equality AH Ax = λx for some eigenvector x = 0 implies λx22 = λxH x = (Ax)H Ax = Ax22 , so λ > 0.

Corollary 7.7. Let A ∈ Cm×n be a matrix. The eigenvalues of the matrix B = AH A ∈ Cn×n are real non-negative numbers. Proof. The matrix B deﬁned above is clearly Hermitian and, therefore, its eigenvalues are real numbers by Theorem 7.12. Next, if λ is an eigenvalue of B, then by Equality (7.2), we have λ=

(Ax)H Ax Ax xH AH Ax = = 0, H H xx xx x

where x is an eigenvector that corresponds to λ.

494

Linear Algebra Tools for Data Mining (Second Edition)

Note that if A is a Hermitian matrix, then AH A = A2 , hence the spectrum of AH A is {λ2 | λ ∈ spec(A)}. Theorem 7.13. If A ∈ Cn×n is a Hermitian matrix and u, v are two eigenvectors that correspond to two distinct eigenvalues λ1 and λ2 , then u ⊥ v. Proof. We have Au = λ1 u and Av = λ2 v. This allows us to write v H Au = λ1 v H u. Since A is Hermitian, we have λ1 v H u = v H Au = vH AH u = (Av)H u = λ2 v H u, which implies v H u = 0, that is, u ⊥ v.

Theorem 7.14 (Ky Fan’s Theorem). Let A ∈ Cn×n be a Hermitian matrix such that spec(A) = {λ1 , . . . , λn }, where λ1 · · · λn and let V = (v 1 · · · v n ) be the matrix whose columns consist of the corresponding unit eigenvectors of A. n Let {x1 , . . . , xn } be an orthonormal q in C . For q set of vectors any positive integer q n, the sums i=1 λi and i=1 λn+1−i are, respectively, the maximum and minimum of qj=1 xj Axj . Namely, the maximum is obtained by choosing the vectors x1 , . . . , xq as the ﬁrst q columns of V ; the minimum is obtained by assigning to x1 , . . . , xq the last q columns of V. n i p i Proof. Let xi = p=1 bp v be the expressions of x relative to the basis V for 1 i m. In matrix form, these equalities can be written as ⎛ 1 ⎞ b1 · · · bn1 ⎜. .⎟ ⎟ (x1 . . . xn ) = (v 1 . . . v n ) ⎜ ⎝ .. · · · .. ⎠, b1n · · · bnn where bpi = (v p ) xi = (xi ) v p . Thus, for X = (x1 . . . xn ) and V = (v 1 . . . v n ), we have X = V B, where B is the orthonormal matrix ⎞ ⎛ 1 b1 · · · bn1 ⎜. .⎟ ⎟ B=⎜ ⎝ .. · · · .. ⎠. b1n · · · bnn

Eigenvalues

495

We have

xj Axj = xj Abjp v p = bjp xj Av p = bjp (xj ) λp v p = (bjp )2 λp q n n j 2 j 2 y = λq (bp ) + (λp − λq )(bp ) + (λj − λq )(bjp )2 . p=1

p=1

j=q+1

By Inequality (6.26), this implies (xj ) Axj λq +

q (λp − λq )(bjp )2 . p=1

Therefore, q i=1

λi −

q j=1

⎞ ⎛ q q (xj ) Axj (λi − λq ) ⎝1 − (bji )2 ⎠. i=1

(7.10)

j=1

q j 2 2 Again, by Inequality (6.26), we have j=1 (bi ) xi = 1, so q q j 2 0. The left member of Inequali=1 (λi − λq ) 1 − j=1 (bi ) i i ity (7.10) becomes 0, when x = v , so qj=1 (xj ) Axj qi=1 λi . The maximum of qj=1 (xj ) Axj is obtained when xj = v j for 1 j q, that is, when X consists of the ﬁrst q columns of V that correspond to eigenvectors of the top k largest eigenvalues. The argument for the minimum is similar. An equivalent form of Ky Fan’s Theorem can be obtained by observing that the orthonormality condition of the set {x1 , . . . , xq } where X ∈ Cn×q is the matrix can be expressed as X X = Iq , X = (x1 · · · xq ). Also, the sum qj=1 xj Axj equals trace(X AX). Thus, q Theorem is equivalent to the fact that the sums n Ky Fan’s λ and i=1 λn+1−i are, respectively, the maximum and mini=1 i imum of trace(X AX), where X X = Iq . In this form, Ky Fan’s Theorem is useful for the discussion of principal component analysis in Chapter 13.

496

7.5

Linear Algebra Tools for Data Mining (Second Edition)

Spectra of Special Matrices

In this section, we examine the spectra of special classes of matrices. We begin with spectra of block upper triangular and block lower triangular matrices. Theorem 7.15. Let A be a block upper triangular partitioned matrix given by ⎞ ⎛ A11 A12 · · · A1m ⎟ ⎜ O A 22 · · · A2m ⎟ ⎜ ⎟ ⎜ A=⎜ . .. .. ⎟, . ··· . ⎠ ⎝ .. O O · · · Amm where Aii ∈ Rpi ×pi for 1 i m. Then, spec(A) = ni=1 spec(Aii ). If A is a block lower triangular matrix ⎞ ⎛ A11 O · · · O ⎜ A21 A22 · · · O ⎟ ⎟ ⎜ A = ⎜ .. .. .. ⎟, ⎝ . . ··· . ⎠ Am1 Am2 · · · Amm

the same equality holds. Proof. Let A be a block upper triangular matrix. Its characteristic equation is det(λIn − A) = 0. Observe that the matrix λIn − A is also a block upper triangular matrix: ⎛ ⎞ λIp1 − A11 O ··· O ⎜ −A21 ⎟ λIp2 − A22 · · · O ⎜ ⎟ λIn − A = ⎜ ⎟. .. .. .. ⎝ ⎠ . . ··· . −Am1

−Am2

· · · λIpm − Amm

By Theorem 5.13, the characteristic polynomial of A can be written as m m det(λIpi − Aii ) = pAii (λ). pA (λ) = i=1

n

i=1

Therefore, spec(A) = i=1 spec(Aii ). The argument for block lower triangular matrices is similar.

Eigenvalues

Corollary 7.8. Let A ∈ Rn×n be a ⎛ A11 O ⎜ O A22 ⎜ A = ⎜ .. .. ⎝ . . O O

497

block diagonal matrix given by ⎞ ··· O ··· O ⎟ ⎟ .. ⎟, ··· . ⎠ · · · Amm

for 1 i m. We have spec(A) = ni=1 spec(Aii ) where Aii ∈ Rni ×ni and algm(A, λ) = m i=1 algm(Ai , λ). Moreover, v = 0n is an eigenvector of A if and only if we can write ⎛ ⎞ v1 ⎜ .. ⎟ v = ⎝ . ⎠, vm where each vector v i is either an eigenvector of Ai or 0ni for 1 i m, and there exists i such that v i = 0ni . Proof.

This statement follows immediately from Theorem 7.15.

Theorem 7.16. Let A = (aji ) ∈ Cn×n be an upper (lower) triangular matrix. Then, spec(A) = {aii | 1 i n}. Proof. It is easy to see that the characteristic polynomial of A is pA (λ) = (λ−a11 ) · · · (λ−ann ), which implies immediately the theorem.

Corollary 7.9. If A ∈ Cn×n is an upper triangular matrix and λ is an eigenvalue such that the diagonal entries of A that equal λ occur i in aii11 , . . . , aipp , then SA,λ is a p-dimensional subspace of Cn generated by ei1 , . . . , eip . Proof.

This statement is immediate.

Corollary 7.10. We have spec(diag(d1 , . . . , dn )) = {d1 , . . . , dn }. Proof.

This statement is a direct consequence of Theorem 7.16.

Theorem 7.17. If A spec(A) = {0}.

∈

Cn×n is a nilpotent matrix, then

498

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Let A ∈ Cn×n be a nilpotent matrix such that nilp(A) = k. By Theorem 7.6, if λ ∈ spec(A), then λk ∈ spec(Ak ) = spec(O) = {0}. Thus, λ = 0. Theorem 7.18. If A ∈ Cn×n is an idempotent matrix, then spec(A) ⊆ {0, 1}. Proof. Let A ∈ Cn×n be an idempotent matrix, λ be an eigenvalue of A, and let x be an eigenvector of λ. We have P 2 x = P x = λx; on the other hand, P 2 x = P (P x) = P (λx) = λP (x) = λ2 x, so λ2 = λ, which means that λ ∈ {0, 1}. Definition 7.5. Let A ∈ Cp×p . The spectral radius of A is the number ρ(A) = max{|λ| | λ ∈ spec(A)}. The next statement shows that for a stochastic matrix, the spectral radius is equal to 1. Theorem 7.19. At least one eigenvalue of a stochastic matrix is equal to 1 and all eigenvalues lie on or inside the unit circle. Proof. Let A ∈ Rn×n be a stochastic matrix. Then 1 ∈ spec(A) and 1 is an eigenvector that corresponds to the eigenvalue 1 as the reader can easily verify. If λ is an eigenvalue of A and Ax = λx, then λxi = αji xj for 1 n n, which implies |λ||xi | aji |xj |. Since x = 0n , let xp be a component of x such that |xp | = max{|xi | | 1 i n}. Choosing i = p, we have |λ|

n i=1

aji

n

|xj | j ai = 1, |xp i=1

which shows that all eigenvalues of A lie on or inside the unit circle.

Theorem 7.20. All eigenvalues of a unitary matrix are located on the unit circle.

Eigenvalues

499

Proof. Let A ∈ Rn×n be a unitary matrix and let λ be an eigenvalue of A. By Theorem 6.24, if x is an eigenvector that corresponds to λ, we have x = Ax = λx = |λ|x, which implies |λ| = 1. 7.6

Geometry of Eigenvalues

Any matrix norm of a matrix A ∈ Cn×n provides an upper bound for the absolute value of any eigenvalue. Indeed, if μ is a matrix norm, and (λ, x) is an eigenpair, then |λ|μ(x) = (μ(Ax) μ(A)μ(x), so |λ| μ(A). The next result allows ﬁnding a more precise location of the eigenvalues of a square matrix A in the complex plane. n×n be a Theorem 7.21 (Gershgorin’s theorem). Let A ∈ R {|aij | | 1 j n and j = i}, for square matrix and let ri = 1 i n. Then we have spec(A) ⊆

n

{z ∈ C | |z − aii | ri }.

i=1

Proof. Let λ ∈ spec(A) and let us suppose that Ax = λx, where x = 0. nLet p be such that |xp | = max{|xi | |1n i n}. Then j=1 apj xj = λxp , which is equivalent to j=1,j=p apj xj = (λ − app )xp . This, in turn, implies n n apj xj |apj ||xj | |xp ||λ − app | = j=1,j=p j=1,j=p |xp |

n

|apj | = |xp |rp .

j=1,j=p

Therefore, |λ − app | rp for some p. This yields the desired conclusion.

500

Linear Algebra Tools for Data Mining (Second Edition)

Definition 7.6. Let A ∈ Rn×n be a square matrix and let ri = {aij | 1 j n and j = i}, for 1 i n. A disk of the form Di (A) = {z ∈ C | |z − aii | ri } is called a Gershgorin disk. By Theorem 3.17, if A ∈ Cn×n , then there exists a Hermitian matrix HA and a skew-Hermitian matrix SA such that A = HA + SA . Let μA = A∞ = max{|aij | 1 i, j n} be the generalized ∞norm of A and let μHA and μSA be the generalized ∞-norms of HA and SA , respectively. Let λ1 , . . . , λn be the eigenvalues of A, where |λ1 | · · · |λn |. The real eigenvalues of HA and SA are denoted as η1 , . . . , ηn and σ1 , . . . , σn , where η1 · · · ηn and σ1 · · · σn . These notations are needed for the next two theorems. Theorem 7.22 (Hirsch’s first theorem). If A ∈ Cn×n , then we have λk nμA , (λk ) nμHA , and (λk ) nμSA for 1 k n. Proof. Let x be a unit eigenvector that corresponds to an eigenvalue λk . We have Ax = λk x, which implies (Ax, x) = λk and ¯ k . Thus, we obtain (AH x, x) = (x, Ax) = (x, λk x) = λ ¯k λk + λ 2 (Ax, x) + (AH x, x) = 2 H A+A x, x = 2

(λk ) =

= HA x, and ¯k λk − λ 2i (Ax, x) − (AH x, x) = 2i H A−A x, x = 2i

(λk ) =

= SA x.

Eigenvalues

501

Since λk = (Ax, x), we have n n n n aij xi x ¯j |aij ||xi ||¯ xj | |λk | = i=1 j=1

μA

i=1 j=1

n n

|xi ||¯ xj | = μ A

i=1 j=1

n

2 |xi |

.

i=1

By Inequality (6.7), we have |λk | nμA .

A related theorem is next. Theorem 7.23 (Hirsch’s second theorem). If A ∈ Cn×n and HA is a real matrix, then n(n − 1) . | (λk )| μSA 2 is real, then (aij + aji ) = 0 for 1 i, j n. Proof. If HA = A+A 2 z−¯ z Since (z) = 2i , it follows that H

¯ij − a ¯ji = 0, aij + aji − a so aij − a ¯ji = −(aji − a ¯ij ) for 1 i, j n. Therefore, by the proof of Theorem 7.22, λ − λ k ¯k | (λk )| = 2i (Ax, x) − (AH x, x) = 2i A − AH x, x = 2i ⎛ ⎞ n n aij − a ¯ji ⎠ ⎝ xj x = ¯i 2i i=1

j=1

a −a ¯ji xj x ¯ i − xi x ¯j ij · 2 i i |λ2 | · · · |λn |. Deﬁne a sequence of vectors x0 , x1 , . . . as follows. The initial vector x0 ∈ Cn is any unit vector that is not orthogonal on any eigenvector corresponding to λ1 . Then xk+1 is the unit vector given by xk+1 =

Axk Axk

for k ∈ N. The unit vector xk can be written as xk =

Ak x0 Ak x0

Linear Algebra Tools for Data Mining (Second Edition)

504

for k 1. Indeed, in the base case (k = 1), this equality clearly holds. Suppose that it holds for k. We have k

xk+1

A x0 A A kx Ak+1 x0 Axk 0 = = . = k A x0 Axk Ak+1 x0 A A kx 0

We claim that limk→∞ xk = v 1 . By Theorem 7.1, the set {v 1 , . . . , v n } is linearly independent and this allows us to write x0 = a1 v 1 + · · · + an vn . By the assumption made concerning x0 (as not being orthogonal on any eigenvector corresponding to λ1 ), we have a1 = 0. Thus, Ax0 = a1 Av 1 + · · · + an Av n = a1 λ1 v 1 + · · · + an λn v n . A straightforward induction argument on k 1 shows that Ak x0 = a1 λk1 v 1 + · · · + an λkn v n k k a a λ λn 2 2 n = a1 λk1 v1 + v2 + · · · + vn . a1 λ 1 a1 λ1 |λj | |λ1 |

< 1 for 2 j n, it follows that a2 λ2 k an λn k Ak x0 = lim v 1 + v2 + · · · + vn = v1 , lim k→∞ a1 λk k→∞ a1 λ 1 a1 λ1 1

Since

and the speed of convergence of implies

Ak x 0 a1 λk1

to v 1 is determined by

λ2 λ1 .

This

Ak x0 = v 1 = 1. k→∞ |a1 λk 1| lim

Therefore, Ak x0 Ak x0 |a1 λk1 | Ak x0 · = v1. = lim = lim k k→∞ Ak x0 k→∞ |a1 λk k→∞ |a1 λk 1 | A x0 1|

lim xk = lim

k→∞

Since the limit of the sequence x0 , . . . , xk , . . . is v 1 , we have a method to approximatively determine a unit vector corresponding to the dominating eigenvalue λ1 of A.

Eigenvalues

505

Example 7.8. Let us apply the iteration method to the symmetric matrix ⎛ ⎞ 3 2 1 ⎜ ⎟ A = ⎝2 4 2⎠, 1 2 6 beginning with the vector x0 = 13 . The following MATLAB code computes ten vectors of the sequence (x1 , x2 , . . .): x = ones(3,1); sequence(:,1) = x; for i=1:10 x = A*x/norm(A*x); sequence(:,i+1)=x; end sequence

and prints the sequence as 1.0000 0.4015 0.3844 0.3772 0.3740 0.3726 0.3719 0.3716 0.3715 0.3714 0.3714 1.0000 0.5789 0.5678 0.5621 0.5594 0.5581 0.5576 0.5573 0.5572 0.5571 0.5571 1.0000 0.7097 0.7279 0.7361 0.7398 0.7414 0.7422 0.7425 0.7427 0.7427 0.7428

The last vector in the sequence is very close to an eigenvector that corresponds to the eigenvalue λ = 8. A variant of the power method considered in [113] replaces the scaling factor Axk by m(Axk ), where m(v) is the ﬁrst component of v that has the largest absolute value. For example, if v = (−2, −4, 3, −4), then m(v) = v2 = −4. We have m(av) = a m(v) for every a ∈ R. The power method is limited to the dominant eigenvalue. However, if a close approximative value is known for an eigenvalue λi , the power method can be applied to compute an eigenvector associated to this eigenvalue. Observe that if |λi − a| < |λj − a| for every λj ∈ spec(A) − {λi }, then λi1−a is the dominant eigenvalue of the matrix B = (A − aIn )−1 . The justiﬁcation of the algorithm follows from the next statement. Theorem 7.26. Let A ∈ Cn×n and let a be a number such that a ∈ spec(A). Then v is an eigenvector of A that corresponds to the eigenvalue λ if and only if v is an eigenvector of the matrix 1 ∈ spec(B). B = (A − aI)−1 that corresponds to the eigenvalue λ−a

Linear Algebra Tools for Data Mining (Second Edition)

506

Proof. If v is an eigenvector for A that corresponds to the eigenvalue λ, then Av = λv, so (A − aI)v = (λ − a)v. Since a ∈ spec(A), 1 v = (A − aI)−1 v, which proves the matrix A − aI is invertible, so λ−a 1 . that v is an eigenvector of B that corresponds to the eigenvalue λ−a The reverse implication is immediate. By Theorem 7.26, if a is a good approximation of λ, the power method applied to B will yield an eigenvector v of A that corresponds to λ. This technique is known as the inverse power method. 7.9

The QR Iterative Algorithm

Another technique for computing eigenvalues is the QR iterative algorithm. The approach involves decomposing a matrix A ∈ Rn×n into a product A = QR, where Q is an orthogonal matrix and R is an upper triangular matrix. If A ∈ Cn×n is a complex matrix, then we seek Q as a unitary matrix. A sequence of matrices (A0 , A1 , . . .) is deﬁned, where A0 = A ∈ n×n R . Begin by computing a QR factorization of A0 as A0 = Q0 R0 . If Ai is factored as Ai = Qi Ri , deﬁne Ai+1 as Ai+1 = Ri Qi for i ∈ N. Let Pi = Q0 · · · Qi and Si = Ri · · · R0 for i ∈ N. Clearly, Pi is an orthonormal matrix and Si is an upper triangular matrix for every i ∈ N. We claim that Ai+1 = Pi APi .

(7.11)

Indeed, for i = 0, we have P0 AP0 = Q0 Q0 R0 Q0 = R0 Q0 = Q1 . Suppose that the equality holds for i. Then APi+1 = Qi+1 Pi APi Qi+1 = Qi+1 Ai+1 Qi+1 Pi+1 = Qi+1 Qi+1 Ri+1 Qi+1 = Ai+2 ,

which concludes the proof of the claim. Therefore, each of the matrices Ai is orthonormally similar to A and spec(Ai ) = spec(A) for i ∈ N. Equality (7.11) can also be written as Pi Ai+1 = APi for i ∈ N.

(7.12)

Eigenvalues

507

Note that Pi Ri = Pi−1 Qi Ri = Pi−1 Ai = APi−1 , taking into account Equality (7.12). This implies that the subspace CPi ,q generated by the ﬁrst q columns of Pi equals CAPi−1 ,q . If limn→∞ Pn exists and it equals P , since Pn−1 Qn = Pn , it follows that limn→∞ Qn = I. Therefore, since An = Qn Rn , we have limn→∞ An = limn→∞ Rn = R. The matrix R is an upper triangular matrix, as the limit of a sequence of upper triangular matrices, has non-negative entries on its main diagonal, which equal the eigenvalues of the matrix A. Therefore, a necessary condition for the algorithm to work is that all eigenvalues of A be real, non-negative numbers. Thus, the value of this initial version of the QR algorithm is limited. In practical terms, it is desirable to replace the initial matrix A with its upper Hessenberg equivalent H1 (which is a tridiagonal matrix when A is symmetric). As we saw in Supplement 72 of Chapter 6, if we factor H1 = Q1 R1 such that R1 is non-singular, Q1 and R1 Q1 are both upper Hessenberg matrices, so the next matrix H2 = R1 Q1 is again an upper Hessenberg matrix, and the process continues in the same manner as in the initial QR algorithm. Other improvements of the QR involve the double shifting of the matrices Hn by a multiple of In . Thus, instead of factoring Hk , one factors H − aIn , where a is an approximation of a real eigenvalue, Hk − aIn = Qk Rk . Then, the next matrix is Hk+1 = Rk Qk + aIn . Therefore, Hk+1 = Rk Qk + aIn = Qk (Hk − aIn )Qk + aIn = Qk Hk Qk for k 1. Thus, spec(Hk+1 ) = spec(Hk ). 7.10

MATLAB

Computations

To compute the eigenvalues of a matrix A ∈ Cn×n , we can use the command eig(A), which returns an n-dimensional vector containing the eigenvalues of A. The variant [V,D] = eig(A) produces a diagonal matrix D of eigenvalues and a full matrix V ∈ Cn×n whose columns are the corresponding eigenvectors so that AV = V D.

508

Linear Algebra Tools for Data Mining (Second Edition)

A faster computation of eigenvalues is obtained using the eigs. If A is a large, sparse, and square matrix, eigs(A) returns a vector that consists of the six largest magnitude eigenvalues. The function call [V,D] = eigs(A) returns a diagonal matrix D that contains six eigenvalues of A having the largest absolute values and a matrix V whose columns are the corresponding eigenvectors. If a parameter flag is added, [V,D,flag] = eigs(A), and flag is 0, then all the eigenvalues converge; otherwise not all converge. The call eigs(A,k) computes the k largest magnitude eigenvalues. Various options can be set by specifying additional parameters. For instance, if A is a symmetric matrix, then the call [V,D] = eigs(A,k,’SA’)

will compute the k smallest eigenvalues of A, and return a diagonal matrix D that contains the eigenvalues and a matrix V that contains the corresponding eigenvectors. Exercises and Supplements (1) Prove that a matrix A ∈ Cn×n is non-singular if and only if 0 ∈ spec(A). (2) Prove that spec(Jn,n ) = {0, n} and that algm(A, 0) = n − 1. (3) Determine the spectrum of the matrix A = aIn + bJn,n . (4) Let X, Y ∈ Cn×n be two non-singular matrices. Prove that λ is an eigenvalue of one of the matrices XY −1 , Y X −1 if and only if λ−1 is an eigenvalue of the other. (5) Let A ∈ Cn×n be a matrix and let a, b ∈ C such that a = 0. Prove that spec(aA + bIn ) = {aλ + b | λ ∈ A}. Solution: Let B = aA + bIn . For the characteristic polynomial pB , we can write pB (λ) = det(λIn − B) = det(λIn − (aA + bIn )) λ−b n In − A , = det((λ − b)In − aA) = a det a which shows that spec(B) has the desired form. (6) Prove that the non-zero eigenvalues of a real skew-symmetric matrix are purely imaginary numbers.

Eigenvalues

509

Solution: Let λ be a non-zero eigenvalue of a real skewsymmetric matrix A ∈ Rn×n . Since Ax = λx for some x ∈ Rn , it follows that A2 x = λ2 x. By Supplement 115 of Chapter 6, we have λ2 0. (7) Let S be a k-dimensional invariant subspace for a matrix A ∈ Cn×n and let the columns of the matrix X ∈ Cn×k form a basis for S. Prove that there is a unique matrix C ∈ Ck×k such that AX = XC and that spec(C) ⊆ spec(A). Solution: Suppose that X = (x1 · · · xk ). Axi is a unique linear combination of the columns of X because Axi ∈ S, so there exists a unique vector ci such that Axi = Xci for 1 i k. The matrix C = (c1 · · · ck ) is the matrix we are seeking. If λ ∈ spec(C), then Cu = λu for some vector u, which implies A(Xu) = λ(Xu). Thus, spec(C) ⊆ spec(A). (8) Let sin α cos α sinh t cosh t and B = . A= − cos α sin α cosh t sinh t Prove that spec(A) = {sin α} and spec(B) = {sinh t + cosh t, sinh t − cosh t}. (9) Let a11 a12 , A= a21 a22 be a matrix (that is, a12 = a21 ) in R2×2 . Prove that there exists a rotation matrix cos θ sin θ R(θ) = , − sin θ cos θ such that R(−θ)AR(θ) is a diagonal matrix having the eigenvalue λ1 and λ2 as its diagonal elements. (10) Let A ∈ R2×2 be a matrix and let cos α x= ∈ R2 . sin α As α varies between 0 and 2π, x rotates around the origin. Prove that Ax moves on an ellipse centered in the origin. Compute α such that the vectors x and Ax are collinear.

510

Linear Algebra Tools for Data Mining (Second Edition)

(11) Let A ∈ Cn×n be a matrix such that spec(A) = {λ1 , . . . , λn }. Prove that spec(A + cIn ) = {λ1 + c, . . . , λn + c}. (12) Let A ∈ Rn×n be a matrix. Prove that if A has a dominating eigenvalue λ1 , then λ1 is a real number. (13) Prove that if A ∈ Cn×n has the spectrum spec(A) = {λ1 , . . . , λn }, then spec(A − aIn ) = {λ1 − a, . . . , λn − a}. (14) Let C(A) be the Cayley transform of the matrix A ∈ Cn×n introduced in Exercise 65 of Chapter 3. (a) Prove that if λ ∈ spec(C(A)), then 1−λ 1+λ ∈ spec(A). (b) Prove that if C(A) exists, then −1 ∈ spec(C(A)). (15) Let A ∈ Cm×m and B ∈ Cn×n . Prove that trace(A ⊗ B) = trace(A)trace(B) and det(A ⊗ B) = det(A)n det(B)m . (16) Let A ∈ Cn×n be a Hermitian matrix such that spec(A) = {λ1 , . . . , λn } consists of n distinct eigenvalues and let v 1 , . . . , v n be n unit eigenvectors that correspond to the eigenvalues λ1 , . . . , λn , respectively. is a unit vector and ci = wH ui for 1 i n, (a) If w ∈ Cn prove that ni=1 c2i = 1. (b) Prove that |λi − w H Aw| maxj {|λi − λj |w − ui 22 . (17) Prove that the matrix A ∈ Cn×n has a set of pairwise orthonormal eigenvectors {u1 , . . . , un } if and only if A is a normal matrix. (18) Prove that the set of simple matrices is dense in the metric space (Cn×n , d), where d(A, B) = |||A − B|||2 for A, B ∈ Cn×n . (19) Let A ∈ Cm×n and B ∈ Cn×m . Prove the following: (a) the matrices

On,m On,n AB Om,m and B 0n,m B BA

are similar; (b) if m n, then BA ∈ Cn×n has the same eigenvalues as AB ∈ Cm×m together with n − m zero eigenvalues; (c) if m = n and at least one of the matrices A or B is nonsingular, then AB and BA are similar. Solution: For the ﬁrst part, note that the matrix Im A M= On,m In

Eigenvalues

511

has all its eigenvalues equal to 1 and, therefore, is non-singular. It is easy to see that Im −A −1 M = On,m In and that M

−1

AB Om,m On,m On,n , M= B 0n,m B BA

which proves the desired similarity. The last two parts are left to the reader. A commuting family of matrices is a collection of matrices M ⊆ Cn×n such that for each A, B ∈ M, we have AB = BA. An invariant subspace for a commuting family of matrices M ⊆ Cn×n is a subspace W of Cn that is invariant for each matrix in M. (20) Let A ∈ Cn×n and let S be an invariant subspace of A with dim(S) 1. Prove that S contains an eigenvector of A. Solution: Suppose that dim(S) = m 1 and let v 1 , . . . , v m ∈ be a set of vectors that form a basis of S. Let V = (v 1 v 2 · · · v m ) ∈ Cn×m be the matrix whose columns are these vectors. Note that rank(V ) = m so the nullspace of V consists of 0m . Then we have Av i ∈ S for every i, 1 i m, because S is an invariant subspace of A. Thus, we have AV = V U for some U ∈ Cm×m . If x = 0m is an eigenvector of U , we have U x = λx, hence V U x = λV x, or AV x = λV x. Thus, V x is an eigenvector of A that is contained in S. (21) Prove that if M is a commuting family of matrices in Cn×n , then there exists x ∈ Cn such that x is an eigenvector of every A ∈ M. Cn

Solution: Note that Cn is an invariant subspace for each matrix in M. Thus, we can assume that there exists an invariant subspace W for F such that dim(W ) is minimal and positive. We claim that every w ∈ W − {0n } is an eigenvector of F. Suppose that this is not the case, so not every non-zero vector in W is an eigenvector of A. Since W is M-invariant, it is also A-invariant and, by Supplement 7.10, there exists an eigenvector of A in W such that x = 0n and Ax = λx.

512

Linear Algebra Tools for Data Mining (Second Edition)

Let W0 be the subspace W0 = {y ∈ W | Ay = λy}. Clearly, x ∈ W0 . Since we assumed that not every non-zero vector in W is an eigenvector of A, W0 = W , so dim(W0 ) < dim(W ). If B ∈ M, x ∈ W0 implies Bx ∈ W because W0 ⊆ W and W is M-invariant. Since M is a commuting family, we have A(Bx) = (AB)x = (BA)x = B(Ax) = B(λx) = λ(Bx), hence Bx ∈ W0 . Therefore, W0 is M-invariant, which results in a contradiction, since we assumed that dim(W ) is minimal. Certain spectral properties of Hermitian matrices can be extended to matrices that are self-adjoint with respect to an inner product f : Cn × Cn −→ R. These extensions are discussed next. (22) Prove that a matrix A ∈ Cn×n that is self-adjoint relative to an inner product f : Cn × Cn −→ R has real eigenvalues. Furthermore, if λ, μ ∈ spec(A) are two distinct eigenvalues of A and u, v are eigenvectors of A that correspond to λ and μ, respectively, then u and v are orthogonal relative to f , that is, f (u, v) = 0. (23) Let A ∈ Cn×n be a matrix such that ni=1 aij = 1 for every j, 1 j n and let x ∈ Cn be a eigenvector such that 1 x = 0. Prove that the eigenvalue that corresponds to x is 1. (24) Let u, v ∈ Cn . Prove that the matrix uv H + vuH has at most two non-zero eigenvalues. H u n×2 and B = ∈ C2×n . By Solution: Let A = (v u) ∈ C vH Theorem 7.10, the set of non-zero eigenvalues of the matrices AB = vuH + uv H ∈ Cn×n and BA ∈ C2×2 are the same and algm(AB, λ) = algm(BA, λ) for each such eigenvalue. (25) Let A ∈ Cn×n be a matrix such that ni=1 |aij | 1 for every j, 1 j n. Prove that for every eigenvalue λ of A we have |λ| 1. (26) For any matrix A ∈ Cn×m , prove that there exists c ∈ C such that the matrices A + cIn and A − cIn are invertible. Conclude that every matrix is the sum of two invertible matrices. (27) Let A, B, and E be three matrices in Cn×n such that B = A + E and let μ ∈ spec(B) − spec(A). Prove that if Q ∈ Cn×n is an

Eigenvalues

513

invertible matrix, then Q−1 (A − μIn )Q Q−1 EQ. Solution: Since μ ∈ spec(B) − spec(A), the matrix B − μIn is singular, while A − μIn is nonsingular. We have Q−1 (B − μIn )Q = Q−1 (A − μIn + E)Q = Q−1 (A − μIn )Q[In + Q−1 (A − μIn )−1 Q] × (Q−1 EQ). Since B − μIn is singular, the matrix [In + Q−1 (A − μIn )−1 Q](Q−1 EQ) must be singular, which implies [Q−1 (A − μIn )−1 Q](Q−1 EQ) 1, so Q−1 (A − μIn )−1 QQ−1 EQ 1. This implies the desired inequality. (28) Let v be an eigenvector of the matrix A ∈ Cn×n and let 0 aH , B= a A

(29)

(30)

(31)

(32)

Cn . Prove

that there exists an eigenvector u of B x such that u = for some x ∈ C if and only if aH v = 0. v Let B ∈ Rn×n be a matrix and let A = diag(b11 , . . . , bnn ). If μ is an eigenvalue of B, prove that there exists a number bii such that |μ − bii | E. Let A, B be two matrices in Cn×n . Prove that for every > 0, there exists δ > 0 such that if |aij − bij | < δ and λ ∈ spec(A), then there exists θ ∈ spec(B) such that |λ − θ| < . Let A ∈ Cm×m and B ∈ Cn×n be two matrices. Prove that trace(A ⊗ B) = trace(A)trace(B) and det(A ⊗ B) = det(A)n det(B)m . Let A ∈ Cn×n be a matrix whose eigenvalues are λ1 , . . . , λn . k k Deﬁne the matrix A[k] = A ⊗ A ⊗ · · · ⊗ A ∈ Cn ×n which is the k th power of A in the sense of the Kronecker product. Prove that the eigenvalues of A[k] have the form λi1 λi2 · · · λik , where 1 i1 , . . . , ik n. where a ∈

514

Linear Algebra Tools for Data Mining (Second Edition)

(33) Let A ∈ Cn×n be a matrix such that spec(A) = {λ1 , . . . , λn }. Prove that det(In − A) =

n

n (1 − λi ) and det(In + A) = (1 + λi ).

i=1

i=1

(34) Let P ∈ Cn×n be a projection matrix. Prove that (a) if rank(P ) = r, then algm(P, 1) = r and algm(P, 0) = n − r; (b) we have rank(P ) = trace(P ). (35) Let A ∈ Rn×n be a matrix such that λ 0 for every λ ∈ spec(A). Prove that trace(A) n . det(A) n (36) Let A ∈ Cm×n and B = Cn×m . Prove that λn pAB (λ) = λm pBA (λ). (37) Let φ : Rn −→ R be a quadratic form deﬁned by φ(x) = x Ax for x ∈ Rn , where A ∈ Rn×n is a symmetric matrix. Prove that if the unit vector x is an extreme for φ, then x is an eigenvector for A. Solution: Since x is a unit vector, we have x x − 1 = 0, so the Lagrangian of this problem is Λ(x, λ) = x Ax + λ(x x − 1). We have (∇Λ)(x) = 2Ax + 2λx = 0m , which shows that x must be a unit eigenvector of A. (38) Let Pφ be a permutation matrix, where φ ∈ PERMn is an ncyclic permutation. If z is a root of order n of 1 (that is, z n = 1), prove that (1, z, z 2 , . . . , z n−1 ) is an eigenvector of Pφ . Further, prove that the eigenvalues of the matrix Pφ + Pφ−1 have the form 2 cos 2πk n for 1 k n. (39) Let A ∈ Rn×n be a matrix. A vector x is said to be subharmonic for A if x 0n and Ax ax for some a ∈ R. Prove that if x is an eigenvector of A, then abs(x) is a subharmonic vector for abs(A).

Eigenvalues

515

(40) Let A ∈ Rn×n be a matrix that can be written as A = ri=1 v i v i , where v i ∈ Rn − {0n } and the vectors v 1 , . . . , v r are pairwise orthogonal. Prove that (a) A is a symmetric matrix of rank r; (b) the eigenpairs of A that involve non-zero eigenvalues are ( v i , v i ) for 1 i r. n×n be a Hermitian matrix and let r (41) Let A = n ∈ C max{ i=1 |aij | | 1 i n}. Prove that spec(A) ⊆ [−r, r]. 2 2 (42) Let Knn ∈ Rn ×n be the commutation matrix introduced in Section 3.17. Prove that spec(Knn ) = {−1, 1}, algm(Knn , 1) = n(n+1) , and algm(Knn , −1) = n(n−1) . Also, show 2 2 that det(Knn ) = (−1)

n(n−1) 2

.

Solution: Since the real matrix Knn is orthogonal and symmetric, spec(Knn ) = {−1, 1} by Example 7.7. Therefore, algm(Knn , 1) + algm(Knn , −1) = n2 and det(Knn ) = (−1)algm(Knn ,−1) . On the other hand, by Part (e) of Supplement 96, trace(Knn ) = n = algm(Knn , 1) − algm(Knn , −1) = n2 − 2algm(A, −1). Thereand algm(Knn , 1) = n(n+1) . fore, algm(A, −1) = n(n−1) 2 2 Since det(Knn ) equals the product of eigenvalues, the last equality follows immediately. Bibliographical Comments Lanczos decomposition presented in Supplement 40 was obtained from [98]; the elementary solution given was developed in [148]. The result discussed in Supplement 7.10 was obtained from [8].

This page intentionally left blank

Chapter 8

Similarity and Spectra

8.1

Introduction

This chapter presents the links between spectral properties of matrices and conditions that ensure the existence of diagonal matrices which are equivalent to matrices that enjoy these properties. Further spectral properties are discussed such as the relationships between the geometric and algebraic multiplicities of eigenvalues. The standard Jordan form of matrices is given in the context of λ-matrices and the link between matrix norms and eigenvalues is also examined. 8.2

Diagonalizable Matrices

Diagonalizable matrices were deﬁned in Section 3.12 as square matrices that are similar to diagonal matrices. In the context of spectral theory, we have the following characterization of diagonalizable matrices. Theorem 8.1. A matrix A ∈ Cn×n is diagonalizable if and only if there exists a linearly independent set {v 1 , . . . , v n } of n eigenvectors of A. Proof. Let A ∈ Cn×n be such that there exists a set {v 1 , . . . , v n } of n eigenvectors of A that is linearly independent and let P be the

517

518

Linear Algebra Tools for Data Mining (Second Edition)

matrix (v 1 v 2 · · · v n ) that is clearly invertible. We have P −1 AP = P −1 (Av 1 Av 2 ⎛ λ1 0 ⎜ 0 λ2 ⎜ = P −1 P ⎜ .. .. ⎝. . 0

· · · Av n ) = P −1 (λ1 v 1 λ2 v 2 · · · λn v n ) ⎞ ⎛ ⎞ ··· 0 λ1 0 · · · 0 ⎜ ⎟ ··· 0 ⎟ ⎟ ⎜ 0 λ2 · · · 0 ⎟ . ⎟=⎜ . . . ⎟. · · · .. ⎠ ⎝ .. .. · · · .. ⎠

0 · · · λn

Therefore, we have A = P DP −1 , where ⎛ λ1 0 · · · 0 ⎜ 0 λ2 · · · 0 ⎜ D = ⎜ .. .. . ⎝ . . · · · ..

0

0 · · · λn

⎞ ⎟ ⎟ ⎟, ⎠

0 0 · · · λn

so A ∼ D. Conversely, suppose that A is diagonalizable, so AP = P D, where D is a diagonal matrix and P is an invertible matrix, and let v 1 , . . . , v n be the columns of the matrix P . We have Av i = dii v i for 1 i n, so each v i is an eigenvector of A. Since P is invertible, its columns are linear independent. Theorem 8.1 can be restated by saying that a matrix is diagonalizable if and only if it is non-defective. Corollary 8.1. If A ∈ Cn×n is diagonalizable, then the columns of any matrix P such that D = P −1 AP is a diagonal matrix are eigenvectors of A. Furthermore, the diagonal entries of D are the eigenvalues that correspond to the columns of P . This statement follows from the proof of Theorem 8.1. Corollary 8.2. If A ∈ Cn×n is a matrix such that {geomm(A, λ) | λ ∈ spec(A)} = n, then A is diagonalizable.

Proof.

Proof. Suppose that spec(A) = {λ1 , . . . , λk } and let Bk = k1 kpk {v p , . . . , v } be a basis pof the invariant spaces SA,λk , where k=1 pk = n. Then B = k=1 Bk is a linearly independent set of eigenvectors, so A is diagonalizable. Corollary 8.3. If the eigenvalues of the matrix A ∈ Cn×n are distinct, then A is diagonalizable.

Similarity and Spectra

519

Proof. By Theorem 7.1, the set of eigenvectors of A is linearly independent. The statement follows immediately from Theorem 8.1.

Theorem 8.2. Let A ∈ Cn×n be a block diagonal matrix, ⎞ ⎛ A11 O · · · O ⎜ O A ··· O ⎟ 22 ⎟ ⎜ ⎟ A=⎜ .. .. ⎟. ⎜ .. . ··· . ⎠ ⎝ . O O · · · Amm A is diagonalizable if and only if every matrix Aii is diagonalizable for 1 i m. Proof. Suppose that A is a block diagonal matrix which is diagonalizable. Furthermore, suppose that Aii ∈ Cni ×ni and m i=1 ni = n. There exists an invertible matrix P ∈ Cn×n such that P −1 AP is a diagonal matrix D = diag(λ1 , . . . , λn ). Let p1 , . . . , pn be the columns of P , which are eigenvectors of A. Each vector pi is divided into m blocks pji with 1 j m, where pji ∈ Cnj . Thus, P can be written as ⎞ ⎛ 1 1 p1 p2 · · · p1n ⎜ p2 p2 · · · p2 ⎟ ⎜ 1 2 n⎟ ⎟ P =⎜ .. .. ⎟. ⎜ .. ⎝ . . ··· . ⎠ m m p1 pm 2 · · · pn The equality Api = λi pi can be expressed as ⎛

A11 O ⎜O A 22 ⎜ ⎜ .. ⎜ .. ⎝ . . O O

⎞⎛ 1⎞ ⎛ 1⎞ pi O pi 2⎟ ⎟ ⎜ ⎜ 2⎟ p O ⎟⎜ i ⎟ ⎜ pi ⎟ ⎟⎜ ⎟ ⎜ ⎟ .. ⎟ ⎜ .. ⎟ = λi ⎜ .. ⎟, ⎝ . ⎠ ··· . ⎠⎝ . ⎠ m pi pm · · · Amm i ··· ···

which shows that Ajj pji = λi pji for 1 j m. Let M j = (pj1 pj2 · · · pjn ) ∈ Cnj ×n . We claim that rank(M j ) = nj . Indeed, if rank(M j ) were less than nj , we would have fewer that n independent rows M j for 1 j m. This, however, would imply that the

520

Linear Algebra Tools for Data Mining (Second Edition)

rank of P is less than n, which contradicts the invertibility of P . Since there are nj linearly independent eigenvectors of Ajj , it follows that each block Ajj is diagonalizable. Conversely, suppose that each Ajj is diagonalizable, that is, there exists an invertible matrix Qj such that Q−1 j Ajj Qj is a diagonal matrix. Then, it is immediate to verify that the block diagonal matrix ⎞ ⎛ Q1 O · · · O ⎜ O Q ··· O ⎟ 2 ⎟ ⎜ ⎟ Q=⎜ .. .. ⎟ ⎜ .. . ··· . ⎠ ⎝ . O O · · · Qm is invertible and Q−1 AQ is a diagonal matrix.

Deﬁnition 8.1. Two matrices A, B ∈ Cn×n are simultaneously diagonalizable if there exists a matrix T ∈ Cn×n such that A = T DT −1 and B = T ET −1 , where D and E are diagonal matrices in Cn×n . Lemma 8.1. If C ∈ Cn×n is a diagonalizable matrix and T ∈ Cn×n is an invertible matrix, then T CT −1 is also diagonalizable. Proof. Since C is diagonalizable, there exists a diagonal matrix D and an invertible matrix S such that SCS −1 = D, so C = S −1 DS. Then, T CT −1 = TS−1 DST −1 = (TS−1 )D(T S −1 )−1 . Therefore, (TS−1 )−1 (T CT −1 )(TS−1 ) = D, which shows that T CT −1 is diagonalizable.

Theorem 8.3. Let A, B ∈ Cn×n be two diagonalizable matrices. Then A and B are simultaneously diagonalizable if and only if AB = BA. Proof. Suppose that A and B are simultaneously diagonalizable, so A = TDT−1 and B = TET−1 , where D and E are diagonal matrices. Then AB = TDT−1 TET−1 = TDET−1 and BA = TET−1 TDT−1 = TEDT−1 . Since any two diagonal matrices commute, we have AB = BA.

Similarity and Spectra

Conversely, suppose that AB = BA. there exists a matrix T such that ⎛ λ1 Ik1 Ok1 ,k2 ⎜O ⎜ k2 ,k1 λ2 Ik2 T AT −1 = ⎜ .. ⎜ .. ⎝ . . Okm ,k1 Okm ,k2

521

Since A is diagonalizable, ⎞ · · · Ok1 ,km · · · Ok2 ,km ⎟ ⎟ ⎟ .. ⎟, ··· . ⎠ · · · λm Ikm

where ki is the multiplicity of λi for 1 i m. It is easy to see that if A and B commute, then TAT−1 and TBT−1 also commute. If we write TBT−1 as a block matrix TBT−1 = (Bpq ), where Bpq ∈ Ckp ×kq for 1 p, q m, then the fact that the matrices TAT−1 and TBT−1 commute translates into the equality ⎛

⎞

⎛ λ1 B11 λ2 B12 ⎜ ⎟ λ2 B22 · · · λ2 B2m ⎟ ⎜ λ1 B21 λ2 B22 ⎜ λ2 B21 ⎜ ⎟ ⎜ ⎜ ⎟=⎜ . .. .. .. .. ⎜ ⎟ ⎝ .. . . . ··· . ⎝ ⎠ λ1 Bm1 λ2 Bm1 λm Bm1 λm Bm1 · · · λm Bmm λ1 B11

λ1 B12

···

λ1 B1m

Thus, if λi = λj we have Bij = Oki ,kj , which TBT−1 is a block diagonal matrix ⎛ B11 Ok1 ,k2 · · · ⎜O B22 · · · ⎜ k2 ,k1 TBT−1 = ⎜ . .. ⎜ . ⎝ . . ··· Okm ,k1 Okm ,k2 · · ·

⎞ λm B1m λm B2m ⎟ ⎟ ⎟. .. ⎠ ··· . · · · λm Bmm ··· ···

shows that the matrix ⎞ Ok1 ,km Ok2 ,km ⎟ ⎟ ⎟ .. ⎟. . ⎠ Bmm

Since B is diagonalizable, it follows that TBT−1 is diagonalizable (by Lemma 8.1), so each matrix Bjj is diagonalizable. Let Wi be a matrix such that Wi−1 Bii Wi is diagonal and let W be the block diagonal matrix ⎞ ⎛ W1 O · · · O ⎜ O W ··· O ⎟ 2 ⎟ ⎜ ⎟ W =⎜ .. .. ⎟. ⎜ .. . ··· . ⎠ ⎝ . O O · · · Wm Then, both W −1 TAT−1 W and W −1 TBT−1 W are diagonal matrices, and this implies that A and B are simultaneously diagonalizable.

522

Linear Algebra Tools for Data Mining (Second Edition)

It is interesting to observe that if A ∈ Cn×n is a diagonalizable matrix, then there exists a matrix B ∈ Cn×n such that B 2 = A. Indeed, suppose that A ∼ D, where D is a diagonal matrix and, therefore, A = PDP−1 , where P is an invertible matrix ∈ Cn×n given by and D = diag(d √ 1 , . . .√, dn ). Consider the matrix E 2 E = diag( d1 , . . . , dn ) for which we have E = D, and deﬁne B = P EP −1 . Then, we have B 2 = PEP−1 PEP−1 = PE2 P −1 = PDP−1 = A. We refer to B as a square root of A. Note that for the existence of the square root of a diagonalizable matrix it is essential that we deal with matrices with complex entries. Under certain conditions, square roots exist also for matrices with real entries, as we shall see in Section 8.12. 8.3

Matrix Similarity and Spectra

Theorem 8.4. If A, B ∈ Cn×n and A ∼ B, then the two matrices have the same characteristic polynomials and, therefore, spec(A) = spec(B). Proof. Since A ∼ B, there exists an invertible matrix X such that A = XBX−1 . Then, the characteristic polynomial det(A − λIn ) can be rewritten as det(A − λIn ) = det(XBX−1 − λXIn X −1 ) = det(X(B − λIn )X −1 ) = det(X) det(B − λIn ) det(X −1 ) = det(B − λIn ), which implies spec(A) = spec(B).

Theorem 8.5. If A, B ∈ Cn×n and A ∼ B, then trace(A) = trace(B). Proof. Since the two matrices are similar, they have the same characteristic polynomials, so both trace(A) and trace(B) equal −c1 , where c1 is the coeﬃcient of λn−1 in both pA (λ) and pB (λ).

Similarity and Spectra

523

Theorem 8.6. If A ∼u B, where A, B ∈ Cn×n , then the Frobenius norm of these matrices are equal, that is, AF = BF . Proof. We have shown in Theorem 3.43 that AH A ∼u B H B. Therefore, these matrices have the same characteristic polynomials which allows us to infer that trace(AH A) ∼u trace(B H B). By the analogue of the Equality (6.11) for complex matrices, we obtain the desired conclusion. Theorem 8.7. Let A ∈ Cn×n and B ∈ Ck×k be two matrices. If there exists a matrix U ∈ Cn×k having an orthonormal set of columns such that AU = UB, then there exists V ∈ Cn×(n−k) such that (UV) ∈ Cn×n is a unitary matrix and

B (U H AV ) H . (UV) A(UV) = O (V H AV ) Proof. Since U has an orthonormal set of columns, by Corollary 6.27, there exists V ∈ Cn×(n−k) such that (UV) is a unitary matrix. We have U H AU = U H U B = Ik B = B, V H AU = V H U B = OB = O, which allows us to write

UH (AU AV ) VH

H B U H AV U AU U H AV = . = V H AU V H AV O V H AV

(UV)H A(UV) = (UV)H (AU AV ) =

Corollary 8.4. Let A ∈ Cn×n be a Hermitian matrix and B ∈ Ck×k be a matrix. If there exists a matrix U ∈ Cn×k having an orthonormal set of columns such that AU = U B, then there exists V ∈ Cn×(n−k) such that (UV) is a unitary matrix and

B O H . (UV) A(UV) = O V H AV

524

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Since A is Hermitian, we have U H AV = U H AH V = (V H AU )H = O, which, by Theorem 8.7 produces the desired result.

Corollary 8.5. Let A ∈ Cn×n , λ be an eigenvalue of A, and let u be an eigenvector of A with u = 1 that corresponds to λ. There exists V ∈ Cn×(n−1) such that (u V ) ∈ Cn×n is a unitary matrix and

λ (uH AV ) H . (u V ) A(u V ) = O (V H AV ) If A is a Hermitian matrix, then (u V ) A(u V ) = H

Proof.

λ O . O (V H AV )

This statement follows from Theorem 8.7 by taking k = 1.

Theorem 8.8 (Schur’s triangularization theorem). Let A ∈ Cn×n be a square matrix. There exists a unitary matrix U ∈ Cn×n and an upper-triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A. Moreover, each eigenvalue λ occurs in the sequence of diagonal values a number of algm(A, λ) times. Proof. The argument is by induction on n 1. The base case, n = 1, is trivial. So, suppose that the statement is true for matrices in C(n−1)×(n−1) . Let λ1 ∈ C be an eigenvalue of A, and let u be an eigenvector that corresponds to this eigenvalue. By Corollary 8.5, we have

λ1 uH AV H , U AU = O V H AV where U = (u|V ) is an unitary matrix. By the inductive hypothesis, since V H AV ∈ C(n−1)×(n−1) , there exists a unitary matrix S ∈ C(n−1)×(n−1) such that V H AV = S H W S, where W is an upper-triangular matrix. Then we have

λ1 λ1 O uH V S H W S H = , U AU = 0 S HW S 0 W

Similarity and Spectra

525

which shows that an upper triangular matrix T that is unitarily similar to A can be deﬁned as

λ1 O . T = 0 W Since T ∼u A, it follows that the two matrices have the same characteristic polynomials and, therefore, the same spectra and algebraic multiplicities for each eigenvalue. Corollary 8.6. If A ∈ Rn×n is a matrix such that spec(A) = {0}, then A is nilpotent. Proof. By Schur’s Triangularization Theorem, A is unitarily similar to a strictly upper triangular matrix, A = U H T U , so An = U H T n U . By Supplement 30 of Chapter 3, we have T n = O, so An = O. Example 8.1. Let A ∈ R3×3 be the symmetric matrix ⎛ ⎞ 14 −10 −2 ⎜ ⎟ A = ⎝−10 −5 5 ⎠ −2 5 11 whose characteristic polynomial is pA (λ) = λ3 − 20λ2 − 100λ + 2000. The eigenvalues of A are λ1 = 20, λ2 = 10, and λ3 = −10. It is easy to see that ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ −2 1 −2 ⎝ ⎠ ⎝ ⎠ ⎝ 1 , v 2 = 0 , v 3 = −5⎠ v1 = 1 2 1 are eigenvectors that correspond to the eigenvalues λ1 , λ2 , and λ3 , respectively. The corresponding unit vectors are ⎛ 1 ⎞ ⎛ 2 ⎞ ⎛ 2 ⎞ √ −√ − √6 5 ⎜ ⎟ ⎜ 530 ⎟ ⎜ √1 ⎟ ⎜ √ ⎟ 0⎟ u1 = ⎝ 6 ⎠, u2 = ⎜ ⎝ ⎠, u3 = ⎝− 30 ⎠. √1 6

√2 5

√1 30

For U = (u1 u2 u3 ), we have U AU = U (20u1 10u2 − 10u3 ) = diag(20, 10, −10).

526

Linear Algebra Tools for Data Mining (Second Edition)

Corollary 8.7. Let A ∈ Cn×n and let f be a polynomial. If spec(A) = {λ1 , . . . , λn } (including multiplicities), then spec(f (A)) = {f (λ1 ), . . . , f (λn )}. Proof. By Schur’s Triangularization Theorem, there exists a unitary matrix U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A, λ1 , . . . , λn . Therefore, U f (A)U −1 = f (T ), and by Theorem 3.45, the diagonal elements of f (T ) are f (λ1 ), . . . , f (λm ). Since f (A) ∼ f (T ), we obtain the desired conclusion because two similar matrices have the same eigenvalues with the same algebraic multiplicities. Let r, s ∈ N be two numbers such that 1 r, s n, r = s, and let Ers ∈ Cn×n be the matrix whose unique non-zero entry is ers = 1. In other words, Ers is deﬁned by (Ers )ij =

1 if i = r and j = s, 0 otherwise,

for 1 i, j n. Note that Ers ep =

er

if p = s,

0

otherwise,

es

if q = r,

0

otherwise,

for 1 p n, and eq Ers

=

for 1 q n. Lemma 8.2. The matrix In + cErs is nonsingular, and (In + cErs )−1 = In − cErs . Furthermore, for every A ∈ Cn×n , if B = (In − cErs )A(In + cErs ), then B = A − cErs A + cAErs − c2 asr Ers . Proof. Since r = s, we have (Ers )2 = O. Therefore, (In +cErs )(In − cErs ) = In , so (In +cErs )−1 = In −cErs . Also, Ers AErs is the matrix

Similarity and Spectra

527

that has a unique non-zero element, that is, Ers AErs = asr Ers . Since B = (In − cErs )A(In + cErs ), we have B = A − cErs A + cAErs − c2 Ers AErs = A − cErs A + cAErs − c2 asr Ers .

Lemma 8.3. If A ∈ Cn×n is an upper triangular matrix, 1 r < s n, and B = (In − cErs )A(In + cErs ), then bij = aij only if 1 i r and j = s, or if i = r and s j n (see Figure 8.1, where the elements of B that diﬀer from the corresponding elements of A are represented by thick lines) and brs = ars + c(arr − ass ). Proof. Since A is an upper triangular matrix and r < s, we have asr = 0, so by Lemma 8.2, B = A − cErs A + cAErs . Since A is an upper triangular matrix, we ⎛ 0 0 ······ 0 ⎜. . .. ⎜ .. .. ··· . ⎜ ⎜ Ers A = ⎜0 0 · · · · · · ass ⎜ ⎜ .. .. .. ⎝. . ··· . 0 0 ······ 0

have ⎞ 0 . ⎟ · · · .. ⎟ ⎟ ⎟ · · · asn ⎟, ⎟ .. ⎟ ··· . ⎠ ··· 0 ···

r

B Fig. 8.1

s

Elements of B that diﬀer from the corresponding elements of A.

528

Linear Algebra Tools for Data Mining (Second Edition)

where ass , . . . , asn occur in the r th row of ⎛ 0 0 · · · a1r ⎜. . ⎜ .. .. · · · ... ⎜ ⎜ ⎜0 0 · · · arr AErs = ⎜ ⎜0 0 · · · 0 ⎜ ⎜. . . ⎜. . ⎝ . . · · · .. 0 0 ···

0

Ers A, and ⎞ ··· 0 .⎟ · · · .. ⎟ ⎟ ⎟ · · · 0⎟ ⎟, · · · 0⎟ ⎟ .. ⎟ ⎟ · · · .⎠ ··· 0

where a1r , . . . , arr occur in the sth column. Thus, the elements of B that diﬀer from those of A are located on the r th row at the right of the sth column and on the sth row above the r th row. Also, brs = ars + c(arr − ass ). Lemma 8.2 implies that if arr = ass , by taking c = arr /(ass − arr ) we have brs = 0. Theorem 8.9. Let A ∈ Cn×n be a square matrix having k distinct eigenvalues λ1 , . . . , λk such that algm(λi ) = ni for 1 i k and k i=1 ni = n. There exists a block-triangular matrix T ∈ Cn×n such that A ∼u T , ⎞ ⎛ T1 O · · · O ⎜O T · · · O ⎟ 2 ⎟ ⎜ ⎟, T =⎜ . . . ⎟ ⎜. . ⎝ . . · · · .. ⎠ O O · · · Tk where each Ti ∈ Cni ×ni is an upper triangular matrix having all its diagonal elements equal to λi for 1 i k. Proof. By Schur’s Triangularization Theorem, there exists an upper-triangular matrix V ∈ Cn×n such that A ∼ V and the diagonal elements of V are the eigenvalues of A such that each eigenvalue λi occurs in the sequence of diagonal values a number of algm(A, λi ) times. By Lemmas 8.2 and 8.3, we can construct a matrix T , similar to V and, therefore, similar to A such that T is a block triangular matrix.

Similarity and Spectra

529

To build the matrix T , we construct a sequence of matrices starting from V by zeroing elements situated above the main diagonal, outside the triangular blocks that correspond to each eigenvalue. This process must be conducted such that the creation of a new zero component will not disturb the already zeroed elements. This can be easily done by numbering the positions to be zeroes in such a manner that all numbers located at the right and above the position numbered k are greater than k, as we show in Example 8.2. Example 8.2. Let A ∈ C8×8 be a matrix whose distinct eigenvalues are λ1 , λ2 , and λ3 such that algm(A, λ1 ) = algm(A, λ2 ) = 3 and algm(A, λ3 ) = 2. In Figure 8.2, we show the order of zeroing the desired elements of the upper triangular matrix in order to construct a block triangular matrix. Theorem 8.8 shows that a Schur factorization A = U H T U exists for every complex matrix A. The next statement presents a property of real matrices that admit real Schur factorizations. Theorem 8.10. Let A ∈ Rn×n be a real square matrix. If there exists a orthogonal matrix U ∈ Rn×n and an upper triangular matrix T ∈ Rn×n such that A = U −1 T U, that is, a real Schur factorization, then the eigenvalues of A are real numbers. Proof. If the above factorization exists, we have T = UAU−1 . Thus, the eigenvalues of A are the diagonal components of T and, therefore, they are real numbers. λ1

9 12 15 18 21 λ1

8 11 14 17 20 λ1 7 10 13 16 19 λ2 λ2

3

6

2

5

λ2 1

4

λ3 λ3 Fig. 8.2

Order of nulliﬁcation of elements of a block triangular matrix.

530

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 8.11. Let A ∈ Rn×n be a matrix having n real eigenvalues. Then A has a real Schur factorization, that is, A = U T U −1 , where U ∈ Rn×n is an orthogonal matrix and T ∈ Rn×n is an upper triangular matrix that has the eigenvalues of A on its diagonal. Proof. The argument is a paraphrase of the argument of Theorem 8.4, taking into account that the eigenvalues are real numbers. Corollary 8.8. If A ∈ Rn×n and A is a symmetric matrix, then A is orthogonally diagonalizable. Proof. By Corollary 7.5, A has real eigenvalues. Theorem 8.11 implies the existence of a real Schur factorization of A, A = U T U −1 , where U is an orthogonal matrix and T is both upper triangular and symmetric and, therefore, a diagonal matrix. Corollary 8.9. If A ∈ Rn×n is a symmetric matrix, then AF = n 2 i=1 λi , where spec(A) = {λ1 , . . . , λn }. Proof. By Corollary 8.8, A ∼u D, where D = diag(λ 1 , . . . , λn ). Therefore, by Theorem 8.6 we have AF = DF = ni=1 λ2i . Theorem 8.12. If A ∈ Rn×n is a symmetric matrix, then geomm(A, λ) = algm(A, λ) for every eigenvalue λ ∈ spec(A). Proof. By Corollary 8.8, A is orthonormally diagonalizable, that is, there exists an orthogonal matrix P such that A = P DP , where D is a diagonal matrix. Therefore, det(λIn − A) = det(λIn − D). In other words, the matrices λIn − A and λIn − D have the same characteristic polynomial. Thus, if λ ∈ spec(A) and algm(A, λ) = k, λ occurs in k diagonal positions of D, p1 , . . . , pk . Thus, ep1 , . . . , epk are k linearly independent eigenvectors in the invariant subspace SD,λ , which implies that the vectors P ep1 , . . . , P epk are k linearly indepen dent eigenvectors in S(A, λ). Next, we show that unitary diagonalizability is a characteristic property of normal matrices. Theorem 8.13 (Spectral theorem for normal matrices). A matrix A ∈ Cn×n is normal if and only if there exists a unitary matrix U and a diagonal matrix D such that A = U H DU, the columns of U H are unit eigenvectors, and the diagonal elements of D are the eigenvalues of A that correspond to these eigenvectors.

Similarity and Spectra

531

Proof. Suppose that A is a normal matrix. By Schur’s triangularization theorem, there exists a unitary matrix U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U −1 T U . Thus, T = UAU−1 = UAUH and T H = UAH U H . Therefore, T H T = UAH U H U AU H = UAH AU H (because U is unitary) = UAAH U H (because A is normal) = UAUH U AH U H = T T H, so T is a normal matrix. By Theorem 3.19, T is a diagonal matrix, so D’s role is played by T . We leave the proof of the converse implication to the reader. Corollary 8.10. Let A ∈ Cn×n be a normal matrix and let A = U H DU be the factorization whose existence was established in Theorem 8.13. If U H = (u1 , . . . , un ) and D = diag(λ1 , . . . , λn ), where λ1 , . . . , λn are the eigenvalues of A, then each column vector ui is an eigenvector of A that corresponds to the eigenvalue λi for 1 i n. Proof.

We have ⎛

⎞ λ1 0 · · · 0 ⎛ H ⎞ ⎜ 0 λ · · · 0 ⎟ u1 2 ⎜ ⎟⎜ . ⎟ H ⎟⎜ ⎟ A = U DU = (u1 · · · un ) ⎜ .. ⎟ ⎝ .. ⎠ ⎜ .. .. ⎝ . . ··· . ⎠ uHn 0 0 · · · λn = λ1 u1 uH1 + · · · + λn un uHn .

(8.1)

Therefore, Aui = λi ui for 1 i n, which proves the statement. Equality (8.1) is referred to as spectral decomposition of the normal matrix A.

532

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 8.14 (Spectral theorem for hermitian matrices). If the matrix A ∈ Cn×n is Hermitian or skew-Hermitian, A can be written as A = U H DU , where U is a unitary matrix and D is a diagonal matrix having the eigenvalues of A as its diagonal elements. Proof. This statement follows from Theorem 8.13 because any Hermitian or skew-Hermitian matrix is normal. Corollary 8.11. The rank of a Hermitian matrix is equal to the number of non-zero eigenvalues. Proof. The statement of the corollary obviously holds for any diagonal matrix. If A is a Hermitian matrix, by Theorems 3.44 and 8.14, we have rank(A) = rank(D), where D is a diagonal matrix having the eigenvalues of A as its diagonal elements. This implies the statement of the corollary. Let A ∈ Cn×n be a Hermitian matrix of rank p. By Theorem 8.14, A can be written as A = U H DU , where U is a unitary matrix and D = diag(λ1 , . . . , λp , 0, . . . , 0), where λ1 , . . . , λp are the non-zero eigenvalues of A and λ1 · · · λp > 0. Thus, if W ∈ Cp×n is a matrix that consists of the ﬁrst p rows of U , we can write ⎛ ⎞ λ1 0 · · · 0 ⎜ 0 λ ··· 0 ⎟ 2 ⎜ ⎟ ⎜ ⎟W A=W ⎜ . . . . .. .. ⎟ ⎝ .. .. ⎠ 0 0 · · · λp ⎛ ⎞ λ1 0 · · · 0 ⎛ ⎞ ⎜ 0 λ · · · 0 ⎟ u1 2 ⎜ ⎟⎜ . ⎟ ⎜ ⎟⎜ . ⎟ = (u1 . . . up ) ⎜ . . . . ⎝ . ⎠ .. .. ⎟ ⎝ .. .. ⎠ up 0 0 · · · λp = λ1 u1 u1 + · · · + λp up up . Note that if A is not Hermitian, rank(A) may diﬀer from the number of non-zero eigenvalues. For example, the matrix

0 1 A= 0 0 has no non-zero eigenvalues. However, its rank is 1.

Similarity and Spectra

533

The spectral decomposition (8.1) of Hermitian matrices, A = λ1 u1 uH1 + · · · + λn un uHn , allows us to extend functions of the form f : R −→ R to Hermitian matrices. Since the eigenvalues of a Hermitian matrix are real numbers, it makes sense to deﬁne f (A) as f (A) = f (λ1 )u1 uH1 + · · · + f (λn )un uHn . In particular, if A is positive semi-deﬁnite, we have λi 0 for 1 i n and we can deﬁne

√ A = λ1 u1 uH1 + · · · + λn un uHn . Theorem 8.15. Let A, B ∈ Cn×n be two Hermitian matrices, where A is positive deﬁnite. There exists a matrix P such that P H AP = In and P H BP is a diagonal matrix. Proof. By Cholesky’s Decomposition Theorem (Theorem 6.57), A can be factored as A = RH R, where R is an invertible matrix, so (RH )−1 AR−1 = In . The matrix C = (RH )−1 BR−1 is Hermitian and, therefore, it is unitarily diagonalizable, that is, there exists a unitary matrix U such that U H CU is diagonal. Since U H In U = In , we have P = R−1 U . Corollary 8.12. Let A, B ∈ Cn×n . If A B O, then A−1 ≺ B −1 . Proof. By Theorem 8.15, there exists a matrix P such that A = P H P and B = P H DP , where D is a diagonal matrix, D = diag(d1 , . . . , dn ). Thus, xH Ax > xH Bx is equivalent to y H y > y H Dy, where y = P x, so A > B if and only if di < 1 for 1 i n. There−1 fore, A−1 = QH Q and B −1 = QH EQ, where E = diag(d−1 1 , . . . , dn ) −1 −1 −1 and di > 1 for 1 i n, which implies A < B . Deﬁnition 8.2. Let A ∈ Cn×n be a Hermitian matrix. The triple I(A) = (n+ (A), n− (A), n0 (A)), where n+ (A) is the number of positive eigenvalues, n− (A) is the number of negative eigenvalues, and n0 (A) is the number of zero eigenvalues, is the inertia of the matrix A. The number sig(A) = n+ (A) − n− (A) is the signature of A.

534

Linear Algebra Tools for Data Mining (Second Edition)

Example 8.3. If A = diag(4, −1, 0, 0, 1), then I(A) = (2, 1, 2) and sig(A) = 1. Let A ∈ Cn×n be a Hermitian matrix. By Theorem 8.14, A can be written as A = U H DU , where U is a unitary matrix and D = diag(λ1 , . . . , λn ) is a diagonal matrix having the eigenvalues of A (which are real numbers) as its diagonal elements. Without loss of generality, we may assume that the positive eigenvalues of A are λ1 , . . . , λn+ , followed by the negative values λn+ +1 , . . . , λn+ +n− , and the zero eigenvalues λn+ +n− +1 , . . . , λn . Let θj be the numbers deﬁned by ⎧ ⎪ λ if 1 j n+ , ⎪ ⎨ j θj = −λj if n+ + 1 j n+ + n− , ⎪ ⎪ ⎩1 if n + n + 1 j n +

−

for 1 j n. If T = diag(θ1 , . . . , θn ), then we can write D = T H GT , where G is a diagonal matrix, G = (g1 , . . . , gn ) deﬁned by ⎧ ⎪ 1 if λj > 0, ⎪ ⎨ gj = −1 if λj < 0, ⎪ ⎪ ⎩0 if λ = 0, j

for 1 j n. This allows us to write A = U H DU = U H T H GT U = (T U )H G(T U ). The matrix T U is clearly nonsingular, so A ∼H G. The matrix G deﬁned above is the inertia matrix of A and these deﬁnitions show that any Hermitian matrix is congruent to its inertia matrix. For a Hermitian matrix A ∈ Cn×n , let S+ (A) be the subspace of n C generated by n+ (A) orthonormal eigenvectors that correspond to the positive eigenvalues of A. Clearly, we have dim(S+ (A)) = n+ (A). This notation is used in the proof of the next theorem. Theorem 8.16 (Sylvester’s inertia theorem). Let A, B be two Hermitian matrices, A, B ∈ Cn×n . The matrices A and B are congruent if and only if I(A) = I(B). Proof. If I(A) = I(B), then we have A = S H GS and B = T H GT , where both S and T are nonsingular matrices. Since A ∼H G and B ∼H G, we have A ∼H B.

Similarity and Spectra

535

Conversely, suppose that A ∼H B, that is, A = S H BS, where S is a nonsingular matrix. We have rank(A) = rank(B), so n0 (A) = n0 (B). To prove that I(A) = I(B), it suﬃces to show that n+ (A) = n+ (B). Let m = n+ (A) and let v 1 , . . . , v m be m orthonormal eigenvectors of A that correspond to the m positive eigenvalues of this matrix, and let S+ (A) be the subspace generated by these vectors. If v ∈ S+ (A) − {0}, then we have v = a1 v 1 + · · · + am v m , so ⎛ v H Av = ⎝

m j=1

⎞H

⎛

aj v j ⎠ A ⎝

m j=1

⎞ aj v j ⎠ =

m

|aj |2 > 0.

j=1

Therefore, xH S H BSx > 0, so if y = Sx, then y H By > 0, which means that y ∈ S+ (B). This shows that S+ (A) is isomorphic to a subspace of S+ (B), so n+ (A) n+ (B). The reverse inequality can be shown in the same manner, so n+ (A) = n+ (B). We can add an interesting detail to the full-rank decomposition of a matrix. Corollary 8.13. If A ∈ Cm×n and A = CR is the full-rank decomposition of A with rank(A) = k, C ∈ Cm×k , and R ∈ Ck×n , then C may be chosen to have orthogonal columns and R to have orthogonal rows. Proof. Since the matrix AH A ∈ Cn×n is Hermitian, by Theorem 8.14, there exists a unitary matrix U ∈ Cn×k such that AH A = U H DU , where D ∈ Ck×k is a non-negative diagonal matrix. Let C = AU H ∈ C n×k and R = U . Clearly, CR = A, and R has orthogonal rows because U is unitary. Let cp , cq be two columns of C, where 1 p, q k and p = q. Since cp = Aup and cq = Auq , where up , uq are the corresponding columns of U , we have cHp cq = uHp AH Auq = uHp U H DU uq = eHp Deq = 0, because p = q.

Theorem 8.17 (Rayleigh–Ritz theorem). Let A ∈ Cn×n be a Hermitian matrix and let λ1 , λ2 , . . . , λn be its eigenvalues, where λ1 λ2 · · · λn . Deﬁne the Rayleigh–Ritz function

Linear Algebra Tools for Data Mining (Second Edition)

536

ralA : Rn − {0} −→ R as ralA (x) =

xH Ax xH x

Then λ1 ralA (x) λn for x ∈ Cn − {0n }. Proof. By Theorem 8.13, since A is Hermitian, there exists a unitary matrix P and a diagonal matrix T such that A = P H T P and the diagonal elements of T are the eigenvalues of A, that is, T = diag(λ1 , λ2 , . . . , λn ). This allows us to write x Ax = x P T P x = (P x) T P x = H

H

H

H

n

λi |(P x)i |2 ,

j=1

which implies λ1 P x2 xH Ax λn P x2 . Since P is unitary, we also have P x2 = xH P H P x = xH x, which implies λ1 xH x xH Ax λn xH x, for x ∈ Cn .

Corollary 8.14. Let A ∈ Cn×n be a Hermitian matrix and let λ1 , λ2 , . . . , λn be its eigenvalues, where λ1 λ2 · · · λn . We have λ1 = max{xH Ax | xH x = 1}, λn = min{xH Ax | xH x = 1}. Proof. Note that if x is an eigenvector that corresponds to λ1 , then Ax = λ1 x, so xH Ax = λ1 xH x; in particular, if xH x = 1, we have λ1 = xH Ax, so λ1 = max{xH Ax | xH x = 1}. The equality for λn can be shown in a similar manner.

Similarity and Spectra

537

We discuss next an important result that is a generalization of the Rayleigh–Ritz theorem (Theorem 8.17). Denote by Sk (L) the collection of all subspaces of dimension k of an F-linear space L. Theorem 8.18 (Courant-Fischer theorem). Let A ∈ Cn×n be a Hermitian matrix having the eigenvalues λ1 · · · λn . We have λk =

min

max{xH Ax | x2 = 1},

S∈Sn−k+1 (Cn ) x∈S

and λk =

max min{xH Ax | x2 = 1}.

S∈Sk (Cn ) x∈S

Proof. By Theorem 8.13, there exists a unitary matrix U and a diagonal matrix D such that A = U H DU and the diagonal elements of D are the eigenvalues of A, that is, D = diag(λ1 , λ2 , . . . , λn ). We prove only that λk =

min

max{xH Dx | x2 = 1}.

S∈Sn−k+1 (Cn ) x∈S

The proof of the second part of the theorem is entirely similar. Let S be a subspace of Cn with dim(S) = n − k + 1. We have S ∩ e1 , . . . , ek = {0n } because otherwise the equality S ∩e1 , . . . , ek = {0n } would imply dim(S) n − k. Deﬁne S˜ = {y ∈ S | y = 1}, Sˆ = {y ∈ S˜ | y ∈ e1 , . . . , ek }. Therefore, Sˆ consists of vectors of S˜ having the form ⎛ ⎞ y1 ⎜.⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜yk ⎟ ⎟ y=⎜ ⎜0⎟ ⎜ ⎟ ⎜.⎟ ⎜.⎟ ⎝.⎠ 0

Linear Algebra Tools for Data Mining (Second Edition)

538

such that

k

2 i=1 yi

ˆ we have = 1. Thus, for all y ∈ S,

y H Dy =

k

λi |yi |2 λk

i=1

so

k

|yi |2 = λk .

i=1

˜ it follows that max ˜ y H Dy max ˆ y H Dy λk , Since Sˆ ⊆ S, y∈S y∈S min

dim(S)=n−k+1

max{xH Dx | x ∈ S and x2 = 1} λk . x

Let now S be S = e1 , . . . , ek−1 ⊥ . Clearly, dim(S) = n − k + 1. A vector y ∈ S has the form ⎛ ⎞ 0 ⎜.⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜0⎟ ⎟ y=⎜ ⎜y ⎟. ⎜ k⎟ ⎜.⎟ ⎜.⎟ ⎝.⎠ yn Therefore, y Dy = H

n i=k

2

λi |yi | λi

n

|yi |2 = λi

i=k

for all y ∈ {y ∈ S | y2 = 1}. This implies min

dim(S)=n−k+1

max{xH Dx | x ∈ S and x2 = 1} λk , x

which yields the desired equality. The matrices A and D have the same eigenvalues. Also xH Ax = xH Ax = xH U H DU x = (U x)H D(U x) and U x2 = x2 , because U is a unitary matrix. This yields the ﬁrst equality of the theorem. Another form of the Courant–Fisher theorem can be obtained by observing that every p-dimensional subspace S of Cn is the orthogonal space of an (n − p)-dimensional subspace. Therefore, for

Similarity and Spectra

539

each p-dimensional subspace S there is a sequence of n − p vectors w1 , . . . , w n−p (which is a basis of S ⊥ ) such that S = {x ∈ Cn | x ⊥ w1 , . . . , x ⊥ wn−p }. Theorem 8.19. Let A ∈ Cn×n be a Hermitian matrix having the eigenvalues λ1 · · · λn . We have λk = =

min

max{xH Ax | x ⊥ w1 , . . . , x ⊥ wk−1 and x2 = 1},

max

min{xH Ax | x ⊥ w1 , . . . , x ⊥ w n−k , and x2 = 1}.

w1 ,...,w k−1

x

w1 ,...,w n−k

x

Proof. This statement is equivalent to theorem 8.18, as we observed before. Theorem 8.20 (Interlacing theorem). Let A ∈ Cn×n be a Her i · · · ik mitian matrix and let B = A 1 be a principal submatrix of i1 · · · ik A, B ∈ Ck×k . If spec(A) = {λ1 , . . . , λn } and spec(B) = {μ1 , . . . , μk }, where λ1 · · · λn and μ1 · · · μk , then λj μj λn−k+j for 1 j k. Proof. Let {j1 , . . . , jq } = {1, . . . , n} − {i1 , . . . , ik }, where j1 < · · · < jq and k + q = n. By the Courant–Fisher theorem, we have λj = min max{xH Ax | x2 = 1 and x ∈ W ⊥ }, W

x

where W ranges over sets of non-zero vectors in Cn containing j − 1 vectors. Therefore, λj min max{xH Ax | x2 = 1 and x ∈ W ⊥ W

x

and x ∈ ej1 , . . . , ejq ⊥ } = min max{y H By | y2 = 1andy ∈ U ⊥ = μj U

y

(by Exercise 2 of Chapter 3), where U ranges over sets of non-zero vectors in Ck containing j − 1 vectors.

540

Linear Algebra Tools for Data Mining (Second Edition)

Again, by the Courant–Fisher theorem, λn−k+j = max min{xH Ax | x2 = 1 and x ∈ Z⊥ }, x

Z

where Z ranges over sets containing k − j non-zero vectors in Cn . Consequently, λn−k+j max min{xH Ax | x2 = 1 and x ∈ Z⊥ x

Z

and x ∈ ej1 , . . . , ejq ⊥ } = max min{y H By | y2 = 1 and y ∈ S⊥ } = μj , y

S

where S ranges over the sets of non-zero vectors in Ck containing n − j vectors. n×n be a Hermitian matrix and let B = Corollary 8.15. Let A ∈ C i · · · ik A 1 be a principal submatrix of A, B ∈ Ck×k . The set spec(B) i1 · · · ik contains no more positive eigenvalues than the number of positive eigenvalues of A and no more negative eigenvalues than the number of negative eigenvalues of A.

Proof. This observation is a direct consequence of the Interlacing Theorem. Theorem 8.21. Let A ∈ Cn×n be a Hermitian matrix having the eigenvalues λ1 · · · λn . If u1 , . . . , un are eigenvectors that correspond to λ1 , . . . , λn , respectively, W = {u1 , . . . , uk } and Z = {uk+2 , . . . , un }, then we have λk+1 = max{xH Ax | x2 = 1 and x ∈ W ⊥ } x

= min{xH Ax | x2 = 1 and x ∈ Z⊥ }. x

Proof. If A = U H DU , where U is a unitary matrix and D is a diagonal matrix, then ui , the i th column of U H , can be written as ui = U H ei . Therefore, by the second part of the proof of the Courant– Fisher theorem, we have xAx λk+1 if x belongs to the subspace orthogonal to the subspace generated by the ﬁrst k eigenvectors of A. Consequently, the Courant–Fisher theorem implies the ﬁrst equality of this theorem. The second equality can be obtained in a similar manner.

Similarity and Spectra

541

Corollary 8.16. Let A ∈ Cn×n be a Hermitian matrix having the eigenvalues λ1 · · · λn . If u1 , . . . , uk are eigenvectors that correspond to λ1 , . . . , λk , respectively, then a unit vector x that maximizes xH Ax and belongs to the subspace orthogonal to the subspace generated by the ﬁrst k eigenvectors of A is an eigenvector that corresponds to λk+1 . Proof. Let u1 , . . . , un be the eigenvectors ofA and let x ∈ n u1 , . . . , uk ⊥ be a unit vector. We have x = j=k+1 aj uj , and n 2 j=k+1 aj = 1, which implies x Ax = H

n

λj a2j = λk+1 .

j=k+1

This, in turn, implies ak+1 = 1 and ak+2 = · · · = an = 0, so x = uk+1 . Theorem 8.22. Let A, B ∈ Cn× be two Hermitian matrices and let E = B − A. Suppose that the eigenvalues of A, B, E are α1 · · · αn , β1 · · · βn , and 1 · · · n , respectively. Then we have n βi − αi 1 . Proof. Note that E is also Hermitian, so all matrices involved have real eigenvalues. By the Courant–Fisher theorem, βk = min max{xH Bx | x2 = 1 and wHi x = 0 for 1 i k − 1}, W

x

where W = {w1 , . . . , wk−1 }. Thus, βk max xH Bx = max(xH Ax + xH Ex). x

x

(8.2)

Let U be a unitary matrix such that U H AU = diag(α1 , . . . , αn ). Choose wi = U ei for 1 i k − 1. We have wHi x = eHi U H x = 0 for 1 i k − 1. = x2 = 1. Deﬁne y = U H x. Since U is a unitary matrix, y2 n 2 Observe that eHi y = yi = 0 for 1 i k. Therefore, i=k yi = 1. n H H H 2 This, in turn, implies x Ax = y U AU y = i=k αi yi αk . From the Inequality (8.2), it follows that βk αk + max xH Ex αk + n . x

Since A = B − E, by inverting the roles of A and B we have αk βk − 1 , or 1 βk − αk , which completes the argument.

542

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 8.23 (Hoﬀman–Wielandt theorem). Let A, B ∈ Cn×n be two normal matrices having the eigenvalues α1 , . . . , αn and β1 , . . . , βn , respectively. Then there exist permutations φ and ψ in PERMn such that n

2

|αi − βψ(i) | A −

B2F

i=1

n

|αi − βφ(i) |2 .

i=1

Proof. Since A and B are normal matrices, they can be diagonalized as A = U H DA U and B = W H DB W, where U and W are unitary matrices and C, D are diagonal matrices, C = diag(α1 , . . . , αn ) and D = diag(β1 , . . . , βn ). Then we can write A − B2F = U H CU − W H DW 2F = trace(E H E), where E = U H CU − W H DW . Note that E H E = (U H C H U − W H D H W )(U H CU − W H DW ) = U H C H CU + W H D H DW − W H D H W U H CU − U H C H U W H DW = U H C H CU + W H D H DW − U H C H U W H DW − (U H C H U W H DW )H = U H C H CU + W H D H DW − 2 (U H C H U W H DW ). Observe that trace( (U H C H U W H DW )) = (trace(U H C H U W H DW )) = (trace(C H U W H DW U H )). Thus, if Z is the unitary matrix Z = W U H , we have trace( (U H C H U W H DW )) = (trace(C H Z H DZ)). Since C2F = ni=1 α2i and D2F = ni=1 βi2 , we have trace(E H E) = C2F + D2F − 2 (trace(C H Z H DZ)) ⎛ ⎞ n n n n α2i + βi2 − 2 ⎝ a ¯i |zij |2 βj ⎠. = i=1

i=1

i=1 j=1

Theorem 3.20 implies that the matrix S that has the elements |zij |2 is doubly stochastic because Z is a unitary matrix. This allows us to

Similarity and Spectra

543

write A − B2F = trace(E H E)

n

α2i +

i=1

n

⎛ βi2 − max ⎝ S

i=1

n n

⎞ a ¯i sij βj ⎠,

i=1 j=1

and A − B2F = trace(E H E)

n

α2i +

i=1

n

⎞ ⎛ n n βi2 − min ⎝ a ¯i sij βj ⎠, S

i=1

i=1 j=1

where the maximum and the minimum are taken over the set of all doubly stochastic matrices. The Birkhoﬀ–von Neumann Theorem states that the polyhedron of doubly–stochastic matrices has the permutation matrices as its vertices. Therefore, the extremes of the linear function ⎞ ⎛ n n α ¯ i sij βj ⎠ f (S) = ⎝ i=1 j=1

are achieved when S is a permutation matrix. Let Pφ be the permutation matrix that gives the maximum of f and let Pψ be the permutation matrixthat gives the minimum. If S = Pφ , then nj=1 sij βj = βφ(i) , so ⎞ ⎛ n n n n α2i + βi2 − ⎝ α ¯ i βφ(j) ⎠ A − B2F i=1

=

n

i=1

i=1 j=1

|αi − βφ(i) |2 .

i=1

In the last equality, we used the elementary calculation a − ¯b) |a − b|2 = (a − b)(a − b) = (a − b)(¯ ¯b = a¯ a + b¯b − a ¯b − a ab), = |a|2 + |b|2 − 2 (¯ for a, b ∈ C. Similarly, if S = Pψ , we obtain the other inequality.

Linear Algebra Tools for Data Mining (Second Edition)

544

Corollary 8.17. Let A, B ∈ Cn×n be two Hermitian matrices having the eigenvalues α1 , . . . , αn and β1 , . . . , βn , respectively, where α1 · · · αn and β1 · · · βn . Then, n

|αi − βi |2 A − B2F .

i=1

If α1 · · · αn and β1 · · · βn , then n

|αi − βi |2 A − B2F .

i=1

Proof. Since A and B are Hermitian, their eigenvalues are real numbers. By the Hoﬀman–Wielandt Theorem, there exist two permutations φ, ψ ∈ PERMn such that n

2

|αi − βψ(i) | A −

i=1

B2F

n

|αi − βφ(i) |2 .

i=1

We have n

|αi − βφ(i) |2 = a − Pφ b2F

i=1

and n

|αi − βψ(i) |2 = a − Pψ b2F ,

i=1

where

⎞ ⎛ ⎞ β1 α1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ a=⎜ ⎝ .. ⎠ and b = ⎝ .. ⎠, αn βn ⎛

so a − Pψ b2F A − B2F a − Pφ b2F . By Corollary 6.13, since the components of a and b are placed in decreasing order, we have a − Pφ bF a − bF ,

Similarity and Spectra

545

so a −

b2F

=

n

|αi − βi |2 a − Pφ bF A − B2F ,

i=1

which proves the ﬁrst inequality of the corollary. For the second part, by Corollary 6.13, we have A − BF a − Pψ bF a − bF ,

By Theorem 7.16, the characteristic polynomial of an upper triangular matrix T is pT (λ) = (λ − λ1 ) · · · (λ − λn ), where λ1 , . . . , λn are the eigenvalues of T and, at the same time, the diagonal elements of T . Lemma 8.4. Let T ∈ Cn×n be an upper triangular matrix and let pT (λ) = λn +a1 λn−1 +· · ·+an−1 λ+an be its characteristic polynomial. Then pT (T ) = T n + a1 T n−1 + · · · + an−1 T + an In = On,n . Proof.

We have pT (T ) = (T − λ1 In ) · · · (T − λn In ).

Observe that for any matrix A ∈ Cn×n , λj , λk ∈ spec(A), and every eigenvector v of A in SA,λk , we have (λj In − A)v = (λj − λk )v. Therefore, for v ∈ ST,λk , we have pT (T )v = (λ1 In − T ) · · · (λn In − T )v = 0, because (λk I − T )v = 0. By Corollary 7.9, pT (T )ei = 0 for 1 i n, so pT (T ) = On,n , so pT (T ) = On,n . Theorem 8.24 (Cayley–Hamilton theorem). If A ∈ Cn×n is a matrix, then pA (A) = On,n .

Linear Algebra Tools for Data Mining (Second Edition)

546

Proof. By Schur’s Triangularization Theorem, there exists a unitary matrix U ∈ Cn×n and an upper–triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A. Taking into account that U H U = U U H = In , we can write pA (A) = (λ1 In − A)(λ2 In − A) · · · (λn In − A) = (λ1 U H U − U H T U ) · · · (λn U H U − U H T U ) = U H (λ1 In − T )U U H (λ2 In − T )U · · · U H (λn In − T )U = U H (λ1 In − T )(λ2 In − T ) · · · (λn In − T )U = U H pT (T )U = On,n ,

by Lemma 8.4.

8.4

The Sylvester Operator

Schur’s Theorem allows us to examine the solvability of the matrix equation AX − XB = C, where A ∈

Cm×m ,

B∈

Cn×n ,

and C ∈ Cm×n .

Theorem 8.25. Let A ∈ Cm×m and B ∈ Cn×n be two matrices. The Sylvester operator deﬁned by A and B is the mapping S A,B : Cm×n −→ Cm×n given by S A,B (X) = AX − XB. The linear mapping SA,B has an inverse if and only if spec(A) ∩ spec(B) = ∅. Proof. The linearity of SA,B is immediate. Suppose that λ ∈ spec(A) ∩ spec(B), so Au = λu and v H B = λv H for some u ∈ Cm×1 and v ∈ Cn×1 . Thus, u is an eigenvector for A and v is a left eigenvector for B that corresponds to the same eigenvalue λ. Deﬁne the matrix X = uvH ∈ Cm×n . We have AX − XB = Auv H − uv H B = (Au)v H − u(v H B) = 0, which means that S A,B is not invertible. Thus, if SA,B is invertible, the spectra of A and B are disjoint.

Similarity and Spectra

547

Conversely, suppose that S A,B is invertible, so the equation AX − XB = C is solvable for any C ∈ Cm×n . If B = U H T U is a Schur decomposition of B, where U ∈ Cn×n is a unitary matrix and T ∈ Cn×n

⎛ t11 t12 ⎜0 t 22 ⎜ T =⎜ . .. ⎜ . ⎝ . . 0 0

⎞ · · · t1n · · · t2n ⎟ ⎟ ⎟ .. ⎟ ··· . ⎠ · · · tnn

is an upper triangular matrix having as diagonal elements the eigenvalues t11 , t22 , . . . , tnn of B, then AX − XU H T U = C, which implies AXU H − XU H T = CU H . If Z = XU H ∈ Cm×n and D = CU H , we have AZ − ZT = D and this system has the form A(z 1 · · · z n ) − (z 1 · · · z n )T = (d1 · · · dn ). Equivalently, we have Az 1 − t11 z 1 = d1 Az 2 − t12 z 1 − t22 z 2 = d2 .. . Az n − t1n z 1 − t2n z 2 − · · · − tnn z n = dn . Note that the ﬁrst equation of this system (A − t11 In )z 1 = d1 can be solved for z 1 because t11 , as an eigenvalue of B, does not belong to spec(A). Thus, the second equation Az 2 − t22 z 2 = d2 + t12 z 1 can be resolved with respect to z 2 , because t22 ∈ spec(A), etc. Thus, the matrix Z can be found, so X = ZU . As observed in [76], special cases of this equation occur in many important problems in linear algebra. For instance, if n = 1 and B = (0), then solving SA,0 (x) = c amounts to solving the linear system Ax = c. Solving the equation SA,A (X) = On,n amounts to ﬁnding the matrices X that commute with A, etc. Theorem 8.26. Let A ∈ Cm×m and B ∈ Cn×n be two matrices such that spec(B) ⊂ {z ∈ C | |z| < r} and spec(A) ⊂ {z ∈ C | |z| > r}

548

Linear Algebra Tools for Data Mining (Second Edition)

for some r > 0. Then, the solution of the equation AX − XB = C is given by the series (A−1 )n+1 CB n . X= n∈N

Proof. Since spec(A) and spec(B) are ﬁnite sets, there exist r1 and r2 such that 0 < r1 < r < r2 , spec(B) ⊂ {z ∈ C | |z| < r1 }, and spec(A) ⊂ {z ∈ C | |z| > r2 }. Then spec(A−1 ) ⊂ {z ∈ C | |z| < 1/r2 }. By Theorem 8.48, there exists a positive integer n0 such that −1 n if n n0 , |||B n ||| r1n and |||A ||| r2 . Thus, for n n0 , we n

have |||A−n−1 CB n ||| rr12 |||A−1 C|||, which shows that the series −1 n+1 CB n is convergent. It is immediate that for this n∈N (A ) choice of X we have AX − XB = C.

Corollary 8.18. Let A ∈ Cm×m and B ∈ Cn×n be two normal matrices such that spec(B) ⊂ {z ∈ C | |z| < r} and spec(A) ⊂ {z ∈ C | |z| > r + a} for some r > 0 and a > 0. If X is the solution of the Sylvester equation AX − XB = C, and ||| · ||| is a unitarily invariant matrix norm, then |||X||| a1 |||C|||. Proof.

From Theorem 8.26, we have |||X|||

∞

|||A−1 |||n+1 |||C||||||B|||n

n=0

|||C|||

∞ n=0

(r + a)−n−1 an =

1 |||C|||. a

The Sylvester operator can be seen as a linear transformation of the linear space Cm×n (which is isomorphic to Cmn ). The separation of two matrices A and B, denoted by sep(A, B), was introduced in [157]. This quantity is very useful in studying relationships between invariant spaces of matrices and is deﬁned as sep(A, B) = min{S A,B (X) | X = 1}.

(8.3)

Obviously, the number is dependent on the norm used in its definition. If we wish to specify this norm, we will add a suggestive

Similarity and Spectra

549

subscript; for example, if we use the Frobenius norm, we will denote the corresponding separation by sepF (A, B); sep2 (A, B) will be used when we deal with ||| · |||2 . Theorem 8.27. If A ∈ Cm×m and B ∈ Cn×n are two matrices, then the spectrum of the Sylvester operator S A,B is spec(S A,B ) = {λ − μ | λ ∈ spec(A) and μ ∈ spec(B)}. Proof.

Suppose that θ ∈ spec(S A,B ). There exists a matrix X ∈

Cm×n − {Om,n } such that AX − XB = θX, which amounts to

(A − θIm )X − XB = O. In other words, S A−θIm ,B is singular, which implies that spec(A−θIm )∩spec(B) = ∅. This means that there exist λ ∈ spec(A) and μ ∈ spec(B) such that λ − θ = μ, which implies that θ = λ − μ. Thus, spec(S A,B ) ⊆ {λ − μ | λ ∈ spec(A) and μ ∈ spec(B)}. The reverse inclusion can be shown by reversing the above implications. Corollary 8.19. If A ∈ Cm×m and B ∈ Cn×n are two diagonalizable matrices such that spec(A) = {λ1 , . . . , λm }, and spec(B) = {μ1 , . . . , μm }, and if ui is an eigenvector of A associated to λi and v j is an eigenvector of B H associated with μj , then the matrix Xij = ui v Hj is an eigenvector of S A,B associated to λi − μj . Proof.

We have

S A,B (Xij ) = Aui v Hj − ui v Hj B = λi ui vj − ui (B H v j )H = λi ui vj − μj ui v Hj = (λi − μj )ui v Hj = (λi − μj )Xij .

Corollary 8.20. Let A ∈ Cm×m and B ∈ Cn×n be two matrices. We have sep(A, B) min{|λ − μ| | λ ∈ spec(A), μ ∈ spec(B)}. Proof. The inequality holds if sep(A, B) = 0. Therefore, suppose that sep(A, B) > 0. This implies that S A,B is nonsingular and by Supplement 43 of Chapter 6 we have |||S −1 A,B |||2 = −1 1 1 min{S A,B (X)2 | X2 =1} = sep(A,B) . Since the spectral radius of S A,B is less than |||S −1 A,B |||, it follows that

eigenvalues of μ ∈ spec(B).

S −1 A,B ,

1 sep(A,B)

is larger or equal to any

so sep(A, B) |λ − μ| for every λ ∈ spec(A) and

Linear Algebra Tools for Data Mining (Second Edition)

550

8.5

Geometric versus Algebraic Multiplicity

Theorem 8.28. Let A ∈ Cn×n be a square matrix and let λ ∈ spec(A). The geometric multiplicity geomm(A, λ) is less than or equal to the algebraic multiplicity algm(A, λ). Proof.

By Equality (7.4),

geomm(A, λ) = dim(null(A − λIn )) = n − rank(A − λIn ). Let m = geomm(A, λ). Starting from an orthonormal basis u1 , . . . , um of the subspace null(A − λIn ), deﬁne the matrix U = (u1 · · · um ). We have (A − λIn )U = O, so AU = λU = U (λIm ). Thus, by Theorem 8.7, we have

λIm U H AV , A∼ O V H AV where U ∈ Cn×m and V ∈ Cn×(n−m) . Therefore, A has the same characteristic polynomial as

λIm U H AV , B= O V H AV which implies algm(A, λ) = algm(B, λ). Since the algebraic multiplicity of λ in B is at least equal to m, it follows that algm(A, λ) m = geomm(A, λ).

Deﬁnition 8.3. If λ is an eigenvalue of A and geomm(A, λ) = algm(A, λ), then we refer to λ as a semi-simple eigenvalue. The matrix A is defective if there exist at least one eigenvalue that is not semi-simple. Otherwise, A is said to be non-defective. A is a non-derogatory matrix if geomm(A, λ) = 1 for every eigenvalue λ. If λ is a simple eigenvalue of A, we have algm(A, λ) = 1, and, since algm(A, λ) geomm(A, λ) 1, it follows that algm(A, λ) = geomm(A, λ) = 1, so λ is semi-simple. Thus, the notion of semisimplicity of an eigenvalue generalizes the notion of simplicity.

Similarity and Spectra

551

Example 8.4. Let A be the matrix ⎛ ⎞ 2 1 0 ⎜ ⎟ A = ⎝0 2 1⎠. 0 0 2 Its characteristic polynomial is 2 − λ 1 0 2−λ 1 = (2 − λ)3 , pA (λ) = det(A − λI3 ) = 0 0 0 2 − λ which means that A has the unique value 2 with algebraic multiplicity 3. ⎛ ⎞ 1 ⎝ If v = 0⎠ is an eigenvector, then Av = 2v, which amounts to 0 v1 = 1, v2 = 0, and v3 = 0. Thus, the geometric multiplicity of 2 is equal to 1, hence the eigenvalue 2 is not semi-simple. Note that if λ is a simple eigenvalue of A, then geomm(A, λ) = algm(A, λ) = 1, so λ is semi-simple. Example 8.5. Let A ∈ R2×2 be the matrix

A=

a b , b a

where a, b ∈ R are such that a = 0 and b = 0. The characteristic polynomial of A is pA (λ) = (a − λ)2 − b2 , so spec(A) = {a + b, a − b} and algm(A, a + b) = algm(A, a − b) = 1. Thus, both a + b and a − b are simple eigenvalues of A, and, therefore, they are also semi-simple. Theorem 8.29. If A ∈ Rn×n is a symmetric matrix, each of its eigenvalues is semi-simple. Proof. We saw that each symmetric matrix has real eigenvalues and is orthonormally diagonalizable (by Corollary 8.8). Starting from the real Schur factorization A = U T U −1 , where U is an orthogonal matrix and T = diag(t11 , . . . , tnn ) is a diagonal matrix, we can write

552

Linear Algebra Tools for Data Mining (Second Edition)

AU = U T . If we denote the columns of U by u1 , . . . , un , then we can write (Au1 , . . . , Aun ) = (t11 u1 , . . . , tnn un ), so Aui = tii ui for 1 i n. Thus, the diagonal elements of T are the eigenvalues of A and the columns of U are corresponding eigenvectors. Since these eigenvectors are pairwise orthogonal, the dimension of the invariant subspace that corresponds to an eigenvalue equals the algebraic multiplicity of the eigenvalue, so each eigenvalue is semi-simple. 8.6

λ-Matrices

The set of polynomials over the complex ﬁeld depending on a nondeterminate λ is denoted by C[λ]. If f (λ) ∈ C[λ], we denote its degree by deg(f ). Deﬁnition 8.4. A λ-matrix is a polynomial of the form G(λ) = A0 λm + A1 λm−1 + · · · + Am , where A0 , A1 , . . . , Am are matrices in Cp×q . If A0 = Op,q , then we say that m is the degree of the λ-matrix G and we write deg(G) = m. A square λ-matrix, that is, a λ-matrix of type n × n, is regular if its leading coeﬃcient A0 is a nonsingular matrix. The set of λ-matrices of format p × q over a ﬁeld F is denoted by F [λ]p×q . In the special case p = q = 1, G(λ) is a usual polynomial in λ. Note that a λ-matrix of type p × q can also be regarded as a matrix whose entries are polynomials in λ, that is, as a member of C[λ]p×q ; the converse is also true, that is, every matrix whose entries are polynomials in λ is a λ-matrix. Example 8.6. Consider the λ-matrix G(λ) = A0 λ3 + A1 λ2 + A2 λ + A3 ,

Similarity and Spectra

where

and

553

⎛

⎞ ⎛ ⎞ 0 0 1 1 1 1 ⎜ ⎟ ⎜ ⎟ A0 = ⎝1 −1 0⎠, A1 = ⎝2 0 0⎠, 2 0 0 0 −2 0 ⎛ ⎞ ⎛ ⎞ 2 3 4 0 0 0 ⎜ ⎟ ⎜ ⎟ A2 = ⎝1 9 4⎠, A3 = ⎝1 0 0⎠. 3 0 0 0 0 1

G(λ) is the matrix ⎛

⎞ λ2 + 3λ λ3 + λ2 + 4λ λ2 + 2λ ⎟ ⎜ 4λ G(λ) = ⎝λ3 + 2λ2 + λ + 1 −λ3 + 9λ ⎠. −2λ2 1 2λ3 + 3λ

Theorem 8.30. A λ-matrix G(λ) is invertible if and only if det(G(λ)) is a non-zero constant. Proof. Suppose that G(λ) is invertible, that is, there exists a λmatrix H(λ) such that G(λ)H(λ) = In . Then det(G(λ)) det(H(λ)) = 1, so det(G(λ)) is a constant that cannot be 0. Conversely, suppose that det(G(λ)) is a non-zero constant. If we construct G−1 (λ) using the standard approach, it is immediate that G−1 (λ) is a λ-matrix, so G(λ) is invertible. The sum and product of λ-matrices are component-wise deﬁned exactly as the general sum and products of matrices. Note that if G(λ) and H(λ) are two λ-matrices, the degree of their product is not necessarily equal to the sum of the degrees of the factors. In general, the degree of G(λ)H(λ) is not larger than the sum of the degrees of the factors. Example 8.7. Let G(λ), H(λ) be the λ-matrices

1 2 1 0 1 G(λ) = λ+ , and H(λ) = 0 0 −2 1 0

deﬁned by

1 1 3 λ+ . 0 3 2

Linear Algebra Tools for Data Mining (Second Edition)

554

We have

GH =

7 7 4 3 λ+ . −2 −2 1 −4

Let G(λ) = A0 λm + A1 λm−1 + · · · + Am be a λ-matrix. The same ˜ matrix can also be written as G(λ) = λm A0 + λm−1 A1 + · · · + Am . ˜ If the matrix H is substituted for λ in G(λ) and in G(λ), the results G(B) = A0 B m + A1 B m−1 + · · · + Am , ˜ G(B) = B m A0 + B m−1 A1 + · · · + Am are distinct, in general, because matrices of the form B j do not commute with A0 , A1 , . . . , Am . Thus, G(A) is referred to as the right ˜ value of G in A and G(A) is referred to as the left value of G in A. Theorem 8.31. Let G(λ) and H(λ) be two n × n λ-matrices such that H(λ) is a regular matrix. There are two unique n × n λ-matrices Q(λ) and R(λ) such that G(λ) = H(λ)Q(λ)+ R(λ) such that R(λ) = On,n or deg(R(λ)) < deg(H(λ)). Proof.

Suppose that G(λ) = A0 λm + A1 λm−1 + · · · + Am , F (λ) = B0 λq + B1 λq−1 + · · · + Bq ,

where A0 = On,n and det(B0 ) = 0. We assume that m q; otherwise, we can take Q(λ) = On,n and R(λ) = G(λ). We deﬁne a sequence of λ-matrices G(1) (λ), G(2) (λ) . . . as having a non-increasing sequence of degrees m1 m2 · · · . The (p) coeﬃcients of G(p) (λ) are denoted by Aj . We deﬁne G(1) (λ) = G(λ) − H(λ)B0−1 A0 λm−q and (k−1) mk−1 −q

G(k) (λ) = G(k−1) (λ) − H(λ)B0−1 A0

λ

,

as long as mk q. If mk+1 < q, the computation halts and (k)

G(k+1) (λ) = G(k) (λ) − H(λ)B0−1 A0 . This allows us to write (1)

(k)

G(λ) = H(λ)[B0−1 A0 λm−q + B0−1 A0 λm1 −q + · · · + B0−1 A0 ] + G(k+1) (λ).

Similarity and Spectra

555

Therefore, we can take (1)

(k)

Q(λ) = B0−1 A0 λm−q + B0−1 A0 λm1 −q + · · · + B0−1 A0 , and R(λ) = G(k+1) (λ); clearly, we have deg(R(λ)) < deg(H(λ). Suppose now that we have both G(λ) = H(λ)Q(λ) + R(λ) and G(λ) = H(λ)Q1 (λ) + R1 (λ), where deg(R(λ)) < deg(H(λ)) and deg(R1 (λ)) < deg(H(λ)). These equalities imply H(λ)(Q(λ) − Q1 (λ)) = R1 (λ) − R(λ). If Q(λ) − Q1 (λ) = On,n , we have deg(H(λ)(Q(λ) − Q1 (λ))) deg(H(λ)) because H(λ) is regular. However, deg(R1 (λ) − R(λ)) < deg(H(λ)), which leads to a contradiction. Therefore, we have Q(λ) = Q1 (λ), which implies R1 (λ) = R(λ). We refer to the λ-matrices Q(λ) and R(λ) deﬁned by the equality G(λ) = H(λ)Q(λ) + R(λ) of Theorem 8.31 as the left quotient and left remainder of the division of G(λ) by H(λ). It is possible to prove in an entirely similar manner, under the same assumptions as the ones of Theorem 8.31, that the matrices P (λ) and S(λ), such that G(λ) = B(λ)H(λ) + S(λ) and S(λ) = On,n or deg(S(λ)) < deg(H(λ)), are uniquely determined. In this case we refer to the matrices P (λ) and S(λ) as the right quotient and the right remainder of the division of G(λ) by H(λ). Corollary 8.21. The left remainder of the division of the λ-matrix G(λ) = A0 λm + A1 λm−1 + · · · + Am by H(λ) = λI − C is the matrix G(C), where G(C) = C m A0 + C m−1 A1 + · · · + Am . The right remainder of the division of G(λ) by H(λ) is the matrix ˜ G(C) = A0 C m + A1 C m−1 + · · · + Am . Proof.

Starting from the equality A0 λm + A1 λm−1 + · · · + Am = (λI − C)Q(λ) + R,

where Q(λ) = Q0 λm−1 + · · · + Qm−1 and identifying the coeﬃcients of the same powers of λ, we obtain the equalities A0 = Q0 , A1 = Q1 − CQ0 , A2 = Q2 − CQ1 , . . . , Am−1 = Qm−1 − CQm−2 , Am = R − CQm−1 . These equalities imply R = C m A0 + C m−1 A1 + · · · + CAm−1 + Am = A(C). The argument for the right remainder is similar.

556

Linear Algebra Tools for Data Mining (Second Edition)

Polynomials with complex coeﬃcients, that is, polynomials in C[λ], are 1 × 1 λ-matrices.

Deﬁnition 8.5. Let G, H ∈ C[λ] be two polynomials and let A ∈ Cn×n be a matrix such that det(H(A)) = 0. The matrix Q(A) =

G(A)(H(A))−1 is the quotient of G(A) by H(A). A rational matrix function of A is a quotient of two polynomials H(A)/G(A).

Since G(A)H(A) = H(A)G(A) for every A ∈ Cn×n , if H(A) is invertible, then (H(A))−1 G(A) = G(A)(H(A))−1 . Theorem 8.32. Let A ∈ Cn×n . There exists a unique polynomial mA of minimal degree whose leading coeﬃcient is 1 such that mA (A) = On×n . Proof. Theorem 8.24 involving the characteristic polynomial pA of A shows that the set of polynomials whose leading coeﬃcient is 1 and that has A as a root is non-empty. Thus, we need to show the uniqueness of a polynomial of minimal degree. Suppose that f and g are two distinct polynomials with leading coeﬃcient 1, of minimal degree k, such that f (A) = g(A) = On,n . Then (f − g)(A) = On,n and the degree of the polynomial f − g is less than k. This contradicts the minimality of the degrees of f and g. Thus, f = g. Deﬁnition 8.6. Let A ∈ Cn×n . The polynomial mA of minimal degree whose leading coeﬃcient is 1 such that mA (A) = On×n is referred to as the minimal polynomial of A. Theorem 8.33. A ∈ Cn×n is an invertible matrix if and only if mA (0) = 0. Proof. Let mA (λ) = λk + a1 λk−1 + · · · + ak−1 λ + ak be the minimal polynomial of A. Suppose that A is an invertible matrix. If ak = 0, then mA (λ) = λ(λk−1 + a1 λk−2 + · · · + ak−1 ), so A(Ak−1 + a1 Ak−2 + · · · + ak−1 In ) = On,n . By multiplying the last equality by A−1 to the left, we obtain Ak−1 + a1 Ak−2 + · · · + ak−1 In = On,n , which contradicts the minimality of the degree of mA . Thus, ak = 0.

Similarity and Spectra

557

Conversely, suppose that ak = 0. Since Ak + a1 Ak−1 + · · · + ak−1 A + ak In = On,n , it follows that A(Ak−1 + a1 Ak−2 + · · · + ak−1 In ) = −ak In Thus, A is invertible and its inverse matrix is A−1 = −

1 (Ak−1 + a1 Ak−2 + · · · + ak−1 In ). ak

Deﬁnition 8.7. An annihilating polynomial of A ∈ Cn×n is a polynomial f such that f (A) = On,n . Theorem 8.34. If f is an annihilating polynomial for A ∈ Cn×n , than mA divides evenly f . Proof. Suppose that under the hypothesis of the theorem, mA does not evenly divide f . Then we can write f (λ) = mA (λ)q(λ) + r(λ), where r is a polynomial of degree smaller than the degree of mA . Note, however, that r(A) = f (A) − mA (A)q(A) = On,n , which con tradicts the minimality of the degree of mA . Let A ∈ Cn×n be a matrix and let B(λ) be the matrix whose transpose consists of the cofactors of the elements of the matrix λIn − A. Then, B(λ)(λIn − A) = pA (λ)In .

(8.4)

By substituting λ = A, we obtain an alternative proof of the Cayley– Hamilton equality pA (A) = On,n . The matrix B(λ) introduced above allows us to give an explicit form of the minimal polynomial of a matrix A ∈ Cn×n . Theorem 8.35. Let A ∈ Cn×n be a matrix. Its minimal polynomial is given by mA (λ) =

pA (λ) , d(λ)

where d(λ) is the greatest common divisor of the elements of the matrix B(λ).

Linear Algebra Tools for Data Mining (Second Edition)

558

Proof. The deﬁnition of d(λ) means that we can write B(λ) = d(λ)C(λ), where the entries of C(λ) are pairwise relatively prime polynomials. Thus, Equality (8.4) can be written as C(λ)(λIn − A) = which shows that pA (λ) d(λ)

pA (λ) d(λ)

pA (λ) In , d(λ)

is an annihilating polynomial for A. There-

is divisible by the minimal polynomial mA (λ), which fore, allows us to write pA (λ) = mA (λ)r(λ). d(λ) Since C(λ)(λIn − A) = mA (λ)r(λ)In , taking into account that mA (λ)In is an annihilator for A and, therefore, divisible by λIn − A, it follows that mA (λ)In = M (λ)(λIn − A). This implies C(λ)(λIn − A) =

pA (λ) In = mA (λ)r(λ)In = r(λ)M (λ)(λIn − A). d(λ)

Therefore, C(λ) = r(λ)M (λ), so r(λ) is a common divisor of the elements of C(λ). Since entries of C(λ) are pairwise relatively prime polynomials, it follows that r(λ) is a constant r0 . By the deﬁnition of the polynomials pA and mA , it follows that r0 = 1, which implies A (λ) . mA (λ) = pd(λ) Theorem 8.36. If A and B are similar matrices in Cn×n , then their minimal polynomials are equal. Proof. Suppose that A ∼ B, that is A = P −1 BP , where P is an invertible matrix and let mA and mB be the two minimal polynomials of A and B, respectively. We have mB (A) = mB (P −1 BP ) = P −1 mB (B)P = On,n , so mA divides mB . In a similar manner, we can prove that mB divides mA , so mA = mB because both polynomials have 1 as leading coeﬃcient.

Similarity and Spectra

559

Let C(λ) ∈ C[λ]n×n be a λ-matrix such that rank(C) = r. Thus, C(λ) has at least one non-null minor of order r and all minors of order greater than r are null. Denote by δk (λ) the polynomial having 1 as leading coeﬃcient that is the greatest common divisor of all minors of order k of C(λ) for 1 k r. The polynomial δ0 (λ) is deﬁned as equal to 1. Since every minor of order k is a linear combination of minors of order k − 1, it follows that δk−1 (λ) divides δk (λ) for 1 k r. Deﬁnition 8.8. The invariant factors of the matrix C(λ) with rank(C) = r are the polynomials t0 (λ), . . . , tr−1 (λ), where tr−k (λ) =

δk (λ) δk−1 (λ)

for 1 k r. Next, we introduce the notion of elementary transformation matrices for λ-matrices. Deﬁnition 8.9. The row elementary transformation matrices of λmatrices are the matrices T a(i) , T (p)↔(q) , and T (i)+p(λ)(j) , where T a(i) and T (p)↔(q) were introduced in Examples 3.34 and 3.35, and T (i)+p(λ)(j) is deﬁned as ⎛ ⎞ 1 0 ··· ··· ··· 0 ···0 ⎜0 1 · · · · · · · · · 0 · · · 0⎟ ⎜ ⎟ ⎜. . ⎟ . . ⎜ ⎟ .. · · · .. · · · 0⎟, T (i)+p(λ)(j) = ⎜ .. .. · · · ⎜ ⎟ ⎜0 0 · · · p(λ) · · · 1 · · · 0⎟ ⎝ ⎠ .. .. .. .. . . ··· . ··· . ···1 where the polynomial p(λ) occurs in row i and column j. Note that all elementary transformation matrices are invertible. The eﬀect of a left multiplication of a matrix G(λ) by any of these matrices is clearly identical to the eﬀect of left multiplication of a matrix by the usual elementary transformation matrices that aﬀect the rows of a matrix. Similarly, if one multiplies a matrix G(λ) at

560

Linear Algebra Tools for Data Mining (Second Edition)

the right by T (i)↔(j) , T a(i) , and T (i)+p(λ)(j) , the eﬀect is to exchange columns i and j, multiply the i th column by a, and add the j th column multiplied by a to the i th column, respectively. Deﬁnition 8.10. To λ-matrices G(λ), H(λ) are equivalent if one of them can be obtained from the other my multiplications by elementary transformation matrices. It is easy to verify that the relation introduced in Deﬁnition 8.10 is indeed an equivalence relation. We denote the equivalence of two λ-matrices G(λ) and H(λ) by A ∼λ B. Theorem 8.37. Let G(λ) ∈ C[λ]n×n be a λ-matrix having rank r. There exists an equivalent diagonal λ-matrix D(λ) = diag(tr−1 (λ), tr−2 (λ), . . . , t0 (λ)), where t0 (λ), . . . , tr−2 (λ), tr−1 (λ) are the invariant factors of G(λ). Furthermore, each invariant factor tr−j is a divisor of the invariant factor tr−j+1 for 1 j r. Proof. Let B (0) (λ) ∈ C[λ]n×n be a λ-matrix that is equivalent to G(λ) such that the polynomial b11 (λ) has minimal degree among all polynomials bij (λ). We claim that (B (0) (λ)) can be chosen such that (B (0) (λ))11 divides all polynomials located on the ﬁrst row. Indeed, suppose that this is not the case and (B (0) (λ))11 does not divide (B (0) (λ))1j . Since deg((B (0) (λ))1j ) deg((B (0) (λ))11 ), we can write deg((B (0) (λ))1j ) = deg((B (0) (λ))1j )Q(λ) + R(λ), where deg(R(λ)) < deg((B (0) (λ))1j ). Then by subtracting from the j th column the ﬁrst column multiplied by Q(λ), we obtain a matrix equivalent to B (0) (λ) (and, therefore, with G(λ)) that has an element in the position (1, j) of degree smaller than the element in position (1, 1). This contradicts the deﬁnition of B (0) (λ). A similar argument shows that (B (0) (λ))11 divides all polynomials located on the ﬁrst column.

Similarity and Spectra

561

Since the entries in the ﬁrst row and the ﬁrst column are divisible by (B (0) (λ))11 , starting from the matrix B (0) (λ) we construct, by applying elementary transformations, the matrix B (1) (λ) that has the form ⎞ ⎛ (1) 0 ··· 0 b11 (λ) ⎟ ⎜ (1) (1) ⎜ 0 b22 (λ) · · · b2n (λ)⎟ ⎟ ⎜ ⎜ . .. .. ⎟ ⎜ .. . ··· . ⎟ ⎠ ⎝ (1)

(1)

bn2 (λ) · · · bnn (λ)

0

and we have B (0) (λ) ∼λ B (1)(λ) . Note that the degrees of the components of B (1)(λ) outside the ﬁrst row and the ﬁrst columns are at least equal to the degree of (B (1) (λ))11 . Repeating this argument for the matrix B (1) (λ) and involving the second row and the second column of this matrix, we obtain the matrix ⎞ ⎛ (1) b11 (λ) 0 0 0 ⎟ ⎜ (2) ⎟ ⎜ 0 (λ) 0 0 b 22 ⎟ ⎜ , B (2) (λ) = ⎜ . .. .. .. ⎟ ⎟ ⎜ .. . . . ⎠ ⎝ 0

0

(1)

· · · bnn (λ)

which is equivalent to B (1) (λ) and therefore with G(λ). We can assume that the degrees of all polynomials of B (2) located below the ﬁrst row and at the right of the ﬁrst columns are at least equal (1) to deg(b11 (λ)). (1) Without loss of generality we may assume that b11 (λ) divides the (2) polynomial b22 (λ). Indeed, if this is not the case, we can write (2)

(1)

b22 (λ) = b11 (λ)u(λ) + v(λ), (1)

where deg(v(λ)) < deg(b11 (λ)). By multiplying the ﬁrst column of B (2) (λ) by −u(λ) and adding it to the second column, we have the equivalent matrix

562

Linear Algebra Tools for Data Mining (Second Edition)

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

(1)

b11 (λ) −b11(1) (λ)u(λ) (2)

0 .. .

b22 (λ) .. .

0

0

0

0

0 .. .

0 .. . (1)

⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎠

· · · bnn (λ)

By adding the ﬁrst row to the second, we have another equivalent matrix ⎞ ⎛ (1) 0 b11 (λ) −b11(1) (λ)u(λ) 0 ⎟ ⎜ ⎜ 0 v(λ) 0 0 ⎟ ⎟ ⎜ ⎟. ⎜ . . . . .. .. .. ⎟ ⎜ .. ⎠ ⎝ 0

0

(1)

· · · bnn (λ)

Finally, by adding the ﬁrst column multiplied by u(λ) to the second column, we obtain yet another equivalent matrix ⎞ ⎛ (1) 0 0 b11 (λ) 0 ⎟ ⎜ ⎟ ⎜ 0 v(λ) 0 0 ⎟ ⎜ . ⎜ . .. .. .. ⎟ ⎟ ⎜ .. . . . ⎠ ⎝ (1) 0 0 · · · bnn (λ) The existence of this matrix contradicts the assumption made about (1) (1) the matrix B (2) (λ) because deg(v(λ)) < deg(b11 (λ). Thus, b11 (λ) (2) divides the polynomial b22 (λ). Eventually, we are left with the matrix ⎛ (1) ⎞ b11 (λ) 0 0 ··· 0 ⎜ ⎟ (2) ⎜ 0 0 · · · 0⎟ b22 (λ) ⎜ ⎟ ⎜ . ⎟ .. .. ⎜ . ⎟ . . · · · 0⎟ ⎜ . D(λ) = B (r) (λ) = ⎜ ⎟ (r) ⎜ 0 ⎟ (λ) · · · 0 0 b rr ⎜ ⎟ ⎜ ⎟ .. .. ⎜ .. ⎟ ⎝ . . . · · · 0⎠ 0 0 0 ··· 0 which is equivalent to G(λ). The degrees of the polynomials that occur on the diagonal of this matrix are increasing and each poly(j) (j+1) nomial bjj (λ) divides its successor bj+1 j+1 (λ) for 1 j r − 1.

Similarity and Spectra

563

Note that the divisibility properties obtained above imply that (1)

(2)

δr (λ) = b11 (λ)b22 (λ) · · · b(r) rr (λ) (2)

(r−1)

δr−1 (λ) = b11 (λ) · · · br−1 r−1 (λ) .. . (1)

δ1 (λ) = b11 (λ). Thus, the invariant factors of this matrix are t0 (λ) =

δr (λ) = b(r) rr (λ), δr−1 (λ)

.. . tj (λ) =

δr−j (λ) (r−j) = br−j r−j (λ) δr−j−1 (λ)

.. . tr−1 (λ) =

δ1 (λ) (1) = b11 (λ). δ0 (λ)

Theorem 8.38. Elementary transformations applied to λ-matrices preserve the invariant factors. Proof. It is clear that by multiplying a row (or a column) by a constant or by permuting two rows (or columns), the value of the polynomials δk (λ) is not aﬀected. Suppose that we add the j th row multiplied by the polynomial p to the i th row. Minors of order k can be classiﬁed in the following categories: (i) minors that contain both the i th row and the j th row; (ii) minors that contain neither the i th row nor the j th row; (iii) minors that contain the j th row but do not contain the i th row, and (iv) minors that contain the i th row but do not contain the j th row. It is clear that minors in the ﬁrst three categories are not aﬀected by this elementary transformation. If M is a minor of order k of the matrix G(λ) that contains the i th row but does not contain the j th

564

Linear Algebra Tools for Data Mining (Second Edition)

row and we add the elements of the j th row, then M is replaced by the sum of two minors of order k. Thus, δk (λ) is not aﬀected by this elementary transformation. Theorem 8.39. If two λ-matrices have the same invariant factors, then they are equivalent. Proof. Let G(λ) and H(λ) be two matrices that have the same invariant factors t0 (λ), . . . , tr−1 (λ). Both matrices are equivalent to the matrix ⎞ ⎛ 0 0 ··· 0 t0 (λ) ⎜ 0 t1 (λ) 0 · · · 0⎟ ⎟ ⎜ ⎟ ⎜ .. .. ⎟ ⎜ .. ⎜ . . . · · · 0⎟ ⎟ ⎜ D(λ) = ⎜ 0 tr−1 (λ) · · · 0⎟ ⎟ ⎜ 0 ⎟ ⎜ . . . ⎟ ⎜ . .. .. · · · 0⎠ ⎝ . 0 0 0 ··· 0 that has their common invariant factors on its main diagonal and, therefore, are equivalent. We saw that a λ-matrix G(λ) is invertible if and only if det(G(λ)) is a non-zero constant. Lemma 8.5. Every invertible λ-matrix G(λ) ∈ C[λ]n×n is a product of elementary transformation matrices. Proof. Since G(λ) is an invertible λ-matrix, det(A) is a non-zero constant, so δn (λ) is a non-zero constant c. Therefore, all polynomials δk (λ) are equal to c, so all invariant factors of G(λ) equal 1. Thus, G(λ) is equivalent to matrix In , which amounts to the having G(λ) equal to a product of elementary transformation matrices. Theorem 8.40. The λ-matrices G(λ) and H(λ) are equivalent if and only if there exist two invertible λ-matrices P (λ) and Q(λ) such that G(λ) = P (λ)H(λ)Q(λ). Proof. If G(λ) and H(λ) are equivalent, then H(λ) can be obtained from G(λ) by applying elementary transformations (multiplying by elementary λ-matrices for row transformations and multiplying by

Similarity and Spectra

565

elementary λ-matrices for column transformations). Each of these matrices is invertible and this leads to the desired equality. Conversely, suppose that the equality G(λ) = P (λ)H(λ)Q(λ) holds for two invertible λ-matrices P (λ) and Q(λ). By Lemma 8.5, both P (λ) and Q(λ) are products of elementary transformation matrices, so G(λ) ∼λ H(λ). Example 8.8. Let ⎞ λ 2λ 2λ ⎟ ⎜ A(λ) = ⎝λ2 + λ 3λ2 + λ 2λ2 + 2λ⎠ λ 2λ λ3 + λ ⎛

be a λ-matrix. This matrix is equivalent to the matrix ⎛ ⎞ λ 0 0 ⎜ ⎟ 0 ⎠. ⎝ 0 λ2 − λ 0 0 λ3 − λ The invariant factors of A are t0 (λ) = λ, t1 (λ) = λ2 − λ, and t2 (λ) = λ3 − λ, Let tr−1 (λ) be the invariant factor of A having the highest degree and let tr−1 (λ) = p1 (λ) · · · pk (λ) be the factorization of tr−1 as a product of irreducible polynomials over the ﬁeld F that contains the elements of A. Deﬁnition 8.11. The elementary divisors of the λ-matrix G(λ) are the irreducible polynomials p1 (λ), . . . , pk (λ) that occur in the factorization of the invariant factor of the highest degree of G(λ). Since i j implies that ti (λ) divides tj (λ) for 1 i, j r − 1, it follows that the factorization of any of the polynomials ti (λ) may contain only these elementary divisors. Example 8.9. The elementary divisors of the matrix A(λ) considered in Example 8.8 are λ, λ + 1, and λ − 1.

566

8.7

Linear Algebra Tools for Data Mining (Second Edition)

The Jordan Canonical Form

We begin by introducing a type of matrix which is as defective as possible. Deﬁnition 8.12. Let λ be a complex associated with λ is the matrix ⎛ λ 1 0 ··· ⎜0 λ 1 · · · ⎜ ⎜. . . . Br (λ) = ⎜ ⎜ .. .. . . . . ⎜ ⎝0 0 0 · · · 0 0 0 ···

number. An r-Jordan block ⎞ 0 0⎟ ⎟ .. ⎟ r×r ⎟ .⎟ ∈ C . ⎟ 1⎠ λ

The matrix Br (λ) is diagonal if and only if r = 1; otherwise, that is, if r > 1, the matrix Br (λ) is not even diagonalizable. Also, a block Br (λ) can be written as Br (λ) = λIr + Er , where

Er =

0r−1 Ir−1 . 0 0r−1

The unique eigenvalue of the Jordan block Br (λ) is λ1 = λ and algm(Br , λ) = n. On the other hand, the invariant subspace corresponding to λ is the subspace generated by e1 , so geomm(Br , λ) = 1. Thus, every r-Jordan block (for r > 1) is a defective matrix. Deﬁnition 8.13. A Jordan segment associated with the number λ is a block-diagonal matrix Jr1 ,...,rk (λ) given by ⎞ ⎛ Br1 (λ) O ··· O ⎜ O O ⎟ Br2 (λ) · · · ⎟ ⎜ ⎜ Jr1 ,...,rk (λ) = ⎜ .. .. .. ⎟ ⎟, . ··· . ⎠ ⎝ . O O · · · Brk (λ) where Br1 (λ), . . . , Brk (λ) are Jordan blocks and r1 r2 · · · rk . The sequence (r1 , r2 , . . . , rk ) is the Segr`e sequence of the segment Jr1 ,...,rk (a).

Similarity and Spectra

567

Given a Segr`e sequence (r1 , r2 , . . . , rk ) and a number λ, the Jordan segment Jr1 ,...,rk (λ) is completely determined. A Jordan segment Jr1 ,...,rk (λ) is a diagonal matrix if and only if each of its Jordan blocks is unidimensional. Otherwise, a Jordan segment is not even diagonalizable. Indeed, if Jr1 ,...,rk (λ) = XDX −1 , where D is a diagonal matrix and X is an invertible matrix, then D = diag(λ, . . . , λ) = λI, so Jr1 ,...,rk (λ) = λI, which contradicts the fact that Jr1 ,...,rk (λ) contains a Jordan block of size larger than 1. By Theorem 7.15, the spectrum of a Jordan segment Jr1 ,...,rk (λ) ∈ Cn×n consists of a single eigenvalue λ of algebraic multiplicity n. The geometric multiplicity of λ is k and the eigenvectors that generate the invariant subspace of λ are e1 , er1 +1 , . . . , er1 +···+rk−1 +1 . Deﬁnition 8.14. A Jordan matrix R ∈ Cn×n is a block diagonal matrix, whose blocks are Jordan segments, R = (Jr1,1 ,...,r1,k1 (λ1 ), . . . , Jrp,1 ,...,rp,kp (λp )). We shall prove that every matrix A ∈ Cn×n is similar to a Jordan matrix R such that: (i) for each eigenvalue λ of A we have a Jordan segment Jr,1 ,...,r,k (λ ) in R; (ii) for the Jordan segment Jr,1 ,...,r,k (λ ) that corresponds to the eigenvalue λ , the number k of Jordan blocks equals geomm(A, λ ); (iii) the algebraic multiplicity of λ equals the size of the Jordan segment that corresponds to this eigenvalue, that is, we have algm(A, λ ) =

k

r,i .

i=1

Next we give an algorithmic proof of the fact that for every square matrix A ∈ Cn×n there exists a similar Jordan matrix. This proof was obtained in [54]. Theorem 8.41. Let T ∈ Cn×n be an upper triangular matrix. There exists a nonsingular matrix X ∈ Cn×n such that X −1 T X = diag(K1 , . . . , Km ), where Ki = λi Ipi + Li , Li is strictly upper triangular and each λi is distinct, for 1 i m.

568

Linear Algebra Tools for Data Mining (Second Edition)

Proof. The argument is by induction on n. The base case, n = 1, is immediate. Suppose that the theorem holds for matrices of size less than n and let T ∈ Cn×n be an upper triangular matrix. Without loss of generality we may assume that

T1 S T = , O T2 where T1 and T2 have no eigenvalues in common and T1 = λ1 I + E, where E is strictly upper triangular. Since

I −Y T1 −T1 Y + S + Y T2 I Y T1 S = , O I O T2 O I O T2 there exists a matrix Y such that

I Y I −Y T1 O T = O T2 O I O I if and only if S = T1 Y − Y T2 . By Theorem 8.25, this equation has a solution Y if and only if spec(T1 ) ∩ spec(T2 ) = ∅, which is the case by the assumption we made about T1 and T2 . By the induction hypothesis, T2 may be reduced to block diagonal form, that is, there is a matrix Z such that Z −1 T2 Z = diag(H1 , . . . , Hp ). Therefore, we have

O I O T1 I O T1 O . = O T2 O Z −1 T2 Z O Z O Z −1 The last matrix is clearly in the block diagonal form, which also shows that the matrix X is given by

I −Y I O I −Y Z X= = . O I O Z O Z Lemma 8.6. Let E ∈ Ck×k be a matrix of the form 0k−1 Ik−1 E= . 0 0k−1 We have

EE= Eei+1 = ei and (Ik −

E E)x

0 0k−1

0k−1 , Ik−1

= (e1 x)e1 for x ∈ Ck .

Similarity and Spectra

569

Proof. The proofs of the equalities of the lemma are straightforward. Theorem 8.42. Let W ∈ Cn×n be a strictly upper triangular matrix. Then there is a nonsingular matrix X such that X −1 W X = G, where G = diag(E1 , . . . , Em ) with each Ej given by

0 Ik j Ej = 0 0 such that kj+1 kj for 1 j m − 1. Proof. The proof is by induction on n 1. The statement clearly holds in the base case, n = 1. Assume that the result holds for strictly upper triangular matrices of format (n − 1) × (n − 1) and let W ∈ Cn×n be an upper triangular matrix. We can write

0 u , W = 0 V where V ∈ C(n−1)×(n−1) is an upper triangular matrix. By the inductive hypothesis, there exists a nonsingular matrix Y such that

E1 O −1 , Y VY = O H where H = diag(E2 , . . . , Em ) and the order of E1 is at least equal to the size of any of the matrices Ei , where 2 i m. Then

0 u Y 1 0 1 0 . W = 0 Y 0 Y −1 V Y 0 Y −1 The matrix

0 u Y 0 Y −1 V Y

can be written as

u Y

0 0 Y −1 V Y

⎛

⎞ 0 u1 u2 ⎜ ⎟ = ⎝0 E1 O ⎠ 0 O H

570

Linear Algebra Tools for Data Mining (Second Edition)

by partitioning the vector u Y as u Y = (u1 u2 ). This allows us to write ⎞⎛ ⎞⎛ ⎞ ⎛ 1 u1 E1 0 0 u1 u2 1 −u1 E1 0 ⎟⎜ ⎟⎜ ⎟ ⎜ I 0⎠ ⎝0 E1 O ⎠ ⎝0 I 0⎠ ⎝0 0 O H I 0 0 0 0 I ⎞ ⎛ 0 u1 (I − E1 E1 ) u2 ⎟ ⎜ E1 O ⎠. = ⎝0 0 O H By Lemma 8.6, we have u1 (I − E1 E1 ) = ((I − E1 E1 )u1 ) = (e1 u1 )e1 , so

⎞ ⎛ ⎞ ⎛ 0 (e1 u1 )e1 u2 0 u1 (I − E1 E1 ) u2 ⎟ ⎜ ⎟ ⎜ E1 O ⎠ = ⎝0 E1 O ⎠. ⎝0 0 O H 0 O H

We need to consider two cases depending on whether the number e1 u1 is 0 or not. Case 1: Suppose that e1 u1 = 0. We have ⎞ ⎛ 0 (e1 u1 )e1 u2

E e1 u2 ⎟ ⎜ 0 E O ∼ , ⎠ ⎝ 1 O H 0 O H where

E=

0 e1 . 0 E1

This follows from the equality ⎞ ⎞⎛ ⎞⎛ ⎛ 1 0 0 (e1 u1 )e1 u2 e1 u1 0 0 e1 u1 0 ⎟ ⎜ ⎟⎜ ⎜ 0 I 0 ⎟ E1 O⎠⎝ 0 I 0 ⎠ ⎠ ⎝0 ⎝ 0 0 e 1u1 I 0 O H 0 0 (e1 u1 )I 1 ⎞ ⎛ 0 e1 u2 ⎟ ⎜ = ⎝0 E1 0 ⎠ 0 0 H

Similarity and Spectra

571

because e1 u2

=

u2 . O

Note that the order k of E is strictly greater than the order of and diagonal block of H, so H k−1 = O. Deﬁne si = u2 H i−1 for 1 i k. Then

E ei si I −ei+1 si I ei+1 si E ei+1 si+1 = O H 0 I 0 I 0 H for 1 i k − 1. We have sk = 0 because H k−1 = O, and it follows that W is similar to the matrix

E O . O H Case 2: If e1 u1 = 0, by permuting the rows and columns, the matrix W is similar to the matrix ⎞ ⎛ E1 0 0 ⎟ ⎜ ⎝ 0 0 u2 ⎠. 0 0 H Then, by the inductive hypothesis, there is a nonsingular matrix Z such that

0 u2 Z = L, Z −1 0 H where L has the desired block diagonal form. Thus, W is similar to the matrix

E1 0 . 0 L By applying a permutation of the blocks, we obtain a matrix in the proper form, which completes the proof.

572

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 8.43. Let A ∈ Cn×n be a matrix such that spec(A) = {λ1 , . . . , λp }. The matrix A is similar to a Jordan matrix R = (Jr1,1 ,...,r1,k1 (λ1 ), . . . , Jrp,1 ,...,rp,kp (λp )), where algm(A, λi ) =

k i

h=1 ri,h

and geomm(A, λi ) = ki for 1 i p.

Proof. By Schur’s Triangularization Theorem (Theorem 8.8), there exists a unitary matrix U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A; furthermore, each eigenvalue λ occurs in the sequence of diagonal values a number of algm(A, λ) times. By Theorem 8.41, the upper triangular matrix T is similar to a block upper triangular matrix diag(K1 , . . . , Km ), where Ki = λi Ipi + Li , Li is strictly upper triangular, and each λi is distinct for 1 i m. Theorem 8.42 implies that each triangular block is similar to a matrix of the required form because for each diagonal block Ki there is a nonsingular Xi such that Xi−1 (λi I+)Xi = λi I + diag(E1 , . . . , Emi ). Theorem 8.44. For an eigenvalue λ of a square matrix A ∈ Cn×n , we have algm(A, λ) = 1 if and only if geomm(A, λ) = geomm(AH , λ) = 1, and for any eigenvector u that corresponds to λ in A and any eigenvector v that corresponds to the same eigenvalue in AH , we have v H u = 0. Proof. Suppose that λ is an eigenvalue of A such that geomm(A, λ) = geomm(AH , λ) = 1 and for any eigenvector u that corresponds to λ in A and any eigenvector v that corresponds to the same eigenvalue in AH we have v H u = 0. Let B ∈ Cn×n be a matrix that is similar to A. There exists an invertible matrix X such that B = XAX −1 . Then, A − λI ∼ B − λI, so dim(SA,λ ) = dim(SB,λ ), which allows us to conclude that λ has geometric multiplicity 1 in both A and B. Note that

Similarity and Spectra

573

BXu = XAu = λXu, so Xu is an eigenvector of B. Similarly, B H (X H )−1 v = (X H )−1 AH X H (X H )−1 v = (X H )−1 AH v = λ(X H )−1 v, so (X H )−1 v is an eigenvector for B H that corresponds to the same eigenvalue λ. Furthermore, we have ((X H )−1 v)H Xu = v H X −1 Xu = v H u = 0. Thus, if A is a matrix that satisﬁes the conditions of the theorem, then any matrix similar to A satisﬁes the same conditions. If λ is a simple eigenvalue of A, then λ is also a simple eigenvalue of the Jordan normal form of A. Therefore, a Jordan segment that corresponds to an eigenvalue λ of geometric multiplicity 1 consists of a single Jordan block B1 (λ) of order 1 that has (1) as an eigenvector. The transposed segment has also (1) as an eigenvector. The eigenvector u of A that corresponds to λ can be obtained from (1) by adding zeros corresponding to the remaining components and the eigenvector v can be obtained from (1) in a similar manner. Since (1)H (1) = 0, we have vH u = 0. Conversely, if A and λ satisfy the conditions of the theorem, then the same conditions are satisﬁed by the Jordan normal form C of A. Therefore, the Jordan segment that corresponds to λ in C consists of a single block Bm (λ). Suppose that m > 1. Then e1 ∈ Cm is an eigenvector of Bm (λ), em ∈ Cm is an eigenvector of (Bm (λ))H , and eHm e1 = 0, which would imply v H u = 0. Therefore, m = 1 and λ is a simple eigenvalue of A. 8.8

Matrix Norms and Eigenvalues

There exists a simple relationship between the spectral radius and any matrix norm |||·|||. Namely, if x is an eigenvector that corresponds to λ ∈ spec(A), then |||Ax||| = |λ||||x||| |||A||||||x|||, which implies λ |||A|||. Therefore, ρ(A) |||A|||.

(8.5)

It is easy to see that if A ∈ Cn×n and a ∈ R>0 , then ρ(aA) = aρ(A).

574

Linear Algebra Tools for Data Mining (Second Edition)

In Section 6.7, we have seen that for A ∈ Cp×p , |||A||| < 1 implies limn→∞ An = Op,p . Using the spectral radius we can prove a stronger result: Theorem 8.45 (Oldenburger’s theorem). If A ∈ Cp×p , then limn→∞ An = Op,p if and only if ρ(A) < 1. Proof. Let B be a matrix in Jordan normal form that is similar to A. There exists a non-singular matrix U such that U AU −1 = B, and, therefore, U An U −1 = B n . Therefore, if limn→∞ An = Op,p we have limn→∞ B n = Op,p , so limn→∞ λn = O for any λ ∈ spec(A) = spec(B). This is possible only if |λ| < 1 for λ ∈ spec(A), so ρ(A) < 1. Conversely, if ρ(A) < 1, then |λ| < 1 for λ ∈ spec(A), so for any block Br (λ) of B we have limn→∞ (Br (λ))n = Or,r by Exercise 8.13. Therefore, limn→∞ B n = Op,p , so limn→∞ An = Op,p . Theorem 8.46. If A, B ∈ Rn×n and abs(A) B, then ρ(A) ρ(B). Proof. Suppose that abs(A) B and ρ(B) < ρ(A). There exists α ∈ R such that 0 ρ(B) < α < ρ(A). If C = α1 A and D = α1 B, then ρ(C) = α1 ρ(A) > 1 and ρ(D) = α1 ρ(B) < 1. Thus, by Theorem 8.45, limk→∞ D k = O. Since abs(A) B, we have abs(C) = α1 abs(A) α1 B = D. Therefore, abs(C k ) (abs(C))k < D k , which implies limk→∞ C k = 0. This contradicts the fact that ρ(C) > 1. Corollary 8.22. If A ∈ Rn×n , then ρ(A) ρ(abs(A)). Proof. abs(A).

This follows immediately from Theorem 8.46 by taking B =

Theorem 8.47. Let A ∈ Rn×n be a matrix and x ∈ Rn be a vector such that A On,n and x 0n . If Ax > ax, it follows that a < ρ(A). Proof. Since ρ(A) 0 we assume that a > 0. Also, since Ax > ax, we have x = 0. The strict inequality Ax > ax implies that Ax 1 A, we have Bx x, so (a + )x for some > 0. Therefore, if B = a+ x Bx · · · B k x. Thus, ρ(B) 1, so a < a + < ρ(A).

Similarity and Spectra

575

Inequality (8.5) applied to matrix Ak implies ρ(Ak ) |||Ak |||. By Theorem 7.6, we have (ρ(A))k |||Ak ||| for every k ∈ N. Actually, the following statement holds. Theorem 8.48. For every A ∈ Cp×p , we have 1 k lim |||Ak |||2 = ρ(A). k→∞

1 Proof. We need to prove only that limk→∞ |||Ak |||2 k ρ(A), because the reverse inequality was already proven. Let T be an upper triangular matrix and let D = diag(d1 , . . . , dp ) be a diagonal matrix, where di = 0 for 1 i p. Note that

1 1 −1 , ,..., D = diag d1 dp d

hence (D −1 T D)ij = tij dji for 1 i, j p. If di = δi for 1 i p, where δ > 0, denote the upper triangular matrix D −1 T D by Tδ . It follows that ⎧ ⎪ t if i = j, ⎪ ⎨ ii (Tδ )ij = tij δj−i if j > i, ⎪ ⎪ ⎩0 if j < i. The matrix Tδ can now be written as Tδ = E + S, where E is a diagonal matrix such that E = diag(t11 , . . . , tnn ) and 0 if j i, sij = tij δj−i if j > i. If δ < 1, there exists a positive number c such that |||S|||2 cδ. Therefore, |||Tδ |||2 |||E|||2 + |||S|||2 = max{|tii | | 1 i n} + |||S|||2 max{|tii | | 1 i n} + cδ (by Supplement 45). By Schur’s Triangularization Theorem (Theorem 8.8), A can be factored as A = U H T U , where U is a unitary matrix and T is an

Linear Algebra Tools for Data Mining (Second Edition)

576

upper triangular matrix T ∈ Cp×p such that elements of T are the eigenvalues of A and each such eigenvalue occurs in the sequence of diagonal values a number of algm(A, λ) times. If we construct the matrix Tδ as above, starting from the upper triangular matrix T that results from Schur’s decomposition, we have |||Tδ |||2 ρ(A) + cδ. Since T = DTδ D −1 , it follows that A = U H DTδ D −1 U , so A = W Tδ W −1 , where W = U H D. This implies Ak = W Tδk (W −1 )k , so |||Ak |||2 |||Tδk |||2 |||W |||2 |||W −1 |||2 (|||Tδ |||2 )k |||W |||2 |||W −1 |||2 . Consequently,

k

|||A |||2

1

k

1

1

|||Tδ |||2 (|||W |||2 |||W −1 |||2 ) k (ρ(A) + cδ)(|||W |||2 |||W −1 |||2 ) k .

1 Thus, limk→∞ |||Ak |||2 k ρ(A) + cδ, and since this equality holds 1 for any δ > 0, it follows that limk→∞ |||Ak |||2 k ρ(A). Another interesting connection exists between the matrix norm |||A|||2 and the spectral radius of the matrix A A. Theorem 8.49. Let A ∈ Cm×n . We have √ |||A|||2 = max{ λ | λ ∈ spec(AH A)}. Proof. Observe that if λ ∈ spec(AH A), then AH Ax = λx for x = 0, and therefore, xH AH Ax = λxH x, or (Ax)H (Ax) = λxH x. The last equality amounts to Ax22 = λx22 , which allows us to conclude that all eigenvalues of AH A are real and non-negative. By the deﬁnition of |||A|||2 , we have |||A|||2 = max{Ax2 | x2 = 1}. This implies that |||A|||22 is the maximum of Ax22 under the restriction x22 = 1. We apply Lagrange’s multiplier method to determine conditions that are necessary for the existence of the maximum. Let B = (bij ) = AH A. Then, consider the function n n n xi bij xj − x2i − 1 . g(x1 , . . . , xn ) = i=1 j=1

i=1

Similarity and Spectra

For the maximum, we need to have n

∂g ∂xi

577

= 0, which implies

bij xj − xi = 0,

j=1

for 1 i n. This is equivalent to Bx = x, which means that is an eigenvalue of B = AH A. The value of |||Ax|||22 is xH AH Ax = xH x = x22 = . Thus, the maximum of |||Ax|||22 for x2 = 1 is the largest eigenvalue of AH A, which completes the argument. Theorem 8.49 states that |||A|||2 =

ρ(AH A),

(8.6)

which explains why |||A|||2 is also known as the spectral norm of the matrix A. It is interesting to recall that we also have

AF = trace(AH A), as we saw in Equality (6.12). Corollary 8.23. For A ∈ Rm×n , we have |||A|||22 |||A|||1 |||A|||∞ . Proof. In Theorem 8.49, we have shown that |||A|||22 is an eigenvalue of AH A. Therefore, there exists x = 0 such that AH Ax = |||A|||22 x, which implies |||AH Ax|||1 = |||A|||22 |||x|||1 . Note that |||AH Ax|||1 |||AH |||1 |||A|||1 |||x|||1 , which implies |||A|||22 |||x|||1 |||AH |||1 |||A|||1 |||x|||1 . Thus, we have |||A|||22 |||AH |||1 |||A|||1 . The desired inequality follows by observing that |||AH |||1 = |||A|||∞ . ∞ Theorem 8.50 (Weyr’s theorem). Let f (z) = m=0 cm z m be a n×n power series having the convergence radius r. If A ∈ C ∞ is such that |λ| < r for every λ ∈ spec(A), then the power series m=0 cm Am converges absolutely. If there exists λ ∈ spec(A) such that |λ| > r, m then the series ∞ m=0 cm A is divergent. Proof. Suppose that spec(A) = {λ1 , . . . , λn }. By Schur’s Triangularization Theorem (Theorem 8.8), there exists a unitary matrix

578

Linear Algebra Tools for Data Mining (Second Edition)

U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are theeigenvalues of A. p m are elements of the matrix m=0 cm T pThe diagonal m there exists an eigenvalue λi such that |λi | > r, m=0 cm λi . If ∞ m diverges, which implies that the series then the series m=0 cm T ∞ m m=0 cm A diverges. Suppose now that |λi | < r for 1 i n. Let ρ1 , . . . , ρn , ρ be n + 1 distinct numbers such that |λi | ρi ρ < r and let S an upper triangular matrix such that abs(T ) S such that S has ρ1 , . . . , ρn as its diagonal elements. Let U be a matrix such that U −1 SU = diag(ρ1 , . . . , ρn ). Then abs(T m ) U diag(ρ1 , . . . , ρn )m U −1 , so abs(cm T m ) |cm |U diag(ρ1 , . . . , ρn )m U −1 for m 0. Consequently, cm T m ∞ |cm |U diag(ρ1 , . . . , ρn )m U −1 ∞ |cm |n2 U ∞ diag(ρ1 , . . . , ρn )m ∞ U −1 ∞ C|cm |ρm , where C is a constant that does not dependon m. Therefore, the m m converges series ∞ series ∞ m=0 cm T converges, so the m=0 cm T ∞ m absolutely. This implies that the series m=0 cm A converges abso lutely. Theorem 8.51. Let f (x) be a rational function, f (z) = p(z)/q(z), where p and q belong to C[z], p(z)= p0 + p1 z + · · · + pk z k , q(z) = j q0 + q1 z + · · · + qh z h , and f (z) = ∞ j=0 cj z . ∞ j If the series k=0 cj z converges to z, such that |z| < r and n×n is a matrix such that ⊆ {z ∈ C | |z| < r}, then A∈C spec(A) j. c A f (A) = (q(A))−1 p(A) equals ∞ j=0 j Proof. By hypothesis, if |z| < r, then q(z) = 0. If spec(A) = {λ1 , . . . , λn }, then det(q(A)) = q(λ1 ) · · · q(λn ) = 0, which means that q(A) is an invertible matrix. This shows that f (A) = (q(A))−1 p(A), or q(A)f (A) = p(A). The deﬁnition of f implies h

(q0 + q1 z + · · · + qh z )

∞ k=0

cj z j = p0 + p1 z + · · · + pk z k .

Similarity and Spectra

579

If k, we have q0 c + q1 c−1 + qh c−h = p ; otherwise, if k < , we have q0 c + q1 c−1 + qh c−h = 0. This implies q(A)

∞

cj Aj

k=0

= (q0 In + q1 A + · · · + qh Ah )f (A) =

∞

j

q0 cj A +

j=0

=

∞ j=0

=

∞

j+1

q1 cj A

+

j=0

q0 cj Aj +

∞

∞

qh cj Aj+h

j=0

q1 cj−1 Aj +

j=0

∞

qh cj−h Aj ,

k=0

∞ (q0 cj + q1 cj−1 + · · · + qh cj−h )Aj j=0

=

k

pj Aj = f (A),

j=0

where cl = 0 if l < 0. Consequently, f (A).

∞

j j=0 cj A

= q(A)−1 p(A) =

Deﬁnition 8.15. Let A ∈ Cn×n be a matrix such that |λ| < r for every λ ∈ spec(A), where r is the convergence radius ofthe series ∞ j j f (z) = ∞ j=0 cj z . The matrix f (A) is deﬁned as f (A) = j=0 cj A . Theorem 8.51 shows that if f : C −→ C is a rational func(A) ∈ Cn×n can tion, f (z) = p(z) q(z) , the deﬁnition of the matrix f ∞ j be given either as q(A)−1 p(A) or as the sum j=0 cj A , where ∞ f (A) = j=0 cj Aj . zj Example 8.10. The series ∞ j=0 j! is convergent in C and its sum is ez . Thus, following Deﬁnition 8.15, eA is deﬁned as eA =

∞ 1 j A. j! j=0

580

Linear Algebra Tools for Data Mining (Second Edition)

Similarly, considering the series 1 1 1 1 sin z = z − z 3 + z 5 − · · · , cos z = 1 − z 2 + z 4 − · · · , 3! 5! 2! 4! 1 5 1 2 1 1 3 sinh z = z + z + z + · · · , cosh z = 1 + z + z 4 + · · · , 3! 5! 2! 4! which are convergent everywhere, we can deﬁne 1 1 1 1 sin A = A − A3 + A5 − · · · , cos A = 1 − A2 + A4 − · · · , 3! 5! 2! 4! 1 5 1 2 1 1 3 sinh A = A + A + A + · · · , cosh A = 1 + A + A4 + · · · . 3! 5! 2! 4! Therefore, we obtain the equalities eiA = cos A + i sin A and eA = cosh A + sinh A, and 1 1 sin A = (eiA − e−iA ), cos A = (eiA + e−iA ), 2 2 1 1 A sinh A = (e − e−A ), cosh A = (eA + e−A ), 2 2 n×n . for every A ∈ C Example 8.11. Let A ∈ C2×2 be a matrix such that spec(A) = 1 {− π4 , π4 }. Its characteristic polynomial is pA (λ) = λ2 − 16 . By the 1 2 Cayley–Hamilton Theorem, we have A − 16 I2 = O3,3 . Thus, ∞

1 A2n+1 (2n + 1)! n=0

2 n ∞ π 1 n =A (−1) (2n + 1)! 24 n=0

sin A =

=A

(−1)n

∞

(−1)n

n=0

π 2n . + 1)!

24n (2n

Theorem 8.52. Let ∞ ∞ ∞ k k ak z , g(z) = bk z , h(z) = ck z k f (z) = k=0

k=0

k=0

be the functions deﬁned by the given series that are convergent for |z| < r and let A ∈ Cn×n be a matrix such that spec(A) ⊆ {z ∈ C |

Similarity and Spectra

581

|z| < r}. If h(z) = f (z)g(z) for |z| < r, then h(A) = f (A)g(A); if h(z) = f (z) + g(z) for |z| < r, then h(A) = f (A) + g(A). Proof. We discuss only the case when h(z) = f (z)g(z). The equality h(z) = f (z)g(z) implies ck = a0 bk + a1 bk−1 + · · · + ak−1 b1 + ak bk0 for k ∈ N. The series with real non-negative coeﬃcients, ∞ k=0 dk z , k k ∈ N is convergent for |z| < r, where dk = j=0 |aj | |bk−j | for ∞ k which implies that the series k=0 dk A is absolutely convergent, by Weyr’s Theorem (Theorem 8.50). Lemma 6.7, this In turn, by k . Thus, the series |d |A is equivalent to the convergence of ∞ k k=0 ∞ k k k=0 j=0 aj bk−j A is absolutely convergent and its terms can be permuted. Therefore, ⎛ ⎞ ∞ ∞ k ⎝ ck Ak = aj bk−j ⎠ Ak h(A) = k=0

=

∞ j=0

=

∞ j=0

j=0

k=0

aj

∞

bk−j A =

k=j

aj Aj

k

∞ =0

∞ j=0

aj

∞

b Aj+

=0

b A = f (A)g(A).

Some properties of scalar functions are transmitted to the corresponding matrix functions; others fail to do so, as we show in the following. Example 8.12. Since ez e−z = 1, we have eA e−A = In for A ∈ Cn×n , by Theorem 8.52. However, despite the fact that ez ew = ez+w , the corresponding equality does not hold. For example, let

1 1 0 1 A= and B = , 0 1 1 0 where a = 0. By the Cayley–Hamiltion Theorem, we have A2 = 2A − I2 and B 2 = I2 . It is easy to see that An = nA − (n − 1)I2 and B n = I2 if n is even and B n = B if n is odd. Thus, the exponentials of these matrices are ∞ 1 (nA − (n − 1)I2 ) e = n! n=0 A

Linear Algebra Tools for Data Mining (Second Edition)

582

∞ ∞ ∞ 1 1 1 nA − nI2 + I2 = n! n! n! n=0

n=0

n=0

= (1 + e)A − (1 + e)I2 + I2 = (1 + e)A − eI2 =

1 1+e , 0 1

and 1 1 1 B + I2 + B + · · · 1! 2! 3!

1 e2 + 1 e2 − 1 = I2 cosh 1 + B sinh 1 = 2 . 2e e2 − 1 e2 + 1

eB = 1 +

Thus, eA eB = ((1 + e)A − eI2 )(I2 cosh 1 + B sinh 1)

1 e2 + 2e2 − e e3 + 2e2 + e . = 2e e2 − 1 e2 + 1 The sum of the matrices is

C =A+B =

1 2 , 1 1

and we have C 2 = 2C + I2 . We leave it to the reader to verify that eC = eA eB . 8.9

Matrix Pencils and Generalized Eigenvalues

Deﬁnition 8.16. Let A, B ∈ Cn×n . The matrix pencil determined by A and B is the one-parameter set of matrices Pen(A, B) = {A − tB | t ∈ C}. The set of generalized eigenvalues of Pen(A, B) is spec(A, B) = {λ ∈ C | det(A − λB) = 0}.

Similarity and Spectra

583

If λ ∈ spec(A, B) and Ax = λBx, then x is an eigenvector of Pen(A, B). Note that in this case x ∈ null(A − λB). When B = In , then an (A, In )-generalized eigenvalue is an eigenvalue of A and any eigenvector of the pair (A, In ) is just an eigenvector of A. Theorem 8.53. Let A and B be two matrices in Cn×n . If rank(B) = n, then spec(A, B) consists of n eigenvalues (taking into account their multiplicities). Proof. Suppose that rank(B) = n, that is, B is an invertible matrix. Then, λ ∈ spec(A, B) is equivalent with the existence of an eigenvector x = 0n such that Ax = λBx, so B −1 Ax = λx. Since rank(B) = n, by Corollary 3.4, rank(B −1 A) = n and spec(A, B) con sists of n eigenvalues. Corollary 8.24. If B is a nonsingular matrix, then an (A, B)generalized eigenvalue is simply an eigenvalue of the matrix B −1 A. Proof. The corollary follows immediately from the proof of Theorem 8.53. When null(A)∩null(B) = {0n }, the existence of (A, B)-generalized eigenvalues becomes trivial. If z ∈ (null(A) ∩ null(B)) − {0}, then we have both Az = 0 and Bz = 0. Thus, (A−tB)z = 0n for every t ∈ C, so any complex number is an (A, B)-generalized eigenvalue and any non-zero vector z ∈ (null(A) ∩ null(B)) − {0} is an eigenvector of the pair (A, B). On the other hand, if

a 0 0 c A= and B = , 0 b 0 0 where ab = 0, the pencil Pen(A, B) has no generalized eigenvalues because det(A − λB) = ab = 0. Deﬁnition 8.17. Let A, B ∈ Cn be two matrices. We refer to Pen(A, B) as a regular pencil when det(A − λB) is not identically zero. A regular pencil (A, B) can have only a ﬁnite number of eigenvalues because the polynomial det(A − λB) is of degree not larger than n.

584

Linear Algebra Tools for Data Mining (Second Edition)

We saw in Theorem 6.54 that if B is a Hermitian and positive deﬁnite matrix, then the mapping fB : Cn × Cn −→ R given by fB (x, y) = xH By for x, y ∈ Cn deﬁnes an inner product on Cn . If Pen(A, B) is a regular matrix pencil and B is Hermitian, the matrix D = B −1 A is self-adjoint with respect to this inner product. Indeed, we have (B −1 Ax, y) = (B −1 Ax)H By = xH AH (B H )−1 By = xH Ax, (x, (B −1 A)y) = xH BB −1 Ay = xH Ax, which justiﬁes our claim. The self-adjoint matrix D = B −1 A has real eigenvalues {λ1 , . . . , λn } and a linearly independent orthonormal family of eigenvectors {z 1 , . . . , z n } (by the inner product fB ). Thus, we have B −1 Az i = λi z i and 1 if i = j, H fB (z i , z j ) = (z i ) Bz j = 0 if i = j, for 1 i j (see Exercise 22 of Chapter 7). Deﬁnition 8.18. Let Pen(A, B) be a regular matrix pencil, where A, B ∈ Cn×n . The eigenvalues of this pencil are the eigenvalues of the matrix D = B −1 A. The vectors z i are referred to as the eigenvectors of the pencil. For the eigenvectors of a regular pencil Pen(A, B), we have Az i = λi Bzi for 1 i n. The matrix Z = (z 1 · · · z n ) ∈ Cn×n is nonsingular because the set of vectors is linearly independent; we refer to Z as a principal matrix of Pen(A, B). Using this matrix, we have Z H BZ = In and Z H AZ = Z H diag(λ1 , . . . , λn )BZ = diag(λ1 , . . . , λn ). For w ∈ Cn deﬁned by w = Z −1 x, we have xH Ax = wH Z H AZw == wH diag(λ1 , . . . , λn )w =

n

λi |wi |2

(8.7)

i=1

and x Bx = w Z AZw = w w = H

H

H

H

n i=1

|wi |2 .

(8.8)

Similarity and Spectra

585

Let λ1 λ2 · · · λn be the eigenvalues of a regular pencil Pen(A, B), where A, B ∈ Rn×n . Consider the generalized Rayleigh– Ritz quotient ralA,B : Rn −→ R deﬁned by x Ax x Bx for x ∈ Rn such that x Bx = 0. If Z is a principal matrix of Pen(A, B), by Equalities (8.7) and (8.7) we can write n λi wi2 ralA,B (x) = i=1 n 2 . i=1 wi ralA,B (x) =

It is clear that λ1 ralA,B (x) λn . It is easy to see that we have ralA,B (x) = λ1 if and only if w2 = · · · = wn = 0, which means that x = Ze1 = z 1 . Similarly, ralA,B (x) = λn if and only if x = z n . In other words, the minimum of ralA,B (x) is achieved when x is an eigenvector that corresponds to the least eigenvalue of the pencil, while the maximum is achieved when x is an eigenvector that corresponds to the largest eigenvalue. Similar results hold for the second smallest or the second largest eigenvalues of a regular pencil. Note that x = ni=1 wi z i . Thus, if x is orthogonal on z 1 (in the sense of the inner product fB ), we have x Bz1 = w1 = 0, so n λi wi2 ralA,B (x) = i=2 n 2 i=2 wi and min{ralA,B (x) | xBz 1 = 0} = λ2 , max{ralA,B (x) | xBz n = 0} = λn−1 . It is not diﬃcult to show that, in general, x Ax λp = min x Bz i = 0 for 1 i p − 1 x Bx and

λn−p = max

x Ax x Bz i = 0 for n − p + 1 i n . x Bx

Next, we present an intrinsic characterization of the eigenvalues of a regular pencil, that is, a characterization that avoids the use of eigenvectors.

586

Linear Algebra Tools for Data Mining (Second Edition)

Let C ∈ Rk×n be a matrix. Let us examine the variation of the generalized Rayleigh–Ritz quotient ralA,B : Rn −→ R when x Bx = 0 and Cx = 0k . In other words, we apply k linear constraints of the form r i x = 0 to x, where r 1 , . . . , r k are the rows of the matrix C. Choose k = n − 1 and let C(p) be ⎞ ⎛ r1 ⎜ . ⎟ ⎜ .. ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ r p−1 ⎟ ⎟. ⎜ C(p) = ⎜ ⎟ ⎜z p+1 B ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎝ . ⎠ z n B It is clear that there exists x such that C(p) x = 0n−1 because the system C(p) x = 0n−1 has n − 1 equations. Note that any solution x satisﬁes z k Bx = 0, which shows that x is orthogonal on the last n − p eigenvectors of Pen(A, B). Therefore, for the vectors x that satisfy these restrictions, we have min

x Ax λp . x Bx

Consequently, λp is the maximum of the minimum of the ratio when x satisﬁes p − 1 arbitrary linear restrictions. 8.10

x Ax x Bx

Quadratic Forms and Quadrics

Let g : V × W −→ R be a bilinear form, where V and W are ﬁnitedimensional real linear spaces, dim(V ) = n, and dim(W ) = m. If e1 , . . . , en and f 1 , . . . , f m are two bases in V and W , respectively, then for x ∈ V and y ∈ W we can write x = x1 e1 + · · · + xn en and y = y1 f 1 + · · · + ym f m . By the deﬁnition of bilinear forms, we have g(x, y) =

n m i=1 j=1

xi yj g(ei , f j ) = x Ay,

(8.9)

Similarity and Spectra

587

where A ∈ Rn×m is the matrix A = (g(ei , f j )). This matrix will be referred to as the matrix of the bilinear form. The bilinear form f is completely deﬁned by the matrix A as introduced above. Let V be a real vector space and let f : V × V −→ R be a bilinear form. If f (x, y) = f (y, x) for every x, y ∈ V , we say that f is a symmetric bilinear form. Theorem 8.54. A bilinear form f : Rn × Rn −→ R is symmetric if and only if its matrix is symmetric. Proof. Let A be the matrix of f . Since f is symmetric, we have x Ay = y Ax for every x, y ∈ Rn . For x = ei and y = ej , this implies aij = aji for every 1 i, j n. Thus, A = A. Conversely, if A = A, we have (x Ay) = (x A y) = y Ax, so (f (x, y)) = f (y, x). Since f (, y) ∈ R, the symmetry of f follows. Let f : V × V be a symmetric bilinear form. Deﬁne the function φ : V −→ R by φ(x) = f (x, x) for x ∈ V . We have φ(u + v) = f (u + v, u + v) = f (u, u) + 2f (u, v) + f (v, v), (8.10) φ(u − v) = f (u − v, u − v) = f (u, u) − 2f (u, v) + f (v, v), (8.11) because f is bilinear and symmetric. Therefore, we obtain the equality φ(u + v) + φ(u − v) = 2(φ(u) + φ(v)),

(8.12)

which is a generalization of the parallelogram equality from Theorem 6.32. Deﬁnition 8.19. A quadratic form deﬁned on a real vector space V is a function φ : V −→ R for which there exists a bilinear form f : V × V −→ R such that φ(x) = f (x, x) for x ∈ V . Equality (8.10) implies f (u, v) = 12 (φ(u + v) − φ(u) − φ(v)), which shows that the bilinear form f is completely determined by its corresponding quadratic form.

588

Linear Algebra Tools for Data Mining (Second Edition)

Unlike bilinear forms, a quadratic form may be deﬁned by several matrices. Theorem 8.55. Let φ be a quadratic form. We have φ(x) = x Ax = x Bx for every x ∈ Rn if and only if A + A = B + B . Proof. Suppose that φ(x) = x Ax = x Bx for every x ∈ Rn . If x = ei this amounts to aii = bii for 1 i n. If x = ei + ej , this yields aii + aij + aji + ajj = bii + bij + bji + bjj , which means that aij + aji = bij + bji for 1 i, j n. Thus, A + A = B + B . Conversely, suppose that A + A = B + B . Note that φ(x) = 1 2 (φ(x) + φ(x) ), because φ(x) ∈ R. Thus, 1 φ(x) = (x Ax + x Ax) 2

A+A x. =x 2 Thus, it is clear that either A or B deﬁnes the quadratic form.

Among the set of matrices that deﬁne a quadratic form, there is only one symmetric matrix. This is stated formally in the next theorem. Corollary 8.25. If φ : Rn −→ R is a quadratic form, then there exists a unique symmetric matrix C ∈ Rn×n such that φ(x) = x Cx for x ∈ Rn . Proof. Suppose that φ(x) = x Ax for x ∈ Rn . We saw that φ(x) = x 21 (A + A ) x and the matrix 12 (A + A ) is clearly symmetric. On the another hand, if A + A = B + B and B is symmetric, then B = 12 (A + A ).

Since a quadratic form φ can be uniquely written as φ(x) = x Cx for x ∈ Rn , properties of C can be transferred to φ. For example, if C is positive deﬁnite (semideﬁnite), then we say that φ is positive deﬁnite (semideﬁnite). Example 8.13. Let x ∈ Rn and let φ(x) = ni=1 (xi+1 − xi )2 . It is easy to verify that φ(x) = x Ax,

Similarity and Spectra

589

where A is the matrix ⎛ ⎞ 1 −1 0 0 · · · 0 ⎜0 −1 2 −1 · · · 0⎟ ⎜ ⎟ ⎜ ⎟ .. .. .. ⎟ ⎜ .. .. .. .. ⎜. . . . · · · . ⎟ . . A=⎜ ⎟ ⎜0 0 0 −0 · · · 2 −1 0 ⎟ ⎜ ⎟ ⎜ ⎟ ⎝0 0 0 −0 · · · −1 2 −1⎠ 0 0 0 −0 · · · 0 −1 1 and x ∈ Rn . The deﬁnition of φ shows that x = 0n implies φ(x) > 0. Thus, both A and φ are positive deﬁnite. Observe that the bilinear symmetric function f can be recovered from the associated quadratic form φ because f (u, v) =

1 (φ(u + v) − φ(u) − φ(v)), 2

for u, v ∈ V . Example 8.14. Let φ : R3 −→ R be the quadratic form deﬁned by φ(x) = 4x21 + 8x22 − x23 − 6x1 x2 + 4x1 x3 + x2 x3 , for

⎛ ⎞ x1 ⎜ ⎟ x = ⎝x2 ⎠ ∈ R3 . x3

The symmetric matrix C that deﬁnes φ is ⎛ ⎞ 4 −3 2 ⎜ ⎟ C = ⎝−3 8 0.5⎠. 2 0.5 −1 The oﬀ-diagonal terms of C are the halves of the coeﬃcients of the corresponding linear form coeﬃcients; for instance, since the coeﬃcient of the cross-product term x1 x2 is −6, we have c12 = c21 = −3.

590

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 8.56. A continuous function φ : V −→ R is a quadratic form if and only if it satisﬁes the Parallelogram Equality (8.12) for every u, v ∈ V . Proof. The necessity of the condition was shown already. Therefore, we need to show only that if a continuous φ satisﬁes Equality (8.12), then it is a quadratic form. Let φ be a function such that φ(u+v)+φ(u−v) = 2(φ(u)+φ(v)) for every u, v ∈ V . If u = v = 0, it follows that φ(0) = 0. Choosing u = 0 in Equality (8.12) implies φ(v) + φ(−v) = 2φ(v), so φ(−v) = φ(v) for v ∈ V , so φ is an even function. Deﬁne the symmetric function f : V × V −→ R by f (u, v) =

1 (φ(u + v) − φ(u) − φ(v)), 2

for u, v ∈ V . We need to show that f is bilinear. In view of the symmetry of f , it suﬃces to prove that f is linear in its ﬁrst argument. The deﬁnition of f allows us to write 2f (u1 + u2 , v) = φ(u1 + u2 + v) − φ(u1 + u2 ) − φ(v), 2f (u1 , v) = φ(u1 + v) − φ(u1 ) − φ(v), 2f (u2 , v) = φ(u2 + v) − φ(u2 ) − φ(v), which implies 2(f (u1 + u2 , v) − f (u1 , v) − f (u2 , v)) = (φ(u1 + u2 + v) + φ(v)) − (φ(u1 + v) + φ(u2 + v)) −(φ(u1 + u2 ) − φ(u1 ) − φ(u2 )).

(8.13)

Since φ satisﬁes the parallelogram equality, we have 1 φ(u1 + u2 + v) + φ(v) = (φ(u1 u2 + 2v) + φ(u1 + u2 )), (8.14) 2 1 φ(u1 + v) + φ(u2 + v) = (φ(u1 u2 + 2v) + φ(u1 − u2 )). (8.15) 2

Similarity and Spectra

591

The last two equalities imply φ(u1 + u2 + v) + φ(v) − φ(u1 + v) − φ(u2 + v) 1 = (φ(u1 + u2 ) − φ(u1 − u2 )) 2 = −φ(u1 ) − φ(u2 ) + φ(u1 + u2 ),

(8.16)

using again the parallelogram’s equality. The last equality combined with Equality (8.13) implies f (u1 + u2 , v) = f (u1 , v) + f (u2 , v).

(8.17)

We prove now that f (au, v) = af (u, v)

(8.18)

for a ∈ R and u, v ∈ V . Choosing u1 = u and u2 = −u, we have f (u, v) = −f (−u, v), so Equality (8.18) holds for a = −1. It is easy to verify by induction on k ∈ N that Equality (8.17) implies that f (ku, v) = kf (u, v) for every k ∈ N. Let k, l be two integers, l = 0. We have lf ( kl u, v) = f (ku, v) = kf (u, v), which implies f ( kl u, v) = kl f (u, v). Thus, f (ru, v) = rf (u, v) for any rational number r. The continuity of f allows us to conclude that f (au, v) = af (u, v) for any real number a. Thus, f is a bilinear form. Since φ(x) = f (x, x), it follows that φ is indeed a quadratic form.

Let φ : V −→ R be a quadratic form and let v 1 , . . . , v n be a basis n of V, where V is an n-dimensional real linear space. Since x = i=1 xi v i , the quadratic form can be written as

φ(x) = x Cx =

n n

xi xj v i Cv j ,

i=1 j=1

where C is the matrix of φ. If i = j implies v i Cv j = 0, then the basis v 1 , . . . , v n is referred to as a canonical basis of φ. Relative to this basis we can write n ci x2i , φ(x) = i=1

where ci =

v i Cvi .

592

Linear Algebra Tools for Data Mining (Second Edition)

Deﬁnition 8.20. A quadric in Rn is a set of the form Q = {x ∈ Rn | x Ax + 2b x + c = 0}, where A ∈ Rn×n is a symmetric matrix, b ∈ Rn , and c ∈ R. We will refer both to a quadric Q and to the equation that describes it as a quadric. The extended matrix of the quadric Q is the symmetric matrix Q ∈ R(n+1)×(n+1) given by ⎞ ⎛ a11 · · · a1n b1 ⎜ .

.. ⎟ ⎟ ⎜ .. · · · ... A b . ⎟= . Q=⎜ ⎟ ⎜ b c ⎝an1 · · · ann bn ⎠ b1 · · · bn c Consider the quadric x Ax + 2b x + c = 0,

(8.19)

where A ∈ Rn×n is a symmetric matrix, b ∈ Rn , and c ∈ R. By the Spectral Theorem for Hermitian Matrices (applied, in this case, to real, symmetric matrices) there exists an orthogonal matrix U such that A = U DU and D is a diagonal matrix having the eigenvalues of A as its diagonal elements. Consider an isometry deﬁned by x = U y + r. Equation (8.19) can be written as (y U + r )A(U y + r) + 2b (U y + r) + c = 0. An equivalent form of this equality is y U AU y + r AU y + y U Ar + r Ar + 2b U y + 2b r + c = y U AU y + 2(r A + b )U y + r Ar + 2b r + c = 0. We used here the fact that both r AU y and y U Ar are scalars and, therefore, they coincide with their transposed form. Deﬁnition 8.21. Let x Ax + 2b x + c = 0 be a quadric, x = U y + r ˜ + be an isometry (where U is an orthogonal matrix), and let y Ay ˜ 2b y + c˜ = 0 be the transformed equation under the isometry. A function Φ : Rn×n × Rn × R −→ R is an isometric invariant if for any quadric we have ˜ c˜). ˜ b, Φ(A, b, c) = Φ(A,

Similarity and Spectra

593

Observe that A˜ = U AU, ˜ = (b + Ar)U, b c˜ = r Ar + 2b r + c. Since A and A˜ are similar matrices, they have the same characteristic polynomial, which implies that any function of the form Φ(A, b, c) = ai , where ai is the i th coeﬃcient of the characteristic polynomial of A, is an isometric invariant. Theorem 8.57. Let

Q=

A b b c

be the extended matrix of the quadric x Ax + 2b x + c = 0. Then det(Q) is an isometric invariant. Proof.

We have to show the equality

A˜ ˜b A b = det det(Q) = det ˜ c˜ b c b

(b + Ar)U U AU . = det U (b + r A) r Ar + 2b r + c

Taking into account the result shown in Supplement 33 of Chapter 5, it follows that

(b + Ar)U U AU det U (b + r A) r Ar + 2b r + c

AU (b + Ar)U = det (b + r A) r Ar + 2b r + c

A (b + Ar) , = det (b + r A) r Ar + 2b r + c

594

Linear Algebra Tools for Data Mining (Second Edition)

because U is an orthogonal matrix. Further elementary transformation yields the following equalities:

A b + Ar A b A (b + Ar) = det = det , det b b r+c b c (b + r A) r Ar + 2b r + c ˜ = det(Q). which show that det(Q)

If the equation r A + b = 0 has a solution in r, we refer to r as a center of the quadric. A quadric may have a unique center, a set of centers, or no center, depending on the number of solutions of the equation r A + b = 0 , which is equivalent to Ar + b = 0, since A is a symmetric matrix. If A is an invertible matrix, then Q has a unique center r0 = −A−1 b. Choosing r = r 0 , the quadric is deﬁned by the equation y Dy − r 0 Ar 0 + c = 0,

(8.20)

because r 0 Ar 0 + 2b r 0 = −r0 Ar 0 . Note that if a center r 0 exists, then y ∈ Q if and only if −y ∈ Q. Equivalently, r 0 + U x ∈ Q if and only if r 0 − U x ∈ Q and the middle of the line segment determined by r 0 + U x and r 0 − U x is r 0 . Thus, r 0 is a center of symmetry for the quadric. The set of orthonormal columns of the matrix U for which A = U DU (or D = U AU , where D is a diagonal matrix) are called the principal axes of the quadric Q. Quadrics are classiﬁed based on the non-nullity and the sign of the eigenvalues of the symmetric matrix A. Thus, we can use for the classiﬁcation either the diagonal matrix D whose entries are the eigenvalues of A or, equivalently, the inertia of the matrix A. Let x Ax + 2b x + c = 0 be a quadric having a unique center. Since A is a nonsingular matrix, there are no null eigenvalues of A, so I(A) = (n+ (A), n− (A), 0). There are n + 1 possible classes since we can have 0 n+ n and n+ + n− = n. When n+ = n, the quadric is an n-dimensional ellipsoid. The equality y Dy − r 0 Ar 0 + c = 0

Similarity and Spectra

595

can be written as n i=1

λi yi2 − r 0 Ar 0 + c = 0.

If r 0 Ar 0 + c > 0, the last equality can be written as n y2 i

i=1

a2i

= 1,

for 1 i n and the quadric is a real ellipsoid where a2i = r Ar−c λi having a1 , . . . , an as its semiaxes. At the other extreme, suppose that n− = n, which means that spec(A) consists of negative numbers. If k = r Ar − c < 0, then Q = ∅. If k > 0, by multiplying by −1 we revert back to the previous case. Example 8.15. In this example, we consider the case of quadrics x Ax + 2b x + c = 0 in R3 that have a unique center, that is, for which det(A) = 0. If n+ = 3, we have an ellipsoid, as we saw before. For n+ < 3, the classiﬁcation of the quadric depends on det(Q). If det(Q) = 0, n+ = 2, and n− = 1, we have a hyperboloid of one sheet, while in the case when det(Q) = 0, n+ = 1, and n− = 2, we have a hyperboloid of two sheets. If det(Q) = 0, we have a cone. After transforming the equation of the quadric x Ax+2b x+c = 0 to the canonical form given by Equality (8.20): y Dy − k = 0, where k = r 0 Ar 0 − c. Note that ⎛ λ1 0 0 ⎜0 λ 0 2 ⎜ det(Q) = det ⎜ ⎝ 0 0 λ3 0 0 0

⎞ 0 0⎟ ⎟ ⎟ = λ1 λ2 λ3 k. 0⎠ k

596

Linear Algebra Tools for Data Mining (Second Edition)

Now, the class of the quadric can be easily established, based on the following table: n+ 3 3 3 2 1 2 1

n− 0 0 0 1 2 1 2

det(Q) = 0 =0 O. Observe that its spectral radius ρ(A) is positive because ρ(A) = 0 implies that spec(A) = {0}, which implies that A is nilpotent by Corollary 8.6. Since A > O, this is impossible, so ρ(A) > 0. Theorem 8.58 (Perron–Frobenius theorem). Let A ∈ Rn×n be a symmetric matrix with positive elements and let λ be its largest eigenvalue. The following statements hold: (i) λ is a positive number; (ii) there exists an eigenvector x that corresponds to λ such that x > 0n ; (iii) geomm(A, λ) = 1; (iv) if θ is any other eigenvalue of A, then |θ| < λ. Proof. Since the eigenvalues of A are real and their sum equals trace(A) > 0, it follows that its largest value λ is positive, which proves Part (i). Let u be a real unit eigenvector that belongs to λ, so nj=1 aij uj = λui . We have Au = λu, so u Au = λu u = λu22 = λ. This allows us to write λ=

n n

aij ui uj ,

i=1 i=1

so λ=

n n i=1 i=1

n n aij ui uj = aij ui uj , i=1 i=1

because λ > 0. We claim that the vector x = abs(u) is an eigenvector that corresponds to λ and has positive components. Note that n n n n aij ui uj aij xi xj . λ= i=1 i=1

i=1 i=1

By Corollary 8.14, since λ is the largest eigenvalue of A, we have n n a xi xj λ; the equality takes place only if x is an eigenij i=1 i=1 vector of λ.

598

Linear Algebra Tools for Data Mining (Second Edition)

nSince x 0, if xi = 0 for some i, then the equality λxi = j=1 aij xj implies that xj = 0 for 1 j n, because all numbers aij are positive. This means that x = 0, which is impossible because x is an eigenvector. Thus, we conclude that all components of x are positive. This completes the proof of Part (ii). For Part (iii), suppose that the geometric multiplicity of λ is greater than 1 and let u and v be two real unit vectors of the invariant subspace SA,λ such that u ⊥ v. Note that the vectors abs(u) and abs(v) are also eigenvectors corresponding to λ. n a u and Suppose that ui < 0 for some i. Since λui = n n j=1 ij j λ|ui | = j=1 aij |uj |, we have λ(ui + |ui |) = 0 = j=1 aij (uj + |uj |), and this implies uj + |uj | = 0 for 1 j n. In other words, we have j, or uj = −|uj | < 0 for every j. The either uj = |uj | > 0 for every same applies to v, so v u = ni=1 vi ui = 0, which contradicts the orthogonality of u and v. Thus, geomm(A, λ) = 1. For Part (iv), let w be a unit eigenvector that corresponds to n the eigenvalue θ, so i=1 aij wj = θwi for 1 i n. Again, by Corollary 8.14, n n n n aij |wi ||wj | aij wi wj = |θ|. λ i=1 j=1

i=1 j=1

If θ = −λ, these inequalities show that |wj | = xj for all j and, thereλxi = fore, there exists i such that n wi = xi . Adding the equalities n n a x and −λw = a w yields 0 = a (x w i j=0 ij j j=0 ij j j=1 ij + j ) aii (xi + wi ), which contradicts the fact that aii > 0 and wi = xi > 0. Thus, θ = −λ. Deﬁnition 8.22. Let A ∈ Rn×n be a symmetric matrix with positive elements. The number ρ(A) is the Perron number of A; the positive vector x with x2 = 1 that corresponds to the eigenvalue ρ(A) is the Perron vector of A. The Perron–Frobenius Theorem states that the spectral radius ρ(A) of a positive matrix A is always an eigenvalue of A. This property holds also for non-negative matrices. Lemma 8.7. Let A ∈ Rn×n be a non-negative matrix that is irreducible, where n > 1. Its spectral radius ρ(A) is an eigenvalue and there exists a positive eigenvector that corresponds to

Similarity and Spectra

599

this eigenvalue. No non-negative eigenvector corresponds to any other eigenvalue of A. Proof. Since A is an irreducible matrix, the matrix (In + A)n−1 is positive by Supplement 131 of Chapter 3. This implies ((In + A)n−1 ) = (In + A )n−1 > On,n , so by the Perron–Frobenius theorem, there exists a positive vector y such that (I + A )n−1 y = ρ((I + A )n−1 )y. Equivalently, we have y (I + A)n−1 = ρ((I + A)n−1 )y .

(8.21)

Let λ ∈ spec(A) be an eigenvalue of A such that |λ| = ρ(A) and let x be an eigenvector associated to λ. Since λx = Ax, it follows that |λ|abs(x) = abs(Ax) abs(A)abs(x), so ρ(A)abs(x) Aabs(x) because A is a non-negative matrix. It is immediate that ρ(A)p abs(x) Ap abs(x) for p ∈ N. In turn, this implies (1 + ρ(A))n−1 abs(x) (In + A)n−1 abs(x), so, by Equality (8.21), we have (1+ρ(A))n−1 (y abs(x)) y (In +A)n−1 abs(x) = ρ((I+A)n−1 )y abs(x). Since y > 0, we have y abs(x) > 0 and, therefore, (1 + ρ(A))n−1 ρ((I + A)n−1 ).

(8.22)

By Corollary 8.7, the eigenvalues of (I + A)n−1 have the form (1 + λ)n−1 , where λ ∈ spec(A). Therefore, there exists an eigenvalue μ of A such that |(1 + μ)n−1 | = ρ((I + A)n−1 ). Since |μ| ρ(A), Equality (8.22) implies (1 + |μ|)n−1 (1 + ρ(A))n−1 ρ((I + A)n−1 ) = |(1 + μ)n−1 |, so 1 + |μ| |1 + μ|. This implies that μ is a positive real number and, therefore, μ = ρ(A). Consequently, (ρ(A))k abs(x) Ak abs(x) for k 1. In particular, for k = 1, we have ρ(A)abs(x) Aabs(x) = μabs(x).

Linear Algebra Tools for Data Mining (Second Edition)

600

Since (I + A)n−1 abs(x) = (1 + μ)n−1 abs(x) = ρ((I + A)n−1 )abs(x), and abs(x) > 0 (by Theorem 8.58), it follows that there is only one linearly independent vector associated with μ. Indeed, suppose that we would have two linearly independent vectors, u and v, associated with the eigenvalue μ. Since v = 0, there exists vi = 0, so the vector w = u − wvii v is also an eigenvector of the eigenvalue μ because w = 0. But wi = 0, which contradicts the fact that abs(x) > 0 for any eigenvector of μ. Since A = On,n , we have ρ(A) > 0. Moreover, ρ(A) is a simple eigenvalue of A. Indeed, there is only one linearly independent eigenvector u of A associated with ρ(A) such that u > 0. Since A is also irreducible (by Exercise 128 of Chapter 3), there exists only one linearly independent eigenvector v of A associated to ρ(A ) and v > 0. Since v u > 0, it follows that ρ(A) is a simple eigenvalue of A by Theorem 8.44. Suppose that z is an eigenvector of an eigenvalue ζ of A and z > 0, where ζ = ρ(A). We have shown that A has an eigenvector w > 0 such that A w = ρ(A)w. This implies w Az = ζw z. Since w Az = (A w) z = ρ(A)w z, we have ζw z = ρ(A)w z. Since w z > 0, it follows that ζ = ρ(A), which is a contradiction. Thus, no other eigenvalue except ρ(A) has a positive eigenvector. Now we prove the existence of a non-negative eigenvector for ρ(A) without the assumption of irreducibility of A made in Lemma 8.7. Theorem 8.59. Let A ∈ Rn×n be a non-negative matrix such that n > 1. Its spectral radius ρ(A) is an eigenvalue of A and there exists a non-negative eigenvector that corresponds to this eigenvalue. Proof. Suppose that red(A) = k. As we saw in the proof of Supplement 127 of Chapter 3, there exists a permutation matrix P such that ⎛ ⎞ B11 B12 · · · B1k ⎜O B ⎟ 22 · · · B2k ⎟ ⎜ ⎟ P AP = ⎜ .. . ⎟, ⎜ .. . · · · .. ⎠ ⎝ . O O · · · Bkk

Similarity and Spectra

601

where the matrices B11 , . . . , Bkk are irreducible. By Theorem 7.15, spec(A) = ki=1 spec(Bii ). We have ρ(A) = ρ(P AP ) = max{ρ(Bii | 1 k}. Let j be the smallest number such that ρ(A) = ρ(Bjj ). The matrix P AP can now be written as ⎞ ⎛ C E F ⎟ ⎜ A = ⎝O Bjj G⎠, O O D where C ∈ Cp×p and D ∈ Cq×q are upper triangular matrices or are missing, when Bjj is the ﬁrst diagonal block or the last diagonal block. Since Bjj is irreducible, Lemma 8.7 implies that there exists y j such that Bjj y j = ρ(A)y j If j = 1 and Bjj is the ﬁrst diagonal block, then

y1 0 is a non-negative eigenvector of P AP . If j > 1, there exists z > 0 such that ⎛ ⎞ z ⎝y ⎠ 1 0 is an eigenvector of P AP . Since ⎞ ⎛ ⎞ ⎛ Cz + Ey 1 z P AP ⎝y j ⎠ = ⎝ Bjj y j ⎠, 0 0 it suﬃces to ﬁnd z 0 such that Cz + Ey j = ρ(A)z. We have 1 C we have ρ(C) < ρ(A), which means that for the matrix Z = ρ(A) 2 −1 ρ(Z) < 1. Thus, I +Z +Z +· · · converges to (I −Z) . Since Z O, the terms of the series I + Z + Z 2 + · · · are nonnegative matrices, so (I − Z)−1 is a non-negative matrix. The desired vector z is z= and z 0.

1 (1 − Z)−1 Ey j ρ(A)

602

8.12

Linear Algebra Tools for Data Mining (Second Edition)

Spectra of Positive Semideﬁnite Matrices

In Theorem 7.12, we saw that eigenvalues of Hermitian matrices are real numbers. The next theorem links these eigenvalues to positive deﬁniteness. Theorem 8.60. Let A ∈ Cn×n be a Hermitian matrix. If A is positive semideﬁnite, then all its eigenvalues are non-negative; if A is positive deﬁnite, then its eigenvalues are positive. Proof. Since A is Hermitian, all its eigenvalues are real numbers. Suppose that A is positive semideﬁnite, that is, xH Ax 0 for x ∈ Cn . If λ ∈ spec(A), then Av = λv for some eigenvector v = 0. The positive semideﬁniteness of A implies v H Av = λv H v = λv22 0, which implies λ 0. It is easy to see that if A is positive deﬁnite, then λ > 0. Theorem 8.61. Let A ∈ Cn×n be a Hermitian matrix. If A is positive semideﬁnite, then all its principal minors are non-negative real numbers. If A is positive deﬁnite, then all its principal minors are positive real numbers. Proof. Since A is positive semideﬁnite, every sub-matrix i1 · · · ik is a Hermitian positive semideﬁnite matrix by TheoA i1 · · · ik rem 6.51, so every principal minor is a non-negative real number. The second part of the theorem is proven similarly. Corollary 8.26. Let A ∈ Cn×n be a Hermitian matrix. The following statements are equivalent. (i) A is positive semideﬁnite; (ii) all eigenvalues of A are non-negative numbers; (iii) there exists a Hermitian matrix C ∈ Cn×n such that C 2 = A; (iv) A is the Gram matrix of a sequence of vectors, that is, A = B H B for some B ∈ Cn×n . Proof. (i) implies (ii): This was shown in Theorem 8.60. (ii) implies (iii): Suppose that A is a matrix such that all its eigenvalues are the non-negative numbers λ1 , . . . , λn . By Theorem 8.14,

Similarity and Spectra

603

A can be written as A = U H DU , where U is a unitary matrix and ⎞ ⎛ λ1 0 · · · 0 ⎜ 0 λ ··· 0 ⎟ 2 ⎟ ⎜ ⎟. D=⎜ . . . ⎟ ⎜. . ⎝ . . · · · .. ⎠ 0 0 · · · λn √ Deﬁne the matrix D as ⎛√ ⎞ λ1 0 · · · 0 ⎜ 0 √λ · · · 0 ⎟ √ 2 ⎜ ⎟ ⎟ D=⎜ .. .. ⎟. ⎜ .. ⎝ . . ··· . ⎠ √ λn 0 0 ··· √ 2 = D. Now we can write A = Clearly, we ( D) √ gave √ H H U DU U√ DU , which allows us to deﬁne the desired matrix C as C = U DU H . (iii) implies (iv): Since C is itself a Hermitian matrix, this implication is obvious. (iv) implies (i): Suppose that A = B H B for some matrix B ∈ n×k C . Then, for x ∈ Cn , we have xH Ax = xH B H Bx = (Bx)H (Bx) = Bx22 0, so A is positive semideﬁnite. 8.13

MATLAB

Computations

The function eigalgmult computes the eigenvalues of a matrix and their algebraic multiplicities: function [eigvalues, repeats] = eigalgmult(A) tol = sqrt(eps); v = sort(eig(A)); v = round(v/tol) * tol; eigvalues = flipud(unique(v)); n = length(v); d = length(eigvalues); B = v * ones(1, d); C = ones(n,1) * eigvalues’; m = abs(C-B) > A=[3 1 1;1 3 1; 1 1 3] A = 3 1 1 1 3 1 1 1 3 >> [e,mult]=eigalgmult(A) e = 5 2 mult = 1 2

The function eiggeommult(A) computes eigenvectors of A and their geometric multiplicity. Its outputs are two matrices V and G. The diagonal matrix G has on its diagonal the eigenvalues of A. Each value is repeated a number of times equal to the dimensionality of its invariant space, which equals its geometric multiplicity. The matrix V contains eigenvectors corresponding to these eigenvalues. function [V, G] = eiggeommult(A) [m,n] = size(A); [eigvalues, r] = eigalgmult(A); V = []; d = []; for k = 1 : length(eigvalues); v = nullspbasis(A - eigvalues(k)*eye(n)); [ms, ns] = size(v); V = [V v]; temp = ones(ns, 1) * eigvalues(k); d = [d; temp]; end G = diag(d);

Similarity and Spectra

605

Example 8.18. The matrix ⎛ ⎞ 2 1 0 ⎜ ⎟ A = ⎝0 2 1⎠ 0 0 2 has one eigenvalue 2 with algebraic multiplicity 3 and geometric multiplicity 1 as follows from the following MATLAB computation: >> A=[2 1 0;0 2 1; 0 0 2] A = 2 1 0 0 2 1 0 0 2 >> [V,G]=eiggeommult(A) V = 1 0 0 G = 2

On the other hand, the next matrix has two non-defective eigenvalues, 5 and 2 as follows from the next computation: >> A=[3 1 1;1 3 1; 1 1 3] A = 3 1 1 1 3 1 1 1 3 >> [V,G]=eiggeommult(A) V = 1 -1 -1 1 1 0 1 0 1 G = 5 0 0 0 2 0 0 0 2 >>

Linear Algebra Tools for Data Mining (Second Edition)

606

Let A and B be two square matrices, A, B ∈ Cn×n . To compute the generalized eigenvalues of A and B, we can use eig(A,B), which returns a vector containing these values. Also, [V,D] = eig(A,B) yields a diagonal matrix D of generalized eigenvalues and a matrix V ∈ Cn×n whose columns are the corresponding eigenvectors so that AV = BV D. The exponential of a matrix can be computed using the function expm. Example 8.19. Let A and B be the matrices considered in Example 8.12,

A=

1 1 0 1 1 2 ,B = , and C = A + B = . 1 0 1 1 0 1

The exponentials of these matrices obtained with the function expm are >> expm(B) ans = 2.7183 0

2.7183 2.7183

>> expm(B) ans = 1.5431 1.1752

1.1752 1.5431

>> expm(C) ans = 5.9209 3.7194

7.4388 5.9209

Furthermore, the product eA ∗ eB computed as >> expm(A)*expm(B) ans = 7.3891 7.3891 3.1945 4.1945

is clearly distinct from eC .

Similarity and Spectra

607

Similar functions in MATLAB exist for computing the principal logarithm log A of a matrix A, namely logm, and for computing the principal square root of a matrix, sqrtm(A), namely, the unique square root for which every eigenvalue has a non-negative real part.

Exercises and Supplements (1) Let A, B ∈ Cn×n be two matrices. Prove that if B is nonsingular, then AB −1 ∼ B −1 A. (2) Prove that if A ∈ Cm×n and B ∈ Cn×m , then the matrices

C=

AB Om,n B On,n

and D =

0m,m 0m,n B BA

are similar. (3) Let A ∈ Cm×m and B ∈ Cn×n be two matrices. Prove that S A,B (X) = (S −B H ,−AH (X H ))H for X ∈ Cm×n . (4) The Lie bracket of Cn×n is the mapping [·, ·] : Cn×n × Cn×n −→ Cn×n given by [X, Y ] = S X,X (Y ). Prove that [X, Y ] = −[Y, X], [X, X] = On,n and [X, [Y, Z]] + [Y, [Z, X]] + [Z, [X, Y ]] = On,n for any matrices X, Y, Z ∈ Cn,n . (5) Let A ∈ Rn×n be a symmetric matrix having the eigenvalues λ1 · · · λn . Prove that |||A −

λ1 − λn λ1 + λn In |||2 = . 2 2

(6) Let A ∈ Rn×n be a matrix and let c ∈ R. Prove that for x, y ∈ Rn − {0n }, we have ralA (x) − ralA (y) = ralB (x) − ralB (y), where B = A + cIn . (7) Let A ∈ Rn×n be a symmetric matrix having the eigenvalues λ1 · · · λn and let x and y be two vectors in Rn − {0n }.

608

Linear Algebra Tools for Data Mining (Second Edition)

Prove that |ralA (x) − ralA (y)| (λ1 − λn ) sin ∠(x, y). Solution: Assume that x = y = 1 and let B = A − λ1 +λn In . By Exercise 8.13, we have 2 |ralA (x) − ralA (y)| = |ralB (x) − ralB (y)| = |x Bx − y By| = |B(x − y) (x + y)|. By the Cauchy–Schwarz Inequality, we have x − y x + y 2 = (λ1 − λn ) sin ∠(x, y).

|B(x − y) (x + y)| 2B

(8) Prove that if A is a unitary matrix and 1 ∈ spec(A), then there exists a skew-Hermitian S such that A = (In − S)(In + S)−1 . (9) Let f : Rn×n −→ R be a function such that f (AB) = f (BA) for A, B ∈ Rn×n . Prove that if A ∼ B, then f (A) = f (B). (10) Let Br (λ, a) ∈ Cr×r be the matrix deﬁned by ⎛ ⎞ λ a 0 ··· 0 ⎜0 λ a · · · 0⎟ ⎜ ⎟ ⎜. . . ⎟ . r×r . ⎟ Br (λ, a) = ⎜ ⎜ .. .. . . . . .. ⎟ ∈ C . ⎜ ⎟ ⎝ 0 0 0 · · · a⎠ 0 0

··· λ

0

Note that Br (λ, 1) is a Jordan block. Let a, b ∈ C − {0} be two nonzero numbers. Prove that Bn (λ, a) ∼ Bn (λ, b). (11) Let λ be a complex number. Prove that the k th power of Br (λ) is given by k k−r+1 ⎞ ⎛ k k k−1 k k−2 λ · · · r−1 λ 1 λ 2 λ ⎜ k k k−1 · · · k−r+2 ⎟ λk ⎜0 ⎟ 1 λ r−2 λ k ⎜ ⎟. (Br (λ) = ⎜ . ⎟ . . . . . . . ⎝. ⎠ . . ··· . 0

0

0

···

λk

Similarity and Spectra

609

(12) Prove that if |λ| < 1, then limk→∞ Br (λ)k = O. (13) Let A ∈ Rn×n be a symmetric real matrix. Prove that ∇ralA (x) =

2 (Ax − ralA (x)x). x x

Also, show that the eigenvectors of A are the stationary points of the function ralA (x). (14) Let A, B ∈ Cn×n be two Hermitian matrices. Prove that AB is a Hermitian matrix if and only if AB = BA. (15) Let A ∈ R3×3 be a symmetric matrix. Prove that if trace(A) = 0, the sum of principal minors of order 2 equals 0, and det(A) = 0, then rank(A) = 1. Solution: The characteristic polynomial of A is pA (λ) = λ3 − trace(A)λ2 = 0. Thus, spec(A) = {trace(A), 0}, where algm(A, 0) = 2, so rank(A) = 1. (16) Let A ∈ R3×3 be a symmetric matrix. Prove that if the sum of principal minors of order 2 does not equal 0 but det(A) = 0, then rank(A) = 2. (17) Let A ∈ Cn×n be a Hermitian matrix, u ∈ Cn be a vector, and let a be a complex number. Deﬁne the Hermitian matrix B as

A u . B= uH a Let α1 · · · αn be the eigenvalues of A and let β1 · · · βn βn+1 be the eigenvalues of B. Prove that β1 α1 β2 · · · βn αn βn+1 . Solution: Since B ∈ C(n+1)×(n+1) , by the Courant–Fisher theorem we have βk+1 = min max{xH Bx | x2 = 1 and x ∈ W ⊥ } W

x

= max min{xH Bx | x2 = 1 and x ∈ Z⊥ }, Z

x

where W denotes ranges of sets of k non-zero arbitrary vectors, and let Z be a subset of Cn that consists of n − k non-zero arbitrary vectors in Cn+1 .

Linear Algebra Tools for Data Mining (Second Edition)

610

Let U be a set of k non-zero vectors in Cn and let Y be a set of n − k − 1 vectors in Cn . Deﬁne the subsets WU and ZY of Cn+1 as u WU = u ∈ U 0 and

y ZY = y ∈ Y ∪ {en+1 }. 0

By restricting the sets W and Z to sets of the form WU and ZY , we obtain the double inequality max min{xH Bx | x2 = 1 and x ∈ ZY ⊥ } x

ZY

βk+1 min max{xH Bx | x2 = 1 and x ∈ WU ⊥ }. WU

x

Note that, if x ∈ ZY ⊥ , then we have x ⊥ en+1 , so xn+1 = 0. Therefore,

y A u H H = y H Ay. x Bx = (y 0) H 0 u a Consequently, max min{xH Bx | x2 = 1, x ∈ ZY ⊥ } x

ZY

= max min{y H Ay | y2 = 1, y ∈ Y ⊥ } = αk . Y

y

This allows us to conclude that αk βk+1 for 1 k n. On the another hand, if x ∈ WU ⊥ and

u x= , 0 then xH Bx = uH Au and x2 = u2 . Now we can write min max{xH Bx | x2 = 1, x ∈ WU ⊥ } x

WU

= min max{uH Au | u2 = 1, and u ∈ U ⊥ } = αk+1 , U

u

so βk+1 αk+1 for 1 k n − 1.

Similarity and Spectra

611

(18) Let A, B ∈ Cn×n be two matrices such that AB = BA. Prove that A and B have a common eigenvector. Solution: Let λ ∈ spec(A) and let {x1 , . . . , xk } be a basis for null(A−λIn ). Observe that the matrices A−λIn and B commute because (A − λIn )B = AB − λB and B(A − λIn ) = BA − B. Therefore, we have (A − λIn )Bxi = B(A − λIn )xi = 0, so (A − λIn )BX = On,n , where X = (x1 , . . . , xk ). Consequently, ABX = λBX. Let y 1 , . . . , y m be the columns of the matrix BX. The last equality implies that Ay i = λy i , so y i ∈ null(A − λIn ). Since X is a basis of null(A − λIn ), it follows that each y i is a linear combination of the columns of X so there exists a matrix P such that (y 1 · · · y m ) = (x1 · · · xk )P , which is equivalent to BX = XP . Let w be an eigenvector of P . We have P w = μw. Consequently, BXw = XP w = μXw, which proves that Xw is an eigenvector of B. Also, A(Xw) = A(BXw) = (λBX)w = λμXw, so Xw is also an eigenvector of A. (19) Let A ∈ Cn×n be a matrix and let spec(A) = {λ1 , . . . , λn }. Prove that n 2 A2F ; (a) p=1 |λp | (b) the equality np=1 |λp |2 = A2F holds if and only if A is normal. Solution: By Schur’s Triangularization Theorem (Theorem 8.8), there exists a unitary matrix U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A. Thus, A2F = T 2F =

n p=1

|λp |2 +

|tij |2 ,

i k, we have W = λ1 u1 u1 + · ·· + λr ur ur . Starting from the r matrices ui ui , we can form kr matrices of rank k of the form i∈I ui ui by considering all subsets I of {1, . . . , r} that contain k elements. We have W =

r

λj uj uj

j=1

=

I,|I|=k

αI

ui ui .

i∈I

If we match the coeﬃcients of ui ui , we have λi = If we add these equalities, we obtain k=

r

I,i∈I,|I|=k αI .

αI .

i=1 I,i∈I,|I|=k

We choose αI to depend on the cardinality of I and take into account that each αI occurs k times in the previous sum. This implies I,i∈I,|I|=k αI = 1, so each W is a convex combination of matrices of rank k, so K conv (M1 ) = M2 . No matrix of rank greater than k can be an extreme point. Since every convex and compact set has extreme elements, only matrices of rank k can play this role. Since the deﬁnition of M2 makes no distinction between the k-rank matrices, it follows that the set of extreme elements coincides with M1 .

622

Linear Algebra Tools for Data Mining (Second Edition)

(38) Prove that Ky Fan’s Theorem can be derived from Supplement 8.13. (39) Prove that the Jordan block Br (a) can be written as ⎛ ⎞ ⎛ ⎞ e1 e2 ⎜ e ⎟ ⎜e ⎟ ⎜ 2 ⎟ ⎜ 3⎟ ⎜ . ⎟ ⎜.⎟ ⎟ ⎜ ⎟ Br (a) = a ⎜ ⎜ .. ⎟. + ⎜ .. ⎟. ⎜ ⎟ ⎜ ⎟ ⎝er−1 ⎠ ⎝er ⎠ er

0

(40) Let A ∈ Rm×n be a matrix such that rank(A) = r. Prove that A can be factored as A = U DV , where U ∈ Rm×r and V ∈ Rn×r are matrices with orthonormal columns and D ∈ Rr×r is a diagonal matrix. This factorization is known as the Lanczos decomposition of A. Solution: By Theorem 3.33, the symmetric and square matrix A A ∈ Rn×n has the same rank r as the matrix A. Therefore, by the Spectral Theorem for Hermitian Matrices (Theorem 8.14), there exists an orthogonal matrix V ∈ Rn×n such that A A = V SV , where S is a diagonal matrix D = (σ1 , . . . , σr , 0, . . . , 0) and σ1 σ2 · · · σr > 0. This is equivalent to V A AV = (AV ) AV = S. The matrix V can be written as V = (V1 V0 ), where V1 = (v 1 · · · v r ) ∈ Rn×r consists of the ﬁrst r columns of the matrix V, that is, from eigenvectors of A A that correspond to σ1 , . . . , σr . The last n − r columns of V (the columns of V0 ) are eigenvectors that correspond to the eigenvalue 0. Since (AV ) (AV ) = S, it follows that (Av 1 · · · Av r AV0 ) (Av 1 · · · Av r AV0 ) = S, which means that for the vectors z i = Av i (1 i r), we have σi if i = j, zizj = 0 otherwise. √ Let λi = σi and let ui = λ1i z i for 1 i r. The set of vectors {u1 , . . . , ur } is orthonormal and AV = (λ1 u1 , . . . , λr ur ) = U D, where U = (u1 · · · ur ) and D = diag(λ1 , . . . , λr ).

Similarity and Spectra

623

Note that V ∈ Rn×r and that A Av = 0 for any column v of V 0 . This implies AV0 = O, so AV = A(V1 V0 ) = (AV1 O) = (U D O) and

V A = (U A O)V = (U A O) 0 = U DV . V1 The ﬁeld of values of a matrix A ∈ Cn×n is the set of numbers F (A) = {xAxH | x ∈ Cn and x2 = 1}. (41) Prove that F (A) is a convex set. (42) Prove that spec(A) ⊆ F (A) for any A ∈ Cn×n . (43) If U ∈ Cn×n is a unitary matrix and A ∈ Cn×n , prove that F (U AU H ) = F (A). (44) Prove that for a normal matrix A ∈ Cn×n , F (A) equals the convex hull of spec(A). Infer that A is Hermitian if and only if F (A) is an interval of R. Solution: Since A is normal, by the Spectral Theorem for Normal Matrices (Theorem 8.13), there exists a unitary matrix U and a diagonal matrix D such that A = U H DU and the diagonal elements of D are the eigenvalues of A. Then, by Exercise 8.13, F (A) = F (D). Therefore, z ∈ F (A) if z = xDxH for some x ∈ Cn such that x2 = 1, so z = nk=1 |xk |2 λk , where spec(A) = spec(D) = {λ1 , . . . , λn }, which proves that F (A) is included in the convex closure of spec(A). The reverse inclusion is immediate. (45) Let

A B M= O C be a block matrix, where A and C are square matrices. Prove that if spec(A) ∩ spec(C) = ∅, then there exists a matrix X such that

−1

I X I X M O I O I is a block diagonal matrix.

624

Linear Algebra Tools for Data Mining (Second Edition)

Solution: The matrix

I X O I

is invertible and we have

−1

I −X I X = . O I O I Thus, we need to ﬁnd a matrix X such that

I X A B I −X A −AX + B + XC = . O I O C O I O C By Theorem 8.25, if spec(A) ∩ spec(C) = ∅, there exists X such that AX − XC = B. (46) Prove that A ∼ B implies f (A) ∼ f (B) for every A, B ∈ Cn×n and every polynomial f . (47) Let A ∈ Cn×n be a matrix such that spec(A) = {λ1 , . . . , λn }. Prove that (a) n

2

|λi |

i=1

n n

|aij |2 ;

i=1 j=1

(b) the matrix A is normal if and only if n i=1

|λi |2 =

n n

|aij |2 .

i=1 j=1

(48) Let A ∈ Cn×n such that A On,n . Prove that if 1n is an eigenvector of A, then ρ(A) = |||A|||∞ and if 1n is an eigenvector of A , then ρ(A) = |||A|||1 . (49) Let A be a non-negative matrix in Cn×n and let u = A1n and v = A 1n . Prove that max{min ui , min vj } ρ(A) min{max ui , max vj }. (50) Let A ∈ Cn×n be a matrix such that A On,n . Prove that if there exists k ∈ N such that Ak > On,n , then ρ(A) > 0.

Similarity and Spectra

625

(51) Let A ∈ Cn×n be a matrix such that A On,n . If A = On,n and there exists an eigenvector x of A such that x > 0n , prove that ρ(A) > 0. (52) Prove that A ∈ Cn×n is positive semideﬁnite if and only n if there H is a set U = {v 1 , . . . , v n } ⊆ Cn such that A = i=1 v i v i . Furthermore, prove that A is positive deﬁnite if and only if there exists a linearly independent set U as above. (53) Prove that A is positive deﬁnite if and only if A−1 is positive deﬁnite. (54) Prove that if A ∈ Cn×n is a positive semideﬁnite matrix, then Ak is positive semideﬁnite for every k 1. (55) Let A ∈ Cn×n be a Hermitian matrix and let pA (λ) = λn + c1 λn−1 + · · · + cm λn−m be its characteristic polynomial, where cm = 0. Then, A is positive semideﬁnite if and only if ci = 0 for 0 k m (where c0 = 1) and cj cj+1 < 0 for 0 j m − 1. (56) Corollary 8.26 can be extended as follows. Let A ∈ Cn×n be a positive semideﬁnite matrix. Prove that for every k 1 there exists a positive semideﬁnite matrix B having the same rank as A such that (a) B k = A; (b) AB = BA; (c) B can be expressed as a polynomial in A. Solution: Since A is Hermitian, its eigenvalues are real nonnegative numbers and, by the Spectral Theorem for Hermitian matrices, there exists a unitary matrix U ∈ Cn×n such that A = 1

1

1

U H diag(λ1 , . . . , λn )U . Let B = U H diag(λ1k , . . . , λnk )U , where λik is a non-negative root of order k of λi . Thus, B k = A, B is clearly positive semideﬁnite, rank(B) = rank(A), and AB = BA. Let p(x) =

n j=1

1

λjk

n k=1,k =j

x − λk λj − λk 1

be a Lagrange interpolation polynomial such that p(λj ) = λjk (see Exercise 31). Then, 1

1

p(diag(λ1 , . . . , λn )) = diag(λ1k , . . . , λnk ),

Linear Algebra Tools for Data Mining (Second Edition)

626

so p(A) = p(U H diag(λ1 , . . . , λn )U ) = U H p(diag(λ1 , . . . , λn ))U 1

1

= U H diag(λ1k , . . . , λnk )U = B. (57) Let A ∈ Rn×n be a symmetric matrix. Prove that there exists b ∈ R such that A + b(11 − In ) is positive semideﬁnite, where 1 ∈ Rn . Solution: We need to ﬁnd b such that for every x ∈ Rn we will have x (A + b(11 − In ))x 0. We have x (A + b(11 − In ))x = x Ax + bx 11 x − bx x 0, which amounts to

⎛

x Ax + b ⎝

n

2 xi

⎞ − x22 ⎠ 0.

i=1

Since A is symmetric, by the Rayleigh–Ritz theorem, we have x Ax λ1 x22 , where λ1 is the least eigenvalue of A. Therefore, it suﬃces to take b λ1 to satisfy the equality for every x. (58) If A ∈ Rn×n is a positive deﬁnite matrix, prove that there exist c, d > 0 such that cx22 x Ax dx22 , for every x ∈ Rn . (59) Let A = diag(A1 , . . . , Ap ) and B = (B1 , . . . , Bq ) be two block diagonal matrices. Prove that sepF (A, B) = min{sepF (Ai , Bj ) | 1 i p and 1 j q}.

Similarity and Spectra

627

(60) Let A ∈ Cn×n be a Hermitian matrix. Prove that if for any λ ∈ spec(A) we have λ > −a, then the matrix A+aI is positivesemideﬁnite. (61) Let A ∈ Cm×m and B ∈ Cn×n be two matrices that have the eigenvalues λ1 , . . . , λm and μ1 , . . . , μn , respectively. Prove that: (a) if A and B are positive deﬁnite, then so is A ⊗ B; (b) if m = n and A, B are symmetric positive deﬁnite, the Hadamard product A B is positive deﬁnite. Solution: The ﬁrst part follows immediately from Theorem 7.24. For the second part, recall that the Hadamard product A B of two square matrices of the same format is a principal submatrix of A ⊗ B. Then, apply Theorem 8.20. (62) Let A ∈ Rn×n be a real matrix that is symmetric and positive semideﬁnite) such that A1n = 0n . Prove that n

2 max

1in

√ √ aii ajj . j=1

Solution: By Corollary 8.26, A is the Gram matrix of a sequence of vectors B = (b1 , . . . , bn ), so A = B B. Since A1n =0n , it follows that (B1n ) (B1n ) = 0, so B1n = 0n . Thus, ni=1 bi = 0n . Then we have uj |||2 uj 2 , ui 2 = ||| − j =i

j =i

which implies 2 max ui 2 1in

n

uj 2 .

j=1

This is immediately equivalent to the inequality to be shown. (63) Let A, B ∈ Rn . Prove that the function ψ : Rn −→ R deﬁned by ψ(x) = Ax22 − Bx22 is a quadratic form and ﬁnd the symmetric matrix that represents ψ.

628

Linear Algebra Tools for Data Mining (Second Edition)

Bibliographical Comments The reader should consult [162]. The solution of Supplement 11 is given in [78]. The proof of the Perron–Frobenius Theorem that we present was obtained in [119]. The proof of Theorem 8.22 is given in [171]. The Hoﬀman–Wielandt theorem was shown in [77]. The proof of Theorem 8.48 is given in [104]. Ky Fan’s Theorem appeared in [48]. The proof of Supplement 8.13 was obtained from Overton and Womersley [124], where the reader can ﬁnd the solution of Exercise 8.13. Theorem 8.45 was obtained in [120]; the result of Theorem 8.50 appeared in [170]. The treatment of spectral resolution of matrices follows [160]. Supplement 8.13 is a result of Juh´ asz, which appears in [86]; Supplement 8.13 originated in [90]. Lemma 9.4 is a specialization of a result proved in [160] for Banach spaces.

Chapter 9

Singular Values

9.1

Introduction

The singular value decomposition has been described as the “Swiss Army knife of matrix decompositions” [121] due to its many applications in the study of matrices; from our point of view, singular value decomposition is relevant for dimensionality reduction techniques in data mining. 9.2

Singular Values and Singular Vectors

The notion of singular value introduced in this section allows us to formulate the singular value decomposition (SVD) theorem, which extends a certain property of unitarily diagonalizable matrices. Let A ∈ Cn×n be a square matrix which is unitarily diagonalizable. There exists a diagonal matrix D = diag(d1 , . . . , dn ) ∈ Cn×n and a unitary matrix X ∈ Cn×n such that A = XDXH ; equivalently, we have AX = XD. If we denote the columns of X by x1 , . . . , xn , then Axi = di xi , which shows that xi is a unit eigenvector that corresponds to the eigenvalue di for 1 i n. Also, we have ⎛ H⎞ x1 ⎜ . ⎟ ⎟ A = (x1 · · · xn )diag(d1 , . . . , dn ) ⎜ ⎝ .. ⎠ xHn = d1 x1 xH1 + · · · + dn xn xHn . 629

630

Linear Algebra Tools for Data Mining (Second Edition)

This is the spectral decomposition of A, which we already discussed in Chapter 7. Note that rank(xi xHi ) = 1 for 1 i n. The SVD theorem extends this decomposition to rectangular matrices. Theorem 9.1 (SVD Theorem). If A ∈ Cm×n is a matrix and rank(A) = r, then A can be factored as A = U DV H , where U ∈ Cm×m and V ∈ Cn×n are unitary matrices, ⎛ σ1 0 0 ⎜0 σ 0 2 ⎜ ⎜. . .. ⎜. . . ⎜. . ⎜ ⎜ D = ⎜ 0 0 ··· ⎜ ⎜ 0 0 ··· ⎜ .. ⎜ .. .. ⎝. . . 0 0 ···

···

0 0 . · · · .. σr · · · 0 ··· . · · · .. 0 ···

⎞ 0 0⎟ ⎟ .. ⎟ ⎟ .⎟ ⎟ m×n 0⎟ , ⎟∈C ⎟ 0⎟ ⎟ .. ⎟ .⎠ 0

and σ1 . . . σr are real positive numbers. Proof. By Theorem 3.33 and Example 6.27, the square matrix AH A ∈ Cn×n has the same rank as the matrix A and is positive semideﬁnite. Therefore, there are r positive eigenvalues of this matrix, denoted by σ12 , . . . , σr2 , where σ1 σ2 · · · σr > 0. Let v 1 , . . . , v r be the corresponding pairwise orthogonal unit eigenvectors in Cn . We have AH Av i = σi2 v i for 1 i r. Let V be the matrix V = (v 1 · · · v r v r+1 · · · v n ) obtained by completing the set {v 1 , . . . , v r } to an orthogonal basis for Cn . If V1 = (v 1 · · · vr ) and V2 = (v r+1 · · · v n ), we can write V = (V1 V2 ). The equalities involving the eigenvectors can now be written as H A AV1 = V1 E 2 , where E = diag(σ1 , . . . , σr ). Deﬁne U1 = AV1 E −1 ∈ Cm×r . We have U1H = S −1 V1H AH , so U1H U1 = S −1 V1H AH AV1 E −1 = E −1 V1H V1 E 2 E −1 = Ir , which shows that the columns of U1 are pairwise orthogonal unit vectors. Consequently, U1H AV1 E −1 = Ir , so U1H AV1 = E.

Singular Values

631

If U1 = (u1 , . . . , ur ), let U2 = (ur+1 , . . . , um ) be the matrix whose columns constitute the extension of the set {u1 , . . . , ur } to an orthogonal basis of Cm . Deﬁne U ∈ Cm×m as U = (U1 U2 ). Note that H U1 H A(V1 V2 ) U AV = U2H H H U1 AV1 U1H AV2 U1 AV1 U1H AV2 = = U2H AV1 U2H AV2 U2H AV1 U2H AV2 H E O U1 AV1 O = , = O O O O which is the desired decomposition.

Observe that in the SVD described by Theorem 9.1 known as the full SVD of A, the diagonal matrix D has the same format as A, while both U and V are square unitary matrices. Definition 9.1. Let A ∈ Cm×n be a matrix. A number σ ∈ R>0 is a singular value of A if there exists a pair of vectors (u, v) ∈ Cn × Cm such that Av = σu and AH u = σv.

(9.1)

The vector u is the left singular vector and v is the right singular vector associated to the singular value σ. Note that if (u, v) is a pair of vectors associated to σ, then (au, av) is also a pair of vectors associated with σ for every a ∈ C. Let A ∈ Cm×n and let A = U DV H , where U ∈ Cm×m , D = diag(σ1 , . . . , σr , 0, . . . , 0) ∈ Cm×n , and V ∈ Cn×n . Note that Av j = U DV H vj = U Dej (because V is a unitary matrix) = σj U ej = σj uj and AH uj = V D H U H uj = V DU H uj V Dej (because U is a unitary matrix) = σj V ej = σj v j .

632

Linear Algebra Tools for Data Mining (Second Edition)

Thus, the jth column of the matrix U , uj and the jth column of the matrix V, v j are left and right singular vectors, respectively, associated to the singular value σj . Corollary 9.1. Let A ∈ Cm×n be a matrix and let A = UDVH be the singular value decomposition of A. If · is a unitarily invariant norm, then A = D = diag(σ1 , . . . , σr , 0, . . . , 0). Proof. This statement is a direct consequence of Theorem 9.1 because the matrices U ∈ Cm×m and V ∈ Cn×n are unitary. In other words, the value of a unitarily invariant norm of a matrix depends only on its singular values. As we saw in Theorem 6.24, |||·|||2 and · F are unitarily invariant. Therefore, the Frobenius norm can be written as

r σr2 . AF = i=1

Definition 9.2. Two matrices A, B ∈ Cm×n are unitarily equivalent (denoted by A ≡u B) if there exist two unitary matrices W1 and W2 such that A = W1H BW2 . Clearly, if A ∼u B, then A ≡u B. Theorem 9.2. Let A and B be two matrices in Cm×n . If A and B are unitarily equivalent, then they have the same singular values. Proof. Suppose that A ≡u B, that is, A = W1H BW2 for some unitary matrices W1 and W2 . If A has the SVD A = U H diag(σ1 , . . . , σr , 0, . . . , 0)V , then B = W1 AW2H = (W1 U H )diag(σ1 , . . . , σr , 0, . . . , 0)(V W2H ). Since W1 U H and V W2H are both unitary matrices, it follows that they have the same singular values as B. Let v ∈ Cn be an eigenvector of the matrix AH A that corresponds to a non-zero, positive eigenvalue σ 2 , that is, AH Av = σ 2 v.

Singular Values

633

Deﬁne u = σ1 Av. We have Av = σu. Also, 1 H H Av = σv. A u=A σ This implies AAH u = σ 2 u, so u is an eigenvector of AAH that corresponds to the same eigenvalue σ 2 . Conversely, if u ∈ Cm is an eigenvector of the matrix AAH that corresponds to a non-zero, positive eigenvalue σ 2 , we have AAH u = σ 2 u. Thus, if v = σ1 Au, we have Av = σu and v is an eigenvector of AH A for the eigenvalue σ 2 . The Courant–Fisher theorem (Theorem 8.18) allows the formulation of a similar result for singular values. Theorem 9.3. Let A be a matrix, A ∈ Cm×n . If σ1 σ2 · · · σk · · · is the non-increasing sequence of singular values of A, then σk = σk =

min

dim(S)=n−k+1

max{Ax2 | x ∈ S and x2 = 1}

max min{Ax2 | x ∈ T and x2 = 1},

dim(T )=k

where S and T range over subspaces of Cn . Proof. We give the argument only for the second equality of the theorem; the ﬁrst can be shown in a similar manner. We saw that σk equals the kth largest absolute value of the eigenvalue |λk | of the matrix AH A. By the Courant–Fisher theorem, we have λk = =

max

min{xH AH Ax | x ∈ T and x2 = 1}

max

min{Ax22 | x ∈ T and x2 = 1},

dim(T )=k dim(T )=k

x x

which implies the second equality of the theorem.

Theorem 9.3 can be restated as follows; Theorem 9.4. Let A be a matrix, A ∈ Cm×n . If σ1 σ2 · · · σk · · · is the non-increasing sequence of singular values of A, then σk = =

min

max{Ax2 | x ⊥ w1 , . . . , x ⊥ wk−1 and x2 = 1}

max

min{Ax2 | x ⊥ w1 , . . . , x ⊥ wn−k and x2 = 1}.

w 1 ,...,w k−1 w 1 ,...,w n−k

634

Proof.

Linear Algebra Tools for Data Mining (Second Edition)

The argument is similar to the one used in Theorem 8.19.

Corollary 9.2. The smallest singular value of a matrix A ∈ Cm×n equals min{Ax2 | x ∈ Cn and x2 = 1}. The largest singular value of a matrix A ∈ Cm×n equals max{Ax2 | x ∈ Cn and x2 = 1}. Proof.

The corollary is a direct consequence of Theorem 9.3.

The SVD theorem can also be proven by induction on q = min{m, n}. In the base case, q = 1, we have A ∈ C1×1 , or A ∈ Cm×1 , or A ∈ C1×n . Suppose, for example, that A = a ∈ Cm×1 , where ⎛ ⎞ a1 ⎜ . ⎟ ⎟ a=⎜ ⎝ .. ⎠ am and let a = a2 . We seek U ∈ Cm×m , V = (v) ∈ C1×1 such that a = U diag(a)v, where

⎛ ⎞ a ⎜ 0⎟ ⎜ ⎟ m×1 ⎟ . diag(a) = ⎜ ⎜ .. ⎟ ∈ C ⎝.⎠ 0

The role of the matrix U is played by any unitary matrix which has the ﬁrst column equal to ⎛ a1 ⎞ a

⎜ a2 ⎟ ⎜a⎟ ⎜ ⎟, ⎜ .. ⎟ ⎝ . ⎠ an a

and we can adopt v = 1. The remaining base subcases can be treated in a similar manner.

Singular Values

635

Suppose now that the statement holds when at least one of the numbers m and n is less than q and let us prove the assertion when at least one of m and n is less than q + 1. Let u1 be a unit eigenvector of AAH that corresponds to the eigenvalue σ12 and let v 1 = σ11 AH u1 . We have v 1 2 = 1 and Av 1 =

1 AAH u1 = σ1 u1 , σ1

which shows that (v 1 , u1 ) is a pair of singular vectors corresponding to the singular value σ1 . We have also uH1 AH v 1 =

1 H u AAH u1 = σ1 . σ1 1

Deﬁne U = (u1 U1 ) and V = (v 1 V1 ) as unitary matrices having u1 and v 1 as their ﬁrst columns, respectively. Then, H u1 H H AH v 1 V1 U AV = H U1 H H u1 A v 1 V1 = H H U1 A H H u1 A v 1 uH1 AV1 . = U1H Av 1 U1H AV1 Since U is a unitary matrix, every column of U1 is orthogonal to u1 . Therefore, U1H Av 1 =

1 H U AAH u1 = σ1 U1H u1 = 0, σ1 1

and, similarly, uH1 AH V1 = σ1 v H1 V1 = 0 , because v 1 is orthogonal on all columns of V1 . Thus, σ1 0 H . U AV = 0 U1H AV1

636

Linear Algebra Tools for Data Mining (Second Edition)

The matrix U1H AV1 has fewer rows and columns than U H AV , so we can apply the inductive hypothesis to B = U1H AV1 . Therefore, by the inductive hypothesis, B can be written as B = XDY H , where X and Y are unitary matrices and D is a diagonal matrix. This allows us to write 0 σ1 1 0 σ1 0 1 0 H = . U AV = 0 XDY H 0 X 0 D 0 YH Since the matrices

1 0 0 X

and

1 0 0 YH

are unitary, we obtain the desired conclusion. If A ∈ Cn×n is an invertible matrix and σ is a singular value of A, then σ1 is a singular value of the matrix A−1 . Example 9.1. Let ⎞ a1 ⎜.⎟ ⎟ a=⎜ ⎝ .. ⎠ an ⎛

be a non-zero vector in Cn , which can also be regarded as a matrix in Cn×1 . The square of a singular value of A is an eigenvalue of the matrix ⎛ ⎞ a ¯ 1 a1 · · · a ¯ n a1 ⎜a ¯ n a2 ⎟ ⎜ ¯ 1 a2 · · · a ⎟ H ⎜ ⎟ A A=⎜ . . ⎟ . . ⎝ . ··· . ⎠ ¯ n an a ¯ 1 an · · · a and we have seen (in Corollary 7.4) that the unique non-zero eigenvalue of this matrix is a22 . Thus, the unique singular value of a is a2 .

637

Singular Values

Example 9.2. Let A ∈ R3×2 be the matrix ⎛ ⎞ 0 1 ⎜ ⎟ A = ⎝1 1⎠ . 1 0 The matrices AH A and AH A ⎛ 1 1 AAH = ⎝1 2 0 1

are given by ⎞ 0 2 1 1⎠ and AH A = . 1 2 1

The eigenvalues of AH A are the roots of the polynomial λ2 − 4λ + 3, and therefore, they are λ1 = 3 and λ2 = 1. The eigenvalues of AAH are 3, 1, and 0. Unit eigenvectors of AH A that correspond to 3 and 1 are √ √ v 1 = α1

2 √2 2 2

and v 2 = α2

−

2 2√

2 2

,

respectively, where αi ∈ {−1, 1} for i = 1, 2. Unit eigenvectors of AH A that correspond to 3, 1, and 0 are ⎛√ ⎞ ⎛ √ ⎞ ⎛ √ ⎞ 6

3

2

2 ⎜ √6 ⎟ ⎜ 3√ ⎟ ⎜ ⎟ 6 3⎟ ⎜ ⎟ u1 = β1 ⎝ 3 ⎠ , u2 = β2 ⎝ 0 ⎠ , u3 = β3 ⎜ ⎝−√ 3 ⎠ , √ √ 6 3 − 22 6 3

respectively, where βi ∈ {−1, 1} for i = 1, 2, 3. The choice of the columns of the matrices U and V must be done such that for a pair of eigenvectors (u, v) that correspond to a singular value σ, we have v = σ1 AH u or, equivalently, u = σ1 Av, as we saw in the proof of Theorem 9.1. For instance, if we choose α1 = α2 = 1, then √ √ v1 =

2 √2 2 2

, v2 =

−

2 2√

2 2

,

638

and u1 =

Linear Algebra Tools for Data Mining (Second Edition) √1 Av 1 3

and u2 = Av 2 , that is ⎛√ ⎞ u1 =

6 ⎜ √6 ⎟ ⎜ 6 ⎟ , u2 ⎝ √3 ⎠ 6 6

⎛

√ ⎞ − 22 ⎜ ⎟ = ⎝ 0 ⎠, √

2 2

which means that β1 = 1 and β2 = −1; the value of β3 that corresponds to the eigenvalue of 0 can be chosen arbitrarily. Thus, an SVD of A is √ ⎛√ √ ⎞ ⎛√ ⎞ 6 2 3 − 3 0 √2 √2 6 2 3 √ ⎟⎜ ⎜√ ⎟ √2 6 2√ . A=⎜ 0 − 33 ⎟ ⎠ ⎝ 0 1⎠ ⎝ √3 2 √ √ − 22 2 6 2 3 0 0 6

2

3

The singular values of a matrix A ∈ Cm×n are uniquely determined. However, the matrices U and V of the SVD of A are not unique, as we saw in Example 9.2. Once we choose a column of the matrix V for a singular value σ, the corresponding column of U is determined by u = σ1 Av. A variant of the SVD Decomposition Theorem is given next. Corollary 9.3 (The Thin SVD Decomposition Corollary). Let A ∈ Cm×n be a matrix having non-zero singular values σ1 , σ2 , . . . , σr , where σ1 σ2 · · · σr > 0 and r min{m, n}. Then, A can be factored as A = U DV H , where U ∈ Cm×r and V ∈ Cn×r are matrices having orthonormal sets of columns and D is the diagonal matrix ⎞ ⎛ σ1 0 · · · 0 ⎜ 0 σ ··· 0 ⎟ 2 ⎟ ⎜ ⎟ D=⎜ .. ⎟ . ⎜ .. .. ⎝ . . ··· . ⎠ 0 0 · · · σr Proof. The Theorem 9.1.

statement

is

an

immediate

consequence

of

The decomposition described in Corollary 9.3 is known as a thin SVD decomposition of the matrix A.

639

Singular Values

Example 9.3. The thin SVD decomposition of the matrix A introduced in Example 9.2, ⎛ ⎞ 0 1 ⎜ ⎟ A = ⎝1 1⎠ . 1 0 is

⎛√ A=

6 ⎜ √6 ⎜ 6 ⎝ √3 6 6

−

√

2 2

⎞

⎟ 0 ⎟ ⎠ √

√

2 2

3 0 0 1

√

2 √2 2 2

√

2 2√ − 22

.

Since U and V in the thin SVD have orthonormal columns, it is easy to see that U H U = V H V = Ip .

(9.2)

Lemma 9.1. Let D ∈ Rn×n be a diagonal matrix, where D = . . , σr ) and σ1 · · · σr . Then, we have |||D|||2 = σ1 , and diag(σ1 , . r 2 DF = i=1 σi . Proof.

By the deﬁnition of |||D|||2 , we have |||D|||2 = max{Dx2 | x = 1} ⎫ ⎧

n r ⎬ ⎨ σi2 |xi |2 |xi |2 = 1 . = max ⎭ ⎩ i=1 i=1

Since r

because

n

σi2 |xi |2

i=1

σ12

r

2

|xi |

σ12 ,

i=1

2 i=1 |xi |

= 1, it follows that ⎫ ⎧

n r ⎬ ⎨ σi2 |xi |2 |xi |2 = 1 = σ1 . max ⎭ ⎩ i=1

The second part is immediate.

i=1

640

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 9.5. Let A ∈ Cm×n be a matrix whose singular values are r 2 σ1 · · · σr . Then |||A|||2 = σ1 , and AF = i=1 σi . Proof. Suppose that the SVD of A is A = U DV H , where U and V are unitary matrices. Then, by Theorem 6.24 and Lemma 9.1, we have |||A|||2 = |||U DV H |||2 = |||D|||2 = σ1 ,

r H σi2 . AF = U DV F = DF = i=1

Corollary 9.4. If A ∈ Cm×n is a matrix, then |||A|||2 AF √ n|||A|||2 . Suppose that σ 1 (A) is the largest of the singular values of r 2 A. Then, since AF = i=1 σi , we have Proof.

σ1 (A) AF

√ n max σj (A)2 = σ1 (A) n, i

which is the desired double inequality.

Theorem 9.6. Let A ∈ Cn×n be an invertible matrix. If the singular values of A are σ1 · · · σn > 0, then cond(A) =

σ1 . σn

Proof. We have shown in Theorem 9.5 that |||A|||2 = σ1 . Since the singular values of A−1 are 1 1 ··· , σn σ1 it follows that |||A−1 |||2 = ately.

1 σn .

The desired equality follows immedi

Corollary 9.5. Let A ∈ Cn×n be an invertible matrix. We have cond(AH A) = (cond(A))2 .

Singular Values

641

Proof. Let σ be a singular value of A and let u, v be two left and right singular vectors corresponding to σ, respectively. We have Av = σu and AH u = σv. This implies AH Av = σAH u = σ 2 v, which shows that the singular values of the matrix AH A are the squares of the singular values of A, which produces the desired conclusion. Let A = U DV H be an SVD of A. If we write U and V using their columns as U = (u1 · · · um ), V = (v 1 · · · v n ), then A can be written as A = U DV H

⎛

σ1 0 ⎜0 σ ⎜ 2 ⎜ ⎜ .. .. ⎜. . = (u1 · · · un ) ⎜ ⎜0 0 ⎜ ⎜. . ⎜. . ⎝. . 0 0 ⎛ ⎞ σ1 v H1 ⎜ ⎟ = (u1 · · · um ) ⎝ ... ⎠ σr v Hp

⎞ ··· ··· 0 · · · · · · 0⎟ ⎟⎛ ⎞ H ⎟ .. ⎟ v1 ⎜ ⎟ · · · · · · .⎟ ⎟ ⎜ .. ⎟ ⎟ · · · σr 0 ⎟ ⎝ . ⎠ H .⎟ ⎟ vm · · · · · · .. ⎠ 0 0 0

= σ1 u1 v H1 + · · · + σr ur v Hp .

(9.3)

Since ui ∈ Cm and v i ∈ Cn , each of the matrices ui v Hi is an m × n matrix of rank 1. Thus, the SVD yields an expression of A as a sum of r matrices of rank 1, where r is the number of non-zero singular values of A. Theorem 9.7. The rank-1 matrices of the form ui v Hi , where 1 i r that occur in Equality (9.3), are pairwise orthogonal. Moreover, ui v Hi F = 1 for 1 i r.

642

Proof.

Linear Algebra Tools for Data Mining (Second Edition)

For i = j and 1 i, j r, we have trace ui v Hi (uj v Hj )H = trace (ui v Hi v j uj ) = 0,

because the vectors v i and v j are orthogonal. Thus, (ui v Hi , uj v Hj ) = 0. By Equality (6.12), we have ui v Hi 2F = trace((ui v Hi )H ui v Hi ) = trace(v i uHi ui v Hi ) = 1, because the matrices U and V are unitary.

Theorem 9.7 shows that Equality (9.3) is similar to a Fourier expansion of A. Namely, the matrix is extended in terms of the orthonormal set {ui v Hi | 1 i r} of the linear space of all matrices of rank no larger than r. Theorem 9.8. Let A ∈ Cm×n be a matrix that has the singular value decomposition A = U DV H . If rank(A) = r, then the first r columns of U form an orthonormal basis for range(A), and the last n−r columns of V constitute an orthonormal basis for null(A). Proof. Since both U and V are unitary matrices, it is clear that {u1 , . . . , ur }, the set of the ﬁrst r columns of U , and {v r+1 , . . . , v n }, the set of the last n − r columns of V, are linearly independent sets. Thus, we only need to show that u1 , . . . , ur = range(A) and v r+1 , . . . , v n = null(A). By Equality (9.3), we have A = σ1 u1 v H1 + · · · + σr ur vHr . If t ∈ range(A), then t = As for some s ∈ Cn . Therefore, t = σ1 u1 (v H1 s) + · · · + σr ur (v Hr s), and, since every product v Hj s is a scalar for 1 j r, it follows that t ∈ u1 , . . . , ur , so range(A) ⊆ u1 , . . . , ur . To prove the reverse inclusion, note that 1 v i = ui , A σi for 1 i r, due to the orthogonality of the columns of V. Thus, u1 , . . . , ur = range(A).

Singular Values

643

Note that Equality (9.3) implies that Av j = 0 for r + 1 j n, so v r+1 , . . . , v n ⊆ null(A). Conversely, suppose that Ar = 0. Since the columns of V form a basis of Cn , we have r = a1 v 1 +· · ·+an v n , so Ar = a1 Av 1 +· · ·+ar v r = 0. The linear independence of {v 1 , . . . , v r } implies a1 = · · · = ar = 0, so r = ar+1 vr+1 + · · · + an v n , which shows that null(A) ⊆ v r+1 , . . . , v n . Thus, null(A) = v r+1 , . . . , v n . Corollary 9.6. Let A ∈ Cm×n be a matrix that has the singular value decomposition A = U DV H . If rank(A) = r, then the first r transposed columns of V form an orthonormal basis for the subspace of Rn generated by the rows of A. Proof. This statement follows immediately from Theorem 9.8 applied to AH . 9.3

Numerical Rank of Matrices

The SVD allows us to ﬁnd the best approximation of a matrix by a matrix of limited rank. The central result of this section is Theorem 9.9. Lemma 9.2. Let A = σ1 u1 v H1 + · · · + σr ur v Hr be the SVD of a matrix A ∈ Rm×n , where k σ1 · · ·H σr > 0. For every k, 1 k r, the matrix B(k) = i=1 σi ui vi has rank k. Proof. The null space of the matrix B(k) consists of those vectors x such that ki=1 σi ui vHi x = 0. The linear independence of the vectors ui , and the fact that σi > 0 for 1 i r, implies the equalities v Hi x = 0 for 1 i r. Thus, null(B(k)) = null ((v 1 · · · v k )) . Since v 1 , . . . , v k are linearly independent, it follows that dim (null(B(k)) = n − k, which implies rank(B(k)) = k for 1 k r. Theorem 9.9 (Eckhart–Young Theorem). Let A ∈ Cm×n be a matrix whose sequence of nonzero singular values is (σ1 , . . . , σr ). Assume that σ1 · · · σr > 0 and that A can be written as A = σ1 u1 v H1 + · · · + σr ur vHr .

644

Linear Algebra Tools for Data Mining (Second Edition)

Let B(k) ∈ Cm×n be the matrix deﬁned by B(k) =

k

σi ui v Hi .

i=1

If rk = inf{|||A − X|||2 | X ∈ Cm×n and rank(X) k}, then |||A − B(k)|||2 = rk = σk+1 , for 1 k r, where σr+1 = 0 and B(k) is the best approximation of A among the matrices of rank no larger than k in the sense of the norm ||| · |||2 . Proof.

Observe that A − B(k) =

r

σi ui vHi ,

i=k+1

and the largest singular value of the matrix Therefore, by Theorem 9.5,

r

i=k+1 σi ui v i

H

is σk+1 .

|||A − B(k)|||2 = σk+1 . for 1 k r. We prove now that for every matrix X ∈ Cm×n such that rank(X) k, we have |||A − X|||2 σk+1 . Since dim(null(X)) = n − rank(X), it follows that dim(null(X)) n − k. If T is the subspace of Rn spanned by v 1 , . . . , v k+1 , we have dim(T ) = k + 1. Since dim(null(X)) + dim(T ) > n, the intersection of these subspaces contains a non-zero vector and, without loss of generality, we can assume that this vector is a unit vector x. We have x = a1 v 1 + · · · ak v k + ak+1 v k+1 because x ∈ T . The 2 orthogonality of v 1 , . . . , v k , v k+1 implies x22 = k+1 i=1 |ai | = 1. Since x ∈ null(X), we have Xx = 0, so (A − X)x = Ax =

k+1

ai Av i =

i=1

k+1

ai σ i u i .

i=1

Thus, we have |||(A − X)x|||22 =

k+1

i=1

2 |σi ai |2 σk+1

k+1

2 |ai |2 = σk+1 ,

i=1

because u1 , . . . , uk are also orthonormal. This implies |||A − X|||2 σk+1 = |||A − B(k)|||2 .

645

Singular Values

It is interesting to observe that the matrix B(k) provides an optimal approximation of A not only with respect to |||·|||2 but also relative to the Frobenius norm. Theorem 9.10. Using the notations introduced in Theorem 9.9, B(k) is the best approximation of A among matrices of rank no larger than k in the sense of the Frobenius norm. Proof. Note that A − B(k)2F = A2F − ki=1 σi2 . Let X be a matrix of rank k, which can be written as X = ki=1 xi y Hi . Without loss of generality we may assume that the vectors x1 , . . . , xk are orthonormal. If this is not the case, we can use the Gram–Schmidt algorithm to express them as linear combinations of orthonormal vectors, replace these expressions in ki=1 xi y Hi , and rearrange the terms. Now, the Frobenius norm of A − X can be written as A − X2F A−

= trace

k

H xi y H

A−

i=1

= trace A A + H

k

k

xi y H

i=1

(y i − A xi )(y i − A xi ) − H

H

H

k

i=1

A xi xi A . H

H

i=1

Taking into account that ki=1 (y i − AH xi )(y i − AH xi )H is a real non negative number and that ki=1 AH xi xHi A = Axi 2F , we have A −

X2F

trace A A − H

k

A xi xi A H

H

i=1

=

A2F

− trace

k

A xi xi A . H

H

i=1

Let A = U diag(σ1 , . . . , σn )V H be the singular value decomposition of A. If V = (V1 V2 ), where V1 has k columns v 1 , . . . , v k , D1 = diag(σ1 , . . . , σk ), and D2 = diag(σk+1 , . . . , σn ), then we can

646

Linear Algebra Tools for Data Mining (Second Edition)

write

D12 O A A = V D U U DV = (V1 V2 ) O D22 H

H

H

H

V1H V2H

= V1 D12 V1H + V2 D22 V2H . and AH A = V D 2 V H . These equalities allow us to write Axi 2F = trace(xHi AH Axi ) = trace xHi V1 D12 V1H xi + xHi V2 D22 V2H xi = D1 V1H xi 2F + D2 V2H xi 2F = σk2 + D1 V1H xi 2F − σk2 V1H xi 2F − σk2 V2H xi 2F − D2 V2H xi 2F ) − σk2 (1 − V H xi ). Since V H xi 1F = 1 (because xi is a unit vector and V is a unitary matrix) and σk2 V2H xi 2F − D2 V2H xi 2F 0, it follows that Axi 2F σk2 + D1 V1H xi 2F − σk2 V1H xi 2F . Consequently, k

Axi 2F kσk2 +

i=1

k

i=1

=

=

kσk2 k

+

k k

σk2

+

(σj2 − σk2 )|v Hj xi |2

i=1 j=1

j=1

D1 V1H xi 2F − σk2 V1H xi 2F

(σj2

−

σk2 )

k

2

|v j xi |

i=1

k k

(σk2 + (σj2 − σk2 )) = σj2 , j=1

j=1

which concludes the argument.

Definition 9.3. Let A ∈ Cm×n . The numerical rank of A is the function numrankA : [0, ∞) −→ N given by numrankA (d) = min{rank(B) | |||A − B|||2 d} for d 0.

Singular Values

647

Theorem 9.11. Let A ∈ Cm×n be a matrix having the sequence of non-zero singular values σ1 σ2 · · · σr . Then numrankA (d) = k < r if and only if σk > d σk+1 . Proof. Let d be a number such that σk > d σk+1 . Equivalently, by the Eckhart–Young Theorem, we have |||A − B(k − 1)|||2 > d |||A − B(k)|||2 , Since |||A − B(k − 1)|||2 = min{|||A − X|||2 | rank(X) = k − 1} > d, it follows that min{rank(B) | |||A−B|||2 d} = k, so numrankA (d) = k. Conversely, suppose that numrankA (d) = k. This means that the minimal rank of a matrix B such that |||A − B|||2 d is k. Therefore, |||A − B(k − 1)|||2 > d. On the other hand, d |||A − B(k)|||2 because there exists a matrix C of rank k such that d |||A − C|||2 , so d |||A − B(k)|||2 = σk+1 . Thus, σk > d σk+1 . Recapitulating facts that we have previously established, we have the following equivalent statements concerning a matrix A ∈ Rm×n and its rank r = rank(A): (i) r = dim(rank(A)); (ii) r = dim(rank(A )); (iii) r is the maximal number of linearly independent rows of A; (iv) r is the maximal number of linearly independent columns of A; r (v) r is minimal with the property A = i=1 xi y i , where m n xi ∈ R and y i ∈ R ; (vi) r is maximal with the property that there exists a nonsingular r × r submatrix of A; (vii) r is the number of singular positive values of A. 9.4

Updating SVDs

Eﬃcient algorithms for updating the SVD of matrices are particularly useful when we deal with large matrices that cannot be entirely accommodated in the main memory of computers. Let A ∈ Cm×n be a matrix having the thin SVD decomposition A = U DV H , where rank(A) = r, U ∈ Cm×r , D = diag(σ1 , . . . , σr ), and V ∈ Cn×r . As we observed in Equality (9.2), U H U = V H V = Ir .

648

Linear Algebra Tools for Data Mining (Second Edition)

Suppose that A is modiﬁed by adding to A a matrix of the form XY H , where X ∈ Cm×c and Y ∈ Cn×c . The matrix D Or,c H (V Y )H (9.4) B = A + XY = (U X) Oc,r Ic has a new SVD, B = U1 D1 V1H . We discuss a technique (due to Brand [22]) for producing the factors U1 , D1 , and V1 of the new SVD starting from the factors of the initial SVD rather than using the matrix A. Let PX ∈ Cm×b be an orthonormal basis for range((Im − U U H )X) and let RX = PXH (Im − U U H )X ∈ Cb×c . We have

Ir U H X . (U X) = (U PX ) Ob,r RX Similarly, let PY ∈ Cn×d be an orthonormal basis for range((In − V V H )Y ), and let RY = PYH (In − V V H )Y ∈ Cd×c . Again, we have

Ir V H Y (V Y ) = (V PY ) Od,r RY

.

Thus, the Equality (9.4) can be written as D Or,c (V Y )H B = (U X) Oc,r Ic H Ir U H X D Or,c Ir Or,d Y = (U PX ) . H H Ob,r RX Oc,r Ic Y V RY PYH

Furthermore, we have D Or,c Ir Or,d Ir U H X Oc,r Ic Y H V RYH Ob,r RX D + U H XY H V U H XRYH = RX Y H V RX RYH H H H U X V Y D Or,c . + = Oc,r Oc,c RX RY

649

Singular Values

The matrix K=

D Or,c Oc,r Oc,c

U HX + RX

V HY RY

H

is sparse and rather small. By diagonalizing K as K = W EZ H , we obtain H Y H H A + XY = (U PX )W EZ PYH H H Z V Y . = (U W PX W )E Z H PYH An important special case occurs when c = 1, that is, when X and Y are just vectors in Cm and Cn , respectively. Thus, the matrix A is modiﬁed by adding a rank-1 matrix xy H as follows: D 0r H (V y)H . B = A + xy = (U x) 0r 1 The previous matrices PX and PY are now replaced by the vectors px =

1 (I − U U H )x, (I − U U H )x

py =

1 (I − V V H )y, (I − V V H )y

and RX , RY are now reduced to two vector numbers: r x = pHx (I − U U H )x ∈ Cb and r y = pHy (I − V V H )y ∈ Cd . This allows us to write

Ir U H x (U x) = (U px ) Ob,r r x

and

I V Hy . (V y) = (Y py ) Od,r r y

650

Linear Algebra Tools for Data Mining (Second Edition)

In this special case, Equality (9.4) amounts to D 0r (V y)H B = (U x) 0r 1 Ir Or,d YH D 0r Ir U H x . = (U px ) Y H V r Hy pHy 0r 1 Ob,r r x Furthermore, we have Ir Or,d D + U H xy H V U H xr Hy D 0r Ir U H x = Y H V r Hy rxyH V rx r Hy 0r 1 Ob,r r x H H H V y U x D 0r . + = rx ry 0r 0 Finally, by diagonalizing K = W EZ H , we have yH H H A + xy = (U px )W EZ pHy Z HyH . = (U W px W )E Z H pHy 9.5

Polar Form of Matrices

The SVD allows us to deﬁne the polar form of a matrix, which as an extension of the polar representation of a complex number z = |z|eiθ , where |z| is the modulus of z and θ, is the argument of z. Theorem 9.12 (Polar Form Theorem). Let A ∈ Cm×n be a matrix with singular values σ1 , σ2 , . . . , σr , where σ1 σ2 · · · σr > 0 and r min{m, n}. If m n, then A can be factored as A = P Y, where P ∈ Cm×m is a positive semideﬁnite matrix such that P 2 = AAH and Y ∈ Cm×n is a matrix having a set of orthonormal rows (that is, it satisﬁes the equality Y Y H = Im ). If m n, then A can be factored as A = W Q, where W ∈ Cm×n has orthonormal columns (that is, W H W = In and Q ∈ Cn×n is a positive semideﬁnite matrix.

Singular Values

651

Proof. By the Thin SVD Decomposition Corollary (Corollary 9.3), A can be factored as A = U DV H , where U ∈ Cm×r and V ∈ Cn×r are matrices having orthonormal columns and D is the diagonal matrix ⎞ ⎛ σ1 0 · · · 0 ⎜ 0 σ ··· 0 ⎟ 2 ⎟ ⎜ ⎟. D=⎜ . . . ⎟ ⎜. . ⎝ . . · · · .. ⎠ 0 0 · · · σr Since U has orthonormal columns, we have U H U = Ir , which allows us to write A = (U DU H )(U V H ). Deﬁne P = U DU H ∈ Cm×m . It is easy to see that P is a positive semideﬁnite matrix. Also, P 2 = U DU H U DU H = U DDU H = U DV H V DU H = AAH . On the other hand, if we deﬁne Y = U V H , we have Y Y H = U V H V U H = Im , which means that P and Y satisfy the conditions of the theorem. To prove the second part of the theorem, we apply the ﬁrst part to the matrix AH ∈ Cn×m . Since AH = V D H U H = V D U H , the polar decomposition of AH is AH = P1 Y1 , where P1 = V D V H is positive semideﬁnite and Y1 = V U H is a matrix such that Y1 Y1H = In . Thus, A = Y1H P1H , which allows us to deﬁne W = Y1H and Q = P1H . Note that W H W = Y1 Y1H = In and that Q is positive semideﬁnite. Corollary 9.7. If A ∈ Cn×n , then A can be factored as A = P Y = W Q, where P, Q ∈ Cn×n are positive semidefinite matrices, and Y, W ∈ Cn×n are unitary matrices. Proof.

This statement follows immediately from Theorem 9.12.

Note that the positive semideﬁnite matrix P in the Polar Form Theorem is uniquely determined by the matrix A. Also, the spectrum of P equals the set of singular values of A. 9.6

CS Decomposition

The cosine–sine decomposition of unitary matrices was introduced in Stewart’s paper [158] although it was implicit in an article by Davis and Kahan [33]. As we show in this section and in the next, this

652

Linear Algebra Tools for Data Mining (Second Edition)

factorization of unitary matrices referred to as the CS decomposition plays an essential role in the study of the geometry of subspaces. Theorem 9.13 (Paige–Wei CS Decomposition Theorem). Let A ∈ Cn×n be a unitary matrix. For any partitioning A11 A12 A= , A21 A22 where A11 ∈ Cr1 ×c1 , A12 ∈ Cr1 ×c2 , A21 ∈ Cr2 ×c1 , A22 ∈ Cr2 ×c2 , and r1 + r2 = c1 + c2 = n, there exist unitary matrices U1 , U2 , V1 , V2 such that (i) U = diag(U1 , U2 ) and V = diag(V1 , V2 ); (ii) we have D11 D12 H , (9.5) U AV = D = D21 D22 where Dij ∈ Rri ×cj are real matrices for 1 i, j 2, and (iii) the matrices Dij have the form ˜ ˆ H , S, I), D12 = diag(O D11 = diag(I, C, O), ˆ S, I), D22 = diag(I, −C, O ˜ H ), D21 = diag(O, where C = diag(γ1 , . . . , γs ), S = diag(σ1 , . . . , σs ), C 2 + S 2 = I, 1 > γ1 · · · γs > 0, and 1 > σs · · · σ1 > 0. ˆ and O ˜ matrices are zero matrices and, depending on A and The O its partitioning, may have no rows or no columns. Some of the unit matrices could be nonexistent and no two need to be equal. The four C and S submatrices are square matrices with the same format and could be nonexistent. Proof. Let A11 = U1 D11 V1H be the SVD of A11 . Since D is a unitary matrix, no singular value of D11 can exceed 1 and D11 = U1H A11 V1 ˜ We have has the prescribed form D11 = diag(I, C, O). H A11 A12 U1 O V1 O H diag(U1 , I)Adiag(V1 , I) = A21 A22 O I O I H U1 A11 V1 U1H A12 = B, = A21 V1 A22

653

Singular Values

where B is the unitary matrix D11 B12 B= . B21 B22 By Theorem 6.68, there exist unitary matrices U2 and V2 such that the matrix (U1H A12 )V2 = B12 V2 is upper triangular and U2H (A21 V1 ) = U2H B21 is lower triangular, both having real non-negative elements on their diagonal ending in the bottom right corner. This implies D11 B12 I O I O = C, B21 B22 O V2 O U2H where C is the matrix C=

B12 V2 D11 . U2H B21 U2H B22 V2

By the choice of U2 and V2 , C has the form ⎛ I O O ⎞ ⎜ ⎛ I OO ⎜ O C O ⎟ ⎜ ⎜O C O ˜ O O O C12 ⎟ ⎜ ⎜ C=⎜ ⎟=⎜ ˆ ˜ ⎠ ⎜ ⎝O O O ⎜ O O O ⎜ C21 C22 ⎝ F21 S O F31 F32 I

⎞ O E12 E13 O S E23 ⎟ ⎟ ⎟ O O I ⎟ ⎟. G11 G12 G13 ⎟ ⎟ ⎟ G21 G22 G23 ⎠ G31 G32 G33

H S+ The orthonormality of the ﬁrst three block columns implies F21 H H F31 F32 = O, F31 = O, and F32 = O, so F21 , F31 , and F32 are zero matrices. Similarly, the orthonormality of the ﬁrst three block rows implies that E12 , E12 , E23 are zero matrices. Thus, the matrix becomes ⎞ ⎛ I O O O O O ⎜O C O O S O ⎟ ⎟ ⎜ ˜ O O I ⎟ ⎜O O O ⎟. ⎜ C=⎜ ˆ ⎟ ⎜ O O O G11 G12 G13 ⎟ ⎝ O S O G21 G22 G23 ⎠ O O I G31 G32 G33

Orthogonality of the third column block on the fourth, ﬁfth, and sixth column blocks implies that G31 , G32 , and G33 are zero matrices;

654

Linear Algebra Tools for Data Mining (Second Edition)

orthogonality of the third row block on the fourth and ﬁfth row blocks implies that G13 and G23 are zero matrices. Thus, C is actually ⎞ ⎛ I O O O O O ⎜ O C O O S O⎟ ⎟ ⎜ ⎟ ⎜ ˜ O O I⎟ ⎜O O O ⎟. ⎜ C=⎜ ˆ ⎟ ⎜ O O O G11 G12 O ⎟ ⎟ ⎜ ⎝ O S O G21 G22 O ⎠ O O I O O O Orthogonality of the second and fourth blocks of columns implies SG21 = O. Since S is non-singular, G21 is a zero matrix. Similarly, the orthogonality of the second and fourth blocks of rows shows that G12 is a zero matrix. Orthogonality of the ﬁfth and the second blocks of rows yields CS + SG22 = O, so G22 = −C (because C and S are diagonal matrices). Finally, we obtain G11 GH11 = GH11 G11 = I, so G11 is unitary. The matrix G11 can be transformed to I without aﬀecting the rest of the matrix by replacing U2H by diag(GH11 , I, I)U2H . The unitary character of the matrix implies C 2 + S 2 = I. Corollary 9.8 (The Thin CS Decomposition Theorem). Let A ∈ Cr×c be a matrix having orthonormal columns. For any partitioning A1 , A= A2 where A1 ∈ Cr1 ×c , A2 ∈ Cr2 ×c , r1 c, r2 c, there exist unitary matrices U1 ∈ Cr1 ×r1 , U2 ∈ Cr2 ×r2 , and V ∈ Cc×c such that (i) U = diag(U1 , U2 ); (ii) we have C H , U AV = D = S where C ∈ Cri ×n and S ∈ Cr2 ×n are real matrices for 1 i, j 2, and

655

Singular Values

(iii) the matrices C and S have the form C = diag(cos θ1 , . . . , cos θn ) and S = diag(sin θ1 , . . . , sin θn ), where 0 θ1 θ2 · · · θn π2 . Proof.

This statement follows from the Theorem 9.13 for c2 = 0.

Starting from Equality (9.5), U AV = D = H

D11 D12 , D21 D22

it is possible to obtain other variants of the CS decomposition by multiplying both sides of the equality by appropriate unitary matrices that transform D to the desired aspect. Recall that all permutation matrices are unitary matrices. Permuting the ﬁrst r1 rows or the last r2 rows in D or changing the sign of any row or column preserves the almost diagonality of the blocks Dij of D. Theorem 9.14 (Stewart–Sun CS Decomposition). Let A ∈ Cn×n be a unitary matrix. For any partitioning A11 A12 , A= A21 A22 where 2 n, A11 ∈ C× , A12 ∈ C×(n−) , A21 ∈ C(n−)× , A22 ∈ C(n−)×(n−) there exist unitary matrices U, V ∈ Cn×n such that (i) U = diag(U1 , U2 ) and V = diag(V1 , V2 ), where U1 , V1 ∈ C× ; (ii) we have ⎛ ⎞ C˜ −Sˆ O ⎜ ⎟ (9.6) U H AV = ⎝ Sˆ C˜ O ⎠, O

O

In−2

where C˜ = diag(γ1 , . . . , γ ), Sˆ = diag(σ1 , . . . , σ ) are two diagonal matrices having non-negative elements and C˜ 2 + Sˆ2 = I . Proof. Since 2 n, we have = r1 = c1 r2 = n − in ˆˆ ˆ ˜ ˜ and O ˆ H = (O, ˜ = (O ˜ , O) O ), where Equality (9.5). We can write O

656

Linear Algebra Tools for Data Mining (Second Edition)

ˆ are square matrices. The matrix D can now be written ˜ and O O as ⎞ ⎛ ˆˆ ˆ O O O I O O OO ⎟ ⎛ ⎞ ⎜ ⎜ O C O OO O S O ⎟ ˆH O O I OOO ⎟ ⎜ ⎟ ⎜O C O O S O ⎟ ⎜ ˜ ˜ ˜ ⎜ ⎜ ⎟ ⎜ O OO OO O O O ⎟ ⎟ ⎜ ⎟ H ⎟ ˆ ˜ O O O⎟ ⎜ ⎜O O O ˆ ⎟ ⎜ O O O I O O O O ⎟=⎜ D=⎜ ⎟. ⎜O ⎟ ⎜ H ˆ O O I O O ˆ O O OO I O O ⎟ ⎜ ⎟ ⎜O ⎟ ⎜ ⎟ ⎟ ⎝ O S O O −C O ⎠ ⎜ ⎜ O S O O O O −C O ⎟ ⎟ ⎜ ˜H ⎜ O O I OO O O O OO I O O O ˜ H ⎟ ⎠ ⎝ H ˜ O O O I O O O ˜O By multiplying the last two columns of the last matrix by −1 yields the matrix ⎞ ⎛ ˆ ˆO ˆ O O I O O OO ⎟ ⎜ ⎜ O C O O O O −S O ⎟ ⎟ ⎜ ⎟ ⎜ ˜ ⎜ O OO ˜O O O O ⎟ ˜ O ⎟ ⎜ H ⎜ ˆ ˆ O O O I O O O ⎟ ⎟ ⎜ O ⎟ ⎜ ⎟ ⎜O ⎜ ˆ H O O O O I O O ⎟ ⎟ ⎜ ⎜ O S O OO O C O ⎟ ⎟ ⎜ ⎜ O O I OO O O O ˜ H ⎟ ⎠ ⎝ H ˜ O O O I O O O ˜O Let now C˜ and Sˆ be the square matrices ˆ , S, I). ˜ ) and Sˆ = diag(O C˜ = diag(I, C, O Permuting rows and columns within the main block, we obtain another variant of the CS decomposition: ⎞ ⎛ C˜ O −Sˆ O ⎜O I O O⎟ ⎟ ⎜ (9.7) U H AV = ⎜ ⎟, ⎝ Sˆ O C˜ O ⎠ OO O I

Singular Values

as shown in [129]. If r1 = c1 , one obtains the matrix ⎛ ⎞ C˜ −Sˆ O ⎜ ⎟ U H AV = ⎝ Sˆ C˜ O ⎠ . O O In−2

9.7

657

(9.8)

Geometry of Subspaces

Many of the results discussed in this section and in the next appear in [160]. Let S and T be two subspaces of Cn having the same dimension, m, and let {u1 , . . . , um } and {v 1 , . . . , v m } be two orthonormal bases for S and T , respectively. Consider the matrices B = (u1 · · · um ) ∈ Cn×m and C = (v 1 · · · v m ) ∈ Cn×m . The projection matrices of these subspaces are PS = BB H and PT = CC H , respectively, and we saw that these Hermitian matrices do not depend on the particular choices of the orthonormal bases of the subspaces. Theorem 9.15. Let S and T be two subspaces of Cn such that dim(S) = dim(T ) = m, and let {u1 , . . . , um } and {v 1 , . . . , v m } be two orthonormal bases for S and T, respectively. Consider the matrices B = (u1 · · · um ) ∈ Cn×m and C = (v 1 · · · v m ) ∈ Cn×m . The singular values γ1 , . . . , γm of the matrix B H C do not depend on the choices of the orthonormal bases in the subspaces S and T . ˜ and C˜ be matrices that correspond to a diﬀerent Proof. Let B choice of bases in the two spaces. By Theorem 9.2, to prove the independence of the singular values on the choice of bases, it suﬃces ˜ H C˜ are unitarily equivalent. to show that the matrices B H C and B By Theorem 6.46, there exists a unitary matrix Q and a unitary ˜ and C = CP ˜ . This allows us to write matrix P such that B H = QH B H H ˜H ˜ H ˜ H C. ˜ B C = Q B CP , which shows that B C ≡u B Since B and C have orthonormal columns, we have B H B = Im ˜ 1. This implies |||B H C|||2 and C H C = Im , so |||B|||2 1 and |||C||| |||B H |||2 |||C|||2 1, so γi ∈ [0, 1]. Definition 9.4. Let S and T be two subspaces of Cn such that dim(S) = dim(T ) = m and denote the singular values of the matrix

658

Linear Algebra Tools for Data Mining (Second Edition)

B H C by γ1 , . . . , γm , where B and C are matrices that deﬁne bases of S and T , respectively. The angles between S and T are θ1 , . . . , θm ∈ 0, π2 , where cos θi = γi for 1 i k. The squares of the singular values of B H C, cos2 θi for 1 i m, are the eigenvalues of the matrices C H BB H C = C H PB C or B H CC H B = B H PC B. If we denote a matrix of a basis of the subspace that is orthogonal to range(B) by B⊥ , then the squares of the singular values of (B⊥ )H C H C = C H (In − PB )C = are the eigenvalues of the matrices C H B⊥ B⊥ H In − C PB C. These squares correspond to sin2 θ1 , . . . , sin2 θn . Thus, for S = range(B) and T = range(C), the vector SIN(S, T ) =

sin θ1 . . . sin θp

of sinuses of angles between the subspaces S and T consists of the H C. singular values of the matrix B⊥ The minimal angle between the subspaces S and T is given by cos θ1 = σ1 . Theorem 9.16. Let S and T be two subspaces of Rn . The least angle between S and T is determined by the equality cos θ1 = |||PS PT |||2 , where PS and PT are the projection matrices of S and T, respectively. Proof.

We have cos θ1 = min{u v | u ∈ S, v ∈ T, u2 = v2 1}.

For u ∈ S and v ∈ T , we have PS u = u and PT v = v, which implies cos θ1 = min{u PS PT v | u ∈ S, v ∈ T, u2 = v2 1} = |||PS PT |||2 .

Theorem 9.17. Let A ∈ Cn× and B ∈ Cn× be two matrices both having orthonormal columns. If 2 n, there are unitary matrices Q ∈ Cn×n , U1 ∈ C× , and V1 ∈ C× such that ⎛ ⎛ ⎞ ⎞ I C ⎜ ⎜ ⎟ ⎟ QAU1 = ⎝ O, ⎠ ∈ Cn× and QBV1 = ⎝ S ⎠ ∈ Cn× , On−2, On−2, where C = diag(γ1 , . . . , γ ), S = diag(σ1 , . . . , σ ), 0 γ1 · · · γ , 1 σ1 · · · σ 0, and γi2 + σi2 = 1 for 1 i .

Singular Values

659

Proof. Expand the matrices A and B to two unitary matrices A˜ = ˜ = (B B1 ) ∈ Cn×n , respectively, and deﬁne the (A A1 ) ∈ Cn×n and B n×n matrix D ∈ C as H H B A B A 1 H ˜= . D = A˜ B AH1 B AH1 B1 We have A1 ∈ Cn×(n−) and B1 ∈ Cn×(n−) . The matrix D is unitary ˜ By applying the CS as a product of two unitary matrices, A˜H and B. decomposition, we obtain the existence of the unitary block diagonal matrices U = diag(U1 , U2 ) ∈ Cn×n and V = diag(V1 , V2 ) ∈ Cn×n such that U1 , V1 ∈ C× , U2 , V2 ∈ C(n−)×(n−) , and ⎛ ⎞ C −S O ⎜ ⎟ O ⎠, U H DV = ⎝ S C O O In−2 where C, S ∈ C× . In turn, this yields U1H AH BV1 = C ∈ C× ,

U1H AH B1 V2 = (−S O) ∈ C×(n−) , S H H ∈ C(n−)× , U2 A1 BV1 = O C O H H ∈ C(n−)×(n−) . U2 A1 B1 V2 = O In−2 Let Q=

U1H AH . U2H AH1

We have

H H U1H AH U1 A AU1 AU1 = QAU1 = U2H AH1 U2H AH1 AU1 ⎞ ⎛ I I = ⎝ O, ⎠ ∈ Cn× , = On−, On−2,

(9.9)

660

Linear Algebra Tools for Data Mining (Second Edition)

because A has orthonormal columns and U1 is unitary, and ⎛ ⎞ ⎛ H H⎞ C H H U1 A U1 A BV1 ⎜ ⎟ H H = ⎝ S ⎠ ∈ Cn× . QBV1 = ⎝U2 A1 x⎠ BV1 = H H U A BV 1 2 1 y O

Further results can be derived from Theorem 9.17 and its proof. Observe that the Equalities (9.9) imply C O S H H H H (U2 A1 B1 V2 U2 A1 BV1 ) = O I O so

U2 A1 (B1 V2 BV1 ) = H

H

C O S O I O

∈ Cn× ,

which is equivalent to

V2 B 1 V1H B H H

H

⎞ C O ⎟ ⎜ A1 U2 = ⎝ O I ⎠ ∈ C×n . S O ⎛

Also,

In− V2H B1H B 1 V2 = ∈ Cn×(n−) V1H B H On,n−

˜ is the because B1 has orthonormal columns and V2 is unitary. If Q matrix H H V2 B 1 ˜ , Q= V1H B H then we have

⎞ ⎛ C O I ⎟ ⎜ ˜ ˜ . QA1 U2 == ⎝ O I ⎠ and QB1 V2 = O S O

A twin of Theorem 9.17 follows.

Singular Values

661

Theorem 9.18. Let A ∈ Cn× and B ∈ Cn× be two matrices both having orthonormal columns. If n < 2, there are unitary matrices Q ∈ Cn×n , U1 ∈ C× , and V1 ∈ C× such that ⎛ ⎞ On−,2−n In− ⎜ ⎟ I2−n ⎠ ∈ Cn× QAU1 = ⎝O2−n,n− On−,n− On−,2−n and ⎛

C1

⎜ QBV1 = ⎝O2−n,n− S1

⎞ On−,2−n ⎟ I2−n ⎠ ∈ Cn× , On−,2−n

where C1 = diag(γ1 , . . . , γn− ), S1 = diag(σ1 , . . . , σn− ), 0 γ1 · · · γn− , 1 σ1 · · · σn− 0, and γi2 + σi2 = 1 for 1 i n − . Proof. We are using the same notations as in Theorem 9.17 and the discussion that precedes the current theorem. Since A ∈ Cn× , B ∈ Cn× , and n < 2, there exist two matrices A0 ∈ Cn×n− ˜ = (B0 , B) and B0 ∈ Cn− such that the matrices A˜ = (A0 , A) and B are both unitary. For = n − we have 0 < n. Thus, we can apply Theorem 9.17 to the matrices A0 and B0 ; the roles played by the matrices A1 and B1 in the previous proof are played now by A and B, and the roles played by U1 and V1 are played by U2 and V2 , respectively. Theorem 9.19. Let S and T be two subspaces of Rn such that dim(S) = dim(T ) = where 2 n and let σ1 , . . . , σ be the sinuses of the angles between S and T . If PS and PT are the projections on S and T, then the singular values of the matrix PS (In − PT ) are σ1 , . . . , σ , 0, . . . , 0 and the singular values of the matrix PS − PT are σ1 , σ1 , . . . , σ , σ , 0, . . . , 0. Proof. Let B = (u1 · · · u ) ∈ Cn× and C = (v 1 · · · v ) ∈ Cn× be matrices whose columns are orthonormal bases of S and T , respectively. The projection matrices of these subspaces are the Hermitian matrices PS = BB H and PT = CC H . By Theorem 9.17,

662

Linear Algebra Tools for Data Mining (Second Edition)

there are unitary matrices Q ∈ Cn×n , U1 ∈ C× , and V1 ∈ C× such that ⎞ ⎞ ⎛ ⎛ I C ⎟ ⎟ ⎜ ⎜ QBU1 = ⎝ O, ⎠ ∈ Cn× and QCV1 = ⎝ S ⎠ ∈ Cn× , On−2, On−2, where C = diag(γ1 , . . . , γ ), S = diag(σ1 , . . . , σ ), 0 γ1 · · · γ , 1 σ1 · · · σ 0, and γi2 + σi2 = 1 for 1 i . Therefore, we have QPS (In − PT )QH = QBB H (In − CC H )QH = QBB H QH − QBB H CC H QH = (QBU1 )(U1H B H QH ) − QBU1 U1H B H QH (QCV1 )(V1H C H QH ) = (QBU1 )(QBU1 )H − (QBU1 )(QBU1 )H (QCV1 )(QCV1 )H = (QBU1 )(QBU1 )H (In − (QCV1 )(QCV1 )H ). Since

and

⎛

⎞ I ⎜ ⎟ (QBU1 )(QBU1 )H = ⎝ O, ⎠ (I O, O,n−2 ) On−2, ⎞ ⎛ O, O,n−2 I ⎟ ⎜ O, O,n−2 ⎠ = ⎝ O, On−2, On−2, On−2,n−2 ⎛ ⎜ (QCV1 )(QCV1 )H = ⎝

C S On−2,

⎞ ⎟ ⎠ (C S O,n−2 )

⎞ CS O,n−2 C2 ⎟ ⎜ S2 O,n−2 ⎠ , = ⎝ SC On−2, On−2, On−2,n−2 ⎛

Singular Values

663

it follows that QPS (In − PT )QH ⎛ ⎞⎛ ⎞ O, O,n−2 I I − C 2 −CS O,n−2 ⎜ ⎟⎜ ⎟ O, O,n−2 ⎠ ⎝ −SC I − S 2 O,n−2 ⎠ = ⎝ O, On−2, On−2, On−2,n−2 On−2, On−2, In−2 ⎛ ⎞ ⎛ ⎞ O,n−2 I − C 2 −CS S ⎜ ⎟ ⎜ ⎟ O, O,n−2 ⎠ = ⎝ O, ⎠ (S − C O,n−2 ) = ⎝ O, On−2, On−2, On−2,n−2 On−2, because I − C 2 = S 2 . Since the rows of the matrix (S − C O,n−2 ) are orthonormal, it follows that the singular values of PS (In − PT ) are σ1 , . . . , σ , 0, . . . , 0. For the matrix PS − PT , we can write Q(PS − PT )QH = QBB H QH − QCC H )QH = QBU1 U1H B H QH − QCV1 V1H C H )QH = (QBU1 )(QBU1 )H − (QCV1 )(QCV1 )H . Substituting QBU1 and QCV1 yields the equality ⎛ ⎞ O, O,n−2 I ⎜ ⎟ O, O,n−2 ⎠ Q(PS − PT )QH = ⎝ O, On−2, On−2, On−2,n−2 ⎞ ⎛ CS O,n−2 C2 ⎟ ⎜ S2 O,n−2 ⎠ − ⎝ SC On−2, On−2, On−2,n−2 ⎞ ⎛ O,n−2 I − C 2 −CS ⎟ ⎜ −S 2 O,n−2 ⎠ = ⎝ −SC On−2, On−2, On−2,n−2 ⎞ ⎛ −CS O,n−2 S2 ⎟ ⎜ −S 2 O,n−2 ⎠ . = ⎝ −SC On−2, On−2, On−2,n−2 The desired conclusion follows from Supplement 9.9.

664

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 9.20. Let S and T be two subspaces of Rn such that dim(S) = dim(T ) = where n < 2 and let σ1 , . . . , σn− be the sinuses of the angles between S and T . If PS and PT are the projections on S and T, then the singular values of the matrix PS (In − PT ) are σ1 , . . . , σn− , 0, . . . , 0 and the singular values of the matrix PS − PT are σ1 , σ1 , . . . , σn− , σn− , 0, . . . , 0. Proof. This statement can be obtained from Theorem 9.18 using an argument similar to the one used in the proof of Theorem 9.19. 9.8

Spectral Resolution of a Matrix

In Theorem 7.3, we have shown that if S is an invariant subspace of a matrix A ∈ Cn×n such that dim(S) = p, and X ∈ Cn×p is a matrix whose set of columns is a basis of S, then there exists a unique matrix L ∈ Cp×p that depends only on the basis S such that AX = XL. Definition 9.5. The representation of a matrix A ∈ Cn×n relative to a basis B = {x1 , . . . , xp } of an invariant subspace S with dim(S) = p is the matrix L ∈ Cp×p that satisﬁes the equality AX = XL, where X = (x1 · · · xp ). Note that if p = 1, S is a unidimensional subspace and L = (λ) has as its unique entry the eigenvalue of that corresponding to this space. Lemma 9.3. Let A ∈ Cn×n and let X ∈ Cn×p be a full-rank matrix such that rank(X) = p n. If Y ∈ Cn×(n−p) is a matrix such that range(Y ) = range(X)⊥ , then range(X) is an invariant subspace of A if and only if Y H AX = On−p,p . In this case, range(Y ) is an invariant subspace of AH . Proof. Let X and Y be matrices that satisfy the conditions formulated above. The subspace range(X) is an invariant subspace of A if and only if Ax ∈ range(X), which is equivalent to Ax ⊥ range(Y ) for every column x of X. This, in turn, is equivalent to Y H Ax = 0n−p for every column x of X and, therefore, to Y H AX = On−p,p . The last equality can be written as X H AH Y = Op,n−p , which implies that range(Y ) is an invariant subspace of AH .

665

Singular Values

Theorem 9.21. Let S be an invariant subspace for a matrix A ∈ Cn×n such that dim(S) = p and let X ∈ Cn×p be a matrix whose columns constitute an orthonormal basis for S. Let U = (X Y ) be a unitary matrix, where Y ∈ Cn×(n−p) . We have U AU =

K

H

On−p,p

L M

,

(9.10)

where K ∈ Cp×p , L ∈ Cp×(n−p), and M ∈ C(n−p)×(n−p) . Proof.

We have (X Y ) A(X Y ) = H

X H AX X H AY Y H AX Y H AY

.

Since (X Y ) is a unitary matrix, it is clear that range(Y ) = range(X)⊥ . By Lemma 9.3, it follows that Y H AX = On−p,p . Thus, we can deﬁne K = X H AX, M = Y H AY, and L = X H AY. The Equality (9.10) is known as the reduced form of S with respect to the unitary matrix U = (X Y ). By Theorem 7.15, spec(A) = spec(K) ∪ spec(M ). If spec(K) ∩ spec(M ) = ∅, we say that S = range(X) is a simple invariant space. Theorem 9.22. Let S be a simple invariant subspace for a matrix A ∈ Cn×n of rank p that has the reduced form U AU = H

K On−p,p

L M

,

(9.11)

with respect to the unitary matrix U = (X Y ), where X ∈ Cn×p and Y ∈ Cn×(n−p) . There are matrices W ∈ Cn×(n−p) and Z ∈ Cn×p such that (X W )−1 = (Z Y )H and A = XKZ H + W M Y H , where K = Z H AX ∈ Cp×p and M = Y H AW ∈ C(n−p)×(n−p) .

(9.12)

666

Linear Algebra Tools for Data Mining (Second Edition)

Proof.

Ip On−p,p

We claim that there exists a matrix Q ∈ Cp×(n−p) such that −Q In−p

K

L M

On−p,p

Ip

Q

On−p,p In−p

=

K On−p,p

Op,n−p . M

Indeed, the above equality is equivalent to

K On−p,p

KQ − QM + L M

=

K On−p,p

Op,n−p M

and the matrix Q that satisﬁes the equality KQ − QM = −L exists by Theorem 8.25, because S is a simple invariant space (so spec(K)∩ spec(M ) = ∅). This allows us to write

K On−p,p

Op,n−p M

H X Ip −Q Q = A(X Y ) YH On−p,p In−p On−p,p In−p H X − QY H A(X XQ + Y ). = YH Ip

Observe that −1

(X XQ + Y )

=

X H − QY H , YH

because

X H − QY H (X XQ + Y ) YH

= XX H + Y Y H = In .

This follows from the fact that (X Y ) is a unitary matrix. Deﬁne W = XQ + Y and Z = X − Y QH . We have (X W )−1 = (Z Y )H ; also, Z H W = Op,n−p . This allows us to write A = (X XQ + Y )

K On−p,p

Op,n−p M

X H − QY H , YH

which is equivalent to A = XKZ H + W M Y H .

(9.13)

Singular Values

Note that we have

667

I O ZH (X W ) = , YH O I

so Z H X = Ip , Z H W = Op,n−p , Y H X = On−p,p , and Y H W = In−p . (9.14) Equality (9.12) implies AX = XKZ H X + W M Y H X = X(KZ H X), because Y H X = On,n and AW = XKZ H W + W M Y H W = W (M Y H W ) because Z H W = Op,n−p . The last equality implies that range(W ) is an invariant space of A. Since (X W ) is non-singular, range(X W ) = Cn . The Equality (9.12) and its equivalent (9.13) are known as the spectral resolution equalities of A relative to the subspaces S = range(X) and T = range(W ). Note that range(W ) is a complementary invariant subspace of the subspace range(X). Let P1 = XZ H ∈ Cn×n and P2 = W Y H ∈ Cn×n . The Equalities (9.14) imply P12 = XZ H XZ H = P1 , P22 = W Y H W Y H = P2 and P1 P2 = XZ H W Y H = On,n , P2 P1 = W Y H XZ H = On,n . Also, we have A = P1 AP1 + P2 AP2 . It is clear that P1 s ∈ S for s ∈ S and P2 t ∈ T for t ∈ T , where S = range(X) and T = range(W ). Also P1 t = 0n and P2 s = 0n for s ∈ S and t ∈ T . A vector x ∈ Cn can be written as x = s + t, where s = P1 x and t = P2 x. This justiﬁes referring to P1 as the projection of S along T . Recall that the separation of two matrices was introduced in Equality 8.3.

668

Linear Algebra Tools for Data Mining (Second Edition)

Lemma 9.4. Let K ∈ Cp×p , H ∈ Cp×(n−p) , G ∈ C(n−p)×p , and M ∈ C(n−p)×(n−p) be four matrices such that spec(K) ∩ spec(M ) = ∅. Define γ, η, and δ as γ = GF , η = HF , and δ = sep(A, B). 2

If γη < δ4 , then there exists a unique X ∈ C(n−p)×p such that SM,K (X) = M X − XK = XHX − G and XF

2γ γ > svd(A) ans = 4.3674 1.2034 0.0000

Singular Values

675

It is interesting to note that rank(A) = 2 since the last column of A is the sum of the ﬁrst two columns. Thus, we would expect to see two non-zero singular values. To compute |||A|||2 , which equals the largest singular value of A, we can use max(svd(A)). MATLAB has the function rank that computes the numerical rank of a matrix numrankA (d). By default, the value of d is set to a value known as tolerance, which is max(size(A)*eps(max(s)), where s is the vector of the singular values of A. Otherwise, we can enter our own tolerance tol level using a call rank(A,tol). Small variation in the values of the elements of a matrix can trigger sudden changes in rank. These variations may occur due to rounding errors as rational numbers (such as 13 ) may be aﬀected by these rounding errors. Example 9.4. When we enter the matrix A considered above, the system responds with A = 0.3333 2.2000 0.1429 0.1111

1.3333 0.8000 0.5714 0.4444

1.6667 3.0000 0.7143 0.5556

Calls of the rank function with several levels of tolerance return results such as >> rank(A,10^(-9)) ans = 2 >> rank(A,10^(-15)) ans = 2 >> rank(A,10^(-16)) ans = 2 >> rank(A,10^(-17)) ans = 3

If d of numrankA (d) is suﬃciently low, the numerical rank of the representation of A (aﬀected by rounding errors) becomes 3.

676

Linear Algebra Tools for Data Mining (Second Edition)

Another variant of the svd function, [U,S,V] = svd(A), yields a diagonal matrix S, of the same format as A and with nonnegative diagonal elements in decreasing order, and unitary matrices U and V so that A = U SV H . For the matrix A shown above, we obtain >> [U,S,V] = svd(A) U = -0.4487 0.7557 -0.8599 -0.5105 -0.1923 0.3239 -0.1496 0.2519 S = 4.3674 0 0 1.2034 0 0 0 0 V = -0.4775 -0.3349 -0.8123

-0.6623 0.7447 0.0823

0.4769 0.0000 -0.6726 -0.5658

0.0161 -0.0000 -0.6370 0.7707

0 0 0.0000 0

0.5774 0.5774 -0.5774

The “economical form” of the svd function is [U,S,V] = svd(A,’econ’)

If A ∈ Rm×n and m > n, only the ﬁrst n columns of U are computed and S ∈ Rn×n . If m < n, only the ﬁrst m columns of V are computed. Example 9.5. Starting from the matrix ⎛ ⎞ 18 8 20 ⎜−4 20 1 ⎟ ⎜ ⎟ A=⎜ ⎟ ∈ R4×3 ⎝ 25 8 27⎠ 9 4 10 a call to the economical variant of the svd function yields >> [U,D,V] = svd(A,’econ’) U = -0.5717 -0.0211 0.8095 -0.0721 -0.9933 -0.0669

Singular Values

-0.7656 -0.2859

0.1133 -0.0105

-0.4685 -0.3474

49.0923 0 0

0 20.2471 0

0 0 0.0000

-0.6461 -0.2706 -0.7137

0.3127 -0.9468 0.0760

0.6963 0.1741 -0.6963

677

S =

V =

Example 9.6. The function svapprox given in what follows com putes the successive approximations B(k) = ki=1 σi ui vHi of a matrix A ∈ Rm×n having the SVD A = U DV H and produces a threedimensional array C ∈ Rm×n×r , where r is the numerical rank of A and C(:, :, k) = B(k) for 1 k r. function [C] = svapprox(A) %SVAPPROX computes the successive approximations % of A using the singular component decomposition. % The number of approximations equals the % numerical rank of A. % determine the format of A and its numerical rank [m, n] = size(A); r = rank(A,10^(-5)); % compute the SVD of A [U,D,V] = svd(A); C = zeros(m,n,r); C(:,:,1) = D(1,1) * U(:,1) * (V(:,1))’; for k=2:r C(:,:,k) = D(k,k) * U(:,k) * (V(:,k))’ + C(:,:,k-1); end;

This function is used in Example 9.7. Example 9.7. In Figure 4.10(a), we have an image of the digit 4 created from a pgm ﬁle that contains the representation of this digit, as discussed in Example 4.56. The numerical rank of the matrix A introduced in the example mentioned above is 8. Therefore, the array C computed by

678

Linear Algebra Tools for Data Mining (Second Edition)

(a)

Fig. 9.1

(b)

(c)

(d)

Successive approximations of A.

C = svapprox(A) consists of 8 matrices. To represent these matrices in the pgm format, we cast the components of C to integers of the type uint8 using D = min(16,uint8(C)). Thus, D(:, :, j) contains the rounded jth approximation of A and we represent the images for the ﬁrst four approximations in Figure 9.1. Note that the digit four is easily recognizable beginning with the second approximation. The use of SVD in approximative recognition of digits is discussed extensively in [44].

Exercises and Supplements (1) Prove that if all singular values of a matrix A ∈ Cn×n are equal, then there exists a unitary matrix U ∈ Cn×n and a real number a such that A = aU . (2) Prove that if A ∈ Cm×n has an orthonormal set of columns (an orthonormal set of rows), then its singular values equal 0 or 1. (3) Let c1 , . . . , c , s1 , . . . , s be 2 real numbers such that s2i + c2i = 1 and si 0 for 1 i n. Prove that the singular values of the matrix diag(−c1 s1 , . . . , −c s ) diag(s21 , . . . , s2 ) X= −diag(s21 , . . . , s2 ) diag(−c1 s1 , . . . , −c s ) are s1 , s1 , . . . , s , s . Solution: Note that

XX =

E O , O E

where E = diag(s21 , . . . , s2 )2 + diag(−c1 s1 , . . . , −c s )2 = diag(s41 + c21 s21 , . . . , s4 + c2 s2 ) = diag(s21 , . . . , s2 ).

Singular Values

679

Thus, XX has the eigenvalues s21 , . . . , s2 each with algebraic multiplicity 2, which implies that X has the eigenvalues s1 , s1 , . . . , s , s . (4) Prove that if Pφ is a permutation matrix, where φ ∈ PERMm , then the set of singular values of the matrix Pφ A is the same as the set of singular values of the matrix A for every A ∈ Cm×n . (5) Prove that 0 ∈ spec(A), where A ∈ Cn×n if and only if 0 is a singular value of A. (6) Compute the singular values of the matrices A=

1 0 0 1 0 0 0 0 ,B = ,C = ,D = . 0 0 0 0 0 1 1 0

Let X, Y be any two of these four matrices. Verify that XY and Y X have 0 as an eigenvalue with multiplicity 2. However, the sets of singular values of each of the matrices in the pairs (AB, BA), (AD, DA), (BC, CB), and (CD, DC) are distinct. (7) Let A ∈ Cn×n be a Hermitian matrix, and let A = U H DU be the decomposition given by the spectral theorem for Hermitian matrices. (a) Prove that if spec(A) ⊆ R0 , then A = U H DU is an SVD of A. (b) If spec(A) = {λ1 , . . . , λn } ⊆ R0 , let S be diagonal matrix such that sii = 1 if λi 0 and sii = −1 if λi < 0. Prove that A = (U H S)(SD)U is an SVD of A. (8) Let A ∈ Cn×n be a Hermitian matrix such that spec(A) = {λ1 , . . . , λn } ⊆ R. Prove that the singular values of A are |λ1 |, . . . , |λn |. Solution: The singular values of A equal the square roots of the non-zero eigenvalues of the matrix AH A. Since A is Her2 mitian, these eigenvalues equal the non-zero eigenvalues of A , which means that the singular values of A are λ21 , . . . , λ2n , that is, |λ1 |, . . . , |λn |. (9) Next we give a stronger form of Corollary 8.20 that holds for Hermitian matrices. Let A ∈ Cn×n and B ∈ Cm×m be two matrices. Prove that if A and B are Hermitian, then sepF (A, B) = min{|λ − μ| | λ ∈ spec(A), μ ∈ spec(B)}.

680

Linear Algebra Tools for Data Mining (Second Edition)

Solution: The Sylvester operator S A,B (X) = AX − XB can be written as Kronecker operations as vec(S A,B (X)) = vec(AX) − vec(XB) = (In ⊗ A)vec(X) − (B ⊗ Im )vec(X) = (In ⊗ A − B ⊗ Im )vec(X) = (A B)vec(X) by applying the identities of Supplement 90 of Chapter 3. Since both A and B are Hermitian, by Supplement 83 of the same chapter, the matrix A B is Hermitian and, by Theorem 7.25, spec(A B) = {λ − μ | λ ∈ spec(A), μ ∈ spec(B)}. Also, note that vec(X)2 = XF , so S A,B (X)F = (A B)vec(X)2 . Since sepF (A, B) = min{S A,B (X)F | XF = 1} = min{(A B)vec(X)2 | vecXF = 1, it follows that sepF (A, B) equals the smallest singular value of the Hermitian matrix A B. By Corollary 9.2 and by supplement 9.9, this equals min{|λ − μ| | λ ∈ spec(A), μ ∈ spec(B)}. (10) Let A ∈ Cm×n be a matrix having full rank r and let σ1 · · · σr > 0 be its non-zero singular values. Prove that if B ∈ Cm×n is a matrix such that |||A − B|||2 < σr , then B has full rank. (11) Let A ∈ Cm×n be a matrix. Prove that (a) for every > 0 there exists a matrix B having distinct singular values such that A − BF < ; (b) if rank(A) = r < min{m, n}, then for every > 0 there exists a matrix B ∈ Rm×n having full rank such that A − BF . Solution: We discuss only the ﬁrst part of this Supplement. Let A = U DV H be an SVD of the matrix A, where D = diag(σ1 , . . . , σp ) ∈ Cm×n and σ1 σ2 · · · σp . Suppose that m n and consider the matrix Dδ = diag(d11 + δ, d22 + 2δ, . . . , dnn + nδ) ∈ Cm×n , where δ > 0. Clearly, dii = σi for 1 i p and dii = 0 for p + 1 i n. If σ1 = · · · = σp , then the diagonal elements of Dδ are all distinct. Otherwise, choose δ such that 0 < δ < min1ip−1 (σi − σi+1 ); this results again in a matrix Dδ with distinct diagonal entries. Deﬁne Aδ = U Dδ V H . Since U and V are unitary

681

Singular Values

matrices, it follows that A − Aδ F = D − Dδ F = δ Choosing δ such as √ 2 δ < min min (σi − σi+1 ), , 1ip−1 n(n + 1)

n(n+1) . 2

implies A − Aδ F < . (12) Let A ∈ Cn×n be a non-singular matrix such that the matrix AH A can be unitarily diagonalized as AH A = U S 2 U H , where U is a unitary matrix and S = diag(σ1 , . . . , σp ). Prove that the matrix V = AH U S −1 is unitary and that A has the singular value decomposition A = U SV H . Let A ∈ Cm×n be a matrix of rank r. By Corollary 6.21, we have Cm = range(A) null(AH ), Cn = range(AH ) null(A).

Let Brange(A) = {u1 , . . . , ur }, Bnull(AH ) = {ur+1 , . . . , um }, Brange(AH ) = {v 1 , . . . , v r }, Bnull(A) = {v r , . . . , r n }, be orthonormal bases of the subspaces range(A), null(AH ), range(AH ), and null(A), respectively. (13) Let A ∈ Cm×n be a matrix of rank r. Deﬁne the unitary matrices U = (u1 · · · ur ur+1 · · · um ) ∈ Cm×m , V = (v 1 · · · v r v r+1 · · · v n ) ∈ Cn×n . An alternative method for obtaining the singular value decomposition of a matrix is as follows. (a) Prove that U AV = H

C Om−r,r

Or,n−r , Om−r,n−r

where C ∈ Cr×r is a matrix of rank r.

682

Linear Algebra Tools for Data Mining (Second Edition)

(b) Prove that there are two unitary matrices U1 ∈ Cm×m and V1 ∈ Cn×n such that U1

C Om−r,r

D Or,n−r Or,n−r H V1 = , Om−r,n−r Om−r,r Om−r,n−r

where D ∈ Cr×r is a diagonal matrix. (14) Let A ∈ Cn×n be a matrix whose singular values are σ1 , . . . , σn . k k Deﬁne the matrix A[k] = A ⊗ A ⊗ · · · ⊗ A ∈ Cn ×n as in Exercise 32. Prove that the singular values of A[k] have the form σi1 σi2 · · · σik , where 1 i1 , . . . , ik n. (15) Let A ∈ Cn×n be a square matrix having the eigenvalues λ1 , λ2 , . . . , λn and the singular values σ1 , σ2 , . . . , σn , where |λ1 | |λ2 | · · · |λn | and σ1 σ2 · · · σn . Prove that |λ1 · · · λk | σ1 · · · σk for every k, 1 k n − 1 and that |λ1 · · · λn | = σ1 · · · σn . (16) Let A = P Y be the polar decomposition of the matrix A, where P ∈ Cm×m is a positive semideﬁnite matrix such that P 2 = AAH , and Y ∈ Cm×n is such that Y Y H = Im . Obtain the SVD of A by applying Theorem 8.14 to the Hermitian matrix P 2 . n (17) Prove that if A ∈ Cn×n , then | det(A)| = i=1 σi , where σ1 , . . . , σn are the singular values of A. (18) Prove that A ∈ Cn×n , then the singular values of the matrix adj(A) are {σj | j ∈ {1, . . . , n} − {i}} for 1 i n. (19) Let S = {x ∈ Rn | x2 = 1} be the surface of the unit sphere in Rn and let A ∈ Rn×n be an invertible matrix whose SVD is A = U DV , where U, V are orthogonal matrices and D = diag(σ1 , . . . , σn ). Prove that (a) the image of S under left multiplication by A, that is, the set {Ax | x ∈ S}, is an ellipsoid having the semiaxes σ1 , . . . , σn ; (b) The columns of U give the directions of the semiaxes of the ellipsoid. Solution: Since A = U DV , it follows immediately that −1 A = V D −1 U . Thus, if y = Ax for some x ∈ S, then, by

683

Singular Values

deﬁning w = U y, we have w2 w12 w22 + 2 + · · · + 2n = D −1 w22 = D −1 U y22 2 σn σ1 σ2 = V D −1 U u22 = A−1 y22 = A−1 Ax22 = x22 = 1. Thus, {U Ax | x ∈ S} is an ellipsoid having σ1 , . . . , σn as semiaxes. Since multiplication by an orthogonal matrices is isometric, it follows that the set {Ax | x ∈ S} is obtained by rotating the previous ellipsoid, so it is again an ellipsoid having the same semiaxes. (20) Let A = σ1 u1 v H1 + · · · + σr ur v Hr be the SVD of a matrix A ∈ Rm×n , where σ1 · · · σr > 0. Prove that the matrix B(k) = k H i=1 σi ui v i from Lemma 9.2 can be written as B(k) =

k

i=1

ui (ui )H A =

k

Av i (v i )H .

i=1

In other words, B(k) is the projection of A on the subspace generated by the ﬁrst k left singular vectors of A. Solution: The equalities follow immediately from Theorem 9.7. (21) Let A ∈ Cm×n be a matrix. Deﬁne the square matrix A˜ ∈ C(m+n)×(m+n) as Om,m A ˜ . A= AH On,n Prove that: (a) A˜ is a Hermitian matrix; (b) the matrix A has the singular values σ1 · · · σk , where ˜ consists of the numk = min{m, n} if and only if spec(A) bers −σ1 · · · −σk σk . . . σ1 and an additional number of 0s. Solution: The ﬁrst part is immediate.

684

Linear Algebra Tools for Data Mining (Second Edition)

Suppose that m n and let A = U DV H be the singular value decomposition of A, where S , D= Om−n,n U ∈ Cm×m , V ∈ Cn×n , and S = diag(σ1 , . . . , σn ). Since m n, we can write U = (Y Z), where Y ∈ Cm×n and Z ∈ Cm×(m−n) . Deﬁne the matrix 1 √ Y − √1 Y √1 Z 2 2 2 ∈ C(m+n)×(m+n) . W = √1 Z √1 Z O n,m−n 2 2 This matrix is unitary because WWH = =

√1 Y 2 √1 Z 2

− √12 Y √1 Z 2

Im Om,n On,m In

√1 Z 2

On,m−n

⎛

√1 Y H 2 ⎜− √1 Y H ⎝ 2 √1 Z H 2

√1 Z H 2 √1 Z H 2

Om−n,n

⎞ ⎟ ⎠

= Im+n .

Then, A˜ can be factored as ⎞ ⎛ On,m−n S On,n ⎟ ⎜ −S On,m−n ⎠ W H , A˜ = W ⎝ On,n Om−n,n Om−n,n Om−n,m−n which shows that the spectrum of A˜ has the form claimed above. (22) Let A and E be two matrices in Cm×n and let p = min{m, n}. Denote by σ1 (A) · · · σp (A) the singular values of A arranged in decreasing order and by σ1 (E) · · · σp (E) and σ1 (A + E) · · · σp (A + E) the similar sequences for E and A + E, respectively, where p = min{m, n}. Prove that |σi (A + E) − σi (A)| |||E|||2 for 1 i p and pi=1 (σi (A + E) − σi (A))2 E2F . ˜ E ˜ and A Solution: Consider the matrices A, + E deﬁned in ˜ Supplement 9.9 . It is immediate to verify that A + E = A˜ + E.

685

Singular Values

The eigenvalues of the matrix A˜ arranged in increasing order are −σ1 (A) · · · −σk (A) σk (A) · · · σ1 (A), ˜ in the same order are and the eigenvalues of E −σ1 (E) · · · −σk (E) σk (E) · · · σ1 (E). By Weyl’s Theorem (Supplement 30 of Chapter 8), we have −σ1 (E) σi (A + E) − σi (A) σ1 (E) = |||E|||2 , by Theorem 9.5, so |σi (A + E) − σi (A)| |||E|||2 for 1 i p. (23) Let S, T be two subspaces of Rn such that dim(S) = dim(T ) = r and let {u1 , . . . , ur } and {v 1 , . . . , v r } be two orthonormal bases of S and of T represented by the matrices BS = (u1 . . . ur ) ∈ Cn×r and BT = (v 1 . . . v r ) ∈ Cn×r , respectively. Prove that the singular values of the matrix BT BS are located in the interval [0, 1]. (24) Let · be a unitarily invariant norm on Cm×n , where m n, z ∈ Cm . Deﬁne the matrix Z = (diag(z1 , . . . , zm ) Om,n−m ∈ Cm×n . Prove that (a) the set of singular values of Z is {|z1 |2 , . . . , |zm |2 }; (b) the function g : Cm −→ R0 deﬁned by g(z) = Z is a symmetric gauge function. (25) Let g : Rn −→ R0 be a symmetric gauge function and let A g be deﬁned by Ag = g(s), where A ∈ Cm×n , and z =

σ1 . . . σn

is

a vector whose components are the singular values of A. Prove that · g is a unitarily invariant norm. In Exercise 51 of Chapter 6, we noted that νp is a symmetric gauge function on Rn . The norms · νp for p 1 are known as the Schatten norms. Note that for p = 2 the Schatten norm coincides with the Frobenius norm. For p = 1, we obtain Aν1 = σ1 +· · ·+σr , where σ1 , . . . , σr are the non-zero singular values of A. Since Aν1 = trace(D), where D is the diagonal matrix that occurs in the SVD of A, this Schatten norm is also known as the trace norm or Ky Fan’s norm.

686

Linear Algebra Tools for Data Mining (Second Edition)

(26) Let A ∈ Cm×n be a matrix of rank r such that there exists a factorization R Or,n−r A=U V H, Om−r,r Om−r,n−r where R ∈ Cr×r is a matrix of rank r, and U ∈ Cm×m and V ∈ Cn×n are unitary matrices. Note that the SVD of A is a special case of this factorization. Prove that the Moore–Penrose pseudoinverse of A is given by −1 R Ok,m−k † U H. A =V On−k,k On−k,m−k Solution: By Theorem 3.26, it suﬃces to verify that AA† A = A and A† AA† = A† . (27) Prove that for the matrices A(1) , . . . , A(n) , we have (A(1) ⊗ A(2) ⊗ · · · ⊗ A(n) )† = A(1)† ⊗ A(2)† ⊗ · · · A(n)† , where A(k) ∈ Rmk ×nk for 1 k n. Solution: Note that the matrices A(i) can be factored as Ori ,ni −ri Ri (i) A = Ui Vi , Omi −ri ,ri Omi −ri ,ni −ri where Ui ∈ Rmi ×mi , Vi ∈ Rni ×ni are orthonormal matrices, Ri ∈ Rri ×ri , and ri = rank(A(i) ). The argument is by induction on n. For the base case n = 2, observe that the pseudoinverses A(i)† are −1 R O i Ui A(i)† = Vi 0 O for i = 1, 2. Therefore, A(1)† ⊗ A(2)† −1 O R1−1 O R 2 = V1 ⊗ U1 ⊗ V2 U2 0 O O O R1−1 ⊗ R2−1 O (U1 ⊗ U2 ). = (V1 ⊗ V2 ) O O

Singular Values

687

On the other hand, we have "† ! A(1) ⊗ A(2) † R1 O R1 O V1 ⊗ U1 V2 = U1 O O O O † R1 O R2 O = (U1 ⊗ U2 ) ⊗ (V1 ⊗ V2 ) O O O O (R1 ⊗ R2 )−1 O = (V1 ⊗ V2 ) (U1 ⊗ U2 ) O O (R1−1 ⊗ R2−1 O (U1 ⊗ U2 ) , = (V1 ⊗ V2 ) O O

which concludes the base step. The proof of the induction step is straightforward. (28) Let A, B ∈ Rm×p and let Q ∈ Rp×p be an orthogonal matrix. Prove that (a) A − BQF is minimal if and only if trace(Q B A) is maximal; (b) if the SVD of the matrix B A is B A = U diag (σ1 , . . . , σp )V . The number trace(Q B A) is maximal if Q = U V . Solution: As we saw in Example 6.21, A − BQ2F = trace((A − BQ) (A − BQ)) = trace((A − Q B )(A − BQ) = trace(A A − Q B A − A BQ + Q B BQ) = trace(A A) − trace(Q B A) − trace(A BQ) + trace(Q B BQ) = A2F − 2trace(Q B A) + trace(BQQ B ) = A2F − 2trace(Q B A) + trace(BB )

688

Linear Algebra Tools for Data Mining (Second Edition)

(because the orthogonality of Q means that QQ = I) = A2F + B2F − 2trace(Q B A), which justiﬁes the claim made in the ﬁrst part. For the second part, we start from the equality trace(Q B A) = trace(Q U diag(σ1 , . . . , σp )V ) = trace(V Q U diag(σ1 , . . . , σp )). Note that the matrix T = V Q U is orthogonal and trace(T diag(σ1 , . . . , σp ) =

p

tii σi

i=1

p

σi .

i=1

When Q = U V , we have T = I, so the maximum is achieved. (29) Let A ∈ Cm×n be a matrix having σ1 , . . . , σr as its non-zero singular values. The volume of A is the number vol(A) = σ1 · · · σr . Using the notations of Supplement 35 of Chapter 5, prove that vol(A) = Dr2 (A). Solution: Let A = U DV H be the thin SVD decomposition of A (see Corollary 9.3), where U ∈ Cm×r and V ∈ Cn×r are matrices having orthonormal sets of columns and D = diag(σ1 , . . . , σr ). Then, vol(D) = σ1 · · · σr . Note that matrices U and DV H constitute a full-rank decomposition of A, so by Supplement 35 of Chapter 5, Dr2 (A) = Dr2 (U )Dr2 (DV H ), In turn, D and V H are full-rank matrices, so Dr2 (DV H ) = Dr2 = σ12 · · · σr2 , which concludes the argument. # $ 1, . . . , p m×n be a matrix and let B = A . If σ1 (30) Let A ∈ C 1, . . . , n · · · σn are the singular values of A and τ1 · · · τn are the singular values of B, prove that σi τi for 1 i n. Solution: We can write

B A= , C

where C ∈ C(m−p)×n . We have AH A = B H B + C H C. The singular values of A are the eigenvalues of AH A, τ1 , . . . , τn are the

Singular Values

689

singular values B H B. The singular values of C, γ1 , . . . , γn are non-negative. By applying Theorem 8.22, we obtain σi τi for 1 i n. (31) Let A ∈ Cm×n be a matrix such that rank(A) = r. Prove that there exists a set of r(m + n + 1) numbers that determines the matrix A. Solution: The matrix A has mn components and a thin SVD decomposition of the form A = σ1 u1 v 1 + · · · + σr ur v r , where ui ∈ Cm and v i ∈ Cn . Thus, if we select r singular values, mr components for the vectors ui and nr components for the vectors v i , the matrix A is determined by r + mr + nr = r(m + n + 1) numbers. (32) Let A = U DV H be the thin SVD decomposition of the matrix A ∈ Cm×n . Prove that A(AH A)−1/2 = U V H . Solution: Note that AH A = V DU H U DV H = V D 2 V H because U is a Hermitian matrix. Thus, (AH A)−1/2 = V D −1 V H , which yields A(AH A)−1/2 = U DV H V D −1 V H = U V H . (33) Let A ∈ Cm×n be a matrix with m n and rank(A) = r n. Prove that A can be factored as A = U1 D1 V H , where U1 ∈ Cm×n matrix, D ∈ Cn×n is a diagonal matrix, and V ∈ Cn×n such that U H U = V V H = V H V = In . Solution: Starting from the full SVD of A, A = U DV H let U1 ∈ Cm×n be the matrix that consists of the ﬁrst n columns of the matrix U , and let D1 be the matrix that consists of the ﬁrst n rows of D. Then, A = U1 D1 V H is the desired decomposition. (34) Prove that a matrix A is subunitary if and only if its singular values belong to [0, 1]. Solution: Let A ∈ Cn×k with n k and let A = U1 D1 V H be the SVD decomposition of A as established in Supplement 9.9. Suppose that for the largest singular value we have σ1 1 and let Z be the unitary completion of U1 to a unitary matrix (U1 Z) ∈ Cm×m . For the matrix Y = (A U1 (In − D12 )1/2 Z) ∈ Cm×(m+n) , we have Y Y H = AAH + ZZ H = Im , so Y is a semiunitary matrix, which implies that A is a subunitary matrix. Conversely, suppose that A is a subunitary matrix and let W be a semiunitary matrix such that W = (A T ) and whose rows are orthonormal. If w ∈ Cm is a vector with wH w = 1m , we

690

Linear Algebra Tools for Data Mining (Second Edition)

have wH w = wH Im w = wH (AAH + T T H )w wH AAH w, which implies that A has no singular value greater than 1. (35) The product of two subunitary (suborthogonal) matrices is a subunitary (suborthogonal) matrix. Solution: Let A ∈ Cm×k and B ∈ Ck×n be two subunitary matrices. Let A+ be the expansion of A to a matrix with orthonormal columns and B+ be the expansion of B to a matrix with orthonormal rows. Since the non-zero singular values of the H AH = A+ AH+ are the same as the eigenvalues matrix A+ B+ B+ of AH+ A+ = I, it follows that A+ B+ is subunitary. Since AB is a submatrix of A+ B+ , it follows that AB is subunitary. (36) Let A and B be two matrices in Cn×n having the singular values σ1 · · · σn and τ1 · · · τn , respectively. Prove that n σk τ k . |trace(AB)| σk=1

Solution: Let A = U DV H and B = W CZ H be the singular value decompositions of A and B, where U, V, W, Z are unitary matrices and D = diag(σ1 , . . . , σn ) and C = diag(τ1 , . . . , τn ). We have trace(AB) = trace(U DV H W CZ H) = trace(Z H U DV H W C) = trace(SDT C), where S = Z H U and T = V H W are unitary matrices. Therefore, trace(AB) =

n

spq tpq σp τq

p,q=1

n n 1 2 1 2 spq σp τq + tpq σp τq . 2 2 p,q=1

p,q=1

Since S and T are unitary matrices, it follows that the matrices S1 = (|spq |2 ) and T1 = (|t2pq |) are doubly stochastic, and the desired conclusion follows immediately from Supplement 106 of Chapter 3.

691

Singular Values

(37) Let A ∈ Rn×n be a suborthogonal matrix with rank(A) = r and let C ∈ Rn×n be a diagonal matrix, C = diag(c1 , . . . , cn ), where c1 · · · cn 0. Prove that trace(AC) c1 + · · · + cr . Hint: Apply Supplement 9.9. (38) Let A ∈ Rm×n be a rectangular matrix. Extend the Rayleigh– Ritz function ralA deﬁned for square matrices in Theorem 8.17 to rectangular matrices as ralA =

x Ay xy

for x ∈ Rm and y ∈ Rn . Prove that σ is a singular value corresponding to singular vectors x and y when (x, y, σ) is a critical point of ralA . Solution: Consider the Lagrangian function L : R −→ R given by

Rm

× Rn ×

L(x, y, σ) = x Ay − σ(x y − 1) that is continuously diﬀerentiable for x = 0m and y = 0n . The ﬁrst-order optimality conditions yield ˜ ˜ ˜ x A x y A˜ y =σ and =σ . ˜ y ˜ x ˜ x ˜ y By deﬁning u = A u = λv.

1 ˜ ˜ x x

and v =

1 ˜, ˜ y y

we obtain Av = λu and

Bibliographical Comments Theorem 9.9 was obtained in [41]. An important historical presentation of the singular value decomposition can be found in [159]. The proof of Theorem 9.10 is given in this article. The proof of the CS Decomposition Theorem was given in [129] and [128]. Theorem 9.22 is a result of [160]. Supplements 33 and 35 are results of ten Berge that appear in [12]. In Supplement 36, we give a result of von Neumann [85]; the solution belongs to Mirsky [114]. For Supplement 37, see [12]. The proof of Supplement 27 appears in [99].

This page intentionally left blank

Chapter 10

The k-Means Clustering

10.1

Introduction

The k-means algorithm is one of the best-known clustering algorithms and has been in existence for a long time [73]. In a recent publication [173], the k-means algorithm was listed among the top ten algorithms in data mining. This algorithm computes a partition of a set of points in Rn that consists of k blocks (clusters) such that the objects that belong to the same block have a high degree of similarity, and the objects that belong to distinct blocks are dissimilar. The algorithm requires the speciﬁcation of the number of clusters k as an input. The set of objects to be clustered S = {u1 , . . . , um } is a subset of Rn . Due to its simplicity and to its many implementations, it is a very popular algorithm despite this requirement. 10.2

The k-Means Algorithm and Convexity

The k-means algorithm begins with a randomly chosen set of k points c1 , . . . , ck in Rn called centroids. An initial partition of the set S of objects is computed by assigning each object ui to its closest centroid cj and adopting a rule for breaking ties when there are several centroids that are equally distanced from ui (e.g., assigning ui to the centroid with the lowest index). As we shall see, the algorithm alternates between assigning cluster membership for each object and computing the center of each cluster.

693

694

Linear Algebra Tools for Data Mining (Second Edition)

Let Cj be the set of points assigned to the centroid cj . The assignments of objects to centroids are expressed by a matrix B = (bij ) ∈ Rm×k , where 1 if ui ∈ Cj , bij = 0 otherwise. Since m one cluster, we have k each object is assigned to exactly b = 1. On the other hand, ij j=1 i=1 bij equals the number of objects assigned to the centroid cj . After these assignments, expressed by the matrix B, the centroid cj is recomputed using the following formula: m bij ui (10.1) cj = i=1 m i=1 bij for 1 ≤ j ≤ k. The matrix of the centroids, C = (c1 · · · ck ) ∈ Rn×k can be written as

C = (u1 · · · um )Bdiag

1 1 ,..., m1 mk

,

where mj = m i=1 bij is the number of objects of cluster Cj . The sum of squared errors of a partition π = {C1 , . . . , Ck } of a set of objects S is intended to measure the quality of the assignment of objects to centroids and is deﬁned as sse(π) =

k

d2 (u, cj ),

(10.2)

j=1 u∈Cj

where cj is the centroid of Cj for 1 ≤ j ≤ k. It is clear that sse(π) is unaﬀected if the data are centered.

The k-Means Clustering

695

We can write sse(π) as sse(π) =

m k

bij xi − cj 22

i=1 j=1

=

m k i=1 j=1

bij

n

(xip − cjp )2 .

p=1

The nk necessary conditions for a local minimum of this function, m

∂sse(π) = bij (−2(xip − cjp )) = 0 ∂cjp i=1

for 1 ≤ p ≤ n and 1 ≤ j ≤ k, can be written as m

bij xip =

i=1

m

bij cjp = cjp

i=1

m

bij ,

i=1

or, as cjp

m i=1 bij xip = m i=1 bij

for 1 ≤ p ≤ n. In vectorial form, these conditions amount to m bij xi , cj = i=1 m i=1 bij which is exactly the formula (10.1) that is used to update the centroids. Thus, the choice of the centroids can be justiﬁed by the goal of obtaining local minima of the sum of squared errors of the clusterings. Since we have new centroids, objects must be reassigned, which means that the values of bij must be re-computed, which, in turn, will aﬀect the values of the centroids, etc. The halting criterion of the algorithm depends on particular implementations and it may involve

696

Linear Algebra Tools for Data Mining (Second Edition)

(i) performing a certain number of iterations; (ii) lowering the sum of squared errors sse(π) below a certain limit; (iii) the current partition coincides with the previous partition. This variant of the k-means algorithm is known as Forgy–Lloyd algorithm (Algorithm 10.2.1) [56, 107]: Algorithm 10.2.1: Forgy–Lloyd Algorithm Data: set of objects to be clustered S = {x1 , . . . , xm } ⊆ Rn and the number of clusters k. Result: a collection of k clusters 1 generate a randomly chosen collection of k points c1 , . . . , ck in Rn ; 2 assign each object xi to the closest centroid cj ; 3 let π = {C1 , . . . , Ck } be the partition deﬁned by c1 , . . . , ck ; 4 recompute the centroids of the clusters C1 , . . . , Ck ; 5 while halting criterion is not met do 6 compute the new value of the partition π using the current centroids; 7 recompute the centroids of the blocks of π; 8 end The popularity of the k-means algorithm stems from its simplicity and its low time complexity which is O(km), where m is the number of objects to be clustered and is the number of iterations that the algorithm is performing. Another variant of the k-means algorithm redistributes objects to clusters based on the eﬀect of such a re-assignment on the objective function. If sse(π) decreases, the object is moved and the two centroids of the aﬀected clusters are recomputed. This variant is carefully analyzed in [16]. As the iterations of the k-means algorithm develop, the algorithm may be trapped into a local minimum, which is the serious limitation of this clustering. The next theorem shows another limitation of the k-means algorithm because this algorithm produces only clusters whose convex closures may intersect only at the points of S.

The k-Means Clustering

697

Theorem 10.1. Let S = {x1 , . . . , xm } ⊆ Rn be a set of m objects. If C1 , . . . , Ck is the set of clusters computed by the k-means algorithm in any step, then the convex closure of each cluster Ci , K conv (Ci ) is included in a polytope Pi that contains ci for 1 ≤ i ≤ k. Proof. Suppose that the centroids of the partition {C1 , . . . , Ck } are c1 , . . . , ck . Let mij = 12 (ci + cj ) be the midpoint of the segment ci cj and let Hij be the hyperplane (ci − cj ) (x − mij ) = 0 that is the perpendicular bisector of the segment ci cj . Equivalently, 1 Hij = x ∈ Rm | (ci − cj ) x = (ci − cj ) (ci + cj ) . 2 The halfspaces determined by Hij are described by the following inequalities: 1 + : (ci − cj ) x ≤ (ci 22 − cj 22 ), Hij 2 1 − : (ci − cj ) x ≥ (ci 22 − cj 22 ). Hij 2 + − and cj ∈ Hij . Moreover, if d2 (ci , x) < It is easy to see that ci ∈ Hij + − d2 (cj , x), then x ∈ Hij , and if d2 (ci , x) > d2 (cj , x), then x ∈ Hij . Indeed, suppose that d2 (ci , x) < d2 (cj , x), which amounts to ci − x22 < cj − x22 . This is equivalent to

(ci − x) (ci − x) < (cj − x) (cj − x). The last inequality is equivalent to ci 22 − 2ci x < cj 22 − 2cj x, + . In other words, x is located in the same which implies that x ∈ Hij half-space as the closest centroid of the set {ci , cj }. Note also that if + − ∩ Hij = Hij , that is, d2 (ci , x) = d2 (cj , x), then x is located in Hij on the hyperplane shared by Pi and Pj . Let Pi be the closed polytope deﬁned by + | j ∈ {1, . . . , k} − {i}}. Pi = {Hij

Objects that are closer to ci than to any other centroid cj are located in the closed polytope Pi . Thus, Ci ⊆ Pi and this implies K conv (Ci ) ⊆ Pi .

Linear Algebra Tools for Data Mining (Second Edition)

698

10.3

Relaxation of the k-Means Problem

The idea of relaxing the requirements of the k-means algorithm in order to apply ideas that originate in principal component analysis belongs to Ding and He [35]. Let X ∈ Rm×n be the data matrix of the set S = {x1 , . . . , xm } ⊆ n R of m objects to be clustered, ⎛ ⎞ x1 ⎜ . ⎟ ⎟ X=⎜ ⎝ .. ⎠ . xm Note that ⎛

⎛ ⎞ ⎞ x1 x1 x1 · · · x1 xm ⎜ . ⎟ ⎜ . .. ⎟ ⎜ ⎟ ⎟ XX = ⎜ . ⎠, ⎝ .. ⎠ (x1 · · · xm ) = ⎝ .. · · · xm xm x1 · · · xm xm which implies trace(XX ) = j=1 xj xj . To simplify the arguments, we shall assume that X is a centered matrix. Let π = {C1 , . . . , Ck } be a clustering of the set S and let mj = |Cj | for 1 ≤ j ≤ k. Clearly, we have kj=1 mj = m. There exists a permutation matrix P that allows us to rearrange the rows of the matrix X such that the rows that belong to a cluster Cj are located in contiguous columns. In other words, we can write P X as a block matrix ⎛ ⎞ X(1) ⎜ . ⎟ ⎟ PX = ⎜ ⎝ .. ⎠ , X(k)) where X(j) ∈ Rmj ×n . The centroid cj of Cj can be written as cj = 1 mj 1mj X(j). Furthermore, we have sse(π) =

k j=1 x∈Cj

d2 (x, cj ) =

k j=1 x∈Cj

x − cj 22

The k-Means Clustering

=

k

(x − cj ) (x − cj ) =

j=1 x∈Cj

=

=

=

x∈S

(x x − 2cj x + cj cj )

cj x +

k

k

cj cj

j=1 x∈Cj

mj cj cj

j=1

x∈S

k j=1 x∈Cj

x x −

k j=1 x∈Cj

x x − 2

x∈S

699

k 1 xx− 1 X(j)X(j) 1mj . mj mj

j=1

Let Rk = (r 1 · · · r k ) ∈ Rm×k be the matrix deﬁned by ⎛ ⎞ 0m1 ⎜ . ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜0mj−1 ⎟ ⎟ 1 ⎜ ⎜ 1m ⎟ rj = √ j ⎟ ⎜ mj ⎜ ⎟ ⎜0mj+1 ⎟ ⎜ ⎟ ⎜ .. ⎟ ⎝ . ⎠ 0mk for 1 ≤ j ≤ k. The columns of Rkare multiples of the characteristic √ vectors of the clusters, we have kj=1 mj r j = 1m and the set of columns of Rk is orthonormal, that is, Rk Rk = Ik . Note that the 1 X rj . centroids of the clusters can be written now as cj = √m j Now sse(π) can be written as sse(π) = trace(XX ) − trace(Rk XX Rk ). The ﬁrst term of sse(π) is constant; therefore, to minimize sse(π), we need to maximize trace(Rk XX Rk ). Let T ∈ Rk×k be an orthonormal matrix (T T = T T = Ik ) having the last column ⎛ m ⎞ 1

m

⎜ . ⎟ . ⎟ tk = ⎜ ⎝ . ⎠. m k

m

(10.3)

Linear Algebra Tools for Data Mining (Second Edition)

700

Deﬁne the matrix Qk = (q 1 · · · q k ) ∈ Rm×k as Qk = Rk T . We have q i = r 1 t1i + · · · + r k tki for 1 ≤ i ≤ k, so the last column of Qk is qk = r1

m1 + · · · + rk m

1 mk = √ 1k . m m

Also, Qk Qk = T Rk Rk T = T T = Ik . Let Qk−1 be the matrix that consists of the ﬁrst k − 1 columns of Qk . Clearly, Qk−1 has orthogonal columns and qi 1m = 0 for 1 ≤ i ≤ k − 1. Since Qk XX Qk = T Rk XX Rk T , we have trace(Qk XX Qk ) = trace(T Rk XX Rk T ) = X Rk T 2F = X Rk 2F = trace(Rk XX Rk ). Therefore, sse(π) = trace(XX ) − trace(Qk XX Qk ). This allows us to write sse(π) = trace(XX ) − trace(Qk XX Qk ) 1 = trace(XX ) − √ 1m XX 1m − trace(Qk−1 XX Qk−1 ) m = trace(XX ) − trace(Qk−1 XX Qk−1 ), because X being a centered matrix, we have X 1m = 0m . The above argument reduces the k-means clustering to determining a matrix Qk−1 such that trace(Qk−1 XX Qk−1 ) is maximal, subjected to the following restrictions: (i) Qk−1 Qk−1 = Ik−1 ; (ii) q j 1m = 0 for 1 ≤ j ≤ k; (iii) vectors of the form q j are obtained from vectors of the form r j via a linear transformation because Qk is deﬁned as Qk = Rk T . Recall that directions and the principal components of the centered data matrix X are the vectors dj and qj , respectively, such that Xdj = σj q j and X q j = σj dj for 1 ≤ j ≤ k.

The k-Means Clustering

701

In [35], the idea of relaxing this formulation of k-means by dropping the last requirement is introduced. By Ky Fan’s Theorem (Theorem 7.14), we have max{trace(Qk−1 XX Qk−1 ) | Qk−1 Qk−1 = Ik−1 } = λ1 + · · · + λk−1 , where λ1 ≥ λ2 ≥ · · · are the eigenvalues of the matrix XX , that is, the squares of the singular values of X. Moreover, the columns of the optimal Qk−1 are the k − 1 eigenvectors that correspond to the top k − 1 eigenvalues of the matrix XX , that is, the top q − 1 principal components of X. These columns represent the continuous solutions for the transformed discrete cluster membership problem. Theorem 10.2. Let c1 , . . . , ck be the centroids of the clusters obtained by applying the k-means algorithm to the centered data matrix X ∈ Rm×n . For the optimal solution of the relaxed problem, the subspace c1 , . . . , ck is spanned by the first k − 1 principal directions of the centered data matrix X. Proof. If q 1 , . . . , q k−1 be the vectors that correspond to the optimal solution, that is, the principal components of the centered matrix X. Since Rk = Qk T , we can write 1 1 cj = √ X r j = √ X (q 1 t1j + · · · + q k tkj ) mj mj 1 = √ X (q 1 tj1 + · · · + qk tjk ) mj 1 = √ σ1 d1 tj1 + · · · + σk dk tjk ) mj for 1 ≤ j ≤ k. Thus, every centroid belongs to the subspace generated by the principal directions of the matrix X. Conversely, let dj be a principal direction of X. Since dj = =

1 1 X qj = X (r 1 t1j + · · · + r k tkj ) σj σj √ 1 √ ( m1 c1 t1j + · · · + mk ck tkj ), σj

which shows that every principal direction of X belongs to the sub space generated by an optimal set of centroids.

702

Linear Algebra Tools for Data Mining (Second Edition)

Note that the vectors q j may have both positive and negative components. The recovery of the characteristic vectors r 1 , . . . , r k−1 , r k of the clusters cannot proceed from the equality Qk = Rk T because the matrix T itself can be determined only after obtaining the results of the clustering. Instead, the idea proposed in [35] takes advantage of Theorem10.2. Let C ∈ Rm×m be the matrix C = ki=1 q j q j . Recall that q j = r j T , so q j qj = r j T T r j = r j r j . The rank-1 matrix r j r j consists of a block surrounded by 0s, so C has a block-diagonal structure. Thus, if cij = 0, xi and xj belong to the same cluster. The matrix C computed starting from q 1 , . . . , q k may have negative elements. In general, entries of C below a certain value are ignored to eliminate noise. 10.4

SVD and Clustering

We refer to the current approach for seeking a k-means clustering as the discrete clustering problem (DCP). The DCP problem can n be mrelaxed by 2seeking a subspace V of R so that dim(V ) ≤ k and i=1 d(xi , V ) is minimal. This new problem will be referred to as the continuous clustering problem (CCP). It was shown in [39], using the SVD, that the relaxation of the problem because it can be used to obtain a 2-approximation of the optimum for the original problem and the approximative solution is interesting in its own right. Furthermore, the CCP problem can be solved in polynomial time because it can be obtained from the SVD of a matrix constructed on the data points. It is easy to see that sse(π) ≥ gS (V ) for any k-clustering π. Indeed, if U = c1 , . . . , ck , then dim(U ) ≤ k and sse(π) ≥ gS (V ). Observe that for each cluster center cj , the set Cj = {x ∈ S | d(x, cj ) ≤ d(x, ck ), where k = j} that consists of points whose closest point in C is cj , which is a polyhedron. Thetotal number of faces of the polyhedra C1 , . . . , Ck does not exceed k2 because each face is determined by at least one pair of points. The hyperplanes that deﬁne the faces of the polyhedra can be moved without modifying the partition κ such that each face contains

The k-Means Clustering

703

at least n points of the set S. If the points of S are in general position and 0n ∈ S, each face contains n aﬃnely independent points of S. Observe that there are m k n (10.4) N (k, m, n) = k ≤ t ≤ t 2 hyperplanes each of which contains n aﬃnely independent points of M . The following is an enumerative algorithm for resolving the discrete clustering problem (Algorithm 10.4.1). Algorithm 10.4.1: Enumerative Algorithm Data: set of objects to be clustered S = {x1 , . . . , xm } ⊆ Rn and the number of clusters k. Result: a collection of k clusters 1 enumerate all N (k, m, n) sets of hyperplanes each of which contains n aﬃnely independent points; 2 retain only family of hyperplanes that partition S into k cells; tn choices as to which cell to assign each 3 make one of the 2 point of S located on a hyperplane; 4 ﬁnd the centroid of each set in the partition π and compute sse(π); Let V be a k-dimensional subspace of Rn and let z 1 , . . . , z m be the orthogonal projections of the vectors x1 , . . . , xm on V. Consider the matrices ⎛ ⎞ ⎛ ⎞ z1 x1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ A=⎜ ⎝ .. ⎠ and B = ⎝ .. ⎠ . xm z m Clearly, we have rank(B) ≤ k. We have A −

B2F

=

m i=1

xi −

z i 22

=

m

d(xi , V )2 .

i=1

Let (σ1 , . . . , σr ) be the sequence of non-zero singular values of A, where σ1 ≥ · · · ≥ σr > 0. The SVD Theorem implies that A can be

704

Linear Algebra Tools for Data Mining (Second Edition)

written as A = σ1 u1 v H1 + · · · + σr ur vHr . Let B(k) ∈ Cm×n be the matrix deﬁned by B(k) =

k

σi ui v Hi .

i=1

By Theorem 9.10, B(k) is the best approximation of A among the matrices of rank no larger than k in the sense of Frobenius norm and A − B(k)2F = A2F −

k

σi2 .

i=1

To solve the CCP, we choose the subspace V as the subspace generated by the ﬁrst k left singular vectors, u1 , . . . , uk of the matrix A. We saw in Supplement 20 of Chapter 9 that B(k) is the projection of A on this subspace. Combining the DCP Algorithm 10.4.1 with the CCP approach produces a 2-approximation of the DCP (Algorithm 10.4.2): Algorithm 10.4.2: 2-Approximation Algorithm Data: set of objects to be clustered S = {x1 , . . . , xm } ⊆ Rn and the number of clusters k. Result: a set D of cluster centers 1 compute the k-dimensional subspace generated by the ﬁrst k left singular vectors, u1 , . . . , uk , of the matrix A; 2 solve the DCP with input A to obtain the set of centers D = {d1 , . . . , dk }; 3 output the set of centers {d1 , . . . , dk }. The optimal value Z obtained from the DCP algorithm satisﬁes the inequality Z≥

m i=1

d(xi , z i )2 .

(10.5)

The k-Means Clustering

705

If C = {c1 , . . . , ck } is an optimal center set for the DCP, and d1 , . . . , dk are the projections of these centers on W , then Z=

m

d(xi , C) ≥ 2

i=1

m

d(z i , D)2 .

(10.6)

i=1

The Inequalities (10.5) and (10.6) imply 2Z ≥

m

d(xi , z i )2 +

i=1

=

m

d(z i , D)2

i=1

i = 1m d(xi , D)2 ,

which shows that the Algorithm 10.4.2 produces indeed a 2approximative solution of the DCP. 10.5

Evaluation of Clusterings

The silhouette method is an unsupervised method for evaluation of clusterings that computes certain coeﬃcients for each object. The set of these coeﬃcients allows an evaluation of the quality of the clustering. Let S = {x1 , . . . , xn } be a collection of objects, d : S × S −→ R≥0 a dissimilarity on S, and let {C1 , . . . , Ck } be a clustering of S. Suppose that xi ∈ C . The average dissimilarity of xi is given by {d(xi , u) | u ∈ C − {xi }} , a(xi ) = |C | that is, the average dissimilarity between xi to all other objects of C , the cluster to which xi is assigned. For xi and a cluster C = C , let {d(xi , u) | f (u) = C} , d(xi , C) = |C| be the average dissimilarity between xi and the objects of the cluster C. Definition 10.1. Let {C1 , . . . , Ck } be a clustering. A neighbor of xi is a cluster C = C for which d(xi , C) is minimal.

706

Linear Algebra Tools for Data Mining (Second Edition)

In other words, a neighbor of an object xi is “the second best choice” for a cluster for xi . Let b : S −→ R≥0 be the function deﬁned by b(xi ) = min{d(xi , C) | C = C }. Definition 10.2. The silhouette of is the number sil(xi ) given by ⎧ i) ⎪ 1 − a(x ⎪ b(xi ) ⎨ sil(xi ) = 0 ⎪ ⎪ ⎩ b(xi ) − 1 a(xi )

the object xi for which |C | ≥ 2, if a(xi ) < b(xi ) if a(xi ) = b(xi ) if a(xi ) > b(xi ).

Equivalently, we have sil(xi ) =

b(xi ) − a(xi ) max{a(xi ), b(xi )}

for xi ∈ O. Observe that −1 ≤ sil(xi ) ≤ 1 (see Exercise 10.6). When sil(xi ) is close to 1, this means that a(xi ) is much smaller than b(xi ) and we may conclude that xi is well-classiﬁed. When sil(xi ) is near 0, it is not clear which is the best cluster for xi . Finally, if sil(xi ) is close to −1, the average distance from u to its neighbor(s) is much smaller than the average distance between xi and other objects that belong to the same cluster f (xi ). In this case, it is clear that xi is poorly classiﬁed. Definition 10.3. The average silhouette width of a cluster C is {sil(u) | u ∈ C} . sil(C) = |C| The average silhouette width of a clustering κ is {sil(u) | u ∈ O} . sil(κ) = |O| The silhouette of a clustering can be used for determining the “optimal” number of clusters. If the average silhouette of the clustering is above 0.7, we have a strong clustering.

The k-Means Clustering

10.6

707

MATLAB Computations

MATLAB implements the k-means algorithm using the function kmeans(T,k), where T is a data matrix in Rm×n and k is the number of clusters. The function considers the rows of T as the points to be clustered. A vector idx ∈ Rm is returned indicating that the object xi belongs to the cluster idxi . Non-numeric values in data are treated by kmeans as missing data and rows that contain such data are ignored. By default, kmeans uses the Euclidean distance between vectors. The function kmeans(S,k) has several return formats as indicated in the following table: Return format [IDX,C] [IDX, C, SUMD] [IDX, C, SUMD, D]

Eﬀect Returns the k cluster centroid locations in the k × n matrix C Returns, additionally, the within-cluster sums of point-to-centroid distances in the row vector sumD D is an m × k matrix that contains the distances from each point to every centroid

A more general format [ ... ] = kmeans(..., ’par1’,val1, ’par2’,val2, ...)

speciﬁes optional parameter/value pairs to direct the algorithm. The following is a partial list of these choices: Parameter ‘Distance’

Value ‘sqEuclidean’ ‘cityblock’ ‘cosine’ ‘correlation’

‘Start’ ‘sample’ ‘uniform’ ‘cluster’

‘Replicates’

Meaning Distance between objects Squared Euclidean distance (default) d1 distance 1 − cos α, where α = ∠(u, v) 1 − corr(u, v) method used to choose initial cluster centroid positions; choose k vectors from S at random (default); choose k vectors uniformly from S; perform preliminary clustering phase on a random 10% subsample of S; this ﬁrst phase is itself initialized using ‘sample’. number of times to repeat the clustering, each with a new set of initial centroids; must be a positive integer, default is 1.

Linear Algebra Tools for Data Mining (Second Edition)

708

‘EmptyAction’ ‘error’ ‘drop’ ‘singleton’ ‘Options’ ‘Display’ ‘MaxIter’

Action to take if a cluster loses all of its members; treat an empty cluster as an error (default); remove any clusters that become empty; create a new cluster consisting of the one observation furthest from its centroid. Options for the iterative algorithm used to minimize the ﬁtting criterion; level of display output. Choices are ‘oﬀ’, (default), ‘iter’, and ‘ﬁnal’. maximum number of iterations allowed; default is 100.

Example 10.1. We begin by generating a dataset containing three clusters using the function datagen introduced in Example 4.50. >> opts=statset(’Display’,’final’); >> [idclust,centers]=kmeans(T,3,’Replicates’,4,’Options’,opts); >> plot(T(idclust==1,1),T(idclust==1,2),’+’,... T(idclust==2,1),T(idclust==2,2),’*’,... T(idclust==3,1),T(idclust==3,2),’x’,... centers(:,1),centers(:,2),’o’);

The cluster centers are represented by small circles in Figure 10.1. Note the use of the MATLAB function statset that creates an option structure such that the named parameters have the speciﬁed values. ∈ Rm×n is a data matrix whose m rows are the then silhouette(X, CLUST) plots cluster silhouettes for the matrix X, with clusters deﬁned by CLUST. By default, silhouette uses the squared Euclidean distance between points; the function call [S,H] = silhouette(X,CLUST) plots the silhouettes, and returns the silhouette values in the vector S and the ﬁgure handle in H. Other inter-point distances can be used by calling silhouette(X,CLUST,d), where d speciﬁes the distance (’Euclidean’,’cityblock’,’cosine’, etc). The silhouettes of the 140 points in R2 generated above using If X

x1 , . . . , xm ,

[S,H]=silhouette(T,idclust)

are shown in Figure 10.2. As expected, the values of these silhouettes are quite high, since the points were grouped tightly around the centers and we used the same number of clusters (k = 3) as the number used in the generation process.

The k-Means Clustering

709

12

10

8

6

4

2

0

−2 −2

−1

0

1

Fig. 10.1

2

3

4

5

6

7

Clustering of a dataset.

Cluster

1

2

3

0

0.2

Fig. 10.2

0.4

0.6 Silhouette Value

0.8

Silhouettes of the 3-clustering.

1

710

Linear Algebra Tools for Data Mining (Second Edition)

Exercises and Supplements (1) Let c0 , c1 , and x be three vectors in Rn . Prove the following: (a) if d(c0 , c1 ) > 2d(x, c1 ), then d(x, c0 ) ≥ d(x, c1 ); (b) d(x, c1 ) ≥ max{0, d(x, c0 ) − d(c0 , c1 )}. Examine ways to use the above inequalities to improve the performance of the k-means algorithm. (2) Let X = {x1 , . . . , x2n } be a set that consists of 2n real numbers. A balanced clustering of X is a partition π = {C, D} of X such that |C| = |D| = n. Prove that if π is such that x,y∈C d(x, y)+ x,y∈D d(x, y) is minimal, then one of the clusters consists of the ﬁrst n points detected by scanning the line from left to right and the other cluster consists of the remaining points. Solution: Let z ∈ R − X be such that there are n points of X at its left and an equal number of points at its right. For an arbitrary balanced clustering {C, D} of X, deﬁne the sets Cl,z = {x ∈ C | x < z}, Dl,z = {x ∈ D | x < z},

Cr,z = {x ∈ C | x > z}, Dr,z = {x ∈ D | x > z}.

Note that |Cl,z |+|Dl,z | = |Cr,z |+|Dr,z | = n, so {Cl,z ∪Dl,z , Cr,z ∪ Dr,z } is again a balanced clustering. Suppose that Cl,z = {u1 , . . . , uk }, Dl,z = {vk+1 , . . . , vn },

Cr,z = {uk+1 , . . . , yn }, Dr,z = {v1 , . . . , vk }.

Clearly, C = Cl,z ∪ Cr,z and D = Dl,z ∪ Dr,z . If the distance between two ﬁnite subsets A, B of R is deﬁned as d(A, B) = {|a − b| | a ∈ A, b ∈ B}, then we claim that d(Cl,z , Cr,z ) + d(Dl,z , Dr,z ) ≥ d(Cl,z , Dl,z ) + d(Cr,z , Dr,z ). We have d(Cl,z , Cr,z ) =

k n

(uj − ui )

i=1 j=k+1

=

k i=1

⎛ ⎝

n

j=k+1

⎞ uj ⎠ − (n − k)ui

The k-Means Clustering n

=k

711

uj − (n − k)

k

d(Dl,z , Dr,z ) =

k n

(vi − vj )

j=k+1 i=1

=

n

ui ,

i=1

j=k+1

k

vi

− kvj

i=1

j=k+1

= (n − k)

k

n

vi − k

i=1

vj ,

j=k+1

and d(Cl,z , Dl,z ) =

n k

|ui − vj |,

i=1 j=k+1

≤

k n

(|z − ui | + |vj − z|),

i=1 j=k+1

= (n − k)

k

|z − ui | + k

i=1

d(Cr,z , Dr,z ) =

k n

n

|vj − z|,

j=k+1

|uj − vi |

j=k+1 i=1

≤

k n

|uj − z| + |z − vi |

j=k+1 i=1

=k

n j=k+1

|uj − z| + (n − k)

k

|z − vi |.

i=1

The desired equality follows after some elementary algebra. (3) Let g be the centroid of a ﬁnite, non-empty subset S of Rn . ˆ Deﬁne the set of centered vectors S={y ∈ Rn | y=x−g, x ∈ S}.

712

Linear Algebra Tools for Data Mining (Second Edition)

Prove that ˆ = {y22 | y ∈ S} {x22 | x ∈ S} − |S|g22 . (4) Let S be a ﬁnite, non-empty subset of Rn , π = {C1 , . . . , Ck } be a partition of S, g be the centroid of S, and ci be the centroid of Ci for 1 ≤ i ≤ k. Prove that g is the convex combination g=

k |Ci | i=1

|S|

ci .

(5) This exercise, based on an example given in [88], shows that the Forgy–Lloyd k-means algorithm may converge to a locally minimal solution that is arbitrarily bad compared to the optimal solution. Let S = {x1 , x2 , x3 , x4 } ⊆ R be the set to be clustered using the Forgy–Lloyd k-means algorithm with k = 3. Suppose that x1 < x2 < x3 < x4 and that x4 − x3 < x2 − x1 < x3 − x2 . Prove that: (a) there are only three possible partition outcomes of the Forgy–Lloyd algorithm: π1 = {{x1 }, {x2 }, {x3 , x4 }}, π2 = {{x1 , x2 }, {x3 }, {x4 }}, and π3 = {{x1 }, {x2 , x3 }, {x4 }}; 2 2 3) 1) , sse(π2 ) = (x2 −x , and sse(π3 ) = (b) sse(π1 ) = (x4 −x 2 2 (x3 −x2 )2 . 2

(c) Show that π1 is the optimal partition and the ratio sse(π2 ) sse(π1 ) (where π2 is a locally minimal partition and, therefore, a possible outcome of the Forgy–Lloyd algorithm) can be arbitrarily large if the distance between x2 and x3 is suﬃciently large. (6) Let π = {C1 , . . . , Ck } be a clustering of a ﬁnite set of objects S = {x1 , . . . , xm }. Prove that k 1 sse(π) = 2mj j=1

where mj = |Cj | for 1 ≤ j ≤ k.

xp ,xq ∈Cj

xp − xq 22 ,

The k-Means Clustering

713

Solution: To prove this equality, it suﬃces to show that

1 2mj

x − cj 22 =

x∈Cj

xp − xq 22 .

xp ,xq ∈Cj

This follows by writing 1 2mj =

xp − xq 22

xp ,xq ∈Cj

1 2mj

(xp xp + xq xq − 2xp xq )

xp ,xq ∈Cj

⎛

=

1 ⎝ xp xp + mj xq xq mj 2mj xp ∈Cj

−2

⎞

xq ∈Cj

xp xq ⎠

xp ∈Cj xq ∈Cj

=

x∈Cj

=

x x −

1 mj

xp xq

xp ∈Cj xq ∈Cj

x x − mj cj cj ,

x∈Cj

and

x − cj 22 =

x∈Cj

(x x − 2x cj + cj cj )

x∈Cj

=

x x − mj c cj ,

x∈Cj

which conﬁrms the needed equality. Let π = {C1 , . . . , Ck } be a partition of a ﬁnite, non-empty set S ⊆ Rn and let α be a non-negative number. Deﬁne sseα (π) as sseα (π) = kj=1 |Cj |α x∈Cj d2 (x, cj ). Note that for α = 0, we have sse0 (π) = sse(π). This idea was introduced in [81].

Linear Algebra Tools for Data Mining (Second Edition)

714

(7) Examine the eﬀect of using sseα with α > 0 instead of sse0 on the relative size of the clusters resulting from an application of the k-means algorithm. (8) Prove that the number of hyperplanes given in Formula (10.4) m k n k ≤ t ≤ t 2 2 nk is in O m 2 . (9) Let C, D ⊆ Rn be two ﬁnite, non-empty, and disjoint subsets of Rn and let δ(C, D) = {x − y22 | x ∈ C, y ∈ D}. Prove that δ(C,C) δ(D,D) + c − d22 , where m = |C|, p = (a) δ(C,D) mp = m2 + p2 |D| and c,d are the centroids of C and D, respectively. 2 (b) δ(C, D) + δ(C, C) + δ(D, D) = |S| y∈Sˆ y2 . Solution: For the ﬁrst part, we have δ(C, C) = {xi − xj 22 | xi , xj ∈ C} = {xi xi − 2xi xj + xj xj | xi , xj ∈ C} = (m − 1) {xi xi | xi ∈ C} {xi xj | i < j}. −2 xi ∈C xj ∈C

m

= mc, the norm of the centroid of C is {xi xi | xi ∈ C} + 2 {xi xj | i < j}, m2 c c =

Since

i=1 xi

xi ∈C xj ∈C

so δ(C, C) + m2 c c = m Similarly, δ(D, D) + p2 d d = p For δ(C, D), we have δ(C, D) = xi ∈C,xj ∈D

{xi xi | xi ∈ C}.

{xj xj | xj ∈ D}.

(xi xi + xj xj − 2xi xj )

The k-Means Clustering

=p

xi xi + m

xi ∈C

=p

xj xj − 2

xj ∈D

xi xi + m

xi ∈C

715

xi xj

xi ∈C xj ∈D

xj xj − 2mp c d.

xj ∈D

These equalities yield δ(C, D) δ(C, C) δ(D, D) − − mp m2 p2 1 1 = xi xi + xj xj − 2 c d m p xi ∈C

+ c c − −

xj ∈D

1 {xi xi | xi ∈ C} + d d m

1 {xj xj | xj ∈ D} p

+ c c + d d − 2c d = c − d22 , which concludes the argument for the ﬁrst part. For the second part, using the expressions derived above, we obtain δ(C, D) + δ(C, C) + δ(D, D) xi xi + m xj xj − 2mp c d =p xi ∈C

xj ∈D

+m {xi xi | xi ∈ C} − m2 c c +p {xj xj | xj ∈ D} − p2 d d x x − 2mp c d − m2 c c − p2 d d = |S| x∈S

= |S|

x x − mc + pd22

x∈S

= |S|

x∈S

x x − |S|2 g22

Linear Algebra Tools for Data Mining (Second Edition)

716

= |S|

y22 ,

y∈Sˆ

taking into account Exercise 3. (10) Using the notations introduced in Supplement 9, prove that for any two ﬁnite, non-empty, and disjoint subsets C, D of Rn , we have 2

δ(C, C) δ(D, D) δ(C, D) + . ≥ mp m2 p2

Hint: The inequality follows immediately from Supplement 9. (11) Let π = {C, D} be a clustering of the set S ⊆ Rn , where C = {x1 , . . . , xm } and D = {xm+1 , . . . , xm+p }, m + p = |S|, and let T = (tij ) ∈ Rs×s be the matrix deﬁned by tij = xi − xj 22 for 1 ≤ i, j ≤ s, where s = |S| = m + p. Deﬁne δ(C, D) δ(C, C) δ(D, D) mp 2 − − . J(π) = |S| mp m2 p2 Let q ∈ Rs be the vector given by ⎧ ⎨ p if 1 ≤ i ≤ m, ms qi = m ⎩ − if m + 1 ≤ i ≤ m + p. ps Prove that (a) 1| S| q = 0 and q2 = 1; (b) q T q = −J(π); (c) the previous two parts hold if the vectors of S are not arranged in any particular order. Solution: The ﬁrst part is immediate. Applying the deﬁnition of T , we obtain

q Tq =

m+p m+p

q i tij qj

i=1 j=1

=

m m i=1 j=1

q i tij qj +

m m+p i=1 j=m+1

q i tij qj

The k-Means Clustering m+p

m

qi tij qj +

i=m+1 j=1

m+p

717 m+p

qi tij qj

i=m+1 j=m+1

m m m m+p p 1 2 = xi − xj 2 − xi − xj 22 ms s i=1 j=1

−

i=1 j=m+1

m+p m 1 xi − xj 22 s i=m+1 j=1

+

m+p m ps

m+p

xi − xj 22

i=m+1 j=m+1

p 2 m δ(C, C) − δ(C, D) + δ(D, D) ms s ps mp 2δ(C, D) δ(C, C) δ(D, D) − − = −J(π). =− s mp m2 p2 =

(12) Prove that for 2-means clustering π = {C, D} of a ﬁnite, nonempty set S ⊆ Rn , we have sse(π)+ 12 J(π) = 12 y∈Sˆ y y, where Sˆ is the set of centered vectors that correspond to the vectors of S. Solution: The sum sse(π) + 12 J(π) can be expressed using the function δ introduced earlier as follows: 1 sse(π) + J(π) 2 1 1 1 δ(C, C) + δ(D, D) + δ(C, D) = 2m 2p 2|S| −

pδ(C, C) mδ(D, D) − 2m|S| 2p|S|

(by Supplements 10.6 and 10.6) =

1 1 (δ(C, D) + δ(C, C) + δ(D, D)) = y22 , 2|S| 2 y∈Sˆ

(by the second part of Supplement 9), where m = |C| and p = |D|.

718

Linear Algebra Tools for Data Mining (Second Edition)

Supplement 12 suggests a way for obtaining optimal 2-means clustering. Since sse(π)+ 12 J(π) does not depend on the partition π, minimizing sse(π) is tantamount to maximizing J(π). Note that the eﬀect of maximizing J(π) is to ensure that the average inter-cluster distance δ(C,D) mp is maximal, while the average intra-

and δ(D,D) are minimal. This ensures cluster distances δ(C,C) m2 p2 well-separated, compact clusters. In [35], a relaxation of this problem is considered by allowing the components of q to assume values in the interval [−1, 1] rather than two discrete values, but still requires that q q = 1. Then, to maximize J(π), that is, to minimize −J(π) amounts to ﬁnding q as the unit eigenvector that corresponds to the least eigenvalue of T . An alternative relaxation method is discussed next. (13) Let S be a ﬁnite, non-empty subset of Rn with |S| = m, and let Tˆ = Hm T Hm ∈ Rm×m be the centered distance matrix, where T is the matrix of squared distances between the vectors of S. Prove that (a) for every vector q such that 1m q = 0, we have q Tˆq = q T q; (b) every eigenvector of Tˆ that corresponds to a non-zero eigenvalue is orthogonal on 1m . Solution: Observe that q Tˆq = q Hm T Hm q = q T q because Hm q = q Hm = q. Since 1m is an eigenvector of Tˆ that corresponds to the eigenvalue 0, every eigenvector of Tˆ that belongs to a diﬀerent eigenvalue is orthogonal on 1m . (14) Prove that −1 ≤ sil(x) ≤ 1 for every object x ∈ S, where S is a set equipped with a clustering. Bibliographical Comments The spectral relaxation of the k-means algorithm was obtained in [176]. The link between k-means clustering and non-negative matrix factorization was established in [36]. For a solution to Exercise 1, see [45]. Supplement 2 is a result obtained in [132]. Supplements 9–13 are based on Ding and He’s paper where the PCA-guided clustering is introduced.

Chapter 11

Data Sample Matrices

11.1

Introduction

Matrices are natural tools for organizing datasets. Let such a dataset consist of a sequence E of m vectors of Rn , (u1 , . . . , um ). The jth components (ui )j of these vectors correspond to the values of a random variable Vj , where 1 ≤ j ≤ n. This data series will be represented as a matrix having m rows u1 , . . . , um and n columns v1 , . . . , v n . We refer to matrices obtained in this manner as sample matrices. The number m is the size of the sample. In this chapter, we present algebraic properties of vectors and matrices associated with a sample matrix: the mean vector and the covariance matrix. Biplots, a technique for exploring and visualizing sample matrices, are also introduced. 11.2

The Sample Matrix

Each row vector ui corresponds to an experiment Ei in the series of experiments E = (E1 , . . . , Em ); the experiment Ei consists of measuring the n components of ui = (xi1 , . . . , xin ), as follows. v1 u1 x11 u2 x21 .. .. . . um xm1

· · · vn · · · x1n · · · x2n .. .. . . . · · · xmn

719

720

Linear Algebra Tools for Data Mining (Second Edition)

The column vector

⎛

⎞ x1j ⎜ x2j ⎟ ⎜ ⎟ v j = ⎜ .. ⎟ ⎝ . ⎠ xmj represents the measurements of the jth variable Vj of the experiment, for 1 ≤ j ≤ n, as shown in the following. These variables are usually referred to as attributes or features of the series E. Definition 11.1. The sample matrix of E is the matrix X ∈ Cm×n given by ⎛ ⎞ u1 ⎜ .. ⎟ X = ⎝ . ⎠= (v 1 · · · v n ). um

Clearly, we have (v j )i = (ui )j = xij for 1 ≤ i ≤ m and 1 ≤ j ≤ n. If E is clear from the context, the subscript E is omitted. We will use both representations of the sample matrix and will write ⎛ ⎞ u1 ⎜ .. ⎟ X = ⎝ . ⎠= (v 1 · · · v n ), um when we are interested in the vectors that represent results of experiments and X = (v 1 , . . . , v n ), when we need to work with vectors that represent the values of variables. Pairwise distances between the row vectors of the sample matrix X ∈ Rm×n can be computed with the MATLAB function pdist(X). comThis form of the function returns a vector D having m(m−1) 2 m ponents corresponding to 2 pairs of observations arranged in the order d2 (u2 , u1 ), d2 (u3 , u1 ), d2 (u3 , u2 ), . . ., that is the order of the lower triangle of the distance matrix.

Data Sample Matrices

721

Example 11.1. Let X be the data matrix ⎛ ⎞ 1 4 5 ⎜2 3 7⎟ ⎟ X=⎜ ⎝5 1 4⎠. 6 2 4 The function call D = pdist(X) returns D = 2.4495

6.0000

5.4772

7.3485

5.0990

5.0990

Equivalently, a distance matrix can be obtained using the auxiliary function squareform, by writing E = squareform(D), which yields E = 0 2.4495 6.0000 5.4772

2.4495 0 7.3485 5.0990

6.0000 7.3485 0 5.0990

5.4772 5.0990 5.0990 0

There are versions of pdist that can return other distances by using a second string parameter. For instance, pdist(X, ‘cityblock’) computes d1 (xi , xj ) and pdist(X,‘cebyshev’) computes d∞ (xi , xj ). In general, Minkowski’s distance dp can be computed using D = pdist(X,‘minkowski’,p). A linear data mapping for a data sequence (u1 , . . . , um ) ∈ Seqm (Rn ) is the morphism r : Rn −→ Rq . If R ∈ Rn×q is the matrix that represents this mapping, then r(ui ) = Rui for 1 ≤ i ≤ m. If q < n, we refer to r as a linear dimensionality-reduction mapping. The reduced data matrix is given by ⎛ ⎞ ⎛ ⎞ r(u1 ) (Ru1 ) ⎜ . ⎟ ⎜ . ⎟ m×q ⎟ ⎜ ⎟ . r(XE ) = ⎜ ⎝ .. ⎠ = ⎝ .. ⎠ = XE R ∈ R r(um ) (Rum ) The reduced dataset r(XE ) has new variables Y1 , . . . , Yq . We denote this by writing (Y1 , . . . , Yq ) = r(V1 , . . . , Vn ). The mapping r is a linear feature selection mapping if R ∈ {0, 1}q×n is a 0/1-matrix having exactly one unit in every row and at most one unit in every column.

722

Linear Algebra Tools for Data Mining (Second Edition)

Definition 11.2. Let (u1 , . . . , um ) be a series of observations in Rn . The sample mean of this sequence is the vector m 1

˜= u i ∈ Rn . (11.1) u m i=1

˜ = 0n . The series is centered if u ˜ , . . . , um − u ˜ ) is always centered. Also, Note that the series (u1 − u observe that 1 ˜ = (u1 · · · um )1m . (11.2) u m If n = 1, the series of observations is reduced to a vector v ∈ Rm . Definition 11.3. The standard deviation of a vector v ∈ Rm is the number m 1

(vi − v)2 , sv = m−1 i=1

where v is the mean of the components of v. The standard deviation of sample matrix X ∈ Rm×n , where X = (v 1 · · · v n ), is the row s = (sv1 , . . . , svn ). If the measurement scale for the variables V1 , . . . , Vn involved in the experiment are very diﬀerent due to diﬀerent measurement units, some variables may inappropriately inﬂuence the analysis process. Therefore, the columns of the data sample matrix need to be scaled in order to make their values comparable. To scale a matrix, we need to replace each column v i by s1v v i . This will yield a matrix having i the standard deviation of each column equal to 1. Next, we examine the eﬀect of centering on a sample matrix. Theorem 11.1. Let X ∈ Rm×n be a sample matrix ⎛ ⎞ u1 ⎜ . ⎟ ⎟ X=⎜ ⎝ .. ⎠ . um The sample matrix that corresponds to the centered sequence is 1 ˆ X = Im − 1m 1m X. m

Data Sample Matrices

Proof.

723

The matrix that corresponds to the centered sequence is ⎞ ⎛ ˜ u1 − u ⎟ ⎜ .. ⎟ = X − 1m u ˆ =⎜ ˜ . X . ⎠ ⎝ ˜ um − u

By Equality (11.2), it follows that 1 ˆ = X − 1m u ˜ = X − 1m 1m X = X m

1 Im − 1m 1m X, m

which yields the desired equality.

Theorem 11.1 shows that to center a data matrix X ∈ Rm×n , we need to multiply it at the left by the centering matrix H m = Im −

1 1m 1m ∈ Rm×m , m

ˆ = Hm X. Note that Hm = Im − 1 Jm . It is easy to see that that is, X m Hm is both symmetric and idempotent. Since Hm 1m = 1m −

1 1m 1m 1m = 0, m

it follows that Hm has the eigenvalue 0. If X ∈ Rm×n is a matrix, the standard deviations are computed in MATLAB using the function std(X), which returns an n-dimensional row s containing the square roots of the sample variances of the columns of U , that is, their standard deviations. The means of the columns of X is computed in MATLAB using the function mean(X). The MATLAB function Z = zscore(X) computes a centered and scaled version of a data sample matrix having the same format as X. If X is a matrix, then z-scores are computed using the mean and standard deviation along each column of X. The columns of Z have sample mean zero and sample standard deviation one (unless a column of X is constant, in which case that column of Z is constant at 0). If we use the format [Z,mu,sigma] = zscore(X),

the mean vector is returned to mu and the vector of standard deviations, to sigma.

Linear Algebra Tools for Data Mining (Second Edition)

724

Example 11.2. Let X be the matrix X = 1 3 2 5

12 15 15 18

77 80 75 98

The means and the standard deviations of the columns of X are obtained as follows. >> m = mean(X) m = 2.7500

15.0000

82.5000

2.4495

10.5357

>> s=std(X) s = 1.7078

Finally, to compute together the mean, the standard deviation, and the matrix Z, we write >> [Z,m,s]=zscore(A) Z = -1.0247 0.1464 -0.4392 1.3175

-1.2247 0 0 1.2247

-0.5220 -0.2373 -0.7119 1.4712

2.7500

15.0000

82.5000

1.7078

2.4495

10.5357

m =

s =

Definition 11.4. Let u = (u1 , . . . , um ) be a sequence of vectors in Rn . The inertia of this sequence relative to a vector z ∈ Rn is the

Data Sample Matrices

725

number Iz (u) =

m

j=1

uj − z22 .

Theorem 11.2 (Huygens’s Inertia Theorem). Let u = (u1 , . . . , um ) ∈ Seqm (Rn ). We have Iz (u) − Iu˜ (u) = m˜ u − z22 , for every z ∈ Rn . Proof.

˜ is The inertia of u relative to u Iu˜ (u) =

m

˜ 22 uj − u

j=1 m

˜ ) (uj − u ˜) = (uj − u j=1

=

m

˜ uj − uj u ˜ +u ˜ u ˜ ). (uj uj − u j=1

Similarly, we have Iz (u) =

m

j=1

(uj uj − z uj − uj z + z z).

This allows us to write Iz (u) − Iu˜ (u) =

m m

˜ u ˜ (˜ u − z) uj + uj (˜ u − z) + z z − u j=1

= (˜ u − z)

j=1

m

i=1

⎛ ⎞ m

˜ u ˜) uj + ⎝ uj ⎠ (˜ u − z) + m(z z − u j=1

˜ u ˜ + m˜ ˜) u − z) + m(z z − u u (˜ = m(˜ u − z) u = m˜ u − z22 , which is the equality of the theorem.

Linear Algebra Tools for Data Mining (Second Edition)

726

Corollary 11.1. Let u = (u1 , . . . , um ) ∈ Seqm (Rn ). The minimal ˜. value of the inertia Iz (u) is achieved for z = u Proof.

This statement follows immediately from Theorem 11.2.

Let u and w be two vectors in Rm , where m > 1, having the means u and w, and the standard deviations su and sv , respectively. Definition 11.5. The covariance coeﬃcient of u and w is the number cov(u, w) =

m−1 1

(ui − u)(wi − w). m−1 i=1

The correlation coeﬃcient of u and w is the number ρ(u, w) =

cov(u, w) . su sw

By the Cauchy–Schwarz Inequality (Corollary 6.1), we have m m m

(ui − u)2 · (wi − w)2 , (ui − u)(wi − w) ≤ i=1

i=1

i=1

which implies −1 ≤ ρ(u, w) ≤ 1. ˆ be Definition 11.6. Let X ∈ Rm×n be a sample matrix and let X the centered sample matrix corresponding to X. The sample covariance matrix is the matrix cov(X) =

1 ˆ ˆ X X ∈ Rn×n . m−1

1 X X. Note that if X is centered, cov(X) = m−1 If n = 1, the matrix is reduced to one column X = (v) and

cov(v) =

1 v v ∈ R. m−1

In this case, we refer to cov(v) as the variance of v; this number is denoted by var(v).

Data Sample Matrices

727

If X = (v 1 · · · v n ), then (cov(X))ij = cov(v i , v j ) for 1 ≤ i, j ≤ n. The covariance matrix can be written also as cov(X) =

1 1 X Hm Hm X = X Hm X. m−1 m−1

The sample correlation matrix is the matrix corr(X) given by (corr(X))ij = ρ(v i , v j ) for 1 ≤ i, j ≤ n. 1 If X is centered, then cov(X) = m−1 X X. Clearly, the covariance matrix is a symmetric, positive semideﬁnite matrix. Furthermore, by ˆ and, Theorem 3.33, the rank of cov(X) is the same as the rank of X since m, the size of the sample, is usually much larger than n, we are often justiﬁed in assuming that rank(cov(X)) = n. Let X = (v 1 · · · v n ) ∈ Rm×n be a sample matrix. Note that 1 1 Hm v p = Im − 1m 1m v p = v p − 1m 1m v p = v p − ap 1m , m m 1 ˜ = (a1 , . . . , an ). 1m v p = ap for 1 ≤ p ≤ n, where u because m The covariance matrix can be written as

1 Hm (v 1 · · · vn ) (v 1 · · · v n ) Hm m−1 1 (Hm v 1 · · · Hm vn ) (Hm v1 · · · Hm v n ), = m−1

cov(X) =

which implies that the (p, q)-entry of this matrix is cov(X)pq =

1 1 (Hm vp ) (Hm v q ) = (v p −ap 1m ) (v q −aq 1m ). m−1 m−1

For a diagonal element, we have m

cov(X)pp

1

= (v q − aq 1m )2i , m−1 i=1

which shows that cov(X)pp measures the scattering of the values of the pth variable around the corresponding component ai of the mean sample. This quantity is known as the pth variance and is denoted by σp2 for 1 ≤ p ≤ n. The total variance tvar(X) of X is trace(cov(X)).

728

Linear Algebra Tools for Data Mining (Second Edition)

For p = q, the element cpq of the matrix C = cov(X) is referred to as the (p, q)-covariance. We have 1 (v p − ap 1m ) (v q − aq 1m ) m−1 1 v p v q − ap 1m v q − aq v p 1m + map aq = m = vp v q − ap aq .

(cov(X))pq =

If cov(X)pq = 0, then we say that the variables Vp and Vq are uncorrelated. The behavior of the covariance matrix with respect to multiplication by orthogonal matrices is discussed next. Theorem 11.3. Let ⎞ x1 ⎜ . ⎟ ⎟ X=⎜ ⎝ .. ⎠ xm ⎛

be a centered sample matrix and let R ∈ Rn×n be an orthogonal matrix. If Z ∈ Rm×n is a matrix such that Z = XR, then Z is centered, cov(Z) = R cov(X)R and tvar(Z) = tvar(X). Proof.

By writing explicitly the rows of the matrix Z, ⎛ ⎞ z1 ⎜ . ⎟ ⎟ Z=⎜ ⎝ .. ⎠ , zm

we have z i = xi R for 1 ≤ i ≤ m because Z = XR. Note that the sample mean of Z is 1 1 ˜ Z˜ = 1m Z = 1m XR = XR, m m ˜ is the sample mean of X. Since X is centered, we have where X ˜ ˜ Z = X = 0n , so Z is centered as well.

Data Sample Matrices

729

The covariance matrix of Z is cov(Z) =

1 1 Z Z = R X XR = R cov(X)R. m−1 m−1

Since the trace of two similar matrices are equal (by Theorem 8.5) and cov(Z) is similar to cov(X), the total variance of Z equals the total variance of X, that is, tvar(Z) = trace(cov(Z)) = trace(cov(X)) = tvar(X).

Since the covariance matrix of a centered matrix X, cov(X) = ∈ Rn×n is symmetric, by Corollary 8.8, cov(X) is orthonormally diagonalizable, so there exists an orthogonal matrix R ∈ Rn×n such that R cov(X)R = D, which corresponds to a sample matrix Z = XR. Let cov(Z) = D = diag(d1 , . . . , dn ). The number dp is the sample variance of the pth variable of the data matrix, and the covariances of the form cov(Z)pq with p = q are 0. From a statistical point of view, this means that the components p and q are uncorrelated. Without loss of generality we can assume that d1 ≥ · · · ≥ dn . The columns of the matrix Z correspond to the new variables Z1 , . . . , Zn . Often the variables of a data sample matrix are not expressed using diﬀerent units. In this case, the components of the covariance have no meaning because variables that have large numerical values have a disproportionate inﬂuence compared to variables that have small numerical value. For example, if a spatial variable is measured in millimeters, its values are three orders of magnitude larger than the values of a variable expressed in meters. 1 m−1 X X

11.3

Biplots

Biplots introduced by Gabriel in [61] oﬀer a way of representing graphically the elements sets of vectors (hence, the term biplot). Let A ∈ Rm×n be a matrix that can be A = LR, where L ∈ Rm×r , R ∈ Rr×n are

succinct and powerful of a matrix using two written as a product, the left and the right

730

Linear Algebra Tools for Data Mining (Second Edition)

factors, respectively. Suppose that ⎛ ⎞ l1 ⎜ . ⎟ ⎟ L=⎜ ⎝ .. ⎠ and R = (r 1 · · · r m ), lm where l1 , . . . , lm , r 1 , . . . , r n are m + n vectors in Rr . Then, each element aij of A can be regarded as a inner product of two vectors aij = li r j

(11.3)

for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Such matrix factorizations are common in linear algebra and we have already discussed a number of factorization techniques (full-rank decompositions, QR decompositions, etc.) Starting from the factorization A = LR, new factorizations of A can be built as A = (LK )(R K −1 ) for every invertible matrix K ∈ Rr×r . Therefore, the above representation for A is not unique in general. Thus, to use the biplot for a representation of the relations between the rows w 1 , . . . , wn of A, one could choose R such that RR = Ir , which yields AA = LL . This implies wi wj = li lj for 1 ≤ i, j ≤ n. Taking i = j, we have w i = li , which , in turn, implies ∠(wi , wj ) = ∠(li , lj ). A similar choice can be made for the columns of A by imposing the requirement L L = Ir , which implies A A = R R. The case when the rank r of the matrix A is 2 is especially interesting because we can draw the vectors l1 , . . . , lm , r 1 , . . . , r n to obtain an exact two-dimensional representation of A, as we show in the next example. Example 11.3. Let

⎛

18 ⎜−4 ⎜ A=⎜ ⎝ 25 9

8 20 8 4

⎞ 20 1⎟ ⎟ ⎟ 27⎠ 10

be a matrix of rank 2 in R4×3 that can be written as A = LR, where ⎛ ⎞ 2 4 ⎜−2 3⎟ 5 −4 4 ⎜ ⎟ L=⎜ . ⎟ and R = ⎝ 3 5⎠ 2 4 3 1 2

Data Sample Matrices

731

The vectors that help us with the representation of A are 2 −2 3 1 , l2 = , l3 = , l4 = l1 = 3 5 2 4 and

−4 4 5 r1 = , r2 = , r3 = . 2 4 3

For example, the a32 element of A can be written as −4 a32 = l3 r 2 = (3 5) = 8. 4 Equality (11.3) shows that each vector li corresponds to a row of A and each vector r j , to a column of A. When we can factor a sample data matrix X as X = LR, a column of the right factor r j is referred to as the biplot axis and corresponds to a variable Vj (Figure 11.1). Each vector li represents an observation in the sample matrix. It is interesting to observe that Equality (11.3) implies that the magnitude of projection of li on the biplot axis r j is li 2 cos ∠(li , r j ) =

li r j aij = . r j 2 r j 2

Therefore, if we choose the unit of measure on the axis r j as the number r1j 2 , we can read the values of the entries aij directly on 6 r2

Fig. 11.1

+

l3

+l 1

r3 l4 + 1r 1 @

l2 @+ @ @

-

Representation of the vectors li and rj .

732

Linear Algebra Tools for Data Mining (Second Edition)

the axis r j . For instance, the unit along the biplot axis is r13 2 = 0.2. It is also clear that if two axis of the biplot point roughly in the same direction, the corresponding variables will show a strong correlation. In general, the rank of the data matrix A is larger than 2. In this case, approximative representations of A can be obtained by using the thin singular value decomposition of matrices (Corollary 9.3). Let A be a matrix of rank r and let

A = U DV =

r

i=1

σi ui vi

be the thin SVD, where U ∈ Rm×r and V ∈ Rn×r are matrices of rank r (and, therefore, full-rank matrices) having orthonormal sets of columns. Here U = (u1 · · · ur ) and V = (v 1 · · · v r ). The matrix D containing √ can be split between U √ singular values and V by deﬁning L = U D and R = DV . The usefulness of the SVD for biplots is based on the Eckhart–Young Theorem (Theorem 9.9), which stipulates that the best approximation of A in the sense of the matrix norm ||| · |||2 in the class of matrix of rank k is the matrix deﬁned by B(k) =

k

σi ui v i .

i=1

According to Theorem 9.10, the same matrix B(k) is the best approximation of A in the sense of Frobenius norm. The extent of the deﬁciency of this approximation is measured by A − B(k)2F = 2 + · · · + σr2 . Since A2F = σ12 + · · · + σr2 , an absolute measure of σk+1 the quality of the approximation of A by B(k) is qk = 1 −

σ12 + · · · + σk2 A − B(k)2F = . A2F σ12 + · · · + σr2

In the special case, k = 2, the quality of the approximation is q2 =

σ12 + σ22 σ12 + · · · + σr2

and it is desirable that this number be as close to one as possible. The rank-2 approximation of A is useful because we can apply biplots to the visualization of A.

Data Sample Matrices

Example 11.4. Let A ∈ R5×3 be ⎛ 1 ⎜0 ⎜ ⎜ A = ⎜1 ⎜ ⎝1 0

733

the matrix deﬁned by ⎞ 0 0 1 0⎟ ⎟ ⎟ 1 1⎟ . ⎟ 1 0⎠ 0 1

It is easy to see that the rank of this matrix is 3 and, using MATLAB , a singular value decomposition can be obtained as U =

[

0.2787 0.2787 0.7138 0.5573 0.1565

-0.2176 -0.2176 0.3398 -0.4352 0.7749

-0.7071 0.7071 -0.0000 -0.0000 0.0000

2.3583 0 0 0 0

0 1.1994 0 0 0

0 0 1.0000 0 0

0.6572 0.6572 0.3690

-0.2610 -0.2610 0.9294

-0.7071 0.7071 0.0000

-0.2996 -0.2996 -0.4037 0.7033 0.4037

-0.5341 -0.5341 0.4605 0.0736 -0.4605

S =

V =

The rank-2 approximation of this matrix is B(2) = σ1 u1 v H1 + σ2 u2 v H2 , and is computed in MATLAB using >> B2 = 2.3583* U(:,1) * V(:,1)’ + 1.1994 * U(:,2) * V(:,2)’ B2 = 0.5000 0.5000 1.0000 1.0000 -0.0000

0.5000 0.5000 1.0000 1.0000 -0.0000

-0.0000 -0.0000 1.0000 -0.0000 1.0000

Linear Algebra Tools for Data Mining (Second Edition)

734

If we split the singular values as √ √ √ √ B(2) = ( σ1 u1 )( σ1 v 1 )H + ( σ2 u2 )( σ2 v 2 )H , then B(2) can be written as ⎛ ⎞ 0.4280 −0.2383 ⎜0.4280 −0.2383⎟ ⎜ ⎟ 1.0092 1.0092 0.5667 ⎜ ⎟ B(2) = ⎜1.0962 0.3721 ⎟ . ⎜ ⎟ −0.2858 −0.2858 1.0179 ⎝0.8559 −0.4766⎠ 0.2403 0.8487 The biplot that represents matrix A is shown in Figure 11.2. The quality of the approximation of A is q2 =

2.35832 + 1.19942 = 0.875. 2.35832 + 1.19942 + 1

The “allocation” of singular values among the columns of the matrices U and V may lead to biplots that have distinct properties. 1.2 1

l5

r

3

0.8 0.6 0.4

l3

0.2 0 −0.2

r1,r2

l1,l2

−0.4

l

4

−0.6 −1

−0.5

Fig. 11.2

0

0.5

1

Biplot of the rank-2 approximation of A.

Data Sample Matrices

735

For example, we could write B(2) = (σ1 u1 )v H1 + (σ2 u2 )v H2 ,

(11.4)

B(2) = u1 (σ1 v 1 )H + u2 (σ2 v 2 )H .

(11.5)

or

The ﬁrst allocation leads to the factorization B(2) = LR, where ⎛

⎞ 0.6572 −0.2610 ⎜0.6572 −0.2610⎟ ⎜ ⎟ ⎜ ⎟ L = ⎜1.6834 0.4075 ⎟ ⎜ ⎟ ⎝1.3144 −0.5219⎠ 0.3690 0.9294 and R = 0.65720.65720.3690 − 0.2610 − 0.26100.9294 , while the second yields the factors L = 0.2787

−0.21760.2787

−0.21760.7138

0.33980.5573

−0.43520.1565

0.7749

and R=

1.5499 1.5499 0.8703 . −0.3130 −0.3130 1.1147

The ﬁrst variant (Equality 11.4) leads to a representation, where the distances between the vectors li approximates the Euclidean distances between rows, while for the second variant (Equality 11.5), the cosine of angles between the vectors r j approximates the correlations between variables. Exercises and Supplements (1) Verify that the centering matrix Hm is both symmetric and idempotent. (2) Compute the spectrum of a centering matrix Hm .

736

Linear Algebra Tools for Data Mining (Second Edition)

(3) Let X ∈ Rm×n be a matrix and let Qc ∈ Rm×m be the matrix introduced in Supplement 10 of Chapter 3. Prove that (a) if c ∈ Rm , then Qc X = X − 1m c X; if c is the mean of the 1 rows of X, that is if c = m X 1m , then Qc X = Hm X; n (b) if c ∈ R , then XQc = X − X1m c ; if c is the mean of the columns of X, that is if c = n1 X1n , then XQc = XHn . (4) Usually, for a data sample matrix X ∈ Rm×n , we can assume that m ≥ n. Examine the possibility that rank(X) < n. (5) Let (u1 , . . . , um ) be a sequence of m vectors in Rn . Prove that the covariance matrix C of this sequence can be written as ⎞ ⎛ ˜ u1 − u ⎟ ⎜ 1 .. ⎟. ˜ · · · un − u ˜) ⎜ (u1 − u C= . ⎠ ⎝ m−1 ˜ un − u (6) Let q ∈ Rn be a vector such that i = 1n qi = 0. Prove that Hn q = q Hn = q. (7) Let A ∈ Rn×n and c ∈ Rn . Deﬁne the mapping f : Rn −→ r n as f (u) = Au + c for u ∈ Rn and let wi = f (ui ) for 1 ≤ i ≤ m. Prove that the covariance matrix D of the sequence (w1 , . . . , w m ) is D = ACA . (8) Prove that the function dC : {u1 , . . . , um }2 −→ R deﬁned by ˜ for 1 ≤ i, j ≤ m is a metric ˜ ) C −1 (uj − u) dC (ui , uj ) = (ui − u on {u1 , . . . , um }. Furthermore, prove that dD (f (ui ), f (uj )) = d(ui , uj ) for 1 ≤ i, j ≤ m. The function dC is known as the Mahalanobis metric. (9) Let X ∈ Rn×p be a data sample matrix. Prove that 1 ˆ ∗ 1n X ⊗ 1n . X=X− n ˆ starting from X. Write a MATLAB function that computes X m (10) Prove that if x ∈ R , then Hm x = x − x1m , where x is the mean of x. ⎛ ⎞ x1 ⎜ ⎟ m 1 ⎜ .. ⎟ (11) Prove that x Hm x = m i=1 (xi − x), where x = ⎝ . ⎠ and x xm is the mean of x.

Data Sample Matrices

737

Let X ∈ Rm×n be a sample matrix. Its correlation matrix, corr(X) ∈ Rn×n , is deﬁned by (corr(X))ij = ρ(v i , v j ) for 1 ≤ i, j ≤ n, where X = (v 1 · · · v n ). (12) Prove that corr(X) = D −1 cov(X)D −1 , where D = diag(sv1 , . . . , svn ). (13) Justify the claims made at the end of Example 11.4 that refer to the allocation variants for singular values. The generalized sample variance of a data sample matrix X is the determinant det(cov(X)), which oﬀers a succinct summary of the variances and covariances of X. (14) Let X ∈ Rm×n be a centered data sample matrix, X = (v 1 , . . . , v n ). Prove that the generalized sample variance of X 1 Vn (X)2 , where Vn (X) is the volume of the paralequals n−1 lelepiped constructed on the vectors v 1 , . . . , v n . Hint: See Supplement 107 of Chapter 6. Bibliographical Comments There are several excellent sources for biplots [67, 68, 96]. A recent readable introduction to biplots is [69].

This page intentionally left blank

Chapter 12

Least Squares Approximations and Data Mining

12.1

Introduction

The least square method is used in data mining as a method of estimating the parameters of a model by adopting the values that minimize the sum of the squared diﬀerences between the predicted and the observed values of data. This estimation process is also known as regression, and several types of regression exist depending on the nature of the assumed model of dependency between the predicted and the observed data. 12.2

Linear Regression

The aim of linear regression is to explore the existence of a linear relationship between the outcome of an experiment and values of variables that are measured during the experiment. As we saw in Chapter 11, experimental data often are presented as a data sample matrix B ∈ Rm×n , where m is the number of experiments and n is the number of variables measured. The results of the experiments are

739

Linear Algebra Tools for Data Mining (Second Edition)

740 3800

PT

3600 RO

NO

calories per day

3400 CY

3200

3000

FI

NL

BA SK

2800 YU 2600

2400

0

Fig. 12.1

1

2

3 4 5 6 gdp per person in $10K units

7

8

9

Calories vs. GDP in 10K units per person in Europe.

the components of a vector ⎞ b1 ⎜ ⎟ b = ⎝ ... ⎠ . ⎛

bm Linear regression amounts to determining r ∈ Rn such that Br = b. Knowing the components of r allows us to express the value of the result as a linear combination of the values of the variables. Unfortunately, since m is usually much larger than n, this system is overdetermined and, in general, is inconsistent. The columns v1 , . . . , v n of the matrix B are referred to as the regressors; the linear combination r1 v 1 + · · · + rn v n is the regression of b onto the regressors v 1 , . . . , v n . Example 12.1. In Figure 12.1, we represent (using the function plot of MATLAB), the number of calories consumed by a person per day vs. the gross national product per person in European countries starting from the following table.

Least Squares Approximations and Data Mining ccode ‘AL’ ‘AT’ ‘BY’ ‘BE’ ‘BA’ ‘BG’ ‘HR’ ‘CY’ ‘CZ’ ‘DK’ ‘EE’ ‘FI’ ‘FR’ ‘GE’ ‘DE’ ‘GR’ ‘HU’ ‘IS’ ‘IE’

gdp 0.74 4.03 1.34 3.79 0.66 1.28 1.75 1.21 2.56 3.67 1.90 3.53 3.33 0.48 3.59 3.02 1.90 3.67 3.76

cal 2824.00 3651.00 2895.00 3698.00 2950.00 2813.00 2937.00 3208.00 3346.00 3391.00 3086.00 3195.00 3602.00 2475.00 3491.00 3694.00 3420.00 3279.00 3685.00

ccode ‘IT’ ‘LV’ ‘LT’ ‘LU’ ‘MK’ ‘MT’ ‘MD’ ‘NL’ ‘NO’ ‘PL’ ‘PT’ ‘RO’ ‘RU’ ‘YU’ ‘SK’ ‘SI’ ‘ES’ ‘CH’

gdp 3.07 1.43 1.59 8.18 0.94 2.51 0.25 4.05 5.91 1.88 2.30 1.15 1.59 1.10 2.22 2.84 2.95 4.29

741

cal 3685.00 3029.00 3397.00 3778.00 2881.00 3535.00 2841.00 3240.00 3448.00 3375.00 3593.00 3474.00 3100.00 2689.00 2825.00 3271.00 3329.00 3400.00

This data set was extracted from [55]. We seek to approximate the calorie intake as a linear function of the gdp of the form cal = r1 + r2 gdp. This amounts to solving a linear system that consists of 37 equations and two unknowns: r1 + 0.74r2 = 2824 .. . r1 + 4.29r2 = 3400 and, clearly such a system is inconsistent. If the linear system Br = b has no solution, the “next best thing” is to ﬁnd a vector c ∈ Rn such that Bc − b2 ≤ Bw − b2 for every w ∈ Rn , an approach known as the least square method. We will refer to the triple (B, r, b) as an instance of the least square problem. Note that Br ∈ range(B) for any r ∈ Rn . Thus, solving this problem amounts to ﬁnding a vector Br in the subspace range(B) such that Br is as close to b as possible.

742

Linear Algebra Tools for Data Mining (Second Edition)

Let B ∈ Rm×n be a full-rank matrix such that m > n, so rank(B) = n. The symmetric square matrix B B ∈ Rn×n has the same rank n as the matrix B, as we saw in Theorem 3.33. Therefore, the system (B B)r = B b has a unique solution s. Moreover, B B is positive deﬁnite because r B Br = (Br) Br = Br22 > 0 for r = 0. Theorem 12.1. Let B ∈ Rm×n be a full-rank matrix such that m > n and let b ∈ Rm . The unique solution of the system (B B)r = B b equals the projection of the vector b on the subspace range(B). Proof. The n columns of the matrix B = (v 1 · · · v n ) constitute a basis of the subspace range(B). Therefore, we seek the projection c of b on range(B) as a linear combination c = Bt, which allows us to reduce this problem to a minimization of the function f (t) = Bt − b22 = (Bt − b) (Bt − b) = (t B − b )(Bt − b) = t B Bt − b Bt − t B b + b b. The necessary condition for the minimum is (∇f )(t) = 2B Bt − 2B b = 0, which implies B Bt = B b.

The linear system (B B)t = B b is known as the system of normal equations of B and b. Example 12.2. We augment the data sample matrix by a column that consists of 1s to accommodate a constant term r1 ; thus, we work with the data sample matrix B ∈ R37×2 given by ⎛ ⎞ 1 0.74 ⎜. .. ⎟ ⎟ . B=⎜ . ⎠ ⎝. 1 4.29 whose second column consists of the countries’ gross domestic products in $10K units. The matrix C = B B is 37.0000 94.4600 C= . 94.4600 333.6592

Least Squares Approximations and Data Mining

743

4200 4000 3800

calories per day

3600 3400 3200 3000 2800 2600 2400

0

1

2

3 4 5 gdp per person in $10K units

Fig. 12.2

6

7

8

Regression line.

Solving the normal system using the MATLAB statement r = C\(B ∗b) yields 2894.2 r= , 142.3 so the regression line is cal = 142.3 ∗ gdp + 2894.2, shown in Figure 12.2. Suppose now that B ∈ Rm×n has rank k, where k < min{m, n}, and U ∈ Rm×m , V ∈ Rn×n are orthonormal matrices such that B can be factored as B = U M V , where R Ok,n−k M= ∈ Rm×n , Om−k,k Om−k,n−k R ∈ Rk×k , and rank(R) = k.

c1 deﬁne c = ∈ and let c = , where c1 ∈ Rk For b ∈ c2 and c2 ∈ Rm−k . Since rank(R) = k, the linear system Rz = c1 has a unique solution z 1 . Rm

U b

Rm

744

Linear Algebra Tools for Data Mining (Second Edition)

Theorem 12.2. All vectors r that minimize Br − b2 have the form z r=V w for an arbitrary w. Proof.

We have

Br − b22 = U M V r − U U b22 = U (M V r − U b)22 = M V r − U b22 (because multiplication by an orthonormal matrix is norm-preserving) = M V r − c22 = M y − c22 = Rz − c1 22 + c2 22 , where z consists of the ﬁrst r components of y. This shows that the minimal value of Br − b22 is achieved by the solution of Therefore, the vectors the system Rz = c1 and is equal to c2 22 . z for an arbitrary r that minimize Br − b22 have the form w w ∈ Rn−r . Instead of the Euclidean norm we can use the · ∞ . Note that we have t = Br−b∞ if and only if −t1 ≤ Br−b ≤ t1, so ﬁnding r that minimizes · ∞ amounts to solving a linear programming problem: minimize t subjected to the restrictions −t1 ≤ Br − b ≤ t1. Similarly, we can use the norm · p . If y = Br − b, then we need to minimize ypp = |y1 |p + · · · + |ym |p , subjected to the restrictions −y ≤ Ar − b ≤ y. 12.3

The Least Square Approximation and QR Decomposition

Solving the system of normal equation presents numeric diﬃculties because, by Corollary 9.5 the condition number of the matrix B B is the square of the condition number of B. An alternative approach

Least Squares Approximations and Data Mining

745

to ﬁnding r ∈ Rn that minimizes f (u) = Bu − b22 is to use a full QR decomposition of the matrix B, where B ∈ Rm×n is a full-rank matrix and m > n, as described in Theorem 6.67. Suppose that R , B=Q Om−n,n where Q ∈ Rm×m is an orthonormal matrix and R ∈ Rn×n is an upper triangular matrix such that R ∈ Rm×n . Om−n,n We have

Bu − b = Q =Q

R

Om−n,n R

Om−n,n

u−b

u − QQ b

(because Q is orthonormal and therefore QQ = Im ) R =Q u−Qb . Om−n,n By Theorem 6.24, multiplication by an orthogonal matrix preserves the Euclidean norm of vectors. Thus,

2 R

u − Q b

. Bu − b22 =

Om−n,n 2 If we write Q = (L1 L2 ), where L1 ∈ Rm×n and L2 ∈ Rm×(m−n) , then

R L1 b

2

2 u− Bu − b2 =

Om−n,n L2 b 2

Ru − L1 b

2 =

−L2 b 2 = Ru − L1 b22 + L2 b22 . Observe that the system Ru = L1 b can be solved and its solution minimizes Bu − b2 .

746

12.4

Linear Algebra Tools for Data Mining (Second Edition)

Partial Least Square Regression

When the number of variables of an experiment is large, multiple output variables exist, and some or all these variables are correlated, the least square regression is replaced by a dimensionality reduction technique known as partial least square regression (PLS). Its inventor, the Swedish mathematician Herman Wold, observed that the acronym PLS is also consistent with “projection on latent structures” and this is a more accurate description of this data analysis method. In this section, we assume that the set of variables of an experiment E is partitioned into two disjoint sets {X1 , . . . , Xp } referred to as the set of predictor variables and {Y1 , . . . , Yr } named the set of response variables. Thus, the sample matrix XE ∈ Rm×(p+r) can be written as XE = (X, Y ), where X ∈ Rm×p and Y ∈ Rm×r . The basic model of PLS starts from the assumption that the matrices X and Y can be written as X = T P + E, Y = T Q + F, where the matrices T, U, P, Q, E, F are described in the following table. Matrix Notation T P Q E F

Format Rn×s Rp×s Rr×s Rn×p Rn×r

Matrix Designation Matrix of score vectors Matrix of loadings Matrix of loadings Matrix of residuals Matrix of residuals

The iterative process introduced in Algorithm 12.4.1 begins with the matrices X and Y and proceeds to construct two sequences of matrices X0 , . . . , Xk and Y0 , . . . , Yk , where X0 = X and Y0 = Y , as follows. In the repeat loop that extends between lines 4 and 12, we compute four sequence of vectors w ∈ Rp , t ∈ Rn , c ∈ Rr , and u ∈ Rn .

Least Squares Approximations and Data Mining

Algorithm 12.4.1: Iterative Algorithm for PLS Data: Matrices X ∈ Rn×p and Y ∈ Rn×r Result: Sequences of matrices X0 , . . . , Xk and Y0 , . . . , Yk 1 for i = 1 to k do 2 = 0; 3 initialize u0 as the ﬁrst column of Yi−1 ; 4 repeat u 5 w = X u u ; 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1 w w ; t = wXw w ; c = Yt tt ; c ← c1 c ; u+1 = cY cc ;

w ←

6

← + 1; until until convergence condition is met; t(i) = t ; u(i) = u ; c(i) = c ; t(i) ; p = (t(i)1) t(i) Xi−1 1 Y u(i) ; (u(i) ) u(i) i−1 (t(i) ) t(i) bi = (u (i) ) t(i) ; Xi ← Xi−1 − t(i) (p(i) ) ; Yi ← Yi−1 − bi t(i) (c(i) ) ;

q=

end

Observe that u+1 = Y c /(c c ) = Y Y t /(c c )(t t ) = Y Y Xw /(c c )(t t )(w w ) = Y Y XX u /(c c )(t t )(w w )(u u ).

747

Linear Algebra Tools for Data Mining (Second Edition)

748

Similarly, we have w+1 = X u+1 /(u+1 u+1 ) = X Y Y XX u /(c c )(t t )(w w )(u u )(u+1 u+1 ) = X Y Y Xw /(c c )(t t )(w w )(u+1 u+1 ), t+1 = Xw +1 /(w +1 w+1 = XX Y Y Xw /(c c )(t t )(w w )(u+1 u+1 )(w +1 w+1 = XX Y Y t /(c c )(t t )(u+1 u+1 )(w +1 w+1 ), and c+1 = Y t+1 /(t+1 t+1 ) = Y XX Y Y t /(c c )(t t )(u+1 u+1 )(w +1 w+1 )(t+1 t+1 ) = Y XX Y c /(c c )(u+1 u+1 )(w +1 w+1 )(t+1 t+1 ). 12.5

Locally Linear Embedding

Often data that results from observations obey mathematical relationships that are not immediately apparent because of the high dimensionality of the space where data resides. These relationships, once identiﬁed, could point to the location of the data vectors on a manifold whose intrinsic dimensionality could be much lower than the dimension of the ambient space. The locally linear embedding (LLE) introduced by Roweis and Saul in [141] is based on the basic idea that a point and its immediate neighbors are approximatively located on a linear manifold that is locally close to the real underlying manifold. Suppose that we have a sample data set {x1 · · · xn } ⊆ Rm that consists of n vectors in a high-dimension space Rm . The matrix X ∈ Rm×n is deﬁned by X = (x1 · · · xn ). The LLE algorithm computes the neighbors Ni = {xj | j ∈ Ji } of each data vector xi (as, for example, the nearest k data points in the sense of Euclidean distance). Then, it seeks to compute a matrix of weights W ∈ Rn×n such that (i) if j ∈ Ji , then wij = 0; n (ii) i=1 wij = 1 for every i, 1 ≤ i ≤ n;

Least Squares Approximations and Data Mining

749

2 n

n

(iii) the error err = i=1

xi − j=1 wij xj

is minimal for 1 ≤ 2 i ≤ n. Observe that the second condition can be equivalently written as W 1n = 1n . Using the weights obtained in the second phase, we seek to determine a set of low-dimensional vectors y 1 , . . . , y n in Rd , where d is much lower than m such that the local linear structure deﬁned by the matrix W is preserved. An important property of the weights wij is their invariance with respect to rotations, rescalings, and translations of the data points and of their neighbors. Indeed, if a rotation is applied to xi and to its neighbors in Ni , the value of err remains the same. Also, if both xi and the vectors in Ni are rescaled by the same factor, the values of the coeﬃcients remain unchanged. Finally, if a translation by a vector t is applied to xi and to all xj ∈ Ji , we have xi + t −

n

wij (xj + t) = xi −

j=1

n

wij xj ,

j=1

n

because j=1 wij = 1. Thus, the weights wij characterize the intrinsic geometric properties of the neighborhood of xi . In its ﬁrst phase, the Saul–Roweis algorithm identiﬁes the neighbors of each data vector either by seeking the k closest vectors or by identifying the points within a closed sphere of ﬁxed radius. The number k of neighbors of a point xi is, in general, not larger than the input dimensionality m; if this is not the case, special terms need to be added to the reconstruction costs (see Supplement 6). The second phase of the algorithm involves computing the weights wij such that err is minimal and nj=1 wij = 1. Note that err =

n

n n

n

2

2

wij xj

= wij (xi − xj )

xi −

i=1

=

n i=1

=

j=1

2

i=1

j=1

⎞ ⎛ n n ⎝ wij (xi − xj )⎠ wik (xi − xk ) j=1

n n n i=1 j=1 k=1

k=1

wij (xi − xj ) wik (xi − xk ).

2

750

Linear Algebra Tools for Data Mining (Second Edition)

The optimal weights are determined using the Lagrangian L(W, λ1 , . . . , λn ) =

n n n

wij (xi − xj ) wik (xi − xk )

i=1 j=1 k=1

−

n

⎛ ⎞ n λi ⎝ wij − 1⎠.

i=1

j=1

The necessary extremal conditions 2

n

∂L ∂wpq (W, λ1 , . . . , λn )

= 0 imply

wpk (xp − xq ) (xp − xk ) − λp = 0

k=1

for 1 ≤ p, q ≤ n. Let k be the chosen number of neighbors of points and let G(p) ∈ Rk×k be the local Gram matrix in xp , (G(p))qr = (xp − xq ) (xp − xr ) for 1 ≤ q, r ≤ k. The matrix G(p) is symmetric and positive semideﬁnite. The extremum condition can be written as 2

k

wp G(p)q = λp

=1

or, as G(p)w p = 12 λp 1n , where w p is the pth row of W (involving the coeﬃcients that correspond to xp ). Therefore, wp = 12 λp G(p)−1 1k . Using the condition 1k wp = 1, we have λp 1k G(p)−1 1k = 2, so λp =

2

. 1k G(p)−1 1k

Consequently, wi is given by wp =

1 G(p)−1 1k . 1k G(p)−1 1k

For the ﬁnal phase of the algorithm we seek a set of vectors {y 1 , . . . , y n } in the lower-dimensional space Rd , where d < m, such that the matrix Y = (y 1 · · · y n ) ∈ Rd×n minimizes the cost function

2

Φ(Y ) = ni=1

y i − nj=1 wij y j

. Observe that in the absence of 2

Least Squares Approximations and Data Mining

751

other conditions Φ(Y ) is optimal for Y = Od,n . Therefore, to make this problem well-posed it is necessary to require that the covariance matrix cov(Y ) equals In . This will compel the vectors y i to be diﬀerent from 0d and pairwise orthogonal. Let M ∈ Rn×n be the symmetric and positive semideﬁnite matrix deﬁned by M = (In − W ) (In − W ). The cost function can be written as Φ(Y ) =

n n n (y i − wij y j ) (y i − wih y h ) i=1

=

n

j=1

y i y i −

i=1

+

h=1

n n

wij y j y i −

i=1 j=1

n n n

n n

wih y i y h

i=1 h=1

wij wih y j y h .

i=1 j=1 h=1

Note that Y M Y = Y (In − W − W + W W )Y = Y Y − Y W Y − Y W Y + Y W W Y and we have the obvious equalities trace(Y Y ) =

n

y i y i

i=1

trace(Y W Y ) =

n n

y i wij y j

i=1 j=1

trace(Y W Y ) =

n n

y i (W )ij y j =

i=1 j=1

trace(Y W W Y ) =

n n n

n n i=1 j=1

y i (W )ij Wjh y h

i=1 j=1 h=1

=

n n n i=1 j=1 h=1

so Φ(Y ) = trace(Y M Y ).

y i wji wjh y h ,

y i wji y j

752

Linear Algebra Tools for Data Mining (Second Edition)

It is important to observe that the matrix M has (1n , 0) as an eigenpair since M 1n = (In − W ) (In − W )1n = (In − W ) 0n = 0n , and all its eigenvalues are non-negative. The cost function is invariant under translations. This makes it necessary to add the second condition that ﬁxes the center of the set of vectors {y 1 , . . . , y n } in 0n , which is equivalent to asking M 1n = 0. For the case of dimension d, we can write the matrix Y as a collection of n-dimensional rows as follows: ⎛ ⎞ z1 ⎜ ⎟ . Y = ⎝ . .⎠ . zd Now the cost function becomes ⎛⎛

⎞ ⎞ z1 Φ(Y ) = trace ⎝⎝. . .⎠ M (z 1 · · · z d )⎠ zd =

d p=1

trace(z p M z p ) =

d

z p M z p .

p=1

Let λ1 ≥ · · · ≥ λd ≥ λd+1 = 0 be the least d + 1 eigenvalues of M . To minimize Φ(Y ) we need to minimize each of the non-negative numbers z p M z p , which means that we need to adopt for z 1 , . . . , z d the unit eigenvectors that correspond to the eigenvalues λ1 , . . . , λd . The matrix Y is given by ⎛ ⎞ z1 ⎜.⎟ ⎟ Y =⎜ ⎝ .. ⎠ = (y 1 · · · y n ). zd The eigenvectors that correspond to the smallest d + 1 eigenvalues provide a matrix (y 1 , . . . , y d , y d+1 ) ∈ Rn×(d+1) . The last eigenvector y d+1 = 1n corresponds to the eigenvalue 0. The d-dimensional rows z 1 , . . . , z n of the matrix S = (y 1 , . . . , y d ) ∈ Rn×d yield the lowdimensional representation z 1 , . . . , z n of {x1 , . . . , xn }.

Least Squares Approximations and Data Mining

753

Example 12.3. We discuss the MATLAB algorithm developed for LLE in [140]. We begin by generating a data set in R2 using the MATLAB code i = 1; for x = -pi:0.1:pi X(1,i) = x*sqrt(3)/2 -1 -0.5*sin(x - pi); X(2,i) = 0.5*x + sqrt(3) + sqrt(3)*sin(x-pi)/2; i=i+1; end

The 63 columns of matrix X ∈ R2×63 have been plotted in Figure 12.3 and they are located on a curve in R2 which is a one-dimensional manifold. Next we determined the nearest two points for each point. For this computation we need to compute the matrix of distances dist ∈ R63×63 between points for i=1:1:63 for j=1:1:63 dist(i,j)= norm(X(:,i)-X(:,j)); end end [sorted,index] = sort(dist); neighborhood = index(2:3,:); 3.5

3

2.5

2

1.5

1

0.5

0 −4

−3

Fig. 12.3

−2

−1

0

1

Representation of the data set in R2 .

2

754

Linear Algebra Tools for Data Mining (Second Edition)

The matrix dist is sorted yielding the matrix sorted having each of its columns arranged in ascending order. Thus, the jth column of sorted contains the distances between xj and the remaining points. The matrix index contains in its jth column the indices of the points that correspond to the sorted distances. This allows us to ﬁnd the next two neighbors of xj by simply extracting the second and the third elements of dist. W = zeros(2,63); for ii = 1:63 z = X(:,neighborhood(:,ii))-repmat(X(:,ii),1,2); % next we compute the local Gram matrix G = z’*z; % we solve Gw=1 W(:,ii) = G\ones(2,1); % enforce sum(w)=1 W(:,ii) = W(:,ii)/sum(W(:,ii)); end;

In the next phase the eigenvectors of the cost matrix are computed. % M is a sparse matrix with storage for 4*2*63 nonzero elements M = sparse(1:63,1:63,ones(1,63),63,63,4*2*63); for ii=1:63 w = W(:,ii); jj = neighborhood(:,ii); M(ii,jj) = M(ii,jj) - w’; M(jj,ii) = M(jj,ii) - w; M(jj,jj) = M(jj,jj) + w*w’; end; options.disp = 0; options.isreal = 1; options.issym = 1; [Y,eigenvals] = eigs(M,2,0,options); Y = Y(:,2:2)’*sqrt(63);

The resulting matrix Y is a row vector containing the representation of the 63 points located on the curve. 12.6

MATLAB

Computations

The MATLAB function lsqr attempts to compute the least square solution x to the linear system of equations Ax = b by ﬁnding x that

Least Squares Approximations and Data Mining

755

minimizes Ax − b. If the system is consistent, the linear squares solution is also a solution of the linear system and a corresponding message is displayed. Example 12.4. Let A and b ⎛ 1 1 ⎜ A = ⎝2 −1 3 1

be given by ⎞ ⎛ ⎞ −2 −2 ⎟ ⎜ ⎟ 0 ⎠ and b = ⎝ 1 ⎠. −1 2

The following MATLAB code solves the system Ax = b: >> A=[1 1 -2;2 -1 0;3 1 -1] A = 1 1 2 -1 3 1 >> b=[-2;1;2] b = -2 1 2 >> x=lsqr(A,b) x = 1.0000 1.0000 2.0000

-2 0 -1

Example 12.5. Let us expand the matrix A by adding one more row: >> A=[1 1 -2;2 -1 0;3 1 -1;1 -1 3]

and setting \bfb = [-2; 1; 2; 8]

The system Ax = b is now incompatible and the result returned is >> x=lsqr(A,b) lsqr converged at iteration 3 to a solution with relative residual 0.096. x = 1.1905 1.3810 2.6190

756

Linear Algebra Tools for Data Mining (Second Edition)

The result includes the relative residual error ation number at which the algorithm halted.

b−Ax x

and the iter-

If A ∈ Rn×n and B is a column vector with n components, or a matrix with several such columns, then X = A\B is the solution to the equation AX = B in the least square sense. A warning message is displayed if A is badly scaled or nearly singular.

Exercises and Supplements (1) Let X ∈ Rm×n , where m ≥ n, be a full-rank matrix, b be a vector in Rm , and let r = Xx − b be the residual vector corresponding to x. Prove that r is the solution of the normal system X Xr = X b if and only if r is orthogonal to the columns of X. (2) Let a, b ∈ Rm be two vectors and let ax = b be a oneindeterminate linear system. The system is, in general, incompatible, if m > 1. (a) Prove that the best approximation of the solution in the sense deﬁned in Section 12.1 is x=

(a, b) . a22

Replace the norm · 2 with · 1 ; in other words, deﬁne the best approximation of the solution as x such that Ax − b1 is minimal. Prove that if a = (a, a, a) ∈ R3 with a > 0, then the optimum is achieved for x = b2 . (3) Let b1 , . . . , bn be n vectors in Rm such that ni=1 bi = 0m and let Hw,a be a hyperplane in Rm , where w is a unit vector. Let B = (b1 · · · bn ). Prove that if ni=1 d(Hw,a , bi )2 is minimal (where d(Hw,a , bi ) is the distance from bi to Hw,a ), then w is an eigenvector of the matrix BB that corresponds to the least eigenvalue of this matrix and a = 0. Solution: We need to minimize ni=1 (w bi − a)2 = w B − a1 22 subjected to the restriction w2 = 1. Consider the

Least Squares Approximations and Data Mining

757

Lagrangian

n n 2 2 F (w1 , . . . , wm , λ) = (w bi − a) − λ w − 1 i=1

=1

⎛ ⎞2

n n m ⎝ wj bji − a⎠ − λ w2 − 1 . = i=1

j=1

=1

Then, we can write ⎛ ⎞ n m ∂F = 2⎝ wj bji − a⎠ bki − 2λwk = 0, ∂wk i=1

j=1

which is equivalent to n m

wj bji bki − a

i=1 j=1

m

bkj − λwk = 0.

j=1

Since ni=1 bi = 0m , it follows that nj=1 bkj = 0 for every k so the necessary extremal conditions are n m

wj bji bki = λwk = 0

i=1 j=1

for 1 ≤ k ≤ m. This equality amounts to w BB = λw , or BB w = λw, which shows that w must be an eigenvector of BB . The total sum of the squared distances is D = w B − a1 22 = (w B − a1 )(w B − a1 ) = (w B − a1 )(B w − a1) = w BB w − 2a1 B w + a2 1 1 = λw w + ma2 = λ + ma2 . Thus, w must be a unit eigenvector of BB that corresponds to the least eigenvalue. The minimal value of D is obtained when a = 0.

758

Linear Algebra Tools for Data Mining (Second Edition)

(4) Formulate and prove a result similar to the one in Supplement 3 without the condition ni=1 bi = 0m . Hint: Consider the ci = bi − b0 , where b0 = n1 sumni=1 bi vectors n and observe that i=1 ci = 0. (5) Let W ∈ Rn×n be a matrix. Prove that (In − W ) (In − W )1n = 0n if and only if W 1n = 0n . (6) Prove that when k, the number of nearest k neighbors exceeds the dimension m of the input space of the locally linear embedding algorithm, the local Gram matrix can be non-invertible. Prove that this problem can be corrected by replacing G(p) by G(p) + c/kIn , where c > 0 is a small number. (7) Let A ∈ Cm×n be a matrix with rank(A) = k such that A = U RV H . Here we assume that U ∈ Cm×n , V ∈ Cn×n are unitary matrices, and R ∈ Cm×n can be written as R=

Q Om−k,k

Ok,n−k , Om−k,n−k

where Q ∈ Ck×k and rank(Q) = k. Let b ∈ Cm , x ∈ Cn , g = U H b ∈ Cm , and y = V H x. Write g=

g1 g2

and y =

y1 , y2

where g 1 ∈ Ck , g 2 ∈ Cm−k , y 1 ∈ Ck , and y 2 ∈ Cn−k . Prove that (a) if z is the unique solution of Qz = g 1 , then any vector x that minimizes Ax − b has the form ˆ=V x

z ; y2

(b) any optimal solution gives the residual vector r = b − Aˆ x=U where r = g 2 ;

0k , g2

Least Squares Approximations and Data Mining

759

(c) the unique solution of minimal Euclidean norm is z ˆ=V , x 0n−k where z was deﬁned above; (d) the solution of minimum Euclidean length, the minimal value of b − Ax, and set of all solutions are unique. Solution: The hypothesis implies Ax − b2 = U RV H x − b2 = U RV H x − U U H b2 = U (RV H x − U H b)2 = RV H x − U H b2 (because U is a unitary matrix)

Qy 1 − g 1

2 = Ry − g =

g2 = Qy 1 − g 1 2 + g 2 2 . The minimum value of Ax − b2 is achieved when Qy 1 = g 1 and it equals g 2 2 . Thus, if z is the unique solution of the equation Qz = g 1 , then any solution that minimizes Ax − b has the form −1 Q g1 z ˆ=V =V , x g2 y2 where y 2 is arbitrary. The solution that minimizes Ax − b and has the minimum Euclidean norm is z ˆ=V . x 0n−k ˆ1 y H ˆ = ˆ=V x . The residual vector of an optimal soluLet y ˆ2 y tion is ˆ r = b − Aˆ x = U U H b − U RV H x ˆ ) = U (U H b − Rˆ y) = U = U (U b − RV x H

H

The last part follows immediately.

0k . g2

Linear Algebra Tools for Data Mining (Second Edition)

760

(8) Prove that the unique minimum Euclidean length solution x of the minimization problem has the form ˆ=V x

Q−1 On−k,k

Ok,m−k U H b, On−k,m−k

where the notations are the same as in Supplement 7. Solution: The conclusion is immediate from the equalities −1 Q g1 Ok,m−k Q−1 g 1 ˆ=V V x 0n−k On−k,k On−k,m−k g2 −1 Ok,m−k Q U H b. =V On−k,k On−k,m−k

(9) Let A ∈ Cm×n be a matrix and let z i ∈ Rm be the solution of minimal Euclidean length of the least square problem instance (A, z, ei ), where ei ∈ Cm and z i ∈ Cn for 1 ≤ i ≤ m. Prove that the Moore–Penrose pseudoinverse of A is A† = (z 1 · · · z m ). Solution: In Supplement 26 of Chapter 9, we have shown that the pseudoinverse of A has the form †

A =V

R−1 On−k,k

Ok,m−k U H, On−k,m−k

R

Ok,n−k V H, Om−k,n−k

where A=U

Om−k,k

R ∈ Ck×k is a matrix of rank k, and U ∈ Cm×m and V ∈ Cn×n are unitary matrices. The columns of A† are given by †

z i = A ei = V

R−1 On−k,k

Ok,m−k U H ei , On−k,m−k

and z i is precisely the solution of minimal Euclidean length of the least square problem min{ei − Az}.

Least Squares Approximations and Data Mining

761

(10) Let A ∈ Rm×n and b ∈ Rn . Deﬁne d = A b, B = (A b), and A A d , H =BB= d a2 where a2 = b b. Prove that if H has the Cholesky factorization H = U U , where W y U= , 0 ρ ˆ that minimizes b − Ax. then |ρ| = b − Aˆ x for every x (11) Let A ∈ Rm×n and C ∈ Rm×p be two matrices and let X ∈ Rn×p be a matrix such that AX − CF is minimal. Prove that the equation A AX = A C is always solvable in X and that its solution minimizes AX − CF . Solution: Let C = {c1 · · · cp }. We saw that the equation A Ax = A ci is always solvable with respect to x; let xi be its solution, where 1 ≤ i ≤ p. Then, we have A A(x1 · · · xp ) = (c1 . . . cp ) = C, so X = (x1 · · · xp ) is a solution of A AX = A C. Note that ⎛ ⎞ x1 A − c1 ⎜ ⎟ .. ⎟(Ax1 − c1 · · · Axp − cp ), (X A − C )(AX − C) =⎜ . ⎝ ⎠ xp − c p which implies AX − C2F = trace(X A − C )(AX − C) =

p

(xi A − ci )(Axi − ci )

i=1

=

p i=1

Axi − ci 22 .

This implies the second part of this supplement.

762

Linear Algebra Tools for Data Mining (Second Edition)

(12) Let A ∈ Rm×n be a matrix whose columns form an orthonormal set of vectors and let c ∈ Rm . Prove that x = A c minimizes Ax − c. Note that in this case A = A† by Supplement 45 of Chapter 3. (13) Let A ∈ Rm×n be a matrix with rank(A) = r < n. If X ⊆ Rn is the set of vectors x that minimize Ax − b, prove that this set is convex. Therefore, X contains a unique element having minimum 2-norm, denoted by xLS (the subscript suggests the words “least square”). Solution: Suppose that both u and v minimize Ax−b. Then, for λ ∈ [0, 1], we have A(λu + (1 − λ)v − b λAu − b + (1 − λ)Av − b = Ax − b. (14) Let A ∈ Rm×n be a matrix and rank(A) = r, and let A = U DV be the SVD decomposition of A, where U = (u1 · · · um ) ∈ Rm×m , V = (v 1 · · · v n ) ∈ Rn×n are orthogonal matrices and D = diag(σ1 , . . . , σr , 0, . . . , 0). u b If b ∈ Rm , prove that xLS = ri=1 σii v i minimizes Ax − b has the smallest norm of all minimizers. Solution: We have Ax − b2 = (U AV )(V x) − U b2 (because U and V are orthogonal) =

r i=1

σi (V x)i − ui b

2

+

m

(ui b)2 .

i=r+1

If xLS solves the least square problem, this implies (V x)i = ui b σi . (15) Let A ∈ Rm×n be a matrix and rank(A) = r, and let A = U DV be the SVD decomposition of A. Prove that the matrix 1 1 † , . . . , , 0, . . . , 0 U A = V diag σ1 σr is the pseudoinverse of A and xLS = A† b is the least square solution.

Least Squares Approximations and Data Mining

763

Bibliographical Comments Exercise 2 is based on an observation made in [65]. Monographs dedicated to the least squares problems are [21] and [103]. Partial square regression is discussed in [1]. Locally linear embedding is due to S. T. Roweis and L. K. Saul [141]. Further improvements and variations on the LLE theme can be found in [9, 38, 144]. Supplement 6 is discussed in [143]. Supplements 7–10 are results of [103].

This page intentionally left blank

Chapter 13

Dimensionality Reduction Techniques

13.1

Introduction

Physical and biological data as well as economic and demographic data often have high dimensionality. Intelligent data-mining algorithms work best in interpretation and decision-making based on this data when we are able to simplify their tasks by reducing the high dimensionality of the data. Dimensionality reduction refers to the extraction of the relevant information for a speciﬁc objective, while ignoring the unnecessary information and is a key concept in pattern recognition, data mining, feature processing, and machine learning. Dimensionality reduction requires tuning in terms of the expected number of dimensions, or the parameters of the learning algorithms. 13.2

Principal Component Analysis

Principal component analysis (PCA) is a dimensionality reduction technique that aims to create a few new, uncorrelated linear combinations of the variables of experiments that “explain” the major parts of the data variability. Let W ∈ Rm×n be a data matrix given by ⎛ ⎞ u1 ⎜ . ⎟ ⎟ W =⎜ ⎝ .. ⎠ . um 765

Linear Algebra Tools for Data Mining (Second Edition)

766

Definition 13.1. Let w ∈ Rn be a unit vector. The residual of ui relative to w is the number r(ui ) = ui −(ui w)w2 and it represents the error committed when the vector ui is replaced by its projection on w. Theorem 13.1. If r(ui ) is the residual of the vector ui of a data matrix W relative to the vector w with w = 1, then r(ui ) = ui 2 − (ui w)2 . Proof.

We have

r(ui ) = ui − (ui w)w2 = (ui − (ui w)w) (ui − (ui w)w) = (ui − (ui w)w )(ui − (ui w)w) = ui ui − (ui w)w ui − ui (ui w)w + (ui w)w (ui w)w = ui 2 − 2(ui w)2 + (ui w)2 = ui 2 − (ui w)2 because w w = 1.

Definition 13.2. The mean square error MSE(W, w) of the projections of the experiments u1 , . . . , um of the data matrix W ∈ Rm×n on the unit vector w ∈ Rn is the sum of the residuals m

MSE(W, w) =

1 r(ui ). m i=1

The average of the projections of the experiment vectors on the unit vector w is m

uw =

1 ˜ w, ui w = u m i=1

˜ is the sample mean deﬁned by Equality (11.1). where u The variance of the projections of the experiment vectors on w is m

1 (ui w − uw )2 . V (W, w) = m i=1

Dimensionality Reduction Techniques

767

˜ = 0n and, therefore, we Note that if W is centered, we have u have uw = 0. Theorem 13.2. We have the following equalities: m

V (W, w) =

1 2 (ui w) − u2w , m i=1

and m

1 ui 2 − u2w − V (W, w). MSE(W, w) = m i=1

Proof. The ﬁrst equality is well known from elementary statistics. For the second equality, we can write m

MSE(W, w) =

1 r(ui ) m i=1 m

1 ui 2 − (ui w)2 = m i=1 m

m

i=1

i=1

1 2 1 ui 2 − (ui w) = m m =

1 m

m i=1

ui 2 − u2w − V (W, w).

Corollary 13.1. If W is a centered data matrix, then minimizing MSE(W, w) amounts to maximizing the variance of the projections of the vectors of the experiments. Proof. Since W is centered, we have uw = 0. Therefore, the equality involving MSE(W, w) from Theorem 13.2 becomes m

MSE(W, w) =

1 ui 2 − V (W, w). m i=1

The ﬁrst term does not depend on w. Therefore, to minimize the mean square error, we need to maximize the variance of the projec tions of the vectors of the experiments.

768

Linear Algebra Tools for Data Mining (Second Edition)

If W is a centered data matrix, we have uw = 0 and the variance of the data matrix reduces to m 1 2 (ui w) . V (W, w) = m i=1

This expression can be transformed as V (W, w) =

1 1 (W w) (W w) = w W W w = w Zw, m m

1 W W . where Z = m We need to choose the unit vector w to maximize V (W, w). In other words, we need to maximize V (W, w) subjected to the restriction w w − 1 = 0. This can be resolved using a Lagrange multiplier λ to optimize the function

L(w, λ) =

1 w W W w − λ(w w − 1). m

Since ∂L = w w − 1, ∂λ ∂L = 2Z w − 2λv, ∂w which implies w w = 1 and Zw = λw. The last equality amounts to 1 W W w = λw, m which means that w must be an eigenvector of the covariance matrix cov(W ). This is an n × n symmetric matrix, so its eigenvectors are mutually orthogonal and all its eigenvalues are non-negative. These eigenvectors are the principal components of the data. Definition 13.3. Let ⎛

⎞ u1 ⎜ . ⎟ m×n ⎟ W =⎜ ⎝ .. ⎠ = (v 1 , . . . , v n ) ∈ R um

ˆ = Hm W be the corresponding be a data sample matrix and let W centered data matrix.

Dimensionality Reduction Techniques

769

The principal directions of W are the eigenvectors of the covariance matrix 1 1 ˆˆ 1 WW = W Hm W Hm W ∈ Rn×n. cov(W ) = Hm W m−1 m−1 m−1 The principal components of W are the eigenvectors of the matrix ˆW ˆ . W Note that the covariance matrix cov(W ) is a scalar multiple of the ˆ of the columns v ˆ W ˆ1, . . . , v ˆ n of the centered data Gram matrix W ˆ matrix W . If R ∈ Rn×n is the orthogonal matrix that diagonalizes cov(W ), then the principal directions of W are the columns of R because R cov(W )R = D, or equivalently, cov(W )R = RD. Without loss of generality, we assume in this section that D = diag(d1 , d2 , . . . , dn ) and that d1 d2 · · · dn . The ﬁrst eigenvector of cov(W ) (which corresponds to d1 ) is the first principal direction of the data matrix W ; in general, the kth eigenvector r k is called the kth principal direction of W . There exists an immediate link between PCA and the SVD decomˆ . Namely, if W ˆ s = σr and position of a centered data matrix W ˆ and s is the correˆ r = σs, then r is a principal component of W W ˆ. sponding principal direction of W If ⎛ ⎞ ˆ1 u ⎜ . ⎟ ˆ = ⎜ . ⎟, W ⎝ . ⎠ ˆ m u we have ˆ m r = sm , ˆ 1 r = σs1 , . . . , u u which shows that the principal components are the projections of the centered data points on the principal directions. As we saw in Theorem 11.3, the sum of the elements of D’s main diagonal equals the total variance tvar(W ). The principal directions “explain” the sources of the total variance: sample vectors grouped around r 1 explain the largest portion of the variance; sample vectors grouped around r2 explain the second largest portion of the variance, etc.

770

Linear Algebra Tools for Data Mining (Second Edition)

Let Q ∈ Rn× be a matrix having orthogonal columns. Starting from a sample matrix X ∈ Rm×n , we can construct a new sample matrix W ∈ Rm× having variables. Each experiment Ei is represented now by a row wi that is linked by ui by the equality wi = ui Q. This means that the component (w i )k that corresponds to the new variable Wk is obtained as n (ui )p qpk , (wi )k = p=1

a linear combination of the values that correspond to the previous variables. Theorem 13.3. Let W ∈ Rm×n be a centered sample matrix and let R ∈ Rn×n be an orthogonal matrix such that R cov(W )R = D, where D ∈ Rn×n is a diagonal matrix D = diag(d1 , . . . , dn ) and d1 · · · dn . Let Q ∈ Rn× be a matrix having orthogonal columns and let X = W Q ∈ Rm× . Then, trace(cov(X)) is maximized when Q consists of the first columns of R and is minimized when Q consists of the last columns of R. Proof. This result follows from Ky Fan’s Theorem (Theorem 7.14) applied to the symmetric covariance matrix of the transformed dataset. ˆ = U DV be the thin SVD of the centered data matrix Let W m×n ˆ , where U ∈ Rm×r and V ∈ Rr×n are matrices having W ∈ R orthogonal columns and ⎞ ⎛ σ1 0 · · · 0 ⎜ 0 σ ··· 0 ⎟ 2 ⎟ ⎜ ⎟ D=⎜ .. ⎟ , ⎜ .. .. ⎝ . . ··· . ⎠ 0 0 · · · σr where σ1 · · · σr > 0 are the singular values of A. For the covariance matrix of cov(W ), we have 1 1 ˆˆ WW = V D U U DV cov(W ) = m−1 m−1 1 1 V D DV = V D2 V , = m−1 m−1

Dimensionality Reduction Techniques

771

due to the orthogonality of the columns of U . As we saw before, the columns of V are the eigenvectors of cov(W ). The matrix V is known as the matrix of loadings. The matrix S = U D ∈ Rm×r is known as the matrix of scores. It ˆ. ˆ W is clear that σ12 , . . . , σr2 coincide with the eigenvalues of W ˆ = SV , where S is the scores matrix and V is Observe that W the loadings matrix. Since the columns of V are orthogonal, we also ˆ V. have S = W ˆ can be written as The SVD of W ˆ = W

r

σi ui v i .

i=1

ˆ v i = σ 2 v i . Since u W ˆ = σi v , it follows that v ˆ W This implies W i i i i ˆ . Similarly, ui are is a weighted sum of the rows of the matrix W ˆ. weighted sums of the columns of W As observed in [61], if ˆ = (u1 · · · ur )(σ1 v 1 . . . σr v r ) , W then Ir = (u1 · · · ur ) (u1 · · · ur ), ˆ = (u1 · · · ur )(u1 · · · ur ) . ˆ X (n − 1)X Example 13.1. We use the FAO dataset introduced in Example 12.1 showing the protein and fat consumption for 37 European countries. The sample matrix X ∈ R37×2 is obtained from the second and third columns of this table that correspond to the variables prot and fat. The vector of the sample variances of the two columns is s = (15.5213 28.9541). Since the magnitudes of the sample variances are substantial and quite distinct, we normalize the data by dividing the columns of X by their respective sample variances. The normalization is done by using the function zscore; namely, zscore(X) returns a centered and scaled version of X having the same format as X such that the columns of the result have sample mean 0 and sample variance 1.

Linear Algebra Tools for Data Mining (Second Edition)

772

The loading matrix or the coeﬃcient matrix is given by

0.7071 −0.7071 . 0.7071 0.7071

Both coeﬃcients in the ﬁrst column (which represents the ﬁrst principal component) are equal and positive, which means that the ﬁrst principal component is a weighted average of the two variables. The second principal component corresponds to a weighted diﬀerence of the original variables. The coordinates of the data in the new coordinate system is deﬁned by the matrix scores. These scores have been plotted in Figure 13.1. Principal component analysis in MATLAB is done using the function pca of the statistics toolbox. There are several signatures of this function which we review next. The statement coeff = pca(A) performs principal components analysis (PCA) on the matrix A ∈ Rm×n , and returns the principal component coeﬃcients, also known as loadings. Rows of A correspond to observations, and columns to variables. The columns of the matrix coeff (an n × n matrix) contain coeﬃcients for one principal 1.5

1

SK MK

BE

HR

0.5 second pc

CH HU

YU

AT ES IT FR

BG

0

LU GR MD

−0.5

RU

GE BA

AL

−1

−1.5 −3

Fig. 13.1

−2

−1

RO LT

0 first pc

IS MT

1

2

3

The ﬁrst two principal components of the FAO dataset.

Dimensionality Reduction Techniques

773

component and these columns are in order of decreasing component variance. The function pca computes the principal components of a sample matrix X. The are several incarnations of the function pca, described as follows. (i) [coeff,score] = pca(X) returns the matrix score, the principal component scores, that is, the representation of X in the principal component space. The rows of score correspond to observations, columns to components. (ii) [coeff,score,latent] = pca(X) returns the vector latent which contains the eigenvalues of the covariance matrix of X. The matrix score contains the data formed by transforming the original data into the space of the principal components. The values of the vector latent are the variance of the columns of score. The function pca centers X by subtracting oﬀ column variance means, but does not rescale the columns of X. To perform principal components analysis with standardized variables, we need to use pca(zscore(X)). Example 13.2. The dataset that we are about to analyze originates in a study of the health condition of Boston neighborhoods [28] produced by the Health Department of the City of Boston. The data include incidence of various diseases and health events that occur in the 16 neighborhoods of the city identiﬁed as Neighborhood Allston/Brighton Back Bay Charlestown East Boston Fenway Hyde Park Jamaica Plain Mattapan

Code AB BB CH EB FW HP JP MT

Neighborhood North Dorchester North End Roslindale Roxbury South Boston South End South Dorchester West Roxbury

Code ND NE RO RX SB SE SD WR

This is entered in MATLAB as neighborhoods = [’AB’;’BB’;’CH’;’EB’;’FW’;’HP’;’JP’;... ’MT’;’ND’;’NE’;’RS’;’RX’;’SB’;’SD’;’SE’;’WR’]

774

Linear Algebra Tools for Data Mining (Second Edition)

The diseases and the health conditions are listed as the vector categories: Category Hepatitis B Hepatitis C HIV/AIDS Chlamydia Syphilis Gonorrhea

Code HepB HepC HIVA CHLA SYPH GONO

Category Tuberculosis Live Births Low weight at birth Infant Mortality Children with Elevated Lead Subst. Abuse Treat. Admissions

Code TBCD B154 LBWE INFM CELL SATA

This is entered in MATLAB as categories = [’HepB’;’HepC’;’HIVA’;’CHLA’;’SYPH’;’GONO’;... ’TBCD’;’B154’;’LBWE’;’INFM’;’CELL’;’SATA’]

The data itself are contained and have the form 38 57 13 168 5 22 17 24 16 179 11 52 10 13 0 46 0 8 12 46 11 150 10 16 18 19 8 163 9 44 11 18 8 179 9 25 6 32 18 213 10 46 15 24 11 264 14 56 42 76 22 611 15 135 0 0 0 0 0 0 13 22 0 115 6 31 21 50 17 477 8 72 9 52 0 85 7 25 68 78 23 760 24 176 51 35 35 124 31 61 9 10 0 17 0 0

in the matrix diseaseinc in R16×12 20 8 0 17 7 9 6 8 29 0 0 8 5 24 11 0

607 306 284 718 125 487 420 285 1350 89 488 829 403 656 439 419

40 25 25 43 10 46 35 26 168 5 39 87 25 67 34 34

7 0 0 10 0 12 5 7 28 0 6 27 0 9 0 0

13 0 0 43 0 19 10 25 88 0 28 30 18 63 0 11

624 497 489 1009 272 2781 1071 390 1492 130 330 2075 1335 1464 6064 179

The array containing the sample variances of columns is computed by applying the function std: stdinc = std(diseaseinc)

Next, by using the function repmat as in si = diseaseinc./repmat(stdinc,16,1)

Dimensionality Reduction Techniques

775

we create a 16 × 12 matrix consisting of 16 copies of stdinc and compute the normalized matrix si that is subjected to PCA in [loadings,scores,variances]=pca(si)

This is one of several formats of the function pca. This function is applied to a data matrix and it centers the matrix by subtracting oﬀ column means. In the format that we use here, the function returns the matrices loadings, scores, and variances that contain the following data: (i) The columns of the matrix loadings contain the principal components. The entries of this matrix are, of course, known as loadings. In our case, loadings is a 12 × 12 matrix, where each column represents one principal component. The columns are in order of decreasing component variance. We reproduce the ﬁrst three columns of this matrix as follows: 0.2914 0.3207 0.2666 0.3267 0.2426 0.3209 0.3215 0.3055 0.3061 0.2655 0.3026 0.1423

0.2732 −0.0568 0.3848 −0.0800 0.4650 0.0668 −0.0161 −0.2594 −0.2659 −0.3215 −0.2903 0.4702

−0.2641 −0.1593 0.1427 −0.2671 −0.0270 −0.3463 −0.1444 0.3184 0.2735 0.3832 −0.0815 0.5848

(ii) The matrix scores in R16×12 contain the principal component scores, that is, the representation of si in the principal component space. Rows of scores correspond to neighborhoods and columns, to components. (iii) The matrix variances contain the principal component variances, that is, the eigenvalues of the covariance matrix of si. The ﬁrst two columns of the matrix scores contain the projections of data on the ﬁrst two principal components. This is done by running plot(scores(:,1),scores(:,2),’*’)

Linear Algebra Tools for Data Mining (Second Edition)

776 5

SE

Second Principal Component

4 3 2 1

SD

BB

0

AB

NE RS

−1

EB RX

−2 −3 −4

Fig. 13.2

ND

−2

0 2 4 First Principal Component

6

8

Projections on the ﬁrst two principal components.

After the plot is created, labels can be added to the axes using xlabel(’First Principal Component’) ylabel(’Second Principal Component’)

The resulting plot is shown in Figure 13.2. The neighborhood codes are applied to this plot by running gname(neighborhoods). An inspection of the ﬁgure shows that the health issues are diﬀerent for neighborhoods like South End (SE), South Dorchester (SD), and North Dorchester (ND). The matrix variances allows us to examine the percentage of the total variability explained by each principal component. Initially, we compute the matrix percent_explained as percent_explained=100*variances/sum(variances)

and using the function pareto we write pareto(percent_explained) xlabel(’Principal Component’) ylabel(’Variance Explained’)

This code produces the histogram as shown in Figure 13.3.

Variance Explained

Dimensionality Reduction Techniques

777

90

90%

80

80%

70

70%

60

60%

50

50%

40

40%

30

30%

20

20%

10

10%

0

Fig. 13.3

1

2

3 Principal Component

4

5

0%

Percentage of variability explained by principal components.

To visualize the results, one can use the biplot function as in biplot(loadings(:,1:2),’scores’,scores(:,1:2),’varlabels’, categories)

resulting in Figure 13.4. Each of the 12 variables is represented by a vector in this ﬁgure. Since the ﬁrst principal component has positive coeﬃcients, all vectors are located in the right half-plane. On the other hand, the signs of the coeﬃcients of the second principal component are varying. These components distinguish between neighborhoods where there is a high incidence of substance abuse treatment admissions (SATA), syphilis (SYPH), HIV/Aids (HIVA), Hepatitis B (HepB), and Gonorrhea (GONO) and low incidence of the others and neighborhoods where the opposite situation occurs. As observed in [84], the conclusions of a PCA analysis of data are mainly qualitative. The numerical precision (4 decimal digits) is not especially relevant for the PCA. Next, we present a geometric point of view of principal component analysis.

Linear Algebra Tools for Data Mining (Second Edition)

778 0.5

SATA

0.4

SYPH HIVA

0.3

HepB

Component 2

0.2 0.1

GONO

0

TBCD HepC

−0.1

CHLA

−0.2

B154 LBWE

−0.3

CELL INFM

−0.4 −0.5 −0.5

−0.4

−0.3

Fig. 13.4

−0.2

−0.1

0 0.1 Component 1

0.2

0.3

0.4

0.5

Representation of the 12 variables.

Let t ∈ Rn be a unit vector. The projection of a vector w ∈ Rn on the subspace t generated by t is given by projt (w) = tt w. To simplify the notation, we shall write projt instead of projt . Let ˆ ∈ Rm×n be a centered sample matrix that corresponds to a W sequence of experiments (u1 , . . . , um ), that is ⎞ u1 ⎜ ⎟ ˆ = ⎜ .. ⎟ . W ⎝ . ⎠ um ⎛

ˆ )) on the subspace genWe seek to evaluate the inertia I0 (projt (W n ˆ erated by the unit vector t ∈ R . Since W = (u1 · · · um ), by the

Dimensionality Reduction Techniques

779

deﬁnition of inertia, we have ˆ )) = I0 (projt (W

m j=1

=

m

tt uj 22 uj tt tt uj

j=1

=

m

uj tt uj

j=1

(because t t = 1) =

m

t uj uj t

j=1

(because both uj t and t uj are scalars) = t X Xt. The necessary condition for the existence of extreme values of this inertia as a function of t is

ˆ W ˆ t + λ(1 − t t) ˆ )) + λ(1 − t t) = grad t W grad I0 (projt (W ˆ u − 2λt = 0, ˆ W = 2W ˆ t = λt. In other ˆ W where λ is a Lagrange multiplier. This implies W ˆ )), t must words, to achieve extreme values of the inertia I0 (projt (W ˆ , that is, be chosen as an eigenvector of the covariance matrix of W ˆ as a principal direction of W . ˆ ∈ Rm×n can The principal directions of a data sample matrix W be obtained directly from the data sample matrix W by applying Corollary 8.16, a consequence of the Courant–Fisher theorem. ˆ are the numbers λ1 · · · ˆ W Suppose that the eigenvalues of W λn . The ﬁrst principal direction t1 of W , which corresponds to the ˆ , is ˆ W largest eigenvalue of W ˆW ˆ t | t ∈ Rn , t2 = 1 t1 = arg max t W t ˆ t2 | t2 = 1 . = arg max W 2 t

780

Linear Algebra Tools for Data Mining (Second Edition)

ˆ. Suppose that we computed the principal directions t1 , . . . , tk of W n Then, by Corollary 8.16, tk+1 ∈ R is a unit vector t that maximizes ˆ t2 ˆW ˆ t = W t W 2 and belongs to the subspace orthogonal to the subspace generated ˆ , that is, by the ﬁrst k principal directions of W ˆ t22 | t ∈ Rn , t2 = 1, t ∈ t1 , . . . , tk ⊥ . tk+1 = arg max W t

Note that for every vector z ∈ Rn , we have ⎞ ⎛ k ⎝I − tj tj ⎠ z = z − projt1 ,...,tk z ∈ t1 , . . . , tk ⊥ . j=1

Therefore, x ∈ t1 , . . . , tk ⊥ is equivalent to x = (I − kj=1 tj tj )x. Thus, we can write ⎫ ⎧ ⎛ ⎞ k ⎬ ⎨ ˆ ⎝ ⎠ tj tj t | t2 = 1 , tk+1 = arg max W I − t ⎩ ⎭ j=1

2

for 0 k n − 1. This technique allows ﬁnding the principal direcˆ by solving a sequence of optimization problems involving tions of W ˆ. the matrix W Next, we discuss an algorithm known as the Nonlinear Iterative Partial Least Squares Algorithm (NIPALS), which can be used for computing the principal components of a centered data matrix (Algorithm 13.2.1). The algorithm computes a sequence of matrices ˆ and needs a parameX1 , X2 , . . . beginning with the matrix X1 = W ter to impose an iteration limitation. ˆ ), we found the principal directions of W . In this If c = rank(W ˆ can be written as W ˆ = T V , where T = (t1 · · · tc ) and case, W V = (v 1 · · · v c ). Let j = Xj t2 . When the algorithm completes the repeat loop, we have ˘t = Xj v j is approximatively equal to tj , so Xj Xj v j = j v j , which implies that j is close to an eigenvalue and v j is close to an eigenvector of Xj Xj . Also, we have t t = v j Xj Xj v j = v j (Xj Xj v j ) = j v j v j = j , since v j is a unit vector.

Dimensionality Reduction Techniques

781

Algorithm 13.2.1: The NIPALS Algorithm ˆ ∈ Rm×n Data: A centered sample matrix W ˆ , where Result: The ﬁrst c principal directions of W 1 c rank(A) ˆ 1 X1 = W ; 2 for j = 1 to c do 3 choose tj as any column of Xj ; 4 repeat 5 6 7 8 9

X t

v j = X jt2 ; j ˘t = Xj v j until ˘t − tj 2 < ; Xj+1 = Xj − tj v j ; end ˆ = X1 = t1 v + X2 . We have After the ﬁrst run, we have W 1 ˆ v 1 − v1 λ1 = 0. ˆ t1 − v 1 t t1 = W ˆ W ˆ − t1 v ) t1 = W (W 1 1

Since t2 is initially a column of X2 , t2 it is orthogonal to t1 and remains so to the end of the loop. ˆ = t1 v + After the second run through the for loop, we have W 1 t2 v2 + X3 , etc. When the loop is completed, we have ˆ = t1 v + t2 v + · · · + tc vc + Xc+1 . W 1 2 ˆ ), then Xc+1 = O. If c = rank(W 13.3

Linear Discriminant Analysis

Linear discriminant analysis (LDA) is a supervised dimensionreduction technique. If a dataset contains data that belong to k distinct classes, LDA aims to ﬁnd a low-dimensional subspace such that the projections of the data vectors on this subspace are well-separated clusters. Let (u1 , . . . , um ) ∈ Seq(Rn ) be a sequence of vectors that belong to two classes C1 and C2 . Denote by mi the mean of the vectors that

782

Linear Algebra Tools for Data Mining (Second Edition)

belong to the class Ci for i = 1, 2 and by m the global mean m |C1 | |C2 | 1 ui = m1 + m2 . m= m m m i=1

We seek a unit vector w such that the projections on the subspace w of the vectors that belong to the two classes are as separated as possible. The function φ : Rn −→ R deﬁned by φ(u) = w u is a discriminant function and the set of projections consists of {φ(ui ) = w ui | 1 i m}. The means of the projections on w are 1 {w u ∈ Ci } = w mi mi,w = |Ci | for i = 1, 2. The class scatter matrix for Ci is the matrix Si ∈ Rn×n deﬁned by {(u − mi )(u − mi ) | u ∈ Ci } Si = for i = 1, 2. The intra-class scatter matrix is Sintra = S1 + S2 . The scatter of Ci relative to w is deﬁned as the number s2i,w = w Sintra w and the total intra-scatter relative to w is s2w = s21,w + s22,w = w Sintra w.

(13.1)

Note that the matrix Sintra is symmetric and positive semideﬁnite. Let Sinter be the inter-class scatter matrix deﬁned by Sinter = |C1 |(m1 − m)(m1 − m) + |C2 |(m2 − m)(m2 − m) . By substituting the value of m in the deﬁnition of Sinter , we obtain |C1 ||C2 | (13.2) (m1 − m2 )(m1 − m2 ) . m Thus, the inter-class scatter matrix is a matrix of rank at most 1 (when m1 = m2 ). Sinter =

Lemma 13.1. The separation between the means of the classes is given by |m1,w − m2,w |2 = w Sinter w.

(13.3)

Dimensionality Reduction Techniques

Proof.

783

By applying the deﬁnitions of m1,w and m2,w , we have |m1,w − m2,w |2 = (w m1 − w m2 )2 = w (m1 − m2 )(m1 − m2 ) w m = w Sinter w. |C1 | |C2 |

Definition 13.4. The Fisher linear discriminant is the generalized Rayleigh–Ritz quotient of the inter-class scattering matrix and the intra-class scattering matrix: w Sinter w F(w) = w Sintra w for w ∈ Rn − {0n }. Equalities (13.1) and (13.1) imply the equality |C1 | |C2 | |m1,w − m2,w |2 . (13.4) m s2w To obtain well-separated projections of the vectors of the classes, we need to ﬁnd w that maximizes Fisher’s linear discriminant. As observed in Section 8.9, the largest value of the generalized Rayleigh– Ritz quotient (and, therefore the largest value of the Fisher linear discriminant) is obtained when w is a generalized eigenvector that corresponds to the largest generalized eigenvalue of the matrix pencil (Sinter , Sintra ). When the vectors of the sequence (u1 , . . . , um ) ∈ Seq(Rn ) belong to several classes C1 , . . . , Ck , we need to consider the projection on a (k − 1)-dimensional subspace. If mi = |C1i | {u | u ∈ Ci }, then, as before, the intra-class scatter matrix is Sintra = ki=1 Si , where {(u − mi ) (u − mi ) | u ∈ Ci } ∈ Rn×n Si = F(w) =

for 1 i k. The inter-class scatter matrix is now given by Sinter =

k

|Ci |(mi − m)(mi − m) ,

j=1

where m is the global mean of the sequence of vectors. Thus, Sinter is the sum of k matrices of rank 1. Since ki=1 (mi − m) = 0, the actual rank of Sinter is k − 1.

784

Linear Algebra Tools for Data Mining (Second Edition)

If Sintra is a non-singular matrix, then the eigenvalues of the pencil −1 (Sinter , Sintra ) coincide with the eigenvalues of the matrix Sintra Sinter . n×n be the projection matrix on the desired subspace. Let P ∈ R The projection of the mean of the class Ci is ˘i= m

1 {P u | u ∈ Ci } = P mi |Ci |

and the projection of the global mean is ˘ = m

|Ci | ˜ i = P m. m k i=1 |Ci |

Thus, the intra-scatter matrix of the projections of class Ci is {(P u − P mi )(P u − P mi ) | u ∈ Ci } S˘i = =P {(u − mi )(u − mi ) P | u ∈ Ci } = P Si P , which implies that the intra-scatter matrix of the projections is S˘intra = P Sintra P . Similarly, the inter-scatter matrix of the projections is S˘inter = P Sinter P . The projection matrix P is chosen in this case such that −1 ˘ Sinter ) is maximal. trace(S˘intra An interesting connection between LDA and the least square regression is shown in [174]. 13.4

Latent Semantic Indexing

Latent semantic indexing (LSI) is an information-retrieval technique presented in [34, 40] based on SVD. The central problem in information retrieval (IR) is the computation of sets of documents that contain terms speciﬁed by queries submitted by users of collections of documents. The problem is quite challenging because any such retrieval needs to take into account that a concept can be expressed by many equivalent words (synonimy) and

Dimensionality Reduction Techniques

785

the same word may mean diﬀerent things in various contexts (polysemy). This can lead the retrieval technique to return documents that are irrelevant to the query (false positive) or to omit documents that may be relevant (false negatives). Several models of information retrieval exist. We refer the interested reader to [6] for a comprehensive presentation. We are discussing here the vector model and examine an application of SVD to this model. The next deﬁnition establishes the framework for discussing the LSI. Definition 13.5. A corpus is a pair K = (T, D), where T = {t1 , . . . , tm } is a ﬁnite set whose elements are referred to as terms, and D = (D1 , . . . , Dn ) is a set of documents. Each document Di is a ﬁnite sequence of terms, Dj = (tj1 , . . . , tjk , . . . , tjj ). Let K be a corpus such that |T | = m terms and D contains n documents. If ti is a term and Dj is a document of K, the frequency of ti in Dj is the number of occurrences of ti in dj , that is, aij = |{p | tjp = ti }|. The frequency matrix of the corpus K is the matrix A ∈ Rm×n deﬁned by A = (aij ). Each term ti generates a row vector (ai1 , ai2 , . . . , ain ) referred to as a term vector and each document dj generates a column vector ⎛ ⎞ a1j ⎜ . ⎟ ⎟ dj = ⎜ ⎝ .. ⎠ . amj A query is a sequence of terms q ∈ Seq(T ) and it is also represented as a vector ⎛ ⎞ q1 ⎜ . ⎟ ⎟ q=⎜ ⎝ .. ⎠ , qm where qi = 1 if the term ti occurs in q, and 0 otherwise.

786

Linear Algebra Tools for Data Mining (Second Edition)

When a query q is applied to a corpus K, an IR system that is based on the vector model computes the similarity between the query and the documents of the corpus by evaluating the cosine of the angle between the query vector q and the vectors of the documents of the corpus. For the angle αj between q and dj , we have cos αj =

(q, dj ) . q2 dj 2

The IR system returns those documents Dj for which this angle is small, that is cos αj t, where t is a parameter provided by the user. The LSI method aims to capture relationships between documents motivated by the underlying structure of the documents. This structure is obscured by synonimy, polysemy, the use of insigniﬁcant syntactic-sugar words, and plain noise, which is caused my misspelled words or counting errors. By Theorem 9.8, if A ∈ Rm×n is the matrix of a corpus K, A = U DV H is an SVD of A, and rank(A) = p, then the ﬁrst p columns of U form an orthonormal basis for range(A), the subspace generated by the vector documents of K; and the last n−p columns of V constitute an orthonormal basis for null(A). Also, by Corollary 9.6, the ﬁrst p transposed columns of V form an orthonormal basis for the subspace of Rn generated by the term vectors of K. Example 13.3. Consider a tiny corpus K = (T, D), where T = {t1 , . . . , t5 } and D = {D1 , D2 , D3 }. Suppose that the matrix of the corpus is t1 t A= 2 t3 t4 t5

D1 1 0 1 1 0

D2 0 1 1 1 0

D3 0 0 1 0 1

A matrix of this type may occur when documents D1 and D2 are fairly similar (they contain two common terms, t3 and t4 ) and when t1 and t2 are synonyms. A query that seeks documents that contain t1 returns the singledocument D1 . However, since t1 and t2 are synonyms, it would be

Dimensionality Reduction Techniques

787

desirable to have both D1 and D2 returned. Of course, the matrix representation cannot directly account for the equivalence of the terms t1 and t2 . The fact that t1 and t2 are synonymous is consistent with the fact that both these terms appear in the common context {t3 , t4 } in D1 and D2 . The successive approximations of A are B(1) = σ1 ∗ u1 ∗ v H1 ⎛ 0.4319 0.4319 ⎜0.4319 0.4319 ⎜ ⎜ =⎜ ⎜1.1063 1.1063 ⎜0.8638 0.8638 ⎝ 0.2425 0.2425

⎞ 0.2425 0.2425⎟ ⎟ ⎟ 0.6213⎟ ⎟, 0.4851⎟ ⎠ 0.1362

B(2) = σ1 ∗ u1 ∗ v H1 + σ2 ∗ u2 ∗ v H2 ⎛ ⎞ 0.5000 0.5000 0.0000 ⎜0.5000 0.5000 −0.0000⎟ ⎜ ⎟ ⎜ ⎟ 1.0000 1.0000 1.0000 =⎜ ⎟. ⎜ ⎟ ⎝1.0000 1.0000 0.0000 ⎠ 0.0000 −0.0000

1.0000

Another property of SVDs that is useful for LSI was shown in Theorem 9.7, namely that the rank-1 matrices ui vHi are pairwise orthogonal, and their Frobenius norms are all equal to 1. If we regard the noise as distributed with relative uniformity with respect to the p orthogonal components of the SVD, then by omitting several such components that correspond to relatively small singular values, we eliminate a substantial part of the noise and we obtain a matrix that better reﬂects the underlying hidden structure of the corpus. Example 13.4. Suppose that we apply to the miniature corpus described in Example 13.3 a query whose vector is ⎛ ⎞ 1 ⎜0⎟ ⎜ ⎟ ⎜ ⎟ q = ⎜0⎟ . ⎜ ⎟ ⎝1⎠ 0

Linear Algebra Tools for Data Mining (Second Edition)

788

The similarity between q and the document vectors di , 1 i 3, that constitute the columns of A is cos(q, d1 ) = 0.8165, cos(q, d2 ) = 0.482, and cos(q, d3 ) = 0, suggesting that d1 is by far the most relevant document for q. However, if we compute the same value of cosine for q and the columns b1 , b2 , and b3 of the matrix B(2), we have cos(q, b1 ) = cos(q, b2 ) = 0.6708, and cos(q, b3 ) = 0. This approximation of A uncovers the hidden similarity of d1 and d2 , a fact that is quite apparent from the structure of the matrix B(2). 13.5

Recommender Systems and SVD

Recommender systems use data analysis techniques to help customers ﬁnd products they would like to buy, visit websites they would be interested to see, choose entertainment that is appropriate for their tastes, etc. Frequently, these systems are based on collaborative ﬁltering, a process that involves collaboration among multiple agents having diﬀerent viewpoints. Collaborative ﬁltering requires collecting data concerning the interests or tests from many users and is based on the assumption that users who made similar choices in the past will agree in the future. Thus, collaborative recommender systems recommend items to users based on ratings of these items awarded by other users. Formally, we can regard a recommender system as a bipartite weighted graph G = (B ∪ T, E, r), where {B, T } is a partition of the set of vertices of the graph. The sets B = {b1 , . . . , bm } and T = {t1 , . . . , tn } are the set of buyers and the set of items. An edge (bi , tj ) ∈ B×T exists in E if buyer bi has rated item tj with the rating aij . In this case, r(bi , tj ) = aij . For large recommender systems, the sizes of the sets B and C can be of the order of millions, an aspect that raises issues of performance and scalability for RS designers. SVDs can be used for recommender systems in the same way that they are used for latent semantic indexing (see Section 13.4), that is, by using lower-rank approximations of the recommender system matrix, that are likely to ﬁlter out the noise eﬀect of smaller singular value components.

Dimensionality Reduction Techniques

789

Generally, each buyer rates a rather small number of items. Thus, the set of ratings is sparse. To remedy this sparsity, one replaces each missing rating of an item by the average of the existing ratings for that item as proposed in [142]. We obtain a matrix A ∈ Rm×n . The vector r ∈ Rm of average ratings for each buyer is r = n1 A1n and serves the normalization of A. This normalization process consists in subtracting the average rating of each buyer from each rating generated by the buyer. The normalized matrix N is given by N = A − diag(r1 , . . . , rm )Jm,n . The SVD decomposition of the resulting matrix is computed and a low rank approximation of the rating matrix is considered (Algorithm 13.5.1). The new rating matrix Aˆ is obtained by adding the average ratings of customers to the rank k approximation of the normalized rating matrix, that is Aˆ = diag(r1 , · · · , rm )Jm,m + U (:, 1 : k)D(1 : k, 1 : k)V (:, 1 : k)) . Algorithm 13.5.1: Algorithm for Computing a Ratings Matrix Data: The bipartite graph G of the recommender system Result: A matrix of ratings 1 compute vector of average ratings r for items; 2 construct the matrix A by adopting average ratings of items for missing ratings; 3 compute the normalized matrix N by subtracting the average ratings of buyers from each rating in A; 4 compute the SVD, (U, D, V ) of the normalized matrix N ; 5 adopt a rank k; 6 compute a new ratings matrix as diag(r1 , . . . , rm )Jm,m +U (:, 1 : k)D(1 : k, 1 : k)V (:, 1 : k))

Example 13.5. Suppose we have a collaborative system having the graph shown in Figure 13.5. The average ratings of the items are as

Linear Algebra Tools for Data Mining (Second Edition)

790

b1 b2 b3 b4 Fig. 13.5

s @ @

s t1

2

2

3 s @ 4 3 t2 @ s 4@ @ 1 s @s b " t3 b 2 b 4"" b" " b " 3 b " b t4

Bipartite graph of a recommender system.

follows: Item t1 t2 t3 t4 Average rating 2 3 3.25 2.5 Thus, the rating matrix is ⎛

2 ⎜2 ⎜ A=⎜ ⎝2 2

2 3 3 3

⎞ 4 2.5 4 2.5⎟ ⎟ ⎟. 1 2⎠ 4 3

The average ratings are given by r = 14 A14 and are obtained in MATLAB using the expression 0.25*A*ones(4,1). We have the vector of the averages 2.6250 2.8750 2.0000 3.0000

The normalized matrix N is -0.6250 -0.8750 0 -1.0000

-0.6250 0.1250 1.0000 0

1.3750 1.1250 -1.0000 1.0000

-0.1250 -0.3750 0 0

Dimensionality Reduction Techniques

The SVD decomposition [U,D,V]=svd(N):

N

=

U DV

791

is

computed

by

U = -0.5962 -0.4995 0.4036 -0.4818

0.2111 -0.4647 -0.7549 -0.4118

0.0004 0.6840 -0.0232 -0.7291

-0.7746 0.2582 -0.5164 0.2582

2.7177 0 0 0

0 1.1825 0 0

0 0 0.3013 0

0 0 0 0.0000

0.4752 0.2626 -0.8342 0.0963

0.5805 -0.7991 0.0936 0.1250

0.4326 0.2060 0.2129 -0.8515

0.5000 0.5000 0.5000 0.5000

D =

V =

The approximation√B(2) of rank A is computed as the product √ 2 of of the matrices U2 D2 and D2 V2 , where U2 consists of the ﬁrst two columns of U , V2 consists of the ﬁrst two column of V, and D2 is the matrix 2.7177 0 , D2 = 0 1.1825 which gives

13.6

D2 =

1.6485 0 . 0 1.0874

Metric Multidimensional Scaling

Multidimensional scaling (MDS) is a process that allows us to represent a dissimilarity space using a low-dimensional Euclidean space. Scaling is important for visualizing the result of data explorations. Two basic types of scaling algorithms exist. The metric multidimensional scaling starts with a ﬁnite dissimilarity space and produces a set of vectors that optimizes a function known as strain. The non-metric multidimensional scaling seeks a monotonic relationship

792

Linear Algebra Tools for Data Mining (Second Edition)

between the values of a ﬁnite dissimilarity and the distances between the vectors that represent the elements of the dissimilarity space. We discuss only the metric multidimensional scaling. Let (x1 , . . . , xm ) be a sequence of m vectors in Rn . The corresponding matrix is X = (x1 , . . . , xm ) ∈ Rn×m . Note that X is the transpose of the sample data matrix previously considered (which had x1 , . . . , xm as its rows). Given the matrix of Euclidean distances D = (d2ij ) ∈ Rm×m , where d2ij = xi − xj 22 = (xi − xj ) (xi − xj ) for 1 i, j m, we need to retrieve the vectors x1 , . . . , xm . Clearly, this problem does not have a unique solution because the matrix D is the same for (x1 , . . . , xm ) and for (x1 + c, . . . , xm + c) for every c ∈ Rn . Let G ∈ Rm×m be the Gram matrix of X, G = GX = X X. Since gpq = xp xq , we have d2ij = (xi − xj ) (xi − xj ) = gii + gjj − 2gij

(13.5)

for 1 i, j m. Suppose now that F ∈ Rm×m is the Gram matrix of another sequence of vectors (y 1 , . . . , y m ), that is, F = GY = Y Y , where Y = (y 1 , . . . , y m ) ∈ Rn×m such that d2ij = fii + fjj − 2fij . Then gii + gjj − 2gij = fii + fjj − 2fij for 1 i, j ≤ m. Let W = G − F . Then W is a symmetric matrix and wii + wjj − 2wij = 0, so wij = 1 2 (wii + wjj ). Let ⎛ ⎞ w11 1⎜ . ⎟ . ⎟ w= ⎜ 2⎝ . ⎠ wmm and note that the matrix W can now be written as W = w1m +1m w , which proves that W is a special, rank-2 matrix. Consequently, G = F + w1m + 1m w .

(13.6)

Thus, the set of vectors that correspond to a distance matrix is not unique, and the Gram matrices of any two such sequences diﬀer by a symmetric matrix of rank 2.

Dimensionality Reduction Techniques

793

It is possible to construct X starting from D if we assume that the centroid of the vectors of X is 0n , that is, if m i=1 xi = 0n . Let A ∈ Rm×m be the matrix deﬁned by A = − 12 D. Elementwise, this means that aij = − 12 d2ij for 1 i, j m. Consider the averages deﬁned by m

ai· =

1 aij , m j=1 m

1 a·j = aij , m i=1

m m 1 aij . a·· = 2 m i=1 j=1

The components of the Gram matrix G ∈ Cm×m , gij = xi xj for 1 i, j m, can be expressed using these averages, assuming that the set of columns of X is centered in 0n . Theorem 13.4. Let X = (x1 , . . . , xm ) ∈ Rn×m be a matrix such that m 1 2 i=1 xi = 0n and let A be the matrix defined by aij = − 2 xi − xj 2 for 1 i, j m. The components of the Gram matrix G, gij = xi xj are given by gij = aij − ai· − a·j + a·· for 1 i, j m. Proof.

By Equality (13.5), we have −2aij = gii + gjj − 2gij

for 1 i, j m. Note that m i=1

gij =

m j=1

gij = 0.

Linear Algebra Tools for Data Mining (Second Edition)

794

The averages introduced earlier can be written as m m 1 2 1 dij = (gii + gjj − 2gij ) −2a·j = m m i=1 i=1 m m 1 1 gii + gjj − 2 xi xj = m m i=1

= gjj + because

m

i=1 xi

1 m

i=1

m

gii ,

i=1

= 0n . Similarly, we have m 1 −2ai· = gii + gjj . m

(13.7)

j=1

Therefore, m m m 2 1 2 d = gii . ij n2 m i=1 j=1

i=1

m

m

j=1

j=1

(13.8)

Thus, we have gii = gjj

1 2 1 dij − gjj , m m

1 = m

m

d2ij

j=1

m

1 − gii . m i=1

Equality (13.5) yields

1 gij = − d2ij − gii − gjj 2 ⎛ ⎞ m m m m 1 1 1 1 1 d2ij + xj xj − d2ij + xi xi ⎠ = − ⎝d2ij − 2 m m m m ⎛

j=1

j=1

j=1

⎞

i=1

m m m m 1 2 1 2 1 2 ⎠ 1 dij − dij + 2 dij = − ⎝d2ij − 2 m m r j=1

j=1

i=1 j=1

= aij − ai· − a·j + a·· , which completes the proof.

Dimensionality Reduction Techniques

795

Corollary 13.2. The Gram matrix G = X X of the sequence of n R 1vectors X 2=

(x1 , . . . , xm ) can be obtained from the matrix A = − 2 xi − xj 2 as G = Hm AHm , where Hm is the centering matrix 1 H m = Im − m 1m 1m . Proof.

The matrix Hm AHm can be written as

Hm AHm

1 1 = Im − 1m 1m A Im − 1m 1m m m 1 1 A − A1m 1m = Im − 1m 1m m m

1 1 1 1m A − A1m 1 + 2 1m 1m A1m 1m . = A − 1m m m m

The terms of the above sum correspond to aij , a· j , ai · , and a· · , respectively. The desired conclusion then follows from Theorem 13.4.

The rank of the matrix G = X X ∈ Rm×m is equal to the rank of X, namely rank(G) = n, as we saw in Theorem 3.33. Since G is symmetric, positive semideﬁnite, and of rank n, it follows that G has n non-negative eigenvalues and m − n zero eigenvalues. By the Spectral Theorem for Hermitian matrices (Theorem 8.14), we have G = U DU , where U is an orthogonal matrix, U = (u1 · · · um ), D = (λ1 , . . . , λn , 0, . . . , 0), and λ1 · · · λn > 0. Taking into account that the last m−n elements of the diagonal of , where V ∈ Rn×m . By G are 0, we can write G = √1 , . . . , λn )Vn×m √ V diag(λ , we have G = X X deﬁning X as X = diag( λ1 , . . . , λn )V ∈ R and the m columns of X yield the desired vectors in Rn . A more general problem begins with a matrix of dissimilarities Δ = (δij ) ∈ Rm×m and seeks to determine whether there exists a sequence of vectors (x1 , . . . , xm ) in Rn such that d(xi , xj ) = δij for 1 i, j m. Lemma 13.2. Let A, G ∈ Rm×m be two matrices such that G = Hm AHm , where Hm is the centering matrix Hm = Im − n1 1m 1m . Then gii + gjj − 2gij = aii + ajj − 2aij for 1 i, j m.

796

Linear Algebra Tools for Data Mining (Second Edition)

Proof. We saw that if G = Hm AHm , then gij = aij − ai· − a·j + a·· . Therefore, we have gii = aii − 2ai· + a·· , gjj = ajj − 2a·j + a·· . This allows us to write gii + gjj − 2gij = aii − 2ai· + a·· + ajj − 2a·j + a·· −2(aij − ai· − a·j + a·· ) = aii + ajj − 2aij .

Theorem 13.5. Let Δ ∈ Rm×m be a matrix of dissimilarities, A ∈ 2 for 1 i, j m, and let Rm×m be the matrix defined by aij = − 12 δij G be the centered matrix G = Hm AHm . If G is a positive semidefinite matrix and rank(G) = n, then there exists a sequence (x1 , . . . , xm ) of vectors in Rn such that d(xi , xj ) = δij for 1 i, j m. Proof. Since G is a symmetric, positive semideﬁnite matrix having rank n, by Theorem 8.14, it is possible to write G = V DV , where Rn×m , D √ = (λ1 , . . . , λn ), and λ1 · · · λn > 0. V = (v 1 · · · v m ) ∈ √ Let X = diag( λ1 , . . . , λn )V ∈ Rn×m . We claim that the distances between √ x1 , . . . , xm equal the prescribed dissimilarities. Indeed, since xi = λi v i , we have d(xi , xj )2 = (xi − xj ) (xi − xj ) = xi xi + xj xj − 2xi xj = λi vi vi + λj v j v j − 2λi λj v i vj = gii + gjj − 2gij (by Lemma 13.2) 2 , = aii + ajj − 2aij = −2aij = δij

which is the desired conclusion.

Since G = Hm AHm and Hm has an eigenvalue equal to 0, it is clear that G also has such an eigenvalue. Therefore, rank(G) ≤ m−1, so there exist m vectors of dimensionality not larger than m − 1 such that their distances are equal to the given dissimilarities.

Dimensionality Reduction Techniques

797

We saw that the matrices XX and G = X X have the same rank and their non-zero eigenvalues are positive numbers and have the same algebraic multiplicities for both matrices (by Corollary 7.6). Let w be a principal component of the matrix X ∈ Rn×m , that is, an eigenvector of the matrix XX . Suppose that rank(G) = r and let X = U DV be the thin SVD decomposition of the matrix X , where D = (σ1 , . . . , σr ) and U, V ∈ Rm×r (see Corollary 9.3). The matrices U and V have orthogonal columns, so U U = V V = Ir . Since the numbers σ1 , . . . , σr are positive, D is invertible and we obtain U = X V D −1 . Thus, MDS involves a process that is dual to the usual PCA; some authors refer to it as the dual PCA. MATLAB deals with metric MDS using the function cmdscale. The function call X = cmdscale(D) is applied to a distance matrix D ∈ Rm×n , and returns a matrix X ∈ Rm×n . The rows of X are the coordinates of m points in n-dimensional space for some n, where n < m. When D is a Euclidean distance matrix, the distances between those points are given by D. The number n is the smallest dimension of the subspace in which the m points whose inter-point distances are given by D can be embedded. [X,e] = cmdscale(D) also returns the eigenvalues of XX as components of the vector e. When D is Euclidean, the ﬁrst n elements of e are positive, the rest zero. If the ﬁrst k elements of e are much larger than the remaining n − k, then it is possible to use the ﬁrst k columns of X to produce k-dimensional vectors whose interpoint distances approximate D. This can provide a useful dimension reduction for visualization, e.g., for k = 2. D need not be a Euclidean distance matrix. If it is non-Euclidean or a more general dissimilarity matrix, then some elements of e are negative, and cmdscale chooses n as the number of positive eigenvalues. In this case, the reduction to n or fewer dimensions provides a reasonable approximation to D only if the negative elements of e are small in magnitude. D can be speciﬁed as either a full dissimilarity matrix, or in uppertriangle vector form such as is output by pdist. A full dissimilarity matrix must be real and symmetric, and have zeros along the diagonal and positive elements everywhere else. A dissimilarity matrix in upper triangle form must have real, positive entries. D can be speciﬁed as a full similarity matrix, with ones along the diagonal and all other elements less than one. The function cmdscale transforms a

798

Linear Algebra Tools for Data Mining (Second Edition)

similarity matrix to a dissimilarity matrix in such a way that distances between the points √ returned in Y are equal to or approximate the distances given by I − D. Example 13.6. We start from a matrix of driving distances between ﬁve northeastern cities: Boston, Providence, Hartford, New York, and Concord. dist = 0 41.9000 92.8800 189.9000 63.4700

41.9000 0 65.3600 154.8400 95.7800

92.8800 65.3600 0 99.7600 115.5900

189.9000 154.8400 99.7600 0 213.7800

63.4700 95.7800 115.5900 213.7800 0

Applying cmdscale to this matrix, [X,e]=cmdscale(dist) produces the matrix X = 58.1439 19.3304 -29.8485 -129.6169 81.9911

-20.4773 -34.2586 8.8070 7.7975 38.1313

-4.2664 3.4664 1.1787 -1.1686 0.7899

and the matrix of eigenvalues e = 1.0e+004 * 2.8168 0.3185 0.0034 -0.0000 -0.0006

Next, by deﬁning the matrix >> cities = [’BOS’;’PRO’;’HAR’;’NYC’;’CON’]

the results are displayed using >> plot(X(:,1),X(:,2),’+’) >> gname(cities)

which results in the representation contained in Figure 13.6.

Dimensionality Reduction Techniques

799

40

CON

30 20 10

HAR

NYC

0 −10 BOS

−20 −30 PRO −40 −150

−100

Fig. 13.6

−50

0

50

100

Representation of the ﬁve cities.

Of course, the representation is approximative, but the relative positions of the cities is reasonably close to their real placement.

13.7

Procrustes Analysis

Procrustes1 analysis (PRA) starts with two sets of vectors U = {u1 , . . . , um } ⊆ Rn , and V = {v 1 , . . . , v m } ⊆ Rp , where p < n, and tries to evaluate and improve the quality of the correspondence between these sets, as speciﬁed by the subscripts of the elements of the two sets. In general, we have p n. The dimensionality of the vectors of the two sets can be equalized by appending n − p zeroes as the last components of the vectors of V. 1

Damastes, known as Procrustes (the stretcher), was a mythological Greek character who had the annoying habit of ﬁtting his guests to the bed he oﬀered them. If the guest was too short, he stretched the guest; if the guest was too long, he cut oﬀ a part of the body to make the guest ﬁt the bed. He was killed by hero Theseus, the founder of Athens, who treated him to his own bed.

800

Linear Algebra Tools for Data Mining (Second Edition)

PRA tries to construct a linear transformation (through the application of dilations, rotations, reﬂections, and translations) that maps the point of V to be as close as possible to the points of U . The evaluation criterion for the quality of the matching is d(V, U ) =

m

(v i − ui ) (v i − ui ),

i=1

which we need to minimize. Deﬁne the transformation h : Rn −→ Rn as h(v) = bT v + c, where b ∈ R, T ∈ Rn×n is an orthogonal matrix and c ∈ Rn . Clearly, h is the composition of a rotation (or a reﬂection), a dilation, and a translation. The quality of the matching between the displaced set of vectors h(V ) and U is d(h(V ), U ) =

m

(aA vi + b − ui ) (aA v i + b − ui ).

i=1

˜ and v ˜ be the means of the vectors of U and V, respectively. Let u Since ˜ + c − (ui − u ˜) − u ˜ + bT v ˜ ), bT v i + c − ui = bT (v i − v and m ˜ ) − (ui − u ˜ )) = 0n , (bT (v i − v i=1

it follows that m ˜ ) − ui + u ˜ ) (bT (v i − v ˜ ) − ui + u ˜) (bT (v i − v d(h(V ), U ) = i=1

˜+c−u ˜ ) (bT v ˜+c−u ˜ ). + m(bT v ˜. ˜ − bT v To ensure the minimum of d(h(V ), U ), we need to take c = u ˜) + u ˜. Thus, the new set of vectors is deﬁned by h(v i ) = bT (v i − v This implies that the mean of the set {h(v i ) | 1 i m} coincides

Dimensionality Reduction Techniques

801

˜ , that is, the set of vectors {v i | 1 ≤ i m} is transformed with u such that the means of the two groups of vectors, h(V ) and U , coincide. Suppose now that both means of the sets U and V coincide with 0n . In this case, m (bT v i − ui ) (bT v i − ui ) d(h(V ), U ) = i=1

=b

2

m

v i T T v i

+

i=1

= b2

m

m

ui ui

− 2b

m

i=1

v i v i +

i=1 2

m

i=1

ui ui − 2b

i=1

m

ui T v i

i=1

= b trace(V V ) + trace(U U ) − 2btrace(V T U ), (13.9) where

ui T v i

⎞ ⎛ ⎞ v1 u1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ U =⎜ ⎝ .. ⎠ and V = ⎝ .. ⎠. um v m ⎛

The ambiguity introduced by denoting the above matrices by the same letter as the set of vectors is harmless. The minimum of this quadratic function of b is achieved when b=

trace(T U V ) trace(V T U ) = trace(V V ) trace(V V )

and it equals dmin =

trace(V V )trace(U U ) − trace(V T U )2 . trace(V V )

Once the translation (c) and the dilation ratio (b) have been chosen, we need to focus on the rotation matrix T .

Linear Algebra Tools for Data Mining (Second Edition)

802

Theorem 13.6 (Berge–Sibson Theorem). Let M ∈ Rn×n be a matrix having the singular value decomposition M = P DQ , where P, Q ∈ Rn×n are orthogonal matrices and D = diag(σ1 , . . . , σn ), where σi 0 for 1 i n. Then, for every orthogonal matrix T ∈ Rn×n , we have trace(T M ) trace((M M )1/2 ) and the equality takes place when T = QP . Proof.

Starting from the SVD M = P DQ , we have

trace(T M ) = trace(T P DQ ) = trace(Q T P D) = trace(LD), where L = Q T P ∈ Rn×n is an orthogonal matrix (as a product of orthogonal matrices). Since the elements of an orthogonal matrix cannot exceed 1, it follows that trace(T M )

n

1

σi = trace(D) = trace((D D) 2 ).

i=1

Taking into account that D = P M Q (and D = Q M P ), we have 1

1

trace((D D) 2 ) = trace((Q M P P M Q) 2 ) 1

1

= trace((Q M M Q) 2 ) = trace((M QQ M ) 2 ) (by Corollary 7.6) 1

= trace((M M ) 2 ), which is the desired inequality. If T = QP , we can write 1

T M = QP M = QP P DQ = (QDQ QDQ ) 2 1

1

= (QDDQ ) 2 = (QDP P DQ ) 2 1

= (A A) 2 , which concludes the argument.

Berge–Sibson Theorem allows us to choose the optimal orthogonal matrix in Equality (13.9) by maximizing trace(V T U ) = trace(T U V ). To this end, we need to take T = QP , where U V has the SVD U V = P DQ .

Dimensionality Reduction Techniques

803

Since T = QP , it follows that Q T P D = D because P and Q are orthogonal. In turn, this implies QQ T P DQ = QDQ , so T P DQ = QDQ . Therefore, 1

1

T U V = T P DQ = QDQ = (QD 2 Q ) 2 = (QDP P DQ ) 2 1

= ((U V ) (U V )) 2 . The optimal dilation ratio can be written now as 1

trace((U V ) (U V )) 2 . b= trace(V V ) Since trace(T U V ) = trace(V T U ), the minimal value of d(h(V ), U ) is dmin =

trace(V V )trace(U U ) − trace(V T U )2 trace(V V )

= trace(U U ) −

trace(V T U )2 trace(V V ) 1

trace(((U V ) (U V )) 2 )2 . = trace(U U ) − trace(V V )

A normalized version of this criterion is 1

trace(((U V ) (U V )) 2 )2 dmin = 1 − , trace(U U ) trace(V V )trace(U U ) known as the Procrustes criterion. MATLAB performs PRA starting from two sample data matrices U and V, which have the same number of rows. The rows of U and V are obtained by transposing the vectors of U and V, respectively. The PRA matches the ith row in V to the ith row in U . Rows in U can have smaller dimensions (number of columns) than those in V ; in this case columns of zeros are added to U to equalize the number of columns of the two matrices. Calling the function [d, Z] = procrustes(V,U)

also returns Z = (h(u1 ) · · · h(um ) ) = bU T + c containing the transformed U vectors. The transformation that maps U into Z is returned by the function call

Linear Algebra Tools for Data Mining (Second Edition)

804

[d, Z, transform] = procrustes(V,U)

The variable transform is a structure that consists of the following components: • c , that is the translation component; • T , the orthogonal matrix involved; • b, the dilation factor. Other further interesting variants of procrustes are described as follows: procrustes(..., ’Scaling’,false) procrustes(..., ’Scaling’,true) procrustes(..., ’Reflection’,false) procrustes(..., ’Reflection’,’best’) procrustes(..., ’Reflection’,true)

no scale component (b = 1) includes a scale component (default) no reﬂection component (det(T ) = 1) best ﬁt solution (default) solution includes a reﬂection (det(T ) = −1)

Example 13.7. We apply the Procrustes analysis to the sets of points X and Y in R2 produced as in Example 4.18: U =

V = 0.5377 1.8339 -2.2588 0.8622 0.3188 -1.3077 -0.4336 0.3426 3.5784 2.7694 -1.3499 3.0349 0.7254 -0.0631 0.7147

-0.2050 -0.1241 1.4897 1.4090 1.4172 0.6715 -1.2075 0.7172 1.6302 0.4889 1.0347 0.7269 -0.3034 0.2939 -0.7873

2.2260 2.7057 1.3409 2.6851 2.3451 1.6735 1.5266 2.2899 4.0256 3.2358 1.6690 3.4838 2.2542 2.0618 2.0694

1.7753 1.4795 3.2412 2.4492 2.5894 2.5745 1.5894 2.1642 1.7556 1.5190 2.8621 1.5175 1.7058 2.1317 1.5363

Dimensionality Reduction Techniques

805

The function call >> [d,Z,tr] = procrustes(U,V)

returns d = 0.0059 Z = 0.6563 1.7723 -2.3302 0.7597 0.0388 -1.0910 -0.3485 0.3733 3.7441 2.6363 -1.3887 3.0605 0.7745 0.0172 0.6304

-0.1488 -0.1696 1.4579 1.4628 1.3591 0.6567 -1.1707 0.5786 1.6315 0.4321 1.1426 0.6795 -0.2389 0.2933 -0.7142

tr = T: [2x2 double] b: 1.9805 c: [15x2 double]

The results can be represented graphically using >> plot(U(:,1),X(:,2),’bx’,V(:,1),Y(:,2),’b+’,Z(:,1),Z(:,2), ’bo’)

and as shown in Figure 13.7. A summary inspection of Figure 13.7 shows that the transformation of the points of V (marked by a symbol ’+’) into the points of Z (marked by ’o’) yields points that are quite close to the points of X (marked by ’x’).

Linear Algebra Tools for Data Mining (Second Edition)

806 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −3

−2

−1

Fig. 13.7

13.8

0

1

2

3

4

5

Procrustes analysis of the sets X and Y .

Non-negative Matrix Factorization

Throughout this book we presented several factorization techniques for matrices that satisfy speciﬁc conditions involving the matrix to be factored or the factor matrices. In this chapter, we are concerned with a special type of approximative factoring of a matrix. Namely, starting from a matrix A ∈ Rn×m and a positive integer p < min{m, n}, we seek to ﬁnd a pair of matrices (U, V ) such that U ∈ Rn×p , V ∈ Rp×m , and A − U V F is minimal. If A is a nonnegative matrix and we impose the non-negativity condition on the factors U, V , then we have the non-negative factorization of A. Note that rank(U V ) p. We begin by examining the existence of an approximative factorization for an arbitrary matrix A ∈ Rm×n . The minimization of A − U V F is equivalent to minimizing the function f (U, V ) = 1 2 2 A − U V F . We have grad U A − U V 2F = U V V − AV

(13.10)

Dimensionality Reduction Techniques

807

and grad

V

A − U V 2F = U U V − U A.

(13.11)

Thus, the necessary conditions for ﬁnding a stationary point are U V V = AV and U U V = U A. Also, if (U, V ) is a solution of the problem, then so is (aD, a1 V ), where a is a positive number. Thus, if the problem has a solution, this solution is not unique, in general. The minimization of A − U V F when we seek non-negative factors U and V is equivalent to minimizing the function f (U, V ) = 1 2 2 A − U V F subjected to np + pm restrictions: uij 0 for 1 i n and 1 j p and vjk 0 for 1 j p and 1 k m, respectively. We apply the Karush–Kuhn–Tucker Theorem to the function f (U, V ) that depends on np + pm components uij and vjk of the matrices U ∈ Rn×p and V ∈ Rp×m. The functions g and h that capture the constraints are g : Rn×p × n×p R −→ Rn×p and h : Rp×m × Rn×p −→ Rp×m , which are deﬁned by U U g = −U and h = −V. V V a×b be an a × b matrix whose single non-zero component is Let Eqr equal to 1 and is located at the intersection of line q and column r. For the constraint functions, we have n×p , grad grad U gij = −Eij

grad U hjk = On,p , grad

V

V gij

= Op,m ,

p×m hjk = −Ejk

for 1 i n, 1 j p, and 1 k m. We have U U = −V Op,m . g = −U On,p and h V V We introduce two matrices of Lagrange multipliers, B ∈ Rn×p and C ∈ Rp×m .

Linear Algebra Tools for Data Mining (Second Edition)

808

The ﬁrst Karush–Kuhn–Tucker condition amounts to the equalities grad

U A

p n

− U V 2F +

bij grad

U gij

+

i=1 j=1

grad

2 V A − U V F +

m n i=1 j=p

p m

cjk grad

U hjk

cjk grad

V

= On,p ,

j=1 k=1

bij grad

V gij +

p m

gjk = Op,m ,

j=1 k=1

which are equivalent to U V V − AV − B = On,p , U U V − U A − C = Op,m . The remaining conditions can be written as bij uij = 0 and cjk vjk = 0 for 1 i n, 1 j p, and 1 k m, and B On,p , C Op,m . If both U > On,p and V > Op,m , then we have B = On,p and C = Op,m , and the Karush–Kuhn–Tucker conditions imply U V V = AV and U U V = U A. If we seek matrices U and V of full rank, then, by Corollary 3.3, V V and U U are non-singular matrices and we obtain U = AV (V V )−1 and V = (U U )−1 U A. These arguments suggest a nonnegative matrix factorization known as the alternate least-square (ALS) algorithm (see [17, 127]). This algorithm takes advantage of the fact that although the function F : Rm×p × Rp×n −→ R deﬁned by F (U, V ) = A − U V F is not convex with respect to the pair (U, V ), it is convex on either U or V. Its basic form is Algorithm 13.8.1. The factorization technique for non-negative matrices that we discuss next was developed by Lee and Seung [105] and consists of constructing iteratively a sequence of pairs of matrices (U 0 , V 0 ), (U 1 , V 1 ), . . . such that the products U t V t oﬀer increasingly better approximations of a non-negative matrix A. We need the following preliminary notion. Definition 13.6. Let G : (Rk )2 −→ R and F : Rk −→ R be two functions. G is an auxiliary function for F if G(x, y) ≥ F (x) and G(x, x) = F (x) for x, y ∈ Rk .

Dimensionality Reduction Techniques

809

Algorithm 13.8.1: Alternate Least-Square Factorization Algorithm Data: A nonnegative matrix A ∈ Rm×n and a maximum number of iterations r Result: An approximate factorization A = U V 1 1 initialize U to a random matrix; 2 for t = 1 to r do 3 solve for V t the equation (U t ) U t V t = (U t ) A; 4 set all negative elements in V t to 0; 5 if t < r then 6 solve for U t+1 the equation V t (V t ) U t+1 = (V t ) At ; 7 set all negative elements in U t+1 to 0; 8 end 9 end r r 10 return U and V ; Consider a sequence of vectors (ut )t∈N in Rk deﬁned by ut+1 = arg min G(u, ut ) u

(13.12)

for t ∈ N. Then, F (ut+1 ) G(ut+1 , ut ) (by the ﬁrst condition of Deﬁnition 13.6) G(ut , ut ) (by Equality (13.12)) (by the second condition of Deﬁnition 13.6) = F (ut ) for t ∈ N. Thus, the sequence (F (ut ))t∈N is non-increasing. The inequalities F (ut+1 ) G(ut , ut ) F (ut ) show that if F (ut+1 ) = F (ut ), then G(ut , ut ) = F (ut ) so ut is a local minimum of G(u, ut ). If the derivatives of F exist and are continuous in a neighborhood of ut , then ∇F (ut ) = 0. Thus, the sequence (ut )t∈N converges to a local minimum umin of F . Let A = (a1 · · · am ) be the matrix to be factored, where aj ∈ Rn for 1 j m, and let FA : Rn×p × Rp×m −→ R0 be the objective

Linear Algebra Tools for Data Mining (Second Edition)

810

function that we seek to minimize in NNMF, as follows: FA (U, V ) =

1 A − U V 2F . 2

If we write A = (a1 · · · am ) and V = (v 1 · · · v m ), then the matrix A − U V can be written as A − U V = (a1 − U v 1 · · · am − U v m ), so A − U V 2F =

n

aj − U v j 2F .

j=1

Let Fa (U, v) = a − U v2F . The last equality can be written as FA (U, V ) =

n

Faj (U, v j ).

j=1

Lemma 13.3. Let v ∈ Rk and let di (v) be defined by di (v) = (U vUi v)i for 1 i k and K(v) = diag(d1 (v), . . . , dk (v)). The matrix K(v)− U U is positive semidefinite. Proof. Deﬁne the matrix M (v) ∈ Rk×k as M (v)ij = vi (K(v) − U U )ij vj for 1 i, j k. In Supplement 96 of Chapter 6, we proved that M is positive semideﬁnite if and only if K(v) − U U is positive semideﬁnite. Note that (U U )ij vj vi M (v)ii = (K(v) − U U )ii vi2 = (U U v)i vi = j

for 1 i k. Also, if i = j, we have M (v)ij = −vi (U U )ij vj . We have z M (v)z =

i

=

i

zi M (v)ij zj

j

M (v)ii zi2 +

i=j

zi M (v)ij zj

Dimensionality Reduction Techniques

=

i

(U U )ij vj vi −

j

= (U U )ij vi vj i,j

811

zi vi (U U )ij vj zj

i=j

1 2 1 2 z + zj − zi zj 2 i 2

1 (U U )ij vi vj (zi − zj )2 , 2

=

i,j

which shows the positive semideﬁniteness of M (v).

Lemma 13.4. The function G : (Rk )2 −→ R defined by 1 G(s, v) = F (v) + (s − v) ∇F (v) + (s − v) K(v)(s − v) 2 is an auxiliary function for the quadratic function F : Rk −→ R given by 1 F (s) = F (v) + (s − v) ∇F (v) + (s − v) U U (s − v) 2 Proof.

The inequality G(s, v) F (s) amounts to (s − v) K(v)(s − v) (s − v) U U (s − v),

which follows immediately from Lemma 13.3. Thus, the ﬁrst condition of Deﬁnition 13.6 is satisﬁed. It is immediate that G(v, v) = F (v). Choose F = Fa . We have 1 F (s) = Fa (s) = a − U s2F 2

⎛ ⎞2 1 1 ⎝ ai − = (a − U s)2i = Uij sj ⎠ 2 2 i

i

j

Therefore, ∂F =− ∂s

i

⎛ ⎝ai −

j

⎞ Uij sj ⎠ Ui = −

i

ai Ui +

i

j

Uij Ui sj

812

Linear Algebra Tools for Data Mining (Second Edition)

for 1 i k. Thus, ∇F (s) = −U a + U U s. The corresponding auxiliary function is now 1 G(s, v) = F (v) + (s − v) (−U a + U U v)) + (s − v) K(v)(s − v). 2 To obtain the update rule, we need to ﬁnd v such that 1 G(v, v t ) = F (v t )+(v −vt ) (−U a+U U v t )+ (v −vt ) K(v t )(v −vt ) 2 is minimal. Regarding G(v, v t ) as a function of v, we obtain ∇v G(v, v t ) = −U a + U U v t + K(v t )(v − vt ), by applying the diﬀerentiation rules ∇(a x) = ∇(x a) and ∇(x Ax) = 2Ax (for a symmetric matrix). Therefore, the necessary extremal condition ∇v G(v, v t ) = 0 implies v = v t − K(v t )−1 (U U vt − U a), which is the update rule for the columns of the matrix V if U is kept constant. Note that 1 1 t −1 ,..., K(v ) = diag d1 (v t ) dk (v t ) vkt v1t , . . . , = diag . (U U v t )1 (U U v t )k Thus, we obtain the update rule for the ith column of the matrix V for a ﬁxed matrix U vit+1 = vit − =

vit ((U U v t )i − (U a)i ) (U U v t )i

vit (U a)i . (U U v t )i

Dimensionality Reduction Techniques

813

The components of the updated matrix V can be written as t+1 = vij

t (U A) vij ij . (U U V t )ij

Similarly, the updated components of the matrix U , when V is ﬁxed, are given by ut+1 ij =

utij (AV )ij . (U t V V )ij

Lee and Seung algorithm [105] consists of applying alternating updates of the matrices U and V (Algorithm 13.8.2). Algorithm 13.8.2: Lee–Seung Factorization Algorithm Data: A nonnegative matrix A ∈ Rm×n and a maximum number of iterations r Result: An approximate factorization A = U V 1 1 initialize U to a random matrix; 1 to a random matrix; 2 initialize V 3 for t = 1 to r do 4 V t+1 = (V t ((U t ) A)) ((U t ) U t V t + 10−10 Jp,m ; 5 U t+1 = (U t (A(V t ) )) (U t V t (V t ) + 10−10 Jn,p ; 6 end t t 7 return U and V ; Using the Hadamard product and quotient of matrices introduced in Deﬁnition 3.45, the computations involving the sequences V 1 , . . . , V t , . . . and U 1 , . . . , U t , . . . can be written in a more succinct form as V t+1 = (V t ((U t ) A)) ((U t ) U t V t ),

(13.13)

U t+1 = (U t (A(V t ) )) (U t V t (V t ) ).

(13.14)

The algorithm is given next. The terms 10−10 Jp,m and 10−10 Jn,p are added to avoid division by 0. The Equalities (13.13) and (13.14) show that if U 1 and V 1 are positive, the matrices U t and V t remain positive throughout the iterations.

Linear Algebra Tools for Data Mining (Second Edition)

814

MATLAB uses the function nnmf of the Statistics toolbox to compute a factorization of a nonnegative matrix. When using its simplest syntax, [U,V] = nnmf(A,k), this function computes an approximate decomposition of the non-negative matrix A ∈ Rm×n into the nonnegative factors U ∈ Rm×k and V ∈ Rk×n , where the rows of V have unit length. Actually, the product U V is an approximation of A that minimizes the quantity d = A − U V F /mn known as the root-mean-squared residual. To add d to the results, we can write [U,V,d] = nnmf(A,k). The function nnmf can be called with the number of options detailed in the documentation of MATLAB . Such a call has the form

[U,V] = nnmf(A,k,param1,val1,param2,val2,...)

For example, for the option ‘algorithm’ the possible values are ‘mult’ for the multiplicative algorithm or ‘als’ for an alternating least-squares algorithm. Example 13.8. The IRIS dataset from the UCI machine learning repository consists of 150 records that describe characteristics of three varieties of the iris ﬂower: Iris Setosa, Iris Versicolor, and Iris Virginica. Each record has four components that correspond to the sepal length, sepal width, petal length, and petal width (in centimeters). To compute a nonnegative matrix factorization of the matrix D ∈ R150×4 as D = U V , where U ∈ R150,2 and V ∈ R2,4 , we write >> [U,V] = nnmf(D,2)

This produces the matrix U (which we omit) and the matrix V given by V = 0.6942 0.8027

0.2855 0.5675

0.6223 0.1829

0.2221 0.0142

corresponding to the four variables mentioned above. The factorization of D allows us to express each row di = (sli , swi , pli , pwi ) as sli = 0.6942ui1 + 0.807ui2 swi = 0.2855ui1 + 0.5675ui2 , pli = 0.6223ui1 + 0.1829ui2 pwi = 0.2221ui1 + 0.0142ui2 .

Dimensionality Reduction Techniques

815

1

seplen

COLUMN 2

0.8

0.6

sepwidth

0.4

0.2

0

petallen

0

0.2

petalwidth 0.4

Fig. 13.8

0.6 COLUMN 1

0.8

1

Biplot of the IRIS dataset.

The ﬁrst column of U has a strong inﬂuence on the sepal length sli and petal length pli ; in contrast, the second column of U has little inﬂuence on the petal width pwi . To visualize these inﬂuences, a biplot is constructed. >> biplot(V’,’scores’,U,’varlabels’,{’seplen’,’sepwidth’,... ’petallen’,’petalwidth’}); >> axis([0 1.1 0 1.1]) >> xlabel(’COLUMN 1’) >> ylabel(’COLUMN 2’)

The result is shown in Figure 13.8. Exercises and Supplements (1) Let X = (x1 · · · xn ) ∈ Rm×n be a matrix. A Karhunen–Loeve basis for X is a sequence of pairwise orthogonal k unit vectors (k < n) u1 , . . . , uk in Rm such that for the orthogonal

816

Linear Algebra Tools for Data Mining (Second Edition)

projections projS (u1 ), . . . , projS (un ) on the subspace S generated by u1 , . . . , uk , ni=1 xi − projS (xi )2 is minimal. Prove that the ﬁrst k columns of the orthogonal matrix U involved in the SVD of the matrix X, X = UDV is a Karhunen–Loeve basis for X. (2) Let X ∈ Rm×n be a centered data matrix and let X = UDV be the thin SVD of X, where U ∈ Rm×r , D ∈ Rr×r , and V ∈ Rn×r and each U and V has orthonormal columns. If S = U D = (s1 · · · sr ) = (d1 u1 · · · dr ur ) ∈ Rm×r is the matrix of scores and V ∈ Rn×r is the matrix of loadings, prove that 1 d2i ; (a) the variance of a score vector si is var(si ) = m−1 (b) X = s1 v 1 + · · · + sr vr ; (c) if Xk = s1 v1 + · · · + sk v k , where k r, then k d2i tvar(Xk ) = i=1 r 2. tvar(X) i=1 di k

d2

i In other words, i=1 r 2 indicates the portion of the total varii=1 di ance of X explained by the ﬁrst k scores. (3) Let X ∈ Rm×n be a centered data sample matrix whose covariance matrix is cov(X) = aIn + bJn,n . Prove that a 0 and b 0. (4) The orthogonal factor model [82] seeks to express a centered data sample matrix X ∈ Rm×n = (v 1 · · · v n ) as X = F L + S, where L ∈ Rq×n is the matrix of factor loadings, F ∈ Rm×q is the matrix of factors, and S ∈ Rm×n is the matrix of speciﬁc factors. The model assumes that F is a centered matrix, cov(F ) = In , S is a centered matrix, cov(S) is a diagonal matrix, and F S = On,n . Prove that (a) cov(X) = (q − 1)cov(L) + cov(S) and F X = L; (b) var(v i ) = (q − 1)var(li ) + var(si ) for 1 i n.

Dimensionality Reduction Techniques

817

(5) Let X ∈ Rm×n be a centered data sample matrix and let p, q ∈ N be such that p + q = n and X = (U V ), where U ∈ Rm×p and V ∈ Rm×q . Deﬁne the mixed covariance of U and V as 1 cov(U, V ) = m−1 U V ∈ Rp×q . Prove that cov(X) =

cov(U ) cov(U, V ) . cov(U, V ) cov(V )

(6) Using the same notations as in Exercise 5, let u = U a and v = V b be two vectors in Rm that aggregate the columns of U and V, respectively. We assume here that a ∈ Rp and v ∈ Rq . Prove that (a) var(u) = a cov(U )a, var(v) = b cov(V )b, and cov(u, v) = a cov(U, V )b; (b) if the correlation of u and v is deﬁned as corr(u, v) =

cov(u, v) , var(u) var(v)

prove that the largest value of corr(u, v) is the largest eigenvalue of cov(U )−1/2 cov(U, V )cov(V )−1 cov(V, U )cov(U )−1/2 . (7) Let A ∈ Rn×m be a positive matrix. Deﬁne FA : Rn×p >0 × p×m R>0 −→ R by FA (U, V ) = D(A, U V ), where D is the matrix divergence introduced in Supplement 108 of Chapter 3. If A = (a1 · · · am ) and V = (v 1 · · · v m ), prove that D(A, U V ) =

n

D(aj , U v j ).

j=1

(8) Let Fa : Rk>0 −→ R be the function deﬁned by ⎛ ⎞ ai ⎝ ⎠ uij sj − ai + ai ln Fa (s) = j uij sj i

j

818

Linear Algebra Tools for Data Mining (Second Edition)

for s ∈ Rk>0 , where a ∈ Rk>0 . Prove that the function Ga : (Rk>0 )2 −→ R given by Ga (s, v) = (ai ln ai − ai ) + uij si i

−

i

i

uij vj ai p uip sp

j

j

uij vj ln uij sj − ln p uip sp

is an auxiliary function for Fa . Solution: We have (ai ln ai − ai ) + uij si Ga (s, s) = i

i

j

uij sj uij sj − ai ln uij sj − ln p uip sp p uip sp i j (ai ln ai − ai ) + uij si = i

−

i

=

⎛

j

i

j

uij sj ai ln uip sp p uip sp p

⎝ai ln ai − ai +

i

uij sj − ai ln

⎞ uip sp ⎠

p

j

= Fa (s). The second condition of Deﬁnition 13.6, Ga (s, v) Fa (s), can be proven starting from the convexity of the function (x) = − ln x. By Jensen’s Theorem, we have ⎞ ⎛ k k uij sj ⎠ ⎝ uij sj − tj ln − ln tj j=1

j=1

Dimensionality Reduction Techniques

819

for any set of non-negative numbers t1 , . . . , tk such that uij vj k yields the inequality j=1 tj = 1. Choosing tj = k p=1

⎛ − ln ⎝

k

⎞ uij sj ⎠ ≤ −

j=1

k j=1

=−

k j=1

×

uip vp

uij sj

uij vj ln k p=1 uip vp

uij vj k p=1 uip vp

uij vj k p=1 uip vp uij vj

ln uij sj − ln k

p=1 uip vp

.

This inequality implies (ai ln ai − ai ) + uij si Ga (s, v) i

−

i

⎛ ai ln ⎝

i k

⎞

j

uij sj ⎠ = FA (s).

j=1

Thus, Ga is indeed an auxiliary function for Fa . (9) Let X be a non-negative matrix and let X = U V be a nonnegative matrix factorization of X. Prove that all rows of X are non-negative combinations of the rows of U , that is, they lie in the simplicial cone generated by the rows of U . (10) Let X = (x1 , . . . , xp ), Y = (y 1 , . . . , y q ) be two sequences of vectors in Rn . The matrix D ∈ Rp×q of squared distances between these sequences is DX,Y = (dij ), where dij = xi − y j 22 for 1 i p and 1 j q. Prove that ⎛ ⎞ x1 x1 ⎜ .. ⎟ DX,Y = ⎝ . ⎠ 1q + 1p (y 1 y 1 , . . . , y q y q ) − 2 X Y. xp xp

820

Linear Algebra Tools for Data Mining (Second Edition)

The analogue of multidimensional metric scaling for two sets of vectors has been introduced and studied as multidimensional unfolding (MDU) by Sch¨ onemann in [147]. For MDU, the issue is to reconstitute two ﬁnite vector sets given their mutual distances. Let X = (x1 · · · xp ) ∈ Rn×p and let tz be a translation of Rn . Denote by tz (X) the matrix (tz (x1 ) · · · tz (xp )). (11) Let X = (x1 · · · xp ) ∈ Rn×p , x0 ∈ Rn , and let c ∈ Rp be a ˜ = t−x (X), then X ˜ = vector such that x0 = Xc. Prove that if X 0 XQc , where Qc = Ip − 1p c is the projection matrix introduced in Supplement 10 of Chapter 3. Solution: We have

⎛ ⎞ x1 − x0 ⎜ ⎟ .. t−x0 (X ) = ⎝ ⎠ . xp − x0 = X − 1p x0 = X − 1p c X = (Ip − 1p c )X = Qc X ,

which is equivalent to t−x0 (X) = XQc . (12) Let X = (x1 · · · xp ) ∈ Rn×p , Y = (y 1 · · · y q ) ∈ Rn×q , x0 , y 0 ∈ ˜ and Y˜ be matrices obtained by a translation by −x0 Rn . Let X and −y 0 of the columns of X and Y , respectively. If c ∈ Rp and d ∈ Rq are such that x0 = Xc and y 0 = Y d, prove that ˜ Y˜ = − 1 Qc DX,Y Q . X d 2 Solution: Starting from the equality given in Exercise 10, we have Qc DX,Y Qd ⎛ ⎞ x1 x1 ⎜ .. ⎟ = Qc ⎝ . ⎠ 1q Qd + Qc 1p (y 1 y 1 , . . . , y q y q )Qd xp xp − 2 Qc X Y Qd ˜ Y˜ , = −2 Qc X Y Qd = −2X because Qd 1q = 0q and Qc 1p = 0p .

Dimensionality Reduction Techniques

821

˜ Y˜ and let CX,Y = GH be a full-rank decomposiLet CX,Y = X tion of CX,Y (see, Theorem 3.35). Note that such a decomposition is not unique for, if T is a non-singular matrix, CX,Y can be writ˜ T −1 )(T Y˜ ) is yet another full-rank factorization ten as CX,Y = (X of CX,Y . (13) Let C = GH be a full-rank decomposition of the matrix CX,Y . Prove that (a) starting from G = (g 1 · · · g p ) and H = (h1 · · · hq ), if xi = T g i + x0 and y j = T −1 hj + y 0 for 1 i p and 1 j q, then the matrices X = (x1 · · · xp ) and Y = (y 1 · · · y q ) give a solution of the metric unfolding problem; (b) with the previous choice for X and Y and with M = T T , we have d2ij = g i M g i + hj M −1 hj + (x0 − y 0 ) (x0 − y 0 ) + 2g i T (x0 − y 0 ) − 2hj (T −1 ) (x0 − y 0 ) − 2g i hj , for 1 i p and 1 j q. Let S be the ﬁnite set of numbers, S = {−1, 0, 1}. A semidiscrete decomposition (SDD) of a matrix A ∈ Rm×n is a sum of the form A=

k

dp xp y p ,

p=1

where xp ∈ S m , y p ∈ S n , and d1 , . . . , dk are real numbers. Note that every A can be expressed as a sum of mn rankmatrix m n 1 matrices A = i=1 j=1 aij ei ej ; the purpose of the SDD is to develop sum of fewer terms that approximate the matrix A. (14) Prove that for a ﬁxed A, x, and y, the least value of A−dxy 2F (x Ay)2 (x Ay)2 2 is achieved for d∗ = x 2 y2 and it equals AF − x2 y2 . F

F

F

F

822

Linear Algebra Tools for Data Mining (Second Edition)

Solution: Let R be the residual R = A − dxy obtained by approximating A by dxy and let rA (d, x, y) = R2F . We have rA (d, x, y) = (A − dxy ) (A − dxy ) = (A − dyx )(A − dxy ) = A A − dyx A − dA xy + d2 yx xy = A2F − 2dx Ay + d2 x2F y2F . The optimal solution for d is therefore d =

x Ay x2F y2F

. The ﬁnal

part follows from the substitution of d in rA (d, x, y). (15) Suppose that A ∈ Rm×n is a matrix and let y ∈ S n . Determine d and x ∈ S m such that A − dxy 2F is minimal. Solution: In principle, we need to examine 3m possible value for x, but we will show it is possible to limit the search to that n m values. Let zi = xi j=1 aij yj . We have rA (d, x, y) =

n m (aij − dxi yj )2 i=1 j=1

=

m n (a2ij − 2daij xi yj + d2 x2i yj2 ) i=1 j=1

=

=

n m

a2ij − 2d

n m

i=1 j=1

i=1 j=1

m n

m

a2ij − 2d

i=1 j=1

aij xi yj + d2

x2i yj2

i=1 j=1

zi + d2

i=1

m n

m i=1

x2i

n

yj2 .

j=1

The signs of the components of x do not aﬀect the last term. Thus, to minimize rA (d, x, y), the signs of xi must be chosen such that all zi have the same sign: positive if d > 0 and negative, otherwise. Let J be the number of non-zero components of x. Clearly, we have 1 J n. Then, rA (d, x, y) =

n m i=1 j=1

a2ij − 2d

m i=1

zi + d2 J

n j=1

yj2 .

Dimensionality Reduction Techniques

823

For a particular choice of x, the d that minimizes rA (d, x, y) is m zi , (13.15) d∗ = i=1 J nj=1 yj2 and the minimum value of rA (d, x, y) is rA (d∗ , x, y) =

m n i=1 j=1

a2ij − d2∗ J

n

yj2 .

j=1

d2 J. For a Thus, minimizing rA (d, x, y) amounts to maximizing given J, xi must be chosen equal to sign (| j=1 aij yj |) for those i that correspond to the largest sums | j=1 aij yj |. This shows that there are m choicesof x to check: set the xi ’s corresponding to the J largest sums | j=1 aij yj | to 1 or to −1 such that all zi are positive, ﬁnd d from Equality 13.15, and choose d2 J among the n choices for J. The O’Leary–Peleg algorithm [95, 122] for SDD starts with a matrix A ∈ Rm×n and with a number k and consists of alternative steps that determine two sequences of vectors x1 , . . . , xk and y 1 , . . . , y k that allow the construction of a sequence of approximations A0 , A1 , . . . , Ak , . . . of A. The squares of the norms of the residuals of successive approximations are r1 , r2 , . . .. The desired accuracy of the approximation is rmin . Also, max is the maximum allowable number of inner iterations, and αmin is the minimum relative improvement. (16) Let A ∈ Rm×n and let A0 , A1 , . . . and R1 , R2 , . . . be two sequences of matrices deﬁned by A0 = Om,n , R1 = A, Ak = Ak−1 + dk xk y k , and Rk+1 = Rk − dk xk y k , where the sequences x1 , x2 , . . . and y 1 , y 2 , . . . are deﬁned as in Algorithm 13.8.3. Prove that (a) Rk+1 F < Rk F , which ensures the convergence of the O’Leary–Peleg algorithm;

1 k ≤ 1 − mn R0 2F , which proves that (b) Rk+1 2F limk→∞ Rk F = 0 and the rate of convergence is at least linear.

824

Linear Algebra Tools for Data Mining (Second Edition)

Algorithm 13.8.3: O’Leary–Peleg Factorization Algorithm Data: A matrix A ∈ Rm×n and a number kmax of approximating terms Result: An SDD approximate factorization of A 2 1 initialize R1 = A; initialize r1 = R1 F ; 2 for k = 1 to kmax do 3 while rk > rmin do 4 choose y = ej if the jth column of Rk contains the largest magnitude entry in Rk ; 5 for = 1 to max do 6 while α > αmin do 1 7 z = y 2 Rk y; F

8

ﬁnd x ∈ S m that maximizes

9

z=

1 x2F

Rk x;

10

ﬁnd y ∈ S n that maximizes

11

β=

(x Rk y)2 x2F y2F

;

19

if > 1 then β¯ α = β− β¯ end β¯ = β; end end xk = x; y k = y;

20

dk =

12 13 14 15 16 17 18

21 22 23 24 25

xk Rk y k xk 2F y k 2F

;

Ak = Ak−1 + dk xk y k ; Rk+1 = Rk − dk xk y k ; rk+1 = rk − β; end end

1 x2F

1 y2F

(x z)2 ;

(y z)2 ;

Dimensionality Reduction Techniques

825

Bibliographical Comments Spectral methods for dimensionality reduction are surveyed in [145]. A main reference for multidimensional scaling is [32]. The application of SVDs to the recommender system was initiated in [19]. Schoenberg’s paper on metric spaces [146] in one of the earliest contributions to metric scaling. Theorem 13.6 was obtained by Berge in [15] and Sibson in [150]. The connection between multidimensional metric scaling and PCA was studied by Gower in [66]. The nonnegative factorization problem attracted a huge amount of interest after Lee and Seung’s paper [105] was published in 1999. The research in this type of problems was initiated earlier by Paatero in [127] followed by [125, 126]. Applications in text mining and document clustering are discussed in several important publications [17, 26, 131, 149] on which this chapter is based. The semidiscrete matrix decomposition of matrices was introduced in [122] and further explored in [94, 95].

This page intentionally left blank

Chapter 14

Tensors and Exterior Algebras

14.1

Introduction

In many machine learning-oriented literature sources, tensors are regarded as multidimensional arrays. While the sets of values of tensors over ﬁnite-dimensional spaces are indeed such arrays, this point of view misses fundamental properties of tensors, especially their behaviors relative to basis changes in the underlying linear spaces. Recall that the elements of a linear space V are referred to as contravariant vectors while the elements of the dual space V ∗ are referred to as covariant vectors (see Section 2.8). The components of contravariant vectors will be denoted with letters with superscripts, while the components of covariant vectors will be designated by letters with subscripts. The opposite convention is applied to vectors themselves: vectors in V are denoted by letters with subscripts, while vectors in V ∗ are denoted by letters with superscripts. 14.2

The Summation Convention

The summation convention, a modality of simplifying expression with multiple indices that is widely used in presenting tensors, was introduced by Albert Einstein. This convention stipulates that when an index variable appears twice in a single term and is not otherwise defined, it implies summation of that term over all the values of the index.

827

828

Linear Algebra Tools for Data Mining (Second Edition)

Indices that are not summation indices are referred to as free indices. Example 14.1. If i ranges over the set {1, 2, 3, 4}, the expression xi yi stands for x1 y1 + x2 y2 + x3 y3 + x4 y4 . The summation index is a dummy index, in the sense that it can be replaced by any other index ranging over the same set without modifying the value of the expression. For instance, we have xk yk = xi yi , if k ranges over the set {1, 2, 3, 4}. Example 14.2. Using the summation convention, the deﬁnition of a determinant (Deﬁnition 5.1) can be written using the summation convention and the Levi-Civita symbols (introduced in Deﬁnition 1.11) as det(A) = i1 ···in a1i1 · · · anin . Example 14.3. The inner product of two vectors x, y ∈ Rn can be written as x y = xi yi . Example 14.4. Let yi = yi (x1 , . . . , xn ) be n functions which have continuous partial derivatives relative to x1 , . . . , xn . The Jacobian matrix J is ⎞ ⎛ ∂y1 ∂y1 ∂y1 ∂x1 ∂x2 · · · ∂xn ⎜ ∂y2 ∂y2 ∂y2 ⎟ ⎟ ⎜ ∂x1 ∂x2 · · · ∂x n⎟ ⎜ J =⎜ . ⎟. . . . .. .. .. ⎟ ⎜ .. ⎠ ⎝ ∂yn ∂yn ∂yn ∂x1 ∂x2 · · · ∂xn The determinant det(J) is the Jacobian of the transformation of Rn deﬁned by the functions y1 , . . . , yn . The transformation is locally bijective on an open subset U of Rn if and only if det(J) = 0 at each point of U . When det(J) = 0 and the functions y1 , . . . , yn have continuous partial derivatives of second order in U , then the transformation is called an admissible change of coordinates. If zk = zk (y1 , . . . , yn ) are n functions (for 1 k n) which have continuous partial derivatives relative to y1 , . . . , yn , we derive

Tensors and Exterior Algebras

829

the following link between partial derivatives written by using the summation convention: ∂zk ∂yj ∂zk = . ∂xi ∂yj ∂xi 14.3

Tensor Products of Linear Spaces

Recall that multilinear functions were introduced in Deﬁnition 2.27. Definition 14.1. Let V and W be ﬁnite-dimensional linear spaces. Their tensor product is a linear space V ⊗W equipped with a bilinear map ⊗ : V × W −→ V ⊗ W such that for any linear space U and bilinear mapping f : V × W −→ U , there exists a unique linear mapping g : V ⊗ W −→ U such that the diagram f V ×W

U

⊗ g V ⊗W

is commutative. The linear space V ⊗ W is the tensor product of the linear spaces V and W , its elements are referred to as tensors, and the elements in Im(⊗), the image of the bilinear function ⊗, are said to be the decomposable tensors. The existence of the mapping g is the universal property of the tensor product. It means that any bilinear function f : V × W −→ U can be obtained by applying a linear function g to the special bilinear function ⊗. We deﬁned the tensor product of two linear spaces as a new linear space that satisﬁes a certain property. As we shall see, if such a linear space exists, then it is unique. However, it is incumbent on us to show the existence of the tensor product. We will do that later in this section (see Theorem 14.1).

830

Linear Algebra Tools for Data Mining (Second Edition)

The bilinearity of the ⊗ mapping implies the following axioms for the tensor product: • Distributivity of tensor product: for any vectors v, v 1 , v 2 in V and any vectors w, w 1 , w 2 in W , we have v ⊗ (aw1 + bw2 ) = a(v ⊗ w1 ) + b(v ⊗ w2 ),

(14.1)

(av 1 + bv 2 ) ⊗ w = a(v 1 ⊗ w) + b(v 2 ⊗ w);

(14.2)

• Existence of basis in tensor product: if {v 1 , . . . , v m } is a basis in V and {w 1 , . . . , w m } is a basis in W , then {uij = v i ⊗ wj | 1 i m, 1 j n} is a basis in U = V ⊗ W . If v = 0V or w = 0W , then v ⊗ w = 0V ⊗W . By taking a = 1, b = −1, and w1 = w2 in Equality (14.1), we obtain v ⊗ 0W = 0V ⊗W for every v ∈ V . Similarly, 0V ⊗ w = 0V ⊗W . Thus, if V, W are ﬁnite-dimensional vector spaces having the bases {v 1 , . . . , v m } and {w1 , . . . , w m }, respectively, and {uij = v i ⊗ wj | 1 i m, 1 j n} is a basis in U = V ⊗W , then, if x = xi v i ∈ V and y = y j wj , the unique tensor product that satisﬁes the axioms has the components xi y j in the basis {uij | 1 i m, 1 j n}. tensor space V ⊗ W is the set of all sums of the form The i j aij v i ⊗w j which have a ﬁnite number of non-zero coeﬃcients. We will refer to the set of coeﬃcients aij as a tensor. Example 14.5. Suppose that V and W are the unidimensional linear space R. The direct product R × R consists of all pairs (u, v) ∈ R × R. Addition of (u1 , v1 ) and (u2 , v2 ) in R × R is deﬁned by (u1 , v1 ) + (u2 , v2 ) = (u1 + u2 , v1 + v2 ) and scalar multiplication is deﬁned as a(u, v) = (au, av) for a ∈ R and (u, v) ∈ R × R. The tensor product R ⊗ R also consists of pairs (u, v), where u, v ∈ R are denoted as u ⊗ v. Scalar multiplication in the tensor space is deﬁned as a(u ⊗ v) = (au) ⊗ v = u ⊗ (av) for a, u, v ∈ R. Addition of two pairs u1 ⊗ v1 and u2 ⊗ v2 in R ⊗ R is denoted as u1 ⊗ v1 + u2 ⊗ v2 and these pairs interact only when one of their components are the same, that is, u1 ⊗ v + u2 ⊗ v = (u1 + u2 ) ⊗ v, u ⊗ v1 + u ⊗ v2 = u ⊗ (v1 + v2 ).

Tensors and Exterior Algebras

831

In other words, addition in R ⊗ R is symbolic unless one of the components is the same in both terms. For instance, we can write (u ⊗ 5v) + (2u ⊗ v) = 5(u ⊗ v) + 2(u ⊗ v) = 7(u ⊗ v). Next, we prove that the tensor product of linear spaces introduced above exists and is unique up to an isomorphism and this will be done in several steps. Theorem 14.1. Let V and W be two finite-dimensional linear spaces. If V ⊗ W exists, then the linear spaces Hom(V ⊗ W, U ) and M(V × W, U ) are isomorphic. Proof. Let h : V ⊗ W −→ U be a linear mapping in Hom(V ⊗ W, U ). Then h⊗ : V × W −→ U is bilinear. Conversely, if f ∈ M(V × W, U ), there is a unique linear map g : V ⊗W −→ U such that g(v⊗w) = f (v, w) for all (v, w) ∈ V ×W .

Theorem 14.2. If the tensor products V ⊗1 W and V ⊗2 W of two linear spaces V and W exist, then they are isomorphic. Proof. Suppose that V ⊗1 W and V ⊗2 W are both tensor products of the linear spaces V and W . By the universal property of the tensor product of linear spaces, there exist unique linear maps g1 : V ⊗2 W −→ V ⊗1 W and g2 : V ⊗1 W −→ V ⊗2 W that make the diagrams

⊗1

V ⊗1W

V ×W

⊗2

⊗2

⊗1 g1

V ⊗2 W

V ⊗2W

V ×W

g2 V ⊗1 W

commutative. Deﬁne the linear mappings h1 = g1 g2 : V ⊗1 W −→ V ⊗ V1 and h2 = g2 g1 : V ⊗2 W −→ V ⊗ V2 which make the diagrams

Linear Algebra Tools for Data Mining (Second Edition)

832

⊗1

V ⊗1W

V ×W

⊗2 V ⊗2W

⊗2

V ⊗2W

V ×W

⊗1 h1

V ⊗1W

h2

commute. The identity maps i1 : V ⊗1 W −→ V ⊗1 W and i2 : V ⊗2 W −→ V ⊗2 W are linear and also make the diagrams commute. By the uniqueness hypothesis, we have h1 = i1 and h2 = i2 , so g1 and g2 are inverses to each other and this proves that V ⊗1 W and V ⊗2 W are isomorphic. Theorem 14.3. Let V, W, and U be real finite-dimensional linear spaces. There exists a finite-dimensional real linear space T and a bilinear mapping t : V × W −→ T denoted by t(u, v) = u ⊗ v satisfying the following properties: (i) for every bilinear mapping f : V × W −→ U, there exists a unique linear mapping g : T −→ U such that f (v, w) = g(t(v, w)); (ii) if {v 1 , . . . , v m } is a basis in V and {w 1 , . . . , wn } is a basis in W, then {(v i ⊗ wj ) | 1 i m, 1 j n} is a basis in T, hence dim(T ) = dim(V ) dim(W ). Proof. In this proof, we will use the summation convention. For each pair (i, j) ∈ {1, . . . , m} × {1, . . . , n}, let tij be a symbol. Deﬁne T to be the real linear space that consists of all formal linear combinations with real coeﬃcients aij tij of the symbols tij . Deﬁne the bilinear mapping t : V × W −→ T by t(v i , w j ) = tij and denote t(v i , w j ) as v i ⊗ wj . This mapping is extended to V × W as a bilinear mapping. Thus, if v = ai v i and w = bj wj , we deﬁne v ⊗ w as being the element of T given by v ⊗ w = t(v, w) = ai bj tij . Suppose now that f : V × W −→ U is an arbitrary bilinear map. Since every element of T is a linear combination of tij , we can deﬁne a unique linear transformation g : T −→ U as g(tij ) = f (vi , wj ).

Tensors and Exterior Algebras

833

The bilinearity of f and the linearity of g imply

f (v, w) = f ai v j , bj wj = ai bj f (v i , w j ) = ai bj g(tij ) = g(ai bj tij ) = g(u ⊗ v) = g(t(u, v)). This proves the existence and uniqueness of g such that f = gt as claimed in the ﬁrst part of the statement. ˆm} We show now that for arbitrary bases in V and W , {ˆ v1, . . . , v ˆ n }, the set {ˆ ˆ j | 1 i m, 1 j n} is a ˆ 1, . . . , w vi ⊗ w and {w basis in T . ˆ i and w = ˆbj w ˆ j , by the bilinearity of ⊗, we have For v = a ˆi v ˆ j ), vi ⊗ w v⊗w =a ˆiˆbj (ˆ ˆ j span T . If these elements ˆi ⊗ w which shows that the mn elements v were linearly dependent, we would have dim(T ) < mn, which is a ˆ j | 1 i m, 1 j n} is a basis contradiction. Thus, {ˆ vi ⊗ w for T . The space T deﬁned above is the tensor product of V and W and is denoted by V ⊗ W . Corollary 14.1. Let V, W be two real linear spaces. If v = 0V and w = 0W , then v ⊗ w = 0V ⊗W . Proof. Let BV and BW be bases of V and W that contain v and w, respectively. Then v ⊗ w is a member of a basis of V ⊗ W , hence v ⊗ w = 0V ⊗W . Let V, W be two real linear spaces with dim(V ) = m and dim(W ) = n having the bases {v i | 1 i m} and {w j | 1 j n}, respectively. The equalities n m

aij (v i ⊗ wj )

i=1 j=1

=

n i=1

=

m j=1

⎛ ⎞⎞ m ⎝v i ⊗ ⎝ aij wj ⎠⎠ ⎛

j=1 n i=1

aij v i

⊗ wj

,

834

Linear Algebra Tools for Data Mining (Second Edition)

imply that an element of V ⊗ W can be expressed in many ways as sums of tensor products of vectors in V and W . Theorem 14.4. Let V, W be two real linear spaces. The linear spaces V ⊗ W and W ⊗ V are isomorphic. Proof. Deﬁne the bilinear mapping h : V ⊗ W −→ W ⊗ V as h(v, w) = w ⊗ v for v ∈ V and winW . By the universal property, there exists a unique linear mapping g : V ⊗ W −→ W ⊗ V that makes the diagram h

W ⊗V

V ×W

⊗ g V ⊗W

commutative. Since the set {w ⊗ v | v ∈ V, w ∈ W } spans the space W ⊗ V , the function g is surjective and, therefore, it is an isomorphism. Theorem 14.5. Let V, W be two real linear spaces. The linear spaces V ∗ ⊗ W and Hom(V, W ) are isomorphic. Proof. Let h : V ∗ × W −→ Hom(V, W ) be deﬁned as h(f , w)(v) = f (v)w for f ∈ V ∗ , v ∈ V , and w ∈ W . Since h is bilinear, it induces a linear mapping g : V ∗ ⊗W −→ Hom(V, W ) by the universal property such that g(f ⊗ w) = h(f , w) for f ∈ V ∗ , and w ∈ W . In other words, the diagram h V∗×W

U

⊗ g V

is commutative.

∗

⊗W

Tensors and Exterior Algebras

835

By Supplement 23 of Chapter 2, mappings of the form h(f , w) span Hom(V, W ), which implies that g is an isomorphism. Theorem 14.6. Let V, W, X, Y be R-linear space and let f ∈ Hom(V, W ) and h ∈ Hom(X, Y ). There is a unique linear map g = f ⊗ h : V ⊗ X −→ W ⊗ Y such that (f ⊗ h)(v ⊗ x) = f (v) ⊗ h(x) for v ∈ V and x ∈ X. Proof. Let : V × X −→ W ⊗ Y be the bilinear mapping deﬁned by (v, x) = f (v) ⊗ h(x) for v ∈ V and x ∈ X. By the universal property of the tensor product, there is a unique linear mapping g : V ⊗ X −→ W ⊗ Y such that g(v ⊗ x) = (v, x) = f (v) ⊗ h(x) for v ∈ V and x ∈ X. Thus, g is the linear mapping f ⊗ h. Example 14.6. Let V, W, X, Y be four R-linear spaces such that dim(V ) = m, dim(W ) = n, dim(X) = p, dim(Y ) = q. If f ∈ Hom(V, W ) and h ∈ Hom(X, Y ), let Af ∈ Rn×m and Ah ∈ Rq×p be the matrices associated to these linear transformations and the bases {v 1 , . . . , v m }, {w 1 , . . . , wn }, {x1 , . . . , xp }, and {y 1 , . . . , y q }, in V, W, X, and Y , respectively. The matrices Af and Ah associated with the linear transformations f and h, respectively, are given by (Af )ij = (f (v i ))j for 1 i m and 1 j n, and (Ah )rs = (h(xr )s for 1 r p and 1 s q. By Theorem 14.3, a basis in the linear space V ⊗ X consists of mp tensors of the form v i ⊗ xj , while a basis in W ⊗ Y consists of nq tensors of the form wr ⊗ y s . The matrix Af ⊗h ∈ Rnq×mp associated with the linear transformation g = f ⊗ h : V ⊗ X −→ W ⊗ Y is given by (Af ⊗h )(ij),(rs) = (((f ⊗ h)(v i ⊗ xj ))rs , which is the Kronecker product Af ⊗ Ah of the matrices Af and Ah . The next theorem establishes the associativity of the tensor product of linear spaces. Theorem 14.7. Let V, W , and U be linear spaces. The tensor spaces V ⊗ (W ⊗ U ) and (V ⊗ W ) ⊗ U are isomorphic.

836

Linear Algebra Tools for Data Mining (Second Edition)

Proof. For a ﬁxed u ∈ U , the mapping fu : V × W −→ V ⊗ (W ⊗ U ) given by fu (v, w) = v ⊗ (w ⊗ u) is bilinear because fu (v, w 1 + w2 ) = v ⊗ (w1 ⊗ u + w2 ⊗ u) = v ⊗ (w1 ⊗ u) + v ⊗ (w2 ⊗ u) = fu (v, w 1 ) + fu (v, w 2 ). By the universal property of tensor products, there is a unique linear mapping fu : V ⊗ W −→ V ⊗ (W ⊗ U ) such that fu = fu h, where h : V ×W −→ V ⊗W is the bilinear mapping h(v, w) = v ⊗w for v ∈ V and w ∈ W . In other words, the diagram fu

V ⊗ (W ⊗ U )

V ×W h V ⊗W

fu

is commutative. For (v, w) ∈ V × W , we have fu (v, w) = fu h(v, w) = fu (v ⊗ w), which implies fu (v, w) = v ⊗ (w ⊗ u) due to the deﬁnition of fu . such that for and fau Since u is arbitrary, there are morphisms fu+t any (v, w) ∈ V × W , we have (v, w) = v ⊗ (w ⊗ (u + t)) fu+t

= v ⊗ (w ⊗ u + w ⊗ t) = v ⊗ (w ⊗ u) + v ⊗ (w ⊗ t), which amounts to (v, w) = fu (v, w) + ft (v, w) fu+t

Tensors and Exterior Algebras

837

and fau (v, w) = v ⊗ (w ⊗ au)

= v ⊗ (a(w ⊗ u)) = a(v ⊗ (w ⊗ u)), that is (v, w) = afu (v, w). fau

Therefore, the function φ : (V ⊗ W ) × U −→ V ⊗ (W ⊗ U ) given by φ((v, w), u) = fu (v, w) is bilinear. By the universal property of the tensor product (V ⊗ W ) ⊗ U , φ induces a unique morphism φ : (V ⊗ W ) ⊗ U −→ V ⊗ (W ⊗ U ) such that φ ((v ⊗ w) ⊗ u) = v ⊗ (w ⊗ u), which is reﬂected in the following commutative diagram: (V ⊗W ) × U

φ

V ⊗(W ⊗U )

⊗ (V ⊗ W )⊗ U

φ

Similarly, there is a unique morphism ψ : V ⊗ (W ⊗ U ) −→ (V ⊗ W ) ⊗ U such that ψ (v ⊗ (w ⊗ u)) = (v ⊗ w) ⊗ u. It is immediate that ψ φ and 1((V ⊗W )⊗U ) coincide on the generators of (V ⊗ W ) ⊗ U , that is ψ φ = 1((V ⊗W )⊗U . Similarly, we have φ ψ = 1(V ⊗(W ⊗U ) . Thus, φ and ψ are inverse of each other, which means that they are isomorphisms. The associativity of the tensor product of linear spaces allows us to write V ⊗ W ⊗ U instead of V ⊗ (W ⊗ U ) or (V ⊗ W ) ⊗ U for

838

Linear Algebra Tools for Data Mining (Second Edition)

any linear spaces V, W, U . Also, taking into account Theorem 14.4, we could freely change the order in which we list the linear spaces in any tensor product. Example 14.7. Let {e1 , e2 , e3 } be the standard basis for R3 and let {e1 , e2 } be the standard basis for R2 . A basis for R3 ⊗ R2 consists of the tensors e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 , e3 ⊗ e1 , e3 ⊗ e2 . Let U be an R-linear space and let f : R3 × R2 −→ U be a bilinear mapping. Deﬁne the mapping g : R3 ⊗ R2 −→ U by g(ei ⊗ ej ) = f (ei , ej ) for i ∈ {1, 2, 3} and j ∈ {1, 2}. Since f is a bilinear mapping, g is well-deﬁned and g(u ⊗ v) = f (u, v). Theorem 14.3 can be extended to the case of m linear spaces as follows. If V1 , . . . , Vm are m real and ﬁnite-dimensional spaces, there exists a linear space V1 ⊗ · · · ⊗ Vm and a mapping ⊗ : V1 × · · · × Vm −→ V1 ⊗· · ·⊗Vm such that for any linear space U and multilinear mapping f : V1 ×· · ·×Vm −→ U , there exists a unique linear mapping g : V1 ⊗ · · · ⊗ Vm −→ U such that the diagram f V1 × · · · × Vm

U

⊗ g V1 ⊗ · · · ⊗ Vm

is commutative. Recall the deﬁnition of tensors that we introduced in Chapter 2: Definition 14.2. A tensor t over the linear space V is a multilinear function t : V ⊗p1 ⊗ V ∗⊗q1 ⊗ · · · ⊗ V ⊗pk ⊗ V ∗⊗q −→ R. In this case, we say that the pair (p, q) = (p1 + · · · + pk , q1 + · · · + q ) is the type t while p + q is the valence of t.

Tensors and Exterior Algebras

839

Furthermore, we say that t is contravariant of order p and covariant of order q. The valence of a tensor is the total number of arguments of a tensor regarded as a multilinear function. Example 14.8. The tensors in V ⊗ V ∗ ⊗ V ∗ −→ R have valence 3 and type (1, 2). In other words, such tensors are 1-contravariant and 2-covariant. If dim(V ) = m, a tensor t in this space can be written i as t = t e ⊗ f j ⊗ f k and each of the indices i, j, k varies between jk i 1 and m. This tensor is once contravariant and twice covariant. It is desirable to avoid writing the tensor indices directly underneath each other as in tijk because when these indices are lowered or raised (as we will see that it is necessary sometimes), then it is diﬃcult to put back these indices in the places they left. Note that i · · · ip1 t=t 1

j1 · · · jq

h1 · · · hpr

1 . . . k

ei1 ⊗ · · · ⊗ eip1

⊗f j1 ⊗ · · · ⊗ f jq1 ⊗ · · · ⊗ f 1 ⊗ · · · ⊗ f qk ⊗ eh1 ⊗ · · · ⊗ ehpr . is a tensor in the product space V ⊗p1 ⊗ V ∗⊗q1 ⊗ V ∗⊗qk ⊗ V ⊗pr . Taking into account the commutativity and associativity of tensor products of linear spaces, we can reformulate the deﬁnition of tensors. Definition 14.3. Let V be a real linear space and let p, q ∈ N. A tensor of order (p, q) on V is a multilinear mapping t, where ∗ · · × V ∗ −→ R. t:V · · × V × V × · × · p

q

Also, we refer to a tensor t deﬁned as above as a p-contravariant and q-covariant tensor. If p = q = 0, t is a member of R. If q = 0, t is a p-contravariant tensor, and if p = 0, t is a q-covariant tensor. If (p, q) = (0, 0), then t is a mixed tensor. The set of tensors of order (p, q) on V is denoted as Tpq (V ). Similarly, the set of p-contravariant tensors will denoted by Tp (V ) and the set of q-covariant tensors will be denoted by Tq (V ).

840

Linear Algebra Tools for Data Mining (Second Edition)

If dim(V ) = n, then dim(Tpq ) = np+q because the dimension of a tensor product is the product of the dimensions of the factors. If B = {e1 , . . . , ep } is a basis of V and {f 1 , . . . , f n } is its dual basis in V ∗ , then the set {ei1 ⊗ · · · ⊗ eip ⊗ f j1 ⊗ · · · ⊗ f jq | 1 i n, 1 jm n} is a basis of Tpq called the standard basis corresponding to B. If the element ei1 ⊗ · · · ⊗ eip ⊗ f j1 ⊗ · · · ⊗ f jq of the standard i · · · ip , then a tensor z ∈ Tpq can be basis is denoted by t 1 j1 · · · jq expressed uniquely as z=

j ···j q z 1

The np+q numbers z

i1 · · · ip

t

i1 · · · ip

j1 · · · jq

.

j1 · · · jq

are the components of z relai1 · · · ip tive to the standard basis. The upper indices of the components z correspond to V and the lower indices correspond to V ∗ .

Example 14.9. Let V be a real linear space and let f 1 , . . . , f n ∈ V ∗ . Deﬁne gf 1 ,...,f n : V n −→ R as the multilinear function gf 1 ,...,f n (x1 , . . . , xn ) = f 1 (x1 ) · · · f n (xn ) for xi ∈ V and 1 i n. The multilinear function gf 1 ,...,f n : V ×n −→ R is denoted by f 1 ⊗ · · · ⊗ f n , and is an n-contravariant tensor. Example 14.10. If xi ∈ Vi for 1 i n, then the multilinear function x1 ,...,xn : (V ∗ )n −→ R, deﬁned by x1 ,...,xn (f 1 , . . . , f n ) = f 1 (x1 ) · · · f n (xn ) for f i ∈ V ∗ for 1 i n belonging to M(V ∗ , . . . , V ∗ ; R), is denoted by x1 ⊗ · · · ⊗ xn , and is a n-covariant tensor. Let B = {e1 , . . . , en } be a basis in the linear space V and let ˜ = {f 1 , . . . , f n } be its dual basis in V ∗ . Deﬁne the new bases B = B

Tensors and Exterior Algebras

841

˜ = {f 1 , . . . , f n } in V and V ∗ , respectively, {e1 , . . . , en } and B using the equalities ei = aji ej and f i = aij f j as follows:

ei = aii ei , ei = aii ei

f j = ajj f j , f j = ajj f j . Of course, summation indices can be changed consistently without aﬀecting the correctness of these formulas. Recall that if V is a real linear space and V ∗ is its dual, the vectors of V are said to be contravariant, while those of V ∗ are said to be covariant for reasons that were discussed in Section 3.7 If {e1 , . . . , em } is a basis in the linear space V, a contravariant vector x ∈ V can be written (with the summation convention) as x = xi ei . If {f 1 , . . . , f n } is a basis in V ∗ , a covariant vector y ∈ V ∗ can be written as y = yj f j . Let t be the tensor t=t

i1 · · · ih

j1 · · · jk

ei1 ⊗ · · · ⊗ eih ⊗ f j1 ⊗ · · · f jk .

Applying the change of bases, we have t=t

i1 · · · ih

i

i

j1 · · · jk

ai11 · · · aihh ajj1 · · · ajjk ei1 ⊗· · ·⊗eih ⊗f j1 ⊗· · ·⊗f jk . 1

k

i · · · ih Therefore, the components t˜ 1

of t relative to the bases j1 · · · jk {e1 , . . . , en } and {f , . . . , f } are given by 1

i · · · ih t˜ 1

j1 · · · jk

=t

n

i1 · · · ih

i

j1 · · · jk

i

ai11 · · · aihh ajj1 · · · ajjk . 1

k

(14.3)

Similarly, we obtain the formula t

i1 · · · ih

j1 · · · jk

j i · · · ih j = aii1 · · · aiih aj11 · · · ajkk t˜ 1 1

h

j1 · · · jk

.

(14.4)

The operation of tensor products can be extended from vectors and covectors to tensors.

Linear Algebra Tools for Data Mining (Second Edition)

842

Definition 14.4. Let u ∈ Tpq (V ) and w ∈ Trs (V ) be two tensors. Their product is the tensor u ⊗ w ∈ Tp+r q+s (V ) deﬁned as (u ⊗ w)(v 1 , . . . , v p+r , v 1 , . . . , v q+s ) = u(v 1 , . . . , v p , v 1 , . . . , v q )w(v p+1 , . . . , v p+r , v q+1 , . . . , v q+s ) for v 1 , . . . , v p+r ∈ V ∗ and v 1 , . . . , v q+s ∈ V . Thus, the tensor product is an operation ⊗ : Tpq (V ) × Trs (V ) −→ Tp+r q+s (V ). The tensor product can be extended to any ﬁnite number of tensor spaces, +···+pk (V ), ⊗ : Tpq11 (V ) × · · · × Tpq11+···+q k

such that ⊗(t1 , . . . , tk ) = t1 ⊗ · · · ⊗ tk , where t ∈ Tpq for 1 k. The extended operation is multilinear, associative, and distributive. 14.4

Tensors on Inner Product Spaces

Let V be an n-dimensional real inner product space and let B = {e1 , . . . , en } be a basis in V. In Section 6.15, we examined the link between the contravariant components xi of a vector x ∈ V relative to a basis B and the covariant components xi of the same vector x using the fundamental matrix G of the basis B. Let t be a tensor t = x1 ⊗ · · · ⊗ xq . If xirr is a contravariant compoi nent of xr , then we have ti1 ···iq = xi11 · · · xqq . Taking into account the links between contravariant and covariant components of vectors, we can write t

i1

i2 · · · iq

i

= xi11 xi22 · · · xqq i

= gi1 j xj1 xi22 · · · xqq (because xi11 = gi1 j xj1 ) = gi1 j t

ji2 · · · iq

.

Tensors and Exterior Algebras

843

Similarly, one obtains t

i1 · · · iq

= g i1 j t

j

i2 · · · iq

.

Repeated applications of the previous transformations allow us to write the following: t t

i1 i2 · · · iq i1 i2 · · · iq

= gi1 j1 gi2 j2 · · · giq jq t = g i1 j 1 g i2 j 2 · · · g iq j q t

j1 j2 · · · jq

j1 j2 · · · jq

, .

We can formulate now a tensoriality criterion for tensors in Euclidean spaces. Theorem 14.8. A collection of numbers ti1 i2 ···ip depending on the choice of basis forms a tensor if and only if ti1 i2 ···ip = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip , where B = {e1 , . . . , en }, gij = (ei , ej ) for 1 i, j n. Proof. Suppose that ti1 i2 ···ip are the components of a tensor t. Then, the numbers ti1 i2 ···ip are the coeﬃcients of some multilinear form f , that is, ti1 i2 ···ip = f (ei1 , ei2 , . . . , eip ). The coeﬃcients ti1 i2 ···ip in the new basis e1 , . . . , em are ti1 i2 ···ip = f (ei1 , ei2 , . . . , eip ). Since ei1 = gi1 i1 ei1 , . . . , eim = gim im eim , we can write ti1 i2 ···ip = f (gi1 i1 ei1 , gi2 i2 ei2 , . . . , gip ,ip ep ) = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip , because f is a multilinear form.

Linear Algebra Tools for Data Mining (Second Edition)

844

Conversely, suppose that ti1 i2 ···ip transform as indicated when switching to a new basis. Let x1 , . . . , xp be p vectors such that xi = xij ej for 1 i p. We must show that ti1 i2 ···ip xi1 j1 · · · xip jp is a multilinear form in x1 , . . . , xp , which means that it depends only on the vectors x1 , . . . , xp and not on the choice of the basis. In a basis B = {e1 , . . . , en }, the previous expression is ti1 i2 ···ip xi1 j1 · · · xip jp . The hypothesis of the theorem allows us to write the following: ti1 i2 ···ip xi1 j1 · · · xip jp = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip xi1 j1 · · · xip jp = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip xi1 j1 · · · xip jp = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip gi1 k1 xk1 j1 · · · gip kp xkp jp = gi1 i1 gi1 k1 gi2 i2 gi2 k2 · · · gip ip gip kp ti1 i2 ···ip xk1 j1 · · · xkp jp = δi1 k1 δi2 k2 · · · δip kp ti1 i2 ···ip xk1 j1 · · · xkp jp = ti1 i2 ···ip xi1 j1 · · · xip jp

because xi1 j1 = gi1 k1 xk1 j1 , . . . , xip jp = gip kp xkp jp .

14.5

Contractions

Theorem 14.9. Let t = v 1 ⊗ · · · ⊗ v p ⊗ f 1 ⊗ · · · ⊗ f q be a simple tensor in T pq , where p, q 1. There exists a unique linear mapping c : Tpq −→ Tp−1 q−1 (referred to as contraction) such that c(t) = f j (v i )v 1 ⊗ · · · ⊗ v i−1 ⊗ v i+1 · · · ⊗ v p ⊗f 1 ⊗ · · · ⊗ f j−1 · · · f j+1 ⊗ · · · ⊗ f q .

Tensors and Exterior Algebras

Proof.

845

Deﬁne the multilinear mapping f : V p × (V ∗ )q −→ Tp−1 q−1 as f (v1 , . . . , v p , f 1 , . . . , f q ) = f j (v i )v 1 ⊗ · · · ⊗ v i−1 ⊗ v i+1 · · · ⊗ v p ⊗f 1 ⊗ · · · ⊗ f j−1 · · · f j+1 ⊗ · · · ⊗ f q .

Let c : Tpq −→ Tp−1 q−1 be a linear mapping deﬁned by its values on the basis of Tp as c(ei1 ⊗ · · · ⊗ eip ⊗ f j1 ⊗ f jq ) = f (e1 , . . . , ep , f 1 , . . . , f q ), and denote its extension to a linear mapping deﬁned on Tpq also by c. Since c⊗ and f agree on the basis of V p × (V ∗ )q , the diagram f

Tp−1 q−1

V p × (V ∗ )q

⊗ Tpq

c

is commutative. The universal property of the tensor product implies the unique ness of this extension. Example 14.11. Let V be a real linear space and let v ∈ V and f ∈ V ∗ . Deﬁne the bilinear mapping φ : V × V ∗ −→ R as φ(v, f ) = f (v). By the universal property of a tensor product, φ can be factored as φ = c⊗, where c : T11 −→ R = T00 is a linear mapping. In other words, we have φ(v, f ) = f (v)v ⊗ f for all v ∈ V and f ∈ V ∗ . 14.6

Symmetric and Skew-Symmetric Tensors

Theorem 14.10. Let φ be a permutation in PERMr , and let V be a linear space.

846

Linear Algebra Tools for Data Mining (Second Edition)

There exists a unique linear mapping P φ : V ⊗r −→ V ⊗r such that P φ (v 1 ⊗ · · · ⊗ v r ) = v φ

−1 (1)

⊗ · · · ⊗ vφ

−1 (r)

.

The mapping P φ is an isomorphism of V ⊗r and P ψ P φ = P ψφ for every ψ, φ ∈ PERMr . Proof.

The mapping f : V · · × V −→ V ⊗r deﬁned by × · r

f (v1 , . . . , v r ) = v φ

−1 (1)

⊗ · · · ⊗ vφ

−1 (r)

is an r-multilinear mapping. By the universal property of the tensor products, there exists a unique linear mapping P φ that makes the diagram V × ··· × V

f V ⊗r

r

⊗ Pφ V ⊗r

commutative, that is, P φ (v 1 ⊗ · · · ⊗ v r ) = v φ

−1 (1)

⊗ · · · vφ

−1 (r)

for v 1 , . . . , v r . Since P φ induces a permutation of the standard basis of V ⊗r , it is immediate that P φ is an isomorphism. We have P ψ P φ (v 1 ⊗ · · · ⊗ v r ) −1 (1)

⊗ · · · ⊗ vφ

−1 ψ −1 (1)

⊗ · · · ⊗ vφ

= P ψ (v φ = (v φ

= P ψφ (v 1 ⊗ · · · ⊗ v r ).

−1 (r)

)

−1 ψ −1 (r)

)

Tensors and Exterior Algebras

847

Example 14.12. Let φ, ψ ∈ PERM4 be the permutations given by 1 2 3 4 1 2 3 4 φ: and ψ : . 2 4 1 3 4 2 1 3 This implies φ−1 :

1 2 3 4 3 1 4 2

and ψ −1 :

1 2 3 4 3 2 4 1

and P φ (v 1 ⊗ v 2 ⊗ v3 ⊗ v 4 ) = v 3 ⊗ v1 ⊗ v 4 ⊗ v 2 , P ψ (v 1 ⊗ v 2 ⊗ v3 ⊗ v 4 ) = v 3 ⊗ v2 ⊗ v 4 ⊗ v 1 . Therefore, P ψ P φ (v 1 ⊗ v 2 ⊗ v 3 ⊗ v4 ) = P ψ (v 3 ⊗ v 1 ⊗ v 4 ⊗ v 2 ) = v 4 ⊗ v 1 ⊗ v2 ⊗ v 3 . Since ψφ :

1 2 3 4 1 2 3 4 , , and (ψφ)−1 : 4 1 2 3 2 3 4 1

it follows that P ψφ (v 1 ⊗ v 2 ⊗ v 3 ⊗ v4 ) = (v 4 ⊗ v1 ⊗ v 2 ⊗ v 3 ) = P ψ P φ (v 1 ⊗ v 2 ⊗ v 3 ⊗ v 4 ). Definition 14.5. A tensor t ∈ V ⊗r is symmetric if P φ (t) = t for every φ ∈ PERMr . A tensor t ∈ V ⊗r is skew-symmetric or alternating if P φ (t) = sign(φ)t for every permutation φ ∈ PERMr . The set of symmetric tensors in V ⊗r is denoted as SYMV,r . The set of skew-symmetric tensors in V ⊗r is denoted as SKSV,r . Both SYMV,r and SKSV,r are subspaces of V ⊗r . Definition 14.6. Let V, W be F-linear spaces and k be a positive integer. A multilinear map f : V k −→ W is alternating if it vanishes

848

Linear Algebra Tools for Data Mining (Second Edition)

whenever two arguments are equal, that is, f (. . . , x, . . . , x, . . .) = 0W for every x ∈ V . A multilinear mapping f : V k −→ W is skew-symmetric if the sign of f changes when two arguments are permuted, that is, f (. . . , x, . . . , y, . . .) = −f (. . . , y, . . . , x, . . .) for x, y ∈ V . The notions of alternating multilinear function and skewsymmetric function are identical, as the next statement shows. Theorem 14.11. Let V, W be F-linear spaces and k be a positive integer such that k 2. A multilinear mapping f : V k −→ W is alternating if and only if it is skew-symmetric. Proof. Let f be an alternating multilinear mapping. By the multilinearity of f , we can write f (. . . , x + y, . . . , x + y, . . .) = f (. . . , x, . . . , x, . . .) + f (. . . , x, . . . , y, . . .) +f (. . . , y, . . . , x, . . .) + f (. . . , y, . . . , y, . . .) = f (. . . , x, . . . , y, . . .) + f (. . . , y, . . . , x, . . .) = 0W , hence f is skew-symmetric. Conversely, if f is skew-symmetric, we have f (. . . , x, . . . , x, . . .) = −f (. . . , x, . . . , x, . . .), hence f (. . . , x, . . . , x, . . .) = 0W , which shows that f is alternating.

The set of alternating multilinear mappings from V k to W is a subspace of M(V, . . . , V ; W ). Theorem 14.12. Let V, W be F-linear spaces. A multilinear mapping f : V k −→ W is skew-symmetric if and only if f (xφ(1) , . . . , xφ(k) ) = sign(φ)f (x1 , . . . , xk ) for every φ ∈ PERMk and x1 , . . . , xk ∈ V .

Tensors and Exterior Algebras

849

Proof. Suppose that the condition of the theorem is satisﬁed. If φ is a transposition, 1 ··· i ··· j ··· k φ: , 1 ··· j ··· i··· k the equality of the theorem amounts to f (x1 , . . . , xi , . . . , xj , . . . , xk ) = −f (x1 , . . . , xj , . . . , xi , . . . , xk ), which shows that f is skew-symmetric. Conversely, if f is skew-symmetric, for each transposition ψ we have f (xψ(1) , . . . , xψ(k) ) = −f (x1 , . . . , xk ). If φ is a product of transpositions φ = ψ1 · · · ψr , then the k-tuple (xφ(1) , . . . , xφ(k) ) can be obtained from the k-tuple (x1 , . . . , xk ) by applying successively the transpositions ψ1 , . . . ψr . This implies f (xφ(1) , . . . , xφ(k) ) = sign(ψ1 ) · · · sign(ψr )f (x1 , . . . , xk ) = sign(φ)f (x1 , . . . , xk ).

Theorem 14.13. If v 1 ⊗ · · · ⊗ v r ∈ SKSV,r , then v 1 ⊗ · · · ⊗ v r = 0V,r if any two of the vectors v 1 , . . . , v r are equal. Proof.

For the transposition 1 ··· i ··· j ··· r φ: , 1 · · · j · · · i · · · r.

we have inv(φ) = 1. Suppose that the ith argument and the jth argument of t are both equal to v. Then, by the skew-symmetry of t, we have v 1 ⊗ · · · ⊗ v ⊗ · · · ⊗ v, · · · , v r ) = −v1 ⊗ · · · ⊗ v ⊗ · · · ⊗ v ⊗ · · · ⊗ v r , hence v 1 ⊗ · · · ⊗ v ⊗ · · · ⊗ v ⊗ · · · ⊗ v r = 0V,r .

Corollary 14.2. If {v 1 , . . . , v r } is a linearly dependent set of vectors in SKSV,r , then v 1 ⊗ · · · ⊗ v r = 0V,r .

850

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Since {v 1 , . . . , v r } is a linearly dependent set, there exists a vector v i that can be expressed as a linear combination of the other vectors. Without loss of generality, we may assume that v 1 = a2 v 2 + · · · + ar v r . This implies v 1 ⊗ · · · ⊗ v r = (a2 v 2 + · · · + ar v r ) ⊗ v 2 ⊗ · · · ⊗ vr =

r

ai (v i ⊗ v 2 ⊗ · · · ⊗ v r ) = 0V,r .

i=2

Corollary 14.3. Let V be a real linear space with dim(V ) = n. If r > n, then V ⊗r = 0V,r . Proof. Since every subset of V that contains more than n = dim(V ) vectors is linearly dependent, the statement follows from Corollary 14.2. Deﬁne the linear mappings S V,r : V ⊗r −→ V ⊗r and AV,r : −→ V ⊗r as V 1 S V,r (t) = P φ (t), (14.5) r! ⊗r

φ∈PERMr

AV,r (t) =

1 r!

sign(φ)P φ (t).

(14.6)

φ∈PERMr

These mappings are referred to as the symmetrizer and the alternator on V ⊗r for reasons that will become apparent immediately. If V is clear from the context, these mappings will be denoted as S r and Ar , respectively. Theorem 14.14. For every permutation ψ ∈ PERMr we have P ψ S r = S r P ψ = S r and P ψ Ar = Ar P ψ = sign(ψ)Ar . Proof.

We have

⎛

P ψ S r (t) =

1 Pψ ⎝ r! ⎛

=

1 ⎝ r!

φ∈PERMr

φ∈PERMr

⎞ P φ (t)⎠ ⎞

P ψ P φ (t)⎠

Tensors and Exterior Algebras

⎛ =

⎞

1 ⎝ r!

851

P ψφ (t)⎠ .

φ∈PERMr

As φ runs through PERMr , the same holds for τ = ψφ, hence, P ψ S r (t) =

1 r!

P τ (t) = S r (t).

τ ∈PERMr

For the second equality of the theorem, we have ⎛ P ψ Ar (t) =

1 Pψ ⎝ r!

⎞ sign(φ)P φ (t)⎠

φ∈PERMr

=

=

=

1 r! 1 r! 1 r!

sign(φ)P ψφ (t)

φ∈PERMr

sign(φ)P ψφ (t)

φ∈PERMr

sign(ψ)sign(ψφ)P ψφ (t)

φ∈PERMr

= sign(ψ)

1 r!

sign(τ )P τ (t)

τ ∈PERMr

because sign(φ) = sign(ψ)sign(ψφ) and τ = ψφ runs through PERMr . The remaining equalities, S r P ψ = S r and Ar P ψ = sign(ψ)Ar , have similar arguments. Theorem 14.15. If t ∈ SKSV,r , then Ar (t) = t.

852

Linear Algebra Tools for Data Mining (Second Edition)

Proof. Since t ∈ SKSV,r , we have P φ (t) = sign(φ)t for every φ ∈ PERMr . This allows us to write 1 sign(φ)P φ (t) Ar (t) = r! φ∈PERMr

1 r!

=

sign(φ)2 t = t

φ∈PERMr

because the last sum contains r! terms and sign(φ)2 = 1.

Theorem 14.16. Ar (t) is a skew-symmetric tensor for every tensor t ∈ V ⊗r . Proof. We need to show that P ψ (Ar (t)) = sign(ψ)Ar (t) for every permutation ψ ∈ PERMr . By the deﬁnition of Ar , we have P ψ (Ar (t)) = =

=

1 r! 1 r! 1 r!

sign(φ)Pψ Pφ (t)

φ∈PERMr

sign(φ)Pψφ (t)

φ∈PERMr

sign(φ)Pψφ (t).

φ∈PERMr

Since sign(ψ)2 = 1, the last equality can be written as P ψ (Ar (t)) =

1 r!

sign(ψ)sign(ψφ)P ψφ (t)

φ∈PERMr

= sign(ψ)

1 r!

sign(ψφ)P ψφ (t).

φ∈PERMr

By Theorem 1.4, for a ﬁxed ψ we have the equality {ψφ | φ ∈ PERMr } = PERMr , which allows us to write P ψ (Ar (t)) = sign(ψ)

1 r!

φ∈PERMr

sign(φ)P φ (t) = sign(ψ)Ar (t).

Tensors and Exterior Algebras

853

Theorem 14.17. The mappings S V,r and AV,r defined on V ⊗r are projections of V ⊗r onto the subspaces SYM(V, r) and SKSV,r of V ⊗r , respectively. Proof. We begin by proving that both S V,r and AV,r are idempotent. We have S V,r (S V,r (t)) = =

1 r! 1 r!

P φ S V,r (t)

φ∈PERMr

S V,r (t)

φ∈PERMr

(by Theorem 14.14) = S V,r (t) (because the previous sum contains r! terms), hence S V,r S V,r = S V,r . A similar computation leads to the same conclusion for AV,r : AV,r (AV,r (t)) = =

1 r! 1 r!

sign(φ)P φ AV,r (t)

φ∈PERMr

sign(φ)2 AV,r (t)

φ∈PERMr

(by Theorem 14.14) = AV,r (t) (because the previous sum contains r! terms and sign(φ)2 = 1). for all φ ∈ PERMr . Note that t ∈ SYMV,r implies P φ (t) = t Therefore, if t ∈ SYMV,r , we have S V,r (t) = r!1 φ∈PERMr P φ (t) = t. Conversely, if S V,r (t) = t, we have P φ (t) = P φ (S V,r (t)) = S V,r (t) = t for all σ ∈ PERMr and t ∈ SYMV,r .

Linear Algebra Tools for Data Mining (Second Edition)

854

The membership t ∈ SKSV,r is equivalent to P φ (t) = sign(φ)t for every φ ∈ PERMr . Thus, if t ∈ SKSV,r , we have AV,r (t) =

1 r!

sign(φ)P φ (t) =

φ∈PERMr

1 r!

sign(φ)2 t = t.

φ∈PERMr

Conversely, if AV,r (t) = t, we have P φ (t) = P φ AV,r (t) = sign(φ)AV,r (t) = sign(φ)t for all φ ∈ PERMr and t ∈ SKSV,r . Furthermore, we have S V,r (AV,r (t)) =

1 r!

1 = r!

P σ (AV,r (t))

σ∈PERMr

sign(φ) AV,r (t) = 0V ⊗r

σ∈PERMr

because σ∈PERMr sign(φ) = 0. A similar computation yields AV,r (S V,r (t)) = 0V ⊗r . Let t ∈ SYMV,r ∩ SKSV,r , we have t = S V,r (t), hence AV,r (k) = AV,r (S V,r (t)) = 0V ⊗r , hence AV,r S V,r = 0. Similarly, S V,r AV,r = 0V ⊗r . Example 14.13. Let V be a linear space and V 2 . We have 1 2 1 , φ1 : PERM2 = φ0 : 1 2 2

let t be a tensor in 2 1

and, therefore, we can write AV,2 (t) =

1 (P φ0 (t) − P φ1 (t)) . 2

Thus, if t is a simple tensor t = u ⊗ w ∈ V ∗2 , we have 1 AV,2 = (u ⊗ w − w ⊗ u). 2 Let V be a linear space with dim(V ) = n. For ei1 , . . . , eik ∈ V with i1 , . . . , ik ∈ {1, . . . , n}, following [30], we denote by ei1 · · · eik

Tensors and Exterior Algebras

the tensor S V,k (ei1 ⊗ · · · ⊗ eik ) =

1 k!

855

eiφ(1) ⊗ · · · ⊗ eiφ(k) .

φ∈PERMk

The factor ei1 · · · eik depends only on the number of times each ei enters this product, and we may write ei1 · · · eik = ep11 · · · epnn , where pi is the multiplicity of occurrence of ei in ei1 · · · eik (which may also be 0). Thus, the numbers p1 , . . . , pn are non-negative integers and p1 + · · · + pn = k. Theorem 14.18. Let {e1 , . . . , en } be a basis of a linear space V. Then {S V,k (ei1 ⊗ · · · ⊗ eik ) | 1 i1 · · · ik n}

. is a basis of SV,k . Furthermore, dim(SV,k ) = n+k−1 k Proof. Since B = {ei1 ⊗ · · · ⊗ eik | 1 i1 n, . . . , 1 ik n} is a basis for Tk (V ) and S V n ,k maps Tk (V n ) into SV,k , the set S(B) = {ei1 · · · eik | 1 i1 · · · ik n} = {ep11 · · · epnn | p1 + · · · + pn = k} spans SV,k . Vectors in SV,k are linearly independent because, if (p1 , . . . , pn ) = (q1 , . . . , qn ), then the tensors ep11 · · · epnn and eq11 · · · eqnn are linear combinations of two non-intersecting subsets of the basic elements of Tk (V ). By Supplement 12 of Chapter 1, the cardinality of S(B) is the number of combinations with repetition of n objects taken k at a time. Theorem 14.19. Let ti1 ,...,ip be a standard basis element of a tensor space Tp = V ⊗p . If ik = i for some 1 k < p, then AV,p (ti1 ,...,ip ) = 0V ⊗p . Proof. Let φ ∈ PERMp be the transposition that inverts the places of k and . Since ik = i , we have P φ (ti1 ,...,ip ) = ti1 ,...,ip . This implies AV,p (ti1 ,...,ip ) = AV,p (Pφ (ti1 ,...,ip )) = sign(φ)AV,p (ti1 ,...,ip ) = −AV,p (ti1 ,...,ip ), hence AV,p (ti1 ,...,ip ) = 0V ⊗p .

856

Linear Algebra Tools for Data Mining (Second Edition)

Corollary 14.4. If p > n, where n = dim(V ), then dim(V ⊗p ) = 0. Proof.

This follows from Theorem 14.19.

Theorem 14.20. Let V be an n-dimensional linear space. If p n, then {AV,p (ti1 ,...,ip ) | i1 < i2 < · · · < ip } is a basis of the linear space SKSV,p and dim(SKSV,p ) = np . Proof. By Theorem 14.19, we need to consider only those ti1 ,...,ip having all indices distinct. For φ, ψPERMp , we have tiφ(1) ,...,iφ(p) = tiψ(1) ,...,iψ(p) if and only if φ = ψ. Thus, we have AV,p (ti1 ,...,ip ) = 0V ⊗p . If the sets {i1 , . . . , ip } and {k1 , . . . , kp } contain the same elements, then AV,p (ti1 ,...,ip ) = ±AV,p (tk1 ,...,kp ). If the sets {i1 , . . . , ip } and {k1 , . . . , kp } are distinct, there are no common elements of the basis when we expend AV,p (ti1 ,...,ip ) and AV,p (tk1 ,...,kp ) as linear combinations of the elements of the basis. Therefore, the set of elements of the form AV,p (ti1 ,...,ip ) is linearly independent. Corollary 14.5. We have dim(SKSV,n ) = 1 and dim(SKSV,p ) = dim(SKSV,n−p ). Proof.

These equalities follow from Theorem 14.20.

Let t ∈ V ⊗p . We have ⎛ 1 t − S V,p (t) = ⎝p!t − p!

φ∈PERMp

⎞ 1 P φ (t)⎠ = p!

(t − P φ (t)) .

φ∈PERMp

If t ∈ ker S V,p , this implies t=

1 p!

(t − P φ (t)) .

φ∈PERMp

Let z = t − P φ (t). Then S V,p (z) = S V,p (t) − S V,p (t) = 0V ⊗p , so z ∈ ker(S V,p ). For φ ∈ PERMp and t ∈ V ⊗p , let w(t, φ) = t − P φ (t). We have S V,p (w(t, φ)) = S V,p (t) − S V,p (P φ (t)) = 0V ⊗p , hence w(t, φ) ∈ ker(S V,p ). Thus, if W is the subspace generated by {t − P φ | t ∈ V ⊗p and φ ∈ PERMp }, we have ker(S V,p ) = W .

Tensors and Exterior Algebras

14.7

857

Exterior Algebras

In this section, we introduce exterior algebra also known as Grassmann1 algebra, a construct that makes use of a new concept known as wedge product. Definition 14.7. Let V be a linear space and let t = v 1 ⊗ · · · ⊗ v r be a simple tensor in Tr (V ). The wedge product of the vectors v 1 , . . . , v r is the tensor r!AV,r (t) denoted as v1 ∧ · · · ∧ v r . Example 14.14. The wedge product of the vectors v 1 , v 2 ∈ V is the tensor v 1 ∧ v 2 = 2!AV,r (v 1 ⊗ v2 ) = v1 ⊗ v2 − v2 ⊗ v1. The following properties are immediate: (i) v 1 ∧ v2 = −v 2 ∧ v 1 , (ii) (av 1 ) ∧ v 2 = a(v 1 ∧ v2 ), (iii) (u1 + u2 ) ∧ v = (u1 ∧ v) + (u2 ∧ v), for every u1 , u2 , v 1 ,v 2 , and v ∈ V . By Property (i), we have u ∧ u = 0V ⊗2 , for every u ∈ L. A tensor of the form u ∧ v is also referred to as a bivector. In general, we have v 1 ∧ · · · ∧ vr =

sign(φ)P φ (v φ

−1 (1)

⊗ · · · ⊗ vφ

−1 (r)

)

φ∈PERMr

=

sign(φ)P φ (v φ(1) ⊗ · · · ⊗ v φ(r) ),

φ∈PERMr

because sign(φ) = sign(φ−1 ). The next statement presents a universal property of the wedge product. 1 Hermann G¨ unter Grassmann was born in Stettin in 1809 and died in the same place in 1877. He was a mathematician and linguist and he made important contributions to an algebraic approach to geometry.

Linear Algebra Tools for Data Mining (Second Edition)

858

Theorem 14.21. Let V and W be two linear spaces and let f : V · · × V −→ W be a skew-symmetric multilinear function. There × · r

exists a unique linear transformation d : SKSV,r −→ W such that the diagram V × ··· × V

f W

r

A V,r d

SKSV,r

is commutative. Proof. By the universal property of tensor products, there exists a linear mapping c : V ⊗r −→ W such that f = c⊗. Since f is skew-symmetric, it follows that c(v φ(1) ⊗ · · · ⊗ v φ(r) ) = f (v φ(1) , . . . , v φ(r) ) = sign(φ)f (v 1 , . . . , v k ) (by Theorem 14.12) = sign(φ)c(v 1 ⊗ · · · ⊗ v r ) for all simple tensors v 1 ⊗ · · · ⊗ v r ∈ V ⊗r and all φ ∈ PERMr . Therefore, c(v 1 ⊗ · · · ⊗ v r ) = sign(φ)c(v φ(1) ⊗ · · · ⊗ v φ(r) ). Summing up over all permutations in PERMr , we obtain r!c(v 1 ⊗ · · · ⊗ v r ) =

sign(φ)c(v φ(1) ⊗ · · · ⊗ v φ(r) )

φ∈PERMr

= r!f (v1 , . . . , v r ).

Tensors and Exterior Algebras

V × ··· × V

f

W

r

⊗

859

c

V ⊗r d

A V,r SKSV,r

Deﬁne the mapping d : SKSV,r −→ W as d(t) = c(t) for t ∈ SKSV,r , that is, the restriction of c to the subspace SKSV,r , that is, d = r!1 c SKSV,r . It is clear that d makes commutative the above diagram. Simple tensors of the form v 1 ∧ · · · ∧ v r ∈ SKSV,r generate the subspace SKSV,r due to the fact that Tr (V ) is generated by tensors of the form v1 ⊗ · · · ⊗ v r and AV,r is the projection of Tr (V ) on SKSV,r . Thus, d is unique because its values on SKSV,r are uniquely determined by the condition d(v 1 ∧ · · · ∧ v r ) = f (v1 , . . . , v r ). Let t1 ∈ SKSV,p1 and t2 ∈ SKSV,p2 , the tensor t1 ∧ t2 is deﬁned as t1 ∧ t2 =

p1 + p2 AV,p1 +p2 (t1 ⊗ t2 ). p1

(14.7)

Equivalently, we have t1 ∧ t2 = =

p1 + p2 1 p1 (p1 + p2 )!

1 p1 !p2 !

sign(φ)P φ (t1 ⊗ t2 )

φ∈PERMp1 +p2

sign(φ)P φ (t1 ⊗ t2 ).

φ∈PERMp1 +p2

Definition 14.8. The Grassmann algebra or the exterior algebra of order n of a linear space V is the algebra (V ) deﬁned as a direct sum of the subspaces SKSV,p for 0 p n, where the product t1 ∧ t2 of t1 ∈ SKSV,p1 and t2 ∈ SKSV,p2 is deﬁned as AV,p1+p2 and this operation is extended to (V ) by linearity.

Linear Algebra Tools for Data Mining (Second Edition)

860

If t1 = u1 ⊗ · · · ⊗ up1 and t2 = v 1 ⊗ · · · ⊗ v p2 , we have t1 ⊗ t2 = u1 ⊗ · · · ⊗ up1 ⊗ v 1 ⊗ · · · ⊗ v p2 , t2 ⊗ t1 = v 1 ⊗ · · · ⊗ v p2 ⊗ u1 ⊗ · · · ⊗ up1 . Theorem 14.22. Let t ∈ SKSV,p and s ∈ SKSV,q . We have: s ∧ t = (−1)pq t ∧ s. Proof. Let θ ∈ PERMp+q be the permutation introduced in Exercise 7 of Chapter 1, 1 2 ··· p p + 1 ··· p + q θ: . p + 1 p + 2 ··· p + q 1 ··· q Recall that inv(θ) = (−1)pq . Let t = ti1 ···ip ei1 ⊗ · · · ⊗ eip and s = sj1···jq ej1 ⊗ · · · ⊗ ejq . We have P θ (t ⊗ s) = s ⊗ t. Therefore, t∧s = s∧t = =

=

1 p1 !p2 ! 1 p1 !p2 ! 1 p1 !p2 ! 1 p1 !p2 !

= signθ

= signθ

sign(φ)P φ (t ⊗ s)

φ∈PERMp+q

sign(φ)P φ (s ⊗ t)

φ∈PERMp+q

sign(φ)P φ P θ (t ⊗ s)

φ∈PERMp+q

sign(φ)signθ 2 P φθ (t ⊗ s)

φ∈PERMp+q

1 p1 !p2 ! 1 p1 !p2 !

= (−1)pq t ∧ s.

sign(φθ)P φθ (t ⊗ s)

φ∈PERMp+q

sign(ψ)P ψ (t ⊗ s)

ψ∈PERMp+q

Tensors and Exterior Algebras

861

Corollary 14.6. If t ∈ SKSV,p and p is an odd number, then t ∧ t = 0V,p . 2

Proof. By Theorem 14.22, we have t ∧ t = (−1)p t ∧ t. Since p is an odd number, so is p2 , hence t ∧ t = 0V,p . Let ti ∈ SKSV,pi for 1 i n be n tensors. Deﬁne the tensor t1 ∧ t2 ∧ · · · ∧ tn as t1 ∧ t2 ∧ · · · ∧ tn =

(p1 + p2 + · · · + pn )! AV,p1 +p2 +···pn (t1 ⊗ t2 ⊗ · · · ⊗ tn ). p1 !p2 ! · · · pn !

The wedge product of tensors is multilinear. For simple tensors, we have the associativity property: (v 1 ∧ · · · ∧ v p ) ∧ ((u1 ∧ · · · ∧ uq ) ∧ (w 1 ∧ · · · wr )) = ((v 1 ∧ · · · ∧ vp ) ∧ (u1 ∧ · · · ∧ uq )) ∧ (w1 ∧ · · · wr ), Since SKSV,r is a subspace of Tr and the simple tensors ei1 · · · ⊗ · · · eir form a base of Tr , it follows that every t ∈ SKSV,r can be written as t = ti1 ···ir ei1 · · · ⊗ · · · eir . This implies t = AV,r (t) = ti1 ···ir AV,r (ei1 ⊗ · · · eir ) =

1 ti ···i ei1 ∧ · · · ∧ eir . r! 1 r

These equalities show that the set of simple skew-symmetric tensors of the form ei1 ∧· · ·∧eir span the subspace SKSV,r . However, the set of simple skew-symmetric tensors of SKSV,r does not form a basis of this space because it is not linearly independent. Indeed, if φ ∈ PERMr , we have eiφ(1) ∧ · · · ∧ eiφ(1) = sign(φ)ei1 ∧ · · · ∧ eir . To identify a basis of SKSV,r , we introduce the notation t[i1 ···ir ] for the component ti1 ···ir that satisﬁes the restriction i1 < · · · < ir ; similarly, we denote by c[i1 ···ir ] the fact that i1 < · · · < ir . Theorem 14.23. If V is a linear space that has a d-element spanning subset, then SKSV,k = {0V,k } for k > d. Proof. Indeed, suppose that {x1 , . . . , xd } spans V. When k > d and xi1 ∧ · · · ∧ xik contains two equal factors, so it is zero. Thus, SKSV,k is spanned by zero, so SKSV,k = {0SKSV,k }.

862

Linear Algebra Tools for Data Mining (Second Edition)

Next, we present a universal property for SKSV,2 , where V is a linear space. Theorem 14.24. Let V be a real linear space. There exists a linear space W and an alternating multilinear mapping f : V × V −→ W such that if ψ is the exterior multiplication ψ : V × V −→ SKSV,2 , then f factors uniquely as a composition f = gψ, where g : SKSV,2 −→ W is a linear mapping. In other words, there exists a unique linear mapping g that makes the diagram f V ×V

W

ψ g SKSV,2

commutative. Proof. Recall that FREE(V × V ) is the set of all maps over V × V with values in R that are 0 everywhere with the exception of a ﬁnite number of pairs in V × V . For x, y ∈ V , deﬁne λx,y : V × V −→ R as 1 if (x, y) = (r, s) λx,y (r, s) = 0 if (x, y) = (r, s). The set {λx,y | x, y ∈ V } is a basis in the linear space F(V × V −→ R) of real-valued functions deﬁned on FREE(V × V ) which have ﬁnite supports. Deﬁne the set of functions G in F(V × V −→ R) as consisting of the functions of the form λax+by,cz+dt + 12 (acλz,x + adλt,x + bcλz,y + bdλt,y ) − 12 (acλx,z + adλx,t + bcλy,z + bdλy,t ) , for x, y, z, t ∈ V and a, b, c, d ∈ R. The subspace of F(V × V −→ R) generated by G is denoted by F. The relation ρ in FREE(V × V ) consists of those pairs f, g such that f ρg if f − g ∈ G. The linear space W is deﬁned as W = FREE(V × V )/ρ.

Tensors and Exterior Algebras

863

Let J be the canonical surjection of FREE(V × V ) in FREE(V × V )/ρ, which has G as a kernel. Then the map f = Jλ deﬁned on FREE(V × V ) is an alternate bilinear map. Indeed, we have 1 1 λy,x − J λx,y J(λx,y ) + J 2 2 1 1 = J λx,y + λy,x − λx,y 2 2

(due to the linearity of J) = 0, because J is linear with kernel G. Thus, the function 1 1 λx,y + λy,x − λx,y 2 2 belongs to G (with (x, y, z, t) = (x, 0, y, 0) and (a, b, c, d) = (1, 0, 1, 0)). Consequently, we have J(λx,y + λy,x ) = 0, hence f (x, y) = −f (y, x), which shows that the function f , is skewsymmetric. To prove the bilinearity of f observe that 1 1 J λx+ay,z + (λz,x + aλz,y ) − (λx,z + aλx,z ) = 0 2 2

because the argument of J has the form prescribed for the subspace FREE(V ×V ) with (1, a, 1, 0) taken for (a, b, c, d). Thus, f (x+ay, z)+ f (z, x)+af (z, y) = 0 amounts to f (x+ay, z) = −f (z, x)−af (z, y), which implies the bilinearity of f by applying the skew-symmetry previously shown. We show now that the space W = FREE(V × V )/ρ and the mapping f satisfy the conditions of the theorem. Let ψ be a skewsymmetric map of V × V into SKSV,2 . Consider a basis that consists of elements λx,y , where x, y ∈ V , and consider in SKSV,2 the family {ψ(x, y) | x, y ∈ V }. There exists a unique linear map Λ : FREE(V × V ) −→ W such that Λ(λx,y ) = ψ(x, y).

Linear Algebra Tools for Data Mining (Second Edition)

864

If f, g ∈ FREE(V × V ) are such that f ρg, then Λ(f ) = Λ(g). Indeed, note that any function in G is null for Λ. Now, we have 1 Λ λax+by,cz+dt + (acλz,x + adλt,x + bcλz,y + bdλt,y ) 2 1 − (acλx,z + adλx,t + bcλx,z + bdλy,t ) 2 = ψ(ax + by, cz, dt) 1 + (acψ(z, x) + adψ(t, x(+bcψ(z, y) + bdψ(t, y)) 2 1 − (acψ(x, z) + adψ(x, t) + bcψ(y, z) + bdψ(y, t)) , 2 which is 0 due to the bilinearity and skew symmetry of ψ. Starting from Λ, deﬁne a unique mapping g : SKSV,2 −→ W such that for every v ∈ SKSV,2 we have g(v) = Λ(λx,y ), where λx,y is a representative of v in FREE(V × V ). Then, we have g(J(λx,y )) = ψ(x, y), which shows that gJ = ψ. The map g is unique. Indeed, the family {J(λx,y ) | x, y ∈ L} is a generating set of W since J is a surjective mapping from FREE(V ×V ) into W and {λx,y | x, y ∈ L} is a basis of FREE(V × V ). This family contains a basis {J(λxi ,yi ) | i ∈ I} of W . Then, we have g(f (x, y)) = ψ(x, y) for x, y ∈ V , which implies g(f (xi , y i )) = ψ(xi , y i ). Thus, there exists a unique mapping g : SKSV,2 −→ W such that f = gψ, where W = FREE(V × V )/ρ. A more general result having a similar argument is Theorem 14.25. Let V be a real linear space. There exists a linear space W and an alternating multilinear mapping f : V · · × V −→ × · k

W such that if ψ is the exterior multiplication ψ : SKSV,k −→ W , then f factors uniquely as a composition f = gψ, where g : SKSV,k −→ M is a linear map. Corollary 14.7. Let V be a linear space with dim(V ) = n and let ψ : V × · · · × V −→ SKSV,k . If k > n, we have SKSV,k = {0V }. k

Tensors and Exterior Algebras

865

Proof. This follows from the fact that if k > dim(V ), every k-blade must contain two identical factors and, therefore, must equal 0V . Also, observe that if dim(V ) = n, we have dim(SKSV,r ) = 1. Theorem 14.26. Let V be a linear space. We have v 1 ∧v2 ∧· · ·∧v k = 0V,k if and only if {v 1 , v 2 , . . . , v k } is a linearly dependent set in V. Proof. Suppose that {v 1 , v 2 , . . . , v k } is a linearly dependent set in V. Without loss of generality, assume that v 1 = a2 v2 + · · · + ak vk . Then, v1 ∧ v2 ∧ · · · ∧ vk = (a2 v 2 + · · · + ak vk ) ∧ v 2 ∧ · · · ∧ v k =

k

(ai v i ∧ v 2 · · · ∧ v i ∧ · · · ∧ v k ) = 0V,k

i=2

because each wedge in the last sum has a repeated vector. Conversely, suppose that {v 1 , v 2 , . . . , v k } is linearly independent and let {v 1 , v 2 , . . . , v k , . . . , v n } be its extension to a basis of V. As we saw above, the collection {v i1 ∧ · · · ∧ v ik | 1 i1 < · · · < ik n} is a basis of SKSV,k . Since v1 ∧ · · · ∧ v k belongs to this basis, it cannot be 0V,k . Theorem 14.27. Let V be a linear space and let v 1 ∧ · · · ∧ v k and w1 ∧ · · · ∧ wk be two wedges in SKSV,k . There exists c ∈ R − {0} such that v 1 ∧ · · · ∧ v k = c w1 ∧ · · · ∧ wk if and only if v1 , . . . , v k = w1 , . . . , w k . Proof. Let R = v 1 , . . . , v k and S = w1 , . . . , w k . If R = S, every v i is a linear combination of {w1 , . . . , w k }, v i = ci1 w1 + . . . + cik wk . After substituting the expressions of vi in v1 ∧ · · · ∧ v k , applying multilinearity and the alternating property, we are left with an expression of the form cw 1 ∧ · · · ∧ wk for some non-zero c. Note that c = 0 because a wedge cannot be 0V . If R = S, then let = dim(R ∩ S) < k. Without loss of generality, we may assume that the ﬁrst elements (u1 , . . . , u ) of both lists (v 1 , . . . , v k ) and (w1 , . . . , wk ) form a basis for R ∩ S. Thus, the previous lists can be written as (u1 , . . . , u , v 1 , . . . , v k− )

866

Linear Algebra Tools for Data Mining (Second Edition)

and (u1 , . . . , u , w 1 , . . . , wk− ). Then, the members of the list (u1 , . . . , u , v 1 , . . . , v k− , w1 , . . . , w k− ) form a linearly independent set, which can be extended to a basis of V. This implies that the vectors u1 ∧ · · · ∧ u ∧ v 1 ∧ · · · ∧ v k− and u1 ∧ · · · ∧ u ∧ w1 ∧ · · · ∧ wk− belong to the same basis, which means that they cannot diﬀer by a constant multiple. Lemma 14.1. Let V be an n-dimensional linear space and let B = {e1 , . . . , en } be a basis for V. The set of wedge products ei1 ∧ · · · ∧ eik with 1 i1 < · · · < ik n spans SKSV,k . Proof. We start from the fact that SKSV,k is spanned by k-wedges w1 ∧ · · · ∧ wk . Therefore, it suﬃces to prove that any such wedge is in the span of Bk . Since B is a basis for V, each wi is a linear combination of the basis vectors e1 , . . . , ek . If w1 ∧ · · · ∧ wk is expanded using distributivity, each term in the resulting sum is a multiple of a k-wedge of the form ei1 ∧ · · · ∧ eik . This suﬃces to conclude that SKSV,k is generated by the nk blades of the form ei1 ∧ ei2 ∧ · · · ∧ eik . Among the blades of the form ei1 ∧ ei2 ∧ · · · ∧ eik those which contain two equal indices are 0V,k , and the blades can always be rearranged in the increasing order of the indices. Therefore, Bk spans SKSV,k . Theorem 14.28. Let B = {e1 , . . . , en } be a basis in the linear space V. Then the set Bk = {ei1 ∧ ei2 ∧ · · · ∧ eik | 1 i1 < i3 < · · · < ik n}.} is a basis of SKSV,k . Proof. For k = 0, there is nothing to prove, so we may assume that k 1. The idea is to embed SKSV,k as a subspace of the tensor power ⊗k V . Let f : V k −→ V ⊗k be the multilinear function deﬁned as (−1)inv(φ) eφ(1) ⊗ · · · ⊗ eφ(k) . f (e1 , . . . , ek ) = φ∈PERMk

The multilinearity of f follows from the fact that each term of the sum contains each ei only once. It is easy to see that f is also alternating. By the universal property of the exterior power, there is a

Tensors and Exterior Algebras

867

linear map gk,V : SKSV,k −→ V ⊗k such that (−1)inv(φ) eφ(1) ⊗ · · · ⊗ eφ(k) . gk,V (e1 ∧ · · · ek ) = φ∈PERMk

Injectivity of gk,V is clear for k > n. Therefore, suppose that 1 k n. We may assume k 2 because the statement is immediate for k = 1. Since B = {e1 , . . . , en } is a basis for V, the wedge products ei1 ∧ · · · ∧ eik with 1 i1 < · · · < ik n spans SKSV,k by Lemma 14.1. Since V ⊗k has a basis, ei1 ⊗ · · · ⊗ eik , where 1 i1 , . . . , ik n. Suppose t ∈ SKSV,k satisﬁes gk,V (t) = 0, where t = {ci1 ···ik ei1 ∧ · · · ∧ eik |1 i1 < · · · < ik n}, where ci1 ···ik ∈ R. Then gk,V = 0 implies ci1 ···ik (−1)inv(φ) eφ(1) ⊗ · · · ⊗ eφ(k) = 0, 1i1 > A(:,:,2)=[1 0;0 0] A(:,:,1) = 1 0

0 1 t112 = 1

t212 = 0

t122 = 0

t222 = 1

t111 = 1

t211 = 0

t121 = 0

t221 = 1

Fig. 15.4

A 2 × 2 × 2-mda.

Linear Algebra Tools for Data Mining (Second Edition)

898

A(:,:,2) = 1 0

0 1

>> T=tensor(A) T is a tensor of size 2 x 2 x 2 T(:,:,1) = 1 0 0 1 T(:,:,2) = 1 0 0 1

The three unfoldings of the mda T are obtained using the functions tensor_as_matrix of the package 862 described in [4]. For example, to obtain the unfolding after the ﬁrst dimension, we write tensor_as_matrix(T,1,’fc’). The unfoldings produced on the ﬁrst dimension in this manner are >> tensor_as_matrix(T,1,’fc’) ans is a matrix corresponding to a tensor of size 2 x 2 x 2 ans.rindices = [1] (modes of tensor corresponding to rows) ans.cindices = [2, 3] (modes of tensor corresponding to columns) ans.data = 1 0 0 1

1 0

0 1

Similar unfoldings yield >> tensor_as_matrix(T,2,’fc’) ans.data = 1 1 0 0

0 1

0 1

>> tensor_as_matrix(T,3,’fc’) ans.data = 1 0 1 0

0 0

1 1

Multidimensional Array and Tensors

899

Clearly, we have the ranks R1 = R2 = 2 and R3 = 1, so the ranks can be distinct, and, in turn, can be distinct from the rank of the mda. Since T = e1 ⊗ e2 ⊗ (e1 + e2 ) + e2 ⊗ e2 ⊗ (e1 + e2 ), the rank of T cannot be larger than 2, which shows that T is indeed of rank 2. Example 15.9. Let T be the 2 × 2 × 2 mda deﬁned in Figure 15.5. The n-rank is 2 for 1 n 3. Since T = e2 ⊗ e1 ⊗ e1 + e1 ⊗ e2 ⊗ e1 + e1 ⊗ e1 ⊗ e2 ,

(15.5)

it follows that the rank of T is not larger than 3. To prove that the rank of T equals 3, we need to show that the decomposition 15.5 is minimal. Suppose that T would have rank 2, that is, T = x1 ⊗ y 1 ⊗ z 1 + x2 ⊗ y 2 ⊗ z 2 , and let X, Y , Z, D1 , and D2 be deﬁned as X = (x1 x2 ), Y = (y 1 y 2 ), Z = (z 1 z 2 ), and D1 = diag(z11 , z12 ) and D2 = diag(z21 , z22 ). Note that

z11 0 y1 XD1 Y = (x1 x2 ) 0 z12 y 2 0 1 . = x1 y 1 z11 + x2 y 2 z12 = 1 0

t112 = 1

t212 = 0

t122 = 0

t222 = 0

t111 = 0

t211 = 1

t121 = 1

t221 = 0

Fig. 15.5

A 2 × 2 × 2-mda.

(15.6)

Linear Algebra Tools for Data Mining (Second Edition)

900

Similarly,

xD2 y =

x1 y 1 z21

+

x2 y 2 z22

=

1 0 , 0 0

which means that Equality (15.6) is equivalent to 0 1 1 0 and XD2 Y = . XD1 Y = 1 0 0 0 The last equalities imply that rank(D1 ) = 2 and rank(D2 ) = 1. The matrices X and Y have full rank. Without loss of generality, we may assume that z22 = 0. Observe now that 0 1 −1 −1 −1 . (XD2 Y ) · (XD1 Y ) = X(D2 D1 ) X = 0 0 1 while the ﬁrst This equality implies that X1 is proportional to 0 row is proportional to (0 1). This is possible when X is of rank 1, which conﬂicts with the initial assumption (R1 = 2). 15.5

Matricization and Vectorization

Matricization [92] of an mda is the rearrangement of the elements of an mda into a matrix. Let T ∈ RI1 ×I2 ×···×IN be an order-N mda whose set of modes is N = {1, 2, . . . , N }. For a subset D = {d1 , . . . , d|D| } of N, deﬁne the function ψD : N =1 {1, . . . , I } −→ N as ⎡ ⎤ |D| p−1 ⎣(idp − 1) Idj ⎦, ψD (i1 , . . . , iN ) = 1 + p=1

j=1

where 1 ij Ij for 1 j |D|. Suppose that {R, C : IN } is a partition of N, where R = {r1 , . . . , rp }, C = {s1 , . . . , sq }, and p + q = N . The set R contains those indices that will be mapped into row indices of the resulting matrix, while the set C contains the

Multidimensional Array and Tensors

901

indices that will be mapped into the column indices of the resulting matrix. Let J = n∈R In and K = n∈C In and let π be a bijection π : {1, . . . , I1 } × · · · × {1, . . . , IN } −→ {1, . . . , J} × {1, . . . , K} deﬁned by π(i1 , . . . , iN ) = (ψR (i1 , . . . , iN ), ψC (i1 , . . . , iN )). The matrix T(R,C:IN ) ∈ RJ×K is deﬁned as (T(R,C:IN ) )jk = Ti1 i2 ···iN ,

(15.7)

with j = ψR (i1 , . . . , in ) and k = ψC (i1 , . . . , in ). Definition 15.9. The matricized mda is a matrix T(R,C:IN ) in RJ×K deﬁned by Equation (15.7). This form of matricization was introduced in [50–52]. A special case of matricization occurs when R = {n} and C = {1, . . . , N } − {n}. In this case, we use the term n-unfolding. The matrix unfolding T(n) ∈ RIn ×(In+1 ···IN I1 I2 ···In−1 ) of T contains the element Ti1 i2 ···iN at the position with row number in and column number ψ{dn } (i1 , . . . , iN )1 + N m=1,m=n m−1 (im − 1) p=1,p=n Ip . Example 15.10. Let T be an mda with the format 3 × 4 × 2 having the following frontal slices: ⎛ ⎞ ⎛ ⎞ 1 4 7 10 13 16 19 22 ⎜ ⎟ ⎜ ⎟ T1 = ⎝2 5 8 11⎠ and T2 = ⎝14 17 20 23⎠. 3 6 9 12 15 18 21 24 For R = {1, 2} and C = {3}, the matrix TR;C ∈ R12×2 is given by T12;3 =

4 2 3 i1 =1 i2 =1 i3 =1

Ti1 i2 i3 e3i1 ⊗ e4i2 ⊗ e2i3 .

902

Linear Algebra Tools for Data Mining (Second Edition)

Consider the term T231 e32 ⊗ e43 ⊗ e21 from the above sum. Since ⎛ ⎞ 0 0 ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜ ⎟ ⎛ ⎞ ⎜0 0⎟ ⎛ ⎞ 0 ⎜ ⎟ 0 ⎜0⎟ 1 ⎜0 0⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ e32 ⊗ e43 ⊗ e21 = ⎝1⎠ ⊗ ⎜ ⎟ ⊗ =⎜ ⎟, ⎜1 0⎟ ⎝1⎠ 0 ⎜ ⎟ 0 ⎜0 0⎟ 0 ⎜ ⎟ ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜ ⎟ ⎝0 0⎠ 0 0 it follows that T231 occupies the position at the 7th row and the ﬁrst column of the matrix TR,C . Example 15.11. We present an example of the matricization of an mda given in [93, 139]. Let T be an mda with the format 3 × 4 × 2 introduced in Example 15.10 having the following frontal slices: ⎛ ⎞ ⎛ ⎞ 1 4 7 10 13 16 19 22 ⎜ ⎟ ⎜ ⎟ T1 = ⎝2 5 8 11⎠ and T2 = ⎝14 17 20 23⎠ 3 6 9 12 15 18 21 24 The three unfoldings are ⎛ 1 4 7 ⎜ T(1) = ⎝2 5 8 3 6 9 ⎛ 1 2 ⎜4 5 ⎜ T(2) = ⎜ ⎝7 8 10 11

⎞ 10 13 16 19 22 ⎟ 11 14 17 20 23⎠, 12 15 18 21 24 ⎞ 3 13 14 15 6 16 17 18⎟ ⎟ ⎟, 9 19 20 21⎠ 12 22 23 24

Multidimensional Array and Tensors

T(3) =

903

1 2 3 4 · · · 9 10 11 12 . 13 14 15 16 · · · 21 22 23 24

The ﬁbers of n-mode of the mda T can be retrieved as the columns of the unfolding T(n) . Definition 15.10. Let T and S be two mdas having the set of modes I ∪ J and I ∪ K, respectively, where |I| = M , |J| = N , and |K| = P and I, J, K are pairwise disjoint. We assume that I = {i1 , . . . , iM }. The contraction of T and S over I is the mda T I S given by (T I S)j1 ···jN k1 ···kP =

I1 i1 =1

···

Im

Ti1 ···im j1 ···jn Si1 ···im k1 ···kp .

im =1

Definition 15.11. Let T be an (I1 × I2 × · · · × IN )-mda and let A ∈ RJn ×In be a matrix. The n-mode product of T and A is an mda denoted by T ×n A ∈ RI1 ×···×In−1 ×Jn ×In+1 ×IN given by (T ×n A)i1 ···in−1 jn in+1 ···iN =

In

Ti1 i2 ···iN Ajn in .

in =1

The result of the n-mode product T ×n A is an mda of size I1 × I2 · · · × In−1 × Jn × In+1 · · · × IN -mda. Each of the mode-n ﬁbers of T is multiplied by the matrix A. Example 15.12. Let A ∈ RI1 ×I2 be a matrix, which we regard as an mda. The product A ×1 B can be considered for a matrix B ∈ J × I1 and yields a matrix A ×1 B ∈ RJ×I2 where (A ×1 B)ji2 =

I1

Ai1 i2 Bji1 =

i1 =1

I1 i1 =1

Ai1 i2 Bi1 j .

In terms of matrix products, this is A ×1 B = B A. If C ∈ RJ×I2 , the product A ×2 C ∈ RI1 ×J is (A ×2 C)i1 j =

I2 i2 =1

We have A ×2 C = AC .

Ai1 i2 Cj i2 =

I2 i2 =1

Ai1 i2 Ci2 j .

Linear Algebra Tools for Data Mining (Second Edition)

904

Theorem 15.5. {r1 , . . . , rL } and {1, . . . , N }. Also, A(n) ∈ RIn ×Jn for The equality

Let N = {1, . . . , N }, and let the sets R = C = {c1 , . . . , cM } deﬁne a partition of N = let Y be an (J1 × J2 × · · · × JN )-mda and let n ∈ N be N matrices.

X = Y ×1 A(1) ×2 A(2) · · · ×N A(N )

(15.8)

holds if and only if we have the following matrix equality: X(R×C:IN ) = (A(rL ) ⊗· · ·⊗A(r1 ) )Y(R×C:IN ) (A(cM ) ⊗· · ·⊗A(c1 ) ) . (15.9) Proof.

Equality (15.8) is equivalent to I1

Xi1 i2 ···iN =

···

j1 =1

IN jN =1

(1)

(N ) jN .

Yj1 ···jn Ai1 j1 · · · AiN

Observe that Xi1 i2 ···iN = (X(R,C:IN ) )ψR (i1 ,i2 ,··· ,iN ),ψC (i1 ,i2 ,··· ,iN ) , and Yi1 i2 ···iN = (Y(R,C:IN ) )ψR (i1 ,i2 ,··· ,iN ),ψC (i1 ,i2 ,··· ,iN ) . By the deﬁnition of matricization, Equality (15.9) implies X(R×C:IN ) =

I1

···

i1 =1

=

I1

iN =1

···

i1 =1

=

IN

(I ) (I ) Xi1 ···iN ⊗n∈R einn ⊗n∈C einn

IN I1

···

iN =1 j1 =1

IN jN =1

(1)

(N ) jN

Yj1 ···jn Ai1 j1 · · · AiN

(I ) (I ) ⊗n∈R einn ⊗n∈C einn

I1 i1 =1 (1)

···

IN I1

···

iN =1 j1 =1 (N ) jN

Ai1 j1 · · · AiN

IN jN =1

(Y(R,C:IN ) )ψR (j1 ,j2 ,··· ,jN ),ψC (j1 ,j2,··· ,jN ) (I )

⊗n∈R einn

(I )

⊗n∈C einn

Multidimensional Array and Tensors

=

I1 i1 =1 (1)

···

IN I1 iN =1 j1 =1

···

IN jN =1

905

(Y(R,C:IN ) )ψR (j1 ,j2 ,··· ,jN ),ψC (j1 ,j2,··· ,jN )

j=1 i=1 (N ) M cj L ri e e jN ψR (i1 ,...,iN ) ψC (j1 ,...,jn )

Ai1 j1 · · · AiN

= (A(rL ) ⊗ · · · ⊗ A(r1 ) )Y(R×C:IN ) (A(cM ) ⊗ · · · ⊗ A(c1 ) ) . Corollary 15.1. Let X be an I1 × · · · × IN -mda and let Ji × Ii -matrix for 1 i N . We have

A(i)

be a

Y = X ×1 A(1) ×2 A(2) · · · ×N A(N ) if and only if we have the equality Y(n) = A(n) X(n) (A(N ) ⊗ · · · ⊗ A(n+1) ⊗ A(n−1) ⊗ · · · ⊗ A(1) ) between the unfoldings Y(n) and X(n) for 1 n N . Let X be an (I1 ×I2 ×· · ·×IN )-mda , let A be a matrix, A ∈ RIn ×Jn , and let Y ∈ RI1 ×I2 ×···Jn ×···×In . We have Y = X ×n A if and only if Y(n) = AX(n) . Proof.

These statements follow from Theorem 15.5.

Theorem 15.6. Let T ∈ RI1 ×I2 ×···×IN and let A ∈ RJn ×In and B ∈ RJm ×Im be two matrices with n = m. We have (T ×n A) ×m B = (T ×m B) ×n A, and their common value is denoted by T ×n A ×m B. Furthermore, if C ∈ RJn ×In and D ∈ RKn ×Jn , then (T ×n C) ×n D = T ×n (DC). Proof. ×n .

(15.10)

Both equalities follow immediately from the deﬁnition of

Next we introduce the n-mode product ×n of an mda with a vector. Definition 15.12. Let T be an (I1 × I2 × · · · × IN )-mda and let v ∈ RIn be a matrix. The n-mode product of T and v is an mda of order N − 1 of size I1 × · · · × In−1 × In+1 × · · · × IN denoted by T ×n v given by (T ×n v)i1 ···in−1 in+1 ···iN =

In in =1

Ti1 i2 ···in vin .

906

Linear Algebra Tools for Data Mining (Second Edition)

Clearly, T ×n v computes the inner product of each n-ﬁber of T with v. Example 15.13. Let T be a tensor of size 4 × 3 × 2 having the 3rd dimension slices ⎛ ⎞ 9 4 6 ⎜5 9 1⎟ ⎜ ⎟ T1 = ⎜ ⎟ ⎝8 8 8⎠ 2 9 9 and

⎛

7 ⎜7 ⎜ T2 = ⎜ ⎝7 4 If

6 2 7 1

⎞ 3 1⎟ ⎟ ⎟. 1⎠ 8

⎛

⎞ 10 ⎜ ⎟ u = ⎝20⎠, 30

the product T ×2 u is the mda of size 4 × 2, ⎛ ⎞ 350 280 ⎜260 140⎟ ⎜ ⎟ T ×1 u = ⎜ ⎟. ⎝480 240⎠ 470 300 The actual implementation of this computation is further discussed in Example 15.25. Theorem 15.7. Let T and S be two mdas such that T ∈ RI1 ×···×In−1 ×J×In+1 ×···×IN S ∈ RI1 ×···×In−1 ×K×In+1 ×···×IN and let A be a matrix, A ∈ RJ×K . We have (T, S×n A) = (T ×n A , S).

Multidimensional Array and Tensors

Proof.

907

We have

(T ×n A )i1 ···in−1 kin+1 ···IN =

J

Ti1 ···in−1 jin+1 ···iN akj ,

j=1

(S ×n A)i1 ···in−1 jin+1 ···iN =

K

Si1 ···in−1 kin+1 ···iN ajk ,

k=1

and (T, S ×n A) =

=

I1 i1 =1

in =1

I1

IN

···

i1 =1

=

IN

···

I1

Ti1 i2 ...iN (S ×n A)i1 i2 ···iN

Ti1 i2 ...iN

iN =1 IN

···

i1 =1

in =1

In

Si1 ···in ···iN ain i

i=1

Ti1 i2 ...in aiin Si1 ···in ···iN

= (T ×n A , S).

15.6

Inner Product and Norms

Example 15.14. Let T and S be two mdas in RI1 ×I2 ×···×IN . Their inner product is given by (T, S) =

I1 I2

···

i1 =1 i2 =1

IN

Ti1 i2 ···iN Si1 i2 ···iN ,

iN =1

which is actually their contraction T M S over the entire set of modes M = {1, . . . , N }. Therefore, it is natural to deﬁne the norm of an mda T as 2

T = (T, T ) =

I1 i1 =1

···

IN iN =1

Ti21 ···iN .

Linear Algebra Tools for Data Mining (Second Edition)

908

The deﬁnition of the norm also implies that for two mdas T, S ∈

RI1 ×···×IN , we have

T − S 2 = T 2 − 2(T, S) + S 2 . The norm of an mda can be expressed as the norm of a matrix, as the following statement shows. Theorem 15.8. Let X be an mda, X ∈ RI1 ×I2 ×···×IN , where the set of modes of X is N = {1, . . . , N }. We have X = XR,calc:IN F . Proof. The square of Frobenius norm of the matrix XR,calc:IN F is the sum of all squares of entries of this matrix. Since these entries form a rearrangement of the entries of X, the result follows immedi ately. 15.7

Evaluation of a Set of Bilinear Forms

Let X, Y be two F-linear spaces. Following [23], we discuss the evaluation of several bilinear forms pi = j k hijk xj yk , where 1 i m, x = (xj ) ∈ X, and y = (yk ) ∈ Y . Let Gi be the matrix Gi = (hijk ), where⎛1 ⎞ i m that cors1 ⎜ . ⎟ ⎟ responds to the linear form pi . For s = ⎜ ⎝ .. ⎠, deﬁne G(s) = sm m i=1 si Gi . The matrix G(s) is the characteristic matrix of the problem; the matrices Gi are called the basis matrices. The scalar H(s, x, y) =

p q m

hijk si xj yk = x G(s)y

i=1 j=1 k=1

is the deﬁning function of the problem. The number of linear forms m = dim s is the index of the problem. When we evaluate the bilinear forms p1 , . . . , pm at (x, y), we are computing the set of inner products x (Gi y). Following [23], we assume that F contains a subset K such that the computation of K-linear forms is cheap compared with the computations of products of such linear forms. Furthermore, we assume

Multidimensional Array and Tensors

909

that the additive and multiplicative identities in F hold in K, and that all elements hijk belong to K. The general algorithm for evaluating K-linear forms proposed in [23] has the following form (Algorithm 15.7.1): Algorithm 15.7.1: Brokett–Dobkin Algorithm Data: A set of elements hijk of K Result: A set of products of the form pi = j,k hijk xi yj 1 Compute the K-linear forms ci , x and bi , y for 1 i d; 2 Compute the products ri = ci , xbi , y for 1 i d; d 3 Compute pi as pi = j=1 aij rj for 1 i m, where aij ∈ K; return pi for 1 i m; Example 15.15. Let x = x1 + ix2 and y = y1 + iy2 be two complex numbers. Their product is xy = x1 y1 − x2 y2 + i(x2 y1 + x1 y2 ), which entails the evaluation of the bilinear forms x1 y1 − x2 y2 and x2 y1 + x1 y2 . These forms correspond to the basis matrices 1 0 0 1 and G2 = G1 = 0 −1 1 0 because

1 0 y1 x1 y1 − x2 y2 = (x1 x2 ) 0 −1 y2

and

0 1 y1 . x2 y1 + x1 y2 = (x1 x2 ) 1 0 y2

The characteristic matrix is G(s) =

s1 s2 s2 −s1

and the deﬁning function is

s1 s2 y1 H(s, x, y) = (x1 x2 ) s2 −s1 y2 = x1 y 1 s 1 + x1 y 2 s 2 + x2 y 1 s 2 − x2 y 2 s 1 .

910

Linear Algebra Tools for Data Mining (Second Edition)

Example 15.16. The computation of the product of matrices X=

x1 x2 x3 x4

and Y =

y1 y2 y3 y4

corresponds to the deﬁning function H(s, x, y) = x1 y1 s1 + x2 y3 s1 + x1 y2 s2 + x2 y4 s2 +x3 y1 s3 + x4 y3 s3 + x3 y2 s4 + x4 y4 s4 and the characteristic matrix ⎛

s1 ⎜0 ⎜ G(s) = ⎜ ⎝s3 0

s2 0 s4 0

0 s1 0 s3

⎞ 0 s2 ⎟ ⎟ ⎟. 0⎠ s4

Example 15.17. Let x(t) and y(t) be two polynomials of degree m − 1 and n − 1, respectively, x(t) =

m

xj t

j−1

and y(t) =

j=1

n

yk tk−1 .

k=1

The deﬁning function is H(s, x, y) =

m n

sj+k−1 xj yk .

j=1 k=1

The characteristic matrix of this problem is the following Hankel matrix (introduced in Deﬁnition 3.22): ⎛

s1 s2 s3 ⎜ s2 s3 s4 ⎜ ⎜ s s4 s5 Gmn (s) = ⎜ ⎜ 3 ⎜ .. .. .. ⎝ . . . sm sm+1 sm+2

··· ··· ···

sn sn+1 sn+2 .. .

··· · · · sm+n−1

⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎠

Multidimensional Array and Tensors

15.8

911

Matrix Multiplications and Arrays

The role of bilinear forms is studied in [97] using a special matrix representation. Let U, V ∈ R2×2 be two matrices and let W = U V . Index the elements of each matrix by a single index as follows:

u1 u2 u3 u4

v1 v2 v3 v4

=

w1 w2 . w3 w4

Denote by xijk the coeﬃcient of ui vj in wk = 2i=1 2j=1 xijk ui vj ; each xijk is either 0 or 1. The corresponding matrices (xijk ) for wk and 1 k 4 are ⎛

1 ⎜0 ⎜ w1 : ⎜ ⎝0 0 ⎛ 0 ⎜0 ⎜ w3 : ⎜ ⎝1 0

0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 1

⎞ 0 0⎟ ⎟ ⎟w 0⎠ 2 0 ⎞ 0 0⎟ ⎟ ⎟w 0⎠ 4 0

⎛

0 ⎜0 ⎜ :⎜ ⎝0 0 ⎛ 0 ⎜0 ⎜ :⎜ ⎝0 0

1 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0

⎞ 0 1⎟ ⎟ ⎟ 0⎠ 0 ⎞. 0 0⎟ ⎟ ⎟ 0⎠ 1

A more general situation involves computing a set of K bilinear forms using a three-way array of coeﬃcients xijk as wk =

J I

xijk ui vj ,

i=1 j=1

where 1 k K. Example 15.18. Supposethat for a three-mode array X we have the decomposition xijk = R r=1 air bjr ckr for 1 i I, 1 j J, and 1 k K. If wk =

I J i=1 j=1

xijk ui vj

912

Linear Algebra Tools for Data Mining (Second Edition)

for 1 k K, then wk can be written as wk =

I J

xijk ui vj

i=1 j=1

=

J R I

air bjr ckr ui vj

i=1 j=1 r=1

=

I R r=1

=

R

air ui

⎛ ⎝

i=1

J

⎞ bjr vj ⎠ ckr

j=1

fr (air )gr (bjr ),

r=1

where fr (air ) = Ii=1 air ui and gr (bjr ) = Jj=1 bjr vj for 1 r R. This entails computing 2R linear combinations and multiplying these combinations pairwise to form R products. Then, we compute linear combinations of these products with the coeﬃcients ckr : w1 = u1 v1 + u2 v3 w2 = u1 v2 + u2 v4 , w3 = u3 v1 + u4 v3 w4 = u3 v2 + u4 v4 . For the standard matrix multiplication (using Kruskal’s one-index notation), we have R = 8 bilinear forms, h1 , . . . , h8 : h1 = u1 v1 h2 = u2 v3 h3 = u1 v2 h4 = u2 v4 , h5 = u3 v1 h6 = u4 v3 h7 = u3 v2 h8 = u4 v4 , then w1 = h1 + h2 w2 = h3 + h4 , w3 = h5 + h6 w4 = h7 + h8 . In tabular form, this algorithm is shown in Figure 15.6. Example 15.19. Here we used the notation of Strassen (introduced in Exercise 120 of Chapter 3): I = f1 g1 , II = f2 g2 , III = f3 g3 , IV = f4 g4 , V = f5 g5 , V I = f6 g6 , V II = f7 g7 .

Multidimensional Array and Tensors

913

f1

f2

f3

f4

f5

f6

f7

f8

u1

1

0

1

0

0

0

0

1

u2

0

1

0

1

0

0

0

0

u3

0

0

0

0

1

0

1

0

u4

0

0

0

0

0

1

0

1

g1

g2

g3

g4

g5

g6

g7

g8

v1

1

0

0

0

1

0

0

0

v2

0

0

1

0

0

0

1

0

v3

0

1

0

0

0

1

0

0

v4

0

0

0

1

0

0

0

1

h1

h2

h3

h4

h5

h6

h7

h8

w1

1

1

0

0

0

0

0

0

w2

0

0

1

1

0

0

0

0

w3

0

0

0

0

1

1

0

0

w4

0

0

0

0

0

0

1

1

Fig. 15.6

Bilinear forms in matrix product computation.

Linear Algebra Tools for Data Mining (Second Edition)

914

15.9

MATLAB Computations

Following the practice of Computer Science literature, we will use the terms mda and tensor loosely and interchangeably. The Tensor Toolbox for MATLAB consists of a collection of tools for working with mdas. It is an open-source package constructed by researchers at Sandia National Laboratory [4, 5]. Every MATLAB array has at least two dimensions: a scalar is an object of size 1 × 1, a column vector is an array of size n × 1, etc. MATLAB drops trailing singleton dimensions. Thus, a 3 × 4 × 1 object has a reported size of 3 × 4. The MATLAB tensor class explicitly stores trailing singleton dimensions. Example 15.20. To create a tensor T starting from an array A, we write (Figure 15.7) A= rand(3,4,2) A(:,:,1) = 0.8147 0.9058 0.1270

0.9134 0.6324 0.0975

0.2785 0.5469 0.9575

0.9649 0.1576 0.9706

0.1419 0.4218 0.9157

0.7922 0.9595 0.6557

0.0357 0.8491 0.9340

A(:,:,2) = 0.9572 0.4854 0.8003

>> T=tensor(A) T is a tensor of size 3 x 4 x 2 T(:,:,1) = 0.8147 0.9134 0.2785 0.9058 0.6324 0.5469 0.1270 0.0975 0.9575 T(:,:,2) = 0.9572 0.1419 0.7922 0.4854 0.4218 0.9595 0.8003 0.9157 0.6557

0.9649 0.1576 0.9706 0.0357 0.8491 0.9340

Multidimensional Array and Tensors

915

f1

f2

f3

f4

f5

f6

f7

u1

1

0

1

0

1

−1

0

u2

0

0

0

0

1

0

1

u3

0

1

0

0

0

1

0

u4

1

1

0

1

0

0

−1

g1

g2

g3

g4

g5

g6

g7

v1

1

1

0

−1

0

1

0

v2

0

0

1

0

0

1

0

v3

0

0

0

1

0

0

1

v4

1

0

−1

0

1

0

0

I

II

III

IV

V

VI

VII

w1

1

0

0

1

−1

0

1

w2

0

1

0

1

0

0

0

w3

0

0

1

0

1

0

0

w4

1

−1

1

0

0

0

1

Fig. 15.7

Bilinear forms in Strassen’s matrix product computation

Linear Algebra Tools for Data Mining (Second Edition)

916

The tensor class explicitly tracks singleton dimensions. Example 15.21. Creating an mda of size 4 × 3 × 1 with A=rand(4,3,1) ignores trailing singleton dimensions resulting in A = 0.6787 0.7577 0.7431 0.3922

0.6555 0.1712 0.7060 0.0318

0.2769 0.0462 0.0971 0.8235

and >> ndims(A) ans = 2 >> size(A) ans = 4 3

In contrast, using the tensor constructor T = tensor(A,[4,3,1]) results in T is a tensor of size 4 x 3 x 1 T(:,:,1) = 0.6787 0.6555 0.2769 0.7577 0.1712 0.0462 0.7431 0.7060 0.0971 0.3922 0.0318 0.8235

and >> ndims(T) ans = 3 >> size(T) ans = 4 3

1

A vector can be stored as a tensor by writing T = tensor(rand(4,1),[4]).

Multidimensional Array and Tensors

917

Accessors and assignments for tensors work in the same way as for mdas. Example 15.22. A 2 × 2 × 2 tensor is created with T = tensor(rand(2,2,2))

To reassign a 2 × 2 identity matrix, we write A(:,1,:) = eye(2)

Three types of mda multiplications are discussed: multiplication of an mda with a matrix (denoted by ttm), with a vector (denoted by ttv), and with another mda (denoted by ttt). The ﬁrst two variants can be regarded as special cases of the third. When mda is multiplied with a matrix, it is necessary to specify the mode of the mda involved in the multiplication. In [101], the n-mode multiplication of an mda by a matrix is discussed. This type of multiplication is a generalization of the matrix product. Let A = U BV be a matrix, A ∈ Rj1 ×j2 , B ∈ Ri1 ×i2 , U ∈ Rj1 ×i1 , and V ∈ Rj2 ×i2 , we adopt the notation proposed in [101] and denote this product by A = B ×1 U ×2 V which suggests that the matrix U creates linear combinations of the rows of B (the mode 1 of B), and the matrix V creates linear combinations of the columns of B (the mode 2 of B). More generally, if T ∈ RJ1 ×J2 ×···Jn ×···×JN and A ∈ RIn ×Jn , the mda T ×n A ∈ RJ1 ×···×Jn−1 ×In ×Jn+1 ×···JN is deﬁned as (T ×n A)j1 ···jn−1 ijn+1 ···jN =

Jn jn =1

Tj1 ···jn ···jN aijn ,

Linear Algebra Tools for Data Mining (Second Edition)

918

and is obtained by having A act on the nth ﬁbers of T . The MATLAB implementation of the n-mode multiplication is discussed in Example 15.23. The mda-matrix n-mode product of the mda T and the matrix A is speciﬁed by ttm(T, A, n). Example 15.23. We create a tensor T starting from a 3 × 4 × 2 mda of integers named A. To this end, we deﬁne a 3 × 4-matrix A using > A = [1 4 7 10;2 5 8 11;3 6 9 12] A = 1 2 3

4 5 6

7 8 9

10 11 12

Then, the second page of this array is added using A(:,:,2) = [13 16 19 22;14 17 20 23; 15 18 21 24]

which results in A(:,:,1) = 1 2 3

4 5 6

7 8 9

10 11 12

16 17 18

19 20 21

22 23 24

A(:,:,2) = 13 14 15

The tensor T is obtained by writing >> T = tensor(A) T is a tensor of size 3 x 4 x 2 T(:,:,1) = 1 4 7 10 2 5 8 11 3 6 9 12 T(:,:,2) = 13 16 19 22

Multidimensional Array and Tensors

14 15

17 18

20 21

919

23 24

Let B be the 2 × 3 matrix deﬁned by >> B=[1 2 3;4 5 6] B = 1 4

2 5

3 6

To compute the tensor–matrix product P = ttm(T, B, 1) between the 3×4×2 tensor T and the matrix B ∈ R2×3 resulting in, a 2×4×2 tensor, we execute the following code: >> P=ttm(T,B,1) P is a tensor of size 2 x 4 x 2 P(:,:,1) = 14 32 50 68 32 77 122 167 P(:,:,2) = 86 104 122 140 212 257 302 347

corresponding to the following equalities: 14 = 1 · 1 + 2 · 2 + 3 · 3 .. . 68 = 1 · 10 + 2 · 11 + 3 · 12 32 = 4 · 1 + 5 · 2 + 6 · 3 .. . 167 = 4 · 10 + 5 · 11 + 6 · 12 and 86 = 1 · 13 + 2 · 14 + 3 · 15 .. . 140 = 1 · 22 + 2 · 23 + 3 · 24

920

Linear Algebra Tools for Data Mining (Second Edition)

212 = 4 · 13 + 5 · 14 + 6 · 15 .. . 347 = 4 · 22 + 5 · 23 + 6 · 24. If C is the 2 × 4 matrix, C=

3 2 3 1 , 0 3 0 4

the product of T by C along the second dimension of T is obtained as >> Q=ttm(T,C,2) Q is a tensor of size 3 x 2 x 2 Q(:,:,1) = 42 52 51 59 60 66 Q(:,:,2) = 150 136 159 143 168 150

Finally, if we multiply T along its third dimension by the matrix D ∈ R2×2 given by D=

7 3 , 5 2

the resulting mda R = ttm(T, D, 3) is >> R=ttm(T,D,3) R is a tensor of size 3 x 4 x 2 R(:,:,1) = 46 76 106 136 56 86 116 146 66 96 126 156 R(:,:,2) = 31 52 73 94 38 59 80 101 45 66 87 108

Multidimensional Array and Tensors

921

4

2

C

2

2

2

D

2 T B

3

3

4

Fig. 15.8

Possible mda-matrix products.

An image of various possible mda-matrix products is shown in Figure 15.8. This image suggests an alternative notation for matrixmda products proposed by Kruskal in [97], where the products ttm(T,B,1), ttm(T,C,2), ttm(T,D,3) C

are denoted as BT , T , and T D, respectively. An alternate way of computing n-mode products using cell arrays is shown next. Example 15.24. Consider the mda T ∈ R4×3×2 produced by >> T = tensor(randi(9,4,3,2)) T is a tensor of size 4 x 3 x 2 T(:,:,1) = 9 4 6 5 9 1 8 8 8 2 9 9 T(:,:,2) = 7 6 3 7 2 1 7 7 1 4 1 8

Linear Algebra Tools for Data Mining (Second Edition)

922

and the matrices A and B generated by >> A = randi(9,2,4) A = 7 3

9 1

4 4

7 8

and >> B=randi(9,3,2) B = 2 5 5

6 7 7

Matrices A and B are stored in the cell array W deﬁned next: >> W{1}=A W = 1 x 1 cell array {2 x 4 double} >> W{2}=B W = 1 x 2 cell array {2 x 4 double}

{3 x 2 double}

Finally, using the cell array W the product T ×3 B ×1 A is computed as >> S=ttm(T,W,[1,3]) S is a tensor of size 2 x 3 x 3 S(:,:,1) = 1316 978 688 586 S(:,:,2) = 1946 1685 1016 1017 S(:,:,3) = 1946 1685 1016 1017

832 714 1360 1161 1360 1161

Multidimensional Array and Tensors

923

If T is an I1 × I2 × · · · × IN tensor, A ∈ RJm ×Im , and B ∈ RJn ×In , then (T ×m A) ×n B = (T ×n B) ×m A. Let T be an I1 × I2 × · · · × IN tensor and let (U (1) , . . . , U (N ) ) be a sequence of matrices, where U (i) has the format Ji × Ii . Then T ×1 U (1) ×2 U (2) ×N U (N ) is of size J1 × J2 × · · · × JN . Let T be an mda of format I1 ×I2 ×· · · In · · · IN and let v be a vector of size In . The contracted n-mode product of T and v introduced in [4] builds an mda T ×n v of format I1 × · · · × In−1 × In+1 × · · · In given by (T ×n v)(i1 , . . . , in−1 , in+1 , . . . , iN ) =

In

T (i1 , . . . , iN )v(in ).

in =1

Example 15.25. Let T be the mda introduced in Example 15.13 and let ⎛ ⎞ 10 ⎜ ⎟ u = ⎝20⎠. 30 T ∈ R4×3×2 can be multiplied by the vectors u along its second dimension by writing >> ttv(T,u,2) ans is a tensor of size 4 x 2 ans(:,:) = 350 280 260 140 480 240 470 300

Yet another type of mda multiplication involves two tensors. This product comes in three ﬂavors: the outer product, the inner product, and the contracted product. Definition 15.13. Let T be an mda of size I1 × · · · × IM and let S be an mda of size J1 × · · · × JN . The outer product T oS is of size

924

Linear Algebra Tools for Data Mining (Second Edition)

I1 × · · · × IM × J1 × · · · × JN and is given by (T oS)(i1 , . . . , iM , j1 . . . , jN ) = T (i1 , . . . , iM )S(j1 , . . . , jN ). The MATLAB command ttt is an abbreviation of “tensor times tensor” and is given by Z = ttt(T,S). Example 15.26. Let T and S be the mdas produced by >> T = tensor(randi(9,2,2,2)) >> s = tensor(randi(5,2,1,3))

and given by T is a tensor of size 2 x 2 x 2 T(:,:,1) = 9 1 6 3 T(:,:,2) = 5 9 9 2 >> S=tensor(randi(5,2,1,3)) S is a tensor of size 2 x 1 x 3 S(:,:,1) = 5 4 S(:,:,2) = 5 4 S(:,:,3) = 1 5

Their outer product obtained with Z = ttt(T,S) is Z is a tensor of size 2 x 2 x 2 x 2 x 1 x 3 Z(:,:,1,1,1,1) = 45 5 30 15 Z(:,:,2,1,1,1) = 25 45 45 10 Z(:,:,1,2,1,1) = 36 4 24 12

Multidimensional Array and Tensors

Z(:,:,2,2,1,1) 20 36 36 8 Z(:,:,1,1,1,2) 45 5 30 15 Z(:,:,2,1,1,2) 25 45 45 10 Z(:,:,1,2,1,2) 36 4 24 12 Z(:,:,2,2,1,2) 20 36 36 8 Z(:,:,1,1,1,3) 9 1 6 3 Z(:,:,2,1,1,3) 5 9 9 2 Z(:,:,1,2,1,3) 45 5 30 15 Z(:,:,2,2,1,3) 25 45 45 10

=

=

=

=

=

=

=

=

=

Example 15.27. Let T and S be the mdas T = tensor(randi(9,2,2,2)) S = tensor(randi[10 20],2,2,2) >> T= tensor(randi(9,2,2,2)) T is a tensor of size 2 x 2 x 2 T(:,:,1) = 8 5 2 5 T(:,:,2) = 6 7 7 3 >> S= tensor(randi([10 20],2,2,2))

925

Linear Algebra Tools for Data Mining (Second Edition)

926

S is a tensor of size 2 x 2 x 2 S(:,:,1) = 14 11 17 17 S(:,:,2) = 10 10 13 11

The inner product of T and S computed with ttt(T,S), or with [1,2] is ans = 540

The contracted product of two mdas is a generalization of the mda vector product. Definition 15.14. Let T be an mda of size I1 × · · · × IM × J1 × · · · JN and let S be an mda of size I1 × · · · × IM × K1 × · · · × KP . The contracted product along the ﬁrst M modes is the mda of size J1 × · · · × JN × K1 × KP given by (T, S)1,...,M ;1...,M (j1 , . . . , jn , k1 , . . . , kp ) =

I1 i1 =1

···

IM

T (i1 , . . . , iM , j1 , . . . , jN )S(i1 , . . . , iM , k1 , . . . , kp ).

iM =1

In MATLAB , the command for mda contracted product is U = ttt(T,S,[1:M],[1:M])

The contracted product may involve distinct lists of dimensions of the two mdas of equal lengths provided that the sizes of the involved dimensions are identical. The next example illustrates this point. Example 15.28. Let T and S be the mdas deﬁned as >> T=tensor(randi(4,3,4,2)) T is a tensor of size 3 x 4 x 2 T(:,:,1) = 2 2 3 3 1 3 2 3 4 1 3 2

Multidimensional Array and Tensors

927

T(:,:,2) = 1 1 4 1 1 4 1 4 4 3 2 1 >> S=tensor(randi(5,4,3,2)) S is a tensor of size 4 x 3 x 2 S(:,:,1) = 4 2 5 5 2 1 5 5 2 1 3 1 S(:,:,2) = 1 1 3 5 5 3 3 4 1 3 2 2

The lists of dimensions of T and S are [13] and [23], and the corresponding sizes are 3 and 2. Therefore, the contracted product ttt(T,S,[1 3],[2 3]) returns >> ttt(T,S,[1 3],[2 3]) ans is a tensor of size 4 x 4 ans(:,:) = 44 38 34 22 33 51 49 29 42 53 49 30 36 51 54 27

This means that the dimensions involved in the contracted product are the ﬁrst and the third for the mda T and the second and the third for the mda S. Note that the size of the ﬁrst dimension of T is 3, the same as the size of the second dimension of S; also, the size of the third dimension of T is 2, which is also the size of the third dimension of S. The MATLAB function reshape given by B = reshape(A,sz)

reformats the array A using the size vector sz to deﬁne size(B). The vector sz must contain at least 2 elements, and the product of the components of sz must equal numel(A), that gives the number of elements in the array A.

Linear Algebra Tools for Data Mining (Second Edition)

928

Example 15.29. For the array A=randi([1,10],[2 3 2]) given by A(:,:,1) = 9 10

2 10

7 1

A(:,:,2) = 3 6

10 10

2 10

the eﬀect of the statement B=reshape(A,[2 6]) is to produce the array B = 9 10

2 10

7 1

3 6

10 10

2 10

The alternative format B = reshape(A,sz1,...,szN) reformats A into an array having sz1,...,szN as sizes of each dimension. The vector A = 1 : 10 can be reshaped into a 5 × 2 matrix by B = reshape(A,[5 2])

which yields A = 1

2

3

4

>> B = reshape(A,[5 2]) B = 1 2 3 4 5

6 7 8 9 10

5

6

7

8

9

10

Multidimensional Array and Tensors

Finally, to reshape an array of integers A given by A = randi([1 12],[3 2 4])

into a 6 × 4 matrix M , we write M = reshape(A,6,4)

and obtain >> A = randi([1 12],[3,2,4]) A(:,:,1) = 12 6 10

2 6 11

A(:,:,2) = 10 12 8

1 11 12

A(:,:,3) = 9 10 9

5 8 3

A(:,:,4) = 9 1 4

1 2 10

12 6

10 12

and M = 9 10

9 1

929

Linear Algebra Tools for Data Mining (Second Edition)

930

10 2 6 11

8 1 11 12

9 5 8 3

4 1 2 10

The function permute applied as B = permute(A,dimorder) rearranges the dimensions of an array in the order speciﬁed by the vector dimorder. For example, permute(A,[2 1]) switches the row and column dimensions of a matrix A. In general, the ith dimension of the output array is the dimension dimorder(i) from the input array. Example 15.30. To create a 3 × 4 × 2 array and permute it so that the ﬁrst and third dimensions are switched, resulting in a 2 × 4 × 3 array, we could use the following MATLAB code: A = rand(3,4,2) A(:,:,1) = 0.8147 0.9058 0.1270

0.9134 0.6324 0.0975

0.2785 0.5469 0.9575

0.9649 0.1576 0.9706

0.1419 0.4218 0.9157

0.7922 0.9595 0.6557

0.0357 0.8491 0.9340

0.9134 0.1419

0.2785 0.7922

0.9649 0.0357

0.6324 0.4218

0.5469 0.9595

0.1576 0.8491

A(:,:,2) = 0.9572 0.4854 0.8003

B = permute(A,[3 2 1])

B(:,:,1) = 0.8147 0.9572

B(:,:,2) = 0.9058 0.4854

Multidimensional Array and Tensors

931

B(:,:,3) = 0.1270 0.8003

0.0975 0.9157

0.9575 0.6557

0.9706 0.9340

Example 15.31. We start by creating an mda of random integers using X=randi(10,4,6,4,2): X(:,:,1,1) = 1 3 9 1 10

8 5 6 3 5

10 6 6 3 5

7 7 4 4 10

1 9 10 8 1

3 4 7 2 8

1 8 6 5 10

7 7 9 9 6

2 3 9 1 5

2 10 8 6 5

7 6 10 7 9

5 5 9 1 2

2 4 9 9 1

4 6 5 7 7

10 10 1 8 3

5 6 10 5 10

4 8 7 6 7

7 2 2 10 2

X(:,:,2,1) = 2 7 5 8 8

10 9 4 7 2

X(:,:,3,1) = 1 7 1 1 6

1 9 9 8 2

X(:,:,4,1) = 3 5 1 10 2

2 4 2 5 4

Linear Algebra Tools for Data Mining (Second Edition)

932

X(:,:,1,2) = 1 6 9 7 2

4 5 10 2 9

7 4 2 5 5

2 6 3 4 6

3 3 7 3 9

10 8 4 6 2

5 1 6 5 7

7 7 1 1 4

6 7 5 9 8

10 6 4 2 7

6 10 7 10 3

7 3 7 7 1

3 3 7 9 4

8 7 1 7 4

2 8 5 2 4

7 2 8 3 10

3 8 2 3 1

6 7 6 5 7

X(:,:,2,2) = 10 9 9 3 6

1 5 4 2 2

X(:,:,3,2) = 8 5 1 3 2

3 5 6 5 9

X(:,:,4,2) = 10 1 5 5 5

8 4 8 5 1

Next, we create two arrays R =[2 3] and C = [4 1] that deﬁne a partitioning of the set of modes of the array X. The array I = size(X) deﬁnes the sizes of the modes of X; J = prod(I(R)) and K = prod(I(C)) deﬁne the dimensions of the target matrix Y:

Multidimensional Array and Tensors

933

>> J = prod(I(R)) J = 24 >> K=prod(I(C)) K = 10

The resulting matrix Y will have 24 rows and 10 columns and can be obtained by Y = reshape(permute(X,[R C]),J,K)

as follows: Y = 1 8 10 7 1 3 2 10 1 7 2 2 1 1 7 5 2 4 3 2 10 5 4 7

1 4 7 2 3 10 10 1 5 7 6 10 8 3 6 7 3 8 10 8 2 7 3 6

3 5 6 7 9 4 7 9 8 7 3 10 7 9 6 5 4 6 5 4 10 6 8 2

6 5 4 6 3 8 9 5 1 7 7 6 5 5 10 3 3 7 1 4 8 2 8 7

9 6 6 4 10 7 5 4 6 9 9 8 1 9 10 9 9 5 1 2 1 10 7 2

9 10 2 3 7 4 9 4 6 1 5 4 1 6 7 7 7 1 5 8 5 8 2 6

1 3 3 4 8 2 8 7 5 9 1 6 1 8 7 1 9 7 10 5 8 5 6 10

7 2 5 4 3 6 3 2 5 1 9 2 3 5 10 7 9 7 5 5 2 3 3 5

10 5 5 10 1 8 8 2 10 6 5 5 6 2 9 2 1 7 2 4 3 10 7 2

2 9 5 6 9 2 6 2 7 4 8 7 2 9 3 1 4 4 5 1 4 10 1 7

934

Linear Algebra Tools for Data Mining (Second Edition)

To convert back the matrix Y to an mda, we could use the function ipermute(B,dimorder) that rearranges the dimensions of an array B in the order speciﬁed by the vector dimorder as in Z = ipermute(reshape(Y,[I(R) I(C)]),[R C]).

Vectorization of an mda is a special case of matricization that converts an mda to a vector. Let V1 , . . . , Vp be p F-linear spaces, where dim(Vi ) = ni for 1 i p. If π ∈ PERMp , the following linear spaces: V1 ⊗ V2 ⊗ · · · ⊗ Vp Fn1 ⊗ Fn2 ⊗ · · · Fnp .. . Fπ(n1 ) ⊗ Fπ(n2 ) ⊗ · · · Fπ(np )

.. . Fn1 n2 ···np

are isomorphic. The isomorphisms between these linear spaces make it possible to interpret a p-dimensional array as an mda of order , where p. Let {ej1 , . . . , ejnj } be a basis in Fj for 1 j p. For m ∈ N, deﬁne [m] = {1, . . . , m}. A bijective map μ : [n1 ] × · · · × [np ] −→ [n1 · · · np ] generates a linear space isomorphism between Fn1 ⊗ · · · ⊗ Fnp and p Fn1 ···np , which maps e1j1 ⊗ · · · ⊗ ejp into eμ(j1 ,...,jp ) . The isomorphism V1 ⊗ V2 ⊗ · · · ⊗ Vp ∼ = Fn1 ⊗ Fn2 ⊗ · · · Fnp allows the interpretation of an mda as a vector in Rn1 n2 ...np . This interpretation is known as a vectorization, and it is a generalization of matrix vectorization that was introduced in Deﬁnition 3.16. Namely, the vectorization vec(T ) of the mda T is ⎞ ⎛ t11···1 ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ ⎜ tn1 1···1 ⎟ ⎟. ⎜ vec(T ) = ⎜ ⎟ ⎜ t12···1 ⎟ ⎟ ⎜ .. ⎟ ⎜ . ⎠ ⎝ tn1 n2 ···np .

Multidimensional Array and Tensors

935

Example 15.32. The vectorization mapping introduced in Deﬁnition 3.16 is implemented in MATLAB using the function reshape. If A ∈ R3×2 is the matrix deﬁned as >> A = [1 2;3 4;5 6] A = 1 2 3 4 5 6

then v = reshape(A,6,1) will return v = 1 3 5 2 4 6

which is vec(A). An alternative way to compute A is v = A(:). To reverse this reshaping, we can write A = reshape(v,3,2). Example 15.33. Let T ∈ R3×2×3 be the mda deﬁned as a111 a213 a223 a133

= a112 = a311 = a321 = a312

= a211 = a313 = a323 = a123

= −a212 = 1, = a121 = a122 = a221 = −a222 = 2, = 4, = a322 = 0.

The matrix unfolding T(1) ∈ R3×6 is I 1 ⎝1 2 ⎛

II III IV 1 0 2 −1 2 2 0 2 4

V VI ⎞ 2 0 −2 4 ⎠. 0 4

The columns of T(1) , numbered with roman numerals from I to VI, are shown in Figure 15.9.

Linear Algebra Tools for Data Mining (Second Edition)

936

I

IV

111

121 2

1

II

V 112

122 2

1 III

VI

113

211

123

221

0

1

0

2

212

222 311

-1

-2

2

4 223

213 2

312

4

322 0

0

313

323

2

4

Fig. 15.9

15.10

321

The columns of the unfolding T(1) .

Hyperdeterminants

Hyperdeterminants extend the notion of determinant deﬁned for matrices and oﬀer information about solutions of multilinear systems. Example 15.34. to slice this array t000 t010

Let T be an mda of format 3 × 2 × 2. It is possible as three 2 × 2 matrices: t001 t100 t101 t200 t201 , , and t011 t110 t111 t210 t211

Multidimensional Array and Tensors

937

or as two 3 × 2 matrices: ⎛ ⎞ ⎛ ⎞ t000 t001 t010 t011 ⎜ ⎟ ⎜ ⎟ ⎝t100 t101 ⎠ and ⎝t110 t111 ⎠. t200 t201 t210 t211 The homogeneous multilinear system T (x ⊗ y) = 0, where x, y ∈ R2 , is t000 x0 y0 + t001 x0 y1 + t010 x1 y0 + t011 x1 y1 = 0 t100 x0 y0 + t101 x0 y1 + t110 x1 y0 + t111 x1 y1 = 0 t200 x0 y0 + t201 x0 y1 + t210 x1 y0 + t211 x1 y1 = 0. Notethat this system has a solution x0 = x1 = 0 for any value of y0 y= . y1 Example 15.35. Consider the multilinear system x0 y0 = z0 , x0 y1 = z1 , x1 y0 = z2 , x1 y1 = z3 . Observe that if the system has a solution, then z0 z3 − z1 z2 = x0 y0 x1 y1 − x0 y1 x1 y0 = 0. The system has a non-trivial solution if and only if z0 z3 −z1 z2 = 0 and (z0 , z1 , x2 , z3 ) = 04 . Indeed, suppose that at least one number, say z0 , is distinct from 0 and z0 z3 − z1 z2 = 0. This implies x0 = 0 and y0 = 0, which, in turn, imply y1 = xz10 and x1 = yz20 . Therefore, x1 y1 = xz10zy20 = z1z0z2 = z3 , hence the system has a non-trivial solution. The converse implication is immediate. Theorem 15.9. Let T be a 2 × 2 × 2 mda that deﬁnes a system in the variables x0 , x1 , y0 , y1 as follows: t000 x0 y0 + t001 x0 y1 + t010 x1 y0 + t011 x1 y1 = 0, t100 x0 y0 + t101 x0 y1 + t110 x1 y0 + t111 x1 y1 = 0.

Linear Algebra Tools for Data Mining (Second Edition)

938

This system has non-trivial solutions up to a multiplicative constant. Also, the system has a unique solution up to a multiplicative constant if and only if the t000 x0 + t010 x1 t001 x0 + t011 x1 t x + t X t x + t x = 0. 100 0 110 1 101 0 111 1 Proof.

In matrix form, the system can be written as y0 0 t000 x0 + t010 x1 t001 x0 + t011 x1 = . t100 x0 + t110 X1 t101 x0 + t111 x1 y1 0

Therefore, non-trivial solutions exist for values of x0 , x1 such that t000 x0 + t010 x1 t001 x0 + t011 x1 (15.11) t x + t X t x + t x = 0. 100 0 110 1 101 0 111 1

The determinant in the left member of Equality (15.11) equals t000 t011 t010 t001 2 t000 t001 +x0 x1 + x0 t100 t101 t100 t111 t110 t101 2 t010 t011 , +x1 t110 t111 and is a homogeneous polynomial. Its discriminant, t000 t001 t010 t011 t000 t011 t010 t001 2 t t + t t − 4 t t t t 100 111 110 101 100 101 110 111 is referred to as the hyperdeterminant of T . Let T be an mda of format 3 × 2 × 2 and let AT be the matrix deﬁned as ⎛ ⎞ t000 t001 t010 t011 ⎜ ⎟ (15.12) AT = ⎝t100 t101 t110 t111 ⎠. t200 t201 t210 t211 Denote by T00 , T01 , T10 , and T11 the matrices obtained from AT by eliminating the ﬁrst column (00), the second column (01), the third column (10), and the fourth column (11).

Multidimensional Array and Tensors

939

If rank(AT ) < 3, then one of the equations of the system is a linear combination of the other two and we obtain a 2 × 2 × 2 system. For example, if the third row of AT is a linear combination of the ﬁrst two rows, the system t000 x0 y0 + t001 x0 y1 + t010 x1 y0 + t011 x1 y1 = 0, t100 x0 y0 + t101 x0 y1 + t110 x1 y0 + t111 x1 y1 = 0, can be written as (t000 y0 + t001 y1 )x0 + (t010 y0 + t011 y1 )x1 = 0, (t100 y0 + t101 y1 )x0 + (t110 y0 + t111 y1 )x1 = 0. This system has a non-trivial solution in x0 , x1 if t000 y0 + t001 y1 t010 y0 + t011 y1 t y + t y t y + t y = 0. 100 0

101 1 110 0

111 1

Theorem 15.10. Suppose that rank(AT ) = 3, where AT is the matrix deﬁned in Equality (15.12). The multilinear system t000 x0 y0 + t001 x0 y1 + t010 x1 y0 + t011 x1 y1 = 0, t100 x0 y0 + t101 x0 y1 + t110 x1 y0 + t111 x1 y1 = 0, t200 x0 y0 + t201 x0 y1 + t210 x1 y0 + t211 x1 y1 = 0. has a non-trivial solution if and only if det(T01 ) det(T10 ) − det(T00 ) det(T11 ) = 0. Proof.

Let z0 = x0 y0 , z1 = x0 y1 , z2 = x1 y0 , and z3 = x1 y1 ,

and consider the new multilinear system t000 z0 + t001 z1 + t010 z2 + t011 z3 = 0, t100 z0 + t101 z1 + t110 z2 + t111 z3 = 0, t200 z0 + t201 z1 + t210 z2 + t211 z3 = 0, x0 y0 = z0 , x0 y1 = z1 , x1 y0 = z2 , x1 y1 = z3 .

940

Linear Algebra Tools for Data Mining (Second Edition)

By Supplement 47 of Chapter 5, if the linear system that consists of the ﬁrst three equations has a nontrivial solution (z1 , z2 , z3 , z4 ), then z1 z2 z3 z0 = = = , det(T00 ) − det(T01 ) det(T10 ) − det(T11 ) where the matrices Tk are obtained from AT by eliminating the kth column, where 1 k 4. Furthermore, Example 15.35 shows that the full multilinear system having a non-trivial solution implies that det(T01 ) det(T10 ) − det(T00 ) det(T11 ) = 0. The notion of hyperdeterminant has been extended to mdas having boundary format, that is, to mdas having the format (k0 + 1) × (k1 + 1) × · · · × (kd + 1), where k0 = di=1 ki . In this special situation, it is shown in [63] that the hyperdeterminant can be expressed as the determinant of a block matrix. Note that an mda of format 3 × 2 × 2 has this boundary format. 15.11

Eigenvalues and Singular Values

Let T be an mth order, n-dimensional mda. The mth degree homogeneous polynomial fT deﬁned by T is given by Ti1 ···im xi1 · · · xim . fT (x) = i1 ···im

If xm is the mth order n-dimensional mda with entries xi1 · · · xim , then T xm = fT (x). The mda T is positive deﬁnite if fT is positive deﬁnite. For a vector x ∈ Rn and m ∈ N, x = (x1 , . . . , xn ), denote by x[ m] the vector in Rn whose components are (x[m])i = xm i for 1 i n. Starting from the k-mode product of an mda T = (Ti1 ···im ) and a matrix P ∈ Rp×n , we consider the mda P m (T ) given by m

(P (T ))j1 ···jm =

n i1 ···im =1

Ti1 ···im pj1 i1 · · · pjm im .

Multidimensional Array and Tensors

941

If P is a row vector, x = (x1 , . . . , xn ), the following notations introduced in [138] which involve homogeneous polynomials are used: T xm−2 = T ×3 x ×4 · · · ×m x =

n

Tiji3 ···im xi3 · · · xim

i3 ,...,im =1

T xm−1 = T ×2 x ×3 · · · ×m x n = Tii2 ···im xi2 · · · xim i2 ,...,im =1

Tx

m

= T × 1 x × 2 · · · × m x =

n

Tii1 ···im xi3 · · · xim .

i1 ,...,im =1

The right-hand side of the last equality is the full contraction of T and x. In general, one could deﬁne T xm−k as T xm−k = T ×k+1 x ×k+2 · · · ×m x as having the components the homogeneous polynomials n

Tj1 ...jk ik+1 ···im xik+1 · · · xim .

ik+1 ···im =1

Definition 15.15. Let T ∈ Mm,n . A complex number λ is an eigenvalue of T if there exists x ∈ Cn − {0n } such that the following equalities involving homogeneous polynomials are satisﬁed: (T xm−1 )i = λxm−1 i for 1 i n.

(15.13)

942

Linear Algebra Tools for Data Mining (Second Edition)

The vector x is an eigenvector of T associated with λ and (λ, x) is an eigenpair of T . for 1 If x[m−1] is a vector in Cn deﬁned by (x[m−1] )i = xm−1 i i n, then Equality (15.13) can be written as T xm−1 = λx[m−1] . The spectrum of T is the set of eigenvalues of T . This set is denoted as spec(T ); the spectral radius of T is the number ρ(T ) = max{|λ| | λ ∈ spec(T )}. The term H-eigenvalue was introduced in [136]. Definition 15.16. An eigenvalue λ of an mda T is an H-eigenvalue if λ is real and there is a real eigenvector x for λ. In this case, we refer to x as an H-eigenvector. If λ ∈ R, x ∈ Rn , and Axm−1 = λx and x x = 1, the (λ, x) is said to be a Z-pair (cf. [136, 138]). Theorem 15.11. Let T ∈ Tm,n be an mda, where m is an even number. Then T has H-eigenvalues. Proof. Consider the minimization of the continuous function T xm n subjected to the restriction i=1 xm i = 1. Since m is a positive even n m = 1} is compact and the minix integer, the set {x ∈ Rn | i=1 i mizer x∗ exists. Using the Lagrangian L(x, λ) = T xm −λ( x m −1), the following Karush–Kuhn–Tucker optimality conditions: ∂L(x, λ) ∂L(x, λ) = 0 and =0 ∂x ∂λ amount to mT xm−1 − λmx[m−1] = 0 and 1 − x m m = 0, which imply T xm−1 = λx[m−1] , which is the equality that deﬁnes eigenvalues, where x = x∗ . The notions of positive deﬁniteness and semideﬁniteness for matrices can be extended to mdas in Mm,n . Definition 15.17. An mda T ∈ Mm,n is positive semideﬁnite if T xm 0 for all x ∈ Rn . If T xm > 0 for x ∈ Rn − {0n }, then T is positive deﬁnite.

Multidimensional Array and Tensors

943

Let λH min (T ) be the smallest eigenvalue of T , where T ∈ Tm,n and m is an even number. Then T is positive deﬁnite (positive semideﬁnite) if and only if λH min (T ) > 0 (λH min (T ) 0, respectively). By the deﬁnition of x∗ given in Theorem 15.11, we have x T T (x∗ )m . 1/m ( ni=1 xm ) i m m > 0 when x ∈ Rn − {0n }, it follows that Since i=1 xi λH min (T ) > 0. A generalization of Gershgorin’s Theorem (Theorem 7.21) for mdas follows. Theorem 15.12. Let T ∈ Mm,n be an mda and let ri be the sum of the absolute values of oﬀ-diagonal entries of T . The eigenvalues of T are contained in the union of n disks having the diagonal components of T as their centers and the numbers ri as their radii. Proof.

Suppose that (λ, x) is an eigenpair of T and that xi is |xi | = max{|xj | | 1 j n}.

We have n

= λxm−1 i

aii2 ···im xi2 · · · xim ,

i2 ,...,im =1

which implies n

= (λ − ai···i )xm−1 i

aii2 ···im xi2 · · · xim .

i2 , . . . , im = 1 δi1 ···im = 0 This implies |λ − ai···i |

n

i2 , . . . , im = 1 δi1 ···im = 0

|aii2 ···im

|xi2 | |xi | ··· m |xi | |xi |

Linear Algebra Tools for Data Mining (Second Edition)

944

n

|aii2 ···im |,

i2 , . . . , im = 1 δi1 ···im = 0

which concludes the argument.

The notion of submatrix introduced in Deﬁnition 3.5 can be extended to mdas. Definition 15.18. A submda (or a subtensor) of an mda T ∈ RI1 ×···×IN is an mda Tin =a ∈ RI1 ×···×In−1 ×In+1 ×···×IN obtained by ﬁxing the nth index to a. Example 15.36. Let T be an mda with the format 3 × 4 × 2 having the following frontal slices: ⎛ ⎞ ⎛ ⎞ 1 4 7 10 13 16 19 22 ⎜ ⎟ ⎜ ⎟ T1 = ⎝2 5 8 11⎠ and T2 = ⎝14 17 20 23⎠. 3 6 9 12 15 18 21 24 T1 and T2 are the subtensors Ti3 =1 and Ti3 =1 . The subtensors Ti1 =1 , Ti1 =2 , and Ti1 =3 in R4×2 are 1 4 7 10 , Ti1 =1 = 13 16 19 22 2 5 8 11 , Ti1 =2 = 14 17 20 23 3 6 9 12 . Ti1 =3 = 15 18 21 24 We reformulate the Singular Value Decomposition Theorem for matrices (Theorem 9.1) in preparation for a similar result for mdas. Namely, for a matrix A ∈ CI1 ×I2 , there is a decomposition

A = U (1) SV (2) = S ×1 U (1) ×2 V (2) = S ×1 U (1) ×2 U (2) , (1)

(1)

(1)

such that U (1) = (u1 u2 · · · uI1 is a unitary (I1 × I1 )-matrix,

(2)

(2)

(2)

V (2) = U (2) = (u1 u2 · · · uI2 is a unitary (I2 × I2 )-matrix, S is an (I1 × I2 )-matrix such that S = diag(σ1 , σ2 , . . . , σmin{I1 ,I2} ), and σ1 σ2 · · · σmin{I1 ,I2 } 0.

Multidimensional Array and Tensors

945

The generalization of this result to mdas that follows is known as the Higher Order Singular Value Decomposition (HOSVD). The presentation follows [100, 102]. Theorem 15.13 (HOSVD theorem). Every mda T ∈ RI1 ×···×N can be written as T = S ×1 U (1) ×2 · · · ×N U (N ) , such that (n)

(n)

(n)

(i) U (n) = (u1 u2 · · · uIn ) is an orthogonal (In × In )-matrix; (ii) S ∈ RI1 ×···×IN is an mda such that (a) any two distinct mdas Sin =a and Sin =b with a = b are orthogonal (the general orthogonality property), and (b)

Sin =1 Sin =2 · · · Sin =In

for all possible values of n, α, β, with α = β. (n)

The ith column ui of the matrix U (n) is the ith n-singular vector. (n) The numbers Sin =i denoted by σi are the n-mode singular values. Proof.

Consider the mdas T, S ∈ RI1 ×···×IN such that

S = T ×1 U (1) ×2 · · · ×N U (N ) , where U (1) , . . . , U (N ) are orthogonal matrices. In matrix form, this equality becomes T(n) = U (n) S(n) U (n+1) ⊗ · · · U (N ) ⊗ U (1) ⊗ · · · ⊗ U (n−1) .

If U (n) is obtained from the SVD of T(n) as T(n) = U (n) Σ(n) V (n) , (n)

(n)

(n)

where V (n) is orthogonal and Σ(n) = diag(σ1 , σ2 , . . . , σIn ), where (n)

σ1

(n)

σ2

(n)

· · · σIn 0, we denote by rn the highest index

(n)

for which σrn > 0. Taking into account that the matrix U (n+1) ⊗ · · · U (N ) ⊗ U (1) ⊗ · · · ⊗ U (n−1) is orthogonal (by Supplement 117 of Chapter 6), it follows that S(n) = Σ(n) V (n) U (n+1) ⊗ U (n+2) · · · ⊗ U (N ) ⊗ U (1) ⊗U (2) · · · ⊗ U (n−1) .

Linear Algebra Tools for Data Mining (Second Edition)

946

This implies for arbitrary orthogonal matrices U (1) , . . . , U (n−1) , U (n+1) , . . . , U (N ) that if α = β, then (Sin =α , Sin =β ) = 0 and (n)

Sin =1 = σ1

(n)

Sin =2 = σ2

(n)

· · · Sin =In = σIn 0,

and, if rn < In , (n)

(n)

Sin =rn +1 = σrn +1 = · · · = Sin =In = σIn = 0. The matrices U (1) , U (n−1) , U (n+1) , U (N ) can be constructed in the same manner as U (n) such that S and these matrices satisfy the conditions of the theorem. On the other hand, all matrices U (1) , . . . , U (N ) and S that satisfy the theorem can be found from the singular value decompositions of A(n) . Note that the matrix S that occurs in the SVD is replaced by the core mda S and the diagonal character of the matrix S in the SVD Theorem is replaced by the orthogonality of the mdas of the form Sin =a . The role played by the singular values in the SVD theorem is played by the norms of the mdas Sin =k . Ample details concerning the spectral theory of mdas can be found in the monograph [138]. 15.12

Decomposition of Tensors

The notion of singular value deﬁned for matrices was extended to tensors in [101]. This is the basis of the CANDECOMP/PARAFAC technique (abbreviated as CP), which decomposes a tensor as a sum of rank-1 tensors, a concept developed in the psychometric [25, 46] and chemometric publications [89]. A construction introduced by Kruskal [97] starts with the matrices A ∈ RI×R , B ∈ RJ×R , and C ∈ RK×R and deﬁnes the mda T as Tijk =

R

air bjr ckr .

r=1

The mda T is denoted by A, B, C.

Multidimensional Array and Tensors

947

A more general decomposition applicable to mdas of order N allows an mda T ∈ RI1 ×···×In to be expressed as Ti1 i2 ···ir =

R

λr ai1 r ai2 r · · · ain r ,

r=1

where λ ∈ RR , Aip ∈ RIp ×R for 1 p r. In this case, T is denoted by λ; A, B, C. Another modality of saving space when storing an mda is the Tucker operator (see [92] and [164]). A three-way mda T ∈ RI×J×K is stored as the product of a core mda G of size R × S × Z with corresponding matrices A ∈ RI×R , B ∈ RJ×S , and C ∈ RK×Z such that Tijk =

R S Z

Grst air bjs cks .

r=1 s=1 t=1

If I, J, K are much larger than R, S, Z, then forming T explicitly requires more memory than it is required to store its components, namely RSZ + IR + JS + KZ. Definition 15.19. Let T ∈ RJ1 ×···×JN , N = {1, . . . , N }, and let N matrices A(n) ∈ RIn ×Jn for n ∈ N. The Tucker operator applied to T produces the mda T ; A(1) , . . . , A(N ) deﬁned by T ; A(1) , . . . , A(N ) = T ×1 A(1) ×2 · · · ×n A(N ) ∈ RI1 ×···×IN . T is referred to as the core mda. Theorem 15.14. Let T ∈ RJ1 ×···×JN and let N = {1, . . . , N }. If A(n) ∈ RIn ×Jn and B (n) ∈ RKn ×In for 1 n N , then T ; A(1) , . . . , A(N ) ; B (1) , . . . , B (N ) = T ; B (1) A(1) , . . . , B (N ) A(N ) . Proof.

By the deﬁnition of Tucker operator, we have T ; A(1) , . . . , A(N ) ; B (1) , . . . , B (N ) = (T ×1 A(1) ×2 · · · ×N A(N ) ); B (1) , . . . , B (N ) = (T ×1 A(1) ×2 · · · ×N A(N ) ) ×1 B (1) · · · ×N B (N ) .

Linear Algebra Tools for Data Mining (Second Edition)

948

Note that (T ×1 A(1) · · · ×N A(N ) )i1 ···iN =

j1 ···jn

(1)

(N )

Tj1 ···jN Ai1 j1 · · · AiN jN .

Furthermore, ((T ×1 A(1) · · · ×n A(N ) ) ×1 B (1) · · · ×N B (N ) )k1 ...kN (1) (N ) (1) (N ) = Tj1 ···jN Ai1 j1 · · · AiN jN Bk1 i1 · · · BkN iN i1 ···iN j1 ···jN

=

i1 ···iN j1 ···jN

=

j1 ···jn

(1)

(1)

(N )

(N )

Tj1 ···jN Bk1 i1 Ai1 j1 · · · BkN iN AiN jN

Tj1 ···jN

i1

(1)

(1)

Bk1 i1 Ai1 j1

⎛ ⎞ (N ) (N ) BkN iN AiN jN ⎠ ···⎝ iN

= (T ; B (1) A(1) , . . . , B (N ) A(N ) )k1 ...kN , which concludes the argument.

The next statement allows us to express the Tucker operator in terms of matricized mdas. Theorem 15.15. Let T ∈ RJ1 ×···×JN be an mda and let N = {1, · · · , N }. If A(n) ∈ RIn ×Jn for n ∈ N, and R = {r1 , . . . , rL }, C = {c1 , . . . , cM } form a partition of N, then S = T ; A(1) , . . . , A(N ) if and only if S(R×C):JN = (A(rL ) ⊗ · · · ⊗ A(r1 ) )T(R×C:IN ) (A(cM ) ⊗ · · · ⊗ A(c1 ) ) . Here IN = {I1 , . . . , IN } and JN = {J1 , . . . , JN }. Proof.

This is a direct consequence of Theorem 15.5.

The calculation of the norm of a large mda can be reduced to the calculation of the norm of a smaller mda using the thin QR factorization of rectangular matrices discussed in Theorem 6.65. Theorem 15.16. Let T ∈ RJ1 ×J2 ×···×JN be an mda having the set of modes N = {1, . . . , N }, and let A(n) be N rectangular matrices

Multidimensional Array and Tensors

949

having the QR decompositions A(n) = Q(n) R(n) for n ∈ N, where the columns of Q(n) constitute an orthonormal basis for range(A), and R(n) is an upper triangular invertible matrix such that its diagonal elements are real non-negative numbers for n ∈ N. We have T ; A(1) , . . . , A(N ) = T ; R(1) , . . . , R(N ) . Proof. By the deﬁnition of QR decomposition of matrices and Theorem 15.14, we have T ; A(1) , . . . , A(N ) = T ; Q(1) R(1) , . . . , Q(N ) R(N ) = T ; R(1) , . . . , R(N ) ; Q(1) , . . . , Q(n) . Supplement 15.13 further implies T ; A(1) , . . . , A(N ) = T ; R(1) , . . . , R(N ) .

Definition 15.20. The Tucker decomposition of an mda T ∈ RI1 ×I2 ×···×IN is given by T = G; A(1) , A(2) , . . . , A(N ) , where A(n) ∈ RIn ×Jn and G ∈ RJ1 ×I2 ×···×JN . The Tucker decomposition allows the generation of an approximation G of T when the ranks of G(n) are smaller than the corresponding ranks of the matrices T(n) . In general, the Tucker decomposition is not unique (see Supplement 15.13).

15.13

Approximation of mdas

The next theorem appears in [92, 102]. Theorem 15.17. Let T ∈ RI2 ×I2 ×···×IN and let A(n) ∈ RIn ×Jn be matrices with orthonormal sets of columns for 1 n N . The optimal mda G ∈ RJ1 ×J2 ×···×JN that minimizes T − G; A(1) , . . . , A(N ) for ﬁxed matrices A(1) , . . . , A(N ) is given by

G = T ; A(1) , . . . , A(n) .

Linear Algebra Tools for Data Mining (Second Edition)

950

matrices A(i) for this problem maximize Furthermore, the optimal T ; A(1) , . . . , A(N ) . Choosing R = N in Theorem 15.15 allows us to write T − G; A(1) , . . . , A(N ) = T − G ×1 A(1) ×2 · · · ×n A(N ) = vec(T ) − (A(N ) ⊗ . . . ⊗ A(1) )vec(G). The minimization of the norm vec(T ) − (A(N ) ⊗ . . . ⊗ A(1) )vec(G) can be treated as a least square problem and the solution is † vec(G) = A(N ) ⊗ · · · ⊗ A(1) vec(T ). Proof.

The solution can be written as vec(G) = (A(N ) ⊗ · · · ⊗ A(1) )† vec(T ) = (A(N )† ⊗ · · · ⊗ A(1)† )vec(T ) (by Supplement 27 of Chapter 9)

= (A(N ) ⊗ · · · ⊗ A(1) )vec(T ) (because the matrices A(i) have orthonormal sets of columns). Next, we have T − G; A(1) , . . . , A(N ) = T 2 − 2(T, G; A(1) , . . . , A(N ) ) + G; A(1) , . . . , A(n) 2 . We have (T, G; A(1) , . . . , A(N ) )

= (T ; A(1) , . . . , A(N ) , G) (by Theorem 15.7) = G 2 .

Multidimensional Array and Tensors

951

By Supplement 17, we have (1) (N ) G; A , . . . , A = G , which implies 2 T − G; A(1) , . . . , A(N ) = T 2 − G 2 . Thus, minimizing T − G; A(1) , . . . , A(N ) amounts to maximizing

G . Next we discuss approximations of mdas by mdas of rank 1. In other words, given an mda T ∈ RI1 ×···×IN , we seek the determination of λ ∈ R and of unit-vectors u(1) , . . . , u(N ) that deﬁne a rank-1 mda S deﬁned as (1)

(N )

Si1 ...iN = λui1 · · · ◦ uiN that minimizes the least-square function f (S) = T − S 2 .

(15.14)

This is a constrained optimization problem solved in [101] by the method of Lagrange multipliers. Consider the objective function f˜ deﬁned as (n) (1) (N ) 2 (n) 2 Ti1 ···iN − λui1 · · · uiN + λ (uin ) − 1 . n

i1 ···iN

The equality λ

∂ f˜ (n) ∂uin

(n)

+λ2 uin

= 0 implies (1)

i1 ···in−1 in+1 ···iN

in

(n−1) (n+1) (N )

(n)

Ti1 ···iN ui1 · · · uin−1 uin+1 uiN = λ(n) uin

i1 ···in−1 in+1 ···iN

(1) 2

ui1

(15.15)

2 (n−1) 2 (n+1) 2 · · · uin−1 · · · uN . uin+1 iN

Linear Algebra Tools for Data Mining (Second Edition)

952

The equality

∂ f˜ ∂λ(n)

= 0 implies (n) 2 = 1. uin

(15.16)

in

Finally,

∂ f˜ ∂λ

= 0 yields

i1 ···iN

(1)

(N )

Ti1 ···iN ui1 · · · uiN = λ

i1 ···iN

(1) 2

ui1

(N ) 2 · · · uiN .

The previous equalities imply (1) (N ) Ti1 ···iN ui1 · · · uiN = λ,

(15.17)

i1 ···iN

and λ

i1 ···in−1 in+1 ···iN

(1)

(n−1) (n+1)

(N )

(n)

Ti1 ···iN ui1 · · · uin−1 uin+1 · · · UiN = (λ2 + λ(n) )uin .

(15.18) Taking into account Equality (15.16), we have (1) (n−1) (n+1) (N ) (N ) Ti1 ···iN ui1 · · · uin−1 uin+1 · · · uiN = λuiN . (15.19) i1 ···in−1 in+1 ···iN

These necessary conditions correspond to

T ×1 u(1) · · ·×n−1 u(n−1) ×n+1 u(n+1) · · ·×N u(N ) = λu(N ) , (15.20)

T ×1 u(1) ×2 · · · ×N u(N ) = λ,

(15.21)

u(n) = 1

(15.22)

and for 1 n N . Minimizing the function f deﬁned by Equality (15.14) is equivalent to maximizing 2 (1) (2) (N ) (1) (2) (N ) ×2 u ×3 · · · ×N u g(u , u , . . . , u ) = T ×1 u , over u(1) , u(2) , . . . , u(N ) . Moreover, if λ is chosen as in Equality (15.21), then the functions f and g satisfy the equality f (S) =

T 2 − g(S).

Multidimensional Array and Tensors

953

Indeed, note that f (S) = T − S 2 = T 2 − 2(T, S) + S 2 . The deﬁnition of λ and Equality (15.17) imply (1) (N ) Ti1 ···iN λui1 · · · uiN = λ2 . (T, S) = i1 ···iN

Since u(1) = u(2) = · · · = u(N ) = 1, we also have S 2 = λ2 . This implies f (S) = T 2 − g(S). An alternating least squares algorithm for computing the best rank-1 approximation of an mda is presented in [101]. Exercises and Supplements (I )

(1) Let e(inn) be a vector in

RI n

(I )

for 1 n N . Prove that (I )

e(i11) ⊗ · · · ⊗ e(iNN ) = eIi , where I = I1 · . . . · IN , and i = (i1 − 1)

N p=2

Ip + (i2 − 1)

N

Ip + · · · + iN .

p=3

(2) Let T ∈ RJ1 ×J2 ×···×JN be an mda. Prove that if A ∈ RIm ×Jm and B ∈ RIn ×Jn and m = n, then (T ×m A) ×n B = (T ×n B) ×m A. The common value of these arrays is denoted by T ×m A ×n B. Solution: Without loss of generality, assume that m < n. The deﬁnition of the mda-matrix mode product allows us to write (T ×m A)j1 ,...,jm−1 ,im ,jm+1 ,...,jN =

Im

Tj1 j2 ···jN Aim jm .

im =1

Therefore, ((T ×m A) ×n B)j1 ,...,jm−1 ,im ,jm+1 ,...,jn−1 ,in ,jn+1 ,...,jN = Iinn=1 (T ×m A)j1 ,...,jm−1 ,im ,jm+1 ,...,jN Bin jn m = Iinn=1 Iim =1 Tj1 j2 ···jn ···jm ···jN Aim jm Bin jn . The same expression is obtained by computing ((T ×n B) ×m A)j1 ···jm−1 im jm+1 ···jn−1 in jn+1 ···jN .

Linear Algebra Tools for Data Mining (Second Edition)

954

(3) Recall that the contracted n-mode product ×n drops the nth singleton dimension. Prove that the contracted n-mode product is not commutative, that is, in general (T ×n x)×n y = (T ×n y)×n x. (4) Prove that if T is an mda having the set of modes M, then vec(T ) = T(M×∅:M) . (5) Let T, S be two mdas of rank 1, where T = a(1) ⊗a(2) ⊗· · ·⊗a(n) and S = b(1) ⊗ b(2) ⊗ · · · ⊗ b(n) . Prove that (T, S) =

n (a(i) ⊗ b(i) ). i=1

(6) Let T ∈ RI1 ×I2 ×···×IN be an mda, and let A ∈ RJn ×In and B ∈ RJm ×Im be two matrices, where m = m. Prove that (T ×n A) ×m B = (T ×m B) ×n A. (7) Let A ∈ Rm×n and B ∈ Rp×q be two matrices, and let x ∈ Rqn be a vector. Compute the vector y = (A ⊗ B)x without computing A ⊗ B. This is important if the size of the Kronecker product is very large. Solution: The following MATLAB program is a solution: [m,n] = size(A); [p,q] = size(B); X = reshape(x,q,n); Y = B*X*A’; y = reshape(Y,m*p,1);

(8) Let A(1) , . . . , A(N ) be N matrices, where A(n) ∈ RIn ×Jn for n ∈ {1, . . . , N }. Prove that X = Y ; A(1) , . . . , A(N ) if and only if X(n) = A(n) Y(n) (A(N ) ⊗ · · · ⊗ A(n+1) ⊗ A(n−1) ⊗ · · · ⊗ A(1) ) . (9) Let T be an mda in Tm,n . Prove that if (λ, x) is an eigenpair of T , then (aλ + b, x) is an eigenpair of aT + bIm,n . (10) Let X = [A, B, C] be a representation of a three-way array X, and let M, N be two sets of matrices such that M ⊆ N.

Multidimensional Array and Tensors

955

Prove that the rank R of X satisﬁes the following inequality: R min rank(N X) + max (number of zero columns in M A). N ∈N

M ∈M

Solution: For any M ∈ M, we have minN ∈N rank(N X) rank(M X) = rank([M A, B, C]) R − (number of zero columns in M A). If we take the minimum over all M ∈ M, we obtain the statement. (11) Let X be a three-way array, that is 1-non-degenerate, and let N = {u | u = 0}. Prove that rank(X) min rank(uX) + dim1 (X) − 1. u∈N

Solution: Let [A, B, C] be a representation of X and let M = {u | uA = 0}. Since X is 1-non-degenerate, we have N = {u | uX = 0} ⊆ M ⊆ N, hence, N = M. Then max(number of zero columns in uA = max(number of zero columns in A which are orthogonal to u) rank(A) − 1 (because we can pick some rank(A) independent columns of A and select u orthogonal to rank(A) − 1 of them) dim1 (X) − 1. Now the statement follows from the previous supplement. (12) Let X be a 1-non-degenerate array and let N consist of all M ×1matrices with full row rank. Prove that rank(X) min rank(AX) + dim1 (X) − M. A∈N

956

Linear Algebra Tools for Data Mining (Second Edition)

(13) Prove that the following multilinear system: x0 y0 + x1 y1 = x0 y1 − x1 y0 = 0, has nontrivial solutions in C. (14) Prove that the following multilinear system: x0 y0 = x0 y1 = x1 y0 = x1 y1 = 0, has only trivial solutions. (15) Let T ∈ Mm,n be a hypercubic mda and let a, b ∈ R. If (λ, x) is an eigenpair of T , prove that (aλ + b, x) is an eigenpair of aT + bIm,n . (16) Consider the odd-order symmetric mda T ∈ Tm,n deﬁned in [27] or [138] as follows: √ T111 = 10 T112 = − 3 √ √ T121 = − 3 T122 = 3 √ √ T211 = − 3 T212 = 3 √ T221 = 3 T222 = 4. Prove the following: (a) if u, v ∈ R>0 , then 2u3 + v 3 3u2 v; (b) T x3 0 for x ∈ R20 ; (c) there is no H-eigenvalue for T . Solution: The inequality is equivalent to 2z 3 + 1 3z 2 for z > 0. The function φ(z) = 2z 3 − 3x2 + 1 has a minimum for z = 1, hence, 2z 3 − 3z 2 + 1 0 for z 0. Replacing z by uv gives the desired inequality. Therefore, 1 1 1 1 2 1 10x31 + 4x32 = (15 3 x1 )3 + (12 3 x2 )3 (15 3 x1 )2 (12 3 x2 ) 3 3 √ 2 3 3x1 x2 ,

which shows that T x3 0 for x ∈ R20 .

Multidimensional Array and Tensors

957

Suppose that (λ, x) ∈ R × R2=0 is an eigenpair, that is, √ √ 10x21 − 2 3x1 x2 + 3x22 = λx21 , √ √ − 3x21 + 2 3x1 x2 + 4x22 = λx22 . If x2 = 0, then x1 = 0, which is impossible. If x2 = 0, let z = The previous system becomes √ √ (10 − λ)z 2 − 2 3z + 3 = 0, √ √ − 3z 2 + 2 3z + 4 − λ = 0.

x1 x2 .

The non-negativity of the discriminants of these trinomials implies 12 − 4 3(10 − λ) 0, √ 12 + 4 3(4 − λ) 0. √ √ These inequalities lead to 4 + 3 λ 10 − 3, which is contradictory. (17) Let T ∈ RI1 ×···×In−1 ×J×In+1×IN and let A ∈ RJ×In . If A has orthonormal columns, prove that T = T ×n A . Solution: By Equality (15.10), we have

T ×n A 2 = (T ×n A, T ×n A) = ((T ×n A) × A , T ) (by Exercise 20) = (T ×n (A A)), X) = (T, T ) (by hypothesis) = T 2 , because A A is the unit matrix due to the orthonormality of its columns. (18) Let N = {1, . . . , N } and let A(n) ∈ RIn ×Jn for n ∈ N be N matrices having orthonormal sets of columns. Prove that T = S; A(1) , . . . ; A(N ) implies S = T ; A(1) , . . . ; A(N ) . Hint: Apply Theorem 15.14.

Linear Algebra Tools for Data Mining (Second Edition)

958

(19) Let T = G; A(1) , A(2) , . . . , A(N ) and let B ∈ RJ1 ×J1 be an orthogonal matrix. Prove that T = G ×1 B; A(1) B, A(2) , . . . , A(N ) . Solution: The hypothesis amounts to T = G ×1 A(1) ×2 · · · ×n A(N ) . Therefore, T = G ×1 A(1) BB ×2 · · · ×n A(N ) = G ×1 B; A(1) B ×2 · · · ×n A(N ) , so T = G ×1 B; A(1) B, A(2) , . . . , A(N ) . (20) Let T ∈ RI1 ×···×In−1 ×J×In+1 ×···×IN , S ∈ RI1 ×···×In−1 ×K×In+1×···×IN be two mdas, and let A ∈ RJ×K . Prove that (T, S ×n A) = (T ×n A , S). Solution: Let Si1 ···in−1 kin+1 ···iN be the components of S. By the deﬁnition of S ×n A, we have (S ×n A)i1 ···in−1 jin+1 ...iN =

|K|

Si1 ···in−1 kin+1 ···iN Ajk

k=1

and (T, S ×n A) =

Ti1 ···in−1 jin+1 ···iN (S × A)i1 ···in−1 jin+1 ···iN

i1 ,...,in−1 ,j,in+1 ,...,iN

=

i1 ,...,in−1 ,j,in+1 ,...,iN

=

Ti1 ···in−1 jin+1 ···iN

|K|

Si1 ···in−1 kin+1 ···iN Ajk

k=1

Ti1 ···in−1 jin+1 ···iN (A)kj Si1 ···in−1 kin+1 ···iN

i1 ,...,in−1 ,k,in+1 ,...,iN

= (T ×n A , S). The Kruskal operator is a special case of the Tucker operator. For N = {1, . . . , N } and A(n) ∈ RIn ×R for n ∈ N, the value of the Kruskal operator applied to matrices A(1) , . . . , A(N ) is deﬁned as

Multidimensional Array and Tensors

959

A(1) , . . . , A(N ) . Note that the matrices A(1) , . . . , A(N ) must have the same number of columns. If I is the identity mda (whose diagonal components equal 1, and is 0 elsewhere), the Kruskal operator can be written as I; A(1) , . . . , A(N ) . If S = A(1) , . . . , A(N ) , then Si1 ,i2 ,...,iN =

R j=1

(1) (2)

(N )

ai1 j ai2 j · · · aiN j .

The Kruskal operator is used to deﬁne the PARAFAC decomposition of an mda T as T = A(1) , A(2) , . . . , A(n) . The interested reader should consult [47, 92]. (21) Let N = {1, . . . , N } and A(n) ∈ B (n) ∈ RKn ×In , prove that

RIn ×R .

If A(n) ∈

RIn ×R

and

A(1) , . . . , A(N ) ; B (1) , . . . , B (N ) = B (1) A(1) , . . . , B (N ) A(N ) . (22) Let N = {1, . . . , N } and A(n) ∈ RIn ×R for n ∈ N. Prove that 2 (1) A , . . . , A(N ) =

R R

((A(1) A(1) ) (A(2) A(2) ) · · · (A(N ) A(N ) ))jk .

j=1 k=1

Recall that “ ” is the notation for the Hadamard product introduced in Deﬁnition 3.45. Solution: We have 2 (1) A , . . . , A(N ) = (A(1) , . . . , A(N ) , A(1) , . . . , A(N ) ) ⎛ ⎞ R R (1) (2) (N ) (1) (2) (N ) ⎝ ai1 j ai2 j · · · aiN j , ai1 k ai2 k · · · aiN k ⎠ = i1 ···iN

=

j=1

R R j=1 k=1 i1 ···iN

k=1

(1) (2)

(N ) (1) (2)

(N )

ai1 j ai2 j · · · aiN j ai1 k ai2 k · · · aiN k

960

Linear Algebra Tools for Data Mining (Second Edition)

=

R R j=1 k=1 i1

=

R R

(1) (1)

ai 1 j ai 1 k

i2

(2) (2)

ai 2 j ai 2 k · · ·

iN

(1)

(1)

aiN j aiN k

(A(1) A(1) )jk · · · (A(N ) A(N ) )jk .

j=1 k=1

A fundamental result of complexity theory [62] is that the satisﬁability problem stated in what follows is NP-complete. 3-SATISFIABILITY or 3-SAT: Input: a collection of m clauses C = {C1 , . . . , Cm } on a ﬁnite set of n variables U such that each clause contains 3 literals. Problem: is there a truth assignment for U that satisﬁes all clauses in C? The NP-completeness of 3-SAT was used in [74] to establish the NP-hardness of determining the rank of a tensor over the ﬁeld of rational numbers. This new problem can be formulated as follows: TENSOR RANK: Input: a three-dimensional tensor Tijk Problem: given numbers Tijk where 1 i n1 , 1 j () n2 , 1 k n3 , and r ∈ N are there ve , 1 rr vectors r, 1 e 3 such that Tijk = =1 v1 (i)v2 (j)v3 (k) for all i, j, k? (23) Give an encoding of an arbitrary 3-SAT Boolean formula in n variables and m clauses as an mda T ∈ Q(n+2m+2)×3n×(3n+m) with the property that the 3-SAT formula is satisﬁable if and only if rank(T ) 4n + 2m. Hint: See [74].

Multidimensional Array and Tensors

961

Bibliographical Comments The main sources for this chapter are [4, 5, 91–93, 95]. For tensor ranks in various ﬁelds, the reader should consult [58, 115]. A useful survey of tensor decomposition is in [139]. The works of L. Qi and his collaborators [80, 136–138] is a main reference for the spectral theory of tensors. Supplement 10 is a result of Kruskal [97]. The treatment of hyperdeterminants follows [123].

This page intentionally left blank

Bibliography [1] Abdi, H. (2003). Partial least squares regression (PLS-regression), in Encyclopedia for Research Methods for the Social Sciences (Sage, Thousand Oaks, CA), pp. 792–795. [2] Akivis, M. A. and Goldberg, V. V. (1977). An Introduction to Linear Algebra and Tensors (Dover Publications, New York). [3] Artin, M. (1991). Algebra, 1st edn. (Prentice-Hall, Englewood Cliﬀs, NJ), a second edition was published in 2010. [4] Bader, B. W. and Kolda, T. G. (2006). Algorithm 862: Matlab tensor classes for fast algorithm prototyping, ACM Transactions on Mathematical Software 32, 4, pp. 635–653. [5] Bader, B. W. and Kolda, T. G. (2008). Eﬃcient Matlab computations with sparse and factored tensors, SIAM Journal on Scientiﬁc Computing 30, 1, pp. 205–231. [6] Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval (Addisson-Wesley, Harlow, UK). [7] Bartle, R. G. and Sherbery, D. R. (1999). Introduction to Real Analysis, 3rd edn. (Wiley, New York). [8] Bauer, F. L. and Fike, C. T. (1960). Norms and exclusion theorems, Numerische Mathematik 2, pp. 137–141. [9] Belkin, M. and Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering, in Advances in Neural Information Processing Systems 14 (MIT Press), pp. 585–591. [10] Ben-Israel, A. (1992). A volume associated with m × n matrices, Linear Algebra and Its Applications 167, pp. 87–111. [11] Ben-Israel, A. and Greville, T. N. E. (1974). Generalized Inverses — Theory and Applications (Wiley-Interscience, New York). [12] Berge, J. M. F. T. (1983). A generalization of Kristof’s theorem on the trace of certain matrix products, Psychometrika 48, pp. 519–523.

963

964

Linear Algebra Tools for Data Mining (Second Edition)

[13] Berge, J. M. F. T. (1991). Kruskal’s polynomial for 2× 2× 2 arrays and a generalization to 2× n× n arrays, Psychometrika 56, 4, pp. 631–636. [14] Berge, J. M. F. T. (1993). Least Squares Optimization in Multivariate Analysis (DSWO Press, Leiden, The Netherlands). [15] Berge, T. (1977). Orthogonal Procrustes rotation for two or more matrices, Psychometrika 42, pp. 267–276. [16] Berkhin, P. and Becher, J. (2002). Learning simple relations: Theory and applications, in Proceedings of the Second SIAM International Conference on Data Mining (Arlington, VA). [17] Berry, M. W., Browne, M., Langville, A. N., Pauca, V. P. and Plemmons, R. J. (2007). Algorithms and applications for approximate nonnegative matrix factorization, Computational Statistics & Data Analysis 52, 1, pp. 155–173. [18] Bhatia, R. (1997). Matrix Analysis (Springer, New York). [19] Billsus, D. and Pazzani, M. (1998). Learning collaborative information ﬁlters, in J. W. Shavlik (ed.), Proceedings of the 15th International Conference on Machine Learning, pp. 46–54. [20] Birkhoﬀ, G. (1973). Lattice Theory, 3rd edn. (American Mathematical Society, Providence, RI). [21] Bj¨ork, A. (1996). Numerical Methods for Least Squares Problems (SIAM, Philadelphia, PA). [22] Brand, M. (2003). Fast online SVD revisions for lightweight recommender systems, in D. Barbar´a and C. Kamath (eds.), Proceedings of the Third SIAM International Conference on Data Mining (SIAM, Philadelphia, PA), pp. 37–46. [23] Brokett, R. W. and Dobkin, D. (1978). On the optimal evaluation of a set of bilinear forms, Linear Algebra and Its Applications 19, pp. 207–235. [24] Burdick, D. S. (1995). An introduction to tensor product with applications to multiway data analysis, Chemometrics and Intelligent Laboratory Systems 28, pp. 229–237. [25] Carroll, J. D. and Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an n-way generalization of Eckart-Young decomposition, Psychometrika 35, 3, pp. 283–319. [26] Catral, M., Han, L., Neumann, M. and Plemmons, R. (2004). On reduced rank nonnegative (non-negative) matrix factorizations for symmetric matrices, Linear Algebra and Applications 393, pp. 107–127. [27] Chen, H., Chen, Y., Li, G. and Qi, L. (2015). Finding the maximum eigenvalue of a class of tensors with applications in copositivity test and hypergraphs, arXiv preprint arXiv:1511.02328.

Bibliography

965

[28] City of Boston (2008). Public Health Commission Research Oﬃce: The health of Boston 2008, Tech. rep., Boston, MA. [29] Cline, R. E. and Funderlink, R. E. (1979). The rank of a diﬀerence of matrices and associated generalized inverses, Linear Algebra and Its Applications 24, pp. 185–215. [30] Comon, P., Golub, G., Lim, L.-H. and Mourrain, B. (2008). Symmetric tensors and symmetric tensor rank, SIAM Journal on Matrix Analysis and Applications 30, 3, pp. 1254–1279. [31] Cox, D. A., Little, J. and O’Shea, D. (2005). Using Algebraic Geometry, 2nd edn. (Springer, New York). [32] Cox, T. F. and Cox, M. (2001). Multidimensional Scaling, 2nd edn. (Chapman & Hall, Boca Raton). [33] Davis, C. and Kahan, W. M. (1969). Some new bounds on perturbation of subspaces, Bulletin of the American Mathematical Society 75, pp. 863–868. [34] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis, Journal of the American Society for Information Science 41, pp. 391–407. [35] Ding, C. and He, X. (2004). K-means clustering via principal component analysis, in Proceedings of the 21st International Conference on Machine Learning (Banﬀ, Canada), pp. 225–232. [36] Ding, C., He, X. and Simon, H. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering, in Proceedings of the SIAM International Conference on Data Mining (Newport Beach, CA), pp. 606–610. [37] Dodson, C. T. J. and Poston, T. (1997). Tensor Geometry, 2nd edn. (Springer, Berlin). [38] Donoho, D. and Grimes, G. (2003). Hessian eigenmaps: New locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Sciences of the United States of America 5, pp. 5591–5596. [39] Drineas, P., Frieze, A., Kannan, R., Vampala, S. and Vinay, V. (2004). Clustering large graphs via the singular value decomposition, Machine Learning 56, pp. 9–33. [40] Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. and Harshman, R. (1988). Using latent semantic analysis to improve access to textual information, in CHI ’88: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (ACM, New York, NY), ISBN 0-201-14237-6, pp. 281–285. [41] Eckhart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank, Psychometrica 1, pp. 211–218.

966

Linear Algebra Tools for Data Mining (Second Edition)

[42] Egerv´ary, E. (1960). On rank-diminishing operators and their application to the solution of linear systems, Zeitschrift f¨ ur Angewandte Mathematik und Physik 11, pp. 376–386. [43] Elad, M. (2010). Sparse and Redundant Representations (Springer, New York). [44] Eld´en, L. (2007). Matrix Methods in Data Mining and Pattern Recognition (SIAM, Philadelphia, PA). [45] Elkan, C. (2003). Using the triangle inequality to accelerate k-means, in Proceedings of the 20th International Conference on Machine Learning, pp. 147–153. [46] A. H. et al. (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. [47] Faber, N. M., Bro, R. and Hopke, P. K. (2003). Recent developments in candecomp/parafac algorithms: A critical review, Chemometrics and Intelligent Laboratory Systems 65, pp. 119–137. [48] Fan, K. (1949). On a theorem of Weil concerning eigenvalues of linear transformations -I, Proceedings of the National Academy of Sciences of the United States of America 35, pp. 652–655. [49] Fan, K. (1951). Maximum properties and inequalities for the eigenvalues of completely continuous operators, Proceedings of the National Academy of Sciences of the United States of America 37, pp. 760–766. [50] Favier, G. (2019). From Algebraic Structure to Tensors, Vol. 1 (Wiley, Hoboken, NJ). [51] Favier, G. (2021). Matrix and Tensor Decompositions in Signal Processing, Vol. 2 (Wiley, Hoboken, NJ). [52] Favier, G. and de Almeida, A. (2014). Overview of constrained parafac models, EURASIP Journal on Advances in Signal Processing 2014, 1, pp. 1–25. [53] Fejer, P. A. and Simovici, D. A. (1991). Mathematical Foundations of Computer Science, Vol. 1 (Springer Verlag, New York). [54] Fletcher, R. and Sorensen, D. (1983). An algorithmic derivation of the Jordan canonical form, American Mathematical Monthly 90, pp. 12–16. [55] Food and Organization, A. (2009). FAO Statistical Yearbook (Statistics Division FAO, Rome, Italy). [56] Forgy, E. (1965). Cluster analysis of multivariate data: Eﬃciency vs. interpretabiity of classiﬁcations, Biometrics 21, p. 768. [57] Francis, J. G. F. (1961). The QR transformation — A unitary analogue to the LR transformation — Part I, The Computer Journal 4, pp. 265–271.

Bibliography

967

[58] Friedland, S. (2012). On the generic and typical rank of 3-tensors, Linear Algebra and Its Applications 436, pp. 478–497. [59] Frieze, A. and Kannan, R. (1999). Quick approximation to matrices and applications, Combinatorica 19, pp. 175–200. [60] Fuglede, B. (1950). A commutativity theorem for normal operators, Proceedings of the National Academy of Sciences of the United States of America 36, 1, p. 35. [61] Gabriel, K. R. (1971). The biplot graph display of matrices with application to principal component analysis, Biometrika 58, pp. 453–467. [62] Garey, M. R. and Johnson, D. S. (1979). Computers and Intractability — A Guide to the Theory of NP-Completeness (W. H. Freeman, New York). [63] Gelfand, I. M., Kapranov, M. and Zelevinsky, A. (1994). Discriminants, Resultants, and Multidimensional Determinants (Birkh¨auser, Boston, MA). [64] Gilat, A. (2011). MATLAB — An Introduction with Applications, 4 edn. (John Wiley, New York). [65] Golub, G. H. and Loan, C. F. V. (1989). Matrix Computations, 2nd edn. (The Johns Hopkins University Press, Baltimore, MD). [66] Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis, Biometrika 53, pp. 325–338. [67] Gower, J. C. and Hand, D. J. (1995). Biplots (Chapman & Hall, London). [68] Gower, J. C., Lubbe, S. G. and Le Roux, N. (2011). Understanding Biplots (Wiley, New York). [69] Greenacre, M. (2010). Biplots in Practice (Fundacion BBVA, Madrid, Spain). [70] Greub, W. (1981). Linear Algebra, 4th edn. (Springer-Verlag, New York). [71] Greub, W. H. (1978). Multilinear Algebra, 2nd edn. (Springer-Verlag, New York). [72] Hackbusch, W. (2012). Tensor Spaces and Numerical Tensor Calculus, 2nd edn. (Springer, Cham, Switzerland). [73] Hartigan, J. A. and Wong, M. (1979). A k-means clustering algorithm, Applied Statistics, pp. 100–108. [74] H˚ astad, J. (1990). Tensor rank is NP-complete, Journal of Algorithms 4, 11, pp. 644–654. [75] Higham, D. J. and Higham, N. J. (2000). Matlab Guide (SIAM, Philadelphia, PA).

968

Linear Algebra Tools for Data Mining (Second Edition)

[76] Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms, 2nd edn. (SIAM, Philadelphia, PA). [77] Hoﬀman, A. J. and Wielandt, H. W. (1953). The variation of the spectrum of a normal matrix, Duke Mathematical Journal 20, pp. 37–39. [78] Horn, R. A. and Johnson, C. R. (1996). Matrix Analysis (Cambridge University Press, Cambridge, UK). [79] Horn, R. A. and Johnson, C. R. (2008). Topics in Matrix Analysis (Cambridge University Press, Cambridge, UK). [80] Hu, S., Huang, Z.-H., Ling, C. and Qi, L. (2013). On determinants and eigenvalue theory of tensors, Journal of Symbolic Computation 50, pp. 508–531. [81] Inaba, M., Katoh, N. and Imai, H. (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering, in 10th Computational Geometry (ACM, New York), pp. 331–339. [82] Johnson, R. A. and Wichern, D. W. (2002). Applied Multivaried Statistical Analysis, 5 edn. (Prentice Hall, Upper Saddle River, NJ). [83] Johnson, R. K. (2011). The Elements of MATLAB Style (Cambridge University Press, Cambridge, UK). [84] Jolliﬀe, I. T. (2002). Principal Component Analysis, 2nd edn. (Springer, New York). [85] Jordan, P. and Neumann, J. V. (1935). On inner products in linear, metric spaces, The Annals of Mathematics 36, pp. 719–723. [86] Juh´ asz, F. (1978). On the spectrum of a random graph, in Colloquia Mathematica Societatis J´ anos Bolyai, Vol. 25 (Szeged), pp. 313–316. [87] Kannan, R. (2010). Spectral methods for matrices and tensors, in Proceedings of the ACM Symposium on the Theory of Computing (STOC) (ACM, New York), pp. 1–12. [88] Kanungo, T., Mount, D. M., Netanyahu, N. S. and Piatko, C. D. (2004). A local search approximation algorithm for k-means clustering, Computational Geometry: Theory and Applications 28, pp. 89–112. [89] Kiers, H. A. (1998). A three-step algorithm for CANDECOMP/PARAFAC analysis of large data sets with multicollinearity, Journal of Chemometrics: A Journal of the Chemometrics Society 12, 3, pp. 155–171. [90] Knyazev, A. V. and Argentati, M. E. (2006). On proximity of Rayleigh quotients for diﬀerent vectors and Ritz values generated by diﬀerent trial subspaces, Linear Algebra and Its Applications 415, pp. 82–95.

Bibliography

969

[91] Kolda, T. G. (2001). Orthogonal tensor decompositions, SIAM Journal of Matrix Analysis and Applications 23, pp. 243–255. [92] Kolda, T. G. (2006). Multilinear operators for higher-order decompositions, Tech. Rep. SAND2006-2081, Sandia. [93] Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications, SIAM Review 51, 3, pp. 455–500. [94] Kolda, T. G. and O’Leary, D. P. (1998). A semidiscrete matrix decomposition for latent semantic indexing in information retrieval, ACM Transactions on Information Systems 16, pp. 322–346. [95] Kolda, T. G. and O’Leary, D. P. (1999). Computation and uses of the semidiscrete matrix decomposition, Tech. Rep. ORNL-TM-13766, Oak Ridge National Laboratory, Oak Ridge, TN. [96] Kroonenberg, P. M. (2008). Applied Multiway Data Analysis (Wiley, Hoboken, NJ). [97] Kruskal, J. B. (1977). Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics, Linear Algebra and Its Applications 18, 2, pp. 95–138. [98] Lanczos, C. (1958). Linear systems in self-adjoint form, American Mathematical Monthly 65, pp. 665–679. [99] Langville, A. N. and Stewart, W. J. (2004). The Kronecker product and stochastic automata networks, Journal of Computational and Applied Mathematics 167, 2, pp. 429–447. [100] Lathauwer, L. D. (1997). Signal processing based on multilinear algebra, Ph.D. thesis, Katholieke Universiteit Leuven. [101] Lathauwer, L. D., Moor, B. D. and Vandewalle, J. (2000). A multilinear singular value decomposition, SIAM Journal on Matrix Analysis and Applications 21, 4, pp. 1253–1278. [102] Lathauwer, L. D., Moor, B. D. and Vandewalle, J. (2000). On the best rank-1 and rank-(r1 , r2 , . . . , rn ) approximation of higher-order tensors, SIAM Journal on Matrix Analysis and Applications 21, 4, pp. 1324–1342. [103] Lawson, C. L. and Hanson, R. J. (1995). Solving Least Square Problems (SIAM, Philadelphia, PA). [104] Lax, P. D. (2007). Linear Algebra and Its Applications (WileyInternational, Hoboken, NJ). [105] Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization, Nature 401, pp. 515–521. [106] Lichnerowicz, A. (1956). Alg`ebre et Analyse Lin´eaires (Masson et Co., Paris). [107] Lloyd, S. P. (1982). Least square quantization in PCM, IEEE Transactions on Information Theory 28, pp. 129–137.

970

Linear Algebra Tools for Data Mining (Second Edition)

[108] MacLane, S. and Birkhoﬀ, G. (1993). Algebra, 3rd edn. (Chelsea Publishing Company, New York). [109] Magnus, J. R. and Neudecker, H. (1979). The commutation matrix: Some properties and applications, The Annals of Statistics 7, 2, pp. 381–394. [110] Marcus, M. (1973). Finite Dimensional Multilinear Algebra — Part I (Marcel Dekker, New York). [111] Marcus, M. and Minc, H. (2010). A Survey of Matrix Theory and Matrix Inequalities (Dover Publications, New York). [112] Marsaglia, G. and Styan, G. (1974). Equalities and inequalities for ranks of matrices, Linear and Multilinear Algebra 2, pp. 269–292. [113] Meyer, C. (2000). Matrix Analysis and Applied Algebra (SIAM, Philadelphia, PA). [114] Mirsky, L. (1975). A trace inequality of john von neumann, Monatshefte f¨ ur Mathematik 79, pp. 303–306. [115] Moitra, A. (2018). Algorithmic Aspects of Machine Learning (Cambridge University Press, Cambridge, UK). [116] Moler, C. B. (2004). Numerical Computing with MATLAB (SIAM, Philadelphia, PA). [117] Morozov, A. Y. and Shakirov, S. R. (2010). New and old results in resultant theory, Theoretical and Mathematical Physics 163, 2, pp. 587–617. [118] Neudecker, H. (1968). The Kronecker matrix product and some of its applications in econometrics, Statistica Neerlandica 22, 1, pp. 69–82. [119] Ninio, F. (1976). A simple proof of the Perron-Frobenius theorem for positive matrices, Journal of Physics A 9, pp. 1281–1282. [120] Oldenburger, R. (1940). Inﬁnite powers of matrices and characteristic roots, Duke Mathematical Journal 6, pp. 357–361. [121] O’Leary, D. P. (2006). Matrix factorization for information retrieval, Tech. Rep., University of Maryland, College Park, MD. [122] O’Leary, D. P. and Peleg, S. (1983). Digital image compression, IEEE Transactions on Communications 31, pp. 441–444. [123] Ottaviani, G. (2012). Introduzione all’iperdeterminante, La Matematica nella Societ` a e nella Cultura. Rivista dell’Unione Matematica Italiana 5, 2, pp. 169–195. [124] Overton, M. L. and Womersley, R. S. (1991). On the sum of the largest eigenvalues of a symmetric matrix, Tech. Rep. 550, Courant Institute of Mathematical Sciences, NYU. [125] Paatero, P. (1997). Least squares formulation of robust non-negative factor analysis, Chemometrics and Intelligent Laboratory Systems 37, pp. 23–35.

Bibliography

971

[126] Paatero, P. (1999). The multilinear engine — A table driven least squares program for solving multilinear problems, including the nway parallel factor analysis model, Journal of Computational and Graphical Statistics 8, pp. 1–35. [127] Paatero, P. and Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values, Environmetrics 5, pp. 111–126. [128] Paige, C. and Saunders, M. (1981). Toward a generalized singular value decomposition, SIAM Journal on Numerical Analysis 18, pp. 398–405. [129] Paige, C. C. and Wei, M. (1994). History and generality of the CS decomposition, Linear Algebra and Its Applications 208, pp. 303–326. [130] Papalexakis, E. E., Faloutsos, C. and Sidiropoulos, N. D. (2016). Tensors for data mining and data fusion: Models, applications, and scalable algorithms, ACM Transactions on Intelligent Systems and Technology (TIST) 8, 2, pp. 1–44. [131] Pauca, V. P., Shahnaz, F., Berry, M. W. and Plemmons, R. J. (2004). Text mining using non-negative matrix factorizations, in SIAM Data Mining, pp. 452–456. [132] Pferschy, U. and Rudolf, R. (1994). Some geometric clustering problems, Nordic Journal of Computing 1, pp. 246–263. [133] Pollock, D. S. G. (1979). The Algebra of Econometrics (John Wiley & Sons, Chichester). [134] Pollock, D. S. G. (2021). Multidimensional arrays, indices and Kronecker products, Econometrics 9, p. 18. [135] Prasolov, V. V. (2000). Problems and Theorems in Linear Algebra (American Mathematical Society, Providence, RI). [136] Qi, L. (2005). Eigenvalues of a real supersymmetric tensor, Journal of Symbolic Computation 40, 6, pp. 1302–1324. [137] Qi, L. (2007). Eigenvalues and invariants of tensors, Journal of Mathematical Analysis and Applications 325, 2, pp. 1363–1377. [138] Qi, L. and Luo, Z. (2017). Tensor Analysis — Spectral Theory and Special Tensors (SIAM, Philadelphia, PA). [139] Rabanser, S., Shchur, O. and G¨ unnemann, S. (2017). Introduction to tensor decompositions and their applications in machine learning, arXiv preprint arXiv:1711.10781. [140] Roweis, S. T. and Saul, L. K. (2000). Locally linear embedding, http: //www.cs.nyu.edu/~roweis/lle/. [141] Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding, Science 290, pp. 2323–2326.

972

Linear Algebra Tools for Data Mining (Second Edition)

[142] Sarwar, B., Karypis, G., Konstan, J. and Riedl, J. (2002). Incremental singular value decomposition algorithms for highly scalable recommender systems, in 5th International Conference on Computer and Information Technology (ICCIT), pp. 27–28. [143] Saul, L. K. and Roweis, S. T. (2000). An introduction to locally linear embedding, Tech. rep., AT&T Labs-Research. [144] Saul, L. K. and Roweis, S. T. (2003). Think globally, ﬁt locally: Unsupervised learning of low dimensional manifolds, Journal of Machine Learning Research 4, pp. 119–155. [145] Saul, L. K., Weinberger, K., Sha, F., Ham, J. and Lee, D. D. (2006). Spectral methods for dimensionality reduction, in Semi-Supervised Learning (The MIT Press, Cambridge, MA), pp. 293–308. [146] Schoenberg, I. J. (1935). Remarks to Maurice Fr´echet’s article “sur la d´eﬁnition axiomatique d’une classe d’espaces distanci´es vectoriellement applicable sur l’espace de Hilbert”, Annals of Mathematics 36, pp. 724–732. [147] Sch¨onemann, P. H. (1970). On metric multidimensional unfolding, Psychometrika 35, pp. 349–366. [148] Schwerdtfeger, H. (1960). Direct proof of Lanczos’ decomposition theorem, American Mathematical Monthly 67, pp. 855–860. [149] Shahnaz, F., Berry, M. W., Pauca, V. P. and Plemmons, R. J. (2006). Document clustering using nonnegative matrix factorization, Information Processing Management 42, 2, pp. 373–386. [150] Sibson, R. (1978). Studies in the robustness of multidimensional scaling: Procrustes statistics, Journal of the Royal Statistical Society, Series B 40, pp. 234–238. [151] Sidiropoulos, N., Lathauwer, L. D., Fu, X., Huang, K., Papalexakis, E. and Faloutsos, C. (2017). Tensor decomposition for signal processing and machine learning, IEEE Transactions on Signal Processing 65, 13, pp. 3551–3582. [152] Simovici, D. A. and Djeraba, C. (2015). Mathematical Tools for Data Mining — Set Theory, Partially Ordered Sets, Combinatorics, 2nd edn. (Springer-Verlag, London). [153] Soules, G. W. (1983). Constructing symmetric nonnegative matrices, Linear and Multilinear Algebra 13, pp. 241–251. [154] Stanley, R. P. (1997). Enumerative Combinatorics, Vol. 1 (Cambridge University Press, Cambridge). [155] Stanley, R. P. (1999). Enumerative Combinatorics, Vol. 2 (Cambridge University Press, Cambridge). [156] Sternberg, S. (1983). Lectures on Diﬀerential Geometry, 2nd edn. (AMS Chelsea Publishing, Providence, RI).

Bibliography

973

[157] Stewart, G. W. (1973). Error and perturbation bounds for subspaces associated with certain eigenvalue problems, SIAM Review 15, pp. 727–764. [158] Stewart, G. W. (1977). On the perturbation of pseudo-inverses, projections, and linear least square problems, SIAM Review 19, pp. 634–662. [159] Stewart, G. W. (1993). On the early history of the singular value decomposition, SIAM Review 35, pp. 551–566. [160] Stewart, G. W. and Sun, J.-G. (1990). Matrix Perturbation Theory (Academic Press, Boston, MA). [161] Strassen, V. (1969). Gaussian elimination is not optimal, Numerische Mathematik 13, pp. 354–356. [162] Trefethen, L. N. and III, D. B. (1997). Numerical Linear Algebra (SIAM, Philadelphia, PA). [163] Trotter, W. T. (1995). Partially ordered sets, in Handbook of Combinatorics (The MIT Press, Cambridge, MA), pp. 433–480. [164] Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis, Psychometrica 31, pp. 279–331. [165] van der Waerden, B. L. (1950). Modern Algebra (Vol. I and II) (F. Ungar Publ. vo, New York). [166] van Lint, J. H. and Wilson, R. M. (2002). A Course in Combinatorics, 2nd edn. (Cambridge University Press, Cambridge). [167] Van Loan, C. F. and Fan, K. Y. D. (2010). Insight through Computing (SIAM, Philadelphia, PA). [168] vols.), F. R. G. . (1977). The Theory of Matrices (American Mathematical Society, Providence, RI). [169] Wedderburn, J. (1934). Lectures on Matrices (American Mathematical Society, Providence, RI). [170] Weyr, E. (1887). Note sur la th´eorie de quantit´es complexes form´ees avec n unites principales, Bulletin des Sciences Math´ematiques II 11, pp. 205–215. [171] Wilkerson, J. H. (1965). The Algebraic Eigenvalue Problem (Clarendon Press, Oxford, London). [172] Winograd, S. (1968). A new algorithm for inner product, IEEE Transactions on Computers C-17, pp. 693–694. [173] Wu, X. and (editors), V. K. (2009). The Top Ten Algorithms in Data Mining (CRC Press, Boca Raton, FL). [174] Ye, J. (2007). Least square discriminant analysis, in Proceedings of the 24th International Conference on Machine Learning (Corvallis, OR), pp. 1087–1094. [175] Yokonuma, T. (1992). Tensor Spaces and Exterior Algebra (American Mathematical Society, Providence, RI).

974

Linear Algebra Tools for Data Mining (Second Edition)

[176] Zha, H., He, X., Ding, C., Simon, H. and Gu, M. (2001). Spectral relaxation for k-means clustering, Neural Information Processing Systems 14, pp. 1057–1064. [177] Zhang, F. (1999). Matrix Theory: Basic Results and Techniques (Springer Verlag, New York). [178] Zhang, F. (2005). The Schur Complement and Its Applications (Springer, New York).

Index

A

graded, 186 ideal of an, 185 morphism, 183 structural coeﬃcients of an, 183 subalgebra of an, 185 unital, 183 associative operation, 16 attribute, 720 augmented matrix of a system of linear equations, 157 auxiliary function, 808 average dissimilarity of an object, 705

Abel’s equality, 213 absolute norm, 457 absolutely convergent matrix series, 371 addition of polynomials, 21 additive inverse of an element, 17 adjacent transposition, 8 adjoint matrix, 297, 385 aﬃne mapping, 39 algebra exterior, 859 Grassman, 859 algebra of ﬁnite type, 17 algebra of type θ, 17 algebra type, 17 algebraic multiplicity of an eigenvalue, 486 alternator on V ⊗r , 850 angle between vectors, 386 angles between subspaces, 658 annihilating polynomial of a matrix, 557 annihilator of a subset, 73 anonymous function, 274 arity of an operation, 15 associative algebra, 182

B balanced clustering, 710 band matrix, 118 basis dual basis of a, 71 fundamental matrix of a, 379 ordered, 437 having a negative orientation, 437 having a positive orientation, 437 reciprocal set of a, 390 basis of a linear space, 40 Berge–Sibson theorem, 802 975

976

Linear Algebra Tools for Data Mining (Second Edition)

bilinear form, 80 skew-symmetric, 80 structural constants of a, 85 bilinear function structural constants of a, 85 binary operation, 15 binomial coeﬃcient, 13 bivector, 857 bounded set in a metric space, 341 C canonical basis of a quadratic form, 591 canonical morphism, 56 canonical surjection, 56 carrier of an algebra, 17 Cauchy matrix of two real sequences, 320 Cauchy–Schwarz inequality, 338 Cauchy–Binet formula, 291 Cayley transform, 202 Cayley–Hamilton theorem, 545 center of a quadric, 594 centered data series, 722 centering matrix, 723 centroid, 693 characteristic matrix of a partition, 195 characteristic polynomial of a matrix, 484 Chebyshev metric, 346 Cholesky factor of a matrix, 412 Cholesky’s decomposition theorem, 410 circulant matrix, 120 city-block metric, 346 class scatter matrix, 782 closed set generated by a subset, 27 closed sphere, 340 closure operator, 25 closure system on a set S, 25 co-kernel of a linear mapping, 56 coeﬃcients of a linear combination, 36 cofactor, 290 colon notation in MATLAB , 237

column subspace of a matrix, 137 combination with repetition, 32 commutative groups, 19 commutative operation, 16 commutative ring, 23 companion matrix of a polynomial, 490 complementary subspaces, 64 complete matrix, 102 complete symmetric polynomial, 23 complex linear space, 34 composition of permutations, 6 condensed graph of a graph, 221 condition number of a matrix, 434 conformant matrices, 105 congruent matrices, 156 conjugate linearity of an inner product, 378 conjugate norm of a norm, 374 consistent family of matrix norms, 362 consistent system of linear equations, 157 contiguous submatrix of a matrix, 101 continuity argument, 301 continuous clustering problem, 702 contracted product of an mda and a vector, 923 contraction, 844 convergent series of matrices, 371 corpus, 785 correlation coeﬃcient, 726 Courant–Fischer theorem, 537 covariance coeﬃcient, 726 covectors, 68 Cramer’s formula, 297 cross product, 437 CS decomposition, 652 cut norm, 460 cycle, 7 cycle of an element, 7 cyclic decomposition of the permutation, 8 cyclic permutation, 7

Index D φ-diagonal of a matrix, 188 data matrix principal directions of a, 769 Davis–Kahan SIN theorem, 673 defective matrix, 550 degenerate matrix, 141 degree of a λ-matrix, 552 degree of a monomial, 22 degree of a polynomial, 21 descent of a sequence, 9 determinant of a matrix, 281 diagonal matrix, 103 diagonalizable matrix, 155 diagonally dominant matrix, 158 diameter of a metric space, 341 diameter of a subset of a metric space, 341 direct sum external, 61 discrete metric, 339 discriminant, 309 discriminant function, 782 dissimilarity on a set, 342 dissimilarity space, 342 divergence of two matrices, 217 dominant eigenvalue, 503 doubly stochastic matrix, 120 dual bases, 71 dual of a linear mapping, 74 dual of a linear space, 68 dual PCA, 797 E Eckhart–Young theorem, 643 eigenpair, 479 eigenvalue of a matrix, 479 eigenvector, 479 eigenvector of a matrix pencil, 583 eigenvectors of a pencil, 584 elementary divisors of a λ-matrix, 565 elementary matrix, 199 elementary symmetric polynomials, 22

977 elementary transformations matrices, 168 endomorphism of a linear space, 52 equalizer of two linear mappings, 52 equivalent λ-matrices, 560 equivalent linear systems, 168 equivalent norms, 355 error of a cut decomposition, 461 error sequence, 372 Euclidean metric, 340, 346 Euclidean norm, 345 extended matrix of a quadric, 592 F F -closed subset of a set, 28 F-linear space, 33 factor algebra, 186 feature, 720 ﬁeld, 23 ﬁeld of values of a matrix, 623 ﬁnite algebra, 17 ﬁnite algebra type, 17 ﬁrst Hirsch’s theorem, 500 Fisher linear discriminant, 783 Forgy–Lloyd algorithm, 696 format of a matrix, 98 Fourier expansion of an element with respect to an orthonormal set, 389 free indices, 828 frequency matrix of the corpus, 785 Frobenius inequality, 145 Frobenius norm, 363 Frobenius rank inequality, 145 full QR factorization, 423 full singular value decomposition of a matrix, 631 full-rank matrix, 141 G general linear group, 109 generalized Kronecker symbol, 95 generalized Rayleigh–Ritz quotient, 585 generalized sample variance, 737

978

Linear Algebra Tools for Data Mining (Second Edition)

geometric multiplicity of an eigenvalue, 482 Gershgorin disk, 500 Gershgorin’s theorem, 499 Gram matrix of a sequence of vectors, 409 Gramian of a sequence of vectors, 409 greatest common divisor, 18 group, 19 groupoid, 17 H Hadamard matrix, 194 Hadamard product, 180 Hadamard quotient, 180 Hahn–Banach theorem, 376 Hankel matrix, 120 Hilbert matrix, 322 Hoﬀman–Wielandt theorem, 542 homogeneous linear system, 157, 332 homogeneous system trivial solution of a, 332 homotety on linear space, 53 Householder matrix of a vector, 398 Huygens’s inertia theorem, 725 hypermatrix, 883 hyperplane, 391 I I-open subsets of a set, 29 idempotent matrix, 115 idempotent morphism of a ﬁeld, 54 idempotent operation, 16 identity permutation, 6 ill-conditioned linear system, 435 incidence matrix of a collection of sets, 468 independent set extension corollary, 42 index of a square matrix, 152 indexing of matrices, 264 inertia matrix, 534 inertia of a matrix, 533 inertia of a sequence of vectors, 724

inner product, 378 inner product of two vectors, 112 inner product space, 378 instance of the least square problem, 741 interior operator, 28 interior system, 28 interlacing theorem, 539 intra-class scatter matrix, 782 invariant factor of a λ-matrix, 559 invariant subspace of a matrix for an eigenvalue, 481 inverse of a matrix, 108 inverse of an element, 19 inverse of an element relative to an operation, 17 inverse power method, 506 inversion of a sequence of distinct numbers, 4 involutive matrix, 203 isometric invariant, 592 J Jacobi’s identity, 477 Jordan block associated to a complex number, 566 Jordan matrix, 567 Jordan segment, 566 K K-closed subsets of a set, 26 k-combination, 12 k-means algorithm, 693 Karhunen–Loeve basis for a matrix, 815 Kronecker δ, 51 Kronecker diﬀerence, 179 Kronecker product, 174 Kronecker sum, 179 Kruskal operator, 958 Kullback–Leibler divergence of two stochastic matrices, 217 Ky Fan’s norm, 685 Ky Fan’s theorem, 494

Index L Lagrange interpolating polynomial, 275 Lagrange interpolation polynomial, 328 Lagrange’s identity, 293, 477 Lanczos decomposition of a matrix, 622 Laplace expansion of a determinant by a column, 289 Laplace expansion of a determinant by a row, 289 leading principal minor, 287 leading submatrix of a matrix, 101 left distributivity laws in a ring, 20 left eigenvector of a matrix, 485 left inverse of a matrix, 148 left quotient of λ-matrices, 555 left remainder of λ-matrices, 555 left singular vector, 631 left value of a λ-matrix, 554 Leibniz formula, 281 Levi-Civita symbol, 11 Lie bracket, 607 linear combination, 36 linear data mapping for a data set, 721 linear dimensionality-reduction mapping, 721 linear feature selection mapping, 721 linear form, 40, 80 linear mapping, 39 induced mapping by a, 869 nullity of a, 47 rank of a, 47 spark of a, 47 linear morphism, 39 linear operator associated to a matrix, 137 linear operator on a linear space, 52 linear regression, 739 linear space dimension of a, 43 free over a set, 51 graded, 186

979 identity morphism of a, 53 of inﬁnite type, 43 quotient space of a, 55 subspace of a, 35 vectors in a, 33 zero endomorphism of a, 53 linear space of ﬁnite type, 40 linear space symmetric relative to a norm, 466 linear spaces direct product of, 61 isomorphic, 48 isomorphism of, 48 linear spaces of ﬁnite type, 42 linear system basic variables of a, 162 non-basic variables of a, 162 non-principal variables of a, 162 principal variables of a, 162 linear systems back substitution in, 161 linearly dependent set, 37 linearly independent set, 37 loadings in PCA, 775 local Gram matrix, 750 locally linear embedding, 748 logical indexing in MATLAB , 265 lower bandwidth of a matrix, 118 lower Hessenberg matrix, 118 lower triangular matrix, 106 M λ-matrix of type p × q, 552 M-diagonalizable matrix, 155 Mahalanobis metric, 736 main diagonal of a matrix, 99 matrices commuting family of, 511 triple product of, 889 matricization, 900 matrix, 98 LU -decomposition of a, 171 commutation, 208 complex, 121 degree of reducibility of a, 222

980

Linear Algebra Tools for Data Mining (Second Edition)

directed graph of a, 221 Givens, 397 Hermitian, 121 Hermitian adjoint of a, 121 invertible, 108 irreducible, 222 Moore–Penrose pseudoinverse of a, 134 normal, 121 orthogonal, 122, 393 orthonormal, 122, 393 pivot of a row of a, 160 reducible, 222 row echelon form of a, 160 skew-Hermitian, 121 spark of a, 153 spectral radius of a, 498 submatrix, 100 transpose conjugate of a, 121 unitary, 121 matrix column, 98 matrix gallery, 322 matrix of a bilinear form, 587 matrix of loadings, 771 matrix of scores, 771 matrix of weights for locally linear embedding, 748 matrix pencil, 582 matrix product Khatri–Rao, 181 matrix row, 98 matrix series, 371 matrix similarity, 153 mda H-eigenvalue of an, 942 H-eigenvector of an, 942 m-order, n-dimensional, 884 n-mode singular values of an, 945 n-mode vectors of an, 888 n-rank of an, 888 n-singular vector of an, 945 diagonal, 884 eigenpair of an, 942 eigenvalue of an, 941 eigenvector of an, 942

ﬁbers of an, 885 homogeneous polynomial deﬁned by, 940 hypercubic, 884 identity, 884 matricized, 901 norm of an, 907 order of, 884 positive deﬁnite, 940, 942 positive semideﬁnte, 942 set of modes of an, 884 slice of an, 886 spectrum of an, 942 Tucker decomposition of an, 949 unfolding of an, 901 mdas contraction of, 903 inner product of, 907 outer product of, 887 outer product, 182 metric, 339 metric induced by a norm on a linear space, 346 metric space, 339 minimal polynomial of a matrix, 556 minimax inequality for real numbers, 194 Minkowski metric, 346 Minkowski’s inequality, 338 minor of a matrix, 287 mixed covariance of two data sample matrices, 817 monoid, 18 monomial, 22, 302 monotone norm on Cn , 457 multidimensional array, 883–884 dimension vector of a, 884 symmetric, 886 multilinear function, 80 multilinear mapping alternating, 847 skew-symmetric, 848 multiplicative inverse of an element relative to an operation, 17

Index N n-ary operation, 15 n-dimensional ellipsoid, 594 n-mode multiplication of an mda by a matrix, 917 n-mode product of a mda and a matrix, 903 n-mode product of a mda and a vector, 905 negative closed half-space, 391 negative open half-space, 391 negative semideﬁnite matrix, 406 Newton’s binomial, 191 Newton’s binomial formula, 14 nilpotency of a matrix, 114 nilpotent matrix, 114 NIPALS algorithm, 780 Noether’s ﬁrst isomorphism theorem, 79 Noether’s second isomorphism theorem, 79 non-defective matrix, 550 non-derogatory matrix, 550 non-negative matrix, 118 non-singular matrix, 141 norm of a linear function, 359 normed linear space, 343 null space of a matrix, 137 numerical rank of a matrix, 646 numerical stability of algorithms, 417 O oblique projection of a vector on a subspace along another subspace, 398 Oldenburger’s theorem, 574 open sphere, 340 operation, 15 opposite element of an element, 17 opposite of a matrix, 104 order of a matrix, 99 order of a minor of a matrix, 287 orthogonal complement of a subspace, 387

981 orthogonal projection of a vector on a subspace, 399 orthogonal set of vectors, 379 orthogonal subspaces, 387 orthonormal set of vectors, 380 outer product, 887 P Paige–Wei CS decomposition theorem, 652 parallelepiped k-dimensional, 871 parallelogram, 871 parallelogram equality, 381 Parseval’s equality, 390 partial least square regression, 746 partitioning of a matrix, 127 Pauli matrices, 218 PCA-guided clustering, 718 PERMn set of permutations of {1, . . . , n}, 6 permutation, 5 permutation diagonal of a matrix, 188 Perron number of a matrix, 598 Perron vector of a matrix, 598 Perron–Frobenius theorem, 597 perturbation, 434 polar form of a matrix, 650 polar form theorem, 650 polynomial discriminant of, 331 homogeneous, 302 polynomials associated, 302 polysemy, 785 poset dual of a, 28 positive closed half-space, 391 positive deﬁnite matrix, 405–406 positive matrix, 118 positive open half-space, 391 positive semideﬁnite matrix, 406 power method for computing eigenvalues, 503 predictor variables, 746

982

Linear Algebra Tools for Data Mining (Second Edition)

principal axes of a quadric, 594 principal cofactor, 290 principal components of W , 769 principal matrix of a pencil, 584 principal minor, 287 principal submatrix of a matrix, 101 problem 3-sat, 960 3-satisﬁability, 960 product of matrices, 104 product of permutations, 6 product of polynomials, 21 product of tensors, 842 projection matrix of a subspace, 400 projection of a subspace along another subspace, 667 projections on closed sets theorem, 347 Ptolemy inequality, 466 Pythagora’s theorem, 388 Q QR iterative algorithm, 506 quadratic form, 587 quadric, 592 quaternion, 185 conjugate, 185 quotient of two matrix polynomials, 556 R range of a matrix, 137 rank of a matrix, 138 rational matrix function, 556 Rayleigh–Ritz function, 535 Rayleigh–Ritz theorem, 535 real ellipsoid, 595 real linear space, 34 reciprocal set of vectors, 472 reduced form of a subspace with respect to a unitary matrix, 665 reﬂection matrix, 395 regression, 739 regressor, 740 regular λ-matrix, 552

regular pencil, 583 representation of a matrix on an invariant subspace, 483 representation of a matrix relative to a basis of an invariant subspace, 664 residual, 766 residual vector, 756 response variables, 746 resultant, 304 right distributivity laws in a ring, 20 right eigenvector of a matrix, 485 right inverse of a matrix, 148 right quotient of λ-matrices, 555 right remainder of λ-matrices, 555 right singular vector, 631 right value of a λ-matrix, 554 ring, 20 ring addition, 20 ring multiplication, 20 root-mean-squared residual, 814 rotation matrix, 395 rotation with a given axis, 471 S S-sequence, 3 sample correlation matrix, 727 sample covariance matrix, 726 sample matrix, 719–720 sample mean, 722 sample standard deviation of a vector, 722 scalar, 33 scalar multiplication, 33 scalar triple product, 438 Schatten norms, 685 Schur’s complement, 323 Schur’s triangularization theorem, 524 second Hirsch’s theorem, 501 Segr`e sequence of a Jordan segment, 566 self-adjoint matrix, 385 semi-simple eigenvalue, 550

Index semidiscrete decomposition of a matrix, 821 semigroup, 17 semimetric, 339 semimetric space, 339 seminorm, 342 semiorthogonal matrix, 463 semiunitary matrix, 463 separate sets in a metric space, 341 separation between the means of the classes, 782 separation of two matrices, 548 sequence length of a, 3 strict, 4 set of generalized eigenvalues of a matrix pencil, 582 set of permutations, 6 set spanned by a subset in a linear space, 36 signature of a matrix, 533 silhouette of a clustering, 705 similarity on a set, 342 similarity space, 342 simple eigenvalue, 486 simple invariant space, 665 simple matrix, 488 simultaneously diagonalizable matrices, 520 singular matrix, 141 singular value of a matrix, 631 skew-symmetric matrix, 99 sparse matrix, 247 spectral decomposition of a normal matrix, 531 spectral decomposition of a square matrix, 484 spectral decomposition of a unitarily diagonalizable matrix, 630 spectral norm of a matrix, 577 spectral resolution of a matrix relative to two subspaces, 667 spectral theorem for Hermitian matrices, 532 spectral theorem for normal matrices, 530

983 spectrum of a matrix, 480 square matrix, 99 square root of a diagonalizable complex matrix, 522 standard deviation of sample matrix, 722 Stewart–Sun CS decomposition, 655 stochastic matrix, 120 strictly lower triangular matrix, 106 strictly upper triangular matrix, 106 strongly non-singular matrix, 327 subadditivity property of norms, 362 subharmonic vector for a matrix, 514 submultiplicative property of matrix norms, 362 suborthogonal matrix, 463 subset closed under a set of operations, 28 subspace equivalence generated by a, 55 quotient set of a, 55 zero, 35 subspaces direct sum of, 62 internal sum of, 61 linearly independent, 89 subunitary matrix, 463 sum of matrices, 103 summation convention, 827 support, 61 support of a mapping deﬁned on a ﬁeld, 36 support of a sequence, 21 SVD theorem, 630 Sylvester determinant of two polynomials, 304 Sylvester operator of two matrices, 546 Sylvester’s identity, 318 Sylvester’s inertia theorem, 534 Sylvester’s rank theorem, 143 symmetric bilinear form, 587 symmetric gauge function, 458 symmetric matrix, 99 symmetric polynomial, 22 symmetrizer on V ⊗r , 850

984

Linear Algebra Tools for Data Mining (Second Edition)

synonimy, 784 system of linear equations, 157 system of normal equations, 742 T

F-topological linear space, 76–77 tensor, 839 of rank 1, 888 over a linear space, 80, 838 alternating, 847 contravariant, 839 covariant, 839 isotropic, 877 mixed, 839 order of a, 839 realization of a, 888 simple, 888 skew-symmetric, 847 symmetric, 847 type of a, 838 valence of a, 838 tensor on a linear space, 80 tensor product, 829 axioms of, 830 tensor product of linear mappings, 876 tensor rank, 960 tensor space standard basis of a, 840 tensors decomposable, 829 term vector, 785 the full QR factorization theorem, 425 the full-rank factorization theorem, 147 the least square method, 741 the replacement theorem, 42 the thin CS decomposition theorem, 654 the thin SVD decomposition corollary, 638 thin QR factorization, 423 thin SVD decomposition of a matrix, 638

three-way array k-nondegenerate, 889 k-slab of, 889 rank of, 889 Toeplitz matrix, 119 tolerance in MATLAB , 675 total variance, 727 trace norm, 685 trace of a matrix, 116 trailing submatrix of a matrix, 101 translation generated by an element of a linear space, 77 translation in a linear space, 53 transpose of a matrix, 100 transposition, 7 tridiagonal matrix, 118 trivial solution of a homogeneous linear system, 157 Tucker operator, 947 U unary operation, 15 unit matrix, 102 unit of a binary operation, 16 unitarily diagonalizable matrix, 155 unitarily equivalent, 632 unitarily invariant norms, 368 unitarily similar matrices, 153 unitary group of matrices, 122 unitary ring, 21 upper bandwidth of a matrix, 118 upper Hessenberg matrix, 118 upper triangular matrix, 106 V Vandermonde determinant, 290 variance of a vector, 726 vector model, 785 vector normal to a hyperplane, 391 vector triple product, 439 vectorial matrix norm, 361 vectorization, 116 vectors contravariant, 68 covariant, 68

Index

985

Vi´ete’s theorem, 23 volume of a matrix, 688

Woodbury–Sherman–Morrison identity, 196

W

Z

Wedderburn’s theorem, 151 wedge product, 857 well-conditioned linear system, 435 Weyl’s theorem, 616 Weyr’s theorem, 577 width of a cut decomposition, 461

zero linear space, 34 zero matrix, 102 zero of a binary operation, 16 zero polynomial, 21 zero-ary operation, 15 zero-extension of a linear form, 70