117 74 10MB
English Pages [1002] Year 2023
This page intentionally left blank
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
Library of Congress Cataloging-in-Publication Data Names: Simovici, Dan A., author. Title: Linear algebra tools for data mining / Dan A. Simovici, University of Massachusetts Boston, USA, Dana-Farber Cancer Institute, USA. Description: Second edition. | New Jersey : World Scientific Publishing Co. Pte. Ltd., [2023] | Includes bibliographical references and index. Identifiers: LCCN 2022062233 | ISBN 9789811270338 (hardcover) | ISBN 9789811270345 (ebook for institutions) | ISBN 9789811270352 (ebook for individuals) Subjects: LCSH: Data mining. | Parallel processing (Electronic computers) | Computer algorithms. | Linear programming. Classification: LCC QA76.9.D343 S5947 2023 | DDC 006.3/12--dc23/eng/20230210 LC record available at https://lccn.loc.gov/2022062233 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
Copyright © 2023 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
For any available supplementary material, please visit https://www.worldscientific.com/worldscibooks/10.1142/13248#t=suppl Desk Editors: Balasubramanian/Steven Patt Typeset by Stallion Press Email: [email protected] Printed in Singapore
To my wife, Doina, and to the memory of my brother, Dr. George Simovici
This page intentionally left blank
Preface
Linear algebra plays an increasingly important role in data mining and pattern recognition research either directly, or through the applications of linear algebra in graph theory and optimization. Linear algebra-based algorithms are elegant and fast, are based on a common mathematical doctrine with its collection of basic ideas and techniques, and are easy to implement; they are especially suitable for parallel and distributed computation to approach large-scale challenging problems such as searching and extracting patterns from the entire web. Thus, the application of linear algebra-based techniques in data mining and machine learning research constitute an increasingly attractive area. Many linear algebra results are important for their applications in biology, chemistry, psychology, and sociology. The standard undergraduate education of a computer scientist includes one or, rarely, two semesters of linear algebra, which is woefully inadequate for a researcher in data mining or pattern recognition. Even a casual review of publications in these disciplines convincingly demonstrates the use of quite sophisticated tools from linear algebra, optimization, probabilities, functional analysis, and other areas. Linear algebra and its field of applications are constantly growing, and this volume is a mere introduction to a life-long study. A mathematical background is essential to understand current data mining and pattern recognition research and in conducting research in these disciplines. Therefore, this book was constructed to provide this background and to present a volume of applications vii
viii
Linear Algebra Tools for Data Mining (Second Edition)
that will attract the reader to the study of their mathematical basis. We do not focus on the numerical aspects of the algorithms, particularly error sensitivity, because this extremely important topic has been treated in a vast body of literature in numerical analysis and is not specific to data mining applications. Among the data mining applications we discuss are the k-means algorithm and several of its relaxations, principal component analysis and singular value decomposition for data dimension reduction, biplots, non-negative matrix factorization for unsupervised and semisupervised learning, and latent semantic indexing. Preparing the second edition of this volume involved correcting the existing text, considerable rewriting, and introducing new major topics: tensors, exterior algebra, and multidimensional arrays. The intended readership consists of graduate students and researchers who work in data mining and pattern recognition. I strived to make this volume as self-contained as possible. The reader interested in applications will find in this volume most of the mathematical background that is currently needed. There are few routine exercises, most of those support the material presented in the main sections of each chapter, and there are more than 600 exercises and supplements. Special thanks are due to the librarians of the Joseph Healy Library at the University of Massachusetts Boston whose dedicated and timely help was essential in completing this project. I gratefully acknowledge the support provided by MathWorks Inc. from Natick, Massachusetts. Their book program provided the license for various components of MATLAB, the paramount current tool for linear algebra computations, which we use in this book. Last, but not least, I wish to thank my wife, Doina, a source of strength and loving support.
About the Author
Dan Simovici is a Professor of Computer Science and the Director of the Computer Science Graduate Program at the University of Massachusetts Boston. His main research interests are in the applications of algebraic and information-theoretical methods in data mining and machine learning. He has published over 200 research papers and several books whose subjects range from theoretical computer science to applications of mathematical methods in machine learning. Dr. Simovici served as a visiting professor at Tohoku University in Japan and the University of Lille, France, and he is currently the Editor-in-Chief of the Journal for Multiple-Valued Logic and Soft Computing.
ix
This page intentionally left blank
Contents
Preface
vii
About the Author
ix
1.
1
Preliminaries 1.1 Introduction . . . . . . . . . 1.2 Functions . . . . . . . . . . . 1.3 Sequences . . . . . . . . . . . 1.4 Permutations . . . . . . . . . 1.5 Combinatorics . . . . . . . . 1.6 Groups, Rings, and Fields . . 1.7 Closure and Interior Systems Exercises and Supplements . . . . . Bibliographical Comments . . . . .
2.
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Linear Spaces 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
1 1 3 5 12 15 25 29 32 33
Introduction . . . . . . . . . . Linear Spaces . . . . . . . . . . Linear Independence . . . . . . Linear Mappings . . . . . . . . Bases in Linear Spaces . . . . Isomorphisms of Linear Spaces Constructing Linear Spaces . . Dual Linear Spaces . . . . . . Topological Linear Spaces . . . xi
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
33 33 36 39 40 48 55 68 76
xii
Linear Algebra Tools for Data Mining (Second Edition)
2.10 Isomorphism Theorems 2.11 Multilinear Functions . Exercises and Supplements . . Bibliographical Comments . . 3.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Matrices . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
MATLAB Environment 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11
Introduction . . . . . . . . . . . . . . . . The Interactive Environment of MATLAB Number Representation and Arithmetic Computations . . . . . . . . . . . . . . . Matrices and Multidimensional Arrays . . Cell Arrays . . . . . . . . . . . . . . . . . Solving Linear Systems . . . . . . . . . . Control Structures . . . . . . . . . . . . . Indexing . . . . . . . . . . . . . . . . . . Functions . . . . . . . . . . . . . . . . . . Matrix Computations . . . . . . . . . . . Matrices and Images in MATLAB . . . . .
79 80 87 96 97
3.1 Introduction . . . . . . . . . . . . . . . . . . 3.2 Matrices with Arbitrary Elements . . . . . . 3.3 Fields and Matrices . . . . . . . . . . . . . . 3.4 Invertible Matrices . . . . . . . . . . . . . . . 3.5 Special Classes of Matrices . . . . . . . . . . 3.6 Partitioned Matrices and Matrix Operations 3.7 Change of Bases . . . . . . . . . . . . . . . . 3.8 Matrices and Bilinear Forms . . . . . . . . . 3.9 Generalized Inverses of Matrices . . . . . . . 3.10 Matrices and Linear Transformations . . . . 3.11 The Notion of Rank . . . . . . . . . . . . . . 3.12 Matrix Similarity and Congruence . . . . . . 3.13 Linear Systems and LU Decompositions . . . 3.14 The Row Echelon Form of Matrices . . . . . 3.15 The Kronecker and Other Matrix Products . 3.16 Outer Products . . . . . . . . . . . . . . . . . 3.17 Associative Algebras . . . . . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . . . 4.
. . . .
97 98 102 107 118 127 129 131 133 134 138 153 156 160 174 182 182 186 225 227
. . . . . 227 . . . . . 227 . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
228 236 254 255 257 264 266 267 271
xiii
Contents
Exercises and Supplements . . . . . . . . . . . . . . . . . 272 Bibliographical Comments . . . . . . . . . . . . . . . . . 276 5.
Determinants 5.1 Introduction . . . . . . . . . . . . . . . 5.2 Determinants and Multilinear Forms . . 5.3 Cramer’s Formula . . . . . . . . . . . . 5.4 Partitioned Matrices and Determinants 5.5 Resultants . . . . . . . . . . . . . . . . 5.6 MATLAB Computations . . . . . . . . . Exercises and Supplements . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . .
6.
277 . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Norms and Inner Products 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13
Introduction . . . . . . . . . . . . . . . . . Basic Inequalities . . . . . . . . . . . . . . Metric Spaces . . . . . . . . . . . . . . . . Norms . . . . . . . . . . . . . . . . . . . . . The Topology of Normed Linear Spaces . . Norms for Matrices . . . . . . . . . . . . . Matrix Sequences and Matrix Series . . . . Conjugate Norms . . . . . . . . . . . . . . Inner Products . . . . . . . . . . . . . . . . Hyperplanes in Rn . . . . . . . . . . . . . . Unitary and Orthogonal Matrices . . . . . Projection on Subspaces . . . . . . . . . . . Positive Definite and Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . . . 6.14 The Gram–Schmidt Orthogonalization Algorithm . . . . . . . . . . . . . . . . . . . 6.15 Change of Bases Revisited . . . . . . . . . 6.16 The QR Factorization of Matrices . . . . . 6.17 Matrix Groups . . . . . . . . . . . . . . . . 6.18 Condition Numbers for Matrices . . . . . . 6.19 Linear Space Orientation . . . . . . . . . . 6.20 MATLAB Computations . . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . .
277 277 296 298 302 313 315 333 335
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
335 335 339 342 354 361 370 374 378 391 392 398
. . . . 405 . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
415 421 423 432 434 437 439 444 478
xiv
7.
Linear Algebra Tools for Data Mining (Second Edition)
Eigenvalues
479
7.1 Introduction . . . . . . . . . . . . . . . . . 7.2 Eigenvalues and Eigenvectors . . . . . . . . 7.3 The Characteristic Polynomial of a Matrix 7.4 Spectra of Hermitian Matrices . . . . . . . 7.5 Spectra of Special Matrices . . . . . . . . . 7.6 Geometry of Eigenvalues . . . . . . . . . . 7.7 Spectra of Kronecker Products and Sums . 7.8 The Power Method for Eigenvalues . . . . . 7.9 The QR Iterative Algorithm . . . . . . . . 7.10 MATLAB Computations . . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . . 8.
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
Similarity and Spectra . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
Singular Values 9.1 9.2 9.3 9.4 9.5 9.6 9.7
Introduction . . . . . . . . . . . . . . Singular Values and Singular Vectors Numerical Rank of Matrices . . . . . Updating SVDs . . . . . . . . . . . . Polar Form of Matrices . . . . . . . . CS Decomposition . . . . . . . . . . . Geometry of Subspaces . . . . . . . .
479 479 484 492 496 499 502 503 506 507 508 515 517
8.1 Introduction . . . . . . . . . . . . . . . . . 8.2 Diagonalizable Matrices . . . . . . . . . . . 8.3 Matrix Similarity and Spectra . . . . . . . 8.4 The Sylvester Operator . . . . . . . . . . . 8.5 Geometric versus Algebraic Multiplicity . . 8.6 λ-Matrices . . . . . . . . . . . . . . . . . . 8.7 The Jordan Canonical Form . . . . . . . . 8.8 Matrix Norms and Eigenvalues . . . . . . . 8.9 Matrix Pencils and Generalized Eigenvalues 8.10 Quadratic Forms and Quadrics . . . . . . . 8.11 Spectra of Positive Matrices . . . . . . . . 8.12 Spectra of Positive Semidefinite Matrices . 8.13 MATLAB Computations . . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . . 9.
. . . . . . . . . . . .
517 517 522 546 550 552 566 573 582 586 597 602 603 607 628 629
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
629 629 643 647 650 651 657
xv
Contents
9.8 Spectral Resolution of a 9.9 MATLAB Computations Exercises and Supplements . . Bibliographical Comments . . 10.
Matrix . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
664 674 678 691 693
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
693 693 698 702 705 707 710 718 719
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
719 719 729 735 737
Least Squares Approximations and Data Mining 739 12.1 12.2 12.3
Introduction . . . . . . . . . . . . . . . . Linear Regression . . . . . . . . . . . . . The Least Square Approximation and QR Decomposition . . . . . . . . . . . . . . . 12.4 Partial Least Square Regression . . . . . 12.5 Locally Linear Embedding . . . . . . . . 12.6 MATLAB Computations . . . . . . . . . . Exercises and Supplements . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . .
13.
. . . .
Data Sample Matrices 11.1 Introduction . . . . 11.2 The Sample Matrix 11.3 Biplots . . . . . . . Exercises and Supplements Bibliographical Comments
12.
. . . .
The k-Means Clustering 10.1 Introduction . . . . . . . . . . . . . . . 10.2 The k-Means Algorithm and Convexity 10.3 Relaxation of the k-Means Problem . . 10.4 SVD and Clustering . . . . . . . . . . . 10.5 Evaluation of Clusterings . . . . . . . . 10.6 MATLAB Computations . . . . . . . . Exercises and Supplements . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . .
11.
. . . .
. . . . . 739 . . . . . 739 . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Dimensionality Reduction Techniques 13.1 13.2 13.3 13.4 13.5 13.6 13.7
Introduction . . . . . . . . . . . Principal Component Analysis . Linear Discriminant Analysis . . Latent Semantic Indexing . . . . Recommender Systems and SVD Metric Multidimensional Scaling Procrustes Analysis . . . . . . .
. . . . . . .
. . . . . . .
744 746 748 754 756 763 765
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
765 765 781 784 788 791 799
xvi
Linear Algebra Tools for Data Mining (Second Edition)
13.8 Non-negative Matrix Factorization . . . . . . . . . 806 Exercises and Supplements . . . . . . . . . . . . . . . . . 815 Bibliographical Comments . . . . . . . . . . . . . . . . . 825 14.
Tensors and Exterior Algebras
827
14.1 Introduction . . . . . . . . . . . . . . . . 14.2 The Summation Convention . . . . . . . 14.3 Tensor Products of Linear Spaces . . . . 14.4 Tensors on Inner Product Spaces . . . . . 14.5 Contractions . . . . . . . . . . . . . . . . 14.6 Symmetric and Skew-Symmetric Tensors 14.7 Exterior Algebras . . . . . . . . . . . . . 14.8 Linear Mappings between Spaces SKSV,k 14.9 Determinants and Exterior Algebra . . . Exercises and Supplements . . . . . . . . . . . . Bibliographical Comments . . . . . . . . . . . . 15.
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Multidimensional Array and Tensors 15.1 Introduction . . . . . . . . . . . . . . 15.2 Multidimensional Arrays . . . . . . . 15.3 Outer Products . . . . . . . . . . . . . 15.4 Tensor Rank . . . . . . . . . . . . . . 15.5 Matricization and Vectorization . . . 15.6 Inner Product and Norms . . . . . . . 15.7 Evaluation of a Set of Bilinear Forms 15.8 Matrix Multiplications and Arrays . . 15.9 MATLAB Computations . . . . . . . 15.10 Hyperdeterminants . . . . . . . . . . 15.11 Eigenvalues and Singular Values . . . 15.12 Decomposition of Tensors . . . . . . . 15.13 Approximation of mdas . . . . . . . . Exercises and Supplements . . . . . . . . . . Bibliographical Comments . . . . . . . . . .
827 827 829 842 844 845 857 869 870 872 881 883
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
883 883 887 888 900 907 908 911 914 936 940 946 949 953 961
Bibliography
963
Index
975
Chapter 1
Preliminaries
1.1
Introduction
We include in this chapter a presentation of basic set-theoretical notions: functions, permutations, and algebraic structures that are intended to make this work as self-contained as possible. We assume that the reader is familiar with certain important classes of relations such as equivalences and partial orders, as presented, for example, in [53] or [152]. Also, we assume familiarity with basic notions and results concerning directed and undirected graphs. We use the following standard notations for several numerical sets: C the set of complex numbers R the set of real numbers R0 the set of non-negative real numbers R>0 the set of positive real numbers R0 the set of non-positive real R0 I N
the the the the
set set set set
R0 ∪ {+∞} R0 ∪ {+∞}
of irrational numbers of natural numbers
Functions
In this section, we review several types of functions and discuss some results that are used in later chapters.
1
2
Linear Algebra Tools for Data Mining (Second Edition)
The set of functions defined on S and ranging on T is denoted by S −→ T . If f belongs to this set of functions, we write f : S −→ T . The image of a subset P of S under a function f : S −→ T is the set f (P ) = {t ∈ T | f = f (p) for some p ∈ P }. It is easy to verify that for any two subsets P, Q of S we have f (P ∪ Q) = f (P ) ∪ f (Q), f (P ∩ Q) ⊆ f (P ) ∩ f (Q). The preimage of a subset U of T under a function f : S −→ T is the set f −1 (U ) = {s ∈ S | f (s) ∈ U }. Again, it is easy to verify that for any two subsets U, V of T , we have f −1 (U ∪ V ) = f −1 (U ) ∪ f −1 (V ), f −1 (U ∩ V ) = f −1 (U ) ∩ f −1 (V ). Definition 1.1. Let S, T be two sets. A function f : S −→ T is: (i) an injection if f (s) = f (s ) implies s = s ; (ii) a surjection if for every t ∈ T there exists s ∈ S such that f (s) = t; (iii) a bijection, if it is both an injection and a surjection. The set S −→ S contains the identity function 1S defined by 1S (s) = s for s ∈ S. Definition 1.2. Let S, T, U be three sets and let f : S −→ T and g : T −→ U be two functions. The composition or the product of f and g is the function gf : S −→ U defined by (gf )(s) = g(f (s)) for s ∈ S. Note that if f : S −→ T , then f 1S = f and 1T f = f . It is easy to verify that for four sets S, T, U, V and f : S −→ T , g : T −→ U , h : U −→ V , we have h(gf ) = (hg)f (associativity of function composition). Theorem 1.1. A function f : S −→ T is a surjection if and only if for any two functions g1 : T −→ U and g2 : T −→ U the equality g1 f = g2 f implies g1 = g2 .
Preliminaries
3
Proof. Let f : S −→ T be a surjection such that g1 f = g2 f , and let t ∈ T . There exists s ∈ S such that t = f (s). Then g1 (t) = g1 (f (s)) = (g1 f )(s) = (g2 f )(s) = g2 (f (s)) = g2 (t) for every t ∈ T , which implies g1 = g2 . Conversely, suppose that g1 f = g2 f implies g1 = g2 for any g1 , g2 , and assume that f is not a surjection. Then there exists t ∈ T such that t = f (s) for every s ∈ S. Since t ∈ f (S), it is possible to define g1 , g2 : T −→ U such that g1 (t) = g2 (t), yet g1 (f (s)) = g2 (f (s)). This contradicts the initial supposition, so f must be a surjection.
A similar result holds for injections: Theorem 1.2. A function f : S −→ T is an injection if and only if for any two functions k1 : X −→ S and g2 : X −→ S the equality f k1 = f k2 implies k1 = k2 . Proof. 1.3
The argument is left to the reader.
Sequences
Definition 1.3. Let S be a set. An S-sequence of length n is a function s : {0, . . . , n − 1} −→ S. The number n is the length of the sequence s and is denoted as |s|. The set of S-sequences of length n is denoted as Seqn (S); the set of all sequences on S is Seqn (S). Seq(S) = n0
A S-sequence s of length n is denoted by (s(0), . . . , s(n − 1)). The element s(j) is the j th symbol of s. Example 1.1. If S = {↑, ↓, −}, then (↑, ↑, −, ↓, −) is an S-sequence of length 5.
4
Linear Algebra Tools for Data Mining (Second Edition)
For a finite set S, with |S| = m, there exist mn S-sequences of length n. In particular, taking n = 0 it follows that there exists a unique S-sequence of length 0 denoted as λ = () and referred to as the null sequence. If u ∈ Seqp (S), v ∈ Seqq (S), u = (u0 , . . . , up−1 ), and v = (v0 , . . . , vq−1 ), the concatenation of u and v is the sequence uv given by uv = (u0 , . . . , up−1 , v0 , . . . , vq−1 ). This means that |uv| = |u| |v|. It is easy to verify that sequence concatenation is associative. This means that if u, v, w ∈ Seq(S), we have (uv)w = u(vw), as the reader can easily verify. The null sequence plays the role of the unit element, that is, uλ = λu = u. If S contains more than one element, then sequence concatenation is not commutative, as the next example shows. Example 1.2. Let S = {0, 1}, u = (0, 1, 1), and v = (1, 0, 0, 1). We have uv = (0, 1, 1, 1, 0, 0, 1) and vu = (1, 0, 0, 1, 0, 1, 1). It is clear that uv = vu. If s = (j1 , . . . , jn ) ∈ Seq(N), we refer to each pair (jp , jq ) such that p < q and jp > jq as an inversion. The set of inversions of s is denoted by INV(s) and |INV(s)| is denoted by inv(s). Definition 1.4. For S ⊆ N, a sequence s ∈ Seq(N) is strict if s = (a0 , . . . , an−1 ) and a0 < a1 < · · · < an−1 , where n 1. In other words, a sequence s ∈ Seq(n) is strict if it consists of distinct elements and inv(s) = 0. The sequence obtained from the sequence s ∈ Seq(N) by sorting its elements in increasing order is denoted as (s). Example 1.3. If s = (4, 1, 8, 2, 1, 6), then (s) = (1, 1, 2, 4, 6, 8).
Preliminaries
5
If s and t are two strict sequences in Seq(N), their concatenation is not strict, in general, even if they have no elements in common. Example 1.4. Let s = (1, 5, 8) and t = (2, 7, 9, 11). The sequence st = (1, 5, 8, 2, 7, 9, 11) is clearly not strict and we have inv(st) = 3. The sequence (st) is (1, 2, 5, 7, 8, 9, 11). If s and t are two strict sequences, the number of inversions in the sequence st is denoted by inv(s, t). Theorem 1.3. Let s, t, u ∈ Seq(N). We have inv((st), (u)) = inv(s, u) + inv(t, u), inv((s), (tu)) = inv(s, t) + inv(s, u). Proof. Since (st) and (u) are sorted sequences, there are no inversions in (st) or in (u). Therefore, inversions in (st)u may occur only because an inversion occurs between a component of s and a component of u, or between a component of t and a component of (u). This justifies the first equality. The second equality has a similar argument. 1.4
Permutations
The notion of permutation that we discuss in this section is essential for the study of determinants and exterior algebras. Definition 1.5. A permutation of a set S is a bijection φ : S −→ S. A permutation φ of a finite set S = {s1 , . . . , sn } is completely described by the sequence (φ(s1 ), . . . , φ(sn )). No two distinct components of such a sequence may be equal because of the injectivity of φ, and all elements of the set S appear in this sequence because φ is surjective. Therefore, the number of permutations equals the number of such sequences, so there are n(n − 1) · · · 2 · 1 permutations of a finite set S with |S| = n. The number n(n − 1) · · · 2 · 1 is usually denoted by n!. This notation is extended by defining 0! = 1, which is consistent with the interpretation of n! as the number of bijections of a set that has n elements.
6
Linear Algebra Tools for Data Mining (Second Edition)
The set of permutations of the set Sn = {1, . . . , n} is denoted by PERMn . If φ ∈ PERMn is such a permutation, we write 1 ··· i ··· n , φ: a1 · · · ai · · · an where ai = φ(i) for 1 i n. To simplify the notation, we shall specify φ just by the sequence of distinct numbers s = (a1 , . . . , ai , . . . , an ). The permutation ιn ∈ PERMn is defined as ιn (j) = j for 1 j n and is known as the identity permutation. Definition 1.6. Let φ, ψ ∈ PERMn be two permutations. Their composition or product is the permutation ψφ defined by (ψφ)(j) = ψ(φ(j)) for 1 j n. It is clear that the composition of two permutations of Sn is a permutation of Sn . Example 1.5. Let ψ, φ ∈ PERM4 be the permutations 1 2 3 4 1 2 3 4 φ: and ψ : . 3 1 4 2 4 2 1 3 The permutations ψφ and φψ are given by 1 2 3 4 1 2 3 4 ψφ : and φψ : . 1 4 3 2 2 1 3 4 Thus, the composition of permutations is not commutative. Note that every permutation φ ∈ PERMn has an inverse φ−1 because φ is a bijection. Example 1.6. Let φ ∈ PERM4 be the permutation 1 2 3 4 φ: . 3 1 4 2 Its inverse is −1
φ
:
1 2 3 4 . 2 4 1 3
Preliminaries
7
Theorem 1.4. Let PERMn = {φ1 , . . . , φn! }. If ψ ∈ PERMn , then {ψφ1 , . . . , ψφn! } = PERMn . Proof. Note that ψφp = ψφq implies φp = φq for 1 p q n!. The statement follows immediately. Definition 1.7. Let S be a finite set, φ be a permutation of S, and x ∈ S. The cycle of x is the set of elements of the form Cφ,x = {φi (x) | i ∈ N}. The number |Cφ,x | is the length of the cycle. Cycles of length 1 are said to be trivial. Theorem 1.5. The cycles of a permutation φ of a finite set S form a partition πφ of S. Proof. Let S be a finite set. Since Cφ,x ⊆ S, it is clear that Cφ,x is a finite set. If |Cφ,x | = , then Cφ,x = {x, φ(x), . . . , φ−1 (x)}. Note that each pair of elements φi (x) and φj (x) is distinct for 0 i, j − 1 and i = j, because otherwise we would have |Cφ,x | < . Moreover, φ (x) = x. If z ∈ Cφ,x , then z = φk (x) for some k, 0 k − 1, where = |Cφ,x |. Since x = φ (x), it follows that x = φ−k (z), which shows that x ∈ Cφ,z . Thus, Cφ,x = Cφ,z . Definition 1.8. A k-cyclic permutation or a cycle of a finite set S is a permutation φ such that πφ consists of a cycle of length k and a number of |S| − k cycles of length 1. A transposition of S is a 2-cyclic permutation. Note that if φ is a transposition of S, then φ2 = 1S . Theorem 1.6. Let S be a finite set, φ be a permutation, and πφ = {Cφ,x1 , . . . , Cφ,xm } be the cycle partition associated with φ. Define the cyclic permutations ψ1 , . . . , ψm of S as φ(t) if t ∈ Cφ,xp , ψp (t) = t otherwise. Then, ψp ψq = ψq ψp for every p, q such that 1 p, q m.
8
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Observe first that u ∈ Cφ,x if and only if φ(x) ∈ Cφ,x for any cycle Cφ,x . We can assume that p = q. Then the cycles Cφ,xp and Cφ,xq are disjoint. If u ∈ Cφ,xp ∪ Cφ,xq , then ψp (ψq (u)) = ψp (u) = u and ψq (ψp (u)) = ψq (u) = u. Suppose now that u ∈ Cφ,xp − Cφ,xq . We have ψp (ψq (u)) = ψp (u) = φ(u). On the other hand, ψq (ψp (u)) = ψq (φ(u)) = φ(u) because φ(u) ∈ Cφ,xq . Thus, ψp (ψq (u)) = ψq (ψp (u)). The case where u ∈ Cφ,xq −Cφ,xp is treated similarly. Also, note that Cφ,xp ∩Cφ,xq = ∅, so, in all cases, we have ψp (ψq (x)) = ψq (ψp (u)). The set of cycles {ψ1 , . . . , ψm } is the cyclic decomposition of the permutation φ. Definition 1.9. An adjacent transposition is a transposition that changes the places of two adjacent elements. Example 1.7. The permutation φ ∈ PERM5 given by φ:
1 2 3 4 5 1 3 2 4 5
is an adjacent transposition of the set {1, 2, 3, 4, 5} because it changes the position of the elements 2 and 3. On the other hand, the permutation 1 2 3 4 5 ψ: 1 5 3 4 2 is a transposition but not an adjacent transposition of the same set because the pair of elements involved are not consecutive. Theorem 1.7. Any transposition of a finite set is a product of adjacent transpositions. Proof. Let φk, be a transposition that swaps k and , where k < . We can move k to one step at a time and then move to where k was: φk, = φk,k+1 φk+1,k+2 · · · φ−1, φ−2,−1 · · · φk,k+1 .
Preliminaries
9
For a permutation φ ∈ PERMn specified by the sequence s = (j1 , . . . , jn ), the set INV(s) is denoted by INV(φ) and |INV(φ)| is denoted by inv(φ). Definition 1.10. The sign of a permutation is the function sign (φ) : PERMn −→ {−1, 1} defined as sign (φ) = (−1)inv(φ) . If sign (φ) = 1, we say that φ is an even permutation; otherwise, that is, if sign (φ) = −1, we refer to φ as an odd permutation. Theorem 1.8. If φ, ψ ∈ PERMn , then sign (φψ) = sign (φ)sign (ψ). Proof. The theorem can be proven by showing that inv(φ)+inv(ψ) has the same parity as inv(φψ). Suppose that 1 2 ··· n φ= and φ(1) φ(2) · · · φ(n) φ(1) φ(2) · · · φ(n) ψφ = . ψ(φ(1)) ψ(φ(2)) · · · ψ(φ(n)) and consider the following cases: (i) if i < j, φ(i) < φ(j), and ψ(φ(i)) < ψ(φ(j)), then no inversions are generated in φ, ψ, and ψφ; (ii) if i < j, φ(i) < φ(j), and ψ(φ(i)) > ψ(φ(j)), φ has no inversion, ψ has an inversion, and so does ψφ; (iii) if i < j, φ(i) > φ(j), and ψ(φ(i)) > ψ(φ(j)), φ has an inversion, ψ has no inversion, and ψφ has an inversion; (iv) if i < j, φ(i) > φ(j), and ψ(φ(i)) < ψ(φ(j)), then φ has an inversion, ψ has an inversion, and ψφ has no inversions. In each of these cases, inv(φ) + inv(ψ) differs from inv(ψφ) by an even number. A descent of a sequence with distinct numbers s = (j1 , . . . , jn ) is a number k such that 1 k n − 1 and jk > jk+1 . The set of descents of s is denoted by D(s) and the set of descents of a permutation φ specified by s is denoted by D(φ).
Linear Algebra Tools for Data Mining (Second Edition)
10
Example 1.8. Let φ ∈ PERM6 be 1 2 3 4 5 6 φ: . 4 2 5 1 6 3 We have INV(φ) = {(4, 2), (4, 1), (4, 3), (2, 1), (5, 1), (5, 3), (6, 3)}, and inv(φ) = 7. Furthermore, D(φ) = {1, 3, 5}. It is easy to see that the following conditions are equivalent for a permutation φ of the finite set S: (i) φ = 1S ; (ii) inv(φ) = 0; (iii) D(φ) = ∅. Theorem 1.9. Every permutation φ ∈ PERMn is a composition of transpositions. Proof. If D(φ) = ∅, then φ = 1{1,...,n} and the statement is vacuous. Suppose therefore that D(φ) = ∅, and let k ∈ D(φ), which means that (jk , jk+1 ) is an inversion φ. Let ψ be the adjacent transposition that exchanges jk and jk+1 . It is clear that inv(ψφ) = inv(φ) − 1. Thus, if ψ1 , . . . , ψp are the transpositions that correspond to all adjacent inversions of φ, where p = inv(φ), it follows that ψp · · · ψ1 φ has 0 inversions and, as observed above, ψp · · · ψ1 φ = 1S . Since ψ 2 = 1S for every transposition ψ, we have φ = ψp · · · ψ1 , which gives the desired conclusion. Corollary 1.1. If a permutation φ ∈ Φn can be factored as a product of p transpositions, then φ−1 can be factored as the same number of transpositions. Proof.
This follows immediately from Theorem 1.9.
Theorem 1.10. If φ is a permutation of the finite set S, then inv(φ) is the least number of adjacent transpositions that constitute a factorization of φ, and the number of adjacent transpositions involved in any other factorization of φ as a product of adjacent transpositions differs from inv(φ) by an even number.
Preliminaries
11
Proof. Let φ = ψq · · · ψ1 be a factorization of φ as a product of adjacent transpositions. Then ψ1 · · · ψq φ = 1S , and we can define the sequence of permutations φ = ψ · · · ψ1 φ for 1 q. Since each ψi is an adjacent transposition, we have inv(φ+1 ) − inv(φ ) = 1 or inv(φ+1 ) − inv(φ ) = −1. If |{ | 1 q − 1 and inv(φ+1 ) − inv(φ ) = 1}| = r, then |{ | 1 q − 1 and inv(φ+1 ) − inv(φ ) = −1}| = q − r, so inv(φ) + r − (q − r) = 0, which means that q = inv(φ) + 2r. This implies the desired conclusion. Theorem 1.11. A transposition is an odd permutation. Proof.
Suppose that φ ∈ PERMn is the transposition 1 ··· i ··· j ··· n φ: , 1 ··· j ··· i··· n
so j > i. If p and q form an inversion in φ, it follows that we must have i p < q j. For i < p < q < j, the pair (p, q) has no contribution to inversions. If i < q ≤ j, the pair (i, q) contributes j − i inversions because φ(i) = j > φ(q). If i p < j and φ(q) > i = φ(j), there are (j − 1) − i inversions. Thus, the total number of inversions of φ is (j − i) + (j − 1) − i = 2(j − i) − 1 and this number is odd. Next, we introduce the Levi-Civita symbols that are useful in the study of permutations. Definition 1.11. Let (i1 · · · in ) be a permutation of the set {1, 2, . . . , n}; The Levi-Civita symbols1 i1 i2 ···in and i1 i2 ···in are defined as 1 Tullio Levi-Civita (March 29, 1873–December 29, 1941) was an Italian mathematician, well-known for his work on tensor calculus and its applications to the theory of relativity. His work included foundational papers in both pure and applied mathematics, celestial mechanics, analytic mechanics, and hydrodynamics. He was born in Padua, graduated in 1892 from the University of Padua Faculty of Mathematics where he became a professor in 1898. In 1918, be was appointed at the University of Rome.
12
Linear Algebra Tools for Data Mining (Second Edition)
(i) 1 2 ··· n = 1 2 ··· n = 1; (ii) ··· ip ··· iq ··· = −··· iq ··· ip ··· and ··· ip ··· iq ··· = −··· iq ··· ip ··· (the antisymmetry property); (iii) when all indices i1 , . . . , in are distinct, we have i1 ···in = i1 ···in = (−1)p , where p = inv(i1 · · · in ). When two indices ip and iq are equal, the antisymmetry of i1 i2 ···in and i1 i2 ···in implies i1 i2 ···in = i1 i2 ···in = 0. Example 1.9. For Levi-Civita symbols i1 i2 i3 with n = 3, we have 27 components. Since an equality of any of the indices implies that the number i1 i2 i3 is 0, it follows that only six of these indices are non-zero. The non-zero values are 123 = 231 = 312 = 1, 132 = 213 = 321 = −1. 1.5
Combinatorics
Let S be a finite nonempty set, S = {s1 , . . . , sn }. We seek to count the sequences of S having length k without repetitions. Suppose initially that k 1. For the first place in a sequence s of length k, we have n choices. Once an element of S has been chosen for the first place, we have n − 1 choices for the second place because the sequence may not contain repetitions, etc. For the k th component of s, there are n − 1 + k choices. Thus, the number of sequences of length k without repetitions is given by n(n − 1) · · · (n − k + 1). We shall denote this number by A(n, k). There exists only one sequence of length 0, namely the empty sequence, so we extend the definition of A by A(n, 0) = 1 for every n ∈ N. Let S be a finite set with |S| = n. A k-combination of S is a subset M of S such that |M | = k. Define the equivalence relation ∼ on Seq(S) by s ∼ t if there exists a bijection f such that s = tf . If T is a subset of S such that |T | = k, there exists a bijection t : {0, . . . , k −1} −→ T ; clearly, this is a sequence without repetitions
Preliminaries
13
and there exist A(n, k) such sequences. Note that if u is an equivalent sequence (that is, if t ∼ u), then the range of this sequence is again the set T and there are k! such sequences (due to the existence of the k! permutations f ) that correspond to the same set T . Therefore, we A(n,k) may conclude nthat Pk (S) contains k! elements. We denote this we refer to it as the (n, k)-binomial coefficient. number by k and n We can write k using factorials as follows: n(n − 1) · · · (n − k + 1) n A(n, k) = = k! k! k =
n(n − 1) · · · (n − k + 1)(n − k) · · · 2 · 1 k!(n − k)!
=
n! . k!(n − k)!
We mention the following useful identities: n n−1 k =n , k k−1 n n−1 n . = m m−1 m Equality (1.1) can be extended as n n−−1 k(k − 1) · · · (k − ) = n(n − 1) · · · (n − ) k k−−1
(1.1) (1.2)
(1.3)
for 0 k − 1. The set of polynomials with complex (real) coefficients in the nondeterminate x is denoted by C[x] (respectively, R[x]). Consider now the n-degree polynomial p ∈ R[x]: p(x) = (x + a0 ) · · · (x + an−2 )(x + an−1 ). The coefficient of xn−k consists of the sum of all monomials of the form ai0 · · · aik−1 , where the subscripts i0 , . . . , ik−1 are distinct. Thus, the coefficient of xn−k contains nk terms corresponding to the k-element subsets of the set {0, . . . , n − 1}. Consequently, the coefficient of xn−k in the power (x + a)n can be obtained from the similar
14
Linear Algebra Tools for Data Mining (Second Edition)
coefficient in p(x) by taking a0 = · · · = an−1 = a; thus, the coefficient is nk ak . This allows us to write n
(x + a) =
n n k=0
k
xn−k ak .
(1.4)
This equality is known as Newton’s binomial formula and has numerous applications. Example 1.10. If we take x = a = 1 in Formula (1.4), we obtain the identity n
2 =
n n k=0
k
.
(1.5)
Note that this equality can be obtained directly by observing that the right member, enumerates the subsets of a set having n elements by their cardinality k. A similar interesting equality can be obtained by taking x = 1 and a = −1 in Formula (1.4). This yields 0=
n n n + + + ··· (−1) = 0 2 4 k n n n − − − − ··· . 0 2 4
n n k=0
k
This inequality shows that each set contains an equal number of subsets having an even or odd number of elements. n−1 (x + a). Example 1.11. Consider the equality (x + a)n = (x n+ a) n−k k a in the left member is k . In the right The coefficient of x n−1 n−k k a has the coefficient ( n−1 + k−1 ), so we obtain the member, x k equality n n−1 n−1 = + , (1.6) k k k−1
for 0 k n − 1.
Preliminaries
15
Multinomial coefficients are generalizations of binomial coefficients that can be introduced as follows. The nth power of the sum x1 + · · · + xk can be written as c(n, r1 , . . . , rk )xr11 · · · xrkk , (x1 + · · · + xk )n = (r1 ,...,rk )
where the sum involves all (r1 , . . . , rk ) ∈ Nk such that ki=1 ri = n. By analogy with the binomial coefficients, we denote c(n, r1 , . . . , rk ) n . As we did with binomial coefficients in Example 1.11, by r1 ,...,r n starting from the equality (x1 + · · · + xk )n = (x1 + · · · + xk )n−1 (x1 + · · · +xk), the coefficient of the monomial xr11 · · · xrkk in the right mem n ber is r1 ,...,rn . In the left member, the same coefficient is k i=1
n−1 , r1 , . . . , ri − 1, . . . , rn
so we obtain the identity
n r1 , . . . , rn
=
k i=1
n−1 , r1 , . . . , ri − 1, . . . , rn
(1.7)
a generalization of the identity (1.6). 1.6
Groups, Rings, and Fields
The notion of operation on a set is needed for introducing various algebraic structures on sets. Definition 1.12. Let n ∈ N. An n-ary operation on a set S is a function f : S n −→ S. The number n is the arity of the operation f . If n = 0, we have the special case of zero-ary operations. A zeroary operation is a function f : S 0 = {∅} −→ S, which is essentially a constant element of S, f (). Operations of arity 1 are referred to as unary operations. Binary operations (of arity 2) are frequently used. For example, the union, intersection, and difference of subsets of a set S are binary operations on the set P(S).
16
Linear Algebra Tools for Data Mining (Second Edition)
If f is a binary operation on a set, we denote the result f (x, y) of the application of f to x, y by xf y rather than f (x, y). We now introduce certain important types of binary operations. Definition 1.13. A binary operation f on a set S is (i) associative if (xf y)f z = xf (yf z) for every x, y, z ∈ S, (ii) commutative if xf y = yf x for every x, y, ∈ S, and (iii) idempotent if xf x = x for every x ∈ S. Example 1.12. Set union and intersection are both associative, commutative, and idempotent operations on every set of the form P(S). The addition of real numbers “+” is an associative and commutative operation on R; however, “+” is not idempotent. The binary operation g : R2 −→ R given by g(x, y) = x+y 2 for x, y ∈ R is a commutative and idempotent operation of R that is and xg(ygz) = not associative. Indeed, we have (xgy)gz = x+y+2z 4 2x+y+z . 4 Example 1.13. The binary operations max{x, y} and min{x, y} are associative, commutative, and idempotent operations on the set R. Next, we introduce special elements relative to a binary operation on a set. Definition 1.14. Let f be a binary operation on a set S. (i) An element u is a unit for f if xf u = uf x = x for every x ∈ S. (ii) An element z is a zero for f if zf u = uf z = z for every x ∈ S. Note that if an operation f has a unit, then this unit is unique. Indeed, suppose that u and u were two units of the operation f . According to Definition 1.14, we would have uf x = xf u = x and, in particular, uf u = u f u = u . Applying the same definition to u yields u f x = xf u = x and, in particular, u f u = uf u = u. Thus, u = u . Similarly, if an operation f has a zero, then this zero is unique. Suppose that z and z were two zeros for f . Since z is a zero, we have zf x = xf z = z for every x ∈ S; in particular, for x = z , we have zf z = z f z = z. Since z is zero, we also have z f x = xf z = z for every x ∈ S; in particular, for x = z, we have z f z = zf z = z , and this implies z = z .
Preliminaries
17
Definition 1.15. Let f be a binary associative operation on S such that f has the unit u. An element x has an inverse relative to f if there exists y ∈ S such that xf y = yf x = u. An element x of S has at most one inverse relative to f . Indeed, suppose that both y and y are inverses of x. Then we have y = yf u = yf (xf y ) = (yf x)f y = uf y = y , which shows that y coincides with y . If the operation f is denoted by “+”, then we refer to the inverse of x as the additive inverse of x, or the opposite element of x; similarly, when f is denoted by “·”, we refer to the inverse of x as the multiplicative inverse of x. The additive inverse of x is usually denoted by −x, while the multiplicative inverse of x is denoted by x−1 . Definition 1.16. An element x of a set S equipped with a binary operation “∗” is idempotent if x ∗ x = x. Observe that every unit and every zero of a binary operation is an idempotent element. Definition 1.17. Let I = {fi |i ∈ I} be a set of operations on a set S indexed by a set I. An algebra type is a mapping θ : I −→ N. An algebra of type θ is a pair A = (A, I) such that (i) A is a set, and (ii) the operation fi has arity θ(i) for every i ∈ I. The algebra A = (A, I) is finite if the set A is finite. The set A is referred to as the carrier of the algebra A. If the indexing set I is finite, we say that the type θ is a finite type and refer to A as an algebra of finite type. If θ : I −→ N is a finite algebra type, we assume, in general, that the indexing set I has the form (0, 1, . . . , n − 1). In this case, we denote θ by the sequence (θ(0), θ(1), . . . , θ(n − 1)). Definition 1.18. A groupoid is an algebra of type (2), A = (A, {f }). If f is an associative operation, then we refer to this algebra as a semigroup. In other words, a groupoid is a set equipped with a binary operation f .
18
Linear Algebra Tools for Data Mining (Second Edition)
Example 1.14. The algebra (R, {f }) where f (x, y) = x+y is a 2 groupoid. However, it is not a semigroup because f is not an associative operation. Example 1.15. Define the binary operation g on R by xgy = ln(ex + ey ) for x, y ∈ R. Since z
(xgy)gz = ln(exgy+e ) = ln(ex + ey + ez ), xg(ygz) = ln(x + eygz ) = ln(ex + ey + ez ), for every x, y, z ∈ R, it follows that g is an associative operation. Thus, (R, g) is a semigroup. It is easy to verify that this semigroup has no unit element. Definition 1.19. A monoid is an algebra of type (0, 2), A = (A, {e, f }), where e is a zero-ary operation, f is a binary operation, and e is the unit element for f . Example 1.16. Let gcd(m, n) be the greatest common divisor of the numbers m, n ∈ N. The algebras (N, {1, ·}) and (N, {0, gcd}) are monoids. In the first case, the binary operation is the multiplication of natural numbers, the unit element is 1, and the algebra is clearly a monoid. In the second case, the unit element is 0. We claim that gcd is an associative operation. Let m, n, p ∈ N. We need to verify that gcd(m, gcd(n, p)) = gcd(gcd(m, n), p). Let k = gcd(m, gcd(n, p)). Then (k, m) ∈ δ and (k, gcd(n, p)) ∈ δ, where δ is the divisibility relation. Since gcd(n, p) divides evenly both n and p, it follows that (k, n) ∈ δ and (k, p) ∈ δ. Thus, k divides gcd(m, n), and therefore k divides h = gcd(gcd(m, n), p). Conversely, h being gcd(gcd(m, n), p), it divides both gcd(m, n) and p. Since h divides gcd(m, n), it follows that it divides both m and p. Consequently, h divides gcd(n, p) and therefore divides k = gcd(m, gcd(n, p)). Since k and h are both natural numbers that divide each other evenly, it follows that k = h, which allows us to conclude that gcd is an associative operation. Since n divides 0 evenly, for any n ∈ N, it follows that gcd(0, n) = gcd(n, 0) = n, which shows that 0 is the unit for gcd.
Preliminaries
19
Definition 1.20. A group is an algebra of type (0, 2, 1), A = (A, {e, f, h}), where e is a zero-ary operation, f is a binary operation, e is the unit element for f , and h is a unary operation such that f (h(x), x) = f (x, h(x)) = e for every x ∈ A. Note that if we have xf y = yf x = e, then y = h(x). Indeed, we can write h(x) = h(x)f e = h(x)f (xf y) = (h(x)f x)f y = ef y = y. We refer to the unique element h(x) as the inverse of x. The usual notation for h(x) is x−1 . Definition 1.21. A group A = (A, {e, f, h}) is Abelian if xf y = yf x for all x, y ∈ A. Abelian groups are also known as commutative groups. Traditionally, the zero-ary operation of Abelian groups is denoted by 0, the binary operation by “+”, and the inverse of an element x is denoted by −x. Thus, we usually write an Abelian group A as (A, {0, +, −}). Example 1.17. The algebra (Z, {0, +, −}) is an Abelian group where “+” is the usual addition of integers, and the additive inverse of an integer n is −n. Example 1.18. The set of permutations of {1, . . . , n} is the symmetric group Sn = (PERMn , ιn , ·, ), where · stands for the permutation composition. Example 1.19. Let ≡n ⊆ Z × Z be the equivalence relation defined on Z by ≡n = {(p, q) ∈ Z × Z | n evenly divides p − q}. In each equivalence class [p] there exists a least non-negative element m such that 0 m n − 1, due to the properties of the division of integers. This allows us to consider the quotient set Z/ ≡n = {[0], . . . , [n − 1]}. For example, if n = 2, the quotient set contains the classes [0] and [1]. All even integers, belong to the class [0] and all odd integers, to the class [1]. We denote the quotient set Z/ ≡n by Zn . The sum of two equivalence classes [p] and [q] is the class [p + q]. It is easy to verify that
20
Linear Algebra Tools for Data Mining (Second Edition)
this addition is well-defined for if r ∈ [p] and s ∈ [q], then n divides evenly r − p and s − q and, therefore, it divides evenly r + s − (p + q). For example, Z4 consists of {[0], [1], [2], [3]} and the addition is defined by the table + [0] [1] [2] [3]
[0] [0] [1] [2] [3]
[1] [1] [2] [3] [0]
[2] [2] [3] [0] [1]
[3] [3] [0] [1] [2]
It is easy to verify that (Zn , {[0], +, −}) is an Abelian group, where −[p] is [n − p] for 0 p n − 1 and [n] = [0]. Definition 1.22. A ring is an algebra of type (0, 2, 1, 2), A = (A, {e, f, h, g}), such that A = (A, {e, f, h}) is an Abelian group and g is a binary associative operation such that xg(uf v) = (xgu)f (xgv), (uf v)gx = (ugx)f (vgx), for every x, u, v ∈ A. These equalities are known as left and right distributivity laws, respectively. The operation f is known as the ring addition, while · is known as the ring multiplication. Frequently, these operations are denoted by “+” and “·”, respectively. Example 1.20. The algebra (Z, {0, +, −, ·}) is a ring. The distributive laws amount to the well-known distributive properties p · (q + r) = (p · q) + (p · r), (q + r) · p = (q · p) + (r · p), for p, q, r ∈ Z, of integer addition and multiplication. Example 1.21. A more interesting type of ring is defined on the set √ of numbers of the form m + n 2, where m and n are integers. The ring operations are given by √ √ √ (m + n 2) + (p + q 2) = m + p + (n + q) 2, √ √ √ (m + n 2) · (p + q 2) = m · p + 2 · n · q + (m · q + n · p) 2.
Preliminaries
21
If the multiplicative operation of a ring has a unit element 1, then we say that the ring is a unitary ring. We consider a unitary ring as an algebra of type (0, 0, 2, 1, 2) by regarding the multiplicative unit as another zero-ary operation. Observe, for example, that the ring (Z, {0, 1, +, −, ·}) is a unitary ring. Also, note that the set of even numbers also generates a ring ({2k | k ∈ Z}, {0, +, −, ·}). However, no multiplicative unit exists in this ring. Example 1.22. Let (S, {0, 1, +, −, ·}) be a commutative ring and let s = (s0 , s1 , . . .) be a sequence of elements of S. The support of the sequence s is the set supp(s) = {i ∈ N | si = 0}. A polynomial over S is a sequence that has a finite support. If supp(s) = ∅, then s is the zero polynomial. The degree of a polynomial p is the number deg(p) = max supp(p). The degree of the zero polynomial is 0. The addition of the polynomials p = (p0 , p1 , . . .) and q = (q0 , q1 , . . .) produces the polynomial p + q = (p0 + q0 , p1 + q1 , . . .). The product of the polynomials p and q is
m pi qm−i , . . . . pq = p0 q0 , p0 q1 + p1 q0 , . . . , i=0
The “usual notation” for polynomials involves considering a symbol λ referred to as an indeterminate. Then the polynomial p = (p0 , p1 , . . .) is denoted as p(λ) = p0 + p1 λ + · · · + pn λn , where n = deg(p). The set of polynomials over a ring S in the indeterminate λ is denoted by S[λ]. The reader accustomed to the usual notation for polynomials will realize that the addition and multiplication defined above correspond to the usual addition and multiplication of polynomials. If is easy to verify that the set of polynomials in the indeterminate λ over S denoted by S[λ] is itself a ring, denoted by S[λ], with the addition and multiplication defined above.
22
Linear Algebra Tools for Data Mining (Second Edition)
Example 1.23. Example 1.22 can be extended to polynomials of several variables. Let Nk = (z 0 , z 1 , . . .) be the set of k-tuples of natural numbers listed in a fixed order. A polynomial in k variables over S is a sequence p = (pz 0 , pz 1 , . . . , pz n , . . .) such that the support set supp(p) = {z ∈ Nk | pz = 0} is finite. If supp(p) = ∅, then p is the zero polynomial. The set of polynomials in k variables over S is denoted by S[λ1 , . . . , λk ]. If |supp(p)| = 1, then p is a monomial. Let λ1 , . . . , λk be k indeterminates. A monomial p such that supp(p) = {(a1 , . . . , ak )} is written as p(λ1 , . . . , λk ) = λa11 λa22 · · · λakk and the number deg(p) = a1 +· · ·+ak is the degree of the monomial p. If p, q are two polynomials in k variables, p = (pz 0 , pz 1 , . . . , pz n , . . .) and q = (qz 0 , qz 1 , . . . , qz n , . . .), their sum is the polynomial p + q given by (p + q)(λ1 , . . . , λk ) = (pz 0 + qz 0 , pz 1 + qz1 , . . . , pz n + qz n , . . .). The product of p and q is the polynomial pq, where pu q v . (pq)z = u+v=z
Definition 1.23. A polynomial p ∈ S[λ1 , . . . , λk ] is symmetric if p(λ1 , . . . , λk ) = p(λφ(1) , . . . , λφ(k) ) for every permutation φ ∈ PERMk . Example 1.24. The elementary symmetric polynomials in λ1 , . . . , λn are defined by s0 (λ1 , . . . , λn ) = 1, n λi , s1 (λ1 , . . . , λn ) = i=1
s2 (λ1 , . . . , λn ) =
λi λj ,
i n. Solution: For the first part we compute the coefficient of tb for b ∈ N in both sides of the equality. In the left member this coefficient is {λa11 · · · λann | a1 + · · · + an = b because ck is the sum of all monomials of degree k in λ1 , . . . , λn . In the right member we have a product of n series of the form 1 = 1 + λi t + · · · + λpi tp + · · · . 1 − λi t Therefore, the coefficient of tb in this product equals the value of the coefficient for the left member for every b, which establishes the equality. The second part follows immediately from the first part. The equality Cn (t)Sn (−t) = 1 can be written as ⎞ ⎛
n si (λ1 , . . . , λn )(−1)i ti · ⎝ ck (λ1 , . . . , λn )tk ⎠ = 1. i=0
k0
Preliminaries
31
The coefficient of tn in the left member is k (−1)j sj (λ1 , . . . , λn )pn−j (λ1 , . . . , λn ) j=0
while the coefficient of tn in the right-hand member is 0. (4) If π, σ ∈ PERMn , prove that sign (πσ) = sign (π)sign (σ). (5) If σ ∈ PERMn , prove that sign (σ −1 ) = sign (σ). (6) Let π ∈ PERMn and σ ∈ PERMn+m . Define the permutation π ˜ ∈ PERMn+m as π(k) if 1 k n, π ˜ (k) = k if n + 1 k n + m. Prove that sign (˜ π σ) = sign (π)sign (σ). (7) Let θ ∈ PERMp+q be defined as 1 2 ··· p p + 1 ··· p + q θ: . p + 1 p + 2 ··· p + q 1 ··· q Prove that sign (θ) = (−1)pq . (8) Prove that if θ ∈ PERMn can be written as products of transpositions, θ = τ1 τ2 · · · τr = τ1 τ2 · · · τs , then r ≡ s( mod 2). (9) Prove that ⎧ ⎪ if (i1 , . . . , in ) is an even permutation ⎨1 i1 ···in = −1 if (i1 , . . . , in ) is an odd permutation ⎪ ⎩ 0 otherwise. (10) Prove that the Levi-Civita symbols satisfy the identity 3
ijk mk = ij1 m1 + ij2 m2 + ij3 m3 .
k=1
(11) Prove that 3 3 3 i=1 j=1 k=1
ijk ijk = 6.
32
Linear Algebra Tools for Data Mining (Second Edition)
Let X be a finite set, X = {x1 , . . . , xn }. A selection of r objects from X where each object can be selected more than once is a combination of n objects taken r at a time with repetition. Two combinations taken r at a time with repetition are identical if they have the same elements repeated the same number of times regardless of the order. For instance, if X = {x1 , x2 , x3 , x4 }, then there are 20 combinations of the elements of X taken three at a time with repetition: x1 x1 x1 , x1 x1 x2 , x1 x1 x3 , x1 x1 x4 , x1 x2 x2 , x1 x2 x3 , x1 x2 x4 , x1 x3 x3 , x1 x3 x4 , x1 x4 x4 , x2 x2 x2 , x2 x2 x3 , x2 x2 x4 , x2 x3 x3 , x2 x3 x4 , x2 x4 x4 , x3 x3 x3 , x3 x3 x4 , x3 x4 x4 , x4 x4 x4 . (12) Show that the following numbers are equal: (a) the number of combinations of n objects taken r at a time; (b) the number of non-negative integer solutions of the equation x1 + x2 + · · ·+ xn = r;n+r−1 n+r−1 (c) the number n−1 = . r Solution: We prove only that the number of non-negative intesolutions of the equation x1 + x2 + · · · + xn = r equals ger n+r−1 n−1 . A solution of the equation x1 + x2 + · · · + xn = r can be represented as a binary string of length n + r − 1 as · · · 1 0 · · · 11 · · · 1 11 · · · 1 0 11 x1
x2
xn
containing x1 +· · ·+xn = r digits equal to 1 and n−1 digits equal to 0. Since there are n+r−1 such binary strings, the conclusion n−1 follows. Note that in the example that precedes this supplement, the number of of X taken three at a time with combinations = 20. repetition is 4+3−1 3 Bibliographical Comments There are many sources for general algebra [3, 20, 108] and combinatorics [154, 155, 163, 166] that contain lots of material that would amply satisfy the needs of the reader.
Chapter 2
Linear Spaces
2.1
Introduction
This chapter is dedicated to the study of linear spaces, which are mathematical structures that play a central role in linear algebra. A linear space is defined in connection with a field and makes use of two operations: an additive operation between its elements, and an external multiplication operation that involves elements of the field and those of the linear space. 2.2
Linear Spaces
Definition 2.1. Let F = (F, {0, 1, +, −, ·}) be a field. An F-linear space is a pair (L, s) such that L = (V, {0V , +, −}) is an Abelian group and s : F × V −→ V is a function referred to as the scalar multiplication that satisfies the following conditions: (1) s(a + b, x) = s(a, x) + s(b, x); (2) s(a, x + y) = s(a, x) + s(a, y); (3) s(ab, x) = s(a, s(b, x); (4) s(1, x) = x for every a, b ∈ F and x, y ∈ V . The elements of the field F are referred to as scalars, while the elements of V are referred to as vectors.
33
34
Linear Algebra Tools for Data Mining (Second Edition)
The result of the scalar multiplication s(a, v) is denoted simply by av. This allows us to write the previous equalities as (1) (a + b)v = av + bv; (2) a(v + w) = av + aw; (3) (ab)v = a(bv); (4) 1v = v for a, b ∈ F and v, w ∈ L. We omit the explicit mention of the scalar multiplication function from the definition of an F-linear space V = (V, s) and we refer to V simply as V. If the field F is irrelevant, or it is clearly designated from the context, we refer to an F-linear space just as a linear space. On the other hand, if F is the real field R or the complex field C, we refer to an R-linear space as a real linear space and to a C-linear space as a complex linear space. Example 2.1. If F = (F, {0, 1, +, −, ·}) is a field, then the oneelement linear space V = {0V }, where a0V = 0V for every a ∈ F, is the zero F-linear space, or, for short, the zero linear space. The field F itself is an F-linear space, where the Abelian group is (F, {0, +, −}) and scalar multiplication coincides with the scalar multiplication of F. Example 2.2. The set of all sequences of real numbers, Seq(R), is a real linear space, where the sum of two sequences x = (x0 , x1 , . . .) and y = (y0 , y1 , . . .) is the sequence x + y defined by x + y = (x0 + y0 , x1 + y1 , . . .) and the multiplication of x by a scalar a is ax = (ax0 , ax1 , . . .). A related real linear space is the set Seqn (R) of all sequences of real numbers having length n, where the sum and the scalar multiplications are defined in a similar manner. Namely, if x = (x0 , x1 , . . . , xn−1 ) and y = (y0 , y1 , . . . , yn−1 ), the sequence x + y is defined by x + y = (x0 + y0 , x1 + y1 , . . . , xn−1 + yn−1 ) and the multiplication of x by a scalar a is ax = (ax0 , ax1 , . . . , axn−1 ). This linear space is denoted by Rn and its zero element is denoted by 0n . Example 2.3. If the real field R is replaced by the complex field C, we obtain the linear space Seq(C) of all sequences of complex numbers. Similarly, we have the complex linear space Cn , which consists of all sequences of length n of complex numbers.
Linear Spaces
35
Example 2.4. Let V be an F-linear space and let S be a non-empty set. The set V S that consists of all functions of the form f : S −→ V is an F-linear space. The addition of functions is defined by (f + g)(s) = f (s) + g(s), while the multiplication by a scalar is given by (af )(s) = af (s), for s ∈ S and a ∈ F. We leave to the reader the task of verifying that the definition of a linear space is satisfied. Example 2.5. Let S be a set and let F2 be the two element field defined in Example 1.27. Define the scalar multiplication of a subset T of S by an element of the field as 0 · T = ∅ and 1 · T = T for every T ∈ P(S). The sum of two subsets U and V is defined as their symmetric difference U + V = (U − V ) ∪ (V − U ). With these definitions the set of subsets of S is an F2 -linear space, as the reader can easily verify. Informally, a subspace of an F-linear space is a subset of the Flinear space that behaves exactly like an F-linear space. Definition 2.2. A non-empty subset U of an F-linear space V is a subspace of V if (i) x + y ∈ U for all x, y ∈ U , (ii) ax ∈ U for a ∈ F and x ∈ U . If U is a subspace of an F-linear space V, then U is itself an F-linear space. Example 2.6. The subset {0V } of any F-linear space V is a subspace of V named the zero subspace. This is the smallest subspace of V. Theorem 2.1. If V = {Vi | i ∈ I} is a collection of subspaces of an F-linear space V, then V is a subspace of V. Proof. Suppose that x, y ∈ V. Then, x, y ∈ Vi , so x + y ∈ Vi and ax ∈ Vi for every i ∈ I.Thus, x + y ∈ V and ax ∈ V, which allows us to conclude that V is a subspace of L.
36
Linear Algebra Tools for Data Mining (Second Edition)
Since V itself is a subspace of V, it follows that the collection of subspaces of a linear space is a closure system V. If K sub is the closure operator induced by V, then for every subset X of V, K sub (X) is the smallest subspace of V that contains X. If U is a subspace of a linear space V and x ∈ V, we denote the set {x + u | u ∈ U } by x + U . The following statements are immediate for an F-linear space V : (i) the sets V and {0V } are subspaces of V ; (ii) each subspace U of V contains 0V . 2.3
Linear Independence
Let F be a field and let I be a non-empty set. If ϕ : I −→ F is a function, define the support of ϕ as the subset of I given by supp(ϕ) = {i ∈ I | ϕ(i) = 0}. Definition 2.3. Let F be a field, I be a non-empty set, and let ϕ : I −→ F be a function that has finite support. If V is an F-linear space, the linear combination determined by ϕ is an element wϕ of V that can be written as ϕ(i)xi , w= i∈supp(ϕ)
where xi ∈ V for i ∈ supp(ϕ). In the special case when supp(ϕ) = ∅, we define w ϕ = 0V . If supp(ϕ) = {1, . . . , n} and {x1 , . . . , xn } ⊆ X, then a X-linear combination is an element x of V that can be written as x = a1 x1 + · · · + an xn , where ai = ϕ(i) for 1 i n are scalars called the coefficients of the linear combination w. The set of all X-linear combinations is denoted by X and is referred to as the set spanned by X. Theorem 2.2. Let V be an F-linear space. If X ⊆ V, then X is the smallest subspace of V that contains the set X. In other words, we have
Linear Spaces
37
(i) X is a subspace of V ; (ii) X ⊆ X; (iii) if X ⊆ M, where M is a subspace of V, then X ⊆ M . Proof. It is clear that if u and v are two X-linear combinations, then u + v and au are also X-linear combinations, so X is a subspace of V. Also, for x ∈ X, we can write 1x = x, so X ⊆ X. Finally, suppose that X ⊆ M , where M is a subspace of V and a1 x1 + · · · + an xn ∈ X, where x1 , . . . , xn ∈ X. Since X ⊆ M , we have x1 , . . . , xn ∈ M , hence a1 x1 + · · · + an xn ∈ M because M is a subspace. Thus, X ⊆ M . Corollary 2.1. Let V be an F-linear space. If X ⊆ V, then X equals K sub (X), the subspace of V generated by X. Proof.
This statement follows from Theorem 2.2.
From now on, we use the notation X instead of K sub (X). Definition 2.4. Let V be an F-linear space. A finite subset U = {x1 , . . . , xn } of V is linearly dependent if 0V = a1 x1 + · · · + an xn = 0V , where at least one element ai of F is not equal to 0. If this condition is not satisfied, then U is said to be linearly independent. A set U that consists of one vector x = 0V is linearly independent. The subset U = {x1 , . . . , xn } of V is linearly independent if a1 x1 + · · · + an xn = 0V implies a1 = · · · = an = 0. Also, note that a set U that is linearly independent does not contain 0V . Example 2.7. Let V be an F-linear space. If u ∈ V, then the set Vv = {au | a ∈ F } is a linear subspace of V. Moreover, if u = 0V , then the set {u} is linearly independent. Indeed, if au = 0V and a = 0, then multiplying both sides of the above equality by a−1 we obtain (a−1 a)u = a−1 0, or equivalently, u = 0V , which contradicts the initial assumption. Thus, {u} is a linearly independent set. Definition 2.4 is extended to arbitrary subsets of a linear space. Definition 2.5. Let V be an F-linear space. A subset W of V is linearly dependent if it contains a finite subset U that is linearly dependent. A subset W is linearly independent if it is not linearly dependent.
38
Linear Algebra Tools for Data Mining (Second Edition)
Thus, W is linearly independent if every finite subset of W is linearly independent. Further, any subset of a linearly independent subset is linearly independent and any superset of a linearly dependent set is linearly dependent. Example 2.8. For every F-linear space V, the set {0V } is linearly dependent because we have 10V = 0V . Example 2.9. Consider the F-linear space SF(I, F) that consists of the set of functions that map the non-empty set I into F. For i ∈ I, define the function ei : I −→ F as 1 if j = i, ei (j) = 0 otherwise. The set E = {ei | i ∈ I} is linearly independent in the linear space SF(I, F). Indeed, let {eik | 1 k p} be a finite subset of E and assume that ai1 ei1 + · · · + aip eip = z, where z is the function defined by z(i) = 0 for i ∈ I. Thus, choosing i = ik , we have ai1 ei1 (ik ) + · · · + aip eip (ik ) = z(ik ). All terms in the left member equal 0 with the exception of aik eik (ik ) = aik = 0. Theorem 2.3. Let V be an F-linear space and let W be a linearly independent subset of V. If y ∈ W , that is y = a1 x1 + · · · + an xn , for some finite subset {x1 , . . . , xn } of W, then the coefficients a1 , . . . , an are uniquely determined. Proof.
Suppose that y can be alternatively written as y = b1 x1 + · · · + bn xn ,
for some b1 , . . . , bn ∈ F. Since W is linearly independent, this implies (a1 − b1 )x1 + · · · + (an − bn )xn = 0V , which, in turn, yields a1 − b1 = · · · = an − bn = 0. Thus, we have ai = bi for 1 i n.
Linear Spaces
2.4
39
Linear Mappings
Linear mappings between linear spaces are functions that are compatible with the algebraic operations of linear spaces, as introduced next. Definition 2.6. Let F be a field and let V and W be two F-linear spaces. A linear mapping is a function h : V −→ W such that h(ax + by) = ah(x) + bh(y) for each of the scalars a, b ∈ F and x, y ∈ V. An affine mapping is a function f : V −→ W such that there exists a linear mapping h : V −→ W and b ∈ W such that f (x) = h(x) + b for x ∈ V. Linear mappings are also referred to as linear space homomorphisms, linear morphisms, or linear operators. The set of morphisms between two F-linear spaces V and W is denoted by Hom(V, W ). The set of affine mappings between two Flinear spaces V and W is denoted by Aff(V, W ). Definition 2.7. Let h, g ∈ Hom(V, W ) be two linear mappings between the F-linear spaces V and W . The sum of h and g is the mapping h + g defined by (h + g)(x) = h(x) + g(x) for x ∈ V. If a ∈ F, the product af is defined as (af )(x) = af (x) for x ∈ V. If V, W are two F-linear spaces, then the set Hom(V, W ) is never empty because the zero morphism 0V,W : V −→ W defined as 0V,W (x) = 0V for x ∈ V is always an element of Hom(V, W ). Note that (f + g)(ax + by) = f (ax + by) + g(ax + by) = af (x) + bf (y) + ag(x) + bg(y) = f (ax + by) + g(ax + by), for all a, b ∈ F and x, y ∈ L. This shows that the sum of two linear mappings is also a linear mapping.
40
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 2.4. Hom(V, W ) equipped with the sum and product defined above is an F-linear space. Proof. The zero element of Hom(V, W ) is the mapping 0V,W . We leave to the reader to verify that Hom(V, W ) satisfies the properties mentioned in Definition 2.1. Definition 2.8. Let V be an F-linear space. A linear form on V is a morphism in Hom(V, F), where the field F is regarded as an F-linear space. 2.5
Bases in Linear Spaces
Definition 2.9. A basis of an F-linear space V is a linearly independent subset B such that B = V. If an F-linear space V has a finite basis, then we say that V is a linear space of finite type. Example 2.10. Let F be a field and let I be a non-empty set. In Example 2.9, we saw that the set E = {ei | i ∈ I} is a linear independent set in the linear space SF(I, F). Let ϕ : I −→ F and suppose that supp(ϕ) = {j1 , . . . , jp }. Then, ϕ(i) = pi=1 ϕ(jk )ejk (i) for i ∈ I. Therefore, E is a basis in SF(I, F). Example 2.11. Let ei be the vector in Cn given by ⎛ ⎞ 0 ⎜.⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜0⎟ ⎟ ei = ⎜ ⎜1⎟, ⎜ ⎟ ⎜ ⎟ ⎜0⎟ ⎝ ⎠ .. .0 having one non-zero component equal to 1 on its ith position. Note that for x ∈ Cn , we have x = x1 e1 + · · · + xn en ,
Linear Spaces
41
which means that Cn = {e1 , . . . , en }. If x1 e1 + · · · + xn en = 0n , it follows immediately that x1 = · · · = xn = 0, which implies that {e1 , . . . , en } is a basis for Cn . Theorem 2.5. Every non-zero F-linear space V has a basis. Proof. Let V be a non-zero F-linear space and let U be a set such that U = L. Note that at least one such set exists because V = V. The set U contains at least an element distinct from 0V because {0V } = {0V }. Let IU be the collection of linearly independent subsets of U . Since for every x ∈ U such that x = 0V the set {x} is linearly independent, it follows that IU is a non-empty collection. We claim that in the partially ordered set (IU , ⊆) every chain has an upper bound. Indeed, let {Ki | i ∈ J} be a chain of independent subsets in (IU , ⊆). We claim that K = {Ki | i ∈ J} is a linearly independent set. Indeed, suppose that {xk | 1 k n} is a finite subset of K. For every l, 1 k n there exists ik ∈ J such that xk ∈ Kik . Since {Ki | i ∈ J} is a chain, there exists a set K among the sets Ki1 , . . . , Kin that includes all others. Thus, {xk | 1 k n} ⊆ K , which implies that the set {xk | 1 k n} is linearly independent. Consequently, the set K is linearly independent and, therefore, it is an upper bound for the chain {Ki | i ∈ J}. By Zorn’s Lemma (see, for example, Section 4.10 of [152]), the partially ordered set (IU , ⊆) contains an element maximal B. Clearly, B is linearly independent. To prove that B is a basis, we need to show only that B spans the entire linear space V, that is, that B = V. To this end, it suffices to prove that the set U of generators of V is included in B. Let x ∈ U − B and let X = B ∪ {x}. Then, since B is maximal, it follows that X is linearly dependent. Therefore, there exists a linear combination ax + pi=1 ai xi = 0V such that a = 0. Since Fis a field, there exists an inverse a−1 of a. Therefore, x = −a−1 pi=1 ai xi , which implies x ∈ B. Thus, we may conclude that U ⊆ B, so V = U ⊆ B = V. Corollary 2.2. Let U be a subset of an F-linear space V such that U = V. If B ⊆ U is a linearly independent set, then there is a basis Z of V such that B ⊆ Z ⊆ U .
42
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Let IU be the collection of linearly independent subsets of U . By Zorn’s Lemma, for every element B of IU there exists a maximal element Z of (IU , ⊆) such that B ⊆ Z. The desired basis is Z. Corollary 2.3 (Independent set extension corollary). Let V be an F-linear space. If S is a linearly independent set, then there exists a basis B of V such that S ⊆ B. Proof. Since S is a linearly independent set, if T = V, then S ∪T also generates V. The statement follows by Corollary 2.2. If an F-linear space V has a finite basis, then we say that V is a linear space of finite type. Lemma 2.1. Let V be a finite type F-linear space and let T be a finite subset of V that is not linearly independent. If k = |T | 2 and (t1 , . . . , tk ) is a list of the vectors in T, then there exists a number j such that 2 j m and tj is a linear combination of its predecessors in the sequence. Furthermore, we have T − {tj } = T . Proof. Suppose that T is linearly dependent. Then there exists a linear combination ki=1 ai ti = 0V such that some of the scalars a1 , . . . , ak are different from 0. Let j be the largest number such that 1 j k and aj = 0. The definition of j implies that a1 t1 + ai · · · + aj tj = 0V , so tj = − j−1 i=1 aj ti , which shows that tj is a linear combination of its predecessors in the list. Consequently, the set of linear combinations of the vectors in T − {tj } equals T . Theorem 2.6 (The Replacement theorem). Let V be a finitetype F-linear space such that the set S spans the linear space V and |S| = n. If U is a linearly independent set in V such that |U | = m, then m n and there exists a subset S of S such that S contains n − m vectors and U ∪ S spans the space V. Proof. Suppose that S = {w1 , . . . , w n } and U = {u1 , . . . , um }. The argument is by induction on m. The basis case, m = 0, is immediate. Suppose the statement holds for m and let U = {u1 , . . . , um , um+1 } be a linearly independent set that contains m+1 vectors.
Linear Spaces
43
The set {u1 , . . . , um } is linearly independent, so by the inductive hypothesis m n and there exists a subset S of S that contains n−m vectors such that {u1 , . . . , um } ∪ S spans the space V. Without loss of generality we may assume that S = {w1 , . . . , wn−m }. Thus, um+1 is a linear combination of the vectors of {u1 , . . . , um , w 1 , . . . , wn−m }, so we have um+1 = a1 u1 + · · · + am um + b1 w1 + · · · + bn−m wn−m . We have m+1 n because, otherwise, m+1 = n and um+1 would be a linear combination of u1 , . . . , um , thereby contradicting the linear independence of the set U . The set {u1 , . . . , um , um+1 , w1 , . . . , w n−m } is not linearly independent. Let v be the first member of the sequence (u1 , . . . , um , um+1 , w1 , . . . , wn−m ) that is a linear combination of its predecessors. Then, v cannot be one of the ui (with 1 i m) because this would contradict the linear independence of the set U . Therefore, there exists k such that w k is a linear combination of its predecessors and 1 k n − m. By Lemma 2.1, we can remove this element from the set {u1 , . . . , um , um+1 , w1 , . . . , w n−m } without affecting the set spanned. Corollary 2.4. Let V be a finite-type F-linear space and let B, C be two bases of L. Then |B| = |C|. Proof. Since B is a linearly independent set, and C = V, by Theorem 2.6 we have |B| |C|. The reverse inequality, |C| |B|, is obtained by asserting that C is linearly independent and C = V. Thus, |B| = |C|. Corollary 2.4 allows the introduction of the notion of dimension for a linear space. Definition 2.10. The dimension of a finite-type linear space V is the number of elements of any basis of V. The dimension of V is denoted by dim(V ). The dimension of the zero F-linear space {0} is 0. If a linear space V is not of finite type, then we say that dim(V ) is of infinite type.
44
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 2.7. Let V be an F-linear space of finite type having the basis B = {x1 , . . . , xn } and let Y = {y 1 , . . . , y n } be a subset of an Flinear space W . If f : B −→ Y is a function between B and W, then there exists a unique extension of f as a linear mapping f : V −→ W such that f (xi ) = y i for 1 i n. an xn because Proof. If x ∈ V, we have x = a1 x1 + · · · + {x1 , . . . , xn } is a basis of V. Define f (x) as f (x) = ni=1 ai y i . The uniqueness of the expression of x as a linear combination of the elements of B makes f well-defined. The linearity of f is immediate. For uniqueness of the extension of f , observe that the value of f is determined by the values of f (xi ). Example 2.12. Let S be a non-empty, finite set. The linear space CS has dimension |S|. Indeed, for each t ∈ S, consider the function ft : S −→ C defined by ft (s) =
1 0
if s = t, otherwise,
for t ∈ S. If S = {t1 , . . . , tn }, then the set of functions {ft1 , . . . , ftn } is linearly independent, for if c1 ft1 (s) + · · · + cn ftn (s) = 0, then by Furthermore, taking s = tk we obtain ck = 0 for any k, 1 k n. i if f : S −→ C is a function and f (ti ) = c , then f = ni=1 ci fti , so {ft1 , . . . , ftn } is a basis for CS . Example 2.13. Let V, W be two linear spaces of finite type with dim(V ) = p and dim(W ) = q. Then, dim(Hom(V, W )) = pq. Suppose that {x1 , . . . , xp } is a basis in V and {y 1 , . . . , y q } is a basis in W . As we have shown in Theorem 2.7, for every i such that 1 i p and j such that 1 j q, there exists a unique linear mapping fij : {x1 , . . . , xp } −→ W such that fij (xk ) = for 1 k p.
yj 0M
if i = k, otherwise,
Linear Spaces
Note that if x =
p
fij (x) = fij
k=1 ak xk ,
p
the linearity of fij implies
ak xk
k=1
45
=
p
ak fij (xk ) = ai fij (xi ).
k=1
We claim that the set {fij | 1 i p, 1 j q} is a basis for Hom(V, W ). Let linear mapping. If x ∈ V, we can write pf : V −→ W be a p x = i=1 ai xi , so f (x) = q i=1 ai f (xi ). In turn, since {y 1 , . . . , y q } is a basis in W , f (xi ) = j=1 bij y j , for some bij ∈ F. This allows us to write f (x) =
p i=1
ai
q j=1
bij y j =
q p i=1 j=1
ai bij y j =
q p
ai bij fij (x),
i=1 j=1
which shows that each linear mapping in Hom(V, W ) is a linear combination of functions fij . j p}is linearly indeFurthermore, the set {fij | 1 i p, 1 pendent in Hom(V, W ). Indeed, suppose that pi=1 qj=1 cij fij (x) = q 0W . Then, for x = xi we have j=1 cij y j = 0W , which implies cij = 0. We may conclude that dim(Hom(V, W )) = dim(V ) dim(W ). Theorem 2.8. If W is a subspace of a finite-type linear space V, then dim(W ) dim(V ). Proof. If U is a linearly independent set in the subspace W , then it is clear that U is linearly independent in V. There exists a basis B of V such that U ⊆ B and |B| = dim(V ). Therefore, dim(W ) dim(V ). Definition 2.11. Let V and W be two F-linear spaces and let h : V −→ W be a linear mapping. The kernel of h is the subset Ker(h) of V given by Ker(h) = {v ∈ V | h(v) = 0W }. The image of h is the subset Im(h) of W given by Im(h) = {w ∈ W | h(v) = w for some v ∈ V }.
46
Linear Algebra Tools for Data Mining (Second Edition)
The notion of subspace is closely linked to the notion of linear mapping as we show next. Theorem 2.9. Let V, W be two F-linear spaces. If h : V −→ W is a linear mapping, then Im(h) is a subspace of W and Ker(h) is a subspace of V. Proof. Let w1 and w2 be two elements of Im(h). There exist v 1 , v 2 ∈ V such that w1 = h(v 1 ) and w2 = h(v 2 ). Since h is a linear mapping, we have w1 + w2 = h(v 1 ) + h(v 2 ) = h(v 1 + v 2 ). Thus, w1 +w2 ∈ Im(h). Further, if a ∈ F and w ∈ W , then w = h(v) for some v ∈ V and we have aw = ah(v) = h(av), so aw ∈ Im(h). Thus, Im(h) is indeed a subspace of W . Suppose now that s and t belong to Ker(h), that is h(s) = h(t) = 0W . Then h(s + t) = h(s) + h(t) = 0W , so s + t ∈ Ker(h). Also, h(as) = ah(s) = a0W = 0W , which allows us to conclude that Ker(h) is a subspace of W . Theorem 2.10. Let V and W be two linear spaces, where dim(V ) = n, and let h : V −→ W be a linear mapping. Then, we have dim(Ker(h)) + dim(Im(h)) = n. Proof. Suppose that {e1 , . . . , em } is a basis for the subspace Ker(h) of V. By Corollary 2.3, each such basis can be extended to a basis {e1 , . . . , em , em+1 , . . . , en } of the linear space V. Any v ∈ V can be written as v=
n
ai ei .
i=1
Since {e1 , . . . , em } ⊆ Ker(h), we have h(ei ) = 0W for 1 i m, so h(v) =
n
ai h(ei ).
i=m+1
This means that the set {h(em+1 ), . . . , h(en )} spans the subspace Im(h) of W . We show nowthat this set is linearly indepenn i dent. Indeed, suppose that i=m+1 b h(ei ) = 0W . This implies
Linear Spaces
47
n i h( ni=m+1 bi ei ) = 0W , that is, i=m+1 b ei ∈ Ker(h). Since {e1 , . . . , em } is a basis for Ker(h), there exist m scalars c1 , . . . , cm such that n
bi ei = c1 e1 + · · · + cm em .
i=m+1
The fact that {e1 , . . . , em , em+1 , . . . , en } is a basis for V implies that c1 = · · · = cm = bm+1 = · · · = bn = 0, so the set {h(em+1 ), . . . , h(en )} is linearly independent and, therefore, a basis for Im(h). Thus, dim(Im(h)) = n − m, which concludes the argument. Definition 2.12. Let V and W be two F-linear spaces and let h ∈ Hom(V, W ). The rank of h is rank(h) = dim(Im(h)); the nullity of h is nullity(h) = dim(Ker(h)). The spark of a linear mapping h : V −→ W is the minimum size of a subset S of Im(h) such that 0W ∈ S. Theorem 2.10 can now be rephrased by saying that if h : V −→ W is a linear mapping and V is a linear space of finite type, then dim(V ) = rank(h) + nullity(h). Theorem 2.11. Let h : V −→ W be a linear mapping between the linear spaces V and W . Then rank(h) min{dim(V ), dim(W )}. Proof. From Theorem 2.10 it follows that rank(h) dim(V ). On the other hand, rank(h) = dim(Im(h)) dim(W ) because Im(h) is a subspace of W , so the inequality of the theorem follows. Example 2.14. Let V, W be two F-linear spaces. For h ∈ V ∗ and y ∈ W , define the mapping h,y : V −→ W as h,y (x) = h(x)y for x ∈ V. It is easy to verify that h,y is a linear mapping, that is, h,y ∈ Hom(V, W ). Furthermore, we have rank(h,y ) = 1 because Im(h,y ) consists of the multiples of the vector y. Let f ∈ Hom(V, W ) be a linear mapping of rank r, which means that dim(Im(f )) = r. There exists a basis {y 1 , . . . , y r } in Im(f ) such
Linear Algebra Tools for Data Mining (Second Edition)
48
that for every x ∈ V, f (x) can be uniquely written as f (x) =
r
ai y i .
i=1
Let hi ∈ V ∗ bethe linear form defined as hi (x) = ai for 1 i r. Then, f (x) = ri=1 hi (x)y i , hence f is the sum of r linear forms of rank 1. 2.6
Isomorphisms of Linear Spaces
Definition 2.13. Let V and W be two F-linear spaces. An isomorphism between these linear spaces is a linear mapping h : V −→ W , which is a bijection. If an isomorphism exists between two F-linear spaces V and W , we say that these linear spaces are isomorphic and we write V ∼ = W. Two F-linear spaces that are isomorphic are indiscernible from an algebraic point of view. Theorem 2.12. Let V and W be two F-linear spaces and let h ∈ Hom(V, W ). Then Im(h) ∼ = (V /Ker(h)). Proof. Define the mapping g : V /Ker(h) −→ Im(h) by g([x]) = h(x) for x ∈ V. We show that g is an isomorphism. The mapping g is well-defined since if u ∈ [x], then u ∼Ker(h) x, so u − x ∈ Ker(h). Therefore, h(u − x) = 0W , hence h(u) = h(x). We leave it to the reader to verify that g is a linear mapping. Further, it is clear that g is surjective. To prove that g is injective, suppose that g([x]) = g([y]). This amounts to h(x) = h(y), which is equivalent to h(x − y) = 0W . Thus, x − y ∈ Ker(h), which implies [x] = [y]. In other words, g is injective and, therefore, is a bijection. This shows that the linear spaces Im(h) and V /Ker(h) are isomorphic. Corollary 2.5. Let V and W be two F-linear spaces and let h : V −→ W be a surjective morphism of linear spaces. Then, W ∼ = V /Ker(h). Proof.
This is an immediate consequence of Theorem 2.12.
Linear Spaces
49
Theorem 2.13. Isomorphism is an equivalence relation between linear spaces. Proof. It is clear that every linear space V is isomorphic to itself, which follows from the fact that the identity map 1V is an isomorphism. Suppose now that the linear spaces V and W are isomorphic and let h : V −→ W be an isomorphism. It is easy to verify that the inverse mapping h−1 : W −→ V is a linear morphism and a bijection, so the existence of an isomorphism is symmetric. Finally, if h : V −→ W and g : W −→ U are isomorphisms, then gh is an isomorphism from V to U , so the existence of an isomorphism is transitive. Thus, isomorphism is an equivalence. Theorem 2.14. If V, W are two finite-dimensional F-linear spaces and V ∼ = W, then dim(V ) = dim(W ). Proof. Suppose that dim(V ) = n and that B = {x1 , . . . , xn } is a basis for V. We claim that if f : V −→ W is an isomorphism, then B = {f (x1 ), . . . , f (xn )} is a basis for W . Let y ∈ W . Since f is a surjection, there exists x ∈ V such that y = f (x). Then, x = a1 x1 + · · · + an xn for some a1 , . . . , an in F because B is a basis for V, hence y = f (x) = f (a1 x1 + · · · + an xn ) = a1 f (x1 ) + · · · + an f (xn ). This shows that B spans the space W . To prove that B is linearly independent, assume that 0W = c1 f (x1 ) + · · · + cn f (xn ) = f (c1 x1 + · · · + cn xn ). Since f is injective, we have c1 x1 + · · · + cn xn = 0V , which implies c1 = · · · = cn = 0. Thus, B is also linearly independent, hence B is a basis of W . We conclude that dim(V ) = |B| = |B | = dim(W ). Theorem 2.15. If V, W are two finite-dimensional F-linear spaces and dim(V ) = dim(W ), then V ∼ = W. Proof. To prove that V ∼ = W , it suffices to show that any of these two spaces is isomorphic to Fn . Suppose that dim(V ) = n and that B = {x1 , . . . , xn } is a basis for V. Define the mapping h : V −→ Fn as ⎛ ⎞ a1 ⎜.⎟ ⎟ h(x) = ⎜ ⎝ .. ⎠ , an
50
Linear Algebra Tools for Data Mining (Second Edition)
where x = a1 x1 + · · · + an xn . Since the expression of x in the basis B is unique, this function is a well-defined injective morphism. The function is also a surjection, so it is an isomorphism. Corollary 2.6. If V, W are two finite-dimensional F-linear spaces, then V ∼ = W if and only if dim(V ) = dim(W ). Proof.
This follows from Theorems 2.14 and 2.15.
Theorem 2.16. Let h : V −→ W be a linear mapping between the F-linear spaces V and W such that dim(V ) = dim(W ) = n. The following statements are equivalent: (i) h is surjective; (ii) h is an isomorphism; (iii) h is injective. Proof. (i) implies (ii): Suppose that h is surjective, that is, Im(h) = W . Since dim(Ker(h)) + dim(Im(h)) = n, it follows that dim(Ker(h)) = 0, so Ker(h) = {0V }. Therefore, by Theorem 2.23, h is an injection and, therefore, an isomorphism. (ii) implies (iii): This implication is immediate. (iii) implies (i): Suppose that h is injective. Then dim(Im(h)) = n, so Im(h) = W , which means that h is a surjection. Theorem 2.17. Let V be a subspace of the linear space W . Then dim(W/V ) = dim(W ) − dim(V ). Proof. We apply Theorem 2.10 to the mapping hV : W −→ W/V. We noted that Ker(hV ) = V and Im(hV ) = W/V. Therefore, dim(W ) = dim(V ) + dim(W/V ), which yields our equality. Let X be a non-empty set, F be the real or the complex field, and let FREEF (X) be a subset of FX that consists of the mappings f : X −→ F such that the set supp(f ) = {x ∈ X | f (x) = 0} is finite. Addition and scalar multiplication are defined as usual. If f, g ∈ FREEF (X), then supp(f + g) ⊆ supp(f ) ∪ supp(g) and supp(f ) if a = 0, supp(af ) = {0} if a = 0. This turns FREEF (X) into an F-linear space. The zero element of FREEF (X) denoted by 0∗ is given by 0∗ (x) = 0 for x ∈ X.
Linear Spaces
51
Definition 2.14. Let X be a non-empty set. The F-free linear space over X is the linear space FREEF (X) constructed above. If the field F is clear from context, the subscript F is omitted. Example 2.15. The free R-linear space on the set {1, . . . , n} is isomorphic to Rn . For z ∈ X, let δz be the function defined by 1 if x = z, δz (x) = 0 otherwise. Since supp(δz ) = {z}, it is clear that δz ∈ FREE(X). The notation δz is known as Kronecker delta.1 A related notation is δuv : X × X −→ {0, 1}, which denotes a function given by 1 if u = v, δuv = 0 otherwise. a basis in FREE(X). Indeed, The set {δz | z ∈ X} is if supp(f ) = {z1 , . . . , zn }, then f (z) = ni=1 f (zi )δzi (z), or f = ni=1 f (zi )δzi . nThe set {δz | z ∈ X} is linearly independent because, if j=1 aj δzj (x) = 0, then choosing x = zi we obtain aj = 0 for 1 j n. Example 2.16. The free R-linear space FREE(N) consists of all sequences r0 n0 +r1 n1 +· · · , where the set {ri ∈ R | i ∈ N and ri = 0} is finite. The zero element of this space is 0n0 + 0n1 + · · · . Theorem 2.18 (Universal property of hX ). Let X be a set and let hX : X −→ FREE(X) be the mapping defined by h(x) = δx for x ∈ X. If L is an F-vector space and φ : X −→ L is a mapping, there exists a unique linear mapping ψ : FREE(X) −→ L such that the following diagram is commutative: 1
Leopold Kronecker (December 7, 1823 in Liegnitz–December 29, 1891 in Berlin) was a German mathematician who worked on number theory, algebra, and logic. He studied at the Universities of Bonn, Breslau, Berlin, where he defended his dissertation in algebraic number theory in 1845. He was elected a member of the Berlin Academy in 1861. There are numerous concepts in mathematics named after Kronecker: Kronecker delta, Kronecker product, and many others.
Linear Algebra Tools for Data Mining (Second Edition)
52
hX
X φ
FREE(X) ψ
L Proof. Define ψ as the function that maps the element δz of the basis of FREE(X) in φ(z). It is immediate that ψ is a linear mapping and that the diagram is commutative. To prove uniqueness, suppose that ψ1 is another linear mapping such that φ = ψhX = ψ1 hX . Then for x ∈ X, we have φ(x) = ψ(hX (x)) = ψ1 (hX (x)), which implies φ(x) = ψ(δx ) = ψ1 (δx ). Since the set {δx | x ∈ X}is a basis in FREE(X), it follows that ψ = ψ1 . Theorem 2.19. Let V and W be two F-linear spaces and let g, h ∈ Hom(V, W ) be two linear mappings. If X is a subset of V such that g(x) = h(x) for every x ∈ X, then g(x) = h(x) for every x ∈ X. Proof.
Consider the set EQ(g, h) = {u ∈ V | g(u) = h(u)}.
If u, v ∈ EQ(g, h), then g(au+bv) = ag(u)+bg(v) = ah(u)+bh(u) = h(au + bv), so au + bv ∈ EQ(g, h) for every a, b ∈ F, which implies that EQ(g, h) is a subspace of V. Since X ⊆ EQ(g, h), it follows that X ⊆ EQ(g, h), which yields the desired conclusion. We refer to EQ(g, h) as the equalizer of g and h. Theorem 2.20. Let V and W be two F-linear spaces. A morphism h ∈ Hom(V, W ) is injective if and only if h(x) = 0W implies x = 0V . Proof. Let h be a morphism such that h(x) = 0W implies x = 0V . If h(x) = h(y), by the linearity of h we have h(x − y) = 0W , which implies x − y = 0V , that is, x = y. Thus, h is injective. Conversely, suppose that h is injective. If x = 0V , then h(x) = h(0V ) = 0W . Thus, h(x) = 0W implies x = 0V . An endomorphism of an F-linear space V is a morphism h : V −→ V. The set of endomorphisms of V is denoted by End(V ). Often, we refer to endomorphisms of V as linear operators on V.
Linear Spaces
53
Let F be a field and let V be an F-linear space. Define the mapping ha : V −→ V by ha (x) = ax for x ∈ V. It is easy to verify that ha is a linear operator on L. This mapping is known as a homothety on V. If a = 1, then h1 is given by h1 (x) = x for x ∈ V ; this is the identity morphism of V, which is usually denoted by 1V . For a = 0, we obtain the zero endomorphism of V denoted by 0V and given by 0V (x) = 0V for x ∈ V. Example 2.17. Let V be an F-linear space and let z ∈ V. The translation generated by z ∈ V is the mapping tz : V −→ V defined by tz (x) = x + z for x ∈ V. A translation is a bijection but not a morphism unless z = 0V . Its inverse is t−z . Definition 2.15. Let V be an F-linear space and let U and Z be two subsets of L. Define the subset U + Z of V as U + Z = {u + z | u ∈ U and z ∈ Z}. For a ∈ F, the set aU is aU = {au | u ∈ U }. Theorem 2.21. Let V, W, U be three F-linear spaces. The following properties of compositions of linear mappings hold: (i) If f ∈ Hom(V, W ) and g ∈ Hom(W, U ), then gf ∈ Hom(V, U ). (iii) If f ∈ Hom(V, W ) and g0 , g1 ∈ Hom(W, U ), then f (g0 + g1 ) = f g0 + f g1 . (iii) If f0 , f1 ∈ Hom(V, W ) and g ∈ Hom(W, U ), then (f0 + f1 )g = f0 g + f1 g. Proof. We prove only the second part of the theorem and leave the proofs of the remaining parts to the reader. Let x ∈ V. Then, f (g0 + g1 )(x) = f ((g0 + g1 )(x)) = f (g0 (x) + g1 (x)) = f (g0 (x)) + f (g1 (x)) for x ∈ V, which yields the desired equality.
54
Linear Algebra Tools for Data Mining (Second Edition)
Corollary 2.7. Let V, W be two F-linear spaces. The algebra (Hom(V, W ), {h0 , +, −}) is an Abelian group that has the zero morphism 0V,W as its zero-ary operation and the addition of linear mappings as its binary operation; the opposite of a linear mapping h is the mapping −h. Moreover, (End(V ), {0V , 1V , +, −, ·}) is a unitary ring, where the multiplication is defined as the composition of linear mappings. Proof. The first part of the statement is a simple verification that the operations h0 , +, − satisfy the definition of Abelian groups. The second part of the corollary follows immediately from Theorem 2.21. An endomorphism h of a field is idempotent if h2 = h, that is, if h(h(x)) = h(x) for every x ∈ M . Corollary 2.8. If h is an idempotent endomorphism of the field (End(V ), {h0 , 1V , +, −, ·}), then 1 − h is also an idempotent endomorphism. Proof. This statement is a direct consequence of Corollary 2.7 and of Theorem 1.12. Definition 2.16. Let h be an endomorphism of linear space V. The mth iteration of h (for m ∈ N) is defined as (i) h0 = 1V ; (ii) hm+1 (x) = h(hm (x)) for m ∈ N. For every m 1, hm is an endomorphism of V ; this can be shown by a straightforward proof by induction on m. Example 2.18. Let F be a field and F[λ] be the set of polynomials in the indeterminate λ with coefficients in F. If p ∈ F[λ], we have p(λ) = p0 + p1 λ + · · · + pn λn . For an endomorphism of an F-linear space V, h : V −→ V define the function f = p(h) as f (x) = p0 + p1 h(x) + · · · + pn hn (x), for x ∈ V. Theorem 2.22. Let V be an F-linear space and let p ∈ F[λ] be a polynomial. If h : V −→ V is an endomorphism of V, then p(h) is an affine mapping on V.
Linear Spaces
55
Proof. Let p(λ) = p0 + p1 λ + · · · + pn λn . Since hm is an endomorphism of V for every m 1 and the sum of endomorphisms is an endomorphism, it follows that q(h) is an endomorphism, where q(λ) = p1 λ + · · · + pn λn . Thus, f is an affine mapping because f (x) = p0 + g(x) for x ∈ V. Note that if p(0V ) = 0 and h is an endomorphism, then p(h) is also an endomorphism. 2.7
Constructing Linear Spaces
This section is concerned with methods of constructing new linear spaces. Definition 2.17. Let S be a subspace of an F-linear space V. The equivalence generated by S is the relation ∼S on the set V defined by x ∼S y if x − y ∈ S. The quotient set V /S is the set of equivalence classes of ∼S , V /S = {[x] | x ∈ V }. It is easy to verify that ∼S is indeed an equivalence relation on V. The quotient set has a natural structure of linear space when we define the addition of classes as [x] + [y] = [x + y] and the scalar multiplication as a[x] = [ax] for a ∈ F and x, y ∈ V. These operations are well-defined. Indeed, if u ∈ [x] and v ∈ [y], then x − u ∈ S and y − v ∈ S. Therefore, x + y − (u + v) ∈ S, so [u + v] = [x + y]. Similarly, ax − au = a(x − u) ∈ S, so [au] = [ax]. We leave it to the reader to check that they satisfy the definition of a linear space. Definition 2.18. The quotient space of an F-linear space V by a subspace S is the F-linear space defined on the set V /∼S whose operations are defined as above. The quotient of V by S is denoted by V /S.
56
Linear Algebra Tools for Data Mining (Second Edition)
Observe that the surjective mapping hS : V −→ V /S, also known as the canonical morphism (or canonical surjection) of S defined by hS (x) = [x] for x ∈ V, is a linear mapping. We have Ker(hS ) = S and Im(hS ) = V /S. We saw that for a linear mapping h : V −→ W , the set Im(h) is a subspace of W . The quotient linear space W/Im(h) is referred to as the co-kernel of h and is denoted by Coker(h). If h is a surjective mapping, Im(h) = W and, therefore, Coker(h) = {[0W ]}. The co-kernel of a linear mapping h is the zero subspace if and only if h is surjective. Theorem 2.23. Let V, W, Z be two F-linear spaces. If h ∈ Hom(V, W ), then the following three statements are equivalent: (i) h is an injection; (ii) if f, g : Z −→ V are two linear mappings such that hf = hg, then f = g; (iii) Ker(h) = {0V }. Proof. (i) implies (ii): Let h be an injection and suppose that hf = hg, where f, g : Z −→ V are two linear operators. For z ∈ Z, we have h(f (z)) = h(g(z)), which yields f (z) = g(z) for z ∈ Z because h is injective. Thus, f = g. (ii) implies (iii): Let i : Ker(h) −→ V be the linear application defined by i(x) = x for x ∈ Ker(h). Let k : Ker(h) −→ W be the linear function defined by k(v) = 0W . Note that h(i(z)) = h(k(z)) = 0W . Thus, i(z) = k(z) = 0W , which implies z = 0V , so Ker(h) = {0V }. (iii) implies (i): Suppose that Ker(h) = {0V } and that h(x) = h(y). Therefore, h(x − y) = 0W , so x − y ∈ Ker(h) = {0V }, which means that x = y, that is, h is an injection. A similar characterization exists for surjective linear mappings. Theorem 2.24. Let V, W be two F-linear spaces. If h ∈ Hom(V, W ), then the following statements are equivalent: (i) h is a surjection; (ii) if U is an F-linear space and f, g ∈ Hom(W, U ) are such that f h = gh, then f = g.
Linear Spaces
57
Proof. (i) implies (ii): Suppose that h is a surjective linear mapping. Then, for every y ∈ W there exists x such that y = h(x). Thus, we have f (y) = f (h(x)) = g(h(x)) = g(y) for every y ∈ W , so f = g. (ii) implies (i): If condition (ii) is satisfied, let U = Coker(h) and define f, g : W −→ Coker(h) as f (w) = [h(w)] in Coker(h) = W/Im(h) and g(w) = 0Coker(h) . Then f (h(v)) = g(h(v)) means that [h(v)] = 0Coker(h) , hence Coker(h) = [0W ], hence h is surjective. Theorem 2.25. Let V, W, U be three linear spaces and let h : V −→ W and g : V −→ U be two linear mappings. If Ker(h) ⊆ Ker(g) and h is surjective, then there exists a unique k ∈ Hom(W, U ) such that g = kh. Proof. The statement of the theorem asserts the existence and uniqueness of the morphism k : W −→ U that makes the diagram h
W k
V g
U
commutative. Note that if Ker(h) ⊆ Ker(g), then h(x) = 0W implies g(x) = 0U . Since h is surjective, if y ∈ W , there exists x ∈ V such that h(x) = y. Define the mapping k : W −→ U as k(y) = g(x), where h(x) = y. We verify first that k is well-defined. Indeed, suppose that x1 ∈ V is such that h(x1 ) = y. This implies h(x) = h(x1 ), which is equivalent to h(x − x1 ) = 0W . Therefore, g(x − x1 ) = 0U , which means that g(x1 ) = g(x), so k is well-defined. The mapping k is linear. Indeed, let a1 , a2 ∈ F and y 1 , y 2 ∈ W . Suppose that x1 , x2 ∈ V are such that h(xi ) = y i for i = 1, 2. Then h(a1 x1 + a2 x2 ) = a1 y 1 + a2 y 2 . This means that k(a1 y 1 + a2 y 2 ) = g(a1 x1 + a2 x2 ), so by the linearity of g, we have k(a1 y 1 + a2 y 2 ) = a1 g(x1 ) + a2 g(x2 ) = a1 k(y 1 ) + a2 k(y 2 ), which proves that k is linear. The definition of h implies that k(h(x)) = g(x) for every x ∈ V, so g = kh.
58
Linear Algebra Tools for Data Mining (Second Edition)
Suppose that k1 is a morphism in Hom(W, U ) such that g = k1 h, so k1 h = kh. Since h is a surjection, by Theorem 2.24, we obtain k1 = k. Theorem 2.26. Let h : V −→ W be a linear mapping between the Fspaces V and W and let S be a subspace of V. There exists a unique linear mapping h : V /S −→ W such that h([x]) = h(x) (where [x] = x + S for all x ∈ S) if and only if h(s) = 0W for all s ∈ S. h
S⊆V hS
W h
V /S If h exists and is unique, then for every linear mapping g : V /S −→ W there exists a unique linear transformation g : V −→ W that generates g in this manner. Proof. Suppose that h exists such that h([x]) = h(x). Then, if s ∈ S, we have, h(s) = h(s + S) = h(0V /S ) = 0W . Conversely, suppose now that h(s) = 0W for all s ∈ S. Define h as h([x]) = h(x). Note that h is well-defined for, if [x] = [y], we have x − y ∈ S, hence h(x − y) = 0W , or h(x) = h(y). The final part of the theorem follows by noting that the equality h(x + S) = h(x) shows that any of the mappings h, h is determined by the other. Let SUBSP(V ) be the collection of subspaces of a linear space V. If this set is equipped with the inclusion relation ⊆ (which is a partial order), then for any two subspaces K, L both sup{K, L} and inf{K, L} exist and are given by sup{K, L} = {x + y | x ∈ K and y ∈ L},
(2.1)
inf{K, L} = K ∩ L.
(2.2)
Let H = {x + y | x ∈ K and y ∈ L}. Observe that we have both K ⊆ H and L ⊆ H because 0 belongs to both K and L.
Linear Spaces
59
If u and v belong to H, then u = x1 + y 1 and v = x2 + y 2 , where x1 , x2 ∈ K and y 1 , y 2 ∈ L. Since x1 − x2 ∈ K and y 1 − y 2 ∈ L (because K and L are subspaces), it follows that u − v = x1 + y 1 − (x2 + y 2 ) = (x1 − x2 ) + (y 1 − y 2 ) ∈ H. We have au = ax1 + ax2 ∈ H because ax1 ∈ K and ax2 ∈ L. Thus, H is a subspace of V and is an upper bound of {K, L} in the partially ordered set (SUBSP(V ), ⊆). If G is a subspace of V that contains both K and L, then x+y ∈ G for x ∈ K and y ∈ L, so H ⊆ G. Thus, H = sup{K, L}. We denote H = sup{K, L} by K + L. The lattice of subspaces SUBSP(V ) of a linear space V is actually a complete lattice because the collection of subspaces of a linear space is a closure system. Next, we prove the modularity of SUBSP(V ). Theorem 2.27. Let V be an F-linear space. For any P, Q, R ∈ SUBSP(V ) such that Q ⊆ P, we have P ∩ (Q + R) = Q + (P ∩ R). Proof. Note that Q ⊆ P ∩(Q+R), P ∩R ⊆ P ∩(Q+R). Therefore, we have the inclusion Q + (P ∩ R) ⊆ P ∩ (Q + R) =, which leaves us with the reverse inclusion to prove. Let z ∈ P ∩ (Q + R). This implies z ∈ P and z = x + y, where x ∈ Q ⊆ P and y ∈ R. Therefore, y = z − x ∈ P , so y ∈ P ∩ R. Consequently, z ∈ Q + (P ∩ R), so P ∩ (Q + R) ⊆ Q + (P ∩ R). Theorem 2.35 can now be reformulated as Corollary 2.9. Let K, L be two subspaces of a linear space V. We have dim(sup{K, L}) + dim(inf{K, L}) = dim(K) + dim(L). An immediate consequence of Corollary 2.9 is the following inequality valid for two linear subspaces K, L of a linear space V, namely: dim(K + L) dim(K) + dim(L).
(2.3)
Theorem 2.28. Let V be an n-dimensional linear space and let W be a subspace of V. Then the space V /W is finite-dimensional and dim(V /W ) = dim(V ) − dim(W ).
60
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Let {w 1 , . . . , w n } be a basis of W and let {w 1 , . . . , wn , v 1 , . . . , v k } be its extension to a basis for V, where dim(W ) = n and dim(V ) = n + k. The set {v 1 + W, . . . , v k + W } is a basis of V /W . Indeed, suppose that a1 (v 1 + W ) + · · · + ak (v k + W ) = W for a1 , . . . , ak ∈ F, where W is the zero element of V /W . This equality amounts to (a1 v 1 + · · · + ak v k ) + W = W, which implies a1 v 1 + · · · + ak v k ∈ W . Thus, we can write a1 v 1 + · · · + ak v k = b1 w1 + · · · + bn wn for some b1 , . . . , bn ∈ F. Since a1 v 1 + · · · + ak v k − b1 w1 − · · · − bn wn = 0V , and {w1 , . . . , w n , v 1 , . . . , v k } is a basis, it follows that a1 = · · · = ak = b1 = · · · = bn = 0. This shows that {v 1 + W, . . . , v k + W } is linearly independent. Let v + W ∈ V /W . Since the set {w 1 , . . . , wn , v 1 , . . . , v k } spans V, we can write v = a1 w 1 + · · · + an wn + b1 v 1 + · · · + bk v k for some a1 , . . . , an , b1 , . . . , bk ∈ F. Therefore, we can write v + W = (a1 w1 + · · · + an wn + b1 v1 + · · · + bk v k ) + W = (b1 v 1 + · · · + bk v k ) + W = b1 (v 1 + W ) + · · · + bk (v k + W ). This shows that the set {v 1 + W, . . . , v k + W } spans V /W , and we conclude that this set forms a basis of V /W . Therefore, dim(V /W ) = k = dim(V ) − dim(W ).
Linear Spaces
61
Definition 2.19. Let V = {Vi | i ∈ I} be a family of F-linearspaces indexed by a set I. The direct product of V is the linear space i∈I Vi that consists of all functions f : I −→ i∈I Vi such that f (i) ∈ Vi for i ∈ I. The addition and scalar multiplication are defined by (f + g)(i) = f (i) + g(i) for every f, g ∈ I −→
(af )(i) = af (i)
The support of f ∈ We have
i∈I
Vi and a ∈ F.
i∈I
Vi is the set supp(f ) = {i | f (i) = 0Vi }.
supp(f + g) ⊆ supp(f ) + supp(g), supp(af ) ⊆ supp(f ) for every f, g ∈ I −→ i∈I Vi and a ∈ F. The notion of direct sum of linear spaces has different meanings depending on the nature of the linear spaces involved. Definition 2.20. Let V = {Vi | i ∈ I} be a family of F-linear spaces. The external direct sum of V is the F-linear space i∈I Vi that consists of all functions f : I −→ i∈I Vi such that f (i) ∈ Vi for i ∈ I that have a finite support. The addition and scalar multiplication are defined exactly as in the case of the members of the direct product. It is clear that the direct sum of a family of F-linear spaces is a subspace of their direct product. Also, for finite families of linear spaces, the direct product is identical to the direct sum. If V1 , V2 are subspaces of an F-linear space V, then their intersection is non-empty because 0V ∈ V1 ∩ V2 . Moreover, it is easy to see that V1 ∩ V2 is also a subspace of V. Let V1 , V2 be two subspaces of a linear space V. Their internal sum, or simply their sum, is the subset V1 + V2 of V defined by V1 + V2 = {x + y | x ∈ V1 and y ∈ V2 }. It is immediate to verify that V1 + V2 is a subspace of V and that 0V ∈ V1 ∩ V2 .
62
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 2.29. Let V1 , V2 be two subspaces of the F-linear space V. If V1 ∩ V2 = {0V }, then any vector x ∈ V1 + V2 can be uniquely written as x = x1 + x2 , where x1 ∈ V1 and x2 ∈ V2 . Proof. By the definition of the sum V1 + V2 , it is clear that any vector x ∈ V1 + V2 can be written as x = x1 + x2 , where v 1 ∈ V1 and v 2 ∈ V2 . We need to prove only the uniqueness of x1 and x2 . Suppose that x = x1 + x2 = y 1 + y 2 , where x1 , y 1 ∈ V1 and x2 , y 2 ∈ V2 . This implies x1 − y 1 = y 2 − x2 and, since x1 − y 1 ∈ V1 and y 2 − x2 ∈ V2 , it follows that x1 − y 1 = y 2 − x2 = 0V by hypothesis. Therefore, x1 = y 1 and x2 = y 2 . Theorem 2.30. Let V1 , V2 be two subspaces of the F-linear space V. If every vector x ∈ V1 + V2 can be uniquely written as x = x1 + x2 , then V1 ∩ V2 = 0V . Proof. Suppose that the uniqueness of the expression of x holds but z ∈ V1 ∩ V2 and z = 0V . If x = x1 + x2 , then we can also write x = (x1 + z) + (x2 − z), where x1 + z ∈ V1 and x2 − z ∈ V2 , x1 + z = x1 and x2 − z = x2 , and this contradicts the uniqueness property. Corollary 2.10. Let V be an F-linear space and let V1 , V2 be two subspaces of V. The decomposition of a vector x ∈ V as x = x1 + x2 , where x1 ∈ V1 and x2 ∈ V2 , is unique if and only if V1 ∩ V2 = {0V }. Proof.
This statement follows from Theorems 2.29 and 2.30.
If V1 and V2 are subspaces of the F-linear space V and V1 ∩ V2 = {0V }, we refer to V1 + V2 as the direct sum of the subspaces V1 and V2 . The direct sum V1 + V2 is denoted by V1 ⊕ V2 . Theorem 2.31. Let V be an F-linear space and U0 be a subspace of V. There exists a subspace U1 of V such that V = U0 ⊕ U1 . Proof. Let {ei | i ∈ I} be a basis of the subspace U0 and let {ei | i ∈ I ∪ J} be its completion to a basis of the entire linear space V, where I ∩ J = ∅. If U1 is the subspace generated by {ei | i ∈ J}, then V = U0 + U1 .
Linear Spaces
63
Subspaces of linear spaces are related to idempotent endomorphisms as shown next. Theorem 2.32. Let V be an F-linear space. If h ∈ End(V ) is an idempotent endomorphism of V, then V = Ker(h) ⊕ Im(h). Proof. Observe that both Ker(h) and Im(h) are subspaces of V. Furthermore, suppose that x ∈ Ker(h) ∩ Im(h). Since x ∈ Im(h), we have x = h(y) for some y ∈ V. On the other hand, x ∈ Ker(h) means that h(x) = 0V . Thus, h(h(y)) = h(y) implies h(x) = x, which yields x = 0V . This allows us to conclude that Ker(h) ∩ Im(h) = {0V }. If z ∈ V, we have z = (z−h(z))+h(z). Observe that h(z) ∈ Im(h) and h(z − h(z)) = h(z)− h(h(z)) = 0V , so z − h(z) ∈ Ker(h) because h is idempotent. We conclude that V = Ker(h) ⊕ Im(h). Theorem 2.33. Let V be an F-linear space. If U and W are two subspaces of V such that V = U ⊕ W, then there exists an idempotent endomorphism h of V such that U = Ker(h) and W = Im(h). Proof. Since V is a direct sum of U and W , each v ∈ V can be uniquely written as v = u + w, where u ∈ U and w ∈ W . Let π1 and π2 be the canonical projections and let h1 and h2 be the canonical injections of the subspaces U and W shown in what follows, where h1 (u) = u, h2 (w) = w and p1 (u + w) = u, p2 (u + w) = w for u ∈ U and w ∈ W . p2 p1 - V U W h1 h2 Define the endomorphism g1 ∈ End(V ) as g1 = h1 p1 . Note that v ∈ Ker(g1 ), where v = u + w for u ∈ U and w ∈ W if and only if g1 (v) = 0V . This, in turn, is equivalent to h1 (p1 (v)) = h1 (u) = u = 0V , so v = w ∈ W . This shows that Ker(g1 ) = W . On the other hand, z ∈ Im(g1 ) means that z = g1 (v) = h1 p1 (u + v) = h1 (u) = u, so Im(g1 ) = U . It is immediate to verify that g1 is idempotent. In Theorem 2.31, we saw that if U is a subspace of an F-linear space V, there exists another subspace W of V such that V = U ⊕W . By Theorem 2.33, there exists an idempotent endomorphism h of V such that U = Ker(h) and W = Im(h).
64
Linear Algebra Tools for Data Mining (Second Edition)
Definition 2.21. If V is an F-linear space and U, W are two subspaces of V such that V = U ⊕ W , then we say that U, W are complementary subspaces. Theorem 2.34. Let V be an F-linear space and let U, W be two subspaces of L. The following statements are equivalent: (i) U, W are complementary spaces; (ii) V = {u + w | u ∈ U and w ∈ W } and U ∩ W = {0}; (iii) If BU and BW are two bases of U and W, respectively, then BU ∪ BW is a basis of V and BU ∩ BW = ∅. Proof. (i) implies (ii): Suppose that V = U ⊕ W . Then, every v ∈ V can be uniquely written as a sum v = u + w with u ∈ U and w ∈ W , so V = {u + w | u ∈ U and w ∈ W }. If x = u + w, where u ∈ U and w ∈ W , and t ∈ U ∩ W , t = 0L , then we can also write x = (u − t) + (t + w), which contradicts the uniqueness of the representation of x. Thus, t = 0V . (ii) implies (iii): Since U ∩ W = {0V }, we have BU ∩ BW = ∅ because 0V does not belong to either BU or BW . Every v ∈ V can be written as v = u+w, where u ∈ U and w ∈ W . Thus, u = {ai ui | ai ∈ F and ui ∈ U, i ∈ I}, w = {bi wi | bi ∈ F and w i ∈ U, i ∈ I}, so v = u + w can be written as a linear combination of BU ∪ BW . (iii) implies (i): Suppose now that BU and BW are two bases of subspaces U and W , respectively, such that BU ∪ BW is a basis of V and BU ∩ BW = ∅. Since BU ∪ BW is a basis of V, it is clear that every v ∈ V is a sum of the form u + w, where u ∈ U and w ∈ W . Suppose that t ∈ U ∩ W and t = 0L . Then, t can be expressed both as a linear combination of BU and as a linear combination of BW , u = a u = b wj and not all ai and not all bj equal 0. This i i∈I i j∈J j implies i∈I ai ui − j∈J bj wj = 0V , which contradicts the linear independence of the set BU ∪ BW . A generalization of Theorem 2.34 is given next. Theorem 2.35. Let V be an F-linear space and let U, W be two subspaces of V. The set T = {t ∈ V | t = u + w, u ∈ U, w ∈ W } is a subspace of V and dim(T ) = dim(U ) + dim(W ) − dim(U ∩ W ).
Linear Spaces
65
Proof. It is straightforward to verify that T is indeed a subspace of V. Suppose that BZ = {z 1 , . . . , z k } is a basis for the subspace Z = U ∩ W . By the Extension corollary (Corollary 2.3), B can be extended to a basis for U , BU = BZ ∪ {u1 , . . . , up } and to a basis for W , BW = BZ ∪ {w1 , . . . , w q }. It is clear that B = BU ∪ BW generates the subspace T . The set B is linear independent. Indeed, suppose that a1 z 1 + · · · + ak z k + b1 u1 + · · · + bp up + c1 w1 + · · · + cq wq = 0V , for some a1 , . . . , ak , b1 , . . . , bp , c1 , . . . , cq ∈ F, so c1 w1 + · · · + cq w q = − (a1 z 1 + · · · + ak z k + b1 u1 + · · · + bp up ) ∈ U and, of course, c1 w1 + · · · + cq wq ∈ W . Thus, c1 w1 + · · · + cq wq ∈ U ∩ W , so c1 w1 + · · · + cq wq = d1 z 1 + · · · + dk z k , which implies c1 = · · · = cq = d1 = · · · = dk = 0 because BW is a basis for W . Using this fact, we obtain a1 z 1 + · · · + ak z k + b1 u1 + · · · + bp up = 0V . Since BU is a basis, this implies a1 = · · · = ak = b1 = · · · = bp = 0. We conclude that BU ∪ BW is indeed a basis for T and dim(T ) = |B| = |BU | + |BV | − |BU ∩ BV | = dim(U ) + dim(W ) − dim(U ∩ W ).
Corollary 2.11. Let U, W be two subspaces of an F-linear space V with dim(V ) = n. We have dim(U ) + dim(W ) − dim(U ∩ W ) n. Proof. The inequality follows from Theorem 2.35 by observing that dim(T ) dim(L) = n. Theorem 2.36. An F-linear space V is a direct sum of the family of subspaces {Vi | i ∈ I} if and only if V = i∈I Vi and for each i ∈ I we have Vj = {0L }. (2.4) Vi ∩ j∈I−{i}
66
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Suppose that V is the direct sum of the family of its subspaces {V | i ∈ I}. It is clear that V = i i∈I Vi . If z ∈ Vi ∩ j∈I−{i} Vj , then z = xi for some xi ∈ Vi and z = xj1 + · · · xj , where jp = i for 1 p . By the uniqueness of direct sum representation, we have xi = 0V , and z = 0V , hence Equality (2.4) holds. Conversely, suppose that V = i∈I Vi and Equality (2.4) holds. Suppose that a vector x ∈ V can be written as x = xj 1 + · · · + xj n and as x = tk1 + · · · + tkm , where xjp ∈ Vjp for 1 p n and tkq ∈ Vkq for 1 q m. Without loss of generality we may assume that n = m (by adding to the expressions of x an appropriate number of zero terms). Thus, we can assume that x = x i1 + · · · + x in = t i1 + · · · + t in , where xi and ti both belong to the subspace Vi for 1 n. Therefore, xi − ti ∈ Vi for 1 n. Since (xi1 − ti1 ) + · · · + (xin − tin ) = 0V , if follows that each difference xir − tir is a sum of vectors from other subspaces. This is possible if and only if xir − tir = 0V , hence, xir = tir for 1 r n. If {xi | i ∈ I} and {y j | j ∈ J} are bases in the linear spaces V and W , respectively, then the set {(xi , 0W ) | i ∈ I} ∪ {(0V , y j ) | j ∈ J} is a basis in V + W , as the reader can easily verify. If both V and W are of finite type, then so is V + W and dim(V + W ) = dim(V ) + dim(W ).
Linear Spaces
67
The direct sum of two linear spaces allows us to introduce four linear mappings: the injections i1 : V −→ V + W , i2 : W −→ V + W , and the projections p1 : V + W −→ V and p2 : V + W −→ W given by i1 (x) = (x, 0W ), i2 (y) = (0V , y), p2 (x, y) = y, p1 (x, y) = x, for x ∈ V and y ∈ W . It is immediate to verify that i1 , i2 are injective and p1 , p2 are surjective mappings. The definitions of i1 , i2 , p1 , and p2 imply immediately p1 i1 = 1V p1 i2 = 0W
p2 i2 = 1W , p2 i1 = 0V .
Additionally, we have (i1 p1 + i2 p2 )(x, y) = i1 p1 (x, y) + i2 p2 (x, y) = (x, 0W ) + (0V , y) = (x, y), for every (x, y) ∈ V × W . Theorem 2.37. Let V, W, U be three linear vector spaces for which there exist the linear mappings h1 : V −→ U , h2 : W −→ U and g1 : U −→ V, g2 : U −→ W such that g1 h1 = 1V , g1 h2 = 0W,V ,
g2 h2 = 1W , g2 h1 = 0V,W ,
and h1 g1 + h2 g2 = 1U . Then, there exists an isomorphism h : V + W −→ U such that h1 = hi1 h2 = hi2
g1 = p1 h−1 , g2 = p2 h−1 .
Proof. The linear mappings mentioned above are shown in the following commutative diagram.
68
Linear Algebra Tools for Data Mining (Second Edition)
V h1
p1 g1
i1 h
V +W
U g2
i2
h2
p2 W
Define the linear mappings k : U −→ V + W as k(z) = (g1 (z), g2 (z)) for z ∈ U and : V + W −→ U as (x, y) = h1 (x) + h2 (y) for x ∈ V and y ∈ W . Note that (k(z)) = (g1 (z), g2 (z)) = h1 (g1 (z)) + h2 (g2 (z)) = (h1 g1 + h2 g2 )(z) = z, and k((x, y)) = k(h1 (x) + h2 (y)) = k(h1 (x)) + k(h2 (y)) = (g1 (h1 (x)), g2 (h1 (x))) + (g1 (h2 (y)), g2 (h2 (y))) = (x, 0L+M ) + (0M +L , y) = (x, y). This shows that the linear mappings and k are inverse isomor phisms. 2.8
Dual Linear Spaces
Let V be an F-linear space. The set of linear forms defined on V is denoted by V ∗ . This set has the natural structure of an F-linear space known as the dual of the space V . The elements of V ∗ are also referred to as covariant vectors or covectors. We will refer to the vectors of the original linear space as contravariant vectors. The reason for adopting the designation of
Linear Spaces
69
“covariant” and “contravariant” terms for the vectors of V ∗ and V, respectively, will be discussed in Section 3.7. Definition 2.22. Let B = {ui ∈ V | 1 i n} be a basis in an n-dimensional F-linear space V, and let f ∈ V ∗ be a covector. The numbers ai = f (ui ), where 1 i n are the components of the covector f relative to the basis B of V. Theorem 2.38. Let B = {ui ∈ L | 1 i n} be a basis in an n-dimensional F-linear space V. If {ai ∈ F | 1 i n} is a set of scalars, then there is a unique covector f ∈ V ∗ such that f (ui ) = ai for 1 i n. Proof. Since B is a basis in V, we can write v = ni=1 ci ui for every v ∈ V. Thus,
f (v) = f
n i=1
ci ui
=
n
ci ai ,
i=1
which shows that the f is uniquely determined by the n⎞ ⎛ covector a1 ⎜ ⎟ tuple of scalars a = ⎝ ... ⎠. an ˜ = {˜ Let B = {ui ∈ V | 1 i n} B u i ∈ V | 1 i n} be two bases in the linear space V, where u˜i = nj=1 cij uj for 1 i n. ˜ can be written The components of a covector f ∈ V ∗ in the basis B as ⎛ ⎞ n n cij uj ⎠ = cij f (uj ). f (˜ ui ) = f ⎝ j=1
j=1
This equality shows that the components of a covector transform in the same manner as the basis. This justifies the use of the term covariant applied to these components. Theorem 2.39. Let V be an n-dimensional F-linear space. Then, its dual V ∗ is isomorphic to Fn , and, thus, dim(V ∗ ) = dim(V ) = n.
70
Proof.
Linear Algebra Tools for Data Mining (Second Edition)
The function h : Fn −→ Hom(V, F) that maps the vector ⎛ ⎞ a1 ⎜.⎟ ⎟ a=⎜ ⎝ .. ⎠ an
to the function f defined as in Theorem 2.38 is an isomorphism, as it can be checked easily. Theorem 2.38 shows that a linear form f ∈ V ∗ is uniquely determined by its values on the basis of the space V. This allows us to prove the following extension theorem. Theorem 2.40. Let U be a subspace of a finite-dimensional F-linear space V. A linear function g : U −→ F belongs to U ∗ if and only if there exists a linear form f ∈ V ∗ such that g is the restriction of f to U . Proof. If g is the restriction of f to U , then it is immediate that g ∈ U ∗ . Conversely, let g ∈ U ∗ and let B = {u1 , . . . , up } be a basis of U , where dim(U ) = p. Consider an extension of B to a basis of the entire space B1 = {u1 , . . . , up , up+1 , . . . , un }, where n = dim(L) and define the linear form f : V −→ F by g(ui ) if i p, f (ui ) = 0 if p + 1 i n. Since f and g coincide for all members of the basis of U , it follows that g is the restriction of f to U . We refer to f as the zero-extension of the linear form g defined on the subspace U . Theorem 2.41. If B = {v 1 , . . . , v n } is a basis of the F-linear space V, then the set of linear forms F = {f j | 1 j n} defined by 1 if i = j, j f (v i ) = 0 otherwise is a basis of the dual linear space V ∗ .
Linear Spaces
71
Proof. The set F = {f 1 , . . . , f n } spans the entire dual space V ∗ . ∗ Indeed, n let i f ∈ V be defined by f (v i ) = ai for 1 i n. If v = i=1 c v i , then
n n n ci v i = ci f (v i ) = ci ai . f (v) = f i=1
i=1
i=1
On the other hand, n
i
ai f (v) =
n
i=1
j
i
ai c f (v j ) =
i=1
n
ai ci ,
i=1
due to the definition of the linear forms f 1 , . . . , f n . Therefore, f = a1 f 1 + · · · + an f n , which shows that F = L∗ . To prove that the set F is linearly independent in V ∗ , suppose that a1 f 1 + · · · + an f n = 0V ∗ . This implies a1 f 1 (v) + · · · + an f n (v) = 0V for every v ∈ V. Choosing v = v j , we obtain aj f j (v j ) = 0, hence aj = 0, and this can be shown for 1 j n, which implies the linear independence. The basis F = {f 1 , . . . , f n } of V ∗ constructed in Theorem 2.41 is the dual basis of the basis B = {v 1 , . . . , v n } of V. We refer to the pair (B, F ) as a pair of dual bases. In general, we will index vectors in a linear space V using subscripts and covectors in the dual linear space using superscripts. Corollary 2.12. The dual of an n-dimensional F-linear space V is an n-dimensional linear space. This statement follows immediately from Theorem 2.41.
Proof.
Let E = {e1 , . . . , en } be a basis of V and let F = {f 1 , . . . , f n } be its dual basis in V ∗ . If x = a1 e1 + · · · + an en , then, we have i
i
f (x) = f (a1 e1 + · · · + an en ) =
n
aj f i (ej ) = ai
(2.5)
j=1
for 1 i n. Thus, we have x=
n i=1
f i (x)ei .
(2.6)
72
Linear Algebra Tools for Data Mining (Second Edition)
Similarly, if f ∈ V ∗ and f = b1 f 1 + · · · + bn f n , then (f , ei ) = bi , and f=
n (f , ei )f i .
(2.7)
i=1
Example 2.19. Let P2 [x] be the linear space of polynomials of degree 2 in x, that consists of polynomials of the form p(x) = ax2 + bx + c. The set {p0 , p1 , p2 } given by p0 (x) = 1, p1 (x) = x and p2 (x) = x2 is a basis in P2 [x]. Note that, we have c = p(0), 1 b = (p(1) − p(−1)), 2 1 a = (p(1) + p(−1) − 2p(0)). 2 If f : P2 [x] −→ R is a linear form, we have f (p) = af (x2 ) + bf (x) + cf (0) 1 = (p(1) + p(−1) − 2p(0))f (x2 ) 2 1 + (p(1) − p(−1))f (x) + p(0)f (1). 2 Therefore, a basis in P2 [x]∗ consists of the functions f 0 (p) = p(0), 1 f 1 (p) = (p(1) − p(−1)), 2 1 f 2 (p) = (p(1) + p(−1) − 2p(0)). 2 Theorem 2.42. Let V be a finite-dimensional F-linear space and let v ∈ V − {0V }. There exists a linear form f v in V ∗ such that f v (v) = 1. Proof. Since v = 0V , there is a basis B in V that includes v. Then f v can be defined as the linear form that corresponds to v.
Linear Spaces
73
We saw that the dual V ∗ of an F-linear space V is an F-linear space. The construction of the dual may be repeated, and V ∗∗ , the dual of the dual F-linear space V ∗ is an F-linear space. In the case of finite-dimensional linear spaces, we have dim(V ∗∗ ) = dim(V ∗ ) = dim(V ), and all these spaces are isomorphic. Theorem 2.43. Let V be a finite-dimensional F-linear space. Then, the dual V ∗∗ of the dual V ∗ of L is an F-linear space isomorphic to V. Proof. The space V ∗∗ consists of linear functions of the form φ : V ∗ −→ F that map linear forms f ∈ V ∗ into scalars in F . Let Ψ : V −→ V ∗∗ be the mapping given by Ψ(v) = φ, where φ is the linear form defined by φ(f ) = f (v) for all v ∈ V and f ∈ V ∗ . The mapping Ψ is injective. Indeed, suppose that Ψ(u) = Ψ(v) = φ for u, v ∈ V. This implies φ(f ) = f (u) = f (v) for every f ∈ V ∗ , which, in turn, implies u = v. Suppose that Ψ(v) = 0V ∗∗ . Then, f (v) = 0 for every f , which implies v = 0V (by taking f = 1V ). Thus, by Theorem 2.16, Ker(Ψ) = {0V }, which implies that Ψ is an isomorphism. Theorem 2.43 allows us to identify V ∗∗ with V when necessary. Definition 2.23. Let V be an F-linear space let U be a subset of V. The annihilator of U is the subset Ann(U ) of the dual space V ∗ given by Ann(U ) = {f ∈ V ∗ | f (x) = 0 for every x ∈ U } = {f ∈ V ∗ | U ⊆ Ker(f )}. The annihilator of a subset T of the dual space V ∗ of a linear space V is the set Ann(T ) = {v ∈ V | f (v) = 0 for every f ∈ T } = {Ker(f ) | f ∈ T }. Theorem 2.44. Let V be an F-linear space. For any subsets U ⊆ V and T ⊆ V ∗ , Ann(U ) and Ann(T ) are subspaces of V ∗ and V, respectively.
74
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Let f , f˜ ∈ Ann(U ) and let a, b ∈ F. We have (af +bf˜ )(x) = af (x) + bf˜ (x) = 0 for every x ∈ U , so af + bf˜ ∈ Ann(U ). The argument for Ann(T ) is similar. Theorem 2.45. If V is a finite-dimensional F-linear space, let U ⊆ V and T ⊆ V ∗ be subspaces of V and V ∗ , respectively. The dual space U ∗ is isomorphic to the quotient space V ∗ /Ann(U ) and the dual space T ∗ is isomorphic to the quotient space V ∗∗ /Ann(T ). The quotient linear spaces (V /U )∗ and (V ∗ /T )∗ are isomorphic to the subspaces Ann(U ) and Ann(T ), respectively. Proof. Recall that the zero-extension of a linear form defined on a subspace U of V was introduced in Theorem 2.40. Let Φ : V ∗ −→ U ∗ , where Φ(f ) = g if f is the zero-extension of g for g ∈ U ∗ . It is easy to verify that Φ is a surjective linear mapping. Furthermore, Ker(Φ) consists of those linear forms f in V ∗ such that Φ(f ) is the zero linear form on U , so f (x) = 0 for every x ∈ U . Consequently, Ker(Φ) = Ann(U ). By Corollary 2.5, the quotient space V ∗ /Ann(U ) is isomorphic to U ∗ . Proving that T ∗ is isomorphic to V ∗∗ /Ann(T ) is left to the reader. For the second part of the theorem, define a mapping Ψ : (V /U )∗ −→ Ann(U ) as follows. Consider a linear form φ ∈ Hom(V /U, F ) and the canonical mapping hU : V −→ V /U , and define Ψ(φ) = φhU . Observe that φ(hU (x)) = 0 for x ∈ U , so φhU ∈ Ann(U ). It is immediate that Ψ is a linear mapping. If g ∈ Ann(U ), this means that Ker(hU ) = U ⊆ Ker(g), so by Theorem 2.25, since hU is surjective, there exists a unique factorization g = khU , so k = φ and g = Ψ(φ). Thus, Ψ is an isomorphism. We leave to the reader the proof of the isomorphism between (V ∗ /T )∗ and Ann(T ). Definition 2.24. Let V, W be two F-linear spaces and let h ∈ Hom(V, W ) be a linear mapping. The dual of h is the morphism h∗ ∈ Hom(W ∗ , V ∗ ) defined by h∗ (f ) = f h, shown in the commutative diagram that follows. h V W f ∈V∗ h∗ (f ) = f h ∈ U ∗ ~
?
F
Linear Spaces
75
To verify the correctness of Definition 2.24, we need to show that h∗ is indeed a morphism between the linear spaces W ∗ and V ∗ . Let f , g ∈ W ∗ and let a, b ∈ F. The linear form φ = af + bg is defined by φ(v) = af (v) + bg(v) for v ∈ W . Then, h∗ (φ) = h∗ (af + bg) = (af + bg)h. Since (af + bg)h(u) = af (h(u)) + b(g(h(u) = a(h∗ f )(u) + b(h∗ g)(u) for u ∈ U , it follows that h∗ (af + bg) = ah∗ f + bh∗ g, which proves that h∗ is a morphism. Theorem 2.46. If V, W are two F-linear spaces, the mapping Φ : Hom(V, W ) −→ Hom(W ∗ , V ∗ ) defined by Φ(h) = h∗ for h ∈ Hom(V, W ) is linear. Proof. Let h0 , h1 ∈ Hom(V, W ) and let a, b ∈ F. We need to prove that (ah0 + bh1 )∗ = ah∗0 + bh∗1 . Let f : W −→ F be a linear form. By the definition of dual morphisms, we have (ah0 + bh1 )∗ (f ) = f (ah0 + bh1 ). Therefore, for every v ∈ W , we have (ah0 + bh1 )∗ (f )(v) = f (ah0 (v) + bh1 (v)) = af h0 (v) + bf h1 (v) = ah∗0 (f )(v) + bh∗1 (v), which proves that Φ is a linear mapping.
We leave it to the reader to verify that the dual of 1V is 1V ∗ and that for any two linear mappings h0 ∈ Hom(U, V ) and h1 ∈ Hom(V, W ), we have (h1 h0 )∗ = h∗1 h∗0 . Theorem 2.47. Let V be a finite-dimensional F-linear space and let Z be a subspace of V. Then dim(Ann(Z)) = dim(V ) − dim(Z) and Ann(Ann(Z)) = Z. Proof. If dim(V ) = n and Z is a subspace of V, then dim(V /Z) = n − dim(Z) by Theorem 2.17. Since (V /Z)∗ ∼ = Ann(Z), it follows that dim(Ann(Z)) = dim(V /Z)∗ = n − dim(Z). Thus, dim(V ) = dim(Z) + dim(Ann(Z)).
76
Linear Algebra Tools for Data Mining (Second Edition)
For the second part of the theorem, observe that by Definition 2.23 we have Ann(Ann(Z)) = {Ker(f ) | f ∈ Ann(Z)} = {Ker(f ) | Z ⊆ Ker(f )}, which implies Z ⊆ Ann(Ann(Z)). Note that both Z and Ann(Ann(Z)) are subspaces of V. The first part of the theorem implies dim(Ann(Ann(Z))) = dim(V ) − dim(Ann(Z)). On the other hand, we saw that dim(Z) + dim(Ann(Z)) = dim(V ), so dim(Z) = dim(Ann(Ann(Z)). This implies Ann(Ann(Z)) = Z. Theorem 2.48. Let V and W be two finite-dimensional F-linear spaces. If h ∈ Hom(V, W ), then rank(h) = rank(h∗ ). Proof. Note that if f ∈ Ker(h∗ ) is equivalent to h∗ f (u) = 0, it is equivalent to having f (h(u)) = 0 for every u ∈ V, that is, with f (Im(h)) = 0. Thus, Ker(h∗ ) = Ann(Im(h)), so dim(Ker(h∗ )) = dim(Ann(Im(h))). Since dim(W ) = dim(Ker(h∗ )) + dim(Im(h∗ )) = dim(Ann(Im(h)) + dim(Im(h∗ )), it follows that dim(Im(h∗ )) = dim(W ) − dim(Ann(Im(h)) = dim(Im(h)), by Theorem 2.47. Thus, rank(h∗ ) = rank(h). 2.9
Topological Linear Spaces
We are examining now the interaction between the algebraic structure of linear spaces and topologies that can be defined on linear spaces that are compatible in a certain sense with the algebraic structure. Compatibility, in this case, is defined as the continuity of addition and scalar multiplication. Definition 2.25. Let F be the real field R or complex field C. An F-topological linear space is a topological space (V, O) such that
Linear Spaces
77
(i) V is an F-linear space; (ii) the vector addition is a continuous function between V 2 and V; (iii) the scalar multiplication is a continuous function between F × V and V. Unless stated otherwise, we assume that the field F is either the real or the complex field. Theorem 2.49. Let (V, O) be an F-topological linear space and let z ∈ V. The translation mapping tz : V −→ V is a homeomorphism. Proof. It is immediate that tz is a bijection whose inverse is t−z . The continuity of both tz and t−z follows from the continuity of the vector addition of V. Definition 2.26. The mapping tz introduced in Theorem 2.49 is the translation generated by z. Example 2.20. If a = 0, then each homothety ha of a topological linear space (V, O) is a homeomorphism. Indeed, the inverse of ha is ha−1 . The continuity of both ha and ha−1 follows from the continuity of the scalar multiplication of V. Theorem 2.50. Let (V, O) be a topological linear space. If W is a neighborhood of 0V , then tx (W ) is a neighborhood of x. Moreover, every neighborhood of x can be obtained by a translation of a neighborhood of 0. Proof. Since W is a neighborhood of the origin, there exists an open subset L of V such that 0 ∈ L ⊆ W . This implies x = tx (0) ∈ tx (L) ⊆ tx (W ). It follows that tx (L) is an open set and this, in turn, implies that tx (W ) is a neighborhood of x. Conversely, let U be a neighborhood of x and let K be an open set such that x ∈ K ⊆ U . Then, we have 0 = t−x (x) ∈ t−x (K) ⊆ t−x (U ). Since t−x (K) is an open set, it follows that t−x (K) is a neighborhood of 0 and the desired conclusion follows from the fact that U = tx (t−x (U )). Theorem 2.50 shows that in a topological linear space (V, O) the neighborhoods of any point are obtained by translating the neighborhoods of 0V .
78
Linear Algebra Tools for Data Mining (Second Edition)
Corollary 2.13. If Fx is a fundamental system of neighborhoods of x in the topological linear space (V, O), then Fx can be obtained by a translation of a fundamental system of neighborhoods F0 of 0V . Proof.
This statement follows immediately from Theorem 2.50.
The next theorem shows that a linear function between two topological linear spaces is continuous if and only if it is continuous in the zero element of the first space. Theorem 2.51. Let (V1 , O1 ) and (V2 , O2 ) be two topological F-linear spaces having 01 and 02 as zero elements, respectively. A linear operator f ∈ Hom(V1 , V2 ) is continuous in x ∈ V1 if and only if it is continuous in 01 ∈ V1 . Proof. Let f be a function that is continuous in a point x ∈ V1 . If U ∈ neigh02 (O2 ), then f (x) + U is a neighborhood of f (x). Since f is continuous, there exists a neighborhood W of x such that f (W ) ⊆ f (x) + U . Observe that the set −x + W is a neighborhood of 01 . Moreover, any neighborhood of 01 has this form. If t ∈ −x+W , then t+x ∈ W and, therefore, f (t) + f (x) = f (t + x) ∈ f (x) + U . This shows that f (t) ∈ U , which proves that f is continuous in 01 . Conversely, suppose that f is continuous in 01 . Let x ∈ V1 and let Z ∈ neighf (x) (O2 ). The set −f (x) + Z is a neighborhood of 02 in V2 . The continuity of f in 01 implies the existence of a neighborhood T of 01 such that f (T ) ⊆ −f (x)+Z. Note that x+T is a neighborhood of x in V1 and every neighborhood of x in V1 has this form. Since f (x + T ) ⊆ Z, it follows that f is continuous in x. Corollary 2.14. Let (V1 , O1 ) and (V2 , O2 ) be two topological F-linear spaces. A linear operator f ∈ Hom(V1 , V2 ) is either continuous on V1 or is discontinuous in every point of V1 . Proof.
This statement is a direct consequence of Theorem 2.51.
Theorem 2.52. Let C and D be two subsets of Rn such that C is compact and D is closed. Then the set C + D = {x + y | x ∈ C, y ∈ D} is closed.
Linear Spaces
79
Proof. Let x ∈ K(C + D). There exists a sequence (x0 , x1 , . . .) such that xi ∈ C + D and limn→∞ xn = x. The definition of C + D means that there is a sequence (u0 , u1 , . . .) ∈ Seq∞ (C) and a sequence (v 0 , v 1 , . . .) ∈ Seq∞ (D) such that xi = ui + vi for i ∈ N. Since C is compact, the sequence (v 0 , v 1 , . . .) contains a convergent subsequence (ui0 , ui1 , . . .). Let u = limm→∞ uim . Clearly, limm→∞ xim = x. Since D is a closed set, limm→∞ v im = x − u ∈ D. Therefore, x = u + v ∈ C + D, so K(C + D) = C + D, which means that C + D is closed. 2.10
Isomorphism Theorems
Theorem 2.53 (Noether’s first isomorphism theorem). Let V be an F-linear space, and let K and L be two subspaces of V such that K ⊆ L. Then L/K is a linear subspace of M/K and M/L ∼ = (M/K)/(L/K). Proof. Consider the canonical linear mappings hK : V −→ V /K and hL : V −→ V /L for which we have Ker(hK ) = K ⊆ L = Ker(hL ). Since K ⊆ L, by Theorem 2.25, there exists a linear mapping e : V /K −→ V /L such that hL = ehK , that is e(hK (x)) = hL (x), which means that e([x]K ) = [x]L for every x ∈ V. We denoted the ∼K -class of x by [x]K and the ∼L -class of the same element by [x]L . Since Ker(e) = L/K and Im(e) = V /L, we obtain that V /L is isomorphic to (V /K)/(L/K) by Theorem 2.12. Theorem 2.54 (Noether’s second isomorphism theorem). Let V be an F-linear space, and let K and L be two subspaces of V. Then (L + K)/L ∼ = H/(L ∩ K). Proof. Consider the linear mapping h : K −→ K + L defined by h(x) = x for every x ∈ K and the canonical linear mapping g : K + L −→ (K + L)/L. Define the linear mapping e = gh. We have Ker(e) = K ∩ L and Im(e) = (K + L)/L, so by Theorem 2.12, the subspaces (L + K)/L and H/(L ∩ K) are isomorphic.
80
2.11
Linear Algebra Tools for Data Mining (Second Edition)
Multilinear Functions
The notion of linear mapping can be extended as follows. Definition 2.27. Let V1 , . . . , Vn , U be real linear spaces. A multilinear function is a mapping f : V1 × · · · × Vn −→ U that is linear in each of its arguments when the other arguments are held fixed. In other words, f satisfies the following conditions: ⎞ ⎛ k aj xji , xi+1 , . . . , xn ⎠ f ⎝x1 , . . . , xi−1 , j=1
=
k
aj f (x1 , . . . , xi−1 , xji , xi+1 , . . . , xn ),
j=1
for every xi , xji ∈ Vi and a1 , . . . , ak ∈ R. If V1 = · · · = Vn = V and U = R, the mapping f is called an n-linear form on V or an n-tensor on V. The set of n-tensors on V is denoted as Tn (V ). When n = 2, V1 = V2 = Rm , and U = R, we refer to f as a bilinear form on Rm . A bilinear form f : Rm × Rm −→ R is skew-symmetric if f (x, y) = −f (y, x) for x, y ∈ Rm . More generally, a tensor t over the linear space V is a multilinear function ∗ · · × V ∗ × V · · × V −→ R. t:V × · × · p
q
In this case, we say t is covariant of order p and contravariant of order q. The set of real multilinear functions defined on the linear spaces V1 , . . . , Vn and ranging in the real linear space V is denoted by M(V1 , . . . , Vn ; V ). The set of real multilinear forms is M(V1 , . . . , Vn ; R). Multilinear functions are defined on the Cartesian product of the sets V1 , . . . , Vn , not on the direct product of the linear spaces V1 , . . . , Vn . Example 2.21. Let V be a linear space and let V ∗ be its dual. Consider the following four pair of vectors:
Linear Spaces
81
(i) u in V and v ∈ V ; (ii) u in V and g ∈ V ∗ ; (iii) f in V ∗ and v ∈ V ; (iv) f in V ∗ and g ∈ V ∗ . Each of these pairs generates a bilinear function as follows: (i) φu,v : V ∗ × V ∗ −→ R given by φu,v (h, l) = h(u)l(v); (ii) φu,g : V ∗ × V −→ R given by φu,g (h, t) = h(u)g(t); (iii) φf ,v : V × V ∗ −→ R given by φv,f (x, h) = f (x)h(v); (iv) φf ,g : V × V −→ R given by φf ,g (u, v) = f (u)g(v). The function φf ,v : V × V ∗ −→ R defined as φf ,v (x, h) = f (x)h(v) for x ∈ V and h ∈ V ∗ is linear in x by the linearity of f and is linear in the second variable h by the definition of V ∗ . Therefore, h ∈ M(V, V ∗ ; R). If f : U × W −→ V is a bilinear function, we have f (0U , w) = 0V and f (u, 0W ) = 0V for every w ∈ W and u ∈ U . Definition 2.28. Let V, W be two complex linear spaces. A function f : V × W −→ C is said to be Hermitian bilinear if it is linear in the first variable and skew-linear in the second, that is, it satisfies the following equalities: f (a1 x1 + a2 x2 , y) = a1 f (x1 , y) + a2 f (x2 , y), f (x, b1 y 1 + b2 y 2 ) = b1 f (x, y 1 ) + b2 f (x, y 2 ), for x1 , x2 , x ∈ V, y, y 1 , y 2 ∈ W , and a1 , a2 , b1 , b2 ∈ C.
82
Linear Algebra Tools for Data Mining (Second Edition)
Example 2.22. Multilinearity is distinct from the notion of linearity on a product of linear spaces. For instance, the mapping h : R2 −→ R defined by h(x, y) = x + y is linear but not bilinear. On the other hand, the mapping g : R2 −→ R given by h(x, y) = xy is bilinear but not linear. Definition 2.29. Let V1 , . . . , Vn , V be real linear spaces. If f, g ∈ M(V1 , . . . , Vn ; L) are two multilinear functions, their sum is the function f + g defined by (f + g)(x1 , . . . , xn ) = f (x1 , . . . , xn ) + g(x1 , . . . , xn ), and the product af , where a ∈ F is the function af given by (af )(x1 , . . . , xn ) = af (x1 , . . . , xn ) for xi ∈ Vi and 1 i n. It is immediate to verify that M(V1 , . . . , Vn ; V ) is an R-linear space relative to these operations. Let f : V1 × V2 −→ V be a real bilinear function. Observe that for x ∈ V1 and y ∈ V2 , we have f (x, 0V2 ) = f (x, 0y) = 0f (x, y) = 0V and f (0V1 , y) = f (0x, y) = 0f (x, y) = 0V .
(2.8)
Example 2.23. Let V be an R-linear space and let ·, · : V ∗ ×V −→ R be the function given by h, y = h(y) for h ∈ V ∗ and y ∈ V. It is immediate that ·, · is a bilinear function because ah + bg, y = ah, y + bg, y, h, ay + bz = ah, y + bh, z, for a, b ∈ R, h, g ∈ V ∗ , and y, z ∈ V. Moreover, we have h, y = 0 for every y ∈ V if and only if h = 0V ∗ and h, y = 0 for every h ∈ V ∗ if and only if y = 0V . Example 2.24. Let V1 , . . . , Vn , V be R-linear spaces, ai ∈ Vi for 1 i n, and let gi ∈ Vi∗ . Define the function G : V1 ×· · ·×Vn −→ R as G(a1 , . . . , an ) = g1 (a1 ) · · · gn (an ) for ai ∈ Vi and 1 i n.
Linear Spaces
83
The function G is multilinear. Indeed, if ai , bi ∈ Vi and a ∈ R, it is immediate to verify that G(a1 , . . . , ai + bi , . . . , an ) = G(a1 , . . . , ai , . . . , an ) + G(a1 , . . . , bi , . . . , an ), and G(a1 , . . . , aai , . . . , an ) = aG(a1 , . . . , ai , . . . , an ). Note, however, that G is not a linear function because G(aa1 , . . . , aan ) = an G(a1 , . . . , an ) for a ∈ R. Example 2.25. The function f : R2 −→ R defined by f (x1 , x2 ) = x1 x2 is bilinear because it is linear in each of its variables, separately, but is not linear in the ensemble of its arguments. Indeed, we have f (x1 + y1 , x2 ) = f (x1 , x2 ) + f (y1 , x2 ), f (x1 , x2 + y2 ) = f (x1 , x2 ) + f (x1 , y2 ), for every x1 , x2 , y1 , y2 ∈ R, which shows the bilinearity of f . However, we have f (x1 + x2 , y1 + y2 ) = x1 y1 + x1 y2 + x2 y1 + x2 y2 = f (x1 , y1 ) + f (x2 , y2 ), which means that f is not a linear function. Let h : U ×V −→ W be a bilinear mapping. For a ∈ U and b ∈ V, define ha : V −→ W and hb : U −→ W as ha (v) = h(a, v) for v ∈ V, and hb (u) = h(u, b) for u ∈ U. Then, we have ha ∈ Hom(V, W ) and hb ∈ Hom(U, W ) because of the bilinearity of the mapping h.
84
Linear Algebra Tools for Data Mining (Second Edition)
Conversely, if f : U −→ Hom(V, W ) is a linear mapping, then the function h : U × V −→ W defined by h(u, v) = f (u)(v) is bilinear and hu = f (u). If W = R, then h is a bilinear form, ha ∈ V ∗ and hb ∈ U ∗ . Thus, we have a mapping Φ : M(U, V ; R) −→ Hom(U, V ∗ ) given by Φ(h)(a) = ha for a ∈ U , and a mapping Ψ : M(U, V ; R) −→ Hom(V, U ∗ ) given by Ψ(h)(b) = hb for b ∈ V. Theorem 2.55. Let U, V be two real linear spaces and let M(U, V ; R) be the linear space of bilinear forms defined on U × V. The linear spaces M(U, V ; R), Hom(U, V ∗ ), and Hom(V, U ∗ ) are isomorphic. Proof. It is immediate that Φ is a linear mapping because for c, d ∈ R and h1 , h2 ∈ M(U, V ; R), we have Φ(ch1 + dh2 )(a)(v) = ((ch1 + dh2 )a )(v) = (ch1 + dh2 )(a, v) = ch1 (a, v) + dh2 (a, v) = cha1 (v) + dha2 (v) = cΦ(h1 )(a)(v) + dΦ(h2 )(a)(v), or Φ(ch1 + dh2 ) = cΦ(h1 ) + dΦ(h2 ). Note that Φ maps h : U −→ V into the linear form that transforms a into ha for a ∈ U . Thus, if Φ(h1 ) = Φ(h2 ), we have both h1 and h2 yield equal values for a ∈ U , so h1 = h2 , which proves the injectivity of Φ. Let f ∈ Hom(U, V ∗ ). For every a ∈ U there exists a linear form g : V −→ R such that f (a) = g, or f (a)(v) = g(v) for every v ∈ V. The mapping h : U × V −→ R defined by h(u, bf v) = f (u)(v) is bilinear and Φ(h)(u)(v) = hu (v) = h(u, v) = f (u)(v), which means that Φ(h) = f . Thus, Φ is also surjective and, therefore, it is an isomorphism between the linear spaces M(U, V ; R) and Hom(U, V ∗ ). The existence of an isomorphism targeted to Hom(V, U ∗ ) has a similar argument. Corollary 2.15. Let V, W be two dim(M(V, W ; R)) = dim(V ) · dim(W ).
R-linear
spaces.
Then,
Linear Spaces
85
Proof. Since dim(W ∗ ) = dim(W ) = n, we have dim(Hom(V, W ∗ )) = mn. The result follows immediately from Theorem 2.55. Let U, V, W be three R-linear spaces of finite dimensions having the bases {u1 , . . . , um }, {v 1 , . . . , v n }, and {w 1 , . . . , wp }, respectively, and let f : U× V −→ W be a bilinear function. If u = m n i=1 ai ui ∈ U , v = j=1 bj v j , then f (u, v) =
m n
ai bj f (ui , v j ).
i=1 j=1
Since f (ui , v j ) ∈ W , there exist ckij such that f (ui , v j ) = p k k=1 cij w k , hence f (u, v) =
p m n
ai bj ckij wk .
i=1 j=1 k=1
{ckij
∈ R | 1 i m, 1 j n, 1 k p} (which Thus, the set contains mnp elements) determines a bilinear function relative to the chosen bases in U, V, and W . The numbers ckij are known as the structural constants of f . In the special case of bilinear forms, when W = R, f : U ×V −→ R is a bilinear form and can be written using structural constants cij as in m n ai bj cij , (2.9) f (u, v) = m
i=1 j=1
n
where u = i=1 ai ui ∈ U , v = j=1 bj v j , and cij are the structural constants. The matrix of f is Cf = (cij ) and the form itself can be written as f (u, v) = u Cf v for u ∈ U and v ∈ V. Example 2.26. Unlike the case n = 1, the set of values of a multilinear function f : V1 × · · · × Vn −→ W is not a subspace of W in general. Indeed, consider a two-dimensional R-linear space U having a basis {u1 , u2 }, a four-dimensional R-linear space W having the basis {w 1 , w 2 , w 3 , w 4 }, and the bilinear function f : U × U −→ W defined as f (u, v) = u1 v1 w1 + u1 v2 w2 + u2 v1 w3 + u2 v2 w4 , where u = u1 u1 + u2 u2 and v = v1 u1 + v2 u2 .
86
Linear Algebra Tools for Data Mining (Second Edition)
Let S be the set of all vectors of the form s = f (u, v). By the definition of S, there exist u, v ∈ U such that s1 = u1 v1 , s2 = u1 v2 , s3 = u2 v1 , s4 = u2 v2 , hence s1 s4 = s2 s3 for any s ∈ S. Define the vectors z, t in W as z = 2w 1 + 2w 2 + w3 + w4 , t = w1 + w3 . Note that we have both z ∈ S and t ∈ S. However, x = z − t = w1 + 2w2 + w4 does not belong to S because x1 x4 = 1 and x2 x3 = 0. This example given in [71] shows that the range of a bilinear function is not necessarily a subspace of the space of values of the function. Example 2.27. Let f : R2 × R2 −→ R be a bilinear form. Since the vectors 1 0 and e1 = e1 = 1 0 form a basis in R2 , f can be written as f (ae1 + be2 , ce1 + de2 ) = af (e1 , ce1 + de2 ) + bf (e2 , ce1 + de2 ) = acf (e1 , e1 ) + adf (e1 , e2 ) + bcf (e2 , e1 ) + bdf (e2 , e2 ) = αf (e1 , e1 ) + βf (e1 , e2 ) + γf (e2 , e1 ) + δf (e2 , e2 ), where α = ac, β = ad, γ = bc, δ = bd. Thus, the multilinearity of f implies αδ = βγ. Theorem 2.56. Let V, W, U be three finite-dimensional linear spaces having the bases {v i | 1 i }, {wj | 1 i m}, and {uk | 1 k p}, and the dual bases {v ∗i | 1 i }, {w ∗j | 1 i m}, {u∗k | 1 k p}, respectively. Then, the functions φkij : V × W −→ U defined as φkij (v, w) = v ∗i (v)w∗j (w)uk are bilinear and form a basis of M(V × W, U ).
Linear Spaces
87
Proof. The bilinearity of the functions φkij is immediate. If f ∈ M(V × W, U ), v ∈ V, and w ∈ W , we have ⎛ f (v, w) = f ⎝
v ∗i (v),
i
=
i
= =
w∗j (w)wj ⎠
v ∗i (v)w ∗j (w)wj f (vi , w j )
j
j
j
v ∗i (v)w∗j (w)w j u∗ (f (v i , w j ))uk
k
i
⎞
j
i
u∗ (f (vi , wj ))φkij (v, w).
k
Thus, the set of mappings φkij spans M(V × W, U ) and the set of numbers u∗k (f (vi , wj )) uniquely determines the bilinear function f , which shows that the set of functions φkij is linearly independent. Corollary 2.16. We have dim(M(V × W, U )) = dim(V ) dim(W ) dim(U ). Proof. The equality Theorem 2.56.
is
an
immediate
consequence
of
Exercises and Supplements (1) Let V be an R-linear space. Prove that the following three statements that concern a non-empty subset P of V are equivalent: (a) P is a subspace of V ; (b) if x, y ∈ P and a ∈ F, then x + y ∈ P and ax ∈ P ; (c) x, y ∈ P and a, b ∈ F imply ax + by ∈ P . (2) Let h be an endomorphism of an R-linear space V. Prove that hm is an endomorphism of V for every m ∈ nn and m 1. (3) Let V and W be two linear spaces such that B = {v 1 , . . . , v n } is a basis of V and let f ∈ Hom(V, W ). Prove that f is an injective linear mapping if and only if the set {f (v 1 ), . . . , f (v n )} is linearly independent.
Linear Algebra Tools for Data Mining (Second Edition)
88
Solution: Suppose that f is injective and that a1 f (v1 ) + · · · + an f (vn ) = 0W . Then, since f is linear, we have f (a1 v 1 + · · · + an vn ) = 0W . The injectivity of f implies a1 v 1 + · · · + an v n = 0V and this, in turn, implies a1 = · · · = an = 0. Thus, {f (v 1 ), . . . , f (v n )} is linearly independent. Conversely, suppose that the set {f (v 1 ), . . . , f (v n )} is linearly independent, and that f (u) = f (v) for some u, v ∈ V. Since B is a basis in V, we can write u = b1 v 1 + · · · + bn v n and v = c1 v 1 + · · · + cn v n , which, in turn, imply b1 f (v1 ) + · · · + bn f (v n ) = c1 f (v 1 ) + · · · + cn f (v n ), or (b1 − c1 )f (v 1 ) + · · · + (bn − cn )f (v n ) = 0W .
(4)
(5) (6)
(7)
The linear independence of {f (v 1 ), . . . , f (v n )} means that bi = ci for 1 i n and this implies u = v. Thus, f is injective. Let V be a linear space such that dim(V ) = n. Prove that no finite set U that contains n − 1 vectors may span V ; prove that no finite set of n + 1 vectors can be linearly independent. Prove that every subset of a linearly independent set of a linear space is independent. Let U and W be two finite, linearly independent subsets of a linear space V such that |U | |W |. Prove that there exists u ∈ U − W such that W ∪ {u} is a linearly independent set. Let U, V be two linear subspaces of Cn . If dim(U )+dim(V ) > n, prove that the subspace U ∩ V contains a vector x = 0n . Solution: We know that 0n ∈ U ∩ V for any two subspaces U and V. Suppose that dim(U ) + dim(V ) > n and that U ∩ V = {0n } and let u1 , . . . , up and v 1 , . . . , v q be two bases of U and V, respectively, where p = dim(U ), q = dim(V ), and p + q > n. The set {u1 , . . . , up , v 1 , . . . , v q } is linearly dependent, so there exist b1 , . . . , bp , c1 , . . . , cq such that not all these numbers are null and b1 u1 + · · · + bp up + c1 v1 + · · · + cq vq = 0n .
Linear Spaces
89
Let u = b1 u1 +· · ·+bp up ∈ U and v = c1 v 1 +· · ·+cq v q ∈ V. The previous equality means that u + v = 0n . Since u = −v, both u and v belong to U ∩ V, so u = v = 0n . This contradicts the linear independence of at least one of the bases, {u1 , . . . , up } and {v 1 , . . . , v q }. The notion of linear independence can be extended to subspaces of finite-dimensional linear spaces. Namely, if V1 , . . . , Vk are subspaces of a finite-dimensional linear space V, then we say that V1 , . . . , Vk are linearly independent if for xi ∈ Vi , 1 i k, the equality k i=1 xi = 0V implies xi = 0V for 1 i k. (8) Prove that the subspaces V1 , . . . , Vk of a linear space V are linearly independent if and only if any set {xi | xi ∈ Vi − {0V } for 1 i k} is linearly independent. (9) Let U be a subspace of a linear space V. Prove that there exists a non-zero linear form : V −→ R such that Ker() = U . (10) Let h be an endomorphism of Rn and let x be a vector in Rn − {0n } such that there exists a least integer p such that hp (x) = 0. Prove that the set {x, h(x), h2 (x), . . . , hp−1 (x)} is linearly independent. Solution: Suppose that a0 x+ a1 h(x)+ · · · + ap−1 hp−1 (x) = 0n . Then, we have a0 h(x) + a1 h2 (x) + · · · + ap−2 hp−1 (x) = 0n a0 h2 (x) + a1 h3 (x) + · · · + ap−3 hp−1 (x) = 0n .. . a0 hp−3 (x) + a1 hp−2 (x) + a2 hp−1 (x) = 0n a0 hp−2 (x) + a1 hp−1 (x) = 0n a0 hp−1 (x) = 0n . The last equality implies a0 = 0. Substituting a0 in the previous equality yields a1 = 0, etc. Eventually, we obtain a0 = a1 = · · · = ap−1 = 0, which proves that S is linearly independent. (11) Let (S, O) be a topological space and let U and W be two subsets of S. If U is open and U ∩W = ∅, then prove that U ∩K(W ) = ∅. (12) Let (S, O) be a topological space and let I be its interior operator. Prove that the poset of open sets (O, ⊆) is a complete lattice, where sup L = L and inf L = I ( L) for every family of open sets L.
90
Linear Algebra Tools for Data Mining (Second Edition)
(13) Let (S, O) be a topological space and let I be its interior operator. Prove that the poset of open sets (O, ⊆) is a complete lattice, where sup L = L and inf L = I ( L) for every family of open sets L. (14) Let (S, O) be a topological space, let K be its interior operator, and let K be its collection of closed sets. ⊆) is a Prove that (K, complete lattice, where sup L = K ( L) and inf L = L for every family of closed sets. (15) Let T be a subspace of the topological space (S, O). Let K S , I S , and ∂S be the closure, interior, and border operators associated to S and K T , I T , and ∂T be the corresponding operators associated to T . Prove that (a) K T (U ) = K S (U ) ∩ T , (b) I S (U ) ⊆ I T (U ), and (c) ∂T U ⊆ ∂S U for every subset U of T . (16) Let V be an R-linear space, S a subspace of V with dim(S) = k, and suppose that the quotient space V /S is of dimension p. Prove that dim(V ) = k + p and that a basis for V can be obtained by supplementing a basis of S with p representatives of the classes of the quotient space (one per class). (17) Let V be an R-linear space, and let S and T be complementary subspaces of V. Prove that the restriction of the function f : V −→ V /S given by f (x) = x + S to T is an isomorphism between T and S/V. (18) Let V1 , V2 , V be R-linear spaces. Prove that if f : V1 × V2 −→ V is both a linear and bilinear function, then f (x, y) = 0V for x ∈ V1 and y ∈ V2 . Solution: The bilinearity of f implies f (x, 0V2 ) = f (0V1 , y) = 0V by Equalities (2.8). On the other hand, since f is linear, we have f (x, y) = f (0V1 , y) + f (x, 0V2 ) = 0V . The next Supplement is a generalization of Corollary 2.15.
Linear Spaces
91
(19) Let V1 , . . . , Vm , V be m linear spaces with dim(Vi ) = ni for 1 i m and dim(V ) = n. Prove that dim(M(V1 , . . . , Vn ; V )) = n m i=1 ni . Solution: Let Γ(n1 , . . . , nm ) be the set of sequences of integers (a1 , . . . , am ) of length m such that 1 ai ni for 1 i m. Suppose that the linear space Vt has the base {et1 , . . . , etnt } and V has the base {u1 , . . . , un }. For each sequence α ∈ Γ(n1 , . . . , nm ), define n m t=1 nt functions:
m ξtα(t) uj , φα,j (v 1 , . . . , v m ) = t=1
n t
where v t = s=1 ξts ets . Each of these functions is in that if v k is replaced by M(V1 , . . . , Vn ; V ). Indeed, observe v k , then the product m ξ is replaced by cv k + d˜ tα(t) t=1 ξ1α(1) · · · ξ(k−1)α(k−1) (cξkα(k) + dξ˜kα(k) ) × ξ(k+1)α(k+1) · · · ξm α(m) = cξ1α(1) · · · ξ(k−1)α(k−1) ξkα(k) · · · ξmα(m) + dξ1α(1) · · · ξ(k−1)α(k−1) ξ˜kα(k) · · · ξmα(m) . Therefore, each of the n m t=1 nt functions φα,j belong to M(V1 , . . . , Vn ; V ). Note that for φ, θ ∈ M(V1 , . . . , Vn ; V ), we have φ = θ if and only if φ(ec ) = θ(ec ) for every c ∈ Γ(n1 , . . . , nm ) because of the multilinearity of φ and θ. Let φ be an arbitrary multilinear function such that φ(eγ ) =
n
cγj uj
j=1
for γ ∈ Γ(n1 , . . . , nm ). Define θ=
n
β∈Γ(n1 ,...,nm ) j=1
cβj φβ,j .
92
Linear Algebra Tools for Data Mining (Second Edition)
We have n
θ(eγ ) =
cβj φβ,j (eγ )
β∈Γ(n1 ,...,nm ) j=1 n
=
cβj δβ,γ uj
β∈Γ(n1 ,...,nm ) j=1
=
n
cβj δβ,γ uj
j=1 β∈Γ(n1 ,...,nm )
=
n
cγ j uj = φ(eγ ).
j=1
Thus, an arbitrary φ is expressed as a linear combination of φβj . This means that the subspaces generated by the functions φαj are M(V 1 , . . . , Vn ; V ).n If θ = β∈Γ(n1 ,...,nm ) j=1 dβj φβ,j = 0, then
θ(eγ ) =
n
dβj δβγ uj
β∈Γ(n1 ,...,nm ) j=1
=
n
dγj uj = 0,
j=1
which implies dγj = 0 for 1 j n. (20) Let f ∈ M(R2 , R2 ; R). Prove that f (u1 + u2 , v 1 + v 2 ) = f (u1 , v 1 ) + f (u1 , v 2 ) + f (u2 , v 1 ) + f (u2 , v 2 ). (21) Let f ∈ M(R2 , R2 ; R) and let (s, u) and (t, v) be two distinct bases in R2 . Suppose that w, x, y, z are vectors in R2 such that w = as + bu, x = ct + dv, y = es + gu, z = ht + kv, for a, b, c, d, e, g, h, k ∈ R. Prove that if adgh = bcek, then f (s, v) and f (u, t) can be determined from f (s, t), f (u, v), f (w, x), and f (y, z).
Linear Spaces
93
(22) Prove the following equalities involving Levi-Civita and Kronecker symbols: (a) 3
ijk mk = δi δjm − δim δjl ,
k=1
where all indices vary in the set {1, 2, 3}; (b) 3 3
ijk jk = 2δi ,
j=1 k=1
where all indices vary in the set {1, 2, 3}. (23) Let V, W be two finite-dimensional linear spaces having the bases {v i | 1 i m} and {w j | 1 i n}, respectively. If {f i | 1 i } is the dual basis in V ∗ of {v i | 1 i }, then the set of morphisms {hij : V −→ W | hij (v) = f i (v)wj for 1 i , 1 j n} is a basis in Hom(V, W ). Solution: It is immediate that hij are homomorphisms in Hom(V, W ) for 1 i m and 1 j n. Let h be an arbitrary homomorphism in Hom(V, W ). If v ∈ V, then v = a1 v 1 + · · · + am v m , hence h(v) = m i=1 ai h(v i ). Since h(v i ) ∈ W , we can write: h(v i ) = nj=1 bij wj , for 1 i m. Therefore, n m m m n ai h(v i ) = ai bij wj = ai bij wj . h(v) = i=1
i=1
j=1
i=1 j=1
i
Since {f | 1 i } is the dual basis of {v i | 1 i }, we have
m a v wj = ai wj . hij (v) = f i (v)w j = f i Thus, h(v) =
m n i=1
=1
i j=1 bij hj (v),
hence
{hij : V −→ W | hij (v) = f i (v)wj for 1 i , 1 j n} spans the linear space Hom(V, W ).
94
Linear Algebra Tools for Data Mining (Second Edition)
m n If c hi is the zero morphism of Hom(V, W ), i=1 m j=1 nij j i cij hj (v i ) = 0W , which amounts to then m ni=1 j=1 i i=1 j=1 cij f (v)w j = 0W . Since {w j | 1 i n} is m i a basis in W , we obtain i=1 cij f (v) = 0V , and therefore, m i i=1 cij vi v i = 0V , which implies cij = 0. Thus, {hj : V −→ W | hij (v) = f i (v)w j for 1 i , 1 j n} is a basis in Hom(V, W ). (24) Let V, W, U be three linear spaces and let f : V × W −→ U be a bilinear mapping. Define the mapping φ : V ×W −→ U as φ(z) = f (π1 (z), π2 (z)), where π1 and π2 are the canonical projections of V × W on V and W , respectively. Prove that φ(z 1 + z 2 ) + φ(z 1 − z 2 ) = 2(φ(z1 ) + φ(z2 )) and φ(az) = a2 φ(z) for every a ∈ R. (25) Let f : V 3 −→ R be a real multilinear form. Prove that if f is symmetric in the first two arguments and is skew-symmetric in the last two arguments, then f (x, y, z) = 0 for x, y, z ∈ V. Solution: We have f (x, y, z) = f (y, x, z) (by the symmetry in the first two arguments) = −f (y, z, x) (by the skew-symmetry in the last two arguments) = −f (z, y, x) (by the symmetry in the first two arguments) = f (z, x, y) (by the skew-symmetry in the last two arguments) = f (x, z, y) (by the symmetry in the first two arguments) = −f (x, y, z) (by the skew-symmetry in the last two arguments),
hence 2f (x, y, z) = 0. (26) Prove that every bilinear form f over R or C is uniquely expressible as a sum f = f1 + f2 , where f1 is symmetric and f2 is skew-symmetric.
Linear Spaces
95
(27) Prove that every symmetric bilinear form f over R or C is determined by its values of the form f (v, v). Solution: For any v and w, we can write 1 (f (v + w, v + w) − f (v, v) − f (w, w)) 2 1 = (f (v, w) + f (w, v)) = f (v, w), 2 which leads to the desired conclusion. i ···i
The generalized Kronecker symbol δj11 ···jpp is defined as
i ···i
δj11 ···jpp =
⎧ ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎪ ⎪ ⎨
if
i1 · · · ip
j · · · jp
1 i1 · · · ip
is an even permutation
⎪ is an odd permutation −1 if ⎪ ⎪ ⎪ j · · · j ⎪ 1 p ⎪ ⎪ ⎩0 in all other cases.
Thus, if {i1 , . . . , ip } or {j1 , . . . , jp } does not consist of distinct integers or {i1 , . . . , ip } = {j1 , . . . , jp }, we have i ···i
δj11 ···jpp = 0. i1 i2 ···in and i1 i2 ···in = δi12···n . (28) Verify that i1 i2 ···in = δ12···n 1 i2 ···in (29) Prove that n i1 =1
···
n ik =1
···ik δii11···i = k
n! . (n − k)!
(30) Prove that i ···i
δj11 ···jpp = =
φ∈PERMp
φ∈PERMp
i ···i
(−1)inv(φ) δj1φ(1)p···jφ(p) i
···iφ(p)
(−1)inv(φ) δjφ(1) 1 ···jp
.
96
Linear Algebra Tools for Data Mining (Second Edition)
Bibliographical Comments The books of MacLane and Birkhoff [108], Artin [3], and van der Waerden [165] contain a vast amount of material from many areas of algebra presented in a lucid and readable manner. Supplement 2.11 appears in [110].
Chapter 3
Matrices
3.1
Introduction
Matrices are rectangular arrays and their elements belong typically to a field. Numeric matrices (that is, matrices whose elements belong to R or C) serve as representations of linear transformations between linear spaces in the context of certain linear bases in these spaces. Historically, determinants (discussed in Chapter 5) preceded matrices. The term matrix was introduced by Sylvester,1 who regarded matrices as generators of determinants and therefore adopted the Latin word matrix (in Latin, a neutral noun matrix, matricis) (which signifies womb) to designate arrays of numbers. We begin with matrices whose entries belong to arbitrary sets and then focus on matrices whose components belong to fields. Then, we present several classes of matrices, discuss matrix partitioning, and the notion of invertible matrix. The fundamental relationship between matrices and linear transformations between finite-dimensional linear spaces is explored and properties of linear mappings are transferred to matrices.
1
James Joseph Sylvester was born on September 3, 1814 in London and died on March 15, 1897 in the same city. Sylvester made fundamental contributions in algebra, number theory, and Combinatorics. He studied at the University of London and at St. John’s College in Cambridge. Sylvester taught at the University College London, the Royal Military Academy, Johns Hopkins University in Baltimore, and Oxford. 97
98
Linear Algebra Tools for Data Mining (Second Edition)
The notion of matrix rank is introduced starting from the dimension of the range of linear mappings attached to matrices. We discuss linear systems and the application of matrices in solving such systems. We conclude with a presentation of Kronecker, Hadamard, and Khatri–Rao matrix products.
3.2
Matrices with Arbitrary Elements
We define a class of two-argument finite functions that is ubiquitous in mathematics and is central for linear algebra and its applications. Definition 3.1. A matrix on C is a function A : {1, . . . , m} × {1, . . . , n} −→ C. The pair (m, n) is the format of the matrix A. If A(i, j) ∈ R for 1 i m and 1 j n, then we say that A is a matrix on R. Matrices can be conceptualized as two-dimensional arrays as follows: ⎛
⎞ A(1, 1) A(1, 2) . . . A(1, n) ⎜A(2, 1) A(2, 2) . . . A(2, n)⎟ ⎜ ⎟ ⎜ ⎟ .. ⎟ . .. ⎜ .. . ⎠ ⎝ . . ... A(i, 1) A(i, 2) . . . A(i, n) If A : {1, . . . , m} × {1, . . . , n} −→ C, we say that A is an (m × n)matrix on C. The set of all such matrices are denoted by Cm×n . The element A(i, j) of the matrix A ∈ Cm×n is denoted either by Aij or by aij , for 1 i m and 1 j n. Definition 3.2. A row in a matrix is C1×n ; a column in a matrix is Cn×1 . Rows or columns are denoted by small bold-faced letters: r, s, etc.
Matrices
99
A matrix A ∈ Cm×n can be regarded as consisting of m rows, where the i th row is a sequence of the form (ai1 , ai2 , . . . , ain ), for 1 i n, or as a collection of n columns, where the j th column has the form ⎞ ⎛ a1j ⎜ a2j ⎟ ⎟ ⎜ ⎜ .. ⎟ ⎝ . ⎠ amj for 1 j m. The i th row of a matrix is denoted by A(i, :); similarly, the j th column of A is denoted by A(:, j). The main diagonal of a matrix A ∈ Cm×n is the set {aii | 1 i ≤ min{m, n}}. Note that the set {aij | i − j = k} for 1 ≤ k n − 1 consists of elements located on the k th diagonal above the main diagonal of the matrix A. Similarly, the set {aij | j − i = k} for 1 k n − 1 consists of elements located on the k th diagonal below the main diagonal. Definition 3.3. A square matrix on C is an (n × n)-matrix on the set C for some n 1. The number n is referred to as the order of the matrix A. An (n × n)-square matrix A = (aij ) on C is symmetric if aij = aji for every i, j such that 1 i, j n. A is skew-symmetric if aij = −aij for 1 i, j n. Example 3.1. The (3 × 3)-matrix ⎛ ⎞ 1 0.5 1 ⎜ ⎟ 2⎠ A = ⎝0.5 1 1 2 0.3 over the set of reals R is symmetric. The matrix 0 1 −2 B= −1 0 32 −3 0 is skew-symmetric.
100
Linear Algebra Tools for Data Mining (Second Edition)
Example 3.2. Let f : Rm × Rm −→ R be a skew-symmetric form defined by f (x, y) = x Ay, where A ∈ Rm×m . We have m
aij xi yj = −
i,j=1
m
aij yi xj .
i,j=1
Changing the summation indices in the second term of the above equality yields m
(aij + aji )xi yj = 0
i,j=1
for every x, y ∈ Rm , which implies aij + aji = 0, hence aij = −aji . This shows that the matrix A that defines f is skew-symmetric. Theorem 3.1. Let A ∈ Rn×n . We have x Ax = 0 for every x ∈ Rn if and only if A is a skew-symmetric matrix. n n Proof. Since x Ax = i=1 j=1 xi aij xj = i i) =0 (because j i < k implies hjk = 0). Therefore, LH is a lower triangular matrix. The argument for upper triangular matrices is similar.
Matrices
107
A simple and useful observation is contained in the next theorem. Theorem 3.5. Let A and B be two matrices in Cn×n . If A = BR, where R is an upper triangular matrix, then the first q columns of A are linear combinations of the first q columns of B for 1 q n. If A = LB, where L is a lower triangular matrix, then the first q rows of A are linear combinations of the first q rows of B for 1 q n. Proof.
Indeed, the equality A = BR can be written as ⎛ ⎞ r11 r12 · · · r1n ⎜0 r ⎟ 22 · · · r2n ⎟ ⎜ ⎜ ⎟ (a1 a2 · · · an ) = (b1 b2 · · · bn ) ⎜ . .. .. ⎟ , . . ··· . ⎠ ⎝ . 0 0 · · · rrr
so a1 = r11 b1 , a2 = r12 b1 + r22 b2 .. . an = r1n b1 + r2n b2 + · · · + rnn bn . By applying a transposition it is easy to see that if A = LB, where L is a lower triangular matrix, the first q rows of A are linear com binations of the first q rows of B. Let SA,q be the subspace generated by the first q columns of a matrix A ∈ Cm×n . This notation is used in the next statement. 3.4
Invertible Matrices
Let A ∈ Cn×n be a square matrix. Suppose that there exist two matrices U and V such that AU = In and V A = In . This implies V = V In = V (AU ) = (V A)U = In U = U. Thus, if AU = V A = In , the two matrices involved, U and V, must be equal.
Linear Algebra Tools for Data Mining (Second Edition)
108
Definition 3.12. A matrix A ∈ Cn×n is invertible if there exists a matrix B ∈ Cn×n such that AB = BA = In . Suppose that C is another matrix such that AC = CA = In . By the associativity of the matrix product we have C = CIn = C(AB) = (CA)B = In B = B. Therefore, if A is invertible, there is exactly one matrix B such that AB = BA = In . We denote the matrix B by A−1 and we refer to it as the inverse of the matrix, A. Note that A ∈ Cn×n is a unitary matrix if and only if A−1 = AH . Theorem 3.6. If A, B ∈ Cn×n are two invertible matrices, then the product AB is invertible and (AB)−1 = B −1 A−1 . Proof.
Applying the definition of the inverse of a matrix, we obtain
(AB)(B −1 A−1 ) = A(BB −1 )A−1 = AIn A−1 = AA−1 = In , which implies (AB)−1 = B −1 A−1 .
Theorem 3.7. If A ∈ Cn×n is invertible, then AH is invertible and (AH )−1 = (A−1 )H . Proof. Since AA−1 = In , we have (A−1 )H AH = In , which shows that (A−1 )H is the inverse of AH and (AH )−1 = (A−1 )H . Example 3.9. Let
A=
a11 a12 a21 a22
be a matrix in R2×2 . We seek to determine conditions under which A is invertible. Suppose that x11 x12 X= x21 x22 is a matrix in R2×2 such that AX = I2 . This matrix equality amounts to four scalar equalities: a11 x11 + a12 x21 a11 x12 + a12 x22 a21 x11 + a22 x21 a21 x12 + a22 x22
= 1, = 0, = 1, = 0,
Matrices
109
which, under certain conditions, can be solved with respect to x11 , x12 , x21 , x22 . By multiplying the first equality by a22 and the third by −a12 and adding the resulting equalities, we obtain (a11 a22 − a12 a21 )x11 = −a22 . Thus, if a11 a22 − a12 a21 = 0, we have x11 = −
a22 . a11 a22 − a12 a21
The same condition, a11 a22 − a12 a21 = 0, suffices to allow us to obtain the value of the remaining components of X, as the reader can easily verify. Thus, A is an invertible matrix if and only if a11 a22 − a12 a21 = 0. Example 3.10. Let
φ:
1 ··· k ··· n , a1 · · · ak · · · an
be a permutation of the set {1, . . . , n}, where ak = φ(k) for 1 k n. The matrix of this permutation is the square matrix Pφ = (pij ) ∈ {0, 1}n×n , where 1 if j = φ(i), (3.1) pij = 0 otherwise for 1 i, j n. The set of invertible matrices in Rn×n is a group relative to matrix multiplication known as the general linear group GL(n, R). Similarly, the set of invertible matrices in Cn×n forms the group GL(n, C). Note that the matrix of the permutation ιn is In . Also, if φ, ψ are two permutations of nthe set {1, . . . , n}, then Pψφ = Pφ Pψ . Indeed, since (Pφ Pψ )ij = k=1 (Pφ )ik (Pψ )kj , observe that only the term (Pφ )ik (Pψ )kj in which k = φ(i) and j = ψ(k) is different from 0. Thus, (Pφ Pψ )ij = 0 if and only if j = ψ(φ(i)), which means that Pψφ = Pφ Pψ . Thus, if φ and φ−1 are two inverse permutations in PERMn , we have Pφ Pφ−1 = In , so Pφ is invertible and Pφ−1 = Pφ−1 .
110
Linear Algebra Tools for Data Mining (Second Edition)
For instance, if φ ∈ PERM4 is the permutation considered in Example 1.6, 1 2 3 4 φ: , 3 1 4 2 then
⎛ 0 ⎜1 ⎜ Pφ = ⎜ ⎝0 0
Its inverse is
0 0 0 1 ⎛
Pφ−1 = Pφ−1
1 0 0 0
0 ⎜0 ⎜ =⎜ ⎝1 0
⎞ 0 0⎟ ⎟ ⎟. 1⎠ 0
1 0 0 0
0 0 0 1
⎞ 0 1⎟ ⎟ ⎟. 0⎠ 0
It is easy to verify that the inverse of a permutation matrix Pφ coincides with its transpose (Pφ ) . Observe that if A ∈ Rn×n having the rows r 1 , . . . , r n and Pφ is a permutation matrix, then Pφ A is the matrix whose rows are r φ(1) , r φ(2) , . . . , r φ(n) . Similarly, if the columns of A are c1 , . . . , cn , the columns of the matrix APφ are cφ(1) , . . . , cφ(n) . In other words, Pφ A is obtained from A be permuting its rows according to the permutation φ and APφ is obtained from A by permuting the columns according to the same permutation. Since every column and row of a permutation matrix contains exactly one 1, it follows that each such matrix is also a doubly stochastic matrix. Theorem 3.8. Let A ∈ Rn×n be a lower (upper) triangular matrix such that aii = 0 for 1 i n. The matrix A is invertible and its inverse is a lower (upper) triangular matrix having diagonal elements equal to the reciprocal of the diagonal elements of A.
Matrices
Proof.
111
Let A be a lower triangular matrix ⎛ ⎞ 0 ··· 0 a11 0 ⎜a ⎟ ⎜ 21 a22 0 · · · 0 ⎟ ⎟ A=⎜ .. .. . ⎟, ⎜ .. ⎝ . . . · · · .. ⎠ an1 an2 an3 · · · ann
where aii = 0 for 1 i n. The proof is by induction on n 1. The base case, n = 1, is immediate, since the inverse of the matrix 1 (a11 ) is a11 . Suppose that the statement holds for matrices in R(n−1)×(n−1) . Then A can be written as 0n−1 B , A= an1 an2 · · · an n−1 ann where B ∈ R(n−1)×(n−1) is a lower triangular matrix. By the inductive hypothesis, this matrix is invertible, its inverse B −1 is also lower triangular, and the diagonal elements of B −1 are the reciprocal elements of the corresponding diagonal elements of B. The matrix B −1 0n−1 v
1 ann
1 a B −1 , and a = is the inverse of A, where v = − ann (an1 , an2 , . . . , an n−1 ), as the reader can easily verify. A similar argument can be used for upper triangular matrices.
Theorem 3.9. Let A ∈ Rn×n be an invertible matrix, Then, its transpose A is invertible and (A )−1 = (A−1 ) . Proof.
Observe that A (A−1 ) = (A−1 A) = In = In .
Therefore, (A )−1 = (A−1 ) .
Linear Algebra Tools for Data Mining (Second Edition)
112
If A ∈ Rn×n is an invertible matrix, we have AA−1 = In , so trace(A)trace(A−1 ) = trace(In ) = n, by Theorem 3.14. This implies n . (3.2) trace(A−1 ) = trace(A) The trace of a matrix A ∈ Cn×n can be obtained as n
ei Aei . trace(A) = i=1
Theorem 3.10. Let A and B be two matrices in Cn×n . If A = BR, where R is an invertible upper triangular matrix, then SA,q = SB,q for every q, 1 q n. Proof. By Theorem 3.5, A = BR implies SA,q ⊆ SB,q for every q, 1 q n. Since R is invertible, B = AR−1 , so SB,q ⊆ SA,q for every q, 1 q n by the same theorem. This implies the desired equality. It is interesting to compute two matrix products that can be formed starting from the columns u and v given by ⎛ ⎞ ⎛ ⎞ v1 u1 ⎜ v2 ⎟ ⎜ u2 ⎟ ⎜ ⎟ ⎜ ⎟ u = ⎜ . ⎟ and v = ⎜ . ⎟ . ⎝ .. ⎠ ⎝ .. ⎠ un vn Note that u v ∈ F1×1 , that is, u v = u1 v1 + u2 v2 + · · · + un vn . This product is known as the inner product of u and v. Theorem 3.11. If u ∈ Rn×1 and u u = 0, then u = 0. Proof.
Let
⎞ u1 ⎜ u2 ⎟ ⎜ ⎟ u = ⎜ . ⎟. ⎝ .. ⎠ un ⎛
We have u u = u21 + u22 + · · · + u2n , so u u = 0 implies u1 = u2 = · · · = un = 0, that is, u = 0.
Matrices
113
The mth power of a square matrix A ∈ Fn×n (where m ∈ N) can be defined inductively as follows. Definition 3.13. The 0th power of any matrix A ∈ Fn×n is A0 = In . The (m+1)st power of A is the matrix Am+1 = Am A for m 0. Example 3.11. Let
A=
a b . −b a
We have A0 = I2 , A1 = A0 A = I2 A = A, 2 a b a b a − b2 2ab = . A2 = A1 A = −ba −ba −2ab a2 − b2 Example 3.12. Let En ∈ Cn×n be the matrix 0n−1 In−1 . En = 0 0n−1 We claim that the pth power of this matrix is given by 0n−1 · · · 0n−1 In−p p . (En ) = 0 ··· 0 0n−p Note that
⎛
⎞ e2 ⎜ .⎟ ⎜ .. ⎟ ⎟ En = (0 e1 · · · en−1 ) = ⎜ ⎜ ⎟. ⎝en ⎠ 0
The equality that we need to prove is equivalent to ⎛ ⎞ ep+1 ⎜ . ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ en ⎟ ⎟ (En )p = ⎜ ⎜ 0 ⎟ . ⎜ ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎝ . ⎠ 0
114
Linear Algebra Tools for Data Mining (Second Edition)
The proof is by induction on p 1. The base case, p = 1, is immediate. Suppose that the equality holds for p − 1. Since 1 if i = j, ei ej = 0 otherwise for 1 i, j n, we have
⎛
⎞ ep ⎜ .⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜en ⎟ p p−1 ⎟ (En ) = (En ) En = ⎜ ⎜ 0 ⎟ (0 e1 · · · en−1 ) ⎜ ⎟ ⎜ .⎟ ⎜ .⎟ ⎝ .⎠ 0 (by the inductive hypothesis) ⎛ ⎞ ep+1 ⎜ . ⎟ ⎜ .. ⎟ ⎟ ⎜ ⎜ ⎟ ⎜ en ⎟ ⎟ =⎜ ⎜ 0 ⎟ . ⎟ ⎜ ⎜ . ⎟ ⎜ . ⎟ ⎝ . ⎠ 0
The previous equality implies immediately (En )n = On,n . Theorem 3.12. Let T ∈ Cn×n be an upper (a lower) triangular matrix. Then T k is an upper (a lower) triangular matrix. Proof. By Theorem 3.4, the product of two upper (lower) triangular matrices is an upper (lower) triangular matrix. The current statement follows immediately. Definition 3.14. A matrix A ∈ Cn×n is nilpotent if there is m ∈ N such that Am = On,n . The nilpotency of A is the number nilp(A) = min{m ∈ N | Am = On,n }. If A ∈ Cn×n is a nilpotent matrix, we have nilp(A) = m if and only if Am = On,n but Am−1 = On,n .
Matrices
115
Example 3.13. Let a and b be two positive numbers in R. The matrix A ∈ R3×3 given by ⎛ ⎞ 0 a 0 ⎜ ⎟ A = ⎝0 0 b ⎠ 0 0 0 is nilpotent because ⎛
⎞ ⎛ ⎞ 0 0 ab 0 0 0 ⎜ ⎟ ⎜ ⎟ A2 = ⎝0 0 0 ⎠. and A3 = ⎝0 0 0⎠. 0 0 0 0 0 0
Thus, nilp(A) = 3. Definition 3.15. A matrix A ∈ Fn×n is idempotent if A2 = A. Example 3.14. The matrix
A=
0.5 1 0.25 0.5
is idempotent, as the reader can easily verify. Matrix product is distributive with respect to matrix addition. Theorem 3.13. Let A ∈ Fm×n , B, C ∈ Fn×p , and D ∈ Fp×q . We have A(B + C) = AB + AC, (B + C)D = BD + CD. Proof.
We have (A(B + C))ik =
n
aij (bjk + cjk )
j=1
=
n
j=1
aij bjk +
n
aij cjk
j=1
= (AB)ik + (AC)ik , for 1 i m and 1 k p, which proves the first equality. The proof of the second equality is equally straightforward and is left to the reader.
116
Linear Algebra Tools for Data Mining (Second Edition)
Definition 3.16. The (m×n)-vectorization mapping is the mapping vec : Cm×n −→ Cmn defined by ⎛ ⎞ a11 ⎜ . ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ am1 ⎟ ⎜ ⎟ ⎜ ⎟ vec(A) = ⎜ ... ⎟ , ⎜ ⎟ ⎜a ⎟ ⎜ 1n ⎟ ⎜ ⎟ ⎜ .. ⎟ ⎝ . ⎠ amn obtained by reading A column-wise. The following equality is immediate for a matrix A ∈ Cm×n : ⎛ ⎞ Ae1 ⎜ Ae ⎟ ⎜ 2⎟ ⎟ vec(A) = ⎜ (3.3) ⎜ .. ⎟ . ⎝ . ⎠ Aen The vectorization mapping vec is an isomorphism between the linear space Cm×n and the linear space Cmn , as the reader can easily verify. Example 3.15. For the matrix In , we have ⎛ ⎞ e1 ⎜e ⎟ ⎜ 2⎟ ⎟ vec(In ) = ⎜ ⎜ .. ⎟ . ⎝. ⎠ en The MATLAB implementation of vec is discussed in Example 15.32. Definition 3.17. Let A = (aij ) ∈ Cn×n be a square matrix. The trace of A is the number trace(A) given by trace(A) = a11 + a22 + · · · + ann .
Matrices
117
Theorem 3.14. Let A and B be two square matrices in Cn×n . We have (i) trace(aA) = a trace(A), (ii) trace(A + B) = trace(A) + trace(B), and (iii) trace(AB) = trace(BA). Proof. The first two parts are direct consequences of the definition of the trace. For the last part, we can write n n
n
(AB)ii = aij bji . trace(AB) = i=1
i=1 j=1
Exchanging the subscripts i and j and, then the order of the summations, we have n
n
i=1 j=1
aij bji =
n
n
aji bij =
j=1 i=1
n
n
i=1 j=1
n
bij aji = (BA)ii , i=1
which proves the desired equality.
Note that the elements on the diagonal of a skew-symmetric matrix are 0 and, therefore, its trace equals 0. Let A, B, C be three matrices in Cn×n . We have trace(ABC) = trace((AB)C) = trace(C(AB)) = trace(CAB), and trace(ABC) = trace(A(BC)) = trace((BC)A) = trace(BCA). However, it is important to notice that the third part of Theorem 3.14 cannot be extended to arbitrary permutations of a product of matrices. Consider, for example, the matrices 1 1 1 1 1 0 ,B = , and C = . A= 1 1 1 0 0 1 We have
ABC =
1 2 2 1 and ACB = , 2 3 3 1
so trace(ABC) = 4 and trace(ACB) = 3.
118
Linear Algebra Tools for Data Mining (Second Edition)
Definition 3.18. A matrix A ∈ Rm×n is non-negative if aij 0 for 1 i m and 1 j n. This is denoted by A Om,n . A is positive if aij > 0 for 1 i m and 1 j n. This is denoted by A > 0m,n . If B, C ∈ Rm×n , we write B C (B > C) if B − C Om,n (B − C > Om,n , respectively). The sets of non-negative (non-positive, positive, negative) m × nm×n m×n (Rm×n matrices is denoted by Rm×n 0 0 , R>0 , R j + p implies aij = 0. A has upper bandwidth q if j > i + q implies aij = 0. A is a (p, q)-band matrix if it has lower bandwidth p and upper bandwidth q. A tridiagonal matrix is a (1, 1)-band matrix. A lower Hessenberg matrix is an (m − 1, 1)-band matrix, while an upper Hessenberg matrix is a (1, n − 1)-band matrix. Example 3.17. The matrix A ∈ R4×6 defined by ⎛ ⎞ 1 2 0 0 0 0 ⎜1 2 3 0 0 0⎟ ⎜ ⎟ ⎜ ⎟ ⎝2 1 3 5 0 0⎠ 0 2 1 4 3 0 is a (2, 1)-band matrix. The matrix B ∈ R4×6 given by ⎛
1 ⎜1 ⎜ ⎜ ⎝0 0 is a tridiagonal matrix.
2 2 1 0
0 2 3 1
0 0 2 4
0 0 0 2
⎞ 0 0⎟ ⎟ ⎟ 0⎠ 0
Matrices
The matrix
⎛
1 ⎜1 ⎜ ⎜ ⎝0 0
2 2 1 0
is an upper Hessenberg matrix. ⎛ 1 2 ⎜1 2 ⎜ ⎜ ⎝2 1 3 2
3 2 3 1
4 3 2 4
119
5 4 5 2
⎞ 6 4⎟ ⎟ ⎟ 0⎠ 0
On the other hand, the matrix ⎞ 0 0 0 0 3 0 0 0⎟ ⎟ ⎟ 3 5 0 0⎠ 1 4 3 0
is a lower Hessenberg matrix. Note that a matrix L ∈ Fm×n is lower triangular if it is an (m − 1, 0)-band matrix. Similarly, U is upper triangular if it is a (0, n − 1)-band matrix. In other words, L is lower triangular if its upper bandwidth is 0, that is, if j > i implies lij = 0; U is upper triangular if its lower bandwidth is 0, that is, i > j implies uij = 0. Next, we define several classes of matrices whose components are real numbers. Definition 3.20. A Toeplitz matrix is a matrix T ∈ Rn×n such that the elements located in any line parallel to the main diagonal of T (including, of course, the main diagonal) are equal. Example 3.18. Let the matrix ⎛ 1 ⎜6 ⎜ ⎜ T = ⎜7 ⎜ ⎝8 9
2 1 6 7 8
3 2 1 6 7
4 3 2 1 6
⎞ 5 4⎟ ⎟ ⎟ 3⎟ ⎟ 2⎠ 1
be a 5 × 5 Toeplitz matrix. The elements of a Toeplitz matrix T ∈ Rn×n are completely determined by its first row and its first column (which must have their
120
Linear Algebra Tools for Data Mining (Second Edition)
first components equal). Therefore, such a matrix is fully defined by a set of 2n − 1 numbers. Definition 3.21. A circulant form ⎛ c1 ⎜ cn ⎜ ⎜ c C=⎜ ⎜ n−1 ⎜ .. ⎝ . c2
matrix is a Toeplitz matrix C of the c2 c1 cn .. . c3
c3 c2 c1 .. . c4
⎞ · · · cn · · · cn−1 ⎟ ⎟ ⎟ · · · cn−2 ⎟ . ⎟ .. ⎟ ··· . ⎠ ···
c1
Note that if C = (cij ), then cij = c(j−i+1)
mod n .
Definition 3.22. A Hankel matrix is a matrix H ∈ Rn×n such that the elements located in any line parallel to the skew-diagonal of H (including, of course, the skew-diagonal) are equal. Example 3.19. Let T be the matrix ⎛ 1 ⎜2 ⎜ ⎜ T =⎜ ⎜3 ⎜4 ⎝ 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
⎞ 5 6⎟ ⎟ ⎟ 7⎟ ⎟ 8⎟ ⎠ 9
be a 5 × 5 Hankel matrix. Next, we introduce a class of matrices that is essential in the study of the theory of Markov chains. Definition 3.23. A stochastic matrix is a matrix A ∈ Rn×n such n that aij 0 for 1 i, j n and j=1 aij = 1 for every i, 1 i n. A doubly stochastic matrix is a matrix A ∈ Rn×n such that both A and A are stochastic. The rows of a stochastic matrix can be regarded as discrete probability distributions.
Matrices
121
Example 3.20. The matrix A ∈ R3×3 defined by ⎛1 1⎞ 2 0 2 ⎜1 1 1⎟ A = ⎝3 2 6⎠ 0 23 31 is a stochastic matrix. Let C be the field of complex numbers. A complex matrix is a matrix A ∈ Cm×n . Definition 3.24. The conjugate of a matrix A ∈ Cm×n is the matrix A¯ ∈ Cm×n , where A(i, j) = A(i, j) for 1 i m and 1 j n. The notion of symmetry is extended to accommodate complex matrices. Definition 3.25. The transpose conjugate of the matrix A ∈ Cm×n or its Hermitian adjoint is the matrix B ∈ Cn×m given by B = A¯ = (A ). The transpose conjugate of A is denoted by AH . Example 3.21. Let A ∈ C3×2 be the matrix ⎛ ⎞ 1+i 2 ⎜ ⎟ i ⎠. A = ⎝2 − i 0 1 − 2i The matrix AH is given by 1−i 2+i 0 H . A = 2 −i 1 + 2i Using Hermitian conjugates, several important classes of matrices are defined. Definition 3.26. The matrix A ∈ Cn×n is (i) Hermitian if A = AH ; (ii) skew-Hermitian if AH = −A; (iii) normal if AAH = AH A; (iv) unitary if AAH = AH A = In .
122
Linear Algebra Tools for Data Mining (Second Edition)
The set of unitary matrices in Cn×n constitutes the unitary group UG(n, C). It is immediate that all unitary, Hermitian, and skew-Hermitian matrices are normal. However, there are normal matrices outside these three classes. Example 3.22. The matrix A=
1 −1 1 1
is not Hermitian or skew-Hermitian, and 2 H H AA = A A = 0
it is not unitary because 0 . 2
However, A is normal. Example 3.23. Let α, β, γ, δ, and θ be five real numbers such that α − β − γ + δ is a multiple of 2π. The matrix eiα cos θ −eiβ sin θ Mα,β,γ,δ (θ) = eiγ sin θ eiδ cos θ introduced in [57] is unitary because Mα,β,γ,δ (θ)H Mα,β,γ,δ (θ) e−iα cos θ e−iγ sin θ eiα cos θ −eiβ sin θ 1 0 = . = 0 1 −e−iβ sin θ e−iδ cos θ eiγ sin θ eiδ cos θ It is immediate that all unitary matrices are invertible, and their inverse is equal to their Hermitian conjugate. Furthermore, the product of two unitary matrices is a unitary matrix. Indeed, suppose that A, B ∈ Cn×n are unitary matrices, that is, AAH = BB H = In . Then (AB)(AB)H = ABB H AH = AAH = In , hence AB is a unitary matrix. If A ∈ Rn×n is a unitary real matrix, we refer to A as an orthogonal matrix or an orthonormal matrix for reasons that we will discuss in Section 6.11. If A ∈ Rn×n is a matrix with real entries, then its Hermitian adjoint coincides with the transposed matrix A . Thus, a real matrix is Hermitian if and only if it is symmetric.
Matrices
123
Observe that if z ∈ Cn and
⎛ ⎞ z1 ⎜.⎟ ⎟ z=⎜ ⎝ .. ⎠ , zn
then z H z = z 1 z1 + · · · + z n zn =
n
2 i=1 |zi | .
Theorem 3.15. Let A ∈ Cn×n . The following statements hold: (i) the matrices A + AH , AAH , and AH A are Hermitian and A − AH is skew-Hermitian; (ii) if A is a Hermitian matrix, then so is Ak for k ∈ N; (iii) if A is Hermitian and invertible, then so is A−1 ; (iv) if A is Hermitian, then aii are real numbers for 1 i ≤ n. Proof. All statements follow directly from the definition of Hermi tian matrices. Theorem 3.16. If A ∈ Cn×n there exists a unique pair of Hermitian matrices (H1 , H2 ) such that A = H1 + iH2 . Proof.
Let 1 i H1 = (A + AH ) and H2 = − (A − AH ). 2 2
It is immediate that both H1 and H2 are Hermitian and that H1 + iH2 = A. Suppose that A = H3 + iH4 , where H3 and H4 are Hermitian. Then, we have 2H1 = A + AH = H3 + iH4 + H3H − iH4H = 2H3 , so H1 = H3 . Therefore, H2 = H4 , so the matrices H1 and H2 are uniquely determined. Theorem 3.17. If A ∈ Cn×n , there exists a unique pair of matrices (H, S) such that H is Hermitian, S is skew-Hermitian, and A = H + S.
Linear Algebra Tools for Data Mining (Second Edition)
124
Proof. By Theorem 3.16, A can be written as A = H1 +iH2 , where H1 and H2 are Hermitian matrices. Choose H = H1 and S = iH2 . By Exercise 3.17, S is skew-Hermitian. The uniqueness of the pair (H, S) is immediate. Next, we discuss a characterization of Hermitian matrices. Theorem 3.18. A matrix A ∈ Cn×n is Hermitian if and only if xH Ax is a real number for every x ∈ Cn . Proof.
Suppose that A is Hermitian. Then xH Ax = xH AH x = x A x = x A (xH ) = xH Ax,
so xH Ax is a real number because it is equal to its conjugate. Conversely, suppose that xH Ax is a real number for every x ∈ Cn . This implies that (x + y)H A(x + y) = xH Ax + xH Ay + y H Ax + y H Ay is a real number, so xH Ay + y H Ax is real for every x, y ∈ Cn . Let x = ep and y = eq . Then apq + aqp is a real number. If we choose x = −iep and y = ej , it follows that −iapq + iaqp is a real number. Thus, (apq ) = −(aqp ) and (apq ) = (aqp ), which leads to apq = aqp for 1 p, q n. These equalities are equivalent to A = AH , so A is Hermitian. Observe that for any matrix B ∈ Cm×n , the matrix B H B is normal since (B H B)H B H B = B H BB H B and B H B(B H B)H = B H BB H B. Example 3.24. All Hermitian or skew-Hermitian matrices are normal. Indeed, if A is Hermitian, then A = AH and the normality condition is obviously satisfied. If A is skew-Hermitian, then AH = −A and the normality follows from (−A)A = A(−A) = −A2 . If A is a real, symmetric matrix, then A is obviously normal. Theorem 3.19. A matrix A ∈ Cn×n is normal and upper triangular (or lower triangular) if and only if A is a diagonal matrix. Proof. Clearly, any diagonal matrix is both normal and upper (and lower) triangular. Therefore, we need to show only that if A is both triangular and normal, then A is diagonal. We make the argument for the case when A is upper triangular.
Matrices
125
Since AH A = AAH , we have the equality (AH A)pp = (AAH )pp for 1 p n. We prove by induction on p that the non-diagonal elements of A are 0. For the base case, p = 1, the conditions of the theorem imply a ¯11 a11 =
n
a1j a ¯1j = a11 a ¯11 +
j=1
n
a1j a ¯1j .
j=2
n ¯11 = |a11 |2 , it follows that ¯1j = Since a ¯11 a11 = a11 a j=2 a1j a n 2 j=2 |a1j | = 0, so a1j = 0 for 2 j n. Thus, all non diagonal element located on the first line of A (or the first column of AH ) are zero. Suppose now that all non-diagonal elements of the first p − 1 lines of A are 0. For the pth diagonal element of (AH A)pp , we have (AH A)pp =
n
a ¯ip aip
i=1
=
n
a ¯ip aip
i=p
(by the inductive hypothesis) =a ¯pp app (because A is an upper diagonal matrix). This allows us to write n n
apj a ¯pj = app a ¯pp + apj a ¯pj , a ¯pp app = j=p
j=p+1
so n
j=p+1
apj a ¯pj =
n
|apj |2 = 0,
j=p+1
which implies ap p+1 = · · · = apn = 0. Thus, all non-diagonal ele ments on the line p are 0. Theorem 3.20. If U ∈ Cn×n is a unitary matrix, then the matrix Z ∈ Rn×n given by zij = |uij |2 for 1 i, j n is a doubly stochastic matrix.
126
Proof. Thus,
Linear Algebra Tools for Data Mining (Second Edition)
Since U is a unitary matrix, we have U U H = U H U = In . n
zij =
j=1
n
|uij |2 =
j=1
=
n
n
uij uij
j=1
uij (U H )ji = (U U H )ii = 1.
j=1
n
The equality i=1 zij = 1 can be established in a similar manner, so Z is indeed a doubly stochastic matrix. Let A ∈ Cm×n be a matrix. The matrix of the absolute values of A is the matrix abs(A) ∈ Rm×n defined by (abs(A))ij = |aij | for 1 i m and 1 j n. In particular, if x ∈ Cn , we have (abs(x))j = |xj |. Theorem 3.21. Let A ∈ Cm×n and B ∈ Cn×p be two matrices. We have abs(AB) abs(A)abs(B). Proof. Since (AB)ik = nj=1 aij bjk , it follows that n n n
aij bjk ≤ |aij bjk | = |aij | |bjk |, |(AB)ik | = j=1
j=1
j=1
for 1 i m and 1 k p. This amounts to abs(AB) abs(A)abs(B). Theorem 3.22. For A ∈ Cn×n we have abs(Ak ) (abs(A))k for every k ∈ N. Proof. The proof is by induction on k. The base case, k = 0, is immediate. Suppose that the inequality holds for k. We have abs(Ak+1 ) = abs(Ak A) ≤ abs(Ak )abs(A) (by Theorem 3.21)
Matrices
127
≤ (abs(A))k abs(A) (by the inductive hypothesis) = (abs(A))k+1 , which completes the induction case.The factor of S’*A*S tends to be sparser than the factor of A. 3.6
Partitioned Matrices and Matrix Operations
Let A ∈ Fm×n be a matrix and suppose that m = m1 + · · · + mp and n = n1 + · · · + nq , where F is the real or the complex field. A partitioning of A is a collection of matrices Ahk ∈ Fmh ×nk such that Ahk is the contiguous submatrix
m1 + · · · + mh−1 + 1, . . . , m1 + · · · + mh−1 + mh , A n1 + · · · + nk−1 + 1, . . . , n1 + · · · + nk for 1 h p and 1 k q. If {Ahk | 1 h p and 1 k q} written as ⎛ A11 A12 · · · ⎜A ⎜ 21 A22 · · · A=⎜ .. ⎜ .. ⎝ . . ··· Ap1 Ap2 · · ·
is a partitioning of A, A is ⎞ A1q A2q ⎟ ⎟ ⎟ .. ⎟ . . ⎠ Apq
The matrices Ahk are referred to as the blocks of the partitioning. All blocks located in a column must have the same number of columns; all blocks located in a row must have the same number of rows. Example 3.25. The matrix A ∈ F5×6 given by ⎞ ⎛ a11 a12 a13 a14 a15 a16 ⎟ ⎜a ⎜ 21 a22 a23 a24 a25 a26 ⎟ ⎟ ⎜ ⎟ A=⎜ ⎜a31 a32 a33 a34 a35 a36 ⎟ ⎟ ⎜a ⎝ 41 a42 a43 a44 a45 a46 ⎠ a51 a52 a53 a54 a55 a56
128
Linear Algebra Tools for Data Mining (Second Edition)
can be partitioned as ⎛
a11 ⎜a ⎜ 21 ⎜ ⎜a31 ⎜ ⎜a ⎝ 41 a51
Thus, if we introduce ⎛ a11 a12 ⎜ A11 = ⎝a21 a22 a31 a32 a41 a42 A21 = a51 a52
a12 a22 a32 a42 a52
a13 a23 a33 a43 a53
a14 a24 a34 a44 a54
a15 a25 a35 a45 a55
⎞ a16 a26 ⎟ ⎟ ⎟ a36 ⎟. ⎟ a46 ⎟ ⎠ a56
the matrices ⎛ ⎞ ⎛ ⎞ a13 a14 a15 ⎜ ⎟ ⎜ ⎟ a23 ⎠ , A12 = ⎝a24 ⎠ , A13 = ⎝a25 a33 a34 a35 a45 a45 a43 , A22 = , A23 = a53 a55 a55
⎞ a16 ⎟ a26 ⎠ , a36 a46 , a56
the matrix A can be written as A11 A12 A13 . A= A21 A22 A23 Partitioning matrices is useful because matrix operations can be performed on block submatrices in a manner similar to scalar operations as we show next. Theorem 3.23. Let A ∈ Fm×n and B ∈ Fn×p be two matrices. Suppose that the matrices A, B are partitioned as ⎞ ⎛ ⎞ ⎛ B11 · · · B1 A11 · · · A1k ⎜ . ⎜ . . ⎟ . ⎟ ⎟ ⎜ ⎟ A=⎜ ⎝ .. · · · .. ⎠ and B = ⎝ .. · · · .. ⎠ , Ah1 · · · Ahk Bk1 · · · Bk where Ars ∈ Fmr ×ns , Bst ∈ Fns ×pt for 1 r h, 1 s k and 1 t . Then, the product C = AB can be partitioned as ⎞ ⎛ C11 . . . C1 ⎜ . .. ⎟ ⎟ . C=⎜ ⎝ . ··· . ⎠, Ch1 · · · Chl where Cuv = kt=1 Aut Btv , 1 u h, and 1 v .
Matrices
129
Proof. Note that m1 + · · · + mh = m and p1 + · · · + p = p. For a pair (i, j) such that 1 i m and 1 j n, let u be the least number such that i m1 + · · · + mu and let v be the least number such that j p1 + · · · + pv . The definition of u and v implies m1 + · · · + mu−1 + 1 i ≤ m1 + · · · + mu and p1 + · · · + pv−1 + 1 j p1 + · · · + pv . This implies that the cij element of the product is located in the submatrix Cuv = kt=1 Aut Btv of C. By the definition of the matrix product, we have cij =
n
aig bgj
g=1
=
n1
g=1
aig bgj +
n
1 +n2 g=n1 +1
aig bgj + · · · +
n1 +···+n
s
aig bgj .
g=n1 +···+nk−1 +1
Observe that the vectors (ai1 , . . . , ain1 ) and (b1j , . . . , bn1 j ) represent the line number i − (m1 + · · · + mu−1 + 1) and the column number j − (p1 + · · · + pv−1 + 1) of the matrix Au1 and B1v , etc. Similarly, (ai,n1 +···+nk−1 +1 , . . . , ai,n1 +···+ns ) and (bn1 +···+nk−1 +1,j , . . . , bn1 +···+ns ,j ) represent the line number i − (m1 + · · · + mu−1 + 1) and the column number j − (p1 + · · · + pv−1 + 1) of the matrix Auk and Bkv , which shows that cij is computed correctly as an element of the block Cuv . 3.7
Change of Bases
˜ = {˜ ˜n } be two bases in Cn . e1 , . . . , e Let B = {e1 , . . . , en } and B These bases define the matrices ˜n ) e1 · · · e MB = (e1 · · · en ) and MB˜ = (˜ ˜ in Cn×n whose columns are the vectors of the bases B and B, respectively.
Linear Algebra Tools for Data Mining (Second Edition)
130
As vectors of each base can be expressed in terms of the other base, we can write ei =
n
˜j , for 1 i n, cij e
j=1
˜h = e
n
dhk ek , for 1 h n.
k=1
In matrix formulation the previous equalities can be written as MB = CMB˜ and MB˜ = DMB . Theorem 3.24. The matrices C = (cij ) and D = (dhk ) defined above are inverse to each other. Proof.
We have n n n n
n
˜j = cij e cij djk ek = cij djk ek ei = j=1
j=1
j=1 k=1
k=1
for 1 i n. Equating the coefficients of vectors of B yields the following equalities: n
cij dji = 1, and
j=1
n
cij djk = 0 if k = i.
j=1
If P = CD, these equalities amount to pii = 0 and pik = n j=1 cij djk = 0 if i = k, which means that P = In . Thus, the matrices C and D are inverse to each other. n ˜ be the expression of the vector x in the basis Let x = i=1 x ˜e n i i ˜ ˜ B. Since ei = k=1 dik ek , it follows that x=
n
˜i = x ˜i e
i=1
n
j=1
xj ej =
n
xj
j=1
n
i=1
˜i = cji e
n n
˜i , xj cji e
i=1 j=1
which implies x ˜i =
n
cji xj
(3.4)
j=1
˜ x for 1 i n. The components of x in the new basis B, ˜i can be expressed via the components of x in the old basis B using the
Matrices
131
matrix C (that was used to express the old basis B in terms of the ˜ as MB = CM ˜ ). This justifies the term contravariant new basis B B components applied to the components xk because the bases and the components transform in opposite directions. Similarly, we can write x=
n
j=1
xj ej =
n
˜i = x˜i e
i=1
n
x˜i
i=1
n
dij ej =
j=1
n
n
x˜i dij ej ,
j=1 i=1
hence xj =
n
x˜i dij ej
(3.5)
i=1
for 1 j n. The effects of a change of basis on covectors have been discussed in Section 2.8.
3.8
Matrices and Bilinear Forms
The link between matrices and bilinear form is discussed next. Let C ∈ Rm×n be a matrix. If x ∈ Rm and y ∈ Rn , the function fC : L × M −→ R defined by fC (x, y) = x Cy can be easily seen to be bilinear. The next theorem shows that all bilinear functions between two finite-dimensional spaces can be defined in this manner. Theorem 3.25. Let V, W be two finite-dimensional R-linear spaces. If f : V × W −→ R is a bilinear form, then there is a matrix Cf ∈ Rm×n such that f (x, y) = x Cf y for all x ∈ V and y ∈ W . Proof. Suppose that B = {x1 , . . . , xm } and B = {y 1 , . . . , y n } are bases in V and W , respectively. Let x = a1 x1 + · · · + am xm be the expression of x ∈ V in the base B. Similarly, let y = b1 y 1 +· · ·+bn y n
132
Linear Algebra Tools for Data Mining (Second Edition)
be the expression of y ∈ W in B . The bilinearity of f implies ⎞ ⎛ m n
ai x i , bj y j ⎠ f (x, y) = f ⎝ i=1
=
n m
j=1
ai bj f (xi , y j ).
i=1 j=1
If Cf is the matrix
⎛
⎞ f (x1 , y 1 ) · · · f (x1 , y n ) ⎜ ⎟ .. .. .. ⎟, Cf = ⎜ . . . ⎝ ⎠ f (xm , y 1 ) · · · f (xm , y n )
then f (x, y) = x Cf y, where ⎛ ⎞ ⎛ ⎞ b1 a1 ⎜.⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ x=⎜ ⎝ .. ⎠ and y = ⎝ .. ⎠ . am bn
Rn×n
Let A ∈ be a symmetric matrix. The quadratic form associated to the matrix A is the function fA : Rn −→ R defined as fA (x) = x Ax for x ∈ Rn . The polar form of the quadratic form fA is the bilinear form f˜A defined by f˜A (x, y) = x Ay for x, y ∈ Rn . Since x Ay and y Ax are scalars, they are equal and we have fA (x + y) = (x + y) A(x + y) = x Ax + y Ay + x Ay + y Ax = fA (x) + fA (y) + 2f˜A (x, y), which allows us to express the polar form of fA as 1 f˜A (x, y) = (fA (x + y) − fA (x) − fA (y)). 2 If A ∈ Cn×n is a Hermitian matrix, the quadratic Hermitian form associated to A is the function fA : Cn −→ C defined as
Matrices
133
fA (x) = xH Ax for x ∈ Cn . Note that fA (x) = xsH AH x = xsH Ax = fA (x) for x ∈ Cn . This allows us to conclude that values of the quadratic Hermitian form associated to a Hermitian matrix are real numbers. Let us consider the values of a quadratic Hermitian form fA (x) when x = 1. Since fA (x) is a continuous function and the set {x | x = 1} is compact, the function fA attains its supremum. Thus, there exists z 1 ∈ Cn with z 1 = 1 such that fA (z 1 ) attains its maximum M1 . Consider again the maximization of fA (x) subjected to the restrictions x = 1 and (x, z 1 ) = 0. By the same reasoning as above, there exists a unit vector z 2 such that z 2 = 1 and z 2 ⊥ z 2 where the maximum M2 is attained, etc. We obtain an orthonormal sequence of vectors z 1 , z 2 , . . . , z n known as the sequence of principal directions of A. The matrix ⎛ ⎞ z1 ⎜ . ⎟ ⎜ . ⎟ ⎝ . ⎠ z n is obviously unitary. 3.9
Generalized Inverses of Matrices
The next theorem involves a generalization of the notion of inverse of a matrix that is applicable to rectangular matrices. Theorem 3.26. If A ∈ Cm×n , there exists at most one matrix M ∈ Cn×m such that the matrices M A and AM are Hermitian, AM A = A, and M AM = M . Proof. Suppose that both M and P satisfy the equalities of the theorem. We have P = P AP = P (AP ) = P (AP )H = P P H AH = P P H (AM A)H
134
Linear Algebra Tools for Data Mining (Second Edition)
= P P H AH M H AH = P P H AH (AM )H = P (AP )H (AM )H = P AP AM = P (AP A)M = P AM = P AM AM = (P A)H (M A)H M = AH P H AH M H M = (AP A)H M H M = AH M H M = (M A)H M = M AM = M. Thus, P = M and the uniqueness of M is proven.
The matrix M introduced by Theorem 3.26 is known as the Moore-Penrose pseudoinverse of A and is denoted by A† . Theorem 3.27. Let A ∈ Cn×n . If A is invertible, then A† exists and equals A−1 . Proof. If A is invertible, it is clear that the matrices AA−1 and A−1 A are Hermitian because both are equal to In . Furthermore, AA−1 A = A and A−1 AA−1 = A−1 , so A−1 = A† by Theorem 3.26. The Moore–Penrose pseudoinverse of a matrix may exist even if † = On,m , as it can the inverse does not. For example, we have Om,n be easily seen. However, On,n is not invertible. The symmetry of the equalities defining the Moore Penrose pseudoinverse of a matrix show that (A† )† = A. 3.10
Matrices and Linear Transformations
Let h ∈ Hom(Cm , Cn ) be a linear transformation between the linear spaces Cm and Cn . Consider a basis in Cm , R = {r 1 , . . . , r m }, and a basis in Cn , S = {s1 , . . . , sn }. The function h is completely determined by the images of the elements of the basis R, that is, by the set {h(r 1 ), . . . , h(r m )}. If h(r j ) = a1j s1 + a2j s2 + · · · + anj sn =
n
i=1
aij si ,
Matrices
135
then, for x = x1 r 1 + · · · + xm rm , we have by linearity h(x) = x1 h(r 1 ) + · · · + xm h(r m ) =
m
xj h(r j )
j=1
=
n m
xj aij si .
j=1 i=1
In other words, we have aij = (h(r j ))i for 1 i n and 1 j m. In a more compact form, we can write ⎛ ⎞⎛ ⎞ a11 · · · a1m x1 ⎜ . ⎟ ⎜ . ⎟⎜ . ⎟ ⎟ h(x) = (s1 · · · sn ) ⎜ ⎝ .. · · · .. ⎠ ⎝ .. ⎠ . an1 · · · anm xm The image of a vector x = x1 r 1 + · · · + xm r m under h equals Ah x, where Ah is ⎛ ⎞ a11 · · · a1m ⎜ . ⎟ .. · · · ... ⎟ ∈ Cn×m . Ah = ⎜ ⎝ ⎠ an1 · · · anm Clearly, the matrix Ah attached to h : Cm −→ Cn depends on the bases chosen for the linear spaces Cm and Cn and (Ah )ij equals (h(r j ))i , the i th component of image h(r j ) of the basis vector r j . Let now h : Cn −→ Cn be an endomorphism of Cn and let R = {r 1 , . . . , r n } and S = {s1 , . . . , sn } be two bases of Cn . The vectors si can be expressed as linear combinations of the vectors r 1 , . . . , r n as follows: si = pi1 r 1 + · · · + pin r n
(3.6)
for 1 i n, which implies h(si ) = pi1 h(r 1 ) + · · · + pin h(r n )
(3.7)
136
Linear Algebra Tools for Data Mining (Second Edition)
for 1 i n. Therefore, the matrix associated to a linear form h : Cm −→ C is a column vector r. In this case we can write h(x) = r H x for x ∈ Rn . Theorem 3.28. Let h ∈ Hom(Cm , Cn ). The matrix Ah∗ ∈ Cm×n is the transpose of the matrix Ah , that is, we have Ah∗ = Ah . Proof. By the previous discussion, if 1 , . . . , n is a basis of the space (Cn )∗ , then the j th column of the matrix Ah∗ ∈ Cm×n is obtained by expressing the linear form h∗ (j ) = j h in terms of a basis in the dual space (Cm )∗ . Therefore, we need to evaluate the linear form j h ∈ (Cm )∗ . Let {p1 , . . . , pm } be a basis in Cm and let {g1 , . . . , gm } be its dual in (Cm )∗ . Also, let {q 1 , . . . , q n } be a basis in Cn , and let {1 , . . . , n } be its dual (Cn )∗ . Observe that if v ∈ Cm can be expressed as v = m j=1 vj pj , then ⎛ gp (v) = gp ⎝
m
⎞ vj pj ⎠ =
j=1
m
vj gp (pj ) = vp ,
j=1
because {g1 , . . . , gm } is the dual of {p1 , . . . , pm } in (Cm )∗ . On the other hand, we can write ⎞ ⎛ ⎛ ⎞ m m n
vp h(pp )⎠ = j ⎝ vp aip q i ⎠ j (h(v)) = j ⎝ p=1
p=1
i=1
⎛ ⎞ n n m
m
vp aip qi ⎠ = vp aip j (q i ) = j ⎝ p=1 i=1
=
m
p=1
vp ajp =
p=1 i=1 m
ajp gp (v).
p=1
Thus, h∗ (j ) = m p=1 ajp gp for every j, 1 ≤ j m. This means that the j th column of the matrix Ah∗ is the transposed j th row of the matrix Ah , so Ah∗ = (Ah ) . Matrix multiplication corresponds to the composition of linear mappings, as we show next.
Matrices
137
Theorem 3.29. Let h ∈ Hom(Cm , Cn ) and g ∈ Hom(Cn , Cp ). Then, Agh = Ag Ah . Proof. If p1 , . . . , pm is a basis for Cm , then Agh (pi ) = gh(pi ) = g(h(pi )) = g(Ah pi ) = Ag (Ah (pi )) for every i, where 1 i n. This proves that Agh = Ag Ah . Thus, if h is an idempotent endomorphism of a linear space, the matrix Ah is idempotent. The inverse direction, from matrices to linear operators, is introduced next. Definition 3.27. Let A ∈ Cn×m be a matrix. The linear operator associated to A is the mapping hA : Cm −→ Cn given by hA (x) = Ax for x ∈ Cm . If {e1 , . . . , em } is a basis for Cm , then hA (ei ) is the i th column of the matrix A. It is immediate that AhA = A and hAh = h. Attributes of a matrix A are usually transferred to the linear operator hA . For example, if A is Hermitian, we say that hA is Hermitian. Definition 3.28. Let A ∈ Cn×m be a matrix. The range of A is the subspace Im(hA ) of Cn . The null space of A is the subspace Ker(hA ). The range of A and the null space of A are denoted by range(A) and null(A), respectively. Clearly, CA,n = range(A). The null space of A ∈ Cm×n consists of those x ∈ Cn such that Ax = 0. Let {p1 , . . . , pm } be a basis of Cm . Since range(A) = Im(hA ), it follows that this subspace is generated by the set {hA (p1 ), . . . , hA (pm )}, that is, by the columns of the matrix A. For this reason, the subspace range(A) is also known as the column subspace of A. Theorem 3.30. Let A, B ∈ Cm×n be two matrices. Then range(A + B) ⊆ range(A) + range(B). Proof. Let u ∈ range(A + B). There exists v ∈ Cn such that u = (A + B)v = Av + Bv. If x = Av and y = Bv, we have x ∈ range(A) and y ∈ range(B), so u = x + y ∈ range(A) + range(B).
138
Linear Algebra Tools for Data Mining (Second Edition)
Several facts concerning idempotent endomorphisms that were presented in Chapter 2 can now be formulated in terms of matrices. For example, Theorem 2.32 applied to Cn states that if A is an idempotent matrix, then Cn = null(A) range(A). Conversely, by Theorem 2.33, if U and W are two subspaces of Cn such that Cn = U W , then there exists an idempotent matrix A ∈ Cn×n such that U = null(A) and W = range(A). Multilinear functions can be represented by generalizations of matrices. Let V, W , and U be finite-dimensional linear spaces having the bases {v 1 , . . . , v n }, {w 1 , . . . , wm }, and {u1 , . . . , up }, respectively, and let f : V × W −→ U be a multilinear function. For v = a1 v 1 + · · · + an v n and w = b1 w1 + · · · + bm wm , by the multilinearity of f we can write f (v, w) = f a1 v 1 + · · · + an v n , b1 w1 + · · · + bm wm =
n
m
ai bj f (vi , w j ).
i=1 j=1
Since f (vi , w j ) ∈ U , we can further write f (v i , wj ) =
p
ckij uk ,
k=1
for 1 i n and 1 j m. Thus, f (v, w) =
p n
m
ai bj ckij uk .
i=1 j=1 k=1
Thus, f is completely specified by nmp numbers ckij . Conversely, every set of nmp numbers specifies a multilinear function. 3.11
The Notion of Rank
Definition 3.29. The rank of a matrix A is the same number denoted by rank(A) given by rank(A) = dim(range(A)) = dim(Im(hA )).
Matrices
139
Thus, the rank of A is the maximal size of a set of linearly independent columns of A. Theorem 2.10 applied to the linear mapping hA : Cm −→ Cn means that for A ∈ Cn×m , we have dim(null(A)) + rank(A) = m.
(3.8)
Observe that if A ∈ Cm×m is non-singular, then Ax = 0m implies x = 0m . Thus, if x ∈ null(A) ∩ range(A), it follows that Ax = 0, so the subspaces null(A) and range(A) are complementary. Example 3.26. For the matrix ⎛
⎞ 1 0 2 ⎜1 −1 1⎟ ⎜ ⎟ A=⎜ ⎟, ⎝2 1 5⎠ 1 2 4 we have rank(A) = 2. Indeed, if c1 , c2 , c3 are its columns, then it is easy to see that {c1 , c2 } is a linearly independent set, and c3 = 2c1 + c2 . Thus, the maximal size of a set of linearly independent columns of A is 2. Example 3.27. Let A ∈ Cn×m and B ∈ Cp×q . For the matrix C ∈ C(n+p)×(m+q) defined by C=
A On,q , Op,m B
we have rank(C) = rank(A) + rank(B). Suppose that rank(C) = and let c1 , . . . , c be a maximal set of linearly independent columns of C. Without loss of generality we may assume that the first k columns are among the first m columns of A and the remaining − k columns are among the last q columns of C. The first k columns of C correspond to k linearly independent columns of A, while the last −k columns correspond to −k linearly independent columns of B. Thus, rank(C) = k rank(A) + rank(B). Conversely, suppose that rank(A) = s and rank(B) = t. Let ai1 , . . . , ais be a maximal set of linearly independent columns of A
Linear Algebra Tools for Data Mining (Second Edition)
140
and let bj1 , . . . , bjt be a maximal set of linearly independent columns of B. Then, it is easy to see that the vectors ais 0n 0n a i1 ,··· , ,..., ,..., 0n 0n bj1 b jt constitute a linearly independent set of columns of C, so rank(A) + rank(B) rank(C). Thus, rank(C) = rank(A) + rank(B). Example 3.28. Let x and y be two vectors in Cn − {0}. The matrix xy H has rank 1. Indeed, if y H = (y1 , y2 , . . . , yn ), then we can write xy H = (y1 x y2 x · · · yn x), which implies that the maximum number of linearly independent columns of xy H is 1. Example 3.29. Let A, B ∈ Cn×m . We have rank(A + B) rank(A) + rank(B). Let A = (a1 a2 · · · am ) and B = (b1 b2 · · · bm ) be two matrices, where a1 , . . . , am , b1 , . . . , bm ∈ Cn . Clearly, we have A + B = (a1 + b1 a2 + b2 · · · am + bm ). If x ∈ Im(A + B), we can write x = x1 (a1 + b1 ) + x2 (a2 + b2 ) + · · · + xm (am + bm ) = y + z, where y = x1 a1 + · · · + xm am ∈ Im(A), z = x1 b1 + · · · + xm bm ∈ Im(B). Thus, Im(A + B) ⊆ Im(A) + Im(B). Since the dimension of the sum of two subspaces of a linear space is less or equal to the dimension of the sum of these subspaces, the result follows. For a matrix A ∈ Cn×m the range of the matrix A ∈ Cm×n is the subspace of Cn generated by the rows of the original matrix A and coincides with the subspace Im(h∗A ) of the dual mapping h∗A .
Matrices
141
By Theorem 2.48, we have rank(h∗A ) = rank(hA ), so the rank of the transposed matrix A is equal to rank(A). Thus, dim(null(A )) + rank(A) = n,
(3.9)
and the maximal size of a set of linearly independent rows of A coincides with the rank of A. The above discussion also shows that if A ∈ Cn×m , then rank(A) min{m, n}. Theorem 3.31. Let A ∈ Cm×n be a matrix. We have rank(A) = rank(A). Proof. Suppose that A = (a1 , . . . , an ) and that the set {ai1 , . . . , aip } is a set of linearly independent columns of A. Then, the set {ai1 , . . . , aip } is a set of linearly independent columns of A. This implies rank(A) = rank(A). Corollary 3.1. We have rank(A) = rank(AH ) for every matrix A ∈ Cm×n . Proof.
Since AH = A , the statement follows immediately.
Definition 3.30. A matrix A ∈ Cn×m is a full-rank matrix if rank(A) = min{m, n}. If A ∈ Cm×n is a full-rank matrix and m n, then the n columns of the matrix are linearly independent; similarly, if n m, the m rows of the matrix are linearly independent. A matrix that is not a full-rank is said to be degenerate. A degenerate square matrix is said to be singular. A non-singular matrix A ∈ Cn×n is a matrix that is not singular and, therefore, has rank(A) = n. Theorem 3.32. A matrix A ∈ Cn×n is non-singular if and only if it is invertible. Proof. Suppose that A is non-singular, that is, rank(A) = n. In other words, the set of columns {c1 , . . . , cn } of A is linearly independent, and therefore, is a basis of Cn . Then, each of the vectors ei can
142
Linear Algebra Tools for Data Mining (Second Edition)
be expressed as a unique combination of the columns of A, that is ei = b1i c1 + b2i c2 + · · · + bni cn , for 1 i n. These equalities can ⎛ b11 ⎜b ⎜ 21 (c1 · · · cn ) ⎜ ⎜ .. ⎝ . bn1
be written as ⎞ · · · b1n · · · b2n ⎟ ⎟ ⎟ .. ⎟ = In . ··· . ⎠ · · · bnn
Consequently, the matrix A is invertible and ⎛ ⎞ b11 · · · b1n ⎜b ⎟ ⎜ 21 · · · b2n ⎟ −1 ⎜ ⎟ A =⎜ . .. ⎟ . . ⎝ . ··· . ⎠ bn1 · · · bnn Suppose now that A is invertible and that d1 c1 + · · · + dn cn = 0. This is equivalent to
⎞ d1 ⎜.⎟ ⎟ A⎜ ⎝ .. ⎠ = 0. dn ⎛
Multiplying both sides by A−1 implies ⎛ ⎞ d1 ⎜.⎟ ⎜ . ⎟ = 0, ⎝.⎠ dn so d1 = · · · = dn = 0, which means that the set of columns of A is linearly independent, so rank(A) = n. Corollary 3.2. A matrix A ∈ Cn×n is non-singular if and only if Ax = 0 implies x = 0 for x ∈ Cn .
Matrices
143
Proof. If A is non-singular then, by Theorem 3.32, A is invertible. Therefore, Ax = 0 implies A−1 (Ax) = A−1 0, so x = 0. Conversely, suppose that Ax = 0 implies x = 0. If A = (c1 · · · cn ) and x = (x1 , . . . , xn ) , the previous implication means that x1 c1 + · · · + xn cn = 0 implies x1 = · · · = xn = 0, so {c1 , . . . , cn } is linearly independent. Therefore, rank(A) = n, so A is non-singular. Let A ∈ Cn×m be a matrix. It is easy to see that the square matrix B = AH A ∈ Cm×m is symmetric. Theorem 3.33. Let A ∈ Cn×m be a matrix and let B = AH A. The matrices A and B have the same rank. Proof. We prove that null(A) = null(B). If Au = 0, then Bu = AH (A0) = 0, so null(A) ⊆ null(B). If v ∈ null(B), then AH Av = 0, which implies that v H AH Av = 0. This, in turn can be written as (Av)H (Av) = 0, so, by Theorem 3.11, we have Av = 0, which means that v ∈ null(A). We conclude that null(A) = null(AH A). The equalities dim(null(A)) + rank(A) = m, dim(null(A)) + rank(AH A) = m, imply that rank(AH A) = m.
Corollary 3.3. Let A ∈ Cn×m be a matrix of full-rank. If m n, then the matrix AH A is non-singular; if n m, then AAH is nonsingular. Proof. Suppose that m n. Then, rank(AH A) = rank(A) = m because A is a full-rank matrix. Thus, AH A ∈ Cm×m is non-singular. The argument for the second part of the corollary is similar. Example 3.30. Let A = (a1 · · · am ) ∈ Cn×m . Since AAH = a1 aH1 + · · · + am aHm , it follows that the rank of the matrix a1 aH1 + · · · + am aHm equals the rank of the matrix A and, therefore, it cannot exceed m. Theorem 3.34 (Sylvester’s rank theorem). Let A ∈ Cm×n and B ∈ Cn×p be two matrices. We have rank(AB) = rank(B) − dim(null(A) ∩ range(B)).
144
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Both null(A) and range(B) are subspaces of Cn . Therefore, null(A) ∩ range(B) is a subspace of Cn . If u1 , . . . , uk is a basis of the subspace null(A) ∩ range(B), then there exists a basis u1 , . . . , uk , uk+1 , . . . , ul of the subspace range(B). The set {Auk+1 , . . . , Aul } is linearly independent. Indeed, suppose that there exists a linear combination a1 Auk+1 + · · · + al−k Aul = 0. Then, A(a1 uk+1 +· · ·+al−k ul ) = 0, so a1 uk+1 +· · ·+al−k ul ∈ null(A). Since uk+1 , . . . , ul ∈ range(B), it follows that a1 uk+1 +· · ·+al−k ul ∈ null(A) ∩ range(B). Since u1 , . . . , uk is a basis of the subspace null(A) ∩ range(B), we have a1 uk+1 + · · · + al−k ul = d1 u1 + · · · + dk uk for some d1 , . . . , dk ∈ C, which implies a1 uk+1 + · · · + al−k ul − d1 u1 − · · · − dk uk = 0. Since u1 , . . . , uk , uk+1 , · · · , ul is a basis of range(B), it follows that a1 = · · · = al−k = d1 = · · · = dk = 0, so Auk+1 , . . . , Aul is indeed linear independent. Next, we show that Auk+1 , . . . , Aul spans the subspace range(AB). Since uj ∈ range(B), it is clear that Auj ∈ range(AB) for k + 1 j l. If w ∈ range(AB), then w = ABx for some x ∈ Cp . Since Bx ∈ range(B), we can write Bx = b1 u1 + · · · + bk uk + bk+1 uk+1 + · · · + bl ul , which implies w = ABx = bk+1 Auk+1 + · · · + bl Aul , because Au1 = · · · = Auk = 0, as u1 , . . . , uk belong to null(A). Thus, Auk+1 , . . . , Aul spans the subspace range(AB), which allows us to conclude that this linearly independent set is a basis for this subspace that contains l−k elements. This allows us to conclude that rank(AB) = dim(range(AB)) = rank(B) − dim(null(A) ∩ range(B)). Corollary 3.4. Let A ∈ Cm×n . If R ∈ Cm×m and Q ∈ Cn×n are invertible matrices, then rank(A) = rank(RA) = rank(AQ) = rank(RAQ).
Matrices
145
Proof. Note that rank(R) = m and rank(Q) = n. Thus, null(R) = {0m } and null(Q) = {0n }. By Sylvester’s rank theorem we have rank(RA) = rank(A) − dim(null(R) ∩ range(A)) = rank(A) − dim({0}) = rank(A). On the other hand, we have rank(AQ) = rank(Q) − dim(null(A) ∩ range(Q)) = n − dim(null(A)) = rank(A), because range(Q) = Cn . The last equality of the theorem follows from the first two.
Corollary 3.5 (Frobenius2 inequality). Let A ∈ Cm×n , B ∈ Cn×p , and C ∈ Cp×q be three matrices. Then, rank(AB) + rank(BC) rank(B) + rank(ABC). Proof.
By Sylvester’s rank theorem, we have rank(ABC) = rank(BC) − dim(null(A) ∩ range(BC)), rank(AB) = rank(B) − dim(null(A) ∩ range(B)).
These equalities imply rank(ABC) + rank(B) = rank(AB) + rank(BC) − dim(null(A) ∩ range(BC)) + dim(null(A) ∩ range(B)). Since null(A) ∩ range(BC) ⊆ null(A) ∩ range(B), we have dim(null(A) ∩ range(BC)) dim(null(A) ∩ range(B)), which implies the desired inequality. 2
Ferdinand Georg Frobenius was born on October 26, 1849 in CharlottenburgBerlin and died on August 3, 1917 in Berlin. He studied at the University of G¨ ottingen and Berlin and taught at the University of Berlin and ETH in Z¨ urich. Frobenius has contributed to the study of elliptic functions, algebra, and mathematical physics.
Linear Algebra Tools for Data Mining (Second Edition)
146
Corollary 3.6. Let A ∈ Cm×n and B ∈ Cn×p be two matrices. We have dim(null(AB)) = dim(null(B)) + dim(null(A) ∩ range(B)). Proof.
By Equality (3.8), we have dim(null(AB)) + rank(AB) = p, dim(null(B)) + rank(B) = p.
An application of Sylvester’s rank theorem implies dim(null(AB)) = dim(null(B)) + dim(null(A) ∩ range(B)).
Corollary 3.7. Let A ∈ Cm×n and B ∈ Cn×p be two matrices. We have rank(A)+rank(B)−n rank(AB) min{rank(A), rank(B)}, (3.10) and max{dim(null(A)), dim(null(B))} ≤ dim(null(AB)) ≤ dim(null(A)) + dim(null(B)). Proof. Since dim(null(A)∩rank(B)) dim(null(A)) = n−rank(A), it follows that rank(AB) rank(B) − (n − rank(A)) = rank(A) + rank(B) − n. For the second inequality, observe that Sylvester’s rank Theorem, implies immediately rank(AB) rank(B). Also, rank(AB) = rank((AB) ) = rank(B A ) rank(A ) = rank(A), so rank(AB) ≤ min{rank(A), rank(B)}. The second part of the Corollary follows from the first part. Inequality (3.10) is also known as the Sylvester rank inequality. Corollary 3.8. If A ∈ Cm×n is a full-rank matrix with m n, then rank(AB) = rank(B) for any B ∈ Cn×p . Proof. Since m n, we have rank(A) = n; therefore, the n columns of A are linearly independent so null(A) = {0}. By Sylvester’s Rank theorem, we have rank(AB) = rank(B).
Matrices
147
Theorem 3.35 (The full-rank factorization theorem). Let A ∈ Cm×n be a matrix with rank(A) = r > 0. There exists B ∈ Cm×r and C ∈ Cr×n such that A = BC. Furthermore, if A = DE, where D ∈ Cm×r , E ∈ Cr×n , then both D and E are full-rank matrices, that is, we have rank(D) = rank(E) = r. Proof. Let {b1 , . . . , br } ⊆ Cm be a basis for the range(A). Define B = (b1 · · · br ) ∈ Cm×r . The columns of A, a1 , . . . , an can be written as ai = c1i b1 + · · · cri br for 1 i n, which amounts to ⎛ ⎞ c11 · · · c1r ⎜ . . ⎟ ⎟ A = (a1 · · · an ) = (b1 · · · br ) ⎜ ⎝ .. · · · .. ⎠ . cr1 · · · cr Thus, A = BC, where
⎞ c11 · · · c1r ⎜ . . ⎟ ⎟ C=⎜ ⎝ .. · · · .. ⎠ . cr1 · · · cr ⎛
Suppose now that A = DE, where D ∈ Cm×r , E ∈ Cr×n . It is clear that we have both rank(D) r and rank(E) r. On the other hand, by Corollary 3.7, r = rank(A) = rank(DE) min{rank(D), rank(E)} implies r rank(D) and r ≤ rank(E), so rank(D) = rank(E) = r. Corollary 3.9. Let A ∈ Cm×n be a matrix such that rank(A) = r > 0, and let A = BC be a full-rank factorization of A. If the columns of B constitute a basis of the column space of A, then C is uniquely determined. Furthermore, if the rows of C constitute a basis of the row space of A, then B is uniquely determined. Proof. This statement is an immediate consequence of the full rank factorization theorem. Corollary 3.10. If A ∈ Cm×n is a matrix with rank(A) = r > 0, then A can be written as A = b1 c1 + · · · + br cr , where {b1 , . . . , br } ⊆ Cm and {c1 , . . . , cr } ⊆ Cn are linearly independent sets.
148
Linear Algebra Tools for Data Mining (Second Edition)
Proof. The corollary follows from Theorem 3.35 by adopting the set of columns of B as {b1 , . . . , br } and the transposed rows of C as {c1 , . . . , cr }. Theorem 3.36. Let A ∈ Cm×n be a full-rank matrix. If m n, then there exists a matrix D ∈ Cn×m such that DA = In . If n m, then there exists a matrix E ∈ Cn×m such that AE = Im . Proof. Suppose that A = (a1 · · · an ) ∈ Cm×n is a full-rank matrix and m n. Then, the n columns of A are linearly independent and, by Corollary 2.3, we can extend the set of columns to a basis of Cm , {a1 , . . . , an , d1 , . . . , dm−n }. The matrix T = (a1 · · · an d1 · · · dm−n ) is invertible, so there exists ⎛ ⎞ t1 ⎜ . ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ −1 t T = ⎜ n ⎟, ⎜ ⎟ ⎜ ..⎟ ⎝tn+1 .⎠ tm such that T −1 T = Im . If we define ⎛ ⎞ t1 ⎜.⎟ ⎟ D=⎜ ⎝ .. ⎠ , tn it is immediate that DA = In . The argument for the second part is similar.
Definition 3.31. Let A ∈ Cm×n . A left inverse of A is a matrix D ∈ Cn×m such that DA = In . A right inverse of A is a matrix E ∈ Cn×m such that AE = Im . Theorem 3.36 can now be restated as follows. Let A ∈ Cm×n be a full-rank matrix. If m n, then A has a left inverse; if n m, then A has a right inverse. Corollary 3.11. Let A ∈ Cn×n be a square matrix. The following statements are equivalent.
Matrices
149
(i) A has a left inverse; (ii) A has a right inverse; (iii) A has an inverse. Proof. It is clear that (iii) implies both (i) and (ii). Suppose now that A has a left inverse, so DA = In . Then, the columns of A, c1 , . . . , cn are linearly independent, for if a1 c1 + · · · an cn = 0, we have a1 Dc1 + · · · + an Dcn = a1 e1 + · · · + an cn = 0, which implies a1 = · · · = an = 0. Thus, rank(A) = n, so A has an inverse. In a similar manner (using the rows of A), we can show that (ii) implies (iii). Theorem 3.37. Let A ∈ Cm×n be a matrix with rank(A) = r > 0. There exists a non-singular matrix G ∈ Cm×m and a non-singular matrix H ∈ Cn×n such that Or,n−r Ir H. A=G Om−r,r Om−r,n−r Proof. By the Full-Rank factorization theorem (Theorem 3.35), there are two full-rank matrices B ∈ Cm×r and C ∈ Cr×n such that A = BC. Let {b1 , . . . , br } be the columns of B and let c1 , . . . , cr be the rows of C. It is clear that both sets of vectors are linearly independent and, therefore, for the first set, there exist br+1 , . . . , bm such that {b1 , . . . , bm } is a basis of Cm ; for the second set, we have the vectors cr+1 , . . . , cn such that {c1 , . . . , cn } is a basis for Rn . Define G = (b1 , . . . , bm ) and ⎛ ⎞ c1 ⎜.⎟ ⎟ H=⎜ ⎝ .. ⎠ . cn Clearly, both G and H are non-singular and Or,n−r Ir H. A=G Om−r,r Om−r,n−r
Next we examine the relationships between full-rank factorization and the Moore–Penrose pseudoinverse.
Linear Algebra Tools for Data Mining (Second Edition)
150
Theorem 3.38. Let A ∈ Cm×n be a matrix with rank(A) = r > 0 and let A = BC be a full-rank factorization of A, where B ∈ Cm×r and C ∈ Cr×n are full-rank matrices. The following statements hold: (1) the matrices B H B ∈ Cr×r and CC H ∈ Cr×r are non-singular; (2) the matrix B H AC H is non-singular; (3) the Moore–Penrose pseudoinverse of A is given by A† = C H (CC H )−1 (B H B)−1 B H . Proof. The first part of the theorem follows from Corollary 3.3. For Part (ii), note that B H AC H = B H (BC)C H = (B H B)(CC H ), so B H AC H is non-singular as a product of two non-singular matrices. For Part (iii), observe that (B H AC H )−1 = ((B H B)(CC H ))−1 = (CC H )−1 (B H B)−1 . Therefore, C H (B H AC H )−1 B H = C H (CC H )−1 (B H B)−1 B H , and it is easy to verify that the matrix C H (CC H )−1 (B H B)−1 B H satis fies the conditions of Theorem 3.26. Lemma 3.1. If A ∈ Cm×n is a matrix and x ∈ Cm , y ∈ Cn are two vectors such that xH Ay = 0, then rank(AyxH A) = 1. Proof. By the associative property of matrix product, we have AyxH A = A(yxH )A, so rank(AyxH A) min{rank(yxH , rank(A)} = 1, by Corollary 3.7. We claim that AyxH A = Om,n . Suppose that AyxH A = Om,n . This implies xH AyxH Ay = 0. If z = xH Ay, the previous equality amounts to z 2 = 0, which yields z = xH Ay = 0. This contradicts the hypothesis of the lemma, so AyxH A = Om,n , which implies rank(AyxH A) 1. This allows us to conclude that rank(AyxH A) = 1. The rank-1 matrix AyxH A discussed in Lemma 3.1 plays a central role in the next statement.
Matrices
151
Theorem 3.39 (Wedderburn’s theorem). Let A ∈ Cm×n be a matrix. If x ∈ Cm and y ∈ Cn are two vectors such that xH Ay = 0 and B is the matrix B =A−
1 AyxH A, x Ay H
then rank(B) = rank(A) − 1. Proof. have
Observe that if z ∈ null(A), then Az = 0. Therefore, we
Bz = −
1 AyxH Az = 0, x Ay H
so null(A) ⊆ null(B). Conversely, if z ∈ null(B), we have Az −
1 AyxH Az = 0, xH Ay
which can be written as Az = =
1 Ay(xH Az) x Ay H
xH Az Ay. xH Ay
Thus, we obtain A(z − ky) = 0, where k=
xH Az . xH Ay
Since Ay = 0, this shows that a basis of null(B) can be obtained by adding y to a basis of null(A). Therefore, dim(null(B)) = dim(null(A)) + 1, so rank(B) = rank(A) − 1.
Linear Algebra Tools for Data Mining (Second Edition)
152
Theorem 3.40. A square matrix A ∈ Cn×n generates an increasing sequence of null spaces {0} = null(A0 ) ⊆ null(A1 ) ⊆ · · · ⊆ null(Ak ) ⊆ · · · and a decreasing sequence of subspaces Cn = range(A0 ) ⊇ range(A1 ) ⊇ · · · ⊇ range(Ak ) ⊇ · · ·
Furthermore, there exists a number such that null(A0 ) ⊂ null(A1 ) ⊂ · · · ⊂ null(A ) = null(A+1 ) = · · · and range(A0 ) ⊃ range(A1 ) ⊃ · · · ⊃ range(A ) = range(A+1 ) = · · · Proof. The proof of the existence of the increasing sequence of null subspaces and the decreasing sequence of ranges is immediate. Since null(Ak ) ⊆ Cn for every k, there exists a least number p such that range(Ap ) = range(Ap+1 ). Therefore, range(Ap+i ) = Ai range(Ap ) = Ai range(Ap+1 ) = range(Ap+i+1 ) for every i ∈ N. Thus, once two consecutive subspaces range(A ) and range(A+1 ) are equal, the sequence of range subspaces stops growing. By Equality (3.8), we have dim(range(Ak )) + dim(null(Ak )) = n, so the sequence of null spaces stabilizes at the same number . Definition 3.32. The index of a square matrix A ∈ Cn×n is the number defined in Theorem 3.40. We denote the index of a matrix A ∈ Cn×n by index(A). Observe that if A ∈ Cn×n is a non-singular matrix, then index(A) = 0 because in this case Cn = range(A0 ) = range(A). Theorem 3.41. Let A ∈ Cn×n be a square matrix. The following statements are equivalent: (i) range(Ak ) ∩ null(Ak ) = {0}; (ii) Cn = range(Ak ) null(Ak ); (iii) k index(A). Proof. We prove this theorem by showing that (i) and (ii) are equivalent, (i) implies (iii), and (iii) implies (ii).
Matrices
153
Suppose that the first statement holds. By Theorem 2.35, the set T = {t ∈ V | t = u + v, u ∈ range(Ak ), v ∈ null(Ak )} is a subspace of Cn and dim(T ) = dim(range(Ak )) + dim(null(Ak )) = n. Therefore, T = Cn , so Cn = range(Ak ) null(Ak ). The second statement clearly implies the first. Suppose now that Cn = range(Ak ) null(Ak ). Then range(Ak ) = Ak Cn = Arange(Ak ) = range(Ak+1 ), so k index(A). Conversely, if k index(A) and x ∈ range(Ak ) ∩ null(Ak ), then x = Ak y and Ak x = 0, so A2k y = 0. Thus, y ∈ null(A2k ) = null(Ak ), which means that x = Ak y = 0. Thus, the first statement holds. The notion of spark of linear mappings can be transferred to matrices. Definition 3.33. Let A ∈ Cm×n be a matrix. The spark of A is the minimum size of a set of columns that is linearly dependent. If the set of columns of A is linearly independent, then spark(A) = n + 1. Note that for A ∈ Cm×n , we have 1 spark(A) n + 1. 3.12
Matrix Similarity and Congruence
Define the similarity relation “∼” on the set of square matrices Cn×n by A ∼ B if there exists an invertible matrix X such that A = XBX −1 . If X is a unitary matrix, then we say that A and B are unitarily similar and we write A ∼u B, so ∼u is a subset of ∼. In this case, we have A = XBX H . Theorem 3.42. The relations “∼” and “∼u ” are equivalence relations. Proof. We have A ∼ A because A = In A(In )−1 , so ∼ is a reflexive relation. To prove that ∼ is symmetric suppose that A = XBX −1 . Then, B = X −1 AX and, since X −1 is invertible, we have B ∼ A.
154
Linear Algebra Tools for Data Mining (Second Edition)
Finally, to verify the transitivity, let A, B, C be such that A = XBX −1 and B = Y CY −1 , where X and Y are two invertible matrices. This allows us to write A = XBX −1 = XY CY −1 X −1 = (XY )C(XY )−1 , which proves that A ∼ C. We leave to the reader the similar proof concerning ∼u .
Theorem 3.43. If A ∼u B, where A, B ∈ Cn×n , then AH A ∼u B H B. Proof. Since A ∼u B, there exists a unitary matrix X such that A = XBX −1 = XBX H . Then, AH = XB H X H , so AH A = XB H X H XBX H = XB H BX H . Thus, AH A is unitarily similar to B H B.
The similarity relation can be extended to sets of rectangular matrices as follows. Definition 3.34. Let A, B be two matrices in Cm×n . Then, A and B are similar (written A ∼ B) if there exist two non-singular matrices G ∈ Cm×m and H ∈ Cn×n such that A = GBH. It is easy to verify that ∼ is an equivalence relation on Cm×n . Theorem 3.44. Let A and B be two matrices in Cm×n . We have A ∼ B if and only if rank(A) = rank(B). Proof. By Theorem 3.37, if A ∈ Cm×n is a matrix with rank(A) = r > 0, then Or,n−r Ir . A∼ Om−r,r Om−r,n−r Thus, for every two matrices A, B ∈ Cn×m of rank r, we have A ∼ B because both are similar to Or,n−r Ir . Om−r,r Om−r,n−r
Matrices
155
Conversely, suppose that A ∼ B, that is, A = GBH, where G ∈ Cm×m and H ∈ Cn×n are non-singular matrices. By Corollary 3.4, we have rank(A) = rank(B). Definition 3.35. A matrix A ∈ Cn×n is diagonalizable if there exists a diagonal matrix D such that A ∼ D. Let M be a class of matrices. A is M-diagonalizable if there exists a matrix M ∈ M such that A = M DM −1 . For example, if A is M-diagonalizable and M is the class of unitary matrices, we say that A is unitarily diagonalizable. Let f : Cn −→ C be a polynomial given by f (z) = a0 z n + a1 z n−1 + · · · + an , where a0 , a1 , . . . , an ∈ C. If A ∈ Cm×m , then the matrix f (A) is defined by f (A) = a0 An + a1 An−1 + · · · + an Im . Theorem 3.45. If T ∈ Cm×m is an upper (a lower) triangular matrix and f is a polynomial, then f (T ) is an upper (a lower) triangular matrix. Furthermore, if the diagonal elements of T are t11 , t22 , . . . , tmm , then the diagonal elements of f (T ) are f (t11 ), f (t22 ), . . . , f (tmm ), respectively. Proof. By Theorem 3.12, any power T k of T is an upper (a lower) triangular matrix. Since the sum of upper (lower) triangular matrices is upper (lower) triangular, if follows that f (T ) is an upper triangular (a lower triangular) matrix. An easy argument by induction on k (left to the reader) shows that if the diagonal elements of T are t11 , t22 , . . . , tmm , then the diagonal elements of T k are tk11 , tk22 , . . . , tkmm . The second part of the theorem follows immediately. Theorem 3.46. Let A, B ∈ Cm×m . If A ∼ B and f is a polynomial, Then f (A) ∼ f (B).
156
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Let X be an invertible matrix such that A = XBX −1 . It is straightforward to verify that Ak = XB k X −1 for k ∈ N. This implies that f (A) = Xf (B)X −1 , so f (A) ∼ f (B). Then f (A) ∼ f (B). Definition 3.36. Let A and B be two matrices in Cn×n . The matrices A and B are congruent if there exists an invertible matrix X ∈ Cn×n such that B = XAX H . This is denoted by A ∼H B. The relation ∼H is an equivalence on Cn×n . We have A ∼H A because A = In AInH . If A ∼H B, then B = XAX H , so A = X −1 B(X H )−1 = X −1 B(X −1 )H , which implies B ∼H A. Finally, ∼H is transitive because if B = XAX H and C = Y BY H , where X and Y are invertible matrices, then C = (Y X)A(Y X)H and Y X is an invertible matrix. It is immediate that any two congruent matrices have the same rank. To recapitulate the definitions of the important similarity relations discussed in this section, consider the following list: (i) A and B are similar matrices, A ∼ B, if there exists an invertible matrix X such that A = XBX −1 ; (ii) A and B are congruent matrices, A ∼H B, if there exists an invertible matrix X ∈ Cn×n such that B = XAX H ; (iii) A and B are unitarily similar, A ∼u B, if there exists a unitary matrix U such that A = U BU −1 . Since every unitary matrix is invertible and its inverse equals its conjugate Hermitian matrix, it follows that ∼u is a subset of both ∼ and ∼H . 3.13
Linear Systems and LU Decompositions
Consider the following set of linear equalities a11 x1 + . . . + a1n xn = b1 , a21 x1 + . . . + a2n xn = b2 , .. .. . . am1 x1 + . . . + amn xn = bm ,
Matrices
157
where aij and bi belong to a field F . This set constitutes a system of linear equations. Solving this system means finding x1 , . . . , xn that satisfy all equalities. The system can be written succinctly in a matrix form as Ax = b, where ⎞ ⎛ ⎛ ⎞ a11 · · · a1n b1 ⎟ ⎜a ⎜ b2 ⎟ ⎜ 21 · · · a2n ⎟ ⎜ ⎟ ⎟ A=⎜ . ⎟, .. ⎟ , b = ⎜ ⎜ .. ⎝ .. ⎠ ⎝ . ··· . ⎠ bm ··· a a m1
and
mn
⎞ x1 ⎜ x2 ⎟ ⎜ ⎟ x = ⎜ .. ⎟ . ⎝ . ⎠ ⎛
xn If the set of solutions of a system Ax = b is not empty, we say that the system is consistent. Note that Ax = b is consistent if and only if b ∈ range(A). Let Ax = b be a linear system in matrix form, where A ∈ Cm×n . The matrix [A b] ∈ Cm×(n+1) is the augmented matrix of the system Ax = b. Theorem 3.47. Let A ∈ Cm×n be a matrix and let b ∈ Cn×1 . The linear system Ax = b is consistent if and only if rank(A b) = rank(A). Proof. If Ax = b is consistent and x = (x1 , . . . , xn ) is a solution of this system, then b = x1 c1 + · · · + xn cn , where c1 , . . . , cn are the columns of A. This implies rank([A b]) = rank(A). Conversely, if rank(A b) = rank(A), the vector b is a linear combination of the columns of A, which means that Ax = b is a consistent system. Definition 3.37. A homogeneous linear system is a linear system of the form Ax = 0m , where A ∈ Cm×n , x ∈ Cn,1 , and 0 ∈ Cm×1 . Clearly, any homogeneous system Ax = 0m has the solution x = 0n . This solution is referred to as the trivial solution. The set of solutions of such a system is null(A), the null space of the matrix A.
158
Linear Algebra Tools for Data Mining (Second Edition)
Let u and v be two solutions of the system Ax = b. Then A(u − v) = 0m , so z = u − v is a solution of the homogeneous system Ax = 0m , or z ∈ null(A). Thus, the set of solutions of Ax = b can be obtained as a “translation” of the null space of A by any particular solution of Ax = b. In other words, the set of solutions of Ax = b is {x + z | z ∈ null(A)}. Thus, for A ∈ Cm×n , the system Ax = b has a unique solution if and only if null(A) = {0n }, that is, according to Equality (3.8), if rank(A) = n. Theorem 3.48. Let A ∈ Cn×n . Then, A is invertible (which is to say that rank(A) = n) if and only if the system Ax = b has a unique solution for every b ∈ Cn . Proof. If A is invertible, then x = A−1 b, so the system Ax = b has a unique solution. Conversely, if the system Ax = b has a unique solution for every b ∈ Cn , let c1 , . . . , cn be the solution of the systems Ax = e1 , . . . , Ax = en , respectively. Then, we have A(c1 | · · · |cn ) = In , which shows that A is invertible and A−1 = (c1 | · · · |cn ).
Corollary 3.12. A homogeneous linear system Ax = 0, where A ∈ Cn×n has a non-trivial solution if and only if A is a singular matrix. Proof.
This statement follows from Theorem 3.48.
Thus, by calculating the inverse of A we can solve any linear system of the form Ax = b. In Chapter 5, we discuss this type of calculation in detail. Definition 3.38. A matrix A ∈ Cn×n is diagonally dominant if |aii | > {|aik | | 1 k n and k = i}. Theorem 3.49. A diagonally dominant matrix is non-singular. Proof. Suppose that A ∈ Cn×n is a diagonally dominant matrix that is singular. By Corollary 3.12, the homogeneous system Ax = 0 has a non-trivial solution x = 0. Let xk be a component of x that
Matrices
159
has the largest absolute value. Since x = 0, we have |xk | > 0. We can write
{akj xj | 1 j n and j = k}, akk xk = − which implies
{akj xj | 1 j n and j = k} |akk | |xk | =
{|akj | |xj | | 1 j n and j = k}
{|akj | | 1 j n and j = k}. |xk |
Thus, we obtain |akk |
{|akj | | 1 j n and j = k},
which contradicts the fact that A is diagonally dominant.
Definition 3.39. The sparsity of a vector v ∈ Cn is the number sparse(v) of components of v equal to 0. Let ν0 : Rn −→ R be the function defined by ν0 (x) = {i | 1 i n, xi = 0}, which gives the number of non-zero components of x. It is immediate to verify that ν0 (x + y) ν0 (x) + ν0 (y) for x, y ∈ Rn . Note that ν0 (v) = n − sparse(v). If x ∈ null(A), then ν0 (x) spark(A) because vectors in this subspace are linear combinations of columns of A that equal 0n and at least spark(A) columns are needed to produce the vector 0n . The following result was obtained in [43]: Theorem 3.50. If a linear system Ax = b has a solution x such that ν0 (x) < 12 spark(A), then x is a sparsest solution of the system. Proof. Let y be a solution of the same system. We have Ax−Ay = A(x −y) = 0n . By the definition of spark(A) we have ν0 (x)+ν0 (y) ν0 (x − y) spark(A). Thus, the number of non-zero components of the vector x − y cannot exceed the sum of the number of non-zero components within each of the vectors x and y. Since x0 satisfies ν0 (x0 ) < spark(A)/2, it follows that any other solution has more than spark(A)/2 non-zero components.
160
3.14
Linear Algebra Tools for Data Mining (Second Edition)
The Row Echelon Form of Matrices
We begin with a class of linear systems that can easily be solved. Definition 3.40. A matrix C ∈ Cm×n is in row echelon form if the following conditions are satisfied: (i) rows that contain nonzero elements precede zero rows (that is, rows that contain only zeros); (ii) if cij is the first nonzero element of the row i, all elements in the j th column located below cij , that is, entries of the form ckj with k > j are zero (see Figure 3.1); (iii) if i < , ciji is the first non-zero element of the row i, and cj is the first non-zero element of the row , then ji < j (see Figure 3.2). The first non-zero element of a row i (if it exists) is called the pivot of the row i.
Fig. 3.1
Condition (ii) of Definition 3.40.
Fig. 3.2
Condition (iii) of Definition 3.40.
Matrices
161
Example 3.31. Let C ∈ R4×5 be the matrix ⎛
1 ⎜0 ⎜ C=⎜ ⎝0 0
2 2 0 0
⎞ 0 2 0 3 0 1⎟ ⎟ ⎟. 0 −1 2⎠ 0 0 0
It is clear that C is in row echelon form; the pivots of the first, second, and third rows are c11 = 1, c22 = 2, and c34 = −1. Theorem 3.51. Let C ∈ Cm×n be a matrix in row echelon form such that the rows that contain non-zero elements are the first r rows. Then, rank(C) = r. Proof. Let c1 , . . . , cr be the non-zero rows of C. Suppose that the row ci has the first non-zero element in the column ji for 1 i r. By the definition of the echelon form, we have j1 < j2 < · · · < jr . Suppose that a1 c1 + · · · + ar cr = 0. This equality can be written as a1 c1j1 = 0, a1 c1j2 + a2 c2j2 = 0, .. . a1 c1n + a2 c2n + . . . + ar crn = 0. Since c1j1 = 0, we have a1 = 0. Substituting a1 by 0 in the second equality implies a2 = 0 because c2j2 = 0, etc. Thus, we obtain a1 = a2 = . . . = ar = 0, which proves that the rows c1 , . . . , cr are linearly independent. Since this is a maximal set of rows of C that is linearly independent, it follows that rank(C) = r. Linear systems whose augmented matrices are in row echelon form can be easily solved using a process called back substitution. Consider the following augmented matrix in row echelon form of a system with
162
Linear Algebra Tools for Data Mining (Second Edition)
m equations and n unknowns: ⎛ 0 · · · 0 a1j1 · · · ⎜0 · · · 0 0 · · · ⎜ ⎜. . .. ⎜. . ··· ⎜ . · · · .. ⎜ ⎜0 · · · 0 0 · · · ⎜ ⎜ ⎜0 · · · 0 0 · · · ⎜ . .. ⎜ .. ⎝ . · · · .. . ··· 0 ··· 0 0 ···
··· ··· a2j2 · · · .. . ··· · · · arjr ··· 0 .. . ··· ··· 0
⎞ a1n b1 a2n b2 ⎟ ⎟ .. .. ⎟ ⎟ . . ⎟ ⎟ · · · br ⎟ ⎟. ⎟ · · · br+1 ⎟ ⎟ .. .. ⎟ . . ⎠ · · · bm
The system of equations has the following form: a1j1 xj1 + · · · + a1n xn = b1 a2j2 xj2 + · · · + a2n xn = b2 .. . arjr xjr + · · · + arn xn = br 0 = br+1 .. .. .=. 0 = bm . The variables xj1 , xj2 , . . . , xjr that correspond to the columns where the pivot elements occur are referred to as the basic variables or principal variables. The remaining variables are non-basic or non-principal. Note that we have r min{m, n}. If r < m and there exists b = 0 for r < m, then the system is inconsistent and no solutions exist. If r = m or b = 0 for r < m n, one can choose the variables that do not correspond to the pivot elements, {xi | i ∈ {j1 , j2 , . . . , jr }, as parameters and express the basic variables as functions of these parameters. The process starts with the last basic variable, xjr (because every other variable in the equation arjr xjr + · · · + arn xn = br is a parameter), and then substitutes this
Matrices
163
variable in the previous equality. This allows us to express xjr−1 as a function of parameters, etc. This explains the term back substitution previously introduced. If r = n, then no parameters exist. To conclude, if r < m, the system has a solution if and only if bj = 0 for j > r. If r = m, the system has a solution. This solution is unique if r = n. Definition 3.41. A linear system Ax = b, where A ∈ Rm×n , b ∈ Rm , and m n, is said to be in explicit form if A contains the columns of the matrix Im . Example 3.32. The linear system Ax = b is defined by 1 0 1 −1 2 A= and b = . 0 1 −1 1 1 Note that A is a full-rank matrix because rank(A) = 2. The system is already in explicit form relative to the variables x1 and x2 because the first two columns of the matrix [A|b] are the columns of I2 . There are six possible explicit forms of this system corresponding to the 42 = 6 subsets of the set of variables {x1 , x2 , x3 , x4 }. Note, however, that if the columns chosen form a submatrix of rank less than 2, the explicit form does not exist. This is the case when we chose the explicit form relative to x3 and x4 . Therefore, this system has five explicit forms relative to the sets of variables {x1 , x2 }, {x1 , x3 }, {x1 , x4 }, {x2 , x3 }, and {x2 , x4 }. In general, a system with m equations and n variables has maxin explicit forms. mum m Example 3.33. Consider the system x1 + 2x2 + 2x2 + 3x3 +
2x4 x5 −x4 + x5 0
= b1 = b2 . = b3 = b4
The matrix A ∈ R4×5 is not of full rank because rank(A) = 3.
164
Linear Algebra Tools for Data Mining (Second Edition)
The augmented matrix ⎛ 1 ⎜0 ⎜ ⎜ ⎝0 0
of this system is ⎞ 2 0 2 0 b1 2 3 0 1 b2 ⎟ ⎟ ⎟. 0 0 −1 2 b3 ⎠ 0 0 0 0 b4
The basic variables are x1 , x2 , and x4 . If b4 = 0, the system is consistent. Under this assumption we can choose x3 and x5 as parameters. Let x3 = p and x4 = q. The third equation yields x4 = q − b3 . Similarly, the second equation implies x2 = 0.5(b2 − 3p − q). Substituting these values in the first equation allows us to write x1 = b1 − b2 + 2b3 − 3p − q. Further transformations of this system allow us to construct an equivalent linear system whose matrix contains the columns of the matrix I3 . Subtracting the second row from the first yields ⎞ ⎛ 1 0 −3 2 −1 b1 − b2 ⎜0 2 3 0 1 b2 ⎟ ⎟ ⎜ ⎟. ⎜ ⎝0 0 0 −1 2 b3 ⎠ 0 0 0 0 0 b4 Then, dividing the second row by 2 will produce ⎞ ⎛ 1 0 −3 2 −1 b1 − b2 ⎜0 1 3/2 0 1/2 b /2 ⎟ 2 ⎟ ⎜ ⎟, ⎜ ⎝0 0 0 −1 2 b3 ⎠ 0 0 0 0 0 b4 which creates the second column of I3 . Next, multiply the third row by −1: ⎞ ⎛ 1 0 −3 2 −1 b1 − b2 ⎟ ⎜0 1 3 0 1 b2 ⎟ ⎜ 2 2 2 ⎟. ⎜ ⎝0 0 0 1 −2 −b3 ⎠ 0 0 0 0 0 b4
Matrices
165
Finally, multiply the third by −2 and add it to the first row: ⎛ 1 ⎜0 ⎜ ⎜ ⎝0 0
0 −3 0 1 0 0
3 2
0 0
b1 − b2 + 2b3
1
b2 2
0 12 1 −2 0 0
−b3 b4
⎞ ⎟ ⎟ ⎟. ⎠
Thus, if we choose for the basic variables x1 = b1 − b2 + 2b3 , x2 =
b2 , and x4 = −b3 2
and for the non-basic variables x3 = x5 = 0, we obtain a solution of the system. The extended echelon form of a system can be achieved by applying certain transformations on the rows of the augmented matrix of the system (which amount to transformations involving the equations of the system). In preparation, a few special invertible matrices are introduced in the next examples. Example 3.34. Consider the matrix ⎛
T (i)↔(j)
1 ⎜ . ⎜ .. ⎜ ⎜ ⎜· · · ⎜ ⎜ = ⎜ ... ⎜ ⎜· · · ⎜ ⎜ ⎜ .. ⎝ .
.. . 0 .. . 1 .. .
··· .. . ··· .. . ··· .. . ···
.. . 1 .. . 0 .. .
⎞ ··· .. ⎟ . ⎟ ⎟ ⎟ · · ·⎟ ⎟ .. ⎟ , . ⎟ ⎟ · · ·⎟ ⎟ ⎟ .. ⎟ . ⎠ 1
where line i contains exactly one 1 in position j and line j contains exactly one 1 in position i. If T (i)↔(j) ∈ Cp×p and A ∈ Cp×q , it is easy to see that the matrix T (i)↔(j) A is obtained from the matrix A by permuting the lines i and j.
166
Linear Algebra Tools for Data Mining (Second Edition)
For instance, consider the matrix ⎛ 1 ⎜0 ⎜ T (2)↔(4) = ⎜ ⎝0 0 and the matrix A ∈ F4×5 . We ⎛ 1 0 0 ⎜0 0 0 ⎜ T (2)↔(4) A = ⎜ ⎝0 0 1 0 1 0 ⎛ a11 ⎜a ⎜ 41 =⎜ ⎝a31 a21
a12 a42 a32 a22
T (2)↔(4) ∈ C4×4 defined by ⎞ 0 0 0 0 0 1⎟ ⎟ ⎟ 0 1 0⎠ 1 0 0
have ⎞⎛ 0 a11 ⎟ ⎜ 1⎟ ⎜a21 ⎟⎜ 0⎠ ⎝a31 a41 0 a13 a43 a33 a23
a14 a44 a34 a24
a12 a22 a32 a42
a13 a23 a33 a43
a14 a24 a34 a44
⎞ a15 a25 ⎟ ⎟ ⎟ a35 ⎠ a45
⎞ a15 a45 ⎟ ⎟ ⎟. a35 ⎠ a25
The inverse of T (i)↔(j) is T (i)↔(j) itself. Example 3.35. Let T a(i) ∈ Cp×p be the matrix ⎛ ⎞ 1 0 ··· 0 ···0 ⎜ 0 1 · · · 0 · · · 0⎟ ⎜ ⎟ ⎜ . ⎟ .. .. ⎜ . ⎟ . · · · . · · · 0⎟ ⎜ . a(i) ⎜ ⎟ =⎜ T ⎟ · · · · · · · · · a · · · 0 ⎜ ⎟ ⎜ . ⎟ .. .. ⎜ .. ⎟ . · · · . · · · 0 ⎝ ⎠ 0 0 ··· 0 ···1 that has a ∈ F − {0} on the i th diagonal element, 1 on the remaining diagonal elements, and 0 everywhere else. The product T a(i) A is obtained from A by multiplying the i th row by a. The inverse of this 1 matrix is T a (i) .
Matrices
167
As an example, consider the matrix T 3(2) ∈ C4×4 given by ⎛ ⎞ 1 0 0 0 ⎜0 3 0 0⎟ ⎜ ⎟ T 3(2) = ⎜ ⎟. ⎝0 0 1 0⎠ 0 0 0 1 If A ∈ C4×5 , the matrix T 3(2) A is obtained from A by multiplying its second line by 3. We have ⎞ ⎛ ⎞⎛ 1 0 0 0 a11 a12 a13 a14 a15 ⎟ ⎜0 3 0 0⎟ ⎜a ⎜ ⎟ ⎜ 21 a22 a23 a24 a25 ⎟ ⎟ ⎜ ⎟⎜ ⎝0 0 1 0⎠ ⎝a31 a32 a33 a34 a35 ⎠ a41 a42 a43 a44 a45 0 0 0 1 ⎞ a11 a12 a13 a14 a15 ⎟ ⎜3a ⎜ 21 3a22 3a23 3a24 3a25 ⎟ =⎜ ⎟. ⎝ a31 a32 a33 a34 a35 ⎠ a41 a42 a43 a44 a45 ⎛
Example 3.36. Let T (i)+a(j) ∈ Cp×p be the matrix whose entries are identical to the matrix Ip with the exception of the element located in row i and column j that equals a: ⎛ ⎞ 1 0 ··· ··· ··· 0 ···0 ⎜0 1 · · · · · · · · · 0 · · · 0⎟ ⎜ ⎟ ⎜. . ⎟ . . ⎜ ⎟ T (i)+a(j) = ⎜ .. .. · · · .. · · · .. · · · 0⎟ . ⎜ ⎟ ⎜0 0 · · · a · · · 1 · · · 0⎟ ⎠ ⎝ .. .. .. .. . . ··· . ··· . ···1 The result of the multiplication T (i)+a(j) A is a matrix that can be obtained from A by adding the j th line of A multiplied by a to the i th line of A. The inverse of the matrix T (i)+a(j) A is T (i)−a(j) A.
Linear Algebra Tools for Data Mining (Second Edition)
168
For example, we have ⎛ ⎞⎛ 1 0 0 0 a11 ⎜0 1 0 0⎟ ⎜a ⎜ ⎟ ⎜ 21 T (4)+2(2) A = ⎜ ⎟⎜ ⎝0 0 1 0⎠ ⎝a31 a41 0 2 0 1
a12 a22 a32 a42
a13 a23 a33 a43
a14 a24 a34 a44
⎞ a15 a25 ⎟ ⎟ ⎟ a35 ⎠ a45
⎞ a12 a13 a14 a15 a11 ⎟ ⎜ a21 a22 a23 a24 a25 ⎟ ⎜ =⎜ ⎟. ⎠ ⎝ a31 a32 a33 a34 a35 a41 + 2a21 a42 + 2a22 a43 + 2a23 a44 + 2a24 a45 + 2a25 ⎛
It is easy to see that if one multiplies a matrix A at the right by T (i)↔(j) , T a(i) , and T (i)+a(j) , the effect on A consists of exchanging the columns i and j, multiplying the i th column by a, and adding the j th column multiplied by a to the i th column, respectively. Definition 3.42. Let F be a field, A, C ∈ Fm×n and let b, d ∈ Fm×1 . Two systems of linear equations Ax = b and Cx = d are equivalent if they have the same set of solutions. If Ax = b is a system of linear equations in matrix form, where A ∈ Cm×n and b ∈ Cm×1 , and T ∈ Cm×m is a matrix that has an inverse, then the systems Ax = b and (T A)x = (T b) are equivalent. Indeed, any solution of Ax = b satisfies the system (T A)x = (T b). Conversely, if (T A)x = (T b), by multiplying this equality by T −1 to the left, we get (T −1 T )Ax = (T −1 T )b, that is, Ax = b. The matrices T (i)↔(j) , T a(i) , and T (i)+a(j) introduced in Examples 3.34–3.36 play a special role in Algorithm 3.14.1 that transforms a linear system Ax = b into an equivalent system in row echelon form. These transformations are known as elementary transformation matrices. Example 3.37. Consider the linear system x1 + 2x2 + 3x3 = 4 x1 + 2x2 + x3 = 3 x1 + 3x2 + x3 = 1.
Matrices
169
Algorithm 3.14.1: Algorithm for the Row Echelon Form of a Matrix Data: A matrix A ∈ Fp×q Result: A row echelon form of A 1 r = 1; 2 c = 1; 3 while r p and c q do 4 while A(∗, c) = 0 do 5 c =c+1 6 end 7 j = r; 8 while A(j, c) = 0 do 9 j = j+1 10 end 11 if j = r then 12 exchange line r with line j 13 end 1 14 multiply line r by A(r,c) ; 15 for each k = r + 1 to p do 16 add line r multiplied by −A(k, c) to line k 17 end 18 r = r + 1; 19 c = c + 1; 20 end The augmented matrix of this system is ⎛ 1 2 3 ⎜ [A|b] = ⎝1 2 1 1 3 1
⎞ 4 ⎟ 3⎠ . 1
By subtracting the first row from the second and the third, we obtain the matrix ⎛ ⎞ 1 2 3 4 ⎜ ⎟ T (3)−1(1) T (2)−1(1) [A|b] = ⎝0 0 −2 −1⎠ . 0 1 −2 −3
170
Linear Algebra Tools for Data Mining (Second Edition)
Next, the second and third row are exchanged yielding the matrix ⎛ ⎞ 1 2 3 4 ⎜ ⎟ T (2)↔(3) T (3)−1(1) T (2)−1(1) [A|b] = ⎝0 1 −2 −3⎠ . 0 0 −2 −1 To obtain a 1 in the pivot of the third row, we multiply the third row by − 12 : ⎛
⎞ 1 2 3 4 ⎜ ⎟ T −0.5(3) T (2)↔(3) T (3)−1(1) T (2)−1(1) [A|b] = ⎝0 1 −2 −3⎠ , 0 0 1 0.5 which is the row echelon form of the matrix [A|b]. To achieve the row echelon form, we needed to multiply the matrix [A|b] by the matrix T = T −0.5(3) T (2)↔(3) T (3)−1(1) T (2)−1(1) . The solutions of the system can now be obtained by back substitution from the linear system x1 + 2x2 + 3x3 = 4, x2 − 2x3 = −3, x3 = 0.5. The last equation yields x3 = 0.5. Substituting x3 in the second equation implies x2 = −2; finally, from the first equality we have x1 = 6.5. Theorem 3.52. Let T a(i) , T (p)↔(q) , and T (i)+a(j) be the matrices in Rm×m that correspond to the row transformations applied to matrices in Rm×n , where i = j and p = q. We have ⎧ ⎪ T (p)↔(q) T a(i) if i ∈ {p, q}, ⎪ ⎨ T a(i) T (p)↔(q) = T (p)↔(q) T a(q) if i = p, ⎪ ⎪ ⎩T (p)↔(q) T a(p) if i = q,
Matrices
T (i)+a(j) T (p)↔(q) =
⎧ ⎪ T (p)↔(q) T (i)+a(j) ⎪ ⎪ ⎪ ⎪ ⎨T (q)+a(j) T (p)↔(q) ⎪ T (i)+a(p) T (p)↔(q) ⎪ ⎪ ⎪ ⎪ ⎩T (q)+a(p) T (p)↔(q)
171
if {i, j} ∩ {p, q} = ∅, if i = p and j = q, if i = p and j = q, if i = p and j = q.
Proof. The equalities of the theorem follow immediately from the definitions of the matrices. The matrices that describe elementary transformations are of two types: lower triangular matrices of the form T a(i) or T (i)+a(j) or permutation matrices of the form T (p)↔(q) . If all pivots encountered in the construction of the row echelon form of the matrix A are not zero, then there is no need to use any permutation matrix T (p)↔(q) among the matrices that multiply A at the left. Thus, there is a lower matrix T and an upper triangular matrix U such that T A = U . The matrix T is a product of invertible matrices and, therefore, it is invertible. Since the inverse L = T −1 of a lower triangular matrix is lower triangular, as we saw in Theorem 3.8, it follows that A = LU ; in other words, A can be decomposed into a product of a lower triangular and an upper triangular matrix. This factorization of matrices is known as an LU -decomposition of A. Example 3.38. Let A ∈ R3×3 be ⎛ 1 ⎜ A = ⎝2 1
the matrix ⎞ 0 1 ⎟ 1 1⎠ . −1 2
Initially, we add the first row multiplied by −2 to the second row, and the same first row, multiplied by −1, to the third row. This amounts to ⎛ ⎞ 1 0 1 ⎜ ⎟ T (3),−(1) T (2,−2(1) A = ⎝0 1 −1⎠ . 0 −1 1 Next, we add the second row to the third to produce the matrix ⎛ ⎞ 1 0 1 ⎜ ⎟ T (3)+(2) T (3),−(1) T (2,−2(1) A = ⎝0 1 −1⎠ , 0 0 0
172
Linear Algebra Tools for Data Mining (Second Edition)
which is an upper triangular matrix. By Theorem 3.51, we can conclude that rank(A) = 2. We can write ⎛ ⎞ 1 0 1 ⎜ ⎟ A = (T (2,−2(1) )−1 (T (3),−(1) )−1 (T (3)+(2) )−1 ⎝0 1 −1⎠ . 0 0 0 Thus, the lower triangular matrix we are seeking is L = (T (2)−2(1) )−1 (T (3),−(1) )−1 (T (3)+(2) )−1 = T (2)+2(1) T (3)+(1) T (3)−(2) ⎛ ⎞⎛ ⎞⎛ ⎞ ⎛ ⎞ 1 0 0 1 0 0 1 0 0 1 0 0 ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟ = ⎝2 1 0⎠ ⎝0 1 0⎠ ⎝0 1 0⎠ = ⎝2 1 0⎠ , 0 0 1 1 0 1 0 −1 1 1 −1 1 which shows that A can be ⎛ 1 ⎜ A = ⎝2 1
written as ⎞ ⎞⎛ 0 0 1 0 1 ⎟ ⎟⎜ 1 0⎠ ⎝0 1 −1⎠ . −1 1 0 0 0
Suppose that during the construction of the matrix U some of the elementary transformation matrices are permutation matrices of the form T (p)↔(q) . By Theorem 3.52, matrices of the form T (p)↔(q) can be shifted to the right. Therefore, instead of the previous factorization of the matrix A, we have a lower triangular matrix T and a permutation matrix, which results as a product of all permutation matrices of the form T (p)↔(q) used in the algorithm such that T P A = U . In this case, we obtain an LU -factorization of P A instead of A. We give now an alternative characterization of the notion of matrix rank. Note that if T and S are matrices of any of the elementary transformations, then rank(T A) = rank(A) and rank(AS) = rank(A) by Corollary 3.4. Theorem 3.53. Let A ∈ Rm×n be a matrix. If rank(A) = k, then the largest non-singular square submatrix B of A is a k × k-matrix.
Matrices
173
Proof. Since rank(A) = k, there is a maximal linearly independent set of k columns of A, {ci1 , . . . , cik }, and a maximal linearly independent set of k of rows {r j1 , . . . , r jk }. Let K be the submatrix of A defined by
j1 · · · jk K=A . i1 · · · ik We claim that K is non-singular, that is, rank(K) = k. Using right multiplications by elementary transformation matrices, we can produce a matrix D that has the columns ci1 , . . . , cik on the first k positions. Since the remaining columns are linear combinations of these first k columns, using the same type of column transformations we can transform each of the remaining columns into 0m . Thus, there exists a non-singular matrix S such that D = AS = ((ci1 · · · cik |Om,n−k ) and rank(D) = rank(A) = k. The matrix D has the rows numbered i1 , . . . , ik as a maximal set of linearly independent rows. Now, by applying elementary transformations, these rows can be brought in the first k places and the remaining m − r rows can be nullified. Thus, there is an invertible matrix T such that K O E = T AS = , O O where K is a non-singular k × k-matrix because rank(E) = rank(A). Thus, A contains a submatrix of rank k. Suppose that G is a non-singular square submatrix of A. By a series of row elementary and column elementary transformations (described by the matrices P and Q, respectively) we have G H P AQ = L M and rank(P AQ) = rank(A) = k. Consider now the invertible matrices P1 and Q1 defined by I O I −G−1 H and Q1 = . P1 = −LG−1 I O I
174
Linear Algebra Tools for Data Mining (Second Edition)
Since P1 is lower triangular and Q1 is upper triangular having all diagonal elements equal to 1, both P1 and Q1 are invertible. Therefore, rank(P1 P AQQ1 ) = rank(A). On the other hand, G H I −G−1 H I O P1 P AQQ1 = L M −LG−1 I 0 I =
G O . O M − LG−1 H)
Therefore, rank(A) = rank(G) + rank(M − LG−1 H) rank(G), so k is the maximal rank of a nonsingular submatrix of A. 3.15
The Kronecker and Other Matrix Products
Definition 3.43. Let A ∈ Cm×n and B ∈ Cp×q be two matrices. The Kronecker product of these matrices is the matrix A ⊗ B ∈ Cmp×nq defined by ⎞ ⎛ a11 B a12 B · · · a1n B ⎜a B a B ··· a B ⎟ 22 2n ⎟ ⎜ 21 ⎟. A⊗B =⎜ ⎜ .. .. ... ... ⎟ . ⎠ ⎝ . am1 B am2 B · · · amn B The Kronecker product A ⊗ B creates mn copies of the matrix B and multiplies each copy by the corresponding element of A. Note that the Kronecker product can be defined between any two matrices regardless of their format. If x ∈ Cm and y ∈ Cn , we have x ⊗ y = vec(yx ),
(3.11)
x ⊗ y = xy = y ⊗ x.
(3.12)
and
Example 3.39. Consider the matrices ⎛ ⎞ b11 b12 b13 a11 a12 ⎜ ⎟ and B = ⎝b21 b22 b23 ⎠ . A= a21 a22 b31 b32 b33
Matrices
Their Kronecker product is ⎛ a11 b11 a11 b12 ⎜a b a b ⎜ 11 21 11 22 ⎜ ⎜a11 b31 a11 b32 A⊗B =⎜ ⎜a b a b ⎜ 21 11 21 12 ⎜ ⎝a21 b21 a21 b22 a21 b31 a21 b32
a11 b13 a11 b23 a11 b33 a21 b13 a21 b23 a21 b33
175
a12 b11 a12 b21 a12 b31 a22 b11 a22 b21 a22 b31
a12 b12 a12 b22 a12 b32 a22 b12 a22 b22 a22 b32
⎞ a12 b13 a12 b23 ⎟ ⎟ ⎟ a12 b33 ⎟ ⎟. a22 b13 ⎟ ⎟ ⎟ a22 b23 ⎠ a22 b33
Let C ∈ Cmp×nq be the Kronecker product of the matrices A ∈ and B ∈ Cp×q . We seek to express the value of cij , where 1 i mp and 1 j ≤ nq. It is easy to see that Cm×n
cij = a i , j bi−p i −1,j−q j −1 . p
q
p
(3.13)
q
Conversely, we have ars bvw = (A ⊗ B)p(r−1)+v,q(s−1)+w
(3.14)
for 1 r m, 1 s n and 1 v p, 1 w q. Theorem 3.54. The Kronecker product is associative. In other words, if A ∈ CK×L , B ∈ CM ×N , and C ∈ CR×S , then (A⊗ B)⊗ C = A ⊗ (B ⊗ C). Proof. The product akl bmn is the entry ((k − 1)M + m, ( − 1)N + +n) of A ⊗ B. Therefore, the product (akl bmn )crs is the entry (((k − 1)M + m − 1)R + r, (( − 1)N = n − 1)S + s) of (A ⊗ B) ⊗ C. On the other hand, the product akl (bmn crs ) is the entry of A ⊗ (B ⊗ C) that occupies the position ((k − 1)M R + (m − 1)R + r, ( − 1)N S + (n − 1)S + s), which is identical to ((k − 1)M + (m − 1)R + r, (( − 1)N + (n − 1)S + s). Thus, the product akl bmn cop occupies the same position in (A⊗B)⊗C and in A ⊗ (B ⊗ C). Therefore, (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C).
Linear Algebra Tools for Data Mining (Second Edition)
176
Theorem 3.55. Let A1 , . . . , An and B1 , . . . , Bn be matrices. Then, we have (A1 ⊗ B1 )(A2 ⊗ B2 ) · · · (An ⊗ Bn ) = (A1 A2 · · · An ) ⊗ (B1 B2 · · · Bn ). Proof.
Note that
(A ⊗ B)(C ⊗ D) ⎞⎛ ⎞ ⎛ c11 D · · · a1p D a11 B · · · a1n B ⎜ ⎜ . .. ⎟ .. ⎟ ⎟ ⎜ .. ⎟ . =⎜ . · · · . . · · · . ⎠ ⎠⎝ ⎝ am1 B · · · amn B cn1 D · · · cnp D ⎞ a1j cj1 BD · · · j a1j cjp BD ⎟ ⎜ .. .. ⎟ =⎜ . ··· . ⎠ ⎝ a c BD · · · a c BD j mj j1 j mj jp ⎛
j
⎞ a1j cj1 · · · j a1j cjp ⎟ ⎜ .. .. ⎟ ⊗ BD = (AC) ⊗ (BD). =⎜ ··· . ⎠ ⎝ . a c · · · a c j mj j1 j mj jp ⎛
j
Repeated multiplication yields the desired equality.
The next theorem contains a few other elementary properties of Kronecker’s product. Theorem 3.56. For any complex matrices A, B, C, D, we have the following: (i) (A ⊗ B) = A ⊗ B , (ii) (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C), (iii) A ⊗ B + A ⊗ C = A ⊗ (B + C), (iv) A ⊗ D + B ⊗ D = (A + B) ⊗ D, (v) (A ⊗ B) = A ⊗ B , (vi) (A ⊗ B)H = AH ⊗ B H , when the usual matrix sum and multiplication are well-defined in each of the above equalities. Proof.
The proof is straightforward and is left to the reader.
Matrices
177
Example 3.40. Let x ∈ Cn and y ∈ Cm . We have ⎞ ⎛ ⎞ ⎛ y1 x x1 y ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ x⊗y =⎜ ⎝ .. ⎠ and y ⊗ x = ⎝ .. ⎠ . xn y ym x Note that the Kronecker product is not commutative. For example, we have x ⊗ y = y ⊗ x. We have ⎛ ⎞ 5 ⎜4⎟ ⎛ ⎞ ⎜ ⎟ 1 ⎜ ⎟ ⎜10⎟ 4 ⎜ ⎟ ⎟ =⎜ ⎝2⎠ ⊗ ⎜ 8 ⎟, 5 ⎜ ⎟ 3 ⎜ ⎟ ⎝15⎠ 12 but
⎛
⎞ 4 ⎟ ⎛ ⎞ ⎜ ⎜8⎟ ⎜ ⎟ 1 4 ⎜ ⎟ ⎜ 12⎟ ⎟ ⎝2⎠ = ⎜ ⎜ 5 ⎟. 5 ⎜ ⎟ 3 ⎜ ⎟ ⎝10⎠ 15
Example 3.41. Let D ∈ Cp×p . The Kronecker product C = Im ⊗ D is given by cij = I i , j di−p i −1,j−p j −1 p
=
p
p
di−p(k−1),j−p(k−1) 0
p
if pi = pj = k otherwise,
for 1 i, j mp. Let now E ∈ Cm×m . The Kronecker product L = E ⊗ Ip is given by e i , j if p pi − i = pj − j, p p lij = 0 otherwise, for 1 i, j mp.
Linear Algebra Tools for Data Mining (Second Edition)
178
Theorem 3.57. If A ∈ Cn×n and B ∈ Cm×m are two invertible matrices, then A ⊗ B is invertible and (A ⊗ B)−1 = A−1 ⊗ B −1 . Proof.
Since (A ⊗ B)(A−1 ⊗ B −1 ) = (AA−1 ⊗ BB −1 ) = In ⊗ Im ,
the theorem follows by observing that In ⊗ Im = Inm .
In a sequence of Kronecker products of column and row vectors, a column vector can be permuted with a row vector without altering the final result. This is formalized in the next statement. Theorem 3.58. Let u ∈ Cm and let v ∈ Cn . We have u ⊗ v = v ⊗ u. Proof.
Suppose that
⎞ ⎛ ⎞ v1 u1 ⎜ .. ⎟ ⎜ .. ⎟ u = ⎝ . ⎠ and v = ⎝ . ⎠ . um vn
We have
Also,
⎛
⎞ ⎞ ⎛ u1 v1 · · · u1 vn u1 v ⎜ .. .. ⎟ . u ⊗ v = ⎝ · · · ⎠ = ⎝ ... . . ⎠ um v um v1 · · · um vn ⎛
⎞ ⎞ ⎛ v1 u1 · · · vn u1 u1 ⎜ ⎟ ⎜ .. .. .. ⎟ ⎟ v ⊗ u = (v1 , · · · , vn ) ⎝ ... ⎠ = ⎜ . . ⎠, ⎝ . um v1 um · · · vn um ⎛
which establishes our equality.
Example 3.42. Suppose that u1 ∈ Cm , u2 ∈ Cp , and v ∈ Cm . Then, we have u1 ⊗ u2 ⊗ v = u1 ⊗ v ⊗ u2 = v ⊗ u1 ⊗ u2 . Note that such transformations can be applied to Kronecker products of column and row vectors provided we do not change the order of the column and the row vectors.
Matrices
179
Theorem 3.59. Let A ∈ Cn×n and B ∈ Cm×m be two normal (unitary) matrices. Their Kronecker product A ⊗ B is also a normal (a unitary) matrix. Proof.
By Theorem 3.56, we can write
(A ⊗ B) (A ⊗ B) = (A ⊗ B )(A ⊗ B) = (A A ⊗ B B) = (AA ⊗ BB ) (because both A and B are normal) = (A ⊗ B)(A ⊗ B) , which implies that A ⊗ B is normal.
Definition 3.44. Let A ∈ Cm×m and B ∈ Cn×n be two square matrices. Their Kronecker sum is the matrix A⊕B ∈ Cmn×mn defined by A ⊕ B = (A ⊗ In ) + (Im ⊗ B). The Kronecker difference is the matrix A B ∈ Cmn×mn defined by A B = (A ⊗ In ) − (Im ⊗ B). The element (A ⊕ B)ij is given by (A ⊕ B)ij = (A ⊗ In )ij + (Im ⊗ B)ij . Thus, (A ⊕ B) can be computed by applying the formulas developed in Example 3.41: a i , j if n ni − i = n nj − j, n n (A ⊗ In )ij = 0 otherwise, bi−n(k−1),j−n(k−1) if ni = nj = k, (Im ⊗ B)ij = 0 otherwise, for 1 i, j mn. Similar fact can be proven about Kronecker difference by replacing B by −B in the formula involving the Kronecker sum.
180
Linear Algebra Tools for Data Mining (Second Edition)
Definition 3.45. Let A, B ∈ Cm×n . The Hadamard product of A and B is the matrix A B ∈ Cm×n defined by ⎞ ⎛ a11 b11 a12 b12 · · · a1n b1n ⎟ ⎜a b ⎜ 21 21 a22 b22 · · · a2n b2n ⎟ ⎟. AB = ⎜ .. .. ⎟ .. ⎜ .. . ⎠ ⎝ . . . am1 bm1 am2 bm2 · · · amn bmn The Hadamard quotient A B is defined only if bij = 0 for 1 i m and 1 j n. In this case, ⎛ a11 b11 a21 b21
a12 b12 a22 b22
.. .
··· ··· .. .
a1n b1n a2n b2n
am1 bm1
am2 bm2
···
amn bmn
⎜ ⎜ AB =⎜ ⎜ .. ⎝ .
⎞
⎟ ⎟ .. ⎟ ⎟. . ⎠
Theorem 3.60. If A, B, C ∈ Cm×n and c ∈ C, we have (i) A B = B A; (ii) A Jm,n = Jm,n A = A; (iii) A (B + C) = A B + A C; (iv) A (cB) = c(A B). Proof.
The proof is straightforward and is left to the reader.
Note that the Hadamard product of two matrices A, B ∈ Cm×n is a submatrix of the Kronecker product A ⊗ B. Example 3.43. Let A, B ∈ C2×3 be the matrices A=
a11 a12 a13 a21 a22 a23
and B =
b11 b12 b13 . b21 b22 b23
The Kronecker product of these matrices is A ⊗ B ∈ C4×9 given by ⎛ a11 b11 ⎜ ⎜a11 b21 A⊗B = ⎜ ⎜a b ⎝ 21 11 a21 b21
a11 b12
a11 b13
a12 b11
a12 b12
a12 b13
a13 b11
a13 b12
a11 b22
a11 b23
a12 b21
a12 b22
a12 b23
a13 b21
a13 b22
a21 b12
a21 b13
a22 b11
a22 b12
a22 b13
a23 b11
a23 b12
a21 b22
a21 b23
a22 b21
a22 b22
a22 b23
a23 b21
a23 b22
a13 b13
⎞
⎟ a13 b23 ⎟ ⎟. a23 b13 ⎟ ⎠ a23 b23
Matrices
181
The Hadamard product of the same matrices is a11 b11 a12 b12 a13 b13 , AB = a21 b21 a22 b22 a23 b23 and we can regard the Hadamard product as a submatrix of the Kronecker product A ⊗ B,
1, 5, 9 A B = (A ⊗ B) . 4, 4, 4 Another matrix product involves matrices that have the same number of columns. Definition 3.46. Let A ∈ Cm×n and B ∈ Cp×n be two matrices that have the same number n of columns, A = (a1 · · · an ) and B = (b1 · · · bn ). The Khatri–Rao product of A and B (or the column-wise Kronecker product) is the matrix A ∗ B = (a1 ⊗ b1 a2 ⊗ b2 · · · an ⊗ bn ). Example 3.44. The Khatri–Rao product of the ⎛ 1 0 1 2 3 ⎜ A= and B = ⎝ 2 1 4 5 6 −1 2
matrices ⎞ 2 ⎟ 3⎠ 1
is the matrix (a1 ⊗ b1 a2 ⊗ b2 a3 ⊗ b3 ), which equals ⎛ ⎞ 1 0 6 ⎜ 2 2 9⎟ ⎜ ⎟ ⎜ ⎟ ⎜−1 4 3 ⎟ ⎜ ⎟ ⎜ 4 0 12⎟ . ⎜ ⎟ ⎜ ⎟ ⎝ 8 5 18⎠ −4 10 6 Note that for any vectors a, b, we have a ⊗ b = a ∗ b.
Linear Algebra Tools for Data Mining (Second Edition)
182
3.16
Outer Products
Definition 3.47. Let u ∈ Cm and v ∈ Cn . The outer product of the vectors u and v is the matrix u ◦ v ∈ Cm×n defined by u ◦ v = uv H . As we saw in Example 3.28, the outer product of two vectors is a matrix of rank 1. For u ∈ Cm and v ∈ Cn , we have v ◦u = vuH = (uv H )H = (u ◦v)H . Therefore, the outer product is not commutative because for u ∈ Cm and v ∈ Cn , we have u ◦ v ∈ Cm×n and v ◦ u ∈ Cn×m . Note that when m = n, we have uv H = trace(u ◦ v). Example 3.45. Let
⎛ ⎞ u1 v1 ⎜ ⎟ . u = ⎝u2 ⎠ and v = v2 u3
We have
⎞ u1 v1 u1 v2 v1 u1 v1 u2 v1 u3 ⎟ ⎜ . u ◦ v = ⎝u2 v1 u2 v2 ⎠ and v ◦ u = v2 u1 v2 u2 v2 u3 u3 v1 u3 v2 ⎛
Contrast this with the Kronecker products: ⎞ ⎛ ⎞ ⎛ v1 u1 u1 v1 ⎜v u ⎟ ⎜u v ⎟ ⎜ 1 2⎟ ⎜ 1 2⎟ ⎟ ⎜ ⎟ ⎜ ⎜v1 u3 ⎟ ⎜u2 v1 ⎟ ⎟ ⎜ ⎟ u⊗v =⎜ ⎜u v ⎟ and v ⊗ u = ⎜v u ⎟ . ⎜ 2 1⎟ ⎜ 2 2⎟ ⎟ ⎟ ⎜ ⎜ ⎝v2 u2 ⎠ ⎝u3 v1 ⎠ u3 v2 v2 u3 Note that the entries of the Kronecker product u⊗v can be obtained by reading the entries of u ◦ v row-wise.
3.17
Associative Algebras
Definition 3.48. An F-associative algebra is a pair (V, m), where V is an F-linear space and m : V × V −→ V is a bilinear mapping that is associative (which means that m(x, m(y, z)) = m(m(x, y), z) for all x, y, z ∈ V ).
Matrices
183
If there exists u ∈ V such that m(v, u) = v for every v ∈ V , then u is referred to as the unit element and (V, m) is a unital associative algebra. The dimension of an F-associative algebra (V, m) is dim(V ). We denote m(v 1 , v 2 ) as v 1 v 2 . In a unital associative algebra there exists a unique unit element. Example 3.46. The set Rn×n of square matrices with real elements is a unital associative algebra, where m(A, B) = AB. The unit element is the matrix In . Example 3.47. Let Hom(V, V ) be the linear space of homomorphisms of an F-linear space V. For h, k ∈ Hom(V, V ), define m(h, k) as the composition of mappings hk. It is easy to verify that (Hom(V, V ), m) is a unital associative algebra having 1V as its unit element. Definition 3.49. An associative algebra morphism between the associative algebras (V, m) and (W, m ) is a linear mapping h : V −→ W such that h(m(u, v)) = m (h(u), h(v)) for every u, v ∈ V . If (V, m) and (W, m ) have unit elements u and u , respectively, then h(u) = u . If h is a bijective morphism, we refer to h as an isomorphism of associative algebras. Let B = {e1 , . . . , en } be a basisin V, where (V, m) is nan asson i j ciative algebra. If x, y ∈ V , x = i=1 a ei , and y = j=1 b ej , then xy =
n
n
ai bj ei ej .
i=1 j=1
Since ei ej ∈ V , it is possible to write ei ej =
n
k γij ek ,
k=1 k are n3 scalars in F. We refer to these numbers as the strucwhere γij tural coefficients of the associative algebra (V, m).
184
Linear Algebra Tools for Data Mining (Second Edition)
k be structural coefficients of the associative Theorem 3.61. Let γij algebra (V, m) having the basis {e1 , . . . , en }. These coefficients satisfy the following n4 equalities:
p
q r r γjk γiq = γij γpk q
p
for i, h, j, k between 1 and n. p q ep and ej ek = q γjk eq , we can write Proof. Since ei ej = p γij ei (ej ek ) =
q
=
q
(ei ej )ek =
q γjk ei eq =
r
q
q r γjk γiq er =
p γij ep ek =
p
=
p
q γjk
p r γij γpk er =
r
r γiq er
r
r
p γij
p
q
q r γjk γiq er
r γpk er
r
r
p r γij γpk er .
p
The associativity property implies the following equalities:
p
q r r γjk γiq = γij γpk q
for i, h, j, k between 1 and n.
p
An associative algebra (V, m) over a finite-dimensional linear space V is completely defined by its multiplicative table. Starting from a basis {e1 , . . . , en } place at the intersection of the i th row and the j th column the expansion of the product ei ej in terms of the basis: e ··· e ··· e jk nk 1k e1 ··· ··· k γ11 ek k γ1j ek k γ1n ek .. .. .. .. .. .. . . . . . . k k k ei γ e · · · γ e · · · γ k i1 k k ij k k in ek .. .. .. .. .. .. . . . . . . k k k en ··· ··· k γn1 ek k γnj ek k γnn ek
Matrices
185
Example 3.48. Let V be a four-dimensional real linear space having the basis {1, i, j, k} whose elements have the form v = a1 + bi + cj + dk, where a, b, c, d ∈ R. The multiplication table of this algebra is given by 1 i j k
1 1 i j k
i i −1 −k j
j j k −1 −i
k k −j i −1
If q ∈ V , we can write q = a1 + bi + cj + dk for some a, b, c, d ∈
R. Such an element is said to be a quaternion. Its conjugate is the
quaternion q = a1 − bi − cj − dk. It is easy to see that qq = (a2 + b2 + c2 + d2 )1. Definition 3.50. A subalgebra of an associative algebra (V, m) is a subspace U of V such that m(u1 , u2 ) ∈ U for every u1 , u2 ∈ U . If U is a subalgebra of (V, m) and m = m U , then (U, m ) can be regarded as an associative algebra. The intersection of any family of subalgebras of an associative algebra is a subalgebra. An ideal of an associative algebra (V, m) is a subspace I of V such that m(v, i) ∈ I and m(i, v) ∈ I for every v ∈ V and i ∈ I. Let {Ij | j ∈ J} be a family of ideals of an associative algebra (V, m). It is easy to see that j∈J is an ideal of (V, m). Moreover, if S ⊆ V , the intersection of all ideals that contain S is again an ideal denoted by IS . If I is an ideal of an associative algebra (V, m), we can consider the quotient linear space V /I. Define the multiplication m on V /I as ˜ ∈ [x] m([x], [y]) = [m(x, y)]. Note that m is well-defined because if x ˜ ∈ [y], we have x ˜ − x ∈ I and y ˜ − y ∈ I. In other words, there and y ˜ = x + u and ˜[y] = y + v. This implies exist u, v ∈ I such that x ˜ ) = m(x + u, y + v) m(˜ x, y = m(x, y) + m(x, v) + m(u, y) + m(u, v) = m(x, y) + z,
186
Linear Algebra Tools for Data Mining (Second Edition)
˜ )] = [m(x, y)] and where z ∈ I. Therefore, [m(˜ x, y ˜ )] = [m(x, y)], m([˜ x], [˜ y ]) = [m(˜ x, y which shows that m is well-defined. The associative algebra (V /I, m) is the factor algebra of V by I. Definition 3.51. A graded linear space is a vector space V that can be written as a direct sum of the form V = n∈N Vn , where each Vn is a vector space. The linear space Vn is the set of elements of degree n. Example 3.49. Let R[x, y] be the linear space of polynomials with real coefficients in the indeterminates x and y. If R[x, y]n is the set of all homogeneous polynomials of degree n in x and y, then R[x, y] is a graded linear space. Generalizing Example 3.49, if V is a direct sum, V = n∈N Vn , we refer to members of the summand Vn as homogeneous elements of degree n. If v is v = n∈N v n where v n ∈ Vn , v n is the homogeneous component of v of degree n. A graded associative algebra is a graded vector space V = n∈N Vn such that m(Vp , Vq ) ⊆ Vp+q for p, q ∈ N. A subalgebra Wof the graded associative algebra V is a graded subalgebra if W = n∈N (W ∩ Vn ). Exercises and Supplements (1) Let A ∈ Cm×n and B ∈ Cn×p be two conformant matrices. Prove that: (a) computing the matrix G = AB using standard matrix multiplication requires mnp number multiplications; (b) if C ∈ Cp×q and we compute the matrix D = (AB)C = A(BC) by the standard method, the first modality D = (AB)C, requires mp(n + q) multiplications, while the second, D = A(BC), requires nq(m + p) multiplications. (2) Let {i1 , . . . , ik } be a subset of {1, . . . , n} and let {j1 , . . . , jq } = {1, . . . , n} − {i1 , . . . , ik }, where j1 < · · · < jq and k + q = n.
Matrices
187
i1 · · · ik be a Let A ∈ be a matrix and let B = A i1 · · · ik principal submatrix of A. Prove that if y ∈ Ck , then y H By = xH Ax, where yi if i = jr , xr = 0 otherwise Cn×n
for 1 r n. (3) Let X = (x1 · · · xn ) be a matrix in Cn×n and let C = diag(c1 , . . . , cn ) ∈ Cn×n . Prove that XC = c1 x1 e1 + · · · + c1 xn en . (4) Let X = (x1 · · · xn ) and Y = (y 1 · · · y n ) be two matrices in Cn×n , and let C = diag(c1 , . . . , cn ), D = diag(d1 , . . . , dn ). Prove that XCY D =
n
n
ci dj yji xi ej ,
i=1 j=1
and that trace(XCY D) = ni=1 nj=1 ci dj yji xji . (5) Let a1 , . . . , an be n complex numbers. Prove that n
diag(1, . . . , 1, ai , 1, . . . , 1) = diag(a1 , . . . , an ).
i=1
(6) Let D = diag(d1 , . . . , dn ) ∈ Cn×n and let A ∈ Cn×n . Prove that (DAD)ij = di aij dj for 1 i, j n. (7) Let S n×n be the set of n×n symmetric matrices in Rn×n . Prove . that S n×n is a subspace of Rn×n and dim(S n×n ) = n(n+1) 2 n×n n×n (8) Let A ∈ R be a symmetric be a matrix and let B ∈ R matrix such that x Bx = ni=1 nj=1 aij (xi − xj )2 for every x ∈ Rn . Prove that: (a) for 1 k n, we have bkk = 2 {aik | 1 i n and i = k}; n n (b) i=1 j=1 bij = 0. (9) Let ψ ∈ PERMn and let A be a square matrix, A ∈ Cn×n . Prove that (Pψ A)ij = aψ(i)j and (APψ )ij = aiψ−1 (j) for 1 i, j n.
188
Linear Algebra Tools for Data Mining (Second Edition)
Let φ ∈ PERMn be a permutation and let A ∈ Cn×n be a square matrix. A φ-diagonal of A is a set of n elements Dφ (A) = {a1φ(1) , a2φ(2) , . . . , anφ(n) }. A permutation diagonal of A is a φdiagonal of A for some φ ∈ PERMn . (10) Let A ∈ Cn×n be a square matrix and let ψ ∈ PERMn be a permutation. Prove that if Dφ (A) is a permutation diagonal of A, the same set is a permutation diagonal Dψ−1 φ (B) of B = APψ . n Solution: Note that (APψ )ij = k=1 aik pkj = aiψ−1 (j) . Therefore, we have Dφ (APψ ) = {(APψ )1φ(1) , . . . , (APψ )nφ(n) } = {a1ψ−1 (φ(1)) , . . . , anψ−1 (φ(n)) }, which allows us to conclude that Dφ (A) coincides with the permutation diagonal Dψ−1 φ (B) of B = APψ . (11) Let A ∈ Cn×n be an upper-Hessenberg matrix and let U ∈ Cn×n be an upper-triangular matrix. Prove that both AU and U A are upper-Hessenberg matrices. Solution: Let B = AU . For i > j + 1, we have bij =
n
k=1
aik ukj =
i−2
k=1
aik ukj +
n
aik ukj .
k=i−1
i−2 Since A is upper-Hessenberg, n we have k=1 aik ukj = 0; since U is upper triangular, k=i−1 aik ukj = 0 because j < i − 1. Thus, bij = 0, so B is indeed upper-Hessenberg. The argument for U A is similar. (12) Let A, B ∈ Cn×n be two Hermitian matrices. Prove that AB is a Hermitian matrix if and only if AB = BA. Solution: Suppose that AB = BA. Then, (AB)H = (BA)H = AH B H = AB, hence AB is a Hermitian matrix. Conversely, if AB is Hermitian, we have AB = (AB)H = B H AH = BA. (13) Let A, B be two matrices in Cn×n . Suppose that B = C + D, where C is a Hermitian matrix and D is a skew-Hermitian matrix. Prove that if A is Hermitian and AB = BA, then AC = CA and AD = DA.
Matrices
189
(14) Prove that if A = diag(A1 , . . . , Ak ) is a block-diagonal matrix, then A is Hermitian if and only if every block Ai is Hermitian. (15) Let A ∈ Cn×n be a matrix. Prove that each φ-diagonal of A contains a 0 if and only if A contains a zero submatrix B ∈ Cp×q such that p + q = n + 1. This fact is known as the Frobenius– K¨ onig Theorem. Solution: Suppose that A contains the submatrix
i1 , . . . , ip = Op,q , B=A j1 , . . . , jq where p + q = n + 1 and Dφ = {a1φ(1) , a2φ(2) , . . . , anφ(n) } = {aφ−1 (1)1 , . . . , . . . , aφ−1 (n)n } is a φ-diagonal without zero entries. Then none of the components aφ−1 (j1 )j1 , . . . , . . . , aφ−1 (jq )jq are 0, so they must be located in the rows {1, . . . , n} − {i1 , . . . , ip }. Therefore, q + p n and this contradicts the fact that p + q = n + 1. Conversely, suppose that every diagonal of A contains a 0. We show by induction on n 1 that A contains a zero submatrix. The base case, n = 1 is immediate. Suppose that the implication holds for matrices of size less than n. If A = On,n , the reverse implication is trivial. Therefore, we can assume that A contains a non-zero component. Without loss of generality we may assume that ann = 0. Every diagonal of the submatrix
1, · · · , n − 1 S=A 1, . . . , n − 1 contains a 0 and, therefore, by inductive hypothesis, S contains a zero submatrix of format r × s, where r + s = n. It is obvious that this submatrix of S is also a zero-submatrix of A. There exist two permutations σ, τ ∈ PERMn such that Pσ APτ =
C Or,s , D E
190
Linear Algebra Tools for Data Mining (Second Edition)
where C ∈ Cr×r and E ∈ Cs×s . Let η ∈ PERMn . An η-diagonal of A has the form Dη (A) = {a1η(1) , . . . , anη(n) }. Since Pσ permutes the rows of A and Pτ permutes the columns of A, the set Dη corresponds to the set {aσ(1)τ (η(1)) , . . . , aσ(n)τ (η(n)) } = {a1σ−1 (τ (η(1))) , . . . , anσ−1 (τ (η(n))) }, which is a diagonal set of Pσ APτ . Thus, the collection of diagonal sets of A and Pσ APτ are the same. If all elements of the diagonal set of Pσ APτ that correspond to C are non-zero, then the remaining elements of a diagonal are located in E and must contain a 0. Therefore, if all elements of a diagonal of C are non-zero, then each a diagonal of E must contain a 0. Thus, either all diagonals of C contain a 0, or all diagonals of E contain a 0. In the first case, by inductive hypothesis, C contains a k ×h zero submatrix with k +h = r +1. This implies that the first r rows of Pσ APτ contain a k × (h + s) zero submatrix and k + h + s = r + 1 + s = n + 1. Similarly, in the second case, E contains a p × q zero-matrix with p + q = s + 1. Therefore, the last s columns of Pσ APτ contain a zero submatrix of format (r + p) × q and r + p + q = r + s + 1 = n + 1. (16) Prove that row exchanges for matrices in Rm×n can be expressed as a sequence of row multiplications and additions of a row multiplied by a constant to another row by showing that T (i)↔(j) = T (i)−(j) T −(j) T (j)−(i) T ((i)+(j) for 1 i, j m. (17) Let A ∈ Cm×n be a matrix. If v = vec(A) ∈ Cmn , prove that vj = aj−m j−1 , j m
m
for 1 j mn. (18) Prove that the matrices A, B given by 1 0 1 1 A= and B = 0 1 0 1 are not similar.
Matrices
191
(19) Let F be the set of functions f : C − {− dz } −→ C given by f (z) =
az + b . cz + d
Denote by Mf the matrix Mf =
a b . c d
Prove that (a) if f, g ∈ F, then Mf g = Mf Mg ; (b) f (f (z)) − (a + d)f (z) + ad − bc = 0 for every z ∈ C. (20) Recall that Jn,n ∈ Rn×n is the complete n × n matrix, that is the matrix having all components equal to 1. Prove that for every number m ∈ N and m 1, we have m = nm−1 Jn,n . Jn,n
(21) Let A, B ∈ Cn×n be two matrices. Prove that if AB = BA, then we have the following equality known as Newton’s binomial: (A + B)n =
n
n k=0
k
An−k B k .
Give an example of matrices A, B ∈ C2×2 such that AB = BA for which the above formula does not hold. (22) Let A ∈ Cn×n be a circulant matrix whose first row is (c1 , . . . , cn ) and let P ∈ Cn×n be the circulant matrix whose first row is (0, 1, 0, . . . , 0). Prove that A = c1 P 0 + c1 P 1 + · · · + cn P n−1 . (23) Let A ∈ Rn×n and B ∈ Rn×n be two matrices. Prove that if AB is invertible, then both A and B are invertible. Solution: Since AB is invertible, we have AB(AB)−1 = In . Thus, A is invertible and A−1 = B(AB)−1 . Similarly, since (AB)−1 AB = In , it follows that B is invertible and B −1 = (AB)−1 A. (24) Let c ∈ (0, 1] and let A ∈ Rn×n be the matrix A = cIn + (1 − c)Jn . Prove that A is a non-singular matrix.
Linear Algebra Tools for Data Mining (Second Edition)
192
Solution: Suppose that x ∈ Rn is a vector such that Ax = 0n . This amounts to cx + (1 − c)Jn x = cx + (1 − c)ξ1n , where ξ = ni=1 xi . Thus, x = − (1−c)ξ c 1n , which means that all components of x must equal − (1−c)ξ c . This implies ξ = (1−c)ξ −n c , so ξ = 0 which implies x = 0n . Thus, A is nonsingular. (25) Let A, B ∈ Cn×n be two invertible matrices. Prove that B −1 − A−1 = B −1 (A − B)A−1 . Conclude that rank(A − B) = rank(B −1 − A−1 ). (26) Let A ∈ Rn×n be a non-singular matrix. Prove that the set {x1 , . . . , xm } ⊆ Rn is a linearly independent set if and only if {Ax1 , . . . , Axm } is a linearly independent set. (27) Prove that the matrix T (i)+a(j) ∈ Rn×n can be expressed as T (i)+a(j) = In + aei ej . (28) Let U=
Ok,k · · · On−k,k W
be a matrix in Cn×n , where W is an upper triangular matrix, and let Y be an upper triangular matrix such that yk+1,k+1 = 0. Prove that (U Y )ij = O for 1 i, j k + 1. Solution: Note that Y can be written as Y12 Y11 , Y = On−k,k Y22 where Y11 and Y22 are upper triangular matrices. Thus, Ok,k BY22 . UY = On−k,k W Y22 Thus, the first k columns of the product U Y consist only of Os.
Matrices
193
The k + 1st column of U Y is ⎛
⎞ y1 k+1 ⎜ . ⎟ ⎜ .. ⎟ ⎟ ⎜ ⎟ ⎜ ⎜yk k+1 ⎟ ⎟ ⎜ ⎟ U⎜ ⎜ 0 ⎟ ⎜ 0 ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ .. ⎟ ⎝ . ⎠ 0
because Y is an upper triangular, matrix. If p < k, then (U Y )p k+1 = 0 because the first k columns of U contain zeros. (29) Let m, n be two positive natural numbers such that m = pq and n = rs, where p, q, r, s ∈ N. Prove that there exists a bijection between the set Fm×n and the set (Fp×r )q×s . Interpret the existence of this bijection in terms of matrices. (30) Let A ∈ Cn×n be a strictly upper triangular matrix. Prove that A is nilpotent. Solution: Since A is strictly upper triangular, we have aij = 0 for 1 j i and every i, 1 i n. We prove by induction on (m) (m) m that for Am = (aij ), we have aij = 0 for 1 j i+m−1 and every i, 1 i n. The base case m = 1 is immediate. Suppose that the statement (m+1) = 0 for 1 j i + m. We holds for m. We show that aij have (m+1)
aij
=
n
(m)
aik akj
k=1
=
n
(m)
aik akj = 0
k=i+m
because akj = 0 when j i + m k. Thus, An = 0 and A is nilpotent. (31) Let A ∈ Cn×n be a nilpotent matrix. Prove that rank(A) n nilp(A)−1 nilp(A) .
Linear Algebra Tools for Data Mining (Second Edition)
194
A Hadamard matrix is a matrix H ∈ for 1 i, j n and HH = nIn .
Rn×n
such that hij ∈ {−1, 1}
(32) Verify that the matrices
1 1 1 −1
and ⎛ ⎞ 1 1 1 1 ⎜1 1 −1 −1⎟ ⎜ ⎟ ⎜ ⎟ ⎝1 −1 −1 1 ⎠ 1 −1 1 −1 are Hadamard matrices. (33) Let H = (h1 · · · hj · · · hn ) ∈ Rn×n be a Hadamard matrix. Prove that the matrix (h1 · · · −hj · · · hn ) is also a Hadamard matrix. (34) Let A = (aij ) be an (m × n)-matrix of real numbers. Prove that max min aij min max aij j
i
i
j
(the minimax inequality). Solution: Observe that aij0 maxj aij for every i and j0 , so mini aij0 mini maxj aij , again for every j0 . Thus, maxj mini aij mini maxj aij . (35) Let A ∈ Rn×n and let ΦA : Rn×n −→ Rn× be the function defined by ΦA (X) = AX − XA, for X ∈ Rn×n . Prove that: (a) ΦA is linear, that is, for any a, b ∈ R and X, Y ∈ Rn×n , we have ΦA (aX + bY ) = aΦA (X) + bΦA (Y ); (b) ΦA (XY ) = ΦA (X)Y + XΦA (Y ); (c) if X is an invertible matrix, then ΦA (X) = −XΦA (X −1 )X; (d) ΦA (ΦB (X)) − ΦB (ΦA (X)) = ΦAB (X) − ΦBA (X); (e) trace(ΦA (X)) = 0 for every X ∈ Rn×n .
Matrices
195
(36) Let A be a matrix in Cm×n such that rank(A) = r. Prove that A can be factored as A = P C, where P ∈ Cm×r , C ∈ Cr×n , and rank(P ) = r and as A = DQ, where D ∈ Cm×r , Q ∈ Cr×n and rank(Q) = r. Solution: Let p1 , . . . , pr be a basis for range(A), where A = (a1 · · · an ). Every column ai of A can be written as a linear combination ai = ci1 p1 + · · · + cir pr for 1 i n. In matrix form these equalities amount to A = P C. The argument for the second part is similar and involves the rows of A. (37) Let {Ai ∈ Rm×ni | 1 i k} be a set of matrices and let = ki=1 ni . Prove that the subspaces {range(Ai ) | 1 i k} are linearly independent if and only if rank(A1 |A2 | · · · |Ak ) =
k
rank(Ai ).
i=1
= dim(range(Ai )) and Hint: Note that rank(Ai ) rank(A1 |A2 | · · · |Ak ) = dim(range(A1 ) + · · · + range(Ak )). (38) Let π = {B1 , . . . , Bk } be a partition of a set S = {s1 , . . . , sn }. Define the characteristic matrix of π, B ∈ Rn×k as 1 if si ∈ Bj bij = 0 otherwise for 1 i n and 1 j k. Prove that B B is a diagonal matrix, (B B)jj = |Bj | for 1 j k and that B B is invertible. (39) Let A and B be two matrices in Cp×q . Prove that rank(A+B) rank(A B) rank(A) + rank(B). Solution: We have rank(A B) = rank(A A+B) rank(A+B) because adding the first q columns of the matrix (A B) to the last q columns does not change the rank of a matrix, and the rank of a submatrix is not larger than the rank of the matrix. On the other hand, rank(A B) = rank((A Op,q ) + (Op,q B)) rank(A Op,q ) + rank(Op,q B) = rank(A) + rank(B). (40) If A ∈ Cp×q and B ∈ Cq×r , then prove that rank(AB) min{rank(A), rank(B)}.
196
Linear Algebra Tools for Data Mining (Second Edition)
(41) Let A ∈ Rn×k be a matrix, where n k. If A A = Ik , prove that rank(AA ) = rank(A ) = rank(A) k. Solution: By Sylvester’s rank theorem, we have k = rank(A A) = rank(A) − dim(null(A ) ∩ range(A)), rank(AA ) = rank(A ) − dim(null(A) ∩ range(A )). The first equality implies rank(A) k. Let t ∈ null(A) ∩ range(A ). We have At = 0 and t = A z for some z ∈ Rn . Therefore, t t = t A z = (At) z = 0, so t = 0k . Thus, dim(null(A)∩range(A )) = 0, so rank(AA ) = rank(A ) = rank(A) k. (42) Let A ∈ R3×3 be a matrix such that A2 = O3,3 . Prove that rank(A) 1. (43) Let Z ∈ Rn×n , W ∈ Rm×m and let U, V ∈ Rn×m be four matrices such that each of the matrices W , Z, Z + U W V , and W −1 + U W V has an inverse. Prove that (Z + U W V )−1 = Z −1 − Z −1 U (W −1 + V Z −1 U )−1 V Z −1 (the Woodbury–Sherman–Morrison identity). Solution: Consider the following system of matrix equations: ZX + U Y = In , V X − W −1 Y = Om,n , where X ∈ Rn×n and Y ∈ Rm×n . The second equation implies V X = W −1 Y , so U W V X = U Y . Substituting U Y in the first equation yields ZX + U W V X = In . Therefore, we have (Z + U W V )X = In , which implies X = (Z + U W V )−1 .
(3.15)
On the other hand, we have X = Z −1 (In − U Y ) from the first equation. Substituting X in the second equation yields V Z −1 (In − U Y ) = W −1 Y,
Matrices
197
which is equivalent to V Z −1 = +W −1 Y + V Z −1 U Y = (W −1 + V Z −1 U )Y. Thus, we have Y = (W −1 + V Z −1 U )−1 V Z −1 . Substituting the values of X and Y in the first equality implies ZX + U (W −1 + V Z −1 U )−1 V Z −1 = In . Therefore, ZX = In − U (W −1 + V Z −1 U )−1 V Z −1 ,
(3.16)
which implies X = Z −1 − Z −1 U (W −1 + V Z −1 U )−1 V Z −1 .
(3.17)
The Woodbury–Sherman–Morrison identity follows immediately from Equalities (3.15) and (3.17). (44) None of the following matrices A, B, C are invertible: 0 0 1 0 1 1 A= , B= , C= . 0 0 1 1 1 1 However, their pseudoinverses exist and are given by 1 1 1 1 0 0 , B † = 2 2 , C † = 41 14 , A† = 0 0 0 0 4 4 respectively. (45) Let A ∈ Rm×n be a matrix whose columns form an orthonormal set (that is, A A = In ). Prove that the Moore–Penrose pseudoinverse of A is A† = A . Solution: Let M = A . Note that M A = A A = In and AM = AA . Clearly, M A is a symmetric matrix and so is AM . Furthermore, we have AM A = A(A A) = A and M AM = A AA = A = M , so A† = A .
198
Linear Algebra Tools for Data Mining (Second Edition)
(46) Let A ∈ Rn×n be an invertible matrix, and let c, d ∈ Rn be two vectors such that d A−1 c = −1. Prove that (A + cd )−1 = A−1 −
1 A−1 cd A−1 . 1 + d A−1 c
Hint: Apply the Woodbury–Sherman–Morrison identity. (47) Let M ∈ Rn×n be a partitioned matrix, M=
A B , C D
where A ∈ Rm×m and m < n. Prove that, if all inverse matrices mentioned in what follows exist and Q = (A−BD −1 C)−1 , then M
−1
=
−QBD −1 . −D −1 CQ D −1 + D −1 CQBD −1 Q
Hint: Multiply M by M −1 given above. (48) Let c1 , . . . , cp be p vectors in Rn and k1 , . . . , kp be p numbers. Find sufficient conditions for the existence of a symmetric matrix X, where X ∈ Rn×n such that ci Xci = ki for 1 i p. (49) Let A ∈ Cn×n be a Hermitian matrix such that rank(A) = r. Prove that there is a permutation matrix P and a principal submatrix B of A such that B ∈ Cr×r Ir H B(Ir F H ) P AP = F and F ∈ C(n−r)×r . Solution: Since rank(A) = r, there is a sequence of r linearly independent columns; since A is Hermitian, the rows that have the same numbers are also linearly independent. The elements at the intersection of those rows yield an r × r submatrix B of
Matrices
199
A having rank R. There exists a permutation matrix P such that B CH H . P AP = C D Since P H AP has rank r and so does (B C H ), it follows that (B C H ) generates the rows of P H AP . Consequently, C = F B for some matrix F , so D = F C H , which implies P AP = H
B B HF H F B F B HF H
Ir = B(Ir F H ). F
(50) Let A ∈ Cn×n be a matrix such that rank(A) = r. Prove that A has a submatrix B ∈ Cr×r such that rank(B) = r. If we choose A = −I in the Equality of Exercise 3.17, we have (I − 1 n cd )−1 = I + 1−d cd , for c, d ∈ R , such that d c = 1. Matrices of c the form I − cd , with d c = 1, are known as elementary matrices. (51) Prove that (a) T (i)↔(j) is an elementary matrix obtained by taking c = d = ei − ej . (b) T a(i) , where a = 0, is an elementary matrix obtained by taking c = (1 − a)ei and d = ei . (c) T (i)+a(j) is an elementary matrix obtained by taking c = −aei and d = ej . (52) Prove that if S, T ∈ C2×2 are two Toeplitz matrices, then T S = ST . Does this property hold for matrices T, S ∈ Cn×n , where n 3? (53) Prove that every permutation matrix is a unitary matrix. (54) Prove that if S ∈ Cn×n is a skew-Hermitian matrix, then In +S is an invertible matrix and the matrix (In − S)(In + S)−1 is a unitary matrix. (55) Prove that if A ∈ Cn×n is a Hermitian matrix, then z = xH Ax is a real number for every vector x ∈ Cn . (56) Let A ∈ Cn×n be a Hermitian matrix and let B ∈ Cn×n be a skew-Hermitian matrix. Prove that iA is skew-Hermitian, and iB is Hermitian.
200
Linear Algebra Tools for Data Mining (Second Edition)
(57) Prove that if A ∈ A = On,n .
Cn×n ,
then trace(AAH ) = 0 if and only if
Solution: By Equality (6.12), we have A 2F = trace(AAH ). Therefore, if trace(AAH ) = 0, it follows that A 2F = 0, so A = On,n . The reverse implication is immediate. (58) Let A ∈ Cn×n . Prove that the following conditions are equivalent: (a) A is Hermitian; (b) xH Ax ∈ R for any x ∈ Cn ; (c) A2 = AH A. (59) Let R(p,q) ∈ Rn×n be the matrix defined by R = (0 · · · 0 ep 0 · · · 0), where ep occurs in the qth position. (a) Prove that if p = q, we have (I + aR(p,q))−1 = I − aR(p,q) for every a ∈ R. (b) Prove that if T ∈ Rn×n and p < q, then ⎞ ⎛ 0 ··· 0 ··· 0 ··· 0 ⎟ ⎜. ⎜ .. · · · ... · · · ... · · · ... ⎟ ⎟ ⎜ ⎟ ⎜ R(p,q)T = ⎜0 · · · 0 · · · tqq · · · tqn ⎟ , ⎟ ⎜ ⎜ .. . . . ⎟ ⎝ . · · · .. · · · .. · · · .. ⎠ 0 ··· 0 ···
0
where the elements tqq , . . . , tqn occur ⎛ 0 · · · t1p ⎜. ⎜ .. · · · ... ⎜ ⎜ ⎜0 · · · tpp (p,q) =⎜ TR ⎜0 · · · 0 ⎜ ⎜. . ⎜. ⎝ . · · · .. 0 · · · t1p
···
0
in line p, and ⎞ ··· 0 .⎟ · · · .. ⎟ ⎟ ⎟ · · · 0⎟ ⎟, · · · 0⎟ ⎟ .. ⎟ ⎟ · · · .⎠ ··· 0
where t1p , . . . , tpp occur in the qth column. (c) Prove that if T ∈ Rn×n and p < q, then R(p,q)T R(p,q) = O. (d) Prove that if T ∈ Rn×n is an upper triangular matrix and p < q, then ((I + aRp,q )−1 T (I + aRp,q ))ij = tij unless i = p and j q or j = q and i p; moreover, prove that ((I + aRp,q )−1 T (I + aRp,q ))pq = tpq + a(tpp − tqq ).
Matrices
201
Solution: We discuss only the last part of this supplement. Observe that (I + aR(p,q))−1 T (I + aR(p,q) ) = (I − aR(p,q) )T (I + aR(p,q) ) = T − aR(p,q) T + aT R(p,q). Taking into account the first three parts, the claims made in the last part are immediately justified. (60) Let Z (k) ∈ Cn×n be the matrix defined by 1 if i = j and i = k (k) (Z )ij = 0 otherwise, for 1 i, j n. If A ∈ Cn×n , prove that AZ (k) is obtained from A by replacing the k th column by zeros; similarly, Z (k) A is obtained from A by replacing the k th row by zeros. (61) Let K be a subset of {1, . . . , n} and let k = |K|. Define the matrix C (n,K) as the n × k matrix obtained from In by eliminating all columns whose numbers do not occur in K; the matrix R(n,K) is the k × n matrix obtained from In by eliminating all rows whose numbers do not occur in K. (a) If A ∈ Cn×n , prove that AC (n,K) is obtained from A by eliminating all columns whose numbers do not occur in K; similarly, R(n,K) A is obtained from A by erasing all rows whose numbers do not
occur in K. i1 · · · ik be a submatrix of A ∈ Cm×n . Prove (b) Let B = A j1 · · · jh that B = R(m,{i1 ,...,ik } AC (n,{j1 ,...,jh }) . (62) Prove that rank(AH A) = rank(AAH ) = rank(A) for any matrix A ∈ Cm×n using Sylvester’s Theorem. (63) Prove that if A ∈ Cm×n , then AH A is singular if and only if m < n. (64) Prove or disprove: (a) If 0 ∈ W , where W = {w 1 , . . . , wn }, then W is linearly independent. (b) If W = {w1 , . . . , w n } is linearly independent and w is not a linear combination of the vectors of W , then W ∪ {w} is linearly independent.
202
Linear Algebra Tools for Data Mining (Second Edition)
(c) If W = {w1 , . . . , w n } is linearly dependent, then any of wi is a linear combination of the others. (d) If y is not a linear combination of {w 1 , . . . , w n }, then {y, w1 , . . . , w n } is linearly independent. (e) If any n − 1 vectors of the set W = {w 1 , . . . , wn } are linearly dependent, then W is linearly independent. (65) Let A ∈ Cn×n be a matrix such that I + A is invertible. The Cayley transform of A is the matrix C(A) = (I − A)(I + A)−1 . (a) Prove that if C(A) exists, then C(A) is a unitary matrix if and only if A is skew-Hermitian matrix. (b) Prove that C(C(A)) = A if all transforms exist. (c) Prove that 0 tan α cos 2α − sin 2α C = . − tan α 0 sin 2α cos 2α (66) Prove that if A ∈ Cn×n is an idempotent matrix, then range(A) ∩ null(A) = {0}. (67) Let A ∈ Cn×n be an idempotent matrix. Prove that I − A is an idempotent matrix and that rank(A) + rank(I − A) = n. Solution: We leave to the reader the proof of the idempotency of I − A. To prove the desired equality, start from Equality (3.8), that is, from rank(A) + dim(null(A)) = n. Thus, it suffices to show that rank(I − A) = dim(null(A)). In turn, it is sufficient to show that range(I − A) = null(A). Let u ∈ range(I −A). We have u = (I −A)x for some x, which implies Au = (A − A2 )x = 0. Thus, u ∈ null(A). Conversely, if u ∈ null(A), then Au = 0, and u = Iu − Au = (I − A)u, so u ∈ range(I − A). (68) Let c ∈ Cn be a vector such that 1n c = 1 and let Qc ∈ Cn×n be the matrix defined by Qc = In − 1n c . Prove that Qc 1n = 0n and Qc is idempotent. Solution: It is clear that c 1n = 1. Therefore, Q2c = (In −1n c )(In −1n c ) = In −21n c +1n c 1n c = In −1n c = Qc .
(69) Let c be a vector in idempotent matrix.
Cn .
Prove that if 1n c = 1, then 1n c is an
Matrices
203
An involutive matrix is a matrix A ∈ Cn×n such that A2 = In . (70) Prove that if B ∈ Cn×n is an idempotent matrix, then A = 2B − In is an involutive matrix. (71) Let A ∈ Cn×n be an involutive matrix and let S = {x ∈ Cn | Ax = x} and T = {x ∈ Cn | Ax = −x}. Prove that both S and T are subspaces of Cn and Cn = S T . (72) Let A ∈ Rn×n be the matrix obtained from In by swapping the i th column ei with the j th column ej . Prove that A is an involutive matrix and XA is obtained from X by swapping the i th column xi with the j th column xj . (73) Let A ∈ Cn×m be a matrix, and let c ∈ Cn and d ∈ Cm . Prove that rank(A) − 1 ≤ rank(A) − rank(cdH ) rank(A − cdH ) ≤ rank(A) + rank(cdH ) rank(A) + 1. (74) Let A ∈ Cm×n be a matrix, u ∈ Cm and v ∈ Cn be two vectors, and let a ∈ C. Prove that if rank(A − auv H ) < rank(A), then u ∈ range(A) and v ∈ range(AH ). Solution: Observe that we can write ⎛
⎞ au1 vH ⎜ . ⎟ ⎟ auv H = (av1 u · · · avn u) = ⎜ ⎝ .. ⎠ . aum v H Thus, if w1 , . . . , w m are the rows of the matrix A, we have ⎛ ⎞ w1 − au1 v H ⎜ ⎟ .. ⎟. A − auv H = ⎜ . ⎝ ⎠ wm − aum v H Suppose that rank(A) = r and let wi1 , . . . , w ir be a maximal set of linearly independent rows. Since rank(A − auv H ) < r,
Linear Algebra Tools for Data Mining (Second Edition)
204
there is a linear combination α1 (w i1 − aui1 v H ) + · · · + αr (w ir − auir v H ) = 0, such that not all coefficients αi are 0. The previous equality can be written as α1 wi1 + · · · + αr wir = (aα1 ui1 + · · · + aαr uir )v H ,
(75) (76) (77) (78)
which shows that aα1 ui1 + · · · + aαr uir = 0 and that vH is a linear combination of the rows of A. Thus, v ∈ range(AH ). Proving that u ∈ range(A) is entirely similar. Prove that if A ∈ Cn×n is a normal matrix and U ∈ Cn×n is a unitary matrix, then U H AU is a normal matrix. Let A ∈ Cn×n be a normal matrix and let a, b ∈ C. Prove that the matrix B = aA + bIn is also normal. Let A, B ∈ Cn×n be two matrices such that AB = pA + qB, where p, q ∈ C − {0}. Prove that AB = BA. Let A ∈ Cm×n , B ∈ Cp×q , C ∈ Cn×r , and D ∈ Cq×s . Prove that (A ⊗ B)(C ⊗ D) = AC ⊗ BD. A ∈ Cn×n and B ∈ Cm×n . Prove that (A ⊗ B)H = AH ⊗ B H ; if A and B are Hermitian, then so is A ⊗ B; if A and B are invertible, then so is A⊗B and (A⊗B)−1 = A−1 ⊗ B −1 . Prove that if A and B are two square submatrices in Cn×n , then A ∗ B is a principal submatrix of A ⊗ B. Prove that trace(A ⊗ B) = trace(A)trace(B), for any matrices A and B. Prove that rank(A ⊗ B) = rank(A)rank(B) for any matrices A and B. Let A ∈ Cm×n and B ∈ Cp×q . Prove that if A and B are Hermitian, then so is A ⊗ B.
(79) Let (a) (b) (c) (80) (81) (82) (83)
Solution: By the last part of Theorem 3.56 we have (A⊗B)H = AH ⊗ B H = A ⊗ B, which gives the desired conclusion.
Matrices
205
(84) Let ψ : {1, . . . , m1 } × {1, . . . , mk } −→ {1, . . . , m1 · · · mk } be the function defined by ψ(i1 , . . . , ik ) = 1 +
p−1 k
(ip − 1) mj , p=1
j=1
where 1 i m and 1 k. Prove that ψ is a bijection. Solution: It suffices to show that ψ is injective. Suppose that ψ(i1 , . . . , ik ) = ψ(i1 , . . . , ik ). By the definition of ψ, we have (i1 − 1) + (i2 − 1)m1 + (i3 − 1)m1 m2 + · · · + (ik − 1)m1 · · · mk = (i1 − 1) + (i2 − 1)m1 + (i3 − 1)m1 m2 + · · · + (ik − 1)m1 · · · mk .
Since i1 m1 , by reducing the previous equality modulo m1 it follows that i1 = i1 and (i2 − 1) + (i3 − 1)m2 + · · · + (ik − 1) · · · mk = (i2 − 1) + (i3 − 1)m2 + · · · + (ik − 1)m2 · · · mk . The same argument involving m2 implies i2 = i2 , etc. (n) (85) Let ei be the vector in Rn whose unique non-zero component (m) (n) is in position i, where 1 i n. Prove that ei ⊗ ej = (mn)
em(i−1)+j . Extend this result by proving that (m1 )
e i1
(m2 )
⊗ ei2
(mk )
⊗ · · · ⊗ e ik
k
m
r = eψ(ir=1 , 1 ,...,ik )
where ψ is the function defined in Supplement 3.17. (86) Prove that: (m) (n) (m) (n) (m,n) (m,n) , where Eij is the (a) ei (ej ) = ei ⊗ (ej ) = Eij m×n matrix in R which has a unique non-zero entry in position (i, j); (b) if A = (aij ) is a matrix in Rm×n , then A=
n m
i=1 j=1
(m)
aij ei
(ej ) = (n)
n m
i=1 j=1
(m)
aij ei
⊗ (ej ) . (n)
206
Linear Algebra Tools for Data Mining (Second Edition)
(87) Let A ∈ Rm×n and B ∈ Rp×q . Prove that A⊗B =
p
q m
n
i=1 j=1 h=1 k=1
aij bhk em(i−1)+h ⊗ (e(q(k−1)+j ) . (mp)
(qn)
Solution: We can write A⊗B = =
m
n
(m) aij ei
i=1 j=1 p
q n
m
⊗
(n) (ej )
p
q
bhk eh ⊗ (ek ) (p)
(q)
h=1 k=1 (m)
⊗ (ej ) ⊗ eh ⊗ (ek )
(m)
⊗ eh ⊗ (ej ) ⊗ (ek )
aij bhk ei
(n)
(p)
(q)
i=1 j=1 h=1 k=1
=
=
p
q m
n
i=1 j=1 h=1 k=1 p
q n
m
i=1 j=1 h=1 k=1
aij bhk ei
(p)
(n)
(q)
aij bhk em(i−1)+h ⊗ (eq(k−1)+j ) . (mp)
(qn)
(88) Let A(r) ∈ Rmr ×pr for 1 r N be N matrices. Extend Supplement 3.17 for the matrix A(1) ⊗ · · · ⊗ A(N ) ∈
N
N R r=1 mr × r=1 pr . (89) Let A, X, B be three matrices such that A ∈ Rm×p , X ∈ Rp×q , and B ∈ Rq×n . Prove that (a) we have vec(AXB) = (B ⊗ A)vec(X); (b) if the matrix B ⊗A is invertible, then the equation AXB = C is solvable in X. Solution: Suppose that
⎞ b11 · · · b1n ⎜ . .. .. ⎟ ⎟ . B=⎜ . . . ⎠. ⎝ bq1 · · · bqn ⎛
By the definition of Kronecker products, we have ⎞ ⎛ b11 A · · · bq1 A ⎜ . .. .. ⎟ ⎟ B ⊗ A = ⎜ . . ⎠. ⎝ .. b1n A · · · bqn A
Matrices
207
By writing X columnwise, X = (x1 · · · xq ), we further have ⎞ x1 ⎜ . ⎟ ⎟ vec(X) = ⎜ ⎝ .. ⎠ xq ⎛
and ⎛
⎞ b11 Ax1 + · · · + bq1 Axq ⎜ ⎟ .. ⎟. (B ⊗ A)vec(X) = ⎜ . ⎝ ⎠ b1n Ax1 + · · · + bqn Axq On the other hand, we have XB = (b11 x1 + · · · + bq1 xq , . . . , b1n x1 + · · · + bqn xq ), which implies AXB = (b11 Ax1 + · · · + bq1 Axq , . . . , b1n Ax1 + · · · + bqn Axq ). This shows that vec(AXB) = (B ⊗A)vec(X). The second part of the supplement is immediate. (90) Let U ∈ Cm×n and V ∈ Cn×p be two matrices. Prove that: (a) vec(U V ) = (Ip ⊗ U )vec(V ); (b) vec(U V ) = (V ⊗ Im )vec(U ); (c) vec(U V ) = diag(U, . . . , U )vecV . Solution: This follows immediately from Supplement 3.17. (91) Prove that x ∈ Cm and y ∈ Cn , then x ⊗ y = vec(yx ) and x ⊗ y = xy = y ⊗ x. (92) Let A, B ∈ Rn×n . Prove that trace(AB) = (vec(A )) vec(B). Solution: Since trace(AB) = from Equality (3.3).
n
i=1 ei ABei ,
the result follows
208
Linear Algebra Tools for Data Mining (Second Edition)
Let di ∈ Cm be the i th column of the matrix Im and ej be the j th column of the matrix In . Define the matrix Hij = di ej ∈ Cm×n , which has an 1 in its (i, j) position and 0 elsewhere. Then, any matrix A ∈ Cm×n can be written as A=
m
n
aij Hij .
(3.18)
aij Hij .
(3.19)
i=1 j=1
This implies A =
n m
i=1 j=1
(93) Let ei be the i th column of the matrix In for 1 i n and let E[ij] = ei ej . Prove that: (a) the single nonzero entry of E[ij] is (E[ij] )ij = 1; ej = ei ; (b) E [ij] n = In ; (c) i=1 E[ii] (d) vec(In ) = ni=1 ei ⊗ ei ; E[kl] = vec(E[kl] )(vec(E[ij] ) ; (e) E[ij] ⊗ n n (f) i=1 j=1 (E[ij] ⊗ E[ij] ) = (vec(In ))(vec(In )) . m m n n m n (94) Prove that if ei ∈ C and ej ∈ C , then ei ⊗ ej = emn (i−1)n+j . Define the matrix Kmn ∈ Rmn×mn as Kmn =
m
n
(Hij ⊗ Hij ). i=1 j=1
Kmn is a square mn-dimensional matrix partitioned into mn submatrices in Rm×n such that the (i, j) submatrix has a 1 in position (j, i) and 0 elsewhere. Kmn is referred to as the mn-commutation matrix (see [109], which contains a detailed study of these matrices). For example, K23 is a 6×6 matrix partitioned into six submatrices, as follows: K23 =
2
3
(Hij ⊗ Hij ) i=1 j=1
Matrices
⎛ 1 ⎜0 ⎜ ⎜ ⎜0 =⎜ ⎜0 ⎜ ⎜ ⎝0 0
0 0 0 1 0 0
0 1 0 0 0 0
209
0 0 0 0 1 0
0 0 1 0 0 0
⎞ 0 0⎟ ⎟ ⎟ 0⎟ ⎟. 0⎟ ⎟ ⎟ 0⎠ 1
(95) Prove that if A ∈ Rm×n , we have Kmn vec(A) = vec(A ). Solution: We have ⎛ ⎞ ⎞ ⎛
⎠ aij Hij = vec ⎝ (di Aej )(ej di )⎠ vec(A ) = vec ⎝ ⎛ = vec ⎝ =
ij
⎞ ej di Aej di ⎠ =
ij
ij
vec(Hij AHij )
ij
Hij Hij vec(A)
ij
(by Supplement 3.17) = Kmn vec(A). Here di ∈ Cm is the i th column of the matrix Im and ej is the j th column of the matrix In . (96) Prove the following properties of the commutation matrix Kmn defined above: (a) Kmn = nj=1 (ej ⊗ Im ⊗ ej ) = m i=1 (di ⊗ In ⊗ di ); = Knm ; (b) Kmn −1 = K (c) Kmn nm ; (d) K1n = Kn1 = In ; (e) trace(Kmn ) = 1 + gcd(m − 1, n − 1); (f) if A ∈ Cn×s and B ∈ Cm×t , then Kmn (A⊗B)Kst = B ⊗A. Solution: Part (a): of Kmn implies Kmn = The definition )= (H ⊗ H (d e ⊗ e d ). By applying Exercise 91, ij i j i ij j ij ij
210
Linear Algebra Tools for Data Mining (Second Edition)
we have
(di ej ⊗ ej di ) =
ij
(ej ⊗ di ⊗ di ⊗ ej ) ij
=
j
=
ej
⊗
ej ⊗
i
j
=
di ⊗ di
⊗ ej
di di
⊗ ej
i
(ej ⊗ Im ⊗ ej ). j
Similarly,
(di ej ⊗ ej di ) = (di ⊗ ej ⊗ ej ⊗ di ) ij
ij
=
i
⎛ ⎞ ⎞
⎝di ⊗ ⎝ ej ⊗ ej ⎠ ⊗ di ⎠ ⎛
j
⎛ ⎞ ⎞
⎝di ⊗ ⎝ ej ej ⎠ di ⎠ = ⎛
i
j
(di ⊗ In ⊗ di ). = i
Part (b): We have
= (di ej ⊗ ej di ) = (ej di ⊗ di ej ) = Kmn . Kmn ij
ji
K Part (c): We need to verify that Kmn mn = Imn . This equality follows from
Kmn = (ej di ⊗ di ej ) (ds et ⊗ et ds ) Kmn ji
st
(ej di ds et ⊗ di ej et ss ) = jist
⎛ ⎞
(ej ej ⊗ di di ) = ⎝ ej ej ⎠ di di = Imn . = ji
j
i
Matrices
211
Part (d): We have
(ej ⊗ ej ) = In = (ej ⊗ ej ) = Kn1 . K1n = j
j
Part (e): Note that ⎛
⎞
trace(Kmn ) = trace ⎝ (di ej ⊗ ej di )⎠ =
ij
trace((di ⊗ ej )(ej ⊗ di ))
ij
(ej ⊗ di ) (di ⊗ ej ). = ij
The mn-vector ej ⊗ di has a unique 1 in its ((j-1)m +i)th position and 0s elsewhere, and the vector di ⊕ ej has a unique 1 in its ((i − 1)n + j)th position. Therefore, we have (ej ⊗ di ) (di ⊗ ej ) =
1 if(j − 1) + i = (i − 1)n + j, 0 otherwise.
Let {E1 , . . . , Ep } be a set of equalities and let B[Ei | 1 i p] be the number of valid equalities in this set. We have trace(Kmn ) =
(ej ⊗ di ) (di ⊗ ej ) ij
= B[(j − 1)m + i = (i − 1)n + j | 1 i m, 1 j n] = B[(j − 1)(m − 1) = (i − 1)(n − 1) | 1 i m, 1 j n] = 1 + B[(j − 1)(m − 1) = (i − 1)(n − 1) | 2 i m, 2 j n] = 1 + B[j(m − 1) = i(n − 1) | 1 i m − 1,
Linear Algebra Tools for Data Mining (Second Edition)
212
1 j n − 1] i m−1 = | 1 i m − 1, = 1+B n−1 j 1 j n − 1] . Let m =
m n and n = . gcd(m, n) gcd(m, n)
Any pair (i, j) such that ji = m n where 1 i m and 1 j n must be of the form i = αm and j = αn , where α is a positive rational number smaller or equal to gcd(m, n). We can write α = pq , where p, q are positive integers and gcd(p, q) = 1. Then,
pn i = pm q and j = q . The numbers i and j are integers if and only if pm and pn are both divisible by q. Since gcd(p, q) = 1, the only common divisor of m and n is 1, which implies q = 1. This, in turn, implies i = pm and j = pn , where 1 p gcd(m, n). Thus, there are gcd(m, n) pairs (i, j) such that ji = m n , that is,
m − 1 i = | 1 i m − 1, 1 j n − 1 = gcd(m, n). B n−1 j
Part (f): Let X ∈ Rs×t . We have Kmn (A ⊗ B)Kst vec(X) = Kmn (A ⊗ B)vec(X ) = (by Supplement 3.17) = Kmn vec(BX A ) = vec(AXB ) = (B ⊗ A)vec(X), which gives the desired result. (97) Prove that for every A ∈ Cm×n , rank(A) = rank(AH ). ¯In ) for any If m = n, prove that rank(A − aIn ) = rank(AH − a a ∈ C.
Matrices
213
(98) Let U ∈ Rn×n be the upper triangular matrix ⎛ ⎞ 1 1 ··· 1 1 ⎜0 1 · · · 1 1⎟ ⎜ ⎟ ⎟ U =⎜ .. .. ⎟ ⎜ .. .. ⎝. . · · · . .⎠ 0 0 ··· 0 1 and let V ∈ Rn×n be the ⎛ 1 ⎜0 ⎜ ⎜. V =⎜ ⎜ .. ⎜ ⎝0 0
matrix −1 0 1 −1 .. .. . . 0 0 0 0
··· ···
0 1 . · · · .. · · · −1 ··· 0
⎞ 0 1⎟ ⎟ .. ⎟ ⎟ .⎟ . ⎟ 1⎠ 1
(a) Verify that U V = In . (b) Let a, b ∈ Rn and let sk = ki=1 ai . Using Part (a), prove Abel’s equality: n
i=1
ai bi =
n−1
si (bi − bi+1 ) + sn bn .
i=1
Solution: Since Part (a) ⎛ is immediate, we discuss only ⎞ s1 ⎜ ⎟ Part (b). Note that if s = ⎝ ... ⎠, then s = U a. Also, we sn have ⎞ ⎛ b1 − b2 ⎜ b2 − b3 ⎟ ⎟ ⎜ ⎟ ⎜ . ⎟. ⎜ .. Vb=⎜ ⎟ ⎟ ⎜ ⎝bn−1 − bn ⎠ bn This allows us to write n−1
si (bi − bi−1 ) + sn bn = s (V b) = a U V b = a b, i=1
which yields Abel’s equality.
214
Linear Algebra Tools for Data Mining (Second Edition)
(99) Let x, y, z ∈ Rn such that x1 x2 · · · xn 0 and i i j=1 yj j=1 zj for 1 i n. Prove that x y x z. Solution: Let a ∈ Rn be defined by ai = zi − yi .To apply i Abel’s equality to the vectors a and x, let si = j=1 zj − i j=1 yj 0 for 1 i n. We have x z − x y =
n
ai x i =
i=1
n−1
si (xi − xi+1 ) + sn xn 0,
i=1
which gives the desired inequality. (100) Let a, b ∈ Rn be two vectors such that a1 a2 · · · an 0 and b1 b2 · · · bn 0 and let P On,n be a matrix in Rn×n such that P 1n 1n and P 1n 1n . Prove that a P b a b. Solution: Observe that n
k
pij =
j=1 i=1
because
n
j=1 pij
k
n
pij k,
i=1 j=1
1 for every i. This allows us to write
k
k
k n
pij +
j=1 i=1
pij k,
j=k+1 i=1
hence, k n
pij k −
j=k+1 i=1
k
k
pij
j=1 i=1
=
k
j=1
1−
k
pij
.
i=1
Since b1 b2 · · · bn 0, we get a stronger inequality by introducing the factors b1 , . . . , bn as follows: k n k k
pij bj ≤ bj 1 − pij . (3.20) j=k+1 i=1
j=1
i=1
Matrices
215
Let c = P b. We have k
ci =
i=1
n k
n
pij bj =
i=1 j=1
=
k
j=1
bj
bj
j=1
k
i=1
pij
i=1
n
pij +
k
bj
k
j=k+1
pij .
i=1
Combining the last inequality and Inequality (3.20), we have k
ci
i=1
k
bj
k
pij +
n
j=1
i=1
j=k+1
k
k
k
j=1
bj
i=1
pij +
bj
bj
k
pij
i=1
1−
j=1
k
i=1
pij
=
k
bj
j=1
for 1 k n. The result follows from Supplement 3.17. (101) Let A ∈ Cm×n be a matrix. If X ∈ Cn×k and Y ∈ Cm×k are two matrices such that the matrix R = Y H AX ∈ Ck×k is invertible and B = A − AXR−1 Y H A, then rank(B) = rank(A) − rank(AXR−1 Y H A). (102) Let A(α) =
1 1 , 1 α
where α ∈ R. Define the function f : R −→ R as f (α) = rank(Aα ). Prove that f is not continuous in 1. (103) Let A ∈ Rn×n be a matrix such that A On,n . Prove that A is a stochastic matrix if and only if A1n = 1; prove that A is doubly stochastic if and only if A1n = 1 and 1n A1n . (104) Prove that if A, B ∈ Rn×n are (doubly) stochastic matrices, then AB is a (doubly) stochastic matrix. Also, prove that for every c ∈ [0, 1], the matrix cA+(1−c)B is a (doubly) stochastic matrix. (105) Prove that if A ∈ Rn×n is a (doubly) stochastic matrix, then Am is a (doubly) stochastic matrix for m ∈ N.
216
Linear Algebra Tools for Data Mining (Second Edition)
(106) Let A ∈ Rn×n be a doubly stochastic matrix and let x, y ∈ Rn be two vectors such that x1 x2 · · · xn 0 and y1 y2 · · · yn 0. Prove that x Ay x y. Solution: Note that there exist n nonnegative numbers a1 , . . . , an and n nonnegative numbers b1 , . . . , bn such that xp = ar + · · · + an and yp = br + · · · + bn for 1 r n. We have x y − x Ay = x(I − A)y = a H (I − A)AHb, because x = Ha and y = Hb, where ⎛ 1 1 1 ··· ⎜0 1 1 · · · ⎜ H=⎜ ⎜ .. .. .. .. ⎝. . . . 0 0 0 ···
⎞ 1 1⎟ ⎟ ⎟ .. ⎟ . .⎠ 1
Thus, it suffices to show that H (I − A)H 0n,n . We have (H AH)ij =
n
n
hpi (I − A)pq hqj
p=1 q=1
=
j i
(I − A)pq . p=1 q=1
If i j, we have
(H AH)ij = i −
j i
apq 0,
p=1 q=1
because no sum jq=1 apq can exceed 1 and there are i such sums (1 p i). The case when j i is similar. (107) Let A ∈ Rn×n be a matrix such that AP = P A for every permutation matrix P . Prove that there exist a, b ∈ R such that A = aIn + bJn . Solution: Suppose that P = Pφ , where φ ∈ PERMn . Taking into account Equality (3.1), (AP )ij = (P A)ij is equivalent to
Matrices
217
aiφ−1 (j) = aφ(i)j for every permutation φ ∈ PERMn . If φ is the transposition that exchanges i and j, then the last equality amounts to aii = ajj for 1 i, j n, so all elements on the main diagonal are equal. Let now aik and aj be two elements outside the main diagonal, so i = k and = j. Let φ be a permutation such that φ(k) = j and φ(i) = . Since φ−1 (j) = k, it follows that aik = aj , so any off-diagonal elements of A are also equal. Thus, A has the desired form. (108) Let A, B ∈ Rn×n be two matrices such that A > 0 and B > 0 and let D(A, B) be the divergence defined by D(A, B) =
n n
i=1 j=1
aij − aij + bij . aij ln bij
Prove that: (a) D(A, B) 0 and that D(A, B) = 0 implies A = B. (b) If A and B are stochastic matrices, then D(A, B) = both aij n n i=1 j=1 aij ln bij . (In this case D(A, B) is the Kullback–Leibler divergence, denoted by KL(A, B).) Solution: Let f : R>0 −→ R be the function defined by f (x) = x − 1 − ln x. We have f (1) = 0, f (x) = 1 − x1 , and f (x) = x12 , so f has a minimum for x = 1. Thus, f (x) f (1) = 0, which allows us to conclude that x − 1 − ln x 0 for x > 0; also, x − 1 − ln x = 0 if and only if x = 1. Therefore, we have bij bij − 1 − ln 0, aij aij a
or bij − aij + aij ln bijij 0, where the equality takes place only when aij = bij . This implies immediately the inequality of the first part. The second part follows immediately from the definition of stochastic matrices. (109) Give an example of a non-invertible matrix A ∈ Cn×n that satisfies the condition |aii | {|aik | | 1 k n and k = i} for 1 i n. Note that this condition is weaker than the condition of diagonal dominance used in Theorem 3.49. (110) Prove that rank(A† ) = rank(A) for every A ∈ Cm×n .
218
Linear Algebra Tools for Data Mining (Second Edition)
(111) Let A ∈ Cm×n and B ∈ Cn×p . If rank(A) = rank(B) = n, then prove that (AB)† = B † A† . (112) Let Qr,p be the collection of r-element subsets of the set {1, . . . , p} and let A ∈ Rm×n be a matrix such that rank(A) = r. Denote by I(A) = {I ∈ Qr,m | rank(A(I, :)) = r}, J(A) = {J ∈ Qr,n | rank(A(:, J))
= r}, and N(A) = {(I, J) ∈ I = r. Prove that: Qr,m × Qr,n | rank A J (a) I(A), J(A) and N(A) denote the maximal sets of linearly independent rows and columns, and of maximal nonsingular matrices, respectively; (b) N(A) = I(A) × J(A); (c) if 0 k r2 , then the intersection of any r − k linearly independent rows and r − k linearly independent columns is a matrix of rank at least equal to r − 2k. Solution: The inclusion N(A) ⊆ I(A) × J(A) is immediate. By the Full-rank factorization theorem (Theorem 3.35) there r×n such that are two full-rank matrices B ∈ C m×r
and C ∈ C I A = BC. Then, every matrix A can be factored as J I A = C(I, :)B(:, J), J which implies I(A) × J(A) ⊆ N(A). For the third part, note that dropping a row and a column of a matrix decreases the rank by at most 2. The Pauli matrices are the matrices P1 , P2 , P3 ∈ C2×2 defined by 0 1 0 −i 1 0 , P2 = , P3 = . P1 = 1 0 i 0 0 −1 (113) Prove that Pk2 = I2 and Pk Ph + Ph Pk = 2δhk I2 for k, k ∈ {1, 2, 3}. (114) Prove that the set {I2 , P1 , P2 , P3 } is a basis in C2×2 . (115) Prove that P12 = P22 = P32 = −iP1 P2 P3 = I2 , and trace(Pk ) = 0 for 1 k 3.
Matrices
219
(116) Let u ∈ Cm and v ∈ Cn . Prove that det(u ∗ v) = 0. (117) Prove that if u, v, w ∈ Cn , then (a) (u ∗ v)H = v ∗ u; (b) (v + w) ∗ u = v ∗ u + w ∗ u; (c) u ∗ (v + w) = u ∗ v + u ∗ w; (d) c(u ∗ v) = (cu) ∗ v = u ∗ (cv). Let x, y ∈ Rn . To compute the product x y = ni=1 xi yi , a number of n multiplications and n − 1 additions of real numbers are required. The main component of this time is given by the number of multiplications, so minimizing this number is important in reducing computing time. (118) For x, y ∈ Rn , define ξ and η as n/2
ξ=
j=1
n/2
x2j−1 x2j , η =
y2j−1 y2j .
j=1
Note that the computation of ξ and η requires 2n/2 products. Prove that x y is given by x y =
⎧ n/2 ⎪ ⎨ j=1 (x2j−1 + y2j )(x2j + y2j−1 ) − ξ − η
if n is even,
⎪ ⎩n/2 (x
if n is odd.
j=1
2j−1
+ y2j )(x2j + y2j−1 ) − ξ − η + xn yn
Thus, the total number of products needed to compute x y is upper bounded by 2n/2 + (n + 1)/2. (119) Let A ∈ Rm×n and B ∈ Rn×p . If we write ⎛ ⎞ a1 ⎜ . ⎟ ⎟ A=⎜ ⎝ .. ⎠ and B = (b1 , . . . , bp ), am computing the matrix product AB using the standard computation is equivalent to performing mp inner products of ndimensional vectors that require mpn multiplications. Prove that computing these mp inner products using the technique described in Exercise 3.17 requires (m + p)n + (mp − m − p)(n + 1)/2 multiplications. For large values of m, n, p, this number is roughly half of mnp.
Linear Algebra Tools for Data Mining (Second Edition)
220
(120) Let A=
a11 a12 a21 a22
and B =
b11 b12 b21 b22
be two matrices in R2×2 . The standard method for computing their product requires eight multiplications. Define the following seven numbers: I = (a11 + a22 )(b11 + b22 ) II = (a21 + a22 )b11 III = a11 (b12 − b22 ) IV = a22 (−b11 + b21 ) V = (a11 + a12 )b22 V I = (−a11 + a21 )(b11 + b12 ) V II = (a12 − a22 )(b21 + b22 ). Verify that their product C = AB can be written as c11 c21 c12 c22
= I + IV − V + V II = II + IV = III + V = I + III − II + V I,
using a total of seven multiplications and 18 additions. (121) Let A, B ∈ Rn×n be two matrices, where n = 2k , and B11 B12 A11 A12 and B = , A= A21 A22 B21 B22 k−1
k−1
and Aij , Bij ∈ R2 ×2 . Using multiplications of blockmatrices, the product C = AB ∈ Rn×n can be written as C11 C12 , C= C21 C22 , where Cij ∈ Rn/2×n/2 . Using the result contained in Exercise 3.17, prove that to compute C the number of multiplications required is O(nlog 7 ).
Matrices
221
Solution: Denote by τ (n) the number of operations required to multiply two n × n matrices, where n = 2k . We have τ (n) = 2 7τ n2 + 18 n2 ) for n 2. This implies τ (n) = O(7log n ) = O(nlog 7 ). Let A ∈ Cn×n be a square matrix. Its directed graph is the graph GA having {1, . . . , n} as its set of vertices. An edge (i, j) exists in GA if and only if aij = 0. (122) Draw the graph of the matrix ⎛
0 ⎜2 ⎜ A=⎜ ⎝0 0
1 1 2 0
2 1 0 0
⎞ 0 2⎟ ⎟ ⎟. 1⎠ 2
(123) Let G = (V, E) be a graph having V as a set of vertices and E as a set of edges. Prove that the relation γG on V that consists of all pairs of vertices (x, y) such that there is a path that joins x to y is an equivalence on V. Let G = (V, E) be a directed graph and let V1 , . . . , Vk be the set of equivalence classes relative to the equivalence γG . The condensed graph of G is the digraph C(G) = ({V1 , . . . , VE }, K), having strong components as its vertices such that (Vi , Vj ) ∈ K if and only if there exist vi ∈ Vi and vj ∈ Vj such that (vi , vj ) ∈ E. (124) Let G = (V, E) be a directed graph. Prove that the condensed graph C(G) is acyclic. Solution: Suppose that (Vi1 , . . . , Vi , Vi1 ) is a cycle in the graph C(G) that consists of distinct vertices. By the definition of C(G) we have a sequence of vertices (vi1 , . . . , vi , vi1 ) in G such that the (vip , vip+1 ) ∈ E for 1 p − 1 and (vi , vi1 ) ∈ E. Therefore, (vi1 , . . . , vi , vi1 ) is a cycle, and for any two vertices u, v of this cycle, there is a path from u to v and a path from v to u. In other words, for any pair of vertices (u, v) of this cycle we have (u, v) ∈ γG , so Vi1 = · · · = Vi , which contradicts our initial assumption.
222
Linear Algebra Tools for Data Mining (Second Edition)
(125) Let φ ∈ PERMn and let Pφ be its permutation matrix. Prove that if A ∈ Cn×n and the vertices of the graph GA are renumbered by replacing each number j by φ(j), the resulting graph corresponds to the matrix B = Pφ APφ . A matrix A ∈ Cn×n isreducible if there exists a permutation matrix U V Pφ , where U ∈ Cp×p , V ∈ Cp×q , and Pφ such that A = Pφ Op,q W W ∈ Cq×q . Otherwise, A is irreducible. (126) Prove that a matrix A ∈ Cn×n is irreducible if and only if there exists no partition {I, J} of the set {1, . . . , n} such that aij = 0 when i ∈ I and j ∈ J. Let A ∈ Cn×n . The degree of reducibility of A is k, where 0 k n−1 if there exists a partition {I1 , . . . , Ik+1 } ∈ PARTn such that any I submatrix A i is irreducible for 1 i k+1 and apq = 0 whenever Ii p ∈ Ii , q ∈ Ij , and i = j. The degree of reducibility of A is denoted by red(A). (127) Let A ∈ Rn×n be a matrix. Prove that GA is a strongly connected digraph if and only if A is irreducible. Solution: Let A ∈ Rn×n be a reducible matrix. There exists a partition {I, J} of the set {1, . . . , n} such that aij = 0 when i ∈ I and j ∈ J. Therefore, there is no edge from a vertex in J to a vertex in I, which implies that there exists a vertex i ∈ I and a vertex j ∈ J such that there is no path leading from j to i. Thus, GA is not strongly connected. This shows that if GA is strongly connected, then A is irreducible. Conversely, suppose that GA is not strongly connected and let V1 , . . . , Vk be the strong connected components of GA , where k > 1. Since the condensed digraph C(GA ) is acyclic (by Supplement 3.17), we may assume without loss of generality that its vertices V1 , . . . , Vk are numbered in topological order. In other words, the existence of an edge (Vi , Vj ) in C(G) implies i < j.
Matrices
223
Assume initially that the vertices of the strong component Vi are vp , vp+1 , . . . , vp+|Vi |−1 , where p = 1 + i−1 j=1 |Vj | for 1 i j k. Under this assumption we have apq = 0 if vp ∈ Vi , vq ∈ V , and i > . In other words, the matrix A has the form ⎞ ⎛ A11 A12 · · · A1k ⎟ ⎜ O A 22 · · · A2k ⎟ ⎜ ⎟ A=⎜ .. . ⎟, ⎜ .. ⎝ . . · · · .. ⎠ O O · · · Akk where Aii is the incidence matrix of the subgraph induced by Vi . Thus, A is not irreducible. If the vertices of GA are not numbered according to the previous assumptions, let φ be a permutation that rearranges the vertices in the needed order. Then Pφ APφ has the necessary form and, again, A is not irreducible. (128) Prove that a matrix A ∈ Cn×n is irreducible if and only if its transpose is irreducible. (129) Let A ∈ Rn×n be a non-negative matrix. For m 1, we have (Am )ij > 0 if and only if there exists a path of length m in the graph GA from i to j. Solution: The argument is by induction on m ≥ 1. The base case, m = 1, is immediate. Suppose that the statement holds for numbers less than m. Then (Am )ij = nk=1 (Am−1 )ik Akj . (Am )ij > 0 if and only if there is a positive term (Am−1 )ik Akj in the right-hand sum because all terms are non-negative. By the inductive hypothesis, this is the case if and only if there exists a path of length m − 1 joining i to k and an edge joining k to j, that is, a path of length m joining i to j. (130) Let A ∈ Rn×n be an irreducible matrix such that A On,n . If i ki > 0 for 1 i n − 1, then n−1 i=0 ki A > On,n . Solution: Since A is an irreducible matrix, the graph GA is strongly connected. Thus, there exists a path of length no larger than n − 1 that joins any two distinct vertices i and
Linear Algebra Tools for Data Mining (Second Edition)
224
j of the graph GA . This implies that for some m n − 1 we have (Am )ij > 0. Since n−1 n−1
i ki A = ki (Am )ij , (3.21) i=0
ij
i=0
and all numbers that occur in this equality are non-negative, n−1 i > 0. If i = j, it follows that for i = j we have i=0 ki A ij
the same inequality follows from the fact that k0 In > On,n . (131) Let A ∈ Rn×n be an irreducible matrix such that A On,n . Prove that (In + A)n−1 > On,n . Solution: If we choose ki = ni for 0 i n − 1 in Inequality (3.21), the desired inequality follows. (132) Let A ∈ Rp× , B ∈ Rq× , C ∈ Rr× be three matrices having the same number of columns. Prove that: (a) the Khatri–Rao product is associative, that is, (A ∗ B) ∗ C = A ∗ (B ∗ C); (b) (A ∗ B) (A ∗ B) = (A A) ∗ (B B); (c) (A∗)† = ((A A) ∗ (B B))† (A ∗ B) . (133) Let A ∈ Rm×n and B ∈ Rp×q be two matrices. Prove that if x ∈ Rnq , y ∈ Rmp , then we have y = (A ⊗ B)x if and only if Y = BXA , where X = (x1 x2 · · · xn ) ∈ Rq×n and Y = (y 1 y 2 · · · y m ) ∈ Rp×m . Solution: We begin by partitioning x and y as ⎛ ⎞ ⎛ ⎞ y1 x1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ x=⎜ ⎝ .. ⎠ and y = ⎝ .. ⎠ , xn ym
Matrices
225
where x1 , . . . , xn ∈ Rq and y 1 , . . . , y m ∈ Rp . Then we have ⎞⎛ ⎞ ⎛ x1 a11 B a12 B · · · a1n B ⎜ a B a B · · · a B ⎟ ⎜x ⎟ 22 2n ⎟ ⎜ 2 ⎟ ⎜ 21 ⎟⎜ ⎟ y = (A ⊗ B)x = ⎜ .. .. ⎟ ⎜ .. ⎟ , ⎜ .. .. . ⎝ . . . ⎠⎝ . ⎠ am1 B am2 B · · · amn B
xn
which implies y i = ai1 Bx1 + · · · a1n Bxn ⎛ ⎞ ai1 ⎜ . ⎟ ⎟ = (Bx1 · · · Bxn ) ⎜ ⎝ .. ⎠ ain ⎛
⎞ ai1 ⎜ . ⎟ ⎟ = B(x1 · · · xn ) ⎜ ⎝ .. ⎠ ain = BXai , where ai = (ai1 · · · ain ). By the definition of Y , we have Y = (BXa1 · · · BXam ) = BX(a1 · · · bf am ) = BXA . (134) Let u ∈ CI , v ∈ CJ , and w ∈ CK . Prove that x = u ⊗ v ⊗ w if and only if ui vj wk = xk+(j−1)K+(i−1)JK for 1 i I, 1 j J, and 1 k K. Bibliographical Comments Wedderburn’s Theorem (Theorem 3.39) was obtained in [169]. In Supplement 23 of Chapter 5, we discuss a converse of this statement, a result of Egerv´ ary [42]. Theorem 3.35 (the Full-Rank factorization theorem) was formulated in [112]. Supplement 3.17 is an extension of Wedderburn’s theorem obtained in [29]. Supplement 3.17 is a result of Ky Fan [49]. A comprehensive reference of the Moore–Penrose pseudoinverse of a matrix is the monograph [11].
226
Linear Algebra Tools for Data Mining (Second Edition)
Fundamental references to linear algebra include the two-volume book by Gantmacher [168]. More advanced topics are treated in [18] and [78, 79]. Concentrated and useful sources are [111] and [177]. Properties of Kronecker products are studied in [109, 118, 133, 134]. The computation of x y described in Exercise 3.17 was discovered in [172] and initiated a research stream for more economical ways for multiplying matrices. The reduction of the number of necessary multiplications of two 2×2 matrices from eight to seven shown in Exercise 3.17 was obtained by Strassen in [161].
Chapter 4 MATLAB
4.1
Environment
Introduction
MATLAB ,
which stands for “matrix laboratory” [116], is a formidable tool for anybody interested in linear algebra and its applications.
4.2
The Interactive Environment of MATLAB
MATLAB
is an interactive system. Commands can be entered at the prompt >> in the command window shown in Figure 4.1. It is important to remember that • MATLAB is case sensitive; • variables are not typed; • typing the name of a variable causes the value of the variable to be printed; • ending a command with a semicolon will suppress the screen display of the results of the command. In Figure 4.1 we show the command window of MATLAB . Commas and semicolons separate statements that are placed on the same line as in‘
x = 1 ; y =2; z =3;
As we mentioned above, semicolons suppress the output. To continue a line on the next line, one ends the line with three periods (...). 227
228
Linear Algebra Tools for Data Mining (Second Edition)
Fig. 4.1
The command window of MATLAB .
MATLAB is equipped with extensive help and documentation facilities. To display a help topic, it suffices to type in the command help followed by the name of the topic or to access the MATLAB documentation using the command doc.
4.3
Number Representation and Arithmetic Computations
In MATLAB , we deal with two representations of integers: the unsigned integers and the signed integers. The classes of unsigned integers are denoted by unitk, where k may assume the values 8, 16, 32, and 64 corresponding to representations over one, two, four, and eight bytes, respectively. Example 4.1. To create a one-byte unsigned integer having the value 77, we write >> x = uint8(77) x = 77
To obtain information on x, we type whos(’x’). MATLAB returns the main characteristics of x
matlab Environment
>> whos(’x’) Name Size x 1x1
Bytes 1
Class uint8
229
Attributes
Since x is a scalar, MATLAB regards x as a 1 × 1-matrix. The function size returns the sizes of each dimension of an array. Example 4.2. For the matrix A defined by A = [1,2,3;0,5,1]
we obtain d = 2
3
The function unique applied to an array A produces an array that contains the same data as A with no repetitions and in sorted order. Example 4.3. The application of unique to the array has the following effect: A = [1 2 1 1 0]; >> unique(A) ans = 0 1 2
The call [m,n] = size(A)
returns the size of the matrix A in separate variables m and n. More generally, [d1,d2,...,dn]=size(A)
returns the sizes of the first n dimensions of the array A in separate variables. If we need to return the size of the dimension of A specified by the scalar dim into the variable m, we write m = size(A,dim)
The function isa determines if its first argument belongs to the type specified by its second argument.
230
Linear Algebra Tools for Data Mining (Second Edition)
Example 4.4. The type of x can be verified by typing >> isa(x,’integer’) ans = 1 >> isa(x,’uint8’) ans = 1 >> isa(x,’unit32’) ans = 0
The limits of any of the above integer types can be determined by using the functions intmin and intmax, as follows: Example 4.5. >> intmin(’uint32’) ans = 0 >> intmax(’uint32’) ans = 4294967295
For any of these unsigned types, a number larger than intmax is mapped to intmax and a number lower than intmin is mapped to intmin: >> uint8(1000) ans = 255 >> uint8(-1000) ans = 0
Signed integers are represented in MATLAB using the types of the form intk, where k ranges among 8,16,32, and 64. The leftmost digit is reserved for the sign (1 for a negative integer and 0 for a positive integer). Thus, we have >> intmax(’int8’) ans = 127 >> intmin(’int8’) ans = -128
matlab Environment sign R
exponent q digits
- ···
s
Fig. 4.2
231
mantissa p digits
-
··· p−1
2 1 0
Representation of floating point numbers.
MATLAB
makes use of the floating point representation of real numbers. This representation uses a number of s+1 binary digits and comprises three parts: the sign bit, the sequence of exponent bits, and the sequence of mantissa (or significant) bits, as shown in Figure 4.2. The bits are numbered from right to left starting with 0. We assume that the mantissa uses p bits and the exponent uses q bits, where p + q = s. Depending on the desired precision, MATLAB supports a single-precision format or a double-precision format and the latter format (as specified by the IEEE Standard 754) is the default representation. For double-precision numbers, we have s = 63. The 64th bit is the bit sign b63 defined as 0 if x ≥ 0, b63 = 1 if x < 0, q = 11 bits are reserved for the exponent, and p = 52 bits are used by the significant. The number represented by the sequence of bits (b63 , b62 , . . . , b0 ) is 52 b63 −i b52−i 2 · 2y−1034 , x = (−1) · 1 + i=
where y is the equivalent of (b1 · · · b11 )2 . For single-precision numbers, we have s = 32, the exponent uses q = 8 bits (biased by 127), and the significant uses the remaining p = 23 bits. Accordingly, single-precision numbers are given by 22 b22−i 2−i · 2y−127 , x = (−1)b31 · 1 + i=
where y is the equivalent of (b1 · · · b8 )2 . Single-precision numbers require less memory than double-precision numbers, but they are represented to less precision.
232
Linear Algebra Tools for Data Mining (Second Edition)
Double-precision numbers are created with assignments such as x = 19.43, because the default format in MATLAB is double precision. To represent the same number in single precision, we need to write y = single(19.43). It is important to realize that the set of real numbers that are representable in any of these formats is finite. For example, in the case of the double-precision format, we have at our disposal 64 bits, which means that only 264 real numbers have an exact representation as double-precision numbers. For single-precision numbers, the set of real numbers that have exact representations consists of 232 numbers. The remaining real numbers can be represented only with a degree of approximation, which has considerable consequences for numerical computing. Since the set of reals that can be represented exactly on any floating-point system is finite, it follows that there exists a small gap between each double-precision number and the next larger doubleprecision number. The gap, which limits the precision of computations, can be determined using the eps function. If x is a doubleprecision number, there are no other double-precision numbers in the interval (x, x + eps(x)). Example 4.6. The distance between 7 and the next double-precision number can be determined as follows: >> eps(7) ans = 8.8818e-016
As x increases so does eps(x). We have >> eps(70) ans = 1.4211e-014 >> eps(700) ans = 1.1369e-013 >> eps(7000) ans = 9.0949e-013
matlab Environment
233
Entering eps without arguments is equivalent to eps(1): >> eps ans = 2.2204e-16
Similar considerations hold for single-precision numbers. Here the gaps between numbers are wider because there are fewer exactly representable numbers. Example 4.7. If we define y by y = single(7), then eps(y) returns 4.7684e-007, a value larger than eps(7) computed above. Example 4.8. Let x = 123/1256. This number is not a sum of powers of 2, so it cannot be represented exactly as a double-precision number. Consider the following MATLAB dialog: >> x =123/1256 x = 0.0979 >> y = 123 -1256 * x y = -1.4211e-014
The approximative representation of x means that y is computed, in turn, approximatively, which explains the fact that MATLAB does not return 0 for y. Rounding of decimal numbers can cause unexpected results. This phenomenon is known as roundoff error and it can be seen in the next example. Example 4.9. Define x as x = 0.1. Using the equality test == we have the following results: >> x + x == 0.2 ans = 1 >> x + x + x == 0.3 ans = 0 >> x + x + x + x == 0.4 ans = 1
234
Linear Algebra Tools for Data Mining (Second Edition)
>> x + x + x + x + x + x + x + x == 0.8 ans = 0
Another risk in performing floating point computations is the inadvertent subtraction of two large and close numbers. This may result in a catastrophic cancellation as we show next. Example 4.10. Consider the following MATLAB computation. >> a = 5 a = 5 >> b = 5e24 b = 5.0000e+024 >> c = a + b - b c = 0
returns 0 c rather than 5 since the numbers a + b and b have the same floating-point representation.
MATLAB
The range of double-precision numbers is determined using the functions realmin and realmax: >> rangeDouble = ’Double-precision numbers range between %g and %g’; >> sprintf(rangeDouble,realmin,realmax) ans = Double-precision numbers range between 2.22507e-308 and 1.79769e+308
When these functions are called with the argument ‘single’, the corresponding values for the single-precision type are returned: >> rangeSingle = ‘Single-precision numbers range between %g and %g’; >> sprintf(rangeSingle,realmin(‘single’), realmax(‘single’)) ans = Single-precision numbers range between 1.17549e-038 and 3.40282e+038 MATLAB operates with an extended set of reals; the values Inf and -Inf represent real numbers outside the representation ranges, as shown next.
matlab Environment
235
realmax(‘single’) + .0001e+038 ans = Inf -realmax(‘single’) - .0001e+038 ans = -Inf
Values that are not real or complex numbers are represented by the symbol NaN, an acronym for “Not a Number.” Expressions like 0/0 and Inf/Inf yield NaN, as do any arithmetic operations involving a NaN: x = 0/0 x = NaN
The imaginary unit of complex numbers is represented in by either of two letters: i or j. A complex number can be created in MATLAB by writing
MATLAB
z = 3 + 4i
or, equivalently, by using the complex function: >> z = complex(3,4) z = 3.0000 + 4.0000i
The real and imaginary parts of a complex number can be obtained using the real and imag functions: >> x = real(z) x = 3 >> y = imag(z) y = 4
The functions complex, real, and imag can be applied to arrays, where they act componentwise: >> x= [1 2 3]; >> y = [4 5,6]; >> z=complex(x,y) z = 1.0000 + 4.0000i >> zreal = real(z)
2.0000 + 5.0000i
3.0000 + 6.0000i
Linear Algebra Tools for Data Mining (Second Edition)
236
zreal = 1 2 3 >> zimag = imag(z) zimag = 4 5 6
Conversions from types are possible using built-in MATLAB functions named after the target type. For example, to convert other numeric data to double precision, we can use the MATLAB function double. Example 4.11. A signed integer created by y = int64 (-123456789122) is converted to double-precision floating point by x = double(y). Arithmetic operators involve the usual arithmetic operators. In these operators are overloaded, which means that they can be applied both to numbers and to matrices whose formats accommodate the requirements of these operations. These operators include:
MATLAB
Operator + * ^ \ / ’ .’ .* .\^ .\ ./
Significance Addition or unary plus Subtraction or unary minus Numeric or matrix multiplication Numeric or matrix power Backslash or left matrix division Slash or right matrix division Transpose (for real matrices) or Hermitian conjugate (for complex matrices) Nonconjugated transpose Array multiplication (element-wise) Array power (element-wise) Left array divide (element-wise) Right array divide (element-wise)
The slash or right matrix division B/A is equivalent to BA−1 , while the backslash or left matrix division A\B is equivalent to A−1 B. 4.4
Matrices and Multidimensional Arrays
MATLAB
accommodates both real and complex matrices. Matrices can be entered row-wise from the MATLAB console making sure that
matlab Environment
237
(i) elements of a row are separated by spaces or commas; (ii) rows are separated by a semi-colon; (iii) the matrix is enclosed between square brackets. For example, the matrix A ∈ C2×3 1+i 1−i i A= 2 i 3 − 2i is entered as >>A = [1+i MATLAB
A
1-i
i; 2 i
3-2i]
prints the content of the matrix as
= 1. + i 2.
1. - i i
i 3. - 2.i
To inspect the content of the matrix, we type its name at the prompt >>A
and, again, MATLAB prints the matrix A, as above. Lines can be continued by placing ... at the end of the line. For example, we can write >>B = [1 >> 2 >> 3
-11 22 44
22;... 33;... 55]
to define the matrix
⎛
⎞ 1 −11 22 B = ⎝2 22 33⎠. 3 44 55
Scalars can be entered in a rather straightforward manner. To enter x = 1 + 3i, one writes >>x = 1+3i
The complex unit may be denoted either by i or by j. MATLAB has a special notation for designating contiguous submatrices known as the colon notation. To designate the ith row of a matrix A, we can use the notation A(i,:). Similarly, the jth column is designated by A(:,j).
238
Linear Algebra Tools for Data Mining (Second Edition)
The submatrix of A that consists of all the rows between the ith row and the kth row is specified by A(i:k,:). Similarly, the submatrix that consists of the columns of A between the jth column and the hth column is A(:,j:h). Example 4.12. Let A ∈ R3×4 be the matrix ⎛ ⎞ 1 2 3 4 A = ⎝5 6 7 8 ⎠. 9 10 11 12 Then A(:,2:3) returns ans = 2 6 10
3 7 11
The colon notation is also used for specifying sequences. A vector that consists of the members of the arithmetic progression i, i + r, i + 2r, . . . , i + pr, where i + pr > 1:2:8 ans = 1 3 >> 1:2:9 ans = 1 3 >> 5:-0.6:1 ans = 5.0000 2.0000 >> 1:5 ans = 1 2
5
7
5
7
4.4000 1.4000
3
9
3.8000
4
3.2000
2.6000
5
Special matrix can be generated using built-in functions as shown in what follows. For example, the function eye(m,n) generates a
matlab Environment
239
matrix that contains a unit submatrix Ip , where p = min{m, n}, as shown next. >>I = eye(3,3) I = 1. 0. 0. 1. 0. 0.
0. 0. 1.
>>I = eye(4,3) I = 1. 0. 0. 1. 0. 0. 0. 0.
0. 0. 1. 0.
>>I = eye(3,4) I = 1. 0. 0. 1. 0. 0.
0. 0. 1.
0. 0. 0.
Random matrices having elements uniformly distributed in the interval (0, 1) can be generated using the function rand. For example, >>A = rand(p,q)
creates a random matrix A ∈ Rp×q . For example, ->A = rand(2,3)
produces the result A
= 0.3616361 0.2922267
0.5664249 0.4826472
0.3321719 0.5935095
Random number producing functions can be used to produce multidimensional arrays whose entries belong to certain intervals. Example 4.14. To create a 3 × 4 × 2 3-dimensional array whose entries belong to the (0, 1) interval, we can write -> A = rand(3,4,2)
resulting in the array
Linear Algebra Tools for Data Mining (Second Edition)
240
A(:,:,1) = 0.8147 0.9058 0.1270
0.9134 0.6324 0.0975
0.2785 0.5469 0.9575
0.9649 0.1576 0.9706
0.1419 0.4218 0.9157
0.7922 0.9595 0.6557
0.0357 0.8491 0.9340
A(:,:,2) = 0.9572 0.4854 0.8003
The number of elements of an array can be determined using the function numel. which returns the number of elements of an array A. Its effect is equivalent to that of \prod(size(A)). Experimentation in data mining is greatly helped by the capability of MATLAB for generating pseudorandom numbers having a variety of distributions. The function rand produces uniformly distributed pseudorandom numbers in the interval (0, 1). Depending on the number of arguments, this function can return an n × n matrix of such numbers (for rand(n)), or an m × n matrix (when using the call rand(m,n)). Calls of the form rand(n,’double’) or rand(n,’single’) return matrices of numbers that belong to the specified types. Example 4.15. To generate a 3 × 3 matrix of pseudorandom numbers in (0, 1), we write A = rand(3). This may yield A = 0.4505 0.0838 0.2290
0.9133 0.1524 0.8258
0.5383 0.9961 0.0782
The function randi generates a matrix in Rm×n whose entries are pseudorandom integers from a uniform discrete distribution on an interval [h, k] if called as randi([h k],[m,n]). Example 4.16. The call A = randi([5 10],[3 3]) returns a 3× 3 matrix with integer components in the interval [5, 10]:
matlab Environment
241
A = 7 5 10
5 9 9
10 5 7
If interval [h,k] is replaced with a single argument l, then the randi returns a matrix with pseudorandom components in the interval [1,l]. For instance, the call B = randi(10,2,4) returns a 2 × 4 matrix of numbers in the interval [1, 10]. The function permute rearranges the dimensions of an array in the order specified by its second argument that is a vector. Example 4.17. Let A be a three-dimensional array whose entries belong to the interval [0, 15] produced by the statement A = randi([0 15],[3 4 2]): A(:,:,1) = 10 12 11
6 10 2
11 0 4
0 1 13
12 12 2
7 7 10
A(:,:,2) = 11 5 15
0 7 6
To permute the first and third dimensions of the three-dimensional array A defined above, we write B=permute(A, [3 2 1])
resulting in the array B having the dimension vector [2 4 3]: B(:,:,1) = 10 11
6 0
11 12
0 7
10 7
0 12
1 7
B(:,:,2) = 12 5
Linear Algebra Tools for Data Mining (Second Edition)
242
B(:,:,3) = 11 15
2 6
4 2
13 10
The repositioning of the elements of array A is shown in Figure 4.3. To generate pseudorandom numbers following a normal distribution, one can use the function randn. The call randn(m) returns an m × m-matrix following the normal distribution N (0, 1); using a syntax similar to rand and randi, it is possible to generate matrices of other formats. An additional parameter for the functions of this family located last on their list of parameters allow the generation of values that belong to a more restricted type. For example, a call like randi(10,100,1,’uint32’) returns an array of 100 4-byte integers, while randn(10,’double’) returns an array of double numbers. The sequence of pseudorandom numbers produced by any of the generating functions is determined by the internal state of an internal uniform pseudorandom number generator. Resetting this generator to the same state allows computations to be repeated. The function normrand can be used to produce randomly distributed arrays as normally distributed. If M and S are arrays, then normrand returns an array of random numbers chosen from a normal distribution with mean M and standard deviation S having the same format as M and S. If either M or S are scalars, the result is an array having the format of the other parameter. The function normrnd(M,S,[p,q]) (that we use next) returns a p × q array. B(:, :, 3) 2 4
11
13
15 6 2 10 B(:, :, 2) 12 10 0 1 A(:, :, 2)
5 7 12 B(:, :, 1) 10 6 11 0 11
Fig. 4.3
0
12
A(:, :, 1)
7
7
Rearranging of the elements of an array.
matlab Environment
243
Example 4.18. We begin by generating 15 random points in R2 using U = normrnd(0,1,[15 2])
Starting from U we produce another set of 15 points V applying a rotation by 30 degrees, a scaling by 0.5, and a translation by 2 and we add some noise. To this end, we write >> S = [sqrt(3)/2 -1/2; 1/2 sqrt(3)/2]; >> V = normrnd(0.5*U*S+2,0.05,[15 2])
Submatrices can be extracted by indicating ranges of indices or by using the placeholder :, which stands for all rows or columns, respectively. To extract the second column of the matrix A, we write >>A(:,2) ans = 0.5664249 0.4826472
The following is a list of several built-in functions that create or process matrices. Function ones(m, n) zeros(m, n) eye(m, n) toeplitz(u) diag(u) diag(A) triu(A) tril(A) linspace(a, b, n) kron(A, B) rank(A) A A∗B A+B
Description Creates a matrix in Rm×n containing 1s Creates a matrix in Rm×n containing 0s Creates a matrix in Rm×n having 1s on its main diagonal Creates a Toeplitz matrix whose first row is u Creates a diagonal matrix having u on its main diagonal Yields the diagonal of matrix A Gives the upper part of A Gives the lower part of A Creates a vector in Rn whose components divide [a, b] in n − 1 equal subintervals Computes the Kronecker product of A and B Rank of a matrix A The Hermitian adjoint AH of A The product of matrices A and B The sum of matrices A and B
Operations introduced by a dot apply elementwise. As an example, consider the following operations:
Linear Algebra Tools for Data Mining (Second Edition)
244 A.2 A. ∗ B A./B
A matrix that contains the squares of the components of A The Hadamard product of A and B A matrix that contains the ratios of corresponding components of A and B
If A is a real matrix, then A is the transpose of A; for complex matrices, A denotes the Hermitian adjoint AH . Occasionally, we need to make sure that numerical representation of a symmetric matrix A ∈ Rn×n does not affect its symmetry. This can be achieved by the statement A = (A + A’)/2. The function diag mentioned above has several variants. For example, diag(V,k), where V is an n-dimensional vector and k is an integer returns a square matrix of size n + |k| having the components of V on its kth diagonal. The value k = 0 corresponds to the main diagonal, positive values correspond to diagonals above the main diagonal, and negative values of k refer to diagonals below the main diagonal. The effect of diag(V) is identical to the effect of diag(V,0). If X is a matrix, diag(X,k) is a vector whose components are the elements of the k th diagonal of X. Example 4.19. The expression diag(ones(4,1)) + diag(ones(3,1),1) + diag(ones(3,1),-1)
generates the tridiagonal matrix ans = 1 1 0 0
1 1 1 0
0 1 1 1
0 0 1 1
The function blkdiag produces a block-diagonal matrix from its matrix input arguments. Example 4.20. If we start with the matrices defined by A = [1 2;3 4]; B = [5 6 7; 8 9 10; 11 12 13]; C = [14 15;16 17];
the call blkdiag(A,B,C) will produce the matrix
matlab Environment
ans = 1 3 0 0 0 0 0
2 4 0 0 0 0 0
0 0 5 8 11 0 0
0 0 6 9 12 0 0
0 0 7 10 13 0 0
0 0 0 0 0 14 16
245
0 0 0 0 0 15 17
Block matrices can be formed using vertical or horizontal concatenation of matrices. To concatenate vertically two matrices A and B, the number of columns of A must equal the number of columns of B and the operation is realized as C = [A ; B]. Horizontal concatenation requires equality of the numbers of rows and can be obtained as E = [A D]. Example 4.21. Let A, B, and D be the matrices >> A=[1 2 3; 4 5 6] A = 1 4
2 5
3 6
>> B= [7 8 9; 10 11, 12; 13 14 15] B = 7 10 13
8 11 14
9 12 15
>> D = [17 18; 19 20] D = 17 19
18 20
The vertical concatenation of A and B and the horizontal concatenation of A and D are shown as follows: >> C = [A ; B]
Linear Algebra Tools for Data Mining (Second Edition)
246
C = 1 4 7 10 13
2 5 8 11 14
3 6 9 12 15
>> E = [A D] E = 1 4
2 5
3 6
17 19
18 20
The function repmat replicates a matrix A (referred to as a tile) and creates a new matrix B, which consists of an m × n tiling of copies of A when we use the call B = repmat(A,m,n). The format of B is (pm) × (qn), when the format of A is p × q. The statement repmat(A,n) creates an n × n tiling. Example 4.22. Let A be the matrix A = 1 4
2 5
3 6
To create a 3×2 tiling using A as a tile, we write B = repmat(A,3,2), which results in B = 1 4 1 4 1 4
2 5 2 5 2 5
3 6 3 6 3 6
1 4 1 4 1 4
2 5 2 5 2 5
3 6 3 6 3 6
Real matrices can be sorted in ascending or descending order using the function sort. If v is a vector, sort(v) sorts the elements of v in ascending order. If X is a matrix, sort(X) sorts each column of X in ascending order. This function is polymorphic: if X is an array of strings, then sort(X) sorts the strings in ASCII dictionary order. Another variant of sort, sort(X,d,m), sorts X on the dimension d, in either ascending or descending order, as specified by
matlab Environment
247
the third parameter m, which can assume the values ‘ascend’ or ‘descend’. If this function is called with two output parameters as in [Y,I] = sort(X,d,m), then the function returns additionally an index matrix. Example 4.23. Starting from the unidimensional array >> X=[ 9 2 8 5 11 3 7] X = 9 2 8 5
11
3
7
the function call [Y,I] = sort(X) returns the matrices Y = 2
3
5
7
8
9
11
2
6
4
7
3
1
5
I =
Clearly, Y contains the sorted element of X, while I gives the position of each of the elements of Y in the original matrix X. If X is a complex matrix, the elements are sorted in the order of their absolute values and elements that tie for the absolute value are sorted in the order of their angle. A sparse matrix is a matrix whose elements are, to a large extent, equal to 0. Such a matrix can be represented by the position and value of its non-zero elements using the MATLAB function sparse. Example 4.24. Staring from the matrix A = 1 3
0 1
0 0
2 0
the function call B = sparse(A) returns B, the representation of A as a sparse matrix: B = (1,1) (2,1) (2,2) (1,4)
1 3 1 2
The format of B is the same as the format of A.
Linear Algebra Tools for Data Mining (Second Edition)
248
All matrix operations can be applied to sparse matrices, or to mixtures of sparse and full matrices. Operations on sparse matrices return sparse matrices and operations on full matrices return full matrices. In most cases, operations on mixtures of sparse and full matrices return full matrices. The exceptions include situations where the result of a mixed operation is structurally sparse. For instance, the Hadamard product A .* S is at least as sparse as S. Example 4.25. Consider the matrix D = 5 0 0 1
2 1 0 0
4 2 1 3
Its sparse form is E = (1,1) (4,1) (1,2) (2,2) (1,3) (2,3) (3,3) (4,3)
5 1 2 1 4 2 1 3
The sparse form of the matrix A*D can be computed as sparse(A*D) and is (1,1) (2,1) (1,2) (2,2) (1,3) (2,3)
7 15 2 7 10 14
The same result can be obtained by multiplying the sparse forms of the matrices A and D, that is, by writing B*E. The function call S = sparse(i,j,s,m,n,nzmax) uses three vectors of equal length i, j, and s to generate an m × n sparse matrix S such that S(i(k),j(k)) = s(k), with space allocated for nzmax
matlab Environment
249
non-zeros. Any elements of s that are zero are ignored, along with the corresponding values of i and j. Any elements of s that have duplicate values of i and j are added together. To convert a sparse matrix to a full representation, we can use the function full. Its use is illustrated in Example 4.27. The function call [B,d] = spdiags(A) starts with a rectangular matrix A of format m × n and returns a sparse matrix containing all non-zero diagonals of A. The components of the vector d indicate the position of these diagonals. Example 4.26. If A is the 3 × 4 matrix A = 1 5 9
2 6 10
3 7 11
4 8 12
that has 6 diagonals, the call [B,d] = spdiags(A) produces the results B = 0 0 9
0 5 10
1 6 11
2 7 12
3 8 0
4 0 0
d = -2 -1 0 1 2 3
Note that the first column of B is the −2nd diagonal of A that consists of 9, the second column is the −1st diagonal of A, etc. Other useful formats of spdiags exist. We mention just one, A = spdiags(B,d,m,n) that creates an m × n sparse matrix A from the columns of B and places them along the diagonals specified by the vector d.
Linear Algebra Tools for Data Mining (Second Edition)
250
Example 4.27. Starting with the matrix B and the vector d given by B = 1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
-2
-1
0
1
d =
the call A = spdiags(B,d,5,4) produces the sparse matrix A = (1,1) (2,1) (3,1) (1,2) (2,2) (3,2) (4,2) (2,3) (3,3) (4,3) (5,3) (3,4) (4,4) (5,4)
3 2 1 8 7 6 5 12 11 10 9 16 15 14
The normal representation of matrices can be obtained by applying the function full. Thus, C = full(A) produces C = 3 2 1 0 0
8 7 6 5 0
0 12 11 10 9
0 0 16 15 14
Note that if the length of the diagonals of C is insufficient to accommodate the full columns of B, these columns are truncated. Multidimensional arrays generalize matrices. Such arrays can be created starting with matrices and then extending these matrices.
matlab Environment
251
Example 4.28. After creating the matrix A as >> A = [1 0 2 1; -1 1 0 3; 1 2 3 4] A = 1 -1 1
0 1 2
2 0 3
1 3 4
a second matrix with the same format as the initial matrix can be added by writing: >> A(:,:,2) = [5 1 6 7; 8 9 0,2;-1 3 1 0]
This results in an multidimensional array having the format 3 × 4 × 2 that is displayed as A(:,:,1) = 1 -1 1
0 1 2
2 0 3
1 3 4
1 9 3
6 0 1
7 2 0
A(:,:,2) = 5 8 -1
Multidimensional arrays can be created using the concatenation function cat. Example 4.29. To add a third page to the previously created array we write
A,
A = cat(3,A,[6 7 0 8; -1 -2 -3 0; 4 5 6 3])
resulting in A(:,:,1) = 1 -1 1
0 1 2
2 0 3
1 3 4
Linear Algebra Tools for Data Mining (Second Edition)
252
A(:,:,2) = 5 8 -1
1 9 3
6 0 1
7 2 0
7 -2 5
0 -3 6
8 0 3
A(:,:,3) = 6 -1 4
A fourth page of this array that consists of repeated occurrences of 8 can be added by writing A(:,:,4) = 8
The functions prod and sum return the product of the elements and the sum of the elements of an array, respectively. The sum of the elements of a matrix A can be computed along any of its dimensions by using the function sum(A,d), where d is the dimension. The function sum(A) computes the sum of all elements of A. For prod, similar conventions apply. Example 4.30. For the matrix X = 5 6
8 2
1 1
4 3
we can execute the following computations: >> sum(X) ans = 11 10 >> sum(X,1) ans = 11 10 >> sum(X,2) ans = 18 12
2
7
2
7
matlab Environment
253
For the three-dimensional array created in Example 4.28, prod(A,[1,2]) returns a 1 × 1 × 3 array whose elements are the products of each page of A; similarly, —sum(A,[1,2])— returns an array whose elements are the sums of each page of A: >> sum(A,[1,2]) ans(:,:,1) = 17
ans(:,:,2) = 41
ans(:,:,3) = 33
The repmat function can be used to produce multidimensional arrays. Its format in this case is repmat(A,r), where A is an array and r is a vector that specifies the repetition scheme. Example 4.31. Let A be the matrix defined by A = [1 2 3; 4 5 6]
Copies of the matrix A are repeated in a 2 × 3 × 2 multiarray by writing B = repmat(A,[2 3 2])
This results in B(:,:,1) = 1 4 1 4
2 5 2 5
3 6 3 6
1 4 1 4
2 5 2 5
3 6 3 6
1 4 1 4
2 5 2 5
3 6 3 6
Linear Algebra Tools for Data Mining (Second Edition)
254
B(:,:,2) = 1 4 1 4
2 5 2 5
3 6 3 6
1 4 1 4
2 5 2 5
3 6 3 6
1 4 1 4
2 5 2 5
3 6 3 6
2 5 2 5
3 6 3 6
1 4 1 4
2 5 2 5
3 6 3 6
If, instead, we have written C = repmat(A,[2 3 1])
this would have resulted in C = 1 4 1 4
2 5 2 5
3 6 3 6
1 4 1 4
Obviously, the last dimension (1) of C is useless and can be eliminated using the function squeeze which removes dimensions of length 1 of an array: D = squeeze(C)
4.5
Cell Arrays
A cell array is an array whose components called cells can contain any type of data. A cell array can be created using the constructor {}. Example 4.32. To create a 4 × 2-cell array containing numbers, strings, and arrays, we can write cellAr = {1, 9, 4, 3; ’Febr’, randi(2,3), {11; 12; 13}, randi(2,2)}
resulting in cellAr = 2 x 4 cell array {[ 1]} {[ 9]} {’Febr’} {3 X 3 double} {2 x 2 double}
{[ 4]} {[ {3 x 1 cell}
3]}
matlab Environment
255
An empty cell array can be created using C = {}; to create an empty cell array of format 2 × 3 × 4, we can write D = cell(2,3,4). Example 4.33. A cell array of text and data can be created by writing A = {’one’,’two’,’three’;1,2,3}
resulting in A = 2 x 3 cell array {’one’} {’two’} {[ 1]} {[ 2]}
{’three’} {[ 3]}
Components of a cell array can be accessed individually using curly braces, or as a set of cells using small parentheses. Example 4.34. Here A refers to the cell array introduced in Example 4.33. To access the component (2,3) of A, we write A{2,3}. To access the leftmost four cells of the cell array A, we write A(1:2,1:2) and obtain 2 x 2 cell array {’one’} {[ 1]}
4.6
{’two’} {[ 2]}
Solving Linear Systems
The function inv computes the inverse of an invertible square matrix. Example 4.35. Let A be the invertible matrix A = 1 2
2 3
Its inverse is given by >> inv(A) ans = -3.0000 2.0000
2.0000 -1.0000
Linear Algebra Tools for Data Mining (Second Edition)
256
On the other hand, if inv is applied to a singular matrix A = 1 2
2 4
an error message is posted: Warning: Matrix is singular to working precision.
If A is nonsingular, the function inv can be used to solve the system Ax = b by writing x = inv(A)*b, although a better method is described as follows: Example 4.36. For A=
1 2 2 3
and b =
13 , 23
the solution of the system Ax = b is x = inv(A)*b x = 7.0000 3.0000
This is not the best way for solving a system of linear equations. In certain circumstances (which we discuss in Section 6.18), this method produces errors and has a poor time performance. A better approach for solving a linear system Ax = b is to use the backslash operator x = A \ b or x = mldivide(A,b). The term mldivide is related to the position of the matrix A at the left of x. Example 4.37. Define A and b as >> A = [5 11 2; A = 5 11 10 6 -2 9 >> b=[53;26;48] b = 53 26 48
10 6 -4; -2 9 7] 2 -4 7
matlab Environment
257
Then either x=A\b or x=mldivide(A,b) produces x = 1.0000 4.0000 2.0000
The system xA = c, where A ∈ Rn×m and c ∈ Rm can be solved using either x = A / b or x = mrdivide(A,b). It is easy to see that these operations are related by A\b = (A /b ) .
4.7
Control Structures
Relational expressions allow comparisons between numbers using the following relational operators: Relational operator < > = == ~=
Significance Less than Greater than Less or equal Greater or equal Equal to Not equal
The result of a comparison is a logical value defined as one of the numbers 1 or 0, where 1 is equivalent to true and 0 is equivalent to false. If two scalars are compared, the result is a logical value; if two arrays having the same format are compared, the result is an array of logical values (having the same format as the numerical arrays), which contains logical values that result from the componentwise comparisons of the numerical arrays. Example 4.38. The result of a comparison between the arrays x and y is the following array ans: >> x = [1 5 2 4 9 6] x = 1 5 2 >> y = [7 3 1 3 2 9] y = 7 3 1 >> x > y
4
9
6
3
2
9
258
Linear Algebra Tools for Data Mining (Second Edition)
ans = 0
1
1
1
1
0
Example 4.39. A scalar a can be compared with an array B. The result is an array of logical values having the same format as B. a = 5 >> B = [1 5 2; 4 9 6; 6 1 7; 7 3 1] B = 1 5 2 4 9 6 6 1 7 7 3 1 >> a > A = [1 2 3; 4 5 6; 7 8 9] A = 1 2 3 4 5 6 7 8 9 >> A(M)
matlab Environment
259
ans = 1 5 9
The find function applied to an array produces a list of the positions where the components of the array are non-zero. For a one-dimensional array X, find(X) returns a one-dimensional array I, as in the following example: X = 2 0 >> I = find(X) I = 1 3
-1
0
0
3
6
If X is a matrix, I contains the places of the non-zero components of X where X is regarded as an array obtained by concatenating its columns vertically. For example, we have X = 1 0 0 0 >> I = find(X) I = 1 5 6 8 9 10
3 2
0 1
2 1
The call [I,J] = find(X) returns the row and column indices of the non zero components of X, as follows: I = 1 1 2 2 1 2 J = 1 3
Linear Algebra Tools for Data Mining (Second Edition)
260
3 4 5 5
Example 4.40. Let A and B be two matrices: >> A = [1 2 5; 3 0 2; 6 1 7] A = 1 2 5 3 0 2 6 1 7 >> B = [5 9 1; 4 3 8; 5 6 7] B = 5 9 1 4 3 8 5 6 7
The answer to the comparison A < B is the logical matrix 1 1 0
1 1 1
0 1 0
If a matrix is regarded as an array obtained by concatenating its columns vertically, then C = find(A < B) will return C = 1 2 4 5 6 8 MATLAB has four basic control structures: if-then-else, for, while and switch. Their semantics, discussed next, is somewhat different from similar control structures in common programming languages. The syntactic diagram of the if-then-else structure is shown in Figure 4.4.
matlab Environment
-
-
if ?
-
Logical expression elseif
Fig. 4.4
Statements
?
-
Logical Statements expression
? 6 -
261
?
else
-
-
Statements
end
-
Syntactic diagram of the if-then-else structure.
Example 4.41. The following piece of MATLAB code if(sum(sum(A < B))== size(A,1)*size(A,2)) disp(’A is less than B’) elseif(sum(sum(A > B))== size(A,1)*size(A,2)) disp(’A is greater than B’) elseif(sum(sum(A == B))==size(A,1)*size(A,2)) disp(’A and B are equal’) else disp(’A and B are incomparable’) end
applied to two matrices A and B having the same format will return A and B are incomparable
when
A=
1 2 3 4
and B =
2 1 . 6 6
The function sum when applied to a matrix returns a vector that contains the sum of the columns of the matrix. Therefore, sum(sum(A)) returns the sum of elements of A. The calls size(A,1) and size(A,2) return the dimensions of A. Note that a condition like sum(sum(A < B))== size (A,1)*size(A,2) is satisfied if all entries of the matrix A < B are
Linear Algebra Tools for Data Mining (Second Edition)
262
-
for
-
Variable
Fig. 4.5
-
-
Statements
= - List of values -
end
-
Syntactic diagram of the for structure.
while
-
Condition ? end Statements
Fig. 4.6
Syntactic diagram of the while structure.
equal to 1, that is, if every element of A is less than the corresponding element of B. The syntax of the for structure is shown in Figure 4.5. The list of values can be entered as a vector or as a colon expression. Example 4.42. Either >> for i=[5 7 9 11] disp(sqrt(i)) end
or >> for i=5:2:12 disp(sqrt(i)) end
displays the square roots of the odd numbers between 5 and 12: 2.2361 2.6458 3 3.3166
The syntax diagram of the while structure is shown in Figure 4.6.
matlab Environment
263
Example 4.43. The sum of the terms of the harmonic series that are not smaller than 0.01 is computed by the following code fragment: s = 0; n = 1; while(1/n >= 0.01) s = s + 1/n; n = n + 1; end; fprintf(’Sum is %g\n’,s)
which returns Sum is 5.18738
The switch structure has the following syntactic diagram: The statements following the first case where the switch expression matches the case expression are executed; the second list of statements is executed when the second case expression matches the switch expression. If no case expression is a match for the switch expression, then the statements that follow otherwise are executed, if this option exists. Unlike the similar C structure, only one case is executed. The switch expression can be a scalar or a string. Example 4.44. The following code fragment will display cat (Figure 4.7): >> animal = cougar; >> switch lower(animal) case {’ostrich’,’emu’,’turkey’} disp(’bird’); case {’tiger’,’lion’,’cougar’,’leopard’} disp(’cat’); case {’alligator’,’crocodile’,’frog’} disp(’reptile’); end
Linear Algebra Tools for Data Mining (Second Edition)
264
- statement
- otherwise ? switch
? ,
switch expression
-
? case
?- end -
- case expression
- statement ? ,
Fig. 4.7
4.8
Syntactic diagram of the switch structure.
Indexing
Access to individual components of vectors and matrices can be done by indexing. For example, to access the component aij of matrix A, we write A(i,j). The index of a vector can be another vector. Example 4.45. Let v = [1:3:22] and let u = [1 4 5]. Then, we have >> v=1:3:22 v = 1 4 >> u=[1 4 5] u = 1 4 >> v(u) ans = 1 10
7
10
13
16
19
22
5
13
This technique can be applied to swapping the two halves of v by writing >> v([5:8 1:4]) ans = 13 16 19
22
1
4
7
10
matlab Environment
265
The special operator end allows us to access the final position of a vector as follows: Example 4.46. For the vector v defined in Example 4.45, the expression v(1:2:end) returns ans = 1
7
13
19
that is, the components of v located in odd-numbered positions. A popular technique in MATLAB is the logical indexing. If the indexing expression is of logical type, an indexing expression will extract those elements of the array that make that expression true. Example 4.47. To extract the even number components of the vector v defined in Example 4.45, we can write v(mod(v,2)==0). This will result in ans = 4
10
16
22
Example 4.48. Suppose that we have a matrix that has entries that are not defined; such entries can be represented by the special value NaN (not a number) and can be recognized by the logical function isnan that returns true if its input is NaN. Starting from the matrix >> X = [1 NaN 3 4; 8 -2 NaN NaN; 0 1 NaN 5] X = 1 NaN 3 4 8 -2 NaN NaN 0 1 NaN 5
the statement X(isnan(X))=0 results in the substitution of all NaN values by 0: X = 1 8 0
0 -2 1
3 0 0
4 0 5
266
4.9
Linear Algebra Tools for Data Mining (Second Edition)
Functions
Functions can be created in files having the extension .m which have the same name as the functions they define. In Example 4.49, we define the function kronsum; thus, the file is named kronsum.m. The first line of the file has the form function [Y1 , . . . , Yq ] = f (X1 , . . . , Xp ) where X1 , . . . , Xp are the input arguments, f is the name of the function, and Y1 , . . . , Yq are the values computed by the function. Example 4.49. Kronecker’s sum, A⊕B, is computed by the function kronsum as follows: function [S] = kronsum(A,B) %KRONSUM computes the Kronecker sum of matrices A and B if (ndims(A) ~= 2 || ndims(B) ~= 2) return; end [rowsA,colsA] = size(A); [rowsB,colsB] = size(B); if (rowsA ~= colsA) || (rowsB ~= colsB) return; end S = kron(eye(colsB),A) + kron(B,eye(colsA)); end
We ensure that the arguments presented to kronsum are matrices by using the function ndims that returns the number of dimensions of the arguments. The formats of the arguments are computed using the function size and the computation proceeds only if the two arguments are square matrices. Finally, the last line of the function computes effectively the value of the result. Note the presence of comment lines introduced by %. The first comment line is returned when we use the command lookfor or request help. Example 4.50. The function datagen serves as a generator of datasets in R2 that contain a prescribed number of points that are grouped around given centers.
matlab Environment
function %DATAGEN % % % % %
267
[T] = datagen(spec) produces a set of points in R^2 starting from a k x 4 matrix called spec. Points are grouped in k clusters; the centers of the j-th cluster has coordinates [spec(j,1) spec(j,2)]; the j-th cluster contains spec(j,3) points and the diameter is spec(j,4)
% determine the number of clusters noc = size(spec,1); % npgen gives the number of points currently generated npgen = 0; for j=1:noc centr = [spec(j,1) spec(j,2)]; T(npgen+1:npgen+spec(j,3),:)=... spec(j,4)*randn(spec(j,3),2)+... repmat(centr,spec(j,3),1); npgen = npgen+spec(j,3); end
Starting from the matrix spec spec = 1.0000 1.0000 5.0000
1.0000 8.0000 4.0000
50.0000 40.0000 50.0000
1.0000 1.3000 1.0000
and calling the function defined above T=datagen(spec), we obtain the set of objects shown in Figure 4.8. 4.10
Matrix Computations
As a “matrix laboratory,” MATLAB offers built-in functions for a wide variety of matrix computations. We offer a few examples that involve commonly used functions The function trace can be applied only to a square matrix A and trace(A) returns trace(A).
Linear Algebra Tools for Data Mining (Second Edition)
268 12
10
8
6
4
2
0
−2 −2
−1
0
Fig. 4.8
1
2
3
4
5
6
7
Dataset obtained with the function datagen.
The function abs applied to a matrix A returns the matrix of absolute values of the elements of A. Example 4.51. Let A be the matrix A = 1.0000 + 1.0000i 2.0000 - 5.0000i
3.0000 + 4.0000i -7.0000
We obtain >> B = abs(A) B = 1.4142 5.3852
5.0000 7.0000
The function rref produces the reduced row echelon form of a matrix A, when called as R = rref(A). A variant of this function, [R,r] = rref(A), also yields a vector r so that r indicates the nonzero pivots, length(r) is the rank of A, and A(:,r) is a basis for
matlab Environment
269
the range of A. Roundoff errors may cause this algorithm to produce a rank for A that is different from the actual rank. A pivot tolerance tol used by the algorithm to determine negligible columns can be specified using rref(A,tol). Example 4.52. Starting from the matrix A = 1 7 1
2 8 3
3 9 5
4 10 7
5 11 9
6 12 11
the function call [R,r]=rref(A) returns R = 1 0 0
0 1 0
1
2
-1 2 0
-2 3 0
-3 4 0
-4 5 0
r =
showing that the rank of A is 2. Example 4.53. The echelon form of a matrix A can be used to construct a basis of the null space of the matrix. To this end, consider the following function nullspbasis, where colpiv and colnonpiv designate the pivot and non-pivot columns of A, respectively. function B = nullspbasis(A) tol = sqrt(eps); [R, colpiv] = rref(A,tol); [m, n] = size(A); r = length(colpiv); colnonpiv = 1:n; colnonpiv(colpiv) = []; B = zeros(n, n-r); B(colnonpiv,:) = eye(n-r); B(colpiv,:) = -R(1:r, colnonpiv);
For the matrix A from Example 4.52, we obtain the basis B of null(A) given by B = 1 -2 1
2 -3 0
3 -4 0
4 -5 0
Linear Algebra Tools for Data Mining (Second Edition)
270
0 0 0
1 0 0
0 1 0
0 0 1
The LU decomposition of a matrix A is computed by the function call [L,U,P] = lu(A), which returns a lower triangular matrix L, an upper triangular matrix U , and a permutation matrix P such that P A = LU . Example 4.54. Starting from the matrix >> A=[1 0 1; 2 1 1; 1 -1 2] A = 1 0 1 2 1 1 1 -1 2
considered in Example 3.38 and applying the function lu, we obtain >> [L,U,P]=lu(A) L = 1.0000 0 0.5000 1.0000 0.5000 0.3333 U = 2.0000 1.0000 0 -1.5000 0 0 P = 0 1 0 0 0 1 1 0 0
0 0 1.0000 1.0000 1.5000 0
When the function lu is called with two output arguments, the first argument contains the matrix P L, where P is the permutation matrix discussed before. Of course, the first output argument is not a lower triangular matrix, as shown next. >> [L,U] = lu(A) L = 0.5000 0.3333 1.0000 0 0.5000 1.0000
1.0000 0 0
matlab Environment
271
U = 2.0000 0 0
4.11
1.0000 -1.5000 0
1.0000 1.5000 0
Matrices and Images in MATLAB
Binary matrices can be used to encode black and white images in the binary format using the Image Processing toolbox. The function imshow(A) displays the binary image of a matrix A in a figure such that pixels with the value 0 (zero) are displayed as black and pixels with value 1, as white. Example 4.55. Consider the matrix A ∈ {0, 1}8×8 defined as A = [1 0 1 0 1 0 1 0; 0 1 0 1 0 1 0 1; 1 0 1 0 1 0 1 0; 0 1 0 1 0 1 0 1; 1 0 1 0 1 0 1 0; 0 1 0 1 0 1 0 1; 1 0 1 0 1 0 1 0; 0 1 0 1 0 1 0 1]; \end{pgmdisplay} which represents the pattern of squares of a chess board. The squares themselves are specified using the matrix \begin{PGMdisplay} >> B=ones(50,50);
and the board is generated as the Kronecker product of the matrices A and B: >> C=kron(A,B);
The matrix C has the format 400 × 400. Using the function imshow of the MATLAB image processing toolbox as in >> imshow(C,’Border’,’tight’)
the result can be visualized and saved as a pdf file shown in Figure 4.9.
272
Linear Algebra Tools for Data Mining (Second Edition)
Fig. 4.9
Chessboard generated in MATLAB .
Example 4.56. The pgm format (an acronym of “Portable Gray Map”) is the simplest gray scale graphic image representation. The content of this file is shown in Figure 4.10(b). On its first row, P2 identifies the file type. The digit image has been discretized in a 10 × 10 matrix and the format of this matrix is shown in the third line; the fourth line contains the maximum gray value and the rest of the file contains the matrix A ∈ R10×10 (having non-negative integers as entries), which we subject to SVD analysis in Chapter 9 (Figure 4.10). Exercises and Supplements (1) Write a MATLAB function that when applied to a matrix A returns bases for both the null space of A and the range of A. (2) Write a MATLAB function that starts with three real numbers x, y, z and an integer n and returns a tridiagonal n × n-matrix having all elements on the main diagonal equal to a, all elements immediately located under a diagonal equal to b, and all elements immediately above the diagonal equal to c. (3) Write a MATLAB function that returns a true value if the number of non-zero components of a vector of integers is odd.
matlab Environment
273
(a)
P2 # four.pgm 10 10 16 16 16 16 16 16 16 16 16 13 16 14 1 16 8 0 16 3 0 16 16 16 16 16 16 16 16 16 16 16 16
16 14 1 2 2 0 16 16 16 16
16 3 1 15 0 0 3 5 5 16
4 3 3 4 0 0 3 5 9 16
14 12 4 8 5 0 16 16 16 16
16 16 16 16 16 6 16 16 16 16
16 16 16 16 16 12 16 16 16 16
16 16 16 16 16 16 16 16 16 16
(b)
Fig. 4.10
Digit 4 representation (a) and the corresponding matrix (b).
To measure the time used by MATLAB operations, one can use a pair of MATLAB functions named tic and toc. The function tic (with no argument) starts a stopwatch; toc reads the stopwatch and displays the elapsed time in seconds since the most recent invocation of toc. A similar role can be played by the function cputime, which returns the CPU time in seconds that was used by the MATLAB computation since this computation began. For example, t = cputime; . . . cputime -t
returns the cpu time elapsed between these consecutive calls of cputime.
Linear Algebra Tools for Data Mining (Second Edition)
274
(4) Write a MATLAB function that starts from two integer parameters n and k with k ≤ n and returns a matrix S ∈ Rk×m whose columns contain the distinct subsets of the set {1, . . . , n} with n n no more than k elements, so m = 1+ 1 +· · ·+ k . For example, if n = 4 and k = 3, the function should return the matrix 012341112231112 000002343442233 000000000003444 Compute and plot the time used by the algorithm for various values of n and k. (5) Write a MATLAB function that tests if two integer matrices commute; examine the difficulties of writing such a function that deals with real-number matrices. (6) Write a MATLAB script that picks n points at random on the circle of radius 1 and draws the corresponding polygonal contour. Hint: Generate uniformly n angles α1 , . . . , αn in the interval [0, 2π] and then graph the points (cos αi , sin αi ). The MATLAB function fzero can be used to determine the zeros z of a nonlinear single-argument function f : R −→ R located near a number u ∈ R using two arguments: a string s describing the function to investigate and the number u. For example, to find the zeros of the function f given by f (x) = x3 − 6x2 + 11x − 6, we write >> z = fzero(’x^3 -6*x^2 + 11*x -6’,1.3)
which results in z = 1.0000
An alternative technique is to use an anonymous function; such a function can be written as F = @(x)x^3 -6*x^2 + 11*x -6
Then, we could write z = fzero(F,4)
which results in z = 3
matlab Environment
275
(7) Experiment with the family of functions fa : R −→ R defined by f (x) = x3 − 10x2 + 21x + a for a ∈ R and determine the number of zeros for various values of a. (8) Consider the complex row vector: >> v = [1+i 2-i 1.5+2*i]
(9)
(10) (11)
(12)
(13)
Which MATLAB expression will compute v H correctly, v’ or v.’ and why? The least element of a matrix may occur in several positions. Write a MATLAB function that has a matrix A as an argument and replaces its minimal element in each position of the matrix. Write a MATLAB function that generates a random m × n matrix with integer entries distributed uniformly between a and b. Write a MATLAB function that will start with a vector of complex numbers and return a vector having its components in the set {1, 2, 3, 4} corresponding to the orthants where the image of each complex number is placed. Write a MATLAB function that will start with a vector of complex numbers and circularly shift its components one position to the left or to the right, as specified by a parameter of the function. Write a MATLAB function that accepts four square matrices A1 , A2 , A3 , and A4 in Rn×n and outputs the minimum and the maximum trace of a product Ai1 Ai2 Ai3 Ai4 , where 1 2 3 4 i1 i2 i3 i4
is a permutation. (14) Write a MATLAB function that starts with two vectors that represent the sequences of coefficients of two polynomials and generates the sequence of coefficients of the product of these polynomials. (15) The Lagrange interpolating polynomial is a polynomial of degree no larger than n − 1 whose graph passes through the points (x1 , y1 ), . . . , (xn , yn ) in R2 and is given by n {(x − xj ) | 1 ≤ j ≤ n and j = i} . yi p(x) = {(xi − xj ) | 1 ≤ j ≤ n and j = i} i=1
Given two vectors x, y ∈ Rn write a MATLAB function that computes the coefficients of the Lagrange interpolating polynomial.
276
Linear Algebra Tools for Data Mining (Second Edition)
Bibliographical Comments MATLAB ’s
popularity in the technical and research communities has generated a substantial literature. We mention, as especially useful, such titles as [64, 75, 116, 167] and [83]. Space limitations did not allow us to present the outstanding visualization capabilities of MATLAB for scientific data. The reader should consult [75].
Chapter 5
Determinants
5.1
Introduction
Determinants are a class of numerical multilinear functions defined on the set of square matrices. They play an important role in theoretical considerations of linear algebra and are useful for symbolic computations. As we shall see, determinants can be used to solve certain small and well-behaved linear systems; however, they are of limited use for large or numerically difficult linear systems. Historically, determinants appeared long before matrices related to solving linear systems. In modern times, determinants were introduced by Leibniz at the end of the 17th century and Cramer formula appeared in 1750. The term “determinant” was introduced by Gauss in 1801. 5.2
Determinants and Multilinear Forms
Theorem 5.1. Let F be a field, M be an F-linear space, and let f : M n −→ F be a skew-symmetric F-multilinear form. If two arguments of f are interchanged, then the value of f is multiplied by −1, that is, f (x1 , . . . , xi , . . . , xj , . . . , xn ) = −f (x1 , . . . , xj , . . . , xi , . . . , xn ) for x1 , . . . , xn ∈ M .
277
Linear Algebra Tools for Data Mining (Second Edition)
278
Proof.
Since f is a multilinear form, we have
f (x1 , . . . , xi + xj , . . . , xi + xj , . . . , xn ) = f (x1 , . . . , xi , . . . , xi , . . . , xn ) + f (x1 , . . . , xi , . . . , xj , . . . , xn ) +f (x1 , . . . , xj , . . . , xi , . . . , xn ) + f (x1 , . . . , xj , . . . , xj , . . . , xn ) = f (x1 , . . . , xi , . . . , xj , . . . , xn ) + f (x1 , . . . , xj , . . . , xj , . . . , xn ). By the defining property of skew-symmetry, we have the equalities f (x1 , . . . , xi + xj , . . . , xi + xj , . . . , xn ) = 0, f (x1 , . . . , xi , . . . , xi , . . . , xn ) = 0, f (x1 , . . . , xj , . . . , xj , . . . , xn ) = 0, which yield f (x1 , . . . , xi , . . . , xj , . . . , xn ) = −f (x1 , . . . , xj , . . . , xi , . . . , xn ), for x1 , . . . , xn ∈ M .
Corollary 5.1. Let V be an F-linear space and let f : V n −→ F be a skew-symmetric multilinear form. If xi = xj for i = j, then f (x1 , . . . , xi , . . . , xj , . . . , xn ) = 0. Proof.
This follows immediately from Theorem 5.1.
Theorem 5.1 has the following useful extension. Theorem 5.2. Let V be an F-linear space and let f : V n −→ F be a skew-symmetric F-multilinear form. If φ ∈ PERMn is a permutation given by 1 ··· i ··· n , φ: j1 · · · ji · · · jn then f (xj1 , . . . , xjn ) = (−1)inv(φ) f (x1 , . . . , xn ) for x1 , . . . , xn ∈ M . Proof. The argument is by induction on p = inv(φ). The basis case, p = 0 is immediate because in this case, φ is the identity mapping. Suppose that the argument holds for permutations that have no more than p inversions and let φ be a permutation that has p + 1 inversions. Then, as we saw in the proof of Theorem 1.9, there exists an adjacent transposition ψ such that for the permutation φ defined
Determinants
279
as φ = ψφ we have inv(φ ) = inv(φ) − 1. Suppose that φ is the permutation 1 2 ··· + 1 ··· n φ : j1 j2 · · · j j+1 · · · jn and ψ is the adjacent transposition that exchanges j and j+1 , so 1 2 ··· + 1 ··· n . φ: j1 j2 · · · j+1 j · · · jn By the inductive hypothesis,
f (xj1 , . . . , xj , xj+1 , . . . , xjn ) = (−1)inv(φ ) f (x1 , . . . , xn ) and f (xj1 , . . . , xj+1 , xj , . . . , xjn ) = −f (xj1 , . . . , xj , xj+1 , . . . , xjn )
= −(−1)inv(φ ) f (x1 , . . . , xn ) = (−1)inv(φ) f (x1 , . . . , xn ), which concludes the argument.
Theorem 5.3. Let F be a field, V be an F-linear space, f : V n −→ F be a skew-symmetric F-multilinear form, and let a ∈ F. If i = j and x1 , . . . , xn ∈ V n , then f (x1 , . . . , xn ) = f (x1 , . . . , xi + axj , . . . , xn ). Proof.
Suppose that i < j. Then, by the linearity of f , we have
f (x1 , . . . , xi + axj , . . . , xn ) = f (x1 , . . . , xi , . . . , xn ) + af (x1 , . . . , xj , . . . , xj , . . . , xn ) = f (x1 , . . . , xi , . . . , xn ), by Corollary 5.1.
Theorem 5.4. Let V be a linear space and let f : V n −→ R be a skew-symmetric linear form on V. If x1 , . . . , xn are linearly dependent, then f (x1 , . . . , xn ) = 0.
Linear Algebra Tools for Data Mining (Second Edition)
280
Proof. Suppose that x1 , . . . , xn are linearly dependent, that is, one of the vectors can be expressed as a linear combination of the remaining vectors. Suppose that xn = a1 x1 + · · · + an−1 xn−1 . Then, f (x1 , . . . , xn−1 , xn ) = f (x1 , . . . , xn−1 , a1 x1 + · · · + an−1 xn−1 ) n−1 ai f (x1 , . . . , xi , . . . , xn−1 , xi ) = 0, = i=1
by Corollary 5.1.
Theorem 5.5. Let V be an n-dimensional F-linear space and let {e1 , . . . , en } be a basis in V. There exists a unique, skew-symmetric multilinear form dn : V n −→ R such that dn (e1 , . . . , en ) = 1. Proof.
Let u1 , . . . , un be n vectors such that ui = a1i e1 + a2i e2 + · · · + ani en
for 1 i n. If dn is a skew symmetric multilinear form, dn : V n −→ R, then dn (u1 , u2 , . . . , un ) ⎛ ⎞ n n n aj11 ej1 , aj22 ej2 , . . . , ajnn ejn ⎠ = dn ⎝ j1 =1
=
n
n
j1 =1 j2 =1
···
j2 =1
n jn =1
jn =1
aj11 aj22 · · · ajnn dn (ej1 , ej2 , . . . , ejn ).
We need to retain only the terms of this sum in which the arguments of dn (ej1 , ej2 , . . . , ejn ) are pairwise distinct (because the term where jp = jq for p = q is zero, by Corollary 5.1). In other words, only terms in which the list (j1 , . . . , jn ) is a permutation of (1, . . . , n) have a non-zero contribution to the sum. By Theorem 5.2, we can write dn (u1 , u2 , . . . , un ) = dn (e1 , e2 , . . . , en )
j1 ,...,jn
(−1)inv(j1 ,...,jn) aj11 aj22 · · · ajnn .
Determinants
281
where the sum extends to all n! permutations (j1 , . . . , jn ) of (1, . . . , n). Since dn (e1 , . . . , en ) = 1, it follows that (−1)inv(j1 ,...,jn ) aj11 aj22 · · · ajnn . dn (u1 , u2 , . . . , un ) = j1 ,...,jn
Note that dn (u1 , u2 , . . . , un ) is the matrix A, where ⎛ 1 2 a1 a1 ⎜ a1 a2 ⎜ 2 2 A=⎜ .. ⎜ .. ⎝. . 1 an a2n
expressed using the elements of ⎞ · · · an1 · · · an2 ⎟ ⎟ ⎟ .. ⎟. ··· . ⎠ · · · ann
Definition 5.1. Let A = (aji ) ∈ Cn×n be a square matrix. The determinant of A is the number det(A) defined as (−1)inv(j1 ,...,jn ) aj11 aj22 · · · ajnn . (5.1) det(A) = j1 ,...,jn
The determinant of A is denoted either by det(A) or by 1 2 a1 a1 · · · an1 a1 a2 · · · an 2 2 2 .. . .. .. . . ··· . a1 a2 · · · an n n n Equality (5.1) is known as the Leibniz formula. Note that det(A) can be written using the Levi-Civita symbols introduced in Definition 1.11 as j1 ,...,jn aj11 aj22 · · · ajnn . (5.2) det(A) = j1 ,...,jn
Theorem 5.6. Let A ∈ Cn×n be a matrix. We have det(A ) = det(A).
Linear Algebra Tools for Data Mining (Second Edition)
282
Proof.
The definition of A allows us to write (−1)inv(j1 ,...,jn) a1j1 a2j2 · · · anjn , det(A ) = j1 ,...,jn
where the sum extends to all permutations of (1, . . . , n). Due to the commutativity of numeric multiplication, we can rearrange the term a1j1 a2j2 · · · anjn as ak11 ak22 · · · aknn , where 1 2 ··· n 1 2 ··· n and ψ : φ: k1 k2 · · · kn j1 j2 · · · jn are inverse permutations. Since both φ and ψ have the same parity, it follows that (−1)inv(j1 ,...,jn ) a1j1 a2j2 · · · anjn = (−1)inv(k1 ,...,kn ) ak11 ak22 · · · ajnn , which implies det(A ) = det(A).
Corollary 5.2. If A ∈ Cn×n , then det(AH ) = det(A). Furthermore, if A is a Hermitian matrix, det(A) is a real number. Proof. Let A¯ be the matrix obtained from A by replacing each aij by its conjugate. Since conjugation of complex numbers permutes with both the sum and product of complex numbers, it follows that ¯ = det(A) ¯ = det(A). ¯ = det(A). Thus, det(AH ) = det(A) det(A) The second part of the corollary follows from the equality det(A) = det(A). Corollary 5.3. If A ∈ Cn×n is a unitary matrix, then | det(A)| = 1. Proof. Since A is unitary, we have AH A = AAH = In . By Theorem 5.8, det(AAH ) = det(A) det(AH ) = det(A)det(A) = | det(A)|2 = 1. Thus, | det(A)| = 1. Example 5.1. Let A ∈ R3×3 be the matrix ⎛ 1 2 3⎞ a1 a1 a1 ⎜ 1 2 3⎟ A = ⎝a2 a2 a2 ⎠. a13 a23 a33 The number det(A) is the sum of six terms corresponding to the six permutations of the set {1, 2, 3}, as follows:
Determinants Permutation φ (1, 2, 3) (3, 1, 2) (2, 3, 1) (2, 1, 3) (3, 2, 1) (1, 3, 2)
inv(φ) 0 2 2 1 3 1
283 Term a11 a22 a33 a31 a12 a23 a21 a32 a13 −a21 a12 a33 −a31 a22 a13 −a11 a32 a23
Thus, we have det(A) = a11 a22 a33 + a31 a12 a23 + a21 a32 a13 = −a21 a12 a33 − a31 a22 a13 − a11 a32 a23 . The number of terms n! grows very fast with the size n of the determinant. For instance, for n = 10, we have 10! = 3, 682, 800 terms. Thus, direct computations of determinants are very expensive. The definition of the determinant that makes use of skewsymmetric multilinear forms has the advantage of yielding quite simple proofs for many elementary properties of determinants. Theorem 5.7. The following properties of det(A) hold for any A ∈ Cn×n : (i) det(A) is a linear function of the rows of A (of the columns of A); (ii) if A has two equal rows (two equal columns), then det(A) = 0; (iii) if two rows (columns) are permuted, then det(A) is changing signs; (iv) if a row of a matrix, multiplied by a constant, is added to another row, then det(A) remains unchanged; the same holds if instead of rows we consider columns; (v) if a row (column) equals 0, then det(A) = 0. Proof. We begin with the above statements that involve rows of A. Let A = (aji ) and let xi = (a1i , . . . , ani ) ∈ Cn be the ith row of the matrix A. We saw that det(A) = f (x1 , . . . , xn ), where f is the skew-symmetric multilinear form defined by f (e1 , . . . , en ) = 1. The linearity in each argument follows immediately from the linearity of f .
Linear Algebra Tools for Data Mining (Second Edition)
284
Part (ii) follows from Corollary 5.1. The third part follows from skew-symmetry of f . Theorem 5.3 implies Part (iv). Finally, the last statement is immediate. The corresponding statements concerning columns of A follow from Theorem 5.6 because the columns of A are the transposed rows of A . Theorem 5.8. Let A, B ∈ Cn×n be two matrices. We have det(AB) = det(A) det(B). Proof. Let a1 , . . . , an and b1 , . . . , bn be the rows of the matrices A and B, respectively. We assume that ai = (a1i , . . . , ani ) for 1 i n. Then, the rows c1 , . . . , cn of the matrix C = AB are given by ci = aji bj , where 1 j n, as it can be easily seen. If dn : (Cn )n −→ C is the skew-symmetric multilinear function that defines the determinant whose existence and uniqueness were shown in Theorem 5.5, then we have det(AB) = dn (c1 , . . . , ci , . . . , cn ) ⎛ ⎞ n n n aj11 bj1 , . . . , aji i bji , . . . , ajnn bjn ⎠ = dn ⎝ j1 =1
=
n
···
j1 =1
j=1
n j1 =1
n
···
j1 =1
jn =1
aj11 · · · aji i · · · ajnn dn (bj1 , . . . ,
bji , . . . , bjn ), due to the linearity of dn . Observe now that only the sequences (j1 , . . . , jn ) that represent permutations of the set {1, . . . , n} contribute to the sum because dn is skew-symmetric. Furthermore, if (j1 , . . . , jn ) represents a permutation φ, then dn (bj1 , . . . , bji , . . . , bjn ) = (−1)inv(φ) dn (b1 , . . . , bn ). Thus, we can write det(AB) =
n j1 =1
···
n j1 =1
···
n j1 =1
aj11 · · · aji i · · · ajnn dn (bj1 , . . . , bji , . . . , bjn )
Determinants
⎛ =⎝
n j1 =1
···
n
···
j1 =1
n
285
⎞ (−1)inv(j1 ,...,jn ) aj11 · · · aji i · · · ajnn ⎠
j1 =1
×dn (b1 , . . . , bn ) = det(A) det(B).
Lemma 5.1. Let B ∈ R(n+1)×(n+1) be the matrix ⎞ ⎛ 1 0 0 ··· 0 ⎜0 a1 a2 · · · an ⎟ ⎜ 1 1 1⎟ ⎟ B=⎜ .. ⎟. ⎜ .. .. .. ⎝. . . · · · . ⎠ 0 a1n a2n · · · ann We have det(B) = det(A), where ⎞ ⎛ 1 2 a1 a1 · · · an1 ⎟ ⎜ A = ⎝ ... ... · · · ... ⎠. a1n a2n · · · ann Proof.
Note that if B = (bji ), then 1 if j = 1 bj1 = 0 otherwise,
and
b1i
=
1 if i = 1 0 otherwise.
Also, if i > 1 and j > 1, then bji = aj−1 i−1 for 2 i, j n + 1. By the definition of the determinant, each term of the sum that defines det(B) must include an element of the first row. However, only the first element of this row is non-zero, so
286
Linear Algebra Tools for Data Mining (Second Edition)
det(B) =
j
n+1 (−1)inv(j1 ,j2 ,...,jn+1 ) bj11 bj22 . . . bn+1 ,
(j1 ,j2 ,...,jn+1 )
=
jn −1 jn+1 −1 (−1)inv(1,j2 ,...,jn+1 ) aj12 −1 · · · an−1 an ,
(j2 ,...,jn+1 )
where (j2 , . . . , jn+1 ) is a permutation of the set {2, . . . , n + 1}. Since inv(1, j2 , . . . , jn+1 ) = inv(j2 , . . . , jn+1 ), it follows that det(B) =
(−1)inv(j2 ,...,jn+1 ) aj12 −1 aj23 −1 . . . ajnn+1 −1 .
(j2 ,...,jn+1 )
Observe now that if (j2 , . . . , jn+1 ) is a permutation of the set {2, . . . , n + 1}, then (k1 , . . . , kn ), where ki = ji+1 − 1 for 1 i n is a permutation of (1, . . . , n) that has the same number of inversions as (j2 , . . . , jn+1 ). Therefore, (−1)inv(k1 ,...,kn ) ak11 ak22 . . . aknn = det(A). det(B) = (k1 ,...,kn )
Lemma 5.2. Let A ∈ ⎛
Rn×n
a11 ⎜ . ⎜ .. ⎜ ⎜ a1 ⎜ A = ⎜ 1p ⎜ ap+1 ⎜ ⎜ .. ⎝ . a1n
be a matrix partitioned as ⎞ n · · · aq1 aq+1 · · · a 1 1 .. .. .. ⎟ ··· . . ··· . ⎟ ⎟ q q+1 · · · ap ap · · · anp ⎟ ⎟ n ⎟ · · · aqp+1 aq+1 · · · a p+1 ⎟ p+1 ⎟ .. .. . ⎟ ··· . . · · · .. ⎠ · · · aqn aq+1 · · · ann n
and let B ∈ R(n+1)×(n+1) be defined by ⎛ a11 · · · aq1 0 aq+1 1 ⎜ . ⎜ .. · · · ... ... ... ⎜ ⎜ a1 · · · aq 0 aq+1 p p ⎜ p ⎜ B = ⎜ 0 ··· 0 1 0 ⎜ 1 ⎜ ap+1 · · · aqp+1 0 aq+1 p+1 ⎜ .. .. .. ⎜ .. ⎝ . ··· . . . a1n · · · aqn 0 aq+1 n Then det(B) = (−1)p+q det(A).
⎞ · · · an1 . ⎟ · · · .. ⎟ ⎟ · · · anp ⎟ ⎟ ⎟ ··· 0 ⎟. ⎟ · · · anp+1 ⎟ ⎟ . ⎟ · · · .. ⎠ · · · ann
Determinants
287
Proof. By permuting the (p+1)st row of B with each of the p rows preceding it in the matrix B and, then, by permuting the (q+1)st column with each of the q columns preceding it, we obtain the matrix C given by ⎛
1 0 ⎜ 0 a11 ⎜ ⎜ .. .. ⎜. . ⎜ 1 C=⎜ ⎜ 0 ap ⎜ 0 a1 ⎜ p+1 ⎜. . ⎝ .. .. 0 a1n
0 0 0 q q+1 · · · a1 a1 . .. · · · .. . · · · aqp aq+1 p · · · aqp+1 aq+1 p+1 .. .. ··· . .
0 0 · · · an1 . · · · ..
⎞
⎟ ⎟ ⎟ ⎟ ⎟ · · · anp ⎟ ⎟. · · · anp+1 ⎟ ⎟ .. ⎟ ··· . ⎠
· · · aqn aq+1 · · · ann n
By the third part of Theorem 5.7, each of these row or column permutations multiplies det(B) by −1, so det(C) = (−1)p+q det(B). By Lemma 5.1, we have det(C) = det(A), so det(B) = (−1)p+q det(A). Definition 5.2. Let A ∈ Cm×n . A minor of order k of A is a determinant of the form i1 · · · ik . det A j1 · · · kk A principal minor of order k of A is a determinant of the form i1 · · · ik . det A i1 · · · ik The leading principal minor of order k is the determinant 1 ··· k det A . 1 ··· k For A ∈ Cn×n , det(A) is the unique principal minor of order n, and the principal minors of order 1 of A are just the diagonal entries of A: a11 , . . . , ann .
288
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 5.9. Let A ∈ Cn×n . Define the matrix Aji ∈ C(n−1)×(n−1) as
1 ··· i − 1 i + 1 ··· n j , Ai = A 1 ··· j − 1 j + 1 ··· h that is, the matrix obtained from A by removing the ith row and the jth column. Then, we have n det(A) if i = , j i+j j (−1) ai det(A ) = 0 otherwise, j=1 for every i, , 1 i, n. Proof.
Let xi be the ith row of A, which can be expressed as xi =
n
aji ej ,
j=1
where e1 , . . . , en is a basis of Rn such that dn (e1 , . . . , en ) = 1. By the linearity of dn , we have dn (A) = dn (x1 , . . . , xn ) ⎛ = dn ⎝x1 , . . . , xi−1 ,
n
⎞ aji ej , xi+1 , . . . , xn ⎠
j=1
=
n
aji dn (x1 , . . . , xi−1 , ej , xi+1 , . . . , xn ).
j=1
The determinant dn (x1 , . . . , xi−1 , ej , xi+1 , . . . , xn ) corresponds to a matrix D (i,j) obtained from A by replacing the ith row by the sequence (0, . . . , 0, 1, 0, . . . , 0), whose unique nonzero component is on the jth position. Next, by multiplying the ith row by −ajk and adding the result to the kth row for 1 k i − 1 and i + 1 k n, we obtain a matrix E (i,j) that coincides with the matrix A with the following exceptions:
Determinants
289
(i) the elements of row i are 0 with the exception of the jth element of this row that equals 1, and (ii) the elements of column j are 0 with the exception of the element mentioned above. Clearly, det(D (i,j) ) = det(E (i,j) ). By applying Lemma 5.2, we obtain det(E (i,j) ) = (−1)i+j det(Aji ), so dn (A) =
n
aji dn (E (i,j) ) =
j=1
n (−1)i+j aji det(Aji ), j=1
which is the first case of the desired formula. Suppose now that i = . The same determinant could be computed by using an expansion on the th row, as follows: dn (A) =
n
(−1)i+j aj det(Aj ).
j=1
Then nj=1 (−1)i+j aji det(Aj ) is the determinant of a matrix obtained from A by replacing the th row by the ith row and such a determinant is 0 because the new matrix has two identical rows. This proves the second case of the equality of the theorem. The equality of the theorem is known as the Laplace expansion of det(A) by row i. Since the determinant of a matrix A equals the determinant of A , det(A) can be expanded by the jth row as det(A) =
n (−1)i+j aji det(Aji ) i=1
for every 1 j n. Thus, we have n (−1)i+j aji det(Aj ) = i=1
det(A) 0
if i = , if i = .
This formula is the Laplace expansion of det(A) by column j.
290
Linear Algebra Tools for Data Mining (Second Edition)
The number cof(aji ) = (−1)i+j det(Aji ) is the cofactor of aij in either kind of Laplace expansion. Thus, both types of Laplace expansions can be succinctly expressed by the equalities det(A) =
n j=1
aji cof(aji )
=
n
aji cof(aji )
(5.3)
i=1
for all i, j ∈ {1, . . . , n}. Cofactors of the form cof(aii ) are known as principal cofactors of A. Example 5.2. Let a = (a1 , . . . , an ) be a sequence of n real numbers. The Vandermonde determinant Va contains on its ith row the successive powers of ai , namely a0i = 1, a1i = ai , . . . , ani : 1 a1 (a1 )2 · · · (a1 )n−1 1 a2 (a2 )2 · · · (a2 )n−1 Va = .. .. . .. . . · · · . 1 an (an )2 · · · (an )n−1 By subtracting the first line from the remaining lines, we have n−1 2 1 a1 a · · · a 1 1 0 a2 − a1 (a2 )2 − (a1 )2 · · · (a2 )n−1 − (a1 )n−1 Va = . .. .. .. . ··· . 2 2 n−1 n−1 0 an − a1 (an ) − (a1 ) · · · (an ) − (a1 ) a2 − a1 (a2 )2 − (a1 )2 · · · (a2 )n−1 − (a1 )n−1 .. . . . . = . . . ··· . an − a1 (an )2 − (a1 )2 · · · (an )n−1 − (a1 )n−1 Factoring now ai+1 − a1 from the ith line of the new determinant for 1 i n yields 1 a2 + a1 · · · n−2 an−2−i ai 1 i=0 2 .. .. Va = (a2 − a1 ) · · · (an − a1 ) ... . . · · · . 1 an + a1 · · · n−2 an−2−i ai 1 i=0 n
Determinants
291
Consider two successive columns of this determinant: ⎛k−1 k−1−i i ⎞ ⎛ k ⎞ k−i i a1 i=0 a2 i=0 a2 a1 ⎜ ⎜ ⎟ ⎟ .. .. ck = ⎝ ⎠ and ck+1 = ⎝ ⎠. . . k−1 k−1−i i k k−i i a1 i=0 an i=0 an a1 Observe that ⎛
ck+1
⎞ ak2 ⎜ ⎟ = ⎝ ... ⎠ + a1 ck , akn
it follows that by subtracting from each column ck+1 from the previous column multiplied by a1 (from right to left), we obtain 1 a2 · · · an−2 2 Va = (a2 − a1 ) · · · (an − a1 ) ... ... · · · ... 1 an · · · an−2 n = (a2 − a1 ) · · · (an − a1 )V(a2 ,...,an ) . By applying repeatedly this formula, we obtain (ap − aq ), Va = p>q
where 1 p, q n. Theorem 5.8 can be extended to products of rectangular matrices. Theorem 5.10. Let A ∈ Cm×n and B ∈ Cn×m be two matrices, where m n. We have det(AB) k1 · · · km 1 ··· m det B = det A 1 ··· m k1 · · · km | 1 k1 < k2 < · · · < km n . This equality is known as the Cauchy–Binet formula.
292
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Let a1 , . . . , an be the rows of the matrix A and let C = AB. The first column of the matrix AB equals nk1 =1 ak1 bk1 1 . Since det(C) is linear, we can write 1 2 a c1 · · · cn1 k1 n a2 c2 · · · cn 2 2 k1 bk1 1 . . det(C) = . .. .. · · · ... k1 =1 m 2 a cm · · · cnm k1 n Similarly, the second row of C equals k2 =1 ak2 bk2 1 . A further decomposition yields the sum 1 1 a a · · · cn1 k1 k2 n n a2 a2 · · · cn 2 k k 1 2 bk1 1 bk2 2 . . det(C) = , .. .. · · · ... k1 =1 k2 =1 m 2 a a · · · cnm k1 k2 and so on. Eventually, we can write
k a 1 1 . ··· b1k1 · · · bm det(C) = km .. k1 =1 km =1 ak1 m n
n
· · · ak1m .. .. , . . · · · akm m
due to the multilinearity of the determinants. Only terms involving distinct numbers k1 , . . . , km are retained in this sum because any term with kp = kq equals 0. Suppose that {k1 , . . . , km } = {h1 , . . . , hm }, where h1 < · · · < hm and φ is the bijection defined by ki = φ(hi ) for 1 i m. Then h k a 1 · · · ahm a 1 · · · akm 1 1 1 1 .. .. .. . . . = (−1)inv(k1 ,...,km ) ... ... ... , ah1 · · · ahm ak1 · · · akm m m m m which allows us to write h 1 a1 det(C) = ... h1 A = [1 2 3; 4 5 6; 7 8 9] A = 1 2 3 4 5 6 7 8 9
314
Linear Algebra Tools for Data Mining (Second Edition)
>> d = det(A) d = 0
A = 1 5 4 1 3 1 >> d=det(A) d = 10 >> B=inv(A) B = -1/5 1/5 1/10
2 6 4
-9/5 -1/5 7/5
14/5 1/5 -19/10.
The Symbolic Math Toolbox of MATLAB provides functions for solving, plotting, and manipulating mathematics. The function syms creates symbolic variables and functions. Example 5.16. To create the variables x and y, one could write syms x y
Polynomials can be defined as shown in what follows. For example, to re-create the polynomials defined in Example 5.11, one could write f = x^3 - 2x^2 - x + 2 g = x^2-5x +6
The resultant of these polynomials in the variable x is obtained with res = resultant(f,g,x)
and, as expected, the result is 0. Suppose now that f and g are two polynomials in x and the parameter m as in f = x^3 - (m+1)x^2 - mx + 2 g = x^2-(2m+3) x + 2m+4
Clearly, the previous polynomials are obtained by taking m = 1.
Determinants
315
We begin by computing the resultant as a polynomial in m: >> syms x m >> f = x^3 -(m+1)* x^2 - m*x + 2 f = x^3 + (- m - 1)*x^2 - m*x + 2 >> g=x^2 -(2*m +3)*x + 2*m + 4 g = x^2 + (- 2*m - 3)*x + 2*m + 4 >> res = resultant(f,g,x) res = - 8*m^4 - 24*m^3 - 8*m^2 + 24*m + 16
To determine the values of m that make the resultant equal to 0, we write >> mRoots = solve(res)
which returns: mRoots = -2 -1 -1 1
Exercises and Supplements (1) Let φ ∈ PERMn be
φ:
1 ··· i ··· n , a1 · · · ai · · · an
and let vp (φ) = |{(ik , il ) | il = p, k < l, ik > il } be the number of inversions of φ that have p as their second component, for 1 p n. Prove that
316
Linear Algebra Tools for Data Mining (Second Edition)
(a) vp n − p for 1 p n; (b) for every sequence of numbers (v1 , . . . , vn ) ∈ Nn such that vp n − p for 1 p n there exists a unique permutation φ that has (v1 , . . . , vn ) ∈ Nn as its sequence of inversions. (2) Let p be the polynomial defined as p(x1 , . . . , xn ) = (xi − xj ). i 0. (37) Let A, B ∈ Cn×n be two matrices and let p be the polynomial p(x) = det(A + xB). Prove that the degree of f does not exceed the rank of B. Solution: By Theorem 3.37, if rank(B) = r, there exists the non-singular matrices G, H ∈ Cn×n such that B = GLr H and Ir Or,n−r . Lr = On−r,r On−r,n−r Therefore, p(x) = det(A + xGLr H) = det(G(G−1 AH −1 + xLr )H) = det(G) det(H) det(M + xLr ), where M = G−1 AH −1 . The highest degree of a term in det(M + xLr ) cannot exceed r and the result follows immediately.
Determinants
331
(38) Let f (x1 , . . . , xr ) be a homogeneous polynomial of degree n. Prove that
xi
∂f = nf. ∂xi
(This result is known as Euler’s theorem). Solution: The homogeneity of F means that F (tx1 , . . . , txr ) = tn F (x1 , . . . , xr ). Differentiating with respect to t yields
xi
∂F = ntn−1 f (x1 , . . . , xr ), ∂xi
and the desired result follows taking t = 1. (39) Let f1 , . . . , fk be k polynomials in R[x]. Define the polynomials F (x) = a1 f1 (x) + a2 f2 (x) + · · · + ak fk (x), G(x) = b1 f1 (x) + b2 f2 (x) + · · · + bk fk (x). Prove that the polynomials have a common root if and only if RF,G = 0 for all a1 , . . . , ak , b1 , . . . , bk . (40) The discriminant of a polynomial f ∈ R[x] is Df = a10 Rf,f , where a0 is the leading coefficient of f . Prove that f has roots of multiplicity at least 2 if and only if Df = 0. (41) Let f1 , f2 be two polynomials in R[x] of degrees k1 and k2 , respectively. Prove that Df1 f2 = (−1)k1 k2 Df1 Df2 Rf21 ,f2 . (42) Let g, h be two polynomials of degrees k1 and k2 , respectively. Prove that 2 . Dgh = (−1)k1 k2 Dg Dh Rg,h
(43) Prove that the polynomial f ∈ C[x] defined by f (x) = ax3 + bX 2 + cx + d has the expression b2 c2 − 4ac3 − 4b3 d − 27a2 d2 + 18abcd as its discriminant. (44) Let f ∈ C[x] be a polynomial. Prove that Df +a = Df for any constant a ∈ C. If g(x) = f (ax), prove that Dg = an(n−1) Df
332
Linear Algebra Tools for Data Mining (Second Edition)
A homogeneous linear system has the form a11 x1 + a12 x2 + · · · a1n xn = 0, .. . =0 am1 x1 + am2 x2 + · · · amn xn = 0, or Ax = 0m , where A = (aij ) ∈ Rm×n and x ∈ Rn . Observe that each such system has the trivial solution x1 = x2 = · · · = xn = 0. (45) Prove that the set of solutions of the homogeneous system Ax = 0m is a subspace of Rn . (46) Let A ∈ Rn×n be a matrix. Prove that the system Ax = 0n has a non-trivial solution if and only if det(A) = 0. (47) Let A ∈ R3×4 and consider the system Ax = 03 , where x ∈ R4 . Denote by Ak the matrix in R3×3 obtained from A by eliminating the kth column, where 1 k 4. Prove that if the system Ax = 03 has a non-trivial solution x ∈ R4 , then x2 x3 x4 x1 = = = . det(A1 ) − det(A2 ) det(A3 ) − det(A4 ) Solution: The system Au = 03 has the explicit form a11 x1 + a12 x2 + a13 x3 + a14 x4 = 0 a21 x1 + a22 x2 + a23 x3 + a24 x4 = 0 a31 x1 + a32 x2 + a33 x3 + a34 x4 = 0. Equivalently, we can write a11 x1 + a12 x2 + a13 x3 = −a14 x4 a21 x1 + a22 x2 + a23 x3 = −a24 x4 a31 x1 + a32 x2 + a33 x3 = −a34 x4 . Thus, we have
a14 a12 a13 a24 a22 a23 a34 a32 a33 det(A1 ) = −x4 . x1 = −x4 det(A4 ) det(A4 )
Determinants
333
Similar formulas are x2 = x4
det(A2 ) , det(A4 )
and x3 = −x4
det(A3 ) . det(A4 )
Thus, x2 x3 x4 x1 = = = . det(A1 ) − det(A2 ) det(A3 ) − det(A4 ) Bibliographical Comments An encyclopedic reference on Schur’s complement and its applications in numerical analysis, probabilities, and statistics and other areas can be found in [178]. The result contained in Supplement 22 appears in [42]. Supplement 35 is a result of [10].
This page intentionally left blank
Chapter 6
Norms and Inner Products
6.1
Introduction
A norm is a real-valued function defined on a linear space intended to model the “length” of the vector. Norms generate metrics on linear spaces and these metrics, in turn, generate topologies that are useful in constructing algorithms on these spaces. Elementary notions of set topology used in this chapter can be found in [152]. The other fundamental notion discussed in this chapter is the notion of inner product spaces. Inner products are capable of generating norms and allow the introduction of the concept of orthogonality and of unitary and orthogonal (or orthonormal) matrices. Also, we introduce positive definite matrices and discuss two decomposition results for matrices: the Cholesky decomposition for positive definite matrices and the QR factorizations. 6.2
Basic Inequalities
Lemma 6.1. Let p, q ∈ R − {0, 1} such that p1 + 1q = 1. Then we have p > 1 if and only if q > 1. Furthermore, one of the numbers p, q belongs to the interval (0, 1) if and only if the other number is negative. Proof. The statement follows immediately from the equality p . q = p−1 335
336
Linear Algebra Tools for Data Mining (Second Edition)
Lemma 6.2. Let p, q ∈ R −{0, 1} be two numbers such that 1p + 1q = 1 and p > 1. Then, for every a, b ∈ R0 , we have ab
ap bq + , p q 1
where the equality holds if and only if a = b− 1−p . Proof. By Lemma 6.1, we have q > 1. Consider the function p f (x) = xp + 1q − x for x 0. We have f (x) = xp−1 − 1, so the minimum is achieved when x = 1 and f (1) = 0. Thus, 1 f ab− p−1 f (1) = 0, which amounts to p
1 ap b− p−1 − 1 + − ab p−1 0. p q p
By multiplying both sides of this inequality by b p−1 , we obtain the desired inequality. Observe that if 1p + 1q = 1 and p < 1, then q < 0. In this case, we have the reverse inequality ab
ap bq + , p q
(6.1)
which can be shown by observing that the function f has a maximum in x = 1. The same inequality holds when q < 1 and therefore p < 0. Theorem 6.1 (The H¨ older inequality). Let a1 , . . . , an and b1 , . . . , bn be 2n nonnegative numbers, and let p and q be two numbers such that 1p + 1q = 1 and p > 1. We have n i=1
ai bi
n i=1
1 api
p
·
n
1 bqi
q
.
i=1
Proof. If a1 = · · · = an = 0 or if b1 = · · · = bn = 0, then the inequality is clearly satisfied. Therefore, we may assume that at least
Norms and Inner Products
337
one of a1 , . . . , an and at least one of b1 , . . . , bn is non-zero. Define the numbers ai bi xi = 1 and yi = 1 p ( ni=1 ai ) p ( ni=1 bqi ) q for 1 i n. Lemma 6.2 applied to xi , yi yields ai bi 1 api 1 bpi + . 1 1 p ni=1 api q ni=1 bpi ( ni=1 api ) p ( ni=1 bqi ) q Adding these inequalities, we obtain n
ai bi
i=1
because
1 p
+
1 q
n
1 p
api
i=1
n
1 bqi
q
i=1
= 1.
The nonnegativity of the numbers a1 , . . . , an , b1 , . . . , bn can be relaxed by using absolute values. Indeed, we can easily prove the following variant of Theorem 6.1. Theorem 6.2. Let a1 , . . . , an and b1 , . . . , bn be 2n numbers and let p and q be two numbers such that p1 + 1q = 1 and p > 1. We have 1 n 1 n n p q p q a b |a | · |b | . i i i i i=1
Proof.
i=1
i=1
By Theorem 6.1, we have n
|ai ||bi |
i=1
n
1 |ai |p
i=1
p
·
n
1 |bi |q
.
i=1
The needed equality follows from the fact that n n ai bi |ai ||bi |. i=1
q
i=1
Linear Algebra Tools for Data Mining (Second Edition)
338
Corollary 6.1 (The Cauchy–Schwarz inequality for Rn ). Let a1 , . . . , an and b1 , . . . , bn be 2n real numbers. We have
n n n
2 ai bi ai · b2i . i=1
i=1
i=1
Proof. The inequality follows immediately from Theorem 6.2 by taking p = q = 2. Theorem 6.3 (Minkowski’s inequality). Let a1 , . . . , an b1 , . . . , bn be 2n nonnegative real numbers. If p 1, we have 1 1 n 1 n n p p p p p (ai + bi )p ai + bi . i=1
i=1
and
i=1
If p < 1, the inequality sign is reversed. Proof. For p = 1, the inequality is immediate. Therefore, we can assume that p > 1. Note that n n n (ai + bi )p = ai (ai + bi )p−1 + bi (ai + bi )p−1 . i=1
i=1
i=1
By H¨older’s inequality for p, q such that p > 1 and 1p + 1q = 1, we have n 1 n 1 n q p p ai (ai + bi )p−1 ai (ai + bi )(p−1)q i=1
=
i=1 n
1 p
api
i=1
i=1 n (ai + bi )p
1 q
.
i=1
Similarly, we can write n
p−1
bi (ai + bi )
i=1
n
1 bpi
p
i=1
n (ai + bi )p
1 q
.
i=1
Adding the last two inequalities yields ⎛ 1 n 1 ⎞ n 1 n n p p q ⎠ (ai + bi )p ⎝ api + bpi (ai + bi )p , i=1
i=1
i=1
i=1
Norms and Inner Products
339
which is equivalent to the inequality
n (ai + bi )p i=1
6.3
1
p
n
1 api
p
+
i=1
n i=1
1 bpi
p
.
Metric Spaces
Definition 6.1. A function d : S 2 −→ R0 is a metric if it has the following properties: (i) d(x, y) = 0 if and only if x = y for x, y ∈ S; (ii) d(x, y) = d(y, x) for x, y ∈ S; (iii) d(x, y) d(x, z) + d(z, y) for x, y, z ∈ S. The pair (S, d) will be referred to as a metric space. If property (i) is replaced by the weaker requirement that d(x, x) = 0 for x ∈ S, then we refer to d as a semimetric on S. Thus, if d is a semimetric, d(x, y) = 0 does not necessarily imply x = y and we can have for two distinct elements x, y of S, d(x, y) = 0. If d is a semimetric, then we refer to the pair (S, d) as a semimetric space. Example 6.1. Let S be a nonempty set. Define the mapping d : S 2 −→ R0 by 1 if u = v, d(u, v) = 0 otherwise for x, y ∈ S. It is easy to see that d satisfies the definiteness property. To prove that d satisfies the triangular inequality, we need to show that d(x, y) d(x, z) + d(z, y) for all x, y, z ∈ S. This is clearly the case if x = y. Suppose that x = y, so d(x, y) = 1. Then, for every z ∈ S, we have at least one of the inequalities x = z or z = y, so at least one of the numbers d(x, z) or d(z, y) equals 1. Thus, d satisfies the triangular inequality. The metric d introduced here is the discrete metric on S.
340
Linear Algebra Tools for Data Mining (Second Edition)
Example 6.2. Consider the mapping d : (Seqn (S))2 −→ R0 defined by d(p, q) = |{i | 0 i n − 1 and p(i) = q(i)}| for all sequences p, q of length n on the set S. It is easy to see that d is a metric. We justify here only the triangular inequality. Let p, q, r be three sequences of length n on the set S. If p(i) = q(i), then r(i) must be distinct from at least one of p(i) and q(i). Therefore, {i | 0 i n − 1 and p(i) = q(i)} ⊆ {i | 0 i n − 1 and p(i) = r(i)} ∪ {i | 0 i n − 1 and r(i) = q(i)}, which implies the triangular inequality. Example 6.3. For x ∈ Rn and y ∈ Rn , the Euclidean metric is the mapping
n
d2 (x, y) = (xi − yi )2 . i=1
The first two conditions of Definition 6.1 are obviously satisfied. To prove the third inequality, let x, y, z ∈ Rn . Choosing ai = xi − yi and bi = yi − zi for 1 i n in Minkowski’s inequality implies
n
n
n
(xi − zi )2 (xi − yi )2 + (yi − zi )2 , i=1
i=1
i=1
which amounts to d(x, z) d(x, y)+d(y, z). Thus, we conclude that d is indeed a metric on Rn . We frequently use the notions of closed sphere and open sphere. Definition 6.2. Let (S, d) be a metric space. The closed sphere centered in x ∈ S of radius r is the set Bd [x, r] = {y ∈ S|d(x, y) r}. The open sphere centered in x ∈ S of radius r is the set Bd (x, r) = {y ∈ S|d(x, y) < r}.
Norms and Inner Products
341
Definition 6.3. Let (S, d) be a metric space. The diameter of a subset U of S is the number diamS,d (U ) = sup{d(x, y) | x, y ∈ U }. The set U is bounded if diamS,d (U ) is finite. The diameter of the metric space (S, d) is the number diamS,d = sup{d(x, y) | x, y ∈ S}. If the metric space is clear from the context, then we denote the diameter of a subset U just by diam(U ). If (S, d) is a finite metric space, then diamS,d = max{d(x, y) | x, y ∈ S}. ˆ 0 can be extended to the set of subsets A mapping d : S ×S −→ R of S by defining d(U, V ) as d(U, V ) = inf{d(u, v) | u ∈ U and v ∈ V }
(6.2)
for U, V ∈ P(S). Observe that, even if d is a metric, then its extension is not, in general, a metric on P(S) because it does not satisfy the triangular inequality. Instead, we can show that for every U, V, W we have d(U, W ) d(U, V ) + diam(V ) + d(V, W ). Indeed, by the definition of d(U, V ) and d(V, W ), for every > 0, there exist u ∈ U , v, v ∈ V , and w ∈ W such that d(U, V ) d(u, v) d(U, V ) + 2 , d(V, W ) d(v , w) d(V, W ) + 2 . By the triangular axiom, we have d(u, w) d(u, v) + d(v, v ) + d(v , w). Hence, d(u, w) d(U, V ) + diam(V ) + d(V, W ) + , which implies d(U, W ) d(U, V ) + diam(V ) + d(V, W ) + for every > 0. This yields the needed inequality. Definition 6.4. Let (S, d) be a metric space. The sets U, V ∈ P(S) are separate if d(U, V ) > 0.
Linear Algebra Tools for Data Mining (Second Edition)
342
We denote the number d({u}, V ) = inf{d(u, v) | v ∈ V } by d(u, V ). It is clear that u ∈ V implies d(u, V ) = 0. The notion of dissimilarity is a generalization of the notion of metric. Definition 6.5. A dissimilarity on a set S is a function d : S 2 −→ R0 satisfying the following conditions: (i) d(x, x) = 0 for all x ∈ S; (ii) d(x, y) = d(y, x) for all x, y ∈ S. The pair (S, d) is a dissimilarity space. A related concept is the notion of similarity. Definition 6.6. A similarity on a set S is a function s : S 2 −→ R0 satisfying the following conditions: (i) s(x, y) s(x, x) = 1 for all x, y ∈ S; (ii) s(x, y) = s(y, x) for all x, y ∈ S. The pair (S, s) is a similarity space. Example 6.4. Let d : S 2 −→ R0 be a metric on the set S. Then s : S 2 −→ R0 defined by s(x, y) = 2−d(x,y) for x, y ∈ S is a dissimilarity, such that s(x, x) = 1 for every x, y ∈ S.
6.4
Norms
In this chapter, we study norms on real or complex linear spaces. Definition 6.7. A seminorm on an F-linear space V is a mapping ν : V −→ R that satisfies the following conditions: (i) ν(x + y) ν(x) + ν(y) (subadditivity), and (ii) ν(ax) = |a|ν(x) (positive homogeneity), for x, y ∈ V and a ∈ F . By taking a = 0 in the second condition of the definition, we have ν(0) = 0 for every seminorm on a real or complex space. A seminorm can be defined on every linear space. Indeed, if B is a basis of V, B = {v i | i ∈ I}, J is a finite subset of I, and
Norms and Inner Products
x=
i∈I
xi v i , define νJ (x) as 0 νJ (x) =
343
if x = 0, j∈J |aj | otherwise
for x ∈ V . We leave to the reader the verification of the fact that νJ is indeed a seminorm. Theorem 6.4. If V is a real or complex linear space and ν : V −→ R is a seminorm on V, then ν(x − y) |ν(x) − ν(y)|, for x, y ∈ V . Proof.
We have ν(x) ν(x − y) + ν(y), so ν(x) − ν(y) ν(x − y).
(6.3)
Since ν(x − y) = | − 1|ν(y − x) ν(y) − ν(x), we have −(ν(x) − ν(y)) ν(x) − ν(y). Inequalities (6.3) and (6.4) give the desired inequality.
(6.4)
Corollary 6.2. If p : V −→ R is a seminorm on V, then p(x) 0 for x ∈ V . Proof. By choosing y = 0 in the inequality of Theorem 6.4, we have ν(x) |ν(x)| 0. Definition 6.8. A norm on an F-linear space V is a seminorm ν : V −→ R such that ν(x) = 0 implies x = 0 for x ∈ V . The pair (V, ν) is referred to as a normed linear space. Example 6.5. The set of real-valued continuous functions defined on the interval [−1, 1] is a real linear space. The addition of two such functions f, g, is defined by (f + g)(x) = f (x) + g(x) for x ∈ [−1, 1]; the multiplication of f by a scalar a ∈ R is (af )(x) = af (x) for x ∈ [−1, 1]. Define ν(f ) = sup{|f (x)| | x ∈ [−1, 1]}. Since |f (x)| ν(f ) and |g(x)| ν(g) for x ∈ [−1, 1]}, it follows that |(f + g)(x)| |f (x)| + |g(x)| ν(f ) + ν(g). Thus, ν(f + g) ν(f ) + ν(g). We leave to the reader the verification of the remaining properties of Definition 6.7. We denote ν(f ) by f .
Linear Algebra Tools for Data Mining (Second Edition)
344
Theorem 6.5. For p 1, the function νp : Rn −→ R0 defined by νp (x) =
n
1
p
|xi |p
,
i=1
⎞ x1 ⎜ ⎟ where x = ⎝ ... ⎠ ∈ Rn , is a norm on Rn . xn ⎛
Proof. We must prove that νp satisfies the conditions of Definition 6.7 and that νp (x) = 0 implies x = 0. Let ⎛ ⎞ ⎛ ⎞ x1 y1 ⎜ .. ⎟ ⎜ .. ⎟ x = ⎝ . ⎠ and y = ⎝ . ⎠. xn yn Minkowski’s inequality applied to the nonnegative numbers ai = |xi | and bi = |yi | amounts to
n
1 p
p
(|xi | + |yi |)
n
i=1
1 p
p
|xi |
+
i=1
n
1 p
p
|yi |
.
i=1
Since |xi + yi | |xi | + |yi | for every i, we have
n (|xi + yi |)p i=1
1
p
n i=1
1 |xi |p
p
+
n
1 |yi |p
p
,
i=1
that is, νp (x + y) νp (x) + νp (y). We leave to the reader the verification of the remaining conditions. Thus, νp is a norm on Rn . Example 6.6. The mapping ν1 : Rn −→ R given by ν1 (x) = |x1 | + |x2 | + · · · + |xn |,
⎞ x1 ⎜ ⎟ for x = ⎝ ... ⎠, is a norm on Rn . xn ⎛
Norms and Inner Products
345
Example 6.7. A special norm on Rn is the function ν∞ : Rn −→ R0 given by ν∞ (x) = max{|xi | | 1 i n}.
(6.5)
We verify here that ν∞ satisfies the first condition of Definition 6.7. We start from the inequality |xi + yi | |xi | + |yi | ν∞ (x) + ν∞ (y) for every i, 1 i n. This in turn implies ν∞ (x + y) = max{|xi + yi | | 1 i n} ν∞ (x) + ν∞ (y), which gives the desired inequality. This norm can be regarded as a limit case of the norms νp . Indeed, let x ∈ Rn and let M = max{|xi | | 1 i n} = |x1 | = · · · = |xk | for some 1 , . . . , k , where 1 1 , . . . , k n. Here x1 , . . . , xk are the components of x that have the maximal absolute value and k 1. We can write n 1 |xi | p p 1 = lim M (k) p = M, lim νp (x) = lim M p→∞ p→∞ p→∞ M i=1
which justifies the notation ν∞ . We use the alternative notation xp for νp (x). We refer to x2 as the Euclidean norm of x and we denote this norm simply by x when there is no risk of confusion. of sequences Example 6.8. For p 1, let p be the set that consists p of real numbers x = (x0 , x1 , . . .) such that the series ∞ i=0 |xi | is convergent. We can show that p is a linear space. Let x, y ∈ p be two sequences in p . Using Minkowski’s inequality, we have n i=0
|xi + yi |p
n n n (|xi | + |yi |)p |xi |p + |yi |p , i=0
i=0
i=0
which shows that x + y ∈ p . It is immediate that x ∈ p implies ax ∈ p for every a ∈ R and x ∈ p .
346
Linear Algebra Tools for Data Mining (Second Edition)
The following statement shows that any norm defined on a linear space generates a metric on the space. Theorem 6.6. Each norm ν : V −→ R0 on a real linear space V generates a metric on the set V defined by dν (x, y) = ν(x − y) for x, y ∈ V . Proof. Note that if dν (x, y) = ν(x − y) = 0, it follows that x − y = 0; that is, x = y. The symmetry of dν is obvious and so we need to verify only the triangular axiom. Let x, y, z ∈ L. Applying the subadditivity of norms, we have ν(x − z) = ν(x − y + y − z) ν(x − y) + ν(y − z), or, equivalently, dν (x, z) dν (x, y) + dν (y, z), for every x, y, z ∈ L, which concludes the argument. We refer to dν as the metric induced by the norm ν on the linear space V. Observe that the norm ν can be expressed using dν as ν(x) = dν (x, 0)
(6.6)
for x ∈ V . For p 1, then dp denotes the metric dνp induced by the norm νp on the linear space Rn known as the Minkowski metric on Rn . If p = 2, we have the Euclidean metric on Rn given by
n
n
2 |xi − yi | = (xi − yi )2 . d2 (x, y) = i=1
i=1
For p = 1, we have d1 (x, y) =
n
|xi − yi |.
i=1
This metric is known also as the city-block metric. The norm ν∞ generates the metric d∞ given by d∞ (x, y) = max{|xi − yi | | 1 i n}, also known as the Chebyshev metric.
Norms and Inner Products
6
x = (x0 , x1 ) Fig. 6.1
347
y = (y0 , y1 )
(y0 , x1 ) -
The distances d1 (x, y) and d2 (x, y).
A representation of these metrics can be seen in Figure 6.1 for the special case of R2 . If x = (x0 , x1 ) and y = (y0 , y1 ), then d2 (x, y) is the length of the hypotenuse of the right triangle and d1 (x, y) is the sum of the lengths of the two legs of the triangle. We can reformulate now the notion of bounded set introduced in Definition 6.3 for general metric spaces in terms of norms, defined on linear spaces. Theorem 6.7. Let ν be a norm on a linear space V. A subset U of V is bounded in the metric space (V, dν ) if and only if there is b ∈ R0 such that ν(u) b for u ∈ U . Proof. Suppose that U is a bounded set in the sense of Definition 6.3, that is, there exists c ∈ R0 such that sup{dν (x, y) | x, y ∈ U } = c. Let x0 be a fixed element of U . Since ν(x) = dν (x, 0) dν (x, x0 ) + d(x0 , 0) = dν (x, x0 ) + ν(x0 ), it is immediate that ν(x) c + ν(x0 ), so we can define b as b = c + ν(x0 ). Conversely, suppose that ν(u) b for u ∈ U . Then dν (x, y) dν (x, 0) + dν (0, y) 2b, for every x, y ∈ U . Thus, U is bounded in the metric space (V, dν ). Theorem 6.8 (Projections on closed sets theorem). Let U be a closed subset of Rn such that U = ∅ and let x0 ∈ Rn − U . Then there exists x1 ∈ U such that x − x0 2 x1 − x0 2 for every x ∈ U. Proof. Let d = inf{x − x0 2 | x ∈ U } and let Un = U ∩ B x0 , d + n1 . Note that the sets form a descending sequence of bounded and closed sets U1 ⊇ U2 ⊇ · · · ⊇ Un ⊇ · · · . Since U1 is
Linear Algebra Tools for Data Mining (Second Edition)
348
compact, n1 Un = ∅. Let x1 ∈ n1 Un . Since Un ⊆ U for every n, it follows that x1 ∈ U . 1 Note 1 − x0 2 d + n for every n because x1 ∈ Un = that x U ∩ B x0 , d + n1 . This implies x1 − x0 2 d x − x0 2 for every x ∈ U. Theorem 6.9 to follow allows us to compare the norms νp (and the metrics of the form dp ) that were introduced on Rn . We begin with a preliminary result. Lemma 6.3. Let a1 , . . . , an be n positive numbers. If p and q are two positive numbers such that p q, then 1
1
(ap1 + · · · + apn ) p (aq1 + · · · + aqn ) q . Proof.
Let f : R>0 −→ R be the function defined by 1
f (r) = (ar1 + · · · + arn ) r . Since ln f (r) =
ln (ar1 + · · · + arn ) , r
it follows that 1 1 ar ln a1 + · · · + arn ln ar f (r) = − 2 ln (ar1 + · · · + arn ) + · 1 . f (r) r r ar1 + · · · + arn To prove that f (r) < 0, it suffices to show that ln (ar1 + · · · + arn ) ar1 ln a1 + · · · + arn ln ar . ar1 + · · · + arn r This last inequality is easily seen to be equivalent to n
ar i=1 1
ari ari ln r 0, r + · · · + an a1 + · · · + arn
which holds because ari 1 ar1 + · · · + arn for 1 i n.
Norms and Inner Products
349
Theorem 6.9. Let p and q be two positive numbers such that p q. For every u ∈ Rn , we have up uq . Proof.
This statement follows immediately from Lemma 6.3.
Corollary 6.3. Let p, q be two positive numbers such that p q. For every x, y ∈ Rn , we have dp (x, y) dq (x, y). Proof.
This statement follows immediately from Theorem 6.9.
Example 6.9. For p = 1 and q = 2, the inequality of Theorem 6.9 becomes
n n
|ui | |ui |2 , i=1
i=1
which is equivalent to n
n 2 |u | i i=1 i=1 |ui | . n n Theorem 6.10. Let p 1. For every x ∈ Rn , we have
(6.7)
x∞ xp nx∞ . Proof.
Starting from the definition of νp , we have n 1 p 1 1 |xi |p n p max |xi | = n p x∞ . xp = i=1
1in
The first inequality is immediate.
Corollary 6.4. Let p and q be two numbers such that p, q 1. There exist two constants c, d ∈ R>0 such that cxq xp dxq for x ∈ Rn . Proof. Since x∞ xp and xq nx∞ , it follows that xq nxp . Exchanging the roles of p and q, we have xp nxq , so
for every x ∈ Rn .
1 xq xp nxq n
350
Linear Algebra Tools for Data Mining (Second Edition)
Corollary 6.5. For every x, y ∈ Rn and p 1, we have d∞ (x, y) dp (x, y) nd∞ (x, y). Further, for p, q > 1, there exist c, d ∈ R>0 such that cdq (x, y) dp (x, y) cdq (x, y) for x, y ∈ Rn . Proof.
This statement follows from Theorem 6.10.
Corollary 6.3 implies that if p q, then the closed sphere Bdp (x, r) is included in the closed sphere Bdq (x, r). For example, we have Bd1 (0, 1) ⊆ Bd2 (0, 1) ⊆ Bd∞ (0, 1). In Figures 6.2 (a)–(c), we represent the closed spheres Bd1 (0, 1), Bd2 (0, 1), and Bd∞ (0, 1). A useful consequence of Theorem 6.1 is the following statement: y1 , . . . , ym be 2m nonnegative Theorem 6.11. Let 1 , . . . , xm and x m m x = y numbers such that i=1 i i=1 i = 1 and let p and q be two 1 1 positive numbers such that p + q = 1. We have m
1
1
xjp yjq 1.
j=1 1
1
1
1
p q and y1q , . . . , ym Proof. The H¨older inequality applied to x1p , . . . , xm yields the needed inequality
m
1
j=1
@ @
Fig. 6.2
m
xj
j=1
6 @ @-
(a)
1
xjp yjq
m
yj = 1.
j=1
6
6
(b)
(c)
Spheres Bdp (0, 1) for p = 1, 2, ∞.
Norms and Inner Products
351
Theorem 6.11 allows the formulation of a generalization of the H¨older inequality. Theorem 6.12. Let A be an n×m matrix, A = (aij ), having positive entries such that m n. If p = (p1 , . . . , pn ) is j=1 aij = 1 for 1 i an n-tuple of positive numbers such that ni=1 pi = 1, then n m
apiji 1.
j=1 i=1
Proof. The argument is by induction on n 2. The basis case, n = 2, follows immediately from Theorem 6.11 by choosing p = p11 , q = p12 , xj = a1j , and yj = a2j for 1 j m. Suppose that the statement holds for n, let A be an (n + 1) × mmatrix having positive entries such that m j=1 aij = 1 for 1 i n+ 1, and let p = (p1 , . . . , pn , pn+1 ) be such that p1 + · · · + pn + pn+1 = 1. It is easy to see that m n+1 j=1 i=1
apiji
m
p
n−1 pn +pn+1 ap1j1 an−1 . j (anj + an+1 j )
j=1
By applying the inductive hypothesis, we have
m n+1 j=1
i=1
apiji 1.
A more general form of Theorem 6.12 is given next. Theorem 6.13. Let A be an n×m matrix, A = (aij ), having positive entries. If p = (p1 , . . . , pn ) is an n-tuple of positive numbers such that n p i=1 i = 1, then n m j=1 i=1
Proof.
apiji
n i=1
⎛ ⎝
m
⎞pi aij ⎠ .
j=1
Let B = (bij ) be the matrix defined by aij bij = m
j=1 aij
Linear Algebra Tools for Data Mining (Second Edition)
352
for 1 i n and 1 j m. Since m j=1 bij = 1, we can apply Theorem 6.12 to this matrix. Thus, we can write p i n n m m a m ij bpiji = j=1 aij j=1 i=1
j=1 i=1 m n
apiji pi = m a j=1 i=1 j=1 ij m n pi j=1 i=1 aij pi 1. = m n i=1 j=1 aij
We now give a generalization of Minkowski’s inequality (Theorem 6.3). First, we need a preliminary result. Lemma 6.4. If a1 , . . . , an and b1 , . . . , bn are positive numbers and r < 0, then n r n 1−r n r 1−r ai bi ai · bi . i=1
i=1
i=1
Proof. Let cn1 , . . . , cn , d1 , . . . , dn be 2n positive numbers such that n c = i=1 i i=1 di = 1. Inequality (6.1) applied to the numbers 1
1
a = cip and b = di q yields 1
1
cip diq
ci di + . p q
Summing these inequalities produces the inequality n
1
1
cip diq 1,
i=1
or n
cri d1−r 1, i
i=1
where r = 1p < 0. Choosing ci = the desired inequality.
nai
i=1
ai
and di =
nbi
i=1 bi
, we obtain
Norms and Inner Products
353
Theorem 6.14. Let A be an n×m matrix, A = (aij ), having positive entries, and let p and q be two numbers such that p > q and p = 0, q = 0. We have ⎛ ⎝
n m j=1
Proof.
⎛ ⎛ ⎞ p ⎞ p1 q ⎞ 1q q n m p ⎜ p q ⎠ ⎟ ⎠ ⎝ aij ⎝ aij ⎠ .
i=1
i=1
j=1
Define ⎛ n q ⎞ 1q m p p ⎠ , aij E=⎝ j=1
i=1
⎛ ⎛ ⎞ p ⎞ p1 q n m ⎜ ⎝ q ⎠ ⎟ aij F =⎝ ⎠ , i=1
j=1
q and ui = m j=1 aij for 1 i n. There are three distinct cases to consider related to the position of 0 relative to p and q. Suppose initially that p > q > 0. We have Fp = =
n
p
uiq
i=1 n m
= p
aqij uiq
−1
=
i=1 j=1
n
p
ui uiq
i=1 m n
−1
p
aqij uiq
−1
.
j=1 i=1
By applying the H¨older inequality, we have n q n 1− q n p p q p p pq −1 p q q −1 q p−q aij ui (aij ) · (ui ) i=1
=
i=1 n i=1
q apij
p
·
i=1 n
p q
ui
(6.8)
1− q
p
,
i=1
which implies F p E q F p−q . This, in turn, gives F q E q , which implies the generalized Minkowski inequality.
Linear Algebra Tools for Data Mining (Second Edition)
354
Suppose now that 0 > p > q, so 0 < −p < −q. Applying the generalized Minkowski inequality to the positive numbers bij = a1ij gives the inequality ⎛ ⎛ ⎞ q ⎞− 1q ⎛ n p ⎞− p1 p m n m q −q ⎜ ⎝ −p ⎠ ⎟ ⎠ ⎝ bij ⎝ bij ⎠ , j=1
i=1
i=1
j=1
which is equivalent to ⎛ ⎝
m
j=1
n
⎛ ⎛ ⎞ q ⎞− 1q p ⎞− p1 p n m q p ⎜ ⎟ q ⎝ aij ⎠ ⎝ aij ⎠ ⎠ .
i=1
i=1
j=1
A last transformation gives ⎛ ⎝
m j=1
n
⎛ ⎛ ⎞ q ⎞ 1q p n m q p ⎟ ⎠ ⎜ ⎝ ⎠ aij ⎝ ⎠ ,
p ⎞ 1p aqij
i=1
i=1
j=1
which is the inequality to be proven. Finally, suppose that p > 0 > q. Since pq < 0, Inequality (6.9) is replaced by the opposite inequality through the application of Lemma 6.4: n q n 1− q n p p p p pq q q −1 aij ui aij · ui . i=1
i=1
i=1
This leads to F p E q F p−q or F q E q . Since q < 0, this implies F E. 6.5
The Topology of Normed Linear Spaces
Every norm ν defined on a linear space V generates a metric d : V 2 −→ R0 given by d(x, y) = ν(x − y). Therefore, any normed space can be equipped with the topology of a metric space, using the metric defined by the norm. Since this topology is induced by a
Norms and Inner Products
355
metric, any normed space is a Hausdorff space. Further, if v ∈ V , then the collection of subsets {Bd (v, r) | r > 0} is a fundamental system of neighborhoods for v. By specializing the definition of local continuity of functions between metric spaces, a function f : V −→ W between two normed spaces (V, ν) and (W, ν ) is continuous in x0 ∈ V if for every > 0 there exists δ > 0 such that ν(x − x0 ) < δ implies ν (f (x) − f (x0 )) < . A sequence (x0 , x1 , . . .) of elements of V converges to x if for every > 0 there exists n ∈ N such that n n implies ν(xn − x) < . Theorem 6.15. In a normed linear space (V, ν), the norm, the multiplication by scalars, and the vector addition are continuous functions. Proof. By Theorem 6.4, we have ν(x − y) |ν(x) − ν(y)| for every x, y ∈ V . Therefore, if limn→∞ xn = x, we have ν(xn − x) |ν(xn ) − ν(x)|, which implies limn→∞ ν(xn ) = ν(x). Thus, the norm is continuous. Suppose now that limn→∞ an = a and limn→∞ xn = x, where (an ) is a sequence of scalars. Since the sequence (xn ) is bounded, we have ν(ax − an xn ) ν(ax − an x) + ν(an x − an xn ) |a − an |ν(x) + an ν(x − xn ), which implies that limn→∞ an xn = ax. This shows that the multiplication by scalars is a continuous function. To prove that the vector addition is continuous, let (xn ) and (y n ) be two sequences in V such that limn→∞ xn = x and limn→∞ y n = y. Note that ν ((x + y) − (xn + y n )) ν(x − xn ) + ν(y − y n ), which implies that limn→∞ (xn + y n ) = x + y. Thus, the vector addition is continuous. Definition 6.9. Two norms ν and ν on a linear space V are equivalent if they generate the same topology.
356
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 6.16. Let V be a linear space and let ν : V −→ R0 and ν : V −→ R0 be two norms on V that generate the topologies O and O on V, respectively. The topology O is finer than the topology O (that is, O ⊆ O ) if and only if there exists c ∈ R>0 such that ν(v) cν (v) for every v ∈V. Proof. Suppose that O ⊆ O . Then, any open sphere Bν (0, r0 ) = {x ∈ V | ν(x) < r0 } (in O) must be an open set in O . Therefore, there exists an open sphere Bν (0, r1 ) such that Bν (0, r1 ) ⊆ Bν (0, r0 ). This means that for r0 ∈ R0 and v ∈ V there exists r1 ∈ R0 such that ν (v) < r1 implies ν(v) < r0 for every u ∈ V . In particular, for r0 = 1, there is k > 0 such that ν (v) < k implies ν(v) < 1, which is equivalent to cν (v) < 1 implies ν(v) < 1, for every v ∈ V and c = k1 . v 1 For w = c+ ν (v) , where > 0, it follows that
cν (w) = cν so
ν(w) = ν
v 1 c + ν (v)
v 1 c + ν (v)
=
=
c < 1, c+
1 ν(v) < 1. c + ν (v)
Since this inequality holds for every > 0, it follows that ν(v) cν (v). Conversely, suppose that there exists c ∈ R>0 such that ν(v) cν (v) for every v ∈ V . Since r ⊆ {v | ν(v) r} v | ν (v) c for v ∈ V and r > 0, it follows that O ⊆ O .
Corollary 6.6. Let V be a linear space and let ν : V −→ R0 and ν : V −→ R0 be two norms on V. Then ν and ν are equivalent norms if and only if there exist a, b ∈ R>0 such that aν(v) ν (v) bν(v) for v ∈ V .
Norms and Inner Products
Proof.
This statement follows directly from Theorem 6.16.
357
Example 6.10. By Corollary 6.4, any two norms νp and νq on Rn (with p, q 1) are equivalent. Continuous linear operators between normed spaces have a simple characterization. Theorem 6.17. Let (V, ν) and (V , ν ) be two normed F-linear spaces where F is either R or C. A linear operator f : V −→ V is continuous if and only if there exists M ∈ R>0 such that ν (f (x)) M ν(x) for every x ∈ V . Proof. Suppose that f : V −→ V satisfies the condition of the theorem. Then r f Bν 0, ⊆ Bν (0, r) M for every r > 0, which means that f is continuous in 0 and, therefore, it is continuous everywhere (by Theorem 2.51). Conversely, suppose that f is continuous. Then there exists δ > 0 such that f (Bν (0, δ)) ⊆ Bν (f (x), 1), which is equivalent to ν(x) < δ, implies ν (f (x)) < 1. Let > 0 and let z ∈ V be defined by z= We have ν(z) = equivalent to
δν(x) ν(x)+
δ x. ν(x) +
< δ. This implies ν (f (z)) < 1, which is
δ ν (f (x)) < 1 ν(x) + because of the linearity of f . This means that ν (f (x))
0, so ν (f (x)) 1δ ν(x).
Lemma 6.5. Let (V, ν) and (V , ν ) be two normed F-linear spaces where F is either R or C. A linear function f : V −→ V is not injective if and only if there exists u ∈ V − {0V } such that f (u) = 0V .
358
Linear Algebra Tools for Data Mining (Second Edition)
Proof. It is clear that the condition of the lemma is sufficient for failing injectivity. Conversely, suppose that f is not injective. There exist t, v ∈ V such that t = v and f (t) = f (v). The linearity of f implies f (t − v) = 0V . By defining u = t − v = 0V , we have the desired element u. Theorem 6.18. Let (V, ν) and (V , ν ) be two normed F-linear spaces where F is either R or C. A linear function f : V −→ V is injective if and only if there exists m ∈ R>0 such that ν (f (x)) mν(x) for every x ∈ V . Proof. Suppose that f is not injective. By Lemma 6.5, there exists u ∈ V − {0V } such that f (u) = 0V , so ν (f (u)) < mν(u) for any m > 0. Thus, the condition of the theorem is sufficient for injectivity. Suppose that f is injective, so the inverse function f −1 : V −→ V is a linear function. By Theorem 6.17, there exists M > 0 such that ν(f −1 (y)) M ν (y) for every y ∈ V . Choosing y = f (x) yields ν(x) M ν (f (x), so 1 , which concludes the argument. ν (f (x)) mν(x) for m = M Corollary 6.7. Every linear function f : Cm −→ Cn is continuous. Proof. Suppose that both Cm and Cn are equipped with the norm ν1 . If x ∈ Cm , we can write x = x1 e1 + · · · + xm xm and the linearity of f implies ν1 (f (x)) = ν1
f
m
xi ei
= ν1
i=1
m
|xi |ν1 (f (ei )) M
i=1
where M = follows.
m
i=1 ν1 (f (ei )).
m
xi f (ei )
i=1 m
|xi | = M ν1 (x),
i=1
By Theorem 6.17, the continuity of f
Next, we introduce a norm on the linear space Hom(Cm , Cn ) of linear functions from Cm to Cn . Recall that if f : Cm −→ Cn is a linear function and ν, ν are norms on Cm and Cn , respectively, then
Norms and Inner Products
359
there exists a non-negative constant m such that ν (f (x)) M ν(x) for every x ∈ Cm . Define the norm of f , μ(f ), as μ(f ) = inf{M ∈ R0 | ν (f (x)) M ν(x) for every x ∈ Cm }. (6.9) Theorem 6.19. The mapping μ defined by Equality (6.9) is a norm on the linear space of linear functions Hom(Cm , Cn ). Proof. Let f, g be two functions in Hom(Cm , Cn ). There exist Mf and Mg in R0 such that ν (f (x)) Mf ν(x) and ν (g(x)) Mg ν(x) for every x ∈ V . Thus, ν ((f + g)(x)) = ν (f (x) + g(x)) ν (f (x)) + ν (g(x)) (Mf + Mg )ν(x),
so Mf + Mg ∈ {M ∈ R0 | ν ((f + g)(x)) M ν(x) for every x ∈ V }. Therefore, μ(f + g) μ(f ) + μ(g). We leave to the reader the verification of the remaining norm prop erties of μ. Since the norm μ defined by Equality (6.9) depends on the norms ν and ν , we denote it by N (ν, ν ). Theorem 6.20. Let f : Cm −→ Cn and g : Cn −→ Cp and let μ = N (ν, ν ), μ = N (ν , ν ) and μ = N (μ, μ ), where ν, ν , ν are norms on Cm , Cn , and Cp , respectively. We have μ (gf ) μ(f )μ (g). Proof.
Let x ∈ Cm . We have ν (f (x)) (μ(f ) + )ν(x)
for every > 0. Similarly, for y ∈ Cn , ν (g(y)) (μ (g) + )ν (y) for every > 0. These inequalities imply ν (g(f (x)) (μ (g) + )ν (f (x)) (μ (g) + )μ(f ) + )ν(x). Thus, we have μ (gf ) (μ (g) + )μ(f ) + ) for every and . This allows us to conclude that μ (f g) μ(f )μ (g).
360
Linear Algebra Tools for Data Mining (Second Edition)
Equivalent definitions of the norm μ = N (ν, ν ) are given next. Theorem 6.21. Let f : Cm −→ Cn and let ν and ν be two norms defined on Cm and Cn , respectively. If μ = N (ν, ν ), we have (i) μ(f ) = inf{M ∈ R0 | ν (f (x)) M ν(x) for every x ∈ Cm }; (ii) μ(f ) = sup{ν (f (x)) | ν(x) 1}; (iii) μ(f ) = max{ν (f (x)) | ν(x) 1}; (iv) μ(f ) = max{ν (f (x)) | ν(x) = 1}; ν (f (x)) m − {0 } . (v) μ(f ) = sup | x ∈ C m ν(x) Proof. The first equality is the definition of μ(f ). Let be a positive number. By the definition of the infimum, there exists M such that ν (f (x)) M ν(x) for every x ∈ Cm and M μ(f )+. Thus, for any x such that ν(x) 1, we have ν (f (x)) M μ(f ) + . Since this inequality holds for every , it follows that ν (f (x)) μ(f ) for every x ∈ Cm with ν(x) 1. Furthermore, if is a positive number, we claim that there exists x0 ∈ Cm such that ν(x0 ) 1 and μ(f ) − ν (f (x0 )) μ(f ). Suppose that this is not the case. Then, for every z ∈Cn with ν(z) 1 n 1, we have ν (f (z)) μ(f ) − . If x ∈ C , then ν ν(x) x = 1, so ν (f (x)) (μ(f ) − )ν(x), which contradicts the definition of μ(f ). This allows us to conclude that μ(f ) = sup{ν (f (x)) | ν(x) 1}, which proves the second equality. Observe that the third equality (where we replaced sup by max) holds because the closed sphere B(0, 1) is a compact set in Rn . Thus, we have μ(A) = max{ν (f (x)) | ν(x) 1}.
(6.10)
For the fourth equality, since {x | ν(x) = 1} ⊆ {x | ν(x) 1}, it follows that max{ν (f (x)) | ν(x) = 1} max{ν (f (x)) | ν(x) 1} = μ(f ). By the third equality, there exists a vector z ∈ Rn − {0} such that ν(z) 1 and ν (f (z)) = μ(f ). Thus, we have z z ν f . μ(f ) = ν(z)ν f ν(z) ν(z)
Norms and Inner Products
361
z Since ν ν(z) = 1, it follows that μ(A) max{ν (f (x)) | ν(x) = 1}. This yields the desired conclusion. Finally, to prove the last equality observe that for every x ∈ 1 1 m C − {0m }, ν(x) x is a unit vector. Thus, ν (f ( ν(x) x) μ(f ), by the fourth equality. On the other hand, by the third equality, there exists x0 such that ν(x0 ) = 1 and ν (f (x0 )) = μ(f ). This concludes the argument. 6.6
Norms for Matrices
In Chapter 3, we saw that the set Cm×n is a linear space. Therefore, it is natural to consider norms defined on matrices. We discuss two basic methods for defining norms for matrices. The first approach treats matrices as vectors (through the vec mapping). The second regards matrices as representations of linear operators, and defines norms for matrices starting from operator norms. The vectorization mapping vec was introduced in Definition 3.16. Its use allows us to treat a matrix A ∈ Cm×n as a vector from Cmn . Using vector norms on Cmn , we can define vectorial norms of matrices. Definition 6.10. Let ν be a vector norm on the space Rmn . The vectorial matrix norm μ(m,n) on Rm×n is the mapping μ(m,n) : Rm×n −→ R0 defined by μ(m,n) (A) = ν(vec(A)) for A ∈ Rm×n . Vectorial norms of matrices are defined without regard for matrix products. The link between linear transformations of finite-dimensional linear spaces and Theorem 6.20 suggests the introduction of an additional condition. Since every matrix A ∈ Cm×n corresponds to a linear transformation hA : Cm −→ Cn , if ν and ν are norms on Cm and Cn , respectively, it is natural to define a norm on Cm×n as μ(A) = μ(hA ), where μ = N (ν, ν ) is a norm on the space of linear transformations between Cm and Cn . Suppose that ν, ν , and ν are vector norms defined on Cm , Cn , and Cp , respectively. By Theorem 6.20, μ (gf ) μ(f )μ (g), where μ = N (ν, ν ), μ = N (ν , ν ), and μ = N (μ, μ ), so μ (AB) μ(A)μ (B). This suggests the following definition.
Linear Algebra Tools for Data Mining (Second Edition)
362
Definition 6.11. A consistent family of matrix norms is a family of functions μ(m,n) : Cm×n −→ R0 , where m, n ∈ P, that satisfies the following conditions: (i) μ(m,n) (A) = 0 if and only if A = Om,n ; (ii) μ(m,n) (A+B) μ(m,n) (A)+μ(m,n) (B) (the subadditivity property); (iii) μ(m,n) (aA) = |a|μ(m,n) (A); (iv) μ(m,p) (AB) μ(m,n) (A)μ(n,p) (B) for every matrix A ∈ Rm×n and B ∈ Rn×p (the submultiplicative property). If the format of the matrix A is clear from context or is irrelevant, then we shall write μ(A) instead of μ(m,n) (A). Example 6.11. Let P ∈ Cn×n be an idempotent matrix. If μ is a matrix norm, then either μ(P ) = 0 or μ(P ) 1. Indeed, since P is idempotent, we have μ(P ) = μ(P 2 ). By the submultiplicative property, μ(P 2 ) (μ(P ))2 , so μ(P ) (μ(P ))2 . Consequently, if μ(P ) = 0, then μ(P ) 1. Some vectorial matrix norms turn out to be actual matrix norms; others fail to be matrix norms. This point is illustrated by the next two examples. by Example 6.12. Consider the vectorial matrix norm μ1 induced m×n . |a | for A ∈ R the vector norm ν1 . We have μ1 (A) = ni=1 m ij j=1 Actually, this is a matrix norm. To prove this fact, consider the matrices A ∈ Rm×p and B ∈ Rp×n . We have p p n n m m aik bkj |aik bkj | μ1 (AB) =
i=1 j=1 k=1 p m n
p
k =1
k =1
i=1 j=1
i=1 j=1 k=1
|aik ||bk j |
(because we added extra non-negative terms to the sums) ⎞ ⎛ n p m p |aik | · ⎝ |bk j |⎠ = i=1 k =1
= μ1 (A)μ1 (B).
j=1 k =1
Norms and Inner Products
363
We denote this vectorial matrix norm by the same notation as the corresponding vector norm, that is, by A1 . The vectorial matrix norm μ2 induced by the vector norm ν2 is also a matrix norm. Indeed, using the notations as above, we have p 2 m n aik bkj (μ2 (AB)) = i=1 j=1 k=1 p p n m 2 2 |aik | |blj | 2
i=1 j=1
k=1
l=1
(by Cauchy–Schwarz inequality) (μ2 (A))2 (μ2 (B))2 . The vectorial norm of A ∈ Cm×n , ⎛ ⎞1 2 n m 2⎠ ⎝ |aij | , μ2 (A) = i=1 j=1
denoted also by AF , is known as the Frobenius norm. For A ∈ Rm×n , we have
m
n a2ij . AF = i=1 j=1
It is easy to see that for real matrices we have A2F = trace(AA ) = trace(A A).
(6.11)
For complex matrices, the corresponding equality is A2F = trace(AAH ) = trace(AH A). Note that AH 2F = A2F for every A.
(6.12)
364
Linear Algebra Tools for Data Mining (Second Edition)
Example 6.13. The vectorial norm μ∞ induced by the vector norm ν∞ is denoted by A∞ and is given by A∞ = max |aij | i,j
for A ∈ Cn×n . This is not a matrix norm. Indeed, let a, b be two positive numbers and consider the matrices A=
a a a a
and B =
b b . b b
We have A∞ = a and B∞ = b. However, since AB =
2ab 2ab , 2ab 2ab
we have AB∞ = 2ab and the submultiplicative property of matrix norms is violated. A technique that always produces matrix norms starting from vector norms is introduced in the next theorem. Definition 6.12. Let νm be a norm on Cm and νn be a norm on Cn and let A ∈ Cn×m be a matrix. The operator norm of A is the number μ(n,m) (A) = μ(n,m) (hA ), where μ(n,m) = N (νm , νn ). Theorem 6.22. Let {νn | n 1} be a family of vector norms, where νn is a vector norm on Cn . The family of norms {μ(n,m) | n, m 1} is consistent. Proof. It is easy to see that the family of norms {μ(n,m) | n, m 1} satisfies the first three conditions of Definition 6.11 because the corresponding operator norms satisfy similar conditions. For example, if μ(n,m) (A) = 0 for A ∈ Cn×m , this means that μ(n,m) (hA ) = 0, so νn (Ax) = 0 for every x ∈ Cm , such that νm (x) 1. This implies Ax = 0n for every x ∈ Rm , which, in turn, implies A = On,m . Since μ(Om,n ) = 0, the first condition is satisfied.
Norms and Inner Products
365
For the fourth condition of Definition 6.11 and A ∈ Cn×m and B ∈ Cm×p , we have μ(n,p) (AB) = sup{νn ((AB)x) | νp (x) 1} = sup{νn (A(Bx)) | νp (x) 1} Bx νm (Bx)νp (x) 1 = sup νn A νm (Bx) μ(n,m) (A) sup{νm (Bx)νp (x) 1} Bx (because νm ν(Bx) = 1) = μn,m (A)μm,p (B).
Theorem 6.21 implies the following equivalent definitions of μ(n,m) (A). Theorem 6.23. Let νn be a norm on Cn for n 1. The following equalities hold for μ(n,m) (A), where A ∈ C(n,m) : μ(n,m) (A) = inf{M ∈ R0 | νn (Ax) M νm (x) for every x ∈ Cm } = sup{νn (Ax) | νm (x) 1} = max{νn (Ax) | νm (x) 1} = max{ν (f (x)) | ν(x) = 1} ν (f (x)) m | x ∈ C − {0m } . = sup ν(x) Proof.
The theorem is simply a reformulation of Theorem 6.21.
Corollary 6.8. Let μ be the matrix norm on Cn×n induced by the vector norm ν. We have ν(Au) μ(A)ν(u) for every u ∈ Cn . Proof. The inequality is obviously satisfied when u = 0n . There1 u. Clearly, fore, we may assume that u = 0n and let x = ν(u) ν(x) = 1 and Equality (6.10) implies that 1 u μ(A) ν A ν(u) for every u ∈ Cn − {0n }. This implies immediately the desired inequality.
366
Linear Algebra Tools for Data Mining (Second Edition)
If μ is a matrix norm induced by a vector norm on Rn , then μ(In ) = sup{ν(In x) | ν(x) 1} = 1. This necessary condition can be used for identifying matrix norms that are not induced by vector norms. The operator matrix norm induced by the vector norm · p is denoted by ||| · |||p . Example 6.14. To compute |||A|||1 = sup{Ax1 | x1 1}, where A ∈ Rn×n , suppose that the columns of A are the vectors a1 , . . . , an , that is ⎛ ⎞ a1j ⎜ a2j ⎟ ⎜ ⎟ aj = ⎜ .. ⎟. ⎝ . ⎠ anj Let x ∈ Rn be a vector whose components are x1 , . . . , xn . Then, Ax = x1 a1 + · · · + xn an , so Ax1 = x1 a1 + · · · + xn an 1 n |xj |aj 1 j=1
max aj 1 j
n
|xj |
j=1
= max aj 1 · x1 . j
Thus, |||A|||1 maxj aj 1 . Let ej be the vector whose components are 0 with the exception of its jth component that is equal to 1. Clearly, we have ej 1 = 1 and aj = Aej . This, in turn implies aj 1 = Aej 1 |||A|||1 for 1 j n. Therefore, maxj aj 1 |||A|||1 , so |||A|||1 = max aj 1 = max j
j
n
|aij |.
i=1
In other words, |||A|||1 equals the maximum column sum of the absolute values.
Norms and Inner Products
367
Example 6.15. Consider now a matrix A ∈ Rn×n . We have n aij xj Ax∞ = max 1in j=1 max
1in
n
|aij xj |
j=1
max x∞ 1in
n
|aij |.
j=1
Consequently, if x∞ 1, we have Ax∞ max1in nj=1 |aij |. Thus, |||A|||∞ max1in nj=1 |aij |. The converse inequality is immediate if A = On,n . Therefore, assume that A = On×n , and let (ap1 , . . . , apn ) be any row of A that has at least one element distinct from 0. Define the vector z ∈ Rn by |a | pj if apj = 0, zj = apj 1 otherwise for 1 j n. It is clear that zj ∈ {−1, 1} for every j, 1 j n and, therefore, z∞ = 1. Moreover, we have |apj | = apj zj for 1 j n. Therefore, we can write n n n |apj | = apj zj apj zj j=1 j=1 j=1 n aij zj max 1in j=1 = Az∞ max{Ax∞ | x∞ 1} = |||A|||∞ . Since n this holds for every row of A, it follows that max1in j=1 |aij | |||A|||∞ , which proves that |||A|||∞ = max
1in
n
|aij |.
j=1
In other words, |||A|||∞ equals the maximum row sum of the absolute values.
Linear Algebra Tools for Data Mining (Second Edition)
368
Example 6.16. Let D = diag(d1 , . . . , dn ) ∈ Cn×n be a diagonal matrix. If x ∈ Cn , we have ⎞ ⎛ d1 x1 ⎟ ⎜ Dx = ⎝ ... ⎠, dn xn so |||D|||2 = max{Dx2 | x2 = 1} = max{ (d1 x1 )2 + · · · + (dn xn )2 | x21 + · · · + x2n = 1} = max{|di | | 1 1 n}. The next result shows that certain norms are invariant with respect to multiplication by unitary matrices. We refer to these norms as unitarily invariant norms. Theorem 6.24. Let U ∈ Cn×n be a unitary matrix. The following statements hold: (i) U x2 = x2 for every x ∈ Cn ; (ii) |||U A|||2 = |||A|||2 for every A ∈ Cn×p ; (iii) U AF = AF for every A ∈ Cn×p . Proof.
For the first part of the theorem, note that U x22 = (U x)H U x = xH U H U x = xH x = x22 ,
because U H A = In . The second part of the theorem is as follows: |||U A|||2 = max{(U A)x2 | x2 = 1} = max{U (Ax)2 | x2 = 1} = max{Ax2 | x2 = 1} (by Part (i)) = |||A|||2 . For the Frobenius norm, note that U AF = trace((U A)H U A) = trace(AH U H U A) = trace(AH A) = AF , by Equality (6.11).
Norms and Inner Products
369
Corollary 6.9. If U ∈ Cn×n is a unitary matrix, then |||U |||2 = 1. Proof. Since |||U |||2 = sup{U x2 | x2 1}, by Part (ii) of Theorem 6.24, |||U |||2 = sup{x2 | x2 1} = 1.
Corollary 6.10. Let A, U ∈ Cn×n . If U is a unitary matrix, then U H AU F = AF . Proof. Since U is a unitary matrix, so is U H . By Part (iii) of Theorem 6.24, U H AU F = AU F = U H AH 2F = AH 2F = A2F ,
which proves the corollary.
Example 6.17. Let S = {x ∈ Rn | x2 = 1} be the surface of the sphere in Rn . The image of S under the linear transformation hU that corresponds to the unitary matrix U is S itself. Indeed, by Theorem 6.24, hU (x)2 = x2 = 1, so hU (x) ∈ S for every x ∈ S. Also, note that hU restricted to S is a bijection because hU H (hU (x)) = x for every x ∈ Rn . More details on transformations of S are given in Supplement 19 of Chapter 9. Theorem 6.25. Let A ∈ Rn×n . We have |||A|||2 AF . Proof.
Let x ∈
Rn .
We have
⎞ r1 x ⎟ ⎜ Ax = ⎝ ... ⎠, ⎛
rn x where r 1 , . . . , r n are the rows of the matrix A. Thus, n 2 Ax2 i=1 (r i x) = . x2 x2 By Cauchy–Schwarz inequality, we have (r i x)2 r i 22 x22 , so
n Ax2
r i 22 = AF . x2 i=1
This implies |||A|||2 AF .
370
Linear Algebra Tools for Data Mining (Second Edition)
We shall prove in Chapter 9 (in Corollary 9.4) that for every A ∈ Rn×n , we have √ AF n|||A|||2 . (6.13)
6.7
Matrix Sequences and Matrix Series
In this section, we make use of the notion of series of complex numbers and we extend this concept ∞ to series of matrices. Recall (see [7], for example) that a series i=1 ai converges absolutely if the series ∞ |a | converges. Absolute convergence implies convergence. The i=1 i terms of an absolutely convergent series can be rearranged by altering its sum. The set of matrices Cm×p is a C-linear space, and the set of matrices Rm×p is an R-linear space. Using matrix norms, these spaces can be equipped with a topological structure, as we indicated above. We focus now on the normed linear space (Rp×p , ||| · |||), where ||| · |||) is a matrix norm. Let A ∈ Rp×p . We can show by induction on n that |||An ||| (|||A|||)n .
(6.14)
The base step, n = 0, is immediate. Suppose that the inequality holds for n. We have |||An+1 ||| = |||An A||| |||An ||||||A||| (because ||| · ||| is a matrix norm) (|||A|||)n |||A||| (by the inductive hypothesis) = (|||A|||)n+1 , which concludes our argument. If |||A||| < 1, the sequence of matrices (A, A2 , . . . , An , . . .) converges toward the zero matrix Op,p . Indeed, limn→∞ |||An − Op,p ||| = limn→∞ |||An ||| limn→∞ (|||A|||)n = 0, which shows that limn→∞ An = Op,p .
Norms and Inner Products
371
Definition 6.13. Let A = (A0 , A1 , . . . , An , . . .) be a sequence of matrices in Rp×p . A matrix series having A as its sequenceof terms is the sequence of matrices (S0 , S1 , . . . , Sn , . . .), where Si = ik=0 Ak . The series (S0 , S1 , . . . , Sn , . . .) is denoted also by A0 + A1 + · · · + An + · · · . We say that the series A0 + A1 + · · · + An + · · · converges to a matrix S if limn→∞ Sn = S. This is also denoted by A0 + A1 + · · · + An + · · · = S. The series A0 + A1 + · · · + An + · · · converges absolutely if each of the series (A0 )ij + (A1 )ij + · · · + (An )ij + · · · converges absolutely for 1 i, j n. The subadditivity property of the norm can be generalized to a series of matrices. Namely, if the series A0 + A1 + · · · + An + · · · converges to S, then ∞ |||Ai |||. |||S||| i=0
Indeed, by the usual subadditivity property, n |||Ai |||. |||A0 + A1 + · · · + An ||| i=0
This implies |||A0 + A1 + · · · + An ||| for every n ∈ N, so |||S||| matrix norm.
∞
i=0 |||Ai |||,
∞
|||Ai |||,
i=0
due to the continuity of the
Example 6.18. Let A ∈ Rp×p be a matrix such that |||A||| < 1. We claim that the matrix I −A is invertible and A0 +A1 +· · ·+An +· · · = (I − A)−1 . Suppose that I−A is not invertible. Then, the system (I−A)x = 0 has a non-trivial solution. This implies x = Ax, so |||A||| 1, which contradicts the hypothesis. Thus, I − A is an invertible matrix. Observe that (A0 + A1 + · · · + An )(I − A)−1 = I − An+1 .
Linear Algebra Tools for Data Mining (Second Edition)
372
Therefore, lim (A0 + A1 + · · · + An ) (I − A)−1 = lim (I − An+1 ) = I, n→∞
n→∞
since limn→∞ An+1 = O. This shows that the series A0 + A1 + · · · + An + · · · converges to the inverse of the matrix I − A, so (I − A)−1 =
∞
Ai .
i=0
Moreover, we have −1
|||(I − A)
∞ ∞ i ||| = A |||Ai ||| i=0
i=0
∞ (|||A|||)i = = i=0
1 , 1 − |||A|||
because |||A||| 1. Example 6.19. Let (x0 , x1 , . . . , xn , . . .) be a sequence of vectors defined inductively by xn+1 = Axn + b
(6.15)
for n ∈ N, where A ∈ Cn×n and b ∈ Cn . It is easy to verify that xn = An x0 + (An−1 + · · · + A + I)b. If |||A||| < 1, limn→∞ An = O, limn→∞ (An−1 + · · · + A + I) = (I − A)−1 , so limn→∞ xn exists. If limn→∞ xn = x, it follows from Equality (6.15) that x = Ax + b. Let en = xn − x be the error sequence. Clearly, if |||A||| < 1, limn→∞ en = 0. , . . . , Am , . . . be matrices in Cn×n . If ∞ Lemma 6.6. Let A0 m=0 Am ∞ is convergent, then U A V is convergent for every U, V and m m=0 ∞ ∞ m=0 U Am V = U ( m=0 Am ) V . ∞ Proof. Suppose that m=0 Am is convergent and let S = ∞ A be its sum. We can write m=0 m
Norms and Inner Products ∞ U Am V − U SV m=0
∞
373
∞ = U Am − S V m=0 ∞ ∞ n2 U ∞ Am − S V ∞ . m=0
∞
∞
Therefore, the convergence of m=0 A m implies the convergence of ∞ ∞ ∞ U A V and U A V = U ( m m m=0 m=0 m=0 Am ) V . n×n and let Lemma 6.7. Let μ be a vectorial norm on ∞C n×n A0 , . . . , Am , . . . be matrices in C . The m=0 Am is absoseries ∞ lutely convergent if and only if the series m=0 μ(Am ) is convergent. Proof. If the series ∞ m=0 μ(Am ) is convergent, then for every i, j such that 1 i, j n we have |(Am )ij | kμ(Am ) for some positive constant k(see Supplement 6.20) which implies the absolute convergence of ∞ m=0 Am . ∞ Conversely, suppose that m=0 Amis absolutely convergent. Then, there exists a number c such that pm=0 |(Am )ij | c for every p ∈ N and 1 i, j n. Therefore, we can write p m=0
μ(Am )
p n
|(Am )ij | n2 c,
m=0 i,j=1
which allows us to conclude that
∞
m=0 μ(Am )
is convergent.
n×n . If Theorem 6.26. Let A0 , . . . , Am , . . . be ∞matrices in C ∞ m=0 Am is absolutely convergent, then m=0 U Am V is absolutely convergent for every U, V .
Proof.
The second part of Supplement 6.20 implies
U Am V ∞ n2 P ∞ Am ∞ V ∞ cAm ∞ , ∞ where c does not depend on m. By Lemma 6.7, m=0 Am ∞ is ∞ for every U, V , which convergent, so m=0 U Am V ∞ is convergent U A V. implies the absolute convergence of ∞ m m=0
Linear Algebra Tools for Data Mining (Second Edition)
374
6.8
Conjugate Norms
Let ν : Cn −→ R0 . Consider the function ν ∗ : Cn −→ R0 defined by ν ∗ (y) = max{|y H x| | ν(x) = 1} for y ∈ Cn . Theorem 6.27. The mapping ν ∗ is a norm on Cn . Proof.
Let y, z ∈ Cn . For ν(x) = 1, we have |(y + z)H x| |y H x| + |z H x| ν ∗ (y) + ν ∗ (z),
which implies ν ∗ (y + z) ν ∗ (y) + ν ∗ (z). Thus, ν ∗ is subadditive. The positive homogeneity is immediate. Suppose that ν ∗ (y) = 0 but y = 0. Since y = 1, ν ν(y) this implies ν ∗ (y) |y H
y22 y |= . ν(y) ν(y)
Therefore, if y = 0, then ν ∗ (y) > 0. Consequently, ν ∗ (y) = 0 implies y = 0, which allows us to conclude that ν ∗ is a norm. Definition 6.14. The norm ν ∗ is the conjugate norm of the norm ν. Example 6.20. Let νp be the norm introduced in Theorem 6.5, n 1 p |xi |p νp (x) = i=1
for x ∈ Rn . To compute its dual νp∗ , we need to compute νp∗ (y) = max{|y H x | νp (x) = 1} = max{|y1 x1 + · · · + yn xn |
n i=1
for y ∈
Cn .
|xi |p = 1}
Norms and Inner Products
375
Without loss of generality, we can assume that y1 , . . . , yn , x1 , . . . , xn belong to R0 . By introducing a Lagrange multiplier λ, we consider the function n p xi − 1 . Φ(y1 , . . . , yn , x1 , . . . , xn , λ) = y1 x1 + · · · + yn xn − λ i=1
The necessary extremum conditions are ∂Φ = yi − λpxp−1 =0 i ∂xi for 1 i n. Thus, we must have y1
xp−1 1
= ··· =
yn
xp−1 n
= pλ.
Equivalently, we have p
p
p ynp−1 y1p−1 p−1 . p = ··· = p = (pλ) x1 xn
For q =
p p−1 ,
the last equality can be written as νq (y)q ynq y1q q = · · · = = (pλ) = = νq (y)q , νp (x)p xp1 xpn
since νp (x) = 1. Thus, pλ = νq (y) and, therefore, yi xi = xpi νq (y) for 1 i n. The maximum of y1 x1 + · · · + yn xn for νp (x) = 1 is therefore n i=1
y i xi =
n
xpi νq (y) = νq (y).
i=1
This shows that the conjugate of the norm νp is the norm νq , where 1 1 + = 1. p q Observe that the conjugate of the Euclidean norm ν2 is ν2 itself.
376
Linear Algebra Tools for Data Mining (Second Edition)
Let U be a subspace of Cn and let h : U −→ C be a linear functional defined on U . We observed that there exists y ∈ Cn such that h can be expressed as h(x) = y H x for x ∈ U . Note that the vector y is not unique because h(x) = (y + z)H x for x ∈ U for every vector z ∈ U ⊥ . Theorem 6.28 (Hahn–Banach theorem). Let h be a linear functional defined on a subspace U of Cn and let ν be a norm on C such that max{|h(x)| | x ∈ U and ν(x) = 1} = M. ˆ of h to Cn such that There exists an extension h ˆ max{|h(x)| | x ∈ Cn and ν(x) = 1} = M. Proof. If the subspace U coincides with Cn , then there is nothing to prove. Suppose that U ⊂ Cn and let w be a vector in Cn − U . Clearly, we have w = 0. If u ∈ U − {0}, we have |h(u)| M ν(u). The linearity of h implies h(u1 )−h(u2 ) = h(u1 −u2 ) M ν(u1 −u2 ) M (ν(u1 +w)+ν(u2 +w)) which yields h(u1 ) − M ν(u1 + w) h(u2 ) + M ν(u2 + w) for any u1 , u2 ∈ U . Therefore, every number in the set A = {h(u1 ) − M ν(u1 + w) | u1 ∈ U } is less than or equal to any number of the set B = {h(u2 ) + M ν(u2 + w) | u2 ∈ U }. We aim to linearly extend h to a linear functional hw defined on the subspace W = U ∪{w}. To this end, define hw (w) = −a, where a is a number located between the sets A and B. Since hw (u) = h(u) for every u ∈ U , we have hw (u) − M ν(u + w) −hw (w) hw (u) + M ν(u + w), which implies hw (u + w) − M ν(u + w) 0 hw (u + w) + M ν(u + w).
Norms and Inner Products
377
This is equivalent to |hw (u + w)| M ν(u + w). For α ∈ C − {0}, define hw (u + αw) = h(u) + αhw (w). We have 1 1 |hw (u + αw)| = |α|hw u + w |α|M ν u+w |α| |α| = M ν(u + αw). If W = Cn , this extension can be repeated a finite number of times because Cn is of finite dimension. Eventually, we obtain a linear functional defined on Cn with the preservation of the boundedness condition. Corollary 6.11. The conjugate ν ∗∗ of the conjugate ν ∗ of a norm ν on Cn equals ν. Proof.
Since ν ∗ (y) = max{|y H x| | ν(x) = 1}, it follows that |y H x| ν ∗ (y)ν(x)
(6.16)
for x, y ∈ Cn , an inequality that is a generalization of the Cauchy– Schwarz inequality. Therefore, we have ν ∗∗ (x) max{|xH y | ν ∗ (y) = 1} ν(x). To prove the converse inequality, we need to use the Hahn–Banach Theorem. Consider the linear functional g defined on the subspace x by g(u) = aν(x) for u = ax and a ∈ C. If ν(u) = 1, we have 1 |a| = ν(x) , so |g(u)| = |a|ν(x) = 1. Thus, by the Hahn–Banach Theorem, g can be extended to gˆ : C∗ −→ C such that |ˆ(g)(v)| 1 if ν(v) = 1 for v ∈ C. If gˆ(v) = z H v, then
ν ∗ (z) = max{|z H v| | ν(v) = 1} = 1 and |z H v| = ν(v). Therefore, ν ∗∗ (x) = max{|xH z| | ν ∗ (z) = 1} ν(x), so ν ∗∗ (x) = ν(x).
378
6.9
Linear Algebra Tools for Data Mining (Second Edition)
Inner Products
Definition 6.15. Let V be a C-linear space. An inner product on V is a function f : V × V −→ C that has the following properties: (i) f (ax + by, z) = af (x, z) + bf (y, z) (linearity in the first argument); (ii) f (x, y) = f (y, x) for y, x ∈ V (conjugate symmetry); (iii) if x = 0V , then f (x, x) is a positive real number (positivity); (iv) f (x, x) = 0 if and only if x = 0V (definiteness); for every x, y, z ∈ V and a, b ∈ C. The pair (V, f ) is called an inner product space. An alternative terminology [106] for real inner product spaces is Euclidean spaces, and Hermitian spaces for complex inner product spaces. For the second argument of an inner product on a C-linear space, we have the property of conjugate linearity, that is, f (z, ax + by) = a ¯f (z, x) + ¯bf (z, y) for every x, y, z ∈ V and a, b ∈ C. Indeed, by the conjugate symmetry property, we can write f (z, ax + by) = f (ax + by, z) = af (x, z) + bf (y, z) =a ¯f (x, z) + ¯bf (y, z) =a ¯f (z, x) + ¯bf (z, y). To simplify notations, if there is no risk of confusion, we denote the inner product f (u, v) as (u, v). Observe that the conjugate symmetry property on inner products implies that for x ∈ V , (x, x) is a real number because (x, x) = (x, x). When V is a real linear space, the definition of the inner product becomes simpler because the conjugate of a real number a is a itself. Namely, for real linear spaces, the conjugate symmetry is replaced by the plain symmetry property, (x, y) = (y, x) for x, y ∈ V . Thus, in case of real linear spaces, an inner product is linear in both arguments.
Norms and Inner Products
379
Let W = {w 1 , . . . , wn } be a basis in the complex n-dimensional inner product space V. If x = ni=1 xi wi and y = nj=1 y j wj , then (x, y) =
n n
xi y j (wi , wj ),
i=1 j=1
due to the bilinearity of the inner product. If we denote (wi , w j ) by gij , then (x, y) can be written as (x, y) =
n n
xi y j gij
(6.17)
i=1 j=1
for x, y ∈ V , when V is a complex inner product linear space. If V is a real inner product space, then (x, y) =
n n
xi y j gij .
i=1 j=1
Note that in this case (gij ) is a symmetric matrix. The matrix G = (gij ) is actually the Gram matrix of the basis W , which will be referred to as the fundamental matrix of the basis B. Definition 6.16. Two vectors u, v ∈ Cn are said to be orthogonal with respect to an inner product if (u, v) = 0. This is denoted by x ⊥ y. An orthogonal set of vectors in an inner product space V equipped with an inner product is a subset W of V such that for every u, v ∈ W , we have u ⊥ v. Theorem 6.29. Any inner product on a linear space V generates a norm on that space defined by x = (x, x) for x ∈ V . Proof. Let V be a C-linear space. We need to verify that the norm satisfies the conditions of Definition 6.7. Applying the properties of the inner product, we have x + y2 = (x + y, x + y) = (x, x) + 2(x, y) + (y, y) = x2 + 2(x, y) + y2
Linear Algebra Tools for Data Mining (Second Edition)
380
x2 + 2xy + y2 = (x + y)2 . Because x 0, it follows that x + y x + y, which is the subadditivity property. a(x, x) = |a|2 (x, x) = If a ∈ C, then ax = (ax, ax) = a¯ |a| (x, x) = |a|x. Finally, from the definiteness property of the inner product, it follows that x = 0 if and only if x = 0V , which allows us to conclude that · is indeed a norm. The norm induced by the inner product f (x, y) = xi y j gij introduced in Equality (6.17) is x2 = f (x, x) = xi xj gij . Theorem 6.30. If W is a set of orthogonal vectors in an ndimensional C-linear space V and 0V ∈ W , then W is linearly independent. Proof. Let c = a1 w1 + · · · + an wn be a linear combination in V such that a1 w1 + · · · + an wn = 0V . Since (c, w i ) = ai w i 2 = 0, we have ai = 0 because w i 2 = 0, and this holds for every i, where 1 i n. Thus, W is linearly independent. Definition 6.17. An orthonormal set of vectors in an inner product space V equipped with an inner product is an orthogonal subset W of V such that for every u we have u = 1, where the norm is induced by the inner product. Corollary 6.12. If W is an orthonormal set of vectors in an ndimensional C-linear space V and |W | = n, then W is a basis in L. Proof.
This statement follows immediately from Theorem 6.30.
If W = {w 1 , . . . , w n } is an orthonormal basis in Cn , we have 0 if i = j, gij = (w i , wj ) = 1 if i = j, which means that the inner product of the vectors x = xi wi and y = y j wj is given by (x, y) = xi y j (wi , wj ) = xi y i . Consequently, x2 = ni=1 |xi |2 .
(6.18)
Norms and Inner Products
381
The inner product of x, y ∈ Rn is (x, y) = xi y j (wi , wj ) = xi y i .
(6.19)
Next, we present a direct proof for the Cauchy–Schwarz inequality. Theorem 6.31 (Cauchy–Schwarz inequality). Let C-linear space. Then, for every x, y ∈ V, we have
V
be
a
|(x, y)| xy. Moreover, the equality takes place if and only if {x, y} is a linear dependent set of vectors. Proof. If y = 0V , the inequality obviously holds. So, suppose that y = 0V . For every t ∈ R, we have (x + ty, x + ty) 0. In particular, , we have taking t = (x,y) y2 0 x − ty2 = (x, x) − t(x, y) − t(y, x) + tt(y, y) = x2 − t(x, y) − t(x, y) + tty2 |(x, y)|2 |(x, y)|2 = x2 − 2 + y2 y2 |(x, y)|2 = x2 − , y2 which implies |(x, y)| xy. If the equality |(x, y)| = xy takes place, then x − ty = 0, so x − ty = 0, which means that {x, y} is linearly dependent. Not every norm can be induced by an inner product. A characterization of this type of norms in linear spaces, obtained in [85], is presented next. This equality shown in the next theorem is known as the parallelogram equality. Theorem 6.32. Let V be a real linear space. A norm · is induced by an inner product if and only if x + y2 + x − y2 = 2(x2 + y2 ) for every x, y ∈ V .
Linear Algebra Tools for Data Mining (Second Edition)
382
Proof. Suppose that the norm is induced by an inner product. In this case, we can write the following for every x and y: (x + y, x + y) = (x, x) + 2(x, y) + (y, y), (x − y, x − y) = (x, x) − 2(x, y) + (y, y). Thus, (x + y, x + y) + (x − y, x − y) = 2(x, x) + 2(y, y), which can be written in terms of the norm generated as the inner product as x + y2 + x − y2 = 2(x2 + y2 ). Conversely, suppose that the condition of the theorem is satisfied by the norm · . Consider the function f : V × V −→ R defined by 1 x + y2 − x − y2 (6.20) f (x, y) = 4 for x, y ∈ V . The symmetry of f is immediate, that is, f (x, y) = f (y, x) for x, y ∈ V . The definition of f implies 1 y2 − − y2 = 0. (6.21) f (0, y) = 4 We prove that f is a bilinear form that satisfies the conditions of Definition 6.15. Starting from the parallelogram equality, we can write u + v + y2 + u + v − y2 = 2(u + v2 + y2 ), u − v + y2 + u − v − y2 = 2(u − v2 + y2 ). Subtracting these equalities yields u + v + y2 + u + v − y2 − u − v + y2 − u − v − y2 = 2(u + v2 − u − v2 ). This equality can be written as f (u + y, v) + f (u − y, v) = 2f (u, v). Choosing y = u implies f (2u, v) = 2f (u, v), due to Equality (6.21).
(6.22)
Norms and Inner Products
383
Let t = u + y and s = u − y. Since u = 12 (t + s) and y = 12 (t − s), we have 1 (t + s), v = f (t + s, v), f (t, v) + f (s, v) = 2f 2 by Equality (6.22). Next, we show that f (ax, y) = af (x, y) for a ∈ R and x, y ∈ V . Consider the function φ : R −→ R defined by φ(a) = f (ax + y). The basic properties of norms imply that ax + y − bx + y (a − b)x for every a, b ∈ R and x, y ∈ V . Therefore, the function φ, ψ : R −→ R given by φ(a) = ax + y and ψ(a) = ax − y for a ∈ R are continuous. The continuity of these functions implies that the function f defined by Equality (6.20) is continuous relative to a. Define the set S = {a ∈ R | f (ax, y) = af (x, y)}. Clearly, we have 1 ∈ S. Further, if a, b ∈ S, then a + b ∈ S and a − b ∈ S, which implies Z ⊆ S. If b = 0 and b ∈ S, then, by substituting x by 1b x in the equality f (bx, y) = bf (x, y), we have f (x, y) = bf ( 1b x, y), so 1b f (x, y) = f ( 1b x, y). Thus, if a, b ∈ S and b = 0, we have f ( ab x, y) = ab f (x, y), so Q ⊆ S. Consequently, S = R. This allows us to conclude that f is linear in its first argument. The symmetry of f implies the linearity in its second argument, so f is bilinear. Observe that f (x, x) = x2 . The definition of norms implies that f (x, x) = 0 if and only if x = 0, and if x = 0, then f (x, x) > 0. Thus, f is indeed an inner product and x = f (x, x). Theorem 6.33. Let x, y ∈ Rn be two vectors such that x1 x2 · · · xn , y 1 y 2 · · · y n . For every permutation matrix P , we have x y x (P y). For every permutation matrix P , we have x y x (P y). Proof. Let φ be the permutation that corresponds to the permutation matrix P and suppose that φ = ψp . . . ψ1 , where p = inv(φ)
384
Linear Algebra Tools for Data Mining (Second Edition)
and ψ1 , . . . , ψp are standard transpositions that correspond to all standard inversions of φ (see Theorems 1.9 and 1.10). Let ψ be a standard transposition of {1, . . . , n}, 1 ··· i i + 1 ··· n ψ: . 1 ··· i + 1 i ··· n We have x (P y) = x1 y1 + · · · + xi−1 yi−1 + xi yi+1 + xi+1 yi + · · · + xn yn , so the inequality x y x (P y) is equivalent to xi yi + xi+1 yi+1 xi yi+1 + xi+1 yi . This, in turn is equivalent to (xi+1 − xi )(yi+1 − yi ) 0, which obviously holds in view of the hypothesis. As we observed previously, Pφ = Pψ1 · · · Pψp , so x y x (Pψp y) x (Pψp−1 Pψp y) · · · x (Pψ1 · · · Pψp y) = x(P y), which concludes the proof of the first part of the theorem. To prove the second part of the theorem, apply the first part to the vectors x and −y. Corollary 6.13. Let x, y ∈ Rn be two vectors such that x1 x1 · · · xn , y1 y2 · · · yn . For every permutation matrix P , we have x − yF x − P yF . If x1 x1 · · · xn and y1 y2 · · · yn , then for every permutation matrix P , we have x − yF x − P yF . Proof.
Note that x − y2F = x2F + y2F − 2x y, x − P y2F = x2F + P y2F − 2x (P y) = x2F + y2F − 2x (P y)
because P y2F = y2F . Then, by Theorem 6.33, x − yF x − P yF . The argument for the second part of the corollary is similar.
Norms and Inner Products
385
Corollary 6.14. Let V be an n-dimensional linear space. If W is an orthogonal (orthonormal) set and |W | = n, then W is an orthogonal (orthonormal) basis of L. Proof. This Theorem 6.30.
statement
For every A ∈
is
an
immediate
consequence
of
Cn×n
and x, y ∈
Cn ,
we have
(Ax, y) = (x, AH y),
(6.23)
which follows from the equalities n (Ax, y) = (Ax)i y¯i i=1
=
=
=
n n
aij xj y¯i i=1 j=1 n n xj
j=1 n j=1
xj
aij y¯i
i=1 n
a ¯ij yi
i=1
= (x, AH y). More generally, we have the following definition: Definition 6.18. A matrix B ∈ Cn×n is the adjoint of a matrix A ∈ Cn×n relative to the inner product (·, ·) if (Ax, y) = (x, By) for every x, y ∈ Cn . A matrix is self-adjoint if it equals its own adjoint, that is if (Ax, y) = (x, Ay) for every x, y ∈ Cn . Thus, a Hermitian matrix is self-adjoint relative to the inner product (x, y) = xH y for x, y ∈ Cn . If we use the Euclidean inner product, we omit the reference to this product and refer to the adjoint of A relative to this product simply as the adjoint of A. Example 6.21. An inner product on Cn×n , the linear space of matrices of format n × n, can be defined as (X, Y ) = trace(XY H ) for X, Y ∈ Cn×n . Note that X2F = (X, X H ) for every X ∈ Cn×n .
386
Linear Algebra Tools for Data Mining (Second Edition)
By Theorem 2.38, a linear form f defined on an Rn can be uniquely written as f (v) =
n
ci ai ,
i=1
where {e1 , . . . , en } is a basis of Rn , v = {ci ei | 1 i n}, and ai = f (ei ) for 1 i n. Thus, using the Euclidean inner product, we can write f (v) = (v, a) for v ∈ Rn , where ⎛ ⎞ a1 ⎜ .. ⎟ a = ⎝ . ⎠. an The Cauchy–Schwarz inequality implies that |(x, y)| x2 y2 . Equivalently, this means that −1
(x, y) 1. x2 y2
This double inequality allows us to introduce the notion of angle between two vectors x, y of a real linear space V. Definition 6.19. The angle between the vectors x and y is the number α ∈ [0, π] defined by cos α =
(x, y) . x2 y2
This angle will be denoted by ∠(x, y). Example 6.22. Let u = (u1 , u2 ) ∈ R2 be a unit vector. Since u21 + u22 = 1, there exists α ∈ [0, 2π] such that u1 = cos α and u2 = sin α. Thus, for any two unit vectors in R2 , u = (cos α, sin α) and v = (cos β, sin β) we have (u, v) = cos α cos β + sin α sin β = cos(α − β), where α, β ∈ [0, 2π]. Consequently, ∠(u, v) is the angle in the interval [0, π] that has the same cosine as α − β.
Norms and Inner Products
387
Theorem 6.34 (The Cosine theorem). Let x and y be two vectors in Rn equipped with the Euclidean inner product. We have x − y2 = x2 + y2 − 2xy cos α, where α = ∠(x, y). Proof.
Since the norm is induced by the inner product, we have x − y2 = (x − y, x − y) = (x, x) − 2(x, y) + (y, y) = x2 − 2xy cos α + y2 ,
which is the desired equality. If T ⊆ V , then the set
T⊥
is defined by
T ⊥ = {v ∈ V | v ⊥ t for every t ∈ T }. Note that T ⊆ U implies U ⊥ ⊆ T ⊥ . If S, T are two subspaces of an inner product space, then S and T are orthogonal if s ⊥ t for every s ∈ S and every t ∈ T . This is denoted as S ⊥ T . Theorem 6.35. Let V be an inner product space and let T ⊆ V . The set T ⊥ is a subspace of V. Furthermore, T ⊥ = T ⊥ . Proof. Let x and y be two members of T ⊥ . We have (x, t) = (y, t) = 0 for every t ∈ T . Therefore, for every a, b ∈ F , by the linearity of the inner product, we have (ax + by, t) = a(x, t) + b(y, t) = 0 for t ∈ T , so ax + bt ∈ T ⊥ . Thus, T ⊥ is a subspace of V. By a previous observation, since T ⊆ T , we have T ⊥ ⊆ T ⊥ . To prove the converse inclusion, let z ∈ T ⊥ . If y ∈ T , y is a linear combination of vectors of T , y = a1 t1 + · · · + am tm , so (y, z) = a1 (t1 , z) + · · · + am (tm , z) = 0. Therefore, z ⊥ y, which implies z ∈ T ⊥ . This allows us to conclude that T ⊥ = T ⊥ . We refer to T ⊥ as the orthogonal complement of T . Note that T ∩ T ⊥ ⊆ {0}. If T is a subspace, then this inclusion becomes an equality, that is, T ∩ T ⊥ = {0}.
388
Linear Algebra Tools for Data Mining (Second Edition)
If x and y are orthogonal, by Theorem 6.34, we have x − y2 = x2 + y2 , which is the well-known Pythagorean Theorem. Theorem 6.36. Let T be a subspace of Cn . We have (T ⊥ )⊥ = T . Proof. Observe that T ⊆ (T ⊥ )⊥ . Indeed, if t ∈ T , then (t, z) = 0 for every z ∈ T ⊥ , so t ∈ (T ⊥ )⊥ . To prove the reverse inclusion, let x ∈ (T ⊥ )⊥ . Theorem 6.37 implies that we can write x = u + v, where u ∈ T and v ∈ T ⊥ , so x − u = v ∈ T ⊥ . Since T ⊆ (T ⊥ )⊥ , we have u ∈ (T ⊥ )⊥ , so x − u ∈ (T ⊥ )⊥ . Consequently, x−u ∈ T ⊥ ∩(T ⊥ )⊥ = {0}, so x = u ∈ T . Thus, (T ⊥ )⊥ ⊆ T , which concludes the argument. Corollary 6.15. Let Z be a subset of Cn . We have (Z ⊥ )⊥ = Z. Proof. Let Z be a subset of Cn . Since Z ⊆ Z, it follows that Z⊥ ⊆ Z ⊥ . Let now y ∈ Z ⊥ and let z = a1 z 1 + · · · + ap z p ∈ Z, where z 1 , . . . , z p ∈ Z. Since (y, z) = a1 (y, z 1 ) + · · · + ap (y, z p ) = 0, it follows that y ∈ Z⊥ . Thus, we have Z ⊥ = Z⊥ . This allows us to write (Z ⊥ )⊥ = (Z⊥ )⊥ . Since Z is a subspace of Cn , by Theorem 6.36, we have (Z⊥ )⊥ = Z, so (Z ⊥ )⊥ = Z. Theorem 6.37. Let U be a subspace of Cn . Then, Cn = U ⊕ U ⊥ . Proof. If U = {0}, then U ⊥ = Cn and the statement is immediate. Therefore, we can assume that U = {0}. In Theorem 6.35 we saw that U ⊥ is a subspace of Cn . Thus, we need to show that Cn is the direct sum of the subspaces U and U ⊥ . By Corollary 2.10, we need to verify only that every x ∈ Cn can be uniquely written as a sum x = u + v, where u ∈ U and v ∈ U ⊥ . Let u1 , . . . , um be an orthonormal basis of U , that is, a basis such that 1 if i = j, (ui , uj ) = 0 otherwise
Norms and Inner Products
389
for 1 i, j m. Define the vector u as u = (x, u1 )u1 + · · · + (x, um )um and the vector v as v = x − u. The vector v is orthogonal to every vector ui because (v, ui ) = (x − u, ui ) = (x, ui ) − (u, ui ) = 0. Therefore v ∈ U ⊥ and x has the necessary decomposition. To prove that the decomposition is unique, suppose that x = s + t, where s ∈ U and t ∈ U⊥ . Since s + t = u + v, we have s − u = v − t ∈ U ∩ U ⊥ = {0}, which implies s = u and t = v. Theorem 6.38. Let S be a subspace of Cn such that dim(S) = k. There exists a matrix A ∈ Cn×k having orthonormal columns such that S = range(A). Proof. Let v 1 , . . . , v k be an orthonormal basis of S. Define the if x ⎞ = matrix A as A = (v 1 , . . . , v k ). We have x ∈ S, if and only ⎛ a1 ⎜ ⎟ a1 v 1 + · · · + ak v k , which is equivalent to x = Aa, where a = ⎝ ... ⎠. ak This amounts to x ∈ range(A). Thus, we have S = range(A). Clearly, for an orthonormal basis in an n-dimensional space, the associated matrix is the diagonal matrix In . In this case, we have (x, y) = x1 y1 + x2 y2 + · · · + xn yn for x, y ∈ V . Observe that if W = {w1 , . . . , w n } is an orthonormal set and x ∈ W , which means that x = a1 w1 +· · ·+an wn , then ai = (x, w i ) for 1 i n. Definition 6.20. Let W = {w 1 , . . . , wn } be an orthonormal set and let x ∈ W . The equality x = (x, w 1 )w 1 + · · · + (x, wn )w n
(6.24)
is the Fourier expansion of x with respect to the orthonormal set W .
390
Linear Algebra Tools for Data Mining (Second Edition)
Furthermore, we have Parseval’s equality: n (x, w i )2 . x2 = (x, x) =
(6.25)
i=1
Thus, if 1 q n, we have q (x, wi )2 x2 .
(6.26)
i=1
It is easy to see that a square matrix C ∈ Cn×n is unitary if and only if its set of columns is an orthonormal set in Cn . Definition 6.21. Let V be an inner product C-linear n-dimensional space and let B = {e1 , . . . , en } be a basis of V. ˜ = {˜ ˜n } is a reciprocal set of B if A set of vectors B e1 , . . . , e ˜j ) = δij for 1 i, j n. (ei , e Theorem 6.39. Let V be an inner product C-linear n-dimensional ˜ of a basis B = {e1 , . . . , en } exists and is space. The reciprocal set B unique. Proof. Let U = e2 , . . . , en and let U ⊥ be the orthogonal complement of U . Then, dim(U ⊥ ) = 1 and there exists a non-zero vector w ∈ U ⊥ . Note that (e1 , w) = 0, which allows us to define ˜1 = (e11,w) w. Thus, (˜ e e1 , e1 ) = 1 and (˜ e1 , ej ) = 0 for j = 1. In a similar manner, we can prove the existence of the remaining vectors ˜ ˜n of B. ˜2 , . . . , e e ˜ suppose that there exist two recipTo prove the uniqueness of B, ˜1, . . . , d ˜ n }. Then, (˜ ˜ = {d ˜ ˜n } and D ek − rocal sets of B, B = {˜ e1 , . . . , e ˜ k is orthogonal to ˜ k , ek ) = 0 for 1 k n, which means that e ˜k − d d ˜ k , v) = 0 for all v ∈ V , which implies B. This is equivalent to (˜ ek − d ˜ ˜k = dk . This proves that the reciprocal set of B is unique. e Theorem 6.40. The reciprocal set of a basis in an inner product space V is a basis of V. ˜ = {˜ ˜n } be the reciprocal set of a basis B = Proof. Let B e1 , . . . , e ˜1 + · · · + an e ˜n = 0V . Then, {e1 , . . . , en } in V. Suppose that a1 e ˜1 + · · · + an e ˜n , ek ) = ak = 0 (a1 e for 1 k n, which means that the reciprocal set is linearly inde˜ = n, it follows that B ˜ is a basis in V. pendent. Since |B|
Norms and Inner Products
6.10
391
Hyperplanes in Rn
Definition 6.22. Let w ∈ Rn − {0} and let a ∈ R. The hyperplane determined by w and a is the set Hw,a = {x ∈ Rn | w x = a}. If x0 ∈ Hw,a , then w x0 = a, so Hw,a is also described by the equality Hw,a = {x ∈ Rn | w (x − x0 ) = 0}. Any hyperplane Hw,a partitions Rn into three sets: > Hw,a = {x ∈ Rn | w x > a}, 0 = Hw,a , Hw,a
< = {x ∈ Rn | w x < a}. Hw,a
> and H < are the positive and negative open half-spaces The sets Hw,a w,a determined by Hw,a , respectively. The sets = {x ∈ Rn | w x a}, Hw,a = {x ∈ Rn | w x a} Hw,a
are the positive and negative closed half-spaces determined by Hw,a , respectively. If x1 , x2 ∈ Hw,a , we say that the vector x1 − x2 is located in the hyperplane Hw,a . In this case, w ⊥ x1 −x2 . This justifies referring to w as the normal to the hyperplane Hw,a . Observe that a hyperplane is fully determined by a vector x0 ∈ Hw,a and by w. Let x0 ∈ Rn and let Hw,a be a hyperplane. We seek x ∈ Hw,a such that x − x0 2 is minimal. Finding x amounts to minimizing the function f (x) = x − x0 22 = ni=1 (xi − x0i )2 subjected to the constraint w1 x1 + · · · + wn xn − a = 0. Using the Lagrangian Λ(x) = f (x) + λ(w x − a) and the multiplier λ, we impose the conditions ∂Λ = 0 for 1 i n, ∂xi which amount to ∂f + λwi = 0 ∂xi for 1 i n. These equalities yield 2(xi − x0i ) + λw i = 0, so we have xi = x0i − 12 λwi . Consequently, we have x = x0 − 12 λw.
392
Linear Algebra Tools for Data Mining (Second Edition)
Since x ∈ Hw,a , this implies 1 w x = w x0 − λw w = a. 2 Thus, λ=2
w x0 − a w x0 − a = 2 . w w w22
We conclude that the closest point in Hw,a to x0 is x = x0 −
w x0 − a w. w22
The smallest distance between x0 and a point in the hyperplane Hw,a is given by x0 − x =
w x0 − a . w2
If we define the distance d(Hw,a , x0 ) between x0 and Hw,a as this smallest distance, we have d(Hw,a , x0 ) = 6.11
w x0 − a . w2
(6.27)
Unitary and Orthogonal Matrices
Lemma 6.8. Let A ∈ Cn×n . If xH Ax = 0 for every x ∈ Cn , then A = On,n . Proof. If x = ek , then xH Ax = akk for every k, 1 k n, so all diagonal entries of A equal 0. Choose now x = ek + ej . Then (ek + ej )H A(ek + ej ) = eHk Aek + eHk Aej + eHj Aek + eHj Aej = eHk Aej + eHj Aek = akj + ajk = 0.
Norms and Inner Products
393
Similarly, if we choose x = ek + iej , we obtain (ek + iej )H A(ek + iej ) = (eHk − ieHj )A(ek + iej ) = eHk Aek − ieHj Aek + ieHk Aej + eHj Aej = −iajk + iakj = 0. The equalities akj + ajk = 0 and −ajk + akj = 0 imply akj = ajk = 0. Thus, all off-diagonal elements of A are also 0, hence A = On,n . Theorem 6.41. A matrix U ∈ Cn×n is unitary if U x2 = x2 for every x ∈ Cn . Proof.
If U is unitary, we have U x22 = (U x)H U x = xH U H U x = x22
because U H U = In . Thus, U x2 = x2 . Conversely, let U be a matrix such that U x2 = x2 for every x ∈ Cn . This implies xH U H U x = xH x, hence xH (U H U − In )x = 0 for x ∈ Cn . By Lemma 6.8, this implies U H U = In , so U is a unitary matrix. Corollary 6.16. The following statements that concern a matrix U ∈ Cn×n are equivalent: (i) U is unitary; (ii) U x − U y2 = x − y2 for x, y ∈ Cn ; (iii) (U x, U y) = (x, y) for x, y ∈ Cn . Proof.
This statement is a direct consequence of Theorem 6.41.
The counterpart of unitary matrices in the set of real matrices are introduced next. Definition 6.23. A matrix A ∈ Rn×n is orthogonal or orthonormal if it is unitary. In other words, a real matrix A ∈ Rn×n is orthogonal if and only if A A = AA = In . Clearly, A is orthogonal if and only if A is orthogonal.
394
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 6.42. If A ∈ Rn×n is an orthogonal matrix, then det(A) ∈ {−1, 1}. Proof. By Corollary 5.3, | det(A)| = 1. Since det(A) is a real num ber, it follows that det(A) ∈ {−1, 1}. Corollary 6.17. Let A be a matrix in Rn×n . The following statements are equivalent: (i) A is orthogonal; (ii) A is invertible and A−1 = A ; (iii) A is invertible and (A )−1 = A; (iv) A is orthogonal. Proof. The equivalence between these statements is an immediate consequence of definitions. Corollary 6.17 implies that the columns of a square matrix form an orthonormal set of vectors if and only if the set of rows of the matrix is an orthonormal set. Theorem 6.41 specialized to orthogonal matrices shows that a matrix A is orthogonal if and only if it preserves the length of vectors. Theorem 6.43. Let S be an r-dimensional subspace of Rn and let {u1 , . . . , ur }, {v 1 , . . . , v r } be two orthonormal bases of the space S. The orthogonal matrices B = (u1 · · · ur ) ∈ Rn×r and C = (v 1 · · · v r ) ∈ Rn×r of any two such bases are related by the equality B = CT, where T = C B ∈ Cr×r is an orthogonal matrix. Proof. Since the columns of B form a basis for S, each vector v i can be written as v i = v 1 t1i + · · · + v r tri for 1 i r. Thus, B = CT . Since B and C are orthogonal, we have B H B = T H C H CT = T H T = Ir , so T is an orthogonal matrix and because it is a square matrix, it is also a unitary matrix. Furthermore, we have C H B = C H CT = T , which concludes the argument.
Norms and Inner Products
395
Definition 6.24. A rotation matrix is an orthogonal matrix R ∈ Rn×n such that det(R) = 1. A reflection matrix is an orthogonal matrix R ∈ Rn×n such that det(R) = −1. Example 6.23. In the two-dimensional case, n = 2, a rotation is a matrix R ∈ R2×2 , r11 r12 R= , r21 r22 such that 2 2 + r21 = 1, r11 2 2 r12 + r22 = 1, r11 r12 + r21 r22 = 0
and r11 r22 − r12 r21 = 1. This implies r22 (r11 r12 + r21 r22 ) − r12 (r11 r22 − r12 r21 ) = −r12 , or 2 2 + r12 ) = −r12 , r21 (r22
so r21 = −r12 . If r21 = −r21 = 0, the above equalities imply that either r11 = r22 = 1 or r11 = r22 = −1. Otherwise, the equality r11 r12 +r21 r22 = 0 implies r11 = r22 . 2 1, it follows that there exists θ such that r Since r11 11 = cos θ. This implies that R has the form cos θ − sin θ R= . sin θ cos θ Its effect on a vector
x=
x1 x2
∈ R2
396
Linear Algebra Tools for Data Mining (Second Edition)
is to produce the vector y = Rx, where x1 cos θ − x2 sin θ , y= x1 sin θ + x2 cos θ which is obtained from x by a counterclockwise rotation by angle θ. It is easy to see that det(R) = 1, so the term “rotation matrix” is clearly justified for R. To mark the dependency of R on θ, we will use the notation cos θ − sin θ R(θ) = . sin θ cos θ The same conclusion canbereached by examining Figure 6.3. If the x1 angle of the vector x = with the x1 axis is α and x is rotated x2 counterclockwise by θ to yield the vector y = y1 e1 + y2 e2 , then x1 = r cos α, x2 = r sin α, and y1 = r cos(α + θ) = r cos α cos θ − r sin α sin θ = x1 cos θ − x2 sin θ, y2 = r sin(α + θ) = r sin α cos θ + r cos α sin θ = x1 sin θ + x2 cos θ, which are the formulas that describe the transformation of x into y. x2 y
x θ α x1 Fig. 6.3
y is obtained by a counterclockwise rotation by an angle θ from x.
Norms and Inner Products
397
A extension of Example 6.23 is the Givens matrix G(p, q, θ) ∈
Rn×n , defined as
⎛
1 ⎜ .. ⎜. ⎜ ⎜0 ⎜. ⎜. ⎜. ⎜0 ⎜ ⎜. ⎝ ..
··· ··· ···
··· .. . cos θ .. .
··· ··· ···
··· ··· · · · − sin θ · · · .. ··· . ··· 0 ··· ··· ··· p q
⎞ ··· 0 . .. ⎟ · · · .⎟ ⎟ sin θ · · · 0⎟ p .. .. ⎟ ⎟ . · · · .⎟ cos θ · · · 0⎟ ⎟q .. .. ⎟ . · · · .⎠ ··· ··· 1 ··· .. .
We can write G(p, q, θ) = (e1 · · · cos θep − sin θeq · · · sin θep + cos θ eq · · · en ). It is easy to verify that G(p, q, θ) is a rotation matrix since it is orthogonal and det(G(p, q, θ)) = 1. Since ⎞ ⎛ ⎞ ⎛ v1 v1 .. ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜.⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎜ vp ⎟ ⎜ cos θvp + sin θvq ⎟ ⎟ ⎜.⎟ ⎜ .. ⎟=⎜ ⎟, . G(p, q, θ)⎜ . . ⎟ ⎜ ⎟ ⎜ ⎜ v ⎟ ⎜− sin θv + cos θv ⎟ ⎜ q⎟ ⎜ p q⎟ ⎟ ⎜.⎟ ⎜ .. ⎠ ⎝ .. ⎠ ⎝ . vn
vn
the multiplication of a vector v be a Givens matrix amounts to a clockwise rotation by θ in the plane of the coordinates (vp , vq ). If vp = 0, then the rotation described by the Givens matrix can be used to zero the qth component of the resulting vector by taking v θ such that tan θ = vpq . It is easy to see that R(θ)−1 = R(−θ) and that R(θ2 )R(θ1 ) = R(θ1 + θ2 ).
Linear Algebra Tools for Data Mining (Second Edition)
398
Example 6.24. Let v ∈ Cn −{0n } be a unit vector. The Householder matrix Hv ∈ Cn×n is defined by Hv = In − 2vv H . The matrix Hv is clearly Hermitian. Moreover, we have HH H = HH = (In − 2vv H )2 = In − 4vvH + 4v(v H v)v H = In , so Hv is unitary and involutive. Supplement 14 of Chapter 5 implies that det(Hv ) = −1, so Hv is a reflection. For a unit vector v ∈ Rn , Hv is an orthogonal and involutive matrix. In this case the vector Hv x is a reflection of x relative to the hyperplane Hv,0 defined by v x = 0, which follows from the fact that vector t = x − Hv x = (In − Hv )x = 2v(v x) is orthogonal to the hyperplane v x = 0. 6.12
Projection on Subspaces
We have shown in Theorem 2.33 that if U, W are two complementary subspaces of Cn , then there exist idempotent endomorphisms g and h of Cn such that W = Ker(g), U = Im(g), and U = Ker(h) and W = Im(h). The vector g(x) is the oblique projection of x on U along the subspace W and h(x) is the oblique projection of x on W along the subspace U . If g and h are represented by the matrices BU and BW , respectively, it follows that these matrices are idempotent, g(x) = BU x ∈ U , and h(x) = BW x ∈ W for x ∈ Cn . Also, BU BW = BW BU = On,n . Let U and W be two complementary subspaces of Cn , {u1 , . . . , up } be a basis for U , and let {w1 , . . . , w q } be a basis for W , where p + q = n. Clearly, {u1 , . . . , up , w 1 , . . . , wq } is a basis for Cn and every x ∈ Cn can be written as x = x1 u1 + · · · + xp up + xp+1 w1 + · · · + xp+q wq . Let B ∈ Cn×n be the matrix B = (u1 · · · up w1 · · · wq ),
Norms and Inner Products
399
which is clearly invertible. Note that BU ui = ui BU wj = 0n , BW ui = 0n BW wj = wj for 1 i p and 1 j q. Therefore, we have Ip Op,n−p BU B = (u1 · · · up 0n · · · 0n ) = B , On−p,p On−p,n−p so
BU = B
Ip
On−p,p
Op,n−p B −1 . On−p,n−p
Similarly, we can show that On−q,n−q On−q,q B −1 . BW = B Oq,n−q Iq Note that BU + BW = BIn B −1 = In . Thus, the oblique projection on U along W is given by Ip Op,n−p B −1 x. g(x) = BU x = B On−p,p On−p,n−p The similar oblique projection on W along U is On−q,n−q On−q,q B −1 x h(x) = BU x = B Oq,n−q Iq for x ∈ Cn . Observe that g(x) + h(x) = x, so the projection on W along U is h(x) = x − g(x) = (In − BU )x. A special important type of projections involves pairs of orthogonal subspaces. Let U be a subspace of Cn with dim U = p and let BU = {u1 , . . . , up } be an orthonormal basis of U . Taking into account that (U ⊥ )⊥ = U , by Theorem 2.33 there exists an idempotent endomorphism g of Cn such that U = Im(g) and U ⊥ = Ker(g). The proof of Theorem 6.37 shows that this endomorphism is defined by g(x) = (x, u1 )u1 + · · · + (x, um )um . Definition 6.25. Let U be an m-dimensional subspace of Cn and let {u1 , . . . , um } be an orthonormal basis of this subspace. The orthogonal projection of the vector x ∈ Cn on the subspace U is the vector projU (x) = (x, u1 )u1 + · · · + (x, um )um .
Linear Algebra Tools for Data Mining (Second Edition)
400
Theorem 6.44. Let U be an m-dimensional subspace of Rn and let x ∈ Rn . The vector y = x − projU (x) belongs to the subspace U ⊥ . Proof. that
Let BU = {u1 , . . . , um } be an orthonormal basis of U . Note (y, uj ) = (x, uj ) −
m (x, ui )ui , uj
i=1 m (x, ui )(ui , uj ) = 0, = (x, uj ) − i=1
due to the orthogonality of the basis BU . Therefore, y is orthogonal on every linear combination of BU , that is on the subspace U . Theorem 6.45. Let U be an m-dimensional subspace of Cn having the orthonormal basis {u1 , . . . , um }. The orthogonal projection projU is given by projU (x) = BU BUH x for x ∈ Cn , where BU ∈ Rn×m is the matrix BU = (u1 · · · um ) ∈ Cn×m . Proof.
We can write
⎛
⎞ uH1 ⎜ ⎟ ui (uHi x) = (u1 · · · um )⎝ ... ⎠x = BU BUH x. projU (x) = i=1 uHm m
Since the basis {u1 , . . . , um } is orthonormal, we have BUH BU = Im . Observe that the matrix BU BUH ∈ Cn×n is symmetric and idempotent because (BU BUH )(BU BUH ) = BU (BUH BU )BUH = BU BUH . For an m-dimensional subspace U of Cn , we denote by PU = BU BUH ∈ Cn×n , where BU is a matrix of an orthonormal basis of U as defined before. PU is the projection matrix of the subspace U . Corollary 6.18. For every non-zero subspace U , the matrix PU is a Hermitian matrix, and therefore, a self-adjoint matrix. Proof. Since PU = BU BUH where BU is a matrix of an orthonormal basis of the subspace S, it is immediate that PUH = PU .
Norms and Inner Products
401
The self-adjointness of PU means that (x, PU y) = (PU x, y) for every x, y ∈ Cn . Corollary 6.19. Let U be an m-dimensional subspace of Cn having the orthonormal basis {u1 , . . . , um }. If BU = (u1 · · · um ) ∈ Cn×m , then for every x ∈ C we have the decomposition x = PU x + QU x, where PU = BU BUH and QU = In − PU , PU x ∈ U and QU x ∈ U ⊥ . Proof.
This statement follows immediately from Theorem 6.45.
Observe that Q2U = (In − PU PUH )(In − PU PUH ) = In − PU PUH − PU PUH + PU PUH PU PUH = QU , so QU is an idempotent matrix. The matrix QU is the projection matrix on the subspace U ⊥ . Clearly, we have PU ⊥ = QU = In − PU .
(6.28)
It is possible to give a direct argument for the independence of the projection matrix PU relative to the choice of orthonormal basis in U. Theorem 6.46. Let U be an m-dimensional subspace of Cn having the orthonormal bases {u1 , . . . , um } and {v 1 , . . . , v m } and let BU = ˜U = (v 1 · · · v m ) ∈ Cn×m . The matrix (u1 · · · um ) ∈ Cn×m and B H ˜ m×m ˜ H = BU B H . ˜U B is unitary and B BU BU ∈ C U U ˜U are bases for U , Proof. Since both sets of columns of BU and B m×m ˜U Q. such that BU = B there exists a unique square matrix Q ∈ C H H ˜ ˜U implies B BU = B ˜ B The orthonormality of BU and B U U U = Im . Thus, we can write H ˜H B ˜ Im = BUH BU = QH B U U Q = Q Q,
˜ U = QH B ˜H B ˜ which shows that Q is unitary. Furthermore, BUH B U U = H Q is unitary and ˜U QQH B ˜U B ˜UH = B ˜UH . BU BUH = B
402
Linear Algebra Tools for Data Mining (Second Edition)
In Example 6.11, we have shown that if P is an idempotent matrix, then for every matrix norm μ we have μ(P ) = 0 or μ(P ) 1. For orthogonal projection matrices of the form PU , where U is a non-zero subspace, we have |||PU |||2 = 1. Indeed, we can write x = (x − projU (x)) + projU (x), so x22 = x − projU (x)22 + projU (x)22 projU (x)22 . Thus, x2 PU (x) for any x ∈ Cn , which implies PU x2 n |||PU |||2 = sup x ∈ C − {0}} 1. x2 This implies |||PU |||2 = 1. The next theorem shows that the best approximation of a vector x by a vector in a subspace U (in the sense of Euclidean distance) is the orthogonal projection of x on U . Theorem 6.47. Let U be an m-dimensional subspace of Cn and let x ∈ Cn . The minimal value of d2 (x, u), the Euclidean distance between x and an element u of the subspace U , is achieved when u = projU (x). Proof. We saw that x can be uniquely written as x = y+projU (x), where y ∈ U ⊥ . Let now u be an arbitrary member of U . We have d2 (x, u)2 = x − u22 = (x − projU (x)) + (projU (x) − u)22 . Since x − projU (x) ∈ U ⊥ and projU (x) − u ∈ U , it follows that these vectors are orthogonal. Thus, we can write d2 (x, u)2 = (x − projU (x))22 + (projU (x) − u)22 , which implies that d2 (x, u) d2 (x − projU (x)).
The orthogonal projections associated with subspaces allow us to define a metric on the collection of subspaces of Cn . Indeed, if S and T are two subspaces of Cn , we define dF (S, T ) = PS − PT F . When using the vector norm · 2 and the metric induced by this norm on Cn , we denote the corresponding metric on subspaces by d2 .
Norms and Inner Products
403
Example 6.25. Let u, w be two distinct unit vectors in the linear space V. The orthogonal projection matrices of u and w are uu and ww , respectively. Thus, dF (u, w) = uu − vv F . Suppose now that V = R2 . Since u and w are unit vectors in R2 , there exist α, β ∈ [0, 2π] such that cos α cos β u= and w = . sin α sin β Thus, we can write cos2 α − cos2 β cos α sin α − cos β sin β uu − vv = cos α sin α − cos β sin β sin2 α − sin2 β and dF (u, w) =
√ 2| sin(α − β)|.
We could use any matrix norm in the definition of the distance between subspaces. For example, we could replace the Frobenius norm by ||| · |||1 or by ||| · |||2 . Let S be a subspace of Cn and let x ∈ Cn . The distance between x and S defined by the norm · is d(x, S) = x − projS (x) = x − PS x = (I − PS )x. Theorem 6.48. Let S and T be two non-zero subspaces of Cn and let δS = max{d2 (x, T ) | x ∈ S, x2 = 1}, δT = max{d2 (x, S) | x ∈ T, x2 = 1}. We have d2 (S, T ) = max{δS , δT }. Proof.
If x ∈ S and x2 = 1, we have d2 (x, T ) = x − PT x2 = PS x − PT x2 = (PS − PT )x2 |||PS − PT |||2 .
Therefore, δS |||PS − PT |||2 . Similarly, δT |||PS − PT |||2 , so max{δS , δT } d2 (S, T ).
Linear Algebra Tools for Data Mining (Second Edition)
404
Note that δS = max{(I − PT )x2 | x ∈ S, x2 = 1}, δT = max{(I − PS )x2 | x ∈ T, x2 = 1}, so, taking into account that PS x ∈ S and PT x ∈ T for every x ∈ Cn , we have (I − PS )PT x2 δS PT x2 , (I − PT )PS x2 δT PS x2 . We have PT (I − PS )x22 = (PT (I − PS )x, PT (I − PS )x) = ((PT )2 (I − PS )x, (I − PS )x) = (PT (I − PS )x, (I − PS )x) = (PT (I − PS )x, (I − PS )2 x) = ((I − PS )PT (I − PS )x, (I − PS )x) because both PS and I − PS are idempotent and self-adjoint. Therefore, PT (I − PS )x22 (I − PS )PT (I − PS )x2 (I − PS )x2 δT PT (I − PS )x2 (I − PS )x2. This allows us to infer that PT (I − PS )x2 δT (I − PS )x2.
We discuss now four fundamental subspaces associated to a matrix A ∈ Cm×n . The range and the null space of A, range(A) ⊆ Cm and null(A) ⊆ Cn , have been already discussed in Chapter 3. We add now two new subspaces: range(AH ) ⊆ Cn and null(AH ) ⊆ Cm . Theorem 6.49. For every matrix A (range(A))⊥ = null(AH ).
∈
Cm×n ,
we have
Proof. The statement follows from the equivalence of the following statements: (i) x ∈ (range(A))⊥ ;
Norms and Inner Products
(ii) (iii) (iv) (v) (vi)
(x, Ay) = 0 for all y ∈ Cn ; xH Ay = 0 for all y ∈ Rn ; y H AH x = 0 for all y ∈ Rn ; AH x = 0; x ∈ null(AH ).
Corollary 6.20. For every matrix A (range(AH ))⊥ = null(A). Proof. by AH .
405
∈
Cm×n ,
we have
This statement follows from Theorem 6.49 by replacing A
Corollary 6.21. For every matrix A ∈ Cm×n , we have Cm = range(A) ⊕ null(AH ), Cn = null(A) ⊕ range(AH ).
Proof. By Theorem 6.37, we have Cm = range(A) ⊕ range(A)⊥ and Cn = null(A) ⊕ null(A)⊥ . Taking into account Theorem 6.49 and Corollary 6.21, we obtain the desired equalities. 6.13
Positive Definite and Positive Semidefinite Matrices
Definition 6.26. A matrix A ∈ Cn×n is positive definite if xH Ax is a real positive number for every x ∈ Cn − {0}. Theorem 6.50. If A ∈ Cn×n is positive definite, then A is Hermitian. Proof. Let A ∈ Cn×n be a matrix. Since xH Ax is a real number, it follows that it equals its conjugate, so xH Ax = xH AH x for every x ∈ Cn . By Theorem 3.16, there exists a unique pair of Hermitian matrices H1 and H2 such that A = H1 + iH2 , which implies AH = H1H − iH2H . Thus, we have xH (H1 + iH2 )x = xH (H1H − iH2H )x = xH (H1 − iH2 )x, because H1 and H2 are Hermitian. This implies xH H2 x = 0 for every x ∈ Cn , which, in turn, implies H2 = On,n . Consequently, A = H1 , so A is indeed Hermitian.
406
Linear Algebra Tools for Data Mining (Second Edition)
Definition 6.27. A matrix A ∈ Cn×n is positive semidefinite if xH Ax is a non-negative real number for every x ∈ Cn − {0}. If xH Ax is a non-positive real number for every x ∈ Cn − {0}, then we say that A is a positive definite matrix. Positive definiteness (positive semidefiniteness) is denoted by A 0 (A 0, respectively). The definition of positive definite (semidefinite) matrix can be specialized for real matrices as follows. Definition 6.28. A symmetric matrix A ∈ Rn×n is positive definite if x Ax > 0 for every x ∈ Rn − {0}. If A satisfies the weaker inequality x Ax 0 for every x ∈ Rn − {0}, then we say that A is positive semidefinite. A 0 denotes that A is positive definite and A 0 means that A is positive semidefinite. Note that in the case of real-valued matrices, we need to require explicitly the symmetry of the matrix because, unlike the complex case, the inequality x Ax > 0 for x ∈ Rn − {0n } does not imply the symmetry of A. For example, consider the matrix a b A= , −b a where a, b ∈ R and a > 0. We have a b x1 = a(x21 + x22 ) > 0 x Ax = (x1 x2 ) −b a x2 if x = 02 . Example 6.26. The symmetric real matrix a b A= b c is positive definite if and only if a > 0 and b2 −ac < 0. Indeed, we have x Ax > 0 for every x ∈ R2 −{0} if and only if ax21 +2bx1 x2 +cx22 > 0, where x = (x1 x2 ); elementary algebra considerations lead to a > 0 and b2 − ac < 0.
Norms and Inner Products
407
A positive definite matrix is non-singular. Indeed, if Ax = 0, where A ∈ Rn×n is positive definite, then xH Ax = 0, so x = 0. By Corollary 3.2, A is non-singular. Example 6.27. If A ∈ Cm×n , then the matrices AH A ∈ Cn×n and AAH ∈ Cm×m are positive semidefinite. For x ∈ Cn , we have xH (AH A)x = (xH AH )(Ax) = (Ax)H (Ax) = Ax22 0. The argument for AAH is similar. If rank(A) = n, then the matrix AH A is positive definite because H x (AH A)x = 0 implies Ax = 0, which, in turn, implies x = 0. Theorem 6.51. If A ∈ Cn×n is a positive definite matrix, then any ! i1 · · · ik is a positive definite matrix. principal submatrix B = A i1 · · · ik Proof. Let x ∈ Cn − {0} be a vector such that all components located on ! positions other than i1 , . . . , ik equal 0 and let i1 · · · ik ∈ Ck be the vector obtained from x by retaining y=x 1 only the components located on positions i1 , . . . , ik . Since y H By = xH Ax > 0, it follows that B 0. Corollary 6.22. If A ∈ Cn×n is a positive definite matrix, then any diagonal element aii is a real positive number for 1 i n. Proof. This statement follows immediately from Theorem 6.51 by observing that every diagonal element of A is a 1 × 1 principal sub matrix of A. Theorem 6.52. If A, B ∈ Cn×n are two positive semidefinite matrices and a, b are two non-negative numbers, then aA + bB 0. Proof. The statement holds because xH (aA + bB)x = axH Ax + bxH Bx 0, due to the fact that A and B are positive semidefinite.
Theorem 6.53. Let A ∈ Cn×n be a positive definite matrix and let S ∈ Cn×m . The matrix S H AS is positive semidefinite and has the same rank as S. Moreover, if rank(S) = m, then S H AS is positive definite.
408
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Since A is positive definite, it is Hermitian and (S H AS)H = S H AS implies that S H AS is a Hermitian matrix. Let x ∈ Cm . We have xH S H ASx = (Sx)H A(Sx) 0 because A is positive definite. Thus, the matrix S H AS is positive semidefinite. If Sx = 0, then S H ASx = 0; conversely, if S H ASx = 0, then xH S H ASx = 0, so Sx = 0. This allows us to conclude that null(S) = null(S H AS). Therefore, by Equality (3.8), we have rank(S) = rank(S H AS). Suppose now that rank(S) = m and that xH S H ASx = 0. Since A is positive definite, we have Sx = 0 and this implies x = 0, because of the assumption made relative to rank(S). Consequently, S H AS is positive definite. Corollary 6.23. Let A ∈ Cn×n be a positive definite matrix and let S ∈ Cn×n . If S is non-singular, then so is S H AS. Proof.
This is an immediate consequence of Theorem 6.53.
Theorem 6.54. A Hermitian matrix B ∈ Cn×n is positive definite if and only if the mapping f : Cn × Cn −→ C given by f (x, y) = xH By for x, y ∈ Cn defines an inner product on Cn . Proof. Suppose that B defines an inner product on Cn . Then, by Property (iii) of Definition 6.15 we have f (x, x) > 0 for x = 0, which amounts to the positive definiteness of B. Conversely, if B is positive definite, then f satisfies the condition from Definition 6.15. We show here only that f has the conjugate symmetry property. We can write f (y, x) = y H Bx = y Bx for x, y ∈ Cn . Since B is Hermitian, B = B H = B , so f (y, x) = y B x. Observe that y B x is a number (that is, a 1 × 1 matrix), so (y B x) = xH By = f (x, y).
Corollary 6.24. A symmetric matrix B ∈ Rn×n is positive definite if and only if the mapping f : Rn × Rn −→ R given by f (x, y) = x By for x, y ∈ Rn defines an inner product on Rn .
Norms and Inner Products
Proof.
This follows immediately from Theorem 6.54.
409
Definition 6.29. Let S = (v 1 , . . . , v m ) be a sequence of vectors in Rn . The Gram matrix of S is the matrix GS = (gij ) ∈ Rm×m defined by gij = v i v j for 1 i, j m. Note that if AS = (v 1 · · · v m ) ∈ Rn×m , then GS = AS AS . Also, note that GS is a symmetric matrix. Example 6.28. Let ⎛
⎞ ⎛ ⎞ ⎛ ⎞ 1 1 2 v 1 = ⎝ 0 ⎠, v 2 = ⎝2⎠, v 3 = ⎝1⎠. −1 2 0 The Gram matrix of the set S = {v 1 , v 2 , v 3 } is ⎛ ⎞ 2 −1 2 GS = ⎝−1 9 4⎠. 2 4 5 Note that det(GS ) = 1. The Gram matrix of an arbitrary sequence of vectors is positive semidefinite, as the reader can easily verify. Theorem 6.55. Let S = (v 1 , . . . , v m ) be a sequence of m vectors in Rn , where m n. If S is linearly independent, then the Gram matrix GS is positive definite. Proof. Let S be a linearly independent sequence of vectors, S = (v 1 , . . . , v m ), and let x ∈ Rm . We have x GS x = x AS AS x = (AS x) AS x = AS x22 . Therefore, if x GS x = 0, we have AS x = 0, which is equivalent to x1 v 1 + · · · + xn vn = 0. Since {v 1 , . . . , v m } is linearly independent, it follows that x1 = · · · = xm = 0, so x = 0n . Thus, GS is indeed positive definite. Definition 6.30. Let S = (v 1 , . . . , v m ) be a sequence of m vectors in Rn , where m n. The Gramian of S is the number det(GS ). Theorem 6.56. If S = (v 1 , . . . , v m ) is a sequence of m vectors in Rn , then S is linearly independent if and only if det(GS ) = 0.
410
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Suppose that det(GS ) = 0 and that S is not linearly independent. In other words, the numbers a1 , . . . , am exist such that at least one of them is not 0 and a1 v 1 + · · · + am v m = 0Rn . This implies the equalities a1 (v 1 , v j ) + · · · + am (v m , v j ) = 0 for 1 j m, so the system GS a = 0n has a non-trivial solution in a1 , . . . , am . This implies det(GS ) = 0, which contradicts the initial assumption. Conversely, suppose that S is linearly independent and det(GS ) = 0. Then the linear system a1 (v 1 , v j ) + · · · + am (v m , v j ) = 0 for 1 j m, has a non-trivial solution in a1 , . . . , am . Let w = a1 v1 +· · · am v m . We have (w, v i ) = 0 for 1 i n. This, in turn, implies (w, w) = w22 = 0, so w = 0, which contradicts the linear independence of S. Theorem 6.57 (Cholesky’s decomposition theorem). Let A ∈ Cn×n be a Hermitian positive definite matrix. There exists a unique upper triangular matrix R with real, positive diagonal elements such that A = RH R. Proof. The argument is by induction on n 1. The base step, n = 1, is immediate. Suppose that a decomposition exists for all Hermitian positive matrices of order n, and let A ∈ C(n+1)×(n+1) be a symmetric and positive definite matrix. We can write A=
a11 aH , a B
where B ∈ Cn×n . By Theorem 6.51, a11 > 0 and B ∈ Cn×n is a Hermitian positive definite matrix. It is easy to verify the following
Norms and Inner Products
identity: A=
√
a11 0 √ 1 a In a11
1 0 0 B − a111 aaH
411
√ a11 0
√ 1 aH a11
In
.
(6.29)
Let R1 ∈ Cn×n be the upper triangular non-singular matrix √ a11 √a111 aH . R1 = 0 In This allows us to write
A = R1 H
where A1 = B −
1 0 R1 , 0 A1
1 H a11 aa .
Since 1 0 = (R1−1 )H AR1−1 , 0 A1
by Theorem 6.53, the matrix
1 0 0 A1
is positive definite, which allows us to conclude that the matrix A1 = B − a111 aaH ∈ Cn×n is a Hermitian positive definite matrix. By the inductive hypothesis, A1 can be factored as A1 = P H P, where P is an upper triangular matrix. This allows us to write 1 0 1 0 1 0 . = 0 A1 0 PH 0 P Thus,
A = R1H
1 0 0 PH
If R is defined as √ a11 1 0 1 0 R1 = R= 0 P 0 P 0
1 0 R1 . 0 P
√ 1 aH a11
In
√ a11 = 0
√ 1 aH a11
then A = RH R and R is clearly an upper triangular matrix.
P
,
Linear Algebra Tools for Data Mining (Second Edition)
412
We refer to the matrix R as the Cholesky factor of A. Corollary 6.25. If A ∈ Cn×n is a Hermitian positive definite matrix, then det(A) > 0. Proof. By Corollary 5.2, det(A) is a real number. By Theorem 6.57, A = RH R, where R is an upper triangular matrix with real, positive diagonal elements, so det(A) = det(RH ) det(R) = (det(R))2 . Since det(R) is the product of its diagonal elements, det(R) is a real, positive number, which implies det(A) > 0. Example 6.29. Let A be the symmetric matrix ⎛ ⎞ 3 0 2 A = ⎝0 2 1⎠. 2 1 2 We leave it to the reader to verify that this matrix is indeed positive definite starting from Definition 6.26. By Equality (6.29), the matrix A can be written as ⎞ ⎞⎛ ⎞⎛√ ⎛√ 3 0 0 3 0 √23 1 0 0 A = ⎝ 0 1 0⎠⎝0 2 1 ⎠⎝ 0 1 0 ⎠, √2 0 1 0 1 23 0 0 1 3 because
A1 =
1 0 2 1 2 1 0 2 = − . 1 2 1 23 3 2
Applying the same equality to A1 , we have √ √ 2 0 2 1 0 A1 = √1 1 1 0 6 0 2
√1 2
1
.
Since the matrix ( 16 ) can be factored directly, we have √ √ 1 0 1 0 2 0 2 √12 A1 = √1 0 √16 0 √16 1 0 1 2 √ √ 1 2 √2 2 0 . = √1 √1 0 √16 2 6
Norms and Inner Products
In turn, this ⎛ 1 ⎝0 0
implies ⎞ ⎛ 1 √0 0 0 2 2 1 ⎠ = ⎝0 0 √12 1 23
⎞⎛ 1 √0 0 ⎜ 2 0 ⎠⎝0 √1 0 0 6
413
0
⎞
√1 ⎟, 2⎠ √1 6
which produces the following Cholesky final decomposition of A: ⎞⎛√ ⎞⎛ ⎞⎛ ⎞ ⎛√ 1 √0 0 1 √0 0 3 0 0 3 0 √23 2 √12 ⎟ 2 0 ⎠⎜ A = ⎝ 0 1 0⎠⎝0 ⎝0 ⎠⎝ 0 1 0 ⎠ 1 2 1 1 √ 0 √2 √6 0 1 0 0 √6 0 0 1 3 √ ⎛ ⎞ ⎞ ⎛√ 3 0 √23 3 √0 0 √ 2 0 ⎠⎜ 2 √12 ⎟ =⎝ 0 ⎝ 0 ⎠. √2 √1 √1 √1 0 0 3 2 6 6 Cholesky’s decomposition theorem can be extended to positive semi-definite matrices. Theorem 6.58 (Cholesky’s decomposition theorem for positive semidefinite matrices). Let A ∈ Cn×n be a Hermitian positive semidefinite matrix. There exists an upper triangular matrix R with real, non-negative diagonal elements such that A = RH R. Proof. The argument is similar to the one used for Theorem 6.57 and is omitted. Observe that for positive semidefinite matrices, the diagonal elements of R are non-negative numbers and the uniqueness of R does not longer holds. 1 −1 Example 6.30. Let A = . Since x Ax = (x1 − x2 ), it −1 1 is clear that A is a positive semidefinite but not a positive definite matrix. Let R be a matrix of the form r1 r R= 0 r2 such that A = R R. It is easy to see that the last equality is equivalent to r12 = r22 = 1 and rr1 = −1. Thus, we have for distinct Cholesky factors, the following matrices: 1 −1 1 −1 −1 1 −1 1 , , , . 0 1 0 −1 0 1 0 −1
414
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 6.59. A Hermitian matrix A ∈ Cn×n is positive definite if and only if all its leading principal minors are positive. Proof. By Theorem 6.51, if A is positive definite, then every principal submatrix is positive definite, so by Corollary 6.25, each principal minor of A is positive. Conversely, suppose that A ∈ Cn×n is a Hermitian matrix having positive leading principal minors. We prove by induction on n that A is positive definite. The base case, n = 1, is immediate. Suppose that the statement holds for matrices in C(n−1)×(n−1) . Note that A can be written as B b , A= bH a where B ∈ C(n−1)×(n−1) is a Hermitian matrix. Since the leading minors of B are the first n − 1 leading minors of A, it follows, by the inductive hypothesis, that B is positive definite. Thus, there exists a Cholesky decomposition B = RH R, where R is an upper triangular matrix with real, positive diagonal elements. Since R is invertible, let w = (RH )−1 b. The matrix B is invertible. Therefore, by Theorem 5.14, we have det(A) = det(B)(a − bH B −1 b) > 0. Since det(B) > 0, it follows that a bH B −1 b. We observed that if B is positive definite, then so is B −1 . Therefore, a 0 is and we can write a = c2 for some positive c. This allows us to write H 0 R R w = C H C, A= 0H c wsH c where C is the upper triangular matrix with positive e R w . C= 0H c This implies immediately the positive definiteness of A.
Cn×n .
Let A, B ∈ We write A B if A − B 0, that is, if A − B is a positive definite matrix. Similarly, we write A B if A − B O, that is, if A − B is positive semidefinite. Theorem 6.60. Let A0 , A1 , . . . , Am be m + 1 matrices in Cn×n such that A0 is positive definite and all matrices are Hermitian. There
Norms and Inner Products
415
exists a > 0 such that for any t ∈ [−a, a] the matrix Bm (t) = A0 + A1 t + · · · + Am tm is positive definite. Proof. Since all matrices A0 , . . . , Am are Hermitian, note that xH Ai x are real numbers for 0 i m. Therefore, pm (t) = xH Bm (t)x is a polynomial in t with real coefficients and pm (0) = xH A0 x is a positive number if x = 0. Since pm is a continuous function, there exists an interval [−a, a] such that t ∈ [−a, a] implies pm (t) > 0 if x = 0. This shows that Bm (t) is positive definite. 6.14
The Gram–Schmidt Orthogonalization Algorithm
The Gram–Schmidt algorithm (Algorithm 6.14.1) constructs an orthonormal basis for a subspace U of Cn , starting from an arbitrary basis of {u1 , . . . , um } of U . The orthonormal basis is constructed sequentially such that w1 , . . . , w k = u1 , . . . , uk for 1 k m. Algorithm 6.14.1: Gram–Schmidt Orthogonalization Algorithm Data: A basis {u1 , . . . , um } for a subspace U of Cn Result: An orthonormal basis {w1 , . . . , w m } for U 1 W = On,m ; 1 2 W (:, 1) = W (:, 1) + U (:,1)2 U (:, 1); 3 for k = 2 to m do 4 P = In − W (:, 1 : (k − 1))W (:, 1 : (k − 1))H ; 1 5 W (:, k) = W (:, k) + P U (:,k) P U (:, k); 2 6 end 7 return W = (w 1 · · · w m ); Next, we prove the correctness of the Gram–Schmidt algorithm. Theorem 6.61. Let (w1 , . . . , w m ) be the sequence of vectors constructed by the Gram–Schmidt algorithm starting from the basis {u1 , . . . , um } of an m-dimensional subspace U of Cn . The set {w 1 , . . . , wm } is an orthogonal basis of U and w1 , . . . , wk = u1 , . . . uk for 1 k m.
416
Linear Algebra Tools for Data Mining (Second Edition)
Proof. In the algorithm the matrix W is initialized as On,m . Its columns will contain eventually the vectors of the orthonormal basis w1 , . . . , w m . The argument is by induction on k 1. The base case, k = 1, is immediate. Suppose that the statement of the theorem holds for k, that is, the set {w1 , . . . , w k } is an orthonormal basis for Uk = u1 , . . . , uk and constitutes the set of the initial k columns of the matrix W , that is, Wk = W (:, 1 : k). Then, Pk = In − Wk WkH is the projection matrix on the subspace Uk⊥ , so Pk uk is orthogonal on every wi , where 1 i k. Therefore, wk+1 = W (:, (k + 1)) is a unit vector orthogonal on all its predecessors w 1 , . . . , wk , so {w1 , . . . , wm } is an orthonormal set. The equality u1 , . . . , uk = w1 , . . . , w k clearly holds for k = 1. Suppose that it holds for k. Then, we have 1 (uk+1 − Wk WkH uk+1 ) Pk uk+1 2 1 (uk+1 − (w 1 · · · wk )WkH uk+1 ) . = Pk uk+1 2
wk+1 =
Since w1 , . . . , wk belong to the subspace u1 , . . . , uk (by inductive hypothesis), it follows that wk+1 ∈ u1 , . . . , uk , uk+1 , so w1 , . . . , w k+1 ⊆ u1 , . . . , uk . For the converse inclusion, since uk+1 = Pk uk+1 2 w k+1 + (w1 · · · wk )WkH uk+1 , it follows that uk+1 ∈ w1 , . . . , w k , w k+1 . Thus, u1 , . . . , uk , uk+1 ⊆ w1 , . . . , wk , wk+1 . Example 6.31. Let A ∈ R3×2 be the matrix ⎛
⎞ 1 1 A = ⎝0 0⎠. 1 3 It is easy to see that rank(A) = 2. We have {u1 , u2 } ⊆ R3 and we construct an orthogonal basis for the subspace generated by these columns. The matrix W is initialized to O3,2
Norms and Inner Products
417
By Algorithm 6.14.1, we begin by defining ⎛√ ⎞ 2
1 ⎜ 2 ⎟ u1 = ⎝ √0 ⎠, w1 = u1 2 2 2
so
⎛√
2 2
⎜ W = ⎝ √0
2 2
⎞ 0 ⎟ 0⎠. 0
The projection matrix is
⎛
1 2
P = I3 − W (:, 1)W (:, 1) = I3 − w1 w1 = ⎝ 0 − 12 The projection of u2 is
⎞ 0 − 12 1 0 ⎠. 0 12
⎛
⎞ −1 P u2 = ⎝ 0 ⎠ 1
and the second column of W becomes
⎛
√ ⎞ − 22 P u2 2 ⎜ ⎟ u2 = ⎝ √0 ⎠. w k = W (:, 2) = P 2 2
Thus, the orthonormal basis we are seeking consists of the vectors ⎛ √ ⎞ ⎛√ ⎞ 2 − 2 ⎜ 2 ⎟ ⎜ 2 ⎟ ⎝ √0 ⎠ and ⎝ √0 ⎠. 2 2
2 2
To decrease the sensitivity of the Gram–Schmidt algorithm to rounding errors, a goal often referred to as numerical stability, a modified Gram–Schmidt algorithm has been proposed. In the modified variant, lines 4 and 5 of Algorithm 6.14.1 (in which projection on the subspace w1 , . . . , w k is computed) are replaced by a computation of this projection using the equality In − Wk WkH = (In − wk wHk ) · · · (In − w1 w H1 ) of Supplement 6.20. The modified Gram–Schmid algorithm is Algorithm 6.14.2.
418
Linear Algebra Tools for Data Mining (Second Edition)
Algorithm 6.14.2: The Modified Gram–Schmidt Algorithm Data: A basis {u1 , . . . , um } for a subspace U of Cn Result: An orthonormal basis {w1 , . . . , w m } for U 1 W = On,m ; 1 2 W (:, 1) = W (:, 1) + U (:,1)2 U (:, 1); 3 for k = 2 to m do 4 t = U (:, k); 5 for j = 1 to k do 6 t = (In − W (:, j)W (:, j)H )t; 7 end 1 8 W (:, k) = W (:, k) + t t; 2 9 end 10 return W = (w 1 · · · w m ); Theorem 6.62. If L = (v 1 , . . . , v m ) is a sequence of m vectors in Rn , we have m v j 22 . det(GL ) j=1
The equality takes place only if the vectors of L are pairwise orthogonal. Proof. Suppose that L is linearly independent and construct the orthonormal set {y 1 , . . . , y m } as y j = bj1 v 1 + · · · + bjj v j for 1 j m, using Gram–Schmidt algorithm. Since bjj = 0, it follows that we can write v j = cj1 y 1 + · · · + cjj y j for 1 j m so that (v j , y p ) = 0 if j < p and (v j , y p ) = cjp if p j. Thus, we have ⎞ ⎛ (v 1 , y 1 ) (v 2 , y 1 ) · · · (v m , y 1 ) ⎜ 0 (v 2 , y 2 ) · · · (v m , y 2 ) ⎟ ⎟ ⎜ ⎜ 0 0 · · · (v m , y 3 ) ⎟ (v 1 , . . . , v m ) = (y 1 , . . . , y m )⎜ ⎟. ⎟ ⎜ .. .. .. .. ⎠ ⎝ . . . . 0 0 · · · (v m , y m )
Norms and Inner Products
This implies
419
⎞ ⎛ ⎞ v1 (v 1 , v 1 ) · · · (v 1 , v m ) ⎟ ⎜ .. ⎟ ⎜ .. .. ⎠ = ⎝ . ⎠(v 1 , . . . , v m ) ⎝ . ··· . vm (v m , v 1 ) · · · (v m , v m ) ⎞ ⎛ (v 1 , y 1 ) 0 0 ⎟ ⎜ (v 2 , y ) (v 2 , y ) 0 1 2 ⎟ ⎜ =⎜ ⎟ .. .. .. ⎠ ⎝ . . . (v m , y 1 ) (v m , y 2 ) (v m , y m ) ⎞ ⎛ (v 1 , y 1 ) (v 2 , y 1 ) · · · (v m , y 1 ) ⎜ 0 (v 2 , y 2 ) · · · (v m , y 2 ) ⎟ ⎟ ⎜ ×⎜ ⎟. .. .. .. .. ⎠ ⎝ . . . . 0 0 · · · (v m , y m ) ⎛
Therefore, we have m m 2 (v i , y i ) (v i , v i )2 , det(GL ) = i=1
i=1
v i )2 (y i , y i )2 and (y i , y i ) = 1 for 1 i m. because (v i , y i )2 (v i , 2 To have det(GL ) = m i=1 (v i , v i ) , we must have v i = ki y i , that is, the vectors v i must be pairwise orthogonal. Corollary 6.26. Let A ∈ Rn×n be a matrix such that |aij | 1 for n 1 i, j n. Then | det(A)| n 2 and the equality holds only if A is a Hadamard matrix. Proof. √Let ai = (ai1 , . . . , ain ) be the ith row of A. We have ai 2 n, so |(ai , aj )| n by the Cauchy–Schwartz inequality, for 1 i, j n. 2 Note that L = A A, where L is the set of rows of A, so det(A) = G m 2 det(GL ) j=1 v j 2 , so | det(A)|
m
n
v j 2 n 2 .
j=1
The equality takes place only when the vectors of A are mutually orthogonal, so when A is a Hadamard matrix.
420
Linear Algebra Tools for Data Mining (Second Edition)
We saw that orthonormal sets of vectors that do not contain 0 are linearly independent. The Extension Corollary (Corollary 2.3) can now be specialized for orthonormal sets. Theorem 6.63. Let V be a finite-dimensional linear space. If U is an orthonormal set of vectors, then there exists a basis T of V that consists of orthonormal vectors such that U ⊆ T . Proof. Let U = {u1 , . . . , um } be an orthonormal set of vectors in V. There is an extension of U , Z = {u1 , . . . , um , um+1 , . . . , un } to a basis of V, where n = dim(V ), by the Extension Corollary. Now, apply the Gram–Schmidt algorithm to the set U to produce an orthonormal basis W = {w 1 , . . . , w n } for the entire space V. It is easy to see that wi = ui for 1 i m, so U ⊆ W and W is the orthonormal basis of V that extends the set U . Corollary 6.27. If A ∈ Cm×n is a matrix with m n having orthonormal set of columns, then there exists a matrix B ∈ Cm×(m−n) such that (A B) is an orthogonal (unitary) square matrix. Proof.
This follows directly from Theorem 6.63.
Corollary 6.28. Let U be a subspace of an n-dimensional linear space V such that dim(U ) = m, where m < n. Then dim(U ⊥ ) = n − m. Proof.
Let u1 , . . . , um be an orthonormal basis of U , and let u1 , . . . , um , um+1 , . . . , un
be its completion to an orthonormal basis for V, which exists by Theorem 6.63. Then um+1 , . . . , un is a basis of the orthogonal complement U ⊥ , so dim(U ⊥ ) = n − m. Theorem 6.64. A subspace U of Rn is m-dimensional if and only if it is the set of solutions of a homogeneous linear system Ax = 0, where A ∈ R(n−m)×n is a full-rank matrix. Proof. Suppose that U is an m-dimensional subspace of Rn . If v 1 , . . . , v n−m is a basis of the orthogonal complement of U , then
Norms and Inner Products
421
v i x = 0 for every x ∈ U and 1 i n − m. These conditions are equivalent to the equality (v 1 v2 · · · v n−m )x = 0, which shows that U is the set of solutions of a homogeneous linear Ax = 0, where A = (v 1 v 2 · · · v n−m ). Conversely, if A ∈ R(n−m)×n is a full-rank matrix, then the set of solutions of the homogeneous system Ax = b is the null subspace of A and, therefore, it is an m-dimensional subspace. 6.15
Change of Bases Revisited
Next, we examine the behavior of vector components relative to change of bases in real linear spaces equipped with inner products. ˜ = {˜ ˜n } e1 , . . . , e In Section 3.7, we saw that if B = {e1 , . . . , en } and B n are two bases of C , then the contravariant components xk of a vector x relative to the basis B are linkedto the contravariant components of the form xk = ni=1 x ˜i dik , where D is the matrix x ˜k by equalities defined by e˜h = nk=1 dhk ek . If the inner product of two vectors of the basis B isgij = (ei , ej ) for 1 i, j n, the inner product of the vectors u = ni=1 ui ei and v = nj=1 vj ej can be written as (u, v) =
n n i=1 j=1
ui vj (ei , ej ) =
n n
ui vj gij ,
i=1 j=1
where gij = (ei , ej ) for 1 i, j n. The matrix G = (gij ) is the Gram matrix of vectors of the basis B and, therefore, is a positive definite matrix. Next, we define the covariant and contravariant components of a vector x in a real n-dimensional linear space L equipped with an inner product. Definition 6.31. Let L be an n-dimensional inner product real linear space and let B = {e1 , . . . , en } be a basis of L. If x ∈ L, the covariant components of x are the numbers xi = (x, ei ) for 1 i n. The contravariant components of x are the numbers xi defined by x = x1 e1 + · · · + xn en .
Linear Algebra Tools for Data Mining (Second Edition)
422
The application of Definition 6.31 yields ⎛ ⎞ n xj ej , ei ⎠ xi = (x, ei ) = ⎝ j=1
=
n
j
x (ej , ei ) =
j=1
n
gij xj
j=1
for 1 i n. Conversely, the contravariant components of x can be computed from the covariant components xi by solving the system ⎛ 1⎞ ⎛ ⎞ x x1 ⎜ .. ⎟ ⎜ .. ⎟ ⎝ . ⎠ = G⎝ . ⎠, xn xn using Cramer’s formula
j
det G ← xj = Since
det(G)
x1
.. .
xn
.
⎛
⎞⎞ x1 n ⎜ j ⎜ ⎟⎟ gji xi , det ⎝G ← ⎝ ... ⎠⎠ = i=1 xn ⎛
n
(6.30)
g x
it follows that xj = i=1g ji i , where g = det(G) and gji is defined by Equality (6.30). When the basis {ei | 1 i n} is orthonormal, we have 1 if i = j, gij = 0 if i = j, where 1 i, j n. In this case, xi = xi for 1 i n, which means that the contravariant components of x coincide with the covariant components of x.
Norms and Inner Products
6.16
423
The QR Factorization of Matrices
We describe two variants of a factorization algorithm for rectangular matrices. The first algorithm allows us to express a matrix as a product of a rectangular matrix with orthogonal columns and an upper triangular invertible matrix (the thin QR factorization). The second algorithm uses Householder matrices and factorizes a matrix as a product of a square orthogonal matrix and an upper triangular matrix with non-negative diagonal entries (the full QR factorization). Theorem 6.65 (The Thin QR factorization theorem). Let A ∈ Cm×n be a full-rank matrix such that m n. Then A can be factored as A = QR, where Q ∈ Cm×n , R ∈ Cn×n such that (i) the columns of Q constitute an orthonormal basis for range(A), and (ii) R = (rij ) is an upper triangular invertible matrix such that its diagonal elements are real non-negative numbers, that is, rii 0 for 1 i n. Proof. Let u1 , . . . , un be the columns of A. Since rank(A) = n, these columns constitute a basis for range(A). Starting from this set of columns construct an orthonormal basis w1 , . . . , w n for the subspace range(A) using the Gram–Schmidt algorithm. Define Q as the orthogonal matrix Q = (w 1 · · · wn ). By the properties of the Gram–Schmidt algorithm, we have u1 , . . . , uk = w1 , . . . , w k for 1 k n, so it is possible to write uk = r1k w1 + · · · + rkk wk ⎛ ⎞ ⎛ ⎞ r1k r1k ⎜ .. ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ ⎟ ⎜r ⎟ ⎜rkk ⎟ = (w 1 · · · wn )⎜ ⎟ = Q⎜ kk ⎟. 0 ⎜ 0 ⎟ ⎜ ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎝ .. ⎠ ⎝ .. ⎠ 0 0 We can assume that rkk 0; otherwise, that is, if rkk < 0, replace wk by −wk . Clearly, this does not affect the orthonormality of the set {w 1 , . . . , wn }.
424
Linear Algebra Tools for Data Mining (Second Edition)
It is clear that rank(Q) = n. Therefore, by Corollary 3.7, since rank(A) min{rank(Q), rank(R)}, it follows that rank(R) = n, so R is an invertible matrix. Therefore, we have rkk > 0 for 1 k n. Example 6.32. Let us determine a QR factorization for the matrix introduced in Example 6.31. We constructed an orthonormal basis for range(A) that consists of the vectors ⎛ ⎞ ⎛ ⎞ √1 − √12 2 ⎜ ⎟ ⎜ ⎟ w1 = ⎝ 0 ⎠ and w2 = ⎝ 0 ⎠. √1 2
√1 2
Thus, the orthogonal matrix Q is ⎛ 1
√ 2
⎜ Q=⎝ 0
√1 2
⎞ − √12 ⎟ 0 ⎠. √1 2
To compute R, we need to express u1 and u2 as linear combinations of w1 and w 2 . Since √ u1 = 2w1 , √ √ u2 = 2 2w1 + 2w2 , the matrix R is
√ 2 2√ 2 . 2 0
√ R=
Note that the matrix Q obtained by the thin QR decomposition is not a square matrix in general. However, its columns form an orthonormal set. Another variant of the QR decomposition algorithm that produces a first factor Q that is an orthogonal matrix can be obtained using Householder matrices introduced in Example 6.24. We need the following preliminary result. Theorem 6.66. Let x, y ∈ Cp be two vectors such that x = y. There exists a Householder matrix Hv ∈ Cp×p such that Hv x = y. Proof. Let Hv = Ip − 2vv , where v ∈ Cp is a unit vector. The equality (Ip − 2vv )x = y is equivalent to x − y = 2vv x = 2av,
Norms and Inner Products
where a = v x. Then, v = a = 12 x − y, so v=
425
− y). Since v = 1, this implies
1 2a (x
1 (x − y). x − y
Corollary 6.29. If u, w ∈ C1×n are two row vectors such that u2 = w2 = 1, there exists a Householder matrix Hv such that u = wHv . Proof. Clearly, we have uH 2 = w H . Thus, by Theorem 6.66, there exists a matrix Hv such that uH = Hv wH , which implies u = wHvH = wHv . Theorem 6.66 is used for describing an iterative process that leads to a QR decomposition of a matrix A ∈ Cm×n , where m n and rank(A) = n. Theorem 6.67 (The Full QR factorization theorem). Let A ∈ Cm×n be a matrix such that m n and rank(A) = n. Then A can be factored as A=Q
R
Om−n,n
,
where Q ∈ Cm×m and R ∈ Cn×n such that (i) Q is a unitary matrix, and (ii) R = (rij ) is an upper triangular matrix having non-negative diagonal entries. Proof. A vector x has the same norm as the vector y = xe1 (which has only one non-zero component which is non-negative). Therefore, by Theorem 6.66, there exists a Householder matrix Hv such that y = Hv x. Let A = (c1 c2 · · · cn ) ∈ Cm×n . If we multiply A by the Householder matrix Hv1 ∈ Cm×m that zeroes all components of the first column c1 located below the first element, then we obtain a matrix
426
Linear Algebra Tools for Data Mining (Second Edition)
B1 ∈ Cm×n having the form ⎛
⎞ ∗ ··· ∗ · · ·⎟ ⎟ .. .. ⎟, . . ⎠ 0 ∗ ···
∗ ⎜0 ⎜ B1 = Hv1 A = ⎜ .. ⎝.
where the asterisks represents components that are not necessarily 0. Next, let ! 2 ··· m ∈ C(m−1)×(n−1) A1 = B1 2 ··· n and let Hv2 ∈ C(m−1)×(m−1) be a Householder matrix that zeroes the elements of the first column of A1 , that is, the subdiagonal elements of the second column of A: ⎛ ⎞ ∗ ∗ ··· ⎜0 ∗ · · · ⎟ ⎜ ⎟ Hv2 A1 = ⎜ .. .. .. ⎟ ∈ C(m−1)×(n−1) . ⎝. . . ⎠ 0 ∗ ··· Observe that by multiplying B1 by the matrix 1 0 ∈ Cm×m , 0 Hv 2 the first line and the first row of B1 are left intact and the submatrix ! 2 ··· m 1 0 B1 0 Hv 2 2 ··· n coincides with Hv2 A1 . Define ⎛ ∗ ⎜0 ⎜ 1 0 1 0 ⎜ B1 = Rv1 A = ⎜0 B2 = 0 Hv 2 0 Hv 2 ⎜ .. ⎝.
⎞ ··· ∗ · · · ∗⎟ ⎟ · · · ∗⎟ ⎟ ∈ Cm×n . ⎟ · · · ∗⎠ 0 0 ··· ∗ ∗ ∗ 0 .. .
Norms and Inner Products
427
Continuing this construction, we obtain a sequence of matrices B1 , B2 , . . . , Bn−1 in Cm×n such that I O Bk , Bk+1 = k O Hvk+1 where Hvk+1 zeroes the subdiagonal elements of the (k + 1)st column of the matrix Bk , where 1 k n − 1. Thus, we have Bn = QA, where Bn is an upper triangular matrix of the form ⎛ ⎞ ∗ ∗ ··· ∗ ⎜0 ∗ · · · ∗⎟ ⎜ ⎟ ⎜0 0 · · · ∗⎟ ⎜. . . ⎟ ⎜. . . ⎟, ⎜ . . . 0⎟ ⎜. . ⎟ ⎝ .. .. · · · 0⎠ 0 0 ··· 0 and Q ∈ Cm×m is a unitary matrix. This allows us to write A = Q Bn , which is the desired decomposition of the matrix A. Note that Bn ∈ Cm×n can be written as R , Bn = Om−n,n where R is an upper triangular matrix in Cn×n having non-negative diagonal entries. Observe that by writing Q = (Q0 Q1 ), where Q0 ∈ Cm×n and Q1 ∈ Cm×(m−n) , we have A = Q0 R, which is the thin QR decomposition discussed in Theorem 6.65. Corollary 6.30. Let A ∈ Cm×n be a matrix such that n m and rank(A) = m. Then, A can be factored as A = (L Om,n−m )Q where Q ∈ Cn×n and L ∈ Cm×m such that (i) L = (lij ) is a lower triangular matrix having non-negative diagonal entries, and (ii) Q is a unitary matrix.
428
Linear Algebra Tools for Data Mining (Second Edition)
Proof. The statement is obtained by applying Theorem 6.67 to the matrix AH ∈ Cn×m . By choosing carefully the sequence of Householder matrices, one can prove similar results. Theorem 6.68. Let A ∈ Cm×n be a matrix such that m n and rank(A) = n. Then, A can be factored as R S, A= Om−n,n where S ∈ Cn×n and R ∈ Cn×n such that (i) S is a unitary matrix, and (ii) R is an upper triangular matrix having real non-negative elements on its diagonal ending in the bottom right corner. Also, if rank(A) = m, then A can be factored as A = T (L Om,n−m ) where T ∈ Cm×m and L ∈ Cm×m such that (i) L is a lower triangular matrix with real non-negative elements on its diagonal ending in the bottom right corner, and (ii) T is a unitary matrix. Proof. We prove only the first part of the theorem, which is similar to the proof of the full QR Decomposition Theorem. Let r 1 , . . . , r m be the rows of the matrix A. Clearly, the row vector r m eHn has the same norm as r m , so, by Corollary 6.29, there exists a unit vector v m ∈ Rn such that r m Hvm = r m eHn . Therefore, we have ⎛ ⎞ ∗ ∗ ··· ∗ ∗ ⎜∗ ∗ · · · ∗ ∗⎟ ⎜ ⎟ B1 = AHvm = ⎜ .. .. .. .. .. ⎟. ⎝. . . . .⎠ 0 0 ··· 0 ∗ Let A1 = B1
! 1 · · · (m − 1) ∈ C(m−1)×(n−1) 1 · · · (n − 1)
Norms and Inner Products
429
and let Hv m−1 ∈ C(n−1)×(n−1) be a Householder matrix that zeroes the first n − 2 elements of the last row of A1 : ⎛
A1 Hv2
⎞ ∗ ··· ∗ · · ·⎟ ⎟ (m−1)×(n−1) . .. .. ⎟ ∈ C . . ⎠ 0 ∗ ···
∗ ⎜∗ ⎜ = ⎜ .. ⎝.
By multiplying B1 = AHv m by
Hvm−1 0n−1 , 0n−1 1
the last row and the last column of B1 are left intact and the submatrix ! 1 ··· m − 1 B1 Hvm−1 0n−1 0n−1 1 1 ··· n − 1 coincides with A1 Hv m−1 . Continuing this construction, we obtain a sequence of matrices B1 , B2 , . . . , Bm in Cm×n such that Bk+1 = Bk
Hvm−(k+1) 0n−1 , 0n−1 1
where Hv m−(k+1) zeroes the elements of the (k+1)st row of the matrix Bk placed at the left of the main diagonal, where 1 k m − 1. Thus, for the last matrix Bm we have Bm = AS, where Bm is an upper triangular matrix and S is a unitary matrix (as a product of unitary matrices). The second part follows from the first part. The subspaces associated with symmetric and idempotent matrices enjoy a special relationship that we discuss next. Theorem 6.69. Let A ∈ Cn×n be an idempotent matrix. Then, null(A) ⊥ range(A) if and only if the matrix A is Hermitian.
430
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Let A be an idempotent matrix such that null(A) ⊥ range(A). We have Au ∈ range(A). Also, we have A(In − A)v = (A − A2 )v = 0, so (In − A)v ∈ null(A). This implies Au ⊥ (I − A)v for every u, v ∈ Cn . Thus, (Au)H (In − A)v = uH AH (In − A)v = 0 for every u, v ∈ Rn , so AH (In − A) = O. Equivalently, AH = AH A. Therefore, A = AH . Conversely, if A is both idempotent and Hermitian, then (Au)H (In −A)v = uH AH (In −A)v = uH A(In −A)v = uH (A−A2 )v = 0 for every u, v ∈ Cn , so null(A) ⊥ range(A).
Note that if A is Hermitian and idempotent, then the mapping that transforms a vector x into Ax projects Cn onto the subspace range(A). Next we examine the possibility of constructing a unitary matrix P for a matrix A ∈ Cn×n such that the matrix P H AP is an upper Hessenberg matrix. To this end, we define a sequence of Householder matrices as follows. Let A0 = A and write a11 b0 , A0 = a 0 B0 where a0 , b0 ∈ Cn−1 and B0 ∈ C(n−1)×(n−1) . Consider the Householder matrix Hv0 (where v 0 ∈ Cn−1 ) such that ⎛ ⎞ ∗ ⎜0⎟ ⎜ ⎟ Hv 0 a0 = ⎜ .. ⎟ ⎝.⎠ 0 and define
P1 =
1 0 . 0 Hv 0
Norms and Inner Products
431
Then P1 A0 P1 = =
1 0 0 Hv 0 a11 Hv 0 a0
1 0 0 Hv 0 ⎛ ⎞ a11 a0 Hv0 ⎟ ⎜ ⎜∗ ⎟ a0 Hv0 ⎜ ⎟ =⎜0 ⎟. H v 0 B0 H v 0 ⎜ .. Hv0 B0 Hv 0 ⎟ ⎝. ⎠ 0 a11 b0 a0 B0
Define A1 = Hv 0 B0 Hv0 and repeat the process on this matrix to obtain a unitary matrix Hv 1 such that the elements of the first column located under the first subdiagonal element in Hv1 A1 Hv1 are 0s. The matrix P2 is given by I2 O P2 = O Hv 2 and it is clearly a unitary matrix. We have ⎛ ∗ ∗ ∗ ··· ⎜∗ ∗ ∗ · · · ⎜ ⎜0 ∗ ∗ · · · ⎜ H H P2 P1 AP1 P2 = ⎜0 0 ∗ · · · ⎜ ⎜ .. .. ⎝. . ∗ · · ·
⎞ ∗ ∗⎟ ⎟ ∗⎟ ⎟ . ∗⎟ ⎟ ⎟ ∗⎠ 0 0 ∗ ··· ∗
Continuing in the same manner, we build the matrices P1 , P2 , . . . , H · · · P1H AP1 · · · Pn−2 is an upper Hessenberg Pn−2 such that C = Pn−2 matrix. If A is a symmetric matrix, then so is C. Therefore, C is actually a tridiagonal matrix, that is, a matrix of the form ⎛ ⎞ ∗ ∗ 0 0 ··· 0 0 0 ⎜∗ ∗ ∗ 0 · · · 0 0 0⎟ ⎜ ⎟ ⎜∗ ∗ ∗ 0 · · · 0 0 0⎟ ⎜ ⎟ ⎜ .. .. .. .. . . . ⎟. ⎜ . . . . · · · .. .. .. ⎟ ⎜ ⎟ ⎝0 0 0 0 · · · ∗ ∗ ∗⎠ 0 0 0 0 ··· 0 ∗ ∗
432
6.17
Linear Algebra Tools for Data Mining (Second Edition)
Matrix Groups
One can define on the set of invertible square matrices Cn×n a group structure, where the group operation is the matrix multiplication, the unit of the group is In , and the inverse of a matrix A is A−1 . This is the general linear group GLn (C) and we shall examine several of its subgroups. Similar linear group of invertible square matrices in Rn×n will be denoted by GLn (R). Since every matrix with real entries also belongs to Cn , it is clear that GLn (R) is a subgroup of GLn (C). Subgroups of GLn (C) are known as linear groups. Definition 6.32. An isometry of Rn is a mapping f : Rn −→ Rn such that x − y2 = f (x) − f (y)2 for x, y ∈ Cn (for x, y ∈ Rn ). Example 6.33. Let z ∈ Rn and let tz be the translation defined by tz (x) = x + z for x ∈ Rn . Since tz (x) − tz (y) = x − y, it is clear that any translation is an isometry. It is straightforward to verify that every isometry is a bijection having as inverse an isometry and, therefore, the set of isometries of Rn is a group with respect to function composition. We denote this group by ISO(Rn ). Theorem 6.70. Let f : Rn −→ Rn . The following statements are equivalent: (i) f is an isometry such that f (0n ) = 0n ; (ii) f preserves the Euclidean inner product, that is f (x y) = f (x) f (y) for every x, y ∈ Rn ; (iii) f (x) = Ax, where A is an orthogonal matrix. Proof. (i) implies (ii): Let f be an isometry such that f (0n ) = 0n . Since f preserves distances, we have (x − y) (x − y) = (f (x) − f (y) )(f (x) − f (y)).
(6.31)
Taking y = 0n in the above equality, we obtain x x = f (x) f (x)
(6.32)
Norms and Inner Products
433
for x ∈ Rn . Equality (6.31) is equivalent to x x − 2y x + y y = f (x) f (x) − 2f (y) f (x) + f (y) f (y) and taking into account Equality (6.32), we have y x = f (y) f (x), which means that f preserves the Euclidean inner product. (ii) implies (iii): Suppose that f : Rn −→ Rn is a mapping that preserves the inner product. Then, 1 = ei ei = f (ei ) f (ei ) and 0 = ei ej = f (ei ) f (ej ) for i = j, 1 i, j n. Thus, {f (e1 ), . . . , f (en )} is an orthonormal set and the matrix A = (f (e1 ) · · · f (en )) is orthogonal. Since the set of orthogonal matrices is a subgroup of GLn (R), the matrix A−1 is also orthogonal and, by Corollary 6.16, it preserves the inner product. Thus, the mapping g : Rn −→ Rn given by g(x) = A−1 f (x) also preserves the inner product and, in addition, g(ei ) = A−1 f (ei ) = ei for 1 i n. Since g preserves the inner product, we have xi = x ei = g(x) g(ei ) = g(x) ei = (g(x))i for every x ∈ Rn and 1 i n, which means that g is the identity mapping on Rn . Consequently, f (x) = Ax. (iii) implies (i): This implication follows immediately from Corollary 6.16. Theorem 6.71. An isometry h of Rn has the form h(x) = Ax + b for x ∈ Rn . Proof. Let b = h(0n ). The mapping f : Rn −→ Rn given by f (x) = h(x) − h(0n ) has the property f (0n ) = 0n . Moreover, f is an isometry because f (x) − f (y) = h(x) − h(y). Therefore, by Theorem 6.70, there exists an orthogonal matrix A ∈ Rn×n such that f (x) = Ax for x ∈ Rn . This implies that h(x) = Ax + b for x ∈ Rn , where b = h(0n ). Theorem 6.71 means that every isometry is the composition of a multiplication by an orthogonal matrix followed by a translation. An isometry f (x) = Ax + b for x ∈ Rn is orientation-preserving if the orthogonal matrix A is a rotation matrix, and is orientationreversing if A is a reflection matrix. Definition 6.33. A dilation is a mapping ha : Rn −→ Rn such that ha (x) = ax for x ∈ Rn , where a ∈ R is the ratio of the dilation.
434
Linear Algebra Tools for Data Mining (Second Edition)
A dilation does not preserve distances (so it is not an isometry), but it preserves angles. Indeed, for u, v ∈ Rn , we have cos ∠(ha (u), ha (v)) =
ha (u) ha (v)) u v = ha (u)2 ha (v)2 u2 v2
= cos ∠(u, v). The set of dilations having a non-zero ratio is easily seen to be a group, where (ha )−1 = h 1 . a
6.18
Condition Numbers for Matrices
Let Au = b be a linear system, where A ∈ Cn×n is a non-singular matrix and b ∈ Rn . We examine the sensitivity of the solution of this system to small variations of b. So, together with the original system, we work with a system of the form Av = b + h, where h ∈ Rn is the perturbation of b. Note that A(v − u) = h, so v − u = A−1 b. Using a vector norm · and its corresponding matrix norm ||| · |||, we have v − u = A−1 h |||A−1 |||h. Since b = Au |||A|||u, it follows that |||A−1 |||h v − u b u |||A|||
=
|||A||||||A−1 |||h . b
Thus, the relative variation of the solution, by the number ing definition.
|||A||||||A−1|||h . b
v−u u ,
(6.33) is upper bounded
These considerations justify the follow-
Definition 6.34. Let A ∈ Cn×n be a non-singular matrix. The condition number of A relative to the matrix norm ||| · ||| is the number cond(A) = |||A||||||A−1 |||. Equality (6.33) implies that if the condition number is large, then small variations in b may generate large variations in the solution of the system Au = b, especially when b is close to 0. When this is the
Norms and Inner Products
435
case, we say that the system Au = b is ill-conditioned. Otherwise, the system Au = b is well-conditioned. Theorem 6.72. Let A ∈ Cn×n be a non-singular matrix. The following statements hold for every matrix norm induced by a vector norm: (i) cond(A) = cond(A−1 ); (ii) cond(cA) = |c|cond(A); (iii) cond(A) 1. Proof. We prove here only Part (iii). Since AA−1 = I, by the properties of a matrix norm induced by a vector norm, we have cond(A) = |||A||||||A−1 ||| |||AA−1 ||| = |||In ||| = 1. Let A, B be two non-singular matrices in C n×n such that B = aA, where a ∈ C. We have B −1 = aA−1 , |||B||| = |a||||B|||, and |||B −1 ||| = |a||||A−1 ||| so cond(B) = |a|2 cond(A). On the other hand, det(B) = an det(A). Thus, if n is large enough and a < 1, then det(B) can be quite close to 0, while the condition number of B may be quite large. This shows that the determinant and the condition number are relatively independent. Example 6.34. Let A ∈ C2×2 be the matrix a a+α A= , a + α a + 2α where a > 0 and α < 0. We have a+2α − α2 −1 A = a+α α2
a+α α2 − αa2
,
2 so |||A|||1 = a and |||A−1 ||| = αa2 . Thus, cond(A) = αa and, if |α| is small, a system of the for Au = b may be ill-conditioned. Ill-conditioned linear systems Au = b may occur when large differences in scale exist among the columns of A, or among the rows of A. Theorem 6.73. Let A = (a1 · · · an ) be an invertible matrix in
Cn×n , where a1 , . . . , an are the columns of A. Then
cond(A) max
ai | 1 i, j n . aj
436
Proof.
Linear Algebra Tools for Data Mining (Second Edition)
Since cond(A) = |||A||||||A−1 |||, we have cond(A) =
max{Ax | x = 1} , min{Ax | x = 1}
by Supplement 6.20. Note that Aek = ak , where ak is the kth column of A and that ek = 1. Therefore, max{Ax | x = 1} ai , min{Ax | x = 1} aj , which implies cond(A)
ai aj
for all 1 i, j n. This yields the inequality of the theorem. Example 6.35. Let
A=
1 0 , 1 α
where α ∈ R and α > 0. The matrix A is invertible and 1 0 −1 . A = − α1 α1 It is easy to see that the condition number of A relative to the Frobenius norm is 2 + α2 . α Thus, if α is sufficiently close to 0, the condition number can reach arbitrarily large values. cond(A) =
In general, we use as matrix norms, norms of the form ||| · |||p . The corresponding condition number of a matrix A is denoted by condp (A). Example 6.36. Let A = diag(a1 , . . . , an ) be a diagonalmatrix. Then, |||A|||2 = max1in |ai |. Since A−1 = diag a11 , . . . , a1n , it follows that |||A−1 |||2 =
1 min1in |ai | ,
so cond2 (A) =
max1in |ai | min1in |ai | .
Norms and Inner Products
6.19
437
Linear Space Orientation
Definition 6.35. Let V be an n-dimensional linear-space. An ordered basis for V is a sequence B = (u1 , . . . , un ) such that B = {u1 , . . . , un } is a basis for n. Since B is a basis, note that its Gram matrix GB is positive definite by Theorem 6.55; also, det(GB ) = 0. Definition 6.36. The ordered basis B has a positive orientation if det(GB ) > 0, and a negative orientation if det(GB ) < 0. Example 6.37. Let {e1 , e2 , e3 } be the standard basis in R3 . The ordered basis B = (e1 , e2 , e3 ) has a positive orientation because ⎛ ⎞ 1 0 0 det(GB ) = det ⎝0 1 0⎠ = 1 > 0. 0 0 1 ˜ = (e1 , e3 , e2 ) has a negative orientation On the other hand, B because ⎛ ⎞ 1 0 0 det(GB ) = det ⎝0 0 1⎠ = −1 < 0. 0 1 0 We extend this definition to an arbitrary sequence of vectors T = (t1 , . . . , tn ) in Rn by saying that T has a positive orientation if det(GT ) > 0 and a negative orientation if det(GT ) < 0. Definition 6.37. Let u, v be two vectors in R3 . The cross-product of u and v is the vector u × v defined as u × v = (u2 v3 − u3 v2 )e1 + (u3 v1 − u1 v3 )e2 + (u1 v2 − u2 v1 )e3 . Note that u × v can be written as a determinant, that is, e1 e2 e3 u × v = u1 u2 u3 . v1 v2 v3 The vector w = u × v is orthogonal on both u and v. Indeed, it is easy to see that (w, u) = (u2 v3 − u3 v2 )u1 + (u3 v1 − u1 v3 )u2 + (u1 v2 − u2 v1 )u3 = 0, and, similarly, (w, v) = 0.
438
Linear Algebra Tools for Data Mining (Second Edition)
Note that the triple T = (u, v, u × v) has a positive orientation when u × v = 03 and when u and v are not collinear. Indeed, since ⎞ 0 u2 (u, v) ⎠, 0 GT = ⎝(v, u) v2 2 0 0 u × v ⎛
we have det(GT ) = (u2 v2 − (u, v)2 )u × v2 > 0. Finally, w2 = (u2 v3 − u3 v2 )2 + (u3 v1 − u1 v3 )2 + (u1 v2 − u2 v1 )2 = (u21 + u22 + u23 )(v12 + v22 + v32 ) −(u1 v1 + u2 v2 + u3 v3 )2 = u2 v2 (1 − cos2 (∠(u, v))). Thus, w = uv sin α, where α = ∠(u, v). We note that w equals the area of the parallelogram formed by the vectors u and v. The following properties are immediate: (i) u × v = −(v × u), (ii) (au) × v = a(u × v), (iii) (u + v) × w = u × w + v × w for u, v, w ∈ R3 , and a ∈ R. Definition 6.38. Let u, v, w ∈ R3 . The scalar triple product of these vectors is the real number ((u × v), w) denoted as (u, v, w). It is easy to verify the following properties: (i) (u, v, w) = −(v, u, w), (ii) (u, v, w) = (v, w, u) = (w, u, v), (iii) (au, v, w) = a(u, v, w), (iv) (u + t, v, w) = (u, v, w) + (t, v, w). We claim that (u, v, w) equals the volume of the parallelepiped constructed on the sequence of vectors (u, v, w). Indeed, since u×v is the area of the parallelogram formed by the vectors u and v, ((u × v), w) is the product of this area with the projection of w on a vector perpendicular to the parallelogram determined by u and v.
Norms and Inner Products
439
Note that
⎞ ⎛ e1 e2 e3 (u, v, w) = ⎝u1 u2 u3 , w1 e1 + w2 e2 + w3 e3 ⎠ v1 v2 v3 w1 w2 w3 u1 u2 u3 = u1 u2 u3 = v1 v2 v3 . v1 v2 v3 w1 w2 w3
Definition 6.39. Let u, v, w ∈ R3 . The vector triple product of these vectors is the vector u × (v × w). Note that (u × (v × w))1 = u2 (v1 w2 − v2 w1 ) − u3 (v3 w1 − v1 w3 ) = v1 (u2 w2 + u3 w3 ) − w1 (u2 v2 + u3 v3 ) = v1 (u1 w1 + u2 w2 + u3 w3 ) − w1 (u1 v1 + u2 v2 + u3 v3 ) = v1 (u, w) − w1 (u, v). Similarly, we have (u × (v × w))2 = v2 (u, w) − w2 (u, v), (u × (v × w))3 = v3 (u, w) − w3 (u, v), which allows us to write u × (v × w) = v(u, w) − w(u, v). 6.20
MATLAB
Computations
Vector norms can be computed using the function norm which comes in two signatures: norm(v) and norm(v,p). The first variant computes v2 ; the second computes vp for any p, 1 p ∞. In addition, norm(v,inf) computes v∞ = max{|vi | | 1 i n}, where v ∈ Rn . If one uses −∞ as the second parameter, then norm(v,-inf) returns min{|vi | | 1 i n}. Example 6.38. For the vector v = [2 -3 5 -4]
Linear Algebra Tools for Data Mining (Second Edition)
440
the computation norms = [norm(v,1),norm(v,2),norm(v,2.5),norm(v,inf), norm(v,-inf)]
returns norms = 14.0000
7.3485
6.5344
5.0000
2.0000
If the first argument of norm is a matrix A, norm(A) returns A2 . For the two-parameter format, the second parameter is restricted to the values 1, 2, inf, and ’fro’. Then, norm(A,2) is the same as norm(A), norm(A,1) is |||A|||1 , (A,inf) is |||A|||∞ , and norm(A,’fro’) yields the Frobenius norm of A. Example 6.39. For the matrix A = [1 -1 2; 3 2 -1; 5 4 2]
the following computation is performed: norms = [norm(A,1),norm(A,2),norm(A,inf),norm(A,’fro’)] norms = 9.0000
7.4783
11.0000
8.0623
For matrices whose norm is expensive to compute, an approximative estimation of A2 can be performed using the function normest(A), or normest(A,r), where r is the relative error; the default for r is 10−6 . The following function implements the Gram–Schmidt algorithm: function [W] = gram(U) %GRAM implements the classical Gram-Schmidt algorithm [n,m] = size(U); W = zeros(n,m); W(:,1)= (1/norm(U(:,1)))*U(:,1); for k = 2:1:m P = eye(n) - W*W’; W(:,k) = W(:,k) + (1/norm(P*U(:,k)))* P*U(:,k); end end
Norms and Inner Products
441
An implementation of the modified Gram–Schmidt algorithm is given next. function [W] = modgram(U) %MODGRAM implements the modified Gram--Schmidt algorithm [n,m] = size(U); W = zeros(n,m); W(:,1)= (1/norm(U(:,1)))*U(:,1); for k = 2:1:m t = U(:,k); for j = 1:1:k t = (eye(n) - W(:,j)*W(:,j)’)*t; end W(:,k) = W(:,k) + (1/norm(t))*t; end end
The Cholesky decomposition of a Hermitian positive definite matrix is computed in MATLAB using the function chol. The function call R = chol(A) returns an upper triangular matrix R, satisfying the equation RH R = A. If A is not positive definite, an error message is generated. The matrix R is computed using the diagonal and the upper triangle of A and the computation makes sense only if A is Hermitian. Example 6.40. Let A be the symmetric positive definite matrix considered in Example 6.29, ⎛ ⎞ 3 0 2 A = ⎝0 2 1⎠. 2 1 2 Then R = chol(A) yields R = 1.7321 0 0
0 1.4142 0
1.1547 0.7071 0.4082
The call L = chol(A,’lower’) returns a lower triangular matrix L from the diagonal and lower triangle of matrix A, satisfying the equation LLH = A. When A is sparse, this syntax of chol is faster.
Linear Algebra Tools for Data Mining (Second Edition)
442
Example 6.41. For the same matrix A as in Example 6.40, L = chol(A,’lower’) returns L = 1.7321 0 1.1547
0 1.4142 0.7071
0 0 0.4082
For added flexibility, [R,p] = chol(A) and [L,p] = chol (A,‘lower’) set p to 0 if A is positive definite and to a positive number, otherwise, without returning an error message. The thin QR decomposition of a matrix A ∈ Cm×n is obtained using the function qr as in [Q R] = qr(A)
To obtain the full decomposition, we write [Q R] = qr(A,0)
The Hessenberg form of a matrix is computed using the function hess. To produce a Hessenberg matrix H and a unitary matrix P such that P H AP = C (or A = P CP H ), one can use [P,H] = hess(A)
For example, if A = 1 5 -2 2
2 6 2 -4
3 2 4 1
4 5 1 2
then [P,H] = hess(A) will return P = 1.0000 0 0 0
0 -0.8704 0.3482 -0.3482
0 -0.4694 -0.3733 0.8002
0 0.1486 0.8599 0.4883
1.0000 -5.7446 0 0
-2.0889 4.1212 5.7089 0
1.1421 -2.0302 2.8879 0.1780
4.8303 -3.3597 -2.9554 4.9909
H =
Norms and Inner Products
443
If only the Hessenberg form is desired, one could use the function call H = hess(A). Note that the matrix B = 1 2 3 4
2 2 5 6
3 5 3 7
is symmetric, so [P1,H1] = hess(B),
4 6 7 4
its
Hessenberg
form
obtained
with
P1 = 0.6931 -0.6931 0.1980 0
-0.6010 -0.4039 0.6897 0
-0.3980 -0.5970 -0.6965 0
0 0 0 1.0000
H1 = -0.9118 -0.8868 0 0
-0.8868 -2.1872 0.1000 0
0 0.1000 9.0990 -10.0499
0 0 -10.0499 4.0000
is a tridiagonal matrix. The condition number of a matrix A is computed using the function cond(A,p) which returns the p-norm condition of matrix A. When used with a single parameter, as in cond(A), the 2-norm condition number of A is returned. Example 6.42. Let A be the matrix >> A=[10.1 6.2; 5.1 3.1] A = 10.1000 6.2000 5.1000 3.1000
The condition number cond(A) is 567.966, which is quite large indicating significant sensitivity to inverse calculations. The inverse of A is >> inv(A) ans = -10.0000 16.4516
20.0000 -32.5806
444
Linear Algebra Tools for Data Mining (Second Edition)
If we make a small change in A yielding the matrix >> B=[10.2 6.3;5.1 3.1] B = 10.2000 6.3000 5.1000 3.1000
the inverse of B changes completely: >> inv(B) ans = -6.0784 10.0000
12.3529 -20.0000
Values of the condition number close to 1 indicate a well-conditioned matrix, and the opposite is true for large values of the condition number. Example 6.43. Consider the linear systems: 10.2x1 + 6.3x2 = 12 10.1x1 + 6.2x2 = 12 and 5.1x1 + 3.1x2 = 6 5.1x1 + 3.1x2 = 6 12 that correspond to Ax = b and Bx = b, where b = . In view 6 of the resemblance of A and B, one would expect their solutions to be close. However, this is not the case. The solution of Ax = b is >> x=inv(A)*b x = 0 1.9355
while the solution of Bx = b is >> x=inv(B)*b x = 1.1765 0
Exercises and Supplements (1) Let ν be a norm on Cn . Prove that there exists a number k∈R such that for any vector x ∈ Cn , we have ν(x) k ni=1 |xi |.
Norms and Inner Products
445
n Solution: n equality x = i=1 nxi ei , we have nStarting from the ν(x e ) = |x |ν(e ) k ν(x) i i i i=1 i=1 i i=1 |xi |, where k = max{ν(ei ) | 1 i n}. (2) Prove that the mapping s : Rn × Rn −→ R defined by s(x, y) = d(x, y)22 is not a metric itself. Show that s(x, y) 2(s(x, z) + s(z, y)) for x, y, z ∈ Rn . (3) Prove that if x, y, z are three vectors in Rn and ν is a norm on V, then ν(x − y) ν(x − z) + ν(z − y). (4) Prove that for any vector norm ν on Rn , we have ν(x + y)2 + ν(x − y)2 4(ν(x)2 + ν(y)2 ) for every x, y ∈ Rn . Solution: Note that ν(x + y) ν(x) + ν(y); similarly, ν(x − y) ν(x) + ν(y). Thus, ν(x + y)2 ν(x)2 + ν(y)2 + 2ν(x)ν(y), ν(x − y)2 ν(x)2 + ν(y)2 + 2ν(x)ν(y), hence, ν(x + y)2 + ν(x − y)2 2ν(x)2 + 2ν(y)2 + 4ν(x)ν(y). The desired inequality follows from 2ν(x)ν(y) ν(x)2 + ν(y)2 . (5) Let x ∈ Rn . Prove that for every > 0 there exists y ∈ Rn such that the components of the vector x + y are distinct and y2 < . Solution: Partition the set {1, . . . , n} into the blocks B1 , . . . , Bk such that all components of x that have an index in B have a common value cj . Suppose that |Bj | = pj . Then k j j=1 pj = n and the numbers {c1 , c2 , . . . , ck } are pairwise distinct. Let d = mini,j |ci − cj |. The vector y can be defined as follows. If Bj = {i1 , . . . , ipj }, then yi1 = η · 2−1 , yi2 = η · 2−2 , . . . , yipj = η · 2−p , where η > 0, which makes the numbers cj +yi1 , cj +yi2 , . . . , cj + yipj pairwise distinct. It suffices to take η < d to ensure that
446
Linear Algebra Tools for Data Mining (Second Edition)
the components of x + y are pairwise distinct. Also, note that k η2 nη2 y22 j=1 pj 4 = 4 . It suffices to choose η such that η < min{d, 2 n } to ensure that y2 < . (m,n) : Cm×n −→ R0 be a vectorial matrix norm. Prove (6) Let μ that for every A ∈ Cm×n there exists a constant k ∈ R such n that μ(m,n) (A) k m i=1 j=1 |aij |. Hint: Apply Supplement 6.20. (7) Prove that a matrix A ∈ Cn×n is normal if and only if Ax = AH x for all x ∈ Cn . (8) Let {μ(m,n) | m, n ∈ N>0 } be a consistent family of matrix norms and let a ∈ Cn be a vector. Prove that for each m ∈ N, the function νm : Rm ‘R0 defined by νm (x) = μ(m,n) (xaH ) is a vector norm that is consistent with the family of matrix norms. (9) Prove that a matrix A ∈ Cn×n is normal if and only if Ax2 = AH x2 for every x ∈ Cn . Solution: Suppose that A is normal. Then Ax22 = (Ax, Ax) = (x, AH Ax) = (x, AAH x) = (AH x, AH x) = AH x22 . Conversely, suppose that Ax2 = AH x2 for every x ∈ Let λ ∈ C be such that |λ| = 1. We have
Cn .
A(λx + y)22 = λAx + Ay22 = (λxH AH + y H AH , λAx + Ay) = (λxH AH , λAx) + (λxH AH , Ay) +(y H AH , λAx) + (y H AH , Ay) = (xH AH , Ax) + (y H AH , Ay) +(λxH AH , Ay) + (y H AH , λAx) = Ax22 + Ay22 + 2(λ(Ax, Ay)), because (λxH AH , Ay) + (y H AH , λAx) = 2(λ(Ax, Ay)). Similarly, AH (λx + y)22 = AH x22 + AH y22 + 2(λ(AH x, AH y)),
Norms and Inner Products
447
hence, Ax22 + Ay22 + 2(λ(Ax, Ay)) = AH x22 +AH y22 + 2(λ(AH x, AH y)). Since Ax22 = AH x22 and Ay22 = AH y22 , we obtain (λ(Ax, Ay) − λ(AH x, AH y)) = (λ(x, AH Ay) − λ(x, AAH y)) = (λ(x, (AH A − AAH )y)) = 0, hence, |x, (AH A − AAH )y)| = 0 for every x ∈ Cn . Therefore, (AH A−AAH )y = 0 for every y ∈ Cn . This implies AH A−AAH = On,n , hence, AH A = AAH . (10) Prove that for every matrix A ∈ Cn×n , we have |||A|||2 = |||AH |||2 . (11) Let U ∈ Cn×n be a matrix whose set of columns is orthonormal and let V ∈ Cn×n be a matrix whose set of rows is orthonormal. Prove that |||U |||2 = |||V |||2 = 1. Solution: Since U H U = In , we have |||U |||2 = max{U x2 | x ∈ Cn andx2 = 1}. For x ∈ Cn such that x2 = 1, we have U x22 = xH U H U x = xH x = x22 = 1, which implies |||U |||2 = 1. The similar argument for V follows from Exercise 6.20. (12) Let U ∈ Cn×n be a matrix whose set of columns is orthonormal and let V ∈ Cn×n be a matrix whose set of rows is orthonormal. Prove that |||U A|||2 = |||AV |||2 = |||A|||2 and, therefore, |||U AV |||2 = |||A|||2 . Solution: By hypothesis, we have U H U = In and V V H = 1. Therefore, |||U A|||22 = max{U Ax22 | x2 = 1} = max{xH AH U H U Ax | x2 = 1} = max{xH AH Ax | x2 = 1} = max{Ax22 | x2 = 1} = |||A|||22 . This allows us to conclude that |||U A|||2 = |||A|||2 . The second equality follows immediately from the first. (13) Let A ∈ Rm×n . Prove that there exists i, 1 i n such that Aei 22 n1 A2F .
Linear Algebra Tools for Data Mining (Second Edition)
448
(14) Let A, B ∈ Cn×n be two matrices such that AB = BA. Prove that if A is a normal matrix, then AH B = BAH . Solution: Let C = AH B − BAH . We have C H = B H A − AB H and, therefore, trace(CC H ) = trace((AH B − BAH )(B H A − AB H )) = trace(AH BB H A) − trace(AH BAB H ) −trace(BAH B H A) + trace(BAH AB H ). Since trace(AH BB H A) = trace(AAH BB H ) = trace(AH ABB H ) = trace(AH BAB H ), trace(BAH AB H ) = trace(ABAH B H ) = trace(BAAH B H ) = trace(BAH AB H ), it follows that trace(CC H ) = 0, so C = On,n . Thus, AH B = BAH by Supplement 57 of Chapter 3. This result is known as Fuglede’s theorem [60]. (15) Let A, B ∈ Cn×n be two normal matrices such that AB = BA. Prove that AB is a normal matrix. Solution: Since both A and B are normal, we have (AB)(AB)H = ABB H AH = AB H BAH = B H AAH B = B H AH AB = (AB)H (AB), which proves that AB is normal. (16) Let U ∈ Cn×n be a matrix whose set of columns is orthonormal and let V ∈ Cn×n be a matrix whose set of rows is orthonormal. Prove that |||U AV |||F = |||A|||F . Solution: Since√ U 2F = trace(U H U ) √= n, it follows n. Similarly, V F = n. Consequently, that U F = U AV 2F = trace(V H AH U H U AV ) = trace(V H AH AV ) = trace(V V H AH A) = trace(AH A) = A2F , which yields the needed equality. (17) Let H ∈ Cn×n be a non-singular matrix. Prove that the function f : Cn×n −→ R0 defined by f (X) = HXH −1 2 for X ∈ Cn×n is a matrix norm.
Norms and Inner Products
449
(18) Let A ∈ Cm×n and B ∈ Cp×q be two matrices. Prove that A ⊗ B2F = trace(A A ⊗ B B). (19) Let x0 , x, and y be three members of Rn . Prove that if t ∈ [0, 1] and u = tx + (1 − t)y, then x0 − u2 max{x0 − x2 , x0 − y2 }. Solution: Since x0 − u = x0 − y − t(x − y), we have x − u22 = x0 − y22 − 2t(x0 − y, x − y) + t2 x − y22 . The graph of the function f : [0, 1] −→ R0 given by f (t) = x − u22 is a segment of a convex parabola. Therefore, maxt∈[0,1] f (t) = max{f (0), f (1)} = max{x0 − y22 , x0 − x22 }, which leads to the desired conclusion. (20) Let u1 , . . . , um be m unit vectors in R2 , such that ui − uj = 1. Prove that m 6. (21) Prove that if A ∈ Cn×n is an invertible matrix, then μ(A) 1 for any matrix norm μ. μ(A−1 ) (22) Let A ∈ Rm×n and B ∈ Rn×p be two rectangular matrices that have orthonormal sets of columns. Prove that the matrix AB ∈ Rm×p also has an orthonormal set of columns. Solution: By hypothesis, we have A A = In and B B = Ip . Therefore, (AB) (AB) = B A AB = B In B = B B = Ip , which shows that AB has an orthonormal set of columns. (23) Let Y ∈ Cn×p be a matrix that has an orthonormal set of columns, that is, Y H Y = Ip . Prove the following: (a) Y F = p; (b) for every matrix R ∈ Cp×q we have Y RF = RF . H Y = I , so Y = trace(Y H Y ) = Solution: We have Y p F √ trace(Ip ) = p. For the second part, we can write Y R2F = trace((Y R)H Y R) = trace(RH Y H Y R) = trace(RH R) = R2F , which gives the desired equality.
450
Linear Algebra Tools for Data Mining (Second Edition)
(24) Let μ : Cn×n −→ R0 be a matrix norm. Prove that there exists a vector norm ν : Cn −→ R0 such that ν(Ax) μ(A)ν(x) for A ∈ Cn×n and x ∈ Cn . Solution: Let b ∈ Cn − {0n }. It is easy to see that the mapping ν : Cn −→ R0 defined by ν(x) = μ(xb ) for x ∈ Cn is a vector norm. Furthermore, we have ν(Ax) = μ(Axb ) μ(A)μ(xb ) = μ(A)ν(x). (25) Let ΦH : Cm×n −→ Cm×n be the function defined by Φ(X) = XHX, where H ∈ Cn×m . Prove that ΦH (X) − ΦH (Y )F 2HF max{XF , Y F } X − Y F Solution: We have ΦH (X) − ΦH (Y )F = XHX − XHY + XHY − Y HY F XHX − XHY F + XHY − Y HY F XF HX − HY F + XH − Y HF Y F max{XF , Y F }(HX − HY F + XH − Y HF ) 2HF max{XF , Y F }X − Y F . (26) Let x, y ∈ Cn − {0}. Prove the following: (a) xy H F = x2 y2 ; (b) |||xy H |||1 = x1 y∞ ; (c) |||xy H |||∞ = x∞ y1 . ˆ B ˆ be four matrices in Cm×n such that none of the (27) Let A, B, A, matrices A, B, or A + B equals Om,n . Define ΔA = Aˆ − A and ˆ − B. Prove that for any matrix norm μ, we have ΔB = B ˆ − (A + B)) μ(A) + μ(B) μ(ΔA ) μ(ΔB ) μ(Aˆ + B max , . μ(A + B) μ(A + B) μ(A) μ(B) Solution: By the triangular property of norms, we have μ(ΔA + ΔB ) μ(ΔA ) + μ(ΔB ) μ(ΔB ) μ(ΔA ) + μ(B) · μ(A) · μ(A) μ(B) μ(ΔA ) μ(ΔB ) , . (μ(A) + μ(B)) max μ(A) μ(B)
Norms and Inner Products
451
ˆ ∈ Cn×p such that none of the (28) Let A, Aˆ ∈ Cm×n and B, B matrices A, B or AB is a zero matrix. Define, as above, ΔA = ˆ − B. Aˆ − A and ΔB = B ˆ − AB) μ(A)μ(B) μ(ΔA ) μ(ΔB ) μ(AˆB + μ(AB) μ(AB) μ(A) μ(B) μ(ΔA ) μ(ΔB ) . + μ(A) μ(B) Solution: We have ˆ − AB = (A + ΔA )(B + ΔB ) − AB AˆB = AΔB + BΔA + ΔA ΔB . By the triangle inequality, ˆ − AB) = μ(AΔB + BΔA + ΔA ΔB ) μ(AˆB μ(AΔB ) + μ(BΔA ) + μ(ΔA ΔB ) μ(A)μ(ΔB ) + μ(B)μ(ΔA ) + μ(ΔA )μ(ΔB ) (since μ is a matrix norm). (29) Prove that for every matrix norm μ induced by a vector norm, we have μ(I) = 1. (30) Let A ∈ Cn×n be a matrix and let μ be a matrix norm induced by a vector norm ν. Prove that if μ(A) < 1, then the matrix In + A is non-singular and 1 1 μ((In + A)−1 ) . 1 + μ(A) 1 − μ(A) Solution: The matrix In + A is non-singular because, otherwise, the system (In + A)x = 0 would have a non-zero solution u. This would imply u = −Au, so ν(u) = ν(Au) μ(A)ν(u), which would imply μ(A) 1, contradicting the hypothesis of the statement. Since In = (In + A)(In + A)−1 , we have 1 = μ(In ) μ(In + 1 μ((In + A)μ((In +A)−1 ) (1+μ(A))μ((In +A)−1 ), so 1+μ(A) −1 A) ).
452
Linear Algebra Tools for Data Mining (Second Edition)
The equality In = (In + A)(In + A)−1 can be written as In = (In + A)−1 + A(In + A)−1 . This implies 1 μ((In +A)−1 )−μ(A(In +A)−1 ) (1−μ(A))μ((In +A)−1 ), so μ((In + A)−1 )
1 1−μ(A) . in Cn×n )
be a non-singular matrix. Prove (31) Let A ∈ Rn×n (or that if ν is a norm on Rn (on Cn , respectively), then νA defined by νA (x) = ν(Ax) is a norm on Rn (on Cn ). (32) Prove that the function ν0 : Rn −→ R0 , where ν0 (x) is the number of non-zero components of x, is not a norm, although it satisfies the inequality ν(x + y) ν(x) + ν(y) for x, y ∈ Rn . (33) Let A ∈ Rm×n and let ν be a vector norm on Rn . Prove that if A ∈ Rm×n , then we have the following equalities: μ(A) = sup{ν(Ax) | ν(x) = 1} ν(Ax) n = sup x ∈ R − {0} ν(x) = inf{k | ν(Ax) kν(x), for every x ∈ Rn }.
Solution: To prove the first equality, note that {ν(Ax) | ν(x) = 1} ⊆ {ν(Ax) | ν(x) 1}. This implies sup{ν(Ax) | ν(x) = 1} sup{ν(Ax) | ν(x) 1} = μ(A). On the other hand, let x be a vector such that ν(x) 1. We have x = 0 if and only if ν(x) = 0 because ν is a norm. 1 x we have ν(y) = 1 and Otherwise, x = 0, and for y = ν(x) ν(Ax) = ν(A(ν(x)y) = ν(x)ν(Ay) ν(Ay). Therefore, in
Norms and Inner Products
453
either case we have ν(Ax) sup{ν(Ay) | ν(y) = 1} for ν(x) 1. Thus, we have the reverse inequality, sup{ν(Ax) | ν(x) 1} sup{ν(Ay) | ν(y) = 1}, so μ(A) = sup{ν(Ax) | ν(x) = 1}. To prove the second equality, observe that x ν(Ax) x = 0 = sup ν(A x = 0 sup ν(x) ν(x) = sup{ν(Ay) | ν(y) = 1} = μ(A), x = 1 and every vector y with ν(y) = 1 can because ν ν(x) 1 x for some x = 0. be written as y = ν(x) We leave the third equality to the reader. (34) Let x and y be two vectors in Rn . Prove that if ax + by = cx + dy, then
(a2 − c2 )x2 + (b2 − d2 )y2 + (ab − cd)(y x + x y) = 0. (35) Let U ∈ Cn×n be a unitary matrix. Prove that cond2 (U ) = 1. (36) Let A and B be two matrices in Cn×n such that A ∼ B. If A = XBX −1 , prove that 1 B2 A2 cond2 (X)B2 . cond2 (X) (37) Prove that |||C (n,K) |||p 1 and |||R(n,K) |||p 1 for any p 1, where C (n,K) and R(n,K) are the matrices defined in Exercise 61. ! i · · · ik be a (38) Let A ∈ Cn×n be a matrix and let B = A 1 j1 · · · jh submatrix of A. Prove that |||B|||p |||A|||p for any p 1. Hint: Apply Part (b) of Exercise 61 of Chapter 3.
454
Linear Algebra Tools for Data Mining (Second Edition)
(39) Prove that if D = diag(d1 , . . . , dn ) is a diagonal matrix, then |||D|||p = max{|di | | 1 i n} for every p 1. (40) Let A ∈ Cn×n . We have seen that if A is a unitary matrix, then Ax2 = x2 for every vector x ∈ Cn (see Theorem 6.24). Prove the inverse statement, that is, if Ax2 = x2 for every vector x ∈ Cn , then A is a unitary matrix. Solution: Observe that the condition satisfied by A implies A(x + y)2 = x + y2 for every x, y ∈ Cn . This, in turn, implies (Ax + Ay, Ax + Ay) = (x + y, x + y), which is equivalent to (Ax, Ay) = (x, y), or (Ax)H Ay = xH y. Choosing x = ei and y = ej , the last condition amounts to 1 if i = j (A A)ij = 0 otherwise H
for 1 i, j n. Thus, AH A = In . (41) Let · be a unitarily invariant norm. Prove that for every Hermitian matrix A ∈ Cn×n and every unitary matrix U , we have A − In A − U A + In . (42) Let · be a unitarily invariant norm. Prove that A O A B O D C D for all conforming matrices A, B, C, D. (43) Let A ∈ Cn×n be an invertible matrix and let · be a norm on Cn . Prove that |||A−1 ||| =
1 , min{Ax | x = 1}
where ||| · ||| is the matrix norm generated by · .
Norms and Inner Products
Solution: We claim that −1
{A
t | t = 1} =
Let a = A−1 t for some t ∈ as x=
Cn
455
1 | x = 1 . Ax
such that t = 1. Define x
1 A−1 t
A−1 t.
Clearly, we have x = 1. In addition, 1 1 t = = , −1 −1 A t A t a | x = 1 . Thus, Ax =
so a ∈
1 Ax
−1
{A
t | t = 1} ⊆
1 | x = 1 . Ax
The reverse inclusion can be shown in a similar way. Therefore, |||A−1 ||| = max{A−1 t | t = 1} =
1 . min{Ax | x = 1}
(44) Prove that if μ1 , . . . , μk are matrix norms on Rn×n , then μ : Rn×n R0 defined by μ(A) = max{μ1 (A), . . . , μk (A)} for A ∈ Rn×n is a matrix norm. Solution: Let A, B be two matrices in
Rn×n .
We have
μi (AB) μi (A)μi (B) max{μ1 (A), . . . , μk (A)} × max{μ1 (B), . . . , μk (B)} for every i, 1 i k. Therefore, max μ1 (AB), . . . , μk (AB) max{μ1 (A), . . . , μk (A)} × max{μ1 (B), . . . , μk (B)}, so μ(AB) μ(A)μ(B). We leave to the reader the verification of the remaining properties of matrix norms.
456
Linear Algebra Tools for Data Mining (Second Edition)
(45) Let S ∈ Cn×n be a matrix such that 0 if j i, sij = j−i if j > i, tij δ where tij ∈ C for 1 i, j n and i < j and δ < 1. Prove that there exists a positive number c such that |||S|||2 cδ. Solution: Let t = max{|tij | | 1 i, j n, i = j} and let x ∈ Cn be a vector such that x2 = 1. We have n n sij xj |tij |δi−j |xj | |(Sx)i | = j=1
t(n − i)δ
j=i+1 n
|xj | tnδx2 .
j+1
This implies |||S|||2 tnδ, so we obtain the desired inequality with c = tn. (46) Let A ∈ Cn×n , a ∈ C, b ∈ Cn−1 , and C ∈ C(n−1)×(n−1) be such that a bH . A= b C Prove that b2 |||A|||2 . Solution: Since |||A|||2 = sup{Ax2 | x = 1}, by substitut 1 , the desired inequality follows immediately. ing 0n−1 (47) In Example 6.13 we saw that A∞ fails to be a matrix norm. However, A∞ can be useful due to its simplicity. Let A ∈ Cn×n . Prove the following: (a) if B ∈ Rn×n is a matrix such that abs(A) B, then A∞ (A) A∞ ; (b) if A1 , . . . , Ak ∈ Cn×n , then A∞ (A1 · · · Ak ) nk−1 ki=1 A∞ (Ai ). Solution: If abs(A) B, we have |aij | bij for 1 i, j n. Therefore, the elements of B are non-negative and we have A∞ = maxi,j |aij | maxi,j bij = B∞ .
Norms and Inner Products
457
For the second part, note that (A1 · · · Ak )ij = {(A1 )ii1 (A2 )i1 i2 · · · (Ak )ik−1 j | (i1 , . . . , ik−1 ) ∈ {1, . . . , n}k−1 }. Thus, |(A1 · · · Ak )ij | {|(A1 )ii1 | |(A2 )i1 i2 | · · · (Ak )ik−1 j . (i1 ,...,ik−1 )
Since the last sum contains nk−1 terms and each term is less or equal to ki=1 A∞ (Ai ), the desired inequality follows immediately. (48) Prove that if A, B ∈ Rn×n and abs(A) abs(B), then AF BF . (49) Let A and B be two matrices in Cn×n and let ||| · ||| be a matrix norm on Cn×n . Prove that if |||AB − In ||| 1, then both A and B are invertible. For x, y ∈ Cn , we write abs(x) abs(y) if |xi | |yi | for 1 i n. A norm ν : Cn −→ R0 is monotone if abs(x) abs(y) implies ν(x) ν(y) for x, y ∈ Cn ; ν is said to be absolute if ν(x) = ν(abs(x)) for x ∈ Cn . (50) Prove that a norm ν on absolute.
Cn
is monotone if and only if it is
Solution: If ν is monotone, then it is clearly absolute. Conversely, let ν be an absolute norm and let x, y ∈ Cn such that x = diag(1, . . . , 1, a, 1, . . . , 1)y, where a ∈ [0, 1]. Note that ⎛
⎛ ⎞ ⎞ y1 y1 ⎜ .. ⎟ ⎜ .. ⎟ .⎟ ⎜ . ⎟ 1 − a 1 + a⎜ ⎜ ⎟ ⎜ ⎟ x= ⎜ yk ⎟ + ⎜−yk ⎟. 2 ⎜.⎟ 2 ⎜ . ⎟ ⎝ .. ⎠ ⎝ .. ⎠ yn yn
458
Linear Algebra Tools for Data Mining (Second Edition)
The definition of norms implies that ν(x) 1−a y ), where 2 ν(˜ ⎞ ⎛ y1 ⎜ .. ⎟ ⎜ . ⎟ ⎟ ⎜ ˜ = ⎜−yk ⎟. y ⎜ . ⎟ ⎝ .. ⎠ yn
1+a 2 ν(y) +
Since ν is absolute, this implies ν(x) ν(y). If a1 , . . . , an ∈ [0, 1] and x = diag(a1 , . . . , an )y, then by Exercise 5 of Chapter 3 and the previous argument, we have ν(x) ν(y). Let x, y ∈ Cn such that abs(x) abs(y), which is equivalent to |xi | |yi |. This allows finding a1 , . . . , an ∈ [0, 1] such that |xi | = ai |yi |, which implies ν(abs(x)) ν(abs(y)). Since ν is absolute, this implies ν(x) ν(y). A function g : Rn −→ R0 is a symmetric gauge function if it satisfies the following conditions: (i) g is a norm on Rn ; (ii) g(Pφ u) = g(u) for every permutation φ ∈ PERMn and every u ∈ Rn ; (iii) g(diag(b1 , . . . , bn )u) = g(u) for every (b1 , . . . , bn ) ∈ {−1, 1}n and u ∈ Rn . (51) Prove that for any p 1 the function νp : Rn −→ R0 is a symmetric gauge function on Rn . (52) Let · be a norm defined on Rn . If f : Rn −→ R is a homogeneous function, that is, a function such that f (ax) = af (x) for a 0, then max{f (x) | x = 1} = max{f (x) | x 1}. Solution: It is clear that max{f (x) | x = 1} max{f (x) | x 1}, because the set from the left member is a subset of the set that occurs in the right member. To prove the reverse inequality, let x be a vector such that x 1. Then x = 1. x
Norms and Inner Products
459
By the defining property of f , we also have 1 x = f (x) f (x). f x x Thus, max{f (x) | x 1} max{f (x) | x = 1}. (53) Let A ∈ Rm×p . Prove that AF = vec(A)2 . (54) Let U ∈ Cm×n , V ∈ Cn×p , and let A ∈ Cm×p . Prove that A − U V F = vec(A) − (V ⊗ Im )vec(U )2 = vec(A) − (In ⊗ U )vec(V )2 . Solution: We have A − U V F = vec(A − U V )2 (by Exercise 6.20) = vec(A) − vec(U V )2 = vec(A) − (V ⊗ Im )vec(U )2 = vec(A) − (In ⊗ U )vec(V )2 , by Supplement 90 of Chapter 3. (55) Prove that if A ∈ Cn×n , then |||A|||2 = max{|y H Ax| | x2 = y2 = 1}. Solution: By the Cauchy–Schwarz inequality, we have |y H Ax| y2 Ax2 , so max{|y H Ax| | x2 = y2 = 1} max{Ax2 | x2 } = |||A|||2 .
(6.34)
˜ be ˜ be a unit vector such that A˜ Let x x2 = |||A|||2 and let y ˜ = A˜1x2 A˜ x. We have the unit vector y ˜ H A˜ x= y
1 A˜ x22 ˜ H AH Ax = x = A˜ x2 = |||A|||2 . A˜ x2 A˜ x2
Thus, the Inequality (6.34) can be replaced by an equality.
460
Linear Algebra Tools for Data Mining (Second Edition)
(56) Prove that every symmetric gauge function g : an absolute norm on Rn .
Rn
−→
R0
is
Solution: Let u ∈ Rn . Define bi = 1 if ui 0 and bi = −1 if ui < 0. Then |ui | = bi ui for 1 i n, so abs(u) = diag(b1 , . . . , bn )u. By the definition of gauge functions, g(u) = g(diag(b1 , . . . , bn )u) = g(abs(u)), which implies that g is an absolute norm. (57) For x ∈ Rn , let gk (x) be the sum of the largest k absolute values of components of x. Prove that (a) gk is a gauge symmetric function for 1 k n; (b) if g : Rn −→ R0 is a gauge symmetric function and gk (x) gk (y) for 1 k n, then g(x) g(y). (58) Let A ∈ Rm×n and let I ⊆ {1, . . . , m} and J ⊆ {1, . . .}. Define A(I, J) = {aij | i ∈ I, j ∈ J}, and AC = maxI,J |A(I, J)|. Prove that · C is a norm on C m×n . This norm will be referred to as the cut norm. (59) A (b, I, J)-cut matrix is a matrix B ∈ Rm×n such that there exist a number b ∈ R and two sets I, J, such that I ⊆ {1, . . . , m} and J ⊆ {1, . . .} and b if i ∈ I and j ∈ J, bij = 0 otherwise. Prove that every cut matrix has rank 1 and that there are 2m+n distinct cut matrices having b as their non-zero entry. (60) Let A ∈ Rm×n be such that there exists a pair of sets (I, J) such that A(I, J) |I||J| for some > 0 and let b = A(I,J) |I||J| . If B is 2 the (b, I, J)-cut matrix, prove that A−BF A2F −2 |I||J|. Solution: Note that in A −!B, one subtracts from every eleS ment of the submatrix A their average. Thus, T {a2ij | i ∈ I or j ∈ J} A − B2F = " A(I, J) 2 + aij − i ∈ I, j ∈ J |I||J| = A2F −
A(I, J)2 A2F − 2 |I||J|. |I||J|
Norms and Inner Products
461
(61) Let A ∈ Rm×n be such that |aij | 1. Prove that for every > 0 it is possible to construct a sequence of cut matrices . , Bp such that A − BC mn and p 12 , where B1 , . . B = pk=1 Bk . Solution: If AC mn, we take p = 1 and B1 = Om,n . Otherwise, AC > mn and we can apply repeatedly the process described in Supplement 6.20. In this manner, we construct two sequences of matrices A1 , . . . , Ap and B1 , . . . , Bp such that A1 = A − B1 , A2 = A1 − B2 , . . . , Ap = Ap−1 − Bp . By Supplement 6.20, we have Ai+1 2F Ai 2F − 2 mn. Since · 2F is non-negative, it is clear that this process can be repeated no more than 12 times until we obtain A − BC mn, where B = B1 + · · · + Bp . (62) Prove the following extension of the result of Supplement 6.20: if A ∈ Rm×n , then for every > 0 there exists a sequence of cut matrices B1 , . . . , Bp such that A = B1 + · · · + Bp + R, where RC mn. The number p is the width of this decomposition, while RC is the error. Solution: Let A be a matrix in Rm×n and let A˜ be the matrix 1 A. We have |˜ aij | 1. Choose now 1 = A ∞ . By A˜ = A ∞ Supplement 6.20, there is a sequence of cut matrices E1 , . . . , Ep such that A˜ = E1 + · · · + Ep + Q and QC 1 mn. Therefore, A = A∞ A˜ = A∞ E1 + · · · + A∞ Ep + A∞ Q and we define Bi as the cut matrix Bi = A∞ Ei for 1 i p and R = A∞ Q. Note that RC = A∞ QC A∞ 1 mn mn. (63) Let D ∈ Rn×n be a diagonal matrix such that dii 0 for 1 i n. Prove that if X is an orthogonal matrix, then trace(XD) trace(D). Solution: Since D is a diagonal matrix, we have trace(XD) = n x d i=1 ii ii . The orthogonality of X implies xii 1, so xii dii dii because dii 0. Thus, trace(XD) =
n i=1
xii dii
n i=1
dii = trace(D).
462
Linear Algebra Tools for Data Mining (Second Edition)
(64) Let · be a norm on Rn , Dn = Rn × Rn − {(0, 0)} −→ R and let f : Dn −→ R be defined by f (x, y) =
x + y2 + x − y2 . 2(x2 + y2 )
Prove that 1 for (x, y) ∈ Dn ; (a) f (x + y, x − y) = f (x,y) (b) if a = inf{f (x, y) | (x, y) ∈ Dn } and b = sup{f (x, y) | (x, y) ∈ Dn }, then 1 a 1 b 2, 2 and ab = 1; (c) the norm · is generated by an inner product if and only if a = b = 1. Solution: The first part is immediate. It is clear that a b. The definition of the norm implies f (x, y) 2, so b 2. By the first part we have a = 1b , which implies a 12 , and a b implies a 1 b. The last part follows from Theorem 6.32. (65) Prove that if x, y ∈ Cn are such that x2 = y2 , then x+y ⊥ x − y. (66) Let f : V × V −→ R be a bilinear form on a real inner product space such that x ⊥ y implies f (x, y) = 0. Prove that f (x, y) = c(x, y) for some c ∈ R. (67) Let f : V × V R be a bilinear form on the R-linear space V. Define x ⊥f y to mean that f (x, y) = 0. Prove the following: (a) x ⊥f y and x ⊥f z imply x ⊥f (ay + bz) for a, b ∈ R; (b) x1 ⊥f y and x2 ⊥f y imply (ax1 + bx2 ) ⊥f y for a, b ∈ R; (c) Let f : R2 × R2 −→ R be the bilinear form defined by x x = xx + xy − x y − yy . f , y y Prove that there exist x, y ∈ R2 such that x ⊥f y but y ⊥f x. (68) If S is a subspace of Cn , prove that (S ⊥ )⊥ = s. (69) Prove that every permutation matrix Pφ is orthogonal.
Norms and Inner Products
463
A matrix A ∈ Cm×n is subunitary if A is a submatrix of a unitary matrix U . If A ∈ Rm×n , and A is subunitary, then A is a suborthogonal matrix (see [14], where suborthogonal matrices were introduced and referred to as suborthonormal matrices). A matrix A ∈ Cm×n is semiunitary if it is rowwise or columnwise unitary; if A ∈ Rm×n , then A is said to be semiorthogonal. Clearly, every unitary (orthogonal) matrix is a semiunitary (semiorthogonal) matrix and every semiunitary (semiorthogonal) matrix is a subunitary (suborthogonal) matrix. (70) Prove that every submatrix of a subunitary (suborthogonal) matrix is subunitary (suborthogonal). (71) Prove that every suborthogonal matrix can be augmented to a semiorthogonal matrix by adding only rows and by adding only columns to it. (72) Let A ∈ Cn×n be an upper Hessenberg matrix and let A = QR be its QR decomposition. If R is a non-singular matrix, prove that both matrices Q and RQ are upper Hessenberg matrices. Solution: By Theorem 3.8, R−1 is an upper triangular matrix. Since Q = AR−1 , it follows from Supplement 11 of Chapter 3 that Q is an upper Hessenberg matrix. From the same supplement, it follows that RQ is an upper Hessenberg matrix. (73) Let A ∈ Rn×n be a symmetric matrix. Prove that (x, Ax) − (y, Ay) = (A(x − y), x + y) for every x, y ∈ Rn . (74) Let x, y ∈ Rn be two unit vectors. Prove that | sin ∠(x, y)| =
x + y2 x − y2 . 2
(75) Let u and v be two unit vectors in ∠(u, v), then
Rn .
Prove that if α =
u − v cos α2 = sin α. Further, prove that v cos α is the closest vector in v to u.
464
Linear Algebra Tools for Data Mining (Second Edition)
Solution: By the definition of the Euclidean norm, we have u − v cos α22 = (u − v cos α) (u − v cos α) = 1 − 2u v cos α + cos2 α = 1 − cos2 α = sin2 α, which justifies the equality that we need to prove. For the second part, let w = av be a vector in v. We have u − av22 = (u − av) (u − av) = 1 − 2au v + a2 = 1 − 2a cos α + a2 . The least value of the function in a is achieved when a = cos α. (76) Let {v 1 , . . . , v p } ⊆ Rn be a collection of p unit vectors such that ∠(vi , v j ) = θ, where 0 < θ π2 for every pair (v i , v j ) . such that 1 i, j p and i = j. Prove that p n(n+1) 2 Solution: We shall prove that under the assumptions made above, the set {A1 , . . . , Ap } of symmetric matrices Ai = v i v i ∈ Rn×n is linearly independent. Suppose that a1 A1 + · · · + ap Ap = On×n . This is equivalent to a1 a1 a1 + · · · + ap ap ap = On×n . Therefore, by multiplying the last equality by ai to the left and by ai to the right, we obtain a1 ai a1 a1 ai + · · · + ap ai ap ap ai = 0, which amounts to a1 cos2 θ + · · · + ap−1 cos2 θ = ap + ap+1 cos2 θ + · · · + ap cos2 θ = 0. This equality and the similar p − 1 equalities can be expressed in matrix form as (Ip (1 − cos2 θ) + Jp cos2 θ)a = Op,p , where
⎛ ⎞ a1 ⎜ .. ⎟ a = ⎝ . ⎠. ap
By Supplement 24 of Chapter 3, a = 0p , so A1 , . . . , Ap are linearly independent. By Supplement 7 of the same chapter, . p n(n+1) 2
Norms and Inner Products
465
(77) Let A ∈ Cn×n be a unitary matrix. If A = (u1 , . . . , un ), prove that the set {u1 , . . . , un } is orthonormal. (78) Let {w 1 , . . . , wk } ⊆ Cn be a set of unit vectors such that wi ⊥ wj for i = j and 1 i, j k. If Wk = (w 1 · · · wk ) ∈ Cn×k , then prove that In − Wk WkH = (In − wk wHk ) · · · (In − w1 wH1 ). Solution: The straightforward proof is by induction on k. (79) This exercise refines the result presented in Supplement 36 of Chapter 3. Let A be a matrix in Cm×n such that rank(A) = r. Prove that A can be factored as A = P C, where P ∈ Cm×r is a matrix having an orthonormal set of columns (that is, P H P = Ir ), C ∈ Cr×n , and rank(P ) = r. Also, prove that A can be factored as A = DQ, where D ∈ Cm×r , Q ∈ Cr×n has an orthonormal set of rows (that is, QQH = Ir , and rank(Q) = r. (80) Let Cn×n be the linear space of complex matrices. Prove the following: (a) the set of Hermitian matrices H and the set of skewHermitian matrices K in Cn×n are subspaces of Cn×n ; (b) if Cn×n is equipped with the inner product defined in Example 6.21, then K = H⊥ . (81) Give an example of a matrix that has positive elements but is not positive definite. (82) Prove that if A ∈ Rn×n is a positive definite matrix, then A is invertible and A−1 is also positive definite. (83) Let A ∈ Cn×n be a positive definite Hermitian matrix. If A = B + iC, where B, C ∈ Rn×n , prove that the real matrix B −C D= C B is positive definite. (84) Let A ∈ R2×2 be a matrix such that x Ax > 0 for every x ∈ R2 − {0}. Does it follow that uH Au > 0 for every x ∈ C2×2 − {0}? i . Hint: Consider A = I2 and x = 0
466
Linear Algebra Tools for Data Mining (Second Edition)
(85) Let x, y ∈ Rn be two vectors such that x > 0. Prove that there exists a number > 0 such that y2 implies x + y > 0. (86) Let U ∈ Rn×k be a matrix such that U U = Ik , where k n. Prove that for every x ∈ Rn , we have x2 U x2 . Solution: Let u1 , . . . , uk be the columns of the matrix U . We have ⎛ ⎞ u1 ⎜ .. ⎟ U U = ⎝ . ⎠(u1 · · · uk ) = Ik , uk which shows that {u1 , . . . , uk } is an orthonormal set of vectors completion of this in Rn . Let {u1 , . . . , uk , uk+1 , . . . , un } be the n 2 Rn . If x = set to an orthonormal set of i=1 ai ui , then x2 = n 2 i=1 ai . On the other hand, ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ a1 u1 u1 x ⎜ .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟ U x = ⎝ . ⎠x = ⎝ . ⎠ = ⎝ . ⎠, uk ak uk x because of the orthonormality of the set {u1 , . . . , uk }, so Y
x22
=
k i=1
a2i
n
a2i = x22 .
i=1
(87) Let V be a real linear space and let · be a norm generated by an inner product defined on V. V is said to be symmetric relative to the norm if ax − y = x − ay for a ∈ R and x, y ∈ V such that x = y = 1. (a) Prove that if a norm on a linear vector space V is induced by an inner product, then V is symmetric relative to that norm. (b) Prove that V satisfies the Ptolemy inequality x − yz y − zx + z − xy, for x, y, z ∈ V if and only if V is symmetric.
Norms and Inner Products
467
Solution: Suppose that x2 = (x, x) for x ∈ V . Then, if x = y = 1, we have ax − y2 = (ax − y, ax − y) = a2 (x, x) − 2a(x, y) + (y, y) = a2 + 1 + 2a(x, y). It is easy to see that we also have x − ay = a2 + 1 + 2a(x, y), which implies that V is symmetric relative to the norm. For the second part, suppose that V is symmetric relative to the norm. Observe that Ptolemy inequality is immediate if ˆ, y ˆ, z ˆ be three non-zero any of x, y, or z is 0. Therefore, let x vectors defined by ˆ= x
1 1 1 ˆ= x, y y, zˆ = z. 2 2 x y z2
We have ˆ 2 = ˆ x−y
1 2(x, y) 1 x − y2 − + = . x2 x2 y2 y2 x2 y2
ˆ ˆ ˆ + ˆ ˆ , the Ptolemy inequality Since ˆ x−y x−z z−y follows immediately. (88) Let Hu be the Householder matrix corresponding to the unit vector u ∈ Rn . If x ∈ Rn is written as x = y + z, where y = au and z ⊥ u, then Hu x is obtained by a reflection of x relative to the hyperplane that is perpendicular to u, that is, Hu x = −u + v. Solution: We have Hu x = Hu (y + z) = Hu y + Hu z. Then, Hu y = (In − 2uu )(au) = au − 2auu u = au − 2au = −au = −y. Also, Hu z = (In − 2uu )z = z − 2uu z = z because u ⊥ z. (89) Let x ∈ Rn be a unit vector such that x ∈ {e1 , −e1 }, and let v=
1 1 (x + e1 ) and w = (x − e1 ). 2(1 + x1 ) 2(1 − x1 )
Prove that (a) v and w are unit vectors; (b) we have Hv x = −e1 and Hw x = e1 .
468
Linear Algebra Tools for Data Mining (Second Edition)
Solution: Observe that 1 (x + e1 )(x + e1 ) 2(1 + x1 ) 1 (x2 + 2x1 + e1 2 ) = 1. = 2(1 + x1 )
v2 =
The computation for w is similar. Solving the second part is straightforward and is omitted. (90) Let a b A= ∈ R2×2 . b a Prove that if A is positive semidefinite but not positive definite, then |a| = |b|. (91) Let S = {s1 , . . . , sn } be a finite set. For a subset U of S, define cU ∈ Rn , the characteristic vector of U , by 1 if si ∈ U, (cU )i = 0 otherwise. (a) Prove that (cU , cW ) = |U ∩ V | for any two subsets U, W of S. (b) Let U = (U1 , . . . , Um ) be a sequence of subsets of S. The incidence matrix of U is the matrix CU = (cU1 , . . . , cUn ). If AU ∈ Rm×m is defined by (AU )ij = |Ui ∩ Uj | for 1 i, j m, prove that AU = CU CU , and therefore, AU is a positive semidefinite matrix. (92) Let Y ∈ Rn×k be a matrix such that Y Y = Ik , where k n. Prove that the matrix In − Y Y is positive semidefinite. Solution: Let x ∈ Rn . We have x (In − Y Y )x = x x − (Y x) (Y x) = x22 − Y x22 . The desired inequality follows immediately from Supplement 6.20. (93) Let A and B be two matrices in Cn×n , where A is Hermitian and B is positive semidefinite. Prove that xH Ax < 0 for all x ∈ Cn such that Bx = 0 and x = 0 if and only if there exists a ∈ R such that a > 0 and A − aB ≺ 0.
Norms and Inner Products
469
Solution: Suppose that there exists a ∈ R such that a > 0 and A − aB ≺ 0. This means that xH (A − aB)x < 0. Then, if Bx = 0, it is clear that axH Bx = 0, so xH Ax < 0. Conversely, let A be Hermitian and let B be positive semidefinite such that xH Ax < 0 for all x ∈ Cn such that Bx = 0 and x = 0. Suppose that for every a > 0 there exists x = 0 such that xH Ax axH Bx. In this case, Bx = 0 implies xH Ax 0. This contradiction yields the desired implication. (94) Let {w1 , . . . , w k } ⊆ Cn be an orthonormal set of vectors in Cn and let x ∈ Cn . If ai = (x, wi ) for 1 i k, prove that x−
k
ai w i , x −
i=1
k
ai w i
(x, x) −
i=1
k
ai ai (x, x).
i=1
Prove that (x, x) = ki=1 ai ai if and only if x = ki=1 ai wi . . , An be n + 1 Hermitian matrices and let B(t) = (95) Let n A0 , . . n−k A t . Prove that if An is positive definite, there exists k k=0 a positive number such that B(t) is positive definite for every t ∈ [−, ]. Solution: The definition of B(t) allows us to write
x B(t)x =
n
x Ak xtn−k .
k=0
Since the function g(t) = nk=0 x Ak xtn−k is a polynomial in t and g(0) = x An x > 0, the continuity of the polynomial g(t) implies that there exists > 0 such that t ∈ [, ] implies g(t) > 0. This produces the desired conclusion. (96) Let A ∈ Rn×n and let b ∈ Rn . Define the matrix C ∈ Rn×n by cij = bi aij bj for 1 i, j n. Prove that A is positive semidefinite if and only if C is positive semidefinite. Solution: Let x be a vector in x Cx =
n n i=1 j=1
Rn .
xi cij xj =
We have n n i=1 j=1
xi bi aij bj xj ,
Linear Algebra Tools for Data Mining (Second Edition)
470
which is equivalent to x Cx = z Az, where ⎞ ⎛ x 1 b1 ⎟ ⎜ z = ⎝ ... ⎠. x n bn We leave the remainder of the argument to the reader. (97) Prove that if A, B ∈ R2×2 are two rotation matrices, then AB = BA. ˜2 , e˜3 } be two orthonormal bases in R3 . We e1 , e Let {e1 , e2 , e3 } and {˜ ˜ ) = δij . have (ei , ej ) = δij and (˜ e ,e i j ˜j = 3k=1 rjk ek and ej = 3k=1 rˆjk e˜k . Since both Suppose that e ˜ ). bases are orthonormal, we obtain rj = (e˜j , e ) and rˆj = (ej , e Clearly, we have: rj = rˆj . ˜j = ˜ , and ej = (98) Prove that rj = rˆj , e k, rjk rk e r r e . This shows that the matrix R = (rj ) is k, kj k orthogonal. (99) A vector v ∈ R3 is isotropic if its components are the same relative to any basis. Prove that the single isotropic vector in R3 is 03 . Solution: Suppose that v ∈ R3 is isotropic. Then, we have ˜j = ˜k , vj ej = vj rkj e v˜k e v= j
and v=
j
j,k
˜j = v˜j e
k
v˜j rjk ek =
j,k
k
vk ek =
If v is isotropic, then v˜j = vj = k rjk vk for any rotation matrix R. Choosing R as 0 10 R = −1 0 0 0 01 we obtain v1 = v2 = 0. Similarly, choosing 1 0 0 R=0 0 1 0 −1 0 we obtain v2 = v3 = 0, hence, v = 03 .
Norms and Inner Products
471
(100) Let u ∈ Rn be a unit vector. A rotation with axis u is an orthogonal matrix A such that Au = u. Prove that if v ⊥ u, then Av ⊥ u and A v ⊥ u. (101) Let u, v, and w be three unit vectors in R2 − {0}. Prove that ∠(u, v) ∠(u, w) + ∠(w, v). Solution: The hypothesis implies the existence of α, β, γ ∈ (0, 2π) such that cos α cos β cos γ u= ,v = , and w = . sin α sin β sin γ Thus, ∠(u, v) = arccos(cos(α − β)), ∠(u, w) = arccos(cos(α − γ)), ∠(w, v) = arccos(cos(γ − β)). Without loss of generality, we may assume that α γ β. Thus, α−β if α − β π, arccos(cos(α − β)) = 2π − α + β if α − β > π and similar equalities can be written for arccos(cos(α−γ)) and arccos(cos(γ − β)). If α − β π, then α − γ π and γ − β π, so ∠(u, v) ∠(u, w) + ∠(w, v). Otherwise, α − β > π and several cases may occur. Since α − β = (α − γ) + (γ − β), at most one of α − γ and γ − β can be greater than π. If we have α − γ π and γ − β π, the inequality to be shown amounts to 2π − α + β α − γ + γ − β, which clearly holds. If α − γ π and γ − β π, then we have 2π − α + β α − γ + 2π − γ + β, which amounts to the inequality α γ, which holds according to the initial assumption.
472
Linear Algebra Tools for Data Mining (Second Edition)
(102) Let T be a subspace of Rn and let u and v be two unit vectors in Rn such that v ∈ T . If t = projT (u), prove that ∠(u, v) ∠(t, v). Solution: Suppose that T is an m-dimensional space, where m n. Let {v, v 1 , . . . , vm−1 } be the extension of the set {v} to an orthonormal basis of T . Then t = projU (u), we have t = (u, v)v + (u, v 1 )v 1 + · · · + (u, v m−1 )v m−1 , so (t, v) = (u, v). Thus, t2 cos ∠(t, v) = cos ∠(u, v), which implies cos ∠(u, v) cos ∠(t, v). (103) We now extend the result of Supplement 6.20 to Rn as follows. Let u, v, and w be three unit vectors in Rn − {0}. Prove that ∠(u, v) ∠(u, w) + ∠(w, v). (104) This supplement formulates a reciprocal to Theorem 6.36. Prove that if A ∈ Cn×n is a Hermitian and idempotent matrix, then there exists a subspace S of Cn such that A is the projection matrix PS . Solution: Let x ∈ Rn be a vector, and let S = range(A). Then, u = Ax ∈ S. If z = x − u, we have (z, u) = z H u = (xH − uH )u = (xH − xH AH )Ax = xH AH )Ax − xH AH Ax = 0, because A is Hermitian and idempotent. Thus, z ⊥ u and z ∈ S ⊥ . By Theorem 6.37, the decomposition of x = u + z is unique, so u = Ax = projS x. Let V be an n-dimensional linear space equipped with an inner product (·, ·). The subsets B = {b1 , . . . , bn } and C = {c1 , . . . , cn } of V are reciprocal if (bi , cj ) = 1 if i = j and (bi , cj ) = 0 if i = j, for 1 i, j n. (105) Let V be an n-dimensional linear space equipped with an inner product (·, ·). If B = {b1 , . . . , bn } is a basis of V, then there exists a unique reciprocal set of B. Solution: Let Ui be the subspace of V generated by the set B − {bi } and let Ui⊥ be its orthogonal complement. By Corollary 6.28, dim(Ui⊥ ) = 1 because dim(Ui ) = n − 1. Thus, there exists a vector t = 0 in Ui⊥ . Note that (t, bi ) = 0 because bi ∈ Ui . Define ci =
1 bi . (t, bi )
Norms and Inner Products
473
Then (bi , ci ) = 1 and (bi , cj ) = 0 if j = i. This construction can be applied to all i, where 1 i n and this yields a set C = {c1 , . . . , cn }, which is reciprocal to B. To prove the uniqueness of the set C, assume that D = {d1 , . . . , dn } is another reciprocal set of the basis B. Then since (bi , cj ) = (bi , dj ), it follows that (bi , cj − dj ) = 0 for every i, j. Since cj − dj is orthogonal on all vectors of B, it follows that cj − dj = 0, so cj = dj . Thus, D = C. (106) Let V be an n-dimensional linear space equipped with an inner product (·, ·). If B = {b1 , . . . , bn } is a basis of V, then the reciprocal set C of B is also a basis of V. (107) Let L = (v 1 , . . . , v n ) be a sequence of vectors, where n 2. Prove that the volume Vn of the parallelepiped constructed on these vectors equals the square root of the Gramian of the sequence (v 1 , . . . , v n ). Solution: For the base case n = 2, the area A of the parallelogram is given by A = u2 v2 sin α, where α = ∠(u, v). In other words, (u, u) (u, v) det(Gu,v ) = det (u, v) (v, v) = u22 v22 − u22 v22 cos2 α = u22 v22 sin2 α = V22 . Suppose that the statement holds for sequences of n vectors and let L = (v 1 , . . . , v n , v n+1 ) be a sequence of n + 1 vectors. Let v n+1 = x + y be the orthogonal decomposition of v n+1 on the subspace Un = v1 , . . . , v n , where x ∈ Un and y ⊥ Un . Since x ∈ Un , there exist a1 , . . . , an ∈ R such that x = a1 v 1 + · · · + an vn . Let (v 1 , v 1 ) (v 1 , v 2 ) · · · (v 1 , v n ) (v 1 , v n+1 ) (v 2 , v 1 ) (v 2 , v 2 ) · · · (v 2 , v n ) (v 2 , v n+1 ) .. .. .. .. .. det(GL ) = . . . . . . (v n , v 1 ) (v n , v 2 ) · · · (v n , v n ) (v n , v n+1 ) (v n+1 , v 1 ) (v n+1 , v 2 ) · · · (v n+1 , v n ) (v n+1 , v n+1 )
474
Linear Algebra Tools for Data Mining (Second Edition)
By subtracting from the last row the first row multiplied by a1 , the second row multiplied by a2 , etc., the value of the determinant remains the same, and we obtain (v 1 , v 1 ) (v 1 , v 2 ) · · · (v 1 , v n ) (v 1 , v n+1 ) (v 2 , v 1 ) (v 2 , v 2 ) · · · (v 2 , v n ) (v 2 , v n+1 ) .. .. .. .. .. det(GL ) = . . . . . . (v n , v 1 ) (v n , v 2 ) · · · (v n , v n ) (v n , v n+1 ) (y, v 1 ) (y, v 2 ) · · · (y, v n ) (y, v n+1 ) Note that (y, v 1 ) = (y, v 2 ) = · · · = (y, v n ) = 0 because y ⊥ Un and (y, v n+1 ) = (y, x + y) = y22 , which allows us to further write (v 1 , v 1 ) (v 1 , v 2 ) · · · (v 1 , v n ) (v 1 , v n+1 ) (v 2 , v 1 ) (v 2 , v 2 ) · · · (v 2 , v n ) (v 2 , v n+1 ) . . . . . . . . . . det(GL ) = . . . . . . (v n , v 1 ) (v n , v 2 ) · · · (v n , v n ) (v n , v n+1 ) 0 0 ··· 0 y22 = V2n y22 = V2n+1 . (108) Let A ∈ Rm×n be a matrix such that rank(A) = n. Prove that the R-factor of the QR decomposition of A = QR has positive diagonal elements, it equals the Cholesky factor of A A, and therefore is uniquely determined. Solution: If rank(A) = n, then, by Theorem 6.57, there exists a unique Cholesky factor of the matrix A A. Suppose that A has the full QR decomposition R , A=Q Om−n,n where Q ∈ Rm×m and R ∈ Rn×n . Then R = R R. A A = (R On,m−n )Q Q Om−n,n (109) Let A ∈ Rn×n be a symmetric positive definite matrix and let dA : Rn × Rn −→ R be the function defined by dA (x, y) =
Norms and Inner Products
(x − y) A(x − y) for x, y ∈ on Rn .
Rn .
475
Prove that dA is a metric
Solution: By Cholesky’s Decomposition Theorem, we can factor A as A = R R, where R is an upper triangular matrix R with positive diagonal elements. Therefore, dA (x, y) = (R(x − y)) (R(x − y)) = R(x − y)2 and the desired conclusion follows immediately. (110) Let A ∈ Cn×m be a full-rank matrix such that m n. Prove that A can be factored as A = LQ, where L ∈ Cn×n and Q ∈ Cn×m , such that (a) the columns of Q constitute an orthonormal basis for range(AH ), and (b) L = (ij ) is a lower triangular invertible matrix such that its diagonal elements are real non-negative numbers, that is, ii 0 for 1 i n. n n function defined by fn (x, y) = (111) Let f n n : R −→ R −→ R be the n . Prove that if x, y ∈ C(0 , 1), (1 + x y ) for x, y ∈ R j j n j=1 then |fn (x, y)| e. Solution: It is easy to verify the elementary inequality ln(1 + |t|) |t| for t ∈ R. This allows us to write n
|fn (x, y)| e
(112) (113)
(114) (115)
j=1
ln(1+|xj yj |)
n
e
i=1
|xj yj |
ex2 y2 .
because x2 1 and y2 1. Prove that if a matrix A ∈ Cn×n is normal, then range(A) ⊥ null(A). We saw that for every matrix A, the matrix AH A is positive semidefinite. Prove that if A = QR is the full QR decomposition of A, then the Cholesky decomposition of AH A is RH R. Let U ∈ Cn×n be a unitary matrix. If U = (X Y ), where X ∈ Cn×p and Y ∈ Cn×(n−p) , prove that range(X) = range(Y )⊥ . Let A ∈ Rn×n be a skew-symmetric matrix. Prove that A2 is a symmetric negative semi definite matrix. Solution: We have (A2 ) = (A )2 = (−A)2 = A2 , so A2 is symmetric. Furthermore, x A2 x = −x A Ax = −(Ax) Ax 0 for x ∈ Rn .
476
Linear Algebra Tools for Data Mining (Second Edition)
(116) Let B = {b1 , . . . , bn } and C = {c1 , . . . , cn } be two orthonormal bases of Rn . The coherence of B and C is the number coh(B, C) = max1i,jn bi c. Prove that 1 √ coh(B, C) 1. n Solution: Note that the matrix D = (bi cj ) ∈ Rn×n is an orthonormal matrix and dk 2 = 1 for each of the columns dk of D. Thus, not all entries of D can be less than √1n , so
max1i,jn bi c √1n . (117) Let A ∈ Rm×m and B ∈ Rn×n be two orthogonal matrices. Prove that their Kronecker product A ⊗ B ∈ Rmn×mn is also an orthogonal matrix. Solution: We have (A ⊗ B)(A ⊗ B) = (A ⊗ B)(A ⊗ B ) (by Theorem 3.56) = (AA ) ⊗ (BB ) (by Theorem 3.55) = Im ⊗ In = Imn .
(118) Let U, V ∈ Rn×n be two orthonormal matrices, where U = (u1 , . . . , un ) and V = (v 1 , . . . , v n ), and let w ∈ Rn . Since the columns of U and V are bases for Rn , for every x ∈ Rn with x2 = 1 there exist a, b ∈ Rn such that x = U a = V b. Prove the inequality a1 + b1
2 coh(U, V )
known an the uncertainty principle (see [43]). Solution: Since both U and V are orthonormal matrices, we have x2 = a2 = b2 . Therefore, 1 = x2 = x x = a U V b =
n n i=1 j=1
coh(U, V )a1 b1 .
|ai | |bj |ui v j
Norms and Inner Products
This implies a1 b1 . b1 √ 2
1 coh(U,V ) ,
477
which in turn yields a1 +
coh(U,V )
(119) Let u, v, w, z ∈ R3 . Prove that
(u, w) (u, z) . ((u × v), (w × z)) = (v, w) (v, z)
This is Lagrange’s identity. (120) Let a, b, c, d ∈ R3 be such that a ⊥ b and c ⊥ d. Prove that the vectors x and y are orthogonal, where x = (b×c)×(a×d), and y = (a × c) × (b × d). (121) Let u, v, w ∈ R3 . Prove that u × (v × w) + v × (w × u) + w × (u × v) = 0R3 . This is Jacobi’s identity. (122) Prove that if u, v ∈ R3 , then (u × v)i = 3j,k=1 ijk uj vk for 1 i 3. (123) Let a, b, c ∈ R3 be three vectors that do not belong to the same plane. Prove that the vectors a + αb, b + βc, and c + γa are coplanar if and only if αβγ = −1. Solution: The scalar triple product of a + αb, b + βc, and c + γa can be rewritten as (a + αb, b + βc, c + γa) = (a, b + βc, c + γa) + α(b, b + βc, c + γa) = (a, b + βc, c + γa) + αβ(b, c, c + γa) = (a, b + βc, c + γa) + αβγ(b, c, a). Similarly, (a, b + βc, c + γa) = (a, b + βc, c) + (a, b + βc, γa) = (a, b, c). Thus, (a + αb, b + βc, c + γa) = (a, b, c)(1 + αβγ).
478
Linear Algebra Tools for Data Mining (Second Edition)
Consequently, the vectors a + αb, b + βc, and c + γa are coplanar if and only if their scalar triple product is 0, that is, if and only if αβγ = −1, because the fact that a, b, c are not coplanar implies (a, b, c) = 0. Bibliographical Comments Supplements 58–61 contain concepts in [59, 87]. Supplement 52 is stated in [113].
and results
developed
Chapter 7
Eigenvalues
7.1
Introduction
The existence of directions that are preserved by linear transformations (which are referred to as eigenvectors) was discovered by Euler in his study of movements of rigid bodies. This work was continued by Lagrange, Cauchy, Fourier, and Hermite. The theme of eigenvectors and eigenvalues acquired increasing significance through its applications in heat propagation and stability theory. Later, Hilbert initiated the study of eigenvalues in functional analysis (in the theory of integral operators). He introduced the terms “eigenvalue” and “eigenvector”.1 7.2
Eigenvalues and Eigenvectors
Definition 7.1. Let A ∈ Cn×n be a square matrix. An eigenvector of A is a vector v ∈ Cn − {0n } such that Av = λv for some λ ∈ C. The complex number λ that satisfies the previous equality is known as an eigenvalue of the matrix A and the pair (λ, v) is known as an eigenpair of A.
1 The term eigenvalue is a German–English hybrid formed from the German word eigen, which means “own”, and the English word “value.” Its use is common in the literature and we adopt it here.
479
480
Linear Algebra Tools for Data Mining (Second Edition)
The set of eigenvalues of a matrix A will be referred to as the spectrum of A and will be denoted by spec(A). Example 7.1. Let A ∈ C2×2 be the matrix A=
a b . c d
The vector v = v1 v2 = 02 is an eigenvector of A if Av = λv, a system equivalent to av1 + bv2 = λv1 , cv1 + dv2 = λv2 . This, in turn, is equivalent to the homogeneous system (a − λ)v1 + bv2 = 0, cv1 + (d − v2 ) = 0. A non-trivial solution exists if and only if det(A) = 0, which is equivalent to (a − λ)(d − λ) − bc = 0 or λ2 − (a + d)λ + ad − bc = 0. The roots of this equation are the eigenvalues of A: a + d ± (a − d)2 + 4bc . λ1,2 = 2 Theorem 7.1. Let A ∈ Cn×n be a matrix. If A has n distinct eigenvalues and v 1 , . . . , v n are corresponding eigenvectors, then {v 1 , . . . , v n } is a linearly independent set. Proof. Suppose that λ1 , . . . , λn are distinct eigenvalues of A and v1 , . . . , vn are eigenvectors that correspond to these values.
Eigenvalues
481
If {v 1 , . . . , v n } were a linearly dependent set, we would have a linear combination of these vectors a1 v p1 + · · · + ak v pk = 0n ,
(7.1)
containing a minimal number k of vectors such that not every number a1 , . . . , ak is 0. This would yield a1 Av p1 + · · · + ak Av pk = a1 λp1 v p1 + · · · + ak λpk v pk = 0n . Taking into account Equality (7.1) multiplied by λpk , we would have a1 (λp1 − λpk )v p1 + · · · + ak−1 (λpk−1 v pk−1 = 0n , which would contradict the minimality of the number of terms in Equality (7.1). Thus, {v 1 , . . . , v n } is a linearly independent set. Corollary 7.1. A matrix A ∈ Cn×n has at most n distinct eigenvalues. Proof. Since the maximum size of a linearly independent set in Cn is n, it follows from Theorem 7.1 that A cannot have more than n distinct eigenvalues. The set of eigenvectors SA,λ that correspond to an eigenvalue λ of A is a subspace of Cn because u, v ∈ SA,λ implies A(au + bv) = aAu + bAv = aλu + bλv = λ(au + bv). The subspace SA,λ is known as the invariant subspace of A for the eigenvalue λ. Clearly, the invariant subspace of A ∈ Cn×n for λ coincides with the null space of the matrix λIn − A. Not every real matrix A ∈ Rn×n has a non-zero invariant subspace in Rn . 1 0 Example 7.2. Let A = ∈ R2×2 . If Ax = λx for x ∈ R2 , 0 −1 then we have x2 = λx1 and −x1 = λx2 , which implies x21 + x22 = 0. This is equivalent to x = 02 . Thus, A has no non-zero invariant subspace. The situation is different if we regard A as a matrix in C2×2 . Under this assumption, the equalities x2 = λx1 and −x1 = λx2
482
Linear Algebra Tools for Data Mining (Second Edition)
imply λ2 + 1 = 0 if x = 02 . Thus, we have λ1 = i and λ2 = −i. In the first case, we have the invariant subspace x1 2 ∈ C | x2 = ix1 , x2 while in the second case the invariant subspace is x1 2 ∈ C | x2 = −ix1 . x2 Observe that if λ is an eigenvalue for A ∈ Cn×n , then aλ is an eigenvalue of the matrix aA for every a ∈ C. Thus, aspec(A) = spec(aA). If (λ, x) is an eigenpair of A, then xH Ax = λxH x, so λ=
xH Ax . xH x
(7.2)
Equality (7.2) can be specialized to the real case by replacing xH by x . Namely, if A ∈ Rn×n , λ is an eigenvalue and x is an eigenvector that corresponds to λ, then λ=
x Ax . x x
(7.3)
The geometric multiplicity of an eigenvalue λ of a matrix A ∈
Rn×n is denoted by geomm(A, λ) and is equal to dim(SA,λ ). Equiva-
lently, the geometric multiplicity of λ is
geomm(A, λ) = dim(null(A − λIn )) = n − rank(A − λIn )
(7.4)
by Equality (3.8). Theorem 7.2. Let A ∈ Rn×n . We have 0 ∈ spec(A) if and only if A is a singular matrix. Moreover, in this case, geomm(A, 0) = n − rank(A) = dim(null(A)). Proof. The statement is an immediate consequence of Equal ity (7.4). Corollary 7.2. Let A ∈ Rn×n . If 0 ∈ spec(A) and algm(A, 0) = 1, then rank(A) = n − 1.
Eigenvalues
Proof.
483
Clearly, we have geomm(A, 0) = 1, so rank(A) = n − 1.
Theorem 7.3. Let A ∈ Cn×n and let S ⊆ Cn be an invariant subspace of A. If the columns of a matrix X ∈ Cn×p constitute a basis of S, then there exists a unique matrix L ∈ Cp×p such that AX = XL. Proof. Let X = (x1 · · · xp ). Since Ax1 ∈ S, it follows that Ax1 can be uniquely expressed as a linear combination of the columns of X, that is, Axj = x1 1j + · · · + xp pj for 1 i p. Thus,
⎛ ⎞ 1j ⎜ . ⎟ ⎟ Axj = X ⎜ ⎝ .. ⎠. pj
The matrix L is defined by L = (ij ).
Corollary 7.3. Using the notations of Theorem 7.3, the pair (λ, v) is an eigenpair of the matrix L if and only if (λ, Xv) is an eigenpair of A. Proof. The statement is an immediate consequence of Theo rem 7.3. The matrix L introduced in Theorem 7.3 will be referred to as a representation of A on the invariant subspace S. Clearly, L depends on the basis chosen for S, so this representation is not unique. Furthermore, we have spec(L) ⊆ spec(A). Theorem 7.4. Let A ∈ Cm×n be a matrix with rank(A) = n and let B ∈ Cp×q be a matrix such that range(B) = range(A)⊥ . Then range(A) is an invariant subspace of a matrix X ∈ Cp×m if and only if B H XA = Oq,n . Proof. The following statements are easily seen to be equivalent: (i) the subspace range(A) is an invariant subspace of X; (ii) Xrange(A) ⊆ range(A); (iii) Xrange(A) ⊥ range(A)⊥ ; (iv) Xrange(A) ⊥ range(B). The last statement is equivalent to B H XA = Oq,n .
484
Linear Algebra Tools for Data Mining (Second Edition)
Let A ∈ Cn×n be a matrix having the eigenvalues λ1 , . . . , λn . If x1 , . . . , xn are n eigenvectors corresponding to these values, then we have Ax1 = λ1 x1 , . . . , Axn = λn xn . By introducing the matrix X = (x1 · · · xn ) ∈ Cn×n , these equalities can be written in a concentrated form as AX = Xdiag(λ1 , . . . , λn ).
(7.5)
Obviously, since the eigenvalues can be listed in several ways, this equality is not unique. Suppose now that x1 , . . . , xn are unit vectors and that the eigenvalues λ1 , . . . , λn are distinct. Then X is a unitary matrix, X −1 = X H , and we obtain the following equality: A = Xdiag(λ1 , . . . , λn )X H = λ1 x1 xH1 + · · · + λn xn xHn ,
(7.6)
known as the spectral decomposition of the matrix A. We will discuss later (in Chapter 9) a far-reaching extension of the spectral decomposition. 7.3
The Characteristic Polynomial of a Matrix
If λ is an eigenvalue of the matrix A ∈ Cn×n , there exists a non-zero eigenvector x ∈ Cn such that Ax = λx. Therefore, the homogeneous linear system (λIn − A)x = 0n
(7.7)
has a non-trivial solution. This is possible if and only if det(λIn −A) = 0, so eigenvalues are the solutions of the equation det(λIn − A) = 0. Note that det(λIn −A) is a polynomial of degree n in λ, known as the characteristic polynomial of the matrix A. We denote this polynomial by pA .
Eigenvalues
485
Example 7.3. Let ⎞ ⎛ a11 a12 a13 ⎟ ⎜ A = ⎝a21 a22 a23 ⎠ a31 a32 a33 be a matrix in C3×3 . Its characteristic polynomial is λ − a11 −a12 −a13 pA (λ) = −a21 λ − a22 −a23 = λ3 − (a11 + a22 + a33 )λ2 −a31 −a32 λ − a33 + (a11 a22 + a22 a33 + a33 a11 − a12 a21 − a23 a32 − a13 a31 )λ − (a11 a22 a33 + a12 a23 a31 + a13 a32 a21 − a12 a21 a33 − a23 a32 a11 − a13 a31 a22 ). Theorem 7.5. Let A ∈ Cn×n . Then spec(A) = spec(A ) and spec(AH ) = {λ | λ ∈ spec(A)}. Proof.
We have
pA (λ) = det(λIn − A ) = det((λIn − A) ) = det(λIn − A) = pA (λ). Thus, since A and A have the same characteristic polynomials, their spectra are the same. For AH , we can write pAH (λ) = det(λIn − AH ) = det((λIn − A)H ) = (pA (λ))H , which implies the second part of the theorem.
Definition 7.2. Let λ be an eigenvalue of a matrix A ∈ Cn×n . A left eigenvector of the matrix A is a vector v ∈ Cn − {0} such that v H A = λv H . We could have used the obvious term right eigenvectors for the eigenvectors of a matrix A. Since in the vast majority of cases we deal with right eigenvectors, we prefer to use the simpler term eigenvectors for the right eigenvectors. Note that if λ is an eigenvalue of A ∈ Cn×n , the set of solutions of the homogeneous system (λIn −A)v = 0 is non-empty and consists of
486
Linear Algebra Tools for Data Mining (Second Edition)
the non-zero invariant space corresponding to the eigenvalue λ. This is also equivalent to saying that rank(λIn − A) < n. Since rank(λIn − A) = rank(λIn − AH ), it follows that the linear system (λIn − AH )v = 0n has a non-zero solution, so AH v = λv, which is equivalent to v H A = λv H . Thus, every eigenvalue has both an eigenvector and a left eigenvector. Equality of spectra of A and A does not imply that the eigenvectors or the invariant subspaces of the corresponding eigenvalues are identical, as it can be seen from the following example. Example 7.4. Consider the matrix a A= c
A ∈ C2×2 defined by 0 , b
where a = b and c = 0. It is immediate that spec(A) = spec(A ) = {a, b}. For λ1 = a, we have the distinct invariant subspaces: a−b SA,a = k k ∈ C , c 1 SA ,a = k k ∈ C , 0 as the reader can easily verify. If λ1 , λ2 are two distinct eigenvalues of A, u is a right eigenvector that corresponds to λ1 , and v is a left eigenvector that corresponds to λ2 , then u ⊥ v. Indeed, we have λ1 v H u = vH Au = λ2 v H u, so v H u = 0. The leading term of the characteristic polynomial of A is generated by (λ − a11 )(λ − a22 ) · · · (λ − ann ) and equals λn . The fundamental theorem of algebra implies that pA has n complex roots, not necessarily distinct. Observe also that, if A is a matrix with real entries, the roots are paired as conjugate complex numbers. Definition 7.3. The algebraic multiplicity of an eigenvalue λ of a matrix A ∈ Cn×n , algm(A, λ), equals k if λ is a root of order k of the equation pA (λ) = 0. If algm(A, λ) = 1, we refer to λ as a simple eigenvalue.
Eigenvalues
487
Example 7.5. Let A ∈ R3×3 be the matrix ⎛ ⎞ 1 1 1 ⎜ ⎟ A = ⎝0 1 2⎠. 2 1 0 The characteristic polynomial of A is λ − 1 −1 −1 pA (λ) = 0 λ − 1 −2 = λ3 − 2λ2 − 3λ. −2 −1 λ Therefore, the eigenvalues of A are 3, 0, and −1. The eigenvalues of I3 are obtained from the equation λ − 1 0 0 det(λI3 − I3 ) = 0 λ − 1 0 = (λ − 1)3 = 0. 0 0 λ − 1 Thus, I3 has one eigenvalue, 1, and algm(I3 , 1) = 3. Example 7.6. Let P (a) ∈ Cn×n be the matrix ⎛ ⎞ a 1 ··· 1 ⎜1 a · · · 1⎟ ⎜ ⎟ P (a) = ⎜ .. .. . ⎟. ⎝ . . · · · .. ⎠ 1 1 ··· a
To find the eigenvalues of P (a), we need to solve the equation λ − a −1 · · · −1 −1 λ − a · · · −1 .. .. = 0. .. . . ··· . −1 −1 · · · λ − a By adding the first n − 1 columns to the last and factoring out λ − (a + n − 1), we obtain the equivalent equation λ − a −1 · · · 1 −1 λ − a · · · 1 = 0. (λ − (a + n − 1)) . . . .. · · · .. .. −1 −1 · · · 1
488
Linear Algebra Tools for Data Mining (Second Edition)
Adding the last column from the first n − 1 columns and expanding the determinant yields the equation (λ − (a + n − 1))(λ − a + 1)n−1 = 0, which allows us to conclude that P (a) has the eigenvalue a + n − 1 with algm(P (a), a + n − 1) = 1 and the eigenvalue a − 1 with algm(P (a), a − 1) = n − 1. In the special case when a = 1, we have P (1) = Jn,n . Thus, Jn,n has the eigenvalue λ1 = n with algebraic multiplicity 1 and the eigenvalue 0 with algebraic multiplicity n − 1. Definition 7.4. A matrix A ∈ Cn×n is simple if there exists a linearly independent set of n eigenvectors. If A ∈ Cn×n has n distinct eigenvalues, then, by Theorem 7.1, A is a simple matrix. The reverse of this statement is false because there exist simple matrices for which not all eigenvalues are distinct. For example, spec(In ) = {1}, but {e1 , . . . , en } is a linearly independent set of distinct eigenvectors. Theorem 7.6. Let A ∈ Rn×n be a matrix and let λ ∈ spec(A). Then, for any k ∈ P, λk ∈ spec(Ak ). Proof. The proof is by induction on k 1. The base step, k = 1, is immediate. Suppose that λk ∈ spec(Ak ), that is Ak x = λk x for some x ∈ V − {0}. Then Ak+1 x = A(Ak x) = A(λk x) = λk Ax = λk+1 x, so λk+1 ∈ spec(Ak+1 ). Theorem 7.7. Let A ∈ Rn×n be a non-singular matrix and let λ ∈ spec(A). We have λ1 ∈ spec(A−1 ) and the sets of eigenvectors of A and A−1 are equal. Proof. Since λ ∈ spec(A) and A is non-singular, we have λ = 0 and Ax = λx for some x ∈ V − {0}. Therefore, we have A−1 (Ax) = λA−1 x, which is equivalent to λ−1 x = A−1 x, which implies λ1 ∈ spec(A−1 ). In addition, this implies that the set of eigenvectors of A and A−1 are identical. Theorem 7.8. Let pA (λ) = λn +c1 λn−1 +· · ·+cn−1 λ+cn be the characteristic polynomial of the matrix A. Then we have ci = (−1)i Si (A) for 1 i n, where Si (A) is the sum of all principal minors of order i of A.
Eigenvalues
489
Proof. Since pA (λ) = λn + c1 λn−1 + · · · + cn−1 λ + cn , it is easy to see that the derivatives of pA (λ) are given by (1)
pA (λ) = nλn−1 + (n − 1)c1 λn−2 + · · · + cn−1 , (2)
pA (λ) = n(n − 1)λn−2 + (n − 1)(n − 2)c1 λn−3 + · · · + 2cn−2 , .. . (k)
pA (λ) = n(n − 1) · · · (n − k + 1)λn−k + · · · + k!cn−k , .. . (n)
pA (λ) = n!c0 . This implies (k)
cn−k = k!pA (0) for 0 k n. On the other hand, the derivatives of pA (λ) can be computed using Theorem 5.11, and taking into account the formula given in Exercise 17 of Chapter 5, we have 1 (−1)k k!Sn−k (A) = (−1)n−k Sn−k (A), k! which implies the statement of the theorem. cn−k =
By Vi´ete’s Theorem, taking into account Theorem 7.8, we have λ1 + · · · + λn = a11 + a22 + · · · + ann = trace(A) = −c1 .
(7.8)
Another interesting fact that follows immediately from Theorem 7.8 is λ1 · · · λn = det(A). Theorem 7.9. nomial in C[x],
(7.9)
Let p(λ) = λn + a1 λn−1 + · · · + an−1 λ + an be a polywhere n 2. Then the matrix Ap ∈ Cn×n defined by ⎛
0 1 0 ⎜ 0 0 1 ⎜ Ap = ⎜ .. .. ⎜ .. ⎝ . . . −an −an−1 −an−2 has p as its characteristic polynomial.
⎞ 0 0 ⎟ ⎟ ⎟ .. ⎟ ··· . ⎠ · · · −a1 ··· ···
490
Linear Algebra Tools for Data Mining (Second Edition)
Proof. The proof is by induction on n 2. For the base case, n = 2, we have 0 −1 A= , −a2 −a1 the characteristic polynomial of A is λ 1 = λ2 + a1 λ + a2 , pA (λ) = a2 λ + a1 as claimed. Suppose now that the statement holds for polynomials of degree less than n and let p be a polynomial of degree n. Define q as q(λ) = λn−1 + a1 λn−2 + · · · + an−1 . By the inductive hypothesis, the characteristic matrix ⎛ 0 1 0 ··· ⎜ 0 0 1 ··· ⎜ Aq = ⎜ .. .. ⎜ .. ⎝ . . . ··· −an−1 −an−2 −an−3 · · ·
polynomial of the ⎞ 0 0 ⎟ ⎟ ⎟ .. ⎟ . ⎠ −a1
is q(λ). Observe that det(λIn − Ap ) = λ det(Aq ) + an , as follows by expanding det(λIn − Ap ) by its first column. Thus, the characteristic polynomial of Ap is pAp = λpAq (λ) + an = λq(λ) + an = p(λ), which concludes the argument.
The matrix Ap will be referred to as the companion matrix of the polynomial p. Theorem 7.10. Let A ∈ Cm×n and B ∈ Cn×m be two matrices. Then the set of non-zero eigenvalues of the matrices AB ∈ Cm×m and BA ∈ Cn×n are the same and algm(AB, λ) = algm(BA, λ) for each such eigenvalue.
Eigenvalues
Proof.
Consider the following straightforward equalities: λIm A λIm − AB Om,n Im −A = , On,m λIn B In −λB λIn λIm A −λIm −A −Im Om,n = . B In On,m λIn − BA −B λIn
Observe that
det
491
Im
−A
On,m
λIn
λIm
A
B
In
= det
−Im
Om,n
−B
λIn
λIm
A
B
In
,
and therefore, −λIm −A λIm − AB Om,n = det det . On,m λIn − BA −λB λIn The last equality amounts to λn pAB (λ) = λm pBA (λ). Thus, for λ = 0, we have pAB (λ) = pBA (λ), which gives the desired conclusion. Corollary 7.4. Let
⎞ a1 ⎜.⎟ ⎟ a=⎜ ⎝ .. ⎠ an ⎛
be a vector in Cn − {0}. Then, the matrix aaH ∈ Cn×n has one eigenvalue distinct from 0, and this eigenvalue is equal to a2 . Proof. By Theorem 7.10, the matrix aaH has the same non-zero eigenvalues as the matrix aH a ∈ C1×1 and the single eigenvalue of aH a is aH a = a2 . Theorem 7.11. Let A ∈ C(m+n)×(m+n) be a matrix partitioned as B C , A= On,m D where B ∈ Cm×m , C ∈ Cm×n , and D ∈ Cn×n . Then spec(A) = spec(B) ∪ spec(D).
492
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Let λ ∈ spec(A) and let x ∈ Cm+n be an eigenvector that corresponds to λ. If u x= , v where u ∈ Cm and v ∈ Cn , then we have B C u Bu + Cv u Ax = = =λ . On,m D v Dv v This implies Bu + Cv = λu and Dv = λv. If v = 0, then λ ∈ spec(D); otherwise, Bu = λu, which yields λ ∈ spec(B), so λ ∈ spec(B) ∪ spec(D). Thus, spec(A) ⊆ spec(B) ∪ spec(D). To prove the converse inclusion, note that if λ ∈ spec(B) and u is an eigenvector of λ, then Bu = λu, which means that u u A =λ , 0 0 so spec(B) ⊆ spec(A). Similarly, spec(D) ⊆ spec(A), which implies the equality of the theorem. 7.4
Spectra of Hermitian Matrices
Theorem 7.12. All eigenvalues of a Hermitian matrix A ∈ Cn×n are real numbers. All eigenvalues of a skew-Hermitian matrix are purely imaginary numbers. Proof. Theorem 3.18 implies that xH x is a real number for every x ∈ Cn . Then, by Equality (7.2), λ is a real number. Suppose now that B is a skew-Hermitian matrix. Then, as above, xH Ax = −xH Ax, which implies that the real part of xH Ax is 0. Thus, xH Ax is a purely imaginary number and, by the same Equality (7.3), λ is a purely imaginary number.
Eigenvalues
493
Corollary 7.5. If A ∈ Rn×n and A is a symmetric matrix, then all its eigenvalues are real numbers. Proof. This statement follows from Theorem 7.12 by observing that the Hermitian adjoint AH of a matrix A ∈ Rn×n coincides with its transposed matrix A . Example 7.7. Let A ∈ Rn×n be a symmetric and orthogonal matrix. By Corollary 7.5, all its eigenvalues are real numbers. Let λ ∈ spec(A) and let x be an eigenvector that corresponds to λ. Since Ax = λx, we have x A = λx , hence λ2 x x = (Ax ) Axx A Ax = x x. Therefore, λ2 = 1, hence λ ∈ {−1, 1}. Corollary 7.6. Let A ∈ Cm×n be a matrix. The non-zero eigenvalues of the matrices AAH and AH A are positive numbers and they have the same algebraic multiplicities for the matrices AAH and AH A. Proof. By Theorem 7.10, we need to verify only that if λ is a nonzero eigenvalue of AH A, then λ is a positive number. Since AH A is a Hermitian matrix, by Theorem 7.12, λ is a real number. The equality AH Ax = λx for some eigenvector x = 0 implies λx22 = λxH x = (Ax)H Ax = Ax22 , so λ > 0.
Corollary 7.7. Let A ∈ Cm×n be a matrix. The eigenvalues of the matrix B = AH A ∈ Cn×n are real non-negative numbers. Proof. The matrix B defined above is clearly Hermitian and, therefore, its eigenvalues are real numbers by Theorem 7.12. Next, if λ is an eigenvalue of B, then by Equality (7.2), we have λ=
(Ax)H Ax Ax xH AH Ax = = 0, H H xx xx x
where x is an eigenvector that corresponds to λ.
494
Linear Algebra Tools for Data Mining (Second Edition)
Note that if A is a Hermitian matrix, then AH A = A2 , hence the spectrum of AH A is {λ2 | λ ∈ spec(A)}. Theorem 7.13. If A ∈ Cn×n is a Hermitian matrix and u, v are two eigenvectors that correspond to two distinct eigenvalues λ1 and λ2 , then u ⊥ v. Proof. We have Au = λ1 u and Av = λ2 v. This allows us to write v H Au = λ1 v H u. Since A is Hermitian, we have λ1 v H u = v H Au = vH AH u = (Av)H u = λ2 v H u, which implies v H u = 0, that is, u ⊥ v.
Theorem 7.14 (Ky Fan’s Theorem). Let A ∈ Cn×n be a Hermitian matrix such that spec(A) = {λ1 , . . . , λn }, where λ1 · · · λn and let V = (v 1 · · · v n ) be the matrix whose columns consist of the corresponding unit eigenvectors of A. n Let {x1 , . . . , xn } be an orthonormal q in C . For q set of vectors any positive integer q n, the sums i=1 λi and i=1 λn+1−i are, respectively, the maximum and minimum of qj=1 xj Axj . Namely, the maximum is obtained by choosing the vectors x1 , . . . , xq as the first q columns of V ; the minimum is obtained by assigning to x1 , . . . , xq the last q columns of V. n i p i Proof. Let xi = p=1 bp v be the expressions of x relative to the basis V for 1 i m. In matrix form, these equalities can be written as ⎛ 1 ⎞ b1 · · · bn1 ⎜. .⎟ ⎟ (x1 . . . xn ) = (v 1 . . . v n ) ⎜ ⎝ .. · · · .. ⎠, b1n · · · bnn where bpi = (v p ) xi = (xi ) v p . Thus, for X = (x1 . . . xn ) and V = (v 1 . . . v n ), we have X = V B, where B is the orthonormal matrix ⎞ ⎛ 1 b1 · · · bn1 ⎜. .⎟ ⎟ B=⎜ ⎝ .. · · · .. ⎠. b1n · · · bnn
Eigenvalues
495
We have
xj Axj = xj Abjp v p = bjp xj Av p = bjp (xj ) λp v p = (bjp )2 λp q n n j 2 j 2 y = λq (bp ) + (λp − λq )(bp ) + (λj − λq )(bjp )2 . p=1
p=1
j=q+1
By Inequality (6.26), this implies (xj ) Axj λq +
q (λp − λq )(bjp )2 . p=1
Therefore, q i=1
λi −
q j=1
⎞ ⎛ q q (xj ) Axj (λi − λq ) ⎝1 − (bji )2 ⎠. i=1
(7.10)
j=1
q j 2 2 Again, by Inequality (6.26), we have j=1 (bi ) xi = 1, so q q j 2 0. The left member of Inequali=1 (λi − λq ) 1 − j=1 (bi ) i i ity (7.10) becomes 0, when x = v , so qj=1 (xj ) Axj qi=1 λi . The maximum of qj=1 (xj ) Axj is obtained when xj = v j for 1 j q, that is, when X consists of the first q columns of V that correspond to eigenvectors of the top k largest eigenvalues. The argument for the minimum is similar. An equivalent form of Ky Fan’s Theorem can be obtained by observing that the orthonormality condition of the set {x1 , . . . , xq } where X ∈ Cn×q is the matrix can be expressed as X X = Iq , X = (x1 · · · xq ). Also, the sum qj=1 xj Axj equals trace(X AX). Thus, q Theorem is equivalent to the fact that the sums n Ky Fan’s λ and i=1 λn+1−i are, respectively, the maximum and mini=1 i imum of trace(X AX), where X X = Iq . In this form, Ky Fan’s Theorem is useful for the discussion of principal component analysis in Chapter 13.
496
7.5
Linear Algebra Tools for Data Mining (Second Edition)
Spectra of Special Matrices
In this section, we examine the spectra of special classes of matrices. We begin with spectra of block upper triangular and block lower triangular matrices. Theorem 7.15. Let A be a block upper triangular partitioned matrix given by ⎞ ⎛ A11 A12 · · · A1m ⎟ ⎜ O A 22 · · · A2m ⎟ ⎜ ⎟ ⎜ A=⎜ . .. .. ⎟, . ··· . ⎠ ⎝ .. O O · · · Amm where Aii ∈ Rpi ×pi for 1 i m. Then, spec(A) = ni=1 spec(Aii ). If A is a block lower triangular matrix ⎞ ⎛ A11 O · · · O ⎜ A21 A22 · · · O ⎟ ⎟ ⎜ A = ⎜ .. .. .. ⎟, ⎝ . . ··· . ⎠ Am1 Am2 · · · Amm
the same equality holds. Proof. Let A be a block upper triangular matrix. Its characteristic equation is det(λIn − A) = 0. Observe that the matrix λIn − A is also a block upper triangular matrix: ⎛ ⎞ λIp1 − A11 O ··· O ⎜ −A21 ⎟ λIp2 − A22 · · · O ⎜ ⎟ λIn − A = ⎜ ⎟. .. .. .. ⎝ ⎠ . . ··· . −Am1
−Am2
· · · λIpm − Amm
By Theorem 5.13, the characteristic polynomial of A can be written as m m det(λIpi − Aii ) = pAii (λ). pA (λ) = i=1
n
i=1
Therefore, spec(A) = i=1 spec(Aii ). The argument for block lower triangular matrices is similar.
Eigenvalues
Corollary 7.8. Let A ∈ Rn×n be a ⎛ A11 O ⎜ O A22 ⎜ A = ⎜ .. .. ⎝ . . O O
497
block diagonal matrix given by ⎞ ··· O ··· O ⎟ ⎟ .. ⎟, ··· . ⎠ · · · Amm
for 1 i m. We have spec(A) = ni=1 spec(Aii ) where Aii ∈ Rni ×ni and algm(A, λ) = m i=1 algm(Ai , λ). Moreover, v = 0n is an eigenvector of A if and only if we can write ⎛ ⎞ v1 ⎜ .. ⎟ v = ⎝ . ⎠, vm where each vector v i is either an eigenvector of Ai or 0ni for 1 i m, and there exists i such that v i = 0ni . Proof.
This statement follows immediately from Theorem 7.15.
Theorem 7.16. Let A = (aji ) ∈ Cn×n be an upper (lower) triangular matrix. Then, spec(A) = {aii | 1 i n}. Proof. It is easy to see that the characteristic polynomial of A is pA (λ) = (λ−a11 ) · · · (λ−ann ), which implies immediately the theorem.
Corollary 7.9. If A ∈ Cn×n is an upper triangular matrix and λ is an eigenvalue such that the diagonal entries of A that equal λ occur i in aii11 , . . . , aipp , then SA,λ is a p-dimensional subspace of Cn generated by ei1 , . . . , eip . Proof.
This statement is immediate.
Corollary 7.10. We have spec(diag(d1 , . . . , dn )) = {d1 , . . . , dn }. Proof.
This statement is a direct consequence of Theorem 7.16.
Theorem 7.17. If A spec(A) = {0}.
∈
Cn×n is a nilpotent matrix, then
498
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Let A ∈ Cn×n be a nilpotent matrix such that nilp(A) = k. By Theorem 7.6, if λ ∈ spec(A), then λk ∈ spec(Ak ) = spec(O) = {0}. Thus, λ = 0. Theorem 7.18. If A ∈ Cn×n is an idempotent matrix, then spec(A) ⊆ {0, 1}. Proof. Let A ∈ Cn×n be an idempotent matrix, λ be an eigenvalue of A, and let x be an eigenvector of λ. We have P 2 x = P x = λx; on the other hand, P 2 x = P (P x) = P (λx) = λP (x) = λ2 x, so λ2 = λ, which means that λ ∈ {0, 1}. Definition 7.5. Let A ∈ Cp×p . The spectral radius of A is the number ρ(A) = max{|λ| | λ ∈ spec(A)}. The next statement shows that for a stochastic matrix, the spectral radius is equal to 1. Theorem 7.19. At least one eigenvalue of a stochastic matrix is equal to 1 and all eigenvalues lie on or inside the unit circle. Proof. Let A ∈ Rn×n be a stochastic matrix. Then 1 ∈ spec(A) and 1 is an eigenvector that corresponds to the eigenvalue 1 as the reader can easily verify. If λ is an eigenvalue of A and Ax = λx, then λxi = αji xj for 1 n n, which implies |λ||xi | aji |xj |. Since x = 0n , let xp be a component of x such that |xp | = max{|xi | | 1 i n}. Choosing i = p, we have |λ|
n i=1
aji
n
|xj | j ai = 1, |xp i=1
which shows that all eigenvalues of A lie on or inside the unit circle.
Theorem 7.20. All eigenvalues of a unitary matrix are located on the unit circle.
Eigenvalues
499
Proof. Let A ∈ Rn×n be a unitary matrix and let λ be an eigenvalue of A. By Theorem 6.24, if x is an eigenvector that corresponds to λ, we have x = Ax = λx = |λ|x, which implies |λ| = 1. 7.6
Geometry of Eigenvalues
Any matrix norm of a matrix A ∈ Cn×n provides an upper bound for the absolute value of any eigenvalue. Indeed, if μ is a matrix norm, and (λ, x) is an eigenpair, then |λ|μ(x) = (μ(Ax) μ(A)μ(x), so |λ| μ(A). The next result allows finding a more precise location of the eigenvalues of a square matrix A in the complex plane. n×n be a Theorem 7.21 (Gershgorin’s theorem). Let A ∈ R {|aij | | 1 j n and j = i}, for square matrix and let ri = 1 i n. Then we have spec(A) ⊆
n
{z ∈ C | |z − aii | ri }.
i=1
Proof. Let λ ∈ spec(A) and let us suppose that Ax = λx, where x = 0. nLet p be such that |xp | = max{|xi | |1n i n}. Then j=1 apj xj = λxp , which is equivalent to j=1,j=p apj xj = (λ − app )xp . This, in turn, implies n n apj xj |apj ||xj | |xp ||λ − app | = j=1,j=p j=1,j=p |xp |
n
|apj | = |xp |rp .
j=1,j=p
Therefore, |λ − app | rp for some p. This yields the desired conclusion.
500
Linear Algebra Tools for Data Mining (Second Edition)
Definition 7.6. Let A ∈ Rn×n be a square matrix and let ri = {aij | 1 j n and j = i}, for 1 i n. A disk of the form Di (A) = {z ∈ C | |z − aii | ri } is called a Gershgorin disk. By Theorem 3.17, if A ∈ Cn×n , then there exists a Hermitian matrix HA and a skew-Hermitian matrix SA such that A = HA + SA . Let μA = A∞ = max{|aij | 1 i, j n} be the generalized ∞norm of A and let μHA and μSA be the generalized ∞-norms of HA and SA , respectively. Let λ1 , . . . , λn be the eigenvalues of A, where |λ1 | · · · |λn |. The real eigenvalues of HA and SA are denoted as η1 , . . . , ηn and σ1 , . . . , σn , where η1 · · · ηn and σ1 · · · σn . These notations are needed for the next two theorems. Theorem 7.22 (Hirsch’s first theorem). If A ∈ Cn×n , then we have λk nμA , (λk ) nμHA , and (λk ) nμSA for 1 k n. Proof. Let x be a unit eigenvector that corresponds to an eigenvalue λk . We have Ax = λk x, which implies (Ax, x) = λk and ¯ k . Thus, we obtain (AH x, x) = (x, Ax) = (x, λk x) = λ ¯k λk + λ 2 (Ax, x) + (AH x, x) = 2 H A+A x, x = 2
(λk ) =
= HA x, and ¯k λk − λ 2i (Ax, x) − (AH x, x) = 2i H A−A x, x = 2i
(λk ) =
= SA x.
Eigenvalues
501
Since λk = (Ax, x), we have n n n n aij xi x ¯j |aij ||xi ||¯ xj | |λk | = i=1 j=1
μA
i=1 j=1
n n
|xi ||¯ xj | = μ A
i=1 j=1
n
2 |xi |
.
i=1
By Inequality (6.7), we have |λk | nμA .
A related theorem is next. Theorem 7.23 (Hirsch’s second theorem). If A ∈ Cn×n and HA is a real matrix, then n(n − 1) . | (λk )| μSA 2 is real, then (aij + aji ) = 0 for 1 i, j n. Proof. If HA = A+A 2 z−¯ z Since (z) = 2i , it follows that H
¯ij − a ¯ji = 0, aij + aji − a so aij − a ¯ji = −(aji − a ¯ij ) for 1 i, j n. Therefore, by the proof of Theorem 7.22, λ − λ k ¯k | (λk )| = 2i (Ax, x) − (AH x, x) = 2i A − AH x, x = 2i ⎛ ⎞ n n aij − a ¯ji ⎠ ⎝ xj x = ¯i 2i i=1
j=1
a −a ¯ji xj x ¯ i − xi x ¯j ij · 2 i i |λ2 | · · · |λn |. Define a sequence of vectors x0 , x1 , . . . as follows. The initial vector x0 ∈ Cn is any unit vector that is not orthogonal on any eigenvector corresponding to λ1 . Then xk+1 is the unit vector given by xk+1 =
Axk Axk
for k ∈ N. The unit vector xk can be written as xk =
Ak x0 Ak x0
Linear Algebra Tools for Data Mining (Second Edition)
504
for k 1. Indeed, in the base case (k = 1), this equality clearly holds. Suppose that it holds for k. We have k
xk+1
A x0 A A kx Ak+1 x0 Axk 0 = = . = k A x0 Axk Ak+1 x0 A A kx 0
We claim that limk→∞ xk = v 1 . By Theorem 7.1, the set {v 1 , . . . , v n } is linearly independent and this allows us to write x0 = a1 v 1 + · · · + an vn . By the assumption made concerning x0 (as not being orthogonal on any eigenvector corresponding to λ1 ), we have a1 = 0. Thus, Ax0 = a1 Av 1 + · · · + an Av n = a1 λ1 v 1 + · · · + an λn v n . A straightforward induction argument on k 1 shows that Ak x0 = a1 λk1 v 1 + · · · + an λkn v n k k a a λ λn 2 2 n = a1 λk1 v1 + v2 + · · · + vn . a1 λ 1 a1 λ1 |λj | |λ1 |
< 1 for 2 j n, it follows that a2 λ2 k an λn k Ak x0 = lim v 1 + v2 + · · · + vn = v1 , lim k→∞ a1 λk k→∞ a1 λ 1 a1 λ1 1
Since
and the speed of convergence of implies
Ak x 0 a1 λk1
to v 1 is determined by
λ2 λ1 .
This
Ak x0 = v 1 = 1. k→∞ |a1 λk 1| lim
Therefore, Ak x0 Ak x0 |a1 λk1 | Ak x0 · = v1. = lim = lim k k→∞ Ak x0 k→∞ |a1 λk k→∞ |a1 λk 1 | A x0 1|
lim xk = lim
k→∞
Since the limit of the sequence x0 , . . . , xk , . . . is v 1 , we have a method to approximatively determine a unit vector corresponding to the dominating eigenvalue λ1 of A.
Eigenvalues
505
Example 7.8. Let us apply the iteration method to the symmetric matrix ⎛ ⎞ 3 2 1 ⎜ ⎟ A = ⎝2 4 2⎠, 1 2 6 beginning with the vector x0 = 13 . The following MATLAB code computes ten vectors of the sequence (x1 , x2 , . . .): x = ones(3,1); sequence(:,1) = x; for i=1:10 x = A*x/norm(A*x); sequence(:,i+1)=x; end sequence
and prints the sequence as 1.0000 0.4015 0.3844 0.3772 0.3740 0.3726 0.3719 0.3716 0.3715 0.3714 0.3714 1.0000 0.5789 0.5678 0.5621 0.5594 0.5581 0.5576 0.5573 0.5572 0.5571 0.5571 1.0000 0.7097 0.7279 0.7361 0.7398 0.7414 0.7422 0.7425 0.7427 0.7427 0.7428
The last vector in the sequence is very close to an eigenvector that corresponds to the eigenvalue λ = 8. A variant of the power method considered in [113] replaces the scaling factor Axk by m(Axk ), where m(v) is the first component of v that has the largest absolute value. For example, if v = (−2, −4, 3, −4), then m(v) = v2 = −4. We have m(av) = a m(v) for every a ∈ R. The power method is limited to the dominant eigenvalue. However, if a close approximative value is known for an eigenvalue λi , the power method can be applied to compute an eigenvector associated to this eigenvalue. Observe that if |λi − a| < |λj − a| for every λj ∈ spec(A) − {λi }, then λi1−a is the dominant eigenvalue of the matrix B = (A − aIn )−1 . The justification of the algorithm follows from the next statement. Theorem 7.26. Let A ∈ Cn×n and let a be a number such that a ∈ spec(A). Then v is an eigenvector of A that corresponds to the eigenvalue λ if and only if v is an eigenvector of the matrix 1 ∈ spec(B). B = (A − aI)−1 that corresponds to the eigenvalue λ−a
Linear Algebra Tools for Data Mining (Second Edition)
506
Proof. If v is an eigenvector for A that corresponds to the eigenvalue λ, then Av = λv, so (A − aI)v = (λ − a)v. Since a ∈ spec(A), 1 v = (A − aI)−1 v, which proves the matrix A − aI is invertible, so λ−a 1 . that v is an eigenvector of B that corresponds to the eigenvalue λ−a The reverse implication is immediate. By Theorem 7.26, if a is a good approximation of λ, the power method applied to B will yield an eigenvector v of A that corresponds to λ. This technique is known as the inverse power method. 7.9
The QR Iterative Algorithm
Another technique for computing eigenvalues is the QR iterative algorithm. The approach involves decomposing a matrix A ∈ Rn×n into a product A = QR, where Q is an orthogonal matrix and R is an upper triangular matrix. If A ∈ Cn×n is a complex matrix, then we seek Q as a unitary matrix. A sequence of matrices (A0 , A1 , . . .) is defined, where A0 = A ∈ n×n R . Begin by computing a QR factorization of A0 as A0 = Q0 R0 . If Ai is factored as Ai = Qi Ri , define Ai+1 as Ai+1 = Ri Qi for i ∈ N. Let Pi = Q0 · · · Qi and Si = Ri · · · R0 for i ∈ N. Clearly, Pi is an orthonormal matrix and Si is an upper triangular matrix for every i ∈ N. We claim that Ai+1 = Pi APi .
(7.11)
Indeed, for i = 0, we have P0 AP0 = Q0 Q0 R0 Q0 = R0 Q0 = Q1 . Suppose that the equality holds for i. Then APi+1 = Qi+1 Pi APi Qi+1 = Qi+1 Ai+1 Qi+1 Pi+1 = Qi+1 Qi+1 Ri+1 Qi+1 = Ai+2 ,
which concludes the proof of the claim. Therefore, each of the matrices Ai is orthonormally similar to A and spec(Ai ) = spec(A) for i ∈ N. Equality (7.11) can also be written as Pi Ai+1 = APi for i ∈ N.
(7.12)
Eigenvalues
507
Note that Pi Ri = Pi−1 Qi Ri = Pi−1 Ai = APi−1 , taking into account Equality (7.12). This implies that the subspace CPi ,q generated by the first q columns of Pi equals CAPi−1 ,q . If limn→∞ Pn exists and it equals P , since Pn−1 Qn = Pn , it follows that limn→∞ Qn = I. Therefore, since An = Qn Rn , we have limn→∞ An = limn→∞ Rn = R. The matrix R is an upper triangular matrix, as the limit of a sequence of upper triangular matrices, has non-negative entries on its main diagonal, which equal the eigenvalues of the matrix A. Therefore, a necessary condition for the algorithm to work is that all eigenvalues of A be real, non-negative numbers. Thus, the value of this initial version of the QR algorithm is limited. In practical terms, it is desirable to replace the initial matrix A with its upper Hessenberg equivalent H1 (which is a tridiagonal matrix when A is symmetric). As we saw in Supplement 72 of Chapter 6, if we factor H1 = Q1 R1 such that R1 is non-singular, Q1 and R1 Q1 are both upper Hessenberg matrices, so the next matrix H2 = R1 Q1 is again an upper Hessenberg matrix, and the process continues in the same manner as in the initial QR algorithm. Other improvements of the QR involve the double shifting of the matrices Hn by a multiple of In . Thus, instead of factoring Hk , one factors H − aIn , where a is an approximation of a real eigenvalue, Hk − aIn = Qk Rk . Then, the next matrix is Hk+1 = Rk Qk + aIn . Therefore, Hk+1 = Rk Qk + aIn = Qk (Hk − aIn )Qk + aIn = Qk Hk Qk for k 1. Thus, spec(Hk+1 ) = spec(Hk ). 7.10
MATLAB
Computations
To compute the eigenvalues of a matrix A ∈ Cn×n , we can use the command eig(A), which returns an n-dimensional vector containing the eigenvalues of A. The variant [V,D] = eig(A) produces a diagonal matrix D of eigenvalues and a full matrix V ∈ Cn×n whose columns are the corresponding eigenvectors so that AV = V D.
508
Linear Algebra Tools for Data Mining (Second Edition)
A faster computation of eigenvalues is obtained using the eigs. If A is a large, sparse, and square matrix, eigs(A) returns a vector that consists of the six largest magnitude eigenvalues. The function call [V,D] = eigs(A) returns a diagonal matrix D that contains six eigenvalues of A having the largest absolute values and a matrix V whose columns are the corresponding eigenvectors. If a parameter flag is added, [V,D,flag] = eigs(A), and flag is 0, then all the eigenvalues converge; otherwise not all converge. The call eigs(A,k) computes the k largest magnitude eigenvalues. Various options can be set by specifying additional parameters. For instance, if A is a symmetric matrix, then the call [V,D] = eigs(A,k,’SA’)
will compute the k smallest eigenvalues of A, and return a diagonal matrix D that contains the eigenvalues and a matrix V that contains the corresponding eigenvectors. Exercises and Supplements (1) Prove that a matrix A ∈ Cn×n is non-singular if and only if 0 ∈ spec(A). (2) Prove that spec(Jn,n ) = {0, n} and that algm(A, 0) = n − 1. (3) Determine the spectrum of the matrix A = aIn + bJn,n . (4) Let X, Y ∈ Cn×n be two non-singular matrices. Prove that λ is an eigenvalue of one of the matrices XY −1 , Y X −1 if and only if λ−1 is an eigenvalue of the other. (5) Let A ∈ Cn×n be a matrix and let a, b ∈ C such that a = 0. Prove that spec(aA + bIn ) = {aλ + b | λ ∈ A}. Solution: Let B = aA + bIn . For the characteristic polynomial pB , we can write pB (λ) = det(λIn − B) = det(λIn − (aA + bIn )) λ−b n In − A , = det((λ − b)In − aA) = a det a which shows that spec(B) has the desired form. (6) Prove that the non-zero eigenvalues of a real skew-symmetric matrix are purely imaginary numbers.
Eigenvalues
509
Solution: Let λ be a non-zero eigenvalue of a real skewsymmetric matrix A ∈ Rn×n . Since Ax = λx for some x ∈ Rn , it follows that A2 x = λ2 x. By Supplement 115 of Chapter 6, we have λ2 0. (7) Let S be a k-dimensional invariant subspace for a matrix A ∈ Cn×n and let the columns of the matrix X ∈ Cn×k form a basis for S. Prove that there is a unique matrix C ∈ Ck×k such that AX = XC and that spec(C) ⊆ spec(A). Solution: Suppose that X = (x1 · · · xk ). Axi is a unique linear combination of the columns of X because Axi ∈ S, so there exists a unique vector ci such that Axi = Xci for 1 i k. The matrix C = (c1 · · · ck ) is the matrix we are seeking. If λ ∈ spec(C), then Cu = λu for some vector u, which implies A(Xu) = λ(Xu). Thus, spec(C) ⊆ spec(A). (8) Let sin α cos α sinh t cosh t and B = . A= − cos α sin α cosh t sinh t Prove that spec(A) = {sin α} and spec(B) = {sinh t + cosh t, sinh t − cosh t}. (9) Let a11 a12 , A= a21 a22 be a matrix (that is, a12 = a21 ) in R2×2 . Prove that there exists a rotation matrix cos θ sin θ R(θ) = , − sin θ cos θ such that R(−θ)AR(θ) is a diagonal matrix having the eigenvalue λ1 and λ2 as its diagonal elements. (10) Let A ∈ R2×2 be a matrix and let cos α x= ∈ R2 . sin α As α varies between 0 and 2π, x rotates around the origin. Prove that Ax moves on an ellipse centered in the origin. Compute α such that the vectors x and Ax are collinear.
510
Linear Algebra Tools for Data Mining (Second Edition)
(11) Let A ∈ Cn×n be a matrix such that spec(A) = {λ1 , . . . , λn }. Prove that spec(A + cIn ) = {λ1 + c, . . . , λn + c}. (12) Let A ∈ Rn×n be a matrix. Prove that if A has a dominating eigenvalue λ1 , then λ1 is a real number. (13) Prove that if A ∈ Cn×n has the spectrum spec(A) = {λ1 , . . . , λn }, then spec(A − aIn ) = {λ1 − a, . . . , λn − a}. (14) Let C(A) be the Cayley transform of the matrix A ∈ Cn×n introduced in Exercise 65 of Chapter 3. (a) Prove that if λ ∈ spec(C(A)), then 1−λ 1+λ ∈ spec(A). (b) Prove that if C(A) exists, then −1 ∈ spec(C(A)). (15) Let A ∈ Cm×m and B ∈ Cn×n . Prove that trace(A ⊗ B) = trace(A)trace(B) and det(A ⊗ B) = det(A)n det(B)m . (16) Let A ∈ Cn×n be a Hermitian matrix such that spec(A) = {λ1 , . . . , λn } consists of n distinct eigenvalues and let v 1 , . . . , v n be n unit eigenvectors that correspond to the eigenvalues λ1 , . . . , λn , respectively. is a unit vector and ci = wH ui for 1 i n, (a) If w ∈ Cn prove that ni=1 c2i = 1. (b) Prove that |λi − w H Aw| maxj {|λi − λj |w − ui 22 . (17) Prove that the matrix A ∈ Cn×n has a set of pairwise orthonormal eigenvectors {u1 , . . . , un } if and only if A is a normal matrix. (18) Prove that the set of simple matrices is dense in the metric space (Cn×n , d), where d(A, B) = |||A − B|||2 for A, B ∈ Cn×n . (19) Let A ∈ Cm×n and B ∈ Cn×m . Prove the following: (a) the matrices
On,m On,n AB Om,m and B 0n,m B BA
are similar; (b) if m n, then BA ∈ Cn×n has the same eigenvalues as AB ∈ Cm×m together with n − m zero eigenvalues; (c) if m = n and at least one of the matrices A or B is nonsingular, then AB and BA are similar. Solution: For the first part, note that the matrix Im A M= On,m In
Eigenvalues
511
has all its eigenvalues equal to 1 and, therefore, is non-singular. It is easy to see that Im −A −1 M = On,m In and that M
−1
AB Om,m On,m On,n , M= B 0n,m B BA
which proves the desired similarity. The last two parts are left to the reader. A commuting family of matrices is a collection of matrices M ⊆ Cn×n such that for each A, B ∈ M, we have AB = BA. An invariant subspace for a commuting family of matrices M ⊆ Cn×n is a subspace W of Cn that is invariant for each matrix in M. (20) Let A ∈ Cn×n and let S be an invariant subspace of A with dim(S) 1. Prove that S contains an eigenvector of A. Solution: Suppose that dim(S) = m 1 and let v 1 , . . . , v m ∈ be a set of vectors that form a basis of S. Let V = (v 1 v 2 · · · v m ) ∈ Cn×m be the matrix whose columns are these vectors. Note that rank(V ) = m so the nullspace of V consists of 0m . Then we have Av i ∈ S for every i, 1 i m, because S is an invariant subspace of A. Thus, we have AV = V U for some U ∈ Cm×m . If x = 0m is an eigenvector of U , we have U x = λx, hence V U x = λV x, or AV x = λV x. Thus, V x is an eigenvector of A that is contained in S. (21) Prove that if M is a commuting family of matrices in Cn×n , then there exists x ∈ Cn such that x is an eigenvector of every A ∈ M. Cn
Solution: Note that Cn is an invariant subspace for each matrix in M. Thus, we can assume that there exists an invariant subspace W for F such that dim(W ) is minimal and positive. We claim that every w ∈ W − {0n } is an eigenvector of F. Suppose that this is not the case, so not every non-zero vector in W is an eigenvector of A. Since W is M-invariant, it is also A-invariant and, by Supplement 7.10, there exists an eigenvector of A in W such that x = 0n and Ax = λx.
512
Linear Algebra Tools for Data Mining (Second Edition)
Let W0 be the subspace W0 = {y ∈ W | Ay = λy}. Clearly, x ∈ W0 . Since we assumed that not every non-zero vector in W is an eigenvector of A, W0 = W , so dim(W0 ) < dim(W ). If B ∈ M, x ∈ W0 implies Bx ∈ W because W0 ⊆ W and W is M-invariant. Since M is a commuting family, we have A(Bx) = (AB)x = (BA)x = B(Ax) = B(λx) = λ(Bx), hence Bx ∈ W0 . Therefore, W0 is M-invariant, which results in a contradiction, since we assumed that dim(W ) is minimal. Certain spectral properties of Hermitian matrices can be extended to matrices that are self-adjoint with respect to an inner product f : Cn × Cn −→ R. These extensions are discussed next. (22) Prove that a matrix A ∈ Cn×n that is self-adjoint relative to an inner product f : Cn × Cn −→ R has real eigenvalues. Furthermore, if λ, μ ∈ spec(A) are two distinct eigenvalues of A and u, v are eigenvectors of A that correspond to λ and μ, respectively, then u and v are orthogonal relative to f , that is, f (u, v) = 0. (23) Let A ∈ Cn×n be a matrix such that ni=1 aij = 1 for every j, 1 j n and let x ∈ Cn be a eigenvector such that 1 x = 0. Prove that the eigenvalue that corresponds to x is 1. (24) Let u, v ∈ Cn . Prove that the matrix uv H + vuH has at most two non-zero eigenvalues. H u n×2 and B = ∈ C2×n . By Solution: Let A = (v u) ∈ C vH Theorem 7.10, the set of non-zero eigenvalues of the matrices AB = vuH + uv H ∈ Cn×n and BA ∈ C2×2 are the same and algm(AB, λ) = algm(BA, λ) for each such eigenvalue. (25) Let A ∈ Cn×n be a matrix such that ni=1 |aij | 1 for every j, 1 j n. Prove that for every eigenvalue λ of A we have |λ| 1. (26) For any matrix A ∈ Cn×m , prove that there exists c ∈ C such that the matrices A + cIn and A − cIn are invertible. Conclude that every matrix is the sum of two invertible matrices. (27) Let A, B, and E be three matrices in Cn×n such that B = A + E and let μ ∈ spec(B) − spec(A). Prove that if Q ∈ Cn×n is an
Eigenvalues
513
invertible matrix, then Q−1 (A − μIn )Q Q−1 EQ. Solution: Since μ ∈ spec(B) − spec(A), the matrix B − μIn is singular, while A − μIn is nonsingular. We have Q−1 (B − μIn )Q = Q−1 (A − μIn + E)Q = Q−1 (A − μIn )Q[In + Q−1 (A − μIn )−1 Q] × (Q−1 EQ). Since B − μIn is singular, the matrix [In + Q−1 (A − μIn )−1 Q](Q−1 EQ) must be singular, which implies [Q−1 (A − μIn )−1 Q](Q−1 EQ) 1, so Q−1 (A − μIn )−1 QQ−1 EQ 1. This implies the desired inequality. (28) Let v be an eigenvector of the matrix A ∈ Cn×n and let 0 aH , B= a A
(29)
(30)
(31)
(32)
Cn . Prove
that there exists an eigenvector u of B x such that u = for some x ∈ C if and only if aH v = 0. v Let B ∈ Rn×n be a matrix and let A = diag(b11 , . . . , bnn ). If μ is an eigenvalue of B, prove that there exists a number bii such that |μ − bii | E. Let A, B be two matrices in Cn×n . Prove that for every > 0, there exists δ > 0 such that if |aij − bij | < δ and λ ∈ spec(A), then there exists θ ∈ spec(B) such that |λ − θ| < . Let A ∈ Cm×m and B ∈ Cn×n be two matrices. Prove that trace(A ⊗ B) = trace(A)trace(B) and det(A ⊗ B) = det(A)n det(B)m . Let A ∈ Cn×n be a matrix whose eigenvalues are λ1 , . . . , λn . k k Define the matrix A[k] = A ⊗ A ⊗ · · · ⊗ A ∈ Cn ×n which is the k th power of A in the sense of the Kronecker product. Prove that the eigenvalues of A[k] have the form λi1 λi2 · · · λik , where 1 i1 , . . . , ik n. where a ∈
514
Linear Algebra Tools for Data Mining (Second Edition)
(33) Let A ∈ Cn×n be a matrix such that spec(A) = {λ1 , . . . , λn }. Prove that det(In − A) =
n
n (1 − λi ) and det(In + A) = (1 + λi ).
i=1
i=1
(34) Let P ∈ Cn×n be a projection matrix. Prove that (a) if rank(P ) = r, then algm(P, 1) = r and algm(P, 0) = n − r; (b) we have rank(P ) = trace(P ). (35) Let A ∈ Rn×n be a matrix such that λ 0 for every λ ∈ spec(A). Prove that trace(A) n . det(A) n (36) Let A ∈ Cm×n and B = Cn×m . Prove that λn pAB (λ) = λm pBA (λ). (37) Let φ : Rn −→ R be a quadratic form defined by φ(x) = x Ax for x ∈ Rn , where A ∈ Rn×n is a symmetric matrix. Prove that if the unit vector x is an extreme for φ, then x is an eigenvector for A. Solution: Since x is a unit vector, we have x x − 1 = 0, so the Lagrangian of this problem is Λ(x, λ) = x Ax + λ(x x − 1). We have (∇Λ)(x) = 2Ax + 2λx = 0m , which shows that x must be a unit eigenvector of A. (38) Let Pφ be a permutation matrix, where φ ∈ PERMn is an ncyclic permutation. If z is a root of order n of 1 (that is, z n = 1), prove that (1, z, z 2 , . . . , z n−1 ) is an eigenvector of Pφ . Further, prove that the eigenvalues of the matrix Pφ + Pφ−1 have the form 2 cos 2πk n for 1 k n. (39) Let A ∈ Rn×n be a matrix. A vector x is said to be subharmonic for A if x 0n and Ax ax for some a ∈ R. Prove that if x is an eigenvector of A, then abs(x) is a subharmonic vector for abs(A).
Eigenvalues
515
(40) Let A ∈ Rn×n be a matrix that can be written as A = ri=1 v i v i , where v i ∈ Rn − {0n } and the vectors v 1 , . . . , v r are pairwise orthogonal. Prove that (a) A is a symmetric matrix of rank r; (b) the eigenpairs of A that involve non-zero eigenvalues are ( v i , v i ) for 1 i r. n×n be a Hermitian matrix and let r (41) Let A = n ∈ C max{ i=1 |aij | | 1 i n}. Prove that spec(A) ⊆ [−r, r]. 2 2 (42) Let Knn ∈ Rn ×n be the commutation matrix introduced in Section 3.17. Prove that spec(Knn ) = {−1, 1}, algm(Knn , 1) = n(n+1) , and algm(Knn , −1) = n(n−1) . Also, show 2 2 that det(Knn ) = (−1)
n(n−1) 2
.
Solution: Since the real matrix Knn is orthogonal and symmetric, spec(Knn ) = {−1, 1} by Example 7.7. Therefore, algm(Knn , 1) + algm(Knn , −1) = n2 and det(Knn ) = (−1)algm(Knn ,−1) . On the other hand, by Part (e) of Supplement 96, trace(Knn ) = n = algm(Knn , 1) − algm(Knn , −1) = n2 − 2algm(A, −1). Thereand algm(Knn , 1) = n(n+1) . fore, algm(A, −1) = n(n−1) 2 2 Since det(Knn ) equals the product of eigenvalues, the last equality follows immediately. Bibliographical Comments Lanczos decomposition presented in Supplement 40 was obtained from [98]; the elementary solution given was developed in [148]. The result discussed in Supplement 7.10 was obtained from [8].
This page intentionally left blank
Chapter 8
Similarity and Spectra
8.1
Introduction
This chapter presents the links between spectral properties of matrices and conditions that ensure the existence of diagonal matrices which are equivalent to matrices that enjoy these properties. Further spectral properties are discussed such as the relationships between the geometric and algebraic multiplicities of eigenvalues. The standard Jordan form of matrices is given in the context of λ-matrices and the link between matrix norms and eigenvalues is also examined. 8.2
Diagonalizable Matrices
Diagonalizable matrices were defined in Section 3.12 as square matrices that are similar to diagonal matrices. In the context of spectral theory, we have the following characterization of diagonalizable matrices. Theorem 8.1. A matrix A ∈ Cn×n is diagonalizable if and only if there exists a linearly independent set {v 1 , . . . , v n } of n eigenvectors of A. Proof. Let A ∈ Cn×n be such that there exists a set {v 1 , . . . , v n } of n eigenvectors of A that is linearly independent and let P be the
517
518
Linear Algebra Tools for Data Mining (Second Edition)
matrix (v 1 v 2 · · · v n ) that is clearly invertible. We have P −1 AP = P −1 (Av 1 Av 2 ⎛ λ1 0 ⎜ 0 λ2 ⎜ = P −1 P ⎜ .. .. ⎝. . 0
· · · Av n ) = P −1 (λ1 v 1 λ2 v 2 · · · λn v n ) ⎞ ⎛ ⎞ ··· 0 λ1 0 · · · 0 ⎜ ⎟ ··· 0 ⎟ ⎟ ⎜ 0 λ2 · · · 0 ⎟ . ⎟=⎜ . . . ⎟. · · · .. ⎠ ⎝ .. .. · · · .. ⎠
0 · · · λn
Therefore, we have A = P DP −1 , where ⎛ λ1 0 · · · 0 ⎜ 0 λ2 · · · 0 ⎜ D = ⎜ .. .. . ⎝ . . · · · ..
0
0 · · · λn
⎞ ⎟ ⎟ ⎟, ⎠
0 0 · · · λn
so A ∼ D. Conversely, suppose that A is diagonalizable, so AP = P D, where D is a diagonal matrix and P is an invertible matrix, and let v 1 , . . . , v n be the columns of the matrix P . We have Av i = dii v i for 1 i n, so each v i is an eigenvector of A. Since P is invertible, its columns are linear independent. Theorem 8.1 can be restated by saying that a matrix is diagonalizable if and only if it is non-defective. Corollary 8.1. If A ∈ Cn×n is diagonalizable, then the columns of any matrix P such that D = P −1 AP is a diagonal matrix are eigenvectors of A. Furthermore, the diagonal entries of D are the eigenvalues that correspond to the columns of P . This statement follows from the proof of Theorem 8.1. Corollary 8.2. If A ∈ Cn×n is a matrix such that {geomm(A, λ) | λ ∈ spec(A)} = n, then A is diagonalizable.
Proof.
Proof. Suppose that spec(A) = {λ1 , . . . , λk } and let Bk = k1 kpk {v p , . . . , v } be a basis pof the invariant spaces SA,λk , where k=1 pk = n. Then B = k=1 Bk is a linearly independent set of eigenvectors, so A is diagonalizable. Corollary 8.3. If the eigenvalues of the matrix A ∈ Cn×n are distinct, then A is diagonalizable.
Similarity and Spectra
519
Proof. By Theorem 7.1, the set of eigenvectors of A is linearly independent. The statement follows immediately from Theorem 8.1.
Theorem 8.2. Let A ∈ Cn×n be a block diagonal matrix, ⎞ ⎛ A11 O · · · O ⎜ O A ··· O ⎟ 22 ⎟ ⎜ ⎟ A=⎜ .. .. ⎟. ⎜ .. . ··· . ⎠ ⎝ . O O · · · Amm A is diagonalizable if and only if every matrix Aii is diagonalizable for 1 i m. Proof. Suppose that A is a block diagonal matrix which is diagonalizable. Furthermore, suppose that Aii ∈ Cni ×ni and m i=1 ni = n. There exists an invertible matrix P ∈ Cn×n such that P −1 AP is a diagonal matrix D = diag(λ1 , . . . , λn ). Let p1 , . . . , pn be the columns of P , which are eigenvectors of A. Each vector pi is divided into m blocks pji with 1 j m, where pji ∈ Cnj . Thus, P can be written as ⎞ ⎛ 1 1 p1 p2 · · · p1n ⎜ p2 p2 · · · p2 ⎟ ⎜ 1 2 n⎟ ⎟ P =⎜ .. .. ⎟. ⎜ .. ⎝ . . ··· . ⎠ m m p1 pm 2 · · · pn The equality Api = λi pi can be expressed as ⎛
A11 O ⎜O A 22 ⎜ ⎜ .. ⎜ .. ⎝ . . O O
⎞⎛ 1⎞ ⎛ 1⎞ pi O pi 2⎟ ⎟ ⎜ ⎜ 2⎟ p O ⎟⎜ i ⎟ ⎜ pi ⎟ ⎟⎜ ⎟ ⎜ ⎟ .. ⎟ ⎜ .. ⎟ = λi ⎜ .. ⎟, ⎝ . ⎠ ··· . ⎠⎝ . ⎠ m pi pm · · · Amm i ··· ···
which shows that Ajj pji = λi pji for 1 j m. Let M j = (pj1 pj2 · · · pjn ) ∈ Cnj ×n . We claim that rank(M j ) = nj . Indeed, if rank(M j ) were less than nj , we would have fewer that n independent rows M j for 1 j m. This, however, would imply that the
520
Linear Algebra Tools for Data Mining (Second Edition)
rank of P is less than n, which contradicts the invertibility of P . Since there are nj linearly independent eigenvectors of Ajj , it follows that each block Ajj is diagonalizable. Conversely, suppose that each Ajj is diagonalizable, that is, there exists an invertible matrix Qj such that Q−1 j Ajj Qj is a diagonal matrix. Then, it is immediate to verify that the block diagonal matrix ⎞ ⎛ Q1 O · · · O ⎜ O Q ··· O ⎟ 2 ⎟ ⎜ ⎟ Q=⎜ .. .. ⎟ ⎜ .. . ··· . ⎠ ⎝ . O O · · · Qm is invertible and Q−1 AQ is a diagonal matrix.
Definition 8.1. Two matrices A, B ∈ Cn×n are simultaneously diagonalizable if there exists a matrix T ∈ Cn×n such that A = T DT −1 and B = T ET −1 , where D and E are diagonal matrices in Cn×n . Lemma 8.1. If C ∈ Cn×n is a diagonalizable matrix and T ∈ Cn×n is an invertible matrix, then T CT −1 is also diagonalizable. Proof. Since C is diagonalizable, there exists a diagonal matrix D and an invertible matrix S such that SCS −1 = D, so C = S −1 DS. Then, T CT −1 = TS−1 DST −1 = (TS−1 )D(T S −1 )−1 . Therefore, (TS−1 )−1 (T CT −1 )(TS−1 ) = D, which shows that T CT −1 is diagonalizable.
Theorem 8.3. Let A, B ∈ Cn×n be two diagonalizable matrices. Then A and B are simultaneously diagonalizable if and only if AB = BA. Proof. Suppose that A and B are simultaneously diagonalizable, so A = TDT−1 and B = TET−1 , where D and E are diagonal matrices. Then AB = TDT−1 TET−1 = TDET−1 and BA = TET−1 TDT−1 = TEDT−1 . Since any two diagonal matrices commute, we have AB = BA.
Similarity and Spectra
Conversely, suppose that AB = BA. there exists a matrix T such that ⎛ λ1 Ik1 Ok1 ,k2 ⎜O ⎜ k2 ,k1 λ2 Ik2 T AT −1 = ⎜ .. ⎜ .. ⎝ . . Okm ,k1 Okm ,k2
521
Since A is diagonalizable, ⎞ · · · Ok1 ,km · · · Ok2 ,km ⎟ ⎟ ⎟ .. ⎟, ··· . ⎠ · · · λm Ikm
where ki is the multiplicity of λi for 1 i m. It is easy to see that if A and B commute, then TAT−1 and TBT−1 also commute. If we write TBT−1 as a block matrix TBT−1 = (Bpq ), where Bpq ∈ Ckp ×kq for 1 p, q m, then the fact that the matrices TAT−1 and TBT−1 commute translates into the equality ⎛
⎞
⎛ λ1 B11 λ2 B12 ⎜ ⎟ λ2 B22 · · · λ2 B2m ⎟ ⎜ λ1 B21 λ2 B22 ⎜ λ2 B21 ⎜ ⎟ ⎜ ⎜ ⎟=⎜ . .. .. .. .. ⎜ ⎟ ⎝ .. . . . ··· . ⎝ ⎠ λ1 Bm1 λ2 Bm1 λm Bm1 λm Bm1 · · · λm Bmm λ1 B11
λ1 B12
···
λ1 B1m
Thus, if λi = λj we have Bij = Oki ,kj , which TBT−1 is a block diagonal matrix ⎛ B11 Ok1 ,k2 · · · ⎜O B22 · · · ⎜ k2 ,k1 TBT−1 = ⎜ . .. ⎜ . ⎝ . . ··· Okm ,k1 Okm ,k2 · · ·
⎞ λm B1m λm B2m ⎟ ⎟ ⎟. .. ⎠ ··· . · · · λm Bmm ··· ···
shows that the matrix ⎞ Ok1 ,km Ok2 ,km ⎟ ⎟ ⎟ .. ⎟. . ⎠ Bmm
Since B is diagonalizable, it follows that TBT−1 is diagonalizable (by Lemma 8.1), so each matrix Bjj is diagonalizable. Let Wi be a matrix such that Wi−1 Bii Wi is diagonal and let W be the block diagonal matrix ⎞ ⎛ W1 O · · · O ⎜ O W ··· O ⎟ 2 ⎟ ⎜ ⎟ W =⎜ .. .. ⎟. ⎜ .. . ··· . ⎠ ⎝ . O O · · · Wm Then, both W −1 TAT−1 W and W −1 TBT−1 W are diagonal matrices, and this implies that A and B are simultaneously diagonalizable.
522
Linear Algebra Tools for Data Mining (Second Edition)
It is interesting to observe that if A ∈ Cn×n is a diagonalizable matrix, then there exists a matrix B ∈ Cn×n such that B 2 = A. Indeed, suppose that A ∼ D, where D is a diagonal matrix and, therefore, A = PDP−1 , where P is an invertible matrix ∈ Cn×n given by and D = diag(d √ 1 , . . .√, dn ). Consider the matrix E 2 E = diag( d1 , . . . , dn ) for which we have E = D, and define B = P EP −1 . Then, we have B 2 = PEP−1 PEP−1 = PE2 P −1 = PDP−1 = A. We refer to B as a square root of A. Note that for the existence of the square root of a diagonalizable matrix it is essential that we deal with matrices with complex entries. Under certain conditions, square roots exist also for matrices with real entries, as we shall see in Section 8.12. 8.3
Matrix Similarity and Spectra
Theorem 8.4. If A, B ∈ Cn×n and A ∼ B, then the two matrices have the same characteristic polynomials and, therefore, spec(A) = spec(B). Proof. Since A ∼ B, there exists an invertible matrix X such that A = XBX−1 . Then, the characteristic polynomial det(A − λIn ) can be rewritten as det(A − λIn ) = det(XBX−1 − λXIn X −1 ) = det(X(B − λIn )X −1 ) = det(X) det(B − λIn ) det(X −1 ) = det(B − λIn ), which implies spec(A) = spec(B).
Theorem 8.5. If A, B ∈ Cn×n and A ∼ B, then trace(A) = trace(B). Proof. Since the two matrices are similar, they have the same characteristic polynomials, so both trace(A) and trace(B) equal −c1 , where c1 is the coefficient of λn−1 in both pA (λ) and pB (λ).
Similarity and Spectra
523
Theorem 8.6. If A ∼u B, where A, B ∈ Cn×n , then the Frobenius norm of these matrices are equal, that is, AF = BF . Proof. We have shown in Theorem 3.43 that AH A ∼u B H B. Therefore, these matrices have the same characteristic polynomials which allows us to infer that trace(AH A) ∼u trace(B H B). By the analogue of the Equality (6.11) for complex matrices, we obtain the desired conclusion. Theorem 8.7. Let A ∈ Cn×n and B ∈ Ck×k be two matrices. If there exists a matrix U ∈ Cn×k having an orthonormal set of columns such that AU = UB, then there exists V ∈ Cn×(n−k) such that (UV) ∈ Cn×n is a unitary matrix and
B (U H AV ) H . (UV) A(UV) = O (V H AV ) Proof. Since U has an orthonormal set of columns, by Corollary 6.27, there exists V ∈ Cn×(n−k) such that (UV) is a unitary matrix. We have U H AU = U H U B = Ik B = B, V H AU = V H U B = OB = O, which allows us to write
UH (AU AV ) VH
H B U H AV U AU U H AV = . = V H AU V H AV O V H AV
(UV)H A(UV) = (UV)H (AU AV ) =
Corollary 8.4. Let A ∈ Cn×n be a Hermitian matrix and B ∈ Ck×k be a matrix. If there exists a matrix U ∈ Cn×k having an orthonormal set of columns such that AU = U B, then there exists V ∈ Cn×(n−k) such that (UV) is a unitary matrix and
B O H . (UV) A(UV) = O V H AV
524
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Since A is Hermitian, we have U H AV = U H AH V = (V H AU )H = O, which, by Theorem 8.7 produces the desired result.
Corollary 8.5. Let A ∈ Cn×n , λ be an eigenvalue of A, and let u be an eigenvector of A with u = 1 that corresponds to λ. There exists V ∈ Cn×(n−1) such that (u V ) ∈ Cn×n is a unitary matrix and
λ (uH AV ) H . (u V ) A(u V ) = O (V H AV ) If A is a Hermitian matrix, then (u V ) A(u V ) = H
Proof.
λ O . O (V H AV )
This statement follows from Theorem 8.7 by taking k = 1.
Theorem 8.8 (Schur’s triangularization theorem). Let A ∈ Cn×n be a square matrix. There exists a unitary matrix U ∈ Cn×n and an upper-triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A. Moreover, each eigenvalue λ occurs in the sequence of diagonal values a number of algm(A, λ) times. Proof. The argument is by induction on n 1. The base case, n = 1, is trivial. So, suppose that the statement is true for matrices in C(n−1)×(n−1) . Let λ1 ∈ C be an eigenvalue of A, and let u be an eigenvector that corresponds to this eigenvalue. By Corollary 8.5, we have
λ1 uH AV H , U AU = O V H AV where U = (u|V ) is an unitary matrix. By the inductive hypothesis, since V H AV ∈ C(n−1)×(n−1) , there exists a unitary matrix S ∈ C(n−1)×(n−1) such that V H AV = S H W S, where W is an upper-triangular matrix. Then we have
λ1 λ1 O uH V S H W S H = , U AU = 0 S HW S 0 W
Similarity and Spectra
525
which shows that an upper triangular matrix T that is unitarily similar to A can be defined as
λ1 O . T = 0 W Since T ∼u A, it follows that the two matrices have the same characteristic polynomials and, therefore, the same spectra and algebraic multiplicities for each eigenvalue. Corollary 8.6. If A ∈ Rn×n is a matrix such that spec(A) = {0}, then A is nilpotent. Proof. By Schur’s Triangularization Theorem, A is unitarily similar to a strictly upper triangular matrix, A = U H T U , so An = U H T n U . By Supplement 30 of Chapter 3, we have T n = O, so An = O. Example 8.1. Let A ∈ R3×3 be the symmetric matrix ⎛ ⎞ 14 −10 −2 ⎜ ⎟ A = ⎝−10 −5 5 ⎠ −2 5 11 whose characteristic polynomial is pA (λ) = λ3 − 20λ2 − 100λ + 2000. The eigenvalues of A are λ1 = 20, λ2 = 10, and λ3 = −10. It is easy to see that ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ −2 1 −2 ⎝ ⎠ ⎝ ⎠ ⎝ 1 , v 2 = 0 , v 3 = −5⎠ v1 = 1 2 1 are eigenvectors that correspond to the eigenvalues λ1 , λ2 , and λ3 , respectively. The corresponding unit vectors are ⎛ 1 ⎞ ⎛ 2 ⎞ ⎛ 2 ⎞ √ −√ − √6 5 ⎜ ⎟ ⎜ 530 ⎟ ⎜ √1 ⎟ ⎜ √ ⎟ 0⎟ u1 = ⎝ 6 ⎠, u2 = ⎜ ⎝ ⎠, u3 = ⎝− 30 ⎠. √1 6
√2 5
√1 30
For U = (u1 u2 u3 ), we have U AU = U (20u1 10u2 − 10u3 ) = diag(20, 10, −10).
526
Linear Algebra Tools for Data Mining (Second Edition)
Corollary 8.7. Let A ∈ Cn×n and let f be a polynomial. If spec(A) = {λ1 , . . . , λn } (including multiplicities), then spec(f (A)) = {f (λ1 ), . . . , f (λn )}. Proof. By Schur’s Triangularization Theorem, there exists a unitary matrix U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A, λ1 , . . . , λn . Therefore, U f (A)U −1 = f (T ), and by Theorem 3.45, the diagonal elements of f (T ) are f (λ1 ), . . . , f (λm ). Since f (A) ∼ f (T ), we obtain the desired conclusion because two similar matrices have the same eigenvalues with the same algebraic multiplicities. Let r, s ∈ N be two numbers such that 1 r, s n, r = s, and let Ers ∈ Cn×n be the matrix whose unique non-zero entry is ers = 1. In other words, Ers is defined by (Ers )ij =
1 if i = r and j = s, 0 otherwise,
for 1 i, j n. Note that Ers ep =
er
if p = s,
0
otherwise,
es
if q = r,
0
otherwise,
for 1 p n, and eq Ers
=
for 1 q n. Lemma 8.2. The matrix In + cErs is nonsingular, and (In + cErs )−1 = In − cErs . Furthermore, for every A ∈ Cn×n , if B = (In − cErs )A(In + cErs ), then B = A − cErs A + cAErs − c2 asr Ers . Proof. Since r = s, we have (Ers )2 = O. Therefore, (In +cErs )(In − cErs ) = In , so (In +cErs )−1 = In −cErs . Also, Ers AErs is the matrix
Similarity and Spectra
527
that has a unique non-zero element, that is, Ers AErs = asr Ers . Since B = (In − cErs )A(In + cErs ), we have B = A − cErs A + cAErs − c2 Ers AErs = A − cErs A + cAErs − c2 asr Ers .
Lemma 8.3. If A ∈ Cn×n is an upper triangular matrix, 1 r < s n, and B = (In − cErs )A(In + cErs ), then bij = aij only if 1 i r and j = s, or if i = r and s j n (see Figure 8.1, where the elements of B that differ from the corresponding elements of A are represented by thick lines) and brs = ars + c(arr − ass ). Proof. Since A is an upper triangular matrix and r < s, we have asr = 0, so by Lemma 8.2, B = A − cErs A + cAErs . Since A is an upper triangular matrix, we ⎛ 0 0 ······ 0 ⎜. . .. ⎜ .. .. ··· . ⎜ ⎜ Ers A = ⎜0 0 · · · · · · ass ⎜ ⎜ .. .. .. ⎝. . ··· . 0 0 ······ 0
have ⎞ 0 . ⎟ · · · .. ⎟ ⎟ ⎟ · · · asn ⎟, ⎟ .. ⎟ ··· . ⎠ ··· 0 ···
r
B Fig. 8.1
s
Elements of B that differ from the corresponding elements of A.
528
Linear Algebra Tools for Data Mining (Second Edition)
where ass , . . . , asn occur in the r th row of ⎛ 0 0 · · · a1r ⎜. . ⎜ .. .. · · · ... ⎜ ⎜ ⎜0 0 · · · arr AErs = ⎜ ⎜0 0 · · · 0 ⎜ ⎜. . . ⎜. . ⎝ . . · · · .. 0 0 ···
0
Ers A, and ⎞ ··· 0 .⎟ · · · .. ⎟ ⎟ ⎟ · · · 0⎟ ⎟, · · · 0⎟ ⎟ .. ⎟ ⎟ · · · .⎠ ··· 0
where a1r , . . . , arr occur in the sth column. Thus, the elements of B that differ from those of A are located on the r th row at the right of the sth column and on the sth row above the r th row. Also, brs = ars + c(arr − ass ). Lemma 8.2 implies that if arr = ass , by taking c = arr /(ass − arr ) we have brs = 0. Theorem 8.9. Let A ∈ Cn×n be a square matrix having k distinct eigenvalues λ1 , . . . , λk such that algm(λi ) = ni for 1 i k and k i=1 ni = n. There exists a block-triangular matrix T ∈ Cn×n such that A ∼u T , ⎞ ⎛ T1 O · · · O ⎜O T · · · O ⎟ 2 ⎟ ⎜ ⎟, T =⎜ . . . ⎟ ⎜. . ⎝ . . · · · .. ⎠ O O · · · Tk where each Ti ∈ Cni ×ni is an upper triangular matrix having all its diagonal elements equal to λi for 1 i k. Proof. By Schur’s Triangularization Theorem, there exists an upper-triangular matrix V ∈ Cn×n such that A ∼ V and the diagonal elements of V are the eigenvalues of A such that each eigenvalue λi occurs in the sequence of diagonal values a number of algm(A, λi ) times. By Lemmas 8.2 and 8.3, we can construct a matrix T , similar to V and, therefore, similar to A such that T is a block triangular matrix.
Similarity and Spectra
529
To build the matrix T , we construct a sequence of matrices starting from V by zeroing elements situated above the main diagonal, outside the triangular blocks that correspond to each eigenvalue. This process must be conducted such that the creation of a new zero component will not disturb the already zeroed elements. This can be easily done by numbering the positions to be zeroes in such a manner that all numbers located at the right and above the position numbered k are greater than k, as we show in Example 8.2. Example 8.2. Let A ∈ C8×8 be a matrix whose distinct eigenvalues are λ1 , λ2 , and λ3 such that algm(A, λ1 ) = algm(A, λ2 ) = 3 and algm(A, λ3 ) = 2. In Figure 8.2, we show the order of zeroing the desired elements of the upper triangular matrix in order to construct a block triangular matrix. Theorem 8.8 shows that a Schur factorization A = U H T U exists for every complex matrix A. The next statement presents a property of real matrices that admit real Schur factorizations. Theorem 8.10. Let A ∈ Rn×n be a real square matrix. If there exists a orthogonal matrix U ∈ Rn×n and an upper triangular matrix T ∈ Rn×n such that A = U −1 T U, that is, a real Schur factorization, then the eigenvalues of A are real numbers. Proof. If the above factorization exists, we have T = UAU−1 . Thus, the eigenvalues of A are the diagonal components of T and, therefore, they are real numbers. λ1
9 12 15 18 21 λ1
8 11 14 17 20 λ1 7 10 13 16 19 λ2 λ2
3
6
2
5
λ2 1
4
λ3 λ3 Fig. 8.2
Order of nullification of elements of a block triangular matrix.
530
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 8.11. Let A ∈ Rn×n be a matrix having n real eigenvalues. Then A has a real Schur factorization, that is, A = U T U −1 , where U ∈ Rn×n is an orthogonal matrix and T ∈ Rn×n is an upper triangular matrix that has the eigenvalues of A on its diagonal. Proof. The argument is a paraphrase of the argument of Theorem 8.4, taking into account that the eigenvalues are real numbers. Corollary 8.8. If A ∈ Rn×n and A is a symmetric matrix, then A is orthogonally diagonalizable. Proof. By Corollary 7.5, A has real eigenvalues. Theorem 8.11 implies the existence of a real Schur factorization of A, A = U T U −1 , where U is an orthogonal matrix and T is both upper triangular and symmetric and, therefore, a diagonal matrix. Corollary 8.9. If A ∈ Rn×n is a symmetric matrix, then AF = n 2 i=1 λi , where spec(A) = {λ1 , . . . , λn }. Proof. By Corollary 8.8, A ∼u D, where D = diag(λ 1 , . . . , λn ). Therefore, by Theorem 8.6 we have AF = DF = ni=1 λ2i . Theorem 8.12. If A ∈ Rn×n is a symmetric matrix, then geomm(A, λ) = algm(A, λ) for every eigenvalue λ ∈ spec(A). Proof. By Corollary 8.8, A is orthonormally diagonalizable, that is, there exists an orthogonal matrix P such that A = P DP , where D is a diagonal matrix. Therefore, det(λIn − A) = det(λIn − D). In other words, the matrices λIn − A and λIn − D have the same characteristic polynomial. Thus, if λ ∈ spec(A) and algm(A, λ) = k, λ occurs in k diagonal positions of D, p1 , . . . , pk . Thus, ep1 , . . . , epk are k linearly independent eigenvectors in the invariant subspace SD,λ , which implies that the vectors P ep1 , . . . , P epk are k linearly indepen dent eigenvectors in S(A, λ). Next, we show that unitary diagonalizability is a characteristic property of normal matrices. Theorem 8.13 (Spectral theorem for normal matrices). A matrix A ∈ Cn×n is normal if and only if there exists a unitary matrix U and a diagonal matrix D such that A = U H DU, the columns of U H are unit eigenvectors, and the diagonal elements of D are the eigenvalues of A that correspond to these eigenvectors.
Similarity and Spectra
531
Proof. Suppose that A is a normal matrix. By Schur’s triangularization theorem, there exists a unitary matrix U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U −1 T U . Thus, T = UAU−1 = UAUH and T H = UAH U H . Therefore, T H T = UAH U H U AU H = UAH AU H (because U is unitary) = UAAH U H (because A is normal) = UAUH U AH U H = T T H, so T is a normal matrix. By Theorem 3.19, T is a diagonal matrix, so D’s role is played by T . We leave the proof of the converse implication to the reader. Corollary 8.10. Let A ∈ Cn×n be a normal matrix and let A = U H DU be the factorization whose existence was established in Theorem 8.13. If U H = (u1 , . . . , un ) and D = diag(λ1 , . . . , λn ), where λ1 , . . . , λn are the eigenvalues of A, then each column vector ui is an eigenvector of A that corresponds to the eigenvalue λi for 1 i n. Proof.
We have ⎛
⎞ λ1 0 · · · 0 ⎛ H ⎞ ⎜ 0 λ · · · 0 ⎟ u1 2 ⎜ ⎟⎜ . ⎟ H ⎟⎜ ⎟ A = U DU = (u1 · · · un ) ⎜ .. ⎟ ⎝ .. ⎠ ⎜ .. .. ⎝ . . ··· . ⎠ uHn 0 0 · · · λn = λ1 u1 uH1 + · · · + λn un uHn .
(8.1)
Therefore, Aui = λi ui for 1 i n, which proves the statement. Equality (8.1) is referred to as spectral decomposition of the normal matrix A.
532
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 8.14 (Spectral theorem for hermitian matrices). If the matrix A ∈ Cn×n is Hermitian or skew-Hermitian, A can be written as A = U H DU , where U is a unitary matrix and D is a diagonal matrix having the eigenvalues of A as its diagonal elements. Proof. This statement follows from Theorem 8.13 because any Hermitian or skew-Hermitian matrix is normal. Corollary 8.11. The rank of a Hermitian matrix is equal to the number of non-zero eigenvalues. Proof. The statement of the corollary obviously holds for any diagonal matrix. If A is a Hermitian matrix, by Theorems 3.44 and 8.14, we have rank(A) = rank(D), where D is a diagonal matrix having the eigenvalues of A as its diagonal elements. This implies the statement of the corollary. Let A ∈ Cn×n be a Hermitian matrix of rank p. By Theorem 8.14, A can be written as A = U H DU , where U is a unitary matrix and D = diag(λ1 , . . . , λp , 0, . . . , 0), where λ1 , . . . , λp are the non-zero eigenvalues of A and λ1 · · · λp > 0. Thus, if W ∈ Cp×n is a matrix that consists of the first p rows of U , we can write ⎛ ⎞ λ1 0 · · · 0 ⎜ 0 λ ··· 0 ⎟ 2 ⎜ ⎟ ⎜ ⎟W A=W ⎜ . . . . .. .. ⎟ ⎝ .. .. ⎠ 0 0 · · · λp ⎛ ⎞ λ1 0 · · · 0 ⎛ ⎞ ⎜ 0 λ · · · 0 ⎟ u1 2 ⎜ ⎟⎜ . ⎟ ⎜ ⎟⎜ . ⎟ = (u1 . . . up ) ⎜ . . . . ⎝ . ⎠ .. .. ⎟ ⎝ .. .. ⎠ up 0 0 · · · λp = λ1 u1 u1 + · · · + λp up up . Note that if A is not Hermitian, rank(A) may differ from the number of non-zero eigenvalues. For example, the matrix
0 1 A= 0 0 has no non-zero eigenvalues. However, its rank is 1.
Similarity and Spectra
533
The spectral decomposition (8.1) of Hermitian matrices, A = λ1 u1 uH1 + · · · + λn un uHn , allows us to extend functions of the form f : R −→ R to Hermitian matrices. Since the eigenvalues of a Hermitian matrix are real numbers, it makes sense to define f (A) as f (A) = f (λ1 )u1 uH1 + · · · + f (λn )un uHn . In particular, if A is positive semi-definite, we have λi 0 for 1 i n and we can define
√ A = λ1 u1 uH1 + · · · + λn un uHn . Theorem 8.15. Let A, B ∈ Cn×n be two Hermitian matrices, where A is positive definite. There exists a matrix P such that P H AP = In and P H BP is a diagonal matrix. Proof. By Cholesky’s Decomposition Theorem (Theorem 6.57), A can be factored as A = RH R, where R is an invertible matrix, so (RH )−1 AR−1 = In . The matrix C = (RH )−1 BR−1 is Hermitian and, therefore, it is unitarily diagonalizable, that is, there exists a unitary matrix U such that U H CU is diagonal. Since U H In U = In , we have P = R−1 U . Corollary 8.12. Let A, B ∈ Cn×n . If A B O, then A−1 ≺ B −1 . Proof. By Theorem 8.15, there exists a matrix P such that A = P H P and B = P H DP , where D is a diagonal matrix, D = diag(d1 , . . . , dn ). Thus, xH Ax > xH Bx is equivalent to y H y > y H Dy, where y = P x, so A > B if and only if di < 1 for 1 i n. There−1 fore, A−1 = QH Q and B −1 = QH EQ, where E = diag(d−1 1 , . . . , dn ) −1 −1 −1 and di > 1 for 1 i n, which implies A < B . Definition 8.2. Let A ∈ Cn×n be a Hermitian matrix. The triple I(A) = (n+ (A), n− (A), n0 (A)), where n+ (A) is the number of positive eigenvalues, n− (A) is the number of negative eigenvalues, and n0 (A) is the number of zero eigenvalues, is the inertia of the matrix A. The number sig(A) = n+ (A) − n− (A) is the signature of A.
534
Linear Algebra Tools for Data Mining (Second Edition)
Example 8.3. If A = diag(4, −1, 0, 0, 1), then I(A) = (2, 1, 2) and sig(A) = 1. Let A ∈ Cn×n be a Hermitian matrix. By Theorem 8.14, A can be written as A = U H DU , where U is a unitary matrix and D = diag(λ1 , . . . , λn ) is a diagonal matrix having the eigenvalues of A (which are real numbers) as its diagonal elements. Without loss of generality, we may assume that the positive eigenvalues of A are λ1 , . . . , λn+ , followed by the negative values λn+ +1 , . . . , λn+ +n− , and the zero eigenvalues λn+ +n− +1 , . . . , λn . Let θj be the numbers defined by ⎧ ⎪ λ if 1 j n+ , ⎪ ⎨ j θj = −λj if n+ + 1 j n+ + n− , ⎪ ⎪ ⎩1 if n + n + 1 j n +
−
for 1 j n. If T = diag(θ1 , . . . , θn ), then we can write D = T H GT , where G is a diagonal matrix, G = (g1 , . . . , gn ) defined by ⎧ ⎪ 1 if λj > 0, ⎪ ⎨ gj = −1 if λj < 0, ⎪ ⎪ ⎩0 if λ = 0, j
for 1 j n. This allows us to write A = U H DU = U H T H GT U = (T U )H G(T U ). The matrix T U is clearly nonsingular, so A ∼H G. The matrix G defined above is the inertia matrix of A and these definitions show that any Hermitian matrix is congruent to its inertia matrix. For a Hermitian matrix A ∈ Cn×n , let S+ (A) be the subspace of n C generated by n+ (A) orthonormal eigenvectors that correspond to the positive eigenvalues of A. Clearly, we have dim(S+ (A)) = n+ (A). This notation is used in the proof of the next theorem. Theorem 8.16 (Sylvester’s inertia theorem). Let A, B be two Hermitian matrices, A, B ∈ Cn×n . The matrices A and B are congruent if and only if I(A) = I(B). Proof. If I(A) = I(B), then we have A = S H GS and B = T H GT , where both S and T are nonsingular matrices. Since A ∼H G and B ∼H G, we have A ∼H B.
Similarity and Spectra
535
Conversely, suppose that A ∼H B, that is, A = S H BS, where S is a nonsingular matrix. We have rank(A) = rank(B), so n0 (A) = n0 (B). To prove that I(A) = I(B), it suffices to show that n+ (A) = n+ (B). Let m = n+ (A) and let v 1 , . . . , v m be m orthonormal eigenvectors of A that correspond to the m positive eigenvalues of this matrix, and let S+ (A) be the subspace generated by these vectors. If v ∈ S+ (A) − {0}, then we have v = a1 v 1 + · · · + am v m , so ⎛ v H Av = ⎝
m j=1
⎞H
⎛
aj v j ⎠ A ⎝
m j=1
⎞ aj v j ⎠ =
m
|aj |2 > 0.
j=1
Therefore, xH S H BSx > 0, so if y = Sx, then y H By > 0, which means that y ∈ S+ (B). This shows that S+ (A) is isomorphic to a subspace of S+ (B), so n+ (A) n+ (B). The reverse inequality can be shown in the same manner, so n+ (A) = n+ (B). We can add an interesting detail to the full-rank decomposition of a matrix. Corollary 8.13. If A ∈ Cm×n and A = CR is the full-rank decomposition of A with rank(A) = k, C ∈ Cm×k , and R ∈ Ck×n , then C may be chosen to have orthogonal columns and R to have orthogonal rows. Proof. Since the matrix AH A ∈ Cn×n is Hermitian, by Theorem 8.14, there exists a unitary matrix U ∈ Cn×k such that AH A = U H DU , where D ∈ Ck×k is a non-negative diagonal matrix. Let C = AU H ∈ C n×k and R = U . Clearly, CR = A, and R has orthogonal rows because U is unitary. Let cp , cq be two columns of C, where 1 p, q k and p = q. Since cp = Aup and cq = Auq , where up , uq are the corresponding columns of U , we have cHp cq = uHp AH Auq = uHp U H DU uq = eHp Deq = 0, because p = q.
Theorem 8.17 (Rayleigh–Ritz theorem). Let A ∈ Cn×n be a Hermitian matrix and let λ1 , λ2 , . . . , λn be its eigenvalues, where λ1 λ2 · · · λn . Define the Rayleigh–Ritz function
Linear Algebra Tools for Data Mining (Second Edition)
536
ralA : Rn − {0} −→ R as ralA (x) =
xH Ax xH x
Then λ1 ralA (x) λn for x ∈ Cn − {0n }. Proof. By Theorem 8.13, since A is Hermitian, there exists a unitary matrix P and a diagonal matrix T such that A = P H T P and the diagonal elements of T are the eigenvalues of A, that is, T = diag(λ1 , λ2 , . . . , λn ). This allows us to write x Ax = x P T P x = (P x) T P x = H
H
H
H
n
λi |(P x)i |2 ,
j=1
which implies λ1 P x2 xH Ax λn P x2 . Since P is unitary, we also have P x2 = xH P H P x = xH x, which implies λ1 xH x xH Ax λn xH x, for x ∈ Cn .
Corollary 8.14. Let A ∈ Cn×n be a Hermitian matrix and let λ1 , λ2 , . . . , λn be its eigenvalues, where λ1 λ2 · · · λn . We have λ1 = max{xH Ax | xH x = 1}, λn = min{xH Ax | xH x = 1}. Proof. Note that if x is an eigenvector that corresponds to λ1 , then Ax = λ1 x, so xH Ax = λ1 xH x; in particular, if xH x = 1, we have λ1 = xH Ax, so λ1 = max{xH Ax | xH x = 1}. The equality for λn can be shown in a similar manner.
Similarity and Spectra
537
We discuss next an important result that is a generalization of the Rayleigh–Ritz theorem (Theorem 8.17). Denote by Sk (L) the collection of all subspaces of dimension k of an F-linear space L. Theorem 8.18 (Courant-Fischer theorem). Let A ∈ Cn×n be a Hermitian matrix having the eigenvalues λ1 · · · λn . We have λk =
min
max{xH Ax | x2 = 1},
S∈Sn−k+1 (Cn ) x∈S
and λk =
max min{xH Ax | x2 = 1}.
S∈Sk (Cn ) x∈S
Proof. By Theorem 8.13, there exists a unitary matrix U and a diagonal matrix D such that A = U H DU and the diagonal elements of D are the eigenvalues of A, that is, D = diag(λ1 , λ2 , . . . , λn ). We prove only that λk =
min
max{xH Dx | x2 = 1}.
S∈Sn−k+1 (Cn ) x∈S
The proof of the second part of the theorem is entirely similar. Let S be a subspace of Cn with dim(S) = n − k + 1. We have S ∩ e1 , . . . , ek = {0n } because otherwise the equality S ∩e1 , . . . , ek = {0n } would imply dim(S) n − k. Define S˜ = {y ∈ S | y = 1}, Sˆ = {y ∈ S˜ | y ∈ e1 , . . . , ek }. Therefore, Sˆ consists of vectors of S˜ having the form ⎛ ⎞ y1 ⎜.⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜yk ⎟ ⎟ y=⎜ ⎜0⎟ ⎜ ⎟ ⎜.⎟ ⎜.⎟ ⎝.⎠ 0
Linear Algebra Tools for Data Mining (Second Edition)
538
such that
k
2 i=1 yi
ˆ we have = 1. Thus, for all y ∈ S,
y H Dy =
k
λi |yi |2 λk
i=1
so
k
|yi |2 = λk .
i=1
˜ it follows that max ˜ y H Dy max ˆ y H Dy λk , Since Sˆ ⊆ S, y∈S y∈S min
dim(S)=n−k+1
max{xH Dx | x ∈ S and x2 = 1} λk . x
Let now S be S = e1 , . . . , ek−1 ⊥ . Clearly, dim(S) = n − k + 1. A vector y ∈ S has the form ⎛ ⎞ 0 ⎜.⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜0⎟ ⎟ y=⎜ ⎜y ⎟. ⎜ k⎟ ⎜.⎟ ⎜.⎟ ⎝.⎠ yn Therefore, y Dy = H
n i=k
2
λi |yi | λi
n
|yi |2 = λi
i=k
for all y ∈ {y ∈ S | y2 = 1}. This implies min
dim(S)=n−k+1
max{xH Dx | x ∈ S and x2 = 1} λk , x
which yields the desired equality. The matrices A and D have the same eigenvalues. Also xH Ax = xH Ax = xH U H DU x = (U x)H D(U x) and U x2 = x2 , because U is a unitary matrix. This yields the first equality of the theorem. Another form of the Courant–Fisher theorem can be obtained by observing that every p-dimensional subspace S of Cn is the orthogonal space of an (n − p)-dimensional subspace. Therefore, for
Similarity and Spectra
539
each p-dimensional subspace S there is a sequence of n − p vectors w1 , . . . , w n−p (which is a basis of S ⊥ ) such that S = {x ∈ Cn | x ⊥ w1 , . . . , x ⊥ wn−p }. Theorem 8.19. Let A ∈ Cn×n be a Hermitian matrix having the eigenvalues λ1 · · · λn . We have λk = =
min
max{xH Ax | x ⊥ w1 , . . . , x ⊥ wk−1 and x2 = 1},
max
min{xH Ax | x ⊥ w1 , . . . , x ⊥ w n−k , and x2 = 1}.
w1 ,...,w k−1
x
w1 ,...,w n−k
x
Proof. This statement is equivalent to theorem 8.18, as we observed before. Theorem 8.20 (Interlacing theorem). Let A ∈ Cn×n be a Her i · · · ik mitian matrix and let B = A 1 be a principal submatrix of i1 · · · ik A, B ∈ Ck×k . If spec(A) = {λ1 , . . . , λn } and spec(B) = {μ1 , . . . , μk }, where λ1 · · · λn and μ1 · · · μk , then λj μj λn−k+j for 1 j k. Proof. Let {j1 , . . . , jq } = {1, . . . , n} − {i1 , . . . , ik }, where j1 < · · · < jq and k + q = n. By the Courant–Fisher theorem, we have λj = min max{xH Ax | x2 = 1 and x ∈ W ⊥ }, W
x
where W ranges over sets of non-zero vectors in Cn containing j − 1 vectors. Therefore, λj min max{xH Ax | x2 = 1 and x ∈ W ⊥ W
x
and x ∈ ej1 , . . . , ejq ⊥ } = min max{y H By | y2 = 1andy ∈ U ⊥ = μj U
y
(by Exercise 2 of Chapter 3), where U ranges over sets of non-zero vectors in Ck containing j − 1 vectors.
540
Linear Algebra Tools for Data Mining (Second Edition)
Again, by the Courant–Fisher theorem, λn−k+j = max min{xH Ax | x2 = 1 and x ∈ Z⊥ }, x
Z
where Z ranges over sets containing k − j non-zero vectors in Cn . Consequently, λn−k+j max min{xH Ax | x2 = 1 and x ∈ Z⊥ x
Z
and x ∈ ej1 , . . . , ejq ⊥ } = max min{y H By | y2 = 1 and y ∈ S⊥ } = μj , y
S
where S ranges over the sets of non-zero vectors in Ck containing n − j vectors. n×n be a Hermitian matrix and let B = Corollary 8.15. Let A ∈ C i · · · ik A 1 be a principal submatrix of A, B ∈ Ck×k . The set spec(B) i1 · · · ik contains no more positive eigenvalues than the number of positive eigenvalues of A and no more negative eigenvalues than the number of negative eigenvalues of A.
Proof. This observation is a direct consequence of the Interlacing Theorem. Theorem 8.21. Let A ∈ Cn×n be a Hermitian matrix having the eigenvalues λ1 · · · λn . If u1 , . . . , un are eigenvectors that correspond to λ1 , . . . , λn , respectively, W = {u1 , . . . , uk } and Z = {uk+2 , . . . , un }, then we have λk+1 = max{xH Ax | x2 = 1 and x ∈ W ⊥ } x
= min{xH Ax | x2 = 1 and x ∈ Z⊥ }. x
Proof. If A = U H DU , where U is a unitary matrix and D is a diagonal matrix, then ui , the i th column of U H , can be written as ui = U H ei . Therefore, by the second part of the proof of the Courant– Fisher theorem, we have xAx λk+1 if x belongs to the subspace orthogonal to the subspace generated by the first k eigenvectors of A. Consequently, the Courant–Fisher theorem implies the first equality of this theorem. The second equality can be obtained in a similar manner.
Similarity and Spectra
541
Corollary 8.16. Let A ∈ Cn×n be a Hermitian matrix having the eigenvalues λ1 · · · λn . If u1 , . . . , uk are eigenvectors that correspond to λ1 , . . . , λk , respectively, then a unit vector x that maximizes xH Ax and belongs to the subspace orthogonal to the subspace generated by the first k eigenvectors of A is an eigenvector that corresponds to λk+1 . Proof. Let u1 , . . . , un be the eigenvectors ofA and let x ∈ n u1 , . . . , uk ⊥ be a unit vector. We have x = j=k+1 aj uj , and n 2 j=k+1 aj = 1, which implies x Ax = H
n
λj a2j = λk+1 .
j=k+1
This, in turn, implies ak+1 = 1 and ak+2 = · · · = an = 0, so x = uk+1 . Theorem 8.22. Let A, B ∈ Cn× be two Hermitian matrices and let E = B − A. Suppose that the eigenvalues of A, B, E are α1 · · · αn , β1 · · · βn , and 1 · · · n , respectively. Then we have n βi − αi 1 . Proof. Note that E is also Hermitian, so all matrices involved have real eigenvalues. By the Courant–Fisher theorem, βk = min max{xH Bx | x2 = 1 and wHi x = 0 for 1 i k − 1}, W
x
where W = {w1 , . . . , wk−1 }. Thus, βk max xH Bx = max(xH Ax + xH Ex). x
x
(8.2)
Let U be a unitary matrix such that U H AU = diag(α1 , . . . , αn ). Choose wi = U ei for 1 i k − 1. We have wHi x = eHi U H x = 0 for 1 i k − 1. = x2 = 1. Define y = U H x. Since U is a unitary matrix, y2 n 2 Observe that eHi y = yi = 0 for 1 i k. Therefore, i=k yi = 1. n H H H 2 This, in turn, implies x Ax = y U AU y = i=k αi yi αk . From the Inequality (8.2), it follows that βk αk + max xH Ex αk + n . x
Since A = B − E, by inverting the roles of A and B we have αk βk − 1 , or 1 βk − αk , which completes the argument.
542
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 8.23 (Hoffman–Wielandt theorem). Let A, B ∈ Cn×n be two normal matrices having the eigenvalues α1 , . . . , αn and β1 , . . . , βn , respectively. Then there exist permutations φ and ψ in PERMn such that n
2
|αi − βψ(i) | A −
B2F
i=1
n
|αi − βφ(i) |2 .
i=1
Proof. Since A and B are normal matrices, they can be diagonalized as A = U H DA U and B = W H DB W, where U and W are unitary matrices and C, D are diagonal matrices, C = diag(α1 , . . . , αn ) and D = diag(β1 , . . . , βn ). Then we can write A − B2F = U H CU − W H DW 2F = trace(E H E), where E = U H CU − W H DW . Note that E H E = (U H C H U − W H D H W )(U H CU − W H DW ) = U H C H CU + W H D H DW − W H D H W U H CU − U H C H U W H DW = U H C H CU + W H D H DW − U H C H U W H DW − (U H C H U W H DW )H = U H C H CU + W H D H DW − 2 (U H C H U W H DW ). Observe that trace( (U H C H U W H DW )) = (trace(U H C H U W H DW )) = (trace(C H U W H DW U H )). Thus, if Z is the unitary matrix Z = W U H , we have trace( (U H C H U W H DW )) = (trace(C H Z H DZ)). Since C2F = ni=1 α2i and D2F = ni=1 βi2 , we have trace(E H E) = C2F + D2F − 2 (trace(C H Z H DZ)) ⎛ ⎞ n n n n α2i + βi2 − 2 ⎝ a ¯i |zij |2 βj ⎠. = i=1
i=1
i=1 j=1
Theorem 3.20 implies that the matrix S that has the elements |zij |2 is doubly stochastic because Z is a unitary matrix. This allows us to
Similarity and Spectra
543
write A − B2F = trace(E H E)
n
α2i +
i=1
n
⎛ βi2 − max ⎝ S
i=1
n n
⎞ a ¯i sij βj ⎠,
i=1 j=1
and A − B2F = trace(E H E)
n
α2i +
i=1
n
⎞ ⎛ n n βi2 − min ⎝ a ¯i sij βj ⎠, S
i=1
i=1 j=1
where the maximum and the minimum are taken over the set of all doubly stochastic matrices. The Birkhoff–von Neumann Theorem states that the polyhedron of doubly–stochastic matrices has the permutation matrices as its vertices. Therefore, the extremes of the linear function ⎞ ⎛ n n α ¯ i sij βj ⎠ f (S) = ⎝ i=1 j=1
are achieved when S is a permutation matrix. Let Pφ be the permutation matrix that gives the maximum of f and let Pψ be the permutation matrixthat gives the minimum. If S = Pφ , then nj=1 sij βj = βφ(i) , so ⎞ ⎛ n n n n α2i + βi2 − ⎝ α ¯ i βφ(j) ⎠ A − B2F i=1
=
n
i=1
i=1 j=1
|αi − βφ(i) |2 .
i=1
In the last equality, we used the elementary calculation a − ¯b) |a − b|2 = (a − b)(a − b) = (a − b)(¯ ¯b = a¯ a + b¯b − a ¯b − a ab), = |a|2 + |b|2 − 2 (¯ for a, b ∈ C. Similarly, if S = Pψ , we obtain the other inequality.
Linear Algebra Tools for Data Mining (Second Edition)
544
Corollary 8.17. Let A, B ∈ Cn×n be two Hermitian matrices having the eigenvalues α1 , . . . , αn and β1 , . . . , βn , respectively, where α1 · · · αn and β1 · · · βn . Then, n
|αi − βi |2 A − B2F .
i=1
If α1 · · · αn and β1 · · · βn , then n
|αi − βi |2 A − B2F .
i=1
Proof. Since A and B are Hermitian, their eigenvalues are real numbers. By the Hoffman–Wielandt Theorem, there exist two permutations φ, ψ ∈ PERMn such that n
2
|αi − βψ(i) | A −
i=1
B2F
n
|αi − βφ(i) |2 .
i=1
We have n
|αi − βφ(i) |2 = a − Pφ b2F
i=1
and n
|αi − βψ(i) |2 = a − Pψ b2F ,
i=1
where
⎞ ⎛ ⎞ β1 α1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ a=⎜ ⎝ .. ⎠ and b = ⎝ .. ⎠, αn βn ⎛
so a − Pψ b2F A − B2F a − Pφ b2F . By Corollary 6.13, since the components of a and b are placed in decreasing order, we have a − Pφ bF a − bF ,
Similarity and Spectra
545
so a −
b2F
=
n
|αi − βi |2 a − Pφ bF A − B2F ,
i=1
which proves the first inequality of the corollary. For the second part, by Corollary 6.13, we have A − BF a − Pψ bF a − bF ,
By Theorem 7.16, the characteristic polynomial of an upper triangular matrix T is pT (λ) = (λ − λ1 ) · · · (λ − λn ), where λ1 , . . . , λn are the eigenvalues of T and, at the same time, the diagonal elements of T . Lemma 8.4. Let T ∈ Cn×n be an upper triangular matrix and let pT (λ) = λn +a1 λn−1 +· · ·+an−1 λ+an be its characteristic polynomial. Then pT (T ) = T n + a1 T n−1 + · · · + an−1 T + an In = On,n . Proof.
We have pT (T ) = (T − λ1 In ) · · · (T − λn In ).
Observe that for any matrix A ∈ Cn×n , λj , λk ∈ spec(A), and every eigenvector v of A in SA,λk , we have (λj In − A)v = (λj − λk )v. Therefore, for v ∈ ST,λk , we have pT (T )v = (λ1 In − T ) · · · (λn In − T )v = 0, because (λk I − T )v = 0. By Corollary 7.9, pT (T )ei = 0 for 1 i n, so pT (T ) = On,n , so pT (T ) = On,n . Theorem 8.24 (Cayley–Hamilton theorem). If A ∈ Cn×n is a matrix, then pA (A) = On,n .
Linear Algebra Tools for Data Mining (Second Edition)
546
Proof. By Schur’s Triangularization Theorem, there exists a unitary matrix U ∈ Cn×n and an upper–triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A. Taking into account that U H U = U U H = In , we can write pA (A) = (λ1 In − A)(λ2 In − A) · · · (λn In − A) = (λ1 U H U − U H T U ) · · · (λn U H U − U H T U ) = U H (λ1 In − T )U U H (λ2 In − T )U · · · U H (λn In − T )U = U H (λ1 In − T )(λ2 In − T ) · · · (λn In − T )U = U H pT (T )U = On,n ,
by Lemma 8.4.
8.4
The Sylvester Operator
Schur’s Theorem allows us to examine the solvability of the matrix equation AX − XB = C, where A ∈
Cm×m ,
B∈
Cn×n ,
and C ∈ Cm×n .
Theorem 8.25. Let A ∈ Cm×m and B ∈ Cn×n be two matrices. The Sylvester operator defined by A and B is the mapping S A,B : Cm×n −→ Cm×n given by S A,B (X) = AX − XB. The linear mapping SA,B has an inverse if and only if spec(A) ∩ spec(B) = ∅. Proof. The linearity of SA,B is immediate. Suppose that λ ∈ spec(A) ∩ spec(B), so Au = λu and v H B = λv H for some u ∈ Cm×1 and v ∈ Cn×1 . Thus, u is an eigenvector for A and v is a left eigenvector for B that corresponds to the same eigenvalue λ. Define the matrix X = uvH ∈ Cm×n . We have AX − XB = Auv H − uv H B = (Au)v H − u(v H B) = 0, which means that S A,B is not invertible. Thus, if SA,B is invertible, the spectra of A and B are disjoint.
Similarity and Spectra
547
Conversely, suppose that S A,B is invertible, so the equation AX − XB = C is solvable for any C ∈ Cm×n . If B = U H T U is a Schur decomposition of B, where U ∈ Cn×n is a unitary matrix and T ∈ Cn×n
⎛ t11 t12 ⎜0 t 22 ⎜ T =⎜ . .. ⎜ . ⎝ . . 0 0
⎞ · · · t1n · · · t2n ⎟ ⎟ ⎟ .. ⎟ ··· . ⎠ · · · tnn
is an upper triangular matrix having as diagonal elements the eigenvalues t11 , t22 , . . . , tnn of B, then AX − XU H T U = C, which implies AXU H − XU H T = CU H . If Z = XU H ∈ Cm×n and D = CU H , we have AZ − ZT = D and this system has the form A(z 1 · · · z n ) − (z 1 · · · z n )T = (d1 · · · dn ). Equivalently, we have Az 1 − t11 z 1 = d1 Az 2 − t12 z 1 − t22 z 2 = d2 .. . Az n − t1n z 1 − t2n z 2 − · · · − tnn z n = dn . Note that the first equation of this system (A − t11 In )z 1 = d1 can be solved for z 1 because t11 , as an eigenvalue of B, does not belong to spec(A). Thus, the second equation Az 2 − t22 z 2 = d2 + t12 z 1 can be resolved with respect to z 2 , because t22 ∈ spec(A), etc. Thus, the matrix Z can be found, so X = ZU . As observed in [76], special cases of this equation occur in many important problems in linear algebra. For instance, if n = 1 and B = (0), then solving SA,0 (x) = c amounts to solving the linear system Ax = c. Solving the equation SA,A (X) = On,n amounts to finding the matrices X that commute with A, etc. Theorem 8.26. Let A ∈ Cm×m and B ∈ Cn×n be two matrices such that spec(B) ⊂ {z ∈ C | |z| < r} and spec(A) ⊂ {z ∈ C | |z| > r}
548
Linear Algebra Tools for Data Mining (Second Edition)
for some r > 0. Then, the solution of the equation AX − XB = C is given by the series (A−1 )n+1 CB n . X= n∈N
Proof. Since spec(A) and spec(B) are finite sets, there exist r1 and r2 such that 0 < r1 < r < r2 , spec(B) ⊂ {z ∈ C | |z| < r1 }, and spec(A) ⊂ {z ∈ C | |z| > r2 }. Then spec(A−1 ) ⊂ {z ∈ C | |z| < 1/r2 }. By Theorem 8.48, there exists a positive integer n0 such that −1 n if n n0 , |||B n ||| r1n and |||A ||| r2 . Thus, for n n0 , we n
have |||A−n−1 CB n ||| rr12 |||A−1 C|||, which shows that the series −1 n+1 CB n is convergent. It is immediate that for this n∈N (A ) choice of X we have AX − XB = C.
Corollary 8.18. Let A ∈ Cm×m and B ∈ Cn×n be two normal matrices such that spec(B) ⊂ {z ∈ C | |z| < r} and spec(A) ⊂ {z ∈ C | |z| > r + a} for some r > 0 and a > 0. If X is the solution of the Sylvester equation AX − XB = C, and ||| · ||| is a unitarily invariant matrix norm, then |||X||| a1 |||C|||. Proof.
From Theorem 8.26, we have |||X|||
∞
|||A−1 |||n+1 |||C||||||B|||n
n=0
|||C|||
∞ n=0
(r + a)−n−1 an =
1 |||C|||. a
The Sylvester operator can be seen as a linear transformation of the linear space Cm×n (which is isomorphic to Cmn ). The separation of two matrices A and B, denoted by sep(A, B), was introduced in [157]. This quantity is very useful in studying relationships between invariant spaces of matrices and is defined as sep(A, B) = min{S A,B (X) | X = 1}.
(8.3)
Obviously, the number is dependent on the norm used in its definition. If we wish to specify this norm, we will add a suggestive
Similarity and Spectra
549
subscript; for example, if we use the Frobenius norm, we will denote the corresponding separation by sepF (A, B); sep2 (A, B) will be used when we deal with ||| · |||2 . Theorem 8.27. If A ∈ Cm×m and B ∈ Cn×n are two matrices, then the spectrum of the Sylvester operator S A,B is spec(S A,B ) = {λ − μ | λ ∈ spec(A) and μ ∈ spec(B)}. Proof.
Suppose that θ ∈ spec(S A,B ). There exists a matrix X ∈
Cm×n − {Om,n } such that AX − XB = θX, which amounts to
(A − θIm )X − XB = O. In other words, S A−θIm ,B is singular, which implies that spec(A−θIm )∩spec(B) = ∅. This means that there exist λ ∈ spec(A) and μ ∈ spec(B) such that λ − θ = μ, which implies that θ = λ − μ. Thus, spec(S A,B ) ⊆ {λ − μ | λ ∈ spec(A) and μ ∈ spec(B)}. The reverse inclusion can be shown by reversing the above implications. Corollary 8.19. If A ∈ Cm×m and B ∈ Cn×n are two diagonalizable matrices such that spec(A) = {λ1 , . . . , λm }, and spec(B) = {μ1 , . . . , μm }, and if ui is an eigenvector of A associated to λi and v j is an eigenvector of B H associated with μj , then the matrix Xij = ui v Hj is an eigenvector of S A,B associated to λi − μj . Proof.
We have
S A,B (Xij ) = Aui v Hj − ui v Hj B = λi ui vj − ui (B H v j )H = λi ui vj − μj ui v Hj = (λi − μj )ui v Hj = (λi − μj )Xij .
Corollary 8.20. Let A ∈ Cm×m and B ∈ Cn×n be two matrices. We have sep(A, B) min{|λ − μ| | λ ∈ spec(A), μ ∈ spec(B)}. Proof. The inequality holds if sep(A, B) = 0. Therefore, suppose that sep(A, B) > 0. This implies that S A,B is nonsingular and by Supplement 43 of Chapter 6 we have |||S −1 A,B |||2 = −1 1 1 min{S A,B (X)2 | X2 =1} = sep(A,B) . Since the spectral radius of S A,B is less than |||S −1 A,B |||, it follows that
eigenvalues of μ ∈ spec(B).
S −1 A,B ,
1 sep(A,B)
is larger or equal to any
so sep(A, B) |λ − μ| for every λ ∈ spec(A) and
Linear Algebra Tools for Data Mining (Second Edition)
550
8.5
Geometric versus Algebraic Multiplicity
Theorem 8.28. Let A ∈ Cn×n be a square matrix and let λ ∈ spec(A). The geometric multiplicity geomm(A, λ) is less than or equal to the algebraic multiplicity algm(A, λ). Proof.
By Equality (7.4),
geomm(A, λ) = dim(null(A − λIn )) = n − rank(A − λIn ). Let m = geomm(A, λ). Starting from an orthonormal basis u1 , . . . , um of the subspace null(A − λIn ), define the matrix U = (u1 · · · um ). We have (A − λIn )U = O, so AU = λU = U (λIm ). Thus, by Theorem 8.7, we have
λIm U H AV , A∼ O V H AV where U ∈ Cn×m and V ∈ Cn×(n−m) . Therefore, A has the same characteristic polynomial as
λIm U H AV , B= O V H AV which implies algm(A, λ) = algm(B, λ). Since the algebraic multiplicity of λ in B is at least equal to m, it follows that algm(A, λ) m = geomm(A, λ).
Definition 8.3. If λ is an eigenvalue of A and geomm(A, λ) = algm(A, λ), then we refer to λ as a semi-simple eigenvalue. The matrix A is defective if there exist at least one eigenvalue that is not semi-simple. Otherwise, A is said to be non-defective. A is a non-derogatory matrix if geomm(A, λ) = 1 for every eigenvalue λ. If λ is a simple eigenvalue of A, we have algm(A, λ) = 1, and, since algm(A, λ) geomm(A, λ) 1, it follows that algm(A, λ) = geomm(A, λ) = 1, so λ is semi-simple. Thus, the notion of semisimplicity of an eigenvalue generalizes the notion of simplicity.
Similarity and Spectra
551
Example 8.4. Let A be the matrix ⎛ ⎞ 2 1 0 ⎜ ⎟ A = ⎝0 2 1⎠. 0 0 2 Its characteristic polynomial is 2 − λ 1 0 2−λ 1 = (2 − λ)3 , pA (λ) = det(A − λI3 ) = 0 0 0 2 − λ which means that A has the unique value 2 with algebraic multiplicity 3. ⎛ ⎞ 1 ⎝ If v = 0⎠ is an eigenvector, then Av = 2v, which amounts to 0 v1 = 1, v2 = 0, and v3 = 0. Thus, the geometric multiplicity of 2 is equal to 1, hence the eigenvalue 2 is not semi-simple. Note that if λ is a simple eigenvalue of A, then geomm(A, λ) = algm(A, λ) = 1, so λ is semi-simple. Example 8.5. Let A ∈ R2×2 be the matrix
A=
a b , b a
where a, b ∈ R are such that a = 0 and b = 0. The characteristic polynomial of A is pA (λ) = (a − λ)2 − b2 , so spec(A) = {a + b, a − b} and algm(A, a + b) = algm(A, a − b) = 1. Thus, both a + b and a − b are simple eigenvalues of A, and, therefore, they are also semi-simple. Theorem 8.29. If A ∈ Rn×n is a symmetric matrix, each of its eigenvalues is semi-simple. Proof. We saw that each symmetric matrix has real eigenvalues and is orthonormally diagonalizable (by Corollary 8.8). Starting from the real Schur factorization A = U T U −1 , where U is an orthogonal matrix and T = diag(t11 , . . . , tnn ) is a diagonal matrix, we can write
552
Linear Algebra Tools for Data Mining (Second Edition)
AU = U T . If we denote the columns of U by u1 , . . . , un , then we can write (Au1 , . . . , Aun ) = (t11 u1 , . . . , tnn un ), so Aui = tii ui for 1 i n. Thus, the diagonal elements of T are the eigenvalues of A and the columns of U are corresponding eigenvectors. Since these eigenvectors are pairwise orthogonal, the dimension of the invariant subspace that corresponds to an eigenvalue equals the algebraic multiplicity of the eigenvalue, so each eigenvalue is semi-simple. 8.6
λ-Matrices
The set of polynomials over the complex field depending on a nondeterminate λ is denoted by C[λ]. If f (λ) ∈ C[λ], we denote its degree by deg(f ). Definition 8.4. A λ-matrix is a polynomial of the form G(λ) = A0 λm + A1 λm−1 + · · · + Am , where A0 , A1 , . . . , Am are matrices in Cp×q . If A0 = Op,q , then we say that m is the degree of the λ-matrix G and we write deg(G) = m. A square λ-matrix, that is, a λ-matrix of type n × n, is regular if its leading coefficient A0 is a nonsingular matrix. The set of λ-matrices of format p × q over a field F is denoted by F [λ]p×q . In the special case p = q = 1, G(λ) is a usual polynomial in λ. Note that a λ-matrix of type p × q can also be regarded as a matrix whose entries are polynomials in λ, that is, as a member of C[λ]p×q ; the converse is also true, that is, every matrix whose entries are polynomials in λ is a λ-matrix. Example 8.6. Consider the λ-matrix G(λ) = A0 λ3 + A1 λ2 + A2 λ + A3 ,
Similarity and Spectra
where
and
553
⎛
⎞ ⎛ ⎞ 0 0 1 1 1 1 ⎜ ⎟ ⎜ ⎟ A0 = ⎝1 −1 0⎠, A1 = ⎝2 0 0⎠, 2 0 0 0 −2 0 ⎛ ⎞ ⎛ ⎞ 2 3 4 0 0 0 ⎜ ⎟ ⎜ ⎟ A2 = ⎝1 9 4⎠, A3 = ⎝1 0 0⎠. 3 0 0 0 0 1
G(λ) is the matrix ⎛
⎞ λ2 + 3λ λ3 + λ2 + 4λ λ2 + 2λ ⎟ ⎜ 4λ G(λ) = ⎝λ3 + 2λ2 + λ + 1 −λ3 + 9λ ⎠. −2λ2 1 2λ3 + 3λ
Theorem 8.30. A λ-matrix G(λ) is invertible if and only if det(G(λ)) is a non-zero constant. Proof. Suppose that G(λ) is invertible, that is, there exists a λmatrix H(λ) such that G(λ)H(λ) = In . Then det(G(λ)) det(H(λ)) = 1, so det(G(λ)) is a constant that cannot be 0. Conversely, suppose that det(G(λ)) is a non-zero constant. If we construct G−1 (λ) using the standard approach, it is immediate that G−1 (λ) is a λ-matrix, so G(λ) is invertible. The sum and product of λ-matrices are component-wise defined exactly as the general sum and products of matrices. Note that if G(λ) and H(λ) are two λ-matrices, the degree of their product is not necessarily equal to the sum of the degrees of the factors. In general, the degree of G(λ)H(λ) is not larger than the sum of the degrees of the factors. Example 8.7. Let G(λ), H(λ) be the λ-matrices
1 2 1 0 1 G(λ) = λ+ , and H(λ) = 0 0 −2 1 0
defined by
1 1 3 λ+ . 0 3 2
Linear Algebra Tools for Data Mining (Second Edition)
554
We have
GH =
7 7 4 3 λ+ . −2 −2 1 −4
Let G(λ) = A0 λm + A1 λm−1 + · · · + Am be a λ-matrix. The same ˜ matrix can also be written as G(λ) = λm A0 + λm−1 A1 + · · · + Am . ˜ If the matrix H is substituted for λ in G(λ) and in G(λ), the results G(B) = A0 B m + A1 B m−1 + · · · + Am , ˜ G(B) = B m A0 + B m−1 A1 + · · · + Am are distinct, in general, because matrices of the form B j do not commute with A0 , A1 , . . . , Am . Thus, G(A) is referred to as the right ˜ value of G in A and G(A) is referred to as the left value of G in A. Theorem 8.31. Let G(λ) and H(λ) be two n × n λ-matrices such that H(λ) is a regular matrix. There are two unique n × n λ-matrices Q(λ) and R(λ) such that G(λ) = H(λ)Q(λ)+ R(λ) such that R(λ) = On,n or deg(R(λ)) < deg(H(λ)). Proof.
Suppose that G(λ) = A0 λm + A1 λm−1 + · · · + Am , F (λ) = B0 λq + B1 λq−1 + · · · + Bq ,
where A0 = On,n and det(B0 ) = 0. We assume that m q; otherwise, we can take Q(λ) = On,n and R(λ) = G(λ). We define a sequence of λ-matrices G(1) (λ), G(2) (λ) . . . as having a non-increasing sequence of degrees m1 m2 · · · . The (p) coefficients of G(p) (λ) are denoted by Aj . We define G(1) (λ) = G(λ) − H(λ)B0−1 A0 λm−q and (k−1) mk−1 −q
G(k) (λ) = G(k−1) (λ) − H(λ)B0−1 A0
λ
,
as long as mk q. If mk+1 < q, the computation halts and (k)
G(k+1) (λ) = G(k) (λ) − H(λ)B0−1 A0 . This allows us to write (1)
(k)
G(λ) = H(λ)[B0−1 A0 λm−q + B0−1 A0 λm1 −q + · · · + B0−1 A0 ] + G(k+1) (λ).
Similarity and Spectra
555
Therefore, we can take (1)
(k)
Q(λ) = B0−1 A0 λm−q + B0−1 A0 λm1 −q + · · · + B0−1 A0 , and R(λ) = G(k+1) (λ); clearly, we have deg(R(λ)) < deg(H(λ). Suppose now that we have both G(λ) = H(λ)Q(λ) + R(λ) and G(λ) = H(λ)Q1 (λ) + R1 (λ), where deg(R(λ)) < deg(H(λ)) and deg(R1 (λ)) < deg(H(λ)). These equalities imply H(λ)(Q(λ) − Q1 (λ)) = R1 (λ) − R(λ). If Q(λ) − Q1 (λ) = On,n , we have deg(H(λ)(Q(λ) − Q1 (λ))) deg(H(λ)) because H(λ) is regular. However, deg(R1 (λ) − R(λ)) < deg(H(λ)), which leads to a contradiction. Therefore, we have Q(λ) = Q1 (λ), which implies R1 (λ) = R(λ). We refer to the λ-matrices Q(λ) and R(λ) defined by the equality G(λ) = H(λ)Q(λ) + R(λ) of Theorem 8.31 as the left quotient and left remainder of the division of G(λ) by H(λ). It is possible to prove in an entirely similar manner, under the same assumptions as the ones of Theorem 8.31, that the matrices P (λ) and S(λ), such that G(λ) = B(λ)H(λ) + S(λ) and S(λ) = On,n or deg(S(λ)) < deg(H(λ)), are uniquely determined. In this case we refer to the matrices P (λ) and S(λ) as the right quotient and the right remainder of the division of G(λ) by H(λ). Corollary 8.21. The left remainder of the division of the λ-matrix G(λ) = A0 λm + A1 λm−1 + · · · + Am by H(λ) = λI − C is the matrix G(C), where G(C) = C m A0 + C m−1 A1 + · · · + Am . The right remainder of the division of G(λ) by H(λ) is the matrix ˜ G(C) = A0 C m + A1 C m−1 + · · · + Am . Proof.
Starting from the equality A0 λm + A1 λm−1 + · · · + Am = (λI − C)Q(λ) + R,
where Q(λ) = Q0 λm−1 + · · · + Qm−1 and identifying the coefficients of the same powers of λ, we obtain the equalities A0 = Q0 , A1 = Q1 − CQ0 , A2 = Q2 − CQ1 , . . . , Am−1 = Qm−1 − CQm−2 , Am = R − CQm−1 . These equalities imply R = C m A0 + C m−1 A1 + · · · + CAm−1 + Am = A(C). The argument for the right remainder is similar.
556
Linear Algebra Tools for Data Mining (Second Edition)
Polynomials with complex coefficients, that is, polynomials in C[λ], are 1 × 1 λ-matrices.
Definition 8.5. Let G, H ∈ C[λ] be two polynomials and let A ∈ Cn×n be a matrix such that det(H(A)) = 0. The matrix Q(A) =
G(A)(H(A))−1 is the quotient of G(A) by H(A). A rational matrix function of A is a quotient of two polynomials H(A)/G(A).
Since G(A)H(A) = H(A)G(A) for every A ∈ Cn×n , if H(A) is invertible, then (H(A))−1 G(A) = G(A)(H(A))−1 . Theorem 8.32. Let A ∈ Cn×n . There exists a unique polynomial mA of minimal degree whose leading coefficient is 1 such that mA (A) = On×n . Proof. Theorem 8.24 involving the characteristic polynomial pA of A shows that the set of polynomials whose leading coefficient is 1 and that has A as a root is non-empty. Thus, we need to show the uniqueness of a polynomial of minimal degree. Suppose that f and g are two distinct polynomials with leading coefficient 1, of minimal degree k, such that f (A) = g(A) = On,n . Then (f − g)(A) = On,n and the degree of the polynomial f − g is less than k. This contradicts the minimality of the degrees of f and g. Thus, f = g. Definition 8.6. Let A ∈ Cn×n . The polynomial mA of minimal degree whose leading coefficient is 1 such that mA (A) = On×n is referred to as the minimal polynomial of A. Theorem 8.33. A ∈ Cn×n is an invertible matrix if and only if mA (0) = 0. Proof. Let mA (λ) = λk + a1 λk−1 + · · · + ak−1 λ + ak be the minimal polynomial of A. Suppose that A is an invertible matrix. If ak = 0, then mA (λ) = λ(λk−1 + a1 λk−2 + · · · + ak−1 ), so A(Ak−1 + a1 Ak−2 + · · · + ak−1 In ) = On,n . By multiplying the last equality by A−1 to the left, we obtain Ak−1 + a1 Ak−2 + · · · + ak−1 In = On,n , which contradicts the minimality of the degree of mA . Thus, ak = 0.
Similarity and Spectra
557
Conversely, suppose that ak = 0. Since Ak + a1 Ak−1 + · · · + ak−1 A + ak In = On,n , it follows that A(Ak−1 + a1 Ak−2 + · · · + ak−1 In ) = −ak In Thus, A is invertible and its inverse matrix is A−1 = −
1 (Ak−1 + a1 Ak−2 + · · · + ak−1 In ). ak
Definition 8.7. An annihilating polynomial of A ∈ Cn×n is a polynomial f such that f (A) = On,n . Theorem 8.34. If f is an annihilating polynomial for A ∈ Cn×n , than mA divides evenly f . Proof. Suppose that under the hypothesis of the theorem, mA does not evenly divide f . Then we can write f (λ) = mA (λ)q(λ) + r(λ), where r is a polynomial of degree smaller than the degree of mA . Note, however, that r(A) = f (A) − mA (A)q(A) = On,n , which con tradicts the minimality of the degree of mA . Let A ∈ Cn×n be a matrix and let B(λ) be the matrix whose transpose consists of the cofactors of the elements of the matrix λIn − A. Then, B(λ)(λIn − A) = pA (λ)In .
(8.4)
By substituting λ = A, we obtain an alternative proof of the Cayley– Hamilton equality pA (A) = On,n . The matrix B(λ) introduced above allows us to give an explicit form of the minimal polynomial of a matrix A ∈ Cn×n . Theorem 8.35. Let A ∈ Cn×n be a matrix. Its minimal polynomial is given by mA (λ) =
pA (λ) , d(λ)
where d(λ) is the greatest common divisor of the elements of the matrix B(λ).
Linear Algebra Tools for Data Mining (Second Edition)
558
Proof. The definition of d(λ) means that we can write B(λ) = d(λ)C(λ), where the entries of C(λ) are pairwise relatively prime polynomials. Thus, Equality (8.4) can be written as C(λ)(λIn − A) = which shows that pA (λ) d(λ)
pA (λ) d(λ)
pA (λ) In , d(λ)
is an annihilating polynomial for A. There-
is divisible by the minimal polynomial mA (λ), which fore, allows us to write pA (λ) = mA (λ)r(λ). d(λ) Since C(λ)(λIn − A) = mA (λ)r(λ)In , taking into account that mA (λ)In is an annihilator for A and, therefore, divisible by λIn − A, it follows that mA (λ)In = M (λ)(λIn − A). This implies C(λ)(λIn − A) =
pA (λ) In = mA (λ)r(λ)In = r(λ)M (λ)(λIn − A). d(λ)
Therefore, C(λ) = r(λ)M (λ), so r(λ) is a common divisor of the elements of C(λ). Since entries of C(λ) are pairwise relatively prime polynomials, it follows that r(λ) is a constant r0 . By the definition of the polynomials pA and mA , it follows that r0 = 1, which implies A (λ) . mA (λ) = pd(λ) Theorem 8.36. If A and B are similar matrices in Cn×n , then their minimal polynomials are equal. Proof. Suppose that A ∼ B, that is A = P −1 BP , where P is an invertible matrix and let mA and mB be the two minimal polynomials of A and B, respectively. We have mB (A) = mB (P −1 BP ) = P −1 mB (B)P = On,n , so mA divides mB . In a similar manner, we can prove that mB divides mA , so mA = mB because both polynomials have 1 as leading coefficient.
Similarity and Spectra
559
Let C(λ) ∈ C[λ]n×n be a λ-matrix such that rank(C) = r. Thus, C(λ) has at least one non-null minor of order r and all minors of order greater than r are null. Denote by δk (λ) the polynomial having 1 as leading coefficient that is the greatest common divisor of all minors of order k of C(λ) for 1 k r. The polynomial δ0 (λ) is defined as equal to 1. Since every minor of order k is a linear combination of minors of order k − 1, it follows that δk−1 (λ) divides δk (λ) for 1 k r. Definition 8.8. The invariant factors of the matrix C(λ) with rank(C) = r are the polynomials t0 (λ), . . . , tr−1 (λ), where tr−k (λ) =
δk (λ) δk−1 (λ)
for 1 k r. Next, we introduce the notion of elementary transformation matrices for λ-matrices. Definition 8.9. The row elementary transformation matrices of λmatrices are the matrices T a(i) , T (p)↔(q) , and T (i)+p(λ)(j) , where T a(i) and T (p)↔(q) were introduced in Examples 3.34 and 3.35, and T (i)+p(λ)(j) is defined as ⎛ ⎞ 1 0 ··· ··· ··· 0 ···0 ⎜0 1 · · · · · · · · · 0 · · · 0⎟ ⎜ ⎟ ⎜. . ⎟ . . ⎜ ⎟ .. · · · .. · · · 0⎟, T (i)+p(λ)(j) = ⎜ .. .. · · · ⎜ ⎟ ⎜0 0 · · · p(λ) · · · 1 · · · 0⎟ ⎝ ⎠ .. .. .. .. . . ··· . ··· . ···1 where the polynomial p(λ) occurs in row i and column j. Note that all elementary transformation matrices are invertible. The effect of a left multiplication of a matrix G(λ) by any of these matrices is clearly identical to the effect of left multiplication of a matrix by the usual elementary transformation matrices that affect the rows of a matrix. Similarly, if one multiplies a matrix G(λ) at
560
Linear Algebra Tools for Data Mining (Second Edition)
the right by T (i)↔(j) , T a(i) , and T (i)+p(λ)(j) , the effect is to exchange columns i and j, multiply the i th column by a, and add the j th column multiplied by a to the i th column, respectively. Definition 8.10. To λ-matrices G(λ), H(λ) are equivalent if one of them can be obtained from the other my multiplications by elementary transformation matrices. It is easy to verify that the relation introduced in Definition 8.10 is indeed an equivalence relation. We denote the equivalence of two λ-matrices G(λ) and H(λ) by A ∼λ B. Theorem 8.37. Let G(λ) ∈ C[λ]n×n be a λ-matrix having rank r. There exists an equivalent diagonal λ-matrix D(λ) = diag(tr−1 (λ), tr−2 (λ), . . . , t0 (λ)), where t0 (λ), . . . , tr−2 (λ), tr−1 (λ) are the invariant factors of G(λ). Furthermore, each invariant factor tr−j is a divisor of the invariant factor tr−j+1 for 1 j r. Proof. Let B (0) (λ) ∈ C[λ]n×n be a λ-matrix that is equivalent to G(λ) such that the polynomial b11 (λ) has minimal degree among all polynomials bij (λ). We claim that (B (0) (λ)) can be chosen such that (B (0) (λ))11 divides all polynomials located on the first row. Indeed, suppose that this is not the case and (B (0) (λ))11 does not divide (B (0) (λ))1j . Since deg((B (0) (λ))1j ) deg((B (0) (λ))11 ), we can write deg((B (0) (λ))1j ) = deg((B (0) (λ))1j )Q(λ) + R(λ), where deg(R(λ)) < deg((B (0) (λ))1j ). Then by subtracting from the j th column the first column multiplied by Q(λ), we obtain a matrix equivalent to B (0) (λ) (and, therefore, with G(λ)) that has an element in the position (1, j) of degree smaller than the element in position (1, 1). This contradicts the definition of B (0) (λ). A similar argument shows that (B (0) (λ))11 divides all polynomials located on the first column.
Similarity and Spectra
561
Since the entries in the first row and the first column are divisible by (B (0) (λ))11 , starting from the matrix B (0) (λ) we construct, by applying elementary transformations, the matrix B (1) (λ) that has the form ⎞ ⎛ (1) 0 ··· 0 b11 (λ) ⎟ ⎜ (1) (1) ⎜ 0 b22 (λ) · · · b2n (λ)⎟ ⎟ ⎜ ⎜ . .. .. ⎟ ⎜ .. . ··· . ⎟ ⎠ ⎝ (1)
(1)
bn2 (λ) · · · bnn (λ)
0
and we have B (0) (λ) ∼λ B (1)(λ) . Note that the degrees of the components of B (1)(λ) outside the first row and the first columns are at least equal to the degree of (B (1) (λ))11 . Repeating this argument for the matrix B (1) (λ) and involving the second row and the second column of this matrix, we obtain the matrix ⎞ ⎛ (1) b11 (λ) 0 0 0 ⎟ ⎜ (2) ⎟ ⎜ 0 (λ) 0 0 b 22 ⎟ ⎜ , B (2) (λ) = ⎜ . .. .. .. ⎟ ⎟ ⎜ .. . . . ⎠ ⎝ 0
0
(1)
· · · bnn (λ)
which is equivalent to B (1) (λ) and therefore with G(λ). We can assume that the degrees of all polynomials of B (2) located below the first row and at the right of the first columns are at least equal (1) to deg(b11 (λ)). (1) Without loss of generality we may assume that b11 (λ) divides the (2) polynomial b22 (λ). Indeed, if this is not the case, we can write (2)
(1)
b22 (λ) = b11 (λ)u(λ) + v(λ), (1)
where deg(v(λ)) < deg(b11 (λ)). By multiplying the first column of B (2) (λ) by −u(λ) and adding it to the second column, we have the equivalent matrix
562
Linear Algebra Tools for Data Mining (Second Edition)
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
(1)
b11 (λ) −b11(1) (λ)u(λ) (2)
0 .. .
b22 (λ) .. .
0
0
0
0
0 .. .
0 .. . (1)
⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎠
· · · bnn (λ)
By adding the first row to the second, we have another equivalent matrix ⎞ ⎛ (1) 0 b11 (λ) −b11(1) (λ)u(λ) 0 ⎟ ⎜ ⎜ 0 v(λ) 0 0 ⎟ ⎟ ⎜ ⎟. ⎜ . . . . .. .. .. ⎟ ⎜ .. ⎠ ⎝ 0
0
(1)
· · · bnn (λ)
Finally, by adding the first column multiplied by u(λ) to the second column, we obtain yet another equivalent matrix ⎞ ⎛ (1) 0 0 b11 (λ) 0 ⎟ ⎜ ⎟ ⎜ 0 v(λ) 0 0 ⎟ ⎜ . ⎜ . .. .. .. ⎟ ⎟ ⎜ .. . . . ⎠ ⎝ (1) 0 0 · · · bnn (λ) The existence of this matrix contradicts the assumption made about (1) (1) the matrix B (2) (λ) because deg(v(λ)) < deg(b11 (λ). Thus, b11 (λ) (2) divides the polynomial b22 (λ). Eventually, we are left with the matrix ⎛ (1) ⎞ b11 (λ) 0 0 ··· 0 ⎜ ⎟ (2) ⎜ 0 0 · · · 0⎟ b22 (λ) ⎜ ⎟ ⎜ . ⎟ .. .. ⎜ . ⎟ . . · · · 0⎟ ⎜ . D(λ) = B (r) (λ) = ⎜ ⎟ (r) ⎜ 0 ⎟ (λ) · · · 0 0 b rr ⎜ ⎟ ⎜ ⎟ .. .. ⎜ .. ⎟ ⎝ . . . · · · 0⎠ 0 0 0 ··· 0 which is equivalent to G(λ). The degrees of the polynomials that occur on the diagonal of this matrix are increasing and each poly(j) (j+1) nomial bjj (λ) divides its successor bj+1 j+1 (λ) for 1 j r − 1.
Similarity and Spectra
563
Note that the divisibility properties obtained above imply that (1)
(2)
δr (λ) = b11 (λ)b22 (λ) · · · b(r) rr (λ) (2)
(r−1)
δr−1 (λ) = b11 (λ) · · · br−1 r−1 (λ) .. . (1)
δ1 (λ) = b11 (λ). Thus, the invariant factors of this matrix are t0 (λ) =
δr (λ) = b(r) rr (λ), δr−1 (λ)
.. . tj (λ) =
δr−j (λ) (r−j) = br−j r−j (λ) δr−j−1 (λ)
.. . tr−1 (λ) =
δ1 (λ) (1) = b11 (λ). δ0 (λ)
Theorem 8.38. Elementary transformations applied to λ-matrices preserve the invariant factors. Proof. It is clear that by multiplying a row (or a column) by a constant or by permuting two rows (or columns), the value of the polynomials δk (λ) is not affected. Suppose that we add the j th row multiplied by the polynomial p to the i th row. Minors of order k can be classified in the following categories: (i) minors that contain both the i th row and the j th row; (ii) minors that contain neither the i th row nor the j th row; (iii) minors that contain the j th row but do not contain the i th row, and (iv) minors that contain the i th row but do not contain the j th row. It is clear that minors in the first three categories are not affected by this elementary transformation. If M is a minor of order k of the matrix G(λ) that contains the i th row but does not contain the j th
564
Linear Algebra Tools for Data Mining (Second Edition)
row and we add the elements of the j th row, then M is replaced by the sum of two minors of order k. Thus, δk (λ) is not affected by this elementary transformation. Theorem 8.39. If two λ-matrices have the same invariant factors, then they are equivalent. Proof. Let G(λ) and H(λ) be two matrices that have the same invariant factors t0 (λ), . . . , tr−1 (λ). Both matrices are equivalent to the matrix ⎞ ⎛ 0 0 ··· 0 t0 (λ) ⎜ 0 t1 (λ) 0 · · · 0⎟ ⎟ ⎜ ⎟ ⎜ .. .. ⎟ ⎜ .. ⎜ . . . · · · 0⎟ ⎟ ⎜ D(λ) = ⎜ 0 tr−1 (λ) · · · 0⎟ ⎟ ⎜ 0 ⎟ ⎜ . . . ⎟ ⎜ . .. .. · · · 0⎠ ⎝ . 0 0 0 ··· 0 that has their common invariant factors on its main diagonal and, therefore, are equivalent. We saw that a λ-matrix G(λ) is invertible if and only if det(G(λ)) is a non-zero constant. Lemma 8.5. Every invertible λ-matrix G(λ) ∈ C[λ]n×n is a product of elementary transformation matrices. Proof. Since G(λ) is an invertible λ-matrix, det(A) is a non-zero constant, so δn (λ) is a non-zero constant c. Therefore, all polynomials δk (λ) are equal to c, so all invariant factors of G(λ) equal 1. Thus, G(λ) is equivalent to matrix In , which amounts to the having G(λ) equal to a product of elementary transformation matrices. Theorem 8.40. The λ-matrices G(λ) and H(λ) are equivalent if and only if there exist two invertible λ-matrices P (λ) and Q(λ) such that G(λ) = P (λ)H(λ)Q(λ). Proof. If G(λ) and H(λ) are equivalent, then H(λ) can be obtained from G(λ) by applying elementary transformations (multiplying by elementary λ-matrices for row transformations and multiplying by
Similarity and Spectra
565
elementary λ-matrices for column transformations). Each of these matrices is invertible and this leads to the desired equality. Conversely, suppose that the equality G(λ) = P (λ)H(λ)Q(λ) holds for two invertible λ-matrices P (λ) and Q(λ). By Lemma 8.5, both P (λ) and Q(λ) are products of elementary transformation matrices, so G(λ) ∼λ H(λ). Example 8.8. Let ⎞ λ 2λ 2λ ⎟ ⎜ A(λ) = ⎝λ2 + λ 3λ2 + λ 2λ2 + 2λ⎠ λ 2λ λ3 + λ ⎛
be a λ-matrix. This matrix is equivalent to the matrix ⎛ ⎞ λ 0 0 ⎜ ⎟ 0 ⎠. ⎝ 0 λ2 − λ 0 0 λ3 − λ The invariant factors of A are t0 (λ) = λ, t1 (λ) = λ2 − λ, and t2 (λ) = λ3 − λ, Let tr−1 (λ) be the invariant factor of A having the highest degree and let tr−1 (λ) = p1 (λ) · · · pk (λ) be the factorization of tr−1 as a product of irreducible polynomials over the field F that contains the elements of A. Definition 8.11. The elementary divisors of the λ-matrix G(λ) are the irreducible polynomials p1 (λ), . . . , pk (λ) that occur in the factorization of the invariant factor of the highest degree of G(λ). Since i j implies that ti (λ) divides tj (λ) for 1 i, j r − 1, it follows that the factorization of any of the polynomials ti (λ) may contain only these elementary divisors. Example 8.9. The elementary divisors of the matrix A(λ) considered in Example 8.8 are λ, λ + 1, and λ − 1.
566
8.7
Linear Algebra Tools for Data Mining (Second Edition)
The Jordan Canonical Form
We begin by introducing a type of matrix which is as defective as possible. Definition 8.12. Let λ be a complex associated with λ is the matrix ⎛ λ 1 0 ··· ⎜0 λ 1 · · · ⎜ ⎜. . . . Br (λ) = ⎜ ⎜ .. .. . . . . ⎜ ⎝0 0 0 · · · 0 0 0 ···
number. An r-Jordan block ⎞ 0 0⎟ ⎟ .. ⎟ r×r ⎟ .⎟ ∈ C . ⎟ 1⎠ λ
The matrix Br (λ) is diagonal if and only if r = 1; otherwise, that is, if r > 1, the matrix Br (λ) is not even diagonalizable. Also, a block Br (λ) can be written as Br (λ) = λIr + Er , where
Er =
0r−1 Ir−1 . 0 0r−1
The unique eigenvalue of the Jordan block Br (λ) is λ1 = λ and algm(Br , λ) = n. On the other hand, the invariant subspace corresponding to λ is the subspace generated by e1 , so geomm(Br , λ) = 1. Thus, every r-Jordan block (for r > 1) is a defective matrix. Definition 8.13. A Jordan segment associated with the number λ is a block-diagonal matrix Jr1 ,...,rk (λ) given by ⎞ ⎛ Br1 (λ) O ··· O ⎜ O O ⎟ Br2 (λ) · · · ⎟ ⎜ ⎜ Jr1 ,...,rk (λ) = ⎜ .. .. .. ⎟ ⎟, . ··· . ⎠ ⎝ . O O · · · Brk (λ) where Br1 (λ), . . . , Brk (λ) are Jordan blocks and r1 r2 · · · rk . The sequence (r1 , r2 , . . . , rk ) is the Segr`e sequence of the segment Jr1 ,...,rk (a).
Similarity and Spectra
567
Given a Segr`e sequence (r1 , r2 , . . . , rk ) and a number λ, the Jordan segment Jr1 ,...,rk (λ) is completely determined. A Jordan segment Jr1 ,...,rk (λ) is a diagonal matrix if and only if each of its Jordan blocks is unidimensional. Otherwise, a Jordan segment is not even diagonalizable. Indeed, if Jr1 ,...,rk (λ) = XDX −1 , where D is a diagonal matrix and X is an invertible matrix, then D = diag(λ, . . . , λ) = λI, so Jr1 ,...,rk (λ) = λI, which contradicts the fact that Jr1 ,...,rk (λ) contains a Jordan block of size larger than 1. By Theorem 7.15, the spectrum of a Jordan segment Jr1 ,...,rk (λ) ∈ Cn×n consists of a single eigenvalue λ of algebraic multiplicity n. The geometric multiplicity of λ is k and the eigenvectors that generate the invariant subspace of λ are e1 , er1 +1 , . . . , er1 +···+rk−1 +1 . Definition 8.14. A Jordan matrix R ∈ Cn×n is a block diagonal matrix, whose blocks are Jordan segments, R = (Jr1,1 ,...,r1,k1 (λ1 ), . . . , Jrp,1 ,...,rp,kp (λp )). We shall prove that every matrix A ∈ Cn×n is similar to a Jordan matrix R such that: (i) for each eigenvalue λ of A we have a Jordan segment Jr,1 ,...,r,k (λ ) in R; (ii) for the Jordan segment Jr,1 ,...,r,k (λ ) that corresponds to the eigenvalue λ , the number k of Jordan blocks equals geomm(A, λ ); (iii) the algebraic multiplicity of λ equals the size of the Jordan segment that corresponds to this eigenvalue, that is, we have algm(A, λ ) =
k
r,i .
i=1
Next we give an algorithmic proof of the fact that for every square matrix A ∈ Cn×n there exists a similar Jordan matrix. This proof was obtained in [54]. Theorem 8.41. Let T ∈ Cn×n be an upper triangular matrix. There exists a nonsingular matrix X ∈ Cn×n such that X −1 T X = diag(K1 , . . . , Km ), where Ki = λi Ipi + Li , Li is strictly upper triangular and each λi is distinct, for 1 i m.
568
Linear Algebra Tools for Data Mining (Second Edition)
Proof. The argument is by induction on n. The base case, n = 1, is immediate. Suppose that the theorem holds for matrices of size less than n and let T ∈ Cn×n be an upper triangular matrix. Without loss of generality we may assume that
T1 S T = , O T2 where T1 and T2 have no eigenvalues in common and T1 = λ1 I + E, where E is strictly upper triangular. Since
I −Y T1 −T1 Y + S + Y T2 I Y T1 S = , O I O T2 O I O T2 there exists a matrix Y such that
I Y I −Y T1 O T = O T2 O I O I if and only if S = T1 Y − Y T2 . By Theorem 8.25, this equation has a solution Y if and only if spec(T1 ) ∩ spec(T2 ) = ∅, which is the case by the assumption we made about T1 and T2 . By the induction hypothesis, T2 may be reduced to block diagonal form, that is, there is a matrix Z such that Z −1 T2 Z = diag(H1 , . . . , Hp ). Therefore, we have
O I O T1 I O T1 O . = O T2 O Z −1 T2 Z O Z O Z −1 The last matrix is clearly in the block diagonal form, which also shows that the matrix X is given by
I −Y I O I −Y Z X= = . O I O Z O Z Lemma 8.6. Let E ∈ Ck×k be a matrix of the form 0k−1 Ik−1 E= . 0 0k−1 We have
EE= Eei+1 = ei and (Ik −
E E)x
0 0k−1
0k−1 , Ik−1
= (e1 x)e1 for x ∈ Ck .
Similarity and Spectra
569
Proof. The proofs of the equalities of the lemma are straightforward. Theorem 8.42. Let W ∈ Cn×n be a strictly upper triangular matrix. Then there is a nonsingular matrix X such that X −1 W X = G, where G = diag(E1 , . . . , Em ) with each Ej given by
0 Ik j Ej = 0 0 such that kj+1 kj for 1 j m − 1. Proof. The proof is by induction on n 1. The statement clearly holds in the base case, n = 1. Assume that the result holds for strictly upper triangular matrices of format (n − 1) × (n − 1) and let W ∈ Cn×n be an upper triangular matrix. We can write
0 u , W = 0 V where V ∈ C(n−1)×(n−1) is an upper triangular matrix. By the inductive hypothesis, there exists a nonsingular matrix Y such that
E1 O −1 , Y VY = O H where H = diag(E2 , . . . , Em ) and the order of E1 is at least equal to the size of any of the matrices Ei , where 2 i m. Then
0 u Y 1 0 1 0 . W = 0 Y 0 Y −1 V Y 0 Y −1 The matrix
0 u Y 0 Y −1 V Y
can be written as
u Y
0 0 Y −1 V Y
⎛
⎞ 0 u1 u2 ⎜ ⎟ = ⎝0 E1 O ⎠ 0 O H
570
Linear Algebra Tools for Data Mining (Second Edition)
by partitioning the vector u Y as u Y = (u1 u2 ). This allows us to write ⎞⎛ ⎞⎛ ⎞ ⎛ 1 u1 E1 0 0 u1 u2 1 −u1 E1 0 ⎟⎜ ⎟⎜ ⎟ ⎜ I 0⎠ ⎝0 E1 O ⎠ ⎝0 I 0⎠ ⎝0 0 O H I 0 0 0 0 I ⎞ ⎛ 0 u1 (I − E1 E1 ) u2 ⎟ ⎜ E1 O ⎠. = ⎝0 0 O H By Lemma 8.6, we have u1 (I − E1 E1 ) = ((I − E1 E1 )u1 ) = (e1 u1 )e1 , so
⎞ ⎛ ⎞ ⎛ 0 (e1 u1 )e1 u2 0 u1 (I − E1 E1 ) u2 ⎟ ⎜ ⎟ ⎜ E1 O ⎠ = ⎝0 E1 O ⎠. ⎝0 0 O H 0 O H
We need to consider two cases depending on whether the number e1 u1 is 0 or not. Case 1: Suppose that e1 u1 = 0. We have ⎞ ⎛ 0 (e1 u1 )e1 u2
E e1 u2 ⎟ ⎜ 0 E O ∼ , ⎠ ⎝ 1 O H 0 O H where
E=
0 e1 . 0 E1
This follows from the equality ⎞ ⎞⎛ ⎞⎛ ⎛ 1 0 0 (e1 u1 )e1 u2 e1 u1 0 0 e1 u1 0 ⎟ ⎜ ⎟⎜ ⎜ 0 I 0 ⎟ E1 O⎠⎝ 0 I 0 ⎠ ⎠ ⎝0 ⎝ 0 0 e 1u1 I 0 O H 0 0 (e1 u1 )I 1 ⎞ ⎛ 0 e1 u2 ⎟ ⎜ = ⎝0 E1 0 ⎠ 0 0 H
Similarity and Spectra
571
because e1 u2
=
u2 . O
Note that the order k of E is strictly greater than the order of and diagonal block of H, so H k−1 = O. Define si = u2 H i−1 for 1 i k. Then
E ei si I −ei+1 si I ei+1 si E ei+1 si+1 = O H 0 I 0 I 0 H for 1 i k − 1. We have sk = 0 because H k−1 = O, and it follows that W is similar to the matrix
E O . O H Case 2: If e1 u1 = 0, by permuting the rows and columns, the matrix W is similar to the matrix ⎞ ⎛ E1 0 0 ⎟ ⎜ ⎝ 0 0 u2 ⎠. 0 0 H Then, by the inductive hypothesis, there is a nonsingular matrix Z such that
0 u2 Z = L, Z −1 0 H where L has the desired block diagonal form. Thus, W is similar to the matrix
E1 0 . 0 L By applying a permutation of the blocks, we obtain a matrix in the proper form, which completes the proof.
572
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 8.43. Let A ∈ Cn×n be a matrix such that spec(A) = {λ1 , . . . , λp }. The matrix A is similar to a Jordan matrix R = (Jr1,1 ,...,r1,k1 (λ1 ), . . . , Jrp,1 ,...,rp,kp (λp )), where algm(A, λi ) =
k i
h=1 ri,h
and geomm(A, λi ) = ki for 1 i p.
Proof. By Schur’s Triangularization Theorem (Theorem 8.8), there exists a unitary matrix U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A; furthermore, each eigenvalue λ occurs in the sequence of diagonal values a number of algm(A, λ) times. By Theorem 8.41, the upper triangular matrix T is similar to a block upper triangular matrix diag(K1 , . . . , Km ), where Ki = λi Ipi + Li , Li is strictly upper triangular, and each λi is distinct for 1 i m. Theorem 8.42 implies that each triangular block is similar to a matrix of the required form because for each diagonal block Ki there is a nonsingular Xi such that Xi−1 (λi I+)Xi = λi I + diag(E1 , . . . , Emi ). Theorem 8.44. For an eigenvalue λ of a square matrix A ∈ Cn×n , we have algm(A, λ) = 1 if and only if geomm(A, λ) = geomm(AH , λ) = 1, and for any eigenvector u that corresponds to λ in A and any eigenvector v that corresponds to the same eigenvalue in AH , we have v H u = 0. Proof. Suppose that λ is an eigenvalue of A such that geomm(A, λ) = geomm(AH , λ) = 1 and for any eigenvector u that corresponds to λ in A and any eigenvector v that corresponds to the same eigenvalue in AH we have v H u = 0. Let B ∈ Cn×n be a matrix that is similar to A. There exists an invertible matrix X such that B = XAX −1 . Then, A − λI ∼ B − λI, so dim(SA,λ ) = dim(SB,λ ), which allows us to conclude that λ has geometric multiplicity 1 in both A and B. Note that
Similarity and Spectra
573
BXu = XAu = λXu, so Xu is an eigenvector of B. Similarly, B H (X H )−1 v = (X H )−1 AH X H (X H )−1 v = (X H )−1 AH v = λ(X H )−1 v, so (X H )−1 v is an eigenvector for B H that corresponds to the same eigenvalue λ. Furthermore, we have ((X H )−1 v)H Xu = v H X −1 Xu = v H u = 0. Thus, if A is a matrix that satisfies the conditions of the theorem, then any matrix similar to A satisfies the same conditions. If λ is a simple eigenvalue of A, then λ is also a simple eigenvalue of the Jordan normal form of A. Therefore, a Jordan segment that corresponds to an eigenvalue λ of geometric multiplicity 1 consists of a single Jordan block B1 (λ) of order 1 that has (1) as an eigenvector. The transposed segment has also (1) as an eigenvector. The eigenvector u of A that corresponds to λ can be obtained from (1) by adding zeros corresponding to the remaining components and the eigenvector v can be obtained from (1) in a similar manner. Since (1)H (1) = 0, we have vH u = 0. Conversely, if A and λ satisfy the conditions of the theorem, then the same conditions are satisfied by the Jordan normal form C of A. Therefore, the Jordan segment that corresponds to λ in C consists of a single block Bm (λ). Suppose that m > 1. Then e1 ∈ Cm is an eigenvector of Bm (λ), em ∈ Cm is an eigenvector of (Bm (λ))H , and eHm e1 = 0, which would imply v H u = 0. Therefore, m = 1 and λ is a simple eigenvalue of A. 8.8
Matrix Norms and Eigenvalues
There exists a simple relationship between the spectral radius and any matrix norm |||·|||. Namely, if x is an eigenvector that corresponds to λ ∈ spec(A), then |||Ax||| = |λ||||x||| |||A||||||x|||, which implies λ |||A|||. Therefore, ρ(A) |||A|||.
(8.5)
It is easy to see that if A ∈ Cn×n and a ∈ R>0 , then ρ(aA) = aρ(A).
574
Linear Algebra Tools for Data Mining (Second Edition)
In Section 6.7, we have seen that for A ∈ Cp×p , |||A||| < 1 implies limn→∞ An = Op,p . Using the spectral radius we can prove a stronger result: Theorem 8.45 (Oldenburger’s theorem). If A ∈ Cp×p , then limn→∞ An = Op,p if and only if ρ(A) < 1. Proof. Let B be a matrix in Jordan normal form that is similar to A. There exists a non-singular matrix U such that U AU −1 = B, and, therefore, U An U −1 = B n . Therefore, if limn→∞ An = Op,p we have limn→∞ B n = Op,p , so limn→∞ λn = O for any λ ∈ spec(A) = spec(B). This is possible only if |λ| < 1 for λ ∈ spec(A), so ρ(A) < 1. Conversely, if ρ(A) < 1, then |λ| < 1 for λ ∈ spec(A), so for any block Br (λ) of B we have limn→∞ (Br (λ))n = Or,r by Exercise 8.13. Therefore, limn→∞ B n = Op,p , so limn→∞ An = Op,p . Theorem 8.46. If A, B ∈ Rn×n and abs(A) B, then ρ(A) ρ(B). Proof. Suppose that abs(A) B and ρ(B) < ρ(A). There exists α ∈ R such that 0 ρ(B) < α < ρ(A). If C = α1 A and D = α1 B, then ρ(C) = α1 ρ(A) > 1 and ρ(D) = α1 ρ(B) < 1. Thus, by Theorem 8.45, limk→∞ D k = O. Since abs(A) B, we have abs(C) = α1 abs(A) α1 B = D. Therefore, abs(C k ) (abs(C))k < D k , which implies limk→∞ C k = 0. This contradicts the fact that ρ(C) > 1. Corollary 8.22. If A ∈ Rn×n , then ρ(A) ρ(abs(A)). Proof. abs(A).
This follows immediately from Theorem 8.46 by taking B =
Theorem 8.47. Let A ∈ Rn×n be a matrix and x ∈ Rn be a vector such that A On,n and x 0n . If Ax > ax, it follows that a < ρ(A). Proof. Since ρ(A) 0 we assume that a > 0. Also, since Ax > ax, we have x = 0. The strict inequality Ax > ax implies that Ax 1 A, we have Bx x, so (a + )x for some > 0. Therefore, if B = a+ x Bx · · · B k x. Thus, ρ(B) 1, so a < a + < ρ(A).
Similarity and Spectra
575
Inequality (8.5) applied to matrix Ak implies ρ(Ak ) |||Ak |||. By Theorem 7.6, we have (ρ(A))k |||Ak ||| for every k ∈ N. Actually, the following statement holds. Theorem 8.48. For every A ∈ Cp×p , we have 1 k lim |||Ak |||2 = ρ(A). k→∞
1 Proof. We need to prove only that limk→∞ |||Ak |||2 k ρ(A), because the reverse inequality was already proven. Let T be an upper triangular matrix and let D = diag(d1 , . . . , dp ) be a diagonal matrix, where di = 0 for 1 i p. Note that
1 1 −1 , ,..., D = diag d1 dp d
hence (D −1 T D)ij = tij dji for 1 i, j p. If di = δi for 1 i p, where δ > 0, denote the upper triangular matrix D −1 T D by Tδ . It follows that ⎧ ⎪ t if i = j, ⎪ ⎨ ii (Tδ )ij = tij δj−i if j > i, ⎪ ⎪ ⎩0 if j < i. The matrix Tδ can now be written as Tδ = E + S, where E is a diagonal matrix such that E = diag(t11 , . . . , tnn ) and 0 if j i, sij = tij δj−i if j > i. If δ < 1, there exists a positive number c such that |||S|||2 cδ. Therefore, |||Tδ |||2 |||E|||2 + |||S|||2 = max{|tii | | 1 i n} + |||S|||2 max{|tii | | 1 i n} + cδ (by Supplement 45). By Schur’s Triangularization Theorem (Theorem 8.8), A can be factored as A = U H T U , where U is a unitary matrix and T is an
Linear Algebra Tools for Data Mining (Second Edition)
576
upper triangular matrix T ∈ Cp×p such that elements of T are the eigenvalues of A and each such eigenvalue occurs in the sequence of diagonal values a number of algm(A, λ) times. If we construct the matrix Tδ as above, starting from the upper triangular matrix T that results from Schur’s decomposition, we have |||Tδ |||2 ρ(A) + cδ. Since T = DTδ D −1 , it follows that A = U H DTδ D −1 U , so A = W Tδ W −1 , where W = U H D. This implies Ak = W Tδk (W −1 )k , so |||Ak |||2 |||Tδk |||2 |||W |||2 |||W −1 |||2 (|||Tδ |||2 )k |||W |||2 |||W −1 |||2 . Consequently,
k
|||A |||2
1
k
1
1
|||Tδ |||2 (|||W |||2 |||W −1 |||2 ) k (ρ(A) + cδ)(|||W |||2 |||W −1 |||2 ) k .
1 Thus, limk→∞ |||Ak |||2 k ρ(A) + cδ, and since this equality holds 1 for any δ > 0, it follows that limk→∞ |||Ak |||2 k ρ(A). Another interesting connection exists between the matrix norm |||A|||2 and the spectral radius of the matrix A A. Theorem 8.49. Let A ∈ Cm×n . We have √ |||A|||2 = max{ λ | λ ∈ spec(AH A)}. Proof. Observe that if λ ∈ spec(AH A), then AH Ax = λx for x = 0, and therefore, xH AH Ax = λxH x, or (Ax)H (Ax) = λxH x. The last equality amounts to Ax22 = λx22 , which allows us to conclude that all eigenvalues of AH A are real and non-negative. By the definition of |||A|||2 , we have |||A|||2 = max{Ax2 | x2 = 1}. This implies that |||A|||22 is the maximum of Ax22 under the restriction x22 = 1. We apply Lagrange’s multiplier method to determine conditions that are necessary for the existence of the maximum. Let B = (bij ) = AH A. Then, consider the function n n n xi bij xj − x2i − 1 . g(x1 , . . . , xn ) = i=1 j=1
i=1
Similarity and Spectra
For the maximum, we need to have n
∂g ∂xi
577
= 0, which implies
bij xj − xi = 0,
j=1
for 1 i n. This is equivalent to Bx = x, which means that is an eigenvalue of B = AH A. The value of |||Ax|||22 is xH AH Ax = xH x = x22 = . Thus, the maximum of |||Ax|||22 for x2 = 1 is the largest eigenvalue of AH A, which completes the argument. Theorem 8.49 states that |||A|||2 =
ρ(AH A),
(8.6)
which explains why |||A|||2 is also known as the spectral norm of the matrix A. It is interesting to recall that we also have
AF = trace(AH A), as we saw in Equality (6.12). Corollary 8.23. For A ∈ Rm×n , we have |||A|||22 |||A|||1 |||A|||∞ . Proof. In Theorem 8.49, we have shown that |||A|||22 is an eigenvalue of AH A. Therefore, there exists x = 0 such that AH Ax = |||A|||22 x, which implies |||AH Ax|||1 = |||A|||22 |||x|||1 . Note that |||AH Ax|||1 |||AH |||1 |||A|||1 |||x|||1 , which implies |||A|||22 |||x|||1 |||AH |||1 |||A|||1 |||x|||1 . Thus, we have |||A|||22 |||AH |||1 |||A|||1 . The desired inequality follows by observing that |||AH |||1 = |||A|||∞ . ∞ Theorem 8.50 (Weyr’s theorem). Let f (z) = m=0 cm z m be a n×n power series having the convergence radius r. If A ∈ C ∞ is such that |λ| < r for every λ ∈ spec(A), then the power series m=0 cm Am converges absolutely. If there exists λ ∈ spec(A) such that |λ| > r, m then the series ∞ m=0 cm A is divergent. Proof. Suppose that spec(A) = {λ1 , . . . , λn }. By Schur’s Triangularization Theorem (Theorem 8.8), there exists a unitary matrix
578
Linear Algebra Tools for Data Mining (Second Edition)
U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are theeigenvalues of A. p m are elements of the matrix m=0 cm T pThe diagonal m there exists an eigenvalue λi such that |λi | > r, m=0 cm λi . If ∞ m diverges, which implies that the series then the series m=0 cm T ∞ m m=0 cm A diverges. Suppose now that |λi | < r for 1 i n. Let ρ1 , . . . , ρn , ρ be n + 1 distinct numbers such that |λi | ρi ρ < r and let S an upper triangular matrix such that abs(T ) S such that S has ρ1 , . . . , ρn as its diagonal elements. Let U be a matrix such that U −1 SU = diag(ρ1 , . . . , ρn ). Then abs(T m ) U diag(ρ1 , . . . , ρn )m U −1 , so abs(cm T m ) |cm |U diag(ρ1 , . . . , ρn )m U −1 for m 0. Consequently, cm T m ∞ |cm |U diag(ρ1 , . . . , ρn )m U −1 ∞ |cm |n2 U ∞ diag(ρ1 , . . . , ρn )m ∞ U −1 ∞ C|cm |ρm , where C is a constant that does not dependon m. Therefore, the m m converges series ∞ series ∞ m=0 cm T converges, so the m=0 cm T ∞ m absolutely. This implies that the series m=0 cm A converges abso lutely. Theorem 8.51. Let f (x) be a rational function, f (z) = p(z)/q(z), where p and q belong to C[z], p(z)= p0 + p1 z + · · · + pk z k , q(z) = j q0 + q1 z + · · · + qh z h , and f (z) = ∞ j=0 cj z . ∞ j If the series k=0 cj z converges to z, such that |z| < r and n×n is a matrix such that ⊆ {z ∈ C | |z| < r}, then A∈C spec(A) j. c A f (A) = (q(A))−1 p(A) equals ∞ j=0 j Proof. By hypothesis, if |z| < r, then q(z) = 0. If spec(A) = {λ1 , . . . , λn }, then det(q(A)) = q(λ1 ) · · · q(λn ) = 0, which means that q(A) is an invertible matrix. This shows that f (A) = (q(A))−1 p(A), or q(A)f (A) = p(A). The definition of f implies h
(q0 + q1 z + · · · + qh z )
∞ k=0
cj z j = p0 + p1 z + · · · + pk z k .
Similarity and Spectra
579
If k, we have q0 c + q1 c−1 + qh c−h = p ; otherwise, if k < , we have q0 c + q1 c−1 + qh c−h = 0. This implies q(A)
∞
cj Aj
k=0
= (q0 In + q1 A + · · · + qh Ah )f (A) =
∞
j
q0 cj A +
j=0
=
∞ j=0
=
∞
j+1
q1 cj A
+
j=0
q0 cj Aj +
∞
∞
qh cj Aj+h
j=0
q1 cj−1 Aj +
j=0
∞
qh cj−h Aj ,
k=0
∞ (q0 cj + q1 cj−1 + · · · + qh cj−h )Aj j=0
=
k
pj Aj = f (A),
j=0
where cl = 0 if l < 0. Consequently, f (A).
∞
j j=0 cj A
= q(A)−1 p(A) =
Definition 8.15. Let A ∈ Cn×n be a matrix such that |λ| < r for every λ ∈ spec(A), where r is the convergence radius ofthe series ∞ j j f (z) = ∞ j=0 cj z . The matrix f (A) is defined as f (A) = j=0 cj A . Theorem 8.51 shows that if f : C −→ C is a rational func(A) ∈ Cn×n can tion, f (z) = p(z) q(z) , the definition of the matrix f ∞ j be given either as q(A)−1 p(A) or as the sum j=0 cj A , where ∞ f (A) = j=0 cj Aj . zj Example 8.10. The series ∞ j=0 j! is convergent in C and its sum is ez . Thus, following Definition 8.15, eA is defined as eA =
∞ 1 j A. j! j=0
580
Linear Algebra Tools for Data Mining (Second Edition)
Similarly, considering the series 1 1 1 1 sin z = z − z 3 + z 5 − · · · , cos z = 1 − z 2 + z 4 − · · · , 3! 5! 2! 4! 1 5 1 2 1 1 3 sinh z = z + z + z + · · · , cosh z = 1 + z + z 4 + · · · , 3! 5! 2! 4! which are convergent everywhere, we can define 1 1 1 1 sin A = A − A3 + A5 − · · · , cos A = 1 − A2 + A4 − · · · , 3! 5! 2! 4! 1 5 1 2 1 1 3 sinh A = A + A + A + · · · , cosh A = 1 + A + A4 + · · · . 3! 5! 2! 4! Therefore, we obtain the equalities eiA = cos A + i sin A and eA = cosh A + sinh A, and 1 1 sin A = (eiA − e−iA ), cos A = (eiA + e−iA ), 2 2 1 1 A sinh A = (e − e−A ), cosh A = (eA + e−A ), 2 2 n×n . for every A ∈ C Example 8.11. Let A ∈ C2×2 be a matrix such that spec(A) = 1 {− π4 , π4 }. Its characteristic polynomial is pA (λ) = λ2 − 16 . By the 1 2 Cayley–Hamilton Theorem, we have A − 16 I2 = O3,3 . Thus, ∞
1 A2n+1 (2n + 1)! n=0
2 n ∞ π 1 n =A (−1) (2n + 1)! 24 n=0
sin A =
=A
(−1)n
∞
(−1)n
n=0
π 2n . + 1)!
24n (2n
Theorem 8.52. Let ∞ ∞ ∞ k k ak z , g(z) = bk z , h(z) = ck z k f (z) = k=0
k=0
k=0
be the functions defined by the given series that are convergent for |z| < r and let A ∈ Cn×n be a matrix such that spec(A) ⊆ {z ∈ C |
Similarity and Spectra
581
|z| < r}. If h(z) = f (z)g(z) for |z| < r, then h(A) = f (A)g(A); if h(z) = f (z) + g(z) for |z| < r, then h(A) = f (A) + g(A). Proof. We discuss only the case when h(z) = f (z)g(z). The equality h(z) = f (z)g(z) implies ck = a0 bk + a1 bk−1 + · · · + ak−1 b1 + ak bk0 for k ∈ N. The series with real non-negative coefficients, ∞ k=0 dk z , k k ∈ N is convergent for |z| < r, where dk = j=0 |aj | |bk−j | for ∞ k which implies that the series k=0 dk A is absolutely convergent, by Weyr’s Theorem (Theorem 8.50). Lemma 6.7, this In turn, by k . Thus, the series |d |A is equivalent to the convergence of ∞ k k=0 ∞ k k k=0 j=0 aj bk−j A is absolutely convergent and its terms can be permuted. Therefore, ⎛ ⎞ ∞ ∞ k ⎝ ck Ak = aj bk−j ⎠ Ak h(A) = k=0
=
∞ j=0
=
∞ j=0
j=0
k=0
aj
∞
bk−j A =
k=j
aj Aj
k
∞ =0
∞ j=0
aj
∞
b Aj+
=0
b A = f (A)g(A).
Some properties of scalar functions are transmitted to the corresponding matrix functions; others fail to do so, as we show in the following. Example 8.12. Since ez e−z = 1, we have eA e−A = In for A ∈ Cn×n , by Theorem 8.52. However, despite the fact that ez ew = ez+w , the corresponding equality does not hold. For example, let
1 1 0 1 A= and B = , 0 1 1 0 where a = 0. By the Cayley–Hamiltion Theorem, we have A2 = 2A − I2 and B 2 = I2 . It is easy to see that An = nA − (n − 1)I2 and B n = I2 if n is even and B n = B if n is odd. Thus, the exponentials of these matrices are ∞ 1 (nA − (n − 1)I2 ) e = n! n=0 A
Linear Algebra Tools for Data Mining (Second Edition)
582
∞ ∞ ∞ 1 1 1 nA − nI2 + I2 = n! n! n! n=0
n=0
n=0
= (1 + e)A − (1 + e)I2 + I2 = (1 + e)A − eI2 =
1 1+e , 0 1
and 1 1 1 B + I2 + B + · · · 1! 2! 3!
1 e2 + 1 e2 − 1 = I2 cosh 1 + B sinh 1 = 2 . 2e e2 − 1 e2 + 1
eB = 1 +
Thus, eA eB = ((1 + e)A − eI2 )(I2 cosh 1 + B sinh 1)
1 e2 + 2e2 − e e3 + 2e2 + e . = 2e e2 − 1 e2 + 1 The sum of the matrices is
C =A+B =
1 2 , 1 1
and we have C 2 = 2C + I2 . We leave it to the reader to verify that eC = eA eB . 8.9
Matrix Pencils and Generalized Eigenvalues
Definition 8.16. Let A, B ∈ Cn×n . The matrix pencil determined by A and B is the one-parameter set of matrices Pen(A, B) = {A − tB | t ∈ C}. The set of generalized eigenvalues of Pen(A, B) is spec(A, B) = {λ ∈ C | det(A − λB) = 0}.
Similarity and Spectra
583
If λ ∈ spec(A, B) and Ax = λBx, then x is an eigenvector of Pen(A, B). Note that in this case x ∈ null(A − λB). When B = In , then an (A, In )-generalized eigenvalue is an eigenvalue of A and any eigenvector of the pair (A, In ) is just an eigenvector of A. Theorem 8.53. Let A and B be two matrices in Cn×n . If rank(B) = n, then spec(A, B) consists of n eigenvalues (taking into account their multiplicities). Proof. Suppose that rank(B) = n, that is, B is an invertible matrix. Then, λ ∈ spec(A, B) is equivalent with the existence of an eigenvector x = 0n such that Ax = λBx, so B −1 Ax = λx. Since rank(B) = n, by Corollary 3.4, rank(B −1 A) = n and spec(A, B) con sists of n eigenvalues. Corollary 8.24. If B is a nonsingular matrix, then an (A, B)generalized eigenvalue is simply an eigenvalue of the matrix B −1 A. Proof. The corollary follows immediately from the proof of Theorem 8.53. When null(A)∩null(B) = {0n }, the existence of (A, B)-generalized eigenvalues becomes trivial. If z ∈ (null(A) ∩ null(B)) − {0}, then we have both Az = 0 and Bz = 0. Thus, (A−tB)z = 0n for every t ∈ C, so any complex number is an (A, B)-generalized eigenvalue and any non-zero vector z ∈ (null(A) ∩ null(B)) − {0} is an eigenvector of the pair (A, B). On the other hand, if
a 0 0 c A= and B = , 0 b 0 0 where ab = 0, the pencil Pen(A, B) has no generalized eigenvalues because det(A − λB) = ab = 0. Definition 8.17. Let A, B ∈ Cn be two matrices. We refer to Pen(A, B) as a regular pencil when det(A − λB) is not identically zero. A regular pencil (A, B) can have only a finite number of eigenvalues because the polynomial det(A − λB) is of degree not larger than n.
584
Linear Algebra Tools for Data Mining (Second Edition)
We saw in Theorem 6.54 that if B is a Hermitian and positive definite matrix, then the mapping fB : Cn × Cn −→ R given by fB (x, y) = xH By for x, y ∈ Cn defines an inner product on Cn . If Pen(A, B) is a regular matrix pencil and B is Hermitian, the matrix D = B −1 A is self-adjoint with respect to this inner product. Indeed, we have (B −1 Ax, y) = (B −1 Ax)H By = xH AH (B H )−1 By = xH Ax, (x, (B −1 A)y) = xH BB −1 Ay = xH Ax, which justifies our claim. The self-adjoint matrix D = B −1 A has real eigenvalues {λ1 , . . . , λn } and a linearly independent orthonormal family of eigenvectors {z 1 , . . . , z n } (by the inner product fB ). Thus, we have B −1 Az i = λi z i and 1 if i = j, H fB (z i , z j ) = (z i ) Bz j = 0 if i = j, for 1 i j (see Exercise 22 of Chapter 7). Definition 8.18. Let Pen(A, B) be a regular matrix pencil, where A, B ∈ Cn×n . The eigenvalues of this pencil are the eigenvalues of the matrix D = B −1 A. The vectors z i are referred to as the eigenvectors of the pencil. For the eigenvectors of a regular pencil Pen(A, B), we have Az i = λi Bzi for 1 i n. The matrix Z = (z 1 · · · z n ) ∈ Cn×n is nonsingular because the set of vectors is linearly independent; we refer to Z as a principal matrix of Pen(A, B). Using this matrix, we have Z H BZ = In and Z H AZ = Z H diag(λ1 , . . . , λn )BZ = diag(λ1 , . . . , λn ). For w ∈ Cn defined by w = Z −1 x, we have xH Ax = wH Z H AZw == wH diag(λ1 , . . . , λn )w =
n
λi |wi |2
(8.7)
i=1
and x Bx = w Z AZw = w w = H
H
H
H
n i=1
|wi |2 .
(8.8)
Similarity and Spectra
585
Let λ1 λ2 · · · λn be the eigenvalues of a regular pencil Pen(A, B), where A, B ∈ Rn×n . Consider the generalized Rayleigh– Ritz quotient ralA,B : Rn −→ R defined by x Ax x Bx for x ∈ Rn such that x Bx = 0. If Z is a principal matrix of Pen(A, B), by Equalities (8.7) and (8.7) we can write n λi wi2 ralA,B (x) = i=1 n 2 . i=1 wi ralA,B (x) =
It is clear that λ1 ralA,B (x) λn . It is easy to see that we have ralA,B (x) = λ1 if and only if w2 = · · · = wn = 0, which means that x = Ze1 = z 1 . Similarly, ralA,B (x) = λn if and only if x = z n . In other words, the minimum of ralA,B (x) is achieved when x is an eigenvector that corresponds to the least eigenvalue of the pencil, while the maximum is achieved when x is an eigenvector that corresponds to the largest eigenvalue. Similar results hold for the second smallest or the second largest eigenvalues of a regular pencil. Note that x = ni=1 wi z i . Thus, if x is orthogonal on z 1 (in the sense of the inner product fB ), we have x Bz1 = w1 = 0, so n λi wi2 ralA,B (x) = i=2 n 2 i=2 wi and min{ralA,B (x) | xBz 1 = 0} = λ2 , max{ralA,B (x) | xBz n = 0} = λn−1 . It is not difficult to show that, in general, x Ax λp = min x Bz i = 0 for 1 i p − 1 x Bx and
λn−p = max
x Ax x Bz i = 0 for n − p + 1 i n . x Bx
Next, we present an intrinsic characterization of the eigenvalues of a regular pencil, that is, a characterization that avoids the use of eigenvectors.
586
Linear Algebra Tools for Data Mining (Second Edition)
Let C ∈ Rk×n be a matrix. Let us examine the variation of the generalized Rayleigh–Ritz quotient ralA,B : Rn −→ R when x Bx = 0 and Cx = 0k . In other words, we apply k linear constraints of the form r i x = 0 to x, where r 1 , . . . , r k are the rows of the matrix C. Choose k = n − 1 and let C(p) be ⎞ ⎛ r1 ⎜ . ⎟ ⎜ .. ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ r p−1 ⎟ ⎟. ⎜ C(p) = ⎜ ⎟ ⎜z p+1 B ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎝ . ⎠ z n B It is clear that there exists x such that C(p) x = 0n−1 because the system C(p) x = 0n−1 has n − 1 equations. Note that any solution x satisfies z k Bx = 0, which shows that x is orthogonal on the last n − p eigenvectors of Pen(A, B). Therefore, for the vectors x that satisfy these restrictions, we have min
x Ax λp . x Bx
Consequently, λp is the maximum of the minimum of the ratio when x satisfies p − 1 arbitrary linear restrictions. 8.10
x Ax x Bx
Quadratic Forms and Quadrics
Let g : V × W −→ R be a bilinear form, where V and W are finitedimensional real linear spaces, dim(V ) = n, and dim(W ) = m. If e1 , . . . , en and f 1 , . . . , f m are two bases in V and W , respectively, then for x ∈ V and y ∈ W we can write x = x1 e1 + · · · + xn en and y = y1 f 1 + · · · + ym f m . By the definition of bilinear forms, we have g(x, y) =
n m i=1 j=1
xi yj g(ei , f j ) = x Ay,
(8.9)
Similarity and Spectra
587
where A ∈ Rn×m is the matrix A = (g(ei , f j )). This matrix will be referred to as the matrix of the bilinear form. The bilinear form f is completely defined by the matrix A as introduced above. Let V be a real vector space and let f : V × V −→ R be a bilinear form. If f (x, y) = f (y, x) for every x, y ∈ V , we say that f is a symmetric bilinear form. Theorem 8.54. A bilinear form f : Rn × Rn −→ R is symmetric if and only if its matrix is symmetric. Proof. Let A be the matrix of f . Since f is symmetric, we have x Ay = y Ax for every x, y ∈ Rn . For x = ei and y = ej , this implies aij = aji for every 1 i, j n. Thus, A = A. Conversely, if A = A, we have (x Ay) = (x A y) = y Ax, so (f (x, y)) = f (y, x). Since f (, y) ∈ R, the symmetry of f follows. Let f : V × V be a symmetric bilinear form. Define the function φ : V −→ R by φ(x) = f (x, x) for x ∈ V . We have φ(u + v) = f (u + v, u + v) = f (u, u) + 2f (u, v) + f (v, v), (8.10) φ(u − v) = f (u − v, u − v) = f (u, u) − 2f (u, v) + f (v, v), (8.11) because f is bilinear and symmetric. Therefore, we obtain the equality φ(u + v) + φ(u − v) = 2(φ(u) + φ(v)),
(8.12)
which is a generalization of the parallelogram equality from Theorem 6.32. Definition 8.19. A quadratic form defined on a real vector space V is a function φ : V −→ R for which there exists a bilinear form f : V × V −→ R such that φ(x) = f (x, x) for x ∈ V . Equality (8.10) implies f (u, v) = 12 (φ(u + v) − φ(u) − φ(v)), which shows that the bilinear form f is completely determined by its corresponding quadratic form.
588
Linear Algebra Tools for Data Mining (Second Edition)
Unlike bilinear forms, a quadratic form may be defined by several matrices. Theorem 8.55. Let φ be a quadratic form. We have φ(x) = x Ax = x Bx for every x ∈ Rn if and only if A + A = B + B . Proof. Suppose that φ(x) = x Ax = x Bx for every x ∈ Rn . If x = ei this amounts to aii = bii for 1 i n. If x = ei + ej , this yields aii + aij + aji + ajj = bii + bij + bji + bjj , which means that aij + aji = bij + bji for 1 i, j n. Thus, A + A = B + B . Conversely, suppose that A + A = B + B . Note that φ(x) = 1 2 (φ(x) + φ(x) ), because φ(x) ∈ R. Thus, 1 φ(x) = (x Ax + x Ax) 2
A+A x. =x 2 Thus, it is clear that either A or B defines the quadratic form.
Among the set of matrices that define a quadratic form, there is only one symmetric matrix. This is stated formally in the next theorem. Corollary 8.25. If φ : Rn −→ R is a quadratic form, then there exists a unique symmetric matrix C ∈ Rn×n such that φ(x) = x Cx for x ∈ Rn . Proof. Suppose that φ(x) = x Ax for x ∈ Rn . We saw that φ(x) = x 21 (A + A ) x and the matrix 12 (A + A ) is clearly symmetric. On the another hand, if A + A = B + B and B is symmetric, then B = 12 (A + A ).
Since a quadratic form φ can be uniquely written as φ(x) = x Cx for x ∈ Rn , properties of C can be transferred to φ. For example, if C is positive definite (semidefinite), then we say that φ is positive definite (semidefinite). Example 8.13. Let x ∈ Rn and let φ(x) = ni=1 (xi+1 − xi )2 . It is easy to verify that φ(x) = x Ax,
Similarity and Spectra
589
where A is the matrix ⎛ ⎞ 1 −1 0 0 · · · 0 ⎜0 −1 2 −1 · · · 0⎟ ⎜ ⎟ ⎜ ⎟ .. .. .. ⎟ ⎜ .. .. .. .. ⎜. . . . · · · . ⎟ . . A=⎜ ⎟ ⎜0 0 0 −0 · · · 2 −1 0 ⎟ ⎜ ⎟ ⎜ ⎟ ⎝0 0 0 −0 · · · −1 2 −1⎠ 0 0 0 −0 · · · 0 −1 1 and x ∈ Rn . The definition of φ shows that x = 0n implies φ(x) > 0. Thus, both A and φ are positive definite. Observe that the bilinear symmetric function f can be recovered from the associated quadratic form φ because f (u, v) =
1 (φ(u + v) − φ(u) − φ(v)), 2
for u, v ∈ V . Example 8.14. Let φ : R3 −→ R be the quadratic form defined by φ(x) = 4x21 + 8x22 − x23 − 6x1 x2 + 4x1 x3 + x2 x3 , for
⎛ ⎞ x1 ⎜ ⎟ x = ⎝x2 ⎠ ∈ R3 . x3
The symmetric matrix C that defines φ is ⎛ ⎞ 4 −3 2 ⎜ ⎟ C = ⎝−3 8 0.5⎠. 2 0.5 −1 The off-diagonal terms of C are the halves of the coefficients of the corresponding linear form coefficients; for instance, since the coefficient of the cross-product term x1 x2 is −6, we have c12 = c21 = −3.
590
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 8.56. A continuous function φ : V −→ R is a quadratic form if and only if it satisfies the Parallelogram Equality (8.12) for every u, v ∈ V . Proof. The necessity of the condition was shown already. Therefore, we need to show only that if a continuous φ satisfies Equality (8.12), then it is a quadratic form. Let φ be a function such that φ(u+v)+φ(u−v) = 2(φ(u)+φ(v)) for every u, v ∈ V . If u = v = 0, it follows that φ(0) = 0. Choosing u = 0 in Equality (8.12) implies φ(v) + φ(−v) = 2φ(v), so φ(−v) = φ(v) for v ∈ V , so φ is an even function. Define the symmetric function f : V × V −→ R by f (u, v) =
1 (φ(u + v) − φ(u) − φ(v)), 2
for u, v ∈ V . We need to show that f is bilinear. In view of the symmetry of f , it suffices to prove that f is linear in its first argument. The definition of f allows us to write 2f (u1 + u2 , v) = φ(u1 + u2 + v) − φ(u1 + u2 ) − φ(v), 2f (u1 , v) = φ(u1 + v) − φ(u1 ) − φ(v), 2f (u2 , v) = φ(u2 + v) − φ(u2 ) − φ(v), which implies 2(f (u1 + u2 , v) − f (u1 , v) − f (u2 , v)) = (φ(u1 + u2 + v) + φ(v)) − (φ(u1 + v) + φ(u2 + v)) −(φ(u1 + u2 ) − φ(u1 ) − φ(u2 )).
(8.13)
Since φ satisfies the parallelogram equality, we have 1 φ(u1 + u2 + v) + φ(v) = (φ(u1 u2 + 2v) + φ(u1 + u2 )), (8.14) 2 1 φ(u1 + v) + φ(u2 + v) = (φ(u1 u2 + 2v) + φ(u1 − u2 )). (8.15) 2
Similarity and Spectra
591
The last two equalities imply φ(u1 + u2 + v) + φ(v) − φ(u1 + v) − φ(u2 + v) 1 = (φ(u1 + u2 ) − φ(u1 − u2 )) 2 = −φ(u1 ) − φ(u2 ) + φ(u1 + u2 ),
(8.16)
using again the parallelogram’s equality. The last equality combined with Equality (8.13) implies f (u1 + u2 , v) = f (u1 , v) + f (u2 , v).
(8.17)
We prove now that f (au, v) = af (u, v)
(8.18)
for a ∈ R and u, v ∈ V . Choosing u1 = u and u2 = −u, we have f (u, v) = −f (−u, v), so Equality (8.18) holds for a = −1. It is easy to verify by induction on k ∈ N that Equality (8.17) implies that f (ku, v) = kf (u, v) for every k ∈ N. Let k, l be two integers, l = 0. We have lf ( kl u, v) = f (ku, v) = kf (u, v), which implies f ( kl u, v) = kl f (u, v). Thus, f (ru, v) = rf (u, v) for any rational number r. The continuity of f allows us to conclude that f (au, v) = af (u, v) for any real number a. Thus, f is a bilinear form. Since φ(x) = f (x, x), it follows that φ is indeed a quadratic form.
Let φ : V −→ R be a quadratic form and let v 1 , . . . , v n be a basis n of V, where V is an n-dimensional real linear space. Since x = i=1 xi v i , the quadratic form can be written as
φ(x) = x Cx =
n n
xi xj v i Cv j ,
i=1 j=1
where C is the matrix of φ. If i = j implies v i Cv j = 0, then the basis v 1 , . . . , v n is referred to as a canonical basis of φ. Relative to this basis we can write n ci x2i , φ(x) = i=1
where ci =
v i Cvi .
592
Linear Algebra Tools for Data Mining (Second Edition)
Definition 8.20. A quadric in Rn is a set of the form Q = {x ∈ Rn | x Ax + 2b x + c = 0}, where A ∈ Rn×n is a symmetric matrix, b ∈ Rn , and c ∈ R. We will refer both to a quadric Q and to the equation that describes it as a quadric. The extended matrix of the quadric Q is the symmetric matrix Q ∈ R(n+1)×(n+1) given by ⎞ ⎛ a11 · · · a1n b1 ⎜ .
.. ⎟ ⎟ ⎜ .. · · · ... A b . ⎟= . Q=⎜ ⎟ ⎜ b c ⎝an1 · · · ann bn ⎠ b1 · · · bn c Consider the quadric x Ax + 2b x + c = 0,
(8.19)
where A ∈ Rn×n is a symmetric matrix, b ∈ Rn , and c ∈ R. By the Spectral Theorem for Hermitian Matrices (applied, in this case, to real, symmetric matrices) there exists an orthogonal matrix U such that A = U DU and D is a diagonal matrix having the eigenvalues of A as its diagonal elements. Consider an isometry defined by x = U y + r. Equation (8.19) can be written as (y U + r )A(U y + r) + 2b (U y + r) + c = 0. An equivalent form of this equality is y U AU y + r AU y + y U Ar + r Ar + 2b U y + 2b r + c = y U AU y + 2(r A + b )U y + r Ar + 2b r + c = 0. We used here the fact that both r AU y and y U Ar are scalars and, therefore, they coincide with their transposed form. Definition 8.21. Let x Ax + 2b x + c = 0 be a quadric, x = U y + r ˜ + be an isometry (where U is an orthogonal matrix), and let y Ay ˜ 2b y + c˜ = 0 be the transformed equation under the isometry. A function Φ : Rn×n × Rn × R −→ R is an isometric invariant if for any quadric we have ˜ c˜). ˜ b, Φ(A, b, c) = Φ(A,
Similarity and Spectra
593
Observe that A˜ = U AU, ˜ = (b + Ar)U, b c˜ = r Ar + 2b r + c. Since A and A˜ are similar matrices, they have the same characteristic polynomial, which implies that any function of the form Φ(A, b, c) = ai , where ai is the i th coefficient of the characteristic polynomial of A, is an isometric invariant. Theorem 8.57. Let
Q=
A b b c
be the extended matrix of the quadric x Ax + 2b x + c = 0. Then det(Q) is an isometric invariant. Proof.
We have to show the equality
A˜ ˜b A b = det det(Q) = det ˜ c˜ b c b
(b + Ar)U U AU . = det U (b + r A) r Ar + 2b r + c
Taking into account the result shown in Supplement 33 of Chapter 5, it follows that
(b + Ar)U U AU det U (b + r A) r Ar + 2b r + c
AU (b + Ar)U = det (b + r A) r Ar + 2b r + c
A (b + Ar) , = det (b + r A) r Ar + 2b r + c
594
Linear Algebra Tools for Data Mining (Second Edition)
because U is an orthogonal matrix. Further elementary transformation yields the following equalities:
A b + Ar A b A (b + Ar) = det = det , det b b r+c b c (b + r A) r Ar + 2b r + c ˜ = det(Q). which show that det(Q)
If the equation r A + b = 0 has a solution in r, we refer to r as a center of the quadric. A quadric may have a unique center, a set of centers, or no center, depending on the number of solutions of the equation r A + b = 0 , which is equivalent to Ar + b = 0, since A is a symmetric matrix. If A is an invertible matrix, then Q has a unique center r0 = −A−1 b. Choosing r = r 0 , the quadric is defined by the equation y Dy − r 0 Ar 0 + c = 0,
(8.20)
because r 0 Ar 0 + 2b r 0 = −r0 Ar 0 . Note that if a center r 0 exists, then y ∈ Q if and only if −y ∈ Q. Equivalently, r 0 + U x ∈ Q if and only if r 0 − U x ∈ Q and the middle of the line segment determined by r 0 + U x and r 0 − U x is r 0 . Thus, r 0 is a center of symmetry for the quadric. The set of orthonormal columns of the matrix U for which A = U DU (or D = U AU , where D is a diagonal matrix) are called the principal axes of the quadric Q. Quadrics are classified based on the non-nullity and the sign of the eigenvalues of the symmetric matrix A. Thus, we can use for the classification either the diagonal matrix D whose entries are the eigenvalues of A or, equivalently, the inertia of the matrix A. Let x Ax + 2b x + c = 0 be a quadric having a unique center. Since A is a nonsingular matrix, there are no null eigenvalues of A, so I(A) = (n+ (A), n− (A), 0). There are n + 1 possible classes since we can have 0 n+ n and n+ + n− = n. When n+ = n, the quadric is an n-dimensional ellipsoid. The equality y Dy − r 0 Ar 0 + c = 0
Similarity and Spectra
595
can be written as n i=1
λi yi2 − r 0 Ar 0 + c = 0.
If r 0 Ar 0 + c > 0, the last equality can be written as n y2 i
i=1
a2i
= 1,
for 1 i n and the quadric is a real ellipsoid where a2i = r Ar−c λi having a1 , . . . , an as its semiaxes. At the other extreme, suppose that n− = n, which means that spec(A) consists of negative numbers. If k = r Ar − c < 0, then Q = ∅. If k > 0, by multiplying by −1 we revert back to the previous case. Example 8.15. In this example, we consider the case of quadrics x Ax + 2b x + c = 0 in R3 that have a unique center, that is, for which det(A) = 0. If n+ = 3, we have an ellipsoid, as we saw before. For n+ < 3, the classification of the quadric depends on det(Q). If det(Q) = 0, n+ = 2, and n− = 1, we have a hyperboloid of one sheet, while in the case when det(Q) = 0, n+ = 1, and n− = 2, we have a hyperboloid of two sheets. If det(Q) = 0, we have a cone. After transforming the equation of the quadric x Ax+2b x+c = 0 to the canonical form given by Equality (8.20): y Dy − k = 0, where k = r 0 Ar 0 − c. Note that ⎛ λ1 0 0 ⎜0 λ 0 2 ⎜ det(Q) = det ⎜ ⎝ 0 0 λ3 0 0 0
⎞ 0 0⎟ ⎟ ⎟ = λ1 λ2 λ3 k. 0⎠ k
596
Linear Algebra Tools for Data Mining (Second Edition)
Now, the class of the quadric can be easily established, based on the following table: n+ 3 3 3 2 1 2 1
n− 0 0 0 1 2 1 2
det(Q) = 0 =0 O. Observe that its spectral radius ρ(A) is positive because ρ(A) = 0 implies that spec(A) = {0}, which implies that A is nilpotent by Corollary 8.6. Since A > O, this is impossible, so ρ(A) > 0. Theorem 8.58 (Perron–Frobenius theorem). Let A ∈ Rn×n be a symmetric matrix with positive elements and let λ be its largest eigenvalue. The following statements hold: (i) λ is a positive number; (ii) there exists an eigenvector x that corresponds to λ such that x > 0n ; (iii) geomm(A, λ) = 1; (iv) if θ is any other eigenvalue of A, then |θ| < λ. Proof. Since the eigenvalues of A are real and their sum equals trace(A) > 0, it follows that its largest value λ is positive, which proves Part (i). Let u be a real unit eigenvector that belongs to λ, so nj=1 aij uj = λui . We have Au = λu, so u Au = λu u = λu22 = λ. This allows us to write λ=
n n
aij ui uj ,
i=1 i=1
so λ=
n n i=1 i=1
n n aij ui uj = aij ui uj , i=1 i=1
because λ > 0. We claim that the vector x = abs(u) is an eigenvector that corresponds to λ and has positive components. Note that n n n n aij ui uj aij xi xj . λ= i=1 i=1
i=1 i=1
By Corollary 8.14, since λ is the largest eigenvalue of A, we have n n a xi xj λ; the equality takes place only if x is an eigenij i=1 i=1 vector of λ.
598
Linear Algebra Tools for Data Mining (Second Edition)
nSince x 0, if xi = 0 for some i, then the equality λxi = j=1 aij xj implies that xj = 0 for 1 j n, because all numbers aij are positive. This means that x = 0, which is impossible because x is an eigenvector. Thus, we conclude that all components of x are positive. This completes the proof of Part (ii). For Part (iii), suppose that the geometric multiplicity of λ is greater than 1 and let u and v be two real unit vectors of the invariant subspace SA,λ such that u ⊥ v. Note that the vectors abs(u) and abs(v) are also eigenvectors corresponding to λ. n a u and Suppose that ui < 0 for some i. Since λui = n n j=1 ij j λ|ui | = j=1 aij |uj |, we have λ(ui + |ui |) = 0 = j=1 aij (uj + |uj |), and this implies uj + |uj | = 0 for 1 j n. In other words, we have j, or uj = −|uj | < 0 for every j. The either uj = |uj | > 0 for every same applies to v, so v u = ni=1 vi ui = 0, which contradicts the orthogonality of u and v. Thus, geomm(A, λ) = 1. For Part (iv), let w be a unit eigenvector that corresponds to n the eigenvalue θ, so i=1 aij wj = θwi for 1 i n. Again, by Corollary 8.14, n n n n aij |wi ||wj | aij wi wj = |θ|. λ i=1 j=1
i=1 j=1
If θ = −λ, these inequalities show that |wj | = xj for all j and, thereλxi = fore, there exists i such that n wi = xi . Adding the equalities n n a x and −λw = a w yields 0 = a (x w i j=0 ij j j=0 ij j j=1 ij + j ) aii (xi + wi ), which contradicts the fact that aii > 0 and wi = xi > 0. Thus, θ = −λ. Definition 8.22. Let A ∈ Rn×n be a symmetric matrix with positive elements. The number ρ(A) is the Perron number of A; the positive vector x with x2 = 1 that corresponds to the eigenvalue ρ(A) is the Perron vector of A. The Perron–Frobenius Theorem states that the spectral radius ρ(A) of a positive matrix A is always an eigenvalue of A. This property holds also for non-negative matrices. Lemma 8.7. Let A ∈ Rn×n be a non-negative matrix that is irreducible, where n > 1. Its spectral radius ρ(A) is an eigenvalue and there exists a positive eigenvector that corresponds to
Similarity and Spectra
599
this eigenvalue. No non-negative eigenvector corresponds to any other eigenvalue of A. Proof. Since A is an irreducible matrix, the matrix (In + A)n−1 is positive by Supplement 131 of Chapter 3. This implies ((In + A)n−1 ) = (In + A )n−1 > On,n , so by the Perron–Frobenius theorem, there exists a positive vector y such that (I + A )n−1 y = ρ((I + A )n−1 )y. Equivalently, we have y (I + A)n−1 = ρ((I + A)n−1 )y .
(8.21)
Let λ ∈ spec(A) be an eigenvalue of A such that |λ| = ρ(A) and let x be an eigenvector associated to λ. Since λx = Ax, it follows that |λ|abs(x) = abs(Ax) abs(A)abs(x), so ρ(A)abs(x) Aabs(x) because A is a non-negative matrix. It is immediate that ρ(A)p abs(x) Ap abs(x) for p ∈ N. In turn, this implies (1 + ρ(A))n−1 abs(x) (In + A)n−1 abs(x), so, by Equality (8.21), we have (1+ρ(A))n−1 (y abs(x)) y (In +A)n−1 abs(x) = ρ((I+A)n−1 )y abs(x). Since y > 0, we have y abs(x) > 0 and, therefore, (1 + ρ(A))n−1 ρ((I + A)n−1 ).
(8.22)
By Corollary 8.7, the eigenvalues of (I + A)n−1 have the form (1 + λ)n−1 , where λ ∈ spec(A). Therefore, there exists an eigenvalue μ of A such that |(1 + μ)n−1 | = ρ((I + A)n−1 ). Since |μ| ρ(A), Equality (8.22) implies (1 + |μ|)n−1 (1 + ρ(A))n−1 ρ((I + A)n−1 ) = |(1 + μ)n−1 |, so 1 + |μ| |1 + μ|. This implies that μ is a positive real number and, therefore, μ = ρ(A). Consequently, (ρ(A))k abs(x) Ak abs(x) for k 1. In particular, for k = 1, we have ρ(A)abs(x) Aabs(x) = μabs(x).
Linear Algebra Tools for Data Mining (Second Edition)
600
Since (I + A)n−1 abs(x) = (1 + μ)n−1 abs(x) = ρ((I + A)n−1 )abs(x), and abs(x) > 0 (by Theorem 8.58), it follows that there is only one linearly independent vector associated with μ. Indeed, suppose that we would have two linearly independent vectors, u and v, associated with the eigenvalue μ. Since v = 0, there exists vi = 0, so the vector w = u − wvii v is also an eigenvector of the eigenvalue μ because w = 0. But wi = 0, which contradicts the fact that abs(x) > 0 for any eigenvector of μ. Since A = On,n , we have ρ(A) > 0. Moreover, ρ(A) is a simple eigenvalue of A. Indeed, there is only one linearly independent eigenvector u of A associated with ρ(A) such that u > 0. Since A is also irreducible (by Exercise 128 of Chapter 3), there exists only one linearly independent eigenvector v of A associated to ρ(A ) and v > 0. Since v u > 0, it follows that ρ(A) is a simple eigenvalue of A by Theorem 8.44. Suppose that z is an eigenvector of an eigenvalue ζ of A and z > 0, where ζ = ρ(A). We have shown that A has an eigenvector w > 0 such that A w = ρ(A)w. This implies w Az = ζw z. Since w Az = (A w) z = ρ(A)w z, we have ζw z = ρ(A)w z. Since w z > 0, it follows that ζ = ρ(A), which is a contradiction. Thus, no other eigenvalue except ρ(A) has a positive eigenvector. Now we prove the existence of a non-negative eigenvector for ρ(A) without the assumption of irreducibility of A made in Lemma 8.7. Theorem 8.59. Let A ∈ Rn×n be a non-negative matrix such that n > 1. Its spectral radius ρ(A) is an eigenvalue of A and there exists a non-negative eigenvector that corresponds to this eigenvalue. Proof. Suppose that red(A) = k. As we saw in the proof of Supplement 127 of Chapter 3, there exists a permutation matrix P such that ⎛ ⎞ B11 B12 · · · B1k ⎜O B ⎟ 22 · · · B2k ⎟ ⎜ ⎟ P AP = ⎜ .. . ⎟, ⎜ .. . · · · .. ⎠ ⎝ . O O · · · Bkk
Similarity and Spectra
601
where the matrices B11 , . . . , Bkk are irreducible. By Theorem 7.15, spec(A) = ki=1 spec(Bii ). We have ρ(A) = ρ(P AP ) = max{ρ(Bii | 1 k}. Let j be the smallest number such that ρ(A) = ρ(Bjj ). The matrix P AP can now be written as ⎞ ⎛ C E F ⎟ ⎜ A = ⎝O Bjj G⎠, O O D where C ∈ Cp×p and D ∈ Cq×q are upper triangular matrices or are missing, when Bjj is the first diagonal block or the last diagonal block. Since Bjj is irreducible, Lemma 8.7 implies that there exists y j such that Bjj y j = ρ(A)y j If j = 1 and Bjj is the first diagonal block, then
y1 0 is a non-negative eigenvector of P AP . If j > 1, there exists z > 0 such that ⎛ ⎞ z ⎝y ⎠ 1 0 is an eigenvector of P AP . Since ⎞ ⎛ ⎞ ⎛ Cz + Ey 1 z P AP ⎝y j ⎠ = ⎝ Bjj y j ⎠, 0 0 it suffices to find z 0 such that Cz + Ey j = ρ(A)z. We have 1 C we have ρ(C) < ρ(A), which means that for the matrix Z = ρ(A) 2 −1 ρ(Z) < 1. Thus, I +Z +Z +· · · converges to (I −Z) . Since Z O, the terms of the series I + Z + Z 2 + · · · are nonnegative matrices, so (I − Z)−1 is a non-negative matrix. The desired vector z is z= and z 0.
1 (1 − Z)−1 Ey j ρ(A)
602
8.12
Linear Algebra Tools for Data Mining (Second Edition)
Spectra of Positive Semidefinite Matrices
In Theorem 7.12, we saw that eigenvalues of Hermitian matrices are real numbers. The next theorem links these eigenvalues to positive definiteness. Theorem 8.60. Let A ∈ Cn×n be a Hermitian matrix. If A is positive semidefinite, then all its eigenvalues are non-negative; if A is positive definite, then its eigenvalues are positive. Proof. Since A is Hermitian, all its eigenvalues are real numbers. Suppose that A is positive semidefinite, that is, xH Ax 0 for x ∈ Cn . If λ ∈ spec(A), then Av = λv for some eigenvector v = 0. The positive semidefiniteness of A implies v H Av = λv H v = λv22 0, which implies λ 0. It is easy to see that if A is positive definite, then λ > 0. Theorem 8.61. Let A ∈ Cn×n be a Hermitian matrix. If A is positive semidefinite, then all its principal minors are non-negative real numbers. If A is positive definite, then all its principal minors are positive real numbers. Proof. Since A is positive semidefinite, every sub-matrix i1 · · · ik is a Hermitian positive semidefinite matrix by TheoA i1 · · · ik rem 6.51, so every principal minor is a non-negative real number. The second part of the theorem is proven similarly. Corollary 8.26. Let A ∈ Cn×n be a Hermitian matrix. The following statements are equivalent. (i) A is positive semidefinite; (ii) all eigenvalues of A are non-negative numbers; (iii) there exists a Hermitian matrix C ∈ Cn×n such that C 2 = A; (iv) A is the Gram matrix of a sequence of vectors, that is, A = B H B for some B ∈ Cn×n . Proof. (i) implies (ii): This was shown in Theorem 8.60. (ii) implies (iii): Suppose that A is a matrix such that all its eigenvalues are the non-negative numbers λ1 , . . . , λn . By Theorem 8.14,
Similarity and Spectra
603
A can be written as A = U H DU , where U is a unitary matrix and ⎞ ⎛ λ1 0 · · · 0 ⎜ 0 λ ··· 0 ⎟ 2 ⎟ ⎜ ⎟. D=⎜ . . . ⎟ ⎜. . ⎝ . . · · · .. ⎠ 0 0 · · · λn √ Define the matrix D as ⎛√ ⎞ λ1 0 · · · 0 ⎜ 0 √λ · · · 0 ⎟ √ 2 ⎜ ⎟ ⎟ D=⎜ .. .. ⎟. ⎜ .. ⎝ . . ··· . ⎠ √ λn 0 0 ··· √ 2 = D. Now we can write A = Clearly, we ( D) √ gave √ H H U DU U√ DU , which allows us to define the desired matrix C as C = U DU H . (iii) implies (iv): Since C is itself a Hermitian matrix, this implication is obvious. (iv) implies (i): Suppose that A = B H B for some matrix B ∈ n×k C . Then, for x ∈ Cn , we have xH Ax = xH B H Bx = (Bx)H (Bx) = Bx22 0, so A is positive semidefinite. 8.13
MATLAB
Computations
The function eigalgmult computes the eigenvalues of a matrix and their algebraic multiplicities: function [eigvalues, repeats] = eigalgmult(A) tol = sqrt(eps); v = sort(eig(A)); v = round(v/tol) * tol; eigvalues = flipud(unique(v)); n = length(v); d = length(eigvalues); B = v * ones(1, d); C = ones(n,1) * eigvalues’; m = abs(C-B) > A=[3 1 1;1 3 1; 1 1 3] A = 3 1 1 1 3 1 1 1 3 >> [e,mult]=eigalgmult(A) e = 5 2 mult = 1 2
The function eiggeommult(A) computes eigenvectors of A and their geometric multiplicity. Its outputs are two matrices V and G. The diagonal matrix G has on its diagonal the eigenvalues of A. Each value is repeated a number of times equal to the dimensionality of its invariant space, which equals its geometric multiplicity. The matrix V contains eigenvectors corresponding to these eigenvalues. function [V, G] = eiggeommult(A) [m,n] = size(A); [eigvalues, r] = eigalgmult(A); V = []; d = []; for k = 1 : length(eigvalues); v = nullspbasis(A - eigvalues(k)*eye(n)); [ms, ns] = size(v); V = [V v]; temp = ones(ns, 1) * eigvalues(k); d = [d; temp]; end G = diag(d);
Similarity and Spectra
605
Example 8.18. The matrix ⎛ ⎞ 2 1 0 ⎜ ⎟ A = ⎝0 2 1⎠ 0 0 2 has one eigenvalue 2 with algebraic multiplicity 3 and geometric multiplicity 1 as follows from the following MATLAB computation: >> A=[2 1 0;0 2 1; 0 0 2] A = 2 1 0 0 2 1 0 0 2 >> [V,G]=eiggeommult(A) V = 1 0 0 G = 2
On the other hand, the next matrix has two non-defective eigenvalues, 5 and 2 as follows from the next computation: >> A=[3 1 1;1 3 1; 1 1 3] A = 3 1 1 1 3 1 1 1 3 >> [V,G]=eiggeommult(A) V = 1 -1 -1 1 1 0 1 0 1 G = 5 0 0 0 2 0 0 0 2 >>
Linear Algebra Tools for Data Mining (Second Edition)
606
Let A and B be two square matrices, A, B ∈ Cn×n . To compute the generalized eigenvalues of A and B, we can use eig(A,B), which returns a vector containing these values. Also, [V,D] = eig(A,B) yields a diagonal matrix D of generalized eigenvalues and a matrix V ∈ Cn×n whose columns are the corresponding eigenvectors so that AV = BV D. The exponential of a matrix can be computed using the function expm. Example 8.19. Let A and B be the matrices considered in Example 8.12,
A=
1 1 0 1 1 2 ,B = , and C = A + B = . 1 0 1 1 0 1
The exponentials of these matrices obtained with the function expm are >> expm(B) ans = 2.7183 0
2.7183 2.7183
>> expm(B) ans = 1.5431 1.1752
1.1752 1.5431
>> expm(C) ans = 5.9209 3.7194
7.4388 5.9209
Furthermore, the product eA ∗ eB computed as >> expm(A)*expm(B) ans = 7.3891 7.3891 3.1945 4.1945
is clearly distinct from eC .
Similarity and Spectra
607
Similar functions in MATLAB exist for computing the principal logarithm log A of a matrix A, namely logm, and for computing the principal square root of a matrix, sqrtm(A), namely, the unique square root for which every eigenvalue has a non-negative real part.
Exercises and Supplements (1) Let A, B ∈ Cn×n be two matrices. Prove that if B is nonsingular, then AB −1 ∼ B −1 A. (2) Prove that if A ∈ Cm×n and B ∈ Cn×m , then the matrices
C=
AB Om,n B On,n
and D =
0m,m 0m,n B BA
are similar. (3) Let A ∈ Cm×m and B ∈ Cn×n be two matrices. Prove that S A,B (X) = (S −B H ,−AH (X H ))H for X ∈ Cm×n . (4) The Lie bracket of Cn×n is the mapping [·, ·] : Cn×n × Cn×n −→ Cn×n given by [X, Y ] = S X,X (Y ). Prove that [X, Y ] = −[Y, X], [X, X] = On,n and [X, [Y, Z]] + [Y, [Z, X]] + [Z, [X, Y ]] = On,n for any matrices X, Y, Z ∈ Cn,n . (5) Let A ∈ Rn×n be a symmetric matrix having the eigenvalues λ1 · · · λn . Prove that |||A −
λ1 − λn λ1 + λn In |||2 = . 2 2
(6) Let A ∈ Rn×n be a matrix and let c ∈ R. Prove that for x, y ∈ Rn − {0n }, we have ralA (x) − ralA (y) = ralB (x) − ralB (y), where B = A + cIn . (7) Let A ∈ Rn×n be a symmetric matrix having the eigenvalues λ1 · · · λn and let x and y be two vectors in Rn − {0n }.
608
Linear Algebra Tools for Data Mining (Second Edition)
Prove that |ralA (x) − ralA (y)| (λ1 − λn ) sin ∠(x, y). Solution: Assume that x = y = 1 and let B = A − λ1 +λn In . By Exercise 8.13, we have 2 |ralA (x) − ralA (y)| = |ralB (x) − ralB (y)| = |x Bx − y By| = |B(x − y) (x + y)|. By the Cauchy–Schwarz Inequality, we have x − y x + y 2 = (λ1 − λn ) sin ∠(x, y).
|B(x − y) (x + y)| 2B
(8) Prove that if A is a unitary matrix and 1 ∈ spec(A), then there exists a skew-Hermitian S such that A = (In − S)(In + S)−1 . (9) Let f : Rn×n −→ R be a function such that f (AB) = f (BA) for A, B ∈ Rn×n . Prove that if A ∼ B, then f (A) = f (B). (10) Let Br (λ, a) ∈ Cr×r be the matrix defined by ⎛ ⎞ λ a 0 ··· 0 ⎜0 λ a · · · 0⎟ ⎜ ⎟ ⎜. . . ⎟ . r×r . ⎟ Br (λ, a) = ⎜ ⎜ .. .. . . . . .. ⎟ ∈ C . ⎜ ⎟ ⎝ 0 0 0 · · · a⎠ 0 0
··· λ
0
Note that Br (λ, 1) is a Jordan block. Let a, b ∈ C − {0} be two nonzero numbers. Prove that Bn (λ, a) ∼ Bn (λ, b). (11) Let λ be a complex number. Prove that the k th power of Br (λ) is given by k k−r+1 ⎞ ⎛ k k k−1 k k−2 λ · · · r−1 λ 1 λ 2 λ ⎜ k k k−1 · · · k−r+2 ⎟ λk ⎜0 ⎟ 1 λ r−2 λ k ⎜ ⎟. (Br (λ) = ⎜ . ⎟ . . . . . . . ⎝. ⎠ . . ··· . 0
0
0
···
λk
Similarity and Spectra
609
(12) Prove that if |λ| < 1, then limk→∞ Br (λ)k = O. (13) Let A ∈ Rn×n be a symmetric real matrix. Prove that ∇ralA (x) =
2 (Ax − ralA (x)x). x x
Also, show that the eigenvectors of A are the stationary points of the function ralA (x). (14) Let A, B ∈ Cn×n be two Hermitian matrices. Prove that AB is a Hermitian matrix if and only if AB = BA. (15) Let A ∈ R3×3 be a symmetric matrix. Prove that if trace(A) = 0, the sum of principal minors of order 2 equals 0, and det(A) = 0, then rank(A) = 1. Solution: The characteristic polynomial of A is pA (λ) = λ3 − trace(A)λ2 = 0. Thus, spec(A) = {trace(A), 0}, where algm(A, 0) = 2, so rank(A) = 1. (16) Let A ∈ R3×3 be a symmetric matrix. Prove that if the sum of principal minors of order 2 does not equal 0 but det(A) = 0, then rank(A) = 2. (17) Let A ∈ Cn×n be a Hermitian matrix, u ∈ Cn be a vector, and let a be a complex number. Define the Hermitian matrix B as
A u . B= uH a Let α1 · · · αn be the eigenvalues of A and let β1 · · · βn βn+1 be the eigenvalues of B. Prove that β1 α1 β2 · · · βn αn βn+1 . Solution: Since B ∈ C(n+1)×(n+1) , by the Courant–Fisher theorem we have βk+1 = min max{xH Bx | x2 = 1 and x ∈ W ⊥ } W
x
= max min{xH Bx | x2 = 1 and x ∈ Z⊥ }, Z
x
where W denotes ranges of sets of k non-zero arbitrary vectors, and let Z be a subset of Cn that consists of n − k non-zero arbitrary vectors in Cn+1 .
Linear Algebra Tools for Data Mining (Second Edition)
610
Let U be a set of k non-zero vectors in Cn and let Y be a set of n − k − 1 vectors in Cn . Define the subsets WU and ZY of Cn+1 as u WU = u ∈ U 0 and
y ZY = y ∈ Y ∪ {en+1 }. 0
By restricting the sets W and Z to sets of the form WU and ZY , we obtain the double inequality max min{xH Bx | x2 = 1 and x ∈ ZY ⊥ } x
ZY
βk+1 min max{xH Bx | x2 = 1 and x ∈ WU ⊥ }. WU
x
Note that, if x ∈ ZY ⊥ , then we have x ⊥ en+1 , so xn+1 = 0. Therefore,
y A u H H = y H Ay. x Bx = (y 0) H 0 u a Consequently, max min{xH Bx | x2 = 1, x ∈ ZY ⊥ } x
ZY
= max min{y H Ay | y2 = 1, y ∈ Y ⊥ } = αk . Y
y
This allows us to conclude that αk βk+1 for 1 k n. On the another hand, if x ∈ WU ⊥ and
u x= , 0 then xH Bx = uH Au and x2 = u2 . Now we can write min max{xH Bx | x2 = 1, x ∈ WU ⊥ } x
WU
= min max{uH Au | u2 = 1, and u ∈ U ⊥ } = αk+1 , U
u
so βk+1 αk+1 for 1 k n − 1.
Similarity and Spectra
611
(18) Let A, B ∈ Cn×n be two matrices such that AB = BA. Prove that A and B have a common eigenvector. Solution: Let λ ∈ spec(A) and let {x1 , . . . , xk } be a basis for null(A−λIn ). Observe that the matrices A−λIn and B commute because (A − λIn )B = AB − λB and B(A − λIn ) = BA − B. Therefore, we have (A − λIn )Bxi = B(A − λIn )xi = 0, so (A − λIn )BX = On,n , where X = (x1 , . . . , xk ). Consequently, ABX = λBX. Let y 1 , . . . , y m be the columns of the matrix BX. The last equality implies that Ay i = λy i , so y i ∈ null(A − λIn ). Since X is a basis of null(A − λIn ), it follows that each y i is a linear combination of the columns of X so there exists a matrix P such that (y 1 · · · y m ) = (x1 · · · xk )P , which is equivalent to BX = XP . Let w be an eigenvector of P . We have P w = μw. Consequently, BXw = XP w = μXw, which proves that Xw is an eigenvector of B. Also, A(Xw) = A(BXw) = (λBX)w = λμXw, so Xw is also an eigenvector of A. (19) Let A ∈ Cn×n be a matrix and let spec(A) = {λ1 , . . . , λn }. Prove that n 2 A2F ; (a) p=1 |λp | (b) the equality np=1 |λp |2 = A2F holds if and only if A is normal. Solution: By Schur’s Triangularization Theorem (Theorem 8.8), there exists a unitary matrix U ∈ Cn×n and an upper triangular matrix T ∈ Cn×n such that A = U H T U and the diagonal elements of T are the eigenvalues of A. Thus, A2F = T 2F =
n p=1
|λp |2 +
|tij |2 ,
i k, we have W = λ1 u1 u1 + · ·· + λr ur ur . Starting from the r matrices ui ui , we can form kr matrices of rank k of the form i∈I ui ui by considering all subsets I of {1, . . . , r} that contain k elements. We have W =
r
λj uj uj
j=1
=
I,|I|=k
αI
ui ui .
i∈I
If we match the coefficients of ui ui , we have λi = If we add these equalities, we obtain k=
r
I,i∈I,|I|=k αI .
αI .
i=1 I,i∈I,|I|=k
We choose αI to depend on the cardinality of I and take into account that each αI occurs k times in the previous sum. This implies I,i∈I,|I|=k αI = 1, so each W is a convex combination of matrices of rank k, so K conv (M1 ) = M2 . No matrix of rank greater than k can be an extreme point. Since every convex and compact set has extreme elements, only matrices of rank k can play this role. Since the definition of M2 makes no distinction between the k-rank matrices, it follows that the set of extreme elements coincides with M1 .
622
Linear Algebra Tools for Data Mining (Second Edition)
(38) Prove that Ky Fan’s Theorem can be derived from Supplement 8.13. (39) Prove that the Jordan block Br (a) can be written as ⎛ ⎞ ⎛ ⎞ e1 e2 ⎜ e ⎟ ⎜e ⎟ ⎜ 2 ⎟ ⎜ 3⎟ ⎜ . ⎟ ⎜.⎟ ⎟ ⎜ ⎟ Br (a) = a ⎜ ⎜ .. ⎟. + ⎜ .. ⎟. ⎜ ⎟ ⎜ ⎟ ⎝er−1 ⎠ ⎝er ⎠ er
0
(40) Let A ∈ Rm×n be a matrix such that rank(A) = r. Prove that A can be factored as A = U DV , where U ∈ Rm×r and V ∈ Rn×r are matrices with orthonormal columns and D ∈ Rr×r is a diagonal matrix. This factorization is known as the Lanczos decomposition of A. Solution: By Theorem 3.33, the symmetric and square matrix A A ∈ Rn×n has the same rank r as the matrix A. Therefore, by the Spectral Theorem for Hermitian Matrices (Theorem 8.14), there exists an orthogonal matrix V ∈ Rn×n such that A A = V SV , where S is a diagonal matrix D = (σ1 , . . . , σr , 0, . . . , 0) and σ1 σ2 · · · σr > 0. This is equivalent to V A AV = (AV ) AV = S. The matrix V can be written as V = (V1 V0 ), where V1 = (v 1 · · · v r ) ∈ Rn×r consists of the first r columns of the matrix V, that is, from eigenvectors of A A that correspond to σ1 , . . . , σr . The last n − r columns of V (the columns of V0 ) are eigenvectors that correspond to the eigenvalue 0. Since (AV ) (AV ) = S, it follows that (Av 1 · · · Av r AV0 ) (Av 1 · · · Av r AV0 ) = S, which means that for the vectors z i = Av i (1 i r), we have σi if i = j, zizj = 0 otherwise. √ Let λi = σi and let ui = λ1i z i for 1 i r. The set of vectors {u1 , . . . , ur } is orthonormal and AV = (λ1 u1 , . . . , λr ur ) = U D, where U = (u1 · · · ur ) and D = diag(λ1 , . . . , λr ).
Similarity and Spectra
623
Note that V ∈ Rn×r and that A Av = 0 for any column v of V 0 . This implies AV0 = O, so AV = A(V1 V0 ) = (AV1 O) = (U D O) and
V A = (U A O)V = (U A O) 0 = U DV . V1 The field of values of a matrix A ∈ Cn×n is the set of numbers F (A) = {xAxH | x ∈ Cn and x2 = 1}. (41) Prove that F (A) is a convex set. (42) Prove that spec(A) ⊆ F (A) for any A ∈ Cn×n . (43) If U ∈ Cn×n is a unitary matrix and A ∈ Cn×n , prove that F (U AU H ) = F (A). (44) Prove that for a normal matrix A ∈ Cn×n , F (A) equals the convex hull of spec(A). Infer that A is Hermitian if and only if F (A) is an interval of R. Solution: Since A is normal, by the Spectral Theorem for Normal Matrices (Theorem 8.13), there exists a unitary matrix U and a diagonal matrix D such that A = U H DU and the diagonal elements of D are the eigenvalues of A. Then, by Exercise 8.13, F (A) = F (D). Therefore, z ∈ F (A) if z = xDxH for some x ∈ Cn such that x2 = 1, so z = nk=1 |xk |2 λk , where spec(A) = spec(D) = {λ1 , . . . , λn }, which proves that F (A) is included in the convex closure of spec(A). The reverse inclusion is immediate. (45) Let
A B M= O C be a block matrix, where A and C are square matrices. Prove that if spec(A) ∩ spec(C) = ∅, then there exists a matrix X such that
−1
I X I X M O I O I is a block diagonal matrix.
624
Linear Algebra Tools for Data Mining (Second Edition)
Solution: The matrix
I X O I
is invertible and we have
−1
I −X I X = . O I O I Thus, we need to find a matrix X such that
I X A B I −X A −AX + B + XC = . O I O C O I O C By Theorem 8.25, if spec(A) ∩ spec(C) = ∅, there exists X such that AX − XC = B. (46) Prove that A ∼ B implies f (A) ∼ f (B) for every A, B ∈ Cn×n and every polynomial f . (47) Let A ∈ Cn×n be a matrix such that spec(A) = {λ1 , . . . , λn }. Prove that (a) n
2
|λi |
i=1
n n
|aij |2 ;
i=1 j=1
(b) the matrix A is normal if and only if n i=1
|λi |2 =
n n
|aij |2 .
i=1 j=1
(48) Let A ∈ Cn×n such that A On,n . Prove that if 1n is an eigenvector of A, then ρ(A) = |||A|||∞ and if 1n is an eigenvector of A , then ρ(A) = |||A|||1 . (49) Let A be a non-negative matrix in Cn×n and let u = A1n and v = A 1n . Prove that max{min ui , min vj } ρ(A) min{max ui , max vj }. (50) Let A ∈ Cn×n be a matrix such that A On,n . Prove that if there exists k ∈ N such that Ak > On,n , then ρ(A) > 0.
Similarity and Spectra
625
(51) Let A ∈ Cn×n be a matrix such that A On,n . If A = On,n and there exists an eigenvector x of A such that x > 0n , prove that ρ(A) > 0. (52) Prove that A ∈ Cn×n is positive semidefinite if and only n if there H is a set U = {v 1 , . . . , v n } ⊆ Cn such that A = i=1 v i v i . Furthermore, prove that A is positive definite if and only if there exists a linearly independent set U as above. (53) Prove that A is positive definite if and only if A−1 is positive definite. (54) Prove that if A ∈ Cn×n is a positive semidefinite matrix, then Ak is positive semidefinite for every k 1. (55) Let A ∈ Cn×n be a Hermitian matrix and let pA (λ) = λn + c1 λn−1 + · · · + cm λn−m be its characteristic polynomial, where cm = 0. Then, A is positive semidefinite if and only if ci = 0 for 0 k m (where c0 = 1) and cj cj+1 < 0 for 0 j m − 1. (56) Corollary 8.26 can be extended as follows. Let A ∈ Cn×n be a positive semidefinite matrix. Prove that for every k 1 there exists a positive semidefinite matrix B having the same rank as A such that (a) B k = A; (b) AB = BA; (c) B can be expressed as a polynomial in A. Solution: Since A is Hermitian, its eigenvalues are real nonnegative numbers and, by the Spectral Theorem for Hermitian matrices, there exists a unitary matrix U ∈ Cn×n such that A = 1
1
1
U H diag(λ1 , . . . , λn )U . Let B = U H diag(λ1k , . . . , λnk )U , where λik is a non-negative root of order k of λi . Thus, B k = A, B is clearly positive semidefinite, rank(B) = rank(A), and AB = BA. Let p(x) =
n j=1
1
λjk
n k=1,k =j
x − λk λj − λk 1
be a Lagrange interpolation polynomial such that p(λj ) = λjk (see Exercise 31). Then, 1
1
p(diag(λ1 , . . . , λn )) = diag(λ1k , . . . , λnk ),
Linear Algebra Tools for Data Mining (Second Edition)
626
so p(A) = p(U H diag(λ1 , . . . , λn )U ) = U H p(diag(λ1 , . . . , λn ))U 1
1
= U H diag(λ1k , . . . , λnk )U = B. (57) Let A ∈ Rn×n be a symmetric matrix. Prove that there exists b ∈ R such that A + b(11 − In ) is positive semidefinite, where 1 ∈ Rn . Solution: We need to find b such that for every x ∈ Rn we will have x (A + b(11 − In ))x 0. We have x (A + b(11 − In ))x = x Ax + bx 11 x − bx x 0, which amounts to
⎛
x Ax + b ⎝
n
2 xi
⎞ − x22 ⎠ 0.
i=1
Since A is symmetric, by the Rayleigh–Ritz theorem, we have x Ax λ1 x22 , where λ1 is the least eigenvalue of A. Therefore, it suffices to take b λ1 to satisfy the equality for every x. (58) If A ∈ Rn×n is a positive definite matrix, prove that there exist c, d > 0 such that cx22 x Ax dx22 , for every x ∈ Rn . (59) Let A = diag(A1 , . . . , Ap ) and B = (B1 , . . . , Bq ) be two block diagonal matrices. Prove that sepF (A, B) = min{sepF (Ai , Bj ) | 1 i p and 1 j q}.
Similarity and Spectra
627
(60) Let A ∈ Cn×n be a Hermitian matrix. Prove that if for any λ ∈ spec(A) we have λ > −a, then the matrix A+aI is positivesemidefinite. (61) Let A ∈ Cm×m and B ∈ Cn×n be two matrices that have the eigenvalues λ1 , . . . , λm and μ1 , . . . , μn , respectively. Prove that: (a) if A and B are positive definite, then so is A ⊗ B; (b) if m = n and A, B are symmetric positive definite, the Hadamard product A B is positive definite. Solution: The first part follows immediately from Theorem 7.24. For the second part, recall that the Hadamard product A B of two square matrices of the same format is a principal submatrix of A ⊗ B. Then, apply Theorem 8.20. (62) Let A ∈ Rn×n be a real matrix that is symmetric and positive semidefinite) such that A1n = 0n . Prove that n
2 max
1in
√ √ aii ajj . j=1
Solution: By Corollary 8.26, A is the Gram matrix of a sequence of vectors B = (b1 , . . . , bn ), so A = B B. Since A1n =0n , it follows that (B1n ) (B1n ) = 0, so B1n = 0n . Thus, ni=1 bi = 0n . Then we have uj |||2 uj 2 , ui 2 = ||| − j =i
j =i
which implies 2 max ui 2 1in
n
uj 2 .
j=1
This is immediately equivalent to the inequality to be shown. (63) Let A, B ∈ Rn . Prove that the function ψ : Rn −→ R defined by ψ(x) = Ax22 − Bx22 is a quadratic form and find the symmetric matrix that represents ψ.
628
Linear Algebra Tools for Data Mining (Second Edition)
Bibliographical Comments The reader should consult [162]. The solution of Supplement 11 is given in [78]. The proof of the Perron–Frobenius Theorem that we present was obtained in [119]. The proof of Theorem 8.22 is given in [171]. The Hoffman–Wielandt theorem was shown in [77]. The proof of Theorem 8.48 is given in [104]. Ky Fan’s Theorem appeared in [48]. The proof of Supplement 8.13 was obtained from Overton and Womersley [124], where the reader can find the solution of Exercise 8.13. Theorem 8.45 was obtained in [120]; the result of Theorem 8.50 appeared in [170]. The treatment of spectral resolution of matrices follows [160]. Supplement 8.13 is a result of Juh´ asz, which appears in [86]; Supplement 8.13 originated in [90]. Lemma 9.4 is a specialization of a result proved in [160] for Banach spaces.
Chapter 9
Singular Values
9.1
Introduction
The singular value decomposition has been described as the “Swiss Army knife of matrix decompositions” [121] due to its many applications in the study of matrices; from our point of view, singular value decomposition is relevant for dimensionality reduction techniques in data mining. 9.2
Singular Values and Singular Vectors
The notion of singular value introduced in this section allows us to formulate the singular value decomposition (SVD) theorem, which extends a certain property of unitarily diagonalizable matrices. Let A ∈ Cn×n be a square matrix which is unitarily diagonalizable. There exists a diagonal matrix D = diag(d1 , . . . , dn ) ∈ Cn×n and a unitary matrix X ∈ Cn×n such that A = XDXH ; equivalently, we have AX = XD. If we denote the columns of X by x1 , . . . , xn , then Axi = di xi , which shows that xi is a unit eigenvector that corresponds to the eigenvalue di for 1 i n. Also, we have ⎛ H⎞ x1 ⎜ . ⎟ ⎟ A = (x1 · · · xn )diag(d1 , . . . , dn ) ⎜ ⎝ .. ⎠ xHn = d1 x1 xH1 + · · · + dn xn xHn . 629
630
Linear Algebra Tools for Data Mining (Second Edition)
This is the spectral decomposition of A, which we already discussed in Chapter 7. Note that rank(xi xHi ) = 1 for 1 i n. The SVD theorem extends this decomposition to rectangular matrices. Theorem 9.1 (SVD Theorem). If A ∈ Cm×n is a matrix and rank(A) = r, then A can be factored as A = U DV H , where U ∈ Cm×m and V ∈ Cn×n are unitary matrices, ⎛ σ1 0 0 ⎜0 σ 0 2 ⎜ ⎜. . .. ⎜. . . ⎜. . ⎜ ⎜ D = ⎜ 0 0 ··· ⎜ ⎜ 0 0 ··· ⎜ .. ⎜ .. .. ⎝. . . 0 0 ···
···
0 0 . · · · .. σr · · · 0 ··· . · · · .. 0 ···
⎞ 0 0⎟ ⎟ .. ⎟ ⎟ .⎟ ⎟ m×n 0⎟ , ⎟∈C ⎟ 0⎟ ⎟ .. ⎟ .⎠ 0
and σ1 . . . σr are real positive numbers. Proof. By Theorem 3.33 and Example 6.27, the square matrix AH A ∈ Cn×n has the same rank as the matrix A and is positive semidefinite. Therefore, there are r positive eigenvalues of this matrix, denoted by σ12 , . . . , σr2 , where σ1 σ2 · · · σr > 0. Let v 1 , . . . , v r be the corresponding pairwise orthogonal unit eigenvectors in Cn . We have AH Av i = σi2 v i for 1 i r. Let V be the matrix V = (v 1 · · · v r v r+1 · · · v n ) obtained by completing the set {v 1 , . . . , v r } to an orthogonal basis for Cn . If V1 = (v 1 · · · vr ) and V2 = (v r+1 · · · v n ), we can write V = (V1 V2 ). The equalities involving the eigenvectors can now be written as H A AV1 = V1 E 2 , where E = diag(σ1 , . . . , σr ). Define U1 = AV1 E −1 ∈ Cm×r . We have U1H = S −1 V1H AH , so U1H U1 = S −1 V1H AH AV1 E −1 = E −1 V1H V1 E 2 E −1 = Ir , which shows that the columns of U1 are pairwise orthogonal unit vectors. Consequently, U1H AV1 E −1 = Ir , so U1H AV1 = E.
Singular Values
631
If U1 = (u1 , . . . , ur ), let U2 = (ur+1 , . . . , um ) be the matrix whose columns constitute the extension of the set {u1 , . . . , ur } to an orthogonal basis of Cm . Define U ∈ Cm×m as U = (U1 U2 ). Note that H U1 H A(V1 V2 ) U AV = U2H H H U1 AV1 U1H AV2 U1 AV1 U1H AV2 = = U2H AV1 U2H AV2 U2H AV1 U2H AV2 H E O U1 AV1 O = , = O O O O which is the desired decomposition.
Observe that in the SVD described by Theorem 9.1 known as the full SVD of A, the diagonal matrix D has the same format as A, while both U and V are square unitary matrices. Definition 9.1. Let A ∈ Cm×n be a matrix. A number σ ∈ R>0 is a singular value of A if there exists a pair of vectors (u, v) ∈ Cn × Cm such that Av = σu and AH u = σv.
(9.1)
The vector u is the left singular vector and v is the right singular vector associated to the singular value σ. Note that if (u, v) is a pair of vectors associated to σ, then (au, av) is also a pair of vectors associated with σ for every a ∈ C. Let A ∈ Cm×n and let A = U DV H , where U ∈ Cm×m , D = diag(σ1 , . . . , σr , 0, . . . , 0) ∈ Cm×n , and V ∈ Cn×n . Note that Av j = U DV H vj = U Dej (because V is a unitary matrix) = σj U ej = σj uj and AH uj = V D H U H uj = V DU H uj V Dej (because U is a unitary matrix) = σj V ej = σj v j .
632
Linear Algebra Tools for Data Mining (Second Edition)
Thus, the jth column of the matrix U , uj and the jth column of the matrix V, v j are left and right singular vectors, respectively, associated to the singular value σj . Corollary 9.1. Let A ∈ Cm×n be a matrix and let A = UDVH be the singular value decomposition of A. If · is a unitarily invariant norm, then A = D = diag(σ1 , . . . , σr , 0, . . . , 0). Proof. This statement is a direct consequence of Theorem 9.1 because the matrices U ∈ Cm×m and V ∈ Cn×n are unitary. In other words, the value of a unitarily invariant norm of a matrix depends only on its singular values. As we saw in Theorem 6.24, |||·|||2 and · F are unitarily invariant. Therefore, the Frobenius norm can be written as
r σr2 . AF = i=1
Definition 9.2. Two matrices A, B ∈ Cm×n are unitarily equivalent (denoted by A ≡u B) if there exist two unitary matrices W1 and W2 such that A = W1H BW2 . Clearly, if A ∼u B, then A ≡u B. Theorem 9.2. Let A and B be two matrices in Cm×n . If A and B are unitarily equivalent, then they have the same singular values. Proof. Suppose that A ≡u B, that is, A = W1H BW2 for some unitary matrices W1 and W2 . If A has the SVD A = U H diag(σ1 , . . . , σr , 0, . . . , 0)V , then B = W1 AW2H = (W1 U H )diag(σ1 , . . . , σr , 0, . . . , 0)(V W2H ). Since W1 U H and V W2H are both unitary matrices, it follows that they have the same singular values as B. Let v ∈ Cn be an eigenvector of the matrix AH A that corresponds to a non-zero, positive eigenvalue σ 2 , that is, AH Av = σ 2 v.
Singular Values
633
Define u = σ1 Av. We have Av = σu. Also, 1 H H Av = σv. A u=A σ This implies AAH u = σ 2 u, so u is an eigenvector of AAH that corresponds to the same eigenvalue σ 2 . Conversely, if u ∈ Cm is an eigenvector of the matrix AAH that corresponds to a non-zero, positive eigenvalue σ 2 , we have AAH u = σ 2 u. Thus, if v = σ1 Au, we have Av = σu and v is an eigenvector of AH A for the eigenvalue σ 2 . The Courant–Fisher theorem (Theorem 8.18) allows the formulation of a similar result for singular values. Theorem 9.3. Let A be a matrix, A ∈ Cm×n . If σ1 σ2 · · · σk · · · is the non-increasing sequence of singular values of A, then σk = σk =
min
dim(S)=n−k+1
max{Ax2 | x ∈ S and x2 = 1}
max min{Ax2 | x ∈ T and x2 = 1},
dim(T )=k
where S and T range over subspaces of Cn . Proof. We give the argument only for the second equality of the theorem; the first can be shown in a similar manner. We saw that σk equals the kth largest absolute value of the eigenvalue |λk | of the matrix AH A. By the Courant–Fisher theorem, we have λk = =
max
min{xH AH Ax | x ∈ T and x2 = 1}
max
min{Ax22 | x ∈ T and x2 = 1},
dim(T )=k dim(T )=k
x x
which implies the second equality of the theorem.
Theorem 9.3 can be restated as follows; Theorem 9.4. Let A be a matrix, A ∈ Cm×n . If σ1 σ2 · · · σk · · · is the non-increasing sequence of singular values of A, then σk = =
min
max{Ax2 | x ⊥ w1 , . . . , x ⊥ wk−1 and x2 = 1}
max
min{Ax2 | x ⊥ w1 , . . . , x ⊥ wn−k and x2 = 1}.
w 1 ,...,w k−1 w 1 ,...,w n−k
634
Proof.
Linear Algebra Tools for Data Mining (Second Edition)
The argument is similar to the one used in Theorem 8.19.
Corollary 9.2. The smallest singular value of a matrix A ∈ Cm×n equals min{Ax2 | x ∈ Cn and x2 = 1}. The largest singular value of a matrix A ∈ Cm×n equals max{Ax2 | x ∈ Cn and x2 = 1}. Proof.
The corollary is a direct consequence of Theorem 9.3.
The SVD theorem can also be proven by induction on q = min{m, n}. In the base case, q = 1, we have A ∈ C1×1 , or A ∈ Cm×1 , or A ∈ C1×n . Suppose, for example, that A = a ∈ Cm×1 , where ⎛ ⎞ a1 ⎜ . ⎟ ⎟ a=⎜ ⎝ .. ⎠ am and let a = a2 . We seek U ∈ Cm×m , V = (v) ∈ C1×1 such that a = U diag(a)v, where
⎛ ⎞ a ⎜ 0⎟ ⎜ ⎟ m×1 ⎟ . diag(a) = ⎜ ⎜ .. ⎟ ∈ C ⎝.⎠ 0
The role of the matrix U is played by any unitary matrix which has the first column equal to ⎛ a1 ⎞ a
⎜ a2 ⎟ ⎜a⎟ ⎜ ⎟, ⎜ .. ⎟ ⎝ . ⎠ an a
and we can adopt v = 1. The remaining base subcases can be treated in a similar manner.
Singular Values
635
Suppose now that the statement holds when at least one of the numbers m and n is less than q and let us prove the assertion when at least one of m and n is less than q + 1. Let u1 be a unit eigenvector of AAH that corresponds to the eigenvalue σ12 and let v 1 = σ11 AH u1 . We have v 1 2 = 1 and Av 1 =
1 AAH u1 = σ1 u1 , σ1
which shows that (v 1 , u1 ) is a pair of singular vectors corresponding to the singular value σ1 . We have also uH1 AH v 1 =
1 H u AAH u1 = σ1 . σ1 1
Define U = (u1 U1 ) and V = (v 1 V1 ) as unitary matrices having u1 and v 1 as their first columns, respectively. Then, H u1 H H AH v 1 V1 U AV = H U1 H H u1 A v 1 V1 = H H U1 A H H u1 A v 1 uH1 AV1 . = U1H Av 1 U1H AV1 Since U is a unitary matrix, every column of U1 is orthogonal to u1 . Therefore, U1H Av 1 =
1 H U AAH u1 = σ1 U1H u1 = 0, σ1 1
and, similarly, uH1 AH V1 = σ1 v H1 V1 = 0 , because v 1 is orthogonal on all columns of V1 . Thus, σ1 0 H . U AV = 0 U1H AV1
636
Linear Algebra Tools for Data Mining (Second Edition)
The matrix U1H AV1 has fewer rows and columns than U H AV , so we can apply the inductive hypothesis to B = U1H AV1 . Therefore, by the inductive hypothesis, B can be written as B = XDY H , where X and Y are unitary matrices and D is a diagonal matrix. This allows us to write 0 σ1 1 0 σ1 0 1 0 H = . U AV = 0 XDY H 0 X 0 D 0 YH Since the matrices
1 0 0 X
and
1 0 0 YH
are unitary, we obtain the desired conclusion. If A ∈ Cn×n is an invertible matrix and σ is a singular value of A, then σ1 is a singular value of the matrix A−1 . Example 9.1. Let ⎞ a1 ⎜.⎟ ⎟ a=⎜ ⎝ .. ⎠ an ⎛
be a non-zero vector in Cn , which can also be regarded as a matrix in Cn×1 . The square of a singular value of A is an eigenvalue of the matrix ⎛ ⎞ a ¯ 1 a1 · · · a ¯ n a1 ⎜a ¯ n a2 ⎟ ⎜ ¯ 1 a2 · · · a ⎟ H ⎜ ⎟ A A=⎜ . . ⎟ . . ⎝ . ··· . ⎠ ¯ n an a ¯ 1 an · · · a and we have seen (in Corollary 7.4) that the unique non-zero eigenvalue of this matrix is a22 . Thus, the unique singular value of a is a2 .
637
Singular Values
Example 9.2. Let A ∈ R3×2 be the matrix ⎛ ⎞ 0 1 ⎜ ⎟ A = ⎝1 1⎠ . 1 0 The matrices AH A and AH A ⎛ 1 1 AAH = ⎝1 2 0 1
are given by ⎞ 0 2 1 1⎠ and AH A = . 1 2 1
The eigenvalues of AH A are the roots of the polynomial λ2 − 4λ + 3, and therefore, they are λ1 = 3 and λ2 = 1. The eigenvalues of AAH are 3, 1, and 0. Unit eigenvectors of AH A that correspond to 3 and 1 are √ √ v 1 = α1
2 √2 2 2
and v 2 = α2
−
2 2√
2 2
,
respectively, where αi ∈ {−1, 1} for i = 1, 2. Unit eigenvectors of AH A that correspond to 3, 1, and 0 are ⎛√ ⎞ ⎛ √ ⎞ ⎛ √ ⎞ 6
3
2
2 ⎜ √6 ⎟ ⎜ 3√ ⎟ ⎜ ⎟ 6 3⎟ ⎜ ⎟ u1 = β1 ⎝ 3 ⎠ , u2 = β2 ⎝ 0 ⎠ , u3 = β3 ⎜ ⎝−√ 3 ⎠ , √ √ 6 3 − 22 6 3
respectively, where βi ∈ {−1, 1} for i = 1, 2, 3. The choice of the columns of the matrices U and V must be done such that for a pair of eigenvectors (u, v) that correspond to a singular value σ, we have v = σ1 AH u or, equivalently, u = σ1 Av, as we saw in the proof of Theorem 9.1. For instance, if we choose α1 = α2 = 1, then √ √ v1 =
2 √2 2 2
, v2 =
−
2 2√
2 2
,
638
and u1 =
Linear Algebra Tools for Data Mining (Second Edition) √1 Av 1 3
and u2 = Av 2 , that is ⎛√ ⎞ u1 =
6 ⎜ √6 ⎟ ⎜ 6 ⎟ , u2 ⎝ √3 ⎠ 6 6
⎛
√ ⎞ − 22 ⎜ ⎟ = ⎝ 0 ⎠, √
2 2
which means that β1 = 1 and β2 = −1; the value of β3 that corresponds to the eigenvalue of 0 can be chosen arbitrarily. Thus, an SVD of A is √ ⎛√ √ ⎞ ⎛√ ⎞ 6 2 3 − 3 0 √2 √2 6 2 3 √ ⎟⎜ ⎜√ ⎟ √2 6 2√ . A=⎜ 0 − 33 ⎟ ⎠ ⎝ 0 1⎠ ⎝ √3 2 √ √ − 22 2 6 2 3 0 0 6
2
3
The singular values of a matrix A ∈ Cm×n are uniquely determined. However, the matrices U and V of the SVD of A are not unique, as we saw in Example 9.2. Once we choose a column of the matrix V for a singular value σ, the corresponding column of U is determined by u = σ1 Av. A variant of the SVD Decomposition Theorem is given next. Corollary 9.3 (The Thin SVD Decomposition Corollary). Let A ∈ Cm×n be a matrix having non-zero singular values σ1 , σ2 , . . . , σr , where σ1 σ2 · · · σr > 0 and r min{m, n}. Then, A can be factored as A = U DV H , where U ∈ Cm×r and V ∈ Cn×r are matrices having orthonormal sets of columns and D is the diagonal matrix ⎞ ⎛ σ1 0 · · · 0 ⎜ 0 σ ··· 0 ⎟ 2 ⎟ ⎜ ⎟ D=⎜ .. ⎟ . ⎜ .. .. ⎝ . . ··· . ⎠ 0 0 · · · σr Proof. The Theorem 9.1.
statement
is
an
immediate
consequence
of
The decomposition described in Corollary 9.3 is known as a thin SVD decomposition of the matrix A.
639
Singular Values
Example 9.3. The thin SVD decomposition of the matrix A introduced in Example 9.2, ⎛ ⎞ 0 1 ⎜ ⎟ A = ⎝1 1⎠ . 1 0 is
⎛√ A=
6 ⎜ √6 ⎜ 6 ⎝ √3 6 6
−
√
2 2
⎞
⎟ 0 ⎟ ⎠ √
√
2 2
3 0 0 1
√
2 √2 2 2
√
2 2√ − 22
.
Since U and V in the thin SVD have orthonormal columns, it is easy to see that U H U = V H V = Ip .
(9.2)
Lemma 9.1. Let D ∈ Rn×n be a diagonal matrix, where D = . . , σr ) and σ1 · · · σr . Then, we have |||D|||2 = σ1 , and diag(σ1 , . r 2 DF = i=1 σi . Proof.
By the definition of |||D|||2 , we have |||D|||2 = max{Dx2 | x = 1} ⎫ ⎧
n r ⎬ ⎨ σi2 |xi |2 |xi |2 = 1 . = max ⎭ ⎩ i=1 i=1
Since r
because
n
σi2 |xi |2
i=1
σ12
r
2
|xi |
σ12 ,
i=1
2 i=1 |xi |
= 1, it follows that ⎫ ⎧
n r ⎬ ⎨ σi2 |xi |2 |xi |2 = 1 = σ1 . max ⎭ ⎩ i=1
The second part is immediate.
i=1
640
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 9.5. Let A ∈ Cm×n be a matrix whose singular values are r 2 σ1 · · · σr . Then |||A|||2 = σ1 , and AF = i=1 σi . Proof. Suppose that the SVD of A is A = U DV H , where U and V are unitary matrices. Then, by Theorem 6.24 and Lemma 9.1, we have |||A|||2 = |||U DV H |||2 = |||D|||2 = σ1 ,
r H σi2 . AF = U DV F = DF = i=1
Corollary 9.4. If A ∈ Cm×n is a matrix, then |||A|||2 AF √ n|||A|||2 . Suppose that σ 1 (A) is the largest of the singular values of r 2 A. Then, since AF = i=1 σi , we have Proof.
σ1 (A) AF
√ n max σj (A)2 = σ1 (A) n, i
which is the desired double inequality.
Theorem 9.6. Let A ∈ Cn×n be an invertible matrix. If the singular values of A are σ1 · · · σn > 0, then cond(A) =
σ1 . σn
Proof. We have shown in Theorem 9.5 that |||A|||2 = σ1 . Since the singular values of A−1 are 1 1 ··· , σn σ1 it follows that |||A−1 |||2 = ately.
1 σn .
The desired equality follows immedi
Corollary 9.5. Let A ∈ Cn×n be an invertible matrix. We have cond(AH A) = (cond(A))2 .
Singular Values
641
Proof. Let σ be a singular value of A and let u, v be two left and right singular vectors corresponding to σ, respectively. We have Av = σu and AH u = σv. This implies AH Av = σAH u = σ 2 v, which shows that the singular values of the matrix AH A are the squares of the singular values of A, which produces the desired conclusion. Let A = U DV H be an SVD of A. If we write U and V using their columns as U = (u1 · · · um ), V = (v 1 · · · v n ), then A can be written as A = U DV H
⎛
σ1 0 ⎜0 σ ⎜ 2 ⎜ ⎜ .. .. ⎜. . = (u1 · · · un ) ⎜ ⎜0 0 ⎜ ⎜. . ⎜. . ⎝. . 0 0 ⎛ ⎞ σ1 v H1 ⎜ ⎟ = (u1 · · · um ) ⎝ ... ⎠ σr v Hp
⎞ ··· ··· 0 · · · · · · 0⎟ ⎟⎛ ⎞ H ⎟ .. ⎟ v1 ⎜ ⎟ · · · · · · .⎟ ⎟ ⎜ .. ⎟ ⎟ · · · σr 0 ⎟ ⎝ . ⎠ H .⎟ ⎟ vm · · · · · · .. ⎠ 0 0 0
= σ1 u1 v H1 + · · · + σr ur v Hp .
(9.3)
Since ui ∈ Cm and v i ∈ Cn , each of the matrices ui v Hi is an m × n matrix of rank 1. Thus, the SVD yields an expression of A as a sum of r matrices of rank 1, where r is the number of non-zero singular values of A. Theorem 9.7. The rank-1 matrices of the form ui v Hi , where 1 i r that occur in Equality (9.3), are pairwise orthogonal. Moreover, ui v Hi F = 1 for 1 i r.
642
Proof.
Linear Algebra Tools for Data Mining (Second Edition)
For i = j and 1 i, j r, we have trace ui v Hi (uj v Hj )H = trace (ui v Hi v j uj ) = 0,
because the vectors v i and v j are orthogonal. Thus, (ui v Hi , uj v Hj ) = 0. By Equality (6.12), we have ui v Hi 2F = trace((ui v Hi )H ui v Hi ) = trace(v i uHi ui v Hi ) = 1, because the matrices U and V are unitary.
Theorem 9.7 shows that Equality (9.3) is similar to a Fourier expansion of A. Namely, the matrix is extended in terms of the orthonormal set {ui v Hi | 1 i r} of the linear space of all matrices of rank no larger than r. Theorem 9.8. Let A ∈ Cm×n be a matrix that has the singular value decomposition A = U DV H . If rank(A) = r, then the first r columns of U form an orthonormal basis for range(A), and the last n−r columns of V constitute an orthonormal basis for null(A). Proof. Since both U and V are unitary matrices, it is clear that {u1 , . . . , ur }, the set of the first r columns of U , and {v r+1 , . . . , v n }, the set of the last n − r columns of V, are linearly independent sets. Thus, we only need to show that u1 , . . . , ur = range(A) and v r+1 , . . . , v n = null(A). By Equality (9.3), we have A = σ1 u1 v H1 + · · · + σr ur vHr . If t ∈ range(A), then t = As for some s ∈ Cn . Therefore, t = σ1 u1 (v H1 s) + · · · + σr ur (v Hr s), and, since every product v Hj s is a scalar for 1 j r, it follows that t ∈ u1 , . . . , ur , so range(A) ⊆ u1 , . . . , ur . To prove the reverse inclusion, note that 1 v i = ui , A σi for 1 i r, due to the orthogonality of the columns of V. Thus, u1 , . . . , ur = range(A).
Singular Values
643
Note that Equality (9.3) implies that Av j = 0 for r + 1 j n, so v r+1 , . . . , v n ⊆ null(A). Conversely, suppose that Ar = 0. Since the columns of V form a basis of Cn , we have r = a1 v 1 +· · ·+an v n , so Ar = a1 Av 1 +· · ·+ar v r = 0. The linear independence of {v 1 , . . . , v r } implies a1 = · · · = ar = 0, so r = ar+1 vr+1 + · · · + an v n , which shows that null(A) ⊆ v r+1 , . . . , v n . Thus, null(A) = v r+1 , . . . , v n . Corollary 9.6. Let A ∈ Cm×n be a matrix that has the singular value decomposition A = U DV H . If rank(A) = r, then the first r transposed columns of V form an orthonormal basis for the subspace of Rn generated by the rows of A. Proof. This statement follows immediately from Theorem 9.8 applied to AH . 9.3
Numerical Rank of Matrices
The SVD allows us to find the best approximation of a matrix by a matrix of limited rank. The central result of this section is Theorem 9.9. Lemma 9.2. Let A = σ1 u1 v H1 + · · · + σr ur v Hr be the SVD of a matrix A ∈ Rm×n , where k σ1 · · ·H σr > 0. For every k, 1 k r, the matrix B(k) = i=1 σi ui vi has rank k. Proof. The null space of the matrix B(k) consists of those vectors x such that ki=1 σi ui vHi x = 0. The linear independence of the vectors ui , and the fact that σi > 0 for 1 i r, implies the equalities v Hi x = 0 for 1 i r. Thus, null(B(k)) = null ((v 1 · · · v k )) . Since v 1 , . . . , v k are linearly independent, it follows that dim (null(B(k)) = n − k, which implies rank(B(k)) = k for 1 k r. Theorem 9.9 (Eckhart–Young Theorem). Let A ∈ Cm×n be a matrix whose sequence of nonzero singular values is (σ1 , . . . , σr ). Assume that σ1 · · · σr > 0 and that A can be written as A = σ1 u1 v H1 + · · · + σr ur vHr .
644
Linear Algebra Tools for Data Mining (Second Edition)
Let B(k) ∈ Cm×n be the matrix defined by B(k) =
k
σi ui v Hi .
i=1
If rk = inf{|||A − X|||2 | X ∈ Cm×n and rank(X) k}, then |||A − B(k)|||2 = rk = σk+1 , for 1 k r, where σr+1 = 0 and B(k) is the best approximation of A among the matrices of rank no larger than k in the sense of the norm ||| · |||2 . Proof.
Observe that A − B(k) =
r
σi ui vHi ,
i=k+1
and the largest singular value of the matrix Therefore, by Theorem 9.5,
r
i=k+1 σi ui v i
H
is σk+1 .
|||A − B(k)|||2 = σk+1 . for 1 k r. We prove now that for every matrix X ∈ Cm×n such that rank(X) k, we have |||A − X|||2 σk+1 . Since dim(null(X)) = n − rank(X), it follows that dim(null(X)) n − k. If T is the subspace of Rn spanned by v 1 , . . . , v k+1 , we have dim(T ) = k + 1. Since dim(null(X)) + dim(T ) > n, the intersection of these subspaces contains a non-zero vector and, without loss of generality, we can assume that this vector is a unit vector x. We have x = a1 v 1 + · · · ak v k + ak+1 v k+1 because x ∈ T . The 2 orthogonality of v 1 , . . . , v k , v k+1 implies x22 = k+1 i=1 |ai | = 1. Since x ∈ null(X), we have Xx = 0, so (A − X)x = Ax =
k+1
ai Av i =
i=1
k+1
ai σ i u i .
i=1
Thus, we have |||(A − X)x|||22 =
k+1
i=1
2 |σi ai |2 σk+1
k+1
2 |ai |2 = σk+1 ,
i=1
because u1 , . . . , uk are also orthonormal. This implies |||A − X|||2 σk+1 = |||A − B(k)|||2 .
645
Singular Values
It is interesting to observe that the matrix B(k) provides an optimal approximation of A not only with respect to |||·|||2 but also relative to the Frobenius norm. Theorem 9.10. Using the notations introduced in Theorem 9.9, B(k) is the best approximation of A among matrices of rank no larger than k in the sense of the Frobenius norm. Proof. Note that A − B(k)2F = A2F − ki=1 σi2 . Let X be a matrix of rank k, which can be written as X = ki=1 xi y Hi . Without loss of generality we may assume that the vectors x1 , . . . , xk are orthonormal. If this is not the case, we can use the Gram–Schmidt algorithm to express them as linear combinations of orthonormal vectors, replace these expressions in ki=1 xi y Hi , and rearrange the terms. Now, the Frobenius norm of A − X can be written as A − X2F A−
= trace
k
H xi y H
A−
i=1
= trace A A + H
k
k
xi y H
i=1
(y i − A xi )(y i − A xi ) − H
H
H
k
i=1
A xi xi A . H
H
i=1
Taking into account that ki=1 (y i − AH xi )(y i − AH xi )H is a real non negative number and that ki=1 AH xi xHi A = Axi 2F , we have A −
X2F
trace A A − H
k
A xi xi A H
H
i=1
=
A2F
− trace
k
A xi xi A . H
H
i=1
Let A = U diag(σ1 , . . . , σn )V H be the singular value decomposition of A. If V = (V1 V2 ), where V1 has k columns v 1 , . . . , v k , D1 = diag(σ1 , . . . , σk ), and D2 = diag(σk+1 , . . . , σn ), then we can
646
Linear Algebra Tools for Data Mining (Second Edition)
write
D12 O A A = V D U U DV = (V1 V2 ) O D22 H
H
H
H
V1H V2H
= V1 D12 V1H + V2 D22 V2H . and AH A = V D 2 V H . These equalities allow us to write Axi 2F = trace(xHi AH Axi ) = trace xHi V1 D12 V1H xi + xHi V2 D22 V2H xi = D1 V1H xi 2F + D2 V2H xi 2F = σk2 + D1 V1H xi 2F − σk2 V1H xi 2F − σk2 V2H xi 2F − D2 V2H xi 2F ) − σk2 (1 − V H xi ). Since V H xi 1F = 1 (because xi is a unit vector and V is a unitary matrix) and σk2 V2H xi 2F − D2 V2H xi 2F 0, it follows that Axi 2F σk2 + D1 V1H xi 2F − σk2 V1H xi 2F . Consequently, k
Axi 2F kσk2 +
i=1
k
i=1
=
=
kσk2 k
+
k k
σk2
+
(σj2 − σk2 )|v Hj xi |2
i=1 j=1
j=1
D1 V1H xi 2F − σk2 V1H xi 2F
(σj2
−
σk2 )
k
2
|v j xi |
i=1
k k
(σk2 + (σj2 − σk2 )) = σj2 , j=1
j=1
which concludes the argument.
Definition 9.3. Let A ∈ Cm×n . The numerical rank of A is the function numrankA : [0, ∞) −→ N given by numrankA (d) = min{rank(B) | |||A − B|||2 d} for d 0.
Singular Values
647
Theorem 9.11. Let A ∈ Cm×n be a matrix having the sequence of non-zero singular values σ1 σ2 · · · σr . Then numrankA (d) = k < r if and only if σk > d σk+1 . Proof. Let d be a number such that σk > d σk+1 . Equivalently, by the Eckhart–Young Theorem, we have |||A − B(k − 1)|||2 > d |||A − B(k)|||2 , Since |||A − B(k − 1)|||2 = min{|||A − X|||2 | rank(X) = k − 1} > d, it follows that min{rank(B) | |||A−B|||2 d} = k, so numrankA (d) = k. Conversely, suppose that numrankA (d) = k. This means that the minimal rank of a matrix B such that |||A − B|||2 d is k. Therefore, |||A − B(k − 1)|||2 > d. On the other hand, d |||A − B(k)|||2 because there exists a matrix C of rank k such that d |||A − C|||2 , so d |||A − B(k)|||2 = σk+1 . Thus, σk > d σk+1 . Recapitulating facts that we have previously established, we have the following equivalent statements concerning a matrix A ∈ Rm×n and its rank r = rank(A): (i) r = dim(rank(A)); (ii) r = dim(rank(A )); (iii) r is the maximal number of linearly independent rows of A; (iv) r is the maximal number of linearly independent columns of A; r (v) r is minimal with the property A = i=1 xi y i , where m n xi ∈ R and y i ∈ R ; (vi) r is maximal with the property that there exists a nonsingular r × r submatrix of A; (vii) r is the number of singular positive values of A. 9.4
Updating SVDs
Efficient algorithms for updating the SVD of matrices are particularly useful when we deal with large matrices that cannot be entirely accommodated in the main memory of computers. Let A ∈ Cm×n be a matrix having the thin SVD decomposition A = U DV H , where rank(A) = r, U ∈ Cm×r , D = diag(σ1 , . . . , σr ), and V ∈ Cn×r . As we observed in Equality (9.2), U H U = V H V = Ir .
648
Linear Algebra Tools for Data Mining (Second Edition)
Suppose that A is modified by adding to A a matrix of the form XY H , where X ∈ Cm×c and Y ∈ Cn×c . The matrix D Or,c H (V Y )H (9.4) B = A + XY = (U X) Oc,r Ic has a new SVD, B = U1 D1 V1H . We discuss a technique (due to Brand [22]) for producing the factors U1 , D1 , and V1 of the new SVD starting from the factors of the initial SVD rather than using the matrix A. Let PX ∈ Cm×b be an orthonormal basis for range((Im − U U H )X) and let RX = PXH (Im − U U H )X ∈ Cb×c . We have
Ir U H X . (U X) = (U PX ) Ob,r RX Similarly, let PY ∈ Cn×d be an orthonormal basis for range((In − V V H )Y ), and let RY = PYH (In − V V H )Y ∈ Cd×c . Again, we have
Ir V H Y (V Y ) = (V PY ) Od,r RY
.
Thus, the Equality (9.4) can be written as D Or,c (V Y )H B = (U X) Oc,r Ic H Ir U H X D Or,c Ir Or,d Y = (U PX ) . H H Ob,r RX Oc,r Ic Y V RY PYH
Furthermore, we have D Or,c Ir Or,d Ir U H X Oc,r Ic Y H V RYH Ob,r RX D + U H XY H V U H XRYH = RX Y H V RX RYH H H H U X V Y D Or,c . + = Oc,r Oc,c RX RY
649
Singular Values
The matrix K=
D Or,c Oc,r Oc,c
U HX + RX
V HY RY
H
is sparse and rather small. By diagonalizing K as K = W EZ H , we obtain H Y H H A + XY = (U PX )W EZ PYH H H Z V Y . = (U W PX W )E Z H PYH An important special case occurs when c = 1, that is, when X and Y are just vectors in Cm and Cn , respectively. Thus, the matrix A is modified by adding a rank-1 matrix xy H as follows: D 0r H (V y)H . B = A + xy = (U x) 0r 1 The previous matrices PX and PY are now replaced by the vectors px =
1 (I − U U H )x, (I − U U H )x
py =
1 (I − V V H )y, (I − V V H )y
and RX , RY are now reduced to two vector numbers: r x = pHx (I − U U H )x ∈ Cb and r y = pHy (I − V V H )y ∈ Cd . This allows us to write
Ir U H x (U x) = (U px ) Ob,r r x
and
I V Hy . (V y) = (Y py ) Od,r r y
650
Linear Algebra Tools for Data Mining (Second Edition)
In this special case, Equality (9.4) amounts to D 0r (V y)H B = (U x) 0r 1 Ir Or,d YH D 0r Ir U H x . = (U px ) Y H V r Hy pHy 0r 1 Ob,r r x Furthermore, we have Ir Or,d D + U H xy H V U H xr Hy D 0r Ir U H x = Y H V r Hy rxyH V rx r Hy 0r 1 Ob,r r x H H H V y U x D 0r . + = rx ry 0r 0 Finally, by diagonalizing K = W EZ H , we have yH H H A + xy = (U px )W EZ pHy Z HyH . = (U W px W )E Z H pHy 9.5
Polar Form of Matrices
The SVD allows us to define the polar form of a matrix, which as an extension of the polar representation of a complex number z = |z|eiθ , where |z| is the modulus of z and θ, is the argument of z. Theorem 9.12 (Polar Form Theorem). Let A ∈ Cm×n be a matrix with singular values σ1 , σ2 , . . . , σr , where σ1 σ2 · · · σr > 0 and r min{m, n}. If m n, then A can be factored as A = P Y, where P ∈ Cm×m is a positive semidefinite matrix such that P 2 = AAH and Y ∈ Cm×n is a matrix having a set of orthonormal rows (that is, it satisfies the equality Y Y H = Im ). If m n, then A can be factored as A = W Q, where W ∈ Cm×n has orthonormal columns (that is, W H W = In and Q ∈ Cn×n is a positive semidefinite matrix.
Singular Values
651
Proof. By the Thin SVD Decomposition Corollary (Corollary 9.3), A can be factored as A = U DV H , where U ∈ Cm×r and V ∈ Cn×r are matrices having orthonormal columns and D is the diagonal matrix ⎞ ⎛ σ1 0 · · · 0 ⎜ 0 σ ··· 0 ⎟ 2 ⎟ ⎜ ⎟. D=⎜ . . . ⎟ ⎜. . ⎝ . . · · · .. ⎠ 0 0 · · · σr Since U has orthonormal columns, we have U H U = Ir , which allows us to write A = (U DU H )(U V H ). Define P = U DU H ∈ Cm×m . It is easy to see that P is a positive semidefinite matrix. Also, P 2 = U DU H U DU H = U DDU H = U DV H V DU H = AAH . On the other hand, if we define Y = U V H , we have Y Y H = U V H V U H = Im , which means that P and Y satisfy the conditions of the theorem. To prove the second part of the theorem, we apply the first part to the matrix AH ∈ Cn×m . Since AH = V D H U H = V D U H , the polar decomposition of AH is AH = P1 Y1 , where P1 = V D V H is positive semidefinite and Y1 = V U H is a matrix such that Y1 Y1H = In . Thus, A = Y1H P1H , which allows us to define W = Y1H and Q = P1H . Note that W H W = Y1 Y1H = In and that Q is positive semidefinite. Corollary 9.7. If A ∈ Cn×n , then A can be factored as A = P Y = W Q, where P, Q ∈ Cn×n are positive semidefinite matrices, and Y, W ∈ Cn×n are unitary matrices. Proof.
This statement follows immediately from Theorem 9.12.
Note that the positive semidefinite matrix P in the Polar Form Theorem is uniquely determined by the matrix A. Also, the spectrum of P equals the set of singular values of A. 9.6
CS Decomposition
The cosine–sine decomposition of unitary matrices was introduced in Stewart’s paper [158] although it was implicit in an article by Davis and Kahan [33]. As we show in this section and in the next, this
652
Linear Algebra Tools for Data Mining (Second Edition)
factorization of unitary matrices referred to as the CS decomposition plays an essential role in the study of the geometry of subspaces. Theorem 9.13 (Paige–Wei CS Decomposition Theorem). Let A ∈ Cn×n be a unitary matrix. For any partitioning A11 A12 A= , A21 A22 where A11 ∈ Cr1 ×c1 , A12 ∈ Cr1 ×c2 , A21 ∈ Cr2 ×c1 , A22 ∈ Cr2 ×c2 , and r1 + r2 = c1 + c2 = n, there exist unitary matrices U1 , U2 , V1 , V2 such that (i) U = diag(U1 , U2 ) and V = diag(V1 , V2 ); (ii) we have D11 D12 H , (9.5) U AV = D = D21 D22 where Dij ∈ Rri ×cj are real matrices for 1 i, j 2, and (iii) the matrices Dij have the form ˜ ˆ H , S, I), D12 = diag(O D11 = diag(I, C, O), ˆ S, I), D22 = diag(I, −C, O ˜ H ), D21 = diag(O, where C = diag(γ1 , . . . , γs ), S = diag(σ1 , . . . , σs ), C 2 + S 2 = I, 1 > γ1 · · · γs > 0, and 1 > σs · · · σ1 > 0. ˆ and O ˜ matrices are zero matrices and, depending on A and The O its partitioning, may have no rows or no columns. Some of the unit matrices could be nonexistent and no two need to be equal. The four C and S submatrices are square matrices with the same format and could be nonexistent. Proof. Let A11 = U1 D11 V1H be the SVD of A11 . Since D is a unitary matrix, no singular value of D11 can exceed 1 and D11 = U1H A11 V1 ˜ We have has the prescribed form D11 = diag(I, C, O). H A11 A12 U1 O V1 O H diag(U1 , I)Adiag(V1 , I) = A21 A22 O I O I H U1 A11 V1 U1H A12 = B, = A21 V1 A22
653
Singular Values
where B is the unitary matrix D11 B12 B= . B21 B22 By Theorem 6.68, there exist unitary matrices U2 and V2 such that the matrix (U1H A12 )V2 = B12 V2 is upper triangular and U2H (A21 V1 ) = U2H B21 is lower triangular, both having real non-negative elements on their diagonal ending in the bottom right corner. This implies D11 B12 I O I O = C, B21 B22 O V2 O U2H where C is the matrix C=
B12 V2 D11 . U2H B21 U2H B22 V2
By the choice of U2 and V2 , C has the form ⎛ I O O ⎞ ⎜ ⎛ I OO ⎜ O C O ⎟ ⎜ ⎜O C O ˜ O O O C12 ⎟ ⎜ ⎜ C=⎜ ⎟=⎜ ˆ ˜ ⎠ ⎜ ⎝O O O ⎜ O O O ⎜ C21 C22 ⎝ F21 S O F31 F32 I
⎞ O E12 E13 O S E23 ⎟ ⎟ ⎟ O O I ⎟ ⎟. G11 G12 G13 ⎟ ⎟ ⎟ G21 G22 G23 ⎠ G31 G32 G33
H S+ The orthonormality of the first three block columns implies F21 H H F31 F32 = O, F31 = O, and F32 = O, so F21 , F31 , and F32 are zero matrices. Similarly, the orthonormality of the first three block rows implies that E12 , E12 , E23 are zero matrices. Thus, the matrix becomes ⎞ ⎛ I O O O O O ⎜O C O O S O ⎟ ⎟ ⎜ ˜ O O I ⎟ ⎜O O O ⎟. ⎜ C=⎜ ˆ ⎟ ⎜ O O O G11 G12 G13 ⎟ ⎝ O S O G21 G22 G23 ⎠ O O I G31 G32 G33
Orthogonality of the third column block on the fourth, fifth, and sixth column blocks implies that G31 , G32 , and G33 are zero matrices;
654
Linear Algebra Tools for Data Mining (Second Edition)
orthogonality of the third row block on the fourth and fifth row blocks implies that G13 and G23 are zero matrices. Thus, C is actually ⎞ ⎛ I O O O O O ⎜ O C O O S O⎟ ⎟ ⎜ ⎟ ⎜ ˜ O O I⎟ ⎜O O O ⎟. ⎜ C=⎜ ˆ ⎟ ⎜ O O O G11 G12 O ⎟ ⎟ ⎜ ⎝ O S O G21 G22 O ⎠ O O I O O O Orthogonality of the second and fourth blocks of columns implies SG21 = O. Since S is non-singular, G21 is a zero matrix. Similarly, the orthogonality of the second and fourth blocks of rows shows that G12 is a zero matrix. Orthogonality of the fifth and the second blocks of rows yields CS + SG22 = O, so G22 = −C (because C and S are diagonal matrices). Finally, we obtain G11 GH11 = GH11 G11 = I, so G11 is unitary. The matrix G11 can be transformed to I without affecting the rest of the matrix by replacing U2H by diag(GH11 , I, I)U2H . The unitary character of the matrix implies C 2 + S 2 = I. Corollary 9.8 (The Thin CS Decomposition Theorem). Let A ∈ Cr×c be a matrix having orthonormal columns. For any partitioning A1 , A= A2 where A1 ∈ Cr1 ×c , A2 ∈ Cr2 ×c , r1 c, r2 c, there exist unitary matrices U1 ∈ Cr1 ×r1 , U2 ∈ Cr2 ×r2 , and V ∈ Cc×c such that (i) U = diag(U1 , U2 ); (ii) we have C H , U AV = D = S where C ∈ Cri ×n and S ∈ Cr2 ×n are real matrices for 1 i, j 2, and
655
Singular Values
(iii) the matrices C and S have the form C = diag(cos θ1 , . . . , cos θn ) and S = diag(sin θ1 , . . . , sin θn ), where 0 θ1 θ2 · · · θn π2 . Proof.
This statement follows from the Theorem 9.13 for c2 = 0.
Starting from Equality (9.5), U AV = D = H
D11 D12 , D21 D22
it is possible to obtain other variants of the CS decomposition by multiplying both sides of the equality by appropriate unitary matrices that transform D to the desired aspect. Recall that all permutation matrices are unitary matrices. Permuting the first r1 rows or the last r2 rows in D or changing the sign of any row or column preserves the almost diagonality of the blocks Dij of D. Theorem 9.14 (Stewart–Sun CS Decomposition). Let A ∈ Cn×n be a unitary matrix. For any partitioning A11 A12 , A= A21 A22 where 2 n, A11 ∈ C× , A12 ∈ C×(n−) , A21 ∈ C(n−)× , A22 ∈ C(n−)×(n−) there exist unitary matrices U, V ∈ Cn×n such that (i) U = diag(U1 , U2 ) and V = diag(V1 , V2 ), where U1 , V1 ∈ C× ; (ii) we have ⎛ ⎞ C˜ −Sˆ O ⎜ ⎟ (9.6) U H AV = ⎝ Sˆ C˜ O ⎠, O
O
In−2
where C˜ = diag(γ1 , . . . , γ ), Sˆ = diag(σ1 , . . . , σ ) are two diagonal matrices having non-negative elements and C˜ 2 + Sˆ2 = I . Proof. Since 2 n, we have = r1 = c1 r2 = n − in ˆˆ ˆ ˜ ˜ and O ˆ H = (O, ˜ = (O ˜ , O) O ), where Equality (9.5). We can write O
656
Linear Algebra Tools for Data Mining (Second Edition)
ˆ are square matrices. The matrix D can now be written ˜ and O O as ⎞ ⎛ ˆˆ ˆ O O O I O O OO ⎟ ⎛ ⎞ ⎜ ⎜ O C O OO O S O ⎟ ˆH O O I OOO ⎟ ⎜ ⎟ ⎜O C O O S O ⎟ ⎜ ˜ ˜ ˜ ⎜ ⎜ ⎟ ⎜ O OO OO O O O ⎟ ⎟ ⎜ ⎟ H ⎟ ˆ ˜ O O O⎟ ⎜ ⎜O O O ˆ ⎟ ⎜ O O O I O O O O ⎟=⎜ D=⎜ ⎟. ⎜O ⎟ ⎜ H ˆ O O I O O ˆ O O OO I O O ⎟ ⎜ ⎟ ⎜O ⎟ ⎜ ⎟ ⎟ ⎝ O S O O −C O ⎠ ⎜ ⎜ O S O O O O −C O ⎟ ⎟ ⎜ ˜H ⎜ O O I OO O O O OO I O O O ˜ H ⎟ ⎠ ⎝ H ˜ O O O I O O O ˜O By multiplying the last two columns of the last matrix by −1 yields the matrix ⎞ ⎛ ˆ ˆO ˆ O O I O O OO ⎟ ⎜ ⎜ O C O O O O −S O ⎟ ⎟ ⎜ ⎟ ⎜ ˜ ⎜ O OO ˜O O O O ⎟ ˜ O ⎟ ⎜ H ⎜ ˆ ˆ O O O I O O O ⎟ ⎟ ⎜ O ⎟ ⎜ ⎟ ⎜O ⎜ ˆ H O O O O I O O ⎟ ⎟ ⎜ ⎜ O S O OO O C O ⎟ ⎟ ⎜ ⎜ O O I OO O O O ˜ H ⎟ ⎠ ⎝ H ˜ O O O I O O O ˜O Let now C˜ and Sˆ be the square matrices ˆ , S, I). ˜ ) and Sˆ = diag(O C˜ = diag(I, C, O Permuting rows and columns within the main block, we obtain another variant of the CS decomposition: ⎞ ⎛ C˜ O −Sˆ O ⎜O I O O⎟ ⎟ ⎜ (9.7) U H AV = ⎜ ⎟, ⎝ Sˆ O C˜ O ⎠ OO O I
Singular Values
as shown in [129]. If r1 = c1 , one obtains the matrix ⎛ ⎞ C˜ −Sˆ O ⎜ ⎟ U H AV = ⎝ Sˆ C˜ O ⎠ . O O In−2
9.7
657
(9.8)
Geometry of Subspaces
Many of the results discussed in this section and in the next appear in [160]. Let S and T be two subspaces of Cn having the same dimension, m, and let {u1 , . . . , um } and {v 1 , . . . , v m } be two orthonormal bases for S and T , respectively. Consider the matrices B = (u1 · · · um ) ∈ Cn×m and C = (v 1 · · · v m ) ∈ Cn×m . The projection matrices of these subspaces are PS = BB H and PT = CC H , respectively, and we saw that these Hermitian matrices do not depend on the particular choices of the orthonormal bases of the subspaces. Theorem 9.15. Let S and T be two subspaces of Cn such that dim(S) = dim(T ) = m, and let {u1 , . . . , um } and {v 1 , . . . , v m } be two orthonormal bases for S and T, respectively. Consider the matrices B = (u1 · · · um ) ∈ Cn×m and C = (v 1 · · · v m ) ∈ Cn×m . The singular values γ1 , . . . , γm of the matrix B H C do not depend on the choices of the orthonormal bases in the subspaces S and T . ˜ and C˜ be matrices that correspond to a different Proof. Let B choice of bases in the two spaces. By Theorem 9.2, to prove the independence of the singular values on the choice of bases, it suffices ˜ H C˜ are unitarily equivalent. to show that the matrices B H C and B By Theorem 6.46, there exists a unitary matrix Q and a unitary ˜ and C = CP ˜ . This allows us to write matrix P such that B H = QH B H H ˜H ˜ H ˜ H C. ˜ B C = Q B CP , which shows that B C ≡u B Since B and C have orthonormal columns, we have B H B = Im ˜ 1. This implies |||B H C|||2 and C H C = Im , so |||B|||2 1 and |||C||| |||B H |||2 |||C|||2 1, so γi ∈ [0, 1]. Definition 9.4. Let S and T be two subspaces of Cn such that dim(S) = dim(T ) = m and denote the singular values of the matrix
658
Linear Algebra Tools for Data Mining (Second Edition)
B H C by γ1 , . . . , γm , where B and C are matrices that define bases of S and T , respectively. The angles between S and T are θ1 , . . . , θm ∈ 0, π2 , where cos θi = γi for 1 i k. The squares of the singular values of B H C, cos2 θi for 1 i m, are the eigenvalues of the matrices C H BB H C = C H PB C or B H CC H B = B H PC B. If we denote a matrix of a basis of the subspace that is orthogonal to range(B) by B⊥ , then the squares of the singular values of (B⊥ )H C H C = C H (In − PB )C = are the eigenvalues of the matrices C H B⊥ B⊥ H In − C PB C. These squares correspond to sin2 θ1 , . . . , sin2 θn . Thus, for S = range(B) and T = range(C), the vector SIN(S, T ) =
sin θ1 . . . sin θp
of sinuses of angles between the subspaces S and T consists of the H C. singular values of the matrix B⊥ The minimal angle between the subspaces S and T is given by cos θ1 = σ1 . Theorem 9.16. Let S and T be two subspaces of Rn . The least angle between S and T is determined by the equality cos θ1 = |||PS PT |||2 , where PS and PT are the projection matrices of S and T, respectively. Proof.
We have cos θ1 = min{u v | u ∈ S, v ∈ T, u2 = v2 1}.
For u ∈ S and v ∈ T , we have PS u = u and PT v = v, which implies cos θ1 = min{u PS PT v | u ∈ S, v ∈ T, u2 = v2 1} = |||PS PT |||2 .
Theorem 9.17. Let A ∈ Cn× and B ∈ Cn× be two matrices both having orthonormal columns. If 2 n, there are unitary matrices Q ∈ Cn×n , U1 ∈ C× , and V1 ∈ C× such that ⎛ ⎛ ⎞ ⎞ I C ⎜ ⎜ ⎟ ⎟ QAU1 = ⎝ O, ⎠ ∈ Cn× and QBV1 = ⎝ S ⎠ ∈ Cn× , On−2, On−2, where C = diag(γ1 , . . . , γ ), S = diag(σ1 , . . . , σ ), 0 γ1 · · · γ , 1 σ1 · · · σ 0, and γi2 + σi2 = 1 for 1 i .
Singular Values
659
Proof. Expand the matrices A and B to two unitary matrices A˜ = ˜ = (B B1 ) ∈ Cn×n , respectively, and define the (A A1 ) ∈ Cn×n and B n×n matrix D ∈ C as H H B A B A 1 H ˜= . D = A˜ B AH1 B AH1 B1 We have A1 ∈ Cn×(n−) and B1 ∈ Cn×(n−) . The matrix D is unitary ˜ By applying the CS as a product of two unitary matrices, A˜H and B. decomposition, we obtain the existence of the unitary block diagonal matrices U = diag(U1 , U2 ) ∈ Cn×n and V = diag(V1 , V2 ) ∈ Cn×n such that U1 , V1 ∈ C× , U2 , V2 ∈ C(n−)×(n−) , and ⎛ ⎞ C −S O ⎜ ⎟ O ⎠, U H DV = ⎝ S C O O In−2 where C, S ∈ C× . In turn, this yields U1H AH BV1 = C ∈ C× ,
U1H AH B1 V2 = (−S O) ∈ C×(n−) , S H H ∈ C(n−)× , U2 A1 BV1 = O C O H H ∈ C(n−)×(n−) . U2 A1 B1 V2 = O In−2 Let Q=
U1H AH . U2H AH1
We have
H H U1H AH U1 A AU1 AU1 = QAU1 = U2H AH1 U2H AH1 AU1 ⎞ ⎛ I I = ⎝ O, ⎠ ∈ Cn× , = On−, On−2,
(9.9)
660
Linear Algebra Tools for Data Mining (Second Edition)
because A has orthonormal columns and U1 is unitary, and ⎛ ⎞ ⎛ H H⎞ C H H U1 A U1 A BV1 ⎜ ⎟ H H = ⎝ S ⎠ ∈ Cn× . QBV1 = ⎝U2 A1 x⎠ BV1 = H H U A BV 1 2 1 y O
Further results can be derived from Theorem 9.17 and its proof. Observe that the Equalities (9.9) imply C O S H H H H (U2 A1 B1 V2 U2 A1 BV1 ) = O I O so
U2 A1 (B1 V2 BV1 ) = H
H
C O S O I O
∈ Cn× ,
which is equivalent to
V2 B 1 V1H B H H
H
⎞ C O ⎟ ⎜ A1 U2 = ⎝ O I ⎠ ∈ C×n . S O ⎛
Also,
In− V2H B1H B 1 V2 = ∈ Cn×(n−) V1H B H On,n−
˜ is the because B1 has orthonormal columns and V2 is unitary. If Q matrix H H V2 B 1 ˜ , Q= V1H B H then we have
⎞ ⎛ C O I ⎟ ⎜ ˜ ˜ . QA1 U2 == ⎝ O I ⎠ and QB1 V2 = O S O
A twin of Theorem 9.17 follows.
Singular Values
661
Theorem 9.18. Let A ∈ Cn× and B ∈ Cn× be two matrices both having orthonormal columns. If n < 2, there are unitary matrices Q ∈ Cn×n , U1 ∈ C× , and V1 ∈ C× such that ⎛ ⎞ On−,2−n In− ⎜ ⎟ I2−n ⎠ ∈ Cn× QAU1 = ⎝O2−n,n− On−,n− On−,2−n and ⎛
C1
⎜ QBV1 = ⎝O2−n,n− S1
⎞ On−,2−n ⎟ I2−n ⎠ ∈ Cn× , On−,2−n
where C1 = diag(γ1 , . . . , γn− ), S1 = diag(σ1 , . . . , σn− ), 0 γ1 · · · γn− , 1 σ1 · · · σn− 0, and γi2 + σi2 = 1 for 1 i n − . Proof. We are using the same notations as in Theorem 9.17 and the discussion that precedes the current theorem. Since A ∈ Cn× , B ∈ Cn× , and n < 2, there exist two matrices A0 ∈ Cn×n− ˜ = (B0 , B) and B0 ∈ Cn− such that the matrices A˜ = (A0 , A) and B are both unitary. For = n − we have 0 < n. Thus, we can apply Theorem 9.17 to the matrices A0 and B0 ; the roles played by the matrices A1 and B1 in the previous proof are played now by A and B, and the roles played by U1 and V1 are played by U2 and V2 , respectively. Theorem 9.19. Let S and T be two subspaces of Rn such that dim(S) = dim(T ) = where 2 n and let σ1 , . . . , σ be the sinuses of the angles between S and T . If PS and PT are the projections on S and T, then the singular values of the matrix PS (In − PT ) are σ1 , . . . , σ , 0, . . . , 0 and the singular values of the matrix PS − PT are σ1 , σ1 , . . . , σ , σ , 0, . . . , 0. Proof. Let B = (u1 · · · u ) ∈ Cn× and C = (v 1 · · · v ) ∈ Cn× be matrices whose columns are orthonormal bases of S and T , respectively. The projection matrices of these subspaces are the Hermitian matrices PS = BB H and PT = CC H . By Theorem 9.17,
662
Linear Algebra Tools for Data Mining (Second Edition)
there are unitary matrices Q ∈ Cn×n , U1 ∈ C× , and V1 ∈ C× such that ⎞ ⎞ ⎛ ⎛ I C ⎟ ⎟ ⎜ ⎜ QBU1 = ⎝ O, ⎠ ∈ Cn× and QCV1 = ⎝ S ⎠ ∈ Cn× , On−2, On−2, where C = diag(γ1 , . . . , γ ), S = diag(σ1 , . . . , σ ), 0 γ1 · · · γ , 1 σ1 · · · σ 0, and γi2 + σi2 = 1 for 1 i . Therefore, we have QPS (In − PT )QH = QBB H (In − CC H )QH = QBB H QH − QBB H CC H QH = (QBU1 )(U1H B H QH ) − QBU1 U1H B H QH (QCV1 )(V1H C H QH ) = (QBU1 )(QBU1 )H − (QBU1 )(QBU1 )H (QCV1 )(QCV1 )H = (QBU1 )(QBU1 )H (In − (QCV1 )(QCV1 )H ). Since
and
⎛
⎞ I ⎜ ⎟ (QBU1 )(QBU1 )H = ⎝ O, ⎠ (I O, O,n−2 ) On−2, ⎞ ⎛ O, O,n−2 I ⎟ ⎜ O, O,n−2 ⎠ = ⎝ O, On−2, On−2, On−2,n−2 ⎛ ⎜ (QCV1 )(QCV1 )H = ⎝
C S On−2,
⎞ ⎟ ⎠ (C S O,n−2 )
⎞ CS O,n−2 C2 ⎟ ⎜ S2 O,n−2 ⎠ , = ⎝ SC On−2, On−2, On−2,n−2 ⎛
Singular Values
663
it follows that QPS (In − PT )QH ⎛ ⎞⎛ ⎞ O, O,n−2 I I − C 2 −CS O,n−2 ⎜ ⎟⎜ ⎟ O, O,n−2 ⎠ ⎝ −SC I − S 2 O,n−2 ⎠ = ⎝ O, On−2, On−2, On−2,n−2 On−2, On−2, In−2 ⎛ ⎞ ⎛ ⎞ O,n−2 I − C 2 −CS S ⎜ ⎟ ⎜ ⎟ O, O,n−2 ⎠ = ⎝ O, ⎠ (S − C O,n−2 ) = ⎝ O, On−2, On−2, On−2,n−2 On−2, because I − C 2 = S 2 . Since the rows of the matrix (S − C O,n−2 ) are orthonormal, it follows that the singular values of PS (In − PT ) are σ1 , . . . , σ , 0, . . . , 0. For the matrix PS − PT , we can write Q(PS − PT )QH = QBB H QH − QCC H )QH = QBU1 U1H B H QH − QCV1 V1H C H )QH = (QBU1 )(QBU1 )H − (QCV1 )(QCV1 )H . Substituting QBU1 and QCV1 yields the equality ⎛ ⎞ O, O,n−2 I ⎜ ⎟ O, O,n−2 ⎠ Q(PS − PT )QH = ⎝ O, On−2, On−2, On−2,n−2 ⎞ ⎛ CS O,n−2 C2 ⎟ ⎜ S2 O,n−2 ⎠ − ⎝ SC On−2, On−2, On−2,n−2 ⎞ ⎛ O,n−2 I − C 2 −CS ⎟ ⎜ −S 2 O,n−2 ⎠ = ⎝ −SC On−2, On−2, On−2,n−2 ⎞ ⎛ −CS O,n−2 S2 ⎟ ⎜ −S 2 O,n−2 ⎠ . = ⎝ −SC On−2, On−2, On−2,n−2 The desired conclusion follows from Supplement 9.9.
664
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 9.20. Let S and T be two subspaces of Rn such that dim(S) = dim(T ) = where n < 2 and let σ1 , . . . , σn− be the sinuses of the angles between S and T . If PS and PT are the projections on S and T, then the singular values of the matrix PS (In − PT ) are σ1 , . . . , σn− , 0, . . . , 0 and the singular values of the matrix PS − PT are σ1 , σ1 , . . . , σn− , σn− , 0, . . . , 0. Proof. This statement can be obtained from Theorem 9.18 using an argument similar to the one used in the proof of Theorem 9.19. 9.8
Spectral Resolution of a Matrix
In Theorem 7.3, we have shown that if S is an invariant subspace of a matrix A ∈ Cn×n such that dim(S) = p, and X ∈ Cn×p is a matrix whose set of columns is a basis of S, then there exists a unique matrix L ∈ Cp×p that depends only on the basis S such that AX = XL. Definition 9.5. The representation of a matrix A ∈ Cn×n relative to a basis B = {x1 , . . . , xp } of an invariant subspace S with dim(S) = p is the matrix L ∈ Cp×p that satisfies the equality AX = XL, where X = (x1 · · · xp ). Note that if p = 1, S is a unidimensional subspace and L = (λ) has as its unique entry the eigenvalue of that corresponding to this space. Lemma 9.3. Let A ∈ Cn×n and let X ∈ Cn×p be a full-rank matrix such that rank(X) = p n. If Y ∈ Cn×(n−p) is a matrix such that range(Y ) = range(X)⊥ , then range(X) is an invariant subspace of A if and only if Y H AX = On−p,p . In this case, range(Y ) is an invariant subspace of AH . Proof. Let X and Y be matrices that satisfy the conditions formulated above. The subspace range(X) is an invariant subspace of A if and only if Ax ∈ range(X), which is equivalent to Ax ⊥ range(Y ) for every column x of X. This, in turn, is equivalent to Y H Ax = 0n−p for every column x of X and, therefore, to Y H AX = On−p,p . The last equality can be written as X H AH Y = Op,n−p , which implies that range(Y ) is an invariant subspace of AH .
665
Singular Values
Theorem 9.21. Let S be an invariant subspace for a matrix A ∈ Cn×n such that dim(S) = p and let X ∈ Cn×p be a matrix whose columns constitute an orthonormal basis for S. Let U = (X Y ) be a unitary matrix, where Y ∈ Cn×(n−p) . We have U AU =
K
H
On−p,p
L M
,
(9.10)
where K ∈ Cp×p , L ∈ Cp×(n−p), and M ∈ C(n−p)×(n−p) . Proof.
We have (X Y ) A(X Y ) = H
X H AX X H AY Y H AX Y H AY
.
Since (X Y ) is a unitary matrix, it is clear that range(Y ) = range(X)⊥ . By Lemma 9.3, it follows that Y H AX = On−p,p . Thus, we can define K = X H AX, M = Y H AY, and L = X H AY. The Equality (9.10) is known as the reduced form of S with respect to the unitary matrix U = (X Y ). By Theorem 7.15, spec(A) = spec(K) ∪ spec(M ). If spec(K) ∩ spec(M ) = ∅, we say that S = range(X) is a simple invariant space. Theorem 9.22. Let S be a simple invariant subspace for a matrix A ∈ Cn×n of rank p that has the reduced form U AU = H
K On−p,p
L M
,
(9.11)
with respect to the unitary matrix U = (X Y ), where X ∈ Cn×p and Y ∈ Cn×(n−p) . There are matrices W ∈ Cn×(n−p) and Z ∈ Cn×p such that (X W )−1 = (Z Y )H and A = XKZ H + W M Y H , where K = Z H AX ∈ Cp×p and M = Y H AW ∈ C(n−p)×(n−p) .
(9.12)
666
Linear Algebra Tools for Data Mining (Second Edition)
Proof.
Ip On−p,p
We claim that there exists a matrix Q ∈ Cp×(n−p) such that −Q In−p
K
L M
On−p,p
Ip
Q
On−p,p In−p
=
K On−p,p
Op,n−p . M
Indeed, the above equality is equivalent to
K On−p,p
KQ − QM + L M
=
K On−p,p
Op,n−p M
and the matrix Q that satisfies the equality KQ − QM = −L exists by Theorem 8.25, because S is a simple invariant space (so spec(K)∩ spec(M ) = ∅). This allows us to write
K On−p,p
Op,n−p M
H X Ip −Q Q = A(X Y ) YH On−p,p In−p On−p,p In−p H X − QY H A(X XQ + Y ). = YH Ip
Observe that −1
(X XQ + Y )
=
X H − QY H , YH
because
X H − QY H (X XQ + Y ) YH
= XX H + Y Y H = In .
This follows from the fact that (X Y ) is a unitary matrix. Define W = XQ + Y and Z = X − Y QH . We have (X W )−1 = (Z Y )H ; also, Z H W = Op,n−p . This allows us to write A = (X XQ + Y )
K On−p,p
Op,n−p M
X H − QY H , YH
which is equivalent to A = XKZ H + W M Y H .
(9.13)
Singular Values
Note that we have
667
I O ZH (X W ) = , YH O I
so Z H X = Ip , Z H W = Op,n−p , Y H X = On−p,p , and Y H W = In−p . (9.14) Equality (9.12) implies AX = XKZ H X + W M Y H X = X(KZ H X), because Y H X = On,n and AW = XKZ H W + W M Y H W = W (M Y H W ) because Z H W = Op,n−p . The last equality implies that range(W ) is an invariant space of A. Since (X W ) is non-singular, range(X W ) = Cn . The Equality (9.12) and its equivalent (9.13) are known as the spectral resolution equalities of A relative to the subspaces S = range(X) and T = range(W ). Note that range(W ) is a complementary invariant subspace of the subspace range(X). Let P1 = XZ H ∈ Cn×n and P2 = W Y H ∈ Cn×n . The Equalities (9.14) imply P12 = XZ H XZ H = P1 , P22 = W Y H W Y H = P2 and P1 P2 = XZ H W Y H = On,n , P2 P1 = W Y H XZ H = On,n . Also, we have A = P1 AP1 + P2 AP2 . It is clear that P1 s ∈ S for s ∈ S and P2 t ∈ T for t ∈ T , where S = range(X) and T = range(W ). Also P1 t = 0n and P2 s = 0n for s ∈ S and t ∈ T . A vector x ∈ Cn can be written as x = s + t, where s = P1 x and t = P2 x. This justifies referring to P1 as the projection of S along T . Recall that the separation of two matrices was introduced in Equality 8.3.
668
Linear Algebra Tools for Data Mining (Second Edition)
Lemma 9.4. Let K ∈ Cp×p , H ∈ Cp×(n−p) , G ∈ C(n−p)×p , and M ∈ C(n−p)×(n−p) be four matrices such that spec(K) ∩ spec(M ) = ∅. Define γ, η, and δ as γ = GF , η = HF , and δ = sep(A, B). 2
If γη < δ4 , then there exists a unique X ∈ C(n−p)×p such that SM,K (X) = M X − XK = XHX − G and XF
2γ γ > svd(A) ans = 4.3674 1.2034 0.0000
Singular Values
675
It is interesting to note that rank(A) = 2 since the last column of A is the sum of the first two columns. Thus, we would expect to see two non-zero singular values. To compute |||A|||2 , which equals the largest singular value of A, we can use max(svd(A)). MATLAB has the function rank that computes the numerical rank of a matrix numrankA (d). By default, the value of d is set to a value known as tolerance, which is max(size(A)*eps(max(s)), where s is the vector of the singular values of A. Otherwise, we can enter our own tolerance tol level using a call rank(A,tol). Small variation in the values of the elements of a matrix can trigger sudden changes in rank. These variations may occur due to rounding errors as rational numbers (such as 13 ) may be affected by these rounding errors. Example 9.4. When we enter the matrix A considered above, the system responds with A = 0.3333 2.2000 0.1429 0.1111
1.3333 0.8000 0.5714 0.4444
1.6667 3.0000 0.7143 0.5556
Calls of the rank function with several levels of tolerance return results such as >> rank(A,10^(-9)) ans = 2 >> rank(A,10^(-15)) ans = 2 >> rank(A,10^(-16)) ans = 2 >> rank(A,10^(-17)) ans = 3
If d of numrankA (d) is sufficiently low, the numerical rank of the representation of A (affected by rounding errors) becomes 3.
676
Linear Algebra Tools for Data Mining (Second Edition)
Another variant of the svd function, [U,S,V] = svd(A), yields a diagonal matrix S, of the same format as A and with nonnegative diagonal elements in decreasing order, and unitary matrices U and V so that A = U SV H . For the matrix A shown above, we obtain >> [U,S,V] = svd(A) U = -0.4487 0.7557 -0.8599 -0.5105 -0.1923 0.3239 -0.1496 0.2519 S = 4.3674 0 0 1.2034 0 0 0 0 V = -0.4775 -0.3349 -0.8123
-0.6623 0.7447 0.0823
0.4769 0.0000 -0.6726 -0.5658
0.0161 -0.0000 -0.6370 0.7707
0 0 0.0000 0
0.5774 0.5774 -0.5774
The “economical form” of the svd function is [U,S,V] = svd(A,’econ’)
If A ∈ Rm×n and m > n, only the first n columns of U are computed and S ∈ Rn×n . If m < n, only the first m columns of V are computed. Example 9.5. Starting from the matrix ⎛ ⎞ 18 8 20 ⎜−4 20 1 ⎟ ⎜ ⎟ A=⎜ ⎟ ∈ R4×3 ⎝ 25 8 27⎠ 9 4 10 a call to the economical variant of the svd function yields >> [U,D,V] = svd(A,’econ’) U = -0.5717 -0.0211 0.8095 -0.0721 -0.9933 -0.0669
Singular Values
-0.7656 -0.2859
0.1133 -0.0105
-0.4685 -0.3474
49.0923 0 0
0 20.2471 0
0 0 0.0000
-0.6461 -0.2706 -0.7137
0.3127 -0.9468 0.0760
0.6963 0.1741 -0.6963
677
S =
V =
Example 9.6. The function svapprox given in what follows com putes the successive approximations B(k) = ki=1 σi ui vHi of a matrix A ∈ Rm×n having the SVD A = U DV H and produces a threedimensional array C ∈ Rm×n×r , where r is the numerical rank of A and C(:, :, k) = B(k) for 1 k r. function [C] = svapprox(A) %SVAPPROX computes the successive approximations % of A using the singular component decomposition. % The number of approximations equals the % numerical rank of A. % determine the format of A and its numerical rank [m, n] = size(A); r = rank(A,10^(-5)); % compute the SVD of A [U,D,V] = svd(A); C = zeros(m,n,r); C(:,:,1) = D(1,1) * U(:,1) * (V(:,1))’; for k=2:r C(:,:,k) = D(k,k) * U(:,k) * (V(:,k))’ + C(:,:,k-1); end;
This function is used in Example 9.7. Example 9.7. In Figure 4.10(a), we have an image of the digit 4 created from a pgm file that contains the representation of this digit, as discussed in Example 4.56. The numerical rank of the matrix A introduced in the example mentioned above is 8. Therefore, the array C computed by
678
Linear Algebra Tools for Data Mining (Second Edition)
(a)
Fig. 9.1
(b)
(c)
(d)
Successive approximations of A.
C = svapprox(A) consists of 8 matrices. To represent these matrices in the pgm format, we cast the components of C to integers of the type uint8 using D = min(16,uint8(C)). Thus, D(:, :, j) contains the rounded jth approximation of A and we represent the images for the first four approximations in Figure 9.1. Note that the digit four is easily recognizable beginning with the second approximation. The use of SVD in approximative recognition of digits is discussed extensively in [44].
Exercises and Supplements (1) Prove that if all singular values of a matrix A ∈ Cn×n are equal, then there exists a unitary matrix U ∈ Cn×n and a real number a such that A = aU . (2) Prove that if A ∈ Cm×n has an orthonormal set of columns (an orthonormal set of rows), then its singular values equal 0 or 1. (3) Let c1 , . . . , c , s1 , . . . , s be 2 real numbers such that s2i + c2i = 1 and si 0 for 1 i n. Prove that the singular values of the matrix diag(−c1 s1 , . . . , −c s ) diag(s21 , . . . , s2 ) X= −diag(s21 , . . . , s2 ) diag(−c1 s1 , . . . , −c s ) are s1 , s1 , . . . , s , s . Solution: Note that
XX =
E O , O E
where E = diag(s21 , . . . , s2 )2 + diag(−c1 s1 , . . . , −c s )2 = diag(s41 + c21 s21 , . . . , s4 + c2 s2 ) = diag(s21 , . . . , s2 ).
Singular Values
679
Thus, XX has the eigenvalues s21 , . . . , s2 each with algebraic multiplicity 2, which implies that X has the eigenvalues s1 , s1 , . . . , s , s . (4) Prove that if Pφ is a permutation matrix, where φ ∈ PERMm , then the set of singular values of the matrix Pφ A is the same as the set of singular values of the matrix A for every A ∈ Cm×n . (5) Prove that 0 ∈ spec(A), where A ∈ Cn×n if and only if 0 is a singular value of A. (6) Compute the singular values of the matrices A=
1 0 0 1 0 0 0 0 ,B = ,C = ,D = . 0 0 0 0 0 1 1 0
Let X, Y be any two of these four matrices. Verify that XY and Y X have 0 as an eigenvalue with multiplicity 2. However, the sets of singular values of each of the matrices in the pairs (AB, BA), (AD, DA), (BC, CB), and (CD, DC) are distinct. (7) Let A ∈ Cn×n be a Hermitian matrix, and let A = U H DU be the decomposition given by the spectral theorem for Hermitian matrices. (a) Prove that if spec(A) ⊆ R0 , then A = U H DU is an SVD of A. (b) If spec(A) = {λ1 , . . . , λn } ⊆ R0 , let S be diagonal matrix such that sii = 1 if λi 0 and sii = −1 if λi < 0. Prove that A = (U H S)(SD)U is an SVD of A. (8) Let A ∈ Cn×n be a Hermitian matrix such that spec(A) = {λ1 , . . . , λn } ⊆ R. Prove that the singular values of A are |λ1 |, . . . , |λn |. Solution: The singular values of A equal the square roots of the non-zero eigenvalues of the matrix AH A. Since A is Her2 mitian, these eigenvalues equal the non-zero eigenvalues of A , which means that the singular values of A are λ21 , . . . , λ2n , that is, |λ1 |, . . . , |λn |. (9) Next we give a stronger form of Corollary 8.20 that holds for Hermitian matrices. Let A ∈ Cn×n and B ∈ Cm×m be two matrices. Prove that if A and B are Hermitian, then sepF (A, B) = min{|λ − μ| | λ ∈ spec(A), μ ∈ spec(B)}.
680
Linear Algebra Tools for Data Mining (Second Edition)
Solution: The Sylvester operator S A,B (X) = AX − XB can be written as Kronecker operations as vec(S A,B (X)) = vec(AX) − vec(XB) = (In ⊗ A)vec(X) − (B ⊗ Im )vec(X) = (In ⊗ A − B ⊗ Im )vec(X) = (A B)vec(X) by applying the identities of Supplement 90 of Chapter 3. Since both A and B are Hermitian, by Supplement 83 of the same chapter, the matrix A B is Hermitian and, by Theorem 7.25, spec(A B) = {λ − μ | λ ∈ spec(A), μ ∈ spec(B)}. Also, note that vec(X)2 = XF , so S A,B (X)F = (A B)vec(X)2 . Since sepF (A, B) = min{S A,B (X)F | XF = 1} = min{(A B)vec(X)2 | vecXF = 1, it follows that sepF (A, B) equals the smallest singular value of the Hermitian matrix A B. By Corollary 9.2 and by supplement 9.9, this equals min{|λ − μ| | λ ∈ spec(A), μ ∈ spec(B)}. (10) Let A ∈ Cm×n be a matrix having full rank r and let σ1 · · · σr > 0 be its non-zero singular values. Prove that if B ∈ Cm×n is a matrix such that |||A − B|||2 < σr , then B has full rank. (11) Let A ∈ Cm×n be a matrix. Prove that (a) for every > 0 there exists a matrix B having distinct singular values such that A − BF < ; (b) if rank(A) = r < min{m, n}, then for every > 0 there exists a matrix B ∈ Rm×n having full rank such that A − BF . Solution: We discuss only the first part of this Supplement. Let A = U DV H be an SVD of the matrix A, where D = diag(σ1 , . . . , σp ) ∈ Cm×n and σ1 σ2 · · · σp . Suppose that m n and consider the matrix Dδ = diag(d11 + δ, d22 + 2δ, . . . , dnn + nδ) ∈ Cm×n , where δ > 0. Clearly, dii = σi for 1 i p and dii = 0 for p + 1 i n. If σ1 = · · · = σp , then the diagonal elements of Dδ are all distinct. Otherwise, choose δ such that 0 < δ < min1ip−1 (σi − σi+1 ); this results again in a matrix Dδ with distinct diagonal entries. Define Aδ = U Dδ V H . Since U and V are unitary
681
Singular Values
matrices, it follows that A − Aδ F = D − Dδ F = δ Choosing δ such as √ 2 δ < min min (σi − σi+1 ), , 1ip−1 n(n + 1)
n(n+1) . 2
implies A − Aδ F < . (12) Let A ∈ Cn×n be a non-singular matrix such that the matrix AH A can be unitarily diagonalized as AH A = U S 2 U H , where U is a unitary matrix and S = diag(σ1 , . . . , σp ). Prove that the matrix V = AH U S −1 is unitary and that A has the singular value decomposition A = U SV H . Let A ∈ Cm×n be a matrix of rank r. By Corollary 6.21, we have Cm = range(A) null(AH ), Cn = range(AH ) null(A).
Let Brange(A) = {u1 , . . . , ur }, Bnull(AH ) = {ur+1 , . . . , um }, Brange(AH ) = {v 1 , . . . , v r }, Bnull(A) = {v r , . . . , r n }, be orthonormal bases of the subspaces range(A), null(AH ), range(AH ), and null(A), respectively. (13) Let A ∈ Cm×n be a matrix of rank r. Define the unitary matrices U = (u1 · · · ur ur+1 · · · um ) ∈ Cm×m , V = (v 1 · · · v r v r+1 · · · v n ) ∈ Cn×n . An alternative method for obtaining the singular value decomposition of a matrix is as follows. (a) Prove that U AV = H
C Om−r,r
Or,n−r , Om−r,n−r
where C ∈ Cr×r is a matrix of rank r.
682
Linear Algebra Tools for Data Mining (Second Edition)
(b) Prove that there are two unitary matrices U1 ∈ Cm×m and V1 ∈ Cn×n such that U1
C Om−r,r
D Or,n−r Or,n−r H V1 = , Om−r,n−r Om−r,r Om−r,n−r
where D ∈ Cr×r is a diagonal matrix. (14) Let A ∈ Cn×n be a matrix whose singular values are σ1 , . . . , σn . k k Define the matrix A[k] = A ⊗ A ⊗ · · · ⊗ A ∈ Cn ×n as in Exercise 32. Prove that the singular values of A[k] have the form σi1 σi2 · · · σik , where 1 i1 , . . . , ik n. (15) Let A ∈ Cn×n be a square matrix having the eigenvalues λ1 , λ2 , . . . , λn and the singular values σ1 , σ2 , . . . , σn , where |λ1 | |λ2 | · · · |λn | and σ1 σ2 · · · σn . Prove that |λ1 · · · λk | σ1 · · · σk for every k, 1 k n − 1 and that |λ1 · · · λn | = σ1 · · · σn . (16) Let A = P Y be the polar decomposition of the matrix A, where P ∈ Cm×m is a positive semidefinite matrix such that P 2 = AAH , and Y ∈ Cm×n is such that Y Y H = Im . Obtain the SVD of A by applying Theorem 8.14 to the Hermitian matrix P 2 . n (17) Prove that if A ∈ Cn×n , then | det(A)| = i=1 σi , where σ1 , . . . , σn are the singular values of A. (18) Prove that A ∈ Cn×n , then the singular values of the matrix adj(A) are {σj | j ∈ {1, . . . , n} − {i}} for 1 i n. (19) Let S = {x ∈ Rn | x2 = 1} be the surface of the unit sphere in Rn and let A ∈ Rn×n be an invertible matrix whose SVD is A = U DV , where U, V are orthogonal matrices and D = diag(σ1 , . . . , σn ). Prove that (a) the image of S under left multiplication by A, that is, the set {Ax | x ∈ S}, is an ellipsoid having the semiaxes σ1 , . . . , σn ; (b) The columns of U give the directions of the semiaxes of the ellipsoid. Solution: Since A = U DV , it follows immediately that −1 A = V D −1 U . Thus, if y = Ax for some x ∈ S, then, by
683
Singular Values
defining w = U y, we have w2 w12 w22 + 2 + · · · + 2n = D −1 w22 = D −1 U y22 2 σn σ1 σ2 = V D −1 U u22 = A−1 y22 = A−1 Ax22 = x22 = 1. Thus, {U Ax | x ∈ S} is an ellipsoid having σ1 , . . . , σn as semiaxes. Since multiplication by an orthogonal matrices is isometric, it follows that the set {Ax | x ∈ S} is obtained by rotating the previous ellipsoid, so it is again an ellipsoid having the same semiaxes. (20) Let A = σ1 u1 v H1 + · · · + σr ur v Hr be the SVD of a matrix A ∈ Rm×n , where σ1 · · · σr > 0. Prove that the matrix B(k) = k H i=1 σi ui v i from Lemma 9.2 can be written as B(k) =
k
i=1
ui (ui )H A =
k
Av i (v i )H .
i=1
In other words, B(k) is the projection of A on the subspace generated by the first k left singular vectors of A. Solution: The equalities follow immediately from Theorem 9.7. (21) Let A ∈ Cm×n be a matrix. Define the square matrix A˜ ∈ C(m+n)×(m+n) as Om,m A ˜ . A= AH On,n Prove that: (a) A˜ is a Hermitian matrix; (b) the matrix A has the singular values σ1 · · · σk , where ˜ consists of the numk = min{m, n} if and only if spec(A) bers −σ1 · · · −σk σk . . . σ1 and an additional number of 0s. Solution: The first part is immediate.
684
Linear Algebra Tools for Data Mining (Second Edition)
Suppose that m n and let A = U DV H be the singular value decomposition of A, where S , D= Om−n,n U ∈ Cm×m , V ∈ Cn×n , and S = diag(σ1 , . . . , σn ). Since m n, we can write U = (Y Z), where Y ∈ Cm×n and Z ∈ Cm×(m−n) . Define the matrix 1 √ Y − √1 Y √1 Z 2 2 2 ∈ C(m+n)×(m+n) . W = √1 Z √1 Z O n,m−n 2 2 This matrix is unitary because WWH = =
√1 Y 2 √1 Z 2
− √12 Y √1 Z 2
Im Om,n On,m In
√1 Z 2
On,m−n
⎛
√1 Y H 2 ⎜− √1 Y H ⎝ 2 √1 Z H 2
√1 Z H 2 √1 Z H 2
Om−n,n
⎞ ⎟ ⎠
= Im+n .
Then, A˜ can be factored as ⎞ ⎛ On,m−n S On,n ⎟ ⎜ −S On,m−n ⎠ W H , A˜ = W ⎝ On,n Om−n,n Om−n,n Om−n,m−n which shows that the spectrum of A˜ has the form claimed above. (22) Let A and E be two matrices in Cm×n and let p = min{m, n}. Denote by σ1 (A) · · · σp (A) the singular values of A arranged in decreasing order and by σ1 (E) · · · σp (E) and σ1 (A + E) · · · σp (A + E) the similar sequences for E and A + E, respectively, where p = min{m, n}. Prove that |σi (A + E) − σi (A)| |||E|||2 for 1 i p and pi=1 (σi (A + E) − σi (A))2 E2F . ˜ E ˜ and A Solution: Consider the matrices A, + E defined in ˜ Supplement 9.9 . It is immediate to verify that A + E = A˜ + E.
685
Singular Values
The eigenvalues of the matrix A˜ arranged in increasing order are −σ1 (A) · · · −σk (A) σk (A) · · · σ1 (A), ˜ in the same order are and the eigenvalues of E −σ1 (E) · · · −σk (E) σk (E) · · · σ1 (E). By Weyl’s Theorem (Supplement 30 of Chapter 8), we have −σ1 (E) σi (A + E) − σi (A) σ1 (E) = |||E|||2 , by Theorem 9.5, so |σi (A + E) − σi (A)| |||E|||2 for 1 i p. (23) Let S, T be two subspaces of Rn such that dim(S) = dim(T ) = r and let {u1 , . . . , ur } and {v 1 , . . . , v r } be two orthonormal bases of S and of T represented by the matrices BS = (u1 . . . ur ) ∈ Cn×r and BT = (v 1 . . . v r ) ∈ Cn×r , respectively. Prove that the singular values of the matrix BT BS are located in the interval [0, 1]. (24) Let · be a unitarily invariant norm on Cm×n , where m n, z ∈ Cm . Define the matrix Z = (diag(z1 , . . . , zm ) Om,n−m ∈ Cm×n . Prove that (a) the set of singular values of Z is {|z1 |2 , . . . , |zm |2 }; (b) the function g : Cm −→ R0 defined by g(z) = Z is a symmetric gauge function. (25) Let g : Rn −→ R0 be a symmetric gauge function and let A g be defined by Ag = g(s), where A ∈ Cm×n , and z =
σ1 . . . σn
is
a vector whose components are the singular values of A. Prove that · g is a unitarily invariant norm. In Exercise 51 of Chapter 6, we noted that νp is a symmetric gauge function on Rn . The norms · νp for p 1 are known as the Schatten norms. Note that for p = 2 the Schatten norm coincides with the Frobenius norm. For p = 1, we obtain Aν1 = σ1 +· · ·+σr , where σ1 , . . . , σr are the non-zero singular values of A. Since Aν1 = trace(D), where D is the diagonal matrix that occurs in the SVD of A, this Schatten norm is also known as the trace norm or Ky Fan’s norm.
686
Linear Algebra Tools for Data Mining (Second Edition)
(26) Let A ∈ Cm×n be a matrix of rank r such that there exists a factorization R Or,n−r A=U V H, Om−r,r Om−r,n−r where R ∈ Cr×r is a matrix of rank r, and U ∈ Cm×m and V ∈ Cn×n are unitary matrices. Note that the SVD of A is a special case of this factorization. Prove that the Moore–Penrose pseudoinverse of A is given by −1 R Ok,m−k † U H. A =V On−k,k On−k,m−k Solution: By Theorem 3.26, it suffices to verify that AA† A = A and A† AA† = A† . (27) Prove that for the matrices A(1) , . . . , A(n) , we have (A(1) ⊗ A(2) ⊗ · · · ⊗ A(n) )† = A(1)† ⊗ A(2)† ⊗ · · · A(n)† , where A(k) ∈ Rmk ×nk for 1 k n. Solution: Note that the matrices A(i) can be factored as Ori ,ni −ri Ri (i) A = Ui Vi , Omi −ri ,ri Omi −ri ,ni −ri where Ui ∈ Rmi ×mi , Vi ∈ Rni ×ni are orthonormal matrices, Ri ∈ Rri ×ri , and ri = rank(A(i) ). The argument is by induction on n. For the base case n = 2, observe that the pseudoinverses A(i)† are −1 R O i Ui A(i)† = Vi 0 O for i = 1, 2. Therefore, A(1)† ⊗ A(2)† −1 O R1−1 O R 2 = V1 ⊗ U1 ⊗ V2 U2 0 O O O R1−1 ⊗ R2−1 O (U1 ⊗ U2 ). = (V1 ⊗ V2 ) O O
Singular Values
687
On the other hand, we have "† ! A(1) ⊗ A(2) † R1 O R1 O V1 ⊗ U1 V2 = U1 O O O O † R1 O R2 O = (U1 ⊗ U2 ) ⊗ (V1 ⊗ V2 ) O O O O (R1 ⊗ R2 )−1 O = (V1 ⊗ V2 ) (U1 ⊗ U2 ) O O (R1−1 ⊗ R2−1 O (U1 ⊗ U2 ) , = (V1 ⊗ V2 ) O O
which concludes the base step. The proof of the induction step is straightforward. (28) Let A, B ∈ Rm×p and let Q ∈ Rp×p be an orthogonal matrix. Prove that (a) A − BQF is minimal if and only if trace(Q B A) is maximal; (b) if the SVD of the matrix B A is B A = U diag (σ1 , . . . , σp )V . The number trace(Q B A) is maximal if Q = U V . Solution: As we saw in Example 6.21, A − BQ2F = trace((A − BQ) (A − BQ)) = trace((A − Q B )(A − BQ) = trace(A A − Q B A − A BQ + Q B BQ) = trace(A A) − trace(Q B A) − trace(A BQ) + trace(Q B BQ) = A2F − 2trace(Q B A) + trace(BQQ B ) = A2F − 2trace(Q B A) + trace(BB )
688
Linear Algebra Tools for Data Mining (Second Edition)
(because the orthogonality of Q means that QQ = I) = A2F + B2F − 2trace(Q B A), which justifies the claim made in the first part. For the second part, we start from the equality trace(Q B A) = trace(Q U diag(σ1 , . . . , σp )V ) = trace(V Q U diag(σ1 , . . . , σp )). Note that the matrix T = V Q U is orthogonal and trace(T diag(σ1 , . . . , σp ) =
p
tii σi
i=1
p
σi .
i=1
When Q = U V , we have T = I, so the maximum is achieved. (29) Let A ∈ Cm×n be a matrix having σ1 , . . . , σr as its non-zero singular values. The volume of A is the number vol(A) = σ1 · · · σr . Using the notations of Supplement 35 of Chapter 5, prove that vol(A) = Dr2 (A). Solution: Let A = U DV H be the thin SVD decomposition of A (see Corollary 9.3), where U ∈ Cm×r and V ∈ Cn×r are matrices having orthonormal sets of columns and D = diag(σ1 , . . . , σr ). Then, vol(D) = σ1 · · · σr . Note that matrices U and DV H constitute a full-rank decomposition of A, so by Supplement 35 of Chapter 5, Dr2 (A) = Dr2 (U )Dr2 (DV H ), In turn, D and V H are full-rank matrices, so Dr2 (DV H ) = Dr2 = σ12 · · · σr2 , which concludes the argument. # $ 1, . . . , p m×n be a matrix and let B = A . If σ1 (30) Let A ∈ C 1, . . . , n · · · σn are the singular values of A and τ1 · · · τn are the singular values of B, prove that σi τi for 1 i n. Solution: We can write
B A= , C
where C ∈ C(m−p)×n . We have AH A = B H B + C H C. The singular values of A are the eigenvalues of AH A, τ1 , . . . , τn are the
Singular Values
689
singular values B H B. The singular values of C, γ1 , . . . , γn are non-negative. By applying Theorem 8.22, we obtain σi τi for 1 i n. (31) Let A ∈ Cm×n be a matrix such that rank(A) = r. Prove that there exists a set of r(m + n + 1) numbers that determines the matrix A. Solution: The matrix A has mn components and a thin SVD decomposition of the form A = σ1 u1 v 1 + · · · + σr ur v r , where ui ∈ Cm and v i ∈ Cn . Thus, if we select r singular values, mr components for the vectors ui and nr components for the vectors v i , the matrix A is determined by r + mr + nr = r(m + n + 1) numbers. (32) Let A = U DV H be the thin SVD decomposition of the matrix A ∈ Cm×n . Prove that A(AH A)−1/2 = U V H . Solution: Note that AH A = V DU H U DV H = V D 2 V H because U is a Hermitian matrix. Thus, (AH A)−1/2 = V D −1 V H , which yields A(AH A)−1/2 = U DV H V D −1 V H = U V H . (33) Let A ∈ Cm×n be a matrix with m n and rank(A) = r n. Prove that A can be factored as A = U1 D1 V H , where U1 ∈ Cm×n matrix, D ∈ Cn×n is a diagonal matrix, and V ∈ Cn×n such that U H U = V V H = V H V = In . Solution: Starting from the full SVD of A, A = U DV H let U1 ∈ Cm×n be the matrix that consists of the first n columns of the matrix U , and let D1 be the matrix that consists of the first n rows of D. Then, A = U1 D1 V H is the desired decomposition. (34) Prove that a matrix A is subunitary if and only if its singular values belong to [0, 1]. Solution: Let A ∈ Cn×k with n k and let A = U1 D1 V H be the SVD decomposition of A as established in Supplement 9.9. Suppose that for the largest singular value we have σ1 1 and let Z be the unitary completion of U1 to a unitary matrix (U1 Z) ∈ Cm×m . For the matrix Y = (A U1 (In − D12 )1/2 Z) ∈ Cm×(m+n) , we have Y Y H = AAH + ZZ H = Im , so Y is a semiunitary matrix, which implies that A is a subunitary matrix. Conversely, suppose that A is a subunitary matrix and let W be a semiunitary matrix such that W = (A T ) and whose rows are orthonormal. If w ∈ Cm is a vector with wH w = 1m , we
690
Linear Algebra Tools for Data Mining (Second Edition)
have wH w = wH Im w = wH (AAH + T T H )w wH AAH w, which implies that A has no singular value greater than 1. (35) The product of two subunitary (suborthogonal) matrices is a subunitary (suborthogonal) matrix. Solution: Let A ∈ Cm×k and B ∈ Ck×n be two subunitary matrices. Let A+ be the expansion of A to a matrix with orthonormal columns and B+ be the expansion of B to a matrix with orthonormal rows. Since the non-zero singular values of the H AH = A+ AH+ are the same as the eigenvalues matrix A+ B+ B+ of AH+ A+ = I, it follows that A+ B+ is subunitary. Since AB is a submatrix of A+ B+ , it follows that AB is subunitary. (36) Let A and B be two matrices in Cn×n having the singular values σ1 · · · σn and τ1 · · · τn , respectively. Prove that n σk τ k . |trace(AB)| σk=1
Solution: Let A = U DV H and B = W CZ H be the singular value decompositions of A and B, where U, V, W, Z are unitary matrices and D = diag(σ1 , . . . , σn ) and C = diag(τ1 , . . . , τn ). We have trace(AB) = trace(U DV H W CZ H) = trace(Z H U DV H W C) = trace(SDT C), where S = Z H U and T = V H W are unitary matrices. Therefore, trace(AB) =
n
spq tpq σp τq
p,q=1
n n 1 2 1 2 spq σp τq + tpq σp τq . 2 2 p,q=1
p,q=1
Since S and T are unitary matrices, it follows that the matrices S1 = (|spq |2 ) and T1 = (|t2pq |) are doubly stochastic, and the desired conclusion follows immediately from Supplement 106 of Chapter 3.
691
Singular Values
(37) Let A ∈ Rn×n be a suborthogonal matrix with rank(A) = r and let C ∈ Rn×n be a diagonal matrix, C = diag(c1 , . . . , cn ), where c1 · · · cn 0. Prove that trace(AC) c1 + · · · + cr . Hint: Apply Supplement 9.9. (38) Let A ∈ Rm×n be a rectangular matrix. Extend the Rayleigh– Ritz function ralA defined for square matrices in Theorem 8.17 to rectangular matrices as ralA =
x Ay xy
for x ∈ Rm and y ∈ Rn . Prove that σ is a singular value corresponding to singular vectors x and y when (x, y, σ) is a critical point of ralA . Solution: Consider the Lagrangian function L : R −→ R given by
Rm
× Rn ×
L(x, y, σ) = x Ay − σ(x y − 1) that is continuously differentiable for x = 0m and y = 0n . The first-order optimality conditions yield ˜ ˜ ˜ x A x y A˜ y =σ and =σ . ˜ y ˜ x ˜ x ˜ y By defining u = A u = λv.
1 ˜ ˜ x x
and v =
1 ˜, ˜ y y
we obtain Av = λu and
Bibliographical Comments Theorem 9.9 was obtained in [41]. An important historical presentation of the singular value decomposition can be found in [159]. The proof of Theorem 9.10 is given in this article. The proof of the CS Decomposition Theorem was given in [129] and [128]. Theorem 9.22 is a result of [160]. Supplements 33 and 35 are results of ten Berge that appear in [12]. In Supplement 36, we give a result of von Neumann [85]; the solution belongs to Mirsky [114]. For Supplement 37, see [12]. The proof of Supplement 27 appears in [99].
This page intentionally left blank
Chapter 10
The k-Means Clustering
10.1
Introduction
The k-means algorithm is one of the best-known clustering algorithms and has been in existence for a long time [73]. In a recent publication [173], the k-means algorithm was listed among the top ten algorithms in data mining. This algorithm computes a partition of a set of points in Rn that consists of k blocks (clusters) such that the objects that belong to the same block have a high degree of similarity, and the objects that belong to distinct blocks are dissimilar. The algorithm requires the specification of the number of clusters k as an input. The set of objects to be clustered S = {u1 , . . . , um } is a subset of Rn . Due to its simplicity and to its many implementations, it is a very popular algorithm despite this requirement. 10.2
The k-Means Algorithm and Convexity
The k-means algorithm begins with a randomly chosen set of k points c1 , . . . , ck in Rn called centroids. An initial partition of the set S of objects is computed by assigning each object ui to its closest centroid cj and adopting a rule for breaking ties when there are several centroids that are equally distanced from ui (e.g., assigning ui to the centroid with the lowest index). As we shall see, the algorithm alternates between assigning cluster membership for each object and computing the center of each cluster.
693
694
Linear Algebra Tools for Data Mining (Second Edition)
Let Cj be the set of points assigned to the centroid cj . The assignments of objects to centroids are expressed by a matrix B = (bij ) ∈ Rm×k , where 1 if ui ∈ Cj , bij = 0 otherwise. Since m one cluster, we have k each object is assigned to exactly b = 1. On the other hand, ij j=1 i=1 bij equals the number of objects assigned to the centroid cj . After these assignments, expressed by the matrix B, the centroid cj is recomputed using the following formula: m bij ui (10.1) cj = i=1 m i=1 bij for 1 ≤ j ≤ k. The matrix of the centroids, C = (c1 · · · ck ) ∈ Rn×k can be written as
C = (u1 · · · um )Bdiag
1 1 ,..., m1 mk
,
where mj = m i=1 bij is the number of objects of cluster Cj . The sum of squared errors of a partition π = {C1 , . . . , Ck } of a set of objects S is intended to measure the quality of the assignment of objects to centroids and is defined as sse(π) =
k
d2 (u, cj ),
(10.2)
j=1 u∈Cj
where cj is the centroid of Cj for 1 ≤ j ≤ k. It is clear that sse(π) is unaffected if the data are centered.
The k-Means Clustering
695
We can write sse(π) as sse(π) =
m k
bij xi − cj 22
i=1 j=1
=
m k i=1 j=1
bij
n
(xip − cjp )2 .
p=1
The nk necessary conditions for a local minimum of this function, m
∂sse(π) = bij (−2(xip − cjp )) = 0 ∂cjp i=1
for 1 ≤ p ≤ n and 1 ≤ j ≤ k, can be written as m
bij xip =
i=1
m
bij cjp = cjp
i=1
m
bij ,
i=1
or, as cjp
m i=1 bij xip = m i=1 bij
for 1 ≤ p ≤ n. In vectorial form, these conditions amount to m bij xi , cj = i=1 m i=1 bij which is exactly the formula (10.1) that is used to update the centroids. Thus, the choice of the centroids can be justified by the goal of obtaining local minima of the sum of squared errors of the clusterings. Since we have new centroids, objects must be reassigned, which means that the values of bij must be re-computed, which, in turn, will affect the values of the centroids, etc. The halting criterion of the algorithm depends on particular implementations and it may involve
696
Linear Algebra Tools for Data Mining (Second Edition)
(i) performing a certain number of iterations; (ii) lowering the sum of squared errors sse(π) below a certain limit; (iii) the current partition coincides with the previous partition. This variant of the k-means algorithm is known as Forgy–Lloyd algorithm (Algorithm 10.2.1) [56, 107]: Algorithm 10.2.1: Forgy–Lloyd Algorithm Data: set of objects to be clustered S = {x1 , . . . , xm } ⊆ Rn and the number of clusters k. Result: a collection of k clusters 1 generate a randomly chosen collection of k points c1 , . . . , ck in Rn ; 2 assign each object xi to the closest centroid cj ; 3 let π = {C1 , . . . , Ck } be the partition defined by c1 , . . . , ck ; 4 recompute the centroids of the clusters C1 , . . . , Ck ; 5 while halting criterion is not met do 6 compute the new value of the partition π using the current centroids; 7 recompute the centroids of the blocks of π; 8 end The popularity of the k-means algorithm stems from its simplicity and its low time complexity which is O(km), where m is the number of objects to be clustered and is the number of iterations that the algorithm is performing. Another variant of the k-means algorithm redistributes objects to clusters based on the effect of such a re-assignment on the objective function. If sse(π) decreases, the object is moved and the two centroids of the affected clusters are recomputed. This variant is carefully analyzed in [16]. As the iterations of the k-means algorithm develop, the algorithm may be trapped into a local minimum, which is the serious limitation of this clustering. The next theorem shows another limitation of the k-means algorithm because this algorithm produces only clusters whose convex closures may intersect only at the points of S.
The k-Means Clustering
697
Theorem 10.1. Let S = {x1 , . . . , xm } ⊆ Rn be a set of m objects. If C1 , . . . , Ck is the set of clusters computed by the k-means algorithm in any step, then the convex closure of each cluster Ci , K conv (Ci ) is included in a polytope Pi that contains ci for 1 ≤ i ≤ k. Proof. Suppose that the centroids of the partition {C1 , . . . , Ck } are c1 , . . . , ck . Let mij = 12 (ci + cj ) be the midpoint of the segment ci cj and let Hij be the hyperplane (ci − cj ) (x − mij ) = 0 that is the perpendicular bisector of the segment ci cj . Equivalently, 1 Hij = x ∈ Rm | (ci − cj ) x = (ci − cj ) (ci + cj ) . 2 The halfspaces determined by Hij are described by the following inequalities: 1 + : (ci − cj ) x ≤ (ci 22 − cj 22 ), Hij 2 1 − : (ci − cj ) x ≥ (ci 22 − cj 22 ). Hij 2 + − and cj ∈ Hij . Moreover, if d2 (ci , x) < It is easy to see that ci ∈ Hij + − d2 (cj , x), then x ∈ Hij , and if d2 (ci , x) > d2 (cj , x), then x ∈ Hij . Indeed, suppose that d2 (ci , x) < d2 (cj , x), which amounts to ci − x22 < cj − x22 . This is equivalent to
(ci − x) (ci − x) < (cj − x) (cj − x). The last inequality is equivalent to ci 22 − 2ci x < cj 22 − 2cj x, + . In other words, x is located in the same which implies that x ∈ Hij half-space as the closest centroid of the set {ci , cj }. Note also that if + − ∩ Hij = Hij , that is, d2 (ci , x) = d2 (cj , x), then x is located in Hij on the hyperplane shared by Pi and Pj . Let Pi be the closed polytope defined by + | j ∈ {1, . . . , k} − {i}}. Pi = {Hij
Objects that are closer to ci than to any other centroid cj are located in the closed polytope Pi . Thus, Ci ⊆ Pi and this implies K conv (Ci ) ⊆ Pi .
Linear Algebra Tools for Data Mining (Second Edition)
698
10.3
Relaxation of the k-Means Problem
The idea of relaxing the requirements of the k-means algorithm in order to apply ideas that originate in principal component analysis belongs to Ding and He [35]. Let X ∈ Rm×n be the data matrix of the set S = {x1 , . . . , xm } ⊆ n R of m objects to be clustered, ⎛ ⎞ x1 ⎜ . ⎟ ⎟ X=⎜ ⎝ .. ⎠ . xm Note that ⎛
⎛ ⎞ ⎞ x1 x1 x1 · · · x1 xm ⎜ . ⎟ ⎜ . .. ⎟ ⎜ ⎟ ⎟ XX = ⎜ . ⎠, ⎝ .. ⎠ (x1 · · · xm ) = ⎝ .. · · · xm xm x1 · · · xm xm which implies trace(XX ) = j=1 xj xj . To simplify the arguments, we shall assume that X is a centered matrix. Let π = {C1 , . . . , Ck } be a clustering of the set S and let mj = |Cj | for 1 ≤ j ≤ k. Clearly, we have kj=1 mj = m. There exists a permutation matrix P that allows us to rearrange the rows of the matrix X such that the rows that belong to a cluster Cj are located in contiguous columns. In other words, we can write P X as a block matrix ⎛ ⎞ X(1) ⎜ . ⎟ ⎟ PX = ⎜ ⎝ .. ⎠ , X(k)) where X(j) ∈ Rmj ×n . The centroid cj of Cj can be written as cj = 1 mj 1mj X(j). Furthermore, we have sse(π) =
k j=1 x∈Cj
d2 (x, cj ) =
k j=1 x∈Cj
x − cj 22
The k-Means Clustering
=
k
(x − cj ) (x − cj ) =
j=1 x∈Cj
=
=
=
x∈S
(x x − 2cj x + cj cj )
cj x +
k
k
cj cj
j=1 x∈Cj
mj cj cj
j=1
x∈S
k j=1 x∈Cj
x x −
k j=1 x∈Cj
x x − 2
x∈S
699
k 1 xx− 1 X(j)X(j) 1mj . mj mj
j=1
Let Rk = (r 1 · · · r k ) ∈ Rm×k be the matrix defined by ⎛ ⎞ 0m1 ⎜ . ⎟ ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ ⎜0mj−1 ⎟ ⎟ 1 ⎜ ⎜ 1m ⎟ rj = √ j ⎟ ⎜ mj ⎜ ⎟ ⎜0mj+1 ⎟ ⎜ ⎟ ⎜ .. ⎟ ⎝ . ⎠ 0mk for 1 ≤ j ≤ k. The columns of Rkare multiples of the characteristic √ vectors of the clusters, we have kj=1 mj r j = 1m and the set of columns of Rk is orthonormal, that is, Rk Rk = Ik . Note that the 1 X rj . centroids of the clusters can be written now as cj = √m j Now sse(π) can be written as sse(π) = trace(XX ) − trace(Rk XX Rk ). The first term of sse(π) is constant; therefore, to minimize sse(π), we need to maximize trace(Rk XX Rk ). Let T ∈ Rk×k be an orthonormal matrix (T T = T T = Ik ) having the last column ⎛ m ⎞ 1
m
⎜ . ⎟ . ⎟ tk = ⎜ ⎝ . ⎠. m k
m
(10.3)
Linear Algebra Tools for Data Mining (Second Edition)
700
Define the matrix Qk = (q 1 · · · q k ) ∈ Rm×k as Qk = Rk T . We have q i = r 1 t1i + · · · + r k tki for 1 ≤ i ≤ k, so the last column of Qk is qk = r1
m1 + · · · + rk m
1 mk = √ 1k . m m
Also, Qk Qk = T Rk Rk T = T T = Ik . Let Qk−1 be the matrix that consists of the first k − 1 columns of Qk . Clearly, Qk−1 has orthogonal columns and qi 1m = 0 for 1 ≤ i ≤ k − 1. Since Qk XX Qk = T Rk XX Rk T , we have trace(Qk XX Qk ) = trace(T Rk XX Rk T ) = X Rk T 2F = X Rk 2F = trace(Rk XX Rk ). Therefore, sse(π) = trace(XX ) − trace(Qk XX Qk ). This allows us to write sse(π) = trace(XX ) − trace(Qk XX Qk ) 1 = trace(XX ) − √ 1m XX 1m − trace(Qk−1 XX Qk−1 ) m = trace(XX ) − trace(Qk−1 XX Qk−1 ), because X being a centered matrix, we have X 1m = 0m . The above argument reduces the k-means clustering to determining a matrix Qk−1 such that trace(Qk−1 XX Qk−1 ) is maximal, subjected to the following restrictions: (i) Qk−1 Qk−1 = Ik−1 ; (ii) q j 1m = 0 for 1 ≤ j ≤ k; (iii) vectors of the form q j are obtained from vectors of the form r j via a linear transformation because Qk is defined as Qk = Rk T . Recall that directions and the principal components of the centered data matrix X are the vectors dj and qj , respectively, such that Xdj = σj q j and X q j = σj dj for 1 ≤ j ≤ k.
The k-Means Clustering
701
In [35], the idea of relaxing this formulation of k-means by dropping the last requirement is introduced. By Ky Fan’s Theorem (Theorem 7.14), we have max{trace(Qk−1 XX Qk−1 ) | Qk−1 Qk−1 = Ik−1 } = λ1 + · · · + λk−1 , where λ1 ≥ λ2 ≥ · · · are the eigenvalues of the matrix XX , that is, the squares of the singular values of X. Moreover, the columns of the optimal Qk−1 are the k − 1 eigenvectors that correspond to the top k − 1 eigenvalues of the matrix XX , that is, the top q − 1 principal components of X. These columns represent the continuous solutions for the transformed discrete cluster membership problem. Theorem 10.2. Let c1 , . . . , ck be the centroids of the clusters obtained by applying the k-means algorithm to the centered data matrix X ∈ Rm×n . For the optimal solution of the relaxed problem, the subspace c1 , . . . , ck is spanned by the first k − 1 principal directions of the centered data matrix X. Proof. If q 1 , . . . , q k−1 be the vectors that correspond to the optimal solution, that is, the principal components of the centered matrix X. Since Rk = Qk T , we can write 1 1 cj = √ X r j = √ X (q 1 t1j + · · · + q k tkj ) mj mj 1 = √ X (q 1 tj1 + · · · + qk tjk ) mj 1 = √ σ1 d1 tj1 + · · · + σk dk tjk ) mj for 1 ≤ j ≤ k. Thus, every centroid belongs to the subspace generated by the principal directions of the matrix X. Conversely, let dj be a principal direction of X. Since dj = =
1 1 X qj = X (r 1 t1j + · · · + r k tkj ) σj σj √ 1 √ ( m1 c1 t1j + · · · + mk ck tkj ), σj
which shows that every principal direction of X belongs to the sub space generated by an optimal set of centroids.
702
Linear Algebra Tools for Data Mining (Second Edition)
Note that the vectors q j may have both positive and negative components. The recovery of the characteristic vectors r 1 , . . . , r k−1 , r k of the clusters cannot proceed from the equality Qk = Rk T because the matrix T itself can be determined only after obtaining the results of the clustering. Instead, the idea proposed in [35] takes advantage of Theorem10.2. Let C ∈ Rm×m be the matrix C = ki=1 q j q j . Recall that q j = r j T , so q j qj = r j T T r j = r j r j . The rank-1 matrix r j r j consists of a block surrounded by 0s, so C has a block-diagonal structure. Thus, if cij = 0, xi and xj belong to the same cluster. The matrix C computed starting from q 1 , . . . , q k may have negative elements. In general, entries of C below a certain value are ignored to eliminate noise. 10.4
SVD and Clustering
We refer to the current approach for seeking a k-means clustering as the discrete clustering problem (DCP). The DCP problem can n be mrelaxed by 2seeking a subspace V of R so that dim(V ) ≤ k and i=1 d(xi , V ) is minimal. This new problem will be referred to as the continuous clustering problem (CCP). It was shown in [39], using the SVD, that the relaxation of the problem because it can be used to obtain a 2-approximation of the optimum for the original problem and the approximative solution is interesting in its own right. Furthermore, the CCP problem can be solved in polynomial time because it can be obtained from the SVD of a matrix constructed on the data points. It is easy to see that sse(π) ≥ gS (V ) for any k-clustering π. Indeed, if U = c1 , . . . , ck , then dim(U ) ≤ k and sse(π) ≥ gS (V ). Observe that for each cluster center cj , the set Cj = {x ∈ S | d(x, cj ) ≤ d(x, ck ), where k = j} that consists of points whose closest point in C is cj , which is a polyhedron. Thetotal number of faces of the polyhedra C1 , . . . , Ck does not exceed k2 because each face is determined by at least one pair of points. The hyperplanes that define the faces of the polyhedra can be moved without modifying the partition κ such that each face contains
The k-Means Clustering
703
at least n points of the set S. If the points of S are in general position and 0n ∈ S, each face contains n affinely independent points of S. Observe that there are m k n (10.4) N (k, m, n) = k ≤ t ≤ t 2 hyperplanes each of which contains n affinely independent points of M . The following is an enumerative algorithm for resolving the discrete clustering problem (Algorithm 10.4.1). Algorithm 10.4.1: Enumerative Algorithm Data: set of objects to be clustered S = {x1 , . . . , xm } ⊆ Rn and the number of clusters k. Result: a collection of k clusters 1 enumerate all N (k, m, n) sets of hyperplanes each of which contains n affinely independent points; 2 retain only family of hyperplanes that partition S into k cells; tn choices as to which cell to assign each 3 make one of the 2 point of S located on a hyperplane; 4 find the centroid of each set in the partition π and compute sse(π); Let V be a k-dimensional subspace of Rn and let z 1 , . . . , z m be the orthogonal projections of the vectors x1 , . . . , xm on V. Consider the matrices ⎛ ⎞ ⎛ ⎞ z1 x1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ A=⎜ ⎝ .. ⎠ and B = ⎝ .. ⎠ . xm z m Clearly, we have rank(B) ≤ k. We have A −
B2F
=
m i=1
xi −
z i 22
=
m
d(xi , V )2 .
i=1
Let (σ1 , . . . , σr ) be the sequence of non-zero singular values of A, where σ1 ≥ · · · ≥ σr > 0. The SVD Theorem implies that A can be
704
Linear Algebra Tools for Data Mining (Second Edition)
written as A = σ1 u1 v H1 + · · · + σr ur vHr . Let B(k) ∈ Cm×n be the matrix defined by B(k) =
k
σi ui v Hi .
i=1
By Theorem 9.10, B(k) is the best approximation of A among the matrices of rank no larger than k in the sense of Frobenius norm and A − B(k)2F = A2F −
k
σi2 .
i=1
To solve the CCP, we choose the subspace V as the subspace generated by the first k left singular vectors, u1 , . . . , uk of the matrix A. We saw in Supplement 20 of Chapter 9 that B(k) is the projection of A on this subspace. Combining the DCP Algorithm 10.4.1 with the CCP approach produces a 2-approximation of the DCP (Algorithm 10.4.2): Algorithm 10.4.2: 2-Approximation Algorithm Data: set of objects to be clustered S = {x1 , . . . , xm } ⊆ Rn and the number of clusters k. Result: a set D of cluster centers 1 compute the k-dimensional subspace generated by the first k left singular vectors, u1 , . . . , uk , of the matrix A; 2 solve the DCP with input A to obtain the set of centers D = {d1 , . . . , dk }; 3 output the set of centers {d1 , . . . , dk }. The optimal value Z obtained from the DCP algorithm satisfies the inequality Z≥
m i=1
d(xi , z i )2 .
(10.5)
The k-Means Clustering
705
If C = {c1 , . . . , ck } is an optimal center set for the DCP, and d1 , . . . , dk are the projections of these centers on W , then Z=
m
d(xi , C) ≥ 2
i=1
m
d(z i , D)2 .
(10.6)
i=1
The Inequalities (10.5) and (10.6) imply 2Z ≥
m
d(xi , z i )2 +
i=1
=
m
d(z i , D)2
i=1
i = 1m d(xi , D)2 ,
which shows that the Algorithm 10.4.2 produces indeed a 2approximative solution of the DCP. 10.5
Evaluation of Clusterings
The silhouette method is an unsupervised method for evaluation of clusterings that computes certain coefficients for each object. The set of these coefficients allows an evaluation of the quality of the clustering. Let S = {x1 , . . . , xn } be a collection of objects, d : S × S −→ R≥0 a dissimilarity on S, and let {C1 , . . . , Ck } be a clustering of S. Suppose that xi ∈ C . The average dissimilarity of xi is given by {d(xi , u) | u ∈ C − {xi }} , a(xi ) = |C | that is, the average dissimilarity between xi to all other objects of C , the cluster to which xi is assigned. For xi and a cluster C = C , let {d(xi , u) | f (u) = C} , d(xi , C) = |C| be the average dissimilarity between xi and the objects of the cluster C. Definition 10.1. Let {C1 , . . . , Ck } be a clustering. A neighbor of xi is a cluster C = C for which d(xi , C) is minimal.
706
Linear Algebra Tools for Data Mining (Second Edition)
In other words, a neighbor of an object xi is “the second best choice” for a cluster for xi . Let b : S −→ R≥0 be the function defined by b(xi ) = min{d(xi , C) | C = C }. Definition 10.2. The silhouette of is the number sil(xi ) given by ⎧ i) ⎪ 1 − a(x ⎪ b(xi ) ⎨ sil(xi ) = 0 ⎪ ⎪ ⎩ b(xi ) − 1 a(xi )
the object xi for which |C | ≥ 2, if a(xi ) < b(xi ) if a(xi ) = b(xi ) if a(xi ) > b(xi ).
Equivalently, we have sil(xi ) =
b(xi ) − a(xi ) max{a(xi ), b(xi )}
for xi ∈ O. Observe that −1 ≤ sil(xi ) ≤ 1 (see Exercise 10.6). When sil(xi ) is close to 1, this means that a(xi ) is much smaller than b(xi ) and we may conclude that xi is well-classified. When sil(xi ) is near 0, it is not clear which is the best cluster for xi . Finally, if sil(xi ) is close to −1, the average distance from u to its neighbor(s) is much smaller than the average distance between xi and other objects that belong to the same cluster f (xi ). In this case, it is clear that xi is poorly classified. Definition 10.3. The average silhouette width of a cluster C is {sil(u) | u ∈ C} . sil(C) = |C| The average silhouette width of a clustering κ is {sil(u) | u ∈ O} . sil(κ) = |O| The silhouette of a clustering can be used for determining the “optimal” number of clusters. If the average silhouette of the clustering is above 0.7, we have a strong clustering.
The k-Means Clustering
10.6
707
MATLAB Computations
MATLAB implements the k-means algorithm using the function kmeans(T,k), where T is a data matrix in Rm×n and k is the number of clusters. The function considers the rows of T as the points to be clustered. A vector idx ∈ Rm is returned indicating that the object xi belongs to the cluster idxi . Non-numeric values in data are treated by kmeans as missing data and rows that contain such data are ignored. By default, kmeans uses the Euclidean distance between vectors. The function kmeans(S,k) has several return formats as indicated in the following table: Return format [IDX,C] [IDX, C, SUMD] [IDX, C, SUMD, D]
Effect Returns the k cluster centroid locations in the k × n matrix C Returns, additionally, the within-cluster sums of point-to-centroid distances in the row vector sumD D is an m × k matrix that contains the distances from each point to every centroid
A more general format [ ... ] = kmeans(..., ’par1’,val1, ’par2’,val2, ...)
specifies optional parameter/value pairs to direct the algorithm. The following is a partial list of these choices: Parameter ‘Distance’
Value ‘sqEuclidean’ ‘cityblock’ ‘cosine’ ‘correlation’
‘Start’ ‘sample’ ‘uniform’ ‘cluster’
‘Replicates’
Meaning Distance between objects Squared Euclidean distance (default) d1 distance 1 − cos α, where α = ∠(u, v) 1 − corr(u, v) method used to choose initial cluster centroid positions; choose k vectors from S at random (default); choose k vectors uniformly from S; perform preliminary clustering phase on a random 10% subsample of S; this first phase is itself initialized using ‘sample’. number of times to repeat the clustering, each with a new set of initial centroids; must be a positive integer, default is 1.
Linear Algebra Tools for Data Mining (Second Edition)
708
‘EmptyAction’ ‘error’ ‘drop’ ‘singleton’ ‘Options’ ‘Display’ ‘MaxIter’
Action to take if a cluster loses all of its members; treat an empty cluster as an error (default); remove any clusters that become empty; create a new cluster consisting of the one observation furthest from its centroid. Options for the iterative algorithm used to minimize the fitting criterion; level of display output. Choices are ‘off’, (default), ‘iter’, and ‘final’. maximum number of iterations allowed; default is 100.
Example 10.1. We begin by generating a dataset containing three clusters using the function datagen introduced in Example 4.50. >> opts=statset(’Display’,’final’); >> [idclust,centers]=kmeans(T,3,’Replicates’,4,’Options’,opts); >> plot(T(idclust==1,1),T(idclust==1,2),’+’,... T(idclust==2,1),T(idclust==2,2),’*’,... T(idclust==3,1),T(idclust==3,2),’x’,... centers(:,1),centers(:,2),’o’);
The cluster centers are represented by small circles in Figure 10.1. Note the use of the MATLAB function statset that creates an option structure such that the named parameters have the specified values. ∈ Rm×n is a data matrix whose m rows are the then silhouette(X, CLUST) plots cluster silhouettes for the matrix X, with clusters defined by CLUST. By default, silhouette uses the squared Euclidean distance between points; the function call [S,H] = silhouette(X,CLUST) plots the silhouettes, and returns the silhouette values in the vector S and the figure handle in H. Other inter-point distances can be used by calling silhouette(X,CLUST,d), where d specifies the distance (’Euclidean’,’cityblock’,’cosine’, etc). The silhouettes of the 140 points in R2 generated above using If X
x1 , . . . , xm ,
[S,H]=silhouette(T,idclust)
are shown in Figure 10.2. As expected, the values of these silhouettes are quite high, since the points were grouped tightly around the centers and we used the same number of clusters (k = 3) as the number used in the generation process.
The k-Means Clustering
709
12
10
8
6
4
2
0
−2 −2
−1
0
1
Fig. 10.1
2
3
4
5
6
7
Clustering of a dataset.
Cluster
1
2
3
0
0.2
Fig. 10.2
0.4
0.6 Silhouette Value
0.8
Silhouettes of the 3-clustering.
1
710
Linear Algebra Tools for Data Mining (Second Edition)
Exercises and Supplements (1) Let c0 , c1 , and x be three vectors in Rn . Prove the following: (a) if d(c0 , c1 ) > 2d(x, c1 ), then d(x, c0 ) ≥ d(x, c1 ); (b) d(x, c1 ) ≥ max{0, d(x, c0 ) − d(c0 , c1 )}. Examine ways to use the above inequalities to improve the performance of the k-means algorithm. (2) Let X = {x1 , . . . , x2n } be a set that consists of 2n real numbers. A balanced clustering of X is a partition π = {C, D} of X such that |C| = |D| = n. Prove that if π is such that x,y∈C d(x, y)+ x,y∈D d(x, y) is minimal, then one of the clusters consists of the first n points detected by scanning the line from left to right and the other cluster consists of the remaining points. Solution: Let z ∈ R − X be such that there are n points of X at its left and an equal number of points at its right. For an arbitrary balanced clustering {C, D} of X, define the sets Cl,z = {x ∈ C | x < z}, Dl,z = {x ∈ D | x < z},
Cr,z = {x ∈ C | x > z}, Dr,z = {x ∈ D | x > z}.
Note that |Cl,z |+|Dl,z | = |Cr,z |+|Dr,z | = n, so {Cl,z ∪Dl,z , Cr,z ∪ Dr,z } is again a balanced clustering. Suppose that Cl,z = {u1 , . . . , uk }, Dl,z = {vk+1 , . . . , vn },
Cr,z = {uk+1 , . . . , yn }, Dr,z = {v1 , . . . , vk }.
Clearly, C = Cl,z ∪ Cr,z and D = Dl,z ∪ Dr,z . If the distance between two finite subsets A, B of R is defined as d(A, B) = {|a − b| | a ∈ A, b ∈ B}, then we claim that d(Cl,z , Cr,z ) + d(Dl,z , Dr,z ) ≥ d(Cl,z , Dl,z ) + d(Cr,z , Dr,z ). We have d(Cl,z , Cr,z ) =
k n
(uj − ui )
i=1 j=k+1
=
k i=1
⎛ ⎝
n
j=k+1
⎞ uj ⎠ − (n − k)ui
The k-Means Clustering n
=k
711
uj − (n − k)
k
d(Dl,z , Dr,z ) =
k n
(vi − vj )
j=k+1 i=1
=
n
ui ,
i=1
j=k+1
k
vi
− kvj
i=1
j=k+1
= (n − k)
k
n
vi − k
i=1
vj ,
j=k+1
and d(Cl,z , Dl,z ) =
n k
|ui − vj |,
i=1 j=k+1
≤
k n
(|z − ui | + |vj − z|),
i=1 j=k+1
= (n − k)
k
|z − ui | + k
i=1
d(Cr,z , Dr,z ) =
k n
n
|vj − z|,
j=k+1
|uj − vi |
j=k+1 i=1
≤
k n
|uj − z| + |z − vi |
j=k+1 i=1
=k
n j=k+1
|uj − z| + (n − k)
k
|z − vi |.
i=1
The desired equality follows after some elementary algebra. (3) Let g be the centroid of a finite, non-empty subset S of Rn . ˆ Define the set of centered vectors S={y ∈ Rn | y=x−g, x ∈ S}.
712
Linear Algebra Tools for Data Mining (Second Edition)
Prove that ˆ = {y22 | y ∈ S} {x22 | x ∈ S} − |S|g22 . (4) Let S be a finite, non-empty subset of Rn , π = {C1 , . . . , Ck } be a partition of S, g be the centroid of S, and ci be the centroid of Ci for 1 ≤ i ≤ k. Prove that g is the convex combination g=
k |Ci | i=1
|S|
ci .
(5) This exercise, based on an example given in [88], shows that the Forgy–Lloyd k-means algorithm may converge to a locally minimal solution that is arbitrarily bad compared to the optimal solution. Let S = {x1 , x2 , x3 , x4 } ⊆ R be the set to be clustered using the Forgy–Lloyd k-means algorithm with k = 3. Suppose that x1 < x2 < x3 < x4 and that x4 − x3 < x2 − x1 < x3 − x2 . Prove that: (a) there are only three possible partition outcomes of the Forgy–Lloyd algorithm: π1 = {{x1 }, {x2 }, {x3 , x4 }}, π2 = {{x1 , x2 }, {x3 }, {x4 }}, and π3 = {{x1 }, {x2 , x3 }, {x4 }}; 2 2 3) 1) , sse(π2 ) = (x2 −x , and sse(π3 ) = (b) sse(π1 ) = (x4 −x 2 2 (x3 −x2 )2 . 2
(c) Show that π1 is the optimal partition and the ratio sse(π2 ) sse(π1 ) (where π2 is a locally minimal partition and, therefore, a possible outcome of the Forgy–Lloyd algorithm) can be arbitrarily large if the distance between x2 and x3 is sufficiently large. (6) Let π = {C1 , . . . , Ck } be a clustering of a finite set of objects S = {x1 , . . . , xm }. Prove that k 1 sse(π) = 2mj j=1
where mj = |Cj | for 1 ≤ j ≤ k.
xp ,xq ∈Cj
xp − xq 22 ,
The k-Means Clustering
713
Solution: To prove this equality, it suffices to show that
1 2mj
x − cj 22 =
x∈Cj
xp − xq 22 .
xp ,xq ∈Cj
This follows by writing 1 2mj =
xp − xq 22
xp ,xq ∈Cj
1 2mj
(xp xp + xq xq − 2xp xq )
xp ,xq ∈Cj
⎛
=
1 ⎝ xp xp + mj xq xq mj 2mj xp ∈Cj
−2
⎞
xq ∈Cj
xp xq ⎠
xp ∈Cj xq ∈Cj
=
x∈Cj
=
x x −
1 mj
xp xq
xp ∈Cj xq ∈Cj
x x − mj cj cj ,
x∈Cj
and
x − cj 22 =
x∈Cj
(x x − 2x cj + cj cj )
x∈Cj
=
x x − mj c cj ,
x∈Cj
which confirms the needed equality. Let π = {C1 , . . . , Ck } be a partition of a finite, non-empty set S ⊆ Rn and let α be a non-negative number. Define sseα (π) as sseα (π) = kj=1 |Cj |α x∈Cj d2 (x, cj ). Note that for α = 0, we have sse0 (π) = sse(π). This idea was introduced in [81].
Linear Algebra Tools for Data Mining (Second Edition)
714
(7) Examine the effect of using sseα with α > 0 instead of sse0 on the relative size of the clusters resulting from an application of the k-means algorithm. (8) Prove that the number of hyperplanes given in Formula (10.4) m k n k ≤ t ≤ t 2 2 nk is in O m 2 . (9) Let C, D ⊆ Rn be two finite, non-empty, and disjoint subsets of Rn and let δ(C, D) = {x − y22 | x ∈ C, y ∈ D}. Prove that δ(C,C) δ(D,D) + c − d22 , where m = |C|, p = (a) δ(C,D) mp = m2 + p2 |D| and c,d are the centroids of C and D, respectively. 2 (b) δ(C, D) + δ(C, C) + δ(D, D) = |S| y∈Sˆ y2 . Solution: For the first part, we have δ(C, C) = {xi − xj 22 | xi , xj ∈ C} = {xi xi − 2xi xj + xj xj | xi , xj ∈ C} = (m − 1) {xi xi | xi ∈ C} {xi xj | i < j}. −2 xi ∈C xj ∈C
m
= mc, the norm of the centroid of C is {xi xi | xi ∈ C} + 2 {xi xj | i < j}, m2 c c =
Since
i=1 xi
xi ∈C xj ∈C
so δ(C, C) + m2 c c = m Similarly, δ(D, D) + p2 d d = p For δ(C, D), we have δ(C, D) = xi ∈C,xj ∈D
{xi xi | xi ∈ C}.
{xj xj | xj ∈ D}.
(xi xi + xj xj − 2xi xj )
The k-Means Clustering
=p
xi xi + m
xi ∈C
=p
xj xj − 2
xj ∈D
xi xi + m
xi ∈C
715
xi xj
xi ∈C xj ∈D
xj xj − 2mp c d.
xj ∈D
These equalities yield δ(C, D) δ(C, C) δ(D, D) − − mp m2 p2 1 1 = xi xi + xj xj − 2 c d m p xi ∈C
+ c c − −
xj ∈D
1 {xi xi | xi ∈ C} + d d m
1 {xj xj | xj ∈ D} p
+ c c + d d − 2c d = c − d22 , which concludes the argument for the first part. For the second part, using the expressions derived above, we obtain δ(C, D) + δ(C, C) + δ(D, D) xi xi + m xj xj − 2mp c d =p xi ∈C
xj ∈D
+m {xi xi | xi ∈ C} − m2 c c +p {xj xj | xj ∈ D} − p2 d d x x − 2mp c d − m2 c c − p2 d d = |S| x∈S
= |S|
x x − mc + pd22
x∈S
= |S|
x∈S
x x − |S|2 g22
Linear Algebra Tools for Data Mining (Second Edition)
716
= |S|
y22 ,
y∈Sˆ
taking into account Exercise 3. (10) Using the notations introduced in Supplement 9, prove that for any two finite, non-empty, and disjoint subsets C, D of Rn , we have 2
δ(C, C) δ(D, D) δ(C, D) + . ≥ mp m2 p2
Hint: The inequality follows immediately from Supplement 9. (11) Let π = {C, D} be a clustering of the set S ⊆ Rn , where C = {x1 , . . . , xm } and D = {xm+1 , . . . , xm+p }, m + p = |S|, and let T = (tij ) ∈ Rs×s be the matrix defined by tij = xi − xj 22 for 1 ≤ i, j ≤ s, where s = |S| = m + p. Define δ(C, D) δ(C, C) δ(D, D) mp 2 − − . J(π) = |S| mp m2 p2 Let q ∈ Rs be the vector given by ⎧ ⎨ p if 1 ≤ i ≤ m, ms qi = m ⎩ − if m + 1 ≤ i ≤ m + p. ps Prove that (a) 1| S| q = 0 and q2 = 1; (b) q T q = −J(π); (c) the previous two parts hold if the vectors of S are not arranged in any particular order. Solution: The first part is immediate. Applying the definition of T , we obtain
q Tq =
m+p m+p
q i tij qj
i=1 j=1
=
m m i=1 j=1
q i tij qj +
m m+p i=1 j=m+1
q i tij qj
The k-Means Clustering m+p
m
qi tij qj +
i=m+1 j=1
m+p
717 m+p
qi tij qj
i=m+1 j=m+1
m m m m+p p 1 2 = xi − xj 2 − xi − xj 22 ms s i=1 j=1
−
i=1 j=m+1
m+p m 1 xi − xj 22 s i=m+1 j=1
+
m+p m ps
m+p
xi − xj 22
i=m+1 j=m+1
p 2 m δ(C, C) − δ(C, D) + δ(D, D) ms s ps mp 2δ(C, D) δ(C, C) δ(D, D) − − = −J(π). =− s mp m2 p2 =
(12) Prove that for 2-means clustering π = {C, D} of a finite, nonempty set S ⊆ Rn , we have sse(π)+ 12 J(π) = 12 y∈Sˆ y y, where Sˆ is the set of centered vectors that correspond to the vectors of S. Solution: The sum sse(π) + 12 J(π) can be expressed using the function δ introduced earlier as follows: 1 sse(π) + J(π) 2 1 1 1 δ(C, C) + δ(D, D) + δ(C, D) = 2m 2p 2|S| −
pδ(C, C) mδ(D, D) − 2m|S| 2p|S|
(by Supplements 10.6 and 10.6) =
1 1 (δ(C, D) + δ(C, C) + δ(D, D)) = y22 , 2|S| 2 y∈Sˆ
(by the second part of Supplement 9), where m = |C| and p = |D|.
718
Linear Algebra Tools for Data Mining (Second Edition)
Supplement 12 suggests a way for obtaining optimal 2-means clustering. Since sse(π)+ 12 J(π) does not depend on the partition π, minimizing sse(π) is tantamount to maximizing J(π). Note that the effect of maximizing J(π) is to ensure that the average inter-cluster distance δ(C,D) mp is maximal, while the average intra-
and δ(D,D) are minimal. This ensures cluster distances δ(C,C) m2 p2 well-separated, compact clusters. In [35], a relaxation of this problem is considered by allowing the components of q to assume values in the interval [−1, 1] rather than two discrete values, but still requires that q q = 1. Then, to maximize J(π), that is, to minimize −J(π) amounts to finding q as the unit eigenvector that corresponds to the least eigenvalue of T . An alternative relaxation method is discussed next. (13) Let S be a finite, non-empty subset of Rn with |S| = m, and let Tˆ = Hm T Hm ∈ Rm×m be the centered distance matrix, where T is the matrix of squared distances between the vectors of S. Prove that (a) for every vector q such that 1m q = 0, we have q Tˆq = q T q; (b) every eigenvector of Tˆ that corresponds to a non-zero eigenvalue is orthogonal on 1m . Solution: Observe that q Tˆq = q Hm T Hm q = q T q because Hm q = q Hm = q. Since 1m is an eigenvector of Tˆ that corresponds to the eigenvalue 0, every eigenvector of Tˆ that belongs to a different eigenvalue is orthogonal on 1m . (14) Prove that −1 ≤ sil(x) ≤ 1 for every object x ∈ S, where S is a set equipped with a clustering. Bibliographical Comments The spectral relaxation of the k-means algorithm was obtained in [176]. The link between k-means clustering and non-negative matrix factorization was established in [36]. For a solution to Exercise 1, see [45]. Supplement 2 is a result obtained in [132]. Supplements 9–13 are based on Ding and He’s paper where the PCA-guided clustering is introduced.
Chapter 11
Data Sample Matrices
11.1
Introduction
Matrices are natural tools for organizing datasets. Let such a dataset consist of a sequence E of m vectors of Rn , (u1 , . . . , um ). The jth components (ui )j of these vectors correspond to the values of a random variable Vj , where 1 ≤ j ≤ n. This data series will be represented as a matrix having m rows u1 , . . . , um and n columns v1 , . . . , v n . We refer to matrices obtained in this manner as sample matrices. The number m is the size of the sample. In this chapter, we present algebraic properties of vectors and matrices associated with a sample matrix: the mean vector and the covariance matrix. Biplots, a technique for exploring and visualizing sample matrices, are also introduced. 11.2
The Sample Matrix
Each row vector ui corresponds to an experiment Ei in the series of experiments E = (E1 , . . . , Em ); the experiment Ei consists of measuring the n components of ui = (xi1 , . . . , xin ), as follows. v1 u1 x11 u2 x21 .. .. . . um xm1
· · · vn · · · x1n · · · x2n .. .. . . . · · · xmn
719
720
Linear Algebra Tools for Data Mining (Second Edition)
The column vector
⎛
⎞ x1j ⎜ x2j ⎟ ⎜ ⎟ v j = ⎜ .. ⎟ ⎝ . ⎠ xmj represents the measurements of the jth variable Vj of the experiment, for 1 ≤ j ≤ n, as shown in the following. These variables are usually referred to as attributes or features of the series E. Definition 11.1. The sample matrix of E is the matrix X ∈ Cm×n given by ⎛ ⎞ u1 ⎜ .. ⎟ X = ⎝ . ⎠= (v 1 · · · v n ). um
Clearly, we have (v j )i = (ui )j = xij for 1 ≤ i ≤ m and 1 ≤ j ≤ n. If E is clear from the context, the subscript E is omitted. We will use both representations of the sample matrix and will write ⎛ ⎞ u1 ⎜ .. ⎟ X = ⎝ . ⎠= (v 1 · · · v n ), um when we are interested in the vectors that represent results of experiments and X = (v 1 , . . . , v n ), when we need to work with vectors that represent the values of variables. Pairwise distances between the row vectors of the sample matrix X ∈ Rm×n can be computed with the MATLAB function pdist(X). comThis form of the function returns a vector D having m(m−1) 2 m ponents corresponding to 2 pairs of observations arranged in the order d2 (u2 , u1 ), d2 (u3 , u1 ), d2 (u3 , u2 ), . . ., that is the order of the lower triangle of the distance matrix.
Data Sample Matrices
721
Example 11.1. Let X be the data matrix ⎛ ⎞ 1 4 5 ⎜2 3 7⎟ ⎟ X=⎜ ⎝5 1 4⎠. 6 2 4 The function call D = pdist(X) returns D = 2.4495
6.0000
5.4772
7.3485
5.0990
5.0990
Equivalently, a distance matrix can be obtained using the auxiliary function squareform, by writing E = squareform(D), which yields E = 0 2.4495 6.0000 5.4772
2.4495 0 7.3485 5.0990
6.0000 7.3485 0 5.0990
5.4772 5.0990 5.0990 0
There are versions of pdist that can return other distances by using a second string parameter. For instance, pdist(X, ‘cityblock’) computes d1 (xi , xj ) and pdist(X,‘cebyshev’) computes d∞ (xi , xj ). In general, Minkowski’s distance dp can be computed using D = pdist(X,‘minkowski’,p). A linear data mapping for a data sequence (u1 , . . . , um ) ∈ Seqm (Rn ) is the morphism r : Rn −→ Rq . If R ∈ Rn×q is the matrix that represents this mapping, then r(ui ) = Rui for 1 ≤ i ≤ m. If q < n, we refer to r as a linear dimensionality-reduction mapping. The reduced data matrix is given by ⎛ ⎞ ⎛ ⎞ r(u1 ) (Ru1 ) ⎜ . ⎟ ⎜ . ⎟ m×q ⎟ ⎜ ⎟ . r(XE ) = ⎜ ⎝ .. ⎠ = ⎝ .. ⎠ = XE R ∈ R r(um ) (Rum ) The reduced dataset r(XE ) has new variables Y1 , . . . , Yq . We denote this by writing (Y1 , . . . , Yq ) = r(V1 , . . . , Vn ). The mapping r is a linear feature selection mapping if R ∈ {0, 1}q×n is a 0/1-matrix having exactly one unit in every row and at most one unit in every column.
722
Linear Algebra Tools for Data Mining (Second Edition)
Definition 11.2. Let (u1 , . . . , um ) be a series of observations in Rn . The sample mean of this sequence is the vector m 1
˜= u i ∈ Rn . (11.1) u m i=1
˜ = 0n . The series is centered if u ˜ , . . . , um − u ˜ ) is always centered. Also, Note that the series (u1 − u observe that 1 ˜ = (u1 · · · um )1m . (11.2) u m If n = 1, the series of observations is reduced to a vector v ∈ Rm . Definition 11.3. The standard deviation of a vector v ∈ Rm is the number m 1
(vi − v)2 , sv = m−1 i=1
where v is the mean of the components of v. The standard deviation of sample matrix X ∈ Rm×n , where X = (v 1 · · · v n ), is the row s = (sv1 , . . . , svn ). If the measurement scale for the variables V1 , . . . , Vn involved in the experiment are very different due to different measurement units, some variables may inappropriately influence the analysis process. Therefore, the columns of the data sample matrix need to be scaled in order to make their values comparable. To scale a matrix, we need to replace each column v i by s1v v i . This will yield a matrix having i the standard deviation of each column equal to 1. Next, we examine the effect of centering on a sample matrix. Theorem 11.1. Let X ∈ Rm×n be a sample matrix ⎛ ⎞ u1 ⎜ . ⎟ ⎟ X=⎜ ⎝ .. ⎠ . um The sample matrix that corresponds to the centered sequence is 1 ˆ X = Im − 1m 1m X. m
Data Sample Matrices
Proof.
723
The matrix that corresponds to the centered sequence is ⎞ ⎛ ˜ u1 − u ⎟ ⎜ .. ⎟ = X − 1m u ˆ =⎜ ˜ . X . ⎠ ⎝ ˜ um − u
By Equality (11.2), it follows that 1 ˆ = X − 1m u ˜ = X − 1m 1m X = X m
1 Im − 1m 1m X, m
which yields the desired equality.
Theorem 11.1 shows that to center a data matrix X ∈ Rm×n , we need to multiply it at the left by the centering matrix H m = Im −
1 1m 1m ∈ Rm×m , m
ˆ = Hm X. Note that Hm = Im − 1 Jm . It is easy to see that that is, X m Hm is both symmetric and idempotent. Since Hm 1m = 1m −
1 1m 1m 1m = 0, m
it follows that Hm has the eigenvalue 0. If X ∈ Rm×n is a matrix, the standard deviations are computed in MATLAB using the function std(X), which returns an n-dimensional row s containing the square roots of the sample variances of the columns of U , that is, their standard deviations. The means of the columns of X is computed in MATLAB using the function mean(X). The MATLAB function Z = zscore(X) computes a centered and scaled version of a data sample matrix having the same format as X. If X is a matrix, then z-scores are computed using the mean and standard deviation along each column of X. The columns of Z have sample mean zero and sample standard deviation one (unless a column of X is constant, in which case that column of Z is constant at 0). If we use the format [Z,mu,sigma] = zscore(X),
the mean vector is returned to mu and the vector of standard deviations, to sigma.
Linear Algebra Tools for Data Mining (Second Edition)
724
Example 11.2. Let X be the matrix X = 1 3 2 5
12 15 15 18
77 80 75 98
The means and the standard deviations of the columns of X are obtained as follows. >> m = mean(X) m = 2.7500
15.0000
82.5000
2.4495
10.5357
>> s=std(X) s = 1.7078
Finally, to compute together the mean, the standard deviation, and the matrix Z, we write >> [Z,m,s]=zscore(A) Z = -1.0247 0.1464 -0.4392 1.3175
-1.2247 0 0 1.2247
-0.5220 -0.2373 -0.7119 1.4712
2.7500
15.0000
82.5000
1.7078
2.4495
10.5357
m =
s =
Definition 11.4. Let u = (u1 , . . . , um ) be a sequence of vectors in Rn . The inertia of this sequence relative to a vector z ∈ Rn is the
Data Sample Matrices
725
number Iz (u) =
m
j=1
uj − z22 .
Theorem 11.2 (Huygens’s Inertia Theorem). Let u = (u1 , . . . , um ) ∈ Seqm (Rn ). We have Iz (u) − Iu˜ (u) = m˜ u − z22 , for every z ∈ Rn . Proof.
˜ is The inertia of u relative to u Iu˜ (u) =
m
˜ 22 uj − u
j=1 m
˜ ) (uj − u ˜) = (uj − u j=1
=
m
˜ uj − uj u ˜ +u ˜ u ˜ ). (uj uj − u j=1
Similarly, we have Iz (u) =
m
j=1
(uj uj − z uj − uj z + z z).
This allows us to write Iz (u) − Iu˜ (u) =
m m
˜ u ˜ (˜ u − z) uj + uj (˜ u − z) + z z − u j=1
= (˜ u − z)
j=1
m
i=1
⎛ ⎞ m
˜ u ˜) uj + ⎝ uj ⎠ (˜ u − z) + m(z z − u j=1
˜ u ˜ + m˜ ˜) u − z) + m(z z − u u (˜ = m(˜ u − z) u = m˜ u − z22 , which is the equality of the theorem.
Linear Algebra Tools for Data Mining (Second Edition)
726
Corollary 11.1. Let u = (u1 , . . . , um ) ∈ Seqm (Rn ). The minimal ˜. value of the inertia Iz (u) is achieved for z = u Proof.
This statement follows immediately from Theorem 11.2.
Let u and w be two vectors in Rm , where m > 1, having the means u and w, and the standard deviations su and sv , respectively. Definition 11.5. The covariance coefficient of u and w is the number cov(u, w) =
m−1 1
(ui − u)(wi − w). m−1 i=1
The correlation coefficient of u and w is the number ρ(u, w) =
cov(u, w) . su sw
By the Cauchy–Schwarz Inequality (Corollary 6.1), we have m m m
(ui − u)2 · (wi − w)2 , (ui − u)(wi − w) ≤ i=1
i=1
i=1
which implies −1 ≤ ρ(u, w) ≤ 1. ˆ be Definition 11.6. Let X ∈ Rm×n be a sample matrix and let X the centered sample matrix corresponding to X. The sample covariance matrix is the matrix cov(X) =
1 ˆ ˆ X X ∈ Rn×n . m−1
1 X X. Note that if X is centered, cov(X) = m−1 If n = 1, the matrix is reduced to one column X = (v) and
cov(v) =
1 v v ∈ R. m−1
In this case, we refer to cov(v) as the variance of v; this number is denoted by var(v).
Data Sample Matrices
727
If X = (v 1 · · · v n ), then (cov(X))ij = cov(v i , v j ) for 1 ≤ i, j ≤ n. The covariance matrix can be written also as cov(X) =
1 1 X Hm Hm X = X Hm X. m−1 m−1
The sample correlation matrix is the matrix corr(X) given by (corr(X))ij = ρ(v i , v j ) for 1 ≤ i, j ≤ n. 1 If X is centered, then cov(X) = m−1 X X. Clearly, the covariance matrix is a symmetric, positive semidefinite matrix. Furthermore, by ˆ and, Theorem 3.33, the rank of cov(X) is the same as the rank of X since m, the size of the sample, is usually much larger than n, we are often justified in assuming that rank(cov(X)) = n. Let X = (v 1 · · · v n ) ∈ Rm×n be a sample matrix. Note that 1 1 Hm v p = Im − 1m 1m v p = v p − 1m 1m v p = v p − ap 1m , m m 1 ˜ = (a1 , . . . , an ). 1m v p = ap for 1 ≤ p ≤ n, where u because m The covariance matrix can be written as
1 Hm (v 1 · · · vn ) (v 1 · · · v n ) Hm m−1 1 (Hm v 1 · · · Hm vn ) (Hm v1 · · · Hm v n ), = m−1
cov(X) =
which implies that the (p, q)-entry of this matrix is cov(X)pq =
1 1 (Hm vp ) (Hm v q ) = (v p −ap 1m ) (v q −aq 1m ). m−1 m−1
For a diagonal element, we have m
cov(X)pp
1
= (v q − aq 1m )2i , m−1 i=1
which shows that cov(X)pp measures the scattering of the values of the pth variable around the corresponding component ai of the mean sample. This quantity is known as the pth variance and is denoted by σp2 for 1 ≤ p ≤ n. The total variance tvar(X) of X is trace(cov(X)).
728
Linear Algebra Tools for Data Mining (Second Edition)
For p = q, the element cpq of the matrix C = cov(X) is referred to as the (p, q)-covariance. We have 1 (v p − ap 1m ) (v q − aq 1m ) m−1 1 v p v q − ap 1m v q − aq v p 1m + map aq = m = vp v q − ap aq .
(cov(X))pq =
If cov(X)pq = 0, then we say that the variables Vp and Vq are uncorrelated. The behavior of the covariance matrix with respect to multiplication by orthogonal matrices is discussed next. Theorem 11.3. Let ⎞ x1 ⎜ . ⎟ ⎟ X=⎜ ⎝ .. ⎠ xm ⎛
be a centered sample matrix and let R ∈ Rn×n be an orthogonal matrix. If Z ∈ Rm×n is a matrix such that Z = XR, then Z is centered, cov(Z) = R cov(X)R and tvar(Z) = tvar(X). Proof.
By writing explicitly the rows of the matrix Z, ⎛ ⎞ z1 ⎜ . ⎟ ⎟ Z=⎜ ⎝ .. ⎠ , zm
we have z i = xi R for 1 ≤ i ≤ m because Z = XR. Note that the sample mean of Z is 1 1 ˜ Z˜ = 1m Z = 1m XR = XR, m m ˜ is the sample mean of X. Since X is centered, we have where X ˜ ˜ Z = X = 0n , so Z is centered as well.
Data Sample Matrices
729
The covariance matrix of Z is cov(Z) =
1 1 Z Z = R X XR = R cov(X)R. m−1 m−1
Since the trace of two similar matrices are equal (by Theorem 8.5) and cov(Z) is similar to cov(X), the total variance of Z equals the total variance of X, that is, tvar(Z) = trace(cov(Z)) = trace(cov(X)) = tvar(X).
Since the covariance matrix of a centered matrix X, cov(X) = ∈ Rn×n is symmetric, by Corollary 8.8, cov(X) is orthonormally diagonalizable, so there exists an orthogonal matrix R ∈ Rn×n such that R cov(X)R = D, which corresponds to a sample matrix Z = XR. Let cov(Z) = D = diag(d1 , . . . , dn ). The number dp is the sample variance of the pth variable of the data matrix, and the covariances of the form cov(Z)pq with p = q are 0. From a statistical point of view, this means that the components p and q are uncorrelated. Without loss of generality we can assume that d1 ≥ · · · ≥ dn . The columns of the matrix Z correspond to the new variables Z1 , . . . , Zn . Often the variables of a data sample matrix are not expressed using different units. In this case, the components of the covariance have no meaning because variables that have large numerical values have a disproportionate influence compared to variables that have small numerical value. For example, if a spatial variable is measured in millimeters, its values are three orders of magnitude larger than the values of a variable expressed in meters. 1 m−1 X X
11.3
Biplots
Biplots introduced by Gabriel in [61] offer a way of representing graphically the elements sets of vectors (hence, the term biplot). Let A ∈ Rm×n be a matrix that can be A = LR, where L ∈ Rm×r , R ∈ Rr×n are
succinct and powerful of a matrix using two written as a product, the left and the right
730
Linear Algebra Tools for Data Mining (Second Edition)
factors, respectively. Suppose that ⎛ ⎞ l1 ⎜ . ⎟ ⎟ L=⎜ ⎝ .. ⎠ and R = (r 1 · · · r m ), lm where l1 , . . . , lm , r 1 , . . . , r n are m + n vectors in Rr . Then, each element aij of A can be regarded as a inner product of two vectors aij = li r j
(11.3)
for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Such matrix factorizations are common in linear algebra and we have already discussed a number of factorization techniques (full-rank decompositions, QR decompositions, etc.) Starting from the factorization A = LR, new factorizations of A can be built as A = (LK )(R K −1 ) for every invertible matrix K ∈ Rr×r . Therefore, the above representation for A is not unique in general. Thus, to use the biplot for a representation of the relations between the rows w 1 , . . . , wn of A, one could choose R such that RR = Ir , which yields AA = LL . This implies wi wj = li lj for 1 ≤ i, j ≤ n. Taking i = j, we have w i = li , which , in turn, implies ∠(wi , wj ) = ∠(li , lj ). A similar choice can be made for the columns of A by imposing the requirement L L = Ir , which implies A A = R R. The case when the rank r of the matrix A is 2 is especially interesting because we can draw the vectors l1 , . . . , lm , r 1 , . . . , r n to obtain an exact two-dimensional representation of A, as we show in the next example. Example 11.3. Let
⎛
18 ⎜−4 ⎜ A=⎜ ⎝ 25 9
8 20 8 4
⎞ 20 1⎟ ⎟ ⎟ 27⎠ 10
be a matrix of rank 2 in R4×3 that can be written as A = LR, where ⎛ ⎞ 2 4 ⎜−2 3⎟ 5 −4 4 ⎜ ⎟ L=⎜ . ⎟ and R = ⎝ 3 5⎠ 2 4 3 1 2
Data Sample Matrices
731
The vectors that help us with the representation of A are 2 −2 3 1 , l2 = , l3 = , l4 = l1 = 3 5 2 4 and
−4 4 5 r1 = , r2 = , r3 = . 2 4 3
For example, the a32 element of A can be written as −4 a32 = l3 r 2 = (3 5) = 8. 4 Equality (11.3) shows that each vector li corresponds to a row of A and each vector r j , to a column of A. When we can factor a sample data matrix X as X = LR, a column of the right factor r j is referred to as the biplot axis and corresponds to a variable Vj (Figure 11.1). Each vector li represents an observation in the sample matrix. It is interesting to observe that Equality (11.3) implies that the magnitude of projection of li on the biplot axis r j is li 2 cos ∠(li , r j ) =
li r j aij = . r j 2 r j 2
Therefore, if we choose the unit of measure on the axis r j as the number r1j 2 , we can read the values of the entries aij directly on 6 r2
Fig. 11.1
+
l3
+l 1
r3 l4 + 1r 1 @
l2 @+ @ @
-
Representation of the vectors li and rj .
732
Linear Algebra Tools for Data Mining (Second Edition)
the axis r j . For instance, the unit along the biplot axis is r13 2 = 0.2. It is also clear that if two axis of the biplot point roughly in the same direction, the corresponding variables will show a strong correlation. In general, the rank of the data matrix A is larger than 2. In this case, approximative representations of A can be obtained by using the thin singular value decomposition of matrices (Corollary 9.3). Let A be a matrix of rank r and let
A = U DV =
r
i=1
σi ui vi
be the thin SVD, where U ∈ Rm×r and V ∈ Rn×r are matrices of rank r (and, therefore, full-rank matrices) having orthonormal sets of columns. Here U = (u1 · · · ur ) and V = (v 1 · · · v r ). The matrix D containing √ can be split between U √ singular values and V by defining L = U D and R = DV . The usefulness of the SVD for biplots is based on the Eckhart–Young Theorem (Theorem 9.9), which stipulates that the best approximation of A in the sense of the matrix norm ||| · |||2 in the class of matrix of rank k is the matrix defined by B(k) =
k
σi ui v i .
i=1
According to Theorem 9.10, the same matrix B(k) is the best approximation of A in the sense of Frobenius norm. The extent of the deficiency of this approximation is measured by A − B(k)2F = 2 + · · · + σr2 . Since A2F = σ12 + · · · + σr2 , an absolute measure of σk+1 the quality of the approximation of A by B(k) is qk = 1 −
σ12 + · · · + σk2 A − B(k)2F = . A2F σ12 + · · · + σr2
In the special case, k = 2, the quality of the approximation is q2 =
σ12 + σ22 σ12 + · · · + σr2
and it is desirable that this number be as close to one as possible. The rank-2 approximation of A is useful because we can apply biplots to the visualization of A.
Data Sample Matrices
Example 11.4. Let A ∈ R5×3 be ⎛ 1 ⎜0 ⎜ ⎜ A = ⎜1 ⎜ ⎝1 0
733
the matrix defined by ⎞ 0 0 1 0⎟ ⎟ ⎟ 1 1⎟ . ⎟ 1 0⎠ 0 1
It is easy to see that the rank of this matrix is 3 and, using MATLAB , a singular value decomposition can be obtained as U =
[
0.2787 0.2787 0.7138 0.5573 0.1565
-0.2176 -0.2176 0.3398 -0.4352 0.7749
-0.7071 0.7071 -0.0000 -0.0000 0.0000
2.3583 0 0 0 0
0 1.1994 0 0 0
0 0 1.0000 0 0
0.6572 0.6572 0.3690
-0.2610 -0.2610 0.9294
-0.7071 0.7071 0.0000
-0.2996 -0.2996 -0.4037 0.7033 0.4037
-0.5341 -0.5341 0.4605 0.0736 -0.4605
S =
V =
The rank-2 approximation of this matrix is B(2) = σ1 u1 v H1 + σ2 u2 v H2 , and is computed in MATLAB using >> B2 = 2.3583* U(:,1) * V(:,1)’ + 1.1994 * U(:,2) * V(:,2)’ B2 = 0.5000 0.5000 1.0000 1.0000 -0.0000
0.5000 0.5000 1.0000 1.0000 -0.0000
-0.0000 -0.0000 1.0000 -0.0000 1.0000
Linear Algebra Tools for Data Mining (Second Edition)
734
If we split the singular values as √ √ √ √ B(2) = ( σ1 u1 )( σ1 v 1 )H + ( σ2 u2 )( σ2 v 2 )H , then B(2) can be written as ⎛ ⎞ 0.4280 −0.2383 ⎜0.4280 −0.2383⎟ ⎜ ⎟ 1.0092 1.0092 0.5667 ⎜ ⎟ B(2) = ⎜1.0962 0.3721 ⎟ . ⎜ ⎟ −0.2858 −0.2858 1.0179 ⎝0.8559 −0.4766⎠ 0.2403 0.8487 The biplot that represents matrix A is shown in Figure 11.2. The quality of the approximation of A is q2 =
2.35832 + 1.19942 = 0.875. 2.35832 + 1.19942 + 1
The “allocation” of singular values among the columns of the matrices U and V may lead to biplots that have distinct properties. 1.2 1
l5
r
3
0.8 0.6 0.4
l3
0.2 0 −0.2
r1,r2
l1,l2
−0.4
l
4
−0.6 −1
−0.5
Fig. 11.2
0
0.5
1
Biplot of the rank-2 approximation of A.
Data Sample Matrices
735
For example, we could write B(2) = (σ1 u1 )v H1 + (σ2 u2 )v H2 ,
(11.4)
B(2) = u1 (σ1 v 1 )H + u2 (σ2 v 2 )H .
(11.5)
or
The first allocation leads to the factorization B(2) = LR, where ⎛
⎞ 0.6572 −0.2610 ⎜0.6572 −0.2610⎟ ⎜ ⎟ ⎜ ⎟ L = ⎜1.6834 0.4075 ⎟ ⎜ ⎟ ⎝1.3144 −0.5219⎠ 0.3690 0.9294 and R = 0.65720.65720.3690 − 0.2610 − 0.26100.9294 , while the second yields the factors L = 0.2787
−0.21760.2787
−0.21760.7138
0.33980.5573
−0.43520.1565
0.7749
and R=
1.5499 1.5499 0.8703 . −0.3130 −0.3130 1.1147
The first variant (Equality 11.4) leads to a representation, where the distances between the vectors li approximates the Euclidean distances between rows, while for the second variant (Equality 11.5), the cosine of angles between the vectors r j approximates the correlations between variables. Exercises and Supplements (1) Verify that the centering matrix Hm is both symmetric and idempotent. (2) Compute the spectrum of a centering matrix Hm .
736
Linear Algebra Tools for Data Mining (Second Edition)
(3) Let X ∈ Rm×n be a matrix and let Qc ∈ Rm×m be the matrix introduced in Supplement 10 of Chapter 3. Prove that (a) if c ∈ Rm , then Qc X = X − 1m c X; if c is the mean of the 1 rows of X, that is if c = m X 1m , then Qc X = Hm X; n (b) if c ∈ R , then XQc = X − X1m c ; if c is the mean of the columns of X, that is if c = n1 X1n , then XQc = XHn . (4) Usually, for a data sample matrix X ∈ Rm×n , we can assume that m ≥ n. Examine the possibility that rank(X) < n. (5) Let (u1 , . . . , um ) be a sequence of m vectors in Rn . Prove that the covariance matrix C of this sequence can be written as ⎞ ⎛ ˜ u1 − u ⎟ ⎜ 1 .. ⎟. ˜ · · · un − u ˜) ⎜ (u1 − u C= . ⎠ ⎝ m−1 ˜ un − u (6) Let q ∈ Rn be a vector such that i = 1n qi = 0. Prove that Hn q = q Hn = q. (7) Let A ∈ Rn×n and c ∈ Rn . Define the mapping f : Rn −→ r n as f (u) = Au + c for u ∈ Rn and let wi = f (ui ) for 1 ≤ i ≤ m. Prove that the covariance matrix D of the sequence (w1 , . . . , w m ) is D = ACA . (8) Prove that the function dC : {u1 , . . . , um }2 −→ R defined by ˜ for 1 ≤ i, j ≤ m is a metric ˜ ) C −1 (uj − u) dC (ui , uj ) = (ui − u on {u1 , . . . , um }. Furthermore, prove that dD (f (ui ), f (uj )) = d(ui , uj ) for 1 ≤ i, j ≤ m. The function dC is known as the Mahalanobis metric. (9) Let X ∈ Rn×p be a data sample matrix. Prove that 1 ˆ ∗ 1n X ⊗ 1n . X=X− n ˆ starting from X. Write a MATLAB function that computes X m (10) Prove that if x ∈ R , then Hm x = x − x1m , where x is the mean of x. ⎛ ⎞ x1 ⎜ ⎟ m 1 ⎜ .. ⎟ (11) Prove that x Hm x = m i=1 (xi − x), where x = ⎝ . ⎠ and x xm is the mean of x.
Data Sample Matrices
737
Let X ∈ Rm×n be a sample matrix. Its correlation matrix, corr(X) ∈ Rn×n , is defined by (corr(X))ij = ρ(v i , v j ) for 1 ≤ i, j ≤ n, where X = (v 1 · · · v n ). (12) Prove that corr(X) = D −1 cov(X)D −1 , where D = diag(sv1 , . . . , svn ). (13) Justify the claims made at the end of Example 11.4 that refer to the allocation variants for singular values. The generalized sample variance of a data sample matrix X is the determinant det(cov(X)), which offers a succinct summary of the variances and covariances of X. (14) Let X ∈ Rm×n be a centered data sample matrix, X = (v 1 , . . . , v n ). Prove that the generalized sample variance of X 1 Vn (X)2 , where Vn (X) is the volume of the paralequals n−1 lelepiped constructed on the vectors v 1 , . . . , v n . Hint: See Supplement 107 of Chapter 6. Bibliographical Comments There are several excellent sources for biplots [67, 68, 96]. A recent readable introduction to biplots is [69].
This page intentionally left blank
Chapter 12
Least Squares Approximations and Data Mining
12.1
Introduction
The least square method is used in data mining as a method of estimating the parameters of a model by adopting the values that minimize the sum of the squared differences between the predicted and the observed values of data. This estimation process is also known as regression, and several types of regression exist depending on the nature of the assumed model of dependency between the predicted and the observed data. 12.2
Linear Regression
The aim of linear regression is to explore the existence of a linear relationship between the outcome of an experiment and values of variables that are measured during the experiment. As we saw in Chapter 11, experimental data often are presented as a data sample matrix B ∈ Rm×n , where m is the number of experiments and n is the number of variables measured. The results of the experiments are
739
Linear Algebra Tools for Data Mining (Second Edition)
740 3800
PT
3600 RO
NO
calories per day
3400 CY
3200
3000
FI
NL
BA SK
2800 YU 2600
2400
0
Fig. 12.1
1
2
3 4 5 6 gdp per person in $10K units
7
8
9
Calories vs. GDP in 10K units per person in Europe.
the components of a vector ⎞ b1 ⎜ ⎟ b = ⎝ ... ⎠ . ⎛
bm Linear regression amounts to determining r ∈ Rn such that Br = b. Knowing the components of r allows us to express the value of the result as a linear combination of the values of the variables. Unfortunately, since m is usually much larger than n, this system is overdetermined and, in general, is inconsistent. The columns v1 , . . . , v n of the matrix B are referred to as the regressors; the linear combination r1 v 1 + · · · + rn v n is the regression of b onto the regressors v 1 , . . . , v n . Example 12.1. In Figure 12.1, we represent (using the function plot of MATLAB), the number of calories consumed by a person per day vs. the gross national product per person in European countries starting from the following table.
Least Squares Approximations and Data Mining ccode ‘AL’ ‘AT’ ‘BY’ ‘BE’ ‘BA’ ‘BG’ ‘HR’ ‘CY’ ‘CZ’ ‘DK’ ‘EE’ ‘FI’ ‘FR’ ‘GE’ ‘DE’ ‘GR’ ‘HU’ ‘IS’ ‘IE’
gdp 0.74 4.03 1.34 3.79 0.66 1.28 1.75 1.21 2.56 3.67 1.90 3.53 3.33 0.48 3.59 3.02 1.90 3.67 3.76
cal 2824.00 3651.00 2895.00 3698.00 2950.00 2813.00 2937.00 3208.00 3346.00 3391.00 3086.00 3195.00 3602.00 2475.00 3491.00 3694.00 3420.00 3279.00 3685.00
ccode ‘IT’ ‘LV’ ‘LT’ ‘LU’ ‘MK’ ‘MT’ ‘MD’ ‘NL’ ‘NO’ ‘PL’ ‘PT’ ‘RO’ ‘RU’ ‘YU’ ‘SK’ ‘SI’ ‘ES’ ‘CH’
gdp 3.07 1.43 1.59 8.18 0.94 2.51 0.25 4.05 5.91 1.88 2.30 1.15 1.59 1.10 2.22 2.84 2.95 4.29
741
cal 3685.00 3029.00 3397.00 3778.00 2881.00 3535.00 2841.00 3240.00 3448.00 3375.00 3593.00 3474.00 3100.00 2689.00 2825.00 3271.00 3329.00 3400.00
This data set was extracted from [55]. We seek to approximate the calorie intake as a linear function of the gdp of the form cal = r1 + r2 gdp. This amounts to solving a linear system that consists of 37 equations and two unknowns: r1 + 0.74r2 = 2824 .. . r1 + 4.29r2 = 3400 and, clearly such a system is inconsistent. If the linear system Br = b has no solution, the “next best thing” is to find a vector c ∈ Rn such that Bc − b2 ≤ Bw − b2 for every w ∈ Rn , an approach known as the least square method. We will refer to the triple (B, r, b) as an instance of the least square problem. Note that Br ∈ range(B) for any r ∈ Rn . Thus, solving this problem amounts to finding a vector Br in the subspace range(B) such that Br is as close to b as possible.
742
Linear Algebra Tools for Data Mining (Second Edition)
Let B ∈ Rm×n be a full-rank matrix such that m > n, so rank(B) = n. The symmetric square matrix B B ∈ Rn×n has the same rank n as the matrix B, as we saw in Theorem 3.33. Therefore, the system (B B)r = B b has a unique solution s. Moreover, B B is positive definite because r B Br = (Br) Br = Br22 > 0 for r = 0. Theorem 12.1. Let B ∈ Rm×n be a full-rank matrix such that m > n and let b ∈ Rm . The unique solution of the system (B B)r = B b equals the projection of the vector b on the subspace range(B). Proof. The n columns of the matrix B = (v 1 · · · v n ) constitute a basis of the subspace range(B). Therefore, we seek the projection c of b on range(B) as a linear combination c = Bt, which allows us to reduce this problem to a minimization of the function f (t) = Bt − b22 = (Bt − b) (Bt − b) = (t B − b )(Bt − b) = t B Bt − b Bt − t B b + b b. The necessary condition for the minimum is (∇f )(t) = 2B Bt − 2B b = 0, which implies B Bt = B b.
The linear system (B B)t = B b is known as the system of normal equations of B and b. Example 12.2. We augment the data sample matrix by a column that consists of 1s to accommodate a constant term r1 ; thus, we work with the data sample matrix B ∈ R37×2 given by ⎛ ⎞ 1 0.74 ⎜. .. ⎟ ⎟ . B=⎜ . ⎠ ⎝. 1 4.29 whose second column consists of the countries’ gross domestic products in $10K units. The matrix C = B B is 37.0000 94.4600 C= . 94.4600 333.6592
Least Squares Approximations and Data Mining
743
4200 4000 3800
calories per day
3600 3400 3200 3000 2800 2600 2400
0
1
2
3 4 5 gdp per person in $10K units
Fig. 12.2
6
7
8
Regression line.
Solving the normal system using the MATLAB statement r = C\(B ∗b) yields 2894.2 r= , 142.3 so the regression line is cal = 142.3 ∗ gdp + 2894.2, shown in Figure 12.2. Suppose now that B ∈ Rm×n has rank k, where k < min{m, n}, and U ∈ Rm×m , V ∈ Rn×n are orthonormal matrices such that B can be factored as B = U M V , where R Ok,n−k M= ∈ Rm×n , Om−k,k Om−k,n−k R ∈ Rk×k , and rank(R) = k.
c1 define c = ∈ and let c = , where c1 ∈ Rk For b ∈ c2 and c2 ∈ Rm−k . Since rank(R) = k, the linear system Rz = c1 has a unique solution z 1 . Rm
U b
Rm
744
Linear Algebra Tools for Data Mining (Second Edition)
Theorem 12.2. All vectors r that minimize Br − b2 have the form z r=V w for an arbitrary w. Proof.
We have
Br − b22 = U M V r − U U b22 = U (M V r − U b)22 = M V r − U b22 (because multiplication by an orthonormal matrix is norm-preserving) = M V r − c22 = M y − c22 = Rz − c1 22 + c2 22 , where z consists of the first r components of y. This shows that the minimal value of Br − b22 is achieved by the solution of Therefore, the vectors the system Rz = c1 and is equal to c2 22 . z for an arbitrary r that minimize Br − b22 have the form w w ∈ Rn−r . Instead of the Euclidean norm we can use the · ∞ . Note that we have t = Br−b∞ if and only if −t1 ≤ Br−b ≤ t1, so finding r that minimizes · ∞ amounts to solving a linear programming problem: minimize t subjected to the restrictions −t1 ≤ Br − b ≤ t1. Similarly, we can use the norm · p . If y = Br − b, then we need to minimize ypp = |y1 |p + · · · + |ym |p , subjected to the restrictions −y ≤ Ar − b ≤ y. 12.3
The Least Square Approximation and QR Decomposition
Solving the system of normal equation presents numeric difficulties because, by Corollary 9.5 the condition number of the matrix B B is the square of the condition number of B. An alternative approach
Least Squares Approximations and Data Mining
745
to finding r ∈ Rn that minimizes f (u) = Bu − b22 is to use a full QR decomposition of the matrix B, where B ∈ Rm×n is a full-rank matrix and m > n, as described in Theorem 6.67. Suppose that R , B=Q Om−n,n where Q ∈ Rm×m is an orthonormal matrix and R ∈ Rn×n is an upper triangular matrix such that R ∈ Rm×n . Om−n,n We have
Bu − b = Q =Q
R
Om−n,n R
Om−n,n
u−b
u − QQ b
(because Q is orthonormal and therefore QQ = Im ) R =Q u−Qb . Om−n,n By Theorem 6.24, multiplication by an orthogonal matrix preserves the Euclidean norm of vectors. Thus,
2 R
u − Q b
. Bu − b22 =
Om−n,n 2 If we write Q = (L1 L2 ), where L1 ∈ Rm×n and L2 ∈ Rm×(m−n) , then
R L1 b
2
2 u− Bu − b2 =
Om−n,n L2 b 2
Ru − L1 b
2 =
−L2 b 2 = Ru − L1 b22 + L2 b22 . Observe that the system Ru = L1 b can be solved and its solution minimizes Bu − b2 .
746
12.4
Linear Algebra Tools for Data Mining (Second Edition)
Partial Least Square Regression
When the number of variables of an experiment is large, multiple output variables exist, and some or all these variables are correlated, the least square regression is replaced by a dimensionality reduction technique known as partial least square regression (PLS). Its inventor, the Swedish mathematician Herman Wold, observed that the acronym PLS is also consistent with “projection on latent structures” and this is a more accurate description of this data analysis method. In this section, we assume that the set of variables of an experiment E is partitioned into two disjoint sets {X1 , . . . , Xp } referred to as the set of predictor variables and {Y1 , . . . , Yr } named the set of response variables. Thus, the sample matrix XE ∈ Rm×(p+r) can be written as XE = (X, Y ), where X ∈ Rm×p and Y ∈ Rm×r . The basic model of PLS starts from the assumption that the matrices X and Y can be written as X = T P + E, Y = T Q + F, where the matrices T, U, P, Q, E, F are described in the following table. Matrix Notation T P Q E F
Format Rn×s Rp×s Rr×s Rn×p Rn×r
Matrix Designation Matrix of score vectors Matrix of loadings Matrix of loadings Matrix of residuals Matrix of residuals
The iterative process introduced in Algorithm 12.4.1 begins with the matrices X and Y and proceeds to construct two sequences of matrices X0 , . . . , Xk and Y0 , . . . , Yk , where X0 = X and Y0 = Y , as follows. In the repeat loop that extends between lines 4 and 12, we compute four sequence of vectors w ∈ Rp , t ∈ Rn , c ∈ Rr , and u ∈ Rn .
Least Squares Approximations and Data Mining
Algorithm 12.4.1: Iterative Algorithm for PLS Data: Matrices X ∈ Rn×p and Y ∈ Rn×r Result: Sequences of matrices X0 , . . . , Xk and Y0 , . . . , Yk 1 for i = 1 to k do 2 = 0; 3 initialize u0 as the first column of Yi−1 ; 4 repeat u 5 w = X u u ; 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1 w w ; t = wXw w ; c = Yt tt ; c ← c1 c ; u+1 = cY cc ;
w ←
6
← + 1; until until convergence condition is met; t(i) = t ; u(i) = u ; c(i) = c ; t(i) ; p = (t(i)1) t(i) Xi−1 1 Y u(i) ; (u(i) ) u(i) i−1 (t(i) ) t(i) bi = (u (i) ) t(i) ; Xi ← Xi−1 − t(i) (p(i) ) ; Yi ← Yi−1 − bi t(i) (c(i) ) ;
q=
end
Observe that u+1 = Y c /(c c ) = Y Y t /(c c )(t t ) = Y Y Xw /(c c )(t t )(w w ) = Y Y XX u /(c c )(t t )(w w )(u u ).
747
Linear Algebra Tools for Data Mining (Second Edition)
748
Similarly, we have w+1 = X u+1 /(u+1 u+1 ) = X Y Y XX u /(c c )(t t )(w w )(u u )(u+1 u+1 ) = X Y Y Xw /(c c )(t t )(w w )(u+1 u+1 ), t+1 = Xw +1 /(w +1 w+1 = XX Y Y Xw /(c c )(t t )(w w )(u+1 u+1 )(w +1 w+1 = XX Y Y t /(c c )(t t )(u+1 u+1 )(w +1 w+1 ), and c+1 = Y t+1 /(t+1 t+1 ) = Y XX Y Y t /(c c )(t t )(u+1 u+1 )(w +1 w+1 )(t+1 t+1 ) = Y XX Y c /(c c )(u+1 u+1 )(w +1 w+1 )(t+1 t+1 ). 12.5
Locally Linear Embedding
Often data that results from observations obey mathematical relationships that are not immediately apparent because of the high dimensionality of the space where data resides. These relationships, once identified, could point to the location of the data vectors on a manifold whose intrinsic dimensionality could be much lower than the dimension of the ambient space. The locally linear embedding (LLE) introduced by Roweis and Saul in [141] is based on the basic idea that a point and its immediate neighbors are approximatively located on a linear manifold that is locally close to the real underlying manifold. Suppose that we have a sample data set {x1 · · · xn } ⊆ Rm that consists of n vectors in a high-dimension space Rm . The matrix X ∈ Rm×n is defined by X = (x1 · · · xn ). The LLE algorithm computes the neighbors Ni = {xj | j ∈ Ji } of each data vector xi (as, for example, the nearest k data points in the sense of Euclidean distance). Then, it seeks to compute a matrix of weights W ∈ Rn×n such that (i) if j ∈ Ji , then wij = 0; n (ii) i=1 wij = 1 for every i, 1 ≤ i ≤ n;
Least Squares Approximations and Data Mining
749
2 n
n
(iii) the error err = i=1
xi − j=1 wij xj
is minimal for 1 ≤ 2 i ≤ n. Observe that the second condition can be equivalently written as W 1n = 1n . Using the weights obtained in the second phase, we seek to determine a set of low-dimensional vectors y 1 , . . . , y n in Rd , where d is much lower than m such that the local linear structure defined by the matrix W is preserved. An important property of the weights wij is their invariance with respect to rotations, rescalings, and translations of the data points and of their neighbors. Indeed, if a rotation is applied to xi and to its neighbors in Ni , the value of err remains the same. Also, if both xi and the vectors in Ni are rescaled by the same factor, the values of the coefficients remain unchanged. Finally, if a translation by a vector t is applied to xi and to all xj ∈ Ji , we have xi + t −
n
wij (xj + t) = xi −
j=1
n
wij xj ,
j=1
n
because j=1 wij = 1. Thus, the weights wij characterize the intrinsic geometric properties of the neighborhood of xi . In its first phase, the Saul–Roweis algorithm identifies the neighbors of each data vector either by seeking the k closest vectors or by identifying the points within a closed sphere of fixed radius. The number k of neighbors of a point xi is, in general, not larger than the input dimensionality m; if this is not the case, special terms need to be added to the reconstruction costs (see Supplement 6). The second phase of the algorithm involves computing the weights wij such that err is minimal and nj=1 wij = 1. Note that err =
n
n n
n
2
2
wij xj
= wij (xi − xj )
xi −
i=1
=
n i=1
=
j=1
2
i=1
j=1
⎞ ⎛ n n ⎝ wij (xi − xj )⎠ wik (xi − xk ) j=1
n n n i=1 j=1 k=1
k=1
wij (xi − xj ) wik (xi − xk ).
2
750
Linear Algebra Tools for Data Mining (Second Edition)
The optimal weights are determined using the Lagrangian L(W, λ1 , . . . , λn ) =
n n n
wij (xi − xj ) wik (xi − xk )
i=1 j=1 k=1
−
n
⎛ ⎞ n λi ⎝ wij − 1⎠.
i=1
j=1
The necessary extremal conditions 2
n
∂L ∂wpq (W, λ1 , . . . , λn )
= 0 imply
wpk (xp − xq ) (xp − xk ) − λp = 0
k=1
for 1 ≤ p, q ≤ n. Let k be the chosen number of neighbors of points and let G(p) ∈ Rk×k be the local Gram matrix in xp , (G(p))qr = (xp − xq ) (xp − xr ) for 1 ≤ q, r ≤ k. The matrix G(p) is symmetric and positive semidefinite. The extremum condition can be written as 2
k
wp G(p)q = λp
=1
or, as G(p)w p = 12 λp 1n , where w p is the pth row of W (involving the coefficients that correspond to xp ). Therefore, wp = 12 λp G(p)−1 1k . Using the condition 1k wp = 1, we have λp 1k G(p)−1 1k = 2, so λp =
2
. 1k G(p)−1 1k
Consequently, wi is given by wp =
1 G(p)−1 1k . 1k G(p)−1 1k
For the final phase of the algorithm we seek a set of vectors {y 1 , . . . , y n } in the lower-dimensional space Rd , where d < m, such that the matrix Y = (y 1 · · · y n ) ∈ Rd×n minimizes the cost function
2
Φ(Y ) = ni=1
y i − nj=1 wij y j
. Observe that in the absence of 2
Least Squares Approximations and Data Mining
751
other conditions Φ(Y ) is optimal for Y = Od,n . Therefore, to make this problem well-posed it is necessary to require that the covariance matrix cov(Y ) equals In . This will compel the vectors y i to be different from 0d and pairwise orthogonal. Let M ∈ Rn×n be the symmetric and positive semidefinite matrix defined by M = (In − W ) (In − W ). The cost function can be written as Φ(Y ) =
n n n (y i − wij y j ) (y i − wih y h ) i=1
=
n
j=1
y i y i −
i=1
+
h=1
n n
wij y j y i −
i=1 j=1
n n n
n n
wih y i y h
i=1 h=1
wij wih y j y h .
i=1 j=1 h=1
Note that Y M Y = Y (In − W − W + W W )Y = Y Y − Y W Y − Y W Y + Y W W Y and we have the obvious equalities trace(Y Y ) =
n
y i y i
i=1
trace(Y W Y ) =
n n
y i wij y j
i=1 j=1
trace(Y W Y ) =
n n
y i (W )ij y j =
i=1 j=1
trace(Y W W Y ) =
n n n
n n i=1 j=1
y i (W )ij Wjh y h
i=1 j=1 h=1
=
n n n i=1 j=1 h=1
so Φ(Y ) = trace(Y M Y ).
y i wji wjh y h ,
y i wji y j
752
Linear Algebra Tools for Data Mining (Second Edition)
It is important to observe that the matrix M has (1n , 0) as an eigenpair since M 1n = (In − W ) (In − W )1n = (In − W ) 0n = 0n , and all its eigenvalues are non-negative. The cost function is invariant under translations. This makes it necessary to add the second condition that fixes the center of the set of vectors {y 1 , . . . , y n } in 0n , which is equivalent to asking M 1n = 0. For the case of dimension d, we can write the matrix Y as a collection of n-dimensional rows as follows: ⎛ ⎞ z1 ⎜ ⎟ . Y = ⎝ . .⎠ . zd Now the cost function becomes ⎛⎛
⎞ ⎞ z1 Φ(Y ) = trace ⎝⎝. . .⎠ M (z 1 · · · z d )⎠ zd =
d p=1
trace(z p M z p ) =
d
z p M z p .
p=1
Let λ1 ≥ · · · ≥ λd ≥ λd+1 = 0 be the least d + 1 eigenvalues of M . To minimize Φ(Y ) we need to minimize each of the non-negative numbers z p M z p , which means that we need to adopt for z 1 , . . . , z d the unit eigenvectors that correspond to the eigenvalues λ1 , . . . , λd . The matrix Y is given by ⎛ ⎞ z1 ⎜.⎟ ⎟ Y =⎜ ⎝ .. ⎠ = (y 1 · · · y n ). zd The eigenvectors that correspond to the smallest d + 1 eigenvalues provide a matrix (y 1 , . . . , y d , y d+1 ) ∈ Rn×(d+1) . The last eigenvector y d+1 = 1n corresponds to the eigenvalue 0. The d-dimensional rows z 1 , . . . , z n of the matrix S = (y 1 , . . . , y d ) ∈ Rn×d yield the lowdimensional representation z 1 , . . . , z n of {x1 , . . . , xn }.
Least Squares Approximations and Data Mining
753
Example 12.3. We discuss the MATLAB algorithm developed for LLE in [140]. We begin by generating a data set in R2 using the MATLAB code i = 1; for x = -pi:0.1:pi X(1,i) = x*sqrt(3)/2 -1 -0.5*sin(x - pi); X(2,i) = 0.5*x + sqrt(3) + sqrt(3)*sin(x-pi)/2; i=i+1; end
The 63 columns of matrix X ∈ R2×63 have been plotted in Figure 12.3 and they are located on a curve in R2 which is a one-dimensional manifold. Next we determined the nearest two points for each point. For this computation we need to compute the matrix of distances dist ∈ R63×63 between points for i=1:1:63 for j=1:1:63 dist(i,j)= norm(X(:,i)-X(:,j)); end end [sorted,index] = sort(dist); neighborhood = index(2:3,:); 3.5
3
2.5
2
1.5
1
0.5
0 −4
−3
Fig. 12.3
−2
−1
0
1
Representation of the data set in R2 .
2
754
Linear Algebra Tools for Data Mining (Second Edition)
The matrix dist is sorted yielding the matrix sorted having each of its columns arranged in ascending order. Thus, the jth column of sorted contains the distances between xj and the remaining points. The matrix index contains in its jth column the indices of the points that correspond to the sorted distances. This allows us to find the next two neighbors of xj by simply extracting the second and the third elements of dist. W = zeros(2,63); for ii = 1:63 z = X(:,neighborhood(:,ii))-repmat(X(:,ii),1,2); % next we compute the local Gram matrix G = z’*z; % we solve Gw=1 W(:,ii) = G\ones(2,1); % enforce sum(w)=1 W(:,ii) = W(:,ii)/sum(W(:,ii)); end;
In the next phase the eigenvectors of the cost matrix are computed. % M is a sparse matrix with storage for 4*2*63 nonzero elements M = sparse(1:63,1:63,ones(1,63),63,63,4*2*63); for ii=1:63 w = W(:,ii); jj = neighborhood(:,ii); M(ii,jj) = M(ii,jj) - w’; M(jj,ii) = M(jj,ii) - w; M(jj,jj) = M(jj,jj) + w*w’; end; options.disp = 0; options.isreal = 1; options.issym = 1; [Y,eigenvals] = eigs(M,2,0,options); Y = Y(:,2:2)’*sqrt(63);
The resulting matrix Y is a row vector containing the representation of the 63 points located on the curve. 12.6
MATLAB
Computations
The MATLAB function lsqr attempts to compute the least square solution x to the linear system of equations Ax = b by finding x that
Least Squares Approximations and Data Mining
755
minimizes Ax − b. If the system is consistent, the linear squares solution is also a solution of the linear system and a corresponding message is displayed. Example 12.4. Let A and b ⎛ 1 1 ⎜ A = ⎝2 −1 3 1
be given by ⎞ ⎛ ⎞ −2 −2 ⎟ ⎜ ⎟ 0 ⎠ and b = ⎝ 1 ⎠. −1 2
The following MATLAB code solves the system Ax = b: >> A=[1 1 -2;2 -1 0;3 1 -1] A = 1 1 2 -1 3 1 >> b=[-2;1;2] b = -2 1 2 >> x=lsqr(A,b) x = 1.0000 1.0000 2.0000
-2 0 -1
Example 12.5. Let us expand the matrix A by adding one more row: >> A=[1 1 -2;2 -1 0;3 1 -1;1 -1 3]
and setting \bfb = [-2; 1; 2; 8]
The system Ax = b is now incompatible and the result returned is >> x=lsqr(A,b) lsqr converged at iteration 3 to a solution with relative residual 0.096. x = 1.1905 1.3810 2.6190
756
Linear Algebra Tools for Data Mining (Second Edition)
The result includes the relative residual error ation number at which the algorithm halted.
b−Ax x
and the iter-
If A ∈ Rn×n and B is a column vector with n components, or a matrix with several such columns, then X = A\B is the solution to the equation AX = B in the least square sense. A warning message is displayed if A is badly scaled or nearly singular.
Exercises and Supplements (1) Let X ∈ Rm×n , where m ≥ n, be a full-rank matrix, b be a vector in Rm , and let r = Xx − b be the residual vector corresponding to x. Prove that r is the solution of the normal system X Xr = X b if and only if r is orthogonal to the columns of X. (2) Let a, b ∈ Rm be two vectors and let ax = b be a oneindeterminate linear system. The system is, in general, incompatible, if m > 1. (a) Prove that the best approximation of the solution in the sense defined in Section 12.1 is x=
(a, b) . a22
Replace the norm · 2 with · 1 ; in other words, define the best approximation of the solution as x such that Ax − b1 is minimal. Prove that if a = (a, a, a) ∈ R3 with a > 0, then the optimum is achieved for x = b2 . (3) Let b1 , . . . , bn be n vectors in Rm such that ni=1 bi = 0m and let Hw,a be a hyperplane in Rm , where w is a unit vector. Let B = (b1 · · · bn ). Prove that if ni=1 d(Hw,a , bi )2 is minimal (where d(Hw,a , bi ) is the distance from bi to Hw,a ), then w is an eigenvector of the matrix BB that corresponds to the least eigenvalue of this matrix and a = 0. Solution: We need to minimize ni=1 (w bi − a)2 = w B − a1 22 subjected to the restriction w2 = 1. Consider the
Least Squares Approximations and Data Mining
757
Lagrangian
n n 2 2 F (w1 , . . . , wm , λ) = (w bi − a) − λ w − 1 i=1
=1
⎛ ⎞2
n n m ⎝ wj bji − a⎠ − λ w2 − 1 . = i=1
j=1
=1
Then, we can write ⎛ ⎞ n m ∂F = 2⎝ wj bji − a⎠ bki − 2λwk = 0, ∂wk i=1
j=1
which is equivalent to n m
wj bji bki − a
i=1 j=1
m
bkj − λwk = 0.
j=1
Since ni=1 bi = 0m , it follows that nj=1 bkj = 0 for every k so the necessary extremal conditions are n m
wj bji bki = λwk = 0
i=1 j=1
for 1 ≤ k ≤ m. This equality amounts to w BB = λw , or BB w = λw, which shows that w must be an eigenvector of BB . The total sum of the squared distances is D = w B − a1 22 = (w B − a1 )(w B − a1 ) = (w B − a1 )(B w − a1) = w BB w − 2a1 B w + a2 1 1 = λw w + ma2 = λ + ma2 . Thus, w must be a unit eigenvector of BB that corresponds to the least eigenvalue. The minimal value of D is obtained when a = 0.
758
Linear Algebra Tools for Data Mining (Second Edition)
(4) Formulate and prove a result similar to the one in Supplement 3 without the condition ni=1 bi = 0m . Hint: Consider the ci = bi − b0 , where b0 = n1 sumni=1 bi vectors n and observe that i=1 ci = 0. (5) Let W ∈ Rn×n be a matrix. Prove that (In − W ) (In − W )1n = 0n if and only if W 1n = 0n . (6) Prove that when k, the number of nearest k neighbors exceeds the dimension m of the input space of the locally linear embedding algorithm, the local Gram matrix can be non-invertible. Prove that this problem can be corrected by replacing G(p) by G(p) + c/kIn , where c > 0 is a small number. (7) Let A ∈ Cm×n be a matrix with rank(A) = k such that A = U RV H . Here we assume that U ∈ Cm×n , V ∈ Cn×n are unitary matrices, and R ∈ Cm×n can be written as R=
Q Om−k,k
Ok,n−k , Om−k,n−k
where Q ∈ Ck×k and rank(Q) = k. Let b ∈ Cm , x ∈ Cn , g = U H b ∈ Cm , and y = V H x. Write g=
g1 g2
and y =
y1 , y2
where g 1 ∈ Ck , g 2 ∈ Cm−k , y 1 ∈ Ck , and y 2 ∈ Cn−k . Prove that (a) if z is the unique solution of Qz = g 1 , then any vector x that minimizes Ax − b has the form ˆ=V x
z ; y2
(b) any optimal solution gives the residual vector r = b − Aˆ x=U where r = g 2 ;
0k , g2
Least Squares Approximations and Data Mining
759
(c) the unique solution of minimal Euclidean norm is z ˆ=V , x 0n−k where z was defined above; (d) the solution of minimum Euclidean length, the minimal value of b − Ax, and set of all solutions are unique. Solution: The hypothesis implies Ax − b2 = U RV H x − b2 = U RV H x − U U H b2 = U (RV H x − U H b)2 = RV H x − U H b2 (because U is a unitary matrix)
Qy 1 − g 1
2 = Ry − g =
g2 = Qy 1 − g 1 2 + g 2 2 . The minimum value of Ax − b2 is achieved when Qy 1 = g 1 and it equals g 2 2 . Thus, if z is the unique solution of the equation Qz = g 1 , then any solution that minimizes Ax − b has the form −1 Q g1 z ˆ=V =V , x g2 y2 where y 2 is arbitrary. The solution that minimizes Ax − b and has the minimum Euclidean norm is z ˆ=V . x 0n−k ˆ1 y H ˆ = ˆ=V x . The residual vector of an optimal soluLet y ˆ2 y tion is ˆ r = b − Aˆ x = U U H b − U RV H x ˆ ) = U (U H b − Rˆ y) = U = U (U b − RV x H
H
The last part follows immediately.
0k . g2
Linear Algebra Tools for Data Mining (Second Edition)
760
(8) Prove that the unique minimum Euclidean length solution x of the minimization problem has the form ˆ=V x
Q−1 On−k,k
Ok,m−k U H b, On−k,m−k
where the notations are the same as in Supplement 7. Solution: The conclusion is immediate from the equalities −1 Q g1 Ok,m−k Q−1 g 1 ˆ=V V x 0n−k On−k,k On−k,m−k g2 −1 Ok,m−k Q U H b. =V On−k,k On−k,m−k
(9) Let A ∈ Cm×n be a matrix and let z i ∈ Rm be the solution of minimal Euclidean length of the least square problem instance (A, z, ei ), where ei ∈ Cm and z i ∈ Cn for 1 ≤ i ≤ m. Prove that the Moore–Penrose pseudoinverse of A is A† = (z 1 · · · z m ). Solution: In Supplement 26 of Chapter 9, we have shown that the pseudoinverse of A has the form †
A =V
R−1 On−k,k
Ok,m−k U H, On−k,m−k
R
Ok,n−k V H, Om−k,n−k
where A=U
Om−k,k
R ∈ Ck×k is a matrix of rank k, and U ∈ Cm×m and V ∈ Cn×n are unitary matrices. The columns of A† are given by †
z i = A ei = V
R−1 On−k,k
Ok,m−k U H ei , On−k,m−k
and z i is precisely the solution of minimal Euclidean length of the least square problem min{ei − Az}.
Least Squares Approximations and Data Mining
761
(10) Let A ∈ Rm×n and b ∈ Rn . Define d = A b, B = (A b), and A A d , H =BB= d a2 where a2 = b b. Prove that if H has the Cholesky factorization H = U U , where W y U= , 0 ρ ˆ that minimizes b − Ax. then |ρ| = b − Aˆ x for every x (11) Let A ∈ Rm×n and C ∈ Rm×p be two matrices and let X ∈ Rn×p be a matrix such that AX − CF is minimal. Prove that the equation A AX = A C is always solvable in X and that its solution minimizes AX − CF . Solution: Let C = {c1 · · · cp }. We saw that the equation A Ax = A ci is always solvable with respect to x; let xi be its solution, where 1 ≤ i ≤ p. Then, we have A A(x1 · · · xp ) = (c1 . . . cp ) = C, so X = (x1 · · · xp ) is a solution of A AX = A C. Note that ⎛ ⎞ x1 A − c1 ⎜ ⎟ .. ⎟(Ax1 − c1 · · · Axp − cp ), (X A − C )(AX − C) =⎜ . ⎝ ⎠ xp − c p which implies AX − C2F = trace(X A − C )(AX − C) =
p
(xi A − ci )(Axi − ci )
i=1
=
p i=1
Axi − ci 22 .
This implies the second part of this supplement.
762
Linear Algebra Tools for Data Mining (Second Edition)
(12) Let A ∈ Rm×n be a matrix whose columns form an orthonormal set of vectors and let c ∈ Rm . Prove that x = A c minimizes Ax − c. Note that in this case A = A† by Supplement 45 of Chapter 3. (13) Let A ∈ Rm×n be a matrix with rank(A) = r < n. If X ⊆ Rn is the set of vectors x that minimize Ax − b, prove that this set is convex. Therefore, X contains a unique element having minimum 2-norm, denoted by xLS (the subscript suggests the words “least square”). Solution: Suppose that both u and v minimize Ax−b. Then, for λ ∈ [0, 1], we have A(λu + (1 − λ)v − b λAu − b + (1 − λ)Av − b = Ax − b. (14) Let A ∈ Rm×n be a matrix and rank(A) = r, and let A = U DV be the SVD decomposition of A, where U = (u1 · · · um ) ∈ Rm×m , V = (v 1 · · · v n ) ∈ Rn×n are orthogonal matrices and D = diag(σ1 , . . . , σr , 0, . . . , 0). u b If b ∈ Rm , prove that xLS = ri=1 σii v i minimizes Ax − b has the smallest norm of all minimizers. Solution: We have Ax − b2 = (U AV )(V x) − U b2 (because U and V are orthogonal) =
r i=1
σi (V x)i − ui b
2
+
m
(ui b)2 .
i=r+1
If xLS solves the least square problem, this implies (V x)i = ui b σi . (15) Let A ∈ Rm×n be a matrix and rank(A) = r, and let A = U DV be the SVD decomposition of A. Prove that the matrix 1 1 † , . . . , , 0, . . . , 0 U A = V diag σ1 σr is the pseudoinverse of A and xLS = A† b is the least square solution.
Least Squares Approximations and Data Mining
763
Bibliographical Comments Exercise 2 is based on an observation made in [65]. Monographs dedicated to the least squares problems are [21] and [103]. Partial square regression is discussed in [1]. Locally linear embedding is due to S. T. Roweis and L. K. Saul [141]. Further improvements and variations on the LLE theme can be found in [9, 38, 144]. Supplement 6 is discussed in [143]. Supplements 7–10 are results of [103].
This page intentionally left blank
Chapter 13
Dimensionality Reduction Techniques
13.1
Introduction
Physical and biological data as well as economic and demographic data often have high dimensionality. Intelligent data-mining algorithms work best in interpretation and decision-making based on this data when we are able to simplify their tasks by reducing the high dimensionality of the data. Dimensionality reduction refers to the extraction of the relevant information for a specific objective, while ignoring the unnecessary information and is a key concept in pattern recognition, data mining, feature processing, and machine learning. Dimensionality reduction requires tuning in terms of the expected number of dimensions, or the parameters of the learning algorithms. 13.2
Principal Component Analysis
Principal component analysis (PCA) is a dimensionality reduction technique that aims to create a few new, uncorrelated linear combinations of the variables of experiments that “explain” the major parts of the data variability. Let W ∈ Rm×n be a data matrix given by ⎛ ⎞ u1 ⎜ . ⎟ ⎟ W =⎜ ⎝ .. ⎠ . um 765
Linear Algebra Tools for Data Mining (Second Edition)
766
Definition 13.1. Let w ∈ Rn be a unit vector. The residual of ui relative to w is the number r(ui ) = ui −(ui w)w2 and it represents the error committed when the vector ui is replaced by its projection on w. Theorem 13.1. If r(ui ) is the residual of the vector ui of a data matrix W relative to the vector w with w = 1, then r(ui ) = ui 2 − (ui w)2 . Proof.
We have
r(ui ) = ui − (ui w)w2 = (ui − (ui w)w) (ui − (ui w)w) = (ui − (ui w)w )(ui − (ui w)w) = ui ui − (ui w)w ui − ui (ui w)w + (ui w)w (ui w)w = ui 2 − 2(ui w)2 + (ui w)2 = ui 2 − (ui w)2 because w w = 1.
Definition 13.2. The mean square error MSE(W, w) of the projections of the experiments u1 , . . . , um of the data matrix W ∈ Rm×n on the unit vector w ∈ Rn is the sum of the residuals m
MSE(W, w) =
1 r(ui ). m i=1
The average of the projections of the experiment vectors on the unit vector w is m
uw =
1 ˜ w, ui w = u m i=1
˜ is the sample mean defined by Equality (11.1). where u The variance of the projections of the experiment vectors on w is m
1 (ui w − uw )2 . V (W, w) = m i=1
Dimensionality Reduction Techniques
767
˜ = 0n and, therefore, we Note that if W is centered, we have u have uw = 0. Theorem 13.2. We have the following equalities: m
V (W, w) =
1 2 (ui w) − u2w , m i=1
and m
1 ui 2 − u2w − V (W, w). MSE(W, w) = m i=1
Proof. The first equality is well known from elementary statistics. For the second equality, we can write m
MSE(W, w) =
1 r(ui ) m i=1 m
1 ui 2 − (ui w)2 = m i=1 m
m
i=1
i=1
1 2 1 ui 2 − (ui w) = m m =
1 m
m i=1
ui 2 − u2w − V (W, w).
Corollary 13.1. If W is a centered data matrix, then minimizing MSE(W, w) amounts to maximizing the variance of the projections of the vectors of the experiments. Proof. Since W is centered, we have uw = 0. Therefore, the equality involving MSE(W, w) from Theorem 13.2 becomes m
MSE(W, w) =
1 ui 2 − V (W, w). m i=1
The first term does not depend on w. Therefore, to minimize the mean square error, we need to maximize the variance of the projec tions of the vectors of the experiments.
768
Linear Algebra Tools for Data Mining (Second Edition)
If W is a centered data matrix, we have uw = 0 and the variance of the data matrix reduces to m 1 2 (ui w) . V (W, w) = m i=1
This expression can be transformed as V (W, w) =
1 1 (W w) (W w) = w W W w = w Zw, m m
1 W W . where Z = m We need to choose the unit vector w to maximize V (W, w). In other words, we need to maximize V (W, w) subjected to the restriction w w − 1 = 0. This can be resolved using a Lagrange multiplier λ to optimize the function
L(w, λ) =
1 w W W w − λ(w w − 1). m
Since ∂L = w w − 1, ∂λ ∂L = 2Z w − 2λv, ∂w which implies w w = 1 and Zw = λw. The last equality amounts to 1 W W w = λw, m which means that w must be an eigenvector of the covariance matrix cov(W ). This is an n × n symmetric matrix, so its eigenvectors are mutually orthogonal and all its eigenvalues are non-negative. These eigenvectors are the principal components of the data. Definition 13.3. Let ⎛
⎞ u1 ⎜ . ⎟ m×n ⎟ W =⎜ ⎝ .. ⎠ = (v 1 , . . . , v n ) ∈ R um
ˆ = Hm W be the corresponding be a data sample matrix and let W centered data matrix.
Dimensionality Reduction Techniques
769
The principal directions of W are the eigenvectors of the covariance matrix 1 1 ˆˆ 1 WW = W Hm W Hm W ∈ Rn×n. cov(W ) = Hm W m−1 m−1 m−1 The principal components of W are the eigenvectors of the matrix ˆW ˆ . W Note that the covariance matrix cov(W ) is a scalar multiple of the ˆ of the columns v ˆ W ˆ1, . . . , v ˆ n of the centered data Gram matrix W ˆ matrix W . If R ∈ Rn×n is the orthogonal matrix that diagonalizes cov(W ), then the principal directions of W are the columns of R because R cov(W )R = D, or equivalently, cov(W )R = RD. Without loss of generality, we assume in this section that D = diag(d1 , d2 , . . . , dn ) and that d1 d2 · · · dn . The first eigenvector of cov(W ) (which corresponds to d1 ) is the first principal direction of the data matrix W ; in general, the kth eigenvector r k is called the kth principal direction of W . There exists an immediate link between PCA and the SVD decomˆ . Namely, if W ˆ s = σr and position of a centered data matrix W ˆ and s is the correˆ r = σs, then r is a principal component of W W ˆ. sponding principal direction of W If ⎛ ⎞ ˆ1 u ⎜ . ⎟ ˆ = ⎜ . ⎟, W ⎝ . ⎠ ˆ m u we have ˆ m r = sm , ˆ 1 r = σs1 , . . . , u u which shows that the principal components are the projections of the centered data points on the principal directions. As we saw in Theorem 11.3, the sum of the elements of D’s main diagonal equals the total variance tvar(W ). The principal directions “explain” the sources of the total variance: sample vectors grouped around r 1 explain the largest portion of the variance; sample vectors grouped around r2 explain the second largest portion of the variance, etc.
770
Linear Algebra Tools for Data Mining (Second Edition)
Let Q ∈ Rn× be a matrix having orthogonal columns. Starting from a sample matrix X ∈ Rm×n , we can construct a new sample matrix W ∈ Rm× having variables. Each experiment Ei is represented now by a row wi that is linked by ui by the equality wi = ui Q. This means that the component (w i )k that corresponds to the new variable Wk is obtained as n (ui )p qpk , (wi )k = p=1
a linear combination of the values that correspond to the previous variables. Theorem 13.3. Let W ∈ Rm×n be a centered sample matrix and let R ∈ Rn×n be an orthogonal matrix such that R cov(W )R = D, where D ∈ Rn×n is a diagonal matrix D = diag(d1 , . . . , dn ) and d1 · · · dn . Let Q ∈ Rn× be a matrix having orthogonal columns and let X = W Q ∈ Rm× . Then, trace(cov(X)) is maximized when Q consists of the first columns of R and is minimized when Q consists of the last columns of R. Proof. This result follows from Ky Fan’s Theorem (Theorem 7.14) applied to the symmetric covariance matrix of the transformed dataset. ˆ = U DV be the thin SVD of the centered data matrix Let W m×n ˆ , where U ∈ Rm×r and V ∈ Rr×n are matrices having W ∈ R orthogonal columns and ⎞ ⎛ σ1 0 · · · 0 ⎜ 0 σ ··· 0 ⎟ 2 ⎟ ⎜ ⎟ D=⎜ .. ⎟ , ⎜ .. .. ⎝ . . ··· . ⎠ 0 0 · · · σr where σ1 · · · σr > 0 are the singular values of A. For the covariance matrix of cov(W ), we have 1 1 ˆˆ WW = V D U U DV cov(W ) = m−1 m−1 1 1 V D DV = V D2 V , = m−1 m−1
Dimensionality Reduction Techniques
771
due to the orthogonality of the columns of U . As we saw before, the columns of V are the eigenvectors of cov(W ). The matrix V is known as the matrix of loadings. The matrix S = U D ∈ Rm×r is known as the matrix of scores. It ˆ. ˆ W is clear that σ12 , . . . , σr2 coincide with the eigenvalues of W ˆ = SV , where S is the scores matrix and V is Observe that W the loadings matrix. Since the columns of V are orthogonal, we also ˆ V. have S = W ˆ can be written as The SVD of W ˆ = W
r
σi ui v i .
i=1
ˆ v i = σ 2 v i . Since u W ˆ = σi v , it follows that v ˆ W This implies W i i i i ˆ . Similarly, ui are is a weighted sum of the rows of the matrix W ˆ. weighted sums of the columns of W As observed in [61], if ˆ = (u1 · · · ur )(σ1 v 1 . . . σr v r ) , W then Ir = (u1 · · · ur ) (u1 · · · ur ), ˆ = (u1 · · · ur )(u1 · · · ur ) . ˆ X (n − 1)X Example 13.1. We use the FAO dataset introduced in Example 12.1 showing the protein and fat consumption for 37 European countries. The sample matrix X ∈ R37×2 is obtained from the second and third columns of this table that correspond to the variables prot and fat. The vector of the sample variances of the two columns is s = (15.5213 28.9541). Since the magnitudes of the sample variances are substantial and quite distinct, we normalize the data by dividing the columns of X by their respective sample variances. The normalization is done by using the function zscore; namely, zscore(X) returns a centered and scaled version of X having the same format as X such that the columns of the result have sample mean 0 and sample variance 1.
Linear Algebra Tools for Data Mining (Second Edition)
772
The loading matrix or the coefficient matrix is given by
0.7071 −0.7071 . 0.7071 0.7071
Both coefficients in the first column (which represents the first principal component) are equal and positive, which means that the first principal component is a weighted average of the two variables. The second principal component corresponds to a weighted difference of the original variables. The coordinates of the data in the new coordinate system is defined by the matrix scores. These scores have been plotted in Figure 13.1. Principal component analysis in MATLAB is done using the function pca of the statistics toolbox. There are several signatures of this function which we review next. The statement coeff = pca(A) performs principal components analysis (PCA) on the matrix A ∈ Rm×n , and returns the principal component coefficients, also known as loadings. Rows of A correspond to observations, and columns to variables. The columns of the matrix coeff (an n × n matrix) contain coefficients for one principal 1.5
1
SK MK
BE
HR
0.5 second pc
CH HU
YU
AT ES IT FR
BG
0
LU GR MD
−0.5
RU
GE BA
AL
−1
−1.5 −3
Fig. 13.1
−2
−1
RO LT
0 first pc
IS MT
1
2
3
The first two principal components of the FAO dataset.
Dimensionality Reduction Techniques
773
component and these columns are in order of decreasing component variance. The function pca computes the principal components of a sample matrix X. The are several incarnations of the function pca, described as follows. (i) [coeff,score] = pca(X) returns the matrix score, the principal component scores, that is, the representation of X in the principal component space. The rows of score correspond to observations, columns to components. (ii) [coeff,score,latent] = pca(X) returns the vector latent which contains the eigenvalues of the covariance matrix of X. The matrix score contains the data formed by transforming the original data into the space of the principal components. The values of the vector latent are the variance of the columns of score. The function pca centers X by subtracting off column variance means, but does not rescale the columns of X. To perform principal components analysis with standardized variables, we need to use pca(zscore(X)). Example 13.2. The dataset that we are about to analyze originates in a study of the health condition of Boston neighborhoods [28] produced by the Health Department of the City of Boston. The data include incidence of various diseases and health events that occur in the 16 neighborhoods of the city identified as Neighborhood Allston/Brighton Back Bay Charlestown East Boston Fenway Hyde Park Jamaica Plain Mattapan
Code AB BB CH EB FW HP JP MT
Neighborhood North Dorchester North End Roslindale Roxbury South Boston South End South Dorchester West Roxbury
Code ND NE RO RX SB SE SD WR
This is entered in MATLAB as neighborhoods = [’AB’;’BB’;’CH’;’EB’;’FW’;’HP’;’JP’;... ’MT’;’ND’;’NE’;’RS’;’RX’;’SB’;’SD’;’SE’;’WR’]
774
Linear Algebra Tools for Data Mining (Second Edition)
The diseases and the health conditions are listed as the vector categories: Category Hepatitis B Hepatitis C HIV/AIDS Chlamydia Syphilis Gonorrhea
Code HepB HepC HIVA CHLA SYPH GONO
Category Tuberculosis Live Births Low weight at birth Infant Mortality Children with Elevated Lead Subst. Abuse Treat. Admissions
Code TBCD B154 LBWE INFM CELL SATA
This is entered in MATLAB as categories = [’HepB’;’HepC’;’HIVA’;’CHLA’;’SYPH’;’GONO’;... ’TBCD’;’B154’;’LBWE’;’INFM’;’CELL’;’SATA’]
The data itself are contained and have the form 38 57 13 168 5 22 17 24 16 179 11 52 10 13 0 46 0 8 12 46 11 150 10 16 18 19 8 163 9 44 11 18 8 179 9 25 6 32 18 213 10 46 15 24 11 264 14 56 42 76 22 611 15 135 0 0 0 0 0 0 13 22 0 115 6 31 21 50 17 477 8 72 9 52 0 85 7 25 68 78 23 760 24 176 51 35 35 124 31 61 9 10 0 17 0 0
in the matrix diseaseinc in R16×12 20 8 0 17 7 9 6 8 29 0 0 8 5 24 11 0
607 306 284 718 125 487 420 285 1350 89 488 829 403 656 439 419
40 25 25 43 10 46 35 26 168 5 39 87 25 67 34 34
7 0 0 10 0 12 5 7 28 0 6 27 0 9 0 0
13 0 0 43 0 19 10 25 88 0 28 30 18 63 0 11
624 497 489 1009 272 2781 1071 390 1492 130 330 2075 1335 1464 6064 179
The array containing the sample variances of columns is computed by applying the function std: stdinc = std(diseaseinc)
Next, by using the function repmat as in si = diseaseinc./repmat(stdinc,16,1)
Dimensionality Reduction Techniques
775
we create a 16 × 12 matrix consisting of 16 copies of stdinc and compute the normalized matrix si that is subjected to PCA in [loadings,scores,variances]=pca(si)
This is one of several formats of the function pca. This function is applied to a data matrix and it centers the matrix by subtracting off column means. In the format that we use here, the function returns the matrices loadings, scores, and variances that contain the following data: (i) The columns of the matrix loadings contain the principal components. The entries of this matrix are, of course, known as loadings. In our case, loadings is a 12 × 12 matrix, where each column represents one principal component. The columns are in order of decreasing component variance. We reproduce the first three columns of this matrix as follows: 0.2914 0.3207 0.2666 0.3267 0.2426 0.3209 0.3215 0.3055 0.3061 0.2655 0.3026 0.1423
0.2732 −0.0568 0.3848 −0.0800 0.4650 0.0668 −0.0161 −0.2594 −0.2659 −0.3215 −0.2903 0.4702
−0.2641 −0.1593 0.1427 −0.2671 −0.0270 −0.3463 −0.1444 0.3184 0.2735 0.3832 −0.0815 0.5848
(ii) The matrix scores in R16×12 contain the principal component scores, that is, the representation of si in the principal component space. Rows of scores correspond to neighborhoods and columns, to components. (iii) The matrix variances contain the principal component variances, that is, the eigenvalues of the covariance matrix of si. The first two columns of the matrix scores contain the projections of data on the first two principal components. This is done by running plot(scores(:,1),scores(:,2),’*’)
Linear Algebra Tools for Data Mining (Second Edition)
776 5
SE
Second Principal Component
4 3 2 1
SD
BB
0
AB
NE RS
−1
EB RX
−2 −3 −4
Fig. 13.2
ND
−2
0 2 4 First Principal Component
6
8
Projections on the first two principal components.
After the plot is created, labels can be added to the axes using xlabel(’First Principal Component’) ylabel(’Second Principal Component’)
The resulting plot is shown in Figure 13.2. The neighborhood codes are applied to this plot by running gname(neighborhoods). An inspection of the figure shows that the health issues are different for neighborhoods like South End (SE), South Dorchester (SD), and North Dorchester (ND). The matrix variances allows us to examine the percentage of the total variability explained by each principal component. Initially, we compute the matrix percent_explained as percent_explained=100*variances/sum(variances)
and using the function pareto we write pareto(percent_explained) xlabel(’Principal Component’) ylabel(’Variance Explained’)
This code produces the histogram as shown in Figure 13.3.
Variance Explained
Dimensionality Reduction Techniques
777
90
90%
80
80%
70
70%
60
60%
50
50%
40
40%
30
30%
20
20%
10
10%
0
Fig. 13.3
1
2
3 Principal Component
4
5
0%
Percentage of variability explained by principal components.
To visualize the results, one can use the biplot function as in biplot(loadings(:,1:2),’scores’,scores(:,1:2),’varlabels’, categories)
resulting in Figure 13.4. Each of the 12 variables is represented by a vector in this figure. Since the first principal component has positive coefficients, all vectors are located in the right half-plane. On the other hand, the signs of the coefficients of the second principal component are varying. These components distinguish between neighborhoods where there is a high incidence of substance abuse treatment admissions (SATA), syphilis (SYPH), HIV/Aids (HIVA), Hepatitis B (HepB), and Gonorrhea (GONO) and low incidence of the others and neighborhoods where the opposite situation occurs. As observed in [84], the conclusions of a PCA analysis of data are mainly qualitative. The numerical precision (4 decimal digits) is not especially relevant for the PCA. Next, we present a geometric point of view of principal component analysis.
Linear Algebra Tools for Data Mining (Second Edition)
778 0.5
SATA
0.4
SYPH HIVA
0.3
HepB
Component 2
0.2 0.1
GONO
0
TBCD HepC
−0.1
CHLA
−0.2
B154 LBWE
−0.3
CELL INFM
−0.4 −0.5 −0.5
−0.4
−0.3
Fig. 13.4
−0.2
−0.1
0 0.1 Component 1
0.2
0.3
0.4
0.5
Representation of the 12 variables.
Let t ∈ Rn be a unit vector. The projection of a vector w ∈ Rn on the subspace t generated by t is given by projt (w) = tt w. To simplify the notation, we shall write projt instead of projt . Let ˆ ∈ Rm×n be a centered sample matrix that corresponds to a W sequence of experiments (u1 , . . . , um ), that is ⎞ u1 ⎜ ⎟ ˆ = ⎜ .. ⎟ . W ⎝ . ⎠ um ⎛
ˆ )) on the subspace genWe seek to evaluate the inertia I0 (projt (W n ˆ erated by the unit vector t ∈ R . Since W = (u1 · · · um ), by the
Dimensionality Reduction Techniques
779
definition of inertia, we have ˆ )) = I0 (projt (W
m j=1
=
m
tt uj 22 uj tt tt uj
j=1
=
m
uj tt uj
j=1
(because t t = 1) =
m
t uj uj t
j=1
(because both uj t and t uj are scalars) = t X Xt. The necessary condition for the existence of extreme values of this inertia as a function of t is
ˆ W ˆ t + λ(1 − t t) ˆ )) + λ(1 − t t) = grad t W grad I0 (projt (W ˆ u − 2λt = 0, ˆ W = 2W ˆ t = λt. In other ˆ W where λ is a Lagrange multiplier. This implies W ˆ )), t must words, to achieve extreme values of the inertia I0 (projt (W ˆ , that is, be chosen as an eigenvector of the covariance matrix of W ˆ as a principal direction of W . ˆ ∈ Rm×n can The principal directions of a data sample matrix W be obtained directly from the data sample matrix W by applying Corollary 8.16, a consequence of the Courant–Fisher theorem. ˆ are the numbers λ1 · · · ˆ W Suppose that the eigenvalues of W λn . The first principal direction t1 of W , which corresponds to the ˆ , is ˆ W largest eigenvalue of W ˆW ˆ t | t ∈ Rn , t2 = 1 t1 = arg max t W t ˆ t2 | t2 = 1 . = arg max W 2 t
780
Linear Algebra Tools for Data Mining (Second Edition)
ˆ. Suppose that we computed the principal directions t1 , . . . , tk of W n Then, by Corollary 8.16, tk+1 ∈ R is a unit vector t that maximizes ˆ t2 ˆW ˆ t = W t W 2 and belongs to the subspace orthogonal to the subspace generated ˆ , that is, by the first k principal directions of W ˆ t22 | t ∈ Rn , t2 = 1, t ∈ t1 , . . . , tk ⊥ . tk+1 = arg max W t
Note that for every vector z ∈ Rn , we have ⎞ ⎛ k ⎝I − tj tj ⎠ z = z − projt1 ,...,tk z ∈ t1 , . . . , tk ⊥ . j=1
Therefore, x ∈ t1 , . . . , tk ⊥ is equivalent to x = (I − kj=1 tj tj )x. Thus, we can write ⎫ ⎧ ⎛ ⎞ k ⎬ ⎨ ˆ ⎝ ⎠ tj tj t | t2 = 1 , tk+1 = arg max W I − t ⎩ ⎭ j=1
2
for 0 k n − 1. This technique allows finding the principal direcˆ by solving a sequence of optimization problems involving tions of W ˆ. the matrix W Next, we discuss an algorithm known as the Nonlinear Iterative Partial Least Squares Algorithm (NIPALS), which can be used for computing the principal components of a centered data matrix (Algorithm 13.2.1). The algorithm computes a sequence of matrices ˆ and needs a parameX1 , X2 , . . . beginning with the matrix X1 = W ter to impose an iteration limitation. ˆ ), we found the principal directions of W . In this If c = rank(W ˆ can be written as W ˆ = T V , where T = (t1 · · · tc ) and case, W V = (v 1 · · · v c ). Let j = Xj t2 . When the algorithm completes the repeat loop, we have ˘t = Xj v j is approximatively equal to tj , so Xj Xj v j = j v j , which implies that j is close to an eigenvalue and v j is close to an eigenvector of Xj Xj . Also, we have t t = v j Xj Xj v j = v j (Xj Xj v j ) = j v j v j = j , since v j is a unit vector.
Dimensionality Reduction Techniques
781
Algorithm 13.2.1: The NIPALS Algorithm ˆ ∈ Rm×n Data: A centered sample matrix W ˆ , where Result: The first c principal directions of W 1 c rank(A) ˆ 1 X1 = W ; 2 for j = 1 to c do 3 choose tj as any column of Xj ; 4 repeat 5 6 7 8 9
X t
v j = X jt2 ; j ˘t = Xj v j until ˘t − tj 2 < ; Xj+1 = Xj − tj v j ; end ˆ = X1 = t1 v + X2 . We have After the first run, we have W 1 ˆ v 1 − v1 λ1 = 0. ˆ t1 − v 1 t t1 = W ˆ W ˆ − t1 v ) t1 = W (W 1 1
Since t2 is initially a column of X2 , t2 it is orthogonal to t1 and remains so to the end of the loop. ˆ = t1 v + After the second run through the for loop, we have W 1 t2 v2 + X3 , etc. When the loop is completed, we have ˆ = t1 v + t2 v + · · · + tc vc + Xc+1 . W 1 2 ˆ ), then Xc+1 = O. If c = rank(W 13.3
Linear Discriminant Analysis
Linear discriminant analysis (LDA) is a supervised dimensionreduction technique. If a dataset contains data that belong to k distinct classes, LDA aims to find a low-dimensional subspace such that the projections of the data vectors on this subspace are well-separated clusters. Let (u1 , . . . , um ) ∈ Seq(Rn ) be a sequence of vectors that belong to two classes C1 and C2 . Denote by mi the mean of the vectors that
782
Linear Algebra Tools for Data Mining (Second Edition)
belong to the class Ci for i = 1, 2 and by m the global mean m |C1 | |C2 | 1 ui = m1 + m2 . m= m m m i=1
We seek a unit vector w such that the projections on the subspace w of the vectors that belong to the two classes are as separated as possible. The function φ : Rn −→ R defined by φ(u) = w u is a discriminant function and the set of projections consists of {φ(ui ) = w ui | 1 i m}. The means of the projections on w are 1 {w u ∈ Ci } = w mi mi,w = |Ci | for i = 1, 2. The class scatter matrix for Ci is the matrix Si ∈ Rn×n defined by {(u − mi )(u − mi ) | u ∈ Ci } Si = for i = 1, 2. The intra-class scatter matrix is Sintra = S1 + S2 . The scatter of Ci relative to w is defined as the number s2i,w = w Sintra w and the total intra-scatter relative to w is s2w = s21,w + s22,w = w Sintra w.
(13.1)
Note that the matrix Sintra is symmetric and positive semidefinite. Let Sinter be the inter-class scatter matrix defined by Sinter = |C1 |(m1 − m)(m1 − m) + |C2 |(m2 − m)(m2 − m) . By substituting the value of m in the definition of Sinter , we obtain |C1 ||C2 | (13.2) (m1 − m2 )(m1 − m2 ) . m Thus, the inter-class scatter matrix is a matrix of rank at most 1 (when m1 = m2 ). Sinter =
Lemma 13.1. The separation between the means of the classes is given by |m1,w − m2,w |2 = w Sinter w.
(13.3)
Dimensionality Reduction Techniques
Proof.
783
By applying the definitions of m1,w and m2,w , we have |m1,w − m2,w |2 = (w m1 − w m2 )2 = w (m1 − m2 )(m1 − m2 ) w m = w Sinter w. |C1 | |C2 |
Definition 13.4. The Fisher linear discriminant is the generalized Rayleigh–Ritz quotient of the inter-class scattering matrix and the intra-class scattering matrix: w Sinter w F(w) = w Sintra w for w ∈ Rn − {0n }. Equalities (13.1) and (13.1) imply the equality |C1 | |C2 | |m1,w − m2,w |2 . (13.4) m s2w To obtain well-separated projections of the vectors of the classes, we need to find w that maximizes Fisher’s linear discriminant. As observed in Section 8.9, the largest value of the generalized Rayleigh– Ritz quotient (and, therefore the largest value of the Fisher linear discriminant) is obtained when w is a generalized eigenvector that corresponds to the largest generalized eigenvalue of the matrix pencil (Sinter , Sintra ). When the vectors of the sequence (u1 , . . . , um ) ∈ Seq(Rn ) belong to several classes C1 , . . . , Ck , we need to consider the projection on a (k − 1)-dimensional subspace. If mi = |C1i | {u | u ∈ Ci }, then, as before, the intra-class scatter matrix is Sintra = ki=1 Si , where {(u − mi ) (u − mi ) | u ∈ Ci } ∈ Rn×n Si = F(w) =
for 1 i k. The inter-class scatter matrix is now given by Sinter =
k
|Ci |(mi − m)(mi − m) ,
j=1
where m is the global mean of the sequence of vectors. Thus, Sinter is the sum of k matrices of rank 1. Since ki=1 (mi − m) = 0, the actual rank of Sinter is k − 1.
784
Linear Algebra Tools for Data Mining (Second Edition)
If Sintra is a non-singular matrix, then the eigenvalues of the pencil −1 (Sinter , Sintra ) coincide with the eigenvalues of the matrix Sintra Sinter . n×n be the projection matrix on the desired subspace. Let P ∈ R The projection of the mean of the class Ci is ˘i= m
1 {P u | u ∈ Ci } = P mi |Ci |
and the projection of the global mean is ˘ = m
|Ci | ˜ i = P m. m k i=1 |Ci |
Thus, the intra-scatter matrix of the projections of class Ci is {(P u − P mi )(P u − P mi ) | u ∈ Ci } S˘i = =P {(u − mi )(u − mi ) P | u ∈ Ci } = P Si P , which implies that the intra-scatter matrix of the projections is S˘intra = P Sintra P . Similarly, the inter-scatter matrix of the projections is S˘inter = P Sinter P . The projection matrix P is chosen in this case such that −1 ˘ Sinter ) is maximal. trace(S˘intra An interesting connection between LDA and the least square regression is shown in [174]. 13.4
Latent Semantic Indexing
Latent semantic indexing (LSI) is an information-retrieval technique presented in [34, 40] based on SVD. The central problem in information retrieval (IR) is the computation of sets of documents that contain terms specified by queries submitted by users of collections of documents. The problem is quite challenging because any such retrieval needs to take into account that a concept can be expressed by many equivalent words (synonimy) and
Dimensionality Reduction Techniques
785
the same word may mean different things in various contexts (polysemy). This can lead the retrieval technique to return documents that are irrelevant to the query (false positive) or to omit documents that may be relevant (false negatives). Several models of information retrieval exist. We refer the interested reader to [6] for a comprehensive presentation. We are discussing here the vector model and examine an application of SVD to this model. The next definition establishes the framework for discussing the LSI. Definition 13.5. A corpus is a pair K = (T, D), where T = {t1 , . . . , tm } is a finite set whose elements are referred to as terms, and D = (D1 , . . . , Dn ) is a set of documents. Each document Di is a finite sequence of terms, Dj = (tj1 , . . . , tjk , . . . , tjj ). Let K be a corpus such that |T | = m terms and D contains n documents. If ti is a term and Dj is a document of K, the frequency of ti in Dj is the number of occurrences of ti in dj , that is, aij = |{p | tjp = ti }|. The frequency matrix of the corpus K is the matrix A ∈ Rm×n defined by A = (aij ). Each term ti generates a row vector (ai1 , ai2 , . . . , ain ) referred to as a term vector and each document dj generates a column vector ⎛ ⎞ a1j ⎜ . ⎟ ⎟ dj = ⎜ ⎝ .. ⎠ . amj A query is a sequence of terms q ∈ Seq(T ) and it is also represented as a vector ⎛ ⎞ q1 ⎜ . ⎟ ⎟ q=⎜ ⎝ .. ⎠ , qm where qi = 1 if the term ti occurs in q, and 0 otherwise.
786
Linear Algebra Tools for Data Mining (Second Edition)
When a query q is applied to a corpus K, an IR system that is based on the vector model computes the similarity between the query and the documents of the corpus by evaluating the cosine of the angle between the query vector q and the vectors of the documents of the corpus. For the angle αj between q and dj , we have cos αj =
(q, dj ) . q2 dj 2
The IR system returns those documents Dj for which this angle is small, that is cos αj t, where t is a parameter provided by the user. The LSI method aims to capture relationships between documents motivated by the underlying structure of the documents. This structure is obscured by synonimy, polysemy, the use of insignificant syntactic-sugar words, and plain noise, which is caused my misspelled words or counting errors. By Theorem 9.8, if A ∈ Rm×n is the matrix of a corpus K, A = U DV H is an SVD of A, and rank(A) = p, then the first p columns of U form an orthonormal basis for range(A), the subspace generated by the vector documents of K; and the last n−p columns of V constitute an orthonormal basis for null(A). Also, by Corollary 9.6, the first p transposed columns of V form an orthonormal basis for the subspace of Rn generated by the term vectors of K. Example 13.3. Consider a tiny corpus K = (T, D), where T = {t1 , . . . , t5 } and D = {D1 , D2 , D3 }. Suppose that the matrix of the corpus is t1 t A= 2 t3 t4 t5
D1 1 0 1 1 0
D2 0 1 1 1 0
D3 0 0 1 0 1
A matrix of this type may occur when documents D1 and D2 are fairly similar (they contain two common terms, t3 and t4 ) and when t1 and t2 are synonyms. A query that seeks documents that contain t1 returns the singledocument D1 . However, since t1 and t2 are synonyms, it would be
Dimensionality Reduction Techniques
787
desirable to have both D1 and D2 returned. Of course, the matrix representation cannot directly account for the equivalence of the terms t1 and t2 . The fact that t1 and t2 are synonymous is consistent with the fact that both these terms appear in the common context {t3 , t4 } in D1 and D2 . The successive approximations of A are B(1) = σ1 ∗ u1 ∗ v H1 ⎛ 0.4319 0.4319 ⎜0.4319 0.4319 ⎜ ⎜ =⎜ ⎜1.1063 1.1063 ⎜0.8638 0.8638 ⎝ 0.2425 0.2425
⎞ 0.2425 0.2425⎟ ⎟ ⎟ 0.6213⎟ ⎟, 0.4851⎟ ⎠ 0.1362
B(2) = σ1 ∗ u1 ∗ v H1 + σ2 ∗ u2 ∗ v H2 ⎛ ⎞ 0.5000 0.5000 0.0000 ⎜0.5000 0.5000 −0.0000⎟ ⎜ ⎟ ⎜ ⎟ 1.0000 1.0000 1.0000 =⎜ ⎟. ⎜ ⎟ ⎝1.0000 1.0000 0.0000 ⎠ 0.0000 −0.0000
1.0000
Another property of SVDs that is useful for LSI was shown in Theorem 9.7, namely that the rank-1 matrices ui vHi are pairwise orthogonal, and their Frobenius norms are all equal to 1. If we regard the noise as distributed with relative uniformity with respect to the p orthogonal components of the SVD, then by omitting several such components that correspond to relatively small singular values, we eliminate a substantial part of the noise and we obtain a matrix that better reflects the underlying hidden structure of the corpus. Example 13.4. Suppose that we apply to the miniature corpus described in Example 13.3 a query whose vector is ⎛ ⎞ 1 ⎜0⎟ ⎜ ⎟ ⎜ ⎟ q = ⎜0⎟ . ⎜ ⎟ ⎝1⎠ 0
Linear Algebra Tools for Data Mining (Second Edition)
788
The similarity between q and the document vectors di , 1 i 3, that constitute the columns of A is cos(q, d1 ) = 0.8165, cos(q, d2 ) = 0.482, and cos(q, d3 ) = 0, suggesting that d1 is by far the most relevant document for q. However, if we compute the same value of cosine for q and the columns b1 , b2 , and b3 of the matrix B(2), we have cos(q, b1 ) = cos(q, b2 ) = 0.6708, and cos(q, b3 ) = 0. This approximation of A uncovers the hidden similarity of d1 and d2 , a fact that is quite apparent from the structure of the matrix B(2). 13.5
Recommender Systems and SVD
Recommender systems use data analysis techniques to help customers find products they would like to buy, visit websites they would be interested to see, choose entertainment that is appropriate for their tastes, etc. Frequently, these systems are based on collaborative filtering, a process that involves collaboration among multiple agents having different viewpoints. Collaborative filtering requires collecting data concerning the interests or tests from many users and is based on the assumption that users who made similar choices in the past will agree in the future. Thus, collaborative recommender systems recommend items to users based on ratings of these items awarded by other users. Formally, we can regard a recommender system as a bipartite weighted graph G = (B ∪ T, E, r), where {B, T } is a partition of the set of vertices of the graph. The sets B = {b1 , . . . , bm } and T = {t1 , . . . , tn } are the set of buyers and the set of items. An edge (bi , tj ) ∈ B×T exists in E if buyer bi has rated item tj with the rating aij . In this case, r(bi , tj ) = aij . For large recommender systems, the sizes of the sets B and C can be of the order of millions, an aspect that raises issues of performance and scalability for RS designers. SVDs can be used for recommender systems in the same way that they are used for latent semantic indexing (see Section 13.4), that is, by using lower-rank approximations of the recommender system matrix, that are likely to filter out the noise effect of smaller singular value components.
Dimensionality Reduction Techniques
789
Generally, each buyer rates a rather small number of items. Thus, the set of ratings is sparse. To remedy this sparsity, one replaces each missing rating of an item by the average of the existing ratings for that item as proposed in [142]. We obtain a matrix A ∈ Rm×n . The vector r ∈ Rm of average ratings for each buyer is r = n1 A1n and serves the normalization of A. This normalization process consists in subtracting the average rating of each buyer from each rating generated by the buyer. The normalized matrix N is given by N = A − diag(r1 , . . . , rm )Jm,n . The SVD decomposition of the resulting matrix is computed and a low rank approximation of the rating matrix is considered (Algorithm 13.5.1). The new rating matrix Aˆ is obtained by adding the average ratings of customers to the rank k approximation of the normalized rating matrix, that is Aˆ = diag(r1 , · · · , rm )Jm,m + U (:, 1 : k)D(1 : k, 1 : k)V (:, 1 : k)) . Algorithm 13.5.1: Algorithm for Computing a Ratings Matrix Data: The bipartite graph G of the recommender system Result: A matrix of ratings 1 compute vector of average ratings r for items; 2 construct the matrix A by adopting average ratings of items for missing ratings; 3 compute the normalized matrix N by subtracting the average ratings of buyers from each rating in A; 4 compute the SVD, (U, D, V ) of the normalized matrix N ; 5 adopt a rank k; 6 compute a new ratings matrix as diag(r1 , . . . , rm )Jm,m +U (:, 1 : k)D(1 : k, 1 : k)V (:, 1 : k))
Example 13.5. Suppose we have a collaborative system having the graph shown in Figure 13.5. The average ratings of the items are as
Linear Algebra Tools for Data Mining (Second Edition)
790
b1 b2 b3 b4 Fig. 13.5
s @ @
s t1
2
2
3 s @ 4 3 t2 @ s 4@ @ 1 s @s b " t3 b 2 b 4"" b" " b " 3 b " b t4
Bipartite graph of a recommender system.
follows: Item t1 t2 t3 t4 Average rating 2 3 3.25 2.5 Thus, the rating matrix is ⎛
2 ⎜2 ⎜ A=⎜ ⎝2 2
2 3 3 3
⎞ 4 2.5 4 2.5⎟ ⎟ ⎟. 1 2⎠ 4 3
The average ratings are given by r = 14 A14 and are obtained in MATLAB using the expression 0.25*A*ones(4,1). We have the vector of the averages 2.6250 2.8750 2.0000 3.0000
The normalized matrix N is -0.6250 -0.8750 0 -1.0000
-0.6250 0.1250 1.0000 0
1.3750 1.1250 -1.0000 1.0000
-0.1250 -0.3750 0 0
Dimensionality Reduction Techniques
The SVD decomposition [U,D,V]=svd(N):
N
=
U DV
791
is
computed
by
U = -0.5962 -0.4995 0.4036 -0.4818
0.2111 -0.4647 -0.7549 -0.4118
0.0004 0.6840 -0.0232 -0.7291
-0.7746 0.2582 -0.5164 0.2582
2.7177 0 0 0
0 1.1825 0 0
0 0 0.3013 0
0 0 0 0.0000
0.4752 0.2626 -0.8342 0.0963
0.5805 -0.7991 0.0936 0.1250
0.4326 0.2060 0.2129 -0.8515
0.5000 0.5000 0.5000 0.5000
D =
V =
The approximation√B(2) of rank A is computed as the product √ 2 of of the matrices U2 D2 and D2 V2 , where U2 consists of the first two columns of U , V2 consists of the first two column of V, and D2 is the matrix 2.7177 0 , D2 = 0 1.1825 which gives
13.6
D2 =
1.6485 0 . 0 1.0874
Metric Multidimensional Scaling
Multidimensional scaling (MDS) is a process that allows us to represent a dissimilarity space using a low-dimensional Euclidean space. Scaling is important for visualizing the result of data explorations. Two basic types of scaling algorithms exist. The metric multidimensional scaling starts with a finite dissimilarity space and produces a set of vectors that optimizes a function known as strain. The non-metric multidimensional scaling seeks a monotonic relationship
792
Linear Algebra Tools for Data Mining (Second Edition)
between the values of a finite dissimilarity and the distances between the vectors that represent the elements of the dissimilarity space. We discuss only the metric multidimensional scaling. Let (x1 , . . . , xm ) be a sequence of m vectors in Rn . The corresponding matrix is X = (x1 , . . . , xm ) ∈ Rn×m . Note that X is the transpose of the sample data matrix previously considered (which had x1 , . . . , xm as its rows). Given the matrix of Euclidean distances D = (d2ij ) ∈ Rm×m , where d2ij = xi − xj 22 = (xi − xj ) (xi − xj ) for 1 i, j m, we need to retrieve the vectors x1 , . . . , xm . Clearly, this problem does not have a unique solution because the matrix D is the same for (x1 , . . . , xm ) and for (x1 + c, . . . , xm + c) for every c ∈ Rn . Let G ∈ Rm×m be the Gram matrix of X, G = GX = X X. Since gpq = xp xq , we have d2ij = (xi − xj ) (xi − xj ) = gii + gjj − 2gij
(13.5)
for 1 i, j m. Suppose now that F ∈ Rm×m is the Gram matrix of another sequence of vectors (y 1 , . . . , y m ), that is, F = GY = Y Y , where Y = (y 1 , . . . , y m ) ∈ Rn×m such that d2ij = fii + fjj − 2fij . Then gii + gjj − 2gij = fii + fjj − 2fij for 1 i, j ≤ m. Let W = G − F . Then W is a symmetric matrix and wii + wjj − 2wij = 0, so wij = 1 2 (wii + wjj ). Let ⎛ ⎞ w11 1⎜ . ⎟ . ⎟ w= ⎜ 2⎝ . ⎠ wmm and note that the matrix W can now be written as W = w1m +1m w , which proves that W is a special, rank-2 matrix. Consequently, G = F + w1m + 1m w .
(13.6)
Thus, the set of vectors that correspond to a distance matrix is not unique, and the Gram matrices of any two such sequences differ by a symmetric matrix of rank 2.
Dimensionality Reduction Techniques
793
It is possible to construct X starting from D if we assume that the centroid of the vectors of X is 0n , that is, if m i=1 xi = 0n . Let A ∈ Rm×m be the matrix defined by A = − 12 D. Elementwise, this means that aij = − 12 d2ij for 1 i, j m. Consider the averages defined by m
ai· =
1 aij , m j=1 m
1 a·j = aij , m i=1
m m 1 aij . a·· = 2 m i=1 j=1
The components of the Gram matrix G ∈ Cm×m , gij = xi xj for 1 i, j m, can be expressed using these averages, assuming that the set of columns of X is centered in 0n . Theorem 13.4. Let X = (x1 , . . . , xm ) ∈ Rn×m be a matrix such that m 1 2 i=1 xi = 0n and let A be the matrix defined by aij = − 2 xi − xj 2 for 1 i, j m. The components of the Gram matrix G, gij = xi xj are given by gij = aij − ai· − a·j + a·· for 1 i, j m. Proof.
By Equality (13.5), we have −2aij = gii + gjj − 2gij
for 1 i, j m. Note that m i=1
gij =
m j=1
gij = 0.
Linear Algebra Tools for Data Mining (Second Edition)
794
The averages introduced earlier can be written as m m 1 2 1 dij = (gii + gjj − 2gij ) −2a·j = m m i=1 i=1 m m 1 1 gii + gjj − 2 xi xj = m m i=1
= gjj + because
m
i=1 xi
1 m
i=1
m
gii ,
i=1
= 0n . Similarly, we have m 1 −2ai· = gii + gjj . m
(13.7)
j=1
Therefore, m m m 2 1 2 d = gii . ij n2 m i=1 j=1
i=1
m
m
j=1
j=1
(13.8)
Thus, we have gii = gjj
1 2 1 dij − gjj , m m
1 = m
m
d2ij
j=1
m
1 − gii . m i=1
Equality (13.5) yields
1 gij = − d2ij − gii − gjj 2 ⎛ ⎞ m m m m 1 1 1 1 1 d2ij + xj xj − d2ij + xi xi ⎠ = − ⎝d2ij − 2 m m m m ⎛
j=1
j=1
j=1
⎞
i=1
m m m m 1 2 1 2 1 2 ⎠ 1 dij − dij + 2 dij = − ⎝d2ij − 2 m m r j=1
j=1
i=1 j=1
= aij − ai· − a·j + a·· , which completes the proof.
Dimensionality Reduction Techniques
795
Corollary 13.2. The Gram matrix G = X X of the sequence of n R 1vectors X 2=
(x1 , . . . , xm ) can be obtained from the matrix A = − 2 xi − xj 2 as G = Hm AHm , where Hm is the centering matrix 1 H m = Im − m 1m 1m . Proof.
The matrix Hm AHm can be written as
Hm AHm
1 1 = Im − 1m 1m A Im − 1m 1m m m 1 1 A − A1m 1m = Im − 1m 1m m m
1 1 1 1m A − A1m 1 + 2 1m 1m A1m 1m . = A − 1m m m m
The terms of the above sum correspond to aij , a· j , ai · , and a· · , respectively. The desired conclusion then follows from Theorem 13.4.
The rank of the matrix G = X X ∈ Rm×m is equal to the rank of X, namely rank(G) = n, as we saw in Theorem 3.33. Since G is symmetric, positive semidefinite, and of rank n, it follows that G has n non-negative eigenvalues and m − n zero eigenvalues. By the Spectral Theorem for Hermitian matrices (Theorem 8.14), we have G = U DU , where U is an orthogonal matrix, U = (u1 · · · um ), D = (λ1 , . . . , λn , 0, . . . , 0), and λ1 · · · λn > 0. Taking into account that the last m−n elements of the diagonal of , where V ∈ Rn×m . By G are 0, we can write G = √1 , . . . , λn )Vn×m √ V diag(λ , we have G = X X defining X as X = diag( λ1 , . . . , λn )V ∈ R and the m columns of X yield the desired vectors in Rn . A more general problem begins with a matrix of dissimilarities Δ = (δij ) ∈ Rm×m and seeks to determine whether there exists a sequence of vectors (x1 , . . . , xm ) in Rn such that d(xi , xj ) = δij for 1 i, j m. Lemma 13.2. Let A, G ∈ Rm×m be two matrices such that G = Hm AHm , where Hm is the centering matrix Hm = Im − n1 1m 1m . Then gii + gjj − 2gij = aii + ajj − 2aij for 1 i, j m.
796
Linear Algebra Tools for Data Mining (Second Edition)
Proof. We saw that if G = Hm AHm , then gij = aij − ai· − a·j + a·· . Therefore, we have gii = aii − 2ai· + a·· , gjj = ajj − 2a·j + a·· . This allows us to write gii + gjj − 2gij = aii − 2ai· + a·· + ajj − 2a·j + a·· −2(aij − ai· − a·j + a·· ) = aii + ajj − 2aij .
Theorem 13.5. Let Δ ∈ Rm×m be a matrix of dissimilarities, A ∈ 2 for 1 i, j m, and let Rm×m be the matrix defined by aij = − 12 δij G be the centered matrix G = Hm AHm . If G is a positive semidefinite matrix and rank(G) = n, then there exists a sequence (x1 , . . . , xm ) of vectors in Rn such that d(xi , xj ) = δij for 1 i, j m. Proof. Since G is a symmetric, positive semidefinite matrix having rank n, by Theorem 8.14, it is possible to write G = V DV , where Rn×m , D √ = (λ1 , . . . , λn ), and λ1 · · · λn > 0. V = (v 1 · · · v m ) ∈ √ Let X = diag( λ1 , . . . , λn )V ∈ Rn×m . We claim that the distances between √ x1 , . . . , xm equal the prescribed dissimilarities. Indeed, since xi = λi v i , we have d(xi , xj )2 = (xi − xj ) (xi − xj ) = xi xi + xj xj − 2xi xj = λi vi vi + λj v j v j − 2λi λj v i vj = gii + gjj − 2gij (by Lemma 13.2) 2 , = aii + ajj − 2aij = −2aij = δij
which is the desired conclusion.
Since G = Hm AHm and Hm has an eigenvalue equal to 0, it is clear that G also has such an eigenvalue. Therefore, rank(G) ≤ m−1, so there exist m vectors of dimensionality not larger than m − 1 such that their distances are equal to the given dissimilarities.
Dimensionality Reduction Techniques
797
We saw that the matrices XX and G = X X have the same rank and their non-zero eigenvalues are positive numbers and have the same algebraic multiplicities for both matrices (by Corollary 7.6). Let w be a principal component of the matrix X ∈ Rn×m , that is, an eigenvector of the matrix XX . Suppose that rank(G) = r and let X = U DV be the thin SVD decomposition of the matrix X , where D = (σ1 , . . . , σr ) and U, V ∈ Rm×r (see Corollary 9.3). The matrices U and V have orthogonal columns, so U U = V V = Ir . Since the numbers σ1 , . . . , σr are positive, D is invertible and we obtain U = X V D −1 . Thus, MDS involves a process that is dual to the usual PCA; some authors refer to it as the dual PCA. MATLAB deals with metric MDS using the function cmdscale. The function call X = cmdscale(D) is applied to a distance matrix D ∈ Rm×n , and returns a matrix X ∈ Rm×n . The rows of X are the coordinates of m points in n-dimensional space for some n, where n < m. When D is a Euclidean distance matrix, the distances between those points are given by D. The number n is the smallest dimension of the subspace in which the m points whose inter-point distances are given by D can be embedded. [X,e] = cmdscale(D) also returns the eigenvalues of XX as components of the vector e. When D is Euclidean, the first n elements of e are positive, the rest zero. If the first k elements of e are much larger than the remaining n − k, then it is possible to use the first k columns of X to produce k-dimensional vectors whose interpoint distances approximate D. This can provide a useful dimension reduction for visualization, e.g., for k = 2. D need not be a Euclidean distance matrix. If it is non-Euclidean or a more general dissimilarity matrix, then some elements of e are negative, and cmdscale chooses n as the number of positive eigenvalues. In this case, the reduction to n or fewer dimensions provides a reasonable approximation to D only if the negative elements of e are small in magnitude. D can be specified as either a full dissimilarity matrix, or in uppertriangle vector form such as is output by pdist. A full dissimilarity matrix must be real and symmetric, and have zeros along the diagonal and positive elements everywhere else. A dissimilarity matrix in upper triangle form must have real, positive entries. D can be specified as a full similarity matrix, with ones along the diagonal and all other elements less than one. The function cmdscale transforms a
798
Linear Algebra Tools for Data Mining (Second Edition)
similarity matrix to a dissimilarity matrix in such a way that distances between the points √ returned in Y are equal to or approximate the distances given by I − D. Example 13.6. We start from a matrix of driving distances between five northeastern cities: Boston, Providence, Hartford, New York, and Concord. dist = 0 41.9000 92.8800 189.9000 63.4700
41.9000 0 65.3600 154.8400 95.7800
92.8800 65.3600 0 99.7600 115.5900
189.9000 154.8400 99.7600 0 213.7800
63.4700 95.7800 115.5900 213.7800 0
Applying cmdscale to this matrix, [X,e]=cmdscale(dist) produces the matrix X = 58.1439 19.3304 -29.8485 -129.6169 81.9911
-20.4773 -34.2586 8.8070 7.7975 38.1313
-4.2664 3.4664 1.1787 -1.1686 0.7899
and the matrix of eigenvalues e = 1.0e+004 * 2.8168 0.3185 0.0034 -0.0000 -0.0006
Next, by defining the matrix >> cities = [’BOS’;’PRO’;’HAR’;’NYC’;’CON’]
the results are displayed using >> plot(X(:,1),X(:,2),’+’) >> gname(cities)
which results in the representation contained in Figure 13.6.
Dimensionality Reduction Techniques
799
40
CON
30 20 10
HAR
NYC
0 −10 BOS
−20 −30 PRO −40 −150
−100
Fig. 13.6
−50
0
50
100
Representation of the five cities.
Of course, the representation is approximative, but the relative positions of the cities is reasonably close to their real placement.
13.7
Procrustes Analysis
Procrustes1 analysis (PRA) starts with two sets of vectors U = {u1 , . . . , um } ⊆ Rn , and V = {v 1 , . . . , v m } ⊆ Rp , where p < n, and tries to evaluate and improve the quality of the correspondence between these sets, as specified by the subscripts of the elements of the two sets. In general, we have p n. The dimensionality of the vectors of the two sets can be equalized by appending n − p zeroes as the last components of the vectors of V. 1
Damastes, known as Procrustes (the stretcher), was a mythological Greek character who had the annoying habit of fitting his guests to the bed he offered them. If the guest was too short, he stretched the guest; if the guest was too long, he cut off a part of the body to make the guest fit the bed. He was killed by hero Theseus, the founder of Athens, who treated him to his own bed.
800
Linear Algebra Tools for Data Mining (Second Edition)
PRA tries to construct a linear transformation (through the application of dilations, rotations, reflections, and translations) that maps the point of V to be as close as possible to the points of U . The evaluation criterion for the quality of the matching is d(V, U ) =
m
(v i − ui ) (v i − ui ),
i=1
which we need to minimize. Define the transformation h : Rn −→ Rn as h(v) = bT v + c, where b ∈ R, T ∈ Rn×n is an orthogonal matrix and c ∈ Rn . Clearly, h is the composition of a rotation (or a reflection), a dilation, and a translation. The quality of the matching between the displaced set of vectors h(V ) and U is d(h(V ), U ) =
m
(aA vi + b − ui ) (aA v i + b − ui ).
i=1
˜ and v ˜ be the means of the vectors of U and V, respectively. Let u Since ˜ + c − (ui − u ˜) − u ˜ + bT v ˜ ), bT v i + c − ui = bT (v i − v and m ˜ ) − (ui − u ˜ )) = 0n , (bT (v i − v i=1
it follows that m ˜ ) − ui + u ˜ ) (bT (v i − v ˜ ) − ui + u ˜) (bT (v i − v d(h(V ), U ) = i=1
˜+c−u ˜ ) (bT v ˜+c−u ˜ ). + m(bT v ˜. ˜ − bT v To ensure the minimum of d(h(V ), U ), we need to take c = u ˜) + u ˜. Thus, the new set of vectors is defined by h(v i ) = bT (v i − v This implies that the mean of the set {h(v i ) | 1 i m} coincides
Dimensionality Reduction Techniques
801
˜ , that is, the set of vectors {v i | 1 ≤ i m} is transformed with u such that the means of the two groups of vectors, h(V ) and U , coincide. Suppose now that both means of the sets U and V coincide with 0n . In this case, m (bT v i − ui ) (bT v i − ui ) d(h(V ), U ) = i=1
=b
2
m
v i T T v i
+
i=1
= b2
m
m
ui ui
− 2b
m
i=1
v i v i +
i=1 2
m
i=1
ui ui − 2b
i=1
m
ui T v i
i=1
= b trace(V V ) + trace(U U ) − 2btrace(V T U ), (13.9) where
ui T v i
⎞ ⎛ ⎞ v1 u1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ U =⎜ ⎝ .. ⎠ and V = ⎝ .. ⎠. um v m ⎛
The ambiguity introduced by denoting the above matrices by the same letter as the set of vectors is harmless. The minimum of this quadratic function of b is achieved when b=
trace(T U V ) trace(V T U ) = trace(V V ) trace(V V )
and it equals dmin =
trace(V V )trace(U U ) − trace(V T U )2 . trace(V V )
Once the translation (c) and the dilation ratio (b) have been chosen, we need to focus on the rotation matrix T .
Linear Algebra Tools for Data Mining (Second Edition)
802
Theorem 13.6 (Berge–Sibson Theorem). Let M ∈ Rn×n be a matrix having the singular value decomposition M = P DQ , where P, Q ∈ Rn×n are orthogonal matrices and D = diag(σ1 , . . . , σn ), where σi 0 for 1 i n. Then, for every orthogonal matrix T ∈ Rn×n , we have trace(T M ) trace((M M )1/2 ) and the equality takes place when T = QP . Proof.
Starting from the SVD M = P DQ , we have
trace(T M ) = trace(T P DQ ) = trace(Q T P D) = trace(LD), where L = Q T P ∈ Rn×n is an orthogonal matrix (as a product of orthogonal matrices). Since the elements of an orthogonal matrix cannot exceed 1, it follows that trace(T M )
n
1
σi = trace(D) = trace((D D) 2 ).
i=1
Taking into account that D = P M Q (and D = Q M P ), we have 1
1
trace((D D) 2 ) = trace((Q M P P M Q) 2 ) 1
1
= trace((Q M M Q) 2 ) = trace((M QQ M ) 2 ) (by Corollary 7.6) 1
= trace((M M ) 2 ), which is the desired inequality. If T = QP , we can write 1
T M = QP M = QP P DQ = (QDQ QDQ ) 2 1
1
= (QDDQ ) 2 = (QDP P DQ ) 2 1
= (A A) 2 , which concludes the argument.
Berge–Sibson Theorem allows us to choose the optimal orthogonal matrix in Equality (13.9) by maximizing trace(V T U ) = trace(T U V ). To this end, we need to take T = QP , where U V has the SVD U V = P DQ .
Dimensionality Reduction Techniques
803
Since T = QP , it follows that Q T P D = D because P and Q are orthogonal. In turn, this implies QQ T P DQ = QDQ , so T P DQ = QDQ . Therefore, 1
1
T U V = T P DQ = QDQ = (QD 2 Q ) 2 = (QDP P DQ ) 2 1
= ((U V ) (U V )) 2 . The optimal dilation ratio can be written now as 1
trace((U V ) (U V )) 2 . b= trace(V V ) Since trace(T U V ) = trace(V T U ), the minimal value of d(h(V ), U ) is dmin =
trace(V V )trace(U U ) − trace(V T U )2 trace(V V )
= trace(U U ) −
trace(V T U )2 trace(V V ) 1
trace(((U V ) (U V )) 2 )2 . = trace(U U ) − trace(V V )
A normalized version of this criterion is 1
trace(((U V ) (U V )) 2 )2 dmin = 1 − , trace(U U ) trace(V V )trace(U U ) known as the Procrustes criterion. MATLAB performs PRA starting from two sample data matrices U and V, which have the same number of rows. The rows of U and V are obtained by transposing the vectors of U and V, respectively. The PRA matches the ith row in V to the ith row in U . Rows in U can have smaller dimensions (number of columns) than those in V ; in this case columns of zeros are added to U to equalize the number of columns of the two matrices. Calling the function [d, Z] = procrustes(V,U)
also returns Z = (h(u1 ) · · · h(um ) ) = bU T + c containing the transformed U vectors. The transformation that maps U into Z is returned by the function call
Linear Algebra Tools for Data Mining (Second Edition)
804
[d, Z, transform] = procrustes(V,U)
The variable transform is a structure that consists of the following components: • c , that is the translation component; • T , the orthogonal matrix involved; • b, the dilation factor. Other further interesting variants of procrustes are described as follows: procrustes(..., ’Scaling’,false) procrustes(..., ’Scaling’,true) procrustes(..., ’Reflection’,false) procrustes(..., ’Reflection’,’best’) procrustes(..., ’Reflection’,true)
no scale component (b = 1) includes a scale component (default) no reflection component (det(T ) = 1) best fit solution (default) solution includes a reflection (det(T ) = −1)
Example 13.7. We apply the Procrustes analysis to the sets of points X and Y in R2 produced as in Example 4.18: U =
V = 0.5377 1.8339 -2.2588 0.8622 0.3188 -1.3077 -0.4336 0.3426 3.5784 2.7694 -1.3499 3.0349 0.7254 -0.0631 0.7147
-0.2050 -0.1241 1.4897 1.4090 1.4172 0.6715 -1.2075 0.7172 1.6302 0.4889 1.0347 0.7269 -0.3034 0.2939 -0.7873
2.2260 2.7057 1.3409 2.6851 2.3451 1.6735 1.5266 2.2899 4.0256 3.2358 1.6690 3.4838 2.2542 2.0618 2.0694
1.7753 1.4795 3.2412 2.4492 2.5894 2.5745 1.5894 2.1642 1.7556 1.5190 2.8621 1.5175 1.7058 2.1317 1.5363
Dimensionality Reduction Techniques
805
The function call >> [d,Z,tr] = procrustes(U,V)
returns d = 0.0059 Z = 0.6563 1.7723 -2.3302 0.7597 0.0388 -1.0910 -0.3485 0.3733 3.7441 2.6363 -1.3887 3.0605 0.7745 0.0172 0.6304
-0.1488 -0.1696 1.4579 1.4628 1.3591 0.6567 -1.1707 0.5786 1.6315 0.4321 1.1426 0.6795 -0.2389 0.2933 -0.7142
tr = T: [2x2 double] b: 1.9805 c: [15x2 double]
The results can be represented graphically using >> plot(U(:,1),X(:,2),’bx’,V(:,1),Y(:,2),’b+’,Z(:,1),Z(:,2), ’bo’)
and as shown in Figure 13.7. A summary inspection of Figure 13.7 shows that the transformation of the points of V (marked by a symbol ’+’) into the points of Z (marked by ’o’) yields points that are quite close to the points of X (marked by ’x’).
Linear Algebra Tools for Data Mining (Second Edition)
806 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −3
−2
−1
Fig. 13.7
13.8
0
1
2
3
4
5
Procrustes analysis of the sets X and Y .
Non-negative Matrix Factorization
Throughout this book we presented several factorization techniques for matrices that satisfy specific conditions involving the matrix to be factored or the factor matrices. In this chapter, we are concerned with a special type of approximative factoring of a matrix. Namely, starting from a matrix A ∈ Rn×m and a positive integer p < min{m, n}, we seek to find a pair of matrices (U, V ) such that U ∈ Rn×p , V ∈ Rp×m , and A − U V F is minimal. If A is a nonnegative matrix and we impose the non-negativity condition on the factors U, V , then we have the non-negative factorization of A. Note that rank(U V ) p. We begin by examining the existence of an approximative factorization for an arbitrary matrix A ∈ Rm×n . The minimization of A − U V F is equivalent to minimizing the function f (U, V ) = 1 2 2 A − U V F . We have grad U A − U V 2F = U V V − AV
(13.10)
Dimensionality Reduction Techniques
807
and grad
V
A − U V 2F = U U V − U A.
(13.11)
Thus, the necessary conditions for finding a stationary point are U V V = AV and U U V = U A. Also, if (U, V ) is a solution of the problem, then so is (aD, a1 V ), where a is a positive number. Thus, if the problem has a solution, this solution is not unique, in general. The minimization of A − U V F when we seek non-negative factors U and V is equivalent to minimizing the function f (U, V ) = 1 2 2 A − U V F subjected to np + pm restrictions: uij 0 for 1 i n and 1 j p and vjk 0 for 1 j p and 1 k m, respectively. We apply the Karush–Kuhn–Tucker Theorem to the function f (U, V ) that depends on np + pm components uij and vjk of the matrices U ∈ Rn×p and V ∈ Rp×m. The functions g and h that capture the constraints are g : Rn×p × n×p R −→ Rn×p and h : Rp×m × Rn×p −→ Rp×m , which are defined by U U g = −U and h = −V. V V a×b be an a × b matrix whose single non-zero component is Let Eqr equal to 1 and is located at the intersection of line q and column r. For the constraint functions, we have n×p , grad grad U gij = −Eij
grad U hjk = On,p , grad
V
V gij
= Op,m ,
p×m hjk = −Ejk
for 1 i n, 1 j p, and 1 k m. We have U U = −V Op,m . g = −U On,p and h V V We introduce two matrices of Lagrange multipliers, B ∈ Rn×p and C ∈ Rp×m .
Linear Algebra Tools for Data Mining (Second Edition)
808
The first Karush–Kuhn–Tucker condition amounts to the equalities grad
U A
p n
− U V 2F +
bij grad
U gij
+
i=1 j=1
grad
2 V A − U V F +
m n i=1 j=p
p m
cjk grad
U hjk
cjk grad
V
= On,p ,
j=1 k=1
bij grad
V gij +
p m
gjk = Op,m ,
j=1 k=1
which are equivalent to U V V − AV − B = On,p , U U V − U A − C = Op,m . The remaining conditions can be written as bij uij = 0 and cjk vjk = 0 for 1 i n, 1 j p, and 1 k m, and B On,p , C Op,m . If both U > On,p and V > Op,m , then we have B = On,p and C = Op,m , and the Karush–Kuhn–Tucker conditions imply U V V = AV and U U V = U A. If we seek matrices U and V of full rank, then, by Corollary 3.3, V V and U U are non-singular matrices and we obtain U = AV (V V )−1 and V = (U U )−1 U A. These arguments suggest a nonnegative matrix factorization known as the alternate least-square (ALS) algorithm (see [17, 127]). This algorithm takes advantage of the fact that although the function F : Rm×p × Rp×n −→ R defined by F (U, V ) = A − U V F is not convex with respect to the pair (U, V ), it is convex on either U or V. Its basic form is Algorithm 13.8.1. The factorization technique for non-negative matrices that we discuss next was developed by Lee and Seung [105] and consists of constructing iteratively a sequence of pairs of matrices (U 0 , V 0 ), (U 1 , V 1 ), . . . such that the products U t V t offer increasingly better approximations of a non-negative matrix A. We need the following preliminary notion. Definition 13.6. Let G : (Rk )2 −→ R and F : Rk −→ R be two functions. G is an auxiliary function for F if G(x, y) ≥ F (x) and G(x, x) = F (x) for x, y ∈ Rk .
Dimensionality Reduction Techniques
809
Algorithm 13.8.1: Alternate Least-Square Factorization Algorithm Data: A nonnegative matrix A ∈ Rm×n and a maximum number of iterations r Result: An approximate factorization A = U V 1 1 initialize U to a random matrix; 2 for t = 1 to r do 3 solve for V t the equation (U t ) U t V t = (U t ) A; 4 set all negative elements in V t to 0; 5 if t < r then 6 solve for U t+1 the equation V t (V t ) U t+1 = (V t ) At ; 7 set all negative elements in U t+1 to 0; 8 end 9 end r r 10 return U and V ; Consider a sequence of vectors (ut )t∈N in Rk defined by ut+1 = arg min G(u, ut ) u
(13.12)
for t ∈ N. Then, F (ut+1 ) G(ut+1 , ut ) (by the first condition of Definition 13.6) G(ut , ut ) (by Equality (13.12)) (by the second condition of Definition 13.6) = F (ut ) for t ∈ N. Thus, the sequence (F (ut ))t∈N is non-increasing. The inequalities F (ut+1 ) G(ut , ut ) F (ut ) show that if F (ut+1 ) = F (ut ), then G(ut , ut ) = F (ut ) so ut is a local minimum of G(u, ut ). If the derivatives of F exist and are continuous in a neighborhood of ut , then ∇F (ut ) = 0. Thus, the sequence (ut )t∈N converges to a local minimum umin of F . Let A = (a1 · · · am ) be the matrix to be factored, where aj ∈ Rn for 1 j m, and let FA : Rn×p × Rp×m −→ R0 be the objective
Linear Algebra Tools for Data Mining (Second Edition)
810
function that we seek to minimize in NNMF, as follows: FA (U, V ) =
1 A − U V 2F . 2
If we write A = (a1 · · · am ) and V = (v 1 · · · v m ), then the matrix A − U V can be written as A − U V = (a1 − U v 1 · · · am − U v m ), so A − U V 2F =
n
aj − U v j 2F .
j=1
Let Fa (U, v) = a − U v2F . The last equality can be written as FA (U, V ) =
n
Faj (U, v j ).
j=1
Lemma 13.3. Let v ∈ Rk and let di (v) be defined by di (v) = (U vUi v)i for 1 i k and K(v) = diag(d1 (v), . . . , dk (v)). The matrix K(v)− U U is positive semidefinite. Proof. Define the matrix M (v) ∈ Rk×k as M (v)ij = vi (K(v) − U U )ij vj for 1 i, j k. In Supplement 96 of Chapter 6, we proved that M is positive semidefinite if and only if K(v) − U U is positive semidefinite. Note that (U U )ij vj vi M (v)ii = (K(v) − U U )ii vi2 = (U U v)i vi = j
for 1 i k. Also, if i = j, we have M (v)ij = −vi (U U )ij vj . We have z M (v)z =
i
=
i
zi M (v)ij zj
j
M (v)ii zi2 +
i=j
zi M (v)ij zj
Dimensionality Reduction Techniques
=
i
(U U )ij vj vi −
j
= (U U )ij vi vj i,j
811
zi vi (U U )ij vj zj
i=j
1 2 1 2 z + zj − zi zj 2 i 2
1 (U U )ij vi vj (zi − zj )2 , 2
=
i,j
which shows the positive semidefiniteness of M (v).
Lemma 13.4. The function G : (Rk )2 −→ R defined by 1 G(s, v) = F (v) + (s − v) ∇F (v) + (s − v) K(v)(s − v) 2 is an auxiliary function for the quadratic function F : Rk −→ R given by 1 F (s) = F (v) + (s − v) ∇F (v) + (s − v) U U (s − v) 2 Proof.
The inequality G(s, v) F (s) amounts to (s − v) K(v)(s − v) (s − v) U U (s − v),
which follows immediately from Lemma 13.3. Thus, the first condition of Definition 13.6 is satisfied. It is immediate that G(v, v) = F (v). Choose F = Fa . We have 1 F (s) = Fa (s) = a − U s2F 2
⎛ ⎞2 1 1 ⎝ ai − = (a − U s)2i = Uij sj ⎠ 2 2 i
i
j
Therefore, ∂F =− ∂s
i
⎛ ⎝ai −
j
⎞ Uij sj ⎠ Ui = −
i
ai Ui +
i
j
Uij Ui sj
812
Linear Algebra Tools for Data Mining (Second Edition)
for 1 i k. Thus, ∇F (s) = −U a + U U s. The corresponding auxiliary function is now 1 G(s, v) = F (v) + (s − v) (−U a + U U v)) + (s − v) K(v)(s − v). 2 To obtain the update rule, we need to find v such that 1 G(v, v t ) = F (v t )+(v −vt ) (−U a+U U v t )+ (v −vt ) K(v t )(v −vt ) 2 is minimal. Regarding G(v, v t ) as a function of v, we obtain ∇v G(v, v t ) = −U a + U U v t + K(v t )(v − vt ), by applying the differentiation rules ∇(a x) = ∇(x a) and ∇(x Ax) = 2Ax (for a symmetric matrix). Therefore, the necessary extremal condition ∇v G(v, v t ) = 0 implies v = v t − K(v t )−1 (U U vt − U a), which is the update rule for the columns of the matrix V if U is kept constant. Note that 1 1 t −1 ,..., K(v ) = diag d1 (v t ) dk (v t ) vkt v1t , . . . , = diag . (U U v t )1 (U U v t )k Thus, we obtain the update rule for the ith column of the matrix V for a fixed matrix U vit+1 = vit − =
vit ((U U v t )i − (U a)i ) (U U v t )i
vit (U a)i . (U U v t )i
Dimensionality Reduction Techniques
813
The components of the updated matrix V can be written as t+1 = vij
t (U A) vij ij . (U U V t )ij
Similarly, the updated components of the matrix U , when V is fixed, are given by ut+1 ij =
utij (AV )ij . (U t V V )ij
Lee and Seung algorithm [105] consists of applying alternating updates of the matrices U and V (Algorithm 13.8.2). Algorithm 13.8.2: Lee–Seung Factorization Algorithm Data: A nonnegative matrix A ∈ Rm×n and a maximum number of iterations r Result: An approximate factorization A = U V 1 1 initialize U to a random matrix; 1 to a random matrix; 2 initialize V 3 for t = 1 to r do 4 V t+1 = (V t ((U t ) A)) ((U t ) U t V t + 10−10 Jp,m ; 5 U t+1 = (U t (A(V t ) )) (U t V t (V t ) + 10−10 Jn,p ; 6 end t t 7 return U and V ; Using the Hadamard product and quotient of matrices introduced in Definition 3.45, the computations involving the sequences V 1 , . . . , V t , . . . and U 1 , . . . , U t , . . . can be written in a more succinct form as V t+1 = (V t ((U t ) A)) ((U t ) U t V t ),
(13.13)
U t+1 = (U t (A(V t ) )) (U t V t (V t ) ).
(13.14)
The algorithm is given next. The terms 10−10 Jp,m and 10−10 Jn,p are added to avoid division by 0. The Equalities (13.13) and (13.14) show that if U 1 and V 1 are positive, the matrices U t and V t remain positive throughout the iterations.
Linear Algebra Tools for Data Mining (Second Edition)
814
MATLAB uses the function nnmf of the Statistics toolbox to compute a factorization of a nonnegative matrix. When using its simplest syntax, [U,V] = nnmf(A,k), this function computes an approximate decomposition of the non-negative matrix A ∈ Rm×n into the nonnegative factors U ∈ Rm×k and V ∈ Rk×n , where the rows of V have unit length. Actually, the product U V is an approximation of A that minimizes the quantity d = A − U V F /mn known as the root-mean-squared residual. To add d to the results, we can write [U,V,d] = nnmf(A,k). The function nnmf can be called with the number of options detailed in the documentation of MATLAB . Such a call has the form
[U,V] = nnmf(A,k,param1,val1,param2,val2,...)
For example, for the option ‘algorithm’ the possible values are ‘mult’ for the multiplicative algorithm or ‘als’ for an alternating least-squares algorithm. Example 13.8. The IRIS dataset from the UCI machine learning repository consists of 150 records that describe characteristics of three varieties of the iris flower: Iris Setosa, Iris Versicolor, and Iris Virginica. Each record has four components that correspond to the sepal length, sepal width, petal length, and petal width (in centimeters). To compute a nonnegative matrix factorization of the matrix D ∈ R150×4 as D = U V , where U ∈ R150,2 and V ∈ R2,4 , we write >> [U,V] = nnmf(D,2)
This produces the matrix U (which we omit) and the matrix V given by V = 0.6942 0.8027
0.2855 0.5675
0.6223 0.1829
0.2221 0.0142
corresponding to the four variables mentioned above. The factorization of D allows us to express each row di = (sli , swi , pli , pwi ) as sli = 0.6942ui1 + 0.807ui2 swi = 0.2855ui1 + 0.5675ui2 , pli = 0.6223ui1 + 0.1829ui2 pwi = 0.2221ui1 + 0.0142ui2 .
Dimensionality Reduction Techniques
815
1
seplen
COLUMN 2
0.8
0.6
sepwidth
0.4
0.2
0
petallen
0
0.2
petalwidth 0.4
Fig. 13.8
0.6 COLUMN 1
0.8
1
Biplot of the IRIS dataset.
The first column of U has a strong influence on the sepal length sli and petal length pli ; in contrast, the second column of U has little influence on the petal width pwi . To visualize these influences, a biplot is constructed. >> biplot(V’,’scores’,U,’varlabels’,{’seplen’,’sepwidth’,... ’petallen’,’petalwidth’}); >> axis([0 1.1 0 1.1]) >> xlabel(’COLUMN 1’) >> ylabel(’COLUMN 2’)
The result is shown in Figure 13.8. Exercises and Supplements (1) Let X = (x1 · · · xn ) ∈ Rm×n be a matrix. A Karhunen–Loeve basis for X is a sequence of pairwise orthogonal k unit vectors (k < n) u1 , . . . , uk in Rm such that for the orthogonal
816
Linear Algebra Tools for Data Mining (Second Edition)
projections projS (u1 ), . . . , projS (un ) on the subspace S generated by u1 , . . . , uk , ni=1 xi − projS (xi )2 is minimal. Prove that the first k columns of the orthogonal matrix U involved in the SVD of the matrix X, X = UDV is a Karhunen–Loeve basis for X. (2) Let X ∈ Rm×n be a centered data matrix and let X = UDV be the thin SVD of X, where U ∈ Rm×r , D ∈ Rr×r , and V ∈ Rn×r and each U and V has orthonormal columns. If S = U D = (s1 · · · sr ) = (d1 u1 · · · dr ur ) ∈ Rm×r is the matrix of scores and V ∈ Rn×r is the matrix of loadings, prove that 1 d2i ; (a) the variance of a score vector si is var(si ) = m−1 (b) X = s1 v 1 + · · · + sr vr ; (c) if Xk = s1 v1 + · · · + sk v k , where k r, then k d2i tvar(Xk ) = i=1 r 2. tvar(X) i=1 di k
d2
i In other words, i=1 r 2 indicates the portion of the total varii=1 di ance of X explained by the first k scores. (3) Let X ∈ Rm×n be a centered data sample matrix whose covariance matrix is cov(X) = aIn + bJn,n . Prove that a 0 and b 0. (4) The orthogonal factor model [82] seeks to express a centered data sample matrix X ∈ Rm×n = (v 1 · · · v n ) as X = F L + S, where L ∈ Rq×n is the matrix of factor loadings, F ∈ Rm×q is the matrix of factors, and S ∈ Rm×n is the matrix of specific factors. The model assumes that F is a centered matrix, cov(F ) = In , S is a centered matrix, cov(S) is a diagonal matrix, and F S = On,n . Prove that (a) cov(X) = (q − 1)cov(L) + cov(S) and F X = L; (b) var(v i ) = (q − 1)var(li ) + var(si ) for 1 i n.
Dimensionality Reduction Techniques
817
(5) Let X ∈ Rm×n be a centered data sample matrix and let p, q ∈ N be such that p + q = n and X = (U V ), where U ∈ Rm×p and V ∈ Rm×q . Define the mixed covariance of U and V as 1 cov(U, V ) = m−1 U V ∈ Rp×q . Prove that cov(X) =
cov(U ) cov(U, V ) . cov(U, V ) cov(V )
(6) Using the same notations as in Exercise 5, let u = U a and v = V b be two vectors in Rm that aggregate the columns of U and V, respectively. We assume here that a ∈ Rp and v ∈ Rq . Prove that (a) var(u) = a cov(U )a, var(v) = b cov(V )b, and cov(u, v) = a cov(U, V )b; (b) if the correlation of u and v is defined as corr(u, v) =
cov(u, v) , var(u) var(v)
prove that the largest value of corr(u, v) is the largest eigenvalue of cov(U )−1/2 cov(U, V )cov(V )−1 cov(V, U )cov(U )−1/2 . (7) Let A ∈ Rn×m be a positive matrix. Define FA : Rn×p >0 × p×m R>0 −→ R by FA (U, V ) = D(A, U V ), where D is the matrix divergence introduced in Supplement 108 of Chapter 3. If A = (a1 · · · am ) and V = (v 1 · · · v m ), prove that D(A, U V ) =
n
D(aj , U v j ).
j=1
(8) Let Fa : Rk>0 −→ R be the function defined by ⎛ ⎞ ai ⎝ ⎠ uij sj − ai + ai ln Fa (s) = j uij sj i
j
818
Linear Algebra Tools for Data Mining (Second Edition)
for s ∈ Rk>0 , where a ∈ Rk>0 . Prove that the function Ga : (Rk>0 )2 −→ R given by Ga (s, v) = (ai ln ai − ai ) + uij si i
−
i
i
uij vj ai p uip sp
j
j
uij vj ln uij sj − ln p uip sp
is an auxiliary function for Fa . Solution: We have (ai ln ai − ai ) + uij si Ga (s, s) = i
i
j
uij sj uij sj − ai ln uij sj − ln p uip sp p uip sp i j (ai ln ai − ai ) + uij si = i
−
i
=
⎛
j
i
j
uij sj ai ln uip sp p uip sp p
⎝ai ln ai − ai +
i
uij sj − ai ln
⎞ uip sp ⎠
p
j
= Fa (s). The second condition of Definition 13.6, Ga (s, v) Fa (s), can be proven starting from the convexity of the function (x) = − ln x. By Jensen’s Theorem, we have ⎞ ⎛ k k uij sj ⎠ ⎝ uij sj − tj ln − ln tj j=1
j=1
Dimensionality Reduction Techniques
819
for any set of non-negative numbers t1 , . . . , tk such that uij vj k yields the inequality j=1 tj = 1. Choosing tj = k p=1
⎛ − ln ⎝
k
⎞ uij sj ⎠ ≤ −
j=1
k j=1
=−
k j=1
×
uip vp
uij sj
uij vj ln k p=1 uip vp
uij vj k p=1 uip vp
uij vj k p=1 uip vp uij vj
ln uij sj − ln k
p=1 uip vp
.
This inequality implies (ai ln ai − ai ) + uij si Ga (s, v) i
−
i
⎛ ai ln ⎝
i k
⎞
j
uij sj ⎠ = FA (s).
j=1
Thus, Ga is indeed an auxiliary function for Fa . (9) Let X be a non-negative matrix and let X = U V be a nonnegative matrix factorization of X. Prove that all rows of X are non-negative combinations of the rows of U , that is, they lie in the simplicial cone generated by the rows of U . (10) Let X = (x1 , . . . , xp ), Y = (y 1 , . . . , y q ) be two sequences of vectors in Rn . The matrix D ∈ Rp×q of squared distances between these sequences is DX,Y = (dij ), where dij = xi − y j 22 for 1 i p and 1 j q. Prove that ⎛ ⎞ x1 x1 ⎜ .. ⎟ DX,Y = ⎝ . ⎠ 1q + 1p (y 1 y 1 , . . . , y q y q ) − 2 X Y. xp xp
820
Linear Algebra Tools for Data Mining (Second Edition)
The analogue of multidimensional metric scaling for two sets of vectors has been introduced and studied as multidimensional unfolding (MDU) by Sch¨ onemann in [147]. For MDU, the issue is to reconstitute two finite vector sets given their mutual distances. Let X = (x1 · · · xp ) ∈ Rn×p and let tz be a translation of Rn . Denote by tz (X) the matrix (tz (x1 ) · · · tz (xp )). (11) Let X = (x1 · · · xp ) ∈ Rn×p , x0 ∈ Rn , and let c ∈ Rp be a ˜ = t−x (X), then X ˜ = vector such that x0 = Xc. Prove that if X 0 XQc , where Qc = Ip − 1p c is the projection matrix introduced in Supplement 10 of Chapter 3. Solution: We have
⎛ ⎞ x1 − x0 ⎜ ⎟ .. t−x0 (X ) = ⎝ ⎠ . xp − x0 = X − 1p x0 = X − 1p c X = (Ip − 1p c )X = Qc X ,
which is equivalent to t−x0 (X) = XQc . (12) Let X = (x1 · · · xp ) ∈ Rn×p , Y = (y 1 · · · y q ) ∈ Rn×q , x0 , y 0 ∈ ˜ and Y˜ be matrices obtained by a translation by −x0 Rn . Let X and −y 0 of the columns of X and Y , respectively. If c ∈ Rp and d ∈ Rq are such that x0 = Xc and y 0 = Y d, prove that ˜ Y˜ = − 1 Qc DX,Y Q . X d 2 Solution: Starting from the equality given in Exercise 10, we have Qc DX,Y Qd ⎛ ⎞ x1 x1 ⎜ .. ⎟ = Qc ⎝ . ⎠ 1q Qd + Qc 1p (y 1 y 1 , . . . , y q y q )Qd xp xp − 2 Qc X Y Qd ˜ Y˜ , = −2 Qc X Y Qd = −2X because Qd 1q = 0q and Qc 1p = 0p .
Dimensionality Reduction Techniques
821
˜ Y˜ and let CX,Y = GH be a full-rank decomposiLet CX,Y = X tion of CX,Y (see, Theorem 3.35). Note that such a decomposition is not unique for, if T is a non-singular matrix, CX,Y can be writ˜ T −1 )(T Y˜ ) is yet another full-rank factorization ten as CX,Y = (X of CX,Y . (13) Let C = GH be a full-rank decomposition of the matrix CX,Y . Prove that (a) starting from G = (g 1 · · · g p ) and H = (h1 · · · hq ), if xi = T g i + x0 and y j = T −1 hj + y 0 for 1 i p and 1 j q, then the matrices X = (x1 · · · xp ) and Y = (y 1 · · · y q ) give a solution of the metric unfolding problem; (b) with the previous choice for X and Y and with M = T T , we have d2ij = g i M g i + hj M −1 hj + (x0 − y 0 ) (x0 − y 0 ) + 2g i T (x0 − y 0 ) − 2hj (T −1 ) (x0 − y 0 ) − 2g i hj , for 1 i p and 1 j q. Let S be the finite set of numbers, S = {−1, 0, 1}. A semidiscrete decomposition (SDD) of a matrix A ∈ Rm×n is a sum of the form A=
k
dp xp y p ,
p=1
where xp ∈ S m , y p ∈ S n , and d1 , . . . , dk are real numbers. Note that every A can be expressed as a sum of mn rankmatrix m n 1 matrices A = i=1 j=1 aij ei ej ; the purpose of the SDD is to develop sum of fewer terms that approximate the matrix A. (14) Prove that for a fixed A, x, and y, the least value of A−dxy 2F (x Ay)2 (x Ay)2 2 is achieved for d∗ = x 2 y2 and it equals AF − x2 y2 . F
F
F
F
822
Linear Algebra Tools for Data Mining (Second Edition)
Solution: Let R be the residual R = A − dxy obtained by approximating A by dxy and let rA (d, x, y) = R2F . We have rA (d, x, y) = (A − dxy ) (A − dxy ) = (A − dyx )(A − dxy ) = A A − dyx A − dA xy + d2 yx xy = A2F − 2dx Ay + d2 x2F y2F . The optimal solution for d is therefore d =
x Ay x2F y2F
. The final
part follows from the substitution of d in rA (d, x, y). (15) Suppose that A ∈ Rm×n is a matrix and let y ∈ S n . Determine d and x ∈ S m such that A − dxy 2F is minimal. Solution: In principle, we need to examine 3m possible value for x, but we will show it is possible to limit the search to that n m values. Let zi = xi j=1 aij yj . We have rA (d, x, y) =
n m (aij − dxi yj )2 i=1 j=1
=
m n (a2ij − 2daij xi yj + d2 x2i yj2 ) i=1 j=1
=
=
n m
a2ij − 2d
n m
i=1 j=1
i=1 j=1
m n
m
a2ij − 2d
i=1 j=1
aij xi yj + d2
x2i yj2
i=1 j=1
zi + d2
i=1
m n
m i=1
x2i
n
yj2 .
j=1
The signs of the components of x do not affect the last term. Thus, to minimize rA (d, x, y), the signs of xi must be chosen such that all zi have the same sign: positive if d > 0 and negative, otherwise. Let J be the number of non-zero components of x. Clearly, we have 1 J n. Then, rA (d, x, y) =
n m i=1 j=1
a2ij − 2d
m i=1
zi + d2 J
n j=1
yj2 .
Dimensionality Reduction Techniques
823
For a particular choice of x, the d that minimizes rA (d, x, y) is m zi , (13.15) d∗ = i=1 J nj=1 yj2 and the minimum value of rA (d, x, y) is rA (d∗ , x, y) =
m n i=1 j=1
a2ij − d2∗ J
n
yj2 .
j=1
d2 J. For a Thus, minimizing rA (d, x, y) amounts to maximizing given J, xi must be chosen equal to sign (| j=1 aij yj |) for those i that correspond to the largest sums | j=1 aij yj |. This shows that there are m choicesof x to check: set the xi ’s corresponding to the J largest sums | j=1 aij yj | to 1 or to −1 such that all zi are positive, find d from Equality 13.15, and choose d2 J among the n choices for J. The O’Leary–Peleg algorithm [95, 122] for SDD starts with a matrix A ∈ Rm×n and with a number k and consists of alternative steps that determine two sequences of vectors x1 , . . . , xk and y 1 , . . . , y k that allow the construction of a sequence of approximations A0 , A1 , . . . , Ak , . . . of A. The squares of the norms of the residuals of successive approximations are r1 , r2 , . . .. The desired accuracy of the approximation is rmin . Also, max is the maximum allowable number of inner iterations, and αmin is the minimum relative improvement. (16) Let A ∈ Rm×n and let A0 , A1 , . . . and R1 , R2 , . . . be two sequences of matrices defined by A0 = Om,n , R1 = A, Ak = Ak−1 + dk xk y k , and Rk+1 = Rk − dk xk y k , where the sequences x1 , x2 , . . . and y 1 , y 2 , . . . are defined as in Algorithm 13.8.3. Prove that (a) Rk+1 F < Rk F , which ensures the convergence of the O’Leary–Peleg algorithm;
1 k ≤ 1 − mn R0 2F , which proves that (b) Rk+1 2F limk→∞ Rk F = 0 and the rate of convergence is at least linear.
824
Linear Algebra Tools for Data Mining (Second Edition)
Algorithm 13.8.3: O’Leary–Peleg Factorization Algorithm Data: A matrix A ∈ Rm×n and a number kmax of approximating terms Result: An SDD approximate factorization of A 2 1 initialize R1 = A; initialize r1 = R1 F ; 2 for k = 1 to kmax do 3 while rk > rmin do 4 choose y = ej if the jth column of Rk contains the largest magnitude entry in Rk ; 5 for = 1 to max do 6 while α > αmin do 1 7 z = y 2 Rk y; F
8
find x ∈ S m that maximizes
9
z=
1 x2F
Rk x;
10
find y ∈ S n that maximizes
11
β=
(x Rk y)2 x2F y2F
;
19
if > 1 then β¯ α = β− β¯ end β¯ = β; end end xk = x; y k = y;
20
dk =
12 13 14 15 16 17 18
21 22 23 24 25
xk Rk y k xk 2F y k 2F
;
Ak = Ak−1 + dk xk y k ; Rk+1 = Rk − dk xk y k ; rk+1 = rk − β; end end
1 x2F
1 y2F
(x z)2 ;
(y z)2 ;
Dimensionality Reduction Techniques
825
Bibliographical Comments Spectral methods for dimensionality reduction are surveyed in [145]. A main reference for multidimensional scaling is [32]. The application of SVDs to the recommender system was initiated in [19]. Schoenberg’s paper on metric spaces [146] in one of the earliest contributions to metric scaling. Theorem 13.6 was obtained by Berge in [15] and Sibson in [150]. The connection between multidimensional metric scaling and PCA was studied by Gower in [66]. The nonnegative factorization problem attracted a huge amount of interest after Lee and Seung’s paper [105] was published in 1999. The research in this type of problems was initiated earlier by Paatero in [127] followed by [125, 126]. Applications in text mining and document clustering are discussed in several important publications [17, 26, 131, 149] on which this chapter is based. The semidiscrete matrix decomposition of matrices was introduced in [122] and further explored in [94, 95].
This page intentionally left blank
Chapter 14
Tensors and Exterior Algebras
14.1
Introduction
In many machine learning-oriented literature sources, tensors are regarded as multidimensional arrays. While the sets of values of tensors over finite-dimensional spaces are indeed such arrays, this point of view misses fundamental properties of tensors, especially their behaviors relative to basis changes in the underlying linear spaces. Recall that the elements of a linear space V are referred to as contravariant vectors while the elements of the dual space V ∗ are referred to as covariant vectors (see Section 2.8). The components of contravariant vectors will be denoted with letters with superscripts, while the components of covariant vectors will be designated by letters with subscripts. The opposite convention is applied to vectors themselves: vectors in V are denoted by letters with subscripts, while vectors in V ∗ are denoted by letters with superscripts. 14.2
The Summation Convention
The summation convention, a modality of simplifying expression with multiple indices that is widely used in presenting tensors, was introduced by Albert Einstein. This convention stipulates that when an index variable appears twice in a single term and is not otherwise defined, it implies summation of that term over all the values of the index.
827
828
Linear Algebra Tools for Data Mining (Second Edition)
Indices that are not summation indices are referred to as free indices. Example 14.1. If i ranges over the set {1, 2, 3, 4}, the expression xi yi stands for x1 y1 + x2 y2 + x3 y3 + x4 y4 . The summation index is a dummy index, in the sense that it can be replaced by any other index ranging over the same set without modifying the value of the expression. For instance, we have xk yk = xi yi , if k ranges over the set {1, 2, 3, 4}. Example 14.2. Using the summation convention, the definition of a determinant (Definition 5.1) can be written using the summation convention and the Levi-Civita symbols (introduced in Definition 1.11) as det(A) = i1 ···in a1i1 · · · anin . Example 14.3. The inner product of two vectors x, y ∈ Rn can be written as x y = xi yi . Example 14.4. Let yi = yi (x1 , . . . , xn ) be n functions which have continuous partial derivatives relative to x1 , . . . , xn . The Jacobian matrix J is ⎞ ⎛ ∂y1 ∂y1 ∂y1 ∂x1 ∂x2 · · · ∂xn ⎜ ∂y2 ∂y2 ∂y2 ⎟ ⎟ ⎜ ∂x1 ∂x2 · · · ∂x n⎟ ⎜ J =⎜ . ⎟. . . . .. .. .. ⎟ ⎜ .. ⎠ ⎝ ∂yn ∂yn ∂yn ∂x1 ∂x2 · · · ∂xn The determinant det(J) is the Jacobian of the transformation of Rn defined by the functions y1 , . . . , yn . The transformation is locally bijective on an open subset U of Rn if and only if det(J) = 0 at each point of U . When det(J) = 0 and the functions y1 , . . . , yn have continuous partial derivatives of second order in U , then the transformation is called an admissible change of coordinates. If zk = zk (y1 , . . . , yn ) are n functions (for 1 k n) which have continuous partial derivatives relative to y1 , . . . , yn , we derive
Tensors and Exterior Algebras
829
the following link between partial derivatives written by using the summation convention: ∂zk ∂yj ∂zk = . ∂xi ∂yj ∂xi 14.3
Tensor Products of Linear Spaces
Recall that multilinear functions were introduced in Definition 2.27. Definition 14.1. Let V and W be finite-dimensional linear spaces. Their tensor product is a linear space V ⊗W equipped with a bilinear map ⊗ : V × W −→ V ⊗ W such that for any linear space U and bilinear mapping f : V × W −→ U , there exists a unique linear mapping g : V ⊗ W −→ U such that the diagram f V ×W
U
⊗ g V ⊗W
is commutative. The linear space V ⊗ W is the tensor product of the linear spaces V and W , its elements are referred to as tensors, and the elements in Im(⊗), the image of the bilinear function ⊗, are said to be the decomposable tensors. The existence of the mapping g is the universal property of the tensor product. It means that any bilinear function f : V × W −→ U can be obtained by applying a linear function g to the special bilinear function ⊗. We defined the tensor product of two linear spaces as a new linear space that satisfies a certain property. As we shall see, if such a linear space exists, then it is unique. However, it is incumbent on us to show the existence of the tensor product. We will do that later in this section (see Theorem 14.1).
830
Linear Algebra Tools for Data Mining (Second Edition)
The bilinearity of the ⊗ mapping implies the following axioms for the tensor product: • Distributivity of tensor product: for any vectors v, v 1 , v 2 in V and any vectors w, w 1 , w 2 in W , we have v ⊗ (aw1 + bw2 ) = a(v ⊗ w1 ) + b(v ⊗ w2 ),
(14.1)
(av 1 + bv 2 ) ⊗ w = a(v 1 ⊗ w) + b(v 2 ⊗ w);
(14.2)
• Existence of basis in tensor product: if {v 1 , . . . , v m } is a basis in V and {w 1 , . . . , w m } is a basis in W , then {uij = v i ⊗ wj | 1 i m, 1 j n} is a basis in U = V ⊗ W . If v = 0V or w = 0W , then v ⊗ w = 0V ⊗W . By taking a = 1, b = −1, and w1 = w2 in Equality (14.1), we obtain v ⊗ 0W = 0V ⊗W for every v ∈ V . Similarly, 0V ⊗ w = 0V ⊗W . Thus, if V, W are finite-dimensional vector spaces having the bases {v 1 , . . . , v m } and {w1 , . . . , w m }, respectively, and {uij = v i ⊗ wj | 1 i m, 1 j n} is a basis in U = V ⊗W , then, if x = xi v i ∈ V and y = y j wj , the unique tensor product that satisfies the axioms has the components xi y j in the basis {uij | 1 i m, 1 j n}. tensor space V ⊗ W is the set of all sums of the form The i j aij v i ⊗w j which have a finite number of non-zero coefficients. We will refer to the set of coefficients aij as a tensor. Example 14.5. Suppose that V and W are the unidimensional linear space R. The direct product R × R consists of all pairs (u, v) ∈ R × R. Addition of (u1 , v1 ) and (u2 , v2 ) in R × R is defined by (u1 , v1 ) + (u2 , v2 ) = (u1 + u2 , v1 + v2 ) and scalar multiplication is defined as a(u, v) = (au, av) for a ∈ R and (u, v) ∈ R × R. The tensor product R ⊗ R also consists of pairs (u, v), where u, v ∈ R are denoted as u ⊗ v. Scalar multiplication in the tensor space is defined as a(u ⊗ v) = (au) ⊗ v = u ⊗ (av) for a, u, v ∈ R. Addition of two pairs u1 ⊗ v1 and u2 ⊗ v2 in R ⊗ R is denoted as u1 ⊗ v1 + u2 ⊗ v2 and these pairs interact only when one of their components are the same, that is, u1 ⊗ v + u2 ⊗ v = (u1 + u2 ) ⊗ v, u ⊗ v1 + u ⊗ v2 = u ⊗ (v1 + v2 ).
Tensors and Exterior Algebras
831
In other words, addition in R ⊗ R is symbolic unless one of the components is the same in both terms. For instance, we can write (u ⊗ 5v) + (2u ⊗ v) = 5(u ⊗ v) + 2(u ⊗ v) = 7(u ⊗ v). Next, we prove that the tensor product of linear spaces introduced above exists and is unique up to an isomorphism and this will be done in several steps. Theorem 14.1. Let V and W be two finite-dimensional linear spaces. If V ⊗ W exists, then the linear spaces Hom(V ⊗ W, U ) and M(V × W, U ) are isomorphic. Proof. Let h : V ⊗ W −→ U be a linear mapping in Hom(V ⊗ W, U ). Then h⊗ : V × W −→ U is bilinear. Conversely, if f ∈ M(V × W, U ), there is a unique linear map g : V ⊗W −→ U such that g(v⊗w) = f (v, w) for all (v, w) ∈ V ×W .
Theorem 14.2. If the tensor products V ⊗1 W and V ⊗2 W of two linear spaces V and W exist, then they are isomorphic. Proof. Suppose that V ⊗1 W and V ⊗2 W are both tensor products of the linear spaces V and W . By the universal property of the tensor product of linear spaces, there exist unique linear maps g1 : V ⊗2 W −→ V ⊗1 W and g2 : V ⊗1 W −→ V ⊗2 W that make the diagrams
⊗1
V ⊗1W
V ×W
⊗2
⊗2
⊗1 g1
V ⊗2 W
V ⊗2W
V ×W
g2 V ⊗1 W
commutative. Define the linear mappings h1 = g1 g2 : V ⊗1 W −→ V ⊗ V1 and h2 = g2 g1 : V ⊗2 W −→ V ⊗ V2 which make the diagrams
Linear Algebra Tools for Data Mining (Second Edition)
832
⊗1
V ⊗1W
V ×W
⊗2 V ⊗2W
⊗2
V ⊗2W
V ×W
⊗1 h1
V ⊗1W
h2
commute. The identity maps i1 : V ⊗1 W −→ V ⊗1 W and i2 : V ⊗2 W −→ V ⊗2 W are linear and also make the diagrams commute. By the uniqueness hypothesis, we have h1 = i1 and h2 = i2 , so g1 and g2 are inverses to each other and this proves that V ⊗1 W and V ⊗2 W are isomorphic. Theorem 14.3. Let V, W, and U be real finite-dimensional linear spaces. There exists a finite-dimensional real linear space T and a bilinear mapping t : V × W −→ T denoted by t(u, v) = u ⊗ v satisfying the following properties: (i) for every bilinear mapping f : V × W −→ U, there exists a unique linear mapping g : T −→ U such that f (v, w) = g(t(v, w)); (ii) if {v 1 , . . . , v m } is a basis in V and {w 1 , . . . , wn } is a basis in W, then {(v i ⊗ wj ) | 1 i m, 1 j n} is a basis in T, hence dim(T ) = dim(V ) dim(W ). Proof. In this proof, we will use the summation convention. For each pair (i, j) ∈ {1, . . . , m} × {1, . . . , n}, let tij be a symbol. Define T to be the real linear space that consists of all formal linear combinations with real coefficients aij tij of the symbols tij . Define the bilinear mapping t : V × W −→ T by t(v i , w j ) = tij and denote t(v i , w j ) as v i ⊗ wj . This mapping is extended to V × W as a bilinear mapping. Thus, if v = ai v i and w = bj wj , we define v ⊗ w as being the element of T given by v ⊗ w = t(v, w) = ai bj tij . Suppose now that f : V × W −→ U is an arbitrary bilinear map. Since every element of T is a linear combination of tij , we can define a unique linear transformation g : T −→ U as g(tij ) = f (vi , wj ).
Tensors and Exterior Algebras
833
The bilinearity of f and the linearity of g imply
f (v, w) = f ai v j , bj wj = ai bj f (v i , w j ) = ai bj g(tij ) = g(ai bj tij ) = g(u ⊗ v) = g(t(u, v)). This proves the existence and uniqueness of g such that f = gt as claimed in the first part of the statement. ˆm} We show now that for arbitrary bases in V and W , {ˆ v1, . . . , v ˆ n }, the set {ˆ ˆ j | 1 i m, 1 j n} is a ˆ 1, . . . , w vi ⊗ w and {w basis in T . ˆ i and w = ˆbj w ˆ j , by the bilinearity of ⊗, we have For v = a ˆi v ˆ j ), vi ⊗ w v⊗w =a ˆiˆbj (ˆ ˆ j span T . If these elements ˆi ⊗ w which shows that the mn elements v were linearly dependent, we would have dim(T ) < mn, which is a ˆ j | 1 i m, 1 j n} is a basis contradiction. Thus, {ˆ vi ⊗ w for T . The space T defined above is the tensor product of V and W and is denoted by V ⊗ W . Corollary 14.1. Let V, W be two real linear spaces. If v = 0V and w = 0W , then v ⊗ w = 0V ⊗W . Proof. Let BV and BW be bases of V and W that contain v and w, respectively. Then v ⊗ w is a member of a basis of V ⊗ W , hence v ⊗ w = 0V ⊗W . Let V, W be two real linear spaces with dim(V ) = m and dim(W ) = n having the bases {v i | 1 i m} and {w j | 1 j n}, respectively. The equalities n m
aij (v i ⊗ wj )
i=1 j=1
=
n i=1
=
m j=1
⎛ ⎞⎞ m ⎝v i ⊗ ⎝ aij wj ⎠⎠ ⎛
j=1 n i=1
aij v i
⊗ wj
,
834
Linear Algebra Tools for Data Mining (Second Edition)
imply that an element of V ⊗ W can be expressed in many ways as sums of tensor products of vectors in V and W . Theorem 14.4. Let V, W be two real linear spaces. The linear spaces V ⊗ W and W ⊗ V are isomorphic. Proof. Define the bilinear mapping h : V ⊗ W −→ W ⊗ V as h(v, w) = w ⊗ v for v ∈ V and winW . By the universal property, there exists a unique linear mapping g : V ⊗ W −→ W ⊗ V that makes the diagram h
W ⊗V
V ×W
⊗ g V ⊗W
commutative. Since the set {w ⊗ v | v ∈ V, w ∈ W } spans the space W ⊗ V , the function g is surjective and, therefore, it is an isomorphism. Theorem 14.5. Let V, W be two real linear spaces. The linear spaces V ∗ ⊗ W and Hom(V, W ) are isomorphic. Proof. Let h : V ∗ × W −→ Hom(V, W ) be defined as h(f , w)(v) = f (v)w for f ∈ V ∗ , v ∈ V , and w ∈ W . Since h is bilinear, it induces a linear mapping g : V ∗ ⊗W −→ Hom(V, W ) by the universal property such that g(f ⊗ w) = h(f , w) for f ∈ V ∗ , and w ∈ W . In other words, the diagram h V∗×W
U
⊗ g V
is commutative.
∗
⊗W
Tensors and Exterior Algebras
835
By Supplement 23 of Chapter 2, mappings of the form h(f , w) span Hom(V, W ), which implies that g is an isomorphism. Theorem 14.6. Let V, W, X, Y be R-linear space and let f ∈ Hom(V, W ) and h ∈ Hom(X, Y ). There is a unique linear map g = f ⊗ h : V ⊗ X −→ W ⊗ Y such that (f ⊗ h)(v ⊗ x) = f (v) ⊗ h(x) for v ∈ V and x ∈ X. Proof. Let : V × X −→ W ⊗ Y be the bilinear mapping defined by (v, x) = f (v) ⊗ h(x) for v ∈ V and x ∈ X. By the universal property of the tensor product, there is a unique linear mapping g : V ⊗ X −→ W ⊗ Y such that g(v ⊗ x) = (v, x) = f (v) ⊗ h(x) for v ∈ V and x ∈ X. Thus, g is the linear mapping f ⊗ h. Example 14.6. Let V, W, X, Y be four R-linear spaces such that dim(V ) = m, dim(W ) = n, dim(X) = p, dim(Y ) = q. If f ∈ Hom(V, W ) and h ∈ Hom(X, Y ), let Af ∈ Rn×m and Ah ∈ Rq×p be the matrices associated to these linear transformations and the bases {v 1 , . . . , v m }, {w 1 , . . . , wn }, {x1 , . . . , xp }, and {y 1 , . . . , y q }, in V, W, X, and Y , respectively. The matrices Af and Ah associated with the linear transformations f and h, respectively, are given by (Af )ij = (f (v i ))j for 1 i m and 1 j n, and (Ah )rs = (h(xr )s for 1 r p and 1 s q. By Theorem 14.3, a basis in the linear space V ⊗ X consists of mp tensors of the form v i ⊗ xj , while a basis in W ⊗ Y consists of nq tensors of the form wr ⊗ y s . The matrix Af ⊗h ∈ Rnq×mp associated with the linear transformation g = f ⊗ h : V ⊗ X −→ W ⊗ Y is given by (Af ⊗h )(ij),(rs) = (((f ⊗ h)(v i ⊗ xj ))rs , which is the Kronecker product Af ⊗ Ah of the matrices Af and Ah . The next theorem establishes the associativity of the tensor product of linear spaces. Theorem 14.7. Let V, W , and U be linear spaces. The tensor spaces V ⊗ (W ⊗ U ) and (V ⊗ W ) ⊗ U are isomorphic.
836
Linear Algebra Tools for Data Mining (Second Edition)
Proof. For a fixed u ∈ U , the mapping fu : V × W −→ V ⊗ (W ⊗ U ) given by fu (v, w) = v ⊗ (w ⊗ u) is bilinear because fu (v, w 1 + w2 ) = v ⊗ (w1 ⊗ u + w2 ⊗ u) = v ⊗ (w1 ⊗ u) + v ⊗ (w2 ⊗ u) = fu (v, w 1 ) + fu (v, w 2 ). By the universal property of tensor products, there is a unique linear mapping fu : V ⊗ W −→ V ⊗ (W ⊗ U ) such that fu = fu h, where h : V ×W −→ V ⊗W is the bilinear mapping h(v, w) = v ⊗w for v ∈ V and w ∈ W . In other words, the diagram fu
V ⊗ (W ⊗ U )
V ×W h V ⊗W
fu
is commutative. For (v, w) ∈ V × W , we have fu (v, w) = fu h(v, w) = fu (v ⊗ w), which implies fu (v, w) = v ⊗ (w ⊗ u) due to the definition of fu . such that for and fau Since u is arbitrary, there are morphisms fu+t any (v, w) ∈ V × W , we have (v, w) = v ⊗ (w ⊗ (u + t)) fu+t
= v ⊗ (w ⊗ u + w ⊗ t) = v ⊗ (w ⊗ u) + v ⊗ (w ⊗ t), which amounts to (v, w) = fu (v, w) + ft (v, w) fu+t
Tensors and Exterior Algebras
837
and fau (v, w) = v ⊗ (w ⊗ au)
= v ⊗ (a(w ⊗ u)) = a(v ⊗ (w ⊗ u)), that is (v, w) = afu (v, w). fau
Therefore, the function φ : (V ⊗ W ) × U −→ V ⊗ (W ⊗ U ) given by φ((v, w), u) = fu (v, w) is bilinear. By the universal property of the tensor product (V ⊗ W ) ⊗ U , φ induces a unique morphism φ : (V ⊗ W ) ⊗ U −→ V ⊗ (W ⊗ U ) such that φ ((v ⊗ w) ⊗ u) = v ⊗ (w ⊗ u), which is reflected in the following commutative diagram: (V ⊗W ) × U
φ
V ⊗(W ⊗U )
⊗ (V ⊗ W )⊗ U
φ
Similarly, there is a unique morphism ψ : V ⊗ (W ⊗ U ) −→ (V ⊗ W ) ⊗ U such that ψ (v ⊗ (w ⊗ u)) = (v ⊗ w) ⊗ u. It is immediate that ψ φ and 1((V ⊗W )⊗U ) coincide on the generators of (V ⊗ W ) ⊗ U , that is ψ φ = 1((V ⊗W )⊗U . Similarly, we have φ ψ = 1(V ⊗(W ⊗U ) . Thus, φ and ψ are inverse of each other, which means that they are isomorphisms. The associativity of the tensor product of linear spaces allows us to write V ⊗ W ⊗ U instead of V ⊗ (W ⊗ U ) or (V ⊗ W ) ⊗ U for
838
Linear Algebra Tools for Data Mining (Second Edition)
any linear spaces V, W, U . Also, taking into account Theorem 14.4, we could freely change the order in which we list the linear spaces in any tensor product. Example 14.7. Let {e1 , e2 , e3 } be the standard basis for R3 and let {e1 , e2 } be the standard basis for R2 . A basis for R3 ⊗ R2 consists of the tensors e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 , e3 ⊗ e1 , e3 ⊗ e2 . Let U be an R-linear space and let f : R3 × R2 −→ U be a bilinear mapping. Define the mapping g : R3 ⊗ R2 −→ U by g(ei ⊗ ej ) = f (ei , ej ) for i ∈ {1, 2, 3} and j ∈ {1, 2}. Since f is a bilinear mapping, g is well-defined and g(u ⊗ v) = f (u, v). Theorem 14.3 can be extended to the case of m linear spaces as follows. If V1 , . . . , Vm are m real and finite-dimensional spaces, there exists a linear space V1 ⊗ · · · ⊗ Vm and a mapping ⊗ : V1 × · · · × Vm −→ V1 ⊗· · ·⊗Vm such that for any linear space U and multilinear mapping f : V1 ×· · ·×Vm −→ U , there exists a unique linear mapping g : V1 ⊗ · · · ⊗ Vm −→ U such that the diagram f V1 × · · · × Vm
U
⊗ g V1 ⊗ · · · ⊗ Vm
is commutative. Recall the definition of tensors that we introduced in Chapter 2: Definition 14.2. A tensor t over the linear space V is a multilinear function t : V ⊗p1 ⊗ V ∗⊗q1 ⊗ · · · ⊗ V ⊗pk ⊗ V ∗⊗q −→ R. In this case, we say that the pair (p, q) = (p1 + · · · + pk , q1 + · · · + q ) is the type t while p + q is the valence of t.
Tensors and Exterior Algebras
839
Furthermore, we say that t is contravariant of order p and covariant of order q. The valence of a tensor is the total number of arguments of a tensor regarded as a multilinear function. Example 14.8. The tensors in V ⊗ V ∗ ⊗ V ∗ −→ R have valence 3 and type (1, 2). In other words, such tensors are 1-contravariant and 2-covariant. If dim(V ) = m, a tensor t in this space can be written i as t = t e ⊗ f j ⊗ f k and each of the indices i, j, k varies between jk i 1 and m. This tensor is once contravariant and twice covariant. It is desirable to avoid writing the tensor indices directly underneath each other as in tijk because when these indices are lowered or raised (as we will see that it is necessary sometimes), then it is difficult to put back these indices in the places they left. Note that i · · · ip1 t=t 1
j1 · · · jq
h1 · · · hpr
1 . . . k
ei1 ⊗ · · · ⊗ eip1
⊗f j1 ⊗ · · · ⊗ f jq1 ⊗ · · · ⊗ f 1 ⊗ · · · ⊗ f qk ⊗ eh1 ⊗ · · · ⊗ ehpr . is a tensor in the product space V ⊗p1 ⊗ V ∗⊗q1 ⊗ V ∗⊗qk ⊗ V ⊗pr . Taking into account the commutativity and associativity of tensor products of linear spaces, we can reformulate the definition of tensors. Definition 14.3. Let V be a real linear space and let p, q ∈ N. A tensor of order (p, q) on V is a multilinear mapping t, where ∗ · · × V ∗ −→ R. t:V · · × V × V × · × · p
q
Also, we refer to a tensor t defined as above as a p-contravariant and q-covariant tensor. If p = q = 0, t is a member of R. If q = 0, t is a p-contravariant tensor, and if p = 0, t is a q-covariant tensor. If (p, q) = (0, 0), then t is a mixed tensor. The set of tensors of order (p, q) on V is denoted as Tpq (V ). Similarly, the set of p-contravariant tensors will denoted by Tp (V ) and the set of q-covariant tensors will be denoted by Tq (V ).
840
Linear Algebra Tools for Data Mining (Second Edition)
If dim(V ) = n, then dim(Tpq ) = np+q because the dimension of a tensor product is the product of the dimensions of the factors. If B = {e1 , . . . , ep } is a basis of V and {f 1 , . . . , f n } is its dual basis in V ∗ , then the set {ei1 ⊗ · · · ⊗ eip ⊗ f j1 ⊗ · · · ⊗ f jq | 1 i n, 1 jm n} is a basis of Tpq called the standard basis corresponding to B. If the element ei1 ⊗ · · · ⊗ eip ⊗ f j1 ⊗ · · · ⊗ f jq of the standard i · · · ip , then a tensor z ∈ Tpq can be basis is denoted by t 1 j1 · · · jq expressed uniquely as z=
j ···j q z 1
The np+q numbers z
i1 · · · ip
t
i1 · · · ip
j1 · · · jq
.
j1 · · · jq
are the components of z relai1 · · · ip tive to the standard basis. The upper indices of the components z correspond to V and the lower indices correspond to V ∗ .
Example 14.9. Let V be a real linear space and let f 1 , . . . , f n ∈ V ∗ . Define gf 1 ,...,f n : V n −→ R as the multilinear function gf 1 ,...,f n (x1 , . . . , xn ) = f 1 (x1 ) · · · f n (xn ) for xi ∈ V and 1 i n. The multilinear function gf 1 ,...,f n : V ×n −→ R is denoted by f 1 ⊗ · · · ⊗ f n , and is an n-contravariant tensor. Example 14.10. If xi ∈ Vi for 1 i n, then the multilinear function x1 ,...,xn : (V ∗ )n −→ R, defined by x1 ,...,xn (f 1 , . . . , f n ) = f 1 (x1 ) · · · f n (xn ) for f i ∈ V ∗ for 1 i n belonging to M(V ∗ , . . . , V ∗ ; R), is denoted by x1 ⊗ · · · ⊗ xn , and is a n-covariant tensor. Let B = {e1 , . . . , en } be a basis in the linear space V and let ˜ = {f 1 , . . . , f n } be its dual basis in V ∗ . Define the new bases B = B
Tensors and Exterior Algebras
841
˜ = {f 1 , . . . , f n } in V and V ∗ , respectively, {e1 , . . . , en } and B using the equalities ei = aji ej and f i = aij f j as follows:
ei = aii ei , ei = aii ei
f j = ajj f j , f j = ajj f j . Of course, summation indices can be changed consistently without affecting the correctness of these formulas. Recall that if V is a real linear space and V ∗ is its dual, the vectors of V are said to be contravariant, while those of V ∗ are said to be covariant for reasons that were discussed in Section 3.7 If {e1 , . . . , em } is a basis in the linear space V, a contravariant vector x ∈ V can be written (with the summation convention) as x = xi ei . If {f 1 , . . . , f n } is a basis in V ∗ , a covariant vector y ∈ V ∗ can be written as y = yj f j . Let t be the tensor t=t
i1 · · · ih
j1 · · · jk
ei1 ⊗ · · · ⊗ eih ⊗ f j1 ⊗ · · · f jk .
Applying the change of bases, we have t=t
i1 · · · ih
i
i
j1 · · · jk
ai11 · · · aihh ajj1 · · · ajjk ei1 ⊗· · ·⊗eih ⊗f j1 ⊗· · ·⊗f jk . 1
k
i · · · ih Therefore, the components t˜ 1
of t relative to the bases j1 · · · jk {e1 , . . . , en } and {f , . . . , f } are given by 1
i · · · ih t˜ 1
j1 · · · jk
=t
n
i1 · · · ih
i
j1 · · · jk
i
ai11 · · · aihh ajj1 · · · ajjk . 1
k
(14.3)
Similarly, we obtain the formula t
i1 · · · ih
j1 · · · jk
j i · · · ih j = aii1 · · · aiih aj11 · · · ajkk t˜ 1 1
h
j1 · · · jk
.
(14.4)
The operation of tensor products can be extended from vectors and covectors to tensors.
Linear Algebra Tools for Data Mining (Second Edition)
842
Definition 14.4. Let u ∈ Tpq (V ) and w ∈ Trs (V ) be two tensors. Their product is the tensor u ⊗ w ∈ Tp+r q+s (V ) defined as (u ⊗ w)(v 1 , . . . , v p+r , v 1 , . . . , v q+s ) = u(v 1 , . . . , v p , v 1 , . . . , v q )w(v p+1 , . . . , v p+r , v q+1 , . . . , v q+s ) for v 1 , . . . , v p+r ∈ V ∗ and v 1 , . . . , v q+s ∈ V . Thus, the tensor product is an operation ⊗ : Tpq (V ) × Trs (V ) −→ Tp+r q+s (V ). The tensor product can be extended to any finite number of tensor spaces, +···+pk (V ), ⊗ : Tpq11 (V ) × · · · × Tpq11+···+q k
such that ⊗(t1 , . . . , tk ) = t1 ⊗ · · · ⊗ tk , where t ∈ Tpq for 1 k. The extended operation is multilinear, associative, and distributive. 14.4
Tensors on Inner Product Spaces
Let V be an n-dimensional real inner product space and let B = {e1 , . . . , en } be a basis in V. In Section 6.15, we examined the link between the contravariant components xi of a vector x ∈ V relative to a basis B and the covariant components xi of the same vector x using the fundamental matrix G of the basis B. Let t be a tensor t = x1 ⊗ · · · ⊗ xq . If xirr is a contravariant compoi nent of xr , then we have ti1 ···iq = xi11 · · · xqq . Taking into account the links between contravariant and covariant components of vectors, we can write t
i1
i2 · · · iq
i
= xi11 xi22 · · · xqq i
= gi1 j xj1 xi22 · · · xqq (because xi11 = gi1 j xj1 ) = gi1 j t
ji2 · · · iq
.
Tensors and Exterior Algebras
843
Similarly, one obtains t
i1 · · · iq
= g i1 j t
j
i2 · · · iq
.
Repeated applications of the previous transformations allow us to write the following: t t
i1 i2 · · · iq i1 i2 · · · iq
= gi1 j1 gi2 j2 · · · giq jq t = g i1 j 1 g i2 j 2 · · · g iq j q t
j1 j2 · · · jq
j1 j2 · · · jq
, .
We can formulate now a tensoriality criterion for tensors in Euclidean spaces. Theorem 14.8. A collection of numbers ti1 i2 ···ip depending on the choice of basis forms a tensor if and only if ti1 i2 ···ip = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip , where B = {e1 , . . . , en }, gij = (ei , ej ) for 1 i, j n. Proof. Suppose that ti1 i2 ···ip are the components of a tensor t. Then, the numbers ti1 i2 ···ip are the coefficients of some multilinear form f , that is, ti1 i2 ···ip = f (ei1 , ei2 , . . . , eip ). The coefficients ti1 i2 ···ip in the new basis e1 , . . . , em are ti1 i2 ···ip = f (ei1 , ei2 , . . . , eip ). Since ei1 = gi1 i1 ei1 , . . . , eim = gim im eim , we can write ti1 i2 ···ip = f (gi1 i1 ei1 , gi2 i2 ei2 , . . . , gip ,ip ep ) = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip , because f is a multilinear form.
Linear Algebra Tools for Data Mining (Second Edition)
844
Conversely, suppose that ti1 i2 ···ip transform as indicated when switching to a new basis. Let x1 , . . . , xp be p vectors such that xi = xij ej for 1 i p. We must show that ti1 i2 ···ip xi1 j1 · · · xip jp is a multilinear form in x1 , . . . , xp , which means that it depends only on the vectors x1 , . . . , xp and not on the choice of the basis. In a basis B = {e1 , . . . , en }, the previous expression is ti1 i2 ···ip xi1 j1 · · · xip jp . The hypothesis of the theorem allows us to write the following: ti1 i2 ···ip xi1 j1 · · · xip jp = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip xi1 j1 · · · xip jp = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip xi1 j1 · · · xip jp = gi1 i1 gi2 i2 · · · gip ip ti1 i2 ···ip gi1 k1 xk1 j1 · · · gip kp xkp jp = gi1 i1 gi1 k1 gi2 i2 gi2 k2 · · · gip ip gip kp ti1 i2 ···ip xk1 j1 · · · xkp jp = δi1 k1 δi2 k2 · · · δip kp ti1 i2 ···ip xk1 j1 · · · xkp jp = ti1 i2 ···ip xi1 j1 · · · xip jp
because xi1 j1 = gi1 k1 xk1 j1 , . . . , xip jp = gip kp xkp jp .
14.5
Contractions
Theorem 14.9. Let t = v 1 ⊗ · · · ⊗ v p ⊗ f 1 ⊗ · · · ⊗ f q be a simple tensor in T pq , where p, q 1. There exists a unique linear mapping c : Tpq −→ Tp−1 q−1 (referred to as contraction) such that c(t) = f j (v i )v 1 ⊗ · · · ⊗ v i−1 ⊗ v i+1 · · · ⊗ v p ⊗f 1 ⊗ · · · ⊗ f j−1 · · · f j+1 ⊗ · · · ⊗ f q .
Tensors and Exterior Algebras
Proof.
845
Define the multilinear mapping f : V p × (V ∗ )q −→ Tp−1 q−1 as f (v1 , . . . , v p , f 1 , . . . , f q ) = f j (v i )v 1 ⊗ · · · ⊗ v i−1 ⊗ v i+1 · · · ⊗ v p ⊗f 1 ⊗ · · · ⊗ f j−1 · · · f j+1 ⊗ · · · ⊗ f q .
Let c : Tpq −→ Tp−1 q−1 be a linear mapping defined by its values on the basis of Tp as c(ei1 ⊗ · · · ⊗ eip ⊗ f j1 ⊗ f jq ) = f (e1 , . . . , ep , f 1 , . . . , f q ), and denote its extension to a linear mapping defined on Tpq also by c. Since c⊗ and f agree on the basis of V p × (V ∗ )q , the diagram f
Tp−1 q−1
V p × (V ∗ )q
⊗ Tpq
c
is commutative. The universal property of the tensor product implies the unique ness of this extension. Example 14.11. Let V be a real linear space and let v ∈ V and f ∈ V ∗ . Define the bilinear mapping φ : V × V ∗ −→ R as φ(v, f ) = f (v). By the universal property of a tensor product, φ can be factored as φ = c⊗, where c : T11 −→ R = T00 is a linear mapping. In other words, we have φ(v, f ) = f (v)v ⊗ f for all v ∈ V and f ∈ V ∗ . 14.6
Symmetric and Skew-Symmetric Tensors
Theorem 14.10. Let φ be a permutation in PERMr , and let V be a linear space.
846
Linear Algebra Tools for Data Mining (Second Edition)
There exists a unique linear mapping P φ : V ⊗r −→ V ⊗r such that P φ (v 1 ⊗ · · · ⊗ v r ) = v φ
−1 (1)
⊗ · · · ⊗ vφ
−1 (r)
.
The mapping P φ is an isomorphism of V ⊗r and P ψ P φ = P ψφ for every ψ, φ ∈ PERMr . Proof.
The mapping f : V · · × V −→ V ⊗r defined by × · r
f (v1 , . . . , v r ) = v φ
−1 (1)
⊗ · · · ⊗ vφ
−1 (r)
is an r-multilinear mapping. By the universal property of the tensor products, there exists a unique linear mapping P φ that makes the diagram V × ··· × V
f V ⊗r
r
⊗ Pφ V ⊗r
commutative, that is, P φ (v 1 ⊗ · · · ⊗ v r ) = v φ
−1 (1)
⊗ · · · vφ
−1 (r)
for v 1 , . . . , v r . Since P φ induces a permutation of the standard basis of V ⊗r , it is immediate that P φ is an isomorphism. We have P ψ P φ (v 1 ⊗ · · · ⊗ v r ) −1 (1)
⊗ · · · ⊗ vφ
−1 ψ −1 (1)
⊗ · · · ⊗ vφ
= P ψ (v φ = (v φ
= P ψφ (v 1 ⊗ · · · ⊗ v r ).
−1 (r)
)
−1 ψ −1 (r)
)
Tensors and Exterior Algebras
847
Example 14.12. Let φ, ψ ∈ PERM4 be the permutations given by 1 2 3 4 1 2 3 4 φ: and ψ : . 2 4 1 3 4 2 1 3 This implies φ−1 :
1 2 3 4 3 1 4 2
and ψ −1 :
1 2 3 4 3 2 4 1
and P φ (v 1 ⊗ v 2 ⊗ v3 ⊗ v 4 ) = v 3 ⊗ v1 ⊗ v 4 ⊗ v 2 , P ψ (v 1 ⊗ v 2 ⊗ v3 ⊗ v 4 ) = v 3 ⊗ v2 ⊗ v 4 ⊗ v 1 . Therefore, P ψ P φ (v 1 ⊗ v 2 ⊗ v 3 ⊗ v4 ) = P ψ (v 3 ⊗ v 1 ⊗ v 4 ⊗ v 2 ) = v 4 ⊗ v 1 ⊗ v2 ⊗ v 3 . Since ψφ :
1 2 3 4 1 2 3 4 , , and (ψφ)−1 : 4 1 2 3 2 3 4 1
it follows that P ψφ (v 1 ⊗ v 2 ⊗ v 3 ⊗ v4 ) = (v 4 ⊗ v1 ⊗ v 2 ⊗ v 3 ) = P ψ P φ (v 1 ⊗ v 2 ⊗ v 3 ⊗ v 4 ). Definition 14.5. A tensor t ∈ V ⊗r is symmetric if P φ (t) = t for every φ ∈ PERMr . A tensor t ∈ V ⊗r is skew-symmetric or alternating if P φ (t) = sign(φ)t for every permutation φ ∈ PERMr . The set of symmetric tensors in V ⊗r is denoted as SYMV,r . The set of skew-symmetric tensors in V ⊗r is denoted as SKSV,r . Both SYMV,r and SKSV,r are subspaces of V ⊗r . Definition 14.6. Let V, W be F-linear spaces and k be a positive integer. A multilinear map f : V k −→ W is alternating if it vanishes
848
Linear Algebra Tools for Data Mining (Second Edition)
whenever two arguments are equal, that is, f (. . . , x, . . . , x, . . .) = 0W for every x ∈ V . A multilinear mapping f : V k −→ W is skew-symmetric if the sign of f changes when two arguments are permuted, that is, f (. . . , x, . . . , y, . . .) = −f (. . . , y, . . . , x, . . .) for x, y ∈ V . The notions of alternating multilinear function and skewsymmetric function are identical, as the next statement shows. Theorem 14.11. Let V, W be F-linear spaces and k be a positive integer such that k 2. A multilinear mapping f : V k −→ W is alternating if and only if it is skew-symmetric. Proof. Let f be an alternating multilinear mapping. By the multilinearity of f , we can write f (. . . , x + y, . . . , x + y, . . .) = f (. . . , x, . . . , x, . . .) + f (. . . , x, . . . , y, . . .) +f (. . . , y, . . . , x, . . .) + f (. . . , y, . . . , y, . . .) = f (. . . , x, . . . , y, . . .) + f (. . . , y, . . . , x, . . .) = 0W , hence f is skew-symmetric. Conversely, if f is skew-symmetric, we have f (. . . , x, . . . , x, . . .) = −f (. . . , x, . . . , x, . . .), hence f (. . . , x, . . . , x, . . .) = 0W , which shows that f is alternating.
The set of alternating multilinear mappings from V k to W is a subspace of M(V, . . . , V ; W ). Theorem 14.12. Let V, W be F-linear spaces. A multilinear mapping f : V k −→ W is skew-symmetric if and only if f (xφ(1) , . . . , xφ(k) ) = sign(φ)f (x1 , . . . , xk ) for every φ ∈ PERMk and x1 , . . . , xk ∈ V .
Tensors and Exterior Algebras
849
Proof. Suppose that the condition of the theorem is satisfied. If φ is a transposition, 1 ··· i ··· j ··· k φ: , 1 ··· j ··· i··· k the equality of the theorem amounts to f (x1 , . . . , xi , . . . , xj , . . . , xk ) = −f (x1 , . . . , xj , . . . , xi , . . . , xk ), which shows that f is skew-symmetric. Conversely, if f is skew-symmetric, for each transposition ψ we have f (xψ(1) , . . . , xψ(k) ) = −f (x1 , . . . , xk ). If φ is a product of transpositions φ = ψ1 · · · ψr , then the k-tuple (xφ(1) , . . . , xφ(k) ) can be obtained from the k-tuple (x1 , . . . , xk ) by applying successively the transpositions ψ1 , . . . ψr . This implies f (xφ(1) , . . . , xφ(k) ) = sign(ψ1 ) · · · sign(ψr )f (x1 , . . . , xk ) = sign(φ)f (x1 , . . . , xk ).
Theorem 14.13. If v 1 ⊗ · · · ⊗ v r ∈ SKSV,r , then v 1 ⊗ · · · ⊗ v r = 0V,r if any two of the vectors v 1 , . . . , v r are equal. Proof.
For the transposition 1 ··· i ··· j ··· r φ: , 1 · · · j · · · i · · · r.
we have inv(φ) = 1. Suppose that the ith argument and the jth argument of t are both equal to v. Then, by the skew-symmetry of t, we have v 1 ⊗ · · · ⊗ v ⊗ · · · ⊗ v, · · · , v r ) = −v1 ⊗ · · · ⊗ v ⊗ · · · ⊗ v ⊗ · · · ⊗ v r , hence v 1 ⊗ · · · ⊗ v ⊗ · · · ⊗ v ⊗ · · · ⊗ v r = 0V,r .
Corollary 14.2. If {v 1 , . . . , v r } is a linearly dependent set of vectors in SKSV,r , then v 1 ⊗ · · · ⊗ v r = 0V,r .
850
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Since {v 1 , . . . , v r } is a linearly dependent set, there exists a vector v i that can be expressed as a linear combination of the other vectors. Without loss of generality, we may assume that v 1 = a2 v 2 + · · · + ar v r . This implies v 1 ⊗ · · · ⊗ v r = (a2 v 2 + · · · + ar v r ) ⊗ v 2 ⊗ · · · ⊗ vr =
r
ai (v i ⊗ v 2 ⊗ · · · ⊗ v r ) = 0V,r .
i=2
Corollary 14.3. Let V be a real linear space with dim(V ) = n. If r > n, then V ⊗r = 0V,r . Proof. Since every subset of V that contains more than n = dim(V ) vectors is linearly dependent, the statement follows from Corollary 14.2. Define the linear mappings S V,r : V ⊗r −→ V ⊗r and AV,r : −→ V ⊗r as V 1 S V,r (t) = P φ (t), (14.5) r! ⊗r
φ∈PERMr
AV,r (t) =
1 r!
sign(φ)P φ (t).
(14.6)
φ∈PERMr
These mappings are referred to as the symmetrizer and the alternator on V ⊗r for reasons that will become apparent immediately. If V is clear from the context, these mappings will be denoted as S r and Ar , respectively. Theorem 14.14. For every permutation ψ ∈ PERMr we have P ψ S r = S r P ψ = S r and P ψ Ar = Ar P ψ = sign(ψ)Ar . Proof.
We have
⎛
P ψ S r (t) =
1 Pψ ⎝ r! ⎛
=
1 ⎝ r!
φ∈PERMr
φ∈PERMr
⎞ P φ (t)⎠ ⎞
P ψ P φ (t)⎠
Tensors and Exterior Algebras
⎛ =
⎞
1 ⎝ r!
851
P ψφ (t)⎠ .
φ∈PERMr
As φ runs through PERMr , the same holds for τ = ψφ, hence, P ψ S r (t) =
1 r!
P τ (t) = S r (t).
τ ∈PERMr
For the second equality of the theorem, we have ⎛ P ψ Ar (t) =
1 Pψ ⎝ r!
⎞ sign(φ)P φ (t)⎠
φ∈PERMr
=
=
=
1 r! 1 r! 1 r!
sign(φ)P ψφ (t)
φ∈PERMr
sign(φ)P ψφ (t)
φ∈PERMr
sign(ψ)sign(ψφ)P ψφ (t)
φ∈PERMr
= sign(ψ)
1 r!
sign(τ )P τ (t)
τ ∈PERMr
because sign(φ) = sign(ψ)sign(ψφ) and τ = ψφ runs through PERMr . The remaining equalities, S r P ψ = S r and Ar P ψ = sign(ψ)Ar , have similar arguments. Theorem 14.15. If t ∈ SKSV,r , then Ar (t) = t.
852
Linear Algebra Tools for Data Mining (Second Edition)
Proof. Since t ∈ SKSV,r , we have P φ (t) = sign(φ)t for every φ ∈ PERMr . This allows us to write 1 sign(φ)P φ (t) Ar (t) = r! φ∈PERMr
1 r!
=
sign(φ)2 t = t
φ∈PERMr
because the last sum contains r! terms and sign(φ)2 = 1.
Theorem 14.16. Ar (t) is a skew-symmetric tensor for every tensor t ∈ V ⊗r . Proof. We need to show that P ψ (Ar (t)) = sign(ψ)Ar (t) for every permutation ψ ∈ PERMr . By the definition of Ar , we have P ψ (Ar (t)) = =
=
1 r! 1 r! 1 r!
sign(φ)Pψ Pφ (t)
φ∈PERMr
sign(φ)Pψφ (t)
φ∈PERMr
sign(φ)Pψφ (t).
φ∈PERMr
Since sign(ψ)2 = 1, the last equality can be written as P ψ (Ar (t)) =
1 r!
sign(ψ)sign(ψφ)P ψφ (t)
φ∈PERMr
= sign(ψ)
1 r!
sign(ψφ)P ψφ (t).
φ∈PERMr
By Theorem 1.4, for a fixed ψ we have the equality {ψφ | φ ∈ PERMr } = PERMr , which allows us to write P ψ (Ar (t)) = sign(ψ)
1 r!
φ∈PERMr
sign(φ)P φ (t) = sign(ψ)Ar (t).
Tensors and Exterior Algebras
853
Theorem 14.17. The mappings S V,r and AV,r defined on V ⊗r are projections of V ⊗r onto the subspaces SYM(V, r) and SKSV,r of V ⊗r , respectively. Proof. We begin by proving that both S V,r and AV,r are idempotent. We have S V,r (S V,r (t)) = =
1 r! 1 r!
P φ S V,r (t)
φ∈PERMr
S V,r (t)
φ∈PERMr
(by Theorem 14.14) = S V,r (t) (because the previous sum contains r! terms), hence S V,r S V,r = S V,r . A similar computation leads to the same conclusion for AV,r : AV,r (AV,r (t)) = =
1 r! 1 r!
sign(φ)P φ AV,r (t)
φ∈PERMr
sign(φ)2 AV,r (t)
φ∈PERMr
(by Theorem 14.14) = AV,r (t) (because the previous sum contains r! terms and sign(φ)2 = 1). for all φ ∈ PERMr . Note that t ∈ SYMV,r implies P φ (t) = t Therefore, if t ∈ SYMV,r , we have S V,r (t) = r!1 φ∈PERMr P φ (t) = t. Conversely, if S V,r (t) = t, we have P φ (t) = P φ (S V,r (t)) = S V,r (t) = t for all σ ∈ PERMr and t ∈ SYMV,r .
Linear Algebra Tools for Data Mining (Second Edition)
854
The membership t ∈ SKSV,r is equivalent to P φ (t) = sign(φ)t for every φ ∈ PERMr . Thus, if t ∈ SKSV,r , we have AV,r (t) =
1 r!
sign(φ)P φ (t) =
φ∈PERMr
1 r!
sign(φ)2 t = t.
φ∈PERMr
Conversely, if AV,r (t) = t, we have P φ (t) = P φ AV,r (t) = sign(φ)AV,r (t) = sign(φ)t for all φ ∈ PERMr and t ∈ SKSV,r . Furthermore, we have S V,r (AV,r (t)) =
1 r!
1 = r!
P σ (AV,r (t))
σ∈PERMr
sign(φ) AV,r (t) = 0V ⊗r
σ∈PERMr
because σ∈PERMr sign(φ) = 0. A similar computation yields AV,r (S V,r (t)) = 0V ⊗r . Let t ∈ SYMV,r ∩ SKSV,r , we have t = S V,r (t), hence AV,r (k) = AV,r (S V,r (t)) = 0V ⊗r , hence AV,r S V,r = 0. Similarly, S V,r AV,r = 0V ⊗r . Example 14.13. Let V be a linear space and V 2 . We have 1 2 1 , φ1 : PERM2 = φ0 : 1 2 2
let t be a tensor in 2 1
and, therefore, we can write AV,2 (t) =
1 (P φ0 (t) − P φ1 (t)) . 2
Thus, if t is a simple tensor t = u ⊗ w ∈ V ∗2 , we have 1 AV,2 = (u ⊗ w − w ⊗ u). 2 Let V be a linear space with dim(V ) = n. For ei1 , . . . , eik ∈ V with i1 , . . . , ik ∈ {1, . . . , n}, following [30], we denote by ei1 · · · eik
Tensors and Exterior Algebras
the tensor S V,k (ei1 ⊗ · · · ⊗ eik ) =
1 k!
855
eiφ(1) ⊗ · · · ⊗ eiφ(k) .
φ∈PERMk
The factor ei1 · · · eik depends only on the number of times each ei enters this product, and we may write ei1 · · · eik = ep11 · · · epnn , where pi is the multiplicity of occurrence of ei in ei1 · · · eik (which may also be 0). Thus, the numbers p1 , . . . , pn are non-negative integers and p1 + · · · + pn = k. Theorem 14.18. Let {e1 , . . . , en } be a basis of a linear space V. Then {S V,k (ei1 ⊗ · · · ⊗ eik ) | 1 i1 · · · ik n}
. is a basis of SV,k . Furthermore, dim(SV,k ) = n+k−1 k Proof. Since B = {ei1 ⊗ · · · ⊗ eik | 1 i1 n, . . . , 1 ik n} is a basis for Tk (V ) and S V n ,k maps Tk (V n ) into SV,k , the set S(B) = {ei1 · · · eik | 1 i1 · · · ik n} = {ep11 · · · epnn | p1 + · · · + pn = k} spans SV,k . Vectors in SV,k are linearly independent because, if (p1 , . . . , pn ) = (q1 , . . . , qn ), then the tensors ep11 · · · epnn and eq11 · · · eqnn are linear combinations of two non-intersecting subsets of the basic elements of Tk (V ). By Supplement 12 of Chapter 1, the cardinality of S(B) is the number of combinations with repetition of n objects taken k at a time. Theorem 14.19. Let ti1 ,...,ip be a standard basis element of a tensor space Tp = V ⊗p . If ik = i for some 1 k < p, then AV,p (ti1 ,...,ip ) = 0V ⊗p . Proof. Let φ ∈ PERMp be the transposition that inverts the places of k and . Since ik = i , we have P φ (ti1 ,...,ip ) = ti1 ,...,ip . This implies AV,p (ti1 ,...,ip ) = AV,p (Pφ (ti1 ,...,ip )) = sign(φ)AV,p (ti1 ,...,ip ) = −AV,p (ti1 ,...,ip ), hence AV,p (ti1 ,...,ip ) = 0V ⊗p .
856
Linear Algebra Tools for Data Mining (Second Edition)
Corollary 14.4. If p > n, where n = dim(V ), then dim(V ⊗p ) = 0. Proof.
This follows from Theorem 14.19.
Theorem 14.20. Let V be an n-dimensional linear space. If p n, then {AV,p (ti1 ,...,ip ) | i1 < i2 < · · · < ip } is a basis of the linear space SKSV,p and dim(SKSV,p ) = np . Proof. By Theorem 14.19, we need to consider only those ti1 ,...,ip having all indices distinct. For φ, ψPERMp , we have tiφ(1) ,...,iφ(p) = tiψ(1) ,...,iψ(p) if and only if φ = ψ. Thus, we have AV,p (ti1 ,...,ip ) = 0V ⊗p . If the sets {i1 , . . . , ip } and {k1 , . . . , kp } contain the same elements, then AV,p (ti1 ,...,ip ) = ±AV,p (tk1 ,...,kp ). If the sets {i1 , . . . , ip } and {k1 , . . . , kp } are distinct, there are no common elements of the basis when we expend AV,p (ti1 ,...,ip ) and AV,p (tk1 ,...,kp ) as linear combinations of the elements of the basis. Therefore, the set of elements of the form AV,p (ti1 ,...,ip ) is linearly independent. Corollary 14.5. We have dim(SKSV,n ) = 1 and dim(SKSV,p ) = dim(SKSV,n−p ). Proof.
These equalities follow from Theorem 14.20.
Let t ∈ V ⊗p . We have ⎛ 1 t − S V,p (t) = ⎝p!t − p!
φ∈PERMp
⎞ 1 P φ (t)⎠ = p!
(t − P φ (t)) .
φ∈PERMp
If t ∈ ker S V,p , this implies t=
1 p!
(t − P φ (t)) .
φ∈PERMp
Let z = t − P φ (t). Then S V,p (z) = S V,p (t) − S V,p (t) = 0V ⊗p , so z ∈ ker(S V,p ). For φ ∈ PERMp and t ∈ V ⊗p , let w(t, φ) = t − P φ (t). We have S V,p (w(t, φ)) = S V,p (t) − S V,p (P φ (t)) = 0V ⊗p , hence w(t, φ) ∈ ker(S V,p ). Thus, if W is the subspace generated by {t − P φ | t ∈ V ⊗p and φ ∈ PERMp }, we have ker(S V,p ) = W .
Tensors and Exterior Algebras
14.7
857
Exterior Algebras
In this section, we introduce exterior algebra also known as Grassmann1 algebra, a construct that makes use of a new concept known as wedge product. Definition 14.7. Let V be a linear space and let t = v 1 ⊗ · · · ⊗ v r be a simple tensor in Tr (V ). The wedge product of the vectors v 1 , . . . , v r is the tensor r!AV,r (t) denoted as v1 ∧ · · · ∧ v r . Example 14.14. The wedge product of the vectors v 1 , v 2 ∈ V is the tensor v 1 ∧ v 2 = 2!AV,r (v 1 ⊗ v2 ) = v1 ⊗ v2 − v2 ⊗ v1. The following properties are immediate: (i) v 1 ∧ v2 = −v 2 ∧ v 1 , (ii) (av 1 ) ∧ v 2 = a(v 1 ∧ v2 ), (iii) (u1 + u2 ) ∧ v = (u1 ∧ v) + (u2 ∧ v), for every u1 , u2 , v 1 ,v 2 , and v ∈ V . By Property (i), we have u ∧ u = 0V ⊗2 , for every u ∈ L. A tensor of the form u ∧ v is also referred to as a bivector. In general, we have v 1 ∧ · · · ∧ vr =
sign(φ)P φ (v φ
−1 (1)
⊗ · · · ⊗ vφ
−1 (r)
)
φ∈PERMr
=
sign(φ)P φ (v φ(1) ⊗ · · · ⊗ v φ(r) ),
φ∈PERMr
because sign(φ) = sign(φ−1 ). The next statement presents a universal property of the wedge product. 1 Hermann G¨ unter Grassmann was born in Stettin in 1809 and died in the same place in 1877. He was a mathematician and linguist and he made important contributions to an algebraic approach to geometry.
Linear Algebra Tools for Data Mining (Second Edition)
858
Theorem 14.21. Let V and W be two linear spaces and let f : V · · × V −→ W be a skew-symmetric multilinear function. There × · r
exists a unique linear transformation d : SKSV,r −→ W such that the diagram V × ··· × V
f W
r
A V,r d
SKSV,r
is commutative. Proof. By the universal property of tensor products, there exists a linear mapping c : V ⊗r −→ W such that f = c⊗. Since f is skew-symmetric, it follows that c(v φ(1) ⊗ · · · ⊗ v φ(r) ) = f (v φ(1) , . . . , v φ(r) ) = sign(φ)f (v 1 , . . . , v k ) (by Theorem 14.12) = sign(φ)c(v 1 ⊗ · · · ⊗ v r ) for all simple tensors v 1 ⊗ · · · ⊗ v r ∈ V ⊗r and all φ ∈ PERMr . Therefore, c(v 1 ⊗ · · · ⊗ v r ) = sign(φ)c(v φ(1) ⊗ · · · ⊗ v φ(r) ). Summing up over all permutations in PERMr , we obtain r!c(v 1 ⊗ · · · ⊗ v r ) =
sign(φ)c(v φ(1) ⊗ · · · ⊗ v φ(r) )
φ∈PERMr
= r!f (v1 , . . . , v r ).
Tensors and Exterior Algebras
V × ··· × V
f
W
r
⊗
859
c
V ⊗r d
A V,r SKSV,r
Define the mapping d : SKSV,r −→ W as d(t) = c(t) for t ∈ SKSV,r , that is, the restriction of c to the subspace SKSV,r , that is, d = r!1 c SKSV,r . It is clear that d makes commutative the above diagram. Simple tensors of the form v 1 ∧ · · · ∧ v r ∈ SKSV,r generate the subspace SKSV,r due to the fact that Tr (V ) is generated by tensors of the form v1 ⊗ · · · ⊗ v r and AV,r is the projection of Tr (V ) on SKSV,r . Thus, d is unique because its values on SKSV,r are uniquely determined by the condition d(v 1 ∧ · · · ∧ v r ) = f (v1 , . . . , v r ). Let t1 ∈ SKSV,p1 and t2 ∈ SKSV,p2 , the tensor t1 ∧ t2 is defined as t1 ∧ t2 =
p1 + p2 AV,p1 +p2 (t1 ⊗ t2 ). p1
(14.7)
Equivalently, we have t1 ∧ t2 = =
p1 + p2 1 p1 (p1 + p2 )!
1 p1 !p2 !
sign(φ)P φ (t1 ⊗ t2 )
φ∈PERMp1 +p2
sign(φ)P φ (t1 ⊗ t2 ).
φ∈PERMp1 +p2
Definition 14.8. The Grassmann algebra or the exterior algebra of order n of a linear space V is the algebra (V ) defined as a direct sum of the subspaces SKSV,p for 0 p n, where the product t1 ∧ t2 of t1 ∈ SKSV,p1 and t2 ∈ SKSV,p2 is defined as AV,p1+p2 and this operation is extended to (V ) by linearity.
Linear Algebra Tools for Data Mining (Second Edition)
860
If t1 = u1 ⊗ · · · ⊗ up1 and t2 = v 1 ⊗ · · · ⊗ v p2 , we have t1 ⊗ t2 = u1 ⊗ · · · ⊗ up1 ⊗ v 1 ⊗ · · · ⊗ v p2 , t2 ⊗ t1 = v 1 ⊗ · · · ⊗ v p2 ⊗ u1 ⊗ · · · ⊗ up1 . Theorem 14.22. Let t ∈ SKSV,p and s ∈ SKSV,q . We have: s ∧ t = (−1)pq t ∧ s. Proof. Let θ ∈ PERMp+q be the permutation introduced in Exercise 7 of Chapter 1, 1 2 ··· p p + 1 ··· p + q θ: . p + 1 p + 2 ··· p + q 1 ··· q Recall that inv(θ) = (−1)pq . Let t = ti1 ···ip ei1 ⊗ · · · ⊗ eip and s = sj1···jq ej1 ⊗ · · · ⊗ ejq . We have P θ (t ⊗ s) = s ⊗ t. Therefore, t∧s = s∧t = =
=
1 p1 !p2 ! 1 p1 !p2 ! 1 p1 !p2 ! 1 p1 !p2 !
= signθ
= signθ
sign(φ)P φ (t ⊗ s)
φ∈PERMp+q
sign(φ)P φ (s ⊗ t)
φ∈PERMp+q
sign(φ)P φ P θ (t ⊗ s)
φ∈PERMp+q
sign(φ)signθ 2 P φθ (t ⊗ s)
φ∈PERMp+q
1 p1 !p2 ! 1 p1 !p2 !
= (−1)pq t ∧ s.
sign(φθ)P φθ (t ⊗ s)
φ∈PERMp+q
sign(ψ)P ψ (t ⊗ s)
ψ∈PERMp+q
Tensors and Exterior Algebras
861
Corollary 14.6. If t ∈ SKSV,p and p is an odd number, then t ∧ t = 0V,p . 2
Proof. By Theorem 14.22, we have t ∧ t = (−1)p t ∧ t. Since p is an odd number, so is p2 , hence t ∧ t = 0V,p . Let ti ∈ SKSV,pi for 1 i n be n tensors. Define the tensor t1 ∧ t2 ∧ · · · ∧ tn as t1 ∧ t2 ∧ · · · ∧ tn =
(p1 + p2 + · · · + pn )! AV,p1 +p2 +···pn (t1 ⊗ t2 ⊗ · · · ⊗ tn ). p1 !p2 ! · · · pn !
The wedge product of tensors is multilinear. For simple tensors, we have the associativity property: (v 1 ∧ · · · ∧ v p ) ∧ ((u1 ∧ · · · ∧ uq ) ∧ (w 1 ∧ · · · wr )) = ((v 1 ∧ · · · ∧ vp ) ∧ (u1 ∧ · · · ∧ uq )) ∧ (w1 ∧ · · · wr ), Since SKSV,r is a subspace of Tr and the simple tensors ei1 · · · ⊗ · · · eir form a base of Tr , it follows that every t ∈ SKSV,r can be written as t = ti1 ···ir ei1 · · · ⊗ · · · eir . This implies t = AV,r (t) = ti1 ···ir AV,r (ei1 ⊗ · · · eir ) =
1 ti ···i ei1 ∧ · · · ∧ eir . r! 1 r
These equalities show that the set of simple skew-symmetric tensors of the form ei1 ∧· · ·∧eir span the subspace SKSV,r . However, the set of simple skew-symmetric tensors of SKSV,r does not form a basis of this space because it is not linearly independent. Indeed, if φ ∈ PERMr , we have eiφ(1) ∧ · · · ∧ eiφ(1) = sign(φ)ei1 ∧ · · · ∧ eir . To identify a basis of SKSV,r , we introduce the notation t[i1 ···ir ] for the component ti1 ···ir that satisfies the restriction i1 < · · · < ir ; similarly, we denote by c[i1 ···ir ] the fact that i1 < · · · < ir . Theorem 14.23. If V is a linear space that has a d-element spanning subset, then SKSV,k = {0V,k } for k > d. Proof. Indeed, suppose that {x1 , . . . , xd } spans V. When k > d and xi1 ∧ · · · ∧ xik contains two equal factors, so it is zero. Thus, SKSV,k is spanned by zero, so SKSV,k = {0SKSV,k }.
862
Linear Algebra Tools for Data Mining (Second Edition)
Next, we present a universal property for SKSV,2 , where V is a linear space. Theorem 14.24. Let V be a real linear space. There exists a linear space W and an alternating multilinear mapping f : V × V −→ W such that if ψ is the exterior multiplication ψ : V × V −→ SKSV,2 , then f factors uniquely as a composition f = gψ, where g : SKSV,2 −→ W is a linear mapping. In other words, there exists a unique linear mapping g that makes the diagram f V ×V
W
ψ g SKSV,2
commutative. Proof. Recall that FREE(V × V ) is the set of all maps over V × V with values in R that are 0 everywhere with the exception of a finite number of pairs in V × V . For x, y ∈ V , define λx,y : V × V −→ R as 1 if (x, y) = (r, s) λx,y (r, s) = 0 if (x, y) = (r, s). The set {λx,y | x, y ∈ V } is a basis in the linear space F(V × V −→ R) of real-valued functions defined on FREE(V × V ) which have finite supports. Define the set of functions G in F(V × V −→ R) as consisting of the functions of the form λax+by,cz+dt + 12 (acλz,x + adλt,x + bcλz,y + bdλt,y ) − 12 (acλx,z + adλx,t + bcλy,z + bdλy,t ) , for x, y, z, t ∈ V and a, b, c, d ∈ R. The subspace of F(V × V −→ R) generated by G is denoted by F. The relation ρ in FREE(V × V ) consists of those pairs f, g such that f ρg if f − g ∈ G. The linear space W is defined as W = FREE(V × V )/ρ.
Tensors and Exterior Algebras
863
Let J be the canonical surjection of FREE(V × V ) in FREE(V × V )/ρ, which has G as a kernel. Then the map f = Jλ defined on FREE(V × V ) is an alternate bilinear map. Indeed, we have 1 1 λy,x − J λx,y J(λx,y ) + J 2 2 1 1 = J λx,y + λy,x − λx,y 2 2
(due to the linearity of J) = 0, because J is linear with kernel G. Thus, the function 1 1 λx,y + λy,x − λx,y 2 2 belongs to G (with (x, y, z, t) = (x, 0, y, 0) and (a, b, c, d) = (1, 0, 1, 0)). Consequently, we have J(λx,y + λy,x ) = 0, hence f (x, y) = −f (y, x), which shows that the function f , is skewsymmetric. To prove the bilinearity of f observe that 1 1 J λx+ay,z + (λz,x + aλz,y ) − (λx,z + aλx,z ) = 0 2 2
because the argument of J has the form prescribed for the subspace FREE(V ×V ) with (1, a, 1, 0) taken for (a, b, c, d). Thus, f (x+ay, z)+ f (z, x)+af (z, y) = 0 amounts to f (x+ay, z) = −f (z, x)−af (z, y), which implies the bilinearity of f by applying the skew-symmetry previously shown. We show now that the space W = FREE(V × V )/ρ and the mapping f satisfy the conditions of the theorem. Let ψ be a skewsymmetric map of V × V into SKSV,2 . Consider a basis that consists of elements λx,y , where x, y ∈ V , and consider in SKSV,2 the family {ψ(x, y) | x, y ∈ V }. There exists a unique linear map Λ : FREE(V × V ) −→ W such that Λ(λx,y ) = ψ(x, y).
Linear Algebra Tools for Data Mining (Second Edition)
864
If f, g ∈ FREE(V × V ) are such that f ρg, then Λ(f ) = Λ(g). Indeed, note that any function in G is null for Λ. Now, we have 1 Λ λax+by,cz+dt + (acλz,x + adλt,x + bcλz,y + bdλt,y ) 2 1 − (acλx,z + adλx,t + bcλx,z + bdλy,t ) 2 = ψ(ax + by, cz, dt) 1 + (acψ(z, x) + adψ(t, x(+bcψ(z, y) + bdψ(t, y)) 2 1 − (acψ(x, z) + adψ(x, t) + bcψ(y, z) + bdψ(y, t)) , 2 which is 0 due to the bilinearity and skew symmetry of ψ. Starting from Λ, define a unique mapping g : SKSV,2 −→ W such that for every v ∈ SKSV,2 we have g(v) = Λ(λx,y ), where λx,y is a representative of v in FREE(V × V ). Then, we have g(J(λx,y )) = ψ(x, y), which shows that gJ = ψ. The map g is unique. Indeed, the family {J(λx,y ) | x, y ∈ L} is a generating set of W since J is a surjective mapping from FREE(V ×V ) into W and {λx,y | x, y ∈ L} is a basis of FREE(V × V ). This family contains a basis {J(λxi ,yi ) | i ∈ I} of W . Then, we have g(f (x, y)) = ψ(x, y) for x, y ∈ V , which implies g(f (xi , y i )) = ψ(xi , y i ). Thus, there exists a unique mapping g : SKSV,2 −→ W such that f = gψ, where W = FREE(V × V )/ρ. A more general result having a similar argument is Theorem 14.25. Let V be a real linear space. There exists a linear space W and an alternating multilinear mapping f : V · · × V −→ × · k
W such that if ψ is the exterior multiplication ψ : SKSV,k −→ W , then f factors uniquely as a composition f = gψ, where g : SKSV,k −→ M is a linear map. Corollary 14.7. Let V be a linear space with dim(V ) = n and let ψ : V × · · · × V −→ SKSV,k . If k > n, we have SKSV,k = {0V }. k
Tensors and Exterior Algebras
865
Proof. This follows from the fact that if k > dim(V ), every k-blade must contain two identical factors and, therefore, must equal 0V . Also, observe that if dim(V ) = n, we have dim(SKSV,r ) = 1. Theorem 14.26. Let V be a linear space. We have v 1 ∧v2 ∧· · ·∧v k = 0V,k if and only if {v 1 , v 2 , . . . , v k } is a linearly dependent set in V. Proof. Suppose that {v 1 , v 2 , . . . , v k } is a linearly dependent set in V. Without loss of generality, assume that v 1 = a2 v2 + · · · + ak vk . Then, v1 ∧ v2 ∧ · · · ∧ vk = (a2 v 2 + · · · + ak vk ) ∧ v 2 ∧ · · · ∧ v k =
k
(ai v i ∧ v 2 · · · ∧ v i ∧ · · · ∧ v k ) = 0V,k
i=2
because each wedge in the last sum has a repeated vector. Conversely, suppose that {v 1 , v 2 , . . . , v k } is linearly independent and let {v 1 , v 2 , . . . , v k , . . . , v n } be its extension to a basis of V. As we saw above, the collection {v i1 ∧ · · · ∧ v ik | 1 i1 < · · · < ik n} is a basis of SKSV,k . Since v1 ∧ · · · ∧ v k belongs to this basis, it cannot be 0V,k . Theorem 14.27. Let V be a linear space and let v 1 ∧ · · · ∧ v k and w1 ∧ · · · ∧ wk be two wedges in SKSV,k . There exists c ∈ R − {0} such that v 1 ∧ · · · ∧ v k = c w1 ∧ · · · ∧ wk if and only if v1 , . . . , v k = w1 , . . . , w k . Proof. Let R = v 1 , . . . , v k and S = w1 , . . . , w k . If R = S, every v i is a linear combination of {w1 , . . . , w k }, v i = ci1 w1 + . . . + cik wk . After substituting the expressions of vi in v1 ∧ · · · ∧ v k , applying multilinearity and the alternating property, we are left with an expression of the form cw 1 ∧ · · · ∧ wk for some non-zero c. Note that c = 0 because a wedge cannot be 0V . If R = S, then let = dim(R ∩ S) < k. Without loss of generality, we may assume that the first elements (u1 , . . . , u ) of both lists (v 1 , . . . , v k ) and (w1 , . . . , wk ) form a basis for R ∩ S. Thus, the previous lists can be written as (u1 , . . . , u , v 1 , . . . , v k− )
866
Linear Algebra Tools for Data Mining (Second Edition)
and (u1 , . . . , u , w 1 , . . . , wk− ). Then, the members of the list (u1 , . . . , u , v 1 , . . . , v k− , w1 , . . . , w k− ) form a linearly independent set, which can be extended to a basis of V. This implies that the vectors u1 ∧ · · · ∧ u ∧ v 1 ∧ · · · ∧ v k− and u1 ∧ · · · ∧ u ∧ w1 ∧ · · · ∧ wk− belong to the same basis, which means that they cannot differ by a constant multiple. Lemma 14.1. Let V be an n-dimensional linear space and let B = {e1 , . . . , en } be a basis for V. The set of wedge products ei1 ∧ · · · ∧ eik with 1 i1 < · · · < ik n spans SKSV,k . Proof. We start from the fact that SKSV,k is spanned by k-wedges w1 ∧ · · · ∧ wk . Therefore, it suffices to prove that any such wedge is in the span of Bk . Since B is a basis for V, each wi is a linear combination of the basis vectors e1 , . . . , ek . If w1 ∧ · · · ∧ wk is expanded using distributivity, each term in the resulting sum is a multiple of a k-wedge of the form ei1 ∧ · · · ∧ eik . This suffices to conclude that SKSV,k is generated by the nk blades of the form ei1 ∧ ei2 ∧ · · · ∧ eik . Among the blades of the form ei1 ∧ ei2 ∧ · · · ∧ eik those which contain two equal indices are 0V,k , and the blades can always be rearranged in the increasing order of the indices. Therefore, Bk spans SKSV,k . Theorem 14.28. Let B = {e1 , . . . , en } be a basis in the linear space V. Then the set Bk = {ei1 ∧ ei2 ∧ · · · ∧ eik | 1 i1 < i3 < · · · < ik n}.} is a basis of SKSV,k . Proof. For k = 0, there is nothing to prove, so we may assume that k 1. The idea is to embed SKSV,k as a subspace of the tensor power ⊗k V . Let f : V k −→ V ⊗k be the multilinear function defined as (−1)inv(φ) eφ(1) ⊗ · · · ⊗ eφ(k) . f (e1 , . . . , ek ) = φ∈PERMk
The multilinearity of f follows from the fact that each term of the sum contains each ei only once. It is easy to see that f is also alternating. By the universal property of the exterior power, there is a
Tensors and Exterior Algebras
867
linear map gk,V : SKSV,k −→ V ⊗k such that (−1)inv(φ) eφ(1) ⊗ · · · ⊗ eφ(k) . gk,V (e1 ∧ · · · ek ) = φ∈PERMk
Injectivity of gk,V is clear for k > n. Therefore, suppose that 1 k n. We may assume k 2 because the statement is immediate for k = 1. Since B = {e1 , . . . , en } is a basis for V, the wedge products ei1 ∧ · · · ∧ eik with 1 i1 < · · · < ik n spans SKSV,k by Lemma 14.1. Since V ⊗k has a basis, ei1 ⊗ · · · ⊗ eik , where 1 i1 , . . . , ik n. Suppose t ∈ SKSV,k satisfies gk,V (t) = 0, where t = {ci1 ···ik ei1 ∧ · · · ∧ eik |1 i1 < · · · < ik n}, where ci1 ···ik ∈ R. Then gk,V = 0 implies ci1 ···ik (−1)inv(φ) eφ(1) ⊗ · · · ⊗ eφ(k) = 0, 1i1 > A(:,:,2)=[1 0;0 0] A(:,:,1) = 1 0
0 1 t112 = 1
t212 = 0
t122 = 0
t222 = 1
t111 = 1
t211 = 0
t121 = 0
t221 = 1
Fig. 15.4
A 2 × 2 × 2-mda.
Linear Algebra Tools for Data Mining (Second Edition)
898
A(:,:,2) = 1 0
0 1
>> T=tensor(A) T is a tensor of size 2 x 2 x 2 T(:,:,1) = 1 0 0 1 T(:,:,2) = 1 0 0 1
The three unfoldings of the mda T are obtained using the functions tensor_as_matrix of the package 862 described in [4]. For example, to obtain the unfolding after the first dimension, we write tensor_as_matrix(T,1,’fc’). The unfoldings produced on the first dimension in this manner are >> tensor_as_matrix(T,1,’fc’) ans is a matrix corresponding to a tensor of size 2 x 2 x 2 ans.rindices = [1] (modes of tensor corresponding to rows) ans.cindices = [2, 3] (modes of tensor corresponding to columns) ans.data = 1 0 0 1
1 0
0 1
Similar unfoldings yield >> tensor_as_matrix(T,2,’fc’) ans.data = 1 1 0 0
0 1
0 1
>> tensor_as_matrix(T,3,’fc’) ans.data = 1 0 1 0
0 0
1 1
Multidimensional Array and Tensors
899
Clearly, we have the ranks R1 = R2 = 2 and R3 = 1, so the ranks can be distinct, and, in turn, can be distinct from the rank of the mda. Since T = e1 ⊗ e2 ⊗ (e1 + e2 ) + e2 ⊗ e2 ⊗ (e1 + e2 ), the rank of T cannot be larger than 2, which shows that T is indeed of rank 2. Example 15.9. Let T be the 2 × 2 × 2 mda defined in Figure 15.5. The n-rank is 2 for 1 n 3. Since T = e2 ⊗ e1 ⊗ e1 + e1 ⊗ e2 ⊗ e1 + e1 ⊗ e1 ⊗ e2 ,
(15.5)
it follows that the rank of T is not larger than 3. To prove that the rank of T equals 3, we need to show that the decomposition 15.5 is minimal. Suppose that T would have rank 2, that is, T = x1 ⊗ y 1 ⊗ z 1 + x2 ⊗ y 2 ⊗ z 2 , and let X, Y , Z, D1 , and D2 be defined as X = (x1 x2 ), Y = (y 1 y 2 ), Z = (z 1 z 2 ), and D1 = diag(z11 , z12 ) and D2 = diag(z21 , z22 ). Note that
z11 0 y1 XD1 Y = (x1 x2 ) 0 z12 y 2 0 1 . = x1 y 1 z11 + x2 y 2 z12 = 1 0
t112 = 1
t212 = 0
t122 = 0
t222 = 0
t111 = 0
t211 = 1
t121 = 1
t221 = 0
Fig. 15.5
A 2 × 2 × 2-mda.
(15.6)
Linear Algebra Tools for Data Mining (Second Edition)
900
Similarly,
xD2 y =
x1 y 1 z21
+
x2 y 2 z22
=
1 0 , 0 0
which means that Equality (15.6) is equivalent to 0 1 1 0 and XD2 Y = . XD1 Y = 1 0 0 0 The last equalities imply that rank(D1 ) = 2 and rank(D2 ) = 1. The matrices X and Y have full rank. Without loss of generality, we may assume that z22 = 0. Observe now that 0 1 −1 −1 −1 . (XD2 Y ) · (XD1 Y ) = X(D2 D1 ) X = 0 0 1 while the first This equality implies that X1 is proportional to 0 row is proportional to (0 1). This is possible when X is of rank 1, which conflicts with the initial assumption (R1 = 2). 15.5
Matricization and Vectorization
Matricization [92] of an mda is the rearrangement of the elements of an mda into a matrix. Let T ∈ RI1 ×I2 ×···×IN be an order-N mda whose set of modes is N = {1, 2, . . . , N }. For a subset D = {d1 , . . . , d|D| } of N, define the function ψD : N =1 {1, . . . , I } −→ N as ⎡ ⎤ |D| p−1 ⎣(idp − 1) Idj ⎦, ψD (i1 , . . . , iN ) = 1 + p=1
j=1
where 1 ij Ij for 1 j |D|. Suppose that {R, C : IN } is a partition of N, where R = {r1 , . . . , rp }, C = {s1 , . . . , sq }, and p + q = N . The set R contains those indices that will be mapped into row indices of the resulting matrix, while the set C contains the
Multidimensional Array and Tensors
901
indices that will be mapped into the column indices of the resulting matrix. Let J = n∈R In and K = n∈C In and let π be a bijection π : {1, . . . , I1 } × · · · × {1, . . . , IN } −→ {1, . . . , J} × {1, . . . , K} defined by π(i1 , . . . , iN ) = (ψR (i1 , . . . , iN ), ψC (i1 , . . . , iN )). The matrix T(R,C:IN ) ∈ RJ×K is defined as (T(R,C:IN ) )jk = Ti1 i2 ···iN ,
(15.7)
with j = ψR (i1 , . . . , in ) and k = ψC (i1 , . . . , in ). Definition 15.9. The matricized mda is a matrix T(R,C:IN ) in RJ×K defined by Equation (15.7). This form of matricization was introduced in [50–52]. A special case of matricization occurs when R = {n} and C = {1, . . . , N } − {n}. In this case, we use the term n-unfolding. The matrix unfolding T(n) ∈ RIn ×(In+1 ···IN I1 I2 ···In−1 ) of T contains the element Ti1 i2 ···iN at the position with row number in and column number ψ{dn } (i1 , . . . , iN )1 + N m=1,m=n m−1 (im − 1) p=1,p=n Ip . Example 15.10. Let T be an mda with the format 3 × 4 × 2 having the following frontal slices: ⎛ ⎞ ⎛ ⎞ 1 4 7 10 13 16 19 22 ⎜ ⎟ ⎜ ⎟ T1 = ⎝2 5 8 11⎠ and T2 = ⎝14 17 20 23⎠. 3 6 9 12 15 18 21 24 For R = {1, 2} and C = {3}, the matrix TR;C ∈ R12×2 is given by T12;3 =
4 2 3 i1 =1 i2 =1 i3 =1
Ti1 i2 i3 e3i1 ⊗ e4i2 ⊗ e2i3 .
902
Linear Algebra Tools for Data Mining (Second Edition)
Consider the term T231 e32 ⊗ e43 ⊗ e21 from the above sum. Since ⎛ ⎞ 0 0 ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜ ⎟ ⎛ ⎞ ⎜0 0⎟ ⎛ ⎞ 0 ⎜ ⎟ 0 ⎜0⎟ 1 ⎜0 0⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ e32 ⊗ e43 ⊗ e21 = ⎝1⎠ ⊗ ⎜ ⎟ ⊗ =⎜ ⎟, ⎜1 0⎟ ⎝1⎠ 0 ⎜ ⎟ 0 ⎜0 0⎟ 0 ⎜ ⎟ ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜0 0⎟ ⎜ ⎟ ⎜ ⎟ ⎝0 0⎠ 0 0 it follows that T231 occupies the position at the 7th row and the first column of the matrix TR,C . Example 15.11. We present an example of the matricization of an mda given in [93, 139]. Let T be an mda with the format 3 × 4 × 2 introduced in Example 15.10 having the following frontal slices: ⎛ ⎞ ⎛ ⎞ 1 4 7 10 13 16 19 22 ⎜ ⎟ ⎜ ⎟ T1 = ⎝2 5 8 11⎠ and T2 = ⎝14 17 20 23⎠ 3 6 9 12 15 18 21 24 The three unfoldings are ⎛ 1 4 7 ⎜ T(1) = ⎝2 5 8 3 6 9 ⎛ 1 2 ⎜4 5 ⎜ T(2) = ⎜ ⎝7 8 10 11
⎞ 10 13 16 19 22 ⎟ 11 14 17 20 23⎠, 12 15 18 21 24 ⎞ 3 13 14 15 6 16 17 18⎟ ⎟ ⎟, 9 19 20 21⎠ 12 22 23 24
Multidimensional Array and Tensors
T(3) =
903
1 2 3 4 · · · 9 10 11 12 . 13 14 15 16 · · · 21 22 23 24
The fibers of n-mode of the mda T can be retrieved as the columns of the unfolding T(n) . Definition 15.10. Let T and S be two mdas having the set of modes I ∪ J and I ∪ K, respectively, where |I| = M , |J| = N , and |K| = P and I, J, K are pairwise disjoint. We assume that I = {i1 , . . . , iM }. The contraction of T and S over I is the mda T I S given by (T I S)j1 ···jN k1 ···kP =
I1 i1 =1
···
Im
Ti1 ···im j1 ···jn Si1 ···im k1 ···kp .
im =1
Definition 15.11. Let T be an (I1 × I2 × · · · × IN )-mda and let A ∈ RJn ×In be a matrix. The n-mode product of T and A is an mda denoted by T ×n A ∈ RI1 ×···×In−1 ×Jn ×In+1 ×IN given by (T ×n A)i1 ···in−1 jn in+1 ···iN =
In
Ti1 i2 ···iN Ajn in .
in =1
The result of the n-mode product T ×n A is an mda of size I1 × I2 · · · × In−1 × Jn × In+1 · · · × IN -mda. Each of the mode-n fibers of T is multiplied by the matrix A. Example 15.12. Let A ∈ RI1 ×I2 be a matrix, which we regard as an mda. The product A ×1 B can be considered for a matrix B ∈ J × I1 and yields a matrix A ×1 B ∈ RJ×I2 where (A ×1 B)ji2 =
I1
Ai1 i2 Bji1 =
i1 =1
I1 i1 =1
Ai1 i2 Bi1 j .
In terms of matrix products, this is A ×1 B = B A. If C ∈ RJ×I2 , the product A ×2 C ∈ RI1 ×J is (A ×2 C)i1 j =
I2 i2 =1
We have A ×2 C = AC .
Ai1 i2 Cj i2 =
I2 i2 =1
Ai1 i2 Ci2 j .
Linear Algebra Tools for Data Mining (Second Edition)
904
Theorem 15.5. {r1 , . . . , rL } and {1, . . . , N }. Also, A(n) ∈ RIn ×Jn for The equality
Let N = {1, . . . , N }, and let the sets R = C = {c1 , . . . , cM } define a partition of N = let Y be an (J1 × J2 × · · · × JN )-mda and let n ∈ N be N matrices.
X = Y ×1 A(1) ×2 A(2) · · · ×N A(N )
(15.8)
holds if and only if we have the following matrix equality: X(R×C:IN ) = (A(rL ) ⊗· · ·⊗A(r1 ) )Y(R×C:IN ) (A(cM ) ⊗· · ·⊗A(c1 ) ) . (15.9) Proof.
Equality (15.8) is equivalent to I1
Xi1 i2 ···iN =
···
j1 =1
IN jN =1
(1)
(N ) jN .
Yj1 ···jn Ai1 j1 · · · AiN
Observe that Xi1 i2 ···iN = (X(R,C:IN ) )ψR (i1 ,i2 ,··· ,iN ),ψC (i1 ,i2 ,··· ,iN ) , and Yi1 i2 ···iN = (Y(R,C:IN ) )ψR (i1 ,i2 ,··· ,iN ),ψC (i1 ,i2 ,··· ,iN ) . By the definition of matricization, Equality (15.9) implies X(R×C:IN ) =
I1
···
i1 =1
=
I1
iN =1
···
i1 =1
=
IN
(I ) (I ) Xi1 ···iN ⊗n∈R einn ⊗n∈C einn
IN I1
···
iN =1 j1 =1
IN jN =1
(1)
(N ) jN
Yj1 ···jn Ai1 j1 · · · AiN
(I ) (I ) ⊗n∈R einn ⊗n∈C einn
I1 i1 =1 (1)
···
IN I1
···
iN =1 j1 =1 (N ) jN
Ai1 j1 · · · AiN
IN jN =1
(Y(R,C:IN ) )ψR (j1 ,j2 ,··· ,jN ),ψC (j1 ,j2,··· ,jN ) (I )
⊗n∈R einn
(I )
⊗n∈C einn
Multidimensional Array and Tensors
=
I1 i1 =1 (1)
···
IN I1 iN =1 j1 =1
···
IN jN =1
905
(Y(R,C:IN ) )ψR (j1 ,j2 ,··· ,jN ),ψC (j1 ,j2,··· ,jN )
j=1 i=1 (N ) M cj L ri e e jN ψR (i1 ,...,iN ) ψC (j1 ,...,jn )
Ai1 j1 · · · AiN
= (A(rL ) ⊗ · · · ⊗ A(r1 ) )Y(R×C:IN ) (A(cM ) ⊗ · · · ⊗ A(c1 ) ) . Corollary 15.1. Let X be an I1 × · · · × IN -mda and let Ji × Ii -matrix for 1 i N . We have
A(i)
be a
Y = X ×1 A(1) ×2 A(2) · · · ×N A(N ) if and only if we have the equality Y(n) = A(n) X(n) (A(N ) ⊗ · · · ⊗ A(n+1) ⊗ A(n−1) ⊗ · · · ⊗ A(1) ) between the unfoldings Y(n) and X(n) for 1 n N . Let X be an (I1 ×I2 ×· · ·×IN )-mda , let A be a matrix, A ∈ RIn ×Jn , and let Y ∈ RI1 ×I2 ×···Jn ×···×In . We have Y = X ×n A if and only if Y(n) = AX(n) . Proof.
These statements follow from Theorem 15.5.
Theorem 15.6. Let T ∈ RI1 ×I2 ×···×IN and let A ∈ RJn ×In and B ∈ RJm ×Im be two matrices with n = m. We have (T ×n A) ×m B = (T ×m B) ×n A, and their common value is denoted by T ×n A ×m B. Furthermore, if C ∈ RJn ×In and D ∈ RKn ×Jn , then (T ×n C) ×n D = T ×n (DC). Proof. ×n .
(15.10)
Both equalities follow immediately from the definition of
Next we introduce the n-mode product ×n of an mda with a vector. Definition 15.12. Let T be an (I1 × I2 × · · · × IN )-mda and let v ∈ RIn be a matrix. The n-mode product of T and v is an mda of order N − 1 of size I1 × · · · × In−1 × In+1 × · · · × IN denoted by T ×n v given by (T ×n v)i1 ···in−1 in+1 ···iN =
In in =1
Ti1 i2 ···in vin .
906
Linear Algebra Tools for Data Mining (Second Edition)
Clearly, T ×n v computes the inner product of each n-fiber of T with v. Example 15.13. Let T be a tensor of size 4 × 3 × 2 having the 3rd dimension slices ⎛ ⎞ 9 4 6 ⎜5 9 1⎟ ⎜ ⎟ T1 = ⎜ ⎟ ⎝8 8 8⎠ 2 9 9 and
⎛
7 ⎜7 ⎜ T2 = ⎜ ⎝7 4 If
6 2 7 1
⎞ 3 1⎟ ⎟ ⎟. 1⎠ 8
⎛
⎞ 10 ⎜ ⎟ u = ⎝20⎠, 30
the product T ×2 u is the mda of size 4 × 2, ⎛ ⎞ 350 280 ⎜260 140⎟ ⎜ ⎟ T ×1 u = ⎜ ⎟. ⎝480 240⎠ 470 300 The actual implementation of this computation is further discussed in Example 15.25. Theorem 15.7. Let T and S be two mdas such that T ∈ RI1 ×···×In−1 ×J×In+1 ×···×IN S ∈ RI1 ×···×In−1 ×K×In+1 ×···×IN and let A be a matrix, A ∈ RJ×K . We have (T, S×n A) = (T ×n A , S).
Multidimensional Array and Tensors
Proof.
907
We have
(T ×n A )i1 ···in−1 kin+1 ···IN =
J
Ti1 ···in−1 jin+1 ···iN akj ,
j=1
(S ×n A)i1 ···in−1 jin+1 ···iN =
K
Si1 ···in−1 kin+1 ···iN ajk ,
k=1
and (T, S ×n A) =
=
I1 i1 =1
in =1
I1
IN
···
i1 =1
=
IN
···
I1
Ti1 i2 ...iN (S ×n A)i1 i2 ···iN
Ti1 i2 ...iN
iN =1 IN
···
i1 =1
in =1
In
Si1 ···in ···iN ain i
i=1
Ti1 i2 ...in aiin Si1 ···in ···iN
= (T ×n A , S).
15.6
Inner Product and Norms
Example 15.14. Let T and S be two mdas in RI1 ×I2 ×···×IN . Their inner product is given by (T, S) =
I1 I2
···
i1 =1 i2 =1
IN
Ti1 i2 ···iN Si1 i2 ···iN ,
iN =1
which is actually their contraction T M S over the entire set of modes M = {1, . . . , N }. Therefore, it is natural to define the norm of an mda T as 2
T = (T, T ) =
I1 i1 =1
···
IN iN =1
Ti21 ···iN .
Linear Algebra Tools for Data Mining (Second Edition)
908
The definition of the norm also implies that for two mdas T, S ∈
RI1 ×···×IN , we have
T − S 2 = T 2 − 2(T, S) + S 2 . The norm of an mda can be expressed as the norm of a matrix, as the following statement shows. Theorem 15.8. Let X be an mda, X ∈ RI1 ×I2 ×···×IN , where the set of modes of X is N = {1, . . . , N }. We have X = XR,calc:IN F . Proof. The square of Frobenius norm of the matrix XR,calc:IN F is the sum of all squares of entries of this matrix. Since these entries form a rearrangement of the entries of X, the result follows immedi ately. 15.7
Evaluation of a Set of Bilinear Forms
Let X, Y be two F-linear spaces. Following [23], we discuss the evaluation of several bilinear forms pi = j k hijk xj yk , where 1 i m, x = (xj ) ∈ X, and y = (yk ) ∈ Y . Let Gi be the matrix Gi = (hijk ), where⎛1 ⎞ i m that cors1 ⎜ . ⎟ ⎟ responds to the linear form pi . For s = ⎜ ⎝ .. ⎠, define G(s) = sm m i=1 si Gi . The matrix G(s) is the characteristic matrix of the problem; the matrices Gi are called the basis matrices. The scalar H(s, x, y) =
p q m
hijk si xj yk = x G(s)y
i=1 j=1 k=1
is the defining function of the problem. The number of linear forms m = dim s is the index of the problem. When we evaluate the bilinear forms p1 , . . . , pm at (x, y), we are computing the set of inner products x (Gi y). Following [23], we assume that F contains a subset K such that the computation of K-linear forms is cheap compared with the computations of products of such linear forms. Furthermore, we assume
Multidimensional Array and Tensors
909
that the additive and multiplicative identities in F hold in K, and that all elements hijk belong to K. The general algorithm for evaluating K-linear forms proposed in [23] has the following form (Algorithm 15.7.1): Algorithm 15.7.1: Brokett–Dobkin Algorithm Data: A set of elements hijk of K Result: A set of products of the form pi = j,k hijk xi yj 1 Compute the K-linear forms ci , x and bi , y for 1 i d; 2 Compute the products ri = ci , xbi , y for 1 i d; d 3 Compute pi as pi = j=1 aij rj for 1 i m, where aij ∈ K; return pi for 1 i m; Example 15.15. Let x = x1 + ix2 and y = y1 + iy2 be two complex numbers. Their product is xy = x1 y1 − x2 y2 + i(x2 y1 + x1 y2 ), which entails the evaluation of the bilinear forms x1 y1 − x2 y2 and x2 y1 + x1 y2 . These forms correspond to the basis matrices 1 0 0 1 and G2 = G1 = 0 −1 1 0 because
1 0 y1 x1 y1 − x2 y2 = (x1 x2 ) 0 −1 y2
and
0 1 y1 . x2 y1 + x1 y2 = (x1 x2 ) 1 0 y2
The characteristic matrix is G(s) =
s1 s2 s2 −s1
and the defining function is
s1 s2 y1 H(s, x, y) = (x1 x2 ) s2 −s1 y2 = x1 y 1 s 1 + x1 y 2 s 2 + x2 y 1 s 2 − x2 y 2 s 1 .
910
Linear Algebra Tools for Data Mining (Second Edition)
Example 15.16. The computation of the product of matrices X=
x1 x2 x3 x4
and Y =
y1 y2 y3 y4
corresponds to the defining function H(s, x, y) = x1 y1 s1 + x2 y3 s1 + x1 y2 s2 + x2 y4 s2 +x3 y1 s3 + x4 y3 s3 + x3 y2 s4 + x4 y4 s4 and the characteristic matrix ⎛
s1 ⎜0 ⎜ G(s) = ⎜ ⎝s3 0
s2 0 s4 0
0 s1 0 s3
⎞ 0 s2 ⎟ ⎟ ⎟. 0⎠ s4
Example 15.17. Let x(t) and y(t) be two polynomials of degree m − 1 and n − 1, respectively, x(t) =
m
xj t
j−1
and y(t) =
j=1
n
yk tk−1 .
k=1
The defining function is H(s, x, y) =
m n
sj+k−1 xj yk .
j=1 k=1
The characteristic matrix of this problem is the following Hankel matrix (introduced in Definition 3.22): ⎛
s1 s2 s3 ⎜ s2 s3 s4 ⎜ ⎜ s s4 s5 Gmn (s) = ⎜ ⎜ 3 ⎜ .. .. .. ⎝ . . . sm sm+1 sm+2
··· ··· ···
sn sn+1 sn+2 .. .
··· · · · sm+n−1
⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎠
Multidimensional Array and Tensors
15.8
911
Matrix Multiplications and Arrays
The role of bilinear forms is studied in [97] using a special matrix representation. Let U, V ∈ R2×2 be two matrices and let W = U V . Index the elements of each matrix by a single index as follows:
u1 u2 u3 u4
v1 v2 v3 v4
=
w1 w2 . w3 w4
Denote by xijk the coefficient of ui vj in wk = 2i=1 2j=1 xijk ui vj ; each xijk is either 0 or 1. The corresponding matrices (xijk ) for wk and 1 k 4 are ⎛
1 ⎜0 ⎜ w1 : ⎜ ⎝0 0 ⎛ 0 ⎜0 ⎜ w3 : ⎜ ⎝1 0
0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1
⎞ 0 0⎟ ⎟ ⎟w 0⎠ 2 0 ⎞ 0 0⎟ ⎟ ⎟w 0⎠ 4 0
⎛
0 ⎜0 ⎜ :⎜ ⎝0 0 ⎛ 0 ⎜0 ⎜ :⎜ ⎝0 0
1 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0
⎞ 0 1⎟ ⎟ ⎟ 0⎠ 0 ⎞. 0 0⎟ ⎟ ⎟ 0⎠ 1
A more general situation involves computing a set of K bilinear forms using a three-way array of coefficients xijk as wk =
J I
xijk ui vj ,
i=1 j=1
where 1 k K. Example 15.18. Supposethat for a three-mode array X we have the decomposition xijk = R r=1 air bjr ckr for 1 i I, 1 j J, and 1 k K. If wk =
I J i=1 j=1
xijk ui vj
912
Linear Algebra Tools for Data Mining (Second Edition)
for 1 k K, then wk can be written as wk =
I J
xijk ui vj
i=1 j=1
=
J R I
air bjr ckr ui vj
i=1 j=1 r=1
=
I R r=1
=
R
air ui
⎛ ⎝
i=1
J
⎞ bjr vj ⎠ ckr
j=1
fr (air )gr (bjr ),
r=1
where fr (air ) = Ii=1 air ui and gr (bjr ) = Jj=1 bjr vj for 1 r R. This entails computing 2R linear combinations and multiplying these combinations pairwise to form R products. Then, we compute linear combinations of these products with the coefficients ckr : w1 = u1 v1 + u2 v3 w2 = u1 v2 + u2 v4 , w3 = u3 v1 + u4 v3 w4 = u3 v2 + u4 v4 . For the standard matrix multiplication (using Kruskal’s one-index notation), we have R = 8 bilinear forms, h1 , . . . , h8 : h1 = u1 v1 h2 = u2 v3 h3 = u1 v2 h4 = u2 v4 , h5 = u3 v1 h6 = u4 v3 h7 = u3 v2 h8 = u4 v4 , then w1 = h1 + h2 w2 = h3 + h4 , w3 = h5 + h6 w4 = h7 + h8 . In tabular form, this algorithm is shown in Figure 15.6. Example 15.19. Here we used the notation of Strassen (introduced in Exercise 120 of Chapter 3): I = f1 g1 , II = f2 g2 , III = f3 g3 , IV = f4 g4 , V = f5 g5 , V I = f6 g6 , V II = f7 g7 .
Multidimensional Array and Tensors
913
f1
f2
f3
f4
f5
f6
f7
f8
u1
1
0
1
0
0
0
0
1
u2
0
1
0
1
0
0
0
0
u3
0
0
0
0
1
0
1
0
u4
0
0
0
0
0
1
0
1
g1
g2
g3
g4
g5
g6
g7
g8
v1
1
0
0
0
1
0
0
0
v2
0
0
1
0
0
0
1
0
v3
0
1
0
0
0
1
0
0
v4
0
0
0
1
0
0
0
1
h1
h2
h3
h4
h5
h6
h7
h8
w1
1
1
0
0
0
0
0
0
w2
0
0
1
1
0
0
0
0
w3
0
0
0
0
1
1
0
0
w4
0
0
0
0
0
0
1
1
Fig. 15.6
Bilinear forms in matrix product computation.
Linear Algebra Tools for Data Mining (Second Edition)
914
15.9
MATLAB Computations
Following the practice of Computer Science literature, we will use the terms mda and tensor loosely and interchangeably. The Tensor Toolbox for MATLAB consists of a collection of tools for working with mdas. It is an open-source package constructed by researchers at Sandia National Laboratory [4, 5]. Every MATLAB array has at least two dimensions: a scalar is an object of size 1 × 1, a column vector is an array of size n × 1, etc. MATLAB drops trailing singleton dimensions. Thus, a 3 × 4 × 1 object has a reported size of 3 × 4. The MATLAB tensor class explicitly stores trailing singleton dimensions. Example 15.20. To create a tensor T starting from an array A, we write (Figure 15.7) A= rand(3,4,2) A(:,:,1) = 0.8147 0.9058 0.1270
0.9134 0.6324 0.0975
0.2785 0.5469 0.9575
0.9649 0.1576 0.9706
0.1419 0.4218 0.9157
0.7922 0.9595 0.6557
0.0357 0.8491 0.9340
A(:,:,2) = 0.9572 0.4854 0.8003
>> T=tensor(A) T is a tensor of size 3 x 4 x 2 T(:,:,1) = 0.8147 0.9134 0.2785 0.9058 0.6324 0.5469 0.1270 0.0975 0.9575 T(:,:,2) = 0.9572 0.1419 0.7922 0.4854 0.4218 0.9595 0.8003 0.9157 0.6557
0.9649 0.1576 0.9706 0.0357 0.8491 0.9340
Multidimensional Array and Tensors
915
f1
f2
f3
f4
f5
f6
f7
u1
1
0
1
0
1
−1
0
u2
0
0
0
0
1
0
1
u3
0
1
0
0
0
1
0
u4
1
1
0
1
0
0
−1
g1
g2
g3
g4
g5
g6
g7
v1
1
1
0
−1
0
1
0
v2
0
0
1
0
0
1
0
v3
0
0
0
1
0
0
1
v4
1
0
−1
0
1
0
0
I
II
III
IV
V
VI
VII
w1
1
0
0
1
−1
0
1
w2
0
1
0
1
0
0
0
w3
0
0
1
0
1
0
0
w4
1
−1
1
0
0
0
1
Fig. 15.7
Bilinear forms in Strassen’s matrix product computation
Linear Algebra Tools for Data Mining (Second Edition)
916
The tensor class explicitly tracks singleton dimensions. Example 15.21. Creating an mda of size 4 × 3 × 1 with A=rand(4,3,1) ignores trailing singleton dimensions resulting in A = 0.6787 0.7577 0.7431 0.3922
0.6555 0.1712 0.7060 0.0318
0.2769 0.0462 0.0971 0.8235
and >> ndims(A) ans = 2 >> size(A) ans = 4 3
In contrast, using the tensor constructor T = tensor(A,[4,3,1]) results in T is a tensor of size 4 x 3 x 1 T(:,:,1) = 0.6787 0.6555 0.2769 0.7577 0.1712 0.0462 0.7431 0.7060 0.0971 0.3922 0.0318 0.8235
and >> ndims(T) ans = 3 >> size(T) ans = 4 3
1
A vector can be stored as a tensor by writing T = tensor(rand(4,1),[4]).
Multidimensional Array and Tensors
917
Accessors and assignments for tensors work in the same way as for mdas. Example 15.22. A 2 × 2 × 2 tensor is created with T = tensor(rand(2,2,2))
To reassign a 2 × 2 identity matrix, we write A(:,1,:) = eye(2)
Three types of mda multiplications are discussed: multiplication of an mda with a matrix (denoted by ttm), with a vector (denoted by ttv), and with another mda (denoted by ttt). The first two variants can be regarded as special cases of the third. When mda is multiplied with a matrix, it is necessary to specify the mode of the mda involved in the multiplication. In [101], the n-mode multiplication of an mda by a matrix is discussed. This type of multiplication is a generalization of the matrix product. Let A = U BV be a matrix, A ∈ Rj1 ×j2 , B ∈ Ri1 ×i2 , U ∈ Rj1 ×i1 , and V ∈ Rj2 ×i2 , we adopt the notation proposed in [101] and denote this product by A = B ×1 U ×2 V which suggests that the matrix U creates linear combinations of the rows of B (the mode 1 of B), and the matrix V creates linear combinations of the columns of B (the mode 2 of B). More generally, if T ∈ RJ1 ×J2 ×···Jn ×···×JN and A ∈ RIn ×Jn , the mda T ×n A ∈ RJ1 ×···×Jn−1 ×In ×Jn+1 ×···JN is defined as (T ×n A)j1 ···jn−1 ijn+1 ···jN =
Jn jn =1
Tj1 ···jn ···jN aijn ,
Linear Algebra Tools for Data Mining (Second Edition)
918
and is obtained by having A act on the nth fibers of T . The MATLAB implementation of the n-mode multiplication is discussed in Example 15.23. The mda-matrix n-mode product of the mda T and the matrix A is specified by ttm(T, A, n). Example 15.23. We create a tensor T starting from a 3 × 4 × 2 mda of integers named A. To this end, we define a 3 × 4-matrix A using > A = [1 4 7 10;2 5 8 11;3 6 9 12] A = 1 2 3
4 5 6
7 8 9
10 11 12
Then, the second page of this array is added using A(:,:,2) = [13 16 19 22;14 17 20 23; 15 18 21 24]
which results in A(:,:,1) = 1 2 3
4 5 6
7 8 9
10 11 12
16 17 18
19 20 21
22 23 24
A(:,:,2) = 13 14 15
The tensor T is obtained by writing >> T = tensor(A) T is a tensor of size 3 x 4 x 2 T(:,:,1) = 1 4 7 10 2 5 8 11 3 6 9 12 T(:,:,2) = 13 16 19 22
Multidimensional Array and Tensors
14 15
17 18
20 21
919
23 24
Let B be the 2 × 3 matrix defined by >> B=[1 2 3;4 5 6] B = 1 4
2 5
3 6
To compute the tensor–matrix product P = ttm(T, B, 1) between the 3×4×2 tensor T and the matrix B ∈ R2×3 resulting in, a 2×4×2 tensor, we execute the following code: >> P=ttm(T,B,1) P is a tensor of size 2 x 4 x 2 P(:,:,1) = 14 32 50 68 32 77 122 167 P(:,:,2) = 86 104 122 140 212 257 302 347
corresponding to the following equalities: 14 = 1 · 1 + 2 · 2 + 3 · 3 .. . 68 = 1 · 10 + 2 · 11 + 3 · 12 32 = 4 · 1 + 5 · 2 + 6 · 3 .. . 167 = 4 · 10 + 5 · 11 + 6 · 12 and 86 = 1 · 13 + 2 · 14 + 3 · 15 .. . 140 = 1 · 22 + 2 · 23 + 3 · 24
920
Linear Algebra Tools for Data Mining (Second Edition)
212 = 4 · 13 + 5 · 14 + 6 · 15 .. . 347 = 4 · 22 + 5 · 23 + 6 · 24. If C is the 2 × 4 matrix, C=
3 2 3 1 , 0 3 0 4
the product of T by C along the second dimension of T is obtained as >> Q=ttm(T,C,2) Q is a tensor of size 3 x 2 x 2 Q(:,:,1) = 42 52 51 59 60 66 Q(:,:,2) = 150 136 159 143 168 150
Finally, if we multiply T along its third dimension by the matrix D ∈ R2×2 given by D=
7 3 , 5 2
the resulting mda R = ttm(T, D, 3) is >> R=ttm(T,D,3) R is a tensor of size 3 x 4 x 2 R(:,:,1) = 46 76 106 136 56 86 116 146 66 96 126 156 R(:,:,2) = 31 52 73 94 38 59 80 101 45 66 87 108
Multidimensional Array and Tensors
921
4
2
C
2
2
2
D
2 T B
3
3
4
Fig. 15.8
Possible mda-matrix products.
An image of various possible mda-matrix products is shown in Figure 15.8. This image suggests an alternative notation for matrixmda products proposed by Kruskal in [97], where the products ttm(T,B,1), ttm(T,C,2), ttm(T,D,3) C
are denoted as BT , T , and T D, respectively. An alternate way of computing n-mode products using cell arrays is shown next. Example 15.24. Consider the mda T ∈ R4×3×2 produced by >> T = tensor(randi(9,4,3,2)) T is a tensor of size 4 x 3 x 2 T(:,:,1) = 9 4 6 5 9 1 8 8 8 2 9 9 T(:,:,2) = 7 6 3 7 2 1 7 7 1 4 1 8
Linear Algebra Tools for Data Mining (Second Edition)
922
and the matrices A and B generated by >> A = randi(9,2,4) A = 7 3
9 1
4 4
7 8
and >> B=randi(9,3,2) B = 2 5 5
6 7 7
Matrices A and B are stored in the cell array W defined next: >> W{1}=A W = 1 x 1 cell array {2 x 4 double} >> W{2}=B W = 1 x 2 cell array {2 x 4 double}
{3 x 2 double}
Finally, using the cell array W the product T ×3 B ×1 A is computed as >> S=ttm(T,W,[1,3]) S is a tensor of size 2 x 3 x 3 S(:,:,1) = 1316 978 688 586 S(:,:,2) = 1946 1685 1016 1017 S(:,:,3) = 1946 1685 1016 1017
832 714 1360 1161 1360 1161
Multidimensional Array and Tensors
923
If T is an I1 × I2 × · · · × IN tensor, A ∈ RJm ×Im , and B ∈ RJn ×In , then (T ×m A) ×n B = (T ×n B) ×m A. Let T be an I1 × I2 × · · · × IN tensor and let (U (1) , . . . , U (N ) ) be a sequence of matrices, where U (i) has the format Ji × Ii . Then T ×1 U (1) ×2 U (2) ×N U (N ) is of size J1 × J2 × · · · × JN . Let T be an mda of format I1 ×I2 ×· · · In · · · IN and let v be a vector of size In . The contracted n-mode product of T and v introduced in [4] builds an mda T ×n v of format I1 × · · · × In−1 × In+1 × · · · In given by (T ×n v)(i1 , . . . , in−1 , in+1 , . . . , iN ) =
In
T (i1 , . . . , iN )v(in ).
in =1
Example 15.25. Let T be the mda introduced in Example 15.13 and let ⎛ ⎞ 10 ⎜ ⎟ u = ⎝20⎠. 30 T ∈ R4×3×2 can be multiplied by the vectors u along its second dimension by writing >> ttv(T,u,2) ans is a tensor of size 4 x 2 ans(:,:) = 350 280 260 140 480 240 470 300
Yet another type of mda multiplication involves two tensors. This product comes in three flavors: the outer product, the inner product, and the contracted product. Definition 15.13. Let T be an mda of size I1 × · · · × IM and let S be an mda of size J1 × · · · × JN . The outer product T oS is of size
924
Linear Algebra Tools for Data Mining (Second Edition)
I1 × · · · × IM × J1 × · · · × JN and is given by (T oS)(i1 , . . . , iM , j1 . . . , jN ) = T (i1 , . . . , iM )S(j1 , . . . , jN ). The MATLAB command ttt is an abbreviation of “tensor times tensor” and is given by Z = ttt(T,S). Example 15.26. Let T and S be the mdas produced by >> T = tensor(randi(9,2,2,2)) >> s = tensor(randi(5,2,1,3))
and given by T is a tensor of size 2 x 2 x 2 T(:,:,1) = 9 1 6 3 T(:,:,2) = 5 9 9 2 >> S=tensor(randi(5,2,1,3)) S is a tensor of size 2 x 1 x 3 S(:,:,1) = 5 4 S(:,:,2) = 5 4 S(:,:,3) = 1 5
Their outer product obtained with Z = ttt(T,S) is Z is a tensor of size 2 x 2 x 2 x 2 x 1 x 3 Z(:,:,1,1,1,1) = 45 5 30 15 Z(:,:,2,1,1,1) = 25 45 45 10 Z(:,:,1,2,1,1) = 36 4 24 12
Multidimensional Array and Tensors
Z(:,:,2,2,1,1) 20 36 36 8 Z(:,:,1,1,1,2) 45 5 30 15 Z(:,:,2,1,1,2) 25 45 45 10 Z(:,:,1,2,1,2) 36 4 24 12 Z(:,:,2,2,1,2) 20 36 36 8 Z(:,:,1,1,1,3) 9 1 6 3 Z(:,:,2,1,1,3) 5 9 9 2 Z(:,:,1,2,1,3) 45 5 30 15 Z(:,:,2,2,1,3) 25 45 45 10
=
=
=
=
=
=
=
=
=
Example 15.27. Let T and S be the mdas T = tensor(randi(9,2,2,2)) S = tensor(randi[10 20],2,2,2) >> T= tensor(randi(9,2,2,2)) T is a tensor of size 2 x 2 x 2 T(:,:,1) = 8 5 2 5 T(:,:,2) = 6 7 7 3 >> S= tensor(randi([10 20],2,2,2))
925
Linear Algebra Tools for Data Mining (Second Edition)
926
S is a tensor of size 2 x 2 x 2 S(:,:,1) = 14 11 17 17 S(:,:,2) = 10 10 13 11
The inner product of T and S computed with ttt(T,S), or with [1,2] is ans = 540
The contracted product of two mdas is a generalization of the mda vector product. Definition 15.14. Let T be an mda of size I1 × · · · × IM × J1 × · · · JN and let S be an mda of size I1 × · · · × IM × K1 × · · · × KP . The contracted product along the first M modes is the mda of size J1 × · · · × JN × K1 × KP given by (T, S)1,...,M ;1...,M (j1 , . . . , jn , k1 , . . . , kp ) =
I1 i1 =1
···
IM
T (i1 , . . . , iM , j1 , . . . , jN )S(i1 , . . . , iM , k1 , . . . , kp ).
iM =1
In MATLAB , the command for mda contracted product is U = ttt(T,S,[1:M],[1:M])
The contracted product may involve distinct lists of dimensions of the two mdas of equal lengths provided that the sizes of the involved dimensions are identical. The next example illustrates this point. Example 15.28. Let T and S be the mdas defined as >> T=tensor(randi(4,3,4,2)) T is a tensor of size 3 x 4 x 2 T(:,:,1) = 2 2 3 3 1 3 2 3 4 1 3 2
Multidimensional Array and Tensors
927
T(:,:,2) = 1 1 4 1 1 4 1 4 4 3 2 1 >> S=tensor(randi(5,4,3,2)) S is a tensor of size 4 x 3 x 2 S(:,:,1) = 4 2 5 5 2 1 5 5 2 1 3 1 S(:,:,2) = 1 1 3 5 5 3 3 4 1 3 2 2
The lists of dimensions of T and S are [13] and [23], and the corresponding sizes are 3 and 2. Therefore, the contracted product ttt(T,S,[1 3],[2 3]) returns >> ttt(T,S,[1 3],[2 3]) ans is a tensor of size 4 x 4 ans(:,:) = 44 38 34 22 33 51 49 29 42 53 49 30 36 51 54 27
This means that the dimensions involved in the contracted product are the first and the third for the mda T and the second and the third for the mda S. Note that the size of the first dimension of T is 3, the same as the size of the second dimension of S; also, the size of the third dimension of T is 2, which is also the size of the third dimension of S. The MATLAB function reshape given by B = reshape(A,sz)
reformats the array A using the size vector sz to define size(B). The vector sz must contain at least 2 elements, and the product of the components of sz must equal numel(A), that gives the number of elements in the array A.
Linear Algebra Tools for Data Mining (Second Edition)
928
Example 15.29. For the array A=randi([1,10],[2 3 2]) given by A(:,:,1) = 9 10
2 10
7 1
A(:,:,2) = 3 6
10 10
2 10
the effect of the statement B=reshape(A,[2 6]) is to produce the array B = 9 10
2 10
7 1
3 6
10 10
2 10
The alternative format B = reshape(A,sz1,...,szN) reformats A into an array having sz1,...,szN as sizes of each dimension. The vector A = 1 : 10 can be reshaped into a 5 × 2 matrix by B = reshape(A,[5 2])
which yields A = 1
2
3
4
>> B = reshape(A,[5 2]) B = 1 2 3 4 5
6 7 8 9 10
5
6
7
8
9
10
Multidimensional Array and Tensors
Finally, to reshape an array of integers A given by A = randi([1 12],[3 2 4])
into a 6 × 4 matrix M , we write M = reshape(A,6,4)
and obtain >> A = randi([1 12],[3,2,4]) A(:,:,1) = 12 6 10
2 6 11
A(:,:,2) = 10 12 8
1 11 12
A(:,:,3) = 9 10 9
5 8 3
A(:,:,4) = 9 1 4
1 2 10
12 6
10 12
and M = 9 10
9 1
929
Linear Algebra Tools for Data Mining (Second Edition)
930
10 2 6 11
8 1 11 12
9 5 8 3
4 1 2 10
The function permute applied as B = permute(A,dimorder) rearranges the dimensions of an array in the order specified by the vector dimorder. For example, permute(A,[2 1]) switches the row and column dimensions of a matrix A. In general, the ith dimension of the output array is the dimension dimorder(i) from the input array. Example 15.30. To create a 3 × 4 × 2 array and permute it so that the first and third dimensions are switched, resulting in a 2 × 4 × 3 array, we could use the following MATLAB code: A = rand(3,4,2) A(:,:,1) = 0.8147 0.9058 0.1270
0.9134 0.6324 0.0975
0.2785 0.5469 0.9575
0.9649 0.1576 0.9706
0.1419 0.4218 0.9157
0.7922 0.9595 0.6557
0.0357 0.8491 0.9340
0.9134 0.1419
0.2785 0.7922
0.9649 0.0357
0.6324 0.4218
0.5469 0.9595
0.1576 0.8491
A(:,:,2) = 0.9572 0.4854 0.8003
B = permute(A,[3 2 1])
B(:,:,1) = 0.8147 0.9572
B(:,:,2) = 0.9058 0.4854
Multidimensional Array and Tensors
931
B(:,:,3) = 0.1270 0.8003
0.0975 0.9157
0.9575 0.6557
0.9706 0.9340
Example 15.31. We start by creating an mda of random integers using X=randi(10,4,6,4,2): X(:,:,1,1) = 1 3 9 1 10
8 5 6 3 5
10 6 6 3 5
7 7 4 4 10
1 9 10 8 1
3 4 7 2 8
1 8 6 5 10
7 7 9 9 6
2 3 9 1 5
2 10 8 6 5
7 6 10 7 9
5 5 9 1 2
2 4 9 9 1
4 6 5 7 7
10 10 1 8 3
5 6 10 5 10
4 8 7 6 7
7 2 2 10 2
X(:,:,2,1) = 2 7 5 8 8
10 9 4 7 2
X(:,:,3,1) = 1 7 1 1 6
1 9 9 8 2
X(:,:,4,1) = 3 5 1 10 2
2 4 2 5 4
Linear Algebra Tools for Data Mining (Second Edition)
932
X(:,:,1,2) = 1 6 9 7 2
4 5 10 2 9
7 4 2 5 5
2 6 3 4 6
3 3 7 3 9
10 8 4 6 2
5 1 6 5 7
7 7 1 1 4
6 7 5 9 8
10 6 4 2 7
6 10 7 10 3
7 3 7 7 1
3 3 7 9 4
8 7 1 7 4
2 8 5 2 4
7 2 8 3 10
3 8 2 3 1
6 7 6 5 7
X(:,:,2,2) = 10 9 9 3 6
1 5 4 2 2
X(:,:,3,2) = 8 5 1 3 2
3 5 6 5 9
X(:,:,4,2) = 10 1 5 5 5
8 4 8 5 1
Next, we create two arrays R =[2 3] and C = [4 1] that define a partitioning of the set of modes of the array X. The array I = size(X) defines the sizes of the modes of X; J = prod(I(R)) and K = prod(I(C)) define the dimensions of the target matrix Y:
Multidimensional Array and Tensors
933
>> J = prod(I(R)) J = 24 >> K=prod(I(C)) K = 10
The resulting matrix Y will have 24 rows and 10 columns and can be obtained by Y = reshape(permute(X,[R C]),J,K)
as follows: Y = 1 8 10 7 1 3 2 10 1 7 2 2 1 1 7 5 2 4 3 2 10 5 4 7
1 4 7 2 3 10 10 1 5 7 6 10 8 3 6 7 3 8 10 8 2 7 3 6
3 5 6 7 9 4 7 9 8 7 3 10 7 9 6 5 4 6 5 4 10 6 8 2
6 5 4 6 3 8 9 5 1 7 7 6 5 5 10 3 3 7 1 4 8 2 8 7
9 6 6 4 10 7 5 4 6 9 9 8 1 9 10 9 9 5 1 2 1 10 7 2
9 10 2 3 7 4 9 4 6 1 5 4 1 6 7 7 7 1 5 8 5 8 2 6
1 3 3 4 8 2 8 7 5 9 1 6 1 8 7 1 9 7 10 5 8 5 6 10
7 2 5 4 3 6 3 2 5 1 9 2 3 5 10 7 9 7 5 5 2 3 3 5
10 5 5 10 1 8 8 2 10 6 5 5 6 2 9 2 1 7 2 4 3 10 7 2
2 9 5 6 9 2 6 2 7 4 8 7 2 9 3 1 4 4 5 1 4 10 1 7
934
Linear Algebra Tools for Data Mining (Second Edition)
To convert back the matrix Y to an mda, we could use the function ipermute(B,dimorder) that rearranges the dimensions of an array B in the order specified by the vector dimorder as in Z = ipermute(reshape(Y,[I(R) I(C)]),[R C]).
Vectorization of an mda is a special case of matricization that converts an mda to a vector. Let V1 , . . . , Vp be p F-linear spaces, where dim(Vi ) = ni for 1 i p. If π ∈ PERMp , the following linear spaces: V1 ⊗ V2 ⊗ · · · ⊗ Vp Fn1 ⊗ Fn2 ⊗ · · · Fnp .. . Fπ(n1 ) ⊗ Fπ(n2 ) ⊗ · · · Fπ(np )
.. . Fn1 n2 ···np
are isomorphic. The isomorphisms between these linear spaces make it possible to interpret a p-dimensional array as an mda of order , where p. Let {ej1 , . . . , ejnj } be a basis in Fj for 1 j p. For m ∈ N, define [m] = {1, . . . , m}. A bijective map μ : [n1 ] × · · · × [np ] −→ [n1 · · · np ] generates a linear space isomorphism between Fn1 ⊗ · · · ⊗ Fnp and p Fn1 ···np , which maps e1j1 ⊗ · · · ⊗ ejp into eμ(j1 ,...,jp ) . The isomorphism V1 ⊗ V2 ⊗ · · · ⊗ Vp ∼ = Fn1 ⊗ Fn2 ⊗ · · · Fnp allows the interpretation of an mda as a vector in Rn1 n2 ...np . This interpretation is known as a vectorization, and it is a generalization of matrix vectorization that was introduced in Definition 3.16. Namely, the vectorization vec(T ) of the mda T is ⎞ ⎛ t11···1 ⎟ ⎜ .. ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ ⎜ tn1 1···1 ⎟ ⎟. ⎜ vec(T ) = ⎜ ⎟ ⎜ t12···1 ⎟ ⎟ ⎜ .. ⎟ ⎜ . ⎠ ⎝ tn1 n2 ···np .
Multidimensional Array and Tensors
935
Example 15.32. The vectorization mapping introduced in Definition 3.16 is implemented in MATLAB using the function reshape. If A ∈ R3×2 is the matrix defined as >> A = [1 2;3 4;5 6] A = 1 2 3 4 5 6
then v = reshape(A,6,1) will return v = 1 3 5 2 4 6
which is vec(A). An alternative way to compute A is v = A(:). To reverse this reshaping, we can write A = reshape(v,3,2). Example 15.33. Let T ∈ R3×2×3 be the mda defined as a111 a213 a223 a133
= a112 = a311 = a321 = a312
= a211 = a313 = a323 = a123
= −a212 = 1, = a121 = a122 = a221 = −a222 = 2, = 4, = a322 = 0.
The matrix unfolding T(1) ∈ R3×6 is I 1 ⎝1 2 ⎛
II III IV 1 0 2 −1 2 2 0 2 4
V VI ⎞ 2 0 −2 4 ⎠. 0 4
The columns of T(1) , numbered with roman numerals from I to VI, are shown in Figure 15.9.
Linear Algebra Tools for Data Mining (Second Edition)
936
I
IV
111
121 2
1
II
V 112
122 2
1 III
VI
113
211
123
221
0
1
0
2
212
222 311
-1
-2
2
4 223
213 2
312
4
322 0
0
313
323
2
4
Fig. 15.9
15.10
321
The columns of the unfolding T(1) .
Hyperdeterminants
Hyperdeterminants extend the notion of determinant defined for matrices and offer information about solutions of multilinear systems. Example 15.34. to slice this array t000 t010
Let T be an mda of format 3 × 2 × 2. It is possible as three 2 × 2 matrices: t001 t100 t101 t200 t201 , , and t011 t110 t111 t210 t211
Multidimensional Array and Tensors
937
or as two 3 × 2 matrices: ⎛ ⎞ ⎛ ⎞ t000 t001 t010 t011 ⎜ ⎟ ⎜ ⎟ ⎝t100 t101 ⎠ and ⎝t110 t111 ⎠. t200 t201 t210 t211 The homogeneous multilinear system T (x ⊗ y) = 0, where x, y ∈ R2 , is t000 x0 y0 + t001 x0 y1 + t010 x1 y0 + t011 x1 y1 = 0 t100 x0 y0 + t101 x0 y1 + t110 x1 y0 + t111 x1 y1 = 0 t200 x0 y0 + t201 x0 y1 + t210 x1 y0 + t211 x1 y1 = 0. Notethat this system has a solution x0 = x1 = 0 for any value of y0 y= . y1 Example 15.35. Consider the multilinear system x0 y0 = z0 , x0 y1 = z1 , x1 y0 = z2 , x1 y1 = z3 . Observe that if the system has a solution, then z0 z3 − z1 z2 = x0 y0 x1 y1 − x0 y1 x1 y0 = 0. The system has a non-trivial solution if and only if z0 z3 −z1 z2 = 0 and (z0 , z1 , x2 , z3 ) = 04 . Indeed, suppose that at least one number, say z0 , is distinct from 0 and z0 z3 − z1 z2 = 0. This implies x0 = 0 and y0 = 0, which, in turn, imply y1 = xz10 and x1 = yz20 . Therefore, x1 y1 = xz10zy20 = z1z0z2 = z3 , hence the system has a non-trivial solution. The converse implication is immediate. Theorem 15.9. Let T be a 2 × 2 × 2 mda that defines a system in the variables x0 , x1 , y0 , y1 as follows: t000 x0 y0 + t001 x0 y1 + t010 x1 y0 + t011 x1 y1 = 0, t100 x0 y0 + t101 x0 y1 + t110 x1 y0 + t111 x1 y1 = 0.
Linear Algebra Tools for Data Mining (Second Edition)
938
This system has non-trivial solutions up to a multiplicative constant. Also, the system has a unique solution up to a multiplicative constant if and only if the t000 x0 + t010 x1 t001 x0 + t011 x1 t x + t X t x + t x = 0. 100 0 110 1 101 0 111 1 Proof.
In matrix form, the system can be written as y0 0 t000 x0 + t010 x1 t001 x0 + t011 x1 = . t100 x0 + t110 X1 t101 x0 + t111 x1 y1 0
Therefore, non-trivial solutions exist for values of x0 , x1 such that t000 x0 + t010 x1 t001 x0 + t011 x1 (15.11) t x + t X t x + t x = 0. 100 0 110 1 101 0 111 1
The determinant in the left member of Equality (15.11) equals t000 t011 t010 t001 2 t000 t001 +x0 x1 + x0 t100 t101 t100 t111 t110 t101 2 t010 t011 , +x1 t110 t111 and is a homogeneous polynomial. Its discriminant, t000 t001 t010 t011 t000 t011 t010 t001 2 t t + t t − 4 t t t t 100 111 110 101 100 101 110 111 is referred to as the hyperdeterminant of T . Let T be an mda of format 3 × 2 × 2 and let AT be the matrix defined as ⎛ ⎞ t000 t001 t010 t011 ⎜ ⎟ (15.12) AT = ⎝t100 t101 t110 t111 ⎠. t200 t201 t210 t211 Denote by T00 , T01 , T10 , and T11 the matrices obtained from AT by eliminating the first column (00), the second column (01), the third column (10), and the fourth column (11).
Multidimensional Array and Tensors
939
If rank(AT ) < 3, then one of the equations of the system is a linear combination of the other two and we obtain a 2 × 2 × 2 system. For example, if the third row of AT is a linear combination of the first two rows, the system t000 x0 y0 + t001 x0 y1 + t010 x1 y0 + t011 x1 y1 = 0, t100 x0 y0 + t101 x0 y1 + t110 x1 y0 + t111 x1 y1 = 0, can be written as (t000 y0 + t001 y1 )x0 + (t010 y0 + t011 y1 )x1 = 0, (t100 y0 + t101 y1 )x0 + (t110 y0 + t111 y1 )x1 = 0. This system has a non-trivial solution in x0 , x1 if t000 y0 + t001 y1 t010 y0 + t011 y1 t y + t y t y + t y = 0. 100 0
101 1 110 0
111 1
Theorem 15.10. Suppose that rank(AT ) = 3, where AT is the matrix defined in Equality (15.12). The multilinear system t000 x0 y0 + t001 x0 y1 + t010 x1 y0 + t011 x1 y1 = 0, t100 x0 y0 + t101 x0 y1 + t110 x1 y0 + t111 x1 y1 = 0, t200 x0 y0 + t201 x0 y1 + t210 x1 y0 + t211 x1 y1 = 0. has a non-trivial solution if and only if det(T01 ) det(T10 ) − det(T00 ) det(T11 ) = 0. Proof.
Let z0 = x0 y0 , z1 = x0 y1 , z2 = x1 y0 , and z3 = x1 y1 ,
and consider the new multilinear system t000 z0 + t001 z1 + t010 z2 + t011 z3 = 0, t100 z0 + t101 z1 + t110 z2 + t111 z3 = 0, t200 z0 + t201 z1 + t210 z2 + t211 z3 = 0, x0 y0 = z0 , x0 y1 = z1 , x1 y0 = z2 , x1 y1 = z3 .
940
Linear Algebra Tools for Data Mining (Second Edition)
By Supplement 47 of Chapter 5, if the linear system that consists of the first three equations has a nontrivial solution (z1 , z2 , z3 , z4 ), then z1 z2 z3 z0 = = = , det(T00 ) − det(T01 ) det(T10 ) − det(T11 ) where the matrices Tk are obtained from AT by eliminating the kth column, where 1 k 4. Furthermore, Example 15.35 shows that the full multilinear system having a non-trivial solution implies that det(T01 ) det(T10 ) − det(T00 ) det(T11 ) = 0. The notion of hyperdeterminant has been extended to mdas having boundary format, that is, to mdas having the format (k0 + 1) × (k1 + 1) × · · · × (kd + 1), where k0 = di=1 ki . In this special situation, it is shown in [63] that the hyperdeterminant can be expressed as the determinant of a block matrix. Note that an mda of format 3 × 2 × 2 has this boundary format. 15.11
Eigenvalues and Singular Values
Let T be an mth order, n-dimensional mda. The mth degree homogeneous polynomial fT defined by T is given by Ti1 ···im xi1 · · · xim . fT (x) = i1 ···im
If xm is the mth order n-dimensional mda with entries xi1 · · · xim , then T xm = fT (x). The mda T is positive definite if fT is positive definite. For a vector x ∈ Rn and m ∈ N, x = (x1 , . . . , xn ), denote by x[ m] the vector in Rn whose components are (x[m])i = xm i for 1 i n. Starting from the k-mode product of an mda T = (Ti1 ···im ) and a matrix P ∈ Rp×n , we consider the mda P m (T ) given by m
(P (T ))j1 ···jm =
n i1 ···im =1
Ti1 ···im pj1 i1 · · · pjm im .
Multidimensional Array and Tensors
941
If P is a row vector, x = (x1 , . . . , xn ), the following notations introduced in [138] which involve homogeneous polynomials are used: T xm−2 = T ×3 x ×4 · · · ×m x =
n
Tiji3 ···im xi3 · · · xim
i3 ,...,im =1
T xm−1 = T ×2 x ×3 · · · ×m x n = Tii2 ···im xi2 · · · xim i2 ,...,im =1
Tx
m
= T × 1 x × 2 · · · × m x =
n
Tii1 ···im xi3 · · · xim .
i1 ,...,im =1
The right-hand side of the last equality is the full contraction of T and x. In general, one could define T xm−k as T xm−k = T ×k+1 x ×k+2 · · · ×m x as having the components the homogeneous polynomials n
Tj1 ...jk ik+1 ···im xik+1 · · · xim .
ik+1 ···im =1
Definition 15.15. Let T ∈ Mm,n . A complex number λ is an eigenvalue of T if there exists x ∈ Cn − {0n } such that the following equalities involving homogeneous polynomials are satisfied: (T xm−1 )i = λxm−1 i for 1 i n.
(15.13)
942
Linear Algebra Tools for Data Mining (Second Edition)
The vector x is an eigenvector of T associated with λ and (λ, x) is an eigenpair of T . for 1 If x[m−1] is a vector in Cn defined by (x[m−1] )i = xm−1 i i n, then Equality (15.13) can be written as T xm−1 = λx[m−1] . The spectrum of T is the set of eigenvalues of T . This set is denoted as spec(T ); the spectral radius of T is the number ρ(T ) = max{|λ| | λ ∈ spec(T )}. The term H-eigenvalue was introduced in [136]. Definition 15.16. An eigenvalue λ of an mda T is an H-eigenvalue if λ is real and there is a real eigenvector x for λ. In this case, we refer to x as an H-eigenvector. If λ ∈ R, x ∈ Rn , and Axm−1 = λx and x x = 1, the (λ, x) is said to be a Z-pair (cf. [136, 138]). Theorem 15.11. Let T ∈ Tm,n be an mda, where m is an even number. Then T has H-eigenvalues. Proof. Consider the minimization of the continuous function T xm n subjected to the restriction i=1 xm i = 1. Since m is a positive even n m = 1} is compact and the minix integer, the set {x ∈ Rn | i=1 i mizer x∗ exists. Using the Lagrangian L(x, λ) = T xm −λ( x m −1), the following Karush–Kuhn–Tucker optimality conditions: ∂L(x, λ) ∂L(x, λ) = 0 and =0 ∂x ∂λ amount to mT xm−1 − λmx[m−1] = 0 and 1 − x m m = 0, which imply T xm−1 = λx[m−1] , which is the equality that defines eigenvalues, where x = x∗ . The notions of positive definiteness and semidefiniteness for matrices can be extended to mdas in Mm,n . Definition 15.17. An mda T ∈ Mm,n is positive semidefinite if T xm 0 for all x ∈ Rn . If T xm > 0 for x ∈ Rn − {0n }, then T is positive definite.
Multidimensional Array and Tensors
943
Let λH min (T ) be the smallest eigenvalue of T , where T ∈ Tm,n and m is an even number. Then T is positive definite (positive semidefinite) if and only if λH min (T ) > 0 (λH min (T ) 0, respectively). By the definition of x∗ given in Theorem 15.11, we have x T T (x∗ )m . 1/m ( ni=1 xm ) i m m > 0 when x ∈ Rn − {0n }, it follows that Since i=1 xi λH min (T ) > 0. A generalization of Gershgorin’s Theorem (Theorem 7.21) for mdas follows. Theorem 15.12. Let T ∈ Mm,n be an mda and let ri be the sum of the absolute values of off-diagonal entries of T . The eigenvalues of T are contained in the union of n disks having the diagonal components of T as their centers and the numbers ri as their radii. Proof.
Suppose that (λ, x) is an eigenpair of T and that xi is |xi | = max{|xj | | 1 j n}.
We have n
= λxm−1 i
aii2 ···im xi2 · · · xim ,
i2 ,...,im =1
which implies n
= (λ − ai···i )xm−1 i
aii2 ···im xi2 · · · xim .
i2 , . . . , im = 1 δi1 ···im = 0 This implies |λ − ai···i |
n
i2 , . . . , im = 1 δi1 ···im = 0
|aii2 ···im
|xi2 | |xi | ··· m |xi | |xi |
Linear Algebra Tools for Data Mining (Second Edition)
944
n
|aii2 ···im |,
i2 , . . . , im = 1 δi1 ···im = 0
which concludes the argument.
The notion of submatrix introduced in Definition 3.5 can be extended to mdas. Definition 15.18. A submda (or a subtensor) of an mda T ∈ RI1 ×···×IN is an mda Tin =a ∈ RI1 ×···×In−1 ×In+1 ×···×IN obtained by fixing the nth index to a. Example 15.36. Let T be an mda with the format 3 × 4 × 2 having the following frontal slices: ⎛ ⎞ ⎛ ⎞ 1 4 7 10 13 16 19 22 ⎜ ⎟ ⎜ ⎟ T1 = ⎝2 5 8 11⎠ and T2 = ⎝14 17 20 23⎠. 3 6 9 12 15 18 21 24 T1 and T2 are the subtensors Ti3 =1 and Ti3 =1 . The subtensors Ti1 =1 , Ti1 =2 , and Ti1 =3 in R4×2 are 1 4 7 10 , Ti1 =1 = 13 16 19 22 2 5 8 11 , Ti1 =2 = 14 17 20 23 3 6 9 12 . Ti1 =3 = 15 18 21 24 We reformulate the Singular Value Decomposition Theorem for matrices (Theorem 9.1) in preparation for a similar result for mdas. Namely, for a matrix A ∈ CI1 ×I2 , there is a decomposition
A = U (1) SV (2) = S ×1 U (1) ×2 V (2) = S ×1 U (1) ×2 U (2) , (1)
(1)
(1)
such that U (1) = (u1 u2 · · · uI1 is a unitary (I1 × I1 )-matrix,
(2)
(2)
(2)
V (2) = U (2) = (u1 u2 · · · uI2 is a unitary (I2 × I2 )-matrix, S is an (I1 × I2 )-matrix such that S = diag(σ1 , σ2 , . . . , σmin{I1 ,I2} ), and σ1 σ2 · · · σmin{I1 ,I2 } 0.
Multidimensional Array and Tensors
945
The generalization of this result to mdas that follows is known as the Higher Order Singular Value Decomposition (HOSVD). The presentation follows [100, 102]. Theorem 15.13 (HOSVD theorem). Every mda T ∈ RI1 ×···×N can be written as T = S ×1 U (1) ×2 · · · ×N U (N ) , such that (n)
(n)
(n)
(i) U (n) = (u1 u2 · · · uIn ) is an orthogonal (In × In )-matrix; (ii) S ∈ RI1 ×···×IN is an mda such that (a) any two distinct mdas Sin =a and Sin =b with a = b are orthogonal (the general orthogonality property), and (b)
Sin =1 Sin =2 · · · Sin =In
for all possible values of n, α, β, with α = β. (n)
The ith column ui of the matrix U (n) is the ith n-singular vector. (n) The numbers Sin =i denoted by σi are the n-mode singular values. Proof.
Consider the mdas T, S ∈ RI1 ×···×IN such that
S = T ×1 U (1) ×2 · · · ×N U (N ) , where U (1) , . . . , U (N ) are orthogonal matrices. In matrix form, this equality becomes T(n) = U (n) S(n) U (n+1) ⊗ · · · U (N ) ⊗ U (1) ⊗ · · · ⊗ U (n−1) .
If U (n) is obtained from the SVD of T(n) as T(n) = U (n) Σ(n) V (n) , (n)
(n)
(n)
where V (n) is orthogonal and Σ(n) = diag(σ1 , σ2 , . . . , σIn ), where (n)
σ1
(n)
σ2
(n)
· · · σIn 0, we denote by rn the highest index
(n)
for which σrn > 0. Taking into account that the matrix U (n+1) ⊗ · · · U (N ) ⊗ U (1) ⊗ · · · ⊗ U (n−1) is orthogonal (by Supplement 117 of Chapter 6), it follows that S(n) = Σ(n) V (n) U (n+1) ⊗ U (n+2) · · · ⊗ U (N ) ⊗ U (1) ⊗U (2) · · · ⊗ U (n−1) .
Linear Algebra Tools for Data Mining (Second Edition)
946
This implies for arbitrary orthogonal matrices U (1) , . . . , U (n−1) , U (n+1) , . . . , U (N ) that if α = β, then (Sin =α , Sin =β ) = 0 and (n)
Sin =1 = σ1
(n)
Sin =2 = σ2
(n)
· · · Sin =In = σIn 0,
and, if rn < In , (n)
(n)
Sin =rn +1 = σrn +1 = · · · = Sin =In = σIn = 0. The matrices U (1) , U (n−1) , U (n+1) , U (N ) can be constructed in the same manner as U (n) such that S and these matrices satisfy the conditions of the theorem. On the other hand, all matrices U (1) , . . . , U (N ) and S that satisfy the theorem can be found from the singular value decompositions of A(n) . Note that the matrix S that occurs in the SVD is replaced by the core mda S and the diagonal character of the matrix S in the SVD Theorem is replaced by the orthogonality of the mdas of the form Sin =a . The role played by the singular values in the SVD theorem is played by the norms of the mdas Sin =k . Ample details concerning the spectral theory of mdas can be found in the monograph [138]. 15.12
Decomposition of Tensors
The notion of singular value defined for matrices was extended to tensors in [101]. This is the basis of the CANDECOMP/PARAFAC technique (abbreviated as CP), which decomposes a tensor as a sum of rank-1 tensors, a concept developed in the psychometric [25, 46] and chemometric publications [89]. A construction introduced by Kruskal [97] starts with the matrices A ∈ RI×R , B ∈ RJ×R , and C ∈ RK×R and defines the mda T as Tijk =
R
air bjr ckr .
r=1
The mda T is denoted by A, B, C.
Multidimensional Array and Tensors
947
A more general decomposition applicable to mdas of order N allows an mda T ∈ RI1 ×···×In to be expressed as Ti1 i2 ···ir =
R
λr ai1 r ai2 r · · · ain r ,
r=1
where λ ∈ RR , Aip ∈ RIp ×R for 1 p r. In this case, T is denoted by λ; A, B, C. Another modality of saving space when storing an mda is the Tucker operator (see [92] and [164]). A three-way mda T ∈ RI×J×K is stored as the product of a core mda G of size R × S × Z with corresponding matrices A ∈ RI×R , B ∈ RJ×S , and C ∈ RK×Z such that Tijk =
R S Z
Grst air bjs cks .
r=1 s=1 t=1
If I, J, K are much larger than R, S, Z, then forming T explicitly requires more memory than it is required to store its components, namely RSZ + IR + JS + KZ. Definition 15.19. Let T ∈ RJ1 ×···×JN , N = {1, . . . , N }, and let N matrices A(n) ∈ RIn ×Jn for n ∈ N. The Tucker operator applied to T produces the mda T ; A(1) , . . . , A(N ) defined by T ; A(1) , . . . , A(N ) = T ×1 A(1) ×2 · · · ×n A(N ) ∈ RI1 ×···×IN . T is referred to as the core mda. Theorem 15.14. Let T ∈ RJ1 ×···×JN and let N = {1, . . . , N }. If A(n) ∈ RIn ×Jn and B (n) ∈ RKn ×In for 1 n N , then T ; A(1) , . . . , A(N ) ; B (1) , . . . , B (N ) = T ; B (1) A(1) , . . . , B (N ) A(N ) . Proof.
By the definition of Tucker operator, we have T ; A(1) , . . . , A(N ) ; B (1) , . . . , B (N ) = (T ×1 A(1) ×2 · · · ×N A(N ) ); B (1) , . . . , B (N ) = (T ×1 A(1) ×2 · · · ×N A(N ) ) ×1 B (1) · · · ×N B (N ) .
Linear Algebra Tools for Data Mining (Second Edition)
948
Note that (T ×1 A(1) · · · ×N A(N ) )i1 ···iN =
j1 ···jn
(1)
(N )
Tj1 ···jN Ai1 j1 · · · AiN jN .
Furthermore, ((T ×1 A(1) · · · ×n A(N ) ) ×1 B (1) · · · ×N B (N ) )k1 ...kN (1) (N ) (1) (N ) = Tj1 ···jN Ai1 j1 · · · AiN jN Bk1 i1 · · · BkN iN i1 ···iN j1 ···jN
=
i1 ···iN j1 ···jN
=
j1 ···jn
(1)
(1)
(N )
(N )
Tj1 ···jN Bk1 i1 Ai1 j1 · · · BkN iN AiN jN
Tj1 ···jN
i1
(1)
(1)
Bk1 i1 Ai1 j1
⎛ ⎞ (N ) (N ) BkN iN AiN jN ⎠ ···⎝ iN
= (T ; B (1) A(1) , . . . , B (N ) A(N ) )k1 ...kN , which concludes the argument.
The next statement allows us to express the Tucker operator in terms of matricized mdas. Theorem 15.15. Let T ∈ RJ1 ×···×JN be an mda and let N = {1, · · · , N }. If A(n) ∈ RIn ×Jn for n ∈ N, and R = {r1 , . . . , rL }, C = {c1 , . . . , cM } form a partition of N, then S = T ; A(1) , . . . , A(N ) if and only if S(R×C):JN = (A(rL ) ⊗ · · · ⊗ A(r1 ) )T(R×C:IN ) (A(cM ) ⊗ · · · ⊗ A(c1 ) ) . Here IN = {I1 , . . . , IN } and JN = {J1 , . . . , JN }. Proof.
This is a direct consequence of Theorem 15.5.
The calculation of the norm of a large mda can be reduced to the calculation of the norm of a smaller mda using the thin QR factorization of rectangular matrices discussed in Theorem 6.65. Theorem 15.16. Let T ∈ RJ1 ×J2 ×···×JN be an mda having the set of modes N = {1, . . . , N }, and let A(n) be N rectangular matrices
Multidimensional Array and Tensors
949
having the QR decompositions A(n) = Q(n) R(n) for n ∈ N, where the columns of Q(n) constitute an orthonormal basis for range(A), and R(n) is an upper triangular invertible matrix such that its diagonal elements are real non-negative numbers for n ∈ N. We have T ; A(1) , . . . , A(N ) = T ; R(1) , . . . , R(N ) . Proof. By the definition of QR decomposition of matrices and Theorem 15.14, we have T ; A(1) , . . . , A(N ) = T ; Q(1) R(1) , . . . , Q(N ) R(N ) = T ; R(1) , . . . , R(N ) ; Q(1) , . . . , Q(n) . Supplement 15.13 further implies T ; A(1) , . . . , A(N ) = T ; R(1) , . . . , R(N ) .
Definition 15.20. The Tucker decomposition of an mda T ∈ RI1 ×I2 ×···×IN is given by T = G; A(1) , A(2) , . . . , A(N ) , where A(n) ∈ RIn ×Jn and G ∈ RJ1 ×I2 ×···×JN . The Tucker decomposition allows the generation of an approximation G of T when the ranks of G(n) are smaller than the corresponding ranks of the matrices T(n) . In general, the Tucker decomposition is not unique (see Supplement 15.13).
15.13
Approximation of mdas
The next theorem appears in [92, 102]. Theorem 15.17. Let T ∈ RI2 ×I2 ×···×IN and let A(n) ∈ RIn ×Jn be matrices with orthonormal sets of columns for 1 n N . The optimal mda G ∈ RJ1 ×J2 ×···×JN that minimizes T − G; A(1) , . . . , A(N ) for fixed matrices A(1) , . . . , A(N ) is given by
G = T ; A(1) , . . . , A(n) .
Linear Algebra Tools for Data Mining (Second Edition)
950
matrices A(i) for this problem maximize Furthermore, the optimal T ; A(1) , . . . , A(N ) . Choosing R = N in Theorem 15.15 allows us to write T − G; A(1) , . . . , A(N ) = T − G ×1 A(1) ×2 · · · ×n A(N ) = vec(T ) − (A(N ) ⊗ . . . ⊗ A(1) )vec(G). The minimization of the norm vec(T ) − (A(N ) ⊗ . . . ⊗ A(1) )vec(G) can be treated as a least square problem and the solution is † vec(G) = A(N ) ⊗ · · · ⊗ A(1) vec(T ). Proof.
The solution can be written as vec(G) = (A(N ) ⊗ · · · ⊗ A(1) )† vec(T ) = (A(N )† ⊗ · · · ⊗ A(1)† )vec(T ) (by Supplement 27 of Chapter 9)
= (A(N ) ⊗ · · · ⊗ A(1) )vec(T ) (because the matrices A(i) have orthonormal sets of columns). Next, we have T − G; A(1) , . . . , A(N ) = T 2 − 2(T, G; A(1) , . . . , A(N ) ) + G; A(1) , . . . , A(n) 2 . We have (T, G; A(1) , . . . , A(N ) )
= (T ; A(1) , . . . , A(N ) , G) (by Theorem 15.7) = G 2 .
Multidimensional Array and Tensors
951
By Supplement 17, we have (1) (N ) G; A , . . . , A = G , which implies 2 T − G; A(1) , . . . , A(N ) = T 2 − G 2 . Thus, minimizing T − G; A(1) , . . . , A(N ) amounts to maximizing
G . Next we discuss approximations of mdas by mdas of rank 1. In other words, given an mda T ∈ RI1 ×···×IN , we seek the determination of λ ∈ R and of unit-vectors u(1) , . . . , u(N ) that define a rank-1 mda S defined as (1)
(N )
Si1 ...iN = λui1 · · · ◦ uiN that minimizes the least-square function f (S) = T − S 2 .
(15.14)
This is a constrained optimization problem solved in [101] by the method of Lagrange multipliers. Consider the objective function f˜ defined as (n) (1) (N ) 2 (n) 2 Ti1 ···iN − λui1 · · · uiN + λ (uin ) − 1 . n
i1 ···iN
The equality λ
∂ f˜ (n) ∂uin
(n)
+λ2 uin
= 0 implies (1)
i1 ···in−1 in+1 ···iN
in
(n−1) (n+1) (N )
(n)
Ti1 ···iN ui1 · · · uin−1 uin+1 uiN = λ(n) uin
i1 ···in−1 in+1 ···iN
(1) 2
ui1
(15.15)
2 (n−1) 2 (n+1) 2 · · · uin−1 · · · uN . uin+1 iN
Linear Algebra Tools for Data Mining (Second Edition)
952
The equality
∂ f˜ ∂λ(n)
= 0 implies (n) 2 = 1. uin
(15.16)
in
Finally,
∂ f˜ ∂λ
= 0 yields
i1 ···iN
(1)
(N )
Ti1 ···iN ui1 · · · uiN = λ
i1 ···iN
(1) 2
ui1
(N ) 2 · · · uiN .
The previous equalities imply (1) (N ) Ti1 ···iN ui1 · · · uiN = λ,
(15.17)
i1 ···iN
and λ
i1 ···in−1 in+1 ···iN
(1)
(n−1) (n+1)
(N )
(n)
Ti1 ···iN ui1 · · · uin−1 uin+1 · · · UiN = (λ2 + λ(n) )uin .
(15.18) Taking into account Equality (15.16), we have (1) (n−1) (n+1) (N ) (N ) Ti1 ···iN ui1 · · · uin−1 uin+1 · · · uiN = λuiN . (15.19) i1 ···in−1 in+1 ···iN
These necessary conditions correspond to
T ×1 u(1) · · ·×n−1 u(n−1) ×n+1 u(n+1) · · ·×N u(N ) = λu(N ) , (15.20)
T ×1 u(1) ×2 · · · ×N u(N ) = λ,
(15.21)
u(n) = 1
(15.22)
and for 1 n N . Minimizing the function f defined by Equality (15.14) is equivalent to maximizing 2 (1) (2) (N ) (1) (2) (N ) ×2 u ×3 · · · ×N u g(u , u , . . . , u ) = T ×1 u , over u(1) , u(2) , . . . , u(N ) . Moreover, if λ is chosen as in Equality (15.21), then the functions f and g satisfy the equality f (S) =
T 2 − g(S).
Multidimensional Array and Tensors
953
Indeed, note that f (S) = T − S 2 = T 2 − 2(T, S) + S 2 . The definition of λ and Equality (15.17) imply (1) (N ) Ti1 ···iN λui1 · · · uiN = λ2 . (T, S) = i1 ···iN
Since u(1) = u(2) = · · · = u(N ) = 1, we also have S 2 = λ2 . This implies f (S) = T 2 − g(S). An alternating least squares algorithm for computing the best rank-1 approximation of an mda is presented in [101]. Exercises and Supplements (I )
(1) Let e(inn) be a vector in
RI n
(I )
for 1 n N . Prove that (I )
e(i11) ⊗ · · · ⊗ e(iNN ) = eIi , where I = I1 · . . . · IN , and i = (i1 − 1)
N p=2
Ip + (i2 − 1)
N
Ip + · · · + iN .
p=3
(2) Let T ∈ RJ1 ×J2 ×···×JN be an mda. Prove that if A ∈ RIm ×Jm and B ∈ RIn ×Jn and m = n, then (T ×m A) ×n B = (T ×n B) ×m A. The common value of these arrays is denoted by T ×m A ×n B. Solution: Without loss of generality, assume that m < n. The definition of the mda-matrix mode product allows us to write (T ×m A)j1 ,...,jm−1 ,im ,jm+1 ,...,jN =
Im
Tj1 j2 ···jN Aim jm .
im =1
Therefore, ((T ×m A) ×n B)j1 ,...,jm−1 ,im ,jm+1 ,...,jn−1 ,in ,jn+1 ,...,jN = Iinn=1 (T ×m A)j1 ,...,jm−1 ,im ,jm+1 ,...,jN Bin jn m = Iinn=1 Iim =1 Tj1 j2 ···jn ···jm ···jN Aim jm Bin jn . The same expression is obtained by computing ((T ×n B) ×m A)j1 ···jm−1 im jm+1 ···jn−1 in jn+1 ···jN .
Linear Algebra Tools for Data Mining (Second Edition)
954
(3) Recall that the contracted n-mode product ×n drops the nth singleton dimension. Prove that the contracted n-mode product is not commutative, that is, in general (T ×n x)×n y = (T ×n y)×n x. (4) Prove that if T is an mda having the set of modes M, then vec(T ) = T(M×∅:M) . (5) Let T, S be two mdas of rank 1, where T = a(1) ⊗a(2) ⊗· · ·⊗a(n) and S = b(1) ⊗ b(2) ⊗ · · · ⊗ b(n) . Prove that (T, S) =
n (a(i) ⊗ b(i) ). i=1
(6) Let T ∈ RI1 ×I2 ×···×IN be an mda, and let A ∈ RJn ×In and B ∈ RJm ×Im be two matrices, where m = m. Prove that (T ×n A) ×m B = (T ×m B) ×n A. (7) Let A ∈ Rm×n and B ∈ Rp×q be two matrices, and let x ∈ Rqn be a vector. Compute the vector y = (A ⊗ B)x without computing A ⊗ B. This is important if the size of the Kronecker product is very large. Solution: The following MATLAB program is a solution: [m,n] = size(A); [p,q] = size(B); X = reshape(x,q,n); Y = B*X*A’; y = reshape(Y,m*p,1);
(8) Let A(1) , . . . , A(N ) be N matrices, where A(n) ∈ RIn ×Jn for n ∈ {1, . . . , N }. Prove that X = Y ; A(1) , . . . , A(N ) if and only if X(n) = A(n) Y(n) (A(N ) ⊗ · · · ⊗ A(n+1) ⊗ A(n−1) ⊗ · · · ⊗ A(1) ) . (9) Let T be an mda in Tm,n . Prove that if (λ, x) is an eigenpair of T , then (aλ + b, x) is an eigenpair of aT + bIm,n . (10) Let X = [A, B, C] be a representation of a three-way array X, and let M, N be two sets of matrices such that M ⊆ N.
Multidimensional Array and Tensors
955
Prove that the rank R of X satisfies the following inequality: R min rank(N X) + max (number of zero columns in M A). N ∈N
M ∈M
Solution: For any M ∈ M, we have minN ∈N rank(N X) rank(M X) = rank([M A, B, C]) R − (number of zero columns in M A). If we take the minimum over all M ∈ M, we obtain the statement. (11) Let X be a three-way array, that is 1-non-degenerate, and let N = {u | u = 0}. Prove that rank(X) min rank(uX) + dim1 (X) − 1. u∈N
Solution: Let [A, B, C] be a representation of X and let M = {u | uA = 0}. Since X is 1-non-degenerate, we have N = {u | uX = 0} ⊆ M ⊆ N, hence, N = M. Then max(number of zero columns in uA = max(number of zero columns in A which are orthogonal to u) rank(A) − 1 (because we can pick some rank(A) independent columns of A and select u orthogonal to rank(A) − 1 of them) dim1 (X) − 1. Now the statement follows from the previous supplement. (12) Let X be a 1-non-degenerate array and let N consist of all M ×1matrices with full row rank. Prove that rank(X) min rank(AX) + dim1 (X) − M. A∈N
956
Linear Algebra Tools for Data Mining (Second Edition)
(13) Prove that the following multilinear system: x0 y0 + x1 y1 = x0 y1 − x1 y0 = 0, has nontrivial solutions in C. (14) Prove that the following multilinear system: x0 y0 = x0 y1 = x1 y0 = x1 y1 = 0, has only trivial solutions. (15) Let T ∈ Mm,n be a hypercubic mda and let a, b ∈ R. If (λ, x) is an eigenpair of T , prove that (aλ + b, x) is an eigenpair of aT + bIm,n . (16) Consider the odd-order symmetric mda T ∈ Tm,n defined in [27] or [138] as follows: √ T111 = 10 T112 = − 3 √ √ T121 = − 3 T122 = 3 √ √ T211 = − 3 T212 = 3 √ T221 = 3 T222 = 4. Prove the following: (a) if u, v ∈ R>0 , then 2u3 + v 3 3u2 v; (b) T x3 0 for x ∈ R20 ; (c) there is no H-eigenvalue for T . Solution: The inequality is equivalent to 2z 3 + 1 3z 2 for z > 0. The function φ(z) = 2z 3 − 3x2 + 1 has a minimum for z = 1, hence, 2z 3 − 3z 2 + 1 0 for z 0. Replacing z by uv gives the desired inequality. Therefore, 1 1 1 1 2 1 10x31 + 4x32 = (15 3 x1 )3 + (12 3 x2 )3 (15 3 x1 )2 (12 3 x2 ) 3 3 √ 2 3 3x1 x2 ,
which shows that T x3 0 for x ∈ R20 .
Multidimensional Array and Tensors
957
Suppose that (λ, x) ∈ R × R2=0 is an eigenpair, that is, √ √ 10x21 − 2 3x1 x2 + 3x22 = λx21 , √ √ − 3x21 + 2 3x1 x2 + 4x22 = λx22 . If x2 = 0, then x1 = 0, which is impossible. If x2 = 0, let z = The previous system becomes √ √ (10 − λ)z 2 − 2 3z + 3 = 0, √ √ − 3z 2 + 2 3z + 4 − λ = 0.
x1 x2 .
The non-negativity of the discriminants of these trinomials implies 12 − 4 3(10 − λ) 0, √ 12 + 4 3(4 − λ) 0. √ √ These inequalities lead to 4 + 3 λ 10 − 3, which is contradictory. (17) Let T ∈ RI1 ×···×In−1 ×J×In+1×IN and let A ∈ RJ×In . If A has orthonormal columns, prove that T = T ×n A . Solution: By Equality (15.10), we have
T ×n A 2 = (T ×n A, T ×n A) = ((T ×n A) × A , T ) (by Exercise 20) = (T ×n (A A)), X) = (T, T ) (by hypothesis) = T 2 , because A A is the unit matrix due to the orthonormality of its columns. (18) Let N = {1, . . . , N } and let A(n) ∈ RIn ×Jn for n ∈ N be N matrices having orthonormal sets of columns. Prove that T = S; A(1) , . . . ; A(N ) implies S = T ; A(1) , . . . ; A(N ) . Hint: Apply Theorem 15.14.
Linear Algebra Tools for Data Mining (Second Edition)
958
(19) Let T = G; A(1) , A(2) , . . . , A(N ) and let B ∈ RJ1 ×J1 be an orthogonal matrix. Prove that T = G ×1 B; A(1) B, A(2) , . . . , A(N ) . Solution: The hypothesis amounts to T = G ×1 A(1) ×2 · · · ×n A(N ) . Therefore, T = G ×1 A(1) BB ×2 · · · ×n A(N ) = G ×1 B; A(1) B ×2 · · · ×n A(N ) , so T = G ×1 B; A(1) B, A(2) , . . . , A(N ) . (20) Let T ∈ RI1 ×···×In−1 ×J×In+1 ×···×IN , S ∈ RI1 ×···×In−1 ×K×In+1×···×IN be two mdas, and let A ∈ RJ×K . Prove that (T, S ×n A) = (T ×n A , S). Solution: Let Si1 ···in−1 kin+1 ···iN be the components of S. By the definition of S ×n A, we have (S ×n A)i1 ···in−1 jin+1 ...iN =
|K|
Si1 ···in−1 kin+1 ···iN Ajk
k=1
and (T, S ×n A) =
Ti1 ···in−1 jin+1 ···iN (S × A)i1 ···in−1 jin+1 ···iN
i1 ,...,in−1 ,j,in+1 ,...,iN
=
i1 ,...,in−1 ,j,in+1 ,...,iN
=
Ti1 ···in−1 jin+1 ···iN
|K|
Si1 ···in−1 kin+1 ···iN Ajk
k=1
Ti1 ···in−1 jin+1 ···iN (A)kj Si1 ···in−1 kin+1 ···iN
i1 ,...,in−1 ,k,in+1 ,...,iN
= (T ×n A , S). The Kruskal operator is a special case of the Tucker operator. For N = {1, . . . , N } and A(n) ∈ RIn ×R for n ∈ N, the value of the Kruskal operator applied to matrices A(1) , . . . , A(N ) is defined as
Multidimensional Array and Tensors
959
A(1) , . . . , A(N ) . Note that the matrices A(1) , . . . , A(N ) must have the same number of columns. If I is the identity mda (whose diagonal components equal 1, and is 0 elsewhere), the Kruskal operator can be written as I; A(1) , . . . , A(N ) . If S = A(1) , . . . , A(N ) , then Si1 ,i2 ,...,iN =
R j=1
(1) (2)
(N )
ai1 j ai2 j · · · aiN j .
The Kruskal operator is used to define the PARAFAC decomposition of an mda T as T = A(1) , A(2) , . . . , A(n) . The interested reader should consult [47, 92]. (21) Let N = {1, . . . , N } and A(n) ∈ B (n) ∈ RKn ×In , prove that
RIn ×R .
If A(n) ∈
RIn ×R
and
A(1) , . . . , A(N ) ; B (1) , . . . , B (N ) = B (1) A(1) , . . . , B (N ) A(N ) . (22) Let N = {1, . . . , N } and A(n) ∈ RIn ×R for n ∈ N. Prove that 2 (1) A , . . . , A(N ) =
R R
((A(1) A(1) ) (A(2) A(2) ) · · · (A(N ) A(N ) ))jk .
j=1 k=1
Recall that “ ” is the notation for the Hadamard product introduced in Definition 3.45. Solution: We have 2 (1) A , . . . , A(N ) = (A(1) , . . . , A(N ) , A(1) , . . . , A(N ) ) ⎛ ⎞ R R (1) (2) (N ) (1) (2) (N ) ⎝ ai1 j ai2 j · · · aiN j , ai1 k ai2 k · · · aiN k ⎠ = i1 ···iN
=
j=1
R R j=1 k=1 i1 ···iN
k=1
(1) (2)
(N ) (1) (2)
(N )
ai1 j ai2 j · · · aiN j ai1 k ai2 k · · · aiN k
960
Linear Algebra Tools for Data Mining (Second Edition)
=
R R j=1 k=1 i1
=
R R
(1) (1)
ai 1 j ai 1 k
i2
(2) (2)
ai 2 j ai 2 k · · ·
iN
(1)
(1)
aiN j aiN k
(A(1) A(1) )jk · · · (A(N ) A(N ) )jk .
j=1 k=1
A fundamental result of complexity theory [62] is that the satisfiability problem stated in what follows is NP-complete. 3-SATISFIABILITY or 3-SAT: Input: a collection of m clauses C = {C1 , . . . , Cm } on a finite set of n variables U such that each clause contains 3 literals. Problem: is there a truth assignment for U that satisfies all clauses in C? The NP-completeness of 3-SAT was used in [74] to establish the NP-hardness of determining the rank of a tensor over the field of rational numbers. This new problem can be formulated as follows: TENSOR RANK: Input: a three-dimensional tensor Tijk Problem: given numbers Tijk where 1 i n1 , 1 j () n2 , 1 k n3 , and r ∈ N are there ve , 1 rr vectors r, 1 e 3 such that Tijk = =1 v1 (i)v2 (j)v3 (k) for all i, j, k? (23) Give an encoding of an arbitrary 3-SAT Boolean formula in n variables and m clauses as an mda T ∈ Q(n+2m+2)×3n×(3n+m) with the property that the 3-SAT formula is satisfiable if and only if rank(T ) 4n + 2m. Hint: See [74].
Multidimensional Array and Tensors
961
Bibliographical Comments The main sources for this chapter are [4, 5, 91–93, 95]. For tensor ranks in various fields, the reader should consult [58, 115]. A useful survey of tensor decomposition is in [139]. The works of L. Qi and his collaborators [80, 136–138] is a main reference for the spectral theory of tensors. Supplement 10 is a result of Kruskal [97]. The treatment of hyperdeterminants follows [123].
This page intentionally left blank
Bibliography [1] Abdi, H. (2003). Partial least squares regression (PLS-regression), in Encyclopedia for Research Methods for the Social Sciences (Sage, Thousand Oaks, CA), pp. 792–795. [2] Akivis, M. A. and Goldberg, V. V. (1977). An Introduction to Linear Algebra and Tensors (Dover Publications, New York). [3] Artin, M. (1991). Algebra, 1st edn. (Prentice-Hall, Englewood Cliffs, NJ), a second edition was published in 2010. [4] Bader, B. W. and Kolda, T. G. (2006). Algorithm 862: Matlab tensor classes for fast algorithm prototyping, ACM Transactions on Mathematical Software 32, 4, pp. 635–653. [5] Bader, B. W. and Kolda, T. G. (2008). Efficient Matlab computations with sparse and factored tensors, SIAM Journal on Scientific Computing 30, 1, pp. 205–231. [6] Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval (Addisson-Wesley, Harlow, UK). [7] Bartle, R. G. and Sherbery, D. R. (1999). Introduction to Real Analysis, 3rd edn. (Wiley, New York). [8] Bauer, F. L. and Fike, C. T. (1960). Norms and exclusion theorems, Numerische Mathematik 2, pp. 137–141. [9] Belkin, M. and Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering, in Advances in Neural Information Processing Systems 14 (MIT Press), pp. 585–591. [10] Ben-Israel, A. (1992). A volume associated with m × n matrices, Linear Algebra and Its Applications 167, pp. 87–111. [11] Ben-Israel, A. and Greville, T. N. E. (1974). Generalized Inverses — Theory and Applications (Wiley-Interscience, New York). [12] Berge, J. M. F. T. (1983). A generalization of Kristof’s theorem on the trace of certain matrix products, Psychometrika 48, pp. 519–523.
963
964
Linear Algebra Tools for Data Mining (Second Edition)
[13] Berge, J. M. F. T. (1991). Kruskal’s polynomial for 2× 2× 2 arrays and a generalization to 2× n× n arrays, Psychometrika 56, 4, pp. 631–636. [14] Berge, J. M. F. T. (1993). Least Squares Optimization in Multivariate Analysis (DSWO Press, Leiden, The Netherlands). [15] Berge, T. (1977). Orthogonal Procrustes rotation for two or more matrices, Psychometrika 42, pp. 267–276. [16] Berkhin, P. and Becher, J. (2002). Learning simple relations: Theory and applications, in Proceedings of the Second SIAM International Conference on Data Mining (Arlington, VA). [17] Berry, M. W., Browne, M., Langville, A. N., Pauca, V. P. and Plemmons, R. J. (2007). Algorithms and applications for approximate nonnegative matrix factorization, Computational Statistics & Data Analysis 52, 1, pp. 155–173. [18] Bhatia, R. (1997). Matrix Analysis (Springer, New York). [19] Billsus, D. and Pazzani, M. (1998). Learning collaborative information filters, in J. W. Shavlik (ed.), Proceedings of the 15th International Conference on Machine Learning, pp. 46–54. [20] Birkhoff, G. (1973). Lattice Theory, 3rd edn. (American Mathematical Society, Providence, RI). [21] Bj¨ork, A. (1996). Numerical Methods for Least Squares Problems (SIAM, Philadelphia, PA). [22] Brand, M. (2003). Fast online SVD revisions for lightweight recommender systems, in D. Barbar´a and C. Kamath (eds.), Proceedings of the Third SIAM International Conference on Data Mining (SIAM, Philadelphia, PA), pp. 37–46. [23] Brokett, R. W. and Dobkin, D. (1978). On the optimal evaluation of a set of bilinear forms, Linear Algebra and Its Applications 19, pp. 207–235. [24] Burdick, D. S. (1995). An introduction to tensor product with applications to multiway data analysis, Chemometrics and Intelligent Laboratory Systems 28, pp. 229–237. [25] Carroll, J. D. and Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an n-way generalization of Eckart-Young decomposition, Psychometrika 35, 3, pp. 283–319. [26] Catral, M., Han, L., Neumann, M. and Plemmons, R. (2004). On reduced rank nonnegative (non-negative) matrix factorizations for symmetric matrices, Linear Algebra and Applications 393, pp. 107–127. [27] Chen, H., Chen, Y., Li, G. and Qi, L. (2015). Finding the maximum eigenvalue of a class of tensors with applications in copositivity test and hypergraphs, arXiv preprint arXiv:1511.02328.
Bibliography
965
[28] City of Boston (2008). Public Health Commission Research Office: The health of Boston 2008, Tech. rep., Boston, MA. [29] Cline, R. E. and Funderlink, R. E. (1979). The rank of a difference of matrices and associated generalized inverses, Linear Algebra and Its Applications 24, pp. 185–215. [30] Comon, P., Golub, G., Lim, L.-H. and Mourrain, B. (2008). Symmetric tensors and symmetric tensor rank, SIAM Journal on Matrix Analysis and Applications 30, 3, pp. 1254–1279. [31] Cox, D. A., Little, J. and O’Shea, D. (2005). Using Algebraic Geometry, 2nd edn. (Springer, New York). [32] Cox, T. F. and Cox, M. (2001). Multidimensional Scaling, 2nd edn. (Chapman & Hall, Boca Raton). [33] Davis, C. and Kahan, W. M. (1969). Some new bounds on perturbation of subspaces, Bulletin of the American Mathematical Society 75, pp. 863–868. [34] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis, Journal of the American Society for Information Science 41, pp. 391–407. [35] Ding, C. and He, X. (2004). K-means clustering via principal component analysis, in Proceedings of the 21st International Conference on Machine Learning (Banff, Canada), pp. 225–232. [36] Ding, C., He, X. and Simon, H. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering, in Proceedings of the SIAM International Conference on Data Mining (Newport Beach, CA), pp. 606–610. [37] Dodson, C. T. J. and Poston, T. (1997). Tensor Geometry, 2nd edn. (Springer, Berlin). [38] Donoho, D. and Grimes, G. (2003). Hessian eigenmaps: New locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Sciences of the United States of America 5, pp. 5591–5596. [39] Drineas, P., Frieze, A., Kannan, R., Vampala, S. and Vinay, V. (2004). Clustering large graphs via the singular value decomposition, Machine Learning 56, pp. 9–33. [40] Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. and Harshman, R. (1988). Using latent semantic analysis to improve access to textual information, in CHI ’88: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (ACM, New York, NY), ISBN 0-201-14237-6, pp. 281–285. [41] Eckhart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank, Psychometrica 1, pp. 211–218.
966
Linear Algebra Tools for Data Mining (Second Edition)
[42] Egerv´ary, E. (1960). On rank-diminishing operators and their application to the solution of linear systems, Zeitschrift f¨ ur Angewandte Mathematik und Physik 11, pp. 376–386. [43] Elad, M. (2010). Sparse and Redundant Representations (Springer, New York). [44] Eld´en, L. (2007). Matrix Methods in Data Mining and Pattern Recognition (SIAM, Philadelphia, PA). [45] Elkan, C. (2003). Using the triangle inequality to accelerate k-means, in Proceedings of the 20th International Conference on Machine Learning, pp. 147–153. [46] A. H. et al. (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. [47] Faber, N. M., Bro, R. and Hopke, P. K. (2003). Recent developments in candecomp/parafac algorithms: A critical review, Chemometrics and Intelligent Laboratory Systems 65, pp. 119–137. [48] Fan, K. (1949). On a theorem of Weil concerning eigenvalues of linear transformations -I, Proceedings of the National Academy of Sciences of the United States of America 35, pp. 652–655. [49] Fan, K. (1951). Maximum properties and inequalities for the eigenvalues of completely continuous operators, Proceedings of the National Academy of Sciences of the United States of America 37, pp. 760–766. [50] Favier, G. (2019). From Algebraic Structure to Tensors, Vol. 1 (Wiley, Hoboken, NJ). [51] Favier, G. (2021). Matrix and Tensor Decompositions in Signal Processing, Vol. 2 (Wiley, Hoboken, NJ). [52] Favier, G. and de Almeida, A. (2014). Overview of constrained parafac models, EURASIP Journal on Advances in Signal Processing 2014, 1, pp. 1–25. [53] Fejer, P. A. and Simovici, D. A. (1991). Mathematical Foundations of Computer Science, Vol. 1 (Springer Verlag, New York). [54] Fletcher, R. and Sorensen, D. (1983). An algorithmic derivation of the Jordan canonical form, American Mathematical Monthly 90, pp. 12–16. [55] Food and Organization, A. (2009). FAO Statistical Yearbook (Statistics Division FAO, Rome, Italy). [56] Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretabiity of classifications, Biometrics 21, p. 768. [57] Francis, J. G. F. (1961). The QR transformation — A unitary analogue to the LR transformation — Part I, The Computer Journal 4, pp. 265–271.
Bibliography
967
[58] Friedland, S. (2012). On the generic and typical rank of 3-tensors, Linear Algebra and Its Applications 436, pp. 478–497. [59] Frieze, A. and Kannan, R. (1999). Quick approximation to matrices and applications, Combinatorica 19, pp. 175–200. [60] Fuglede, B. (1950). A commutativity theorem for normal operators, Proceedings of the National Academy of Sciences of the United States of America 36, 1, p. 35. [61] Gabriel, K. R. (1971). The biplot graph display of matrices with application to principal component analysis, Biometrika 58, pp. 453–467. [62] Garey, M. R. and Johnson, D. S. (1979). Computers and Intractability — A Guide to the Theory of NP-Completeness (W. H. Freeman, New York). [63] Gelfand, I. M., Kapranov, M. and Zelevinsky, A. (1994). Discriminants, Resultants, and Multidimensional Determinants (Birkh¨auser, Boston, MA). [64] Gilat, A. (2011). MATLAB — An Introduction with Applications, 4 edn. (John Wiley, New York). [65] Golub, G. H. and Loan, C. F. V. (1989). Matrix Computations, 2nd edn. (The Johns Hopkins University Press, Baltimore, MD). [66] Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis, Biometrika 53, pp. 325–338. [67] Gower, J. C. and Hand, D. J. (1995). Biplots (Chapman & Hall, London). [68] Gower, J. C., Lubbe, S. G. and Le Roux, N. (2011). Understanding Biplots (Wiley, New York). [69] Greenacre, M. (2010). Biplots in Practice (Fundacion BBVA, Madrid, Spain). [70] Greub, W. (1981). Linear Algebra, 4th edn. (Springer-Verlag, New York). [71] Greub, W. H. (1978). Multilinear Algebra, 2nd edn. (Springer-Verlag, New York). [72] Hackbusch, W. (2012). Tensor Spaces and Numerical Tensor Calculus, 2nd edn. (Springer, Cham, Switzerland). [73] Hartigan, J. A. and Wong, M. (1979). A k-means clustering algorithm, Applied Statistics, pp. 100–108. [74] H˚ astad, J. (1990). Tensor rank is NP-complete, Journal of Algorithms 4, 11, pp. 644–654. [75] Higham, D. J. and Higham, N. J. (2000). Matlab Guide (SIAM, Philadelphia, PA).
968
Linear Algebra Tools for Data Mining (Second Edition)
[76] Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms, 2nd edn. (SIAM, Philadelphia, PA). [77] Hoffman, A. J. and Wielandt, H. W. (1953). The variation of the spectrum of a normal matrix, Duke Mathematical Journal 20, pp. 37–39. [78] Horn, R. A. and Johnson, C. R. (1996). Matrix Analysis (Cambridge University Press, Cambridge, UK). [79] Horn, R. A. and Johnson, C. R. (2008). Topics in Matrix Analysis (Cambridge University Press, Cambridge, UK). [80] Hu, S., Huang, Z.-H., Ling, C. and Qi, L. (2013). On determinants and eigenvalue theory of tensors, Journal of Symbolic Computation 50, pp. 508–531. [81] Inaba, M., Katoh, N. and Imai, H. (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering, in 10th Computational Geometry (ACM, New York), pp. 331–339. [82] Johnson, R. A. and Wichern, D. W. (2002). Applied Multivaried Statistical Analysis, 5 edn. (Prentice Hall, Upper Saddle River, NJ). [83] Johnson, R. K. (2011). The Elements of MATLAB Style (Cambridge University Press, Cambridge, UK). [84] Jolliffe, I. T. (2002). Principal Component Analysis, 2nd edn. (Springer, New York). [85] Jordan, P. and Neumann, J. V. (1935). On inner products in linear, metric spaces, The Annals of Mathematics 36, pp. 719–723. [86] Juh´ asz, F. (1978). On the spectrum of a random graph, in Colloquia Mathematica Societatis J´ anos Bolyai, Vol. 25 (Szeged), pp. 313–316. [87] Kannan, R. (2010). Spectral methods for matrices and tensors, in Proceedings of the ACM Symposium on the Theory of Computing (STOC) (ACM, New York), pp. 1–12. [88] Kanungo, T., Mount, D. M., Netanyahu, N. S. and Piatko, C. D. (2004). A local search approximation algorithm for k-means clustering, Computational Geometry: Theory and Applications 28, pp. 89–112. [89] Kiers, H. A. (1998). A three-step algorithm for CANDECOMP/PARAFAC analysis of large data sets with multicollinearity, Journal of Chemometrics: A Journal of the Chemometrics Society 12, 3, pp. 155–171. [90] Knyazev, A. V. and Argentati, M. E. (2006). On proximity of Rayleigh quotients for different vectors and Ritz values generated by different trial subspaces, Linear Algebra and Its Applications 415, pp. 82–95.
Bibliography
969
[91] Kolda, T. G. (2001). Orthogonal tensor decompositions, SIAM Journal of Matrix Analysis and Applications 23, pp. 243–255. [92] Kolda, T. G. (2006). Multilinear operators for higher-order decompositions, Tech. Rep. SAND2006-2081, Sandia. [93] Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications, SIAM Review 51, 3, pp. 455–500. [94] Kolda, T. G. and O’Leary, D. P. (1998). A semidiscrete matrix decomposition for latent semantic indexing in information retrieval, ACM Transactions on Information Systems 16, pp. 322–346. [95] Kolda, T. G. and O’Leary, D. P. (1999). Computation and uses of the semidiscrete matrix decomposition, Tech. Rep. ORNL-TM-13766, Oak Ridge National Laboratory, Oak Ridge, TN. [96] Kroonenberg, P. M. (2008). Applied Multiway Data Analysis (Wiley, Hoboken, NJ). [97] Kruskal, J. B. (1977). Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics, Linear Algebra and Its Applications 18, 2, pp. 95–138. [98] Lanczos, C. (1958). Linear systems in self-adjoint form, American Mathematical Monthly 65, pp. 665–679. [99] Langville, A. N. and Stewart, W. J. (2004). The Kronecker product and stochastic automata networks, Journal of Computational and Applied Mathematics 167, 2, pp. 429–447. [100] Lathauwer, L. D. (1997). Signal processing based on multilinear algebra, Ph.D. thesis, Katholieke Universiteit Leuven. [101] Lathauwer, L. D., Moor, B. D. and Vandewalle, J. (2000). A multilinear singular value decomposition, SIAM Journal on Matrix Analysis and Applications 21, 4, pp. 1253–1278. [102] Lathauwer, L. D., Moor, B. D. and Vandewalle, J. (2000). On the best rank-1 and rank-(r1 , r2 , . . . , rn ) approximation of higher-order tensors, SIAM Journal on Matrix Analysis and Applications 21, 4, pp. 1324–1342. [103] Lawson, C. L. and Hanson, R. J. (1995). Solving Least Square Problems (SIAM, Philadelphia, PA). [104] Lax, P. D. (2007). Linear Algebra and Its Applications (WileyInternational, Hoboken, NJ). [105] Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization, Nature 401, pp. 515–521. [106] Lichnerowicz, A. (1956). Alg`ebre et Analyse Lin´eaires (Masson et Co., Paris). [107] Lloyd, S. P. (1982). Least square quantization in PCM, IEEE Transactions on Information Theory 28, pp. 129–137.
970
Linear Algebra Tools for Data Mining (Second Edition)
[108] MacLane, S. and Birkhoff, G. (1993). Algebra, 3rd edn. (Chelsea Publishing Company, New York). [109] Magnus, J. R. and Neudecker, H. (1979). The commutation matrix: Some properties and applications, The Annals of Statistics 7, 2, pp. 381–394. [110] Marcus, M. (1973). Finite Dimensional Multilinear Algebra — Part I (Marcel Dekker, New York). [111] Marcus, M. and Minc, H. (2010). A Survey of Matrix Theory and Matrix Inequalities (Dover Publications, New York). [112] Marsaglia, G. and Styan, G. (1974). Equalities and inequalities for ranks of matrices, Linear and Multilinear Algebra 2, pp. 269–292. [113] Meyer, C. (2000). Matrix Analysis and Applied Algebra (SIAM, Philadelphia, PA). [114] Mirsky, L. (1975). A trace inequality of john von neumann, Monatshefte f¨ ur Mathematik 79, pp. 303–306. [115] Moitra, A. (2018). Algorithmic Aspects of Machine Learning (Cambridge University Press, Cambridge, UK). [116] Moler, C. B. (2004). Numerical Computing with MATLAB (SIAM, Philadelphia, PA). [117] Morozov, A. Y. and Shakirov, S. R. (2010). New and old results in resultant theory, Theoretical and Mathematical Physics 163, 2, pp. 587–617. [118] Neudecker, H. (1968). The Kronecker matrix product and some of its applications in econometrics, Statistica Neerlandica 22, 1, pp. 69–82. [119] Ninio, F. (1976). A simple proof of the Perron-Frobenius theorem for positive matrices, Journal of Physics A 9, pp. 1281–1282. [120] Oldenburger, R. (1940). Infinite powers of matrices and characteristic roots, Duke Mathematical Journal 6, pp. 357–361. [121] O’Leary, D. P. (2006). Matrix factorization for information retrieval, Tech. Rep., University of Maryland, College Park, MD. [122] O’Leary, D. P. and Peleg, S. (1983). Digital image compression, IEEE Transactions on Communications 31, pp. 441–444. [123] Ottaviani, G. (2012). Introduzione all’iperdeterminante, La Matematica nella Societ` a e nella Cultura. Rivista dell’Unione Matematica Italiana 5, 2, pp. 169–195. [124] Overton, M. L. and Womersley, R. S. (1991). On the sum of the largest eigenvalues of a symmetric matrix, Tech. Rep. 550, Courant Institute of Mathematical Sciences, NYU. [125] Paatero, P. (1997). Least squares formulation of robust non-negative factor analysis, Chemometrics and Intelligent Laboratory Systems 37, pp. 23–35.
Bibliography
971
[126] Paatero, P. (1999). The multilinear engine — A table driven least squares program for solving multilinear problems, including the nway parallel factor analysis model, Journal of Computational and Graphical Statistics 8, pp. 1–35. [127] Paatero, P. and Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values, Environmetrics 5, pp. 111–126. [128] Paige, C. and Saunders, M. (1981). Toward a generalized singular value decomposition, SIAM Journal on Numerical Analysis 18, pp. 398–405. [129] Paige, C. C. and Wei, M. (1994). History and generality of the CS decomposition, Linear Algebra and Its Applications 208, pp. 303–326. [130] Papalexakis, E. E., Faloutsos, C. and Sidiropoulos, N. D. (2016). Tensors for data mining and data fusion: Models, applications, and scalable algorithms, ACM Transactions on Intelligent Systems and Technology (TIST) 8, 2, pp. 1–44. [131] Pauca, V. P., Shahnaz, F., Berry, M. W. and Plemmons, R. J. (2004). Text mining using non-negative matrix factorizations, in SIAM Data Mining, pp. 452–456. [132] Pferschy, U. and Rudolf, R. (1994). Some geometric clustering problems, Nordic Journal of Computing 1, pp. 246–263. [133] Pollock, D. S. G. (1979). The Algebra of Econometrics (John Wiley & Sons, Chichester). [134] Pollock, D. S. G. (2021). Multidimensional arrays, indices and Kronecker products, Econometrics 9, p. 18. [135] Prasolov, V. V. (2000). Problems and Theorems in Linear Algebra (American Mathematical Society, Providence, RI). [136] Qi, L. (2005). Eigenvalues of a real supersymmetric tensor, Journal of Symbolic Computation 40, 6, pp. 1302–1324. [137] Qi, L. (2007). Eigenvalues and invariants of tensors, Journal of Mathematical Analysis and Applications 325, 2, pp. 1363–1377. [138] Qi, L. and Luo, Z. (2017). Tensor Analysis — Spectral Theory and Special Tensors (SIAM, Philadelphia, PA). [139] Rabanser, S., Shchur, O. and G¨ unnemann, S. (2017). Introduction to tensor decompositions and their applications in machine learning, arXiv preprint arXiv:1711.10781. [140] Roweis, S. T. and Saul, L. K. (2000). Locally linear embedding, http: //www.cs.nyu.edu/~roweis/lle/. [141] Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding, Science 290, pp. 2323–2326.
972
Linear Algebra Tools for Data Mining (Second Edition)
[142] Sarwar, B., Karypis, G., Konstan, J. and Riedl, J. (2002). Incremental singular value decomposition algorithms for highly scalable recommender systems, in 5th International Conference on Computer and Information Technology (ICCIT), pp. 27–28. [143] Saul, L. K. and Roweis, S. T. (2000). An introduction to locally linear embedding, Tech. rep., AT&T Labs-Research. [144] Saul, L. K. and Roweis, S. T. (2003). Think globally, fit locally: Unsupervised learning of low dimensional manifolds, Journal of Machine Learning Research 4, pp. 119–155. [145] Saul, L. K., Weinberger, K., Sha, F., Ham, J. and Lee, D. D. (2006). Spectral methods for dimensionality reduction, in Semi-Supervised Learning (The MIT Press, Cambridge, MA), pp. 293–308. [146] Schoenberg, I. J. (1935). Remarks to Maurice Fr´echet’s article “sur la d´efinition axiomatique d’une classe d’espaces distanci´es vectoriellement applicable sur l’espace de Hilbert”, Annals of Mathematics 36, pp. 724–732. [147] Sch¨onemann, P. H. (1970). On metric multidimensional unfolding, Psychometrika 35, pp. 349–366. [148] Schwerdtfeger, H. (1960). Direct proof of Lanczos’ decomposition theorem, American Mathematical Monthly 67, pp. 855–860. [149] Shahnaz, F., Berry, M. W., Pauca, V. P. and Plemmons, R. J. (2006). Document clustering using nonnegative matrix factorization, Information Processing Management 42, 2, pp. 373–386. [150] Sibson, R. (1978). Studies in the robustness of multidimensional scaling: Procrustes statistics, Journal of the Royal Statistical Society, Series B 40, pp. 234–238. [151] Sidiropoulos, N., Lathauwer, L. D., Fu, X., Huang, K., Papalexakis, E. and Faloutsos, C. (2017). Tensor decomposition for signal processing and machine learning, IEEE Transactions on Signal Processing 65, 13, pp. 3551–3582. [152] Simovici, D. A. and Djeraba, C. (2015). Mathematical Tools for Data Mining — Set Theory, Partially Ordered Sets, Combinatorics, 2nd edn. (Springer-Verlag, London). [153] Soules, G. W. (1983). Constructing symmetric nonnegative matrices, Linear and Multilinear Algebra 13, pp. 241–251. [154] Stanley, R. P. (1997). Enumerative Combinatorics, Vol. 1 (Cambridge University Press, Cambridge). [155] Stanley, R. P. (1999). Enumerative Combinatorics, Vol. 2 (Cambridge University Press, Cambridge). [156] Sternberg, S. (1983). Lectures on Differential Geometry, 2nd edn. (AMS Chelsea Publishing, Providence, RI).
Bibliography
973
[157] Stewart, G. W. (1973). Error and perturbation bounds for subspaces associated with certain eigenvalue problems, SIAM Review 15, pp. 727–764. [158] Stewart, G. W. (1977). On the perturbation of pseudo-inverses, projections, and linear least square problems, SIAM Review 19, pp. 634–662. [159] Stewart, G. W. (1993). On the early history of the singular value decomposition, SIAM Review 35, pp. 551–566. [160] Stewart, G. W. and Sun, J.-G. (1990). Matrix Perturbation Theory (Academic Press, Boston, MA). [161] Strassen, V. (1969). Gaussian elimination is not optimal, Numerische Mathematik 13, pp. 354–356. [162] Trefethen, L. N. and III, D. B. (1997). Numerical Linear Algebra (SIAM, Philadelphia, PA). [163] Trotter, W. T. (1995). Partially ordered sets, in Handbook of Combinatorics (The MIT Press, Cambridge, MA), pp. 433–480. [164] Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis, Psychometrica 31, pp. 279–331. [165] van der Waerden, B. L. (1950). Modern Algebra (Vol. I and II) (F. Ungar Publ. vo, New York). [166] van Lint, J. H. and Wilson, R. M. (2002). A Course in Combinatorics, 2nd edn. (Cambridge University Press, Cambridge). [167] Van Loan, C. F. and Fan, K. Y. D. (2010). Insight through Computing (SIAM, Philadelphia, PA). [168] vols.), F. R. G. . (1977). The Theory of Matrices (American Mathematical Society, Providence, RI). [169] Wedderburn, J. (1934). Lectures on Matrices (American Mathematical Society, Providence, RI). [170] Weyr, E. (1887). Note sur la th´eorie de quantit´es complexes form´ees avec n unites principales, Bulletin des Sciences Math´ematiques II 11, pp. 205–215. [171] Wilkerson, J. H. (1965). The Algebraic Eigenvalue Problem (Clarendon Press, Oxford, London). [172] Winograd, S. (1968). A new algorithm for inner product, IEEE Transactions on Computers C-17, pp. 693–694. [173] Wu, X. and (editors), V. K. (2009). The Top Ten Algorithms in Data Mining (CRC Press, Boca Raton, FL). [174] Ye, J. (2007). Least square discriminant analysis, in Proceedings of the 24th International Conference on Machine Learning (Corvallis, OR), pp. 1087–1094. [175] Yokonuma, T. (1992). Tensor Spaces and Exterior Algebra (American Mathematical Society, Providence, RI).
974
Linear Algebra Tools for Data Mining (Second Edition)
[176] Zha, H., He, X., Ding, C., Simon, H. and Gu, M. (2001). Spectral relaxation for k-means clustering, Neural Information Processing Systems 14, pp. 1057–1064. [177] Zhang, F. (1999). Matrix Theory: Basic Results and Techniques (Springer Verlag, New York). [178] Zhang, F. (2005). The Schur Complement and Its Applications (Springer, New York).
Index
A
graded, 186 ideal of an, 185 morphism, 183 structural coefficients of an, 183 subalgebra of an, 185 unital, 183 associative operation, 16 attribute, 720 augmented matrix of a system of linear equations, 157 auxiliary function, 808 average dissimilarity of an object, 705
Abel’s equality, 213 absolute norm, 457 absolutely convergent matrix series, 371 addition of polynomials, 21 additive inverse of an element, 17 adjacent transposition, 8 adjoint matrix, 297, 385 affine mapping, 39 algebra exterior, 859 Grassman, 859 algebra of finite type, 17 algebra of type θ, 17 algebra type, 17 algebraic multiplicity of an eigenvalue, 486 alternator on V ⊗r , 850 angle between vectors, 386 angles between subspaces, 658 annihilating polynomial of a matrix, 557 annihilator of a subset, 73 anonymous function, 274 arity of an operation, 15 associative algebra, 182
B balanced clustering, 710 band matrix, 118 basis dual basis of a, 71 fundamental matrix of a, 379 ordered, 437 having a negative orientation, 437 having a positive orientation, 437 reciprocal set of a, 390 basis of a linear space, 40 Berge–Sibson theorem, 802 975
976
Linear Algebra Tools for Data Mining (Second Edition)
bilinear form, 80 skew-symmetric, 80 structural constants of a, 85 bilinear function structural constants of a, 85 binary operation, 15 binomial coefficient, 13 bivector, 857 bounded set in a metric space, 341 C canonical basis of a quadratic form, 591 canonical morphism, 56 canonical surjection, 56 carrier of an algebra, 17 Cauchy matrix of two real sequences, 320 Cauchy–Schwarz inequality, 338 Cauchy–Binet formula, 291 Cayley transform, 202 Cayley–Hamilton theorem, 545 center of a quadric, 594 centered data series, 722 centering matrix, 723 centroid, 693 characteristic matrix of a partition, 195 characteristic polynomial of a matrix, 484 Chebyshev metric, 346 Cholesky factor of a matrix, 412 Cholesky’s decomposition theorem, 410 circulant matrix, 120 city-block metric, 346 class scatter matrix, 782 closed set generated by a subset, 27 closed sphere, 340 closure operator, 25 closure system on a set S, 25 co-kernel of a linear mapping, 56 coefficients of a linear combination, 36 cofactor, 290 colon notation in MATLAB , 237
column subspace of a matrix, 137 combination with repetition, 32 commutative groups, 19 commutative operation, 16 commutative ring, 23 companion matrix of a polynomial, 490 complementary subspaces, 64 complete matrix, 102 complete symmetric polynomial, 23 complex linear space, 34 composition of permutations, 6 condensed graph of a graph, 221 condition number of a matrix, 434 conformant matrices, 105 congruent matrices, 156 conjugate linearity of an inner product, 378 conjugate norm of a norm, 374 consistent family of matrix norms, 362 consistent system of linear equations, 157 contiguous submatrix of a matrix, 101 continuity argument, 301 continuous clustering problem, 702 contracted product of an mda and a vector, 923 contraction, 844 convergent series of matrices, 371 corpus, 785 correlation coefficient, 726 Courant–Fischer theorem, 537 covariance coefficient, 726 covectors, 68 Cramer’s formula, 297 cross product, 437 CS decomposition, 652 cut norm, 460 cycle, 7 cycle of an element, 7 cyclic decomposition of the permutation, 8 cyclic permutation, 7
Index D φ-diagonal of a matrix, 188 data matrix principal directions of a, 769 Davis–Kahan SIN theorem, 673 defective matrix, 550 degenerate matrix, 141 degree of a λ-matrix, 552 degree of a monomial, 22 degree of a polynomial, 21 descent of a sequence, 9 determinant of a matrix, 281 diagonal matrix, 103 diagonalizable matrix, 155 diagonally dominant matrix, 158 diameter of a metric space, 341 diameter of a subset of a metric space, 341 direct sum external, 61 discrete metric, 339 discriminant, 309 discriminant function, 782 dissimilarity on a set, 342 dissimilarity space, 342 divergence of two matrices, 217 dominant eigenvalue, 503 doubly stochastic matrix, 120 dual bases, 71 dual of a linear mapping, 74 dual of a linear space, 68 dual PCA, 797 E Eckhart–Young theorem, 643 eigenpair, 479 eigenvalue of a matrix, 479 eigenvector, 479 eigenvector of a matrix pencil, 583 eigenvectors of a pencil, 584 elementary divisors of a λ-matrix, 565 elementary matrix, 199 elementary symmetric polynomials, 22
977 elementary transformations matrices, 168 endomorphism of a linear space, 52 equalizer of two linear mappings, 52 equivalent λ-matrices, 560 equivalent linear systems, 168 equivalent norms, 355 error of a cut decomposition, 461 error sequence, 372 Euclidean metric, 340, 346 Euclidean norm, 345 extended matrix of a quadric, 592 F F -closed subset of a set, 28 F-linear space, 33 factor algebra, 186 feature, 720 field, 23 field of values of a matrix, 623 finite algebra, 17 finite algebra type, 17 first Hirsch’s theorem, 500 Fisher linear discriminant, 783 Forgy–Lloyd algorithm, 696 format of a matrix, 98 Fourier expansion of an element with respect to an orthonormal set, 389 free indices, 828 frequency matrix of the corpus, 785 Frobenius inequality, 145 Frobenius norm, 363 Frobenius rank inequality, 145 full QR factorization, 423 full singular value decomposition of a matrix, 631 full-rank matrix, 141 G general linear group, 109 generalized Kronecker symbol, 95 generalized Rayleigh–Ritz quotient, 585 generalized sample variance, 737
978
Linear Algebra Tools for Data Mining (Second Edition)
geometric multiplicity of an eigenvalue, 482 Gershgorin disk, 500 Gershgorin’s theorem, 499 Gram matrix of a sequence of vectors, 409 Gramian of a sequence of vectors, 409 greatest common divisor, 18 group, 19 groupoid, 17 H Hadamard matrix, 194 Hadamard product, 180 Hadamard quotient, 180 Hahn–Banach theorem, 376 Hankel matrix, 120 Hilbert matrix, 322 Hoffman–Wielandt theorem, 542 homogeneous linear system, 157, 332 homogeneous system trivial solution of a, 332 homotety on linear space, 53 Householder matrix of a vector, 398 Huygens’s inertia theorem, 725 hypermatrix, 883 hyperplane, 391 I I-open subsets of a set, 29 idempotent matrix, 115 idempotent morphism of a field, 54 idempotent operation, 16 identity permutation, 6 ill-conditioned linear system, 435 incidence matrix of a collection of sets, 468 independent set extension corollary, 42 index of a square matrix, 152 indexing of matrices, 264 inertia matrix, 534 inertia of a matrix, 533 inertia of a sequence of vectors, 724
inner product, 378 inner product of two vectors, 112 inner product space, 378 instance of the least square problem, 741 interior operator, 28 interior system, 28 interlacing theorem, 539 intra-class scatter matrix, 782 invariant factor of a λ-matrix, 559 invariant subspace of a matrix for an eigenvalue, 481 inverse of a matrix, 108 inverse of an element, 19 inverse of an element relative to an operation, 17 inverse power method, 506 inversion of a sequence of distinct numbers, 4 involutive matrix, 203 isometric invariant, 592 J Jacobi’s identity, 477 Jordan block associated to a complex number, 566 Jordan matrix, 567 Jordan segment, 566 K K-closed subsets of a set, 26 k-combination, 12 k-means algorithm, 693 Karhunen–Loeve basis for a matrix, 815 Kronecker δ, 51 Kronecker difference, 179 Kronecker product, 174 Kronecker sum, 179 Kruskal operator, 958 Kullback–Leibler divergence of two stochastic matrices, 217 Ky Fan’s norm, 685 Ky Fan’s theorem, 494
Index L Lagrange interpolating polynomial, 275 Lagrange interpolation polynomial, 328 Lagrange’s identity, 293, 477 Lanczos decomposition of a matrix, 622 Laplace expansion of a determinant by a column, 289 Laplace expansion of a determinant by a row, 289 leading principal minor, 287 leading submatrix of a matrix, 101 left distributivity laws in a ring, 20 left eigenvector of a matrix, 485 left inverse of a matrix, 148 left quotient of λ-matrices, 555 left remainder of λ-matrices, 555 left singular vector, 631 left value of a λ-matrix, 554 Leibniz formula, 281 Levi-Civita symbol, 11 Lie bracket, 607 linear combination, 36 linear data mapping for a data set, 721 linear dimensionality-reduction mapping, 721 linear feature selection mapping, 721 linear form, 40, 80 linear mapping, 39 induced mapping by a, 869 nullity of a, 47 rank of a, 47 spark of a, 47 linear morphism, 39 linear operator associated to a matrix, 137 linear operator on a linear space, 52 linear regression, 739 linear space dimension of a, 43 free over a set, 51 graded, 186
979 identity morphism of a, 53 of infinite type, 43 quotient space of a, 55 subspace of a, 35 vectors in a, 33 zero endomorphism of a, 53 linear space of finite type, 40 linear space symmetric relative to a norm, 466 linear spaces direct product of, 61 isomorphic, 48 isomorphism of, 48 linear spaces of finite type, 42 linear system basic variables of a, 162 non-basic variables of a, 162 non-principal variables of a, 162 principal variables of a, 162 linear systems back substitution in, 161 linearly dependent set, 37 linearly independent set, 37 loadings in PCA, 775 local Gram matrix, 750 locally linear embedding, 748 logical indexing in MATLAB , 265 lower bandwidth of a matrix, 118 lower Hessenberg matrix, 118 lower triangular matrix, 106 M λ-matrix of type p × q, 552 M-diagonalizable matrix, 155 Mahalanobis metric, 736 main diagonal of a matrix, 99 matrices commuting family of, 511 triple product of, 889 matricization, 900 matrix, 98 LU -decomposition of a, 171 commutation, 208 complex, 121 degree of reducibility of a, 222
980
Linear Algebra Tools for Data Mining (Second Edition)
directed graph of a, 221 Givens, 397 Hermitian, 121 Hermitian adjoint of a, 121 invertible, 108 irreducible, 222 Moore–Penrose pseudoinverse of a, 134 normal, 121 orthogonal, 122, 393 orthonormal, 122, 393 pivot of a row of a, 160 reducible, 222 row echelon form of a, 160 skew-Hermitian, 121 spark of a, 153 spectral radius of a, 498 submatrix, 100 transpose conjugate of a, 121 unitary, 121 matrix column, 98 matrix gallery, 322 matrix of a bilinear form, 587 matrix of loadings, 771 matrix of scores, 771 matrix of weights for locally linear embedding, 748 matrix pencil, 582 matrix product Khatri–Rao, 181 matrix row, 98 matrix series, 371 matrix similarity, 153 mda H-eigenvalue of an, 942 H-eigenvector of an, 942 m-order, n-dimensional, 884 n-mode singular values of an, 945 n-mode vectors of an, 888 n-rank of an, 888 n-singular vector of an, 945 diagonal, 884 eigenpair of an, 942 eigenvalue of an, 941 eigenvector of an, 942
fibers of an, 885 homogeneous polynomial defined by, 940 hypercubic, 884 identity, 884 matricized, 901 norm of an, 907 order of, 884 positive definite, 940, 942 positive semidefinte, 942 set of modes of an, 884 slice of an, 886 spectrum of an, 942 Tucker decomposition of an, 949 unfolding of an, 901 mdas contraction of, 903 inner product of, 907 outer product of, 887 outer product, 182 metric, 339 metric induced by a norm on a linear space, 346 metric space, 339 minimal polynomial of a matrix, 556 minimax inequality for real numbers, 194 Minkowski metric, 346 Minkowski’s inequality, 338 minor of a matrix, 287 mixed covariance of two data sample matrices, 817 monoid, 18 monomial, 22, 302 monotone norm on Cn , 457 multidimensional array, 883–884 dimension vector of a, 884 symmetric, 886 multilinear function, 80 multilinear mapping alternating, 847 skew-symmetric, 848 multiplicative inverse of an element relative to an operation, 17
Index N n-ary operation, 15 n-dimensional ellipsoid, 594 n-mode multiplication of an mda by a matrix, 917 n-mode product of a mda and a matrix, 903 n-mode product of a mda and a vector, 905 negative closed half-space, 391 negative open half-space, 391 negative semidefinite matrix, 406 Newton’s binomial, 191 Newton’s binomial formula, 14 nilpotency of a matrix, 114 nilpotent matrix, 114 NIPALS algorithm, 780 Noether’s first isomorphism theorem, 79 Noether’s second isomorphism theorem, 79 non-defective matrix, 550 non-derogatory matrix, 550 non-negative matrix, 118 non-singular matrix, 141 norm of a linear function, 359 normed linear space, 343 null space of a matrix, 137 numerical rank of a matrix, 646 numerical stability of algorithms, 417 O oblique projection of a vector on a subspace along another subspace, 398 Oldenburger’s theorem, 574 open sphere, 340 operation, 15 opposite element of an element, 17 opposite of a matrix, 104 order of a matrix, 99 order of a minor of a matrix, 287 orthogonal complement of a subspace, 387
981 orthogonal projection of a vector on a subspace, 399 orthogonal set of vectors, 379 orthogonal subspaces, 387 orthonormal set of vectors, 380 outer product, 887 P Paige–Wei CS decomposition theorem, 652 parallelepiped k-dimensional, 871 parallelogram, 871 parallelogram equality, 381 Parseval’s equality, 390 partial least square regression, 746 partitioning of a matrix, 127 Pauli matrices, 218 PCA-guided clustering, 718 PERMn set of permutations of {1, . . . , n}, 6 permutation, 5 permutation diagonal of a matrix, 188 Perron number of a matrix, 598 Perron vector of a matrix, 598 Perron–Frobenius theorem, 597 perturbation, 434 polar form of a matrix, 650 polar form theorem, 650 polynomial discriminant of, 331 homogeneous, 302 polynomials associated, 302 polysemy, 785 poset dual of a, 28 positive closed half-space, 391 positive definite matrix, 405–406 positive matrix, 118 positive open half-space, 391 positive semidefinite matrix, 406 power method for computing eigenvalues, 503 predictor variables, 746
982
Linear Algebra Tools for Data Mining (Second Edition)
principal axes of a quadric, 594 principal cofactor, 290 principal components of W , 769 principal matrix of a pencil, 584 principal minor, 287 principal submatrix of a matrix, 101 problem 3-sat, 960 3-satisfiability, 960 product of matrices, 104 product of permutations, 6 product of polynomials, 21 product of tensors, 842 projection matrix of a subspace, 400 projection of a subspace along another subspace, 667 projections on closed sets theorem, 347 Ptolemy inequality, 466 Pythagora’s theorem, 388 Q QR iterative algorithm, 506 quadratic form, 587 quadric, 592 quaternion, 185 conjugate, 185 quotient of two matrix polynomials, 556 R range of a matrix, 137 rank of a matrix, 138 rational matrix function, 556 Rayleigh–Ritz function, 535 Rayleigh–Ritz theorem, 535 real ellipsoid, 595 real linear space, 34 reciprocal set of vectors, 472 reduced form of a subspace with respect to a unitary matrix, 665 reflection matrix, 395 regression, 739 regressor, 740 regular λ-matrix, 552
regular pencil, 583 representation of a matrix on an invariant subspace, 483 representation of a matrix relative to a basis of an invariant subspace, 664 residual, 766 residual vector, 756 response variables, 746 resultant, 304 right distributivity laws in a ring, 20 right eigenvector of a matrix, 485 right inverse of a matrix, 148 right quotient of λ-matrices, 555 right remainder of λ-matrices, 555 right singular vector, 631 right value of a λ-matrix, 554 ring, 20 ring addition, 20 ring multiplication, 20 root-mean-squared residual, 814 rotation matrix, 395 rotation with a given axis, 471 S S-sequence, 3 sample correlation matrix, 727 sample covariance matrix, 726 sample matrix, 719–720 sample mean, 722 sample standard deviation of a vector, 722 scalar, 33 scalar multiplication, 33 scalar triple product, 438 Schatten norms, 685 Schur’s complement, 323 Schur’s triangularization theorem, 524 second Hirsch’s theorem, 501 Segr`e sequence of a Jordan segment, 566 self-adjoint matrix, 385 semi-simple eigenvalue, 550
Index semidiscrete decomposition of a matrix, 821 semigroup, 17 semimetric, 339 semimetric space, 339 seminorm, 342 semiorthogonal matrix, 463 semiunitary matrix, 463 separate sets in a metric space, 341 separation between the means of the classes, 782 separation of two matrices, 548 sequence length of a, 3 strict, 4 set of generalized eigenvalues of a matrix pencil, 582 set of permutations, 6 set spanned by a subset in a linear space, 36 signature of a matrix, 533 silhouette of a clustering, 705 similarity on a set, 342 similarity space, 342 simple eigenvalue, 486 simple invariant space, 665 simple matrix, 488 simultaneously diagonalizable matrices, 520 singular matrix, 141 singular value of a matrix, 631 skew-symmetric matrix, 99 sparse matrix, 247 spectral decomposition of a normal matrix, 531 spectral decomposition of a square matrix, 484 spectral decomposition of a unitarily diagonalizable matrix, 630 spectral norm of a matrix, 577 spectral resolution of a matrix relative to two subspaces, 667 spectral theorem for Hermitian matrices, 532 spectral theorem for normal matrices, 530
983 spectrum of a matrix, 480 square matrix, 99 square root of a diagonalizable complex matrix, 522 standard deviation of sample matrix, 722 Stewart–Sun CS decomposition, 655 stochastic matrix, 120 strictly lower triangular matrix, 106 strictly upper triangular matrix, 106 strongly non-singular matrix, 327 subadditivity property of norms, 362 subharmonic vector for a matrix, 514 submultiplicative property of matrix norms, 362 suborthogonal matrix, 463 subset closed under a set of operations, 28 subspace equivalence generated by a, 55 quotient set of a, 55 zero, 35 subspaces direct sum of, 62 internal sum of, 61 linearly independent, 89 subunitary matrix, 463 sum of matrices, 103 summation convention, 827 support, 61 support of a mapping defined on a field, 36 support of a sequence, 21 SVD theorem, 630 Sylvester determinant of two polynomials, 304 Sylvester operator of two matrices, 546 Sylvester’s identity, 318 Sylvester’s inertia theorem, 534 Sylvester’s rank theorem, 143 symmetric bilinear form, 587 symmetric gauge function, 458 symmetric matrix, 99 symmetric polynomial, 22 symmetrizer on V ⊗r , 850
984
Linear Algebra Tools for Data Mining (Second Edition)
synonimy, 784 system of linear equations, 157 system of normal equations, 742 T
F-topological linear space, 76–77 tensor, 839 of rank 1, 888 over a linear space, 80, 838 alternating, 847 contravariant, 839 covariant, 839 isotropic, 877 mixed, 839 order of a, 839 realization of a, 888 simple, 888 skew-symmetric, 847 symmetric, 847 type of a, 838 valence of a, 838 tensor on a linear space, 80 tensor product, 829 axioms of, 830 tensor product of linear mappings, 876 tensor rank, 960 tensor space standard basis of a, 840 tensors decomposable, 829 term vector, 785 the full QR factorization theorem, 425 the full-rank factorization theorem, 147 the least square method, 741 the replacement theorem, 42 the thin CS decomposition theorem, 654 the thin SVD decomposition corollary, 638 thin QR factorization, 423 thin SVD decomposition of a matrix, 638
three-way array k-nondegenerate, 889 k-slab of, 889 rank of, 889 Toeplitz matrix, 119 tolerance in MATLAB , 675 total variance, 727 trace norm, 685 trace of a matrix, 116 trailing submatrix of a matrix, 101 translation generated by an element of a linear space, 77 translation in a linear space, 53 transpose of a matrix, 100 transposition, 7 tridiagonal matrix, 118 trivial solution of a homogeneous linear system, 157 Tucker operator, 947 U unary operation, 15 unit matrix, 102 unit of a binary operation, 16 unitarily diagonalizable matrix, 155 unitarily equivalent, 632 unitarily invariant norms, 368 unitarily similar matrices, 153 unitary group of matrices, 122 unitary ring, 21 upper bandwidth of a matrix, 118 upper Hessenberg matrix, 118 upper triangular matrix, 106 V Vandermonde determinant, 290 variance of a vector, 726 vector model, 785 vector normal to a hyperplane, 391 vector triple product, 439 vectorial matrix norm, 361 vectorization, 116 vectors contravariant, 68 covariant, 68
Index
985
Vi´ete’s theorem, 23 volume of a matrix, 688
Woodbury–Sherman–Morrison identity, 196
W
Z
Wedderburn’s theorem, 151 wedge product, 857 well-conditioned linear system, 435 Weyl’s theorem, 616 Weyr’s theorem, 577 width of a cut decomposition, 461
zero linear space, 34 zero matrix, 102 zero of a binary operation, 16 zero polynomial, 21 zero-ary operation, 15 zero-extension of a linear form, 70