A First Course in Linear Model Theory [2 ed.] 9781439858059, 9781032101392, 9781315156651

465 54 29MB

English Pages [530]

Table of contents :
Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Contents
Preface to the First Edition
Preface to the Second Edition
1. Review of Vector and Matrix Algebra
1.1. Notation
1.2. Basic properties of vectors
1.3. Basic properties of matrices
1.4. R Code
Exercises
2. Properties of Special Matrices
2.1. Partitioned matrices
2.2. Algorithms for matrix factorization
2.3. Symmetric and idempotent matrices
2.4. Nonnegative definite quadratic forms and matrices
2.5. Simultaneous diagonalization of matrices
2.6. Geometrical perspectives
2.7. Vector and matrix differentiation
2.8. Special operations on matrices
2.9. R Code
Exercises
3. Generalized Inverses and Solutions to Linear Systems
3.1. Generalized inverses
3.2. Solutions to linear systems
3.3. Linear optimization
3.3.1. Unconstrained minimization
3.3.2. Constrained minimization
3.4. R Code
Exercises
4. General Linear Model
4.1. Model definition and examples
4.2. Least squares approach
4.3. Estimable functions
4.4. Gauss-Markov theorem
4.5. Generalized least squares
4.6. Estimation subject to linear constraints
4.6.1. Method of Lagrangian multipliers
4.6.1.1. Case I: A'B is estimable
4.6.1.2. Case II: A'B is not estimable
4.6.2. Method of orthogonal projections
Exercises
5. Multivariate Normal and Related Distributions
5.1. Intergral evaluation theorems
5.2. Multivariate normal distribution and properties
5.3. Some noncentral distributions
5.4. Distributions of quadratic forms
5.5. Remedies for non-normality
5.5.1. Transformations to normality
5.5.1.1. Univariate transformations
5.5.1.2. Multivariate transformations
5.5.2. Alternatives to multivariate normal distribution
5.5.2.1. Mixture of normals
5.5.2.2. Spherical distributions
5.5.2.3. Elliptical distributions
5.6. R Code
Exercises
6. Sampling from the Multivariate Normal Distribution
6.1. Distribution of sample mean and covariance
6.2. Distributions related to correlation coefficients
6.3. Assessing the normality assumption
6.4. R Code
Exercises
7. Inference for the General Linear Model-I
7.1. Properties of least squares solutions
7.2. General linear hypotheses
7.2.1. Derivation of and motivation for the F-test
7.2.2. Power of the F-test
7.2.3. Testing independent and orthogonal contrasts
7.3. Restricted and reduced models
7.3.1. Estimation space and estimability under constraints
7.3.2. Nested sequence of models or hypotheses
7.4. Confidence intervals
7.4.1. Joint and marginal confidence intervals
7.4.2. Simultaneous confidence intervals
7.4.2.1. Scheffe intervals
7.4.2.2. Bonferroni t-intervals
Exercises
8. Inference for the General Linear Model-II
8.1. Likelihood-based approaches
8.1.1. Maximum likelihood estimation under normality
8.1.2. Model selection criteria
8.2. Departures from model assumptions
8.2.1. Graphical procedures
8.2.2. Heteroscedasticity
8.2.3. Serial correlation
8.3. Diagnostics for the GLM
8.3.1. Further properties of the projection matrix
8.3.2. Types of residuals
8.3.3. Outliers and high leverage observations
8.3.4. Diagnostic measures based on influence functions
8.4. Prediction intervals and calibration
Exercises
9. Multiple Linear Regression Models
9.1. Variable selection in regression
9.1.1. Graphical assessment of variables
9.1.2. Criteria-based variable selection
9.1.3. Variable selection based on significance tests
9.1.3.1. Sequential and partial F-tests
9.1.3.2. Stepwise regression and variants
9.2. Orthogonal and collinear predictors
9.2.1. Orthogonality in regression
9.2.2. Multicollinearity
9.2.3. Ridge regression
9.2.4. Principal components regression
9.3. Dummy variables in regression
Exercises
10. Fixed-Effects Linear Models
10.1. Inference for unbalanced ANOVA models
10.1.1. One-way cell means model
10.1.2. Higher-order overparametrized models
10.1.2.1. Two-factor additive models
10.1.2.2. Two-factor models with interaction
10.1.2.3. Nested or hierarchical models
10.2. Nonparametric procedures
10.3. Analysis of covariance
10.4. Multiple hypothesis testing
10.4.1. Error rates
10.4.2. Procedures for Controlling Type I Errors
10.4.2.1. FWER control
10.4.2.2. FDR control
10.4.2.3. Plug-in approach to estimate the FDR
10.4.3. Multiple comparison procedures
10.5. Generalized Gauss-Markov theorem
Exercises
11. Random- and Mixed-Effects Models
11.1. Setup and examples of mixed-effects linear models
11.2. Inference for mixed-effects linear models
11.2.1. Extended Gauss-Markov theorem
11.2.2. GLS estimation of fixed effects
11.2.3. ANOVA method for estimation
11.2.4. Method of maximum likelihood
11.2.5. REML estimation
11.2.6. MINQUE estimation
11.3. One-factor random-effects model
11.4. Two-factor randomand mixed-effects models
Exercises
12. Generalized Linear Models
12.1. Components of GLIM
12.2. Estimation approaches
12.2.1. Score and Fisher information for GLIM
12.2.2. Maximum likelihood estimation - Fisher scoring
12.2.3. Iteratively reweighted least squares
12.2.4. Quasi-likelihood estimation
12.3. Residuals and model checking
12.3.1. GLIM residuals
12.3.2. Goodness of fit measures
12.3.3. Hypothesis testing and model comparisons
12.3.3.1. Wald Test
12.3.3.2. Likelihood ratio test
12.3.3.3. Drop-in-deviance test
12.4. Binary and binomial response models
12.5. Count Models
Exercises
13. Special Topics
13.1. Multivariate general linear models
13.1.1. Model definition
13.1.2. Least squares estimation
13.1.3. Estimable functions and Gauss-Markov theorem
13.1.4. Maximum likelihood estimation
13.1.5. Likelihood ratio tests for linear hypotheses
13.1.6. Confidence region and Hotelling T2 distribution
13.1.7. One-factor multivariate analysis of variance model
13.2. Longitudinal models
13.2.1. Multivariate model for longitudinal data
13.2.2. Two-stage random-effects models
13.3. Elliptically contoured linear model
13.4. Bayesian linear models
13.4.1. Bayesian normal linear model
13.4.2. Hierarchical normal linear model
13.4.3. Bayesian model assessment and selection
13.5. Dynamic linear models
13.5.1. Kalman filter equations
13.5.2. Kalman smoothing equations
Exercises
14. Miscellaneous Topics
14.1. Robust regression
14.1.1. Least absolute deviations regression
14.1.2. M-regression
14.2. Nonparametric regression methods
14.2.1. Regression splines
14.2.2. Additive and generalized additive models
14.2.3. Projection pursuit regression
14.2.4. Multivariate adaptive regression splines
14.2.5. Neural networks regression
14.3. Regularized regression
14.3.1. L0-regularization
14.3.2. L1-regularization and Lasso
14.3.2.1. Geometry of Lasso
14.3.2.2. Fast forward stagewise selection
14.3.2.3. Least angle regression
14.3.3. Elastic net
14.4. Missing data analysis
14.4.1. EM algorithm
A. Multivariate Probability Distributions
B. Common Families of Distributions
C. Some Useful Statistical Notions
D. Solutions to Selected Exercises
Bibliography
Author Index
Subject Index

Recommend Papers

A First Course in Coding Theory 0198538030

Algebraic coding theory is a new and rapidly developing subject, popular for its many practical applications and for its

101 64 26MB Read more

A First Course in Linear Algebra 9780367697389, 9780367684730

506 7 19MB Read more

A Brief Course in Linear Algebra

484 40 1MB Read more

A First Course in Random Matrix Theory 1108488080, 9781108488082

The real world is perceived and broken down as data, models and algorithms in the eyes of physicists and engineers. Data

562 56 4MB Read more

Linear Model Theory: Exercises and Solutions [1st ed.] 9783030520731, 9783030520748

This book contains 296 exercises and solutions covering a wide variety of topics in linear model theory, including gener

503 34 3MB Read more

Theory and Application of the Linear Model 0878721088

399 98 67MB Read more

A First Course in Linear Models and Design of Experiments [1st ed.] 9789811586583, 9789811586590

This textbook presents the basic concepts of linear models, design and analysis of experiments. With the rigorous treatm

368 49 3MB Read more

Introductory Linear Algebra with Applications: An Applied First Course

1,075 89 674KB Read more

A First Course in Linear Models and Design of Experiments 9789811586583, 9789811586590

360 95 11MB Read more

Coding theory: a first course 0521529239, 9780521529235, 0521821916, 9780521821919

Concerned with successfully transmitting data through a noisy channel, coding theory can be applied to electronic engine

502 54 1MB Read more

A First Course in Linear Model Theory [2 ed.]
9781439858059, 9781032101392, 9781315156651

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

ISTUDY

ISTUDY

A First Course in Linear Model Theory

ISTUDY

CHAPMAN & HALL/CRC Texts in Statistical Science Series Joseph K. Blitzstein, Harvard University, USA Julian J. Faraway, University of Bath, UK Martin Tanner, Northwestern University, USA Jim Zidek, University of British Columbia, Canada Recently Published Titles Beyond Multiple Linear Regression Applied Generalized Linear Models and Multilevel Models in R Paul Roback, Julie Legler Bayesian Thinking in Biostatistics Gary L. Rosner, Purushottam W. Laud, and Wesley O. Johnson Linear Models with Python Julian J. Faraway Modern Data Science with R, Second Edition Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton Probability and Statistical Inference From Basic Principles to Advanced Models Miltiadis Mavrakakis and Jeremy Penzer Bayesian Networks With Examples in R, Second Edition Marco Scutari and Jean-Baptiste Denis Time Series Modeling, Computation, and Inference, Second Edition Raquel Prado, Marco A. R. Ferreira and Mike West A First Course in Linear Model Theory, Second Edition Nalini Ravishanker, Zhiyi Chi, Dipak K. Dey Foundations of Statistics for Data Scientists With R and Python Alan Agresti and Maria Kateri Fundamentals of Causal Inference With R Babette A. Brumback Sampling Design and Analysis, Third Edition Sharon L. Lohr Theory of Statistical Inference Anthony Almudevar Probability, Statistics, and Data: A Fresh Approach Using R Darrin Speegle and Brain Claire For more information about this series, please visit: https://www.crcpress.com/Chapman--Hall/ CRC-Texts-in-Statistical-Science/book-series/CHTEXSTASCI

ISTUDY

A First Course in Linear Model Theory Second Edition

Nalini Ravishanker Zhiyi Chi Dipak K. Dey

ISTUDY

Second edition published 2022 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2022 Taylor & Francis Group, LLC First edition published by CRC Press 2001 CRC Press is an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-7508400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Names: Ravishanker, Nalini, author. | Chi, Zhiyi, author. | Dey, Dipak, author. Title: A first course in linear model theory / Nalini Ravishanker, Zhiyi Chi, Dipak K. Dey. Description: Second edition. | Boca Raton : CRC Press, 2021. | Series: Chapman & Hall/CRC texts in statistical science | Includes bibliographical references and index. Identifiers: LCCN 2021019402 (print) | LCCN 2021019403 (ebook) | ISBN 9781439858059 (hardback) | ISBN 9781032101392 (paperback) | ISBN 9781315156651 (ebook) Subjects: LCSH: Linear models (Statistics) Classification: LCC QA276 .R38 2021 (print) | LCC QA276 (ebook) | DDC 519.5/35--dc23 LC record available at https://lccn.loc.gov/2021019402 LC ebook record available at https://lccn.loc.gov/2021019403

ISBN: 978-1-439-85805-9 (hbk) ISBN: 978-1-032-10139-2 (pbk) ISBN: 978-1-315-15665-1 (ebk) DOI: 10.1201/9781315156651 Typeset in CMR10 font by KnowledgeWorks Global Ltd.

ISTUDY

To my family and friends.

N.R.

To my family.

Z.C.

To Rita and Debosri.

D.K.D.

ISTUDY

ISTUDY

Contents

Preface to the First Edition

xiii

Preface to the Second Edition 1 Review of Vector and Matrix 1.1 Notation . . . . . . . . . . 1.2 Basic properties of vectors 1.3 Basic properties of matrices 1.4 R Code . . . . . . . . . . . Exercises . . . . . . . . . .

xv Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

1 1 3 9 27 30

. . . . . . . . . .

33 33 40 43 46 52 53 56 60 62 63

. . . . . . .

67 67 74 79 79 80 82 83

4 General Linear Model 4.1 Model definition and examples . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Least squares approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Estimable functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87 87 91 109

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 Properties of Special Matrices 2.1 Partitioned matrices . . . . . . . . . . . . . . . . . 2.2 Algorithms for matrix factorization . . . . . . . . 2.3 Symmetric and idempotent matrices . . . . . . . . 2.4 Nonnegative definite quadratic forms and matrices 2.5 Simultaneous diagonalization of matrices . . . . . 2.6 Geometrical perspectives . . . . . . . . . . . . . . 2.7 Vector and matrix differentiation . . . . . . . . . . 2.8 Special operations on matrices . . . . . . . . . . . 2.9 R Code . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

3 Generalized Inverses and Solutions to Linear Systems 3.1 Generalized inverses . . . . . . . . . . . . . . . . . . . . 3.2 Solutions to linear systems . . . . . . . . . . . . . . . . 3.3 Linear optimization . . . . . . . . . . . . . . . . . . . . 3.3.1 Unconstrained minimization . . . . . . . . . . . . 3.3.2 Constrained minimization . . . . . . . . . . . . . 3.4 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . .

vii

ISTUDY

viii

Contents 4.4 4.5 4.6

Gauss–Markov theorem . . . . . . . . . . . . Generalized least squares . . . . . . . . . . . Estimation subject to linear constraints . . . 4.6.1 Method of Lagrangian multipliers . . . 4.6.1.1 Case I: A0 β is estimable . . . 4.6.1.2 Case II: A0 β is not estimable 4.6.2 Method of orthogonal projections . . . Exercises . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

5 Multivariate Normal and Related Distributions 5.1 Intergral evaluation theorems . . . . . . . . . . . 5.2 Multivariate normal distribution and properties 5.3 Some noncentral distributions . . . . . . . . . . 5.4 Distributions of quadratic forms . . . . . . . . . 5.5 Remedies for non-normality . . . . . . . . . . . . 5.5.1 Transformations to normality . . . . . . . 5.5.1.1 Univariate transformations . . . 5.5.1.2 Multivariate transformations . . 5.5.2 Alternatives to the normal distribution . . 5.5.2.1 Mixture of normals . . . . . . . 5.5.2.2 Spherical distributions . . . . . . 5.5.2.3 Elliptical distributions . . . . . . 5.6 R Code . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

114 115 120 120 120 122 124 125

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

129 129 131 148 155 162 162 162 164 165 165 167 169 172 175

6 Sampling from the Multivariate Normal Distribution 6.1 Distribution of sample mean and covariance . . . . . . 6.2 Correlation coefficients . . . . . . . . . . . . . . . . . . 6.3 Assessing the normality assumption . . . . . . . . . . . 6.4 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

181 181 184 188 191 195

. . . . . . . . .

197 197 200 201 209 209 210 211 215 229

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

7 Inference for the General Linear Model-I 7.1 Properties of least squares solutions . . . . . . . . . . . . . 7.2 General linear hypotheses . . . . . . . . . . . . . . . . . . . 7.2.1 Derivation of and motivation for the F -test . . . . . 7.2.2 Power of the F-test . . . . . . . . . . . . . . . . . . . 7.2.3 Testing independent and orthogonal contrasts . . . . 7.3 Restricted and reduced models . . . . . . . . . . . . . . . . 7.3.1 Estimation space and estimability under constraints 7.3.2 Nested sequence of models or hypotheses . . . . . . 7.4 Confidence intervals . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

ISTUDY

Contents

ix

7.4.1 7.4.2

Joint and marginal confidence intervals Simultaneous confidence intervals . . . . 7.4.2.1 Scheff´e intervals . . . . . . . . 7.4.2.2 Bonferroni t-intervals . . . . . Exercises . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

229 231 232 233 234

8 Inference for the General Linear Model-II 8.1 Likelihood-based approaches . . . . . . . . . . . . . . . 8.1.1 Maximum likelihood estimation under normality 8.1.2 Model selection criteria . . . . . . . . . . . . . . 8.2 Departures from model assumptions . . . . . . . . . . . 8.2.1 Graphical procedures . . . . . . . . . . . . . . . . 8.2.2 Heteroscedasticity . . . . . . . . . . . . . . . . . 8.2.3 Serial correlation . . . . . . . . . . . . . . . . . . 8.3 Diagnostics for the GLM . . . . . . . . . . . . . . . . . 8.3.1 Further properties of the projection matrix . . . 8.3.2 Types of residuals . . . . . . . . . . . . . . . . . 8.3.3 Outliers and high leverage observations . . . . . 8.3.4 Diagnostic measures based on influence functions 8.4 Prediction intervals and calibration . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

237 237 238 239 243 243 244 248 253 253 255 258 259 266 271

9 Multiple Linear Regression Models 9.1 Variable selection in regression . . . . . . . . . . . . 9.1.1 Graphical assessment of variables . . . . . . . 9.1.2 Criteria-based variable selection . . . . . . . . 9.1.3 Variable selection based on significance tests 9.1.3.1 Sequential and partial F -tests . . . 9.1.3.2 Stepwise regression and variants . . 9.2 Orthogonal and collinear predictors . . . . . . . . . 9.2.1 Orthogonality in regression . . . . . . . . . . 9.2.2 Multicollinearity . . . . . . . . . . . . . . . . 9.2.3 Ridge regression . . . . . . . . . . . . . . . . 9.2.4 Principal components regression . . . . . . . 9.3 Dummy variables in regression . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

275 275 275 277 280 280 283 288 288 291 293 296 298 302

10 Fixed-Effects Linear Models 10.1 Inference for unbalanced ANOVA models . . 10.1.1 One-way cell means model . . . . . . . 10.1.2 Higher-order overparametrized models 10.1.2.1 Two-factor additive models .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

305 305 306 309 309

. . . .

. . . .

. . . .

. . . .

ISTUDY

x

Contents

10.2 10.3 10.4

10.5

10.1.2.2 Two-factor models with interaction . . 10.1.2.3 Nested or hierarchical models . . . . . . Nonparametric procedures . . . . . . . . . . . . . . . . Analysis of covariance . . . . . . . . . . . . . . . . . . . Multiple hypothesis testing . . . . . . . . . . . . . . . . 10.4.1 Error rates . . . . . . . . . . . . . . . . . . . . . 10.4.2 Procedures for Controlling Type I Errors . . . . 10.4.2.1 FWER control . . . . . . . . . . . . . . 10.4.2.2 FDR control . . . . . . . . . . . . . . . 10.4.2.3 Plug-in approach to estimate the FDR 10.4.3 Multiple comparison procedures . . . . . . . . . Generalized Gauss–Markov theorem . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

11 Random- and Mixed-Effects Models 11.1 Setup and examples . . . . . . . . . . . . . . 11.2 Inference for mixed-effects linear models . . . 11.2.1 Extended Gauss–Markov theorem . . 11.2.2 GLS estimation of fixed effects . . . . 11.2.3 ANOVA method for estimation . . . . 11.2.4 Method of maximum likelihood . . . . 11.2.5 REML estimation . . . . . . . . . . . 11.2.6 MINQUE estimation . . . . . . . . . . 11.3 One-factor random-effects model . . . . . . . 11.4 Two-factor random- and mixed-effects models Exercises . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

12 Generalized Linear Models 12.1 Components of GLIM . . . . . . . . . . . . . . . . . . . 12.2 Estimation approaches . . . . . . . . . . . . . . . . . . 12.2.1 Score and Fisher information for GLIM . . . . . 12.2.2 Maximum likelihood estimation – Fisher scoring 12.2.3 Iteratively reweighted least squares . . . . . . . . 12.2.4 Quasi-likelihood estimation . . . . . . . . . . . . 12.3 Residuals and model checking . . . . . . . . . . . . . . 12.3.1 GLIM residuals . . . . . . . . . . . . . . . . . . . 12.3.2 Goodness of fit measures . . . . . . . . . . . . . 12.3.3 Hypothesis testing and model comparisons . . . 12.3.3.1 Wald Test . . . . . . . . . . . . . . . . 12.3.3.2 Likelihood ratio test . . . . . . . . . . . 12.3.3.3 Drop-in-deviance test . . . . . . . . . . 12.4 Binary and binomial response models . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

312 315 315 318 324 325 327 327 328 329 329 335 337

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

341 341 343 343 346 347 347 348 349 352 362 365

. . . . . . . . . . . . . .

367 367 370 370 372 373 375 377 377 377 379 379 379 380 380

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

ISTUDY

Contents

xi

12.5 Count Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

384 387

13 Special Topics 13.1 Multivariate general linear models . . . . . . . . . . . . . 13.1.1 Model definition . . . . . . . . . . . . . . . . . . . 13.1.2 Least squares estimation . . . . . . . . . . . . . . . 13.1.3 Estimable functions and Gauss–Markov theorem . 13.1.4 Maximum likelihood estimation . . . . . . . . . . . 13.1.5 Likelihood ratio tests for linear hypotheses . . . . 13.1.6 Confidence region and Hotelling T 2 distribution . . 13.1.7 One-factor multivariate analysis of variance model 13.2 Longitudinal models . . . . . . . . . . . . . . . . . . . . . 13.2.1 Multivariate model for longitudinal data . . . . . . 13.2.2 Two-stage random-effects models . . . . . . . . . . 13.3 Elliptically contoured linear model . . . . . . . . . . . . . 13.4 Bayesian linear models . . . . . . . . . . . . . . . . . . . 13.4.1 Bayesian normal linear model . . . . . . . . . . . . 13.4.2 Hierarchical normal linear model . . . . . . . . . . 13.4.3 Bayesian model assessment and selection . . . . . . 13.5 Dynamic linear models . . . . . . . . . . . . . . . . . . . 13.5.1 Kalman filter equations . . . . . . . . . . . . . . . 13.5.2 Kalman smoothing equations . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

391 391 391 393 396 399 401 402 403 406 406 409 412 417 418 420 422 424 425 427 430

14 Miscellaneous Topics 14.1 Robust regression . . . . . . . . . . . . . . . . . 14.1.1 Least absolute deviations regression . . . 14.1.2 M -regression . . . . . . . . . . . . . . . . 14.2 Nonparametric regression methods . . . . . . . . 14.2.1 Regression splines . . . . . . . . . . . . . 14.2.2 Additive and generalized additive models 14.2.3 Projection pursuit regression . . . . . . . 14.2.4 Multivariate adaptive regression splines . 14.2.5 Neural networks regression . . . . . . . . 14.3 Regularized regression . . . . . . . . . . . . . . . 14.3.1 L0 -regularization . . . . . . . . . . . . . . 14.3.2 L1 -regularization and Lasso . . . . . . . . 14.3.2.1 Geometry of Lasso . . . . . . . . 14.3.2.2 Fast forward stagewise selection 14.3.2.3 Least angle regression . . . . . . 14.3.3 Elastic net . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

433 433 433 436 438 439 442 445 447 449 451 452 454 454 455 456 457

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

ISTUDY

xii

Contents 14.4 Missing data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

460 461

A Multivariate Probability Distributions

465

B Common Families of Distributions

471

C Some Useful Statistical Notions

477

D Solutions to Selected Exercises

483

Bibliography

493

Author Index

505

Subject Index

509

ISTUDY

Preface to the First Edition

Linear Model theory plays a fundamental role in the foundation of mathematical and applied statistics. It has a base in distribution theory and statistical inference, and finds application in many advanced areas in statistics including univariate and multivariate regression, analysis of designed experiments, longitudinal and time series analysis, spatial analysis, multivariate analysis, wavelet methods, etc. Most statistics departments offer at least one course on linear model theory at the graduate level. There are several excellent books on the subject, such as “Linear Statistical Inference and its Applications” by C.R. Rao, “Linear Models” by S.R. Searle, “Theory and Applications of the Linear Model” by F.A. Graybill, “Plane Answers to Complex Questions: The Theory of Linear Models” by R. Christiansen and “The Theory of Linear Models” by B. Jorgensen. Our motivation has been to incorporate general principles of inference in linear models to the fundamental statistical education of students at the graduate level, while our treatment of contemporary topics in a systematic way will serve the needs of professionals in various industries. The three salient features of this book are: (1) developing standard theory of linear models with numerous applications in simple and multiple regression, as well as fixed, random and mixed-effects models, (2) introducing generalized linear models with examples, and (3) presenting some current topics including Bayesian linear models, general additive models, dynamic linear models and longitudinal models. The first two chapters introduce to the reader requisite linear and matrix algebra. This book is therefore a self-contained exposition of the theory of linear models, including motivational and practical aspects. We have tried to achieve a healthy compromise between theory and practice, by providing a sound theoretical basis, and indicating how the theory works in important special cases in practice. There are several examples throughout the text. In addition, we provide summaries of many numerical examples in different chapters, while a more comprehensive description of these is available in the first author’s web site (http://www.stat.uconn.edu/∼nalini). There are several exercises at the end of each chapter that should serve to reinforce the methods. Our entire book is intended for a two semester graduate course in linear models. For a one semester course, we recommend essentially the first eight chapters, omitting a few subsections, if necessary, and supplementing a few selected topics from chapters 9-11, if time permits. For instance, section 5.5, section 6.4, sections 7.5.2-7.5.4, and sections 8.5, 8.7 and 8.8 may be omitted in a one semester course. The first two chapters, which present a review on vectors and matrices specifically as they pertain to linear model theory, may also be assigned as background reading if the students had previous exposure to these topics. Our book requires some knowledge of statistics; in particular, a knowledge of elementary sampling distributions, basic estimation theory and hypothesis testing at an undergraduate level is definitely required. Occasionally, more advanced concepts of statistical inference are invoked in this book, for which suitable references are provided. xiii

ISTUDY

xiv

Preface to the First Edition

The plan of this book follows. The first two chapters develop basic concepts of linear and matrix algebra with a view towards application in linear models. Chapter 3 describes generalized inverses and solutions to systems of linear equations. We develop the notion of a general linear model in Chapter 4. An attractive feature of our book is that we unify fullrank and non full-rank models in the development of least squares inference and optimality via the Gauss-Markov theorem. Results for the full-rank (regression) case are provided as special cases. We also introduce via examples, balanced ANOVA models that are widely used in practice. Chapter 5 deals with multivariate normal and related distributions, as well as distributions of quadratic forms that are at the heart of inference. We also introduce the class of elliptical distributions that can serve as error distributions for linear models. Sampling from multivariate normal distributions is the topic of Chapter 6, together with assessment of and transformations to multivariate normality. This is followed by inference for the general linear model in Chapter 7. Inference under normal and elliptical errors is developed and illustrated on examples from regression and balanced ANOVA models. In Chapter 8, topics in multiple regression models such as model checking, variable selection, regression diagnostics, robust regression and nonparametric regression are presented. Chapter 9 is devoted to the study of unbalanced designs in fixed-effects ANOVA models, the analysis of covariance (ANACOVA) and some nonparametric test procedures. Random-effects models and mixed-effects models are discussed in detail in Chapter 10. Finally in Chapter 11, we introduce several special topics including Bayesian linear models, dynamic linear models, linear longitudinal models and generalized linear models (GLIM). The purpose of this chapter is to introduce to the reader some new frontiers of linear models theory; several references are provided so that the reader may explore further in these directions. Given the exploding nature of our subject area, it is impossible to be exhaustive in a text, and cover everything that should ideally be covered. We hope that our judgment in choice of material is appropriate and useful. Most of our book was developed in the form of lecture notes for a sequence of two courses on linear models which both of us have taught for several years in the Department of Statistics at the University of Connecticut. The numerical examples in the text and in the web site were developed by NR over many years. In the text, we have acknowledged published work, wherever appropriate, for the use of data in our numerical examples, as well as for some of the exercise problems. We are indeed grateful for their use, and apologize for any inadvertent omission in this regard. In writing this text, discussions with many colleagues were invaluable. In particular, we thank Malay Ghosh, for several suggestions that vastly improved the structure and content of this book. We deeply appreciate his time and goodwill. We thank Chris Chatfield and Jim Lindsey for their review and for the suggestion about including numerical examples in the text. We are also very grateful for the support and encouragement of our statistical colleagues, in particular Joe Glaz, Bani Mallick, Alan Gelfand and Yazhen Wang. We thank Ming-Hui Chen for all his technical help with Latex. Many graduate students helped in proof reading the typed manuscript; we are especially grateful to Junfeng Liu, Madhuja Mallick and Prashni Paliwal. We also thank Karen Houle, a graduate student in Statistics, who helped with “polishing-up” the numerical examples in NR’s web site. We appreciate all the help we received from people at Chapman & Hall/CRC – Bob Stern, Helena Redshaw, Gail Renard and Sean Davey. Nalini Ravishanker and Dipak K. Dey Department of Statistics University of Connecticut Storrs, CT

ISTUDY

Preface to the Second Edition

Linear Model theory plays a fundamental role in the foundation of mathematical and applied statistics. Our motivation in writing the first edition of the book was to incorporate general principles of inference in linear models to the fundamental statistical education of students at the graduate level, while our treatment of contemporary topics in a systematic way will serve the needs of professionals in various industries. The attractive features of the second edition are: (1) developing standard theory of linear models with numerous applications in simple and multiple regression, as well as fixed, random and mixed-effects models in the first few chapters, (2) devoting a chapter to the topic of generalized linear models with examples, (3) including two new chapters which contain detailed presentations on current and interesting topics such as multivariate linear models, Bayesian linear models, general additive models, dynamic linear models, longitudinal models, robust regression, regularized regression, etc., and (4) providing numerical examples in R for several methods, whose details are available here: https://github.com/nravishanker/FCILM-2. As in the first edition, here too, we have tried to achieve a healthy compromise between theory and practice, by providing a sound theoretical basis, and indicating how the theory works in important special cases in practice. There are several exercises at the end of each chapter that should serve to reinforce the methods. Our entire book is intended for a two semester graduate course in linear models. For a one semester course, we recommend essentially the first eight chapters, omitting a few subsections, if necessary, and including a few selected topics from chapters 9-13, if time permits. For instance, section 5.5, section 6.4, sections 7.5.2-7.5.4, and sections 8.5, 8.7 and 8.8 may be omitted in a one semester course. The first two chapters, which present a review on vectors and matrices specifically as they pertain to linear model theory, may also be assigned as background reading if the students had previous exposure to these topics. Our book requires some knowledge of statistics; in particular, a knowledge of elementary sampling distributions, basic estimation theory and hypothesis testing at an undergraduate level is definitely required. When more advanced concepts of statistical inference are invoked in this book, suitable references are provided. The plan of this book follows. The first two chapters develop basic concepts of linear and matrix algebra with a view towards application in linear models. Chapter 3 describes generalized inverses and solutions to systems of linear equations. We develop the notion of a general linear model in Chapter 4. An attractive feature of our book is that we unify fullrank and non full-rank models in the development of least squares inference and optimality via the Gauss–Markov theorem. Results for the full-rank (regression) case are provided as special cases. We also introduce via examples, balanced ANOVA models that are widely used in practice. Chapter 5 deals with multivariate normal and related distributions, as well as distributions of quadratic forms that are at the heart of inference. We also introduce the class xv

ISTUDY

xvi

Preface to the Second Edition

of elliptical distributions that can serve as error distributions for linear models. Sampling from multivariate normal distributions is the topic of Chapter 6, together with assessment of and transformations to multivariate normality. In the second edition, we describe inference for the general linear model in Chapters and 8. Inference under normal and elliptical errors is developed and illustrated on examples from regression and balanced ANOVA models. Topics in multiple regression models such as model checking, variable selection, regression diagnostics, robust regression and nonparametric regression are presented in Chapter 9. Chapter 10 is devoted to the study of unbalanced designs in fixed-effects ANOVA models, the analysis of covariance (ANACOVA) and some nonparametric procedures. In this edition, we have included a new section on multiple comparisons. Random-effects models and mixedeffects models are discussed in detail in Chapter 11. Chapter 12 is devoted to generalized linear models, whereas in the first edition, we discussed this in just one section. In Chapter 13, we introduce several special topics including Bayesian linear models, dynamic linear models, linear longitudinal models and generalized linear models (GLIM). The purpose of this chapter is to introduce to the reader some new frontiers of linear models theory; several references are provided so that the reader may explore further in these directions. In Chapter 14, we have discussed the theory and applications for robust and regularized regressions, nonparametric methods for regression analysis and some ideas in missing data analysis in linear models. Given the exploding nature of our subject area, it is impossible to be exhaustive in a text, and cover everything that should ideally be covered. The second edition of the book considerably enhances the early parts of the book and further, also includes an in-depth treatment of a few useful and currently relevant additional topics. We hope that our judgment in choice of material for the second edition is useful. We have acknowledged published work, wherever appropriate, for the use of data and for some of the exercises. We are grateful for their use, and also grateful for the use of R and RStudio in the numerical examples. We apologize for any inadvertent omission in regard to acknowledgments. We are very grateful to our colleagues, Haim Bar, Kun Chen, Min-Hui Chen, Yuwen Gu and Elizabeth Schifano, for useful discussions about the material. We also thank our PhD students, Sreeram Anantharaman, Surya Eada, Namitha Pais, and Patrick Toman for their help with the numerical examples and linking to github. We appreciate all the help we received from David Grubbs and Robin Lloyd Starkes at Chapman & Hall/CRC. Last, but not least, a very big thank you to our families and friends for their constant support. Nalini Ravishanker, Zhiyi Chi, and Dipak K. Dey Department of Statistics University of Connecticut Storrs, CT

ISTUDY

1 Review of Vector and Matrix Algebra

In this chapter, we introduce basic results dealing with vector spaces and matrices, which are essential for an understanding of linear statistical methods. We provide several numerical and geometrical illustrations of these concepts. The material presented in this chapter is found in most textbooks that deal with matrix theory pertaining to linear models, including Graybill (1983), Harville (1997), Rao (1973), and Searle (1982). Unless stated otherwise, all vectors and matrices are assumed to be real, i.e., they have real numbers as elements.

1.1

Notation

An m × n matrix A is a rectangular array of real  a11 a12 · · ·  a21 a22 · · ·  A= . .. ..  .. . . am1 am2 · · ·

numbers of the form  a1n a2n   ..  = {aij } .  amn

with row dimension m, column dimension n, and (i, j)th element aij . For example, 5 4 1 A= −3 2 6 is a 2 × 3 matrix. We sometimes use A ∈ Rm×n to denote that A is an m × n matrix of real numbers. An n-dimensional column vector   a1  ..  a= .  an can be thought of as a matrix with n rows and one column. For example,     0.25 3 0.50 1  a= , b = 1 , and c =  0.75 −1 5 1.00 are respectively 2-dimensional, 3-dimensional, and 4-dimensional vectors. An n-dimensional column vector with each of its n elements equal to unity is denoted by 1n , while a column vector whose elements are all zero is called the null vector or the zero vector and is denoted by 0n . When the dimension is obvious, we will drop the subscript. For any integer n ≥ 1, we can write an n-dimensional column vector as a = (a1 , · · · , an )0 , i.e., as the transpose of the DOI: 10.1201/9781315156651-1

1

ISTUDY

2

Review of Vector and Matrix Algebra

n-dimensional (row) vector with components a1 , · · · , an . In this book, a vector denotes a column vector, unless stated otherwise. We use a ∈ Rn to denote that a is an n-dimensional (column) vector. An m × n matrix A with the same row and column dimensions, i.e., with m = n, is called a square matrix of order n. An n × n identity matrix is denoted by In ; each of its n diagonal elements is unity while each off-diagonal element is zero. An m × n matrix Jmn has each element equal to unity. An n × n unit matrix is denoted by Jn . For example, we have     1 0 0 1 1 1 1 1 1 I3 = 0 1 0 , J23 = and J3 = 1 1 1 . 1 1 1 0 0 1 1 1 1 A n × n permutation matrix R is obtained by permuting the rows of In . Each row and column exactly one 1 and has zeroes elsewhere. For example, when n = 3,  of R contains  1 0 0 R = 0 0 1 is a permutation of I3 . 0 1 0 An n × n matrix whose elements are zero except on the diagonal, where the elements are nonzero, is called a diagonal matrix. We will denote a diagonal matrix by D = diag(d1 , · · · , dn ). Note that In is an n × n diagonal matrix, written as In = diag(1, · · · , 1). An m × n matrix all of whose elements are equal to zero is called the null matrix or the zero matrix O. An n × n matrix is said to be an upper triangular matrix if all the elements below and to the left of the main diagonal are zero. Similarly, if all the elements located above and to the right of the main diagonal are zero, then the n × n matrix is said to be lower triangular. For example,     5 0 0 5 4 3 2 0 U = 0 2 −6 and L = 4 3 −6 5 0 0 5 are respectively upper triangular and lower triangular matrices. A square matrix is triangular if it is either upper triangular or lower triangular. A triangular matrix is said to be a unit triangular matrix if aij = 1 whenever i = j. Unless explicitly stated, we assume that vectors and matrices are non-null. A submatrix of a matrix A is obtained by deleting certain rows and/or columns of A. For example, let   1 3 5 7 5 4 1   5 4 1 −9 A= and B = . −3 2 6 −3 2 6 4 The 2 × 3 submatrix B is obtained by deleting row 1 and column 4 of the 3 × 4 matrix A. Any matrix can be considered to be a submatrix of itself. We call a submatrix obtained by deleting the same rows and columns from A a principal submatrix of A. For r = 1, 2, · · · , n, the r × r leading principal submatrix of A is obtained by deleting the last (n − r) rows and columns from A. The 2 × 2 leading principal submatrix of the matrix A shown above is 1 3 C= . 5 4 It may be easily verified that a principal submatrix of a diagonal, upper triangular or lower triangular matrix is respectively diagonal, upper triangular or lower triangular. Some elementary properties of vectors and matrices are given in the following two sections. Familiarity with this material is recommended before a further study of properties of special matrices that are described in the following two chapters.

ISTUDY

Basic properties of vectors R2

3 R3

x2 6

(−2, 3) ] J J (1, 1) J J

x3 6 b3 HH HHb

x1

b2 b1

x2

x1 (a)

(b)

FIGURE 1.2.1. Geometric representation of 2- and 3-dimensional vectors.

1.2

Basic properties of vectors

An n-dimensional vector a is an ordered set of measurements, which can be represented geometrically as a directed line in n-dimensional space Rn with component a1 along the first axis, component a2 along the second axis, etc., and component an along the nth axis. We can represent 2-dimensional and 3-dimensional vectors respectively as points in the plane R2 and in 3-dimensional space R3 . In this book, we always assume Cartesian coordinates for a Euclidean space, such as a plane or a 3-dimensional space. Any 2-dimensional vector a = (a1 , a2 )0 can be graphically represented by the point with coordinates (a1 , a2 ) in the Cartesian coordinate plane, or as the arrow starting from the origin (0, 0), whose tip is the point with coordinates (a1 , a2 ). For n = 2, Figure 1.2.1 (a) shows the vectors (1, 1) and (−2, 3) as arrows starting from the origin. For n = 3, Figure 1.2.1 (b) shows a vector b = (b1 , b2 , b3 )0 in R3 . Two vectors can be added (or subtracted) only if they have the same dimension, in which case the sum (or difference) of the two vectors is the vector of sums (or differences) of their elements, i.e., a ± b = (a1 ± b1 , · · · , an ± bn )0 . The sum of two vectors emanating from the origin is the diagonal of the parallelogram which has the vectors a and b as adjacent sides. Vector addition is commutative and associative, i.e., a + b = b + a, and a + (b + c) = (a + b) + c. The scalar multiple ca of a vector a is obtained by multiplying each element of a by the scalar c, i.e., ca = (ca1 , · · · , can )0 . Definition 1.2.1. Inner product of vectors. The inner product of two n-dimensional vectors a and b is denoted by a • b, a0 b, or ha, bi, and is the scalar   b1 n X  ..  0 a b = (a1 , · · · , an )  .  = a1 b1 + · · · + an bn = ai bi . bn

i=1

ISTUDY

Review of Vector and Matrix Algebra b

-

4

ka − bk kbk a

θ 0

kak

FIGURE 1.2.2. Distance and angle between two vectors. The inner product of a vector a with itself is a0 a. The positive square root of this quantity is called the Euclidean norm, or length, or magnitude of the vector, and is kak = (a21 + · · · + a2n )1/2 . Geometrically, the length of a vector a = (a1 , a2 )0 in two dimensions may be viewed as the hypotenuse of a right triangle, whose other two sides are given by the vector components, a1 and a2 . Scalar multiplication of a vector a changes its length, kcak = (c2 a21 + · · · + c2 a2n )1/2 = |c|(a21 + · · · + a2n )1/2 = |c|kak. If |c| > 1, a is expanded by scalar multiplication, while if |c| < 1, a is contracted. If c = 1/kak, the resulting vector is defined to be b = a/kak, the n-dimensional unit vector with length 1. A vector has both length and direction. If c > 0, scalar multiplication does not change the direction of a vector a. However, if c < 0, the direction of the vector ca is in opposite direction to the vector a. The unit vector a/kak has the same direction as a. The Euclidean distance between two vectors a and b is defined by v u n uX d(a, b) = d(b, a) = ka − bk = t (ai − bi )2 . i=1

The angle θ between two vectors a and b is defined in terms of their inner product as √ √ cos θ = a0 b/kakkbk = a0 b/( a0 a b0 b) (see Figure 1.2.2). Since cos θ = 0 only if a0 b = 0, a and b are perpendicular (or orthogonal) when a0 b = 0. Result 1.2.1. Properties of inner product. and let d be a scalar. Then,

Let a, b, and c be n-dimensional vectors

1. a • b = b • a. 2. a • (b + c) = a • b + a • c. 3. d(a • b) = (da) • b = a • (db). 4. a • a ≥ 0, with equality if and only if a = 0. 5. ka ± bk2 = kak2 + kbk2 ± 2a • b. 6. |a • b| ≤ kakkbk, with equality if and only if a = 0 or b = ca for some scalar c.

ISTUDY

Basic properties of vectors

5

7. ka + bk ≤ kak + kbk, with equality if and only if a = 0 or b = ca for some scalar c ≥ 0. The last two inequalities in Result 1.2.1 are respectively the Cauchy–Schwarz inequality and the triangle inequality, which we ask the reader to verify in Exercise 1.2. Geometrically, the triangle inequality states that the length of one side of a triangle does not exceed the sum of the lengths of the other two sides. Definition 1.2.2. Outer product of vectors. The outer product of two vectors a and b is denoted by a ∧ b or ab0 and is obtained by post-multiplying the column vector a by the row vector b0 . There is no restriction on the dimensions of a and b; if a is an m × 1 vector and b is an n × 1 vector, the outer product ab0 is an m × n matrix. Example 1.2.1.

We illustrate all these vector operations by an example. Let     2 6 10     a = 3 , b = 7 , and d = . 20 4 9

Then,   8 a + b = 10 , 13



 −4 a − b = −4 , −5

  60 10b = 70 , 90

a0 b = 2 × 6 + 3 × 7 + 4 × 9 = 69,   2 ab0 = 3 6 4

7

14 21 28

 18 27 , 36

 20 20 = 30 40

 40 60 . 80

 12 9 = 18 24

  2 ad0 = 3 10 4 However, a + d, a0 d, and b0 d are undefined.

and

Definition 1.2.3. Vector space. A vector space is a set V of vectors v ∈ Rn , which is closed under addition and multiplication by a scalar, i.e., for any u, v ∈ V and c ∈ R, u + v ∈ V and cv ∈ V. For example, Rn is a vector space for any positive integer n = 1, 2, · · · . As another example, consider k linear equations in n variables x1 , · · · , xn : ci1 x1 + · · · + cin xn = 0, i = 1, · · · , k, where cij are real constants. The totality of solutions x = (x1 , · · · , xn )0 is a vector space. We discuss solutions of linear equations in Chapter 3. Definition 1.2.4. Vector subspace and affine subspace. Let S be a subset of a vector space V. If S is also a vector space, it is called a vector subspace, or simply, a subspace of V. Further, for any v ∈ V, v + S = {v + u: u ∈ S} is called an affine subspace of V.

ISTUDY

6

Review of Vector and Matrix Algebra

For example, {0} and V are (trivially) subspaces of V. Any plane through the origin is a subspace of R3 , and any plane is an affine subspace of R3 . More generally, any vector space of n-dimensional vectors is a subspace of Rn . Definition 1.2.5. Linear span. Let S be a subset Pn of a vector space V. The set of all finite linear combinations of the vectors in S, i.e., { i=1 ai vi : ai ∈ R, vi ∈ S, n = 1, 2, · · · } is called the linear span, or simply, span of S and is denoted by Span(S). The span of a set of vectors in a vector space is the smallest subspace that contains the set; it is also the intersection of all subspaces that contain the set. We are mostly interested in the case where S is a finite set of vectors {v1 , · · · , vl }. For example, the vectors 0, v1 , v2 , v1 + v2 , 10v1 , 5v1 − 3v2 all belong to Span{v1 , v2 }. Definition 1.2.6. Linear dependence and independence of vectors. Let v1 , · · · , vm be vectors in V. The vectors are said toPbe linearly dependent if and only if Pm m there exist scalars c1 , · · · , cm , not all zero, such that i=1 ci vi = 0. If i=1 ci vi 6= 0 unless all ci are zero, then v1 , · · · , vm are said to be linearly independent (LIN) vectors. For example, {0} is a linearly dependent set, as is any set of vectors containing 0. P2 Example 1.2.2. Let v1 = (1, −1, 3)0 and v2 = (1, 1, 1)0 . Now, i=1 ci vi = 0 =⇒ c1 + c2 = 0, −c1 + c2 = 0, and 3c1 + c2 = 0, for which the only solution is c1 = c2 = 0. Hence, v1 and v2 are LIN vectors. Example 1.2.3. The vectors v1 = (1, −1)0 , v2 = (1, 2)0 , and v3 = (2, 1)0 are linearly P3 dependent, which is verified by setting c1 = 1, c2 = 1, and c3 = −1; we see that i=1 ci vi = 0. Result 1.2.2.

For n-dimensional vectors v1 , · · · , vm , the following properties hold.

1. If the vectors are linearly dependent, we can express at least one of them as a linear combination of the others. 2. If s ≤ m of the vectors are linearly dependent, then all the vectors are linearly dependent. 3. If m > n, then the vectors are linearly dependent. Proof. We show property 3 by induction on n. The proof of the other properties is left as Exercise 1.8. If n = 1, then we have scalars v1 , · · · , vm , m > 1. If v1 = 0, then 1 · v1 + 0 · v2 + · · · + 0 · vm = 0. If v1 6= 0, then c · v1 + 1 · v2 + · · · + 1 · vm = 0 with c = −(v2 + · · · + vm )/v1 . In either case, v1 , · · · , vm are linearly dependent. Suppose the property holds for (n − 1)-dimensional vectors for some n ≥ 2. We prove that the property must hold for n-dimensional vectors. Let v1 , · · · , vm be n-dimensional vectors, where m > n. If v1 , · · · , vn are linearly dependent, then by property 2, v1 , · · · , vm , m > n, are linearly dependent. Now suppose v1 , · · · , vn are LIN. Let vi = (vi1 , . . . , vin )0 . Define the (n − 1)dimensional vector ui by removing the first component of vi , i.e., ui = (vi2 , . . . , vin )0 . By the induction hypothesis, Pnu1 , · · · , un are linearly Pn dependent, so there are scalars c1 , · · · , cn not all zero, such that j=1 cj uj = 0, i.e., j=1 vij cj = 0 for i > 1. Then, since v1 , · · · , vn Pn Pn are LIN, d = j=1 v1j cj must be nonzero. Let c˜j = cj /d. Then j=1 vij c˜j = 0 for i > 1 Pn Pn and j=1 v1j c˜j = 1. In other words, j=1 c˜j vj = e1 , where ei is the vector with 1 for its ith component and zeros elsewhere. Likewise, every ei is a linear combination of v1 , . . . , vn . On the other hand, it is easy to see that every vector in Rn is a linear combination of e1 , · · · , en . As a result, every vector vi , i > n, is a linear combination of v1 , · · · , vn , so v1 , · · · , vm are linearly dependent. By induction, the proof of property 3 is complete.

ISTUDY

Basic properties of vectors

7

Definition 1.2.7. Basis and dimension. Let V be a vector space. If {v1 , · · · , vm } ⊂ V is a set of LIN vectors that spans V, it is called a basis (Hamel basis) for V and m is called the dimension of V, denoted by dim(V) = m. If S is a subspace of V, then the dimension of any affine subspace v + S, v ∈ V, is defined to be dim(S). Every vector in V has a unique representation as a linear combination of vectors in a basis {v1 , · · · , vm } of the space. First, by DefinitionP1.2.7, such a representation exists. Pm Pm m Second, if a vector x is equal to both i=1 ci vi and i=1 di vi , then i=1 (ci − di )vi = 0, which, by the linear independence of v1 , · · · , vm , is possible only if ci = di for all i. Every non-null vector space V of n-dimensional vectors has a basis, which is seen as follows. Choose any non-null v1 ∈ V; v1 alone is LIN. Suppose we had chosen v1 , · · · , vi ∈ V that are LIN. If these vectors span V, then stop. Otherwise, choose any vi+1 ∈ V not in Span{v1 , · · · , vi }; it can be seen that v1 , · · · , vi+1 are LIN. By property 3 of Result 1.2.2, this process must stop after the choice of some m ≤ n vectors v1 , · · · , vm , which form a basis of V. Consequently, if V consists of n-dimensional vectors, dim(V) is strictly smaller than n unless V = Rn . However, note that V does not have a unique basis. If {v1 , · · · , vm } and {u1 , · · · , uk } are two choices for a basis, then m = k, and so dim(V) Pmis well-defined. To see this, suppose k ≥ m. Since u1 ∈ Span{v1 , · · · , vm }, u1 = a1 , . . . , am not i=1 ai vi for some P m all 0. Without loss of generality, suppose aP = 6 0. Since v = u /a − 1 1 1 1 i=2 (ai /a1 )vi , m V = Span{u1 , · · · , um }. Next, u2 = b1 u1 + i=2 bi vi for some b1 , . . . , bm not all 0. Since u1 , u2 are LIN, not all b2 , . . . , bm are 0. Without loss of generality, suppose b2 6= 0. Then as above, V = Span{u1 , u2 , v3 , . . . , vm }. Continuing this process, it follows that V = Span{u1 , · · · , um }. This implies k = m. Since dim(V) does not depend on the choice of a basis, if V = Span{w1 , · · · , ws }, then dim(V) is the maximum number of LIN vectors in {w1 , · · · , ws }. Definition 1.2.8. Sum and direct sum of subspaces. of V.

Let V1 , · · · , Vk be subspaces

1. The sum of the subspaces is defined to be {v1 + · · · + vk : vi ∈ Vi , i = 1, · · · , k} and Pk denoted by V1 + · · · + Vk or i=1 Vi . Lk 2. The sum is said to be a direct sum, denoted by V1 ⊕ · · · ⊕ Vk or i=1 Vi , if for any vi , wi ∈ Vi , i = 1, . . . , k, v1 + · · · + vk = w1 + · · · + wk if and only if vi = wi for all i; or alternately, if for every v ∈ V1 + · · · + Vk , there are unique vectors vi ∈ Vi , i = 1, · · · , k, such that v = v1 + · · · + vk . From Definition 1.2.8, V1 +· · ·+Vk is the smallest subspace of V that contains ∪ki=1 Vi . In L Pk the case of a direct sum, Vi ∩ ( j6=i Vj ) = {0} for all i, dim(V1 ⊕ · · · ⊕ Vk ) = i=1 dim(Vi ), and if {vi1 , . . . , vimi } is a basis of Vi for each i, where mi = dim(Vi ), then the combined set {vij , j = 1, . . . , mi , i = 1, . . . , k} is a basis of V1 ⊕ · · · ⊕ Vk . The vectors e1 , · · · , en in the proof of property 3 in Result 1.2.2 are called the standard basis vectors of Rn . For example, the standard basis vectors in R2 are e1 = (1, 0)0 and e2 = (0, 1)0 , while those in R3 are e1 = (1, 0, 0)0 , e2 = (0, 1, 0)0 , and e3 = (0, 0, 1)0 . Any x = (x1 , · · · , xn )0 ∈ Rn can be written as x = x1 e1 + · · · + xn en . Definition 1.2.9. Orthogonal vectors and orthogonal subspaces. Two vectors v1 and v2 in V are orthogonal or perpendicular to each other if and only if v1 • v2 = v10 v2 = v20 v1 = 0, denoted by v1 ⊥ v2 . Two subspaces V1 , V2 of V are orthogonal or perpendicular to each other if and only if every vector in V1 is orthogonal to every vector in V2 , denoted by V1 ⊥ V2 .

ISTUDY

8

Review of Vector and Matrix Algebra v2

- v1 + v2

6

kv2 k

kv

+v 1

2k

kv1 k

v1

FIGURE 1.2.3. Pythagoras’s theorem. Pythagoras’s Theorem states that for n-dimensional vectors v1 and v2 , v1 ⊥ v2 if and only if kv1 + v2 k2 = kv1 k2 + kv2 k2 ;

(1.2.1)

this is illustrated in Figure 1.2.3. Result 1.2.3. 1. If v1 , · · · , vm are nonzero vectors which are mutually orthogonal, i.e., vi0 vj = 0, i 6= j, then these vectors are LIN. 2. If V1 , · · · , Vm are subspaces of V, and Vi ⊥ Vj for any i 6= j, then their sum is a direct sum, i.e., every v ∈ V1 + · · · + Vm has a unique decomposition as v = v1 + · · · + vm with vi ∈ Vi , i = 1, · · · , m. The proof of the result is left as Exercise 1.9. Definition 1.2.10. Orthonormal basis. A basis {v1 , · · · , vm } of a vector space V such that vi0 vj = 0, for all i 6= j is called an orthogonal basis. If further, vi0 vi = 1 for i = 1, · · · , m, it is called an orthonormal basis of V. Result 1.2.4. Gram–Schmidt orthogonalization. Let {v1 , · · · , vm } denote an arbitrary basis of V. To construct an orthonormal basis of V starting from {v1 , · · · , vm }, we define y1 = v1 , yk = vk −

k−1 X i=1

yi0 vk yi , k = 2, · · · , m, kyi k2

yk zk = , k = 1, · · · , m. kyk k Then {z1 , · · · , zm } is an orthonormal basis of V. The proof of the result is left as Exercise 1.10. The stages in this process for a basis {v1 , v2 , v3 } are shown in Figure 1.2.4. Example 1.2.4. We use Result 1.2.4 to find an orthonormal basis starting from the basis vectors v1 = (1, −1, 1)0 , v2 = (−2, 3, −1)0 , and v3 = (1, 2, −4)0 . Let y1 = v1 . We compute y10 v2 = −6 and y10 y1 = 3, so that y2 = (0, 1, 1)0 . Next, y10 v3 = −5, y20 v3 = −2,

ISTUDY

Basic properties of matrices v3

9 v3

6 y3

v3

V V y v 1 1

-

-

@

y 2@

@ @

y2 R v2 @ R v2 @

(V is the plane spanned by v1

6 y3 z3 6 y1 V

-z1

z2

y2 and v2 )

y1 V

FIGURE 1.2.4. Gram–Schmidt orthogonalization in R3 . 0 and y20 y2 = 2, so that y3 = (8/3, 4/3, −4/3) that {y y3 } is an √ . It √is easily √ verified √1 , y2 ,√ 0 orthogonal basis and also that z = (1/ 3, −1/ 3, 1/ 3) , z = (0, 1/ 2, 1/ 2)0 , and 1 2 √ √ √ 0 z3 = (2/ 6, 1/ 6, −1/ 6) form a set of orthonormal basis vectors.

Definition 1.2.11. Orthogonal complement of a subspace. Let W be a vector subspace of a Euclidean space Rn . Its orthogonal complement, written as W ⊥ , is the subspace consisting of all vectors in Rn that are orthogonal to every vector in W. It is easy to verify that W ⊥ is indeed a vector subspace. If V is any subspace containing W, then W ⊥ ∩ V, sometimes referred to as the orthogonal complement of W relative to V, consists of all vectors in V that are orthogonal to every vector in W. The proof of the next result is left as Exercise 1.11. Result 1.2.5.

If W and V are two subspaces of Rn and W ⊂ V, then

1. V = W ⊕ (W ⊥ ∩ V) and dim(V) = dim(W) + dim(W ⊥ ∩ V); and 2. (W ⊥ ∩ V)⊥ ∩ V = W.

1.3

Basic properties of matrices

We describe some elementary properties of matrices and provide illustrations. More detailed properties of special matrices that are relevant to linear model theory are given in Chapter 2. Definition 1.3.1. Matrix addition and subtraction. For arbitrary m × n matrices A and B, each of the same dimension, C = A ± B is an m × n matrix whose (i, j)th element is cij = aij ± bij . For example, −5 4 1 7 −9 10 2 −5 11 + = . −3 2 6 2 6 −1 −1 8 5 Definition 1.3.2. Multiplication of a matrix by a scalar. For an arbitrary m × n matrix A, and an arbitrary real scalar c, B = cA = Ac is an m × n matrix whose (i, j)th element is bij = caij . For example, −5 4 1 −25 20 5 5 = . −3 2 6 −15 10 30

ISTUDY

10

Review of Vector and Matrix Algebra

When c = −1, we denote (−1)A as −A, the negative of the matrix A. Result 1.3.1. Laws of addition and scalar multiplication. Let A, B, C be any m × n matrices and let a, b, c be any scalars. The following results hold: 1. (A + B) + C = A + (B + C)

6. (a + b)C = aC + bC

2. A + B = B + A

7. (ab)C = a(bC) = b(aC)

3. A + (−A) = (−A) + A = O

8. 0A = O

4. A + O = O + A = A

9. 1A = A.

5. c(A + B) = cA + cB Definition 1.3.3. Matrix multiplication. For arbitrary matrices A and B of respective dimensions Pn m × n and n × p, C = AB is an m × p matrix whose (i, j)th element is cij = l=1 ail blj . The product AB is undefined when the column dimension of A is not equal to the row dimension of B. For example,   7 5 4 1   25 −3 = . −3 2 6 −15 2 In referring to the matrix product AB, we say that B is pre-multiplied by A, and A is post-multiplied by B. Provided all the matrices are conformal under multiplication, the following properties hold: Result 1.3.2. Laws of matrix multiplication. Let a be a scalar, let A be an m × n matrix and let the matrices B and C have appropriate dimensions so that the operations below are defined. Then, 1. (AB)C = A(BC)

4. a(BC) = (aB)C = B(aC)

2. A(B + C) = AB + AC

5. Im A = AIn = A

3. (A + B)C = AC + BC

6. 0A = O and A0 = O.

In general, matrix multiplication is not commutative, i.e., AB is not necessarily equal to BA. Note that depending on the row and column dimensions of A and B, it is possible that (i) only AB is defined and BA is not, or (ii) both AB and BA are defined, but do not have the same dimensions, or (iii) AB and BA are defined and have the same dimensions, but AB 6= BA. Two n × n matrices A and B are said to commute under multiplication if AB = BA. A collection of n × n matrices A1 , · · · , Ak is said to be pairwise commutative if Ai Aj = Aj Ai for j > i, i, j = 1, · · · , k. Note that the product Ak = A · · · A (k times) is defined only if A is a square matrix. It is easy to verify that Jmn Jnp = nJmp . Result 1.3.3. If A and B are n × n lower (resp. upper) triangular matrices with diagonal elements a1 , · · · , an and b1 , · · · , bn , respectively, then AB is also a lower (resp. upper) triangular matrix with diagonal elements a1 b1 , · · · , an bn . The proof of Result 1.3.3 is left as Exercise 1.18. Definition 1.3.4. Matrix transpose. matrix whose columns are the rows of A by A0 . For example,  2 2 1 6 A= =⇒ A0 = 1 4 3 5 6

The transpose of an m × n matrix A is an n × m in the same order. The transpose of A is denoted  4 3 , 5

6 B= 8

7 9

0

=⇒ B =

6 7

8 . 9

ISTUDY

Basic properties of matrices

11

As we saw earlier, the transpose of an n-dimensional column vector with components a1 , · · · , an is the row vector (a1 , · · · , an ). It is often convenient to write a column vector in this transposed form. The transpose of an upper (lower) triangular matrix is a lower (upper) triangular matrix. It may be easily verified that Jmn = 1m 10n . Result 1.3.4. Laws of transposition. Let A and B conform under addition, and let A and C conform under multiplication. Let a, b, and c denote scalars and let k ≥ 2 denote a positive integer. Then, 1. (A0 )0 = A

4. A0 = B0 if and only if A = B

2. (aA + bB)0 = aA0 + bB0

5. (AC)0 = C0 A0

3. (cA)0 = cA0

6. (A1 · · · Ak )0 = A0k · · · A01 .

Definition 1.3.5. Symmetric matrix. For example, 

A matrix A is said to be symmetric if A0 = A.

1 A= 2 −3

 2 −3 4 5 5 9

is a symmetric matrix. Note that a symmetric matrix is always a square matrix. Any diagonal matrix, written as D = diag(d1 , · · · , dn ), is symmetric. Other examples of symmetric matrices include the variance-covariance matrix and the correlation matrix of any random vector, the identity matrix In and the unit matrix Jn . A matrix A is said to be skewsymmetric if A0 = −A. Definition 1.3.6. Trace of a matrix. Let A be an n × n matrix. PnThe trace of A is a scalar given by the sum of the diagonal elements of A, i.e., tr(A) = i=1 aii . For example, if   2 −4 5 A = 6 −7 0 , 3 9 7 then tr(A) = 2 − 7 + 7 = 2. Result 1.3.5. Properties of trace. scalars a and b,

Provided the matrices are conformable, and given

1. tr(In ) = n

5. tr(A) = 0 if A = O

2. tr(aA ± bB) = a tr(A) ± b tr(B)

6. tr(A0 ) = tr(A)

3. tr(AB) = tr(BA)

7. tr(AA0 ) = tr(A0 A) =

4. tr(ABC) = tr(CAB) = tr(BCA)

8. tr(aa0 ) = a0 a = kak2 =

Pn

2 i,j=1 aij Pn 2 i=1 ai .

The trace operation in property 4 is valid under cyclic permutations only. Definition 1.3.7. Determinant of a matrix. minant of A is a scalar given by |A| =

|A| =

n X j=1 n X i=1

Let A be an n × n matrix. The deter-

aij (−1)i+j |Mij |, for any fixed i, or aij (−1)i+j |Mij |, for any fixed j,

ISTUDY

12

Review of Vector and Matrix Algebra

where Mij is the (n − 1) × (n − 1) submatrix of A after deleting the ith row and the jth column from A. We call |Mij | the minor corresponding to aij and the signed minor, Fij = (−1)i+j |Mij | the cofactor of aij . An alternative notation of the determinant is det(A), which is used in computing environments like R (R Core Team, 2018). We will use |A| and det(A) interchangeably in the text. We consider two special cases: 1. Suppose n = 2. Then |A| = a11 a22 − a12 a21 . 2. Suppose n = 3. Fix i = 1 (row 1). Then a23 1+1 a22 F11 = (−1) a32 a33 , F13

F12 = (−1) 1+3 a21 a22 = (−1) a31 a32 ,

and |A| = a11 F11 + a12 F12 + a13 F13 . For example,  2 −4 A = 6 −7 3 9

1+2

a21 a31

a23 , a33

if  5 0 7

then, −7 |A| = 2(−1)1+1 9

0 1+2 6 − 4(−1) 3 7

0 1+3 6 + 5(−1) 3 7

−7 9

= 2(−49) + 4(42) + 5(75) = 445. Result 1.3.6. Properties of determinants. k be any integer. Then

Let A and B be n × n matrices and let

1. |A| = |A0 |. 2. |cA| = cn |A|. 3. |AB| = |A||B|. 4. If A is a diagonal matrix or an upper (or lower) triangular matrix, the determinant of A is equal to the product of its diagonal elements, i.e., |A| =

n Y

aii .

i=1

5. If two rows (or columns) of a matrix A are equal, then |A| = 0. 6. If A has a row (or column) of zeroes, then |A| = 0. 7. If A has rows (or columns) that are multiples of each other, then |A| = 0. 8. If a row (or column) of A is the sum of multiples of two other rows (or columns), then |A| = 0. 9. Let B be obtained from A by multiplying one of its rows (or columns) by a nonzero constant c. Then, |B| = c|A|. 10. Let B be obtained from A by interchanging any two rows (or columns). Then, |B| = −|A|.

ISTUDY

Basic properties of matrices

13

11. Let B be obtained from A by adding a multiple of one row (or column) to another row (or column). Then, |B| = |A|. 12. If A is an m × n matrix and B is an n × m matrix, then |Im + AB| = |In + BA|. Properties 1–11 in Result 1.3.6 are standard results. The proof of property 12 is deferred to Example 2.1.4. Example 1.3.1. Vandermonde matrix. if there are scalars a1 , · · · , an , such that  1 1  a1 a 2  2  a22 A =  a1  .. ..  . . an−1 1

an−1 2

An n × n matrix A is a Vandermonde matrix 1 a3 a23 .. .

··· ··· ··· .. .

1 an a2n .. .

an−1 3

···

an−1 n

    .  

The determinant of A has a simple form: Y |A| = (aj − ai ) 1≤i 0, and C = [(1 − ρ)I + ρJ], so that C−1 = [(1 − ρ)I + ρJ]−1 . In property 6, set A = (1 − ρ)I, a = ρ1n , and b = 1n . Then A−1 = (1 − ρ)−1 In , A−1 a = (1 − ρ)−1 ρ1n , b0 A−1 = (1 − ρ)−1 10n , and b0 A−1 a = nρ(1 − ρ)−1 , giving 1 (1 − ρ)−2 ρ C−1 = I− J 1−ρ 1 + (1 − ρ)−1 nρ 1 ρ = I− J . (1 − ρ) 1 + (n − 1)ρ Example 1.3.7. Toeplitz matrix. the form  1  ρ  A= .  .. ρn−1

Consider the n × n Toeplitz matrix A which has ρ 1 .. .

ρ2 ρ .. .

··· ··· .. .

ρn−2

ρn−3

···

 ρn−1 ρn−2   ..  . .  1

ISTUDY

18

Review of Vector and Matrix Algebra

Note that all the elements on the jth subdiagonal and the jth superdiagonal coincide, for j ≥ 1. It is easy to verify that, for |ρ| < 1, the inverse of A is   1 −ρ 0 0 ··· 0 0 0 −ρ 1 + ρ2 −ρ 0 · · · 0 0 0   1   . . . .. . . . A−1 = ,  . . . . · · · 0 0 0  1 − ρ2  0 0 0 0 · · · −ρ 1 + ρ2 −ρ 0 0 0 0 ··· 0 −ρ 1 which has a simpler Toeplitz form than A.

Definition 1.3.11. Orthogonal matrix. An n × n matrix A is orthogonal if AA0 = A0 A = In . For example, cos θ − sin θ sin θ cos θ is a 2 × 2 orthogonal matrix. A direct consequence of Definition 1.3.11 is that, for an orthogonal matrix A, A0 = A−1 . Suppose a0i denotes the ith row of A, then, AA0 = In implies that a0i ai = 1, and a0i aj = 0 for i 6= j; so the rows of A have unit length and are mutually perpendicular (or orthogonal). Since A0 A = In , the columns of A have this property as well. If A is orthogonal, clearly, |A| = ±1. It is also easy to show that the product of two orthogonal matrices A and B is itself orthogonal. Usually, orthogonal matrices are used to represent a change of basis or rotation. Example 1.3.8. Helmert matrix. orthogonal matrix and is defined by

An n × n Helmert matrix Hn is an example of an

Hn =

√1 10 n n , H0

where H0 is a (n √ − 1) × n matrix such that for i = 1, . . . , n − 1, its ith row is (10i , −i, 0, · · · , 0)/ λi with λi = i(i + 1). For example, when n = 4, we have √ √ √   √ 1/√4 1/ √4 1/ 4 1/ 4   1/ 2 −1/ 2 0√ 0 . √ √ H4 =    1/ 6 1/ 6 −2/ 6 0 √ √ √ √ 1/ 12 1/ 12 1/ 12 −3/ 12 We next define three important vector spaces associated with any matrix, viz., the null space, the column space and the row space. These concepts are closely related to properties of systems of linear equations, which are discussed in Chapter 3. A system of homogeneous linear equations is denoted by Ax = 0, while Ax = b denotes a system of nonhomogeneous linear equations. Definition 1.3.12. Null space of a matrix. The null space N (A), of an m × n matrix A consists of all n-dimensional vectors x that are solutions to the homogeneous linear system Ax = 0, i.e., N (A) = {x ∈ Rn : Ax = 0}.

ISTUDY

Basic properties of matrices

19

N (A) is a subspace of Rn , and its dimension is called the nullity ofA. For example, 2 −1 0 the vector x = (1, 2) belongs to the null space of the matrix A = , since −4 2 2 −1 1 0 = . −4 2 2 0 We may use RREF(A) to find a basis of the null space of A. We add or delete zero rows until RREF(A) is square. We then rearrange the rows to place the leading ones on the e which is the Hermite form of RREF(A). The nonzero columns main diagonal to obtain H, e − I are a basis for N (A). In general, an n × n matrix H e is in Hermite form if (i) each of H e diagonal element is either 0 or 1; (ii) hii = 1, the rest of column i has all zeroes; and (iii) e e is a vector of zeroes. hii = 0, i.e., the ith row of H Example 1.3.9.

We first find a basis for the null space of the matrix   1 0 −5 1 0 1 2 −3 , A= 0 0 0 0 0 0 0 0

which is in RREF, as we can verify. It is easy to see that the general solution to Ax = 0 is the vector       −1 x1 5  3 x2  −2   = s  + t ,  0 x3   1 1 0 x4 so that the vectors (5, −2, 1, 0)0 and (−1, 3, 0, 1)0 form a basis for N (A). This basis can also be obtained in an alternate way from RREF(A), which in this example coincides with A. Computing   0 0 5 −1 0 0 −2 3 , I − RREF(A) =  0 0 1 0 0 0 0 1 we see that the last two nonzero columns form a basis for N (A).

Definition 1.3.13. Column space and row space of a matrix. Let A be an m × n matrix whose columns are the m-dimensional vectors a1 , a2 , · · · , an . The vector space Span{a1 , · · · , an } is called the column space (or range space) of A, and is denoted by C(A). That is, the column space of A is the set consisting of all m-dimensional vectors that can be expressed as linear combinations of the n columns of A of the form x1 a1 + x2 a2 + · · · + xn an , where x1 , · · · , xn are scalars. The dimension of the column space of A is the number of LIN 1 −2 columns of A, and it is called the column rank of A. For example, given A = , 2 −4 0 0 the vector x1 = (−2, 2) is not in C(A), whereas the vector x2 = (3, 6) is, because 1 −2 −2 1 −2 −2 1 −2 3 1 −2 3 ∼ , while ∼ . 2 −4 2 0 0 6 2 −4 6 0 0 0

ISTUDY

20

Review of Vector and Matrix Algebra

Likewise, if the rows of A are b01 , · · · , b0m , i.e., A = (b1 , . . . , bm )0 , then the vector space Span{b1 , · · · , bm } is called the row space of A, and is denoted by R(A). The row space of A is the set consisting of all n-dimensional vectors that can be expressed as linear combinations of the m rows of A of the form x1 b01 + x2 b02 + · · · + xm b0m where x1 , · · · , xm are scalars. The dimension of the row space is called the row rank of A. Concisely, C(A) = {Ax: x ∈ Rn }, and R(A) = C(A0 ) = {A0 x: x ∈ Rm }. The column space C(A) and the row space R(A) of any m×n matrix A are subspaces of Rm and Rn , respectively. The symbol C ⊥ (A) or {C(A)}⊥ represents the orthogonal complement of C(A) in Rm . Likewise, R⊥ (A) or {R(A)}⊥ represents the orthogonal complement of R(A) in Rn . To find a basis of the column space of A, we first find RREF(A). We select the columns of A which correspond to the columns of RREF(A) with leading ones. These are called the leading columns of A and form a basis for C(A). The nonzero rows of RREF(A) are a basis for R(A). Example 1.3.10. We find a basis shown below:  1 −2 2 1 −1 2 −1 0 A=  2 −4 6 4 3 −6 8 5

for C(A), where the matrix A and B = RREF(A) are  0 0  0 1

 1 0 and B =  0 0

−2 0 0 0

0 −1 1 1 0 0 0 0

 0 0 . 1 0

We see that columns 1, 3 and 5 of B have leading ones. Then columns 1, 3 and 5 of A form a basis for C(A). Result 1.3.10. Let C(A) and N (A) respectively denote the column space and null space of an m × n matrix A. Then, 1. dim[C(A)] = n − dim[N (A)]. 2. N (A) = {C(A0 )}⊥ = {R(A)}⊥ . 3. dim[C(A)] = dim[R(A)]. 4. C(A0 A) = C(A0 ), and R(A0 A) = R(A). 5. C(A) ⊆ C(B) if and only if A = BC for some matrix C. Also, R(A) ⊆ R(B) if and only if A = CB for some matrix C. 6. C(ACB) = C(AC) if r(CB) = r(C). Proof. 1. Suppose dim[C(A)] = r and without loss of generality, let the firstP r columns r of A, i. e., a1 , · · · , ar be LIN. Then forPeach j = r +P 1, . . . , n, we P can write aj P = i=1 cijai , n r n r 0 so for any x = (x1 , · · · , xn ) , Ax = i=1 xi ai = i=1 xi ai + j=r+1 xj i=1 cij ai = Pr Pn Pn i=1 xi + j=r+1 cij xj ai . Then Ax = 0 ⇐⇒ xi = − j=r+1 cij xj , i = 1, . . . , r. In other words, any solution to Ax = 0 is completely determined by xr+1 , . . . , xn , which can take any real values. As a result, dim[N (A)] = n − r.

ISTUDY

Basic properties of matrices

21

2. x ∈ N (A) if and only if a0 x = 0, i.e., x ⊥ a, for every row a0 of A. The latter is equivalent to x ⊥ R(A). Then N (A) = {R(A)}⊥ . 3. By property 2, dim[N (A)] = dim[{R(A)}⊥ ] = n−dim[R(A)]. By comparing to property 1, this gives dim[C(A)] = dim[R(A)]. 4. By property 2 and (A0 A)0 = A0 A, to show C(A0 A) = C(A0 ), it is enough to show that N (A0 A) = N (A), or A0 Ax = 0 ⇐⇒ Ax = 0. The ⇐= part is clear. On the other hand, A0 Ax = 0 =⇒ x0 A0 Ax = 0 =⇒ kAxk2 = 0 =⇒ Ax = 0. The identity for the row spaces can be similarly proved. 5. Let the columns of A be a1 , · · · , an . Suppose A = BC for some k×n matrix C with entries cij . Let c1 , · · · , cn be the columns of C, and let aj = Bcj ∈ C(B) for each j = 1, · · · , n. Thus C(A) ⊆ C(B). Conversely, if C(A) ⊆ C(B), then every column vector of A is a linear combination of the column vectors of B, say b1 , · · · , bk , in other words, for each j = 1, · · · , n, aj = b1 c1j + · · · + bk ckj for some cij . Then A = BC. The result on the row spaces can be proved similarly. 6. From property 5, C(ACB) ⊆ C(AC) and C(CB) ⊆ C(C). If r(CB) = r(C), then the dimensions of the two column spaces are equal, so with the first one being a subspace of the second one, they must be equal. Then by property 5 again, C = (CB)D for some matrix D. Then C(AC) = C(ACBD) ⊆ C(ACB). Thus C(ACB) = C(AC). Definition 1.3.14. Rank of a matrix. Let A be an m × n matrix. From property 3 of Result 1.3.10, dim[C(A)] = dim[R(A)]. We call dim[C(A)] the rank of A, denoted by r(A). We say that A has full row rank if r(A) = m, which is possible only if m ≤ n, and has full column rank if r(A) = n, which is possible only if n ≤ m. A nonsingular matrix has full row rank and full column rank. To find the rank of A, we find RREF(A). We count the number of leading ones, which is then equal to r(A). Example 1.3.11.

Consider the  1 2 2 1 3 1  1 1 3 A=  0 1 −1 1 2 2

matrices   1 2 −1 0 1 −2    0  and B = 0 0 0 0 −1 0 0 −1

2 −1 0 0 0

 −1 −1  0 , 0 0

where B = RREF(A) has two nonzero rows. Hence, r(A) = 2.

Result 1.3.11. Properties of rank. 1. An m × n matrix A has rank r if the largest nonsingular square submatrix of A has size r. 2. For an m × n matrix A, r(A) ≤ min(m, n). 3. r(A + B) ≤ r(A) + r(B). 4. r(AB) ≤ min{r(A), r(B)}, where A and B are conformal under multiplication. 5. For nonsingular matrices A, B, and an arbitrary matrix C, r(C) = r(AC) = r(CB) = r(ACB).

ISTUDY

22

Review of Vector and Matrix Algebra

6. r(A) = r(A0 ) = r(A0 A) = r(AA0 ). 7. For any n × n matrix A, |A| = 0 if and only if r(A) < n. 8. r(A, b) ≥ r(A), i.e., inclusion of a column vector cannot decrease the rank of a matrix. Result 1.3.12. Let A and B be m × n matrices. Let C be a p × m matrix with r(C) = m, and let D be an n × p matrix with r(D) = n. 1. If CA = CB, then A = B. 2. If AD = BD, then A = B. 3. If CAD = CBD, then A = B. Proof. We prove only property 1 here; the proofs of the other two properties are similar. Let the column vectors of C be c1 , · · · , cm . Since r(C) = m, the column vectors are LIN. Let A = {aij } and B = {bij }. The jth column vector of CA is c1 a1j + · · · + cm amj and that of CB is c1 b1j + · · · + cm bmj . Since these two column vectors are equal and ci are LIN, aij = bij . Then A = B. Result 1.3.13.

Let A be an m × n matrix.

1. For n × p matrices B and C, AB = AC if and only if A0 AB = A0 AC. 2. For p × n matrices E and F, EA0 = FA0 if and only if EA0 A = FA0 A. Proof. To prove property 1, the “only if” part is obvious. On the other hand, if A0 AB = A0 AC, then O = (B − C)0 (A0 AB − A0 AC) = (AB − AC)0 (AB − AC), which implies that AB − AC = O (see Exercise 1.16). The proof of property 2 follows directly by transposing relevant matrices in property 1. Definition 1.3.15. Equivalent matrices. Two matrices that have the same dimension and the same rank are said to be equivalent matrices. Result 1.3.14. Equivalent canonical form of a matrix. An m × n matrix A with Ir O r(A) = r is equivalent to PAQ = , where P and Q are respectively m × m and O O n × n matrices, and are obtained as products of elementary matrices, i.e., matrices obtained from the identity matrix using elementary transformations. The matrices P and Q always exist, but need not be unique. Elementary transformations include 1. interchange of two rows (columns) of I, or 2. multiplication of elements of a row (column) of I by a nonzero scalar c, or 3. adding to row j (column j) of I, c times row i (column i). Definition 1.3.16. Eigenvalue, eigenvector, and eigenspace of a matrix. A real or complex number λ is an eigenvalue (or characteristic root) of an n×n matrix A if A−λIn is singular, i.e., |A − λIn | = 0. The space N (A − λIn ), containing vectors with possibly complex-valued components, is

ISTUDY

Basic properties of matrices

23

called the eigenspace corresponding to λ. Any non-null vector in the eigenspace is called an eigenvector of A corresponding to λ and satisfies (A − λIn )v = 0. The dimension of the eigenspace, g = dim[N (A − λIn )], is called the geometric multiplicity of λ. The eigenvalues of A are solutions to the characteristic polynomial equation P (λ) = |A − λI| = 0, which is a polynomial in λ of degree n. Note that the n eigenvalues of A are not necessarily all distinct or real-valued. Since |A − λj I| = 0, A − λj I is a singular matrix, for j = 1, · · · , n, and there exists a nonzero n-dimensional vector vj which satisfies (A − λj I)vj = 0, i.e., Avj = λj vj . The eigenvectors of A are thus obtained by substituting each λj into Avj = λj vj , j = 1, · · · , n, and solving the resulting n equations. We say that an eigenvector vj is normalized if its length is 1. If λj is complex-valued, then vj may have complex elements. If some of the eigenvalues of the real matrix A are complex, then they must clearly √ be conjugate complex (a conjugate complex pair is defined as a + ιb and a − ιb, where ι = −1). If an eigenvalue λ is real, there is a corresponding real eigenvector v. Also, if we multiply v by any complex scalar c = a+ιb, then cv satisfies Acv = λcv, so that cv is an eigenvector of A corresponding to λ. Likewise, if λ is complex and v is a corresponding eigenvector while u is an eigenvector corresponding to the complex conjugate of λ, then v and u need not be conjugate, although there is a complex eigenvector w, say, corresponding to λ, such that v = c1 w and u = c2 w∗ for some scalars c1 and c2 , and where w∗ denotes the complex conjugate of w. Suppose vj1 and vj2 are nonzero eigenvectors of A corresponding to λj , it is easy to see that α1 vj1 + α2 vj2 is also an eigenvector corresponding to λj , where α1 and α2 are real numbers. That is, we must have A(α1 vj1 + α2 vj2 ) = λj (α1 vj1 + α2 vj2 ). The eigenvectors corresponding to any eigenvalue λj span a vector space, called the eigenspace of A for λj . To find the eigenvectors of A corresponding to an eigenvalue λ, we find the basis of the e − I are a basis for N (A − λI), where H e null space of A − λI. The nonzero columns of H denotes the Hermite form of RREF(A − λI) (see Example 1.3.14). Result 1.3.15. Let A and B be n × n matrices such that A = M−1 BM, where M is nonsingular. Then the characteristic polynomials of A and B coincide. Proof. |A − λI| = |M−1 BM − λI| = |M−1 ||BM − λM| = |BMM−1 − λMM−1 | = |B − λI|. The eigenspace N (A − λI) corresponding to an eigenvalue λ has the property that for any x in the eigenspace, Ax = λx is still in the space. This property is formalized below. Definition 1.3.17. Let A be an n × n matrix. A subspace V of Rn is called an invariant subspace with respect to A if {Ax: x ∈ V} ⊂ V. That is, V gets mapped to itself under A. Result 1.3.16. If a non-null space V is an invariant subspace with respect to A, then it contains at least one eigenvector of A. Proof. Let {v1 , · · · , vk } be a basis of V. Put V = (v1 , · · · , vk ) so that V = C(V). Since Avi ∈ V for every i, then C(AV) ⊂ C(V), so by property 5 of Result 1.3.10, AV = VB for some k × k matrix B. Now B has at least one eigenvalue, say λ, and a corresponding eigenvector z. Let x = Vz. Then x ∈ C(V) = V. Since V has full column rank and z 6= 0, we see that x 6= 0. Then, Ax = AVz = VBz = Vλz = λx, so x is an eigenvector in V.

ISTUDY

24

Review of Vector and Matrix Algebra

Definition 1.3.18. Spectrum of a matrix. The spectrum of an n × n matrix A is the set of its distinct (real or complex) eigenvalues {λ1 , λ2 , · · · , λk }, so that the characteristic polynomial of A has factorization P (λ) = (−1)n (λ − λ1 )a1 · · · (λ − λk )ak , where aj are positive integers with a1 + · · · + ak = n. The algebraic multiplicity of λj is defined to be aj . For each eigenvalue, its geometric multiplicity is at most as large as its algebraic multiplicity, i.e., gj ≤ aj , the proof of which is left as Exercise 1.34. However, if A is symmetric, then gj = aj (see Exercise 2.13). Result 1.3.17. Suppose the spectrum of an n × n matrix A is {λ1 , · · · , λk }. Then eigenvectors corresponding to different λj ’s are LIN. Proof. Suppose xj ∈ N (A − λj In ), j = 1, . . . , k, and x1 + · · · + xk = 0. We must show that all xj = 0. For any scalar c, (A − cIn )xj = (λj − c)xj . Then for any scalars c1 , · · · , cs , (A − c1 In ) · · · (A − cs In )xj = (λj − c1 ) · · · (λj − cs )xj . P For each i = 1, . . . , k,Qxi = − j6=i xj . From the display, pre-multiplying both sides by Q l6=i (A − λl In ) gives j6=i (λj − λi )xi = 0, and hence xi = 0. 1 . Then, P (λ) = (λ − 1)2 , with solutions λ1 = 1 1 0 1 (repeated twice), so that a1 = 2. Since A − λ1 I2 = , with rank 1, the geometric 0 0 multiplicity of λ1 is g1 = 2 − 1 = 1 < a1 . 0 −1 Example 1.3.13. Let A = . This matrix has no real eigenvalues, since, P (λ) = 1 0 √ λ2 + 1, with solutions λ1 = ι and λ2 = −ι, where ι = −1. The corresponding eigenvectors are complex, and are (ι, 1)0 and (−ι, 1)0 . Example 1.3.12.

Example 1.3.14.

Let A =

1 0

Let  −1 A= 1 0

 2 0 2 1 . 2 −1

Then |A − λI| = −(λ + 1)(λ − 3)(λ + 2) = 0, yielding solutions λ1 = −1, λ2 = 3, and λ3 = −2, which are the distinct eigenvalues of A. To obtain the eigenvectors corresponding to λi , we must solve the homogeneous linear system (A − λi I)vi = 0, or in other words, identify the null space of the matrix (A − λi I), by completely reducing the augmented matrix (A − λi I : 0). Corresponding to λ1 = −1, we see that     0 2 0 1 0 1 A − (−1)I = 1 3 1 , RREF(A + I) = 0 1 0 , 0 2 0 0 0 0 e The nonzero column of which is in Hermite form, i.e., RREF(A + I) = H.   0 0 1 e − I = 0 0 0 , H 0 0 −1 i.e., (1, 0, −1)0 is a basis of N (A − (−1)I) and therefore of the eigenspace corresponding to

ISTUDY

Basic properties of matrices

25

λ1 = −1. Using a similar approach, we find that a basis of the eigenspace corresponding to λ2 = 3 is (−1, −2, −1)0 and corresponding to λ3 = −2 is (−1, 1/2, −1)0 . These three vectors are the eigenvectors of the matrix A. Result 1.3.18. Let A be an n × n matrix with n eigenvalues λ1 , · · · , λn , counting multiplicities (i.e., the eigenvalues are not necessarily distinct), and let c be a real scalar. Then, counting multiplicities, the following properties hold for the eigenvalues of transformations of A. 1. λk1 , · · · , λkn are the eigenvalues of Ak , for any positive integer k. 2. cλ1 , · · · , cλn are the eigenvalues of cA. 3. λ1 + c, · · · , λn + c are the eigenvalues of A + cI, while the eigenvectors of A + cI coincide with the eigenvectors of A. 4. If A is invertible, then 1/λ1 , · · · , 1/λn are the eigenvalues of A−1 , while the eigenvectors of A−1 coincide with the eigenvectors of A. 5. f (λ1 ), · · · , f (λn ) are the eigenvalues of f (A), where f (.) is any polynomial. The proof of Result 1.3.18 is quite simple by using the results in Chapter 2 and is left as Exercise 2.9. The result offers more than saying that if λ is an eigenvalue of A, then f (λ) is an eigenvalue of f (A) (see Exercise 1.36), as the latter does not consider the multiplicity of the eigenvalue. Let A be an n × n matrix with

Result 1.3.19. Sum and product of eigenvalues. eigenvalues λ1 , · · · , λn , counting multiplicities. Then, Pn 1. tr(A) = i=1 λi . Qn 2. |A| = i=1 λi . Qn 3. |In ± A| = i=1 (1 ± λi ).

The proof for the above result is left as Exercise 1.37. (k) Let aij represent the (i, j)th element of Ak ∈ Rm×n , k = 1, 2, · · · . For every i = 1, · · · , m and j = 1, · · · , n, suppose there exists a scalar aij which is the limit of the (1) (2) sequence of numbers aij , aij , · · · , and suppose A = {aij }. We say that the m × n matrix A is the limit of the sequence of matrices Ak , k = 1, 2, · · · , or that the sequence Ak , k = 1, 2, · · · converges to the matrix A, which we denote by limk→∞ Ak = A. If this limit exists, the sequence of matrices converges, otherwise it diverges. The following infinite series representation of the inverse of the matrix I − A is used in Chapter 5. P∞ Result 1.3.20. For a square matrix A, the infinite series k=0 Ak , with A0 defined to be I, converges if and only if limk→∞ Ak = O, in which case I − A is nonsingular and −1

(I − A)

=

∞ X

Ak .

k=0

As an application of the result, we derive the Sherman–Morrison–Woodbury formula in property 5 of Result 1.3.8. Put M = BCDA−1 . Then (A + BCD)−1 = (A + MA)−1 = A−1 (I + M)−1 . Provided Mk → O as k → ∞, (A + BCD)−1 = A−1 + A−1

∞ X

(−1)k Mk .

k=1

ISTUDY

26

Review of Vector and Matrix Algebra

Now, A−1 M = A−1 BCDA−1 = (A−1 BC)(DA−1 ), A−1 M2 = A−1 BCDA−1 BCDA−1 = (A−1 BC)(DA−1 BC)(DA−1 ), A−1 M3 = A−1 BCDA−1 BCDA−1 BCDA−1 = (A−1 BC)(DA−1 BC)2 (DA−1 ), and so on. Then A−1

∞ X

(−1)k Mk = (A−1 BC)

k=1

∞ X

(−1)k (DA−1 BC)k−1 (DA−1 )

k=1

= −(A−1 BC)(I + DA−1 BC)=1 (DA−1 ), provided (DA−1 BC)k → O. Together, these formulas yield the Sherman–Morrison– Woodbury formula. A direct verification shows that it holds without assuming Mk → O or (DA−1 BC)k → O. Definition 1.3.19. Exponential matrix. eA to be the n × n matrix given by: eA =

For any n×n matrix A, we define the matrix ∞ X Ai i=0

i!

,

when the expression on the right is a convergent series, i.e., all the n × n series (i)

P∞

i=0 0

(i)

ajk ,

j = 1, · · · , n, k = 1, · · · , n are convergent, ajk being the (j, k)th element of Ai and e = I. We conclude this section with some definitions of vector and matrix norms. Definition 1.3.20. Vector norm. A vector norm on Rn is a function f : Rn → R, denoted by kvk, such that for every vector v ∈ Rn , and every c ∈ R, we have 1. f (v) ≥ 0, with equality if and only if v = 0, 2. f (cv) = |c|f (v), and 3. f (u + v) ≤ f (u) + f (v) for every u ∈ Rn . Specifically, the Lp -norm of a vector v = (v1 , · · · , vn )0 ∈ Rn , which is also known in the literature as the Minkowski metric is defined by kvkp = {|v1 |p + · · · + |vn |p }1/p , p ≥ 1. We mention two special cases. The L1 -norm is defined by kvk1 =

n X

|vi |

i=1

and forms the basis for the definition of LAD regression (see Chapter 9). The L2 -norm, which is also known as the Euclidean norm or the spectral norm is defined by kvk2 = {v12 + · · · + vn2 }1/2 = (v0 v)1/2 , and is the basis for least squares techniques (see the discussion below Definition 1.2.1). In this book, we will denote kvk2 simply as kvk. An extension is to define the L2 -norm with respect to a nonsingular matrix A by kvkA = (v0 Av)1/2 .

ISTUDY

R Code Definition 1.3.21. Matrix norm. on Rm×n , denoted by kAk, if

27 A function f : Rm×n → R is called a matrix norm

1. f (A) ≥ 0 for all m × n real matrices A, with equality if and only if A = O, 2. f (cA) = |c|f (A) for all c ∈ R, and for all m × n matrices A, and 3. f (A + B) ≤ f (A) + f (B) for all m × n matrices A and B. The Frobenius norm of an m × n matrix A = {aij } with respect to the usual inner product is defined by  1/2 m X n X kAk = [tr(A0 A)]1/2 =  a2ij  . i=1 j=1

It is easy to verify that kAk ≥ 0, with equality holding only if A = O. Also, kcAk = |c|kAk.

1.4

R Code

There are R packages such as pracma and Matrix that enable computations on vectors and matrices. We show simple code that can be used for the notions defined in this chapter. Results from running the code are straightforward and we do not show them here. library(pracma) library(Matrix) ## Def. 1.3.1. Matrix addition and subtraction # A and B must have the same order (A 0 small enough, Z 1 1 0 0 0 Mx0 Ax+a0 x (t) = z z dz exp t(z Az + a z) − k/2 2 Rk (2π) 1 2 0 = |Ik − 2tA|−1/2 exp t a (I − 2tA)−1 a . (5.4.6) 2 P∞ r The cumulant generating function of x0 Ax + a0 x is K(t) = = r=1 κr t /r! 0 0 log[Mx Ax+a x (t)], so that ∞ X 1 1 1 κr tr = − log |I − 2tA| + t2 a0 (I − 2tA)−1 a. r! 2 2 r=1

(5.4.7)

Let δj , j = 1, · · · , k, denote the eigenvalues of A. Then 1 − 2tδj are the eigenvalues of I − 2tA. For |t| < min(1/|2δ1 |, · · · , 1/|2δk |), k

k

∞

1 1X 1 X X −(2tδj )r − log |I − 2tA| = − log{1 − 2tδj } = − 2 2 j=1 2 j=1 r=1 r =

∞ k X 2r−1 tr X r=1

r

j=1

δjr =

∞ X 2r−1 tr r=1

r

tr(Ar ).

(5.4.8)

ISTUDY

Distributions of quadratic forms

157

Also, for |t| small enough, (2t)k Ak → 0 as t → ∞. By Result 1.3.20, (I−2tA) is nonsingular and (I − 2tA)−1 =

∞ X

2r tr Ar .

(5.4.9)

r=0

Substituting from (5.4.8) and (5.4.9) into the right side of (5.4.7), and equating coefficients of like powers of tr on both sides gives the required result for the cumulants for the standard normal case. For the general case, from Section 5.1, let x = Γz + µ, where z ∼ Nr (0, I) and Γ is a matrix with ΓΓ0 = Σ. Then x0 Ax + a0 x = (Γz + µ)0 A(Γz + µ) + a0 (Γz + µ) = z0 (Γ0 AΓ)z + (2µ0 AΓ + a0 Γ)z + µ0 Aµ + a0 µ ˜ +a ˜0 z + a0 , = z0 Az ˜ = Γ0 AΓ, a ˜ = 2Γ0 Aµ + Γ0 a, and a0 = µ0 Aµ + a0 µ. Then the cumulant generating where A 0 function of x Ax + a0 x is log[Mz0 Az+˜ ˜ az (t)] + a0 t so that from the result for the standard normal case, ( ˜ + a0 tr(A) i if r = 1 h κr = r−1 r 0 ˜ r−2 ˜ ˜ 2 (r − 1)! tr(A ) + r˜ a A a/4 if r ≥ 2. ˜ r ) = tr{(Γ0 AΓ)r } = tr{Γ0 (AΣ)r−1 AΓ} = tr{(AΣ)r }, and a ˜ ra ˜/4 = ˜0 A Since for r ≥ 0, tr(A 0 0 r 0 0 r (2Aµ + a) Γ(Γ AΓ) Γ (2Aµ + a)/4 = (Aµ + a/2) Σ(AΣ) (Aµ + a/2), the proof for the general case follows. From Result 5.4.2, we can show directly (see Exercise 5.26) that E(x0 Ax) = tr(AΣ) + µ0 Aµ, Var(x0 Ax) = 2 tr(AΣAΣ) + 4µ0 AΣAµ, and Cov(x, x0 Ax) = 2ΣAµ. These properties are useful to characterize the first two moments of the distribution of x0 Ax. We sometimes encounter the need to employ the following result on the moments of quadratic forms, which we state without proof (see Magnus and Neudecker, 1988). Result 5.4.3. Let A, B, and C be symmetric k × k matrices and let x ∼ Nk (0, Σ). Then, letting A1 = AΣ, B1 = BΣ, and C1 = CΣ, E[(x0 Ax)(x0 Bx)] = tr(A1 ) tr(B1 ) + 2 tr(A1 B1 ), E[(x0 Ax)(x0 Bx)(x0 Cx)] = tr(A1 ) tr(B1 ) tr(C1 ) + 2[tr(A1 )][tr(B1 C1 )] + 2[tr(B1 )][tr(A1 C1 )] + 2[tr(C1 )][tr(A1 B1 )] + 8 tr(A1 B1 C1 ). We next state and prove an important result which gives a condition under which a quadratic form in a normal random vector has a noncentral chi-square distribution. This result is fundamental for a discussion of linear model inference under normality. Result 5.4.4. Let x ∼ Nk (µ, Σ), Σ being p.d., and let A 6= O be a symmetric matrix of rank r. Then the quadratic form U = x0 Ax has a noncentral chi-square distribution if and only if AΣ is idempotent, in which case U ∼ χ2 (r, λ) with λ = µ0 Aµ/2.

ISTUDY

158

Multivariate Normal Distribution

Proof. We present a proof which only uses simple calculus and matrix algebra (see Khuri, 1999; Driscoll, 1999). Necessity. We assume that U = x0 Ax ∼ χ2 (r, λ), and must show that this implies AΣ is idempotent of rank r. By Result 2.4.5, we can write Σ = PP0 , for nonsingular P. Let y = P−1 x. Clearly, U = y0 P0 APy. Since P0 AP is symmetric, by Result 2.3.4, there exists an orthogonal matrix Q such that P0 AP = QDQ0 , where D = diag(O, d1 Im1 , · · · , dp Imp ), with d1 < · · · < dp denoting the distinct nonzero eigenvalues of P0 AP, with multiplicities m1 , . . . , mp . Then z = Q0 y ∼ Nk (ν, I) with ν = Q0 P−1 µ. Partition z as (z00 , z01 , · · · , z0p )0 , so that zi has dimension mi , i ≥ 1. Partition ν conformably as (ν 00 , ν 01 , · · · , ν 0p )0 . Then U = z0 Dz =

p X

di z0i zi .

(5.4.10)

i=1

Since z0i zi ∼ χ2 (mi , θi ) are independent, with θi = ν 0i ν i /2 ≥ 0, MU (t) =

p Y

Mz0i zi (di t),

(5.4.11)

i=1

where, each m.g.f. on the right side has the form in (5.3.4) and is positive, i.e., greater than zero. From (5.3.4), we also see that MU (t) < ∞ if and only if t < 1/2, and Mz0i zi (di t) < ∞ if and only if di t < 1/2. If possible, let d1 < 0. Then, for all t < 1/(2d1 ) < 0, the left side of (5.4.11) is finite, while the right side is ∞, which is a contradiction. This implies that each di must be positive. Next, let us consider three possible cases, i.e., dp > 1, dp = 1 and dp < 1. Case (i). If possible, let dp > 1. Then, for any t ∈ (1/(2dp ), 1/2), the left side of (5.4.11) is finite, while the right side is ∞, which is again a contradiction. Thus, we must have di ∈ (0, 1] for i = 1, . . . , p. Case (ii). Suppose dp = 1. From (5.4.11), p−1 Y MU (t) = Mz0i zi (di t). Mz0p zp (t) i=1

(5.4.12)

From (5.3.4), for t < 1/2, MU (t) (1 − 2t)−r/2 exp{2λt/(1 − 2t)} = Mz0p zp (t) (1 − 2t)−mp /2 exp{2θp t/(1 − 2t)} = (1 − 2t)−(r−mp )/2 exp{2(λ − θp )t/(1 − 2t)}. If r − mp 6= 0 or λ − θp 6= 0, then, depending on their signs, as t ↑ 1/2 (i.e., increases from 0 to 1/2), the limit of the left side of (5.4.12) is either 0 or ∞. On the other hand, since each di on the right side of (5.4.12) (if there are any), lies in (0, 1), the limit of the right side is strictly positive and finite. This contradiction implies that r − mp = λ − θp = 0, so that both sides of (5.4.12) are constant and equal to 1, implying that p = 1. As a result, the only nonzero eigenvalue of P0 AP is 1. That is, P0 AP, and hence AΣ is idempotent with rank equal to the multiplicity of 1, which is r. Case (iii). Finally, suppose dp < 1. A similar argument as above shows that as t ↑ 1/2, the left side of (5.4.11) tends to either 0 or ∞, while the right side of (5.4.11) tends to a positive finite number. This contradiction then rules out the possibility that dp < 1 and completes the proof of necessity.

ISTUDY

Distributions of quadratic forms

159

Sufficiency. We must show that if AΣ is idempotent of rank r, then U has a χ2 (r, λ) distribution. Since AΣ is idempotent, in (5.4.10), p = 1, d1 = 1, and m1 = r. Hence, x0 Ax has a χ2 (r, λ) distribution, where λ = θ1 . An alternate proof of sufficiency amounts to showing that the m.g.f. of x0 Ax coincides with the m.g.f. of a χ2 (r, λ) random variable, which is obvious from (5.3.5) and (5.4.6). We leave the following consequence of Result 5.4.4 as Exercise 5.24. Result 5.4.5. Let x ∼ Nk (µ, Σ) with r(Σ) = k. Then x0 Ax follows a noncentral chisquare distribution if and only if any one of the following three conditions is met: 1. AΣ is an idempotent matrix of rank m, 2. ΣA is an idempotent matrix of rank m, 3. Σ is a g-inverse of A with r(A) = m. Under any of the conditions, x0 Ax ∼ χ2 (m, λ) with λ = µ0 Aµ/2. Example 5.4.1. Suppose x ∼ Nk (µ, Σ), where Σ is p.d. We show that x0 Ax is distributed as a linear combination of independent noncentral chi-square variables. First, since Σ is p.d., there is a nonsingular matrix P such that Σ = PP0 . Since P0 AP is symmetric, there exists an orthogonal matrix Q such that Q0 P0 APQ = D = diag(λ1 , · · · , λr , 0, · · · , 0), these being the distinct eigenvalues of P0 AP. Let x = PQz. From Result 5.2.5, z ∼ N (Q0 P−1 µ, I); Pk also x0 Ax = z0 Dz = i=1 λi Zi2 . The required result follows from Result 5.3.2. When x may have a singular normal distribution, the following result on a quadratic form holds (Ogasawara and Takahashi, 1951). Its proof is left as Exercise 5.28. Result 5.4.6. Let x ∼ Nk (0, Σ). The quadratic form x0 Ax has a chi-square distribution if and only if ΣAΣAΣ = ΣAΣ, in which case x0 Ax ∼ χ2p with p = r(ΣAΣ). The next result is often referred to as Laha’s theorem (Laha, 1956). It is a generalization of Craig’s theorem which deals with the independence of x0 Ax and x0 Bx when x ∼ Nk (0, I) (Craig, 1943; Shanbhag, 1966). Result 5.4.7. Let x ∼ Nk (µ, Σ), where Σ is p.d. For x0 Ax + a0 x and x0 Bx + b0 x to be independent, it is necessary and sufficient that (i) AΣB = O, (ii) a0 ΣB = 0, (iii) b0 ΣA = 0, and (iv) a0 Σb = 0. Proof. The proof presented below uses the fact that two random variables W1 and W2 with bounded c.g.f.’s in a neighborhood of the origin are independent if and only if their cumulants satisfy κr (yW1 + zW2 ) = κr (yW1 ) + κr (zW2 ) for all real y and z and for all positive integers r. Sufficiency. Since A0 = A, x0 Ax = (Ax)0 (A0 A)− A0 (Ax) is a function of Ax. Likewise x0 Bx is a function of Bx. If (i)–(iv) are all satisfied, then by Result 5.2.10, each of a0 x and Ax is independent of both b0 x and Bx, and hence x0 Ax+a0 x is independent of x0 Bx+b0 x. Necessity. We first consider the case where x ∼ Nk (0, I). For any y, z, ξyz = y(x0 Ax + a0 x) + z(x0 Bx + b0 x) = x0 (yA + zB)x + (ya + zb)0 x. By Result 5.4.2, for r ≥ 2, κr (ξyz ) r−1 2 (r − 1)!

= tr{(yA + zB)r } + r(ya + zb)0 (yA + zB)r−2 (ya + zb)/4.

ISTUDY

160

Multivariate Normal Distribution

On the other hand, κr (y(x0 Ax + a0 x)) = y r tr(Ar ) + ry r a0 Ar−2 a/4, 2r−1 (r − 1)! κr (z(x0 Bx + b0 x)) = z r tr(Br ) + rz r b0 Br−2 b/4, 2r−1 (r − 1)! By independence, tr{(yA + zB)r } + r(ya + zb)0 (yA + zB)r−2 (ya + zb)/4 = y r tr(Ar ) + ry r a0 Ar−2 a/4 + z r tr(Br ) + rz r b0 Br−2 b/4.

(5.4.13)

Let λ1 , · · · , λp denote all the distinct nonzero elements in the spectra (set of eigenvalues) of yA, zB, and yA + zB (so that the λi are distinct). Let C represent one of the three matrices. From the spectral decomposition of a symmetric matrix (see Result 2.3.4), C = Pp 0 λ P i=1 i i,C Pi,C , where Pi,C consists of mi,C orthonormal eigenvectors of C corresponding to λi , with mi,C being the multiplicity of λi as an eigenvalue of C; if λi is not an eigenvalue of aA, take mi,C = 0, and Pi,C = O. Let uC be one of the vectors ya, zb, and ya + zb corresponding to C. Let µi,C = u0C Pi,C P0i,C uC . Then for any positive integer r ≥ 2, tr(Cr ) =

p X

λri mi,C ,

ru0C Cr−2 uC = r

i=1

p X

λir−2 µi,C ,

(5.4.14)

i=1

so from (5.4.13) p X

λri (mi,yA+zB − mi,yA − mi,zB ) {z } | i=1 νi

+

p X

rλr−2 (µi,yA+zB − µi,yA − µi,zB ) = 0. i | {z } i=1

(5.4.15)

νp+i

Recall that all λ1 , · · · , λp are distinct and nonzero. Suppose λ1 < · · · < λp and let ρ = max(|λ1 |, |λp |). Divide both sides of (5.4.15) by rρr−2 and let r → ∞. There are three cases, (a) |λ1 | = |λp |, so that −λ1 = λp = ρ, (b) |λ1 | < λp = ρ, and (c) |λp | < −λ1 = ρ. Consider case (a). If r stays even as r → ∞, then ν1 + νp = 0, while if r stays odd as r → ∞, then −ν1 + νp = 0. As a result, ν1 = νp = 0. Now divide both sides of (5.4.15) by ρr . By a similar argument, µ1 = µp = 0. The other two cases can be treated similarly. Thus, the coefficients corresponding to the eigenvalue(s) with the largest absolute value are all zero. We then can reduce the number of terms of the summations in (5.4.15) by two in case (a), and by one in cases (b) and (c). Then, by the same argument, the coefficients corresponding to the eigenvalue(s) with the second largest absolute value are all zero. Repeating the argument, it follows that all the νi ’s in (5.4.15) are zero, giving mi,yA+zB − mi,yA − mi,zB = 0

and µi,yA+zB − µi,yA − µi,zB = 0.

Then from (5.4.14), tr{(yA + zB)r } = y r tr(Ar ) + br tr(Br ), 0

(ya + zb) (yA + zB)

r−2

r 0

(ya + zb) = y a A

r−2

r

0

r−2

a+z bB

(5.4.16) b.

(5.4.17)

Since the equation holds for all y and z, for any integers r ≥ 2, m ≥ 0, and n ≥ 0,

ISTUDY

Distributions of quadratic forms

161

the coefficients of y m z n on both sides are equal. In particular, for r = 4 and m = n = 2, comparing the coefficients of y 2 z 2 in (5.4.16) yields tr(A2 B2 + B2 A2 + (AB + BA)2 ) = tr((AB)0 (AB) + (BA)0 (BA) + (AB + BA)0 (AB + BA)) = 0. It then follows that AB = BA = O. Plugging this into (5.4.17) and comparing the coefficients of y 2 z 2 therein when r = 4 yields b0 A2 b + a0 B2 a = (Ab)0 (Ab) + (Ba)0 (Ba) = 0, so Ab = Ba = 0. Finally, letting r = 2 in (5.4.17) and comparing the coefficients of yz on both sides gives a0 b = 0. Then the necessity is proved for the case x ∼ Nk (0, I). For the general case, x = Γz + µ, where Γ is a nonsingular matrix with ΓΓ0 = Σ. Then x0 Ax + a0 x = (Γz + µ)0 A(Γz + µ) + a0 (Γz + µ) = z0 (Γ0 AΓ)z + (2Γ0 Aµ + Γ0 a)0 z + (µ0 Γµ + a0 µ).

(5.4.18)

Likewise x0 Bx + b0 x can be written in terms of z. Then the necessity of (i)–(iv) in the general case follows from the standard normal case. Further details of the proof are left as a part of Exercise 5.34. Result 5.4.7 has the following consequence. Result 5.4.8. Let x ∼ Nk (µ, Σ) with Σ being p.d. Let B be an m × k matrix. For Bx and x0 Ax + a0 x to be independent, it is necessary and sufficient that (i) BΣA = O and (ii) BΣa = 0. Proof. Denote the rows of B by b01 , · · · , b0m . Then Bx and x0 Ax+a0 x are independent if and only if each b0i x is independent of x0 Ax+a0 x. The result then follows from Result 5.4.7. A less technical proof of the necessity of (i) and (ii), which does not rely on the cumulant generating function argument used in Result 5.4.7, is sketched in Exercise 5.29. For more details, see Ogawa (1950, 1993), Laha (1956), and Reid and Driscoll (1988). Using only linear algebra and calculus, Driscoll and Krasnicka (1995) proved the general case where x ∼ Nk (µ, Σ) with Σ possibly being singular. In this case, they showed that a necessary and sufficient condition for x0 Ax + a0 x and x0 Bx + b0 x to be independently distributed is that ΣAΣBΣ = O, ΣAΣ(Bµ + b) = 0, ΣBΣ(Aµ + a) = 0, and (Aµ + a/2)0 Σ(Bµ + b/2) = 0.

(5.4.19)

The proof is left as a part of Exercise 5.34. Example 5.4.2. We show independence between the mean and sum of squares. Let x = (X1 , · · · , Xn )0 ∼ Nn (0, I) so that we can think of X1 , · · · , Xn as a random sample 0 x/n has a N (0, 1/n) distribution from a N (0, 1) population. The sample mean X = 1P n 2 (using Result 5.2.9), and the sample sum of squares i=1 (Xi − X) has a central chisquare distribution with n − 1 degrees of freedom (by Result 5.4.1). It is easily verified Pn 2 2 by expressing XP and i=1 (Xi − X)2 as quadratic forms in x that X , and hence X is n 2 independent of i=1 (Xi − X) .

ISTUDY

162

Multivariate Normal Distribution

Example 5.4.3. Let x ∼ Nk (µ, I) and suppose a k ×k orthogonal matrix T is partitioned as T = (T01 , T02 )0 , where Ti is a k × ki matrix, i = 1, 2, such that k1 + k2 = k. It is easy to verify that T1 T01 = Ik1 , T2 T02 = Ik2 , T1 T02 = O, T2 T01 = O, and T0 T = Ik . Also, T0i Ti is idempotent of rank ki , i = 1, 2. By Result 5.4.5, x0 T0i Ti x ∼ χ2 (ki , µ0 T0i Ti µ), i = 1, 2. By Result 5.4.7, these two quadratic forms are independently distributed. Note that Result 5.4.7 applies whether or not the quadratic forms x0 Ax and x0 Bx have chi-square distributions. We next discuss without proof a more general theorem dealing with quadratic forms in normal random vectors. The basic result is due to Cochran (1934), which was later modified by James (1952). A proof is sketched in Exercises 5.37 and 5.38. Result 5.4.9. Cochran’s theorem. Let x ∼ Nk (µ, I). Let Qj = x0 Aj x, j = 1, · · · , L, PL where Aj are symmetric matrices with r(Aj ) = rj and j=1 Aj = I. Then, in order for Qj ’s to be independent noncentral chi-square variables, it is necessary and sufficient that PL 2 0 j=1 rj = k, in which case, Qj ∼ χ (rj , µ Aj µ/2).

5.5

Remedies for non-normality

Section 5.5.1 discusses transformations to normality, while Section 5.5.2 describes a few alternate distributions for the responses in the GLM.

5.5.1

Transformations to normality

A general family of univariate and multivariate transformations to normality is introduced in this section. 5.5.1.1

Univariate transformations

A transformation of X is a function T which replaces X by a new “transformed” variable T (X). The simplest and most widely used transformations belong to the family of power transformations, which are defined below. Definition 5.5.1. Power transformation. The family of power transformations have the form ( aX p + b if p 6= 0, TP (X) = (5.5.1) c log X + d if p = 0, where a, b, c, d, and p are arbitrary real scalars. The power transformation is useful for bringing skewed distributions of random variables closer to symmetry, and thence, to normality. The square root transformation or the logarithmic transformation, for instance, have the effect of “pulling in ” one tail of the distribution. Any power transformation is either concave or convex throughout its domain of positive numbers, i.e., there is no point of inflection. This implies that a power transformation either compresses the scale for larger X values more than it does for smaller X values (for example, TP (X) = log X), or it does the reverse (for example, TP (X) = X 2 ). We cannot, however, use a power transformation to expand the scale of X for large and small values, while compressing it for values in between! Tukey (1957) considered the family

ISTUDY

Remedies for non-normality

163

of transformations X

(λ)

( Xλ = log X

if λ 6= 0, if λ = 0 and X > 0,

(5.5.2)

which is a special case of (5.5.1). This family of transformations, indexed by λ, includes the well-known square root (when λ = 1/2), logarithmic (when λ = 0), and reciprocal (when λ = −1) transformations. Notice that (5.5.2) has a discontinuity at λ = 0. Box and Cox (1964) offered a remedy to this problem which is defined below. Definition 5.5.2. Box–Cox transformation. ( (X λ − 1)/λ if λ 6= 0, X (λ) = log X if λ = 0 and X > 0,

(5.5.3)

which has been widely used in practice to achieve transformation to normality. If some of the Xi ’s assume negative values, a positive constant may be added to all the variables to make them positive. Box and Cox (1964) also proposed the shifted power transformation defined as follows. Definition 5.5.3. Box–Cox shifted power transformation. ( [(X + δ)λ − 1]/λ if λ 6= 0, (λ) X = log[X + δ] if λ = 0,

(5.5.4)

where the parameter δ is chosen such that X > −δ. Several modifications of the Box–Cox transformation exist in the literature, of which a few are mentioned here. Manly (1976) proposed a modification which allows the incorporation of negative observations and is an effective tool to transform skewed unimodal distributions to approximate normality. This is given by ( [exp(λX) − 1]/λ if λ 6= 0 (λ) (5.5.5) X = X if λ = 0. Bickel and Doksum (1981) proposed a modification which incorporates unbounded support for X (λ) : X (λ) = {|X|λ sign(X) − 1}/λ,

(5.5.6)

where   −1 sign(u) = 0   1

u < 0, u = 0, u > 0.

(5.5.7)

With any of these transformations, it is important to note that very often the range of the transformed variable X (λ) is restricted based on the sign of λ, in turn implying that the transformed values may not cover the entire real line. Consequently, only approximate normality may result from the transformation.

ISTUDY

164

Multivariate Normal Distribution

Definition 5.5.4. Modulus transformation. The following transformation was suggested by John and Draper (1980): ( sign(X)[(|X| + 1)λ − 1]/λ if λ 6= 0 (λ) X (5.5.8) = sign(X)[log(|X| + 1)] if λ = 0. The modulus transformation works best to achieve approximate normality when the distribution of X is already approximately symmetric about some location. It alters each half of the distribution about this central value via the same power transformation in order to bring the distribution closer to a normal distribution. It is not difficult to see that when X > 0, (5.5.8) is equivalent to the power transformation. Given a random sample X1 , · · · , XN , estimation of λ using the maximum likelihood approach and the Bayesian framework was proposed by Box and Cox (1964), while Carroll and Ruppert (1988) discussed several robust adaptations. The maximum likelihood approach for estimating λ in the context of linear regression is described in Section 8.2.1. A generalization of the Box– Cox transformation to symmetric distributions was considered by Hinkley (1975), while Solomon (1985) extended it to random-effects models. In a general linear model, if normality is suspect, a possible remedy is a transformation of the data. In general, the parameter of the transformation, i.e., λ, is unknown and must be estimated from the data. We give a brief description of the maximum likelihood approach for estimating λ (Box and Cox, 1964). Suppose we fit the model (4.1.2) to the data (xi , Yi ), i = 1, · · · , N . The estimation procedure consists of the following steps. We first choose a set of λ values in a pre-selected real interval, such as (−5, 5). For each chosen λ, we compute (λ) (λ) the vector of transformed variables y(λ) = (Y1 , · · · , YN ) using (5.5.3), say. We then fit (λ) the normal linear model (4.1.2) to (xi , Yi ), and compute SSE(λ) based on the maximum likelihood estimates (which coincide with the OLS estimates under normality). In the plot of SSE(λ) versus λ, we locate the value of λ which corresponds to the minimum value of SSE(λ). This is the MLE of λ. 5.5.1.2

Multivariate transformations

Andrews et al. (1971) proposed a multivariate generalization of the Box–Cox transformation. A transformation of a k-dimensional random vector x may be defined either with the objective of transforming each component Xj , j = 1, · · · , k, marginally to normality, or to achieve joint normality. They defined the simple family of “marginal” transformations as follows. Definition 5.5.5. Transformation to marginal normality. The transformation for the jth component Xj is defined for j = 1, · · · , k by ( λ (Xj j − 1)/λj if λj 6= 0, (λj ) Xj = (5.5.9) log Xj if λj = 0 and Xj > 0, (λj )

where the λj are chosen to improve the marginal normality of Xj maximum likelihood estimation.

for j = 1, · · · , k, via

It is well known that marginal normality of the components does not imply multivariate normality; in using the marginal transformations in the previous section, it is hoped that the (λ ) marginal normality of Xj j might lead to a transformed x vector which is more amenable to procedures assuming multivariate normality. The following set of marginal transformations was proposed by Andrews et al. (1971) in order to achieve joint normality of the transformed data.

ISTUDY

Remedies for non-normality

165 (λ )

Definition 5.5.6. Transformation to joint normality. Suppose Xj j , j = 1, · · · , k denote the marginal transformations, and suppose the vector λ = (λ1 , · · · , λk )0 denotes (λ ) (λ ) the set of parameters that yields joint normality of the vector x(λ) = (X1 1 , · · · , Xk k )0 . Suppose the mean and covariance of this multivariate normal distribution are µ and Σ. The joint p.d.f. of x is 1 (λ) |J| 0 −1 (λ) exp − (x − µ) Σ (x − µ) , f (x; µ, Σ, λ) = (5.5.10) 2 (2π)k/2 |Σ|1/2 where the Jacobian of the transformation is given by J =

Qk

j=1

(λj )

Xj

.

Given N independent observations x1 , · · · , xN , the estimate of λ, along with µ and Σ is obtained by numerically maximizing the likelihood function which has the form in (5.5.10). There are some situations where nonnormality is manifest only in some directions in the k-dimensional space. Andrews et al. (1971) suggested an approach (a) to identify these directions, and (b) to then estimate a power transformation of the projections of x1 , · · · , xN onto these selected directions in order to improve the normal approximation.

5.5.2

Alternatives to multivariate normal distribution

The multivariate normal distributions constitute a very useful family of symmetric distributions that have found widespread use in the classical theory of linear models and multivariate analysis. However, the normal distribution is not the only choice to characterize the response variable in many situations. Particularly, in robustness studies, where interest lies in assessing sensitivity of procedures to the assumption of normality, interest has centered on a more general class of multivariate distributions (see Kotz et al. (2000)). Of special interest are distributions whose contours of equal density have elliptical shapes (see Section 5.2), and whose tail behavior differs from that of the normal. We begin the discussion by introducing a finite mixture distribution of multivariate normal distributions, as well as scale mixtures of multivariate normals. We then extend these to a more general class of spherically symmetric distributions, and finally to the class of elliptically symmetric distributions. 5.5.2.1

Mixture of normals

A finite parametric mixture of normal distributions is useful in several practical applications. We give a definition and some examples. Definition 5.5.7. We say that x has an L-component mixture of k-variate normal distributions if its p.d.f. is f (x; µ1 , Σ1 , p1 , · · · , µL , ΣL , pL ) L X 0 −1 1 −k/2 −1/2 = pj (2π) |Σj | exp − x − µj Σj x − µj , 2 j=1

x ∈ Rk .

This p.d.f. exhibits multi-modality with up to L distinct peaks. Example 5.5.1. We saw the form of the bivariate normal p.d.f. in Example 5.2.3. Mixtures of L bivariate normal distributions enable us to generate a rich class of bivariate densities 2 2 which have up to L distinct peaks. Consider two mixands. Let µ1 = µ2 = 0, σ1,j = σ2,j = 1, for j = 1, 2, ρ1 = 1/2, and ρ2 = −1/2. With mixing proportions p1 = p2 = 1/2, the mixture

ISTUDY

166

Multivariate Normal Distribution

p.d.f. of (X1 , X2 )0 is f (x, µ1 , Σ1 , p1 , µ2 , Σ2 , p2 ) 2 2 1 √ e−2(x1 +x2 )/3 e−2x1 x2 /3 + e2x1 x2 /3 , = 2π 3

x = (x1 , x2 )0 ∈ R2 .

A plot of this p.d.f. reveals regions where one of the two components dominates, and there are also regions of transition where the p.d.f. does not appear to be “normal”. A well-known property of a bivariate normal mixture is that all conditional and marginal distributions are univariate normal mixtures. Example 5.5.2. Consider the following ε-contaminated normal distribution which is a mixture of a N (0, I) distribution and a N (0, σ 2 I) distribution, with 0 ≤ ε ≤ 1. Its p.d.f. is (1 − ε) x0 x x0 x ε 2 f (x; σ , ε) = exp − exp − 2 . + 2 2σ (2π)k/2 (2π)k/2 σ k The mixture of normals accommodates modeling in situations where the data exhibits multi-modality. Suppose λ denotes a discrete random variable, assuming two distinct positive values λ1 and λ2 , with respective probabilities p1 and p2 , where p1 + p2 = 1. Let x be a k-dimensional random vector which is defined as follows: conditionally on λ = λj , x ∼ Nk (0, λj I), j = 1, 2. The “conditional” p.d.f. of x (conditional on λ) is −k/2

f (x | λj ) = (2π)−k/2 λj

exp{−x0 x/2λj },

x ∈ Rk .

The unconditional distribution of x has the mixture p.d.f. −k/2

f (x) = (2π)−k/2 {p1 λ1

−k/2

exp(−x0 x/2λ1 ) + p2 λ2

exp(−x0 x/2λ2 )}.

This distribution is called a scale mixture of multivariate normals. In general, we can include L mixands, L ≥ 2. By varying the mixing proportions pj and the values λj , we can generate a flexible class of distributions that are useful in modeling a variety of multivariate data. It can be shown that all marginal distributions of this scale distribution mixture are themselves scale mixtures of normals of appropriate dimensions, a property which this distribution shares with the multivariate normal (see Result 5.2.11). Suppose we wish to maintain unimodality while allowing for heavy-tailed behavior, we would assume that the mixing random variable λ has a continuous distribution with p.d.f. π(λ). We define the resulting flexible class of distributions and show several examples which have useful applications in modeling multivariate data. Definition 5.5.8. Multivariate scale mixture of normals (SMN distribution). A k-dimensional random vector x has a multivariate SMN distribution with mean vector θ and covariance matrix Σ if its p.d.f. has a “mixture” form Z f (x; θ, Σ) = Nk (x; θ, κ(λ)Σ) dF (λ), (5.5.11) R+

where κ(.) is a positive function defined on R+ , and F (.) is a c.d.f., which may be either discrete or continuous. The scalar λ is called the mixing parameter, and F (.) is the mixing distribution. Example 5.5.3. Suppose we set κ(λ) = 1/λ in (5.5.11), and assume that the parameter λ ∼ Gamma(ν/2, ν/2), i.e., π(λ) =

1 (ν/2)ν/2 λν/2−1 exp{−νλ/2}, Γ(ν/2)

−∞ < λ < ∞.

ISTUDY

Remedies for non-normality

167

The resulting multivariate t-distribution is a special example of the scale mixtures of normals family with ν degrees of freedom and p.d.f. −(ν+k)/2 Γ{ 21 (k + ν)} 1 0 −1 1 + (z − θ) Σ (z − θ) f (z; θ, Σ, ν) = ν Γ(ν/2)(νπ)k/2 |Σ|1/2

(5.5.12)

for z ∈ Rk . It can be shown that if x z= p + θ with x ∼ Nk (0, Σ), ξ ∼ χ2ν independent, ξ/ν then z ∼ f (z; θ, Σ, ν) and, furthermore, (z − θ)0 Σ−1 (z − θ)/k ∼ Fk,ν . When ν → ∞, the multivariate t-distribution approaches the multivariate normal distribution. By setting θ = 0, and Σ = Ik in (5.5.12), we get the standard distribution, usually denoted by f (z). When ν = 1, the distribution corresponds to a k-variate Cauchy distribution. In particular, when k = 2, let z = (Z1 , Z2 ) denote a random vector with a Cauchy distribution. The p.d.f. of z is f (z) =

1 (1 + z0 z)−3/2 , 2π

z ∈ R2 ,

and corresponds to a (standard) bivariate Cauchy distribution, which is a simple example of a bivariate scale mixture of normal distributions. Example 5.5.4. If we assume κ(λ) = 4λ2 , where λ follows an asymptotic Kolmogorov distribution with p.d.f. π(λ) = 8

∞ X (−1)j+1 j 2 λ exp{−2j 2 λ2 }, j=1

the resulting multivariate logistic distribution is a special case of the scale mixture of normals family. This distribution finds use in modeling multivariate binary data. Example 5.5.5. If we set κ(λ) = 2λ, and assume that π(λ) is a positive stable p.d.f. S P (α, 1) (see item 12 in Appendix B) whose polar form of the p.d.f. is given by (Samorodnitsky and Taqqu, 1994) Z 1 πS P (λ; α, 1) = {α/(1 − α)}λ−{α/(1−α)+1} s(u) exp{−s(u)/λα/(1−α) }du, 0

for 0 < α < 1, and s(u) = {sin(απu)/ sin(πu)}α/(1−α) {sin[(1 − α)πu]/ sin(πu)}. The resulting scale mixture of normals distribution is called the multivariate symmetric stable distribution. 5.5.2.2

Spherical distributions

In this section, we define the class of spherical (or radial) distributions. Definition 5.5.9. A k-dimensional random vector z = (Z1 , · · · , Zk )0 is said to have a spherical (or spherically symmetric) distribution if its distribution does not change under rotations of the coordinate system, i.e., if the distribution of the vector Az is the same as the distribution of z for any orthogonal k × k matrix A. If the p.d.f. of z exists in Rk , it

ISTUDY

168

Multivariate Normal Distribution

depends on z only through z0 z = function),

Pk

i=1

Zi2 ; for any function h (called the density generator

f (z) ∝ h(z0 z) = ck h(z0 z),

(5.5.13)

where ck is a constant. The mean and covariance of z, provided they exist, are E(z) = 0,

and

Cov(z) = cIk ,

where c ≥ 0 is some constant. Different choices of the function h give rise to different examples of the spherical distributions (Muirhead, 1982; Fang et al., 1990). Contours of constant density of a spherical random vector z are circles when k = 2, or spheres for k > 2, which are centered at the origin. The spherical normal distribution shown in the following example is a popular member of this class. Example 5.5.6. Let z have a k-variate normal distribution with mean 0 and covariance σ 2 Ik . We say z has a spherical normal distribution with p.d.f. 1 1 0 f (z; σ 2 ) = exp − z z , z ∈ Rk . 2σ 2 (2π)k/2 σ k The density generator function is clearly h(u) = c exp{−u/2}, u ≥ 0.

The -contaminated normal distribution shown in Example 5.5.2 is also an example of a spherical distribution, as is the standard multivariate t-distribution defined in Example 5.5.3. The following example generalizes the well-known double-exponential (Laplace) distribution to the multivariate case; this distribution is useful for modeling data with outliers. Example 5.5.7. Consider the bivariate generalization of the standard double-exponential distribution to a vector z = (Z1 , Z2 ) with p.d.f. 1 exp{−(z0 z)1/2 }, z ∈ R2 . 2π This is an example of a spherical distribution; notice the similarity of this p.d.f. to that of the bivariate standard normal vector. f (z) =

Definition 5.5.10.

The squared radial random variable T = kzk has p.d.f. π k/2 (k/2)−1 t h(t), Γ(k/2)

t > 0.

(5.5.14)

We say T has a radial-squared distribution with k d.f. and density generator h, i.e., T ∼ Rk2 (h). The main appeal of spherical distributions lies in the fact that many results that we have seen for the multivariate normal hold for the general class of spherical distributions. For example, if z = (Z1 , Z2 )0 is a bivariate spherically distributed random vector, the ratio V = Z1 /Z2 has a Cauchy distribution provided P(Z2 = 0) = 0. If z = (Z1 , · · · , Zk )0 has a k-variate spherical distribution, k ≥ 2, with P(z = 0) =0, we can show that V = Z1 /{(Z22 + · · · + Zk2 )1/2 /(k − 1)} ∼ tk−1 . In many cases, we wish to extend the definition of a spherical distribution to include random vectors with a nonzero mean µ and a general covariance matrix Σ. This generalization leads us from spherical distributions to elliptical (or elliptically contoured distributions), which form the topic of the next subsection.

ISTUDY

Remedies for non-normality 5.5.2.3

169

Elliptical distributions

The family of elliptical or elliptically contoured distributions is the most general family that we will consider as alternatives to the multivariate normal distribution. We derive results on the forms of the corresponding marginal and conditional distributions. There is a vast literature on spherical and elliptical distributions (Kelker, 1970; Devlin et al., 1976; Chmielewski, 1981; Fang et al., 1990; Fang and Anderson, 1990), and the reader is referred to these for more details on this interesting and useful class of distributions. Definition 5.5.11. Let z ∈ Rk follow a spherical distribution, µ ∈ Rk be a fixed vector, and Γ be a k × k matrix. The random vector x = µ + Γz is said to have an elliptical, or elliptically contoured, or elliptically symmetric distribution. Provided they exist, the mean and covariance of x are E(x) = µ and

Cov(x) = cΓΓ0 = cV,

where c ≥ 0 is a constant. The m.g.f. of the distribution, if it exists, has the form Mx (t) = ψ(t0 Vt) exp{t0 µ}

(5.5.15)

for some function ψ. In case the m.g.f. does not exist, we invoke the characteristic function of the distribution for proof of distributional properties. In order for an elliptically contoured random vector to admit a density (with respect to Lebesgue measure), the matrix V must be p.d. and the density generator function h(.) in (5.5.13) must satisfy the condition Z ∞ k/2 π t(k/2)−1 h(t) dt = 1. Γ(k/2) 0 If the p.d.f. of x exists, it will be a function only of the norm kxk = (x0 x)1/2 . We denote the class of elliptical distributions by Ek (µ, V, h). If a random vector x ∼ Ek (0, Ik , h), then x has a spherical distribution. Suppose µ is a fixed k-dimensional vector, and y = µ + Px, where P is a nonsingular k × k matrix. Then, y ∼ Ek (µ, V, h), with V = PP0 . Result 5.5.1. Let z denote a spherically distributed random vector with p.d.f. f (z), and let x = µ + Γz have an elliptical distribution, where Γ is a k × k nonsingular matrix. Let V = ΓΓ0 , and note that z = Γ−1 (x − µ). Then the p.d.f. of x has the form f (x) = ck |V|−1/2 h[(x − µ)0 V−1 (x − µ)],

x ∈ Rk

(5.5.16)

for some function h(.) which can be independent of k, and such that rk−1 h(r2 ) is integrable over [0, ∞). Proof. The transformation from z to x = µ + Γz has Jacobian J = |Γ−1 |. By Result A.2, we have for x ∈ Rk , f (x) = ck |Γ|−1 f {Γ−1 (x − µ)} = ck [|Γ|−1 |Γ|−1 ]1/2 h[(x − µ)0 Γ0−1 Γ−1 (x − µ)] = ck |V|−1/2 h[(x − µ)0 V−1 (x − µ)]. Note that the same steps are used in the derivation of the multivariate normal p.d.f. in Section 5.2. The relation between the spherical distribution and the (corresponding) elliptical distribution is the same as the relationship between the multivariate standard normal distribution (Definition 5.2.1) and the corresponding normal distribution with nonzero mean and covariance Σ (Definition 5.2.2).

ISTUDY

170

Multivariate Normal Distribution

The distribution of x will have m moments provided the function rm+k−1 h(r2 ) is integrable on [0, ∞). We show two examples. Example 5.5.8. Let x have a k-variate normal distribution with mean µ and covariance σ 2 I. Then x has an elliptical distribution. A rotation about µ is given by y = P(x − µ) + µ, where P is an orthogonal matrix. We see that y ∼ Nk (µ, σ 2 I) (see Exercise 5.9), so that the distribution is unchanged under rotations about µ. We say that the distribution is spherically symmetric about µ. In fact, the normal distribution is the only multivariate distribution with independent components Xj , j = 1, · · · , k, that is spherically symmetric. Example 5.5.9. Suppose x = µ+Γz, where z was defined in Example 5.5.3, µ = (µ1 , µ2 )0 is a fixed vector, and Γ is a nonsingular 2 × 2 matrix. Let A = ΓΓ0 ; the p.d.f. of (X1 , X2 )0 is f (x; µ, A) = (2π)−1 |A|−1/2 [1 + (x − µ)0 A−1 (x − µ)]−3/2 ,

x ∈ R2 .

This is the multivariate Cauchy distribution, which is a special case of the multivariate t-distribution. The density generator for the k-variate Cauchy distribution is h(u) = c{1 + u}−(k+1)/2 , while the density generator for the k-variate t-distribution with ν degrees of freedom is h(u) = c{1 + u/ν}−(k+ν)/2 . In terms of its use in linear model theory, the Cauchy distribution and the Student’s t-distribution with small ν are considered useful as robust alternatives to the multivariate normal distribution in terms of error distribution specification. Example 5.5.10. Let z be the standard double exponential variable specified in Example 5.5.7, and suppose we define x = µ + Γz where µ = (µ1 , µ2 )0 is a fixed vector, and Γ is a nonsingular 2 × 2 matrix. Let A = ΓΓ0 ; the p.d.f. of (X1 , X2 )0 is f (x; µ, A) = (2π)−1 |A|−1/2 exp{−[(x − µ)0 A−1 (x − µ)]1/2 },

x ∈ R2 .

A comparison of the contours of this distribution with those of a bivariate normal distribution having the same location and spread shows that this distribution is more peaked at the center and has heavier tails. The next result specifies the marginal distributions and the conditional distributions. Result 5.5.3 characterizes the class of normal distributions within the family of elliptically symmetric distributions. Let x = (X1 , · · · , Xk )0 ∼ Ek (µ, V, h). Result 5.5.2. Suppose we partition x as x = (x01 , x02 )0 , where x1 and x2 are respectively qdimensional and (k − q)-dimensional vectors. Suppose µ and V are partitioned conformably (similar to Result 5.2.11). 1. The marginal distribution of xi is elliptical, i.e., x1 ∼ Eq (µ1 , V11 ) and x2 ∼ Ek−q (µ2 , V22 ). Unless f (x) has an atom of weight at the origin, the p.d.f. of each marginal distribution exists. 2. The conditional distribution of x1 given x2 = c2 is q-variate elliptical with mean −1 E(x1 | x2 = c2 ) = µ1 + V12 V22 (c2 − µ2 ),

(5.5.17)

while the conditional covariance of x1 given x2 = c2 only depends on c2 through the −1 quadratic form (x2 − c2 )0 V22 (x2 − c2 ). The distribution of x2 given x1 = c1 is derived similarly.

ISTUDY

Remedies for non-normality

171

Proof. The m.g.f. of xi is ψ(t0i Vii ti ) exp{t0i µi }, i = 1, 2. As a result, x1 ∼ Eq (µ1 , V11 ) and x2 ∼ Ek−q (µ2 , V22 ). The p.d.f. of x1 , if it exists, has the form −1 f1 (x1 ) = cq |V11 |−1/2 hq [(x1 − µ1 )0 V11 (x1 − µ1 )],

where the function hq depends only on h and q, and is independent of µ and V. This completes the proof of property 1. To show property 2, we see that by definition, the conditional mean is Z E(x1 | x2 = c2 ) = x1 dFx1 | c2 (x1 ). −1 Substituting y = x1 − µ1 − V12 V22 (c2 − µ2 ) and simplifying, we get Z −1 E(x1 | x2 = c2 ) = ydFy | c2 (y) + µ1 + V12 V22 (c2 − µ2 ).

Since it can be verified that the joint m.g.f. of y and x2 , when it exists, satisfies R My,x2 (−t1 , t2 ) = My,x2 (t1 , t2 ), we see that ydFy | c2 (y) = 0, proving (5.5.17). The conditional covariance is Cov(x1 | x2 = c2 ) Z = [x1 − E(x1 | x2 = c2 )][x1 − E(x1 | x2 = c2 )]0 fx1 | x2 (x1 )dx1 Z 0 0 −1 ck 0 h[z V11.2 z + (c2 − µ2 ) V22 (c2 − µ2 )] dz, = zz fx2 (c2 ) |V|1/2 −1 where V11.2 = V11 − V12 V22 V21 . The result follows since fx2 (c2 ) is a function of the 0 −1 quadratic form (x2 − c2 ) V22 (x2 − c2 ).

Result 5.5.3.

Suppose x, µ, and V are partitioned as in Result 5.5.2.

1. If any marginal p.d.f. of a random vector x which has an Ek (µ, V, h) distribution is normal, then x must have a normal distribution. 2. If the conditional distribution of x1 given x2 = c2 is normal for any q, q = 1, · · · , k − 1, then x has a normal distribution. 3. Let k > 2, and assume that the p.d.f. of x exists. The conditional covariance of x1 given x2 = c2 is independent of x2 only if x has a normal distribution. 4. If x ∼ Ek (µ, V, h), and V is diagonal, then the components of x are independent only if the distribution of x is normal. Proof. By Result 5.5.2, the m.g.f. (or characteristic function) of x has the same form as the m.g.f. (or characteristic function) of x1 , from which property 1 follows. Without loss of generality, let µ = 0 and V = I, so that by Result 5.5.2, the conditional mean of x1 given x2 = c2 = 0, and its conditional covariance has the form φ(c2 )Ik−q . Also, g(x1 | x2 = c2 ) = ck h(x01 x1 + x02 x2 )/f2 (c2 ) is a function of x01 x1 . If the conditional distribution is normal, x01 x1 0 0 −(k−q)/2 ck h(x1 x1 + x2 x2 ) = {2πσ(c2 )} f2 (c2 ) exp − , 2σ(c2 ) so that ck h(x01 x1 )

= {2πσ(c2 )}

−(k−q)/2

f2 (c2 ) exp

x02 x2 2σ(c2 )

x0 x1 exp − 1 2σ(c2 )

.

ISTUDY

172

Multivariate Normal Distribution ρ = −0.5

−1

−2 −1

0

0

x1 rho=0.5

1

1

2

2

−2

2 1 x2 0

0

−1 −2

−2

−3

−2

−2

−2 −2

−3

−3

−3

−3

−1

−1

−2

−1

−1

−2

−2

−2

−2

1

1 0 −1

0

0

−1

−1

−1

−1

x2

x2

x2

x2

0

1

1

0

0

x2

x2

x2

0

0

2

2 1

1

1

1

1

3

3 2

2

2

2

2

2

2

3

3

3

ρ = 0.5

−1 −2

−1 0 −3

−2 −1

1 0−2

0

x1 x1 rho=0.5 rho=0.5

2 1 −1

1

2 0 −2

2

x1 rho=−0.5

1

−1−3 2

0 −2 −3

1 −1 −2

−2 −1

x1 rho=0.5

20 −1

0

1 0

1

x1 x1 rho=−0.5 rho=−0.5

21

2

−3

2

−2

−1

−2 −1

0

0

x1 rho=−0.5

1

1

2

2

FIGURE 5.5.1. Contours of a bivariate normal density versus contours of kernel density estimates using a sample from the density; ρ = 0.5 in the left two panels and ρ = −0.5 in the right two panels. This implies that f (x) = ck h(x0 x), i.e., x has a normal distribution, proving property 2. The proof of properties 3 and 4 is left as an exercise. We defer discussion of results on distributions of quadratic forms in elliptical random vectors (some of which are analogous to the theory for normal distributions) to Chapter 13, where we describe elliptically contoured linear models.

5.6

R Code

library(car) library(mvtnorm) library(MASS) ## Example 5.2.5. Bivariate normal # simulate bivariate normal with rho=0.5 set.seed(1);n N − p shows evidence of overdispersion, equate the Xp2 statistic to its expectation, and solve for φ: φb =

Xp2 . N −p

(12.5.6)

Since Xp2 is usually a goodness of fit measure, treating it as pure error in order to estimate φ requires that we are very confident about the systematic part of the GLIM. That is, we are confident that the lack of fit is due to overdispersion, and not due to misspecification in the systematic part. Negative binomial log-linear model. We include a multiplicative random effect R in the Poisson log-linear model to represent unobserved heterogeneity, leading to the negative binomial log-linear model. Suppose p(Y | R) ∼ Poisson(µR),

(12.5.7)

so that E(Y | R) = µR = Var(Y | R). If R were known, then Y would have a marginal Poisson distribution. Since R is unknown, we usually set E(R) = 1, in which case, µ represents the expected outcome for the average individual given covariates X. Assume that R ∼ Gamma(α, β), and let α = β = 1/γ, so that E(R) = 1 and Var(R) = γ 2 . By integrating out R from the joint distribution p(Y | R) × p(R), we get the marginal distribution of Y to be negative binomial with p.m.f. p(y; α, β, µ) =

Γ(y + α) β α µy , y!Γ(α) (µ + β)α+y

(12.5.8)

so that E(Y ) = µ and Var(Y ) = µ(1 + γ 2 µ). If γ 2 is estimated to be zero, we have the Poisson log-linear model. If γ 2 is estimated to be positive, we have an overdispersed model. Most software can fit these models to count data. Numerical Example 12.2. Count model. The dataset “aids” from the R package catdata has 2376 observations on 8 variables and was based on a survey around 369 men who were infected with HIV. The response Y is cd4 (number of CD4 cells). The predictors are time (years since seroconversion), drugs (recreational drug use, yes=1/no=0), partners (number of sexual partners), packs (packs of cigarettes a day), cesd (mental illness score), age (age centered around 30), and person (ID number).

ISTUDY

386

Generalized Linear Models

data("aids", package = "catdata") full_pois |z|) (Intercept) 6.583e+00 1.711e-03 3847.187 < 2e-16 *** time -1.159e-01 4.279e-04 -270.938 < 2e-16 *** drugs 6.707e-02 1.863e-03 36.008 < 2e-16 *** partners -6.351e-04 2.171e-04 -2.925 0.00345 ** packs 7.497e-02 5.007e-04 149.741 < 2e-16 *** cesd -2.607e-03 7.928e-05 -32.888 < 2e-16 *** age 1.046e-03 1.011e-04 10.345 < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 467303 Residual deviance: 352915 AIC: 372732

on 2375 on 2369

degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 4 We check the dispersion parameter as the Residual Deviance/df, which must be approximately 1 for the Poisson case. Here, the dispersion parameter is estimated as 352915/2369 = 149, which is considerably greater than 1, indicating overdispersion. We fit a negative binomial model or a quasi poisson regression model as shown below. full_nb |z|) (Intercept) 6.5805718 0.0213683 307.959 < 2e-16 *** time -0.1126975 0.0053359 -21.121 < 2e-16 *** drugs 0.0731919 0.0231311 3.164 0.00156 ** partners -0.0006407 0.0028137 -0.228 0.81987 packs 0.0725523 0.0067755 10.708 < 2e-16 *** cesd -0.0026719 0.0010129 -2.638 0.00834 ** age 0.0004858 0.0012973 0.374 0.70806 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(4.65) family taken to be 1) Null deviance: 3149.1 Residual deviance: 2465.8 AIC: 34173

on 2375 on 2369

degrees of freedom degrees of freedom

Number of Fisher Scoring iterations: 1 Theta: 4.650 Std. Err.: 0.132 2 x log-likelihood:

-34156.670

We can do a deviance difference test based on the reduced model. The data prefers the reduced model. summary(red_nb 0 and β > 0, find the marginal distribution of Y , and show that it does not belong to the exponential family. Find E(Y ) and Var(Y ). 12.2. For i = 1, · · · , N , let Yi follow a gamma distribution with shape parameter α > 0, scale parameter si > 0 and p.d.f. f (y; α, s) given by (B.10).

ISTUDY

388

Generalized Linear Models (a) Let θ = −1/µ = −1/(αs) and φ = 1/α. Show that V (µ) = µ2 and the p.d.f. can be written as θy + log(−θ) 1 f (y; θ, φ) = exp + log(y/φ) − log y − log Γ(1/φ) , φ φ so that b(θ) = − log(−θ), c(y, φ) = (1/φ) log(y/φ) − log y − log Γ(1/φ). (b) By parametrizing the distribution of Y in terms of its mean µ, show that the canonical link is g(µ) = −1/µ. (c) Given x0i β for i = 1, · · · , N , write the p.d.f. f (y; β, φ) under the (i) inverse link, (ii) log link and (iii) identity link.

12.3. Assume that Yt , the number of fatal accidents in a year t is modeled by a Poisson distribution with mean λt = β0 +β1 t. Given data for 20 years, derive the IRLS estimates for β0 and β1 . 12.4. Let Y1 denote the number of cars that cross a check-point in one hour, and let Y2 denote the number of other vehicles (except cars) that cross the point. Consider two models. Under Model 1, assume that Y1 and Y2 have independent Poisson distributions with means λ1 and λ2 respectively, both unknown. Under Model 2, assume a Binomial specification for Y1 with unknown probability p and sample size Y1 + Y2 . (a) If we define p = λ1 /(λ1 + λ2 ), show that Model 1 and Model 2 have the same likelihood. (Hint: Use Definition 12.1.1.) (b) What is the relationship between the canonical link functions? (c) How can we incorporate a covariate X, which is a highway characteristic, into the modeling? 12.5. Consider the binary response model defined in Section 12.4. Write down in detail the steps in the IRLS algorithm for estimating β under the probit, and complementary loglog links, showing details on the weight matrix W and the vector of adjusted dependent responses z. 12.6. Consider the Poisson log-linear model defined in Example 12.1.2 and discussed in Section 12.5. Write down in detail the steps in the IRLS algorithm for estimating β and show that the weight matrix W and the vector of adjusted dependent responses z are given by (12.5.1) and (12.5.2), respectively. 12.7. Consider the Poisson model. For i = 1, · · · , N , show that (a) the Anscombe residuals are given by 2/3

ri,A =

3(Yi

2/3

−µ bi ) 1/6

.

2b µi

(b) the deviance residuals are given by ri,D = sign(Yi − µ bi )[2(Yi log(Yi /b µi ) − Yi + µ bi )]1/2 . 12.8. [Cameron and Trivedi (2013)] Let Y follow a negative binomial distribution with p.d.f. Γ(y + α−1 ) f (y; µ, α) = Γ(y + 1)Γ(α−1 )

α−1 α−1 + µ

for y = 0, 1, 2, . . . , where α ≥ 0. Let µi = exp(x0i β).

α−1

µ −1 α +µ

y ,

ISTUDY

Exercises

389

(a) Show that log L(α, β) =

yX N i −1 X {( log(j + α−1 )) − log Γ(yi + 1) i=1

j=0

− (yi + α−1 ) log(1 + α exp(x0i β)) + yi log α + yi x0i β}. (b) For fixed α, write down the steps in the IRLS algorithm for estimating β under the log link. 12.9. [Baker (1994)] Consider a categorical variable with J mutually exclusive levels. For i = PJ 1, · · · , N , and j = 1, · · · , J, suppose that Yij are non-negative integers with j=1 Yij = Yi· . Given Yi· , suppose that the random vector y i = (Yi1 , · · · , YiJ )0 has a multinomial distribution, Multinom(Yi· , πi1 , · · · , πiJ ), with p.m.f. J Y yi· ! y πijij p(y i ; π i ) = QJ y ! j=1 ij j=1

PJ where π i = (πi1 , · · · , πiJ )0 , with 0 ≤ πij ≤ 1 and j=1 πij = 1. Assume that πij ∝ exp(β j ), where β j is a k-dimensional vector of unknown regression coefficients. Discuss the “Poisson trick” for modeling the categorical response data.

ISTUDY

ISTUDY

13 Special Topics

This chapter describes a few special topics which extend the general linear model in useful directions. Section 13.1 introduces multivariate linear models for vector-valued responses. Longitudinal models are described in Section 13.2. Section 5.5.2 discusses a non-normal alternative to the GLM via an elliptically contoured linear model. Section 13.4 discusses the hierarchical Bayesian linear model framework of Lindley and Smith, while Section 13.5 presents the dynamic linear model estimated by Kalman filtering and smoothing.

13.1

Multivariate general linear models

The theory of linear models in Chapters 4 and 7 can be extended to the situation where each subject produces a fixed number of responses that are typically correlated. This Section provides an introduction to multivariate linear models for such responses. For a more comprehensive discussion of the theory and application of multivariate linear models, we refer the reader to Mardia (1970) and Johnson and Wichern (2007).

13.1.1

Model definition

The multivariate general linear model (MV GLM) has the form Y = XB + E,

(13.1.1)

where 

Y11  Y21  Y= .  ..

Y12 Y22 .. .

··· ··· .. .

 Y1m Y2m   ..  . 

YN 1

YN 2

···

YN m



ε11  ε21  and E =  .  ..

ε12 ε22 .. .

··· ··· .. .

 ε1m ε2m   ..  . 

εN 1

εN 2

···

εN m

are N ×m matrices with Y consisting of m responses on each of N subjects and E consisting of corresponding errors,   1 X11 · · · X1k 1 X21 · · · X2k    X = . .. .. ..   .. . . .  1 XN 1

···

XN k

is an N × (k + 1) matrix where the first column is a vector of 1’s (corresponding to an intercept) and the remaining columns consist of observations on k explanatory variables,

DOI: 10.1201/9781315156651-13

391

ISTUDY

392

Special Topics

and 

β01 β11  B=  βk1

β02 β12 .. .

··· ···

βk2

···

 β0m β1m     βkm

is a (k + 1) × m matrix of regression coefficients. Let N > p = k + 1 and r = r(X). This accommodates the full-rank multivariate linear regression model when r(X) = p, as well as the less than full-rank multivariate ANOVA (MANOVA) model useful for designed experiments when r(X) < p. We can also consider a model without an intercept, by omitting the first column of X and the first row of B. In the rest of the section, we will denote the rows of a matrix by x0i , yj0 , etc., and the columns by x(i) , y(j) , etc. First,  x01  x02    X =  .  = 1N  ..  

x(1)

···

x(k) ,

x0N

with x0i the vector of observations from the ith subject on all the explanatory variables, and x(j) the vector of observations from all the subjects on the jth explanatory variable. Next, let  0 y1  ..  (13.1.2) Y =  .  and yi0 = x0i B + ε0i , i = 1, · · · , N, 0 yN

with yi being the m-dimensional response vector for the ith subject, and εi denoting the m-dimensional vector of unobserved errors for the ith subject. Clearly, if m = 1, the model (13.1.1) is reduced to the model (4.1.1). We can also write Y = y(1) y(2) · · · y(m) , E = ε(1) ε(2) · · · ε(m) , B = β (1) β (2) . . . β (m) . Then the MV GLM can be written as m univariate linear models, y(1) = β01 1N + β11 x(1) + · · · + βk1 x(k) + ε(1) y(2) = β02 1N + β12 x(1) + · · · + βk2 x(k) + ε(2) .. . y(m) = β0m 1N + β1m x(1) + · · · + βkm x(k) + ε(m) . That is, each column of Y can be written as the GLM in (4.1.1) y(j) = Xβ (j) + ε(j) .

(13.1.3)

While (13.1.1) can be expressed as m univariate GLMs, the reason for carrying out a multivariate analysis rather than m separate univariate analyses is to accommodate the dependence in the errors, which is important for carrying out hypothesis testing and confidence region construction. To extend the basic assumption (4.1.4) for the GLM, suppose

ISTUDY

Multivariate general linear models

393

that in (13.1.1), the error vectors of different subjects are uncorrelated, each with zero mean and the same unknown covariance Σ. That is, ( Σ, if i = j, i, j = 1, . . . , N. (13.1.4) E(εi ) = 0, Cov(εi , εj ) = O otherwise, Column-wise, if Σ = {σij }, this can then be written as E(ε(i) ) = 0,

Cov(ε(i) , ε(j) ) = σij IN ,

i, j = 1, . . . , m.

(13.1.5)

Alternately, we can write (13.1.3) and (13.1.5) as   X y(1)  y(2)  O     ..  =  ..  .  . 

O

y(m)

    β (1) ε(1) O     O   β (2)   ε(2)  ..   ..  +  ..  .  .   . 

O ··· X ··· .. . . . . O ···

X

(13.1.6)

ε(m)

β (m)

and   σ11 IN ε(1)  ε(2)   σ21 IN    Cov  .  =  .  ..   .. 

ε(m)

σm1 IN

σ12 IN σ22 IN .. .

··· ··· .. .

 σ1m IN σ2m IN   ..  , . 

σm2 IN

···

σmm IN

(13.1.7)

respectively. Using the notation in Section 2.8, the above expressions can be compactly written as vec(Y) = (Im ⊗ X) vec(B) + vec(E), E(vec(E)) = 0, Cov(vec(E)) = Σ ⊗ IN .

13.1.2

(13.1.8) (13.1.9)

Least squares estimation

Similar to the GLM, least squares (LS) solutions for (13.1.1) are values of B that minimize m N X X (yij − x0i β (j) )2 , S(B) = kY − XBk = tr((Y − XB) (Y − XB)) = 2

0

i=1 j=1

where k · k is the Frobenious norm of a matrix (see Section 1.3). Let G = (X0 X)− . Result 13.1.1. Multivariate LS solution.

Let

B0 = GX0 Y,

(13.1.10)

or equivalently, β 0(j) = GX0 y(j) ,

j = 1, · · · , m.

(13.1.11)

Then B0 minimizes S(B). We will refer to B0 as a LS solution. Pm Proof. Minimizing S(B) = i=1 Si (β (i) ) is equivalent to minimizing each term Si (β) = (y(i) − Xβ)0 (y(i) − Xβ). By (4.2.4), β 0(i) minimizes Si (β (i) ).

ISTUDY

394

Special Topics

As a passing remark, (13.1.11) is just one of many ways to construct linear functions of Y to minimize S(B). If β 0(j) = Gj X0 y(j) , j = 1, . . . , m, where Gj denotes a different g-inverse of X0 X, then B0 minimizes S(B) as well. However, in this book, by an LS solution, we specifically mean B0 of the form (13.1.10). Definition 13.1.1. Fitted values and residuals. The fitted value and residual of the ith subject’s jth response are ybij = x0i β 0(j) and εbij = yij − ybij , respectively. The matrix of fitted values is b = {b Y yij } = XB0 ,

b(j) = Xβ 0(j) , i.e., y

j = 1, . . . , m.

(13.1.12)

The matrix of residuals is b = {b b = Y − XB0 , E εij } = Y − Y

b(j) . i.e., b ε(j) = y(j) − y

(13.1.13)

Let P be the orthogonal projection matrix onto C(X). As we saw in Chapter 4, b = PY, Y

b = (IN − P)Y = (IN − P)E. E

(13.1.14)

Corresponding to Result 4.2.4 for the GLM, we consider the mean and covariance for the b and E b being matrix-valued random variables, multivariate LS solution. Note that with Y their covariances are characterized by covariances of their vectorized versions, or covariances between their individual columns. Result 13.1.2. 1. Let H = GX0 X. For B0 in (13.1.10), E(B0 ) = HB and Cov(vec(B0 )) = Σ⊗(GX0 XG0 ), i.e., Cov(β 0(i) , β 0(j) ) = σij GX0 XG0 , i, j = 1, · · · , m. b = XB, E(E) b = O, Cov(vec(Y)) b = Σ ⊗ P, and Cov(vec(E)) b = Σ ⊗ (I − P), i.e., 2. E(Y) b(j) ) = σij P and Cov(b Cov(b y(i) , y ε(i) , b ε(j) ) = σij (I − P), i, j = 1, · · · , m. b are uncorrelated, i.e., Cov(β 0 , b 3. B0 and E (i) ε(j) ) = O for all i, j = 1, . . . , m. In particular, b = XB0 and E b are uncorrelated. Y 4. Let b = Σ

1 b0 b E E. N −r

(13.1.15)

b is an unbiased estimator of Σ, i.e., E(Σ) b = Σ. Equivalently, Then Σ σ bij =

1 b ε0 b ε(j) N − r (i)

is an unbiased estimator of σij . Proof. The proof of properties 1–3 closely follows the proof of the corresponding properties in Result 4.2.4; we therefore leave it as Exercise 13.1. To show property 4, it suffices to show that for each 1 ≤ i ≤ j ≤ m, E(b ε0(i) b ε(j) ) = (N − r)σij . From (13.1.14), b ε0(i) b ε(j) = 0 ε0(i) (IN − P)ε(j) . Then E(b ε(i) b ε(j) ) = E(ε0(i) (IN − P)ε(j) ) = tr(E[(IN − P)ε(j) ε0(i) ]) = tr(σij (IN − P)) = σij (N − r). The following corollary is an immediate consequence of Result 13.1.2.

ISTUDY

Multivariate general linear models Corollary 13.1.1.

395

If r(X) = p, the following properties hold.

1. (X0 X)−1 exists and the least squares estimator of B is given by b = (X0 X)−1 X0 Y. B

(13.1.16)

Equivalently, the least squares estimator of β (i) is b = (X0 X)−1 X0 y(i) , i = 1, · · · , m. β (i)

(13.1.17)

b = B. 2. E(B) b ,β b ) = σij (X0 X)−1 for i, j = 1, · · · , m. 3. Cov(β (i) (j) Corresponding to the ANOVA decomposition of sum of squares (SS) for the GLM, the Multivariate ANOVA (MANOVA) decomposition of sum of squares and cross-products (SSCP) can be established for the MV GLM. Define 0 Total SSCP = y1 y10 + · · · + yN yN = Y0 Y, b 0 Y, b b1 y b0 + · · · + y bN y b0 = Y Model SSCP = y

Error SSCP =

1 0 b ε1 b ε1 +

···

N 0 +b εN b εN =

b 0 E. b E

b 0E b = Y0 P(IN − P)E = O. Since Y = Y b + E, b we see that Y0 Y = From (13.1.14), Y 0 0 b Y b +E b E, b i.e., Y Total SSCP = Model SSCP + Error SSCP. The MANOVA decomposition can also be established for the corrected SSCP’s. Let PN y = N −1 i=1 yi . The (mean-)corrected version of the total SSCP is then Corrected Total SSCP =

N X

(yi − y)(yi − y)0 =

i=1

N X

yi yi0 − N y y0 .

i=1

b 0 Y− b The mean-corrected versions of the Model SSCP and the Error SSCP are presumably Y 0 0b 0 b b b b N y y and E E − Nb ε ε , respectively. However, suppose 1N ∈ C(X), which holds for any b 0 1N /N = Y0 P1N /N = Y0 1N /N = y and b b = Y ε = model with an intercept, then y 0 0 b E 1N /N = Y (IN − P)1N /N = 0, so that b 0Y b − N y y0 , Corrected Model SSCP = Y while the Error SSCP needs no correction. Then Corrected Total SSCP = Corrected Model SSCP + Error SSCP. Note that the (i, j)th element of the Corrected Total SSCP matrix is proportional to the sample covariance between the ith and jth response variables, while the diagonal elements are the corrected (univariate) SST for the corresponding univariate response. Similarly, the diagonal elements of the Corrected Model SSCP and Error SSCP are respectively SSR and SSE for the corresponding univariate response.

ISTUDY

396

13.1.3

Special Topics

Estimable functions and Gauss–Markov theorem

Estimable of the linear form P Pm functions of B in the MV GLM (13.1.1) are functions p constants. Let C = {cij } ∈ Rp×m . Then, the linear form i=1 j=1 cij βij , where cij are Pm can be written as tr(C0 B) = j=1 c0(j) β (j) . This function is said to be estimable if it is PN Pm equal to E( i=1 j=1 tij yij ) = E(tr(T0 Y)) for some T = {tij } ∈ RN ×m . Result 13.1.3. properties hold.

The function tr(C0 B) is estimable if and only if any one of the following

1. Each c(j) belongs to R(X), or equivalently, C0 = T0 X for some T ∈ RN ×m . 2. C0 = C0 H, where H = GX0 X. 3. tr(C0 B0 ) is invariant to the LS solution B0 . Proof. Since tr(C0 B) = vec(C)0 vec(B), by applying Result 4.3.1 to the vectorized expression (13.1.8) of the model, tr(C0 B) is estimable if and only if vec(C) ∈ R(Im ⊗ X) = R(X) ⊕ · · · ⊕ R(X). It follows that tr(C0 B) is estimable if and only if c(i) ∈ R(X) for every i, proving property 1. That properties 1 and 2 are equivalent follows from property 4 of Result 4.3.1. To prove property 3, observe that vec(B0 ) is a LS solution to (13.1.8). Then, from property 5 of Result 4.3.1, if tr(C0 B) is estimable, then tr(C0 B0 ) = vec(C)0 vec(B0 ) is invariant to B0 . However, the proof of the converse is somewhat subtle. This is because it follows from the remark below the proof of Result 13.1.1 that solutions of the form vec(B0 ) do not exhaust the LS solutions for (13.1.8); hence, we cannot prove the converse by directly applying property 5 of Result 4.3.1 to (13.1.8). Instead, we have C = C0 + D, where C(C0 ) ⊂ R(X) and C(D) ⊥ R(X). From what was just shown, tr(C00 B0 ) is invariant to B0 . Thus, if tr(C0 B0 ) is invariant to B0 = GX0 Y, then so is tr(D0 B0 ), yielding the invariance of E(tr(D0 B0 )) = tr(D0 GX0 XB) to G. Since XD = O, for any p × m matrix M, G + DM0 is also a g-inverse of X0 X, so the invariance yields tr(D0 DM0 X0 XB) = 0, which holds for all B. As a result, D0 DM0 X0 X = O. Given any u ∈ Rp and v ∈ Rm , let M = uv0 . Then (D0 Dv)(X0 Xu)0 = O. Since X 6= O, there is u such that X0 Xu 6= 0, which then leads to D0 Dv = 0. Since v is arbitrary, D0 D = O, so D = O. As a result, C0 = C00 = T0 X. Then from property 1, tr(C0 B) is estimable, completing the proof of property 3. We now generalize the Gauss–Markov theorem to the MV GLM. In the multivariate setting, a linear estimator is any estimator of the form d0 + tr(D0 Y), where d0 is a constant, and D is an N × m matrix of constants. Result 13.1.4. Gauss–Markov theorem for the MV GLM. Let the assumptions in (13.1.4) hold for nonsingular Σ. Let tr(C0 B) be an estimable function of B and let B0 be any LS solution. Then, tr(C0 B0 ) is the unique b.l.u.e. of tr(C0 B), with variance Var(tr(C0 B0 )) = tr(ΣC0 GC). Finally, for two estimable functions tr(Ci B), i = 1, 2, Cov(tr(C01 B0 ), tr(C02 B0 )) = tr(ΣC01 GC2 ). Proof. From Result 13.1.3, C0 = T0 X for some matrix T. Then E(tr(C0 B0 )) = b which by property 2 is equal to tr(T0 XB) = tr(C0 B). ThereE(tr(T0 XB0 )) = E(tr(T0 Y)), fore C0 B0 is an unbiased estimator of tr(C0 B). Let d0 + tr(D0 Y) be any unbiased estimator of the function. Taking expectation yields d0 + tr(D0 XB) = tr(C0 B). Since the equality b + tr(D0 E). b From property 3, tr(D0 Y) b holds for all B, C0 = D0 X. Then tr(D0 Y) = tr(D0 Y) 0 0 0b 0b 0 and tr(D E) are uncorrelated. Meanwhile, tr(D Y) = tr(D XB ) = tr(CB ). As a result,

ISTUDY

Multivariate general linear models

397

b ≥ Var(tr(CB0 )) and equality holds if and Var(tr(D0 Y)) = Var(tr(CB0 )) + Var(tr(D0 E) b = 0. Since E(tr(D0 E)) b = 0, the latter is equivalent to tr(D0 E) b = 0, or only if Var(tr(D0 E)) 0 0 0b equivalently, tr(D E) = tr(CB ), so that tr(CB ) is the unique b.l.u.e. Next, since tr(C0 B0 )) = vec(C)0 vec(B0 ), we have Var(tr(C0 B0 )) = vec(C)0 Cov(vec(B0 )) vec(C). From property 1 of Result 13.1.2, the right side is equal to vec(C)0 (Σ ⊗ (GX0 XG0 )) vec(C) =

m X

c0(i) (σij GX0 XG0 )c(j) ,

i,j=1

Pm 0 which, according to property 2 of Result 13.1.3, is equal to i,j=1 σij c(i) Gc(j) = 0 tr(ΣC GC). Finally, the statement on the covariance of two estimable functions follows by a completely similar argument. The result below holds as an application of the above results. Result 13.1.5.

Let c ∈ R(X).

1. Each component of B0 c is estimable. 2. E(B00 c) = B0 c and Cov(B00 c) = (c0 Gc)Σ. Proof. The components of B0 c are c0 β (i) = c0 Bei = tr(ei c0 B), i = 1, . . . , m. Let c0 = t0 X. Then ei c0 = T0 X with T = te0i , so from Result 13.1.3, c0 β (i) is estimable, proving property 1. From Result 13.1.4, the first equality in property 2 is clear, and Cov(c0 β 0(i) , c0 β 0(j) ) = Cov(ei c0 B0 , ej c0 B0 ) = tr(Σei c0 Gce0j ) = tr(e0j Σei c0 Gc) = (c0 Gc)σij , so the second equality holds as well. Numerical Example 13.1. The data “mtcars” was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). The data set consists of 11 variables: mpg (miles/gallon), cyl (number of cylinders), disp (displacement in cu.in.), hp (gross horsepower), drat (rear axle ratio), wt (weight in 1000 lbs), qsec (1/4 mile time), vs (engine: 0 = v-shaped, 1 = straight), am (transmission: 0 = automatic, 1 = manual), gear (number of forward gears), and carb (number of carburetors). data(mtcars) Y |t|) (Intercept) -110.123 47.973 -2.296 0.0294 * cyl 59.001 8.275 7.130 9.3e-08 *** am -35.850 25.213 -1.422 0.1661 carb -3.434 7.814 -0.440 0.6637 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 53.04 on 28 degrees of freedom Multiple R-squared: 0.8346,Adjusted R-squared: 0.8168 F-statistic: 47.08 on 3 and 28 DF, p-value: 4.588e-11 Response hp : Call: lm(formula = hp ~ cyl + am + carb, data = mtcars) Residuals: Min 1Q -58.504 -15.061

Median -1.157

3Q 23.414

Max 48.516

Coefficients: Estimate Std. Error t value (Intercept) -63.981 26.816 -2.386 cyl 25.726 4.626 5.561 am 11.604 14.094 0.823 carb 16.632 4.368 3.808 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01

Pr(>|t|) 0.024041 * 6e-06 *** 0.417291 0.000702 *** ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

ISTUDY

Multivariate general linear models

399

Residual standard error: 29.65 on 28 degrees of freedom Multiple R-squared: 0.8311,Adjusted R-squared: 0.813 F-statistic: 45.92 on 3 and 28 DF, p-value: 6.129e-11 Response wt : Call: lm(formula = wt ~ cyl + am + carb, data = mtcars) Residuals: Min 1Q Median -0.61492 -0.32981 -0.06998

3Q 0.12396

Max 1.23908

Coefficients: Estimate Std. Error t value (Intercept) 1.88716 0.45627 4.136 cyl 0.21007 0.07871 2.669 am -0.99370 0.23980 -4.144 carb 0.15429 0.07432 2.076 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01

Pr(>|t|) 0.000291 0.012512 0.000285 0.047186

*** * *** *

‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5045 on 28 degrees of freedom Multiple R-squared: 0.7599,Adjusted R-squared: 0.7341 F-statistic: 29.54 on 3 and 28 DF, p-value: 8.088e-09 b can be pulled out as follows. The estimated coefficient matrix B coef(mvmod) mpg disp hp wt (Intercept) 32.173065 -110.12333 -63.98118 1.8871637 cyl -1.717492 59.00099 25.72551 0.2100744 am 4.242978 -35.85045 11.60366 -0.9936961 carb -1.130370 -3.43438 16.63221 0.1542897

13.1.4

N

Maximum likelihood estimation

We now consider maximum likelihood (ML) estimation for B and Σ and describe the sampling distributions of the estimators under the following normality assumption on the errors: E = (ε1 , · · · , εN )0 with εi being i.i.d. ∼ Nm (0, Σ), Σ is p.d.

(13.1.18)

As before, we write B = (β (1) , · · · , β (m) ). Result 13.1.6.

Suppose E satisfies (13.1.18).

1. Let Q be an N × N orthogonal projection matrix of rank r. Then, E0 QE ∼ Wm (Σ, r). 2. Let M1 and M2 be two matrices of constants, each having N rows, such that M01 M2 = O. Then, M01 E and M02 E are independent.

ISTUDY

400

Special Topics

Proof. To prove property 1, Q = P0 diag(1r , O)P for some orthogonal matrix P. Then 0 E QE = (PE)0 diag(1r , O)(PE). From the normality assumption (13.1.18), for 1 ≤ i, j ≤ m, Cov(Pε(i) , Pε(j) ) = P Cov(ε(i) , ε(j) )P0 = P(σij IN )P0 = σij IN = Cov(ε(i) , ε(j) ). The joint distribution of the Pε(i) ’s is the same as the ε(i) ’s, soPPE ∼ E. Then, E0 QE = r E0 P diag(1r , O)PE ∼ E0 diag(1r , O)E = E0 diag(1r , O)E = i=1 εi ε0i ∼ Wm (Σ, r). The proof of property 2 is left as Exercise 13.2. b M L , is equal to its LS Result 13.1.7. Suppose r(X) = p. The MLE of B, denoted by B b estimate B given in (13.1.16) and the MLE of Σ is b 0E b = (1 − 1/N )Σ. b b ML = 1 E Σ N

(13.1.19)

The maximized likelihood is b M L, Σ b M L ; Y) = L(B

(2πe)−N m/2 . b M L |N/2 |Σ

(13.1.20)

Proof. We follow the proof of Result 6.1.4. From the assumption, y1 , · · · , yN are independent with yi ∼ Nm (B0 xi , Σ). The likelihood function is L(B, Σ; Y) =

(2π)−N m/2 −V /2 e , |Σ|N/2

PN where V = i=1 (yi −B0 xi )0 Σ−1 (yi −B0 xi ) = tr((Y −XB)0 Σ−1 (Y −XB)). From property 3 of Result 1.3.5, V = tr(Σ−1 (Y − XB)0 (Y − XB)) b − X(B − B)] b 0 [E b − X(B − B)]). b = tr(Σ−1 [E b 0 X = E0 (I − P)X = O, Since E b 0E b + (B b − B)0 X0 X(B b − B)]) = tr(Σ−1 E b 0 E) b + tr(D), V = tr(Σ−1 [E b − B)Σ−1 (B b − B)0 X0 ). Then, where D = X(B 1 (2π)−N m/2 1 −1 b 0 b exp − L(B, Σ; Y) = tr(Σ E E) − tr(D) . 2 2 |Σ|N/2 Since D is n.n.d., tr(D) ≥ 0 with tr(D) = 0 if and only if D = O. Since Σ−1 is p.d. b Hence B b ML = B b and it only and X has full column rank, D = O if and only if B = B. b 0 E)}. b From Example 2.4.4, (13.1.19) follows. remains to maximize |Σ|−N/2 exp{− 12 tr(Σ−1 E b M L and Σ = Σ b M L in the above display. With tr(Σ b −1 E b 0 E) b = tr(N Im ) = N m Let B = B ML and D = O, the maximized likelihood is obtained as in (13.1.20). Result 13.1.8. Under the condition in Result 13.1.7, the following properties hold: b ∼ Np (β , σii (X0 X)−1 ) and Cov(β b ,β b ) = σi` (X0 X)−1 . We may also write this 1. β (i) (i) (i) (`) 0 −1 b ∼ Npm (vec(B), Σ ⊗ (X X) )). as vec(B) b 0E b ∼ Wm (Σ, N − r), and B b and E b are independently distributed. 2. If r = r(X), then E Proof. Property 1 directly follows from property 1 of Result 13.1.2. From (13.1.14), b 0E b = E0 (IN − P)E. Since IN − P is symmetric and idempotent with rank N − r, its E b = GX0 Y = HB + GX0 E, distribution follows from property 1 of Result 13.1.6. Since B 0 b = (IN − P)E, and GX (IN − P) = O, from property 2 of Result 13.1.6, B b and E b are E independent, completing the proof of property 2.

ISTUDY

Multivariate general linear models

13.1.5

401

Likelihood ratio tests for linear hypotheses

The likelihood ratio test (LRT) is a standard tool for testing a linear hypothesis on the MV GLM (13.1.1). To illustrate the fundamental idea, we will only consider a particular example. More comprehensive accounts of the LRT for hypotheses on the MV GLM can be found in Mardia (1970), chapter 6. Suppose that the errors εi satisfy (13.1.18) and we partition the p × m matrix B as B1 B= , B2 where B1 ∈ Rq×m and B2 ∈ R(p−q)×m . Partition the N × p matrix X as X = (X1 , X2 ), where X1 ∈ RN ×q and X2 ∈ RN ×(p−q) . Then, (13.1.1) can be written as Y = XB + E = X1 B1 + X2 B2 + E. Suppose r(X) = p. From Result 13.1.3, B2 = (O, Ip−q )B is estimable. Suppose we wish to test H0 : B2 = O, i.e., the responses are not affected by Xq+1 , · · · , Xp . The reduced model under H0 is Y = X1 B1 + E.

(13.1.21)

The LRT statistic for H0 is defined by Λ=

supB1 ,Σ L(B1 , Σ; Y) , supB,Σ L(B, Σ; Y)

where the likelihood in the denominator is evaluated under the full model and the likelihood in the numerator is evaluated under the reduced model (13.1.21). The statistic Λ2/N is known as Wilks’ Lambda. Closely related to the statistic is the Wilks’ Lambda distribution defined below. Definition 13.1.2. If Wi ∼ Wm (Im , νi ), i = 1, 2, are independent and ν1 ≥ m, denote by Λ(m, ν1 , ν2 ) the distribution of |W1 | = |Im + W1−1 W2 |−1 , |W1 + W2 | known as Wilks’ Lambda distribution, with parameters m, ν1 , and ν2 . Result 13.1.9. 1. For ν1 , ν2 ≥ 1, the Wilks’ Lambda distribution Λ(1, ν1 , ν2 ) is identical to the Beta(ν1 /2, ν2 /2) distribution. 2. Let Σ be a k × k p.d. matrix. Let Wi ∼ Wm (Σ, νi ), i = 1, 2, be independent and let ν1 ≥ m. Then |W1 |/|W1 + W2 | ∼ Λ(m, ν1 , ν2 ). Proof. First, Λ(1, ν1 , ν2 ) is the distribution of X/(X +Y ), where X ∼ W1 (1, ν1 ) and Y ∼ W1 (1, ν2 ) are independent. For any n ≥ 1, W1 (1, n) is identical to χ2n , i.e., Gamma(n/2, 2). Then X/(X +Y ) ∼ Beta(ν1 /2, ν2 /2), proving property 1. To prove property 2, let Γ = Σ1/2 . From Exercise 6.2, Γ−1 Wi Γ−1 ∼ Wm (Im , νi ) are independent, giving |W1 |/|W1 + W2 | = |Γ−1 W1 Γ−1 |/|Γ−1 W1 Γ−1 + Γ−1 W2 Γ−1 | ∼ Λ(m, ν1 , ν2 ).

ISTUDY

402

Special Topics

b M L be the MLE of Σ under the full model (13.1.1), and Σ b 0,M L Result 13.1.10. Let Σ be the MLE under the reduced model (13.1.21). 1. The LRT statistic for H0 : B2 = O has the form Λ=

b ML |N/2 |Σ . b 0,M L |N/2 |Σ

(13.1.22)

2. Under H0 , the distribution of the Wilks’ Lambda statistic is Λ2/N =

b ML | |Σ ∼ Λ(m, N − p, p − q). b 0,M L | |Σ

(13.1.23)

3. For large samples, the modified LRT statistic b ML | |Σ m+p+q+1 log − N− b 0,M L | 2 |Σ has an approximate χ2m(p−q) distribution under H0 . This is known as Bartlett’s chisquare approximation (Bartlett, 1938; Mardia et al., 1979). Proof. Property 1 directly follows from (13.1.20) in Result 13.1.7, where the formula is applied twice, once under the full model, and once again under the reduced model. From Result 13.1.7 and (13.1.14), Λ2/N =

|E0 (IN − P)E| |E0 (IN − P)E| = , |E0 (IN − P1 )E| |E0 (IN − P)E + E0 (P − P1 )E|

where P and P1 are respectively the orthogonal projections onto C(X) and C(X1 ). Both IN − P and P − P1 are orthogonal projection matrices, with ranks N − p and p − q, respectively. Also, (IN − P)(P − P1 ) = O. Then, by Result 13.1.6, E0 (IN − P)E ∼ Wm (Σ, N − p), E0 (P − P1 )E ∼ Wm (Σ, p − q), and the two random matrices are independent. Property 2 then follows from property 2 of Result 13.1.9. The proof of property 3 is beyond the scope of this book but can be found in Bartlett (1938).

13.1.6

Confidence region and Hotelling T 2 distribution

Suppose we wish to estimate the mean response vector y0 = B0 x0 for a vector of predictors x0 . Since we have assumed that r(X) = p, from Result 13.1.5, B0 x0 is estimable and the b 0 x0 . Suppose the errors are i.i.d. N (0, Σ) random vectors, where Σ is b0 = B estimate is y p.d. From property 1 in Result 13.1.8, and property 2 of Result 13.1.5, b0 ∼ Nm (y0 , (x00 Gx0 )Σ). y

(13.1.24)

From Result 5.4.5, (x00 Gx0 )−1 (b y0 −y0 )0 Σ−1 (b y0 −y0 ) ∼ χ2m . If Σ were known, a 100(1−α)% 0 confidence region for B x0 would be y0 ∈ Rm : (x00 Gx0 )−1 (b y0 − y0 )0 Σ−1 (b y0 − y0 ) ≤ χ2m,α , where χ2m,α is the upper (100α)th percentile from the χ2m distribution. Since Σ is usually b given in (13.1.15) unknown, a natural idea is to replace it with the unbiased estimate Σ and replace χ2m,α with the corresponding critical value, as discussed below.

ISTUDY

Multivariate general linear models

403

Definition 13.1.3. Let z ∼ Nm (0, Im ) and W ∼ Wm (Im , ν) be independent of z. Denote by T 2 (m, ν) the distribution of z0 (W/ν)−1 z = νz0 W−1 z, known as Hotelling T 2 distribution with parameters m and ν. Result 13.1.11. Let z ∼ Nm (0, Σ) and W ∼ Wm (Σ, ν) be independent of z. Then νz0 W−1 z ∼ T 2 (m, ν). Proof. Let R = Σ1/2 ; see property 2 in Result 2.4.5. Then y = R−1 z ∼ Nm (0, Im ) and M = R−1 WR−1 ∼ Wm (Im , ν) are independent. As a result, νz0 W−1 z = ν(Ry)0 (RMR)−1 (Ry) = νy0 M−1 y ∼ T 2 (m, ν), yielding the proof. Result 13.1.12.

Recall that X ∈ RN ×p has full column rank.

b −1 (b 1. (x00 Gx0 )−1 (b y0 − y0 )0 Σ y0 − y0 ) ∼ T 2 (m, N − p). 2. For α ∈ (0, 1), a 100(1 − α)% confidence region for B0 x0 is n o b −1 (b y0 ∈ Rm : (x00 Gx0 )−1 (b y0 − y0 )0 Σ y0 − y0 ) ≤ T 2 (m, N − p, α) , where T 2 (m, N − r, α) is the upper (100α)th percentile of the T 2 (m, N − p) distribution. b0 − y0 ∼ Nm (0, (x00 Gx0 )Σ). From property 2 in Result 13.1.8, Proof. From (13.1.24), y b 0E b ∼ Wm ((x0 Gx0 )Σ, N − p). Since Σ b =E b 0 E/(N b b0 − y0 is independent of (x00 Gx0 )E y − p), 0 property 1 follows from Result 13.1.11. Property 2 follows immediately from property 1.

13.1.7

One-factor multivariate analysis of variance model

Suppose we wish to compare the mean vectors from a(> 2) populations. Suppose we have a independent samples, with the `th sample consisting of i.i.d. observations y`,1 , · · · , y`,n` from a Nm (µ` , Σ` ) population with sample mean y` and sample variance-covariance matrix S` . Let the overall sample mean of all the observations be y and N = n1 + · · · + na . The one-factor multivariate analysis of variance (MANOVA) model is y`,j = µ + τ ` + ε`,j , j = 1, · · · , n` , ` = 1, · · · , a,

(13.1.25)

where µ is an m-dimensional vector denoting the overall mean, τ ` is an m-dimensional vector denoting the `th treatment effect, such that a X

n` τ ` = 0.

(13.1.26)

`=1

Also assume that ε`,j are i.i.d. ∼ Nm (0, Σ). Under the restricted one-way MANOVA model, the hypothesis H0 : τ 1 = · · · = τ a = 0 is testable. In Section 13.1.5, LRT was applied to test a linear hypothesis for a full-rank model. The LRT applies equally well here, although the model (13.1.25) is not of full-rank.

ISTUDY

404

Special Topics Source Treatment Residual Total (corrected)

d.f. a−1 N −a N −1

SSCP Pa 0 Θ = P`=1 nP ` (y` − y)(y` − y) a n` W = `=1 j=1 (y`,j − y` )(y`,j − y` )0 Θ+W

TABLE 13.1.1. One-way MANOVA table. b M L = y, τb `,M L = y` − y, ` = 1, · · · , a, and We see that under the restriction (13.1.26), µ b M L = W/N , where Σ W=

n` a X a X X (y`,j − y` )(y`,j − y` )0 = (n` − 1)S` . `=1 j=1

`=1

b 0,M L = y and On the other hand, under the model reduced by H0 , µ b 0,M L = NΣ

n` a X X (y`,j − y)(y`,j − y)0 `=1 j=1

n` a a X X X n` (y` − y)(y` − y)0 = (y`,j − y` )(y`,j − y` )0 + `=1 j=1

`=1

= W + Θ. Note that Θ denotes the Between Groups SSCP, while W denotes the Within Groups SSCP. Under H0 , for each ` = 1, · · · , a, (n` − 1)S` ∼ Wm (Σ, n` − 1), and so W ∼ Wm (Σ, N − a). Meanwhile, Θ ∼ Wm (Σ, a − 1) and is independent of W. These results give the MANOVA decomposition which can be expressed as Total Corrected SSCP = Within Groups SSCP + Between Groups SSCP, with d.f. N − 1, N − a, and a − 1, respectively. The MANOVA decomposition is summarized in Table 13.1.1. Now, as in property 1 of Result 13.1.10, the LRT statistic for H0 is Λ = b M L |N/2 /|Σ b 0,M L |N/2 , and the Wilk’s Lambda statistic is |Σ Λ∗ =

|W| . |Θ + W|

From properties 2–3 of Result 13.1.10, under H0 , Λ∗ ∼ Λ(m, N − a, a − 1). If N is large, then m+a+2 log Λ∗ ∼ χ2p(a−1) approx. WL = − N − 2 Approximately, at level α, a LRT rejects H0 if W L > χ2m(a−1),α . The statistic Λ∗ can also be written as Λ∗ =

s Y

1

i=1

bi 1+λ

,

b1 , · · · , λ bs are the nonzero eigenvalues of W−1 Θ, with s = r(Θ) = min(p, a − 1). where, λ

ISTUDY

Multivariate general linear models

405

bi ’s and related eigenvalues are shown below. Let Other statistics that are functions of the λ ηb1 , · · · , ηbs be the nonzero eigenvalues of W(B + W)−1 , and let ri2 =

bi λ bi ) (1 + λ

be the canonical correlations. 1. The Hotelling–Lawley Trace is defined as HLT = tr(W−1 B) =

s X

bi = λ

i=1

s X i=1

ri2 . 1 − ri2

2. Pillai’s trace is defined as P T = tr{B(B + W)−1 } =

s X

bi λ

i=1

bi 1+λ

=

s X

ri2 .

i=1

3. Wilk’s statistic, which is based on the LRT statistic is defined as s

WS =

s

Y X |W| bi )−1 = = (1 + λ (1 − ri2 ). |B + W| i=1 i=1

4. Roy’s Maximum Root statistic is defined as RM RT = max ηbi . i=1,··· ,s

For large samples, these statistics are essentially equivalent. It has been suggested that P T is more robust to non-normality than the others. None of these statistics performs uniformly better than the others for all situations, and the choice of statistic is usually left to the user; see Johnson and Wichern (2007) for a discussion. Suppose we reject H0 at level α. Then we can construct Bonferroni intervals for contrasts τ k − τ ` , for ` 6= k = 1, · · · , a. We have τb k = yk − y, i.e., τbki = y ki − y i . Also, τbki − τb`i = y ki − y `i and Var(b τki − τb`i ) = Var(y ki − y `i ) = (1/nk + 1/n` )σii , σii being the ith diagonal element of Σ, and is estimated by Pawii /(N − a), where wii is the ith diagonal element of the Within SSCP matrix and N = `=1 n` . As a result, the 100(1 − α)% Bonferroni intervals for τki − τ`i are given by 1/2 (1/nk + 1/n` )wii y ki − y `i ± tN −a,α/ma(a−1) N −a for all i = 1, · · · , m and for all differences 1 ≤ ` < k ≤ a. Numerical Example 13.2. The data set “Plastic” with 20 observations from the R package heplots describes an experiment to determine the optimum conditions for extruding plastic film. Three responses were measured tear (tear resistance), gloss (film gloss), and opacity (film opacity). The responses were measured in relation to two factors, rate of extrusion (a factor representing change in the rate of extrusion with levels, Low: −10% or High: 10%), and amount of an additive (a factor with levels, Low: 1.0% or High: 1.5%). We model the response as a function of a single factor, rate using the lm() function and the manova() function. The pairs() function shows interesting scatterplots (the figure is not shown here).

ISTUDY

406

Special Topics

library(heplots) data(Plastic) model F) rate 1 0.58638 7.561 3 16 0.002273 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 pairs(model) # manova() function fit F) rate 1 0.58638 7.561 3 16 0.002273 ** Residuals 18 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

13.2

N

Longitudinal models

In longitudinal studies, measurements on subjects are obtained repeatedly over time. Measurements obtained by following the subjects forward in time are repeated measures or longitudinal data and are collected prospectively, unlike extraction of data over time from available records which constitutes a retrospective study. Longitudinal studies enable us to estimate changes in the response variable over time and relate them to changes in predictors over time. In this section, we describe general linear models for longitudinal data on m subjects over several time periods. For example, Diggle et al. (1994) in a study of the effect of ozone pollution on growth of Sitka spruce trees, data for 79 trees over two growing seasons were obtained in normal and ozone-enriched chambers. A total of 54 trees were grown with ozone exposure at 70 ppb, while 25 trees were grown in normal (control) atmospheres. The response variable was Y = log(hd2 ), where h is the height and d, the diameter of a tree. For each subject i (i = 1, · · · , m), we observe a response Yij at time j (for j = 1, · · · , ni ). In general, in longitudinal studies, the natural experimental unit is not the individual measurement Yij , but the sequence yi ∈ Rni of repeated measurements over time on the ith subject.

13.2.1

Multivariate model for longitudinal data

Let Yij = β0 + β1 Xij1 + β2 Xij2 + · · · + βk Xijk + εij = x0ij β + εij ,

(13.2.1)

where Yij denotes the response variable, and x0ij = (1, Xij1 , · · · , Xijk ), all observed at times tij , for j = 1, · · · , ni , and for subjects i = 1, · · · , m. In (13.2.1), β = (β0 , β1 , · · · , βk )0

ISTUDY

Longitudinal models

407

is a p-dimensional vector of unknown regression coefficients and εij denotes the random error component with Eεij = 0, and Cov(εij , εij 0 ) 6= 0. In matrix notation, let yi = (Yi1 , · · · , Yini )0 denote the ni -dimensional vector of repeated outcomes on subject i with mean vector E(yi ) = µi = (µi1 , · · · , µini )0 and Cov(yi ) = σ 2 Vi P = σ 2 {vijk }. Let m 0 0 0 y = (y1 , · · · , ym ) denote the N -dimensional response vector, where N = i=1 ni , so that 2 2 Cov(y) = σ V = σ diag(V1 , · · · , Vm ). We can write the model in (13.2.1) as yi = Xi β + εi ,

(13.2.2)

where Xi is an ni × p matrix with xij in the ith row and εi = (εi1 , · · · , εini ). Clearly, y ∼ N (Xβ, σ 2 V), where X = (X01 , · · · , X0m )0 is an N ×p matrix. Although such multivariate models for longitudinal data with a general covariance structure are straightforward to implement with balanced data as shown in Result 13.2.1 (McCulloch et al., 2011), they may be cumbersome when subjects are measured at arbitrary times, or when the dimension of V is large. We consider the case where Vi = V0 for i = 1, · · · , m. Result 13.2.1. Balanced longitudinal data model. For i = 1, · · · , m, assume that ni = n, Xi = X0 ∈ Rn×p , and Vi = V0 ∈ Rn×n . Let µ0 = X0 β and Σ0 = σ 2 V0 . 1. Irrespective of what Σ0 is, the MLE of µ0 is m

b0 = µ

1 X yi , m i=1

(13.2.3)

so that µ b0,j = Y ·,j . b 0 = {b 2. The MLE of Σ0 is Σ σ0,j,` }nj,`=1 with m

σ b0,j,` =

1 X (Yij − Y ·j )(Yi` − Y ·` ). m i=1

(13.2.4)

Proof. Under the assumption, y1 , · · · , ym are i.i.d. ∼ N (µ0 , Σ0 ). Then the result is an immediate consequence of Result 6.1.4. Remark 1. We see that in Result 13.2.1, y ∼ N (µ, σ 2 V), where µ = 1m ⊗ µ0 and V = Im ⊗ V 0 . Remark 2. (see (4.5.5)):

In general, if V is known, then the MLE of β is the same as its GLS estimator b β(V) = (X0 V−1 X)−1 X0 V−1 y.

(13.2.5)

0 −1 b b SSE(V) = {y − Xβ(V)} V {y − Xβ(V)}.

(13.2.6)

Also,

Remark 3. If V is unknown, but is not a function of β, the MLE of β is obtained by b ML for V in (13.2.5). substituting the ML estimator V Remark 4. For a symmetric weight matrix W, it follows from (4.5.5) that the WLS estimator of β is 0 −1 0 b β X Wy, W LS = (X WX)

Note that W = I yields the OLS estimate of β, while setting W = V−1 leads to the estimator shown in (13.2.5).

ISTUDY

408

Special Topics

Remark 5. It may be possible to consider a structured V instead of letting it be a general symmetric p.d. matrix. We show examples with two possible assumptions on the block-diagonal structure of σ 2 V, each using only two parameters. Example 13.2.1. Uniform correlation model. For each subject i = 1, · · · , m, we assume that Vi = V0 = (1 − ρ)I + ρJ, i.e., we assume a positive correlation ρ between any two measurements on the same subject. This is often called the compound symmetry assumption. One interpretation of this model is that we introduce a random subject effect Ui , which are mutually independent variables with variance ν 2 between subjects, so that Yij = µij + Ui + Zij , j = 1, · · · , n, i = 1, · · · , m, where Ui are N (0, ν 2 ) variables, and Zij are mutually independent N (0, τ 2 ) variables, which are independent of Ui . In this case, ρ = ν 2 /(ν 2 + τ 2 ), and σ 2 = ν 2 + τ 2 . Example 13.2.2. Exponential correlation model. We assume that the correlation between any two measurements on the same subject decays towards zero as the time separation between the measurements increases, i.e., the (j, k)th element of V0 is Cov(Yij , Yik ) = σ 2 exp{−φ|tij − tik |}. A special case corresponds to equal spacing between observations times; i.e., if tij − tik = d, for k = j −1, and letting ρ = exp(−φd) denote the correlation between successive responses, we can write Cov(Yij , Yik ) = σ 2 ρ|j − k|. One justification for this model is to write Yij = µij + Wij , j = 1, · · · , n, i = 1, · · · , m, where Wij = ρWi,j−1 + Zij , Zij are independently distributed as N (0, σ 2 {1 − ρ2 }) variables. This is often referred to as an AR(1) correlation specification. Mixed-effects modeling (see Chapter 11) is useful for longitudinal or repeated measures designs which are employed in experiments or surveys where more than one measurement of the same response variable is obtained on each subject, and we wish to examine changes (over time) in measurements taken on each subject (called within-subjects effects). In such cases, all the subjects may either belong to a single homogeneous population, or to a populations which we then compare. Example 13.2.3. Repeated measures design. model

Consider the balanced mixed-effects

Yijk = µ + τi + θj + (τ θ)ij + γk(i) + εijk ,

(13.2.7)

where i = 1, · · · , a, j = 1, · · · , b and k = 1, · · · , c, the parameter µ denotes an overall mean effect, τi is the (fixed) effect due to the ith level of a treatment (Factor A), θj denotes the mean (fixed) effect due to the jth time, (τ θ)ij denotes the (fixed) interaction effect between the ith treatment level and the jth time, γk(i) denotes a random effect of the kth subject Pa Pb (replicate) nested in the ith treatment level. We assume that i=1 τi = 0, j=1 θj = 0, Pa Pb 2 j=1 (τ θ)ij = 0, and γk(i) ∼ N (0, σγ ), distributed independently of εijk ∼ i=1 (τ θ)ij = 0, 2 N (0, σε ). The variance structure in the responses follow. Var(Yijk ) = σε2 + σγ2 , Cov(Yijk , Yij 0 k ) = σγ2 , so that Corr(Yijk , Yij 0 k ) = ρ =

σγ2 . σε2 + σγ2

ISTUDY

Longitudinal models Source Treatment Time

409 d.f. a−1 b−1

SS Pa SSA = bc i=1 (Y i·· − Y ··· )2 Pb SSB = ac j=1 (Y ·j· − Y ··· )2 Pa Pb 2 Treat.×Time (a − 1)(b − 1) SSAB = c i=1 j=1 Rij· (see ∗) Pa Pc Subj. w/i treat. a(c − 1) SSC = c i=1 k=1 (Y i·k − Y i·· )2 Pa Pb Pc 2 Residual a(b − 1)(c − 1) SSE = i=1 j=1 k=1 Rijk (see ∗) Pa Pb Pc Total abc − 1 SSTc = i=1 j=1 k=1 (Yijk − Y ··· )2 ∗ Rij· = Y ij· − Y i·· − Y ·j· + Y ··· , Rijk = Yijk − Y ij· − Y i·k + Y i·· TABLE 13.2.1. ANOVA table for Example 13.2.3. An ANOVA table consists of the following sums of squares with corresponding d.f. Following details in Chapter 11, we test H0 : τi ’s equal (or test H0 : στ2 = 0 if Factor A is a random factor) using F = M SA /M SC with (a − 1, a(c − 1)) d.f. We test H0 : θj ’s equal using F = M SB /M SE with [b − 1, a(b − 1)(c − 1)] d.f. We can also test H0 : (τ θ)ij = 0 (or στ2β = 0) using F = M SAB /M SE with ((a − 1)(b − 1), a(b − 1)(c − 1)) d.f. Remark 1. The assumption of compound symmetry in a repeated measures design implies an assumption of equal correlation between responses pertaining to the same subject over different time periods. This assumption may not be reasonable in all examples, and we usually test for the validity of the assumption. Remark 2. Greenhouse and Geisser (1959) discussed situations with possibly unequal correlations among the responses at different time periods (lack of compound symmetry), and suggested use of the same F -statistics as in Example 13.2.3, but with reduced d.f., given by f (a − 1) and f (a − 1)(b − 1) respectively for SSB and SSE (see Table 13.2.1), where the factor f depends on the covariances among the Yij ’s. The value of the factor f is not known in practice. One approach is to choose the smallest possible value of f = 1/(a − 1), such that the resulting F -statistic has (1, b − 1) d.f. The actual significance level of the test in this case is smaller than the stated level α, and the test is conservative. However, in cases where b is small, this procedure makes it extremely difficult to reject the null hypothesis. Most software use an F -statistic with fb(a − 1) and fb(a − 1)(b − 1) d.f., where fb is obtained by an approach described in Geisser and Greenhouse (1958).

13.2.2

Two-stage random-effects models

Laird and Ware (1982) described two-stage random-effects models for longitudinal data, which are based on explicit identification of population and individual characteristics. The probability distributions for the response vectors of different subjects are assumed to belong to a single family, but some random-effects parameters vary across subjects, with a distribution specified at the second stage. The two-stage random-effects model for longitudinal data is given below (see (11.1.1)). Stage 1. For each individual i = 1, · · · , m, yi = Xi τ + Zi γ i + εi , i = 1, · · · , m,

(13.2.8)

where τ is a p-dimensional vector of unknown population parameters, Xi is an ni × p known design matrix, γ i is a q-dimensional vector of unknown individual effects, and Zi

ISTUDY

410

Special Topics is a known ni ×q design matrix. The errors εi are independently distributed as N (0, Ri ) vectors, while τ are fixed parameter vectors. Stage 2. The γ i are i.i.d. ∼ N (0, Dγ ), and Cov(γ i , εi ) = 0. The population parameters τ are fixed effects.

For i = 1, · · · , m, yi are independent N (Xi τ , Vi ), where Vi = Ri + Zi Dγ Z0i . Let Wi = Vi−1 . Let θ denote an vector of parameters determining Ri , i = 1, · · · , m, and Dγ . Inference can be carried out using least squares, maximum likelihood or REML approaches. Estimation of mean effects The classical approach to inference derives estimates of τ and θ based on the marginal distri0 bution of y0 = (y10 , · · · , ym ), while an estimate for γ is obtained by use of Harville’s extended Gauss–Markov theorem for mixed-effects models (Harville, 1976) (see Result 11.2.1). Let γ = (γ 01 , · · · , γ 0m )0 . Result 13.2.2 discusses estimation for two cases, when θ is known and when it is unknown. Result 13.2.2. 1. Suppose θ is known. Assuming that the necessary matrix inversions exist, let !−1 m m X X 0 X0i Wi yi , and (13.2.9) τb = Xi Wi Xi i=1

i=1

b i = Dγ Z0i Wi (yi − Xi τb ). γ

(13.2.10)

Then τbPis the MLE of τ , for any vectors of constants c0P ∈ Rp and c1 , · · · , cm ∈ Rq , m m 0b 0b 0 c0 τ + i=1 ci γ i is an essentially-unique b.l.u.e. of c0 τ + i=1 c0i γ i , and !−1 m X Var(b τ) = X0i Wi Xi , and (13.2.11) i=1

  Var(b γ i ) = Dγ Z0i Wi − Wi Xi 

m X

!−1 X0i Wi Xi

X0i Wi

 

Zi D γ .

(13.2.12)



i=1

b τb ) denote the MLEs 2. Suppose, as is usually the case in practice, θ is unknown. Let (θ, of (θ, τ ), respectively. Then, !−1 m m X X 0c c i yi , τb = Xi Wi Xi X0i W (13.2.13) i=1

i=1

b i.e., the value of Wi under θ. b The MLE of Cov(b c i = Wi (θ), where W τ ) follows by c i in (13.2.11). replacing Wi by W Proof. Recall that we estimate τ and θ simultaneously by maximizing their joint likelihood based on the marginal distribution of y. The proof is left as an exercise. A better assessment of the error of estimation is provided by Var(b γ i − γ i ) = Dγ − Dγ Z0i Wi Zi Dγ +

Dγ Z0i Wi Xi

m X i=1

!−1 X0i Wi Xi

X0i Wi Zi Dγ ,

(13.2.14)

ISTUDY

Longitudinal models

411

which incorporates the variation in γ i . Under normality, the estimator τb is the MVUE, and b is the essentially-unique MVUE of γ (see Result 11.2.1). The estimate τb also maximizes γ the likelihood based on the marginal normal distribution of y. We can write b = E(γ | y, τb , θ), γ b i is the empirical Bayes estimator of γ i . so that γ Estimation of covariance structure Popular approaches for estimating θ are the method of maximum likelihood and the method of restricted maximum likelihood (REML). In balanced ANOVA models, MLE’s of variance components fail to account for the degrees of freedom in estimating the fixed effects τ , and are therefore negatively biased. The REML estimates on the other hand are unbiased; the REML estimate of θ is obtained by maximizing the likelihood of θ based on any full rank set of error contrasts, z = B0 y. Numerical Example 13.3. Repeated measures analysis. The dataset “emotion” in the R package psycho consists of emotional ratings of neutral and negative pictures by healthy participants and consists of 912 rows on 11 variables. library(psycho); library(tidyverse) df % select(Participant_ID, Participant_Sex, Emotion_Condition, Subjective_Valence, Recall) summary(aov(Subjective_Valence ~ Emotion_Condition + Error(Participant_ID/Emotion_Condition), data=df)) Error: Participant_ID Df Sum Sq Mean Sq F value Pr(>F) Residuals 18 115474 6415 Error: Participant_ID:Emotion_Condition Df Sum Sq Mean Sq F value Pr(>F) Emotion_Condition 1 1278417 1278417 245.9 6.11e-12 *** Residuals 18 93573 5198 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Error: Within Df Sum Sq Mean Sq F value Pr(>F) Residuals 874 935646 1070 The ANOVA output from a linear mixed model (LMM) is shown below. library(lmerTest) fit F) Emotion_Condition 1278417 1278417 1 892 1108 0}. Then, y ∼ EN (Xβ, σ 2 IN , h). b and σ Result 13.3.3. Let β bh2 denote the ML estimators of β and σ 2 under the elliptical h linear model (13.3.3). 1. Then, 0 −1 0 b =β b β X y, h M L = (X X) 1 b )0 (y − Xβ b ), (y − Xβ σ bh2 = h h u0

(13.3.4) (13.3.5)

where u0 = arg supu≥0 uN/2 h(u) and may be obtained as a solution to h0 (u) + (N/2u)h(u) = 0. b is 2. The sampling distribution of β h b ∼ Ep (β, σ 2 (X0 X)−1 , h). β h

(13.3.6)

3. An unbiased estimator of σ 2 is σ b∗2 = Proof.

u0 σ bh2 . N −p

(13.3.7)

Property 1 follows from maximizing the logarithm of the likelihood function L(β, σ 2 ; y) ∝ (σ 2 )−N/2 h{σ −2 (y − Xβ)0 (y − Xβ)}.

(13.3.8)

b using (5.5.15) as To show property 2, we derive the MGF of β h M (X(X0 X)−1 X0 t) = ψ(t0 X(X0 X)−1 X0 (σ 2 IN )(X0 X)−1 X0 X0 t) × exp(t0 X(X0 X)−1 X0 Xβ) = ψ(t0 X(σ 2 (X0 X)−1 )X0 t) exp(t0 Xβ) b )u) exp(u0 β), where u = X0 t, = ψ(u0 Cov(β h

b ∼ Ep (β, σ 2 (X0 X)−1 , h). We leave the reader to prove property from which it follows that β h 3. b also coincides with the LS estimate of β and is the same for any density The estimator β h generator h(.) in the elliptical family. The fitted and residual vectors are also the same and have the forms shown in Section 4.2 for the full-rank model. Note that the distribution b is the same for all h(.) within the elliptical family. However, the distribution of of β h 2 σ bh is affected by departures from normality within the family of elliptical distributions. When h(.) corresponds to the normal or t-distributions, u0 = N , while for other elliptical distributions, u0 must be solved numerically. It is straightforward to extend Result 13.3.3 to the case where y ∼ EN (Xβ, σ 2 V, h), V being a known symmetric, p.d. matrix. The details are left as Exercise 13.9. The next result describes testing a linear hypothesis H: C0 β = d on β under the model (13.3.3). Similar to Section 7.2 for the normal GLM, we assume that C is a p × s matrix with known coefficients, r(C) = s and d = (d1 , · · · , ds )0 is a vector of known constants.

ISTUDY

414

Special Topics

Result 13.3.4. 1. Irrespective of h(.), the LRT statistic for testing H: C0 β = d is −N/2 Q , Λ= 1+ SSE where SSE is from the LS fit and Q is given in Result 7.2.1. 2. Under H, F (H) =

Q/s ∼ Fs,N −p SSE/(N − p)

Proof. Let b ε denote the vector of residuals of the LS fit and b εH the one subject to the constraint by H. Then from (13.3.8), Λ= = =

supσ2 ,β: C0 β=d (σ 2 )−N/2 h{σ −2 ky − Xβk2 } supσ2 ,β (σ 2 )−N/2 h{σ −2 ky − Xβk2 } supσ2 (σ 2 )−N/2 h{σ −2 inf β: C0 β=d ky − Xβk2 } supσ2 (σ 2 )−N/2 h{σ −2 k inf β y − Xβk2 } supu≥0 uN/2 h(u · b ε0H b εH ) supu≥0 uN/2 h(u · b ε0 b ε)

. N/2

Since for any a > 0, supu≥0 uN/2 h(ua) = a−N/2 u0 h(u0 ), with u0 as in property 1 of Result 13.3.3, then Λ = (b ε0H b εH /b ε0 b ε)−N/2 , so that property 1 follows from Result 7.2.2. 0 0 Let C = T X. Denote by P the projection onto C(X). Then from Result 7.2.1, we see that under H, Q = ε0 Mε, where M = PT[C0 (X0 X)−1 C]−1 T0 P is symmetric and n.n.d. On the other hand, SSE = ε0 Pε. Then F (H) =

u0 Mu/s ε0 Mε/s = , ε0 Pε/(N − p) u0 Pu/(N − p)

where u = ε/kεk. Note that the above identity holds for any distribution of ε as long as P(ε 6= 0) = 1. Since the distribution of F (H) only depends on the distribution of ε/kεk, it is the same for all spherical distributions with zero mass at the origin. Since N (0, Ip ) is one such spherical distribution, then from Result 7.2.1, property 2 follows. By using the relationship between the likelihood ratio test statistic and the F -statistic, it can be shown that b − d)0 {C0 (X0 X)−1 C}−1 (C0 β b − d) (N − p) (C0 β F0 = (13.3.9) s (N − p)b σ2 has an Fs,N −p distribution under H (see Result 8.1.2). Result 13.3.5. is

The 100(1 − α)% joint confidence region for β under the model (13.3.3) b − β)0 X0 X(β b − β) ≤ pb {β ∈ Rp : (β σ∗2 Fp,N −p,1−α }. h h

(13.3.10)

The proof of the result follows by inverting the F -statistic in (13.3.9). These distributional results facilitate inference in linear models with elliptical errors. For more details, the reader is referred to Fang and Anderson (1990) and references given there. We define the LS (or ML) residuals as well as internally and externally Studentized residuals from fitting the model (13.3.3).

ISTUDY

Elliptically contoured linear model

415

b = Definition 13.3.1. Residuals. 1. The usual LS (or ML) residuals are b ε = y−y (I − P)y, where P = X(X0 X)−1 X0 , and b ε ∼ EN (0, σ 2 (I − P), h), εbi ∼ E(0, σ 2 (1 − pii ), h), i = 1, . . . , N.

(13.3.11)

2. The ith internally Studentized residual is ri =

εb √ i , σ b∗ 1 − pii

(13.3.12)

where σ b∗2 was defined in (13.3.7). 3. The ith externally Studentized residual is ri∗ = 2 σ b∗(i) =

εb √i , where, σ b∗(i) 1 − pii 2 u∗0 σ b(i)

N −1−p

(13.3.13)

,

(13.3.14)

where 2 b )0 (y(i) − X(i) β b ) σ b(i) = (y(i) − X(i) β h(i) h(i) 2 and u∗0 = max(u(N −1)/2 h(u), σ b(i) ).

4. Independent of h(.), ri2 ∼ Beta N −p

1 N −1−p , 2 2

,

ri∗ ∼ tN −1−p , similar to their distributions under the normal GLM. Properties of these residuals and details on other case deletion diagnostics including likelihood displacement (Cook and Weisberg, 1982) have been discussed in Galea et al. (2000). We show an example next. Numerical Example 13.4. The dataset “luzdat” from the R package gwer with 150 observations on four variables is part of a study development by the nutritional department of S˜ ao Paulo University. library(gwer) data(luzdat); attach(luzdat) z1 0) i=1

" ×

N X

# + 0b 0b b I(x0t β LAD > 0)I(xt β LAD < Yt ≤ xt β LAD + h) ,

i=1

where h > 0 is a fixed parameter. Under certain conditions (Rao and Zhao, 1993), 4fb(0)LRT has a limiting χ2s distribution under H0 .

14.1.2

M -regression

The M -estimator, which was introduced by Huber (1964), is a robust alternative to the LS estimator in the linear model. It is attractive because it is relatively simple to compute, and it offers good performance and flexibility. In the model (4.1.2), we assume that εi are i.i.d. random variables with c.d.f. F , which is symmetric about 0. The distribution of εi can be fairly general; in fact we need not assume existence of the mean and variance of the distribution. Recall from Section 8.1.1 that the MLE of β is obtained as a solution to the likelihood equations N X i=1

xi {f 0 (Yi − x0i β)/f (Yi − x0i β)} = 0,

(14.1.7)

ISTUDY

Robust regression

437

where f denotes the p.d.f. of εi , and f 0 denotes its first derivative. Suppose (14.1.7) cannot be solved explicitly; by replacing f 0 /f by a suitable function ψ, we obtain a “pseudo maximum likelihood” estimator of β, which is called its M -estimator. The function ψ is generally b which is robust to alternative specifications chosen such that it leads to an estimator β M of f . Suppose ρ is a convex, continuously differentiable function on R, and suppose ψ = ρ0 . b of β in the model (4.1.2) is obtained by solving the minimization The M -estimator β M problem min β

N X

ρ(Yi − x0i β),

(14.1.8)

i=1

or, equivalently, by solving the set of p equations N X

xi ψ(Yi − x0i β) = 0.

(14.1.9)

i=1

When (a) the function ρ is not convex, (b) the first derivative of ρ is not continuous, or (c) the first derivative exists everywhere except at a finite or countably infinite number of b to be a solution of (14.1.8), since (14.1.2) might not have a solution, points, we consider β M or might lead to an incorrect solution. By a suitable choice of ρ and ψ, we can obtain an estimator which is robust to possible heavy-tailed behavior in f . b is scale-invariant, we generalize (14.1.8) and (14.1.2) respectively to To ensure that β M min β

N X

N X

ρ{(Yi − x0i β)/σ}, and

(14.1.10)

i=1

xi ψ{(Yi − x0i β)/σ} = 0.

(14.1.11)

i=1

In practice, σ is unknown, and is replaced by a robust estimate. One such estimate is the median of the absolute residuals from LAD regression. As we have seen, M -regression generalizes LS estimation by allowing a choice of objective functions. LS regression corresponds to ρ(t) = t2 . In general, we have considerable flexibility in choosing the function ρ, and this choice in turn determines the properties of βbM . The choice involves a balance between efficiency and robustness. Example 14.1.4. Suppose the true regression function is the conditional median of y given X. The choice of ρ(t) is |t|, and the corresponding problem is called median regression. Example 14.1.5. W -regression is an alternative form of M -regression; βbW is a solution to the p simultaneous equations N X Yi − x0i β wi x0i = 0, σ i=1

(14.1.12)

where wi = w{(Yi − x0i β)/σ}. The equations in (14.1.12) are obtained by replacing ψ(t) by tw(t) in (14.1.8); w(t) is called a weight function. Numerical Example 14.1. The dataset is “stackloss” in the R package MASS. Mregression using Huber, Hampel and bisquare ψ functions are shown below.

ISTUDY

438

Miscellaneous Topics

data(stackloss, package = MASS) summary(rlm(stack.loss ~ ., stackloss)) rlm(stack.loss ~ ., stackloss, psi = psi.huber, init = "lts") Call: rlm(formula = stack.loss ~ ., data = stackloss, psi = psi.huber, init = "lts") Converged in 12 iterations Coefficients: (Intercept) Air.Flow -41.0263481 0.8293999

Water.Temp 0.9259955

Acid.Conc. -0.1278426

Degrees of freedom: 21 total; 17 residual Scale estimate: 2.44 rlm(stack.loss ~ ., stackloss, psi = psi.hampel, init = "lts") Call: rlm(formula = stack.loss ~ ., data = stackloss, psi = psi.hampel, init = "lts") Converged in 9 iterations Coefficients: (Intercept) Air.Flow -40.4747671 0.7410846

Water.Temp 1.2250749

Acid.Conc. -0.1455246

Degrees of freedom: 21 total; 17 residual Scale estimate: 3.09 rlm(stack.loss ~ ., stackloss, psi = psi.bisquare) Call: rlm(formula = stack.loss ~ ., data = stackloss, psi = psi.bisquare) Converged in 11 iterations Coefficients: (Intercept) Air.Flow -42.2852537 0.9275471

Water.Temp 0.6507322

Acid.Conc. -0.1123310

Degrees of freedom: 21 total; 17 residual Scale estimate: 2.28

14.2

N

Nonparametric regression methods

Parametric regression models, such as GLMs and GLIMs, assume specific forms of the regression function E(Y | x) that can be completely determined by a finite number of parameters, so that model inference reduces to estimation and inference pertaining to the parameters (under some distributional assumption on the errors). In contrast, nonparamet-

ISTUDY

Nonparametric regression methods

439

ric methods only make a few general assumptions about the regression function (Hastie and Tibshirani, 1990; Hastie et al., 2009; Thisted, 1988). In this section, we give a brief introduction to computer-intensive regression methods that are useful for nonparametric fitting, when the predictor x is multidimensional, or when a parametric linear model does not satisfactorily explain the relationship between Y and x. Additive models and projection pursuit regression are described in the next two subsections. The additive model is a generalization of the multiple linear regression (MLR) model, where the usual linear function of observed predictors is replaced by a sum of unspecified smooth functions of these predictors. Projection pursuit (Friedman and Tukey, 1974) is a general approach that seeks to unravel structure within a high dimensional predictor by finding interesting projections of the data onto a low dimensional linear subspace, such as a line or a plane. We also discuss multivariate adaptive regression splines (MARS) and neural nets regression.

14.2.1

Regression splines

The MLR model can provide an inadequate fit when a linear function is unsuitable to explain the dependence of Y on the explanatory variables. One remedy is to add polynomial terms in some or all of the explanatory variables, resulting in a polynomial regression model. These models may however have limited scope, because, it is often difficult to guess what the appropriate order of the polynomials should be, and also because of the global nature of its fit. In many cases, a piecewise polynomial model is more suitable. Consider a functional relation between a response Y and a single predictor X: Y = f (X) + ε with E(ε) = 0, Var(ε) = σ 2 .

(14.2.1)

Definition 14.2.1. Piecewise polynomial regression. Fix q ≥ 1. An order-q piecewise polynomial with fixed knots ξ1 < · · · < ξL is a function such that, within each of the intervals I0 = (−∞, ξ1 ), I` = [ξ` , ξ`+1 ), ` = 1, . . . , L − 1, and IL = [ξL , ∞), it is a polynomial of degree at most q − 1 with real-valued coefficients f`,q (x) =

q X

c`,j xj−1 ,

x ∈ I` , ` = 0, 1, · · · , L.

(14.2.2)

j=1

Piecewise polynomial regression consists of fitting the data by a regression model (14.2.1), with f assumed to be a piecewise polynomial with no constraints on its coefficients. Although piecewise polynomial functions are more flexible than polynomial functions, one disadvantage is that they are not necessarily smooth functions. In many cases, they can be discontinuous at the knots. Regression splines retain the flexibility of piecewise polynomials while incorporating a degree of smoothness. Definition 14.2.2. Polynomial spline. An order-q polynomial spline with fixed knots ξ1 < · · · < ξL is a piecewise polynomial f with those knots and with continuous derivatives up to order q − 2, that is f (k) (ξ` −) = f (k) (ξ` +),

k = 0, · · · , q − 2

(14.2.3)

for each ` = 1, · · · , L. Polynomial splines in the context of the regression model (14.2.1) are known as regression splines. They facilitate flexible fitting of the data with a degree of smoothness. Usually,

ISTUDY

440

Miscellaneous Topics

q = 1, 2, or 4 are used. When q = 4, we call these cubic splines. If an order-q polynomial spline with L knots is expressed as in (14.2.2), there are (L + 1)q coefficients in total. (k) (k) However, from (14.2.3) f`−1,q (ξ` ) = f`,q (ξ` ), 0 ≤ k ≤ q − 2, so there are q − 1 linear constraints on the coefficients at each of the L knots. As a result, for q > 1, direct estimation of the coefficients is a constrained optimization problem. In practice, since the dimension of the space of order-q splines with L knots is (L + 1)q − L(q − 1) = L + q, regression spline fitting is carried out by unconstrained optimization using a set of basis functions h1 , · · · , hL+q of the space, so that the data is fit with f (x) =

L+q X

a` h` (x),

(14.2.4)

`=1

with a1 , · · · , aL+q being unconstrained coefficients. Example 14.2.1. Truncated power basis. The space of order-q regression splines with knots ξ1 < · · · < ξL has a basis consisting of the functions hj (x) = xj−1 , j = 1, · · · , q, and hq+` (x) = (x − ξ` )q−1 + , ` = 1, · · · , L, where as before, t+ = max(0, t). A widely used class of spline basis is the class of B-splines. Suppose all the Xi ’s are in [a, b] and a < ξ1 < · · · < ξL < b. Fix m ≥ 1. Define an augmented sequence of knots τ1 ≤ · · · ≤ τL+2m , such that   if ` ≤ m a τ` = ξ`−m if ` = m + 1, · · · , m + L   b if ` > m + L. Denote by Bi,j (x) the ith B-spline basis function of order j for the augmented knot sequence, where i = 1, · · · , L + 2m − j and j = 1, · · · , q. To start with, ( 1 if τi ≤ x < τi+1 Bi,1 (x) = 0 otherwise. For j = 2, . . . , q, recursively define Bi,j (x) =

τi+j − x x − τi Bi,j−1 (x) + Bi,j (x), τi+j−1 − τi τi+j − τi+1

where the convention 0/0 = 0 is used. Then, Bi,j , i = m − j + 1, · · · , m + L, constitutes a basis of splines of order j with the original knots ξ1 , · · · , ξL . Definition 14.2.3. Smoothing spline. Given data (Yi , Xi ), for i = 1, · · · , N , a smoothing spline minimizes the following expression, as a compromise between the accuracy of fit and the degree of smoothness: Z N X [Yi − f (Xi )]2 + λ {f 00 (x)}2 dx i=1

over all twice-continuously differentiable functions f , where λ ≥ 0 is a fixed smoothing parameter. It is a remarkable fact that given λ > 0, a minimizer of the expression above not only exists, but is also a unique natural cubic spline with knots at the unique values of

ISTUDY

Nonparametric regression methods

441

X1 , · · · , XN , i.e., it is linear on (−∞, min Xi ) and (max Xi , ∞), respectively. This is the essence of the definition. For details, see Green and Silverman (1994). For larger λ, the spline has less curvature, while for smaller λ, it becomes rougher, yielding more of an interpolation to the data. Regression spline fitting is a very rich area. For more details, see Eubank (1999). We show an example to illustrate cubic spline and smoothing spline fitting. Numerical Example 14.2. We use the dataset “Wage” from the R package ISLR, which consists of data on 3000 male workers in the Mid-Atlantic region of the U.S. The response variable is wage (worker’s raw income) which we model as a function of the age of the worker. Note that age ranges from A scatterplot of wage versus age (figure not shown here) suggests age values of 25, 35 and 65 as possible knots. Figure 14.2.1 shows the cubic splines and smoothing spline fits to the data. library(ISLR) data(Wage) library(splines) (agelims 0, denoted X ∼ IG(a, b), if 1/X ∼ Gamma(α, 1/β). The p.d.f. of X is β β α −(α+1) x exp − , x > 0. (B.25) f (x; α, β) = Γ(α) x We can verify that E(X) = β/(α − 1) if α > 1, and Var(X) = β 2 /{(α − 1)2 (α − 2)} if α > 2.

ISTUDY

C Some Useful Statistical Notions

1. Kullback–Leibler divergence. The Kullback–Leibler (KL) divergence, which is also called relative entropy, is a measure of how one probability distribution is different from a second probability distribution, usually a reference distribution. The KL-divergence of a p.d.f. g(x) from another p.d.f. f (x) is defined as Z g(x) KL(g k f ) = g(x) log dx. (C.1) f (x) If f and g are p.m.f.’s, we replace the integral by a sum. 2. Jensen’s inequality. Let X be a random variable and let g(.) be a convex function. Then, g(E(X)) ≤ E(g(X)).

(C.2)

The difference between the two sides of the inequality is called the Jensen gap. For concave functions, the direction of the equality is opposite. 3. Markov’s inequality. Markov’s inequality gives an upper bound for the probability that a nonnegative random variable is greater than or equal to some positive constant. Let X be a nonnegative random variable and a > 0. Then, P(X ≥ a) ≤

E(X) . a

4. Laplace approximation. Consider the ratio of integrals given by R w(θ) e`(θ) dθ R , π(θ) e`(θ) dθ

(C.3)

(C.4)

where θ is an m × 1 vector of parameters and `(θ) is the log-likelihood function of θ based on n observations y = (y1 , · · · , yn ) coming from the probability model p(y | θ), that is, `(θ) =

n X

log p(yi | θ).

i=1

The quantities w(θ) and π(θ) are functions of θ which may need to satisfy certain conditions depending on the context. In Bayesian formulation π(θ) is the prior and w(θ) = u(θ)π(θ) where u(θ) is some function of θ which is of interest. Thus, the ratio represents the posterior expectation of u(θ), that is, E[u(θ) | y]. For example, if u(θ) = θ then the ratio in (C.4) gives us

DOI: 10.1201/9781315156651-C

477

ISTUDY

478

Appendix the posterior mean of θ. Similarly, if u(θ) = p(y | θ), it yields the posterior predictive distribution value at y. Alternatively, we can write the above ratio as R u(θ)eΛ(θ) dθ R , (C.5) eΛ(θ) dθ where Λ(θ) = `(θ) + log π(θ). Lindley (1980) develops asymptotic expansions for the above ratio of integrals in (C.5) as the sample size n gets large. The idea is to obtain a Taylor series expansion of all the b the posterior mode. Lindley’s approximation to E[u(θ) | y] above functions of θ about θ, is given by:   m m X X b +1 E[u(θ) | y] ≈ u(θ) ui,j σi,j + Λi,j,k ul σi,j σk,l  , (C.6) 2 i,j=1 i,j,k,l=1

where ui ≡

∂u(θ) , ∂θi θ=θˆ

ui,j ≡

∂ 2 u(θ) , ∂θi ∂θj θ=θˆ

Λi,j,k ≡

∂ 3 Λ(θ) , ∂θi ∂θj ∂θk θ=θb

b and σi,j are the elements in the negative inverse Hessian of Λ at θ. Lindley’s approximation (C.6) involves third order differentiation and therefore is computationally cumbersome in highly parameterized cases. Tierney and Kadane (1986) propose an alternative approximation which involves only the first and second order derivatives. This is achieved by using the mode of the product u(θ)eΛ(θ) rather than the mode of the posterior eΛ(θ) and evaluating the second derivatives at this mode. Tierney–Kadane approximation approximates E[u(θ) | y] by !1/2 b |Σ∗ (θ)| b − Λ(θ)]}, b E[u(θ) | y] ≈ exp{n[Λ∗ (θ) (C.7) b |Σ(θ)| b and Σ(θ) b are the corresponding negative inverse where Λ∗ (θ) = log u(θ)+Λ(θ) and Σ∗ (θ) ∗ b Hessians of Λ and Λ evaluated at θ. 5. Stirling’s formula. As n → ∞, √ n! = (n/e)n 2πn[1 + o(1)]. 6. Newton–Raphson algorithm in optimization. The Newton–Raphson algorithm is used on a twice-differentiable function f (x) : R → R in order to find solutions to f 0 (x) = 0, which are stationary points of f (x). These solutions may be minima, maxima, or saddle points of f (x). Let x0 ∈ R be an initial value and let xk ∈ R be the value at the kth iteration. Using a sequence of second-order Taylor series approximations of f (x) around the current iterate xk , k = 0, 1, 2, · · · , the Newton–Raphson method updates xk to xk+1 by xk+1 = xk −

f 0 (xk ) . f 00 (xk )

7. Convergence in probability. If the sequence of estimators θˆN based on N observations converges in probability to a constant θ, we say that θ is the limit in probability of the sequence θbN , and write plim θbN = θ. N →∞

ISTUDY

Appendix

479

8. Hampel’s influence function. Suppose (z1 , · · · , zN ) denotes a large random sample from a population with c.d.f. F . Let FbN = FN (z1 , · · · , zN ) denote the empirical c.d.f. and let TN = T (z1 , · · · , zN ) be a (scalar or vector-valued) statistic of interest. The study of influence consists of assessing the change in TN when some specific aspect of the problem is slightly changed. The first step is to find a statistical functional T which maps (a subset of) the set of all c.d.f.’s onto Rp , so that T (FbN ) = TN . We assume that FN converges to F and that TRN converges to T . For example, when TN = Z, the R corresponding functional is T (F ) = z dF (z), and T (FbN ) = z dFbN = Z. The influence of an estimator TN is said to be unbounded if it is sensitive to extreme observations, in which case, TN is said to be nonrobust. To assess this, one more observation z is added to the large sample, and we monitor the change in TN , and the conclusions based on TN . Let z denote one observation that is added to the large sample (z1 , · · · , zN ) drawn from a population with c.d.f. F , and let T denote a functional of interest. The influence function is defined by 1 ψ(z, F, T ) = lim [T {(1 − ε)F + εδz } − T {F }], ε→0 ε

(C.8)

provided the limit exists for every z ∈ R, and where δz = 1 at z and zero otherwise. The influence curve is the ordinary right-hand derivative, evaluated at ε = 0, of the unction T [(1 − ε)F + εδz ] with respect to ε. The influence curve is useful for studying asymptotic properties of an estimator as well R as for comparing estimators. For example, if T = µ = z dF , then ψ[z, F, T ] = lim {[(1 − ε)µ + εz] − µ}/ε = z − µ, ε→0

(C.9)

which is “unbounded”, so that TN = Z is nonrobust. To use the influence function in the regression context, we must first construct appropriate functionals corresponding to β and σ 2 . 9. Multivariate completion of square. Let x ∈ Rp , a ∈ Rp , and A ∈ Rp×p be a symmetric, p.d. matrix. Then, x0 Ax − 2a0 x = (x − A−1 a)0 A(x − A−1 a) − a0 A−1 a.

(C.10)

This is useful in deriving expressions in a Bayesian framework. 10. Bayes’ theorem. Bayes’ theorem, is a time-honored result dating back to the late 18th century. Let B = (B1 , · · · , Bk ) denote a partition of a sample space S, and let A denote an event with P(A) > 0. By the definition of conditional probability (Casella and Berger, 1990), we have P(Bj | A) = P(Bj ∩ A)/ P(A) = P(A | Bj ) P(Bj )/ P(A). By substituting for P(A) from the law of total probability, i.e., P(A) =

k X

P(A | Bi ) P(Bi ),

i=1

it follows that for any event Bj in B, P(Bj | A) = P(A | Bj ) P(Bj )

k .X i=1

P(A | Bi ) P(Bi ).

ISTUDY

480

Appendix When the partition B represents all possible mutually exclusive states of nature or hypotheses, we refer to P(Bj ) as the prior probability of an event Bj . An event A is then observed, and this modifies the probabilities of the events in B. We call P(Bj | A) the posterior probability of Bj . Now consider the setup in terms of continuous random vectors. Let x be a k-dimensional random vector with joint pdf f (x; θ), where θ is a q-dimensional parameter vector. We assume that θ is also a continuous random vector with p.d.f. π(θ), which we refer to as the prior density of θ. Given the likelihood function is L(θ; x) = f (x; θ), an application of Bayes’ theorem gives the posterior density of θ as .Z π(θ | x) = L(θ; x)π(θ) L(θ; x)π(θ) dθ, where the term in the denominator is called the marginal distribution or likelihood of x and is usually denoted by m(x).

11. Posterior summaries. Let π(θ | x) denote the posterior distribution of θ given data x. The expectations below are taken with respect to π(θ | x). (a) The posterior mean is the Bayes estimator (and also the minimum mean squared estimator, MMSE) of θ and is defined as Z E(θ | x) = θπ(θ | x)dθ. Θ

(b) The posterior variance matrix is defined as Var(θ | x) = E[(θ − E(θ | x))(θ − E(θ | x)0 ]. (c) The largest posterior mode or the generalized MLE of θ and is b | x) = sup π(θ | x). π(θ θ∈Θ

12. Credible set and interval. A 100(1 − α)% credible set for θ is a subset C ⊂ Θ of the parameter space satisfying the condition Z 1 − α ≤ P(C | x) = π(θ | x)dθ. (C.11) C

If the posterior distribution of a scalar θ is continuous, symmetric, and unimodal, the credible interval can be written as (θ(L) , θ(U ) ), where Z

θ (L)

Z

∞

π(θ | x) = −∞

π(θ | x) = α/2.

(C.12)

θ (U )

13. HPD Interval. The 100(1 − α)% highest posterior density (HPD) credible set for θ is a subset C ⊂ Θ given by C = C(k(α)) = {θ ∈ Θ : π(θ | x) ≥ k(α)},

(C.13)

where k(α) is the largest constant satisfying P(C(k(α) | x)) ≥ 1 − α.

(C.14)

ISTUDY

Appendix

481

14. Bayes Factor. Consider a model selection problem in which we must choose between two models, M1 and M2 , parametrized by θ 1 and θ 2 respectively, based on observed data x. The Bayes factor is defined as the ratio of the marginal likelihoods BF12 = p(x | M1 )/p(x | M2 ) =

p(M1 | x) p(M2 ) × , p(M2 | x) p(M1 )

(C.15)

which can also be expressed as the ratio of the posterior odds of M1 to M2 divided by their priors odds.

ISTUDY

ISTUDY

D Solutions to Selected Exercises

Chapter 1 1.1 Verify that |a • b| ≤ kak · kbk. √ √ 1.3 a = ±1/ 2 and b = ±1/ 2. 2 1 3 1.7 Solve the system = c1 + c2 to get c1 = −1 and c2 = 1, showing that u is 3 2 5 in Span{v1 , v2 }. 1.12 Computing the product on both sides, and equate them to get the conditions. 1.14 Show that C = Ak−1 + Ak−2 B + · · · + ABk−2 + Bk−1 . 1.17 Use Result 1.3.5 1.22 ∆n = (1 + a2 + a4 + · · · + a2n ) = [1 − a2(n+1) ]/(1 − a4 ) if a 6= 1, and ∆n = n + 1 if a = 1. 1.30 For (a), use the definition of orthogonality and Result 1.3.6. For (b), use Definition 1.2.9 and Definition 1.3.11. 1.33 r(A) = 2. 1.35 Use Definition 1.2.8 and proof by contradiction. 1.39 The eigenvector corresponding to λ = 3 is v = t(1, 1, 1)0 , for arbitrary t 6= 0. Chapter 2 2.3 Use Result 2.1.3 to write the determinant of A as |1| · |P − xx0 | = |P| · |1 − x0 P−1 x|. Simplify. 2.5 Use Example 2.1.3 and note that when A2 is an n × 1 matrix, the expression A02 (I − P1 )A2 is a scalar. 2.7 Show that Aa = aa0 a = (a0 a)a, and use Definition 1.3.16. 2.12 For (a), use Result 2.3.4 to show that r(A) = r(D), which is equal to the number of nonzero elements of the diagonal matrix D. To show (b), kAk2 = tr(A0 A) = tr(A2 ) = tr(APP0 APP0 ) = tr(P0 APP0 AP) = tr(D2 ). 2.16 Construct an n × (n − k) matrix V such that the columns of (U, V) form an orthogonal basis for Rn . 2.21 Given that QAQ−1 = D, invert both sides. 2.23 Suppose on the contrary, that r(C) < q, and use Exercise 2.10 to contradict this assumption. DOI: 10.1201/9781315156651-D

483

ISTUDY

484

Solutions

2.26 −1/(n − 1) < a < 1 2.30 For (a), use property 5 of Result 1.3.8 and simplify. Substitute for (A−1 + C0 B−1 C)−1 from (a) into the LHS of (b) and simplify. 2.34 If possible, let P1 and P2 be two such matrices. Since u is unique, (P1 − P2 )y = 0 for all y ∈ Rn , so that P1 must equal P2 . Chapter 3 3.2 G is a g-inverse of A if and only if it has the form vectors in R

n−1

u0

In−1

a , where u, v are any v

and a ∈ R.

3.5 For (a), use properties 1 and 2 of Result 3.1.9. For (b), transpose both sides of (a). Use (3.1.7) to obtain (c). For (d), use property 3 of Result 3.1.9. (e) follows directly from (d). 3.8 Since G is a g-inverse of A, r(A) ≤ r(G). If A is a g-inverse of G, then likewise r(G) ≤ r(A), so r(A) = r(G). Conversely, from Result 3.1.3, if A = BDC, where B −1 −1 and C are nonsingular and D = diag(Ir , 0) with r = r(A), then G = C EB , where Ir K E has the form . Use Result 2.1.3. L M 3.14 Use Result 2.6.1 and property 2 of Result 1.3.10, followed by Result 3.1.13. 3.17 (a) Let G = HA−1 . If H = B− , then ABGAB = ABHA−1 AB = ABHB = AB. Conversely, let ABGAB = AB. Then BGAB = B, i.e., BHB = B. The solution for (b) is similar. 3.20 The unique solution is (−1, −4, 2, −1)0 . 3.22 C(c) ⊂ C(A) and R(b) ⊂ R(A); use Result 3.2.8. Chapter4 4.1 Use property 8 of Result 1.3.11 to show that r(X0 X, X0 y) ≥ r(X0 X). Use properties 4 and 6 of Result 1.3.11 to show that r(X0 X, X0 y) = r(X0 X). 4.6 n1 = 2n2 . 4.9 (i) βb1 = (Y2 + Y4 + Y6 − Y1 − Y3 − Y5 )/6, with variance σ 2 /6. (ii) βb1 = {5(Y2 − Y5 ) + 8(Y4 − Y3 ) + 11(Y6 − Y1 )}/48, with variance 420σ 2 /(48)2 . The ratio of variances is 32/35. 4.19 For (a), equate (4.5.5) with (4.2.3) and set y = Xz. Part (b) follows since V−1 = (1 − ρ)−1 I − ρ(1 − ρ)−1 [1 + (N − 1)ρ]−1 J, and 10 y = 0. 4.23 From (4.5.7), WX = KPC(LX) LX = KLX = X as L = K−1 . From (4.5.7), C(W) = C(X(X0 V−1 X)− X0 V−1 ) ⊂ C(X). Finally, W2 = KPC(LX) LKPC(LX) = KPC(LX) PC(LX) = W. 4.24 θbi = Yi − Y + 60◦ , i = 1, 2, 3. b = y, and β b = [IN − VC(C0 VC)−1 C0 ]y. 4.26 β r

ISTUDY

Solutions

485

4.29 E(SSEr ) = σ 2 (N − r) + σ 2 tr(Iq ) + +(A0 β − b)0 (A0 GA)−1 (Aβ − b) = σ 2 (N − r + q) + (A0 β − b)0 (A0 GA)−1 (Aβ − b).

Chapter 5 √ 5.2 (a) Use Result 5.1.1. (b) π/ 3. 5.3 a = 3.056.    4 5 5.5 µ = −2 and Σ = −1 1 2

 −1 2 3 1. 1 6

5.7 1/3. 5.10 Use the singular value decomposition. 5.13 See Result 5.2.14; show that the b.l.u.p. of x1 based on x2 is exactly x1 if and only if x1 = Ax2 + a for some matrix A and vector a of constants. 5.14 For (a), X2 ∼ N (µ2 , σ 2 ) and X3 ∼ N (µ3 , σ 2 ). For (b), (X1 | X2 , X3 ) is normal with mean µ1 + ρ(x2 − µ2 )/(1 − ρ2 ) − ρ2 (x3 − µ3 )/(1 − ρ2 ) and variance σ 2 (1 − 2ρ2 )/(1 − ρ2 ); it reduces to the marginal distribution of X1 when ρ = 0. For (c), ρ = −1/2. 5.20 Show that the product of the m.g.f.’s of the two random variables on the right side gives the m.g.f. of the random variable on the left side. 5.23 For (a), by Result 5.4.5 and Result 5.3.4, E(U ) = k+2λ and Var(U ) = 2(k+4λ). For (b), using Result 5.4.2, U ∼ χ2 (k, λ) with λ = µ0 Σ−1 µ. For (c), x0 Ax ∼ χ2 (k − 1, µ0 aµ/2). 1 −1 1 1 5.35 Verify that Q1 = x0 A1 x, and Q2 = x0 A2 x, with A1 = and A2 = . −1 1 1 1 Use idempotency of A1 Σ and that A1 ΣA2 = O. 5.37 Use independence of x0 Aj x to show necessity of Result 5.4.9. From Result 5.4.7, for i 6= j, A0i Aj = O, so C(Ai ) ⊥ C(Aj ) and it is easy to show the result. 5.39 We show property 3. See that −1 f (x2 ) = ck−q |V22 |−1/2 hk−q (x02 V22 x2 ),

and the conditional covariance is Z −1 −1 (x1 − V12 V22 x2 )(x1 − V12 V22 x2 )0 dFx1 |x2 (x1 ). −1 Let V11.2 = D0 D, and let y = (Y1 , · · · , Yq )0 = D−1 (x1 − V12 V22 x2 ). The covariance matrix is Z yy0 dFy,x2 (y, x2 )/f (x2 ) = C = {cij }, say,

and we are given that C does not depend on x2 . Z ∞ c11 = Y12 dFy,x2 (y, x2 )/f (x2 ), −∞

ISTUDY

486

Solutions and we can write −1 c11 ck−q |V22 |−1/2 hk−q (x02 V22 x2 ) Z ∞ −1 Y12 hk−q+1 (Y12 + x02 V22 x2 ) dy1 . = ck−q+1 |V22 |−1/2 −∞

−1 Set Z = x02 V22 x2 , and b = 2ck−q+1 /ck−q .

Chapter 6 Pm 0 0 0 0 0 0 6.2 B0 WB ∼ j=1 B xj (B xj ) . Since B x1 , · · · , B xm are i.i.d. ∼ Nq (0, B ΣB), then 0 0 B WB ∼ Wq (B ΣB, m). b M L = {(N − 1)/N }SN , |Σ b M L |N/2 = (1 − 1 )kN/2 |SN |N/2 ; substituting this into 6.4 Since Σ N the expression (6.1.9), b M L) = L(b µM L , Σ

exp(− N2k ) . (2π)N k/2 (1 − N1 )kN/2 |SN |N/2

6.5 The first identity is immediate from the definition of the sample mean, and some algebra. To show the second identity, write Sm+1 =

m+1 X

(xi − xm+1 )(xi − xm+1 )0

i=1

=

m X

(xi − xm + xm − xm+1 )(xi − xm + xm − xm+1 )0

i=1

+ (xm+1 − xm+1 )(xm+1 − xm+1 )0 = Sm + m(xm − xm+1 )(xm − xm+1 )0 + (xm+1 − xm+1 )(xm+1 − xm+1 )0 , use the first identity for xm+1 and simplify. 6.9 Use the proof of Result 6.2.3 to show that S11.2 is independent of Z, hence independent of S22 = Z0 AZ. Show the conditional independence of S11.2 and S21 given Z. Argue that the conditional distribution of S11.2 is the same as its unconditional distribution, from which the (unconditional) independence follows. 6.12 Beta(k/2, (N − k − 1)/2) distribution, which follows from (6.2.11) and (B.13). PN PN 6.14 For (a), note that j=1 Z(j) = j=1 Zj = N Z N . Also, Z(i) − Z N is a function of (Z1 − Z N , · · · , ZN − Z N ), which is independent of Z N . (b) follows directly. Chapter 7 7.1 Property 1 of Corollary 7.1.1 is a direct consequence of Result 5.2.5, while property 2 follows from Result 5.2.4. Property 3 follows from the orthogonality of X and I − P. Pb 7.2 (c) The test statistic F0 = (N − 3)Q/SSE ∼ F1,9 under H0 , where Q = β1 Xi (2βb0 + P βb1 Xi + 2βb2 Xi2 ) and SSE = (Yi − βb0 − βb1 Xi − βb2 Xi2 ).

ISTUDY

Solutions

487 P3

P3

Pni

2 2 7.4 Let N = n1 + n2 + n3 . SSEH = i=1 (ni Y i· ) /N and SSE = j=1 Yij − i=1 P P3 Pni 3 2 2 j=1 Yij − i=1 Yi· /ni , so that F (H) = (N −3)(SSEH −SSE)/(2SSE) ∼ F2,N −3 i=1 under H.

b1 = [(1 + c2 )Yk − c3 Yk+1 − c2 Yk+2 ]/(1 + c2 + c4 ) and 7.8 (a) Let β 0 = (λ1 , λ2 ). Then, λ 2 b2 = [cYk + Yk+1 − c(1 + c )Yk+2 ]/(1 + c2 + c4 ), with λ b = Var(β)

σ2 (1 + c2 + c4 )

1 + c2 c

c . 1 + c2

(b) Under the restriction λ1 = −λ2 = λ, say, using results from section 4.6.1, we get b = [Yk − (1 + c)Yk+1 + cYk+2 ]/2(1 + c + c2 ), with Var(λ) b = σ 2 /2(1 + c + c2 ). λ 7.10 E(SSEH ) − E(SSE) = σ 2 (r − r2 ) + β 01 X01 (I − P2 )X1 β 1 . 7.14 H0 : X ∗ = 0 implies H0 : β1 = 0. The test statistic F0 = (N − 3)Q/SSE ∼ F1,9 under P P H0 , where Q = βb1 Xi (2βb0 + βb1 Xi + 2βb2 Xi2 ) and SSE = (Yi − βb0 − βb1 Xi − βb2 Xi2 ). 7.16 Due to orthogonality, the least squares estimates of β0 and β1 are unchanged, and are βb0 = (Y1 + Y2 + Y3 )/3, and βb1 = (Y3 − Y1 )/2 7.18 The hypothesis H can be writen as C0 β = d, where C0 is an (a − 1) × a matrix   1 −2 0 0 ··· 0 0 0 2 −3 0 ··· 0 0   0 0 3 −4 · · · 0 0 ,    ..   . 0 0 0 0 · · · a − 1 −a β = (µ1 , · · · , µa )0 , and d is an (a − 1)-dimensional vector of zeroes. The resulting F statistic is obtained from (7.2.9) and has an Fa−1,a(n−1) distribution under H. 1 2 7.19 F (H) = (3n − 3)Q/SSE ∼ F1,3n−3 under H, where Q = 2n 3 [Y 2· − 2 (Y 1· + Y 1· )] , and P3 Pn 2 SSE = i=1 j=1 (Yij − Y i· ) . p √ 7.23 The 95% C.I. for β1 is βb1 ± 4.73 11/4. For β2 , it is βb2 ± 4.73 3. For β3 , it is βb3 ± 4.73. p p For β1 − β2 , it is βb1 − βb2 ± 4.73 59/4. For β1 + β3 , it is βb1 + βb3 ± 4.73 27/4.

7.25 Scheff´e’s simultaneous confidence set has the form {β: |c0 β 0GLS − c0 β| ≤ σ bGLS (dFd,N −r,α )1/2 [c(X0 Σ−1 X)− c]1/2 ∀ c ∈ L}.

Chapter 8 8.1 The MLE is PN θb =

t=1

Yt Yt−1

PN

Yt2

t=1

.

ISTUDY

488

Solutions

8.3 It can be verified that Var(βe1 ) = σ 2

(N X Xi1 i=1

Var(βe2 ) = σ

2

Xi2

(N X Xi2 i=1

Xi1

N2

)−1

− PN

, and

Xi2 i=1 Xi1

N2

)−1

− PN

Xi1 i=1 Xi2

.

These may be compared with Var(βb1 ) and Var(βb2 ) respectively. 8.7 By Example 2.1.3, PZ = PX + {(I − PX )yy0 (I − PX )}/y0 (I − PX )y = PX + b εb ε0 /b ε0 b ε. b −β b ). Simplify to get the result. 8.8 SICi = (N − 1){T (FbN ) − T (Fb(i) )} = (N − 1)(β (i) 8.11 (a) Since PZ = PX +

b εb ε0 0 , b εb ε

ε b2

pZii ≤ 1 for all i, implying that pii + bε0ibε . (b) These follow directly from the relations in properties 2 and 3 of Result 8.3.4. 8.14 Use Exercise 8.7. The second result follows directly. Chapter 9 PN 2 9.1 Var(βb1 ) = σ 2 / i=1 (Xi1 − X 1 )2 (1 − r12 ). The width of the 95% confidence interval for β1 is 2tN −3,.025 s. e.(βb1 ). 9.3 M SE(b σ 2 ) = 2σ 4 /(N − p), while M SE(y0 A1 y) = 2σ 4 /(N − p + 2). (1) (2) (1) (2) 9.6 (a) α bi = Y i , i = 1, 2, βb = [SXY + SXY ]/[SXX + SXX ]. The vertical distance between b 1 − X 2 ). Then, b = (Y 1 − Y 2 ) − β(X the lines is D = (α1 − α2 ) + β(X 1 − X 2 ), with D b = D. (b) A 95% symmetric C.I. for D is D b ±σ b E(D) b[F1,n1 +n2 −3,.05 ]1/2 s. e.(D).

9.9 (a) From Exercise 4.14, X1 β 1 is estimable under the expanded model. Since X1 has full column rank, β 1 is estimable. (b) Use (4.2.36) to show that e ) = σ 2 [X0 (I − P2 )X1 ]−1 . Cov(β 1 1 b ) = σ 2 (X0 X1 )−1 . Use the Sherman–Morrison–Woodbury On the other hand, Cov(β 1 1 formula. Chapter 10 Pa Pa Pb Pa 10.2 U1a = j=1 Uij = Uii + j6=i Uij = Ni· − l=1 j=1 nij nlj /N·j = Ni· − Ni· = 0, i = 1, · · · , a, implying r(U) ≤ a − 1. Pa Pb Pa Pb Pa Pb 10.3 (b) In order that E( i=1 j=1 cij Yij ) = µ i=1 j=1 cij + i=1 ( j=1 cij )τi + Pb Pa Pa Pb cij = 0, j=1 ( i=1 cij )βj is to be function only of β’s, we must have i=1 Pb Pb Pa Pb j=1 c = 0, i = 1, · · · , a. Then, RHS = ( c )β = d β ij j=1 i=1 ij j j=1 j j , where Pj=1 Pb Pa b j=1 dj = j=1 i=1 cij = 0.

ISTUDY

Solutions

489 Pb

Pa

10.6 (a) For i = 1, · · · , n, write the function as j=1 {nij (µ + τi + θj + γij ) − τk + θj + γkj )}, which shows it is estimable. The proof of (b) is similar.

k=1

nij nkj n·j (µ +

10.7 (a) Yes, (b) No. 0 10.11 θj(i) = Y /n = Y ij· , j = 1, · · · , bi ; i = 1, · · · , a. The corresponding g-inverse is ij· ij 0 0 G= , where D = diag(1/n11 , · · · , 1/naa ). 0 D Pa P = i=1P (Yi·2 /ni ) − 10.15 (a) SS(β0 , β1 ) = [Y··2 /N ] + βb12 i ni (Zi − Z)2 . (b) SS(µ, τ1 , · · · , τa ) P a a 2 2 2 YP 1 ) − SS(µ, τ1 , · · · , τa ) = A/B, where A = [ ·· /N . Hence, SS(β0 , βP i=1 di − i=1 ci √ √ a a 2 2 ( i=1 ci di ) ] and B = [ i=1 ni (Zi −Z) ], ci = ni (Y i· −Y ·· ) and di = ni (Zi −Z). The difference in SS is always nonnegative (by Cauchy–Schwarz inequality), with equality only when ci ∝ di , i = 1, · · · , a, i.e., only when Y i· − Y ·· = w(Zi − Z), where w is a constant.

10.19 Write c0l = √

1 (0, 1, · · · l(l+1)

, 1, −l, 0, · · · , 0). For l = 1, · · · , a − 1, the 95% marginal C.I.

for c0l β is c0l β 0 ± and the 95% Scheff´e intervals are r c0l β 0 ±

M SE

σ b ta(n−1),.025 n (a − 1) Fa−1,N −a,0.05 . n

Chapter 11 11.3 Let the N × (a + 1) matrix X = (1N , x1 , · · · , xa ), where the N -dimensional vector Pa 2 xi = ei ⊗ 1a , with ei the ith standard basis vector. Then, SSA = i=1 Yi·2 /n − N Y ·· = 1 1 y0 { n ⊕ai=1 xi x0i − N 1N 10N }y, which gives the result. 11.4 Suppose SSA = y0 My, then, E(SSA) = tr(MV), where V denotes Cov(y); simplify. 11.5 Find the first and second partial derivatives of the log-likelihood function with respect to τ and σj2 . Use the results E(y − Xτ ) = 0 and E(y − Xτ )0 A(y − Xτ ) = tr(AV). P P P 11.6 Let SSA = i nb(Y i·· − Y ··· )2 , SSB(A) = i,j n(Y ij· − Y i·· )2 , SSE = i,j,k (Yijk − Y ij· )2 with respective d.f. (a − 1), a(b − 1), and ab(n − 1). The ANOVA estimators are σ bτ2 = {M SA − M SB(A)}/nb, σ bβ2 = {M SB(A) − M SE}/n, and σ bε2 = M SE. The ML solutions are σ eτ2 = {(1 − 1/a)M SA − M SB(A)}/nb, while σ eβ2 and σ eε2 coincide with their ANOVA estimators. P P 11.9 Since AX = O, y0 Ay = ε0 Aε = i aii ε2i + 2 i