Festschrift in Honor of R. Dennis Cook: Fifty Years of Contribution to Statistical Science 3030690083, 9783030690083

In honor of professor and renowned statistician R. Dennis Cook, this festschrift explores his influential contributions

130 14 4MB

English Pages 205 [200] Year 2021

Table of contents :
Foreword
A Tribute to Professor R. Dennis Cook
Contents
Using Mutual Information to Measure the Predictive Powerof Principal Components
1 Introduction
2 Overview of Previous Results
3 Conditional Mutual Information
3.1 Under the Linear Model
3.2 Beyond the Linear Regression Model
3.3 Beyond the Normal Distribution
4 Discussion
References
A Robust Estimation Approach for Mean-Shiftand Variance-Inflation Outliers
1 Introduction
2 Our Proposal and Some Background
2.1 A Generalized Setting
2.2 Some Technical Background
2.3 Our Proposal
2.4 Graphical Diagnostics
3 Simulation Study
4 Real-Data Examples
5 Final Remarks
References
Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators
1 Introduction
2 Invariant Linear Operators
3 Invariant Linear Operator and Its Eigenvectors
4 Some Important Members of T Y|X
4.1 Sliced Average Variance Estimation
4.2 SIR-II
4.3 Contour Regression
4.4 Directional Regression
5 Two Estimation Methods Based on Invariant Operators
5.1 Iterative Invariant Transformations (IIT)
5.2 Nonparametrically Boosted Inverse Regression (NBIR)
6 Numerical Study
7 Concluding Remarks
References
Testing Model Utility for Single Index Models Under High Dimension
1 Introduction
2 Generalized SNR for Single Index Models
2.1 Notation
2.2 A Brief Review of the Sliced Inverse Regression (SIR)
2.3 Generalized Signal-to-Noise Ratio of Single Index Models
2.4 Global Testing for Single Index Models
3 The Optimal Test for Single Index Models
3.1 The Detection Boundary of Linear Regression
3.2 Single Index Models
3.3 Optimal Test for SIMa
3.4 Computationally Efficient Test
3.5 Practical Issues
4 Numerical Studies
5 Discussion
Appendix: Proofs
Assisting Lemmas
Proof of Theorems
References
Sliced Inverse Regression for Spatial Data
1 Introduction
2 SIR for iid Data
3 SIR for Time Series Data
4 SIR for Spatial Data
5 Performance Evaluation of SSIR
6 Discussion
References
Model-Based Inverse Regression and Its Applications
1 Introduction
1.1 Model-Based Inverse Reduction
1.2 Sufficient Reduction in Applications
2 Inverse Reduction for Multivariate Count Data
2.1 Multinomial Inverse Regression in Text Analysis
2.2 Predictive Learning in Metagenomics via Inverse Regression
2.3 Poisson Graphical Inverse Regression
3 Inverse Reduction and Its Dual
3.1 Reduction via Principal Coordinate Analysis
3.2 A Supervised Inverse Regression Model
4 Adaptive Independence Test via Inverse Regression
5 Cook's Contributions on Model-Based Sufficient Reduction
References
Sufficient Dimension Folding with Categorical Predictors
1 Introduction
2 Review on Sufficient Dimension Folding
3 Sufficient Dimension Folding with Categorical Predictors
4 Estimation Methods
4.1 Individual Direction Ensemble Method
4.2 Least Squares Folding Approach (LSFA)
4.3 Objective Function Optimization Method
5 Estimation of Structural Dimensions
6 Numerical Analysis
6.1 Simulation Studies
6.1.1 Part I (Continuous Y, Forward Model)
6.1.2 Part II (Discrete Y, Inverse Model)
6.2 Application
7 Discussion
8 Appendix
8.1 Proofs
8.2 Additional Simulation and Data Analysis
Three Histograms for the Real Data
The Bootstrap Confidence Interval Plots for Real Data
References
Sufficient Dimension Reduction Through Independenceand Conditional Mean Independence Measures
1 Introduction
2 Estimating SY|X Through α-Distance Covariance
2.1 α-Distance Covariance
2.2 Estimation of the Central Space
3 Estimating SE(Y|X) Through α-Martingale Difference Divergence
3.1 α-Martingale Difference Divergence
3.2 Estimation of the Central Mean Space
4 Simulation Studies
4.1 Model Setup
4.2 Comparisons of Estimating the Central Space
4.3 Comparisons of Estimating the Central Mean Space
5 Analysis of the Iris Data
6 Conclusion
Appendix
References
Cook's Fisher Lectureship Revisited for Semi-supervised DataReduction
1 Introduction
2 Dimension Reduction by Isotonic Models
2.1 Construction of Isotonic Model
2.2 Maximum Likelihood Estimation of Γ
3 Numerical Examples
4 Real Data Example
5 Discussion
References

Recommend Papers

The World of St. Francis of Assisi: Essays in Honor of William R. Cook 9004270981, 9789004270985

In The World of St. Francis of Assisi, thirteen scholars from diverse academic disciplines offer accounts of the way in

431 28 5MB Read more

Fifty Years of Golf 9780941774055

231 32 339KB Read more

The Memories of Fifty Years 9781406830071

206 102 735KB Read more

Fifty Years Of Soviet Aircraft Cosntruction

227 28 8MB Read more

Birds New to Science: Fifty Years of Avian Discoveries 1472906284, 9781472906281

Amazing as it might sound, ornithologists are still discovering several bird species each year that are completely new t

110 25 Read more

Mag Men : Fifty Years of Making Magazines

459 64 180MB Read more

Fifty Years of the Texas Observer 9781595340870

221 3 2MB Read more

Fifty Years of Hurt 9781473540996, 9781784161729

'England invented football, codified it, became champions of the world in 1966 but humiliatingly then forgot how to

361 116 28MB Read more

Fifty Years in Constantinople and Recollections of Robert College 9781463234089

George Washburn’s memoir describes characters and events during his presidency of Robert College (1877–1903), the first

176 87 20MB Read more

Dennis Brutus: The South African Years 9781928246374

Traces the many facets of poet-activist Dennis Brutus’s life from his childhood until his exile from South Africa in 196

111 103 8MB Read more

Festschrift in Honor of R. Dennis Cook: Fifty Years of Contribution to Statistical Science
3030690083, 9783030690083

Author / Uploaded
Efstathia Bura (editor)
Bing Li (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Efstathia Bura Bing Li Editors

Festschrift in Honor of R. Dennis Cook Fifty Years of Contribution to Statistical Science

Festschrift in Honor of R. Dennis Cook

Efstathia Bura • Bing Li Editors

Festschrift in Honor of R. Dennis Cook Fifty Years of Contribution to Statistical Science

Editors Efstathia Bura Applied Statistics Vienna University of Technology Vienna, Wien, Austria

Bing Li Department of Statistics The Pennsylvania State University University Park, PA, USA

ISBN 978-3-030-69008-3 ISBN 978-3-030-69009-0 (eBook) https://doi.org/10.1007/978-3-030-69009-0 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Foreword

This volume is dedicated to Professor R. Dennis Cook. For more than 45 years, Professor Cook has made outstanding and highly influential contributions to several areas of statistics ranging from experimental design and population genetics to statistical diagnostics and all areas of regression-related inference and analysis. Cook’s distance, a standard influence regression diagnostic tool, is representative of the significance of Professor Cook’s earlier contributions to our profession. Since the early 1990s, Professor Cook has been leading the development of dimension reduction methodology in three distinct but related regression contexts: envelope models, sufficient dimension reduction (SDR), and regression graphics. Professor Cook has made fundamental and pioneering contributions to sufficient dimension reduction. He invented or co-invented many popular dimension reduction methods, such as sliced average variance estimation, the minimum discrepancy approach, model-free variable selection, sufficient dimension reduction subspaces, and sufficient dimension reduction for conditional mean. Sufficient dimension reduction is a powerful set of theories and methods for handling high-dimensional data and is playing an increasingly important role in our age of big data. Professor Cook has been both a visionary in understanding the potential of this approach and the main driving force in unleashing the potential and popularizing the approach. He recently initiated the research on the envelope model, which is a parsimonious way of conducting multivariate analysis and has the potential to refine and redefine many methods in multivariate analysis and regression. During the relatively short period since it appeared in 2010, this method has already undergone momentous development and is being generalized into many directions. Professor Cook has authored over 200 research articles and three research monographs—Influence and Residuals in Regression, Regression Graphics: Ideas for Studying Regressions through Graphics, and An Introduction to Envelopes. He has co-authored three textbooks—An Introduction to Regression Graphics, Applied Regression Including Computing and Graphics, and Residuals and Influence in Regression. He has received many professional statistics and university awards,

v

vi

Foreword

including the 2005 COPSS Fisher Lecture and Award, the highest honor conferred by the statistics profession. Professor Cook has also been a prolific mentor of 31 PhD and 10 MS students, many of whom have gone on to successful careers in academia and industry. In addition, he has been a mentor and source of inspiration to junior colleagues. Among the main features of Professor Cook’s work are his exceptional taste and ability to identify research problems in statistics that are both challenging and important, his deep appreciation of the applied side of statistics, and the quality of his insights. The contributions to this volume, from Cook’s collaborators, colleagues, friends, and former students, reflect his broad interests as they range from regression diagnostics to dimension reduction under diverse regression settings. It is a pleasure and privilege for us to present this Festschrift in honor of Professor Cook’s 75th birthday. Many of the papers included in this volume were presented at the conference Cook’s Distance and Beyond: A Conference Celebrating the Contributions of R. Dennis Cook to Statistical Science that took place in Minneapolis, Minnesota, on March 22–23, 2019. We would like to take this opportunity to extend our most sincere thanks to the contributors of this volume. On behalf of the participants of the aforementioned conference in which the seeds of this volume were sown, we would like to acknowledge the support of the University of Minnesota and the local organizers in Minneapolis, especially Christopher Nachtsheim, Liliana Forzani, the School of Statistics Chair, Galin Jones, and Amanda Schwarz. We also thank Allegra Hoheisel and Magdalena Mayr, both at TU Wien, for their administrative help with the refereeing and production work that was needed to put the volume together.

A Tribute to Professor R. Dennis Cook

Kofi Placid Adragni, PhD Senior Research Scientist Eli Lilly & Company Indianapolis, Indiana R. Dennis Cook was my PhD adviser at the School of Statistics of the University of Minnesota, Twin Cities. He is my mentor, my inspiration, and my hero. Dennis has fundamentally influenced my life and career, and I owe him much of my success today. To me, Dennis Cook is one of the finest statisticians of all times, one of my greatest instructors I know, and my favorite teacher. He is a visionary, a trendsetter, a thought leader, a compassionate individual with a great sense of humor, and a devoted family man. Professor Cook went from driving tanks in the Army to developing comprehensive tools and concepts of fundamental use in statistics. From Cook’s distance in regression diagnostics to envelope methodology, Dennis has pioneered original work in genetics, design of experiments, Bayesian statistics, and residuals analysis, among others. He has opened new subfields in statistics including regression graphics, sufficient dimension reduction, and envelope methodology. With a 48-year career at the University of Minnesota, Dennis has mentored over 30 PhD dissertations, including those of Christopher Nachtsheim of the University of Minnesota; Weng Kee Wong and Robert Weiss of the University of California at Los Angeles; Zhihua Su of the University of Florida; Lexin Li of the University of California, Berkeley; Liliana Forzani of the Universidad Nacional del Litoral in Argentina; Francesca Chiaromonte of the Pennsylvania State University; Efstathia Bura of Vienna University of Technology; and Xiangrong Yin of the University of Kentucky, just to name a few who are prominent and successful in their own right. Dennis has also collaborated with many researchers, including Sandy Weisberg of UMN, Inge S. Helland of the University of Oslo, and Bing Li of the Pennsylvania State University, just to name a few. vii

viii

A Tribute to Professor R. Dennis Cook

My gratitude to Dennis is grand. As a statistician, I stand today because I was lucky enough to have had Dennis Cook as my teacher and my mentor. For a quick story, whenever I am traveling with my colleagues Dr. Bimal Sinha or Dr. Thomas Mathew or when we had visitors in the Mathematics and Statistics Department at the University of Maryland, Baltimore County, they almost always introduced me like this: “This is my colleague Kofi. He was a student of Dennis Cook of Minnesota!” Everyone I met who knew Dennis talk about him with admiration and fondness. And the conversation is almost always about Dennis’ work and his influences in statistics. And I glow in Dennis’ light like a proud son, knowing that in many ways, he helped shape who I have become. While it may sound routine, as countless people have worked with faculty advisers, my gratitude is incommensurate for many reasons. My story is that of an African immigrant who came to the USA through the Diversity Visa Lottery program, who could not speak English, with no grand plan other than surviving. For a while, my dream was to become a truck driver. For some reasons, I was refused the CDL training. Serendipitous events let me to Minneapolis where I was introduced to Professors Glen Meeden and Gary Oehlert, then Director of the School of Statistics and Director of Graduate Studies, respectively. They offered me the opportunity for graduate studies in statistics. As Glen said to me at the time, “We will let you take the two core PhD courses. If you do well, we will let you continue.” That was the key to getting into the graduate program through the “back door.” Of course, the financial cost was fully mine to face. I took the chance. For 2 years, it was brutal. Yet, I made it. I will be forever grateful for the guidance, generosity, patience, and wisdom Dennis bestowed on me since I was a student in his class. In the spring of 2006, when he started teaching an advanced topics course in sufficient dimension reduction, I decided to sit in. Although I was not a registered student that semester, he allowed me to work on assignments and also the final class project. I was fascinated with the concept and methodology. The class project I worked on later became the first chapter of my PhD dissertation. Not only Dennis was my PhD adviser, but he also offered me a research assistantship during my PhD work. I did not fully appreciate the value and importance of working with Dennis, until later after I graduated. During the time I was his RA, I had learned so much through our casual conversations that later helped my research work during my tenure-track years leading to be granted tenure at UMBC. Recently, I decided a career change from academia to industry. Upon application to a position of interest with my current employer, I was surprised by the prompt offer. When I asked the hiring officer about it, he simply said “I spoke to Dennis Cook.” This tribute is also to the School of Statistics and all its faculty members I learned from during those formative years. I am grateful to the few naysayers in the School who pointedly belittled my modest ambition of getting a PhD, as they strengthened my resolve to fully embrace my potentials. A quote from Mark Twain helped me get over them: “Keep away from people who try to belittle your ambitions. Small people always do that, but the really great make you feel that you, too, can become

A Tribute to Professor R. Dennis Cook

ix

great.” I was lucky to have listened to the truly great people like Professor Galin Jones, Professor Emeritus Douglas Hawkins, Professor Gary Oehlert, or Professor Glen Meeden. They all saw a potential in me and gave me a chance. Along with Dennis Cook, they have helped sculpt my career today, and I am deeply grateful for their influence in my life. The unlikely path of my life through the School of Statistics of UMN is also the story of four other young Africans. We all came from the same city in Togo; went to the same University; got college degrees in Mathematics, Physics, or Engineering; and left the country in the pursuit of opportunities in America. We all came to the School of Statistics of UMN and found our way to get a PhD in Statistics under the mentorship of the brightest statisticians on the face of Earth. It is the story of Claude Setodji, now at RAND Corporation, of Gideon Zamba at the University of Iowa, of Edgard Maboudou at the University of Central Florida, and of Vincent Agboto. It is also a story of big struggles because nothing came easy for us. None of us got admitted in the graduate program with financial support or teaching assistantship. It was at times impossible to survive the grinding and grueling statistical courses while laboring for self-survival. Ultimately, we all overcame these insurmountable struggles, with the help of our mentors and advisers. Getting a PhD in Statistics at the School of Statistics of the University of Minnesota was not an easy ride. It took a serious amount of stubbornness, determination, and hard work. Of course each of us, Vincent, Claude, Gideon, Edgard, and I have done our part. Someone may say that we were lucky. Perhaps we were. Isn’t it said that luck is what happens when preparation meets opportunity? We were prepared. The opportunity came. We are better off today. And we are grateful to our teachers. I say thanks to R. Dennis Cook.

Contents

Using Mutual Information to Measure the Predictive Power of Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Artemiou 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Overview of Previous Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Conditional Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luca Insolia, Francesca Chiaromonte, and Marco Riani 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Our Proposal and Some Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Real-Data Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bing Li 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Invariant Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Invariant Linear Operator and Its Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Some Important Members of TY |X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Two Estimation Methods Based on Invariant Operators . . . . . . . . . . . . . . . . . . . . 6 Numerical Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 5 15 15 17 17 20 26 34 37 39 43 43 47 50 52 56 60 62 63

xi

xii

Contents

Testing Model Utility for Single Index Models Under High Dimension. . . . Qian Lin, Zhigen Zhao, and Jun S. Liu 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Generalized SNR for Single Index Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Optimal Test for Single Index Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65 65 67 70 76 80 85

Sliced Inverse Regression for Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Christoph Muehlmann, Hannu Oja, and Klaus Nordhausen 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 2 SIR for iid Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3 SIR for Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4 SIR for Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5 Performance Evaluation of SSIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Model-Based Inverse Regression and Its Applications. . . . . . . . . . . . . . . . . . . . . . . Tao Wang and Lixing Zhu 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Inverse Reduction for Multivariate Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Inverse Reduction and Its Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Adaptive Independence Test via Inverse Regression . . . . . . . . . . . . . . . . . . . . . . . . 5 Cook’s Contributions on Model-Based Sufficient Reduction . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109

Sufficient Dimension Folding with Categorical Predictors . . . . . . . . . . . . . . . . . . Yuanwen Wang, Yuan Xue, Qingcong Yuan, and Xiangrong Yin 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Review on Sufficient Dimension Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Sufficient Dimension Folding with Categorical Predictors. . . . . . . . . . . . . . . . . . 4 Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Estimation of Structural Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127

Sufficient Dimension Reduction Through Independence and Conditional Mean Independence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuexiao Dong 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Estimating SY |X Through α-Distance Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Estimating SE(Y |X) Through α-Martingale Difference Divergence . . . . . . . . 4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 111 116 119 122 123

127 128 129 131 138 139 149 149 164 167 167 169 171 172

Contents

xiii

5 Analysis of the Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Cook’s Fisher Lectureship Revisited for Semi-supervised Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jae Keun Yoo 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Dimension Reduction by Isotonic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Real Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

181 181 183 185 189 190 192

Using Mutual Information to Measure the Predictive Power of Principal Components Andreas Artemiou

1 Introduction Principal component analysis (PCA) (Hotelling, 1933; Joliffe, 2002; Pearson, 1901) is probably the most well-known feature extraction technique and the most widely used by practitioners as well as scientists in many fields. The use of PCA in a regression context has been proposed in an effort to reduce the dimensionality of a p-dimensional predictor vector X as well as a way to eliminate multicollinearity between the predictors. This practice has been questioned over the years in the literature due to the unsupervised nature of the PCA algorithm and the fact that it was applied to a supervised technique like regression which has a response/label variable Y . To find the principal components, one needs to do an eigenvalue decomposition of the covariance matrix of X obtaining the eigenvalues λ1 > . . . > λp and the eigenvectors v 1 , . . . , v p . Throughout the paper we assume the eigenvectors to be ordered in the sense that v 1 corresponds to the largest eigenvalue and v p corresponds to the smallest eigenvalue. Using the eigenvectors one can obtain the principal components. The first principal component is w1 = v T1 X, the second is w 2 = v T2 X, and so on. As one can see, this process does not have any direct or indirect involvement of the response variable Y . This fact leads to a number of researchers questioning the appropriateness of the method as a dimension reduction tool in supervised situations and especially in regression settings (see Cox, 1968). This is due to the fact that there is no way to ensure that the first few principal components are the ones more correlated with the response, although these are the ones which will be selected as the new variables to be fitted in the regression model.

A. Artemiou () School of Mathematics, Cardiff University, Cardiff, UK e-mail: [email protected] © Springer Nature Switzerland AG 2021 E. Bura, B. Li (eds.), Festschrift in Honor of R. Dennis Cook, https://doi.org/10.1007/978-3-030-69009-0_1

1

2

A. Artemiou

Actually Hadi and Ling (1998) and Joliffe (1982) gave real data examples where the least important principal components were the ones more correlated with the response. Other researchers (see Mosteller and Tukey 1977) supported the use of the method, by claiming that “nature is fair” and therefore most of the times the first few principal components will be the ones more correlated with the response. Cook (2007), in his published version of the Fisher lecture, gave a very nice overview of the debate on the appropriateness of principal components as a dimension reduction tool in a supervised/regression setting. Li (2007) formulated the problem in a mathematical conjecture about the probability of the first principal component being the most important extracted component. Cook’s Fisher lecture and Li’s conjecture motivated a number of interesting results over the past few years on the predictive power of principal components in a regression setting. Artemiou and Li (2009) gave a lower bound on the probability of the higher-order principal components having higher correlation with the response than lower-order ones. They showed that under mild assumptions on the regression coefficients β and the covariance matrix var(X) = in a linear regression setting where Y = β T X + where ∼ N (0, 1) when i < j , P(ρ(Y, w i |β, ) > ρ(Y, w j |β, )) >

1 2

where ρ denotes the squared correlation. This workwas extended by Ni (2011) by proving that this probability is exactly equal to π2 E arctan λλji where λi denotes the i th eigenvalue of . Artemiou and Li (2013) showed that this can be extended in more general regression models like the conditional independence model Y X|β T X and the mean conditional independence model Y E(Y |X)|β T X. More recently Jones and Artemiou (2020) have shown that similar relationships hold in a regression with Hilbertian predictors. In a different approach, Hall and Yang (2010) gave a minimax approach by showing that the maximum mean square error (MSE) between the fitted values and the observed responses is minimized when you select the first few principal components (i.e., the ones associated with larger eigenvalues) as opposed to any other subset of principal components. Finally, Jones et al. (2020) and Jones and Artemiou (2021) investigated the predictive potential of kernel principal components. In this work we use the mutual information to measure the relationship between the response variables and the principal components. Interestingly we demonstrate that under the linear model as it was assumed in Artemiou and Li (2009) as well as the conditional independence and the conditional mean models as they were assumed in Artemiou and Li (2013) with the extra assumption of normality of all the distributions, the results are exactly the same as the results we had in the previous papers. In the rest of the paper, we give an overview of the previous results in Sect. 2, and then we present the results using mutual information in Sect. 3. Finally, we close with a Sect. 4.

Using Mutual Information to Measure the Predictive Power of Principal Components

3

2 Overview of Previous Results In this section we give a brief overview of the most relevant results from the previous papers. We give some supporting definitions and results as well as the main theorems that are useful for the developments in this work. First, we define the orientationally uniform distribution for a random covariance matrix as it was defined in Artemiou and Li (2009). Definition 1 We say that a p × p positive definite random matrix has an orientationally uniform distribution if = σ12 v 1 v T1 + · · · + σp2 v p v Tp , where each (σi2 , v i ) is a pair of random elements in which σi2 is a positive random variable and v i is a p-dimensional random vector, such that: 1. (σ12 , . . . , σp2 ) are exchangeable, and its distribution is dominated by the Lebesgue measure, 2. (v 1 , . . . , v p ) are exchangeable, and {v 1 , . . . , v p } is an orthonormal set, 3. (σ12 , . . . , σp2 ) and (v 1 , . . . , v p ) are independent. Artemiou and Li (2009) proved the following lemma which gives the conditions for having a unique median, something which ensures a strict inequality in their main theorem. Lemma 1 Suppose β and v 1 , v 2 are p-dimensional random vectors such that: 1. β (v 1 , v 2 ); 2. P(β ∈ G) > 0 for any nonempty open set G. 3. v 1 and v 2 are linearly independent and exchangeable. Then (β T v 2 )2 /(β T v 1 )2 has a unique median, which equals 1. The main result in Artemiou and Li (2009) proved that under randomness of the covariance matrix and the coefficients, in a linear regression model, the probability that a higher-order principal component will be more correlated with the response than a lower-ranked principal component is greater and 1/2. This means that, most of the times, using principal component analysis as a dimension reduction tool in linear regression gives meaningful results. Theorem 1 Suppose: 1. is a p × p orientationally uniform random matrix, 2. X is a p-dimensional random vector with E(X|) = 0 and var(X|) = , 3. Y = β T X + δ, where β is a p-dimensional random vector and δ is a random variable such that β (X, ), δ (X, β, ), E(δ) = 0, and var(δ) < ∞. 4. P (β ∈ G) > 0 for any nonempty open set G ∈ Rp .

4

A. Artemiou

Let w1 , . . . , w p be the 1st, . . . , pth principal components of X, and let ρi = ρi (β, ) = corr2 (Y, w i |β, ). Then, whenever i < j , P(ρi ≥ ρj ) > 1/2. Ni (2011) showed that the P(ρi ≥ ρj ) can be computed exactly. The main theorem of their work is given below. Theorem 2 Suppose: 1. is a p × p orientationally uniform random matrix, 2. X is a p-dimensional random vector with E(X|) = 0 and var(X|) = , 3. Y = β T X + , where β is a p-dimensional random vector and is a random variable such that β (X, ), (X, β, ), E() = 0, and var() < ∞. 4. P(β ∈ G) > 0 for any nonempty open set G ∈ Rp . Let σ12 , . . . , σp2 be the ordered eigenvalues of the covariance matrix and w 1 , . . . , w p be the 1st, . . . , pth principal components of X, and let ρi = ρi (β, ) = corr2 (Y, w i |β, ). Then, whenever i < j , P(ρi ≥ ρj ) =

1 2 2 E arctan σi2 /σj2 π

(1)

Moreover they showed that under specific extra assumptions, these results can be extended in cases where either β or is fixed and not random. We discuss these assumptions below as they were discussed and stated more clearly in Artemiou and Li (2013). Let Up×p be the space of p × p orthogonal matrices. First we state two assumptions that we frequently use in the sequel: the first on the regression coefficients β and the second on the covariance matrix . Assumption 1 The distribution of the random vector β is spherically symmetric; D that is, for any A ∈ Up×p , β = Aβ. As Artemiou and Li (2013) clarified a necessary and sufficient condition for the above assumption to hold, the density of β depends only on β. The next assumption is for the covariance matrix . Assumption 2 The random matrix is symmetric and invariant under orthogonal D transformation; that is, for any A ∈ Up×p , we have = AAT . Moreover, all the eigenvalues of are distinct and positive. The above two assumptions are key in the results of Artemiou and Li (2013). In a series of results in their paper, they demonstrate the necessary conditions for ensuring the spherical distribution of V β which is crucial in proving the main results in their work, as well as in this work. We summarize their results in the following Lemma. Lemma 2 If any of the following four statement holds: • β satisfies Assumption 1 and is a nonrandom matrix with spectral decomposition V V T ,

Using Mutual Information to Measure the Predictive Power of Principal Components

5

• satisfies Assumption 2 with spectral decomposition V V T and β is a nonrandom vector, • β satisfies Assumption 1 and is a random matrix such that β , • satisfies Assumption 2 and β is a random vector such that β , then V β is spherically symmetric, where V is the orthogonal matrix whose columns are the eigenvectors obtained from the spectral decomposition of and is the diagonal matrix with the eigenvalues on the main diagonal. Using these assumptions and results, Artemiou and Li (2013) demonstrated the following main results. The first one is under the conditional mean independence model which was discussed in the sufficient dimension reduction (SDR) framework by Cook and Li (2002). Theorem 3 Suppose E(Y |X, β, ) = E(Y |β T X, β, ) holds with var(X|) = and var(Y |β, ) < ∞ almost surely and: 1. 2. 3. 4.

β (β, ); E(Y |β T X, β, ) is a linear function of β T X; Either Assumption 1 or Assumption 2 is satisfied; cov(Y, β T X|β, ) = 0 almost surely.

Then, for i < j , P(corr2 (Y, v Ti X|β, ) > corr2 (Y, v Tj X|β, )) =

2 E arctan σi2 /σj2 . π

In addition to the above theorem, Artemiou and Li (2013) discussed the following result which is using the conditional independence model Y X|(β T X, β, ), which is another model used in the SDR framework. Theorem 4 Suppose Y X|(β T X, β, ) holds with var(X|) = . Then, under the conditions 1, 2, and 3 in Theorem 3, for i < j and any f : Rp → R satisfying var(f (Y )|β, ) < ∞ and cov(f (Y ), v Ti X|β, ) = 0 almost surely, we have P(corr2 (f (Y ), v Ti X|β, ) > corr2 (f (Y ), v Tj X|β, )) =

2 E arctan σi2 /σj2 . π

Finally, Artemiou and Li (2013) discussed a number of other extensions of the results, including the extension to multivariate responses Y as well as some weaker results.

3 Conditional Mutual Information The main purpose of this work is to examine the use of mutual information as a measure of the strength of the relation between Y and X in a regression setting and demonstrate the equivalence of the new results with results discussed in the

6

A. Artemiou

previous section which use the squared correlation. We start with the definition and some helpful results and then we list the main results of this work. Definition 2 The conditional mutual information between two random variables X and Y conditional on random variable W is defined to be f (X, Y |W ) I (X, Y |W ) = E log |W . f (Y |W )f (X|W ) Before proving the main results of this section, we discuss some properties of the mutual information. First one can use Jensen’s inequality to show I (X, Y |W ) ≥ 0 with the equality to hold if and only if the conditional random variables X|W and Y |W are independent because in that case f (X, Y |W ) = f (Y |W )f (X|W ) and therefore log 1 = 0. Moreover it has been shown that mutual information is related to entropy (see Paninski (2003)) with the following relationship: I (X, Y |W ) = H (X, Y |W ) − H (X|Y, W ) − H (Y |X, W ) where H (X, Y |W ) is the joint conditional entropy and H (X|Y, W ) and H (Y |X, W ) are conditional entropies. Using this definition one can define a sample version of the mutual information as Iˆ(X, Y |W ) = Hˆ (X, Y |W ) − Hˆ (X|Y, W ) − Hˆ (Y |X, W ) where Hˆ is any estimator of the entropy (see Paninski (2003) for different estimators of entropy). We start by showing the following lemma which is useful in proving results which require the assumption of normality to hold. Lemma 3 Suppose Z ∼ Nr (0, ) with pdf f . Then, 1 r r E(log f (Z)) = − log(2π ) − log det( ) − . 2 2 2 Proof Using the pdf of Z, we have 1 1 r log f (z) = − log(2π ) − log det( ) − (zT −1 z). 2 2 2 So, r 1 1 E(log f (z)) =E − log(2π ) − log det( ) − (zT −1 z) 2 2 2 1 1 r = − log(2π ) − log det( ) − E (zT −1 z) 2 2 2 1 r r = − log(2π ) − log det( ) − . 2 2 2

Using Mutual Information to Measure the Predictive Power of Principal Components

7

In the rest of this section, we prove the main results of this work. We first show that the mutual information criterion can be used for the linear model under the assumption that X comes from a normal distribution. Then we will discuss what happens when we remove the assumption of a linear model. The main difference is the requirement of the joint distribution between Y and v Ti X be Gaussian for every i = 1, . . . , p. A general remark before proceeding with the proofs of the results is that mutual information depends on the joint and marginal conditional densities. Although here we focus on measuring the predictive power of principal components in a regression setting, it is known that it can be used in a much wider context. This is due to the fact that the amount of information it contains is much more than the information the correlation coefficient contains. Correlation measures the strength of linear relationships while we believe this criterion has the ability to measure the strength beyond linear relationships.

3.1 Under the Linear Model We have now the necessary tools to prove the main theorem of this section. The following theorem shows that using the mutual information gives equivalent results as the ones proven in Artemiou and Li (2009). Theorem 5 Suppose: 1. β satisfies Assumption 1 and is a nonrandom matrix with spectral decomposition V V T ; 2. X is a p-dimensional random vector such that X ∼ N(0, ); 3. β X and P(β ∈ G) > 0 for any nonempty open set G; 4. ∼ N(0, τ 2 ), X. 5. Y = β T X + Then, for i < j , P(I (Y, v Ti X|β) > I (Y, v Tj X|β)) > 1/2.

(2)

Proof First, we use the definition of mutual information, that is, Definition 2, to expand the statement on the left-hand side of inequality (2): f (Y, v Ti X|β) I (Y, v i X|β) =E log |β f (Y |β)f (v Ti X|β)

T

=E(log f (Y, v Ti X|β)) − E(log(f (Y |β))) − E(log(f (v Ti X|β))) (3)

8

A. Artemiou

Now we know that v Ti X ∼ N(0, v Ti v i ); then from Lemma 3, we have that 1 1 1 E(log(f (v Ti X|β))) = − log 2π − log v Ti v i − 2 2 2

(4)

Then, we have that the conditional joint distribution of Y and v Ti X is the following:

T 0 var(Y |β) cov(Y, v Ti X|β) β β + τ 2 σi2 v Ti β Y | β ∼ N , = . 0 σi2 v Ti X cov(Y, v Ti X|β) var(v Ti X|β) σi2 v Ti β

Using Lemma 3 and the fact that det var(Y, v Ti X) = σi2 (β T β + τ 2 ) − σi4 (v Ti β)2 , we obtain 1 1 2 E(log(f (Y, v Ti X|β))) = − log 2π − log(σi2 (β T β + τ 2 ) − σi4 (v Ti β)2 ) − 2 2 2 (5) Combining Eqs. (3), (4), and (5), we have the following equation: 1 1 2 log 2π − log(σi2 (β T β + τ 2 ) − σi4 (v Ti β)2 ) − 2 2 2 1 1 1 − E(log(f (Y |β))) + log 2π + log v Ti v i + 2 2 2

I (Y, v Ti X|β) = −

Similarly, 1 1 2 log 2π − log(σj2 (β T β + τ 2 ) − σj4 (v Tj β)2 ) − 2 2 2 1 1 1 − E(log(f (Y |β))) + log 2π + log v Tj v j + 2 2 2

I (Y, v Tj X|β) = −

So using the facts that v Ti = σi2 v Ti , v Ti v i = 1 and by canceling similar terms, the left-hand side of inequality (2) reduces to P(I (Y, v Ti X|β) > I (Y, v Tj X|β)) = P(− log(β T β + τ 2 − σi2 (v Ti β)2 ) > − log(β T β + τ 2 − σj2 (v Tj β)2 )) = P(log(β T β + τ 2 − σj2 (v Tj β)2 ) > log(β T β + τ 2 − σi2 (v Ti β)2 )) = P(β T β + τ 2 − σj2 (v Tj β)2 > β T β + τ 2 − σi2 (v Ti β)2 ) = P(−σj2 (v Tj β)2 > −σi2 (v Ti β)2 ) = P(σi2 (v Ti β)2 > σj2 (v Tj β)2 )

Using Mutual Information to Measure the Predictive Power of Principal Components

9

which is simplified to

σj2 (v Ti β)2 > P (v Tj β)2 σi2

which is greater than 1/2 because of the spherical distribution of V β and σj2 /σi2 < 1.

The first assumption in Theorem 5 is the first assumption in Lemma 2. One can similarly prove results under any of the other three conditions in Lemma 2. Furthermore one can show that the result in Theorem 5 can be extended to show that the equality of Ni (2011) holds in this case as well, that is, P(I (Y, v Ti X|β) > 1 2 . A similar equality can be proved I (Y, v Tj X|β)) = π2 E arctan σi2 /σj2 under any assumption of Lemma 2. For economy of space, we state only the corollary extending Theorem 5 to the result by Ni (2011). The rest of the results can be derived similarly. Corollary 1 Suppose: 1. β satisfies Assumption 1 and is a nonrandom matrix with spectral decomposition V V T ; 2. X is a p-dimensional random vector such that X ∼ N(0, ); 3. β X and P(β ∈ G) > 0 for any nonempty open set G; 4. ∼ N(0, τ 2 ), X. 5. Y = β T X + Then, for i < j , P(I (Y, v Ti X|β) > I (Y, v Tj X|β)) =

1 2 2 E arctan σi2 /σj2 . π

(6)

The proof is straightforward as Ni (2011) showed that the last probability statement of the proof for Theorem 5 is equal to the right-hand side of Eq. (6).

3.2 Beyond the Linear Regression Model In this section we discuss the fact that similar results (to the linear regression case) hold if we assume the conditional independence and conditional mean models used in Artemiou and Li (2013). These models are not necessarily linear and are being used in the sufficient dimension reduction (SDR) framework (see, e.g., Li (2018) for details in this). Assume that predictors X ∈ Rp and the response Y is univariate (without loss of generality). In SDR we try to estimate the p × d matrix β such that Y

X|β T X

(7)

10

A. Artemiou

without losing information on the conditional distribution Y |X. If d < p, then dimension reduction is achieved. If d = 1, then this is known as the single index model. The space spanned by the column vectors of β is called the dimension reduction subspace. There are many β’s that satisfy this relationship, and we are always looking for the one that minimizes the dimension d. Such a space is called the central dimension reduction subspace (CDRS) or simply the central subspace (CS) and is denoted with SY |X . Another model in the SDR framework that is being used extensively is the conditional mean independence model (see Cook and Li (2002)), which is expressed in mathematical notation as Y

E(Y |X)|β T X

(8)

where one is interested in estimating β without losing information on the regression line Y on X or in other words without losing information on the E(Y |X). The space spanned by the columns of β under model (8) is called central mean subspace (CMS). We now state the theorem under which we want to prove the equivalent result for the CMS as in Artemiou and Li (2013) using the mutual information. Throughout this section we assume that d = 1 and therefore we work with single index models. Theorem 6 Suppose E(Y |X, β, ) = E(Y |β T X, β, ) holds with var(X|) = and: 1. 2. 3. 4. 5. 6. 7.

is a p × p matrix that has an orientationally uniform distribution; X is a p-dimensional random vector such that X| ∼ Np (0, ); β X and P (β ∈ G) > 0 for any nonempty open set G; β (X, ); E(X|β T X, β, ) is a linear function of β T X; var(Y |β, ) = τ 2 < ∞ and cov(Y, X|β, ) = 0 almost surely; f (Y, v Tk X|β, ) is Gaussian centered at 0 for all k = 1, . . . , p.

Then, for i < j , P(I (Y, v Ti X|β, ) > I (Y, v Tj X|β, )) > 1/2 Proof First of all we note that as before I (Y, v Ti X|β, ) = E log(f (Y, v Ti X|β)) − E log(f (Y |β)) − E log(f (v Ti X|β)). Now, we note that the last two terms have exactly the same format as in the linear model case (see proof of Theorem 5). Finally, we need to work with the first term which involves the joint distribution of Y and v Ti X. Using the assumption that this is normal distribution, we have that Y 0 var(Y |β, ) = τ 2 cov(Y, v Ti X|β, ) . | β, ∼ N , cov(Y, v Ti X|β, ) var(v Ti X|β, ) = σi2 v Ti X 0

Using Mutual Information to Measure the Predictive Power of Principal Components

11

To calculate the E(log f (Y, v Ti X|β, )), we need the determinant of the covariance matrix of the normal distribution above. Therefore we have to calculate cov(Y, v Ti X|β, ): cov(Y, v Ti X|β, ) =cov(Y, E(v Ti X|X, β, )|β, ) =cov(E(Y |X, β, ), v Ti X|β, ) where the first equality is from iterated expectation and the second from a property of the conditional expectation that if X, Y , and Z are random variables, then E(E(X|Z), Y ) = E(X, E(Y |Z)).

(9)

Under the conditional mean independence model, that is, E(Y |X, β, ) = E(Y |β T X, β, ), we have that cov(E(Y |X, β, ), v Ti X|β, ) =cov(E(Y |β T X, β, ), v Ti X|β, ) =cov(Y, v Ti E(X|β T X, β, )|β, ) by reapplying the property of the conditional expectation in (9) above. Condition 5 in the statement of the theorem about the linearity of E(Y |β T X, β, ) implies that E(Y |β T X, β, ) = P (β)X = β(β T β)−1 β T X

(10)

where P (β) = β(β T β)−1 β T is the projection matrix with respect to the inner product (see Li (2018) for details). Using this one can show that cov(Y, v Ti E(X|β T X, β, )|β, ) =cov(Y, v Ti P T (β)X|β, ) =v Ti P T (β)cov(Y, X|β, ) =σi2 v Ti β(β T β)−1 β T cov(Y, X|β, ) where the last equality holds because v Ti = σi2 v Ti . Combining these, with the result in Lemma 3, we have that E(log f (Y, v Ti X|β, )) = − log 2π − 1 −

1 log(σi2 τ 2 − σi4 (v Ti β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) 2

(11)

Combining this with Eq. (4) that gives E log(f (v Ti X|β)), we have that 1 log(σi2 τ 2 − σi4 (v Ti β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) − 1 2 1 1 1 (12) − E(log f (Y |β, )) + log 2π + log σi2 + 2 2 2

I (Y, v Ti X|β, ) = − log 2π −

12

A. Artemiou

To calculate the P(I (Y, v Ti X|β, ) > I (Y, v Tj X|β, )) for i < j , we first take the inequality, and we try to simplify it, that is, I (Y, v Ti X|β, ) > I (Y, v Tj X|β, ) 1 1 ⇔ − log(σi2 τ 2 − σi4 (v Ti β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) + log σi2 2 2 1 1 > − log(σj2 τ 2 − σj4 (v Tj β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) + log σj2 2 2 where we used (12) ignoring any term who does not depend on i. Then it is simple algebra to further simplify it: − log(τ 2 − σi2 (v Ti β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) > − log(τ 2 − σj2 (v Tj β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) ⇔ log(τ 2 − σi2 (v Ti β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) < log(τ 2 − σj2 (v Tj β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) ⇔ τ 2 − σi2 (v Ti β)2 (β T β)−2 (β T cov(Y, X|β, ))2 < τ 2 − σj2 (v Tj β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) ⇔ −σi2 (v Ti β)2 (β T β)−2 (β T cov(Y, X|β, ))2 < −σj2 (v Tj β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) ⇔ σi2 (v Ti β)2 (β T β)−2 (β T cov(Y, X|β, ))2 > σj2 (v Tj β)2 (β T β)−2 (β T cov(Y, X|β, ))2 ) ⇔ σi2 (v Ti β)2 > σj2 (v Tj β)2 . Using the above derivation, we have shown that P(I (Y, v Ti X|β, ) > I (Y, v Tj X|β, )) =P(σi2 (v Ti β)2 > σj2 (v Tj β)2 ) σj2 (v Ti β)2 > 2 > 1/2 =P (v Tj β)2 σi due to Lemma 1 which was proved in Artemiou and Li (2009) and the fact that σj2 /σi2 < 1.

The following corollary shows that as previous results, this result can also be extended to find the exact probability as was calculated in Ni (2011). The way this result is stated uses the last two conditions in Lemma 2. One can rephrase the conditions to make it applicable to the case that either β or is nonrandom to cover the first two cases in Lemma 2.

Using Mutual Information to Measure the Predictive Power of Principal Components

13

Corollary 2 Suppose E(Y |X, β, ) = E(Y |β T X, β, ) holds with var(X|) = and: 1. 2. 3. 4. 5.

β (β, ); E(X|β T X, β, ) is a linear function of β T X; Either Assumption 1 or Assumption 2 is satisfied; var(Y |β, ) = τ 2 < ∞ and cov(Y, X|β, ) = 0 almost surely; f (Y, v Tk X|β, ) is Gaussian centered at 0 for all k = 1, . . . , p.

Then, for i < j , P(I (Y, v Ti X|β, ) > I (Y, v Tj X|β, )) =

2 E arctan σi2 /σj2 . π

A similar result as the one shown in Artemiou and Li (2013) for the conditional independence model can be shown to hold for the mutual information we use in this work. We state the results below, both for the inequality and the equality. Theorem 7 Suppose Y and: 1. 2. 3. 4. 5. 6. 7.

X|(β T X, β, ) holds with var(X|) = , f : Rp → R

is a p × p matrix that has an orientationally uniform distribution; X is a p-dimensional random vector such that X| ∼ Np (0, ); β X and P(β ∈ G) > 0 for any nonempty open set G; β (X, ); E(X|β T X, β, ) is a linear function of β T X; var(f (Y )|β, ) = τ 2 < ∞ and cov(f (Y ), X|β, ) = 0 almost surely; f (Y, v Tk X|β, ) is Gaussian centered at 0 for all k = 1, . . . , p.

Then, for i < j , P(I (f (Y ), v Ti X|β, ) > I (f (Y ), v Tj X|β, )) > 1/2 Corollary 3 Suppose Y and: 1. 2. 3. 4. 5.

X|(β T X, β, ) holds with var(X|) = , f : Rp → R

β (β, ); E(X|β T X, β, ) is a linear function of β T X; Either Assumption 1 or Assumption 2 is satisfied; var(f (Y )|β, ) = τ 2 < ∞ and cov(f (Y ), X|β, ) = 0 almost surely; f (Y, v Tk X|β, ) is Gaussian centered at 0 for all k = 1, . . . , p.

Then, for i < j , P(I (f (Y ), v Ti X|β, ) > I (f (Y ), v Tj X|β, )) =

2 E arctan σi2 /σj2 . π

Using the first two conditions on Lemma 2, we can expand the above lemma to the case that either β or is nonrandom. The key idea in proving the theorem

14

A. Artemiou

and the corollary is the fact that Y X|(β T X, β, ) implies E(f (Y )|X, β, ) = E(f (Y )|β T X, β, ) which was a critical assumption in proving Theorem 6 and Corollary 2.

3.3 Beyond the Normal Distribution In this section we discuss what happens if the predictor vector X does not follow a normal distribution. To address the non-normality issue, we can use a result in Diaconis and Freedman (1984) who showed that under some extra assumptions, most projections are normally distributed. According to Diaconis and Freedman (1984) (in their simpler result), if there is a set of n p-dimensional vectors, which depend on an index ν, and most are orthogonal and have length near pω2 (where ω2 is positive and finite), then one can show that the empirical distribution of the projections on the p-dimensional unit sphere tends weakly to N(0, ω2 ) as ν tends to infinity. Note that Diaconis and Freedman (1984) give an example in their work where they demonstrate that samples with iid coordinates satisfy their condition. In the linear regression case, the model is denoted Y = β T X + with ∼ N (0, τ 2 ); then one can remove the assumption that X comes from a Gaussian distribution in Theorem 5 and the rest of the results in Sect. 3.1 as long as the assumptions in Diaconis and Freedman (1984) are met. In the conditional independence and the conditional mean models and the results shown in Sect. 3.2, we can similarly remove the assumption that X comes from a Gaussian distribution, but we cannot remove the assumption that the joint distribution of Y and the principal component is Gaussian. Below we state only a modified version of Theorem 5 where we generalize it to allow for vectors coming from a non-Gaussian distribution. Theorem 8 Suppose: 1. β satisfies Assumption 1 and is a nonrandom matrix with spectral decomposition V V T ; 2. X is a p-dimensional random vector such that E(X) = 0 and var(X) = ; 3. β X and P(β ∈ G) > 0 for any nonempty open set G; 4. is a random variable with E() = 0, var() = τ 2 , X. 5. Y = β T X + Then, for i < j , P(I (Y, v Ti X|β) > I (Y, v Tj X|β)) > 1/2.

(13)

Using Mutual Information to Measure the Predictive Power of Principal Components

15

4 Discussion In this paper we propose the use of mutual information, and we demonstrate that it is equivalent to correlation under the assumption of normality of the predictors in demonstrating the predictive potential of principal components in a regression setting. We demonstrate that this equivalence holds in the linear regression model as well as conditional independence model and conditional mean model which are common models used in sufficient dimension reduction. Furthermore, we demonstrate that we can remove the assumption of normal distribution on the predictors and still have the similar results hold. Although the assumption of normality can be sidestepped in the linear regression model by using the assumptions in Diaconis and Freedman (1984), one cannot eliminate it completely from the conditional independence and the conditional mean models, where we assume the joint distribution of Y and any principal component to be normal. One can further investigate whether these results can be used for more general distribution families like the exponential family distribution. Finally, we demonstrated in Sect. 3 that there is a connection between mutual information and entropy and hence with Kullback-Leibler divergence. It will be interesting if we further investigate whether these or other measures can provide similar or stronger results than the ones mutual information gives us. Acknowledgments The author would like to thank the editors and two anonymous reviewers for their comments. I would also like to thank Prof. Bing Li for an insightful discussion a few years back. Finally, special thanks to Prof. R. D. Cook for his contributions and his kindness. Back in time, I was still a PhD student, and after publishing the first paper on this topic (Artemiou and Li (2009)—my MSc thesis paper), I met him for the first time in the summer of 2009 at the JSM conference in Washington DC. When I introduced myself, his immediate reaction was “The principal component guy.” It stuck with me since then and anytime I talk about principal components, his reaction comes to mind. It is that reaction that convinced me to contribute something on the predictive potential of PCA for this edition. I also want to thank him because, although he was surrounded by a number of his students and collaborators, he took a couple of steps away from them and a couple of minutes to discuss with me. I found it very kind of him.

References A. Artemiou, B. Li, On principal components and regression: a statistical explanation of a natural phenomenon. Stat. Sin. 19, 1557–1565 (2009) A. Artemiou, B. Li, Predictive power of principal components for single-index model and sufficient dimension reduction. J. Multivar. Anal. 119, 176–184 (2013) R.D. Cook, Fisher lecture: dimension reduction in regression. Stat. Sci. 22, 1–40 (2007) R.D. Cook, B. Li, Dimension Reduction for the conditional mean. Ann. Stat. 30, 455–474 (2002) D.R. Cox, Notes on some aspects of regression analysis. J. R. Stat. Soc. Ser. A 131, 265–279 (1968) P. Diaconis, D. Freedman, Asymptotics of graphical projection pursuit. Ann. Stat. 12, 793–815 (1984)

16

A. Artemiou

A.S. Hadi, R.F. Ling, Some cautionary notes on the use of principal components in regression. Am. Stat. 52, 15–19 (1998) P. Hall, Y.J. Yang, Ordering and selecting components in multivariate or functional data linear prediction. J. R. Stat. Soc. Ser. B: Stat. Methodol. 72, 93–110 (2010) H. Hotelling, Analysis of a complex statistical variable into its principal components. J. Educ. Psychol. 24, 417–441 (1933) I.T. Joliffe, A note on the use of principal components in regression. Appl. Stat. 31, 300–303 (1982) I.T. Joliffe, Principal Component Analysis, 2nd edn. (Springer, New York, 2002) B. Jones, A. Artemiou, On principal component regression with Hilbertian predictors. Ann. Inst. Stat. Math. 72, 627–644 (2020) B. Jones, A. Artemiou, Revisiting the predictive potential of Kernel principal components. Stat. Probab. Lett. 171 (2021). (10.1016/j.spl.2020.109019) B. Jones, A. Artemiou, B. Li, On the predictive potential of kernel principal components. Electron. J. Stat. 14, 1–23 (2020) B. Li, Comment: Fisher lecture: dimension reduction in regression. Stat. Sci. 22, 32–35 (2007) B. Li, Sufficient Dimension Reduction: Methods and Applications with R, 1st edn. (CRC Press, Boca Raton, 2018) F. Mosteller, J.W. Tukey, Data Analysis and Regression (Addison-Wesley, Reading, MA, 1977). L. Ni, Principal component regression revisited. Stat. Sin. 21, 741–747 (2011) L. Paninski, Estimation of entropy and mutual information. Neural Computat. 15(6), 1191–1253 (2003) K. Pearson, On lines and planes of closest fit to a system of points in space. Philos. Mag. (6), 2, 559–572 (1901)

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers Luca Insolia, Francesca Chiaromonte, and Marco Riani

1 Introduction We consider procedures for detecting and treating outliers in a regression setting. By outliers we mean observations that affect estimation and inference on model parameters—a notion also known as influence (Cook and Weisberg, 1982), which depends on how the position of an observation in the predictor space and its response value combine to make it “extreme” relative to the bulk of the data. Critically, this notion depends also on the presence of other observations whose influence may mask or swamp that of the observation under consideration. Simple and, where possible, automated and computationally inexpensive approaches to detect and treat outliers are very important in the practice of regression. These approaches often rely on postulating an outlier generating mechanism and provide a weighting system for the observations—which may include removing or trimming observations (attributing a weight of 0), keeping them in the analysis as they are (attributing a weight of 1), and, in between, down-weighting them to control their influence. The literature on these subjects is extensive (Atkinson, 1985; Barnett and Lewis, 1974; Beckman and Cook, 1983;

L. Insolia () Faculty of Sciences, Scuola Normale Superiore, Pisa, Italy e-mail: [email protected] F. Chiaromonte Department of Statistics, Pennsylvania State University, University Park, PA, USA Institute of Economics & EMbeDS, Sant’Anna School of Advanced Studies, Pisa, Italy e-mail: [email protected] M. Riani Department of Economics and Management, University of Parma, Parma, Italy e-mail: [email protected] © Springer Nature Switzerland AG 2021 E. Bura, B. Li (eds.), Festschrift in Honor of R. Dennis Cook, https://doi.org/10.1007/978-3-030-69009-0_2

17

18

L. Insolia et al.

Belsley et al., 2004; Chatterjee and Hadi, 1988; Hampel et al., 1986; Huber and Ronchetti, 2009; Maronna et al., 2006). In particular, two main frameworks have been utilized: the mean-shift outlier model (MSOM) and the variance-inflation outlier model (VIOM). MSOM, which assumes that outliers are generated by shifts in mean (Cook and Weisberg, 1982), has been the historically predominant framework. Traditionally individual mean-shift outliers were detected through deletion-based approaches, computing prediction residuals (Cook, 1977). Using maximum likelihood estimation (MLE), the individual outlier position corresponds to the unit with largest Studentized residual based on an ordinary least squares (OLS) fit. Notably, MLE for a MSOM can be reformulated as OLS for the underlying regression model augmented by a dummy for the presence/absence of the observation under evaluation. The estimated coefficient for the dummy is the prediction residual, and the corresponding t-statistic is the externally Studentized residual (also called deletion residual) which tests the “outlying-ness” of an individual observation (Atkinson, 1985). Because of masking and swamping effects, individual deletion residuals can fail to detect multiple MSOM outliers and thus lead to sub-optimal regression estimation and inference. A simple extension of deletion-based approaches to groups of observations implies a combinatorial increase in computation. Much like best subset feature selection, this was computationally intractable in the 1980s and 1990s for realistically large problems and triggered the development of proposals based on penalized fits of a model augmented by n dummies (one for each observation) (McCann et al., 2006; Menjoge and Welsch, 2010; She and Owen, 2011). MSOM detection is customarily followed by outlier removal; that is, outliers are attributed a weight of 0 in estimation and inference on the regression parameters. VIOM is generally considered as an alternative to the MSOM framework, where outliers are generated by an inflation in the error variance (Cook et al., 1982; Thompson, 1985). In a way, VIOM is the “random effect” version of MSOM; instead of being generated by a fixed effect (mean shift, to be estimated), the outlier is generated by a random effect with a certain variance (again, to be estimated). In fact, an equivalent parametrization of the regression problem in the presence of varianceinflation outliers can be given in the form of a mixed-effects linear model. In the VIOM framework, outliers are not removed; they are retained in a weighted fit, where the weight for each observation is inversely proportional to the variance of its random effect. In general, because it uses down-weighting instead of discarding (or failing to discard) observations, VIOM can achieve higher accuracy than MSOM in estimation and inference on the regression parameters. If the data comprise (at most) a single variance-inflated outlier, its detection and the estimation of its variance (and thus its weight) can be performed through a closed-form MLE. Notably, on a given dataset, the outlier identified through individual deletions in MSOM and the outlier identified through MLE in VIOM need not coincide (unless this observation has both the largest absolute residual and the largest absolute Studentized residual). This illustrates how the statistical handling of outliers can depend on our assumptions concerning the mechanism that generates them. However, if restricted maximum likelihood estimation (REMLE) (Harville, 1977) is used instead of MLE to detect

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

19

the outlier and estimate its variance, this coincides with the MSOM outlier identified through individual deletions. The extension of MLE or REMLE approaches to multiple VIOM outliers also poses computational issues, because the closed-form expressions used in the case of individual outliers cannot be straightforwardly generalized (Gumedze, 2019). In summary, to date, penalization approaches offered progress toward the computationally viable detection of multiple MSOM outliers— which are then removed from the regression. On the other hand, the computational viability of state-of-the-art techniques for detecting and down-weighting multiple VIOM outliers is still a concern. The literature on outliers is closely related to that on robust estimation (Hampel et al., 1986; Huber and Ronchetti, 2009; Maronna et al., 2006); outliers can be thought of as a form of (adversarial) perturbation of the data, due to either errors in data recording or contamination with statistical units belonging to a population different from the one of interest. The mechanism generating contamination is critical for studying the properties of robust estimators and has an important role also in our developments. The traditional paradigm is the case- (or row-)wise contamination mechanism, also known as Tukey-Huber mixture model. An outlier is thought of as comprising values that do not conform with the bulk of the data in all its dimensions. In full generality, one assumes that the mixture data distribution is Z ∼ (1 − )F + C, so that each individual observation Zi = (yi , Xi ) is drawn from the “true” distribution F with probability (1 − ) and from a contaminating distribution C with probability (the scheme can be extended to multiple contaminating components). In a way, difficulties in dealing with multiple outliers motivated the development of high-breakdown point (BdP) robust estimators, which produce good estimates without assumptions on the nature of the outliers in the data (Donoho and Huber, 1983; Rousseeuw, 1984). Robust estimation is characterized by a trade-off between the reduction of biases due to outlier removal and the increase in estimates variability, or inefficiency, due to (possibly) not leveraging the entire information contained in the data. State-of-the-art robust methods achieve a compromise employing a preliminary high-BdP estimator (i.e., a possibly inefficient estimator that can withstand high contamination) and then refining its outcome with a second high-efficiency estimator to retain in the fit as much “uncontaminated” information as possible (Maronna et al., 2006; Rousseeuw and Leroy, 1987). Socalled soft-trimming methods down-weight all units and implicitly account for both VIOM and MSOM outliers, while hard-trimming methods provide binary weights and account only for MSOM outliers (Cerioli et al., 2016). Relatedly, outlier problems formulated as mixture contamination models were studied also in the Bayesian literature. De Finetti (1961) investigated a general framework, and Box and Tiao (1968) focused on a VIOM where both the total fraction of contamination and the inflation parameter were assumed to be known constants. We combine robust estimation and mixed models techniques into a novel approach that detects and treats multiple outliers in an effective and computationally viable fashion. Our approach can be fully automated; however, since it utilizes an iteration, we describe criteria to monitor its progression through a graphical diagnostic tool. Importantly, our approach is applicable also to scenarios with a

20

L. Insolia et al.

mix of MSOM and VIOM outliers and comprises a step that, under reasonable assumptions, can separate the corresponding observations. Specifically, we rely on the forward search (FS) (Atkinson and Riani, 2000) for outlier detection and use REMLE to perform down-weighting. FS is an adaptive hard-trimming method based on an iterative algorithm. It starts from a clean subset of observations identified with a (possibly inefficient) high-BdP estimator. At each iteration, it uses OLS to fit the regression (which is fully efficient on the current subset) and extends the subset recovering the observation that is closest to the fit in terms of (robust) residuals. If the contamination fraction is lower than the BdP of the initial estimator, FS provides consistent estimates (Cerioli et al., 2014) and an “optimal” ranking of the observations in terms of their distance from the model (Johansen et al., 2016). Outliers are generally recovered in the last iterations, and we exploit this fact to detect MSOM and VIOM outliers. Moreover, FS allows one to monitor the influence exerted on regression parameter estimates by the observations retrieved in each iteration. This can be used to adaptively settle on a final clean subset of observations, which in many practical scenarios guarantees a better tradeoff between BdP and efficiency (Riani et al., 2014). The reminder of the article is organized as follows. Section 2 introduces the classical and contaminated regression models, some technical background, and the details of our proposal. Section 3 presents simulations comparing accuracy and computing time of various techniques under different scenarios (our graphical diagnostics are illustrated on one of the more complex simulation scenarios). Section 4 applies our approach to real-world data, in both its automated and its “monitored” versions. Section 5 provides final remarks and pointers for future extensions.

2 Our Proposal and Some Background 2.1 A Generalized Setting Consider the classical linear regression model of the form y = Xβ + ε,

(1)

where y ∈ Rn is a vector of observable responses, X ∈ Rn×p is a full rank design matrix with n > p containing observable predictors (these are customarily considered as given, even when they comprise randomness), β ∈ Rp is an unknown parameter vector, and ε ∈ Rn is a vector of unobservable random errors. Classical assumptions specify that such errors are uncorrelated, homoscedastic, and Gaussian; ε ∼ N(0, σ 2 In ), with σ 2 > 0. Under these assumptions the MLE for β corresponds to the OLS, which is the uniformly minimum variance unbiased

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

21

estimator (UMVUE). The MLE for σ 2 is biased by the factor n/(n − p), while REMLE provides a UMVUE also for σ 2 . The absence of any (systematic or stochastic) deviation from (1) is an implicit assumption. We relax it through a parametric outlier model affecting both means and variances. In particular, we allow the presence of two distinct groups of outliers: mV observations generated from a VIOM and mM observations generated from a MSOM. We index the two groups as IV and IM , respectively, but we remark that the outliers’ labels, i.e., which indexes belong to these two sets, as well as their cardinalities, are unknown. In symbols, we have ⎧ 2 ⎪ ⎪ ⎨N(0, σ wi ) ∀ i ∈ IV εi ∼ N(λi , σ 2 ) ∀ i ∈ IM ⎪ ⎪ ⎩N(0, σ 2 ) otherwise,

(2)

where wi > 1 for i ∈ IV and λi = 0 for i ∈ IM . An equivalent parameterization of the contaminated model is y = Xβ + DIV δ + DIM λ + , where DIV (n × mV ) and DIM (n × mM ) are matrices composed of dummy column vectors indexing the outliers belonging to the two groups, δ ∈ RmV ×1 is a random vector ∼ N (0, σ 2 DiagmV (wi − 1)) (Diagk (·) stands for a k × k diagonal matrix), λ ∈ RmM ×1 is a non-stochastic vector, and the random error vector is again ∼ N (0, σ 2 In ). This parameterization highlights that MSOM and VIOM outliers can be thought of, respectively, as fixed and random effects in a mixed-effects linear model. As noted by Cook et al. (1982), one could envision outliers compounding a mean shift and a variance inflation—but this leads to an over-parametrization in which these compounded outliers are equivalent to MSOMs. The fact that we focus on an unlabeled problem, where not only the identity but also the number and nature (MSOM vs. VIOM) of multiple outliers is unknown, complicates matters because it makes masking and swamping effects more likely. As customary (especially in the robust statistics literature), we assume that MSOM outliers can also be affected by shifts in the predictors (which contaminate entries of the design matrix, affecting leverage) (Maronna et al., 2006), but that the VIOM outliers are not (Cook et al., 1982). Correspondingly, when generating predictors in our simulation experiments, we introduce mean shifts in their distribution; we thus use λX to indicate predictors shifts and λε to indicate errors shifts. We also restrict ourselves to settings in which n is substantially larger than p and rely on two key additional assumptions, namely, that: A1 The total fraction of contaminated observations (MSOM or VIOM) is smaller than 50%. A2 Systematic contaminations, which induce shifts in means (MSOM), have larger influence on the regression compared to stochastic contaminations, which

22

L. Insolia et al.

inflate variances (VIOM). Thus, under the uncontaminated model, MSOM outliers are expected to have larger residuals than VIOM outliers. (A1) allows us to safely rely on the properties of high-BdP equivariant estimators and is fairly standard (Maronna et al., 2006). (A2) allows us to take advantage of the FS algorithm to discriminate between the two types of outliers and may not be appropriate in all applications. However, it reflects an intuitive logic in differentiating the two types of contaminations; e.g., shifts in means may be due to the inclusion in the sample of units that do not belong to the target population, while inflation in variances may be due to inaccuracies in measurements on units that do. Also intuitively, from the perspective of the remedies taken, a shift in mean, resulting in the deletion of an observation, cannot be less consequential than an inflation in variance—resulting in a down-weighting.

2.2 Some Technical Background Our proposal utilizes the FS (Atkinson and Riani, 2000), an adaptive hard-trimming method based on an iterative algorithm. The FS algorithm starts from a “clean” subset of observations of size, say, b0 . This is identified with a high-BdP estimator, often setting b0 = p in order to reduce to probability of including outliers. However, unlike in the case of an MM-estimator (see below), the robustness of the FS does not depend on the choice of high-BdP estimator (as long as it unmasks outliers) but on its inclusion strategy. Indeed, computationally fast high-BdP estimators are generally used in the FS, e.g., least median of squares (LMS) or least trimmed squares (LTS) (Rousseeuw and Leroy, 1987; Rousseeuw and Van Driessen, 2006). At each iteration b, with b0 ≤ b ≤ n, the FS operates on a current subset of ˆ observations S(b) of size b. The OLS estimate β(b) is computed on observations i ∈ S(b), and residuals are produced for all observations i = 1, . . . , n: ˆ ei (b) = yi − xiT β(b).

(3)

In the subsequent iteration of the FS, S(b + 1) will comprise the b + 1 observations with smallest absolute residuals in (3). Importantly, the FS strategy for recovering and sometimes removing observations from the current subset (removals can happen especially as outliers are included in the fit late in the process) provides a natural ordering of all observations at each iteration—because the OLS is fully efficient under the uncontaminated null model. Once all n observations have been included in the process, the FS reaches the full OLS fit. Indeed, the FS comprises a collection of least squares estimators carrying information on a sequence of model fits—from a very robust one to the classical OLS. The next objective is to establish a satisfactory compromise between BdP and efficiency along this sequence, pinpointing an iteration where the inclusion of outliers “breaks down” the OLS (Riani and Atkinson, 2007). For a generic iteration,

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

23

consider the deletion residuals of the n − b observations i ∈ / S(b), defined as ri (b) =

ˆ yi − xiT β(b) s 2 (b){1 + h

i (b)}

=

ei (b) s 2 (b){1 + h

i (b)}

,

(4)

where s 2 (b) estimates σ 2 on b−p degrees of freedom, hi (b) = xiT [X(b)T X(b)]−1 xi , and X(b) indicates the design matrix restricted to the rows i ∈ S(b). Let imin = arg mini ∈S(b) |ri (b)| be the index of the observation that is closest to S(b) / in terms of deletion residuals. The idea is that if the absolute value of rimin (b) is / S(b)) are outliers. The sufficiently large, imin (and a fortiori all other observations ∈ deletion residuals in (4) follow a Student’s t distribution under the uncontaminated null model if the estimates are based on all n − 1 observations (Cook and Weisberg, 1982). But this fact is not directly applicable for assessing the inclusion of some outliers in an FS iteration, because here they depend on order statistics. However, the assessment can be performed, e.g., by bootstrapping. In particular, we utilize an approach proposed by Riani et al. (2009) for multivariate analysis and adapted by Atkinson et al. (2016) to regression problems. It relies on theoretical results from (symmetrically) truncated distributions and order statistics to provide fast and accurate point-wise bounds that approximate bootstrap envelopes. Multiple testing is handled controlling the sample-wise level at around 1%. When applied to regression settings, the FS coupled with this “automated” strategy to identify a good trade-off between BdP and efficiency is referred to as FSR (Riani et al., 2012) (see Fig. 5 for an example). In more detail, starting from any iteration of the FS, FSR implements a two-stage procedure based on consecutive single outlier testing of the values rimin (b), which adaptively trims outliers. A first stage detects outliers using all n observations, testing consecutive triplets, couples, or single extreme values (see Atkinson et al. (2016) for more details). If a first signal is detected (i.e., a test of outlying-ness resulting significant) at a given iteration, say b1 , all observations not belonging to S(b1 ) are flagged as possible outliers. A second stage attempts to validate this signal and, if it does, trims a subset of observations flagged in the first stage. This is performed through a superimposition of (forward) confidence bands, starting from the signal potentially detected at iteration b1 in the first stage, and until the trajectory at iteration b2 is not contained inside the forward confidence bands. All observations not belonging to the subset S(b2 ) are those which are finally trimmed. As an example, Fig. 5 (right panel) illustrates the forward plot of the minimum absolute deletion residuals. Monitoring such statistic, a first signal is detected in correspondence of the purple vertical line, where the statistic exceeds the pre-specified quantile. A stopping signal is detected at iteration 675, where the statistic of interest exceeds for the first time its confidence bands (not shown in the plot), indicating the presence of more influential outliers. We remark that, in principle, we could develop our proposal using soft-trimming approaches such as the MM-estimator (Yohai, 1987) in place of the FS. However, while these estimators have appealing theoretical and empirical properties (Maronna

24

L. Insolia et al.

et al., 2006; Riani et al., 2014), they also have substantial drawbacks with respect to our purposes. In particular, they (i) generally rely on a computationally expensive preliminary high-BdP estimator (e.g., a soft estimator of scale) (Riani et al., 2014); (ii) down-weight all observations, possibly trimming the most extreme ones, and thus do not separate a subset of “clean” observations from the rest; (iii) estimate weights through a loss function (e.g., the Tukey bisquare), without explicitly relying on a variance-inflation model; (iv) require nontrivial choices (e.g., the preliminary estimator and the loss function); (v) comprise tuning parameters which are often prespecified (e.g., the efficiency level of the MM-estimator); and (vi) can complicate statistical inferences (Cerioli et al., 2016).

2.3 Our Proposal Our proposal estimates model parameters, identifies outliers arising from either a VIOM or a MSOM, separates them, and estimates the weights with which they participate to the regression (these are forced to 0 for MSOM outliers). In using a hard-trimming approach, we rely on the fact that the VIOM can be viewed as a generalization of the MSOM. An asymptotic equivalence can be drawn (Cook et al., 1982), as an increasing variance inflation pushes weights to 0. Moreover, based on REMLE, VIOM provides a ranking of outliers equivalent to that of MSOM (Thompson, 1985). Consequently, to assess the presence of either VIOM or MSOM outliers, we can simply compare externally Studentized residuals (or any monotonic function of them, e.g., Studentized residuals; Cook and Weisberg 1982). Of course such residuals must be computed from a robust fit in order to avoid masking problems. In this setting the FS ranking is meaningful: VIOM outliers are recovered by the iterations right after the clean units and before MSOM outliers. In addition to utilizing the FS ranking, our proposal also combines trimming and REMLE weighting. In a way, we create a straightforward generalization of the procedure proposed by Thompson (1985) in the presence of a single VIOM outlier, namely, (i) find the largest squared Studentized residual, (ii) estimate w and σ 2 with REMLE, and (iii) estimate β using weighted least squares. Relying on assumption A2, which postulates that MSOM outliers are more extreme than VIOM outliers, we adapt this procedure as follows: • We run FSR with standard settings and take its first detected signal as our “weak” signal, pinpointing the iteration where VIOM outliers start to be included in the fit. • We increase the standard quantile thresholds1 and take the second signal detected by FSR as our “strong” signal, pinpointing the iteration where MSOM outliers start being included in the fit. 1 Since

FSR only aims to trim observations, the default settings can be too weak to separate coexisting VIOM and MSOM outliers—as we wish to do here.

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

25

• We label the group of observations recovered by the FS iterations between the two signals as VIOM outliers, and those excluded from the FS at the second signal as MSOM outliers. • We use REMLE to estimate the weights of the observations labeled as VIOMs and trim out the observations labeled as MSOMs (i.e., set their weights to 0). This procedure, which we refer to as FSRw, adaptively identifies both VIOMs and MSOMs, generating a data-driven estimate of the fraction(s) of outliers and without fixing a priori a trade-off between BdP and efficiency. Furthermore, given that FSR relies only on consecutive exceedances of single-unit outlying tests, it tackles multiple outliers without resorting to more complex calculations. In the current implementation, we use REMLE to estimate (sub-optimal) single weights; thus, from now onward we will refer to our procedure more specifically as FSRws. However, we point out that in principle one can use REMLE to estimate multiple weights jointly, based on recent proposals in Gumedze (2019). In this article we do illustrate the excellent performance and low computational burden of FSRws (see Sect. 3), but we do not compare it to its FSRwj “joint” extension. This is of course of interest, as it may lead to substantially better performance and, with appropriate computational implementations, to affordable increases in running time—but it is left for future work. Evidence from preliminary comparisons on small simulated datasets (not shown) suggests that, at least at low contamination levels, FSRwj does not produce marked performance gains.

2.4 Graphical Diagnostics If one fixes the thresholds used to pinpoint “weak” and “strong” signals along the FS, FSRws is computationally very efficient and fully automated. Full automation is particularly useful when multiple outlier detection/treatment and estimation of model parameters must be accomplished rapidly and without human intervention (e.g., fraud detection in international trade data as described in Perrotta and Torti 2010). When full automation is not necessary, graphical diagnostic tools can aid decisions by allowing a user to monitor the FS process, especially when combined with interactive graphical tools (e.g., using brushing and linking techniques as in Riani et al. 2012). Indeed, the FS algorithm embeds information about the influence of every point, at each iteration, on any parameter (or test statistic) of interest. We propose to profile (single) REMLE weights for all the observations not included in the FS at each iteration, creating what we call a cascade plot (see Fig. 6 for an example). The rationale for the cascade plot diagnostic is the following. As iterations proceed and one moves along the plot horizontally, estimated weights at the time of inclusion in the FS should be (i) approximately constant for uncontaminated observations (except for some short and mild dip due to the FS inclusion rule, especially in the last iterations), (ii) markedly decreasing when VIOM outliers start

26

L. Insolia et al.

to be included, and (iii) sharply increasing when MSOM outliers start to be included (due to masking effects). Unlike diagnostics that inform us on the quality or pitfalls of a specific regression fit (Atkinson, 1985; Belsley et al., 2004; Chatterjee and Hadi, 1988; Cook and Weisberg, 1982), this and other types of monitoring diagnostics provide information about a sequence of fits. This often reveals the empirical properties of a robust estimator and provides useful insights about the structure of the data. Indeed, the “philosophy” of monitoring, which is very natural in the spirit of the FS, can be generalized to other classes of robust estimation procedures, e.g., creating graphical diagnostic plots that profile residuals (or their correlations) along a sequence of BdP or efficiency values—moving from one extreme to the other (Cerioli et al., 2016, 2018; Riani et al., 2014). For instance, when we compare procedures in Sect. 3, we also implement an MM-weights plot (see left panel of Fig. 7) which is similar in spirit to a cascade plot. MM-estimators are often used with a pre-specified efficiency level (e.g., 0.85, 0.95, or 0.99)—relying on asymptotic results that hold only for data where contaminated and uncontaminated observations are well-separated and orthogonal predictors (Maronna et al., 2006, 141). In contrast, an MM-weights plot allows us to monitor estimated weights as a function of efficiency, again keeping track of a sequence of fits—from very high BdP to very high efficiency. Of course, the choice of a preliminary high-BdP estimator affects the solution. One of its clear effects is that of shifting the efficiency level required to (possibly) break down the MMestimator. Monitoring weights derivatives is also informative and can be logically related to the infinitesimal approach to robustness (Hampel et al., 1986) and to local influence (Cook, 1986). We implement this in a MM-weights derivatives plot (see right panel of Fig. 7).

3 Simulation Study Here we present the general simulation framework we created to evaluate our proposal and then focus on selected simulation results illustrating accuracy and computational burden. Documented MATLAB code to reproduce all results in this article is available on GitHub (www.github.com/LucaIns/VIOM_MSOM). Our implementation relies on the FSDA MATLAB Toolbox (FSDA) (Riani et al., 2012), which is downloadable from the MathWorks File Exchange (www. mathworks.com/matlabcentral/fileexchange/72999-fsda).2 Importantly, the fsdaR package (Todorov and Sordini, 2020) allows one to call most of the FSDA routines directly from R.

2 Full

documentation of all the functions inside the FSDA Toolbox can be found at rosa.unipr.it/ FSDA/guide.html. For example, to obtain the documentation of our VIOM function, it is necessary to type rosa.unipr.it/FSDA/VIOM.html.

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

27

First, we generate data following the uncontaminated model (1). The n × p design matrix X is drawn from a standard p-variate normal. The p-dimensional coefficient vector β is fixed; note that the size of the coefficients is irrelevant as long as we consider regression and affine equivariant estimators (Maronna et al., 2 ), where σ 2 2006, p. 142). The errors are drawn independently from a N(0, σSNR SNR is chosen as function of the signal-to-noise-ratio with which we want to characterize 2 . an experiment; SNR = var(Xβ)/σSNR Next, following (2), we independently contaminate (uniformly at random, without repetitions) mM observations with a MSOM (here mean shifts are introduced also in the predictors) and mV with a VIOM (here predictors are uncontaminated). To create an Oracle benchmark, we run a weighted least squares (WLS) fit using the true weights (i.e., 0 or wi−1 for observations contaminated with a MSOM or a VIOM, respectively, and 1 for uncontaminated observations). In the figures reported in this section, this optimal benchmark is indicated with “opt.” We compare it with the following procedures: • The ordinary least squares (OLS). • The least median of squares (LMS), a hard-trimming estimator with asymptotic BdP of 50% (the highest achievable for equivariant estimators) (Rousseeuw, 1984). • The MM-estimator (MM), using LMS as preliminary estimator and Tukey bisquare loss function, with tuning constant fixed as to achieve 85% nominal efficiency (Maronna et al., 2006). Note that using a preliminary hard-trimming estimator such as LMS is sub-optimal in terms of efficiency in MM, but very convenient in terms of reducing computational burden. • Forward search regression (FSR), the adaptive trimming procedure described in Sect. 2.2, with the initial clean subset created again with LMS (based on our assumption A1, we force FSR to search for a signal only after having included 50% of the sample). • Our FSRws, which utilizes a variant of FSR and single REMLE weights as described in Sect. 2.3. Performance of the procedures is compared across different sample sizes n and fractions of (total) contamination (mM + mV )/n. We generate an equal number of VIOM and MSOM outliers (mV = mM ) without overlaps between the two groups. Each simulation scenario is replicated t times and results are averaged. In terms of performance metrics, when p = 1, we consider the mean squared errors (MSE) of βˆ partitioned in variance and squared bias: ˆ = MSE(β)

1 1 ¯ˆ 2 + (β¯ˆ − β)2 , (βˆi − β)2 = (βˆi − β) t t t

t

i=1

i=1

(5)

where β¯ˆ = ti=1 βˆi /t. When p > 1, we average the MSE across the coordinates ˆ = p MSE(βˆj )/p. of β and take MSE(β) j =1

28

L. Insolia et al.

We also consider the MSE of proxy estimates of weights (vˆi = wˆ i−1 , i = 1, . . . , n) and error variance (ˆs 2 ). Note that comparing weights estimates poses some issues because outlier labeling varies from replication to replication, VIOM outliers may sometimes not carry sizeable residuals, and experiments with larger sample sizes may contain more outlier-like uncontaminated observations by chance (these are especially hard to distinguish from VIOM outliers). For the variance we take a WLS-like proxy estimate of the form n 2 1 i=1 vˆ i ei sˆ = , n (n − p) i=1 vˆi /n 2

(6)

where the ei ’s are estimation residuals.3 This captures the effectiveness of weights estimates, taking into account the outlying-ness of observations regardless of whether they are in fact contaminated. The MSE decomposition for sˆ 2 is computed 2 ˆ respectively. Finally, and importantly, as in (5), where σSNR and sˆ 2 replace β and β, we compare procedures in terms of average computing time. In the following, for simplicity, we focus on results for a simulation scenario where the uncontaminated model contains an intercept and a single predictor (p = 2), setting β = (2, 2)T . The signal-to-noise ratio is set to SNR = 3. VIOM outliers are all generated with variance-inflation parameter w = 10. MSOM outliers are all generated with error mean shift λε = −3 and predictor mean shift λX = 3. We use shifting parameters with opposite signs in order to create bad leverage points which are more likely to disrupt the true positive slope relating response and predictor. We consider increasing sample sizes n ranging from 100 to 1000 (with a step size of 50) and total contamination fractions (mV + mM )/n of 0, 0.25, and 0.5. Data for each setting are generated t = 500 times, and results are averaged over these replications. Figure 1 shows results for the MSEs of βˆ (left panel) and sˆ 2 (right panel) across procedures and sample sizes, when there is no contamination; (mV + mM )/n = 0. Both FSR and FSRws do not detect any signal and lead to optimal OLS estimates. On the other hand, MM and LMS must sacrifice some efficiency under the null uncontaminated model. In particular, the MM βˆ estimates are close to optimal, but its sˆ 2 estimates are biased. LMS has a lower convergence rate for βˆ and even larger biases for sˆ 2 . Figure 2 has the same format of Fig. 1, but here the total fraction of contamination is set to (mV + mM )/n = 0.25 (mV /n = mM /n = 0.125). The MM outperforms other procedures—and, notably, it often outperforms also the Oracle benchmark in terms of sˆ 2 . This may be due to the fact that some VIOM outliers do not need to be down-weighted because they lie along the bulk of the data. However, FSR and FSRws perform almost on par with MM and markedly better than LMS ˆ LMS performance is in fact similar to the (and of course OLS) in terms of β.

here we are not using any consistency factor in sˆ 2 . Consistency factors are often used in robust estimation (Maronna et al., 2006).

3 Note

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

500

0

1000

MSE

0.06 0.04

OLS

0.3

0.3

0.2

0.2

0.1

0.1

0.02 500

0

1000

500

0

1000

500

0

1000

500

n

n

n

n

FSR

FSRws

MM

FSR

0.2

0.2

0.2

0.005

0.005

0.1

0.1

0

0

0

0

0

1000

500

n

1000

500

n

1000

500

n

1000

n

1000

n 0.3

0.005

500 FSRws

0.3

MSE

0.01

0

1000

0.3

MSE

Var Sq. bias MSE

0.015

0.01

MSE

MSE

0.015

500

1 0.5

n

0.01

LMS

1.5

MM

0.015

MSE

0.01 0.005

MSE

0

MSE

0.01 0.005

opt

0.08

0.015

MSE

MSE

0.015

MSE sigma with WLS res (SNR: 3, MSOM: 0.000, VIOM: 0.000) LMS

MSE

OLS

MSE

MSE beta (SNR: 3, MSOM: 0.000, VIOM: 0.000) opt

29

Var Sq. bias MSE

0.1

500

1000

0

n

500

1000

n

Fig. 1 MSE comparisons, across procedures and sample sizes, for βˆ (left panel) and sˆ 2 (right panel) in the absence of contamination MSE beta (SNR: 3, MSOM: 0.125, VIOM: 0.125) OLS

MSE sigma with WLS res (SNR: 3, MSOM: 0.125, VIOM: 0.125) LMS

opt

40

0.08

0.03

0.8

0.01

500

1000

0

500

1000

0.04

0.6

MSE

MSE

0.5

0.4 0.2

10

0

0

0

1000

500

1000

LMS

1 0.5

500

0

1000

500

1000

n

n

n

n

n

FSR

FSRws

MM

FSR

FSRws

0.01

0.01

Var Sq. bias MSE

0.8

0.8

0.8

0.6

0.6

0.6

0.4

n

1000

0

0.4

Var Sq. bias MSE

0.4

0.01 0.2

500

MSE

0.02

MSE

0.03

0.02

MSE

0.03

0.02

MSE

n 0.03

0

20

0.02 500

1.5

MM

MSE

MSE

0

0.06

MSE

MSE

MSE

1 0.02

OLS

30

MSE

opt

500

n

1000

0

500

n

1000

0

0.2 500

n

1000

0

0.2 500

n

1000

0

500

1000

n

Fig. 2 MSE comparisons, across procedures and sample sizes, for βˆ (left panel) and sˆ 2 (right panel) in the presence of an intermediate level of contamination

case with no contamination, due to its high BdP. The OLS breaks down due to the presence of MSOM outliers, which induce strong biases (and very low variances) in its estimates. Note that, for our FSRws, the MSE of sˆ 2 shows a slight increase for large n values. This is due to the fact that FSRws tends to detect more outliers as the sample size increases, including “false positives” (especially “falsepositive” VIOMs). Consequently, FSRws may down-weight (or even trim out) more observations than needed when the sample size is very large.4 Nevertheless, the sˆ 2

4 This

is a feature “inherited” from our use of FSR. As the sample size increases, FSR becomes better (likely due to its strong consistency) at detecting and thus trimming all outliers (both VIOM and MSOM). Hence, when n is large, the first signal (which is the same both for FSR and FSRws) can occur while still including clean observations. In the strong contamination scenario (50% total fraction of contamination) which we consider next, we do not notice such phenomenon because we are forcing FSR to find signals in the second half of the search (i.e., we force 50% of the weights to be = 1, motivated by our assumption A1).

30

L. Insolia et al. MSE beta (SNR: 3, MSOM: 0.250, VIOM: 0.250) opt

OLS

MSE sigma with WLS res (SNR: 3, MSOM: 0.250, VIOM: 0.250) LMS

opt

OLS

LMS

2

0.1

500

0

1000

500

1000

0

500

0

1000

20 500

0

1000

6 4 2

500

0

1000

500

1000

n

n

n

n

n

FSR

FSRws

MM

FSR

FSRws

0.1

0.1

8

8

8

6

6

6

4

n

1000

0

500

n

1000

0

500

n

1000

0

4 2

2 500

MSE

MSE

Var Sq. bias MSE

0.2

MSE

0.2

MSE

n

0.5

0

40

MM

1

MSE

4 2

MSE

0

6

MSE

1 0.5

8

60

MSE

1.5

MSE

0.1

8

0.2

MSE

MSE

MSE

0.2

500

n

1000

0

Var Sq. bias MSE

4 2

500

n

1000

0

500

1000

n

Fig. 3 MSE comparisons, across procedures and sample sizes, for βˆ (left panel) and sˆ 2 (right panel) in the presence of a high level of contamination

produced by FSRws shows smaller bias than that produced by FSR across sample sizes. Figure 3 has again the same format, but here the total fraction of contamination is increased to (mV + mM )/n = 0.50 (mV /n = mM /n = 0.25). FSRws, FSR, and ˆ But at this high contamination level, LMS perform comparably well in terms of β. we see a breakdown of the MM, not just the OLS. The situation is similar for sˆ 2 ; LMS provides (nearly) optimal estimates, FSRws does well (better than LMS as n increases) and improves upon FSR (especially in terms of bias), and MM and OLS do poorly. These results highlight how the need to use a pre-specified efficiency can seriously hinder the MM-estimator; having fixed efficiency at 85%, we observe a breakdown in the MM only when raising the contamination to a total fraction as high as 50%. However, when efficiency is set at higher levels (95% or 99% is often used), MM performance can seriously deteriorate also with milder contaminations. Figure 4 (left panel) shows average computing times across procedures and sample sizes, in the high contamination setting. All the procedures we compare here run reasonably fast on our simple and (relatively) small simulated data. For all, on average, the running time is less than 1 second for n = 1000 (using MATLAB R2018a on an Intel Core i7-7700HQ CPU at 2.8 GHz × 4 processors and 16 GB RAM). In general, the computational cost of an MM-estimator is nearly all due to obtaining a preliminary high-BdP estimator. This can in fact be very expensive with standard choices (e.g., soft-scale estimators). However, in MM we use the LMS which is inexpensive. After the LMS is computed, the M-estimation phase of MM takes a negligible amount of time (as running iteratively reweighted OLS). Running FSR from a clean subset produced by LMS is more expensive, but, importantly, it produces information about a sequence of fits. Running our FSRws adds yet some to the computational burden, but not a lot. In our current implementation, FSRws runs fairly inexpensively based on the FSR solution—and with further code optimization, the added cost on top of that of FSR should be nearly equivalent to that of running a WLS.

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

31

1 0.9 0.8

clean VIOM MSOM opt OLS LMS MM FSR FSRws

15

10

0.6

5 Y

time (in sec.)

0.7

opt OLS LMS MM FSR FSRws

0.5 0.4

0

0.3 -5

0.2 0.1

-10 0 100

200

300

400

500

600 n

700

800

900

1000

-4

-2

0

2

4

6

X

Fig. 4 Left panel: average computing time comparisons, across procedures and sample sizes, in the presence of a high level of contamination. Right panel: scatterplot of a simulation example comparing different fits in the presence of a high level of contamination. Here n = 1000 and the points are labeled according to the true generating mechanism

As in many statistical settings, what it means for a sample to be small or large depends on the underlying structure of the problem—for instance, crucially on the “signal-to-noise” ratio discussed above. However, a general rationale can be articulated. Small sample sizes are those around 10–15 (for instance, the size of our example from Cook et al. 1982). In general, Rousseeuw and Van Zomeren (1990) suggest that a robust estimator can be successfully applied when the ratio n/p > 5. Large/very large sample sizes are those in the hundreds, thousands (for instance, the size of our loyalty cards example), or higher (tens, hundreds of thousands, millions). For these, asymptotic results guarantee that FSR provides consistent estimates (Cerioli et al., 2014). To gauge things in practice, in regressions with SNRs around 3 or 5, sample sizes around 100 or 200 should already guarantee excellent performance for FSR. Notably, the computational burden of our FSRws remains very manageable also for large and very large sample sizes (with n in the thousands, the procedure still takes only a few seconds to run). Importantly, all procedures considered here, including the FSRws (which effectively tackles multiple MSOM and VIOM outliers), are hugely cheaper than any approach for outlier treatment that relies on combinatorial enumeration—especially in the case of VIOM outliers. Next, we illustrate graphical diagnostics using a simulation with high contamination and n = 1000. Figure 4 (right panel) compares different fits on this dataset—red and green points represent the true MSOM and VIOM outliers, respectively. FSR, FSRws, and LMS here provide fits much closer to the Oracle than MM, which breaks down because efficiency is set at 85%, and of course OLS. Figure 5 shows the corresponding residual forward plot (left panel) and absolute minimum deletion residual forward plot (right panel) which are commonly used as graphical diagnostics for the FS (Atkinson and Riani, 2000; Atkinson et al., 2016). The residual forward plot tracks residuals trajectories along the FS iterations. On our simulated dataset, it clearly indicates that MSOM residuals start being included

32

L. Insolia et al.

Abs. min. deletion residual

4.5

4

3.5

3

2.5

99.999% 99.99% 99.9% 2 99% 50% 1%

500

600

700 800 iteration b

900

1000

Fig. 5 Residual forward plot (left panel) and minimum absolute deletion residual forward plot (right panel) for the simulated dataset in the right panel of Fig. 4

around iteration 720, where large residuals (in dark blue) shrink to zero and start masking each other. However, as in this example, a residual forward plot might become too complex to diagnose the presence of VIOM residuals in large samples. The absolute minimum deletion residual forward plot tracks a single statistics which depends on the observations excluded from the FS at each iteration, providing a meaningful, simple summary of the information contained in the residual forward plot. Dashed lines represent different quantiles for the point-wise distribution of the absolute minimum deletion residuals (as described in Sect. 2.2). On our simulated dataset, absolute minimum deletion residuals rapidly increase after iteration ≈ 550 and abruptly fall at iteration ≈ 720. The purple vertical line marks the first “weak” signal identified by FSR at iteration 551; in our FSRws this is when VIOM outliers start being included. In contrast to these diagnostics, our cascade plot, which is shown in Fig. 6, highlights both local information (the influence of each observation at every iteration) and global information (the overall estimator performance). The strong decrease in estimated weights between iterations ≈ 550 and ≈ 720 indicates the inclusion of VIOM outliers, and their abrupt increase after iteration ≈ 720 indicates the inclusion of MSOM outliers. Notably, after iteration ≈ 720, estimated weights increase due to masking effects, followed by a large number of observations’ interchanges in the FS subset due to swamping effects. The swamped units (represented by dark blue trajectories in the final part of the cascade plot) were included in earlier iterations of the FS and exit the subset as MSOM outliers begin entering it. Finally, we monitor the performance of the MM using our MM-weights plot and MM-weights derivatives plot, which are shown in the left and right panel of Fig. 7, respectively. We track estimated weights along efficiency levels ranging from 0.5 to 0.99. As in the right panel of Fig. 4, red and green denote the true MSOM and VIOM outliers. Both plots clearly indicate that the MM breaks down at an efficiency level of ≈ 0.82, where it produces a fit very similar to the OLS. Before this efficiency value, clean units (as well as “non-outlying” VIOM outliers) have large and stable

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

33

Fig. 6 Cascade plot for the simulated dataset in the right panel of Fig. 4

Fig. 7 MM-weights plot (left panel) and MM-weights derivatives plot (right panel) for the simulated dataset in the right panel of Fig. 4. The points are labeled according to the true generating mechanism

weights, and MSOM outliers have very small weights; these abruptly increase after the threshold. VIOM outliers are in between these two extremes. Intuitively, while trajectories for influential observations that create masking tend to be convexshaped, the ones for swamping observations are concave-shaped. These effects become even clearer in the MM-weights derivatives plot. Right before the estimator breaks down, one can see a steep increase in derivatives corresponding to outliers that are masking each other (typically MSOMs) and a steep decrease in derivatives corresponding to swamped observations (e.g., good leverage points). In contrast, uncontaminated non-swamped observations have “flat” and small derivatives.

34

L. Insolia et al.

MM

0 500 11 400 10 300

9

-1

11 10

10

7

9

-2 9 -3

200 56 100

-4

0 100

200

300

400

500

600

700

800

3

0.8

4

5

6

7

8

9

10

11

iteration b 1

7

10 11

0.9

6

FSRws weights estimates

1

7

-5

Destructive thickness measure

MM weights estimates

5 6

7

FSRws

600

11

5 1 6

OLS

Scaled residuals

Non-destructive thickness measure

2 700

5 11 9

0.6

0.4

0.2

0.8 0.7

7

0.6 0.5 0.4

1

0.3 5

10

0.2

6 9 0 0.5

0.1

9

7 0.6

0.7

0.8

0.9

Efficiency

1

5

6

7

8

9

10

11

iteration b

Fig. 8 Scatterplot of a small dataset (n = 11) on coating thickness from Cook et al. (1982) with OLS, MM, and FSRws fits superimposed (top left panel). Corresponding graphical diagnostics: residual forward plot (top right panel), MM-weights plot (bottom left panel), and cascade plot (bottom right panel)

4 Real-Data Examples We now apply our FSRws and graphical diagnostics to two real-world datasets which pose different levels of challenge, both in terms of sample size and in terms of contamination mechanisms. The first dataset is very small (n = 11) and was used by Cook et al. (1982). It contains measurements of the thickness of nonmagnetic coatings of galvanized zinc on iron and steel, obtained with two different procedures: an expensive one (response) and a cheaper one (predictor). Figure 8 (top left panel) shows a scatterplot along with OLS, MM, and FSRws fits for the regression, which includes an intercept. This simple example motivates the use of robust estimation procedures to deal with multiple outliers arising from a VIOM and/or MSOM. Assuming a single possible VIOM outlier and using MLE, Cook et al. (1982) flagged observation 9, which has the largest absolute residual. Based on REMLE, Thompson (1985) flagged observation 11, which has the largest absolute Studentized residual.

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

35

The order of the last observations to enter the FS is 5, 6, 9, and 7. For this small dataset, the residual forward plot (Atkinson and Riani, 2000) shown in Fig. 8 (top right panel) provides very clear information. Observations 9 and 7 have a similar behavior, and so do observations 5 and 6. The inclusion of unit 9 at iteration 10 causes a masking of observation 7 and swamping of observations 11 and 10. These effects become stronger as observation 7 is included in the last iteration, i.e., in the OLS fit. The cascade plot in Fig. 8 (bottom right panel) tells the same story; with this small sample, it does not provide a diagnostic advantage with respect to the residual forward plot. The weights estimates for the observations not included in the FS subset remain approximately constant between iterations 5 and 8. At iteration 9, observations 7 and 9 have very small weights, but as observation 9 enters the FS at iteration 10, the weight of observation 7 increases abruptly due to masking; both observations are outliers. A similar behavior, though less marked, can be seen for observations 5 and 6 at iterations 7 and 8. Notice also that there are no interchanges of observations in FS; the lines tracking the weights do not cross. The MM-weights plot in Fig. 8 (bottom left panel) shows that also the MMestimator is strongly affected by observations 7 and 9. Indeed, their weights abruptly increase after the 0.85 efficiency level, where the weight for observation 11 strongly decreases due to swamping. Figure 8 (top left panel) shows that the MM fit nearly overlaps with the one for FRSws. But as the efficiency level increases to 86%, MM becomes indistinguishable from the OLS. This demonstrates the importance of having a right balance between BdP and efficiency, independently of the choice of a specific preliminary fit for MM-estimators. Based on the diagnostics discussed above, three strategies could be used for these data: (i) trim observations 7 and 9 and down-weight 5 and 6; (ii) down-weight observations 5, 6, 7, and 9; or (iii) down-weight only observations 7 and 9. The FSRws fit shown in Fig. 8 (top left panel) corresponds to (iii).5 As we showed before, OLS is influenced by observations 7 and 9 jointly, and for this reason our solution differs from Cook et al. (1982) and Thompson (1985). Indeed, the most outlying unit here is observation 7, which cannot be detected using single outlier methods. The second dataset we consider is larger, with n = 509. It contains loyalty cards information on customers of a supermaket chain in Northern Italy (this data was introduced by Atkinson and Riani (2006) and is available in the FSDA MATLAB Toolbox). The response is the amount spent over a 6-month period (in Euros), and the predictor is the number of visits to the supermarket in the same period of time. Figure 9 (top left panel) shows a scatterplot along with OLS, MM, and FSRws fits for the regression, which does not include an intercept (the expenditure corresponding to 0 visits is reasonably assumed to be ≈ 0). Here robust fits behave

5 Here,

due to the small sample size, FSR does not detect any signal. The FSRws fit shown in the figure is based on “manual” detection; the two down-weighted observations also correspond to the two residuals exceeding 90% confidence intervals in a LMS fit.

36

L. Insolia et al.

4000 OLS MM

3500

FSRws

3000 2500 2000 1500 1000 500 0 0

5

10

15

20

25

30

35

40

45

Number of visits

Fig. 9 Scatterplot of a larger dataset (n = 509) on loyalty cards data with OLS, MM, and FSRws fits superimposed, where the points are labeled according to the FSRws solution (top left panel). Corresponding graphical diagnostics: residual forward plot (top right panel) and cascade plot (bottom panel)

very differently from the OLS, which is affected by multiple outliers. Red and green points represent, respectively, observations trimmed (48) and down-weighted (64) by FSRws. The residual forward plot in Fig. 9 (top right panel) shows that some observations have very large residuals during most of the FS—which decrease in absolute terms after iteration 470 due to the inclusion of more extreme outliers. However, this plot is not very informative in terms of diagnosing the joint presence of multiple VIOM and MSOM outliers. The cascade plot in Fig. 9 (bottom panel) appears to provide more insight: estimated weights decrease markedly after iteration ≈ 400, and the decrease further accelerates after iteration ≈ 450, suggesting the presence of more extreme outliers which are likely to be MSOMs. We also notice that the inclusion of such outliers does not cause any interchanges of observations in the last portion of the FS, indicating that they are not as disruptive as the outliers in our simulation example in Sect. 3 (see the right panel of Fig. 4 for a comparison).

MM weights estimates

1 0.8

MM weights derivatives estimates

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

clean MSOM VIOM

0.6 0.4 0.2 0 0.5

0.6

0.7

0.8

Efficiency

0.9

1

37

7 6 5

clean MSOM VIOM

4 3 2 1 0 -1 0.5

0.6

0.7

0.8

0.9

1

Efficiency

Fig. 10 MM-weights plot (left panel) and MM-weights derivatives plot (right panel) for loyalty cards data in the top left panel of Fig. 9. The points are labeled according to the FSRws solution

Figure 10 highlights the two classes of outliers detected by FSRws in the MMweights and weights derivatives plots (left and right panel, respectively). These indicate that the MM-estimator is strongly influenced by outliers for efficiency levels higher than 95%. In particular, the MM-weights plot shows that for most efficiency values, trajectories are convex-like for observations labeled as outliers and concave-like for observations labeled as clean (color coding corresponds to FSRws labeling, but by and large this behavior would be visually appreciable even without it). Moreover, the MM-weights derivatives plot shows that units flagged as MSOM have flat derivatives which bump up right before the estimator “breaks down.” Units flagged as VIOM have steadily increasing derivatives, and they too accelerate before the breakdown. Non-outlying units have small and constant derivatives for most efficiency values, which eventually become negative for swamped good leverage points.

5 Final Remarks Our proposal builds upon different approaches and tools. We use high-BdP and efficient techniques from the robust estimation literature to design a novel procedure that can identify multiple outliers arising from either a MSOM or a VIOM and provide a way to distinguish between the two. In practice, both soft and hard estimation procedures can deal effectively with VIOM and MSOM outliers. However, soft-trimming procedures can be harder to interpret, because the link between each observation and its influence is blurred by a general down-weighting. Furthermore, choosing the preliminary high-BdP estimator and setting tuning parameters is nontrivial. Thus, we prefer to focus on hard-trimming procedures, which provide a clear link between each observation and its outlying-ness. In particular, we consider the adaptive hard-trimming approach in FSR and build upon it to construct our FSRws.

38

L. Insolia et al.

This provides a meaningful ranking of the observations and a way to detect both a “weak” and a “strong” signal—which we then use to separate VIOM and MSOM outliers. After this phase, we blend into the mix REMLE techniques from the VIOM literature, which allow us to move from shear trimming to a more general scheme where some observations are trimmed (those identified as MSOMs) and some downweighted (those identified as VIOMs). This, in a way, “softens” back the trimming. Quoting Beckman and Cook (1983): There is a stormy history behind the rejection of outliers. In the past, as is the case today, the lines were fairly well drawn between those who discarded discordant observations, those who gave each observation a different weight, and those who used simple, unweighted averages.

Combining robust estimation and REMLE techniques, these three seemingly separate takes—classical estimation, outlier removal, and down-weighting—can be effectively joined in a single, principled framework. In addition to our FSRws, we introduce novel graphical diagnostics. These are monitoring tools that provide information about a sequence of fits—and are similar in spirit to other diagnostics utilized by the FS and FSR. For FSRws we propose the cascade plot, which tracks estimated weights along the FS iterations as they include/exclude observations. The plot aids in the discrimination of VIOM and MSOM outliers and can complement or replace the automated detection of “weak” and “strong” signals performed as part of FSRws. In a way, it extends existing diagnostic tools depicting both local and global information on the FS process. This can provide critical insights on the structure of the data being analyzed, especially for large sample sizes and in combination with interactive tools (e.g., brushing and linking techniques; Riani et al. 2012). We also propose the MM-weights and MMweights derivatives plots. These, switching back to soft-trimming and in particular the MM-estimator, allow one to monitor performance as a function of efficiency values and thus to flag poor decisions leading to the breakdown of the estimator. Our general approach to the identification of multiple VIOM and MSOM outliers could, in principle, be used with robust estimation procedures other than FSR. This may be particularly useful when the sample size is very small (FSR may not be able to detect signals) or very large (FSR may become computationally demanding). Of course monitoring diagnostics similar, e.g., to the MM-weights and MM-weights derivatives plots mentioned above ought to be used also here to achieve a good balance between BdP and efficiency. This is akin to monitoring residuals (or their correlations) for different efficiency levels or BdP as proposed in Riani et al. (2014) and Cerioli et al. (2016, 2018) and could lead to a rigorous procedure for data-driven tuning of the MM and other soft-trimming estimators. Our work is expanding in several other directions. In parallel to investigating theoretical properties, we are analyzing more simulation settings—including highdimensional scenarios with various degrees of predictors collinearity, different ratios between VIOM and MSOM outliers, observation-specific contamination parameters, and VIOM outliers with contaminated predictors. We also plan to extend the comparisons to additional robust estimation procedures which are com-

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

39

putationally more expensive than the ones we considered to date, e.g., preliminary S-estimators for MM-estimators and Tau estimation (Rousseeuw and Yohai, 1984; Yohai and Zamar, 1988). Relatedly, we note that algorithmic advances in mixed integer programming, which have been recently utilized in the feature selection arena (Bertsimas et al., 2016), may offer interesting opportunities also for the computationally viable detection of multiple MSOM outliers. These advances have already been exploited in hard-trimming estimation (e.g., to compute the LMS solution; Bertsimas and Mazumder 2014), and we are investigating their use for performing selection among the dummy features introduced to reparametrize the MSOM (Menjoge and Welsch, 2010; She and Owen, 2011). In our context, which comprises both MSOM and VIOM outliers, we can also add a regularization component to “shrink” the latter. Looking ahead in a different direction, while for the time being FSRws utilizes single REML weights estimation, an extension is under development to utilize joint REML estimation. As an alternative, the MLE approach proposed in Cook et al. (1982) could be used in place of REMLE. However, this would require a substantial change in the FS algorithm. In fact, since deletion residuals are not necessarily informative, one would need to evaluate a likelihood for each observation excluded in the FS at every iteration. Acknowledgments This research benefitted from the High Performance Computing facility of the University of Parma. M.R. acknowledges financial support from the project “Statistics for fraud detection, with applications to trade data and financial statement” of the University of Parma. F.C. acknowledges financial support from the Huck Institutes of the Life Sciences of the Pennsylvania State University.

References A.C. Atkinson, Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis (Clarendon Press, Oxford, 1985) A.C. Atkinson, M. Riani, Robust Diagnostic Regression Analysis (Springer, New York, 2000) A.C. Atkinson, M. Riani, Distribution theory and simulations for tests of outliers in regression. J. Comput. Graph. Stat. 15(2), 460–476 (2006) A.C. Atkinson, M. Riani, F. Torti, Robust methods for heteroskedastic regression. Comput. Stat. Data Anal. 104, 209–222 (2016) V. Barnett, T. Lewis, Outliers in Statistical Data (Wiley, 1974) R.J. Beckman, R.D. Cook, Outlier.......... s. Technometrics 25(2), 119–149 (1983) D.A. Belsley, E. Kuh, R.E. Welsch, Regression Diagnostics - Identifying Influential Data and Sources of Collinearity (Wiley-Interscience, New York, 2004) D. Bertsimas, R. Mazumder, Least quantile regression via modern optimization. Ann. Stat., 2494– 2525 (2014) D. Bertsimas, A. King, R. Mazumder, et al., Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016) G.E. Box, G.C. Tiao, A bayesian approach to some outlier problems. Biometrika 55(1), 119–129 (1968) A. Cerioli, A. Farcomeni, M. Riani, Strong consistency and robustness of the forward search estimator of multivariate location and scatter. J. Multivariate Anal. 126, 167–183 (2014)

40

L. Insolia et al.

A. Cerioli, A.C. Atkinson, M. Riani, How to marry robustness and applied statistics, in Topics on Methodological and Applied Statistical Inference (Springer, 2016), pp. 51–64 A. Cerioli, M. Riani, A.C. Atkinson, A. Corbellini, The power of monitoring: how to make the most of a contaminated multivariate sample. Stat. Methods Appl., 1–29 (2018) S. Chatterjee, A.S. Hadi, Sensitivity Analysis in Linear Regression (Wiley, New York, 1988) R.D. Cook, Detection of influential observation in linear regression. Technometrics 19(1), 15–18 (1977) R.D. Cook, Assessment of local influence. J. Roy. Stat. Soc. B (Methodological) 48(2), 133–155 (1986) R.D. Cook, S. Weisberg, Residuals and Influence in Regression (Chapman and Hall, New York, 1982) R.D. Cook, N. Holschuh, S. Weisberg, A note on an alternative outlier model. J. Roy. Stat. Soc. B (Methodological) 44(3), 370–376 (1982) B. De Finetti, The bayesian approach to the rejection of outliers, in Proceedings of the fourth Berkeley Symposium on Probability and Statistics, vol. 1 (University of California Press, Berkeley, 1961), pp. 199–210 D.L. Donoho, P.J. Huber, The notion of breakdown point, in A Festschrift for Erich L. Lehmann, ed. by P. Bickel, K.A. Doksum, J.L. Hodges (Wadsworth, Belmont, California, 1983), pp. 157–184 F.N. Gumedze, Use of likelihood ratio tests to detect outliers under the variance shift outlier model. J. Appl. Stat. 46(4), 598–620 (2019) F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, W.A. Stahel, Robust Statistics: The Approach Based on Influence Functions (Wiley, New York, 1986) D.A. Harville, Maximum likelihood approaches to variance component estimation and to related problems. J. Am. Stat. Assoc. 72(358), 320–338 (1977) P.J. Huber, E.M. Ronchetti, Robust Statistics (Wiley, New Jersey, 2009) S. Johansen, B. Nielsen, et al., Analysis of the forward search using some new results for martingales and empirical processes. Bernoulli 22(2), 1131–1183 (2016) R.A. Maronna, R.D. Martin, V.J. Yohai, Robust Statistics: Theory and Methods (Wiley, 2006) L. McCann, et al., Robust model selection and outlier detection in linear regressions. Ph.D. thesis, Massachusetts Institute of Technology (2006) R.S. Menjoge, R.E. Welsch, A diagnostic method for simultaneous feature selection and outlier identification in linear regression. Comput. Stat. Data Anal. 54(12), 3181–3193 (2010) D. Perrotta, F. Torti, Detecting price outliers in european trade data with the forward search, in Data Analysis and Classification (Springer, 2010), pp. 415–423 M. Riani, A.C. Atkinson, Fast calibrations of the forward search for testing multiple outliers in regression. Adv. Data Anal. Classif. 1(2), 123–141 (2007) M. Riani, A.C. Atkinson, A. Cerioli, Finding an unknown number of multivariate outliers. J. Roy. Stat. Soc. B (Stat. Methodol.) 71(2), 447–466 (2009) M. Riani, D. Perrotta, F. Torti, FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemom. Intell. Lab. Syst. 116, 17–32 (2012) M. Riani, A. Cerioli, A.C. Atkinson, D. Perrotta, et al., Monitoring robust regression. Electron. J. Stat. 8(1), 646–677 (2014) P.J. Rousseeuw, Least median of squares regression. J. Am. Stat. Assoc. 79(388), 871–880 (1984) P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection (Wiley, New York, 1987) P.J. Rousseeuw, K. Van Driessen, Computing lts regression for large data sets. Data Min. Knowl. Disc. 12(1), 29–45 (2006) P.J. Rousseeuw, B.C. Van Zomeren, Unmasking multivariate outliers and leverage points. J. Am. Stat. Assoc. 85(411), 633–639 (1990) P.J. Rousseeuw, V.J. Yohai, Robust regression by means of s-estimators, in Robust and Nonlinear Time Series Analysis (Springer, 1984), pp. 256–272 Y. She, A.B. Owen, Outlier detection using nonconvex penalized regression. J. Am. Stat. Assoc. 106(494), 626–639 (2011) R. Thompson, A note on restricted maximum likelihood estimation with an alternative outlier model. J. Roy. Stat. Soc. B (Methodological) 47(1), 53–55 (1985)

A Robust Estimation Approach for Mean-Shift and Variance-Inflation Outliers

41

V. Todorov, E. Sordini, fsdaR: Robust Data Analysis Through Monitoring and Dynamic Visualization. https://CRAN.R-project.org/package=fsdaR, R package version 0.4–9 (2020) V.J. Yohai, High breakdown-point and high efficiency robust estimates for regression. Ann. Stat., 642–656 (1987) V.J. Yohai, R. Zamar, High breakdown-point estimates of regression by means of the minimization of an efficient scale. J. Am. Stat. Assoc. 83(402), 406–413 (1988)

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators Bing Li

1 Introduction In this paper we develop two new classes of estimates for sufficient dimension reduction (Li, 1991; Cook, 1998b; Li, 2018). The idea is derived from the iterative Hessian transformation developed by Cook and Li (2002, 2004), as well as the iterative SAVE transformation developed √ by Wang (2005). These estimates include a method that is guaranteed to be a n-consistent and exhaustive estimate of the central dimension reduction subspace. Suppose X is a p-dimensional random vector and Y is a random variable defined on a probability space (, F, P ). That is, X is a mapping from to Rp measurable with respect to F/Rp , Rp being the Borel σ -field defined on Rp , and Y is a mapping from to R measurable with respect to F/R, R being the Borel σ -field in R. The problem of sufficient dimension reduction is to find a matrix β ∈ Rp×d , with d < p, such that Y and X are conditionally independent given β T X, that is, Y

X|β T X.

This relation only depends on the subspace spanned by the columns of β, which is called a sufficient dimension reduction subspace. Under some mild conditions (Cook, 1994; Yin et al., 2008), the intersection of all sufficient dimension reduction subspaces is itself a sufficient dimension reduction subspace. This intersection is then the smallest sufficient dimension reduction subspace and is called the central subspace, denoted by SY |X . The dimension of SY |X is called the structural dimension and is written as d. Let X = var(X) be the covariance matrix of X, μX the mean

B. Li () The Pennsylvania State University, State College, PA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 E. Bura, B. Li (eds.), Festschrift in Honor of R. Dennis Cook, https://doi.org/10.1007/978-3-030-69009-0_3

43

44

B. Li

of X, and Z = X−1/2 (X − μX ) the standardized version of X. It is often easy to first estimate the central subspace SY |Z and then estimate SY |X using the relation SY |X = X−1/2 SY |Z (see Li, 2018, Theorem 2.2). Similarly, if β satisfies E(Y |X) = E(Y |β T X), then the subspace spanned by the columns of β is called a sufficient dimension reduction subspace for the conditional mean, and the intersection of all such subspaces is called the central mean subspace, denoted by SE(Y |X) . Professor Dennis Cook and I introduced the concept of sufficient dimension reduction in the regression function (Cook and Li, 2002), which marked the beginning of our long and very active collaboration. The central mean subspace is always contained in the central subspace; it focuses on the estimation of the mean function E(Y |X). As in the case for the central subspace, we often first estimate SE(Y |Z) and then estimate SE(Y |X) by the relation SE(Y |X) = X−1/2 SE(Y |Z) . An important class of estimates of the central subspace are the inverse regressions, which include first- and second-order inverse regressions among other methods. The first-order inverse regressions involve conditional moment E(Z|Y ), or moments of the form E[Zf (Y )], where f (Y ) is an arbitrary function. The term “first-order” comes from the fact that only the first-order polynomial of Z appears in the moments; the term “inverse” comes from the fact that in estimating E(Z|Y ), we perform a regression of X on Y , whereas in a usual regression analysis problem, we perform regression of Y on X. Examples of the first-order inverse regressions include the ordinary least squares (OLS; Li and Duan, 1989), the sliced inverse regression (SIR; Li, 1991), and the parametric inverse regression (PIR; Bura and Cook, 2001). See Li (2018) for more information about this class. The first-order inverse regressions require the linear conditional mean assumption. That is, E(Z|γ T Z) is a linear function of γ T Z, where γ ∈ Rp×d is any basis matrix of SY |Z ; that is, span(γ ) = SY |Z . The second-order inverse regressions involve conditional moments such as var(Z|Y ) and E(ZZ T |Y ), or moments of the form E[ZZ T f (Y )], where f (Y ) is an arbitrary function. The term “second-order” comes from the fact that secondorder polynomials of Z appear in these moments. Examples of second-order inverse regressions include principal Hessian directions (pHd; Cook, 1998a; Li, 1992), sliced average variance estimate (SAVE; Cook and Weisberg, 1991), sliced inverse regression-II (SIR-II; Li, 1991), contour regression (CR; Li et al., 2005), and directional regression (DR; Li and Wang, 2007). In additional to the above linear conditional mean assumption, the second-order inverse regressions require the constant conditional variance assumption: var(Z|γ T Z) is almost surely constant.

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

45

It is well known that, in some cases, the first-order inverse regressions cannot recover the entire central subspace SY |X and the second-order methods can recover the part missed by the first-order methods. In particular, when the conditional mean E(Y |X) has a component that is symmetric about 0, the first-order methods will miss this component, but the second-order methods will recover it. However, the added condition of constant conditional variance is rather restrictive and limits the application of the second-order methods. Both the first- and the second-order inverse regressions can be formulated as an eigenvalue problem. That is, at the population level, the eigenvectors of a matrix M for the nonzero eigenvalues are members of SY |Z . The matrix M is called a candidate matrix (a term due to Ye and Weiss, 2003). Each inverse regression method has such a matrix. A method is unbiased if span(M) ⊆ SY |Z , exhaustive if span(M) ⊇ SY |Z , and Fisher consistent if it is both unbiased and exhaustive (i.e., span(M) = SY |Z ). Using this notation, span(M) ⊆ SY |Z holds for the firstorder inverse regressions under the linear conditional mean assumption and for the second-order inverse regressions under both the linear conditional mean and the constant conditional variance assumptions. Cook and Li (2002, 2004) discovered a special property of the pHd: if the linear conditional mean assumption is satisfied, then the central mean subspace is an invariant subspace of the candidate matrix for the pHd; that is, MSE(Y |Z) ⊆ SE(Y |Z) , where M = E(Y ZZ T ) is the candidate matrix for the pHd. This means that, without the constant conditional variance assumption, we can identify the central mean subspace as a subspace spanned by a set of eigenvectors of the candidate matrix, though we do not know which subset it is. This incomplete information about the central mean subspace motivated Cook and Li (2002, 2004) to introduce the iterative Hessian transformation estimator of the central mean subspace. Pursuing further along this line, Wang (2005), in his dissertation advised by the present author, showed that the central subspace SY |Z is an invariant subspace of the candidate matrix M for SAVE; that is, MSY |Z ⊆ SY |Z , where M is the candidate matrix of the SAVE. Based on this relation, Wang (2005) developed the iterative SAVE transformation to estimate the central subspace which is unbiased under the linear conditional mean assumption. In the present paper, we carry this idea further to consider the class of all linear operators T : Rp → Rp such that SY |X is invariant with respect to T . We call this class of linear operators the invariant class of operators for the central subspace. We develop some theoretical properties of the invariant class and use these properties to show that the candidate matrices of many second-order inverse regression methods are members of this class under the linear conditional mean assumption. We further show that, for a linear operator in this class, the central subspace is spanned by a set

46

B. Li

of eigenvectors of the linear operator. Thus, given a linear operator in this class, all that is left to do is to identify the set of eigenvectors that span the central subspace. Based on this fact, we develop two classes of estimates of the central subspace. The first class is constructed by iteratively transforming a set of seed vectors known to be in the central subspace under the linear conditional mean assumption. This is a direct extension of the iterative Hessian transformation (Cook and Li, 2002, 2004) and iterative SAVE transformation (Wang, 2005) to any linear operator in the invariant class for the central subspace. Thus we call this estimator the iterative invariant transformation (IIT) method. The second class of estimates is based on an initial nonparametric and exhaustive estimate of the central subspace. Let S ⊆ {1, . . . p} be the subset such that the eigenvectors indexed by it span the subspace X SY |X . Since identifying a subset of a set is an estimation problem with a finite parameter space, any consistent estimator of S converges to S infinitely fast (in the sense that an (Sn − S) = oP (1) for any an → ∞, where Sn is a consistent estimate of S). This means we can start with a nonparametric, and perhaps sub√ n-convergent, estimate of the central subspace to estimate S and then obtain an updated estimate of the central subspace by the eigenvectors of the invariant operator that √ are indexed by Sn . Since the convergence rate of Sn is infinitely fast, the subn-convergence rate of the nonparametric initial estimate of the central subspace does not slow down the final √ estimate of the central subspace based on the invariant operator, which is the n-rate. Using this idea we arrive at an exhaustive and √ n-consistent estimate of the central subspace under the linear conditional mean assumption. Since this method is essentially an inverse regression method, but with its eigenvectors identified by an initial nonparametric and exhaustive method, we call it the nonparametrically boosted inverse regression (NBIR) method. We will also touch upon a parallel theory for the invariant class of linear operators for the central mean subspace, though in this paper the emphasis is given to the central subspace. The rest of the paper is organized as follows. In Sect. 2 we introduce the class of invariant linear operators for the central subspace and the central mean subspace and develop the theoretical properties of the invariant class for the central subspace. In Sect. 3, we show that the central subspace is spanned by a set of eigenvectors of a linear operator in the invariant class for the central subspace. In Sect. 4 we use the properties of the invariant class for the central subspace to show that the candidate matrices of the four well-known second-order inverse regression methods, SAVE, SIR-II, CR, and DR, are in fact members of the invariant class for the central subspace. In Sect. 5 we develop the two classes of methods, IIT and NBIR, for the central subspace. In Sect. 6 we conduct a simulation study to compare the NBIR methods with their inverse regression counterparts. Some concluding discussions are given in Sect. 7.

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

47

2 Invariant Linear Operators Let L(Rp , Rp ) be the class of all linear operator from Rp to Rp . Since the discussions of this paper are restricted to Euclidean spaces Rp , we will identify a linear operator with a matrix with real entries. That is, a member T ∈ L(Rp , Rp ) is identified with a member A ∈ Rp×p such that T (v) = Av,

v ∈ Rp .

In fact, we will also use the symbol T for both T and A. In the same vein, we identify a self-adjoint operator T with a symmetric matrix A. Although in the context of the present paper it is technically possible to refer to linear operators as matrices, it is beneficial to use the former term for two reasons. First, the fundamental idea of our methodology relies on regarding a matrix as a linear operator, and it is more straightforward and transparent to proceed with this term. Second, most of the methods developed here can be extended to the setting of functional sufficient dimension reduction, and linear operator would be the only option in that context. For a linear operator T , we use T ∗ to denote the adjoint operator of T . Recall that a subspace S of Rp is an invariant subspace of a linear operator T ∈ L(Rp , Rp ) if T S ⊆ S. For our purpose it is convenient to turn this terminology around and say that T is an invariant linear operator for the subspace S if T S ⊆ S. In addition, we will require T to be self-adjoint. The following is a rigorous definition of a class of invariant linear operators for a subspace. Definition 1 Let S be a subspace of Rp . We say that a member T of L(Rp , Rp ) is an invariant operator for S if: 1. S is an invariant linear subspace of T —that is, T S ⊆ S; 2. T is a self-adjoint. The class of all invariant linear operators is written as TS ; that is, TS = {T ∈ L(Rp , Rp ) : T S ⊆ S, T = T ∗ }. We give special notations and names to the collections of invariant linear operators for the central subspace and the central mean subspace. Definition 2 If, in Definition 1, S is the central subspace SY |Z , then TS is called the central invariant class, written as TY |X ; if S is the central mean subspace SE(Y |Z) , then TS is called the central mean invariant class, written as TE(Y |X) . In the rest of this paper, we will focus on the central invariant class; the development of the properties of the central mean invariant class and the related estimates can be carried out in a similar fashion. The following proposition gives equivalent conditions for an invariant operator, which will be useful for later discussions.

48

B. Li

Proposition 1 A linear operator T ∈ L(Rp , Rp ) is a member of TY |X if and only if: 1. For any v ∈ SY |X , w ⊥ SY |X , w, T v = 0; 2. T is self-adjoint. Proof It is sufficient to show that T SY |X ⊆ SY |X is equivalent to the first statement in this proposition. Let v be a member of SY |X . Then T v is a member of SY |X if and only if T v is orthogonal to SY⊥|X ; that is, w, T v = 0 for all w ⊥ SY |X .

We now develop some properties of TY |X . Theorem 1 1. If T1 , T2 ∈ TY |X and λ1 , λ2 ∈ R, then λ1 T1 + λ2 T2 ∈ TY |X ; 2. If T1 , T2 ∈ TY |X and T1 and T2 commute, then T2 T1 ∈ TY |X ; 3. If I : Rp → Rp is the identity mapping, then I ∈ TY |X . Proof 1. Let v be a member of SY |X . Then (λ1 T1 + λ2 T2 )(v) = λ1 T1 (v) + λ2 T2 (v). Because T1 and T2 are invariant, T1 (v) and T2 (v) are members of SY |X . Because SY |X is a linear subspace, any linear combination of T1 (v) and T2 (v) is a member of SY |X . Because T1 and T2 are self-adjoint, so is any linear combination of them. 2. If v is a member of SY |X , then T1 (v) is a member of SY |X because T1 is invariant. Hence T2 (T1 (v)) is a member of SY |X because T2 is invariant. Because T1 and T2 are self-adjoint and commute with one another, (T2 T1 )∗ = T1∗ T2∗ = T1 T2 = T2 T1 . Hence T2 T1 is self-adjoint. 3. If v is a member of SY |X , then I (v) = v is a member of SY |X . Because I is also self-adjoint, it is a member of TY |X .

Sometimes we will deal with a random linear operator. Let T : → L(Rp , Rp ) be a random element in L(Rp , Rp ). That is, for each ω ∈ , T (ω) is linear operator in L(Rp , Rp ). We now generalize the notion of an invariant linear operator to a random invariant linear operator. Let L∼ (Rp , Rp ) denote the collection of all random elements in L(Rp , Rp ). We will use the equivalent conditions in Proposition 1 to carry out this generalization. Definition 3 A random linear operator A ∈ L∼ (Rp , Rp ) is said to be invariant if: 1. Whenever v ∈ SY |X and w ⊥ SY |X , we have w, Av = 0 almost surely P ; 2. A(ω) is self-adjoint for each ω ∈ .

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

49

The collection of all invariant random linear operators is written as TY ∼|X . The class of invariant (nonrandom) linear operators TY |X can be viewed as a subset of the class of invariant random linear operators TY ∼|X (i.e., TY |X ⊆ TY ∼|X ), because a nonrandom linear operator can be viewed as a degenerate random linear operator. We now extend Theorem 1 to random linear operators. Theorem 2 1. If A1 , A2 ∈ TY ∼|X and λ1 , λ2 ∈ R, then λ1 A1 + λ2 A2 ∈ TY ∼|X ; 2. If A1 , A2 ∈ TY ∼|X , A1 (ω) commutes with A2 (ω) for each ω ∈ , then A2 A1 ∈ TY ∼|X . Proof 1. Let w ⊥ SY |X and v ∈ SY |X . Then w, (λ1 A1 + λ2 A2 )v = λ1 w, A1 v + λ2 w, A2 v, where the right-hand side is equal to 0 almost surely P because w, A1 v and w, A2 v are equal to 0 almost surely P . Because A1 (ω) and A2 (ω) are selfadjoint, so are their linear combinations. Thus the conditions in Definition 3 are satisfied for λ1 A1 + λ2 A2 . 2. Let w ⊥ SY |X , v ∈ SY |X , and A1 , A2 ∈ TY ∼|X . Let PY |X be the projection onto SY |X and QY |X = Ip − PY |X , which is the projection onto SY⊥|X . Then w, A2 A1 v = w, A2 (PY |X + QY |X )A1 v = w T A2 PY |X A1 v + w T A2 QY |X A1 v. Because A2 ∈ TY ∼|X , w T A2 PY |X = 0 almost surely P ; because A1 ∈ TY ∼|X , QY |X A1 v = 0 almost surely P . Hence w, A2 A1 v = 0 almost surely P . Because, for each ω ∈ , A1 (ω) and A2 (ω) are self-adjoint and they commute with each other, we have [A2 (ω)A1 (ω)]∗ = A∗1 (ω)A∗2 (ω) = A1 (ω)A2 (ω) = A2 (ω)A1 (ω). The conditions in Definition 3 are satisfied for A2 A1 .

∼ Y |X

The next theorem says that the expectation of a member of T is a member of TY |X provided the expectation in question is well defined. For this purpose, we need to define the expectation of a linear operator. Let · be the operator norm of a linear operator. Suppose that EA < ∞. Then, by the Cauchy-Schwarz inequality, for any u, v ∈ Rp , we have |E(u, Av)| ≤ u v E(A). Hence the bilinear form

50

B. Li

b : Rp × Rp → R,

b(u, v) = E(u, Av)

is bounded. Then, there exists a linear operator B : Rp → Rp such that u, Bv = E(u, Av). See, for example, Conway (1990, Theorem 2.2) and Eaton (2007, Section 1.5). The expectation of the random linear operator is defined to be the nonrandom linear operator B and is written as E(A). In other words, the expectation E(A) is defined via the relation u, E(A)v = E(u, Av), under the condition EA < ∞. Theorem 3 If A ∈ TY ∼|X and EA < ∞, then E(A) ∈ TY |X . Proof Let v ∈ SY |X and w ⊥ SY |X . Then w, E(A)v = E(w, Av). Because w, Av = 0 almost surely 0 and w, Av is P -integrable by the condition EA < ∞, the right-hand side is 0, and consequently E(A) is a member of TY |X .

3 Invariant Linear Operator and Its Eigenvectors In this section we develop a key property of an invariant operator in TY |X , which will lead to the construction of a new class of estimators for the central subspace. Note that, if v is an eigenvector of a self-adjoint linear operator T in L(Rp , Rp ), then span(v) is an invariant subspace of T . Moreover, if {v1 , . . . , vd } is a set of eigenvectors of T , then it is easy to see that span{v1 , . . . , vd } is an invariant subspace of T . It is then natural to ask this question: if S is an invariant subspace of a selfadjoint linear operator T , then, is S necessarily spanned by a set of eigenvectors of T ? If this is true, then an invariant linear operator reveals much about the central subspace: even though it does not tell us what it exactly is, it tells us that the central subspace is spanned by d of the p eigenvectors of the linear operator, d being the dimension of the central subspace. The next theorem establishes this as a fact. Theorem 4 If T is a member of TY |X , then SY |X is spanned by d eigenvectors of T . Proof Let {u1 , . . . , ud } be an orthonormal basis of SY |X and A the matrix (u1 , . . . , ud ). We want to show that there exist (λ1 , v1 ), . . . , (λd , vd ), where λ1 , . . . , λd are real numbers and {v1 , . . . , vd } is an orthonormal set in SY |X , such that

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

T vi − λi vi = 0,

51

i = 1, . . . d.

Because a vector v in SY |X can be written as Ac, where c is a vector in Rd , we need to show that there exists an orthonormal subset {c1 , . . . , cd } of Rd such that T Aci − λi Aci = 0,

i = 1, . . . , d.

Because T ∈ TY |X , span(T A) is contained in SY |X . Hence there is a matrix B ∈ Rd×d such that T A = AB. Because A has full column rank, B is nonsingular. So the above equation can be rewritten as ABci − λi Aci = 0.

(1)

Multiplying both sides of T A = AB from the left by AT , noticing that AT A = Id , we have AT T A = B. Because T is symmetric, so is B. Next, for a generic vector c in Rd and a generic real number λ, multiplying both sides of equation ABc − λAc = 0 from the left by AT , we have Bc − λc = 0. Since B is a symmetric, it has d eigenvalues and eigenvectors (λ1 , c1 ), . . . , (λd , cd ), which satisfy (1).

We will make a further assumption that the central subspace SY |X can be identified by a set of d eigenvectors. That is, the eigenvalues outside of {λ1 , . . . , λd } are different from them. This is to avoid the case where some eigenvectors between SY |X and SY⊥|X are arbitrarily defined. For example, the extreme case would be T = Ip , where any orthonormal set {v1 , . . . , vd } is a set of eigenvectors of T and obviously cannot identify SY |X . Assumption 1 If T ∈ TY |X and {v1 , . . . , vd } are eigenvectors of T that span SY |X , then λi = λj whenever i = 1, . . . , d and j = d +1, . . . , p. Furthermore, we assume that the eigenvalues for v1 , . . . , vd are distinct. The part of the assumption that says the eigenvalues for v1 , . . . , vd are distinct is made for convenience. It can be removed, but doing so would make the proof of Theorem 9 in Sect. 5.2 more complicated. The next corollary follows immediately from Theorem 4 and Assumption 1. Corollary 1 Suppose T is a member of TY |X and Assumption 1 holds. A linearly independent set of vectors w1 , . . . , wd span the central subspace SY |X if and only if {w1 , . . . , wd } ⊥ {vd+1 , . . . , vp }. We will use this fact to construct an estimate of the central subspace.

52

B. Li

4 Some Important Members of TY |X In this section, we work with (Y, Z) rather than (Y, X), which is more convenient. We show that some well-known candidate matrices for estimating SY |Z based on the second-order inverse regressions, such as the sliced average variance estimator (Cook and Weisberg, 1991), the SIR-II (Li, 1991), contour regression (Li et al., 2005), and directional regression (Li and Wang, 2007), are, in fact, members of TY |Z . We first restate the linear conditional mean assumption in the Introduction below for easy reference. Assumption 2 E(Z|γ T Z) is a linear function of γ T Z, where γ is any basis matrix for SY |Z . This condition implies that E(Z|γ T Z) = Pγ Z, where Pγ = γ (γ T γ )−1 γ T is the projection onto span(γ ) with respect to the standard inner product in Rp . See, for example, (Li, 2018, Lemma 3.1). As mentioned in the Introduction, in order for these candidate matrices to span a subspace of SY |Z , we need the constant conditional variance assumption: var(Z|γ T Z) be a nonrandom matrix. See, for example, Cook and Weisberg (1991) and Li (1992). Thus, a candidate matrix for a second-order inverse regression needs a weaker assumption to be invariant than to be unbiased.

4.1 Sliced Average Variance Estimation Proposed by Cook and Weisberg (1991), the candidate matrix for the SAVE is defined by MSAVE = E[Ip − var(Z|Y )]2 . We now show that, under the linear conditional mean assumption, MSAVE is a member of TY |Z . This result was proved as a stand-alone theorem (Wang, 2005). We reproduce it here as a special case of the general framework described in Sect. 2. We first prove a lemma. Lemma 1 If U and V are random vectors that are almost surely in SY |Z , that is, w, U = 0 and w, V = 0 almost surely P for any w ⊥ SY |Z , then U V T + V U T is a member of TY ∼|Z . Proof If v is a member of SY |Z , then U V T v is a member of SY |Z almost surely P . Similarly, V U T v is a member of SY |Z almost surely P . Furthermore, because U V T + V U T is self-adjoint, it is a member of TY ∼|Z .

We now show that the candidate matrix for SAVE is an invariant linear operator for the central subspace.

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

53

Theorem 5 If Assumption 2 is satisfied and E[Ip − var(Z|Y )2 ] < ∞, then MSAVE is a member of TY |Z . Proof Since E{[Ip − var(Z|Y )]2 } = E[Ip − var(Z|Y )2 ] < ∞, by Theorem 3, it suffices to show that [Ip −var(Z|Y )]2 is a member of TY ∼|Z . Because Ip −var(Z|Y ) is a member of TY ∼|Z and it commutes with itself, by Theorem 2, part 2, it suffices to show that Ip − var(Z|Y ) is a member of TY ∼|Z , for which, by Theorem 2, part 1, it suffices to show that var(Z|Y ) is a member of TY ∼|Z . Note that var(Z|Y ) = E(ZZ T |Y ) − E(Z|Y )E(Z T |Y ). It is well known that, under Assumption 2, E(Z|Y ) is a member of SY |Z almost surely P . Hence, by Lemma 1, E(Z|Y )E(Z T |Y ) is a member of TY ∼|Z . By Theorem 2, in order for var(Z|Y ) to be a member of TY ∼|Z , it suffices to show that E(ZZ T |Y ) is a member of TY ∼|Z . Let v be a vector in SY |Z and β a basis matrix of SY |Z . Then E(ZZ T v|Y ) = E[E(ZZ T v|β T Z, Y )|Y ] = E[Z T vE(Z|β T Z)|Y ] = Pβ E(ZZ T v|Y ). Hence E(ZZ T v|Y ) is a member of SY |Z .

4.2 SIR-II We now turn to an estimator proposed by Li (1991) in his rejoinder to the discussants. The candidate matrix for SIR-II takes the form MSIRII = E{var(Z|Y ) − E[var(Z|Y )]}2 . This is similar to SIR except that the conditional mean E(Z|Y ) is replaced by the conditional variance var(Z|Y ). Theorem 6 If Assumption 2 is satisfied and Evar(Z|Y ) − E[var(Z|Y )]2 < ∞, then MSIRII is a member of TY |Z . Proof We have already shown in the proof of Theorem 5 that var(Z|Y ) is a member of TY ∼|Z . Hence, by Theorem 3, E[var(Z|Y )] is a member of TY |Z . By Theorem 1, part

54

B. Li

1, var(Z|Y ) − E[var(Z|Y )] is a member of TY ∼|Z . By Theorem 1, part 2, {var(Z|Y ) − E[var(Z|Y )]}2 is a member of TY ∼|Z . By Theorem 3 again, MSIRII is a member of

TY |Z .

4.3 Contour Regression The contour regression was introduced by Li et al. (2005) as a dimension reduction ˜ Y˜ ) be method that overcomes some of the difficulties with SIR and SAVE. Let (Z, an independent copy of (Z, Y ). The candidate matrix for contour regression takes the form MCR = [2Ip − A()]2 ,

where

˜ ˜ T | |Y − Y˜ | < ]. A() = E[(Z − Z)(Z − Z) Theorem 7 If Assumption 2 is satisfied and ˜ ˜ T |Y, Y˜ ]} < ∞, E{E[(Z − Z)(Z − Z)

(2)

then MCR is a member of TY |Z . Proof Let ˜ ˜ T f (Y, Y˜ )], B = E[(Z − Z)(Z − Z) where f (Y, Y˜ ) is a bounded measurable function of Y, Y˜ . The matrix A() is a special case of B with I (|Y − Y˜ | < ) . f (Y, Y˜ ) = P (|Y − Y˜ | < ) We first show that B is a member of TY |Z . Since B can be rewritten as ˜ ˜ T |Y, Y˜ ]f (Y, Y˜ )}, E{E[(Z − Z)(Z − Z) ˜ by condition (2) and Theorem 3, it suffices to show that E[(Z − Z)(Z − ˜ T |Y, Y˜ ]f (Y, Y˜ ) is a member of TY ∼|Z . Because f (Y, Y˜ ) is a scalar, E[(Z − Z)(Z ˜ Z) − ˜ ˜ T |Y, Y˜ ] is a ˜ T |Y, Y˜ ]f (Y, Y˜ ) is a member of TY ∼|Z if and only if E[(Z − Z)(Z − Z) Z) ∼ member of TY |Z . Note that

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

˜ ˜ T |Y, Y˜ ] E[(Z − Z)(Z − Z) ˜ T |Y, Y˜ ) + E(Z˜ Z˜ T |Y, Y˜ ). = E(ZZ T |Y, Y˜ ) − E(Z Z˜ T |Y, Y˜ ) − E(ZZ

55

(3)

By the properties of conditional independence developed in Dawid (1979) (see also Li, 2018, Corollary 2.2), because (Z, Y ) Y˜ , we have Y˜ Z|Y . Hence the first term on the right-hand side of (4) is E(ZZ T |Y ). Similarly, the fourth term is equal ˜ we have (Y, Z) ˜ Y˜ , which implies (Y˜ , Z), to E(Z˜ Z˜ T |Y˜ ). Because (Y, Z) Z| ˜ ˜ Z Z|Y, Y˜ . Thus the second term is equal to E(Z|Y, Y˜ )E(Z|Y, Y˜ ). But we have Y |Y˜ . Hence this term can be further Y˜ |Y and similarly Z˜ just seen that Z reduced to E(Z|Y )E(Z˜ T |Y˜ ). By the same argument, the third term on the right˜ Y˜ )E(Z T |Y ). To sum up, we have hand side of (4) is E(Z| ˜ ˜ T |Y, Y˜ ] E[(Z − Z)(Z − Z) ˜ Y˜ )E(Z T |Y ) + E(Z˜ Z˜ T |Y˜ ). = E(ZZ T |Y ) − E(Z|Y )E(Z˜ T |Y˜ ) − E(Z|

(4)

We have shown in the proof of Theorem 5 that, under Assumption 2 and the condition EE(ZZ T |Y ) < ∞, E(ZZ T |Y ) is a member of TY ∼|Z . Similarly, E(Z˜ Z˜ T |Y˜ ) is a member of TY ∼|Z . Also, it is well known that, under Assumption 2, ˜ Y˜ ) are members of SY |Z almost surely P (see Li, 1991), so, by E(Z|Y ) and E(Z| Lemma 1, under Assumption 2 and condition (2), the second and third terms in (4) ˜ ˜ T |Y, Y˜ ] is a member of combined is a member of TY ∼|Z . Hence E[(Z − Z)(Z − Z) ∼ TY |Z . By Theorem 3, B is a member of TY |Z , and hence so is A(). By Theorem 1, then, MCR is a member of TY |Z .

4.4 Directional Regression Directional regression (DR) was introduced by Li and Wang (2007) as a method that performs similarly as the contour regression but with substantially reduced computing time. Its candidate matrix is as follows: MDR = E[2Ip − A(Y, Y˜ )]2 ,

where

˜ ˜ T |Y, Y˜ ]. A(Y, Y˜ ) = E[(Z − Z)(Z − Z) Theorem 8 If Assumption 2 is satisfied and ˜ ˜ T |Y, Y˜ ]2 } < ∞, E{E[(Z − Z)(Z − Z) then MDR is a member of TY |Z . Proof Because

(5)

56

B. Li

[2Ip − A(Y, Y˜ )]2 ≤ 4Ip + 4A(Y, Y˜ ) + A(Y, Y˜ )2 , the expectation of the left-hand side is finite if condition (5) holds. By Theorem 3, for MDR to be a member of TY |X , it suffices to show that [2Ip −A(Y, Y˜ )]2 is a member of TY ∼|Z . By Theorem 2, it suffices to show that A(Y, Y˜ ) is a member of TY ∼|Z , but this has already been shown in the proof of Theorem 7.

5 Two Estimation Methods Based on Invariant Operators In this section we propose two methods for estimating the central subspace SY |Z based on invariant operators.

5.1 Iterative Invariant Transformations (IIT) √ Suppose we are given n-consistent estimates of a set of vectors in SY |Z —let us call these estimates the seed vectors—and an estimated invariant linear operator in TY |Z . Then we can use the estimated invariant operator to repeatedly transform the seed vectors. By definition of TY |Z , the transformed seed vectors are estimates of some (other) vectors in the central subspace. Thus, by iteratively transforming the seed vectors, we bring out more vectors in the central subspace. This idea was used by Cook and Li (2002, 2004) to construct the iterative Hessian transformation to estimate SE(Y |Z) and by Wang (2005) to construct the iterative SAVE transformation to estimate SY |Z . Specifically, suppose (X1 , Y1 ), . . . , (X √n , Yn ) are an i.i.d. sample of (X, Y ) and, for some integer r ≤ d, uˆ 1 , . . . , uˆ r are n-consistent √ estimates of a set of vectors u1 , . . . , ur in SY |Z . Suppose T ∈ TY |X and Tˆ is a n-estimate of T . Then, for any positive integer m, the set of vectors {Tˆ i uˆ j : j = 1, . . . , r, i = 0, . . . , m − 1} is a

(6)

√ n-consistent estimate of the set of nonrandom vectors {T i uj : j = 1, . . . , r, i = 0, 1, . . . , m − 1}

(7)

in the central subspace SY |Z . Thus, we can “grow” a small set of seed vectors into a larger set of vectors of SY |Z . The point of this process is that we can usually produce the seed vectors uˆ 1 , . . . , uˆ r under the linear conditional mean assumption alone (Assumption 2) without evoking the constant conditional variance assumption. However, in some circumstances, such as when the regression function is partly symmetric about the

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

57

mean of X, the seed vectors are insufficient to span the entire central subspace SY |Z . For a detailed discussion of this point, see Cook and Li (2002, 2004). In the meantime, as the results in the last section show, for the operators such as MSAVE , MSIRII , MCR , and MDR to be invariant operators, we also only need the linear conditional mean assumption. As a result, the IIT procedure relies only on the linear conditional mean assumption. In comparison, in order for these operators to be unbiased candidate matrices of SY |Z (i.e., span(M) ⊆ SY |Z ), we need both the linear conditional mean assumption and the constant variance assumption. The set of vectors in (7) has mr vectors, which is usually a much larger number than d. Since this set is a subset of the central subspace, there are at most d linearly independent vectors in this set. The next question is then how to estimate these linear independent vectors from the vectors in the set in (6)? There are several options here. The simplest method is to use principal component analysis. Let Bˆ j = (uˆ j , Tˆ uˆ j , Tˆ 2 uˆ j , . . . , Tˆ m−1 uˆ j ). Let Bˆ = (Bˆ 1 , . . . , Bˆ r ). Let vˆ1 , . . . , vˆd be the first d eigenvectors of Bˆ Bˆ T . We use span(vˆ1 , . . . , vˆd ) ˆ X−1/2 vˆ1 , . . . , ˆ X−1/2 vˆd ) as as the estimate of the central subspace SY |Z and use span( ˆ X is the sample covariance matrix the estimate of the central subspace SY |X , where based on the sample X1 , . . . , Xn . Another option is to combine the vectors in (6) optimally using the minimum discrepancy approach by Cook and Ni (2005).

5.2 Nonparametrically Boosted Inverse Regression (NBIR) In the IIT procedure, there is no guarantee that the enlarged vector set (7) spans the entire central subspace SY |Z . This is because, first, the set of seed vectors may not be exhaustive and, second, the iterative transformations by T may not bring out all the linearly independent vectors in SY |Z , though it may well bring out additional linearly independent vectors in SY |Z . However,√using the result in Sect. 3, we can start with any consistent (but not necessarily n-consistent) and exhaustive √ estimate of SY |Z and combine it with an invariant linear operator to construct a n-consistent estimate of the basis of SY |Z . There do exist consistent nonparametric estimates of the central subspace that is exhaustive. For example, the ensemble outer product of gradients (eOPG) introduced in Li (2018) is both consistent and exhaustive. The eOPG was an extension of the outer product of gradients (OPG) introduced by Xia et al. (2002). Because once we have an invariant operator, we know that the central subspace SY |Z is spanned by a set of eigenvectors of T . What remains to be done is to

58

B. Li

determine which subset of vectors converges to a basis of SY |Z . Identifying a subset of {1, . . . , p} is an estimation problem √ where the parameter space is finite, and any consistent estimate of the set is n-consistent—in fact, faster than any rate an → ∞. Thus, even if the consistent and exhaustive√ estimate we start with converges to its target at a slower nonparametric rate (sub n rate), the eigenvectors √ ˆ it identifies always converge at n-rate provided that Tˆ converges to T at the of T √ n-rate. To be more specific, suppose, at the population level, we know that {w1 , . . . , wd } is a basis of SY |Z . Then, by Theorem 4 an eigenvector vi of T is either orthogonal to {w1 , . . . , wd } or belongs to the subspace spanned by w1 , . . . , wd . It follows that SY |Z = span{vi : viT Gvi > 0}, where G is the Gram matrix of {w1 , . . . , wd }; that is, ⎞ w1 , w1 · · · w1 , wd ⎟ ⎜ .. .. .. G=⎝ ⎠. . . . wd , w1 · · · wd , wd ⎛

√ Mimicking the above process at the sample level, let Tˆ be a n-consistent ˆ estimate √ of T , and let vˆ1 , . . . , vˆp be the eigenvectors of T . Then vˆ1 , . . . , vˆp converge at the n-rate to some eigenvectors v1 , . . . , vp of T , which form an orthonormal basis of Rp . Now suppose wˆ 1 , . . . , wˆ d are consistent estimates of some basis ˆ be the Gram matrix Wˆ T Wˆ . {w1 , . . . , wd } of SY |Z . Let Wˆ = (wˆ 1 , . . . , wˆ d ), and let G We order the p numbers ˆ vˆi , αˆ i = vˆiT G

i = 1, . . . , p

from the largest to the smallest. Let vˆ(1) , . . . , vˆ(p) be the eigenvectors according to the values of αˆ i ; that is, vˆ(1) corresponds to the largest αˆ i ; vˆ(2) corresponds to the second largest αˆ i ; and so on. Then we use vˆ(1) , . . . , vˆ(d) as an estimate of a basis√of the central subspace SY |Z . The next theorem shows that vˆ(1) , . . . , vˆ(d) form a n-consistent estimate of a basis of SY |Z regardless of the convergence rates of wˆ 1 , . . . , wˆ p . √ Theorem 9 Suppose that Assumption 1 holds, T ∈ TY |Z , and Tˆ is a n-consistent } is a consistent estimate of a basis {w1 , . . . , wd } of estimate of T . If {wˆ 1 , . . . , wˆ d√ SY |Z , then {vˆ(1) , . . . , vˆ(d) } is a n-consistent estimate of a basis of SY |Z . Proof By perturbation theory (Kato,√1980), vˆ1 , . . . , vˆp converges to a set of eigenvectors v1 , . . . , vp of T at the n-rate. Let S be a subset of {1, . . . , p} with cardinality d such that {vi : i ∈ S} spans the central subspace SY |Z . By

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

59

Assumption 1, the eigenvalues λi of T for i ∈ S are distinct. Then we can speak of v(1) , . . . , v(d) which are ordered by the values of viT Gvi . Also by Assumption 1, all the eigenvectors vi with i outside S are orthogonal to SY |Z , and all the eigenvectors vi with i in S belong to SY |Z . As a result, S = {i : viT Gvi > 0}. ˆ are consistent, vˆiT G ˆ vˆi converges in probability Because, by assumption, vˆi and G T ˆ to 0 for any i ∈ / S, and vˆi Gvˆi converges in probability to the positive number viT Gvi for i ∈ S. Let δ be a constant such that 0 < δ < min{viT Gvi : i ∈ S}. Then ˆ vˆi < δ for i ∈ ˆ vˆi > δ for i ∈ S) → 1, / S and vˆiT G P (vˆiT G which implies ˆ vˆi : i ∈ ˆ vˆi : i ∈ S}) → 1, / S} < min{vˆiT G P (max{vˆiT G which implies P ({vˆi : i ∈ S} = {vˆ(i) : i = 1, . . . , d}) = 1. √ Now let > 0 and let K be a positive constant such that √ P ( n|vˆi − vi | > K) < for each i = 1, . . . , p. This is possible because vˆi are n-consistent. Then, for each i = 1, . . . , d, √ √ P ( n|vˆ(i) − v(i) | > K) = P ( n|vˆi − vi | > K for some i) ≤ p. √ Hence vˆ(i) is a n-consistent estimate of v(i) , and consequently {vˆ(1) , . . . , vˆ(d) } is a √ n-consistent estimate of a basis of SY |Z .

√ This theorem shows that we can always√find a n-consistent and exhaustive estimate of the central subspace based on a n-consistent estimate of an invariant linear operator and a consistent and exhaustive estimate of the central subspace SY |Z . We call this method the nonparametrically boosted inverse regression, because typically the initial consistent and exhaustive estimate is a nonparametric method like eOPG. Correspondingly, if Mˆ SAVE is used as the invariant operator, then we call the method the nonparametrically boosted SAVE, or NB-SAVE. Similarly, we have NB-SIRII, NB-CR, and NB-DR. Besides eOPG, other nonparametric methods, such as dOPG, dMAVE, or eMAVE (Li, 2018; Xia, 2007), can also be used as the initial estimates. Note that the nonparametrically boosted method does cost more computing time than the corresponding inverse regression method, because we need to devote extra computing time to the nonparametric SDR method (such as the eOPG) that identifies the set of eigenvectors of the invariant linear operator that span the central subspace.

B. Li

−1

0

1

2

x1

1.5 1.0 0.5 y −2

−1

0 x2

1

2

−1.5 −1.0 −0.5 0.0

y −2

−1.5 −1.0 −0.5 0.0

−1.5 −1.0 −0.5 0.0

y

0.5

0.5

1.0

1.0

1.5

1.5

2.0

2.0

2.0

60

−2

−1

0 x3

1

2

Fig. 1 Comparison between inverse regressions and nonparametrically boosted inverse regressions

6 Numerical Study In this section we conduct a simulation study to compare the nonparametrically boosted methods such as NB-SAVE, NB-SIRII, NB-CR, and NB-DR with their inverse regression counterparts. We consider a distribution of X that satisfies the linear conditional mean assumption but not the constant variance assumption, in which case the second-order inverse regression methods are not guaranteed to be consistent, but √ the nonparametrically boosted inverse regression methods are guaranteed to be n-consistent and exhaustive, provided that a consistent and exhaustive initial estimate like eOPG is used. We choose the distribution of X to be the truncated multivariate normal random vector. For sample size n = 200 and predictor dimension p = 10, we first generate X1 , . . . , X5n from N(0, Ip ). Then we compute the 20th sample quantile τ of X1 , . . . , X5n . We take the n observations whose norm is no greater than τ . Finally, we rescaled the obtained sample of n = 200 observations by the max(X1 , . . . , X5n ), so that the maximum norm in the sample (of size n = 200) is 1. The distribution thus generated is spherical (therefore satisfies the linear conditional mean assumption) and is rather close to the uniform distribution on the unit ball in Rp . Denoting the ith component of the random vector Xa by Xai , the response Ya is generated by the model 2 Ya = sin(Xa1 ) + 4Xa2 + 0.05(1 + 20|Xa3 |)a ,

a = 1, . . . , n,

where 1 , . . . , n are i.i.d. N(0, 1). In this model, d = 3 and the central subspace is spanned by e1 , e2 , e3 , where ei is the p-dimensional vector with its ith component equal to 1 and all the other components equal to 0. The model has a monotone component in Xa1 , a symmetric component in Xa2 , and a heteroscedastic component in X3a . The coefficients such as 4, 0.05, and 20 are chosen so that the signal stands out from the noise but is not too dominant. The scatter plots of Ya versus Xa1 , Xa2 , and Xa3 are shown in Fig. 1.

0.8 1.0 1.2 1.4 1.6 1.8 2.0

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

61

l

l l l l l

l l l

l

l l l

dis.save dis.nbsave dis.sirii

dis.nbsirii

dis.cr

dis.nbcr

dis.dr

dis.nbdr

Fig. 2 Comparison between inverse regressions and nonparametrically boosted inverse regressions

For SAVE, SIR-II, and DR, we take the number of slices to be h = 5. For CR, we take the 10% contour vectors (see Li et al., 2005, for details). The boosting nonparametric estimate is the eOPG, and the ensemble functions are the Box-Cox family with equally spaced points in [−2, 2] (see Li, 2018, page 181 for the choice of bandwidth of eOPG and other details). For the comparison we use the distance between dimension reduction spaces proposed in Li et al. (2005), which is the Frobenius norm of the difference of the two projection matrices. That is, for two d-dimensional spaces S1 and S2 in Rp , their distance is d(S1 , S2 ) = PS1 − PS2 , where · is the Frobenius norm. This distance depends on p and d. To give a benchmark, we randomly generate β1 , β2 ∈ Rp×d from the i.i.d. normal variables and calculated the expected distance. If p = 10 and d = 3, this expected distance is calculated by simulation as EPβ1 − Pβ2 ≈ 2.044. We generated 100 simulation samples each of sample size n = 200, and for each sample, we computed Pγˆ − Pγ , where γˆ is computed by SAVE, NB-SAVE, SIRII, NB-SIRII, CR, NB-CR, DR, and NB-DR. Figure 2 shows the boxplots based on 100 distances between the estimated and the true space by the eight methods. We see that the nonparametric boosting brings significant improvement to SAVE and SIR-II and visible improvement for DR. The nonparametrically boosted CR performs somewhat worse than CR, which turns out to be especially accurate for this example. Overall, nonparametric boosting brings significant improvement to the second-order inverse regression method when the constant conditional variance assumption is violated.

62

B. Li

7 Concluding Remarks In concluding this paper, we discuss some potentials for further developing the proposed framework and provide some intuitions about when the proposed methods work the best. In this paper we have focused on four invariant linear operators for the central subspace, but the algebraic properties of TY |X developed in Sect. 2 open up wide possibilities for constructing new invariant linear operators and the corresponding dimension reduction estimates by linear combination and composition. For example, matrices such as λ1 MCR + λ2 MCR MDR MCR + λ3 MSAVE are members of TY |X and can be used in IIT and NBIR. √ It would also be interesting to develop new n-consistent and exhaustive estimates for the central mean subspace SE(Y |Z) by combining invariant linear operators for the central mean subspace and consistent and exhaustive estimates of the central subspace. For example, the y-based or r-based pHd (Cook, 1998a) can be used as the invariant linear operators for the central mean subspace, and nonparametric methods such as the OPG and MAVE can be used as the consistent and exhaustive initial estimate of the central mean subspace. The proposed methods can also be extended to functional sufficient dimension reduction with the invariant operators in Euclidean space replaced by their counterparts in Hilbert spaces of functions. See, for example, Ferré and Yao (2003, 2005), Lian and Li (2014), and Li and Song (2017) for the development of functional sufficient dimension reduction. An important question that we have not touched upon in this paper is that of order determination—that is, the estimation of the structural dimension d of the central subspace. For NBIR, we can estimate the structural dimension by determining the order of the nonparametric method—for example, the eOPG method—using an existing order estimator such as the ladle estimator introduced by Luo and Li (2016). For IIT, order determination is an open problem that deserves to be further studied, which we leave to future research. Finally, we provide some intuitions as to when the advantage of the NBIR is most pronounced. Let λˆ 1 , . . . , λˆ p be the eigenvalues of the candidate matrix of an ˆ and S the index set of the eigenvectors that span estimated invariant operator, say M, SY |Z . Let λˆ in = min{λˆ i : i ∈ S},

λˆ out = max{λˆ i : i ∈ / S}.

ˆ is stochastically large (e.g., if /sd() ˆ is significantly larger ˆ = λˆ in − λˆ out . If Let ˆ being the standard deviation of ), ˆ then the inverse regression method than 0, sd() corresponding to Mˆ can identify the central subspace well, and the nonparametric ˆ is stochastically small or boost can offer little extra help. On the other hand, if

Estimating Sufficient Dimension Reduction Spaces by Invariant Linear Operators

63

even takes negative values with a large probability, then there is a good chance that the inverse regression method will include one or a few wrong eigenvectors in its estimated basis of the central subspace, causing a large error. In this case the nonparametric boost will help greatly to identify the correct set of eigenvectors. Indeed, the role played by constant conditional variance assumption is nothing but ˆ converges in probability to a positive number, and when it does to guarantee that not hold, the second-order inverse regression tends to misidentify the eigenvectors, which is corrected by NBIR. Acknowledgments The author would like to thank two referees and Professor Efstathia Bura for their thoughtful and helpful comments and suggestions. The author’s research is supported in part by the National Science Foundation grant DMS-1713078, which he gratefully acknowledges.

References E. Bura, R.D. Cook, Estimating the structural dimension of regressions via parametric inverse. J. Roy. Stat. Soc. B 63, 393–410 (2001) J.B. Conway, A Course in Functional Analysis, 2nd edn. (Springer, 1990) R.D. Cook, Using dimension-reduction subspaces to identify important inputs in models of physical systems, in 1994 Proceedings of the Section on Physical and Engineering Sciences (American Statistical Association, Alexandria, VA, 1994), pp. 18–25 R.D. Cook, Principal Hessian directions revisited. J. Am. Stat. Assoc. 93, 84–94 (1998a) R.D. Cook, Regression Graphics: Ideas for Studying Regressions Through Graphics (Wiley, New York, 1998b) R.D. Cook, B. Li, Dimension reduction for conditional mean in regression. Ann. Stat. 30, 455–474 (2002) R.D. Cook, B. Li, Determining the dimension of iterative Hessian transformation. Ann. Stat. 32, 2501–2531 (2004) R.D. Cook, L. Ni, Sufficient dimension reduction via inverse regression a minimum discrepancy approach. J. Am. Stat. Assoc. 108, 410–428 (2005) R.D. Cook, S. Weisberg, Sliced inverse regression for dimension reduction: Comment. J. Am. Stat. Assoc. 86, 328–332 (1991) A.P. Dawid, Conditional independence in statistical theory. J. Roy. Stat. Soc. B (Methodological) 1–31 (1979) M.L. Eaton, Multivariate Statistics: A Vector Space Approach (Institute of Mathematical Statistics, 2007) L. Ferré, A.F. Yao, Functional sliced inverse regression analysis. Stat. J. Theor. Appl. Stat. 37, 475–488 (2003) L. Ferré, A.F. Yao, Smoothed functional inverse regression. Statistica Sinica 15, 665–683 (2005) T. Kato, Perturbation Theory for Linear Operators (Springer, 1980) B. Li, Sufficient Dimension Reduction: Methods and Applications with R (CRC Press/Chapman & Hall, 2018) B. Li, J. Song, Nonlinear sufficient dimension reduction for functional data. Ann. Stat. 45, 1059– 1095 (2017) B. Li, S. Wang, On directional regression for dimension reduction. J. Am. Stat. Assoc. 35, 2143– 2172 (2007) B. Li, H. Zha, F. Chiaromonte, Contour regression: A general approach to dimension reduction. Ann. Stat. 33, 1580–1616 (2005) K.-C. Li, Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86, 316–327(1991)

64

B. Li

K.-C. Li, On principal hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. J. Am. Stat. Assoc. 87, 1025–1039 (1992) K.-C. Li, N. Duan, Regression analysis under link violation. Ann. Stat. 17, 1009–1052 (1989) H. Lian, G. Li, Series expansion for functional sufficient dimension reduction. J. Multivariate Anal. 124, 150–165 (2014) W. Luo, B. Li, Combining eigenvalues and variation of eigenvectors for order determination. Biometrika 103, 875–887 (2016) S. Wang, Dimension Reduction in Regression, Ph.D. Thesis, Pennsylvania State University (2005) Y. Xia, A constructive approach to the estimation of dimension reduction directions. Ann. Stat. 35, 2654–2690 (2007) Y. Xia, H. Tong, W.K. Li, L.-X. Zhu, An adaptive estimation of dimension reduction space. J. Roy. Stat. Soc. B 64, 363–410 (2002) Z. Ye, R.E. Weiss, Using the bootstrap to select one of a new class of dimension reduction methods. J. Am. Stat. Assoc. 98, 968–979 (2003) X. Yin, B. Li, R. Cook, Successive direction extraction for estimating the central subspace in a multiple-index regression. J. Multivariate Anal. 99, 1733–1757 (2008)

Testing Model Utility for Single Index Models Under High Dimension Qian Lin, Zhigen Zhao, and Jun S. Liu

1 Introduction Testing whether a quantitative response is dependent/independent of a subset of covariates is one of the central problems in statistical analyses. Most existing literature focuses on linear relationships. For instance, Arias-Castro et al. (2011b) considered the linear model y = Xβ + ,

(1)

Lin’s research was supported in part by National Key R&D Program of China (2020AAA0105200), the National Natural Science Foundation of China (Grant 11971257), Beijing Natural Science Foundation (Grant Z190001) and Beijing Academy of Artificial Intelligence. Zhao’s research was supported in part by the NSF Grant IIS-1633283. Liu’s research was supported in part by the NSF Grants NSF DMS-1712714 and NSF DMS-2015411. The authors “Qian Lin” and “Zhigen Zhao” contributed equally. Q. Lin Center for Statistical Science and Department of Industrial Engineering, Tsinghua University, Beijing, China e-mail: [email protected] Z. Zhao Department of Statistical Science, Temple University, Philadelphia, PA, USA e-mail: [email protected] J. S. Liu () Department of Statistics, Harvard University, Cambridge, MA, USA Center for Statistical Science, Tsinghua University, Beijing, China e-mail: [email protected] © Springer Nature Switzerland AG 2021 E. Bura, B. Li (eds.), Festschrift in Honor of R. Dennis Cook, https://doi.org/10.1007/978-3-030-69009-0_4

65

66

Q. Lin et al.

where ∼ N (0, σ 2 I), to test whether all the βi ’s are zero. This can be formulated as the following null and alternative hypotheses: β1 = . . . = βp = 0 H0 : (2) p Hs,r : β ∈ s (r) {β ∈ Rs | β22 ≥ r 2 } p

where Rs denotes the set of s-sparse vector in Rp with the number of nonzero entries being no greater than s. Arias-Castro et al. (2011b) and Ingster et al. (2010) 1/2 showed that one can detect the signal if and only if r 2 s log(p) ∧ p n ∧ √1n . The n upper bound is guaranteed by an asymptotically most powerful test based on higher criticism (Donoho and Jin, 2004). The linearity or other functional form assumption is often too restrictive in practice. Theoretical and methodological developments beyond parametric models are important, urgent, and extremely challenging. As a first step toward nonparametric testing of the independence, we here study the single index model y = f (β τ x, ), where f (·) is an unknown function. Our goal is to test the global null hypothesis that all the βi ’s are zero. The first challenge is to find an appropriate formulation of alternative hypotheses because β22 used in (2) is not even identifiable in single index models. When rank(var(E[x | y])) is nonzero in a single index model, the unique nonzero eigenvalue λ of var(E[x | y]) can be viewed as the generalized signalto-noise ratio (gSNR) (Lin et al., 2019). In Sect. 2, we show that for the linear regression model, this λ is almost proportional to β2 when it is small. The alternative hypotheses in (2) can be rewritten as gSN R > r 2 . Because of this connection, we can treat λ as the separation quantity for the single index model and consider the following contrasting hypotheses: H0 : gSNR = 0, Ha : gSNR ≥ λ0 . We show that, under certain regularity conditions, one can detect a nonzero gSNR if 1/2 ∧ p n ∧ √1n for the single index model with additive noise. and only if λ0 s log(p) n This is a strong and surprising result because this detection boundary is the same as that for the linear model. Using the idea from the sliced inverse regression (SIR) (Li, 1991), we show that this boundary can be achieved by the proposed spectral test statistics using SIR (SSS) and SSS with ANOVA test assisted (SSSa). Although SIR has been advocated as an effective alternative to linear multivariate analysis (Chen and Li, 1998), the existing literature has not provided satisfactory theoretical foundations for high dimensions until recently (Lin et al., 2018a,b, 2019). We believe that the results in this paper provide further supporting evidence to the speculation that “SIR can be used to take the same role as linear regression in model building, residual analysis, regression diagnoses, etc” (Chen and Li, 1998). In Sect. 2, after briefly reviewing the SIR and related results in linear regression, we state the optimal detection problem and a lower bound for single index models.

SIM-Detection

67

In Sect. 3, we first show that the correlation-based higher criticism (Cor-HC) developed for linear models fails for single index models and then propose a test to achieve the lower bound stated in Sect. 2. Some numerical studies are included in Sect. 4. We list several interesting implications and future directions in Sect. 5. Additional proofs and lemmas are included in appendices.

2 Generalized SNR for Single Index Models 2.1 Notation The following notations are adopted throughout the paper. For a matrix V , we call the space generated by its column vectors the column space and denote it by col(V ). The i-th row and j -th column of the matrix are denoted by V i,∗ and V ∗,j , respectively. For vectors x and β ∈ Rp , we denote their inner product x, β by x(β), and the k-th entry of x by x(k). For two positive numbers a, b, we use a ∨ b and a ∧ b to denote max{a, b} and min{a, b}, respectively. Throughout the paper, we use C, C , C1 , and C2 to denote generic absolute constants, though the actual value may vary from case to case. For two sequences {an } and {bn }, we denote an bn (resp. an bn ) if there exists positive constant C (resp. C ) such that an ≥ Cbn (resp. an ≤ C bn ). We denote an bn if both an bn and an bn hold. We denote an ≺ bn (resp. an bn ) if an = o(bn ) (resp. bn = o(an )). The (1,∞) norm and (∞, ∞) norm of matrix A are defined as p A1,∞ = max1≤j ≤p i=1 |Ai,j | and max1≤i,j ≤n Ai,j , respectively. For a finite subset S, we denote by |S| its cardinality. We also write AS,T for the |S| × |T | submatrix with elements (Ai,j )i∈S,j ∈T and AS for AS,S . For any squared matrix A, we define λmin (A) and λmax (A) as the smallest and largest eigenvalues of A, respectively. When y and x are independent, it is denoted as y ⊥ ⊥ x.

2.2 A Brief Review of the Sliced Inverse Regression (SIR) SIR was first proposed by Li (1991) to estimate the central space spanned by β 1 , . . . , β d based on n i.i.d. observations (yi , x i ), i = 1, · · · , n, from the multiple index model y = f (β τ1 x, . . . , β τd x, ), under the assumption that x follows an elliptical distribution and is Gaussian. SIR starts by dividing the data into H equalsized slices according to the order statistics y(i) . To ease notations and arguments, we assume that n = cH and E[x] = 0 and re-express the data as yh,j and x h,j , where h refers to the slice number and j refers to the order number within the slice, i.e., yh,j ← y(c(h−1)+j ) , x h,j ← x (c(h−1)+j ) . Here x (k) is the concomitant of y(k) . Let the sample mean in the h-th slice be denoted by x h,· ; then var(E[x|y]) can be estimated by

68

Q. Lin et al. H 1 H = 1 x¯ h,· x¯ τh,· = XτH XH H H

(3)

h=1

where XH denotes the p × H matrix formed by the H sample means, i.e., XH = (x 1,· , . . . , x H,· ). H is the matrix formed by the H ), where V Thus, col() is estimated by col(V H . The col(V H ) is d eigenvectors associated with the largest d eigenvalues of a consistent estimator of col() under certain technical conditions (Duan and Li, 1991; Hsing and Carroll, 1992; Li, 1991; Lin et al., 2018b; Zhu et al., 2006). It is shown in Lin et al. (2018a,b) that, for single index models (d = 1), H can be chosen as a fixed number not depending on λ(), n, and p for the asymptotic results to hold. Throughout this paper, we assume the following mild conditions: (A1) x ∼ N (0, ), and there exist two positive constants Cmin < Cmax , such that Cmin < λmin () ≤ λmax () < Cmax . (A2) Sliced stable condition. For 0 < a1 < 1 < a2 , let AH (a1 , a2 ) denote all partitions {−∞ = a0 ≤ a2 ≤ . . . ≤ aH = +∞} of R satisfying that a1 a2 ≤ P(ah ≤ Y ≤ ah+1 ) ≤ . H H A curve m(y) is ϑ-sliced stable with respect to y, if there exist positive constants a1 , a2 , a3 and large enough H0 such that for any H > H0 , for any partition in AH (a1 , a2 ) and any γ ∈ Rp , one has H a3 1 var γ τ m(y)ah−1 ≤ y < ah ≤ ϑ var γ τ m(y) . H H

(4)

h=1

A curve is sliced stable if it is ϑ-sliced stable for some positive constant ϑ. The sliced stable condition is introduced in Lin et al. (2018b) to study the phase transition of SIR. The sliced stable condition is a mild condition. Neykov et al. (2016) derived the sliced stable condition from a modification of the regularity condition proposed in Hsing and Carroll (1992). For this paper, we modified it for single index models.

2.3 Generalized Signal-to-Noise Ratio of Single Index Models We consider the following single index model: y = f (β τ x, ), x ∼ N(0, ), ∼ N(0, σ 2 ),

(5)

SIM-Detection

69

where f (·) is an unknown function. What we want to know is whether the coefficient vector β, when viewed as a whole, is zero. This can be formulated as a global testing problem as H0 : β = 0 versus

Ha : β = 0.

When assuming the linear model y = β τ x +, whether we can separate the null and alternative depends on the interplay between σ 2 and the norm of β. More precisely, it depends on the signal-to-noise ratio (SNR) defined as SNR =

β22 β τ0 β 0 E[(β τ x)2 ] = E[y 2 ] σ 2 + β22 β τ0 β 0

when β = 0 and β 0 = β/β2 (Janson et al., 2017). Here ||β||2 is useful for benchmarking prediction accuracy for various model selection techniques such as AIC, BIC, or the Lasso. However, since there is an unknown link function f (·) in the single index model, the norm ||β||2 becomes non-identifiable. Without loss of generality, we restrict ||β||2 = 1 and have to find another quantity to describe the separability. For the single index model (5), to simplify the notation, use λ to denote λmax (var(E[x|y])). For linear models, we can easily show that var(E[x|y]) =

β τ0 β 0 β22 ββ τ and λ = . β τ0 β 0 β22 + σ 2 β τ0 β 0 β22 + σ 2 β τ β

Consequently, λ/SNR = β0τ β 0 . When assuming condition (A2), such a ratio is 0 0 bounded by two finite limits. Thus, λ can be treated as an equivalent quantity to the SNR for linear models and is therefore named as the generalized signal-to-noise ratio (gSNR) for single index models. Remark 1 To the best of our knowledge, although SIR uses the estimation of λ to determine the structural dimension (Li, 1991), few investigations have been made toward theoretical properties of this procedure in high dimensions. The only work that uses λ as a parameter to quantify the estimation error when estimating the direction of β is Lin et al. (2018a), which, however, does not indicate explicitly what role λ plays. The aforementioned observation about λ for single index models provides a useful intuition: λ is a generalized notion of the SNR, and condition (A2) merely requires that gSNR is nonzero.

70

Q. Lin et al.

2.4 Global Testing for Single Index Models As we have discussed, Arias-Castro et al. (2011b) and Ingster et al. (2010) considered the testing problem (2), which can be viewed as the determination of the detection boundary of gSNR. Through the whole paper, we consider the following testing problem:

H0 :

gSN R = 0,

Ha :

λ(= gSN R)

is nonzero,

(6)

based on i.i.d. samples {(yi , x i ), i = 1, . . . , n}. Two models are considered: (i) the general single index model (SIM) defined in (5) and (ii) the single index model with additive noise (SIMa) defined as y = f (β τ x) + , x ∼ N(0, ), ∼ N(0, σ 2 ).

(7)

We assume that conditions (A1) and (A2) hold for both models.

3 The Optimal Test for Single Index Models 3.1 The Detection Boundary of Linear Regression To set the goal and scope, we briefly review some related results on the detection boundary for linear models (Arias-Castro et al., 2011b; Ingster et al., 2010). Proposition 1 Assume that x i ∼ N(0, Ip ), i = 1, · · · , n, and that β has at most s nonzero entries. There is a test with both type I and II errors converging to zero for the testing problem in (2) if and only if r2

1 s log(p) p1/2 ∧ ∧√ . n n n

(8)

Assuming x ∼ N(0, Ip ) and the variance of the noise is known, Ingster et al. (2010) obtained the sharp detection boundary (i.e., with exact asymptotic constant) for the above problem. Since linear models are special cases of SIMa, which is a special subset of SIM, the following statement about the lower bound of detectability is a direct corollary of Proposition 1. Corollary 1 i) If s 2 log2 (p) ∧ p ≺ n, then any test fails to separate the null and the alternative hypothesis asymptotically for SIM when

SIM-Detection

71

λ≺

s log(p) p1/2 ∧ . n n

(9)

ii) Any test fails to separate the null and the alternative hypothesis asymptotically for SIMa when λ≺

s log(p) p1/2 1 ∧ ∧√ . n n n

(10)

3.2 Single Index Models Moving from linear models to single index models is a big step. A natural and reasonable start is to consider tests based on the marginal correlation used for linear models (Arias-Castro et al., 2011b; Ingster et al., 2010). However, the following example shows that the marginal correlation fails for the single index models, indicating that we need to look for some other statistics to approximate the gSNR. Example 1 Suppose that x ∼ N(0, Ip ), ∼ N(0, 1), and we have n samples from the following model: y = (x1 + . . . + xl ) − (x1 + . . . + xl )3 /3l + .

(11)

Simple calculation shows that E[xy] = 0. Thus, correlation-based methods do not work for this simple model. On the other hand, since the link function f (t) = t − t 3 /3l is monotone when |t| is sufficiently large, we know that E[x | y] is not a constant and var(E[x | y]) = 0. Let λ0 and λa0 be two sequences such that λ0

s log(p) p1/2 s log(p) p1/2 1 ∧ , λa0 ∧ ∧√ . n n n n n

For a p × p symmetric matrix A and a positive constant k such that ks < p, we define λ(ks) max (A) = max λmax (AS ). |S|=ks

(12)

For model y = f (β τ x, ), in addition to the condition that λ0 ≺ λ, we further assume that s 2 log2 (p) ∧ p ≺ n. H be the estimate of var(E[x|y]) based on SIR. Let τn , τn , and τn be three Let quantities satisfying

72

Q. Lin et al.

√

p s log(p) 1 ≺ τn ≺ λ0 , ≺ τn ≺ λ0 , √ ≺ τn ≺ λa0 . n n n

(13)

We introduce the following two assistance tests: 1. Define H ) > ψ1 (τn ) = 1(λmax (

tr() + τn ). n

2. Define ψ2 (τn ) = 1(λ(ks) max (H ) > τn ).

Finally, the spectral test statistic based on SIR, abbreviated as SSS, is defined as SSS = max{ψ1 (τn ), ψ2 (τn )}.

(14)

To show the theoretical properties of SSS, we impose the following condition on the covariance matrix : (A3) There are at most k nonzero entries in each row of . This assumption is first explicitly proposed in Lin et al. (2018b), which is partially motivated by the separable after screening (SAS) properties in Ji and Jin (2012). In this paper, we assume such a relative strong condition and focus on establishing the detection boundary. This condition can be possibly relaxed by considering a larger class of covariance matrices S(γ , ) = | j l | ≤ 1 − (log(p))−1 ,

|{l | j l > γ }| ≤ ,

which is used in Arias-Castro et al. (2011a) for analyzing linear models. Our condition contains S(0, ) for some positive constant , and we could relax our constraint to some S(γ , ). However, the technical details will be much more involved, which masks the importance of the main results. We thus leave it for a future investigation. Theorem 1 Assume that s 2 log2 (p) ∧ p ≺ n, λ λ0 , and conditions (A1)−(A3) hold. Two sequences τn and τn satisfy the conditions in (13). Then, type I and type II errors of the test SSS (τn , τn ) converge to zero for the testing problem under SIM. Comparing with the test proposed in Ingster et al. (2010), our test statistics is a H . It is adaptive in the spectral statistics and depends on the first eigenvalue of moderate-sparsity scenario. In the high-sparsity scenario when s 2 log2 (p) ≺ p, the SSS relies on ψ2 (τn ), which depends on the sparsity s of the vector β. Therefore, SSS is not adaptive to the sparsity level. Both Arias-Castro et al. (2011a) and Ingster et al. (2010) introduced an (adaptive) asymptotically powerful test based

SIM-Detection

73

on the higher criticism (HC) for the testing problem under linear models. It is an interesting research problem to develop an adaptive test using the idea of higher criticism for (6).

3.3 Optimal Test for SIMa When the noise is assumed additive as in SIMa (7), the detection boundary can be further improved. In addition to conditions (A1)–(A3), f is further assumed to satisfy the following condition: (B) f (z) is sub-Gaussian, E[f (z)] = 0, and var(f (z)) > Cvar(E[z | f (z) + ]) iid

for some constant C, where z, ∼ N(0, 1). Note that for any fixed function f such that var(E[z | f (z) + ]) = 0, there exists a positive constant C such that var(f (z)) > C. var(E[z | f (z) + ])

(15)

By continuity, we know that (15) holds in a small neighborhood of f , i.e., if C is sufficiently small, condition (B) holds for a large class of functions. First, we adopt the test SSS (τn , τn ) described in the previous subsection. Since the noise is additive, we include the ANOVA test: ψ3 (τn ) = 1(t > τn ) where t = n1 nj=1 (yj2 − 1) and τn is a sequence satisfying the condition (13). Combing this test with the test SSS (τn , τn ), we can introduce SSS assisted by ANOVA test (SSSa) as SSSa (τn , τn , τn ) = max{SSS (τn , τn ), ψ3 (τn )}.

(16)

We then have the following result. Theorem 2 Assume that λ λa0 and the conditions (A1)−(A3) and (B) hold. Assume that the sequences τn , τn , and τn satisfy condition (13); then type I and type II errors of the test SSSa (τn , τn , τn ) converge to zero for the testing problem under SIMa. Example Continued. For the example in (11), we calculated the test statistic ψSSS defined by (14) under both the null and alternative hypotheses. Figure 1 shows the histograms of such a statistic under both hypotheses, demonstrating a perfect separation between the null and alternative. For this example, λks max (H ) has more H ). discrimination power than λmax (

74

Q. Lin et al.

Fig. 1 The histograms of ˆ λks max (H ) for the model (11). The top panel corresponds to the scores under the null, and the bottom one corresponds to the scores under the alternative. The “black” vertical line is the 95% quantile under the null

0

20

40

60

SSS, p=2000,n=1000,rho=0

0.3

0.4

0.5 Null

0.6

0.7

0.8

0.2

0.3

0.4 0.5 0.6 Alternative

0.7

0.8

0

5

10

15

0.2

3.4 Computationally Efficient Test Although the test SSS (and SSSa ) is rate optimal, it is computationally inefficient. Here we propose an efficient algorithm to approximate λ(ks) max (H ) via a convex relaxation, which is similar to the convex relaxation method for estimating the top eigenvector of a semi-definite matrix (Adamczak et al., 2008; Berthet and Rigollet, H of 2013b; d’Aspremont et al., 2005, 2014). To be precise, given the SIR estimate var(E[x | y]), consider the following semi-definite programming (SDP) problem: ! λ(ks) max (H ) max

H M), tr(

subject to tr(M) = 1,

|M|1 ≤ ks,

(17)

M is semi-definite positive.

τn

With ! λ(ks) max (H ), for a sequence τn satisfying the condition in (13), i.e., ≺ λ0 , a computationally feasible test is

s log(p) n

≺

!2 (τn ) = 1(! ψ λ(ks) max (H ) > τn ).

Then, for any sequence τn satisfying the inequality in (13), we define the following computationally feasible alternative of SSS :

SIM-Detection

75

!SSS = max{ψ1 (τn ), ψ !2 (τn )}.

(18)

Theorem 3 Assume that s 2 log2 (p) ∧ p ≺ n, λ λ0 , and conditions (A1)−(A3) !SSS (τn , τn ) converge to zero for the hold. Then, type I and type II errors of the test testing problem under SIMa. Similarly, if we introduce the test !SSSa (τn , τn , τn ) = max{ !SSS , ψ3 (τn )},

(19)

for three sequences τn , τn , and τn , then we have: Theorem 4 Assume that λ λa0 and conditions (A1)−(A3) and (B) hold. The test !SSSa (τn , τn , τn ) is asymptotically powerful for the testing problem under SIMa. Theorems 2 and 4 not only establish the detection boundary of gSNR for single index models but also open a door of thorough understanding of semi-parametric regression with a Gaussian design. It is shown in Lin et al. (2018a) that if we denoted the single index models satisfying conditions (A1), (A3), and rank(var(x|y)) > 0, one has sup inf Em Pβ − Pβ 2F 1 ∧ m∈M β

s log(ep/s) , nλ

(20)

ˆ βˆ T β) ˆ −1 βˆ T and Pβ = β(β T β)−1 β T are the projection operators where Pβˆ = β( with respect to βˆ and β, respectively, and the space M is defined in Equation (14) of Lin et al. (2018a). This implies that the necessary and sufficient condition for obtaining a consistent estimate of the projection operator Pβ is s log(ep/s) ≺ λ. On n the other hand, Theorems 2 and 4 state that, for single index models with additive 1/2 ∧ p n ∧ √1n ≺ λ, then one can detect the existence of gSNR (aka noise, if s log(p) n nontrivial direction β). Our results thus imply for SIMa that, if s log(p) , n

p1/2 n

∧

√1 n

≺λ≺

one can detect the existence of nonzero β, but cannot provide a consistent estimation of its direction. To estimate the location of nonzero coefficient especially when focusing on the almost recovery region (Ji and Jin, 2012), we must tolerate a certain error rate such as the false discovery rate (Benjamini and Hochberg, 1995). For example, the knockoff procedure (Barber and Candès, 2015), SLOPE (Su and Candes, 2016), and UPT (Ji and Zhao, 2014) might be extended to single index models.

76

Q. Lin et al.

3.5 Practical Issues In practice, we do not know whether the noise is additive or not. Therefore, we !SSS . Condition (13) provides us a theoretical basis only consider the test statistic for choosing the sequences τn and τn . In practice, however, we determine these H ) and λ˜ (ks) thresholds by simulating the null distribution of λmax ( max (H ). Our final algorithm is as follows. Algorithm 1 Spectral test statistic based on SIR (SSS) algorithm H ) and ! H ) for the given input (x, y) 1. Calculate λmax ( λmax ( (ks)

iid

2. Generate z = (z1 , · · · , zn ), where zi ∼ N (0, 1) (ks) H ) and ! 3. Calculate λmax ( λmax ( H ) based on (x, z) 4. Repeat Steps 2 and 3 N (= 100) times to get two sequences of λmax and ! λ(ks) max . Let τn and τn be the 95% quantile of these two simulated sequences H ) > τn and/or λ˜ (ks) 5. Reject the null if λmax ( max (H ) > τn

4 Numerical Studies Let β be the vector of coefficients, and let S be the active set, S = {i : βi = 0}, iid

for which we simulated βi ∼ N(0, 1). Let x be the random design matrix with each row following N(0, ). We consider two types of covariance matrices: (i) = (σij ) with σii = 1 and σij = ρ |i−j | and (ii) σii = 1, σij = ρ when i, j ∈ S or i, j ∈ S c and σij = σj i = 0.1 when i ∈ S, j ∈ S c . The first one represents a covariance matrix which is essentially sparse, and we choose ρ among 0, 0.3, 0.5, and 0.8. The second one represents a dense covariance matrix with ρ chosen as 0.2. In all the simulations, n = 1000, p varies among 100, 500, 1000, and 2000 and the number of replication is 100. The random error follows N(0, In ). We consider the following models: I. II. III. IV.

y y y y

= 0.02 ∗ (16xβ − exp(xβ)) + , where |S| = 7; = 0.2 ∗ sin(xβ/2) ∗ exp(xβ/2) + , where |S| = 10; = 0.8 ∗ xβ − (xβ)3 /15 + , where |S| = 5; = sin(xβ) ∗ exp(xβ/10) ∗ , where |S| = 10.

We choose H = 20 in the Algorithm 1 and assume the oracle information of the sparsity in the numerical studies because the goal of the numerical investigation is to demonstrate the theoretical detection boundary. A data-driven choice of such a tuning parameter is challenging to get and unnecessarily obscures the theoretical pattern. If we calculate N(= 100) test statistics for each replication, it will take an

SIM-Detection

77

extremely long time. Therefore, in the simulation, we calculate τn and τn slightly different from Algorithm 1. For each generated data set, we simulated only one (ks) H ) and ! vector z where z ∼ N(0, In ) and calculate the statistic λmax ( λmax ( H ). The τn and τn are chosen as 95% quantile from the corresponding sequence for all the replications. For each generated data, we also calculated Cor-HC scores according to AriasCastro et al. (2012). The threshold chc is chosen according to the same scheme as choosing the thresholds τn and τn . Namely, we calculated the Cor-HC scores based on z where z ∼ N(0, In ). The threshold chc is the 95% quantile of these simulated scores. The hypothesis is rejected if the Cor-HC score is greater than chc . The power for both methods is calculated as the average number of rejections out of 100 replications. These numbers are reported in Table 1. It is clearly seen that the power of SSS decreases when the dimension p increases. Nevertheless, the power of SSS is better than the one based on Cor-HC except for (ks) one case. In Fig. 2, we plot the histogram of the statistic ! λmax ( H ) under the null in the top-left panel and the histogram of this statistic under the alternative in the bottom-left panel for Model III when p = 500 and ρ = 0.3 for type (i) covariance matrix. It is clearly seen that the test statistic SSS is well separated under the null and alternative. However, Cor-HC fails to distinguish between the null and alternative as shown in the two panels on the right side. To see how the performance of Cor-HC varies, we consider the following model: V. y = κxβ − exp(xβ) + , where |S| = 7, κ = 1, 3, 5, · · · , 19. Set n = 1000, p = 1000, and ρ = 0.3 for type (i) covariance matrix, and the power of both methods are displayed in Fig. 3. The coefficient κ determines the magnitude of the marginal correlation between the active predictors and the response. It is seen that when κ is close to 16, representing the case of diminishing marginal correlation, the power of Cor-HC dropped to the lowest. Under all the models, SSS is more powerful in detecting the existence of the signal. To observe the influence of the signal-to-noise ratio on the power of the tests, we consider the following two models: VI. y = (15xβ − exp(xβ)) ∗ κ + 4, where |S| = 7; VII. y = sin(xβ) ∗ exp(10xβκ) ∗ , where |S| = 10. Here κ = 0.01, 0.02, . . . , 0.10. Set n = 1000, p = 1000, and ρ = 0.3; we plot the power of both methods against the coefficient κ in Fig. 4. It is clearly seen that for both examples there is a sharp “phase transition” for the power of SSS as the signal strength increases, validating our theory about the detection boundary. In both examples SSS is much more powerful than Cor-HC.

78

Q. Lin et al.

Table 1 Power comparison of SSS and HC for four models I–IV for different parameter settings. Symbol “∗” indicates the type (ii) covariance matrix Model

Dim 100

500

I

1000

2000

100

500

III

1000

2000

ρ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗

SSS 1.00 1.00 0.99 1.00 0.90 0.98 0.99 0.97 0.98 0.52 0.89 0.88 0.91 0.96 0.37 0.92 0.86 0.83 0.90 0.43 1.00 1.00 1.00 1.00 0.98 0.99 1.00 0.98 0.99 0.62 0.99 0.97 0.97 0.92 0.60 0.96 0.97 0.93 0.88 0.59

HC 0.16 0.29 0.54 0.93 0.35 0.16 0.18 0.34 0.71 0.25 0.19 0.16 0.33 0.53 0.30 0.18 0.25 0.43 0.60 0.17 0.21 0.25 0.63 1.00 0.78 0.11 0.12 0.11 0.22 0.72 0.11 0.06 0.18 0.10 0.59 0.16 0.19 0.15 0.10 0.58

Model

Dim 100

500

II

1000

2000

100

500

IV

1000

2000

ρ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗ 0 0.3 0.5 0.8 0.2∗

SSS 0.98 0.97 0.96 1.00 0.96 0.87 0.80 0.82 0.83 0.77 0.81 0.74 0.77 0.84 0.69 0.75 0.68 0.68 0.81 0.63 0.89 0.91 0.89 1.00 0.94 0.70 0.57 0.57 0.69 0.45 0.55 0.56 0.51 0.73 0.44 0.58 0.47 0.45 0.61 0.40

HC 0.12 0.16 0.24 0.37 0.56 0.06 0.09 0.13 0.14 0.32 0.09 0.06 0.08 0.11 0.25 0.11 0.12 0.13 0.10 0.41 0.01 0.03 0.04 0.10 0.07 0.03 0.04 0.07 0.09 0.08 0.07 0.04 0.09 0.06 0.08 0.07 0.07 0.09 0.02 0.08

79 Cor−HC: p=500,n=1000,rho=0.3 0.30

SSS, p=500,n=1000,rho=0.3

0

0.00

20

0.15

40

60

80

SIM-Detection

0.10 0.12 0.14 0.16 0.18 0.20 0.22 Null

0

1

2

3

4

5

4

5

0

0.0

5 10

0.2

20

0.4

Null

0

0.10 0.12 0.14 0.16 0.18 0.20 0.22 Alternative

1

2 3 Alternative

Fig. 2 Model III, n = 1000, p = 500, type (i) covariance matrix, ρ = 0.3 Fig. 3 Power: Model V, n = 1000, p = 1000, ρ = 0.3 for type (i) covariance matrix

0.8

1.0

Model V, p=1000,n=1000,rho=0.3

0.0

0.2

0.4

Power

0.6

Cor−HC SSS

5

10 kappa

15

80

Q. Lin et al. Model VI, p=1000,n=1000,rho=0.3

Power 0.4 0.6

Power 0.4 0.6

0.8

0.8

1.0

1.0

Model VII, p=1000,n=1000,rho=0.3

Cor−HC SSS

0.0

0.0

0.2

0.2

Cor−HC SSS

0.02

0.04

0.06 kappa

0.08

0.10

0.02

0.04

0.06 kappa

0.08

0.10

Fig. 4 Power: Models VI and VII, n = 1000, p = 1000, ρ = 0.3 for the type (i) covariance matrix

5 Discussion Assuming that var(E[x | y]) is nonvanishing, we show in this paper that λ, the unique nonzero eigenvalue of var(E[x | y]) associated with the single index model, is a generalization of the SNR. We demonstrate a surprising similarity between linear regression and single index models with Gaussian design: the detection boundary of gSNR for the testing problem (6) under SIMa matches that of SNR for linear models (2). This similarity provides an additional support to the speculation that “the rich theories developed for linear regression can be extended to the single/multiple index models” (Chen and Li, 1998; Lin et al., 2019). Besides the gap we explicitly depicted between detection and estimation boundaries, we provide here several other directions which might be of interests to researchers. First, although this paper only deals with single index models, the results obtained here are very likely extendable to multiple index models. Assume that the noise is additive, and let 0 < λd ≤ . . . ≤ λ1 be the nonzero eigenvalues associated with the matrix var(E[x|y]) of a multiple index model. Similar argu√ p ments can show that the i-th direction is detectable if λi n ∧ s log(p) ∧ √1n . New n thoughts and technical preparations might be needed for a rigorous argument for determining the lower bound of the detection boundary. Second, the framework can be extended to study theoretical properties of other sufficient dimension reduction algorithms such as SAVE and directional regression (Lin et al., 2018a,b, 2019).

SIM-Detection

81

Acknowledgments We thank Dr. Zhisu Zhu for his generous help with SDP.

Appendix: Proofs Assisting Lemmas Since our approaches are based on the technical tools developed in Lin et al. (2018a,b, 2019), we briefly recollect the necessary (modified) statements without proofs below. iid

Lemma 1 Let zj ∼ N(0, 1), j = 1, . . . , p. Let σ1 , . . . , σp be p positive constants satisfying σ1 ≤ . . . ≤ σp . Then for any 0 < α ≤ σ12 j σj4 , we have p

⎛ P⎝

j

⎞

σj2

2 α 2 zj − 1 > α ⎠ ≤ exp − 4 . 4 σj

(21)

Lemma 2 Suppose that a p ×H matrix X formed by H i.i.d. p dimensional vector x j ∼ N (0, ), j = 1, . . . , H where 0 < C1 ≤ λmin () ≤ λmax () ≤ C2 for some constants C1 and C2 . We have " " "1 τ " " X X − tr() IH " > α (22) "p " p F 2 for some positive constant C. In with probability at most 4H 2 exp − Cpα 2 H particular, we know that λmax XXτ /p = λmax X τ X/p ≤ tr()/p + α 2 happens with probability at least 1 − 4H 2 exp − Cpα . H2

(23)

⎞ B1 0 Lemma 3 Assume that p1/2 ≺ nλ. Let M = ⎝ B2 B3 ⎠ be a p × H matrix, where 0 B4 B1 and B2 are scalar, B3 is a 1 × (H − 1) vector, and B4 is a (p − 2) × (H − 1) matrix satisfying ⎛

82

Q. Lin et al.

1 1 2 1− λ ≤ B1 ≤ 1 + λ 2ν 2ν " " 2 √ . " B pα A " B2 B3 2 " " " B τ B2 B τ B3 + B τ B4 − n I H " ≤ n 3 3 4 F for a constant ν > 1 where α ≺

nλ . p1/2

(24)

Then we have

A λmax MM τ > − n

√ pα 1 + 1− λ. n 2ν

(25)

Sliced Approximation Inequality The next result is referred to as “key lemma” in Lin et al. (2018a,b, 2019), which depends on the following sliced stable condition stated as Assumption A2. Lemma 4 Assume that Condition (A1) and the sliced stable condition A2 (for some ϑ > 0) hold in the single index model y = f (β τ x, ). Further assume that H be the SIR estimate of = var(E[x | y]), and let rank(var(x|y)) > 0. Let P be the projection matrix associated the column spaceof . For any vector with τ 1 τ p β ∈ R and any ν > 1, let Eβ (ν) = β P H P − β ≤ 2ν β β . There exist positive constants C1 , C2 , C3 , and C4 such that for any ν > 1 and H satisfying that H ϑ > C4 ν, one has ⎞ ⎛ # nλmax () ⎠ ⎝ Eβ ≥ 1 − C1 exp −C2 + C3 log(H ) . (26) P H ν2 β

Proof of Theorems Proof of Theorem 1 Theorem 1 follows from Lemmas 5 and 6. Lemma 5 Assume that p 1/2 ≺ nλ0 , and let τn be a sequence such that λ0 . Then, as n → ∞, we have:

√ p n

≺ τn ≺

H ) < tr() + τn with probability i) Under H0 , i.e., if y ⊥ ⊥ x, then λmax ( n converging to 1; H ) > tr() + τn with probability converging ii) Under H1 , if λ λ0 , then λmax ( n to 1.

SIM-Detection

83

Proof i) If y ⊥ ⊥ x, we know that N (0,

1 n ).

√1 −1/2 X H H

is a p × H matrix with entries i.i.d. to

From Lemma 2, we know that λmax

1 τ X XH H H

≤

tr() + τn n

(27)

Cn2 τ 2 with probability at least 1 − 4H 2 exp − H 2 pn which → 1 as n → ∞. ii) For any event ω, there exist p × p orthogonal matrix S and H × H orthogonal matrix T such that ⎛

⎞ Z1 0 SX H (ω)T = ⎝ Z2 Z3 ⎠ 0 Z4

(28)

where Z1 , Z2 are two scalars, Z3 is a 1 × (H − 1) vector, and Z4 is a (p − 2) × (H − 1) matrix. Lemmas 4 and 2 imply that there exist a constant A and an events set , such that P (c ) → 0 as n → ∞. For any ω ∈ , one has

1 1 2 1− λ ≤ Z1 ≤ 1 + λ, 2ν 2ν " " τ √ " Z Z2 pα tr() " Z2τ Z3 2 " " " Z τ Z2 Z τ Z3 + Z τ Z4 − n IH " ≤ n . 3 3 4 F

(29)

Lemma 3 implies that λmax

1 XH XH H

τ

√ p 1 tr() tr() − α+ 1− λ + τn . ≥ n n 2ν n (30)

Lemma 6 Assume that s log(p) ≺ λ0 . Let τn be a sequence such that s log(p) ≺ τn ≺ n n λ0 . Then, as n → ∞, we have: (ks) i) If y ⊥ ⊥ x, then λmax H < τn with probability converging to 1; (ks) ii) If λ λ0 , then λmax H > τn with probability converging to 1. Proof i) If y ⊥ ⊥ x, we know that i.i.d. to N (0,

1 n ).

√1 E H H

=

√1 −1/2 X H H

is a p × H matrix with entries

Thus

(ks) λ(ks) max H = λmax

1 1/2 E H E τH 1/2 H

1/2 and λ(ks) E H E τH D 1/2 max D

84

Q. Lin et al.

are identically distributed where D is diagonal matrix consisting of the eigen1/2 values of . For any subset S ⊂ [p], let X H,S = D S E S,H where E S,H is a submatrix of E H consisting of the rows in S. Note that 1 1/2 τ 1/2 τ X H,S XH,S . = λmax λmax D E H E H D S H

(31)

Thus, by Lemma 3, we have λmax

1 XH,S XτH,S H

< tr(D S )/n + α ≤

ksλmax () +α n

(32)

2 α2 with probability at least 1 − 4H 2 exp − Cn . Let α = C s log(p) for some 2 n H s p ep ks (ks) sufficiently large constant C. Since ks ≤ ks , we know that λmax (H ) ≤ C s log(p) ≺ τn with probability converging to 1. n ii) Let η be the eigenvector associated with the largest eigenvalue of . Thus |supp(η)| = ks. From Lemma 4, we know that 1 τ λ λ(ks) ( ) ≥ η η ≥ 1 − H H max 2ν

(33)

(ks) with probability converging to 1. Thus, λmax > τn with probability H converging to 1.

Proof of Theorem 2 Theorem 2 follows from the Theorem 1 and the following Lemma 7.

Lemma 7 Assume that τn → 0. Then we have:

√1 n

≺ λa0 . Let τn be a sequence such that

√1 n

≺ τn ≺ λa0 ,

i) If y ⊥ ⊥ x, then t < τn with probability converging to 1. ii) If λ λa0 , then t > τn with probability converging to 1. Proof (i) Since y ⊥ ⊥ x ,we know that E[t] = 0. Let zj = yj2 − 1; then we have ⎛

⎞ 1 P⎝ zj > τn ⎠ ≤ exp −Cnτn2 n

(34)

j

for some constant C. In other words, the probability of t > τn converges to 0 as n → ∞.

SIM-Detection

85

(ii) If λ λa0 , we have var(f (z)) ≥ Cλ and E[y 2 − 1] ≥ Cλ for some constant C. Let zj = yj2 − 1, j = 1, . . . , n. Since f (xj ), j = 1, . . . , n are sub-Gaussian, we know that ⎞ ⎛ 1 (35) zj > E[y 2 − 1] + δ ⎠ ≤ exp −Cnδ 2 . P⎝ n j

By choosing δ = CE[y 2 −1] for some constant C, we know that the probability of t ≥ (C + 1)λ τn converges to 1.

Proof of Theorem 3 Theorems 3 and 4 follow from the following lemma, the Theorems 1 and 2.

Lemma 8 Assume that λ0 . Then we have:

s log(p) n

≺ λ0 . Let τn be a sequence such that

s log(p) n

≺ τn ≺

(ks) i) If y ⊥ ⊥ x, then ! λmax H < τn with probability converging to 1; (ks) ii) If λ λ0 , then ! λmax H > τn with probability converging to 1.

Proof i) Under H0 , i.e., y ⊥ ⊥ x, the entries of as N (0,

1 n ).

Thus, if 1 ≺ α ≺

nτn s log(p) ,

√1 −1/2 X H H

are identically distributed

we have

α log(p) H (i, j ) ≤ max (i,j ) n

(36)

with probability at least 1 − p2 exp −Cα 2 log(p)2 for some constant C which converges to 1 as n → ∞. Since (see, e.g., Lemma 6.1 in Berthet and Rigollet (2013a)) α log(p) ! ≺ τn λ(ks) + ks max H ≤ λmax st α log(p) H n n where stz (A)i,j = sign(Ai,j )(Ai,j − z)+ , we know that (i) holds. (ks) (ks) ii) Follows from that ! λmax ( H ) ≥ λmax ( H ).

(37)

References R. Adamczak, O. Guédon, A. Litvak, A. Pajor, N. Tomczak-Jaegermann, Smallest singular value of random matrices with independent columns. Compt. Rendus Math. 346(15), 853–856 (2008) E. Arias-Castro, E.J. Candès, A. Durand, Detection of an anomalous cluster in a network. Ann. Stat. 39(1), 278–304 (2011a)

86

Q. Lin et al.

E. Arias-Castro, E.J. Candès, Y. Plan, Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism. Ann. Stat. 39(5), 2533–2556 (2011b) E. Arias-Castro, S. Bubeck, G. Lugosi, Detection of correlations. Ann. Stat. 40(1), 412–435 (2012) R.F. Barber, E.J. Candès, Controlling the false discovery rate via knockoffs. Ann. Stat. 43(5), 2055–2085 (2015) Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57(1), 289–300 (1995) Q. Berthet, P. Rigollet, Optimal detection of sparse principal components in high dimension. Ann. Stat. 41(4), 1780–1815 (2013a) Q. Berthet, P. Rigollet, Optimal detection of sparse principal components in high dimension. Ann. Stat. 41(4), 1780–1815 (2013b). https://doi.org/10.1214/13-AOS1127 C.H. Chen, K.C. Li, Can SIR be as popular as multiple linear regression? Stat. Sin. 8(2), 289–316 (1998) A. d’Aspremont, L.E. Ghaoui, M.I. Jordan, G.R. Lanckriet, A direct formulation for sparse PCA using semidefinite programming, in Advances in Neural Information Processing Systems (2005), pp. 41–48 A. d’Aspremont, F. Bach, L. El Ghaoui, Approximation bounds for sparse principal component analysis. Math. Program. 148(1–2), 89–110 (2014) D. Donoho, J. Jin, Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat. 32(3), 962–994 (2004) N. Duan, K.C. Li, Slicing regression: a link-free regression method. Ann. Stat. 19(2), 505–530 (1991) T. Hsing, R.J. Carroll, An asymptotic theory for sliced inverse regression. Ann. Stat. 20(2), 1040– 1061 (1992) Y.I. Ingster, A.B. Tsybakov, N. Verzelen, Detection boundary in sparse regression. Electron. J. Stat. 4, 1476–1526 (2010) L. Janson, R.F. Barber, E. Candes, Eigenprism: inference for high dimensional signal-to-noise ratios. J. R. Stat. Soc. Ser. B 79(4), 1037–1065 (2017) P. Ji, J. Jin, UPS delivers optimal phase diagram in high-dimensional variable selection. Ann. Stat. 40(1), 73–103 (2012) P. Ji, Z. Zhao, Rate optimal multiple testing procedure in high-dimensional regression. arXiv preprint arXiv:1404.2961 (2014) K.C. Li, Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86(414), 316–327 (1991) Q. Lin, X. Li, D. Huang, J.S. Liu, On the optimality of sliced inverse regression for high dimension (2018a) Q. Lin, Z. Zhao, J.S. Liu, On consistency and sparsity for sliced inverse regression in high dimensions. Ann. Stat. 46(2), 580–610 (2018b) Q. Lin, Z. Zhao, J.S. Liu, Sparse sliced inverse regression via lasso. J. Am. Stat. Assoc. 114(528), 1726–1739 (2019). PMID: 32952234. https://doi.org/10.1080/01621459.2018.1520115 M. Neykov, Q. Lin, J.S. Liu, Signed support recovery for single index models in high-dimensions. Ann. Math. Sci. Appl. 1(2), 379–426 (2016) W. Su, E.J. Candes, SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann. Stat. 44(3), 1038–1068 (2016) L. Zhu, B. Miao, H. Peng, On sliced inverse regression with high-dimensional covariates. J. Am. Stat. Assoc. 101(474), 640–643 (2006)

Sliced Inverse Regression for Spatial Data Christoph Muehlmann, Hannu Oja, and Klaus Nordhausen

1 Introduction Data that is recorded at spatial locations is increasingly common. For such spatial data, it is natural to assume that measurements which are close to each other are more similar than measurements taken far apart. In a regression context, it is also natural to consider that the response is not only a function of the explaining variables measured at the same location but might also depend on explaining variables in the vicinity. For example, if the response is some measurement of pollution at a given location, it might depend also on environmental factors in neighboring areas which could be carried over by the wind. Or, if one is interested in the value of houses in one district, the value of the houses in the neighborhood as well as their crime statistics might be of relevance. There are many spatial regression models taking into consideration spatial proximity; see, for example, Kelejian and Piras (2017), LeSage and Pace (2009), and references therein. The amount of possible explaining variables measured is however increasing tremendously; hence, it would be beneficial to identify a sufficient lower-dimensional subspace of the data prior to building the actual regression model. Reducing the dimension of the explanatory variables without losing information on the response is known as sufficient dimension reduction

C. Muehlmann · K. Nordhausen () Institute of Statistics & Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria e-mail: [email protected]; [email protected] H. Oja Department of Mathematics and Statistics, University of Turku, Turku, Finland e-mail: [email protected] © Springer Nature Switzerland AG 2021 E. Bura, B. Li (eds.), Festschrift in Honor of R. Dennis Cook, https://doi.org/10.1007/978-3-030-69009-0_5

87

88

C. Muehlmann et al.

(SDR); it is well established for iid data (see Li (2018); Ma and Zhu (2013) for some recent reviews). The most popular SDR methods are sliced inverse regression (SIR) (Li, 1991) and sliced average variance estimation (SAVE) (Cook, 2000; Cook and Weisberg, 1991). SIR and SAVE have been recently extended to the time series case (Matilainen et al. , 2017, 2019). SDR methods for spatial data are not yet much investigated. Guan and Wang (2010) suggested several SDR methods for spatial point processes, which means that the locations by themselves are stochastic and of interest and thus considered as a response to be modeled using explanatory covariates. Another type of spatial data is often referred to as geostatistical data where at fixed locations random phenomena are observed and these should be modeled. In general for geostatistical data locations are usually irregularly selected from the domain of interested. A special case is when the locations are lying on a regular grid. For such grid data, (Affossogbe et al., 2019; Loubes and Yao, 2013) considered kernel SIR and SAVE methods under the assumption that the response at a location is a function of the covariates at this same location. In this paper we extend SIR to grid data where we assume that the response at a given location might also depend on covariates measured at different locations. Our approach follows ideas from blind source separation and is an extension of the time series SIR method suggested in Matilainen et al. (2017). The structure of the paper is as follows: In Sect. 2 we review SIR in a blind source separation framework for iid data. Section 3 is devoted to SIR for time series data as our extension is build on these ideas. Then, in Sect. 4 we suggest our extension of these methods for spatial data. Lastly, we present several simulation studies of our spatial SIR method in Sect. 5.

2 SIR for iid Data For the purpose of this paper, we follow the concepts as in Matilainen et al. (2017, 2019) and introduce SIR in a blind source separation model context in which it is assumed that the response y is univariate and the p-variate vector x of explaining variables has the representation x = Ωz + μ = Ω

(1) z + μ, z(2)

where μ is a p-variate location vector and the p × p matrix Ω is called the mixing matrix having the only restriction to be full rank. Regarding the latent unobservable p-variate random vector z, the following assumptions are made: Assumption 1 The random vector z can be partitioned into the d-variate subvector z(1) and the p − d-variate subvector z(2) , and together they satisfy:

Spatial SIR

89

(A1) E(z) = 0 and Cov(z) = Ip , and " (A2) y, z(1)" ⊥⊥ z(2) . The dimension d and the partitioning are minimal in the sense that, for projection matrices satisfying (y, Pz)⊥⊥(I − P)z, the rank of P is larger than or equal to d and " " " " P z(1) , 0" = z(1) , 0" . Note however that these assumptions do not specify z completely; both subvectors are defined only up to rotation by an orthogonal matrix of corresponding dimension. Assumption (A2) is slightly different than the one usually stated in the SIR literature where it is required that (A2 ) z(2) ⊥⊥y|z(1) and E(z(2) |z(1) ) = 0 (a.s.). Thus, in (A2 ) some weak dependence between z(2) and z(1) is allowed. To extend the theory to stochastic processes like time series or random fields, we make the stronger assumption (A2) which makes it easier to apply tools from blind source separation methods where independence assumptions are frequent. This stronger assumption was also used in the iid case in Nordhausen et al. (2016) to construct asymptotic and bootstrap tests for d, something we also plan to extend in future work to the time series and spatial case described in the subsequent sections. Naturally (A2) implies (A2 ) and both also have as consequence that

Cov(E(z(1) |y)) 0 Cov(E(z|y)) = . 0 0 This matrix is the key to SIR which is defined as: Definition 1 The sliced inverse regression functional Γ (x, y) at the joint distribution of x and y is obtained as follows: 1. Whiten the explanatory vector: xst = Cov(x)−1/2 (x − E(x)). 2. Find the d × p matrix U with orthonormal rows u1 , . . . , ud which maximizes d 2 st ||diag UCov(E(xst |y))U" ||2 = u" Cov(E(x |y))u . c c c=1

3. Γ (x, y) = UCov(x)−1/2 . The question is then naturally how to estimate Cov(E(xst |y)) if y is not discrete; that is where the term sliced comes into play. In that case y is discretized and “sliced” into H disjoint intervals yielding y sl , and then in Definition 1 the sliced y sl is used rather than y itself. It was shown that SIR is quite robust with respect

90

C. Muehlmann et al.

to slicing when at least ten observations are in each slice. Hence, for sample sizes larger than 100, H = 10 seems to be a reasonable choice. The optimization problem from Definition 1 can be solved by performing an eigenvector decomposition of Cov(E(xst |y)) or Cov(E(xst |y sl )), respectively. The vectors uc consist of the eigenvectors of this decomposition which have nonzero eigenvalues. Note that, given that slicing does not cause loss of information, there should be exactly d nonzero eigenvalues and the magnitude of these eigenvalues reflects the relevance of the corresponding direction for the response. Hence, the eigenvalues of Cov(E(xst |y sl )) based on a real sample can give an idea about the usually unknown value of d. Inferential theory for the order determination based on these eigenvalues is, for example, mentioned in Bura and Cook (2001), Luo and Li (2016), Nordhausen et al. (2016), and references therein.

3 SIR for Time Series Data As the next step, we introduce SIR in the time series context following Matilainen et al. (2017) as we will extend this approach to spatially dependent data. Thus, we consider a univariate response time series y = y[t], t = 0, ±1, ±2, . . . and a pvariate time series x = x[t], t = 0, ±1, ±2, . . . that is used to explain y. Similarly as in the spatial data case, it is natural to assume that the dependence of y and x might also lag in time and therefore the time structure is taken into consideration when performing the dimension reduction. Becker and Fried (2003) suggested to simply add the lag-shifted time series as new variables to the process x[t] yielding x∗ [t] = (x[t]" , x[t − 1]" , . . . , x[t − K]" )" and apply the iid SIR to the pair y[t] and x∗ [t]. The disadvantage of this approach is that if K lags are of interest, then the dimension of x∗ [t] is (K + 1)p, while at the same time, the sample size is reduced by K. Another approach for a time series version of SIR was suggested recently by Matilainen et al. (2017), the main idea is to incorporate serial information in Cov(E(x|y)) by defining Στ (x) = Cov(E(x[t]|y[t + τ ])). For their method to work, Matilainen et al. (2017) formulate the time series SIR blind source separation model x[t] = Ωz[t] + μ = Ω

(1) z [t] + μ, z(2) [t]

where Ω and μ are the full-rank p × p mixing matrix and the p-variate location vector, respectively. For the latent unobservable p-variate random process z = z[t] = (z(1) [t], z(2) [t])" the following assumptions are made:

Spatial SIR

91

Assumption 2 The stationary random process z = z[t] can be partitioned into the d-variate subprocess z(1) = z(1) [t] and the p − d-variate subprocess z(2) = z(2) [t], and together they satisfy (A3) E(z) = 0 and Cov(z) = Ip , and " (A4) y, z(1)" ⊥⊥ z(2) , where d is minimal in the sense as specified in Assumption 1. Thus, all necessary information required to model y[t] goes through z(1) [t] as Assumption (A4) implies also that

y[t1 ], z(1) [t1 ]"

"

⊥⊥ z(2) [t2 ] or

" y[t1 + τ ], z(1) [t1 ]" ⊥⊥ z(2) [t2 ]

for all t1 , t2 , τ ∈ Z. It holds then that Στ (z) =

Cov E(z(1) [t]|y[t + τ ]) 0 for all τ ∈ Z. 0 0

The idea of Matilainen et al. (2017) is to jointly diagonalize several matrices Στ (x) using a set of different lags T = {τ1 , . . . , τK } rather than diagonalizing only one matrix Στ (x). This leads to the following definition of the time series (TSIR) functional. Definition 2 The TSIR functional Γ (x; y) for a stationary time series (y, x" )" is obtained as follows: 1. Standardize x and write xst := Cov(x)−1/2 (x − E(x)). 2. Find the d ×p matrix U = (u1 , . . . , ud )" with orthonormal rows u1 , . . . , ud that maximizes "2 " " " "diag UCov(E(xst [t]|y[t + τ ]))U" " = τ ∈T d

2 st u" Cov(E(x [t]|y[t + τ ]))u , c c

(1)

c=1 τ ∈T

for a set of chosen lags T ⊂ Z+ . 3. The value of the functional is then Γ (x; y) = UCov(x)−1/2 . Again, if the response time series is continuous, slicing is performed to obtain y sl [t] which leads to a feasible computation of the matrices Στ . However, now the optimization problem is not any more solved by a simple eigenvector-

92

C. Muehlmann et al.

eigenvalue decomposition. Thus, TSIR is based on joint diagonalization of all Cov(E(xst [t]|y[t + τ ])) matrices for τ ∈ T . There are many available algorithms for joint diagonalization of many matrices; for this paper we use the one using Givens rotations; see Cardoso and Souloumiac (1996) for details. For other options we refer to Illner et al. (2015). To choose the relevant vectors from the joint diagonalization, (Matilainen et al. , 2017) define 2 st λcτ = u" Cov(E(x [t]|y[t + τ ]))u , c = 1, . . . , p; τ ∈ T . c c And then choose those vectors uc from the joint diagonalization where λc· = τ ∈T λcτ , c = 1, . . . , d is larger than zero.

4 SIR for Spatial Data In the following, as it is quite common in spatial regression, we assume that the spatial data is measured on a regular grid indexed by [i, j ] with i, j ∈ Z. We formulate a blind source separation model for the joint distribution of a pvariate explaining random field x = x[i, j ] and the univariate response random field y = y[i, j ] by assuming that x[i, j ] = Ωz[i, j ] + μ = Ω

(1) z [i, j ] + μ. z(2) [i, j ]

Ω and μ denote the mixing matrix and location vector, as seen before. For the latent field z = z[i, j ], we assume: Assumption 3 The stationary random field z[i, j ] can be partitioned into the dvariate random subfield z(1) = z(1) [i, j ] and the p − d-variate random subfield z(2) = z(2) [i, j ]. The random fields then satisfy: (A5) E(z) = 0 and Cov(z) = Ip , and " (A6) y, z(1)" ⊥⊥ z(2) , where d is minimal in the sense as specified in Assumption 1. " " Assumption (A6) implies that y[i, j ], z(1) [i, j ] ⊥⊥ z(2) [i , j ] which can also " be expressed as y[i − k, j − l], z(1) [i, j ]" ⊥⊥ z(2) [i , j ] for all i, i , j, j , k, l ∈ Z. Additionally, Assumption (A6) can also be seen as such that there exists a " " full-rank p × p matrix Γ = (Γ " 1 , Γ 2 ) such that Cov(Γ x[i, j ]) = Ip and " " (y, Γ1 x)⊥⊥Γ2 x. Based on Assumption 3 the iid sliced inverse regression (SIR) " operating on the marginal distributions of y[i, j ], x[i, j ]" could also be used to identify z(1) but would use only the cross-sectional information while ignoring

Spatial SIR

93

spatial dependencies. A relevant source of information we would not like to ignore. Using the idea of Becker and Fried (2003) and adding the neighboring cells as additional variables to x[i, j ] is of course again possible but even more costly. For example, assuming a squared n × n grid and including just all directly connected neighbors increases the dimension eightfold and reduces the sample size to a n∗ ×n∗ grid with n∗ = n − 2, meaning 4n − 4 observations are discarded. Like in the time series case, this model formulation does not separate between independent and dependent explaining fields to explain the y[i, j ] field. All the dependence between the x and y fields, as a whole, goes through z(1) , and the aim is simply to separate between the signal field z(1) and the noise field z(2) . Again there are indeterminacies in this model formulation; the fields z(1) and z(2) are identifiable only up to pre-multiplication by orthogonal matrices. We define the matrices Σ (k,l) (x[i, j ]) = Cov(E(x[i, j ]|y[i − k, j − l])), for all k ∈ Z and l ∈ Z and the following holds: Result 1 For all random fields fulfilling (A5) and (A6), Σ(k,l) (z[i, j ]) =

Cov E(z(1) [i, j ]|y[i − k, j − l]) 0 , 0 0

for all k, l ∈ Z. Finally, we have all we need to define SIR in a spatial data context which is denoted as SSIR. Definition 3 The SSIR functional Γ (x; y) for a stationary random field (y, x" )" is obtained as follows: 1. Standardize x and write xst := Cov(x)−1/2 (x − E(x)). 2. Find the d ×p matrix U = (u1 , . . . , ud )" with orthonormal rows u1 , . . . , ud that maximizes "2 " " " "diag UCov(E(xst [i, j ]|y[i − k, j − l]))U" " = (k,l)∈S d

2 st u" Cov(E(x [i, j ]|y[i − k, j − l]))u , c c c=1 (k,l)∈S

for a set of chosen spatial lags S = {(k1 , l1 ), . . . , (kK , lK )}. 3. The value of the functional is then Γ (x; y) = UCov(x)−1/2 .

(2)

94

C. Muehlmann et al.

Again, if the response field is continuous, slicing is required for a practical computation of the inverse regression matrices Σ (k,l) . Here there are K pairs of spatial lags as specified in the set S ; thus the solution to the maximization problem can be obtained by joint diagonalization of these former matrices. To extract the correct directions, we define in analogy to TSIR 2 st λc(k,l) = u" Cov(E(x [i, j ]|y[i − k, j − l]))u , c = 1, . . . , p; (k, l) ∈ S . c c If the slicing did not cause any loss of information, then there should be d values λc(·,·) = (k,l)∈S λc(k,l) larger than zero. In practice therefore these sums can give an idea about the number of directions to retain; they could be used, for example, in a scree plot. Furthermore the λc(k,l) -values contain also information about the spatial lags which might be of relevance; nonzero values indicate that the corresponding spatial lag and direction are of interest. The problem here is just if the spatial correlation in a field is large, the dependence between successive λc(k,l) might not vanish quickly. To make the different λc(k,l) -values more comparable, we in the following assume that the values are standardized such that they all together add up to one. Following Matilainen et al. (2017), a first option to select a number of directions and spatial lags of interest is as follows. Fix a proportion P ∈ (0, 1), and determine the smallest number of descending sorted λc(k,l) -values where the cumulative sum exceeds P . The lags and directions belonging to these values are then selected as directions and spatial lags of relevance. This can be compared to the strategy in principal component analysis which retains the smallest number of components that explain 100 · P % of the variation of the data. Just here it should explain the dependence with respect to the response field y. Matilainen et al. (2017) also discuss other strategies to select the number of directions and lags in a time series context based on the λc(k,l) -values. However, this approach seems to be the most promising one so far and therefore we focus on this strategy. Naturally, the above strategies require that all relevant spatial lags are included in the set S . In practice this should be based on expert knowledge. But in doubt the choice of lags can be adapted freely as illustrated by the following examples. One could, for example, have an isotropic model in mind and include neighbors of first or neighbors of first as well as second order. In contrast, if one models data where the response is believed to be dependent only on certain neighboring predictors, then only lags of certain directions can be considered. Examples would be data where the response is dependent on wind direction. Figure 1 visualizes these three different options—in the left panel, only neighbors of first order are considered which are highlighted by the light gray cells, the middle grid shows neighbors of first and second order, and the right panel shows a structure based on South-East neighbors of first order. Generally, increasing the number of spatial lags leads to a higher number of Σ(k,l) matrices to be jointly diagonalized. Although theoretically

Spatial SIR

95

Fig. 1 Illustration of different neighborhood relationships depicted in light gray relative to the dark gray cells. Left: Neighbors of first order. Middle: Neighbors of first and second order. Right: South-East neighbors

adding matrices with no information has no impact, in a finite sample setting, this increases the computational burden and adds noise to the joint diagonalization algorithm. Our approach of using the spatially lagged inverse regression curves is what distinguishes our approach from the spatial SDR methods for grid data as described in Affossogbe et al. (2019), Loubes and Yao (2013) which concentrate only on on-site information. Another parameter needed to be chosen in SSIR is the number of slices H used to estimate the inverse regression curve. We follow the usual SIR guidelines and use H = 10 in the following simulations. This is a common choice for iid SIR which is shown to be a quite robust choice (Li, 1991). Similarly, for TSIR (Matilainen et al. , 2019) carried out an extensive simulation study on the influence of H on the prediction power comparing values of H = 2, 5, 10, 20, 40. It was found that using more than H = 10 slices does not provide meaningful results and that a lower number might be more appropriate when sample size decreases. In case of a nominal response, the number of slices is naturally limited to the number of classes. For example, for a binary response variable, only H = 2 slices can be used.

5 Performance Evaluation of SSIR In this section we want to evaluate how well SSIR performs. For that purpose we adapt the three models used in a time series context in Matilainen et al. (2017) to the spatial setting. The three models are: A: y[i, j ] = 2z1 [i + 1, j ] + 3z2 [i + 1, j ] + [i, j ], B: y[i, j ] = 2z1 [i + 1, j ] + 3z2 [i + 2, j − 2] + [i, j ], C: y[i, j ] = z1 [i + 1, j ]/(0.5 + (z2 [i + 1, j ] + 1.5))2 + [i, j ].

96

C. Muehlmann et al.

Thus z(1) [i, j ] = (z1 [i, j ], z2 [i, j ])" , and we define also a two-dimensional noise part z(2) [i, j ] = (z3 [i, j ], z4 [i, j ])" . For simplicity we choose all four random fields z1 , . . . , z4 as mutually independent Gaussian random fields having mean zero and the following isotropic and homogeneous exponential covariance function: C(h) = C0 exp(−h/ h0 ) + C1 10 (h),

0.5

C(h) 1.0

1.5

low dependence high dependence

0.0

Fig. 2 Visualization of the exponential covariance functions used to simulate the random fields. Choosing h0 = 0.25 yields the weak dependence setting and h0 = 15 the strong dependence setting

2.0

with C0 = C1 = 1, 1 is the indicator function and h denotes the distance between the points under consideration. The right part in the covariance function above represents a nugget effect which is an on-site variance term; for details see van Lieshout (2019). In the following simulations, we sample from such a field using a square grid of length 100 having 400 spatial locations in each direction yielding a total of 400×400 observations. This can be thought of as having a square of 100 × 100 meters and taking a measurement every 25 centimeters. Similar as in Matilainen et al. (2017), we consider two cases: in the first case, there is strong spatial dependence within each of the four fields, and in the second case, the spatial dependence is weak. For that purpose we fix the scale parameter h0 = 0.25 for the weak spatial dependence case and h0 = 15 for the strong spatial dependence case. Figure 2 visualizes the two different covariance functions. The discontinuity at zero represents the nugget effect. In the weak dependence case, there is basically no spatial dependence when measurements are taken three units apart, whereas in the strong dependence case, there is even still considerable dependence when measurements are 20 units apart. Additionally, for the independent error [i, j ], we add Gaussian white noise with σ 2 = 1 in all three models.

0

5

10 h

15

20

Spatial SIR

97

As our SSIR method is affine equivariant, we use without loss of generality Ω = I4 as the mixing matrix. Thus x[i, j ] = z[i, j ] and the directions of interest in the three different models are: A: ((2, 3, 0, 0)" x)[i+1,j ] B: ((2, 0, 0, 0)" x)[i+1,j ] and ((0, 3, 0, 0)" x)[i+2,j −2] C: ((1, 0, 0, 0)" x)[i+1,j ] and ((0, 1, 0, 0)" x)[i+1,j ] . To fit SSIR we use either all first-order neighbors or all first- and second-order neighbors as visualized in Fig. 1 and set the number of slices H to 10. For the computation of the following simulations, we use R 3.5.1 (R Core Team, 2018) with the packages JADE (Miettinen et al., 2017), raster (Hijmans, 2019), RandomFields (Schlather et al., 2015), and LDRTools (Liski et al., 2018). In the simulation we would like to evaluate how well SSIR estimates the directions of interest. For that purpose we consider the case where d is known and also when d is estimated by applying the rule described above using P = 0.5 and P = 0.8, respectively. Thus, when estimating Γ , the rank of the matrix might differ from the true rank d; this together with the indeterminacy that the results might be rotated by an orthogonal matrix must be considered when choosing the performance criterion. Therefore, we do not compare Γ and Γˆ but their projection matrices PΓ and PΓˆ . Following Liski et al. (2016), we measure the distance between the projection matrices using the Frobenius norm after weighting them by their rank, i.e.,

2 Dw (PΓˆ , PΓ ) =

1 w dˆ PΓˆ − w(d)PΓ 2 , 2

where for the weights we consider √ the two weight functions: inverse, w(d) = 1/d, or inverse sqrt, w(d) = 1/ d. Both weight function options ensure that projectors of different ranks are more comparable; they differ by their image set and interpretation in special cases, as described in Liski et al. (2016). The following results are only presented for the inverse weight function as the qualitative results are equal for both weight function options. Figures 3, 4, and 5 show the distances for all three models in the low and high dependence settings considering first- as well as first- and second-order neighbors for the inverse weight function based on 2000 repetitions. Furthermore, Fig. 6 depicts the percentages of chosen directions d for P = 0.5 and P = 0.8 of all model and dependence settings. The distances clearly show that SSIR works as expected when the true number of directions is known as the distance to the true subspace is small. If the dimension is unknown, then the rule here using P = 0.5 is not advisable in Models B and C. Also for Models B and C with low dependence, P = 0.8 seems to work well except for Model B when considering only first-order neighbors. This effect is explained by the fact that the response in Model B depends on second-order neighbors as well;

98

C. Muehlmann et al. Model A: low dependence First order

Model A: low dependence First and second order

0.10

0.05

Distance

l l l

l l l

l l l

l l l

l l l

l l l

0.00

Model A: high dependence First order

l l l l l l l l l l l l

l l l l l l l l l l l l

l l l l l l l l l l l l

l

l

l

l l l l l

l l l l l

l l l l l

Model A: high dependence First and second order l l l

l l l

l l l

l l l l l l l l l l l l l l l

l l l l l l l l l l l l l l l

l l l l l l l l l l l l l l l

l

l

l

0.10

0.05

P=

0.

8

5 0.

ow kn

P=

n

8 P=

5 0. P=

0.

d

d

kn

ow

n

0.00

Method

Fig. 3 Inverse weighted deviations of the estimated directions projector matrices and the true directions projector matrices. 2000 repetitions of Model A for the low and high dependence case are presented. Neighbors of first as well as first and second order are considered

hence the true second direction cannot be found. If the dependence in the field is large, then the performance of SSIR worsens, especially in Model B, and in about half of the simulations in Model C, directions are missed. The problem of missing directions in the high dependence case is linked to the fact that the λc(k,l) values do not vanish quickly and therefore also the spatial lag selection is challenging in that case. This is also observed in Fig. 6 as the number of chosen directions never exceeds the true number d. To illustrate this effect further, we present one case from the simulation study for each model and dependence setting. Figure 7 visualizes the latent fields and the three response fields in these six different settings. Tables 1, 2, 3, 4, 5, and 6 show the standardized λc(k,l) -values when S consists of all neighbors of first as well as first and second order. In the

Spatial SIR

99 Model B: low dependence First order

Model B: low dependence First and second order

l

l

0.30 0.25

l

0.20 0.15 0.10

Distance

0.05 l l

0.00

l

Model B: high dependence First order l l l l l l l

l l l l l l l

l l l l l l l l

l

l

Model B: high dependence First and second order l l l l l

l

l l l l l l

l l l l l l

l

l

l

0.15

l l l l l l l l l l l l l l l l l

0.10 l l l l l l l l l l

0.05 l l l

0.

8

0.00

P=

5 0.

ow kn

l l l

d

P=

n

8 0. P=

0.

5

l

P=

ow

n

l

kn

0.25 0.20

l l l

d

0.30

Method

Fig. 4 Inverse weighted deviations of the estimated directions projector matrices and the true directions projector matrices. 2000 repetitions of Model B for the low and high dependence case are presented. Neighbors of first as well as first and second order are considered

tables gray cells highlight the largest λc(k,l) -values that are needed to exceed the threshold of P = 0.8. Thus the tables also show that in the low dependence setting, not only the number of directions is chosen correctly, but the method indicates also always the correct lags to be used. However in the high dependence setting, often one direction dominates, and almost all spatial lags are considered informative; this effect is again in accordance to the findings depicted in Fig. 6.

100

C. Muehlmann et al. Model C: low dependence First order

Model C: low dependence First and second order

l

l

0.25 0.20 0.15 0.10

Distance

0.05 l

0.00

l

l

Model C: high dependence First order

l

Model C: high dependence First and second order 0.25 0.20 0.15 0.10 l l

0.00

0.

8

l

P=

5 0.

ow kn

P=

0.

n

8

5 0. P=

n ow kn d

l

l

l

P=

l

l

0.05

l l l l l l l

d

l l l l l l

Method

Fig. 5 Inverse weighted deviations of the estimated directions projector matrices and the true directions projector matrices. 2000 repetitions of Model C for the low and high dependence case are presented. Neighbors of first as well as first and second order are considered

6 Discussion SIR is perhaps one of the most popular supervised dimension reduction methods. It is included, among other, more sophisticated methods, in the more general framework sufficient dimension reduction (SDR) where R.D. Cook provided seminal work. See, for example, Adragni and Cook (2009), Cook (2007, 2018), Li (2018), and references therein for further details. Moreover, seminal work by R.D. Cook put SDR into the larger framework of envelopes (see, e.g., Cook (2018) for details).

Spatial SIR

101 0

20

40

60

80

Model A: low dependence

Model A: high dependence

Model B: low dependence

Model B: high dependence

100

1.: P=0.5 1. & 2.: P=0.5 1.: P=0.8 1. & 2.: P=0.8

Method

1.: P=0.5

^ d

1. & 2.: P=0.5

1 2

1.: P=0.8 1. & 2.: P=0.8

Model C: low dependence

Model C: high dependence

1.: P=0.5 1. & 2.: P=0.5 1.: P=0.8 1. & 2.: P=0.8 0

20

40

60

80

100

^ Percentage of estimated directions d

Fig. 6 Percentage of estimated directions dˆ for the simulations presented in Figs. 3, 4, and 5 for different values of P when using first- and first- and second-order neighbors

SDR and envelope methods mainly focus on vector-valued iid data. Recently, Matilainen et al. (2017) combined ideas from temporal blind source separation like second-order blind identification (SOBI) (Belouchrani et al., 1997; Miettinen et al., 2016) with SIR principles and introduced time series SIR (TSIR), which we summarize in Sect. 3. TSIR is the starting point of our contribution where we similarly combine principles from spatial blind source separation as developed in Bachoc et al. (2020), Nordhausen et al. (2015) with the SIR principle and introduce spatial SIR (SSIR) in Sect. 4. Based on our simulation study (Sect. 5), SSIR works well if the number of directions of interest is known. However, the identification of important spatial lags seems to depend heavily on the strength of spatial dependence in the explaining random field; higher spatial dependence worsens the ability to decide for important spatial lags. Therefore, further research is needed to help decide on the number of directions and spatial lags to be considered. In future work spatial bootstrap algorithms can be used to derive tests for the directions of importance, and also large sample behavior might be useful. We formulated SSIR by restricting the spatial domain to be a two-dimensional regular grid. On the one hand, the dimensional restriction of the spatial domain can be easily relaxed to a d-dimensional regular gridded spatial domain to consider, for example, rather voxels than pixels. On the other hand, the regular grid restriction can be relaxed to the case of areal unit data where the exact distance between units is of minor importance. This is usually the case for spatial econometric applications

102

C. Muehlmann et al.

Fig. 7 Illustration of the field z and the responses of the three different models for low dependence (upper chart) and high dependence (lower chart)

where the neighborhood definition still follows principles such as nearest, nextnearest neighbors, etc. Contrarily, if the spatial locations are irregularly measured and distances between locations are important, then the adaptation of SSIR is not straightforward. In this case, principles of spatial blind source separation (Bachoc et al., 2020; Nordhausen et al., 2015) could be applied by drawing rings of different radii around each location and average over them to define a neighborhood when assuming isotropic data. SSIR is meant to be a first approach for supervised dimension reduction for spatial regression. In further research we aim to follow R.D. Cook’s ideas in order to extend and embed it in more general concepts like spatial SDR and spatial envelopes as mentioned above.

Spatial SIR

103

Table 1 Estimated dependencies λˆ c(k,l) between y[i, j ] and (uc xst )(i+k,j +l) for Model A for low dependence (left table) and high dependence (right table). Considering first-order neighbors. Small deviations in the sums arise from considering only four digits u1 xst u2 xst u3 xst u4 xst 1 0 0.9033 0.0001 0.0001 0.0001 −1 0 0.0039 0.0001 0.0000 0.0000 0 1 0.0134 0.0001 0.0001 0.0000 0 −1 0.0129 0.0001 0.0001 0.0001 1 1 0.0300 0.0000 0.0000 0.0000 −1 1 0.0029 0.0002 0.0000 0.0000 1 −1 0.0295 0.0001 0.0001 0.0000 −1 −1 0.0025 0.0000 0.0001 0.0001 Sum 0.9984 0.0007 0.0005 0.0004 k

l

Sum 0.9035 0.0041 0.0136 0.0131 0.0301 0.0031 0.0297 0.0028 1

u1 xst u2 xst u3 xst u4 xst 1 0 0.4396 0.0006 0.0002 0.0001 −1 0 0.0767 0.0016 0.0002 0.0000 0 1 0.0790 0.0015 0.0002 0.0000 0 −1 0.0782 0.0019 0.0002 0.0001 1 1 0.0796 0.0015 0.0002 0.0001 −1 1 0.0770 0.0015 0.0002 0.0001 1 −1 0.0793 0.0018 0.0003 0.0000 −1 −1 0.0760 0.0018 0.0002 0.0000 Sum 0.9853 0.0124 0.0018 0.0005

k

l

Sum 0.4404 0.0785 0.0808 0.0804 0.0814 0.0788 0.0815 0.0781 1

Table 2 Estimated dependencies λˆ c(k,l) between y[i, j ] and (uc xst )(i+k,j +l) for Model A for low dependence (left table) and high dependence (right table). Considering first- and second-order neighbors. Small deviations in the sums arise from considering only four digits. u1 xst u2 xst u3 xst u4 xst 1 0 0.8366 0.0000 0.0001 0.0000 −1 0 0.0036 0.0001 0.0001 0.0000 0 1 0.0124 0.0000 0.0001 0.0001 0 −1 0.0119 0.0001 0.0001 0.0001 1 1 0.0278 0.0000 0.0000 0.0000 −1 1 0.0026 0.0000 0.0000 0.0002 1 −1 0.0273 0.0000 0.0001 0.0001 −1 −1 0.0024 0.0001 0.0001 0.0000 2 0 0.0262 0.0001 0.0000 0.0001 2 1 0.0121 0.0001 0.0000 0.0001 2 2 0.0025 0.0000 0.0001 0.0001 2 −1 0.0120 0.0001 0.0000 0.0000 2 −2 0.0022 0.0001 0.0001 0.0000 −2 0 0.0004 0.0001 0.0001 0.0000 −2 1 0.0004 0.0002 0.0001 0.0001 −2 2 0.0002 0.0001 0.0001 0.0001 −2 −1 0.0002 0.0001 0.0001 0.0000 −2 −2 0.0002 0.0000 0.0000 0.0001 1 2 0.0042 0.0001 0.0000 0.0001 −1 2 0.0006 0.0002 0.0000 0.0001 1 −2 0.0041 0.0001 0.0000 0.0000 −1 −2 0.0009 0.0000 0.0001 0.0000 0 2 0.0023 0.0001 0.0000 0.0001 0 −2 0.0023 0.0001 0.0001 0.0001 Sum 0.9953 0.0017 0.0015 0.0015 k

l

Sum 0.8368 0.0037 0.0126 0.0121 0.0279 0.0029 0.0274 0.0026 0.0264 0.0122 0.0027 0.0123 0.0024 0.0006 0.0007 0.0005 0.0004 0.0004 0.0044 0.0009 0.0042 0.0011 0.0024 0.0025 1

u1 xst u2 xst u3 xst u4 xst 1 0 0.1952 0.0011 0.0001 0.0000 −1 0 0.0346 0.0004 0.0001 0.0000 0 1 0.0357 0.0004 0.0001 0.0000 0 −1 0.0354 0.0005 0.0001 0.0000 1 1 0.0360 0.0004 0.0001 0.0000 −1 1 0.0348 0.0004 0.0001 0.0000 1 −1 0.0359 0.0005 0.0001 0.0000 −1 −1 0.0345 0.0005 0.0001 0.0000 2 0 0.0356 0.0004 0.0001 0.0000 2 1 0.0355 0.0004 0.0001 0.0000 2 2 0.0343 0.0005 0.0001 0.0000 2 −1 0.0354 0.0005 0.0001 0.0000 2 −2 0.0341 0.0005 0.0001 0.0000 −2 0 0.0330 0.0005 0.0001 0.0000 −2 1 0.0333 0.0005 0.0001 0.0000 −2 2 0.0325 0.0004 0.0001 0.0000 −2 −1 0.0324 0.0005 0.0001 0.0000 −2 −2 0.0325 0.0006 0.0001 0.0000 1 2 0.0350 0.0004 0.0001 0.0000 −1 2 0.0332 0.0004 0.0001 0.0001 1 −2 0.0349 0.0005 0.0001 0.0000 −1 −2 0.0337 0.0005 0.0001 0.0000 0 2 0.0342 0.0003 0.0001 0.0001 0 −2 0.0341 0.0006 0.0001 0.0000 Sum 0.9857 0.0115 0.0021 0.0007

k

l

Sum 0.1964 0.0352 0.0362 0.0360 0.0365 0.0353 0.0365 0.0351 0.0361 0.0360 0.0349 0.0360 0.0347 0.0336 0.0339 0.0330 0.0330 0.0332 0.0355 0.0337 0.0355 0.0343 0.0347 0.0348 1

104

C. Muehlmann et al.

Table 3 Estimated dependencies λˆ c(k,l) between y[i, j ] and (uc xst )(i+k,j +l) for Model B for low dependence (left table) and high dependence (right table); first-order neighbors are considered. Small deviations in the sums arise from considering only four digits u1 xst u2 xst u3 xst u4 xst 1 0 0.8687 0.0001 0.0003 0.0001 −1 0 0.0049 0.0002 0.0002 0.0002 0 1 0.0141 0.0002 0.0002 0.0001 0 −1 0.0147 0.0035 0.0002 0.0001 1 1 0.0281 0.0005 0.0001 0.0003 −1 1 0.0033 0.0003 0.0005 0.0001 1 −1 0.0322 0.0221 0.0001 0.0004 −1 −1 0.0033 0.0006 0.0003 0.0003 Sum 0.9694 0.0274 0.0017 0.0015 k

l

Sum 0.8692 0.0055 0.0146 0.0184 0.0289 0.0041 0.0547 0.0044 1

u1 xst u2 xst u3 xst u4 xst 1 0 0.2363 0.0153 0.0003 0.0001 −1 0 0.1016 0.0037 0.0003 0.0000 0 1 0.1020 0.0036 0.0002 0.0000 0 −1 0.1049 0.0039 0.0002 0.0000 1 1 0.1014 0.0038 0.0003 0.0001 −1 1 0.1001 0.0037 0.0003 0.0000 1 −1 0.1070 0.0043 0.0002 0.0001 −1 −1 0.1021 0.0038 0.0003 0.0001 Sum 0.9554 0.0422 0.0021 0.0003

k

l

Sum 0.2520 0.1055 0.1058 0.1091 0.1056 0.1041 0.1116 0.1063 1

Table 4 Estimated dependencies λˆ c(k,l) between y[i, j ] and (uc xst )(i+k,j +l) for Model B for low dependence (left table) and high dependence (right table); first- and second-order neighbors are considered. Small deviations in the sums arise from considering only four digits u1 xst u2 xst u3 xst u4 xst 1 0 0.0026 0.2693 0.0000 0.0001 −1 0 0.0001 0.0015 0.0001 0.0000 0 1 0.0002 0.0043 0.0000 0.0000 0 −1 0.0016 0.0041 0.0000 0.0001 1 1 0.0004 0.0085 0.0001 0.0000 −1 1 0.0001 0.0010 0.0000 0.0001 1 −1 0.0085 0.0084 0.0001 0.0000 −1 −1 0.0002 0.0010 0.0001 0.0001 2 0 0.0031 0.0092 0.0001 0.0001 2 1 0.0007 0.0039 0.0001 0.0001 2 2 0.0002 0.0011 0.0001 0.0001 2 −1 0.0202 0.0039 0.0001 0.0001 2 −2 0.6141 0.0003 0.0001 0.0001 −2 0 0.0001 0.0003 0.0001 0.0000 −2 1 0.0001 0.0002 0.0000 0.0000 −2 2 0.0000 0.0001 0.0001 0.0001 −2 −1 0.0001 0.0002 0.0000 0.0001 −2 −2 0.0001 0.0002 0.0001 0.0001 1 2 0.0001 0.0015 0.0001 0.0001 −1 2 0.0000 0.0005 0.0000 0.0001 1 −2 0.0186 0.0014 0.0001 0.0000 −1 −2 0.0003 0.0003 0.0001 0.0000 0 2 0.0001 0.0009 0.0001 0.0001 0 −2 0.0021 0.0011 0.0001 0.0000 Sum 0.6737 0.3233 0.0016 0.0014 k

l

Sum 0.2720 0.0017 0.0046 0.0057 0.0091 0.0013 0.0171 0.0014 0.0125 0.0047 0.0014 0.0243 0.6146 0.0005 0.0004 0.0003 0.0004 0.0004 0.0018 0.0006 0.0202 0.0008 0.0011 0.0033 1

u1 xst u2 xst u3 xst u4 xst 1 0 0.0632 0.0200 0.0001 0.0001 −1 0 0.0342 0.0004 0.0002 0.0000 0 1 0.0345 0.0004 0.0002 0.0000 0 −1 0.0355 0.0004 0.0002 0.0000 1 1 0.0344 0.0003 0.0002 0.0001 −1 1 0.0338 0.0003 0.0002 0.0001 1 −1 0.0363 0.0003 0.0002 0.0000 −1 −1 0.0344 0.0004 0.0003 0.0000 2 0 0.0362 0.0004 0.0002 0.0001 2 1 0.0351 0.0003 0.0002 0.0001 2 2 0.0339 0.0003 0.0002 0.0001 2 −1 0.0369 0.0003 0.0002 0.0000 2 −2 0.1438 0.0052 0.0010 0.0001 −2 0 0.0331 0.0004 0.0003 0.0001 −2 1 0.0326 0.0003 0.0003 0.0001 −2 2 0.0312 0.0003 0.0003 0.0001 −2 −1 0.0327 0.0004 0.0003 0.0000 −2 −2 0.0331 0.0004 0.0003 0.0001 1 2 0.0337 0.0003 0.0002 0.0001 −1 2 0.0326 0.0003 0.0002 0.0001 1 −2 0.0361 0.0003 0.0002 0.0001 −1 −2 0.0342 0.0004 0.0003 0.0000 0 2 0.0331 0.0003 0.0002 0.0001 0 −2 0.0351 0.0004 0.0002 0.0000 Sum 0.9596 0.0329 0.0061 0.0015

k

l

Sum 0.0833 0.0349 0.0351 0.0361 0.0350 0.0344 0.0369 0.0352 0.0368 0.0356 0.0345 0.0374 0.1501 0.0338 0.0332 0.0318 0.0334 0.0338 0.0343 0.0333 0.0367 0.0349 0.0337 0.0357 1

Spatial SIR

105

Table 5 Estimated dependencies λˆ c(k,l) between y[i, j ] and (uc xst )(i+k,j +l) for Model C for low dependence (left table) and high dependence (right table); first-order neighbors are considered. Small deviations in the sums arise from considering only four digits. u1 xst u2 xst u3 xst u4 xst 1 0 0.7106 0.1878 0.0001 0.0001 −1 0 0.0041 0.0007 0.0001 0.0002 0 1 0.0113 0.0030 0.0002 0.0001 0 −1 0.0102 0.0038 0.0002 0.0002 1 1 0.0229 0.0064 0.0001 0.0001 −1 1 0.0027 0.0004 0.0001 0.0001 1 −1 0.0241 0.0070 0.0002 0.0001 −1 −1 0.0023 0.0005 0.0002 0.0001 Sum 0.7883 0.2095 0.0012 0.0010 k

l

Sum 0.8986 0.0051 0.0146 0.0144 0.0296 0.0032 0.0315 0.0030 1

u1 xst u2 xst u3 xst u4 xst 1 0 0.3613 0.0796 0.0003 0.0002 −1 0 0.0636 0.0143 0.0009 0.0003 0 1 0.0642 0.0156 0.0007 0.0003 0 −1 0.0633 0.0157 0.0009 0.0002 1 1 0.0638 0.0148 0.0007 0.0003 −1 1 0.0630 0.0142 0.0009 0.0003 1 −1 0.0640 0.0163 0.0012 0.0003 −1 −1 0.0628 0.0149 0.0009 0.0003 Sum 0.8060 0.1853 0.0064 0.0023

k

l

Sum 0.4413 0.0791 0.0808 0.0801 0.0795 0.0784 0.0818 0.0789 1

Table 6 Estimated dependencies λˆ c(k,l) between y[i, j ] and (uc xst )(i+k,j +l) for Model C for low dependence (left table) and high dependence (right table); first- and second-order neighbors are considered. Small deviations in the sums arise from considering only four digits u1 xst u2 xst u3 xst u4 xst 1 0 0.6469 0.1711 0.0001 0.0001 −1 0 0.0038 0.0006 0.0001 0.0001 0 1 0.0104 0.0027 0.0002 0.0001 0 −1 0.0093 0.0034 0.0001 0.0001 1 1 0.0209 0.0058 0.0001 0.0002 −1 1 0.0024 0.0003 0.0001 0.0001 1 −1 0.0222 0.0064 0.0002 0.0001 −1 −1 0.0021 0.0004 0.0002 0.0000 2 0 0.0235 0.0059 0.0001 0.0004 2 1 0.0100 0.0023 0.0001 0.0002 2 2 0.0027 0.0007 0.0002 0.0001 2 −1 0.0119 0.0021 0.0002 0.0001 2 −2 0.0024 0.0010 0.0002 0.0001 −2 0 0.0006 0.0001 0.0002 0.0001 −2 1 0.0006 0.0001 0.0001 0.0001 −2 2 0.0002 0.0001 0.0001 0.0002 −2 −1 0.0004 0.0002 0.0001 0.0002 −2 −2 0.0005 0.0001 0.0002 0.0002 1 2 0.0039 0.0011 0.0002 0.0001 −1 2 0.0005 0.0001 0.0004 0.0002 1 −2 0.0045 0.0009 0.0001 0.0001 −1 −2 0.0009 0.0005 0.0002 0.0001 0 2 0.0024 0.0007 0.0004 0.0000 0 −2 0.0022 0.0011 0.0001 0.0002 Sum 0.7852 0.2078 0.0038 0.0032 k

l

Sum 0.8182 0.0046 0.0133 0.0130 0.0270 0.0029 0.0289 0.0027 0.0299 0.0126 0.0037 0.0143 0.0037 0.0010 0.0009 0.0005 0.0008 0.0010 0.0054 0.0011 0.0056 0.0018 0.0035 0.0036 1

u1 xst u2 xst u3 xst u4 xst 1 0 0.1604 0.0352 0.0005 0.0002 −1 0 0.0286 0.0063 0.0002 0.0001 0 1 0.0288 0.0069 0.0002 0.0001 0 −1 0.0284 0.0069 0.0002 0.0001 1 1 0.0287 0.0065 0.0002 0.0001 −1 1 0.0282 0.0063 0.0002 0.0001 1 −1 0.0288 0.0072 0.0003 0.0001 −1 −1 0.0282 0.0066 0.0003 0.0001 2 0 0.0293 0.0071 0.0003 0.0001 2 1 0.0286 0.0068 0.0002 0.0000 2 2 0.0283 0.0070 0.0003 0.0001 2 −1 0.0295 0.0069 0.0003 0.0000 2 −2 0.0277 0.0068 0.0002 0.0001 −2 0 0.0268 0.0064 0.0003 0.0001 −2 1 0.0270 0.0061 0.0002 0.0002 −2 2 0.0259 0.0062 0.0002 0.0001 −2 −1 0.0265 0.0064 0.0003 0.0001 −2 −2 0.0272 0.0063 0.0004 0.0001 1 2 0.0282 0.0071 0.0002 0.0001 −1 2 0.0266 0.0064 0.0001 0.0001 1 −2 0.0285 0.0068 0.0002 0.0001 −1 −2 0.0273 0.0068 0.0003 0.0001 0 2 0.0281 0.0061 0.0002 0.0001 0 −2 0.0282 0.0071 0.0002 0.0000 Sum 0.8037 0.1882 0.0061 0.0021

k

l

Sum 0.1963 0.0353 0.0360 0.0356 0.0355 0.0349 0.0365 0.0352 0.0367 0.0357 0.0357 0.0368 0.0348 0.0335 0.0335 0.0324 0.0332 0.0339 0.0356 0.0331 0.0355 0.0345 0.0345 0.0355 1

106

C. Muehlmann et al.

Acknowledgments The work of CM and KN was supported by the Austrian Science Fund (FWF) Grant number P31881-N32, and we are grateful for the comments from the referees and from the editors which helped to improve the paper.

References P. Adragni, R.D. Cook, Sufficient dimension reduction and prediction in regression. Philos. Trans. R. Soc. A 367, 4385–4405 (2009) M.M.R. Affossogbe, G. Martial Nkiet, C. Ogouyandjou, Dimension reduction in spatial regression with kernel SAVE method (2019). arXiv:1909.09996 F. Bachoc, M.G Genton, K. Nordhausen, A. Ruiz-Gazen, J. Virta, Spatial blind source separation. Biometrika 107, 627–646 (2020) C. Becker, R. Fried, Sliced inverse regression for high-dimensional time series, in Exploratory Data Analysis in Empirical Research, ed. by M. Schwaiger, O. Opitz. Studies in Classification, Data Analysis, and Knowledge Organization (Springer, Berlin, Heidelberg, 2003), pp. 3–11 A. Belouchrani, K. Abed Meraim, J.-F. Cardoso, E. Moulines, A blind source separation technique based on second order statistics. IEEE Trans. Signal. Process. 45, 434–444 (1997) E. Bura, R.D. Cook, Extending sliced inverse regression: the weighted chi-squared test. J. Am. Stat. Assoc. 96, 996–1003 (2001) J.-F. Cardoso, A. Souloumiac, Jacobi angles for simultaneous diagonalization. SIAM J. Math. Anal. Appl. 17, 161–164 (1996) R.D. Cook, SAVE: a method for dimension reduction and graphics in regression. Commun. Stat. Theory Methods 29, 2109–2121 (2000) R.D. Cook, Fisher lecture: dimension reduction in regression. Stat. Sci. 22, 1–26 (2007) R.D. Cook, An Introduction to Envelopes: Dimension Reduction for Efficient Estimation in Multivariate Statistics (Wiley, Hoboken, NJ, 2018) R.D. Cook, Principal components, sufficient dimension reduction, and envelopes. Annu. Rev. Stat. Appl. 5, 533–559 (2018) R.D. Cook, S. Weisberg, Sliced inverse regression for dimension reduction: comment. J. Am. Stat. Assoc. 86, 328–332 (1991) Y. Guan, H. Wang, Sufficient dimension reduction for spatial point processes directed by gaussian random fields. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 72, 367–387 (2010) R.J. Hijmans, raster: Geographic Data Analysis and Modeling (2019). R package version 2.8-19 K. Illner, J. Miettinen, C. Fuchs, S. Taskinen, K. Nordhausen, H. Oja, F.J. Theis, Model selection using limiting distributions of second-order blind source separation algorithms. Signal Process. 113, 95–103 (2015) H. Kelejian, G. Piras, Spatial Econometrics (Academic, New York, 2017) J. LeSage, R.K. Pace, Introduction to Spatial Econometrics. Statistics: A Series of Textbooks and Monographs (Chapman & Hall/CRC, Boca Raton, FL, 2009) K.-C. Li, Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86, 316–327 (1991) B. Li, Sufficient Dimension Reduction Methods and Applications with R (CRC Press, Boca Raton, 2018) E. Liski, K. Nordhausen, H. Oja, A. Ruiz-Gazen, Combining linear dimension reduction subspaces, in Recent Advances in Robust Statistics: Theory and Applications, ed. by C. Agostinelli, A. Basu, P. Filzmoser, D. Mukherjee (Springer, New Delhi, 2016), pp. 131–149 E. Liski, K. Nordhausen, H. Oja, A. Ruiz-Gazen, LDRTools: Tools for Linear Dimension Reduction (2018). R package version 0.2-1 J.-M. Loubes, A.-F. Yao, Kernel Inverse Regression for spatial random fields. Int. J. Appl. Math. Stat. 32, 1–26 (2013) W. Luo, B. Li, Combining eigenvalues and variation of eigenvectors for order determination. Biometrika 103, 875–887 (2016)

Spatial SIR

107

Y. Ma, L. Zhu, A review on dimension reduction. Int. Stat. Rev. 81, 134–150 (2013) M. Matilainen, C. Croux, K. Nordhausen, H. Oja, Supervised dimension reduction for multivariate time series. Econ. Stat. 4, 57–69 (2017) M. Matilainen, C. Croux, K. Nordhausen, H. Oja, Sliced average variance estimation for multivariate time series. Statistics 53, 630–655 (2019) J. Miettinen, K. Illner, K. Nordhausen, H. Oja, S. Taskinen, F. Theis, Separation of uncorrelated stationary time series using autocovariance matrices. J. Time Ser. Anal. 37(3), 337–354 (2016) J. Miettinen, K. Nordhausen, S. Taskinen, Blind source separation based on joint diagonalization in R: the packages JADE and BSSasymp. J. Stat. Softw. 76, 1–31 (2017) K. Nordhausen, H. Oja, P. Filzmoser, C. Reimann, Blind source separation for spatial compositional data. Math. Geosci. 47, 753–770 (2015) K. Nordhausen, H. Oja, D.E. Tyler, Asymptotic and bootstrap tests for subspace dimension (2016). arXiv:1611.04908 R Core Team, R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, 2018) M. Schlather, A. Malinowski, P.J. Menck, M. Oesting, K. Strokorb, Analysis, simulation and prediction of multivariate random fields with package RandomFields. J. Stat. Softw. 63, 1–25 (2015) M. van Lieshout, Theory of Spatial Statistics (Chapman & Hall/CRC, New York, 2019)

Model-Based Inverse Regression and Its Applications Tao Wang and Lixing Zhu

1 Introduction Consider the regression of a univariate response Y ∈ R on a p-dimensional predictor vector X = (X1 , . . . , Xp )" ∈ Rp . In full generality, dimension reduction in regression aims to seek a function R : Rp → Rd , d ≤ p, such that the conditional distribution of Y | X depends on X only through R(X). More formally, if Y and X are independent given R(X), then R(X) is called a sufficient reduction for the regression of Y onto X. Many methods for dimension reduction in regression restrict attention to linear reductions: R(X) = η" X for some p × d matrix η. The column space of η is called a dimension-reduction subspace. The inferential target is then the central subspace SY |X , defined as the intersection of all dimension-reduction subspaces (Cook, 1998).

Tao Wang was supported in part by National Natural Science Foundation of China (11971017), National Key R&D Program of China (2018YFC0910500), Shanghai Municipal Science and Technology Major Project (2017SHZDZX01), SJTU Trans-med Awards Research Young Faculty Grant (YG2019QNA26, YG2019QNA37), and Neil Shen’s SJTU Medical Research Fund. Lixing Zhu was supported by a grant from the University Grants Council of Hong Kong. T. Wang Department of Statistics, Shanghai Jiao Tong University, Shanghai, China SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China L. Zhu () Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong Center for Statictics and Data Science, Beijing Normal University at Zhuhai, Guangdong, China e-mail: [email protected] © Springer Nature Switzerland AG 2021 E. Bura, B. Li (eds.), Festschrift in Honor of R. Dennis Cook, https://doi.org/10.1007/978-3-030-69009-0_6

109

110

T. Wang and L. Zhu

Classical methods for estimating SY |X include sliced inverse regression (Li, 1991), sliced average variance estimation (Cook and Weisberg, 1991), principal Hessian directions (Li, 1992), parametric inverse regression (Bura and Cook, 2001), minimum average variance estimation (Xia et al., 2002), minimum discrepancy estimation (Cook and Ni, 2005), contour regression (Li et al., 2005), directional regression (Li and Wang, 2007), likelihood acquired directions (Cook and Forzani, 2009), discretization-expectation estimation (Zhu et al., 2010), and so on. Recent novel methods include semiparametric estimation (Ma and Zhu, 2012), kernel dimension reduction (Lee et al., 2013), transformation-based dimension reduction (Wang and Zhu, 2018; Wang et al., 2014), and dimension reduction for censored outcomes (Sun et al., 2019). See Ma and Zhu (2013) and Li (2018a) for an excellent review.

1.1 Model-Based Inverse Reduction Assuming that Y and X are jointly distributed, Cook (2007) introduces three equivalent paradigms for dimension reduction in regression: (i) Forward reduction, Y | X ∼ Y | R(X), (ii) Inverse reduction, X | {Y, R(X)} ∼ X | R(X), (iii) Joint reduction, Y ⊥⊥X | R(X). Here, ∼ stands for identically distributed, and ⊥⊥ indicates independence. Without requiring a pre-specified model for Y | X, inverse reduction is promising in regressions with many predictors. There are two basic contexts for determining a sufficient reduction inversely: inverse moment-based reduction and inverse modelbased reduction (Adragni and Cook, 2009). Most inverse regression methods are developed in the first context by exploiting properties of the conditional moments of X | Y . Two prominent examples are sliced inverse regression and sliced average variance estimation. However, nearly all these methods require the predictors to be continuous. In the second context, a model is specified for the inverse regression of X onto Y . For example, Cook and Forzani (2008) propose principal fitted components, and Cook and Forzani (2009) introduce likelihood-based dimension reduction, both of which are based on normal models for the conditional distribution of X | Y . When some predictors are discrete, methods that are free of the Gaussian assumption are highly in demand. As an important first step, Cook and Li (2009) propose a method for inverse reduction based on a conditional independence model. Their method allows for all discrete or mixed types of predictors. However, the conditional independence assumption is restrictive when the predictors are conditionally dependent given the response. Using the multinomial distribution, Taddy (2013) develops an inverse regression method for count data that arise naturally in text analysis. The main drawback of this method is that the multinomial model is a trivial generalization of the Poisson independence model. More recently, Bura et al. (2016) propose a

Model-Based Inverse Regression and Its Applications

111

general framework for inverse reduction in regressions with exponential family predictors, which allows for multivariate nontrivial distributions and includes regressions where predictors are mixtures of categorical and continuous. The price paid for this flexibility is that sufficient reductions are not necessarily linear but they are exhaustive. See also Forzani et al. (2018), Wang et al. (2019), and Wang (2020).

1.2 Sufficient Reduction in Applications Sufficient reduction has a long history of successful applications and is still an active research area (Cook, 2018; Li, 2018b). For example, Naik et al. (2000) describe how to use sliced inverse regression in marketing studies, and Roley and Newman (2008) apply inverse regression estimation as a preprocessor for predicting Eurasian watermilfoil invasions in Minnesota. There also have been many applications in bioinformatics (Antoniadis et al., 2003; Bura and Pfeiffer, 2003; Chiaromonte and Martinelli, 2002; Li, 2006). In particular, Bura and Pfeiffer (2003) use sliced average variance estimation, and Antoniadis et al. (2003) use minimum average variance estimation, both to gene expression data for tumor classification. Zhong et al. (2005) propose regularized sliced inverse regression for identifying transcription factor binding motifs, and Li and Yin (2008) employ a similar strategy to model and predict survival time given gene expression profiles. More recently, Taddy (2013, 2015) apply multinomial inverse regression for text analysis, Wang et al. (2019) and Wang (2020) propose model-based inverse regression for prediction learning in metagenomics, and Tomassi et al. (2019) develop likelihood-based sufficient reduction methods for compositional data. The rest of the paper is organized as follows. In Sects. 2.1–2.3, we review model-based inverse reduction for multivariate count data in different contexts, using the multinomial distribution and its generalizations. Section 3 takes a different perspective on model-based inverse reduction. Sufficient reduction is achieved in the dual sample-based space, rather than in the primal predictor-based space. In Sect. 4, we consider an application of inverse modeling to testing the independence between the microbial community composition and a continuous or many-valued outcome. An adaptive test is presented based on a dynamic slicing technique. Section 5 briefly reviews R. Dennis Cook’s contributions on model-based sufficient reduction in regression.

2 Inverse Reduction for Multivariate Count Data Suppose X = (X1 , . . . , Xp )" is a random vector of counts and Y is a scalar random variable. Examples of Y include a review rating where X represents the word counts in a dictionary of the review document. The classical distribution for characterizing X is the multinomial distribution with probability mass function

112

T. Wang and L. Zhu

fMN (x; p) =

(

p

j =1 xj ) !

x1 ! x2 ! · · · xp !

p $

x

πj j ,

(1)

j =1

where x = (x1 , . . . , xp )" is an instance of X and π = (π1 , . . . , πp )" is a vector of p probabilities, j =1 πj = 1. Note that inferences under this model condition on the total count.

2.1 Multinomial Inverse Regression in Text Analysis To link phrase counts X with document annotations Y in text mining, Taddy (2013) follows the inverse modeling approach of Cook (2007) and Cook and Li (2009) and proposes multinomial inverse regression (MNIR) by specifying πj as exp(αj + γ " j h) πj = p , " k=1 exp(αk + γ k h)

(2)

where αj ∈ R, γ j ∈ RG , and h = (h1 , . . . , hG )" is a known vector of functions of Y = y. Many possibilities for h are available in the literature. For example, one might use polynomial functions, hg = y g . Alternatively, hg could be constructed by partitioning the range of Y into G bins or slices and defining hg as the gth indicator function. This slicing construction for h is simple and widely used (Adragni and Cook, 2009). MNIR provides not only a tool for studying the relationship between counts and an outcome but also a basis for supervised dimension reduction. p Specifically, Taddy (2013) shows that, under (1) and (2), Y ⊥⊥X | ( " X, j =1 Xj ), where " p×G . As shown in the following proposition, MNIR can = (γ " 1 ,...,γp) ∈ R have a computational advantage in some settings. A proof can be found in Wang (2020) and is omitted here. Proposition 2.1 Suppose that {xij , i = 1, . . . , n, j = 1, . . . , p} are independent observations of Poisson random variables with intensity parameters ⎧ ⎨

λij = exp ci + ⎩

G g=1

⎫ ⎬

αjg I (yi = g) . ⎭

p If we set ci = log( j =1 xij ), then the estimates that maximize the log-likelihood of the Poisson independence model coincide with the maximum likelihood estimates of the corresponding multinomial model with

Model-Based Inverse Regression and Its Applications

k=1 λik

G

g=1 αjg I (yi = g) . G g=1 αkg I (yi = g) k=1 exp

exp

λij πij = p

113

= p

In the context of MNIR, the slicing construction of h enables distributed estimation of model parameters in large-scale multinomial regression (Taddy, 2015). On the other hand, MNIR inherits a weakness of the multinomial model. Suppose that Xj are independent Poisson random variables. It is well-known that the joint into the product of two distributions: distribution of X = (X1 , . . . , Xp )" factors p a Poisson distribution for the total j =1 Xj and a multinomial distribution for X given the total. In other words, the multinomial distribution is almost identical to the Poisson independence model, and so the conditional dependence among counts permitted by MNIR is very restrictive.

2.2 Predictive Learning in Metagenomics via Inverse Regression Count data, such as sequencing reads, are often over-dispersed. The standard way of modeling over-dispersion is to assume that π in the multinomial distribution is random with some prior distribution. Denote by Sp−1 the (p − 1)-dimensional simplex. Note that π ∈ Sp−1 . One popular prior for π is the additive logistic normal distribution (Xia et al., 2013). Define the transformation of z = (z1 , . . . , zp ) ∈ Sp−1 to Rp−1 as ) ( zp−1 z1 , . . . , log . φ(z) = log zp zp This transformation, called the additive log-ratio transformation (Aitchison, 1986), is a bijection. The additive logistic normal distribution assumes that W = φ(π) has a multivariate normal distribution. The resulting distribution for counts, which combines this distribution with the multinomial distribution, is called the additive logistic normal multinomial distribution (Billheimer et al., 2001). To explore the relationship between microbial counts and a phenotype of interest, Wang et al. (2019) exploit the idea of Cook (2007) and Cook and Li (2009) by assuming

πj log πp

= aj + γ " j βh,

j = 1, . . . , p − 1,

(3)

where aj ∈ R, γ j ∈ Rd , β ∈ Rd×G has rank d ≤ min(p, G). Again, h ∈ RG is a known vector-valued function of Y = y. Let a = (a1 , . . . , ap−1 )" ∈ Rp−1 . To account for over-dispersion, a is assumed be a realization of A =

114

T. Wang and L. Zhu

(A1 , . . . , Ap−1 )" ∼ N(μ, ), and A is independent of Y . Let ξ = A − μ and = (γ 1 , . . . , γ p−1 )" ∈ R(p−1)×d . Then W = μ + βh + ξ .

(4)

Clearly, under (4), Y ⊥⊥W | " −1 W . However, unlike in the standard framework of sufficient reduction, W is unobservable here. To predict a future observation of Y associated with a new observed vector of X, one can use the forward regression mean function E(Y | X). It is easy to see that E(Y | X) = E{E(Y | W ) | X}.

(5)

Wang et al. (2019) construct a prediction rule by utilizing this observation. In essence the method consists of two parts: estimation of E(Y | X) when E(Y | W ) is known and estimation of E(Y | W ). A Monte Carlo expectation-maximization algorithm is proposed for maximum likelihood estimation and prediction learning. As outlined in Wang et al. (2019), there are several advantages to the approach of reversing the roles of the outcome and the counts. First, the inverse regression model for X | Y deals directly with over-dispersion. Furthermore, it can be inverted to provide a prediction rule without specifying a forward model for Y | X or a joint model for (X, Y ). Second, using raw counts in microbiome analysis is problematic, especially when the sequence depth varies greatly between samples. To address this issue, Eq. (3) directly relates the response to the unobserved proportions. It also provides a seamless way of handling zero counts. Rather than taking the logarithm of absolute or relative counts, it operates on estimated proportions which are almost surely nonzero. While the multinomial distribution seems a natural distribution for multivariate count data, it is almost an independent Poisson model. This partly justifies why MNIR can be applied on a very large scale. The above methodology relaxes the independence assumption, allowing for a more flexible dependence structure among counts. However, it pays the price of heavy computation. Unlike MNIR, it is not scalable for large p.

2.3 Poisson Graphical Inverse Regression To relax the independence assumption of the multinomial distribution, while at the same time preserving the parallel computing framework, Wang (2020) proposes a new inverse regression approach based on a Poisson graphical model and applies it to a gut microbiome data set. Given an undirected graph G with vertex set V and edge set E ⊆ V × V, an undirected graphical model is a way of expressing the conditional dependence structure between a set of random variables indexed by the vertices in V (Lauritzen,

Model-Based Inverse Regression and Its Applications

115

1996). An important special case is a pairwise graphical model. For discrete random variables, the probability mass function over a pairwise Markov graph can be represented as f (x) = exp

⎧ ⎨ ⎩

ψj (xj ) +

j ∈V

(j,k)∈E

⎫ ⎬ ψj k (xj , xk ) − D , ⎭

where ψj (xj ) are node potentials, ψj k (xj , xk ) are edge potentials, and D < ∞ is the log normalization constant (Clifford, 1990). A popular tool for describing multivariate count data is the pairwise Poisson graphical model fP oG (x; θ , ) = exp

⎧ ⎨ ⎩

θj xj +

j ∈V

θj k xj xk −

(j,k)∈E

j ∈V

⎫ ⎬

log(xj !) − D(θ, ) . (6) ⎭

Here θ = {θj , j ∈ V}, = {θj k , (j, k) ∈ E}, and D(θ, ) is the log normalization constant. Let Nj be the set of neighbors of node j according to the graph G. One attractive property of model (6) is that each node-conditional distribution is a univariate Poisson distribution, with log intensity parameter θj + k∈Nj θj k xk . Furthermore, it can be shown that this model encapsulates the independent Poisson model of Cook and Li (2009) and the MNIR model of Taddy (2013) as special cases. To connect a vector of counts X and an outcome Y , Wang (2020) adopts the inverse modeling approach by considering the Poisson graphical inverse regression model ⎧ ⎫ ⎨ ⎬ fP oG (x; y, G, , ) ∝ exp θj (y)xj + θj k xj xk − log(xj !) , (7) ⎩ ⎭ j ∈V

(j,k)∈E

j ∈V

where G = (V, E) is an undirected graph over the set of nodes V = {1, . . . , p} corresponding to the p variables X1 , . . . , Xp , θj (y) = γ " j h, and = {θj k , (j, k) ∈ E}. Again, h = (h1 , . . . , hG )" is a pre-specified vector of functions of Y = y, " p×G . It is easily verified γ j = (γj 1 , . . . , γj G )" ∈ RG , and = (γ " 1 ,...,γp) ∈ R that, under (7), Y ⊥⊥X | " X. In practice we do not know which edges to omit from the graph G and so would like to discover this from the data. To recover the graph structure, as well as to estimate the parameters, Wang (2020) proposes a neighborhood selection approach (Meinshausen and Bühlmann, 2006), using the fact that each node-conditional distribution is univariate Poisson, with log intensity parameter ηj = G g=1 γjg hg + k=j θj k xk . Given n independent and identically distributed samples {x i , yi } from the joint distribution of (X, Y ), Wang (2020) proposes to maximize penalized conditional log-likelihoods node by node:

116

T. Wang and L. Zhu

⎡ max ⎣

γ j ,j

n

log{fP o (xij ; xik , k = j, γ j , j )} − λ

⎤ |θj k |⎦ ,

(8)

k=j

i=1

where λ is a tuning parameter to be determined. The optimization problem (8) can be implemented efficiently on a parallel computer.

3 Inverse Reduction and Its Dual We note that this section covers a more general setting than multivariate count data as in Sect. 2. Linear reduction methods aim to construct a few linear combinations of the original predictors that are useful for subsequent analyses. So far nearly all methods estimate a subspace in the primal predictor-based space and then obtain the set of reduced predictors by projecting the original predictor vector onto this subspace (Cook et al., 2012; Wang et al., 2018). Motivated by the well-known duality between principal component analysis and principal coordinate analysis (Gower, 1966), Zhang et al. (2012) calculate the projection coordinates by applying principal coordinate analysis to slice means and then interpolate the projection of a new predictor vector using these coordinates. This method bypasses estimating the directions and can be thought of as a dual version of sliced inverse regression, despite the lack of theoretical support. More recently, Wang and Xu (2019) propose a principled reduction method in the dual sample-based space, on the basis of a supervised inverse regression model. Let Y denote the sample space of Y and SE(X|Y ) = span{E(X | Y = y) − E(X), y ∈ Y} the subspace spanned by the centered inverse regression curves. To motivate the inverse regression model, Wang and Xu (2019) introduce the following proposition. Proposition 3.1 Assume (C1) SE(X|Y ) ⊆ Var(X)SY |X and (C2) Var(X | Y ) is positive definite and is nonrandom; then Var(X | Y )SY |X = Var(X)SY |X . Condition (C1), known as the linearity condition, holds if E(X | η" X) is a linear function of η" X, where η is a basis matrix for SY |X . A slightly stronger condition is (C1 ) SE(X|Y ) = Var(X)SY |X . Condition (C2) is related to the covariance condition that (C3) Var(X | η" X) is constant (Cook et al., 2012). The distinction between (C2) and (C3) sheds light on the difference between model-based and momentbased inverse regression methods. Let = Var(X | Y ). By Proposition 3.1, conditions (C1 ) and (C2) imply that SE(X|Y ) = SY |X . Consequently, SY |X = span(−1 ), where ∈ Rp×d is a basis matrix for SE(X|Y ) . Let Xy denote a random vector distributed as X | (Y = y). The above argument leads naturally to the inverse regression model

Model-Based Inverse Regression and Its Applications

Xy = μ + v y + ,

117

(9)

where μ = (μ1 , . . . , μp )" ∈ Rp , v y ∈ Rd is an unknown vector-valued function of y, ∈ Rp is random with mean vector zero and covariance matrix , and is independent of Y . Since is not identifiable, assume that −1/2 is a p × d matrix with orthonormal columns. This model is the same as the PC regression model (Cook, 2007), except that the latter assumes a normal error vector.

3.1 Reduction via Principal Coordinate Analysis Without loss of generality, assume for the moment that = Ip , the p × p identity matrix. This implies that is a semi-orthogonal matrix and SY |X = span( ). Suppose that the data consist of n independent observations, x y1 , . . . , x yn . For two observations indexed by y and y , define dyy = μy − μy 22 . Then " " dyy = v y − v y 22 = v " y v y − 2v y v y + v y v y .

Let D = (dyy ) ∈ Rn×n and V = (v y1 , . . . , v yn )" ∈ Rn×d . Denote by 1n the nvector of ones and Pn = In − n−1 1n 1" n the centering matrix. A simple calculation shows that 1 VV" = − Pn DPn . 2

(10)

This is principal coordinate analysis, or classical multidimensional scaling, at the population level. One can view v y as the coordinate vector of μy with respect to the basis . At the sample level, one can replace dyy by an unbiased estimate dˆyy = x y − ˆ = (dˆyy ) and X = (x y , . . . , x yn )" ∈ Rn×p . Then, x y 22 − 2p. Let D 1 1 ˆ " VV" ≈ − Pn DP n = Pn XX Pn . 2

(11)

Write the eigen-decomposition of Pn XX" Pn as Pn XX" Pn = ni=1 λi α i α " i , where λ1 ≥ · · · ≥ λn ≥ 0 are the eigenvalues and α 1 , . . . , α n are the corresponding ˜ = eigenvectors. By the Eckart–Young theorem, a solution for V is given by V 1/2 1/2 " ˜ (λ1 α 1 , . . . , λd α d ). Write V = (˜v y1 , . . . , v˜ yn ) . Similarly, v˜ y represents the vector of coordinates of x y with respect to the basis . Instead of estimating the directions in the primal predictor-based space, dimension reduction is achieved in the dual sample-based space. This mimics the duality between principal component analysis and principal coordinate analysis. To conduct dimension reduction in the original predictor space, one can estimate ˜ " 2 , where · F by minimizing the residual sum-of-squares Pn X − V F

118

T. Wang and L. Zhu

˜ V ˜ " V) ˜ −1 . Write denotes the Frobenius matrix norm. The minimizer is ˜ = X" V( ˜ = ( ˜ 1 , . . . , ˜ d ). A simple calculation shows that ˜ j equals the j th principal component direction of Pn X, and thus the first d principal component score vectors of Pn X produce a sufficient reduction. In other words, the above method coincides with the method of maximum likelihood of Cook (2007) under the PC regression model. To investigate the relationship between the response and the vector of coordinates, it remains to get the coordinates of a new observation, x y ∗ , y ∗ ∈ Y. This can be done by the classical method of adding a point to vector diagrams (Gower, 1968; Zhang et al., 2012). For each i ∈ {1, . . . , n}, define s˜i = ˜v yi 22 − x y ∗ − x yi 22 . Let s˜ = (˜s1 , . . . , s˜n )" ∈ Rn . The predicted coordinate v˜ y ∗ of x y ∗ is then v˜ y ∗ =

1 ˜ " ˜ −1 ˜ " (V V) V s˜ . 2

(12)

In the classical sufficient dimension reduction, one is interested mainly in the matrix or the subspace SY |X spanned by it. The above procedure operates in the space of coordinates of x y with respect to the orthonormal basis . It achieves dimension reduction while at the same time avoiding the estimation of .

3.2 A Supervised Inverse Regression Model In many applications the response is expected to play an important role in supervising our reduction. However, aside from the subscript y, nothing on the right-hand side of (9) is observable. To facilitate supervised reduction, Wang and Xu (2019) propose to specify the coordinate vectors as v y = βhy , where β ∈ Rd×G has rank d ≤ min(G, p) and hy ∈ RG is a known vector-valued function of y. This parameterization has been widely used in inverse reduction; see Bura and Cook (2001), Cook and Forzani (2008), Cook et al. (2012), Wang and Zhu (2013), Wang et al. (2019), and Wang (2020). Replacing v y in model (9) by βhy leads to the following model: Xy = μ + βhy + .

(13)

The process of dimension reduction based on principal coordinate analysis is essentially the same as before. See Wang and Xu (2019) for details on the methodology and its theoretical and numerical behavior. The above methodology is related to a nonparametric multivariate analysis procedure in ecological studies (Mcardle and Anderson, 2001). This procedure, known as permutation multivariate analysis of variance, partitions the variability

Model-Based Inverse Regression and Its Applications

119

in multivariate ecological data according to factors in an experimental design. The underlying intuition is the duality between X" X, an inner product matrix in the primal space, and XX" , an outer product matrix in the dual space, in the sense that trace(X" X) = trace(XX" ). This equivalence is important because an outer product matrix can be obtained from any symmetric distance matrix D = (dij ) ∈ Rn×n (Gower, 1966). In particular, for a p × p positive definite matrix B, if we let dij (B) = (x i − x j )" B(x i − x j ), then XBX" = −Pn DPn /2, where Pn is the centering matrix.

4 Adaptive Independence Test via Inverse Regression A major goal of microbiome studies is to link the overall microbiome composition with clinical or environmental variables. La Rosa et al. (2012) propose a parametric test for comparing microbiome populations between two or more groups of subjects. However, this method is not applicable for testing the association between the community composition and a continuous or many-valued outcome. Although multivariate nonparametric methods based on permutations are widely used in ecology studies (Mcardle and Anderson, 2001), they lack interpretability and can be inefficient for analyzing microbiome data. By discretizing the range of the outcome into several slices, Song et al. (2020) recast the problem in the classical framework of comparing microbiome compositions between multiple groups, with each group represented by a slice, and propose an adaptive likelihood-ratio test. Consider a set of microbiome samples measured on n subjects, with p taxa at a fixed taxonomic level. Let xij be the number of sequence preads for taxon j in subject i and x i = (xi1 , xi2 , . . . , xip )" . Denote by Ni. = total number of j =1 xij the p sequence reads in subject i. Similarly, N.j = ni=1 xij and N = ni=1 j =1 xij . Assume for the moment that the vector of bacterial counts X follows a multinomial distribution. We first consider the problem of comparing microbiome samples between K ≥ 2 groups. To examine whether the taxa frequencies are the same in all groups, La Rosa et al. (2012) develop a generalized Wald test. Let Y ∈ {1, 2, . . . , K} represent the group indicator. Equivalently, this method tests the hypothesis that the conditional distribution of X given Y equals the marginal distribution of X. In other words, one can view the Wald-type test as an independence test between a categorical variable and the community composition. Then, an alternative method is the likelihood ratio test. One major disadvantage of the generalized Wald test is that when Y has too many levels, it can lose power, and it is not applicable for a continuous variable. In these situations, a simple but approximate solution is to divide the range of Y into a few bins or slices and use the sliced version of Y . This amounts to comparing the distributions of X between different groups, with each group represented by a slice. Denote by S a slicing scheme and |S| the size

120

T. Wang and L. Zhu

of S. Write S = {Sh , h = 1, . . . , |S|}. Consider the following hypotheses: H0 : the distributions of X given Y ∈ Sh (1 ≤ h ≤ |S|) are the same, versus H1 : the distributions of X given Y ∈ Sh (1 ≤ h ≤ |S|) are not all the same. Under H0 , Xi ∼ fMN (x i ; π ). Under H1 , X i | Y ∈ Sh ∼ fMN (x i ; π (h) ), (h) (h) (h) where π (h) = (π1 , . . . , πp )" ∈ Rp and πj are slice-specific taxa proportions, p (h) (h) = 1. For a fixed slicing scheme S (namely, K = |S|), let N.j be the j =1 πj total count of the j th taxon in Sh and N (h) the total count of all taxa in Sh . The likelihood ratio test statistic has the form DMN =

p |S |

(h) N.j log

h=1 j =1

(h)

N.j

N (h)

−

p j =1

N.j N.j log N

.

(14)

In practice, the choice of slicing scheme is critical in detecting the dependence between X and Y . One general approach is to use equally spaced slices by assigning the same number of observations into each slice. Then we need to choose the number of slices. Intuitively, if H0 is true, then slicing should not have a large effect on the test statistic. Under H1 , if we divide the samples into too many slices, then we will not have enough power to detect the dependence. On the contrary, we may not be able to reject the null hypothesis when the number of slices is small; for example, DMN = 0 when there is one single slice. In summary, the slicing scheme behaves like a tuning parameter that controls the flexibility of the resulting test procedure. However, searching among all possible slicing schemes is computationally prohibitive, especially when the sample size is large. To avoid over-slicing, Song et al. (2020) introduce a penalty factor for the number of slices. Following Jiang et al. (2015) and Heller et al. (2016), they consider the regularized test statistic Dˆ MN = max{DMN − λn (|S| − 1)}, S

(15)

where the maximum is taken over all possible slicing schemes, λn = λ log(n), and λ > 0 is a tuning parameter to be specified. Rather than considering all possible slicing schemes, a local optimal solution can be obtained by a dynamic programming algorithm: Step 1: Rank yi , and then rank x i accordingly. Step 2: Let d0 = 0. Fill in entries of two tables [di ]0≤i≤n and [bi ]1≤i≤n recursively as follows:

Model-Based Inverse Regression and Its Applications

121

(l:i) ( ) p N.j (l:i) di = max dl−1 + N.j log − λn , l∈{1,...,i} N (l:i) j =1

( bi = arg max dl−1 + l∈{1,...,i}

p

(l:i) N.j log

j =1

(l:i)

N.j

N (l:i)

) − λn ,

p (l:i) where N.j = im=l xmj and N (l:i) = im=l j =1 xmj . Step 3: Let e0 = n. Compute eh = beh−1 − 1 recursively for h ≥ 1 until eH = 0 for some H . The optimal slicing scheme is given by Sh = {i : eh + 1 ≤ i ≤ eh−1 }, h = 1, . . . , H. Notice that in Step 2, di stores a partially maximized test statistic until the ith ranked observation, and bi records the first cut position of the last slice. Furthermore, Dˆ MN = dn −

p j =1

N.j log

N.j N

+ λn .

(16)

When all of the observed values of Y are distinct, adaptive slicing reduces the computational complexity from O(2n p) to O(n2 p). One can further speed up the computation by pre-assigning observations into bins and restricting slicing on them. The dynamic slicing algorithm is general and thus not restricted to independence tests. In the framework of sufficient reduction, Wang (2019, 2020) adopt this approach for learning the best slicing scheme from the data. The former is momentbased, while the latter is model-based. Controlling the type I error of the test procedure requires a proper choice of the penalty parameter λ. Instead of resorting to an asymptotic analysis, Song et al. (2020) use a permutation-based method. First, construct a fine grid of λ values, and for each λ, simulate the null distribution of Dˆ by permutations. This leads to a numerical relationship between P r(Dˆ > 0) and λ0 under H0 . Second, choose the smallest λ, denoted by λα , so that the estimated significance level is less than or equal to α. As mentioned before, the multinomial distribution may not be suitable for modeling microbiome count data due to over-dispersion. To account for overdispersion, we typically assume that there exists some variation in taxa proportions across samples. For computational tractability, one may take as a prior distribution for π a Dirichlet distribution p $ (α+ ) α fD (π; α) = .p πj j −1 , j =1 (αj ) j =1

(17)

p where αj > 0, α+ = j =1 αj , and (·) is the gamma function. This leads to the probability mass function of the Dirichlet-multinomial distribution

122

T. Wang and L. Zhu

fDMN (x i ; α) =

p (Ni. + 1)(α+ ) $ (xij + αj ) . (Ni. + α+ ) (xij + 1)(αj )

(18)

j =1

In a similar fashion, one can derive the adaptive test; see Song et al. (2020) for details.

5 Cook’s Contributions on Model-Based Sufficient Reduction Cook has made fundamental contributions on sufficient reduction in regression that stimulate the research in this field. In particular, theoretical foundations of sufficient reduction can be found in Cook (1994, 1998). One limitation of sliced inverse regression is that it cannot recover any direction in the central subspace if the regression function is symmetric about zero. The same applies to methods that are based on the conditional mean E(X | Y ). To overcome the problem, methods based on the second-order conditional moments, E(X | Y ) and Var(X | Y ), have been developed. They work well in regressions with strong curvature where first-order methods fail or have low sensitivity. Cook and Weisberg (1991) introduce the first such method, sliced average variance estimation. Sliced inverse regression and sliced average variance estimation are the first two choices for sufficient reduction. Cook and Ni (2005) propose a systematic way to optimally combine different sufficient reduction estimators. The methodology is similar in spirit to the generalized method of moments. Sufficient reduction methods are largely model-free. In the inverse reduction framework, a model could be postulated for the regression of X on Y (Adragni and Cook, 2009). Under the assumption that X | Y is multivariate normal, Cook (2007) and Cook and Forzani (2008) propose principal fitted components and investigate its theoretical properties. One implication is that sliced inverse regression provides the maximum likelihood estimator of SY |X when the response is categorical. Cook and Forzani (2009) consider the general situation where X | Y is normally distributed and both E(Y | X) and Var(Y | X) are allowed to vary with the response and derive the asymptotically efficient estimator of SY |X . Cook and Li (2009) assume that Xj | Y follows a one-parameter exponential family distribution for each coordinate Xj of X and extend principal fitted components to continuous, categorical, or mixtures of continuous and categorical predictors. However, this method requires the components be independent given the response. The minimal sufficient reductions in regressions with inverse predictors in the elliptically contoured family are derived by Bura and Forzani (2015) and in the exponential family of distributions by Bura et al. (2016).

Model-Based Inverse Regression and Its Applications

123

References K.P. Adragni, R.D. Cook, Sufficient dimension reduction and prediction in regression. Philos. Trans. R. Soc. A 367(1906), 4385–4405 (2009) J. Aitchison, The Statistical Analysis of Compositional Data (Chapman and Hall, London, 1986) A. Antoniadis, S. Lambertlacroix, F. Leblanc, Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19(5), 563–570 (2003) D. Billheimer, P. Guttorp, W.F. Fagan, Statistical interpretation of species composition. J. Am. Stat. Assoc. 96(456), 1205–1214 (2001) E. Bura, R.D. Cook, Estimating the structural dimension of regressions via parametric inverse regression. J. R. Stat. Soc. Ser. B 63(2), 393–410 (2001) E. Bura, L. Forzani, Sufficient reductions in regressions with elliptically contoured inverse predictors. J. Am. Stat. Assoc. 110(509), 420–434 (2015) E. Bura, R.M. Pfeiffer, Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics 19(10), 1252–1258 (2003) E. Bura, S. Duarte, L. Forzani, Sufficient reductions in regressions with exponential family inverse predictors. J. Am. Stat. Assoc. 111(515), 1313–1329 (2016) F. Chiaromonte, J. Martinelli, Dimension reduction strategies for analyzing global gene expression data with a response. Bellman Prize Math. Biosci. 176(1), 123–144 (2002) P. Clifford, Markov random fields in statistics, in Disorder in Physical Systems: A Volume in Honour of John M. Hammersley (Clarendon Press, Oxford, 1990) R.D. Cook, Using dimension-reduction subspaces to identify important inputs in models of physical systems, in Proceedings of the section on Physical and Engineering Sciences (American Statistical Association, Alexandria, VA, 1994), pp. 18–25 R.D. Cook, Regression Graphics: Ideas for Studying Regressions Through Graphics (Wiley, New York, 1998) R.D. Cook, Fisher lecture: dimension reduction in regression. Stat. Sci. 22(1), 1–26 (2007) R.D. Cook, Principal components, sufficient dimension reduction, and envelopes. Annu. Rev. Stat. Appl. 5, 533–559 (2018) R.D. Cook, L. Forzani, Likelihood-based sufficient dimension reduction. J. Am. Stat. Assoc. 104(485), 197–208 (2009) R.D. Cook, L. Li, Dimension reduction in regressions with exponential family predictors. J. Comput. Graph. Stat. 18(3), 774–791 (2009) R.D. Cook, L. Ni, Sufficient dimension reduction via inverse regression: a minimum discrepancy approach. J. Am. Stat. Assoc. 100(470), 410–428 (2005) R.D. Cook, L. Orzani, Principal fitted components for dimension reduction in regression. Stat. Sci. 23(4), 485–501 (2008) R.D. Cook, S. Weisberg, Comment. J. Am. Stat. Assoc. 86(414), 328–332 (1991) R.D. Cook, L. Forzani, A.J. Rothman, Estimating sufficient reductions of the predictors in abundant high-dimensional regressions. Ann. Stat. 40(1), 353–384 (2012) L. Forzani, R.G. Arancibia, P. Llop, D. Tomassi, Supervised dimension reduction for ordinal predictors. Comput. Stat. Data Anal. 125, 136–155 (2018) J.C. Gower, Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325–338 (1966) J.C. Gower, Adding a point to vector diagrams in multivariate analysis. Biometrika 55(3), 582–585 (1968) R. Heller, Y. Heller, S. Kaufman, B. Brill, M. Gorfine, Consistent distribution-free K-sample and independence tests for univariate random variables. J. Mach. Learn. Res. 17(29), 1–54 (2016) B. Jiang, C. Ye, J.S. Liu, Nonparametric K-sample tests via dynamic slicing. J. Am. Stat. Assoc. 110(510), 642–653 (2015) P.S. La Rosa, J.P. Brooks, E. Deych, E.L. Boone, D.J. Edwards, Q. Wang, et al., Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS One 7(12), e52078 (2012)

124

T. Wang and L. Zhu

S.L. Lauritzen, Graphical Models (Clarendon Press, Oxford, 1996) K.-Y. Lee, B. Li, F. Chiaromonte, A general theory for nonlinear sufficient dimension reduction: formulation and estimation. Ann. Stat. 41(1), 221–249 (2013) K.-C. Li, Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86(414), 316–327 (1991) K.-C. Li, On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. J. Am. Stat. Assoc. 87(420), 1025–1039 (1992) L. Li, Survival prediction of diffuse large-b-cell lymphoma based on both clinical and gene expression information. Bioinformatics 22(4), 466–471 (2006) B. Li, Sufficient Dimension Reduction: Methods and Applications with R (CRC Press, Boca Raton, 2018a) L. Li, Sufficient Dimension Reduction. Wiley StatsRef: Statistics Reference Online (2018b) L. Li, X. Yin, Sliced inverse regression with regularizations. Biometrics 64(1), 124–131 (2008) B. Li, S. Wang, On directional regression for dimension reduction. J. Am. Stat. Assoc. 102(479), 997–1008 (2007) B. Li, H. Zha, F. Chiaromonte, Contour regression: a general approach to dimension reduction. Ann. Stat. 33(4), 1580–1616 (2005) Y. Ma, L. Zhu, A semiparametric approach to dimension reduction. J. Am. Stat. Assoc. 107(497), 168–179 (2012) Y. Ma, L. Zhu, A review on dimension reduction. Int. Stat. Rev. 81(1), 134–150 (2013) B.H. Mcardle, M.J. Anderson, Fitting multivariate models to community data: a comment on ˇ distance Rbased redundancy analysis. Ecology 82(1), 290–297 (2001) N. Meinshausen, P. Bühlmann, High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34(3), 1436–1462 (2006) P.A. Naik, M.R. Hagerty, C. Tsai, A new dimension reduction approach for data-rich marketing environments: sliced inverse regression. J. Market. Res. 37(1), 88–101 (2000) S.S. Roley, R.M. Newman, Predicting Eurasian watermilfoil invasions in Minnesota. Lake Reserv. Manage. 24(4), 361–369 (2008) Y. Song, H. Zhao, T. Wang, An adaptive independence test for microbiome community data. Biometrics 76(2), 414–426 (2020) Q. Sun, R. Zhu, T. Wang, D. Zeng, Counting process-based dimension reduction methods for censored outcomes. Biometrika 106(1), 181–196 (2019) M. Taddy, Multinomial inverse regression for text analysis. J. Am. Stat. Assoc. 108(503), 755–770 (2013) M. Taddy, Distributed multinomial regression. Ann. Appl. Stat. 9(3), 1394–1414 (2015) D. Tomassi, L. Forzani, S. Duarte, R. Pfeiffer, Sufficient dimension reduction for compositional data. Biostatistics (2019). https://doi.org/10.1093/biostatistics/kxz060 T. Wang, Dimension reduction via adaptive slicing. Stat. Sin. (2019) https://doi.org/10.5705/ss. 202019.0102 T. Wang, Graph-assisted inverse regression for count data and its application to sequencing data. J. Comput. Graph. Stat. 29(3), 444–454 (2020) T. Wang, P. Xu, On supervised reduction and its dual. Stat. Sin. (2019) https://doi.org/10.5705/ss. 202017.0532 T. Wang, L. Zhu, Sparse sufficient dimension reduction using optimal scoring. Comput. Stat. Data Anal. 57(1), 223–232 (2013) T. Wang, L. Zhu, Flexible dimension reduction in regression. Stat. Sin. 28(2), 1009–1029 (2018) T. Wang, X. Guo, L. Zhu, P. Xu, Transformed sufficient dimension reduction. Biometrika 101(4), 815–829 (2014) T. Wang, M. Chen, H. Zhao, L. Zhu, Estimating a sparse reduction for general regression in high dimensions. Stat. Comput. 28(1), 33–46 (2018) T. Wang, C. Yang, H. Zhao, Prediction analysis for microbiome sequencing data. Biometrics 75, 875–884 (2019) Y. Xia, H. Tong, W.K. Li, L. Zhu, An adaptive estimation of dimension reduction space. J. R. Stat. Soc. Ser. B 64(3), 363–410 (2002)

Model-Based Inverse Regression and Its Applications

125

F. Xia, J. Chen, W.K. Fung, H. Li, A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics 69(4), 1053–1063 (2013) Z. Zhang, D. Yeung, J.T. Kwok, E.Y. Chang, Sliced coordinate analysis for effective dimension reduction and nonlinear extensions. J. Comput. Graph. Stat. 17(1), 225–242 (2012) W. Zhong, P. Zeng, P. Ma, J.S. Liu, Y. Zhu, RSIR: regularized sliced inverse regression for motif discovery. Bioinformatics 21(22), 4169–4175 (2005) L. Zhu, T. Wang, L. Zhu, L. Ferré, Sufficient dimension reduction through discretizationexpectation estimation. Biometrika 97(2), 295–304 (2010)

Sufficient Dimension Folding with Categorical Predictors Yuanwen Wang, Yuan Xue, Qingcong Yuan, and Xiangrong Yin

1 Introduction Data often come with the characters of high dimensions and complicated structures in the age of “Big Data” (IBM, 2014). For complex data such as images or audios, each observation can be a large sized 2D matrix, or even higher dimensional array. Effectively reducing the number of dimensions while preserving the data structure remains a challenging task, since most current dimension reduction methods are well suited for regular data structure where each observation is a vector. Li et al. (2010) proposed the framework of sufficient dimension folding that extended the moment-based dimension reduction methods such as sliced inverse regression (SIR; Li 1991), sliced average variance estimation (SAVE; Cook and Weisberg 1991), and directional regression (DR; Li and Wang 2007) to folded SIR, folded SAVE, and folded DR for reducing the dimensions of matrix-/array-valued objects while keeping the data structure at the same time.

Y. Wang Department of Statistics, University of Georgia, Athens, GA, USA e-mail: [email protected] Y. Xue School of Statistics, University of International Business and Economics, Beijing, China e-mail: [email protected] Q. Yuan Department of Statistics, Miami University, Oxford, OH, USA e-mail: [email protected] X. Yin () Dr. Bing Zhang Department of Statistics, University of Kentucky, Lexington, KY, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 E. Bura, B. Li (eds.), Festschrift in Honor of R. Dennis Cook, https://doi.org/10.1007/978-3-030-69009-0_7

127

128

Y. Wang et al.

Nevertheless, real data usually come with both matrix predictors and traditional categorical variables. Based on Chiaromonte et al. (2002) and Li et al. (2003) who developed sufficient dimension reduction methods with categorical predictors in regression context, we extend sufficient dimension folding with consideration of categorical predictor variables. We define three types of reduced spaces including marginal folding subspace that does not involve categorical variable information; conditional folding subspace that corresponds to the central folding subspace within each category; and partial folding subspace that combines different conditional folding subspaces. We bridge the gap among these spaces by theoretical justifications, followed by three numerical algorithms that estimate the desired partial folding subspace. An empirical maximal eigenvalue ratio criterion is used to further determine the structural dimensions for the partial folding subspace. Simulations, as well as an application to a longitudinal data set, indicate that the proposed partial folding subspace provides better insights by including both a matrix predictor and a categorical predictor in the same model. The rest of the paper is organized as follows. Section 2 reviews the concepts of sufficient dimension folding for matrix/array predictors. Section 3 is devoted to the theoretical foundations of sufficient dimension folding with categorical predictors. Section 4 proceeds with three estimation methods to estimate the desired partial folding subspace. Section 5 provides an estimation method for the structural dimensions. Section 6 explores different simulation settings including forward and inverse models, as well as an application to a longitudinal data set. Section 7 has a short summary. Proofs are provided in the Appendix.

2 Review on Sufficient Dimension Folding Suppose that B is a p×q matrix and span(B) is the space spanned by the columns of B. The notation PB = B(B T B)† B T is the projection matrix onto span(B), where “†” denotes the Moore-Penrose inverse. We consider a regression or a classification problem involving a univariate response variable Y with a pl × pr matrix predictor X, and the components of X are assumed to be continuous. For random vectors, U, V, and W, notation U V|W means that given W, U, and V are independent. According to Li et al. (2010), if there are two matrices α ∈ Rpl ×dl and β ∈ p r R ×dr (dl < pl and dr < pr ) satisfying Y

X|α T Xβ,

(2.1)

then Y depends on X only through the transformed matrix α T Xβ. The spaces span(α) and span(β) are defined as left dimension folding subspace and right dimension folding subspace. Denote SY |◦X and SY |X◦ as the intersections of all left dimension folding subspaces and all right dimension folding subspaces, respectively. If they satisfy (2.1), then the central folding subspace SY |◦X◦ is defined as

Sufficient Dimension Folding with Categorical Predictors

129

SY |◦X◦ = SY |X◦ ⊗ SY |◦X = span(β0 ⊗ α0 ) = span(β0 ) ⊗ span(α0 ), where ⊗ is the Kronecker product, and α0 and β0 are basis matrices of SY |◦X and SY |X◦ , respectively. The relationship between the central folding subspace and the central subspace (Cook, 1994, 1996) is captured through the concept of Kronecker envelope. The Kronecker envelope (Li et al., 2010) of a random matrix U ∈ R(rR rL )×k is ⊗ (U) = SU◦ ⊗ S◦U , where SU◦ ∈ RrR and S◦U ∈ RrL , and it satisfies the following conditions: 1. span(U) ⊆ SU◦ ⊗ S◦U almost surely. 2. If there exists another pair of subspaces SR ∈ RrR and SL ∈ RrL satisfying 1, then SU◦ ⊗ S◦U ⊆ SR ⊗ SL . By specifying a matrix U through traditional sufficient dimension reduction methods such as SIR (Li, 1991), SAVE (Cook and Weisberg, 1991), and DR (Li and Wang, 2007) which target the central subspace for vector predictors, the central folding subspace SY |◦X◦ can be estimated by the Kronecker envelope of matrix U (Li et al., 2010). Existing works with regard to dimension folding include those of Pfeiffer et al. (2012), Ding and Cook (2014, 2015), Xue and Yin (2014, 2015), and Xue et al. (2016).

3 Sufficient Dimension Folding with Categorical Predictors We now introduce the marginal, conditional, and partial folding subspaces for dimension folding for matrix/array predictors. For a continuous matrix predictor X, the notation SY |◦X◦ is used to denote the marginal central folding subspace (Li et al., 2010). We assume there is an additional categorical predictor variable W with possible values w = 1, 2, . . . , C. Definition 1 For each fixed W = w, the intersection of all spaces SL and the intersection of all spaces SR satisfying X|(PSL XPSR , W = w)

Y

are defined as the conditional left-folding subspace (SYw |◦Xw ) and the conditional right-folding subspace (SYw |Xw ◦ ), respectively. Their Kronecker product SYw |Xw ◦ ⊗ SYw |◦Xw = span(βw ) ⊗ span(αw ) is defined as the conditional folding subspace and denoted by SYw |◦Xw ◦ , where αw ∈ Rpl ×dl and βw ∈ Rpr ×dr are basis matrices of SYw |◦Xw and SYw |Xw ◦ , respectively. Definition 2 For a discrete random variable W , the intersection of all spaces SL and the intersection of all spaces SR satisfying Y

X|(PSL XPSR , W )

130

Y. Wang et al. (W )

are defined as the partial left-folding subspace (SY |◦X ) and the partial right-

) ) ) folding subspace (SY(W|X◦ ), respectively. Their Kronecker product SY(W|X◦ ⊗ SY(W|◦X = span(β (W ) ) ⊗ span(α (W ) ) is defined as the partial folding subspace and denoted by (W ) (W ) SY |◦X◦ , where α (W ) ∈ Rpl ×dl and β (W ) ∈ Rpr ×dr are basis matrices of SY |◦X and (W )

SY |X◦ , respectively. The following equivalent relationships hold: Y Y

X|(PSY |◦X XPSY |X◦ ) ⇔ Y

X|(PSYw |◦Xw XPSYw |Xw ◦ , W = w) ⇔ Y Y

X|(PS (W ) XPS (W ) , W ) ⇔ Y Y |◦X

Y |X◦

X|([PSY |◦X◦ ]vec(X)), X|([PSYw |◦Xw ◦ ]vec(X), W = w), X|([PS (W ) ]vec(X), W ). Y |◦X◦

) Partial folding subspace SY(W|◦X◦ is the desired space to estimate since it achieves dimension folding with the presence of both continuous and categorical predictors. Let ⊕ be the direct sum. The following proposition illustrates the relationships among marginal, conditional, and partial folding subspaces and serves as the theoretical foundations for the proposed numerical estimation algorithms.

Proposition 1 (a) If W (W )

X|PS (W ) XPS (W ) or W Y |◦X

Y |X◦

(W )

Y |PS (W ) XPS (W ) , then SY |◦X ⊆ SY |◦X , SY |X◦ ⊆

(W )

Y |◦X

Y |X◦

SY |X◦ , and thus SY |◦X◦ ⊆ SY |◦X◦ . (b) If W (c)

) Y |X, then SY(W|◦X◦ ⊆ SY |◦X◦ .

(W )

SY |◦X = ⊕C w=1 SYw |◦Xw , (W )

SY |X◦ = ⊕C w=1 SYw |Xw ◦ and therefore, ) C SY(W|◦X◦ = (⊕C w=1 SYw |Xw ◦ ) ⊗ (⊕w=1 SYw |◦Xw ).

(d) (W )

C ⊕C w=1 SYw |◦Xw ◦ = ⊕w=1 (SYw |Xw ◦ ⊗ SYw |◦Xw ) ⊆ SY |◦X◦ .

(e) If span(Uw ) ⊆ SYw |vec(Xw ) almost surely, and the basis matrix of space (W ) ∗ ⊗ ∗ ⊕C w=1 span(Uw ) is denoted by U , we have (U ) ⊆ SY |◦X◦ almost surely, and consequently, C ∗ S◦U∗ ⊆ ⊕C w=1 SYw |◦Xw and SU ◦ ⊆ ⊕w=1 SYw |Xw ◦ .

Sufficient Dimension Folding with Categorical Predictors

131

(f) (W )

SY |◦X ⊆ SW |◦X ⊕ SY |◦X , (W )

SY |X◦ ⊆ SW |X◦ ⊕ SY |X◦ and therefore, ) ) SY |◦X◦ ⊆ (SW |X◦ ⊕ SY(W|X◦ ) ⊗ (SW |◦X ⊕ SY(W|◦X ).

In Proposition 1, part (a) and part (b) discuss the general relationship between the ) defined partial folding subspace SY(W|◦X◦ and the marginal folding subspace SY |◦X◦ , which was proposed in Li et al. (2010). That is, one space does not always contain the other space. Part (a) describes one scenario when marginal folding subspace is a subspace of partial folding subspace, where part (b) describes the scenario vice versa. By targeting only marginal folding subspace SY |◦X◦ as partial folding ) subspace SY(W|◦X◦ and completely ignores categorical variable W , one could be estimating either a larger or a smaller space. Part (c) indicates that partial folding subspace equals to the Kronecker product of the left and right partial folding subspaces, which in turn is equal to the corresponding direct sum of conditional left-folding (right-folding) subspaces. We propose an individual direction ensemble method which estimates conditional left-folding C subspace ⊕C w=1 SYw |◦Xw and conditional right-folding subspace ⊕w=1 SYw |Xw ◦ and leverages the Kronecker product of the estimated basis as an estimate for the partial folding subspace. (W ) Part (d) concludes that ⊕C w=1 (SYw |Xw ◦ ⊗SYw |◦Xw ) is a subspace of SY |◦X◦ , based on which an ordinary least squares method is presented to estimate this space. Part (e) suggests us to propose an objective function optimization method, (W ) which estimates SY |◦X◦ through a direct ordinary least squares formulation. While not directly related to our proposed estimation algorithms targeting for partial folding subspace, Part (f) illustrates further connections between the left-folding ) subspace, right-folding subspace, and partial folding subspace SY(W|◦X◦ . The proof of Proposition 1 is arranged in the Appendix.

4 Estimation Methods In this section, we propose three algorithms based upon folded SIR to estimate partial folding subspace. We name algorithms involving the adoption of folded SIR in general as partial folded sliced inverse regression (partial folded SIR). However, the three algorithms themselves are distinguished by their names: individual direction ensemble method; least squares folding approach (LSFA); and objective function optimization method. Extensions based on folded SAVE and folded DR can be developed similarly. We assume that the random sample contains n observations

132

Y. Wang et al.

with nw observations for each category W = 1, · · · , C, and n1 + · · · + nC = n. At this stage, we assume that dimensions dl and dr are known; estimation for dl and dr will be provided later. And the number of parameters estimated in total for each approach is pl dl + pr dr .

4.1 Individual Direction Ensemble Method ) C Part (c) of Proposition 1 concludes that SY(W|◦X◦ = (⊕C w=1 SYw |Xw ◦ )⊗(⊕w=1 SYw |◦Xw ). Based on this equivalence, we propose an individual direction ensemble method (W ) C that estimates ⊕C w=1 SYw |Xw ◦ and ⊕w=1 SYw |◦Xw , respectively, and thus SY |◦X◦ . The idea of the individual direction ensemble method is similar to the outer product of the gradient method (OPG; Xia et al., 2002), where conditional left-folding (right-folding) directions are estimated first. A basis matrix α (β) is estimated by the ensemble of all the conditional left-folding (right-folding) directions αw (βw ). Then β ⊗ α forms an estimator for the partial folding subspace. The algorithm is as follows:

1. For each category W = w, estimate a basis matrix of conditional left-folding subspace SYw |◦Xw as αˆ w ∈ Rpl ×dl and a basis matrix of conditional right-folding subspace SYw |Xw ◦ as βˆw ∈ Rpr ×dr using folded SIR algorithm (Li et al., 2010). 2. The ensemble of all the conditional left-folding (right-folding) SIR directions ˆα = αˆ w (βˆw ) forms the partialleft-folding (right-folding) SIR directions as C C nw nw ˆ ˆ T T ˆ ˆ w αˆ w (β = w=1 n α w=1 n βw βw ). Apply eigenvalue decomposition to ˆ β ) and extract the first dl (dr ) eigenvectors as the partial left-folding (rightˆ α ( ˆ folding) SIR directions αˆ (β). ) 3. An estimator of the partial folding subspace SY(W|◦X◦ is βˆ ⊗ α. ˆ

4.2 Least Squares Folding Approach (LSFA) (W ) Part (d) of Proposition 1 concludes that ⊕C w=1 (SYw |Xw ◦ ⊗ SYw |◦Xw ) ⊆ SY |◦X◦ , which motivates a least squares folding method following the reformulation of SIR with OLS (Cook, 2004) to estimate ⊕C w=1 (SYw |Xw ◦ ⊗ SYw |◦Xw ). First observe that

(W ) C (W ) ⊗ α (W ) = SY |◦X◦ , ⊕C w=1 SYw |◦Xw ◦ = ⊕w=1 span {βw ⊗ αw } ⊆ span β which means that, for ∀w = 1, . . . , C, there is at least one matrix f (w) ∈ dr dl × pr pl such that βw ⊗ αw = (β (W ) ⊗ α (W ) )f (w). Minimizing EW ||βW ⊗ αW − (β ⊗ α)f (W )||2 ,

(4.1)

Sufficient Dimension Folding with Categorical Predictors

133

over α, β, and f (W ) with constraints that α T α = Idl and β T β = Idr recovers (W ) an estimate of the partial folding subspace SY |◦X◦ in population, where ||.|| is the Frobenius norm of a matrix. It has been established that vec(β ⊗ α) = [vec(β) ⊗ vec(α)] (Li et al., 2010), where = Idr ⊗ [(Idl ⊗ Kpr ,pl )Kpl dl ,pr ] and Kpr ,pl and Kpl dl ,pr are commutation matrices (Magnus and Neudecker, 1999). Magnus and Neudecker (1999) showed that vec(β) ⊗ vec(α) = vec[vec(α)vecT (β)] = [Ipr dr ⊗ vec(α)]vec(β).

(4.2)

The objective function (4.1) is equivalent to E||vec(βW ⊗ αW ) − vec[(β ⊗ α)f (W )]||2 . Using (4.2), (4.1) can then be rewritten as E||vec(βW ⊗ αW ) − (f T (W ) ⊗ Ipl pr )[Ipr dr ⊗ vec(α)]vec(β)]||2 . Let V1 = V1 (W ) = vec(βW ⊗ αW ) and

V2 = V2 (W ) = (f T (W ) ⊗ Ipl pr )[vec(α) ⊗ Ipr dr ],

then the minimizer of (4.1) over β ∈ Rpr ×dr , for fixed f ∈ L2dl dr ,c and α ∈ Rpl ×dl , is vec(β) = [E(VT2 V2 )]−1 E(VT2 V1 ). Similarly, the minimizer of (4.1) over α ∈ Rpl ×dl , for fixed f ∈ L2dl dr ,c and β ∈ Rpr ×dr , is vec(α) = [E(VT2 V2 )]−1 E(VT2 V1 ), where V1 = V1 (W ) = vec(βW ⊗ αW ) and V2 = V2 (W ) = (f T (W ) ⊗ Ipl pr )[Ipl dl ⊗ vec(β)]. The algorithm is detailed as follows: 1. For each category W = w, estimate the directions for conditional left-folding space SYw |◦Xw as αˆ w and directions for conditional right-folding space SYw |Xw ◦ as βˆw using folded SIR algorithm (Li et al., 2010). 2. Generate initial value of αˆ (0) ∈ Rpl ×dl and fˆ0 (w) : w = 1, . . . , C ∈ Rdl dr ×dl dr (each element) from standard normal distribution N(0, 1). 3. For w = 1, . . . , C, k = 0, 1, · · · , in the (k + 1)-th iteration, compute pˆ w = and ˆ 1 (w) = vec(βˆw ⊗ αˆ w ), V ˆ 2 (w) = (fˆT (w) ⊗ Ip pr )[vec(α) ˆ (k) ⊗ Ipr dr ]. V l k

nw n

134

Y. Wang et al.

ˆ (k+1) by Then compute vec(β) /

C

0−1 / ˆ 2 (w) ˆ T (w)V pˆ w V 2

w=1

C

0 ˆ 1 (w) ˆ T (w)V pˆ w V 2

.

w=1

ˆ 2 (w) as 4. For w = 1, . . . , C, recompute V ˆ 2 (w) = (fˆT (w) ⊗ Ip pr )[Ip d ⊗ vec(β) ˆ (k+1) ]. V l l l k ˆ (k+1) but with an Then compute vec(α) ˆ (k+1) by the same equation as vec(β) (k+1) ˆ updated V2 (w), that is, vec(α) ˆ is /

C

0−1 / ˆ 2 (w) ˆ T (w)V pˆ w V 2

w=1

C

0 T ˆ ˆ pˆ w V2 (w)V1 (w) .

w=1

ˆ 2 (w) as 5. For w = 1, . . . , C, recompute V ˆ 2 (w) = Id dr ⊗ [βˆ (k+1) ⊗ αˆ (k+1) ], V l and ˆ 2 (w)T V ˆ 2 (w)T V ˆ 2 (w)]−1 [V ˆ 1 (w)]. fˆk+1 (w) = [V 6. Iterate between step 3 and step 5, until C

pˆ w ||βˆw ⊗ αˆ w − (βˆ (k) ⊗ αˆ (k) )fˆk (w)||2

w=1

is smaller than some pre-specified threshold , for example, 10−6 . 7. An estimated basis matrix for the partial folding subspace is βˆ (k) ⊗ αˆ (k) .

4.3 Objective Function Optimization Method Instead of applying decomposition to the ensemble of the directions of conditional left-folding (right-folding) subspace, one could estimate β ⊗ α directly through an OLS type formulation which can help to reduce the aggregated errors in estimating basis matrices βw and αw for their corresponding conditional folding subspaces, respectively. Part (e) of Proposition 1 concludes that we can estimate partial folding

Sufficient Dimension Folding with Categorical Predictors

135

subspace SY |◦X◦ by targeting on the Kronecker envelope of U∗ which is a basis matrix of ⊕C w=1 span(Uw ). First we clarify the notations as follows. Random matrix UW ∈ R(pl pr )×kW follows the previous definition, and discrete variable W has finite possible values with distribution 0 < P (W = w) = pw < 1. Let α0 ∈ Rpl ×dl and β0 ∈ Rpr ×dr be basis matrices of S◦U∗ and SU∗ ◦ which span the Kronecker envelope of U∗ with respect to integer pair (pl , pr ). For positive integers k1 and k2 , and a random vector Z defined on Z , let L2 k1 ×k2 (Z ) be the class of functions f : Z → Rk1 ×k2 such that E||f (Z)||2 < ∞. The theorem below lays out the theoretical foundations for our third estimation method. (W )

Theorem 1 Suppose that for each W = w, elements of Uw have finite variances and are measurable with respect to a random vector Z and that A is a pl pr × pl pr nonrandom and nonsingular matrix. Let (α ∗ , β ∗ , f1∗ , . . . , fC∗ ) be the minimizer of E||AUW − A(β ⊗ α)fW (Z)||2 ,

(4.3)

overall α ∈ Rpl ×dl , β ∈ Rpr ×dr , and fw ∈ L2 dl dr ×kw (Z ) for each W = w. Then span(β ∗ ⊗ α ∗ ) = ⊗ (U∗ ). Theorem 1 holds for any nonrandom and nonsingular matrix A with dimension pl pr × pl pr . In practice such as in numerical analysis and real data analysis, 1/2 T (X − X). ¯ ¯ We connect ˆ pool = n−1 ni=1 vec(Xi − X)vec A = pool , where i the solution for minimizing Eq. (4.3) to the results of Li et al. (2010). By using the following notation U∗ =

2 1√ 1√ 2 √ √ √ p1 U1 , p2 U2 , . . . , pC UC , f (Z)∗ = p1 f1 (Z), . . . , pC fC (Z) ,

we can rewrite (4.3), (or (8.4) in the Appendix) as E||AU∗ − A(β ⊗ α)f (Z)∗ ||2 . Then following Li et al. (2010), we implement their algorithm to estimate β and α directly from the objective function. Note that Li et al. (2010) proposed a simplified estimation method for folded SIR when the dimensions of f (Y ) is dl dr × 1. However, in our case, since fW (Y ) is a dl dr × C matrix, such simplification could not be applied here. We give detailed algorithm for partial folded SIR below. The following notations will be used in the algorithm. First, the pooled covariance matrix pool is estimated through vectorized data ˆ pool = n−1

n

T ¯ ¯ vec(Xi − X)vec (Xi − X).

i=1

Similarly, the covariance matrix w for each category W = w, w = 1, · · · , C can be estimated by

136

Y. Wang et al.

ˆ w = n−1 w

nw

T ¯ ¯ vec(Xi − X)vec (Xi − X).

i=1

We discretize response variable Y into s categories, and let J1 , . . . , Js be the partition of Y and D = δ(Y ) be the discrete random variable defined by δ(Y ) =

if

Y ∈ J , = 1, . . . , s.

−1 For n a function h(·) of (X, Y ), let En h(X, Y ) denote the sample average n × i=1 h(Xi , Yi ). The objective function optimization algorithm is described as follows:

1. Use solution in Sect. 4 for the initial value of αˆ (0) ∈ Rpl ×dl . Generate each element of the random matrices fˆ(0) () ∈ Rdl dr ×C : = 1, . . . , s from N (0, 1). 2. For = 1, . . . , s, in (k + 1)-th iteration, compute pˆ = En [I (D = )] and 1 ˆ 1 () = pˆ −1 vec{ ˆ 2 [ P (W = 1) ˆ −1 En [vec(XW =1 )I (D = )], V pool 1 −1 ˆ En [vec(XW =C )I (D = )]]}, . . . , P (W = C) C 1

ˆ 2 () = (fˆT ()(k) ⊗ ˆ 2 )[Ipr dr ⊗ vec(α) ˆ (k) ]. V pool ˆ (k+1) by Then compute vec(β) / s

0−1 / s

ˆ 2 () ˆ T ()V pˆ V 2

=1

0 ˆ 1 () ˆ T ()V pˆ V 2

(4.4)

.

=1

ˆ 2 () as 3. For = 1, . . . , s, recompute V 1

ˆ 2 () = (fˆT ()(k) ⊗ ˆ 2 )[vec(β) ˆ (k+1) ) ⊗ Ipl dl ]. V pool ˆ (k+1) , but with an Then compute vec(α) ˆ (k+1) by the same equation as vec(β) ˆ 2 (). That is, updated V vec(α) ˆ

(k+1)

=

/ s =1

0−1 / s

ˆ 2 () ˆ T ()V pˆ V 2

0 ˆ 1 () ˆ T ()V pˆ V 2

.

=1

To compute the inverse of the covariance matrix, we use the same technique as in Li et al. (2010), which is Moore-Penrose generalized inverse. See

Sufficient Dimension Folding with Categorical Predictors

137

https://www.rdocumentation.org/packages/MASS/versions/7.3-51.5/topics/ ginv. We also tried to use ridge type; it does not make a big difference. ˆ 2 as 4. For = 1, . . . , s, recompute V 1

ˆ 2 = Ic ⊗ [ ˆ 2 βˆ (k+1) ⊗ αˆ (k+1) ] V pool and further ˆTV ˆ −1 ˆ T ˆ fˆ(l)(k+1) = (V 2 2 ) [V2 V1 ()]. 5. Iterate between step 2 and step 4, until ||Pβˆ (k+1) ⊗αˆ (k+1) − Pβˆ (k) ⊗αˆ (k) || is smaller than some pre-specified threshold , such as 10−6 . 6. An estimated basis matrix for partial folding subspace is βˆ (k) ⊗ αˆ (k) . The objective function optimization method considers regression when there are categorical variables and continuous matrix predictor; it does not simply focus on improving the individual methods such as “individual direction ensemble” and the “OLS” method. In fact, we develop this method, similar to what was developed by Chiaromonte et al. (2002) for dimension reduction by introducing an additional categorical variable W . It is an extension from Li et al. (2010), where there is no categorical variable at all. That is, Li et al. (2010) focused on SY |◦X◦ , while we focus ) ) on SY(W|◦X◦ . Relationships between SY |◦X◦ and SY(W|◦X◦ are described in Proposition 1 part (a) and part (b). In this manuscript, the proposed three algorithms are the first methods to estimate (W ) SY |◦X◦ which performs dimension reduction when both categorical variable W and continuous matrix predictor X exist. Thus, in the simulation settings, comparison with the folding method in Li et al. (2010) is not directly recommended, due to model setup. Also, comparison with partial SIR in Chiaromonte et al. (2002) which targets at the partial central subspace SY |vec(X) is also not ideal as it vectorizes X and simplifies estimation, losing its efficacy as the reason why Li et al. (2010) developed (W ) folding approach. The estimated space is also a proposer subspace of SY |◦X◦ as indicated in the settings of numerical studies in Sect. 6.1, where we, nevertheless, add some numerical evidence. Nevertheless, we understand intuitively and based on our limited empirical evidences from our simulations in numerical study and in the Appendix that: 1. When all the individual central subspaces are the same (however, models may be different) across the categorical variables, then our proposed three methods are the best.

138

Y. Wang et al.

2. When all individual central subspaces are completely different across the categorical variables, then individual method may be better, due to combined sample size and its target. 3. When not all directions across the individual subspaces are the same, or completely different, perhaps neither of the two methods (ours and individual) has advantage, depending on different target and how much the individual subspaces are overlap.

5 Estimation of Structural Dimensions In Sect. 4, we assume that the structural dimensions dl (W ) and dr (W ) are known. In reality, these dimensions need to be inferred from the data. One can utilize the individual direction ensemble method to construct an empirical method to estimate structural dimensions dl (W ) and dr (W ) . Suppose that for each category W = w, the basis matrices for conditional leftfolding and right-folding subspaces (SYw |◦Xw and SYw |Xw ◦ ) estimated by folded SIR (Li et al., 2010) are αˆ w and βˆw , respectively, and the ensemble directions are ˆα =

C C nw nw T ˆβ = αˆ w αˆ w βˆw βˆwT . and n n

w=1

w=1

ˆ β are (λˆ 1 , . . . , λˆ pl ) and (φˆ 1 , . . . ., φˆ pr ) in descending ˆ α and The eigenvalues for order. Numerically, we can use the maximal eigenvalue ratio criterion (MERC; Li and Yin 2009; Luo et al. 2009) to estimate the dimensions. The ratio of two consecutive eigenvalues can then be computed as rˆi = λˆ i /λˆ i+1 for i = 1, . . . , pl − 1 and qˆj = φˆ j /φˆ j +1 for j = 1, . . . , pr − 1. The structural dimensions are estimated by using MERC as (W ) dˆl = arg

(W ) max {ˆri }, and dˆr = arg

1≤i≤dl max

max

1≤j ≤dr max

{qˆj }.

As suggested by Luo et al. (2009), one can usually set the maximum possible structural dimensions as 5. Graphically, we can plot the eigenvalues versus its order and use the “elbow plot” to estimate the dimensions. The elbow plot is the same as practice for PCA. See Ye and Weiss (2003) and Zhu and Zeng (2006) for the use of “elbow plot” in determining structural dimensions in sufficient dimension reduction. Recently developed ladle plot (Luo and Li, 2016) is very handy as well.

Sufficient Dimension Folding with Categorical Predictors

139

6 Numerical Analysis 6.1 Simulation Studies The numerical studies consist of two parts. Part I includes Examples 1 and 2, where response variable Y is continuous and is generated from forward models. For comparison, the response variable Y in Examples 3 and 4 in Part II is discrete with two levels and is generated from inverse models. The categorical variable W = w ∈ {0, 1} is independent of X and Y and follows a binomial distribution with probability 0.5 throughout the simulation studies. For brevity, the proposed estimation methods are referred as individual ensemble; LSFA; objective, where the pooled covariance matrix in Sect. 4.3 is replaced by individual covariance matrix in each category w; and objective function optimization method as objective (pooled), which is described in Sect. 4.3. We summarize the results in the respective table, where the numbers given in the columns are the mean of matrix norm between estimated projection and true projection onto dimension folding subspaces. And the numbers in parentheses are the standard errors of such discrepancy. The smaller the two values, the better the estimation method is.

6.1.1

Part I (Continuous Y, Forward Model)

In this part, examples are modified based on the examples in Xue and Yin (2014) but with an additional categorical variable W . In Examples 1 and 2, two conditional central subspaces are proper subspaces of the conditional folding subspaces correspondingly, i.e., for w = 0, 1, SYw |vec(X)w SYw |◦Xw ◦ . (W )

Therefore, the desired partial central subspace SY |vec(X) is also a proper subspace of

) partial folding subspace SY(W|◦X◦ in that, (W )

(W )

SY |vec(X) = ⊕1w=0 SYw |vec(X)w ⊕1w=0 SYw |◦Xw ◦ ⊆ SY |◦X◦ . For such simulation settings, methods such as partial sliced inverse regression (partial SIR; Chiaromonte et al. 2002) which targets the conditional central subspace (W ) (W ) SY |vec(X) of the vectorized predictor vec(X) may not recover SY |◦X◦ exhaustively. Therefore, making comparison between the proposed three estimation methods and partial SIR is not recommended. Examples 1 and 2 differ by how the two conditional folding subspaces overlap with each other. In Example 1, the two subspaces are exactly the same, but the two subspaces are orthogonal to each other in Example 2. Let predictor X be a 5×5 matrix, and Xij stand for the element on the ith row and j th column. Its vectorized version follows a standard normal distribution such that

140

Y. Wang et al.

Table 1 Example 1, accuracy of estimates on partial folding subspace n Benchmark Individual ensemble 200 1.9928 (0.3540) 400 1.4318 (0.4069) 600 2.5894 1.0036 (0.2966) 1000 0.7105 (0.2046) 1600 0.4584 (0.1110)

LSFA 2.0090 (0.3845) 1.4370 (0.4235) 0.9965 (0.2981) 0.7103 (0.2057) 0.4580 (0.1110)

Objective 2.0267 (0.3735) 1.5484 (0.3922) 1.1935 (0.3407) 0.9018 (0.2675) 0.7025 (0.1779)

Objective (pooled) 1.9949 (0.3112) 1.5503 (0.4013) 1.2176 (0.3119) 0.9605 (0.2626) 0.7186 (0.1881)

Table 2 Example 2, accuracy of estimates on partial folding subspace n Benchmark Individual ensemble 200 2.8420 (0.6133) 400 2.2741 (0.7187) 600 3.3844 1.7783 (0.7106) 1000 1.0698 (0.3741) 1600 0.7958 (0.3014)

LSFA 2.8857 (0.5835) 2.3071 (0.7305) 1.7465 (0.6960) 1.0919 (0.4277) 0.7961 (0.3023)

Objective 3.1280 (0.4645) 3.0424 (0.5441) 2.9843 (0.5206) 2.8106 (0.6115) 2.6598 (0.7274)

Objective (pooled) 3.1580 (0.4780) 3.1002 (0.5327) 3.0103 (0.5324) 2.7481 (0.6649) 2.6990 (0.7179)

vec(X) ∼ MV Npl pr (0, Ipl pr ). The error term is independent of X and follows a standard normal distribution. To compute the accuracy of the estimates, we use the Frobenius norm (Li et al., 2005) between estimated projection matrix and the underlying true projection matrix, ||Pβ⊗ ˆ αˆ − Pβ⊗α ||. The smaller value indicates better estimate. Tables 1 and 2 summarize the results of average estimation errors and their standard errors based on 100 replicates for sample sizes n = 200, 400, 600, 800, 1000, and 1600, respectively. The results are also compared with the benchmark performance. The benchmark distance is defined as the Frobenius norm between two independent spaces with the same dimensions (Li et al., 2008). The benchmark distance is used to help understand the accuracy of the estimates. It is the average of matrix norm between two random projections with the same dimensions. It is to measure how “close” the distance for two random subspaces. In practice, the benchmark distance can be calculated by E(||Pβ0 ⊗α0 − Pβ⊗α ||) where α0 and β0 are obtained by randomly and independently generating vec(α0 ) and vec(β0 ) from multivariate standard normal distribution. We estimate the benchmark distance by running 10,000 simulations. The details of the two examples are as follows. Example 1 The conditional distribution of Y is as follows: Y = X11 × (X12 + X21 + 1) + 0.2 × , for W = 0, Y = X22 × (X12 + X21 + 1) + 0.2 × , for W = 1. In this example, partial folding subspace is exactly the same as two corresponding conditional folding subspace. That is,

Sufficient Dimension Folding with Categorical Predictors

141

(W )

SY |◦X◦ = SYw=0 |◦Xw=0 ◦ = SYw=1 |◦Xw=1 ◦ = SYw=0 |◦Xw=0 ◦ ⊕ SYw=1 |◦Xw=1 ◦ = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 ). (W )

Therefore, all three proposed estimation methods can recover SY |◦X◦ exhaustively. However, for vectorized data, its corresponding partial central subspace is a proper subset of partial folding subspace, which means SYw=0 |vec(X)w=0 = span(e1 ⊗ e1 , e1 ⊗ e2 + e2 ⊗ e1 ) SYw=0 |◦Xw=0 ◦ and SYw=1 |vec(X)w=1 = span(e2 ⊗ e2 , e1 ⊗ e2 + e2 ⊗ e1 ) SYw=1 |◦Xw=1 ◦ . Thus (W )

(W )

SY |vec(X) = span(e1 ⊗ e1 , e1 ⊗ e2 + e2 ⊗ e1 , e2 ⊗ e2 ) SY |◦X◦ . The results in Table 1 indicate that the individual ensemble method and LSFA method both perform much better than the benchmark measure, as well as the objective function method. As sample size increases, all three methods steadily improve their performance. Objective function method with or without pooled variance provide similar accuracy compared with the other two methods when sample size is small but improve slower when sample size increases. Example 2 In this example, we modify the previous example so that the two conditional folding subspaces are completely orthogonal to each other. The conditional distribution of Y given X and W is: Y = X11 × (X12 + X21 + 1) + 0.2 × for W = 0, Y = X33 × (X34 + X43 + 1) + 0.2 × for W = 1. In this case, SYw=0 |◦Xw=0 ◦ = span(e1 , e2 ) ⊗ span(e1 , e2 ) = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 ) and SYw=1 |◦Xw=1 ◦ = span(e3 , e4 ) ⊗ span(e3 , e4 ) = span(e3 ⊗ e3 , e3 ⊗ e4 , e4 ⊗ e3 , e4 ⊗ e4 ).

Based on part (c) of Proposition 1, we have (W )

SY |◦X◦ = (span(e1 , e2 ) ⊕ span(e3 , e4 )) ⊗ (span(e1 , e2 ) ⊕ span(e3 , e4 )) = span(ei ⊗ ej ) , i, j = 1, . . . , 4. On the other hand, based on part (d) of Proposition 1,

142

Y. Wang et al.

SYw=0 |◦Xw=0 ◦ ⊕ SYw=1 |◦Xw=1 ◦ = (span(e1 , e2 ) ⊗ span(e1 , e2 )) ⊕ (span(e3 , e4 ) ⊕ span(e3 , e4 )) = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 , e3 ⊗ e3 , e3 ⊗ e4 , e4 ⊗ e3 , e4 ⊗ e4 ) (W )

SY |◦X◦ . Therefore, the LSFA method only targets SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ which is a proper (W ) subspace of our desired space SY |◦X◦ . It can be established that the conditional central subspace and partial central subspace of the vectorized predictor are proper subspaces of the corresponding conditional folding subspace and partial folding subspace, which means that SYw=0 |vec(X)w=0 = span(e1 ⊗ e1 , e1 ⊗ e2 + e2 ⊗ e1 ) SYw=0 |◦Xw=0 ◦ and SYw=1 |vec(X)w=1 = span(e3 ⊗ e3 , e3 ⊗ e4 + e4 ⊗ e3 ) SYw=1 |◦Xw=1 ◦ . Therefore, (W )

SY |vec(X) = span(e1 ⊗ e1 , e1 ⊗ e2 + e2 ⊗ e1 , e3 ⊗ e3 , e3 ⊗ e4 + e4 ⊗ e3 ) ) SYw=0 |◦Xw=0 ◦ ⊕ SYw=1 |◦Xw=1 ◦ SY(W|◦X◦ .

Table 2 shows the results. Although the LSFA method could not estimate partial folding subspace exhaustively in theory, it still provides similar results compared with individual ensemble method, assuming known dimensions. Obviously, the objective and objective (pooled) are not good, as the two conditional spaces are orthogonal.

6.1.2

Part II (Discrete Y, Inverse Model)

In this part, the response variable Y is set to be discrete and generated from inverse model. These two examples are modified from Li et al. (2010), with an additional binary variable W . Example 3 Let dl = dr = 2 and pl = pr = p = 5. The response Y follows Bernoulli distribution with success probability π = 0.5. The conditional distribution of vectorized data vec(X) given Y follows a multivariate normal distribution. When W = 0, E(X|Y = 0, W = 0)=0p×p ,

E(X|Y = 1, W = 0)=

μI2

02×(p−2)

0(p−2)×2 0(p−2)×(p−2)

,

Sufficient Dimension Folding with Categorical Predictors

143

where μ = 1.5 and 0r×s is an r × s matrix with all elements equal to 0. The conditional variance of each element of X is set to be ( var(Xij |Y = 0, W = 0) = ( var(Xij |Y = 1, W = 0) =

σ 2 (i, j ) ∈ A , and 1 (i, j ) ∈ /A τ 2 (i, j ) ∈ A , 1 (i, j ) ∈ /A

where σ = 0.1, τ = 1.5, and A is the index set {(1, 2), (2, 1)}. We assume that cov(Xij , Xi j ) = 0 in the covariance of vec(X), when (i, j ) = (i , j ). For W = 1, the conditional mean and covariance of X follow the same settings as for W = 0. In such a situation, the categorical variable W is useless. By Bayes theorem, it can be derived that the conditional posterior probability P (Y = 0|X, W = 0) (or 2 and X 2 , and equivalently, P (Y = 1|X, W = 0)) is a function of X11 + X22 , X12 21 so does conditional posterior probability P (Y = 1|X, W = 1). The smallest submatrix containing the elements is

X11 X12 X21 X22

.

Therefore, the two conditional folding subspaces, as well as the desired partial folding subspace, are all the same. That is, (W )

SY |◦X◦ = SYw=0 |◦Xw=0 ◦ = SYw=1 |◦Xw=1 ◦ = SYw=0 |◦Xw=0 ◦ ⊕ SYw=1 |◦Xw=1 ◦ = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 ). ) In this case, all three estimation methods can recover SY(W|◦X◦ exhaustively. However, for vectorized data,

SYw=0 |vec(X)w=0 = span(e1 ⊗ e1 + e2 ⊗ e2 , e1 ⊗ e2 , e2 ⊗ e1 ) SYw=0 |◦Xw=0 ◦ and SYw=1 |vec(X)w=1 = span(e1 ⊗ e1 + e2 ⊗ e2 , e1 ⊗ e2 , e2 ⊗ e1 ) SYw=1 |◦Xw=1 ◦ , which lead to (W )

(W )

SY |vec(X) = span(e1 ⊗ e1 + e2 ⊗ e2 , e1 ⊗ e2 , e2 ⊗ e1 ) SY |◦X◦ . The results listed in Table 3 show that the individual ensemble method and LSFA method outperform the two objective function methods when sample size n is relatively small. As sample size n increases, all these methods have similar performance.

144

Y. Wang et al.

Table 3 Example 3, accuracy of estimates on partial folding subspace n Benchmark Individual ensemble 200 0.9905 (0.2339) 400 0.6349 (0.1404) 600 2.5894 0.5012 (0.1015) 1000 0.3763 (0.0749) 1600 0.2906 (0.0542)

LSFA 0.9913 (0.2353) 0.6349 (0.1402) 0.5012 (0.1015) 0.3763 (0.0749) 0.2906 (0.0542 )

Objective 1.6270 (0.4409) 0.8305 (0.1969) 0.5858 (0.1245) 0.4175 (0.0897) 0.3039 (0.0624)

Objective (pooled) 1.4212 (0.4283) 0.8082 (0.2120) 0.6031 (0.1391) 0.4277 (0.0865) 0.3134 (0.0660)

Table 4 Example 3, accuracy of estimates on partial folding subspace with pl = pr = 10, 30 n Pl , Pr 1600 10 1600 30

Individual ensemble LSFA Objective Objective (pooled) 0.4980 (0.0538) 0.4980 (0.0538) 0.5880 (0.0750) 0.5316 (0.06616) 1.0283 (0.0620) 1.0281 (0.0620) 2.8194 (0.0128) 1.2233 (0.1087)

For Example 3, we include a simulation result for pl = pr = 10 and pl = pr = 30, respectively, to match the largest dimensions in Li et al. (2010) and Ding and Cook (2015). Table 4 shows the results. With pl = pr = 30, the biggest dimension so far in literature, which is equivalently dimension 900 in vector space and our algorithms are able to scale to this dimension without problems. In general, comparisons among the three proposed algorithms however are legitimate, and our results show that individual ensemble method and least squares methods usually outperform objective method when sample size n is small but are on parity when sample size n is large. All proposed algorithms are based on iterations since the original folding method is also based on iterations. In our additional numerical analysis, we found out that the convergence speed of our algorithms largely depends on how overlap the conditional folding subspaces are. The more overlap the subspaces are, the faster the convergence. Example 4 In this example, the two conditional folding subspaces for different assignments to W (W = w and w ∈ {0, 1}) are completely orthogonal to each other, which can generate the largest partial folding subspace overall. For W = 0, the model specifications are the same as in Example 3. However, for W = 1, the condition mean of X given Y is modified as ⎛

⎞ 02×2 02×2 02×(p−4) E(X|Y = 1, W = 1) = ⎝ 02×2 μI2 02×(p−4) ⎠ . 0(p−4)×2 0(p−4)×2 0(p−4)×(p−4) ) is the same as in It can be verified that the partial folding subspace SY(W|◦X◦ Example 2. However, for vectorized predictor vec(X),

SYw=0 |vec(X)w=0 = span(e1 ⊗ e1 + e2 ⊗ e2 , e1 ⊗ e2 , e2 ⊗ e1 ) SYw=0 |◦Xw=0 ◦ and SYw=1 |vec(X)w=1 = span(e3 ⊗ e3 + e4 ⊗ e4 , e3 ⊗ e4 , e4 ⊗ e3 ) SYw=1 |◦Xw=1 ◦ ,

Sufficient Dimension Folding with Categorical Predictors

145

Table 5 Example 4, accuracy of estimates on partial folding subspace n Benchmark Individual ensemble 200 1.5467 (0.3854) 400 0.9871 (0.2605) 600 3.3844 0.7996 (0.2024) 1000 0.6191 (0.1555) 1600 0.5013 (0.1407)

LSFA 1.5464 (0.3853) 0.9873 (0.2605) 0.7996 (0.2024) 0.6191 (0.1555) 0.5013 (0.1407)

Objective 1.8733 (0.5051) 1.1736 (0.2996) 0.9048 (0.2343) 0.6961 (0.1759) 0.5607 (0.1427)

Objective (pooled) 1.1086 (0.2908) 0.7344 (0.1787) 0.5818 (0.1352) 0.4617 (0.1066) 0.3725 (0.1071)

and thus (W )

(W )

SY |vec(X) SY |◦X◦ . The results listed in Table 5 indicate that the individual ensemble method and LSFA method outperform the objective function method without pooled covariance. However, the objective function method with pooled covariance performs the best. From these simulations and additional ones in the Appendix, in summary, the individual direction ensemble method and the LSFA method outperform objective function method in forward and continuous models, but the objective function method with pooled variance performs better than the other methods in inverse models with discrete response variable. It is probably because that folded SIR (Li et al., 2010) itself is developed as an inverse model and discrete response provides a way on how to slice the data naturally, and each slice contains observations belonging to the same category. The estimation accuracy of all three methods is influenced by how overlap the two conditional folding subspaces are; the more they overlap, the fewer parameters need to be estimated, thus higher accuracy. When response variable Y is continuous and model is forward model, individual ensemble method and OLS method usually outperform objective function methods, both in terms of estimation accuracy and variability. As sample size n increases, the discrepancy of accuracy becomes smaller. When response variable Y is discrete and model is inverse model, objective function method with pooled variance performs much better than individual direction ensemble method and OLS methods. This is probably because that folded SIR (Li et al., 2010) itself is developed as an inverse model and the fact that discrete response also ease the difficulty of choosing appropriate hyper-parameters: number of slices. The smaller the overlap the two spaces are, the better the objective function method with pooled variance is. We infer that when two conditional folding subspaces are quite different, the errors of estimating their corresponding basis matrices will aggregate when applying individual ensemble or least square methods. However, for the objective function method with pooled variance, we do not need to estimate conditional folding subspaces at all; thus no errors are aggregated at this stage. Instead, we form the optimization function directly which directly aims at estimating partial folding subspace.

146

Y. Wang et al.

6.2 Application In this section, we apply the individual ensemble, LSFA, and objective (pooled) methods to estimate the partial folding subspace for a longitudinal primary biliary cirrhosis (PBC) data set provided by Mayo Clinic. The data set was collected in a follow-up experiment conducted between 1974 and 1984 with 312 patients participated. The sample size is 128 for female and 16 for male. The detailed description can be found in Fleming and Harrington (1991) and Murtaugh et al. (1994). The data set is named as “pbcseq” in the “survival” package in R software. PBC is a progressive cholesteric liver disease on adults which may result in severe liver failure, need of transplantation, and even death (Talwalkar and Lindor, 2003). This disease is related to various biomarkers including bilirubin, albumin, and prothrombin time. In this PBC data set, repeated measurements of these biomarkers and other demographic information are collected at different visit time points for each subject. We are particularly interested in how these biomarkers, correlated with time and other demographic information, influence patients’ transplantation time (or death time). The histograms of the distributions of biomarkers are in Appendix (Fig. 4), indicating a bimodal for the markers. We denote the response and predictor variables in this analysis as follows. The response variable is the time between registration and the earlier of transplanting or death in years. The predictor variable is a 3 × 4 matrix predictor whose column variables are the discrete visit times from subjects. Similar to Xue and Yin (2014), we regard visits between 90 days and 270 days as 6 months, 270–550 days as 1 year, 550–910 days as 2 years, and 910–1275 as 3 years. The row variables are three types of biomarkers including bilirubin, albumin, and prothrombin time. Gender is a categorical variable including male and female. Note that several sufficient dimension reduction and sufficient dimension folding methods have been devoted to analyze longitudinal data. Li and Yin (2009) developed sufficient dimension reduction framework where time is regarded as discrete random variable. A longitudinal sliced inverse regression method (LSIR; Pfeiffer et al. 2012) was developed to extend sliced inverse regression in the longitudinal data context. Xue and Yin (2014) applied folded MAVE and folded OPG under the framework of sufficient dimension folding, which treats time as one fold dimension of the variable and the biomarkers quantities as the other fold. In this way, correlation across different time points is taken into consideration. However, information carried in categorical predictor such as gender was overlooked. Our methods can simultaneously reduce the dimensions of a matrix predictor as well as investigate categorical variable information. We begin our analysis by estimating the structural dimensions dl (W ) and dr (W ) for the associated partial folding subspace. The eigenvalue plots (Fig. 1) in determining structural dimensions show that the first eigenvalues dominate the second and third eigenvalues which are close to zero. Based on the “elbow plot,” rather than the MERC criterion, we infer that the (W ) (W ) structural dimensions for the partial folding subspace are dl = dr = 1. Typi-

Sufficient Dimension Folding with Categorical Predictors

147

Fig. 1 Eigenvalue plots for partial structural dimensions

Table 6 Estimated directions for the partial, conditional (denoted by male and female), and marginal folding subspace Direction αˆ 1 αˆ 2 αˆ 3 βˆ1 βˆ2 βˆ3 βˆ4

Individual ensemble 0.0253 −0.9943 −0.1029 −0.1127 −0.1301 −0.3294 −0.9283

LSFA 0.0498 −0.9955 −0.0801 −0.1067 −0.1127 −0.3319 −0.9304

Objective (pooled) 0.0203 −0.9934 −0.1124 −0.0220 −0.0406 −0.3479 −0.9363

Male 0.9319 −0.3577 −0.0589 0.7553 −0.4679 0.2123 −0.4067

Female 0.0530 −0.9955 −0.0797 −0.1064 −0.1116 −0.3321 −0.9305

Marginal 0.0530 −0.9948 −0.0868 −0.0598 −0.0631 −0.2657 −0.9601

cally, in “elbow plot,” big eigenvalues indicated strong associations of the respective eigenvectors to the response. Therefore we use dl w = dr w = 1, w ∈ {0, 1}, for the associated conditional folding subspace within each gender subpopulation. Incidentally, Xue and Yin (2014) demonstrated that the structural dimensions for marginal folding subspace are also dl = dr = 1. Notations α(w) and β(w) , w ∈ {0, 1}, denote basis matrices of the conditional folding subspace, and α (W ) and β (W ) are basis matrices of the partial folding subspace. Their corresponding i-th coefficient is denoted by αi or βi discriminated by which subspace is of interest. Table 6 summarizes the estimated directions from our proposed algorithms. It is observed that the individual ensemble and LSFA methods provide very similar estimation results on the partial folding subspace, but the objective (pooled) method yields a slightly different estimate of β. The estimated partial folding subspace is essentially very similar to the marginal folding subspace without considering gender information, or simply the conditional folding subspace for the female subgroup. This is probably due to the fact that female subjects are the majority experiment subjects (128 females vs. 16 males). We notice that the conditional folding subspace for the male subgroup is very different from the one for female subgroup (Fig. 2). It could be due to small sample and overfit. Our results agree with those in Xue and Yin (2014) in which folded OPG and folded MAVE were utilized to estimate the marginal folding subspace.

148

Y. Wang et al.

Fig. 2 Smoothing splines fits and residual plots with reduced predictors

We compute the confidence intervals of the estimated directions using bootstrap method with 200 replications. The 95% confidence intervals (in Fig. 5 in Appendix). show that coefficients α2 and β4 are significantly smaller than 0 but α1 is significantly larger than 0 across all three methods. The coefficient β3 appears not significant in individual ensemble method and LSFA method; however, it is estimated to be significantly smaller than 0 by the objective method. The variabilities of the estimates show similar pattern between the individual ensemble method and the LSFA method. By contrast, estimates by the objective method demonstrate smaller variability. It is obvious especially for the coefficient α1 . Finally, based on the estimated partial folding subspace and two conditional foldT ing subspaces, we can compute the reduced predictor. For example, α (W ) Xβ (W ) ∈ 1×1 R is the reduced predictor for partial folding subspace. In addition, the univariate variable αw=0 T Xβw=0 ∈ R1×1 is the reduced predictor for male group, and αw=1 T Xβw=1 ∈ R1×1 is for female group. The smoothing spline method is applied to analyze the relationship between the reduced predictors and the response variable. A leave-one-out cross-validation is adopted to select tuning parameter λ. Figure 3 provides the plots for fitted models and residual patterns. The first row in Fig. 3 shows the fitted lines. Notice that the fitted lines for both female group and whole data set are similar as that in Xue and Yin (2014), but our models provide smaller residuals in magnitude. In contrast, the male group exhibits an increasing but not the same pattern as that of the female group. Due to limited samples, the fitted smoothing spline might not represent a correct pattern for the male group. More male data points are needed in order to study the pattern of the reduced predictor against the response variable for the male group. We believe that the results indicate better analysis and conclusion when using the reduced predictor from the partial folding subspace which simultaneously reduce dimensions for a matrix predictor, as well as taking the categorical variable gender into consideration.

Sufficient Dimension Folding with Categorical Predictors

149

7 Discussion In this paper, we mainly discuss how to implement dimension folding for a complex matrix/array predictor with the existence of categorical variables. We define partial folding subspace for a matrix/array predictor in parallel to partial central subspace. Three types of estimators are proposed to efficiently estimate partial folding subspace. Both simulations and real data application indicate that the proposed partial folding subspace estimation methods provide better insights and summary on how a matrix predictor is associated with response variable, combining the additional information from categorical variables. More recently, Pan et al. (2019) proposed an effective method for classification problem (i.e., Y is categorical variable only) and, under (inverse) normal assumptions (W |Y ), differing from ours where W is categorical. Their estimation method is more as a model approach, while ours is model-free. In addition, the goals of our approaches and theirs are also different. However, their approach is a very effective method in reducing (folding) dimensions in classification using both covariate information. It might be very interesting on how to extend ours into their content or theirs into our content. Acknowledgments Yin’s work is supported in part by an NSF grant CIF-1813330. Xue’s work is supported in part by the Fundamental Research Funds for the Central Universities in University of International Business and Economics (CXTD11-05).

8 Appendix 8.1 Proofs The following equivalent relationship will be repeatedly used in the proof of Proposition 1. For generic random variables V1 , V2 , V3 , and V4 , Cook (1998) showed that V1

V2 |(V3 , V4 ) and V1

V4 |V3

⇔ V1

V4 |(V2 , V3 ) and V1

⇔ V1

(V2 , V4 )|V3 .

Proof of Proposition 1 part (a) In Eq. (8.1), let ⎧ ⎪ ⎪ ⎨

V1 = vec(X), V2 = W, ⎪ = P V SR ⊗SL vec(X), ⎪ ⎩ 3 V4 = Y,

V2 |V3

(8.1)

150

Y. Wang et al.

and apply the first part of Eq. (8.1) and the equivalent relationship that Y X|α T Xβ ⇔ Y vec(X)|(β ⊗ α)T vec(X), we have W |(PSL XPSR , Y ) and X

X

Y |PSL XPSR

⇔ vec(X)

W |(PSR ⊗SL vec(X), Y ) and vec(X)

Y |PSR ⊗SL vec(X)

⇔ vec(X)

Y |(W, PSR ⊗SL vex(X)) and vec(X)

W |PSR ⊗SL vec(X)

⇔X

Y |(W, PSL XPSR ) and X

W |PSL XPSR .

Therefore, under the assumption that W Y

X|PS (W ) XPS (W ) and Y |◦X

Y |X◦

X|(W, PS (W ) XPS (W ) ) ⇒ Y Y |◦X

(W )

Y |X◦

X|PS (W ) XPS (W ) , Y |◦X

(W )

Y |X◦

(W )

we have SY |◦X ⊆ SY |◦X , SY |X◦ ⊆ SY |X◦ and SY |◦X◦ ⊆ SY |◦X◦ . Now in Eq. (8.1), let ⎧ ⎪ ⎪ ⎨

V1 = Y, V2 = vec(X), ⎪ V = PSR ⊗SL vec(X), ⎪ ⎩ 3 V4 = W, and again apply the first part of Eq. (8.1) and the equivalent relationship that Y X|α T Xβ ⇔ Y vec(X)|(β ⊗ α)T vec(X), we have Y

W |(PSL XPSR , Y ) and Y

X|PSL XPSR

⇔Y

W |(PSR ⊗SL vec(X), Y ) and Y

⇔Y

vec(X)|(PSR ⊗SL vec(X), W ) and Y

⇔Y

X|(PSL XPSR , W ) and Y

vec(X)|PSR ⊗SL X W |PSR ⊗SL vec(X)

W |PSL XPSR .

Therefore, under the assumption that Y Y

W |PS (W ) XPS (W ) and Y |◦X

Y |X◦

X|(PS (W ) XPS (W ) , W ) ⇒ Y Y |◦X

(W )

Y |X◦

(W )

X|PS (W ) XPS (W ) , Y |◦X

(W )

Y |X◦

there exist SY |◦X ⊆ SY |◦X , SY |X◦ ⊆ SY |X◦ and SY |◦X◦ ⊆ SY |◦X◦ . Proof of Proposition 1 part (b) In Eq. (8.1), let

Sufficient Dimension Folding with Categorical Predictors

151

⎧ ⎪ ⎪ ⎨

V1 = Y, V2 = W, ⎪ = P V 3 SR ⊗SL vec(X), ⎪ ⎩ V4 = vec(X), and apply the first part of Eq. (8.1) and the equivalent relationship that Y X|α T Xβ ⇔ Y vec(X)|(β ⊗ α)T vec(X), we have W |(PSL XPSR , X) and Y

Y

X|PSL XPSR

⇔Y

W |(PSR ⊗SL vec(X), vec(X)) and Y

vec(X)|PSR ⊗SL vec(X)

⇔Y

vec(X)|(W, PSR ⊗SL vex(X)) and Y

W |PSR ⊗SL vec(X)

⇔Y

X|(W, PSL XPSR ) and Y

Therefore, under the assumption that W X). Thus, Y

X|PSY |◦X XPSY |X◦ ⇒ Y (W )

W |PSL XPSR . Y |X, we also have Y

W |(PSY |◦X XPSY |X◦ ,

X|(W, PSY |◦X XPSY |X◦ ),

(W )

(W )

and further we have SY |◦X ⊆ SY |◦X , SY |X◦ ⊆ SY |X◦ and SY |◦X◦ ⊆ SY |◦X◦ .

Proof of Proposition 1 part (c) For generic subspace SL and SR , we have Y

X|(PSL XPSR , W ) ⇔ Y ⇔Y

(W )

(W )

(W )

X|(PSL XPSR , W = w),

∀w = 1, . . . , C,

vec(X)|P(SR ⊗SL ) vec(X), W =w),

(W )

∀w=1, . . . , C. (8.2)

(W )

Since SY |◦X◦ = SY |X◦ ⊗SY |◦X , SY |◦X , and SY |X◦ satisfy the left-hand side of Eq. (8.2) by their definitions, they also satisfy Y

X|(PS (W ) XPS (W ) , W = w), Y |◦X

Y |X◦

∀w = 1, . . . , C. (W )

(W )

This implies that, for ∀w = 1, . . . , C, SYw |◦Xw ⊆ SY |◦X , SYw |Xw ◦ ⊆ SY |X◦ and thus (W )

(W )

C ⊕C w=1 SYw |◦Xw ⊆ SY |◦X , ⊕w=1 SYw |Xw ◦ ⊆ SY |X◦ . Therefore, (W )

(W )

(W )

C (⊕C w=1 SYw |Xw ◦ ) ⊗ (⊕w=1 SYw |◦Xw ) ⊆ SY |X◦ ⊗ SY |◦X = SY |◦X◦ . C Because that SYw |◦Xw ⊆ (⊕C w=1 SYw |◦Xw ) and SYw |Xw ◦ ⊆ (⊕w=1 SYw |Xw ◦ ), for ∀w = 1, . . . , C, the two direct sum spaces also satisfy the right-hand side of Eq. (8.2). Therefore, we have

152

Y. Wang et al.

X|(P⊕C

Y

w=1 SYw |◦Xw

XP⊕C

w=1 SYw |Xw ◦

, W ).

This implies the other containing relationship (W )

C SY |◦X◦ ⊆ (⊕C w=1 SYw |Xw ◦ ) ⊗ (⊕w=1 SYw |◦Xw ). ) C = (⊕C We then conclude that SY(W|◦X◦ w=1 SYw |Xw ◦ ) ⊗ (⊕w=1 SYw |◦Xw ).

Proof of Proposition 1 part (d) For generic subspace SL and SR , we have Y

X|(PSL XPSR , W ) ⇔ Y ⇔Y

(W )

(W )

(W )

X|(PSL XPSR , W = w),

∀w = 1, . . . , C,

vec(X)|P(SR ⊗SL ) vec(X), W =w),

(W )

∀w=1, . . . , C. (8.3)

(W )

Since SY |◦X◦ = SY |X◦ ⊗SY |◦X , SY |◦X , and SY |X◦ satisfy the left-hand side of Eq. (8.3) by their definitions, they also satisfy Y

X|(PS (W ) XPS (W ) , W = w), Y |◦X

Y |X◦

∀w = 1, . . . , C. (W )

(W )

This implies that, for ∀w = 1, . . . , C, SYw |◦Xw ⊆ SY |◦X , SYw |Xw ◦ ⊆ SY |X◦ and thus (W )

(W )

(W )

SYw |Xw ◦ ⊗ SYw |◦Xw ⊆ SY |X◦ ⊗ SY |◦X = SY |◦X◦ . Therefore, (W )

⊕C w=1 (SYw |Xw ◦ ⊗ SYw |◦Xw ) ⊆ SY |◦X◦ . Because that SYw |Xw ◦ ⊗SYw |◦Xw ⊆ ⊕C w=1 (SYw |Xw ◦ ⊗SYw |◦Xw ) and SYw |Xw ◦ ⊗SYw |◦Xw satisfy the second right-hand equation (8.3), for ∀w = 1, . . . , C, we have (W ) (W ) (W ) ∗ ⊕C w=1 (SYw |Xw ◦ ⊗ SYw |◦Xw ) = span(U ) ⊆ SY |◦X◦ = SY |X◦ ⊗ SY |◦X , pl pr ×k . where U∗ is a random basis matrix of space ⊕C w=1 span(βw ⊗ αw ) in R Therefore, by the definition of Kronecker envelope in Li et al. (2010), the Kronecker envelope of U∗ with respect to integer pl and pr , that is p⊗l ,pr (U∗ ) = SU∗ ◦ ⊗ S◦U∗ , satisfies the following conditions: ∗ ∗ 1. span(U∗ ) = ⊕C w=1 (SYw |Xw ◦ ⊗ SYw |◦Xw ) ⊆ SU ◦ ⊗ S◦U almost surely. 2. If there is another pair of subspaces SR ∈ Rpr and SL ∈ Rpl that satisfies condition 1, then SU∗ ◦ ⊗ S◦U∗ ⊆ SR ⊗ SL .

However, from the previous proof span(U∗ ) = ⊕C w=1 (SYw |Xw ◦ ⊗ SYw |◦Xw ) ⊆ SY |X◦ ⊗ SY |◦X = SY |◦X◦ , (W )

(W )

(W )

Sufficient Dimension Folding with Categorical Predictors (W )

153

(W )

and by definition, SY |X◦ ∈ Rpr and SY |◦X ∈ Rpl . Therefore, p⊗l ,pr (U∗ ) = SU∗ ◦ ⊗ S◦U∗ ⊆ SY |X◦ ⊗ SY |◦X = SY |◦X◦ . (W )

(W )

(W )

On the other hand, for ∀w = 1, . . . , C, ∗ ⊗ ∗ SYw |Xw ◦ ⊗SYw |◦Xw ⊆ ⊕C w=1 (SYw |Xw ◦ ⊗SYw |◦Xw ) = span(U ) ⊆ SU∗ ◦ ⊗S◦U∗ =pl ,pr (U ).

Therefore, S◦U∗ and SU∗ ◦ satisfy the second right-hand side of Eq. (8.3). And for the left-hand side of Eq. (8.3), we have X|(PS◦U∗ XPSU∗ ◦ , W ).

Y (W )

(W )

Thus SY |◦X ⊆ S◦U∗ and SY |X◦ ⊆ SU∗ ◦ , which implies the relationship SY |◦X◦ ⊆ SU∗ ◦ ⊗ S◦U∗ = p⊗l ,pr (U∗ ). (W )

Therefore, SY |◦X◦ = SU∗ ◦ ⊗ S◦U∗ = p⊗l ,pr (U∗ ). (W )

(W )

This concludes that SY |◦X◦ equals to the Kronecker envelope of ⊕C w=1 (SYw |Xw ◦ ⊗ SYw |◦Xw ). Thus by estimating ⊕C (S ⊗ S ), we are targeting a proper Yw |◦Xw w=1 Yw |Xw ◦ (W ) C subspace of SY |◦X◦ . For example, estimation on ⊕w=1 (SYw |Xw ◦ ⊗ SYw |◦Xw ) does not (W )

recover SY |◦X◦ exhaustively.

Proof of Proposition 1 part (e) First note that if for each W = w, span(Uw ) ⊆ SYw |vec(X)w almost surely, then from part (d) of Proposition 1, we have (W ) c ⊕C w=1 span(Uw ) ⊆ ⊕w=1 SYw |vec(X)w = SY |vec(X)

⊆ ⊕cw=1 SYw |Xw ◦ ⊗ SYw |Xw ◦ (W )

⊆ SY |◦X◦ = (⊕cw=1 SYw |Xw ◦ ) ⊗ (⊕cw=1 SYw |◦Xw ) almost surely. Therefore, by the definition of Kronecker product, we have S◦Unew ⊆ ⊕cw=1 SYw |◦Xw , SUnew ◦ ⊆ ⊕cw=1 SYw |Xw ◦ and

⊗ (Unew ) = SUnew ◦ ⊗ S◦Unew ⊆ SY |◦X◦ = (⊕cw=1 SYw |Xw ◦ ) ⊗ (⊕cw=1 SYw |◦Xw ). (W )

154

Y. Wang et al.

Proof of Theorem 1 Using double expectation formula, we can further write the objective function as EW (EUw ||AUw − A(β ⊗ α)fw (Z)|W = w||2 ), where the inside expectation is with respect to random matrices U1 , . . . , UC and the outside expectation is with respect to categorical variable W . This is equivalent to C

pw (E||AUw − A(β ⊗ α)fw (Z)|W = w||2 ).

(8.4)

w=1

Assume ⊗ (U∗ ) = span(β0 ⊗ α0 ). Because that, for each W = w, Uw ⊆ ⊗ ∗ ⊕C w=1 span(Uw ) ⊆ (U ) = span(β0 ⊗ α0 ) and the elements of Uw are measurable with respect to Z, there exists a random projection matrix φw (Z) ∈ Ldl dr ×kw such that Uw = (β0 ⊗ α0 )φw (Z), which is equivalent to AUw = A(β0 ⊗ α0 )φw (Z). Thus (4.3), or equivalently (8.4), reaches its minimum 0 within the range of (α, β, f1 , . . . , fC ) given in the theorem. This implies that any minimizer (α ∗ , β ∗ , f1∗ , . . . , fC∗ ) of (4.3) must satisfy A(β ∗ ⊗ α ∗ )fw∗ (Z) = AUw almost surely for every W = w and, consequently, (β0 ⊗ α0 )φw (Z) = (β ∗ ⊗ α ∗ )fw ∗ (Z) almost surely. But this means that span(β ∗ ⊗ α ∗ ) contains each Uw almost surely; ∗ ∗ ∗ ∗ thus we have ⊕C w=1 span(Uw ) ⊆ span(β ⊗ α ). Since span(β ⊗ α ) has the ⊗ ∗ same dimensions as (U ), the theorem now follows from the uniqueness of the Kronecker envelope.

8.2 Additional Simulation and Data Analysis The following six examples are related to the examples in Sect. 6.1 of simulation studies, showing how the results changes when overlap information changes among individual subspaces. Example A1 Example A1 almost keep the same experimental setting as in Example 1 but slightly change the conditional distribution of Y given X and W , so that the two conditional central subspaces are overlapped but not identical. Y = X11 × (X12 + X21 + 1) + 0.2 × f or W = 0, Y = X12 × (X13 + X22 + 1) + 0.2 × f or W = 1. In this example,

Sufficient Dimension Folding with Categorical Predictors

155

SYw=0 |◦Xw=0 ◦ = span(e1 , e2 ) ⊗ span(e1 , e2 ) = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 ), SYw=1 |◦Xw=1 ◦ = span(e2 , e3 ) ⊗ span(e1 , e2 ) = span(e2 ⊗ e1 , e2 ⊗ e2 , e3 ⊗ e1 , e3 ⊗ e2 ).

The two conditional folding subspace are overlapped, because their left conditional folding subspaces are the same and their right conditional folding subspaces also share one same direction. By part (c) of Proposition 1, we have: (W )

C SY |◦X◦ = (⊕C w=1 SYw |Xw ◦ ) ⊗ (⊕w=1 SYw |◦Xw )

= (span(e1 , e2 ) ⊕ span(e2 , e3 )) ⊗ (span(e1 , e2 ) ⊕ span(e1 , e2 )) = span(e1 , e2 , e3 ) ⊗ span(e1 , e2 ) = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 , e3 ⊗ e1 , e3 ⊗ e2 ). On the other hand, based on part (d) of Proposition 1, we have: SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ = (span(e1 , e2 ) ⊗ span(e1 , e2 )) ⊕ (span(e2 , e3 ) ⊗ span(e1 , e2 )) = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 , e3 ⊗ e1 , e3 ⊗ e2 ) (W )

= SY |◦X◦ . ) Therefore, all three methods can still recover SY(W|◦X◦ exhaustively. Again, for vectorized data,

SYw=0 |vec(X)w=0 = span(e1 ⊗ e1 , e1 ⊗ e2 + e2 ⊗ e1 ) SYw=0 |◦Xw=0 ◦ , SYw=1 |vec(X)w=1 = span(e2 ⊗ e1 , e2 ⊗ e2 + e3 ⊗ e1 ) SYw=1 |◦Xw=1 ◦ , thus (W )

SY |vec(X) = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 + e3 ⊗ e1 ) (W )

SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ = SY |◦X◦ . Table 7 summarizes the simulation results for Example A1. Note that all three methods perform worse than Example 1 in terms of accuracy and variability. This is due to the fact that they are estimating a bigger partial folding subspace than in Example 1. We observe that individual ensemble method and LSFA method still provide similar accuracy and they outperform objective function method. Example A2 Example A2 keeps the two conditional central subspaces overlapped but to a smaller extent. This can be achieved by setting conditional distribution as

156

Y. Wang et al.

Table 7 Example A1, accuracy of estimates on partial folding subspace n Bench mark 200 400 600 3.0136 1000 1600

Individual ensemble 2.3420 (0.4424) 1.8432 (0.4548) 1.2344 (0.3994) 0.8635 (0.2493) 0.6033 (0.1571)

LSFA 2.3746 (0.4098) 1.8461 (0.4765) 1.2496 (0.4333) 0.8705 (0.2787) 0.6030 (0.1560)

Objective 2.6465 (0.2962) 2.3987 (0.3391) 2.2887 (0.4549) 2.0916 (0.4682) 1.9512 (0.5517)

Objective (pooled) 2.6467 (0.2515) 2.4733 (0.3234) 2.2826 (0.4296) 2.1298 (0.4816) 1.9884 (0.5127)

Y = X11 × (X12 + X21 + 1) + 0.2 × f or W = 0, Y = X22 × (X23 + X32 + 1) + 0.2 × f or W = 1. In this example, SYw=0 |◦Xw=0 ◦ = span(e1 , e2 ) ⊗ span(e1 , e2 ) = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 ) SYw=1 |◦Xw=1 ◦ = span(e2 , e3 ) ⊗ span(e2 , e3 ) = span(e2 ⊗ e2 , e2 ⊗ e3 , e3 ⊗ e2 , e3 ⊗ e3 ),

The two conditional folding subspace are slightly overlapped, but none of their left (right) conditional folding subspaces are the same. By part (c) of Proposition 1, we have: ) C SY(W|◦X◦ = (⊕C w=1 SYw |Xw ◦ ) ⊗ (⊕w=1 SYw |◦Xw )

= (span(e1 , e2 ) ⊕ span(e2 , e3 )) ⊗ (span(e1 , e2 ) ⊕ span(e2 , e3 )) = span(e1 , e2 , e3 ) ⊗ span(e1 , e2 , e3 ) = span(ei ⊗ ej ) i, j = 1, . . . , 3. On the other hand, based on part (d) of Proposition 1, we have: SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ = (span(e1 , e2 ) ⊗ span(e1 , e2 )) ⊕ (span(e2 , e3 ) ⊗ span(e2 , e3 )) = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e1 , e2 ⊗ e2 , e2 ⊗ e3 , e3 ⊗ e2 , e3 ⊗ e3 ) (W )

SY |◦X◦ .

In this case, only individual ensemble method and objective function method (W ) recover SY |◦X◦ exhaustively. Since LSFA method is targeting on space SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ , it is a smaller subspace according to part (d) of Proposition 1 and the experiment setting above. Therefore, LSFA method estimates a smaller subspace (W ) than the desired partial folding subspace SY |◦X◦ . In practice, the accuracy of LSFA method may not be defected since we use the results from individual ensemble method as initial values. Again, for vectorized data,

Sufficient Dimension Folding with Categorical Predictors

157

Table 8 Example A2, accuracy of estimates on partial folding subspace n Bench mark 200 400 600 3.3838 1000 1600

Individual ensemble 2.7658 (0.5124) 2.0669 (0.5702) 1.4614 (0.4660) 0.9939 (0.2985) 0.7236 (0.2075)

LSFA 2.7475 (0.4856) 2.0919 (0.5553) 1.4753 (0.4872) 0.9945 (0.2976) 0.7234 (0.2075)

Objective 2.9935 (0.3805) 2.6864 (0.4324) 2.5285 (0.5177) 2.4001 (0.5814) 2.1148 (0.6121)

Objective (pooled) 2.9462 (0.4306) 2.6790 (0.4294) 2.5496 (0.4665) 2.3995 (0.5745) 2.1708 (0.6180)

SYw=0 |vec(X)w=0 = span(e1 ⊗ e1 , e1 ⊗ e2 + e2 ⊗ e1 ) SYw=0 |◦Xw=0 ◦ , SYw=1 |vec(X)w=1 = span(e2 ⊗ e2 , e2 ⊗ e3 + e3 ⊗ e2 ) SYw=1 |◦Xw=1 ◦ , thus (W )

SY |vec(X) = span(e1 ⊗ e1 , e1 ⊗ e2 + e2 ⊗ e1 , e2 ⊗ e3 + e3 ⊗ e2 ) (W )

SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ SY |◦X◦ . Table 8 displays the simulation results for Example A2, where individual ensemble method and LSFA method still outperform objective function method. Example A3 Example A3 constrains that the two conditional central subspaces are as the same as the conditional folding subspaces, i.e., for any w = 0, 1, SYw |vec(X)w = SYw |◦Xw ◦ . However, the partial central subspace is still a proper subspace of partial folding subspace. We can achieve this by constraining the two partial central subspaces to be orthogonal with each other. The conditional distribution of Y given X and W , that is: Y = X11 × (X21 + 1) + 0.2 × f or W = 0, Y = X32 × (X42 + 1) + 0.2 × f or W = 1. In this case, SYw=0 |vec(X)w=0 = SYw=0 |◦Xw=0 ◦ = span(e1 ) ⊗ span(e1 , e2 ), SYw=1 |vec(X)w=1 = SYw=1 |◦Xw=1 ◦ = span(e2 ) ⊗ span(e3 , e4 ). Thus, (W )

SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ = SY |vec(X) = span(e1 ⊗ e1 , e1 ⊗ e2 , e2 ⊗ e3 , e2 ⊗ e4 ). Based on part (c) of Proposition 1, we have:

158

Y. Wang et al.

Table 9 Example A3, accuracy of estimates on partial folding subspace n Bench mark 200 400 600 3.2834 1000 1600

Individual ensemble 1.2022 (0.5059) 0.7100 (0.3220) 0.5246 (0.1436) 0.3977 (0.0924) 0.3007 (0.0680)

LSFA 1.2067 (0.5036) 0.6828 (0.2071) 0.5246 (0.1437) 0.3976 (0.0923) 0.3007 (0.0680)

Objective 2.2931 (0.4596) 1.8820 (0.3156) 1.7107 (0.3883) 1.7082 (0.3474) 1.6253 (0.4211)

Objective (pooled) 2.2329 (0.4736) 1.9276 (0.3190) 1.7484 (0.3846) 1.7279 (0.3612) 1.6443 (0.4064)

(W )

SY |◦X◦ = (span(e1 ) ⊕ span(e2 )) ⊗ (span(e1 , e2 ) ⊕ span(e3 , e4 )) = span(e1 ⊗ e1 , e1 ⊗ e2 , e1 ⊗ e3 , e1 ⊗ e4 , e2 ⊗ e1 , e2 ⊗ e2 , e2 ⊗ e3 , e2 ⊗ e4 ) (W )

SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ = SY |vec(X) .

Still, the LSFA method only targets at SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ , which is a proper (W ) subspace of our desired space SY |◦X◦ . Simulation results for this example in Table 9 indicate that both individual ensemble method and LSFA method performs similarly. Example A4 In Example A4, we constrain that both the conditional central subspaces and partial central subspaces are as the same as conditional folding subspace and partial folding subspace, respectively. Since estimating partial folding subspace will greatly reduce number of parameters especially when the dimension is larger, here in this example, we are specifically interested in whether folding-based method can achieve higher accuracy than traditional methods such as partial SIR. We modify the conditional distribution as: Y = X11 × (X21 + 1) + 0.2 × f or W = 0, Y = X21 × (X31 + 1) + 0.2 × f or W = 1. In this case, SYw=0 |vec(X)w=0 = SYw=0 |◦Xw=0 ◦ = span(e1 ) ⊗ span(e1 , e2 ), SYw=1 |vec(X)w=1 = SYw=1 |◦Xw=1 ◦ = span(e1 ) ⊗ span(e2 , e3 ). And most importantly, (W )

(W )

SY |◦X◦ = SY |◦Xw=0 ◦ ⊕ SY |◦Xw=1 ◦ = SY |vec(X) = span(e1 ⊗ e1 , e1 ⊗ e2 , e1 ⊗ e3 ). Again, the results from Table 10 reflects individual and LSFA perform better than the other two approaches. Figure 3 summarizes the first two examples in the paper and Examples A1– A4. The three estimation methods can be interpreted as this: Since partial folding ) subspace SY(W|◦X◦ with its basis matrix must be presented as a Kronecker product,

Sufficient Dimension Folding with Categorical Predictors

159

Table 10 Example A4, accuracy of estimates on partial folding subspace n Bench mark 200 400 600 2.2925 1000 1600

Individual ensemble 0.8150 (0.3521) 0.4462 (0.1385) 0.3424 (0.1106) 0.2552 (0.0633) 0.1922 (0.0488)

LSFA 0.7827 (0.3055) 0.4462 (0.1383) 0.3424 (0.1106) 0.2552 (0.0633) 0.1922 (0.0488)

Objective 1.2986 (0.1844) 1.1472 (0.0820) 1.0889 (0.0509) 1.0488 (0.0255) 1.0314 (0.0175)

Objective (pooled) 1.2864 (0.1458) 1.1595 (0.0841) 1.1004 (0.0575) 1.0519 (0.0283) 1.0353 (0.0164)

Fig. 3 Visualization of conditional and partial folding subspaces: top left panel (Example 1, Completely overlap); top right panel (Example A1, moderately overlap); middle left panel (Example A2 , slightly overlap); middle right panel (Example 2, completely orthogonal); bottom left panel (Example A3, orthogonal, folding=vectorize); bottom right panel (Example A4, overlap, folding=vectorize)

160

Y. Wang et al.

it can only be covered by a “rectangle space.” Therefore, exhaustive methods including individual ensemble method and objective function method attempt to find one minimal “rectangle space” that covers both of the conditional folding subspaces. On the other hand, LSFA method estimates ⊕SY |◦Xw ◦ , which look for two minimal “rectangles” which cover all the conditional folding subspaces, thus can be smaller (W ) than partial folding subspace. Traditional partial central subspace SY |vec(X) , which stack the columns together, and its estimation method partial SIR look for “blocks” which cover all the conditional central subspaces. Example A5 Example A5 follows closely from Example 3, which intends to construct corresponding partial folding subspaces to be exactly the same as that of Examples 2 and A1. We illustrate the details of the experiment setting as follows: For W = 0, it follows exact same setting as in Example 3. For W = 1, however, the condition mean of X given Y is changed to: E(X|Y = 1, W = 1) =

02×1

μI2

02×(p−3)

0(p−2)×1 0(p−2)×2 0(p−2)×(p−3)

.

Correspondingly, the conditional covariance structure stay the same as in Example 3 except the index set A = {(1, 3), (2, 2)}. We can easily verify that the desired partial ) folding subspace SY(W|◦X◦ is the same as in Example A1. But for vectorized data vec(X), SYw=0 |vec(X)w=0 = span(e1 ⊗ e1 + e2 ⊗ e2 , e1 ⊗ e2 , e2 ⊗ e1 ) SYw=0 |◦Xw=0 ◦ , and SYw=1 |vec(X)w=1 = span(e2 ⊗ e1 + e3 ⊗ e2 , e3 ⊗ e1 , e2 ⊗ e2 ) SYw=1 |◦Xw=1 ◦ , thus (W )

(W )

SY |vec(X) SY |◦X◦ . From Table 11, it appears that the objective function method with pooled variance provides smallest errors and smallest variability across all different sample size n. The individual direction ensemble method and LSFA method produce similar accuracy and stableness in Example A5. Example A6 Example A6 follows closely from Example 3, which intends to construct corresponding partial folding subspaces to be exactly the same as that of Example 2 and Example A1. In this example, its corresponding two conditional folding subspaces are less overlapped, leading to a larger partial folding subspace. For W = 0, it follows exact same setting as in Examples 3 and A5.

Sufficient Dimension Folding with Categorical Predictors

161

Table 11 Example A5, accuracy of estimates on partial folding subspace n Bench mark 200 400 600 3.0136 1000 1600

Individual ensemble 1.2225 (0.2671) 0.7793 (0.1704) 0.6428 (0.1284) 0.4660 (0.1034) 0.3701 (0.0822)

LSFA 1.2205 (0.2655) 0.7792 (0.1704) 0.6428 (0.1284) 0.4660 (0.1034) 0.3701 (0.0822 )

Objective 1.7997 (0.4134) 1.4358 (0.4458) 1.3478 (0.4272) 1.3212 (0.4528) 1.2234 (0.4680)

Objective (pooled) 0.8207 (0.1525) 0.5541 (0.1218) 0.4601 (0.0945) 0.3427 (0.0740) 0.2837 (0.0594)

Table 12 Example A6, accuracy of estimates on partial folding subspace

LSFA 1.5344 (0.4109)

Objective 2.5681 (0.4846)

1.0235 (0.2273)

1.0235 (0.2273)

1.8443 (0.6332)

0.7849 (0.1772)

0.7849 (0.1772)

1.5548 (0.6027)

1000

0.6363 (0.1286)

0.6363 (0.1286)

0.9386 (0.3634)

1600

0.4881 (0.1002)

0.4881 (0.1002)

0.6495 (0.1702)

n 200

Bench mark Individual ensemble 1.5339 (0.4108)

400 600

3.3838

Objective (pooled covariance) 1.0414 (0.2400) 0.7078 (0.1306) 0.5672 (0.1239) 0.4511 (0.0904) 0.3297 (0.0706)

For W = 1, however, the condition mean of X given response Y is changed to: ⎛

⎞ 01×1 01×2 01×(p−3) E(X|Y = 1, W = 1) = ⎝ 02×1 μI2 02×(p−3) ⎠ . 0(p−3)×1 0(p−3)×2 0(p−3)×(p−3) Correspondingly, the conditional covariance structure stay the same as in Example 3 except the index set A = {(2, 3), (3, 2)}. We can easily verify that the desired partial (W ) folding space SY |◦X◦ is the same as in Example A1. But for vectorized data vec(X), SYw=0 |vec(X)w=0 = span(e1 ⊗ e1 + e2 ⊗ e2 , e1 ⊗ e2 , e2 ⊗ e1 ) SYw=0 |◦Xw=0 ◦ , and SYw=1 |vec(X)w=1 = span(e2 ⊗ e2 + e3 ⊗ e3 , e2 ⊗ e3 , e3 ⊗ e2 ) SYw=1 |◦Xw=1 ◦ , thus (W )

(W )

SY |vec(X) SY |◦X◦ .

162

Y. Wang et al.

300 200 0

100

Frequency

400

500

Hitogram of markers

0

5

10 15 20 Values of markers

25

30

300 200 0

100

Frequency

400

500

Hitogram of markers for female

0

5

10 15 20 Values of markers

25

30

40 30 0

10

20

Frequency

50

60

Hitogram of markers for male

0

Fig. 4 Histogram of the data

5

10 15 Values of markers

20

Sufficient Dimension Folding with Categorical Predictors

163

Fig. 5 Confidence intervals for estimated directions

The results are listed in Table 12; similarly, the proposed individual ensemble method and LSFA method outperform the third estimation method objective function optimization method, but the objective function optimization method with pooled covariance yields smallest error and standard deviations.

164

Y. Wang et al.

Three Histograms for the Real Data See Fig. 4.

The Bootstrap Confidence Interval Plots for Real Data See Fig. 5.

References F. Chiaromonte, R.D. Cook, B. Li, Sufficient dimension reduction in regressions with categorical predictors. Ann. Stat. 30, 475–497 (2002) R.D. Cook, On the interpretation of regression plots. J. Am. Stat. Assoc. 89, 177–189 (1994) R.D. Cook, Graphics for regressions with a binary response. J. Am. Stat. Assoc. 91, 983–992 (1996) R.D. Cook, Regression Graphics: Ideas for Studying Regressions Through Graphics (Wiley, New York, 1998) R.D. Cook, Testing predictor contribution in sufficient dimension reduction. Ann. Stat. 32, 1062– 1092 (2004) R.D. Cook, S. Weisberg, Discussion of “Sliced inverse regression for dimension reduction”. J. Am. Stat. Assoc. 86, 328–332 (1991) S. Ding, R.D. Cook, Dimension folding PCA and PFC for matrix-valued predictors. Stat. Sin. 24, 463–492 (2014) S. Ding, R.D. Cook, Tensor sliced inverse regression. J. Multivar. Anal. 133, 216–231 (2015) T.R. Fleming, D.P. Harrington, Counting process and survival analysis (Wiley, New York, 1991) IBM Big Data and Analytics Hub. The Four V’s of Big Data (2014). http://www.ibmbigdatahub. com/infographic/four-vs-big-data K.-C. Li, Sliced inverse regression for dimension reduction (with discussion). J. Am. Stat. Assoc. 86, 316–342 (1991) B. Li, S. Wang, On directional regression for dimension reduction. J. Am. Stat. Assoc. 102, 997– 1008 (2007) L. Li, X. Yin, Longitudinal data analysis using sufficient dimension reduction. Comput. Stat. Data Anal. 53, 4106–4115 (2009) B. Li, R.D. Cook, F. Chiaromonte, Dimension reduction for the conditional mean in regression with categorical predictors. Ann. Stat. 31, 1636–1668 (2003) B. Li, H. Zha, C. Chairomonte, Contour regression: a general approach to dimension reduction. Ann. Stat. 33, 1580–1616 (2005) B. Li, S. Wen, L. Zhu, On a projective resampling method for dimension reduction with multivariate responses. J. Am. Stat. Assoc. 103, 1177–1186 (2008) B. Li, M. Kim, N. Altman, On dimension folding of matrix- or array-valued statistical objects. Ann. Stat. 38, 1094–1121 (2010) W. Luo, B. Li, Combining eigenvalues and variation of eigenvectors for order determination. Biometrika 103, 875–887 (2016) R. Luo, H. Wang, C.L. Tsai, Contour projected dimension reduction. Ann. Stat. 37, 3743–3778 (2009)

Sufficient Dimension Folding with Categorical Predictors

165

J.R. Magnus, H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics, 2nd edn. (Wiley, New York, 1999) P.A. Murtaugh, E.R. Dickson, G.M. Van Dam, M. Malinchoc, P.M. Grambsch, A.L. Langworthy, C.H. Gips, Primary biliary cirrhosis: prediction of short-term survival based on repeated patient visits. Hepatology 20, 126–134 (1994) Y. Pan, Q. Mai, X. Zhang, Covariate-adjusted tensor classification in high dimensions. J. Am. Stat. Assoc. 114, 1305–1319 (2019) R.M. Pfeiffer, L. Forzani, E. Bura, Sufficient dimension reduction for longitudinally measured predictors. Stat. Med. 31, 2414–2427 (2012) J.A. Talwalkar, K.D. Lindor, Primary biliary cirrhosis. Lancet 362, 53–61 (2003) Y. Xia, H. Tong, W. Li, L. Zhu, An adaptive estimation of dimension reduction. J. R. Stat. Soc. Ser. B 64, 363–410 (2002) Y. Xue, X. Yin, Sufficient dimension folding for regression mean function. J. Comput. Graph. Stat. 23, 1028–1043 (2014) Y. Xue, X. Yin, Sufficient dimension folding for a functional of conditional distribution of matrixor array-valued objects. J. Nonparametr. Stat. 27, 253–269 (2015) Y. Xue, X. Yin, X. Jiang, Ensemble sufficient dimension folding methods for analyzing matrixvalued data. Comput. Stat. Data Anal. 103, 193–205 (2016) Z. Ye, R.E. Weiss, Using the bootstrap to select one of a new class of dimension reduction methods. J. Am. Stat. Assoc. 98, 968–979 (2003) Y. Zhu, P. Zeng, Fourier methods for estimating the central subspace and the central mean subspace in regression. J. Am. Stat. Assoc. 101, 1638–1651 (2006)

Sufficient Dimension Reduction Through Independence and Conditional Mean Independence Measures Yuexiao Dong

1 Introduction As a useful tool in multivariate analysis, the goal of sufficient dimension reduction (SDR) (Cook, 1998) is to find relevant information between univariate response Y and p-dimensional predictor X through linear combinations of X. To facilitate making inference about the conditional distribution of Y | X, SDR looks for β0 ∈ Rp×d with the smallest possible column space such that Y

X | β0T X,

(1)

where “ " means independence. The corresponding column space is known as the central space and is denoted as SY |X . On the other hand, if the focus is on the conditional mean E(Y | X), SDR looks for η0 ∈ Rp×r with the smallest possible column space such that Y

E(Y | X) | η0T X.

(2)

The corresponding column space is known as the central mean space (Cook and Li, 2002) and is denoted as SE(Y |X) . The dimension of the column space of β0 (or η0 ) is referred to as the structural dimension of the central space (or the central mean space). Since the introduction of sliced inverse regression (SIR) (Li, 1991), many moment-based SDR methods have been introduced in the literature, such as sliced average variance estimation (SAVE) (Cook, 1991), directional regression (Li

Y. Dong () Department of Statistical Science, Temple University, Philadelphia, PA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 E. Bura, B. Li (eds.), Festschrift in Honor of R. Dennis Cook, https://doi.org/10.1007/978-3-030-69009-0_8

167

168

Y. Dong

and Wang, 2007), cumulative slicing estimation (Zhu et al., 2010), and inverse conditional cumulants (Dong, 2016). All of the above methods need the linearity condition or the constant covariance condition for the predictor. To relax the predictor distribution assumptions of the moment-based methods, estimating equation-based SDR methods and semiparametric SDR methods are studied in Li and Dong (2009), Dong and Li (2010), Ma and Zhu (2012), and Dong et al. (2018). Some recent reviews of SDR methods can be found in Ma and Zhu (2013), Li (2018), and Dong (2021). As a measure of independence, distance covariance (Székely et al., 2007) gains great popularity in recent years as it measures all types of dependence between random vectors of arbitrary dimensions. The connection between distance covariance and SDR was first revealed by Sheng and Yin (2013), where distance covariance was used to recover the direction in single-index models. The extension to multi-index models was discussed in Sheng and Yin (2016). SDR through distance covariance (DCOV) enjoys many advantages over existing methods. Compared with the moment-based methods, DCOV does not require the linearity condition or the constant covariance condition for the predictor. Compared with the semiparametric methods, DCOV does not involve any nonparametric estimation. However, SDR through distance covariance has two limitations. First, the original distance covariance may not work well with heavy-tailed distributions. We expect DCOV for SDR to inherit this limitation. Second, DCOV only targets the central space but not the central mean space, as distance covariance is a measure of independence but not the conditional mean independence. In this paper, we propose a unified framework for SDR through a general family of independence and conditional mean independence measures. To address the first limitation of DCOV for SDR, we study α-distance covariance for SDR. The α-distance covariance is proposed to deal with heavy-tailed distributions in Székely and Rizzo (2013), and it includes the original distance covariance as a special case when α = 1. Our development then follows directly from the ideas of Sheng and Yin (2016). To address the second limitation of DCOV for SDR, we introduce the α-martingale difference divergence to measure the conditional mean independence and then develop estimators of the central mean space based on the newly proposed conditional mean independence measure. In the case of α = 1, the α-martingale difference divergence becomes the original martingale difference divergence introduced by Shao and Zhang (2014). Although our focus is sufficient dimension reduction, the generalized martingale difference divergence may be of independent interest as a robust measure of conditional mean independence. The rest of the paper is organized as follows. Estimation of the central space through α-distance covariance is studied in Sect. 2. In Sect. 3, we propose to estimate the central mean space through α-martingale difference divergence. Comprehensive simulations studies are provided in Sect. 4. The iris data is analyzed in Sect. 5, and we conclude the paper with some discussions in Sect. 6. For a complex-valued function f (·), the complex conjugate of f is denoted as f¯ and f 2 = f f¯. For a matrix A, A2 denotes its maximum singular value. For a pdimensional vector a, |a|p denotes its Euclidean norm. We use |a| instead of |a|1

Sufficient Dimension Reduction Through Independence and Conditional Mean. . .

169

when p = 1. For two vectors a and b of the same dimensionality, a, b = a T b denotes their inner product. We assume the structural dimension of the central space (or the central mean space) is known throughout the paper.

2 Estimating SY |X Through α-Distance Covariance 2.1 α-Distance Covariance For random vectors V ∈ Rq and U ∈ Rp and vectors t ∈ Rq and s ∈ Rp , denote the characteristic functions fV ,U (t, s) = E{exp(it, V +is, U )}, fV (t) = E{exp(it, V )}, and fU (s) = E{exp(is, U )}. Furthermore, let 2π p/2 (1 − α/2) 2π q/2 (1 − α/2) , C(q, α) = , α α2 ((p + α)/2) α2α ((q + α)/2)

C(p, α) =

α+p

α+q −1 } .

and wq,p (t, s) = {C(q, α)C(p, α)|s|p |t|q squared α-distance covariance is defined as

Then for 0 < α < 2, the

3 2 α (V , U )

=

Rp+q

fV ,U (t, s) − fV (t)fU (s)2 wq,p (t, s)dtds.

(3)

The integral in (3) is finite if E(|U |αp ) < ∞ and E(|V |αq ) < ∞. The properties of α (V , U ) are summarized in the next result. Proposition 1 For 0 < α < 2, suppose E(|U |αp ) < ∞, E(|V |αq ) < ∞, and E(|U |αp |V |αq ) < ∞. Then (i) α (V , U ) = 0 if and only if V U . (ii) Let (V , U ), (V , U ) and (V , U ) be i.i.d. copies. Then 2 α (V , U )

=E(|U − U |αp |V − V |αq ) + E(|U − U |αp )E(|V − V |αq ) − E(|U − U |αp |V − V |αq ) − E(|U − U |αp |V − V |αq ).

(4)

The first part of Proposition 1 states that α (V , U ) is a measure of independence. The second part provides an alternative formula for 2α (V , U ), which will be used later for the sample level estimation. When α = 1, α (V , U ) becomes the original distance covariance in Székely et al. (2007). Extension to α-distance covariance with 0 < α < 2 is discussed in Székely and Rizzo (2009). Székely and Rizzo (2013) suggested that α-distance covariance can be used to deal with heavy-tailed distributions.

170

Y. Dong

2.2 Estimation of the Central Space Now we turn to sufficient dimension reduction with X ∈ Rp and Y ∈ R. Denote = Var(X) and let Id be the d × d identity matrix. For β ∈ Rp×d , let P (β) = β(β T β)−1 β T and Q (β) = Ip −P (β). Recall that β0 ∈ Rp×d denotes a basis for the central space SY |X . We consider β ∗ = arg max β∈Rp×d

2 T α (Y, β X)

subject to β T β = Id .

(5)

If we set α = 1, (5) becomes essentially the same optimization problem considered in Sheng and Yin (2016). Let Span(·) denote the column space. The next result justifies using Span(β ∗ ) to recover the central space. Proposition 2 Let β0 ∈ Rp×d be a basis of the central space satisfying β0T β0 = Id . Suppose P (β0 )T X Q (β0 )T X. Then Span(β ∗ ) = SY |X . As discussed in Sect. 3.5 of Sheng and Yin (2013), the independence condition P (β0 )T X Q (β0 )T X is not as strong as it seems to be, and it could be satisfied asymptotically when p is reasonably large. At the sample level, let {(Xi , Yi ), i = 1, . . . , n} be i.i.d. samples. Parallel to (5), we consider βˆ ∗ = arg max ˆ 2α (Y, β T X) subject to β T β = Id ,

(6)

β∈Rp×d

where ˆ 2α (Y, β T X) is computed as the sample version of

2 (Y, β T X) α

in (4),

n 1 T |β (Xk − X )|αd |Yk − Y |α n2 k,=1 ⎛ ⎞⎛ ⎞ n n 1 1 +⎝ 2 |β T (Xk − X )|αd ⎠ ⎝ 2 |Yk − Y |α ⎠ n n

ˆ 2α (Y, β T X) =

k,=1

−

2 n3

n

k,=1

|β T (Xk − X )|αd |Yk − Ym |α .

k,,m=1

The next result characterizes the consistency of βˆ ∗ up to a rotation transformation of β ∗ . Proposition 3 Let β0 ∈ Rp×d be a basis of the central space satisfying β0T β0 = Q (β0 )T X, E(|Y |α ) < ∞, and the support of X is Id . Suppose P (β0 )T X compact. Then there exists a rotation matrix M ∈ Rd×d such that M T M = Id and P βˆ ∗ −→ β ∗ M.

Sufficient Dimension Reduction Through Independence and Conditional Mean. . .

171

3 Estimating SE(Y |X) Through α-Martingale Difference Divergence 3.1 α-Martingale Difference Divergence For random vectors V ∈ Rq and U ∈ Rp and vector s ∈ Rp , recall that fU (s) = E{exp(is, U )} denotes the characteristic function of U . Denote gV ,U (s) = E{V exp(is, U )} and gV = E(V ). For 0 < α < 2, the squared α-martingale difference divergence is defined as 3 !2α (V | U ) =

Rp

gV ,U (s) − gV fU (s)2 ρp (s)ds,

(7)

where ρp (s) = {C(p, α)|s|p }−1 . When we compare !2α (V | U ) in (7) with 2 (V , U ) in (3), we see that they bare close resemblance. The difference is that α fV ,U and fV in (3) are replaced by gV ,U and gV in (7), and the weight function is modified accordingly. The properties of α-martingale difference divergence are summarized in the next result. α+p

Proposition 4 For 0 < α < 2, suppose E(|U |αp ) < ∞, E(|V |2q ) < ∞, and E(|U |αp |V |2q ) < ∞. Then (i) !α (V | U ) = 0 if and only if E(V | U ) = E(V ) almost surely. (ii) Let (V , U ) and (V , U ) be i.i.d. copies. Then !2α (V | U ) = −E{(V − E(V ))T (V − E(V ))|U − U |αp }.

(8)

The first part of Proposition 4 states that !α (V | U ) is a measure of conditional mean independence. The second part provides an easy formula for approximation at the sample level. When α = 1, !α (V | U ) becomes the original martingale difference divergence introduced in Shao and Zhang (2014).

3.2 Estimation of the Central Mean Space Now we turn to sufficient dimension reduction with X ∈ Rp and Y ∈ R. Recall that η0 ∈ Rp×r denotes a basis for the central mean space SE(Y |X) . We consider η∗ = arg max !2α (Y | ηT X) subject to ηT η = Ir .

(9)

η∈Rp×r

The optimization in Eq. (9) is parallel to the optimization problem in Eq. (5). Instead of maximizing the α-distance covariance between Y and β T X, we now maximize

172

Y. Dong

the α-martingale difference divergence of Y conditional on ηT X. While the column space of the optimal β ∗ recovers the central space, the next result states that the column space of the optimal η∗ coincides with the central mean space. Proposition 5 Let η0 ∈ Rp×r be a basis of the central mean space satisfying η0T η0 = Ir . Suppose P (η0 )T X Q (η0 )T X. Then Span(η∗ ) = SE(Y |X) . Given i.i.d. sample {(Xi , Yi ), i = 1, . . . , n}, denote Y¯ = n−1 nk=1 Yk . The sample version of !2α (Y | ηT X) in (8) is calculated as ˆ 2α (Y | ηT X) = − !

n 1 (Yk − Y¯ )(Y − Y¯ )|ηT (Xk − X )|αr . n2 k,=1

Then we estimate η∗ in (9) by ˆ 2α (Y | ηT X) subject to ηT η = Ir . ηˆ ∗ = arg max !

(10)

η∈Rp×r

When α = 1, (10) leads to the estimator in Zhang et al. (2019).

4 Simulation Studies 4.1 Model Setup We use synthetic data to examine the performances of the proposed procedures in this section. Four cases are used to generate X = (x1 , x2 , . . . , xp )T . In case (i), X is multivariate normal with mean 0 and covariance matrix = {σij }, where σij = .5|i−j | for 1 ≤ i, j ≤ p. In case (ii), X is multivariate t5 with mean 0 and the same covariance matrix as in case (i). In case (iii), xi is independent Poisson(1) for i = 1, . . . , p. In case (iv), xi is independent normal mixture τ N (0, 1) + (1 − τ )N (1, 1) with τ = 0.95. We fix p = 6 in all simulation settings. Let β1 = (1, 0, 0, 0, 0, 0)T , β2 = (0, 1, 0, 0, 0, 0)T , β3 = (1, 1, 0, 0, 0, 0)T , and β4 = (0, 0, 1, 1, 0, 0)T . The following three models are used to generate Y , Model 1: Y = (β1T X)2 + β2T X + 0.1 Model 2: Y = exp(β3T X) Model 3: Y = exp(β3T X − 1) + c(β4T X + 1) where is standard normal independent of X. The constant c is taken to be either 1 or 5. In addition to α-distance covariance (DCOV) and α-martingale difference divergence (MDD), we include the following methods for the comparison: SIR

Sufficient Dimension Reduction Through Independence and Conditional Mean. . .

173

Table 1 Model 1 results of estimating the central space. Mean and standard error of δ1 = Pβˆ − Pβ0 2 are reported based on 100 repetitions Model 1

SIR

SAVE

DR

LAD

Case (i)

0.3055 (0.1081) 0.2979 (0.1005) 0.2395 (0.0847) 0.2320 (0.0652)

0.7000 (0.2400) 0.9114 (0.1336) 0.8687 (0.1704) 0.6587 (0.2548)

0.3442 (0.1144) 0.4679 (0.1842) 0.3633 (0.1533) 0.2804 (0.1008)

0.3245 (0.1197) 0.3213 (0.1218)

Case (ii) Case (iii) Case (iv)

0.2436 (0.0776)

DCOV α = 0.5 0.2645 (0.0940) 0.2596 (0.1116) 0.0324 (0.0889) 0.1942 (0.0534)

α=1 0.2978 (0.0972) 0.5067 (0.2125) 0.1917 (0.1366) 0.2126 (0.0706)

α = 1.5 0.5347 (0.1758) 0.7290 (0.1693) 0.4112 (0.1638) 0.4372 (0.1785)

(Li, 1991), SAVE (Cook and Weisberg, 1991), directional regression (DR) (Li and Wang, 2007), and likelihood acquired directions (LAD) (Cook and Forzani, 2009). To recover the central space by maximizing ˆ 2α (Y, β T X) over β ∈ Rp×d , we use MATLAB function fmincon, which can implement optimization under the constraint in (6). An initial value is needed for the numerical optimization. We use SIR, SAVE, and DR estimators to be the initials, and we choose the one with the biggest ˆ 2α value as the final initial value. In the case of estimating the central ˆ 2α (Y | ηT X) over η ∈ Rp×r , and we choose the one mean space, we maximize ! 2 ˆ −1 βˆ T ˆ = β( ˆ βˆ T β) ˆ with the biggest !α value as the final initial value. Denote P (β) T T −1 and P (β0 ) = β0 (β0 β0 ) β0 . To measure the performance of the central space ˆ we use δ1 = P ˆ − Pβ0 2 , where · 2 is the maximum singular value estimator β, β of a matrix. Similarly for central mean space estimator η, ˆ we use δ2 = Pηˆ − Pη0 2 . We use the true structural dimensions of β0 and η0 for all the simulation studies.

4.2 Comparisons of Estimating the Central Space Based on 100 repetitions from Model 1, we report the mean and the standard error of δ1 in Table 1. The sample size is fixed to be n = 100, and three α values 0.5, 1, and 1.5 are specified for the α-distance covariance. The structural dimension of β0 is d = 2 for Model 1. The α-distance covariance estimator with α = 0.5 consistently enjoys the best performance across four cases of predictor distribution. As expected, when X changes from the normal distribution in case (i) to the t distribution in (ii), the performance of DCOV with α = 1 or α = 1.5 drops significantly, while DCOV with α = 0.5 is not sensitive to this change. LAD does not work with discrete predictors and is not included in the comparison when X is Poisson from case (iii). The results of Model 2 and Model 3 are summarized in Tables 2 and 3, respectively. We report the mean and the standard error of δ1 based on 100 repetitions. The sample size is fixed to be n = 100. We set c = 1 for Model 3.

174

Y. Dong

Table 2 Model 2 results of estimating the central space. Mean and standard error of δ1 = Pβˆ − Pβ0 2 are reported based on 100 repetitions Model 2

SIR

SAVE

DR

LAD

Case (i)

0.8293 (0.1720) 0.8681 (0.1514) 0.5317 (0.2026) 0.8888 (0.1577)

0.5688 (0.2513) 0.8018 (0.1911) 0.4113 (0.2117) 0.4986 (0.2250)

0.3752 (0.1183) 0.5564 (0.1876) 0.3180 (0.1504) 0.3251 (0.0969)

0.3442 (0.1297) 0.4267 (0.1761)

Case (ii) Case (iii) Case (iv)

0.3012 (0.1092)

DCOV α = 0.5 0.1839 (0.0771) 0.1974 (0.0755) 0.0000 (0.0000) 0.1628 (0.0501)

α=1 0.2275 (0.0829) 0.2905 (0.1278) 0.0017 (0.0069) 0.2044 (0.0633)

α = 1.5 0.2836 (0.1007) 0.4727 (0.2270) 0.1889 (0.1928) 0.2459 (0.0756)

Table 3 Model 3 results of estimating the central space with c = 1. Mean and standard error of δ1 = Pβˆ − Pβ0 2 are reported based on 100 repetitions Model 3

SIR

SAVE

DR

LAD

Case (i)

0.8429 (0.1376) 0.8920 (0.1386) 0.7403 (0.2069) 0.6958 (0.2014)

0.9102 (0.1204) 0.9513 (0.0623) 0.9417 (0.0768) 0.8606 (0.1423)

0.8044 (0.1644) 0.8678 (0.1243) 0.8871 (0.1385) 0.7406 (0.2173)

0.7640 (0.1751) 0.8149 (0.1835)

Case (ii) Case (iii) Case (iv)

0.6781 (0.2179)

DCOV α = 0.5 0.6472 (0.2006) 0.7085 (0.2219) 0.7925 (0.2067) 0.4620 (0.2363)

α=1 0.7258 (0.1936) 0.8431 (0.1480) 0.8345 (0.1737) 0.5555 (0.2408)

α = 1.5 0.8070 (0.1723) 0.9018 (0.1070) 0.8834 (0.1276) 0.6646 (0.2238)

Similar to the results in Table 1, we see that DCOV with α = 0.5 leads to the best central space estimator across all four cases in both Model 2 and Model 3.

4.3 Comparisons of Estimating the Central Mean Space We now examine the performance of our proposals to estimate the central mean space. We revisit Model 3 and now include MDD in the comparison. We set α = 1 for both DCOV and MDD. The sample size is set to be n = 50, 100, or 200. The mean and the standard error of δ2 based on 100 repetitions are summarized in Table 4. While the results in Table 3 are based on structural dimension d = 2 of the central space, the results in Table 4 are based on structural dimension r = 1 of the central mean space. We consider c = 1 and c = 5 for the variance component in Model 3. The performances of MDD and DCOV are similar when c = 1, and both are better than the other competitors. For c = 5, the larger conditional variance of Y given X makes it more difficult to accurately recover the conditional mean

Sufficient Dimension Reduction Through Independence and Conditional Mean. . .

175

Table 4 Model 3 results of estimating the central mean space. Mean and standard error of δ2 = Pηˆ − Pη0 2 are reported based on 100 repetitions Model 3 c=1

n = 50 n = 100 n = 200

c=5

n = 50 n = 100 n = 200

SIR 0.6734 (0.1903) 0.5163 (0.1513) 0.3940 (0.1145) 0.9167 (0.1040) 0.8949 (0.1205) 0.8381 (0.1489)

SAVE 0.9366 (0.0783) 0.8922 (0.1318) 0.8206 (0.1861) 0.9359 (0.1007) 0.9362 (0.0763) 0.9573 (0.0594)

DR 0.7217 (0.1840) 0.5966 (0.1933) 0.4184 (0.1343) 0.9163 (0.1053) 0.8941 (0.1336) 0.9426 (0.0734)

LAD 0.7756 (0.1924) 0.6052 (0.1882) 0.4205 (0.1306) 0.9224 (0.0946) 0.9293 (0.1111) 0.9505 (0.0938)

DCOV 0.5640 (0.1706) 0.4238 (0.1481) 0.3074 (0.1013) 0.8799 (0.1276) 0.8633 (0.1595) 0.8951 (0.1032)

MDD 0.5655 (0.1761) 0.4379 (0.1524) 0.3004 (0.0989) 0.8579 (0.1389) 0.8337 (0.1306) 0.7098 (0.1693)

of Y given X. In this challenging setting, MDD enjoys the best performance for all three sample sizes. While MDD always improves when sample size increases, DCOV is not a consistent estimator of the central mean space and may deteriorate with increasing sample size. We remark that existing central mean space estimators include ordinary least squares (Li and Duan, 1989), principal Hessian directions (Li, 1992), iterative Hessian transformation (Cook and Li, 2002), minimum average variance estimation (Xia et al., 2002), Fourier transformation method (Zhu and Zeng, 2006), and the semiparametric approach (Ma and Zhu, 2014). Please refer to Zhang et al. (2019) for a more comprehensive comparison between MDD and the aforementioned central mean space estimators.

5 Analysis of the Iris Data For the real data analysis, we use the iris data that can be downloaded from the UCI machine learning repository. The data in the archive contains three classes of irises. We consider the first two classes (setosa and versicolor) with a total of 100 observations. The response is 0 if the iris class is setosa, and the response is 1 if the corresponding class is versicolor. The four predictors are sepal length, sepal width, petal length, and petal width. First we set the structural dimension d = 1 and implement α-distance covariance with α = 0.5, 1, and 1.5. Denote the resulting estimators as βˆi , i = 1, 2, 3. The Pearson correlation between XT βˆi and XT βˆj for any i = j is at least 0.9998, suggesting that all three estimators agree with each other for this data set. Next we randomly pick 30% of the observations and switch their responses. While the predictors are the same as in the original data, the response is now artificially contaminated. This process is repeated 100 times. In each repetition, we rerun the analysis of the three methods based on the

176

Y. Dong

0.6

0.7

ρ

0.8

0.9

1.0

DCOV

α=1.5

α=1

α=0.5

Fig. 1 Boxplots of sample Pearson correlation ρ based on 100 repetitions

perturbed data and denote the corresponding estimator as βˆi(b) , i = 1, 2, 3, and b = 1, 2, . . . , 100. We summarize the sample Pearson correlation ρ between XT βˆi and XT βˆi(b) in Fig. 1. From the boxplots, we see that the DCOV estimator with α = 0.5 is the most resistant to data contamination.

6 Conclusion In this paper, we make two extensions of DCOV for sufficient dimension reduction in Sheng and Yin (2016). Through α-distance covariance, we get central space estimators that are more accurate in the presence of heavy-tailed predictor distribution and data contamination. Through α-martingale difference divergence, we get central mean space estimators, which may be a proper subspace of the central space. Chen et al. (2015) proposed to use conditional distance covariance for diagnostic studies in sufficient dimension reduction. Their idea can be applied in our setting to estimate the structural dimension of the central space through a sequential test procedure. Furthermore, the same diagnostic tool can be used to determine α in the α-distance covariance in a data-driven manner. Chen et al. (2018) proposed a sparse dimension reduction estimator via penalized distance covariance. Adding penalty to α-distance covariance or α-martingale difference divergence is worth future investigation.

Sufficient Dimension Reduction Through Independence and Conditional Mean. . .

177

Acknowledgments The author sincerely thanks the editor and two anonymous referees for useful comments that led to a much improved presentation of the paper.

Appendix Proof of Proposition 1 For part (i), since the weight function wq,p (t, s) is positive, α (V , U ) = 0 if and only if fV ,U (t, s) = fV (t)fU (s) for almost all s and t. Thus as long as it is well defined, α (V , U ) is zero if and only if V and U are independent. The proof of part (ii) follows directly from the proof of Theorem 7 in Székely and Rizzo (2009) and is thus omitted.

The following Lemma is needed before we prove Proposition 2. Its proof follows directly from the proof of Theorem 3 in Székely and Rizzo (2009) and is thus omitted. Lemma 1 For random vectors V1 , V2 ∈ Rq , and U1 , U2 ∈ Rp , assume E(|U1 |αp ) < ∞, E(|U2 |αp ) < ∞, E(|V1 |αq ) < ∞, and E(|V2 |αq ) < ∞. Denote T T T T T T 2 α as the square root of α . If [V1 , U1 ] is independent of [V2 , U2 ] , then α (V1

+ V2 , U1 + U2 ) ≤

α (V1 , U1 ) +

α (V2 , U2 ).

Equality holds if and only if U1 and V1 are both constants, or U2 and V2 are both constants, or U1 , U2 , V1 , V2 are mutually independent. Proof of Proposition 2 We follow the proof of Proposition 2 in Sheng and Yin (2016). For any β ∈ Rp×d , there exists rotation matrix M ∈ Rd×d such that βM = [βa , βb ], where Span(βa ) ⊆ Span(β0 ) and Span(βb ) ⊆ Span(β0 )⊥ , where Span(β0 )⊥ denotes the orthogonal space of Span(β0 ). From (1), we have Y βbT X | β0T X. Together with P (β0 )T X Q (β0 )T X, we have (Y, XT β0 ) βbT X. It T T T T T T follows that (Y, X βa ) βb X. Let U1 = [X βa , 0] , U2 = [0, X βb ] , V1 = Y , and V2 = 0. Then [V1 , U1T ]T is independent of [V2 , U2T ]T . According to Lemma 1, α (Y, M

≤

T

β T X) =

α (V1 , U1 ) +

α (Y, [X

T

βa , XT βb ]T ) =

α (V2 , U2 )

=

α (V1

+ V2 , U1 + U2 )

T α (Y, βa X).

(11)

On the other hand, M being a rotation matrix implies that MM T = M T M = Id and |M T β T (X − X )|d = |β T (X − X )|d . It follows from Proposition 1 that α (Y, M

T

β T X) =

α (Y, β

T

X).

(12)

Similarly, Span(βa ) ⊆ Span(β0 ) implies |βaT (X − X )|da ≤ |β0T (X − X )|d , where da is the number of columns for βa . Apply Proposition 1 and we have T α (Y, βa X)

≤

T α (Y, β0 X).

(13)

178

Y. Dong

(11), (12), and (13) together lead to α (Y, β T X) ≤ α (Y, β0T X). We get equality if and only if Span(βa ) = Span(β0 ), in which case βb vanishes. Since β ∗ maximizes 2 (Y, β T X) over β ∈ Rp×d , we must have Span(β ∗ ) = Span(β ) = S

0 Y |X . α Proof of Proposition 3 The proof follows directly from the proof of Proposition 3 in Sheng and Yin (2016) and is thus omitted.

Proof of Proposition 4 For part (i), note that !α (V | U ) = 0 if and only if gV ,U (s) = gV fU (s) for almost all s. Thus !α (V | U ) = 0 if and only if E(V ) = E(V | U ) almost surely. For part (ii), the proof of Theorem 1 in Shao and Zhang (2014) can be followed directly.

Proof of Proposition 5 Denote η0⊥ ∈ Rp×(p−r) as a basis for the orthogonal space T η = 0, and of Span(η0 ). We choose η0 and η0⊥ such that η0T η0 = Ir , η0⊥ 0 T p×r T η0⊥ η0⊥ = Ip−r . For any η ∈ R satisfying η η = Ir , there exists A ∈ Rr×r and C ∈ R(p−r)×r such that η = η0 A + η0⊥ C. Then Ir = ηT η = (η0 A + η0⊥ C)T (η0 A + η0⊥ C) = AT A + C T C. For s ∈ Rr , because P (η0 )T X E(Y )E(eis,η

T X

(14)

Q (η0 )T X, we have

) = E(Y )E(eis,X

T η A 0

eis,X

T η C 0⊥

= E(Y )E(eis,X

T η A 0

)E(eis,X

)

T η C 0⊥

(15) ).

Note that (2) implies E(Y | X) = E(Y | η0T X), and we have E(Y eis,η

T X

) = E{E(Y | X)eis,η = E{E(Y | η0T X)e

T X

= E{E(Y | η0T X)eis,X = E(Y eis,X

T η A 0

}

is,XT η0 A T η A 0

)E(eis,X

eis,X

T η C 0⊥

}E(eis,X

T η C 0⊥

}

T η C 0⊥

(16) )

).

(15), (16), and the definition of !2α in (7) together lead to 3 !2α (Y | ηT X) = 3 = 3 ≤

Rr

Rr

Rr

E(Y eis,η

) − E(Y )E(eis,η

T η A 0

) − E(Y )E(eis,X

T η A 0

) − E(Y )E(eis,X

E(Y eis,X E(Y eis,X

T X

= !2α (Y | AT η0T X).

T X

)2 ρr (s)ds

T η A 0

)2 ρr (s)E(eis,X

T η A 0

)2 ρr (s)ds

T η C 0⊥

)2 ds

(17)

Sufficient Dimension Reduction Through Independence and Conditional Mean. . .

179

On the other hand, it follows from (14) that |AT η0T (X − X )|2r = (X − X )T η0 AAT η0T (X − X ) ≤ (X − X )T η0 Ir η0T (X − X ) = |η0T (X − X )|2r .

(18)

(17), (18), and equation (8) from Proposition 4 together lead to !2α (Y | ηT X) ≤ !2α (Y | AT η0T X) = −E{(Y − E(Y ))(Y − E(Y ))|AT η0T (X − X )|αr } ≤ −E{(Y − E(Y ))(Y − E(Y ))|η0T (X − X )|αr } = !2α (Y | η0T X). We get equality if and only if AT A = Ir , in which case C vanishes and η = η0 A. Since η∗ maximizes !2α (Y | ηT X) over η ∈ Rp×r , we must have Span(η∗ ) = Span(η0 ) = SE(Y |X) .

References X. Chen, R.D. Cook, C. Zou, Diagnostic studies in sufficient dimension reduction. Biometrika 102, 545–558 (2015) X. Chen, W. Sheng, X. Yin, Efficient sparse estimate of sufficient dimension reduction in high dimension. Technometrics 60, 161–168 (2018) R.D. Cook, Regression Graphics: Ideas for Studying Regressions Through Graphics (Wiley, New York, 1998) R.D. Cook, L. Forzani, Likelihood-based sufficient dimension reduction. J. Am. Stat. Assoc. 104, 197–208 (2009) R.D. Cook, B. Li, Dimension reduction for the conditional mean. Ann. Stat. 30, 455–474 (2002) R.D. Cook, S. Weisberg, Discussion of sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86, 28–33 (1991) Y. Dong, A note on moment-based sufficient dimension reduction estimators. Stat. Interface 9, 141–145 (2016) Y. Dong, A brief review of linear sufficient dimension reduction through optimization. J. Stat. Plann. Inference 211, 154–161 (2021) Y. Dong, B. Li, Dimension reduction for non-elliptically distributed predictors: second order methods. Biometrika 97, 279–294 (2010) Y. Dong, Q. Xia, C. Tang, Z. Li, On sufficient dimension reduction with missing responses through estimating equations. Comput. Stat. Data Anal. 126, 67–77 (2018) B. Li, Sufficient Dimension Reduction: Methods and Applications with R (CRC Press, 2018) B. Li, Y. Dong, Dimension reduction for non-elliptically distributed predictors. Ann. Stat. 37, 1272–1298 (2009) B. Li, S. Wang, On directional regression for dimension reduction. J. Am. Stat. Assoc. 479, 997– 1008 (2007) K.C. Li, Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86, 316–327 (1991) K.C. Li, On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. J. Am. Stat. Assoc. 87, 1025–1039 (1992)

180

Y. Dong

K.C. Li, N. Duan, Regression analysis under link violation. Ann. Stat. 17, 1009–1052 (1989) Y. Ma, L.P. Zhu, A semiparametric approach to dimension reduction. J. Am. Stat. Assoc. 107, 168–179 (2012) Y. Ma, L. Zhu, A review on dimension reduction. Int. Stat. Rev. 81, 134–150 (2013) Y. Ma, L.P. Zhu, On estimation efficiency of the central mean subspace. J. R. Stat. Soc. Ser. B 76, 885–901 (2014) X. Shao, J. Zhang, Martingale difference correlation and its use in high dimensional variable screening. J. Am. Stat. Assoc. 109, 1302–1318 (2014) W. Sheng, X. Yin, Direction estimation in single-index models via distance covariance. J. Multivar. Anal. 122, 148–161 (2013) W. Sheng, X. Yin, Sufficient dimension reduction via distance covariance. J. Comput. Graph. Stat. 25, 91–104 (2016) G.J. Székely, M.L. Rizzo, Brownian distance covariance. Ann. Appl. Stat. 3, 1236–1265 (2009) G.J. Székely, M.L. Rizzo, Energy statistics: a class of statistics based on distances. J. Stat. Plann. Inference 143, 1249–1272 (2013) G.J. Székely, M.L. Rizzo, N.K. Bakirov, Measuring and testing dependence by correlation of distances. Ann. Stat. 35, 2769–2794 (2007) Y. Xia, H. Tong, W. Li, L. Zhu, An adaptive estimation of dimension reduction space. J. R. Stat. Soc. Ser. B 64, 363–410 (2002) Y. Zhang, J. Liu, Y. Wu, X. Fang, A martingale-difference-divergence-based estimation of central mean subspace. Stat. Interface 12, 489–500 (2019) Y. Zhu, P. Zeng, Fourier methods for estimating the central subspace and the central mean subspace in regression. J. Am. Stat. Assoc. 101, 1638–1651 (2006) L.P. Zhu, L.X. Zhu, Z.H. Feng, Dimension reduction in regressions through cumulative slicing estimation. J. Am. Stat. Assoc. 105, 1455–1466 (2010)

Cook’s Fisher Lectureship Revisited for Semi-supervised Data Reduction Jae Keun Yoo

1 Introduction Sufficient dimension reduction (SDR) is to reduce dimension of the predictors X ∈ Rp Cook (1998) without loss of information on the regression of y|X. From now on, without any specific mention, r is assumed to be equal to one. The foundation and related methods of sufficient dimension reduction are well described in Cook (1998) and Li (2018). Usual SDR methods estimate related dimension reduction subspaces nonparametically under a few assumptions mainly focused on the marginal distributions of X. R. D. Cook’s Fisher lectureship Cook (2007) opened a new and seminal paradigm in sufficient dimension reduction literature, and various SDR methods (Adragni and Cook, 2009; Bura and Forzani, 2015; Bura et al., 2016; Chen et al., 2010; Cook, 2018; Cook and Li, 2009; Cook and Forzani, 2009; Ding and Cook, 2013) have been followed from Cook (2007). In the lecture, a parametric inverse regression model is defined as follows: X ∈ Rp = μ + Γ ν y + ε,

(1)

where Γ ∈ Rp×d with Γ T Γ = Id and d ≤ p, ε ∼ N(0, Δ) and cov(ν y , ε) = 0. In model (1), ν y is a d-dimensional unknown function of y with a positive definite sample covariance and y ν y = 0 to be centered to have mean 0. The model in (1) is called a principal component model.

J. K. Yoo () Department of Statistics, Ewha Womans University, Seoul, South Korea e-mail: [email protected] © Springer Nature Switzerland AG 2021 E. Bura, B. Li (eds.), Festschrift in Honor of R. Dennis Cook, https://doi.org/10.1007/978-3-030-69009-0_9

181

182

J. K. Yoo

If the unknown ν y is replaced with βfy , a model called a principal fitted component model (PFC) is defined as follows: X = μ + Γ βfy + ε,

(2)

where β is an unknown d × q matrix and fy ∈ Rq is a known vector-valued function of y with y fy = 0. According to Cook and Forzani (2009), under mild conditions, the maximum likelihood estimation of Γ under model (2) is robust to the misspecification of fy and the non-normality of ε. More details regarding these two models are left to readers by referring Cook (2007) and Cook and Forzani (2009). Let {Xi , yi }ni=1 represent n i.i.d realizations of (y, X). It is said that a pair of {Xi , yi } is supervised, if yi s are observed, and it is called unsupervised, otherwise. If a data set contains supervised and unsupervised observations for y, it is called semi-supervised. The response y is not usually observed due to the difficulty of labeling. The semi-supervised data arises from many recent science fields, for example, speech recognition, spam email filtering, an artificial intelligence, video surveillance, and so on. These areas have rapidly grown right now, and hence the analysis of semi-supervised data is increasingly demanded. Two distributional and margin-based approaches have been developed in the semi-supervised data analysis literature. For the distributional approach, their popular methods are co-training Blum and Mitchell (1998), the EM method Nigam et al. (1998), the bootstrap method Collins and Singer (1999), Gaussian random fields Zhu et al. (2003), and structure learning models Ando and Zhang (2005). These methods critically depend on an assumption relating the class probability given input p(x) = P (y = 1|X = x) to the marginal distribution of X for an improvement to occur. However, the assumption is often neither easy to prove nor guaranteed to hold in practice. For the margin-based approach, usage of the concept of regularized separation is critical. Transductive SVM (TSVM; Chapelle and Zien 2005; Vapnik 1998; Wang et al. 2007) and a large margin method in Wang and Shen (2007) are proposed for this approach. The methods employ a notation of separation to obtain information from unlabeled parts to improve classifications. This depends on an assumption Chapelle and Zien (2005) that a clustering boundary can precisely estimate a Bayes decision boundary which is the focus of classification. The current methodologies by both approaches for semi-supervised data commonly face the following three weaknesses. First, those methods heavily depend on underlying assumptions, mainly, normality. Failure of the assumptions should be cause of concern in real practice. Second, there is no unified approach for classification and regression, even though both approaches study the conditional distribution of response given predictors. Third, the methods for binary responses cannot be directly applied for multi-category responses. That is, we need to use different methods depending on the numbers of categories of responses. The main goal of this paper is to propose an effective dimension reduction methodology for the semi-supervised data by adopting the philosophy of Cook’s Fisher lectureship. In this paper, the dimension reduction method is developed under

Cook’s Fisher Lectureship Revisited for Semi-supervised Data Reduction

183

an isotonic covariance structure, and the three obstacles commonly faced in semisupervised data analysis are possibly avoided. The details for the proposed method are given in the next section.

2 Dimension Reduction by Isotonic Models 2.1 Construction of Isotonic Model Consider a semi-supervised data with a p dimensional sample consisting of n = nu + nl observations with nu unlabeled pairs of {Xi ∈ Rp , yi ∈ R1 }ni=nl +1 and l . Let U and L be the sets to have the unlabeled nl labeled pairs of {Xi , yi }ni=1 and labeled responses, respectively. To denote the observations in a unified way, a labeling indicator function is defined such that Iy = 1 for unlabeled responses and 0 otherwise. We start with a modified principal component model in Cook (2007) with an isotonic error structure for semi-supervised data: Xy = μ + Γ ν y + σ ε = μ + Iy Γ ν y + (1 − Iy )Γ ν y + σ ε.

(3)

where μ = E(X), Γ ∈ Rp×d with Γ T Γ = Id , d < p, σ > 0, and ε ∼ N(0, Ip ). And, the coordinate vector ν y ∈ Rd is an unknown function of y that is assumed to have a positive definite sample covariance matrix and is centered to have mean 0, y ν y = 0. According to Cook (2007), under model (3), the estimate of Γ is the first d ˆ Then, all observations of X are utilized to construct principal components of Σ. ˆ regardless of the status of labeling. This indicates that a usual PCA application Σ for dimension reduction of X turns out to be effective for model (3). Now we use the separation of data by the labeling indicator Iy . This separation is quite simple and straightforward, but it is fundamental for further methodological developments. The separation in (3) enables us to set ν y = βfy for labeled data: Xy = μ + Iy Γ ν y + (1 − Iy )Γ βfy + σ ε,

(4)

≤ min(p, r), and fy ∈ Rr is a known vector-valued where a matrix β ∈ Rd×r , d function of the response with y fy = 0. According to Cook (2007), the PFC often outperforms principal components in many regression problems, so it is expected that an estimation of Γ through model (4) yields better estimation results than that through (3). For labeled cases, each coordinate Xyj , j = 1, . . . , p, of Xy follows linear model with predictor vector fy in model (4). Therefore, to obtain useful information about unknown fy , one can consider inverse response plots (Cook 1998, Chapter 10) of

184

J. K. Yoo

Xyj versus y. For instance, supposing that log(y) reasonably fits all the inverse response plots, one can consider fy = log(y). If y is categorical with h levels, a suitable choice for fy should be J (y ∈ Ck ), where Ck indicates categories of y. Proposition 1 Under model (4), y

X|Γ T X.

The proof of the proposition directly comes from Proposition 1 of Cook (2007). Therefore, the d-dimensional linearly transformed predictor Γ T X can replace the original p-dimensional predictor X without loss of information on a regression y|X. The dimension reduction of X through model (4) will be called combined principal and principal fitted components (CPPFC). In the next section, we will estimate unknown parameters through maximizing the related likelihood function.

2.2 Maximum Likelihood Estimation of Γ For notational convenience, let X be a n × p sample matrix of X centered at the ¯ = 1 n X by stacking (Xi − X) ¯ T in each column. Using this sample mean X i=1 n notation, we define that XU = {X|y ∈ U } and XL = {X|y ∈ L}. The log-likelihood for (4) is np 1 log σ 2 − 2 2σ 2 {Xy − μ − Iy Γ ν y − (1 − Iy )Γ βfy }T

d (μ, ν y , β, Γ ) = −

y

{Xy − μ − Iy Γ ν y − (1 − Iy )Γ βfy } ( np 1 =− (Xy − μ − Γ ν y )T (Xy − μ − Γ ν y ) log σ 2 − 2 2σ 2 y∈U

+

)

(Xy − μ − Γ βfy ) (Xy − μ − Γ βfy ) T

(5)

y∈L

From (5), with ν y , β, Γ fixed, the MLE of μ is obtained by the centering ¯ Next replacing μ by X ¯ and with remaining assumption of ν y and fy , which is X. T ¯ for y ∈ U , which β, Γ constant, (5) is partially maximized at ν y = Γ (Xy − X) is the MLE νˆ y . The MLE for β is obtained by plugging νˆ y in (5) and with Γ fixed, and it is βˆ = Γ T XTL F(FT F)−1 , where F is constructed by stacking fTy . Then, the updated log-likelihood of (5) is np log σ 2 − 2 np log σ 2 − =− 2

d (Γ ) = −

1 trace(XTU XU QΓ ) + trace(XTL PF XL QΓ ) 2σ 2

n n nl ˆ u ˆ Q , trace + Σ Σ Γ U fit( L ) n n 2σ 2

(6)

Cook’s Fisher Lectureship Revisited for Semi-supervised Data Reduction

ˆU = where PF = F(FT F)−1 FT , QΓ = I − Γ Γ T , Σ 1 T ˆ res(L) = Σ ˆL−Σ ˆ fit(L) . ˆ fit(L) = X PF XL and Σ Σ nl L By this, the following equality can be easily noted

1 T nu XU XU ,

185

ˆL = Σ

ˆ = nu Σ ˆ U + nl Σ ˆ L = nu Σ ˆ U + nl Σ ˆ fit(L) + nl Σ ˆ res(L) . Σ n n n n n

1 T nl XL XL ,

(7)

Then (6) is maximized by replacing Γ as the eigenvectors (γˆ 1 , . . . , γˆ d ) correspondˆ U + nl Σ ˆ ing to the first d largest eigenvalues of nnu Σ n fit(L) . Then, the maximum ˆ likelihood estimator Γ of Γ is (γˆ 1 , . . . , γˆ d ).

3 Numerical Examples To investigate the isotonic model for semi-supervised data, the following numerical model was considered: Xy = Γ ν y + ε, where Γ ∈ Rp = (1, 0, . . . , 0)T , ν y = y, y ∼ N (0, σy2 ), and ε ∼ N(0, σ 2 Ip ) with y ε. To investigate the effects of nonnormal predictors, y was generated from Exponential(1). This model was generated 500 times with various settings of fy , nu , p, σ 2 , and σy2 with n = 100. To evaluate the estimation performance of Γ , the angle rA between Γ and Γˆ was computed. If Γ and Γˆ are p × mmatrices with full column ranks, T T the angle is computed as follows: rA = 1 − m1 Γˆ (Γˆ Γˆ )−1 Γˆ Γ (Γ T Γ )−1 Γ T . T T Actually, the quantity of m1 Γˆ (Γˆ Γˆ )−1 Γˆ Γ (Γ T Γ )−1 Γ T is a trace correlation Hooper (1959), which measures how much the spaces spanned by the columns of Γ and Γˆ are close to each other. That is, higher values imply the two subspaces closer. To make smaller angles closer to each other, however, the trace correlation is subtracted from one. If rA is equal to zero, the two subspace are equivalent, while rA is one, if the two subspaces are orthogonal. Therefore, smaller values of rA indicate that Γˆ is better estimates of Γ . Without mentioning specific values, we set fy = y, nu = 50 or 80, σ 2 = σy2 = 1, and p = 10. After the generation of the simulation model, the nu observations of y were randomly removed. We considered three dimension reduction methods of the CPPFC, the principal component analysis (PCA), and the PFC with the labeled data alone (PFCL ). First, the impact of the choices of fy was studied. The following six candidates of fy were considered: (1) fy = y; (2) fy = (y, y 2 ); (3) fy = (y, y 2 , y 3 ); (4) fy = exp(y); (5) fy = (y, exp(y)); and (6) the categorization of y with four levels. The first three and fifth candidates contain the true ν y , and the other two do not. Naturally, it is expected that the first choice should yield the best results and that the four choices to have y should provide better estimation performances than the other two. According to Fig. 1, for both CPPFC and PFCL , the choice of fy = y gives the best estimation results, as expected. Although the CPPFC is little bit sensitive to the

J. K. Yoo

Angle 0.2 0

0

0.1

0.1

Angle 0.2

0.3

0.3

0.4

0.4

186

0

1

2

3 fy

a

4

5

6

0

1

2

3 fy

4

5

6

b

Fig. 1 The effects of choices of fy ; CPPFC: solid, PFCL :dashed. (a) nu = 50. (b) nu = 80

six choices, the CPPFC is more robust to the PFCL , especially with larger numbers ˆ U does not require any of nu . This would be mainly because the construction of Σ information of fy . Next, the estimation of Γ was investigated by (nu , nl ) = (10, 90), (20, 80), (30, 70), . . . , (90, 10) with p = 10 and p = 30. A normal expectation should be that the estimation performance of PFCL gets worse with larger values of nu and that all the three methods struggle against higher values of p. According to Fig. 2, this expectation was met. It should be observed that the CPPFC showed the better estimation results than PCA for all values of nu . This confirms the potential advantages of CPPFC over PCA and PFCL . To examine the effects of non-normal predictors, y was randomly generated from Exponential(1) instead of N(0, 1) under the default setting. The summary figures are reported in Fig. 3. With p = 10, there is no notable difference between normal and non-normal predictors in the estimation of Γ . However, with p = 30, PFCL struggle more for the non-normal predictors, and this brings little bit worse estimation in CPPFC, although the CPPFC provides better results than PCA for all values of nu . We studied the impact of the number of the predictors to the estimation of Γ by varying p = 10, 20, 30, . . . , 90 for nu = 50 and nu = 80. According to Fig. 4, the CPPFC often shows the best performances and is the most robust to nu , although all the three methods yield the worse performances with larger p for nu = 50 and nu = 80. Compared to results with varying nu , the number of predictors is more critical than nu in the estimation of Γ . The effect of σ 2 to the estimation of Γ was investigated for σ = 0.5, 1, 2, 3, 4, 5, which is summarized in Fig. 5. All of the three get even worse for larger values of σ , although the PFCL is the best. With smaller values of σ , the CPPFC can compete

187

Angle

0

0.1

0.1

0.2

0.2

0.3

Angle 0.3

0.4

0.4

0.5

0.5

0.6

0.6

Cook’s Fisher Lectureship Revisited for Semi-supervised Data Reduction

0

10

20

30

40

nu

50

60

70

80

90

0

10

20

30

a

40

nu

50

60

70

80

90

b

Angle

0

0.1

0.1

0.2

0.2

0.3

Angle 0.3

0.4

0.4

0.5

0.5

0.6

0.6

Fig. 2 The effects of the sizes of nu ; CPPFC: solid, PCA: dotted, PFCL : dashed. (a) p = 10. (b) p = 30

0

10

20

30

40

nu

a

50

60

70

80

90

0

10

20

30

40

nu

50

60

70

80

90

b

Fig. 3 The effects of non-normality; CPPFC: solid, PCA: dotted, PFCL : dashed. (a) p = 10. (b) p = 30

with the other two. Under the same model in Cook (2007), the PCA showed much poorer estimation of Γ than the PFC. So, the poor estimation performance of the PCA worsens that of the CPPFC, which results in the better estimation performance of the PFCL than the CPPFC. Finally, the impact of σy2 , which measures the variability of the response y, was examined for σy = 0.5, 1, 2, 3, 4, 5. The summary is reported to Fig. 6. The figure

J. K. Yoo

Angle 0.4 0

0

0.2

0.2

Angle 0.4

0.6

0.6

0.8

0.8

188

0

10

20

30 40 50 60 Numbers of predictors

70

80

90

0

10

20

30 40 50 60 Numbers of predictors

a

70

80

90

b

Angle 0.4 0

0

0.2

0.2

Angle 0.4

0.6

0.6

0.8

0.8

Fig. 4 The effects of the sizes of the numbers of predictors (p); CPPFC: solid, PCA: dotted, PFCL : dashed. (a) nu = 50. (b) nu = 80

0

2

σ

a

4

6

0

2

σ

4

6

b

Fig. 5 The effects of the sizes of σ 2 : CPPFC: solid, PCA: dotted, PFCL : dashed. (a) nu = 50. (b) nu = 80

shows that higher variability of the response tends to estimate Γ better. There are no clear differences among the three methods in this case. In summary, the numerical studies show that the proposed CPPFC has potential advantages in the estimation of Γ under various settings of the tuning parameters of fy , ν y , p, σ 2 , and σy2 over the existing PCA and PFCL .

189

Angle 0

0

0.2

0.2

Angle

0.4

0.4

0.6

0.6

Cook’s Fisher Lectureship Revisited for Semi-supervised Data Reduction

0

2

σy

4

a

6

0

2

σy

4

6

b

Fig. 6 The effects of the sizes of σy2 : CPPFC: solid, PCA: dotted, PFCL : dashed. (a) nu = 50. (b) nu = 80

4 Real Data Example In survival regression, a true survival time is not fully observed due to censoring. The censoring inevitably occurs in survival regression, and this aspect has to be properly reflected in dimension reduction and model building. The application of sufficient dimension reduction in survival regression is well described in Cook (2003). Here, we treat the censored survival time as unlabeled observations, although more rigorous theoretical aspect should be established for this matter. Since Cox proportional hazards (CPH) model, which is one of the most popular methods in survival regression, uses event time observations alone, this approach would be acceptable in practice. Then, the survival data becomes semi-supervised data. For illustration purpose, the primary biliary cirrhosis (PBC) data (Yoo, 2008; Yoo and Lee, 2011) collected at the Mayo Clinic between 1974 and 1986 is used. The data consists of 19 variables with 276 observations after removing all missing values from the full dataset. The 2 among the 19 variables are the survival time and the censoring status, whose value is equal to 1, if the event occur, and zero, otherwise (censored). The following 10 many-valued variables among the remaining 17 ones are under consideration as predictors: age in years; serum bilirubin, in mg/dl; serum cholesterol in mg/dl; albumin in g/dl; urine copper in μg/day; alkaline phosphatase in U/liter; SGOT in U/ml; triglycerides in mg/dl; platelet count per cubic ml of blood divided by 1000; and prothrombin time in seconds. And, the predictors are standardized to have zero sample means and one standard deviations, to minimize the impact of different measurement units. The predictors in survival regression are reduced through PCA, CPPFC, and PFCL along with fitting CPH model for comparison.

190

J. K. Yoo

The data are divided into two separate subgroups depending on the values of the censoring status. As discussed above, all observations corresponding to the events become labeled data, and the others are unlabeled one. In the data, the censoring percentage is about 60%, so the unlabeled percentage is about 10% bigger than the labeled percentage. For CPPFC and PFCL , we set fy = (y, y 2 , y 3 ), where y indicates the event survival time in PBC data. With the original ten predictors, the CPH model is fitted, which will be denoted as CPH10 . Let βˆCPH ∈ R10×1 be an estimated coefficient vector in CPH10 , and let Γˆ PCA ∈ R10×1 , Γˆ CPPFC ∈ R10×1 , and Γˆ PFCL ∈ R10×1 stand for the estimates from PCA, CPPFC, and PFCL with d = 1, respectively. We call Xs βˆCPH , Xs Γˆ PCA , Xs Γˆ CPPFC , and Xs Γˆ PFCL as the CPH, PCA, CPPFC, and PFCL directions, respectively, where Xs ∈ R256×10 stands for the standardized predictor matrix. We compare the relations among the CPH, PCA, CPPFC, and PFCL directions through a scatterplot matrix, which is reported in Fig. 7. After that, three CPH models are additionally fitted with the PCA, CPPFC, and PFCL directions, and the AIC (Akaike information criteria) is compared for the four CPH models, which is reported in Table 1. According to Fig. 7, it is observed that all four directions are highly correlated. The minimum correlation (0.904) is computed between CPH and PFCL , while the maximum one (0.961) comes between CPPFC and CPH. Table 1 shows that usage of the CPPFC direction is the best among four, although the difference from CPH10 is narrow. This real data example shows a potential usefulness of CPPFC in practice.

5 Discussion Cook’s Fisher lectureship Cook (2007) proposed a model-based dimension reduction approach and has established a new paradigm in methodological development in sufficient dimension reduction literature. In his lectureship, models developed in Cook (2007) was extended to exponential predictors, so categorical and continuous predictors can be handled simultaneously. To cover semi-supervised data, which has become common along with artificial intelligence literature growing, Cook’s Fisher lectureship is revisited, and a combined approach is newly proposed. For the proposed method, the theories are well established, and a basis matrix for the dimension reduction was estimated by maximum likelihood estimation. The estimator is constructed by a weight mean of the sample covariance matrix of the predictors for unlabeled parts and the sample fitted covariance matrix for labeled parts. Numerical studies confirm the potential usefulness of the proposed combined approach for semi-supervised data. To generalize the results, the proposed combined model has to be extended to more general settings of the covariance of the random error. The theoretical development for this is under progress.

Cook’s Fisher Lectureship Revisited for Semi-supervised Data Reduction −4

−2

0

2

−5

−3

−1

1

2

0

2

−6

191

−2

0

2

−6

−4

−2

CPPFC

0

2

−6

−4

PCA

−1

1

2

−6

−4

−2

PFC.L

−5

−3

CPH

−6

−4

−2

0

2

−6

−4

−2

0

2

Fig. 7 Scatterplot matrix of directions from CFFPC, PCA, PFCL and CPH in Sect. 4 Table 1 AIC from the Cox proportional hazards models from the original ten predictors and the first directions from PCA, CPPFC, and PFCL in Sect. 4 AIC

10 predictors 971.72

CPPFC 971.17

PCA 1001.10

PFCL 978.65

Acknowledgments I sincerely appreciate Professor R. D. Cook to raise me from an academic kid and to inspire and support me all the time. Without you, I cannot be present me. You make myself try to be better person, father, and professor. For Jae Keun Yoo, this work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korean Ministry of Education (NRF2019R1F1A1050715/2019R1A6A1A11051177).

192

J. K. Yoo

References K. Adragni, R.D. Cook, Sufficient dimension reduction and prediction in regression. Philos. Trans. R. Soc. A 367, 4385–4405 (2009) R. Ando, T. Zhang, A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853 (2005) A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in Proceedings of the eleventh Annual Conference on Computational Learning Theory (1998), pp. 92–100 E. Bura, L. Forzani, Sufficient reductions in regressions with elliptically contoured inverse predictors. J. Am. Stat. Assoc. 110, 420–434 (2015) E. Bura, S. Duarte, L. Forzani, Sufficient reductions in regressions with exponential family inverse predictors. J. Am. Stat. Assoc. 111, 1313–1329 (2016) O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in Proceedings of International Workshop on Artificial Intelligence and Statistics (2005), pp. 57–64 X. Chen, F. Zou, R.D. Cook, Coordinate-independent sparse sufficient dimension reduction and variable selection. Ann. Stat. 38, 3696–3723 (2010) M. Collins, Y. Singer, Unsupervised models for named entity classification, in Proceedings Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999), pp. 100–110 R.D. Cook, Regression Graphics (Wiley, New York, 1998) R.D. Cook, Dimension reduction and graphical exploration in regression including survival analysis. Stat. Med. 22, 1399–1413 (2003) R.D. Cook, Fisher lecture: dimension reduction in regression. Stat. Sci. 22, 1–26 (2007) R.D. Cook, Principal components, sufficient dimension reduction and envelopes. Annu. Rev. Stat. Appl. 5, 533–559 (2018) R.D. Cook, L. Li, Dimension reduction in regressions with exponential family predictors. J. Comput. Graph. Stat. 18, 774–791 (2009) R.D. Cook, L. Forzani, Principal fitted components for dimension reduction in regression. Stat. Sci. 485, 485–501 (2009) R.D. Cook, L. Forzani, Likelihood-based sufficient dimension reduction. J. Am. Stat. Assoc. 104, 197–208 (2009) S. Ding, R.D. Cook, Dimensional folding PCA and PFC for matrix-valued predictors. Stat. Sinica 24, 463–492 (2013) J. Hooper, Simultaneous equations and canonical correlation theory. Econometrika 27, 245–256 (1959) B. Li, Sufficient Dimension Reduction: Methods and Applications with R (Chapman and Hall/CRC, New York, 2018) K. Nigam, A. McCallum, S. Thrun, T. Mitchell, Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 103–134 (1998) V. Vapnik, Statistical Learning Theory (Wiley, New York, 1998) J. Wang, X. Shen, Large margin semi-supervised learning. J. Mach. Learn. Res. 8, 1867–1891 (2007) J. Wang, X. Shen, W. Pan, On transductive support vector machine. Contemp. Math. 43, 7–19 (2007) J.K. Yoo, Fused sliced inverse regression in survival analysis. Commun. Stat. Appl. 24, 533–541 (2008) J.K. Yoo, R.D. Cook, Response dimension reduction for the conditional mean in multivariate regression. Comput. Stat. Data Anal. 53, 334–343 (2008) J.K. Yoo, K. Lee, Model-free predictor tests in survival regression through sufficient dimension reduction. Lifetime Data Anal. 17, 433–444 (2011) X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, in Proceedings of the Twentieth International Conference on International Conference on Machine Learning (2003), pp. 912–919