243 87 17MB
English Pages 496 [498] Year 2020
Handbook of Statistics Volume 43
Principles and Methods for Data Science
Handbook of Statistics Series Editor Arni S.R. Srinivasa Rao Medical College of Georgia, Augusta University, United States C.R. Rao C.R. Rao AIMSCS, University of Hyderabad Campus, Hyderabad, India
Handbook of Statistics Volume 43
Principles and Methods for Data Science Edited by
Arni S.R. Srinivasa Rao Medical College of Georgia, Augusta University, Augusta, Georgia, United States
C.R. Rao AIMSCS, University of Hyderabad Campus, Hyderabad, India
North-Holland is an imprint of Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2020 Elsevier B.V. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0-444-64211-0 ISSN: 0169-7161 For information on all North-Holland publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Zoe Kruze Acquisition’s Editor: Sam Mahfoudh Editorial Project Manager: Peter Llewellyn Production Project Manager: Abdulla Sait Cover Designer: Alan Studholme Typeset by SPi Global, India
Contents Contributors Preface
1.
Markov chain Monte Carlo methods: Theory and practice David A. Spade 1 Introduction 2 Introduction to Bayesian statistical analysis 2.1 Noninformative prior distributions 2.2 Informative prior distributions 2.3 Bayesian estimation 3 Markov chain Monte Carlo background 3.1 Discrete-state Markov chains 3.2 General state space Markov chain theory 4 Common MCMC algorithms 4.1 The Metropolis–Hastings algorithm 4.2 Multivariate Metropolis–Hastings 4.3 The Gibbs sampler 4.4 Slice sampling 4.5 Reversible jump MCMC 5 Markov chain Monte Carlo in practice 5.1 MCMC in regression models 5.2 Random effects models 5.3 Bayesian generalized linear models 5.4 Hierarchical models 6 Assessing Markov chain behavior 6.1 Using the theory to bound the mixing time 6.2 Output-based convergence diagnostics 6.3 Using auxiliary simulations to bound mixing time 6.4 Examining sampling frequency 7 Conclusion References Further reading
xiii xv
1 1 3 4 7 9 12 13 14 16 16 18 22 30 31 35 35 36 38 39 40 40 43 49 61 63 64 66
v
vi
2.
Contents
An information and statistical analysis pipeline for microbial metagenomic sequencing data Shinji Nakaoka and Keisuke H. Ota 1 Introduction 2 A brief overview of shotgun metagenomic sequencing analysis 2.1 Sequence assembly and contig binning 2.2 Annotation of taxonomy, protein, metabolic, and biological functions 2.3 Statistical analysis and machine learning 2.4 Reconstruction of pseudo-dynamics and mathematical modeling 2.5 Construction of analysis pipeline with reproducibility and portability 3 Computational tools and resources 3.1 Tools and software 3.2 Public resources and databases 3.3 Do-It-Yourself information analysis pipeline for metagenomic sequences 4 Notes Acknowledgments References
3.
Machine learning algorithms, applications, and practices in data science Kalidas Yeturu 1 Introduction 2 Supervised methods 2.1 Data sets 2.2 Linear regression 2.3 Logistic regression 2.4 Support vector machine—Linear kernel 2.5 Decision tree Outline of decision tree 2.6 Ensemble methods 2.7 Bias-variance trade off 2.8 Cross validation and model selection 2.9 Multiclass and multivariate scenarios 2.10 Regularization 2.11 Metrics in machine learning 3 Practical considerations in model building 3.1 Noise in the data 3.2 Missing values 3.3 Class imbalance 3.4 Model maintenance 4 Unsupervised methods 4.1 Clustering 4.2 Comparison of clustering algorithms over data sets 4.3 Matrix factorization
67 67 68 68 68 69 69 70 71 71 74 77 77 79 79
81 83 85 86 87 92 95 97 98 98 109 111 118 122 124 131 131 131 132 133 133 134 137 137
Contents
4.4 Principal component analysis 4.5 Understanding the SVD algorithm 4.6 Data distributions and visualization 5 Graphical methods 5.1 Naive Bayes algorithm 5.2 Expectation maximization Example of email spam and nonspam problem—Posing as graphical model 5.3 Markovian networks Topic modeling of audio data Topic modeling of image data 6 Deep learning 6.1 Neural network 6.2 Encoder 6.3 Convolutional neural network 6.4 Recurrent neural network 6.5 Generative adversarial network 7 Optimization 8 Artificial intelligence 8.1 Notion of state space and search 8.2 State space—Search algorithms 8.3 Planning algorithms 8.4 Formal logic 8.5 Resolution by refutation method 8.6 AI framework adaptability issues 9 Applications and laboratory exercises 9.1 Automatic differentiation 9.2 Machine learning exercises 9.3 Clustering exercises 9.4 Graphical model exercises 9.5 Data visualization exercises 9.6 Deep learning exercises References
4.
Bayesian model selection for high-dimensional data Naveen Naidu Narisetty 1 Introduction 2 Classical variable selection methods 2.1 Best subset selection 2.2 Stepwise selection methods 2.3 Criterion functions 3 The penalization framework 3.1 LASSO and generalizations 3.2 Nonconvex penalization 3.3 Variable screening 4 The Bayesian framework for model selection
vii 139 141 147 151 152 155 156 159 162 163 163 164 169 173 178 183 184 185 186 188 191 193 195 196 197 197 199 199 199 200 201 201
207 208 209 209 210 211 211 212 215 216 217
viii
Contents
5
Spike and slab priors 5.1 Point mass spike prior 5.2 Continuous spike priors 5.3 Spike and slab LASSO 6 Continuous shrinkage priors 6.1 Bayesian LASSO 6.2 Horseshoe prior 6.3 Global-local shrinkage priors 6.4 Regularization of Bayesian priors 6.5 Prior elicitation—Hyperparameter selection 7 Computation 7.1 Direct exploration of the model space 7.2 Gibbs sampling 7.3 EM algorithm 7.4 Approximate algorithms 8 Theoretical properties 8.1 Consistency properties of the posterior mode 8.2 Posterior concentration 8.3 Pairwise model comparison consistency 8.4 Strong model selection consistency 9 Implementation 10 An example 11 Discussion Acknowledgments References
5.
Competing risks: Aims and methods Ronald B. Geskus 1 Introduction 2 Research aim: Explanation vs prediction 2.1 In-hospital infection and discharge 2.2 Causes of death after HIV infection 2.3 AIDS and pre-AIDS death 3 Basic quantities and their estimators 3.1 Definitions and notation 3.2 Data setup 3.3 Nonparametric estimation 3.4 Standard errors and confidence intervals 3.5 Regression models 3.6 Software 4 Time-varying covariables and the subdistribution hazard 4.1 Overall survival 4.2 Spectrum in causes of death 4.3 Summary 5 Confusion 5.1 What is the appropriate analysis? 5.2 Is a marginal analysis feasible in practice?
220 220 223 224 225 226 226 227 228 228 229 230 231 233 234 236 237 237 238 238 239 239 243 244 244
249 250 250 251 253 255 258 258 260 262 270 271 272 272 273 274 280 281 282 283
Contents
5.3 If we fit a Cox model, do we need to assume that the competing risks are independent? 5.4 Is a regression model for the subdistribution hazard (such as a Fine and Gray model) the only truly competing risks regression model? 5.5 Is the subdistribution hazard a quantity that can be given an interpretation? Acknowledgment References
6.
7.
High-dimensional statistical inference: Theoretical development to data analytics
ix
284
284 285 286 286
289
Deepak Nag Ayyala 1 Introduction 2 Mean vector testing 2.1 Independent observations 2.2 Projection-based tests 2.3 Random projections 2.4 Other approaches 2.5 Dependent observations 3 Covariance matrix 3.1 Estimation 3.2 Hypothesis testing 4 Discrete multivariate models 4.1 Multinomial distribution 4.2 Compound multinomial models 4.3 Other distributions 5 Conclusion References
289 291 293 301 304 307 309 313 314 317 319 320 323 326 330 331
Big data challenges in genomics
337
Hongyan Xu 1 Introduction 2 Next-generation sequencing 3 Data integration 4 High dimensionality 5 Computing infrastructure 6 Dimension reduction 7 Data smoothing 8 Data security 9 Example 10 Conclusion References
337 338 339 342 343 344 345 345 346 346 346
x
8.
Contents
Analysis of microarray gene expression data using information theory and stochastic algorithm Narayan Behera 1 Introduction 1.1 Gene clustering algorithms 2 Methodology 2.1 Discretization 2.2 Genetic algorithm 2.3 The evolutionary clustering algorithm 3 Results 3.1 Synthetic data 3.2 Real data 4 Section A: Studies on gastric cancer dataset (GDS1210) 4.1 Comparison of the algorithms based on the classification accuracy of samples 4.2 Analysis of classificatory genes 4.3 Comparison of algorithms based on the representative genes 4.4 Study of gene distribution in clusters 5 Section B: A brief study on colon cancer dataset 5.1 Comparison of the algorithms based on classification accuracy 6 Section C: A brief study on brain cancer (medulloblastoma metastasis) dataset (GDS232) 6.1 Comparison of the algorithms based on the classification accuracy 7 Conclusion Appendices Appendix A: A brief overview of the OCDD algorithm Appendix B: Smoothing and Chi-square test method References Further reading
9.
Human life expectancy is computed from an incomplete sets of data: Modeling and analysis Arni S.R. Srinivasa Rao and James R. Carey 1 Introduction 2 Life expectancy of newly born babies 3 Numerical examples Acknowledgments Appendix. Analysis of the life expectancy function References
349 350 352 356 356 356 357 361 361 362 363 363 366 367 369 370 370 372 372 373 375 375 376 377 378
379 379 383 385 387 387 389
Contents
10. Support vector machines: A robust prediction method with applications in bioinformatics Arnout Van Messem 1 Introduction 2 Mathematical prerequisites 2.1 Topology 2.2 Probability and measure theory 2.3 Functional and convex analysis 2.4 Derivatives in normed spaces 2.5 (†) Convex programs, Lagrange multipliers and duality 3 An introduction to support vector machines 3.1 (†) The generalized portrait algorithm 3.2 (†) The hard margin SVM 3.3 (†) The soft margin SVM 3.4 (†) Empirical risk minimization and support vector machines 3.5 Kernels and the reproducing kernel Hilbert space 3.6 (†) Loss functions 3.7 Bouligand-derivatives of loss functions 3.8 Shifting the loss function 4 (†) An introduction to robustness and influence functions 5 Properties of SVMs 5.1 Existence, uniqueness and consistency of SVMs 5.2 Robustness of SVMs 6 (†) Applications 6.1 Predicting blood pressure through BMI in the presence of outliers 6.2 Breast cancer distant metastasis through gene expression 6.3 Splice site detection References
Index
xi
391 392 393 393 394 396 398 401 404 404 408 410 413 416 418 423 428 431 434 434 437 452 454 455 460 462
467
This page intentionally left blank
Contributors Numbers in Parentheses indicate the pages on which the author’s contributions begin.
Deepak Nag Ayyala (289), Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA, United States Narayan Behera (349), Department of Physics, Dayananda Sagar University; Institute of Bioinformatics and Applied Biotechnology; Complex systems and computing group, Sandeb Tech Pvt Ltd, Bengaluru, Karnataka, India; Department of Applied Physics, Adama Science and Technology University, Adama, Ethiopia James R. Carey (379), Department of Entomology, University of California, Davis; Center for the Economics and Demography of Aging, University of California, Berkeley, CA, United States Ronald B. Geskus (249), Centre for Tropical Medicine, Oxford University Clinical Research Unit, Ho Chi Minh City, Viet Nam; Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom Arnout Van Messem (391), Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium; Center for Biotech Data Science, Ghent University Global Campus, Incheon, Republic of Korea Shinji Nakaoka (67), Faculty of Advanced Life Science, Hokkaido University, Sapporo, Japan Naveen Naidu Narisetty (207), Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, United States Keisuke H. Ota (67), Faculty of Advanced Life Science, Hokkaido University, Sapporo, Japan David A. Spade (1), Department of Mathematical Sciences, University of Wisconsin–Milwaukee, Milwaukee, WI, United States Arni S.R. Srinivasa Rao (379), Division of Health Economics and Modeling, Department of Population Health Sciences; Laboratory for Theory and Mathematical Modeling, Department of Medicine, Division of Infectious Diseases, Medical College of Georgia; Department of Mathematics, Augusta University, Augusta, GA, United States Hongyan Xu (337), Division of Biostatistics and Data Science, Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA, United States Kalidas Yeturu (81), Indian Institute of Technology Tirupati, Tirupati, India
xiii
This page intentionally left blank
Preface Principles and Methods for Data Science, Volume 43 in the Handbook of Statistics series, highlights new advances in the field, with this updated volume presenting interesting and timely topics, including competing risks, aims and methods, data analysis and mining of microbial community dynamics, support vector machines, a robust prediction method with applications in bioinformatics, Bayesian model selection for data with high dimension, high dimensional statistical inference: theoretical development to data analytics, Big Data challenges in genomics, analysis of microarray gene expression data using information theory and stochastic algorithm, hybrid models, Markov Chain Monte Carlo methods: theory and practice, and more. Principles and Methods for Data Science has been developed with brilliantly written chapters by authors from various aspects of data science. All the authors took utmost care in making their chapters available either for a data-oriented scientist or theoretician or method developer whose work interfaces with societal issues using data. Authors have experience of one or more of the skills in handling the data, namely, designing a course on data science, developing theory for data ensemble techniques, methods on high dimensional data designing for biotechnology companies, teachers of machine learning to computer graduates or purely developers of theory to be applied by others, etc. Utmost care is taken to make chapters are assessable to the students and equally entertain subject experts. Students can use our volume for learning a new topic that has not been taught in their curriculums or using some of the chapters as additional reading material. Faculty can use our volume as a quick reference book on the subject matter. Authors expertise ranges from computer science to mathematics, statistics to bioinformatics, pure data science to computational physics. We strongly feel that we have brought compelling material for data science lovers in applied and purely theoretical. We have divided 10 chapters of volume 43 into three sections: Section I: Methods, Machine Intelligence and Simulations Section II: Statistical Approaches, High-Dimensions, Genomics and Health Section III: Simple Data Models to Advanced Modern Support Vector Machines
xv
xvi
Preface
Section I contains three chapters: the first chapter by D.A. Spade brings an introduction from the origin to the recent advances Markov Chain Monte Carlo methods. This chapter offers a clear description, uses in practical modeling approaches, stage by stage development of the methods from a deterministic and continuous approaches standpoint. The second chapter by S. Nakaoka and K. Ohta wrote on microbial metagenomic sequencing data description and related statistical analysis approaches. They bring in handy availability of relevant approaches to analyze microbial dynamics data. The third chapter by K. Yeturu writes a comprehensive classroom-type lecture note by keeping a computational graduate student to advanced faculty who expertise in statistical theory in mind who wants to learn basics to advanced machine learning techniques for data scientists. Section II contains five chapters: the first chapter by N.N. Narisetty writes a detailed chapter on model selection studies for high dimensional approaches from Bayesian perspectives. The chapter also brings an overview comparison of the frequentist approach to the research question using regression and graphical modeling situations; the second chapter by R. Geskus is for the data scientists who work on medical statistics to improve patient care and inferential science. This chapter can be readily adapted for classroom lectures on data science methods for biostatisticians; the third chapter D.N. Ayyala provides a comprehensive account of high dimensional statistical inference with applications in data analytics. The author presents several computational tools that can give readers the latest overview of various challenges in big data. The fourth chapter by H. Xu gives a detailed description of newer challenges in big data genomics by keeping a nontechnical reader in mind. Practical technical skills required for a data scientist narrated in the chapter are supplemented by the methods described in other chapters in the volume. The fifth chapter by N. Behera provides a very practical treatment of the stochastic process principles and methods involved in bioinformatics with a special focus on microarray data. It also discusses several models for better accuracy in picking up the top-ranking candidate genes. Section III contains two chapters: the first chapter by A.S.R. Srinivasa Rao and J.R. Carey apply newer techniques in computing life expectancy of humans when limited data is available. This chapter describes a framework that can be used by scientists who are not aware of standard methods and do not have enough data to obtain life expectancy through standard methods. The second chapter by A.V. Messem writes a text-book type detail on support vector machines (SVMs) for a very advanced reader with a bundle of basic to advanced theories. It also discusses high dimensional dependency structures in theory. The ideas of this chapter are intended for scientists who require SVMs theory for their bioinformatics data. Our sincere thanks to Mr. Sam Mahfoudh, acquisition editor (Elsevier and North-Holland) for his overall administrative support throughout the preparation of this volume. His valuable involvement in time to time handling of
Preface
xvii
authors’ queries is highly appreciated. We also thank Ms. Hilal Johnson, Editorial Project Manager (Elsevier) and Mr. Peter J. Llewellyn, who worked during the initial stage of the development of this volume as the Editorial Project Manager (Elsevier) for their excellent assisting of editors and authors in various technical editorial aspects throughout the preparation. Ms. Johnson provided valuable assistance toward the proofs and production. Our thanks are also to Md. Sait Abdulla, the Project Manager, Book Production, Chennai, India, RELX India Private Limited for leading the production and printing activities and assistance to the authors. Our sincere thanks and gratitude to all the authors for writing brilliant chapters by keeping our requirements of the volume. We solidly believe that this volume has come-up very timely and we are convinced that this collection will be resourceful for the beginners and advanced scientists in data science. Arni S.R. Srinivasa Rao C.R. Rao
This page intentionally left blank
Chapter 1
Markov chain Monte Carlo methods: Theory and practice David A. Spade1 Department of Mathematical Sciences, University of Wisconsin–Milwaukee, Milwaukee, WI, United States 1 Corresponding author: e-mail: [email protected]
Abstract In many situations, especially in Bayesian statistical analysis, it is required to draw samples from intractable probability distributions. A common way to obtain approximate samples from such distributions is to make use of Markov chain Monte Carlo (MCMC) algorithms. Two questions arise when using MCMC algorithms. The first of these is how long the underlying Markov chain must run before it can be used to draw approximate samples from the desired distribution. The second question is that of how often states of the chain can be used as approximate samples from the desired distribution so that the samples are roughly independent. This chapter provides insight into how to answer both of these questions in the course of describing how MCMC algorithms are used in practice. In this chapter, common types of MCMC algorithms are described, and Bayesian estimation using the output of the chain is also discussed. Keywords: Bayesian statistics, Mixing time, Convergence, Metropolis–Hastings, Lyapunov conditions
1
Introduction
In many settings, particularly in Bayesian statistics, it is necessary to draw samples from probability distributions from which it is difficult to draw samples directly. In these situations, the best that can be done is to obtain approximate samples from these distributions. A common way to do this is to construct a process, known as a Markov chain, that converges (in a sense described in Section 6) to the desired distribution. This distribution is known as the target distribution. Once the Markov chain has approximately reached the target distribution, at selected increments, the states of the chain can be used as an approximate random sample from the target distribution. These samples can then be used to construct estimates of unknown quantities of Handbook of Statistics, Vol. 43. https://doi.org/10.1016/bs.host.2019.06.001 © 2020 Elsevier B.V. All rights reserved.
1
2
Handbook of Statistics
interest. These estimates are then used to make inferences about phenomena that take place in the world. The process of drawing these approximate samples from the target distribution has come to be known as Markov chain Monte Carlo (MCMC). The Markov chain part of the name comes from the Markov chain that is used to approximate the target distribution, while the Monte Carlo part of the name refers to the use of states of the chain as approximate random samples from the target distribution. With the use of any MCMC algorithm come two questions. The first of these questions is that of how long the chain must run before subsequent states of the chain can be used as approximate samples from the target distribution. This is often called the burn-in time. This question can be answered in several ways. Rosenthal (1995) presents a technique that requires verifying that the chain satisfies a drift condition and a minorization condition and then using these conditions to bound the time the chain takes to approach its stationary distribution. This can be difficult to do analytically, but for some specific Markov chains, computational approaches to verifying these conditions and estimating their corresponding coefficients are available. Cowles and Rosenthal (1998) do this for general MCMC algorithms, but their approach suffers from severe computational limitations. Spade (2016) presents an approach that works well for a hybrid Metropolis–Hastings sampler. The estimates that come out of these techniques can be used to approximate an upper bound on the mixing time of the underlying chain. Due to the complexities inherent to these approaches, convergence questions are often handled on a case-by-case basis using the output of the chain. For example, Gelman and Rubin (1992) present a convergence assessment method that relies on running multiple independent chains from various points in the state space and comparing variances between chains and within chains. Other output-based methods are presented by Heidelberger and Welch (1983), Geweke (1992), Raftery and Lewis (1992), Roberts (1992), Ritter and Tanner (1992), Yu (1994), Yu and Mykland (1994), Zellner and Min (1995), and Roberts (1996). Cowles and Carlin (1996) provide a comparative review of all of these techniques. A discussion of some of these techniques and their advantages and limitations is presented in Section 6. The second question that comes up is that of how often samples should be taken from the output of the chain. In the output, there is likely to be autocorrelation among states that are close to each other. In order to determine how far apart samples should be taken, it is common to examine autocorrelation plots to see how many steps need to be taken between samples so that they can be viewed as roughly independent. This process will also be described in Section 6. The remainder of this chapter is organized as follows. Section 2 presents an introduction to Bayesian statistical analysis. This is followed in Section 3 by an introduction to Markov chains. In Section 4, a discussion of several common MCMC methods is presented. Section 5 details common practical
Markov chain Monte Carlo methods Chapter
1
3
uses of various MCMC algorithms, and Section 6 provides a description of several methods of answering the two questions posed above. This chapter concludes with a discussion of the concepts presented in earlier sections.
2
Introduction to Bayesian statistical analysis
Bayesian statistical methodology is fundamentally different from methods that are commonly used in classical statistics because it treats unknown quantities as random variables as opposed to considering them as fixed quantities that need to be estimated. In Bayesian statistics, we estimate a parameter vector θ by drawing samples from the distribution of θ given the observed data y. By doing this, it is possible to obtain a measure of variability in θ while also obtaining a point estimate of θ. This distribution, say p(θjy), is known as the posterior distribution of θ given y. In order to obtain this distribution, it is necessary to follow two steps. The first is to construct a joint probability distribution of all observed and unobservable quantities in the model. This distribution needs to be reflective of all available knowledge of θ and y. The second step is to condition on the observed data. This leads to the posterior density of θ being given by pðθjyÞ ¼
pðθ, yÞ , mðyÞ
where p(θ, y) is the joint density of θ and y, and m(y) is the marginal probability density of the observed data. In other words, Z mðyÞ ¼ pðθ,yÞdθ, Θ
where Θ is the set of all possible values of θ. In practice, m(y) may be very difficult to obtain due to the intractability of integrating p(θ, y). There are several methods of obtaining estimates of m(y) in the literature (Chib et al., 1998; Ishwaran et al., 2001; Liang, 2007; Neal, 1998; Oh and Berger, 1989; Wei and Tanner, 1990). The interested reader is encouraged to consult these resources for further information on estimating marginal densities. What is typically available initially is the joint conditional density f(yjθ) of the data given a value of θ. This density, viewed as a function of θ with y fixed is known as the likelihood function of θ based on y. In order to obtain a joint probability density of θ and y, it is necessary to construct a parameter model, more commonly known as a prior probability distribution on θ. In this chapter, the prior density will be denoted by π(). Choosing a prior distribution is something of a delicate matter. Ideally, one would choose a prior density that reflects any scientific knowledge that is available about θ. If there is a great deal that is known about θ, the prior distribution should be chosen in a way that reflects this. This ensures that both prior knowledge of θ and the
4
Handbook of Statistics
information about θ contained in the data are incorporated in the posterior distribution. If very little is known about θ, then one may reasonably choose a prior distribution that has little impact on the posterior density. In many cases, prior densities that do convey the available information about θ result in a joint density of θ and y that is very difficult to integrate. We can write the posterior density as pðθjyÞ ¼ Z Θ
f ðyjθÞπðθÞ f ðyjθÞπðθÞ dθ
:
It is common to make an effort to avoid this difficulty by choosing a prior density that conveys enough knowledge about θ while still ensuring that the joint density can still be integrated. This is typically done through specification of conjugate prior distributions. In cases where this is not done and the integral of p(θ, y) is intractable, the MCMC methods that are the focus of this chapter are often used to obtain approximate samples from the posterior density. If no scientific knowledge is available about θ, the most common approach to specifying the prior density is to specify a density that conveys no information about θ. Prior densities that do this are known as noninformative prior densities.
2.1 Noninformative prior distributions A noninformative prior distribution is a distribution that gives no information about θ. One way to do this is to construct a prior density that is improper, but that leads to a proper posterior distribution. Definition 1. A prior density π() is proper if there exists k < ∞ such that Z πðθÞ dθ ¼ k: Θ
In a case where k6¼1, π(θ) is proper, but unnormalized. In these cases, π(θ) is rescaled by a factor of 1/k to give a normalized density 1 π ðθÞ ¼ πðθÞ: k This makes π (θ) a probability density as it is simply a renormalized version of π(θ). This is demonstrated in the next example. Example 1. A common way to model the variance of a N(μ, σ 2) distribution is to assume, a priori, that σ 2 follows an inverse gamma distribution with shape parameter α and scale parameter β. Then β
πðσ 2 Þ∝ ðσ 2 Þα1 e 2σ2 ð0, ∞Þ ðσ 2 Þ,
Markov chain Monte Carlo methods Chapter
1
5
where A ðxÞ ¼
1 0
if x 2 A otherwise:
The function A ð Þ is known as the indicator function for the set A. Integrating the prior density gives Z ∞ ΓðαÞ πðσ 2 Þ dσ 2 ¼ α , β 0 1
where Γ() denotes the gamma function. Unless β ¼ ½ΓðαÞα , this integral is not equal to 1, so π(σ 2) is a proper, unnormalized prior density. By writing π ðσ 2 Þ ¼
β βα ðσ 2 Þα1 e σ 2 ð0, ∞Þ ðσ 2 Þ ΓðαÞ
one obtains a proper, normalized prior probability density. Definition 2. An improper probability density π() is one for which Z πðθÞ dθ ¼ ∞: Θ
These densities cannot be normalized in a way that ensures their propriety. However, as mentioned above, they can still lead to proper posterior distributions. Example 2. Suppose that yjθ follows an exponential distribution with rate parameter θ, so that f ðyjθÞ ¼ θe
θ
n P i¼1
n yi Y ð0, ∞Þ ðyi Þ, and that i¼1
πðθÞ ¼ ð0, ∞Þ ðθÞ: Then n P n θ yi Y ð0, ∞Þ ðyi Þ: pðθ, yÞ ¼ θe i¼1 i¼1
Z n Y mðyÞ ¼ ð0, ∞Þ ðyi Þ i¼1
Z n Y ¼ ð0, ∞Þ ðyi Þ i¼1
∞
θe
θ
n P
yi
i¼1
dθ
0 ∞ 0
θ21 e
θ
n P i¼1
yi dθ:
6
Handbook of Statistics
Observe that the integrand is thePkernel of a gamma distribution with shape parameter 2 and rate parameter ni¼1 yi ¼ ny. Thus, mðyÞ ¼
n 1 Y ð0, ∞Þ ðyi Þ, so ðnyÞ2 i¼1
pðθjyÞ ¼ ðnyÞ2 θeθny ð0, ∞Þ ðθÞ: Therefore, p(θjy) is a proper posterior density despite the use of an improper prior distribution. While noninformative prior densities are often specified by the researcher subjectively, one way to define a noninformative prior distribution for the one-parameter case is to consider injective functions, say γ ¼ g(θ), of the parameter. Consider the prior density π(θ). By transformation of variables, dθ 1 πðγÞ ¼ πðθÞ ¼ πðθÞ 0 : (1) dγ jg ðθÞj Jeffreys’ invariance principle states that determination of the prior density π(θ) should yield an equivalent result if applied to γ so that a version of π(γ) that is determined directly using π(γ)f(yjγ) should match π(γ) determined from (1). To that end, let J(θ) denote the Fisher information for θ. In other words, " 2 # d log f ðyjθÞ 2 d log f ðyjθÞ jθ ¼ jθ : JðθÞ ¼ dθ dθ2 The prior distribution is then specified by 1
πðθÞ∝½JðθÞ2 : To see why this accomplishes the goal, let θ ¼ g1(γ). Then 2 d log f ðyjγÞ JðγÞ ¼ dγ 2 " # d 2 log f ðyjθ ¼ g1 ðyÞÞ dθ2 ¼ dγ dθ2 2 dθ ¼ JðθÞ : Thus; dγ 1 1 dθ 2 2 ½JðγÞ ¼ ½JðθÞ : dγ With multiparameter models, this process is not quite as straightforward, so we typically assume independent noninformative marginal prior distributions on the components of θ.
Markov chain Monte Carlo methods Chapter
1
7
2.2 Informative prior distributions If information is available about θ prior to the beginning of the study, it is natural to include this information in the specification of the prior distribution. For example, if the goal is to carry out Bayesian estimation of the mean vector of the normal distribution, it may be natural to model the mean vector with another normal distribution. In general, the prior distribution should have a support that covers all plausible values of θ. The prior distribution does not, however, need to have the probability concentrated around the true value of θ, as the prior information is often heavily outweighed by the information about θ that is contained in the data. A common approach to specifying prior distributions is to set a prior density that is conjugate to the likelihood function.
2.2.1 Conjugate prior distributions To begin, it is necessary to define formally a conjugate prior distribution. Definition 3. Let F be a family of distributions f(yjθ), and let Π be a family of distributions for θ. Then Π is conjugate for F if p(θjy) 2Π for all f(jθ) and π() 2Π. Distributions that are members of an exponential family have natural conjugate prior distributions. Definition 4. The class F is an exponential family if for all f ð jθÞ 2 F , f(yjθ) can be written in the form k P
f ðyjθÞ ¼ hðyÞcðθÞej¼1
T gj ðθÞ Tj ðyÞ
(2)
for some positive integer k. Example 3. Consider the multivariate N(θ, Σ) distribution, where Σ is fixed and known. Then 1
1
1
1
f ðyjθ,ΣÞ ¼ jΣj 2 e 2ðyθÞ ¼ jΣj 2 e 2ðy 1
cðθÞ ¼ e 2θΣ 1
hðyÞ ¼ e 2y
T
1
θ
T
T
Σ1 ðyθÞ
Σ1 y2θT Σ1 y + θT Σ1 θÞ
: Let
1
jΣj 2
Σ1 y
TðyÞ ¼ Σ1 y gðθÞ ¼ θ: Then; T
f ðyjθÞ ¼ hðyÞcðθÞe½gðθÞ
TðyÞ
,
so f(yjθ) is a member of an exponential family. At this point, we provide an example of a conjugate prior distribution in the one-parameter case.
8
Handbook of Statistics
Example 4. Suppose y follows a binomial distribution with n trials, where n is known, and unknown success probability θ. Then n y f ðyjθÞ ¼ θ ð1 θÞny : y Assume that θ follows a Beta(α, β) distribution a priori. Then πðθÞ∝θα1 ð1 θÞβ1 ð0,1Þ ðθÞ, so pðθ, yÞ∝θy + α1 ð1 θÞn + βy1 ð0, 1Þ ðθÞ: Since m(y) is a constant in θ, and since p(θ, y) gives the kernel of a Beta(α + y, n + β y) distribution, θjy follows a Beta(α + y, n + β y) distribution a posteriori. Thus, the posterior distribution of θ given y is a beta distribution, and the beta distribution is conjugate for the binomial class of likelihoods. Example 5. Suppose yjθ follows a N(θ, Σ) distribution, where Σ is known. Assume that θ follows a N(θ0, Σ0) distribution a priori. Then 1
f ðyjθÞ∝ e 2ðyθÞ 1
T
Σ1 ðyθÞ T
1
π ðθÞ∝ e 2ðθθ0 Þ Σ ðθθ0 Þ
1 T T 1 1 1 1 p ðθjyÞ∝ e 2 θ ðΣ + Σ0 Þθ + θ ðΣ y + Σ0 θ0 Þ : Completing the quadratic form will yield the density of a N(θ , Σ ) density, 1 1 1 1 1 1 where θ ¼(Σ1 + Σ1 0 ) (Σ y + Σ0 θ0 ) and Σ ¼ ðΣ0 + Σ Þ . Therefore, the normal prior density is conjugate for the normal likelihood.
2.2.2 Nonconjugate prior distributions While the ease of interpretation and the analytic tractability of using conjugate prior distributions makes this type of prior specification appealing, it is not always justified. In most cases, the use of a nonconjugate prior distribution gives a joint density that is not analytically tractable, but that does not create any new conceptual problems in terms of carrying out a Bayesian analysis. An example of a posterior distribution that arises from the specification of a nonconjugate prior distribution is given below. Example 6. Suppose yi, i ¼ 1, …, n are independent and identically distributed normal random variables with mean θ and variance σ 2, where σ 2 is known. The likelihood is given by n P 1 2σ 2 ðyi θÞ2 i¼1 f ðyjθÞ∝ e : Assume also that, a priori, θ follows a Cauchy(0,1) distribution. The prior density for θ is
Markov chain Monte Carlo methods Chapter
πðθÞ ¼
1
9
1 ðθÞ: 1 + θ2
Therefore, the posterior density of θjy is
pðθjyÞ∝ e
1 2σ 2
n P
ðyi θÞ2
i¼1
1 ðθÞ: 1 + θ2
This is not the density of a Cauchy distribution in θ, so the Cauchy distribution is nonconjugate for the normal likelihood.
2.3 Bayesian estimation Bayesian estimation differs from frequentist estimation in the sense that a Bayesian estimator ^ θðyÞ of θ is one that minimizes a chosen expected loss with respect to the posterior distribution. Here, we see how this is done. We define four key quantities now. Definition 5. The loss function Lð^ θðyÞ, θÞ is a selected function that reflects the consequences, or loss, of selecting ^ θðyÞ as an estimator of θ. When data y are observed, it is possible to determine a risk function ^ RðθðyÞ, θÞ for ^ θðyÞ. Definition 6. The risk function Rð^ θðyÞ,θÞ of ^ θðyÞ is given by θðyÞ,θÞ Rð^ θðyÞ,θÞ ¼ y ½Lð^ Z ¼ Lð^ θðyÞ, θÞf ðyjθÞ dy, Y
where f (yjθ) denotes the likelihood function. Now let p(θjy) denote the posterior density of θ. Then an estimator of θ can be chosen so that it minimizes the posterior expected loss. Definition 7. The posterior expected loss ρð^ θðyÞÞ of an estimator ^θðyÞ is given by ρð^ θðyÞÞ ¼ θjy ½Lð^ θðyÞ, θÞ Z ¼ Lð^ θðyÞ, θÞpðθjyÞ dθ: Θ
The most common way to choose an estimator of ^θðyÞ of θ in the Bayesian setting is to choose ^ θðyÞ to minimize the Bayes Risk BRð^θðyÞÞ.
10
Handbook of Statistics
Definition 8. The Bayes Risk BRð^ θðyÞ,θÞ of an estimator ^θðyÞ is defined by θðyÞ, θÞ BRð^ θðyÞÞ ¼ θjy ½Rð^ h i θðyÞ, θÞ ¼ θjy y ½Lð^ Z Z ¼ Lð^ θðyÞ,θpðθjyÞ dy dθ: Θ Y
Bayesian estimators are chosen in such a ways that they minimize wither the posterior expected loss or the Bayes Risk. There are several loss functions for which it is known which estimator minimizes the Bayes risk. Consider squared error loss. In other words, let Lð^ θðyÞ,θÞ ¼ ð^ θðyÞ θÞT ð^ θðyÞ θÞ: The Bayesian estimator here is ^ θðyÞ ¼ θjy ½θjy Z ¼ θpðθjyÞ dθ: Θ
In other words, the posterior mean minimizes the Bayes risk under squared error loss. We now show why this is the case in the one-parameter setting. We have, in the one-parameter case, Lðθ, ^ θÞ ¼ ðθ ^ θÞ2 , so ρðθ, ^ θÞ ¼ ½Lðθ, ^ θÞjX Z ¼ Lðθ, ^ θÞpðθjxÞ dθ ZΘ θÞ2 pðθjxÞ dθ: ¼ ðθ ^ Θ
The goal is to minimize this with respect to ^ θ. Then Z Z Z 2 2 ^ ^ ^ pðθjxÞ dθ: θ Bayes ¼ arg min θ pðθjxÞ dθ 2θ θpðθjxÞ dθ + θ ^θ
Θ
∂ρðθ, ^ θÞ ¼ 2½θjx + 2^θ ¼ 0, so ^ ∂θ θ^ Bayes ¼ ½θjx:
Θ
Θ
∂2 ρðθ, ^ θÞ ¼ 2 > 0, ∂θ^ 2 so the posterior mean minimizes the posterior expected loss, and so it is the Bayes estimator under squared error loss.
Markov chain Monte Carlo methods Chapter
1
11
Example 7. Suppose y1, …, yn are independent and identically distributed Poisson random variables with rate parameter θ, and assume a priori that θ follows an exponential distribution with rate parameter θ0. Define for an estimator ^ θðyÞ Lð^ θðyÞ,θÞ ¼ ð^ θθÞ2 : The likelihood function is f ðyjθÞ ¼
n Y θyi eθ i¼1
yi ! n P
n yi Y 1 : yi ! i¼1
¼ enθ θi¼1 πðθÞ ¼ θ0 eθθ0 :
Therefore, the posterior density is given by n P
yi pðθjyÞ∝eθðn + θ0 Þ θi¼1 , so the posterior distribution of θjy is a gamma distribution with shape parameter ny + 1 and rate parameter n + θ0. Therefore, ny + 1 ^ θðyÞ ¼ θjy ½θjy ¼ n + θ0 is the Bayesian estimator of θ based on squared error loss. Another common loss function is absolute error loss (in the one-parameter case), which is given by Lð^θðyÞ, θÞ ¼ j^ θðyÞ θj: Under absolute error loss, the posterior median minimizes the Bayes risk. In multiparameter settings, this loss function becomes considerably more complicated to use. Example 8. Suppose y1, …, yn are independent and identically distributed normal random variables with mean μ and variance 1. Assume a priori that μ follows a normal distribution with mean μ0 and variance 4. The loss function for an estimator μ ^ ðyÞ is Lð^ μ ðyÞ, μÞ ¼ j^ μ ðyÞ μj:
12
Handbook of Statistics
The likelihood function and prior density are given by f ðyjμÞ ∝ e
1 2
n P
ðyi μÞ2
i¼1
1 2 πðμÞ∝ e 8ðμμ0 Þ : Therefore;
pðμjyÞ∝ e
1 2
n P
y2i + μny n2μ2 18μ2 + 14μμ0 18μ20
i¼1
1
1 1 2 ∝ e 2 n + 4 μ + μ ny + 4μ0 :
Thus, μjy follows a normal distribution with mean 4n4+ 1 ny + 14 μ0 and variance 4n4+ 1. Since the normal distribution is symmetric, the posterior median is equal to the posterior mean. Therefore, the Bayesian estimator μ ^ ðyÞ of μ is given by 4 1 μ ^ ðyÞ ¼ ny + μ0 : 4n + 1 4 A Bayesian estimator can be found based on any loss function by following the process detailed above. The squared error loss and the absolute error loss are used frequently because the Bayesian estimators based on these loss functions are known, so analysis of the posterior density does not need to be done. Furthermore, the squared error loss has the property that its associated risk function is precisely a commonly used measure of the quality of an estimator. It is a straightforward step to see that under squared error loss, Rð^ θðyÞ, θÞ ¼ y ½ð^ θðyÞ θÞT ð^θðyÞ θÞ ¼ MSEð^ θðyÞÞ, where MSE() denotes the mean-squared error of the estimator.
3 Markov chain Monte Carlo background This section introduces the background on Markov chains that is required for an understanding of the concepts related to MCMC methods. This introduction includes ideas related to the limiting behavior of Markov chains. We begin with a discussion of Markov chains on discrete-state spaces in order to develop comfort with Markov chains before extending these concepts to general state spaces. First, we define a Markov chain. Definition 9. A Markov chain (Xt)t0 is a discrete-time stochastic process fX0 , X1 ,…g with the property that, given X0, X1,…, Xt1, the distribution of
Markov chain Monte Carlo methods Chapter
1
13
Xt depends only on Xt1. Formally, (Xt)t0 is a Markov chain if for all A S, where S is the state space, PðXt 2 AjX0 ,…,Xt1 Þ ¼ PðXt 2 AjXt1 Þ:
3.1 Discrete-state Markov chains Let S ¼ fx1 , x2 , …g be a discrete-state space. Then transition probabilities are of the form Pij(t) ¼ P[Xt ¼ jjXt1 ¼ i]. If (Xt)t0 is to converge to a stationary distribution, (Xt)t0 has to satisfy three conditions. First, the chain must be irreducible, which means that any state j can be reached from any state i in a finite number of steps. The chain must be positive recurrent, meaning that, on average, the chain starting in state i returns to state i in a finite number of steps for all i 2 S. The chain must also be aperiodic, which means it is not expected to make regular oscillations between states. These terms are formalized below. Definition 10. (Xt)t0 is irreducible if for all i, j, there exists an integer t > 0 such that Pij(t) > 0. Definition 11. An irreducible Markov chain (Xt)t0 is recurrent if the first return time τii ¼ min ft > 0 : Xt ¼ ijX0 ¼ ig to state i has the property that for all i, Pðτii < ∞Þ ¼ 1. Definition 12. An irreducible, recurrent Markov chain is positive recurrent if for all i, ½τii < ∞. Definition 13. A Markov chain (Xt)t0 has stationary distribution π() if for all j and for all t 0, X πðiÞPij ðtÞ ¼ πðjÞ: i
The existence of a stationary distribution for the chain is equivalent to that chain being positive recurrent. Definition 14. An irreducible Markov chain is aperiodic if for all i, gcd ft > 0 : Pii ðtÞ > 0g ¼ 1: Definition 15. (Xt)t0 is reversible if it is positive recurrent with stationary distribution π() if for all i, j, π(i)Pij ¼ π( j)Pji. The discrete-state Markov chain (Xt)t0 has a unique stationary distribution if it is irreducible, aperiodic, and reversible. The next example illustrates some of these properties.
14
Handbook of Statistics
Example 9. Consider the Markov chain (Xt)t0 with state space S ¼ f0, 1, 2g. Let the transition probability matrix be given by 21 3 68 8 6 61 1 P¼6 62 8 4 3 1 8 2
13 27 7 37 7, 87 5 1 8
so that Pij ¼ P(Xt ¼ jjXt1 ¼ i). Since P00 ¼P11 ¼P22 ¼ 18, the chain is aperiodic, and since all transition probabilities are positive, the chain is clearly irreducible. Now we try to find a stationary distribution. To do this, we solve the following system of equations. πð0Þ ¼ πð0ÞP00 + πð1ÞP10 + πð2ÞP20 1 1 3 ¼ πð0Þ + πð1Þ + πð2Þ 8 2 8 πð1Þ ¼ πð0ÞP01 + πð1ÞP11 + πð2ÞP21 3 1 1 ¼ πð0Þ + πð1Þ + πð2Þ 8 8 2 πð2Þ ¼ πð0ÞP02 + πð1ÞP12 + πð2ÞP22 1 3 1 ¼ πð0Þ + πð1Þ + πð2Þ: 2 8 8 Solving this system gives 1 πð0Þ ¼ πð1Þ ¼ πð2Þ ¼ : 3 Since P is symmetric, Pijπ(i) ¼Pjiπ( j) for all i, j. Thus, (Xt)t0 is reversible, so this stationary distribution is the chain’s unique stationary distribution. Furthermore, the existence of the stationary distribution ensures that the chain is positive recurrent.
3.2 General state space Markov chain theory Here, we generalize the concepts presented in Section 3.1 to cover a Markov chain that explores a general state space Ω. We begin with the notion of a transition kernel. Definition 16. A function K : ½Ω, BðΩÞ7!½0, 1 is a transition kernel if 1. For all A 2 BðΩÞ, K(, A) is a nonnegative measurable function on Ω, and 2. For all x 2 Ω, K(x, ) is a probability measure on BðΩÞ.
Markov chain Monte Carlo methods Chapter
1
15
Definition 17. If (Xt)t0 is a Markov chain on a state space Ω, its transition kernel is given by Kðx,AÞ ¼ PðXt + 1 2 AjXt ¼ xÞ for x 2 Ω and A 2 BðΩÞ. The k-step transition kernel for (Xt)t0 is K k ðx, AÞ ¼ PðXt + k 2 AjXt ¼ xÞ: Now we address the notion of ϕ-irreducibility. Definition 18. A Markov chain (Xt)t0 with transition kernel K(, ) on a state space Ω is ϕ-irreducible if there exists a nontrivial measure ϕ on BðΩÞ such that if ϕ(A) > 0, there exists an integer kx for all x 2 Ω such that K kx ðx,AÞ > 0. Next, we discuss the idea of stationary distributions. Definition 19. A probability measure π() on BðΩÞ is called a stationary measure for the Markov chain (Xt)t0 having state space Ω and transition kernel K(, ) if for all A 2 BðΩÞ, Z πðAÞ ¼ πðdxÞKðx, AÞ: Ω
In order to address uniqueness of the stationary measure, we need the concepts of recurrence and aperiodicity. In order to develop these, let ηA denote the expected number of visits the chain makes to the set A 2 BðΩÞ. Then the set A is called a recurrent set if ηA ¼ ∞. Definition 20. The Markov chain (Xt)t0 with state space Ω is recurrent if for all A 2 BðΩÞ, A is recurrent. In order to discuss Definition 21. A set measure νm on BðΩÞ A 2 BðΩÞ, Km(x, A)
aperiodicity, we need the notion of small sets. C 2 BðΩÞ is a small set if there exists a nontrivial and an integer m > 0 such that for all x 2 C and νm(A).
Now we examine d-cycles. Let EC ¼ fn 1 : the set C is small with respect to νn , with νn ¼ δn ν for some δn > 0g:
Definition 22. Let (Xt)t0 be a ϕ-irreducible Markov chain on Ω, and let C 2 BðΩÞ, where C is a small set. Let fDi g, i ¼ 1, …, n, partition Ω. If for each i, i ¼ 1, …, n 1, for x 2 Di, K(x, Di+1) ¼ 1 and for x 2 Dn, K(x, D1) ¼ 1, fDi g is a d-cycle. The largest value of d for which a d-cycle for (Xt)t0 occurs is the period of (Xt)t0. If d ¼ 1, the chain is aperiodic, and if there exists a nontrivial set A that is small with respect to ν1, the chain is strongly aperiodic.
16
Handbook of Statistics
If a ϕ-irreducible Markov chain is recurrent and strongly aperiodic, its stationary distribution is unique (Meyn and Tweedie, 2005). One final property to discuss is that of reversibility. Definition 23. A Markov chain (Xt)t0 on a state space Ω with transition kernel K that admits a density k(, ) and stationary measure π is reversible if for all x, y 2 Ω, kðx,yÞπðxÞ ¼ kðy,xÞπðyÞ: Reversibility is helpful because strong aperiodicity and recurrence can be difficult to verify. Uniqueness of the stationary distribution can also be verified by showing that (Xt)t0 is ϕ-irreducible, aperiodic, and reversible.
4 Common MCMC algorithms This section presents several common versions of MCMC algorithms along with illustrative examples. The section begins with a description of several versions of the Metropolis–Hastings (Hastings, 1970; Metropolis et al., 1953) algorithm and the Gibbs sampler (Geman and Geman, 1984). This section also describes algorithms such as the slice sampler (Neal, 2003) and reversible jump MCMC (Green, 1995).
4.1 The Metropolis–Hastings algorithm Here, we describe the Metropolis–Hastings algorithm. We describe and illustrate the general form of the algorithm, and then we discuss several common special cases. Let p() denote the posterior distribution of a random vector X. Then for a Metropolis–Hastings chain (Xt)t0, p() is the target distribution. To begin, choose an initial value x0 from some distribution. The prior distribution is a common choice of distribution from which to draw the initial state. At iteration t ¼ 1,2,…, propose a new value, say x , from a density of q(jxt) that can be used as a candidate for the next value of x. Let pðx Þqðxt jx Þ ,1 : αðx , xt Þ ¼ min pðxt Þqðx jxt Þ Then x is selected as the value of xt+1 with probability α(x , xt), and xt+1 ¼xt with probability 1 α(x , xt). Example 10. Let y1, …, y50 be independent and identically distributed Poisson random variables with rate parameter λ, and assume a priori that λ follows an exponential distribution with scale parameter 20. For this example, the data are drawn from a Poisson distribution with rate parameter 20. The posterior density of λ is
Markov chain Monte Carlo methods Chapter
1
17
pðλÞ ¼ f ðyjλÞπðλÞ
! 50 yi λ Y λ e 1 λ e 20 ¼ 20 yi ! i¼1 50 P
yi ∝ λi¼1 e50:05λ , P so that λjy follows a gamma distribution with shape parameter 50 i¼1 yi and scale parameter 1/50.05. Clearly, it is possible to sample directly from this distribution, but we shall use it to illustrate and evaluate the behavior of the Metropolis–Hastings algorithm. Let λ be chosen from an exponential distribution with scale parameter λt, so that
1 λ qðλ jλt Þ ¼ e λt : λt The chain is run for 50,000 iterations. Fig. 1 shows a trace plot of the output of the chain with a yellow line overlaid at λ ¼ 20 and a white line overlaid at the posterior mean, and Fig. 2 shows a histogram with the posterior density overlaid. At this point, we turn our attention to the Metropolis et al. (1953) algorithm. This algorithm is a special case of the Metropolis–Hastings algorithm in which, for a given state xt and a proposal x , q(xtjx ) ¼ q(x jxt). Example 11. Suppose that y1, …, y30 are independent and identically normally distributed random variables with unknown mean μ and variance 9. Assume a priori that μ follows a normal distribution with mean 0 and variance 1. For this example, the data are simulated from a normal distribution with mean 2 and variance 9. Then pðμÞ ¼ f ðyjμÞπðμÞ 1 18
∝e
13
30 P i¼1
∝ e 3 μ
2
ðyi μÞ2 12μ2
30μ + 9 y :
Thus, a posteriori, μ follows a normal distribution with mean 10 13 y and variance 3 26. Suppose that at time t, μ is proposed from a normal distribution with mean μt and variance 4. Then 1
2
qðμt jμ Þ e 8ðμ μt Þ ¼ ¼ 1: qðμ jμt Þ e 18ðμt μ Þ2
18
Handbook of Statistics
FIG. 1 Trace plot of the output of the Metropolis–Hastings algorithm for estimating λ. A yellow line is overlaid at λ ¼ 20 in order to facilitate evaluation of the behavior of the chain. It can be seen that the chain slightly underestimates the true value of λ, but that the chain still settles quite closely to 20. A white line is overlaid to represent the posterior mean. It can be seen that the chain settles around the posterior mean.
This is, therefore, an example of a Metropolis algorithm. The chain in this example is also run for 50,000 iterations. Fig. 3 shows a trace plot of the output of the chain with a yellow line overlaid at μ ¼ 2 and a white line overlaid at the posterior mean. Fig. 4 shows a histogram with the posterior density overlaid.
4.2 Multivariate Metropolis–Hastings A common approach to multivariate problems, and univariate problems, for that matter, is to employ what is known as a random-walk Metropolis (Roberts et al., 1997) algorithm. Random-walk Metropolis algorithms are the among most common algorithms (Sherlock et al., 2010) among all the versions of the Metropolis–Hastings algorithm. The random-walk Metropolis algorithm is a special case of the Metropolis algorithm in which an increment ε is chosen from a given density q(). The proposal for xt+1 is x ¼ xt + ε.
FIG. 2 Histogram of the output of the Metropolis–Hastings chain with the posterior density overlaid. It can be seen that the chain does a good job of approximating the posterior distribution.
FIG. 3 Trace plot of the output of the Metropolis algorithm for estimating μ with a yellow line overlaid to highlight where the true value of μ is and a white line overlaid to highlight the posterior mean. The yellow and white lines are so close together that it is difficult to see both. The posterior mean is 1.997. This indicates that the chain is approximating both the true mean and the posterior mean well.
20
Handbook of Statistics
FIG. 4 Histogram of the output of the Metropolis algorithm for estimating μ with the posterior density overlaid. The posterior density matches the histogram very well, indicating that the chain is performing well in approximating the posterior distribution.
Example 12. For this example, assume that y1, …, y30 are independent and identically distributed according to a normal distribution with mean vector μ and covariance matrix I4, where I4 is the 4 4 identity matrix. For data generation, the mean vector is chosen as μ ¼ [2, 5, 10, 6]T. Assume a priori that μ follows a normal distribution with mean vector τ ¼ [1, 7, 8, 2]T and covariance matrix 4I4. The posterior density then is given by
pðμÞ∝ e ∝e
1 2
30 P
ðyi μÞT ðyi μÞ 18ðμτÞT ðμτÞ
i¼1
1 1 30y + 4τ μ 2ð30:25ÞμT μ ,
so that, a posteriori, μ follows a normal distribution with mean vector 2 3 2:344 6 7 6 5:221 7 6 7 μ¼6 7 4 9:943 5 6:690
Markov chain Monte Carlo methods Chapter
1
21
FIG. 5 Trace plots of μ1, …, μ4 with yellow lines overlaid to represent the true value of μi and white lines overlaid to represent the posterior mean value of μi. It can be seen that the chain approximates each posterior mean well, and estimates the true values of μ2 and μ3 quite well. The algorithm does slightly overestimate μ1 and μ4, however.
and covariance matrix 0.033I4. The increment density is the four-dimensional independent uniform density on [0.5, 0.5]. The chain was run for 50,000 iterations. Fig. 5 shows trace plots of each component of μ with a yellow line overlaid for the true value of μi and a white line overlaid to represent the posterior mean value of μi. Fig. 6 provides histograms of the marginal empirical posterior distributions of μi with the corresponding marginal density overlaid. A second approach in the multivariate case is to employ a random-walk Metropolis algorithm that updates one component of the vector of interest at a time, where the component to be updated is selected at random. This approach is known as a random-scan, random-walk Metropolis algorithm (Fort et al., 2003). This approach often results in more rapid coverage of the state space because it allows larger steps in one direction than does the random-walk Metropolis algorithm. To illustrate it, we reconfigure the previous example as a random-scan, random-walk Metropolis sampler.
22
Handbook of Statistics
FIG. 6 Histograms of μ1, …, μ4 with the marginal posterior densities overlaid. It can be seen that the empirical marginal posterior distributions perform well in approximating the posterior distribution of each component.
Example 13. In this example, assume that the proposed increment εi to μi is selected from a uniform distribution on [2,2] for i ¼ 1, 2, 3, 4. Fig. 7 presents trace plots of μ1, …, μ4, with a yellow line overlaid on each to represent the true value of μi and a white line overlaid to represent the posterior mean value of μi. The chain was run for 50,000 iterations. We will discuss better ways to choose how long to run the chain in Section 6. Fig. 8 presents histograms representing the empirical marginal posterior distributions of each component of μ, with the theoretical marginal posterior density overlaid on each.
4.3 The Gibbs sampler The Gibbs sampler is an MCMC algorithm that requires sampling from what are known as full conditional distributions. Let θ be a vector of parameters from whose posterior density p(θjy) it is desired to obtain samples, where
Markov chain Monte Carlo methods Chapter
1
23
FIG. 7 Trace plots of μ1, …, μ4 with a yellow line overlaid on the plot of μi to denote the true value of μi and a white line overlaid on the plot of μi to indicate the marginal posterior mean value of μi. The chain settles around the posterior mean in all four components and approximates well the true values of μ2 and μ3. The algorithm does overestimate μ1 and μ4.
y is a data vector and θ ¼ [θ(1), …, θ( p)]T. The Gibbs sampler works by cycling through θ(1), …, θ( p) and sampling from each of their full conditional distributions. The full conditional density of θ(i ) is given by pðθðiÞ jθð1Þ , …, θði1Þ ,θði + 1Þ , …, θðpÞ Þ ¼ Z θðiÞ
pðθjyÞ pðθjyÞdθð1Þ …dθði1Þ dθði + 1Þ …dθðpÞ
,
where θ(i) denotes the vector θ with the ith entry removed. The Gibbs sampler works in the following way. Assume that at iteration t, the value of θ is θt. ð1Þ ð2Þ ðpÞ Then θt+1 is obtained by sampling θt + 1 from pðθð1Þ jθt , …, θt , yÞ, sampling ð2Þ
ð1Þ
ð3Þ
ðpÞ
ðpÞ
θt + 1 from pðθð2Þ jθt + 1 , θt , …,θt Þ, and cycling through to θ( p) so that θt + 1 is
ð1Þ ðp1Þ chosen from the density pðθðpÞ jθt + 1 , …,θt + 1 Þ. Once this is done, θt+1 is a vecð1Þ ðpÞ tor of the values θt + 1 ,…, θt + 1 sampled from the full conditional distributions.
24
Handbook of Statistics
FIG. 8 Histograms of the empirical marginal posterior distributions of μ1, …, μ4 with the theoretical marginal posterior densities overlaid. It is clear that the algorithm performs well in approximating the marginal posterior densities of each of the four components of μ.
In this way, the Gibbs sampler is a special case of the Metropolis–Hastings in which all proposals are accepted. Gibbs sampling is useful when samples can be obtained from the full conditional distributions. If the full conditional distributions are intractable, then a typical Metropolis–Hastings algorithm is necessary. The following example illustrates the Gibbs sampler at work. Example 14. Let y1, …, y30 be independent and identically normally distributed random variables with mean μ and variance σ 2. Assume a priori that μ follows a normal distribution with mean 1 and variance 25, and that σ 2 follows an improper uniform prior distribution on (0, ∞). For this simulation study, the data are generated according to a standard normal distribution. The posterior density is given by 2 15
pðμ, σ jyÞ∝ ðσ Þ 2
e
1 2σ 2
30 P i¼1
1 ðμ1Þ2 ðyi μÞ2 50
:
Markov chain Monte Carlo methods Chapter
1
25
FIG. 9 Trace plots of the values of μ and σ 2 that come from the Gibbs sampler.
Thus, the full conditional densities are given by
2
pðμjσ ,yÞ∝ e
1 1 σ2 2σ 2 ð30:04Þμ2 + σ 2 30y + 25μ
2 15
pðσ jμ,yÞ∝ ðσ Þ 2
e
1 2σ 2
30 P
ðyi μÞ2
i¼1
and :
1 σ2 30y + 25 Thus, the full conditional distribution of μ is normal with mean 30:04 The full conditional distribution of σ 2 is an inverse gamma hP i 30 2 distribution with shape parameter 14 and scale parameter 12 ðy μÞ . In i¼1 i
and variance
1 30:04.
this example, the Gibbs sampler is run for 25,000 iterations. Trace plots of μ and σ 2 are shown in Fig. 9 to demonstrate the behavior of the chain.
4.3.1 Sampling from intractable full conditional distributions In many cases, full conditional distributions can be intractable in terms of sampling from them. In these cases, several approaches exist for working around this issue. The first, as was mentioned before, is to make use of a
26
Handbook of Statistics
Metropolis–Hastings algorithm to sample from the distribution. Here, we describe rejection sampling (Dieter and Ahrens, 1974) and adaptive rejection sampling (Gilks, 1992; Gilks and Wild, 1992).
4.3.2 Rejection sampling In order to carry out rejection sampling, consider a full conditional density g(θ) of some unknown parameter vector θ, and let G(θ) be a function, called an envelope function, that has the property that for all θ, G(θ) g(θ). Samples are drawn from a density that is proportional to G(θ), and these samples θ are gðθ Þ accepted with probability Gðθ . For the purposes of implementation, draw a Þ random sample u from a Uniform[0,1] distribution. If u
gðθ Þ , Gðθ Þ
θ is accepted. It is important to note that there may be many rejections before a proposal is accepted. For a Gibbs sampler, however, it is only necessary to accept one proposal. Observe that the marginal probability of an acceptance is Z gðθÞ dθ PðacceptÞ ¼ Z Θ : (3) GðθÞ dθ Θ
From (3), it can be seen that if G() is far from g(), the marginal probability of acceptance is small. Thus, it is important to find an envelope function that is close to the full conditional density in order to keep the marginal acceptance probability large enough to avoid major computational slowdowns. One last point to make here is that, unlike the Metropolis–Hastings algorithm, rejection sampling produces independent samples from its target distribution (Ripley, 1987). Example 15. In this example, rejection sampling is employed as a means of sampling from a Beta(3,5) distribution. The target density is fΘ ðθÞ ¼
Γð3ÞΓð5Þ 2 θ ð1 θÞ4 ð0, 1Þ ðθÞ: Γð8Þ
The envelope function is chosen to be GðθÞ ¼ max fΘ ðθÞð0, 1Þ ðθÞ, θ
so that G(θ) never falls below the target density. 1000 samples are obtained from the target density in this example. In obtaining these samples, 44.19% of proposals were accepted, which could suggest a need for a tighter envelope
Markov chain Monte Carlo methods Chapter
1
27
FIG. 10 Relative frequency histogram of the values of θ obtained via rejection sampling. A blue curve is overlaid to indicate the target density, and the black line represents the envelope function.
function. A relative frequency histogram of the sampled values is presented in Fig. 10. A blue curve is overlaid to indicate the target density, and a black line is placed in the plot to represent the envelope function. The histogram suggests that the rejection sampling algorithm performs well in approximating the target density.
4.3.3 Adaptive rejection sampling Given the limitation in implementing rejection sampling, it is important to have an approach that will limit the computational slowdown in sampling from intractable full conditional distributions. If the full conditional density is a log-concave univariate density, efficient methods are available to construct envelope functions (Bennett et al., 1995; Carlin and Gelfand, 1991; Zeger and Karim, 1991). One idea aside from these is to allow the envelope function to be constructed adaptively through adaptive rejection sampling. This technique allows the envelope function to be adjusted as the algorithm proceeds. This adjustment brings the envelope function closer to the density from which sampling is desired. The algorithm begins with a set S, of values, sorted in ascending order, of a parameter θ for log-concave univariate densities. For each θ in S, the process continues in one of two ways.
28
Handbook of Statistics
Gilks and Wild (1992) propose drawing tangents to loggð Þ at each point in S. The envelope between adjacent points in S is constructed from the tangents at either end of the interval with the given points as endpoints. Gilks (1992) proposes drawing secants through log gð Þ at adjacent points in S and then constructing the envelope between any two adjacent points in S from the secants immediately to the left and right of the interval. We will go into further detail about the tangent method. The secant method is done in a very similar fashion.
4.3.4 The tangent method This approach consists of three key steps: the initialization step, the sampling step, and the updating step. Let hðθÞ ¼ log gðθÞ. To begin, let S k ¼ fθi ,i ¼ 1, …, kg be the set of k starting points, and let D be the domain of g(). If D is unbounded on the left, choose θ1 such that h0 (θ1) > 0. If D is unbounded on the right, choose θk such that h0 (θk) < 0. This is tantamount to choosing points on each side of the mode if g() is unimodal. The next step is to compute several functions for each of the k starting points. Let uk(θ) be the piecewise linear upper bound formed from the tangents to h(θ) at each point in S k . Since the envelope can be shown to be piecewise exponential, compute sk ðθÞ ¼ Z
euk ðθÞ 0
euk ðθ Þ dθ 0
0
:
Θ
Compute also lk(θ), the piecewise lower bound formed from the chords between adjacent points in S k . Expressions for uk(θ) and lk(θ) can be obtained directly. We know that the tangents at θi and θi+1 intersect at 0
zi ¼
0
hðθi + 1 Þ hðθi Þ θi + 1 h ðθi + 1 Þ + θi h ðθ0i Þ for i ¼ 1, 2, …, h0 ðθi Þ h0 ðθi + 1 Þ
with z0 and zk defined to be the upper and lower bounds on D. If D is unbounded on the left, z0 ¼ ∞, and if D is unbounded on the right, zk ¼ ∞. Then we can write 0
uk ðθÞ ¼ hðθi Þ + ðθ θi Þh ðθi Þ for θ 2 [θi1, θi], where i ¼ 1, 2, …, k. The piecewise lower bound is obtained in a similar way, and lk ðθÞ ¼
ðθi + 1 θÞhðθi Þ + ðθ θi Þhðθi + 1 Þ θi + 1 θi
for θ 2 [θi, θi+1] and i ¼ 1, 2, …, k 1. If θ < θ1 or θ > θk, lk ðθÞ ¼ ∞. This completes the initialization step.
Markov chain Monte Carlo methods Chapter
1
29
The sampling step begins with the choice of a value θ from the density sk(θ) and, independently, a value u from a Uniform[0,1] distribution. First, a squeezing test is performed. If u elk ðθ
Þuk ðθ Þ
,
θ is accepted and the sampling step is complete. Otherwise, we evaluate h(θ ) and h0 (θ ) and move on to a rejection test. In this test, if u ehðθ
Þuk ðθ Þ
,
then θ is accepted. Otherwise, θ is rejected. This completes the sampling step. In the updating step, if θ was not accepted as a result of the squeezing test in the sampling step, add θ to S k to form S k + 1 . Then relabel the elements of S k + 1 in ascending order and construct uk+1(θ), lk+1(θ), and sk+1(θ). Example 16. In this example, adaptive rejection sampling is used to obtain 1000 independent samples from a Beta(3,5) distribution. A relative frequency histogram of the sampled values is presented in Fig. 11. A blue curve is overlaid to show the true density function. The adaptive rejection sampling algorithm appears to perform well in approximating the Beta(3,5) distribution.
FIG. 11 A relative frequency histogram of the sampled values obtained via adaptive rejection sampling. A blue curve is overlaid to show the true Beta(3,5) probability density function.
30
Handbook of Statistics
The algorithm was carried out using the “ars” package in R (Perez Rodriguez, 2018). This package does not return rejection rates, so information on the rejection rate is unavailable for this example.
4.4 Slice sampling A key limitation of the Metropolis–Hastings algorithm is the dependence of the mixing behavior of the underlying Markov chain on the proposal density. This issue can lead to acceptance or rejection rates that are too high to ensure good mixing (Gelman et al., 1997). Consequently, the parameters of the proposal distribution need to be tuned to avoid such problems. Slice sampling (Neal, 2003) is an MCMC algorithm that aims to avoid this problem by sampling from slices of the posterior distribution. The idea is that drawing a sample from the density p() is the same thing as sampling uniformly from the points under the curve of p(). In other words, we sample uniformly among all points u such that 0 u p(θ), where p(θ) may or may not be normalized. In order to sample (θ, u) in such a way that (θ, u) is uniformly distributed, p(θ) is augmented with a random variable u so that 8 > < 1 if 0 u p^ðθÞ where pðθ,uÞ ¼ mp > : 0 otherwise Z mp ¼ p^ðθÞ dθ Θ
and p^ðθÞ is an unnormalized version of p(θ). One then obtains a sample from p(θ) by dropping the u value. This works due to the relation Z Z p^ðθÞ 1 pðθ,uÞ dθ ¼ du mp Θ 0 ¼
p^ðθÞ ¼ pðθÞ: mp
This process can be carried out by using a Gibbs sampler and alternately sampling θ and u from their full conditional distributions. Sampling u from p(ujθ) is done by drawing u U½0, p^ðθÞ. Then θ is sampled uniformly among all points for which u < p^ðθÞ. This set of points is known as a “slice.” This can be very difficult to do in practice. If the posterior density is unimodal, constructing the slice is easy to do and can be done exactly as described. However, if the posterior density has multiple modes, several approaches are available to handle them. For example, Neal (2003) presents a novel approach to dealing with multimodality in the posterior distribution. As this chapter is designed as an introduction to data analysis using MCMC, we shall not delve into these cases here.
Markov chain Monte Carlo methods Chapter
1
31
FIG. 12 A relative frequency histogram of the values of θ obtained from the slice sampler. A curve is overlaid to illustrate the target density.
Example 17. In this example, a slice sampler is employed as a way to obtain 1000 samples from a Beta(3,5) distribution. The target density is fΘ ðθÞ ¼
Γð3ÞΓð5Þ 2 θ ð1 θÞ4 ð0,1Þ ðθÞ, Γð8Þ
which is maximized at θ ¼ 13. The density is then unnormalized by dividing fΘ(θ) by max θ fΘ ðθÞ to give fΘ ðθÞ. A sample u is drawn from a U[0,1] distribution, and a proposal θ is also drawn from a U[0,1] density. If fΘ ðθÞ > u, then θ is accepted as a sample from fΘ(θ). A relative frequency histogram of the sampled values of θ is provided in Fig. 12, and a curve is overlaid to indicate the target density. It can be seen that the empirical distribution matches the target distribution well, suggesting that the slice sampler performs well here in approximating the target density.
4.5 Reversible jump MCMC All of the MCMC algorithms described previously are limited by the fact that they do not allow for changes in the dimensionality of the state space. The reversible jump MCMC algorithm avoids this limitation. This is an
32
Handbook of Statistics
important advantage in several areas of statistics, including in finite mixture modeling (Zhang et al., 2014), variable selection (Green and Hastie, 2009; Pan et al., 2017), and time series models (Troughton and Godsill, 1998). To see how the algorithm works, let M ¼ fM1 , M2 , …g be a class of candidate models where the index k can be viewed as an auxiliary indicator variable. Each model Mk has a pk-dimensional vector of unknown parameters θk 2 pk . The joint posterior distribution of (k, θk) given a data vector y is given by pðk, θk jyÞ ¼ X Z k
f ðyjk,θk Þπðθk jkÞπðkÞ
pk
f ðyjθk ,kÞπðθk jkÞπðkÞ dθk
:
This is the target density for the reversible jump algorithm, where (k, θk) are the states of the chain and the dimension of the state space is allowed to vary. This means that the output of a single chain can provide a full characterization of the posterior probability of each model, as well as the posterior distribution of the parameters under each model. The algorithm works in a manner similar to the Metropolis–Hastings algorithm in the sense that, given the current state θ ¼ (k, θk) of the chain, a new state θ ¼ (k , θk ) is proposed from a density q(θ jθ) and is accepted with probability pðθ jyÞqðθjθ Þ αðθ , θÞ ¼ min 1, : pðθjyÞqðθ jθÞ Naturally, if k is fixed, this is the standard Metropolis–Hastings algorithm. The reversible jump MCMC algorithm is implemented in practice through what is known as “dimension matching.” The idea is that in general Bayesian modeling, if we are at θ ¼ (k, θk) and want to propose a move to θ ¼ (k , θk ), where pk > pk , we match dimensions by proposing a random vector u of length dk!k ¼ pk pk from a known density qdk!k ðuÞ. The current state θk and u are mapped to θk ¼ gk!k ðθk ,uÞ via bijective mapping function gk!k :pk dk!k 7!pk . The proposal is accepted with probability ∂gk!k ðθk ,uÞ pðk ,θk Þqðk jkÞ , αð½k, θk , ½k , θk Þ ¼ min 1, pðk,θk Þqðkjk Þqdk!k ðuÞ ∂ðθk ,uÞ where q(k jk) is the probability of proposing a move from model Mk to model Mk . The reverse move from model Mk to Mk is made deterministically and accepted with probability αð½k ,θk , ½k,θk Þ ¼
1 : αð½k,θk ,½k ,θk Þ
While there are some generalities related to the dimension of the auxiliary vector u, we shall not pursue them here. Example 18. This example is one that comes from Green and Hastie (2009). They use the goals scored over three seasons of English Premier League (EPL) soccer to make a determination as to whether the counts of goals per
Markov chain Monte Carlo methods Chapter
1
33
game are overdispersed with respect to a Poisson distribution. In any EPL season, there are 380 games, so that the number of games examined in total is N ¼ 1, 140 games. An observation yi, i ¼ 1, …, N, is the number of goals scored in a single game. Two models, M1 and M2., are proposed. The first specifies that the yis are independent and identically distributed according to a Poisson(λ) distribution. Thus, under M1, the likelihood function is LðλjyÞ ¼
N yi Y λ
y! i¼1 i
eλ :
Under M2, the yis are assumed to be independent and identically distributed according to a negative binomial distribution with parameters λ > 0 and κ > 0, so that under M2, the likelihood function is 1 Γ + yi N yi Y 1 λ κ yi ð1 + κλÞ κ : Lðyjλ, κÞ ¼ y! 1 1 i¼1 i Γ κ κ+λ As is done in Green and Hastie (2009), λ is assigned a gamma prior distribution with shape parameter 25 and rate parameter 10, and κ is assigned a Gamma(1,10) prior distribution. At this point, we need to define the bijective mappings Ψ between the g-space and the parameter set for each model. Barker and Link (2013) prescribe that dim(Ψ) ¼ dim(θ(k), u(k)) for all k, where u(k) is an augmenting variable. Here, we have dim(Ψ) ¼ max (dim(θ(k))) ¼ 2. Thus, an augmenting variable is only necessary for M1. Let g1 be associated with λ in both M1 and M2, and let ψ 2 be associated with κ in M2. We need a variable u for M1 since no second parameter exists. Assume a priori that u N(0, σ 2), then take ψ 2 ¼ μeu for fairly small μ and reasonable σ 2. In this example, μ is set to be 0.015 and σ 2 is set to be 2.25. Observe that the value of u does not impact inference, so neither do the choices of μ and σ 2. The tendency to keep μ fairly small is done for the sake of computational efficiency and not for any reason pertaining to inference procedures. Now we compute Ψ. We have λ Ψ ¼ g1 1 u λ ¼ : Therefore; μeu λ θ¼ u 2 3 ψ1 5 : ¼4 ψ2 log μ
34
Handbook of Statistics
Under M2, the bijective map is the identity, so that g2(Ψ) ¼ Ψ and g1 2 ðθÞ ¼ θ. The next step is to obtain prior distributions for Ψ. Using standard change of variables techniques, pðΨjM1 Þ ¼ Gammað25, 10ÞNð0, σ 2 ÞjJ1 j, where Jk is the Jacobian under Mk. The algorithm implementation in the “rjmcmc” package (Gelling et al., 2018) in R computes the Jacobian for us, so we need not find this analytically. Under M2, pðΨjM2 Þ ¼ Gammað25, 10ÞGammað1, 10ÞjJ2 j: The rest of the implementation can be done automatically in R. In this example, the chain is run for 10,000 iterations. A histogram of the data along is provided in Fig. 13. In this histogram, a blue curve is overlaid to indicate the negative binomial distribution with the resulting Bayes estimates of the parameters λ ¼ 2.523 and κ ¼ 0.02. A red curve is also overlaid to show the Poisson distribution with the Bayes estimate of λ under M1, which is 2.522. These estimates are given by the posterior median. Both models appear
FIG. 13 Histogram of the number of goals scored per game. The blue curve indicates the negative binomial distribution with parameters λ ¼ 2.523 and κ ¼ 0.02, and the red curve indicates the Poisson distribution with λ ¼ 2.522.
Markov chain Monte Carlo methods Chapter
1
35
to approximate the distribution of the data reasonably well, but the reversible jump MCMC algorithm prefers M1 over M2 with probability 0.71. Now that we have described several common MCMC algorithms, it is time to see how they are used in practice. This is explored in Section 5.
5
Markov chain Monte Carlo in practice
This section provides examples of settings in which some of the methods described in Section 4 are used in practice. Section 5.1 shows MCMC in use during Bayesian inference related to regression models, and Section 5.2 demonstrates the use of MCMC in dealing with random effects models. In Section 5.3, we discuss how MCMC methods are used in analyzing generalized linear models, and in Section 5.4, we discuss handling problems involving missing data. This section provides only an introduction to the myriad of uses of MCMC methods and shall by no means be taken to be exhaustive.
5.1 MCMC in regression models A common problem in Bayesian analysis is that of Bayesian linear regression. Consider the linear regression model y ¼ Xβ + ε, where y is a data vector, X is a design matrix of predictors, β is a vector of regression coefficients, and ε is a random error vector with the property that ε N(0, σ 2I), where I is the identity matrix. The goal is typically to estimate the coefficient vector. In order to do this, a prior distribution π(β) is typically placed on β, and another prior distribution π(σ 2) is placed on σ 2. We have that yjβ, σ 2 N(Xβ, σ 2I), and the posterior density is given by pðβjσ 2 , yÞ∝ f ðyjβ,σ 2 ÞπðβÞπðσ 2 Þ: Once the posterior density is obtained, sampling from it is often done via MCMC procedures. Here we describe an example related to hospital stays. Example 19. A group of researchers is attempting to model infection risk based on the average length of a stay, the frequency of X-ray use, and a set of three indicators X3, X4, and X5 for the region of the country (north-central, south, west) in which the hospital is located. The model here is y ¼ Xβ + ε, where y denotes the vector of infection risks for each selected hospital and ε N(0, σ 2I). This regression model does not include interactions. In this example, improper noninformative prior distributions are placed on β and σ 2 so that πðβÞ∝ 1 and πðσ 2 Þ∝ 1. Consequently, the posterior density is given by
36
Handbook of Statistics
pðβ,σ 2 jyÞ∝f ðyjβ,σ 2 ÞπðβÞπðσ 2 Þ n
1
∝ðσ 2 Þ 2 e 2σ 2 ðyXβÞ
T
ðyXβÞ
:
Since full conditional distributions are available, we will use a Gibbs sampler here. We have that
βjσ 2 , y N ðXT XÞ1 XT y,σ 2 ðXT XÞ1 , and n 1 σ 2 jβ, y IG 1, ðy XβÞT ðy XβÞ : 2 2 The Gibbs sampler is run for 1000 iterations, and we compare the estimates to the maximum likelihood estimate ^ β MLE . 3 3 3 2 2 28:37 β0 28:39 6 0:064 7 6 β1 7 6 0:07 7 7 7 6 6 7 6 6 0:022 7 6 β2 7 6 0:005 7 ^ 7 6 7 7 6 6 β¼6 7¼6 7 and β MLE ¼ 6 3:891 7: 7 6 6 β3 7 6 3:885 7 4 0:053 5 4 β4 5 4 0:053 5 3:763 3:791 β5 2
Our Bayesian estimates are quite close to the maximum likelihood estimates, which is expected since the prior densities do not provide any information about β. Thus, the likelihood function and the posterior density are the same. Under squared error loss, we know that the Bayes estimator is the posterior mean, and since we assumed the normal distribution on the error terms, the mode of the likelihood function is the same as the posterior mean. In this process, σ 2 is treated as a nuisance parameter.
5.2 Random effects models Another common question to which Bayesian approaches lend themselves well is in random effects models. Such models are often represented as follows: Y ¼ μ + s + ε, where μ is an overall mean, s is a random effect, and ε is an error term. In many ways, the random effects model is handled in the Bayesian framework in a manner similar to how estimation of coefficients was done in the fixed effects setting of Section 5.1. Here, we detail an example pertaining to mathematics performance in schools. Example 20. Here, we look at the performance of mathematics students from J ¼ 5 different schools. For each school, I ¼ 10 standardized test scores in mathematics are randomly selected. For the sake of illustration, these scores
Markov chain Monte Carlo methods Chapter
1
37
are simulated according to five different normal distributions, each having variance 100. The means are 69, 75, 81, 72, and 66. The model here is Yij ¼ μj + εij , where the εij are independent and identically distributed N(0, σ 2) random variables. Let Yij denote the ith from school j, and μj ¼ μ + sj, where μ is an overall mean and sj is a random effect of school j. Here, μ and σ 2 are nuisance parameters, and μj is our main interest. Noninformative prior distributions are placed on μ and σ 2, and μj follows a N(μ, 9) distribution by specification. The posterior density then is pðμ, μ,σ 2 jyÞ∝ f ðyjμ, μ,σ 2 ÞπðμÞπðμÞπðσ 2 Þ IJ ∝ ðσ 2 Þ 2
1 2σ 2
e
I P J P i¼1j¼1
1 ðyij μj Þ2 18
J P
ðμj μÞ2
j¼1
:
Full conditional distributions are available and easy to sample from, so a Gibbs sampler is appropriate here. This analysis is done under squared error loss, so that the posterior mean of the μ vector is the Bayesian estimator. The full conditional distributions are as follows:
J 9 1X μjσ ,μ,y N μ, , where μ ¼ μ, J J j¼1 j 2
! I X J X IJ 1 2 1, σ 2 jμ,μ,y IG ðyij μj Þ , and 2 2 i¼1 j¼1
μj jμ,σ 2 ,y Nðθj , τ2 Þ, where I 1 1 , τ2 ¼ 2 + σ 9 I 1 θj ¼ τ2 2 y j + μ , σ 9 I X 1 y j ¼ yij , I i¼1 and the μj are independent. The chain is run for 10,000 iterations and gives an estimate 3 2 63:53 6 73:59 7 7 6 7 ^ ¼6 μ 6 82:88 7, 4 72:14 5 66:06
38
Handbook of Statistics
which, aside from the first entry, approximates μ reasonably well. Unlike the example in Section 5.1, informative prior knowledge of μ plays a role in determination of the Bayes estimate.
5.3 Bayesian generalized linear models Many problems in Bayesian analysis involve modeling categorical responses. This is where generalized linear models often come into play. Common examples of their use are myriad, so we detail one here. Example 21. Suppose that we have a setting in which three independent binomial random variables are observed. Let z1, z2, and z3 be three independent 0-1 predictors, and suppose y1, y2, and y3 are independent binomial random variables with, respectively, n1 ¼ 98, n2 ¼ 18, and n3 ¼ 2 trials and success probability ΦðzTi βÞ, i ¼ 1, 2, 3, where Φ() is the standard normal cumulative distribution function. We observe 2 3 2 3 11 1 1 1 y ¼ 4 1 5 and Z ¼ 4 0 1 1 5: 0 0 0 1 Our goal is to estimate β. A priori, assume that β N(0, I3). The target density is pðβjyÞ∝ f ðyjβÞπðβÞ ∝e
1 2
3 P
2 3
βðiÞ Y y
n y ni i¼1 ΦðzTi βÞ i 1 ϕðzTi βÞ i i 3 ðβÞ: yi i¼1
This is a version of the Bayesian probit model. The full conditional densities here are not easy densities from which to sample directly. Therefore, we make use of a RWM sampler. In this procedure, proposed increments to β(i ) are proposed from a U[0.04, 0.04] distribution for i ¼ 1, 2, 3. The RWM sampler accepts around 26% of proposals, and it is run for 10,000 iterations. The resulting estimates of β(1), β(2), and β(3) are 2 3 0:120 ^ β ¼ 4 0:498 5: 0:565 Therefore, the estimated success probabilities are ΦðzTi βÞ ¼ Φð1:183Þ ¼ 0:118, ΦðzTi βÞ ¼ Φð1:063Þ ¼ 0:144, and ΦðzTi βÞ ¼ Φð0:565Þ ¼ 0:286:
Markov chain Monte Carlo methods Chapter
1
39
5.4 Hierarchical models In Bayesian statistics, the standard approach is to construct a data model for observations y that depend on a parameter vector θ. A prior distribution is then placed on θ, and this distribution is one component of a posterior distribution that is sampled in some way in order to estimate θ. In many analyses, the θ vector is assigned either a noninformative prior distribution or it is assigned an informative prior distribution with known parameter values. These types of informative prior distributions often assume greater knowledge of θ than what is actually available. One way to reduce the amount of information that is assumed to be available about θ is to place what are known as hyperpriors on the parameters of the distribution of θ. Such a model is known as a hierarchical model, and it comes up often in Bayesian statistical analysis. We detail an example here. Example 22. Consider data yij, i ¼ 1, 2, 3, 4; j ¼ 1, …, 10 that consists of ten observations from each of four different groups. Assume that yij ¼ μi + εij , where the εij are independent and identically normally distributed random variables with mean 0 and variance σ 2y . Assume further that the μi are independent and identically normally distributed random variables with mean μ and variance σ 2μ . We could write, equivalently, that μ i ¼ μ + bi , where the bi are independent and identically normally distributed random variables with mean 0 and variance σ 2b . We make the following prior assumptions. μ Nð0,10, 000Þ, ln σ y U½100,100 ln σ b U½100,100 b Nð0,σ 2b IÞ μ i ¼ μ + bi : Furthermore, yij jμ,σ 2y , μi ,σ 2b N(μi ,σ 2y ). Let ly ¼ ln ðσ y Þ and let lb ¼ ln ðσ b Þ. Then the posterior density is given by pðμ, b, ly , lb , μi jyÞ∝ f ðyjμ,ly ÞπðμÞπðly Þπðbjσ 2b Þ 4 P 10 4 P P 12ly ðyij μi Þ2 12lb b2i 2e 2e i¼1j¼1 i¼1 ∝ e20ly e2lb e
½100, 100 ðly Þ½100, 100 ðlb Þ:
Here, we use a Random-Walk Metropolis algorithm to sample from this distribution. For this example, the yij are simulated from a multivariate normal
40
Handbook of Statistics
distribution with μ1 ¼ 700, μ2 ¼ 800, μ3 ¼ 500, and μ4 ¼ 650, with variance σ 2y ¼ 10,000. Following prediction of the random effects, the estimates of the group means are μ ^ 1 ¼ 697.83, μ ^ 2 ¼ 817.24, μ ^ 3 ¼ 522.39, and μ ^ 4 ¼ 638.77. The variance is estimated as σ^ 2y ¼ 8,429.735. The parameter σ 2b is a nuisance parameter that must also be sampled throughout the analysis. Overall, the algorithm performs well in estimating the group means, despite what looks to be poor prior specification on them. This section has provided several examples of how MCMC algorithms are used in practice and some highly informal examination of how they perform in terms of parameter estimation. We have not yet investigated how the chain itself performs. This will be taken up in Section 6.
6 Assessing Markov chain behavior In this section, we describe ways to answer the two primary questions surrounding the use of MCMC algorithms. These are the questions of (1) how long the chain takes to approach its stationary distribution and (2) how often samples should be taken from the output of the chain in order to ensure that samples are roughly independent. We examine the first question in three different ways. Section 6.1 details a theory-based approach presented by Rosenthal (1995). In Section 6.2, we describe five methods of ad hoc convergence assessment, and in Section 6.3, we detail two computational approaches to bounding what is known as the mixing time of the underlying Markov chain. These methods serve as an intermediary between the theory-based approach and the output-based methods. Section 6.4 describes a common way to answer the second question.
6.1 Using the theory to bound the mixing time Rosenthal (1995) presents an approach to bounding the time a Markov chain takes to approach its stationary distribution. In order to describe this, we need a few concepts. First, let (Xt)t0 be a Markov chain with stationary measure π() and transition kernel K(, ). The total variation distance δ(Kn, π) between Kn(, ) and π() is given by δðK n , πÞ ¼ sup
sup jK n ðx,AÞ πðAÞj,
x2m A2Bðm Þ
so that δ(Kn, π) is the largest difference between the probabilities that Kn(, ) and π() assign to the same event. We now define the ε-mixing time. Choose a fixed constant ε. Definition 24. The ε-mixing time τε is given by τε ¼ min fn : δðK n , πÞ εg: n
Finally, we need the notion of geometric ergodicity.
Markov chain Monte Carlo methods Chapter
1
41
Definition 25. The Markov chain (Xt)t0 is geometrically ergodic if there exists ρ 2 (0, 1) and some M() 0 such that for all n 2 , δðK n , πÞ MðxÞρn : Furthermore, if (Xt)t0 is geometrically ergodic and g : m ! such that π ½jgðXÞj2 + δ < ∞ for some δ > 0, the central limit theorem holds for ergodic averages (Roberts and Tweedie, 1996). Rosenthal (1995) shows that (Xt)t0 is geometrically ergodic if (Xt)t0 satisfies a minorization condition and an associated drift condition. In order to define these conditions, we need the notion of a small set. Definition 26. A set C m is a small set if there exist an integer n > 0 and a probability measure ν such that for all x 2 C and for all A 2 Bðm Þ, Kðx, AÞ νðAÞ: Definition 27. A Markov chain (Xt)t0 with transition kernel K(x, ) on a state space m is said to satisfy a minorization condition if there exist a probability measure Q() on m , a positive integer k0, a fixed ε > 0, and a small set C such that K k0 ðx,AÞ εQðAÞ for all x 2 C and for all A 2 Bðm Þ. Definition 28. A Markov chain (Xt)t0 satisfies a drift condition if there exists a function V : m 7!½1, ∞Þ, constants λ 2 (0, 1) and b < ∞, and a small set C 2 Bðm Þ such that for all x 2 m , ½VðXt + 1 ÞjXt ¼ x λVðxÞ + bC ðxÞ:
(4)
This leads to the following theorem proven by Rosenthal (1995). Theorem 1. (Rosenthal, 1995) Suppose that for a function V : m 7!½1, ∞Þ and constants λ 2 (0, 1) and b < ∞, (Xt)t0 satisfies ½VðXt + 1 ÞjXt ¼ x λVðxÞ + bC ðxÞ for all x 2 m , where C ¼ fx : VðxÞ d g, where d > 2b/(1 λ) 1. Suppose also that for some ε > 0 and some probability measure Q() on Bðm Þ, Kðx, AÞ εQðAÞ for all A 2 Bðm Þ and for all X 2 C. Then for any r 2 (0, 1) with (Xt)t0 beginning in the initial distribution Ψ,
42
Handbook of Statistics
rk
ð1rÞ r k
δðK , πÞ ð1 εÞ + ðα k
AÞ
b + Ψ ½VðX0 Þ , where 1+ 1λ
b + ð1 λÞ and d1 1+ 2 A ¼ 1 + ðλd + bÞ:
α1 ¼ λ +
Here, we illustrate the use of Theorem 1 in bounding the mixing time of a Gibbs sampler. Example 23. Suppose that we have a Gibbs sampler that explores the density
2 3 y x + 2 2
πðx, yÞ∝ y2 e
ð0, ∞Þ ðyÞ:
Then the full conditional distributions are 1 XjY N 0, Y 5 x2 YjX Gamma , + 2 : 2 2 Our goal is to sample from the density 5 3 x2 2 1+ πðxÞ ¼ : 8 4 Consider the drift function V (x) ¼ x2. Then ½VðX1 ÞjX0 ¼ x ¼ ½X12 jX0 ¼ x Z Z u2 π XjY ðujyÞ du π YjX ðyjxÞ dy ¼ Y X # rffiffiffiffiffi Z "Z 1 1 y u2 2 2 2 e u y du ¼ 2π Y X 2 5 2 x
2 +2 3 y x + 2 2 2 y2 e dy: 5 Γ 2
The inner integral is equal to ½U 2 jy, where Ujy N(0, 1/y). Therefore, ½U 2 jy ¼ Var(Ujy) ¼ 1/y. Consequently,
2 Z 1 y x + 2 1 2 ½VðX1 ÞjX0 ¼ x ¼ dy: y2 e 2 5 Y 2 5 x +2 Γ 2 2
Markov chain Monte Carlo methods Chapter
1
43
2 The integrand is the kernel of a Gamma 32 , x2 + 2 density, so the integral is 3 2
3 x2 + 2 , and equal to Γ 2 2
3 2 x 2 +2 ½VðX1 ÞjX0 ¼ x ¼ 5 2 Γ 2 x2 4 ¼ + : 3 3 Γ
Suppose that our small set is C ¼ x 2 : x2 10 : Then we can show that ε ¼ 0.343 works as a minorization coefficient. Putting all of this together in Theorem 1 gives δðK n , πÞ ðx2 + 4Þð0:943Þn : Thus, the chain is geometrically ergodic, and if the chain begins at x ¼ 95 and we choose 0.01 as a threshold for mixing, then δðK n , πÞ 9, 029ð0:943Þn : The first time this upper bound on the total variation distance falls below 0.01 is at 233 steps, so the 0.01-mixing time is no larger than 233 steps. As can be seen, the math can be fairly involved even for simple settings. In practice, this technique can be very complicated. Furthermore, this upper bound on the mixing time is known to be very conservative, so using it could result in running a chain for a longer burn-in than is necessary. The result of this pair of issues has been a set of ad hoc convergence diagnostics that are described in Section 6.2.
6.2 Output-based convergence diagnostics In this section, we describe and illustrate several ways to assess the convergence behavior of a Markov chain using the chain’s output. We describe how to make use of a trace plot, how to carry out four different convergence diagnostics, and we provide examples that use each of these techniques. We also discuss the advantages and limitations of assessing convergence in this way. The working example in this section is the one analyzed in Section 6.1.
6.2.1 Trace plots A trace plot is, in effect, a time series plot where the y-values correspond to the values that are output from the chain. Trace plots are quick tools that
44
Handbook of Statistics
are used to assess whether or not the chain has reached its target distribution. If the chain has “burned in,” the trace plots should show rapid oscillation around a central value. Fig. 14 shows examples of differently behaved trace plots. The left panel shows a trace plot for a Markov chain that exhibits good mixing behavior. The two plots on the right show examples of how trace plots from poorly mixing Markov chains appear. The Gibbs sampler from Section 6.1 gives a trace plot in Fig. 15 that indicates no lack of convergence after 233 steps. The trace plots can be misleading in the sense that they often appear not to suggest a lack of convergence earlier on in the running of the chain than the time when that chain is close to the stationary distribution.
6.2.2 Heidelberger and Welch (1983) Diagnostic Heidelberger and Welch (1983) construct a method for generating a confidence interval of a prespecified width for the mean when the chain does not begin in its stationary distribution. The procedure is applied to a single Markov chain. The Heidelberger and Welch (1983) diagnostic extends the methods of Schruben (1982) and Schruben et al. (1983). Convergence here is diagnosed based on the theory of the Brownian bridge. This technique is useful for geometrically ergodic chains, a requirement that is satisfied by many common MCMC algorithms. The null hypothesis is that the sequence of output is from the stationary measure. Let Yj be the jth entry in the output sequence, S(0) be the spectral density of the sequence evaluated at 0, [] be the greatest integer function, n be the total number of iterations, and define the following quantities: T0 ¼ 0 Tk ¼
k X
Yk , k 1
j¼1
Y¼
n 1X Yj n j¼1
T½nt ½ntY Bn ðtÞ ¼ pffiffiffiffiffiffiffiffiffiffiffi , 0 t 1: nSð0Þ For large n, Bn ¼ fBn ðtÞ, 0 t 1g is approximately a Brownian bridge. Therefore, the test statistic Z 1 Bn ðtÞ2 dt 0
can be used to test the hypothesis. In order to carry out the test for stationarity, we estimate S(0) and specify jmax, the maximum number of iterations that can be run, and τ, the desired confidence interval half width. Let j1 ¼ 0.1jmax. Since the chain does not begin in its stationary distribution, the estimate of
Bad mixing
Bad mixing
0
4
9
5
6
10
8
a
a
10
a
11
10
15
12
12
20
14
Good mixing
0
2000
4000
6000
Index
8000 10,000
0
200
400
600
Index
800
1000
0
2000 4000 6000 8000
12,000
Index
FIG. 14 Left panel: Trace plot for a Markov chain that demonstrates good mixing behavior. Center panel: Poor mixing behavior as the chain never settles around a center. Right panel: Poor mixing behavior where the chain stays for long periods of time in different areas of the target distribution.
46
Handbook of Statistics
0 −10
−5
X
5
10
Gibbs sampler trace plot
0
2000
4000
6000
8000
10,000
Iteration
FIG. 15 Trace plot for the Gibbs sampler of Section 6.1. The plot shows rapid oscillation around a central value before 233 steps, which does not indicate a lack of convergence at this point.
S(0) would tend to be too large if it were based on the initial part of the chain, and S(0) is estimated based on the second half of the run. If the null hypothesis is rejected, the first 10% of iterations are discarded and the test is repeated. This method is repeated until either a portion of the output of length at least 0.5j1 is found for which the chain passes the stationarity test or 50% of the iterations are discarded and the chain still fails the stationarity test. In the first case, S(0) is reestimated from the portion of the output for which the stationarqffiffiffiffiffiffi ^ ity test was passed, and the standard error of the mean is estimated as Sð0Þ n , where n is the length of the output that is not discarded. If the half width of the generated interval is less than τ, the process ends and the sample mean and confidence interval are reported. In the latter case, we return the discarded output to the sequence and run more iterations to obtain a run length of j2 ¼ 1.5j1. We then carry out the test for stationarity in the same way as before, with no regard for the results of the stationarity test based on the first sequence. If the stationarity test is failed, we continue the process for longer sequences. These sequences are of length jk, where jk ¼ min ð1:5jk1 , jmax Þ until a stationarity test is passed or the run length exceeds jmax. The primary limitation of the Heidelberger and Welch (1983) diagnostic is that if the chain stays out of the stationary distribution initially for a long time, the diagnostic has little power to detect this.
Markov chain Monte Carlo methods Chapter
1
47
The Gibbs sampler in Section 6.1 passes the Heidelberger and Welch (1983) after two iterations, much less than the 238 prescribed by the theoretical approach of Rosenthal (1995).
6.2.3 Geweke (1992) Spectral density diagnostic The Geweke (1992) spectral density diagnostic is designed for settings in which a Gibbs sampler is used to estimate the mean of some function g() of the parameters θ. If values of g(θj) are computed after each iteration, the resulting sequence can be viewed as a time series. The premise here is that the nature of the MCMC algorithm and g() imply that a spectral density Sg() exists for this time series and has no discontinuity at frequency 0. If this is true, then ½gðθÞ is estimated using gn ¼
n 1X gðθi Þ, n i¼1
S ð0Þ
and the asymptotic variance is gn . To carry out this diagnostic, choose values nA and nB, where nA is the number of iterations to be used from the first part of the chain to be used in estimating ½gðθÞ and nB is the number of iterations from the end of the chain to be used in estimating ½gðθÞ. Let g A ðθÞ and g B ðθÞ be the corresponding estimates of this expectation. If the ratios nA/n and nB/n are held fixed and nA + nB < n, then the central limit theorem implies that the statistic Zn ¼
g A ðθÞ g B ðθÞ sffiffiffiffiffiffiffiffiffiffiffi S^g ð0Þ n
follows an approximate standard normal distribution. This result is used to carry out the test of stationarity. Geweke (1992) suggests using nA ¼ 0.1n and nB ¼ 0.5n. Geweke (1992) also suggests that this diagnostic can be used to determine how many initial iterations to discard. This method only requires a single chain and, though it is designed for a Gibbs sampler, it can be used to investigate convergence of any MCMC method. However, this diagnostic is highly sensitive to the specification of the spectral window, and Geweke (1992) does not really specify a procedure for applying the diagnostic. This choice is left to the statistician. In the example from Section 6.1, the Z-score after two iterations is Z ¼ 0.9646, thus suggesting no lack of convergence after two iterations.
6.2.4 Gelman and Rubin (1992) diagnostic The (Gelman and Rubin, 1992) diagnostic is based on normal approximations to exact Bayesian posterior inference. The procedure consists of two steps. The first step takes place before any Markov chains are run. In order to carry out this step, choose a number, say m, of initial values that are overdispersed
48
Handbook of Statistics
with respect to the target distribution. The second step is carried out for each parameter of interest. We run the algorithm from each initial state for some number, say 2n, of iterations. The last n iterations are used to estimate the distribution of the quantities of interest as a conservative t distribution, where the scale parameter incorporates both between-chain variance and within-chain variance. Convergence is monitored by estimating the factor by which the scale parameter shrinks if sampling were continued indefinitely. Let B be the variance between means from the m chains, W be the average of the m within-chain variances, and ν be the degrees of freedom for the approximating pffiffiffiffi t density. Then the potential scale reduction factor (PSRF) R^ is given by pffiffiffiffi R^ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi s n1 m+1 B ν + : n mn W ν 2
pffiffiffiffi If the chain has roughly mixed, then R^ should be near 1 for all quantities of interest, where “nearness” is determined in practice by declaring that the pffiffiffiffi chain has mixed adequately if R^ 1:1. This procedure was designed for Gibbs samplers, but can be applied to any MCMC algorithm. The idea behind it is that the PSRF should approach one when the chain has mixed. Gelman and Rubin (1992) arrive at this conclusion by reasoning that when the pooled within-chain variance dominates the between-chain variance, all chains have outrun the influence of their initial distributions and have explored the entire target distribution. They claim that there is no way to make such a determination based on a run of a single chain. Critics of the Gelman and Rubin (1992) diagnostic point out that it relies heavily on the ability to find an initial distribution that is overdispersed with respect to the target distribution. Furthermore, Gibbs samplers are used most commonly when the normal approximation to the posterior distribution is inadequate for estimation and inference purposes, so relying on the normal approximation for diagnosing convergence of a Gibbs sampler may be a dubious approach. In addition, the approach is basically univariate, though Gelman and Rubin (1992) suggest applying the process to 2 times the log of the posterior distribution as a way of summarizing convergence. Many researchers advocate running a single long chain because they consider it inefficient to run multiple chains and discard a large number of early iterations from each. Their rationale is this. If one compares a single long run of 10,000 iterations with 10 independent chains run for 1000 iterations, the last 9000 iterations from the long chain are all drawn from distributions that are likely to be closer to the target distribution than those drawn from any of the shorter chains. The example from Section 6.1 passes the Gelman and Rubin (1992) diagnostic with PSRF 1.01 after 12 iterations, where 10 independent chains are used. In situations where the target distribution has multiple modes, it is necessary to run a larger number of independent chains in order to assess convergence well.
Markov chain Monte Carlo methods Chapter
1
49
6.2.5 Yu and Mykland (1994) CUSUM plot diagnostic Yu and Mykland (1994) propose a graphical method based on CUSUM plots for a univariate summary statistic from a single Markov chain. To do this, decide via trace plot examination or by some other means the number of burn-in iterations n0 to discard. The CUSUM path plots are then constructed for the remaining iterations n0 + 1 to n in the following way. Let T(θ) be a summary statistic for which the plot is to be constructed. Let μ ^ be the estimate of ½TðθÞ. In other words, n X 1 μ ^¼ Tðθi Þ: n n0 j¼n + 1 0 The observed CUSUM at time point t is then S^t ¼
t X
½Tðθj Þ μ ^ for t ¼ n0 + 1, …, n:
j¼n0 + 1
The plot is constructed by plotting S^t against t. Observe that the plot always begins and ends at 0. If the chain mixes slowly, the CUSUM path will be smooth and will wander far from zero. If the chain mixes quickly, the CUSUM path will be hairy. Yu and Mykland (1994) suggest plotting the CUSM path against an idealized CUSUM path from independent and identically distributed normal random variables with mean and variance equal to the sample mean and variance of the output of the chain. Yu and Mykland (1994) claim that the CUSUM plot may remove the need for additional information in diagnosing convergence beyond the information that is contained in the output of a single chain. They do acknowledge, however, that their method may fail when some regions of the state space mix more slowly than others. This approach does not stand on its own, as we can see from the fact that we need to look at other diagnostics for determining burn-in time. Where it can be useful is in detecting mixing that is occurring slowly enough that another MCMC algorithm or alternate parameterization is required. Because the Yu and Mykland (1994) diagnostic assesses dependence between iterations, it addresses indirectly the variance and the bias in estimation. The CUSUM path approach to convergence assessment can be applied to any MCMC algorithm and is straightforward to code. The CUSUM path from the example in Section 6.1 is in Fig. 16. The first 238 iterations are discarded. The plot is not particularly smooth, and the CUSUM path does not wander far from zero, so there is no evidence of a lack of convergence here after 238 steps. The inherent subjectivity in assessing “hairiness” is another clear limitation of this approach.
6.3 Using auxiliary simulations to bound mixing time The approach described in Section 6.1 to bounding the mixing time is theoretically important, but it can be intractable in practice. The output-based convergence diagnostics described in Section 6.2 are easy to use in practice,
50
Handbook of Statistics
FIG. 16 The CUSUM plot for the Gibbs sampler in Section 6.1. The plot is not smooth and does not take major excursions from zero, thus suggesting that no evidence of a lack of convergence is present after 233 iterations.
but they suffer from their own limitations. One of these limitations pertains specifically to the diagnostics that are carried out based on the output of a single chain. Due to the use of only one chain, these types of diagnostics are sensitive to the initial state of the chain and will give different diagnoses of convergence based on this. The Gelman and Rubin (1992) diagnostic is fairly robust to this issue because of its use of independent chains that are initialized at points that are overdispersed with respect to the target density. The ad hoc diagnostics also suffer from the limitation that different diagnostics can give different assessments of convergence. Furthermore, all of them require a chain to be run, perhaps several times, before the chain can be concluded to have approximately reached its stationary distribution. On the other side of this, it is possible that one may run a chain for much longer than what is needed for mixing because a long chain was run in the first place and the diagnostic that was used failed to detect a lack of convergence. The issues with the Rosenthal (1995) method and with the output-based convergence diagnostics suggest the need for an intermediary approach. This section describes three such methods. Section 6.3.1 describes an approach developed by Cowles and Rosenthal (1998) that works for any MCMC algorithm. Section 6.3.2
Markov chain Monte Carlo methods Chapter
1
51
describes a method presented by Spade (2016) that works for random-scan random-walk Metropolis algorithms, and Section 6.3.3 details an approach given by Spade (2020) that works for random-walk Metropolis samplers but can be extended to other full-updating Metropolis–Hastings samplers.
6.3.1 Cowles and Rosenthal (1998) auxiliary simulation approach Cowles and Rosenthal (1998) present a technique that uses auxiliary simulations to estimate drift and minorization coefficients. The first step, naturally, is to choose a drift function V (). This, in itself, is a nontrivial task. There are two things to keep in mind while choosing V (). First, if the chain is far away from the target distribution, the value of V () should decrease on average over the next iteration. Second, the transition probabilities K(, ) should have fairly large overlap from all points x with V (x) d. The approach begins with the estimation of a lower bound on b in (4). The idea is to find all points x such that V (x) ¼ 1. Then, from each of these points, N0 one-step chains are run, where N0 is chosen such that the standard error of the estimates of ½VðX1 ÞjX0 is less than or equal to some prespecified value. The maximum value of these estimates is chosen as the estimated lower bound on b. Given the estimate ^b of b, generate N1 different initial values in such a way that they cover all potentially bad parts of the state space. For each initial value, run N2 one-step chains and estimate e(x) ¼ ½VðX1 ÞjX0 ¼ x. Then λ is estimated by taking the maximum of eðxÞ b^ VðxÞ over the different choices of x. If the estimates of λ are unstable, choose N2 larger and try again. If ^λ < 1, we have evidence of a useful drift function. Estimating the minorization coefficient relies on a result presented by Rosenthal (1995). Lemma 1. Suppose the Markov chain (Xt)t0 has transition density k(jx). Then there exists a probability measure Q() such that, for some R 2 Bðm Þ, Kðx, Þ εQð Þ for all X 2 R, where
Z ε¼
inf kðyjxÞ dy:
m x2R
(5)
The estimation of the minorization coefficient relies on estimating the integral in (5). To do this, Cowles and Rosenthal (1998) propose chopping the state space into bins that are small enough that the transition probabilities are roughly constant over each bin. A set of initial values in the set C ¼ fx : VðxÞ dg is chosen so that the transition probabilities from the
52
Handbook of Statistics
different initial values have minimal overlap among all choices of x 2 C. For each initial value, N3 one-step chains are run, and the fractions of them that land in each bin are tabulated. An estimate of ε is obtained by summing over all the little bins the minimum over all the different choices of x the fraction of samples landing in that bin. The mixing time is bounded by using ^ ^λ, and ^ε . in Theorem 1 the estimates b, Example 24. Rosenthal (1996) uses this approach in a model related to JamesStein estimators. Consider Yijθi N(θi, v) for 1 i K, and θijμ, A N(μ, A) for 1 i K. A flat prior is assumed on μ and v is an estimated constant. A is assumed to follow an IG(a, b) distribution. A Gibbs sampler is run in order to estimate A, μ, θ1, …, θK. The full conditional distributions are given by ! K K 1 1X 2 , ðθi θÞ Ajμ,θ, Y IG a + 2 2 i¼1 A μjA,θ, Y N θ, K uv + Yi A Av , θi jA, μ, Y N : v+A A+v In the example, K ¼ 18 and v ¼ 0.00434. The drift function is VðA,μ, θÞ ¼
K X ðθi Y Þ2 , i¼1
and a and b are assumed to be 1 and 2. The resulting estimates of λ, b, and ε are ^λ ¼ 0:000289, b^ ¼ 0:161, and ^ε ¼ 0:0656. This gives an estimated upper bound of 140 steps on the mixing time of the Gibbs sampler. This procedure works well in low dimensions, but due to the need for binning up the state space, the procedure can be computationally intractable in higher dimensions. Not only does the number of bins increase exponentially in the dimension of the state space, but the number of chains that need to be run from each initial value also increases exponentially in order to ensure adequate coverage of the bins. It is important to note that in this process, we are not restricted to running only one-step chains from each initial value. Cowles and Rosenthal (1998) carry out several examples using chains of length longer than one.
6.3.2 An auxiliary simulation approach for random-scan randomwalk Metropolis samplers While the choice of drift function is sometimes difficult, Fort et al. (2003) present a set of three conditions that ensure geometric ergodicity of a random-scan random-walk Metropolis (RSM) algorithm. They also provide a drift function for RSM algorithms that satisfy the conditions.
Markov chain Monte Carlo methods Chapter
1
53
Condition 1. The stationary distribution π is absolutely continuous with respect to λm, the m-dimensional Lebesgue measure, with positive and continuous density p() on m . Condition 2. Let fqi gm i¼1 be a family of symmetric increment densities with respect to λ1. There exist constants ηi > 0 and δi < ∞ for all i ¼ f1, 2, …,mg such that whenever jyj δi, qi(y) ηi. Condition 3. There exist constants δ and Δ with 0 δ < Δ ∞ such that Z ξ ¼ inf
Δ
1 i m δ
qi ðyÞλ1 ðdyÞ > 0
(6)
and for any sequence x ¼ fxn g with lim n!∞ k xn k¼ ∞, one may extract a subsequence ~ x ¼ f~ x n g with the property that for some i 2 f1,2, …, mg and all y 2 [δ, Δ], lim
pð~ xnÞ
n!∞ pð~ x
x ðiÞ n signð~ n Þyei Þ
¼ 0 and
pð~ x n + signð~ x ðiÞ n Þyei Þ ¼ 0: n!∞ pð~ xnÞ lim
These conditions lead to the following theorem. Theorem 2. (Fort et al., 2003) Assume that Conditions 1–3 hold, and let s 2 (0, 1) such that 1
sð1 sÞ s 1
1 ^λ . If d does not satisfy this requirement, choose a larger value of d and repeat the estimation of λ and b. Estimation of ε relies on estimating the integral in (5). In the RSM setting, this integral can be rewritten as Z m 1X inf ki ðxt + 1 jxt Þ dxt + 1 : ε¼ m xt 2C m i¼1
In order to approximate this integral, choose a set C~ of points x near the edges of C. Next, for each i 2 f1, …,mg, divide the support of qi() into little bins whose width is chosen in such a way that the resulting estimates of ε are sta~ given i 2 f1, 2, …,mg, choose N values of y between ai ble. For each x 2 C, L i and aU . For each of the N values of y, use a Metropolis updating procedure to decide whether or not to accept the proposal x ¼xt + yei. Keep track of the number of accepted proposals that fall into each bin. For each bin, compute over x 2 C~ the minimum number of accepted proposals that fall into each bin and sum these minima over each bin. This leads to an estimate ^ε i of εi, where εi is the contribution to ε that comes from the ith conditional transition density. Then ε is estimated using ^ε ¼
m 1X ^ε i : m i¼1
Markov chain Monte Carlo methods Chapter
1
57
Since the binning is done at the univariate level, the number of bins increases only linearly in the dimension of the state space. This process will still suffer from the curse of dimensionality, but not as severely as the Cowles and Rosenthal (1998) method. This approach can be extended to other one-ata-time updating Metropolis–Hastings algorithms, and if a suitable drift function can be found, it is not actually necessary to verify the conditions of Fort et al. (2003). This approach does not generalize well to full-updating schemes because of its inherent exploitation of the one-at-a-time updating process.
1 Example 25. Let (Xt)t0 be an RSM chain with the N 513 , 2π I3 density as its stationary distribution, and C ¼ fx : V0:01 ðxÞ 10g. In estimating λ, 1000 initial values were chosen from outside C, and for each, the sum in (8) is computed based on 100 one-step chains. This gives an estimate of ^λ ¼ 0:95059. We obtain 2b^ an estimate b^ ¼ 0:0511 of b in a similar way. The value of 1 ^λ is 2.07, so d ¼ 10 is a sufficiently large choice of d. The estimate of the minorization coefficient is obtained using 50 initial values. The estimate of ε is ^ε ¼ 0:4817, thus yielding an estimated upper bound of 983 steps on the mixing time. The trace plots in Fig. 17 indicate that 983 steps is a sufficient burn-in time.
6.3.3 Auxiliary simulation approach for full-updating Metropolis samplers Spade (2020) extends the approach described in Section 6.3.2 to full-updating RWM processes. This method relies on the RWM sampler’s satisfaction of three conditions. Condition 1. The increment density q() is symmetric about 0 such that there exist constants εq and δq > 0 with the property that for all y such that kyk εq, qðyÞ δq :
(9)
Furthermore, the stationary measure π() is absolutely continuous with respect to Lebesgue measure λm with positive and continuous density p() over m . Condition 2. The target density p() is superexponential. In other words, let nðxÞ ¼
x , kxk
and let Δf (X) denote the gradient of f (X). Then lim nðxÞ r ln pðxÞ ¼ ∞,
kxk!∞
where a b denotes the inner product of a and b.
58
Handbook of Statistics
FIG. 17 Trace plots of the RSM sampler for X(1), X(2), and X(3). A blue vertical line is placed in each plot to indicate 983 steps. The plots indicate that a burn-in of 983 steps is sufficient.
Conditions 1 and 2 lead to the following theorem provided by Jarner and Hansen (2000). Theorem 3. ( Jarner and Hansen, 2000) If p() is superexponential, then the RWM chain (Xt)t0 with transition kernel K(, ) on ðm , Bðm ÞÞ and symmetric increment density satisfying (9) is geometrically ergodic if and only if lim inf Kðx, AðxÞÞ > 0, kxk!∞
(10)
where K(, ) is the probability measure given by Z Kðx, BÞ ¼ kðx0 jxÞλm ðdxÞ B
and AðxÞ ¼ fy : pðx + yÞ pðxÞg, where k(j) is the probability density associated with the transition kernel. When (10) is satisfied, (4) is also satisfied with Vs(X) ¼ c[p(X)]s for some c > 0 and any s 2 (0, 1).
Markov chain Monte Carlo methods Chapter
1
59
Verifying that (10) holds can be difficult to do directly, but Jarner and Hansen (2000) provide a sufficient condition for satisfaction of (10). Condition 3. The target density p() is such that lim sup mðxÞ nðxÞ > 0, kxk!∞
where mðxÞ ¼
rpðxÞ : k rpðxÞ k
For RWM samplers that satisfy Conditions 1–3, bounding the mixing time begins with the estimation of λ. Letting KVs ðxÞ ¼ ½Vs ðXt + 1 ÞjXt ¼ x, we estimate for x 62 C, where C is chosen as in Section 6.3.2, Z Z KVs ðxÞ Vs ðx + yÞ Vs ðx + yÞ pðx + yÞ ¼ qðyÞ dy + qðyÞ dy Vs ðxÞ pðxÞ AðxÞ Vs ðxÞ RðxÞ Vs ðxÞ Z pðx + yÞ 1 qðyÞ dy, + pðxÞ RðxÞ where RðxÞ ¼ Ac ðxÞ. In order to do this, choose NC0 values outside of C to form the set C^0 , and then take N0 samples from q(y) for each x 2 C^0 . Then for given xi, i ¼ 1, 2, …, NC0 , compute the sum ^ ^λ xi ¼ KVs ðxi Þ Vs ðxi Þ 1 X Vs ðx + yÞ 1 X Vs ðxi + yÞ pðxi + yÞ + ¼ N0 ^ Vs ðxÞ N0 ^ Vs ðxi Þ pðxi Þ y2Rðxi Þ y2Aðxi Þ 1 X pðxi + yÞ : 1 + N0 ^ pðxi Þ y2Rðxi Þ
Then the estimate of λ is given by ^λ ¼
max ^λ xi :
i¼1, …, NC0
The estimation of b is carried out in a manner very similar to the estimation of λ. We again rely on the estimated drift function. We choose NC points from ^ we draw N1 samples from inside C, and from each xi in the sampled set C, ^ ^ q(). The setsAðxi Þ and Rðxi Þ are constructed in the same way as has been done before, and to compute an intermediate estimate of b, we use the sum 1 X ^ 1 X ^ pðxi + yÞ K^ V^s ðxi Þ ¼ V s ðxi + yÞ + V s ðxi + yÞ N1 ^ N1 ^ pðxi Þ y2Rðxi Þ y2Aðxi Þ X 1 pðxi + yÞ , 1 + V^s ðxi Þ N1 pðxi Þ ^ y2Rðxi Þ
60
Handbook of Statistics
and the estimate of b based on xi is given by b^xi ¼ K^ V^s ðxi Þ ^λ V^s ðxi Þ: The overall estimate of b is given by b^ ¼ max i¼1, …,NC b^xi . In order to obtain a conservative estimate, we make use of the properties of infima to say that
Z inf kðxt + 1 jxt Þ inf αðxt , xt + 1 ÞqðyÞ + inf ð1 αðxt , xt + 1 ÞÞqðyÞ dy δxt ðXt + 1 Þ xt 2C xt 2C xt 2C m Z ð1 αðxt , xt + 1 ÞÞqðyÞ dy δxt ðXt + 1 Þ: inf xt 2C
m
(11)
Therefore, ε is no smaller than the infimum of the rejection probabilities from the initial states inside C. Thus, the approach to estimating ε begins with the selection of NC~ points inside C in the same manner as described in the estimation of λ and b and so that the estimates of ε are reasonably stable. Next, the rejection probability from each of the sampled points is estimated. To do this, run N2 chains, where N2 is also chosen to ensure stability in the estimates of the integral in (11) from each initial state xi. Keep track of the proportion of proposals that are rejected. This proportion gives an estimate ^ε xi of ε from the initial state xi. Repeat this process for each of the NC~ sampled initial values. An estimate of ε is then given by ^ε ¼
min ^ε xi :
i¼1, …, NC~
The ability to use this procedure is not tied to the satisfaction of the three ( Jarner and Hansen, 2000) conditions. These simply help us choose a suitable drift function for RWM samplers. If a suitable drift function can be found another way, this approach can be adapted in a straightforward manner to estimate drift and minorization coefficients for other full-updating Metropolis– Hastings samplers. Example 26. Let (Xt)t0 be a RWM chain with stationary measure π, where π had the density 1 1 ð1Þ2 ð2Þ2 ð1Þ2 ð2Þ2 pðxÞ ¼ e4x x + ex 4x : 2 2 In this example, d is set at 40, and s is set at 0.02. The proposed increment density is the independent bivariate uniform density, where the range for each component is [1.5,1.5], which gives an acceptance rate of 30.42%. This rate is nearly optimal for mixing. To carry out the estimation process, 100 points are chosen from outside C, 100 one-step chains are run from each initial state, and the resulting estimate of λ is ^λ ¼ 0:8411. In estimating b, 100 points are chosen from inside C and 100 one-step chains are run from each of these initial states. The resulting estimate of b is b^ ¼ 0:1543. To estimate ε, 200 points are chosen from inside C, and 100 one-step chains are run from each.
Markov chain Monte Carlo methods Chapter
1
61
FIG. 18 Trace plots of X(1) and X(2) for the RWM sampler. Both plots indicate that a burn-in of 18,535 steps is extremely conservative.
The resulting estimate of ε is ^ε ¼ 0:22. The resulting upper bound on the mixing time is 18,535 steps, but the trace plots in Fig. 18 indicate that the chain 2b^ burns in much more quickly than that. The value of the quantity 1 ^λ is 1.949, so d ¼ 40 is a sufficiently large choice for this value.
6.4 Examining sampling frequency In order to carry out Bayesian inference, it is not enough just to be able to obtain samples from the posterior distribution. These samples also need to be roughly independent. By the structure of a Markov chain, successive states of the chain are not independent. In order to obtain independent samples from the posterior distribution, we need to determine how far apart observations should be in order for them to be considered roughly independent. In making such an assessment, we view the Markov chain as a stationary time series. For many MCMC algorithms, this is a reasonable assumption, as the underlying chain is typically time-homogeneous. Time homogeneity is, informally, a Markov chain analog to stationarity. Because the Markov chain can also be viewed as a stationary time series, a common way to assess dependence is
62
Handbook of Statistics
to examine the sample autocorrelation function. The idea is to see how highly correlated the observations spaced k time units apart are. The sample autocorrelation function after lag k is given by nk X
rk ¼
t¼1
ðXt + k XÞðXt XÞ n X ðXt XÞ
:
t¼1
We then choose the factor by which to “thin” the output of the chain in order to have roughly independent observations by finding the smallest value of k for which we would not reject the hypothesis that the lag k autocorrelation is equal to zero. This is done by examining autocorrelation plots and determining the first time that the sample autocorrelation has a magnitude that falls pffiffiffi below 2= n (Box et al., 1994). Consider as an example the Gibbs sampler from Section 6.1. We examine the autocorrelation plots where the dashed lines represent the bounds of the nonrejection region for the hypothesis that the lag k autocorrelation is 0. For this example, the autocorrelation plot is in Fig. 19. We would select k ¼ 2 as the thinning factor since this is the earliest
FIG. 19 Autocorrelation plot for X. The dashed lines represent the boundaries of the nonrejection region for the hypothesis that the lag k autocorrelation is zero. The first time that this hypothesis would not be rejected is at lag 2.
Markov chain Monte Carlo methods Chapter
1
63
lag for which the sample autocorrelation falls in the nonrejection region. In multivariate settings, we would thin the output by a factor k, where k is the first lag for which all autocorrelations fall within the nonrejection region.
7
Conclusion
In this chapter, we have seen several settings where the use of MCMC methods comes up in data analysis. The goal of this chapter has been twofold. The first part of the goal is to demonstrate how MCMC methods are used to carry out Bayesian estimation of model parameters in common settings while providing enough detail to enable the reader to apply such methods in other settings. The second part of the goal is to provide the reader with the tools to use MCMC methods sensibly in sampling from the target density. It is important to have a sense of how long the chain should be run in order to obtain approximate samples from the target density as well as how often samples should be taken from the output of the chain in order to obtain a roughly random sample from the target distribution. While MCMC methods can be useful tools for obtaining approximate samples from intractable probability distributions, there are issues that can arise with their use. The biggest of these pertains to the first question. Some MCMC algorithms take a long time to approach their stationary distribution. If high autocorrelation is also an issue, then reasonably large random samples can take a long time to acquire. In these situations, computational expense becomes a problem, especially if the sampling scheme has to be carried out multiple times. Many times, this issue can be alleviated through transformation or reparameterization. A common issue with Metropolis–Hastings samplers is the choice of a suitable proposal density. Ideally, the proposal density should be shaped in a way that is similar to the target density. This density should be such that it results in proposals being accepted at a rate that helps ensure good mixing. This often requires some experimentation. In terms of convergence assessment, we have addressed the limitations of many of the techniques described in Section 6. One thing that has not been mentioned is a cautionary note in using the techniques described in Section 6.3. A particular issue of which one needs to be cognizant is the number of tuning parameters, specifically the numbers of initial values and the numbers of chains to be run from each initial value that need to be specified in order to ensure stability in estimating drift and minorization coefficients. If these are selected to be too large, the estimation procedure will not be efficient. Suitable choices of these values require some experimentation. Throughout the descriptions of the procedures in this chapter, we have made an effort to provide enough information to enable the reader to use these methods in a measured, nonnaı¨ve way. For more information on the theory of Markov chains, the reader is encouraged to see Meyn and Tweedie (2005). For more on the use of MCMC algorithms in Bayesian analysis, many resources are available. Two particularly useful resources are Gilks et al. (1996) and Gelman et al. (2014).
64
Handbook of Statistics
References Barker, R.J., Link, W.A., 2013. Bayesian multimodel inference by RJMCMC: a Gibbs sampling approach. Am. Stat. 67 (3), 150–156. Bennett, J.E., Racine-Poon, A., Wakefield, J.C., 1995. MCMC for nonlinear hierarchical models. In: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (Eds.), Markov Chain Monte Carlo in Practice. Chapman and Hall, London, pp. 339–357. Box, G.E.P., Jenkins, G.M., Reinsel, G.C., 1994. Time Series Analysis: Forecasting and Control. Prentice Hall, New York. Carlin, B.P., Gelfand, A.E., 1991. An iterative Monte Carlo method for nonconjugate Bayesian analysis. Stat. Comput. 1, 119–128. Chib, S., Nardari, F., Shephard, N., 1998. Markov chain Monte Carlo Methods for generalized stochastic volatility models. J. Econ. 108, 281–316. Cowles, M.K., Carlin, B.P., 1996. Markov chain Monte Carlo convergence diagnostics: a comparative review. J. Am. Stat. Assoc. 91, 883–904. Cowles, M.K., Rosenthal, J.S., 1998. A simulation-based approach to convergence rates for Markov chain Monte Carlo algorithms. Stat. Comput. 8, 115–124. Dieter, U., Ahrens, J., 1974. Acceptance-rejection techniques for sampling from the gamma and beta distributions. Technical Report 83, Research. Office of Naval. Fort, G., Moulines, E., Roberts, G.O., Rosenthal, J.S., 2003. On the geometric ergodicity of hybrid samplers. J. Appl. Probab. 40, 123–146. Gelling, N., Schofield, M.R., Barker, R.J., 2018. RJMCMC: reversible jump MCMC using PostProcessing-R Package version 0.4.3. https://cran.r-project.org/web/packages/rjmcmc/rjmcmc.pdf. Gelman, A., Rubin, D.B., 1992. Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–511. Gelman, A., Gilks, W.R., Roberts, G.O., 1997. Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 7 (1), 110–120. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B., 2014. Bayesian Data Analysis, third ed. CRC Press, Taylor and Francis Group, Boca Raton, FL. Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs distributions, and Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741. Geweke, J., 1992. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In: Bernardo, J.M., Berger, J., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics 4. Oxford University Press, Oxford, UK, pp. 169–193. Gilks, W.R., 1992. Derivative-free adaptive rejection sampling for Gibbs sampling. In: Bernardo, J.M., Berger, J., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics 4. Oxford University Press, Oxford, UK. Gilks, W.R., Wild, P., 1992. Adaptive rejection sampling for Gibbs samplers. Appl. Stat. 41 (2), 337–348. Gilks, W.R., Richardson, S., Spiegelhalter, D.J., 1996. Markov Chain Monte Carlo in Practice. CRC Press, Taylor and Francis Group, Boca Raton, FL. Green, P.J., 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 (4), 711–732. Green, P.J., Hastie, D.I., 2009. Reversible jump MCMC. Genetics 155 (3), 1391–1403. Hastings, W., 1970. Monte Carlo sampling using Markov chains and their application. Biometrika 57, 97–109. Heidelberger, P., Welch, P.D., 1983. Simulation run length control in the presence of an initial transient. Oper. Res. 31, 1109–1144.
Markov chain Monte Carlo methods Chapter
1
65
Ishwaran, H., James, L.F., Sun, J., 2001. Bayesian model selection in finite mixtures by marginal density decompositions. J. Am. Stat. Assoc. 96, 1316–1332. Jarner, S.F., Hansen, E., 2000. Geometric ergodicity of Metropolis algorithms. Stoch. Process. Appl. 85, 341–361. Liang, F., 2007. Continuous contour Monte Carlo for marginal density estimation with an application to a spatial statistical model. J. Comput. Graph. Stat. 16 (3), 608–632. Metropolis, N., Rosenbluth, A., Rosenbluth, B., Teller, A., Teller, E., 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21 (6). 1097-1092. Meyn, S.P., Tweedie, R.L., 2005. Markov Chains and Stochastic Stability, second ed. SpringerVerlag, London. Neal, R.M., 1998. Annealed importance sampling. University of Toronto Department of Statistics. Technical Report. Neal, R.M., 2003. Slice sampling. Ann. Stat. 31 (3), 705–767. Oh, M., Berger, J.O., 1989. Adaptive importance sampling in Monte Carlo integration. Purdue University Department of Statistics. Technical Report. Pan, J., Lee, M., Tsai, M., 2017. Reversible jump Markov chain Monte Carlo algorithm for Bayesian variable selection in logistic mixed models. Commun. Stat. Simul. Comput. 47 (8), 2234–2247. Perez Rodriguez, P., 2018. ars: adaptive rejection sampling-R package version 0.6. https://CRAN. R-project.org/package¼ars. Raftery, A.E., Lewis, S., 1992. How many iterations in the Gibbs sampler? In: Bernardo, J.M., Berger, J., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics 4. Oxford University Press, Oxford, U.K, pp. 763–773. Ripley, B.D., 1987. Stochastic Simulation. Wiley, New York. Ritter, C., Tanner, M.A., 1992. Facilitating the Gibbs sampler: the Gibbs stopper and the GriddyGibbs sampler. J. Am. Stat. Assoc. 82, 861–868. Roberts, G.O., 1992. Convergence diagnostics of the Gibbs sampler. In: Bernardo, J.M., Berger, J., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics 4. Oxford University Press, Oxford, UK, pp. 775–782. Roberts, G.O., 1996. Methods for estimating L2 convergence of Markov Chain Monte Carlo. In: Berry, D., Chaloner, I., Geweke, J. (Eds.), Bayesian Statistics and Econometrics: Essays in Honor of Arnold Zellner. North-Holland, Amsterdam, pp. 373–384. Roberts, G.O., Tweedie, R.L., 1996. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika 83 (1), 95–110. Roberts, G.O., Gelman, A., Gilks, W.R., 1997. Weak convergence and optimal scaling of randomwalk Metropolis algorithms. Ann. Appl. Probab. 7 (10), 110–120. Rosenthal, J.S., 1995. Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Am. Stat. Assoc. 90, 558–566. Rosenthal, J.S., 1996. Analysis of the Gibbs sampler for a model related to James-Stein estimators. Stat. Comput. 6, 269–275. Schruben, L.W., 1982. Detecting initialization bias in simulation output. Oper. Res. 30, 569–590. Schruben, L.W., Singh, H., Tierney, L., 1983. Optimal tests for initialization bias in simulation output. Oper. Res. 31, 1167–1178. Sherlock, C., Fearnhead, P., Roberts, G.O., 2010. The random walk Metropolis: linking theory and practice through a case study. Stat. Sci. 25 (2), 172–190. Spade, D.A., 2016. A computational procedure for efficient estimation of the mixing time of a random-scan Metropolis algorithm. Stat. Comput. 26 (4), 761–781.
66
Handbook of Statistics
Spade, D.A., 2020. Geometric ergodicity of a Metropolis-Hastings algorithm for Bayesian inference of phylogenetic branch lengths. Comput. Stat. (accepted). Troughton, P.T., Godsill, S.J., 1998. A reversible jump sampler for autoregressive time series. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing. Wei, G.C.G., Tanner, M.A., 1990. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Am. Stat. Assoc. 85, 699–704. Yu, B., 1994. Monitoring the Convergence of Markov Samplers Based on Estimated L1 Error. University of California at Berkeley Department of Statistics. Yu, B., Mykland, P., 1994. Looking at Markov Samplers Through CUSUM Path Plots: A Simple Diagnostic Idea. University of California at Berkeley Department of Statistics. Zeger, S., Karim, M.R., 1991. Generalized linear models with random effects: a Gibbs sampling approach. J. Am. Stat. Assoc. 86, 79–86. Zellner, A., Min, C.K., 1995. Gibbs sampler convergence criteria. J. Am. Stat. Assoc. 90, 921–927. Zhang, Z., Chan, K.L., Wu, Y., Chen, C., 2014. Learning a multivariate Gaussian model with the reversible jump MCMC algorithm. Stat. Comput. 14 (4), 343–355.
Further reading Link, W.A., Barker, R.J., 2009. Bayesian Inference: With Ecological Applications. Academic Press. Mengersen, K.L., Tweedie, R.L., 1996. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Stat. 24 (1), 101–121. Robert, C.P., Casella, G., 2004. Monte Carlo Statistical Methods, second. Springer, New York. Roberts, G.O., Rosenthal, J.S., 1997. Geometric ergodicity and hybrid Markov chains. Electron. Commun. Probab. 2, 13–25. Paper 2. Roberts, G.O., Rosenthal, J.S., 1998. Two convergence properties of hybrid samplers. Ann. Appl. Probab. 8, 397–407. Schervish, M.J., Carlin, B.P., 1992. On the convergence of successive substitution sampling. J. Comput. Graph. Stat. 1, 111–127.
Chapter 2
An information and statistical analysis pipeline for microbial metagenomic sequencing data Shinji Nakaoka∗ and Keisuke H. Ota Faculty of Advanced Life Science, Hokkaido University, Sapporo, Japan ∗ Corresponding author: e-mail: [email protected]
Abstract Metagenomic sequencing produces massive collections of genomic reads in an unspecific manner, enabling profiling a given microbial community. Information processing of metagenomic data followed by data analysis and mining is indispensable to obtain proper interpretations and implications for a microbial community. A variety of data analysis and mining techniques have contributed to extract essential associations that constitute a microbial community. In the present chapter, brief overviews for popular methods are introduced to cover necessary procedures to analyze microbial metagenomic data. Keywords: Shotgun microbiology
1
metagenomics,
Information
analysis
pipeline,
Genome
Introduction
Recent popularization of high-throughput sequencing by next-generation sequencers (NGS) has accelerated research and development of novel techniques that make use of NGS. Metagenomic sequencing refers to (de novo) DNA sequencing of a sample composed of different microbial species. Metagenomic sequencing has gathered attention due to the versatile and unspecific nature of sequence specification. Diverse applications of metagenomic sequencing have been reported in the various fields such as marine ecology, plant biology, agriculture, and clinical microbiology. Effective use of computational tools and resources are indispensable for extract useful and interpretable information from raw metagenomic sequences, which are roughly a collection of four letters representing different DNA/RNA molecules. A variety of computational tools and resources that support Handbook of Statistics, Vol. 43. https://doi.org/10.1016/bs.host.2020.01.001 © 2020 Elsevier B.V. All rights reserved.
67
68
Handbook of Statistics
informatics, statistical and mathematical analyses of metagenomic sequencing data have been released as open-source software. A prominent feature of informatics analysis for metagenomic data compared with other omics-related data is broad flexibility for constructing information analysis pipeline. Therefore, it is essential to design a specific pipeline suitable for the purpose on the premises of Do-It-Yourself construction. In this chapter, we introduce an information analysis pipeline for metagenomic sequencing data by combining state-of-art computational tools and resources. Note that we emphasize brief descriptions and explanations of essential ingredients in the pipeline rather than detailed explanations to a fixed pipeline with an example to support DIY construction. The organization of this chapter is as follows. In the next section, we describe a brief overview of shotgun metagenomic sequencing analysis with a focus on informatics. Topics include basic concepts of sequence assembly, contig binning, functional annotation, statistical & machine learning approach, mathematical modeling, and tips on data reproducibility & portability. In Section 3, we introduce state-of-art computational tools and resources that are in frequent use in metagenomic sequencing studies.
2 A brief overview of shotgun metagenomic sequencing analysis In this section, we introduce basic strategies to process metagenomic sequences to extract useful information and data from them.
2.1 Sequence assembly and contig binning A next Generation Sequencer (NGS) produces a collection of short-read sequences which are much shorter than a typical length of a protein-coding gene. Shotgun DNA sequencing is often employed to determine large DNA sequences based on short fragments (reads). Shotgun metagenomic sequencing focuses on a batch of DNA sequences that originate from different organisms. Sequence assembly is informatic processing to align and merge short reads from a longer DNA sequence to reconstruct the original sequence. Contig refers to a collection of overlapping DNA fragments, or a longer DNA sequence reconstructed by aligning overlapping fragments. Contig represents a consensus region of DNA. In metagenomic sequences, DNA fragments usually originate from multiple species. In metagenomic sequencing, contig binning refers to informatic processing to assign contigs to the whole genome of each original organism.
2.2 Annotation of taxonomy, protein, metabolic, and biological functions Raw metagenomic sequences themselves give us little knowledge about a microbial community without assigning associated knowledge. If we already
Microbial metagenomic sequencing data Chapter
2
69
have had knowledge of a particular gene, counting the number of short reads that fall into the gene region becomes meaningful. In a more general setting, a collection of biologically meaningful pieces of knowledge helps to “annotate” common knowledge to the (processed) sequences in a sample. Hence the existence of databases that collect and store useful information is indispensable for annotating sequences. Most databases focus on collecting a specific valuable attribute of organisms or knowledge. For example, valuable attributes for protein include family, domain, and functional sites information. Recent major databases allow programmatic access and operation to extract information necessary for annotation. Specific public databases that provide useful resources are summarized in Section 3.2.
2.3 Statistical analysis and machine learning In general, processed sequences are summarized as a read-count table which comprises a collection of read-counts of items (genes or species) representing the number of occurrences for each sample. Hence the application of standard statistical methods for counting data is a primary step to analyze a read-count table. Because read-count depends on the total reads for each sample, appropriate normalization is required to ensure a rational comparison between samples. A read-count table is mathematically represented by a high-dimensional (sparse) nonnegative matrix. Hence machine-learning approach may be useful to extract features that may characterize the state of a microbial community from a read-count table if a sufficient number of samples is available. For example, dimension reduction methods such as Principle Coordinate Analysis (PCoA), t-SNE as a popular manifold learning method, or non-negative matrix factorization and another method in topic modeling may be helpful. If one has different types of multiple read-count tables, it is appropriate to analyze a set of read-count tables simultaneously rather than to analyze each read-count table independently. If read-count tables share a same attribute, for example, tables are generated from the same sample set, then applications of methods for multidimensional array (tensor) will be more suitable. Note that different types of information such as taxonomy classification, annotation via protein databases can be simultaneously extracted from metagenomic sequencing data as read-count tables sharing the same attribute (sample information). Hence tensor-based methods may be more suitable for analyzing metagenomic sequencing data.
2.4 Reconstruction of pseudo-dynamics and mathematical modeling Compositional change of a given microbial community is often associated with an important ecological or clinical signature. In terms of the gut microbiota, compositional change toward decreasing diversity is referred to as dysbiosis. Recent studies have demonstrated that dysbiosis is associated with the
70
Handbook of Statistics
progression of various diseases. To characterize dysbiosis as microbial community dynamics, mathematical models may be useful to identify causal mechanisms underlying community change. Although a variety of data types can be extracted from metagenomic sequences, the static nature of data acquisition limits the application of mathematical modeling. In most cases, information regarding the dynamical process of a microbial community is lacking. Hence validation of mathematical models faces difficulty. One promising approach but has not been fully established yet is to reconstruct a pseudo-dynamical process from non-time series datasets. Expected information about temporal change from pseudo-dynamical process can be helpful for validation of mathematical models. This approach implicitly requires an assumption that the whole datasets are collected appropriately to cover a sufficient number of snapshots for some time-evolving process. A concrete example of this approach is a pseudo-time reconstruction of cell differentiation process from single-cell RNA-seq datasets. Although any pseudo-time reconstruction methods that are applicable to metagenomic data have not been reported, reconstruction of pseudo-dynamics and mathematical modeling would support in part for understanding microbial community dynamics.
2.5 Construction of analysis pipeline with reproducibility and portability A common information pipeline for metagenomic sequences is built by incorporating many different types of tools. It is often the case that a pipeline becomes disable when the version of one of the tools is updated. This reason is that any tool usually depends on runtime libraries or another tool irrespective of an explicit or implicit manner. This implies that building a pipeline under the support of a traceable system is highly recommended to maintain complicated explicit and implicit dependencies on libraries and other tools. Docker is an open source software for creating container-type virtual environments. A Docker container is a unit of software that packages up code and all its dependencies such as libraries (https://www.docker.com/resources/whatcontainer). Note that a Docker container is standalone, executable package of software that includes everything needed to run an application. These reasons ensure why a Docker container image is suitable for building an analysis pipeline with reproducibility, portability, and a ready-to-go deployment to other computers irrespective of the installed operating system. Furthermore, there exists a community which maintains container images specific to biological applications, referred to as BioContainer. Hence primary computational tools should be readily available as a Docker container image. If tools are made by python statistical analysis software R scripts, anaconda may be a more comfortable option to build up a pipeline with reproducibility and portability.
Microbial metagenomic sequencing data Chapter
3
2
71
Computational tools and resources
Development of computational tools to process metagenomic sequences requires efficient algorithm to deal with a massive amount of sequences. Public resources stored in major databases of nucleotide and amino acid support functional annotation to the predicted genes and proteins. In the present section, we briefly introduce major computational tools and resources that are in frequent use in metagenomic sequence analysis.
3.1 Tools and software Combining different types of computational tools is indispensable to process and extract meaningful information from metagenomic datasets. In this section, we introduce major software and tools that are employed in each step of an information analysis pipeline. Note that we will present only one state-of-art tool for each essential step of the pipeline as a representative example. More comprehensive review for constructing and utilizing a metagenomic analysis pipeline can be found in a review paper.
3.1.1 BLAST BLAST, an abbreviation of Basic Local Alignment Search Tool, refers an algorithm or software to align sequences of DNA (deoxyribonucleic acid) or amino acids (Altschul et al., 1990). BLAST helps to find an unknown sequence that exists in some species (for example, mouse) which is similar to the sequences found in another species (like Homo sapiens). More specifically, BLAST performs alignment of a target sequence as a query to search a set of reference sequences that are similar to the query sequence. Reference sequences are composed of sequenced DNA or protein originated from one or multiple species. A database of reference sequences is maintained by international organizations such as the National Center of Biotechnology Information (NCBI) GenBank, European Bioinformatics Institute (EBI), and DNA DataBank of Japan (DDBJ). These three organizations belong to the International Nucleotide Sequence Database Collaboration (INSDC), update and share the reference sequences on a daily basis (see Section 3.2.1 for details of the database of reference sequences). The similar sequence of a query sequence is often referred to as a homologous sequence. The similarity of sequences is evaluated quantitatively with a given threshold. A similarity score among different thresholds, E-value is a primary index of homology of similar sequence. Based on the assumption that the expected distribution of similarity scores by chance is described by the extreme value distribution, E-value represents the expected number of times the score would occur by chance (Pearson, 2013). Depending on the type of a query sequence, a user can implement a different computer program specific to his/her purpose. For example, if one wants to search a homologous amino acid sequences with a
72
Handbook of Statistics
given protein sequence, then blastp is an appropriate choice. The other computer programs include blastn (nucleotide to nucleotide), blastx (nucleotide to amino acid equipped with translation process), and tblastn (amino acid to nucleotide). A variant of BLAST includes BLAT (Blast Like Alignment Tool) which generally performs faster alignment at the expense of low precision. Clustal is a specific tool that enables performing multiple alignments. There exist several ultra-fast alignment tools for amino acid sequences which outperform blastp (see Section 3.1.9 for instance). More detailed explanations for applications and basic algorithmic concept can be found in a standard textbook or review paper of bioinformatics (Pearson, 2013).
3.1.2 BWA BWA, an abbreviation of Burrows-Wheeler Aligner, is a mapping tool for (short-)read sequences to a large reference genome (Li and Durbin, 2010). BWA utilizes the Burrows-Wheeler transform that is a reversible transform used in preprocessing data for compression and string search. BWA performs efficient mapping based on the indexing of a reference genome for enabling faster sequence search while reducing memory usage. In practice, BWA performs mapping of contig sequences constructed from a metagenomic sample to reference genome sequences. The corresponding read-count table for mapping multiple samples comprises count data of each item (gene) in rows and samples in columns. 3.1.3 SAMtools SAMtools is a versatile tool to process SAM (Sequence Alignment/Map) or BAM (binary version of SAM) formatted files (Li et al., 2009). A SAMformatted file describes information regarding the position of each sequence read in a genome. SAMtools manipulates several operations for alignment, including merge, sort, format conversion, and making indexes. 3.1.4 CD-HIT CD-HIT performs clustering to merge similar nucleotide/amino acid sequences or contigs (Fu et al., 2012). CD-HIT is often employed to remove redundancy in a dataset of contigs or proteins. Non-redundant datasets are suitable for further analysis in annotating taxonomic information for contig sequences or domain or other types of information that characterize protein function for amino acid sequences. 3.1.5 MEGAHIT MEGAHIT is an ultra-fast and memory efficient assembly tool for metagenomic sequences (Li et al., 2015). A straightforward approach to formulate sequence assembly is to reduce a problem to a Hamilton path problem. Note that solving a Hamilton path problem is known to be computationally
Microbial metagenomic sequencing data Chapter
2
73
NP-hard. Most of the current major sequence assemblers employ an approach to solve a problem of sequence assembly as a classical Eulerian path problem in the de Bruijn graph (Pevzner et al., 2001). MEGAHIT employs a succinct de Bruijn graph with additional operations for solving sequence assembly to achieve fast implementation despite efficient memory use.
3.1.6 MaxBin MaxBin is a popular binning tool based on an Expectation-Maximization (EM) algorithm which recovers a genome from assembled contig sequences (Wu et al., 2014). MaxBin2 is the updated version of MaxBin that attempts to increase the performance of binning by co-assembling multiple metagenomic datasets (Wu et al., 2016). MaxBin takes a collection of contig sequences in the FASTA or FASTQ file as input and produces a recovered genome in the FASTA format file. Recovered individual genomes can be processed to find detailed taxonomic information, associated genes and pathways via IMG/M, the Integrated Microbial Genomes and Microbiomes (see Section 3.2.2 for details). 3.1.7 Prodigal Prodigal, an abbreviation of PROkaryotic DynamIc programming Genefinding ALgorithm, is a fast tool for a protein-coding gene prediction for bacterial and archaeal genomes (Hyatt et al., 2010). Although prodigal can handle both draft genome and metagenomes, a user-guide describes a way of processing an individual genome recovered from metagenomic datasets via some binning tool as an ideal solution (https://github.com/hyattpd/prodigal/wiki/Advice-byInput-Type). Outputs of prodigal can be further processed to perform function annotation. Note that prodigal does not support viral gene prediction, although it would work with a collection of short-read sequences which may contain segments of viral genomes. A specific tool to identify viral sequences from a metagenomic sample will help to investigate functional roles of viral components (see the next section). 3.1.8 VirFinder A distinction of viral and host sequences that exist in a metagenomic sample is a crucial first step to investigate functional roles of viral components. VirFinder is a tool to detect viral sequences in metagenomic sequences (Ren et al., 2017). VirFinder employs a machine-learning approach to distinguish viral sequences from others based on a finding that viruses and hosts exhibit distinct k-mer signatures. Screening of viral sequences followed by sequence assembly to find contigs increases the quality of homology search against a database of viral sequences by BLAST. A read-count table is obtained from screened viral sequences via mapping to reference viral genomes.
74
Handbook of Statistics
3.1.9 DIAMOND DIAMOND, an abbreviation of Double Index AlignMent Of Next-generation sequencing Data, is an ultra-fast aligner of protein sequences against a reference protein database (Buchfink et al., 2015). By virtue of double indexing, an approach that determines the list of all seeds and their locations in both the query and reference sequences, DIAMOND exhibits 20,000 faster than BLASTX on short read sequences although the degree of sensitivity is similar (Buchfink et al., 2015). 3.1.10 MEGAN MEGAN, an abbreviation of MEtaGenome ANalyzer, offers various tools to perform annotation of taxonomy, protein, metabolic, and biological functions (Huson et al., 2016). More specifically, MEGAN community edition (CE) is the latest release which now supports taxonomy binning using the LCA algorithm and NCBI taxonomy database, functional analyses with InterPro classified by Gene Ontology (InterPro2GO), SEED, eggNOG, and legacy KEGG (see details for the databases above in the next sections). Results of functional analyses are visualized with respective viewers. MEGAN CE can perform functional annotation analyses from homology search output produced by DIAMOND. 3.1.11 TSCAN Trajectory inference refers to a computational method that attempts to infer trajectories of a time-evolving process such as cell differentiation that would be inherently equipped with datasets. Trajectory inference is often referred to as pseudo-time reconstruction since a trajectory is reconstructed from nontime series datasets. Although technological details differ among existing methods, the basic idea of pseudo-time reconstruction is based on clustering and ordering similar samples with respect to a particular feature to form an ordered path (with branches). TSCAN ( Ji and Ji, 2016) is a popular tool for trajectory inference for single-cell RNA-sequencing data. Although any trajectory inference methods to specifically apply to metagenomic sequencing have not appeared so far, the concept behind trajectory inference can apply to align metagenomic samples along the axis of pseudo-time. Inferred trajectories help to construct a pseudo-dynamics: a dynamical process reconstructed from samples, each of which exhibits a snapshot of some time-evolving process. 3.2 Public resources and databases In this section, we introduce popular public resources and databases used in metagenomic sequencing study.
Microbial metagenomic sequencing data Chapter
2
75
3.2.1 NCBI reference database (RefSeq) Reference Sequence databases (RefSeq) maintained by NCBI is a comprehensive collection of various databases, including nucleotide and protein (Pruitt et al., 2007). RefSeq provides “non-redundant crated data” derived from INSDC sequences (see RefSeq Frequently Asked Questions: https://www. ncbi.nlm.nih.gov/books/NBK50679/). NCBI non-redundant nucleotide database (nt) is a comprehensive collection of curated nucleotide sequences. RefSeq non-redundant proteins (nr) is a comprehensive curated database for protein. Genomic or metagenomic information is stored in the NCBI genome in which supplementary files and metadata associated with the genome information such as gene annotation are also available. Genomes in the NCBI genome database are not necessarily curated or completely sequenced. In fact, the database contains genomes with sequencing in-progress. The NCBI microbial and viral genome databases are two subsets of the NCBI genome database which are often referred to in metagenomics study. Taxonomic annotation often refers to the NCBI taxonomy which stores comprehensive curated taxonomic information and sequences 3.2.2 Integrated Microbial Genomes (IMG) and Genome OnLine Database (GOLD) DOE’s Joint Genome Institute (JGI) offers the Genome OnLine Database (GOLD) that is a comprehensive database for genomic and metagenomic sequencing projects and their associated data (Mukherjee et al., 2019). Integrated Microbial Genomes (IMG) (Chen et al., 2016) and microbiome (IMG/M) system (Chen et al., 2017) supports annotation, analysis, and distribution of microbial genome and microbiome datasets sequenced at JGI or submitted by individual scientists. 3.2.3 UniProt UniProt is a comprehensive database for protein sequences and their functions (UniProt Consortium, 2008). UniProt is composed of three different databases, UniProt Knowledgebase (UniProtKB) that contains manually curated entries (from Swiss-Prot) and automatically annotated entries (TrEMBL), UniRef as sequence clusters, and UniParc for data archive. UniProtKB attempts to provide all relevant known information about a particular protein (UniProt Consortium, 2008). UniParc contains protein sequences from different organizations, including GenBank nucleotide sequence databases and RefSeq. 3.2.4 InterPro InterPro is an integrated database based on several databases with respect to protein: protein family, domain, and functional sites ( Jones et al., 2014).
76
Handbook of Statistics
The members of the InterPro databases include more than ten databases such as UniProt, PROSITE for protein families and domains, HAMAP for highquality automated and manual annotation of microbial proteomes, and Gene3D for protein families and domain architectures in complete genomes. InterPro web-server provides functional analysis of proteins by classifying them into families, predicting domains and important sites. InterProScan is a stand-alone or web-server tool to search protein families, domains, and functional sites for given protein sequences.
3.2.5 KEGG KEGG, an abbreviation of the Kyoto Encyclopedia of Genes and Genomes, is a comprehensive pathway database with strength in metabolic pathways (Kanehisa et al., 2017). KEGG Orthology (KO) database is a collection of manually defined orthologous groups. KO groups are mapped to several objects such as pathway, entry of another database, gene, and enzyme. As a useful application, functional annotation of enzymes predicted in a metagenomic sample is possible. More specifically, results of homology search of a query protein sequence by BLAST or DIAMOND are annotated with KO via mapping to KEGG genes or enzymes. 3.2.6 SEED The SEED is an annotation environment to provide accurate genome annotation. In response to the recent rapid growth of genome sequencing in number, the SEED project now integrates the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) (Overbeek et al., 2014). Rapid annotation relies on automating annotation, which is supported by the subsystems technology. A subsystem is a set of functional roles that make up a metabolic pathway, a complex, or a class of proteins (c.f., http://www.theseed.org/wiki/ Glossary). Details for the definition of functional roles can be found in (Overbeek et al., 2005). Note that functional roles play an equivalent role to KEGG Orthology or Gene Ontology. Hence results of homology search of a query protein sequence are annotated with functional roles. 3.2.7 EggNOG Ortholog is a homologous gene shared by different species descended from the same ancestral species by a speciation event. An orthologous group is defined when an orthologous relation is extended to include multiple species. The Clusters of Orthologous Groups (COGs) of proteins were originally generated by comparing the protein sequences of 21 complete genomes of bacteria, archaea, and eukaryotes (Tatusov et al., 2000). EggNOG, an abbreviation of evolutionary genealogy of genes: Non-supervised Orthologous Groups, is a
Microbial metagenomic sequencing data Chapter
2
77
database of Orthologous Groups (OGs) and functional annotations at different taxonomic levels (Huerta-Cepas et al., 2019). EggNOG is constructed by extending the original idea of COGs, but the construction of orthologous group is based on unsupervised graph-based clustering. Note that COGs and non-supervised orthologous groups can be used to annotate a protein sequence.
3.3 Do-It-Yourself information analysis pipeline for metagenomic sequences We propose a way to combine all the computational tools and resources introduced in this section to produce different types of read-count tables. For example, selecting viral sequences by VirFinder followed by binning to a virus taxonomy database will produce a read-count table of virus taxonomy (represented by “virus count” in Fig. 1). Besides, an appropriate normalization may be required to process a raw read-count table. Finally, a collection of different read-count tables comprises a third-order nonnegative tensor: samples in the first, items (genes or species) in the second, and table-types (virus, bacteria, COGs, etc.) in the third coordinate. Then a variety of computational methods for tensor analysis such as tensor decomposition are available to extract useful information or implications on metagenomic datasets of interest. Furthermore, trajectory inference is possible for each table-types. Although a novel method to consistently integrate multiple inferred trajectories obtained from multiple read-count tables is required, reconstruction of pseudo-dynamics from metagenomic datasets motivates to construct an integrated data-driven mathematical model that describes various timeevolving biological events.
4
Notes
This chapter introduced a DIY information analysis pipeline for metagenomic sequences. We only presented the minimum number of computational tools and resources. There usually exist multiple tools for the same purpose, for example, binning. In fact, both MaxBin and MEGAN can perform binning. Despite the importance, we skipped several steps that are usually required in the preprocessing such as quality check for metagenomic sequences. Although it is out of focus in this chapter, 16S rRNA amplicon sequencing is a popular method to investigate a bacterial community profile. Information processing via 16S rRNA sequences usually produces Operational Taxonomy Units (OTUs) as a count-table of a bacterial community. There are also a variety of multivariate statistical analysis methods for OTUs. Some of the existing methods will be applicable to a read-count table generated from metagenomic sequencing data.
FIG. 1 Do-It-Yourself information analysis pipeline for metagenomic sequences constructed by a combination of state-of-art computational tools (highlighted with orange box) and resources (highlighted with green box). Processed data via informatic analysis are further analyzed with computational and mathematical methods (highlighted with purple box).
Microbial metagenomic sequencing data Chapter
2
79
Acknowledgments This work is supported by JST PRESTO Grant Number JPMJPR16E9, the Japan Society for the Promotion of Science (JSPS) Grant-in-Aid (C) JP16K05265 and (S) JP15H05707. The authors are grateful to Ms. Mai Suganami for typesetting references and corrections of typos.
References Altschul, S.F., et al., 1990. Basic local alignment search tool. J. Mol. Biol. 215 (3), 403–410. Buchfink, B., Xie, C., Huson, D.H., 2015. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12 (1), 59–60. Chen, I.M., et al., 2016. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system. BMC Genomics 17, 307. Chen, I.A., et al., 2017. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res. 45 (D1), D507–D516. Fu, L., et al., 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28 (23), 3150–3152. Huerta-Cepas, J., et al., 2019. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47 (D1), D309–D314. Huson, D.H., et al., 2016. MEGAN Community edition—interactive exploration and analysis of large-scale microbiome sequencing data. PLoS Comput. Biol. 12 (6), e1004957. Hyatt, D., et al., 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119. Ji, Z., Ji, H., 2016. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 44 (13), e117. Jones, P., et al., 2014. InterProScan 5: genome-scale protein function classification. Bioinformatics 30 (9), 1236–1240. Kanehisa, M., et al., 2017. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45 (D1), D353–D361. Li, H., Durbin, R., 2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26 (5), 589–595. Li, H., et al., 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25 (16), 2078–2079. Li, D., et al., 2015. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31 (10), 1674–1676. Mukherjee, S., et al., 2019. Genomes OnLine database (GOLD) v.7: updates and new features. Nucleic Acids Res. 47 (D1), D649–D659. Overbeek, R., et al., 2005. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33 (17), 5691–5702. Overbeek, R., et al., 2014. The SEED and the rapid annotation of microbial genomes using subsystems technology (RAST). Nucleic Acids Res. 42 (Database issue), D206–D214. Pearson, W.R., 2013. An introduction to sequence similarity (“homology”) searching. Curr. Protoc. Bioinformatics Chapter 3, Unit3 1. Pevzner, P.A., Tang, H., Waterman, M.S., 2001. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. U. S. A. 98 (17), 9748–9753.
80
Handbook of Statistics
Pruitt, K.D., Tatusova, T., Maglott, D.R., 2007. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35 (Database issue), D61–D65. Ren, J., et al., 2017. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5 (1), 69. Tatusov, R.L., et al., 2000. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28 (1), 33–36. UniProt Consortium, 2008. The universal protein resource (UniProt). Nucleic Acids Res. 36 (Database issue), D190–D195. Wu, Y.W., Simmons, B.A., Singer, S.W., 2016. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32 (4), 605–607. Wu, Y.W., et al., 2014. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2, 26.
Chapter 3
Machine learning algorithms, applications, and practices in data science Kalidas Yeturu* Indian Institute of Technology Tirupati, Tirupati, India * Corresponding author: e-mail: [email protected]
Abstract Data science is an umbrella term used for referring to concepts and practices of subset of the topics under artificial intelligence (AI) methodologies. AI is actually a framework to define notion of intelligence in software systems or devices in terms of knowledge representation and reasoning methodologies. There are two main types of reasoning methods deductive and inductive over data. The major class of machine learning and deep learning methods come under inductive reasoning where essentially, missing pieces of information are interpolated based on existing data through numerical transformations. However, today AI is mostly identified with deduction systems while it is actually a comprehensive school of thought and formal framework. The AI framework offers rigor and robustness to the solutions developed and there is still scope for onboarding today’s deep learning solutions and reap benefits of sturdiness. Data science is about end to end development of a smart solution that involves creation of pipelines for activities for data generation, business decision making and solution maintenance with humans in loop. Data generation is a cycle of activities involving collection, refinement, feature transformations, devising more insightful heuristic measures based on domain peculiarities and iterations to enhance quality of data driven decisions. Business decision making is pipeline of activities involving designing mappers from data to business decisions. The mappers are typically machine learning methods which are fine tuned to give best possible performance in a given period of study subject to business constraints. The mappers are fine tuned based on quality and magnitude of data and subdata. Solution maintenance is a critical component that involves setting up alarms to detect when a given decision maker model no longer works as desired. The maintenance work calls for repair actions such as identifying data to gather, comparative metrics of different models and monitoring the patterns and trends in the input data. Keywords: Machine learning, Supervised models, Gradient descent, SVD, Boosting, Graphical models, Automatic differentiation, Data science, Deep learning, Artificial intelligence Handbook of Statistics, Vol. 43. https://doi.org/10.1016/bs.host.2020.01.002 © 2020 Elsevier B.V. All rights reserved.
81
82
Handbook of Statistics
Terminology and abbreviations AI AE ANN AUC BFS CNN CV DBSCAN DBN DFS DT EM FN FP FPR GA GAN HMM LDA LOO LSTM LU MAE MDS MSE NB PCA PDDL PR Curve QR RELU RBM RL RMSD RNN ROC SVD SVM TFIDF TN TP TPR VAE
Artificial intelligence Auto encoder Artificial neural network Area under the curve Breadth first search Convolutional neural network Cross validation Densitiy-based spatial clustering of applications with noise Deep belief network Depth first search Decision tree Expectation maximization False negative False positive False positive rate Genetic algorithm Generative adverserial network Hidden Markov model Latent Dirichlet allocation Leave one out (cross validation) Long short-term memory Lower upper (triangular matrix decomposition) Mean absolute error Multidimensional scaling Mean squared error Naive Bayes algorithm Principal component analysis Planning domain definition language Precision-recall curve Orthonormal matrix decomposition Rectified linear unit (activation function) Restricted Boltzmann machine Reinforcement learning Root mean squared deviation Recurrent neural network Receiver operating characteristic Singular value decomposition Support vector machine Term frequency–inverse document frequency True negative True positive True positive rate Variational auto encoder
Machine learning algorithms, applications, and practices Chapter
1
3
83
Introduction
Artificial intelligence (AI) is a formal framework of system representation and reasoning that encompasses inductive and deductive reasoning methodologies for problem formulation and solution design (Fig. 1) (Russel and Norvig, 2003). The newly emerged stream of study, data science refers to application of techniques and tools from statistics, optimization, and logic theories to ordered or unordered collections of data. The field has acquired a number of key terms and the terminology is slowly growing based on which technique is prominently used and what type of data it is applied to. Coming to the basic definitions, data is a time stamped fact, albeit noisy, recorded by a sensor in the context of a process of a system under study. Each datum is effectively represented as a finite number sequence corresponding to a semantic in the physical world. The science aspect of data science is all about symbolic representation of data, mathematical, and logical operations over the same and relating the findings back to the physical world scenarios. On the implementation front, the engineering systems that store and retrieve data at scale both in terms of size and frequency are referred to as Big Data systems. The nature of the input data is closely tied to its context, field of study, use case scenario, and discipline. The word data is highly analogous to the word signal in the field of signal processing, thereby inheriting a majority of techniques in the
Encoding A task in symbolic world
A task in real world Decoding
Representation: Entities & Operators
Real/simulation subtask
Real/simulation subtask
Problem decomposition & synthesis
Learn/adapt
Real/simulation subtask
Experiments Analytical
Simulation Inferences Error metrics
FIG. 1 Artificial intelligence framework—A real world problem or task is represented as a symbolic world problem or computer program composed of entities (e.g., data structures) and operations (e.g., algorithms). Encoding is the first step, identifying and formulating a problem statement. Experiments, inferences, and metrics are iterated at much lower costs in the symbolic world. The inferences are related back to the real world through decoding. A real world problem itself is recursively defined as composed of decomposition and synthesis for creating a conglomeration of multiple subproblems which themselves may be a simulations as well and synthesis of results for a holistic inference.
84
Handbook of Statistics
field of signal processing to the discipline of data science, with one popular example being machine learning methodology. Some of the characteristics of the data and the noise thereof pertain to the set theoretic notions, the nature of the data sources and popular domain categories. The characteristics include missing values, ordered or unordered lists of variable lengths and homogeneous or heterogeneous groups of data elements. The sources that generate data include raw streams or carefully engineered feature transformations. Some of the popular domain categories of data include numeric, text, image, audio, and video. The word numeric is an umbrella term and it includes any type of communication between stateof-the-art computer systems. The techniques that operate on data include statistical treatment, optimization formulation, and automatic logical or symbolic manipulation. For numeric type of data, the techniques essentially deduce a mapping function that optimally maps a given input to the output represented as number sequences, typically of different sizes. The statistical approaches used here, mainly fall into probabilistic generative and discriminative methodologies. The optimization techniques used mainly involve discrete and continuous state space representation and error minimization. The logical manipulation techniques involve determining rules and deducing steps to prove or disprove assertions. This chapter, as it belongs in a broader theme of practices and principles for data science, elucidates the mapping process of a given problem statement to a quantified assertion driven by data. The chapter focuses on aspects of machine learning algorithms, applications, and practices. The mapping process first identifies characteristics of the data and the noise, followed by defining and applying mathematical operations that operate on data and finally inferring the findings to relate back to the given problem scenario and its context. Most of the popular and much needed categories of techniques as of today are covered to a good amount of depth in the light of data science aspects within the scope of this chapter, while any domain specific engineering aspects are referred to appropriate external content. Machine learning approaches covered here, include discriminative and generative modeling methodologies such as supervised, unsupervised, and deep learning algorithms. The data characterization topics include practices on handling missing values, resolving class imbalance, vector encoding, and data transformations. The supervised learning algorithms covered include decision trees (DT), logistic regression, ensemble of classifiers including random forest and gradient boosted trees, neural networks, support vector machines (SVM), naive Bayes classifier, and Bayesian logistic regression. The chapter includes standard model accuracy metrics, ROC curve, bias-variance trade off, cross validation (CV), and regularization aspects. Deep learning algorithms covered include autoencoders, CNN, RNN, and LSTM methods. Unsupervised mechanisms such as different types clustering and special category of reinforcement learning methodologies and learning using EM in probabilistic generative models including GMM, HMM,
Machine learning algorithms, applications, and practices Chapter
3
85
and LDA are also discussed. As industry emphasizes heavily on model maintenance, techniques involving setting alarms for data distribution difference and retraining, transfer learning, and active learning methodologies are described. A cursory description of symbolic representation and reasoning in the AI topics including state space representation and search, first-order logic—unification and deduction are also presented with references to external text handling the topic to full depth. The workings of the algorithms are also explained over case studies on top of popular data sets as of today, with references to code implementations, libraries, and frameworks. A brief introduction to currently researched topics including on automatic machine learning model building, model explainability and visualization will also covered in the chapter. Finally the chapter concludes with an overview of the Big Data frameworks for distributed data warehousing systems available today with examples on how data science algorithms use the frameworks to work at scale.
2
Supervised methods
Supervised learning methods are a class of machine learning algorithms where known correspondences between data and expected outcome are provided as examples (Duda et al., 2001; Bishop, 2008; Ng, 2012). The examples constitute a data set which is also referred to as ground truth or gold standard. The input and output are both in the form of vectors. The problem of supervised learning can be expressed mathematically as in Eq. (1) where a data point in d dimensional space is mapped to a data point in m dimensional space. The primary aspect of supervised learning, a known set of such correspondences are given. The task is to learn the mapping function M. However, the mapping need not be one-to-one, it can be many-to-one as well. The data set of known mappings is provided as a data set S (Eq. 2). M : Rd ! Rm
(1)
S ¼ fðx1 , y1 Þ, …, ðxN , yN Þjxi R , yi R g d
m
(2)
We fundamentally need a goodness metric for assessing the quality of the mapping function. The goodness metric is formulated as a loss formulation (Eq. 3), where L(, ) is a loss function. M ¼ arg min LðMðxi Þ, yi Þ
(3)
M
The task of determining a minimizing M is computationally hard to express unless we bring in simplifying assumptions. Parameterize the mapping function, M in terms of certain parameters Θ and minimize over the parameters (Eq. 4). Θ ¼ arg min LðMðΘ, xi Þ, yi Þ Θ
(4)
86
Handbook of Statistics
Based on the semantics and form of the M function and the parameters Θ, different classes of learning algorithms have come into existence. The two primary types of supervised methods are—(i) M(, ) is a differentiable function and (ii) M(, ) is not a differentiable function. The Θ values may be learned over iterations by gradient descent formulation or through a discrete mechanism of hill climbing. For instance, as we will see in the sections to come, the parameter improvement in case of a DT, is choosing the right split every time a new subtree is constructed. In case of parameter update by gradient descent, the function needs to be differentiable and takes parameter changes in decreasing direction of the loss function. The loss function, L(, ) computes an error value for deviation between the actual yi vector and the predicted M(Θ, xi) vector. Examples of loss functions is tabulated in Table 1. The loss functions for single dimensional points are illustrated in Fig. 2. In these functions let us denote the probability of prediction as required in case of logistic and cross entropy loss functions, the probability that jth element of the predicted output vector be on is given by, eMðΘ,xi Þ½ j pi ½ j ¼ Pk¼n MðΘ,xi Þ½k k¼1 e
2.1 Data sets The supervised learning methodology requires a data set to operate. Though unsupervised learning problems also require data set, there is a stark difference. There is an (input,output) pair in supervised methods and typically the output is fixed dimensional. However, when we study sequence-to-sequence TABLE 1 Examples of loss functions. Loss function
Description
Mean squared error (MSE)
1 jXj
Mean absolute error (MAE)
1 jXj
Hinge loss Logistic loss Cross entropy loss Information gain loss
1 jXj 1 jXj 1 jXj 1 jXj
P P P
x i X jjMðΘ, x i Þ
y i jj2
x i X jjMðΘ, x i Þ
y i jj
x i X maxfð2 y i
P xi X
ðð2 y i 1Þ p i Þ
xi X
ðy i logðp i Þ+ð1 y i Þ logð1 p i ÞÞ
P P
1Þ ð2 MðΘ, x i Þ 1Þ, 0g
x i X ðy i
logðy i Þ y i logðp i ÞÞ
Machine learning algorithms, applications, and practices Chapter
3
87
7 Hinge loss Logistic Squared Absolute
Loss value L(t = 1, y)
6 5
Custom (e|t – y| – 1)
4 3 2 1 0 –4
–3
–2
–1
0
1
2
3
4
Predicted value (y) FIG. 2 Five types of convex error curves—hinge, logistic, squared, L1, and custom loss functions are shown. Let t be the true value and y be the predicted value. The curves are shown for the true value t ¼ 1. The squared loss ((ty)2 shown in red) has a global minimum at 1. The hinge loss (max(0, 1 y * t) shown in green) has a bend at 1. The logistic loss function, log(1 + et*y) shown in blue, penalizes a mismatch in the sign of the prediction and the true value. The L1 loss function (jy tj shown in light green) is not differentiable though it is convex and it has a kink at 1. A custom loss function ejytj 1 shown in pink color, has a kink at 1 as well.
models, there also input and output pairs occur. In case of the sequence-tosequence problems, the input and output sizes not fixed in size or there is no such constraint required. There are a number of publicly available data sets for supervised learning methods of varying complexity and characteristics. Each data sets challenges a particular method and works well on some and does not work well on others. Some of the data sets are real and some are synthetic. The scikit-learn library provides a number of synthetic data sets, some of which are shown in Table 2.
2.2 Linear regression Linear regression is the fundamental regression algorithm where we need to predict the output y coordinate from the input x. Imagine the scenario where there are N data points in 1 dimension (i.e., number of features is just one). Each data point has the corresponding y coordinate. The task is to predict for a given x input, what could be the y coordinate. There are several possible ways in which the given task can be accomplished. l l
Let the data set be D ¼ fðxi , yi Þð8i ½1…NÞg Let y ¼ H(b, x) ¼ b where b is a constant number that we will learn
88
Handbook of Statistics
TABLE 2 Synthetic data sets. Name
Description
Circles
Concentric circles, where each circle is of particular class
Moons
Yin-yan type of semioverlapping concentric moons, where each moon is of a particular class (Fig. 3)
Blobs
Globular clusters of data
Classification
Classification type of data where points are generated in multidimensional space
1.5 1.0 0.5 0.0 –0.5 –1.0 –1.5
–1.0
–0.5
0.0
0.5
1.0
1.5
2.0
FIG. 3 An example of the moons data set is shown here for 100 points, where 50% of them are of one class (red) and the remaining 50% are of the other class (blue). The points are generated along two half circles of a circle and a random Gaussian noise is added. The upper and the bottom half circles are filled with points of the two classes. The two half circles are then cut apart and shifted laterally and also pushed inwards to result in a complex intertwining of the points to make a challenging data set for any classifier.
l
l
We need to define an error function and the compute error for each input and the cumulative error of all points in the data set Let Pi¼Nthe error be a 2 function of the model parameters as here EðbÞ ¼ i¼1 ðyi Hðb, xi ÞÞ . This is also called squared error.
The task is to identify a function H that minimizes the error. It is difficult to directly operate on the space of functions, although it is easier to determine the best operating parameters of a given function. Therefore the problem is recast as identifying the model parameters that minimize the error.
Machine learning algorithms, applications, and practices Chapter
3
89
The solution is to start with some value of b and over iterations move the value along the negative of the gradient descent direction. Gradient is computed as in Eq. (5). EðbÞ ¼
i¼N X
ðyi Hðb, xi ÞÞ2 ¼
i¼1
i¼N X
ðyi bÞ2
i¼1
b ¼ arg min EðbÞ b
bnew
bold rEðbÞjb¼bold rEðbÞ ¼
i¼N ∂EðbÞ X ¼ 2 ðyi bÞ 1 ∂b i¼1
(5)
Lets take a look at analytical solution for determining b although it is not practically feasible to write down one, when dealing with thousands of dimensions. Setting the gradient (Eq. 5) to zero would yield optimal b value analytically. rEðbÞ ¼ 0 )
i¼N X
2 ðyi bÞ 1 ¼ 0
i¼1
)b¼
i¼N 1X yi N i¼1
As we can see, it is simply the mean value of the y coordinates when the error function is squared error. The problem of determining an optimal b is same as finding a horizontal line that best passes through all the points. However, the problem complexity can be slightly increased to consider fitting a line which has a slope as well. The task of fitting a line with slope now reduces to identifying the pair of variables together, i.e., (m,c) that minimizes total error as in the previous case. Minimizing corresponds to moving in the negative direction of the gradient vector (Eq. 6). y ¼ Hðm, c,xÞ ¼ mx+ c i¼N X Eðm,cÞ ¼ ðyi Hðm, c, xi ÞÞ2 i¼1
¼
i¼N X
ðyi ðm xi +cÞÞ2
i¼1
ðm, cÞ ¼ arg min Eðm0 ,c0 Þ ðm0 ; c0 Þ
∂Eðm, cÞ ∂Eðm, cÞ , rEðm, cÞ ¼ ∂m ∂c
(6)
90
Handbook of Statistics
The weight update equation for two variables now is ðm, cÞnew ¼ ðm, cÞold rEðm, cÞjðm,cÞ¼ðm,cÞold Though determining the analytical solution for input features involving several hundreds or thousands of features is not practically feasible, for the purposes of understanding of the concept, we can take a look at the analytical solution. Equating the gradient to zero (Eq. 6), we get the following solution. )
rEðm, cÞ ¼ 0 ∂Eðm, cÞ ∂Eðm, cÞ , ¼0 ∂m ∂c
∂Eðm, cÞ ¼0 ∂m ∂Eðm, cÞ ) ¼0 ∂c )
(7) (8)
Consider Eq. (7),
) 2
i¼N X
∂Eðm, cÞ ¼0 ∂m ðyi ðm xi + cÞÞ xi ¼ 0
i¼1
Pi¼N
)m¼
i¼1 ðyi cÞ xi Pi¼N i¼1 xi xi
(9)
Consider Eq. (8), ∂Eðm,cÞ ¼0 ∂c
) 2
i¼N X ðyi ðm xi + cÞÞ 1 ¼ 0 i¼1
)c¼
i¼N 1X ðyi m xi Þ N i¼1
(10)
As we see, Eq. (9) which updates m is a function involving c and vice versa, Eq. (10) that updates c depends on m. This cyclic dependency is expectable in case of multivariate input, i.e., having many features. Any iterative process of starting with one initial setting of m, c and updating successive values results in converging to an optimal solution. Gradient descent of the (m, c) vector starts with initial values of m and c and converges to an optimal solution over successive iterations. If we consider partial derivatives of rE(m, c) vector with respect to m and c we can observe that it results in a negative constant (exercise to the reader). This asserts the fact that there exists a global minimum (m, c) vector for the squared error loss function.
Machine learning algorithms, applications, and practices Chapter
3
91
2.2.1 Polynomial fitting Linear regression may be applied to fit higher order curves. Consider the same data set of 2D points, D as in the previous section. Convert each point to a degree d polynomial by feature transformation, Dd ¼ fðx0i , yi Þjx0i ¼ ½x0i , …, xdi g Now the learning task becomes identifying the coefficients for each of the powers of elements. Consider an array of coefficients (to be learned) as, Θ ¼ ½a0 , …, ad Given yi is a product of the these coefficients against the transformed input coordinate, k¼d X ð8x0i Dd Þ : y^i ¼ MðΘ,x0i Þ ¼ ak xki k¼0
The polynomial fitting problem now corresponds to reduction of loss between the actual and the predicted y coordinates. Using the least squares loss function, the gradient with respect to Θ becomes, X X LðΘ, Dd Þ ¼ jjyi y^i jj2 ¼ jjMðΘ, x0i Þ yi jj2 x0i Dd
x0i Dd
i Xh rΘ L ¼ 2 ðMððΘ, xÞ yi ÞT rΘ MðΘ, xÞ x0i Dd
x¼x0i
3 ∂MðΘ, xÞ 2 03 x 6 ∂a0 7 7 6 7 6 7¼4 ⋮ 5 rΘ MðΘ, xÞ ¼ 6 ⋮ 7 6 4 ∂MðΘ, xÞ 5 xd ∂ad 2
An example of polynomial fitting with degrees 1, 4, and 15 on a sinusoidal function with an error is presented in Fig. 4. The figure depicts the actual data points in blue color, the true function in red color and the predicted curve in blue color. One can observe that as the degree is increasing the curve is wiggly and overfitting to the points. When the degree is too small than required, the curve exhibits a general shift in its position also termed as bias. There exists an optimal parameter setting for a given modeling procedure (here it is polynomial fitting) and the parameter is the degree.
2.2.2 Thresholding and linear regression Though linear regression is a plain real valued attribute prediction problem, it can be converted to a classification problem by imposing a threshold. However, a threshold has to be learned from the data instead of manually fixing the same. This would be a case when the set of points represent a boundary or gulf regions
92
Handbook of Statistics
1.5 1.0 0.5 0.0 True curve Degree 1 Degree 5 Degree 15 Ground truth
–0.5 –1.0 –1.5 0
2
4
6
8
10
FIG. 4 Polynomial fitting—An underlying function y ¼ sinðxÞ+ex is used for generation of the data. The original function is shown in dashed (‘- -’) green curve. Random noise from uniform distribution ξ U(0.5, 0.5) is added to the curve. A data set {(x, y + ξ)} is constructed and is shown as green dots. Polynomials using ridge regression with loss function, L(w) ¼ jj(y XTw)jj2 + jjwjj2 are fitted with varying degrees 1, 5, and 15 and shown, respectively, in blue, navy, and red colors. The illustration aims to show overfitting nature of the higher degree polynomial, in red. 2
between two (or more) classes points. An application of threshold on the linear regression would then spot a point in one of the buckets surrounding the gulf region of points over which a regression problem is solved. However, more interpretable and sophisticated methodologies such as Logistic regression, SVM, DT and other formulations are available in place of converting a linear regression problem to a classification problem. However, surprisingly the formulations still utilize gradient descent in case of differentiable loss functions and in philosophical sense what is solved is also a regression. Caution should be exercised here to not get confused just because gradient descent is deployed in all these formulations, the intuition behind the approaches are different between regression and classification problems. In case of neural networks, one often finds solving regression problems for each output classes represented by the neurons in the final layer of the neural network. The outputs are then compared to find maximally scoring neuron, using a softmax classifier to detect the predicted class. In this case, regression scores are used for comparison and are converted to solve a classification problem. However, it is recommended to treat a neural network also as a separate problem formulation than plain regression formulation, though gradient descent is deployed in both the places.
2.3 Logistic regression Logistic regression is one of the fundamental classification algorithms where a log odds in favor of one of the classes is defined and maximized
Machine learning algorithms, applications, and practices Chapter
93
3
via a weight vector. As against a linear regression where w x is directly used to predict y coordinate, in the logistic regression formulation w x is defined as log odds in favor of predicted class being 1. But for the interpretation of the w x values, the rest of the formulation depends on regression of w vector to minimize a logit loss function, a name of logistic regression is given for this method. It is essentially a classification algorithm though the word “regression” is present. Let y0 0, 1 be the actual label of a data point in a two class problem. Let D ¼ {(x0 , y0 )} be the data set of points. Let (8(x0 , y0 ) D) : D1 ¼ {(x0 , 1)jy0 ¼ 1} and D0 ¼ {(x0 , 0)jy0 ¼ 0} Let y denote the predicted class. Pðy¼1jw, xÞ Now, let us define w x ¼ logðPðy¼0jw, xÞÞ where P(y ¼ 1jw, x) denote the probability of predicted class 1 given w and x vectors. With this setting the logistic regression weight update equations are derived as below. Pðy ¼ 0jw, xÞ ¼ 1 Pðy ¼ 1jw, xÞ Pðy ¼ 1jw, xÞ w x ¼ log Pðy ¼ 0jw, xÞ 1 ) Pðy ¼ 1jw,xÞ ¼ 1+ ew x PðDjwÞ ¼ PðD1 jwÞ PðD0 jwÞði:i:dÞ Y ¼ Pðy ¼ 1jw, x1 Þ ðx1 ;y1 Þ D1
¼
Y
Y
Pðy ¼ 0jw,x0 Þ
ðx0 ; y0 Þ D0
Pðy ¼ 1jw, xÞy Pðy ¼ 0jw, xÞð1yÞ
ðx;yÞ D
) logðPðDjwÞÞ ¼
X
ðy logðPðy ¼ 1jw,xÞ+ ð1 yÞ logðPðy ¼ 0jw, xÞÞÞ
ðx;yÞ D
) logðPðDjwÞÞ ¼
X
ðx;yÞ D
ðy logðPðy ¼ 1jw,xÞÞ+ð1 yÞ logð1 Pðy ¼1jw, xÞÞ
1 1 ¼ + ð1 yÞ log 1 1+ ew x 1+ ew x ðx;yÞ D wx X e 1 ¼ ðy log + ð1 yÞ log 1+ ew x 1+ ew x ðx;yÞ D X ¼ ðy ðw xÞ y logð1+ ew x Þ ð1 yÞ logð1+ ew x Þ X
ðy log
ðx;yÞ D
¼
X
ðy ðw xÞ logð1+ ew x ÞÞ
ðx;yÞ D
LðwÞ ¼ logPðDjwÞ ¼
X
ðy ðw xÞ logð1 + ew x ÞÞ
ðx,yÞ D
(11)
94
Handbook of Statistics
In logistic regression, we need to determine the w vector such that the probability of data is maximized (Eq. 11). The interpretation of the loss function is given by Eq. (12). 0
Lðw, x0 , y0 Þ ¼ logðPðy ¼ y0 jw, x0 Þ ¼ logð1 + ew x Þ y0 ðw xÞ
(12)
The optimal value is obtained by taking derivative of the loss function with respect w which is the gradient and iterating by moving in the direction of negative gradient, calculated in each step (Eq. 13). w* ¼ arg maxw P(Djw) ¼ arg minw log(P(Djw)) ¼ arg minwL(w) (13) wnew ¼ wold α rLðwÞjw¼wold P Here rLðwÞ ¼ ðx,yÞ D y 1 + 1ew x ⊙ x. Consider the second derivative of the loss function, or gradient of the gradient, results in the Hessian matrix (Eq. 14). The vector inner product ensures positive definiteness of the Hessian. X ∂rLðwÞ +1 ¼ xxT 0 w x Þ2 ∂w ð1 + e ðx, yÞ D
(14)
Consider a vector y in the same d dimensional vector space as the input points, Rd, ð8y Rd Þ : yT ðx xT Þ y ¼ ðyT xÞ ðxT yÞ 0 There is an interesting simplified form of logistic regression formulation when y {1, +1} instead of {0, 1}. The loss function can be simplified. Consider the loss function (Eq. 12), for two scenarios when y0 ¼ 0 and y0 ¼ 1. Lðw, x, yÞ ¼ logð1 + ew x Þ y ðw xÞ wx
y ¼ 0 ) Lðw, x, 0Þ ¼ logð1 + e
Þ
(15) (16)
y ¼ 1 ) Lðw, x,1Þ ¼ logð1+ ew x Þ ðw xÞ ¼ logð1+ ew x Þ logðew x Þ ¼ logð1+ ew x Þ ) Lðw,x, βÞ ¼ logð1 + eβ ðw xÞ Þ
(17) (18)
Converting the problem of y to β ¼ 2 * y 1 in Eq. (18). For instance fitting a logistic regression classifier to a data set of moons type of data is shown in Fig. 5. The shades of red color indicate probability of each pixel in the image region for the red class according to the trained classifier. The shades of blue color indicate the probability of each pixel in the image region for the blue class according to the trained classifier. The multiclass classification scenario is more advanced than binary scenario and is presented in the subsequent sections.
Machine learning algorithms, applications, and practices Chapter
3
95
2.0 1.5 1.0 0.5 0.0 –0.5 –1.0 –1.5 –1.5
–1.0
–0.5
0.0
0.5
1.0
1.5
2.0
2.5
FIG. 5 Separation of two classes of data points in the moons data set using a degree one polynomial (line). The colors in the 2D plot ranges from red to blue proportional to the confidence score of the respective colored classes. The colors in the ambiguous regions are light red and light blue. Farther the region from the intersection region of the two classes, the darker are the colors.
2.4 Support vector machine—Linear kernel A linear kernel SVM (Cortes and Vapnik, 1995; van den Burg and Groenen, 2016) is a formulation that segregates two classes of points by a margin. A margin is a region bounded by two parallel hyperplanes (i.e., naming convention for plane in more than three dimensions). The margin region between the two hyperplanes of separation should not have any points. In case of the soft margin formulation, those points falling in the margin are given certain penalty. The data points lying on the margin boundaries are called support vectors. The data points on either side of the separating hyperplanes belong in the two classes although certain points may be wrongly classified contributing to error value. The first step in the formulation of SVM is to demarcate two classes with +1 and 1 labels. The formulation goes as determining a vector w such that w x +1 for the +1 class of points and w x 1 for the other class of points. Let D ¼ {(x, y)jy {1, +1}} D ¼ fðx, yÞjy ¼ 1g D+ ¼ fðx, yÞjy ¼ +1g Combining the two equations for +1 and 1 class, w x + 1ð8ðx,yÞ D+ ! y ¼ + 1Þ w x 1ð8ðx, yÞ D ! y ¼ 1Þ ) ð8yÞ : y ðw xÞ 1
96
Handbook of Statistics
TABLE 3 ξ interpretations. ξ value
Interpretation
ξ0
Points are correctly classified
0 0Þ : yi ðw xi Þ 1 ξi
(20)
When we take two points +1 hyperplane, A+, B+, the w A+ ¼ + 1 w B+ ¼ + 1 ) w ðA+ B+ Þ ¼ 0 ) w? Hyperplane Consider two points along the w vector, one point A+ on the +1 hyperplane and the another point A on the 1 hyperplane. The size of the margin λ is given by the distance between two points on either of the hyperplanes lying along the w vector, given in Eq. (21). A+ A ¼ λw w A+ ¼ + 1 w A ¼ 1 w ðA+ A Þ ¼ + 1 ð1Þ ¼ 2 w ðλwÞ ¼ 2 λjjwjj2 ¼ 2 2 )λ¼ jjwjj2
Machine learning algorithms, applications, and practices Chapter
λ¼
2 jjwjj2
3
97
(21)
Identifying the margin resulting in maximal separation of the two classes corresponds to Eq. (22). w ¼ arg max λ w
¼ arg min w
w ¼ arg min w
jjwjj2 2
2
jjwjj ð3 ð8i ½1…NÞyi ðw xi Þ 1 ξi Þ 2
(22)
The error terms ξi can be added to the optimization function with constraints in Eq. (22) to assume the form Eq. (23). w ¼ arg minf w
i¼N jjwjj2 X + ξi g 2 i¼1
(23)
However, in this form, the positive and negative errors cancel out each other as in Eq. (24), while it is required to penalize only ξj > 0 entities. In order to account for focus on correction of errors, a hinge loss function is introduced into the optimization function in Eq. (22) to assume the form Eq. (25). ð9i, j ½1…NÞ : ξi < 0 ^ ξ j > 0 ) ξi + ξ j ξ j ( ) i¼N 2 X jjwjj w ¼ arg min + maxð0,ξi Þ 2 w i¼1 (
jjwjj2 X ¼ arg min + maxð0, ð1 yi ðw xi ÞÞ 2 w i¼1 i¼N
(24)
) (25)
The more advanced forms of SVM are based on kernel functions and Mercer’s theorem to express w vector as linear combination of the input data points.
2.5 Decision tree A DT algorithm (Safavian and Landgrebe, 1991) takes as an input, a table X and recursively partition the table into sub–sub tables and so on improving a purity score of the “label” column in each partition. The purity score is a mechanism based on proportion of individual classes in a mixture of class labels. The higher the proportion of one of the classes, the more pure the collection is.
98
Handbook of Statistics
The attributes of a table can be numeric or categorical. A numerical attribute can partition the table into two parts about its mean value—rows with attribute value less than as one part and remaining as the other part. A categorical attribute can partition the table into so many parts as its individual possible values.
Outline of decision tree l l l
l l l l
Determine base propensities of the classes Determine base purity score Evaluated weighted purity score for each attribute over its possible value, where weight is the fraction of rows having that value Select the attribute that gives highest purity gain over the base purity Split the table into parts based on attribute values Compute DT for each of the part recursively More detailed
A DT construction a function y ¼ F(x) where F is composed of a sequence of test over the attribute values to finally land up in a node. The node has propensities of individual classes which are output a class probabilities for the input x. A new data point goes through a series of tests starting from the root node down till the leaf nodes against the selected attributes and their corresponding values. The DT algorithm is shown in Algorithm 1. An application of the DT classifier to moons data is shown in Fig. 6. Each node in the DT corresponds to either a horizontal or a vertical line. A particular subregion is subdivided by horizontal and vertical lines according to the DT depth considered. The tree shown here is for a depth of 3 where the whole region is partitioned into 8 subregions. If the depth is k, the 2D region would be partitioned into 2k subregions. An application of the DT regression model is shown in Fig. 7. Here the feature space is one dimensional. The output is a real valued quantity that needs to be predicted. The output in each subtree of the DT is its mean value. The tree is computed for a depth of 3 hence resulting in 23 horizontal partitions. Each horizontal partition corresponds to a mean value computed within that subbranch. The tree with depth 1 is also illustrated, which has just two mean values (21).
2.6 Ensemble methods Ensembling is a method of combination of several classifiers to result in a better classifier in terms of reduced variance and bias (Shi et al., 2011; Webb and Zheng, 2004). There are three types of ensembling approaches—(i) bagging, (ii) boosting, and (iii) random forests. Bagging is a process where a powerful classifier is trained on different subsets of data and average prediction is taken. Such a process reduces variance in predictions. Boosting is a process
Machine learning algorithms, applications, and practices Chapter
ALGORITHM 1 Decision Tree Algorithm—DT. Require: X /*input table*/ Node ¼ new Node() /*Node to return after tree construction*/ C ¼ cols(X) /*set of columns*/ Node.basep ¼ purity(X) /*purity on label column*/ Node.probas ¼ probas(X[’label’) /*class propensities*/ maxpurity ¼ Node.basep /*highest enhanced purity*/ maxc ¼ ’label’ /*attribute that enhances purity*/ β ¼ [] /*condition tests*/ Γ ¼ [] /*sub trees*/ if Customize Stopping Criteria then return Node end if for c C do Π ¼ [] /*table partitions based on values*/ βc ¼ [] /*condition tests for this attribute*/ if type(c) ¼ categorical then for v set (X[c]) do Π Π⊙X½c ¼ v βc ⊙”c ¼¼ v” βc end for else μ ¼ mean(vals(X[c])) Π [X[c < μ], X[c μ]] βc ¼ [”c < μ”, ”c μ”] end if wtpurity ¼ 0 for i ¼ 1 : jΠj do wtpurity + ¼ jΠ½ij jXj purityðΠ½iÞ end for if wtpurity > maxpurity then maxpurity ¼ wtpurity maxc ¼ c β ¼ βc end if end for if maxc 6¼ ’label’ then for i ¼ 1 : jβj do Γ Γ⊙DT ðX½β½iÞ /*Recursion for sub tree building*/ end for end if return Node
3
99
2.0 1.5 1.0 0.5 0.0 –0.5 –1.0 –1.5 –1.5
–1.0
–0.5
0.0
0.5
1.0
1.5
2.0
2.5
FIG. 6 A decision tree classifier on the moons data set of two classes of points—red and blue colored ones. The X-axis and the Y-axis correspond to the dimensions of the points (here they are 2D). Given that both the dimensions are numeric type, they are split about the mean value and one of the partitions are selected and split about its mean and so on, on the similar lines to a binary search, however, in 2D and for splitting points. The whole area is split into rectangles (as it is 2D). As the split is made about the mean value with a branch, it always results in rectangles that halve the area. However, it is to be noted that the above scenario is for numeric type attributes. For categorical attributes, there is no notion of ordering among its values and each such attribute is split only once forever in the whole of the tree building process.
Decision tree regression 1.5
Maxdepth = 1 Maxdepth = 2 Data
1.0
Target
0.5 0.0 –0.5 –1.0 –1.5 0
1
3
2
4
5
Data FIG. 7 Decision tree for regression problem for 1-dimensional points is shown here. For a given x-coordinate, its corresponding y-coordinate is to be estimated using decision tree approach. In this figure, a split is chosen along the x-axis where the sum of mean squared errors in either of the partitions is minimized. Here, it is only one attribute which is chosen again and again, however, the process of selection is applicable to any mixture of numerical and categorical attribute types. A tree of depth k, splits the x-axis into 2k partitions, first splitting about the mean of the x-coordinate values, and then recursively splitting each half. In this figure, the blue lines show a tree of depth 1, which splits the x-coordinates into two zones. The light yellow lines show a tree of depth 2, which splits the x-axis into four zones.
Machine learning algorithms, applications, and practices Chapter
3 101
ALGORITHM 2 Bagging methodology. Require: X /*Data set*/ Γ ¼ [] for i ¼ 1 : M do Xsub sample(X) Γ Γ ⊙ trainðX sub Þ end for if classification problem then y voting({γ(x) : 8γ Γ}) else y mean({γ(x) : 8γ Γ}) end if
where incrementally, over steps, several weak classifiers are combined so as to reduce error. Random forest is a technique where multiple random classifiers are built on different subsets of data columns to reduce bias and variance both. Each of these methods has its own set of control parameters and characteristic behavior on diverse data sets. A sketch of the bagging approach is given in Algorithm 2. It is reasonable to assume that the test data distribution spans across the diverse subsamples of the training data. In this case, conceptually, average prediction from multiple classifiers built over those different subsets of training data behave differently from a single model built over the entire training data. The behavior is analogous to CV process where a model’s performance is averaged over multiple subsamples of training data. The average model will have lesser variance than a single model. However, nothing can be guaranteed about the training set accuracy or the bias. If all the constituent models are too simple, the bias of their combination in bagging is still going to be high. Boosting algorithm as depicted in Algorithm 3, creates the ensemble with focus on error reduction at each step of adding a new classifier. Typically all the classifiers are all weak, however, due to reduction in remaining error at each step of adding a new classifier to the ensemble, the final classifier attains very high accuracy than individual weaker ones. However, unlike bagging, the boosting algorithms do not reduce variance and they are prone to overfitting, though they reduce bias. The only advantage is, a bunch of simple weak classifiers are sufficient rather than a single complex classifier. Reducing overfitting is as simple as reducing the size of the ensemble or considering even simpler classifiers of the ensemble. Random forest (RF) algorithm (Breiman, 2001) as depicted in Algorithm 4, creates an ensemble of multiple classifiers built over different subsets of columns of the input data. However, unlike bagging, where subsets of data are used for model building, each classifier sees the whole data. This is also unlike boosting where new classifier targets to reduce remaining error, here in RF, the
102 Handbook of Statistics
ALGORITHM 3 Boosting methodology. Require: X /*Data set*/ Γ0 ¼ 0 /*List of classifiers so far*/ Error ¼ L(Γ(X), y) /*Loss function*/ for i ¼ 1 : M do Γi Γi1 + α f (x ) /*Add a new learner*/ f i, αi arg minf,αL(Γi (X ), y) /*Reduce loss further*/ /*For a convex loss function*/ ∂LðΓi ðXÞ, yÞ Setting ¼ 0 ! fi ∂f ∂LðΓi ðXÞ, yÞ Setting ¼ 0 ! αi ∂α end for y ¼ ΓM (x) /*to predict for new input vector*/
ALGORITHM 4 Random forest methodology. Require: X /*Data set*/ Γ0 ¼ [] /*List of classifiers so far*/ χ ¼ cols (X ) /*set of columns*/ for i ¼ 1 : M do χ i ¼ subset (χ) /*random subset*/ h* ¼ L(h (X [χ i]), y) /*best classifier on subset of columns*/ Γi Γi1 ⊙ h end for if classification problem then y voting ({γ(x) : 8γ Γ}) else y mean ({γ(x) : 8γ Γ}) end if
new classifier does not have the notion of remaining error. A random forest reduces variance due to averaging the decision over diverse classifiers. A random forest also reduces bias if individual classifier is sufficiently complex over the given subset of features. Comparing and contrasting between the three ensembling methods is given in Table 4. The bagging and random forest methodologies are more closely related than either to the boosting methodology. The nature of the constituent classifiers affects bias and variance.
Machine learning algorithms, applications, and practices Chapter
3 103
TABLE 4 Comparison of ensembling methods. Aspect
Bagging
Boosting
Random forest
Training data for building constituent classifiers
Multiple subsets
Whole data
Whole data
Typical complexity of the constituent classifiers
Complex
Weak
Complex
Variance reduction
Yes
No
Yes
Bias reduction
No
Yes
No
Complexity control
Constituent classifiers
Ensemble size
Constituent classifiers and ensemble size
2.6.1 Boosting algorithms Boosting algorithms work by focusing on the error in each iteration and reducing that error by adding a new model. There are two main types of boosting algorithms—AdaBoost is the one that focuses on erroneous data points via a notion of sample weights and Gradient boost is the other one where the error value is predicted and subtracted. AdaBoost algorithm: This algorithm (Freund and Schapire, 1996) employs a notion of weight of each point. The higher the weight the more is the contribution of the point to the overall error and the lower the value the lesser is the contribution of the point to the overall error. The algorithm starts with an initial setting of the weights uniformly for all the points and updates weights over iterations as in Algorithm 5. The notations used are described in Table 5 Let us discuss why this algorithm works in terms of the mathematics behind α, E, W. The first thing to start with is defining the error function. Let yi f1, +1gð8i ½1…NÞ . Such an assumption does not restrict the scope of the derivation of the AdaBoost derivation as we will see. Let H M ðxÞ ¼ Pm¼M m¼1 αm hm ðxÞ be the ensemble. Let us define a loss function L() over HM() classifier. The error on misclassification is determined using LA error function (Eq. 26). However, for the easy of derivation of update equations for α, E, W quantities, we consider the exponential forms. The exponentiated value of the LA is still a good indicator of the error function although with all positive values LB in Eq. (27). This is a function where sum of terms is occurring inside the exponentiation, by applying Jensen’s rule over convex function, we can deduce Eq. (30) which is an upper bound on the error in LA term. LA ¼ 1=N
i¼N X yi FM ðxi Þ i¼1
(26)
104 Handbook of Statistics
ALGORITHM 5 AdaBoost algorithm. 1: W 0 ðiÞ ¼ 1=Nð8i ½1…NÞ 2: for m ¼ 1 : M do P 3: hm ¼ arg minh i¼N i¼1 W m1 ðiÞ 1fy i 6¼ hðx i Þg Pi¼N 4: Em ¼ i¼1 W m1 1fy i 6¼ hm ðx i Þg ðiÞ m 5: αm ¼ 1=2 log 1E Em 6: for i ¼ 1 : N do 7: if yi ¼ hm(xi) then 1 8: γ ¼ 1E m 9: else 10: γ ¼ E1m 11: end if 12: Wm(i ) ¼ 1/2 * Wm1(i ) * γ 13: end for 14: end for P 15: y ¼ m¼M m¼1 αm hm ðxÞ
TABLE 5 AdaBoost notations. Symbol
Meaning
Wk(i)
Weight of ith data point in kth iteration
hm(x)
Classifier built in mth iteration
Em
Sum of weights of erroneous points in mth iteration
αm
Weight of each classifier hm(x)
1{a 6¼ b}
This is an indicator function which 1 when a 6¼ b and 0 otherwise
LB ¼ eL
A
(27)
) ðLB 0Þ ^ ðLB LA Þ LB 1=N
i¼N X eyi FM ðxi Þ
(28) (29)
i¼1
LðFM Þ :¼ 1=N
i¼N X eyi FM ðxi Þ
(30)
i¼1
The AdaBoost formulation reduces upper bound on the error rather than P W k ðiÞ directly the error. Let Wk(i) is such that i¼N i¼1 Z k ¼ 1ð8k ½0…MÞ where Zk is the normalizing factor. The best classifier and its weight are selected following minimization of the loss function (Eq. 31).
Machine learning algorithms, applications, and practices Chapter
3 105
LðFM+1 Þ ¼ LðFM + α hÞ i¼N X ¼ 1=N eyi FM+1 ðxi Þ i¼1
i¼N X ¼ 1=N eyi ðFM ðxi Þ+αhðxi ÞÞ i¼1
i¼N X ¼ 1=N eyi FM ðxi Þ eyi αhðxi Þ i¼1
eyi FM ðxi Þ Let WM ðiÞ :¼ ZM ) LðFM+1 Þ ¼ lðα, hÞ ¼ 1=N
i¼N X WM ðiÞ eyi αhðxi Þ i¼1
αM+1 ,hM+1 ¼ arg min lðα, hÞ ¼ arg min lðα, hÞf1=N ðα; hÞ
ðα;hÞ
i¼N X WM ðiÞ eyi αhðxi Þ g i¼1
(31) We can note that Eq. (31) that, the hM+1 does not depend on α, ZM, therefore finding the optimal hM+1 is given in Eq. (32). i¼N X WM ðiÞ eyi αhðxi Þ i¼1 8 9 < X = X ¼ 1=N WM ðiÞ eα + WM ðjÞ eα :y ¼hðx Þ ; yj 6¼hðxj Þ i i 8 9 < = X X ¼ 1=N eα WM ðiÞ+ eα WM ðjÞ : ; yi ¼hðxi Þ yj 6¼hðxj Þ 8 9 0 1 < = X X ¼ 1=N eα @1 WM ðiÞA+ eα WM ðjÞ : ; yj 6¼hðxj Þ yj 6¼hðxj Þ 8 9 < = X X ¼ 1=N eα eα WM ðjÞ+ eα WM ðjÞ : ; yj 6¼hðxj Þ yj 6¼hðxj Þ 8 9 < = X ¼ 1=N eα + ðeα eα Þ WM ðjÞ : ; yj 6¼hðxj Þ ( ) i¼N X α α α ¼ 1=N e + ðe e Þ 1fyi 6¼ hðxi ÞgWM ðiÞ
LðFM+1 Þ ¼ 1=N
i¼1
) arg min lðα, hÞ ¼ arg min lðα, hÞð∵ α! ! lðα, hÞÞ h; α
h
i¼N X 1fyi 6¼ hðxi ÞgWM ðiÞ ¼ arg min h
i¼1
106 Handbook of Statistics
hM+1 ¼ arg min LðFM+1 Þ ¼ arg min h
i¼N X 1fyi 6¼ hðxi ÞgWM ðiÞ
h
(32)
i¼1
Once the minimizing hM+1 is determined, the next task is to determine, minimizing αM+1 of the loss function (Eq. 33). The αM is determined as in Eq. (34). EM ¼
i¼N X
WM ðiÞ 1fyi 6¼ hM+1 ðxi Þg
(33)
i¼1
lðα, hM+1 Þ ¼ 1=N feα + ðeα eα Þ
i¼N X
1fyi 6¼ hM+1 ðxi ÞgWM ðiÞg
i¼1
¼ 1=N feα + ðeα eα Þ EM+1 g αM+1 ¼ arg min lðα, hM+1 Þ
(34)
α
In order to identify α, taking partial differentiation of Eq. (34) with respect to alpha yields the following equations (Eq. 35). ∂lðα, hM+1 Þ ¼0 ∂α ) f1 eα + ðeα + eα Þ EM+1 g ¼ 0 ) f1 + ðe2α + 1Þ EM+1 ¼ 0g 1 ) e2α ¼ 1 EM+1 1 EM+1 αM+1 ¼ 1=2 log EM+1
(35)
The weight update equation for the next iteration WM+1 depends on the previous WM as in Eq. (36). ð9γÞWM+1 ðiÞ ¼ γ eyi FM+1 ðxi Þ WM+1 ðiÞ ¼ γ eyi ðFM ðxi Þ+αM+1 hM+1 ðxi ÞÞ ¼ γ eyi FM ðxi Þ eyi αM+1 hM+1 ðxi Þ ¼ γ WM ðiÞ ZM eyi αM+1 hM+1 ðxi Þ ð9γÞ : WM+1 ðiÞ ¼ γ eyi FM+1 ðxi Þ
(36)
WM+1 ðiÞ ¼ γ WM ðiÞ ZM eyi αM+1 hM+1 ðxi Þ i¼N X yi αM+1 hM+1 ðxi Þ ) WM+1 ðiÞ ¼ γ ZM sumi¼N ¼1 i¼1 WM ðiÞ e i¼1
0 @ ¼ γ ZM
X
WM ðiÞ eαM+1+
yi ¼hM+1 ðxi Þ
¼ γ ZM ðð1 EM+1 Þ e
αM+1
X
1 WM ðiÞ eαM+1A¼ 1
yi 6¼hM+1 ðxi Þ
+ EM+1 eαM+1 Þ ¼ 1
Machine learning algorithms, applications, and practices Chapter
Consider Eq. (35),
3 107
1 EM+1 αM+1 ¼ 1=2 log EM+1 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 EM+1 ) eαM+1 ¼ EM+1 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EM+1 ) eαM+1 ¼ 1 EM+1
Then the normalizing weights W M+1 ðiÞð8i ½1…NÞ derives Eq. (37). γ ZM ðð1 EM+1 Þ eαM+1 + EM+1 eαM+1 Þ ¼ 1 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EM+1 1 EM+1 ¼ γ ZM ðð1 EM+1 Þ + EM+1 Þ¼1 1 EM+1 EM+1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ γ Z M ð ð1 EM+1 Þ EM+1 + EM+1 ð1 EM+1 ÞÞ ¼ 1 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi )γ¼ ZM 2 ð1 EM+1 Þ EM+1 γ¼
ZM 2
1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 EM+1 Þ EM+1
(37)
Substituting Eq. (37) in Eq. (36) and simplifying for the cases of yi 6¼ hM+1(xi) and the other class of points results in separate weight update Eqs. (38) and (39). WM+1 ðiÞ ¼ γ WM ðiÞ ZM eyi αM+1 hM+1 ðxi Þ 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi WM ðiÞ ZM eyi αM+1 hM+1 ðxi Þ ¼ ZM 2 ð1 EM+1 Þ EM+1 1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi WM ðiÞ eyi αM+1 hM+1 ðxi Þ 2 ð1 EM+1 Þ EM+1 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi WM ðiÞ eαM+1 2 ð1 EM+1 Þ EM+1 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 EM+1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi WM ðiÞ 1 EM+1 2 ð1 EM+1 Þ EM+1 1 ðyi 6¼ hM+1 ðxi ÞÞ : pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi WM ðiÞ eαM+1 2 ð1 EM+1 Þ EM+1 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 EM+1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi WM ðiÞ EM+1 2 ð1 EM+1 Þ EM+1
ðyi ¼ hM+1 ðxi ÞÞ :
ðyi ¼ hM+1 ðxi ÞÞ :
WM 2 ð1 EM+1 Þ
(38)
WM 2 EM+1
(39)
ðyi 6¼ hM+1 ðxi ÞÞ :
108 Handbook of Statistics
2.6.2 Gradient boosting algorithm Gradient Boosting (Friedman, 2001) is another highly popular of boosting algorithm, where over iterations error itself is predicted and subtracted from the output of the classifier. The word gradient is used to imply the fact that error is proportional to the negative direction of gradient of a loss function. Let y ¼ F(x) be the machine learned function that predicts y coordinate for the input x. Let L(F(x), y) be the loss function that computes difference between predicted and actual outputs. Let L(F(x), y) ¼ 1/2 * (yF(x))2 be the squared loss function. Then the following derivations compute new values of F in Eq. (40). LðFðxÞ,yÞ ¼ 1=2 ðy FðxÞÞ2 ∂LðFðxÞ, yÞ ¼ 1=2 2 ðy FðxÞÞ 1 ∂FðxÞ ) rLðFðxÞ, yÞ ¼
∂LðFðxÞ, yÞ ¼ ðFðxÞ yÞ ) Fnew ðxÞ ∂FðxÞ
¼ Fold ðxÞ rLðFðxÞ, yÞjFðxÞ¼Fold ðxÞ Fnew ðxÞ ¼ Fold ðxÞ predictedðFold ðxÞ yÞ
(40)
Building sequence of classifiers: The steps in building sequence of classifiers is as follows. l l l l
l l l l
Let F1(x) be the first classifier built over the data set Let e1(x) ¼ F1(x) y be the classifier or regressor for the error Let F2(x) ¼ F1(x) e1(x) be the updated classifier Let e2(x) ¼ F2(x) y be the classifier or regressor for the error on the update classifier Let F3(x) ¼ F2(x) e2(x) be the updated classifier and so on Let FM+1(x) ¼ FM(x) eM(x) P Then we can expand FM+1 ðxÞ ¼ F1 ðxÞ i¼M i¼1 ei ðxÞ
For any other loss function of the form L(F(x) y) if the function is not a constant function, then the gradient term, rL(F(x) y) ∝ (F(x) y). There are other variants of gradient boosting techniques, one of the very popular technique is called xgboost and it combines features of random forest and gradient boosting. Though boosting algorithms reduce error, they are prone to overfitting. Unlike bagging, boosting algorithms can be ensembles of several weak classifiers. The focus in boosting is error reduction, where as the focus of bagging is variance reduction.
Machine learning algorithms, applications, and practices Chapter
3 109
2.7 Bias-variance trade off Bias and variance are two important characteristics of any machine learning model and they closely relate to training and validation set accuracies (Wolpert, 1997). The simpler the model is, the lower is the test set accuracy and even the training set accuracy. On the other hand if the model is too complex, the training set accuracy is higher where as the test set accuracies are poor. A machine learning model needs to generalize well on the unseen data there by requiring the model not to underfit or overfit. Bias is a problem related to underfitting of the model to the data. The problem is evident from low training set accuracies itself. The problem is addressed by increasing the complexity of the model. Variance is a problem related to overfitting of the model to the training data. Overfitting is a scenario where model performs poorly when any input is given outside of the training data provided to the model. The variance problem is addressed by considering bagging or increasing training data size or reduce complexity of the model. It is desirable to have low bias and low variance although achieving both is difficult if not impossible for practical problem scenarios. The algorithm for determining bias and variance of a given model is as shown in Algorithm 6. Multiple subsets are drawn from input (with replacement). For each subset, a classifiers is built that minimizes error on that subset and finally an ensemble of such classifiers is built (Γ list). Then bias is defined as deviation of the averaged prediction from the true value and variance is spread of the values across constituent classifiers’ predictions. Derivation of bias and variance trade off involves considering expectations of test set accuracies over diverse data sets. Following the usual notation of ExX[ f(x)] denotes expectation taken for the function f(x) over a
ALGORITHM 6 Bias variance calculation. Require: X /*Input data set*/ Γ ¼ [] /*List of classifeirs*/ for i ¼ 1 : M do Xsub ¼ subset (X) h* ¼ arg minh L(h(Xsub), y) Γ Γ ⊙ h end for /*Define bias and variance as functions over individual classifiers’ outputs*/ Given (x, y) X /*x is input and y is true value*/ bias (x) :¼ (mean ({γ(x) : 8γ Γ}) y)2 variance (x) :¼ var ({γ(x) : 8γ Γ})
110 Handbook of Statistics
Symbol
Meaning
D
Data set of (x, y) points
Sample()
Function that selects a subset of points
Π
Set of all models built over subsets of D
MðxÞ
Average of model predictions
δ(x)
Error over given data point.
range of inputs x X. Some of the terminology are shown in the tabulation below. Let us derive the bias variance relation using a least squares error function. D ¼ fðx,yÞg Π ¼ fMS jMS ¼ modelðsampleðDÞÞg MðxÞ ¼ EMS Π ½MS ðxÞ δS ðxÞ ¼ fMS ðxÞ y^g2 The relationship between bias and variance is all hidden in δS(x) function for the model MS built over the subset of data S. The final error is expected E value taken over all the data points in the test set for all the models hence built as depicted in Eq. (41) where bias is denoted by the symbol β and variance by the symbol ν. The bias is defined as in Eq. (42) and variance in Eq. (43). The final error is summation of the bias and variance components Eq. (44) because expanding the expectation over the constituent terms of δS(x) function results in clear separation of bias and variance terms. δS ðxÞ ¼ fMS ðxÞ y^g2 ¼ fMS ðxÞ y^ MðxÞ+MÞðxÞg2 ¼ fðMS ðxÞ MðxÞÞ+ðMÞðxÞ y^Þg2 ¼ fMS ðxÞ MðxÞg2 + fMðxÞ y^g2 + 2 ðMS ðxÞ MðxÞÞ ðMðxÞ y^Þ E ¼ ES ½Ex ½δS ðxÞ β ¼ ES ½Ex ½fMðxÞ y^g2 ¼ Ex ½ES ½fMðxÞ y^g2 ¼ Ex ½fMðxÞ y^g2 ν ¼ Ex ½ES ½fMS ðxÞ MðxÞg2
(41) (42) (43)
E ¼ Ex ½ES ½δS ðxÞ ¼ Ex ½ES ½fMS ðxÞ MðxÞg2+ fMðxÞ y^g2+ 2 ðMS ðxÞ MðxÞÞ ðMðxÞ y^Þ ) E ¼ β + ν + Ex ½ES ½2 ðMS ðxÞ MðxÞÞ ðMðxÞ y^Þ
Machine learning algorithms, applications, and practices Chapter
3 111
The third term in E expansion results in a zero value as follows. Ex ½ES ½2 ðMS ðxÞ MðxÞÞ ðMðxÞ y^Þ ¼ 2 Ex ½ES ½MS ðxÞ MðxÞ MðxÞ MðxÞ MS ðxÞ y^+ MðxÞ y^ ¼ 2 Ex ½MðxÞ2 2 Ex ½MðxÞ2 2 Ex ½MðxÞ y^+ 2 Ex ½MðxÞ y^ ¼0 ;E ¼ β + ν
(44)
2.7.1 Bias variance experiments Bias and Variance are computed over two models—(i) a horizontal line and (ii) an inclined line, both of which correspond to valid models (Fig. 8). The data is a set of 2D coordinates drawn over a true curve of sinusoidal shape with some added noise. Thousand random subsets of models are selected from the data set and both the models ((i) and (ii)) are plotted. The average of the predictions is plotted in Fig. 9. This average corresponds to the bias in the model with given parameter settings. The variance of about each data point is plotted for both the models in Fig. 10. The smaller the width of the pink bounds region, the lesser is the variance. As it can be seen, the variance for some points is high while for the others it is low. The more denser regions correspond to low variance zones. The variance plots for cases when the training data size is more is shown in Fig. 11. As the training data size increased, the variance reduced for both the models. 2.8 Cross validation and model selection CV is used for model configuration or parameter selection. The process of CV is closely related to the bagging method since in both the processes, different subsets of training data are used for building classifiers. The training data is repeatedly split into two parts—learning and validation parts. On each of the learning subset, a machine learning model is built and evaluated on the validation subset. The aggregate of the evaluation metric is computed and reported as final CV metric. The model parameter configuration which results in overfitting causes each and every model to overfit to the learning subset and exhibit poor performance on its validation subset. The final aggregate metric over multiple overfitting models still results in poor performance with respect to the CV metric. Two of the highly popular CV methods include K-fold and leave one out (LOO). In K-fold CV algorithm, a data set shuffled and split into K parts of roughly equal size. Iteratively, over K iterations, each of the subsets is chosen as the validation set and remaining (K-1) folds are combined and used as learning subset. In LOO cross validation, just one example is chosen as validation subset and remaining all points are used as learning subset. The sketch
1.00
1.00
0.75
0.75
0.50
0.50
0.25
0.25
0.00
0.00
–0.25
–0.25
–0.50
–0.50
–0.75
–0.75
–1.00
–1.00 0
1
2
3
4
5
6
0
1
2
3
4
5
6
FIG. 8 The red curve in the left and the right plots is the true sine curve from which data is sampled uniformly to create (x, y) tuples. The problem is of regression type where given an x coordinate, the task is to predict the corresponding y coordinate. The left plot shows attempts to fit a degree 0 polynomial (i.e., y ¼ b type) and the right plot shows attempts to fit a degree 1 polynomial (i.e., of the form y ¼ a0 * x + a1). Each of the left and the right plots have 100 models corresponding to the number of subsets sampled from the true sine curve and it corresponds to 100 lines we see here. In this plot, each of the 100 models has just 2 points in its training set. The task is to assess the behavior of the average model for its mean and variance.
1.00
1.00
0.75
0.75
0.50
0.50
0.25
0.25
0.00
0.00
–0.25
–0.25
–0.50
–0.50
–0.75
–0.75
–1.00
–1.00 0
1
2
3
4
5
6
0
1
2
3
4
5
6
FIG. 9 For each column, i.e., the x-coordinate, average value is calculated from across the models along that column. The averaging mechanism, also termed as the average models for the degree 0 polynomials (left plot) and the degree 1 polynomials (right plot) are shown in dark red color.
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
–0.5
–0.5
–1.0
–1.0
–1.5
–1.5
–2.0
0
1
2
3
4
5
6
–2.0
0
1
2
3
4
5
6
FIG. 10 The plots show variation about the mean μ(x) along each column, i.e., the x-coordinate. Each of the 100 models (in either of the plots) is build with just 2 points in its training set. Standard deviation σ(x) is calculated for each x-coordinate value in a set of values computed by each of the models. A pink colored line is drawn between μ(x) σ(x) to μ(x) + σ(x) to illustrate the amount of variation. One can note that, variation is higher in case of the degree 1 polynomial than the degree 0 polynomial. This is expected behavior because, the space explored by the degree 1 polynomials is higher, or in other words, capacity is higher for the degree 1 polynomial.
1.00 1.0 0.75 0.50
0.5
0.25 0.0
0.00 –0.25
–0.5 –0.50 –0.75 –1.0 –1.00 0
1
2
3
4
5
6
0
1
2
3
4
5
6
FIG. 11 The plots show the impact adding more training data points to each and every one of the models. Each of the 100 models (in either of the plots) is built with 100 points (instead of just 2 as in Fig. 10) points in its training set. As the x-coordinates are uniformly sampled from a given range, the more the size of the sample, the more the similarity between the sample sets. The reduction in variance is due to increasing of the training set sizes of the models that essentially emit similar individual models from each subset.
116 Handbook of Statistics
ALGORITHM 7 Cross validation. Require: X, metric /*Input data set and metric function*/ scores ¼ [] for i ¼ 1 : M do /*In K fold - subset() retrieves combined (K-1) folds and M is K*/ /*In LOO - subset() retrieves all but one exemplar*/ Xlearn ¼ subset(X) /*In K fold - the remaining fold is validation set*/ /*In LOO - the remaining exemplar is the validation set*/ Xval ¼ X Xlearn M ¼ model (Xlearn) yval ¼ Xval [0 label0 ] scores ¼ scores ⊙ metricðMðX val Þ, y val Þ end for return mean(scores)
of the CV method is given in Algorithm 7. The metric function metric(ypred, ytrue) takes two arguments as input, predicted and actual values of y coordinates. The function model() builds a model classifier or regressor. The data X has implicit label column X[0 label0 ]. The model selection process involves explore diverse configurations of the model parameters and evaluating each over the training data set using CV methodology. For each configuration of the model, a CV score is output and finally that model configuration with highest score is output.
2.8.1 Model selection process Selection of a model corresponds to selection of parameters for a given chosen ML methodology. For instance if SVM is the chosen methodology, then the parameters to select are the kernel and C—penalty attributes. Another example, if the chosen methodology is decision tree, then the parameters to select include depth of the tree, impurity function, minimal leaf sizeand minimal purity metric value. The steps in model selection are summarized in Algorithm 8. 2.8.2 Learning curves Learning curves correspond to evaluation of the model over various parameters or over various sizes of the data. They may also correspond to any other aspect of model evaluation where the idea is to determine best values of the metrics that lead to selection of (near)optimal parameters for the model. An illustration of learning curves for various values of depth of a DT is shown in Fig. 12. Another illustration of learning curve over multiple subsets of training data over various sizes is shown in Fig. 13.
Machine learning algorithms, applications, and practices Chapter
3 117
ALGORITHM 8 Model selection—pseudocode. 1: Decide on the modeling methodology—e.g., SVM, DT, Logistic regression, Random forest 2: Provide possible sets of values for the configurable attributes of the model as a grid to explore automatically 3: Choose a metric to evaluate, it can be customized metric as well 4: for param parameter sets do 5: for (learning subset,validation subset) of training data do 6: Select subset of training data as learning data and remaining data as validation subset 7: Perform cross validation and compute the metric 8: end for 9: Compute average value of the metric 10: end for 11: Choose the parameter combination that gives highest score for the metric
Learning curve—training set size (moons data set) 0.88
Accuracy score
0.86 0.84 0.82 0.80 0.78
Learning set scores Validation set scores
0.76 15
20
30 25 Training set size
35
40
FIG. 12 Learning curve for various training set sizes is shown here for the moons data set and the classifier as used in Fig. 6. Cross validation (CV) with stratified shuffling for five random splits for learning subset and validation subset partitions in 80–20% sizes is carried out on each training data partition. Each execution of the CV results in a mean accuracy score of the learning and the validation subsets. CV is carried out for random subsets of the training data set for different sizes of the points and mean accuracy scores for the learning and the validation sets are shown in green and red colors, respectively. The scores in general increase as the learning set size increases.
118 Handbook of Statistics
Validation curve—polynomial degree (moons data set) 1.000 0.975
Accuracy
0.950 0.925 0.900 0.875 0.850 Learning set scores Validation set scores
0.825 1
2
3 5 4 Degree of polynomial features
6
FIG. 13 Validation curve for various degrees of polynomial features expanding the input dimensionality is shown here for the moons data set and the classifier as used in Fig. 6. The data that is expanded for feature dimensionality is of the form (x1, x2, label), which is expanded into a higher dimension k as the set of all distinct terms of the form fxa1 xb2 j8ða, bÞ : 0 ða + bÞ kg. Accuracy scores are computed for the learning and the validation sets and shown in green and red colors, respectively, for various values of the expansion degree parameter from 1 to 6.
2.9 Multiclass and multivariate scenarios The algorithms discussed so far handled the scenarios of two label classification and univariate regression. These algorithms need to be modified for a multiclass and multivariate scenarios. Modification of the linear regression algorithm for a multivariate scenario is straightforward. It is equivalent to carrying out multiple linear regressions simultaneously one for each output variable occurring in a summation term in the loss function. However, modification of the two class classification algorithms for multiclass classification is not straightforward and typically involves tricky customization techniques.
2.9.1 Multivariate linear regression Multivariate linear regression is modeled over output vector can be posed as a minimization over a loss function which has sum of errors for each of the output variables. Let y ¼ ½y1 , …, yK corresponding to K dimensional output vector for any data point. The input data, i.e., rows are indicated by X matrix where Xi denotes ith row. Let Y denote a matrix of output vectors of all data points, i.e., Yj denotes the jth column of Y matrix across all data points. Let Wj be the vector of weights corresponding to jth column of Y. The jth column of the ith row is denoted by Yj[i]. The ith row of the input is denoted by X[i]. The prediction formulation then becomes Eq. (45), where N is the number of data points.
Machine learning algorithms, applications, and practices Chapter
Y j ½i ¼ W j X½ið8i ½1 : N ^ 8j ½1 : KÞ
3 119
(45)
Here Yj is (N 1) vector and Yj[i] is scalar value in that vector corresponding to ith row. Wj is a vector of (d 1) dimensions where d is the dimensionality of the input data. The input data is denoted by, X which is a (N d) dimensions where N is the number of input points. The loss function is defined as summation of losses incurred over individual Wj vectors over the entire data set, Eq. (46). L0 ðX,W, YÞ ¼
j¼K X i¼N X
LðXi Wj , Yj ½iÞ
(46)
j¼1 i¼1
This loss function is then minimized with respect to each of the Wj[i] variables and gradient descent algorithm is applied to find optimal values Eq. (47). rL0 :
∂ð ∂L0 ¼ ∂Wj ½k
Pi¼N i¼1
i¼N LðXi Wj , Yj ½iÞÞ X ∂LðXi Wj , Yj ½iÞ ¼ ∂Wj ½k ∂Wj ½k i¼1
(47)
The weight update equation now becomes Eq. (48). W j ½knew
W j ½kold α rL0 jW j ½k¼W j ½kold
(48)
2.9.2 Multiclass classification There are a variety of techniques to handle multiclass classification problems based on the nature of the classification algorithm applied. For SVMs, multiple classes are handled in a one-vs-rest or all pair approaches. However, for logistic regression, multiple classes are handled as summation of loss functions of the individual logistic regressions over pertinent classes, each class having its own weight vector. 2.9.2.1 Multiclass SVM There are two major approaches where multiple classes are handled in SVM (van den Burg and Groenen, 2016). In the first approach, an SVM is built one for each class treating that class as positive and remaining as negative exemplars (Algorithm 9). In the other approach, several SVMs are built for each pair of classes and a voting scheme or weighted confidence of prediction is determined for new input (Algorithm 10). 2.9.2.2 Multiclass logistic regression The multiclass logistic regression formulation needs a proper definition of log odds in favor of one class to some other class or the rest of the data. The formulation can be very similar to multivariate linear regression where the task is to determine each of the Wo vectors corresponding to the output dimension o.
120 Handbook of Statistics
ALGORITHM 9 SVM—One vs Rest. Require: D¼{(x,y)} /*input data - feature vector and label*/ Let χ be the set of output classes of {y : (x, y) D} Γ ¼ [] for c χ do {(x, y ¼ c)) : (x, y) D} D0 Γ Γ ⊙ Ψc ðD 0 Þ end for /*Prediction for input x*/ y ¼ arg maxΨΓ Ψ(x) /*Alternatively, all the SVMs can be used to transform input vector*/ DΓ ¼ {[Ψ(x) : Ψ Γ](8(x, y) D)} MΓ ¼ Classifier (DΓ) /*Prediction for input x*/ y ¼ MΓ(x)
ALGORITHM 10 SVM—All pairs. Require: D ¼ {(x,y)} /*input data - feature vector and label*/ Let χ be the set of output classes of {y : (x, y) D} Γ ¼ [] for c1, c2 χ(8c1 6¼ c2) do D0 fðx, 1ÞÞ : ðx, yÞ D ^ y ¼ c 1 g [ fðx, 0Þ : ðx, yÞ D ^ y ¼ c 2 g Γ Γ ⊙ Ψc 1 ðD 0 Þ end for /*Prediction for input x*/ y ¼ arg maxcχ j{Mc(x) ¼ 1 : Mc Γ} /*voting*/ /*Alternatively, all the SVMs can be used to transform input vector*/ DΓ ¼ {[M(x) : M Γ](8(x, y) D)} MΓ ¼ Classifier (DΓ) /*Prediction for input x*/ y ¼ MΓ(x)
However, as logistic regression deals with probabilities of each class which must sum up to 1, the formulation takes a different turn. In order to define log odds, the definition of the other class becomes important. One of the classes is chosen as pivot class and log odds is defined in terms of that class. Some of the symbols and their meaning is given in the tabulation below.
Machine learning algorithms, applications, and practices Chapter
Symbol
3 121
Meaning
Wc ð8c ½1…ðK 1ÞÞ : W c x ¼ log
Pðy¼cjW c Þ Pðy¼KjW K Þ
K
Weight vector for determining class c Odds in favor of class c against class k Pivot class
The relation between class c and class K is derived as below to result in Eq. (49). Pðy ¼ cjWc Þ Wc x ¼ log ð8c ½1…ðK 1ÞÞ Pðy ¼ KjWK Þ ) Pðy ¼ cjWc Þ ¼ eWc x Pðy ¼ KjWK Þð8c ½1…ðK 1ÞÞ ∵
c¼K1 X
Pðy ¼ cjWc Þ+ Pðy ¼ KjWK Þ ¼ 1
c¼1
)
!
c¼K1 X
eWc x Pðy ¼ KjWK Þ + Pðy ¼ KjWK Þ ¼ 1
c¼1
) Pðy ¼ KjWK Þ ð1+
c¼K1 X
eWc x Þ ¼ 1
c¼1
Pðy ¼ KjW K Þ ¼
1+
1 Pc¼K1 c¼1
(49)
eW c x
However, to unify the notations for classes 1…ðK 1Þ and the Kth class, the steps below result in Eq. (50). ð8c ½1…ðK 1ÞÞ : Pðy ¼ cjx, ½W 1 , …, W K Þ ¼ Pðy ¼ cjx, W c Þ ¼ eW c x Pðy ¼ KjW K Þ ¼ eW c x c ¼ K ! Pðy ¼ Kj½W 1 , …, W K Þ ¼ Pðy ¼ KjW K Þ ¼ ¼
1+ 1+
1 Pc¼K1 c¼1
1 Pc¼K1 c¼1 α
eW c x eW c x
eα e Pðy ¼ KjW K Þ ¼ P eα eα + c¼K1 eðW c x+αÞ c¼1
ðα ¼ W K xÞ ! Pðy ¼ KjW K Þ ¼
eW K x
¼
eW K x Pc¼K1 W x+W x K + c¼1 e c
eW K x +
eW K x Pc¼K1 c¼1
eðW c +W K Þ x
) ð8c ½1…KÞ : W 0c ‘ ðW c + W K Þ ) Pðy ¼ cjW 0c Þ ¼ Pðy ¼ cjW c Þ
122 Handbook of Statistics
eW c x ð8½W 1 , …, W K Þ, ð8c ½1…KÞ : Pðy ¼ cjW c Þ ¼ Pc¼K Wc x c¼1 e
(50)
Now the likelihood of data with respect to W c ð8c ½1…KÞ is defined as in Eq. (51) due to i.i.d property of the individual data elements. Pðfðx1 , y1 Þ, …,ðxN , yN Þgj½W1 ,…, WK Þ ¼ Πi¼N i¼1 Pðy ¼ yi jxi , ½W1 ,…, WK Þ ¼ Πi¼N i¼1 Pðyi jxi , Wyi Þ
(51)
Expressing the likelihood of data using Eq. (50) results in the steps as below. ½W1 , …, WK ¼ arg max½W1 ,…, WK Πi¼N i¼1 Pðyi jxi ,Wyi Þ ¼ arg max½W1 ,…, WK logðΠi¼N i¼1 Pðyi jxi , Wyi ÞÞ Let, lðW1 , …,WK Þ¼ logðΠi¼N i¼1 Pðyi jxi , Wyi ÞÞ ) lðW1 , …, WK Þ ¼
i¼N X i¼1
¼
i¼N X i¼1
¼
i¼N X
logðPðyi jxi ,Wyi Þ 1 log eWyi xi Pc¼K W x c i c¼1 e c¼K X ðWyi xi Þ log eWc x i
i¼1
¼
i¼N X
c¼1
Wyi xi
i¼1
i¼N X
log
i¼1
c¼K X
!! !
W c xi
e
c¼1
2.10 Regularization Regularization is an attempt to reduce sensitivity of the model with respect to fine changes in the input. Consider a model, Y^ ¼ MðΘ, XÞ Now consider a simple mean squared loss over N points, LðXÞ ¼
1 ^ jjY Yjj2 N
Note that we have expressed loss function in terms of the input points X. Taking gradient of the loss function with respect to input points, results in a function, g(Θ), rX LðXÞ ¼ gðΘÞ
Machine learning algorithms, applications, and practices Chapter
3 123
The function is sensitive to the magnitudes of Θ, which implies slight changes in the input are going to cause fluctuations in the model predictions. In order to avoid this, a usual trick is to add magnitude of weights of the matrix or absolute values of parameters to the penalty or loss function. This term is added with appropriate scaling factor to fine tune the amount of regularization needed.
2.10.1 Regularization in gradient methods An Lp norm for a k dimensional w vector in case of a linear regression or logistic regression problem, is defined as Lp ðwÞ ¼
i¼k X jwi jp i¼1
A value of p ¼ 1 is called lasso and a value of p ¼ 2 is called L2 norm. Contours of Lp for various values of p are depicted in Fig. 14. These values are used as regularization factors in a loss function when fine tuning a 3
L100 L3 L2 L1 L0.5
2
1
0
–1
–2
–3 –3
–2
–1
0
1
2
3
FIG. 14 Contours of the curves for various norms are shown for—(jxjp+jyjp)1/p ¼ c for p ¼ [0.5, 1, 2, 3, 100] are, respectively, shown in maroon, green, red, blue, and purple colors. Note that for a given value c, the contours for higher values of p in the Lp norm tend to take shape of a maximal square, i.e., with corners where all the dimensions have highest magnitude values. Closely observe that, though for p ¼ 1, the contour has square shape, it is still not maximal as just defined. A weaker reasoning why L1 begets sparsity compared to L2 would be, to assume that optimization converges to solutions at the corners.
124 Handbook of Statistics
Actual L1 L2
1.0 0.5 0.0 –0.5 –1.0 –1.5
0
1
2
3
4
5
6
FIG. 15 Curves fitted for sinusoidal data y^ ¼ sinðxÞ+ 0:1 U½0,1 using L1 and L2 regularizations for least squares error minimization are shown here. The x-coordinate is expanded in dimensionality to degree 10 polynomial, i.e., x ‘ ðx0 ,…,x10 Þ and predicted using the loss function, LðwÞ ¼ Σx ðw x y^Þ2 + Lp ðwÞ where w is also 11 dimensional. Actual data and predicted values using L1 and L2 regularizations are shown in green, red, and blue colors, respectively. In this scenario, L2 is better, as in case of L1, several components of w-vector are zero.
machine learning model. An illustration of the two norms over a 2D scenario (i.e., k ¼ 2) is shown in Fig. 15.
2.10.2 Regularization in other methods In case of DTbased methods, a most common type of regularization corresponds to limiting the depth of a DT. Another way of regularization in a DT is to increase the minimum required size of any leaf to make the tree smaller. In case of deep neural networks neural networks, drop out has regularization effect. Another related concept to regularization is the momentum operation in deep nets. Here an exponential average of previous gradients is used instead of single gradient estimate in each iteration. In case of AdaBoost method, regularization corresponds to either increasing the depth of a DT or increasing the number of estimators. In any machine learning model regularization corresponds to reducing its sensitivity to changes in input. 2.11 Metrics in machine learning A machine learning model needs to be evaluated for optimal choice of parameters, optimal performance on a data set, less bias and greater generalizability. Good performance on the training set gives a low bias model, which is nontrivial and able to capture signal in the data. After the demonstrable accuracy on the training data set, then comes generalizability. The algorithm should not over learn the training data, it should learn sufficient enough such
Machine learning algorithms, applications, and practices Chapter
3 125
that performance on unseen data is improved. Generalization is characterized by low variance. The points are summarized into four steps below. l
l l
l
First identify essential model configuration to result in decent training set accuracies First identify low bias model Next identify low variance model for better performance on test data or better generalizability Search over the parameter space and data subsets
2.11.1 Confusion matrix In case of a two class classification problem, a typical confusion matrix is defined over positive and negative classes. Steps are as below. l l
l
l
l
Choose one of the two classes as positive, and the other as negative This choice varies from one domain to the another and is subjective to an individual or the team involved The ground truth data (i.e., data set) can now be categorized into two parts—(i) actual positives and (ii) actual negatives The model predictions can now be categorized into two parts— (i) predicted positives and (ii) predicted negatives From these sets of data points, commonalities can be determined to form a confusion matrix (Table 6).
In case of multiple classes, determining positive and negative classes is not possible. In this case a matrix of predicted and actual classes is constructed. If the number of classes is k, then the dimensionality of this matrix is k k. The elements on the diagonal, i.e., on each of the (i, i)th cell corresponds to correct prediction. The elements off the diagonal correspond to wrong predictions in both horizontal way and vertical way. An illustration of the confusion matrix is given in Fig. 16. In this table, vertical and off-diagonal elements corresponds to false positives (FP) for that class. The horizontal and offdiagonal elements correspond to false negatives for that class. The true positives are the elements in the diagonal for each class. The true negatives are the sum of those elements in other diagonal cells. The accuracy metric from the confusion matrix is Eq. (52). Here α denotes accuracy, the confusion matrix is denoted by CM[][], the number of classes is k and the total number of elements is N. TABLE 6 Confusion matrix. Actual positives
Actual negatives
Predicted positives
True positives
False positives
Predicted negatives
False negatives
True negatives
126 Handbook of Statistics
Confusion matrix
0.8 0
0.95
0.05
True label
0.6
0.4 1
0.24
0.76 0.2
0
1 Predicted label
FIG. 16 Confusion matrix for two class classification problem on the moons data set is shown here. The steps followed for data set generation and the classifier are identical to the process used in Fig. 6. For the classes, 0 and 1, the X-axis in the plot is for the predicted class and the Y-axis is for the true class. The true class elements of a row are spread across columns and the elements of the matrix are normalized row wise, i.e., sum of fractions along a row sum to 1. The only true predictions are along the diagonal, i.e., each of the i–ith element of the matrix and all other offdiagonal elements along a row are wrong predictions. The more the correctness of a class, the darker the blue hue it has in a cell of the plot of the confusion matrix.
Pi¼k α¼
i¼1 CM½i½i
N
(52)
Precision for jth class is given by Eq. (53). CM½ j½ j π½ j ¼ Pi¼k i¼1 CM½i½ j
(53)
Recall for the jth class is given by Eq. (54). CM½ j½ j ρ½ j ¼ Pi¼k i¼1 CM½ j½i
(54)
Same equations are applicable for the two class problem as well.
2.11.2 Precision-recall curve Another metric for model assessment is the precision-recall curve, also called PR curve. This is actually not a curve, a 2D scatter plot on which various classifiers are plotted. We want to select that classifier which gives high precision and high recall scores. This can be used for model selection as well. However, it is customary to use PR curve, after the model selection process and for threshold selection. Most of the classifiers can be configured to emit and
Machine learning algorithms, applications, and practices Chapter
3 127
PR curve for the moons data 1.0
Precision
0.8
0.6
0.4
0.2
0.0 0.0
0.2
0.4
0.6
0.8
1.0
Recall FIG. 17 Moons data for two class classification problem is generated using sklearn’s make_moons() function invocation. A logistic regression classifier is fit using LBFGS gradient descent for optimization. The classifier’s predictions are refined further by imposing thresholds on the predicted probabilities of points in the test data set. The plot showing a list of values of precision and recall scores corresponding to a list of thresholds is called PR curve is depicted here. Such a plot as this is called PR curve.
equivalent of a probability value for each of the predictions. The probability value can further be thresholded to refine the prediction quality. The steps in PR curve formation are summarized in the list below. An illustration of the PR curve for a DT on the moons data set is shown in Fig. 17. l l l l l l
l
Perform model selection (using CV or other means) Configure the model to emit an output score (such as probability score) For each threshold on the score, determine confusion matrix values Compute precision and recall scores Plot them on a 2D plot of PR curve Each threshold becomes a point in the PR curve, these points are all connected to form a continuous looking shape Pickup the threshold that gives highest precision and recall scores among others
2.11.3 ROC curve Receiver operating characteristic (ROC) curve is also a 2D plot between recall and fall-out scores for a classifier. Fall-out corresponds to how many of the true negatives are leaking in as FPs. It is a ratio of FPs by total number of actual negatives. Recall corresponds to how many of the actual positives are retrieved to be true positives by the classifier. We want while the classifier is able to retrieve correctly, it should not allow leaking in from the negative lot.
128 Handbook of Statistics ROC curve for the moons data (AUC = 0.9696) 1.0
True positive rate
0.8 0.6 0.4 0.2 0.0 0.0
0.2
0.4 0.6 False positive rate
0.8
1.0
FIG. 18 The data set, the classifier and the process of applying thresholds is identical to the one described in Fig. 6.However, here the plot is between False Positive Rate FPR ¼ FPFP + TN and True Positive Rate TPR ¼ TPTP + FN . This plot for various values of FPR and TPR values at various thresholds is called the ROC curve. An important metric computed from this curve is called area under the curve (AUC) which is also indicated in the plot here. The AUC metric is numerically computed from the plot by slicing the region into small intervals.
The process of construction of the curve is similar to PR curve construction, although here recall is Y-axis and fall-out is X-axis. An illustration of the ROC curve is shown in Fig. 18 for a DT classifier on moons data set. Different classifiers are compared for their behavior with respect to shapes of the input data points is shown in Fig. 19. We need to note that not all classifiers are suitable for a given problem at hand. The best methodology is assessed both by judgement and reasoning of the engineer and the benchmarking studies involving PR curves and ROC curves. The best scoring threshold value in a ROC curve is the point which gives highest recall with least fall-out scores. In an overall sense, to assess the quality of the chosen classification methodology and the optimal parameters selected, the ROC curve can be used to compute what is the best (recall, fall-out) point, across diverse methodologies and their configurations. One such assessment is by area under the curve (AUC). This value is numerically computed by partitioning the ROC curve into vertical slices and summing up the areas of each the slices. The higher the AUC score, the better is the classifier.
2.11.4 Metrics for the multiclass classification The notions of precision and recall depend on setting up of a positive and a negative class. However, in a multiclass classification scenario, there is no concept of a positive class. Precision and recall scores are calculate for for each and every class, refer to Fig. 20.
FIG. 19 Comparison of classifiers on different types of data distributions is depicted here. The data distributions are moons, concentric circles, and linearly separable types generated using scikit-learn’s make_moons(), make_circles() and make_classifciation() functions and are shown along first three rows top to bottom. The data sets are of the form, {(x1, x2, y)jy {0, 1}} to which a small amount of noise is purposefully added to mimic a more realistic scenario of inseparability of a small fraction of points. For the moons and the circles data points, a Gaussian noise with zero mean and variance of 0.3 is added to each of the coordinates and for the linearly separable case a uniform noise is added from U[0, 2]. The first column shows the raw data distribution with red and blue colors for points of either classes and the following columns from left to right indicate Nearest Neighbors, Linear SVM, Neural Net, and Decision Tree respectively.
130 Handbook of Statistics
Per class PR curves
A 1.0 0.9
Recall
0.8 0.7 0.6
0.5 0.0
0.2
0.4
0.6
0.8
1.0
0.8
1.0
Precision B
Per class ROC curves 1.0
True positive rate
0.8 0.6
0.4
0.2
0.0 0.0
0.2
0.4
0.6
False positive rate FIG. 20 (A) An illustration of PR curve for multiclass classification on MNIST digits (10 classes) data set is shown here. A logistic regression classifier is used on a train-test split of 60% ¼ 40% of the data set in scikit-learn library. (B) An illustration of the ROC curve for the MNIST digits data set is shown here. To note that there is just not one PR or ROC curve, but there exist a curve for each of the classes.
In order to obtain a consensus score, mean average precision metric is used for summarizing the performance of a multiclass classification model. To put under simple terms, the metric mean average precision (MAP) is computed as area under the PR curve for each of the classes and average of the same. Pi¼K APðiÞ MAP ¼ i¼1K where AP(i) is area under PR curve for the ith class.
Machine learning algorithms, applications, and practices Chapter
3
3 131
Practical considerations in model building
The supervised learning methods discussed so far may work well in ideal scenarios. However, the real world data come with a number problems at various stages of machine learning model building pipeline of activities. Some of the concerns a practitioner would face include the following. l l l l
Noise in the data (Yakowitz and Lugosi, 1990; Lui et al., 2017) Missing values (Zhang et al., 2013) Class imbalance (Alejo et al., 2013; Soda, 2011) Data distribution changes (Hutter and Zaffalon, 2005; Chechik et al., 2010)
3.1 Noise in the data The notion of noise in the data ranges from physical properties of measuring devices to semantic levels of human understanding. In this context defining noise becomes important from one problem scenario to another. Any undesirable or unwanted or nonpattern can be defined as noise. The best way to eliminate noise would be to perform preprocessing of the data. Preprocessing need not be a simple task, it can be an involved effort by a team of customers, data engineers, developers, scientists and management to identify and exclude noisy patterns to provide clean data. The solution can even include engineering interventions such as force color coding certain entities in a computer vision task or sound proof setting of a recording room in a speech processing task. The more clean the data is, the more easy to build a machine learning model and maintain. Sometimes the noise can have highly consistent pattern. In such cases, it is also common to build machine learning models to identify noise and eliminate. In summary there are four important aspects of noise. l
l
l
l
Noise in the machine learning practitioners world is typically defined at semantic level The noise patterns may be strong or weak. If the pattern is strong, it is a good idea to eliminate it by creating rules Noise results in sensitivity of the accuracy with respect to slight perturbation in the input values. Sensitivity analysis should detect noise. Label corruption is a form of noise.
3.2 Missing values Missing values is a regular phenomenon in the real world during data collection phase (Zhang et al., 2013). However, each and every classifier needs all the columns that it takes as input. Any missing value needs to be filled in for the classifier or regression model to generate output. The process of coming
132 Handbook of Statistics
up with a proposed value for the missing value is called missing value imputation. Highlighting below some of the popular missing value imputation techniques. l
l l l l
l l l
Numeric attributes—mean, median, or mode among values of that attribute in the data set Categorical entity—mode value among values of that attribute in the data set Inferring from other data points based on notion of neighbors Inferring from other attributes of a given data point (conditional inference) Missing values may also be imputed based on adjacency of attributes or columns or adjacency of rows. Fill them with zero Fill them with constant positive value May be possible to remove missing value containing data points (although this is very rare)
3.3 Class imbalance The problem of class imbalance is the lack of availability of training data for the class that corresponds to rare event or when the labeling is highly costly. The rare events are peculiar to each domain and occur only sporadically. Some of the examples are fraudulent credit card transactions in banking domain, break down of an electrical generator in a power grid sector or disease condition in healthcare. Due to predominance of the abundant class of points, the classifier will be biased toward the majority class. Even when there is only a dummy predictor which output for any input, the label of the output class, it is going to be highly accurate as the majority class dominates as in Eq. (55). The precision also will be high (Eq. 56) due to under represented class are the ones contributing for FP. However, the class specific recall is drastically for the under represented class is significantly lower. TP + TN !1 + TN + FP + FN
(55)
TP ! 1ð∵ 0 FP ≪ ∞Þ + FP
(56)
lim
TP!∞ TP
lim
TP!∞ TP
In order to overcome problem with representation of data points from either of the classes, some of the techniques are as follows. l l l l l l
Over sample the under represented class (SMOTE) Under sample the over represented class Increase weight of the smaller class (class weights vector) Increase weights of the points in the smaller class Synthetic data augmentation Preprocess to exclude bulk of over represented data
Machine learning algorithms, applications, and practices Chapter
3 133
3.4 Model maintenance Maintenance of a machine learning solution which is already deployed in production takes a major part of the day to day activities of a data scientist (Yang et al., 2017; Benjamini and Hochberg, 1995). Unlike machine learning competitions and challenges where a model is demonstrated on one data set and published, in an industry scenario the life cycle of a machine learning model starts after it is put into production use. The nature of data changes over time based on the stake holder entities involved in data generation. An example is an e-commerce website that is to suggest advertisements to users based on their search and access patterns. Multitude of factors play role here including seasonality, user behavioral changes or events, current ongoing trends among other factors. The events of life play a significant role in the user behavioral patterns, for instance, consider a parent blessed with a baby, they start to search for infant items in initial months, then child items and so on. A single model deployed in production does not perform when the test data on which it reported high accuracies during deployment is different in nature from the production data that is hitting the model. Such a scenario is commonly referred to as data distribution difference. The usual steps in case of data distribution difference are summarized as follows. l
l l
l l
l
4
Setup alarms to detect difference between model training and/or test data and production data Define metrics to calculate how similar or dissimilar a pair of data sets is Determine the data points on which the prediction is ambiguous and send them for labeling Retrain the model periodically or based on alarm As new knowledge gets acquired or new features are discovered, upgrade the model As constituent or dependent submodel improves, retrain the model
Unsupervised methods
The bulk of the data in the machine learning world is unlabeled. This is the case because the labeling process is costly both in terms of data acquisition and error free manual annotation (Hinton and Sejnowski, 1999; Ghahramani, 2004). When only a handful of labels are available, the unsupervised methods provide a mechanism to propagate the labels to thousands of newer points. In case of supervised methods, the word “learn” quite understandable as it means calculating an optimal mapping function between input dimensional space to the output dimensions or “label.” It is not very intuitive to understand the word “learn” in the unsupervised scenario. However, unsupervised learning corresponds to learning structure in the data. Based on how the structure is represented and the learning process itself, there are a number of unsupervised learning methods, some of which are listed below.
134 Handbook of Statistics l l l
l
Clustering (Xu and Wunsch, 2008; Schubert et al., 2017) Matrix factorization (Lee and Seung, 2000; Salakhutdinov and Mnih, 2008) Principal component analysis (Ding, 2006; Agarwal et al., 2010; Jolliffe, 2005) Graphical methods (Edwards, 2000; Heckerman and Shachter, 1995; Pacer and Griffiths, 2011; Zhang et al., 2014)
4.1 Clustering Clustering is more of a process or a paradigm or a philosophy than of a simple method (Xu and Wunsch, 2008). It is a process of aggregating related entities together and segregating dissimilar entities from each other. The representation of aggregates, their properties and quantification of quality of aggregation and segregation differs from one algorithm to another. Each of the algorithm differs from the other in of terms of its capabilities and limitations expressed via the nature of the data that can be clustered, the characteristics of the algorithm itself, the output or resultant clusters and the representation of clusters. Some of the popular algorithms include the following. Each algorithm comes with its own parameters, time to execute and characteristic behavior. l l l l l
K-means Hierarchical clustering Density-based clustering Matrix factorization Principle component analysis
4.1.1 K-means K-means algorithm (Hamerly and Elkan, 2003; Arthur and Vassilvitskii, 2007) determines disconnected blobs of data based on the notion of centroid of a cluster, distance between the points and centroids and membership of the points to the centroids as in Eq. (57). Though the formulation looks like determining all possible subsets, the actual implementation of K-means algorithm is much simpler time complexity wise and converges much faster than exponential time. The iterative algorithm is given in Algorithm 11. There are diverse variations of K-means algorithm such as K-medoids or the fuzzy K-means and others. However, in all of these algorithms, the spirit of formulation is maintained. One of the very different approaches is to formulate K-means as a gradient descent algorithm as in Eq. (58). The gradient descent formulation of K-means converges much faster than the iterative data point membership-based approach and is suitable for online learning as well, as the centroid update corresponds to weight update (Eq. 59). χ ¼ arg min χ0
i¼k X X i¼1
x χ 0 ½i:set
ðx χ 0 ½i:cntrÞ2
(57)
Machine learning algorithms, applications, and practices Chapter
3 135
ALGORITHM 11 K-means. Require: X /*d-dimensional data*/, k /*number of clusters*/ 1: ðc ½1…kÞ : χ½c:cntr ¼ RANDðd, 1Þ /*k points, each of d dimensions*/ 2: 3: for iter ¼ 1:N do 4: χ[c] ¼ {} 5: for x X do 6: c ¼ arg minc ½1…N DIST ðχ½c:cntr, xÞ 7: 8: χ½c:set ¼ χ½c:set [ fxg 9: end for 10: ð8c ½1…NÞ : χ½c:cntr ¼ MEANðχ½c:setÞ 11: end for 12: return χ
w ¼ ½w1 …wk ¼ arg min
i¼k X X
½w1 …wk i¼1 x X
w ði+1Þ ¼ w ðiÞ α
∂
Pi¼k P i¼1
ðx wi Þ2
2 x X ðx wi Þ
∂w
jw¼w ðiÞ
(58)
(59)
4.1.2 Hierarchical clustering Hierarchical clustering (Jonyer et al., 2001; Bateni et al., 2017) is a most commonly used technique especially in biological data analysis including evolutionary characteristics of gene and protein sequences. In this form, the clusters are visualized dendrograms which are essentially tree representation of points based on similarity or dissimilarity metrics. Distance scores are defined between every pair of points. The distance metric is highly customizable capturing any notion of dissimilarity including examples such as Euclidean measure or Manhattan measure or negative of similarity score. Once distances are defined between every pair of points, the clustering algorithm proceeds by iteratively identifying subclusters. The procedure can be top down or bottom up. In the top down process, at the beginning, all points are assumed to be in a single cluster. As the algorithm proceeds, the single cluster is split into two or more parts. However, the divisive clustering has to examine exponential number of subsets to determine where to split. This problem is mitigated by bottom up approach. In this formulation, in the beginning, all points are assigned to individual clusters. The clusters are then merged to result in next level by defining notion of cluster to cluster distances. The algorithm for bottom up hierarchical clustering is shown in Algorithm 12.
136 Handbook of Statistics
ALGORITHM 12 Bottom up hierarchical clustering. Require: X /*data set*/, k /*number of clusters*/ 1: /*initialize*/ 2: χ ¼ {} 3: for i ¼ 1 : N do 4: Ci ¼ {xi} 5: χ ¼ χ [ C i 6: end for 7: while jχj > k do 8: ði , j Þ ¼ arg mini,j ½1…jχj:i6¼j DIST ðC i , C j Þ 9: χ ¼ χ fC i , C j g [ ðC i [ C j Þ 10: end while 11: return χ
Scattered points
Dendrogram
FIG. 21 Dendrogram is constructed for 10 points occurring in three groups of 3, 3, and 4 points each. The dendrogram recovers natural grouping of the points by having leaves occurring in the above numbers, as shown in three different colored leaf nodes. Agglomerative clustering is carried out with single linkage clustering option and using Euclidean distance metric between pairs of points.
The dendrogram is shown in Fig. 21 where a set of points were generated using make blobs utility of the scikit-learn library. The points are then clustered using agglomerative clustering and dendrogram of pairwise distances are plotted.
4.1.3 Density-based clustering Density-based clustering algorithm DBSCAN (Schubert et al., 2017) is based on recursively determining chain of dense points satisfying radius and density criteria. The algorithm is useful in determining arbitrary shaped clusters (Algorithm 13). However, the algorithm is mildly sensitive to starting point resulting in slight ambiguity at the cluster boundaries although these issues are negligible when dealing with high volumes of data. The running time is linear in the number of data points.
Machine learning algorithms, applications, and practices Chapter
3 137
ALGORITHM 13 DBSCAN. Require: X /*data set*/, η /*number of points in radius*/, ρ /*radius*/ 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:
/*initialize*/ χ ¼ [] /*list of clusters*/ Γ ¼ {} /*dense points*/ for all x X do if j{x0 X : jx x0 j ρ}j η then Γ ¼ Γ [ fxg end if end for while jΓj > 0 do C ¼ {} C ¼ {(9x Γ) : x} /* cluster starting from some dense point */ Γ0 ¼ {} /* gather all dense points of a cluster, iteratively */ while jΓ0 Cj > 0 do Γ0 ¼ C C ¼ {(9(x0 Γ, x Γ0 )) : jx0 xj ρ} /*identify any other reachable dense points*/ end while C ¼ {(8x X, 9x0 Γ0 ) : jx x0 j ρ} /*determine member points*/ χ ¼ χ ⊙ C /*append to the list of clusters*/ Γ ¼ Γ Γ0 /*focus on remaining core points*/ end while return χ
4.2 Comparison of clustering algorithms over data sets Each clustering algorithm performs better on one data set and does not perform better on another data set. For instance k-means works well on globular type of data. However, k-means does not work well on arbitrary shaped clusters. In case of DBSCAN, it works on arbitrary shaped clusters. However, the speed of execution becomes an issue. An illustration of comparison of clustering algorithms discussed in this section against some of the known synthetic data sets is shown in Fig. 22.
4.3 Matrix factorization Matrix factorization techniques (Lee and Seung, 2000; Salakhutdinov and Mnih, 2008) are commonly considered when dealing with recommendation systems. The scenario corresponds to a group of m entities mapping to a group of n other entities. A very common example is the movie recommendation for users. In this case the task is to determine proximity and distance between a movie and a user as if both are same type of entities. In order to obtain such
138 Handbook of Statistics
K-Means
DBSCAN
nclus=3
nclus=3
nclus=1
nclus=3
nclus=3
nclus=2
nclus=3
nclus=3
nclus=3
nclus=3
nclus=3
nclus=4
Blobs
Moons
Circles
S curve
Agglomerative
FIG. 22 Visualizations of some of the major clustering algorithms against different data distributions (shapes) are shown here. Four types of data distributions in 2D are considered (row wise)— (i) S curve, (ii) two concentric circles, (iii) moons, and (iv) three blobs. Three clustering algorithms are evaluated on each data set (column wise)—(i) Agglomerative, (ii) K-means, and (iii) DBSCAN. Both Agglomerative and K-means are specified for determining three clusters a priori and for DBSCAN, a radius of 0.1 units and minimum samples of 100 are specified. As for the data sets, a number of 1000 points are generated using scikit-learn’s make_s_curve, make_circles, make_moons and make_blobs utility functions with a noise of 0.1 for the first three. It is important to note that a change in parameter configurations of each of the algorithms and the data set distributions affects the outcome of the clustering and what is shown here is just an illustration toward intuition behind these algorithms. The parameters of the algorithms are chosen by hand after several iterations of visualization and it will not be the case in a real world scenario. One can note that arbitrary shaped clusters are determined well by DBSCAN albeit a little bit of noise points. However, determining the parameters of the algorithm is the main challenge. K-means recovers natural globular groups of points when well separated. However, it injects unwanted clusters when points are not globular. Agglomerative clustering performs similar to K-means, however, it is sensitive to linkages between natural groups of points. Though points are illustrated for the case of 2-dimensions, the concepts extend naturally to multidimensional points as well. Feature expansion or kernel based pairwise distances are applicable as well for points in high dimensional spaces.
Machine learning algorithms, applications, and practices Chapter
3 139
a representation where movies and users can be compared and matched, they both need to be cast as vectors of identical dimension. Consider a matrix Xmn corresponding to m users and n movies. The actual content of the matrix can be any semantic such as ranking. We need a representation of each user and movie as a k dimensional vector. One of the methods based on singular value decomposition (SVD) is illustrated below. Along with regularization the formulation is similar to as in Eq. (60). Xmn ¼ Amk Bkn A , B ¼ arg min jX ABj2 A, B
A , B ¼ arg min jX ABj2 + jAj+jBj
(60)
A, B
4.4 Principal component analysis Principal component analysis (PCA) (Agarwal et al., 2010; Ding, 2006; Jolliffe, 2005) is one popular and favorite technique in the machine learning community for dimensionality reduction. The idea is to calculate a handful of vectors from the input data which are representative of the whole data. Any new input can be represented now, as simply dot products with respect to representative vectors. The effect is input dimensionality drastically reduces. Consider a hypothetical example of a data set of human faces one mega pixel gray scaled photographs, each of width and height 1000 1000. Each and every pixel is a feature. The input image is now a vector of 106 dimensions. Assume one is able to identify some 10 vectors (each of input dimensionality, i.e., 106) representative of eyes, ears, mouth, nose, head regions. Now, the input image can be considered as dot product with respect to each of these 10 dimensions, resulting in just 10 numbers. This is equivalent to transforming 106 dimensions to 10 dimensions. How to automatically identify the representative vectors is the problem addressed by PCA. Consider a data of XNd where N data points are there, each having d dimensions. The eigenvalues of the correlation matrix gives the dimensions of maximal spread of data points. Given new input, top k eigenvalues and their corresponding eigenvectors are selected and the whole input is feature transformed into k dimensional space (Eq. 61). x¼
i¼N 1X X½i N i¼1
ð8i ½1…NÞ : X½i ¼ X½i x ð½λ1 , …, λd , ½v1 ,…, vd Þ ¼ eigðXT XÞ½ð8i, j ½1…dÞ : λi λj ð8i ½1…kÞ : xd1 ‘ x0k1 ¼ ðx v1 ,…,x vk Þ
(61)
140 Handbook of Statistics
The eigenvalue detection is carried out through numerical methods of which Golub and Kahan (1965) method is the most widely used. The procedure is to employ Givens row and column transformations repeatedly to construct a bi-diagonal matrix. There are various variants of PCA such as Kernel, Sparse, Truncated, and Incremental forms (Schraudolph et al., 2006; Demsˇar et al., 2012). Application of PCA algorithm for face recognition problem is shown in Fig. 23. The eigenvectors are of same dimension as the input image. These vectors are then scaled between 0 and 255 and displayed back as images to inspect which of the pixels constituted the significant values of the eigenvectors. Application of PCA for detection of prominent direction in a point cloud originating from measurement of a mechanical part is shown in Fig. 24. Axis detection problem has significance in mechanical part quality assessment scenarios.
Actual Points (2 + 6 dimensions)
PCA recovered (2 dimensional)
NMF recovered (2 dimensional)
FIG. 23 Principle component analysis (PCA) and nonnegative matrix factorization (NMF)based lower dimensional feature transformation of data points in 8 dimensions is depicted here. A set of 1000 data points are generated from four natural groups of points shown by four distinct colors (left most figure). The points are then transformed into 8 dimensional space by the transformation ðx1 , x2 Þ ‘ ðRðÞ, RðÞ, x1 , x2 , RðÞ, RðÞ, RðÞ, RðÞÞ where R() is a function that generates a Gaussian random number 1 + 0.5 * N(0, 1). The input data is then transformed using PCA into a lower dimensional space by specifying number of components as 2, i.e., to consider only top two eigenvectors (middle figure). Each of the eigenvector’s dot product with an 8 dimensional point generates a coordinate. One can observe that the original four groups of points in the 8 dimensional space are recovered after PCA-based feature reduction as indicated by the four distinct colored groups. Similar to the PCA exercise, NMF-based factorization is carried out for two components, i.e., X10008 ¼ W10002 H28. Each of the data points is then transformed using the H matrix and points are plotted for 2 dimensions. One can observe that both PCA and NMF are able to recover the natural groups of points in the input data, despite addition of noisy dimensions.
Machine learning algorithms, applications, and practices Chapter
3 141
15 10 5 0 –5 –10 –15 –10.0
–7.5
–5.0
–2.5
0.0
2.5
5.0
7.5
10.0
FIG. 24 Principle component analysis is used for determining major axis of a distribution of 1000 points along a parallelogram. The top eigenvector is plotted as a line in the 2 dimensional space. Center of the data is shown by a red dot.
4.5 Understanding the SVD algorithm SVD stands for Singular Value Decomposition (Klema and Laub, 1980; Hogben, 2007; Strang, 2009) which is a numerical iterative method for matrix factorization. The algorithm is used for determining eigenvalues, rank of a matrix, null space and this method forms the crux of the principle component analysis discussed earlier in the chapter. The state-of-the-art SVD algorithm employs involves QR decomposition, Householder transformation, and Givens matrix rotations. The following are some of the key concepts involved in understanding SVD algorithm. l l l l l
Solving system of linear equations and LU decomposition Householder transformation Givens matrix rotations Orthonormal matrices Eigenvectors and eigenvalues detection
4.5.1 LU decomposition LU decomposition stands for Lower Upper triangular decomposition. The triangular matrices speed up the process of back substitution in root finding. For instance consider the problem of solving a system of linear equations as below. 0 2 31 2 3 0 2 31 10 20 30 x1 140 B 6 7C 6 7 B 6 7C @A ¼ 4 3 30 45 5A 4 x2 5 ¼ @b ¼ 4 198 5A 5 22 54 211 x3
142 Handbook of Statistics
The Gauss–Seidel elimination process eliminates rows to make the A matrix triangular matrix and then back substitute values to determine x. It uses augmented matrix to simultaneously affect b values. 2
10 20 30 : 140
2
3
3 : 14
1 2
2
3
1 2
3 : 14
3
6 7 7 6 7 6 4 3 30 45 : 198 5 ! 4 3 30 45 : 198 5 ! 4 0 24 36 : 156 5 0 12 39 : 141
5 22 54 : 211
5 22 54 : 211 2
3
2 3 1 6 7 6 7 6 7 ! 4 0 1 0 : 2 5 ) 4 x2 5 ¼ 4 2 5 1 2 3 : 14 0 0 1 : 3
2
x1
3
x3
3
The back substitution process is the most attractive aspect of triangular form of matrices. However, in order to carry out the procedure of solving system of linear equations every time a new b is input, the row operations need to be replayed. A more efficient approach is to remember the effect of the triangularization process and the LU decomposition algorithm precisely address this problem. The LU decomposition stands for Lower Upper triangular transformation of a square matrix. Amm ¼ Lmm U mm 3 2 3 2 10 20 30 1 0 0 10 7 6 6 7 6 6 7 6 7 6 6 3 30 45 7 ¼ 6 0:3 1 0 7 6 0 4 5 4 5 4 0:5 0:5 1 0 5 22 54 2
20 30
3
7 7 24 36 7 5 0 21
Now solving for x vector can be accomplished in two steps as below. Ax ¼ b ) L U x ¼ b ¼ L ðU xÞ ¼ b Now the two steps for solving the system of linear equations is, 1. Let y ¼ Ux 2. First solve Ly ¼ b using back substitution ) y 3. Second solve Ux ¼ y using back substitution ) x 2
1
6 4 0:3 0:5
0 1 0:5
0
3
2
10 20
7 6 05 4 0 1 0
24 0
2 3 1 7 6 7 6 7 36 5 4 x2 5 ¼ 4 2 5 3 21 x3 30
3
2
x1
3
Machine learning algorithms, applications, and practices Chapter
3 143
Solving for the L side of the equation, 2 3 2 3 2 3 2 3 2 3 1 0 0 y1 1 y1 140 6 7 6 7 6 7 6 7 6 7 4 0:3 1 0 5 4 y2 5 ¼ 4 2 5 ! 4 y2 5 ¼ 4 156 5 0:5 0:5 1 3 63 y3 y3 Solving for the R side of the equation, 2 3 2 3 2 3 2 3 2 3 10 20 30 x1 140 x1 1 6 7 6 7 6 7 6 7 6 7 4 0 24 36 5 4 x2 5 ¼ 4 156 5 ! 4 x2 5 ¼ 4 2 5 0 0 21 63 3 x3 x3 The LU decomposition has simplified solving for system of linear equations. There are a number of algorithms to accomplish this factorization including, l l l l
Doolittle’s algorithm Crout’s algorithm With full pivoting With partial pivoting
We present here a simplified algorithm forth LU decomposition is shown in Algorithm 14 based on Doolittle’s algorithm. Consider a given square matrix Ann needs to be LU factorized. Let us denote each element of A by typical computer programming style notation, A[i][j]. We will deduce a factorization A ¼ L U, where L is lower triangular with L[i][j] ¼ 0(8j > i) and L[i][i] ¼ 1 and U is an upper triangular matrix such that U[i][j] ¼ 0(8i > j).
ALGORITHM 14 LU decomposition. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:
ð8i ½1…nÞ : U½1½i ¼ A½1½i A½i½1 ð8i ½1…nÞ : L½i½1 ¼ U½1½1 for i ¼ 2…n do for m ¼ 2…n do P U½i½m ¼ A½i½m k¼i1 k¼1 L½m½k U½k½i end for for m ¼ 2…n do Pk¼i1 A½m½i L½m½k U½k½i k¼1 L½m½i ¼ U½i½i end for end for
144 Handbook of Statistics
2
1
0
6 6 L½2½1 1 6 6 A¼6 6 L½3½1 L½3½2 6 6 ⋮ ⋮ 4
0
…
1
0
…
⋮
⋮
⋮
L½n½n 1
1
L½n½1 L½n½2 … 2 6 6 6 6 6 6 6 6 4
3
…
7 7 7 7 7 7 7 7 5
U½1½1
U½1½2
…
U½1½n 1 U½1½n
0
U½2½2
U½2½3
…
0
0
U½3½3
U½3½4
…
⋮
⋮
⋮
⋮
⋮
0
0
…
0
U½n½n
3 7 7 7 7 7 7 7 7 5
However, when it comes to calculation of eigenvectors of a given matrix in applications such as PCA, a more desirable factorization is QR decomposition where Q is an orthonormal matrix and R is an upper triangular matrix. The algorithms accomplishing such a factorization are more involved than the LU decomposition methods.
4.5.2 QR decomposition In LU decomposition, the L and U matrices are not designed to be orthogonal matrices. Requiring U matrix to be orthogonal helps in devising SVD algorithm. In order to factorize a matrix, A ¼ Q R where Q is an orthogonal matrix, QR factorization algorithm is used. There are multiple ways of performing this decomposition, including. l l l
Gram–Schmidt method Householder reflection-based method Givens rotations based method
The QR decomposition algorithm based on Givens rotations is presented in Algorithm 15. Recall a 2 2 rotation matrix whose first row denotes X-axis and second row denotes Y-axis. In order to rotate any given point by an angle θ about X-axis, the rotation matrix is as below.
cosðθÞ sinðθÞ RðθÞ ¼ sinðθÞ cosðθÞ Applying the matrix to any other matrix results in rotation of all its rows about X-axis by θ, R22 ðθÞ A2m ¼ Arotated 2m .
Machine learning algorithms, applications, and practices Chapter
3 145
ALGORITHM 15 QR decomposition—Simplified pseudocode. 1: 2: 3: 4: 5: 6: 7: 8: 9:
Input: A Q ¼ I, R ¼ A for i ¼ n : 2 do for j ¼ i 1 : 1 do Q ¼ Z (i, j, R) * Q R¼Q*R end for end for return (Q,R)
Givens rotation matrix is a generalization of the rotation matrix to a high dimensional space. Consider an identity matrix whose each row is a vector. Now in order to convert it to a rotation matrix in which we need to rotate any given vector about ith dimension and jth dimension, i.e., from ith to jth by an angle θ, the matrix is given as below. How to construct the Givens rotation matrix is as follows, Rnn ½i½i ¼ 1ð8i ½1…nÞ ^ R½i½ j ¼ 0ð8i 6¼ jÞ. Setup the rotations as elements, R[i]i] ¼ cos(θ), R[i][j] ¼ sin(θ), R[j][i] ¼ sin(θ), R[j][j] ¼ cos(θ). 3 2 1 0 … 7 60 1 … 7 6 7 6 6⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 0 ⋮ 7 7 6 7 6 6 0 … cosðθÞ 0 … sinðθÞ 0 … 7 7 6 7 6 7 Gði, j, θÞ ¼ 6 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 0 ⋮ 7 6 7 6 6 0 … sinðθÞ 0 … cosðθÞ 0 … 7 7 6 7 6 6⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 0 ⋮ 7 7 6 7 6 1 0 5 40 … 0
…
0
1
This matrix when applied to any other matrix rotates, all the ith and jth columns of the vectors in the i jth plane from ith to jth dimension by an angle θ. We use this matrix to make zero some of the columns of the input matrix A, by Amodif ied ¼ Gði, j, θÞ A
146 Handbook of Statistics
operation. The angle θ is set such that θ
sinðθÞ A½j½i + cosðθÞ A½ j½ j ¼ 0:
Such a θ* will make Amodif ied ¼ Gði, j, θ Þ A ! Amodif ied ½ j½i ¼ 0 which is same as making zero a selected cell of a matrix. Let us denote the operation of making zero, the j, ith cell of a matrix A using Givens rotation by the operator, Z(j, i, A) which internally constitutes two steps—(i) selecting θ* and (ii) applying Givens rotation matrix. A sequence of Givens rotations on a matrix A can convert it to a upper triangular matrix as in Algorithm 15. Please note that Q matrix is still an orthonormal after a series of multiplications in the iterations. The R matrix is an upper triangular matrix. The SVD algorithm, makes use of QR decomposition and Givens rotations to result in factorization of a nonsquare matrix Amn. Let A ¼ UΣVT be the SVD decomposition where Umm and Σmn and Vnn are the factor matrices. Both U and V are orthonormal matrices. The Σ matrix is a diagonal matrix, i.e., all elements (8i 6¼ j) : Σ[i][j] ¼ 0. The diagonal elements of the Σ are such that Σ[i][i] Σ[j][j](8i j). If we denote σ i as the Σ[i][i] element, the number of nonzero diagonal elements are σ 1 … σ k where k ¼ min{m, n}. There are several implementation of SVD algorithm including the list below. We show here a simplified pseudocode for ease of understanding to a beginner reader (Algorithm 16). This procedure is based on the fact that upper triangulation of lower triangular matrix results in diagonal matrix.
ALGORITHM 16 SVD Algorithm—Simplified pseudocode. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
Input : Amn matrix. ^ mm , Z mn ¼ QRðAÞ //QR factorization of A U ^ Z ¼) A ¼ U Note that Z is an upper triangular matrix Consider Z Tnm Vnn, Dnm ¼ QR(ZT) //QR factorization of ZT ¼) ZT ¼ V D ^ ðZ T ÞT ¼ U ^ ðV DÞT ¼ U D T V T A¼U T Note that D is still a diagonal matrix Now, DT needs to be cast as Σ with diagonal element ordering Let (9P) : DT ¼ P Σ ^ P // to absorb the row permutations Let U ¼ U Then, we have A ¼ U Σ VT as required by the SVD factorization
Machine learning algorithms, applications, and practices Chapter
l l l l l l l
3 147
Iterative Householder matrix transformations Golub–Reinsch algorithm Golub–Kahan algorithm Bidiagonal algorithm Demmel–Kahan algorithm Numerical method-based algorithm Jacobi rotation algorithm
There are a number of very important applications of SVD factorization including the following list. l l l l l l
Eigenvalue computation Computing pseudo inverse of a matrix Principle component analysis Clustering problems Multidimensional scaling Low rank approximations of matrices
4.6 Data distributions and visualization Data visualization (Fayyad et al., 2001) is a fundamental need for a data science practitioner. There are effectively two types of visualization in data science—(i) metric plots and (ii) data distribution plots. While metric plots are a routine aspect in every day life of a practitioner, the data visualization algorithms are only a handful. Examples of some of the metric-based visualization are listed below. l l l l l l
Correlation scatter plots Precision-recall and ROC curves Validation curves about metric of choice versus data set or parameters Confusion matrix Kullback–Leibler divergence score between data sets Time series plots about metric value of interest
Though the metric visualization suffices in most cases, the data visualization helps in getting an intuition about the data and devise better methods or features. Some of the examples of data visualization approaches are listed below. l l l l l
Network graph visualization of connections between data points Dendrogram of clustered data points Multidimensional Scaling (MDS) Student’s t-distributed stochastic neighborhood embedding (tSNE) PCA-based dimensionality transformation
In this chapter let us look at the data point visualization strategies—MDS, tSNE, and PCA-based visualization.
148 Handbook of Statistics
4.6.1 Multidimensional scaling The MDS algorithm (Wu et al., 2019) works by projection of higher dimensional data onto lower dimensional data subject to distance constraints. The key idea behind this algorithm is to deduce a topological encoding of the points by way of considering all pair distances between points to be conserved across dimensions. The formulation is based on pairwise distances between points. Consider all the points are centered about their average. Consider all the input points to be unit vectors in D dimensional space. Given two points u and v, the Euclidean distance between them is proportional to ju vj ¼ 2ð1 u vÞ Let the input set of points be set X where X[i] denotes ith point of the set. Now assume an isomorphic mapping between the points D dimensional space to a lower dimensional space (typically 2 dimensional). Let the points in lower dimensional space be Y, such that X[i] has a counterpart Y [i]. Now, ð8i, jÞ : jX½i X½ jj jY½i Y½ jj The distance can be stated in terms of the dot product as, dXij ¼ jX½i X½ jj ¼ 2ð1 X½i X½ jÞ dYij ¼ jY½i Y½ jj ¼ 2ð1 Y½i Y½ jÞ DX ¼ ½dXij 8i,j ¼ 2I 2ðXT XÞ DY ¼ ½dYij 8i,j ¼ 2I 2ðY T YÞ We need to minimize the difference between the distance matrices, Y ¼ arg min jjDX DY jj2 Y
The modifications can be to consider a kernel function in place of the plain dot product, X½i X½ j ‘ KðX½i, X½ jÞ where K(, ) is a kernel function. Major drawback of this method is, it is not incremental, i.e., when new data point gets added, the whole Y needs to be recomputed. Another issue with this method is sensitive to noise. The above issue limit the practical applicability of the method in industry scale use cases. Visualization of MDS of 20 random points is shown in Fig. 25. The original points and the reconstructed points are shown in orange and green colors. The plot also shown a variant of the MDS, called nonmetric MDS, which enforces ordering of pairwise distances.
Machine learning algorithms, applications, and practices Chapter
Actual
Rotated & distorted
3 149
MDS recovered
FIG. 25 Application of multidimensional scaling (MDS) algorithm for recovering structure of the input data. A sample of 100 points is generated in the form of an S curve using scikit-learn’s make_s_curve function. All the points are then centered about their mean and then rotated by 60 degrees. The rotated set of points is used for generating all pair distances which is fed to the MDS algorithm to recover the structure of points in 2D given pairwise distances. Actual points are shown in the left most figure, while the center figure shows rotated points, the last figure shows recovered structure of points. One can appreciate the usefulness of the MDS algorithm for visualization of higher dimensional points in a 2D plate to obtain intuition regarding nature of the data.
4.6.2 tSNE In this method (van der Maaten and Hinton, 2008), the pairwise distance between points attains different interpretation as points selecting their neighbors. The tSNE algorithm is much more successful than the MDS and it has been used in a number domains for data visualization. This method provides for an impressive supplementary perspective of commenting on the nature of the data in slow paced scenarios such as genomics research. For a given point X[i], the tSNE method considers distance to another point X[j] as proportional to the probability that ith point would select jth as its neighbor. The interpretation of probability that jth point is selected a neighbor, given the cumulative probabilities of all other points being selected as neighbors is given by P[i][ j] as below. X 2
eðdij Þ =ð2σi Þ P½i½ j ¼ P ðdikX Þ2 =ð2σ 2i Þ ke 2
For the corresponding points in the other dimensional space Y (typically reduced and 2 dimensional), the equivalent definition of probability of picking jth point as neighbor for the ith point is given by,
ð1 + jjY½i Y½ jjj2 Þ 1 Q½i½ j ¼ P 2 k ð1 + jjY½i Y½kjj Þ 1
150 Handbook of Statistics
The difference between these two distributions is determined as error as below. X P½i½ j P½i½ jlog errorðYÞ ¼ KLðPjjQÞ ¼ Q½i½ j i6¼j The values for elements of the matrix Y are determined by gradient descent on the error(Y) error function. The “t” in the method corresponds to the nature of distribution of Q[][] being a heavy tailed Student’s t-distribution curve. The method is demonstrated for success on diverse data sets including MNIST digits, Olivetti faces, Netflix, ImageNet, and other popular data sets. However, one drawback of this method is, it not applicable in an incremental scenario where new data points are added to the previous ones. Moreover multiple executions of the method on a same data set, however, in a batch-wise mode would result in different Y values and hence layout of the points. The method is also prone to curse of dimensionality in which case, when the input dimensions are large, the P[][] values are all similar and small. The method is applicable in scenarios of adding another perspective of a given data set, where the data set is more or less mature and is expected to be static. However, in industry scenarios of seasonal data and interpretation of cause of error of machine learning models, it is still an active area of research.
4.6.3 PCA-based visualization A simple alternative to MDS and tSNE algorithms is to simply plot each data point in a lower dimensional space based on the eigenvectors of the data set. For a given set of points in D dimensional space, the PCA algorithm output D eigenvectors. Consider top two eigenvalues by their magnitudes λ1, λ2 and their corresponding eigenvectors v^1 , v^2 . For each point u X, compute ðu v^1 , u v^2 Þ as a 2-dimensional embedding and plot for visualization. The major limitation of this approach is, nonlinear relationships between features is not captured by visualization as the standard PCA is a linear algorithm. Though other variants may be applicable such as kernel PCA, it is still biased by the majority distribution in the data set. 4.6.4 Research directions A visualization algorithm is closely tied to the vectorization of the data points. Vectorization means features that determine a data point. The features themselves may be engineered or automatically learned such as using deep learning methodologies like autoencoder, previous layers of a convolutional neural network (CNN) or hidden vector of a recurrent neural network (RNN). A good quality vectorization separates data points into blobs in high dimensional space based on the true clusters. Consequently, a low quality
Machine learning algorithms, applications, and practices Chapter
3 151
vectorization cannot separate data points into clean blobs and natural clusters overlap leading to ambiguity in predictions. A necessary condition for clear separation of data in low dimensional space being visualized is high quality of vectorization and features of input points. Typically the distance between points in low dimensional space is less informative and may lead to false conclusions if not backed up by quantitative metrics. There is scope for assessing points which regress over time as the model is trained on new seasonal data. There is a need for visualization to be incremental as new data flows in.
5
Graphical methods
Graphical models (Edwards, 2000) in the machine learning world is widely associated with a minor subset of graph-based models called Bayesian networks (Jensen and Nielsen, 2007; Neal, 1996). However, the general concept of using a graph abstraction is much beyond probabilistic framework and several state-of-the-art practical systems are not constrained by probabilistic framework requirements. For instance, a very common and widely used concept such as a flow chart is a graphical model of control flow. In case of distributed computing, the message passing framework among nodes is a graphical formulation. There is a wide variety of practical systems that employ graph-based abstraction such as page rank, social networking, cellular automata among several others. In this section we focus on Bayesian networks that impose probabilistic framework among rows and columns of a data matrix. This abstraction helps in human interpretability of data in terms of relations between the columns is most desirable in the machine learning world. More precisely, the cause and effect relation in the data (Heckerman and Shachter, 1995; Pacer and Griffiths, 2011) if captured, the same can be used to carefully introduce interventions as required. In order to understand the cause and effect relationship between data, we need to model the data generating process and consider the given data set as a sampling from the generating process. Given the snapshot of sampling from a hypothetical generating process, the task then remains to assess the right parameters of the process. Once right parameters are determined, the generating model can be used to synthetically generate data as required. In this approach of understanding cause and effect relationship between data items, the features are interconnected. One feature determines another. The measured value of a feature is determined by values of other features. For instance shadow regions in a image correspond to occluded path from a bright light region. A sequence of previous frames in a video, determine the content of the next video frame. The cause and effect relations may be between data in two time frames (temporal) or between two features within a data point (spatial). Another example in the context of a recommendation systems is, a woman buying several baby products because she is recently blessed with a baby.
152 Handbook of Statistics
A graphical model is represented by a graph of nodes and edges. The nodes correspond to features or hidden concepts. The nodes are also called states. The edges may be directed or undirected and are often weighted to indicate the probability of transition between a pair of nodes. A typical graphical model imposes certain restrictions on the topology of the graph and the edge weights. There are two types of nodes—(i) observed nodes and (ii) hidden nodes. The observed nodes correspond to the data set. The hidden nodes correspond to hypothesis of the generative process. It is not necessary that hidden nodes be there, however, observed nodes are compulsory. The edges in the graph are of two types—(i) transitions between hidden states or (ii) connections between hidden states to observed states. The latter is called emission. The edge weights are probabilities of transitions between nodes. The observed states may be implicit in case of modeling real valued outputs. In case of observations being real valued data, there is a probability associated with each value, typically defined by a parameterized function. The sum of probabilities on the outgoing edges of a node sum up to 1. The probability values may be determined by a formula (parametric models) or singleton values. The parameters of the formulae are determined by iterative algorithms. One algorithm that is used for estimating the parameter values is expectation maximization. Some of the popular graphical algorithms include the following. l l l l l
Naive Bayes (McCallum and Nigam, 1998) Gaussian mixture model (Rabiner, 1989) Markov model (Puterman, 2014) Hidden Markov model (Eddy, 2004) Latent Dirichlet analysis (Blei et al., 2003; Nakajima et al., 2014)
Extended abstraction of graphical models: A wide variety of problems can be considered as graphical models by extrapolating the abstract concept of nodes to implicit nodes and implicit edges. The number of abstract concept of nodes may be uncountably infinite when they represent real valued elements. It is not necessary that graphical models oblige probabilistic framework, such as social network models (Wasserman and Faust, 1994), message passing models (Zhang et al., 2014), petrinets (Peterson, 1983), and cellular automata (Codd, 1968). It is only a small subset of problems, however, a highly impactful subset of problems can be Bayesian networks, a class of graphical models that oblige probabilistic framework.
5.1 Naive Bayes algorithm Naive Bayes algorithm (NB) is Bayesian graphical model that has nodes corresponding to each of the columns or features. It is called naive because, it ignores prior distribution of parameters and assume independence of all
Machine learning algorithms, applications, and practices Chapter
3 153
features and all rows. Ignoring prior has both an advantage and disadvantage. The advantage is that, we can plugin any type of distribution over individual features and learn the maximum likelihood features from the data. We need not restrict the class of prior distributions to exponential family in order to simplify algebra of product of likelihood and prior. The disadvantage is that, it is a maximum likelihood model. It does not improve posterior iteratively. Despite having advantages and disadvantages, the NB method is still a probabilistic generative model, i.e., given parameters, one can synthetically generate data. These node emit values, which are observed feature values. The values may be real valued for numeric type of attributes or may be discrete set of symbols for categorical type of attributes. The label column itself corresponds to a node. The label may be a real valued quantity as in the case of a regression problem, or a categorical type. NB makes two primary assumptions—(i) that all columns are independent of each other and only dependent on the label and (ii) all rows are independent of each other. Based on the nature of the attributes there are three major versions of NB algorithms—(i) Bernoulli NB (McCallum and Nigam, 1998), (ii) Gaussian NB, and (iii) Multinomial NB (Rennie et al., 2003). Though the names are different, the underlying formulation is common and generic which we present below. Let X denote the data set and x X denote a single data point. Let x[i] denote ith column of the data element. Assuming there are n columns, i ½1…n. Let x[L] denote the label value of the data element. Let Ci denote the random variable for the ith column. Let P(Ci ¼ x[i]jL ¼ x[L]) denote the conditional probability of ith column of a data point taking a value as observed in the data point x given its label value as observed. Let each of the columns Ci has a set of parameters, compactly denoted as Θi. Given that all columns are independent of each other, the conditional probability of a data point given its label value is modeled as, PðxjL ¼ x½LÞ ¼ Πi PðCi ¼ x½ijL ¼ x½LÞ Each and every random variable (8i) : Ci comes with its own parameters based on the type of the data observed and modeling assumption. For example, if a column indicates coin tossing, and observed data is either head or tail, and the modeling scientist chose the distribution to be of Bernoulli type, then the parameters of Ci would be λ indicative of probability of observing heads. The machine learning task is to learn these parameters (8i)Θi from data. Let Θ ¼ fΘ1 [ Θ2 [ … [ Θn g. The task of determining right parameters is as posed as maximizing the posterior as in Eq. (62). The posterior is equal to the product of P(XjΘ) (likelihood) and P(Θ) (prior) probabilities. Θ ¼ arg max PðΘjXÞ ¼ PðXjΘÞ PðΘÞ Θ
(62)
154 Handbook of Statistics
In case of multiclass classification problems, the NB formulation ties up each Θi with class Γ ¼ {x[L] : x X} as in Eq. (63). In this equation, X½1…L 1 denotes all columns excluding Lth column. All the rows of the Lth column are denoted by X[L]. PðXjΘÞ ¼ PðX½1…L 1, X½LjΘÞ ¼ PðX½1…L 1jX½L, ΘÞ PðX½L, ΘÞ (63) In Eq. (63), the joint probability of X[L], Θ can be modeled by combining the Θi’s with labels to generate a multitude of parameters. For instance, given a set of labels, Γ and a set of parameters Θ, the number of parameters when combined with each and every label, would then become jΓjjΘj. Let us denote combination of the parameters as here, ΘL ¼ fΘγi : γ Γ, i 1…L 1g Note: It is completely up to the person who is modeling the problem to choose what variables to have what parameters and distributions. One can have label also parameterized. Based on independence of columns and rows, the equation can be simplified as in Eq. (64). PðXjΘÞ ¼ Πx X PðxjΘÞ ¼ Πx X Πi ½1…L1 PðCi ¼ x½i, L ¼ x½LjΘÞ ¼ Πx X Πi ½1…L1 ðPðCi ¼ x½ijL ¼ x½L, ΘÞ PðL ¼ x½LjΘÞ PðXjΘÞ ¼ Πx Πi PðCi ¼ x½ijL ¼ x½L, Θi Þ PðL ¼ x½LÞ
(64)
In case of multiclass classification problems, let Γ ¼ {x[L]jx X} denote the set of all possible label values of the data items in the given data set. Then determining the conditional probability P(Ci ¼ x[i]jL ¼ x[L], Θi) requires, creating separate instance of parameter sets for each and every possible value of L, i.e., ð8γ ΓÞ : Θγi . In case of multiclass label having k labels and n number of columns, the number of parameter sets for each column is k n. The algorithm for learning the parameters is given in Algorithm 17. The probability of data set given the parameters is given as, PðXj½Θγi i,γ Þ ¼ Πx Πi PðCi ¼ x½ijΘi Þ x½L
In the prediction mode the Naive Bayes algorithm determines the posterior probability of the class scores the best given the data. The error function over the parameter is defined as log(P()), XX x½L Λð½Θγi i,γ Þ ¼ logðPðCi ¼ x½ijΘi ÞÞ x
i
Machine learning algorithms, applications, and practices Chapter
3 155
ALGORITHM 17 Naive Bayes—Multiclass. 1: for γ Γ do 2: for i ½1…n do 3: //In the usual NB formulations, parameters can be determined analytically and very easily ∂Λð Þ 4: Θ γi ¼0 ∂Θ γi 5: end for 6: end for
It is possible to derive other simplified variants, where Θi are independent of label classes γ Γ. The variants of the NB formulations for P(R ¼ vjθ) is tabulated below, where R is the random variable in question Ci and θ are the parameters. Parameters
Data value
Formulation type
θ ¼ (μ, σ)
Real valued
Gaussian NB
θ¼λ
x[i] in 2 classes
Bernoulli NB
θ ¼ [λk]
x[i] in k classes
Multinomial NB
5.2 Expectation maximization Expectation maximization (EM) technique (Do and Batzoglou, 2008; Heskes et al., 2003) is typically easily understood from one specific problem to another, however, its formulation may be confusing in general purpose scenarios. In this section we clarify the way in which graphical models are dealt with and generalization of the methods formulation. In any engineering problem, the parameters need to be learned from data. The learning process may be by humans or by automated methods. Eventually the parameter values need to be determined. In case of graphical models, the parameters are probability distributions on each of the hidden and observed nodes which generate observed data. A probability distribution is essentially a function that gives score for each observed measurement. Some of the constraints are imposed on the scores subject to the axioms of probability. A cost function as a function of parameters which is minimized over parameter sequence using optimization algorithms. In a probabilistic generative model, data can be sampled from the probability distribution. The probability of current data set being sampled is computed and corresponding an error function is defined to enhance this probability over parameters.
156 Handbook of Statistics
Assessing the true probability distribution based on given data set is often an intractable task. The approach is to assume some probability distribution Q() over the model and improve the same over iterations with respect to some loss function. In EM random variables observed and hidden states are all considered random variables where each variable assumes certain values with certain probability scores. Each random variable has an associated formula which will be used to assess the probability of a given observation. Let us consider a scenario of spam and nonspam classification of email messages and definition of nodes.
Example of email spam and nonspam problem—Posing as graphical model l l l l l l l l
Define spam node Define nonspam node Determine vocabulary of the whole set of messages Each of the nodes can generate all of the words of the vocabulary However, probability of certain words differs for each word Given a collection of spam and nonspam messages The task is to learn these probabilities There can be other features such as bulk email address and wrong names, they can also be modeled likewise albeit differently
The set of parameters used in all of the probability distribution function used for the nodes of the graphical model are indicated by the symbol Θ. The probability distribution function itself is indicated by Q(Θ) over the parameter values. The set of all hidden variables are indicated by the symbol Z. The probability distribution function for the values of Z given Θ is denoted by Q(ZjΘ). The actual probability distribution (unknown) is denoted by P(). The actual data is denoted by the symbol X. The problem now is how to learn from data, the parameters Θ of the function Q(). As it is impossible to exactly assess the true distribution P() of parameters and values taken by hidden and observed states, we can only approximate by an empirical distribution and refine it over multiple iterations. The EM model start with some distribution of parameters and node values Q() and improves the parameters over iterations that reduces difference with respect to P(). The EM procedure tries to find Θ that maximize the likelihood of data given the parameters (Eq. 65). Θ ¼ arg max PðXjΘÞ Θ
(65)
We need formulation in terms of Q() as P() are not defined. The following algebra results in a form that imbibes Q() in the formulation.
Machine learning algorithms, applications, and practices Chapter
PðXjΘÞ ¼
3 157
X X QðZjΘÞ PðX, ZjΘÞ ¼ PðX, ZjΘÞ QðZjΘÞ Z Z
X PðX, ZjΘÞ QðZjΘÞ QðZjΘÞ Z
PðX, ZjΘÞ ¼ EZ QðZjΘÞ ¼
Finding the maximizing Θ* is equivalent to determining maximizer of the log likelihood (Eq. 66). Θ ¼ arg max PðXjΘÞ Θ
PðX, ZjΘÞ ) Θ ¼ arg max EZ QðZjΘÞ Θ
PðX, ZjΘÞ Θ ¼ arg max log EZ (66) QðZjΘÞ Θ P P However, dealing with logð Þ is more difficult as compared to logð Þ forms. We apply Jensen’s inequality to Eq. (66) to result in Eq. (67). This form has a lower bound that needs to be tightened and then Θ* can be determined.
PðX, ZjΘÞ PðX, ZjΘÞ log EZ EZ log (67) QðZjΘÞ QðZjΘÞ The maximum value in Eq. (67) is attainable when the logarithm takes uniformly similar values as input across all Z (Eq. 68). PðX, ZjΘÞ QðZjΘÞ
(68)
Thus updating Q() for its posterior value increases the value of the lower bound which is also called tightening the lower bound. This is equivalent to determining the posterior distribution of Z given Θ and the data set X. QðZjΘÞ PðX, ZjΘÞ ) ð9cÞ : QðZjΘÞ ¼ c PðX, ZjΘÞ X c PðX, ZjΘÞ QðZjΘÞ ¼ 1 ) QðZjΘÞ ¼ P 0 Z 0 c PðX, Z jΘÞ Z ¼
PðX, ZjΘÞ ! PðZjX, ΘÞ PðXjΘÞ
158 Handbook of Statistics
The Θ itself needs to be updated for maximizing the likelihood over the data (Eq. 69). Θ ¼ arg maxQðXjΘÞ
(69)
Θ
QðXjΘÞ ¼
X
QðX, ZjΘÞ ¼
X
Z
¼
X
QðXjZ,ΘÞ QðZjΘÞ
Z
QðXjZ, ΘÞ QðZjΘÞ ¼ EZ ½QðXjZ, ΘÞ
Z
) arg max QðXjΘÞ ¼ arg max EZ ½QðXjZ, ΘÞ ¼ arg max logðEZ ½QðXjZ, ΘÞÞ Θ
Θ
Θ
) arg max logðEZ ½QðXjZ, ΘÞÞ arg max EZ logðQðXjZ, ΘÞÞ Θ
Θ
The error function is defined as, Λ(Θ) ¼ EZ log(Q(XjZ, Θ)). Maximizing value Θ is derived by gradient descent formulation using convenient Q() formulation such as exponential family of distributions. Θ
Λð Þ ¼0 ∂Θ
5.2.1 E and M steps E-step: The expectation step corresponds to computing problem specific metrics based on given values of Q(). The metrics are averaged over probability scores of the current assignments of latent variables. M-step: The maximization step is mostly analytically deduced per problem by the experts. This step concerns about updating the Θ values based on the aggregate metrics computed in the E-step above. The two steps are repeated till the probability values saturate to a threshold error or for a fixed number of iterations. As in all the iterative methods, we almost never know the exact number of iterations. The engineering principle only considers the fact that solution is better than nothing, i.e., starting with a random setting and only improving over iterations. 5.2.2 Sampling error minimization The E-step closely relates to drawing a sample from the given Q() distribution. This can be seen as in case of K-means where each point updates its cluster membership. Let us denote sample generation by X0 Γ(Θ). The M-step corresponds to analytically deducing the optimal parameters for the error between the generated sample and the given data. The alternative formulation then corresponds to Algorithm 18. The Γ(Θ) is a sampler and any of the standard techniques can be used such as MCMC or Gibbs among others. This method of sampling-based parameter estimation is used in restricted Boltzmann machines (RBM) and the formulation is based on minimizing contrastive divergence.
Machine learning algorithms, applications, and practices Chapter
3 159
ALGORITHM 18 Simplified pseudocode—Sampling error minimization. Require: X and initial Θ 1: for i ½1…n do 2: Λ(Θ) ¼ jΓ(Θ) Xj2 3: Θ* ¼ arg minΘ Λ(Θ) 4: end for 5: return Θ*
Genetic algorithms: The genetic algorithms can be posed under the sampling error minimization formulation as well. A new population of individuals is generated based on crossover and mutation operators and this corresponds to the sampling step. The maximization step corresponds to applying fitness function on the new population and selecting the top individuals for further iterations.
5.3 Markovian networks A Markov model (Puterman, 2014) is a Bayesian network without hidden variables. The transitions between states and emissions of output symbols are defined for every node. In case of categorical data with finite vocabulary size, a Markov network is directly defined over the observed symbols. The transitions and emission probabilities need to be learned from data using maximum likelihood formulation. A set of states Σ ¼ fS1 , …, Sn g is defined. Each state emits an output symbol from the vocabulary, W ¼ fw1 , …, wm g. Each state can emit any of the output symbols with probability (8i, j) : P(wijSj). The transitions between states is defined using the probability (8i, j) : P(SijSj). The parameters of the Markov model then are all these probabilities bundled into a single set Θ. The learning aspect of the Markov model defines likelihood of the training sequence over the model probabilities. Let Ω ¼ ½o1 , …, ok : oi W be the training data sequence of length k. Let ξ ¼ ½σ 1 , …, σ k be the sequence of states that would have emitted the Ω observations, i.e., σ j Σ. The likelihood of the sequence is, P(Ωjξ) is factorized due to Markovian assumptions is a maximization problem as in Eq. (70). PðΩjξÞ ¼ Pðo1 , o2 , …, ok jσ 1 , …, σ k Þ ¼ Pðσ 1 Þ Πi¼k i¼2 Pðoi jσ i Þ Pðσ i jσ i1 Þ ξ ¼ arg max PðΩjξÞ ξ
(70)
Given a particular assignment for Θ, Eq. (70) can be solved using a dynamic programming algorithm such as viterbi to determine optimal assignment of states. The state assignment to input data can also be thought of as
160 Handbook of Statistics
generation of a data points which are actually tuples of the form hemit, Sj, oki and htransit, Si, Sji and hstart, Ski to denote emission, transition and starting. Let us denote this data as the equation below based on the optimal Markov sampling. D0 ¼ fhemit, S j , ok i, htransit, Si , S j i, hstart, Sk ig: Based on these new assignments, the underlying probabilities of individual states can be adjusted as in Eq. (71). Θ ¼ arg max PðD0 jΘÞ Θ
(71)
By choosing convenient exponential distributions and maximizing for log(P(D0 jΘ)), the Θ* can be deduced analytically. The steps of sampling and maximization when repeated over time, the system will converge to appropriate Θ* values. An illustration of very simple Markovian model where output states are directly connected is shown in Fig. 26. There are no hidden states in a simple Markovian model in order to imply more theory about the observed states.
5.3.1 Hidden Markov model The hidden Markov model (HMM) (Eddy, 2004; Rabiner, 1989) is an extension to the Markov model concept where hidden states are used as well. A set of
FIG. 26 Markov and hidden Markov models are illustrated on of a hypothetical boy’s feelings of being loved or not loved by his girl of dreams. The observed states are a phone call or a smile. The hidden states are loved or not loved. (A) Markov model of transitions between direct observations. (B) Hidden Markov model where hidden states are introduced and shown by dotted outlines in blue and green.
Machine learning algorithms, applications, and practices Chapter
3 161
states Σ ¼ fS1 , …, Sn g is defined. Each state emits an output symbol from the vocabulary, W ¼ fw1 , …, wm g. Each state can emit any of the output symbols with probability (8i, j) : P(wijSj). However, in HMM we do not have direct transitions between observed states. Instead we define hidden variables Z ¼ fz1 , …, zq g. The transitions from hidden variables to observed states are defined (8i, j)P(Sijzj). The transitions between hidden states are defined (8i, j) : P(zijzj). Let the parameters are all bundled into single set Θ just as in case of Markov model. The steps are outlined in Algorithm 19.
5.3.2 Latent Dirichlet analysis Latent Dirichlet analysis (LDA) (Blei et al., 2003; Nakajima et al., 2014) is a very important method in automatic detection of topics in data. Though it is mostly applied to text data, the nontextual data can be customized as well for topic detection formulation. The approach here is to have latent variables topics in documents and topics generating vocabulary which are observed symbols. Starting with a random distribution of topics per document and vocabulary per topic, the algorithm arrives at optimal parameter values iteratively. The user needs to just specify the number of topics and the algorithm simultaneously clusters documents and vocabulary according to topics. The iterations are similar to K-means algorithm where each point changes its cluster membership and cluster centroid changes its location. Let W denote vocabulary, W ¼ fw1 , …, wn g. Let D denote the set of documents, D ¼ fd1 , …, dm g. Let T denote the set of topics, T ¼ ft1 , …, tk g. Let the probability distribution of words per topic be, q(wjt). Let the probability distribution of topics per document be, q(tjd). Let the parameters behind these probability distributions be, Θ. The topic detection algorithm proceeds much like the standard EM algorithm as below.
ALGORITHM 19 Simplified Pseudocode—HMM algorithm. Require: //Training data sequences 1: Initialize : Θ 2: for i ¼ 1 : niter do 3: //Expectation step—deduce optimal tuples for emissions, transitions and start states 4: Construct D0 5: //Maximization step—using analytical precomputed formulae 6: Find optimal: Θ* ¼ arg maxΘ P(D0 jΘ) 7: end for 8: return Θ*
162 Handbook of Statistics l l
l
l
Initialize q(). Select optimal assignments—(i) topics per document and (ii) words per topic based on closeness measure as in K-means. Based on the memberships of topics, refine the Θ values using predefined update equations. Refine the above steps for certain number of iterations.
A simplified pseudocode of the algorithm for illustration purposes is shown in Algorithm 20. The actual implementation is much more complex than the one shown here and involves adjusting counts of topic-document and topic-word matrices. We present here a formulation based on K-means like methodology for illustration purpose of a reading beginning to understand the concepts behind deploying topic models. The topic model algorithm is not just restricted to text data. Any other multimedia data can be modeled into the topic detection framework. Some of the examples of processing nontextual data using topic modeling framework is outlined below.
Topic modeling of audio data l
l l
Consider a raw audio file such as .wav where sampled amplitudes are stored Create a sliding window of some size k amplitudes Slide through each audio file, at some stride (i.e., hop length) and generate slides
ALGORITHM 20 Simplified pseudocode—Topic modeling. Require: k //number of topics 1: ð8w W Þ : map½w ¼ RANDð1, …, kÞ 2: ð8d DÞ : map½d½w ¼ RANDð1, …, kÞ 3: for iter ¼ 1…niter do 4: (8w W) : q(wjt) //calculate probability of word per topic 5: (8d D) : q(tjd) //calculate probability of topic per document 6: (8t) : q(t) //calculate probability of topic 7: for d D do 8: for t T do 9: for w W do 10: UPDATE: map[d][w] ¼ arg maxt0 q(t0 ,wjd) ¼ arg maxt0 q(wjt0 ) * q(t0 jd) 11: UPDATE: map[w] ¼ arg maxt0 q(t0 , w) ¼ arg maxt0 q(wjt0 ) * q(t0 ) 12: end for 13: end for 14: end for 15: end for
Machine learning algorithms, applications, and practices Chapter
l l l
l
l
3 163
Cluster all the slides into h clusters Each slide corresponds to a cluster Consider each cluster as a word and the vocabulary size is the number of clusters (h) For each audio file, consider all its constituent slides and it corresponds to words (if length of input ¼ l, number of slides ¼ l k + 1) Now each audio file is a sentence composed of words in a vocabulary which are clusters of audio slides
Topic modeling of image data l l l
l l
l
6
Consider an image file Consider a sliding rectangular region, call it sliding box Slide through an image from top-left to bottom-right corner using the selected rectangular sliding box Cluster all the rectangular image patches into h clusters If an each image is composed of l slides, then it is equivalent to a sentence having l words, where each word corresponds to a cluster Now the problem is posed as topic modeling on a set of sentences which are images whose vocabulary are cluster centroids of image patches
Deep learning
Deep learning (Goodfellow et al., 2016; Bengio, 2012) is a methodology for transforming an input in a given dimensional space to another dimensional space. The vectors in the transformed space are relatively spaced according to strength of correlations between features in the input space. The correlations can be nonlinear and highly complex leading to applicability of deep learning algorithms to a wide array of real world problem statements. The deep networks are used for determining complex patterns in the input which are difficult to be captured and maintained by human feature engineering teams (Geron, 2017). The intuition behind why deep neural networks are working today, the way they are, comes from an interpretation of representation of a computer program. A computer program can be abstracted as a flow chart where nodes denote computations and edges denote flow of information. Assume compute nodes are further decomposed to finest level of basic arithmetic operations over integers and real numbers. Now the question is whether a neural graph or network can be constructed to emulate this compute graph. The answer is yes, due to Universal approximation theorem (Cybenko, 1989; Hornik, 1991). The universal approximation theorem only states that, one can compose by hand, weights of a neural network to mimic any arbitrary function. However, the task in machine learning is to learn from data. As learning a complex compute graph is practically not feasible and therefore simple
164 Handbook of Statistics
architectures with layered structures and sparse cycles have come into existence. A number of deep neural network architectures are used in state-ofthe-art machine learning models today, including the following. In this section will discuss their architectures and working details (Geron, 2017). l l
l l l l
l
Deep belief networks (DBN) (Hinton et al., 2006) Restricted Boltzmann machine (RBM) (Fischer and Igel, 2012; Sutskever et al., 2008) Autoencoder (AE) (Joo et al., 2019) Variational autoencoder (Pu et al., 2016) Convolutional neural networks (CNN) (Lu et al., 2017) Recurrent neural networks (RNN) (Schmidhuber, 2015) and variants including long short-term memory networks (LSTM) (Hochreiter and Schmidhuber, 1997) Generative adversarial networks (GAN) (Goodfellow, 2017) and variants such as deep convolutional GAN (DCGAN) (Radford et al., 2016)
6.1 Neural network Artificial neural network (ANN) or simple called Neural network is the stateof-the-art computational model for data transformation via weighted graph of nodes and edges (Braspenning et al., 1995; Jain and Allen, 1995; Nielsen, 2018; Aggarwal, 2018b). The weights of the network are learned from the data. Every node has incoming edges and outgoing edges. A node consumes data through incoming edges. It then performs arithmetic operations on input data and emit output values. The output values are broadcasted through all the outgoing edges. During learning phase, bits and pieces of error values flow in reverse direction and update the weights all along. The data flowing into a neuron is already multiplied by the edge weight as it flows. The first task a neuron does is simple addition of all its inputs. It then applies a transformation, called activation function before emitting the transformed data value. The output value gets replicated on all of the outbound edges. As the value flows through, it gets multiplied by the edge weights. And then the process repeats at the receiving neurons and so on. There are various types of activation functions and can be custom activation function as well. Some of the commonly used activation functions are listed below (Table 7). An illustration of a typical neural network is given in Fig. 27. Consider a data set having N points, S ¼ fðx1 , y1 Þ, …, ðxN , yN Þg , where each point is d dimensional. A single layer neural network with sigmoid activation function would consume each xi and emits corresponding output yi, with bias b. This bias corresponds to y-intercept equivalent as in the case of equation of a line Eq. (10). y^i ¼ σðw xi + bÞ
Machine learning algorithms, applications, and practices Chapter
3 165
TABLE 7 Standard activation functions. Activation function
Formula
Sigmoid
σðxÞ ¼
ReLu
relu(x) ¼ max(0, x)
Leaky-ReLu
lrelu(x) ¼ max(α * x, x)
tanh
e x e x e x + e x
1 1 + e x
FIG. 27 Illustration of a neural network with three layers. A layer is only counted if it has weights to be learned from data, therefore input layer is ignored though visibly four layers are seen. The notation Hij denotes ith hidden layer and its jth neuron. The output layer is denoted by Oj’s where j is its jth neuron. Each edge in the neural network carries a weight which is learned during the training process. In this neural network, input layer is 4 dimensional, hidden layer 1 is 3 dimensional, hidden layer 2 is 5 dimensional, and the output layer is 2 dimensional.
Shown below is the weight update for minimizing squared error. There are other several forms of error as well, some of the most popular ones are shown in Table 8. Lðw,xÞ ¼
i¼N X
ð^ yi yi Þ2
i¼1
w ¼ arg min Lðw,xÞ )w
new
¼w
w old
η rw¼wold Lðw, xÞ
There are other forms of error functions which are described in Table 8.
166 Handbook of Statistics
TABLE 8 Some of the standard loss functions. Loss function
Formula
Mean squared error
1 yi N ð^
Hinge loss
maxð0, y^i y i Þ
Cross entropy loss
f^ y i logðy i Þ+ð1 y^i Þ logð1 y i Þg
y i Þ2
A neural network is a hierarchical organization from left to right. The numbering of the left most layer is 1 which is always the input layer. The right most layer is the output layer, which is always the last layer. In principle, you can have a minimum two layer neural network, one layer inputs and the other layer outputs. Each layer has neurons. The neurons are numbered from top to bottom, with top most neuron being 1. As layer 1 is input, the number of neurons in the first layer is same as the dimensionality of the input data. As it might be difficult to add a bias column to the input, we will reserve one extra neuron in the 1st layer. The bias neuron always emits 1. Let us indicate the last layer by L, which is also the total number of layers in the neural network. There are connections between neurons from ith layer to (i + 1)th layer. The connections are typically indicated as from left to right. However, during training phase, back propagated error values flow in reverse fashion. Consider a neural network having L layers. Let us denote jth neuron of the ith layer by N[i][j]. Also, let us denote size of ith layer by N[i].size attribute including the bias neuron, which always emits 1. Let us denote weight on the connection between N[i][j] and L[i + 1][k] by W[i][j][k]. Now, compactly, W[i] denotes a matrix of N[i].size N[j].size. The output of ith layer is a N[i].size dimensional vector, say X[i] where X[i][j] denotes the value emitted by the jth neuron of this layer. As the data flows into the network, at each node N[i + 1][k], summation over incoming edges happens. The summation is Z½i+ 1½k ¼
j¼N½i:size X
X½i½ j W½i½ j½k
j¼1
The summation can be captured in the form of matrix multiplication as, T
Z½i + 1 ¼ ðX½iT W½iÞ ¼ W½iT X½i The summation of weighted inputs is then passed through an activation function at each neuron. Let us denote the activation function ψ(). Let us denote the final output of jth neuron of ith layer as ψ(Z[i][j]). Also, in a compact notation, let us denote application of ψ() at the vector level as, X½i ¼ ψðZ½iÞ:
Machine learning algorithms, applications, and practices Chapter
3 167
Therefore the final output is X[L]. The final output is obtained by Eq. (72). X½L ¼ ψðZ½LÞ ¼ ψðW½L 1T X½L 1Þ
(72)
The output of ith layer, expressed in terms of the previous layer, recursively becomes Eq. (73). Starting from the input X[1], the value gets computed by application of the weight matrices to generate X[L]. ð8i ¼ ½L : 2Þ : X½i ¼ ψðW½i 1T X½i 1Þ
(73)
Let us denote the ground truth output vector by Y where Y [j] denotes from top to bottom the jth neuron’s expected value. Now, corresponding to the bias output of the Lth layer, which is the last layer, assume that Y [N[L].size] ¼ 1 always. Recall that the bias neuron of any layer always emits 1. Now, we can formulate a loss function and gradient descent equations for the learning part of the network. Let us denote all the weight matrices simply by, W, then the loss function is shown in Eq. (74). Note that L(W, X[1]) is a single real number after substitution of its parameter values. lossðW, X½1Þ ¼ LðX½L, YÞ
(74)
The partial derivative of the loss function for the W[i][j] weight is given by the following equations where ⊙ denotes element-wise product.
∂Lð Þ ∂Lð Þ T ∂X½L ¼ ð Þ W½i½ j ∂X½L ∂W½i½ j N½L:size1 1N½L:size
∂X½L ∂ψðZ½LÞ ∂ψðZ½LÞ ∂Z½L ½ ¼ ¼ ∂W½i½ j ∂W½i½ j ∂Z½L N½L:sizeN½L:size ∂W½i½ j N½L:size1 T
∂ð½W½L 1 N½L:sizeN½L1:size ½X½L 1N½L1:size1 Þ ∂Z½L ¼ ∂W½i½j ∂W½i½ j " # ∂ðW½L 1T X½L 1Þ ∂ðW½L 1T X½L 1Þ ¼ ∂W½i½j ∂X½L 1 N½L:sizeN½L1:size " #
∂X½L 1 ∂W½L 1T + ∂W½i½j ∂W½i½j N½L1:size1 N½L:sizeN½L1:size " # T ∂ðW½L 1 X½L 1Þ ∂W½L 1T N½L1:size1 ∂ðW½L 1T X½L 1Þ ¼ W½L 1T ∂X½L 1 ∂ðW½L 1T X½L 1Þ ∂W½L 1T
¼ X½L 1s
168 Handbook of Statistics
Putting it all together, the simplification becomes, ∂X½L ∂ψðZ½LÞ ∂ψðZ½LÞ ∂Z½L ¼ ¼ ∂W½i½ j ∂W½i½ j ∂Z½L ∂W½i½ j ¼
∂ψðZ½LÞ ∂W½L 1T X½L 1 ∂Z½L ∂W½i½ j
¼
∂ψðZ½LÞ ∂W T ½L 1 X½L 1 f ∂Z½L ∂W T ½L 1
¼
∂W T ½L 1 ∂W T ½L 1 X½L 1 ∂X½L 1 + g ∂W½i½ j ∂X½L 1 ∂W½i½ j
∂ψðZ½LÞ ∂W T ½L 1 ∂X½L 1 + W T ½L 1 fX½L 1 g ∂Z½L ∂W½i½ j ∂W½i½ j
We also note that X[k] are independent of W[i][ j] when (8k < i) and set the partial derivatives to zero whenever arise. The recursive definition becomes Eq. (75). 8 0 ði kÞ > > > < ∂X½k ∂ψðZ½kÞ X½k 1 ði ¼ k 1Þ (75) ¼ ∂W½i½ j ∂Z½k > ∂X½k 1 > T > ði < k 1Þ : W ½k 1 ∂W½i½ j The loss function derivation becomes Eq. (76). ∂LðW, X½LÞ ∂LðW, X½LÞ ∂X½L ¼ ∂W½i½ j ∂X½L ∂W½i½ j
(76)
6.1.1 Gradient magnitude issues As we see from Eqs. (75) and (76)), the W[i][j] term is recursively multiplied along the way from the last layer L, till its next (i + 1)th layer. This poses a problem as to weight updates in the left side layers of the neural network. Let us denote rψ(Z[i]) as the gradient of the activation function at Z[i]. If (8i, j) : rψ(Z[i]) * W[i][j] > 1, then the product of the terms gets ≫ 1 and it is called exploding gradient problem. When (8i, j) : rψ(Z[i]) * W[i][j] < 1, then the product of the terms gets ≪ 1 and it is called vanishing gradient problem. The solution to these problems is two fold—(i) choice of the ψ() function and (ii) initialize of W[i][ j]. The vanishing or exploding gradients problem occurs for σ() activation function. The other activation functions as in Table 7 do not have this problem. Also, the W[i][j] initialization can be done using Xavier’s method, i.e., setting the values to a Gaussian random variable with mean 0 and variance proportional to the average of number of input and output neurons.
Machine learning algorithms, applications, and practices Chapter
3 169
6.1.2 Relation to ensemble learning There is strong relationship between neural networks and ensembling methods (Hansen and Salamon, 1990; Krogh and Vedelsby, 1995). An ensemble of weak classifiers in the boosting algorithms makes a strong classifier, refer to the AdaBoost Algorithm 5. In case of a neural network, the outputs of previous layer are fed into the next layer and so on. In case of a boosting algorithm, the errors from the current cumulative classifier are fed into the new classifier and so on. As the new classifier is added. In case of neural networks also, the output of some neurons is input to further neurons and so on. The loss function is minimized from the right end of the neural network. This is equivalent to minimizing losses successively over layers from left to right. Essentially one can find resemblances between boosting algorithms and neural networks and both are ensembling type of methods. 6.2 Encoder Encoding corresponds to transformation of points in a given dimensional space to another dimensional space. Typically the transformation happens to a lower dimensional space. It is also related to data compression, however, reverse transformation may not be possible. There are different ways of encoding real world data some of the methods are tabulated in Table 9.
6.2.1 Vectorization of text Vectorization of text is one of the fundamental steps in text analytics (Aggarwal, 2018a). Consider a scenario of classification of email messages as spam or not-spam. In order for the classifier to consume data, the input text need to be in the form of vector. There are various ways of vector representation of textual data and all of the approaches operate over vocabulary sets (Pedregosa et al., 2011).
TABLE 9 Some of the standard encoding methods. Method name
Brief description
Binary encoding
All columns are either 0 or 1
One-hot encoding
A special type of binary encoding, where only one 1 is present
Vector encoding
Converting data into a numeric vector of fixed dimension
Text vectorizer
Converting textual data into vector forms
170 Handbook of Statistics l l
l l
Gather all the words in the data set into a vocabulary set. Convert each email message into a binary vector of 0’s and 1’s based on presence or absence of words The size of the vector is same as the size of the vocabulary The whole data set now is in a form ready to be consumed by standard machine learning algorithms
Count vectorization: In this technique, for each word a count of number occurrences with in a document or paragraph is stored in the vector representation instead of mere presence or absence. A count vectorizer may be more informative that plain binary vectorizer. TFIDF vectorization: In this technique (Aizawa, 2003) the frequency of a word within a document is computed, however, it is divided by the number of documents in which that particular term occurs. This is more informative that mere count vectorizer. k-gram vectorization: In this technique (Cavnar and Trenkle, 1994), instead of a single word, a stretch of k-words is considered as a single word. In this setting the size of the vocabulary increases exponentially. If for instance a vocabulary has 1000 words, and we are considering 3 g. Then the new vocabulary size will be 10003. Though this technique results in very sparse spaces, it often retains highly informative features when combined with TFIDF vectorization as well. One-hot vectorization: This technique is applicable to category type of attributes. If an attribute has m categories, and at any point in time, i.e., in a given row of data, the attribute can take only one of the m values, then it is processed via one-hot-encoding. In this encoding, a vector of length m is generated for that attribute and only one bit is set to 1, while remaining are all 0 in this vector. The vectors for individual attributes are all appended together to form the final feature vector for the data.
6.2.2 Autoencoder An autoencoder (Joo et al., 2019) is a machine learning-based vector encoding that reduces difference between generated data and the input data. In this approach a given data point is processed via an operator matrix to generate a transformed vector. The transformed vector is then reverse mapped back into the input. The difference between the input and the re-generated input is minimized using an optimization formulation that learns the elements of the operator matrix. Let a data point be x in a d-dimensional space. Let Wdk be a transformation matrix. Now, the point can be transformed into k-dimensional space by, x0 ¼ W T x
Machine learning algorithms, applications, and practices Chapter
3 171
The transformed point can then be further back transformed into d-dimensional space by, xR ¼ W x0 Let the loss function be, LðWÞ ¼ jjx xR jj2 An optimal W is determined using gradient descent, W ¼ arg minLðWÞ W
On a data set of X of points, the loss function is computed as cumulative summation, X LðWÞ ¼ jjWW T x xjj2 xX
6.2.3 Restricted Boltzmann machine Boltzmann distribution is a probability function used in statistical physics to characterize state of a system of particles, with respect to temperature and energy. The system can exist in several states, however, the chance of being in certain subset of states is higher than other. The chance itself is parameterized over certain property values. As a system loses energy, it saturates to a certain state, when the energy is high, it fluctuates heavily and as it cools down it stabilizes to one of the desirable states (Fischer and Igel, 2012). The analogy is used in AI-based state space search as well. In case of simulated annealing, the probability of transition to a next state is proportional to a heuristic function value and the number of iterations which emulates system cooling. This method can find applications for synthetic generation of data, filling in missing values and encoding of representation which are high importance in data science regime. Imagine a situation where all features of a data point are interacting and influencing each other. A given data set can be considered a sample drawn from a probability distribution. How to induce a probability distribution on interactions between features when considered over a collection of samples constitutes the learning task. A pair of data points that exhibit certain feature values may be highly related than some other pairs in the data set. Determining such a probability distribution helps in engineering better features or do feature transformations compliant with hidden structure in the data. The past work is referred to as Ising models (Cipra, 1987) where a lattice of electrons are studied for their spins subject to ambient energy. The spin of an electron can be + 1 or 1. Given a lattice of electrons, the spin of an electron is dependent probabilistically on the spins of their neighbors. When the system has high energy, the spins are all random. As the system cools down,
172 Handbook of Statistics
the system converges to specific structure of spins. Relating it back to data science problems, one can imagine a data set where all features are binary. Each feature flips its binary value based on its neighbors. A number of such feature flippings are generated to form a synthetic data set. The difference between the synthetic data set and the given data is evaluated on any metrics of choice and probability distribution parameters are adjusted. Over several iterations, the system converges to optimal parameter values that capture the structure in the data. However, these models have limitations as to capturing any underlying or hidden principles connecting the direct features. And when dealing with a large number of features, such as a data set of images where each image is one mega pixel, meaning that there are 1 million features, the feature correlations and neighborhood approach is not practical, as we do not know how much neighborhood and too many parameters. Alternatively one can model the problem by introducing hidden variables that connect a group of features to one hidden variable, another group to another hidden variable and so on. The hidden states and the observed states may have interconnections between them and among them each. Such a model is called Boltzmann model. The Boltzmann model is impractical to solve due to circular nature of connections and the system forgets the learned parameters and repeats calculations. A Restricted Boltzmann Machine (RBM) (Sutskever et al., 2008) is a simplification over the general Boltzmann machine approach, in the sense of imposing more restrictions on the structure of the graphical model. A bipartite graph is created, composed of hidden states and observed (or visible) states. There are no connections among hidden and no connections among visible states themselves. These restrictions force the system to learn parameters and converge over iterations. A Bernoulli RBM is a modeled as a mapping from a visible vector v of d-dimensional space to a hidden vector h of k-dimensional space via a matrix Wdk. Both the visible and the hidden vectors are binary vectors. The hidden vector is selected from the visible vector by application of the W matrix and setting the bits of the h vector based on a probability from a sigmoid activation function. Similarly the visible vector is generated back from the h vector based on the sigmoid activation function. h σðW T vÞvR σðW hÞ The difference between the reconstructed visible vector vR and the actual visible vector v is minimized over the data set X, Eq. (77). X W ¼ arg min jjW W T v vjj2 (77) W
xX
An illustration of an RBM is shown in Fig. 28 where hidden and visible nodes are marked.
Machine learning algorithms, applications, and practices Chapter
3 173
FIG. 28 Restricted Boltzmann machine (RBM) for 4 dimensional input and one hidden layer with three neurons is depicted here. Each edge in the neural network corresponds to a parameter which is learned during the training phase. The neurons Xj and X0j corresponds to actual input and the reconstructed input after going through transformation by the hidden layer. The RBM algorithm minimizes the net reconstruction error by the contrastive divergence mechanism.
6.3 Convolutional neural network A CNN (Lu et al., 2017) is a deep neural network architecture coming under supervised learning methodology. It was mainly developed to address automatic learning of features in images through several layers of convolutions and scaling. Though the networks are typically used for images and initial demonstration was on images, it can applied to any type of data where all features have homogeneous semantic. Feature homogeneity and localization: This is a notion that all features have same semantics and value distributions and possess proximity. Same value distribution means, the probability that a feature Fi assumes a value v, P(Fi ¼ v) is same as P(Fj ¼ v) for another feature Fj. Same semantic means, all features have same meaning. For instance, all pixels, all audio amplitudes, all temperatures, all electrostatic forces, and so on. Localization of features corresponds to a notion of neighborhood among the features. For instance, a given pixel has eight neighboring pixels to the left, right, top, bottom and four corners. In case of an audio signal, the samples are taken at regular intervals, say every 10 ms, i.e., 100 Hz. Neighbors of a given amplitude recording are the one on the left and the one on the right. In case of a video recording, a given frame has a predecessor frame and the next frame, of course excluding the boundaries. In case of text data, a given sentence is a sequence of words. The previous word and the current word correspond to neighborhood of a given word. Notion of a filter: Consider for instance an example of a household filter which segregates larger items from the smaller ones. Though the filter outputs smaller items mixed up among each other, still it has carried out a useful step of eliminating large elements. Consider another example of a filter, which segregates elements by their weight. For instance in a mass spectrometer, elements of differing masses
174 Handbook of Statistics
Convolution: Given a notion of neighborhood among features of a data point, a weighted average of a neighborhood of a particular feature (imagine a pixel and its neighborhood), often captures more information. For instance, consider a high resolution photograph of a house, of say 1 mega pixels, having 1000 pixel width and 1000 pixel height. In order for us to recognize that image as a house, only a stamp sized image of mere 100 pixels with 10 pixel width and 10 pixel height may be sufficient. Often looking at information in high resolution when not required leads to wrong predictions. It is required to purposefully down-sample the data while retaining aggregate information. One way to accomplish this task in signal processing is to take weighted average of neighborhoods and down-sample. A fundamental building block of a CNN is the convolution filter. A filter is a matrix of smaller dimension that used for localized weighted averaging purposes. Assume, given an image in the form of a matrix, XWH. Consider a filter as a small matrix, F2k+12k+1 (assuming a simple odd sized, square filter). For a given pixel position i, j in the input image, a convolution is defined as in Algorithm 21. Tensor convolution: The convolution operation may be defined over multiple dimensional filter as well. For instance if Xd1 d2 …dk be a k dimensional tensor. Then a convolutional filter will be of k dimensions, of size Fp1 p2 …pk .
ALGORITHM 21 Example—image convolution. Require: X, F //input and filter 1: X0 ¼ 0 //output image after convolution 2: for w ¼ 0…W 1 do 3: for h ¼ 0…H 1 do 4: //for each pixel, perform convolution 5: sum ¼ 0 6: for p ¼ k…k do 7: for q ¼ k…k do 8: if 0 w + p W 1 or 0 h + q H 1 then 9: sum sum + X[w + p][h + q] F[p + k][q + k] 10: end if 11: end for 12: end for 13: X0 [w][h] sum 14: end for 15: end for 16: //Convolution is defined as, X0 ¼ X ⊙ F 17: //The same is computed by the above pseudocode 18: return X0
Machine learning algorithms, applications, and practices Chapter
3 175
Then a convolution is defined as Eq. (78) for a multidimensional matrix or tensor. X0 ½a ¼
i¼k X
X½a pi =2 : a+ pi =2, i ⊙ F½0 : pi , i
(78)
i¼1
The convolution operation is efficiently realized in a highly parallel environment on GPU architectures. This the primary reason why CNNs are trained in orders of magnitude lesser time on a GPU rather than on a CPU.
6.3.1 Filter learning The filters in a CNN need to be learned from data. In order to illustrate the filter learning process, let us consider a very simple example which can trivially extend to complex and real world cases. Assume we are building a one-dimensional CNN, where input is one dimension and the output is a single numeric value. The input goes through convolution phase by a one-dimensional filter. Let X ¼ ½x0 , …, xn1 be an input, whose dimension is 1 n, i.e., one-dimensional array. Let σ() be the activation function of the neural network. Let the convolutional filter be F ¼ [f0, f1, f2] be one dimensional with three elements. The setup is shown in Algorithm 22. The input is convolved with F[] and an intermediate output is generated. The convolved vector is then multiplied with W[] vector to generate output Eq. (79). The output error value is a loss function oðX, F, WÞ ¼ σðW, X ⊙ FÞ
(79)
The loss function derivative with respect to the filter values are computed as below on an example of f0. The loss function values for each of the fi is given in Eq. (80). The dimensions are indicated for clarity.
ALGORITHM 22 CNN Filter learning example. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
//initialize convolution X0 ¼ [0]1n for i ¼ 0 : n 1 do sum ¼ 0 for p ¼ ½1…+1 do if 0 i + p n 1 then sum ¼ sum + X[i + p] * F[p + 1] end if end for X0 [i] ¼ sum end for P 0 y^ ¼ σð i¼n1 i¼0 w i X ½iÞ //Loss function Lðy, y^Þ
176 Handbook of Statistics
∂Lð^ y, yÞ ∂Lð^ y, yÞ ∂^ y ¼ ∂f0 ∂^ y ∂f0 ∂σðW X0 Þ ∂σðW X0 Þ ∂W ∂σðW X0 Þ ∂X0 ∂^ y ¼ ¼ + ∂f0 ∂f0 ∂f0 ∂f0 ∂W ∂X0 ∂W ∵ ¼0 ∂f0 ∂σðW X0 Þ ∂σðW X0 Þ ∂X0 0 + ∂W ∂f0 ∂X0 0 ∂σðW X Þ ∂X0 ¼ ∂f0 ∂X0 ∂ðW X0 Þ ∵ ¼ W ∂X0 ∂X0 ¼W ∂f0 0 ∂X ∂X ⊙ F ∂½xi1 f 0 + xi f 1 + xi+1 f 2 i¼0…i¼n1 T ∵ ¼ ¼ ¼ ½x1 …xn2 ∂f0 ∂f0 ∂f0 ¼
¼ W T ½x1 …xn2 ð8i ½0, …, 2Þ, ð8j 62 ½0, …, n 1 : x j ¼ 0Þ : ¼
∂Lð^ y, yÞ ∂fi
∂Lð^ y, yÞ ð½xi1 , …, xn2+i 1n W n1 Þ11 ∂^ y
(80)
Each of the filter’s weights is updated using gradient descent as in Eq. (81), where η is learning rate. ðtÞ
ðt1Þ
fi ¼ fi
η
∂Lð^ y, yÞ ∂fi
(81)
Filter update for multiple output neurons: The filter update equation presented in Eq. (81) can be extended to complex scenarios. The filter F (in Eq. 79) is situated between a pair of input and output layers. Extending Eq. (81) scenario to multiple output neurons o ¼ 1…O, is given in Eq. (82). ðtÞ
ðt1Þ
fi ¼ fi
η
o¼O X ∂Lðy^o ,yÞ o¼1
∂fi
(82)
Updating multidimensional tensor filters: The filters presented in Eq. (82) is one dimensional. However, the formulation does not stop one from extending the filters to multiple dimensions. If the input is a tensor in k dimensional, the convolving filter is also a k dimensional tensor. The update equation for a tensor type filter is given in Eq. (83). ðtÞ
ðt1Þ
fi1 ,…,ik ¼ fi1 ,…,ik η
o¼O X ∂Lðy^o ,yÞ o¼1
∂fi1 ,…,ik
(83)
Machine learning algorithms, applications, and practices Chapter
3 177
Filter update in interior layers: The filter need not be between the end layers, they can be situated between any pair of layers in the interior of a CNN. Let the layers encompassing a filter is between lth layer and (l + 1)th layer, and the output of lth layer is X[l] and the output the (l+1)th layer is X[l + 1]. The derivative of the loss function with respect to filter cell between X[l + 1] and X[l] is given by, ∂Lð Þ ∂Lð Þ X½l + 1 ¼ ∂ f i1 ,…,ik ∂X½l + 1 ∂ f i1 ,…,ik
6.3.2 Convolution layer A typical CNN has several hundreds of filters at a convolutional layer. It also will have several tens of layers. Each filter may also be a tensor in > 3 dimensions. The dimensionality of a filter in lth layer, matches with the dimensionality of the output of lth layer. Each filter generates one output, of the same size as input, however, after convolution. If there are nf(l+1) filters in (l + 1) th layer, then number of outputs generate is nf(l+1). All these outputs are weighted and combined into required number of neurons in the (l + 1)th layer. The number of convolution layers in some of the popular CNN architectures is given in Table 11. 6.3.3 Max pooling Max pooling corresponds to selection of a maximum value within a neighborhood. Such an operation is desirable when the data needs to be down sampled. A down-sampling operation makes good sense after an averaging step. For instance, in order to recognize a given image as an object of interest, it is sufficient to blur the image repeatedly and reduce the size until a small thumbnail size image is good enough to infer. In case of CNNs max pooling is carried out with stride (typically 2). Stride is a concept that corresponds to jump in each of the tensor dimensions. If the jump is 2, then the image size reduces by that scale. The stride value and the image size are reported in Table 10.
TABLE 10 Stride size. Stride
Output size
1
Original size
2
Half
3
One-third
k
1 k
178 Handbook of Statistics
6.3.4 Fully connected layer A fully connected layer refers to connections from all inputs into a designated set of output neurons in this layer. Typically this fully connected layer corresponds to the last layer of a CNN. If this is a final layer of the CNN, the neurons of this layer correspond to the output categories. In order to determine the final class, softmax operation is used to emit the probability scores relative to the other outputs. If (l + 1)th layer is a fully connected layer, then all the neurons of the previous layer lth layer are connected to each and every neuron of the (l + 1)th layer. The number of weights is L[l].size L[l + 1].size. A trained neural network can be save for its all weights and reused in a number of different applications. During prediction phase, the input goes from left to right through a given neural network via the specified layers. 6.3.5 Popular CNN architectures Some of the popular CNN architectures are shown in Table 11.
6.4 Recurrent neural network Learning sequential data requires a mechanism to remember the context. In case of plain feedforward neural networks and CNNs, there is no provision for memory. A RNN address this issue (Schmidhuber, 2015) by looping back one of the outputs to previous layers. Though the concept looks simple, it poses challenges when formulating backpropagation equations. Certain protocol needs to be in place to define how the weights should be updated such as backpropagation through time. Multiaxle bus and the road analogy: The unrolling steps and slicing of the input can be best imagined if an analogy is brought out to something that we see everyday. Imagine an RNN as a multiaxle bus, where each of its tires correspond to inputs that it consumes simultaneously. The road itself corresponds to a variable sized input sequence of vectors. Exactly as a bus moves along a road, an RNN moves along an input sequence of vectors and processes them. So much like a bus that can move fast or slow, an RNN can take a longer or shorter stride through the input sequence. In case of an RNN the unrolled steps correspond to the axle analogy of the bus. Each of the unrolled steps in an RNN takes corresponding inputs, much like each the pressure impact on each of the axles due to road conditions. The previous stretch of the road and the current position of the bus affects the drivers perception of the road quality. In case of RNN, the hidden vector encodes the memory of the previous inputs to the RNN and the current inputs and the vector changes as the RNN slides through the input.
TABLE 11 Popular CNN architectures. Name
Input image type
Layers
Activation
References
Other comments
LeNet-5
32 32 gray scale
7
Sigmoid
LeCun et al. (1998)
First proof of concept on MNIST database
AlexNet
224 224 RGB
8
ReLu
Krizhevsky et al. (2012)
ImageNet winner
ZFNet
224 224 RGB
8
ReLu
Zeiler and Fergus (2014)
Minor modifications to AlexNet, Baseline high accuracy on ImageNet
Inception
224 224 RGB
22
ReLu
Szegedy et al. (2015)
Higher accuracy on ImageNet
VGGNet
224 224 RGB
16
ReLu
Simonyan and Zisserman (2015)
Higher accuracy on ImageNet
ResNet
224 224 RGB
152
ReLu
He et al. (2016)
State-of-the-art highest accuracy on ImageNet
180 Handbook of Statistics
6.4.1 Anatomy of simple RNN The input is a sequence of vectors S ¼ ½x1 , …, xN where each xi Rd is a d dimensional vector. An RNN cell is composed of one or more units. Each unit consumes one vector some xj at a time as input. If an RNN is composed of U units, each of the units consume respectively, xi , …, xi+U1 vectors from the input. A hidden vector hk1 is maintained to encode the state of the RNN upon receiving of certain sequence of inputs. Each unit is made up of three fundamental matrices—Wxhdk, Whhkk, Whykm. The consumption of input is shown in the list below. An illustration of the RNN unit is shown in Fig. 29.
FIG. 29 (A) Basic cell in a recurrent neural network (RNN). The context from the previous part of the sequence is encoded by the previous hidden vector annotated as # A, HA. Input vector is indicated by orange colored rectangle X1i. The input is multiplied with a dimensionally compatible matrix Wxhih shown in light red color to yield a hidden vector representation (# B annotation) HB1h ¼ X Wxh. The hidden vector is updated by HC ¼ HA + HB (shown with # C annotation) and is output as hidden vector of this unit. The HC vector is then transformed to emit output, Y1o ¼ HC Whyho. Reader is requested to note that many a other way of customization of this basic unit exist in the literature and practice and the cell depicted here is only for illustration. (B) Batch-based mechanism of hidden vector propagation between input batches. Dimensionality of the hidden vector is HBN where B is the batch size and N is number of neurons. Each batch has B input elements, where each input is I dimensional. The output is generated as one 1 O vector for each input of the batch, i.e., YBO.
Machine learning algorithms, applications, and practices Chapter
l l l l l l l
l
3 181
A single RNN unit consumes a single vector from input and processes it Consume input xi Pass it through Wxh matrix to generate hidden vector h Use the previous hidden vector to update the state using Whh matrix Use the output matrix Why to emit output in m dimensional space Proceed to process the next point Multiple RNN units consume, consecutive sequence of inputs simultaneously and process them There is communication from left to right among the multiple RNN units
The steps are presented in mathematical equations below for a single unit RNN. hinput ¼ xTi Wxhhhprev ¼ hTði1Þ WhhhðiÞ ¼ hinput + hhprev y^i ¼ hTðiÞ Why In case of RNN having multiple units, the h vector flows from left to right among the RNN units. Let us denote the Wxhu, Whhu, Whyu as the three transformation matrices of the uth unit of an RNN cell. If uth cell consumes xi, then (u + 1)th cell consumes xi+1. The h(i) is sent as input to the (u + 1)th cell. Each of the U cells, emit y^i …yi+U1 output vectors. The left most cell gets a h vector of 0 in most of the architectures. However, it can be configured to assume any other value as well. The RNN unit is given as a function in Eq. (84). y^, h ¼ Γðx, hÞ
(84)
A series of RNN units thus perform the operation as in Eq. (85). ½^ yi , …, y^ði+U1Þ ¼ Γð½xi , …, xi+U1 , hði1Þ Þ
(85)
6.4.2 Training a simple RNN The training of an RNN results in the update of the elements of the matrices of an RNN unit. The output vectors are regressed with respect to ground truth via a loss function. The gradient of the loss function is used for weight updates. The loss function for the output vectors is given in Eq. (86). In this loss function a stride value of 1 is used. However, the step for ith position need not be 1, it can be more as well. The number of RNN units is given as U here. Lð½Wxh, Whh, WhyÞ ¼
j¼U i¼N X X jj^ yj yj jj2
(86)
i¼1 j¼i
We should note that in Eq. (86), the Wxh, Whh, Why matrices are updated in one shot for the entire 1…U units as if ½yi , …, yi+U1 were a single vector. As the backpropagation is happening over the unrolled units, it is also commonly referred to as Backpropagation through time (BPTT) updates.
182 Handbook of Statistics
6.4.3 LSTM Long short-term memory type of cell (Hochreiter and Schmidhuber, 1997) is a special configuration of the basic RNN unit. In order for the RNN unit to remember the context over long unrolled lengths, the h vector size has to be increased. However, increasing the h vector increases sparsity in the model and induces sensitivity in the output vector generation. A solution to this problem is to capture the necessary complexity in the form of other matrices called gates. The LSTM introduces three types of gates—input gate, output gate, and forget gate. The input gate combines the input and updates the h vector, the output gate combines the current and previous h vectors and the forget gate prevents the current h vector from updates to the given RNN unit. 6.4.4 Examples of sequence learning problem statements Semantic encoding: In order to learn a semantic for a given sentence, the same problems occur with variable input size and dependence on the sequence of words. The task here is to learn a vector encoding for a given sequence of words. The vector is a representation of the semantics. 6.4.5 Sequence to sequence mapping Language translation: Consider the task of translating a sentence in a given language to another language. The input is a sequence of words in one language. The output is a sequence of words in another language. The sizes of input and output can be quite different. In the training data, there can be several (input, output) pairs where lengths are varying. In this scenario, a fixed dimension input mechanism is not suitable. A word at ith position depends on words at jth(j < i) positions. Figure captioning: Give an image, the task is to generate a sequence of words describing it. Once objects in an image are identified and tagged with textual words, the image can be scanned from top-left corner to bottom-right corner and in a left-to-right manner. This corresponds to an input sentence. The corresponding output sentence is a meaningful figure caption in the training data set of (input, output) pairs. The mapping can be learned via sequence to sequence learning as in language translation. Automated dialog-bot: Consider a case of a repository of software trouble shooting where conversations between a human and a computer operator is recorded over time. This is a sequence to sequence scenario where input is a query sequence of words and the output is another sequence of words which constitutes response. Now, given a new query, how can we generate a response automatically. The problem is much similar to language translation, albeit the vocabulary and letters of the words might be same.
Machine learning algorithms, applications, and practices Chapter
3 183
6.5 Generative adversarial network Generative adversarial networks (GAN) (Goodfellow, 2017) is a graphical model for generation of synthetic data. The working process is similar to an RBM (Sutskever et al., 2008) where data is sampled from a Bernoulli network, i.e., hidden states are 0 or 1. In GAN system as well, data is sampled from a neural network. The neural network in GAN is architected in reverse sense of neural network in a typical feedforward network. In a feedforward network, the size of the vector diminishes and finally the output layers form a vector for performing softmax. In case of a GAN, the architecture is reverse, it starts from a small vector, increase the vector dimensions over layers and the final output layer has the same size as the input image.
6.5.1 Training GAN Training GAN is accomplished using the gradient descent defined over the output generated from GAN vs a ground truth data point. A GAN is linked to a discriminator network for forming a loss function. The discriminator network classifies given input from an adversarial network as fake or real. The discriminator network compute data distribution difference between generated data and the known real data as the loss function. The cumulative difference between the encoding by discriminator network for the adversary generated data and the ground truth data is minimized by gradient descent over the adversarial network weights. Let G() be the generator network. Upon input z, the generator network generates a synthetic data point. Let D() be the discriminator network that generates binary output whether real or fake. Let X denote the ground truth known real data. Let XG denote the data generated by G() be the fake data, by random inputs to the generator network. Now the loss function is formulated as, LðGÞ ¼ ExG XG ½logð1 DðxG ÞÞ Ex X ½logðDðxÞÞ
6.5.2 Applications of GANs GANs are applied for data augmentation purposes. When the training data is not sufficient and does not capture required varieties of subdistributions, GANs are used to generate synthetic data. Some of the GAN generated fake images appear to be so real that human eye cannot distinguish between real and fake (Fig. 30). GANs may also be used to generate sequence data (Yu et al., 2017). The formulation is closely related to reinforcement learning where the generator is modeled as RL agent.
184 Handbook of Statistics
FIG. 30 (A) Generative adversarial networks (GAN) overall architecture is depicted in this figure. Both the generator (G) and the discriminator (D) neural networks are trained in the GAN framework simultaneously, however, using two different loss functions. The filters in the G and D neural networks are updated separately based on loss function values. The G network computes loss over the fake data points being correctly recognized as fake by the discriminator, i.e., the lower the loss, the more likely a fake data point passes in as correct one. The D network computes loss over true data points being correctly classified for their true labels, be it fake or real. The G network up-scales a random input vector to the data dimension, shown as 1 n to m n transformation. (B) An illustration of bed-of-nails approach of the up-scale mechanisms is shown, in which input X22 data point is scaled to 5 4 data point and is convolved using a F33 filter. The blank values are all padded by a specified value, typically zero. The filter is then trained the same way as a regular convolution filter, however, on the up-scaled 5 4 data. The spacing between data points is 2 here corresponding to a stride of 2 along horizontal and vertical axes.
7 Optimization The optimization methods discussed so far included gradient descent for linear regression, logistic regression and neural networks. In case of tree-based methods, iteratively the impurity scores are reduced using a hill climbing paradigm of choosing which attribute to split. However, the loss function may be complicated and not differentiable in a number of real world problems related to molecular dynamics simulation, manifold learning, and topological optimization. For a rigorous study of optimization problems and numerical setting, the reader is requested to refer to some of the standard texts in this domain (Nocedal and Wright, 2006; Deb, 2001; Boyd and Vandenberghe, 2004).
Machine learning algorithms, applications, and practices Chapter
3 185
Genetic algorithms (Lim, 2014) are used in cases where gradient descent formulation is difficult and multiple local minima are expected. In a genetic algorithm, a solution is represented as set of chromosomes. A chromosome is composed of a set of genes which are related to each other in certain semantic aspects. A gene is a fundamental unit that goes through mutations and crossover operations over time. A simple problem may just have one chromosome have few genes. A gene can be equated to a parameter-value assignment. A chromosome would then correspond to multiple parameter-value assignments to constitute a solution. A population is seeded with initial random assignment of parameter-value pairs and hence individuals. The individuals undergo mutation and crossover operations over iterations and result in newer genes and hence newer individuals. The expanded population is then subjected to fitness function in which only top k fit individuals are retained in the population and remaining are discarded. A crossover operation takes two solutions and mixes subset of parameter values between them and generates newer individuals. A mutation operation takes a solution and modifies randomly one of the parameter values. The rates of mutations and crossover operations are updated over time. Initially mutation rate is high and as the time progresses, the crossover rate is increased as it results in a fine search as the optimal solution approaches.
8
Artificial intelligence
AI is a broad spectrum of algorithms, software design paradigms, and thinking methodologies where a notion of intelligence and automation is defined over software components and connected devices to cater to the needs of a given domain Fig. 1. Intelligence is programmed over knowledge representation and reasoning methodologies and is broadly categorized into two major types— deductive and inductive reasoning. Knowledge representation involves symbols and defining operators to represent facts about a given world and transformations. The symbolic representation and operators directly translates a real world problem statement to objects and operations of a system architecture or simulation setup that can then be executed to evaluate and generate predictions. The notion of using a symbol for an unknown dates back to the era of Aristotle ( 320 BCE). Over centuries, the notion of symbol has evolved and used in Algebra by Diophantus (CE 200). Today children learn about using variables and perform substitutions starting from high school education itself. The very notion of use of symbol for an unknown has ushered us in a civilization rich with technology. Deductive reasoning involves programmatic combination of facts to generate newer facts subject to the constraints of logic or a given set of rules mainly involving notion of a state space and search. The inductive reasoning caters to the need of interpolation of missing pieces of information based on numerical combination of other elements. The whole of the machine learning comes under this category. When the missing values to be interpolated
186 Handbook of Statistics
correspond to specially named attribute called target or label, it leads to a major class of methods called supervised learning. AI offers a framework on top of which diverse methods from the deductive and the inductive reasoning schools of thought can interplay to provide solution to a given problem scenario. Turning to the machine learning world, given the high cost of a label of a datum, it is important to automatically understand structure of the data. The class of algorithms that compute patterns and subpatterns in data without the need for any label information is called unsupervised learning. Unsupervised methods and some of the supervised methods currently embraced deep learning to learn structure of data and map between feature spaces. Since over a decade, deep learning is being predominantly used to automatically learn substructures in data. While most of the AI research provides a systematic conceptual framework for defining intelligence over discrete state space for development of rigorous solutions, there is still a major scope to on-board continuous state space representation systems such as deep learning in to this generic framework and reap the benefits of robustness and explainability.
8.1 Notion of state space and search The AI methodology starts with defining a state and notion of navigation of state space for any given domain. Some of the states are desirable, called final states for which a search happens. A state can be thought of as a snapshot of parameter values. For instance a weight vector or a matrix of values. A state space is a conceptual notion of topological space such as a graph where nodes are states and edges are their neighbours. For instance notional connections between weight vectors where elements are slightly modified. Navigation in state space corresponds to generating neighbor states or the next states for a given current state and keeping track of visited states. A search corresponds to identifying a desirable state satisfying domain conditions such as best heuristic score so far. While the very thought of state space may trouble a beginner on how to store infinite number of states, in reality it is only notional or conceptual. The states are only unveiled as required and typically a minute fraction of state space can be explored intelligently. Example of continuous state space: One of the popular state space search methods is the gradient descent algorithm. For instance consider the scenario where a hyperplane is fit through a collection points in a k dimensional space. We are fitting a hyperplane, w vector such that y¼wx where x is input data. The idea is to minimize w such that a loss function L(w) is minimized, w ¼ arg minLðwÞ w
Machine learning algorithms, applications, and practices Chapter
3 187
The weight vector updates based on negative gradient direction of the loss function such that L(wnew) > L(wold), where α denotes learning rate is, wnew ¼ wold α rLðwÞj
w¼wold
The AI framework considers each of the w vector as a state. The next state from a given state is computed by the gradient descent step. The navigation may fall into infinite loop as L(w) oscillates due to step size over iterations. The search for optimal w stops after domain constraints, here it refers to number of iterations or change in L(w) values. The example given here is also called hill climbing method where L(w) is the notional hill that iterates over w to climb. The above example is over a set of continuous parameters, i.e., each of the components of the w vector. More common examples in the AI text books are introduced via discrete state space. The discrete state space involves variables that assume discrete values such as boolean or multinomial or countably finite number of numeric values. Movement from one state to another state is equivalent to changing value of one discrete variable between states. A set of states are designated as final states. Example of a discrete state space: For instance there is a set of general aptitude test problems that several students would have seen, on river crossing scenarios under constraints. One of the problems is moving Lion, Goat, Cabbage from the Left to the Right banks of a river. The boat has limited capacity to accommodate only the man and any one of the objects above. The constraints are to avoid situations where one entity eats some other entity such as—(i) if Goat and Cabbage are left alone, the Goat will eat Cabbage and (ii) if the Lion and Goat are left alone, the Lion will eat the Goat. Consider providing a state space representation. Let us abbreviate the entities L—Lion, G—Goat, C—Cabbage, and M—Man. The words, Left and Right denote left bank and right bank, respectively. The state of the system involves depicting Left and Right banks at a given point in time. Initial state ¼ (Left,L,G,C,M),(Right,-,-,-,-). Next state can be (Left,L, G,-,-), (Right,-,-,C,M). However, this state is not valid as it violates the constraint that Lion will eat the Goat if they are not in the company of the Man. We need final desired state of (Left,-,-,-,-),(Right,L,G,C,M). It is left as exercise to the reader to identify correct sequence of states. In certain cases for some other problems, a given final state may not be reachable at all from a given start state or there may be several optimal sequences of states. Here a cost may be associated with each move and the idea is to find least cost sequence of moves. There are different types of methods for performing state space search and each procedure can be customized for a given problem scenario.
188 Handbook of Statistics
8.2 State space—Search algorithms Based on the notion of state space navigation and keeping track of the explored states, there are different types of search algorithms as categorized below. l
l
Enumerative methods such as: – Depth first search (DFS) – Breadth first search (BFS) – Depth first iterative descent (DFID) Heuristic methods such as: – Best first search (BeFS) – Tabu search (TabuS) – Beam search (BS) – Hill climbing (HC)
Randomized methods such as: l l l l
Simulated annealing Iterated hill climbing Genetic algorithms Ant colony optimization
We will discuss some of the methods in each category here to give insight into the framework of AI that has been talked about.
8.2.1 Enumerative search methods Enumerative methods obtain list of next states from a given state and explore them systematically. Mechanisms are ensured to avoid falling into infinite loop. Algorithms in this section are given out as pseudocodes with characteristic symbolic notations as explained in Table 12. The first algorithm to look at is depth first search (DFS). In this algorithm starting from a given state, subsequent next states are identified. The search goes to one of the next states and the process repeats. Each time a to-do list is maintained to keep track of which states are yet to process (in the Ω list). In the middle of the processing if the final state is reached, the search terminates with returning of the same. Otherwise, when the final state is never reachable, the search exhausts as the χ list would have covered all the states. The pseudocode is provided in Algorithm 23. The space complexity of the algorithm, expressed in terms of cardinality denoted by jj of the set of open states, jΩj∝ O(D B) where jΩj is the size of the open list, D is the depth of the current state and B is the maximum branching factor. The algorithm for breadth first search (BFS) is just modification of Line 13 of Algorithm 23. In case of BFS, instead of prepending new states to the Ω list, the new states are appended. This line just becomes, Ω Ω⊙N x . Such a change drastically changes the space complexity behavior to an exponential
Machine learning algorithms, applications, and practices Chapter
3 189
TABLE 12 Symbols and meaning for pseudocodes of AI algorithms. Algorithm notations Notation
Meaning
Small case letters
States
x.π
Predecessor state of x
Ω
List of states yet to explore
χ
Set of states already explored
Ω[i]
Accessing ith element of the list
FINAL(x)
Tests if x is final state
NEXT(x)
Next states for state x
x.δ
Depth of current state
A⊙B
Concatenation of B to A
x.cost
Cost of the path from start till x
edge(x, y)
Cost of edge from x to y
ALGORITHM 23 Depth first search. Require: s /*Start state*/ 1: s.π NIL 2: Ω [s] 3: χ {} 4: while jΩj > 0 do 5: x Ω[0] 6: if FINAL(x) then 7: return x 8: end if 9: Ω Ωx 10: N NEXT(x) 11: N NχΩ 12: Set, (8y N) : y.π 13: Ω N ⊙ Ω //OR Ω 14: end while
x Ω ⊙ N for BFS
rate, now the jΩj∝ O(BD) where B and D are branching factor and depth of the current node, respectively. We need to note that only behavior difference is in terms of jΩj, the time to explore the full tree is same in both the DFS and BFS algorithms. Also, the χ the size of the visited states is same in both the methods.
190 Handbook of Statistics
Based on the premise that final state exists not at the far end of the search space, but somewhere in the middle, it is essential to combine the space-wise benefits of DFS with exploratory nature of BFS. The combination leads to another algorithm called depth first iterative descent (DFID) where repeated invocation of DFS happens for increasing depths as iterations progress (Algorithm 24).
8.2.2 Heuristic search methods—Example A* algorithm DFS and BFS are blind searches where no importance is given to the assessment of where the final state exists. As state spaces in a typical moderate and to even lesser complex models grow exponentially, choosing the right neighbors that take the search closer to the final goal state becomes a major factor in practical usability of the solutions.
ALGORITHM 24 DFID—Depth first search iterative descent. Require: s /*Start state*/ 1: flag True /*To keep iterating as long as new states exists to explore* 2: Δ 0 /*To denote depth to explore in the state space tree*/ 3: while flag ¼¼ True do 4: Δ Δ + 1 /*Incrementally search for higher depths*/ 5: flag False /*Start pessimistically*/ 6: s.π NIL 7: s.δ ¼ 0 /*to denote current depth*/ 8: Ω [s] 9: χ {} 10: while jΩj > 0 do 11: x Ω[0] 12: Ω Ωx 13: 14: if x.δ > Δ then 15: /*Ignore more deep states*/ continue 16: end if 17: if FINAL(x) then 18: return x 19: end if 20: N NEXT(x) 21: N NχΩ 22: flag (jNj > 0) /*Evidence that new states yet exist*/ 23: Set, (8y N) : y.π x and y.δ x.δ + 1 24: Ω N ⊙ Ω /*Prepend*/ 25: end while 26: end while
Machine learning algorithms, applications, and practices Chapter
3 191
One of the major types of heuristic search is the A* algorithm (Nilsson, 1980). Here the best path so far in terms of least cost, is stored as a chain of predecessor links for each state from the start state. Among the next states, the node with least estimated cost is chosen. However, the heuristic function always underestimates the actual cost to the goal state from any given state. Also it is required that all edges have positive weights in order to alleviate the need to maintain χ set. The A* algorithm is shown in Algorithm 25. An example of a heuristic function H() can be any machine learning model that takes a vector representation of state and outputs a score indicative of how close it is to the goal state. A heuristic function can be hand composed metric over the individual elements of a state representation.
8.3 Planning algorithms Planning can be defined as identifying a series of activities or steps that results in achieving final goal state from a starting state (Lavalle, 2006; Bylander, 1994). The representation of states in the context of the planning algorithms is different from the state representation in the previous algorithms we have seen. Each state is represented as a collection of facts over the configuration of the system in question. A given problem or domain is modeled
ALGORITHM 25 A* Algorithm. Require: s, H() /*Start state and Heuristic function*/ 1: s.π NIL 2: s.cost 0 3: Ω ¼ [s] 4: while jΩj > 0 do 5: x* arg minxΩ(x.cost + H(x)) 6: Ω Ω {x} 7: if FINAL(x) then 8: return x 9: end if 10: N NEXT(x) 11: for y N do 12: v ¼ x.cost + edge(x, y) 13: if v < y.cost then 14: y.cost ¼ v 15: y.π ¼ x 16: Ω ¼ Ω⊙y 17: end if 18: end for 19: end while
192 Handbook of Statistics
as a set of boolean variables where each variable captures 1 min aspect of the system. Any state of the system can be represented as a boolean algebraic expression. Any algebraic expression can be transformed into a disjunctive normal form which is a logical OR of several clauses where clause is a conjunction of literals.
8.3.1 Example of a state Consider a hypothetical problem of attending a meeting by 9 AM in the office. The first task is to identify all pertinent boolean variables in the domain. Some of the variables are enumerated below. l l l l l l
Wake up Freshen up Have breakfast Start driving from home Reach office Attend meeting
There can be other variables as well such as below. l l l l
Read news paper Drop kids at school Buy vegetables Go to gym
The idea is to define truth value for each of the variables. If the representation of activities and actions are all states, the planning problem corresponds to finding paths that reduce a cost function. In the space of planning algorithms, all edges are labeled by actions. Optionally there is a cost associated with each edge or a state. There are two approaches to identify the target goal state from the start state in the context of planning algorithms—one is called forward state space search and the other is called backward state space search. In the planning algorithms, much like the cost, the notion of valid path is that start state exists on the path. Forward state space search (FSSP): Algorithm starts from the start state and exhaustively explores neighbors till the goal state is either reachable or not reachable. This approach is simple to understand and at any point in time, only a valid path from the start state exists. However, it is costly when the state space is large. Backward state space search (BSSP): In this approach, the algorithm starts from the goal state and works backwards to find the start state. The idea is to have more focus on the goal state by excluding states that are not relevant to the goal state. While the idea of focus on the goal is evident, this approach results in several temporary paths to the goal state which may not be valid, meaning they cannot be linked back to the start state.
Machine learning algorithms, applications, and practices Chapter
3 193
Goal stack planning: This algorithm combines the benefits of validity from the FSSP algorithm and the benefit of focus on the goal from the BSSP algorithm by introducing a notion of state stack. The stack corresponds to keeping track of states that are originating from the goal state and pruned based on validity from the start state as given in the recursive Algorithm 26 (Ryu and Irani, 1992).
8.4 Formal logic The foundations of the AI reasoning system is the computational logic theory (Siekmann, 2014). A state of the system is considered a set of truth statements about various observations. The recorded observations are assumed to be true. As the system moves from one state to the other, the observations changes over time. The facts may get added or removed over time as in case of the planning algorithms (Bylander, 1994). However, the facts may also assumed to be true perpetually. New facts are deduced from older facts by combination of the facts using a set of rules of inference. Facts are expressed in mechanistic fashion with specific ordering of words rather than a natural language form. The positions of the words have meanings and a clear pattern is intended to be explicitly expressed. Such an expression of explicit patterns of word ordering is called predicate logic. Calculations and modifications of components of the word patterns is called predicate calculus.
8.4.1 Predicate or propositional logic Propositional logic is a form of ordering the words such that, the form is explicit and clearly a commonality is seen across diverse truth statements. ALGORITHM 26 Recursive goal stack planning algorithm (RGSP). Require: S, G /*start and goal states*/ 1: Π ¼ [] //empty plan 2: if G S then 3: return Π 4: end if 5: X ¼ G S 6: for g X do 7: ð9aÞ : ða:Add \ X 6¼ ϕÞ ^ ða:Del \ G ¼ ϕÞ 8: Πa ¼ RGSP(S, a.Pre) 9: S ¼ Progress(S, a.Pre) //S to Preconditions 10: S ¼ Progress(S, Πa) //Preconditions to Plan upon selected action 11: S ¼ Progress(S, a) //Upon selected action to new state 12: Π ¼ Π +Πa + a 13: end for 14: return Π
194 Handbook of Statistics
For instance consider the fact that Sun rises in the east and sets in the west. The same is expressed as, l l
rises(sun,east) sets(sun,west)
For instance, a student reads a text book for a course. Here the facts capture two students—studentA and studentB; three books—bookA, bookB, and bookC; three courses—course1, course2, and course3; and enrollments studentA in course1 and course2. Now as studentA has not enrolled for course3, he does not read the text book for course3. l l l l l l l
reads(studentA,bookA). reads(studentA,bookB). textbook(course1,bookA). textbook(course2,bookB). textbook(course3,bookC). enrolled(studentA,course1). enrolled(studentA,course2).
8.4.2 First-order logic As we have seen in the previous case, the predicate logic requires each and every fact to be recorded and processed. However, this is not feasible and scalable when a common predicate word ordering form may be written in a reduced form. Examples of first-order logic involve variables as below. l l l
A student reads text book of the course he/she has enrolled for enroll(s,c), textbook(c,t) ! reads(s,t) For all students “s,” for all textbooks “t” and for all courses “c”
8.4.3 Automated theorem proof Given a knowledgebase of truth statements and a query truth statement, the task is to prove or disprove. There are two ways of accomplishing this— (i) Forward proof method and (ii) Backward proof method. 8.4.3.1
Forward chaining
A set of rules are provided to the automated theorem proof algorithm. The algorithm systematically expands the knowledgebase by combinations of the pieces of knowledge to produce derived knowledge. Some of the rules of inference is shown in Table 13. Here, ! denotes implication, denotes negation, _ or denotes logical OR, or ^ and comma, denotes logical AND, [] denote grouping. The rules are verifiable via truth table. There can be more rules based on combinations in truth table.
Machine learning algorithms, applications, and practices Chapter
3 195
TABLE 13 Rules of inference. Rule name
Operators
Modus ponens
[A ! B, A] ! B
Modus tollens
[A ! B, B] ! A
Conjunction
[A, B] ! A ^ B
Disjunctive syllogism
[A _ B, A] ! B
Addition
[A] ! A _ B
Simplification
[A, B] ! A
Hypothetical syllogism
[A ! B, B ! C] ! A ! C
Constructive dilemma
[A ! B, C ! D, A, C] ! B _ D
Destructive dilemma
[A ! B, C ! D, B_ D] ! A_ C
8.4.3.2
Incompleteness of the forward chaining
The rules of inference may not be sufficient for proving a given query in the forward method. New rules need to be amended for each special query that cannot be addressed. For instance, the following rule is not possible to deduce from a given of rules (Table 13). ½B ^ D, A _ C ! ðA ^ BÞ _ ðC ^ DÞ In order to consider this, one can augment this as new rule. However, it poses a sense of uncertainty for being applicable to all the queries. 8.4.3.3 Backward chaining In backward chaining, the query (or goal) is considered along side the knowledgebase and deductions are evaluated. One approach is, to augment the query to the knowledgebase and to prove the whole group to be true for some assignment of values to the variables. However, this reduces to evaluating for all possible values and identifying a particular combination of variable assignments leading to truth. The scenario quickly becomes satisfiability problem which has exponential time complexity. Another way to prove is a query is deducible is by a method of contraction which is presented next.
8.5 Resolution by refutation method In order to circumvent the need to exhaustively search for truth values of variables when a goal is augmented to the knowledgebase, alternative approach is to consider a goal with negation and arrive at a contradiction.
196 Handbook of Statistics
A resolution is a special rule of inference, improving on top of disjunctive syllogism (Eq. 87). ½A _ B, C_ B ! A _ C l l l
(87)
Negate the goal and augment to the knowledge base Arrive at contradiction (or refutation) Apply the resolution rule repeatedly
The prolog system (Bratko, 1986) uses backtracking and the resolution method. Further the system is expressed as Horn clauses and the method used is SLD algorithm. In order to convert a natural language sentence to a formal language statement the skolemization is used (Baaz and Iemhoff, 2008). The method used for symbolic matching and substitution is called unification algorithm (Chadha and Plaisted, 1994).
8.6 AI framework adaptability issues There had been tremendous amount of prior work in the AI systems including symbolic logic, unification theory, inference mechanisms, state space search, planning systems, and the prolog inference engine. However, only bits and pieces of concepts are used in the state-of-the-art systems and none at scale as regards to implementational systems such as PDDL (McDermott et al., 1998) and Prolog (Bratko, 1986). For instance we would encourage the reader to try to program a simple river crossing problem in PDDL. For instance try to program image captioning algorithm using Prolog. It is practically impossible to use the AI implementations in their current form. However, concepts can be carried forward to bring in robust abstractions. One major limitation of state space search algorithms is, they are discrete and nondifferentiable. There is no notion of continuous state space search in the traditional systems, however, the real word problems are modeled routinely as optimization problems that necessitate gradient in most cases. If no gradient information is available, a randomized method such as genetic algorithm may be used, however, still operating in continuous state space regime. Another major limitation of the traditional AI systems, is the need for the heuristic function. For instance, in case of the A* algorithm, there are seven theorems pounding on the correctness of the algorithm if the heuristic is admissible, i.e., underestimating type. The question reduces to who bells the cat? that is who will give the heuristic function. In a practical scenario, an engineer coming to work will never be given a heuristic function and is typically given bigger business problem vaguely stated or sometimes exemplars or data sets at the maximum. In such a situation the traditional AI framework becomes inapplicable, if not undesirable.
Machine learning algorithms, applications, and practices Chapter
3 197
There is major scope today to marry the benefits of schools of thought evolved over decades in the state space regime and the formal logic and the current machine learning methodologies including deep learning to reap the benefits of both the strength areas.
9
Applications and laboratory exercises
Machine learning libraries are available in a wide variety of programming languages and libraries. Industry practices typically recommend python due to ease of integration with other programs and scripting capabilities on Unixbased systems on the server side. However, it is also common to see people using R, MATLAB, and other tools as well. Some of the popular and production grade libraries and their platforms is presented in Table 14. There a number of other implementations such as Weka and a number of packages in R and MATLAB which are also useful, however, from the point of view of a big software team integrating with other code bases, though these implementations serve for building great prototypes, the scalability aspects restrict their usage in production grade systems.
9.1 Automatic differentiation Automatic differentiation (Kakade and Lee, 2018) refers to generation of expression parse tree after application of differentiation as an operator. In a graphical representation of a program, the loss function is essentially an expression tree. The more complex a loss function, the more difficult it is to manually and analytically perform gradient calculation. Today it has become norm of the day and the de facto standard to use automatic differentiation for loss functions minimizing human error or efforts.
TABLE 14 Popular open source software and their major purpose. Name
Remark
Scikit-learn
Standard machine learning
Keras, Tensorflow, Mxnet, PyTorch
Deep learning
Mallet
Topic modeling
NLTK
Natural language processing
198 Handbook of Statistics
An illustration of a very minimalistic differentiation is shown in Fig. 31. The figure shows application of the differential operator to the multiplication operator, dðX YÞ ! dðXÞ Y + dðYÞ X The recursive definition of conversion of expression trees extends to loop structures as well. Most typically, the loop iterations are treated as incremental and the differentiation operation goes inside of the loop. Therefore it is the
FIG. 31 (A) An illustration of automatic differentiation of a multiplication operation over two operands is shown. The task of differentiation itself is carried forward recursively to the subtrees of operands. (B) An illustration of how a loop is handled in automatic differentiation is shown. The recurrence equations and updated values are shown in the other two boxes. The equations are y ¼ x4 and iterations are y ¼ y x + x2 in which ∂x∂ y is determined. Recurrence equations for the value of y in kth iteration y(k) and the value of derivative of y in kth iteration as y0 (k) are defined. For a starting value of x ¼ 3, the derivative y0 (2) ¼ 1491 is calculated, as described in the intermediate steps.
Machine learning algorithms, applications, and practices Chapter
3 199
responsibility of the engineer to create loss functions that ensure that each iteration of the loop corresponds to an operation on an independent parameter of the data. Given that a loss function is a sequence of operations rather than a single operation, the sequence is processed from top to bottom (or left to right) for differential operator application. In the loss function, any reference to a previously used variable, the automatic differentiation operator will also guarantee that its previously computed differential value will be used, seamlessly.
9.2 Machine learning exercises Some of the laboratory exercises in the traditional machine learning problems would include the following list (Pedregosa et al., 2011; Geron, 2017). l
l l l
l
l
l
Synthetic data generation and plotting—(i) concentric circles, (ii) moons and (iii) blobs Gradient descent—(i) full data, (ii) batch, (iii) stochastic variants Classifiers to train and fit—(i) logistic regression, (ii) SVM, (iii) DT Ensemble classifiers to train and fit—(i) bagging classifier, (ii) AdaBoost, (iii) gradient boost Validation curves—plot CV scores against other axis (i) depth of a DT and (ii) size of learning set Regression data—generate synthetic data by perturbation of a sinusoidal curve by noise Regression models—(i) fit linear regression, (ii) DT regressor, (iii) use polynomial features
9.3 Clustering exercises Some of the exercises on the clustering would include the following list. l
l
l
l l
Generate synthetic data—use sklearn libraries to generate blobs of data and retrieve k clusters using k-means Perform agglomerative clustering of data points and threshold the dendrogram tree to match the clusters Change the distance measure and repeat K-means cluster and count number of steps till convergence Pose k-means as gradient descent problem and retrieve the clusters Perform DBSCAN clustering of the S curve shaped data and compare against k-means on the same
9.4 Graphical model exercises Latent Dirichlet Analysis (LDA) is most suitable for automatic topic extraction. The topics are not restricted to text, they can be image’s subregions in an image data set or audio clips in an audio data set.
200 Handbook of Statistics
The following three exercise questions will clarify the applicability scope of the LDA method.
9.4.1 Exercise—Topics in text data Consider a corpus of text data composed of sentences. Each sentence is composed of words. Now the task is to apply text pruning to remove noninformative words and initiate steps in LDA. Given a number of k topics, deduce topic-word matrix and document-topic matrix, iterate and saturate to determine optimal assignments. Use scikit-learn library functions for performing LDA. 9.4.2 Exercise—Topics in image data Consider an image as composed of subimage regions. Consider a sliding window and slide through the whole image from top-left corner to the bottom-right corner with a stride of choice. As the image is slided over, a large number of slides is formed per image. In a given data set of images, the slides per image multiply and form a large data set of slides. Cluster all the slides into some k clusters. Each cluster now corresponds to a word in our vocabulary. A given image is composed of slides which belong in some clusters or other and hence certain vocabulary. Convert the image data set to a text corpus, albeit here words are cluster IDs. Do not perform word pruning and carry out LDA on this image-to-text converted corpus. 9.4.3 Exercise—Topics in audio data Consider a raw audio wave form of time-series of amplitudes as composed of several subregions in 1D. Consider a sliding window of chosen interval. Slide through the audio file from left to right and generate several slides per audio file. In the whole database of audio files, there are slides per audio file, leading to a large number of slides across all files. Cluster the slides into some k cluster. Now each cluster-ID, serves as a word in the vocabulary. The analysis is now similar to the image patch exercise as above. Now apply LDA on the audio-to-text converted corpus to detect topics in the audio data. 9.5 Data visualization exercises Visualization of high dimensional data in a lower dimensional space adds value to our understanding of the nature of the data. In this regard, exercises are listed below for practice. l l
l l
Generate a synthetic data set of blobs using scikit-learn libraries Compute all pair distances and perform multidimensional scaling to plot the data on a 2D plate and visualize Perform tSNE and visualize the clusters Determine top two principal components using PCA and plot the points about the two axes In this exercise the blobs of the data points are to be seen as distinct clusters in 2D as well
Machine learning algorithms, applications, and practices Chapter
l
l
3 201
Repeat the exercise on concentric circles data set, moons data set, and the S curve data sets Repeat the exercises on MNIST digits data set and identify if the digits are occurring in separate regions on the 2D transformed space as well
9.6 Deep learning exercises Some of the exercises are listed below to practice deep learning tools. l l l
l
CNNs to be built and evaluated on MNIST and ImageNet data sets. RNN (using LSTM) to be built to learn to map English-to-French. Consider a data set of audio wave forms for digit pronunciations. An LSTM to be built to vector encode each digit pronunciation and determine classification accuracy on the vector encoded audio waves. Build an autoencoder to convert a digit image (from MNIST data set) into a vector and determine classifier accuracy over the vectors using known digit classes
References Agarwal, M., Agrawal, H., Jain, N., Kumar, M., 2010. Face recognition using principle component analysis, eigenface and neural network. In: ICSAP, IEEE Computer Society, pp. 310–314. http://dblp.uni-trier.de/db/conf/icsap/icsap2010.html#AgarwalAJK10. Aggarwal, C.C., 2018a. Machine Learning for Text, first ed. Springer Publishing Company, Incorporated ISBN 3319735306, 9783319735306. Aggarwal, C.C., 2018b. Neural Networks and Deep Learning—A Textbook. Springer, ISBN: 978-3-319-94462-31–497. Aizawa, A., 2003. An information-theoretic perspective of TF-IDF measures. Inf. Process. Manag. 39 (1), 45–65. https://doi.org/10.1016/S0306-4573(02)00021-3. Alejo, R., Valdovinos, R.M., Garcı´a, V., Pacheco-Sanchez, J.H., 2013. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn. Lett. 34 (4), 380–388. http://dblp.uni-trier.de/db/journals/prl/prl34.html#AlejoVGP13. Arthur, D., Vassilvitskii, S., 2007. k-means++: the advantages of careful seeding. In: SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete AlgorithmsSociety for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 1027–1035. http://portal.acm.org/citation.cfm?id¼1283383.1283494. Baaz, M., Iemhoff, R., 2008. On Skolemization in constructive theories. J. Symb. Log. 73 (3), 969–998. http://dblp.uni-trier.de/db/journals/jsyml/jsyml73.html#BaazI08. Bateni, M., Behnezhad, S., Derakhshan, M., Hajiaghayi, M., Kiveris, R., Lattanzi, S., Mirrokni, V.S., 2017. Affinity clustering: hierarchical clustering at scale. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (Eds.), NIPS, 6867–6877. http://dblp.uni-trier.de/db/conf/nips/nips2017.html#BateniBDHKLM17. Bengio, Y., 2012. Deep learning of representations for unsupervised and transfer learning. In: Unsupervised and Transfer Learning Challenges in Machine Learning, vol. 7, p. 19. Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57 (1), 289–300. Bishop, C.M., 2008. Pattern Recognition and Machine Learning. Springer-Verlag, New York.
202 Handbook of Statistics Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022. Boyd, S., Vandenberghe, L., 2004. Convex Optimization. Cambridge University Press. Bratko, I., 1986. Prolog Programming for Artificial Intelligence. Addison-Wesley. Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32. https://doi.org/10.1023/ A:1010933404324. Bylander, T., 1994. The computational complexity of propositional strips planning. Artif. Intell. 69 (1–2), 165–204. http://dblp.uni-trier.de/db/journals/ai/ai69.html#Bylander94. Cavnar, W.B., Trenkle, J.M., 1994. N-gram-based text categorization. In: Proceedings of SDAIR94, 3rd Annual Symposium on Document Analysis and Information Retrieval 161–175. Las Vegas, US. http://citeseer.ist.psu.edu/68861.html. Chadha, R., Plaisted, D.A., 1994. Correctness of unification without occur check in prolog. J. Log. Program. 18 (2), 99–122. https://doi.org/10.1016/0743-1066(94)90048–5. Chechik, G., Sharma, V., Shalit, U., Bengio, S., 2010. Large scale online learning of image similarity through ranking. J. Mach. Learn. Res. 11, 1109–1135. https://doi.org/10.1145/1756006. 1756042. Cipra, B.A., 1987. An introduction to the ising model. Am. Math. Mon. 94 (10), 937–959. https:// doi.org/10.2307/2322600. Codd, E.F., 1968. Cellular Automata. Academic Press, New York. Cortes, C., Vapnik, V., 1995. Support vector networks. Mach. Learn. 20, 273–297. Cybenko, G., 1989. Approximation by superpositions of a sigmoidal function. Math. Control Signal Syst. 2 (4), 303–314. https://doi.org/10.1007/BF02551274. Deb, K., 2001. Multi-Objective Optimization Using Evolutionary Algorithms. John Wiley & Sons. Demsˇar, U., Harris, P., Brunsdon, C., Fotheringham, A.S., McLoone, S., 2012. Principal component analysis on spatial data: an overview. Ann. Assoc. Am. Geogr. 103 (1), 106–128. https:// doi.org/10.1080/00045608.2012.689236. Ding, S., 2006. Independent component analysis based on learning updating with forms of matrix transformations and the diagonalization principle. In: FCSTIEEE Computer Society, pp. 203–210. http://dblp.uni-trier.de/db/conf/fcst/fcst2006.html#Ding06. Lu, L., Zheng, Y., Carneiro, G., Yang, L., 2017. Deep learning and convolutional neural networks for medical image computing–precision medicine, high performance and large-scale datasets. Advances in Computer Vision and Pattern Recognition, Springer. http://dblp.uni-trier.de/db/ series/acvpr/learning2017.html. Do, C.B., Batzoglou, S., 2008. What is the expectation maximization algorithm? Nat. Biotechnol. 26 (8), 897–899. Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classification. Wiley, New York. Eddy, S.R., 2004. What is a hidden Markov model? Nat. Biotechnol. 22 (10), 1315–1316. https:// doi.org/10.1038/nbt1004-1315. Edwards, D., 2000. Introduction to Graphical Modelling. Springer. ISBN: 0387950540. http:// www.amazon.ca/exec/obidos/redirect?tag¼citeulike09-20&path¼ASIN/0387950540. Fayyad, U., Grinstein, G., Wierse, A., 2001. Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann. ISBN: 1558606890. Fischer, A., Igel, C., 2012. An introduction to restricted boltzmann machines. In: Alvarez, L., Mejail, M., Gomez, L., Jacobo, J. (Eds.), Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer, Berlin, Heidelberg, pp. 14–36. Freund, Y., Schapire, R.E., 1996. Experiments with a new boosting algorithm. In: International Conference on Machine Learning, pp. 148–156. /brokenurl#citeseer.nj.nec.com/freund96 experiments.html.
Machine learning algorithms, applications, and practices Chapter
3 203
Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29 (5), 1189–1232. https://doi.org/10.1214/aos/1013203451. Geron, A., 2017. Hands-on Machine Learning With Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Sebastopol, CA. ISBN: 978-1491962299. Ghahramani, Z., 2004. Unsupervised learning. In: Bousquet, O., Raetsch, G., von Luxburg, U. (Eds.), Advanced Lectures on Machine Learning. Lecture Notes in Artificial Intelligence 3176, Springer-Verlag, Berlin. Golub, G., Kahan, W., 1965. Calculating the singular values and pseudo-inverse of a matrix. J. Soc. Indust. Appl. Math. Ser. B Numer. Anal. 2, 205–224. Goodfellow, I.J., 2017. NIPS 2016 tutorial: generative adversarial networks. CoRR abs/ 1701.00160. http://dblp.uni-trier.de/db/journals/corr/corr1701.html#Goodfellow17. Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press. Hamerly, G., Elkan, C., 2003. Learning the k in k-means. In: Advances in Neural Information Processing Systems, vol. 17. Hansen, L.K., Salamon, P., 1990. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12, 993–1001. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: CVPRIEEE Computer Society, pp. 770–778. http://dblp.uni-trier.de/db/conf/cvpr/ cvpr2016.html#HeZRS16. Heckerman, D., Shachter, R., 1995. A definition and graphical representation for causality. In: Besnard, P., Hanks, S. (Eds.), Proc. Eleventh Conf. on Uncertainty in Artificial Intelligence (UAI-95), August, 262–273. Montreal, Quebec. Heskes, T., Zoeter, O., Wiegerinck, W., 2003. Approximate expectation maximization. In: Thrun, S., Saul, L.K., Sch€olkopf, B. (Eds.), NIPS. MIT Press, pp. 353–360. http://dblp. uni-trier.de/db/conf/nips/nips2003.html#HeskesZW03. Hinton, G.E., Sejnowski, T.J., 1999. Unsupervised Learning. MIT Press, Cambridge, MA. Hinton, G., Osindero, S., Teh, Y.-W., 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18 (7), 1527–1554. Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8), 1735–1780. Hogben, L., 2007. Handbook of Linear Algebra. Chapman&Hall/CRC. http://scholar.google. de/scholar.bib?q¼info:lRUP9s0uRPcJ:scholar.google.com/&output¼citation&hl¼de&ct¼ citation&cd¼0. Hutter, M., Zaffalon, M., 2005. Distribution of mutual information from complete and incomplete data. Comput. Stat. Data Anal. 48 (3), 633–657. http://www.hutter1.de/ai/mifs.htm. Jain, L.C., Allen, G.N., 1995. Introduction to artificial neural networks. In: Jain, L.C. (Ed.), Electronic Technology Directions. IEEE Computer Society, pp. 36–62. http://dblp.uni-trier.de/db/ conf/etd2000/etd2000-1995.html#JainA95. Jensen, F.V., Nielsen, T.D., 2007. Bayesian Networks and Decision Graphs, second ed. Springer, New York, NY. ISBN: 0387682813. Jolliffe, I., 2005. Principal Component Analysis. Wiley Online Library. Jonyer, I., Holder, L.B., Cook, D.J., 2001. Graph-based hierarchical conceptual clustering. Int. J. Artif. Intell. Tools 10 (1–2), 107–135. /brokenurl#citeseer.nj.nec.com/jonyer00graphbased. html. Joo, W., Lee, W., Park, S., Moon, I.-C., 2019. Dirichlet variational autoencoder. CoRR abs/ 1901.02739. https://arxiv.org/abs/1901.02739.
204 Handbook of Statistics Kakade, S.M., Lee, J.D., 2018. Provably correct automatic sub-differentiation for qualified programs. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (Eds.), NeurIPS, 7125–7135. http://dblp.uni-trier.de/db/conf/nips/nips2018. html#KakadeL18. Klema, V., Laub, A., 1980. The singular value decomposition: its computation and some applications. IEEE Trans. Autom. Control 25 (2), 164–176. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neuralnetworks. Krogh, A., Vedelsby, J., 1995. Neural network ensembles, cross validation, and active learning. In: Advances in Neural Information Processing Systems, vol. 7, pp. 231–238. Hornik, K., 1991. Approximation capabilities of multilayer feedforward networks. Neural Netw. 4 (2), 251–257. https://doi.org/10.1016/0893-6080(91)90009-T. Lavalle, S.M., 2006. Planning Algorithms. Cambridge University Press. ISBN: 0521862051. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), 2278–2324. http://citeseerx.ist.psu.edu/viewdoc/ summary?doi¼10.1.1.42.7665. Lee, D.D., Seung, H.S., 2000. Algorithms for non-negative matrix factorization. In: NIPS556–562. citeseer.ist.psu.edu/lee01algorithms.html. Lim, T.Y., 2014. Structured population genetic algorithms: a literature survey. Artif. Intell. Rev. 41 (3), 385–399. http://dblp.uni-trier.de/db/journals/air/air41.html#Lim14. Lui, Y.L., Wong, H.T., Leung, C.-S., Kwong, S., 2017. Noise resistant training for extreme learning machine. In: Cong, F., Leung, A.C.-S., Wei, Q. (Eds.), ISNN (2), Lecture Notes in Computer Science, vol. 10262. Springer, pp. 257–265. http://dblp.uni-trier.de/db/conf/isnn/ isnn2017-2.html#LuiWLK17. McCallum, A., Nigam, K., 1998. A comparison of event models for naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, AAAI Press, pp. 41–48. McDermott, D., Ghallab, M., Howe, A.C., 1998. PDDL-the planning domain definition language. In: Proceedings of the International Conference on Artificial Intelligence Planning Systems. Nakajima, S., Sato, I., Sugiyama, M., Watanabe, K., Kobayashi, H., 2014. Analysis of variational Bayesian latent Dirichlet allocation: weaker sparsity than MAP. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (Eds.), NIPS, 1224–1232. http://dblp.uni-trier.de/db/conf/nips/nips2014.html#NakajimaSSWK14. Neal, R.M., 1996. Bayesian Learning for Neural Networks. Springer-Verlag, New York. Ng, A., 2012. Cs229 lecture notes–supervised learning. Nielsen, M.A., 2018. Neural Networks and Deep Learning. Determination Press. http:// neuralnetworksanddeeplearning.com/. Nilsson, N.J., 1980. Principles of Artificial Intelligence. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1–476 Artificial neural networks: an introduction to ANN theory and practice. Braspenning, P.J., Thuijsman, F., Weijters, A.J.M.M., Braspenning, P.J., Thuijsman, F., Weijters, A.J.M.M. (Eds.), 1995. Artificial Neural Networks, In: Lecture Notes in Computer Science, vol. 931. Springer. http://dblp.uni-trier.de/db/conf/ann/ann1995.html. Nocedal, J., Wright, S., 2006. Numerical Optimization. Springer Science & Business Media. Pacer, M., Griffiths, T.L., 2011. A rational model of causal inference with continuous causes. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (Eds.), NIPS, 2384–2392. http://dblp.uni-trier.de/db/conf/nips/nips2011.html#PacerG11.
Machine learning algorithms, applications, and practices Chapter
3 205
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830. Peterson, J., 1983. Petri Net Theory and the Modelling of Systems. Prentice Hall. Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., Carin, L., 2016. Variational autoencoder for deep learning of images, labels and captions. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (Eds.), NIPS, 2352–2360. http://dblp.uni-trier.de/db/ conf/nips/nips2016.html#PuGHYLSC16. Puterman, M.L., 2014. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286. Radford, A., Metz, L., Chintala, S., 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. In: Bengio, Y., LeCun, Y. (Eds.), ICLR. http://dblp. uni-trier.de/db/conf/iclr/iclr2016.html#RadfordMC15. Rennie, J.D., Shih, L., Teevan, J., Karger, D.R., 2003. Tackling the poor assumptions of Naive Bayes text classifiers. In: Fawcett, T., Mishra, N. (Eds.), ICML. AAAI Press, pp. 616–623. Russel, S., Norvig, P., 2003. Artificial Intelligence: A Modern Approach. Pearson Education Inc. Ryu, K.R., Irani, K.B., 1992. Learning from goal interactions in planning: goal stack analysis and generalization. In: Swartout, W.R. (Ed.), AAAI. AAAI Press/The MIT Press, pp. 401–407. http://dblp.uni-trier.de/db/conf/aaai/aaai92.html#RyuI92. Safavian, S.R., Landgrebe, D., 1991. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21 (3), 660–674. http://ieeexplore.ieee.org/xpls/abs_all.jsp? arnumber¼97458. Salakhutdinov, R., Mnih, A., 2008. Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems, vol. 20. Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw. 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003. Schraudolph, N.N., G€unter, S., Vishwanathan, S. V. N., 2006. Fast iterative Kernel PCA. In: Sch€ olkopf, B., Platt, J.C., Hofmann, T. (Eds.), NIPS. MIT Press, pp. 1225–1232. http:// dblp.uni-trier.de/db/conf/nips/nips2006.html#SchraudolphGV06. Schubert, E., Sander, J., Ester, M., Kriegel, H.-P., Xu, X., 2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42 (3), 19:1–19:21. http://dblp.uni-trier.de/db/journals/tods/tods42.html#SchubertSEKX17. Shi, C., Kong, X., Yu, P.S., Wang, B., 2011. Multi-label ensemble learning. In: Proceedings of the ECML/PKDD 2011. Siekmann, J.H., 2014. Computational logic. In: Siekmann, J.H. (Ed.), Handbook of the History of Logic, Computational Logic, vol. 9. Elsevier, pp. 15–30. http://dblp.uni-trier.de/db/series/hhl/ hhl9.html#Siekmann14. Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (Eds.), ICLR. http://dblp.uni-trier.de/db/conf/iclr/iclr2015. html#SimonyanZ14a. Soda, P., 2011. A multi-objective optimisation approach for class imbalance learning. Pattern Recogn. 44 (8), 1801–1810. http://dblp.uni-trier.de/db/journals/pr/pr44.html#Soda11. Strang, G., 2009. Introduction to Linear Algebra, fourth ed. Wellesley-Cambridge Press, Wellesley, MA. ISBN 9780980232714 0980232716 9780980232721 0980232724 9788175968110 8175968117.
206 Handbook of Statistics Sutskever, I., Hinton, G.E., Taylor, G.W., 2008. The recurrent temporal restricted Boltzmann machine. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (Eds.), NIPS. Curran Associates, Inc., pp. 1601–1608. http://dblp.uni-trier.de/db/conf/nips/nips2008.html# SutskeverHT08 Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition1–9. van den Burg, G.J.J., Groenen, P.J.F., 2016. GenSVM: a generalized multiclass support vector machine. J. Mach. Learn. Res. 17, 225:1–225:42. http://dblp.uni-trier.de/db/journals/jmlr/ jmlr17.html#BurgG16. van der Maaten, L., Hinton, G., 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html. Wasserman, S., Faust, K., 1994. Social Network Analysis: Methods and Applications. Cambridge University Press. Webb, G.I., Zheng, Z., 2004. Multistrategy ensemble learning: reducing error by combining ensemble learning techniques. IEEE Trans. Knowl. Data Eng. 16 (8), 980–991. Wolpert, D.H., 1997. On bias plus variance. Neural Comput. 9 (6), 1211–1243. https://doi.org/ 10.1162/neco.1997.9.6.1211. Wu, B., Smith, J.S., Wilamowski, B.M., Nelms, R.M., 2019. DCMDS-RV: density-concentrated multi-dimensional scaling for relation visualization. J. Visualization 22 (2), 341–357. http:// dblp.uni-trier.de/db/journals/jvis/jvis22.html#WuSWN19. Xu, R., Wunsch, D., 2008. Clustering. Wiley-IEEE Press. Yakowitz, S., Lugosi, E., 1990. Random search in the presence of noise, with application to machine learning. SIAM J. Sci. Comput. 11 (4), 702–712. http://dblp.uni-trier.de/db/ journals/siamsc/siamsc11.html#YakowitzL90. Yang, F., Ramdas, A., Jamieson, K.G., Wainwright, M.J., 2017. A framework for multi-A(rmed)/ B(andit) testing with online FDR control. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (Eds.), NIPS, 5959–5968. http://dblp.uni-trier.de/db/conf/nips/nips2017.html#YangRJW17. Yu, L., Zhang, W., Wang, J., Yu, Y., 2017. SeqGAN: sequence generative adversarial nets with policy gradient. In: Singh, S.P., Markovitch, S. (Eds.), AAAI. AAAI Press, pp. 2852–2858. http://dblp.uni-trier.de/db/conf/aaai/aaai2017.html#YuZWY17. Zeiler, M.D., Fergus, R., 2014. Visualizing and understanding convolutional networks. In: European Conference on Computer Vision. 818–833. Zhang, Q., Rahman, A., D’Este, C., 2013. Impute vs. ignore: missing values for prediction. In: IJCNNIEEE, pp. 1–8. http://dblp.uni-trier.de/db/conf/ijcnn/ijcnn2013.html#ZhangRD13. Zhang, J., Schwing, A.G., Urtasun, R., 2014. Message passing inference for large scale graphical models with high order potentials. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (Eds.), NIPS, 1134–1142. http://dblp.uni-trier.de/db/ conf/nips/nips2014.html#ZhangSU14.
Chapter 4
Bayesian model selection for high-dimensional data Naveen Naidu Narisetty* Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, United States * Corresponding author: e-mail: [email protected]
Abstract High-dimensional data, where the number of features or covariates can even be larger than the number of independent samples, are ubiquitous and are encountered on a regular basis by statistical scientists both in academia and in industry. A majority of the classical research in statistics dealt with the settings where there is a small number of covariates. Due to the modern advancements in data storage and computational power, the high-dimensional data revolution has significantly occupied mainstream statistical research. In gene expression datasets, for instance, it is not uncommon to encounter datasets with observations on at most a few hundred independent samples (subjects) and with information on tens or hundreds of thousands of genes per each sample. An important and common question that arises quickly is—“which of the available covariates are relevant to the outcome of interest?” This concerns the problem of variable selection (and more generally model selection) in statistics and data science. This chapter will provide an overview of some of the most well-known model selection methods along with some of the more recent methods. While frequentist methods will be discussed, Bayesian approaches will be given a more elaborate treatment. The frequentist framework for model selection is primarily based on penalization, whereas the Bayesian framework relies on prior distributions for inducing shrinkage and sparsity. The chapter treats the Bayesian framework in the light of objective and empirical Bayesian viewpoints as the priors in the high-dimensional setting are typically not completely based subjective prior beliefs. An important practical aspect of high-dimensional model selection methods is computational scalability which will also be discussed. Keywords: Bayesian variable selection, High-dimensional data, Model comparison, Bayesian computation
Handbook of Statistics, Vol. 43. https://doi.org/10.1016/bs.host.2019.08.001 © 2020 Elsevier B.V. All rights reserved.
207
208 Handbook of Statistics
1 Introduction The rapid developments in collecting, storing, transmitting, and managing massive amounts of data have led to unique opportunities and challenges in Statistics and the emerging field of Data Science. Variable selection is a fundamentally important problem for many modern datasets that have a large as the number of variables, which is a common feature of modern data sets from many applications including biology, climate sciences, behavioral and environmental sciences. The linear regression model is the one of the most commonly used models in statistics and is also a building block for many general models. In this chapter, we primarily consider the linear regression model, but most of the ideas and methods discussed can be applied more generally to other models such as the generalized linear models and nonlinear regression models. Consider the linear regression model Yn1 ¼ Xnp βp1 + En1 ,
(1)
with standard assumptions on the error vector E. The classical least squares approach for estimating β minimizes the loss function LðβÞ ¼
n X 2 ðyi x> i βÞ ,
(2)
i¼1
where yi denotes the ith response and xi denotes the covariate vector for the ith observation. The well-known least squares estimator for β is β^ ols :¼ ðX> XÞ1 X> Y:
(3)
In the high-dimensional setting, the dimension p of the covariate vector can be quite large and potentially even larger than the sample size. For example, in gene expression datasets, there are at least thousands of genes as covariates but typically only a few dozen independent samples. In such cases, p is much larger than n and will be denoted by p ≫ n. When p ≫ n, even estimation of the regression parameter β is a challenging problem since the least squares minimization (2) does not have a unique solution and the least squares estimator is not well defined. This necessitates some simplifying assumptions on the data generating model and the most common assumption made is that most of the components of β are zero, often referred to as the “sparsity assumption” (van de Geer, 2016). Sparsity assumption is made in a lot of applications and is often reasonable. However, one cannot hope that the sparsity assumption remains valid for every application and hence relaxations of the sparsity assumption are also considered in the literature (Belloni and Chernozhukov, 2013; Belloni et al., 2011). Even under the sparsity assumption, it is a very challenging problem to uncover the precise sparsity structure. The problem of detecting the nonzero
Bayesian model selection for high-dimensional data Chapter
4 209
components of the parameter vector is often referred to as variable selection or model selection. In many scientific applications, a sparser model is generally desired for the reasons of parsimony, reduced variance, and ease of interpretability. This article attempts to provide a review of variable selection methods for high-dimensional datasets with a focus on the Bayesian approaches for variable selection. Fan and Lv (2010) and B€ uhlmann and van de Geer (2011) provide reviews of frequentist approaches for variable selection. While George and McCulloch (1997) and O’hara and Sillanpaa (2009) provide reviews of some Bayesian variable selection methods, the current article covers more recently developed strategies. There has been a tremendous research activity in this direction and it is not possible to cover all the work, so the article only presents a selection of ideas. The remaining part of the chapter is organized as follows. In Section 2, a brief outline of some of the classical approaches for model selection, which mainly target the low-dimensional setting, are described. In Section 3, the penalization framework for variable selection is discussed. In Section 4, the Bayesian framework is introduced followed by a detailed description on spike and slab priors in Section 5 and on continuous shrinkage priors in Section 6. Section 7 discusses some computational aspects of Bayesian model selection and Section 8 gives an outline of the different types of theoretical results studied in the literature. Section 9 provides R packages available for implementation and Section 10 provides an example data analysis.
2
Classical variable selection methods
2.1 Best subset selection For model selection, the best subset selection approach is to find the best combination of variables among all possible combinations. For example, if there are three predictors X1, X2, X3, then we would consider all the possible models fX1 g, fX2 g, fX3 g, fX1 ,X2 g, fX1 , X3 g, fX2 ,X3 g,fX1 , X2 , X3 g and determine which of the above models is the best based on some criterion function. The criterion function used needs to penalize large models since large models tend to have better in-sample fit. A major difficulty is when the number of predictors p is large, in which case there are too many models to consider and it soon becomes infeasible to evaluate all the possible models. More recently, modern optimization algorithms have been proposed to perform best subset selection in high-dimensional contexts. Bertsimas et al. (2016) developed modern mixed integer optimization methods to obtain optimal solutions to the best subset selection problem. These algorithms can be used in problems up to a thousand predictors. Recently, Hazimeh and Mazumder (2018) have developed coordinate descent algorithms for the best subset selection problem and demonstrated that their algorithms can be applied to simulated datasets with p nearly a million.
210 Handbook of Statistics
2.2 Stepwise selection methods Motivated by the computational burden associated with traditional best subset selection algorithms, stepwise methods are developed for finding a small subset of “good models” to consider for further evaluation. l
l
Forward selection (FS): Starting from the null model which has no covariates, at each step of the FS algorithm, a new variable is added to the current model based on some criterion such as the decrease in residual sum of squares (RSS). This provides a sequence of p models and the model minimizing a criterion, such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) which are defined in Section 2.3, is used for selecting a final model. Backward elimination (BE): Very similar in spirit to the FS algorithm but the difference is that the BE algorithm starts from the full model (when it is possible to estimate the full model), and removes one variable at a time based on the increase in RSS.
For both FS and BE, it is not necessary to obtain the whole sequence of p models. One can instead terminate the algorithm early after a certain number of steps. In such a case, compared to BE, FS has the computational advantage since only relatively smaller models need to be fit especially when the number of steps is small. This is because working with the full model or models close to the full model in size is typically computationally more expensive. On the other hand, if a pair of important variables are not significant marginally but are jointly significant, then forward selection tends to miss both variables whereas backward elimination has higher chance of selecting them. It is not necessary that the models selected by BE and FS coincide. One simple strategy is to use a criterion function and select one final model from the two models selected by BE and FS. A more general algorithm compared to FS and BE is to consider the possibilities of both addition and deletion of a variable at each step. A further generalization is to include swapping of a selected variable with a variable not yet selected in the model. While these strategies provide more general sequence of models and are likely to provide better model selection performance, they demand more computational power compared to forward selection. The stepwise approaches described so far provide a sequence of models instead of specifying a stopping rule. Stopping criteria based on p-values or F-test statistics have been commonly considered in the literature (Bendel and Afifi, 1977; Finos et al., 2010; Grechanovsky and Pinsker, 1995). However, an important issue that needs to be carefully addressed in those cases is the control of false positives that could occur due to the large number of comparisons involved. A better alternative is to use a criterion function on the entire collection of models, which can incorporate the multiplicity of the comparisons. We now discuss some commonly used criterion functions.
Bayesian model selection for high-dimensional data Chapter
4 211
2.3 Criterion functions l
Akaike information criterion (AIC): AIC (Akaike, 1973) for the model Mk with dimension k is defined as AICðMk Þ ¼ 2 log LðMk Þ + 2 k,
where L(Mk) is the likelihood corresponding to the model Mk. The first term 2 log LðMk Þ in AIC is twice the negative log likelihood, which turns out to be the residual sum of squares corresponding to the model Mk for the linear regression model with a Gaussian likelihood. That is, 2 log LðMk Þ ¼ Pn 2 >^ ^ i¼1 ðyi xi βðMk ÞÞ , where βðMk Þ is the least squares estimator for model Mk. Therefore, the first term acts as a measure of lack of fit to the data with smaller values to be preferred. The second term acts as a penalty term to penalize models having a large dimension. AIC aims to balance the lack of fit and the model complexity with models having smaller AIC values indicating a better balance between these two important aspects. l Bayesian information criterion (BIC): BIC (Schwarz, 1978) for the model Mk with dimension k is defined as BICðMk Þ ¼ 2 log LðMk Þ + log ðnÞ k, where n is the sample size. BIC is motivated by a Bayesian framework in the sense that the model minimizing BIC corresponds to the model with the highest posterior probability. Due to the larger penalty of log ðnÞ on the model complexity as opposed to 2 for AIC, BIC often selects a sparser model compared to AIC. l Extended Bayesian information criterion (EBIC): Chen et al. (2008) proposed a generalization of BIC for the settings with p > n in which case the regularization imposed by BIC on model complexity is not sufficient. A version of EBIC proposed by Chen et al. (2008) uses the following criterion function: EBICðMk Þ ¼ 2 log LðMk Þ + log ðp _ nÞ k, where (p _ n) denotes the maximum of p and n. Note that it simply increases the penalty term of log ðnÞ in BIC to log ðp _ nÞ.
3
The penalization framework
When the covariates are highly correlated, the least squares estimator, although unbiased, suffers from inflated variance. This is because the matrix G ¼ X>X is nearly singular, which causes its inverse to be ill-conditioned even if it exists. Motivated by this problem, “ridge regression” introduces and additional term to the least squares objective function in an attempt to regularize the resultant estimator. The ridge regression objective function is given by RðβÞ ¼
p n X X 2 ðyi x> βÞ + λ β2j , i i¼1
j¼1
212 Handbook of Statistics
where λ is a tuning parameter that controls the amount of penalization or regularization. In comparison to the information criterion functions that use the model size (or equivalently the L0 norm of the regression vector β) to measure the model complexity, the regularization term of ridge regression use the L2 norm of the regression vector. At one extreme with λ ¼ 0, ridge regression estimator is the LSE and at the other extreme of λ !∞, it is the zero vector. For intermediate values of λ, it provides a shrinkage toward zero. The ridge regression estimator is given by β^ ridge ¼ ðX> X + λIÞ1 X> Y: For λ > 0, the ridge estimator introduces some bias, but it helps reduce variance when X>X is nearly singular. For the special case of the orthogonal design with X>X ¼ nI, the ridge regression estimator has a simple relationship with the least squares estimator given by 1 > 1 X Y¼ β^ ols : β^ ridge ¼ n+λ ð1 + λ=nÞ This provides an intuition for the type of shrinkage ridge estimator provides as it shrinks the least squares estimator toward zero. Therefore, the ridge estimator is biased but has less variance compared to the least squares estimator. In high dimensions with p > n, even though X>X is necessarily singular, the ridge estimator is still well-defined unlike the LSE. However, ridge estimator is not sparse since none of its components are nonzero. While ridge regression may not perform variable selection by default, the properties of the ridge estimator for prediction and estimation have been studied for the large p setting (Dicker, 2016; Dobriban and Wager, 2018; Hsu et al., 2014).
3.1 LASSO and generalizations Tibshirani (1996) proposed using an L1 regularized estimator popularly known as the LASSO estimator. The LASSO estimator minimizes the objective function LðβÞ ¼
n X i¼1
p X 2 ðyi x> βÞ + λ jβj j: i j¼1
Although very similar in form to the ridge regression, the LASSO estimator is quite special since it is a sparse estimator. It is not only a valid estimator when p ≫ n, but the number of nonzero components of the LASSO estimator can be much smaller than the sample size for appropriately chosen values of λ. In other words, LASSO estimator does both estimation and variable selection.
Bayesian model selection for high-dimensional data Chapter
4 213
To see this, consider the orthogonal design case with X>X ¼ n I, where the LASSO estimator can be written as: 8 λ λ ols ols > > if β^ j > β^ j > > > 2n 2n > < lasso λ ols β^ j ¼ 0 if jβ^ j j > 2n > > > > ^ ols λ λ ols > : βj + if β^ j < , 2n 2n where β^ j denotes the jth component of the least squares estimator. The relationship between the LASSO estimator and the least squares estimator for the orthogonal design provides important insights about the LASSO estimator. When the magnitude of the least squares estimator is smaller than λ/2n, the LASSO estimator sets it to zero and otherwise, it shrinks the magnitude of the coefficient by λ/2n. Fig. 1 illustrates the relationship between the least squares estimator, ridge regression estimator, and the lasso estimator for the orthogonal design case. We can see from the figure that the ridge estimator’s shrinkage is multiplicative and hence implies larger bias for coefficients with a large magnitude. On the other hand, the shrinkage of LASSO sets coefficients with small magnitudes exactly to zero while the bias remains constant for all the coefficients with a large magnitude. Under quite weak regularity conditions, the LASSO estimator is shown to have optimal theoretical properties for the purposes of estimation and prediction (Bickel et al., 2009; B€ uhlmann and van de Geer, 2011; Meinshausen and Yu, 2009; Mousavi et al., 2017; van de Geer, 2008; Zhang and Huang, 2008). However, for LASSO to have desirable theoretical properties for selection, ols
FIG. 1 The shrinkage of the ridge regression estimator and the LASSO estimator as a function of the least squares estimator for the orthogonal design.
214 Handbook of Statistics
it requires quite stringent conditions on the design matrix called as irrepresentable conditions (Zhao and Yu, 2006), a version of which will be described in the following.
3.1.1 Strong irrepresentable condition Zhao and Yu (2006) discussed the strong irrepresentable condition under which the LASSO estimator can consistently perform perfect variable selection. However, this condition does not hold even if the correlations between the covariates are moderately high. Write the matrix G ¼ X>X as: ! ! G11 G12 XA> XA XA> XI G¼ :¼ G21 G22 XI> XA XI> XI The strong irrepresentable condition states that jG21 G1 11 signðβA Þj < 1, where βA is the regression vector restricted to the active variables. To appreciate the restrictive nature of this condition, Zhao and Yu (2006) considered the case G ¼ rJ + (1 r)I, where J is the matrix of all 1’s and 0 < r < 1 is the common correlation between the predictors. In this case, the irrepresentable condition requires that r < ð1 +1csÞ , where s is the number of nonzero components of β, also referred to as the sparsity level. This is a strong condition especially when s is large and demonstrates that while LASSO is suitable for estimation and prediction, it may not be best suited for variable selection since an approximate version of the strong irrepresentable condition is also necessary for LASSO to be selection consistent (Zhao and Yu, 2006).
3.1.2 Adaptive LASSO ^ the adaptive LASSO estimator (Zou, Given an initial consistent estimator β, 2006) minimizes ALðβÞ ¼
n X
p X jβj j 2 ðyi x> βÞ + λ : i ^ i¼1 j¼1 jβ j j
This helps in providing adaptive shrinkage—larger penalty for smaller coefficients and smaller penalty for larger coefficients. When a good initial estimator is available, this turns out to be a good strategy as it also provides a natural scaling of the coefficients. While the adaptive LASSO helps to improve the performance of LASSO when there is a good initial estimator available, it still lacks model selection consistency under weak conditions. Ideally, under the sparsity assumption, one would like to utilize the L0 norm penalty for regularization to obtain model selection consistency. However, L0 regularized objective function may not be feasible in practice with large p as the number of models to be evaluated is 2p, which is huge. The L1 norm is the smallest Lq norm which is convex and is therefore a natural
Bayesian model selection for high-dimensional data Chapter
4 215
relaxation to the L0 in view of its computational appeal. However, a drawback of the L1 regularization is that the penalty term is linearly proportional to the magnitude of the coefficient β, which causes high bias for estimating the coefficients with large magnitude.
3.1.3 Elastic net The elastic net (Zou and Hastie, 2005) penalty attempts to combine to advantages of both ridge regression and LASSO, namely shrinkage and sparsity together. The elastic net estimator minimizes ENðβÞ ¼
n X
2 ðyi x> i βÞ + λ1
i¼1
p p X X jβj j + λ2 jβj j2 : j¼1
j¼1
Due to the ridge regularization, the elastic net estimator can handle correlations between the predictors better than LASSO and due to the L1 regularization, sparsity is obtained. However, the bias issue present for LASSO is still present for elastic net.
3.2 Nonconvex penalization Motivated by the bias induced by convex penalties such as the LASSO, Fan and Li (2001) and Fan and Peng (2004) proposed a nonconvex penalty called the smoothly clipped absolute deviation (SCAD) penalty. Although nonconvex penalized objective functions may not have a unique minimizer, several computational algorithms which attempt to provide good solutions have been proposed (Breheny and Huang, 2011; Fan and Li, 2001; Mazumder et al., 2011; Zhang, 2010). The SCAD penalty function is given by (Fig. 2): 8 λjβj, if jβj λ: > > > > 2 2 > < 2aλjβj β λ , if λ < jβj aλ: (4) ρSCAD ðβÞ ¼ 2ða 1Þ λ > > > 2 > > : λ ða + 1Þ , if aλ jβj: 2
FIG. 2 Penalty functions corresponding to different penalization methods.
216 Handbook of Statistics
With a similar motivation Zhang (2010) proposed another nonconvex penalty called the minimax concave penalty (MCP). The idea behind these penalties is that they start penalizing the coefficients near zero in an L1 manner similar to the LASSO penalty. However, as the magnitude of the coefficient becomes larger, the amount of their penalty smoothly decreases to zero. The rate at which this regularization decreases is more drastic for the MCP penalty compared to the SCAD penalty. In either case, these penalties avoid the bias induced by LASSO for large magnitude coefficients as the penalty becomes small. An advantage of the nonconvex penalties is that they exhibit attractive theoretical properties for variable selection without stringent conditions as required for LASSO. For instance, Zhang (2010) and Loh and Wainwright (2017) showed that the nonconvex procedures achieve selection consistency under weaker conditions on the design matrix compared to LASSO.
3.3 Variable screening When the dimension is extremely large, variable selection methods may not be feasible both in terms of computational implementation and theoretical performance. In such cases, screening methods play an important role as they reduce the dimension to a manageable size using computationally feasible approaches. Fan and Lv (2008) proposed using the marginal correlations between the covariates and the response for screening out covariates that have low correlation. Define ρj ¼ cor(Xj, Y), and define ρ(j) by ordering the ρj’s based on their magnitudes so that jρ(1)j < ⋯ < jρ(j)j < ⋯ < jρ(p)j. Then the covariates having their magnitude of correlation smaller than jρ(pK)j, that is, {j : jρjj < jρ(pK)j}, are screened out (excluded) from further analysis. Under some conditions on the design matrix and the data generating model, Fan and Lv (2008) showed sure screening property that assures that all the relevant variables are selected by this marginal correlation-based screening. However, for such results to hold, the marginal correlation between each relevant covariate and the response should be large. This avoids the situations where some covariates may not have marginal correlation with the response but have joint effects in presence of other covariates. Fan and Song (2010) generalized this to the GLM setting where marginal correlation is replaced by the marginal effect measured by fitting a one-covariate GLM model for each covariate. He et al. (2013) proposed screening based on the conditional quantiles of Y given each covariate marginally, motivated by the flexibility of quantile regression (Koenker, 2005; Koenker and Bassett, 1978). A nice feature of this approach is that it is much more robust to outliers. Moreover, quantile-based screening does not require specification of a model and can also handle heterogeneity of the observations in the data. For these reasons, it is a strong alternative to correlation-based screening.
Bayesian model selection for high-dimensional data Chapter
4
4 217
The Bayesian framework for model selection
Model choice has been a very important topic within the Bayesian framework. Bayesian hypothesis testing can be viewed as a special case of the Bayesian approach to model selection. Consider the following hypothesis testing problem: H0 : M0 vs H1 : M1 , where M0 and M1 are two competing models (or two probability distributions). The Bayesian approach specifies prior probabilities for the hypotheses, say p0 for H0 and p1 ¼ (1 p0) for H1. If π(Data | M0) and π(Data | M1) denote the data generating distributions under the two different models, respectively, then the Bayesian approach involves computing the posterior probabilities of the models M0 and M1 given the data, namely P½H0 | Data ¼
P½Data | H0 P½H0 : ðP½Data | H0 P½H0 + P½Data | H1 P½H1 Þ
(5)
Therefore, the ratio of the posterior probabilities is given by P½H1 | Data P½Data| H1 P½H1 ¼ P½H0 | Data P½Data| H0 P½H0 : |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl} |fflffl{zfflffl} Posterior odds
Bayes factor
(6)
Prior odds
The first term in the RHS is the Bayes factor and the second term is the prior odds. The LHS is the posterior odds. Kass and Raftery (1995) provide concrete but ad hoc guidelines on how to interpret Bayes factors in terms of the strength of evidence they provide against the null hypothesis. For example, they suggest that Bayes factor larger than 10 provides a strong evidence for the alternate hypothesis. Note that, unlike with frequentist hypothesis testing, the Bayesian approach is symmetric in the hypotheses tested since the posterior odds could just as well be defined as P[H0 | Data]/P[H1 | Data]. Another advantage of the Bayesian approach is that it can be easily generalized to any number of hypothesis by placing a prior distribution on all the hypotheses. However, an issue that arises with multiple hypotheses is that of adjusting for multiplicity. Scott and Berger (2010) studied Bayesian testing of multiple hypotheses and proposed strategies for multiplicity adjustment. Let us now consider the normal means model which is an important special case of the linear regression model. Suppose that we have n independent observations Xj Nðμj , σ 2 Þ, j ¼ 1, …, n, where the variance σ 2 is known and can potentially depend on n. We are interested in testing the hypotheses
218 Handbook of Statistics
H0j : μj ¼ 0 vs H1j : μj 6¼ 0: In the single hypothesis testing context, it is perhaps most natural to assign a prior probability of P[H0] ¼ P[H1] ¼ 0.5 as an objective choice. However, under the multiple hypotheses context, such a prior of P½H0j ¼ P½H1j ¼ 0:5, independently across j ¼ 1, …, n does not provide multiplicity control. Although this prior may seem like an objective prior on each of the hypothesis individually, Scott and Berger (2010) calls it a pseudo-objective since it has a prior expectation of n/2 hypothesis to be nonnull. This prior also leads to the probability that all nulls are true is (1/2)n, which would be minuscule even for moderate n. One way to handle the issue of multiplicity in a Bayesian way is to first consider a prior on the possibility that all nulls are true followed by conditionally specifying prior probabilities on each individual hypothesis. Westfall et al. (Biometrika, 1997) adopted one such strategy that leads to a Bayesian version of Bonferroni correction for multiplicity. In general, a challenge with multiple hypothesis setting and more specifically with high-dimensional model selection is that there is no unique way of defining an objective prior. Let us now consider the more general linear regression model Yn1 ¼ Xnp βp1 + En1 : The Bayesian framework relies on a likelihood specification for the observed data and a prior distribution on the parameters of interest. For linear regression, the most natural likelihood comes from specifying a Gaussian distribution on the errors. That is, Y | ðX, βÞ NðXβ,σ 2 IÞ: Our objective here is to select the model corresponding to the nonzero components of β from all the possible models. This problem can be formulated as a multiple hypothesis testing problem for the collection of the hypotheses H0j : βj ¼ 0 vs H1j : βj 6¼ 0, j ¼ 1,…, p: Define binary indicator variables Zj to indicate whether the hypothesis H1j is true, and the binary vector Z ¼ (Z1, …, Zp) that defines a model as it uniquely identifies the nonzero components of β. The idea is to obtain the posterior distribution of the model vector Z, which can be used to perform model selection. There are different ways to the posterior distribution π(Z | Y) can be used for model selection: l
Maximum a posteriori (MAP) model: the MAP model maximizes the posterior distribution, that is, it finds the model that maximizes the posterior probability, that is, the model arg max P½Z ¼ k|Y, k
Bayesian model selection for high-dimensional data Chapter
4 219
where k denotes an arbitrary model coded as a binary vector with ones corresponding to the active covariates and zeroes corresponding to inactive covariates, and P[Z ¼ k | Y] is the posterior probability that k denotes the data generating model. As an example, k ¼ (1, 1, 1, 0, …, 0) indicates a model with the first three covariates active. To obtain the MAP model, one would need to evaluate all the 2p possible models which can be quite expensive if p is large. A good approximation for the posterior distribution is also difficult to obtain in high dimensions. Iterative methods (Hans et al., 2007; Yang et al., 2016) that aim to obtain a good model having posterior probability close to the MAP model are often used in practice. l Median probability model: Find the set of covariates whose marginal posterior probabilities exceed 0.5, that is, {j : P[Zj ¼ 1|Y] > 0.5}. Barbieri and Berger (2004) called the model corresponding to this set of covariates as the median probability model and studied its theoretical properties. l The threshold of 0.5 used by the median probability model is arguably ad hoc and an alternative threshold may be used for better model selection performance. An adaptive way to choose the threshold is to obtain the models corresponding to different threshold values followed by the use of a criterion function such as BIC to select a final model. This strategy was used by Narisetty and He (2014). Several papers including Scott and Berger (2006, 2010) discuss prior distributions on the hypotheses which induce some level of multiplicity adjustment. Two of the most commonly used priors are given by l
The Zj’s have independent Bernoulli priors P½Zj ¼ 1 ¼ 1 P½Zj ¼ 0 ¼ q, q Betaða, bÞ:
Note that here q is an unknown parameter which has a Beta prior distribution. Scott and Berger (2010) discuss this fully Bayesian approach along with an empirical Bayes approach which estimates the prior probability q based on the data. Several other authors including (Narisetty and He, 2014; Scott and Berger, 2010; Yang et al., 2016) treat q as a hyperparameter and provide conditions on q for achieving appropriate multiplicity adjustment. l Castillo et al. (2015) and Martin et al. (2017) used the following form of priors on Z: 1 p πðZ ¼ kÞ∝ f ðjkjÞ, jkj where jkj is the number of nonzero components of k, also referred to as the size of the model k, and f() is a distribution on the model size. The intuition behind this prior is that the prior probability of a model depends only on its
220 Handbook of Statistics
size and the form of f determines the prior distribution on the model size. Castillo et al. (2015) proposed the following specific form for f() : f ðjkjÞ∝ cjkj pajkj , c, a > 0:
(7)
This prior places exponentially decreasing prior mass on models as their size increases, which encourages sparser models to be selected. Once the prior distribution on the model space is provided, to carry out the Bayesian framework, we would need a prior distribution on the parameter vector β. A wide variety of prior distributions are considered for this purpose. We now discuss some of these priors, which are broadly called as spike and slab priors in the literature.
5 Spike and slab priors In the Bayesian literature on variable selection and shrinkage, there are two primary classes of prior distributions for the regression coefficient βj under the null hypothesis that Zj ¼ 0 : (i) the point mass spike prior which is a degenerate distribution placing all of its probability mass at βj ¼ 0, and (ii) continuous (spike) priors which take the view that the magnitude of βj under the null hypothesis is “small” but need not be exactly zero. Before we proceed, we define some notation to be used in the rest of the chapter. We use k to denote a generic model which is a binary vector of length p indicating which covariates are active. For instance, k ¼ (1, 1, 1, 0, …, 0) indicates a model with the first three covariates active. We use βk and Xk to denote the components and columns of β and X corresponding to the nonzero components of k, respectively. Similarly, βkc and Xkc denote the components and columns of β and X corresponding to the zero components of k, respectively.
5.1 Point mass spike prior Under the hypothesis Zj ¼ 0, the point mass prior at zero is a natural prior choice and hence has been considered by several authors in the literature. Mitchell and Beauchamp (1988) considered the point mass spike prior Zj ¼ 0, and for the hypothesis Zj ¼ 1, proposed a proper uniform prior on βj (which has a density that looks like a slab) giving raise to the well-known spike and slab prior terminology. See Fig. 3 for an illustration of the spike and slab priors with different slab distributions including the uniform slab prior.
5.1.1 g-priors Zellner (1986) proposed a multivariate normal prior for the regression coefficients β given a model. That is, given that Z ¼ k, βkc ¼ 0 and the active part βk | ϕ Nð0, ϕg ðXk> Xk Þ1 Þ and ϕ has an improper diffuse prior given by
Bayesian model selection for high-dimensional data Chapter
Uniform slab
4 221
Gaussian slab
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0 –5.0
–2.5
0.0
q Laplace slab
2.5
5.0
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
–5.0
–2.5
–5.0
–2.5
0.0
2.5
5.0
0.0
2.5
5.0
q Nonlocal slab
0.0
0.0 –5.0
–2.5
0.0
q
2.5
5.0
q
FIG. 3 Examples of point mass spike priors with different slab priors. The (red) arrow indicates a point mass at zero and the slab priors are (i) a Uniform prior on [5,5], (ii) a Gaussian prior with variance 10, (iii) a Laplace prior with scale 0.15, and (iv) a nonlocal prior from the pMOM prior family with r ¼ 1.
πðϕÞ∝ ϕ1 : In this framework, g is a hyperparameter to be chosen. Several authors had provided recommendations for the values of g. For example, Foster and George (1994) suggested g ¼ p2 and Kass and Raftery (1995) suggested g ¼ n and Ferna´ndez et al. (2001) suggested g ¼ max ðn,p2 Þ: To provide better variable selection properties, Liang et al. (2008) proposed using an additional prior distribution on g to obtain a mixture of g-priors. The prior distribution on g proposed by Liang et al. (2008) is given by πðgÞ ¼
ða 2Þ ð1 + gÞa=2 , g > 0, a > 2: 2
In the low-dimensional setting with p fixed, they showed model selection consistency of the mixture g-priors.
5.1.2 Nonlocal priors Johnson and Rossell (2012) proposed using nonlocal priors for Bayesian variable selection. The main motivation behind the nonlocal priors is that the slab prior should ideally not have a fixed positive mass around zero since the slab prior should represent signals with nonzero magnitude. A nonlocal prior on the other hand has a density function that converges to zero as the magnitude of the parameter converges to zero. A more formal definition of local and nonlocal priors can be found in Johnson and Rossell (2012).
222 Handbook of Statistics
Examples of nonlocal priors are the product moment prior (pMOM) and the product inverse moment (piMOM) prior. A pMOM prior for a given model k is given by Y k 1 > 2 πðβk | τ,σ Þ∝ exp β β β2r , 2τσ 2 k k j¼1 ki and a piMOM prior under a model k is given by ( ) k Y τσ 2 ðr + 1Þ 2 πðβk | τ,σ Þ∝ βkj exp 2 , β kj j¼1 where τ, σ 2 are hyperparameters. It can be seen that both the pMOM and piMOM priors have densities converging to zero as the magnitude of the regression coefficient converges to zero. With these priors, Johnson and Rossell (2012) showed that the posterior concentrates on the true model with probability going to one for p n. In particular, if t denotes the true model, the authors P P 1, where ! indicates convergence in probability. showed that P½Z ¼ t | Y! This notion of consistency is called as global model selection consistency or strong model selection consistency, and is much stronger than the more commonly considered consistency in terms of pairwise Bayes factors, that is, P P½Z ¼ k | Y=P½Z ¼ t | Y! 0 for each model k marginally. However, even when this happens, it is still possible that the posterior probability of the true model P tends to zero, that is P½Z ¼ t | Y! 0: This is because there are 2p 1 false models and the cumulative probability of all of them can be large even if each of them individually is small in comparison to the true model. In fact, the strong selection consistency is equivalent to having X P½Z ¼ k | Y k6¼t
P½Z ¼ t | Y
P
! 0,
and is a much stronger statement since the sum is over a large collection of models. More interestingly, Johnson and Rossell (2012) also showed that using local priors, meaning that priors whose mass around zero does not tend to zero, it would not be possible to achieve such a posterior concentration. This is an important result since it provides a guideline on how to choose slab priors in the high-dimensional setting. It is important to note that while Gaussian priors with a fixed variance would be considered as local priors since the density at zero is positive, a Gaussian prior can still achieve the properties of a nonlocal prior if its variance parameter is allowed to increase to infinity as discussed by Narisetty and He (2014). This is because the Gaussian prior mass around zero would become small and tend to zero with an increasing variance. The posterior distributions corresponding to nonlocal priors may not be computed using standard computational algorithms such as Gibbs sampler.
Bayesian model selection for high-dimensional data Chapter
4 223
Johnson and Rossell (2012) proposed using Laplace approximations for approximate posterior computation. In high dimensions, their computational burden could be quite high. Therefore, more recently Shin et al. (2018) proposed a scalable computational algorithm for computing the posterior with nonlocal priors for large high dimensions. This algorithm called Simplified Shotgun Stochastic Search with Screening (S5) generalizes the shotgun stochastic search (SSS) algorithm of Hans et al. (2007) for more efficient sampling of the model space.
5.2 Continuous spike priors For point mass spike priors, under each possible model k for Z, the dimension of the corresponding regression vector βk is different. Methods for posterior computation for this setting tend to be computationally intensive due to the change in dimension which motivated several authors to consider continuous priors for the spike distribution βj | Zj ¼ 0, for j ¼ 1, …, p. That is, priors of the form βj | Zj ¼ 0 π 0 ðβj Þ, βj | ðZj ¼ 1Þ π 1 ðβj Þ,
(8)
with both π 0 and π 1 being continuous distributions. The prior π 0 still focuses majority of its probability mass around zero. Priors on the model indicator Z are placed as discussed in Section 4. A foremost example of the continuous spike and slab prior framework was proposed by George and McCulloch (1993): Y | ðX, β, σ 2 Þ NðXβ, σ 2 IÞ, βj | σ 2 , Zj ¼ 0 Nð0, σ 2 τ20 Þ, βj | σ 2 , Zj ¼ 1 Nð0, σ 2 τ21 Þ,
PðZj ¼ 1Þ ¼ 1 PðZj ¼ 0Þ ¼ q,
(9)
σ 2 IGðα1 , α2 Þ, where 0 < τ20 < τ21 < ∞ are the variances of the spike and slab priors, respectively that may be tuned and IG denotes an inverse Gamma distribution. The intuition behind this setup is that the covariates with zero or very small coefficients will be identified with zero Z values, and the active covariates will be classified as Z ¼ 1. We use the posterior probabilities of the latent variables Z to identify the active covariates (see Fig. 4). Ishwaran and Rao (2005) studied several theoretical properties related to the framework defined by (9) when the dimension p is fixed. In particular, they studied the consistency properties of the posterior mean as an estimator. They also studied the variable selection properties of their selection method that is based on thresholding the posterior mean. While their results provide substantial insights about the shrinkage properties of the spike and slab prior framework, they are not applicable to the high-dimensional setting which is the current interest.
224 Handbook of Statistics
FIG. 4 Examples of continuous spike and slab priors: spike and slab Gaussian priors, and spike and slab LASSO priors.
Narisetty and He (2014) studied the model selection properties associated with these priors and provided insightful results on how the prior parameters τ20 , τ21 , and q should be selected to depend on n and p for achieving appropriate shrinkage and model selection performance. The spike and slab prior variances are set so that τ20 ! 0 and τ21 ! ∞ as n goes to ∞, where the specific rates of convergence depend on n and p. We refer the reader to Narisetty and He (2014) for more specific details about the requirements on these prior parameters. With these prior conditions, Narisetty and He (2014) provided two insightful results about the posterior distribution on the model space. The first is that as sample size n goes to ∞, even if the number of variables p is nearly exponentially large in n, the posterior probability of the true P model goes to one under mild conditions, that is, P½Z ¼ t | Y ! 1: As will be discussed in Section 9, this is a much stronger result than the usual Bayes factor consistency commonly considered. Moreover, another insight is that the posterior on the model space induces a regularization similar to the L0 penalty so that it acts as an information criterion asymptotically. For more detailed discussion on this, we refer to Narisetty and He (2014).
5.3 Spike and slab LASSO Rockova (2018) and Rockova and George (2018) proposed and studied the spike and slab Laplace prior which places a two-component mixture of Laplace priors on the regression parameters (Fig. 5). More specifically, the spike and slab LASSO model is given by: Y | ðX,β, σ 2 Þ NðXβ,σ 2 IÞ, βj | Zj ¼ 0 LPðλ0 Þ, βj | Zj ¼ 1 LPðλ1 Þ, PðZj ¼ 1Þ ¼ 1 PðZj ¼ 0Þ ¼ q,
(10)
where LP(λ) is the Laplace distribution with pdf given by ψðβjλÞ ¼ λ 2 exp fλjβjg, and λ0 ≫ λ1.
4 225
Bayesian model selection for high-dimensional data Chapter
Ridge
LASSO
SCAD 1.5
2.5
3
2.0 2
1.0
1.5 1.0
1
0.5
0.5 0
0.0 –2
–1
0 1 q Horseshoe
2
0.0 –2
–1
0 1 2 q Gaussian spike and slab
–2
–1
0 q
1
2
Spike and slab Lasso
5 1
4
6
0
3
4
–1
2 2
1
–2
0
0 –2
–1
0 q
1
2
–2
–1
0 q
1
2
–2
–1
0 q
1
2
FIG. 5 Penalty functions corresponding to penalization methods on the top panel and penalty functions induced by different Bayesian methods on the bottom panel.
Unlike the previous literature which focused on using the posterior distribution of the Z’s for variable selection, Rockova´ and George (2014) considered obtaining a point estimator for β which is the maximum a posterior (MAP) estimator corresponding to the posterior β | (Y, X). The authors proposed a novel EM algorithm for obtaining this MAP estimator. More details about the EM algorithm for computation in Bayesian variable selection is deferred to Section 7.3. Recently, Gan et al. (2018) used the spike and slab LASSO priors for estimation and sparsity recovery for graphical models and observed that optimal theoretical properties can be obtained due to this adaptive nature of the shrinkage. For a discussion about the shrinkage and regularization implicitly induced by different Bayesian methods including the spike and slab LASSO, see Section 6.4.
6
Continuous shrinkage priors
We now discuss the priors on the regression coefficients which provide shrinkage directly without introducing the binary latent variables as in the spike and slab approaches. Let us again consider the linear regression model Yn1 ¼ Xnp βp1 + En1 : With the Gaussian likelihood, a conjugate prior distribution on the regression vector β is again a Gaussian distribution. That is, β|τ N(0, τ2I), where the parameter τ can either be treated as fixed or random. With a fixed τ, the posterior mean (and mode) corresponding to this Gaussian likelihood and Gaussian prior is the ridge regression estimator. More generally, there is a
226 Handbook of Statistics
correspondence between a specific prior distribution and a regularization imposed by the prior distribution at the MAP estimator. To see this, finding the MAP estimator that maximizes the posterior distribution is equivalent to minimizing logπðβ | YÞ, and is given by β^ MAP ¼ arg min f log πðβ | YÞg ¼ arg min β
β
8 > > > < > > > :
log f ðβ;YÞ +
9 > > > =
ð log πðβÞÞ |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl} Bayesianinduced penalty
> > > ;
(11)
Independent Laplace priors on the components of β leads to a posterior whose mode is LASSO, which forms motivation for Bayesian LASSO (Park and Casella, 2008).
6.1 Bayesian LASSO Bayesian LASSO (Park and Casella, 2008) places independent Laplace prior distributions on the components of β: p Y λjβj j λ exp πðβ | σ,λÞ∝ : 2σ σ j¼1 The mode of the resultant posterior would be the same as the LASSO estimator with tuning parameter λ. The posterior distribution in addition provides uncertainty quantification for the regression parameters in the form of credible intervals. In classical low-dimensional settings, credible intervals obtained from a Bayesian posterior not only have a subjective Bayesian interpretation but also have valid frequentist properties in an asymptotic sense (van der Vaart, 1998). However, this is not the case in high dimensions. However, in the high-dimensional setting, credible intervals do not have an obvious interpretation in either sense since the prior is more aptly viewed as a shrinkage inducing tool rather than a belief inducing mechanism, and frequentist properties of the credible intervals from Bayesian LASSO are not established.
6.2 Horseshoe prior The Laplace prior is a scale mixture of normal distributions. More specifically, if β | τ N(0, τ2) and τ2 Exp(λ2/2), then the marginal distribution of β is the double exponential distribution with the parameter λ. Therefore, when the variance of the normal distribution is exponentially distributed, it amounts to a Laplace distribution. The horseshoe prior takes an even more fat-tailed distribution, namely a Cauchy distribution, for the scale of the normal distribution to obtain the horseshoe prior (Carvalho et al., 2009a, b). β | ðλ,τÞ Nð0, λ2 τ2 Þ; λ C + ð0,1Þ,
:
Bayesian model selection for high-dimensional data Chapter
Gaussian
Laplace
0.4
1.00
0.3
0.75
0.2
0.50
0.1
0.25
4 227
Horseshoe-like
1.5
1.0
0.5
0.0
0.0
0.00 –5.0
–2.5
0.0
2.5
5.0
–5.0
–2.5
q
0.0
2.5
5.0
–5.0
–2.5
q
0.0
2.5
5.0
q
FIG. 6 Prior densities corresponding to some continuous shrinkage priors.
where C+(0, 1) denotes the half Cauchy distribution on the positive real line. Fig. 6 shows the density functions of Gaussian, Laplace, and horseshoe priors. The horseshoe prior is much more concentrated around zero compared to the Gaussian and Laplace priors inducing stronger shrinkage. Due to this, there has been a lot of interest in the use of horseshoe priors recently. We refer to Datta and Ghosh (2013) for an exposition on the Bayes risk properties of horseshoe estimator and Bhadra et al. (2017, 2019) for a comprehensive review of the horseshoe shrinkage approach and comparative studies with LASSO.
6.3 Global-local shrinkage priors Polson and Scott (2010) noted that all continuous shrinkage priors can be written as global-local mixtures of normal priors in the following sense, βj | ðψ j ,τÞ Nð0, ψ j τÞ, ψ j f , τ g, where τ represents the global shrinkage of β toward zero and ψ j allow for differential shrinkage of each of the βj’s. Motivated by this, Bhattacharya et al. (2015) proposed the Dirichlet–Laplace prior framework (in the special case of the normal means model) which is a specific global-local shrinkage prior and is given by: βj | ðψ j , τÞ DEðψ j τÞ, ψ Dirða,…, aÞ, τ gammaðna,1=2Þ, where DE is the double exponential distribution, Dir is the Dirichlet distribution, and a is a hyperparameter. They provide an augmented Gibbs sampling algorithm which can be used for posterior computation. It is worth noting that there are many other continuous shrinkage priors considered in the literature and it is not possible to discuss them all here. The double Pareto shrinkage prior of Armagan et al. (2013) is one such example which is worth exploring by a reader interested in further reading. Finally, we note that continuous shrinkage priors such as the Horseshoe or the Dirchlet-Laplace priors do not directly provide a way to select the variables
228 Handbook of Statistics
since the posterior mode or mean corresponding to these methods need not be sparse. However, the estimator obtained from these methods can be thresholded to select variables. The choice of the threshold parameter would require some tuning. We discuss some strategies for selecting tuning parameters in the next section.
6.4 Regularization of Bayesian priors As discussed previously, the prior distribution in the Bayesian framework implicitly induces shrinkage and regularization as indicated by the definition of the MAP estimator in Eq. (11). The Bayesian penalty function induced by the horseshoe prior can be approximately written as: log π H ðβÞ log
p X
log ð1 + β2 j Þ:
j¼1
The penalty function induced by the spike and slab priors can be written as: penðβj Þ :¼ log ½ð1 θÞψðβj jλ0 Þ + θψðβj jλ1 Þ, j ¼ 1, …,p,
(12)
where ψ( | λ0), ψ( | λ1) are the densities of the spike and slab priors, respectively. As argued by Rockova and George (2018), these spike and slab priors have a desirable adaptive regularization property similar to nonconvex penalty functions such as the SCAD whose regularization decreases to zero as the magnitude of the coefficient increases. Fig. 5 provides a plot of the penalty functions corresponding to ridge regression, LASSO, and SCAD methods along with that corresponding to the Gaussian spike and slab prior, the spike and slab LASSO prior, and the horseshoe prior. It can be seen that regularization from all the Bayesian approaches are quite nonconvex and are closer to the L0 penalty compared to that of the LASSO.
6.5 Prior elicitation—Hyperparameter selection In the traditional subjective Bayesian context, one builds the prior distribution based on the belief or prior knowledge available. However, it is hard to be subjective in the high-dimensional context due to the vastness of the parameter space involved. Therefore, an objective Bayesian or an empirical Bayes stand is often taken under which the hyperparameters of the priors need to be selected based on some prespecified criterion or using the data themselves. We discuss a few such strategies for hyperparameter selection.
6.5.1 Empirical Bayes Consider the following linear regression model with a generic prior indexed by a hyperparameter α: Y | ðX, βÞ NðXβ, σ 2 IÞ, β π α ,
Bayesian model selection for high-dimensional data Chapter
4 229
where α is the hyperparameter that needs to be selected. The basic idea of the empirical Bayes strategy is to treat the parameter of interest as latent and integrate it out to obtain a marginal likelihood for the hyperparameter α. Such a marginal likelihood can be used to obtain an estimator for α based on the observed data. The marginal likelihood is Z MLðα | YÞ∝ LðY | θÞπ α ðθÞdθ: (13) Using the marginal likelihood for α, the empirical Bayes strategy is to find a point estimator. While any estimating method can be used in principle, a common approach is to maximize the marginal likelihood to obtain ^ ¼ arg max MLðα | YÞ: α α
In some cases, especially with conjugate priors such as the Gaussian likelihood and the Gaussian prior case, closed form expressions are available to compute the above marginal likelihood as a function of α. However, it is not always possible to obtain the marginal likelihood in closed form. In such cases, Laplace approximation (discussed in Section 7) or MCMC-based strategies are also often used to approximate the integral in Eq. (13). For variable selection, George and Foster (2000) proposed a conditional empirical Bayes strategy in the context of g-priors and Yuan and Lin (2005) proposed an empirical Bayes method to select the prior hyperparameters in the context of point mass spike and Laplace slab priors.
6.5.2 Criterion-based tuning The hyperparameters can also be selected based on some criterion function such as AIC or BIC discussed in Section 2. The idea here is that by using different values for the hyperparameter α, one would obtain different models among which the best one is selected based on a criterion function. More specifically, let M(α) denote the selected model corresponding to the hyperparameter value α. Then, BIC can be used to choose an optimal value for α as follows: ^ :¼ arg min BICðMðαÞÞ, α α
and Mð^ α Þ will be the final model selected. Such a strategy is commonly used in the literature (Narisetty and He, 2014; Narisetty et al., 2019). Another alternative is to cross validated predicted error as the criterion to minimize in place of AIC or BIC.
7
Computation
Efficient computation is a crucial component of any statistical procedure in the Big Data era. There are a variety of computational approaches for Bayesian variable selection proposed in the literature. The objectives of some of
230 Handbook of Statistics
these computational approaches are quite different. For instance, the EM approaches attempt to obtain the maximum-a-posteriori estimator for β and do not necessarily attempt model selection directly. On the other hand, there are stochastic search approaches which operate only on the model space and do not explicitly consider parameter estimation. In the following, we will discuss some of the major computational approaches for Bayesian high-dimensional regression.
7.1 Direct exploration of the model space 7.1.1 Shotgun stochastic search Hans et al. (2007) proposed this method which aims to search the space of models to obtain models having high posterior probabilities. The algorithm is similar to stepwise selection algorithms in the sense that at each step it considers a neighborhood of models and selects the model maximizing the posterior probability. Along this path, since many models having high posterior probabilities are visited, a set of models with high posterior probabilities are collected. In particular, a simplified version of the SSS algorithm starts with an initial model γ 0 and follows the following steps at iteration t to update the model from γ t1 to γ t for t ¼ 1, …, T. l
l
l
Compute S(γ), the criterion function such as the posterior probability of the model γ, for all models in the neighborhood of γ t1. One way to define a neighborhood of a model is to consider all the models which either have one additional covariate, one covariate less, or one covariate different from the model γ t1. More specifically, these neighborhoods can be defined as: – Addition neighborhood: all the models that add one covariate to the current model γ t1 are considered as neighbors. For instance, if the current model has covariates {1, 2}, the models {1, 2, 3}, {1, 2, 4}, …, {1, 2, p} will be in the neighborhood. – Deletion neighborhood: all the models that remove one covariate from the current model γ t1 are considered as neighbors. If the current model has covariates {1, 2}, the models {1}, {2} will be considered as neighbors. – Swap neighborhood: all the models that have one covariate different from the current model γ t1 are considered as neighbors. If the current model has covariates {1, 2}, the models {1, 3}, {1, 4}, …, {1, p}, {2, 3}, {2, 4}, …, {2, p} are neighbors. From the neighbors of γ t1, sample a model with probability proportional to the criterion function S(γ) by normalizing the total probability within the neighboring set. Set the resultant model as γ t. After the prespecified number of iterations T, choose the model which maximizes the criterion function among all the models visited.
Bayesian model selection for high-dimensional data Chapter
4 231
This algorithm is a special case of Metropolis–Hastings random walk algorithms. These algorithms can be viewed as stochastic versions of stepwise algorithms which have been commonly used for variable selection. The performance of these methods could potentially depend on the initial model γ 0 and the neighborhood considered. Shin et al. (2018) generalized the shortgun stochastic search approach which they use for computation in their nonlocal prior setting. Liang et al. (2007) and Liang (2009) proposed another class of model space exploration approaches called stochastic approximation Monte Carlo (SAMC). SAMC operates by first partitioning the model space into disjoint subsets and by enforcing sampling from each of these subsets to avoid local trap issues often encountered by stochastic search algorithms. SAMC algorithms rely on selection of appropriate subsets of the model space and on estimating the posterior probabilities of the selected subsets. The details of these methods are quite involved and are beyond the scope of the current article, we refer to Liang (2009) and Liang et al. (2013).
7.2 Gibbs sampling Gibbs sampling algorithms are quite commonly used for computation of Bayesian posterior distributions. In the context of variable selection, Gibbs sampling algorithms involving standard distributions can be used for computation when continuous spike and slab priors are used. For example, George and McCulloch (1993, 1997), Ishwaran and Rao (2005), and Narisetty and He (2014) used easy-to-sample-from Gibbs samplers based on Gaussian spike and slab priors. With the model (9), a standard Gibbs sampling algorithm would take the following form: l
l
The conditional distribution of β is given by β | (Z, σ 2, Y, X) N(V X>Y, 2 σ 2V), where V ¼ (X>X + Dz)1, and Dz ¼ DiagðZτ2 1 + ð1 ZÞτ0 Þ. The conditional probability for Zj is PðZj ¼ 1 | ðβ, σ 2 ,Y, XÞÞ ¼
l
qϕðβj , 0,σ 2 τ21 Þ : qϕðβj , 0, σ 2 τ21 Þ + ð1 qÞϕðβj ,0, σ 2 τ20 Þ
The conditional of σ 2 is the Inverse Gamma distribution IG(a, b) with a ¼ α1 + n/2 + p/2, and b ¼ α2 + β>Dzβ/2 + (Y Xβ)>(Y Xβ)/2.
These are standard distributions that can be easily sampled. However, when the dimension of the design matrix p is large, the real challenge is that sampling from a p-variate normal distribution for β is computationally intensive. A direct sampling would typically require p3 order operations as it requires the computation of the p p matrix (X>X + Dz)1/2, which if computed using the eigenvalue decomposition of (X>X + Dz) leads to p3 order computational complexity. The Skinny Gibbs algorithm (Narisetty et al., 2019) provides a simple and very effective modification of the Gibbs sampler to avoid the high computational
232 Handbook of Statistics
complexity in the case of large p. The idea is to split β into two parts in each Gibbs iteration, corresponding to the “active” (with the current Zj ¼ 1) and “inactive” (with the current Zj ¼ 0) subvectors. The active part has a low dimension, and is sampled from the multivariate normal distribution. The inactive part has a high dimension, but we simply sample it from a normal distribution with independent marginals. More specifically, the Skinny Gibbs sampler proceeds as follows, after an initialization. Step 1 (for sampling β). Define the index sets A and I as the active (corresponding to Zj ¼ 1) and the inactive (corresponding to Zj ¼ 0) sets and decompose β ¼ (βA, βI) so that βA and βI contain the components of β corresponding to Zj ¼ 1 and Zj ¼ 0, respectively. Similarly rearrange the design matrix X ¼ [XA, XI]. Then, the vector β is sampled as: βA | ðY,ZÞ NðmA , σ 2 VA1 Þ, VA ¼ ðXA0 XA
+ τ2 1 IÞ,
βI | ðY,ZÞ Nð0, σ 2 VI1 Þ,
mA ¼ VA1 XA0 Y,
(14)
VI ¼ Diag ðXI0 XI Þ + τ2 0 I¼
where and 2 ðn + τ0 ÞI. Step 2 (for sampling Z). Generate Zj (j ¼ 1, …, p) conditioned on the remaining components of Z using the following conditional odds: P½Zj ¼ 1 | Zj , β, Y P½Zj ¼ 0 | Zj , β, Y ¼
n o qϕðβj , 0, τ21,n Þ exp βj Xj0 ðY XCj βCj Þ , 2 ð1 qÞϕðβj ,0, τ0, n Þ
(15)
where Zj is the Z vector without the jth component, and Cj is the index set corresponding to the active components of Zj, i.e., Cj ¼ {k : k 6¼ j, Zk ¼ 1}. Step 3 (for sampling σ 2). The conditional distribution of σ 2 given β and Z is the Inverse Gamma distribution IG(a, b) with a ¼ α1 + n/2 + p/2, and b ¼ α2 + β> Dz β=2 + ðY XA βA Þ> ðY XA βA Þ=2 + nβ> I βI =2, where the index sets A and I are the active and the inactive sets as defined in Step 1. The main idea is that in Step 1, the update of β is modified such that the coefficients corresponding to Zj ¼ 1 and those corresponding to Zj ¼ 0 (denoted by βI) are sampled independently, and the components of βI are updated independently so that large matrix computations are avoided. That is, Skinny Gibbs modifies the precision matrix Vz as ! XA0 XI XA0 XA + τ2 1 I Vz ¼ XI0 XA XI0 XI + τ2 0 I
+
XA0 XA + τ2 1 I
0
0
ðDiagðXI0 XI Þ + τ2 0 ÞI
! :
Bayesian model selection for high-dimensional data Chapter
4 233
As can be seen, the precision matrix is heavily modified in Step 1 which can alter the original Gibbs sampler. To compensate for the loss of correlation structure due to this modification, Step 2 of the Skinny Gibbs algorithm is designed to take into account this lost dependence structure. In spite of this modification, Narisetty et al. (2019) showed that the Skinny Gibbs algorithm retains the desired statistical properties such as the strong model selection consistency property which will be discussed in more detail in Section 8. It is worth noting that the technique developed in the Skinny Gibbs algorithm is very general and can incorporate many modeling settings where the likelihood or priors involved can be written as mixtures of normal distributions. For instance, Narisetty et al. (2019) studied a Skinny Gibbs algorithm applied to logistic regression. This also applies to priors beyond normal priors which can be written as scale mixtures of normal priors such as Laplace priors. Bhattacharya et al. (2016) considered an alternative approach to scale up the Gibbs sampling algorithms which involves large multivariate normal distributions. Their approach is to intelligently utilize properties of matrices to sample from the original high-dimensional normal distribution. While this algorithm has linear order complexity in terms of p, it has a quadratic complexity in terms of the sample size n while Skinny Gibbs has a linear order complexity in n. Moreover, Skinny Gibbs can be scaled up further by utilizing their matrix identities for sampling βA when the size of jAj is large.
7.3 EM algorithm The EM algorithm is a popular technique to compute maximum likelihood estimators and maximum a posterior (MAP) estimators (Dempster et al., 1977). Even in high dimensions, Rockova´ and George (2014) found the EM algorithm to be effective for obtaining the MAP estimator corresponding to the spike and slab Gaussian prior specification given by (9) and Rockova and George (2018) generalized it to the spike and slab Lasso prior specification given by (10). The algorithm treats Z to be latent and implements an EM algorithm. Let us consider the model (9) having Gaussian spike and slab prior for illustration where the variances of the spike and slab priors are τ20 and τ21 , respectively. The maximization problem involves the objective function QðZ,βÞ ¼
p n β2j 1 X 1 X > 2 ðy x βÞ : i i 2σ 2 i¼1 2σ 2 j¼1 τ21 Zj + τ20 ð1 Zj Þ
E-Step: At the E-step, the conditional expectations of the Q function with respect to the Z variables given all the other parameters are obtained. That is, we need to find the conditional expectation p j ð1 p j Þ 1 + :¼ dj , E 2 ¼ τ1 Zj + τ20 ð1 Zj Þ τ21 τ20 where p j ¼ EðZj | βÞ ¼ ða +a bÞ ,a ¼ π(βj | Zj ¼ 1)π(Zj ¼ 1) and b ¼ π(βj | Zj ¼ 0) π(Zj ¼ 0).
234 Handbook of Statistics
M-Step: Then the conditional expectation of Q(Z, β) given Z is essentially a weighted ridge regression penalized objective function with penalty weights given by the dj’s. Therefore, the maximization at the M-step has a closed form with the solution being β^ ðk + 1Þ ¼ ðX> X + DÞ1 X> Y, where D is a p p diagonal matrix with dj as its diagonal elements. This yields a simple iterative algorithm for obtaining the MAP estimator corresponding to the spike and slab Gaussian prior specification. However, one drawback of this approach is that the posterior probabilities of the Z variables are not obtained as part of the EM algorithm, which are important quantities for performing variable selection. As a proxy to these ^ where β^ posterior probabilities, the conditional probabilities of P½Zj ¼ 1 | β, is the MAP estimator of β, are used for variable selection. Rockova (2018), Rockova and George (2018), and Gan et al. (2018) among others used the EM algorithm for MAP estimation and the conditional probability strategy for variable selection.
7.4 Approximate algorithms 7.4.1 Laplace approximation In the context of point mass spike priors as discussed in Section 5.1, a direct computation of the posterior probability of a model requires evaluation of the marginal likelihood P[Y | Z ¼ k] as indicated by the posterior probability expression in Eq. (5). This can be written as Z P½Y | Z ¼ k ¼ P½Y | Z ¼ k, βk π½βk | Z ¼ kdβk : jkj
The Laplace approximation for integrals uses quadratic Taylor’s approximation for the integrand above (Raftery, 1996). This is similar to making a Gaussian approximation to the integrand above as a function of βk which corresponds to the conditional posterior distribution of βk | (Y, Z ¼ k). Therefore, performance of the Laplace approximation depends on how close this conditional posterior distribution is to a Gaussian distribution. The Laplace approximation is commonly used for posterior computation for Bayesian model selection. For instance, Yuan and Lin (2005) use the approximation for empirical Bayes variable selection using g-priors, Liang et al. (2007) use it in the mixture of g-priors context, and Johnson and Rossell (2012) use it for computation with their nonlocal priors.
7.4.2 Variational approximation Variational approximation (Blei et al., 2017; Jordan et al., 1999) is a powerful and general strategy to approximate posterior distributions. The general idea
Bayesian model selection for high-dimensional data Chapter
4 235
of variational approximation is to use a computationally feasible class of distributions from which the closest one to the posterior distribution is to be found. More specifically, suppose that p is the posterior distribution to be computed. For example, p can be the posterior distribution of (β, Z) given data corresponding to Model (9). Consider the family of distributions Q which are easy to compute/sample ^ which is closest to p in terms of minifrom. From the set Q, the distribution q mizing the Kullback–Leibler distance is found. That is, q^
arg min KLðp jj qÞ, q2Q
where KL(p jj q) denotes the Kullback–Leibler distance between the distributions p and q. In particular, if the family Q is indexed by a parameter vector θ so that Q :¼ {qθ : θ 2 Ω}, then this corresponds to minimizing ^ θ
arg min KLðp jj qθ Þ θ2Ω
The main challenge in this context is to find the family Q of distributions which is reasonably close to the posterior distribution π along with being computationally friendly. With point mass spike priors, Carbonetto and Stephens (2012) proposed the following Q family for variational approximation: qðβ, Z, θÞ ¼
p Y qðβk ,Zk , θk Þ, k¼1
where qðβk , Zk , θk Þ ¼ ϕk Nðβk | μk , s2k Þ1fZk ¼ 1g + ð1 ϕk Þδ0 ðβk Þ1fZk ¼ 0g (16) This is a family of component-wise product distributions implying independence of different components. Although the posterior does not belong to this family, the hope is that there is one distribution in this family which is close to the posterior and that its important summary statistics are somewhat close to those of the posterior. Carbonetto and Stephens (2012) proposed a coordinate descent algorithm for the KL minimization problem which yields easily interpretable updates as follows: s2k ¼
σ2
ðX> XÞkk + ðσ 2 τ21 Þ1 ! X s2k μk ¼ 2 ðX> yÞk ðX> XÞjk ϕj μj σ j6¼k
2 P½Zk ¼ 1|ðY,X, βÞ ϕk q sk μk ¼ exp , P½Zk ¼ 0|ðY,X, βÞ ð1 ϕk Þ ð1 qÞ ðστ1 Þ 2s2k
However, Huang et al. (2016) argued that the component-wise variational Bayes algorithm does not work well in the high dimensions. Motivated by
236 Handbook of Statistics
this, they proposed a new algorithm called the batch-wise variational Bayes algorithm where the parameters ðμj Þpj¼1 , ðαj Þpj¼1 , and ðsj Þpj¼1 are updated simultaneously across different j instead of marginally updating for each j conditional on others. The simultaneous updates remain to be similar for the parameters α and sj in the batch-wise algorithm whereas the update for μ changes as: 1 1 > μ ¼ ΦX> XΦ + nΦð1 ΦÞ + ΦX y, τ1 where Φ is the p p diagonal matrix with ϕk as diagonal elements, i.e., Φ ¼ Diag(ϕk). While a direct computation of μ would be computationally expensive, μ can be sequentially updated much more efficiently based on the μ from the previous iteration. Huang et al. (2016) studied the model selection consistency associated with their variational Bayes in the high-dimensional setting when p n. While majority of the variational approximations for Bayesian variable selection focused on the variational family given by (16), recently Ormerod et al. (2017) discussed some other choices and studied their properties when p n.
8 Theoretical properties In high dimensions, it is difficult to take an entirely subjective Bayesian view about the priors and think of the prior distribution as prior belief about the underlying parameters. This is because, due to high-dimensional probability involved, we do not quite have the right intuition about all the parts of the high-dimensional parameter space which can lead to some strong prior mass in some parts irrespective of which kinds of priors one uses. For example, if each of the coefficients has a standard normal prior which may be reasonable in a subjective Bayesian sense, the L2 norm of the entire regression vector will pffiffiffi be heavily concentrated around p which is extremely large in most applications. However, if the prior is chosen for convenience and does not quite reflect the prior uncertainty about the entire parameter space, the posterior uncertainty may not be scientifically interpretable. This provides the motivation to study objective properties associated with Bayesian procedures in the frequentist sense. There are different types of theoretical properties considered in the Bayesian variable selection context. In this section, we provide a brief description of the different types of theoretical results considered and the related references. Let us denote the generic posterior distribution by π(β, Z|Y). However, in some cases such as for continuous shrinkage priors (for example, Bayesian LASSO), the binary vector Z is not used in which case Z can be taken as the vector of all ones.
Bayesian model selection for high-dimensional data Chapter
4 237
8.1 Consistency properties of the posterior mode The posterior mode that maximizes the posterior π(β | Y) is the maximum-aposteriori (MAP) estimator of the parameter β. If obtaining a point estimator is sufficient in an application, studying the properties of the MAP estimator is quite natural. As discussed in Section 4, there is a natural correspondence between the MAP estimator and penalized estimators since the prior implicitly induces a penalty which is often termed Bayesian penalization or regularization. The penalty functions induced by many of the commonly used Bayesian priors are nonconvex and in some cases nonconcave. Rockova (2018) studied the estimation consistency properties of spike and slab LASSO and showed that the MAP estimator achieves the optimal rate of convergence in terms of L2 norm error, L1 norm error, and prediction loss. In the context of Bayesian graphical models, Gan et al. (2018) show that the MAP estimator using the spike and slab LASSO prior would also be consistent in terms of L∞ norm, which is more relevant for model selection.
8.2 Posterior concentration In the Bayesian context, studying the concentration properties of the posterior distribution is quite important (Ghosal et al., 2000). The posterior concentration for the regression vector β requires that P
sup Eβ0 π ðβ :k β β0 k> En | Y Þ ! 0 β0
(17)
where kk denotes the Euclidean norm and the rate of posterior concentration, En, is defined implicitly to satisfy Eq. (17). The following are some existing works which studied the posterior concentration under various prior settings: l
l
Bhattacharya et al. (2015) studied the posterior concentration properties with their proposed Dirichlet–Laplace priors under the special case of the orthogonal design matrix X> nI. Inparticular, they show posterior Xq¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffi s log ðn=sÞ with s ¼ kβk0, which is the concentration (17) with En ¼ O n optimal error rate for posterior concentration for the normal means model. Compared to the low-dimensional setting, the optimal error rate has an additional log ðn=sÞ that represents the cost paid for the dimension, which is the same as the sample size n in this case. Castillo and van der Vaart (2012) and Castillo et al. (2015) studied the posterior concentration results extensively based on general choices of priors for the model size and for the slab prior. Under their priorchoices,Castillo qffiffiffiffiffiffiffiffi s log p et al. (2015) showed posterior concentration with En ¼ O . In the n high-dimensional setting with increasing p, the error rate En pays a penalty factor of log p for not knowing the true model corresponding to the sparsity
238 Handbook of Statistics
l
of β0, which is a standard phenomenon in high-dimensional analysis. Atchade (2017) further extended these results to modeling settings beyond the linear regression model. Martin and Walker (2014) and Martin et al. (2017) studied the posterior concentration results for an empirical Bayes approach that has prior distributions depending qffiffiffiffiffiffiffiffiffiffiffiffiffiffion the data. Their error rate for posterior concentration s log ðp=sÞ is En ¼ O , which matches the optimal error rate as discussed n by Rigollet and Tsybakov (2012) and is slightly better compared to the error rate of Castillo et al. (2015).
8.3 Pairwise model comparison consistency Casella et al. (2009) and Moreno et al. (2010) considered consistency in terms of the comparison of pairwise posterior probabilities. That is, if we consider any model k ¼ 6 t, where t is the true model, then the pairwise consistency requires that P½Z ¼ k | Y P ! 0: P½Z ¼ t | Y Casella et al. (2009) showed that their intrinsic priors have the pairwise consistency property for fixed p case and Moreno et al. (2010) showed the same for potentially diverging dimension but with p < n. When the dimension p is fixed, there are finitely many models k so that this pairwise consistency P implies that P½Z ¼ t | Y! 1: However, when the dimension p can grow to infinity, the pairwise consistency does not imply that the posterior probability of the true model converges to one. In fact, Johnson and Rossell (2012) argued that even under the pairwise consistency, the posterior probability P[Z ¼ t | Y] can go to zero, when p is not fixed. A stronger notion of consistency compared to the pairwise consistency in high dimensions can be defined as: max k6¼t
P½Z ¼ k | Y P ! 0, P½Z ¼ t | Y
(18)
which assures that the posterior probability of the true model is uniformly larger in magnitude compared to any other model.
8.4 Strong model selection consistency Johnson and Rossell (2012) considered a stronger version of Bayesian model P 1: They showed that selection consistency which requires that P½Z ¼ t | Y ! their proposed nonlocal priors satisfy this property for p < n while local priors would not have this strong model selection consistency. Narisetty and He (2014) provided the strong selection consistency for spike and slab Gaussian priors with the spike prior variance decreasing to zero and the slab
Bayesian model selection for high-dimensional data Chapter
4 239
prior variance increasing to infinity as sample size n increases, when the dimension p can be nearly exponentially large in sample size, that is, when logp ¼ oðnÞ:Narisetty et al. (2019) showed similar consistency for their proposed Skinny Gibbs algorithm under the logistic regression model. It needs to be noted that the strong model selection consistency is not only stronger in theory than the pairwise consistency but is also practically very helpful. This is because in practice the posterior distribution can only be obtained approximately using MCMC algorithms or other approximation techniques such as the Laplace approximation. In these cases, it is desirable to have a substantial gap between the posterior probability of the true model when compared to the others. Even if the uniform consistency (18) holds P 0, in which case it would be hard true, it is still possible that P½Z ¼ t | Y ! to distinguish the true model from the rest using approximate methods. Therefore, strong model selection consistency is indeed desired for Bayesian variable selection methods. Along with studying strong selection consistency, Yang et al. (2016) simultaneously studied the algorithmic mixing properties of their proposed stochastic search algorithm in the context of point mass spike priors and slab g-priors. For a randomly generated dataset with Gaussian error distribution, they showed that after Oðnps2 log pÞ iterations, the posterior distribution estimated by their stochastic search algorithm is close to the actual posterior distribution corresponding to the prior specification and the true model will be selected with high probability. This is a distinct result in the Bayesian model selection literature as it accounts for the algorithmic approximation and ensures that it only takes nearly linear order computations for finding the the true model. Similar theoretical results for other computational algorithms such as the Skinny Gibbs algorithm are worthy of future study.
9
Implementation
In this section, we discuss some software implementations available in R for performing Bayesian variable selection. This is definitely not an exhaustive list but covers a range of methods developed under different perspectives. In Table 1, we provide a list of R packages and excerpts from the description provided in their package manual as a reference to some of the useful packages.
10
An example
In this section, we provide an illustration of some of the Bayesian methods for variable selection discussed in the chapter. We consider the data obtained from an experiment by Lan et al. (2006) to study the genetics of two inbred mouse populations. The data contain information about gene expression levels of 22,575 genes of 31 female and 29 male mice. The response variable we
240 Handbook of Statistics
TABLE 1 Some R packages for Bayesian computation. Package
Description (excerpts from the package manual)
BAS
Package for Bayesian Variable Selection and Model Averaging in linear models and generalized linear models using stochastic or deterministic sampling without replacement from posterior distributions. Prior distributions on coefficients are from Zellner’s g-prior or mixtures of g-priors corresponding to the Zellner–Siow Cauchy Priors or the mixture of g-priors.
BayesVarSel
Conceived to calculate Bayes factors in linear models and then to provide a formal Bayesian answer to testing and variable selection problems. From a theoretical side, the emphasis in this package is placed on the prior distributions and it allows a wide range of them.
BayesS5
A scalable stochastic search algorithm that is called the Simplified Shotgun Stochastic Search (S5) and aimed at rapidly explore interesting regions of model space and finding the maximum a posteriori (MAP) model.
BMA
Package for Bayesian model averaging and variable selection for linear models, generalized linear models and survival models (cox regression).
BMS
Bayesian model averaging for linear models with a wide choice of (customizable) priors. Built-in priors include coefficient priors (fixed, flexible, and hyper-g-priors), five kinds of model priors, moreover model sampling by enumeration or various MCMC approaches.
monomvn
Estimation of multivariate normal and Student-t data of arbitrary dimension where the pattern of missing data is monotone. The current version supports maximum likelihood inference and a full Bayesian approach employing scale mixtures for Gibbs sampling.
SAMCpack
We provide generic SAMC samplers for continuous distributions. Userspecified densities in R and C++ are both supported. We also provide functions for specific problems that exploit SAMC computation.
Spikeslab
Spike and slab for prediction and variable selection in linear regression models. Uses a generalized elastic net for variable selection.
SSLASSO
Efficient coordinate ascent algorithm for fitting regularization paths for linear models penalized by Spike-and-Slab LASSO of Rockova and George (2018).
varSelectIP
Objective Bayes Variable Selection in Linear Regression and Probit models (Casella et al., 2009; Leon-Novelo et al., 2012).
consider is the phenotype called glycerol-3-phosphate acyltransferase (GPAT) which is measured by quantitative real-time PCR. The dataset is publicly available at GEO (http://www.ncbi.nlm.nih.gov/geo; accession number GSE3330). Zhang et al. (2009), Bondell and Reich (2012), and Narisetty and He (2014) used this data for demonstrating their methods. Following Narisetty and
Bayesian model selection for high-dimensional data Chapter
4 241
He (2014), we first perform a screening based on marginal correlation with the response and obtain p ¼ 200 predictors (including the intercept and gender variable). We consider the following Bayesian methods for illustration: (i) the g-prior and (ii) the hyper g-prior methods implemented using Clyde et al. (2010)’s BAS R packageWe, (iii) the mixture of g-priors method implemented using the GibbsBvs function from the R package BayesVarSel (GarciaDonato and Martinez-Beneito, 2013), (iv) the Bayesian LASSO and (v) the horseshoe priors implemented using the monomvn R package, (vi) the spike and slab LASSO method using the SSLASSO package, and (vii) the BASAD method implemented using the publicly available code on the author’s website. In addition, we also implement LASSO and SCAD using the glmnet and ncvreg R packages, respectively. The models selected by various methods are tabulated in Table 2, which also provides the values of BIC for each of these selected models. Among all the methods considered, BASAD has the smallest BIC and the mixture g-prior takes a very close second smallest value. It is worth noting that the model selected by mixture g-prior is a submodel of the BASAD model. Due to this and given the large gap between the BIC values of models from BASAD and mixture g-prior in comparison with the remaining models, we would recommend selection of the BASAD model in this data example.
TABLE 2 Models selected for PCR data using for different methods. Method
Model size
BIC
Selected variables
LASSO
10
144.22
24 46 74 101 129 175 176 180 181 191
SCAD
14
145.74
24 46 68 74 90 101 123 129 132 175 176 180 181 191
g-prior
3
139.17
86 90 147
Hyper gprior
4
134.21
86 90 147 190
Mixture gprior
4
114.02
90 101 104 123
SSLASSO
5
120.98
31 71 90 118 195
Bayesian LASSO
8
131.32
23 30 41 100 117 122 130 180
Horseshoe
7
126.54
23 30 41 100 122 140 180
BASAD
6
113.66
24 86 90 101 104 123
BASAD model has the smallest BIC value whereas the model chosen by mixture g-prior also has a very similar BIC with a smaller model. In fact, the model chosen by mixture g-prior is a submodel of the one chosen by BASAD. Bold indicates the best performing method in terms of having the smallest BIC value.
242 Handbook of Statistics
We would like to emphasize that the results from the analysis of this single data set should not be overly generalized. The purpose of this exercise is to give an idea about how these methods can be used in practice and to give a concrete example for the application of these methods. The Bayesian methods based on g-prior, hyper g-prior, mixture g-prior, and BASAD provide the marginal posterior probabilities for all the covariates. As mixture g-prior and BASAD have the smallest BIC values, we provide the plots of their posterior probabilities in Fig. 7. It is interesting to note that the covariates having high posterior probabilities for both the BASAD and Mixture g-prior methods are quite similar, which makes sense as these two methods have the lowest BIC values. The Bayesian methods of Bayesian LASSO and Horseshoe, which are based on continuous shrinkage priors, provide the posterior mean of the regression coefficient along with uncertainty measures in the form of boxplots for each of the regression coefficients. The posterior mean estimates along with their boxplots are provided in Fig. 8. As discussed earlier, the interpretation
123
0.4
101,104
24
0.0
86
0.8
BASAD prior 90
0.4
90
101,104 123
24
86
0.0
Marginal Posterior Probability
0.8
Mixture g-prior
Variables
Variables
FIG. 7 Marginal Posterior probabilities for Mixture g-prior and BASAD. The covariate #90 (circled in red) has the highest posterior probability for both the methods.
50
100 Variables
150
–0.20
0.00 –0.10
0
–0.05
Horseshoe Coefficient estimates
Bayesian LASSO
200
0
Boxplots of regression coefficients
50
100 Variables
150
200
–0.5
0.0
coef.
0.0 –0.4 –0.2
coef.
0.2
0.5
0.4
Boxplots of regression coefficients
mu b.5 b.10 b.16 b.22 b.28 b.34 b.40 b.46 b.52 b.58
mu b.5 b.10 b.16 b.22 b.28 b.34 b.40 b.46 b.52 b.58
FIG. 8 Coefficient estimation and boxplots for continuous Shrinkage Priors.
Bayesian model selection for high-dimensional data Chapter
4 243
of the uncertainty associated with the boxplots is ambiguous in general, but if the priors are viewed as subjective Bayesian priors, the boxplots provide a way to quantify uncertainty of the regression coefficients in the subjective Bayesian sense.
11
Discussion
In the article, we mainly focused on Bayesian variable selection for linear regression models. However, most of the ideas and methods discussed generalize naturally to many other models. There are existing works that extend these approaches to generalized linear models and group selection among others. For instance, Nott and Daniela (2004), Jiang (2007), Wang and George (2007), Chen et al. (2008), Liang et al. (2013), and Narisetty et al. (2019) studied various issues related to Bayesian variable selection in generalized linear models. Bayesian variable selection approaches for settings where the predictors are grouped together are also considered in the recent literature. Xu and Ghosh (2015) and Chen et al. (2016) proposed methods for Bayesian group selection. It is well known that the least squares regression and the corresponding likelihood are sensitive to outliers. For this reason, least absolute deviation regression and more generally quantile regression (Koenker, 2005; Koenker and Bassett, 1978) is a more robust framework for regression when the errors are likely to be heavy tailed. Quantile regression (Koenker and Bassett, 1978) models the τ-th conditional quantile of the response yi given the covariates. Unlike the least squares setting, quantile regression is a local model and does not explicitly assume a specific conditional distribution for Y given X. This means that there is no natural likelihood available for quantile regression and necessitates the use of working likelihoods for carrying out Bayesian inference. Yu and Moyeed (2001) proposed a working likelihood based on the Asymmetric Laplace Distribution (ALD). Computation of the posterior with the ALD likelihood is easy to implement using Gibbs sampling (Kozumi and Kobayashi, 2011) or Metropolis–Hastings (MH) algorithms. Yu et al. (2013) proposed spike and slab priors and a Gibbs sampling algorithm for performing Bayesian variable selection for quantile regression with the ALD likelihood. Variable selection with quantile regression has the potential to be much more robust than the more common linear regression while accommodating heterogeneity in the data. Bayesian shrinkage and regularization has long been used in various contexts demonstrating the flexibility and power of this strategy in various statistical modeling contexts. Beyond statistical models, Bayesian regularization has also seen successful application for machine learning models. For instance, Snoek et al. (2015), Wang and Yeung (2016), and Gal et al. (2017) employed Bayesian regularization for neural networks and deep learning.
244 Handbook of Statistics
Acknowledgments The author is grateful to two reviewers for their extensive and helpful feedback and graduate students Teng Wu and Ke Li for proofreading an initial version of the article. The author gratefully acknowledges support from NSF Award DMS-1811768.
References Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, pp. 267–281. Armagan, A., Dunson, D.B., Lee, J., 2013. Generalized double Pareto shrinkage. Stat. Sin. 23, 119–143. Atchade, Y.A., 2017. On the contraction properties of some high-dimensional quasi-posterior distributions. Ann. Statist. 45 (5), 2248–2273. https://doi.org/10.1214/16-AOS1526. Barbieri, M.M., Berger, J.O., 2004. Optimal predictive model selection. Ann. Stat. 32, 870–897. Belloni, A., Chernozhukov, V., 2013. Least squares after model selection in high-dimensional sparse models. Bernoulli 19, 521–547. Belloni, A., Chernozhukov, V., Hansen, C., 2011. Inference for high-dimensional sparse econometric models. In: Advances in Economics and Econometrics, 10th World Congress of Econometric Society. Bendel, R.B., Afifi, A.A., 1977. Comparison of stopping rules in forward “stepwise” regression. J. Am. Stat. Assoc. 72, 46–53. Bertsimas, D., King, A., Mazumder, R., 2016. The adaptive Lasso and its oracle properties. Ann. Stat. 44, 813–852. Bhadra, A., Datta, J., Polson, N.G., Willard, B., 2017. The horseshoe+ estimator of ultra-sparse signals. Bayesian Anal. 12 (4), 1105–1131. Bhadra, A., Datta, J., Polson, N.G., Willard, B., 2019. Lasso meets horseshoe: a survey. Stat. Sci. (in press). Bhattacharya, A., Pati, D., Pillai, N.S., Dunson, D.B., 2015. Dirichlet–Laplace priors for optimal shrinkage. J. Am. Stat. Assoc. 110, 1479–1490. Bhattacharya, A., Chakraborty, A., Mallick, B., 2016. Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika 103, 985–991. Bickel, P.J., Ritov, Y., Tsybakov, A.B., 2009. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37, 1705–1732. Blei, D.M., Kucukelbir, A., McAuliffe, J.D., 2017. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112 (518), 859–877. https://doi.org/10.1080/01621459.2017.1285773. Bondell, H.D., Reich, B.J., 2012. Consistent high dimensional Bayesian variable selection via penalized credible regions. J. Am. Stat. Assoc. 107, 1610–1624. Breheny, P., Huang, J., 2011. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5 (1), 232. B€ uhlmann, P., van de Geer, S., 2011. Statistics for High-Dimensional Data. Springer-Verlag, Berlin, Heidelberg. Carbonetto, P., Stephens, M., 2012. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7, 73–108. Carvalho, C.M., Polson, N.G., Scott, J.G., 2009a. Handling sparsity via the horseshoe. J. Mach. Learn. Res. 97 (5), 73–80.
Bayesian model selection for high-dimensional data Chapter
4 245
Carvalho, C.M., Polson, N.G., Scott, J.G., 2009b. The horseshoe estimator for sparse signals. Biometrika 97, 465–480. Casella, G., Giron, F.J., Martinez, M.L., Moreno, E., 2009. Consistency of Bayesian procedures for variable selection. Ann. Stat. 37, 1207–1228. Castillo, I., van der Vaart, A., 2012. Needles and straw in a Haystack: posterior concentration for possibly sparse sequences. Ann. Stat. 40, 2069–2101. Castillo, I., Schmidt-Hieber, J., van der Vaart, A., 2015. Bayesian linear regression with sparse priors. Ann. Stat. 43, 1986–2018. Chen, M.H., Huang, L., Ibrahim, J.G., Kim, S., 2008. Bayesian variable selection and computation for generalized linear models with conjugate priors. Bayesian Anal. 3, 585–614. Chen, R.-B., Chu, C.-H., Yuan, S., Wu, Y.N., 2016. Bayesian sparse group selection. J. Comput. Graph. Stat. 25 (3), 665–683. Clyde, M., Ghosh, J., Littman, M., 2010. Bayesian adaptive sampling for variable selection and model averaging. J. Comput. Graph. Stat. 20, 80–101. Datta, J., Ghosh, J.K., 2013. Asymptotic properties of Bayes risk for the Horseshoe prior. Bayesian Anal. 8 (1), 111–132. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological) 39, 1–38. Dicker, L.H., 2016. Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli 22 (1), 1–37. https://doi.org/10.3150/14-BEJ609. Dobriban, E., Wager, S., 2018. High-dimensional asymptotics of prediction: ridge regression and classification. Ann. Statist. 46 (1), 247–279. https://doi.org/10.1214/17-AOS1549. Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360. Fan, J., Lv, J., 2008. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 70, 849–911. Fan, J., Lv, J., 2010. A selective overview of variable selection in high dimensional feature space. Stat. Sin. 20, 101–148. Fan, J., Peng, H., 2004. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 32, 928–961. Fan, J., Song, R., 2010. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat. 38, 3567–3604. Ferna´ndez, C., Ley, E., Steel, M.F., 2001. Benchmark priors for Bayesian model averaging. J. Econ. 100, 381–427. Finos, L., Brombin, C., Salmaso, L., 2010. Adjusting stepwise p-values in generalized linear models. Commun. Stat. Theory Methods 39, 1832–1846. Foster, D.P., George, E.I., 1994. The risk inflation criterion for multiple regression. Ann. Stat. 22, 1947–1975. Gal, Y., Islam, R., Ghahramani, Z., 2017. Deep Bayesian active learning with image data. In: ICML’17. Proceedings of the 34th International Conference on Machine Learning, vol. 70JMLR.org, pp. 1183–1192. http://dl.acm.org/citation.cfm?id¼3305381.3305504. Gan, L., Narisetty, N.N., Liang, F., 2018. Bayesian regularization for graphical models with unequal shrinkage. J. Am. Stat. Assoc. (in press). Garcia-Donato, G., Martinez-Beneito, M.A., 2013. On sampling strategies in Bayesian variable selection problems with large model spaces. J. Am. Stat. Assoc. 108, 340–352. George, E.I., Foster, D.P., 2000. Calibration and empirical Bayes variable selection. Biometrika 87, 731–747.
246 Handbook of Statistics George, E.I., McCulloch, R.E., 1993. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88, 881–889. George, E.I., McCulloch, R.E., 1997. Approaches for Bayesian variable selection. Stat. Sin. 7, 339–373. Ghosal, S., Ghosh, J.K., van der Vaart, A.W., 2000. Convergence rates of posterior distributions. Ann. Stat. 28 (2), 500–531. Grechanovsky, E., Pinsker, I., 1995. Conditional p-values for the F-statistic in a forward selection procedure. Comput. Stat. Data Anal. 20, 239–263. Hans, C., Dobra, A., West, M., 2007. Shotgun stochastic search for “large p” regression. J. Am. Stat. Assoc. 102, 507–516. Hazimeh H. and Mazumder R., Fast best subset selection: coordinate descent and local combinatorial optimization algorithms, arXiv 2018, arXiv:1706.10179. He, X., Wang, L., Hong, H.G., 2013. Quantile-adaptive model-free variable screening for highdimensional heterogeneous data. Ann. Stat. 41, 342–369. Hsu, D., Kakade, S.M., Zhang, T., 2014. Random design analysis of ridge regression. Found. Comput. Math. 14, 569–600. Huang X., Wang J. and Liang F., A variational algorithm for Bayesian variable selection, arXiv 2016, arXiv:1602.07640. Ishwaran, H., Rao, J.S., 2005. Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Stat. 33, 730–773. Jiang, W., 2007. Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities. Ann. Stat. 35, 1487–1511. Johnson, V.E., Rossell, D., 2012. Bayesian model selection in high-dimensional settings. J. Am. Stat. Assoc. 107, 649–660. Jordan, M.I., Ghahramani, Z., Jaakkola, T., Saul, L., 1999. Introduction to variational methods for graphical models. Mach. Learn. 37, 183–233. Kass, R.E., Raftery, A.E., 1995. Bayes factors. J. Am. Stat. Assoc. 90, 773–795. Koenker, R., 2005. Quantile regression. In: Econometric Society Monograph Series. Cambridge University Press. https://doi.org/10.1017/CBO9780511754098. Koenker, R., Bassett, G., 1978. Regression quantiles. Econometrica 46, 33–50. Kozumi, H., Kobayashi, G., 2011. Gibbs sampling methods for Bayesian quantile regression. J. Stat. Comput. Simul. 81 (11), 1565–1578. Lan, H., Chen, M., Flowers, J.B., Yandell, B.S., Stapleton, D.S., Mata, C.M., Mui, E.T., Flowers, M.T., Schueler, K.L., Manly, K.F., Williams, R.W., Kendziorski, K., Attie, A.D., 2006. Combined expression trait correlations and expression quantitative trait locus mapping. PLoS Genet. 2, e6. Leon-Novelo, L., Morenob, E., Casella, G., 2012. Objective Bayes model selection in probit models. J. Am. Stat. Assoc. 31, 353–365. Liang, F., 2009. Improving SAMC using smoothing methods: theory and applications to Bayesian model selection problems. Ann. Stat. 37, 2626–2654. Liang, F., Liu, C., Carroll, R.J., 2007. Stochastic approximation in Monte Carlo computation. J. Am. Stat. Assoc. 102, 305–320. Liang, F., Paulo, R., Molina, G., Clyde, M.A., Berger, J.O., 2008. Mixtures of g priors for Bayesian variable selection. J. Am. Stat. Assoc. 103 (481), 410–423. Liang, F., Song, Q., Yu, K., 2013. Bayesian subset modeling for high dimensional generalized linear models. J. Am. Stat. Assoc. 108, 589–606. Loh, P.-L., Wainwright, M.J., 2017. Support recovery without incoherence: a case for nonconvex regularization. Ann. Statist. 45 (6), 2455–2482.
Bayesian model selection for high-dimensional data Chapter
4 247
Martin, R., Walker, S.G., 2014. Asymptotically minimax empirical Bayes estimation of a sparse normal mean vector. Electron. J. Stat. 8, 2188–2206. Martin, R., Mess, R., Walker, S.G., 2017. Empirical Bayes posterior concentration in sparse highdimensional linear models. Bernoulli 23, 1822–1847. Mazumder, R., Friedman, J.H., Hastie, T., 2011. Sparsenet: coordinate descent with nonconvex penalties. J. Am. Stat. Assoc. 106 (495), 1125–1138. Meinshausen, N., Yu, B., 2009. Lasso-type recovery of sparse representations for highdimensional data. Ann. Stat. 246–270. Mitchell, T.J., Beauchamp, J.J., 1988. Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 83, 1023–1032. Moreno, E., Giron, F.J., Casella, G., 2010. Consistency of objective Bayes factors as the model dimension grows. Ann. Stat. 38, 1937–1952. Mousavi, A., Maleki, A., Baraniuk, R.G., 2017. Consistent parameter estimation for LASSO and approximate message passing. Ann. Stat. 45, 2427–2454. Narisetty, N.N., He, X., 2014. Bayesian variable selection with shrinking and diffusing priors. Ann. Stat. 42, 789–817. Narisetty, N.N., Shen, J., He, X., 2019. Skinny Gibbs: a scalable and consistent Gibbs sampler for model selection. J. Am. Stat. Assoc. Nott, D.J., Daniela, L., 2004. Sampling schemes for Bayesian variable selection in generalized linear models. J. Comput. Graph. Stat. 13, 362–382. O’hara, R.B., Sillanpaa, M.J., 2009. A review of Bayesian variable selection methods: what, how and which. Bayesian Anal. 4, 85–117. Ormerod, J.T., You, C., Muller, S., 2017. A variational Bayes approach to variable selection. Electron. J. Stat. 11, 3549–3594. Park, T., Casella, G., 2008. The Bayesian LASSO. J. Am. Stat. Assoc. 103, 681–686. Polson, N.G., Scott, J.G., 2010. Shrink globally, act locally: sparse Bayesian regularization and prediction. In: Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M. et al., (Eds.), Bayesian Statistics 9. Oxford University Press, New York, pp. 501–538. Raftery, A.E., 1996. Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika 83 (2), 251–266. Rigollet, P., Tsybakov, A.B., 2012. Sparse estimation by exponential weighting. Stat. Sci. 27, 558–575. Rockova, V., 2018. Bayesian estimation of sparse signals with a continuous spike-and-slab prior. Ann. Stat. 46 (1), 401–437. Rockova´, V., George, E.I., 2014. EMVS: the EM approach to Bayesian variable selection. J. Am. Stat. Assoc. 109 (506), 828–846. Rockova, V., George, E.I., 2018. The spike-and-slab LASSO. J. Am. Stat. Assoc. 113, 431–444. Schwarz, G.E., 1978. Estimating the dimension of a model. Ann. Stat. 6, 461–464. Scott, G.S., Berger, J.O., 2006. An exploration of aspects of Bayesian multiple testing. J. Stat. Plann. Inference 136 (7), 2144–2162. Scott, G.S., Berger, J.O., 2010. Bayes and empirical-Bayes multiplicity adjustment in the variableselection problem. Ann. Stat. 38, 2587–2619. Shin, M., Bhattacharya, A., Johnson, V.E., 2018. Scalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. Stat. Sin. 28, 1053–1078. Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M. M. A., Prabhat, P., Adams, R.P., 2015. Scalable Bayesian optimization using deep neural networks. In: ICML’15. Proceedings of the 32nd International Conference on International Conference on Machine Learning, vol. 37JMLR.org, pp. 2171–2180. http://dl.acm.org/citation.cfm?id¼3045118.3045349.
248 Handbook of Statistics Tibshirani, R., 1996. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58, 267–288. van der Vaart, A. W., 1998. Bayes procedures. In: Gill, R., Ripley, B.D., Ross, O.S., Stein, M., Williams, D. (Eds.), Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, pp. 138–152. van de Geer S. A., 2008. High-dimensional generalized linear models and the Lasso. Ann. Stat. 36, 614–645. van de Geer, S., 2016. Estimation and Testing Under Sparsity. Lecture Notes in Mathematics Book Series, Springer ISBN 978–3-319–32774-7. Wang, X., George, E.I., 2007. Adaptive Bayesian criteria in variable selection for generalized linear models. Stat. Sin. 667–690. Wang, H., Yeung, D.-Y., 2016. Towards Bayesian deep learning: a framework and some existing methods. IEEE Trans. Knowl. Data Eng. 28 (12), 3395–3408. ISSN 1041–4347. https://doi. org/10.1109/TKDE.2016.2606428. Xu, X., Ghosh, M., 2015. Bayesian variable selection and estimation for group Lasso. Bayesian Anal. 10 (4), 909–936. https://doi.org/10.1214/14-BA929. Yang, Y., Wainwright, M.J., Jordan, M.I., 2016. On the computational complexity of highdimensional Bayesian variable selection. Ann. Stat. 44, 2497–2532. Yu, K., Moyeed, R.A., 2001. Bayesian quantile regression. Stat. Probab. Lett. 54 (4), 437–447. Yu, K., Chen, C.W.S., Reed, C., Dunson, D.B., 2013. Partial correlation estimation by joint sparse regression models. Stat. Interface 6, 261–274. Yuan, M., Lin, Y., 2005. Efficient empirical Bayes variable selection and estimation in linear models. J. Am. Stat. Assoc. 100, 1215–1225. Zellner, A., 1986. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Bayesian Inference and Decision Techniques, New York, pp. 233–243. Zhang, C.-H., 2010. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38 (2), 894–942. Zhang, C.-H., Huang, J., 2008. The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 1567–1594. Zhang, D., Lin, Y., Zhang, M., 2009. Penalized orthogonal-components regression for large P small N data. Electron. J. Stat. 3, 781–796. Zhao, P., Yu, B., 2006. On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2563. Zou, H., 2006. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429. Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67, 301–320.
Chapter 5
Competing risks: Aims and methods Ronald B. Geskus1 Centre for Tropical Medicine, Oxford University Clinical Research Unit, Ho Chi Minh City, Viet Nam Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom 1 Corresponding author: e-mail: [email protected]
Abstract In the end we all die, but not all at the same age and from the same cause. A competing risks analysis quantifies the occurrence over time of mutually exclusive event types. Competing risks methods are increasingly more often used in different branches of science. Yet, there remains a lot of confusion with respect to the quantities that can be estimated under specific assumptions, as well as their interpretation. Some characteristics are different from the classical time-to-event setting. With right censored time-to-event data, analysis is often based on the rate, or hazard. In the presence of competing risks, three different hazards can be defined: the cause-specific hazard, the subdistribution hazard, and the marginal hazard. The first two quantify different aspects of the competing risks process. Both can be used as basis for quantifying the risk, the cumulative event probability, although direct approaches to estimation of the risk exist as well. The marginal hazard considers the (possibly completely hypothetical) setting in which the competing risks are absent; it requires the competing risks to be independent for unbiased estimation. Using examples from the medical and epidemiological field, I explain when and how to use models and techniques for competing risks and how to interpret the results. Emphasis is on nonparametric estimation. Regression models are covered briefly. I discuss how to deal with time-varying covariables when the subdistribution hazard is the estimand. Software is readily available. In fact, many analyses can be done using standard software for time-to-event analysis. Keywords: Subdistribution, Nonparametric estimation, Product-limit estimator, Aalen–Johansen estimator, Time-varying covariable, Explanation vs prediction
Handbook of Statistics, Vol. 43. https://doi.org/10.1016/bs.host.2019.11.001 © 2020 Elsevier B.V. All rights reserved.
249
250 Handbook of Statistics
1 Introduction I work in a medical setting. Whenever a colleague asks me for advice on her statistical analyses, one of my first questions is on the purpose of her study. She tells me that she is interested in the risk factors for a certain outcome. She explains me the type of outcome and presents a list of variables to consider. However, this is not sufficient information to give her full recommendations on the statistical approach. Does she want a model to predict the outcome as accurately as possible, which can be used in clinical practice? Or does she have a more scientific aim and does she want to understand the mechanisms that lead to the outcome? The distinction between prediction and explanation is important because it determines the type of regression model, the decision which variables to include as well as how the outcome is quantified. A good overview of the difference between both is given in Shmueli (2010). My focus is on the analysis of time-to-event data with competing risks. This means that the outcome is an event; its timing may differ per individual. In the classical time-to-event setting, all events are of the same type, or the types are taken together. In a competing risks setting there are several possible event types that are considered separately; the occurrence of one event type precludes the other event types to occur (or at least to occur under the same conditions). In the presence of competing risks, the appropriate approach critically depends on the aim of the study, whether one wants to explain or to predict. Often, the analysis is based on the hazard. Two different hazards can be used, the cause-specific hazard and the subdistribution hazard. The causespecific hazard is often more appropriate for understanding etiology, even though its actual value does not reflect effect size. The subdistribution hazard has a direct relation to the cumulative probability of a specific event type, which is the quantity of interest in prediction. In Section 2, I give three examples with different types of research aims. I leave out statistical and mathematical detail. In Section 3, I define the relevant quantities in a formal way and define three nonparametric estimators of the cause-specific cumulative probability. Although their structure is quite different, I prove them to be equivalent. I briefly cover regression models and software. In Section 4, I discuss analysis on the subdistribution hazard in the presence of time-varying covariables. I conclude by giving a list of issues that continue to generate confusion, and come up with my suggestions.
2 Research aim: Explanation vs prediction I give three examples of time-to-event outcomes from medical and epidemiological research in which different event types are present. In each example, I formulate the relevant study questions and consider the type of analysis that is most appropriate for answering that question.
Competing risks: Aims and methods Chapter
5 251
2.1 In-hospital infection and discharge This example is interesting because the setting and the data allow for two different study questions, which each need a different approach in order to be answered with the collected data. Suppose a hospital collects data on Staphylococcus infection during in-patient stay of patients with burn wounds. Not all patients become infected during hospital stay. Data after discharge are not relevant for the risk of in-hospital infection. How we deal with patients that are discharged without infection in our analysis depends on the study question. We may want to quantify the infection risk in the hospital, and compare this with other hospitals. Then, we want to estimate the time-to-infection distribution if patients would stay in hospital forever, or at least until they become infected. This question concerns etiology: it reflects the prevalence of Staphylococcus bacteria and the conditions that lead to infection. In the study question there is only one type of event, as depicted graphically in Fig. 1. However, in practice the data contain the competing event “discharge without infection,” as in Fig. 2. The quantity of interest becomes a marginal distribution. It describes the distribution of one outcome (infection) without consideration of the other (discharge). Another question is what percentage of patients actually get infected while staying in hospital. This is relevant for the hospital if it needs to predict the financial or disease burden from Staphylococcus infection. The question is clinical; we predict whether and when infection will be observed in every patient. Now not only the practical data analysis but also both the study question takes the competing risk “discharge without infection” into account as separate outcome. The time until the event “infection in the hospital” is described by the infection-specific distribution, with discharge as competing risk.
Admitted to hospital
Infected in the hospital
FIG. 1 Staphylococcus infection during hospital stay; no discharge.
Discharged from the hospital without infection
Admitted to hospital
Infected in the hospital
FIG. 2 Staphylococcus infection during hospital stay; discharge competing risk.
252 Handbook of Statistics
TABLE 1 Estimation with complete follow-up (artificial data). Day
0–1
1–2
2–3
3–4
4–5
5–6
6–7
>7
Infection
1
2
6
11
9
11
2
18
Cumulative
1
3
9
20
29
40
42
60
Discharge
5
9
6
6
9
12
4
35
Cumulative
5
14
20
26
35
47
51
86
Time-to-event analysis is about transitions between states. Individuals become at risk for the transition to infection in the hospital when they enter the state “admitted to hospital.” Patients are no longer of interest if they are discharged without infection. However, the way how we deal with that transition differs between both study questions. Let us have a look at the (artificial) data in Table 1.
2.1.1 Marginal distribution: Discharge as censoring event When estimating the marginal distribution, we want to know the time at which discharged individuals would have been infected if they had stayed in hospital. Such information on unobserved events may be available in data as generated in simulation studies, but not in real data. However, it may not be unreasonable to assume that Staphylococcus infection and discharge are independent mechanisms, i.e., the marginal distributions of time-to-infection and time-to-discharge are independent. Then the discharged individuals can be represented by the ones that remain in the hospital and continue to be at risk for infection. This is the basis for the Kaplan–Meier estimator, in which discharge is treated as a censoring event. If the independence assumption does not hold, then the Kaplan–Meier can still be computed, but it does not estimate anything meaningful. 2.1.2 Cause-specific distribution: Discharge as competing event For the infection-specific distribution, we compute the (cumulative) fraction of individuals over time that becomes infected. Those that are discharged without infection will never contribute to the numerator, but they continue to contribute to the denominator; discharge is not treated as a censoring event. There are no individuals that leave the study within the first 7 days without having experienced either death or discharge (i.e., there are no right censored data). Therefore, the probability to become infected within 6 days can be estimated as the observed frequency b Pðinfection 6 daysÞ ¼ 40=146
Competing risks: Aims and methods Chapter
5 253
In summary, for the etiological study question, the marginal distribution is of interest. In answering this question, “discharged without infection” is interpreted as a censoring event, and treated as such in estimation. For prediction, the infection-specific cumulative probability is of interest. “Discharged without infection” is interpreted as a competing risk, and ignored in estimation.
2.2 Causes of death after HIV infection Combination antiretroviral therapy of (cART) greatly reduced AIDS-related mortality in HIV infected individuals. The efficacy of cART is typically investigated in a randomized clinical trial, the gold standard for explanatory research. However, a well-designed observational study may come up with similar answers (Herna´n et al., 2003). One may also want to quantify the impact of the introduction of cART on mortality at the population level. This research question is primarily predictive: we want to quantify what we expect to see as a consequence of the introduction of cART. We not only consider AIDS-related mortality as end point, but also the impact of cART on other causes of death (COD). Side effects of cART may increase the risk to die of other causes such as cardiovascular disease. But even if cART itself does not increase the risk of other types of mortality, individuals that no longer die of AIDS will sooner or later die of something else. Although our observations reflect the etiology of cART, it also reflects that “in the end we all die, but not all at the same age and from the same cause.” Therefore, it is better to call it the “impact” of the introduction of cART rather than the “effect.” The same factors that increase HIV infection risk also increase the risk of other viral and bacterial infections, such as hepatitis C virus (HCV) infection. HCV infection increases the risk of liver cancer, but because this cancer on average takes decades to develop, it was not observed very often in HIV infected individuals when the probability to die of AIDS was high, as was the case in the pre-cART era. In van der Helm et al. (2013), we made liver-related death a separate COD and studied whether it is observed more often in the cART era. The other COD are grouped as “natural” (such as cardiovascular and cancer) and “nonnatural” (such as suicide, accident, and drug overdose). In this example, the different COD act as competing risks, as depicted in Fig. 3. The situation in which only AIDS-related mortality exists as COD is completely hypothetical and of little to no interest. Entering the state “HIV infection” defines the start of the relevant time scale. We quantified the impact of cART via the variable “calendar time of follow-up.” We chose 1997 as the single cutpoint (which is certainly too simplistic because the efficacy of cART has improved over time). For individuals that became infected before 1997, the calendar period of follow-up changed over time. Fig. 4 gives the estimated cumulative probability of each cause of death. There is no indication that the other causes death had become more frequent.
254 Handbook of Statistics Non natural Natural HIV infection
Liver related AIDS related
FIG. 3 Spectrum in causes of death after HIV infection.
FIG. 4 Nonparametric estimate of the cumulative probability to die after HIV infection, before and after the introduction of combination antiretroviral therapy (cART). Four classes of causes of death are distinguished. Gray areas depict 95% confidence intervals.
Competing risks: Aims and methods Chapter
5 255
Note, however, that this may be explained by the presence of risk factors that act as confounders for the relation between calendar period and COD. Risk group is one of them. In the populations from which our data were sampled, HIV infection through injecting drug use went down over time. This may explain the decrease in liver-related mortality (hepatitis C infection is very common in HIV infected injecting drug users (IDU)) as well as in the mortality due to nonnatural causes (overdose, suicide). We return to this example in Section 4.
2.3 AIDS and pre-AIDS death We want to estimate the distribution of time from HIV infection to AIDS in the absence of antiretroviral treatment (often called the natural history of HIV infection). Data are from the Amsterdam Cohort Studies on HIV Infection and AIDS (http://www.amsterdamcohortstudies.org). In these studies, two different risk groups have been followed from 1984 onwards: men who have sex with men (MSM) and IDU. We compare the natural history in both groups. This question concerns etiology: do the differences in lifestyle and health between both groups have an effect on progression to AIDS? In the data some individuals died before they developed AIDS; pre-AIDS death is a competing event. This is depicted in Fig. 5. We want to quantify the marginal distribution of time-to-AIDS, i.e., the distribution in the hypothetical situation in which pre-AIDS death does not occur. The question is similar to the etiological one in the example of Section 2.1. There is one important difference: the assumption that time-to-AIDS and time-to-death are independent mechanisms is not valid. Table 2 shows that several of those that died before AIDS had clear signs of disease progression. Therefore they cannot be represented with respect to their progression to AIDS by the ones that were still alive at the same time since HIV infection, and we cannot use the Kaplan– Meier estimator in which we censor individuals when they die before AIDS. Because of their unhealthy and often marginalized lifestyle, the IDU group was expected to have a faster progression to AIDS. However, the Kaplan– Meier in Fig. 6 falsely suggests that injecting drugs slows down AIDS progression. Because those that are close to AIDS are selectively excluded from follow-up, the relatively healthy individuals receive more emphasis in the Kaplan–Meier, which makes the results too optimistic. The bias is largest in the IDU, because pre-AIDS mortality is much higher in that group. Death before AIDS
HIV
AIDS
FIG. 5 Data on time to AIDS, pre-AIDS death competing risk.
256 Handbook of Statistics
TABLE 2 Cause of death before AIDS. Reason of death
IDU
MSM
Cumulative probability of AIDS
Number HIV-related infections
3
0
Overdose/suicide
6
0
Violence/accident
2
0
Liver cirrhosis
2
0
Cancer
0
1
Heart attack
0
1
Unknown
4
3
0.6
0.4 MSM IDU
0.2
0.0 0
2
4 6 8 Time since HIV infection (years)
10
FIG. 6 Kaplan–Meier estimate of time from HIV infection to AIDS by risk group. Individuals are censored when they die before AIDS. IDU, injecting drug users; MSM, men who have sex with men.
For those that died before AIDS, we need to come up with another good guess of the time to AIDS if they had not died. The most extreme one is to assume that they were about to develop AIDS, let us say it would have occurred 1 day later. Then we basically combine both event types and estimate the overall time-to-event distribution, which leads to the estimate in Fig. 7.
Competing risks: Aims and methods Chapter
5 257
Cumulative probability of AIDS
1.0 0.8 0.6 0.4 MSM IDU
0.2 0.0 0
2
4 6 8 Time since HIV infection
10
FIG. 7 Kaplan–Meier estimate of time from HIV infection to AIDS by risk group. Individuals that die before AIDS are assumed to develop AIDS 1 day later. IDU, injecting drug users; MSM, men who have sex with men.
1.0
Crude risk
0.8 0.6
MSM IDU
0.4 0.2 0.0 0
2
4 6 8 Time since HIV infection (years)
10
FIG. 8 Kaplan–Meier estimate of time from HIV infection to AIDS by risk group. Individuals that die before AIDS are assumed to never develop AIDS. IDU, injecting drug users; MSM, men who have sex with men.
At the other extreme is the (unrealistic) assumption that those that died were immune for developing AIDS. They remain in the event-free risk set during the complete follow-up. Hence, pre-AIDS death is treated as a competing risk and we obtain the estimate as in Fig. 8. This AIDS-specific cumulative probability is a quantity in it own right, but not the one of interest if we want to compare the natural history between both risk groups.
258 Handbook of Statistics
3 Basic quantities and their estimators Section 2 focused on the research question, without considering estimation in detail. In this section, I define the quantities that are used in answering the research questions. Next I introduce their nonparametric estimators. Timeto-event data are often incomplete. I consider the setting with right censoring and/or left truncation. Estimation becomes more complicated, but the essential characteristics and main properties are not different from the setting with complete data. Therefore it is often insightful to consider the situation with complete information first. At the end of this section, I briefly cover regression models and software.
3.1 Definitions and notation First consider the classical situation with a single event type. In the presence of competing risks, this is the setting that holds if we combine all event types. The event time is always relative to some time origin. Typically, this time origin is determined by the entry into the at-risk state. In Section 2.1, this was hospital admission, while in Sections 2.2 and 2.3 this was HIV infection. In the classical situation, there is only one single state that can be reached from the initial state. Let the random variable T describe the time-to-event distribution F in the population of interest: T F with F(t) ¼ P(T t). In mathematics, F is called the cumulative distribution function, but I follow common practice in medicine and epidemiology and call it the cumulative incidence. For the complement, the probability to remain event-free, I write F instead of the more common notation S (for survival function). Hence FðtÞ ¼ 1 FðtÞ ¼ PðT > tÞ. The distribution can also be described based on the hazard. Often, the population distribution is assumed to be absolutely continuous. Then, the hazard is defined as hðtÞ ¼ lim
Δt#0
Pðt T < t + Δt j T tÞ , Δt
and the relation between hazard and cumulative probability is FðtÞ ¼ Rt exp f 0 hðsÞdsg. If T is discrete, events can only occur at a finite set of time points, say τ1 < ⋯ < τL ; I define τ0 ¼ 0. The hazard at time τj is h(τj) ¼ P(T ¼ τj | T τj), which can equivalently be written as P(T ¼ τj | T > τj1) or P(T ¼ τj | T > τj). Using the law of conditional probabilities, we derive the product form of the cumulative incidence/survival function Y Y FðtÞ ¼ PðT > tÞ ¼ PðT > τj | T > τj1 Þ ¼ 1 hðτj Þ : (1) τj t
τj t
Competing risks: Aims and methods Chapter
5 259
The discrete setting is relevant for nonparametric estimation. Relation (1) forms the basis for the Kaplan–Meier estimator of the survival function. This product form has a characteristic that I will use in the sequel. Lemma 1. The probability mass P(T ¼ τi) can be written as Fðτi Þ Fðτi1 Þ ¼ Fðτi1 Þ hðτi Þ: Proof. We have PðT ¼ τi Þ ¼
i1 Y PðT > τj | T > τj1 Þ PðT ¼ τi | T > τi1 Þ j¼1
¼ Fðτi1 Þ hðτi Þ:
□
3.1.1 Competing risks I assume K competing risks. Hence, individuals can progress from the initial state to one of K possible states. Let E denotes the event type that occurs. The combination (T, E) describes the probability to experience an event of type k within some time span: P(T t, E ¼ k). Without loss of generality I number the competing event types as 1 < 2 < ⋯ < K. The cumulative probability to have an event of a specific type is often called the cause-specific cumulative incidence; I use notation Fk. It is also called a subdistribution, because it is a distribution that does not go up to the value one. This subdistribution can alternatively be described through a single random variable defined as Tk ¼ T IfE ¼ kg + ∞ IfE 6¼ kg:
(2)
Hence we have T k Fk ðtÞ ¼ PðT k tÞ ¼ PðT t, E ¼ kÞ: The definition of the hazard that uniquely describes this distribution is analogous to the classical setting. For continuous distributions we have Pðt T k < t + Δt j T k tÞ Δt Z t Fk ðtÞ ¼ PðT k > tÞ ¼ exp hk ðsÞds : hk ðtÞ ¼ lim
Δt#0
(3) (4)
0
and for discrete distributions hk ðτj Þ ¼ PðT k ¼ τj | T k τj Þ Fk ðtÞ ¼ PðT k > tÞ ¼
Y f1 hk ðτj Þg: τj t
(5) (6)
260 Handbook of Statistics
3.1.2 Multistate approach An alternative hazard that can be used as basis to describe the cause-specific cumulative incidence is based on (T, E). It is called the cause-specific hazard. For continuous distributions, it is defined as λk ðtÞ ¼ lim
Δt#0
Pðt T < t + Δt, E ¼ k j T tÞ : Δt
With the help of Fig. 9, we derive relation (7) between cause-specific hazard and cause-specific cumulative incidence Z t Fk ðtÞ ¼ PðT t, E ¼ kÞ ¼ FðsÞλk ðsÞds: (7) 0
FðsÞ is the probability to remain free of any event type. It is determined by all cause-specific hazards via ( Z ) K tX FðtÞ ¼ exp λe ðsÞds : 0 e¼1
The cause-specific hazard is a special case of the transition hazard in a multistate model (Geskus, 2016; Putter et al., 2007). Therefore, using the cause-specific hazard as basis can be seen as the multistate approach to competing risks, in contrast to the subdistribution approach described earlier. Hence, two different hazards describe the rate of occurrence of events of a specific type in a competing risks setting and both can be used as basis for the cause-specific cumulative incidence. Table 3 summarizes the possible quantities that may be of interest in the presence of competing risks.
3.2 Data setup In our data, information on event times and event types may be incomplete. There are right censored data if the event of interest has not been observed in some individuals until the end of their follow-up. Two reasons for right censored data are loss to follow-up and administrative censoring. Individuals that left the study before the event could occur are called lost to follow-up.
FIG. 9 From cause-specific hazard to cumulative incidence.
Competing risks: Aims and methods Chapter
5 261
TABLE 3 Hazard types and corresponding cumulative quantities. Hazard Event types separate
Event types combined
Cumulative quantity
marginal
kλ
Marginal cumulative incidence
Cause-specific
λk
No corresponding quantity
Subdistribution
hk
Cause-specific cumulative incidence
Fk(t)
Overall
h
Overall cumulative incidence
F(t)
kF(t)
Individuals that are still in the study and event free at the date of analysis are called administratively censored. Other types of censored data, such as interval censored data or doubly censored data, are not considered here. The occurrence of a competing event also prevents the event of interest to be observed. If we are interested in the marginal distribution, then the occurrence of a competing event is interpreted as censored data as well. If we perform a competing risks analysis, it is interpreted as a separate type of event. Another type of incomplete information is due to left truncation. This can occur if there is late entry: some individuals enter the study after entering the initial state. It may cause some individuals to be missed because they already experienced the final event before they could enter the study. Individuals with a shorter event time are more likely to be missed. This length-biased sampling is called left truncation. Note that there can be late entry without left truncation (Geskus, 2016, p. 12). Information on the event type may be incomplete as well. For example, some individuals may have died from unknown cause. This is another type of incomplete data that we do not consider here (see, e.g., Bakoyannis et al. (2010) and Goetghebeur and Ryan (1995)). Let N be the sample size. The complete information for individual i 2 f1, 2, …, Ng is (ti, ei). The event may be unobserved due to censoring at ci < ti. Define xi ¼ min fti , ci g and δi ¼ {ti ci}. An individual may enter the study after the time origin, at time vi. The observed data can be represented as fðv1 , x1 , e1 δ1 Þ, …, ðvN , xN , eN δN Þg. I use tð1Þ < tð2Þ < ⋯ < tðnÞ to denote the ordered distinct observed event times. Note that n < N if there are ties in the observed event times. Similarly I let cð1Þ < ⋯ < cðnc Þ and vð1Þ < ⋯ < vðnl Þ denote the ordered distinct observed censoring and entry times, respectively. Let mi be the number of censorings at c(i), and let wi be the number of entries at v(i).
262 Handbook of Statistics
For each t(i), define dk(t(i)) as the number of observed events of type k at t(i) and d(t(i)) as the number of observed events of any type at t(i). I make a distinction between individuals that are “observed to be at risk” and those that are “in the subdistribution risk set.” In the multistate approach, individuals leave the risk set when they are no longer observed to be at risk for experiencing an event, either because they are censored or because they experience one of the event types. In the subdistribution approach, individuals that experience a competing event are no longer at risk for the event of interest, but they do remain in the risk set. Let r(t) be the number of individuals that is in follow-up and event free at time t, i.e., observed to be at risk. I use r*(t) to denote the number in the subdistribution risk set, which also includes individuals that had an earlier competing event. Individuals with a competing event remain included until the time they would be censored. This time is usually not known because the event comes first, but I will explain an elegant alternative using inverse probability weights. Events, censorings and late entries may occur at the same time point t. In that case I assume the ordering ti < cj < vj0 . Hence, when estimating the event hazard or event probability at t, individuals that are censored at t are included in r(t), but individuals that enter at t are not. The three quantities in Table 3 that are of most interest in a competing risks analysis are the cause-specific hazard, the subdistribution hazard, and the cause-specific cumulative incidence. In the next section, I introduce their nonparametric estimators.
3.3 Nonparametric estimation For nonparametric estimation, the data are taken “as is,” and the finite number of observations suggests a discrete distribution.
3.3.1 Complete data If we do not have right censoring nor left truncation in our data, the causespecific cumulative incidence is most easily estimated via the empirical cumulative distribution function (ECDF), as I did in Table 1: ck ðtÞ ¼ #fti tg : F N EC
(8)
We can also obtain an estimator via the product form (6) and the estimator of the subdistribution hazard that is introduced in (15) Y PL dk ðtðiÞ Þ c ðtÞ ¼ F 1 : k r ðtðiÞ Þ t t ðiÞ
Individuals with a competing event never leave the risk set r*(t); its size only decreases when individuals experience the event of interest. It is easy to show that this “product-limit (PL)” estimator is equivalent to the ECDF form. It is generalized to data with left truncation and/or right censoring in (16).
Competing risks: Aims and methods Chapter
5 263
3.3.2 Cause-specific hazard: Aalen–Johansen estimator The estimator of the cause-specific hazard is completely standard λbk ðtðjÞ Þ ¼
dk ðtðjÞ Þ : rðtðjÞ Þ
(9)
It forms the basis of the classical Aalen–Johansen estimator of the causespecific cumulative incidence, which is based on relation (7) from the multistate approach: AJ
ck ðtÞ ¼ F
X i: tðiÞ
dk ðtðiÞ Þ bPL F ðtðiÞ Þ , rðtðiÞ Þ t
(10)
b PL the Kaplan–Meier estimator for all event types combined in which F Y PL dðtðjÞ Þ b F ðtðiÞ Þ ¼ 1 : rðtðjÞ Þ j: t > > > >
Γðt ðiÞ > > > > : Γðt b b ðνÞ Þ Þ Φðt
if j is censored or had event of type k before tðiÞ if j is event free and under observation at tðiÞ if j had competing event observed at tðνÞ < tðiÞ
ðνÞ
(14)
Competing risks: Aims and methods Chapter
5 265
The estimator of the subdistribution hazard replaces the number observed to be at risk r by the number in the subdistribution risk set r* hc k ðtðjÞ Þ ¼
d k ðtðjÞ Þ : r ðtðjÞ Þ
(15)
Interpretation of the subdistribution hazard is somewhat problematic (Andersen and Keiding, 2012). It goes against the classical epidemiological concept of a rate, in which all individuals in the denominator are at risk for experiencing the event of interest. Since the subdistribution hazard describes the cause-specific cumulative incidence, we can use (6) to obtain the PL estimator of the cause-specific cumulative incidence: Y PL d k ðtðiÞ Þ b Fk ðtÞ ¼ 1 : (16) r ðtðiÞ Þ t t ðiÞ
3.3.5 Weighted ECDF estimator This estimator extends the ECDF for complete data (8) to the setting with late entries and/or censored observations: X dk ðtðiÞ Þ EC ck ðtÞ ¼ 1 : F b b b ðiÞ Þ N tðiÞ t ΓðtðiÞ Þ Φðt
(17)
3.3.6 Equivalence Although the formulas (10), (16), and (17) look very different, the three estimators of the cause-specific cumulative incidence are algebraically equivalent. Theorem 1. AJ
EC
PL
ck ðtÞ ¼ F ck ðtÞ ¼ F ck ðtÞ for all time points t: F
(18)
Before I prove Theorem 1, I first derive some properties that are of interest in itself. b Although ΦðtÞ is a product over all entry times after t, the combination b b N ΦðtÞ can be computed based on the information up to time t. This is a consequence of Lemma 2. b σ ðtÞ and N bσ as the values of (12) and (13) based on the inforLemma 2. Define Φ mation up to time σ. Let σ 0 be a fixed time point. Then we have for all σ σ 0 b σ ðtÞ N bσ ¼ Φ b σ 0 ðtÞ N bσ 0 Φ
for all t σ 0 :
(19)
b σ ðσÞ ¼ 1 Proof. Since we ignore the new entries at the upper limit σ, we have Φ for all σ.
266 Handbook of Statistics
The proof goes by induction on the entry times after σ 0. Let the entry times after σ 0 be denoted as ðv1 , v2 , …, vnl + 1ν Þ ¼ ðvðνÞ , vðν + 1Þ , …, vðnl Þ Þ . Define I1 ¼ (σ 0, v1] and Ii ¼ (vi1, vi]. bσ Φ b σ0 for all σ 2 I1, because there are no new entries. This We have Φ also holds for σ ¼ v1, because new entries at the upper limit are ignored. Cenb Hence sorings at σ 0 may become events until σ, but that does not change N. 1 equality (19) holds for all σ v (and even for all t σ). Suppose (19) holds for all σ vi. We show it holds for all σ 2 Ii+1 ¼ (vi, vi+1]. Let fxij g and fxσj g be the event and censoring time information collected up to vi and σ, respectively. Since there are no new entries between vi and vi+1, b σ ðtÞ ¼ 1 for all vi < t σ. Hence N bσ can be written as Φ bσ ¼ N
X
1 + σ b xσ vi Φσ ðxj Þ j
X 1 1 ¼ + rðvi + Þ, σ σ b b vi Fα,p,n+mp1 to BF10 ðX, YÞ. At significance level α, the null hypothesis is rejected if
p=2 τ* 1 BF10 ðX, YÞ > τ*α 1 α * Cn , (26) τα
1 where Cn ¼ ðpFα,p,n + mp1 Þ pFα,p,n + mp1 + n + m p 1 ,
1 1 . τα ¼ nmfðn + mÞτα g , and τα ¼ nm ðn + mÞFα,p,n+mp1 1
2.5 Dependent observations For testing equality of means of two populations as presented in (1), the observations from each population are assumed to be independently and identically distributed. Most of the test statistics presented so far have been developed on several assumptions constraining the dependence structure. The testing problem has also been addressed when the covariance matrices are structured (Cai et al., 2014; Gregory et al., 2015; Zhong et al., 2013). But what happens if the observations are identically distributed but are not
310 Handbook of Statistics
independent? Suppose the observations have the following covariance struc ði,jÞ ði,jÞ ture parametrized as cov Xi , X j ¼ Σ1 and cov Yi , Y j ¼ Σ2 : Then for any i and j, the expected value of inner products of the will be variables > > ði,jÞ ði,jÞ > > Xi X j ¼ μ1 μ1 + tr Σ1 and Yi Y j ¼ μ2 μ2 + tr Σ2 , respectively. Considering the functional based on the Euclidean norm of X Y , its expected value will be n n n o > o 1 X ði, jÞ X Y X Y ¼ ðμ1 μ2 Þ> ðμ1 μ2 Þ + 2 tr Σ1 n i, j¼1 (27) m n o 1 X ði, jÞ + 2 tr Σ2 : m i, j¼1 Since the samples are assumed to be identically distributed, we have ði,iÞ ði,jÞ ¼ Σ1 , Σ2 ¼ Σ2 . In the independent case, additionally we have Σ1 ¼
ði,iÞ Σ1 ði,jÞ Σ2
¼ 0pp when i6¼j. Under the dependence structure, we have additional n(n 1) + m(m 1) covariance matrices in the model. An unstructured dependence structure will therefore be infeasible because for any i and j, we ði,jÞ have only one pair of observations (Xi, Xj) to estimate Σ1 . To make estimation feasible, assume second-order stationarity on the dependence structures, ði,jÞ
covðXi , X j Þ ¼ Σ1
¼ Σ1 ði jÞ,
ði,jÞ
covðYi , Y j Þ ¼ Σ2
¼ Σ2 ði jÞ:
> By symmetry, we have Σ1 ðaÞ ¼ Σ> 1 ðaÞ and Σ2 ðaÞ ¼ Σ2 ðaÞ for all a + . In time series, fΣ1 ðaÞ, a g and fΣ2 ðaÞ, a g represent the autocovariance functions of the two populations, respectively. The matrices Σ1(a) and Σ2(a) represent the autocovariance at lag a. Using the autocovariance function, the expected value in (27) simplifies to n T o X Y X Y ¼ ðμ1 μ2 ÞT ðμ1 μ2 Þ
+
n1 m1 1 X 1 X ð n a Þtr Σ ð a Þ + ðm jajÞtrfΣ2 ðaÞg: j j f g 1 n2 a¼ðn1Þ m2 a¼ðm1Þ
(28)
A functional that is unbiased for the Euclidean norm of μ1 μ2 can be constructed using (28) as 2 n1 n o > 1 X d M¼ X Y X Y 4 2 ðn jajÞtr Σ 1 ðaÞ n a¼ðn1Þ 3 (29) m1 n o X 1 d 5, + 2 ðm jajÞtr Σ 2 ðaÞ m a¼ðm1Þ d d where Σ 1 ðaÞ and Σ2 ðaÞ are the biased estimators of Σ1(a) and Σ2(a), respectively, defined as
High-dimensional statistical inference Chapter
b1 ðaÞ ¼ 1 Σ n
njaj X
> Xi X Xi+a X ,
i¼1
6 311
mjaj X > b 2 ðaÞ ¼ 1 Σ Yi Y Yi+a Y : m i¼1
(30)
These estimators are the biased estimators (Brockwell and Davis, 1986), which should be of no concern to us since we are only interested in their trace. When p is finite, these estimators are known to be asymptotically unbiased. However, in high dimensions, when p increases with n n this o property b 1 ðaÞ will be is no longer valid. For instance, the expected value of tr Σ h n oi P b1 ðaÞ ¼ n1 θn ða, bÞtrfΣ1 ðbÞg, where tr Σ b¼0 a1 a1 b 1 f2 ða ¼ 1Þg θn ða, bÞ ¼ 1 ða ¼ bÞ + 1 1 n n n n na +1X n X 1 2 fðjt sj + 1 ¼ bÞ + ðjt + i s 1j ¼ bÞg: n t¼1 s¼1 (31) Asymptotic unbiasedness for finite p follows from the leading term converging to 1 and the second and third terms, which are O(n1), converging to zero as n goes to infinity because trfΣ1 ðaÞg ¼ Oð1Þ. In high dimension, if the autocovariance structure is proper with all eigenvalues being nonzero, then trfΣk ðaÞg ¼ OðpÞ for k ¼ 1, 2 and all lags a. Hence all three terms in the expression n ofor θn(a, b) in (31) should be considered. The expected b 1 ðaÞ depends on the autocovariance matrices at all lags through value of tr Σ the trace function, which is a univariate measure of the matrix. Expressing in vector form, we have fγbn g ¼ Θn γ where Θn ¼ ðθn ða, bÞÞa,bf0,…,n1g , n o n o b 1 ð0Þ , …, tr Σ b1 ðn 1Þ , γ ¼ ðtrfΣ1 ð0Þg, …, trfΣ1 ðn 1ÞgÞ and γb ¼ tr Σ respectively. This property can be used to construct unbiased estimators for c ¼ Θ1 γb . Denoting the elements of trfΣ1 ð0Þg as elements of the vector γ* n
n
c as trfd γ* ΓðaÞg, the functional can finally be constructed as 2 n1 > 1 X Mn ¼ X Y X Y 4 2 ðn jajÞtrfd Σ1 ðaÞg n a¼ðn1Þ 3 m1 1 X + 2 ðm jajÞtrfd Σ2 ðaÞg 5: m a¼ðm1Þ
(32)
Ayyala et al. (2017) proposed a test statistic based on Mn defined in (32). In addition to the second-order stationary autocovariance structure, observations from the two populations are assumed to be realizations of two independent M-dependent strictly stationary Gaussian processes with means
312 Handbook of Statistics
μ1 and μ2 and autocovariance structures {Σ1(a)} and {Σ2(a)}, respectively. The M-dependence structures imposes the autocovariance matrices to be equal to zero for lags greater than M. Properties of the test statistic are established based on the following assumptions: (APR I) The observations are realizations of M-dependent strictly stationary Gaussian processes. (APR II) The rates of increase of dimension p and order M with respect to n are linear and polynomial, respectively, p ¼ OðnÞ,
M ¼ Oðn1=8 Þ:
(APR III) For any k1, k2, k3, k4 {1, 2},
n o trfΣk1 ðaÞΣk2 ðbÞΣk3 ðcÞΣk4 ðdÞg ¼ o ðM + 1Þ4 tr2 ðΩ1 + Ω2 Þ2 ,
P PM where Ω1 ¼ M a¼M ð1 jaj=nÞΣ1 ðaÞ and Ω2 ¼ a¼M ð1jaj=nÞ Σ2 ðaÞ. (APR IV) The means μ1 and μ2 satisfy the local alternative condition n o 1 ðμ1 μ2 Þ> fΣw ðaÞΣw ðaÞg2 ðμ1 μ2 Þ ¼ o ðM + 1Þ4 n1 trðΩ1 + Ω2 Þ2 To better understand the condition (APR III) on the autocovariance structure, consider the following illustration. Setting M ¼ 0 and Σ1(a) ¼ Σ2(a) ¼ 0 for all a 6¼ 0, it is straightforward to see that the conditions (APR III) and (APR IV) are similar to (CQ III) and (CQ IV). The test statistic is given by Mn TAPR ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d var ðMn Þ
(33)
where the variance estimate is constructed similar to TCQ and TPA using a leave-out method for better asymptotic properties. For exact form of the estimator, please refer to Ayyala et al. (2017). Under the conditions (APR I)– (APR IV), TAPR is shown to be asymptotically normal. While the test statistic and the empirical studies of Ayyala et al. are valid, Cho et al. (2019) identified some theoretical errors in the proofs and provided some corrections to some results and assumptions in Ayyala et al. (2017). One issue that still needs to be addressed is the choice of M. Simulation studies reported in Ayyala et al. indicate that over-estimating M is better than under-estimating. When the specified value of M in the analysis is greater than the true order of dependency, the error is in estimating zero matrices for lags greater than the true M. Under-specifying the value results in bias as autocovariances for several lags will not be estimated. Accurate estimation of M using the data is not addressed and remains an open area of research. A large class of models can be approximated using M-dependent strictly stationary processes. Tests for other classes of models such as second-order stationary processes or non-Gaussian processes is another area of active research.
High-dimensional statistical inference Chapter
3
6 313
Covariance matrix
The covariance matrix of a multivariate random variable is a measure of dependence between the components of the variable. It is the second-order central moment of the variable, defined as Σ ¼ varðXÞ ¼ fðX μÞ ðX μÞ> g , where μ ¼ ðXÞ . The covariance matrix is often reparameterized using its inverse, called the precision matrix, Ω ¼ Σ1. Elements of the precision matrix are useful in determining conditional independence under normality. If X N ðμ, ΣÞ, then Ωij ¼ 0 implies Xi is independent of Xj conditional on {Xk : k ¼ 6 i, j}. The precision matrix is important because it can be used to construct an undirected graphical network model. Representing the components as nodes of the network, edges are defined by the elements of Ω ¼ (ωij), where ωij ¼ 6 0 indicates the presence of an edge and ωij ¼ 0 indicates the absence of an edge between nodes i and j. In view of these properties of the covariance matrix and other distributional properties, normality of the variables is commonly assumed in covariance matrix estimation. Unless otherwise stated, we shall assume the variables are normally distributed for the remainder of this section. Given an i.i.d. sample Xi N ðμ, ΣÞ, i ¼ 1, …, n, the biased sample covariance matrix is defined as n > 1X S ¼ sij i, j¼1,…,p ¼ Xi X Xi X , n i¼1
(34)
with ðSÞ ¼ ðn 1Þ=nΣ and rankðSÞ ¼ min ðn 1, pÞ. In traditional multivariate setting with p < n, S is nonsingular and consistent for Σ. The sampling distribution of S is a Wishart distribution with n 1 degrees of freedom (Anderson, 2003; Muirhead, 1982). Additionally, the eigenvalues of S are also consistent for the eigenvalues of Σ. Asymptotically, the eigenvalues are normally distributed—a result that can be used to construct hypothesis tests. Estimation of eigenvalues of Σ is of importance because they give the variance of the principal components, which are useful in constructing lower dimensional embeddings of the data (dimension reduction). Hypothesis tests concerning the structure of the covariance matrix such as sphericity (H 0 : Σ ¼ σ 2 I ) and uniform correlation (H 0 : Σ ¼ σ 2 ð1 ρÞI + ρ11> ) are constructed using this property (Anderson, 2003; Muirhead, 1982). Testing equality of covariance matrices for two or more groups is also well-defined when using the sample covariance matrix and its Wishart properties. Results from traditional multivariate analysis are valid only when n > p and p is assumed to be fixed. In high-dimensional analysis, as seen in Section 2, p is assumed to be increasing with n. How can we construct consistent estimators for Σ and test statistics to compare the covariance structures of two or more populations in high dimension? In high-dimensional models with p n, the sample covariance matrix S is rank-deficient. Estimation of Σ and Ω also suffer from the curse of dimensionality even when p < n with p/n ! c (0, 1). When p ! ∞, S is no longer consistent for Σ. Estimation of Σ was not an issue in
314 Handbook of Statistics
tests the mean vector since we were only interested in consistent estimator for a function of Σ, e.g., trðΣÞ or tr Σ2 . However, for hypothesis for the covariance matrix, the entire covariance matrix or an appropriate functional (not necessarily the trace) needs to be estimated consistently.
3.1 Estimation To obtain consistent estimators for Σ, two methods for reducing the parameter space dimension are used—structural constraints or regularization through sparsity. A banding approach, proposed by Bickel and Levina (2008) sets elements outside a band around the diagonal to zero. For any 1 k p, bðkÞ is defined as the banded estimator Σ sij if ji jj < k ðkÞ b Σij ¼ : (35) 0 if ji jj k ð1Þ
b as the diagonal Here k denotes the width of the band, clearly indicating Σ estimator. The estimator is consistent for Σ under the ‘2 matrix norm and when log p=n ! 0 . The optimal value of k is chosen using K-fold crossvalidation of the estimated risk. It is particularly effective when the components of X are ordered so that σ ij decreases as ji jj increases. Consistency of the estimator is also shown to hold for non-Gaussian variables whose elements have subexponential tails. Regularization is a more commonly used approach for covariance matrix estimation as it is easier to formulate mathematically. Under normality, likelihood of Σ given a sample X1 , …, Xn can be expressed as n npffiffiffiffiffiffiffiffiffiffio X T log det Σ Xi X Σ1 Xi X LðΣjX, …, Xn Þ ¼ i¼1 (36)
n ¼ log fdet Σg + 2tr SΣ1 : 2 Expression of the second term follows by applying the matrix result that for any p dimensional vector x and p p matrix B, we have x> Bx ¼ trðx> BxÞ ¼ trðBxx> Þ. Alternatively, the likelihood can be expressed in terms of the precision matrix Ω as n LðΩjX1 , …, Xn Þ ¼ ½ log fdet Ωg 2trfSΩg : (37) 2 b ¼ S. Maximizing the likelihood in (36) with respect to Σ yields Σ Regularization of the covariance matrix estimator is achieved by adding a penalty term to the likelihood in (36),
n (38) L* ðΣjX1 , …, Xn Þ ¼ log fdet Σg + 2tr SΣ1 λ PðΣÞ, 2 for some penalty function P which can be defined to achieve a desired effect b The penalty parameter λ dictates the trade-off between maximizing the on Σ.
High-dimensional statistical inference Chapter
6 315
likelihood term and minimizing the penalty. Inspired by lasso (Tibshirani, 1996), Bien and Tibshirani (2011) proposed using a ‘1-penalty to induce sparsity in the estimator. The penalty function is given by PðΣÞ ¼ k W∘Σk1 ¼ P i,j wij σ ij , where ∘ denotes the Hadamard element-wise product. The matrix
W ¼ 11> penalizes all the elements of Σ whereas W ¼ 11> I penalizes only the off-diagonal terms. Another approach to address regularization was developed by Daniels and Kass (2001) by shrinking the eigenvalues to make the estimator more stable. While theoretically developing penalized estimates for the covariance matrix is important, it is practically more conducive to obtain sparse estimates of the precision matrix. Sparsity of precision matrix translates to absence of edges between nodes in the network model. Hence a sparse precision matrix can be used to isolate clusters of nodes which are strongly dependent within themselves and independent of the other clusters. To better understand how sparse precision matrices translate to connected graphical models, consider a mock model with p ¼ 20 variables. The precision matrix is divided into three diagonal blocks of sizes 5, 10, and 5, respectively. These three blocks are filled with large nonzero values (set equal to 1) and the remaining elements are filled with small random values (generated from 0.1 *Beta(1, 5)). Hard thresholding is used to induce sparsity by setting ωij < δ to zero for any δ > 0. Fig. 3 presents the networks for different values of δ. We can see that as δ increases the graph becomes sparse, with the three subnetworks distinctly separated for δ ¼ 0.1. The ‘1 penalized precision matrix estimation is done by maximizing the function n LðΩjX1 , …, Xn Þ ¼ ½ log fdet Ωg 2trfSΩg λ k Ωk1 : (39) 2
Friedman et al. (2008) as glasso (short for graphical lasso), the problem has garnered great levels of interest. Several extensions and improvisations of the original glasso method have been proposed. Danaher et al. (2014) and Guo et al. (2011) studied joint estimation of K > 1 precision matrices by imposing two levels of penalties. For sparse estimation of precision matrices Ωð1Þ , …, ΩðKÞ , using (39) individually will not preserve the cluster structure across the groups. By introducing a penalty to merge the K groups, the following penalty functions have been proposed: Fused graphical lasso :
K X X X X ðkÞ ðkÞ ðmÞ P Ωð1Þ ,…,ΩðKÞ ¼ λ1 jωij j + λ2 jωij ωij j, k¼1 i6¼j
Group graphical lasso :
ð1Þ
P Ω ,…,Ω
ðKÞ
¼ λ1
K X X k¼1 i6¼j
Guo et al: :
k 0 and observed that Un performs well even in the high-dimensional case. For the identity hypothesis, they constructed a new test statistic, 2 1 p 1 p 2 W n ¼ trfS I g (44) trS + , p n p n which is also asymptotically chi-squared with p(p + 1)/2 degrees of freedom but has better properties than Vn. Relaxing the assumption of normal distribution and a direct relationship between n and p, Chen et al. (2010) proposed test statistics Un* and Vn* which are asymptotically normally distributed. These test statistics are in the same spirit as TCQ (9) and uses leave-out crossvalidation type products to improve the asymptotic properties. Next, consider testing equality of covariance matrices from two normal populations Xi N ð0, Σ1 Þ, i ¼ 1, …, n and Y j N ð0, Σ2 Þ, j ¼ 1, …, m: The sample covariance matrices and pooled covariance matrix, S1 ¼
n m > > 1X 1X nS 1 + mS 2 Xi X Xi X , S 2 ¼ Yj Y Yj Y , S pl ¼ , n i¼1 m j¼1 n+m
are used to construct the likelihood ratio test statistic as
L ¼ ðn + mÞ log jS pl j n log jS 1 j m log jS 2 j :
(45)
Under H0 : Σ1 ¼ Σ2, L asymptotically follows a chi-squared distribution with p(p + 1)/2 degrees of freedom. Extending to K groups, the test statistic is ( ) K X LK ¼ ng log jS pl j log jS g j , g¼1
P 1 K where ng is the sample size of the gth group and S pl ¼ g¼1 ng P K g¼1 ng S g . Under H 0 : Σ1 ¼ ⋯ ¼ ΣK , the LRT statistic LK asymptotically follows a chi-squared distribution with (K 1)p(p + 1)/2 degrees of freedom. However, for the two sample case, LRT fails when p > min ðn, mÞ because at least one of S 1 or S 2 will become singular. Bai et al. (2009) and Jiang et al. (2012) provided asymptotic corrections to the LRT when n, p ! ∞ with cn ¼ p/n ! c (0, ∞) and proposed n p 1 p log 1 log 1 Lp 1 1 p n 2 n rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi L* ¼ , h i p p 2 log 1 n n which is asymptotically normally distributed under the null hypothesis.
High-dimensional statistical inference Chapter
6 319
Another approach for testing equality of covariance matrices is to construct a functional F ðΣ1 , Σ2 Þ which will be equal to zero when Σ1 ¼ Σ2. Schott (2007) used the squared Frobenius norm of the difference Σ1 Σ2 as the functional to construct the test statistic. This method is readily extended to comparqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dn Þ , where ing K covariance matrices, with the test statistic T ¼ F n = varðF Fn ¼
K o i X n X 1 h tr ðS i S j Þ2 ðK 1Þ ni ðni 2Þtr S 2i + n2i ftrðS i Þg2 , nη i 2 i ¼ σ 2 RI R> ð¼ σ 2 I Þ vs H A : Σ 6¼ σ 2 RI R> ð¼ σ 2 I Þ, H0 : Σ ¼ RI R> ð¼ I Þ vs HA : Σ ¼ 6 RI R> ð¼ I Þ: (47)
Theoretical properties of such tests are an active area of research.
4
Discrete multivariate models
Multivariate count data occur frequently in genomics and text mining. In high-throughput genomic experiments such as RNA-Seq (Wang et al., 2009),
320 Handbook of Statistics
data are reported as the number of reads aligned to the genes in a reference genome. In text mining (Blei et al., 2003), the number of occurrences of a dictionary of words in a library of books is counted to study patterns of keywords and topics. In metagenomics (Holmes et al., 2012), abundances of bacterial species in samples is studied by recording the counts of reads assigned to different bacterial species. In all data sets, the data matrix consists of nonnegative integer counts. Analyzing multivariate discrete data can be addressed two ways. The absolute counts can be modeled using discrete probability models or the data can be transformed (e.g., using relative abundances instead of absolute counts) and continuous probability models such as Gaussian, etc., can then be used. The research community is still divided in opinion on the loss of information due to this transformation (McMurdie and Holmes, 2014) or the lack thereof. Transforming the variables will enable us to use hypothesis testing tools presented in Section 2. In this section, we will look at some discrete multivariate models.
4.1 Multinomial distribution The Multinomial distribution is the most commonly used multivariate discrete model, extending the univariate binomial distribution to multiple dimensions. For p 2, the multinomial distribution is parameterized by a probability vector π ¼ ðπ 1 , …, π p Þ with π 1 + ⋯ + π p ¼ 1 and the total count N + . The probability mass function of X MultðN, πÞ is given by PðX ¼ xÞ ¼ P X1 ¼ x1 , …, Xp ¼ xp ¼
N! π x1 …π xpp , x1 !…xp ! 1
(48)
for all x p+ such that x1 + ⋯ + xp ¼ N. An alternative representation of the multinomial distribution can be obtained using independent Poisson random variables. Consider p independent Poisson random variables, XP k Poisðλk Þ, p k ¼ 1, …,p. Then the vector ðX1 ,…, Xp Þ, conditional on k¼1 X k ¼ N, follows a multinomial distribution with probability parameter π ¼ ðλ1 , …,λp Þ= ðλ1 + ⋯ + λp Þ. The reparameterization using Poisson variable is scale invariant, i.e., the same multinomial distribution is obtained when Xk Poisðsλk Þ for all s > 0. Levin (Levin, 1981) provide a very simple expression for the cumulative distribution function using this property, ( ) p Y N! PðYk ak Þ PðS ¼ NÞ, FX ða1 ,…, ap Þ ¼ P X1 a1 ,…, Xp ap ¼ N s s e k¼1 (49) where s > 0 is any positive number, Xk Poisðsπ k Þ and S ¼
Y *1
+ ⋯ + Y *p
where Y *k is a truncated Poisson variable, Y k Poisðsπ k ; f0, …, ak gÞ. This alternative formulation and Eq. (49) reduce the computational cost of calculating the
High-dimensional statistical inference Chapter
6 321
CDF significantly. Using the mass function, the calculation would include doing a comprehensive search in the sample space fX : X1 + ⋯ + Xp ¼ Ng, which has a computational cost of exponential order with respect to p. The first two moments are functions of π, given by ðXÞ ¼ Nπ and varðXÞ ¼ N diagðπ + π2 Þ ππ> . The constraint on the total sum implies the variables are always negatively correlated, with cov Xi , Xj ¼ Nπ i π j . Parameter estimation for multinomial distributions is well studied in the literature. Using the added constraint π 1 + ⋯ + π p ¼ 1, the maximum likelihood estimates can be easily derived as πbk ¼
Xk , N
k ¼ 1, …, p:
(50)
Starting with the works by Rao (1952, 1957) wherein consistency and asymptotic properties of the maximum likelihood estimator have been established, several extensions have been developed. When π is restricted to a convex region in the parameter space, Barmi and Dykstra (1994) developed an iterative estimation method based on a primal-dual formulation of the problem. Jewell and Kalbfleisch (2004) developed estimators when the multinomial parameters are ordered, i.e., π 1 π 2 ⋯ π p . Leonard (1977) provided a Bayesian approach to parameter estimation by imposing a Dirichlet prior on the probability vector and derived the Bayesian estimates under a quadratic loss function. When comparing two multinomial populations, X Mult(πX) and Y Mult(πY), the hypothesis of interest is H 0 : πX ¼ πY
vs
HA : πX 6¼ πY :
(51)
Unlike the hypothesis tests in Section 2, we do not require replicates of the count vectors to construct the test statistic and P study its asymptotic P properties. Instead, sample sizes for (51) are n ¼ pk¼1 Xk and m ¼ pk¼1 Y k . Traditional tests include the Pearson chi-squared test and the likelihood ratio test, T Pearson ¼
2 2 p p X X Xk Xbk Y Ybk πbk πbk + k , T LRT ¼ Xk log + Y k log , πbXk πbYk Xbk Ybk k¼1 k¼1 (52)
where π^ k ¼ ðXk + Yk Þ=ðn + mÞ, π^ Xk ¼ Xk =n, π^ Yk ¼ Yk =m, X^k ¼ n^ π k , and Ybk ¼ mb π k. Asymptotically, the tests follow a chi-squared distribution with p degrees of freedom under H0. When p is fixed, Hoeffding (1965) provided asymptotically optimal tests for (51). Furthermore, he also provided conditions under which TLRT has superior performance compared to TPearson. Morris (1975) provided a general framework for deriving the limiting distributions of any general sums of the form
322 Handbook of Statistics
Sp ¼
p X
fk ðXk Þ,
k¼1
when f f k , k ¼ 1, …, pg are polynomials of bounded degree, which generalize TPearson and TLRT. For a comprehensive review of tests, refer to Balakrishnan and Wasserman (2018) and the references therein. Distributional properties of these tests hold valid when all the counts are large, i.e., Xk > 0 and Yk > 0 and number of categories p is smaller than n + m. When p is larger than n, we encounter sparsity. This is because the minimum number of zero elements will be p (n + m). Results derived by Morris hold when p and n + m both increase. When the data is large and sparse, i.e., p > n + m, Zelterman (2013) derived the mean and standard deviation of TPearson and normalized the test statistic to construct an asymptotically normal test statistic. Using the ‘1 norm of difference, k πX πY k1 ¼ Pp Pp 2 2 k¼1 jπ Xk π Yk j, and the Euclidean norm k π X πY k2 ¼ k¼1 ðπ Xk π Yk Þ , Chan et al. (2013) considered the following functionals to use as test statistics: T1¼
p X ðXk Yk Þ2 Xk Yk k¼1
Xk + Y k
,
T2¼
p X ðXk Yk Þ2 Xk Yk :
(53)
k¼1
However, the sampling distributions of T 1 and T 2 were not provided. Instead, permutation based cut-off need to be calculated to do inference. Studying the asymptotic properties of such functionals, Plunkett and Park (2019) constructed a test statistic, given by o Xp n Xk Y k Xk Yk 2 n m k¼1 n m T PP ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : (54) Xp 2 2 πbXk 2 π b 4 πb πb πbXk + 2 πb2Yk Yk + k¼1 n2 nm Xk Yk n m m The test statistic was shown to be asymptotically normal under the following conditions: (PP I) min ðn, mÞ ! ∞ and n/(n + m) ! c (0, 1). This condition is the same as (BS II), (SD II) and (PA II). (PP II) The probabilities are not concentrated, i.e., 1 1 max π 2Xk ! 0 and max π 2Yk ! 0 as p ! ∞: 2 k k π X k2 k πY k22 k This condition ensures that the number of components with nonzero probabilities is not bounded. For example, we cannot have πX ¼ ð1=m, …, 1=m, 0, …, 0Þ where the number of nonzero elements is equal to m because max k π 2Xk ¼ 1=m2 and k πX k22 ¼ 1=m resulting in the ratio being equal to 1/m.
High-dimensional statistical inference Chapter
6 323
(PP III) The sample sizes n and m and dimension p are restricted as ðn + mÞ k πX + πY k22 E > 0 for some E > 0: To better understand this condition, consider πX + πY ¼ ð1=p, …, 1=pÞ. Then ðn + mÞ k πX + πY k22 ¼ ðn + mÞ=p which implies p can increase at most linearly with respect to n. (PP IV) Asymptotic normality is valid in the local alternative n2 k πX πY k22 ¼ O k πX + πY k22 :
4.2 Compound multinomial models Consider n multivariate count vectors of dimension p, X1 , …, Xn . Such data commonly arises when multiple samples are collected, e.g., gene expression counts of p genes collected from n specimens. One common criticism of the standard multinomial distribution is that it does not address over-dispersion in the data. If we consider that the count vectors are i.i.d. from Mult(π), we are inadvertently assuming that the population is homogeneous. To account for heterogeneity in the population, it is advised to assume a model with sample-specific parameter, Xi jπi Multðπi Þ,
i ¼ 1, …, n:
Heterogeneity can further be modeled using a distribution on the p-dimensional simplex S p ¼ fπ : π 1 + ⋯ + π p ¼ 1g . In the univariate case, the beta distribution is the natural choice for the distribution on S 2 . Extending to p dimensions, the natural extension is the multivariate beta distribution or the Dirichlet distribution. The Dirichlet distribution is characterized by a single parameter θ ¼ ðθ1 , …, θp Þ, with density function f ðπ; θÞ ¼
Γðθ0 Þ θ1 1 π 1 ⋯ π θpp , p Q Γðθk Þ
π 1 + ⋯ + π p ¼ 1:
k¼1
where θ0 ¼ θ1 + ⋯ + θp and Γ() is the gamma function. The compound Dirichlet-Multinomial(DirMult) distribution, constructed by the marginal of Xijπi Mult(πi) and π Dir(θ) has the density function given by ΓðX0 + 1ÞΓðθ0 Þ Y ΓðXk + θk Þ , ΓðX0 + θ0 Þ k¼1 ΓðXk + 1ÞΓðθk Þ p
f ðX; θÞ ¼
(55)
where X0 ¼ X1 + ⋯ + Xp . The DirMult model was first introduced by Mosimann, who derived the properties of the distribution. The mean and
324 Handbook of Statistics
variance of the DirMult distribution are ðXÞ ¼ X0 θ1 0 θ and varðXÞ ¼
1 > 2 n θ0 diagðθÞ θ0 ðX0 + θ0 Þ=ð1 + θ0 Þθθ . The variance matrix is the sum of a full-rank matrix (diagonal part) and a rank-one matrix. Using the result from Miller (1981), the precision matrix can be calculated in closed form as 9 8 X0 + θ 0 > > > > = < 1 + θ0 1 1 > 1 varðXÞ ¼ n θ0 diagðθ Þ + 11 : X0 + θ 0 > > > > ; : θ20 1 + θ0 For parameter estimation, the likelihood function of (55) does not admit a maximum for θ in closed form. An approximate solution can be obtained using iterative methods such as the Newton–Raphson algorithm. One convenient feature for computation is that the second-order derivative of the loglikelihood function has a closed form expression for the inverse (Sklar, 2014). Thus the Newton–Raphson step has a linear computation cost. When p is larger than X0, Danaher (1988) derived parameter estimates the betabinomial marginals and established their consistency. While the density function is known to be globally convex, maximization can still lead us to a local maxima. A proper initial value specification is essential to have good performance of the estimator. Choice of optimal initial values has been an area of considerable interest, even for the Dirichlet distribution. The challenge lies in the fact that the method of moments (MM) estimator is not unique. This is because of the scaling in ðXk Þ ¼ nθk =θ0 , which gives both θb and cθb as MM estimates for any c > 0. Ronning (1989) proposed using the same initial value for all elements, θbk ¼ min Xij . This proposal was based on an observation that the method ij
of moments estimates can lead to Newton–Raphson updates becoming inadmissible, i.e., θbk < 0 for some k. (Hariharan and Velu, 1993) have done a comprehensive comparison of the different initial values under several models. However, they concluded that none of the methods is uniformly consistent across all the models. Dirichlet-Multinomial has been applied to study multivariate count data in several applications in biomedical research. In metagenomics, the study of bacterial composition of environmental (biological or ecological) samples, we are interested in modeling the abundance of different species of bacteria in samples. The Dirichlet-Multinomial model is apt for such data because (i) abundances of bacteria are constrained by the total number of bacteria sampled in the specimen and (ii) over-dispersion due to environmental variability is accounted for. Holmes et al. (2012) used a Dirichlet multinomial mixture model to cluster samples by abundance profile, i.e., the DirMult parameter. Chen and Li (2013) developed a ‘1-penalized parameter estimation
High-dimensional statistical inference Chapter
6 325
for variable selection in the DirMult model. Sun et al. (2018) used the DirMult model to construct a clustering algorithm for single-cell RNA-seq data. The most celebrated application of DirMult distribution is latent Dirichlet allocation (LDA), introduced by Blei et al. (2003). Developed for text mining for classifying documents by keywords, the model is a hierarchical Bayesian model with three levels. First, the p elements of X represent the words in the vocabulary. A word is represented as X ¼ ðx1 , …, xp Þ where xk P {0, 1} for all k ¼ 1, …, p and pk¼1 xk ¼ 1. A collection of q words represents a topic, which can be used to classify documents, which will also be a multinomial variable T ¼ ðt1 , …, tq Þ with tk {0, 1} for all k ¼ 1, …, q. The number of topics, K, is assumed to be fixed. It should be noted that while the words in the vocabulary are defined and observed, the topic corresponding to a word is a latent variable. Second, each document is defined as a sequence of N words, X ¼ fX1 , …, XN g . The number of words in a document is assumed to have a Poisson distribution (N Pois(λ)) and the topics follow a multinomial distribution with document-specific parameter. And finally, a corpus is defined as a collection of M documents, DN ¼ fX 1 , …, X M g. The LDA model is parameterized as follows. Each corpus is characterized by the probability of its keywords θm, T Mult(θm). The probability parameters are assumed to be following a Dirichlet distribution, θm DirðαÞ, m ¼ 1, …, M: Conditional on the latent topics T, π kt ¼ PðXk ¼ 1jT t ¼ 1Þ denotes the probability that kth word in the vocabulary is observed, provided the word describes the topic. The collection of all such probabilities is parameterized as a p q matrix Π ¼ ðπ kt : k ¼ 1, …, p; t ¼ 1, …, qÞ. Using these components, the complete likelihood can be written as Z D Y PðDjα, ΠÞ ¼ f ðθd jαÞPðX d jθ, ΠÞ dθd , d¼1
¼
Sp
Z D Y d¼1
Sp
(
) Nd Y f ðθd jαÞ gðTn jθd ÞhðXn jTn , ΠÞ dθd :
(56)
n¼1
In this model, f() is the Dirichlet density function, g() is the multinomial mass function and h() is obtained from Π. Parameter estimation is done by maximizing the likelihood using expectation–maximization (EM) algorithm by conditioning on the latent keywords. Major focus on LDA research has been on developing faster algorithms (Hoffman et al., 2010) to be able to analyze larger corpora with large number of documents. Mimno et al. (2012) considered sparsity in the model from the Gibbs sampling perspective to improve the efficiency of the algorithm. However, most of the research has been from a machine learning and estimation perspective. Statistical properties of the estimators, which could be of
326 Handbook of Statistics
potential interest for developing hypothesis tests, have not been established. One potential problem of interest could be comparing the Dirichlet parameters of two corpora, H 0 : α1 ¼ α2
H A : α1 6¼ α2 :
vs
(57)
In computer science literature, the focus has been on developing methods for efficient analysis of corpora with large number of documents. Sample size is known to affect accuracy of the allocation (Crossley et al., 2017). A large p small n problem in this context would be efficient classification of small number of documents (small N) with a large vocabulary (large p). Understanding the efficiency of LDA in such large p small n scenarios is an open area of research.
4.3 Other distributions The Dirichlet-Multinomial is a natural extension to the univariate beta-binomial distribution, which are the marginals of the DirMult distribution. This observation arises the following question: can we develop multivariate count distributions with known marginals? The theoretical answer to this question is to use Sklar’s theorem (Nelson, 2006) and construct a copula to model the joint distribution. However, parametric inference such as hypothesis testing is very tedious and sometimes intractable when using copula models. In this section, we shall look at some multivariate extensions to known univariate distributions which have useful parameterizations and are easy to do inference.
4.3.1 Bernoulli distribution One of the earliest generalizations of the Bernoulli distribution using a parametric approach was developed by Teugels (1990). Using the moments of all orders k ¼ 1, …, p, the moment-generating function of multivariate Bernoulli was constructed. They also provided an extension to the multivariate binomial distribution using the sum of independent Bernoulli variables. Using the joint probabilities, Dai et al. (2013) proposed a multivariate Bernoulli distribution which has an analytical form of the mass function. Before generalizing the multivariate Bernoulli distribution, consider the case where elements of the variable X ¼ ðX1 , …, Xp Þ are independent with Xk Berðπ k Þ, k ¼ 1, …, p. Then the joint probability of X ¼ x is given by P ð X ¼ xÞ ¼
p Y k¼1
PðXk ¼ xk Þ ¼
p Y
π xkk ð1 π k Þ1xk :
k¼1
When the variables are dependent, the joint probability cannot be factored into the product of marginals. Using the joint probabilities, the mass function can defined as
High-dimensional statistical inference Chapter
p Q
ð1xk Þ
k¼1 P X1 ¼ x1 , …, Xp ¼ π 00…0
x1
p Q
ð1xk Þ
k¼2 π 10…0
p Q
6 327
xk
k¼1 ⋯ π 11…1 ,
(58)
where π 00…0 ¼ PðX1 ¼ 0, …, Xp ¼ 0Þ and so on. The marginals of X are Bernoulli with cumulative probability, X Xk Berðπ k Þ, πk ¼ π a1 …ak1 1ak+1 …ap : i6¼k:ai ¼0, 1
Using this formulation, they computed the moments and also calculate the maximum likelihood estimates using Newton–Raphson algorithm. However, the main drawback is the dimension of the parameter space. To define the multivariate Bernoulli mass function, we require a total of 2p 1 parameters, which can be computationally infeasible for higher dimensions.
4.3.2 Binomial distribution The bivariate binomial distribution (BBD) was first introduced by Aitken and Gonin (1936) in the context of analysis 2 2 contingency tables when the two outcomes are not independent. Several extensions have been provided since, including work by Krishnamoorthy (1951) who derived the properties of BBD by extending the moment-generating function from the independent case to dependent variables. Hudson et al. (1986) established limit theorems for BBD expressing them as sums of independent multivariate Bernoulli variables. Several other researchers have discussed the properties of BBD. For a recent list of all publications, please refer to Biswas and Hwang (2002) and the references therein. The multivariate binomial distribution (MBD) also suffers from the same curse of dimensionality as the Bernoulli distribution. The total number of parameters required to define the p-dimensional distribution is equal to 2p 1. The multivariate binomial distribution poses several questions that still need to be answered. For instance, it would of interest to simplify the distribution for a restricted parameter set. For instance, if we assume only k-fold interactions are feasible, then the model can be reduced to have 2k 1 parameters. The generalized additive and multiplicative binomial distribution models proposed by Altham (1978) can serve as motivation for building such reduced models. MBD can also be used to model several data sets in genomics. For instance when studying epigenomic modifications such as DNA methylation, comethylation (mutual methylation of pairs of genes) is actively studied for understanding their association with different phenotypes (outcomes). MBD can be used to model the joint probability of methylation of pairs of genes. However, the major bottleneck that needs to be solved first is the computational complexity. Currently, there are no existing tools to compute and model MBD. With improved computational capabilities, this task should be accomplished easily.
328 Handbook of Statistics
4.3.3 Poisson distribution Constructing a multivariate Poisson distribution whose marginals are univariate Poisson variables is fairly easy. Consider the bivariate case. If Zk Pois(λk), k ¼ 1, 2, 3 are independent Poisson variables, then X ¼ (X1, X2) defined as X1 ¼ Z 1 + Z3 ,
X2 ¼ Z 2 + Z 3 ,
gives a bivariate distribution with Poisson marginals, X1 Pois(λ1 + λ3) and X2 Pois(λ2 + λ3). The joint mass function can be expressed as PðX1 ¼ x1 ,X2 ¼ x2 Þ ¼
minðx 1 , x2 Þ X
PðZ1 ¼ x1 z, Z2 ¼ x2 z, Z3 ¼ zÞ
z¼0
¼e
ðλ1 + λ2 + λ3 Þ
minðx 1 , x2 Þ X z¼0
λ1x1 z λ2x2 z λz3 : ðx1 zÞ! ðx2 zÞ! z!
(59)
Extending to p dimensions, the multivariate Poisson is defined through the latent Zk’s as X Z kj , k ¼ 1, …, p, (60) Xk ¼ Z kk + j6¼k
where Zjk Pois(λjk). Expressing the latent variables in matrix form ðZ jk Þj,k¼1,…,p , defining X requires p(p + 1)/2 independent latent components. The mass function can be expressed as p(p 1)/2 summations and is computationally intensive for even moderate values of p. A more general form of the multivariate Poisson requires 2p 1 latent components and is infeasible to express as in Eq. (60). The following trivariate Poisson should serve as a basic overview of the idea: X1 ¼ Z 1 + Z12 + Z13 + Z123 , X2 ¼ Z 2 + Z12 + Z23 + Z123 ,
(61)
X3 ¼ Z 3 + Z13 + Z23 + Z123 : The main drawback with this formulation of multivariate Poisson distribution is its restrictive dependence structure. In the bivariate case, the correlation between X1 and X2 is given by λ3 ffiffiffiffiffiffiffiffiffiffiffiffiffiffi , corðX1 , X2 Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffip λ1 + λ3 λ2 + λ3 which is always positive. Extending the distribution to a larger class of correlation structures, Shin and Pasupathy (2010) proposed using the normal to anything (NORTA) algorithm (Cario and Nelson, 1997) for random number generation from multivariate Poisson with negative correlations. They define the iterative procedure for generating bivariate Poisson variables with correlation ρ as follows. Let U1, U2, U3 U(0, 1) be i.i.d. variables. A bivariate
High-dimensional statistical inference Chapter
6 329
Poisson distribution with marginals X1 Pois(λ1) and X2 Pois(λ2) can be obtained using 1 X1 ¼ F1 λ1 λ ðU1 Þ + Fλ ðU3 Þ,
X2 ¼
8 1 1 < Fλ2 λ2 λ =λ1 ðU2 Þ + Fλ2 λ =λ1 ðU3 Þ
if ρ > 0
: F1
if ρ < 0
λ2 λ2 λ =λ1
ðU2 Þ + F1 ð1 U3 Þ λ λ =λ 2
1
,
(62)
F1 λ ðxÞ
where ¼ inf fy : Fλ ðxÞ yg is the inverse Poisson cumulative distribution function with parameter λ. The parameter λ* [0, λ1] assuming λ1 λ2. If λ1 λ2, X1, and X2 can be interchanged. While this formulation gives a method for generating random samples from bivariate Poisson variables with negative correlations, it is unusable for inference as the likelihood function is not available. Obtaining the likelihood function for the bivariate case using (62) and parameter estimation using the derived likelihood are a few open problems in using this construction of multivariate Poisson variables. Karlis and Xekalaki (2010) developed another approach to characterize multivariate Poisson random variables by compounding independent Poisson components through a multivariate distribution on their parameters. If λ ¼ ðλ1 , …, λp Þ GðΘÞ is a multivariate distribution, then dependence structure on X can be imposed by taking the a mixture of independent Poisson distributions with G, PðX ¼ xjΘÞ ¼
Z Y p n+
k¼1
eλp
λxpp gðλ; ΘÞ dλ1 …dλp : xp !
(63)
A popular choice for G is the log-normal distribution, since the distribution should be defined on p+ . This formulation has two advantages. First, the covariance structure on λ will impart a dependence structure on X. Second, the mixture model ensures that the variance of Xk is greater than λk for all components, thereby addressing issues of over-dispersions. For more details, readers may refer to Inouye et al. (2017) and the references therein for more papers published studying the multivariate Poisson distribution. Multivariate Poisson distributions are fairly new and have a lot of problems that need to be addressed. The framework for hypothesis testing is not extensively developed. There is very limited literature in this regard. For example, Stern and Zacks (2002) developed a test for the bivariate Poisson model in (59) testing for H0 : λ3 ¼ 0 versus HA : λ3 6¼ 0 using a Bayesian significance test. Testing hypotheses comparing two or more multivariate Poisson families is not addressed. High-dimensional tools for multivariate Poisson are extremely hard to develop due to the exponential computation cost: 2p 1 latent variables required to define the distribution.
330 Handbook of Statistics
Restricted models, such as using only pairwise correlations in (61), have a quadratic computation cost and are easier to study. These could potentially be a good starting point for studying the complete model.
5 Conclusion High-dimensional inference is a very exciting field of statistics with many theoretical challenges and practical uses. Availability of large-scale and high-dimensional data is increasing leaps and bounds. Conducting large-scale analysis has become practical with the availability of high performance computing facilities. There is an urgent need to develop statistical tools that can tackle these large dimensional data sets efficiently and accurately. Statistical methodology and computational tools need to progress in conjunction with each other, leaving the onus on statisticians to develop more accurate methods for estimation and inference. In this chapter, we have looked at three areas of high-dimensional inference that are being actively developed. Hypothesis tests for the population mean is one of the more standard inference problems, which has been well studied in high dimensions. We looked at the two main approaches—asymptotics-based tests and random projection-based tests have been presented. The asymptotics based tests have been fairly well studied in comparison to the random projection-based tests. Projections into lower dimensional spaces using random matrices is an active area of research in mean vector testing. We should consider other methods for dimension reduction to study their use in high-dimensional inference. Convolutional neural networks (CNN) (Goodfellow et al., 2016), which are commonly used in deep learning, is another exciting dimension reduction technique that can potentially be used for high-dimensional inference. Sparse covariance matrix estimation has found practical use in understanding the graphical network structure of variables in high dimensions. We looked at different approaches to construct the regularization and the computational tools developed for optimization. While Gaussianity of variables is commonly assumed in sparse precision matrix estimation due to its properties, extension to non-Gaussian distributions is to be studied. We have looked at hypothesis testing for comparing two or more covariance matrices in the high-dimensional setting. One approach we can identify that is lacking is the use of random projections in covariance matrix testing. This poses an interesting challenge to see the versatility of random projections in highdimensional inference. Finally, we looked at development of discrete multivariate models and the challenges therein. Only two distributions have been extensively studied— multinomial and Dirichlet-Multinomial. We looked at high-dimensional hypothesis tests for the multinomial parameters. The hierarchical models and sparse regression models for the Dirichlet-Multinomial distribution are also well studied. However, a lot of work needs to be done for other distributions.
High-dimensional statistical inference Chapter
6 331
The theoretical developments in multivariate Bernoulli models need to be supplemented with computational tools for estimation and inference. A generalized multivariate Poisson distribution needs to be developed, which can lead to potential extensions such as multivariate Poisson-gamma mixtures.
References Achlioptas, D., 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66 (4), 671–687. https://doi.org/10.1016/S0022-0000(03)00025-4. Aitken, A.C., Gonin, H.T., 1936. XI.—On fourfold sampling with and without replacement. Proc. R. Soc. Edinb. 55, 114–125. https://doi.org/10.1017/S0370164600014413. Altham, P.M.E., 1978. Two generalizations of the binomial distribution. J. R. Stat. Soc. C (Applied Statistics) 27 (2), 162–167. https://doi.org/10.2307/2346943. Anderson, T.W., 2003. An Introduction to Multivariate Statistical Analysis, third ed. John Wiley and Sons. Ayyala, D.N., Frankhouser, D.E., Marcucci, G., Ganbat, J.-O., Yan, P., Bundschuh, R., Lin, S., 2015. Statistical methods for detecting differentially methylated regions based on MethylCap-seq data. Brief. Bioinform. 17 (6), 926–937. https://doi.org/10.1093/bib/bbv089. Ayyala, D.N., Park, J., Roy, A., 2017. Mean vector testing for high-dimensional dependent observations. J. Multivar. Anal. 153, 136–155. https://doi.org/10.1016/j.jmva.2016.09.012. Bai, Z., Saranadasa, H., 1996. Effect of high dimension: by an example of a two sample problem. Stat. Sinica 6, 311–329. Bai, Z., Jiang, D., Yao, J.F., Zheng, S., 2009. Corrections to LRT on large-dimensional covariance matrix by RMT. Ann. Stat. 37 (6B), 3822–3840. https://doi.org/10.1214/09-AOS694. Balakrishnan, S., Wasserman, L., 2018. Hypothesis testing for high-dimensional multinomials: a selective review1. Ann. Appl. Stat. 12 (2), 727–749. https://doi.org/10.1214/18-AOAS1155SF. Barmi, H.E., Dykstra, R.L., 1994. Restricted multinomial maximum likelihood estimation based upon Fenchel duality. Stat. Probab. Lett. 21 (2), 121–130. https://doi.org/10.1016/01677152(94)90219-4. Bickel, P.J., Levina, E., 2008. Covariance regularization by thresholding. Ann. Stat. 36 (6), 2577–2604. https://doi.org/10.1214/08-AOS600. Bien, J., Tibshirani, R.J., 2011. Sparse estimation of a covariance matrix. Biometrika 98 (4), 807–820. https://doi.org/10.1093/biomet/asr054. Bingham, E., Mannila, H., 2001. Random projection in dimensionality reduction: applications to image and text data. In: KDD ’01. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, pp. 245–250. Biswas, A., Hwang, J.S., 2002. A new bivariate binomial distribution. Stat. Probab. Lett. 60 (2), 231–240. https://doi.org/10.1016/S0167-7152(02)00323-1. Blei, D.M., Edu, B.B., Ng, A.Y., Edu, A.S., Jordan, M.I., Edu, J.B., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993. Brockwell, P.J., Davis, R.A., 1986. Time Series: Theory and Methods. Springer-Verlag, Berlin, Heidelberg. ISBN: 0-387-96406-1. Cai, T., Liu, W., Luo, X., 2011. A constrained ‘1 minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106 (494), 594–607. https://doi.org/10.1198/jasa.2011.tm10155. Cai, T.T., Liu, W., Xia, Y., 2014. Two-sample test of high dimensional means under dependence. J. R. Stat. Soc. B Stat. Methodol. 76 (2), 349–372. https://doi.org/10.1111/rssb.12034.
332 Handbook of Statistics Cario, M.C., Nelson, B.L., 1997. Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Ind. Eng. 1–19. http://citeseerx.ist.psu.edu/viewdoc/ download?doi¼10.1.1.48.281&rep¼rep1&type¼pdf. Chan, S.-O., Diakonikolas, I., Valiant, P., Valiant, G., 2013. Optimal algorithms for testing closeness of discrete distributions. In: Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1193–1203. Chen, J., Li, H., 2013. Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat. 7 (1), 418–442. https://doi.org/ 10.1214/12-AOAS592. Chen, S.X., Qin, Y.L., 2010. A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Stat. 38 (2), 808–835. https://doi.org/10.1214/09-AOS716. Chen, S.X., Zhang, L.X., Zhong, P.S., 2010. Tests for high-dimensional covariance matrices. J. Am. Stat. Assoc. 105 (490), 810–819. https://doi.org/10.1198/jasa.2010.tm09560. Cho, S., Lim, J., Ayyala, D.N., Park, J., Roy, A., 2019. Note on mean vector testing for highdimensional dependent observations. arXiv e-prints arXiv:1904.09344. Chung, J.H., Fraser, D.A.S., 1958. Randomization tests for a multivariate two-sample problem. J. Am. Stat. Assoc. 53 (283), 729–735. https://www.jstor.org/stable/2282050. Crossley, S.A., Dascalu, M., Mcnamara, D.S., 2017. How important is size? An investigation of corpus size and meaning in both latent semantic analysis and Latent Dirichlet allocation. In: Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference, pp. 293–296. Dai, B., Ding, S., Wahba, G., 2013. Multivariate Bernoulli distribution. Bernoulli 19 (4), 1465–1483. https://doi.org/10.3150/12-BEJSP10. Danaher, P.J., 1988. Parameter estimation for the Dirichlet-multinomial distribution using supplementary beta-binomial data. Commun. Stat. Theory Methods 17 (6), 1777–1788. https://doi. org/10.1080/03610928808829713. Danaher, P., Wang, P., Witten, D.M., 2014. The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. B Stat. Methodol. 76 (2), 373–397. https://doi. org/10.1111/rssb.12033. Daniels, M.J., Kass, R.E., 2001. Shrinkage estimators for covariance matrices. Biometrics 57 (4), 1173–1184. https://doi.org/10.1111/j.0006-341X.2001.01173.x. Dempster, A.P., 1958. A high dimensional two sample significance test. Ann. Math. Stat. 29 (4), 995–1010. https://doi.org/10.1214/aoms/1177706437. Fan, J., Han, F., Liu, H., 2014. Challenges of big data analysis. Natl. Sci. Rev. 1 (2), 293–314. https://doi.org/10.1093/nsr/nwt032. Fradkin, D., Madigan, D., 2003. Experiments with random projections for machine learning. In: KDD’03. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, pp. 517–522. Friedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), 432–441. https://doi.org/10.1093/biostatistics/kxm045. Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press. http://www. deeplearningbook.org. Gregory, K.B., Carroll, R.J., Baladandayuthapani, V., Lahiri, S.N., 2015. A two-sample test for equality of means in high dimension. J. Am. Stat. Assoc. 110 (510), 837–849. https://doi. org/10.1080/01621459.2014.934826. Guo, J., Levina, E., Michailidis, G., Zhu, J., 2011. Joint estimation of multiple graphical models. Biometrika 98 (1), 1–15. https://doi.org/10.1093/biomet/asq060.
High-dimensional statistical inference Chapter
6 333
Hariharan, H.S., Velu, R.P., 1993. On estimating Dirichlet parameters—a comparison of initial values. J. Stat. Simulation 48 (1–2), 47–58. https://doi.org/10.1080/00949659308811539. Hoeffding, W., 1965. Asymptotically optimal tests for multinomial distributions the annals of mathematical statistics. Ann. Math. Stat. 36 (2), 369–401. https://www.jstor.org/stable/2238145. Hoffman, M., Bach, F.R., Blei, D.M., 2010. Online learning for Latent Dirichlet allocation. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (Eds.), Advances in Neural Information Processing Systems 23. Curran Associates, Inc., pp. 856–864. http:// papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf. Holmes, I., Harris, K., Quince, C., 2012. Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7 (2), e30126. https://doi.org/10.1371/journal.pone.0030126. Hotelling, H., 1931, 08. The generalization of student’s ratio. Ann. Math. Stat. 2 (3), 360–378. https://doi.org/10.1214/aoms/1177732979. Hudson, W.N., Tucker, H.G., Veeh, J.A., 1986. Limit theorems for the multivariate binomial distribution. J. Multivar. Anal. 18 (1), 32–45. https://doi.org/10.1016/0047-259X(86)90056-4. Inouye, D., Yang, E., Allen, G., Ravikumar, P., 2017. A review of multivariate distributions for count data derived from the Poisson distribution. Wiley Interdiscip. Rev. Comput. Stat. 9 (3), e1398. https://doi.org/10.1002/wics.1398.A. Jewell, N.P., Kalbfleisch, J.D., 2004. Maximum likelihood estimation of ordered multinomial parameters. Biostatistics 5 (2), 291–306. https://doi.org/10.1093/biostatistics/5.2.291. Jiang, D., Jiang, T., Yang, F., 2012. Likelihood ratio tests for covariance matrices of highdimensional normal distributions. J. Stat. Plan. Inference 142 (8), 2241–2256. https://doi. org/10.1016/j.jspi.2012.02.057. John, S., 1971. Some optimal multivariate tests. Biometrika 58 (1), 123–127. https://doi.org/ 10.1093/biomet/58.1.123. Johnson, W.B., Lindenstrauss, J., 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206. https://doi.org/10.1090/conm/026/737400. Karlis, D., Xekalaki, E., 2010. Mixed Poisson distributions. Int. Stat. Rev. 73 (1), 35–58. https:// doi.org/10.1111/j.1751-5823.2005.tb00250.x. Krishnamoorthy, A.S., 1951. Multivariate binomial and Poisson distributions. Sankhya B 11 (2), 117–124. https://www.jstor.org/stable/25048072. Kudo, A., 1963. A multivariate analogue of the one-sided test. Biometrika 50 (3), 403–418. https://www.jstor.org/stable/2333909. Ledoit, O., Wolf, M., 2002. Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. Ann. Stat. 30 (4), 1081–1102. Leonard, T., 1977. A Bayesian approach to some multinomial estimation and pretesting problems. J. Am. Stat. Assoc. 72 (360), 869–874. Levin, B., 1981. A representation for multinomial cumulative distribution functions. Ann. Stat. 9 (5), 1123–1126. https://www.jstor.org/stable/2240628. Li, J., Chen, S.X., 2012. Two sample tests for high-dimensional covariance matrices. Ann. Stat. 40 (2), 908–940. https://doi.org/10.1214/12-AOS993. Li, P., Hastie, T.J., Church, K.W., 2006. Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’06, pp. 287–296. Lopes, M., Jacob, L., Wainwright, M.J., 2011. A more powerful two-sample test in high dimensions using random projection. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 24. Curran Associates, Inc., pp. 1206–1214
334 Handbook of Statistics McMurdie, P.J., Holmes, S., 2014. Waste not, want not: why rarefying microbiome data is inadmissible. PLOS Comput. Biol. 10 (4), 1–12. https://doi.org/10.1371/journal.pcbi.1003531. Miller, K.S., 1981. On the inverse of the sum of matrices. Math. Mag. 54 (2), 67–72. https://www. jstor.org/stable/2690437. Mimno, D., Hoffman, M.D., Blei, D.M., 2012. Sparse stochastic inference for Latent Dirichlet allocation. In: ICML’12 Proceedings of the 29th International Conference on International Conference on Machine Learning, pp. 1515–1522. http://arxiv.org/abs/1206.6425. Morris, C., 1975. Central limit theorems for multinomial sums. Ann. Stat. 3 (1), 165–188. https:// www.jstor.org/stable/2958086. Muirhead, R.J., 1982. Aspects of Multivariate Statistical Theory. John Wiley and Sons. Nagao, H., 1973. On some test criteria for covariance matrix. Ann. Stat. 1 (4), 700–709. Nelson, R.B., 2006. An Introduction to Copulas, second ed. Springer Series in Statistics. ISBN: 0-387-28659-4. Nunes, D., Antunes, L., 2018. Neural random projections for language modelling. CoRR abs/ 1807.00930. http://arxiv.org/abs/1807.00930. Park, J., Ayyala, D.N., 2013. A test for the mean vector in large dimension and small samples. J. Stat. Plan. Inference 143 (5), 929–943. https://doi.org/10.1016/J.JSPI.2012.11.001. Plunkett, A., Park, J., 2019. Two-sample test for sparse high-dimensional multinomial distributions. Test 28, 804–826. https://doi.org/10.1007/s11749-018-0600-8. Rao, C.R., 1952. Advanced Statistical Methods in Biometric Research. Wiley. ISBN 02-850820-3. Rao, C.R., 1957. Maximum likelihood estimation for the multinomial distribution. Sankhy: Indian J. Stat. (1933–1960) 18 (1/2), 139–148. http://www.jstor.org/stable/25048341. Ronning, G., 1989. Maximum likelihood estimation of Dirichlet distributions. J. Stat. Comput. Simulation 32 (4), 215–221. https://doi.org/10.1080/00949658908811178. Schott, J.R., 2007. A test for the equality of covariance matrices when the dimension is large relative to the sample sizes. Comput. Stat. Data Anal. 51 (12), 6535–6542. https://doi.org/ 10.1016/j.csda.2007.03.004. Shin, K., Pasupathy, R., 2010. An algorithm for fast generation of bivariate Poisson random vectors. INFORMS J. Comput. 22 (1), 81–92. https://doi.org/10.1287/ijoc.1090.0332. Sklar, M., 2014. Fast MLE computation for the Dirichlet multinomial. http://arxiv.org/abs/ 1405.0099. Srivastava, M.S., 2009. A test for the mean vector with fewer observations than the dimension under non-normality. J. Multivar. Anal. 100 (3), 518–532. https://doi.org/10.1016/j.jmva. 2008.06.006. Srivastava, M.S., 2013. Some tests concerning the covariance matrix in high dimensional data. J. Jpn Stat. Soc. 35 (2), 251–272. https://doi.org/10.14490/jjss.35.251. Srivastava, M.S., Du, M., 2008. A test for the mean vector with fewer observations than the dimension. J. Multivar. Anal. 99 (3), 386–402. https://doi.org/10.1016/j.jmva.2006.11.002. Srivastava, M.S., Yanagihara, H., 2010. Testing the equality of several covariance matrices with fewer observations than the dimension. J. Multivar. Anal. 101 (6), 1319–1329. https://doi.org/ 10.1016/j.jmva.2009.12.010. Srivastava, M.S., Katayama, S., Kano, Y., 2013. A two sample test in high dimensional data. J. Multivar. Anal. 114 (1), 349–358. https://doi.org/10.1016/j.jmva.2012.08.014. Srivastava, M.S., Yanagihara, H., Kubokawa, T., 2014. Tests for covariance matrices in high dimension with less sample size. J. Multivar. Anal. 130, 289–309. https://doi.org/10.1016/ j.jmva.2014.06.003.
High-dimensional statistical inference Chapter
6 335
Srivastava, R., Li, P., Ruppert, D., 2016. RAPTT: an exact two-sample test in high dimensions using random projections. J. Comput. Graph. Stat. 25 (3), 954–970. https://doi.org/10.1080/ 10618600.2015.1062771. Stern, J.M., Zacks, S., 2002. Testing the independence of Poisson variates under the Holgate bivariate distribution: the power of a new evidence test. Stat. Probab. Lett. 60 (3), 313–320. https://doi.org/10.1016/S0167-7152(02)00314-0. Sun, Z., Wang, T., Deng, K., Wang, X.F., Lafyatis, R., Ding, Y., Hu, M., Chen, W., 2018. DIMMSC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformatics 34 (1), 139–146. https://doi.org/10.1093/bioinformatics/btx490. Teugels, J.L., 1990. Some representations of the multivariate Bernoulli and binomial distributions. J. Multivar. Anal. 32 (2), 256–268. https://doi.org/10.1016/0047-259X(90)90084-U. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B Stat. Methodol. 58 (1), 267–288. https://www.jstor.org/stable/2346178. Wang, Z., Gerstein, M., Snyder, M., 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10 (1), 57–63. https://doi.org/10.1038/nrg2484. Wu, Y., Genton, M.G., Stefanski, L.A., 2006. A multivariate two-sample mean test for small sample size and missing data. Biometrics 62 (3), 877–885. https://doi.org/10.1111/j.1541-0420. 2006.00533.x. Zelterman, D., 2013. Goodness-of-fit tests for large sparse distributions multinomial. J. Am. Stat. Assoc. 82 (398), 624–629. https://www.jstor.org/stable/2289474. Zhong, P.-S., Chen, S.X., Xu, M., 2013. Tests alternative to higher criticism for high-dimensional means under sparsity and column-wise dependence. Ann. Stat. 41 (6), 2820–2851. https://doi. org/10.1214/13-AOS1168. Zoh, R.S., Sarkar, A., Carroll, R.J., Mallick, B.K., 2018. A powerful Bayesian test for equality of means in high dimensions. J. Am. Stat. Assoc. 113 (524), 1733–1741. https://doi.org/10.1080/ 01621459.2017.1371024.
This page intentionally left blank
Chapter 7
Big data challenges in genomics Hongyan Xu* Division of Biostatistics and Data Science, Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA, United States * Corresponding author: e-mail: [email protected]
Abstract With the recent development in biotechnology, especially next-generation sequencing in genomics, there is an explosion of genomic data generated. The data are big in terms of both volume and diversity. The big data contain much more information and also pose unprecedented challenges in data analysis. In this article, we discuss the big data challenges and opportunities in genomics research. We also discuss possible solutions for these challenges, which can serve as the basis for future research. Keywords: Next-generation sequencing, Data volume, Human genomics, Disease genomics, Bioinformatics, Computational biology, Machine learning, RNA-Seq
1
Introduction
Recent development of biotechnology in genomics has led to revolutionary ways to genome sequencing that are relatively less costly and mostly highthroughput in nature. This revolution in genomics research starts with the genomics sequencing projects. The first sequencing project of model organism is the whole-genome sequencing of bacterium Haemophilus influenzae Rd (Fleischmann et al., 1995). The genome of the first eukaryotic organism, Saccharomyces cerevisiae, was sequenced in the next year through the collaboration of 19 countries (Goffeau et al., 1996). At the turn of the century, the flowering plant Arabidopsis thaliana was sequenced as the first plant genome project (Arabidopsis Genome Initiative, 2000). Most notably, the Human Genome Project (HGP) started in 1990 and completed in 2003 through the collaboration of 20 institutions and genome centers in 6 countries (International Human Genome Sequencing Consortium, 2004). One of the consequences of these early genome projects is the development of the genomic biotechnologies, which enables the production of high-throughput data at increasingly lower cost. Most notably, the next-generation sequencing (NGS) based approaches
Handbook of Statistics, Vol. 43. https://doi.org/10.1016/bs.host.2019.08.002 © 2020 Elsevier B.V. All rights reserved.
337
338 Handbook of Statistics
has enabled the following genome projects including the Encyclopedia of DNA Elements (ENCODE) project (ENCODE Project Consortium, 2004), the 1000 Genomes Project (1000 Genomes Project Consortium et al., 2010) and the Roadmap Epigenome Project (Roadmap Epigenomics Consortium et al., 2015). More recently, these technologies have been used in larger populationbased genomic projects such as the 100,000 Genomes Project in the United Kingdom and the GenomeAsia 100 K project. The goal of these later two genome projects is to sequence 100,000 individuals in the United Kingdom and Asia, respectively, in order to understand the health and population structure and relation in these populations.
2 Next-generation sequencing Among these technologies, next-generation sequencing (NGS) is the most prominent technology that enables current genome projects. NGS is different from the traditional Sanger sequencing in that it can perform sequencing of DNA stand in a highly parallel fashion and can generate millions of short reads, thus substantially increased the sequencing throughput. The technology itself is evolving rapidly since its beginning in mid-2000. The advances have driven the cost of whole-genome sequencing (WGS) below $1000 and are making genome sequencing an increasing available clinical tool for personalized medicine. WGS has been used in the large scale population-based genome projects to study the patterns of genetic variations in worldwide populations and their relationship with various phenotypes including disease risk factors and traits. Besides the improved the throughput and decreased cost, WGS has the advantage of unbiasedness in the survey of genetic variants across genome compared to other high-throughput methods such as genotyping microarrays. The applications of NGS are not limited to WGS. Exome sequencing and targeted sequencing are also popular approaches for investigating focused genomic regions. By focusing on certain genomic regions, this approach allows higher sequencing depth and larger sample size, which is valuable in characterize rare genetic variants. With the completion of HGP and many genome-wide association studies, rare genetic variants have been shown to be important in explaining “missing heritability” and risk of complex diseases (Gorlov et al., 2008). Besides the applications of NGS on sequencing of primary DNA sequences, it has been applied to study gene expression and gene regulations with technologies such as ChIP-seq, Methyl-seq, and RNA-seq. ChIP-seq uses immunoprecipitation to pull down the protein-DNA complex and enrich the DNA segments that interact with the protein. The pulled down DNA is then subjected to NGS (Park, 2009). In Methyl-seq, the methylated DNA segments are selective enriched and sequenced with NGS, which has been widely used in epigenetic studies due to its good coverage of the genome. RNA-seq is a popular
Big data challenges in genomics Chapter
7 339
approach for evaluating gene expression levels. In a typical RNA-seq experiment, RNA is reverse transcribed to cDNA, which is then subjected to NGS. Compared to microarrays, RNA-seq has good coverage and resolution. It has been used extensively in transcriptome studies to quantify RNA expression and to characterize alternative splicing and RNA isoforms. Big Data is used to describe the high-throughput data generated by the NGS and related technologies. Big Data creates unique challenges and opportunities characterized by the 5Vs, i.e., Volume, Velocity, Variety, Veracity, and Value. In the following sections, we layout several challenges posed by the Big Data in genomics.
3
Data integration
With current technologies in genomics, data at different layers are available, such as primary DNA sequencing data, DNA methylation data, gene expression data and environmental factors. The goal is of data integration to relate these different types of data to the responses such as disease status, disease progression, and response to treatment. The relationship of the different types of data is depicted in Fig. 1. Primary DNA sequences contain all the genetic information—blueprint for life. In order to realize the information encoded in the DNA sequences, they need to be expressed into RNAs and proteins, which has biological functions. All the cells in a human contain the same DNA sequences, yet different cell types have different gene expressions, hence the different morphology and functions of different cell types. In the human population, genetic variations (mutations) lead to variations in gene expression, which then potentially lead to diseases. DNA methylation was the modification of DNA sequences (epigenetics). Levels of DNA methylation were determined partly by genetic variation and
FIG. 1 Relationship of various types of genomic data and diseases. The arrows show the potential direction of the effect.
340 Handbook of Statistics
partly by environmental factors such as smoking and diet. The main effect of DNA methylation was thought to be on gene expression. The hyper-methylation in gene promoter regions has been shown to be inversely related to gene expression. But the relationship between DNA methylation in other genomic regions is not so clear cut. DNA methylation is the result of some proteins (enzymes) in the methylation pathway, whose gene expression levels will affect DNA methylation levels. As the final piece of the puzzle, gene expression was affected by DNA sequences, DNA methylation, and environmental factors. Variation gene expression will eventually lead to different responses (phenotypes), including diseases. With all these inter-connected parts, it is highly likely that variations at DNA level was reflected as variations at DNA methylation and gene expression levels and the signal may be amplified at successive levels. Therefore, integrated analysis of the various types of genomic data has the potential of maximizing statistical power by combining information across data types (Ritchie et al., 2015). Current methods for integrated analysis of genomic data could be roughly put into two categories, multi-stage analysis and simultaneous analysis. In multi-stage analysis, the analysis consists of multiple steps, roughly following the flow of information in Fig. 1. For example, step 1 of the analysis could be the association of sequence variation with the phenotype; genetic variants passed through the first step are then used to filter gene expressions. The expression values of the corresponding genes are tested for association with the phenotype in step 2. The specific statistical methods used for the association will depend on the outcome variable. For continuous outcome variable, linear regression is a commonly used approach. For categorical outcome variable, common methods are based on generalized linear models such as logistic regression. The advantage of the regression based methods is that potential cofounding effects can be adjusted by including covariates in the models. Common covariates in biomedical research include age, sex, race, disease stage, and medications. The multiple stage analysis approach has been successfully applied in recent studies to investigate the genetic basis of drug induced toxicity (Huang et al., 2007, 2008). The disadvantage comes from the arbitrary in the selection criteria usually with P-value cutoffs at each stage. The over-stringency at early stages could lead to missed true signals and therefore overall low statistical power. The optimal strategy for setting the selection criteria has yet to be established. A specific example of the multiple stage analysis method is the likelihoodbased causality model selection (LCMS) (Schadt et al., 2005) to make causal inference of RNA expression on complex diseases. The basic idea is that some common genetic variants are underlying the gene expression and the disease phenotype for the causal relationship. Let L, R, C represents the genetic variant, gene expression and the disease phenotype, respectively.
Big data challenges in genomics Chapter
7 341
LCMS uses likelihood-based method to selection from the three models: the causal model (M1), the reactive model (M2) and the independent model (M3). The likelihoods for the three models are M1: P(L, R, C) ¼ P(L)P(R jL)P(C jR) M2: P(L, R, C) ¼ P(L)P(C jL)P(R jC) M3: P(L, R, C) ¼ P(L)P(C jL)P(R jL,C) Note that in the independent model (M3), P(R j L,C) represents the probability of gene expression given the genetic variants and the disease phenotype, caused by other shared genetic variants and common environment in addition to the genetic variant L. The model selection is based on the Akaike Information Criterion (AIC) value (Sakamoto et al., 1986). Model with the smallest AIC is selected as the best model. In simultaneous analysis, genomic data of different types are combined in one meta-data set for analysis. The advantage of this approach is potentially multivariate methods could be applied and there is no loss of information since all the data are combined. The disadvantage is that the corresponding model will be more complex with the different data types. The most straightforward approach for simultaneous analysis is to concatenate various types of genomic data into one big matrix by sample ID. Appropriate statistical methods considering the heterogeneity of the data types can then be applied to the combined data for the analysis. One example of such an approach is a Bayesian integrative model to study the joint effect of genetic variants (SNPs) and gene expressions on a continuous gemcitabine-treatment responses in cancer cell lines (Fridley et al., 2012). The model first specifies the direct effect of SNPs and gene expressions on the response variable with a linear model that includes both SNPs and gene expressions as predictors. Next, the model specifies the effect of SNPs on genes expressions using a linear framework, assuming the gene expressions follows a Normal distribution. Lastly, this approach performs Bayesian variable selection using stochastic search variable selection (SSVS) (George and McCulloch, 1993; Mukhopadhyay et al., 2010) through model averaging and shrinkage of SNP effect toward zero. The prior distribution of the SNP effect is a mixture of two Normal distributions, both centered at 0 but with different variances, to represent the cases of inclusion or exclusion of the SNP in the final model. Another example is the method proposed by Mankoo et al. to perform an integrative analysis of DNA copy number variation, DNA methylation, miRNA and gene expression on time to event (survival time) in ovarian cancer (Mankoo et al., 2011). This method first performs variable selection using least absolute shrinkage and selection operator (LASSO) from the full model with all the different types of independent variables. The selected variables are then used in the Cox regression model to predict the survival time. Because this type of simultaneous analysis combines all the variables, this will increase the number of independent
342 Handbook of Statistics
variables substantially and some types of data reduction methods such as the variable selection method in the two examples would be necessary for further statistical analysis. One of the difficulties of the concatenation-based method is that different types of genomic data often have very large difference in scales, which can create biases in statistical inference when combined directly. To overcome this problem, several methods have been proposed to transform the data to proper scale before combining them. One example is the graph-based integration approach (Kim et al., 2012) to predict cancer outcomes in brain and ovarian tumors using copy number variation, DNA methylation, miRNA and gene expression data. In this approach an individual graph is generated for each types of genomic data through Graph-based semi-supervised learning (Zhou et al., 2004), in which a node represents a sample and an edge connecting two nodes represents the relationship of the two samples, determined by a Gaussian function of the Euclidean distance between the two samples. The multiple graphs generated from each type of genomic data are then combined through linear combination to generate the final graph for the prediction of cancer outcomes. In some cases where different types of genomic data are generated from different set of subjects, it is possible to perform the analysis of each data type separately to generate one prediction/classification model for each data type, then perform the integration of the models. An example is the study of driver mutations of melanoma using chromosome copy number and gene expression data (Akavia et al., 2010). In this study, a Bayesian network is constructed using each data type. The resulting Bayesian networks are then combined with a Bayesian scoring function maximizing the overall joint probability of the data and the model structure.
4 High dimensionality Big data in genomics is characterized by its high dimensionality, which refers both to the sample size and number of variables and their structures. The pure volume of the data brings challenges in data storage and computation. The data volume can be on the order of terabytes for just the raw data of each sample. For the different types of genomic data, it is a good practice to keep the raw data, often in the image file format so that more sophisticated base calling algorithm can be applied later when available for improved accuracy. Data can be stored locally with hard drive arrays and backed up in other more permanent storage media. It is also a good practice to deposit the data into public databases for easy sharing in the scientific communities, such as the Gene Expression Omnibus at the National Center for Biotechnology Information (NCBI) for functional genomics data (Barrett et al., 2013). Cloud storage is another option where the data can be stored and maintained at a central location accessible by all the research communities.
Big data challenges in genomics Chapter
7 343
Big Data is characterized by its large number of variables. The traditional algorithms could become instable with the large number of variables in the big data of genomics. The large number of variables also contributes to false positive findings due to multiplicity of statistical testing. Data heterogeneity is also a challenge for big data with the increasing popular international collaborations in order to achieve a large sample size, where data were collected from diverse laboratories and time points. While data heterogeneity is a challenge for big data analysis, it also provides unique opportunities for understanding the unique and common features of each subgroup due to its large sample sizes. For example, one of the popular approach for inferring population structure using genetic data is the Bayesian clustering method STRUCTURE (Pritchard et al., 2000), which can assign proportional ancestry to several populations for admixed individuals. STRUCTURE uses Markov Chain Monte Carlo (MCMC) algorithm. It begins by a random assignment of individuals to a K pre-determined populations. Each population has a distinct genetic allele frequencies from other populations. Genetic allele frequencies are then estimated in each population from the individual genotypes and individuals are re-assigned based on the updated frequency estimates. This iterative process is repeated many times until convergence, typically comprising 100,000 iterations. Upon convergence, we can obtain the final allele frequency estimates in each population and the assign each individual to a particular population according to the posterior allele frequency estimates. This method has tremendous impact on the research in human genetics, evolutionary genetics and ecology. However, this method is limited with the number of genetic markers and sample size due to computational cost with the MCMC algorithm. This is apparently a limitation for Big Data in genomics. There are several recent works focusing on overcoming this limitations with likelihood-based methods (Alexander et al., 2009; Tang et al., 2005) and assumptions on variations (Raj et al., 2014).
5
Computing infrastructure
The large volumes of Big Data in genomics make the computation infeasible with traditional computing facility. It could take months to finish alignment and annotation of NGS reads for studies with large samples using desktop computers. One solution to this problem is to use the high-performance computing facilities such as computer clusters. The idea is to split the big computing job into small jobs and distribute them to each computing node in the cluster. The result is the highly parallel computing so that the big job can be finished fast (Almasi and Gottlieb, 1989). Most of the computing clusters are of the Beowulf type, where generally identical computers are connected to the header computer in a local-area network. Besides its improved computing power, Beowulf clusters are highly scalable and relatively easy to maintain, which make them especially appealing to Big Data computing needs.
344 Handbook of Statistics
Cloud computing is another potential solution to the challenge in computing facilities. In cloud computing, major computing companies provide services to end users with computing platform, storage, software and CPU times. Current cloud computing services include AWS (Amazon Web Service), Microsoft Azure, Google Cloud Platform, VMware Cloud service, and IBM Cloud. The salient feature of cloud computing is its elasticity and scalability. Users can buy the right service according to the size of the project. The service is available anytime and anywhere with internet collection. It is also maintenance-free and users can assume the platforms are well-maintained with the most recent software packages. Having the data stored at a central database hosted by cloud service providers removes the need of data transfers in separate local databases and thus could save on the time of data transfer, which is usually a bottleneck for Big Data computing.
6 Dimension reduction Big Data poses challenges in computing and analysis due to its high dimensionality. One solution to this challenge is to use the statistical techniques for dimension reduction. In WGS, the data could be represented by a n d matrix, where n is the number of subjects and d is the number of genetic variants. Each entry in the matrix is the respective genotype or genetic score for the subject at the genetic variant. Because of the large number of genetic variants, it is generally infeasible to use the data matrix directly in the standard statistical analysis. The idea of dimension reduction is to reduce the data dimension through linear or non-linear transformation while keeping as much information in the original data matrix as possible. One common dimension reduction method is principal component analysis (PCA). This is a linear transformation method where it first calculates the eigenvectors of the sample covariance (correlation) matrix. The principal components (first k eigenvectors with the largest eigenvalues) are used to construct a k dimension subspace spanned by the principal components. The original data matrix is then projected to this subspace to obtain a data matrix with n k dimension. The reduced data matrix retains a large fraction of the variance in the original data matrix. This approach is been shown to be effective to adjust for population stratification in genome-wide association studies (Price et al., 2006). With large sample size for genomic studies, direct application of PCA may not be feasible. New methods are needed for efficient dimension reduction with Big Data. One potential method is random projection based on the Johnson-Lindenstrauss lemma, which projects the original data matrix to a subspace that preserves the distance between data points ( Johnson and Lindenstrauss, 1984). This method is powerful and computational simple in that its complexity increases linearly with sample size.
Big data challenges in genomics Chapter
7
7 345
Data smoothing
NGS data are subject to several problems such as missing values, correlations among neighboring genomic positions, and non-trivial technology-specific noise sources. Many standard statistical methods, as well as some machine learning methods, rely on rather simplistic specifications of correlations and noise—and are not robust if these specifications are not accurate. Data smoothing techniques could be useful in de-noising and obtaining the true signal. Functional data analysis (FDA), a repertoire of statistical methods that considers data as evaluations of curves (mathematical functions) over a discrete grid, plays a critical role in exploiting the output of NGS assays, and allows sophisticated biological interpretation of shape information. FDA is an appealing option for overcoming the aforementioned problems with NGS data. Actually, correlations among neighboring measurements can be advantageous in FDA—which smooths such measurements into curves, effectively reducing the dimension of the data. Importantly, the dimension of smooth data representations can be controlled selecting the type and number of basis functions employed, while roughness penalties (e.g., on the total curvature of a function) allow continuous control over smoothness. By representing the data as functions, FDA also alleviates the impact of non-trivial noise and “fills in” missing values, improving statistical power. In addition to improving signal-to-noise ratios, and hence power, smoothing can unveil information and biological insights missed by multivariate techniques, as long as the assumption of smoothing is reasonable (Cremona et al., 2019; Frøslie et al., 2013; Ryu et al., 2016).
8
Data security
Genomic data is special in that given enough genetic marker information, it can uniquely identify an individual, much like the fingerprints. Indeed, genetic information has long been used in forensics for individual identification ( Jeffreys et al., 1985a,b). Genomic data contains critical information for life. With the availability of Big Data in genomics, it is possible to make predictions of many individual characteristics including major disease risks from the data. Therefore, data security could be a big concern for Big Data in genomics. Genomic data should be considered as protected health information and be handled according to the regulations of HIPAA (Health Insurance Portability and Accountability Act) in the United States. Common security practices should be implemented such as password protection, data/disk encryption, secure storage, secure transmission, and regular checking of data integrity with checksum analysis. Cloud computing offers convenient access of data and computing services at the same host with continuous support, and hence very popular for Big Data analysis. However, data security is a concern for
346 Handbook of Statistics
cloud computing because the data are hosted externally of the investigator’s institution. The cloud computing service providers need to address these concerns by providing corresponding security measures such as controlled access and secure data transfer.
9 Example One example of utilizing big data analysis in genomics is an integrative genomics project initiated by AstraZeneca. It is a partnership between AstraZeneca, Human Longevity in the United States, the Welcome Trust Sanger Institute in the United Kingdom, and the Institute for Molecular Medicine in Finland. The goal is to use big data analytics on whole-genome sequencing and whole exome sequencing data to identify novel targets for drug discovery. Patients will be matched to the treatments that are mostly likely to be beneficial based on their genomic profiles. In this project, AstraZeneca plan to generate genomic sequences for two million subjects by 2026, including 500,000 subjects from the participants in its various clinical trials. They established a cloud-based warehousing and analysis platform from NANnexus, which can process the raw genomic sequencing data from thousands of subjects per weeks efficiently at a reasonable cost. It also provides a secure platform where genomics data can be combined with clinical data and shared among the collaborators.
10 Conclusion New biotechnology such as NGS generates Big Data with unprecedented speed and volume in genomics. The Big Data poses challenges in data analysis. We discussed the challenges in data integration, data management and computing facilities, dimension reduction, data smoothing, and data security in the analysis of Big Data in genomics. There are other challenges that are more data specific, such as the analysis single-cell sequencing data, de-novel assemble of sequencing reads, and the analysis of rate genetic variants. Some of the potential solutions are also discussed. These areas are under intense research and will provide important tools and information for the understanding of genomics and life in general.
References 1000 Genomes Project Consortium, Abecasis, G.R., Altshuler, D., Auton, A., Brooks, L.D., Durbin, R.M., Gibbs, R.A., Hurles, M.E., McVean, G.A., 2010. A map of human genome variation from population-scale sequencing. Nature 467 (7319), 1061–1073. https://doi.org/ 10.1038/nature09534. Akavia, U.D., Litvin, O., Kim, J., Sanchez-Garcia, F., Kotliar, D., Causton, H.C., Pochanard, P., Mozes, E., Garraway, L.A., Pe’er, D., 2010. An integrated approach to uncover drivers of Cancer. Cell 143 (6), 1005–1017. https://doi.org/10.1016/j.cell.2010.11.013.
Big data challenges in genomics Chapter
7 347
Alexander, D.H., Novembre, J., Lange, K., 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19 (9), 1655–1664. https://doi.org/10.1101/gr.094052.109. Almasi, G.S., Gottlieb, A., 1989. Highly Parallel Computing. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA. Arabidopsis Genome Initiative, 2000. Analysis of the genome sequence of the flowering plant Arabidopsis Thaliana. Nature 408 (6814), 796–815. https://doi.org/10.1038/35048692. Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., et al., 2013. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995. https://doi.org/10.1093/nar/gks1193. Database issue. Cremona, M., Xu, H., Makova, K., Reimherr, M., Chiaromonte, F., Madrigal, P., 2019. Functional data analysis for computational biology. Bioinformatics (Oxford, England). https://doi.org/ 10.1093/bioinformatics/btz045. ENCODE Project Consortium, 2004. The ENCODE (ENCyclopedia of DNA elements) project. Science 306 (5696), 636–640. https://doi.org/10.1126/science.1105136. Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., 1995. Whole-genome random sequencing and assembly of haemophilus Influenzae Rd. Science 269 (5223), 496–512. Fridley, B.L., Lund, S., Jenkins, G.D., Wang, L., 2012. A Bayesian integrative genomic model for pathway analysis of complex traits. Genet. Epidemiol. 36 (4), 352–359. https://doi.org/ 10.1002/gepi.21628. Frøslie, K.F., Røislien, J., Qvigstad, E., Godang, K., Bollerslev, J., Voldner, N., Henriksen, T., Veierød, M.B., 2013. Shape information from glucose curves: functional data analysis compared with traditional summary measures. BMC Med. Res. Methodol. 13 (January), 6. https://doi.org/10.1186/1471-2288-13-6. George, E.I., McCulloch, R.E., 1993. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88 (423), 881–889. Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., et al., 1996. Life with 6000 genes. Science 274 (5287), 546. 563–67. Gorlov, I.P., Gorlova, O.Y., Sunyaev, S.R., Spitz, M.R., Amos, C.I., 2008. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 82 (1), 100–112. https://doi.org/10.1016/j.ajhg.2007.09.006. pii: S0002-9297(07)00012-2. Huang, R.S., Duan, S., Bleibel, W.K., Kistner, E.O., Zhang, W., Clark, T.A., Chen, T.X., et al., 2007. A genome-wide approach to identify genetic variants that contribute to Etoposideinduced cytotoxicity. Proc. Natl. Acad. Sci. U. S. A. 104 (23), 9758–9763. https://doi.org/ 10.1073/pnas.0703736104. Huang, R.S., Duan, S., Kistner, E.O., Hartford, C.M., Eileen Dolan, M., 2008. Genetic variants associated with carboplatin-induced cytotoxicity in cell lines derived from Africans. Mol. Cancer Ther. 7 (9), 3038–3046. https://doi.org/10.1158/1535-7163.MCT-08-0248. International Human Genome Sequencing Consortium, 2004. Finishing the Euchromatic sequence of the human genome. Nature 431 (7011), 931–945. https://doi.org/10.1038/nature03001. Jeffreys, A.J., Brookfield, J.F., Semeonoff, R., 1985a. Positive identification of an immigration test-case using human DNA fingerprints. Nature 317 (6040), 818–819. Jeffreys, A.J., Wilson, V., Thein, S.L., 1985b. Hypervariable ‘minisatellite’ regions in human DNA. Nature 314 (6006), 67–73. Johnson, W.B., Lindenstrauss, J., 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26 (189–206), 1. Kim, D., Shin, H., Song, Y.S., Kim, J.H., 2012. Synergistic effect of different levels of genomic data for Cancer clinical outcome prediction. J. Biomed. Inform. 45 (6), 1191–1198.
348 Handbook of Statistics Mankoo, P.K., Shen, R., Schultz, N., Levine, D.A., Sander, C., 2011. Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles. PLoS One 6(11), e24709. https://doi.org/10.1371/journal.pone.0024709. Mukhopadhyay, S., George, V., Xu, H., 2010. Variable selection method for quantitative trait analysis based on parallel genetic algorithm. Ann. Hum. Genet. 74 (1), 88–96. https://doi. org/10.1111/j.1469-1809.2009.00548.x. Park, P.J., 2009. ChIP-Seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10 (10), 669–680. https://doi.org/10.1038/nrg2641. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D., 2006. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 (8), 904–909. Pritchard, J.K., Stephens, M., Donnelly, P., 2000. Inference of population structure using multilocus genotype data. Genetics 155 (2), 945–959. Raj, A., Stephens, M., Pritchard, J.K., 2014. FastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197 (2), 573–589. https://doi.org/10.1534/ genetics.114.164350. Ritchie, M.D., Holzinger, E.R., Li, R., Pendergrass, S.A., Kim, D., 2015. Methods of integrating data to uncover genotype-phenotype interactions. Nat. Rev. Genet. 16 (2), 85–97. https://doi. org/10.1038/nrg3868. Roadmap Epigenomics Consortium, Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi, A., et al., 2015. Integrative analysis of 111 reference human epigenomes. Nature 518 (7539), 317–330. https://doi.org/10.1038/nature14248. Ryu, D., Xu, H., George, V., Su, S., Wang, X., Shi, H., Podolsky, R.H., 2016. Differential methylation tests of regulatory regions. Stat. Appl. Genet. Mol. Biol. 15 (3), 237–251. https://doi.org/10.1515/sagmb-2015-0037. Sakamoto, Y., Ishiguro, M., Kitagawa, G., 1986. Akaike Information Criterion Statistics. D. Reidel, Dordrecht, The Netherlands, 81. Schadt, E.E., Lamb, J., Yang, X., Zhu, J., Edwards, S., Guhathakurta, D., Sieberts, S.K., et al., 2005. An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet. 37 (7), 710–717. https://doi.org/10.1038/ng1589. Tang, H., Peng, J., Wang, P., Risch, N.J., 2005. Estimation of individual admixture: analytical and study design considerations. Genet. Epidemiol. 28 (4), 289–301. https://doi.org/10.1002/ gepi.20064. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Sch€olkopf, B., 2004. Learning with local and global consistency. In: Thrun, S., Saul, L.K., Sch€olkopf, B. (Eds.), Advances in Neural Information Processing Systems 16. MIT Press, pp. 321–328. http://papers.nips.cc/paper/2506-learningwith-local-and-global-consistency.pdf.
Chapter 8
Analysis of microarray gene expression data using information theory and stochastic algorithm Narayan Behera∗ Department of Physics, Dayananda Sagar University, Bengaluru, Karnataka, India Complex systems and computing group, Sandeb Tech Pvt Ltd, Bengaluru, Karnataka, India Institute of Bioinformatics and Applied Biotechnology, Bengaluru, India Department of Applied Physics, Adama Science and Technology University, Adama, Ethiopia ∗ Corresponding author: e-mail: [email protected]
Abstract The microarray gene expression data provides a simultaneous gene expression profile of thousands of genes for any biological process. Generally, a few key genes among the thousands of genes play dominant roles in a disease process. A computational approach to find these key genes is an important area of research in bioinformatics. A new computational approach is developed here to identify the candidate genes of a cancer process from microarray gene expression data. Gene clustering enables identification of co-expressed genes that play pivotal roles in specified biological conditions. Many algorithms exist for extracting this information but all have inherent limitations. This model is a hybrid of clustering algorithm and evolutionary computation. Evolutionary computation uses a genetic algorithm that utilizes the three biological principles of evolution, (namely, selection, recombination, and mutation), to solve an optimization problem. The interdependence measure between the genes is based on mutual information. The Euclidean genetic distance measure (differences of the gene expression values) is used in many conventional algorithms. The mutual information theory takes into account the similarity of the gene expression levels as well as positive and negative correlations between the genes while clustering them. The genes having higher interdependence measures are the top candidate genes responsible for cancer. These top genes are believed to be faulty genes that contain the most diagnostic information for a diseased state. An analysis is done on gastric cancer, colon cancer, and brain cancer microarray gene expression datasets. In comparison with many existing computational tools, the top candidate genes found by this evolutionary computational model, are able to classify the samples into cancerous and normal classes with higher accuracies. Handbook of Statistics, Vol. 43. https://doi.org/10.1016/bs.host.2020.02.002 © 2020 Elsevier B.V. All rights reserved.
349
350 Handbook of Statistics The new model creates more even-distribution of genes in the clusters and provides better accuracy in picking up the top candidate genes. Furthermore, the present computational tool is more coherent in clustering the genes across large gene expression numbers. This information-theoretic computational method can be potentially applied to the analysis of big data from other sources. Keywords: Microarray gene expression data, Mutual information, Genetic algorithm, Clustering, Classification accuracy, Cancer-causing genes
1 Introduction The microarray technology has become an important tool through which the biologists monitor genome-wide expression levels of genes in a given organism. Genes are made of deoxyribonucleic acid (DNA). DNA molecules are constructed by sequentially binding nucleotides into a linear sequence. There are four possible nucleotides. This sequence allows DNA to encode information. DNA has wonderful storage capacity as an encoding device. Within the DNA, genes are sequences of different lengths. The genes within a cell contain the information for synthesizing the proteins, which do all that a cell needs. Proteins are the ultimate product of the gene expression process. Proteins are formed from 20 different subunits called amino acids. A microarray is typically a glass slide onto which DNA molecules are fixed in an orderly manner at specific locations called spots. A microarray may contain thousands of spots and each spot may uniquely correspond to a gene. The DNA in a spot may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond to a gene. The spots are printed on to the glass slide by a machine. The DNA microarray technology has the ability to measure simultaneously interactions of thousands of genes through their expression levels. Differential gene expression analysis involves those genes that exhibit different expression levels under different experimental conditions, that is, in various tissue types, or in different developmental stages of the organism. Typical studies include normal-versus-diseased state investigations. Gene co-regulation is somewhat similar to differential gene expression. Generally, differential expression studies look at a single gene profile against the experimental conditions. But gene co-regulation study compares the gene profiles with each other. The objective is to identify the genes whose expression level vary in a correlated fashion across the studied experimental conditions or samples. There is a positive co-regulation between two genes if the expression level of one gene increases when the other increases. A negative co-regulation exists if the expression level of one gene increases while the other decreases. Generally, gene co-regulation studies compare gene profiles of several genes. The microarray experiments reveal the function of the desired genes. The process basically involves the comparison of the desired genes’ expression profiles under various conditions with the corresponding profiles of genes with known function. The functions of the genes with highly similar
Analysis of microarray gene expression data Chapter
8 351
expression profiles serve as tools for inferring the function of the new gene. Gene expression data are also an important tool for clinical diagnostics because they can discover expression patterns that are characteristics of a particular disease. The microarray data analysis involves techniques from life science fields, computer science, and statistics. A gene profile is a gene expression profile that describes the expression values for a single gene across many samples or conditions. An array profile is a gene expression profile that describes the expression values for many genes under a single condition or for a sample. There are several steps in microarray data analysis. The raw data is first processed to reduce the data volume. A gene data matrix contains numbers representing a gene’s expression level in different samples. Due to the imperfect instrument, the measured value deviates from the true expression level by some amount called measurement error. A measurement error has two components. A bias describes the tendency of the instrument to detect either too low or too high values. The measurement error due to variance is generally normally distributed: wrong measurements in either direction are equally frequent and small deviations are more frequent than the large ones. Normalization involves numerical methods to deal with measurement errors. After the successful completion of the human genome project, it has been a great challenge to understand the roles played by the genes in various biological systems. The advent of microarray technology has revolutionized the way of functional genomic studies. Now scientists can study thousands of genes together and their expression profiles under different experimental conditions. This makes it possible to understand how the gene products are regulated in normal and diseased conditions on a global genome-scale. This global analysis allows one to determine the cellular functions of genes, the nature and regulation of biochemical pathways, and the regulatory mechanisms at play during certain disease processes. A cell’s genome has an implicit mechanism for self-replication and transformation of gene information into proteins. The gene to protein transformation constitutes the “central dogma of molecular biology.” It is a two-step process. First, a gene (DNA) makes a ribonucleic acid (RNA). This process is known as transcription. Second, in the process called translation, the RNA makes a protein. The information represented by the DNA sequence of genes is transferred into an intermediate molecular representation, an RNA sequence. The information represented by the RNA is then used as a template for constructing proteins. The RNA occurring as an intermediate structure is referred to as messenger RNA (mRNA). The overall process consisting of transcription and translation is known as gene expression. Due to the RNA’s inherent chemical instability, it is often useful to work with more stable complementary DNA (cDNA) made by reverse transcription at intermediate steps. However, before the array is made, the cDNA is broken up into its individual strands. The basic principle of the DNA microarray is that RNA samples are hybridized to known cDNA probes on the arrays.
352 Handbook of Statistics
Typical goals of the DNA microarray experiments involve the comparison of gene expression in several different cells and in cells exposed to different conditions (physical and chemical), and biological conditions (normal versus diseased). The genetic diseases like cancer are characterized by the genes being inappropriately transcribed. A cDNA microarray study can pinpoint the transcription differences between a normal and a diseased phase. It can also reveal different patterns of abnormal transcription to identify different disease stages. The microarray experiments give an estimation of the absolute values of gene expression. For technical reasons, only a single sample can be measured in one of these kinds of chips. The gene expression analysis for gastric cancer, colon cancer, and brain cancer datasets, which are used in this study, has been done using microarray experiments. The extensive literature on the experimental and computational analysis of microarray data can be obtained (Berrar et al., 2003).
1.1 Gene clustering algorithms It has been challenging to cluster genes with similar functions by developing suitable algorithms. Clustering of microarray gene expression data to identify the genes with similar expression profiles is an important area in the field of bioinformatics. They can tell about the co-regulated and co-expressed genes under certain specified conditions. Consequently, the genes need to be selected that can appreciably classify the test samples into diseased and normal classes. In the microarray data, generally, the number of genes far exceeds the number of samples. This makes gene selection and sample classification difficult as a large number of genes can potentially create noise in the data. Ideally, a small subset of genes containing good diagnostic information is needed. This reduces the dimensionality of the data. These genes can then be used for diagnostic study. Microarray technology makes it possible to put the gene expression values of an entire genome on to a chip. Each data point provided by an experimenter lies in the high-dimensional space defined by the size of the genome under investigation. However, the sample size is severely limited in these experiments. To make sense of the data, feature selection methods are essential. The goal of feature selection is to select relevant features and eliminate irrelevant ones. An important problem in the microarray experiments is the classification of biological samples using the gene expression data. This problem is important in the context of cancer research. Reliable and precise classification of tumors is essential for the successful diagnosis and treatment of cancer. Many clustering algorithms have been developed in the recent past. The k-means is one of the simplest and most used algorithms for clustering objects (Lloyd, 1982). In this algorithm, the number of clusters, k, has to be specified a priori. It is unsupervised learning because the purpose is simply to cluster the data points and not compare with any benchmark performance.
Analysis of microarray gene expression data Chapter
8 353
It classifies the genes based on certain attributes/features (i.e., gene expression values) into k distinct partitions so that each gene belongs to one cluster only. The range of gene expression values is found from the microarray data. The steps of the algorithm go as follows. Initially, the k centers are chosen randomly. These centers are actually different gene expression values chosen randomly. The sum of squared distance between the data points and the corresponding cluster centroid is minimized. This is the way the grouping is done. Assign each data point to the nearest cluster center. Then compute the cluster center by computing the average of the data point values that belong to each cluster. Iteration is continued until there is no change to the cluster centers. In this way, k clusters are generated for the genes. The genes within each cluster are similar in their expression. The genes between the clusters are more dissimilar. Some other clustering algorithms will be mentioned without going into details. They do not enter directly into the present subject but enhances our understanding of clustering. Kohonen’s Self-Organizing Map (SOM) is a well-known unsupervised learning technique used for clustering (Kohonen, 1990). The SOM can be visualized as a sheet-like neural-network array, the cells (or nodes) of which become specifically tuned to various input signal patterns or classes of patterns in an orderly fashion. The learning process is competitive and unsupervised. This makes the SOM useful for visualizing low-dimensional views of high-dimensional data. The hierarchical clustering algorithm has been widely in use for the gene expression data analysis, especially because of its simple visualization tool (Johnson, 1967). The traditional representation of this hierarchy is a tree (called a dendrogram) with individual elements at one end and a single cluster containing every element at the other. The hierarchical clustering is subdivided into agglomerative methods, which proceed by means of a series of fusions of the genes into groups, and into divisive methods, which separate the genes successively into finer groupings. Biclustering or two-mode clustering is a data mining technique that allows simultaneous clustering of the rows and columns of a matrix of data (Cheng and Church, 2000). Given a set of m rows and n columns (i.e., m n matrix), the biclustering algorithm generates biclusters—a subset of rows that exhibit similar behavior across a subset of columns, or vice versa. All these clustering techniques are widely used for clustering genes in the gene expression data (Eisen et al., 1998; Golub et al., 1999; Heyer et al., 1999). The clustering algorithms use the differences of the gene expression values to cluster the genes. The attribute of a gene is defined as its expression level under a specific experimental condition or a value in the time series data of its expression levels. The most widely used distance measures are Euclidian distance and Pearson’s correlation coefficient. Euclidian distance(e) between two genes, x, and y, having n attributes is given by: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n X e¼ (1) ð xi y i Þ2 i¼1
354 Handbook of Statistics
Where xi is the value of ith attribute of gene x and yi is the value of the ith attribute of gene y. Pearson’s correlation coefficient (r) between two genes, x, and y, having n attributes is given by: n X
ðxi xÞ ðyi yÞ
i¼1 ffisffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n n X X ð xi xÞ ð yi yÞ i¼1
(2)
i¼1
where x, y are the means of expression values of genes x and y respectively, xi is the value of ith attribute of gene x and yi is the value of the ith attribute of gene y. A negative value of r denotes a negative linear relationship and a positive value denotes a positive linear relationship. If the value of r is equal to zero, it denotes that the genes are independent. The conventional clustering algorithms use Euclidean/Pearson’s correlation as distance measures to calculate the similarity between two genes. However, these distance measures have some limitations. Euclidian distance measures the similarity based on values but not on expression profile shapes. Hence the genes having the same expression profile shape but differing by large values of gene expression (scaling factor) have the chances of not getting clustered together and conversely, genes differing by a small magnitude and having different expression profile shape may get clustered together. Pearson’s correlation is better than the Euclidian distance in the sense that it is able to define the relationship but it is sensitive to outliers. The clustering algorithms are also dependent on the initial distribution of the genes in the clusters. So, they give different results with a different initial distribution of genes chosen by an algorithm. Furthermore, an important shortcoming of these algorithms is that the number of clusters has to be specified initially. Recently, a new distance measure based on information-theoretic approach has been proposed for the clustering algorithm. The algorithm is called the Attribute Clustering Algorithm (ACA) (Au et al., 2005). It takes care of some of the limitations in the conventional distance measuring methods for gene clustering. The ACA essentially uses the k-means algorithm concept. However, the genetic distance measure being employed is an information-theoretic measure called interdependence redundancy measure that takes into account interdependence between the genes. The interdependence redundancy measure between two genes, x, and y, having n attributes is given by: IR ðx : yÞ ¼
M ðx : y Þ E ðx : y Þ
(3)
where M ðx : y Þ ¼
g X h X P vk Λ vl log k¼1
l¼1
P vk Λ vl , P ð vk Þ P ð vl Þ
(4)
Analysis of microarray gene expression data Chapter
8 355
and Eðx : yÞ ¼
g X h X
P ð vk Λvl Þ log P ð vk Λvl Þ,
(5)
k¼1 l¼1
M(x: y) is the mutual information, E(x: y) is the joint entropy, g is the number of intervals of x, h is the number of intervals of y, vk is the kth interval of x, vl is the lth interval of y, P(vk Λ vl) is the probability of a genic value occurring in the interval vk and vl, P(vk) is the probability of a value occurring within the interval vk and P(vl) is the probability of a value occurring within the interval vl. Although mutual information provides knowledge about the mutual dependence of two genes, say x and y, its value increases with the number of possible gene expression values. Hence to find the correct distance measure between genes, mutual information is normalized by the entropy measure. This normalized information measure is called the Interdependence Redundancy (IR(x:y)) measure. It is calculated as shown in Eq.(3). It reflects the degree of deviation from independence between two genes. A model called Clustering by Mutual Information (CMI) is studied which is fundamentally similar to the one developed by Au et al. (2005) except the fact that they use a different smoothing process (Behera et al., 2018). However, in this model, the initial search space of the gene distributions in appropriate clusters is limited. This means the possible gene distributions have inherent limitations. This can put a constraint on finding the best solution (the best configuration of gene clusters). There is a high possibility that the best solution may not be an optimized one. In order to overcome this problem, a hybrid of evolutionary computation and clustering algorithm model is implemented (Behera et al., 2018). Evolutionary computations are very efficient search algorithms and can be used for studying a large number of the different initial distribution of genes in the clusters. Keeping these issues in mind, a new algorithm called the Evolutionary Clustering Algorithm (ECA) is proposed (Behera et al., 2018). The ECA is an evolutionary computation extension of the CMI. It provides better gene classification and selection by minimizing the limitations arising out of the biased initial distribution of genes in the clusters. This model has also shown promising results with some synthetic data as discussed below. The ECA correctly identifies gene distributions in the clusters. This algorithm is applied to three sets of real data—gastric cancer (Tsutsumi et al., 2002), colon cancer (Alon et al., 1999) and brain cancer (MacDonald et al., 2001) data. They provide insight for its performance in understanding a disease process. The results obtained are then compared with the outcomes from the other popular clustering algorithms. The C4.5 is a decision tree algorithm that is used for building classifiers on the selected subset of candidate genes (Quinlan, 1993). The C4.5 software is well known for its learning efficiency and robustness (Elomaa, 1994). This builds a decision tree from a set of data using the concept of information theory. It is required in
356 Handbook of Statistics
data mining as a decision tree classifier. It has been in use for the analysis of various top-ranking genes from microarray data to classify the test samples as normal or diseased (Cho et al., 2007).
2 Methodology 2.1 Discretization In the real world, most of the data are continuous and need to be discretized for any quantitative analysis. They usually have noise due to measurement errors or incorrect entry of the values. This noise can create a large number of intervals in the discretization result. As more intervals lead to a greater loss in information, these problems need to be taken care of while discretizing. The gene expression values in a microarray are typically continuous which needs to be discretized into proper intervals for computing an information measure. The Optimal Class Dependent Discretization (OCDD) algorithm is used to discretize the continuous data, as it gives nearly a globally optimal solution (Wong et al., 2004). The OCDD takes into account the interdependence between the classes and the gene expression values and minimizes the information loss (see Appendix A for the method). Classes are defined as the categories to which each sample belongs. In this context, there are two classes— normal and diseased. To circumvent the problems while dealing with real data, smoothing of the data and a statistical test (chi-square test) as described in Appendix B are performed. Smoothing is done to remove noise before performing discretization and chi-square test is done to reduce the number of intervals. The smoothing, chi-square test and the parameter values used in this ECA are essentially the same as described in the OCDD algorithm.
2.2 Genetic algorithm A genetic algorithm is an evolutionary search technique used in computing to find true or approximate solutions to optimization problems. This has found wide applications in solving important optimization problems in science and engineering (Goldberg, 2008; Holland, 1975). This evolutionary algorithm has been successfully applied to diverse problems in evolutionary biology (Behera and Nanjundiah, 1995, 1996, 1997, 2004), optimization of multiple protein sequence alignment (Behera et al., 2017) and analysis of microarray gene expression data (Behera et al., 2018). The genetic algorithms are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as mutation, selection, and crossover (also called recombination). This algorithm imitates the Darwinian idea that Nature is the best optimizer. A population of solutions evolves in the process of Darwinian selection over many generations. The raw material for Darwinian evolution is a mutation where new genes appear spontaneously as a result
Analysis of microarray gene expression data Chapter
8 357
of statistical processes due to the change of some nucleotides in the gene sequence. During reproduction, due to the recombination of genes in the chromosomes, new chromosomes are created. For the purpose of illustration, a haploid organism is considered. A crossover point is chosen at random for a chromosome. The genes up to the crossover point in parent 1 and the genes from the crossover point to the end of the chromosome of parent 2 are combined together to create a new chromosome as offspring. Then when parent 1 and parent 2 are interchanged in the above process, one gets another new offspring. They generate a variety of new individuals in the population that were not present in the previous generation. As the population grows in an environment, due to the inherent carrying capacity of the environment, some populations which adapt poorly to the environment do not survive to reproduce. Their genes are eliminated from the population. The individuals that adapt better reproduce more individuals and their genes go to the next generation. So the genes that provide higher fitness to the individual increase in frequency in subsequent generations. The fitness of an individual is defined as the net number of genes it leaves behind to the next generation. The genetic algorithm uses these ideas of Darwinian evolution to solve optimization problems arising in the areas of science and engineering. The central idea is to define the fitness function appropriately corresponding to the research problem that needs optimization. Here genetic algorithm will be used to find a few key genes responsible for cancer by analyzing thousands of genes from microarray gene expression data. Generally, in a genetic algorithm, an individual is coded as a string of 1 s and 0 s. It represents a possible solution to the optimization problem one is interested in. The genetic algorithm starts from a population of randomly generated individuals (possible solutions) and proceeds in successive generations in finding better solutions. In each generation, every individual in the population is modified (recombined and possibly mutated) with some probability (crossover probability and mutation probability) to form a new individual. The target solution to the optimization problem becomes better in each successive generation. Each individual is evaluated for its fitness and multiple individuals are stochastically selected from the current population (based on their fitness) to form a new population. The new population is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations have been achieved, or a satisfactory fitness level has been reached for the population. Finally, the individual with the highest fitness provides the best solution to the optimization problem.
2.3 The evolutionary clustering algorithm To cluster the genes an evolutionary model of an information-theoretic clustering algorithm is introduced here. It is primarily based on the CMI algorithm defined above. The ECA uses the interdependence redundancy measure as the distance measure and the concept of modes of clusters for clustering the
358 Handbook of Statistics
genes. A mode (M) of the cluster is defined as the gene in the cluster having the highest multiple interdependence redundancy (MIR). The MIR of a gene (x) is defined as the sum of its IR measure with all the other genes belonging to the same cluster as defined below. X MIRðxÞ ¼ IRðx : yÞ (6) ðx, yCk Þ
where y represents the genes in the same cluster as x. The genes are clustered on the basis of a gene’s higher IR measure with a mode. A flow diagram of the ECA is shown in Fig. 1(A).
2.3.1 Creation of individuals Each individual is represented as a two-dimensional array of numbers of which one index represents the corresponding gene from the gene expression data and the other denotes the cluster number. It simply says which genes belong to which clusters. A set of unique gene expression values (equal to the number of clusters) is chosen randomly and assigned as the mode for each cluster. The remaining genes are then assigned to the clusters on the basis of their highest IR measures with the corresponding modes of the respective clusters. The other individuals in the population are created in a similar fashion. The total number of genes in an individual is constant and has the same value as any other individual. A diverse set of individuals having different types of gene distributions in the clusters is thus generated. This phenomenon is implemented in the algorithm by changing the seed values. The population size is optimized to be 300 after studying the effects of population size on the final outcome (Behera et al., 2018). 2.3.2 Mutation operations (a) Cluster assignment operator The Genetic k-Means algorithm has been introduced by fusing the k-means model of clustering and the genetic algorithm idea (Krishna and Murty, 1999). This model gives the optimum result for the given data in commensurate with the convergence result. Here we have implemented a marriage of the K-mode clustering algorithm with an evolutionary algorithm in two simple steps. It identifies the mode (M) of each cluster. It is the gene having the maximum MIR in that cluster. Other genes are assigned to respective clusters having higher IR with the cluster mode, i.e., IR (x: Mk), where Mk is the mode of the kth cluster. Then the new modes of the clusters and the fitness of the individual are calculated. (b) Probabilistic mutation operator The mutation rate is a user-defined input. The best 5% of the population is retained to prevent loss of better fit individuals and thereby increases the
Analysis of microarray gene expression data Chapter
8 359
A Generation number (G)=1, Calculate fitness and modes of the clusters Cluster Assignment
G+1
Probabilistic Mutation
Selection
Check mean fitness differences No Yes Termination B Population Mutated individuals
Find the least MIR gene of each cluster Transfer this gene to another cluster probabilistically New individuals C Weakest individual has the smallest share
individual 1 individual 2 16%
individual 3 52%
Selection point
32%
fittested individual has the largest share FIG. 1 (A) Shows the flow chart of the ECA. (B) depicts the flow chart of the probabilistic mutation. (C) shows the pictorial representation of roulette wheel selection.
360 Handbook of Statistics
efficiency of the genetic algorithm. The remaining 95% of the individuals are stochastically selected for mutation. To select an individual for mutation, a random number is generated. The random number varies between zero and one. An individual from the total population is selected for the mutation if the random number is less than the mutation rate. A mutation rate is generally a small number. For the selected individual in a cluster having at least five genes, a gene having the least MIR measure is chosen for donation to another cluster of the same individual. The relative distributions of the values of the IRs of the gene with the modes of the clusters are used to construct a roulette wheel (see below for details). The cluster to which this gene is transferred is selected stochastically using the roulette wheel selection. The new modes of the clusters and the fitness of the individual are calculated. Once all the clusters have undergone the mutation, a mutated individual is created. Then the new population consists of the best 5% of the population, the mutated individuals and the unmutated individuals. The unmutated individuals are taken from the sorted population having lower fitness value in the population. This has been done to remove the bias due to the selection of the best 5% of the population, avoid premature convergence, bring diversity and allow the worse individuals to evolve. The probabilistic mutation rate is optimized to be 0.1 after analyzing the effects of mutation rates (Behera et al., 2018). A flow chart of the steps of the operator is shown in Fig. 1(B).
2.3.3 Fitness function The fitness of an individual is calculated as the sum of the MIR of the modes of the clusters. The individual with the highest fitness value is the fittest individual in the population. The fitness of an individual is the sum of the multiple interdependency measures of the total number of clusters. It is defined as: F¼
k X
Ri
(7)
i¼1
where F is the individual fitness, i denotes the cluster number and Ri is the multiple interdependence redundancy measure of the mode of the cluster i.
2.3.4 Selection The roulette wheel selection method is used for selecting all the individuals for the next generation. It is a popular selection method used in a genetic algorithm. A roulette wheel is constructed from the relative fitness (ratio of individual fitness and total fitness) of each individual. It is represented in the form of a pie chart where the area occupied by each individual on the roulette wheel is proportional to its relative fitness. The sum of relative fitness (S) is calculated. A random number in the interval of 0 and S is generated. The population is traversed and then the corresponding relative fitness values are summed.
Analysis of microarray gene expression data Chapter
8 361
When the sum is greater than the random number generated, the corresponding individual is selected. Since an individual with better fitness value will occupy a bigger area in the pie chart, the probability of selecting it will also be higher. A pictorial representation of the roulette wheel selection is shown in Fig. 1(C).
2.3.5 Termination The algorithm terminates when the variation of the mean fitness for 10 consecutive generations is less than 2%. As an example, at the nth generation, the percentage differences between the mean fitness of the (n i)th generation and the (n 10)th generation are calculated, where i varies from 0 to 9. If all these 10 differences are less than 2%, the program execution is terminated, else it proceeds to the next generation.
3
Results
The tests are performed on a 3.0 GHz dual-core Intel Xeon processor with 2 GB RAM. The whole work has been implemented in Java version 1.6.0. It has been divided into three segments—discretization, calculation of redundancy, and the development of the ECA. Time taken for discretization for a set of 7129 genes and 30 samples is 0.6469 min. The total number of intervals produced is 55,337. For the same dataset, the calculation of redundancy takes 1363.3997 min and one simulation of ECA takes 48.6153 min. The equilibrium generation ranges between 12 and 14. For the same set of data, one simulation of CMI takes 0.7068 min and takes 2–4 iterations for convergence. One simulation of k-means for 10,000 iterations takes 70.56 min. For a higher number of iterations, k-means showed the recurrence of the solution more than once, which means the solution is likely to be a statistically optimal solution.
3.1 Synthetic data To analyze the efficiency of the algorithm for clustering data, studies on two different synthetic data are performed initially. Each dataset contains 200 samples and 20 genes. The gene expression values vary from 0.0 to 1.0 such that the domain is divided into two intervals. The two intervals are [0.0, 0.5] and [0.5–1.0]. The first dataset comprises of two clusters and three classes as defined below. The range of values of genes G1 and G2 are used to define the values of other genes. G1 defines the values of genes G3 to G11 such that G3 to G6 is in the same range as G1 and G7 to G11 are in the opposite range. Similarly, G2 defines the values of G12 to G20. The values of G12 to G15 are in the same range as of G2 and that of G16 to G20 is in the opposite range. The samples for which G1 and G2 range from 0.0 to 0.5 are assigned class label 1. Class label 2 is assigned to the samples for which G1 and G2 ranges from 0.5 to 1.0 and the rest of the samples are assigned class label 3. The second dataset comprises four clusters and five classes and the expression values vary from 0 to 10. The method adopted for generating the second dataset is similar
362 Handbook of Statistics
to the method described for the first dataset. The equal width discretization is used to discretize the synthetic datasets (Cios and Kurgan, 2004). For both these datasets, it is seen that for all the simulations, the ECA correctly identifies gene distribution in the clusters. But the CMI is able to group on an average 65% genes correctly (Behera et al., 2018).
3.2 Real data To evaluate the performance of the ECA, three gene expression data sets are used: gastric cancer dataset (Tsutsumi et al., 2002), colon cancer dataset (Alon et al., 1999) and brain cancer (MacDonald et al., 2001) dataset are used. Descriptions of all the datasets are given in Table 1. For each dataset, 50 simulations of the CMI are performed. The number of clusters varies from 2 to 20 for each of these simulations. The cluster showing the highest individual fitness becomes the optimal cluster of the dataset for that simulation. The smallest value is chosen because the number of clusters should not be large as it would then scatter the data. The smallest value of the optimal cluster number among the 50 simulations is used as the optimal cluster number for the dataset. For all the algorithms, the same cluster number is used for all the simulations. The ECA is compared with the k-means and the CMI. For the purpose of comparison, 10 simulations for each of the algorithms are considered. For each simulation, a set of clusters is obtained. A subset of genes (called as the candidate genes), is selected from each cluster for the classification
TABLE 1 Contains the description of each dataset with respect to the number of a gene, the number of total samples, number of diseased samples, number of healthy samples, the number of clusters, minimum and maximum gene expression values. Gastric cancer
Colon cancer
Brain cancer
Number of genes
7129
2000
2059
Number of total samples
30
62
23
Number of diseased samples
22
40
10
Number of healthy samples
8
22
13
Number of clusters
4
10
6
Minimum gene expression value
0.1
5.82
0.5
Maximum gene expression value
14,237
20,903
1,84,262
Analysis of microarray gene expression data Chapter
8 363
studies. These candidate genes are composed of the top-ranking genes having higher fitness. In the case of the ECA and the CMI, the top-ranking genes are defined as the genes having the highest multiple redundancy measure in clusters. For the k-means, they are the genes having the least Euclidian distance from the mean value of the cluster. For the classification accuracy calculation, the leave-one-out cross-validation (LOOCV) method is used. For LOOCV, the first sample is selected as the test set and the remaining samples as the training set. The top-ranking genes from the training set are used in the algorithm to predict the test sample as diseased or normal. Then the second sample is used as the test sample and the remaining set is taken as the training set. The process is repeated by considering each individual sample as the test sample. The classification accuracy is calculated as the percentage of the test samples correctly predicted as either diseased or normal. A higher value of classification accuracy represents the better efficiency of the algorithm in selecting the significant genes that contain the most diagnostic information.
4
Section A: Studies on gastric cancer dataset (GDS1210)
4.1 Comparison of the algorithms based on the classification accuracy of samples The average classification accuracy of the top-ranking genes is calculated for each of the algorithms. Table 2 shows one instance of gastric cancer dataset. The percentage differences of classification accuracy of the ECA over the CMI and the k-means are computed. It represents the improvement of the ECA over the other algorithms. The improvements of the ECA over the CMI and the k-means with respect to top-ranking genes for gastric cancer dataset are shown in Table 3. The studies show that the ECA outperforms the CMI and the k-means. As we are dealing with a stochastic computation when the seed for generating the random number is changed the initial gene distributions in the clusters also change. This can lead to different evolutionary dynamics such as the equilibrium generation number and mean fitness of top-ranking genes. Therefore, it is necessary to study the average effects of different simulations corresponding to different initial gene distributions in the clusters. A comprehensive analysis of the algorithms on gastric cancer dataset shows that the number of simulations containing the top-ranking genes that provide higher classification accuracy of test samples is more in the ECA than the CMI and the k-means. This proves that in the ECA, there is a higher probability of finding the correct candidate genes than the other algorithms. The individual performance of classification accuracy for the 10 simulations of the ECA, the CMI, and the k-means are shown in Table 4. The table shows the number of simulations for different numbers of top-ranking genes showing the corresponding classification accuracy.
364 Handbook of Statistics
TABLE 2 The classification accuracy of the two top-ranking genes for the ECA, the CMI and the k-means for the gastric cancer dataset. Top 2 genes per cluster Simulations
ECA
CMI
k-means
1
73.33
83.33
63.33
2
96.67
73.33
70.00
3
96.67
86.67
73.33
4
96.67
76.67
63.33
5
96.67
83.33
73.33
6
96.67
76.67
73.33
7
80.00
93.33
70.00
8
76.67
73.33
73.33
9
73.33
73.33
73.33
10
96.67
63.33
63.33
Average
88.34
78.33
69.66
The individual classification accuracy for all the 10 simulations and the average classification accuracy for each algorithm is shown.
TABLE 3 The improvement of the ECA over the CMI and the k-means for the gastric cancer dataset. % Improvement of ECA over Top genes per cluster
CMI
k-means
1
4.4
1.99
2
11.32
21.15
3
5.17
29.66
4
5.86
24.14
5
6.55
25.18
The average classification accuracies of the corresponding algorithms are used to compute the difference of the percentages with respect to the ECA.
TABLE 4 Depicts the performance of the ECA, the CMI, and the k-means in terms of classification accuracy for the gastric cancer dataset. % Classification accuracy >60 & ≤80
>80 & ≤95
>95 & jx Λ lc j jx1, 1 Λ lc j j x Λ lc j < X jx1, 1 Λ lc j log 1, 1 log 1, 1 +z m ð1, sÞ ¼ T T T l j j c > : c¼1 ∞,
(A.2) if s ¼ 1 otherwise (A.3)
3. The new redundancy value, z0 is calculated. 4. z and z0 are compared. If z ¼ z0 , then the optimal partition is P0 . Otherwise, z is initialized to z0 and steps 2, 3 and 4 are repeated.
Appendix B: Smoothing and Chi-square test method For smoothing of the data, segment si is defined for any gene value xi such that the segment consists of values from xiw to xi+w where w is the width of the segment. For this segment, a ratio (r) is calculated as the frequency of the most frequently occurring class label and the frequency of the class label for xi. If r is greater than some threshold value (t), the class label is
Analysis of microarray gene expression data Chapter
8 377
changed to the most frequently occurring class label. For discretization, we have used the default values for w as 5.0 and for t as 1.3. The smoothing and the parameter values used are essentially the same as described in the OCDD algorithm (Wong et al., 2004). The Chi-square test is performed to test the statistical significance of the interdependence between a gene and the class labels. For two neighboring intervals, it checks whether the frequency distributions among them and the class label are significantly interdependent and accordingly decides to merge the intervals. For this purpose, the IR between each gene and class label is computed. Let the joint entropies between a gene and class label be E. A product ( p) between E and the total number of data points is calculated. The ratio between the chi value and twice of p is calculated and is called as rt. If the IR is greater than the rt., then the intervals are merged. Chi values are taken from the standard chi-square distribution table. The degree of freedom for the chi value is defined as the product of one less of the total number of classes and one less of the total number of intervals. The methodology of the test is the same as described in the OCDD algorithm (Wong et al., 2004).
References Alon, U., et al., 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. U.S.A. 96 (12), 6745–6750. Au, W., et al., 2005. Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2 (2), 83–101. Behera, N., Nanjundiah, V., 1995. An Investigation into the Role of Phenotypic Plasticity in Evolution. J. Theor. Biol. 172 (3), 225–234. Behera, N., Nanjundiah, V., 1996. The Consequence of phenotypic plasticity in cyclically varying environments: a genetic algorithm study. J. Theor. Biol. 178 (2), 135–144. Behera, N., Nanjundiah, V., 1997. Trans-gene regulation in adaptive evolution: a genetic algorithm model. J. Theor. Biol. 188, 153–162. Behera, N., Nanjundiah, V., 2004. Phenotypic plasticity can potentiate rapid evolutionary change. J. Theor. Biol. 226, 177–184. Behera, N., Jeevitesh, M., Jose, J., Kant, K., Dey, A., Mazher, M., 2017. Higher accuracy protein multiple sequence alignment by genetic algorithm. Procedia Comput. Sci. 108, 1135–1144. Behera, N., Sinha, S., Gupta, R., Geoncy, A., Dimitrova, N., Mazher, M., 2018. Analysis of Gene Expression Data by Evolutionary Clustering Algorithm. IEEE Explore. https://doi.org/ 10.1109/ICIT.2017.41. Berrar, D.P., Dubitzky, W., Granzow, M., 2003. A Practical Approach to Micro-Array Data Analysis. Kluwer Academic Publishers, London. Cheng, Y., Church, G.M., 2000. Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 93–103. Cho, S.W., Kim, D.H., Uhmn, S., Ko, Y.W., Cheong, J.Y., Kim, J., 2007. Chronic hepatitis and cirrhosis classification using SNP data, decision tree and decision rule. In: ICCSA, vol. 3, pp. 585–596.
378 Handbook of Statistics Cios, K.J., Kurgan, L., 2004. Discretization algorithm that uses class-attribute interdependence maximization. IEEE/ACM Trans. Knowl. Data Eng. 16 (2), 145–153. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D., 1998. Cluster analysis and display of genome- wide expression patterns. Proc. Natl. Acad. Sci. U. S. A. 95 (25), 14863–14868. Elomaa, T., 1994. In defense of C4.5: notes on learning one-level decision trees. In: Proceedings of the 11th International Conference on Machine Learning. Morgan Kaufmann, pp. 62–69. Goldberg, D.E., 2008. Genetic Algorithms in Search, Optimization and Machine learning. Pearson Education, India. Golub, T.R., et al., 1999. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96, 2907–2912. Heyer, L.J., Kruglyak, S., Yooseph, S., 1999. Exploring expression data: identification and analysis of coexpressed genes. Genome Res. 9, 1106–1115. Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, MI, USA. Johnson, S.C., 1967. Hierarchical clustering schemes. Psychometrika 2, 241–254. Kohonen, T., 1990. The self-organizing map. Proc. IEEE 78, 1464–1479. Krishna, K., Murty, M., 1999. Genetic K-means algorithm. IEEE Trans. Syst. Man Cybern. B Cybern. 29, 433–439. Liu, Y., Shen, M., Wen, J.F., Hu, Z.L., 2006. Expressions of TGIF, MMP9 and VEGF proteins and their clinicopathological relationship in gastric cancer. Zhong Nan Da Xue Xue Bao Yi Xue Ban 31 (1), 70–74. Lloyd, S.P., 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theory 28 (2), 129–137. MacDonald, T.J., et al., 2001. Expression profiling of medulloblastoma: PDGFRA and the RAS/MAPK pathway as therapeutic targets for metastatic disease. Nat. Genet. 29 (2), 143–152. Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan Kaufman, San Francisco. Tsutsumi, S., Hippo, Y., Taniguchi, H., Machida, N., 2002. Global gene expression analysis of gastric cancer by oligonucleotide microarrays. Cancer Res. 62 (1), 233–240. Wang, L., Zhu, J.S., Song, M.Q., Chen, G.Q., Chen, J.L., 2006. Comparison of gene expression profiles between primary tumor and metastatic lesions in gastric cancer patients using laser microdissection and cDNA microarray. World J. Gastroenterol. 12 (43), 6949–6954. Wong, A.K.C., Liu, L.L., Yang, W., 2004. A global optimal algorithm for class-dependent discretization of continuous data. Intelligent Data Anal. 8 (2), 151–170.
Further reading Behera, N., 1997. Effect of phenotypic plasticity on adaptation and evolution: a genetic algorithm analysis. Curr. Sci. 73, 968–976.
Chapter 9
Human life expectancy is computed from an incomplete sets of data: Modeling and analysis Arni S.R. Srinivasa Raoa,b,c,* and James R. Careyd,e a
Division of Health Economics and Modeling, Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA, United States b Laboratory for Theory and Mathematical Modeling, Department of Medicine, Division of Infectious Diseases, Medical College of Georgia, Augusta University, Augusta, GA, United States c Department of Mathematics, Augusta University, Augusta, GA, United States d Department of Entomology, University of California, Davis, CA, United States e Center for the Economics and Demography of Aging, University of California, Berkeley, CA, United States * Corresponding author: e-mail: [email protected]
Abstract Estimating the human longevity and computing of life expectancy are central to the population dynamics. These aspects were studied seriously by scientists since 15th century, including renowned astronomer Edmund Halley. From basic principles of population dynamics, we propose a method to compute life expectancy from incomplete data. Keywords: Modeling, History of life expectancy, Population biology
1
Introduction
In 1570 the Italian mathematician Girolamo Cardano suggested that a man who took care of himself would have a certain life expectancy of α (so that at any age x we could expect him to live e(x) ¼ α x more years) and then asked how many years would be squandered by imprudent lifestyles (Smith and Keyfitz, 1977). Cardano’s healthiest man might be born with the potential of living to 260 years but die at age 80, having wasted away years due to bad habits and other such ill-advised choices. In this work, Cardano was in good company; mathematicians such as Fibonacci, d’Alembert, Daniel Bernoulli, Handbook of Statistics, Vol. 43. https://doi.org/10.1016/bs.host.2020.02.001 © 2020 Elsevier B.V. All rights reserved.
379
380 Handbook of Statistics
Euler, Halley, Lotka and many others contributed to our understanding of population dynamics through mathematical models. We can trace the notion of life expectancy in particular back to the 17th century astronomer Edmund Halley who developed a method to compute life expectancy (Halley, 1693). His studies led him to observe “how unjustly we repine at the shortness of our Lives, and think our selves wronged if we attain not Old Age;” for “one half of those that are born are dead in Seventeen years time” and to urge readers “that instead of murmuring at what we call an untimely Death, we ought with Patience and unconcern to submit to that Dissolution which is the necessary Condition of our perishable Materials, and of our nice and frail Structure and Composition: And to account it as a Blessing that we have survived, perhaps by many Years, that Period of Life, whereat the one half of the whole Race of Mankind does not arrive” [postscript to Halley (1693)]. Besides his philosophical musings Halley’s essay contained many tables and detailed analyses. Life expectancy at birth, is defined as the number of years remaining to the average newborn. It is arguably the most important summary metric in the life table because it is based on and thus reflects the longevity outcome of the mortality experience of newborns throughout their life course. When life expectancy is referred to without qualification the value at birth is normally assumed (Carey and Roach, 2020; Misra, 1995; Preston et al., 2001; Wachter, 2014). Life expectancy is intuitive and thus easily understandable by lay persons, independent of population age structure, an indicator of health conditions in different societies, used in insurance annuity computations and as a baseline for estimating the impact on longevity of diseases (e.g., AIDS; cancer, diabetes) and lifestyle choices (e.g., smoking; alcohol consumption). The value of life expectancy at birth is identical to the average age in a life table population. The difference in life expectancies between men and women is known as the gender gap. The inverse of life expectancy equals both the per capita birth (b) and per capita death (d) rates in stationary populations (b d ¼ 0). And since b + d is a measure of the number of vital events in a population, double the inverse of life expectancy equals what is referred to as “population metabolism” as applied to stationary populations. Life expectancy at birth is the most frequently used comparative metric in biological studies of plants and animals. The first substantive demographic work in which life expectancy was estimated was the “Bills of Mortality” published in 1662 by Graunt (1662) who noted “From when it follows, that of the said 100 conceived there remains at six years 64, at thirty-six 26 at sixty-six 3 and at eighty 0”. Although Halley (1693) and Milne (1815) both introduced life table methods for computing life expectancy, King (1902) is generally attributed to introducing the life table and life expectancy in modern notation. It was not until 1947 that life tables in general and life expectancy in particular were introduced to the population biology literature for studying longevity in nonhuman species (Deevey, 1947). Although life expectancy is computed straightforwardly from life table survival data, complete information is often not available.
Life expectancy computation Chapter
9 381
Therefore our objective in this paper is to describe a model that we derived for use in estimating life expectancy at birth from a limited amount of information. The information required to estimate life expectancy in a given year with our model includes the number of births, the number of infant deaths, and the number in the population at each age from 0 through the maximal age, ω. Our computational concept for ω ¼ 2 is based on the following logic: (1) person-years lived for a newborn cohort during the first year is the difference between the number born and the number of infants that died. Person-years is the sum of the number of years lived by all persons in a cohort or population. The number of person-years equals the life expectancy of this cohort if their maximal age is one year (i.e., l(1) ¼ maximal age); (2) person-years lived for this cohort during their first 2 years of life is equal to person-years lived up to 1 year and person-years that would be lived by people who have lived up to age 1. Person-years lived by the newborn cohort during their second year of life is less than the person-years lived by newborn during the first year; (3) the hypothetical number of person-years lived by the newborn cohort during their third year of life (i.e., l(3) ¼ 0) equals the number in the birth cohort minus the person-years lost due to deaths during the third year. We use number of newborn and population at age 1 to compute person-years to be lived by newborn during first 3 years of life. And this process continues through the oldest age, ω > 2. Traditionally, the life expectancy of a population is computed through life table techniques. Life table of a population is a stationary population mathematical model which primarily uses populations and death numbers in all the single year ages for a year or for a period of years to produce life expectancy through construction of several columns. The last column of the life table usually consists of life expectancies for each single year ages and first value of this column is called life expectancy of the corresponding population for the year for which the life table was constructed (see Fig. 1 for the life table of US population in 2010 (Arias and Xu, 2015)). There are seven columns in this life table and the second column in the Fig. 1, which consists the values of probability of dying between ages x to x + 1 for x ¼ 0, 1, … 100 + is first computed from the raw data (see Fig. 2 for the data needed for a life table) and other columns are derived from the second column using formulae without any raw data. The last column of the table in Fig. 2 consists the values of expectation of life at age x for x ¼ 0, 1, …, 100 + : The first value in the last column of the table in Fig. 1 is 78.7, which means life expectancy for the new born babies during 2010 in the US population (boys and girls combined who are of aged 0–1 during 2010) is 78.7 years. In Arias and Xu (2015) we give the various steps involved at Fig. 1. For standard life table methods, see Keyfitz and Caswell (2005), for recent developments in computing life expectancy see Bongaarts and Feeney (2003), for astronomer Edmund Haley’s life table constructed in 17th century, see Smith and Keyfitz (1977). Recent advances in the theory of stationary population models (Rao and Carey, 2015) are serving the purpose of computing
382 Handbook of Statistics
FIG. 1 United States life table for the year 2015. This life table was directly taken from Arias, E., Xu, J., 2015. United States Life Tables. National vital statistics reports, vol. 67, nr. 7. National Center for Health Statistics, Hyattsville, MD and selected age data are considered for demonstration purpose only.
A
Ages 100+ Age 99–100
B
Ages 100+ Age 99–100
Age 98–99
Age 98–99
.
.
.
.
.
.
Age 1–2
Age 1–2
Age 0–1
Births
FIG. 2 (A) Data needed for life table approach. (B) Data needed for computing life expectancy through new approach. Green-bordered rectangles are populations and red colored rectangles are death numbers in the respective ages for a year. Blue-bordered rectangle is birth numbers for a year.
life expectancies for populations in the captive cohorts (Age Tables Behind Bars by Ben Pittman-Polletta, 2014). We propose a very simple formula for computing life expectancy of newly born babies within a time interval when age-specific death rates and life tables are not available. Age-specific death rates at age a are traditionally defined as the ratio of the number of deaths at age a to the population size at age a (Keyfitz and Caswell, 2005).
Life expectancy computation Chapter
9 383
The method of calculating life expectancies given in standard life tables uses age-specific death rates which is computed from deaths and populations in each single year ages. Refer to Fig. 2 for the data needed in traditional life table approach and for the newly proposed method. In this paper, we propose a formula for computing life expectancies is comparable to the technique used to calculate life expectancies in standard life tables, but can be applied when limited data is available. The derived formula uses effective age-specific population sizes, the number of infant deaths, and the number of live births within a year. The number of infant deaths is usually defined as the number of deaths within the first year of life in human populations. If the study population is insects, necessary data can be considered within any appropriate time interval. We tested our proposed simple formula on both small hypothetical populations and global human populations. When a sufficient amount of data on age-specific death rates is available, the life table-based life expectancy is still recommended.
2
Life expectancy of newly born babies
In this section we derive a formula for the life expectancy from basic elements of population dynamics, namely, population-age structure over two time points, simple birth and infant death numbers observed over an interval of time. Suppose, the global population at the beginning of times t0 and t1 (for t0 > > D0 ðsÞds Z t1 > > ω 3 2 > t0 2 1 > + Σ P2n ðsÞds if ω is even > Z Z t1 t1 n¼1 > > 2 > t0 < BðsÞds BðsÞds t0 t0 eðBðt0 ÞÞ ¼ > > Z t1 > ω3 >1 2 > 2 > Σn¼0 P2n+1 ðsÞds if ω is odd > > 2 + Z t1 > t0 > > BðsÞds : t0
(8)
3
Numerical examples
We consider an example population of some arbitrary species, whose effective population age structures, births and infant deaths are observed during some interval [t0, t1) (see Table 1). We give the computed life expectancies in Table 1. We further simplify the life expectancy formula of (8) based on a few assumptions and we obtain (A.6). For details, see Appendix. We tested this formula (for ω even and odd) on global population data (UN, 2012). Total population in 2010 was approximately 6916 million, and infant deaths were 4.801 million. We have obtained P(t0), the total population size with individuals whose age is one and above by removing the size of the population, whose age is zero, from the total population. The adjusted P1(t0) is 6756 million. Assuming a range of live births of 90–100 million occurred during 2010, we have calculated that the life expectancy of cohorts born in 2010 will be between 69 and 76.5 years (when ω is even), and life expectancy for these
386 Handbook of Statistics
TABLE 1 Set of two hypothetically observed population age structures, births, infant deaths during [t0, t1), and computed life expectancies. Age
Effective population
Births
Infant deaths
Life expectancy
1
4.5
3
5.17
(a) 0
10
1
12
2
14
3
12
4
6
5
0
12
(b) 0
12
1
16
2
18
3
12
4
0
9
newly born will be 68.1–75.5 years (when ω is odd). In 2010, the actual global life expectancy was 70 years. We note that the formula in (A.6), and the assumption in (A.5) may not be true for every population’s age structure. Interestingly the formula results (A.6) are very close to the life table-based standard estimates for the US and UK populations. However, it should be noted that the formula did not work for some populations. The total population in US in 2011 was approximately 313 million, and the total live births are approximately 4 million. This gives us eðBðt0 ÞÞ ¼ 0:5 + 313 4 ¼ 0:5 + 78:25 ¼ 78:75 years, whereas the actual life expectancy for the US population for 2011 is 78.64 years. Similarly, the formula-based values for UK is 78.23 years and actual value is 80.75 years. In this paper we suggest a formula for computing life expectancy of a cohort of new born babies when it is difficult to construct a life-table based life expectancy. For the standard life table technique, one requires information Rt Rω Rω on t10 0 Di ðsÞdids, the total deaths during [t0, t1), where 0 Di ðsÞdi is the agespecific death numbers at time s [t0, t1), and then, traditionally compute age-specific death rates at age i during [t0, t1) using,
Life expectancy computation Chapter
9 387
Life expectancy
Infant deaths
Population by age
Births
e
FIG. 3 Life expectancy with limited data. Only with the information on births, effective population by age and infant deaths in a year, the proposed formula will forecast the life expectancy of newly born babies in a year.
Z
t1 t0
Di ðsÞds
Pi ðt0 Þ
:
(9)
It is possible to obtain probability of deaths from (9), with some assumptions on the pattern of deaths within the time interval. We compute various columns of the life table from these death probabilities and compute life expectancy. The proposed formula in (8) is very handy and can be computed by nonexperts with minimal computing skills. It can be adapted by ecologists, experimental biologists, and biodemographers where the data on populations are limited. See Fig. 3 for the data needed to compute life expectancy of newly born babies in a year. It requires some degree of caution to apply the proposed formula when sufficient death data by all age groups is available. Our method heavily depends on the age structure of the population at the time of data collection. Our approach needs to be explored when populations are experiencing stable conditions given in Rao (2014) and also to be tested for its accuracy at different stages of demographic transition. We still recommend using life table methods when age-wise data on deaths for all the ages and the corresponding population sizes are available as indicated in Fig. 2.
Acknowledgments Dr. Cynthia Harper (Oxford) and Ms. Claire Edward (Kent) have helped to correct and revise several sentences. Our sincere gratitude to all.
Appendix. Analysis of the life expectancy function In general, infemum of
R t1 t0
Rt DðsÞds < t01 BðsÞds . When ω is even, the supremum and ! R t1 D0 ðsÞds t0 3 R are 32 and 12 : The contribution of the term 2 t1 t0
BðsÞds
388 Handbook of Statistics
R t1 3 2
R
t0 t1
!
D0 ðsÞds
in computation of life expectancy is very minimal in compar! ω R 1 t 1 2 ison with the term R t1 2 Σn¼1 t0 P2n ðsÞds , hence e(B(t0)) can be approxit0
BðsÞds
t0
BðsÞds
mated by, 2
eðBðt0 ÞÞ Z
t1
ω
2 1 Σn¼1
BðsÞds
Z
t1
P2n ðsÞds
t0
t0
Similarly, when ω is even, e(B(t0)) can be approximated by, Z t1 ω3 2 2 eðBðt0 ÞÞ Z t1 Σn¼1 P2n+1 ðsÞds t0 BðsÞds t0
ðPn ðt0 ÞÞω0
Remark A.1. Suppose is an increasing, then, we will arrive at the two inequalities (A.1) and (A.2). Z t1 Z t1 ω 1 ω 2 1 Σn¼1 P2n ðsÞds < Σn¼1 Pn ðsÞds if ω is even; (A.1) 2 t0 t0
ω3 2
Σn¼1
Z
t1 t0
1 ω P2n+1 ðsÞds > Σn¼1 2
Z
t1
Pn ðsÞds if ω is odd:
(A.2)
t0
Remark A.2. In general when ðPn ðt0 ÞÞω0 is an increasing, without any condition on ω, we can write the inequality (A.3) by combining (A.1) and (A.2) as, Z t1 Z t1 Z t1 ω ω3 1 ω 2 1 2 Σn¼1 P2n ðsÞds < Σn¼1 Pn ðsÞds < Σn¼1 P2n+1 ðsÞds (A.3) 2 t0 t0 t0 Rt Rt Remark A.3. Suppose t01 D0 ðsÞds ¼ t01 BðsÞds in (8), then, the life expectancy, irrespective of ω is even or odd, becomes, Z t1 ω3 1 2 2 eðBðt0 ÞÞ ¼ + Z t1 Σ P2n+1 ðsÞds (A.4) n¼0 2 t0 BðsÞds t0
Remark A.4. When the total population aged one and above at t0 is approximately same as twice the sum of the populations of even single year ages and also twice the sum of the populations of odd single year ages, i.e.
Life expectancy computation Chapter ω 2 1
2Σn¼1
Z
t1 t0
Z P2n ðsÞds
t1 t0
ω3 2
PðsÞds 2Σn¼0
Z
t1
P2n+1 ðsÞds,
9 389
(A.5)
t0
then, life expectancy in (8) further reduces into, 8 Z t1 Z t1 > > D ðsÞds P1 ðsÞds > 0 > > 3 t0 t0 > > + Z t1 if ω is even Z t1 > > 2 > > > BðsÞds BðsÞds > < t0 t0 eðBðt0 ÞÞ ¼ Z t1 > > > > P1 ðsÞds > > > 1 t0 > > + Z t1 if ω is odd > > 2 > > BðsÞds :
(A.6)
t0
where P1(s) is the effective population who are aged one and above at time s [t0, t1).
References Age Tables Behind Bars by Ben Pittman-Polletta, 2014. Math digest section of American Mathematical Society’s math in the media, October 28. http://www.ams.org/news/math-in-themedia/md-201410-toc#201410-populations. Arias, E., Xu, J., 2015. United States Life Tables. National vital statistics reports, vol. 67, nr. 7. National Center for Health Statistics, Hyattsville, MD. Bongaarts, J., Feeney, G., 2003. Estimating mean lifetime. Proc. Natl. Acad. Sci. 100 (23), 13127–13131. Carey, J.R., Roach, D.A., 2020. Biodemography: An Introduction to Concepts and Methods. Princeton University Press, Princeton. Deevey Jr, E.S., 1947. Life tables for natural populations of animals. Q. Rev. Biol. 22, 283–314. Graunt, J., 1662. Natural and Political Observations Mentioned in a Following Index, and Made Upon the Bills of Mortality. Thomas Roycroft, London. Halley, E., 1693. An estimate of the degrees of the mortality of mankind. Philos. Trans. 17, 596–610. Keyfitz, N., Caswell, H., 2005. Applied Mathematical Demography, third ed. Springer. King, G., 1902. Institute of Actuaries Textbook. Part II. Charles and Edward Layton, London. Milne, J., 1815. A Treatise on the Valuation of Annuities and Assurances on Lives and Survivorships. London. Misra, B.D., 1995. An Introduction to the Study of Population, second ed. South Asian Publishers, New Delhi. Preston, S.H., Heuveline, P., Guillot, M., 2001. Demography: Measuring and Modeling Population Processes. Blackwell Publishers, Malden, MA. Rao, A.S.R.S., 2014. Population stability and momentum. Not. Am. Math. Soc. 61 (9), 1062–1065. Rao, A.S.R.S., Carey, J.R., 2015. Carey’s equality and a theorem on stationary population. J. Math. Biol. 71 (3), 583–594. Smith, D., Keyfitz, N., (Eds.), 1977. Mathematical Demography. Springer-Verlag, Berlin (p1). UN, 2012. http://esa.un.org/unpd/wpp/index.htm. Wachter, K.W., 2014. Essential Demographic Methods. Harvard University Press, Harvard.
This page intentionally left blank
Chapter 10
Support vector machines: A robust prediction method with applications in bioinformatics Arnout Van Messem* Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium Center for Biotech Data Science, Ghent University Global Campus, Incheon, Republic of Korea * Corresponding author: e-mail: [email protected]
Abstract Over the last decades, classification and regression problems have been studied extensively. This research led to the development of a vast number of methods for solving these problems. Recent technological developments allowed us to implement algorithms do to the work for us, giving rise to the field of statistical machine learning. One of the most popular classical machine learning techniques are support vector machines (SVMs). One of the advantages of SVMs is that they are nonparametric, and therefore do not assume any (or very little) prior knowledge on the underlying distribution of the data. Other advantages of these kernel-based estimators are their sparseness, their flexibility, and the fact that they can easily deal with large data sets with unknown, complex, and high-dimensional dependency structures, which can occur in, for example, bioinformatics and genomics. However, the main advantage of this technique is its robustness. This property is very important, since it will guarantee that SVMs will still perform well in the presence of outliers or extreme data points, no matter whether these are simply errors in the data, extreme observations, or the data stem from extreme value or heavy-tailed distributions. In this chapter we will introduce the concept of support vector machines and take a (mathematical) look at some of their properties, paying special attention to the robustness of SVMs by way of influence functions. We finish the chapter with some applications of SVMs in a bioinformatics setting. Keywords: Machine learning, Support vector machines, Regression, Classification, Robustness, Influence function
Handbook of Statistics, Vol. 43. https://doi.org/10.1016/bs.host.2019.08.003 © 2020 Elsevier B.V. All rights reserved.
391
392 Handbook of Statistics
Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world. Atul Butte, Stanford
1 Introduction In this age of modern technology, tons of data are generated every minute. Data science will try to make sense of these data by extracting useful information which can be converted into knowledge. However, the sheer amount of data, the immense size of the data sets, and the complexity of the structure of the measurements (think, e.g., of microarray data), make this a daunting task for any human to perform. Consequently, people will depend on computers to do the task at hand for them and expect the machine to apply an algorithm to “learn” and extract any unknown patterns or dependencies from the observed data, as well as be able to automatically make predictions for future observations. This field of research is called machine learning. Support vector machines (SVMs) are one of the techniques that fall under the denominator of statistical machine learning theory. They are nonparametric methods that can be used for both classification and regression purposes. The aim in nonparametric statistical machine learning is to find a functional relationship between an X -valued input variable X and a Y-valued output or response variable Y, under the assumption that the joint distribution P of (X, Y) is (almost) completely unknown. Since the algorithm not only needs to detect the unknown dependency in the observed data, but also needs to be able to assign a response to future input values, it is therefore not sufficient to only find a good description of the relationship between input and output, but it is also extremely important to find a prediction rule that works well for new, unseen data points. Support vector machines are, of course, only one of the many techniques in the field of machine learning that are able to detect such patterns. The reason we focus on SVMs in this chapter, however, is not only because of their flexibility, but also to highlight their strong robustness properties which bring added value by limiting (or even ignoring) the influence of outlying data points, whether SVMs are used on itself or as part of a learning ensemble. The goal of this chapter is thus twofold: introduce the reader to support vector machines and make him aware of the importance of robust statistical methods. In order to model the relationship between the input and response variables, we typically assume to be in the possession of a finite training data set D ¼ ððx1 , y1 Þ, …, ðxn , yn ÞÞ 2 ðX YÞn , consisting of observations from independent and identically distributed (i.i.d.) random variables (Xi, Yi), i ¼ 1, …, n, which all follow the same, but unknown, distribution P on X Y equipped with the corresponding Borel σ-algebra. Since both
SVMs: Robust predictors with applications in bioinformatics Chapter
10 393
X and Y will be topological spaces in our setting, this Borel σ-algebra exists. The method is called nonparametric since no, or very little, additional assumptions are made on the underlying distribution P. This means we do not assume the existence of densities, symmetry, or a parametric model. The goal is then to build a predictor f : X ! Y, based solely on the observations, which assigns to each input vector, sometimes also called risk vector, x a prediction f(x) which will hopefully be a good approximation of the observed output y. Depending on the type of the output, we distinguish different prediction tasks: classification when predicting categorical values, and regression when the responses are quantitative. Common choices of input and output spaces are X Y ¼ Rd f1, + 1g in the binary classification case and X Y ¼ Rd R in the regression case. In the next section, we recall some mathematical prerequisites that are necessary for the good understanding of this chapter, while in Section 3 both the (geometrical) history of SVMs and the nowadays more common interpretation of SVMs via empirical risk minimization are given. At the same time, we also discuss the kernel and the reproducing kernel Hilbert space (RKHS) and take a closer look at the loss functions. In Section 4 we introduce the concept of robustness and the (Bouligand) influence function, which we subsequently use in Section 5 to prove that SVMs are, under some mild conditions, robust machine learning algorithms. Finally, in Section 6 we discuss some applications of SVMs. This chapter is constructed from a statistical/mathematical perspective and not all parts are necessary for a working knowledge and understanding of support vector machines. For a “light” version of the chapter we advise the readers the following road map. Start with Sections 3.1–3.3 to get an idea about where SVMs originated, followed by Section 3.4 complemented with Section 2.5 for an understanding of the current formulation of the SVM optimization problem. Take a look at Section 3.6 to learn about the loss functions and their desired properties and read through Section 4 for an introduction to the important concept of robustness. Finish with the examples given in Section 6. The sections described in the road map are indicated by (†). Throughout the text, key learnings are summarized at the end of one or more sections.
2
Mathematical prerequisites
2.1 Topology In this chapter, both metric spaces and Polish spaces are used. We review here their definitions and some linked concepts. More detailed information on topological spaces can be found in, e.g., Willard (1970) or Dudley (2002).
394 Handbook of Statistics
Definition 1. Let X be a set. A subset τ of the power set 2X of X is called a topology on X if it satisfies the following three conditions: 1. ∅ 2 τ, X 2 τ. 2. If O1 2 τ and O2 2 τ, then O1 \ O2 2 τ. S 3. If I is any index set and Oi 2 τ for all i 2 I, then i2I Oi 2 τ. The pair ðX, τÞ is called a topological space and each O 2 τ is called an open set. A special case of topological spaces are the metric spaces. For d : X X ! ½0, ∞Þ a metric, we call the pair ðX, dÞ a metric space. If d is clear from the context, we omit it and simply call X a metric space. The most trivial example of a metric space is the Euclidean space Rd , d 2 ℕ, equipped with the Euclidean distance. For ðX, τÞ a topological space, we call a set A X dense if its closure A ¼ X. A topological space ðX, τÞ is called separable if there exists a countable and dense subset of X. R is separable, since Q is a countable and dense subset of R. A sequence (xn)n is called a Cauchy sequence if for every ε > 0 there exists an n0 1 such that, for all m, n n0, d(xm, xn) ε. Trivially, every convergent sequence is a Cauchy sequence, but the inverse is in general not true. Therefore, a metric space is called complete if and only if every Cauchy sequence converges. The metric d is then said to be a complete metric for X. A topological space ðX, τÞ is (completely) metrizable if it is homeomorphic to a (complete) metric space. The following definition was first introduced by Bourbaki. Definition 2. A topological space ðX, τÞ is called a Polish space if τ has a countable basis and there exists a complete metric defining τ. Another definition is that the topological space ðX, τÞ is separable and completely metrizable. This means that the space has to be homeomorphic to a complete metric space that has a countable dense subset. Although Polish spaces are metrizable, they are not necessarily themselves metric spaces. Each Polish space admits many complete metrics giving rise to the same topology, but not one of these is singled out or distinguished. A Polish space with a distinguished complete metric is called a Polish metric space. For example, the Euclidean spaces Rd are Polish. Trivially, also all complete separable metric spaces are Polish.
2.2 Probability and measure theory Next, we state some necessary notions and results concerning measure and probability theory. More details on global measure theory and probability can be found in, e.g., Billingsley (1995). For the specific parts on Polish spaces, we refer to Dudley (2002).
SVMs: Robust predictors with applications in bioinformatics Chapter
10 395
Definition 3. Let X be a nonempty set. A subset A of the power set 2X of X is called a σ-algebra on X if it satisfies: 1. X 2 A. 2. AC :¼ XnA 2 A for all A 2 A. 3. [n2ℕ An 2 A for all sequences ðAn Þn2 ℕ of sets in A. We call ðX, AÞ a measurable space. If A is clear from the context, or if its specific form is irrelevant, we just call X a measurable space. The σ-algebra generated by C 2X , denoted as σ(C), is the smallest σ-algebra that contains C. Hence, C σðCÞ A for all σ-algebras A on X with C A. An example of a generated σ-algebra is the Borel σ-algebra BðτÞ :¼ BðXÞ :¼ σðXÞ of a topological space ðX, τÞ. For ðX1 , A1 Þ and ðX2 , A2 Þ two measurable spaces, a function f : X1 ! X2 is called measurable, or ðA1 , A2 Þ-measurable, if f 1 A2 A1 . The triple ðX, A, μÞ is called a measure space or a probability space if ðX, AÞ is a measurable space and μ is a measure, respectively probability measure, on A. A probability measure will most often be written as P instead of μ. The Dirac measure δx for a measurable space ðX, AÞ and an x 2 X is defined as δx(A) :¼ 1 if x 2 A and δx(A) :¼ 0 if x 62 A. The following theorem (Rademacher, 1919) describes the measure of the set of differentiable points of a Lipschitz continuous function f. Theorem 1 (Rademacher’s theorem). Let U ℝn be open, and f : U ! ℝm be a Lipschitz continuous function. Then f is Fr echet-differentiable almost everywhere (i.e., the points where f is not Fr echet-differentiable form a set of Lebesgue measure 0). For ðX, A, μÞ a measure space, a measurable function f : X ! ½∞, ∞ is called μ-integrable if Z jf jdμ < ∞: X
The set of all μ-integrable functions is written as L1 ðμÞ. Let us also review some properties of σ-algebras and measures defined by a topology. A measure μ : BðXÞ ! ½0, ∞, where BðXÞ is the Borel σ-algebra for a topological space ðX, τÞ, is called a Borel measure. A Borel measure μ is said to be regular if for each A 2 BðXÞ we have both outer regularity, i.e., μðAÞ ¼ inffμðOÞ : A O, O openg, and inner regularity, i.e., μðAÞ ¼ supfμðCÞ : C A, C compactg: This definition allows the following theorem (Dudley, 2002, p. 225).
396 Handbook of Statistics
Theorem 2 (Ulam’s theorem). Every finite Borel measure on a Polish space is regular. Polish spaces are also important in “splitting” probability measures, see, e.g., Dudley (2002, Section 10.2). Lemma 1 (Regular conditional distribution for Polish spaces). Let ðX, AÞ be a measurable space, Y be a Polish space with its Borel σ-algebra BðYÞ, and P be a probability measure on A BðYÞ. Then there exists a map Pð j Þ : BðYÞ X ! ½0, 1 such that 1. P( jx) is a probability measure on BðYÞ for all x 2 X. 2. x 7! P(Bjx) is measurable for all B 2 BðYÞ. 3. For all A 2 A, B 2 BðYÞ, we have Z PðA BÞ ¼ PðBjxÞdPX ðxÞ: A
The map P( jx) is called a regular conditional probability or regular conditional distribution of P. PX is called the marginal probability or marginal distribution. Here, A BðYÞ denotes the product σ-algebra on the product space X Y. For a sequence ðXn , An Þ of measurable spaces, the product σ-algebra n2ℕ An on the product space Xn2ℕ Xn is defined as the σ-algebra generated by the sets An Xm6¼n Xm , An 2 A, n 2 ℕ. For a probability space ðX, A, PÞ we define the expectation of a function R f 2 L1(P) as EP f :¼ X fdP. If P is a distribution on X Y, then EP f :¼ R XY fdP, and if in this case P can be split into a marginal distribution PX on X and a Rconditional distribution P( jx) on Y, we write for f 2 L1(PX) that EPX f :¼ X fdPX .
2.3 Functional and convex analysis In this section we review some needed concepts from functional analysis, as well as properties of functions on Banach spaces. We refer the interested reader to, e.g., Dudley (2002) and Lax (2002). For E a vector space and k k : E ! ½0, ∞Þ a norm, we call ðE, k kÞ a normed space. If the metric associated with the norm is complete, the pair ðE, k kÞ is called a Banach space. If there is no confusion possible, we write E instead of ðE, k kÞ. To distinguish between norms, we often add an index k kE for the norm of the normed space E. For E and F two vector spaces, a map S : E ! F is called a (linear) operator if S(αx) ¼ αS(x) and S(x + y) ¼ S(x) + S(y) for all α 2 ℝ and x, y 2 E. We often write Sx instead of S(x). An operator S : E ! F is bounded if the image of the unit ball is bounded under S. If E and F are normed spaces, then this
SVMs: Robust predictors with applications in bioinformatics Chapter
10 397
is equivalent to saying that S is continuous, or that there exists a constant c 2 [0, ∞) such that for all x 2 E we have kSxkE ckxkF . The space of all bounded (linear) operators mapping from E to F is written as LðE, FÞ. If E ¼ F, we will use LðEÞ :¼ LðE, EÞ. A special case of linear operators are the bounded linear functionals, i.e., the elements of the dual space E0 :¼ LðE, RÞ. Note that, due to the completeness of R, dual spaces are always Banach spaces. For x 2 E and x0 2 E0 , the evaluation of x0 at x is often written as a dual pairing, i.e., hx0 , xiE0 ,E :¼ x0 (x). For the proof of Theorem 6 we will need the following consequence of the open mapping theorem, see Lax (2002, p. 170) or Dudley (2002, p. 214). Theorem 3. Let E and F be Banach spaces, S : E ! F be a bounded, linear, and bijective operator. Then the inverse S1 : F ! E is a bounded linear operator. For a measurable space ðX , AÞ, let L0 ðX Þ denote the set of all real-valued measurable functions f on X and μ be a measure on A. For p 2 (0, ∞) and R f 2 L0 ðX Þ, write k f kLp ðμÞ :¼ ð X j f jp dμÞ1=p . Denote the set of p-integrable functions by Lp ðμÞ :¼ f f 2 L0 ðX Þ : k f kLp ðμÞ < ∞g. We call two functions f , f 0 2 Lp ðμÞ equivalent, written as f f 0 , if k f f 0 kLp ðμÞ ¼ 0. The set of equivalence classes is denoted by Lp ðμÞ :¼ f½ f : f 2 Lp ðμÞg, where ½ f :¼ ff 0 2 Lp ðμÞ : f f 0 g, and ðLp ðμÞ, k kLp ðμÞ Þ is a Banach space. A very important example of Banach spaces are Hilbert spaces. For h , i : H H ! ℝ an inner product, the pair (H, h , i) is called a preHilbert space. To differentiate between different inner products, we will often write h , iH. If the inner product is clear from the context, H is called a pre-Hilbert space. The Cauchy–Schwarz inequality jhx, yij2 hx, xihy, yi, x, y 2 H, pffiffiffiffiffiffiffiffiffiffiffi can be used to show that kxkH :¼ hx, xi, x 2 H, defines a norm on H. If this norm is complete, (H, h , i) is called a Hilbert space. Let us end this section with some properties of functions on Banach spaces. A subset A of some Banach space E is called convex if, for all x1, x2 2 A and for all α 2 [0, 1] holds that αx1 + (1 α)x2 2 A. In this case we call f : A ! R [ f∞g a convex function if, for all x1, x2 2 A and for all α 2 [0, 1], we have f ðαx1 + ð1 αÞx2 Þ αf ðx1 Þ + ð1 αÞ f ðx2 Þ: If, for all x1 6¼ x2, the inequality is strict, f is called a strictly convex function. A function that is twice differentiable will be convex provided its Hessian matrix is positive semi-definite. Furthermore, f is called concave if f is convex. For A ℝd , a function f is affine if f is convex, concave and finite; i.e., if there exists a vector a 2 ℝd and a constant b 2 ℝ such that f(x) ¼ aTx + b for
398 Handbook of Statistics
all x 2 A. Important properties of convex functions are that (i) a convex function is always continuous on the interior of its domain (Rockafellar and Wets, 2009), and (ii) a (strictly) convex and continuous function has a (unique) minimizer (Ekeland and Turnbull, 1983, Proposition II.4.6). Given two Banach spaces E and F and A E, a function f : A ! F is called Lipschitz continuous if there exists a constant c 0 such that k f ðxÞ f ðx0 ÞkF ckx x0 kE for all x, x0 2 A. The smallest constant that fulfills this inequality is denoted by j f j1 and is called the Lipschitz constant.
2.4 Derivatives in normed spaces Let us recall the well-known definitions of the G^ateaux- and Frechet-derivative, before introducing Bouligand-derivatives. Let E and F be normed spaces, U E and V F be open sets, and f : U ! V be a function. We say that f is G^ ateaux-differentiable at x0 2 U if there exists a bounded linear operator rG f ðx0 Þ 2 LðE, FÞ such that f ðx0 + txÞ f ðx0 Þ trG f ðx0 ÞðxÞ F lim ¼ 0, x 2 E: t!0, t6¼0 t We say that f is Frechet-differentiable at x0 if there exists a bounded linear operator rF f ðx0 Þ 2 LðE, FÞ such that lim x!0, x6¼0
k f ðx0 + xÞ f ðx0 Þ rF f ðx0 ÞðxÞkF ¼ 0: kx kE
We call rGf (x0) and rFf (x0) the G^ateaux- and Frechet-derivative of f at x0, respectively. The function f is called G^ateaux- (or Frechet-) differentiable if f is G^ateaux- (or Frechet-) differentiable for all x0 2 U. Furthermore, f is called continuously (Fr echet-) differentiable if it is Frechet-differentiable and the derivative rF f : U ! LðE, FÞ is continuous. For a more detailed description on the various notions of S-differentiation, i.e., Frechet-, Hadamard-, and G^ateaux-differentiability, we refer the interested reader to Averbukh and Smolyanov (1967, 1968), Fernholz (1983) and Rieder (1994). Important to note is that the S-derivative of f at x, rS f ðxÞ, is a continuous linear mapping and that it is uniquely defined. Through the definition of these concept by means of coverings, it becomes clear that rF implies rH which, in turn, implies rG. Furthermore, it can be shown that rH is actually the weakest S-derivative which fulfills the chain rule (Averbukh and Smolyanov, 1967, 1968). Another, and lesser-known, concept of differentiability is the Bouligandderivative, which we introduce next, together with strong approximation of functions. As will become clear, Bouligand-derivatives can play an important role when studying the robustness properties of statistical methods.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 399
Let E1, E2, W, and Z be normed linear spaces, and consider neighborhoods N ðx0 Þ of x0 in E1, N ðy0 Þ of y0 in E2, and N ðw0 Þ of w0 in W. Let F and G be functions from N ðx0 Þ N ðy0 Þ to Z, h1 and h2 functions from N ðw0 Þ to Z, f a function from N ðx0 Þ to Z and g a function from N ðy0 Þ to Z. A function f approximates F in x at (x0, y0), written as f x F at (x0, y0), if Fðx, y0 Þ f ðxÞ ¼ oðx x0 Þ: Similarly, g y F at (x0, y0) if F(x0, y) g(y) ¼ o(y y0). This kind of approximation can be, for example, observed in the definition of the partial Frechet-derivative: F has a partial Frechet-derivative rF1 F at (x0, y0) corresponds to saying that Fðx0 , y0 Þ + rF1 Fðx0 , y0 Þðx x0 Þ x F at (x0, y0) (Robinson, 1991). A function h1 strongly approximates h2 at w0, written as h1 h2 at w0, if for each ε > 0 there exists a neighborhood N ðw0 Þ of w0 such that whenever w and w0 belong to N ðw0 Þ, kðh1 ðwÞ h2 ðwÞÞ ðh1 ðw0 Þ h2 ðw0 ÞÞk εkw w0 k: Strong approximation amounts to requiring h1 h2 to have a strong Frechetderivative equal to 0 at w0, though neither h1 nor h2 is assumed to be differentiable in any sense. Strong approximation for functions of several groups of variables, for example, G (x,y) F at (x0, y0), is defined by replacing W by E1 E2 and making the obvious substitutions. A function f strongly approximates F in x at (x0, y0), written as f x F at (x0, y0), if for each ε > 0 there exist neighborhoods N ðx0 Þ of x0 and N ðy0 Þ of y0 such that whenever x and x0 belong to N ðx0 Þ and y belongs to N ðy0 Þ we have kðFðx, yÞ f ðxÞÞ ðFðx0 , yÞ f ðx0 ÞÞk εkx x0 k: A similar definition is made for strong approximation in y. For example, if F(x, y) is Frechet-differentiable in x in a neighborhood of (x0, y0) and its partial Frechet-derivative rF1 F is continuous in both x and y at (x0, y0), then rF1 Fðx0 , y0 Þ x F at (x0, y0) (Dontchev and Hager, 1994). Note that one has both f x F and g y F at (x0, y0) exactly if f(x) + g(y) (x,y) F at (x0, y0). Recall that a function f : E1 ! Z is called positive homogeneous if f ðαxÞ ¼ αf ðxÞ 8 α 0, 8 x 2 E1 : Following Robinson (1987) we can now define the Bouligand-derivative. Definition 4. Given a function f from an open subset U of a normed linear space E1 into another normed linear space Z, we say that f is Bouliganddifferentiable at a point x0 2 U, if there exists a positive homogeneous function rB f ðx0 Þ : U ! Z such that f ðx0 + hÞ ¼ f ðx0 Þ + rB f ðx0 ÞðhÞ + oðhÞ:
400 Handbook of Statistics
which can be rewritten as kf ðx0 + hÞ f ðx0 Þ rB f ðx0 ÞðhÞkZ ¼ 0: h!0 khkE 1 lim
We will from here on use the abbreviations B-, F-, H-, and G-derivatives for Bouligand-, Frechet-, Hadamard-, and G^ateaux-derivatives, respectively. Partial B-derivatives of f are denoted by rB1 f , rB2 f , rB2,2 f :¼ rB2 ðrB2 f Þ, etc. Let F : E1 E2 ! Z, and suppose that F has a partial B-derivative rB1 Fðx0 , y0 Þ with respect to x at (x0, y0). We call rB1 Fðx0 , y0 Þ strong if Fðx0 , y0 Þ + rB1 Fðx0 , y0 Þðx x0 Þ x F at ðx0 , y0 Þ: Robinson (1987) showed that, given certain conditions, the chain rule holds for B-derivatives. Let f be a Lipschitzian function from an open set Ω ℝm to Rk , x0 2 Ω, and f B-differentiable at x0. Let g be a Lipschitzian function from an open set Γ ℝk , with f(x0) 2 Γ, to Rl be B-differentiable at f(x0). Then g ∘ f is B-differentiable at x0 and rB ðg ∘ f Þðx0 Þ ¼ rB gðf ðx0 ÞÞ ∘ rB f ðx0 Þ: The fact that B-derivatives, similar to F- and H-derivatives, fulfill a chain rule is no contradiction to the previously mentioned fact that H-differentiability is the weakest S-differentiation which fulfills the chain rule, because B-derivatives are not necessarily continuous linear functions. G- and B-differentiability are, in general, not directly comparable, because B-derivatives are by definition positive homogeneous, but not necessarily linear. However, since every linear function is always positive homogeneous, it follows directly from the definition that every F-differentiable function is also B-differentiable. Robinson (1991, Corollary 3.4) states the following implicit function theorem for B-derivatives. For a function f from a metric space ðX, dX Þ to another metric space ðY, dY Þ, we define δðf , XÞ :¼ inffdY ðf ðx1 Þ, f ðx2 ÞÞ = dX ðx1 , x2 Þ j x1 6¼ x2 ; x1 , x2 2 Xg: Clearly δðf , XÞ 6¼ 0 only if f is one-to-one on X . Theorem 4 (Implicit function theorem). Let Y be a Banach space and X and Z be normed linear spaces. Let x0 and y0 be points of X and Y, respectively, and let N ðx0 Þ be a neighborhood of x0 and N ðy0 Þ be a neighborhood of y0. Suppose that G is a function from N ðx0 Þ N ðy0 Þ to Z with G(x0, y0) ¼ 0. In particular, for some ϕ and each y 2 N ðy0 Þ, G ( , y) is assumed to be Lipschitz continuous on N ðx0 Þ with modulus ϕ. Assume that G has partial B-derivatives with respect to x and y at (x0, y0), and that: (i) rB2 Gðx0 , y0 Þð Þ is strong. (ii) rB2 Gðx0 , y0 Þðy y0 Þ lies in a neighborhood of 0 2 Z, 8y 2 N ðy0 Þ. (iii) δðrB2 Gðx0 , y0 Þ, N ðy0 Þ y0 Þ ¼: d0 > 0.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 401
Then for each ξ > d01 ϕ there are neighborhoods U of x0 and V of y0, and a function f * : U ! V satisfying (a) f *(x0) ¼ y0. (b) f * is Lipschitz continuous on N ðx0 Þ with modulus ξ. (c) For each x 2 U, f *(x) is the unique solution in V of G(x, y) ¼ 0. (d) The function f * is B-differentiable at x0 with 1 rB1 Gðx0 , y0 ÞðuÞ : rB f ðx0 ÞðuÞ ¼ rB2 Gðx0 , y0 Þ
2.5 (†) Convex programs, Lagrange multipliers and duality Optimization theory provides us with necessary and sufficient conditions for a function to be the solution to a certain optimization problem. As will become clear in Section 3, SVMs can be converted into a form suitable to this framework, namely the maximization (or minimization) of a convex function subject to a number of linear constraints. Definition 5. A convex program (P) is an optimization problem minimize f ðxÞ, x2A subject to gi ðxÞ 0, i ¼ 1, …, m, hi ðxÞ ¼ 0, i ¼ 1, …, n, where A ℝd is a convex set, and the functions f , g1 , …, gm , h1 , …, hn : A ! ℝ are finite convex functions. The function f is usually called the objective function, whereas the functions gi(x) 0 define the inequality constraints, and the functions hi(x) ¼ 0 the equality constraints. More details on convex problems can, e.g., be found in Cristianini and Shawe-Taylor (2000, Chapter 5), and Steinwart and Christmann (2008b, Chapter A.6). For a more general discussion on optimization problems, we refer to, e.g., Gill et al. (1981, Chapter 3). Please note that it suffices to consider the problem (P) as defined above, since any maximization problem is easily converted into a minimization problem by changing the sign of the function f. Similarly, the constraints can always be rewritten as given above. If the objective function, the equality and the inequality constraints are all linear, the optimization problem is called a linear program. If the objective function is quadratic, while all constraints remain linear, it is called a quadratic program. A vector z is said to be a feasible solution of the convex program (P) if z 2 A while satisfying the constraints from Definition 5. The set of all feasible solutions, the feasible region R, is a (possibly empty) convex set. The points where the infimum of f is attained, given that R 6¼ ∅, are the optimal solutions to (P). A convex program is said to be well-posed in the sense of Hadamard if an
402 Handbook of Statistics
optimal solution exists and is unique for all data sets, and it depends on the data in a smooth (or continuous) way. An inequality constraint gi(x) 0 is said to be active (or tight) if the solution z satisfies gi(z) ¼ 0, otherwise it is called inactive. Equality constraints can be considered to be always active. Sometimes slack variables ξi, i ¼ 1, …, m, are introduced to transform the inequality constraints into equality constraints: gi ðxÞ 0 , gi ðxÞ + ξi ¼ 0, with ξi 0: For active constraints, the slack variables will be zero, for inactive constraints they will give a measure of “looseness” in the constraint. One approach to solving convex programs is through Lagrangian theory (1788), which characterizes the solution of an optimization problem with only equality constraints by introducing the Lagrange function and Lagrange multipliers. This method generalized a previous result of Fermat (1629), which gave the solution fosr unconstrained optimization problems. Later, Kuhn and Tucker (1951) expanded this approach so that it is able to cope with both equality and inequality constraints. Details on these methods can be found, e.g., in Vapnik (1998, Chapter 9.5). Below, we first define the Lagrangian multipliers and the Lagrangian function, which contains information about both the objective function as well as the constraints, and then continue by stating the Kuhn–Tucker theorem. Definition 6. Given an optimization problem with objective function f : A ! ℝ, where A ℝd , and equality constraints hi(x) ¼ 0, for i ¼ 1, …, n, the Lagrangian function, or in short Lagrangian, is defined as Lðx, βÞ :¼ f ðxÞ +
n X
βi hi ðxÞ,
i¼1
where β ¼ (β1, …, βn) and the βi 0 are called the Lagrange multipliers. If, in addition, there are also inequality constraints gi(x) 0, i ¼ 1, …, m, then the generalized Lagrangian function is defined as Lðx, α, βÞ :¼ f ðxÞ +
m X i¼1
αi gi ðxÞ +
n X
βi hi ðxÞ,
i¼1
and the components of both α ¼ (α1, …, αm), αi 0, and β ¼ (β1, …, βn), βi 0, are the Lagrange multipliers. Remark that inequalities of the form gi(x) 0 first need to be changed to gi(x) 0. Furthermore, in the absence of equality constraints, we simply write Lðx, αÞ for the Lagrangian. The following theorem, which is a generalization of Lagrange multipliers, states that a solution that fulfills the given conditions is necessarily a global minimum, and thus optimal solution to the convex program.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 403
Theorem 5 (Kuhn–Tucker). Given a convex program (P) with affine functions gi, i ¼ 1, …, m and hi, i ¼ 1, …, n, then necessary and sufficient conditions for a point x to be an optimal solution are the existence of α ¼ ðα 1 , …, α m Þ and β ¼ ðβ 1 , …, β n Þ such that ∂Lðx , α , β Þ ¼ 0, ∂x ∂Lðx , α , β Þ ¼ 0, ∂β α i gi ðx Þ ¼ 0,
i ¼ 1, …, m,
gi ðx Þ 0,
i ¼ 1, …, m,
αi 0,
i ¼ 1, …, m:
The first condition gives a set of new equations, the second one returns the equality constraints. The third relation is known as the Karush–Kuhn–Tucker complementarity condition, which implies that for an active constraint the Lagrange multiplier will be α i 0, while that of any inactive constraint needs to be zero. Together, these five conditions are called the Karush–Kuhn– Tucker (KKT) conditions. In practice, the Lagrangian approach of a convex problem always passes via the (equivalent) dual description of the problem. Definition 7. The Lagrangian dual problem of the primal problem from Definition 5 is maximize θðα, βÞ, subject to α 0, where θðα, βÞ ¼ inf x2A Lðx, α, βÞ. The reason that the dual problem is used, is that it is often easier to tackle than the primal problem, since it avoids handling the inequality constraints directly and tries to optimize the dual function over the Lagrange multipliers instead of optimizing over the vectors x in the feasible region. To transform the primal program into the dual program, the derivatives of the Lagrangian with respect to the primal variables are set to zero, thus imposing stationarity. Next, these relations are substituted into the Lagrangian, which removes any dependence on the primal variables. This is exactly the same as computing the function θðα, βÞ ¼ inf Lðx, α, βÞ: x2A
The only variables in the resulting function are the Lagrange multipliers, which then will be maximized under simpler constraints, namely the remaining KKT
404 Handbook of Statistics
conditions. To see a detailed example of the primal-dual method, we refer to Sections 3.1 and 3.3 where it has been applied for constructing support vector machines.
3 An introduction to support vector machines Support vector machines started as a geometrical idea in a classification setting: how can we find a computationally efficient way of learning “good” separating hyperplanes in a high-dimensional feature space. We will define “good” through the notion of the maximal margin separating hyperplane, but another possibility would be to find the separating hyperplane that minimizes the number of support vectors. Since SVMs produce sparse dual representations of the hypothesis, extremely efficient algorithms can be obtained due to the Karush–Kuhn–Tucker conditions which hold for the solution and which play an important role in the practical implementation and analysis of SVMs. A second important feature is that, due to Mercer’s conditions on the kernels, the corresponding optimization problem is convex, and therefore is not hindered by local extremes. This fact, in combination with the strongly reduced number of nonzero parameters, is exactly why support vector machines distinguish themselves from other learning algorithms such as neural networks. We refer to Vapnik (1998, 2000), Cristianini and Shawe-Taylor (2000), Burges (1998), Sch€ olkopf and Smola (2002), and Steinwart and Christmann (2008b) for textbooks on SVMs and related topics.
3.1 (†) The generalized portrait algorithm The foundation for SVMs was laid by the generalized portrait algorithm (GPA) as introduced by Vapnik and Lerner (1963). The GPA, which is a binary classification algorithm, considers the simplest case possible: a linear machine trained on separable data. Please note that “separable” in this case simply means that both classes of the data can be separated, it has thus nothing to do with the mathematical definition of a separable space as on p. 4. Hence, the data set can be completely and correctly separated by a linear hyperplane. For mathematical ease, the two classes are labeled +1 and 1. Mathematically, this translates to Y ¼ f1, + 1g, take X Rd , and the training data set D ¼ ððx1 , y1 Þ, …, ðxn , yn ÞÞ 2 ðX YÞn is perfectly separable. There thus exists a hyperplane H0 : hw, xi + b ¼ 0
(1)
characterized by w 2 Rd and b 2 R such that the training data satisfy hw, xi i + b + 1
8 i with yi ¼ + 1,
(2)
hw, xi i + b 1
8 i with yi ¼ 1,
(3)
SVMs: Robust predictors with applications in bioinformatics Chapter
10 405
which obviously separates the positive from the negative samples. Conditions (2) and (3) can then be combined as yi ðhw, xi i + bÞ 1
8 i ¼ 1, …, n:
(4)
The hyperplane (1) is called the separating hyperplane because it separates the positive from the negative samples without error. The parameter w is the normal to the hyperplane and b=kwk2 the perpendicular distance from the hyperplane to the origin. The geometrical margin γ g of the separating hyperplane is defined as the distance from the closest vector to the hyperplane (Vapnik, 2000, p. 131). Call d+ the shortest distance from the separating hyperplane to a positive point, and d the distance to the closest negative point. Since clearly the separating hyperplane is not unique, an additional condition is needed to end up with a unique solution. The GPA will look for the perfectly separating hyperplane (wD, bD) that has maximal geometrical margin. This hyperplane is called the maximal margin hyperplane, and the resulting decision function is defined by fD ðxÞ :¼ signðhwD , xi + bD Þ
8x 2 Rd :
Clearly, fD will assign positive labels to one affine half-space and negative labels to the other. To find this decision function, a quadratic problem needs to be solved. Let us therefore take a look at the margin γ g, which needs to be maximized. Points for which equality in (2) holds, lie on the hyperplane H1 : hw, xi i + b ¼ + 1; equality in Eq. (3), gives the hyperplane H2 : hw, xi i + b ¼ 1: The perpendicular distances of H1 and H2 to the origin are j1 bj=kwk2 and j1 + bj=kwk2 , respectively. Since the normals of H0, H1, and H2 are the same, the hyperplanes are parallel, leading to d + ¼ d ¼ 1=kwk2 and thus the geometrical margin γ g ¼ 1=kwk2 . This is shown in Fig. 1. Consequently, maximizing the margin is equivalent to minimizing the norm kwk22 subject to (4). Hence, the optimization problem becomes: 1 over w 2 Rd , b 2 R minimize hw, wi 2 subject to yi ðhw, xi i + bÞ 1, i ¼ 1, …, n,
(5)
which forces the hyperplane to make no classification errors on D, thus perfectly separating both classes. To solve this optimization problem, we rely on the Lagrangian approach, as described in Section 2.5. We do so for two reasons: (i) the constraints in (5) will be replaced by constraints on the Lagrange multipliers, which are
406 Handbook of Statistics
FIG. 1 Both classes are separated by a linear hyperplane H0 in X ¼ R2 . H1 and H2 are the decision boundaries, γ g is the geometrical margin.
easier to work with, and (ii) the training data will only appear as inner products between vectors, which will allow us later on to generalize this algorithm to the nonlinear case via kernels. By applying Lagrangian theory to our problem, the following primal Lagrangian is obtained: n X 1 LP ðw, b, αÞ ¼ hw, wi αi ðyi ðhw, xi i + bÞ 1Þ, 2 i¼1
(6)
with α ¼ (α1, …, αn) the vector of the positive Lagrange multipliers. Remark that for SVMs the problem contains no equality constraints, which simplifies the expression for the Lagrangian. Since the SVM problem is a convex problem with linear constraints, Theorem 5 tells us that the Karush–Kuhn–Tucker (KKT) conditions are necessary and sufficient for some wD, bD, and α* to be an optimal solution. Solving the (primal) problem is thus equivalent to finding a solution to the KKT conditions: n X ∂LP ðw, b, αÞ ¼ w yi αi xi ¼ 0, (7) ∂w i¼1 n X ∂LP ðw, b, αÞ ¼ yi αi ¼ 0, ∂b i¼1
(8)
yi ðhw, xi i + bÞ 1 0,
8 i ¼ 1, …, n,
αi 0, αi ðyi ðhw, xi i + bÞ 1Þ ¼ 0,
8 i ¼ 1, …, n, 8 i ¼ 1, …, n:
(9)
SVMs: Robust predictors with applications in bioinformatics Chapter
10 407
Implementing these conditions is done through the dual Lagrangian, with addition of the complementarity condition (9). P Rewriting (7) as w ¼ ni¼1 yi αi xi and substituting it together with (8) in the primal (6) produces the dual formulation LD ðw, b, αÞ ¼ ¼
n n n X X 1X yi yj αi αj hxi , xj i yi yj αi αj hxi , xj i + αi 2 i, j¼1 i, j¼1 i¼1 n X
αi
i¼1
n 1X yi yj αi αj hxi , xj i, 2 i, j¼1
and the corresponding optimization problem is given by n X
maximize
αi
i¼1
subject to
n X
n 1X yi yj αi αj hxi , xj i over αi 2 R 2 i, j¼1
(10)
yi αi ¼ 0,
i¼1
αi 0,
i ¼ 1, …, n:
Therefore, training the SVM amounts to maximizing LD with respect to the Lagrange parameters αi since these are the only unknowns left in the equation. Each training point xi corresponds to one of the αi. The support vectors are exactly those xi for which αi 6¼ 0 since only these points add a contribution in the expression (7) of the vector w. In other words, they “support” the hyperplane. Define the set of the indices of the support vectors as sv. The complementarity condition (9) makes it clear that the support vectors lie either on H1 or H2: only for inputs xi that have yi(hw, xii + b) ¼ 1, the corresponding αi can be nonzero. All other αi will be equal to zero and the corresponding data points are not support vectors. The support vectors are the critical elements of the data set D, since the decision boundaries are completely defined by only these points. Removing all other points from D and repeating the training process, would yield exactly the same separating hyperplane. For Lagrange parameters α ¼ ðα 1 , …, α n Þ that solve the optimization problem (10), the normal to the maximal margin hyperplane can be easily calculated: wD ¼
n X i¼1
yi α i xi ¼
X
yi α i xi :
i2sv
However, since bD does not appear in the dual problem, its value needs to be computed using the primal constraints as: bD ¼
max yi ¼1 hwD , xi i + min yi ¼1 hwD , xi i : 2
408 Handbook of Statistics
The maximal margin hyperplane is then given by X 0 ¼ hwD , xi + bD ¼ yi α i hxi , xi + bD :
(11)
i2sv
The computation of bD can be facilitated using the complementarity condition: choose an index i for which α i 6¼ 0, and compute bD using (9). It is, of course, numerically more stable to do this for all i with α i 6¼ 0 and then take the mean of all found bD-values. From the complementarity condition follows as well, for i 2 sv, that ! X yj αj hxj , xi i + bD , 1 ¼ yi ðhwD , xi i + bD Þ ¼ yi j2sv
which, in combination with (8), can be used to calculate the optimal geometrical margin: kwD k22 ¼ hwD , wD i ¼ ¼
X
yi α i
i2sv
¼
X
X
n X i, j¼1
yj α j hxi , xj i
j2sv
α i ð1 yi bD Þ ¼
i2sv
P
hyi α i xi , yj α j xj i
X
α i :
i2sv
1=2 . i2sv αi Þ
Therefore, γ g ¼ 1=kwD k2 ¼ ð The margin is thus fully defined by only the support vectors, which is indication of the sparsity of the method. However, it is very clear that the setup of the GPA is extremely limiting. The two major issues for this method are (i) that a linear decision function may not always be suitable for the classification problem at hand, i.e., if the set D is not linearly separable; and (ii) that due the effect of noise in the data set, we would actually allow some points to be misclassified in order to avoid overfitting (which can especially pose serious problems if the dimension d is greater than the sample size n). These problems will be tackled in Sections 3.2 and 3.3.
3.2 (†) The hard margin SVM Boser et al. (1992) extended the previous problem to the case of nonlinear decision functions by applying a rather old trick (Aizerman et al., 1964). For this, it is important to note that the data xi only appear as inner products in the optimization problem. The idea is to map the input data (x1, …, xn) into some (possibly infinite-dimensional) Hilbert space H0 , the feature space, by a typically nonlinear map Φ : X ! H0 , called the feature map, such that the mapped data Φ(xi) can be linearly separated in the feature space H0 by
SVMs: Robust predictors with applications in bioinformatics Chapter
10 409
FIG. 2 Mapping the original data into a higher-dimensional space allows them to be separated by a hyperplane using GPA.
applying the GPA on the mapped data set ((Φ(x1), y1), …, (Φ(xn), yn)), see Fig. 2. In other words, by choosing a good transformation, the transformed data points become linearly separable in the new space and hence the method from Section 3.1 is applicable. The training algorithm will then solely depend on the data through inner products in the feature space H0 , which are of the form hΦ(xi), Φ(xj)i (instead of hxi, xji from before), which is exactly the definition of the kernel k from Definition 9. Therefore, it is possible to work with only the kernel k in the training algorithm, without explicitly having to know the function Φ. For more details on the kernel, the feature space and the feature map, we refer to Section 3.5. Although calculating w, which now resides in the feature space H0 and no longer in Rd , does require explicit knowledge of the feature map due to its form X yi α i Φðxi Þ, w¼ i2sv
the separating hyperplane (and thus also the decision function) can easily be calculated without this knowledge by adapting the expression (11): X yi α i hΦðxi Þ, ΦðxÞi + bD 0¼ i2sv
¼
X
yi α i kðxi , xÞ + bD :
i2sv
Originally, this method was called the maximal margin classifier, later on it was termed as the hard margin SVM. If the data set D does not contain any contradictory data, i.e., there are no (xi, yi) and (xj, yj) with xi ¼ xj and yi 6¼ yj, and by choosing a suitable feature map Φ, see Steinwart and Christmann (2008b, Section 4.6), then this method will be able to perfectly separate every training data set by a hyperplane in the feature space.
410 Handbook of Statistics
But, trivially, the flexibility of this method comes with a price: since the separating hyperplane lies in a high- or even infinite-dimensional space, the problem of overfitting becomes very prominent.
3.3 (†) The soft margin SVM Cortes and Vapnik (1995) solved the problem of overfitting by proposing the soft margin SVM. The idea behind this method is to relax the constraints in (5) by introducing some positive slack variables ξi, i ¼ 1, …, n, which will allow for the margin constraints to be violated when necessary, see Fig. 3, i.e., allow that some classification errors are made. They will thus add an extra cost to the primal objective function, since we want to limit the amount of slack. Combining this with the previously explained idea of the feature map then gives: n X 1 minimize hw, wi + C ξi 2 i¼1
over w 2 H0 , b 2 R, ξ 2 Rn
subject to yi ðhw, Φðxi Þi + bÞ 1 ξi , i ¼ 1, …, n, ξi 0,
(12)
i ¼ 1, …, n,
where C > 0 is a free, but fixed constant that is used to balance the weight accorded to both parts of the objective function and ξ ¼ (ξ1, …, ξn) is the vector of slack variables. The larger the value of C, the more penalization
FIG. 3 Visualization of the slack variable in X ¼ R2 when a linear kernel k(x, x0 ) ¼ hx, x0 i is used.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 411
is attributed to the training errors. As mentioned above, the slack variables will allow individual observations to fall on the wrong side of the margin. If ξi ¼ 0, it means that the observation xi lies on the correct side of the margin. For ξi > 0 the observation is on the wrong side of the margin, and when the value of ξi surpasses one, a classification error P will be made since this means crossing the separating hyperplane. Therefore, ni¼1 ξi can be seen as some sort of upper bound on the number of training errors. Minimizing this upper bound will assure that a minimal number of training errors is made. This problem is still a convex programming problem with linear constraints, and thus it can be solved using the primal-dual approach. As will become clear, neither the ξi, nor their associated Lagrange multipliers will appear in the dual problem. Writing the optimization problem (12) in the primal Lagrangian form gives n X 1 LP ðw, b, ξ, α, βÞ ¼ hw, wi + C ξi 2 i¼1
n X
αi ðyi ðhw, Φðxi Þi + bÞ 1 + ξi Þ
i¼1
n X
(13) β i ξi ,
i¼1
with α ¼ (α1, …, αn) and β ¼ (β1, …, βn) the vectors of positive Lagrange multipliers. To obtain the dual formulation, we first differentiate with respect to w, b and ξ, and impose stationarity: n X ∂LP ðw, b, ξ, α, βÞ ¼w yi αi Φðxi Þ ¼ 0, ∂w i¼1
(14)
n X ∂LP ðw, b, ξ, α, βÞ ¼ yi αi ¼ 0, ∂b i¼1
(15)
∂LP ðw, b, ξ, α, βÞ ¼ C αi βi ¼ 0, ∂ξi
i ¼ 1, …, n:
(16)
The KKT conditions, i.e., necessary conditions for an optimal solution, are Eqs. (14)–(16) completed with yi ðhw, Φðxi Þi + bÞ 1 + ξi 0, ξi 0, αi 0, βi 0, αi ðyi ðhw, Φðxi Þi + bÞ 1 + ξi Þ ¼ 0,
(17)
βi ξi ¼ 0:
(18)
412 Handbook of Statistics
Inserting (14)–(16) into the primal (13) yields the dual Lagrangian LD ðw, b, ξ, α, βÞ ¼
n n X 1X yi yj αi αj hΦðxi Þ, Φðxj Þi + C ξi 2 i, j¼1 i¼1
n X
yi yj αi αj hΦðxi Þ, Φðxj Þi +
i, j¼1
¼
n X
i¼1
i¼1
α i ξi
αi
i¼1
+
i¼1
n X
n X
n X
n X αi
β i ξi
n 1X yi yj αi αj hΦðxi Þ, Φðxj Þi 2 i, j¼1
ðC αi βi Þξi
i¼1
¼
n X
αi
i¼1
n 1X yi yj αi αj hΦðxi Þ, Φðxj Þi, 2 i, j¼1
which is identical to the dual of the GPA, but with Φ(xi) instead of xi. The difference between both methods occurs in the constraints: C αi βi ¼ 0, together with βi 0 enforces that 0 αi C, thus providing an upper bound for the αi. Using the kernel k, the optimization problem becomes maximize subject to
n X i¼1 n X
αi
n 1X yi yj αi αj kðxi , xj Þ 2 i, j¼1
over αi 2 R (19)
yi αi ¼ 0,
i¼1
0 αi C,
i ¼ 1, …, n:
For α ¼ ðα 1 , …, α n Þ a solution of the dual program (19), the normal to the separating hyperplane is X wD ¼ yi α i Φðxi Þ: i2sv
Combining the complementarity condition (18) with (16) shows that if ξi 6¼ 0, then βi ¼ C α i ¼ 0 and thus α i ¼ C. Furthermore, for α i < C, ξi ¼ 0. The value of bD can then be calculated by taking any training point for which 0 < α i < C and ξi ¼ 0, and using (17). As before, it is recommended to take the average over all such training points. Condition (17) shows that those training points for which αi 6¼ 0, i.e., the support vectors, will lie between or on the boundaries determined by the hyperplanes H1 and H2. Hence, in this case, there are—as could be expected—more support vectors than in the simpler case of separable data. Due to their location,
SVMs: Robust predictors with applications in bioinformatics Chapter
10 413
the distance of the support vectors from the separating hyperplane is less than 1=kwD k. If 0 < αi < C, they lie at exactly the target distance 1=kwD k from the separating hyperplane and thus on either H1 or H2. P For f ðxÞ :¼ ni¼1 yi α i kðx, xi Þ + bD , the decision function will be fD(x) ¼ sign(f(x)). P Since kwD k2 ¼ i,j2sv yi yj α i α j kðxi , xj Þ, the geometrical margin will be P γ g ¼ ð i,j2sv yi yj α i α j kðxi , xj ÞÞ1=2 , which again only depends on the support vectors. Key learnings l l
l
l
SVMs are (originally) a linear classifier method. Using kernels allows for nonlinear classifiers by projecting the data in a higher-dimensional space. Adding slack vectors reduces overfitting by allowing a limited amount of misclassification. The hyperplane that separates the data is fully defined by only the support vectors (sparse representation).
3.4 (†) Empirical risk minimization and support vector machines We now shift our focus from the historical, geometrical idea of SVMs to the interpretation through empirical risk minimization, which is nowadays in vigor. Recall that the predictor f : X ! Y will assign to each risk vector x a prediction f(x) which will, hopefully, be a good approximation of the observed output y. A function L : X Y R ! ½0, ∞Þ is called a loss function (or in short a loss) if L is measurable. The loss function assesses the quality of a prediction f(x) for an observed output value y by L(x, y, f(x)), i.e., it gives the cost or loss incurred for predicting y by f(x). Therefore, the smaller L(x, y, f(x)), the better the prediction. In the same reasoning, it is logical to assume that L(x, y, y) ¼ 0 for all y 2 Y, because if the forecast f(x) equals the observed value y, there is no loss. To assess the quality of a predictor f it is, of course, not sufficient to only know the values L(x, y, f (x)) for particular choices of (x, y), but we need to quantify how small the function (x, y) 7! L(x, y, f (x)) itself is. In statistical learning theory this is most commonly done by considering the L-risk or expected loss of f Z RL,P ð f Þ :¼ EP LðX, Y, f ðXÞÞ ¼ Lðx, y, f ðxÞÞdPðx, yÞ, X Y
for P a probability measure on X Y and f : X ! R a measurable function. The learning goal is then to find a decision function fD that (approximately) achieves the Bayes risk, which is the smallest possible risk: R L,P :¼ inffRL,P ð f Þ; f : X ! R measurableg:
(20)
414 Handbook of Statistics The measurable function fL,P : X ! R for which RL,P ð fL,P Þ ¼ R L,P is called the Bayes decision function. However, since the distribution P is unknown, the risk RL,P ð f Þ is also unknown and consequently we cannot compute fD. This problem is solved by replacing P by the empirical distribution
D¼
n 1X δðx ,y Þ n i¼1 i i
corresponding to the data set D, where δðxi ,yi Þ denotes the Dirac distribution in (xi, yi). This gives us the empirical L-risk RL,D ð f Þ :¼
n 1X Lðxi , yi , f ðxi ÞÞ: n i¼1
Although RL,D ð f Þ can be considered as an approximation of RL,P ð f Þ for each single f, solving inf f :X !R RL,D ð f Þ will in general not result in a good approximate minimizer of RL,P ð Þ. Partially, this is due to the effect of overfitting: the learning method will model the data D too closely and will have a poor performance on future, previously unseen data points. The danger of overfitting can be reduced by not considering all measurable functions f : X ! R, but to look at a smaller, but still reasonably rich, class F of functions that is assumed to contain a good approximation of the solution of (20). Instead of then searching the infimum (which, in practice, will usually be a minimum) of RL,D ð Þ over all measurable functions, we only consider F , and thus solve inf RL,D ð f Þ:
f 2F
(21)
This approach is called empirical risk minimization (ERM), and tends to produce approximate solutions of R L,P,F :¼ inf RL,P ð f Þ: f 2F
There are, however, two serious issues with ERM: (i) due to the limited knowledge of the distribution P it is difficult to guarantee that the model error R L,P,F R L,P is sufficiently small; and (ii) solving (21) might be computationally unfeasible. To make the optimization problem computationally feasible, SVMs apply three measures. First, a convex loss function is used, since in this case the risk functional f 7! RL,P ð f Þ is also convex. If we further assume that the set F of functions over which we optimize is convex as well, the learning method defined by (21) will become a convex optimization problem. Secondly, only a very specific set of functions will be considered. For SVMs, this set is the reproducing kernel Hilbert space H of some measurable
SVMs: Robust predictors with applications in bioinformatics Chapter
10 415
kernel k : X X ! R. All functions in H can be described using the kernel. Moreover, the value k(x, x0 ) can often be interpreted as a measure of similarity or dissimilarity between two vectors x and x0 . The third step of SVMs toward computational feasibility—and also to uniqueness of the SVM—is to add a special Hilbert norm regularization term λk f k2H , where k kH denotes the norm in H. It can easily be verified that the sum of a convex function and of a strictly convex function over a convex set is strictly convex, and thus that the problem remains a convex optimization problem. This regularization term also reduces the danger of overfitting, see, e.g., Vapnik (1998) and Sch€olkopf and Smola (2002), because it penalizes rather complex functions f which model the output values in the training set D too closely, since these functions have a large RKHS norm. The term 2 Rreg L,P,λ ð f Þ :¼ RL,P ð f Þ + λk f kH
is called the regularized L-risk, and the constant λ is the regularization parameter. The regularized empirical L-risk is given by 2 Rreg L,D,λ ð f Þ :¼ RL,D ð f Þ + λk f kH :
Definition 8. Let X be the input space, Y R be the output space, L : X Y R ! R be a loss function, H be a reproducing kernel Hilbert space of functions from X to R, and λ > 0 be a constant. For a probability distribution P on X Y, a support vector machine is defined as the minimizer, if it exists, fL,P,λ :¼ arg inf RL,P ð f Þ + λk f k2H : f 2H
(22)
The empirical SVM is denoted by fL,D,λ :¼ arg inf RL,D ð f Þ + λk f k2H f 2H
¼ arg inf
f 2H
n 1X Lðxi , yi , f ðxi ÞÞ + λk f k2H : n i¼1
The link between this formulation and the SVM as obtained in Section 3.3 will become clear later on. Key learnings l
l l
SVMs are defined as the solution of an L-risk minimization problem over a specific set of functions, namely the reproducing kernel Hilbert space. A regularization term is added to reduce overfitting. This formulation also allows for regression problems.
416 Handbook of Statistics
3.5 Kernels and the reproducing kernel Hilbert space In this section, the kernel and the reproducing kernel Hilbert space are examined more closely. As seen, the kernel and the associated feature map allow us to obtain nonlinear decision functions by applying the linear SVM approach on a mapping of the original input data in a higher-dimensional feature space H. First, we give a formal definition of the kernel, the feature map and the feature space, and give examples of commonly used kernels. We only consider real-valued kernels (but most of the theory is also applicable to complex-valued kernels) and thus all considered Hilbert spaces are R-Hilbert spaces. Next, we describe the reproducing kernel Hilbert space, and finally we state some properties of kernels and RKHSs. Definition 9. Let X be a nonempty set and k : X X ! R. We call k a kernel on X if there exists a Hilbert space H0 and a map Φ : X ! H0 such that for all x, x0 2 X we have kðx, x0 Þ ¼ hΦðxÞ, Φðx0 Þi: In this case, Φ is called the feature map and H0 the feature space of k. Let m 0 be an integer and c 0 a real number, and take x, x0 2 X . Some kernels often used in practice are the polynomial kernel kðx, x0 Þ :¼ ðhx, x0 i + cÞm , with as special case the linear kernel (m ¼ 1 and c ¼ 0), the exponential kernel kðx, x0 Þ :¼ exp ðhx, x0 iÞ, and the Gaussian radial basis function (RBF) kernel kRBF ðx, x0 Þ ¼ exp ðγ 2 k x x0 k2 Þ,
(23)
where γ is a positive constant, called the width. The number of possible kernels is almost limitless and many specialized kernels have been developed for very specific problems. More generally, it can be shown, see, e.g., Steinwart and Christmann (2008b, Theorem 4.16) that a function k : X X ! R is a kernel if and only if it is symmetric and positive definite. Another concept we introduced in Section 3.4 is the reproducing kernel Hilbert space. Definition 10. Let X 6¼ ∅ and H a Hilbert space consisting of functions mapping from X into R. 1. A function k : X X ! R is called a reproducing kernel of H if, for all x 2 X , kð , xÞ 2 H and the reproducing property f ðxÞ ¼ h f , kð , xÞiH holds for all f 2 H and all x 2 X .
(24)
SVMs: Robust predictors with applications in bioinformatics Chapter
10 417
2. The space H is called a reproducing kernel Hilbert space (RKHS) over X if it possesses a reproducing kernel. 3. For an RKHS H, the feature map Φ : X ! H is given by ΦðxÞ :¼ kð , xÞ,
x 2 X,
and is called the canonical feature map. Please note that, because of (24), the kernel can be used to describe all functions contained in H. Moreover, the value k(x, x0 ) can often be interpreted as a measure of similarity or dissimilarity between the risk vectors x and x0 . Using the canonical feature map, the reproducing property can also be rewritten as, for all f 2 H and all x 2 X , f ðxÞ ¼ h f , ΦðxÞiH : For a given kernel, neither the feature map nor the feature space is uniquely defined. However, every RKHS is uniquely defined by its kernel k and vice versa. Yosida (1974, Theorem 1, p. 96) tells us that the reproducing kernel of a given Hilbert space is unique, Steinwart and Christmann (2008b, Theorem 4.21) show us the uniqueness of the RKHS for a given kernel k. There thus exists a one-to-one relationship between the kernel and the RKHS. The RKHS H can then, for H0 a feature space of the kernel k and Φ0 : X ! H0 a feature map, be written as H :¼ ff : X ! R j 9 w 2 H0 with f ðxÞ ¼ hw, Φ0 ðxÞiH0 for all x 2 X g equipped with the norm k f kH :¼ inffk f kH0 j 9 w 2 H0 with f ¼ hw, Φ0 ð ÞiH0 g: This expression shows that the RKHS associated with a kernel k consists exactly of all possible functions of the given form, which allows us to determine the RKHS of any given kernel. This means, in a sense, that the RKHS is the smallest feature space of the kernel and can thus be seen as a canonical feature space. Moreover, this set of functions remains unchanged when considering other feature spaces of k. More details on reproducing kernels and their RKHSs can be found in Berlinet and Thomas-Agnan (2004). For k a kernel on X with RKHS H, the Cauchy–Schwarz inequality and (24) learn us that jkðx, x0 Þj2 ¼ jhkð , xÞ, kð , x0 ÞiH j2 kkð , xÞk2H kkð , x0 ÞkH 2
¼ kðx, xÞ kðx0 , x0 Þ
(25)
for all x, x0 2 X . Therefore supx,x0 2X jkðx, x0 Þj ¼ supx2X kðx, xÞ, and hence a kernel k is called bounded, if pffiffiffiffiffiffiffiffiffiffiffiffiffi kkk∞ :¼ sup kðx, xÞ < ∞: x2X
418 Handbook of Statistics
Eq. (25) also shows that, for Φ : X ! H a feature map of k, kΦðxÞkH ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi kðx, xÞ for all x 2 X . Thus Φ is bounded if and only if k is bounded. Using this equality and the reproducing property, we can obtain the well-known inequalities k f k∞ kkk∞ k f kH and kΦðxÞk∞ kkk∞ kΦðxÞkH kkk2∞
(26)
for f 2 H and x 2 X . As an example of a bounded kernel we mention the Gaussian RBF kernel, a feat that adds to its popularity. To conclude this section, we would like to mention that if the kernel k is bounded, all functions in its RKHS are bounded as well (Steinwart and Christmann, 2008b, Lemma 4.23). Also, if k is a measurable kernel, then all functions in the RKHS are measurable (Steinwart and Christmann, 2008b, Lemma 4.24). Key learnings l l l
Kernels can often be seen as a measure of (dis)similarity between two inputs. Every RKHS is uniquely defined by its kernel, and vice versa. For a bounded/measurable kernel, all functions in the RKHS are bounded/ measurable.
3.6 (†) Loss functions As previously stated, SVMs use loss functions to measure the similarity between the observed output yi and the predicted output f(xi) for a given risk vector xi. Recall that a loss is a measurable function L : X Y R ! ½0, ∞Þ. The similarity between the prediction and the observed output is then measured through the L-risk associated with the loss function L. In this section we introduce some commonly used losses and discuss their desired properties. We call a loss function L convex, continuous, Lipschitz continuous, or differentiable, if L has this property with respect to its third argument. For example, L is Lipschitz continuous if there exists a constant jLj1 2 (0, ∞), the Lipschitz constant, such that, for all ðx, yÞ 2 X Y and all t1 , t2 2 ℝ, jLðx, y, t1 Þ Lðx, y, t2 Þj jLj1 jt1 t2 j: If L : X Y R ! ½0, ∞Þ only depends on its last two arguments, i.e., if there exists a measurable function L : Y R ! ½0, ∞Þ such that L(x, y, t) ¼ L (y, t) for all ðx, y, tÞ 2 X Y R, then L is called a supervised loss. A supervised loss L : Y R ! ½0, ∞Þ is called margin-based, if for some function φ : R ! ½0, ∞Þ Lðy, tÞ ¼ φðytÞ for all ðy, tÞ 2 Y R. It is said to be distance-based if there exists a function ψ : R ! ½0, ∞Þ with Lðy, tÞ ¼ ψðy tÞ
SVMs: Robust predictors with applications in bioinformatics Chapter
10 419
for all ðy, tÞ 2 Y R and ψ(0) ¼ 0. It is called symmetric if ψ(r) ¼ ψ(r) for all r 2 ℝ. Most classification losses are margin-based, whereas most loss functions used in regression are distance-based. A loss function L is called a Nemitski loss if there exists a measurable function b : X Y ! ½0, ∞Þ and an increasing function h : [0, ∞) ! [0, ∞) such that Lðx, y, tÞ bðx, yÞ + hðjtjÞ, ðx, y, tÞ 2 X Y R: If additionally b 2 L1 ðPÞ, we say that L is a P-integrable Nemitski loss. Traditionally, research in nonparametric regression is often based on the least squares loss LLS ðx, y, tÞ :¼ ðy tÞ2 : This loss is convex in t, is useful to estimate the conditional mean function, and is advantageous from a numerical point of view, but LLS is not Lipschitz continuous. From a practical point of view, some situations will require more appropriate loss functions. When the main interest lies in fitting a conditional quantile function instead of modeling the conditional mean, the pinball loss function ðτ 1Þðy tÞ, if y t < 0 Lτ-pin ðx, y, tÞ :¼ τðy tÞ, if y t 0 is used, where τ 2 (0, 1) specifies the desired conditional quantile (see, e.g., Koenker and Bassett (1978) and Koenker (2005) for parametric quantile regression and Takeuchi et al. (2006) for nonparametric quantile regression). The pinball loss is, for example, often used in the field of econometrics. If the aim is to estimate the conditional median function, then the E-insensitive loss, given by LE ðx, y, tÞ :¼ max fjy tj E, 0g, E 2 (0, ∞), promises algorithmic advantages in terms of sparseness compared to the L1-loss function LL1(y, t) ¼ jy tj (Sch€ olkopf and Smola, 2002; Vapnik, 1998). And if the conditional distribution of Y given X ¼ x is known to be symmetric, basically all distance-based loss functions of the form L(y, t) ¼ ψ(r) with r ¼ y t, where ψ : R ! ½0, ∞Þ is convex, symmetric and has its only minimum at 0, can be used to estimate the conditional mean (Steinwart, 2007). An example is the logistic loss for regression defined as Lr-log ðx, y, tÞ :¼ ln
4 exp ðy tÞ ð1 + exp ðy tÞÞ2
¼ ln ð4Λðy tÞð1 Λðy tÞÞÞ,
420 Handbook of Statistics
with Λðy tÞ ¼ 1=ð1 + eðytÞ Þ. If one fears outliers in the y-direction, a less steep loss function such as Huber’s loss function given by 0:5ðy tÞ2 if jy tj c Lc-Huber ðx, y, tÞ :¼ cjy tj c2 =2 if jy tj > c for some c 2 (0, ∞), may be more suitable, see, e.g., Huber (1964) and Christmann and Steinwart (2007). Two losses that are commonly used in classification problems are the hinge loss Lhinge ðx, y, tÞ :¼ max f0, 1 ytg, and the logistic loss for classification Lc-log ðx, y, tÞ :¼ ln ð1 + exp ðytÞÞ, for which ðx, y, tÞ 2 X f1, + 1g R. Note that all six loss functions mentioned earlier (excluding the least squares loss and the L1-loss) are convex and Lipschitz continuous supervised losses, but only the logistic losses are twice continuously F-differentiable. Both the E-insensitive loss and the pinball loss are not even one time F-differentiable. These six loss functions are visualized in Fig. 4. Clearly, the hinge loss and the logistic loss for classification are margin-based losses, whereas the other four are distance-based. All of the mentioned distancebased losses, except for the pinball loss with τ 6¼ 0.5, are symmetric. The reason to mainly consider convex losses is that in that case SVMs have, under weak assumptions, at least the following four advantageous properties, see, e.g., Vapnik (1998), Cristianini and Shawe-Taylor (2000), Sch€ olkopf and Smola (2002), and Steinwart and Christmann (2008b) for details. 1. The objective function in (22) becomes convex in f and the SVM fL,D,λ is the unique solution of a well-posed convex optimization problem in Hadamard’s sense. Furthermore, this minimizer is of the form fL,D,λ ¼
n X
αi kðxi , Þ ,
(27)
i¼1
where k is the kernel corresponding to the RKHS H and the αi 2 ℝ, i ¼ 1,…, n, are suitable coefficients. The minimizer fL,D,λ is thus a weighted sum of (at most) n kernel functions k(xi, ), where the weights αi are datadependent, clearly showing the link between the historical discussion of SVMs and its definition through empirical risk minimization. An important consequence of (27) is that the SVM fL,D,λ is contained in a finite dimensional space, even if the space H itself is considerably larger. This observation makes it possible to consider even infinite dimensional spaces H such as the one corresponding to the popular RBF kernel defined in (23).
0
1
2
3
–2
–1
–2
–1
0
1
2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Loss –3
3
–3
–2
–1
0
1
y–t
y–t
Huber
Hinge
Logistic classification
0 y–t
1
2
3
Loss –3
–2
–1
0 yt
1
2
3
2
3
2
3
0.0 0.5 1.0 1.5 2.0 2.5 3.0
y–t
Loss –3
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Loss –1
Logistic regression
0.0 0.5 1.0 1.5 2.0 2.5 3.0
–2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
–3
Loss
Pinball
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Loss
e-insensitive
–3
–2
–1
0
1
yt
FIG. 4 Some commonly used loss functions for regression and classification. In order: E-insensitive loss with E ¼ 0.5, pinball loss with τ ¼ 0.7, logistic loss for regression, Huber loss with c ¼ 0.5, hinge loss, and logistic loss for classification.
422 Handbook of Statistics
2. These SVMs are under very weak assumptions L-risk consistent, i.e., for suitable null-sequences (λn) with λn > 0 we have RL,P ð fL,D,λn Þ ! R L,P ,
n ! ∞,
in probability (Christmann and Steinwart, 2007, 2008; Christmann et al., 2009; Steinwart, 2002, 2005). This means that with an increasing amount of information (i.e., an increasing data set) the L-risk will increasingly approximate the Bayes risk. 3. SVMs based on a convex loss have good statistical robustness properties, if k is continuous and bounded in the sense of kkk∞ < ∞ and if L is Lipschitz continuous (Christmann and Steinwart, 2004, 2007; Christmann and Van Messem, 2008; Christmann et al., 2009). In a nutshell, statistical robustness implies that the SVM fL,P,λ only varies in a smooth and bounded manner if P changes slightly in the set M1 of all probability measures on X Y. This also implies that outliers or extreme data points will only have a limited influence on the predictor. 4. There exist efficient numerical algorithms to determine the vector of the weights α ¼ (α1,…, αn) in the empirical representation (27) and therefore also the SVM fL,D,λ, even for large and high-dimensional data sets D. From a numerical point of view, the vector α is usually computed as a solution of the convex dual problem derived from a Lagrange approach. If the loss function is not convex, the SVM may in general be not unique, and the optimization problem might encounter computational difficulties. It needs to be remarked that, in fact, we do not really need the convexity of the loss function itself, but rather that the risk is convex for all distributions P, because this leads to a strictly convex objective function. However, to the extent of our knowledge, there exist no nonconvex losses such that the risk is convex for all P. Often, Lipschitz continuity of the loss L is also imposed, since this is a condition for most robustness and consistency results. Clearly, the six loss functions defined earlier are all Lipschitz continuous. Another nice feat is that Lipschitz continuous loss functions are trivially Nemitski loss functions for all probability measures on X Y, because Lðx, y, tÞ ¼ Lðx, y, 0Þ + Lðx, y, tÞ Lðx, y, 0Þ bðx, yÞ + jLj1 jtj, where b(x, y) :¼ L(x, y, 0) for ðx, y, tÞ 2 X Y R and jLj1 2 (0, ∞). Furthermore, Lipschitz continuous losses are P-integrable if RL,P ð0Þ is finite (Steinwart and Christmann, 2008b, p. 31). Finally, we would like to mention that, although the least squares loss is not Lipschitz continuous, extended research on the least squares support
SVMs: Robust predictors with applications in bioinformatics Chapter
10 423
vector machine (LS-SVM) has been conducted, see, e.g., Suykens et al. (2002b). However, in this case sparseness is lost and this method is not robust. To solve these shortcomings, a weighted LS-SVM has been proposed (Suykens et al., 2002a). Key learnings l l
The choice of the loss depends on the optimization problem. Convex and Lipschitz continuous losses are preferred for SVMs.
3.7 Bouligand-derivatives of loss functions Let us now take a look at the B-derivative of some loss functions, which will be needed for the robustness results presented in Section 5.2. These calculations were first performed by Christmann and Van Messem (2008). Since both logistic losses Lc-log and Lr-log are F-differentiable, there is no need to calculate their partial B-derivative because it will be equal to the partial F-derivative. We will, however, show that this property is true by means of the least squared loss LLS. Least squares loss Since the function L(x, y, t) ¼ LLS (x, y, t) is known to be F-differentiable, the B-derivative should be the same: rB3 Lðx, y, tÞðhÞ ¼ 2ðy tÞh and rB3,3 Lðx, y, tÞðhÞ ¼ 2h. Hence, as expected, jrB3 Lðx, y, tÞj is unbounded, whereas rB3,3 Lðx, y, tÞ ¼ 2. The first-order B-derivative is rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ Lðx, y, t + hÞ Lðx, y, tÞ ¼ ðy t hÞ2 ðy tÞ2 ¼ 2th 2yh + h2 ¼ 2hðy tÞ + h2 , and therefore rB3 Lðx, y, tÞ ¼ 2ðy tÞ ¼ rF3 Lðx, y, tÞ. In the same way, we find the second-order B-derivative: rB3,3 Lðx, y, tÞðhÞ + oðhÞ ¼ rB3 Lðy, t + hÞ rB3 Lðx, y, tÞ ¼ 2ðy t hÞ ð2ðy tÞÞ ¼ 2h: So rB3,3 Lðx, y, tÞ ¼ 2, which also corresponds to the second-order partial F-derivative.
424 Handbook of Statistics
e-insensitive loss We shall show for L(x, y, 8 h > > > < 0 rB3 Lðx, y, tÞðhÞ ¼ > > > : h
t) ¼ LE(x, y, t) that if ft < y Eg or fy t ¼ E, h < 0g if fy E < t < y + Eg or fy t ¼ E, h 0g or fy t ¼ E, h < 0g if ft > y + Eg or fy t ¼ E, h 0g
and rB3,3 Lðx, y, tÞðhÞ ¼ 0. Thus both the first and second-order partial B-derivative are uniformly bounded: jrB3 Lðx, y, tÞj 1 and jrB3,3 Lðx, y, tÞj ¼ 0. For the derivation of rB3 Lðx, y, tÞ we need to consider five cases. 1. If t > y + E, we have t + h > y + E as long as h is small enough. Therefore, rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ Lðx, y, t + hÞ Lðx, y, tÞ ¼ t + h y E ðt y EÞ ¼ h: 2. If t < y E, we have t + h < y + E if h is sufficiently small. Thus rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ y t h E ðy t EÞ ¼ h: 3. If y t 2 (E, E) we have y t h 2 (E, E) for h ! 0. This yields rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ 0 0 ¼ 0. 4. If y t ¼ E we have to consider two cases. If h 0 and small, then E < y t h < E and hence rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ 0 0 ¼ 0. If h < 0, we have y t h > E and thus rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ y t h E 0 ¼ h: 5. If y t ¼ E we have again to consider two cases. If h 0, we have y t h < E. Hence rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ t + h y E 0 ¼ h: If h < 0, we get E < y t h < E which gives rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ 0 0 ¼ 0. This gives the assertion for the first partial B-derivative. Using the same reasoning we obtain rB3,3 Lðx, y, tÞðhÞ ¼ 0. Pinball loss For L(x, y, t) ¼ Lτ-pin(x, y, t) we will get ð1 τÞh if fy t < 0g or fy t ¼ 0, h 0g rB3 Lðx, y, tÞðhÞ ¼ τh if fy t > 0g or fy t ¼ 0, h < 0g and rB3,3 Lðx, y, tÞðhÞ ¼ 0. This means that jrB3 Lðx, y, tÞj max f1 τ, τg and jrB3,3 Lðx, y, tÞj ¼ 0.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 425
For the calculation of rB3 Lðx, y, tÞ we consider three cases. 1. If y t < 0 we have y t h < 0 for sufficiently small values of jhj. Hence rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ Lðx, y, t + hÞ Lðx, y, tÞ ¼ ðτ 1Þðy t hÞ ðτ 1Þðy tÞ ¼ ð1 τÞh: 2. If y t > 0 we have y t h > 0 for sufficiently small values of jhj which yields rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ τðy t hÞ τðy tÞ ¼ τh: 3. Assume y t ¼ 0. If y t h < 0 we have rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ ð1 τÞh: If y t h > 0 it follows rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ τðy t hÞ τðy tÞ ¼ τh: Together this gives the assertion for rB3 Lðx, y, tÞðhÞ. In the same way we get rB3,3 Lðx, y, tÞðhÞ ¼ 0 Huber loss It will be shown that for L(x, y, t) ¼ Lc-Huber(x, y, t) we have c signðy tÞh if jy tj > c B r3 Lðx, y, tÞðhÞ ¼ ðy tÞh if jy tj c and 8 > < h if fy t ¼ c, h 0g or fy t ¼ c, h < 0g rB3,3 Lðx, y, tÞðhÞ ¼ or fjy tj < cg > : 0 if else : Hence, jrB3 Lðx, y, tÞj c and jrB3,3 Lðx, y, tÞj 1. For the derivation of rB3 Lðx, y, tÞ we consider the following five cases. 1. Let y t ¼ c. If h 0 or y t h c then rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ Lðx, y, t + hÞ Lðx, y, tÞ 1 1 ¼ ðy t hÞ2 ðy tÞ2 2 2 h2 ¼ ðy tÞh + : 2
426 Handbook of Statistics
If h < 0 or y t h > c > 0 we have c2 1 ðy tÞ2 2 2 c2 c2 ¼ cðy t hÞ 2 2 ¼ cðc hÞ c2 ¼ ðy tÞh:
rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ cjy t hj
2. Now we consider the case y t ¼ c. If h 0 or y t h c < 0 we obtain c2 1 ðy tÞ2 2 2 c2 c2 ¼ cðc + hÞ ¼ ðy tÞh: 2 2
rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ cjy t hj
If h < 0 or y t h > c we get 1 1 h2 rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ ðy t hÞ2 ðy tÞ2 ¼ ðy tÞh + : 2 2 2 3. If y t > c, we have y t h > c and thus c2 c2 cjy tj + 2 2 ¼ cðy t hÞ cðy tÞ
rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ cjy t hj
¼ ch ¼ c signðy tÞh: 4. If y t < c, we have y t h < c and obtain analogously to the previous calculation that c2 c2 cjy tj + 2 2 ¼ cðy + t + hÞ cðy + tÞ
rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ cjy t hj
¼ ch ¼ c signðy tÞh: 5. If c < y t < c, then c < y t h < c and 1 1 h2 rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ ðy t hÞ2 ðy tÞ2 ¼ ðy tÞh + : 2 2 2 This gives the assertion for rB3 Lðx, y, tÞðhÞ. Only the first two cases, where y t ¼ c, were necessary to compute, since in the other three parts the function is already F-differentiable, and thus also B-differentiable. For the second partial B-derivative we consider three cases. 1. Assume y t ¼ c. If y t h < c then rB3,3 Lðx, y, tÞðhÞ + oðhÞ ¼ rB3 Lðx, y, t + hÞ rB3 Lðx, y, tÞ ¼ ðy t hÞ ððy tÞÞ ¼ h: If y t h > c then rB3,3 Lðx, y, tÞðhÞ + oðhÞ ¼ c ððy tÞÞ ¼ 0.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 427
2. Assume y t ¼ c. If y t h < c we obtain rB3,3 Lðx, y, tÞðhÞ + oðhÞ ¼ c ððy tÞÞ ¼ 0. If y t h > c then rB3,3 Lðx, y, tÞðhÞ + oðhÞ ¼ ðy t hÞ ððy tÞÞ ¼ h: 3. Assume that jy tj 6¼ c. Then rB3 Lðx, y, t + hÞ ¼ rB3 Lðx, y, tÞ. The difference, and consequently rB3,3 Lðx, y, tÞðhÞ ¼ 0: This gives the assertion for Huber’s loss function. Hinge loss Finally, we will show that for L(x, y, t) ¼ Lhinge(x, y, t) we obtain the following partial B-derivatives: 8 yh if fyt < 1g or fy ¼ t ¼ 1, h < 0g > > > < or fy ¼ t ¼ 1, h 0g rB3 Lðx, y, tÞðhÞ ¼ > 0 if fyt > 1g or fy ¼ t ¼ 1, h 0g > > : or fy ¼ t ¼ 1, h < 0g and rB3,3 Lðx, y, tÞðhÞ ¼ 0. Clearly, the first partial B-derivative is unbounded. For the first derivative, we need to distinguish three cases: 1. If yt < 1 then for h small enough, we also have y(t + h) < 1. Thus rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ Lðx, y, t + hÞ Lðx, y, tÞ ¼ 1 yðt hÞ ð1 ytÞ ¼ yh: 2. Likewise if yt > 1 then also y(t + h) > 1 if h is sufficiently small and so rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ Lðy, t + hÞ Lðy, tÞ ¼ 0 0 ¼ 0: 3. The third case is where yt ¼ 1. Here we have to consider 2 cases. If y(t + h) > 1 we have rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ 0 0 ¼ 0: Or else, for y(t + h) < 1, we obtain rB3 Lðx, y, tÞðhÞ + oðhÞ ¼ 1 yðt + hÞ 0 ¼ yh, which gives us the assertion. Following the same reasoning, it is clear that rB3,3 Lðx, y, tÞðhÞ ¼ 0. Key learnings l
l l
If the Frechet-derivative exists, the Bouligand-derivative exists and both are equal. The Bouligand-derivative exists also for nonsmooth losses. The first and second-order partial Bouligand-derivatives of LE, Lτ-pin, and Lc-Huber are uniformly bounded.
428 Handbook of Statistics
3.8 Shifting the loss function In this section we describe the approach of SVMs based on shifted loss functions. We will later show that, by considering those shifted loss functions, we can enlarge the applicability of SVMs to heavy-tailed distributions. We first explain why SVMs for response variables with heavy tails need careful consideration and discuss SVMs based on shifted loss functions. Consequently, we take a look at some properties of such SVMs. The results for SVMs based on shifted loss functions are mainly meant for regression purposes. Of course, the case of classification where the output space Y is just a set of a finite number of real numbers is a special case and thus classification is covered by the results we will show. The problem of heavy tails is however not present in classification because the conditional distribution of the response variable Y given x has then obviously a bounded support. Assume we can split the probability measure P on X Y into the marginal distribution PX on X and the conditional probability P(yjx) on Y. This is, e.g., possible when Y R is closed (because then Y becomes a complete separable metric space, and therefore a Polish space), and X is a measurable space, see Theorem 1. Furthermore, let L : X Y R ! ½0, ∞Þ be a Lipschitz continuous loss function. Keeping in mind that L(X, Y, Y) ¼ 0, we obtain the following inequality for the L-risk: RL,P ð f Þ ¼ EP ðLðX, Y, f ðXÞÞ LðX, Y, YÞÞ Z Z Lðx, y, f ðxÞÞ Lðx, y, yÞ dPðyjxÞ dPX ðxÞ ¼ X Y Z Z j f ðxÞ yj dPðyjxÞ dPX ðxÞ jLj1 ZX Y Z Z jLj1 j f ðxÞj dPX ðxÞ + jLj1 jyjdPðyjxÞ dPX ðxÞ, X
(28)
X Y
which is finite if f 2 L1(PX), i.e., f is integrable with respect to the marginal distribution PX, and the first absolute moment is finite: Z Z EP jYj ¼ jyj dPðyjxÞ dPX ðxÞ < ∞: (29) X Y
SVMs are consistent and robust for both classification and regression purposes if they are based on a Lipschitz continuous loss and a bounded kernel with a separable RKHS that is dense in L1(μ) for all distributions μ (Christmann and Steinwart, 2007, 2008; Steinwart and Christmann, 2008b). These properties even hold true in the regression context for unbounded output spaces, if the target function f is integrable with respect to the marginal distribution of the input variable X and if the output variable Y has a finite first absolute moment, i.e., if the moment condition (29) holds. However, the latter assumption excludes
SVMs: Robust predictors with applications in bioinformatics Chapter
10 429
distributions with heavy tails, such as several stable distributions, including the Cauchy distribution, or some extreme value distributions which occur in, e.g., financial or actuarial problems. Nevertheless, the applicability of SVMs can be enlarged to such distributions which violate the moment condition EP jYj < ∞. The applied approach is based on a trick well-known in the literature on robust statistics, see, e.g., Huber (1967) for an early use of this trick on M-estimators without regularization term. The trick consist of shifting the loss L(x, y, t) downwards by an amount of L(x, y, 0) 2 [0, ∞). The function L? : X Y R ! R defined by L? ðx, y, tÞ :¼ Lðx, y, tÞ Lðx, y, 0Þ will be referred to as the shifted loss function or the shifted version of L. Using this definition, we obtain for the L?-risk of f that, for all f 2 L1(PX), RL? ,P ð f Þ ¼ EP L? ðX, Y, f ðXÞÞ ¼ EP ðLðX, Y, f ðXÞÞ LðX, Y, 0ÞÞ Z jLðx, y, f ðxÞÞ Lðx, y, 0Þj dPðx, yÞ X Y Z jLj1 j f ðxÞj dPX ðxÞ < ∞:
(30)
X
Therefore, for f 2 L1(PX), the L?-risk is finite, no matter whether the moment condition (29) is fulfilled or not. By relaxing the finiteness of the risk through this “L?-trick”, Christmann et al. (2009) showed that many important results on the SVM fL,P,λ, such as existence, uniqueness, consistency, and statistical robustness also hold for fL? ,P,λ :¼ arg inf RL? ,P ð f Þ + λk f k2H , f 2H
the SVM solution of the shifted problem. Moreover, if fL,P,λ exists, then fL? ,P,λ ¼ fL,P,λ and hence the existing algorithms can be used to compute fL? ,D,λ because the empirical SVM fL,D,λ exists for all data sets D ¼ ððx1 , y1 Þ, …, ðxn , yn ÞÞ ðX YÞn . The advantage of fL? ,P,λ over fL,P,λ is that fL? ,P,λ is still well-defined and useful for heavy-tailed conditional distributions P(yjx), for which the R first absolute moment Y jyjdPðyjxÞ is infinite. In particular, even in the case of heavy-tailed distributions, the forecasts fL? ,D,λ ðxÞ ¼ fL,D,λ ðxÞ are consistent and robust with respect to the (Bouligand) influence function, if the kernel is bounded and a Lipschitz continuous loss function is used (Christmann et al., 2009). In particular, the combination of the Gaussian RBF kernel with the E-insensitive loss function or Huber’s loss function for regression purposes or with the pinball loss for quantile regression yields SVMs with good consistency and robustness properties.
430 Handbook of Statistics
Let us now state some general facts on the function L? which are needed to obtain the robustness results in Section 5.2. The general assumptions for the rest of this chapter are summarized in Assumption 1. Let n 2 ℕ, X be a complete separable metric space (e.g., a closed X Rd or X ¼ Rd Þ, Y R be a nonempty and closed set (e.g., Y ¼ R or Y ¼ f1, + 1g for regression or classification, respectively), and P be a probability distribution on X Y equipped with its Borel σ-algebra. Since Y is closed, P can be split into the marginal distribution PX on X and the conditional probability P(yjx) on Y. Let L : X Y R ! ½0, ∞Þ be a loss function and L? : X Y R ! R its shifted loss function defined by L? ðx, y, tÞ :¼ Lðx, y, tÞ Lðx, y, 0Þ: We say that L (or L?) is convex, Lipschitz continuous, continuous or differentiable, if L (or L?) has this property with respect to its third argument. If not otherwise mentioned, k : X X ! R is a measurable kernel with reproducing kernel Hilbert space H of measurable functions f : X ! R, and Φ : X ! H denotes the canonical feature map, i.e., Φ(x) :¼ k( , x) for x 2 X . Since these assumptions are independent of the data set, they can effectively be checked. The reason the value L(x, y, 0) is used by the L?-trick simply is because the zero function f(x) ¼ 0, for all x 2 X , is always an element of the RKHS H, whereas for other (constant) functions this might not be the case. Because L(x, y, t) 2 [0, ∞), the definition implies that ∞ < L?(x, y, t) < ∞, and thus L? is no longer a nonnegative loss. Since for all practical purposes negative losses do not make sense, there is no intuitive interpretation of the L?-function. The used shift is merely a mathematical trick to enlarge the domain on which SVMs are defined. In practice, this means that for all data sets the SVM based on L and the SVM based on L? will yield identical results, the trick only shifts the objective function, but not the value where this function is minimal. It is easy to verify that L? is (strictly) convex if and only if L is (strictly) convex, and that L? is Lipschitz continuous if and only if L is Lipschitz continuous. Furthermore, both Lipschitz constants are equal, i.e., jLj1 ¼ jL?j1. This also implies that, due to the strict convexity of the mapping f 7! λk f k2H , f 2 H, that L? ðx, y, Þ + λk k2H is a strictly convex function if L is convex. Therefore, for a convex loss L, the mapping f 7! RL? ,P ð f Þ + λk f k2H ,
f 2H,
is a strictly convex function as it is the sum of the convex risk functional RL? ,P and the strictly convex mapping f 7! λk f k2H . Hence, a unique solution will exist.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 431
Remark, however, that for L a distance-based loss, L? will not necessarily share this property, as can be easily demonstrated: for LLS (x, y, t) ¼ (y t)2 we obtain L?(x, y, t) ¼ (y t)2 (y 0)2 ¼ t(t 2y) which clearly cannot be written as a function in y t only. Assume that the partial F-derivative of a loss L exists for ðx, yÞ 2 X Y, then L? ðx, y, t + hÞ L? ðx, y, tÞ h Lðx, y, t + hÞ Lðx, y, tÞ ¼ lim h!0, h6¼0 h ¼ rF3 Lðx, y, tÞ:
rF3 L? ðx, y, tÞ ¼
lim h!0, h6¼0
Hence rF3 L? ðx, y, tÞ ¼ rF3 Lðx, y, tÞ for all t 2 ℝ. An analogous calculation is valid for the B-derivative because the term L(x, y, 0) cancels out in the definition of the B-derivative and we obtain rB3 L? ðx, y, tÞ ¼ rB3 Lðx, y, tÞ, 8 t 2 R:
(31)
Therefore, both the F-derivative and the B-derivative remain unchanged under a shift of the loss function. Key message l
l
4
Shifting the loss function extends results to heavy-tailed and extreme value distributions by avoiding a moment condition. This trick is an essential step for considering statistical robustness because it allows to define the operator on all probability measures.
(†) An introduction to robustness and influence functions
A strong argument in favor of the use of SVMs is that they are L-risk consistent under weak assumptions (see Section 5.1), which means that SVMs are able to “learn”. It is, nevertheless, equally important to investigate the robustness properties of a statistical learning method. More often than not, statistical models are nothing more than approximations of the true random process that generated the given data set. Hence it is only natural to look at the impact that deviations between the statistical model and the true data generating process may have on the results. J.W. Tukey, one of the pioneers of robust statistics, mentioned already in 1960 (Hampel et al., 1986, p. 21): A tacit hope in ignoring deviations from ideal models was that they would not matter; that statistical procedures which were optimal under the strict model would still be approximately optimal under the approximate model. Unfortunately, it turned out that this hope was often drastically wrong; even mild deviations often have much larger effects than were anticipated by most statisticians.
432 Handbook of Statistics
This means that, in theory, statistical procedures were expected to work well given a certain set of data points, even if errors, such as extreme values or outliers, had snuck in the data set. However, in reality, it turned out that small errors might cause huge disturbances in the prediction function, making it virtually useless for practical use. Exactly for this reason, robust methods were developed. These methods will search for models that always fit the majority of the data and hence will still perform (reasonably) well, even in the presence of bad data points. Let us consider a statistic T(P) :¼ fL,P,λ, in our case the SVM, with P a probability measure, as a mapping T : P 7! fL,P,λ. In robust statistics, smooth and bounded functions T are preferred, since these will give stable regularized risks within small neighborhoods of P. If an appropriately chosen derivative rT(P) of T(P) is bounded, then the function T(P) cannot increase or decrease unlimited in small neighborhoods of P. Hence we expect that the value of T(Q) will be close to the value of T(P) for a (perturbing) distribution Q in a small neighborhood of P. In other words, small deviations can never have an unlimited influence on the statistic. Several notions of differentiability have been used for this purpose. From a mathematical point of view, the classical F-derivative would probably be the most suitable notion to use when investigating robustness properties. Unfortunately, a good deal of interesting statistical methods are not F-differentiable, since this is a rather strict concept. Therefore, approaches using weaker notions on differentiability have been introduced, most noticeably G-differentiability, but also H-differentiability and more recently B-differentiability. It was exactly G-differentiability that inspired Hampel to introduce the influence function, a widespread method to check robustness (Hampel, 1968, 1974). The influence function is an even weaker concept than G-differentiability, because it is defined as being a G-derivative, but without the assumption of linearity. The application of these weaker concepts is especially important if the estimates use nonsmooth functions, such as some of the loss functions used by SVMs, see Section 3.6. Of course, other notions to verify robustness are also in vigor. Some examples are the breakdown point of an estimator, its maxbias or its sensitivity curve. For more detail on the theory of robustness, we refer to Hampel et al. (1986), Huber (1981), and Maronna et al. (2006). Let us formally define the influence function. Definition 11. Let M1 be the set of probability distributions on a measurable space ðZ, BðZÞÞ and let H be a reproducing kernel Hilbert space. The influence function (IF) of T : M1 ! H at a point z 2 Z for a distribution P is defined as IFðz; T, PÞ ¼ lim ε#0
if the limit exists.
Tðð1 εÞP + εδz Þ TðPÞ , ε
(32)
SVMs: Robust predictors with applications in bioinformatics Chapter
10 433
Within this approach, robust estimators are those for which the influence function is bounded, because in that case any deviation caused by δz only has a limited impact on the estimator. The influence function is neither supposed to be linear nor continuous. If the influence functions exists for all points z 2 Z and if it is continuous and linear, then the IF is a special G-derivative. Christmann and Steinwart (2004, 2007) and Steinwart and Christmann (2008a) showed that SVMs have a bounded influence function in binary classification and in regression problems provided that the kernel is bounded and continuous, L is twice F-differentiable, and the first and second F-derivative of L are bounded. Hence Lipschitz continuous loss functions are of special interest from a robustness point of view. An example of a loss function with these properties is the logistic loss for regression. However the important special cases LE, Lτ-pin, and Lc-Huber are excluded in these results, because these loss functions are not everywhere (twice) F-differentiable. To fill this gap, Christmann and Van Messem (2008) proposed an alternative to the classical influence function based on B-derivatives, which they called the Bouligand influence function. Since B-derivatives are only supposed to be positive homogeneous instead of linear, the Bouligand influence function allowed them to extend the robustness result on SVMs to also those SVMs that are based on nonsmooth loss functions by showing that a broad class of support vector machines based on a Lipschitz continuous, but not necessarily F-differentiable, loss function are robust in the sense of having a bounded Bouligand influence function. Definition 12. Let M1 be the set of probability distributions on a measurable space ðZ, BðZÞÞ and let H be a reproducing kernel Hilbert space. The Bouligand influence function (BIF) of the function T : M1 ! H for a distribution P in the direction of a distribution Q 6¼ P is the special Bouligand-derivative (if it exists) lim ε#0
kT ðð1 εÞP + εQÞ TðPÞ BIFðQ; T, PÞkH ¼ 0: ε
(33)
The interpretation of the BIF is that it measures the impact of an infinitesimal small amount of contamination—in practice often outliers or extreme values—of the original distribution P in the direction of Q on the quantity of interest T(P). It is therefore desirable that the function T has a bounded BIF. Christmann and Van Messem (2008) showed that (33) is indeed a special B-derivative, and that, if BIF(Q; T, P) exists, then BIFðQ; T, PÞ ¼ lim ε#0
Tðð1 εÞP + εQÞ TðPÞ , ε
which corresponds to the definition of the IF, if we choose Q ¼ δz. Hence, the existence of the BIF implies the existence of the IF and in that case both are
434 Handbook of Statistics
FIG. 5 Road map on the relation between the different types of differentiation and influence functions.
equal. Please note, however, that this in general still does not imply that the IF is a G-derivative. The connection between the various types of differentiation and their link to the influence functions is shown in Fig. 5. In Section 5.2, we will show that SVMs are indeed robust in the sense of influence functions, where we will focus on the Bouligand influence function since these results also cover the classical influence function due to the above mentioned implication between BIF and IF. Using these concepts of robust statistics, it is also possible to compare different methods with respect to their robustness properties. In general, methods with a smaller norm of the influence function are considered to be more robust. Recently, the concept of the Bouligand influence function has also been used in other settings, such as the selection of the optimal number of folds for crossvalidation when determining the hyperparameters of the SVM (Liu and Liao, 2017) and even to construct a fast approximate cross-validation method using the BIF as an alternative to cross-validation itself (Liu et al., 2019). Key learnings l
l
A statistical method is robust if the influence of outliers or extreme values on the estimator is limited. Robustness properties can be verified through, for example, (Bouligand) influence functions.
5 Properties of SVMs 5.1 Existence, uniqueness and consistency of SVMs The support vector machine, as defined in Definition 8, exists and is unique under the following conditions. Details and proofs can be found in Steinwart and Christmann (2008b, Lemmas 5.1 and 5.2). Lemma 2 (Uniqueness of SVM). Let L : X Y R ! ½0, ∞Þ be a convex loss, H be the RKHS of a measurable kernel over X , and P be a distribution on X Y with RL,P ð f Þ < for some f 2 H. Then for all λ > 0 there exists at most one general SVM solution fL,P,λ.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 435
Lemma 3 (Existence of SVM). Let L : X Y R ! ½0, ∞Þ be a convex loss, H be the RKHS of a bounded measurable kernel over X , and P be a distribution on X Y such that L is a P-integrable Nemitski loss. Then for all λ > 0 there exists a general SVM solution fL,P,λ. Similar results were proven by Christmann et al. (2009, Theorems 5 and 6) for the SVM fL? ,P,λ based on the shifted loss function. Lemma 4 (Uniqueness of SVM). Let L : X Y R ! ½0, ∞Þ be a convex loss function. Assume that (i) RL? ,P ð f Þ < for some f 2 H and RL? ,P ð f Þ > ∞ for all f 2 H or (ii) L is Lipschitz continuous and f 2 L1(PX) for all f 2 H. Then for all λ > 0 there exists at most one SVM solution fL? ,P,λ . Lemma 5 (Existence of SVM). Let L : X Y R ! ½0, ∞Þ be a Lipschitz continuous and convex loss function and let H be the RKHS of a bounded measurable kernel k. Then for all λ > 0 there exists an SVM solution fL? ,P,λ . Knowing now that a solution fL? ,P,λ of the shifted problem exists and is unique, it is interesting to examine its relationship to the solution fL,P,λ of the original problem, if the latter solution exists. If RL,P ð0Þ < ∞, it is unnecessary to apply the L?-trick, because then 2 Rreg L? ,P,λ ð fL? ,P,λ Þ ¼ inf EP ðLðX, Y, f ðXÞÞ LðX, Y, 0ÞÞ + λk f kH f 2H ¼ inf EP LðX, Y, f ðXÞÞ + λk f k2H E P LðX, Y, 0Þ f 2H
¼ Rreg L,P,λ ð fL,P,λ Þ RL,P ð0Þ, and RL,P ð0Þ is finite and independent of f. Hence, fL? ,P,λ ¼ fL,P,λ , and both solutions coincide if RL,P ð0Þ < ∞. Remark the link between the finiteness of the risk and the moment condition (29). When we plug f ¼ 0 in (28), we immediately see that the risk can only be finite if the moment EP jYj is finite. The above calculation thus shows that, when the moment is finite and the application of the trick is superfluous, fL? ,P,λ and fL,P,λ are always equal. Similar to (27), a representation of fL? ,P,λ as a finite sum of kernel functions can be obtained as a consequence of Theorem 7 in Christmann et al. (2009). For L a convex, Lipschitz continuous, and F-differentiable loss function and k a bounded and measurable kernel with separable RKHS H, it holds that n X αi kðx, xi Þ, x 2 X , fL? ,D,λ ðxÞ ¼ i¼1
436 Handbook of Statistics
with coefficients equal to αi ¼
1 F ? r L ðxi , yi , fL? ,D,λ ðxi ÞÞ: 2λn 3
This expression might seem strange at first, but the values fL? ,D,λ ðxi Þ are already known from the empirical risk minimization step and can therefore be used to determine the value of fL? ,D,λ in all other points x 2 X . We thus obtain an explicit formula for the empirical decision function fL? ,D,λ as a linear combination of the kernel functions k( , xi) ¼ Φ(xi), for i ¼ 1,…, n. A final property we would like to mention, without going into too much detail, is the consistency of SVMs. Recall that the aim of an SVM in particular, or a statistical learning method in general, is to find a decision function fD, based on the set D of training data, such that the L-risk RL,P ð fD Þ is as close to the Bayes risk R L,P as possible. Since this set D consists of realizations of i.i.d. random variables from an unknown distribution P, fD and RL,P ð fD Þ will also be random variables. To verify if a learning method D 7! fD is really capable of learning, one can investigate what the probability is that the empirical risk RL,P ð fD Þ is close to the minimal risk R L,P , or whether the difference between both tends to zero when the size of the data set increases. One way to answer this question is through consistency, which is of an asymptotic nature and can often be verified without any assumptions on the distribution P. Since SVMs are measurable learning methods under some minimal assumptions (Steinwart and Christmann, 2008b, Lemma 6.23), the maps x 7! fD(x) are measurable and the risks RL,P ð fD Þ will exist for all fixed D 2 ðX YÞn and all n 1. This implies (Steinwart and Christmann, 2008b, Lemma 6.3) that also the maps ðX YÞn ! ½0, ∞ : D 7! RL,P ð fD Þ are measurable. With this information, the concept of consistency can be introduced. Definition 13. Let L : X Y R ! ½0, ∞Þ be a loss and P be a distribution on X Y. An SVM is said to be L-risk consistent for P if, for all ε > 0, we have that lim Pn D 2 ðX YÞn : RL,P ð fD Þ R L,P + ε ¼ 1: n!∞
Furthermore, it is called universally L-risk consistent if it is L-risk consistent for all distributions P on X Y. In other words, when the data set becomes sufficiently large, an L-risk consistent method will provide a decision function fD whose associated risk will be close to the Bayes risk. Or that, with high probability, fD will be nearly optimal. The method will therefore be able to learn. If the method is even universally L-risk consistent, this learning can be done without any specific knowledge of the underlying distribution P, which is a prerequisite for nonparametric methods. The only drawback is that consistency does not say anything about the speed of the convergence, we thus do not know at what rate
SVMs: Robust predictors with applications in bioinformatics Chapter
10 437
the method is able to learn. It just tells us that, in the long run, the method will learn the optimal decision function. Furthermore, we will call a learning method consistent if the decision function fD converges to the Bayes function f *. In practice, the goal is most often to minimize the risk, and thus in those cases, the knowledge of risk consistency will be enough. However, sometimes practitioners are also interested in explicitly knowing the goodness of the function fL,D,λ and the predictions fL,D,λ(x) made by it for previously unseen input values x 2 X . Consistency results—both risk consistency and consistency—for standard SVMs can be found in, e.g., Christmann and Steinwart (2007, 2008), Steinwart (2001, 2002, 2005), and Steinwart and Christmann (2008b), whereas similar results for the SVM solution fL? ,P,λ of the shifted problem are given in Christmann et al. (2009). These results prove that SVMs are indeed capable of learning the optimal solution. Christmann and Hable (2012) prove that bootstrap approximations of SVMs are consistent as well. Key learnings l l
l l
The SVM exists and is unique under some mild conditions. The empirical SVM can be written as finite sum of (at most) n kernel functions. SVMs are consistent, i.e., able to learn a (near) optimal solution. These properties hold both for the SVM solution of the original problem and that of the shifted problem.
5.2 Robustness of SVMs Through the Bouligand influence function, we can show that a broad class of support vector machines based on a Lipschitz continuous, but not necessarily F-differentiable, loss function has a bounded Bouligand influence function. We will discuss results for both SVMs using a standard loss function (Christmann and Van Messem, 2008, Theorem 2) as SVMs based on a shifted loss function (Christmann et al., 2009, Theorem 13). The reason we focus here on the BIF is that, as mentioned before, if the BIF exists, then also the IF exists and both are equal. Let us first investigate the robustness of SVMs that are based on standard, nonshifted loss functions. To this end define T : M1 ðX YÞ ! H, TðPÞ :¼ fL,P,λ : Since the growth behavior of the loss function L plays an important role to obtain consistency and robustness results (Christmann and Steinwart, 2007), we restrict attention to Lipschitz continuous loss functions. For notational convenience we shall often write rB3 LðX, Y, f ðXÞÞ instead of rB3 LðX, Y, Þð f ðXÞÞ, because f ðXÞ 2 R. We will sometimes explicitly write “ ” for multiplication to avoid misunderstandings.
438 Handbook of Statistics
Theorem 6 (Bouligand influence function). Let X Rd be closed, H be an RKHS of a bounded, continuous kernel k, and fL,P,λ 2 H. Let L : X Y R ! ½0, ∞Þ be a convex, Lipschitz continuous loss function with Lipschitz constant jLj1 2 (0, ∞). Let the partial B-derivatives rB3 Lðx, y, Þ and rB3,3 Lðx, y, Þ be measurable and bounded by κ 1 :¼ sup rB3 Lðx, y, Þ∞ 2 ð0, ∞Þ , ðx,yÞ2X Y (34) κ 2 :¼ sup rB3,3 Lðx, y, Þ∞ < ∞ : ðx,yÞ2X Y Let P, Q be probability measures on X Y with EP jYj < and EQ jYj < , δ1 > 0, δ2 > 0, N δ1 ð fL,P,λ Þ :¼ ff 2 H;k f fL,P,λ kH < δ1 g, and λ > 12 κ2 kkk3∞ . Define G : ðδ2 , δ2 Þ N δ1 ð fL,P,λ Þ ! H, Gðε, f Þ :¼ 2λf + Eð1εÞP + εQ rB3 LðX, Y, f ðXÞÞ ΦðXÞ,
(35)
and assume that rB2 Gð0, fL,P,λ Þ is strong. Then the Bouligand influence function of T(P) :¼ fL,P,λ in the direction of Q 6¼ P exists, BIFðQ; T, PÞ ¼ S1 EP rB3 LðX, Y, fL,P,λ ðXÞÞ ΦðXÞ S1 EQ rB3 LðX, Y, fL,P,λ ðXÞÞ ΦðXÞ ,
(36) (37)
where S : H ! H with Sð Þ :¼ rB2 Gð0, fL,P,λ Þð Þ ¼ 2λ idH ð Þ + EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ hΦðXÞ, iH ΦðXÞ, and BIF(Q;T, P) is bounded. Remember that because X and Y are closed, P (Q) can be split up into the marginal distribution PX (QX) and the regular conditional probability P( jx) (Q( jx)), x 2 X , on Y. Furthermore, the theorem also holds more generally if X is a complete separable normed linear space. Remark 1. To prove Theorem 6, we additionally show that, under its assumptions, the following statements hold: 1. For some χ and each f 2 N δ1 ð fL,P,λ Þ, G( , f ) is Lipschitz continuous on (δ2, δ2) with Lipschitz constant χ. 2. G has partial B-derivatives with respect to ε and f at (0, fL,P,λ). 3. rB2 Gð0, fL,P,λ Þðh fL,P,λ Þ lies in a neighborhood of 0 2 H, 8h 2 N δ1 ð fL,P,λ Þ.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 439
4. The constant d0 defined as B r Gð0, fL,P,λ Þðh1 Þ rB Gð0, fL,P,λ Þðh2 Þ 2 2 H inf kh1 h2 kH h1 ,h2 2N δ1 ð fL,P,λ ÞfL,P,λ h1 6¼h2
is strictly positive. 5. For each ξ > d01 χ there exist constants δ3, δ4 > 0, a neighborhood N δ3 ð fL,P,λ Þ :¼ ff 2 H;k f fL,P,λ kH < δ3 g, and a function f : ðδ4 , δ4 Þ ! N δ3 ð fL,P,λ Þ satisfying (v.1) f *(0) ¼ fL,P,λ. (v.2) f * is Lipschitz continuous on (δ4, δ4) with Lipschitz constant j f *j1 ¼ ξ. (v.3) For each ε 2 (δ4, δ4) is f *(ε) the unique solution of G(ε, f ) ¼ 0 in N δ3 ð fL,P,λ Þ. 1 (v.4) rB f ð0ÞðuÞ ¼ rB2 Gð0, fL,P,λ Þ rB1 Gð0, fL,P,λ ÞðuÞ , u 2 (δ4, δ4). The function f * is the same as in the implicit function theorem Theorem 4. Remark 2. As shown in Section 3.7, κ2 ¼ 0 for L ¼ LE and L ¼ Lτ-pin and thus in these cases the regularization condition is reduced to λ > 12 κ2 kkk3∞ ¼ 0. Note that S can be interpreted as the (Bouligand-)Hessian of the regularized risk, see (38) and (41). We would further like to remark that the expression for the BIF given in (36) and (37) is similar to the one obtained by Christmann and Steinwart (2007) for the IF of T(P) ¼ fL,P,λ. But because we here allow the loss functions to be nonsmooth, B-derivatives instead of F-derivatives are used. Clearly the first summand of the BIF given in (36) does not depend on the contaminating distribution Q. However, the second summand of the BIF given in (37) does depends on Q and consists of two factors. The first factor depends on the partial B-derivative of the loss function, and is hence bounded due to assumption (34). The second factor is the feature map Φ(x) which is bounded because k is bounded. For the Gaussian RBF kernel we expect that the second factor is not only bounded, but that the impact of Q ¼ 6 P on the BIF is approximately local, since k(x, x0 ) converges exponentially fast to zero if jjx x0 jj2 is large. The key ingredient of the proof of Theorem 6 is the map G : R H ! H defined by (35). If ε < 0, the integration is with respect to a signed measure. The H-valued expectation used in the definition of G is well-defined for all ε 2 (δ2, δ2) and all f 2 N δ1 ð fL,P,λ Þ, because κ 1 2 (0, ∞) by (34) and kΦðxÞk∞ kkk2∞ < ∞ by (26). Both F- and B-derivatives follow a chain rule
440 Handbook of Statistics
and F-differentiable functions are also B-differentiable. For ε 2 [0, 1] we thus obtain Gðε, f Þ ¼
∂Rreg L,ð1εÞP + εQ,λ ∂H
ð f Þ ¼ rB2 Rreg L,ð1εÞP + εQ,λ ð f Þ :
(38)
reg
Since f 7! RL,ð1εÞP + εQ,λ ð f Þ is convex and continuous for all ε 2 [0, 1], Eq. (38) shows that G(ε, f ) ¼ 0 if and only if f ¼ fL,(1ε)P+εQ,λ for such ε. Hence Gð0, fL,P,λ Þ ¼ 0 :
(39)
Following the steps in Remark 1, it can be shown that Theorem 4 applies to G and that a B-differentiable function ε 7! fε exists that is defined on a small interval (δ2, δ2) for some δ2 > 0 and satisfies G(ε, fε) ¼ 0 for all ε 2 (δ2, δ2). The existence of this function will give BIF(Q; T, P) ¼ rBfε(0). Proof. The existence of fL,P,λ follows from the convexity of L and the penalizing term (Christmann and Steinwart, 2007, Proposition 8). The assumption that G(0, fL,P,λ) ¼ 0 is valid by (39). Let us now prove the results of Remark 1 parts (1)–(5). Remark 1 part (1). For f 2 H fixed let ε1, ε2 2 (δ2, δ2). Using kkk∞ < ∞ and (39) we obtain Eð1ε
1 ÞP + ε1 Q
rB3 LðX, Y, f ðXÞÞ ΦðXÞ
Eð1ε2 ÞP + ε2 Q rB3 LðX, Y, f ðXÞÞ ΦðXÞ
¼ EP rB3 LðX, Y, f ðXÞÞΦðXÞ + ε1 EQP rB3 LðX, Y, f ðXÞÞΦðXÞ
EP rB3 LðX, Y, f ðXÞÞΦðXÞ ε2 EQP rB3 LðX, Y, f ðXÞÞΦðXÞ ¼ ðε1 ε2 ÞEQP rB3 LðX, Y, f ðXÞÞ ΦðXÞ Z B ¼ ðε1 ε2 Þ r3 Lðx, y, f ðxÞÞΦðxÞ dðQ PÞðx, yÞ Z jε1 ε2 j rB3 Lðx, y, f ðxÞÞ ΦðxÞ djQ Pjðx, yÞ Z sup jrB3 Lðx, y, f ðxÞÞj sup jΦðxÞj djQ Pjðx, yÞ jε1 ε2 j x2X ðx,yÞ2X Y Z jε1 ε2 jkΦðxÞk∞ sup rB3 Lðx, y, Þ∞ djQ Pjðx, yÞ ðx,yÞ2X Y 2 2 kk k∞ sup rB3 Lðx, y, Þ∞ jε1 ε2 j ðx,yÞ2X Y ¼ 2 kkk2∞ κ1 jε1 ε2 j < ∞:
SVMs: Robust predictors with applications in bioinformatics Chapter
10 441
Remark 1 part (2). We have rB1 Gðε, f Þ ¼ rB1 Eð1εÞP + εQ rB3 LðX, Y, f ðXÞÞ ΦðXÞ ¼ rB1 ð1 εÞE P rB3 LðX, Y, f ðXÞÞΦðXÞ +εEQ rB3 LðX, Y, f ðXÞÞΦðXÞ ¼ rB1 E P rB3 LðX, Y, f ðXÞÞ ΦðXÞ +εEQP rB3 LðX, Y, f ðXÞÞ ΦðXÞ ¼ EQP rB3 LðX, Y, f ðXÞÞ ΦðXÞ ¼ EQ rB3 LðX, Y, f ðXÞÞ ΦðXÞ EP rB3 LðX, Y, f ðXÞÞ ΦðXÞ:
(40)
These expectations exists due to (26) and (34). Furthermore, rB2 Gð0, fL,P,λ ÞðhÞ + oðhÞ ¼ Gð0, fL,P,λ + hÞ Gð0, fL,P,λ Þ ¼ 2λh + EP rB3 LðX, Y, ð fL,P,λ ðXÞ + hðXÞÞÞ ΦðXÞ EP rB3 LðX, Y, fL,P,λ ðXÞÞ ΦðXÞ ¼ 2λh + EP rB3 LðX, Y, ð fL,P,λ ðXÞ + hðXÞÞÞ rB3 LðX, Y, fL,P,λ ðXÞÞ ΦðXÞ: Since the term rB3 LðX, Y, ð fL,P,λ ðXÞ + hðXÞÞÞ rB3 LðX,Y,fL,P,λ ðXÞÞ is bounded due to (26), (34), and kkk∞ < ∞, this expectation also exists. Using hΦðxÞ, iH 2 H, for all x 2 X , we get rB2 Gð0, fP,λ Þð Þ ¼ 2λidH ð Þ + EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ hΦðXÞ, iH ΦðXÞ : (41) EP rB3,3 LðX, Y,
f ðXÞÞ ¼ rB3 EP rB3 LðX,Y, f ðXÞÞ,
Note that because by using the definition of the B-derivative, we obtain for the partial B-derivatives rB3 E P rB3 LðX, Y, f ðXÞÞ E P rB3,3 LðX, Y, f ðXÞÞ ¼ EP rB3 LðX, Y, ð f ðXÞ + hðXÞÞÞ rB3 LðX, Y, f ðXÞÞ EP rB3,3 LðX, Y, f ðXÞÞ + oðhÞ ¼ EP rB3 LðX, Y, ð f ðXÞ + hðXÞÞÞ rB3 LðX, Y, f ðXÞÞ rB3,3 LðX, Y, f ðXÞÞ + oðhÞ ¼ oðhÞ, h 2 H: Remark 1 part (3). Let N δ1 ð fL,P,λ Þ be a δ1-neighborhood of fL,P,λ. Because H is an RKHS and hence a vector space, it follows for all h 2 N δ1 ð fL,P,λ Þ that k fL,P,λ h 0kH δ1 and thus h fL,P,λ 2 N δ1 ð0Þ H.
442 Handbook of Statistics
Note that rB2 Gð0, fL,P,λ Þð Þ computed by (41) is a mapping from H to H. For ξ :¼ h fL,P,λ we have kξkH δ1 and the reproducing property yields rB2 Gð0, fL,P,λ ÞðξÞ ¼ 2λξ + EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ ξΦðXÞ: Using (26) and (34) we obtain 2λξ + EP rB LðX, Y, fL,P,λ ðXÞÞ ξΦðXÞ 0 3,3 H 2λkξkH + EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ ξΦðXÞH 2λkξkH + sup rB3,3 Lðx, y, Þ∞ kξk∞ kΦðxÞk∞ ðx,yÞ2X Y 2λkξkH + κ 2 kξkH kkk3∞ 2λ + κ 2 kkk3∞ δ1 , which shows that rB2 Gð0, fL,P,λ Þðh fL,P,λ Þ lies in a neighborhood of 0 2 H, for all h 2 N δ1 ð fL,P,λ Þ. Remark 1 part (4). Due to (41), we have to prove that d0 :¼ inf
2λð f1 f2 Þ + EP rB LðX, Y, fL,P,λ ðXÞÞ ð f1 f2 ÞΦðXÞ 3,3 H k f1 f2 kH
f1 6¼f2
is strictly positive. If f1 6¼ f2, then (26), (34), and λ > 12 κ 2 kkk3∞ yield that 2λð f1 f2 Þ + EP rB LðX, Y, fL,P,λ ðXÞÞ ð f1 f2 ÞΦðXÞ 3,3 H
k f1 f2 kH k2λð f1 f2 ÞkH EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ ð f1 f2 ÞΦðXÞH
2λ κ2 kkk3∞
k f 1 f 2 kH
> 0,
which gives the assertion. Remark 1 part (5). The assumptions of Robinson’s implicit function theorem Theorem 4 are valid for G due to the results of Remark 1 parts (1)–(4) and the assumption that rB2 Gð0, fL,P,λ Þ is strong. This yields part (5). The result of Theorem 6 now follows from inserting (40) and (41) into Remark 1 part (v.4). Using (34) we see that S is bounded. The linearity of S follows from its definition and the inverse of S does exist by Theorem 4. If necessary we can restrict the range of S to SðHÞ to obtain a bijective function S* : H ! SðHÞ with S*( f ) ¼ S( f ) for all f 2 H. Hence S1 is also bounded and linear by Theorem 3. This gives the existence of a bounded BIF specified by (36) and (37). □
SVMs: Robust predictors with applications in bioinformatics Chapter
10 443
The results from Theorem 6 cover some commonly used SVMs, such as SVMs based on the E-insensitive loss function or Huber’s loss function for regression, and SVMs based on the pinball loss function for nonparametric quantile regression (Christmann and Van Messem, 2008). As seen in Section 3.7, these loss functions have uniformly bounded first and second partial B-derivatives. The robustness of SVMs based on some smooth loss functions, such as the logistic loss for classification and the logistic loss for regression, were covered by previous results on the influence function (Steinwart and Christmann, 2008b, Theorems 10.10 and 10.18). By contrast, SVMs based on the least squares loss are not robust since their influence function is unbounded. Corollary 1. Let X Rd and Y R be closed, and P, Q be distributions on X Y with EP jYj < ∞ and EQ jYj < ∞. 1. For L 2{Lτ-pin, LE}, assume that for all δ > 0 there exist positive constants ξP, ξQ, cP, and cQ such that for all t 2 ℝ with jt fL,P,λ ðxÞj δkkk∞ the following inequalities hold for all a 2 ½0, 2δkkk∞ and x 2 X : PðY 2 ½t, t + a j xÞ cP a1 + ξP and QðY 2 ½t, t + a j xÞ cQ a1 + ξQ :
(42)
2. For L ¼ Lc-Huber, assume for x 2 X : PðY 2 ffL,P,λ ðxÞ c, fL,P,λ ðxÞ + cg j xÞ ¼ 0, QðY 2 ffL,P,λ ðxÞ c, fL,P,λ ðxÞ + cg j xÞ ¼ 0:
(43)
Then the assumptions of Theorem 6 are valid: BIF(Q; T, P) of T(P) :¼ fL,P,λ exists, is given by (36) and (37), and is bounded. To prove Corollary 1, we need to know the partial B-derivatives of the three loss functions and need to check that rB2 Gð0, fL,P,λ Þ is strong. Since we already computed the partial B-derivatives in Section 3.7, it only remains to show that these partial derivatives are strong. Proof. Because these loss functions have bounded first and second partial B-derivatives, we now check whether or not rB2 Gð0, fL,P,λ Þ is strong. Recall that rB2 Gð0, fL,P,λ Þ is strong, if for all ε* > 0 there exist a neighborhood N δ1 ð fL,P,λ Þ and an interval (δ2, δ2) with δ1, δ2 > 0 such that for all f1 , f2 2 N δ1 ð fL,P,λ Þ and for all ε 2 (δ2, δ2) kðGðε, f1 Þ gð f1 ÞÞ ðGðε, f2 Þ gð f2 ÞÞkH ε k f1 f2 kH , where, for f 2 H, gð f Þ ¼ 2λfL,P,λ ðXÞ + EP rB3 LðX, Y, fL,P,λ ðXÞÞ ΦðXÞ + 2λ idH ð f ðXÞ fL,P,λ ðXÞÞ + EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ hð f ðXÞ fL,P,λ ðXÞÞ, ΦðXÞiH ΦðXÞ :
(44)
444 Handbook of Statistics
Fix ε* > 0. Obviously, (44) holds for f1 ¼ f2. Therefore we can assume for the rest of the proof that f1 , f2 2 N δ1 ð fL,P,λ Þ are fixed arbitrary functions with f1 6¼ f2. For the left hand side of (44) we obtain 2λf1 ðXÞ + Eð1εÞP + εQ rB LðX, Y, f1 ðXÞÞ ΦðXÞ 3 2λfL,P,λ ðXÞ EP rB3 LðX, Y, fL,P,λ ðXÞÞ ΦðXÞ 2λð f1 ðXÞ fL,P,λ ðXÞÞ EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ ð f1 ðXÞ fL,P,λ ðXÞÞΦðXÞ 2λf2 ðXÞ + Eð1εÞP + εQ rB3 LðX, Y, f2 ðXÞÞ ΦðXÞ
2λfL,P,λ ðXÞ EP rB3 LðX, Y, fL,P,λ ðXÞÞ ΦðXÞ 2λð f2 ðXÞ fL,P,λ ðXÞÞ
EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ ð f2 ðXÞ fL,P,λ ðXÞÞΦðXÞ H ¼ Eð1εÞP + εQ rB3 LðX, Y, f1 ðXÞÞ rB3 LðX, Y, f2 ðXÞÞ ΦðXÞ EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ ð f1 ðXÞ f2 ðXÞÞΦðXÞH j1 εj EP rB3 LðX, Y, f1 ðXÞÞ rB3 LðX, Y, f2 ðXÞÞ rB3,3 LðX, Y, fL,P,λ ðXÞÞ ð f1 ðXÞ f2 ðXÞÞ ΦðXÞH + jεj EQ rB3 LðX, Y, f1 ðXÞÞ rB3 LðX, Y, f2 ðXÞÞ ΦðXÞH + jεj EP rB LðX, Y, fL,P,λ ðXÞÞ ð f1 ðXÞ f2 ðXÞÞΦðXÞ 3,3
(45)
H
¼: j1 εjA + jεjB + jεjC:
(46)
Remains to show that (46) is bounded from above by ε k f1 f2 kH . When looking at the first partial B-derivatives of the loss functions, we can distinguish 2 cases: LE and Lτ-pin have one or more discontinuities in rB3 L, whereas rB3 L is continuous for Lc-Huber. Recall that the set D of points where Lipschitz continuous functions are not F-differentiable, has Lebesgue measure zero by Rademacher’s theorem Theorem 1. Define the function hðy, f1 ðxÞ, f2 ðxÞÞ :¼ rB3 Lðx, y, f1 ðxÞÞ rB3 Lðx, y, f2 ðxÞÞ: For L 2 {LE, Lτ-pin}, denote the set of discontinuity points of rB3 L by D. Take f1, f2 2 N δ1 ð fL,P,λ Þ. For rB3 LðX, Y, fL,P,λ ðxÞÞ 62 D we obtain rB3 LðX, Y, f1 ðxÞÞ ¼ rB3 LðX, Y, f2 ðxÞÞ for sufficiently small δ1 and hence h(y, f1(x), f2(x)) ¼ 0. If, on the other hand, rB3 LðX, Y, fL,P,λ ðxÞÞ 2 D and f1(x) < fL,P,λ(x) < f2(x) or f2(x) < fL,P,λ(x) < f1(x), then we have that rB3 LðX, Y, f1 ðxÞÞ 6¼ rB3 LðX, Y, f2 ðxÞÞ and thus h(y, f1(x), f2(x)) 6¼ 0. Define m ¼ 2jDj.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 445
Pinball loss For the pinball loss L ¼ Lτ-pin we obtain that jh (y, f1(x), f2(x))j c1, with c1 ¼ 1, D ¼ f0g, m ¼ 2, and rB3,3 Lðx, y, tÞ ¼ 0, for all t 2 ℝ. For all f 2 N δ1 ð fL,P,λ Þ we get j f ðxÞ fL,P,λ ðxÞj k f fL,P,λ k∞ kkk∞ k f fL,P,λ kH kkk∞ δ1 :
(47)
Furthermore, j f1 ðxÞ f2 ðxÞj k f1 f2 k∞ kkk∞ k f1 f2 kH 2kkk∞ δ1 :
(48)
Using (47), (48), and (42) we get A ¼ EP ðrB3 LðX, Y, f1 ðXÞÞ rB3 LðX, Y, f2 ðXÞÞÞ ΦðXÞH EP jhðY, f1 ðXÞ, f2 ðXÞÞj jΦðXÞj kkk2∞ E P jhðY, f1 ðXÞ, f2 ðXÞÞj1fh6¼0g kkk2∞ c1 P rB3 LðX, Y, f1 ðXÞÞ 6¼ rB3 LðX, Y, f2 ðXÞÞ ¼ kkk2∞ ðPðfY f1 ðXÞ < 0g ^ fY f2 ðXÞ > 0gÞ +PðfY f2 ðXÞ < 0g ^ fY f1 ðXÞ > 0gÞÞ Z ¼ kkk2∞ PðY 2 ð f2 ðxÞ, f1 ðxÞÞ j xÞ + PðY 2 ð f1 ðxÞ, f2 ðxÞÞ j xÞdPX ðxÞ ZX ¼ kkk2∞ PðY 2 ð f2 ðxÞ, f2 ðxÞ + ½f1 ðxÞ f2 ðxÞÞ j xÞ X
+ PðY 2 ð f1 ðxÞ, f1 ðxÞ + ½f2 ðxÞ f1 ðxÞÞ j xÞdPX ðxÞ Z mkkk2∞ cP jf1 ðxÞ f2 ðxÞj1 + ξP dPX ðxÞ X
mkkk2∞ cP k f1 f2 k1∞+ ξP m cP kkk3∞+ ξP k f1 f2 k1H+ ξP , where PX denotes the marginal distribution of X. Similar calculations yield 1+ξ
that B m cQ kkk3∞+ ξQ k f1 f2 kH Q . C ¼ 0, because rB3,3 LðX, Y, fL,P,λ ðXÞÞ ¼ 0. Hence, the term in (46) is less than or equal to 1 + ξQ
1 + ξP 3 + ξP 3 + ξQ j1 εjm cP kkk∞ k f1 f2 kH + jεjm cQ kkk∞ k f1 f2 kH ξ
¼ j1 εjcP kkkξ∞P k f1 f2 kξHP + jεjcQ kkkξ∞Q k f1 f2 kHQ
mkkk3∞ k f1 f2 kH ε k f1 f2 kH , ξ
where ε ¼ ðj1 εjcP kkkξ∞P 2ξP δξ1P + jεjcQ kkkξ∞Q 2ξQ δ1Q Þmkkk3∞ .
446 Handbook of Statistics
e-insensitive loss The proof for the E-insensitive loss L ¼ LE is analogous to that of Lτ-pin, but now c1 ¼ 2, D ¼ fE, + Eg, m ¼ 4 and therefore four cases instead of two where h(y, f1(x), f2(x)) 6¼ 0 need to be considered. Huber loss For Huber’s loss function L ¼ Lc-Huber we have jrB3,3 Lðx, y, tÞj 1 :¼ c2 and h(y, f1(x), f2(x)) is bounded by c1 ¼ 2c. Let us define h ðy, fL,P,λ ðxÞ, f1 ðxÞ, f2 ðxÞÞ :¼ rB3 Lðx, y, f1 ðxÞÞ rB3 Lðx, y, f2 ðxÞÞ rB3,3 Lðx, y, fL,P,λ ðxÞÞ ð f1 ðxÞ f2 ðxÞÞ: Somewhat tedious calculations yield eight cases where h*(y, fL,P,λ(x), f1(x), f2(x)) 6¼ 0 and six cases for which h*(y, fL,P,λ(x), f1(x), f2(x)) ¼ 0. In each of the eight cases, y fL,P,λ(x) 2 {c, c} and jh*(y, fL,P,λ(x), f1(x), f2(x))j jf1(x) f2(x)j. Due to the symmetry of the Huber loss function, calculations are quite similar, and thus we only consider here some of those cases. If c < Y fL,P,λ(x) < c, then rB3,3 LðX, Y, fL,P,λ ðxÞÞ ð f1 ðxÞ f2 ðxÞÞ ¼ f1 ðxÞ f2 ðxÞ and for sufficiently small δ1, rB3 LðX, Y, f1 ðxÞÞ ¼ ðY f1 ðxÞÞ and rB3 LðX, Y, f2 ðxÞÞ ¼ ðY f2 ðxÞÞ. A small calculation shows that h*(Y, fL,P,λ(x), f1(x), f2(x)) ¼ 0. Straightforward calculations give that h*(Y, fL,P,λ(x), f1(x), f2(x)) ¼ 0 for the following five cases: 1. Y fL,P,λ(x) < c or Y fL,P,λ(x) > c, 2. Y fL,P,λ(x) ¼ c and fL,P,λ(x) > f2(x) > f1(x), 3. Y fL,P,λ(x) ¼ c and f1(x) > f2(x) > fL, P, λ(x), 4. Y fL,P,λ(x) ¼ c and fL,P,λ(x) > f2(x) > f1(x), 5. Y fL,P,λ(x) ¼ c and f1(x) > f2(x) > fL,P,λ(x). For Y fL,P,λ(x) ¼ c and f1(x) > fL,P,λ(x) > f2(x), rB3 LðX, Y, f1 ðXÞÞ ¼ c, rB3 LðX, Y, f2 ðxÞÞ ¼ ðY f2 ðxÞÞ and rB3,3 LðX, Y, fL,P,λ ðxÞÞ ð f1 ðxÞ f2 ðxÞÞ ¼ 0. Hence, h ðY, fL,P,λ ðxÞ, f1 ðxÞ, f2 ðxÞÞ ¼ c + Y f2 ðxÞ ¼ fL,P,λ ðxÞ f2 ðxÞ 6¼ 0, since f2(x) < fL,P,λ(x). Similar calculations show that h*(Y, fL,P,λ(x), f1(x), f2(x)) 6¼ 0 for the following 7 cases: 1. Y fL,P,λ(x) ¼ c and f2(x) > fL,P,λ(x) > f1(x), 2. Y fL,P,λ(x) ¼ c and fL,P,λ(x) > f1(x) > f2(x), 3. Y fL,P,λ(x) ¼ c and f2(x) > f1(x) > fL,P,λ(x), 4. Y fL,P,λ(x) ¼ c and f1(x) > fL,P,λ(x) > f2(x), 5. Y fL,P,λ(x) ¼ c and f2(x) > fL,P,λ(x) > f1(x), 6. Y fL,P,λ(x) ¼ c and fL,P,λ(x) > f1(x) > f2(x), 7. Y fL,P,λ(x) ¼ c and f2(x) > f1(x) > fL,P,λ(x).
SVMs: Robust predictors with applications in bioinformatics Chapter
10 447
Using (43) in (46) we get for the term A that A ¼ kEP h ðY, fL,P,λ ðXÞ, f1 ðXÞ, f2 ðXÞÞΦðXÞkH Z kkk2∞ jh ðy, fL,P,λ ðxÞ, f1 ðxÞ, f2 ðxÞÞj1fh 6¼0g dPðx, yÞ Z kkk2∞ jf1 ðxÞ f2 ðxÞjPðY 2 fc + fL,P,λ ðxÞ, c + fL,P,λ ðxÞgjxÞdPX ðxÞ ¼ 0: Also
C ¼ EP rB3,3 LðX, Y, fL,P,λ ðXÞÞ ð f1 ðXÞ f2 ðXÞÞΦðXÞH κ 2 kkk3∞ k f1 f2 kH :
One can compute the analogous terms to A and C, say A(Q) and C(Q), respectively, where the integration is with respect to Q instead of P. Combining these expressions we obtain B ¼ EQ ðrB3 LðX, Y, f1 ðXÞÞ rB3 LðX, Y, f2 ðXÞÞÞ ΦðXÞH EQ rB3 LðX, Y, f1 ðXÞÞ rB3 LðX, Y, f2 ðXÞÞ rB3,3 LðX, Y, fL,P,λ ðXÞÞ ð f1 ðXÞ f2 ðXÞÞ jΦðXÞj +EQ rB3,3 LðX, Y, fL,P,λ ðXÞÞ ð f1 ðXÞ f2 ðXÞÞ jΦðXÞj ¼ AðQÞ + CðQÞ κ2 kkk3∞ k f1 f2 kH : Hence, (46) is less than or equal to ε k f1 f2 kH where ε ¼ 2jεjκ 2 kkk3∞ . This gives the assertion, because jεj can be chosen arbitrarily small. □ Due to (43), the only exclusions we need for the somewhat smoother Huber loss function are that the conditional probabilities of Y given X, both with respect to P and Q, have no point probabilities at the points fL,P,λ(x) c and fL,P,λ(x) + c. Therefore, in this case, we can choose Q to be a Dirac distribution, which then leads to BIF ¼ IF. If the BIF exists for the pinball loss function, calculations give that Z 1 BIFðQ; T, PÞ ¼ ðPðY fL,P,λ ðxÞ j xÞ τÞΦðxÞ dPX ðxÞ 2λ X Z 1 ðQðY fL,P,λ ðxÞ j xÞ τÞΦðxÞ dQX ðxÞ: 2λ X In this representation, we expect that the first integral is small, since fL,P,λ(x) is a good approximation of the τ-quantile of P( jx), for which even rates of convergence are known (Steinwart and Christmann, 2008a,b). From the proof we know that (42) and (43) guarantee that the regular conditional probabilities
448 Handbook of Statistics
P( jx) and Q( jx) do not have large point masses at those points where the Lipschitz continuous loss function L is not F-differentiable or in small neighborhoods around these points. But even for the existence of the IF in the case of parametric quantile regression, meaning for L ¼ Lτ-pin, λ ¼ 0 and the unbounded linear kernel k(x, x0 ) :¼ hx, x0 i, assumptions on the distribution P are necessary. Koenker (2005, p. 44), for example, assumed that the density of the distribution P is continuous and strictly positive where needed. As the following counterexample shows, it seems—as far as we know— impossible to prove Theorem 6 and Corollary 1 for nonsmooth loss functions without making any assumption on the distributions P and Q. Consider kernel-based quantile regression based on the Gaussian RBF kernel. This means that L ¼ Lτ-pin, k ¼ kRBF, and λ > 0. The set D of discontinuity points of rB3 L is in this case D ¼ f0g. Fix x 2 X and y, y 2 Y with y 6¼ y*, and define P ¼ δ(x,y) and Q ¼ δðx,y Þ . Take two functions in a neighborhood of the SVM, but in such a way that their predictions lie on opposite sides of both y and y*: f1 , f2 2 N δ1 ð fL,P,λ Þ with f1(x) ¼ 6 f2(x), y f1(x) > 0, y f2(x) < 0, y* f1(x) > 0, and y* f2(x) < 0. From the calculations in Section 3.7, we obtain that rB3 Lðx, y, f1 ðxÞÞ ¼ rB3 Lðx, y , f1 ðxÞÞ ¼ τ, rB3 Lðx, y, f2 ðxÞÞ ¼ rB3 Lðx, y , f2 ðxÞÞ ¼ 1 τ, and that rB3,3 Lðx, y, tÞ ¼ 0 for all y, t 2 ℝ. For the H-norm in (45) we then get that Eð1εÞP + εQ rB LðX, Y, f1 ðXÞÞ rB LðX, Y, f2 ðXÞÞ ΦðXÞ 3 3 H ¼ kΦðxÞkH > 0:
Hence, for this special case, rB2 Gð0, fL,P,λ Þ is not strong as required by Corollary 1, because kΦðxÞkH is in general larger than ε k f1 f2 kH for arbitrarily small values of ε*. On the other hand, for the smooth loss function Lr-log, the assumptions (42) or (43) are not needed to obtain a bounded BIF. Lr-log is clearly strictly convex and F-differentiable with rF3 Lr-log ðx, y, tÞ ¼ 1 2Λðy tÞ, rF3,3 Lr-log ðx, y, tÞ ¼ 2Λðy tÞ½1 Λðy tÞ rF3,3,3 Lr-log ðx, y, tÞ ¼ 2Λðy tÞ½1 Λðy tÞ ½1 2Λðy tÞ where Λðy tÞ ¼ 1=ð1 + eðytÞ Þ. Obviously, these partial derivatives are bounded for all y, t 2 ℝ. Furthermore, κ1 ¼ κ2 ¼
sup ðx,yÞ2X Y
sup ðx,yÞ2X Y
jrF3 Lr-log ðx, y, Þj1 ¼ 1, jrF3,3 Lr-log ðx, y, Þj1 1=2,
because a function g that is F-differentiable, is Lipschitz continuous with jgj1 ¼ jjrFgjj∞ if rFg is bounded. This result allows to state the following corollary on robustness in the case where the logistic loss for regression is used.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 449
Corollary 2. Let X Rd and Y R be closed, L ¼ Lr-log, and P, Q be distributions on X Y with EP jYj < ∞ and EQ jYj < ∞. Then the assumptions of Theorem 6 are valid, and BIF(Q; T, P) of T(P) :¼ fL,P,λ exists, is given by (36) to (37), and BIF(Q; T, P) is bounded. Proof. Both partial F-derivatives rF3 Lr-log ðx, y, tÞ ¼ 1 2Λðy tÞ and rF3,3 Lr-log ðx, y, tÞ ¼ 2Λðy tÞ½1 Λðy tÞ are bounded since Λ(z) 2 (0, 1), z 2 ℝ. It only remains to show that rB2 Gð0, fL,P,λ Þ is strong for L ¼ Lr-log, i.e., that (45) is bounded by ε k f1 f2 kH for arbitrary chosen ε* > 0. A Taylor expansion gives for arbitrary y, t1 , t2 2 ℝ that Λðy t2 Þ ¼ Λðy t1 Þ + ðt1 t2 ÞΛðy t1 Þð1 Λðy t1 ÞÞ + Oððt1 t2 Þ2 Þ: (49) Combining (26), (47), (48), and (49) we obtain B EP r LðX, Y, f1 ðXÞÞ rB LðX, Y, f2 ðXÞÞ 3 3 rB3,3 LðX, Y, fL,P,λ Þ ð f1 ðXÞ f2 ðXÞÞ ΦðXÞ 2kkk2∞ E P jΛðY f2 ðXÞÞ ΛðY f1 ðXÞÞ ΛðY fL,P,λ ðXÞÞð1 ΛðY fL,P,λ ðXÞÞ ðf1 ðXÞ f2 ðXÞÞj 2kkk2∞ E P jðf1 ðXÞ f2 ðXÞÞ½ΛðY f1 ðXÞÞð1 ΛðY f1 ðXÞÞÞ ΛðY fL,P,λ ðXÞÞð1 ΛðY fL,P,λ ðXÞÞÞ + Oðð f1 ðXÞ f2 ðXÞÞ2 Þ 2kkk2∞ E P k f1 f2 k∞ jΛðY f1 ðXÞÞð1 ΛðY f1 ðXÞÞÞ ΛðY fL,P,λ ðXÞÞð1 ΛðY fL,P,λ ðXÞÞÞj + c3 k f1 f2 k2∞ :
(50)
A Taylor expansion around fL,P,λ(x) shows that Λ(y f1(x))(1 Λ(y f1(x))) equals Λðy fL,P,λ ðxÞÞð1 Λðy fL,P,λ ðxÞÞÞ + ðfL,P,λ ðxÞ f1 ðxÞÞΛðy fL,P,λ ðxÞÞð1 Λðy fL,P,λ ðxÞÞÞ ð1 2Λðy fL,P,λ ðxÞÞÞ + Oðð f1 ðxÞ fL,P,λ ðxÞÞ2 Þ: Using this expansion together with (26), (47), and (48), it follows that the term in (50) is bounded by
k f1 fL,P,λ k∞ + c4 δ21 kkk2∞ + c3 k f1 f2 k2∞ 2kkk2∞ E P k f1 f2 k∞ 4 kkk4∞ δ1 =2 + 2c4 δ21 kkk∞ + 4c3 δ1 k f1 f2 kH : (51)
450 Handbook of Statistics
Using the Lipschitz continuity of rB3 Lðx, y, Þ, (26), and (49), we obtain jεj EQP rB3 LðX, Y, f1 ðXÞÞ rB3 LðX, Y, f2 ðXÞÞ ΦðXÞ jεjkkk2 EjQPj rB LðX, Y, f1 ðXÞÞ rB LðX, Y, f2 ðXÞÞ ∞
3
3
jεjkkk3∞ k f1 f2 kH :
(52)
Combining (51) and (52) shows that (45) is bounded by ε k f1 f2 kH with the positive constant ε ¼ kkk3∞ δ1 kkk∞ =2 + 2c4 δ21 kkk2∞ + 4c3 δ1 kkk∞ + jεj , where δ1 > 0 and ε > 0 can be chosen as small as necessary.
□
As a remark, we would like to state that Corollary 2 is of course also valid for empirical distributions Dn and Qm consisting of n and m data points, respectively, because there are no specific assumptions made on P and Q. Let us now also consider robustness properties of SVMs based on shifted loss functions by means of the BIF.a As will become clear, similar results as before hold, showing that SVMs are still robust, even if the data come from a heavy-tailed or extreme value distribution. To this end, set the function T : M1 ðX YÞ ! H, TðPÞ :¼ fL? ,P,λ : Theorem 7 (Bouligand influence function). Let X Rd be closed and H be an RKHS of a bounded, continuous kernel k. Let L : X Y R ! ½0, ∞Þ be a convex, Lipschitz continuous loss function with Lipschitz constant jLj1 2 (0, ∞). Let the partial B-derivatives rB3 Lðx, y, Þ and rB3,3 Lðx, y, Þ be measurable and bounded by κ1 :¼ sup rB3 Lðx, y, Þ∞ 2 ð0, ∞Þ, ðx,yÞ2X Y (53) κ2 :¼ sup rB3,3 Lðx, y, Þ∞ < ∞: ðx,yÞ2X Y Let P and Q 6¼ P be probability measures on X Y, δ1 > 0, δ2 > 0, N δ1 ð fL? ,P,λ Þ :¼ ff 2 H : k f fL? ,P,λ kH < δ1 g, and λ > 12 κ2 kkk3∞ . Define G : ðδ2 , δ2 Þ N δ1 ð fL? ,P,λ Þ ! H, Gðε, f Þ :¼ 2λf + Eð1εÞP + εQ rB3 L? ðX, Y, f ðXÞÞ ΦðXÞ,
a
(54)
This result was first published in Statistics and Its Interface in Volume 2 (2009), published by International Press.
SVMs: Robust predictors with applications in bioinformatics Chapter
10 451
and assume that rB2 Gð0, fL? ,P,λ Þ is strong. Then the Bouligand influence function BIF(Q; T, P) of TðPÞ :¼ fL? ,P,λ exists, is bounded, and equals S1 EP rB3 L? ðX, Y, fL? ,P,λ ðXÞÞ ΦðXÞ S1 EQ rB3 L? ðX, Y, fL? ,P,λ ðXÞÞ ΦðXÞ ,
(55)
where S :¼ rB2 Gð0, fL? ,P,λ Þ : H ! H is given by Sð Þ ¼ 2λ idH ð Þ + EP rB3,3 L? ðX, Y, fL? ,P,λ ðXÞÞ hΦðXÞ, iH ΦðXÞ: Proof. From Section 5.1 we know that fL? ,P,λ exists and is unique. By definition of L? it follows from (31) that rB3 Lðx, y, tÞ ¼ rB3 L? ðx, y, tÞ. Therefore, Gðε, f Þ :¼ 2λf + Eð1εÞP + εQ rB3 L? ðX, Y, f ðXÞÞΦðXÞ ¼ 2λf + Eð1εÞP + εQ rB3 LðX, Y, f ðXÞÞΦðXÞ: Hence G(ε, f ) is the same as in Theorem 6. All conditions of Theorem 6 are fulfilled since we assume that rB2 Gð0, fL? ,P,λ Þ is strong. Therefore, the proof of Theorem 7 is identical to the proof of Theorem 6, which is based on the implicit function theorem for B-derivatives (Theorem 4), and the assertion follows. □ Note that also this time the Bouligand influence function of the SVM only depends on Q through the second term in (55). A similar result for the classical IF was derived by Christmann et al. (2009, Theorem 10). It can be easily verified that the assumptions on L stated in that theorem are fulfilled for, e.g., the F-differentiable logistic loss functions for classification and for regression for which (κ1, κ2) ¼ (1, 1/4) and (κ1, κ 2) ¼ (1, 1/2), respectively, in combination with any bounded and continuous kernel, such as the Gaussian RBF kernel. The use of the BIF becomes particularly interesting to study robustness properties of statistical functionals if these are defined as minimizers of non-F-differentiable objective functions such as, e.g., the E-insensitive loss or the pinball loss, for which (κ1, κ2) ¼ (1, 0) and ðκ1 , κ2 Þ ¼ ð max f1 τ, τg, 0Þ, respectively (see Section 3.7). As a final remark, we would like to point out that the expression for the BIF (as well as the one for the IF) is proportional to λ1, and thus will go to infinity for λ ! 0. Although this seems counter-intuitive to the concept of robustness, there is a certain logic to it. Large values of λ force the SVM fL? ,P,λ to be smoother, thereby limiting the influence of any perturbation Q. This fact, however, is not so much linked to the intrinsic robustness of the method, as it is related to the regularization itself. On the other hand, small λ will allow for an interpolation of the data, which leaves room for even a single point to have a large influence on the estimated curve, and thus can lead to large biases.
452 Handbook of Statistics
These robustness results for basic SVMs, both in terms of the IF and the BIF, form the basis for research on more specialized topics with respect to the robustness of support vector machines. Hable and Christmann (2013), for example, investigate the goal conflict between robustness and consistency in ill-posed problems. Dumpert and Christmann (2018) and Dumpert (2019) take a look at the robustness of localized SVMs, which have been specifically developed to deal with the ever-increasing amount of data and the related runtime and storage issues by adapting the hyperparameters of the SVM in certain regions to accommodate local behavior of the prediction function. Christmann and Hable (2012) first construct support vector machines for additive models and then prove their consistency and robustness. Through a linear combination of SVMs, Hable and Christmann (2014) estimate the scale functions median absolute deviation and interquartile range in a robust way. Xu et al. (2009) look at robustness from a different angle and imply that robustness is the reason that regularized SVMs generalize well (contrary to the common view that robustness is a consequence of the regularization). Hable and Christmann (2011) investigate the qualitative robustness of SVMs, while Christmann et al. (2013) do the same for the bootstrap approximations of SVMs. Finally, Christmann et al. (2018) look at a more general form of robustness, where not only deviations in the probability measure P are considered, but also the regularization parameter λ, and the kernel k may slightly change. Key learnings l
l
SVMs are proven to be robust in the sense of influence functions in binary classification and regression problems. These results cover the following losses: LE, Lτ-pin, Lc-Huber, Lr-log, and Lc-log.
6 (†) Applications Ever since their introduction in the early nineties, support vector machines have been state of the art, often outperforming other techniques, including neural networks. However, convolutional neural networks have nowadays become a formidable competitor. Nevertheless, SVMs are still widely applied in a broad range of domains (either by themselves or in combination with one or more other techniques). Without trying to be exhaustive, SVMs have been successfully applied for face detection/recognition (Judith and Suchitra, 2018; Kumar et al., 2019), for text classification (Goudjil et al., 2018), for image classification (Jain et al., 2018), for handwriting recognition (Kessentini et al., 2018), in geological and environmental sciences (De Boissieu et al., 2018), in finance and insurance (Tran et al., 2018), and last but not least, in bioinformatics. Most problems in the latter field are classification problems, such as diagnosis of brain tumors based on MRI images (Bauer et al., 2011), tissue classification (e.g., for cancer) based on microarray data (Haussler et al., 2000),
SVMs: Robust predictors with applications in bioinformatics Chapter
10 453
gene function prediction from microarray data (Brown et al., 2000), protein secondary structure prediction (Guo et al., 2004), protein fold prediction (Li et al., 2016), or splice site detection (Bari et al., 2014; Degroeve et al., 2005; Sonnenburg et al., 2007). Some examples of regression type problems are prediction of the gene expression as a function of its transcriptors (Cheng et al., 2011), survival analysis (Fouodo et al., 2018), or drug screening. Before looking into some examples, we would like to draw attention to the estimation of the hyperparameters. The quality of the estimator RL,D ð fL,D,λ Þ and the accuracy of predictions fL,D,λ(x) for unseen x 2 X do not only depend on the data set used for training the SVM but also on the utilized hyperparameters. Hence, the choice of these is of crucial importance. Unfortunately, determining the optimal values usually requires computing fL,D,λ for a large number of different combinations of the hyperparameters, which means that, instead of solving a single convex problem, we now have to solve a series of convex problems. Furthermore, a reasonable choice of the hyperparameters will depend on the criteria used to measure their quality. For regression problems, this criterion is usually a minimization of the empirical L-risk. In the context of classification, a common choice is looking at accuracy, e.g., through minimizing the misclassification rate. Various methods to obtain the optimal values of these parameters exist, none of which is optimal for all data sets or is applicable to samples of all sizes. Most often the parameters are chosen in a data-dependent manner by methods such as random search, cross-validation, via a grid search, or through a training-validation SVM. Optimization through a grid search is straightforward. Each dimension of the search space, which is the space of all possible combinations of the hyperparameters, is split up into parts. The intersections of these splits form the grid with the trial points for which the objective function is calculated. The best performing point is then taken. A two-stage grid search is also possible: first, the space is covered by a rough grid, after which the optimal point from that broad search is used as the center of a finer grid search. If the search range is large enough and the grid is fine enough, it is very unlikely that the algorithm finds a local optimum instead of the global one. On the other hand, the larger and the finer the grid, the more time consuming the method becomes. Another standard technique is (k-fold) cross-validation, which is mostly used for relatively small to moderate-sized data sets. It randomly divides the data set in k equal-sized disjoint subsets, called the folds, and in turn uses each fold as a validation set while the other k 1 folds together make up the training set. Although cross-validation is widely used in practice, there are some disadvantages to the method, see, e.g., Sch€ olkopf and Smola (2002). A major drawback is the apparent danger of overfitting, since the training set and the validation set are related to each other. Another disadvantage is that the optimal hyperparameters depend on the number of folds (and thus the size of each subset), since smaller training sets often require larger values of the regularization
454 Handbook of Statistics
parameter λ, which in turn can lead to different values for the other hyperparameters. In Liu and Liao (2017), a method based on the BIF is proposed to automatically choose the ideal number of folds, and in Liu et al. (2019), the authors even suggest an alternative to approximate k-fold cross-validation of kernel methods, including SVMs, for which only a single training step on the full data is needed, hence significantly increasing efficiency. A simple method specifically developed for choosing the regularization parameter λ is the use of a training-validation SVM (TV-SVM) (Steinwart and Christmann, 2008b, Chapter 6.5). The idea of TV-SVMs is to use the training set to construct a number of SVM decision functions and then use the decision function that performs best on an independent validation set. We like to point out that the validation step not necessarily provides a unique regularization parameter. Steinwart and Christmann (2008b) show that for all interesting cases, a measurable TV-SVM will exist and prove the consistency of the method. Other methods include the Nelder–Mead algorithm (Nelder and Mead, 1965), heuristic choices of the hyperparameters (Cherkassky and Ma, 2004; Mattera and Haykin, 1999), or pattern search (Momma and Bennett, 2002). The rest of this section is dedicated to some applications of SVMs. In the first example, our goal is to empirically demonstrate the robustness of the method. Through a simple regression problem, we will show that the inclusion of (a large number of ) outliers has little influence on the predictor function. The second example is conceived as a step-by-step explanation of the application of SVMs in a classification setting: we try to predict whether or not a breast cancer patient will suffer from distant metastasis. And to end this section, we discuss the use of SVMs in the more complex problem of splice site detection to show their versatility and usefulness in modern-day problems. The computation of the first and second example was executed in R (R Development Core Team, 2009). The function svm from the library e1071 was used. Please note that the regularization parameter λ is encoded in svm through 1 . Furthera cost parameter C, where the relation between both is λ ¼ 2nC more holds for the γ-parameter in the Gaussian RBF kernel the relation pffiffiffiffiffiffiffiffiffiffi γ ¼ 1=γ R , where γ R is the γ-parameter used in the R-function.
6.1 Predicting blood pressure through BMI in the presence of outliers In this example, we want to demonstrate the robustness of SVMs by showing that the influence of outliers on the prediction is limited. To that end, we consider the data set NHANES, which contains survey data that have been collected by the US National Center for Health Statistics (NCHS) ever since the early 1960s. The data set in the R-package NHANES has been adapted
SVMs: Robust predictors with applications in bioinformatics Chapter
10 455
from the original data set for educational purposes, and contains data from the survey years 2009 up to 2011. Some of the measured variables are the systolic blood pressure and the BMI of each person. Our goal is to predict the systolic blood pressure as a function of the BMI. This is, of course, not very realistic in a real-world setting, but our aim is not to create a strong prediction model based on a set of correctly chosen features, but merely to prove that introducing outliers in the data set will have little impact on the predictions made by the model. The reason we only consider one predictor variable is to be able to plot the SVM. From the NHANES data set, we selected all persons from 40 to 49 years old and deleted those with missing values, resulting in n ¼ 1299 data points. The SVM regression was performed by using the E-insensitive loss function and the polynomial kernel of degree 2. The hyperparameters (λ, E) of the SVM have been determined by minimizing the L-risk via a two-dimensional grid search over 17 5 ¼ 85 knots, where λ is the regularization parameter of the SVM, and E is the parameter of the E-insensitive loss function. For each knot in the grid, an SVM was fitted to the data points and the corresponding L?-risk (since we will add outliers) was calculated. The grid search resulted in the choice ðλ, EÞ ¼ ð28 =n, 0:01Þ as optimal parameter set. Next, we randomly selected 5% of the data and multiplied them by 10 to simulate errors in handling the data. We fitted an SVM to this new data set, following exactly the same modelling scheme. The optimal parameters for the new model are ðλ, EÞ ¼ ð27 =n, 0:01Þ. Fig. 6 shows both fitted SVMs (blue full line for the original data, red dashed line for the modified data). The upper subplot clearly shows that the data set contains extreme values because of the multiplication with a factor 10. Nevertheless, the SVMs fit the pattern set by the majority of the data points quite well. Due to these outliers, a relatively small difference between the SVMs would not be visible. Therefore, we zoomed in on the y-axis, as shown in the lower subplot of Fig. 6. In this graph, we see that there is almost no bias between both models. In fact, the maximal difference between both curves is at most 1.3695. Hence, this example confirms the theoretical robustness results: an SVM is robust in the presence of outliers, if the sample size is large enough.
6.2 Breast cancer distant metastasis through gene expression Both van ’t Veer et al. (2002) and van de Vijver et al. (2002) were interested in finding a gene expression signature which would be strongly predictive of a short interval to distant metastases (“poor prognosis” signature) in breast cancer patients. Their goal was to find a strategy to select patients who would benefit from adjuvant therapy. Distant metastasis means that the cancer has spread from the original tumor to distant organs or lymph nodes.
1000 500
Systolic blood pressure
1500
456 Handbook of Statistics
20
30
40
50
60
50
60
Systolic blood pressure
80
100 120 140 160 180 200
BMI
20
30
40 BMI
FIG. 6 The upper subplot shows all data points, including outliers, as well as the SVM fitted to the original data (blue full line) and the one fitted to the modified data (red dashed line). The difference between both functions is hardly visible. The lower subplot zooms in on the y-axis. It shows that the bias between both SVMs is almost negligible, despite the presence of extreme values.
Adjuvant systemic therapy substantially improves the overall survival in women up to the age of 70 with breast cancer, but is most beneficial for patients with a poor prognosis. Using microarray analysis, they classified a number of patients with primary breast carcinomas as having a gene expression signature associated with either a poor or a good prognosis. The gene expression data sets they used were later collected by Schroeder et al. (2019) and made available through Bioconductor (http:// bioconductor.org/), a project that provides open source software and data sets for bioinformatics. In what follows, we construct a support vector machine that aims at predicting, based on the gene expression, whether or not a patient is likely to have distant metastasis. For patients identified to be more susceptible to
SVMs: Robust predictors with applications in bioinformatics Chapter
10 457
distant metastasis, a more thorough follow-up of the distant lymph nodes and organs could then be envisioned. To install the Bioconductor packages, enter (R version 3.6 or higher is required): > if (!requireNamespace("BiocManager", quietly = TRUE)) + install.packages("BiocManager") > BiocManager::install("breastCancerNKI") > BiocManager::install("Biobase")
We first fix a random seed to ensure that the results are reproducible, and open the necessary libraries: e1071 for the SVM, breastCancerNKI for the data, and Biobase to be able to access the different parts of the data. The data set nki is imported in R. > > > > >
set.seed(23) library(e1071) library(breastCancerNKI) library(Biobase) data(nki)
A special class, ExpressionSet, has been created to store expression data from microarray experiments, as well as phenotype data and information about the experiment. The assay data can be retrieved through the function exprs from the package Biobase. In this matrix, the rows correspond to the different genes, whereas the columns represent the individuals. Using the function complete.cases, we first remove any genes with missing values, since the function svm cannot handle incomplete data. Next, we transpose the data to ensure that the different variables, i.e., gene expressions, correspond to the columns of the matrix. The outcome variable we are interested in is stored among the phenotype data, accessible with pData. The binary variable e.dmfs indicates whether or not the event of distant metastasis occurred (1 ¼ event occurred). > > > > > >
x
dim(x) summary(y) x > > > > > > > > > > + > >
total.length > > > > > > > > > > > > > >
cost_vector
library("multtest") # Insert this part after # index.val