120 23 19MB
English Pages 287 [280] Year 2023
Yolanda Larriba Editor
Statistical Methods at the Forefront of Biomedical Advances
Statistical Methods at the Forefront of Biomedical Advances
Yolanda Larriba Editor
Statistical Methods at the Forefront of Biomedical Advances
Editor Yolanda Larriba Department of Statistics and Operations Research University of Valladolid Valladolid, Spain
ISBN 978-3-031-32728-5 ISBN 978-3-031-32729-2 https://doi.org/10.1007/978-3-031-32729-2
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Statistics and biomedical research present a strong linkage since the eighteenth century. The works of Quetelet (1796–1874) and Fisher (1890–1962) marked the beginning of the development of statistical methods in the field of medicine and biomedical science and prompted a novel discipline, formally called Biostatistics. Biostatistics can be defined as the science of data reduction methods, variability, and populations in life science, medicine, public health, or biology. Technological and computational advances that occurred in the twentieth century promoted Biostatistics facing the formidable challenge of translating information into knowledge, with significant contributions of the statistical methodology in design, implementation, and analysis in areas such as laboratory medicine, randomized clinical trials, clinical decision-making, or in new drugs development. For the last decades, there has been an explosive growth of scientific works in biomedical research and public health resulting in cutting-edge statistical methodologies as well as a source of new problems. However, there are still pieces of evidence of a lack of integrity between statistical and medical communities, most of the time because of the absence of a common language between both areas. Such lack of comprehension sometimes is translated into skepticism or a lack of acceptance by the medical community that biostatistical concepts are an integral part of sound medical research, and consequently, its practical application is delayed. This book is aimed to present user-friendly outstanding statistical methods and software which enhance advances in hot topics in biomedical research in an easily comprehensible language following the purpose of being reproducible in the clinic. Precisely, it consists of a collection of 11 scientific works contributed by some of the leading experts in the mathematical and statistical field which address new challenges in very disparate biomedical areas. From a methodological point of view, this book is unbiased. It covers from classical statistical modeling, hypothesis testing procedures, or probability distribution distances, to exciting directional statistics approaches and pioneering machine learning methods. Coexisting in the different chapters of this book are the frequentist and Bayesian Statistics, as well as the parametric and non-parametric perspectives. From the applied point of
v
vi
Preface
view, this book addresses open questions in genetics, cytogenetics, microbiome, ophthalmology, cardiology, and chronobiology, among many others. This book is intended for scientists, researchers, and professors interested in learning more about pioneer statistical methods in biomedical research. The accessible writing and computational guidance serve also as a reference handbook for research for clinicians and professionals in biomedical engineering and medical devices. I would like to thank all the distinguished and expert authors for their enthusiastic participation in this book. It was with utmost kindness and gratefulness that they accepted my invitation and agreed to multiple interactions and chapter revisions. Finally, I sincerely thank the anonymous reviewers and Springer staff for their support, understanding, and encouragement while the edition of this book over almost two years. I hope that this book serves as a bridge between the statistical and medical communities and perhaps a source of inspiration to continue pushing scientific boundaries every day. Valladolid, Spain 02 2023
Yolanda Larriba
Contents
1
2
Multivariate Disease Mapping Models to Uncover Hidden Relationships Between Different Cancer Sites . . . . . . . . . . . . . . . . . . . . . . . . . . Aritz Adin, Tomás Goicoa, and María Dolores Ugarte 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 M-Models for Multivariate Spatio-Temporal Areal Data . . . . . . . . . . 1.2.1 Model Implementation and Identifiability Constraints . . . 1.3 Joint Analysis of Lung, Colorectal, Stomach, and LOCP Cancer Mortality Data in Spanish Provinces. . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Descriptive Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Model Fitting Using INLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Machine Learning Applied to Omics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aida Calviño, Almudena Moreno-Ribera, and Silvia Pineda 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Immunomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Challenges in the Omics Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Random Forests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Multinomial Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Association Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Study Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Material and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 3 6 7 9 10 16 17 21 21 22 22 23 24 26 26 28 30 33 33 33 34 38 41
vii
viii
3
4
Contents
Multimodality Tests for Gene-Based Identification of Oncological Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose Ameijeiras-Alonso and Rosa M. Crujeiras 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Analysing the Number of Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Application to the Gene Expression of Breast Cancer Patients . . . . 3.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eduardo García-Portugués and Andrea Meilán-Vila 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Kernel Smoothing on the Polysphere . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Density Ridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 An Illustrative Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Main Mode of Variation of Hippocampus Shapes . . . . . . . . 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45 46 50 54 56 60 63 63 65 65 69 74 74 75 77 81
5
Application of Quantile Regression Models for Biomedical Data . . . . . 83 Mercedes Conde-Amboage, Ingrid Van Keilegom, and Wenceslao González-Manteiga 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 The New Testing Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2.1 Bootstrap Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2.2 Computational Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Real Data Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6
Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hristo Inouzhe 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Dissimilarities and Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Wasserstein Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Maximum Mean Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Hellinger Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 Friedman–Rafsky Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Applications to the Gating Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Grouping Cytometric Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
115 115 119 121 124 126 127 127 129 129
Contents
6.3.2 Template Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Interpolation Between Cytometry Datasets . . . . . . . . . . . . . . . 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
8
9
Derivation of Optimal Experimental Design Methods for Applications in Cytogenetic Biodosimetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Higueras and Jesús López-Fidalgo 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Dose–Response Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Optimal Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 OED for Cytogenetic Biodosimetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Applied Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Dicentric Plus Ring Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Translocation Assay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Imputation for Compositional Data (MICoDa) Adjusting for Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abhisek Saha, Diane L. Putnick, Huang Lin, Edwina Yeung, Rajeshwari Sundaram, and Shyamal Das Peddada 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 MICoDa Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Performance of MICoDa Using a Simulation Study . . . . . . . . . . . . . . . 8.4 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Children’s Activity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Diversity in Microbial Compositions . . . . . . . . . . . . . . . . . . . . . . 8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Integral Analysis of Circadian Rhythms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jesús Vicente-Martínez, Pedro Francisco Almaida-Pagan, Antonio Martinez-Nicolas, Juan Antonio Madrid, Maria-Angeles Rol, and María-Ángeles Bonmatí-Carrión 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Techniques Used to Measure Circadian Rhythms . . . . . . . . . . . . . . . . . . 9.2.1 Motor Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Thermometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Light Exposure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Multivariable Ambulatory Circadian Monitoring . . . . . . . . . 9.2.5 Feeding Rhythms (Diary, Event Marker, Animals) . . . . . . . 9.2.6 Hormone Determination (Sampling, Analytic Techniques, Melatonin, DLMO) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.7 Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Importance of the Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
131 135 139 139 143 143 145 146 147 148 148 151 153 154 157
157 159 162 168 168 176 180 184 185
186 187 187 190 192 193 195 199 200 201
x
Contents
9.4 9.5
Determinism in Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of Rhythmicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 First Step: Filtering the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Parametric Methods to Determine the Existence of Circadian Rhythmicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Non-parametric Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.4 Intrinsic Circular Nature of the Data . . . . . . . . . . . . . . . . . . . . . . 9.6 Dynamic Circadian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Inputs and Outputs: Phase Response Curves . . . . . . . . . . . . . . 9.6.2 Kronauer’s Dynamic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Two-Process Model of Sleep Regulation . . . . . . . . . . . . . . . . . . 9.6.4 Combining Circadian Oscillators and Sleep Models. . . . . . 9.6.5 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
11
Modelling the Circadian Variation of Electrocardiographic Parameters with Frequency Modulated Models . . . . . . . . . . . . . . . . . . . . . . . . Yolanda Larriba and Cristina Rueda 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 FMM Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 HRV Wave (HRV-WA) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 HRV-WA Performance Modeling 24-h HRV. Cosinor Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 HRV-WA as a Tool for an Interpretable Characterization of HRV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Novel Modeling Proposals for the Analysis of Pattern Electroretinogram Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Canedo, Itziar Fernández, Rosa M. Coco, Rubén Cuadrado, and Cristina Rueda 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Methodological Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Basic Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 The FMM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.3 Measure of Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.4 Identification and Estimation Algorithm for Incomplete Oscillatory Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.5 FMM Model with Autoregressive Errors: The q FMMm1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
202 203 203 204 209 218 219 220 221 225 228 229 229 230 237 237 240 241 242 244 244 245 248 251 255
256 259 259 260 261 261 263
Contents
11.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
266 266 269 272 273
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Chapter 1
Multivariate Disease Mapping Models to Uncover Hidden Relationships Between Different Cancer Sites Aritz Adin, Tomás Goicoa, and María Dolores Ugarte
Abstract Multivariate spatio-temporal disease mapping models offer some advantages over the univariate counterparts as they enhance borrowing of strength across diseases and unveil connections between them. Though theoretically well founded, multivariate disease mapping modelling is not generally used in practice due to computational burden and the difficulty of implementation in common statistical software by users without computing training. Here we consider a multivariate modelling approach to disease mapping based on M-models and we use the Bartlett decomposition of Wishart matrices to avoid over-parameterization of certain covariance matrices. The models are implemented in R through the R-INLA package so that they can be routinely used. We illustrate the methodology with a joint analysis of male mortality data for lung, colorectal, stomach, and LOCP (lip, oral cavity, and pharynx) cancer in continental Spain for the period 2006–2020. Keywords Bayesian inference · Cancer epidemiology · INLA · Spatial patterns
1.1 Introduction The extensive amount of research in disease mapping during the last decades has promoted the development of new statistical techniques and modelling proposals for the analysis of spatial and spatio-temporal areal count data. The analysis of geographical and temporal variation of disease risks or rates is crucial in epidemiology and public health, as it helps to formulate hypothesis about the disease’s aetiology or to look for potential health inequalities. Typically, in these studies the region of interest is divided into non-overlapping areal units where epidemiological data are A. Adin () · T. Goicoa · M. D. Ugarte Department of Statistics, Computer Science and Mathematics, Public University of Navarre, Pamplona, Spain Institute for Advanced Materials and Mathematics, InaMat.2 , Public University of Navarre, Pamplona, Spain e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Y. Larriba (ed.), Statistical Methods at the Forefront of Biomedical Advances, https://doi.org/10.1007/978-3-031-32729-2_1
1
2
A. Adin et al.
presented as aggregated disease counts for each geographical unit (administrative divisions such as states or local health areas). The main objective of disease mapping models is to obtain accurate local estimates of relative risk (or rates) to uncover geographical patterns and temporal trends of the phenomenon under study [1, 2]. Usually, these models include structured random effects to smooth risks borrowing information from spatial and temporal neighbours. Traditionally, disease mapping studies have mainly focused on the spatial and/or spatio-temporal analysis of a single disease. Most of the research is based on the use of the conditional autoregressive (CAR) prior distribution [3, 4], where the spatial correlation in the random effects is determined by the neighbourhood structure of the areal units represented as an undirected graph. However, the joint modelling of several responses using multivariate disease mapping models offers some advantages such as (i) improving smoothing by borrowing strength between diseases, and (ii) allowing relationships between the geographical distribution and temporal trend of the diseases (i.e., correlations between spatial and temporal patterns). A wide variety of models have been proposed in the literature for the joint analysis of several (two or more) diseases. Unfortunately, their use is not a routine practice due to computational complexity and lack of implementation in common statistical packages. Probably, the simplest multivariate extension of univariate CAR models is the so-called shared component model [5, 6]. The key feature of the shared component model (SCM) is the assumption that the diseases under study share a common risk factor(s). Under this assumption, the underlying risk surface for each disease includes a spatial latent component common to two or more diseases (shared risk effect). See MacNab [7] for details about how this class of models are linked with the general multivariate framework. Without being exhaustive, some studies including shared component models in disease mapping applications are described below. Richardson et al. [8] consider a SCM for analysing joint patterns of male and female lung cancer incidence in a region of England, assuming that both populations share environmental risk factors. In Etxeberria et al. [9] SCM are used for the joint modelling of brain cancer incidence and mortality data in very small areas including gender-specific shared spatial components. The high correlation between incidence and mortality data in this lethal cancer makes these models ideal to borrow strength among both disease processes increasing the effective sample size. A similar shared component modelling approach is considered by Etxeberria et al. [10] to obtain more accurate incidence forecasts of rare cancer types using mortality data. Unlike shared component models, where an existing dependence between diseases is known in advance, general multivariate models are more appropriate if we do not know the relationship between diseases. Most of the multivariate proposals for disease mapping are based on multivariate extensions of univariate CAR models. A very attractive approach to multivariate disease mapping is the coregionalization framework [11, 12], where the between-disease dependence is introduced by pre or post-multiplying the spatial (or temporal) random effects by appropriate matrices. However, computing time can be prohibitive when the number of diseases increases. To overcome this problem, Botella-Rocamora et al. [13] proposed the so-called
1 Multivariate Disease Mapping Models
3
M-models to reduce the computational burden at the cost of the identifiability of some parameters. A comprehensive review of multivariate models for spatial count data can be found in the work by MacNab [14], in which the three main lines in the construction of multivariate CAR proposals are discussed. Namely, the multivariate conditionals based approach [15], the univariate conditionals based approach [16], and the coregionalization method mentioned above. The majority of these models are fitted using Markov chain Monte Carlo (MCMC) techniques requiring advanced programming skills. This hampers their use in practice. Other very recent modelling approaches to jointly analyse several diseases include the use of multivariate P-spline models [17] or multivariate directed acyclic graph autoregressive models [18]. The main objective of this work is to show how multivariate modelling is a powerful tool to uncover spatial and temporal patterns of several diseases and to establish associations among them. In particular, we explore the relationship between mortality of four different cancer sites, lung, colorectal, stomach, and LOCP (lip, oral cavity, and pharynx) in continental Spanish provinces along the period 2006–2020. To this end, we use a similar approach to the M-based models for the analysis of multivariate spatio-temporal areal data described in [19]. The main advantage of these models is that they provide estimates of the correlations between the spatial (and temporal) patterns of the different diseases revealing connections among them that could provide clues to common underlying risk factors. The models are fitted using integrated nested Laplace approximations (INLA) [20] through R-INLA (https://www.r-inla.org), an R package for approximate Bayesian inference for latent Gaussian models that is being increasingly used in many research areas such as spatial statistics, ecology, geoscience, public health and environmental statistics among others [21]. The rest of the chapter is organized as follows. Section 1.2 describes the Mmodels framework for spatio-temporal areal data. It also provides details on its implementation in the R-INLA package [22]. In Sect. 1.3 we show the results corresponding to the multivariate analysis of lung, stomach, colorectal, and LOCP cancer mortality data for male population in Spanish provinces during the period 2006–2020. The chapter ends with a discussion.
1.2 M-Models for Multivariate Spatio-Temporal Areal Data Let us assume that the region under study is divided into S contiguous small areas and data are available for T time periods and J diseases. Let .Oitj and .Eitj denote the number of observed and expected cases, respectively, in the i-th small area (.i = 1, . . . , S), t-th period (.t = 1, . . . , T ) and j -th disease (.j = 1, . . . , J ). Here, .Eitj is computed using indirect standardization for each disease as .Eitj = k nitj k · mj k , where k denotes the age-group, .nitj k is the population at risk in the area i, time period t and age-group k for the j -th disease, and .mj k is the overall mortality rate of the disease j for the k-th age-group in the total area and whole study period. That
4
A. Adin et al.
S T S T is, .mj k = i=1 t=1 Oitj k / i=1 t=1 Nitj k , where .Nitj k is the population at risk in area i, time t, disease j , and age-group k. Conditional on the relative risks .Rij t , the number of observed cases in each area-time-disease stratum is assumed to follow a Poisson distribution Oitj |Ritj ∼ Poisson(μitj = Eitj · Ritj ), .
log μitj = log Eitj + log Ritj .
Then, the log-risk is modelled as .
log Ritj = αj + θij + γtj + δitj ,
(1.1)
where .αj is a disease-specific intercept, .θij and .γtj are the spatial and temporal main effects of the j -th disease, and .δitj is the spatio-temporal interaction within the j -th disease. If we denote as .R = (R1 , . . . , RJ ) the vector of relative risks with .Rj = (R11j , . . . , RS1j , . . . , R1Tj , . . . , RSTj ), for .j = 1, . . . , J , the model described in Eq. (1.1) can be expressed in matrix form as .
log(R) = (IJ ⊗ 1T ⊗ 1S )α + (IJ ⊗ 1T ⊗ IS )θ + (IJ ⊗ IT ⊗ 1S )γ + (IJ ⊗ IT ⊗ IS )δ,
where .1S and .1T are column vectors of ones of length S and T , respectively, IS , .IT and .IJ are identity matrices of dimension .S × S, .T × T and .J × J , respectively, and .α = (α1 , . . . , αJ ) is the set of disease-specific inter cepts. The terms, .θ = (θ (1) , . . . , θ (J ) ) , .γ = (γ (1) , . . . , γ (J ) ) , and .δ = (j ) (j ) (j ) (j ) (δ (1) , . . . , δ (J ) ) where .θ (j ) = (θ1 , . . . θS ) , .γ (j ) = (γ1 , . . . γT ) , and j j j j (j ) .δ = (δ11 , . . . , δS1 , . . . , δ1T , . . . , δST ) are vectors of spatial, temporal and spatiotemporal random effects for the j -th disease. Following Botella-Rocamora et al. [13], we rearrange the main spatial and temporal effects into the matrices . = (θ (1) , . . . , θ (J ) ) = {θij : i = 1, . . . , S; (1) , . . . , γ (J ) ) = γ : t = 1, . . . , T ; j = 1, . . . , J to .j = 1, . . . , J } and . = (γ tj better comprehend their dependence structure. Then, similarly to Vicente et al. [19] we express the matrices . and . as .
= θ Mθ ,
.
= γ Mγ , where .θ is a matrix of order .S×J composed of stochastically independent columns that are assumed to follow a Leroux et al. [23] conditional autoregressive prior (denoted as LCAR) distribution to deal with the spatial dependence within-diseases, and .γ is a matrix of order .T × J composed of stochastically independent columns that are assumed to follow a first-order random walk (RW1) prior distribution to deal with the temporal dependence within-diseases. The .J × J matrices .Mθ and .Mγ are non-singular but arbitrary matrices responsible for inducing dependence between the different columns of . and ., respectively, i.e., for inducing correlation
1 Multivariate Disease Mapping Models
5
between the spatial and temporal patterns of the diseases. That is . θ = Mθ Mθ and . γ = Mγ Mγ are the covariance matrices between the spatial patterns and temporal trends, respectively, of the different diseases (see Botella-Rocamora et al. [13] and Vicente et al. [19] for further details). The resulting prior distribution for the spatial random effect .θ is a multivariate Normal distribution with non-separable precision matrix given by .
−1 θ = M−1 ⊗ I , . . . , ) M ⊗ I , Blockdiag( S 1 J S θ θ
(1.2)
where . 1 , . . . , J are the spatial precision matrices of the corresponding LCAR prior distributions. That is, . j = λj iCAR + (1 − λj )IS , for .j = 1, . . . , J , where . iCAR = (Dw − W) is the precision matrix of an intrinsic conditional autoregressive prior (iCAR), .W = (wil ) is the spatial binary adjacency matrix whose il-th element is equal to one if areas i and l are neighbours (regions sharing a common border) and it is zero otherwise, .Dw is a diagonal matrix with the number of neighbours of each area in the main diagonal, and .λj ∈ [0, 1] is a spatial smoothing parameter. Note that if .λj = λ, for .j = 1, . . . , J , then Eq. (1.2) becomes −1 . θ = θ ⊗ LCAR , and the precision matrix is separable. Regarding the time dimension, since RW1 priors have been considered to control the within-disease temporal variability, the resulting prior distribution for the temporal random effect .γ is a multivariate Normal distribution with a separable precision matrix given by −1 γ = Mγ Mγ ⊗ RW1 = −1 γ ⊗ RW1 ,
.
(1.3)
where . RW1 is the precision matrix of a first-order random walk prior, i.e., ⎛
RW1
.
1 −1 ⎜ −1 2 ⎜ ⎜ .. =⎜ . ⎜ ⎝
0 −1 .. .. . .
⎞
⎟ ⎟ ⎟ ⎟, ⎟ −1 2 −1 ⎠ 0 −1 1
(see Rue and Held [24], p. 95). Finally, the interaction random effects are assumed to be independent between diseases. The reason is that these terms generally capture a small part of variability and including dependence between them would unnecessarily complicate the model. Then, the following prior distribution is j j j j assumed for .δ (j ) = (δ11 , . . . , δS1 , . . . , δ1T , . . . , δST ) , for .j = 1, . . . , J , δ (j ) ∼ N 0, [τδj δj ]− ,
.
where .τδj are different precision parameters for each disease, the symbol .− denotes the Moore-Penrose generalized inverse of a matrix, and . δj are the matrices of order .ST × ST obtained as the Kronecker product of the corresponding spatial and
6
A. Adin et al.
Table 1.1 Specification for the four possible types of space-time interactions Interaction Type I Type II Type III Type IV
. δ .IT
.⊗
.IS
. RW1
.⊗
.IS
.IT
.⊗
. iCAR
. RW1
.⊗
. iCAR
Spatial correlation .− .−
Temporal correlation .− .−
Table 1.2 Description of the precision matrices for the different M-models described in Sect. 1.2
Spatial effect .θ
Model 1 Model 2 (M-model .+ M-model (M-model .+ RW1 .+ iCAR.⊗RW1) .+ iCAR.⊗RW1) −1 −1 . θ = Mθ ⊗ IS Blockdiag( 1 , . . . , J ) Mθ ⊗ IS
= (θ (1) , . . . , θ (J ) )
Temporal effect (1) , . . . , γ (J ) ) .γ = (γ Spatio-temporal effect (1) .δ = (δ , . . . , δ (J ) )
. γ .δ
(j )
−1 = Mγ Mγ ⊗ RW1
∼ N 0, [τγj RW1j ]− for .j = 1, . . . , J ∼ N 0, [τδj δj ]− for .j = 1, . . . , J .γ
(j )
temporal structure matrices. Here, the four different types of interactions originally proposed by Knorr-Held [25] have been considered (see Table 1.1). Introducing correlation between temporal trends may not be sensible depending on the diseases. For example, there exist screening programmes for certain types of cancer that have an impact on the temporal evolution of mortality, whereas other types of cancer are detected when they present symptoms or they are at an advanced stage. In these situations it might be better to consider independent temporal trends for each disease. In our real case study, there is a screening programme for colorectal cancer in Spain although there are differences between regions. As far as we know, there are not screening programmes for the other cancer sites. Consequently, we also consider a model assuming independent temporal random effects for each disease. Doing so, we substantially reduce the number of hyperparameters of the temporal effect from .J ×(J +1)/2 to J parameters, hence decreasing computational burden. A detailed description of the precision matrices for the two main models considered in this chapter has been included in Table 1.2. Namely, Model 1 including correlations between the spatial and temporal patterns of the diseases, and Model 2 with between-disease independent main temporal trends.
1.2.1 Model Implementation and Identifiability Constraints In this chapter, models are fitted under a fully Bayesian approach using the wellknown integrated nested Laplace approximation (INLA) technique [20]. The use
1 Multivariate Disease Mapping Models
7
Table 1.3 Identifiability constraints to fit the multivariate spatio-temporal M-models described in Eq. (1.1) Interaction Type I
Identifiability constraints S T . θij = 0, . γtj = 0, i=1
Type II Type III
.
S
i=1 S
.
and
t=1
θij = 0, θij = 0,
.
T
t=1 T
.
i=1
t=1
S
T
.
T S
γtj = 0, γtj = 0,
and and
.
T
t=1 S
.
δitj = 0, for i = 1, . . . , S; j = 1, . . . , J δitj = 0, for t = 1, . . . , T ; j = 1, . . . , J
i=1 T
Type IV
.
i=1
θij = 0,
.
t=1
δitj = 0, for j = 1, . . . , J
i=1 t=1
γtj = 0,
and
.
t=1 S
δitj = 0, for i = 1, . . . , S; j = 1, . . . , J δitj = 0, for t = 1, . . . , T ; j = 1, . . . , J
i=1
of INLA for Bayesian inference has become very popular in applied statistics in general (see, e.g., Rue et al. [21] and references therein) and in spatial statistics in particular [26], as it has proven to be faster than MCMC techniques. One of the main advantages of INLA is that many models used in practice are available in the R-INLA package [22] and others can be implemented using the rgeneric or cgeneric constructions, allowing the users to define their own latent effects. Our M-model proposal for the analysis of multivariate spatio-temporal count data described in Sect. 1.2 is not directly available in R-INLA, but it was implemented by Vicente et al. [19] to analyse different crimes against women in India. Here, an alternative internal parameterization of the M-models based on the Bartlett decomposition of Wishart distributed matrices (see, for example, [27]) has been considered. In particular, we assume Wishart priors for the covariances matrices, −1 −1 . θ and . γ , between the spatial and temporal patterns of the diseases. For further details the reader is referred to [28]. Since the generalized linear mixed models described in this section are typically not identifiable, appropriate sum-to-zero constraints must be imposed over the random effect. In Goicoa et al. [29] the spectral decomposition of the precision matrices of the random effects reveals the required identifiability constraints in spatio-temporal univariate CAR models. A similar approach has been followed here to obtain the set of identifiability constraints for the multivariate spatio-temporal M-models (see Table 1.3).
1.3 Joint Analysis of Lung, Colorectal, Stomach, and LOCP Cancer Mortality Data in Spanish Provinces Several studies indicate that tobacco smoking is the most important risk factor in lung and oral cancer incidence and mortality [30–33]. There are also studies
8
A. Adin et al.
suggesting a relation between tobacco consumption and stomach cancer mortality [34], or other cancer sites such as liver and colorectal cancer [35]. In addition, epidemiological evidences support a causal association of alcohol consumption with oral, colorectal and liver cancers [36]. For all these reasons, we consider essential the use of statistical models for a joint analysis of mortality risks of several cancer sites to properly explore the relationship between them. However, applied papers in the disease mapping literature that jointly analyse two or more cancer sites are not so abundant, probably due to computational burden and difficulties in implementation. Held et al. [6] propose the use of shared component models for the joint modelling of oral, oesophagus, larynx, and lung cancer mortality rates of male population in Germany, assuming that smoking is a common risk factor. Shared component models allow information to be exchanged between diseases in order to obtain more accurate estimates for cancer site with few cases or analysed at a highly disaggregated level. Using similar models, Retegui et al. [37] perform a joint study of LOCP and lung cancer mortality rates by province, age-group and gender in Spain during the period 2011–2015. Botella-Rocamora et al. [13] use multivariate M-based spatial models to jointly analysed 21 different causes of mortality in the municipalities of the autonomous region of Comunidad Valenciana, Spain. They find high positive spatial correlations in several pairs of diseases such as lung/oral, larynx/oral or lung/bladder that may be connected to smoking habits. Very recently, Gao et al. [18] propose a multivariate extension of directed acyclic graphical autoregressive models and illustrate the proposed methodology by jointly analysing lung, oesophagus, larynx, and colorectal cancer data across counties in California. However, all these studies are limited to the analysis of spatial areal count data. The study described in this section is based on the joint analysis of the spatiotemporal evolution of male lung, colorectal, stomach, and LOCP cancer mortality data (.J = 4 diseases) in the .S = 47 provinces of continental Spain (excluding Balearic and Canary Islands and the autonomous cities of Ceuta and Melilla) during the period 2006–2020 (.T = 15 years). Data were reported by the Spanish Statistical Office (INE). According to recent studies [38], lung and colorectal cancer were the leading causes of cancer deaths in the male population (the second and third cause in the female population behind breast cancer) in Europe in year 2020 representing the 24.2% and 12.3% of all cancer deaths, respectively. Stomach and LOCP cancer represent the 5.5% (fifth position) and 3.8% (eighth position) of all cancer deaths in the male population, respectively. According to the latest data published by the INE (see https://www.ine.es/jaxiT3/Tabla.htm?t=7947), the percentages of mortality due to these cancer sites out of the total cancer deaths in the Spanish male population represent around 24.7% for lung cancer, 13.4% for colorectal cancer, 4.4% for stomach cancer and 2.5% for LOCP cancer, respectively.
1 Multivariate Disease Mapping Models
9
Table 1.4 Summary statistics (mean, standard deviation, minimum, quantiles and maximum) of standardized mortality ratio (SMR) for the different cancer sites across the Spanish provinces in some selected years Cancer site Lung
Colorectal
Stomach
LOCP
Year 2006 2009 2013 2017 2020 2006 2009 2013 2017 2020 2006 2009 2013 2017 2020 2006 2009 2013 2017 2020
Mean 1.04 1.01 0.99 0.93 0.84 1.00 1.03 1.07 0.95 0.92 1.32 1.22 1.11 0.94 0.82 1.08 1.03 1.09 1.00 0.91
SD 0.19 0.20 0.14 0.13 0.12 0.17 0.18 0.13 0.10 0.12 0.30 0.26 0.26 0.21 0.17 0.39 0.33 0.39 0.26 0.25
Min 0.63 0.66 0.70 0.70 0.57 0.67 0.59 0.83 0.75 0.68 0.70 0.78 0.68 0.49 0.50 0.23 0.15 0.33 0.45 0.21
.Q1
.Q2
.Q3
0.92 0.89 0.87 0.82 0.74 0.88 0.93 0.98 0.87 0.83 1.10 1.07 0.91 0.78 0.70 0.89 0.79 0.82 0.82 0.72
1.02 1.02 0.96 0.92 0.84 1.02 1.03 1.05 0.96 0.91 1.31 1.19 1.05 0.96 0.81 1.03 1.00 1.05 0.99 0.90
1.14 1.10 1.07 1.00 0.94 1.09 1.12 1.14 1.02 1.00 1.48 1.31 1.29 1.08 0.93 1.36 1.20 1.31 1.18 1.10
Max 1.41 1.54 1.36 1.21 1.12 1.46 1.45 1.37 1.20 1.25 1.96 2.19 1.68 1.40 1.33 1.90 1.82 2.59 1.68 1.48
1.3.1 Descriptive Analysis A total of 242,723 lung cancer deaths (corresponding to International Classification of Diseases-10 codes C33 and C34), 126,965 colorectal cancer deaths (ICD-10 codes C17-C21), 47,965 stomach cancer deaths (ICD-10 code C16) and 23,785 LOCP cancer deaths (ICD-10 codes C00-C14) were registered for the male population in the provinces of continental Spain during the period 2006–2020. Summary statistics (mean, standard deviation, minimum, quantiles and maximum) of standardized mortality ratio (SMR) for the different cancer sites across the Spanish provinces and in some selected years are shown in Table 1.4. The SMR is defined as the ratio of observed and expected cases for the corresponding province, year and cancer site. Note that values higher than one indicate an excess of risk for a cancer site, i.e., the number of observed cases for a particular province and year is greater than expected in comparison to the whole of Spain during the study period. A similar interpretation is done for SMR values lower than one. Figure 1.1 compares the temporal evolution of year-specific SMRs for the different cancer sites, computed by aggregating mortality cases by year for all provinces. We observe strong temporal trends. In particular, colorectal cancer exhibits a temporal trend
10
A. Adin et al.
Cancer Lung Colorectal Stomach LOCP
1.2
SMR
1.1
1.0
0.9
0.8 2006
2008
2010
2012
Year
2014
2016
2018
2020
Fig. 1.1 Standardized mortality ratio (SMR) for the different cancer sites computed by year for the whole of Spanish provinces
different to the other cancer sites. There is a screening programme for colorectal cancer that might have an impact on mortality. For the rest of cancers, no screening programme is implemented. Therefore, it may be sensible to use Model 2 described in the previous section.
1.3.2 Model Fitting Using INLA The multivariate spatio-temporal models described in Sect. 1.2 have been implemented to jointly analyse the evolution in space and time of lung, colorectal, stomach, and LOCP cancer male mortality risks in Spanish provinces during the period 2006–2020. The four types of interactions originally proposed by Knorr-Held [25] have been considered for the spatio-temporal random effects in both Model 1 and Model 2 (see Table 1.2). As described above, the internal parameterization of the M-model for the spatial/temporal random effects is based on the Bartlett decomposition of Wishart matrices, in particular, of the corresponding betweendisease covariance matrices . θ and . γ . Details on its implementation using the rgeneric construction of R-INLA can be found in [28]. For the rest of model parameters, improper uniform prior distributions are considered for the inverse square root of precision parameters .τγj and .τδj (standard deviations), while uniform prior distribution on the interval .[0, 1] are given to the spatial smoothing parameters of the LCAR prior .λj . Finally, a vague Gaussian distribution with mean zero and precision equal to 0.001 is given to the disease-
1 Multivariate Disease Mapping Models
11
Table 1.5 Model selection criteria. (.D(θ): mean deviance, .pD : effective number of parameters, DIC: deviance information criteria, WAIC: Watanabe-Akaike information criteria) and computational time (in minutes). The lowest values of DIC and WAIC are bolded Model 1
Model 2
Interaction Type I Type II Type III Type IV Type I Type II Type III Type IV
.D(θ)
.pD
20222.9 20132.1 20346.2 20272.3 20275.7 20242.3 20379.9 20258.9
638.2 615.2 536.4 436.9 587.6 454.1 501.4 470.3
DIC 20861.1 20747.3 20882.6 20709.2 20863.3 20696.4 20881.4 20729.2
WAIC 20854.3 20733.3 20915.4 20725.9 20863.0 20706.9 20919.4 20744.1
Time 4 12 6 20 3 8 4 14
specific intercepts .αj . The R code to reproduce the results shown in this chapter is available at https://github.com/spatialstatisticsupna/BookChapter_STMmodels. The recently proposed hybrid approximate method that combines the Laplace method with a low-rank Variational Bayes correction to the posterior mean [39, 40] has been considered to fit the models using the R-INLA stable version 22.12.16 in R-4.2.1. All computations are made on a personal computer with an Intel Core i5-7500 processor and 32 GB of RAM. Table 1.5 shows two commonly used model selection criteria for Bayesian models, the Deviance Information Criteria (DIC) [41] and the Watanabe-Akaike information criteria (WAIC) [42]. According to these measures, Model 2 with a M-model for the spatial random effects with a non-separable LCAR prior distribution, independent temporally structured random effects with a RW1 prior distribution, and independent Type II space-time interaction random effects (structured in time but not in space) is selected as the best model. In the following, we provide results using the selected model. Figure 1.2 shows the maps with the underlying spatial patterns for the different cancer sites, that is, the posterior median estimates of .exp(θij ) (top) and posterior exceedance probabilities .P (exp(θij ) > 1|O) (bottom). Dark blue areas in the maps of posterior probabilities indicate significantly high province-specific risks during the period 2006–2010 in comparison to Spain as a whole. For some of the cancer sites, these maps show similarities in the estimated spatial patterns of mortality risk. One of the main advantages of the multivariate models considered here is that they provide estimates of correlations between the spatial patterns of the diseases. This may suggest hidden connections between the different cancer types related to factors such as time of diagnosis, access to treatment, or lifestyle. In our analysis, the strongest spatial correlation is estimated between colorectal and LOCP cancer (posterior median estimate of 0.619), followed by lung and colorectal cancer (posterior median estimate of 0.404) (see Table 1.6). A negative association is estimated between lung and stomach cancer, but its 95% credible interval shows that this correlation is not significant. The estimated global temporal patterns .exp(γtj ) are displayed in Fig. 1.3. Mortality risks show a decreasing trend during the period 2006–2020 for all cancer
12
A. Adin et al. Lung cancer
Colorectal cancer
Stomach cancer
LOCP cancer
Lung cancer
Colorectal cancer
Stomach cancer
LOCP cancer
1.30 or more 1.20 to 1.30 1.10 to 1.20 1.00 to 1.10 0.91 to 1.00 0.83 to 0.91 0.77 to 0.83 Less than 0.77
[0.9−1] [0.8−0.9) [0.2−0.8) [0.1−0.2) [0−0.1)
Fig. 1.2 Maps of posterior median estimates of province-specific mortality risks .exp(θij ) (top) and posterior exceedance probabilities .P (exp(θij ) > 1|O) (bottom)
1 Multivariate Disease Mapping Models
13
Table 1.6 Estimated between-disease spatial correlations (posterior medians and 95% credible intervals). Significant correlations are highlighted in bold Cancer site Lung
Lung –
Colorectal
–
Colorectal 0.404 (0.240, 0.582) –
Stomach
–
–
Stomach .−0.145
(.−0.423, 0.188) 0.196 (.−0.036, 0.395) –
Lung cancer
LOCP 0.185 (0.002, 0.416) 0.619 (0.417, 0.755) 0.289 (0.059, 0.492)
0.8
0.8
1.0
1.0
1.2
1.2
Colorectal cancer
2006
2010
2014
2006
2018
2010
2014
2018
LOCP cancer
0.8
0.8
1.0
1.0
1.2
1.2
Stomach cancer
2006
2010
2014
2018
2006
2010
2014
2018
Fig. 1.3 Posterior median estimates of year-specific mortality risks .exp(γtj ) and 95% credible intervals
sites but colorectal cancer, which shows an increasing trend from 2006 to 2012 and then it starts to decrease linearly. Stomach cancer is the one showing the sharpest decrease in mortality risks during the whole study period. A more moderate but steady decline is also observed for lung and LOCP. As expected, wider credible intervals are obtained for cancers with lower counts. Figures 1.4, 1.5, 1.6, and 1.7 display the maps with the relative mortality risk estimates (posterior medians) of lung, colorectal, stomach, and LOCP cancer, respectively. For lung cancer data, the maps remain fairly constant during the first half of the period even though a marked global decreasing trend is observed (see
14
A. Adin et al. Year 2006
Year 2009
Year 2012
Year 2014
Year 2017
Year 2020
1.30 or more 1.20 to 1.30 1.10 to 1.20 1.00 to 1.10 0.91 to 1.00 0.83 to 0.91 0.77 to 0.83 Less than 0.77
Fig. 1.4 Maps of posterior median estimates of lung cancer mortality risks Year 2006
Year 2009
Year 2012
Year 2014
Year 2017
Year 2020
1.30 or more 1.20 to 1.30 1.10 to 1.20 1.00 to 1.10 0.91 to 1.00 0.83 to 0.91 0.77 to 0.83 Less than 0.77
Fig. 1.5 Maps of posterior median estimates of colorectal cancer mortality risks
Fig. 1.3). Some provinces (Navarra, Burgos, Guadalajara, Ávila and Sevilla) exhibit significantly high risks with posterior median estimates greater than 1.2 until the year 2012. The estimated mortality risks for colorectal cancer show less variability than the other cancer sites considered in this analysis, with posterior median estimates ranging from 0.76 to 1.29. The highest relative risks are attained in year 2012, where Navarra, Orense, Vizcaya and Sevilla are the provinces with higher risks in this year. From 2012 onwards, the maps become lighter. In contrast, relative mortality risks
1 Multivariate Disease Mapping Models
15
Year 2006
Year 2009
Year 2012
Year 2014
Year 2017
Year 2020
1.30 or more 1.20 to 1.30 1.10 to 1.20 1.00 to 1.10 0.91 to 1.00 0.83 to 0.91 0.77 to 0.83 Less than 0.77
Fig. 1.6 Maps of posterior median estimates of stomach cancer mortality risks Year 2006
Year 2009
Year 2012
Year 2014
Year 2017
Year 2020
1.30 or more 1.20 to 1.30 1.10 to 1.20 1.00 to 1.10 0.91 to 1.00 0.83 to 0.91 0.77 to 0.83 Less than 0.77
Fig. 1.7 Maps of posterior median estimates of LOCP cancer mortality risks
for stomach cancer show the highest variability. The average over the whole Spain of posterior median estimates is about 1.31 in year 2006 and 0.83 in year 2020. Although changes in the spatial pattern are observed throughout the study period, there are some provinces that stand out as their relative risks remain significantly high/low comparing to the whole of Spain. Barcelona and Orense exhibit the highest relative mortality risks, with values above 1.82 in 2006 and above 1.31 in 2017. In 2020, the last year of the period, the estimated relative risks in these two provinces are 1.21 and 1.14, respectively. On the contrary, the province of Lugo is the one
16
A. Adin et al.
that shows the lowest relative risks for all years, with values close to 0.95 at the beginning of the study period, 0.75 in the intermediate years, and 0.65 in the last years. Finally, we notice the LOCP cancer mortality risks do not seem to show a very smooth spatial pattern, with high-risk areas or hotspots located all around the map. Similar to other cancer sites, the provinces of Navarra, Barcelona, Vizcaya, and Orense among others have high estimated relative risks. This agrees with the significant and positive estimated correlations between LOCP and the rest of cancer sites (see Table 1.6).
1.4 Discussion Recent advances in computational methods and estimation techniques for Bayesian inference are making possible the development of new statistical tools for the joint analysis of multiple diseases. Unlike the usual spatio-temporal univariate models that separately study the geographical and temporal variations of each disease, multivariate areal models improve risk estimation by borrowing strength across diseases. In addition, these multivariate models permit estimation of latent correlations between the spatial and temporal patterns of the diseases that may give clues about possible relationships and associations with underlying risk factors or lifestyles. Botella-Rocamora et al. [13] describe a unifying modelling framework for multivariate disease mapping models for spatial areal data commonly known as Mmodels. Different CAR prior distributions for the latent spatial random effects can be included in the models, defining either separable or non-separable covariance structures. Originally, inference for M-models was based on MCMC algorithms implemented in the BUGS language [43]. However, M-models can also be fitted using INLA through the package R-INLA (see, e.g., [44]). Our implementation of the multivariate models described in Sect. 1.2 is based on the spatio-temporal M-model proposal described in Vicente et al. [17] by using the “rgeneric” construction of R-INLA. Here we use an alternative parameterization based on the Bartlett decomposition of Wishart matrices that avoids over-parameterization of the between-disease covariance matrices. Using this M-based models for multivariate areal data, we jointly analysed the spatio-temporal patterns of lung, colorectal, stomach, and LOCP cancer mortality risks for male population in Spanish provinces. Our analysis reveals that LOCP cancer mortality is correlated with the other cancer sites. These associations might be connected to tobacco smoking and alcohol consumption. Our results agree with previous studies analysing cancer mortality data in Spain. However, these studies are limited to the analysis of a single response, as it is the case of the estimation of temporal patterns of cancer mortality rates in Spain using either change-point regression models [46] or age-period-cohort models [47]; the estimation of the geographical distribution of mortality risks at municipality level using spatial models [48, 49]; or the estimation of spatio-temporal patterns of stomach [50] and colorectal [51] cancer mortality risks. As far as we now,
1 Multivariate Disease Mapping Models
17
this is the first study that analyses the spatio-temporal evolution of mortality data for multiple cancer sites in Spain. Presumably, the main limitation of the proposed methodology is that the number of hyperparameters of the M-model formulation increases substantially with the number of diseases. According to the INLA developers [21], in the classical integrated nested Laplace approximation technique, the number of hyperparameters should be relatively small (not exceeding 20). This assumption is required for both computational reasons and to ensure accurate approximations of the posterior estimates of model parameters. However, the use of new nested thread-level parallelization strategies in the optimization phase of INLA and the integration of the PARDISO solver makes it possible to fit models with higher number of hyperparameters when using multi-core computer architectures [45]. Covariates can be included in the model easily if they were available. However, there are still unsolved theoretical issues related to the estimation of the fixed effects in spatial and spatio-temporal models. When spatial (and also temporal) random effects are included in the model, they compete with the covariates to cope with the spatial variability in the response and it is not clear to what extent the fixed effect estimates are biased. This is usually known as spatial confounding. In addition, the variance of the fixed effect estimates could be inflated leading to overly conservative inference. Consequently, the covariate effect and the remaining spatial variability could not be well estimated. Fortunately, the risk surface is well estimated (see Adin et al. [52]). Our research team is currently working on this challenging topic to propose a solution. This research field is being very active lately with some interesting pieces of research (see, for example, [53–55] or [56]) though no definite solution has been reached yet. Finally, we would like to further investigate the potential associations between different cancer sites in Spain at a finer spatial resolution, for example, using data at municipality level. The combination of the scalable methodology proposed by Orozco-Acosta et al. [57] and Vicente et al. [28] seems to be a promising approach to jointly analyse high-dimensional spatio-temporal count data for multiple responses. Acknowledgments The authors would like to thank the Spanish Statistical Office (INE) and the Spanish National Epidemiology Center (area of Environmental Epidemiology and Cancer) for providing the data. This work has been supported by project PID2020-113125RBI00/MCIN/AEI/10.13039/501100011033.
References 1. Lawson, A.B., Banerjee, S., Haining, R.P., and Ugarte, M.D. (editors): Handbook of spatial epidemiology. New York: Chapman and Hall/CRC (2016) 2. Martínez-Beneito, M.A., and Botella-Rocamora, P.: Disease mapping: from foundations to multidimensional modeling. CRC Press (2019) 3. Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J Roy Stat Soc B, 36(2), 192–225 (1974)
18
A. Adin et al.
4. Besag, J., York, J., and Mollié, A.: Bayesian image restoration, with two applications in spatial statistics. Ann Inst Stat Math, 43(1), 1–20 (1991) 5. Knorr-Held, L., and Best, N.G.: A shared component model for joint and selective clustering of two diseases. J Roy Stat Soc A, 164(1), 73–85 (2001) 6. Held, L., Natário, I., Fenton, S.E., Rue, H., and Becker, N.: Towards joint disease mapping. Stat Methods Med Res, 14(1), 61–82 (2005) 7. MacNab, Y.C.: On Bayesian shared component disease mapping and ecological regression with errors in covariates. Stat Med, 29(11), 1239–1249 (2010) 8. Richardson, S., Abellan, J.J., Best, N.: Bayesian spatio-temporal analysis of joint patterns of male and female lung cancer risks in Yorkshire (UK). Stat Methods Med Res, 15(4), 385–407 (2006) 9. Etxeberria, J., Goicoa, T., and Ugarte, M.D.: Joint modelling of brain cancer incidence and mortality using Bayesian age-and gender-specific shared component models. Stoch Env Res Risk A, 32(10), 2951–2969 (2018) 10. Etxeberria, J., Goicoa, T., and Ugarte, M.D.: Using mortality to predict incidence for rare and lethal cancers in very small areas. Biometrical J, 65(3), 2200017 (2023) 11. Jin, X., Banerjee, S., and Carlin, B.: Order-free co-regionalized areal data models with application to multiple-disease mapping. J Roy Stat Soc B, 69(5), 817–838 (2007) 12. Martinez-Beneito, M.A.: A general modelling framework for multivariate disease mapping. Biometrika, 100(3), 539–553 (2013) 13. Botella-Rocamora, P., Martinez-Beneito, M.A., and Banerjee, S.: A unifying modeling framework for highly multivariate disease mapping. Stat Med, 34(9), 1548–1559 (2015) 14. MacNab, Y.C.: Some recent work on multivariate Gaussian Markov random fields. Test, 27(3), 497–541 (2018) 15. Mardia, K.: Multi-dimensional multivariate Gaussian Markov random fields with application to image processing. J Multivariate Anal, 24(2), 265–284 (1988) 16. Sain, S.R., Furrer, R., and Cressie, N.: A spatial analysis of multivariate output from regional climate models. Ann Appl Stat, 5(1), 150–175 (2011) 17. Vicente, G., Goicoa, T., and Ugarte, M.D.: Multivariate Bayesian spatio-temporal P-spline models to analyse crimes against women. Biostatistics (in press), https://doi.org/10.1093/ biostatistics/kxab042 (2021) 18. Gao, L., Datta, A., and Banerjee, S.: Hierarchical multivariate directed acyclic graph autoregressive models for spatial diseases mapping. Stat Med, 41(16), 3057–3075 (2022) 19. Vicente, G., Goicoa, T., and Ugarte, M.D.: Bayesian inference in multivariate spatio-temporal areal models using INLA: analysis of gender-based violence in small areas. Stoch Env Res Risk A, 34(10), 1421–1440 (2020) 20. Rue, H., Martino, S., Chopin, N.: Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J Roy Stat Soc B 71(2), 319–392 (2009) 21. Rue, H., Riebler, A., Sørbye, S.H., Illian, J.B., Simpson, D.P., Lindgren, F.: Bayesian Computing with INLA: A review. Annu Rev Stat Appl 4, 395–421 (2017) 22. Lindgren, F., and Rue, H.: Bayesian spatial modelling with R-INLA. J Stat Softw, 63(1), 1–25 (2015) 23. Leroux, B.G., Lei, X., and Breslow, N.: Estimation of disease rates in small areas: a new mixed model for spatial dependence. In Halloran, M. Berry, D. (eds). Statistical Models in Epidemiology, the Environment, and Clinical Trials, 179–192 (1999) 24. Rue, H., and Held, L.: Gaussian Markov Random Fields: Theory and Applications, volume 104. Chapman & Hall/CRS (2005) 25. Knorr-Held, L.: Bayesian modelling of inseparable space-time variation in disease risk. Stat Med, 19(17–18), 2555–2567 (2000) 26. Bakka, H., Rue, H., Fuglstad, G.A., Riebler, A., Bolin, D., Illian, J., Krainski, E., Simpson, D., Lindgren, F.: Spatial modeling with R-INLA: A review. Wiley Interdiscip Rev Comput Stat 10(6), e1443 (2018) 27. Peña V., and Irie K: On the relationship between Uhlig extended and beta-Bartlett processes. J. Time Ser. Anal., 43, 147–153 (2022).
1 Multivariate Disease Mapping Models
19
28. Vicente, G., Adin, A., Goicoa, T., and Ugarte, M.D.: High-dimensional order-free multivariate spatial disease mapping. arXiv preprint, https://arxiv.org/abs/2210.14849 (2022) 29. Goicoa, T., Adin, A., Ugarte, M.D., and Hodges, J.S.: In spatio-temporal disease mapping models, identifiability constraints affect PQL and INLA results. Stoch Env Res Risk A, 32(3):749–770 (2018) 30. Parkin, D.M., Pisani, P., Lopez, A. D., and Masuyer, E.: At least one in seven cases of cancer is caused by smoking. Global estimates for 1985. Int J Cancer, 59(4), 494–504 (1994) 31. Hecht, S.S.: Tobacco smoke carcinogens and lung cancer. JNCI-J Natl Cancer I, 91(14), 1194– 1210 (1999) 32. Johnson, N.: Tobacco use and oral cancer: a global perspective. J Dent Educ, 65(4), 328–339 (2001) 33. Lubin, J.H., Caporaso, N., Wichmann, H.E., Schaffrath-Rosario, A., and Alavanja, M.C.: Cigarette smoking and lung cancer: modeling effect modification of total exposure and intensity. Epidemiology, 18(5), 639–648 (2007) 34. Chao, A., Thun, M.J., Henley, S.J., Jacobs, E.J., McCullough, M.L., and Calle, E.E.: Cigarette smoking, use of other tobacco products and stomach cancer mortality in US adults: The Cancer Prevention Study II. Int J Cancer, 101(4), 380–389 (2002) 35. Kuper, H., Boffetta, P., and Adami, H.O.: Tobacco use and cancer causation: association by tumour type. J Intern Med, 252(3), 206–224 (2002) 36. Connor, J.: Alcohol consumption as a cause of cancer. Addiction, 112(2), 222–228 (2017) 37. Retegui, G., Etxeberria, J., and Ugarte, M.D.: Estimating LOCP cancer mortality rates in small domains in Spain using its relationship with lung cancer. Sci Rep, 11(1), 1–10 (2021) 38. Dyba, T., Randi, G., Bray, F., and others: The European cancer burden in 2020: Incidence and mortality estimates for 40 countries and 25 major cancers. Eur J Cancer, 157, 308–347 (2021) 39. van Niekerk, J., Rue, H.: Correcting the Laplace Method with Variational Bayes. arXiv preprint, https://arxiv.org/abs/2111.12945 (2021) 40. van Niekerk, J., Krainski, E., Rustand, D., Rue, H.: A new avenue for Bayesian inference with INLA. Comput Stat Data An 181, 107692 (2023) 41. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde, A.: Bayesian measures of model complexity and fit. J Roy Stat Soc B 64(4), 583–639 (2002) 42. Watanabe, S., and Opper, M.: Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res, 11(12) (2010) 43. Lunn, D.J., Thomas, A., Best, N., Spiegelhalter, D.: WinBUGS - A Bayesian modelling framework: concepts, structure, and extensibility. Stat Comput 10(4), 325–337 (2000) 44. Palmí-Perales, F., Gómez-Rubio, V., and Martinez-Beneito, M.A.: Bayesian Multivariate Spatial Models for Lattice Data with INLA. J Stat Softw, 98(2), 1–29 (2021) 45. Gaedke-Merzhäuser, L., van Niekerk, J., Schenk, O., Rue, H.: Parallelized integrated nested Laplace approximations for fast Bayesian inference. Stat Comput, 33(1), 25 (2023) 46. Cabanes, A., Vidal, E., Aragonés, N., Pérez-Gómez, B., Pollán, M., Lope, V., and LopezAbente, G.: Cancer mortality trends in Spain: 1980–2007. Ann Oncol, 21, iii14-iii20 (2010) 47. Seoane-Mato, D., Aragonés, N., Ferreras, E., García-Pérez, J., Cervantes-Amat, M., Fernández-Navarro, P., Pastor-Barriuso, R., and López-Abente, G. Trends in oral cavity, pharyngeal, oesophageal and gastric cancer mortality rates in Spain, 1952–2006: an ageperiod-cohort analysis. BMC Cancer, 14(1), 1–11 (2014) 48. López-Abente, G., Aragonés, N., Pérez-Gómez, B., Pollán, M., García-Pérez, J., Ramis, R., and Fernández-Navarro, P.; Time trends in municipal distribution patterns of cancer mortality in Spain. BMC Cancer, 14(1), 1–15 (2014) 49. Santafé, G., Adin, A., Lee, D., and Ugarte, M.D. Dealing with risk discontinuities to estimate cancer mortality risks when the number of small areas is large. Stat Methods Med Res, 30(1), 6–21 (2021) 50. Aragonés, N., Goicoa, T., Pollán, M., Militino, A.F., Pérez-Gómez, B., López-Abente, G., and Ugarte, M.D.: Spatio-temporal trends in gastric cancer mortality in Spain: 1975–2008. Cancer Epidemiol, 37(4), 360–369 (2013)
20
A. Adin et al.
51. Etxeberria, J., Ugarte, M.D., Goicoa, T., and Militino, A.F.: Age-and sex-specific spatiotemporal patterns of colorectal cancer mortality in Spain (1975–2008). Popul Health Metr, 12(1), 1–11 (2014) 52. Adin, A., Goicoa, T., Hodges, J.S., Schnell, P.M., and Ugarte, M.D.: Alleviating confounding in spatio-temporal areal models with an application on crimes against women in India. Stat Model, 23(1), 9–30 (2023) 53. Marques, I., Kneib, T., and Klein, N.: Mitigating spatial confounding by explicitly correlating Gaussian random fields. Environmetrics, 33(5), e2727 (2022) 54. Urdangarin, A., Goicoa, T., and Ugarte, M.D.: Evaluating recent methods to overcome spatial confounding. Rev Mat Complut, 36, 333–360 (2023) 55. Guan, Y., Page, L.G., Reich, B.J., and Ventrucci, M.: A spectral adjustment for spatial confounding. Biometrika. DOI: https://doi.org/10.1093/biomet/asac069 (2022) 56. Khan, K., Berret, C.: Re-thinking Spatial Confounding in Spatial Linear Mixed Models. arXiv preprint, https://doi.org/10.48550/arXiv.2301.05743 (2023) 57. Orozco-Acosta, E., Adin, A., and Ugarte, M.D.: Big problems in spatio-temporal disease mapping: methods and software. Comput Meth Prog Bio, 231, 107403 (2023)
Chapter 2
Machine Learning Applied to Omics Data Aida Calviño, Almudena Moreno-Ribera, and Silvia Pineda
Abstract In this chapter we illustrate the use of some Machine Learning techniques in the context of omics data. More precisely, we review and evaluate the use of Random Forest and Penalized Multinomial Logistic Regression for integrative analysis of genomics and immunomics in pancreatic cancer. Furthermore, we propose the use of association rules with predictive purposes to overcome the low predictive power of the previously mentioned models. Finally, we apply the reviewed methods to a real data set from TCGA made of 107 tumoral pancreatic samples and 117,486 germline SNPs, showing the good performance of the proposed methods to predict the immunological infiltration in pancreatic cancer. Keywords Genomics · High-throughput data · Association rules · Random Forest · LASSO
2.1 Introduction Big data has revolutionized the biomedical field and the way we study complex traits resulting in the generation of vast amounts of omics data, such as genomics, epigenomics, transcriptomics, microbiomics, or immunomics among others. In this era of rapid molecular technological advancements, studies using classical statistical techniques are becoming too simplistic when considering the reality of complex traits. Instead, Machine Learning (ML) techniques that may reflect combinatorial effects (including additive and interactive) should be contemplated to address such complexity. Given the wealth and availability of omics data, ML methods provide a powerful opportunity to improve the current knowledge about the determinants of complex diseases [1]. Developing efficient strategies to identify
A. Calviño () · A. Moreno-Ribera · S. Pineda Department of Statistics and Data Science, Complutense University of Madrid, Madrid, Spain e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Y. Larriba (ed.), Statistical Methods at the Forefront of Biomedical Advances, https://doi.org/10.1007/978-3-031-32729-2_2
21
22
A. Calviño et al.
new markers and its interactions requires applying ML techniques through the integration of large omics data [2–4]. Many recent findings have been powered by statistical advances that have enabled profiling of the biological system on multiple levels. Through gene expression, genetics, immune repertoire and cell profiling, we can characterize potential new markers playing decisive roles in health and disease. In this regard, the immune system requires special attention. We now know that the immune system is not only just responsible for protection against diseases and host defense but also plays a role in tissue maintenance and repair. As such, the system is spread throughout not only the blood and lymphatic systems but also most tissues. The immune system plays a key role in the regulation of tumor development and progression, and crosstalk between cancer cells and immune cells has been incorporated into the list of major hallmarks of cancer [5], consequently, the tumor microenvironment is closely connected to every step of tumorigenesis [6]. Translating this complexity into new biological insights requires advanced statistics, data mining, and ML skills. In this chapter, we focus on the proper use of ML techniques applied to the study of omics data in pancreatic cancer (PC). PC is a dreadful disease and it is the deadliest cancer worldwide with 7% 5-year survival rate. Despite the efforts to advance with this disease in many angles, there are still a number of unknown risk factors and its interactions and correlations contributing to this devastating disease. Immunological infiltration and its interactions with other factors plays an important role in PC development [7]. Indeed, some years ago, it was shown that genetic variations explained an appreciable fraction of trait heritability [8], and very recently it has been estimated that environmental factors and genetic susceptibility may explain 50% and 40% of the immune system differences [9, 10], respectively, suggesting the feasibility of genetic dissection of quantitative variation of specific immune cell types. Thus, we can hypothesize that the characteristics of the tumoral immune-infiltration are modulated by genetic susceptibility. This chapter focuses on the use of ML techniques to integrate genomicimmunomic data that may be useful in deciphering new biological insight that will help to advance PC research. First, we describe the omics data used in this work in Sect. 2.2 and discuss the main challenges faced by these data types in Sect. 2.3. Section 2.4 is devoted to the most important ML techniques in the biomedical field, as well as others less known. Finally, we show an application to a real data set in Sect. 2.5 and give some conclusions and lines of future work in Sect. 2.6.
2.2 Data Types 2.2.1 Genomics The whole genetic information of an individual can be studied using whole genome sequencing determining the order of all nucleotides within the DNA molecule. Most
2 Machine Learning Applied to Omics Data
23
of the studies have determined a subset of genetic markers to capture as much of the complete information as possible using genotyping information. These markers are Single Nucleotide Polymorphisms (SNPs) that are changes of one nucleotide base pair that occur in at least 1% of the population. In humans, the majority of the SNPs are bi-allelic, indicating the two possible bases at the corresponding position within a gene. If we define A as the common allele and B as the variant allele, three combinations are possible: AA (the common homozygous), AB (the heterozygous) and BB (the variant homozygous) which can be transformed into .0, 1, 2 considering the number of variant alleles. These combinations are known as the genotypes and they are assessed using either genotyping platforms or highthroughput sequencing generally from germline DNA (see Fig. 2.1a). One very well known type of study that utilizes genotyping information are the Genomewide association studies (GWAS) used to identify genetic variants (SNPs) extracted from germline DNA that are significantly associated with disease states (healthy vs. unhealthy individuals). Using GWAS in combination with ML techniques, advances in the identification of genetic susceptibility loci have been shown in cancer [11], sclerosis [12] or Covid-19 [13] and many other diseases. How such variation might relate to a particular disorder is not always known and more complex analyses are needed. Linking particular genetic variants (SNPs) to associated mechanisms, and one directly related with the omics integration is through the expression of quantitative trait loci (eQTL) analysis, which are SNPs that partly explain the variation of a gene expression phenotype. In recent years, many statistical and ML methods for eQTL analysis have been developed with the ability to provide a more complex perspective towards the identification of relationships between genetic variation and genetic expression [14].
2.2.2 Immunomics The adaptive immune system is composed of B and T lymphocytes which produce B cell receptors (BCR) or antibodies capable of recognizing foreign substances, such as pathogens or viruses, and T cell receptors (TCR) which recognize fragments of antigens presented on the surface of the cells. BCR (most commonly called immunoglobulins, IG) consist of two identical heavy chains (IGH) and two lightchains, Kappa (IGK) and Lambda (IGL). Human T cell receptors (TCR) consist of an alpha and beta chain (TRA and TRB) and a gamma and delta chain (TRG and TRD). The intact antibody contains a variable and a constant domain (C). Antigen binding occurs in the variable domain, which is generated by recombining a set of variable (V), diversity (D) and joining (J) gene segments forming the Band T- cell immune repertoire (IR), and its diversity is mainly concentrated in the complementary-determining region 3 (CDR3), from now on, this combination will be defined as V(D)J (see Fig. 2.1b).
24
A. Calviño et al.
Fig. 2.1 Definition and matrix data types. (a) Genomics and (b) Immunomics
Recent technological advances have made it possible to apply high-throughput sequencing to characterize BCR and TCR [15]. In addition, bioinformatic tools for extracting BCR and TCR from bulk RNA-seq have also been developed [16]. So, BCR and TCR sequencing allows for a broad examination of B cells and T cells by following changes in clonal and population dynamics. Each sequencing read from BCR or TCR can be grouped into clonotypes which is defined by the number of reads having the same V and J segments, same CDR3 length, and 90%95% nucleotide identity between CDR3s. Consequently a matrix by samples and clonotypes is built for further examination. We have shown before comprehensive analysis using ML techniques applied to this data type in cancer [17, 18] and organ transplantation [19], but numerous other examples have been done in multiple sclerosis [20] or influenza vaccine responses [21]. There has been some attempts to integrate genomic profiles and immune repertoire data, such as in this example [22] across 11 tumor types showing that high expression of T and B cell signatures predict overall survival. However, the development of new statistical strategies to analyze and integrate this type of data with other omics are still in progress.
2.3 Challenges in the Omics Data Analysis The ultimate goal of this research is to investigate whether the variability previously observed in the BCR in pancreatic cancer [17] is associated with germline genetic susceptibility (SNPs).
2 Machine Learning Applied to Omics Data
25
Genomic studies are complex and the data is large and very heterogeneous, therefore classical statistical assumptions are limited as has been extensively review previously [2, 23]. The most classical statistical approach to assess the relationship between genetic variants (SNPs) with the BCR measuring tumor immune-infiltration is a linear regression model, but one of the main assumptions for this model is the independence between the regressor variables. SNPs are variables that can be highly correlated due to linkage disequilibrium which is the non-random association of alleles at different loci in a given population. Moreover, the high dimensionality is also a current problem affecting method convergence and being computation time-consuming. To deal with these problems, ML algorithms are proposed. For example, regularized regression methods (such as the LASSO Regression) are employed when there are many features (more than samples in the study, i.e., .p >> n, where p and n refer to the number of variables and instances, respectively) and a multicollinearity problem among them. Approaches such as Random Forest consider possible interactions among the features and have increased in popularity in the last few years [24]. In complex models, it is very important to propose as parsimonious models as possible. Thus, in ML techniques, feature selection remains a big challenge to be addressed. Additionally, in omics analysis there is a special interest in finding the most influential variables to furtherly extract the biological mechanism that might be involved with the disease under investigation. Currently there are some lines of investigation [25] but given the special nature of genetic data, new proposals are needed. Another important challenge relates to the nature of the clinical endpoints frequently used in biomedical studies. That is, the majority of the studies considered dichotomous clinical variables measuring whether the individuals have or have not the disease, or any other specific characteristic, but probably subtypes of clinical outcomes should be contemplated. Additionally, statistical modeling is also designed for dichotomous dependent variables, but the majority of the time, the modeling with clinical outcomes is more complex than just two categories and the implication of ML techniques capable of working with multinomial models are needed. Taking into consideration these main challenges, in this chapter, we review two of the most commonly used ML techniques and we propose the application of association rules (AR) with predictive purposes extracting the most influential variables, as it will be later shown. Although AR has been previously applied to GWAS studies [26–28], as far as we know, they have not been implemented in the predictive context. In addition, we will focus on the application of predictive methods to variables with more than two outcomes, which are less common in the applied literature.
26
A. Calviño et al.
2.4 Machine Learning Techniques The term Machine Learning (ML) refers to a broad range of statistical and computational techniques that aim at extracting knowledge from large amounts of data. For that reason, ML methods are especially useful for multivariate approaches, which are needed for the integrative analysis of omics data types. ML techniques can be used for supervised or unsupervised learning; examples of the first category are association and prediction analysis, whereas clustering and dimensionality reduction approaches belong to the latter. In this chapter we restrict ourselves to the supervised context, where the objective is to develop a tool that allows guessing the value of a variable (usually referred to as target) by means of a set of other variables (usually referred to as input). As it will be shown in the following subsections, ML techniques are suitable for this task, especially when the number of input variables is large, the number of instances in the data set is extreme or the input variables are correlated, among others. More precisely, we focus on the methods that can be applied when the target variable is a categorical one with .K > 2 levels (usually referred to as multinomial). Many different ML methods have been previously proposed in the literature (see [29] for a review), however, none of them has been shown to be capable of coping with the vast amount of data types and applications that arise in the real world. Taking into account the characteristics of omics data, we now review three of the most convenient methods that can be applied in this context: Random Forest, Multinomial Logistic Regression, and Association Rules.
2.4.1 Random Forests Random Forests (RF) were first proposed in [30] as an attempt to reduce the large variability associated with classification and regression trees. Trees have been shown to be robust to the presence of missing and outlier values, do not require previous variable selection, work efficiently with both quantitative and qualitative input variables and are very intuitive, making them a very attractive predictive tool. Nonetheless, trees suffer from large variability (as a result of its building process) and are very sensitive to the presence of overpowering input variables (see [29]). RFs provide predictions by means of a (large) set of trees that have been built from bootstrap samples,1 where the set of input variables used to grow the branches of the trees is randomly restricted. Once the trees are built, predictions are obtained as the average of the predictions generated by each tree. Including two sources of randomness in the growing process leads to two important characteristics of RF (in comparison with single trees): (1) the variability 1 Bootstrap samples are simply random samples drawn with replacement from the original data set that allow emulating the process of obtaining new data samples.
2 Machine Learning Applied to Omics Data
27
Fig. 2.2 Random Forest illustration
of the model is reduced because of the properties of the sample mean and the decorrelation of the forest attained; and (2) most of the variables with some predictive power make part of the forest, leading to better model performance. Figure 2.2 shows an schematic illustration of the building process of a RF. Moreover, RFs have been shown to cope effectively with large data sets, mainly because the trees are built independently of each other and, thus, parallelization can be applied, reducing significantly computational times. Furthermore, no requirement exists regarding the fact that the number of variables is higher than the number of instances available. Another important feature of RFs, which comes from the properties of the trees in the forest, is the fact that they take into account interactions among input variables. Interactions arise when the input variables are not independent and, thus, the effect that they have on the target variable depends on the values that other input variables have. Interactions are made of two or more variables and are usually difficult to find with other types of models, as will be seen in the next section, especially if more than two input variables are involved. When building RFs, three parameters need to be selected:
28
A. Calviño et al.
• The number of trees that make part of the forest (ntree). Although RFs do not depend heavily on this parameter, as long as it is high enough, building too many trees can lead to unnecessary large computation time. • The size of the input variable set from which the branches of the trees are built (mtry). This parameter has a great impact on performance and computation time, as it affects the way input variables are incorporated in the trees. • The minimum number of observations that a leaf needs (minobs). If this parameter is set too small/large, the trees may be very large/small leading to worse performance of the forest. As in any other ML technique, in order to determine the best parameters, in the sense that a good performance is achieved without risking overfitting, resampling techniques are highly recommended (see [29] for more information on this topic). These techniques allow evaluating models with a set of data that did not belong to the training set and, thus, can provide more accurate and reliable estimates of the predictive power of the models. In the context of RFs, a very useful tool is the Out Of Bag (OOB) observations, which allow evaluating the model accurately while building it. As already mentioned, RFs make use of bootstrap samples in the building process and, therefore, every time a tree is grown, a subset of observations are left aside (these are the OOB ones). These data points can be used to assess performance if they are predicted using only the trees in which they did not make part of the growing process. OOB observations are one of the main advantages of RFs as they permit speeding up the process of setting the best parameters.
2.4.2 Multinomial Logistic Regression Regression models are a wide type of predictive models given by the following equation: Y = f (X1 , X2 , . . . , Xp ) + ,
.
(2.1)
where Y is the target variable, .{Xi }i=1,...,p are the input variables, .f (·) is a mathematical function that relates both types of variables and is usually referred to as link function, and . is the error term. Although trees and RFs could be considered a type of regression model, this term is frequently used for the models that can be easily written by means of the most suitable link function. In the special case of the multinomial logistic regression (MLR), where the number of different values of the qualitative target variables equals K, with some additional assumptions not included here for the sake of brevity, Eq. (2.1) leads to the following formulation of the model:
2 Machine Learning Applied to Omics Data
exp(β0, + β1, x1,i + · · · + βp, xp,i ) , P (yi = ) = K j =1 exp(β0,j + β1,j x1,i + · · · + βp,j xp,i )
.
29
(2.2)
where .i = 1, . . . , n, . = 1, . . . , K and the .β parameters contain the information of the relationship between the input variables and the target one (see [29] for more details regarding this formulation). Determining the optimal values of these parameters (including which ones are set to 0) is the key of the MLR model. MLR models have been widely used as they are a natural extension of the classical linear regression model and, thus, their .β parameters can be interpreted (as opposed to black box models where the exact effect of the input variables on the target remains unknown). However, MLR models do not cope well with outliers and missing observation, which may not be an inconvenience if a proper cleaning of the data is applied. Moreover, MLR models do not take into account interactions among input variables unless they are explicitly included in the model as products of them. Furthermore, MLR models assume that the relationship between an input variable and the target one can be modeled through (2.2) and, thus, if this assumption does not hold, the predictive power of the model might be reduced. Finally, and more importantly in this context, the number of parameters to be estimated needs to be strictly smaller than the number of instances available in the training data set. Whereas in the usual application of MLR, this drawback would have little effect (unless interactions are considered), in the context of omics data, most of the application cases deal with data sets with more variables than observations. In order to deal with this problem and to be able to efficiently select the best input variables (i.e., the ones with corresponding parameters different from zero), some modifications to the classical MLR have been proposed in the literature. In particular, the lasso regularization, and the subsequent group-lasso and sparsegroup-lasso have been shown to be very useful. In the classical MLR context, the optimal parameters are given by the maximization of the likelihood of the training data (or, equivalently, the minimization of the negative-log-likelihood). In this scenario, all the input variables considered get estimated parameters strictly different from zero and, thus, no variable selection is achieved. For that reason, if the number of observations is not sufficiently higher than the number of input variables (as the number of parameters per variable may be higher than one), no solution is found. The lasso model (acronym for Least Absolute Shrinkage and Selection Operator) was proposed in [31] as a solution to the previously mentioned drawback. The basic idea is to add a penalty term in the minimization process that essentially accounts for the number of parameters distinct from zero. The weight given to this penalty term (.λ) allows controlling the number of input variables selected, leading to the classical regression model when it is set to zero. Therefore, the classical lasso model can be stated through the following equation:
30
A. Calviño et al.
⎫ p K ⎬ βi,j , ˆ = argmin − log L + λ .β ⎭ β∈Rp×K ⎩ ⎧ ⎨
(2.3)
i=1 j =1
where .L represents the corresponding likelihood, not shown here for the sake of simplicity, and .β is a .p × K matrix containing the .β parameters. The original lasso model, although very effective when dealing with data sets where .p >>> n, is not always the best choice as it can lead to models where only certain parameters of an input variable are selected (and not all of them). In order to avoid this, the group-lasso was proposed [32] where a modification of the original penalty term is suggested that forces the minimization process to select all or none of the parameters in a group. Generally, the groups are defined by means of the input variables the parameters are associated with, but it is not the only case. The group-lasso might be too restrictive in some cases leading to models with less predictive power. For that reason, a third alternative was proposed, referred to as sparse-group-lasso [33], where both penalty terms are added to the minimization process, with a weight of .α to the lasso and .1 − α to the group-lasso, with .α ∈ [0, 1]. For that reason, the sparse-group-lasso can be seen as a generalization to the previous methods, leading to the classical lasso when .α = 1 and to the group-lasso when .α = 0. In order to determine the best MLR model (with the corresponding lasso regularization) the best combination of .α and .λ needs to be established. As opposed to the RF, in the MLR context there are no OOB observations and, thus, a different resampling technique needs to be applied. In the context of ML techniques, it is very common to resort to k-fold cross-validation (k-CV), where the data set is divided into k parts and, successively, one is devoted to evaluate the model and the remaining k-1 are used to train the model. When the process has been repeated k times, all of the observations in the data set have been predicted once (without having been part of the training process) and that information can be used to accurately evaluate the model. As a final remark to LASSO-based models, we highlight that this type of models are limited to the sample size as the maximum number of parameters that can be set different from zero.
2.4.3 Association Rules Although association rules (AR) are not usually considered supervised models, they can be used to predict the value of a target variable using the values of the input ones, leading to what is usually referred to as Rule-Based Classifiers [34, 35]. It should be noted that this approach is not very frequently used in the GWAS literature and is, therefore, one of the main contributions of this chapter. AR are commonly applied in the retail context and, for that reason, its generation is usually referred to as Market Basket Analysis (MBA), as they were first proposed
2 Machine Learning Applied to Omics Data
31
for determining the products in a market basket that were frequently bought together [36]. Applying the AR terminology, AR mining consists of detecting sets of items (which can be levels of a variable) that have a “large” joint frequency and evaluating the co-occurrences found afterwards. More precisely, these co-occurrences are given by “if-then” statements, where the if clause is denoted as “lhs” (left hand side) and the consequence, as “rhs” (right hand side). In the MBA context, an example of rule would be “If milk and eggs are bought, flour will be as well.” Thus, translating into genetic studies, the rule could be “If a variation on SNP rs1 and rs3 is present, SNP rs5 will have one as well.”2 The aim of AR is, thus, finding rules that take place frequently and can be considered reliable. In order to evaluate AR, three new concepts arise: • Support: the support of a rule is simply the joint relative frequency of the items that belong to the rule. It can be interpreted as the joint probability of the items in the rule. The support is a measure of how often a rule can be used. • Confidence: the confidence of a rule is the probability of the consequence (rhs) conditional to the antecedent (lhs). The confidence of a rule evaluates how often it is true. • Lift: the lift is the ratio of confidence to support. This measure compares how often the lhs and the rhs occur together in comparison with what should be expected if both were independent. Alternatively, it quantifies how often the rhs arises along with the lhs in comparison with the “general population.” Ideally, all combinations of items should be generated and evaluated in order to generate a list of useful ones. However, as the full set of items is generally very large, this procedure is usually unfeasible. For that reason, several procedures have been proposed in the literature to find AR in a data set (see [37] for a review on association rules mining). In this chapter we restrict ourselves to two of them: the apriori algorithm and the Random Forest. The apriori algorithm (proposed in [38]) assumes that, in order to efficiently generate the rules, the ones that do not appear frequently enough should be skipped as they will not be useful. For that reason, instead of generating all possible combinations of items, the algorithm sequentially generates larger rules combining previous sets that have a sufficiently high support (called frequent itemsets). Applying this simple idea, computation times are drastically reduced without risking removing useful rules (as frequent itemsets are always made of frequent “itemsubsets”). Moreover, rules that do not reach a minimum confidence level are removed, reducing the storage requirements. This algorithm was initially proposed in the context of MBA and, thus, if one wants to apply it to generate AR more generally, the variables involved in the rules to be generated need to be transformed previously. More precisely, each variable
2 Please note that the usual way of referring to SNPs in the literature is by means of “rs” followed by its corresponding identification in numbers.
32
A. Calviño et al.
needs to be converted to an item in such a way that, instead of working with items purchased together, we deal with values of variables that take place in the same observation. For that reason, qualitative variables need to be converted to dummy variables and quantitative variables need to be first discretized and then converted to dummy variables as well. Regarding quantitative variables, we note that some loss of information takes places because of the discretization process and, thus, this process needs to be done carefully. Although the apriori algorithm significantly reduces computation time, if the number of variables is very large (and, thus, the number of items is even larger), it still can become unfeasible to generate the rules as the process cannot be parallelized. In such cases, one can resort to different methods of variable selection to reduce the number of variables (or equivalently items) involved. Alternatively to the apriori algorithm, the RF method assumes that the leaves of the trees in the forest can be considered as rules [39, 40], as the conditions that define the leaf might serve as antecedents and the majority target classes, as their consequences. Once generated, the rules are analyzed and the ones not reaching a sufficient confidence level are discarded. This method has several advantages: the first one deals with the fact that input variables do not need to be converted to dummy variables and, thus, the items considered can be made of several levels of the same variable. Finally, as the procedure can be parallelized, it is very fast in comparison with the apriori algorithm. However, because of the randomness included on the input variables, not all combinations of them are evaluated and, thus, some powerful rules might never be found. It is important to highlight that this method is only applicable when the only desired rules are the ones that involve a single variable (in our case the target one) as consequence. As previously mentioned, AR can be used with predictive purposes if the rules obtained contain one of the levels of the target variable as the consequence. In this sense, the observations that fulfill the antecedent condition, can be classified as the consequence. Classical predictive models need to provide a procedure that allows classifying all observations in a data set. However, some observations are generally more difficult to predict than the rest and, because of that, if models are evaluated considering all the available observations it might seem that the models are useless or, in other words, that the input variables do not have information on the target one. AR permit focusing only on the “strong” associations between input and target variables, even if they do not involve all observations in the data set, leading to some insight on the relationship between input and target variables. Moreover, the antecedent can contain several input variables, allowing to include interactions among them.
2 Machine Learning Applied to Omics Data
33
2.5 Application 2.5.1 Study Subjects The data used in this chapter come from a total of 144 confirmed PC cases from TCGA [41].3 The BCR data was extracted from the RNA sequencing (RNA-seq) data using MiXCR tool [16], which align the RNA-seq data in fastq format (storage of sequence data) to the VDJ region to extract IGH, IGK, IGL and TRA, TRB, TRD and TRG. In previous analysis [17], we found IGK clonotypes (defined by the same V and J gene, same CDR3 length, and 90% nucleotide identity) discriminating PC samples. Using a centered log ratio (CLR) LASSO for compositional data followed by a hierarchical clustering, three main clusters were found to classify the PC samples. The distribution of samples across the three clusters was: cluster 1 (22.43%), cluster 2 (43.93%) and cluster 3 (33.64%). Cluster 1 was characterized by a lower infiltration and was more similar to the normal pancreas associated with higher tumor purity and worse survival while cluster 3 was the one with higher infiltration and better survival. Cluster 2 was closer to cluster 3 but showing some differences in tumor purity and survival. We now believe that these clusters might be explained by germline genetic variation as stated in the introduction [8–10]. Therefore, for the purpose of this chapter, we downloaded the genotyping data available from the same set of patients in TCGA. A total of 906,600 blood genotypes were available from 120 PC samples. The monomorphic SNPs and the ones with minor allele frequency .0.8). In this data set, patients come from different races, to avoid population stratification, we selected only the Caucasian individuals (.n = 107 individuals). The final data set was composed of a total of 117,486 SNPs and 107 PC samples and SNPs were defined as .0, 1, 2 corresponding to the number of variant alleles per sample.
2.5.2 Material and Methods Taking all previous information into consideration, the aim of this application is to develop a predictive tool considering the set of SNPs that allows to classify PC patients into one of the 3 previously mentioned clusters, and give some insight on the most influential SNPs. For that purpose, we will make use of LASSO MLR models, RF and AR.
3 https://portal.gdc.cancer.gov/.
34
A. Calviño et al.
As already mentioned, when applying prediction methods, the use of resampling techniques is widely spread as it permits obtaining more accurate estimates of the prediction power. In this chapter, we will partition the data set into training (80% of the samples) and testing (the remaining 20%) in order to develop and evaluate the models, respectively. When dealing with multinomial models, the most frequent measure of evaluation is the accuracy, which is simply the percentage of observations that have been correctly classified. However, in the presence of unbalanced target variables, this measure can lead to misleading conclusions as the best results can be achieved by classifying all observations into the larger category from the target variable. In order to avoid this problem, alternative evaluation measures have been proposed and analyzed [43] that focus mainly on the predicted probability of each of the categories. In this chapter, we make use of the pairwise AUC (Area Under the receiver operating characteristic Curve), which is obtained as the average of the AUCs obtained from the predictive probabilities considered in a one vs. one scenario. On the other hand, regarding the AR, we will make use of the three previous measures, i.e., support, confidence, and lift. Regarding the software used for this application, we make use of R 4.2.0 and, more precisely, packages randomForest, msgl, and arules for the prediction part. For the performance evaluation, package pROC provides the pairwise AUC and we have adapted the code of package OOBCurve for the OOB resampling technique. Moreover, the CPU time required to train the models on a HP ProDesK 400 G7, Intel(R) Core(TM) i7-10700 CPU @ 2.90 GHz, RAM: 16 GB (8 .× 2GB), was 19 s for the RF (parallelizing the process) and 125 s, for the LASSO MLR.
2.5.3 Results 2.5.3.1
Random Forest and LASSO Multinomial Logistic Regression
In order to find the most suitable predictive model for a data set, the corresponding parameters need to be tuned. For that purpose, as already mentioned, we resort to the OOB observations in the RF scenario, and the k-CV scheme for the LASSO MLR applied only to the train partition of the data. Figures 2.3 and 2.4 show the pairwise AUC of both types of models for the different parameters that need to be tuned. In particular, Fig. 2.3 shows the results for the RF parameters, i.e., minobs, mtry and ntree. As it can be seen, there is no clear pattern regarding the best configuration. Moreover, the models do not seem to be very powerful, as for many configurations the AUC is smaller than 0.5. Nonetheless, the best configuration leads to a value slightly larger than 0.6 and corresponds to approximately 100 trees (94 precisely), a minimum of 10 observations per leaf (which corresponds with approximately 10% of the observations available) and random sets of 300 variables. Regarding the best setup, Table 2.1 contains
2 Machine Learning Applied to Omics Data
35
Fig. 2.3 Pairwise AUC obtained from the OOB observations for the different configurations of the RF
Fig. 2.4 Pairwise AUC obtained from k-CV for the different configurations of the LASSO MLR
information on the performance as well as the number of SNPs (variables) that have been effectively selected by the RF (out of the 117,486 available). On the other hand, Fig. 2.4 shows the results for the LASSO MLR parameters, i.e., the values of the .λ and .α parameters. It should be noted that, for the sake of understanding, instead of plotting the value of .λ, we have opted for the corresponding number of features selected (that is, the number of variables that have, at least, one parameter different from zero). As it can be seen, this type of model leads to lower predictive power (as the pairwise AUCs obtained take smaller values). Among many other reasons, we believe that the fact that interactions are not taken into account, being frequently expected in the SNPs context, could be one of the main reasons. Nevertheless, we can conclude that the best configuration consists of selecting 63 features applying a value of .α equal to .0.125. Table 2.1 contains information on the performance of the LASSO (both by means of the k-CV and the test partition).
36
A. Calviño et al.
Table 2.1 Performance measures of the best RF and LASSO MLR evaluated in the test partition and the corresponding resampling method. The number of SNPs selected by each model is also shown RF LASSO
# of SNPs 723 63
AUC pairwise (test) 0.668 0.530
AUC pairwise (OOB/k-CV) 0.607 0.586
Table 2.2 Summary of the three sets of association rules obtained Method # of rules RF extraction 115 RF apriori 14,482 LASSO MLR 29,773 apriori
2.5.3.2
# SNPs in Average lift Max lift a rule 2–8 2.014 4.458 1–2 1.740 4.087
% of rules target .= 1 4.35% 0.19%
% of rules % of rules target .= 2 target .= 3 66.96% 28.69% 94.49% 5.32%
1–7
0.72%
80.38%
2.012
4.458
18.90%
Association Rules
As the performance of the previous models is not sufficiently good to apply them in a predictive context, we now generate association rules that can lead to extract new information of the relationship between SNPs and our target variable. In order to generate the AR, we have applied the two different algorithms previously mentioned (RF extraction and the apriori algorithm). For the latter, we have obtained two different sets of rules by working only with the variables selected in the corresponding predictive model (LASSO MLR and RF). The rationale behind this decision is computational as the number of available items (without subsetting) equals to 352,461 (three dummy variables per SNP plus other three for the target variable) and the corresponding number of combinations gets very large. In this sense, it is important to highlight that the library we have used, arules, establishes a maximum time for obtaining the combinations. For that reason, when the total number of items is large, the rules obtained are usually made by only one or two items. Moreover, this function does not allow generating only the rules that have the target variable as a consequence and, for that reason, the computation capability is not maximized. For instance, the RF-apriori leads to over 60 million rules, most of which are discarded. Table 2.2 contains information on the three sets of rules than have been found: two using solely the SNPs selected by the best RF and LASSO MLR separately4 and the one obtained directly from the leaves in the RF. It is important to highlight that we have set a minimum support of .10% (which corresponds to the optimum number of observations per leaf in the RF) and a minimum confidence of .0.7.
4 We note that we have not considered the SNPs selected by both methods at the same time because of the time limit previously mentioned, as the number of rules generated was smaller than the sum of the rules obtained separately.
2 Machine Learning Applied to Omics Data
37
As it can be seen, the number of rules extracted from the RF is significantly smaller than the apriori cases; this is due to the fact that not all combinations of SNPs are evaluated and, moreover, the number of selected trees in the forest is set relatively low (94). Despite of that, the strength of the rules is as high as for the rules generated with SNPs selected by the LASSO MLR (both methods lead to similar values of the lift). Because of the large number of SNPs selected by the RF, the apriori algorithm is not capable of generating combinations involving more than 3 items. This fact has two main consequences: (1) less complex interactions among SNPs can be found and (2) the rules found are slightly less strong than in the other methods (although still useful). From the set of SNPs selected by RF and LASSO in the predictive models, less than 1% are overlapping between the two methods. Therefore, the number of different rules achieved by the three methods altogether reaches 45,000. Importantly, all these rules might be used with predictive purposes and, moreover, can give new biological insight considering the combinations of SNPs that frequently lead to the different levels of the target variables. Furthermore, Table 2.2 includes information on the percentage of rules having as consequence the three levels of the target variable. The results show that significantly more rules are generated for the second level, which is related to the fact that this level is the most frequent one. Nevertheless, as many rules are generated, even if the percentage is low, the absolute number per level is not negligible. Finally, in order to illustrate the rules generated, Table 2.3 shows, considering each of the outcomes of the target variable as consequence (RHS), the rules with the largest lift value for the three methods. As it can be seen, the rules associated with the first level are the “strongest” ones, as the maximum lift value exceeds 4, whereas this maximum value is close to 3 and slightly larger than 2 for the third and second levels of the target variable, respectively. This is in contrast with the number of rules generated, showing that more rules do not necessarily mean better predictive power. As an example of the rules obtained, we now comment in detail two of the rules in Table 2.3: • The first rule included in the table for the RF apriori method states that the patients that have the two common alleles for the SNP rs8298918 and the two variant alleles for the SNP rs8633827, will belong to the first level of the target variable. This combination of facts happens in 10.3% of the patients and holds in 91.7% of the cases. Regarding the interpretation of the lift, patients that have the two common alleles for the SNP rs8298918 and the two variant alleles for the SNP rs8633827 are 4.09 times more likely to belong to the first level of the target variables than a randomly selected patient. A similar analysis can be carried out for the remaining rules obtained through the apriori algorithm). • The second rule shown in the table for the RF extraction procedure can also be analyzed using the previous ideas regarding the values of support, confidence, and lift. However, it should be noted that, as previously mentioned, rules obtained
38
A. Calviño et al.
Table 2.3 Analysis of the most important rules generated by method and target value Method RF extraction
RF apriori
LASSO MLR apriori
Antecedent SNPs rs8313 = 0, rs38334 >= 1, rs602560 >= 1, rs81162 = 0, rs97842 = 1, rs43904 >= 1, rs45962 >= 1 rs12902 1,
.
(3.2)
where j is the true number of modes of the distribution. This would allow us to know for which genes their genetic expressions are grouped around more than one value. If that hypothesis is rejected, a second step would be to determine which genes behave “abnormally.” Those are the genes that present different groups of patients, but characterizing patients with high and low expression would not be enough. To do this, one could test the hypothesis .H0 : j = 2 versus H1 : j > 2. The different tests that allow us to analyse the null hypothesis of unimodality and bimodality will be briefly summarized in Sect. 3.2. Section 3.3 apply these test to determine which gene expression behave differently depending on the individuals with breast cancer. We give some concluding remarks in Sect. 3.4. A list of genes where the null hypothesis of unimodality was rejected is shown in the Appendix.
3.2 Analysing the Number of Groups With the aim of analysing which genes present their genetic expressions gathered around a certain value, we will provide a brief review on some statistical tests for this purpose. As mentioned in the Introduction, our problem can be formulated in terms of a test for assessing the underlying distribution of the observed sample is unimodal. A first approach, as noted before, would be to determine the number of groups by computing the number of components fitting a mixture model. For example, one could try, for different values of M, to fit a mixture of M normal densities to the data. Given a sample that could be done by obtaining the maximum likelihood estimators of the parameters .μ, .σ and .w of the density fμ,σ ,w (x) =
M
.
wj gμj ,σj (x),
j =1
where each .gμj ,σj is a Gaussian density with mean .μj and standard deviation .σj , and .wj ∈ [0, 1] is the weight, satisfying . M j =1 wj = 1. Then, after
3 Multimodality Tests for Gene-Based Identification of Oncological Patients
51
exploring different values of M, the number of components could be selected using, for example, a Bayesian Information Criterion. A revision of parametric mixture models, with other components and other ways of selecting the number of components, can be found in McLachlan and Peel [8]. For the reader interested in how to employ that procedure in practice, we refer to the R [9] packages mixtools [10], mclust [5], or mixAK [11]. However, if modes are determined in this way, we may encounter a misspecification problem when modelling with simple parametric models the gene expression. In addition, the number of components in a mixture model does not necessarily coincide with the number of modes, so the selection of these particular points is not direct. This can be seen in Fig. 3.4, where a mixture with two components yields a unimodal density estimation. Although mixture models are quite flexible, their use for the purpose of this work presents important drawbacks. Thus, this is one of the situations when it is better not to impose restrictions on the distributional shape. In particular, the idea is to determine the number of modes using nonparametric methods which is the perspective followed in this work. If the objective would be performing nonparametric clustering or classification, we refer to the review paper by Menardi [12]. There, one can find several references on how to perform modal clustering, but most of them depend on some choices, either the employed bandwidth or the initial potential mode candidates. This may be useful when the objective is to classify individuals in the sample or when a classification rule is needed. If that is the goal, one can also find different R packages, see, for example, LPCM [13], modalclust [14], or pdfCluster [15]. Usually, the methods that perform modal clustering do not provide an initial formal well-calibrated test to assess the number of modes. The last is the objective of this chapter, to know if a gene would be a candidate to divide the population in two (or more) groups. In this sense, we will provide a brief review of the statistical techniques that allow to nonparametrically performing the hypothesis testing (3.2). Focusing on the tests that allow us to analyse which genes have their genetic expressions grouped around a certain value, that is when we want to test whether the underlying distribution of the sample is unimodal, there are some nonparametric alternatives. Most of them, are based on two concepts: the critical bandwidth and the excess mass. The critical bandwidth was proposed by Silverman [16] and can be defined as the smallest bandwidth for which the kernel density estimation has one mode if we are testing for unimodality. This bandwidth was extended also to the case of testing for a general number (k) of modes .H0 : j = k versus .H1 : j > k. The formal definition of the critical bandwidth for k modes is: hk = inf{h : fˆh has at most k modes }.
.
(3.3)
In Fig. 3.6 (left), we show the density estimation when employing the critical bandwidth for one of the gene expressions. In Fig. 3.6 (left), we can see that the estimation is unimodal, with a mode at the value 3870, but if one considers smaller bandwidths a second mode appears around the value 9000. Besides being the standard choice, another reason for using the normal density function as kernel
J. Ameijeiras-Alonso and R. M. Crujeiras
2e−04 0e+00
1e−04
Density
3e−04
52
2000
4000
6000
8000
10000
200960_x_at
Fig. 3.6 Left: Kernel density estimation for the 200960_x_at gene expression sample, using the Gaussian kernel and the critical bandwidth for one mode, .h1 . Right: excess mass (area in grey) for the density function f (solid black curve) and .λ = 0.0001
can be found when numerically computing the critical bandwidth. The reason is that the number of modes of .fˆh is a nonincreasing function of h when using a Gaussian kernel. Silverman [16] proposed to employ .hk as the test statistic for testing .H0 : j = k. The calibration of this test is done using bootstrap techniques, where the resamples are generated from the distribution associated with .fˆhk . This test statistic was studied, more in detail, by Hall and York [17]. They found out that Silverman [16] proposal was not consistent and suggested a correction for the unimodality test (3.2). For performing their unimodality test, it must be assumed that the mode is located in a known closed interval I . The critical bandwidth of Hall and York [17] is defined as: h HY = inf{h : fˆh has exactly one mode in I }.
.
Since Silverman’s proposal is not well-calibrated and the support of the gene expression depends on each gene, as can be seen in Fig. 3.2, we will not employ these proposals for testing unimodality in our data. More details about how to correct the bootstrap procedure for testing unimodality with the proposal by Hall and York are given in [17]. Also, the critical bandwidth (3.3) was employed by Fisher and Marron [18] to estimate the cumulative distribution function .Fˆhk (x) = x ˆ −∞ fhk (t)dt. They proposed to employ the following Cramér-von Mises test statistic for testing k-modality, Tk =
n
.
i=1
2i − 1 Fˆhk (X(i) ) − 2n
2 +
1 , 12n
where .{X(1) ≤ · · · ≤ X(n) } is the ordered sample. The distribution of this test statistic under the null is approximated by using a bootstrap procedure, where the resamples are generated from the distribution associated with .fˆhk . As shown in the
3 Multimodality Tests for Gene-Based Identification of Oncological Patients
53
simulation study by [19], the behaviour of the Fisher and Marron proposal is far from satisfactory. The second group of tests for multimodality is based on the idea of excess mass, which was introduced by Müller and Sawitzki [20]. To obtain the expression of this statistic from the sample, it is necessary to define the empirical mass excess for k modes, that is, En,k (Pn , λ) =
.
sup C1 ,...,Ck
k
Pn (Cl ) − λ||Cl || ,
(3.4)
l=1
where .λ > 0, .{C1 , . . . , Ck } are closed and disjoint intervals, and .Pn (Cl ) = n 1 I(Xi ∈ Cl ) is its empirical probability, being .I the indicator function. Given a n i=1
density f , the population version of the excess mass is represented in Fig. 3.6 (right). There, we can see how the excess mass for .k = 2 modes is constructed. Given a value of .λ = 0.0001, the probability associated with two closed intervals (i.e., the area under the curve of f ) minus .λ times its measure is computed. Among all the possible two closed intervals, those maximizing that difference are chosen. The area in grey in Fig. 3.6 (right) is equal to the excess mass for the represented density function. For testing k-modality, the difference between the excess of empirical mass for .(k + 1) modes and k modes is computed. Then, the test statistic is obtained from the maximum in .λ of those differences. More formally, the excess mass statistic for testing k-modality is equal to n,k = max{En,k+1 (Pn , λ) − En,k (Pn , λ)}.
.
λ
(3.5)
The null hypothesis .H0 : j = k should be rejected when the value of .n,k is “large.” Müller and Sawitzki [20] proposed to calibrate this test when testing for unimodality (3.2) using Monte Carlo techniques. In particular, they proposed to generate resamples from the uniform distribution. The result is a test that is extremely conservative when testing for unimodality (see, e.g., [19]). Using the excess mass test statistic, the first (asymptotically) well-calibrated test for unimodality was introduced by Cheng and Hall [21]. They proposed to approximate the distribution of .n,1 under the hypothesis of unimodality using bootstrap techniques. The resamples are generated from a parametric family of distributions (beta, normal, or rescaled Student-t distributions) depending on the
1/5 (see [21] for more details). The issues of their value of . f 3 (x0 )/|f (x0 )| approach are that it can be only employing for testing for unimodality and that a well-calibrated test is only obtained for large values of the sample size, when the true underlying density has a “complex” shape. An alternative way of calibrating the excess mass test statistic was proposed by Ameijeiras-Alonso et al. [19]. Their method consists in calibrating the excess mass statistic using a bootstrap procedure, where the resamples are generated from (a modified version of) .fˆhk (see [19] for details), the KDE with critical bandwidth. The
54
J. Ameijeiras-Alonso and R. M. Crujeiras
Table 3.1 Nonparametric multimodality tests. The column “resamples” indicate the density that is employed to generate resamples under the null hypothesis to compute the p-values. NM shows the number of modes that can be tested: only unimodality (1) or a general number of modes (k). AC indicates if the method is asymptotically well-calibrated. .Yes∗ means that it is asymptotically well-calibrated when the support I is bounded Test statistic Critical bandwidth .hk .h HY Cramér-von Mises .Tk Excess mass
.n,1 .n,k
Method Silverman [16] Hall and York [17] Fisher and Marron [18] Müller and Sawitzki [20] Cheng and Hall [21] Ameijeiras-Alonso et al. [19]
Resamples
NM k .fˆh HY (corrected bootstrap) 1 .fˆhk k Uniform 1 Parametric family 1 Modified version of .fˆhk k
.fˆhk
AC No ∗ .Yes No No Yes Yes
first advantage of this last proposal is that a general number of modes (.H0 : j = k) can be tested, with an asymptotically well-calibrated test (see [19] for the theoretical proof). The second is that their proposal presents correct behaviour even when the sample size is “small.” The different nonparametric multimodality tests reviewed on this book chapter are summarized in Table 3.1, providing information about how resamples to approximate p-values are obtained, how many modes can be tested and if the asymptotic calibration of the procedure is correct. There, we can see that only the proposals by Hall and York, Cheng and Hall, and Ameijeiras-Alonso et al. are asymptotically well-calibrated. Among these three proposals, only AmeijeirasAlonso et al. method allows for testing a general number of modes k. The finite behaviour of the different proposals summarized in Table 3.1 was analysed using a simulation study in [19], where it is shown the correct behaviour, when .n = 200, of Ameijeiras-Alonso et al. proposal, .n = 200 is the sample size in the studied dataset (see Sect. 3.3). In Sect. 3.3, we will employ their procedure for testing, first, if a gene can distinguish between at least two groups of patients (i.e., if unimodality is rejected), and, in that case, if that gene behaves “abnormally” (i.e., if more than two groups of patients are detected for that particular gene). For the reader interested in using any of the tests reviewed in this section, in the statistical software R, we refer to the library multimode [7].
3.3 Application to the Gene Expression of Breast Cancer Patients The application of the unimodality and bimodality tests will be carried out on data from patients with breast cancer. We employed the database analysed by Schmidt et al. [2]. A sample of 200 patients is considered in their study, all between 34 and 89 years old, with breast cancer treated at the Department of Obstetrics and Gynaecology of the Johannes University Gutenberg (Mainz, Germany) between
3 Multimodality Tests for Gene-Based Identification of Oncological Patients
55
1988 and 1998. The database can be found in the NCBI GEO repository under code GSE11121 [22]. All of these patients were treated with surgery, without receiving any systematic therapy in the treatment performed to decrease the risk of cancer reappearing. In [2], the RNA expressions of 22283 genes were considered. For each of these genes, the gene expression has been quantified as a continuous variable. Its values have been obtained from the colour scale shown in the microarray, from red (high expression) to green (low expression). In Fig. 3.1, as an illustration, we showed a microarray for a subsample of 20 patients and 25 genes. As an example, get for now the expression of a single gene, for example, suppose that only the gene 200960_x_at is considered (corresponding histogram and kernel density estimation are represented in Figs. 3.2 and 3.3). Then, the unimodality test of Ameijeiras-Alonso et al. could be employed for the sample with .n = 200 individuals. This is implemented in the R function modetest of the library multimode [7]. The result of performing that test over this sample with .B = 1000 bootstrap replicates is a p-value equal to .0.003. Thus, for a significance level .α = 0.05, the null hypothesis that all patients belong to the same group can be rejected. If only this gene is considered, the conclusion would be that this gene is useful to distinguish between, at least, two groups of patients. Since the null hypothesis of unimodality is rejected, then, we will test if there are more than two groups, to identify if this gene behaves abnormally. After applying that test (with .B = 1000 resamples), we obtain a p-value equal to .0.282. Thus, at a significance level .α = 0.05, there are no evidence to reject the null hypothesis of bimodality on this gene. In summary, the gene 200960_x_at would be a good candidate to distinguish between two kinds of patients. The shortcoming in the application of such an approach is that we have a total of 22283 samples (genes) for which the null hypothesis of unimodality should be tested. If we apply the unimodality test of Ameijeiras-Alonso et al., at a significance level .α = 0.05, since it is well-calibrated, even if all the genes are unimodal, we expect that the null hypothesis will be rejected for .22283 × 0.05 ≈ 1114 genes. Thus, it is clear that a False Discovery Rate (FDR) correction must be applied in order to control the incorrect rejections of the null hypothesis. During the last few years, several ways of “correcting” the p-values under this situation have been proposed. The objective of this chapter is not to do a review of the different available methods for controlling the FDR. For the reader interested in that topic, we refer, for example, to the review papers [23] and [24], where also some other applications in genetics are provided. In this particular application, we suggest employing the FDR correction proposed by Benjamini and Hochberg [25]. The reason for suggesting that procedure is, first, because is simple to implement. It only consists in, first, ordering the obtained p-values after applying the null hypothesis of unimodality .p(1) ≤ · · · ≤ p(22283) . Secondly, obtain the largest value of b such that .p(l) ≤ lα/22283, being .l = 1, . . . , L the tests to be performed. Thirdly, reject the null hypothesis of unimodality for the genes associated with the first l ordered p-values. This method is implemented on the p.adjust function of the R package stats. The second, and most important, reason is that this procedure works correctly if we have independent or positively correlated tests (see [26]). In
56
J. Ameijeiras-Alonso and R. M. Crujeiras
this case, it is natural to assume that we have positively correlated tests as many genes may be related and we are considering, for all of them, always the same individuals. We will use a significance level .α = 0.05. Now, considering that the decision between rejecting or not the null hypothesis will depend on “small” differences in the p-values, we need to consider more decimal digits on the p-values. Since we are using a bootstrap methodology, this means that we have to use a “large” number of resamples in our test. For that reason, we decided to employ .B = 105 bootstrap resamples to test for unimodality on each gene. The result of applying the unimodality test of Ameijeiras-Alonso et al., together with the FDR correction of Benjamini and Hochberg is that the null hypothesis of unimodality is rejected for 77 genes. In Appendix 3.4, we provide the full list of genes where the null hypothesis of unimodality was rejected. For these 77 genes where there are at least two subgroups of individuals, we tested the null hypothesis of bimodality with the test of Ameijeiras-Alonso et al. Again, the FDR correction of Benjamini and Hochberg was employed. The result is that one gene behaves “abnormally.” Only for the gene 203128_at, more than two subgroups of patients were detected. By analysing the RNA expression of different genes from breast cancer patients, we have been able to identify a subselection of only few genes in which the expression divides into two groups. Thus, the proposed procedure allows us to identify which genes would be potential candidates to apply different treatments depending on the specific expression of those genes for each patient.
3.4 Conclusions and Future Work In this book chapter, we have shown how multimodality tests can be employed to identify which genes allow breast cancer patients to be classified into two different groups. For reaching that objective we have provided a well-calibrated statistical test that allows us to find genes that discriminate patient subgroups. We have argued that the combination of the multimodality test of Ameijeiras-Alonso et al. [19], together with the FDR correction of Benjamini and Hochberg [25], provides a procedure that ensures reliable results in practice. The employed techniques do not restrict only to this disease and could be employed in other different frameworks, where the objective is to determine the number of patients groups. This is not the first time that bimodality has been used in order to determine the number of groups of patients. In particular, Hellwig et al. [4] also employed the dataset of [2] with the same objective. There, they discuss different tools/scores to analyse if the distribution of the data is bimodal. However, among the employed methods, only the dip test of [27] performs a statistical test of unimodality (and, thus, provides a p-value). The dip test allows to test (3.2) and it provides exactly the same results as the conservative unimodality test by Müller and Sawitzki [20] (see [19] for more details). This is, probably, the reason why Hellwig et al. only found 8 genes with multimodal gene expressions. Also, the paper by Hellwig et al. [4]
3 Multimodality Tests for Gene-Based Identification of Oncological Patients
57
does not provide any tool to identify which genes are bimodal and which behave abnormally (have more than 3 modes). The current book chapter provides answers to those questions with more accurate techniques for identifying genes with a bimodal gene expression. Future work could address the prognostic relevance of the selected genes (see Appendix 3.4). Some of these genes are already reported in the biomedical literature for having different expressions in different subsets of patients. For example, it is known that “high” gene expression of RAB27B [28] or HIST1H1A [29] correlates with lower survival time of breast cancer patients. The only gene from our set behaving abnormally, SPTLC2 (203128_at), was reported as a gene whose high expression predicts poor survival time in patients with pancreatic adenocarcinoma [30]. Spears et al. [30] only divided their patients into two groups. Based on our results, when analysing the effect of SPTLC2 in breast cancer patients it would be interesting to classify them according to their gene expression in at least three groups. Minchin and Butcher [31] have already demonstrated the applicability of dividing the gene expression into three groups according to the NAT1 gene expression. In their paper, they conclude that the overall survival and the response to chemotherapy of a patient depend on the gene expression subgroup. Finally, further analysis should be done in other genes that appear in Appendix 3.4 and that have not been reported yet as prognostic genes in breast cancer, since they are promising candidates to identify new markers to improve the stratification of breast cancer patients and potentially improve their treatment. Acknowledgments The work of this book chapter was supported by Grant PID2020-116587GBI00 funded by MCIN/AEI/10.13039/501100011033 and the Competitive Reference Groups 2021– 2024 (ED431C 2021/24) from the Xunta de Galicia. The authors thank to the Editor and the anonymous reviewers for their valuable comments that led to an improvement of the book chapter. We also thank Dr. Raquel Cruz for her contributions and discussions on genetic interpretations.
Author Contribution Statement All authors have contributed equally and declare that they have no conflicts of interest.
Appendix Genes Presenting a Multimodal Pattern In this appendix, we include all the genes for which unimodality was rejected for a significance level .α = 0.05. These results were obtained after applying the unimodality test of Ameijeiras-Alonso et al. [19], together with the FDR correction of Benjamini and Hochberg [25]. In bold, we highlight the only gene behaving abnormally, i.e., the gene for which we significantly found more than
58
J. Ameijeiras-Alonso and R. M. Crujeiras
two groups of patients. For each identified gene (each row), we give the Probe Set ID (first column), the gene symbol (second column), and the description (third column). The gene symbol and description were obtained with the R package hgu133plus2.db [32].
Probe set ID 203815_at 205186_at 205230_at 205637_s_at 207199_at 207815_at 208484_at 209151_x_at 211010_s_at 215653_at 215845_x_at 215915_at 217111_at 219360_s_at
Gene symbol GSTT1 DNALI1 RPH3A SH3GL3 TERT PF4V1 HIST1H1A TCF3 NCR3
Description glutathione S-transferase theta 1 dynein, axonemal, light intermediate chain 1 rabphilin 3A SH3-domain GRB2-like 3 telomerase reverse transcriptase platelet factor 4 variant 1 histone cluster 1, H1a transcription factor 3 natural cytotoxicity triggering receptor 3
GULP1 AMACR TRPM4
219882_at 220039_s_at 220061_at
TTLL7 CDKAL1 ACSM5
GULP, engulfment adaptor PTB domain containing 1 alpha-methylacyl-CoA racemase transient receptor potential cation channel, subfamily M, member 4 tubulin tyrosine ligase-like family member 7 CDK5 regulatory subunit associated protein 1-like 1 acyl-CoA synthetase medium-chain family member 5
220450_at 87100_at 211496_s_at 215334_at 215644_at 217522_at
ABHD2 PDC EFR3B ZNF518A KCNV2
219989_s_at
ANKS1B
222319_at 203128_at
NAP1L4 SPTLC2
205225_at 208128_x_at 208548_at 209052_s_at 209587_at 219136_s_at
ESR1 KIF25 IFNA6 WHSC1 PITX1 LMF1
abhydrolase domain containing 2 phosducin EFR3 homolog B zinc finger protein 518A potassium channel, voltage gated modifier subfamily V, member 2 ankyrin repeat and sterile alpha motif domain containing 1B Nucleosome assembly protein 1-like 4 serine palmitoyltransferase, long chain base subunit 2 oestrogen receptor 1 kinesin family member 25 interferon, alpha 6 Wolf-Hirschhorn syndrome candidate 1 paired-like homeodomain 1 lipase maturation factor 1
3 Multimodality Tests for Gene-Based Identification of Oncological Patients Probe set ID 220914_at 203969_at 204152_s_at
Gene symbol
Description
PEX3 MFNG
211573_x_at 220220_at 221598_s_at 205321_at
TGM2 FLJ10120 MED27 EIF2S3
207017_at 209876_at
RAB27B GIT2
peroxisomal biogenesis factor 3 MFNG O-fucosylpeptide 3-beta-Nacetylglucosaminyltransferase transglutaminase 2 hypothetical protein FLJ10120 mediator complex subunit 27 eukaryotic translation initiation factor 2, subunit 3 gamma, 52kDa RAB27B, member RAS oncogene family G protein-coupled receptor kinase interacting ArfGAP 2
210824_at 215561_s_at 219225_at 201755_at
IL1R1 PGBD5 MCM5
207607_at 214642_x_at 215181_at 217210_at 222277_at 206298_at 207870_at 220666_at 222172_at 207314_x_at 206566_at
ASCL2
interleukin 1 receptor, type I piggyBac transposable element derived 5 minichromosome maintenance complex component 5 achaete-scute family bHLH transcription factor 2
CDH22
cadherin 22, type 2
C1QTNF9B-AS1 ARHGAP22 AKAP9
C1QTNF9B antisense RNA 1 Rho GTPase activating protein 22 A kinase (PRKA) anchor protein 9
NPAS3
neuronal PAS domain protein 3
SLC7A1
206960_at 209402_s_at
LPAR4 SLC12A4
209618_at 209764_at
CTNND2 MGAT3
211579_at
ITGB3
201206_s_at 206269_at 208088_s_at 215800_at 217193_x_at 220799_at 213942_at
RRBP1 GCM1 CFHR5 DUOX1
solute carrier family 7 (cationic amino acid transporter, y+ system), member 1 lysophosphatidic acid receptor 4 solute carrier family 12 (potassium/chloride transporter), member 4 catenin (cadherin-associated protein), delta 2 mannosyl (beta-1,4-)-glycoprotein beta-1,4-Nacetylglucosaminyltransferase integrin, beta 3 (platelet glycoprotein IIIa, antigen CD61) ribosome binding protein 1 glial cells missing homolog 1 (Drosophila) complement factor H-related 5 dual oxidase 1
GCM2 MEGF6
glial cells missing homolog 2 (Drosophila) multiple EGF-like-domains 6
59
60 Probe set ID 214481_at 210424_s_at 208287_at 210384_at 215099_s_at 220967_s_at 205461_at 210382_at 213211_s_at
J. Ameijeiras-Alonso and R. M. Crujeiras Gene symbol HIST1H2AM GM88 HCG9 PRMT2 RXRB ZNF696 RAB35 SCTR TAF6L
Description histone cluster 1, H2am 88-kDa golgi protein HLA complex group 9 (non-protein coding) protein arginine methyltransferase 2 retinoid X receptor, beta zinc finger protein 696 RAB35, member RAS oncogene family secretin receptor TAF6-like RNA polymerase II, p300/CBP-associated factor (PCAF)-associated factor, 65kDa
References 1. Therese Sørlie, Charles M Perou, Robert Tibshirani, Turid Aas, Stephanie Geisler, Hilde Johnsen, Trevor Hastie, Michael B Eisen, Matt Van De Rijn, Stefanie S Jeffrey, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19):10869–10874, 2001. 2. Marcus Schmidt, Daniel Böhm, Christian von Tüorne, Eric Steiner, Alexander Puhl, Henryk Pilch, Hans-Anton Lehr, Jan G Hengstler, Heinz Kolbl, and Mathias Gehrmann. The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer research, 68(13):5405–5413, 2008. 3. Matt P. Wand and M. Chris Jones. Kernel Smoothing. Chapman and Hall, Great Britain, 1995. 4. Birte Hellwig, Jan G. Hengstler, Marcus Schmidt, Mathias C. Gehrmann, Wiebke Schormann, and Jörg Rahnenführer. Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes. BMC bioinformatics, 11:1, 2010. 5. Luca Scrucca, Michael Fop, T. Brendan Murphy, and Adrian E. Raftery. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1):289–317, 2016. 6. Probal Chaudhuri and James Stephen Marron. Sizer for exploration of structures in curves. Journal of the American Statistical Association, 94:807–823, 1999. 7. Jose Ameijeiras-Alonso, Rosa M. Crujeiras, and Alberto Rodriguez-Casal. multimode: An r package for mode assessment. Journal of Statistical Software, 97(9):1–32, 2021. 8. Geoffrey McLachlan and David Peel. Finite Mixture Models. John Wiley & Sons, United States of America, 2000. 9. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. 10. Tatiana Benaglia, Didier Chauveau, David R. Hunter, and Derek S. Young. mixtools: An R package for analyzing mixture models. Journal of Statistical Software, 32(6):1–29, 2009. 11. Arnošt Komárek and Lenka Komárková. Capabilities of R package mixAK for clustering based on multivariate continuous and discrete longitudinal data. Journal of Statistical Software, 59:1– 38, 2014. 12. Giovanna Menardi. A review on modal clustering. International Statistical Review, 84(3):413– 433, 2016. 13. Jochen Einbeck and Ludger Evers. LPCM: Local Principal Curve Methods, 2020. R package version 0.46-7. 14. Yansong Cheng and Surajit Ray. Parallel and hierarchical mode association clustering with an R package modalclust. Open Journal of Statistics, 4(10):826–836, 2014.
3 Multimodality Tests for Gene-Based Identification of Oncological Patients
61
15. Adelchi Azzalini and Giovanna Menardi. Clustering via nonparametric density estimation: The R package pdfCluster. Journal of Statistical Software, 57(11):1–26, 2014. 16. Bernard W. Silverman. Using kernel density estimates to investigate multimodality. Journal of the Royal Statistical Society. Series B, 43:97–99, 1981. 17. Peter Hall and Matthew York. On the calibration of Silverman’s test for multimodality. Statistica Sinica, 11:515–536, 2001. 18. Nicholas I. Fisher and James Stephen Marron. Mode testing via the excess mass estimate. Biometrika, 88:419–517, 2001. 19. Jose Ameijeiras-Alonso, Rosa M Crujeiras, and Alberto Rodríguez-Casal. Mode testing, critical bandwidth and excess mass. Test, 28(3):900–919, 2019. 20. Dietrich Werner Müller and Günther Sawitzki. Excess mass estimates and tests for multimodality. Annals of Statistics, 13:70–84, 1991. 21. Ming-Yen Cheng and Peter Hall. Calibrating the excess mass and dip tests of modality. Journal of the Royal Statistical Society. Series B, 60:579–589, 1998. 22. Marcus Schmidt, Daniel Böhm, Christian von Tüorne, Eric Steiner, Alexander Puhl, Henryk Pilch, Hans-Anton Lehr, Jan G Hengstler, Heinz Kolbl, and Mathias Gehrmann. NCBI gene expression omnibus (GEO) series GSE11121. the humoral immune system has a key prognostic impact in node–negative breast cancer. http://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc=GSE11121, 2015. [Online; accessed October 27, 2022]. 23. Yoav Benjamini, Dan Drai, Greg Elmer, Neri Kafkafi, and Ilan Golani. Controlling the false discovery rate in behavior genetics research. Behavioural brain research, 125(1–2):279–284, 2001. 24. Xiongzhi Chen, David G. Robinson, and John D. Storey. The functional false discovery rate with applications to genomics. Biostatistics, 22(1):68–81, 2021. 25. Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995. 26. Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188, 2001. 27. John A. Hartigan and Pamela M. Hartigan. The dip test of unimodality. Journal of the American Statistical Association, 86:738–746, 1985. 28. Jia-Xing Zhang, Xiao-Xia Huang, Man-Bo Cai, Zhu-Ting Tong, Jie-Wei Chen, Dong Qian, YiJi Liao, Hai-Xia Deng, Ding-Zhun Liao, Ma-Yan Huang, et al. Overexpression of the secretory small GTPase RaB27B in human breast cancer correlates closely with lymph node metastasis and predicts poor prognosis. Journal of translational medicine, 10(1):1–10, 2012. 29. Ruocen Liao, Xingyu Chen, Qianhua Cao, Yifan Wang, Zhaorui Miao, Xingyu Lei, Qianjin Jiang, Jie Chen, Xuebiao Wu, Xiaoli Li, et al. HIST1H1B promotes basal-like breast cancer progression by modulating CSF2 expression. Frontiers in Oncology, page 4398, 2021. 30. Meghan E Spears, Namgyu Lee, Sunyoung Hwang, Sung Jin Park, Anne E Carlisle, Rui Li, Mihir B Doshi, Aaron M Armando, Jenny Gao, Karl Simin, et al. De novo sphingolipid biosynthesis necessitates detoxification in cancer cells. Cell reports, 40(13):111415, 2022. 31. Rodney F Minchin and Neville J Butcher. Trimodal distribution of arylamine nacetyltransferase 1 mRNA in breast cancer tumors: association with overall survival and drug resistance. BMC genomics, 19(1):1–10, 2018. 32. Marc Carlson. hgu133plus2.db: Affymetrix Human Genome U133 Plus 2.0 Array annotation data (chip hgu133plus2), 2016. R package version 3.2.3.
Chapter 4
Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing Eduardo García-Portugués and Andrea Meilán-Vila
Abstract Skeletal representations (s-reps) have been successfully adopted to parsimoniously parametrize the shape of three-dimensional objects and have been particularly employed in analyzing hippocampus shape variation. Within this context, we provide a fully nonparametric dimension-reduction tool based on kernel smoothing for determining the main source of variability of hippocampus shapes parametrized by s-reps. The methodology introduces the so-called density ridges for data on the polysphere and involves addressing high-dimensional computational challenges. For the analyzed dataset, our model-free indexing of shape variability reveals that the spokes defining the sharpness of the elongated extremes of hippocampi concentrate the most variation among subjects. Keywords Density ridges · Dimension reduction · Directional data · Nonparametric statistics · Skeletal representations
4.1 Introduction Mental illnesses are prevalent and highly debilitating disorders that affect a substantial proportion of society. Various studies have shown that there is a direct relationship between the etiology of mental diseases and the deformation of more vulnerable parts of the brain, such as the hippocampus (see, e.g., [1]). Hence, the analysis of hippocampus shapes is a relevant target of medical research and a useful instrument for informing it. The present work contributes towards this analysis by introducing a new method to determine the main variation of three-dimensional shapes, like hippocampi, that are instantiated in the form of skeletal models. Statistical shape analysis of three-dimensional objects [2] can be enhanced by using skeletal models, as these capture the interior of objects, and therefore, they
E. García-Portugués () · A. Meilán-Vila Universidad Carlos III de Madrid, Leganés, Madrid, Spain e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Y. Larriba (ed.), Statistical Methods at the Forefront of Biomedical Advances, https://doi.org/10.1007/978-3-031-32729-2_4
63
64
E. García-Portugués and A. Meilán-Vila
more stably and richly collect object shape than models that capture only its boundary. These models explicitly capture the width of the objects, the normal directions, and the boundary curvatures [3]. Skeletal models contain a skeleton, which is centrally located along the object, and the spokes (vectors originating from the skeleton terminating at the object boundary), such that the spokes do not cross within the object [4]. A general construction of these models is the skeletal representation, referred to as s-rep [5]. The rigorous statistical analysis of skeletal models requires the development of tailored novel methods, this constituting an instance of the so-called object oriented data analysis [6]. There is a substantial literature involving s-reps, see, for example, [7, 8], and [9], among others. See also [10] and the recent survey by [11] for a complete review of skeletal models. The dataset considered in this work consists of .n = 177 hippocampus shapes that are instantiated in the form of s-reps (see [3]). Figure 4.4 shows the s-reps of two characteristic hippocampi. The spokes are the segments (in varying colors) joining the inner skeletal points (in black) with the boundary points (also in varying colors, some of them numbered). Each hippocampus has .r = 168 spokes with associated radii and directions. The directions of these spokes lie on the polysphere .(S2 )168 , r where .(Sd )r := Sd × · · · ×Sd , with .Sd := {x ∈ Rd+1 : x = 1} and .r, d ≥ 1. The shapes of the hippocampi constituting the analyzed dataset are different, but their inner skeletal points share roughly matching configurations. Therefore, it is reasonable to consider the average inner skeletal configuration as a common reference, and then investigate the vectors that lead from it to the boundary. Fixing also the radii of these spokes to their averages across subjects allows a reduced representation, as an observation on .(Sd )r (size is ignored), of the hippocampus shape captured by an s-rep. Traditionally, Principal Component Analysis (PCA) has been used to describe the main features of the data by estimating the principal directions of its maximum projected variance. In the framework of skeletal models, modes of variation on the sphere based on a non-geodesic approach can provide more appropriate dimensionality reduction [4, 12]. Following this strategy, [13] introduced Principal Arc Analysis (PAA), which uses small circles on the sphere .S2 to parametrize the main source of variation. Principal Nested Spheres (PNS) is the extension of this method to the hypersphere .Sd [14]. An alternative to the previous parametric approaches for summarizing the primary characteristics of the data consists of generating flexible principal curves informed by the underlying density of the data. Density ridges extend the concept of modes and rely on the gradient and Hessian of the density function [15]. Although density ridge estimation is a challenging task, which can be addressed with an appropriate smoothing-based estimator, it entails a much larger flexibility over fixed parametric modes of variation (e.g., small circles on .Sd ). We introduce in this work a novel fully nonparametric dimension-reduction technique for polyspherical data. The proposed methodology involves estimating density ridges for .(Sd )r -valued data, which entail a specific kernel density estimator and the computation of its .(Sd )r -adapted gradient and Hessian. The estimation of
4 Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing
65
density ridges applies an Euler-like algorithm that presents several high-dimensional computational challenges, and thus we describe keys and guidelines for its implementation and practical use. We propose an effective data-driven indexing and parametrization of the set of .(Sd )r -valued points that is outputted by the Euler algorithm to attain a ridge analog of a (first) principal component curve. Marching along this ridge principal curve is especially useful to visualize the main mode of variation on the original s-rep space. We also highlight this work gives a proof of concept of the applicability of density ridges in high-dimensional settings. The rest of this chapter is organized as follows. Section 4.2 introduces a new dimension-reduction method to determine the main shape variation of threedimensional objects parametrized through s-reps. The approach requires two tailored smoothing techniques (Sect. 4.2.1). On the one hand, a kernel density estimator for polyspherical data and its derivatives (Sects. 4.2.1.1–4.2.1.2), and, on the other hand, a kernel regression estimator for .(Sd )r -valued response (Sect. 4.2.1.3). Density ridges are presented in Sect. 4.2.2 for the population Euclidean (Sect. 4.2.2.1) and sample polyspherical (Sect. 4.2.2.2) cases. The details of the advocated density ridge estimation procedure are elaborated in Sects. 4.2.2.3– 4.2.2.5. Section 4.3 shows the results of applying our methodology. Specifically, an illustrative numerical example on .(S2 )2 (Sect. 4.3.1) and the visualization of the main mode of variation of the aforementioned hippocampi data (Sect. 4.3.2). A critical discussion of the methodology and the identified open areas for improvement is provided in Sect. 4.4. Proofs are relegated to the appendix.
4.2 Methodology 4.2.1 Kernel Smoothing on the Polysphere 4.2.1.1
Density Estimation
Let f be a probability density function (pdf) on .(Sd )r ⊂ Rr(d+1) with respect to r the product measure .σd,r := σd × · · · ×σd , where .σd is the surface area measure on .Sd . Let .X1 , . . . , Xn be an independent and identically distributed (iid) sample from f . Let .x = (x1 , . . . , xr ) ∈ (Sd )r , with .xj = (xj 1 , . . . , xj (d+1) ) ∈ Sd for r .j = 1, . . . , r, and set .h := (h1 , . . . , hr ) ∈ R+ . We consider the kernel density estimator (kde) of f at .x defined as 1 Lh (x, Xi ), Lh (x, y) := cd,L (h)L (1 − x y) h(−2) , . fˆ(x; h) := n n
.
i=1
(4.1) L(s) :=
r j =1
Lj (sj ),
cd,L (h) :=
r j =1
cd,Lj (hj ),
(4.2)
66
E. García-Portugués and A. Meilán-Vila
where . denotes the Hadamard product, .y = (y1 , . . . , yr ) ∈ (Sd )r , and ⎞ ⎛ A11 B11 · · · A1r B1r ⎜ .. .. ⎟ .. .A B := ⎝ . . . ⎠ Ar1 Br1 · · · Arr Brr stands for a “block-in-block matrix product” between the r-partitioned (either in their rows, columns, or both) matrices .A = (Aij ) and .B = (Bij ), .1 ≤ i, j ≤ r. The type of r-partition of the two matrices involved in the product will be clear from the context given the product space structure; e.g., .x y := (x1 y1 , . . . , xr yr ) ∈ Rr . + The normalizing constant of the j th kernel .Lj : R+ 0 → R0 in (4.2) is defined as cd,Lj (hj )
.
−1
:=
Sd
Lj
1 − xj yj
h2j
σd (dxj ).
The most common kernel is the von Mises–Fisher (vMF) kernel, .LvMF (t) := e−t , for .t ≥ 0, although the “Epanechnikov” kernel, .LEpa (t) := (1 − t)1{0≤t≤1} , is more efficient on .Sd . The normalizing constant for the vMF kernel is −1 2 cd,LvMF (h) = (2π )(d+1)/2 I(d−1)/2 (h−2 )e−1/ h hd−1 ,
.
(4.3)
where .Iν is the modified Bessel function of the first kind and .νth order. For .d = 2, a numerically stable form for (4.3) when .h ≈ 0 is .
2 log(c2,LvMF (h)) = − 2 log(h) + log(2π ) + log1p − e−2/ h ,
where .log1p(x) is the numerically stable computation of .log(1 + x) for .x ≈ 0.
4.2.1.2
Gradient and Hessian Density Estimation
To derive the gradient and Hessian of the kernel density estimator introduced in (4.1), let us first consider the radial extension of .f : (Sd )r → R+ 0 given by + r(d+1) ¯ ¯ .f : R \{0} → R0 such that .f (x) := f (¯x), where .x¯ := proj(Sd )r (x) := (x1 /x1 , . . . , xr /xr ) ∈ (Sd )r , for .x ∈ Rr(d+1) \{0}. The reason for this extension is the necessity of taking derivatives on f , defined on the closed support d r .(S ) . The following result provides the expressions of the gradient and Hessian of d r .f¯(x), for .x ∈ (S ) . These statements are required for deriving the gradient and Hessian of the kernel density estimator, that is, .f¯ˆ(x; h) = fˆ(x; h), for .x ∈ (Sd )r . Proposition 1 Assume that .f¯ is twice continuously differentiable on .(Sd )r . Then:
4 Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing
67
1. The (row) gradient vector of .f¯ at .x ∈ (Sd )r is .∇ f¯(x) = (∇ 1 f¯(x), . . . , ∇ r f¯(x)), ∇ j f¯(x) = ∇ j f (x)(Id+1 − xj xj ),
j = 1, . . . , r,
.
where .Ip stands for the identity matrix of size p. 2. The Hessian matrix of .f¯ at .x ∈ (Sd )r is ⎛
H11 f¯(x) ⎜ .. .Hf¯(x) = ⎝ . H1r f¯(x)
⎞ · · · H1r f¯(x) ⎟ .. .. ⎠, . . · · · Hrr f¯(x)
where Hjj f¯(x) = (Id+1 − xj xj )Hjj f (x)(Id+1 − xj xj ) − (∇ j f (x)xj )(Id+1 − xj xj ) − xj ∇ j f (x) + (xj ∇ j f (x)) − 2(∇ j f (x)xj )xj xj , (4.4)
.
Hkj f¯(x) = (Id+1 − xk xk )Hkj f (x)(Id+1 − xj xj ), with .j, k = 1, . . . , r, .k = j . Remark 1 The addend (4.4) is the only j th-term in the gradient and Hessian expressions that is not orthogonal to the subspace spanned by .xj . + 2 Remark 2 For a kernel .L : R+ 0 → R0 , we have that .∇L((1 − x y)/ h ) = −2 2 2 −4 2 −h L ((1 − x y)/ h )y and .HL((1 − x y)/ h ) = h L ((1 − x y)/ h )yy . This holds for the aforementioned kernels .LvMF and .LEpa . These kernel derivatives are: −t −t .L vMF (t) = −e , .LvMF (t) = e ; .LEpa (t) = −1{0≤t≤1} , .LEpa (t) = 0.
From Proposition 1 and Remark 2, the block gradients and Hessians of .f¯ˆ(·; h), ¯ ¯ .∇ j fˆ(·; h) and .Hkj fˆ(·; h), follow immediately from those of .fˆ(·; h): ∇ j fˆ(x; h) = −
.
Hjj fˆ(x; h) = Hkj fˆ(x; h) =
cd,L (hj ) n
nh2j
i=1
cd,L (hj ) n
nh4j
L
i=1
L
1 − xj Xij h2j
1 − xj Xij h2j
L−j,h (x, Xi ) Xij , L−j,h (x, Xi ) Xij Xij ,
cd,L (hk )cd,L (hj ) nh2k h2j n X X 1 − x 1 − x ij ik j k × L L−k,−j,h (x, Xi ) L 2 h h2j k i=1 × Xik Xij ,
68
E. García-Portugués and A. Meilán-Vila
where .x ∈ (Sd )r and (−2) , L−j,h (x, y) := cd−j ,L (h−j )L (1 − x−j y−j ) h−j (−2) L−k,−j,h (x, y) := cd−k,−j ,L (h−k,−j )L (1 − x−k,−j y−k,−j ) h−k,−j . .
For the vMF kernel, the above gradient and Hessian expressions simplify to n 1 Lh (x, Xi )Xij , ∇ j fˆ(x; h) = 2 nhj i=1
.
Hkj fˆ(x) =
n 1 Lh (x, Xi )Xik Xij , nh2k h2j i=1
which can be further compressed into
n 1 .∇ fˆ(x; h) = h Lh (x, Xi )Xi , . n i=1 n 1 Lh (x, Xi )(Xi Xi ) . Hfˆ(x; h) = (hh )(−2) n (−2)
(4.5)
(4.6)
i=1
The simplicity of (4.5)–(4.6) is a major practical benefit of the vMF kernel. Therefore, this kernel is adopted henceforth, although the subsequent theory also holds for other kernels.
4.2.1.3
Polysphere-on-Scalar Regression Estimation
The indexing of density ridges benefits from an auxiliary smoothing of .(Sd )r -valued data with respect to a scalar variable. This smoothing can be cast within a regression framework where one is interested in estimating the extrinsic regression function .t ∈ R → m(t) := proj(Sd )r (E[X|T = t]) given the iid sample .(T1 , X1 ), . . . , (Tn , Xn ) on .R × (Sd )r . An alternative intrinsic approach based on the conditional Fréchet mean is also possible, yet it would involve several issues (non-explicitness, nonuniqueness, and potential smeariness; see [16] on the latter). Within this extrinsic regression setup, given .t ∈ R, we consider the Nadaraya– Watson estimator
n Kh (t − Ti ) , .m(t; ˆ h) := proj(Sd )r Wi (t; h)Xi , Wi (t; h) := n j =1 Kh (t − Tj ) i=1
(4.7) which acts as a weighted local mean informed by the scaled kernel .Kh (·) = K(·/ h)/ h (typically, a Gaussian pdf) and the bandwidth .h > 0. Bandwidth
4 Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing
69
selection for (4.7) can be approached by cross-validation: hˆ CV := arg min CV(h),
.
h>0
2 1 d(Sd )r Xi , m ˆ −i (Ti ; h) , n n
CV(h) :=
(4.8)
i=1
where .m ˆ −i (·; h) denotes (4.7) computed without the ith observation and prevents a spurious overfitting. In (4.8), .d(Sd )r stands for the geodesic distance on the product manifold .(Sd )r , which arises from the Euclidean combination of geodesic distances on .Sd (see, e.g., [13, p. 600]):
d(Sd )
.
⎛ ⎞1/2 r −1 2 ⎝ cos (xj yj ) ⎠ . r (x, y) =
(4.9)
j =1
The cross-validated bandwidth (4.8) can be smoothed according to the one-standard error rule principle from the glmnet package [17]. The rule favors regression simplicity within a one-standard error neighborhood of .CV(hˆ CV ), that is .hˆ 1SE := ˆ 2 (CV(h)) := ˆ max h > 0 : CV(h) = CV(hˆ CV ) + SE(CV( hˆ CV )) , where .SE 2 1 n 2 ˆ −i (Ti ; h) . i=1 (CVi (h) − CV(h)) and .CVi (h) := d(Sd )r Xi , m n−1 A faster and equivalent expression for .CV(h) in (4.8) is given in the following result. For a sample size .n = 200, the median computation time of evaluating .CV(h) as described in Proposition 2 is approximately just .8.6% of that of the naive form (4.8). ˜ and .W ˜ be .n × n matrices with ij -entries .kij := Proposition 2 Let .K (1 − δij )Kh (Ti − Tj ) and .wij := kij /( nj=1 kij ), respectively, where .δij ˜ ˜ := WX, denotes Kronecker’s delta and .i, j = 1, . . . , n. Let .X where .X is the .n × (r(d + 1)) response matrix whose rows are .X1 , . . . , Xn . Then n ˜ 2 ˜ ˜ .CV(h) = i=1 d(Sd )r Xi , proj(Sd )r (Xi ) , where .Xi is the ith row of .X. A more sophisticated local polynomial estimator could be considered instead of (4.7), yet with higher computational cost and higher variability at low-density regions.
4.2.2 Density Ridges 4.2.2.1
Population Euclidean Case
Density ridges are higher-dimensional extensions of the concept of mode that inform on the main features of a density f on .Rp . Density ridges are defined through the gradient and Hessian of f . In particular, they require the eigendecomposition p .Hf (x) = U(x)(x)U(x) , for .x ∈ R , where .U(x) = (u1 (x), . . . , up (x)) is a matrix whose columns are the eigenvectors and .(x) = diag(λ1 (x), . . . ,
70
E. García-Portugués and A. Meilán-Vila
λp (x)), .λ1 (x) ≥ . . . ≥ λp (x), contains the corresponding eigenvalues. Denoting U(p−1) (x) := (u2 (x), . . . , up (x)), the projected gradient onto .{u2 (x), . . . , up (x)} is
.
∇ (p−1) f (x) := ∇f (x)U(p−1) (x)U(p−1) (x) .
.
(4.10)
The density ridge of f is defined by [15] as the set R(f ) := x ∈ Rp : ∇ (p−1) f (x) = 0, λ2 (x), . . . , λp (x) < 0 .
.
(4.11)
Note that .x ∈ R(f ) if either .x is a maximum or a saddle point, or .∇f (x) is parallel to .u1 (x), i.e., the directions of maximum ascent and largest (signed) curvature coincide. To determine .R(f ) in practice, assuming that f is a known density on .Rp , an iterative Euler algorithm that starts at an arbitrary point .x0 ∈ Rp and converges to a certain point in .R(f ) is often used. The algorithm is based on the updating xt+1 = xt + H η(p−1) (xt )
.
(4.12)
until convergence, using a step matrix .H and the normalized projected gradient η(p−1) (x) := ∇ (p−1) f (x)/f (x).
.
(4.13)
The gradient (4.13) boosts the passing through low-density regions and modulates its magnitude at high-density regions. We refer to [15] and [18, Section 6.3] for further details on the population case and for the exposition of the sample version. For the sake of brevity, we directly address next the sample polyspherical case.
4.2.2.2
Sample Polyspherical Case
We turn on back to the setting in the present work: a density f supported over (Sd )r ⊂ Rp , with .p = r(d + 1) henceforth, that is unknown. The recipe for estimating density ridges from a sample .X1 , . . . , Xn on .(Sd )r rests on two main adaptations: (i) plug-in the kde .fˆ(·; h) instead of f in (4.10), (4.11), and (4.13); (ii) conform to the polysphere .(Sd )r the Euler step given in (4.12). The projected gradient of .fˆ(·; h) involves the extended gradient .∇ f¯ˆ(·; h) and Hessian .Hf¯ˆ(·; h) obtained in Proposition 1. However, some care is needed, as the direct translation of (4.10) to .(Sd )r leads to three important issues. First, repeatedly ˆ ˆ ˆ h)(x; h)U(x; h) for computing the full eigendecomposition .Hf¯ˆ(x; h) = U(x; d r .x ∈ (S ) is expensive, especially for large p. However, due to orthogonality, ˆ (p−1) (x; h) = Ip − uˆ 1 (x; h)uˆ 1 (x; h) , and this expression has ˆ (p−1) (x; h)U .U the advantage of involving only the first eigenvector .uˆ 1 (x; h) and not the full eigendecomposition. The computation of the first eigenvector, or a set of prescribed .
4 Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing
71
eigenvectors, can be done efficiently with the implicitly restarted Arnoldi algorithm in ARPACK [19], ported in Armadillo’s eigs_sym [20]. Second, as advanced in Remark 1, the Hessian .Hf¯ˆ(x; h) has a component that is non-orthogonal to .x and that corresponds to the terms (4.4) in the r diagonal blocks. Due to the specificity of .(Sd )r , this component has to be subtracted before the gradient projection: if included, .Ip − uˆ 1 (x; h)uˆ 1 (x; h) would be projecting the gradient .∇ f¯ˆ(x; h) partly along .x, that is, outside the tangent space at .x, spanned by .Ip −diag(x1 x1 , . . . , xr xr ), f¯ˆ(x; h) the Hessian matrix projected on the where .∇ f¯ˆ(x; h) lies. We denote by .H orthogonal space to .x that does not include the terms (4.4) in each of the r diagonal f¯ˆ(x; h) has r null eigenvalues, which is apparent blocks of .Hf¯ˆ(x; h). Third, .H given the mismatch between p and dr, the intrinsic dimension of .(Sd )r . If they are not specifically filtered out, null eigenvalues in the form of machine epsilons may arise in the r largest (signed) eigenvalues. Taking into account the three previous issues, we denote with .u˜ 1 (x; h) the f¯ˆ(x; h). eigenvector associated with the largest (signed) non-null eigenvalue of .H Then, we define the kde analog of the projected gradient (4.10) as ∇ (p−1) f¯ˆ(x; h) := ∇ f¯ˆ(x; h) Ip − u˜ 1 (x; h)u˜ 1 (x; h) .
.
(4.14)
The kde-normalized projected gradient is then defined as ηˆ (p−1) (x; h) := ∇ (p−1) f¯ˆ(x; h)/fˆ(x; h).
.
(4.15)
The Euler step (4.12) transforms into xt+1 := proj(Sd )r xt + h2 ηˆ (p−1) (xt ; h) .
.
(4.16)
In (4.16), .proj(Sd )r preserves each new iteration within .(Sd )r and .h2 ηˆ (p−1) (xt ; h) = h21 ηˆ 1,(p−1) (xt ; h), . . . , h2r ηˆ r,(p−1) (xt ; h) multiplies the j th projected gradient according to the corresponding squared bandwidth. Squares appear as an analogy to the Euclidean case (4.12), where .H, not .H1/2 , is considered to modulate the Euler step [18, Section 6.3]. The recurrence (4.16) is iterated until convergence, when .xt+1 approximately belongs to the sample version of the ridge: R fˆ(·; h) := x ∈ (Sd )r : ∇ (p−1) f¯ˆ(·; h)(x) = 0, λ˜ 2 (x; h), . . . , λ˜ dr (x; h) < 0 ,
.
f¯ˆ(x; h). being .λ˜ 2 (x; h) > . . . > λ˜ dr (x; h) the non-null eigenvalues of .H Computing the gradient and Hessian behind (4.14) in high-dimensional setups has to be done carefully, as their entries quickly underflow. Thanks to (4.5) and (4.6), this issue can be prevented by (i) working in logarithmic scale and (ii) computing rather the gradient and Hessian standardized by the kde (4.1). Obviously, dividing by
72
E. García-Portugués and A. Meilán-Vila
f¯ˆ(x; h), yet it makes them numerically fˆ(x; h) does not affect the eigenvectors of .H stable. To guide the discussion of the specifics in the proposed density ridge estimation procedure on .(Sd )r , we summarize in Algorithm 1 its main steps.
.
Algorithm 1 Ridge estimation and indexing on .(Sd )r Given a sample .X1 , . . . , Xn on .(Sd )r , its estimated ridge is determined and indexed as follows: 1. Select a “suitable” data-driven bandwidth .hˆ (Sect. 4.2.2.3). 2. For each element in an initial grid .{x0,1 , . . . , x0,m } ⊂ (Sd )r , iterate (4.16) “until convergence” to a given .xj , .j = 1, . . . , m (Sect. 4.2.2.4). ˆ fˆ ·; hˆ := {x1 , . . . , xm } ⊂ (Sd )r and assign “scores” to 3. “Index” the estimated ridge .R .X1 , . . . , Xn (Sect. 4.2.2.5).
4.2.2.3
Bandwidth Selection
Bandwidth selection in Step 1 can be done with “upscaled versions” of plug-in bandwidths. [21] proposed a simple plug-in bandwidth selector for the kde (4.1) on d .S . This estimator is an analog to Silverman’s rule-of-thumb [22], as it assumes that the underlying population is a vMF distribution with concentration .κ to estimate the curvature term present in the so-called Asymptotic Mean Integrated Squared Error (AMISE) bandwidth. Within the setting of the present work, the marginal bandwidth selector in the j th .Sd is ⎡ hˆ j,ROT := ⎣
.
4π 1/2 I(d−1)/2 (κˆ j )2
⎤1/(4+d)
⎦ 2dI(d+1)/2 (2κˆ j ) + (2 + d)κˆ j I(d+3)/2 (2κˆ j ) n
(d+1)/2
κˆ j
, (4.17)
1 n where .κˆ j := A−1 i=1 Xij ), with .Ad (r) := I(d+1)/2 (r)/I(d−1)/2 (r), is d ( n the maximum likelihood estimate of .κj . Independently combining the marginal bandwidth selectors (4.17) gives .hˆ IROT := hˆ 1,ROT , . . . , hˆ r,ROT ∈ Rr+ . This admittedly simple selector is explicit and easy to compute, but it undersmooths the underlying density in .(Sd )r . Besides, following the discussion in [18, Section 6.3], the kind of bandwidth selectors recommended for density ridge estimation are the ones designed for Hessian density estimation, since (4.15) critically depends on adequately estimating the Hessian’s first eigenvector. To solve both issues in a computationally tractable manner, given the current lack of theory for derivative bandwidth selectors on .(Sd )r , we consider the following upscaled version of .hˆ IROT : .
(s) hˆ UIROT := hˆ IROT × n1/(d+4)−1/(dr+2s+4) ,
4 Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing
73
where s denotes the order of the derivatives of f that are being estimated. The −1/(dr+2s+4) entries of .hˆ (s) , i.e., the standard rate an AMISE UIROT have order .O n bandwidth for the sth derivatives of a pdf on .Rdr has [18, Section 5.5]. We will (2) consider .C × hˆ UIROT in Step 1, where .C > 0 is determined experimentally. 4.2.2.4
Euler Iteration
An important practical issue is to initiate (4.16) in Step 2 from a sensible grid of points .{x0,1 , . . . , x0,m } ⊂ (Sd )r . This can be challenging in .(Sd )r due to two reasons: (i) the likely vastness of the domain, which forbids considering a product of uniform-like grids on each .Sd (besides, such uniform grids are unknown for .d > 1); (ii) the ubiquitous low-density regions, with associated long convergence paths to ridge points that are usually spurious. Solutions to both problems include setting the initial grid by sampling from .fˆ ·; hˆ or by directly using the sample .X1 , . . . , Xn , thus building data-driven grids across .(Sd )r adapted to the purpose of Algorithm 1. In practice, the iteration of the recurrence (4.16) can be done for a maximum number of iterations N or until a certain stopping .ε-criterion on the standardized √ version of distance (4.9) is met: .(d(Sd )r (xt+1 , xt )/(π r)) < ε. The standardization allows securing the same accuracy within different polyspheres. In our experiments, we found that .N = 1000 and .ε = 10−5 gave a good accuracy–speed trade-off. When applying Step 2 on a high-dimensional space .(Sd )r , we have found that a convenient way to speed up and monitor the obtention of the ridge on .(Sd )r is to initialize the (expensive) Euler algorithm with the endpoints of (much faster) marginal Euler algorithms on each of the .r/k blocks formed by .(Sd )k . This process can be refined by using . passes forming the sequence .1 ≤ k1 < . . . < k = r. Finally, in practice Step 2 is followed by a filtering process that removes spurious endpoints .xj meeting any of the next conditions: (i) .ε-convergence was not achieved ˆ ≥ 0; (iii) .fˆ(xj ; h) ˆ < fˆα , where .fˆα is the .α-quantile in N iterations; (ii) .λ˜ 2 (xj ; h) ˆ ˆ ˆ ˆ of .f (X1 ; h), . . . , f (Xn ; h) for, say, .α = 0.01 (i.e., .xj is in a low-density region).
4.2.2.5
Indexing Ridges
ˆ fˆ ·; hˆ = {x1 , . . . , xm } ⊂ (Sd )r obtained in Step 2 is a The estimated ridge .R set of points without an explicit notion of order. To build a flexible analog of a first ˆ fˆ ·; hˆ is essential. Inspired by the use of principal component, an indexing of .R MultiDimensional Scaling (MDS) in [23] for non-Euclidean dimension-reduction purposes, we advocate the use of a metric MDS (see, e.g., Section 9.1 in [24]) on the matrix of geodesic distances .D := (d(Sd )r (xi , xj )), with .1 ≤ i, j ≤ m. Metric MDS from .(Sd )r to .R produces
74
E. García-Portugués and A. Meilán-Vila
(tˆ1 , . . . , tˆm ) = arg
.
min
t1 ,...,tm ∈R
m 2 d(Sd )r (xi , xj ) − |ti − tj | .
(4.18)
i,j =1
ˆ fˆ ·; hˆ . Optimization The indexes .tˆ1 , . . . , tˆm give an effective handle to traverse .R of (4.18) can be done with the smacof package [25]. The smoother (4.7) becomes now relevant to (i) smooth out irregularities in the estimated ridge and, more importantly, (ii) evaluate the ridge at arbitrary indexes beyond those in (4.18). Consequently, we define the Smoothed–Indexed–Estimated Ridge (SIER) as the curve .t ∈ R → rˆ(t; h) ∈ (Sd )r generated by (4.7) acting on ˆ fˆ ·; hˆ neither contains nor is contained by the sample .(tˆ1 , x1 ), . . . , (tˆm , xm ). .R .{ˆ r(t; h) : t ∈ R}, yet the latter can be regarded as the estimated mean of the former. The score of an arbitrary point .x ∈ (Sd )r on the SIER is defined as the index of its projection on the SIER curve: scorerˆ (·;h) (x) := arg min d(Sd )r rˆ (t; h), x .
.
t∈R
(4.19)
Note that .xj , the “Euler-projection” of .xj,0 , and the projection .projrˆ (;h) (xj,0 ) := rˆ scorerˆ (·;h) (xj,0 ); h can be very different since the Euler paths follow the projected gradient flow and not the geodesic to the closest point on the ridge. This difference is clearly illustrated in Fig. 4.1, where the Euler-projections introduce distortions in the color gradient of the triangles (e.g., longest blue and green paths), which are not present in the sample scores .{scorerˆ (·;h) (Xi )}ni=1 shown in the rug of the right plot.
4.3 Results 4.3.1 An Illustrative Numerical Example We demonstrate the performance of Algorithm 1 for dimension-reduction with a numerical example on .(S2 )2 . The left and central panels of Fig. 4.1 display a sample of size .n = 200 in solid points. The dependence pattern on each .S2 follows a small circle variation that is coupled between .S2 ’s and that is indicated by a common rainbow palette; i.e., points with the same colors in the two panels represent the 2 2 2 .S -coordinates of a single point on .(S ) . The Euler paths arising from running in transparent color. Algorithm 1 taking the sample as the initial grid are shown ˆ fˆ ·; hˆ . The SIER, shown These paths converge to the triangular points defining .R in the black curves, is then obtained with .C = 2 and .hˆ 1SE (Sect. 4.2.1.3) for (4.7). The right panel of Fig. 4.1 shows the scores of the sample points on the SIER, evidencing that the color gradient encoding the one-dimensional mode of variation of the data is recovered (rainbow rug). Indeed, the Spearman correlation between the order of the colors and the order of the sample scores is .0.9999.
75
0.3 0.0
0.1
0.2
Density function
0.4
0.5
0.6
4 Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing
−3
−2
−1
0
1
2
3
Sample scores
Fig. 4.1 Numerical example on .(S2 )2 . The left and central plots display the (joint) main mode of variation of the data, encoded by a common rainbow color palette. The sample is shown in solid points, the Euler paths in transparent curves, and the ridge points in triangles. The black curves represent the two .S2 -views of the common SIER. The right plot shows the kde of the sample scores
4.3.2 Main Mode of Variation of Hippocampus Shapes We analyze now the hippocampi dataset mentioned in Sect. 4.1. The data consists of .n = 177 hippocampi parametrized using s-reps, where each of the subjects has .r = 168 spokes. The s-reps were fitted to a set of binary images of the hippocampi that were segmented from magnetic resonance imaging [3]. Fixing the radii of these vectors to their sample means, hence taking into account only the shape of the hippocampus and not its size, each s-rep is reduced to a value on .(S2 )r . A main form of variation is not well-defined in densities that are rotationally symmetric and unimodal about a certain location, as it distinctively happens with the vMF distribution. To detect such cases delivering spurious ridges, we run in each of the .r = 168 samples of spokes the hybrid test for rotational symmetry with unspecified location from [26], implemented in rotasym [27]. To control for false discoveries, we corrected the r resulting p-values using the false discovery rate by [28]. For a conservative .1% significance level, .r ∗ = 88 non-rotationally symmetric ∗ spokes were found, for which we then ran Algorithm 1 on .(S2 )r . Figure 4.2 shows in color the .r ∗ non-rotationally symmetric spokes and in gray the .r − r ∗ = 89 rotationally symmetric spokes. As advanced in Sect. 4.2.2.4, Algorithm 1 was run in a blockwise fashion to facilitate faster convergences. Precisely, . = 3 passes were applied to .r ∗ /kl blocks of sizes .k1 = 1, .k2 = 22, and .k3 = r ∗ . The initial grid for the first pass was set as the sample, and then subsequent passes were fed with the endpoints of the former pass. After each pass, the spurious endpoints were removed (.α = 0.01) to prevent their propagation into long convergence paths, hence successively trimming (2) the size of the initial grids. The bandwidths applied on each pass were .Cl × hˆ UIROT
(adapted to .(S2 )kl ), with .Cl = 2l experimentally determined, for .l = 1, 2, 3. Our implementation of Algorithm 1 based on a hybrid of C.++ (for the core routines) and R (for interfacing) yielded running times of 64, 170, and 3121 s for the three
76
E. García-Portugués and A. Meilán-Vila
Fig. 4.2 March along the SIER of hippocampi. From left to right, the three plots show the reconstructed hippocampi for the score quantiles .1%, .50%, and .99%. In them, .r ∗ = 88 nonrotationally symmetric spokes (colored) vary along the march, while the remaining .r − r ∗ = 89 spokes (in gray) remain fixed at their spherical means. The yellow/purple color gradient codifies the large/small degree of change along the march. The black points are the average inner skeletal points of the n hippocampi. Surfaces were constructed using alphashape3d [29]
respective blocks. These times were measured on an Apple M1 processor. The fast runs for blockwise fits were convenient for quick exploration of approximate ridges. Finally, the SIER was computed with .hˆ 1SE . Replicating code is available from the authors upon reasonable request. Figure 4.2 depicts the main outcome of our dimension-reduction tool on the hippocampi dataset: the march along the SIER, instantiated for the sake of conciseness at the quantiles .1%, .50%, and .99% of the sample scores. The coloring indicates that the largest variation appears at the yellow/green spokes (e.g., see spokes 28 and 99), with purple indicating virtually no variation, and gray denoting rotationally symmetric spokes. This march shows that: (i) most of the variation is concentrated at the spokes describing the sharpness of the elongated convex edge (right-positioned in the plots), and at the narrowest extreme of the hippocampus form (bottom); (ii) the joint variation of the previous spokes is elucidated as a “synchronous opening of pincers” given by pairs of spokes, i.e., there is a variation gradient from sharper to thicker edges in the hippocampus shapes; (iii) low variation occurs on the normal spokes to the elongated form of the hippocampus; (iv) the concave edge (left) and the widest extreme (top) concentrate most of the rotationally symmetric spokes; (v) on overall, the determined main shape variation across subjects is mild. Figure 4.3 ∗ shows the .S2 -projections of the .(S2 )r -valued SIER, indicating with a rainbow palette the score-driven march along the SIER for which three quantiles were shown in Fig. 4.2. The density of the scores in Fig. 4.3 points towards an asymmetric distribution of the subjects, with a secondary cluster at the right of the main mode. The scores given by the SIER serve to identify the “median hippocampus shape” and “most extreme hippocampus shape” in a straightforward way, given its univariate nature. Indeed, we define the first as the hippocampus whose score is the median of the scores (.−0.30, see the rightmost plot of Fig. 4.3), while we set the second as that hippocampus with the largest absolute score (.10.33). These
4 Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing 86
6766 697072 74 65 73 64 68 61 71
56 76
86
24
25
59 57 54
63
88
45
19
27 7
29 4 34 1
11 8 5
50
58 27 7 29 4 5
3
10 26 43
60
28 10 26
9 6
51
47
20 17 21 15 23
18
31
32
46 49 20 15 17 2 1 23
62
18
22
11 8
30
77 53 2242 44 41
83
19
28
78 24 4048
676665 6471 68 61
87
35
25 3637 38 80 39
33
55
30
2
16 52
3
14 32
6
35
34 1 2
13 85
0.15
84
0.20
81
74 72
Density function
70
69
1416 13 12
33
82
45 47
50
0.25
79
0.10
52
88 31
73
43
55 85
87
75
8279
0.05
60 58
75 39 80 36 81 77 40 37 53 42 78 48 38 44 51 4946 41 56 76
12
84 83
9
0.00
62 59 57 54
63
77
−10
−5
0
5
10
Sample scores
Fig. 4.3 The left and central plots show two views of the same .S2 in which (i) each of the .n = 177 directions of the .r ∗ = 88 spokes has been drawn (colored points) and (ii) the .r ∗ .S2 -projections ∗ of the .(S2 )r -valued SIER are jointly plotted. The yellow/purple color gradient of the directions is assigned according to the spoke to which they belong. The rainbow palette is common to the .r ∗ SIER .S2 -projections and is determined from the order of the sample scores (right plot)
Fig. 4.4 Hippocampus shapes corresponding to the most extreme (left) and median (right) hippocampus. The first corresponds to the hippocampus with the largest absolute score, while the second is the hippocampus whose score is the median of the sample scores (see Fig. 4.3)
particular hippocampi are depicted in Fig. 4.4. The medial hippocampus is highly symmetric and has a small curvature along its elongated direction. The most extreme hippocampus (left plot) is a vertically-squeezed hippocampus that is notably thick, especially in the upper part displayed in Fig. 4.4. Indeed, the height ratio between the upper and lower parts is unusually high (not visible in Fig. 4.4). In opposition to the medial hippocampus, its elongated convex edge is markedly curved and asymmetric.
4.4 Discussion A new fully nonparametric dimension-reduction procedure for finding the main mode of variability of the shape of s-reps was introduced in this chapter. The tech-
78
E. García-Portugués and A. Meilán-Vila
nique targets the polyspherical reduction of s-reps and provides a complete pipeline for attaining an analog of the first principal component on .(Sd )r based on density ridges. As demonstrated with the hippocampi dataset, the tool can be used for flexible dimension-reduction analyses in medical applications involving s-reps, also in high-dimensional settings, that deliver useful visualizations and insights. The proposed technique presents some limitations and is subject to future improvements. As in any kernel-based method, bandwidth selection is a crucial issue. In that regard, the upscaled marginal bandwidths are simple choices open to large enhancements with the development of a theory for density derivative estimation on .(Sd )r , for example, in the direction of cross-validatory or plug-in methods. From an application standpoint, the presented analysis also has certain limitations that constitute opportunities for further research. Arguably, the most relevant improvement would be a more holistic approach to determining the main mode of variation of the hippocampi involving the radii of the spokes and the position of the inner skeletal, as well as their interactions with the directions of the spokes. Addressing this case in a fully nonparametric way would involve substantial further complexities in the definition and estimation of the involved density ridges, caused by the notable increment of dimension and the different nature of the components involved. The presented methodology, ultimately instantiated with the SIER march and scores, has medical potential in regard to analyzing the shapes of hippocampi and other three-dimensional objects parametrized by s-reps. On the one hand, it delivers a rich exploratory data analysis of the morphology of these objects, either through the SIER march or through the sample scores. The univariate scores, in particular, allow investigating the existence of possible clusters, determining the most prototypical subjects, and outlier-hunting signaling abnormal shapes. On the other hand, the methodology can be applied to obtain effective comparisons between treatment and control groups, either through the SIER march (visual, qualitative) or through the metrics on the distribution of the sample scores (quantitative). Acknowledgments Both authors acknowledge support from grant PID2021-124051NB-I00, funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”. The authors greatly acknowledge Prof. Stephen M. Pizer and Dr. Zhiyuan Liu (University of North Carolina at Chapel Hill) for kindly providing the analyzed s-reps hippocampi data. Comments by the editor and a referee are appreciated.
Proofs Proof (Proposition 1) For x¯ = (x1 /x1 , . . . , xr /xr ) =: (¯x1 , . . . , x¯ r ) ∈ (Sd )r , .
∂ x¯ j and = xj −3 xj 2 ei − xij xj ∂xij
∂ x¯ j = 0, ∂xik
4 Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing
79
where ei is the ith canonical vector of Rd+1 , i = 1, . . . , d + 1, and j, k = 1, . . . , r, j = k. It now follows that, for x ∈ (Rd+1 )r , .
∂ f (¯x) = xj −3 ∇ j f (¯x) xj 2 ei − xij xj , j = 1, . . . , r, i = 1, . . . , d + 1. ∂xij
Hence, for x ∈ (Sd )r , ∇ j f¯(x) = ∇ j f (x)(Id+1 − xj xj ),
.
j = 1, . . . , r.
To obtain the Hessian of f¯, we first compute the entries of Hjj f¯(x) for x ∈ and j = 1, . . . , r:
(Rd+1 )r .
∂2 f¯(x) ∂xpj ∂xqj
d+1 ∂ ∂ ∂ = f (¯x) − xj −3 f (¯x)xlj xqj xj −1 ∂xpj ∂xqj ∂xlj l=1 ∂ ∂ ∂ ∂ = f (¯x) + xj −1 f (¯x) xj −1 ∂xqj ∂xpj ∂xqj ∂xpj d+1 ∂ ∂ f (¯x)xlj xqj xj −3 − ∂xlj ∂xpj l=1
− xj −3
d+1 ∂ ∂ ∂ ∂ f (¯x) xlj xqj + f (¯x) (xlj xqj ) ∂xlj ∂xpj ∂xpj ∂xlj l=1
!
= xj −3 − xpj
∂ ∂ f (¯x) − xqj f (¯x) ∂xqj ∂xpj
d+1 ∂ ∂2 f (¯x) xlj f (¯x) + xj + 3xj −2 xpj xqj − δpq ∂xpj ∂xqj ∂xlj
−1
− xj
xpj
l=1
d+1 s=1
+ xpj xqj #
d+1 d+1 l=1 s=1
d+1 ∂2 ∂2 f (¯x) + xqj xlj f (¯x) xsj ∂xsj ∂xqj ∂xpj ∂xlj
xsj xlj
l=1
∂2 f (¯x) ∂xsj ∂xlj
"
= xj −3 − ep xj ∇f (¯x) eq − ep ∇f (¯x)xj eq + ep 3xj −2 xj xj − Id+1 eq xj ∇f (¯x) + xj ep Hjj f (¯x)eq
80
E. García-Portugués and A. Meilán-Vila
− xj −1 ep xj xj Hjj f (¯x)eq + ep Hjj f (¯x)xj xj eq $ + ep xj xj eq xj Hjj f (¯x)xj ,
(4.20)
with p, q = 1, . . . , d + 1. Collecting the entries in (4.20) into Hjj f¯(x), it follows that, for x ∈ (Sd )r , Hjj f¯(x) = − xj ∇ j f (x) − ∇ j f (x) xj + (3xj xj − Id+1 )(∇ j f (x)xj )
.
+ Hjj f (x) − (xj xj Hjj f (x) + Hjj f (x)xj xj ) + xj xj (xj Hjj f (x)xj ) = (Id+1 − xj xj )Hjj f (x)(Id+1 − xj xj ) − (∇ j f (x)xj )(Id+1 − xj xj ) − A,
(4.21)
where A := xj ∇ j f (x) + (xj ∇ j f (x)) − 2(∇ j f (x)xj )xj xj is a symmetric matrix that, differently from the other terms in (4.21), is non-orthogonal to xj xj : A(xj xj ) = (xj ∇ j f (x)) − (∇ j f (x)xj )xj xj ,
.
despite being easy to check that (xj xj )A(xj xj ) = (Id+1 − xj xj )A(Id+1 − xj xj ) = 0. In addition, for k, j = 1, . . . , r, k = j , and p, q = 1, . . . , d + 1, .
∂2 f¯(x) ∂xpk ∂xqj = xj −1
∂ ∂xpk
= xj −1 xk −1 − xk −2 xpk
d+1 ∂ ∂ ∂ f (¯x) − xj −3 f (¯x) xlj xqj ∂xqj ∂xpk ∂xlj l=1
!
∂2 ∂xpk ∂xqj
d+1 s=1
f (¯x)
d+1 ∂2 ∂2 f (¯x)xlj f (¯x)xsk − xj −2 xqj ∂xpk ∂xlj ∂xsk ∂xqj
− xj −2 xk −2 xpk xqj
l=1
d+1 d+1 l=1 s=1
xsk xlj
∂2 f (¯x) ∂xsk ∂xlj
"
# = xj −1 xk −1 ep Hkj f (¯x)eq
− xk −2 ep xk xk Hkj f (¯x)eq − xj −2 ep Hkj f (¯x)xj xj eq $ + xj −2 xk −2 ep xj xk eq xk Hkj f (¯x)xj .
4 Hippocampus Shape Analysis via Skeletal Models and Kernel Smoothing
81
By an analogous collection of terms to that in (4.21), for x ∈ (Sd )r , Hkj f¯(x) = (Id+1 − xk xk )Hkj f (x)(Id+1 − xj xj ). Proof (Proposition the unprojected esti 2) The proof follows after recalling that ˜ −i (t; h) = nj =1, j =i W−i,j (t; h)Xj mator m(t; ˜ h) := nj =1 Wj (t; h)Xi satisfies m with W−i,j (t; h) = Wj (t; h)/(1 − Wi (t; h)) since ni=1 Wi (t; h) = 1, for all t ∈ R.
References 1. Csernansky J.G., Wang L., Swank J., Miller J.P., Gado M., Mckeel D., Miller M.I., Morris J.C.: Preclinical detection of Alzheimer’s disease: hippocampal shape and volume predict dementia onset in the elderly. Neuroimage 25(3), 783 (2005) doi: https://doi.org/j.neuroimage. 2004.12.036 2. Dryden I.L., Mardia K.V.: Statistical Shape Analysis, with Applications in R. Wiley Series in Probability and Statistics. Wiley, Chichester (2016) doi: https://doi.org/10.1002/ 9781119072492 3. Liu Z., Hong J., Vicory J., Damon J.N., Pizer S.M.: Fitting unbranching skeletal structures to objects. Med. Image Anal. 70, 102020 (2021) doi: https://doi.org/10.1016/j.media.2021. 102020 4. Pizer S.M., Marron J.S.: In Zheng, G., Li, S., Székely, G. (eds.) Statistical Shape and Deformation Analysis, pp. 137–164. Academic Press, London (2017) doi: https://doi.org/10. 1016/B978-0-12-810493-4.00007-9 5. Pizer S.M., Jung S., Goswami D., Vicory J., Zhao X., Chaudhuri R., Damon J.N., Huckemann S., Marron J.: Nested sphere statistics of skeletal models. In Breuß, M., Bruckstein, A., Maragos, P. (eds) Innovations for Shape Analysis, pp. 93–115. Springer, Heidelberg (2013) doi: https://doi.org/10.1007/978-3-642-34141-0_5 6. Marron J.S., Dryden I.L.: Object Oriented Data Analysis, Monographs on Statistics and Applied Probability, vol. 169. CRC Press, Boca Raton (2021) doi: https://doi.org/10.1201/ 9781351189675 7. Hong J., Vicory J., Schulz J., Styner M., Marron J.S., Pizer S.M.: Non-Euclidean classification of medically imaged objects via s-reps. Med. Image Anal. 31, 37 (2016) doi: https://doi.org/ 10.1016/j.media.2016.01.007 8. Schulz J., Pizer S.M., Marron J., Godtliebsen F.: Non-linear hypothesis testing of geometric object properties of shapes applied to hippocampi. J. Math. Imaging Vis. 54(1), 15 (2016) doi: https://doi.org/10.1007/s10851-015-0587-7 9. Pizer S.M., Hong J., Vicory J., Liu Z., Marron J.S., Choi H.Y., Damon J., Jung S., Paniagua B., Schulz J., Sharma A., Tu L., Wang J.: Object shape representation via skeletal models (s-reps) and statistical analysis. In Pennec, X., Sommer, S., Fletcher, T. (eds.) Riemannian Geometric Statistics in Medical Image Analysis, pp. 233–271. Elsevier (2020) doi: https://doi. org/10.1016/B978-0-12-814725-2.00014-5 10. Siddiqi K., Pizer S.: Medial Representations: Mathematics, Algorithms and Applications, Computational Imaging and Vision, vol. 37. Springer Science & Business Media (2008) doi: https://doi.org/10.1007/978-1-4020-8658-8 11. Pizer S., Marron J., Damon J.N., Vicory J., Krishna A., Liu Z., Taheri M.: Skeletons, object shape, statistics. Front. Comput. Sci. 4, 842637 (2022) doi: https://doi.org/10.3389/fcomp. 2022.842637 12. Pizer S.M., Hong J., Vicory J., Liu Z., Marron J.S., Choi H.Y., Damon J., Jung S., Paniagua B., Schulz J., Sharma A., Tu L., Wang J.: Object shape representation via skeletal models (s-reps) and statistical analysis. In Pennec, X., Sommer, S., Fletcher, T. (eds.) Riemannian
82
E. García-Portugués and A. Meilán-Vila
Geometric Statistics in Medical Image Analysis, pp. 233–271. Academic Press, London (2020) doi: https://doi.org/10.1016/B978-0-12-814725-2.00014-5 13. Jung S., Foskey M., Marron J.S.: Principal arc analysis on direct product manifolds. Ann. Appl. Stat. 5(1), 578 (2011) doi: https://doi.org/10.1214/10-aoas370 14. Jung S., Dryden I.L., Marron J.S.: Analysis of principal nested spheres. Biometrika 99(3), 551 (2012) doi: https://doi.org/10.1093/biomet/ass022 15. Genovese C.R., Perone-Pacifico M., Verdinelli I., Wasserman L.: Nonparametric ridge estimation. Ann. Stat. 42(4), 1511 (2014) doi: https://doi.org/10.1214/14-AOS1218 16. Eltzner B.: Geometrical smeariness – a new phenomenon of Fréchet means. Bernoulli 28(1), 239 (2022) doi: https://doi.org/10.3150/21-BEJ1340 17. Friedman J., Hastie T., Tibshirani R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010) doi: https://doi.org/10.18637/jss.v033.i01 18. Chacón J.E., Duong T.: Multivariate Kernel Smoothing and its Applications, Monographs on Statistics and Applied Probability, vol. 160. CRC Press, Boca Raton (2018) doi: https://doi. org/10.1201/9780429485572 19. Lehoucq R.B., Sorensen D.C., Yang C.: ARPACK Users’ Guide. Society for Industrial and Applied Mathematics. Philadelphia (1998) doi: https://doi.org/10.1137/1.9780898719628 20. Sanderson C., Curtin R.: A user-friendly hybrid sparse matrix class in C++. In Davenport, J.H., Kauers, M., Labahn, G., Urban, J. (eds.) Mathematical Software – ICMS, pp. 422– 430. Springer International Publishing, Cham (2018) doi: https://doi.org/10.1007/978-3-31996418-8_50 21. García-Portugués E.: Exact risk improvement of bandwidth selectors for kernel density estimation with directional data. Electron. J. Stat. 7, 1655 (2013) doi: https://doi.org/10.1214/ 13-ejs821 22. Silverman B.W.: Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability. Chapman & Hall, London (1986) doi: https://doi.org/10.1007/978-14899-3324-9 23. Zoubouloglou P., García-Portugués E., Marron J.S.: Scaled torus principal component analysis. J. Comput. Graph. Stat. (to appear) (2022) doi: https://doi.org/10.1080/10618600.2022. 2119985 24. Borg I., Groenen P.J.: Modern Multidimensional Scaling: Theory and Applications. Springer Series in Statistics. Springer Science & Business Media (2005) doi: https://doi.org/10.1007/0387-28981-X 25. de Leeuw J., Mair P.: Multidimensional scaling using majorization: SMACOF in R. J. Stat. Softw. 31(3), 1 (2009) doi: https://doi.org/10.18637/jss.v031.i03 26. García-Portugués E., Paindaveine D., Verdebout T.: On optimal tests for rotational symmetry against new classes of hyperspherical distributions. J. Am. Stat. Assoc. 115(532), 1873 (2020) doi: https://doi.org/10.1080/01621459.2019.1665527 27. García-Portugués E., Paindaveine D., Verdebout T.: rotasym: Tests for Rotational Symmetry on the Hypersphere (2022). https://CRAN.R-project.org/package=rotasym. R package version 1.1.4 28. Benjamini Y., Yekutieli D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29(4), 1165 (2001) doi: https://doi.org/10.1214/aos/1013699998 29. Lafarge T., Pateiro-López B.: Implementation of the 3D Alpha-Shape for the Reconstruction of 3D Sets from a Point Cloud (2020). https://CRAN.R-project.org/package=alphashape3d. R package version 1.3.1
Chapter 5
Application of Quantile Regression Models for Biomedical Data Mercedes Conde-Amboage, Ingrid Van Keilegom, and Wenceslao González-Manteiga
Abstract A new lack-of-fit test for censored quantile regression models with multiple or even high-dimensional covariates will be presented. The test is based on the cumulative sum of residuals with respect to unidimensional linear projections of the covariates. The test is then adapting the ideas presented in [1], to cope with high-dimensional covariates, to the test proposed by [2]. The limit distribution of the empirical process associated with the test statistic will be shown. Furthermore, in order to approximate the critical values of the test, a bootstrap mechanism is used, which is similar to the proposal developed in [3]. In addition, an extensive simulation study and an interesting real data application will be presented in order to show the behaviour of the new test in practice. Keywords Quantile regression · Censored data · Lack-of-fit test · High-dimensional covariates
M. Conde-Amboage () Department of Statistics, Mathematical Analysis and Optimization, Universidade de Santiago de Compostela (USC), Santiago de Compostela, Spain CITMAga, Santiago de Compostela, Spain e-mail: [email protected] I. Van Keilegom Research Centre for Operations Research and Statistics (ORSTAT), KU Leuven, Belgium e-mail: [email protected] W. González-Manteiga Department of Statistics, Mathematical Analysis and Optimization, Universidade de Santiago de Compostela (USC), Santiago de Compostela, Spain CITMAga, Santiago de Compostela, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Y. Larriba (ed.), Statistical Methods at the Forefront of Biomedical Advances, https://doi.org/10.1007/978-3-031-32729-2_5
83
84
M. Conde-Amboage et al.
5.1 Introduction Although mean regression achieved its greatest diffusion in the twentieth century, quantile regression ideas were earlier. While the beginning of least squares regression dates back to the year 1805 by the work of Legendre, in the mideighteenth century Boscovich already fitted data about the ellipticity of the Earth using concepts of quantile regression. Later, quantile regression methods found a great development since the emergence of Robust Statistics, which reached great expansion in the 1980s. The works [4] or [5] are good examples in this line. These concepts are still applied nowadays. Quantile regression is employed when the aim of study is centred on the estimation of different positions (quantiles). This kind of regression allows a more detailed description of the behaviour of the response variable (not only focused on measures of central positions), adapts to situations under more general conditions of the error distribution (that is, do not require stringent assumptions, such as homoscedasticity or normality) and has good robustness properties. For all that, quantile regression is a very useful statistical methodology for a large diversity of disciplines, such as Economy, Environmental Science or Medicine. Let us consider a regression model denoted by: Y = qτ (X) + ε,
.
where Y denotes the response variable, .X ∈ Rd denotes the covariates, .qτ represents the regression function and the error .ε has a conditional .τ-quantile equal to zero, that is, .P(ε < 0|X = x) = τ for almost all x. Note that this last condition about the model error is analogous to assuming in the classical mean regression context that the conditional expectation of the error is equal to zero. Quantile regression estimators were introduced in [6] as a weighted absolute residuals fit which allows to extend some properties of classical least squares estimation to quantile regression estimates. That is, given a sample .{(X1 , Y1 ), . . . , (Xn , Yn )}, they assume that .qτ (Xi ) = θτ (1, Xi ) where .θτ is a vector of unknown parameters, and propose estimating .θτ by θτ = arg min
n
.
θ
ρτ Yi − θτ (1, Xi ) ,
i=1
where .ρτ (u) = u(τ − I(u < 0)) is the quantile loss function and .I is the indicator function. Several classical statistical tools and procedures have been adapted to the quantile regression scenario over the years. In this line, a good review about quantile regression can be found in [7]. In the literature, we can identify many application examples of quantile regression for biomedical data. For instance, an application to neuroimaging can be found in [8], an illustration with longitudinal data is presented in [9], the utility of quantile
5 Application of Quantile Regression Models for Biomedical Data
85
Fig. 5.1 Illustration of the different types of observations that we can find when we analyse a right-censored variable
UNCENSORED CENSORED UNCENSORED CENSORED UNCENSORED CENSORED CENSORED
START OF STUDY
END OF STUDY
methods in anaesthesia research is shown in [10] or, recently, an application to COVID-19 data is presented in [11], among others. Once we have fitted a quantile regression model, it is natural to ask whether the fitted model is correct or not. That is, to formulate the following lack-of-fit test: .
H0 : qτ ∈ Qθ = qτ (·, θ ) : θ ∈ ⊂ Rq / Qθ Ha : qτ ∈
.
Several authors have addressed this issue, see, for instance, the test proposed in [12] that is based on smoothing ideas, the proposal presented in [13] that is based on empirical regression processes, in [14] the main goal is to test if the conditional median function is linear against a nonparametric alternative with unknown smoothness, a nonparametric test is presented in [15], a nonparametric test for the correct specification of a linear conditional quantile function over a continuum of quantile levels is introduced in [16], an omnibus specification test for parametric dynamic quantile models is presented in [17], or a test for additive quantile models is proposed in [18], among others. Furthermore, quantile regression procedures allow us to consider regression scenarios with complex data (such as censored data), in many cases, under better conditions than classical least squares methods. In the biomedical context, it is quite common to work with censored data, that is, the variable of interest (also known as survival time), denoted by Y , is not observed or censored if it is greater than the corresponding censoring time, denoted by C. Figure 5.1 shows different types of observations that we can find when we analyse a right-censored variable and, for some individual (censored values), we are not able to know completely the survival time but we have information until a certain moment (that could be, for instance, the end of the study). It is easy to think about different biomedical examples of survival times, for instance: time that COVID-19 patients remain hospitalized in the Intensive Care Unit (ICU), time to return to drug abuse, survival time after a pancreatic cancer diagnostic, or viral load response obtained in a clinical trial for HIV-infected individuals.
86
M. Conde-Amboage et al.
In the literature we can find several methods in order to fit a parametric censored quantile regression model. For instance, • The first estimator for censored quantile regression has been presented in [19] and it is given by θτ = arg
.
min d+1 θτ ∈R
n
ρτ Zi − min{Ci , θτ (1, Xi )} ,
i=1
where .ρτ is the well-known loss quantile function and .{(Zi , Xi , Ci )} with i = 1, . . . , n represent observations of the observed variable (.Z = min{Y, C}), the covariates and the censoring variable. Note that in this case the sample of censoring variable, .{C1 , . . . , Cn }, is supposed to be known. • Later, a method of recursively estimating linear √ conditional quantile functions is presented in [20] and the consistency and . n-convergence of the proposed estimator are established. Portnoy’s method can be regarded as a generalization to the regression context of the Kaplan-Meier estimator. • Afterwards, the approach developed in [21] is associated with the Nelson-Aalen estimator of the cumulative hazard function. Their proposed martingale based estimating equations lead to a simple algorithm that involves minimizations only of .L1 -type convex functions. • Recently, a new estimator is presented in [2] and it is based on the “quantile version” of formula (2.2) in [22] for classical mean regression, that is, .
θτ = arg minq θτ ∈R
n
.
WiKM ρτ (Zi − qτ (Xi , θτ )) ,
i=1
where .Zi = min{Yi , Ci }, .WiKM denote the Kaplan-Meier weights that are given by WiKM =
.
δj i−1
n−j δi , n−i+1 n−j +1 j =1
and .δi = I(Yi ≤ Ci ) represent the censoring indicators. Note that the sample {(Zi , Xi , δi ) .with i = 1, . . . , n} is ordered by the observed variable .Zi .
.
More recently, a complete review about quantile regression models for survival data is presented in [23]. In particular, [23] mentions several authors that have been dealt with the estimation problem for censored data, such as, [24–27] or [28], among others. In the biomedical literature, we can find censored quantile regression applications to haematopoietic cell transplant research (see [29]), survival data with covariates subject to detection limits (see [30]), length-biased survival data (see [31]), predic-
5 Application of Quantile Regression Models for Biomedical Data
87
tion in a renal disease study ([32]) or biomarker data with missing values at early visits (see [33]), among others. Moreover, in the context of parametric censored quantile regression, there exist also several proposals in order to check the validity of the fitted models, although notably less than in the context of complete data. See, for instance, the lack-of-fit test proposed in [34] using smoothing ideas or the test based on empirical processes presented in [2]. Nowadays, it is common to have large amounts of data, that is, a lot of information about each individual in the study. In the regression setup, that means to consider models with multivariate or even high-dimensional covariates. Such situations have recently been addressed in the context of quantile regression, see, for example, [35–38] or [39]. It is well-known that a high (or even moderate) dimension of the covariate can affect the performance of the specification tests. In this kind of situations classical lack-of-fit tests (both based on smoothing ideas or on empirical regression processes) suffer from the curse of dimensionality when the dimension of the explanatory variable increases. This fact has been studied in [40, 41] or [42] for quantile regression models with complete data, but nothing can be found in the censored data setup. Taking into account the state of the art, along this chapter a lack-of-fit test for parametric censored quantile regression models, with good properties for multidimensional or even high-dimensional covariates, will be presented. Moreover, a real data application, where the main goal is to describe the effect of several covariates on the time to return to drug abuse of a group of 517 patients that have had addiction problems, will be presented. The new developed procedure is able to test if the parametric censored quantile regression model that will be fitted is correct or not (in which case, nonparametric techniques will be needed) for different .τ-quantiles of interest. Henceforth, the chapter is organized as follows: in Sect. 5.2 we present two new test statistics, together with a bootstrap method to approximate the critical values of the tests and some computational ideas. Hereafter, Sect. 5.3 contains a simulation study that allows us to show the performance of the new tests in practice compared with their natural competitors. Moreover, a real data application is presented in Sect. 5.4 to illustrate how the tests can be useful in the biomedical context. Finally, Sect. 5.5 contains some conclusions and the Appendix contains the theoretical behaviour of the new proposals.
5.2 The New Testing Procedure Let .Yi be the response variable (survival time) that depends on a d-dimensional vector of covariates .Xi , and let .Ci denote the censoring time with .i = 1, . . . , n. Then, the variable .Yi is not observed or said to be censored if it is greater than the corresponding censoring time .Ci . More specifically, the censored quantile regression model assumes
88
M. Conde-Amboage et al.
Yi = qτ (Xi , θτ ) + εi ,
(5.1)
.
where .qτ represents the conditional quantile function of .Yi given .Xi , which is known up to a finite-dimensional parameter .θτ ∈ Rq , and the .εi are independent random errors which verify that .P(εi ≤ 0|Xi ) = τ, for some quantile of interest .τ ∈ (0, 1). Moreover, let us remember the following notation: the observed variables .Zi = min{Yi , Ci } and the censoring indicators .δi = I(Yi ≤ Ci ) where .I is the indicator function. The main goal of this chapter will be to perform the following lack-of-fit test
H0 : qτ ∈ Qθ = qτ (·, θ ) : θ ∈ ⊂ Rq
.
Ha : qτ ∈ / Qθ
(5.2)
when the covariate is multivariate or even high-dimensional. The new lack-of-fit test for censored quantile regression will be based on the cumulative sum of residuals with respect to unidimensional linear projections of the covariates following the ideas in [1] and [41] to deal with the curse of dimensionality. The idea of using onedimensional projections of the covariates is justified by Lemma 1, whose proof can be found in the Appendix. Lemma 1 (Characterization of the Null Hypothesis) The null hypothesis .H0 : qτ ∈ Qθ , holds if and only if, for some .θτ ∈ ⊂ Rq , and for any .t ∈ R and .β ∈ Rd with .β = 1,
E ψτ (Y − qτ (X, θτ )) qτ(1) (X, θτ ) I(β X ≤ t) = 0,
.
(1)
where .ψτ (r) = τI(r > 0) + (τ − 1)I(r < 0) and .qτ (x, θ ) =
(5.3)
∂ ∂θ qτ (x, θ ).
Given Lemma 1, we will consider consistent estimators of (5.3) to define the test statistic. First of all, if the true parameter .θτ was known, extending the idea appearing in [41] for censored data, the test could be based on the processes 0 Rn,1 (t, β) = n1/2
n
.
WiKM ψτ (Zi − qτ (Xi , θτ )) qτ(1) (Xi , θτ ) I(β Xi ≤ t)
i=1 0 Rn,2 (t, β) = n1/2
n
WiKM ψτ (Zi − qτ (Xi , θτ )) I(β Xi ≤ t),
i=1
where .{(X1 , Z1 , δ1 ), · · · , (Xn , Zn , δn )} represents a random sample of the regression variables and .WiKM denote the Kaplan-Meier weights that are given by KM
Wi
.
δ[j ] i−1
n−j δ[i] = . n−j +1 n−i+1 j =1
5 Application of Quantile Regression Models for Biomedical Data
89
Note that .δ[i] represents the censoring indicator associated with .Z[i] where .Z[1] ≤ Z[2] ≤ · · · ≤ Z[n] . Hereafter, for simplicity, we are going to use the following notation: .Z[i] = Zi , .X[i] = Xi and .δ[i] = δi . 0 is the “quantile version” of the one proposed in [22] The empirical process .Rn,2 for censored mean regression but including the one-dimensional projections of the 0 has been considered because [13], in the covariates. Furthermore, the process .Rn,1 context of quantile regression with complete data, established that including the derivative of the regression function (.qτ(1) ) is desirable in terms of power of the test. In practice, we will need to replace .θτ with an estimator. We will use the estimator proposed in [2], that is, θτ = arg minq θτ ∈R
n
WiKM ρτ (Zi − qτ (Xi , θτ )) .
.
(5.4)
i=1
The asymptotic behaviour of the estimator in (5.4) is shown in [2] and it will be crucial in order to study the behaviour of the new test under composite and alternative hypotheses. In this case, the empirical processes can be written down as follows 1 Rn,1 (t, β) = n1/2
n
.
WiKM ψτ Zi − qτ (Xi , θτ ) qτ(1) (Xi , θτ ) I(β Xi ≤ t)
i=1 1 Rn,2 (t, β) = n1/2
n
WiKM ψτ Zi − qτ (Xi , θτ ) I(β Xi ≤ t).
(5.5)
i=1
Finally, once we have defined the empirical processes (5.5), the test statistics associated with the new proposals can be defined as a Cramer-von Mises norm of them. That is, the following test statistics will be used Tn,1 = largest eigenvalue of
.
Tn,2 =
1 1 Rn,1 (t, β) [Rn,1 (t, β)] Fn,β (dt) dβ
1 (Rn,2 (t, β))2 Fn,β (dt) dβ,
(5.6)
where . = [−∞, +∞] × Sd , .Sd is the unit sphere on .Rd , and .Fn,β is the empirical distribution of the projected covariates .β X1 , . . . , β Xn . Note that in the case of .Tn,1 the result of applying a Cramer-von Mises norm will be a .q × q matrix and, following the ideas developed in [13], we will consider the largest eigenvalue as a summary measure. In the Appendix the asymptotic behaviour of the new tests under the null and the alternative hypotheses is derived in order to show the consistency of the new proposals. Moreover, the proofs of these results also can be found in the Appendix.
90
M. Conde-Amboage et al.
5.2.1 Bootstrap Approximation 1 with .i = 1, 2 In the Appendix, the limit distribution of the empirical processes .Rn,i has been shown but this convergence could be slow and it involves the estimation of unknown quantities such as the conditional density of the error (especially difficult to estimate in the multivariate context). On the other hand, we can consider other possibilities such as a multiplier bootstrap, used in [22] in the context of mean regression for censored data, but this approach also depends on the estimation of unknown quantities. Moreover, taking into account that we are working under Stute’s conditions (assumptions (A1)(i) and (A1)(ii) given in the Appendix), we decide to use a bootstrap approach following the ideas presented in [3] for classical mean regression, and adapted to quantile regression and heteroscedastic scenarios in [2]. To apply the ideas developed in [3], it will be necessary to compute the conditional Kaplan-Meier estimator of the error given the covariates. Obviously, this estimator will be affected by the curse of dimensionality and because of this reason we will use the dimension reduction approach presented in [43]. Then, we will assume that
(ε, C) and X are conditionally independent given λ X
.
for some .λ ∈ ⊂ Rd . Following the ideas developed in [43], the parameter .λ will be estimated using a single-index model, and we will denote this estimator by .λ. For simplicity, we are going to detail the bootstrap procedure for the test statistic .Tn,1 , but it would be analogous for .Tn,2 . Then, the resampling process is the following one: Step 1:
Let us consider a parametric quantile regression model given by Y = qτ (X, θτ ) + ε,
.
(5.7)
where the response variable Y is right censored so we can only know the observed θτ an variable that is denoted by Z. We will fit the model (5.7) and denote by . estimation of the parameter .θτ and by .εi = Zi − qτ (Xi , θτ ) the residuals. Then, we could compute the test statistic as Tn,1 = largest eigenvalue of
.
1 1 Rn,1 (t, β) [Rn,1 (t, β)] Fn,β (dt) dβ,
where . = [−∞, +∞] × Sd , .Sd is the unit sphere on .Rd , and .Fn,β is the empirical distribution of the projected covariates .β X1 , . . . , β Xn . Step 2: For a given covariate .Xi = (Xi1 , Xi2 , . . . , Xid ), let us consider the set Ai = k ∈ {1, . . . , n} : λ Xi − gn ≤ λ Xk ≤ λ Xi + gn ,
.
5 Application of Quantile Regression Models for Biomedical Data
91
for a given univariate smoothing parameter .gn . We can compute a bootstrap error, denoted by .ε , drawn from the conditional Kaplan-Meier estimator of the distribution function of .ε, built using the set .Ai , as follows:
1 .Fε|X=Xi (r) = 1 − 1− . εl ≥ εj ) l I( j ∈Ai , εj ≤r, δj =1 Define the bootstrap dataset .(X1 , Y1 ), . . . , (Xn , Yn ), where
Step 3:
Yi = qτ (Xi , θτ ) + εi ,
.
i = 1, . . . , n.
Note that the bootstrap responses .Yi follow the null hypothesis by construction. Step 4: Based on assumption (A1)-(ii), generate bootstrap resamples of the censoring indicator using that C (Yi ), P(δi = 1|Yi , Xi ) = P(δi = 1|Yi ) = 1 − F
.
C represents the Kaplan-Meier estimator of the distribution function of where .F the censoring variable C. As a result, we can construct the bootstrap censoring observations .{C1 , . . . , Cn } as follows: C restricted to the interval .[Y , ∞). • If .δi = 1 then .Ci is taken from .F i
C restricted to the interval .[−∞, Y ). • If .δi = 0 then .Ci is taken from .F i Finally, we can compute .Zi = min{Yi , Ci }. Step 5: Given the bootstrap sample .{(X1 , Z1 , δ1 ), · · · , (Xn , Zn , δn )}, fit the θτ the bootstrap estimator of . θτ . Then, parametric model (5.7) and denote by . we compute the bootstrap test statistic as
Tn,1 = largest eigenvalue of
.
1, 1, Rn,1 (t, β) [Rn,1 (t, β)] Fn,β (dt) dβ,
where 1, Rn,1 (t, β) = n1/2
n
.
WiKM, ψτ Zi − qτ (Xi , θτ ) qτ(1) (Xi , θτ ) I β Xi ≤ t .
i=1
Step 6:
Repeat steps 2, 3, 4, and 5 many times.
If B bootstrap samples are generated, then the p-value of the test may be approximated by the proportion of bootstrap values not smaller than the original test statistic, that is,
92
M. Conde-Amboage et al.
.
B 1
I Tn,1 ≤ Tn,1,b . B b=1
A discussion of this bootstrap validity can be found in [2]. Finally, in order to use this bootstrap procedure in practice, it will be necessary to choose the smoothing parameter .gn involved in Step 2. In this case, the smoothing ideas presented [44] will be used. Note that in [44] a bootstrap approach to nonparametric mean regression is presented, but it is necessary to assume that the response variable and the censoring variable are conditionally independent given the covariate and we are not using this condition.
5.2.2 Computational Aspects Tests that deal with the curse of dimensionality usually require additional algorithms over other more common model checks. In particular, [1] and [45] are based on the test developed in [46] and require additional computations over this original method. Similarly, a test for high-dimensional covariates that is based on the proposal in [47] is presented in [48] and requires an optimization algorithm over a set of Zheng-type statistics. The proposed method here is an adaptation of the ideas appearing in [13] to censored data and high-dimensional covariates following the ideas developed in [1]. One important virtue of this procedure is that the computational complexity does not grow dramatically with the dimension of the covariate. Following the ideas presented in [1], one can show that .Tn,1 represents the largest eigenvalue of a matrix .MTn,1 that can be expressed as follows MTn,1 =
.
=n
1 1 Rn,1 (t, β)[Rn,1 (t, β)] Fn,β (dt)dβ
n n i=1 j =1
×
WiKM ψτ ( εi ) WjKM ψτ ( εj ) qτ(1) (Xi , θτ ) qτ(1) (Xj , θτ )
I(β Xi ≤ t) I(β Xj ≤ t) Fn,β (dt) dβ
=
n n n i=1 j =1 r=1
×
Sd
WiKM ψτ ( εi ) WjKM ψτ ( εj ) qτ(1) (Xi , θτ )qτ(1) (Xj , θτ )
I(β Xi ≤ β Xr ) I(β Xj ≤ β Xr ) dβ
5 Application of Quantile Regression Models for Biomedical Data
=
n n n i=1 j =1 r=1
=
n n i=1 j =1
93
WiKM ψτ ( εi ) WjKM ψτ ( εj ) qτ(1) (Xi , θτ ) qτ(1) (Xj , θτ ) Aij r
WiKM ψτ ( εi ) WjKM ψτ ( εj ) qτ(1) (Xi , θτ ) qτ(1) (Xj , θτ ) A• [i, j ],
where . = [−∞, +∞] × Sd , . Sd is the unit sphere on .Rd , and .Fn,β is the empirical distribution of the projected covariates .β X1 , . . . , β Xn . Moreover, .A• is a .n × n matrix that is given by A• [i, j ] =
n
.
r=1
Aij r =
n r=1
A(0) ij r
π d/2−1
d 2 +1
with i, j = 1, · · · , n,
where .A(0) ij r is the complementary angle between the vectors .(Xi − Xr ) and .(Xj − Xr ) measured in radians, . is the gamma function, and d is the dimension of the covariate, X. That is ⎧ ⎪ π ⎪ ⎪ ⎪ ⎪ ⎪ ⎨2π
(0) ij r = ⎪π ⎪
if Xi = Xj and Xi = Xr if Xi = Xj = Xr
.A
⎪ ⎪ ⎪ ⎪ ⎩π − arccos (Xi −Xr ) (Xj −Xr ) Xi −Xr Xj −Xr
if Xi = Xj and Xi = Xr or Xj = Xr in any other case.
Note that .Aij r = Aj ir and that simplifies the evaluation of the test statistic .Tn,1 , since it implies that .A• is a symmetric matrix. This fact improves drastically the computation time of the test statistic and allows to apply the test to larger datasets. Thus, the total number of computations needed to obtain the test statistic depends on the dimension, d , only at a linear rate, which is the same rate required by the tests presented in [13] or [2], and much less than the optimization in d dimensions required by other methods in the literature. Moreover, the matrix .A• , which is the most expensive one in terms of computation time, does not need to be recomputed for each bootstrap sample because its elements only depend on the covariate of the quantile regression model that are not modified in the bootstrap procedure. So, the bootstrap test statistic presented in Sect. 5.2.1 can be written as follows:
n,1 =
.T
n n i=1 j =1
(1) (1) WiKM, ψτ ( εi ) WjKM, ψτ ( εj ) qτ (Xi , θτ ) qτ (Xj , θτ ) A• [i, j ].
94
M. Conde-Amboage et al.
5.3 Simulation Study In this section the performance of the new proposed methods under the null and the alternative hypotheses will be studied using a Monte Carlo simulation study in order to show the adjustment of the nominal level and the power of our proposals. In all considered scenarios, the number of simulated original samples was .N = 2000 and the number of bootstrap replications .B = 500. Note that in order to apply the bootstrap procedure proposed in Sect. 5.2.1, we will need to select the one-dimensional bandwidth parameter .gn = cn−0.11 using the ideas developed in [44]. Furthermore, in order to select the constant c we use a simple rule that consists in taking the range of the projected variables .λ X (following the proposal presented in [43]), used previously in [49] or [50], among others. We first focus on the behaviour of the new tests under the null hypothesis, in order to check the adjustment of the significance level. We simulate values from the following censored quantile regression models: Model 1: Y = 1 + X1 + X2 + (ε − cτ ), .
Model 2: Y = 1 + X1 + X2 + X3 + X4 + X5 + (ε − cτ ),
where .X1 , . . . , X5 are independent and follow a uniform distribution on the interval , and .ε is the unknown error that follows different distributions, which is drawn independently of the covariates. The constant .cτ represents the .τ-quantile of the error distribution. Moreover, the censoring variable C follows a Gaussian distribution with different means and variance .1.5. These means will depend on the quantile of interest and is chosen in order to obtain a pre-determined censoring rate. Note that for Models 1 and 2, the null hypothesis is the linear model in X versus an alternative that includes any nonparametric relation between Y and X. In this θτ . .λ = situation we can simplify the estimator proposed in [43] using Tables 5.1 and 5.2 show the proportion of rejections associated with the new test statistics .Tn,1 and .Tn,2 for different quantiles of interest (denoted by .τ), sample sizes (denoted by n), nominal levels (denoted by .α ) and censoring rates (denoted by CR ). Moreover we consider Model 1 with .ε ∼ N (0, 1) and .ε ∼ χ32 in order to show the effect of the error distribution on the adjustment of the nominal level. Finally, with Model 2 we try to show the effect of increasing the dimension of the covariate on the level of the new proposals. According to Tables 5.1 and 5.2, the new tests show a reasonable good adjustment of the nominal level in the different scenarios that have been considered. We should mention that the new proposal is a bit conservative for .τ = 0.75 when .ε ∼ χ32 and the censoring rate increases, but that can be expected because it is a really difficult situation. Finally, to conclude the analysis of the adjustment of the nominal level we will consider a heteroscedastic model in order to show the behaviour of .Tn,1 and .Tn,2 under this kind of scenarios. Note that we have presented a bootstrap procedure adapted to heteroscedastic scenarios (really common in quantile setup). Then, we
.(0, 1)
5 Application of Quantile Regression Models for Biomedical Data
95
Table 5.1 Proportions of rejections associated with the proposed lack-of-fit tests for Models 1 and 2 with censoring rate .CR = 20% .Tn,1
Model .τ n Model 1 0.25 100 with 500 1000 .ε ∼ N (0, 1) 0.50 100 500 1000 0.75 100 500 1000 Model 1 0.25 100 with 500 2 1000 .ε ∼ χ3 0.50 100 500 1000 0.75 100 500 1000 Model 2 0.25 100 500 with .ε ∼ N (0, 1) 1000 0.50 100 500 1000 0.75 100 500 1000
= 0.10 0.10 0.10 0.09 0.10 0.09 0.09 0.09 0.10 0.10 0.09 0.10 0.10 0.10 0.09 0.09 0.07 0.07 0.07 0.08 0.11 0.11 0.09 0.09 0.09 0.07 0.08 0.08
.α
.Tn,2
= 0.05 0.05 0.06 0.04 0.04 0.04 0.05 0.04 0.04 0.05 0.05 0.05 0.05 0.05 0.04 0.04 0.03 0.03 0.03 0.03 0.05 0.05 0.04 0.04 0.05 0.04 0.04 0.04
.α
= 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.01 0.00 0.01 0.01 0.02 0.01 0.01 0.01 0.00 0.01 0.01
.α
= 0.10 0.10 0.10 0.08 0.10 0.09 0.10 0.09 0.10 0.09 0.08 0.10 0.09 0.10 0.09 0.08 0.06 0.08 0.07 0.08 0.10 0.10 0.09 0.09 0.09 0.07 0.08 0.09
.α
= 0.05 0.05 0.06 0.05 0.05 0.04 0.05 0.04 0.04 0.05 0.04 0.05 0.05 0.05 0.04 0.05 0.03 0.03 0.04 0.03 0.05 0.05 0.04 0.05 0.05 0.03 0.04 0.04
.α
= 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.01
.α
will consider . Model
3: Y = 1 + X1 + X2 + X3 + (X1 + 0.5)(ε − cτ ),
where .X1 , X2 , X3 are independent and follow an uniform distribution on the interval , and .ε is the unknown error that follows a standard Gaussian distribution, which is drawn independently of the covariates. The constant .cτ represents the .τquantile of the Gaussian distribution. Moreover, the censoring variable C follows a Gaussian distribution with different means and variance 1. This mean will depend on the quantile of interest and is chosen in order to obtain a pre-determined censoring rate.
.(0, 1)
96
M. Conde-Amboage et al.
Table 5.2 Proportions of rejections associated with the proposed lack-of-fit tests for Models 1 and 2 with censoring rate .CR = 30% .Tn,1
Model .τ n Model 1 0.25 100 with 500 1000 .ε ∼ N (0, 1) 0.50 100 500 1000 0.75 100 500 1000 Model 1 0.25 100 with 500 2 1000 .ε ∼ χ3 0.50 100 500 1000 0.75 100 500 1000 Model 2 0.25 100 500 with .ε ∼ N (0, 1) 1000 0.50 100 500 1000 0.75 100 500 1000
= 0.10 0.10 0.10 0.12 0.07 0.11 0.11 0.07 0.10 0.09 0.11 0.08 0.09 0.08 0.09 0.08 0.04 0.04 0.06 0.10 0.12 0.13 0.08 0.09 0.10 0.07 0.09 0.10
.α
.Tn,2
= 0.05 0.05 0.05 0.07 0.04 0.06 0.06 0.04 0.05 0.04 0.06 0.04 0.05 0.03 0.05 0.04 0.02 0.01 0.03 0.05 0.06 0.07 0.04 0.05 0.06 0.03 0.04 0.05
.α
= 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.01 0.01 0.02 0.03 0.01 0.01 0.01 0.00 0.01 0.01
.α
= 0.10 0.10 0.11 0.11 0.08 0.11 0.10 0.07 0.10 0.08 0.11 0.09 0.09 0.08 0.09 0.08 0.04 0.04 0.06 0.09 0.11 0.13 0.08 0.09 0.09 0.07 0.09 0.11
.α
= 0.05 0.05 0.05 0.06 0.04 0.06 0.05 0.03 0.05 0.04 0.06 0.04 0.05 0.04 0.05 0.04 0.02 0.02 0.03 0.04 0.05 0.08 0.04 0.05 0.05 0.03 0.04 0.05
.α
= 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.01 0.01 0.02 0.01 0.01 0.01 0.00 0.01 0.01
.α
Table 5.3 shows the proportion of rejections associated with the new tests for different nominal levels (.α ), sample sizes (n) and quantiles of interest (.τ). According to Table 5.3, the new proposals show a good adjustment of the nominal level under heteroscedastic models in the different considered scenarios. Now, let us analyse the behaviour of the proposed lack-of-fit tests under alternative hypotheses and we will compare them with their natural competitors. As we mention previously, in the literature there exists the proposals presented in [34] and [2]. Note that the test proposed in [34] is based on smoothing ideas and as a result a generalized cross validation (GCV) procedure is proposed to estimate a d -dimensional bandwidth parameter. This procedure suffers from the curse of dimensionality and it is not computationally feasible when the dimension of the
5 Application of Quantile Regression Models for Biomedical Data
97
Table 5.3 Proportions of rejections associated with our proposed lack-of-fit tests for Model 3 with censoring rate .CR = 25% .Tn,1 .τ
0.25
0.50
0.75
n 100 500 1000 100 500 1000 100 500 1000
.Tn,2
= 0.10 0.10 0.11 0.12 0.08 0.12 0.13 0.08 0.10 0.13
.α
= 0.05 0.05 0.06 0.07 0.03 0.06 0.08 0.03 0.05 0.07
.α
= 0.01 0.01 0.02 0.02 0.00 0.02 0.02 0.00 0.01 0.02
.α
= 0.10 0.11 0.12 0.13 0.09 0.12 0.13 0.08 0.10 0.13
.α
= 0.05 0.05 0.06 0.07 0.03 0.06 0.08 0.03 0.05 0.08
.α
= 0.01 0.01 0.02 0.02 0.01 0.02 0.02 0.00 0.01 0.02
.α
covariate increases (see [42] for complete data). So we will compare the new tests with the test proposed in [2] that is our natural competitor. Note that we will consider the test developed in [2] that contains the derivative of the quantile regression function because it is supposed to be more powerful. We will simulate datasets from the following quantile regression model: . Model
4: Y = 1 + X1 + X2 + X3 − c (X2 + X3 )2 + (ε − cτ ),
where .X1 , .X2 , and .X3 are independent and follow an uniform distribution on the interval .(0, 1), and .cτ denotes the .τ-quantile of the error .ε. We consider two possible error distributions, namely a standard normal distribution and an exponential distribution with expectation .1/5, to check the effect of the error distribution on the power. The null hypothesis is the hypothesis of a linear model in .X1 , .X2 and .X3 . Note that the function .c (X2 + X3 )2 represents the deviation of the model from the null hypothesis, and the constant c will control these deviations. Figures 5.2 and 5.3 show the proportion of rejections for a nominal level .α = 0.01, different sample sizes (n), quantiles of interest (.τ) and values of the constant c associated with the new proposals (.Tn,1 represented with the dark orange line and .Tn,2 with the light orange line) and the test proposed in [2] (represented with violet line). According to these results we can conclude that the proportion of rejections associated with the new proposals goes to 1 when the sample size or the constant c increases showing their consistency. Moreover, the power of the new test is higher than that of the test presented in [2] in all the considered scenarios (or equal in situations with proportion of rejections equal to 1) independently of the considered quantile of interest. Finally, as the proposed lack-of-fit tests have been presented to deal with the curse of dimensionality, we will consider Model 5 in order to check the effect of increasing the dimension of the covariate on the power of the tests. Let us consider the following quantile regression model
98
M. Conde-Amboage et al.
Fig. 5.2 Proportion of rejections associated with our proposed lack-of-fit tests (.Tn,1 represented with the dark orange line and .Tn,2 with the light orange line) and the test proposed in [2] (violet line) for Model 4 depending on the error distribution, the sample size and the constant c that controls the deviation from the null hypothesis for .α = 0.01, .τ = 0.25 and .CR = 20%. (a) .ε ∼ N (0, 1) and .n = 100. (b) .ε ∼ N (0, 1) and .n = 500. (c) .ε ∼ Exp(5) and .n = 100. (d) .ε ∼ Exp(5) and .n = 500
. Model
5: Y = 1 + X1 + · · · + Xd − c eX1 +X2 + (ε − cτ ),
where .X1 , . . . , Xd are independent and follow an uniform distribution on the interval . The error follows a standard Gaussian distribution and .cτ denotes its .τquantile. The null hypothesis is the hypothesis of a linear model in .X1 , . . . , Xp and the function .c eX1 +X2 represents the deviation of the model from the null hypothesis, and the constant c will control these deviations. Figure 5.4 shows the proportion of rejections for a nominal level .α = 0.01, a sample size .n = 250, a quantile of interest .τ = 0.5 and different dimensions of the covariate (d ) and values of the constant c, associated with the new proposals (.Tn,1 represented with the dark orange line and .Tn,2 with the light orange line) and the test proposed in [2] (violet line). In view of these results we can conclude that the
.(0, 1)
5 Application of Quantile Regression Models for Biomedical Data
99
Fig. 5.3 Proportion of rejections associated with our proposed lack-of-fit tests (.Tn,1 represented with the dark orange line and .Tn,2 with the light orange line) and the test proposed in [2] (violet line) for Model 4 depending on the error distribution, the sample size and the constant c that controls the deviation from the null hypothesis for .α = 0.01, .τ = 0.50 and .CR = 20%. (a) .ε ∼ N (0, 1) and .n = 100. (b) .ε ∼ N (0, 1) and .n = 500. (c) .ε ∼ Exp(5) and .n = 100. (d) .ε ∼ Exp(5) and .n = 500
power of the tests decreases when the dimension of the covariate increases, which is expected because we are considering the same deviation from the null hypothesis while the quantile function increases. Note that there exists also an estimation effect on the power. Moreover, the power of the new proposals is clearly higher than the power of the test presented in [2] for the different considered scenarios, but specially for .d = 5. Finally, these results support the idea developed in [13] of including the derivative of the quantile function in the test statistic because the power of .Tn,1 (dark orange line) is superior.
100
M. Conde-Amboage et al.
Fig. 5.4 Proportion of rejections associated with our proposed lack-of-fit tests (.Tn,1 presented with the dark orange line and .Tn,2 with the light orange line) and the test proposed in [2] (violet line) for Model 5 depending on the dimension of the covariate (denoted by d) and the constant c that controls the deviation from the null hypothesis for .α = 0.01, .τ = 0.5, .n = 250 and .CR = 30%. (a) .d = 3. (b) .d = 5
5.4 Real Data Application As discussed in the introduction, the lack-of-fit tests presented in this chapter can be used in a multitude of practical situations whenever we are interested in checking the influence of a set of covariates on a survival time. Throughout this section we will illustrate the behaviour of new proposals in a real data application associated with a project about drug abuse. More specifically, we will focus on analysing the validity of quantile regression models associated with different quantiles of interest where the response variable is the time to return to drug abuse. The considered dataset is available in the R package quantreg under the name uis and it was collected by the University of Massachusetts AIDS Research Unit (UMARU) IMPACT Study. This is a 5-year collaborative research project about drug abuse whose complete description can be found in [51]. Moreover, this data set has been studied previously by several authors, see, for instance, [52] or [53]. The dataset uis contains the time to return to drug abuse (measured in days, and denoted by TIME) of a group of 517 patients. The variable is subject to random right censoring. Moreover, the dataset includes the following covariates that could have an effect on the variable TIME: AGE: BECK: HC:
age of each patient measured in years. Beck depression score of each patient at admission. indicator of whether the patient consumes heroin or cocaine tree months previous to the admission in the detoxification program (where 1 represents “heroin and cocaine”, 2 represents “heroin only”, 3 represents “cocaine only” and 4 represents “neither heroin nor cocaine”).
5 Application of Quantile Regression Models for Biomedical Data
101
Table 5.4 P-values associated with our proposed lack-of-fit tests and with the test presented in [2] (denoted by .TCKG ) for model (5.8). Moreover, computation times are included Quantiles of interest = 0.1 .τ = 0.25 0.246 0.000 0.242 0.000 0.058 0.626
.τ .Tn,1 .Tn,2 .TCKG
IV: NDT: RACE: TREAT: SITE: LEN.T: FRAC:
.τ
= 0.50 0.000 0.000 0.830
.τ
= 0.75 0.688 0.702 1.000
.τ
= 0.9 0.940 0.800 1.000
Computation time 15.62 12.03 9.24
history of intravenous drug use (where 1 represents “never”, 2 represents “previous” and 3 represents “recent”). number of prior drug treatments. race of each patient (where 0 represents “white” and 1 represents “other”). treatment randomization assignment (where 0 represents “short” and 1 represents “long”). place where each patient receives treatment (where 0 represents “location A” and 1 represents “location B”) length of stay in treatment measured in days. compliance fraction defined as LEN.T/90 for short treatments and LEN.T/180 for long treatments.
The main goal will be to test if the relation between the time to return to drug use and the 10 covariates can be fitted using a linear censored quantile regression model. Then, the model under the null hypothesis will be .log(TIME)
= θτ,0 + θτ,1 AGE + θτ,2 BECK + θτ,3 HC + θτ,4 IV + θτ,5 NDT + θτ,6 RACE + θτ,7 TREAT + θτ,8 SITE + θτ,9 LEN.T + θτ,10 FRAC + ε,
(5.8)
where .ε denotes the unknown error having .τ-quantile equal to zero. Note that [2] consider a similar model but only with two covariates. Table 5.4 shows the p-values associated with the proposed lack-of-fit tests for model (5.8) computed for different values of the .τ-quantile of interest. According to these results, if we consider the new proposals the linear quantile model is rejected for .τ = 0.25 and .τ = 0.5 while the test proposed in [2] does not reject the linear model. These results are in line with what is shown in Sect. 5.3, illustrating the best power of the new proposals. On the other hand, the linear quantile model is not rejected for .τ = 0.1, .τ = 0.75 or .τ = 0.9, regardless of the considered lack-of-fit test. To sum up, the main advantage of new tests is related to .τ = 0.25 or .τ = 0.5, because we are able to demonstrate that the lineal model is not correct in these situations, showing that the new proposals are more powerful than our natural competitor. Finally, in Table 5.4 the computational cost associated with each lack-of-fit test is presented. More specifically each value represents the mean time (measured in
102
M. Conde-Amboage et al.
seconds) it takes to run each test for the five quantiles of interest. Note that the computation time obtained for different quantiles are quite similar. These times were measured on a computer with 12th Gen Intel(R) Core(TM) i7-12800H, .2.40 GHz and 16 GB of RAM. As expected, the new tests require more computations than our natural competitor (because we are considering all possible unidimensional projections of the covariates), but the differences are not too large and the power of the new tests, shown in Sect. 5.3, justifies the use of the new proposal in spite of the increase in computation time.
5.5 Conclusions In this chapter, two new lack-of-fit tests for censored quantile regression models are presented. These tests are specially designed in order to deal with the curse of dimensionality when the covariate is multivariate or even high-dimensional. Note that censored variables are quite common in the biomedical context when we pay attention to the time it takes until we observe an event of interest, such as death, relapse into a certain disease or discharge from a certain hospital service, among others. On the one hand, the asymptotic behaviour of the new proposals under the null and the alternative hypotheses is derived in order to show the consistency of the new tests. On the other hand, to use the new procedures in practice, a bootstrap mechanism is presented to approximate the critical values. This bootstrap approximation was designed to work well in homoscedastic as well as heteroscedastic scenarios, which is especially important in the quantile context where it is not necessary to impose stringent assumptions on the error distribution. Furthermore, some computational aspects of the new tests are derived in order to reduce their computation time and make the new proposals applicable in practice. Thanks to an extensive Monte Carlo simulation study, we have shown the performance on the new tests in practice. Firstly, we show the good adjustment of the significance level independently of the considered quantile of interest, censoring rate, error distribution or sample size for different dimensions of the covariate. Secondly, the power of the new tests, compared with their natural competitor, is presented. Our proposals are generally more powerful than their natural competitor, for different deviations from the null hypotheses, samples sizes or quantiles of interest. To conclude the simulation study, we illustrate that the differences in the power increase when the dimension of the covariate is higher. In addition, the proposed tests were applied in a real data situation, where they were useful to validate well-known models in the biomedical literature. In particular, our main goal is to check if a linear censored quantile regression is correct in order to explain the behaviour of the time to return to drug abuse of a group of 517 patients that have had addiction problems, depending on different covariates such as age, race or length of stay.
5 Application of Quantile Regression Models for Biomedical Data
103
Thanks to this practical situation, we have shown the utility of the new lack-offit tests that are able to demonstrate that, for some quantiles of interest, the linear model is not correct and some nonparametric alternatives are needed. Finally, to facilitate the use of the new lack-of-fit tests in practice, an R package has been designed. The new R package is called LoFQuant and it contains different lack-of-fit tests for quantile regression models, in particular the tests presented in this chapter. Acknowledgments The authors gratefully acknowledge the financial support received through grant PID2020-116587GB-I00 funded by MCIN/AEI/10.13039/501100011033 and the European Union, and from the European Research Council (2016–2022, Horizon 2020/ERC grant agreement No. 694409).
Appendix Proof of Lemma 1 This lemma is a consequence of the following characterization of the null hypothesis proved in Lemma 4.1 of [54] (page 139): .H0
(1) : qτ ∈ Qθ holds ⇐⇒ E ψτ (Y − qτ (X, θτ )) qτ (X, θτ ) | β X = 0.
Note that given a pair of variables V1 and V2 if follows that .E [V1 |V2 ]
= 0 ⇐⇒ E [V1 I(V2 ≤ v)] = 0
for each v.
The implication “⇒” is immediate because E [V1 I(V2 ≤v)] =E [E [V1 |V2 ] I(V2 ≤v)]. Moreover, the implication “⇐” is a consequence of the following development: .E [V1
I(V2 ≤ v)] = 0 ∀v ⇒ E [E [V1 |V2 ] I(V2 ≤ v)] = 0 ∀v ⇒ E [V1 |V2 ] I(V2 ≤ v) dFV2 = 0 ∀v ⇒
v −∞
E [V1 |V2 ] dFV2 = 0 ∀v
⇒ E [V1 |V2 ] = 0, where dFV2 represents the cumulative distribution function of V2 , and the last implication is because of density properties.
In this appendix the asymptotic behaviour of the new tests under the null and the alternative hypotheses will be derived. We are going to focus on the empirical i with i = 0, 1, because the analysis of R i will be analogous. Let us processes Rn,1 n,2 introduce the following notation: .H (y)
= P(Z ≤ y)
104
M. Conde-Amboage et al.
0 (y) = P(Z ≤ y, δ = 0) = H
y −∞
(1 − F (u))G(du)
11 (x, y) = P(X ≤ x, Z ≤ y, δ = 1) H y− H0 (dz) . γ0 (y) = exp −∞ 1 − H(z)
Moreover, for any real-valued function ϕt,β defined on Rd+1 , let us consider 1 ϕt,β 1 (y) = 1 − H (y)
.γ
ϕt,β
γ2
(y) =
11 (dx, dw), I(y ≤ w) ϕt,β (x, w) γ0 (w) H
I(v < y, v < w) ϕt,β (x, w) γ0 (w) (1 − H (v))2
ϕ
11 (dx, dw), 0 (dv) H H
ϕ
and
ϕ
ηi t,β = ϕt,β (Xi , Zi , θ) γ0 (Zi ) δi + γ1 t,β (Zi ) (1 − δi ) − γ2 t,β (Zi ).
In our particular case, we will consider the function .ϕt,β (x, y, θ)
= ψτ (y − qτ (x, θ)) qτ (x, θ) I(β x ≤ t) (1)
ϕ
and we are going to use the notation ηi t,β ≡ ηi (t, β, θ). We will start showing the behaviour of the new test under the simple null hypothesis where the parameter θτ is known. In order to get this result it will be 0 , for which crucial to present an iid representation of the empirical processes Rn,1 the following assumptions will be needed: (A1)
(i) Y and C are independent. (ii) P(δ = 1|X, Y ) = P(δ = 1|Y ), that is, given X and Y , the probability of observing the real value for the response variable only depends on Y . (iii) The upper bound of the support of Y is strictly smaller than the upper bound of the support of C . Moreover, the distribution functions of Y and C are continuous. (iv) The conditional distribution of the error given the covariates, Fε|X (r|x), is continuously differentiable with respect to r uniformly in x and fε|X (0) > 0.
(A2) E
(1)
ψτ (Z − qτ (X, θτ )) qτ (X, θτ )
2
γ0 (Z)2 δ
< ∞.
1 0 < ∞ (A3) (1−H dH )3 (A4) All partial derivatives of qτ (x, θ) with respect to the components of θ of order 0, 1 and 2 exist and are continuous in (x, θ) for all x and θ , and (1) P(qτ (X, θ) = 0) = 0 for all θ ∈ . In addition, the first derivative of the
5 Application of Quantile Regression Models for Biomedical Data
105
regression function qτ (x, θ) with respect to the parameter θ , that is denoted by (1) qτ (x, θ), is uniformly bounded and Lipschitz with respect to θ and x . Note that previous assumptions are classical conditions in the context of censoring and quantile regression. On the one hand, conditions like (A1)(i)–(A1)-(iii), (A2) or (A3) are needed in order to obtain the convergence of the Kaplan-Meier integrals involved in the definition of the empirical processes given in (5.5). In this sense, these conditions are quite similar to the assumptions appearing in [22]. On the other hand, condition (A1)(iv) guarantees that there exist data points around the conditional quantile that we want to estimate and it also appears in [13]. 0 Then, the iid representation of the empirical process Rn,1 is presented in Theorem 1. Theorem 1 Under conditions (A1)–(A4) and if the null hypothesis holds, the empirical 0 can be written as a sum of independent and identically distributed random process Rn,1 variables plus a negligible error: 0 −1/2 n,1 (t, β) = n
.R
n
ηi (t, β, θτ ) + Qn (t, β),
(5.9)
i=1
where sup |Qn (t, β)| = O
.
(t,β)∈
(log n)3 √ n
almost surely.
0 can be written as a Kaplan-Meier integral Proof The empirical process Rn,1
0 1/2 n,1 (t, β) = n
.R
X,Y (x, y), ϕt,β (x, y, θτ ) d F
where .ϕt,β (x, y, θτ )
= ψτ (y − qτ (x, θτ )) qτ (x, θτ ) I(β x ≤ t), (1)
X,Y (x, y) = n W KM I(Xi ≤ x, Zi ≤ y) and W KM represents the Kaplan-Meier weight. F i=1 i i Following the ideas of [2], the i.i.d. representation given in (5.9) is a consequence of the results of Theorem 3.11 in [55] (see page 123). Then, we need to prove the conditions established in the latter theorem to conclude this proof. Condition 1: The family of functions ϕt,β should be a VC-class. To get this statement, we will check that the family of indicator functions of the form I(β X ≤ t) is a VC-class (1) because the quantities ψτ (ε) and qτ (x, θτ ) are bounded. Then, the family of indicator functions is a VC-class as a consequence of the two following arguments: – On the one hand, the theory presented in [56] establishes that the set of all half spaces Hr in Rr is a VC-class (see Problem 14 of page 152) where a half space is given by
106
M. Conde-Amboage et al. .H (β, u)
= {x ∈ Rr : β x ≤ u}
and consequently .Hr
= {H (β, u) : β ∈ Sr , u ∈ R},
where Sr = {x ∈ Rr : x = 1}. See Problem 8.4 of [57] for a more complete explanation. – On the other hand, arguments developed in [58] prove that a collection of sets is a VCclass of sets if and only if the collection of the corresponding indicator functions is a VC-class of functions (see page 275). Conditions 2 and 3: these conditions hold as a consequence of arguments developed along proof of Theorem 2 in [2]. Remark 1 Note that Theorem 1 can be easily extended for any parameter θ ∈ . This result will be used to show the asymptotic behaviour of the test statistics under the alternative hypothesis.
Theorem 1 will be essential to obtain the asymptotic behaviour of the presented empirical process under the null hypothesis. This result is established in the following theorem: Theorem 2 (Simple Null Hypothesis) Under the null hypothesis and conditions (A1)–(A4), 0 converges weakly in l ∞ ( ) to a centred Gaussian process R 0 the empirical process Rn,1 ∞ with covariance 0
0
.Cov(R∞ (x1 ), R∞ (x2 ))
= E[η(t1 , β1 , θτ ) η(t2 , β2 , θτ )],
where x1 = (t1 , β1 ), x2 = (t2 , β2 ) and l ∞ ( ) denotes the space of bounded functions defined on = [−∞, +∞] × Sd . 0 , we will analyse the Proof To obtain the asymptotic behaviour of the empirical process Rn,1 dominant terms of the i.i.d representation given in (5.9). This proof is analogous to the proof of Corollary 3 derived by Conde-Amboage et al. [2]. The main idea of the proof is to show that the class of functions η(t, β, θτ ) with (t, β) ∈ is Donsker using arguments similar to those in Lemma 8 in [2]. As in [2], it will be necessary to show that E[η(t, β, θτ )] = 0. Using results in [59] and [60] it follows that .E[η(t1 , β1 , θτ )]
= E[ψτ (Z − qτ (Z, θτ )) qτ (X, θτ ) I(β1 X ≤ t1 ) γ0 (Z) δ (1)
ϕt1 ,β1
+ γ1
ϕt1 ,β1
(Z) (1 − δ) − γ2
(Z)]
= E[ψτ (Z − qτ (X, θτ )) qτ (X, θτ ) I(β1 X ≤ t) γ0 (Z) δ] (1)
ϕt1 ,β1
+ E[γ1
ϕt1 ,β1
(Z) (1 − δ)] − E[γ2
(Z)]
= E[ψτ (Z − qτ (X, θτ )) qτ (X, θτ ) I(β1 X ≤ t) γ0 (Z) δ] (1)
5 Application of Quantile Regression Models for Biomedical Data
107
= E[ψτ (Y − qτ (X, θτ )) qτ (X, θτ ) I(β1 X ≤ t1 )] (1)
= E[E[ψτ (Y − qτ (X, θτ )) |X] qτ (X, θτ ) I(β1 X ≤ t1 )] = 0 (1)
for each t1 ∈ R and β1 ∈ Rd where we have used that ψτ (ε) = τ − I(ε < 0) and that the conditional distribution of I(ε < 0) given X is a Bernoulli distribution with parameter τ. As a result E [ψτ (ε) | X] = 0.
Moreover, as a result of Theorem 2, the test statistic converges to the largest 0 as a consequence of the eigenvalues of the Cramer-von Mises norm of R∞ continuous mapping theorem. Now, we would like to study the behaviour of the test statistic under the composite null and the alternative hypotheses. Let us consider the following local alternative hypothesis: .H1n
1 : Y = qτ (X, θτ ) + √ h(X) + ε, n
where h(·) is a real-valued function with E[ | h(X) | ] < ∞. At this aim, it will be crucial to use the asymptotic representation of the estimator given in (5.4), that is detailed in [2], n
√ 1 (1) ξi ( θτ ) + −1 E fε|X (0) h(X) qτ (X, θτ ) + op (1), n θτ − θτ = √ −1 n
.
i=1
(5.10) where .ξi (θ)
ϕ∗
(1)
ϕ∗
= ψτ (Zi − qτ (Xi , θ)) qτ (Xi , θ) γ0 (Zi ) δi + γ1 (Zi ) (1 − δi ) − γ2 (Zi ),
θτ = arg min E [ρτ (Y − qτ (X, θ))] = arg min S(θ) θ
(1) (1) = E fε|X (0) qτ (X, θτ ) qτ (X, θτ )
θ
under H1n .
and ϕ ∗ (x, y) = ψτ (y − qτ (x, θ)) qτ(1) (x, θ). To obtain the asymptotic representation given in (5.10), we will need to assume the following conditions: (A5) For all > 0, .
sup
θ:θ−θτ ≥
E [ρτ (Y − qτ (X, θ))] > E [ρτ (Y − qτ (X, θτ ))]
and E[supθ∈ |Y − qτ (X, θ)|] < ∞. (A6) The matrix = E fε|X (0) qτ(1) (X, θτ ) qτ(1) (X, θτ ) is non-singular. Moreover, all components of the vector E fε|X (0) h(X) qτ(1) (X, θτ ) and of the matrix (2) (2) ∂ 2 q (X, θ). E fε|X (0) h(X) qτ (X, θτ ) are finite, where qτ (X, θ) = ∂θ∂θ τ
108
M. Conde-Amboage et al.
(A7) E[γ04 (Z)δ] < ∞ and E H 2 (Z)/(1 − H (Z))2 γ04 (Z)δ] < ∞. (A8) FZ|X=x (z) is Lipschitz in z uniformly in x . In order to show the behaviour of the new lack-of-fit tests under the alternative hypothesis, the following Lemma, whose proof is quite technical and similar to the proof of Lemma 8 in [2], will be useful. Lemma 2 Under assumption (A4), it follows that ⎤ ⎡ n n 1 n 1 1 ⎦ ⎣ η (t, β, θ) − η (t, β, θ ) . sup η (t, β, θ) − E τ i i i n n n θ−θτ ≤δn i=1 i=1 i=1 = op (n−1/2 )
(5.11)
uniformly in (t, β) for all positive values δn = o(1). 1 Now we can derive the asymptotic representation of the empirical process Rn,1 under local alternatives. This result is presented in the following Theorem:
Theorem 3 (Composite and Alternative Hypotheses) Let us assume that the data come from .Yi
1 = qτ (Xi , θτ ) + √ h(Xi ) + εi n
i ∈ {1, . . . , n},
where ε1 , . . . , εn are independent errors with conditional τ-th quantile equal to zero and the 1 allows response variable Y is right censored. Under conditions (A1)–(A8), the process Rn,1 the following representation: n 1 (t, β) = R 0 (t, β) − √1 −1 (t, β) ξi ( θτ ) n,1 n,1 n i=1
.R
(1) − −1 (t, β) E fε|X (0) h(X) qτ (X, θτ )
(1) + E fε|X (0) h(X) qτ (X, θτ ) I(β X ≤ t) + op (1)
uniformly in (t, β) ∈ where .(t, β)
(1) (1) = E fε|X (0) qτ (X, θτ ) qτ (X, θτ ) I(β X ≤ t) .
The second term of the right-hand side represents the effect of the estimation of the parameter θτ , while the third and fourth terms are constants reflecting the deviation from the null hypothesis. 1 as follows Proof As a consequence of Theorem 1, we can write Rn,1
5 Application of Quantile Regression Models for Biomedical Data
1 1/2 n,1 (t, β) = n
.R
n
109
(1) WiKM ψτ Zi − qτ (Xi , θτ ) qτ (Xi , θτ ) I(β Xi ≤ t)
i=1 n 1 ηi (t, β, θτ ) + op (1) = √ n i=1
⎡ ⎤ n n 1 1 ηi (t, β, θτ ) − E ⎣ √ ηi (t, β, θτ )⎦ = √ n n i=1
⎡
⎤
i=1
1 + E⎣√ ηi (t, β, θτ )⎦ + op (1) n n
i=1
1a (t, β) + R 1b (t, β) + o (1). = Rn,1 p n,1 1a (t, β) = R 0 (t, β) + On the one hand, following Lemma 2 we can conclude that Rn,1 n,1 op (1) uniformly in (t, β) ∈ . On the other hand, following the argument used to obtain E[η(t1 , β1 , θτ )] = 0, we can write
⎡ 1b ⎣ 1 .R n,1 (t, β) = E √n
n
⎤ ηi (t, β, θτ )⎦ =
√
n E ηi (t, β, θτ )
i=1
(1) √ θτ ) qτ (Xi , θτ ) I(β Xi ≤ t) = n E ψτ Yi − qτ (Xi , √ (1) = n E E ψτ Yi − qτ (Xi , θτ ) X qτ (Xi , θτ ) I(β Xi ≤ t) . For simplicity, let us denote by .l(x, θ)
1 = qτ (x, θ) − qτ (x, θτ ) − √ h(x) n
and by Fεi |Xi and fεi |Xi the corresponding conditional distribution and density functions associated with εi . Then .E
ψτ Yi − qτ (Xi , θτ ) X = E ψτ εi − l(Xi , θτ ) X = E τ − I εi < l(Xi , θ τ ) X θτ )) = Fεi |Xi (0) − Fεi |Xi (l(Xi , θτ )) = τ − Fεi |Xi (l(Xi ,
1 θτ ) − qτ (Xi , θτ ) − √ h(Xi ) = −fεi |Xi (ξi,1 ) qτ (Xi , n
1 (1) θτ − θτ − √ h(Xi ) = −fεi |Xi (ξi,1 ) qτ (Xi , ξ2 ) n
1 (1) θτ − θτ − √ h(Xi ) + op θτ − θτ , = −fεi |Xi (0) qτ (Xi , θτ ) n
110
M. Conde-Amboage et al.
√ i ) and 0, and ξ2 between θτ where ξi,1 is an element between qτ (Xi , θτ ) − qτ (Xi , θτ ) − h(X n and θτ . Then, together with (5.10), it follows that 1b
.Rn,1 (t, β)
√ θτ − θτ E fε|X (0) qτ(1) (X, θτ ) qτ(1) (X, θτ ) I(β X ≤ t) =− n
√ + E fε|X (0) h(X) qτ(1) (X, θτ ) I(β X ≤ t) + op n θτ − θτ $ % n
1 = − √ −1 ξi ( θτ ) + −1 E fε|X (0) h(X) qτ(1) (X, θτ ) + op (1) n i=1
× E fε|X (0) qτ(1) (X, θτ ) qτ(1) (X, θτ ) I(β X ≤ t)
√ θτ − θτ + E fε|X (0) h(X) qτ(1) (X, θτ ) I(β X ≤ t) + op n n
1 = − √ −1 (t, β) ξi ( θτ ) − −1 (t, β) E fε|X (0) h(X) qτ(1) (X, θτ ) n i=1
+ E fε|X (0) h(X) qτ(1) (X, θτ ) I(β X ≤ t) + op (1) 1 under the alternative from which the asymptotic behaviour of the empirical process Rn,1 hypothesis follows.
Note that if we will consider a model under the composite null hypothesis, the 1 allows the following representation empirical process Rn,1 n 1 −1 (t, β) 1 0 ξi (θτ ) + op (1), n,1 (t, β) = Rn,1 (t, β) − √n i=1
.R
where the second term represents the estimation effect. Furthermore, as a corollary 1 and the test of Theorem 3, the asymptotic distribution of the empirical process Rn,1 statistic Tn,1 can be derived.
References 1. Escanciano, J.C. (2006). A consistent diagnostic test for regression models using projections. Econometric Theory, 22, 1030–1051. 2. Conde-Amboage, M., Van Keilegom, I. and González-Manteiga, W. (2021). A new lack-of-fit test for quantile regression with censored data. Scandinavian Journal of Statistics, 48, 655– 688. 3. Orbe, J. and Núñez-Antón, V. (2013). Confidence Intervals on Regression Models with Censored Data. Communications in Statistics-Simulation and Computation, 42, 2140–2159. 4. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W.A. (1986). Robust statistics: The approach based on influence functions. Wiley. 5. Huber, P. (1981). Robust statistics. Wiley. 6. Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50.
5 Application of Quantile Regression Models for Biomedical Data
111
7. Koenker, R. (2005). Quantile regression. Cambridge University Press. 8. Yu, D., Kong, L. and Mizera, I. (2016). Partial functional linear quantile regression for neuroimaging data analysis. Neurocomputing, 195, 74–87. 9. Huang, Q., Zhang, H., Chen, J., and He, M. (2017). Quantile Regression Models and Their Applications: A Review. Journal of Biometrics & Biostatistics, 8, 1–6. 10. Staffa, S. J., Kohane, D. S., and Zurakowski, D. (2019). Quantile regression and its applications: a primer for anesthesiologists. Anesthesia & Analgesia, 128, 820–830. 11. Mazucheli, J., Alves, B., Menezes, A. F., and Leiva, V. (2022). An overview on parametric quantile regression models and their computational implementation with applications to biomedical problems including COVID-19 data. Computer Methods and Programs in Biomedicine, 106816. 12. Zheng, J. X. (1998). A consistent nonparametric test of parametric regression models under conditional quantile restrictions. Econometric Theory, 14, 123–138. 13. He, X. and Zhu, L.X. (2003). A lack-of-fit test for quantile regression. Journal of the American Statistical Association, 98, 1013–1022. 14. Horowitz, J.L. and Spokoiny, V.G. (2002). An adaptive, rate-optimal test of linearity for median regression models. Journal of the American Statistical Association, 97, 822–835. 15. Mammen, E., Van Keilegom, I. and Yu, K. (2019). Expansion for moments of regression quantiles with applications to nonparametric testing. Bernoulli, 25, 793–827. 16. Escanciano, J.C. and Goh, S.C. (2014). Specification analysis of linear quantile models. Journal of Econometrics, 178, 495–507. 17. Escanciano, J.C. and Velasco, C. (2010). Specification tests of parametric dynamic conditional quantiles. Journal of Econometrics, 159, 209–221. 18. Dette, H., Guhlich, M. and Neumeyer, N. (2015). Testing for additivity in nonparametric quantile regression. Annals of the Institute of Statistical Mathematics, 67, 437–477. 19. Powell, J. (1986). Censored Regression Quantiles. Journal of Econometrics, 32, 143–155. 20. Portnoy, S. (2003). Censored Quantile Regression. Journal of the American Statistical Association, 98, 1001–1012. 21. Peng, L. and Huang, Y. (2008). Survival Analysis with Quantile Regression Models. Journal of the American Statistical Association, 103, 637–649. 22. Stute, W., González-Manteiga, W. and Sánchez-Sellero, C. (2000). Nonparametric model checks in censored regression. Communications in Statistics: Theory and Methods, 29, 1611– 1629. 23. Peng, L. (2021). Quantile regression for survival data. Annual review of statistics and its application, 8, 413–437. 24. Portnoy, S. and Lin, G. (2010). Asymptotics for censored regression quantiles. Journal of Nonparametric Statistics, 22, 115–130. 25. Peng, L. (2012). A note on self-consistent estimation of censored regression quantiles. Journal of Multivariate Analysis, 105, 368–379. 26. Yang, X., Narisetty, N.N. and He, X. (2018). A new approach to censored quantile regression estimation. Journal of Computational and Graphical Statistics, 27, 417–425. 27. De Backer, M., El Ghouch, A. and Van Keilegom, I. (2019). An adapted loss function for censored quantile regression. Journal of the American Statistical Association, 114, 1126–1137. 28. De Backer, M., El Ghouch, A. and Van Keilegom, I. (2020). Linear censored quantile regression: A novel minimum-distance approach. Scandinavian Journal of Statistics, 47, 1275– 1306. 29. Wei, B. (2022). Quantile regression for censored data in haematopoietic cell transplant research. Bone Marrow Transplantation, 57, 853–856. 30. Yu, T., Xiang, L. and Wang, H. J. (2021). Quantile regression for survival data with covariates subject to detection limits. Biometrics, 77, 610–621. 31. Wang, H. J., and Wang, L. (2014). Quantile regression analysis of length biased survival data. Stat, 3, 31–47. 32. Li, R. and Peng, L. (2017). Assessing quantile prediction with censored quantile regression models. Biometrics, 73, 517–528.
112
M. Conde-Amboage et al.
33. Lee, M. and Kong, L. (2014). Quantile regression for longitudinal biomarker data subject to left censoring and dropouts.Communications in Statistics-Theory and Methods, 43, 4628–4641. 34. Wang, L. (2008). Nonparametric test for checking lack of fit of the quantile regression model under random censoring. Canadian Journal of Statistics, 36, 321–336. 35. Chen, C.H., Li, K.C. and Wang, J.L. (1999). Dimension reduction for censored regression data. Annals of Statistics, 27, 1–23. 36. Shows, J.H., Lu, W. and Zhang, H.H. (2010). Sparse estimation and inference for censored median regression. Journal of Statistical Planning and Inference, 140, 1903–1917. 37. Xia, Y., Zhang, D. and Xu, J. (2010). Dimension reduction and semiparametric estimation of survival models. Journal of the American Statistical Association, 105, 278–290. 38. Zheng, Q., Peng, L. and He, X. (2018). High dimensional censored quantile regression. Annals of Statistics, 46, 308–343. 39. Fei, Z., Zheng, Q., Hong, H.G. and Li, Y. (2021). Inference for high-dimensional censored quantile regression. Journal of the American Statistical Association, 1–15. 40. Wilcox, R. (2008). Quantile regression: a simplified approach to a goodness-of-fit test. Journal of Data Science, 6, 547–556. 41. Conde-Amboage, M., Sánchez-Sellero, C. and González-Manteiga, W. (2015). A lack-of-fit test for quantile regression models with high-dimensional covariates. Computational Statistics and Data Analysis, 88, 128–138. 42. Maistre, S., Lavergne, P. and Patilea, V. (2017). Powerful nonparametric checks for quantile regression. Journal of Statistical Planning and Inference, 180, 13–29. 43. Li, W. and Patilea, V. (2018). A dimension reduction approach for conditional Kaplan-Meier estimators. TEST, 27. 44. Li, G., and Datta, S. (2001). A bootstrap approach to nonparametric regression for right censored data. Annals of the Institute of Statistical Mathematics, 53, 708–729. 45. Stute, W., Xu, W.L. and Zhu, L.X. (2008). Model diagnosis for parametric regression in highdimensional spaces. Biometrika, 95, 451–467. 46. Stute, W. (1997). Nonparametric model checks for regression. Annals of Statistics, 25, 613– 641. 47. Zheng, J. X. (1996). A consistent test of functional form via nonparametric estimation techniques. Journal of Econometrics, 75, 263–289. 48. Lavergne, P. and Patilea, V. (2008). Breaking the curse of dimensionality in nonparametric testing. Journal of Econometrics, 143, 103–122. 49. Groeneboom, P. and Jongbloed, G. (2015). Nonparametric confidence intervals for monotone functions. Annals of Statistics, 43, 2019–2054. 50. Lopuhaä, H.P. and Musta, E. (2017). Isotonized smooth estimators of a monotone baseline hazard in the Cox model. Journal of Statistical Planning and Inference, 191, 43–67. 51. Hosmer, D.W., Lemeshow, S. and May, S. (2008). Applied Survival Analysis: Regression Modeling of Time-to-Event Data. John Wiley & Sons. 52. Yang, S.J., El Ghouch, A. and Van Keilegom, I. (2014). Varying coefficient models having different smoothing variables with randomly censored data. Electronic Journal of Statistics, 8, 226–252. 53. Geerdens, C., Janssen, P. and Van Keilegom, I. (2020). Goodness-of-fit test for a parametric survival function with cure fraction. TEST, 29, 768–792. 54. Conde-Amboage, M. (2017). Statistical Inference in Quantile Regression Models. (Universidade de Santiago de Compostela. (PhD dissertation). http://hdl.handle.net/10347/15424. 55. Sánchez-Sellero, C. (2001). Inferencia Estadística en datos con censura y/o truncamiento. Universidade de Santiago de Compostela. (PhD dissertation). 56. Van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes. Springer. 57. Wellner, J.A. (2005). Empirical processes: Theory and applications. Notes for a course given at Delft University of Technology. 58. Van der Vaart, A.W. (2000). Asymptotic Statistics. Cambridge University Press.
5 Application of Quantile Regression Models for Biomedical Data
113
59. Stute, W. (1995). The central limit theorem under random censorship. Annals of Statistics, 23, 461–471. 60. Stute, W. (1996). Distributional convergence under random censorship when covariables are present. Scandinavian Journal of Statistics, 23, 461–471.
Chapter 6
Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities Hristo Inouzhe
Abstract In this chapter, we overview some recent and relevant applications of discrepancy measures (distances and dissimilarities) between statistical objects (such as random variables, probability distributions, samples) in the field of cytometry gating. Cytometry gating identifies cell populations in cytometry datasets, i.e., finds groups in multidimensional measurements of (hundreds of) thousands of single cells. From a statistical perspective, cytometry gating is a classification problem, and hence the applicable methods can be unsupervised, supervised, or semi-supervised. Since substantial amounts of variability are unavoidable in biological data, crucial tasks to help classification are establishing similarity between entire (or parts of) cytometry datasets and finding transformations between datasets that are optimal in some sense. A powerful approach to establish similarity between cytometry datasets is to model them as statistical objects and to use some distance or dissimilarity such as the Wasserstein distance, maximum mean discrepancy, and some f -divergence such as Kullback–Leibler or Hellinger or some statistic such as Friedman–Rafsky. We briefly overview the previous discrepancy measures and present how they are (or can be) used for grouping cytometry datasets, for producing templates from a group of datasets, or for interpolation between datasets. We provide instructive examples and further sources of information. The code for generating all figures is freely available at https://github.com/HristoInouzhe/Gating-with-Statistical-Distances. Keywords Flow cytometry gating · Optimal transport · Statistical distances
6.1 Introduction Cytometry is concerned with the measurement (“metry”) of physical, chemical, and other properties of a cell (“cyto”) and, therefore, offers a characterization
H. Inouzhe () Basque Center for Applied Mathematics, Bilbao, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Y. Larriba (ed.), Statistical Methods at the Forefront of Biomedical Advances, https://doi.org/10.1007/978-3-031-32729-2_6
115
116
H. Inouzhe
of biological samples at the single-cell level. Cytometry has diverse and relevant applications in clinical and research immunology and oncology, for example, for diagnosing a range of hematologic (blood) cancers and diseases such as AIDS. Two prevalent techniques for characterizing single cells are flow cytometry (FC) and time of flight (mass) cytometry (CyTOF). In FC, cells tagged with fluorescent antibodies are exposed to lasers with different wavelengths, and the resulting light spectrum is measured and used to characterize a cell. An extensive and comprehensible description of flow cytometry and its applications can be found in [1]. In CyTOF, cells tagged with heavy metal isotope-coupled antibodies go through a mass spectrometer, and the resulting mass spectrum is used for the characterization of a cell. An up-to-date review on CyTOF and its applications can be found in [2]. Currently, the number of different antibodies, also known as markers, used to characterize a cell with CyTOF can get as high as 100, while for FC it can be close to 50. The number of cells that can be characterized in a single sample goes from around hundred thousand to more than several millions. Hence, cytometry data, obtained after appropriate measurement and preprocessing, belong to high-dimensional spaces with big sample sizes. Throughout this chapter, we will use cytometry data, cytometric data, cytometric datasets, and cytometries interchangeably. A crucial task in cytometry data analysis is to identify different cell populations, which amounts to discovering groups of cells that display some significant differences in one or a group of the measured markers. A reference for standardized cell types is [3]. This allows a variety of applications, for example, the cell types and their relative proportions, identified in each cytometry dataset corresponding to a different blood sample in a study, can be used to characterize an immune system reaction or an illness. The standard way of identifying cell populations is called manual gating. Examples of manual gating can be found in Figure 1 in [4] and Figure 169 in [1]. It consists of an expert selecting a pair of markers, then, in the corresponding bi-dimensional projection the expert selects a region where cells inside it are further inspected. That is, a new pair of markers is selected, the cells inside the previous region are represented in the corresponding bi-dimensional plot, and the expert selects a new region for further inspection. This continues until the cells inside the region of interest are considered to be of the same type. Hence, a single cell type is obtained by defining a sequence (hierarchy) of pairs of markers with corresponding regions of interest (also known as gates). Different cell types are defined by a different sequence of pairs of markers and the corresponding gates. This algorithmic procedure allows to identify the cell populations of interest or to discover new ones in a cytometry dataset. Manual gating has been extremely successful, but it presents several drawbacks [1, 4, 5]. Firstly, it is subjective and time-consuming. The selected hierarchy of pairs of markers and the corresponding gates depend on the knowledge and dedication of the expert. Hence, reproducibility of results between different experts may be low. The bi-dimensional inspection can be very time-consuming when the number of markers (the dimension of the space) is high (for example, around 30). In consequence, the time required for an expert or a small group of experts to annotate
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
117
(label, gate) hundreds of high-dimensional cytometry datasets is a major bottleneck for modern studies. Secondly, high-dimensional information is lost, due to the sequential exploration of bi-dimensional projections, which makes it impossible to find and use intricate correlations between multiple markers as a criterion for defining a cell type. To address some of these limitations, automated or semiautomated gating has been introduced using powerful tools based on the interplay of statistics and computer science commonly known as Machine Learning (ML) (see [5, 6]). The main applications of ML to cytometry gating originate from the tools developed for classification tasks and can be divided into three broad categories: unsupervised, supervised, and semi-supervised. Unsupervised techniques try to extract structure from the raw data without requiring knowledge of any ground truth. The primary tool to consider is cluster analysis or clustering, where data are divided into groups (clusters) where elements in the same group are more similar to each other, in some predefined way, than members of different groups. Clustering algorithms can be split into partitioning and agglomerative ones. Partitioning algorithms try to divide the data into a number of clusters such that an optimality criterion is fulfilled, where k-means is the best known and most widely used. Agglomerative algorithms start with single observations and merge them into clusters according to some dissimilarity criteria; hierarchical clustering is probably the most popular example. For readers interested in the topic, a good source is [7]. Clustering is applied to cytometry data as a way of discovering cell populations in high-dimensional spaces in an automatic, time-efficient, unbiased, and reproducible manner. Typically, after clustering a cytometry dataset, clusters are assigned, by an expert or another algorithm, to previously known cell populations or are considered as a new cell population. When using supervised learning, the approach to cytometry gating is fundamentally different to the unsupervised case. The task is to automatically learn a gating strategy from previously manually or otherwise gated cytometry datasets to gate a new ungated one. In this case, contrary to the unsupervised setting, the algorithm directly assigns each cell in an unlabeled cytometry dataset to a specific cell population. The available tools are many, and we highlight quadratic discriminant analysis, tree-based methods as random forest or any approach based on neural networks. It is out of the scope of this chapter to present in detail the many tools of supervised learning, so we refer the interested reader to [8, 9]. The main point of supervised learning applied to cytometry data is to use high-quality historical data, i.e., previously expertly manually gated cytometry datasets, to gate a new ungated cytometry dataset in a time-efficient way which uses intricate and high-dimensional correlations between markers. The clinical setting, where well-established protocols lead to good historical data, is especially well suited for supervised methods. Semi-supervised learning can be considered a mixture between the previous settings where an unsupervised method requires some input from a human or a
118
H. Inouzhe
previous example for the gating task. This is a fairly common paradigm in cytometry gating since it offers a good trade-off between time efficiency and previously available or expert information. Examples of such applications can be found in [10, 11]. One of the main challenges with automatic or semi-automatic gating is the huge variability present in cytometric data, which ensues from a diversity of sources (see [12]). There is a natural biological variability, for example, the cytometry data from blood samples of the same individual in the same conditions measured on the same flow cytometer may present non-negligible differences. A technical source of variability, commonly referred to as batch effect, corresponds to measurements in different conditions (different days, locations, temperature, pressure, etc.), with different machines, with the same machine but different settings, with different staining antibodies, and so on. Another prominent source of variability comes from experts having different criteria for gating, i.e., different sequence of pairs of markers and gates, and the different level of completeness in a gated cytometry dataset, and it is common to gate only some cell populations of interest leaving the rest ungated. Therefore, any automatic gating method that deals with cytometry data from a variety of measurements must be robust to the previous types of variability while also correctly detecting meaningful variability coming from the natural response of the immune system, a cell population characteristic of a disease, a vaccine effect, and many others. To address the previous difficulties and aid the automatic gating workflow, a successful strategy has been to quantify variability of cytometric data (see, for example, [11, 13, 14]). Such variability quantification has been based on mathematical tools that measure the difference between raw or gated cytometry datasets. In essence, when the signal of interest is stronger than natural biological variability or batch effects, one expects higher values for the measure of variability. Hence, with an appropriate measure of difference or similarity between cytometric data, one can detect meaningful and meaningless variability, and this can guide the learning strategy for automated gating. Additionally, the possibility of establishing similarity between cytometry datasets allows for matching and alignment which are of common use in gating workflows. Matching refers to the problem of how to optimally assign a group of cytometry datasets to another group of cytometry datasets, while alignment (interpolation) refers to the problem of transforming, in some predefined way, one cytometry dataset into another cytometry dataset. The aim of this chapter is to introduce the reader to some of the main aspects of the cytometry gating workflow where statistical dissimilarities and distances, that is, measures of discrepancy (difference) between statistical objects, are used. Our goal is to provide some basic notions, while the interested reader can find much more details in the references. We presuppose that the reader has basic notions of probability, if it is not the case we refer to the chapters on random variables, integration (expectation), and joint distributions in the introductory books [15, 16]. The first section of this chapter is dedicated to introducing the mathematical
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
119
modelling of cytometric data and to presenting some popular statistical discrepancy measures used in automated flow cytometry gating. In the second section, we present how statistical measures of dissimilarity can be used in the gating workflow, particularly: for grouping cytometric datasets, for producing template cytometry data, and for interpolation between cytometries. We finish this chapter with some brief concluding remarks.
6.2 Dissimilarities and Distances In this section, the mathematical formalism for dealing with cytometric data is provided. Firstly, several useful ways of describing a cytometry dataset are presented. Secondly, we provide definitions for the notions of dissimilarity and distance between cytometric data. Finally, some of the most popular dissimilarities and distances used in gating are overviewed. A raw cytometry dataset X can be viewed as a collection of single-cell measurem N ×m , with N ments .X = {xi }N i=1 with .xi ∈ R or equivalently as a matrix .X ∈ R the number of cells in the measured sample and m the used markers. An example of two raw cytometries for two markers can be seen in the top of Fig. 6.1. A cytometric dataset can be viewed as an empirical probability distribution η=
.
N 1 δx , N i
(6.1)
i=1
i.e., a probability distribution giving weights .1/N to each .xi ∈ X, or alternatively, as some probability distribution .ηX estimated from the raw sample X. As was noted in Sect. 6.1, N can be in the millions and m close to hundred, and therefore cytometry data can be considered high-dimensional and sample sizes are not particularly small. A gated cytometry is a cytometry dataset with labels for each cell, i.e., .X˜ = m {(xi , li )}N i=1 with .xi ∈ R and .li ∈ L, where .L is some finite set of labels, usually the names of the cell populations from a manual or supervised gating or the label of the cluster from an unsupervised gating. An example of two gated cytometries can be seen in the bottom of Fig. 6.1, where the set of labels is .L = {1, 2, 3, 4}. A gated cytometry can also be viewed as a collection of probability distributions and some associated weights. Let us say that the number of cell populations or the number of clusters in the gated cytometry .X˜ is K; equivalently, we can write .|L| K= K. Then, one can take .X˜ = {(μk , wk )}K k=1 with weights .wk > 0 such that . k=1 wk = 1 and with .μk representing some probability distribution. Usually, .μk is obtained from a model-based or a non-parametric fit to the cells belonging to the k-th cell population or cluster, while .wk is the relative frequency of the cells in the k-th group with respect to the total amount of cells. For instance, one can fit a multivariate normal distribution to each group in the bottom of Fig. 6.1, and the collection of normal distributions and the relative frequencies of the points in the clusters are a
120
H. Inouzhe
10660 Fig. 6.1 Top: Two ungated cytometries .X1 = {x1,i }43156 i=1 and .X2 = {x2,i }i=1 with .X1 , X2 ⊂ R2 . Middle: Non-parametric density estimation with kernel smoothing of .X1 and .X2 . Bottom: Unsupervised gating of .X1 and .X2 , which we denote .X˜ 1 and .X˜ 2 , with a clustering method called tclust (see [17]) looking for four different cell types
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
121
good representation of the cytometries at hand. Notice that the different ways of modelling cytometry data can be suitable for different purposes. Equipped with a formal definition of cytometry data, we can provide tools for comparing different datasets. A dissimilarity (discrepancy) between two cytometry datasets X and Y is a measure of how different the two objects are, with values close to zero if the two datasets are similar and high values if they are very different. Exactly the same ˜ Y˜ which we omit for simplicity concept can be applied to the gated versions .X, of exposition. Formally, a dissimilarity d is a symmetric divergence, and hence it fulfils: 1. .d(X, Y ) ≥ 0 for any two cytometry datasets X and Y and .d(X, Y ) = 0 if and only if the datasets are the same, .X = Y (definition of divergence). 2. .d(X, Y ) = d(Y, X) for any .X, Y (symmetry). A distance (or metric) between cytometry datasets, d, is a dissimilarity with an additional property: 3. .d(X, Z) ≤ d(X, Y ) + d(Y, Z) for any cytometry datasets .X, Y, Z (triangle inequality). A question that arises naturally is what formalism to use for cytometry data and what is an appropriate dissimilarity or distance for different contexts. These questions are not independent, and in the next sections, we briefly present some of the most useful dissimilarities and their applications. In Table 6.1, one can find a summary with some general information about the discrepancy measures introduced in the next sections.
6.2.1 Wasserstein Distance Wasserstein distance, sometimes referred to as the Earth mover’s distance, is a distance between probability distributions which is well suited for cytometric data since it is robust against small translations and small changes in probability [18]. Even more, it can handle distributions with non-overlapping supports, which is a typical situation in cytometry datasets. Broadly speaking, the Wasserstein distance measures the cost of optimally transporting one distribution into the other. The following are good references for the theory of optimal transport [19] and for the theory and computation [20]. Definition 1 (Optimal Transport Cost) Let X and Y be random variables on the spaces .X and .Y where .law(X) = μ .(X ∼ μ) and .law(Y ) = ν .(Y ∼ ν). Let .π(X × Y) be the space of all joint probability measures on the product space .X × Y with first marginal .μ and second marginal .ν, i.e., such that for any joint probability .π ∈ π(X × Y), . dπ(x, y) = μ and . Y X dπ(x, y) = ν. Finally, let .c(x, y) be a cost function representing the cost of transporting a unit of probability mass from
Type Distance
Distance
Dissimilarity
Distance Dissimilarity
Wasserstein
Maximum mean discrepancy
Symmetric Kullback-Leibler
Hellinger Friedman-Rafsky
Discrete, continuous Discrete
Discrete, continuous
Discrete, continuous
Distributions Discrete, continuous
Yes No
Yes
No
Density estimation No
No No
No
Yes
Efficient barycenter computation Yes
Software R: transport Python: POT R: kernlab Python: Easy to implement Easy to implement Easy to implement Python: PyTorch R (Bioconductor): flowMap
Table 6.1 Summary of the statistical measures of discrepancy between probability distributions presented in this chapter. Type indicates if one is dealing with a distance or a dissimilarity as defined in the beginning of Sect. 6.2. Distributions indicates if both discrete and continuous probability distributions can be handled. Density estimation refers to the need of density estimation to compute the respective measure of discrepancy in the cytometry gating setting. The fourth column presents if barycenters (also known as Frechet means) can be computed in an efficient way. The last column presents some freely available software
122 H. Inouzhe
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
123
x to y. The optimal transport (OT) cost is defined as the solution of the following optimal transport problem: OTc (μ, ν) =
.
min
π ∈π(X ×Y ) X ×Y
c(x, y)dπ(x, y)
= min {E(X,Y ) c(X, Y ) : law(X) = μ, law(Y ) = ν}.
(6.2)
(X,Y )
Notice that for discrete measures, that is, when .μ and .ν give masses to finite collections of points .xi , . . . , xn and .y1 , . . . , yn , respectively, the OT cost (6.2) becomes
OTc (μ, ν) =
.
min
n n
π ∈π(X ×Y )
(6.3)
πi,j c(xi , yj ),
i=1 j =1
where .π(X × Y) is the set of .n × n -matrices with row sums equal to .(μ(x1 ), . . . , μ(xn )) and column sums equal to .(ν(y1 ), . . . , ν(yn )). Let us stress that the discrete OT problem can be viewed as a soft assignment problem, where the mass of an origin point .xi is assigned in different proportions to the corresponding .yj points. This is in contrast with a hard assignment where origin points are assigned exclusively to one destination point. When .n = n , the OT problem is equivalent to a hard assignment problem, i.e., a bijection between the sets .{xi }ni=1 and .{yj }nj=1 . The p-Wasserstein distance is an Optimal Transport Cost where the cost function is a p-power of the usual Euclidean distance. Definition 2 (p-Wasserstein Distance) Let .X = Y ⊆ Rm , .EXp , EY p < 0 (finite p-moments), and .c(x, y) = x − y p , with . · representing the usual Euclidean distance. Then, the p-Wasserstein distance is defined as dWp (μ, ν) =
.
min
1/p x − y dπ(x, y) p
π ∈π(X ×X ) X ×X
=
1/p
min {E(X,Y ) X − Y p : law(X) = μ, law(Y ) = ν}
(X,Y )
.
(6.4)
Associated with .dW (μ, ν), under some mild conditions, there is a (notnecessarily unique) optimal coupling T , which encodes an optimal way of transforming .μ into .ν. For example, when both probabilities are discrete, T is a matching indicating how much mass to send from each origin point to its corresponding matched points in the destination. A particularly important case is the 2-Wasserstein distance, where, when at least one of the probability distributions has a density, there is a unique optimal map T such that 2 .dW (μ, ν) 2
=
X
x − T (x) 2 dμ(x).
124
H. Inouzhe
Intuitively, a coupling T provides a mean to do interpolation (alignment) between probability distributions, and this interpolation is unique when at least one of the probabilities has a density and T is obtained through the 2-Wasserstein distance. A very attractive property of the Wasserstein distance is the fact that it allows to produce a “sort of average,” which respects geometrical properties of the underlying data. In applications to gating, this means that it is possible to obtain a template from a group of cytometry datasets. The tool we refer to is the Wasserstein barycenter. Definition 3 (Wasserstein Barycenter) Let .μ1 , . . . , μn be a set of probability distributions belonging to .Pp (Rm ), the set of probability distributions with finite pmoment. Let .{wi }ni=1 be weights such that . ni=1 wi = 1. Then, the p-Wasserstein barycenter is the measure .μ∗ that solves the following optimization problem: μ∗Wp = argminμ∈Pp (Rm )
n
.
p
(6.5)
wi dWp (μ, μi ).
i=1
In the case of the 2-Wasserstein barycenter, if one of the probability distributions .μi has a density, then the barycenter is unique. Furthermore, if all .μi have densities, there are unique optimal interpolations between each of them and the barycenter. For practical purposes, it is essential to be able to compute Wasserstein distances and barycenters efficiently. The task is relatively simple when dealing with location-scatter families such as the (multivariate) normal distribution, but for more general distributions there are only approximate algorithms. Furthermore, to improve efficiency, it is common to use some (entropically) regularized versions of the Wasserstein distance which are only approximations of it. For details on computation, we refer to [20]. However, this allows many real-world applications and in particular its application to different stages of cytometry gating workflows.
6.2.2 Maximum Mean Discrepancy Maximum mean discrepancy (MMD) is a popular measure of difference between probability distributions. It is of great interest since it allows the use of many standard ML algorithms such as Support Vector Machines and Gaussian Process Regression with probability distributions. For the interested reader, extensive details can be found in [21]. Definition 4 (Maximum Mean Discrepancy) Let .μ and .ν be probability distributions on .X , and let .F be a class of real-valued functions defined on .X . The MMD with respect to .F is defined as MMD(μ, ν, F) = sup
f ∈F
.
X
f (x)dμ(x) −
X
f (x)dν(x)
= sup EX∼μ f (X) − EX∼ν f (X) . f ∈F
(6.6)
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
125
Let .H be a reproducing kernel Hilbert space (RKHS) with kernel .κ, i.e., .H = closure(span{κ(x, ·) : x ∈ X }) and .κ : X × X → R is a symmetric and positive definite function. Then (6.6) becomes 2 .MMD (μ, ν, H) = κ(x, x )dμ(x)dμ(x ) − 2 κ(x, y)dμ(x)dν(y) X ×X
X ×X
+
κ(y, y )dν(y)dν(y )
X ×X
= EX,X ∼μ κ(X, X ) − 2EX∼μ,Y ∼ν κ(X, Y ) + EY,Y ∼ν κ(Y, Y ). (6.7) If .κ is a universal kernel, that is, the corresponding RKHS .H is dense in the space of continuous and bounded functions in .X , then MMD(μ, ν, H) = 0
if and only if μ = ν.
.
(6.8)
Therefore, in that setting, the MMD is a distance, and we will denote it dMMD,H (μ, ν) = MMD(μ, ν, H).
.
(6.9)
A well-known universal kernel in .Rm is the Gaussian kernel with parameter .λ, given by κ(x, y) = exp
.
− x − y 2 2λ2
.
(6.10)
We stress that there are other universal kernels in .Rm and that the choice of kernel is non-trivial and it is problem specific. There is an equivalent barycenter problem for MMD for a set of measures n , which can be written as .{μi } i=1 μ∗MMD,H = argminμ∈P (Rm )
n
.
i=1
2 wi dMMD (μ, μi ), ,H
(6.11)
where the barycenter has the following expression (see [22]): μ∗MMD,H =
n
.
wi μi .
(6.12)
i=1
In the setting of MMDs, optimal interpolation between measures, i.e., following geodesics, is straightforward since a geodesic between .μ and .ν is given by .μt = (1− t)μ+tν for .0 ≤ t ≤ 1. It is worth mentioning that barycenters coming from Eq. (6.5) with .p > 1, particularly the 2-Wasserstein distance, have significantly different
126
H. Inouzhe
geometrical properties than barycenters coming from (6.11). A clear example can be seen in the top of Fig. 6.3. Computations of .dMMD,H based on (6.7) are fairly straightforward since for samples .X = {xi }ni=1 and .Y = {yi }ni=1 the following is an unbiased estimator: 2 1 2 k(x, x ) − dMMD = ,H,n n(n − 1) nn
.
x =x ∈X
+
k(x, y)
x∈X,y∈Y
1 k(y, y ). − 1)
n (n
y =y ∈Y
The computation of the barycenter in the setting of MMD can be found in [22].
6.2.3 Kullback–Leibler Divergence The Kullback–Leibler (KL) divergence, which fulfils only the first criteria of a dissimilarity and is also called relative entropy, is a very popular way of measuring the difference between two probability distributions. The KL divergence is probably the most notorious representative of a family of difference measures between probability distributions known as f -divergences. More information about the topic can be found in Section 8.1 in [20]. Definition 5 (Kullback–Leibler Divergence) For discrete distributions .μ and .ν defined on the same space .X , the Kullback–Leibler divergence is defined as KL(μ, ν) =
.
μ(x) log
x∈X
μ(x) . ν(x)
(6.13)
When .μ and .ν have densities .fμ and .fν and are defined on .X ⊆ Rm , the Kullback– Leibler divergence takes the form KL(μ, ν) =
.
X
fμ (x) log
fμ (x) dx. fν (x)
(6.14)
It is important to notice that if .μ assigns probability to a set (region, points) where .ν does not assign any probability, .KL(μ, ν) = ∞, and if it is the other way around, then .KL(μ, ν) = −∞. Therefore, the KL divergence is better suited for absolutely continuous measures, that is, for measures that assign zero probability to the same sets. One possible way to obtain a dissimilarity from the KL divergence is the Symmetric KL divergence defined as dKL (μ, ν) = KL(μ, ν) + KL(ν, μ).
.
(6.15)
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
127
The computation of the KL divergence in the continuous case requires an estimation of the densities and then a numerical computation of the integrals which makes it very sensible to the curse of dimensionality. Additionally, when absolute continuity is not strictly fulfilled, one may need to allow some tolerance for the KL divergence to return meaningful results.
6.2.4 Hellinger Distance The Hellinger distance is another popular measure of the difference between probability distributions. As the KL divergence it also belongs to the family of f divergences. It presents some desirable properties, for example, it can be easily used to define kernel functions (see [23]), something that is not the case with the Wasserstein distance or the KL divergence. It is also more computationally amenable than the KL divergence. Definition 6 (Hellinger Distance) For discrete distributions .μ and .ν defined on the same space .X , the Hellinger distance is defined as 2 dH (μ, ν) =
.
2 1 μ(x) − ν(x) . 2
(6.16)
x∈X
For .μ and .ν with densities .fμ , fν defined on .X ⊆ Rm the Hellinger distance takes the form
2 1 2 fμ (x) − fν (x) dx. (6.17) .dH (μ, ν) = 2 X Notice that in the continuous case (6.17), the Hellinger distance is the .L2 distance between the square roots of the density functions. As for the KL divergence, when applied to the cytometry gating setting, the computation of the Hellinger distance requires estimating densities and then numerically computing an integral. Hence, it presents similar difficulties as the KL divergence case, although it handles better probabilities that assign zero measure to different sets.
6.2.5 Friedman–Rafsky Statistic The Friedman–Rafsky (FR) statistic [24] was conceived as a statistic for testing if two multivariate samples came from the same distribution. The basic concept is the following, if the two samples come from the same distribution, they should be well
128
H. Inouzhe
mixed in the space of markers .Rm . The technically difficult aspect is how to measure the “mixedness” in space. Broadly speaking, to measure how well mixed are two samples .X = {x1 , . . . , xn } and .Y = {y1 , . . . , yn } with .xi , yj ∈ Rm , one creates a complete graph considering n+n .{zk } k=1 = {x1 , . . . , xn , y1 , . . . , yn } as the vertices and the respective Euclidean distance . zk − zk as the weight for the edge connecting .zk , zk . From the complete graph, one extracts the minimum spanning tree (MST), which is a subgraph of the complete graph that connects all vertices, without cycles and with the minimum total edge weight. Once the MST is obtained, all its edges connecting points from the two different samples are removed. The number of remaining subgraphs, r, is an indication of how well mixed the data are. An insightful example of this procedure can be found in [25]. For example, .r = 2 means that there was only one edge in the MST connecting the two samples, and this can be interpreted as the samples being not well mixed. If the value of r is high, many edges in the MST were connecting points from the different samples and this can be interpreted as well mixedness. The FR statistic compares r to its expected value and normalizes with the standard variance. The formal definition is the following. Definition 7 (Friedman–Rafsky Statistic) In the setting and notation of the previous paragraph, let us define .N = n + n , and 2nn + 1, N (c − N + 2)(n + N(N − 1) − 4nn + 2) 2nn 2nn − N 2 + , σ = N (N − 2)(N − 3) N −1
m=
.
where .c is the total number of edge pairs sharing common nodes in the MST. Then, the Friedman–Rafsky statistic is defined as FR(X, Y ) =
.
r −m . σ
(6.18)
From the FR statistic (6.18), one can define a dissimilarity between two cytometries X and Y , or the associated empirical distributions, as dF R (X, Y ) = |FR(X, Y )| .
.
(6.19)
The main computational challenge for obtaining the FR statistic is the computation of the MST. This can be done with standard tools in popular libraries such as igraph in R and scipy in Python.
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
129
6.3 Applications to the Gating Workflow In this section, we present some fundamental applications of statistical distances and dissimilarities to the different gating workflows. In such workflows, the objective is to gate big amounts of cytometric data with no or minimum amount of expert intervention. As is expected, this is where automatic gating is the most useful. Since distances and dissimilarities allow to compare cytometric data, their main applications are the following: • Group cytometry datasets into homogeneous groups, which reduces variability [11, 14, 25] • Produce templates, through barycenters and other techniques, that can summarize the information in a group of cytometric datasets [13, 14, 26] • Interpolate between cytometric datasets, allowing for gates in one cytometry dataset to be transferred to another or allowing for mitigation of batch effects [11, 26–29]
6.3.1 Grouping Cytometric Datasets The idea behind grouping cytometry data is straightforward, since variability is so high it is useful to form groups of cytometric data where variability is lower and then work on these more homogeneous groups of datasets. Therefore, the objective is to do clustering on cytometric datasets. A simple procedure is the following, for a set of cytometry datasets .{X1 , . . . , Xn }, gated or ungated, choose a distance or dissimilarity d and produce a distances (dissimilarities) matrix .Dd such that .[Dd ]ij = d(Xi , Xj ). Then, one can use .Dd for hierarchical clustering, although other clustering options are also possible, and obtain a partition of the cytometry datasets. The main difficulty here is to produce a distance matrix .Dd according to how cytometry data are modelled.
6.3.1.1
Ungated Cytometry Datasets
The first case to consider is when data are samples, i.e., .Xi = {xi,1 , . . . , xi,ni } ⊂ Rm for .1 ≤ i ≤ n, with an associated empirical probability distribution to each .Xi . This means that suitable candidates for a measure of similarity are the Wasserstein distance (6.4), the maximum mean discrepancy distance (6.7), and the dissimilarity based on the Friedman–Rafsky statistic (6.19). For example, one can consider the cytometry datasets in the top of Fig. 6.1 and compute the respective distance. Depending on the sample size of the cytometries involved, to lower the computational and memory cost of the distance calculation for .dWp , (6.4), and .d MMD,H , (6.9), multivariate adaptative extensions of histograms can be used (see, for example, [30, 31] and the references therein). Hence, a dataset .Xi is
130
H. Inouzhe n
i approximated as .Xˆ i = {(cj , pj )}j =1 , where .cj ∈ Rm is a centroid corresponding to the points in the hyperbox j , .pj is the relative weight of the points in the same box, and .ni (the usual scalar √ product), the squared MMD distance (6.7) with the Gaussian kernel (6.10) with .λ = 10/ 2, the Hellinger distance (6.17), and the symmetric KL divergence (6.15) and the FR dissimilarity (6.19). The plot’s interpretation is the following: the black 1 (Cluster 1 in cytometry 2) over the x-label value 1 (cluster 1 in cytometry 1) is the lowest of all the other black numbers (rest of clusters in cytometry 2) at the same x-label value. Hence, in 2-Wasserstein distance, cluster 1 in Cytometry 1 is the closest to cluster 1 in Cytometry 2. This yields that only the 2-Wasserstein distance, the MMD with vanilla kernel, and the Hellinger distance have that the closest to clusters 1, 2, 3, and 4 of Cytometry 1 are clusters 1, 2, 3, and 4 of Cytometry 2, respectively, which correctly captures which clusters represent the same cell types
template. This reduces the amount of comparisons required and facilitates assigning a new cytometry to a group of datasets that is most similar to it. From the previous sections, it is clear that a good candidate for a template, but not the only possible one, is the barycenter of a group of cytometric datasets. Below we present some strategies on how to produce template cytometries.
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
6.3.2.1
133
Ungated Cytometry Datasets
To obtain a template sample .T = {t1 , . . . , tnT } ⊂ Rm from a group of raw cytometry samples .{X1 , . . . , Xn } (as in Sect. 6.3.1.1), one can solve a barycenter problem as the one introduced in Eqs. (6.5) and (6.11). One gets T as a sample from the distribution .μ∗Wp or .μ∗MMD,H , respectively. The main difficulty with this approach is computation. For low-dimensional subspaces of markers, .m ≤ 3, and for relatively small cell counts (sample sizes), it can be done in reasonable computation time. However, to tackle more realistic situations further work in the field of barycenter computation is required. A toy example is given in the top of Fig. 6.3. The synthetic cytometries, i.e., the plotted samples of size 1500, encapsulate the common information present in .X1 and .X2 which are plotted in the top of Fig. 6.1. These templates, or barycenters, can be used as a representation of the set of cytometric datasets .{X1 , X2 }.
6.3.2.2
Gated Cytometry Datasets
A workaround to the problem faced in the ungated setting is to try to work with gated datasets. Since there are many efficient unsupervised gating procedures, this is a viable option. For extensive reviews on such methods, we refer to [4, 6, 33–35]. Hence, we are in the setting of Sect. 6.3.1.2 and there is a collection of cytometry datasets where each individual dataset is modelled as a collection of clusters. Therefore, one has .{Ci = {Ci,l }l∈Li }ni=1 . This is the setting represented in the bottom of Fig. 6.1, where we have two cytometries each formed by four clusters labelled from one to four. There are different approaches on how to obtain a template in this setting. One way, which is used in [14], is to pool all clusters together, hence obtaining 1 +···+kn the set .{C1, 1,1 , . . . , C1, 1,k1 , . . . , Cn, n,1 , . . . , Cn, n,kn } = {Ci }ki=1 , and try to group elements in this set. The rationale behind this is that similar clusters, with respect to some dissimilarity (distance) measure, will represent the same, or at least similar, cell types, and hence grouping them together will allow to separate different cell types. Once different cell types are separated, one can obtain a template for each cell type. The collection of cell type templates is the template for the group of cytometry datasets. Once more, a viable strategy is to obtain a distance matrix .[Dd ]ij = d(Ci , Cj ) and to use hierarchical clustering. As previously, the distance between clusters d can be chosen as in Sect. 6.3.1.2. In the particular case when each cluster is modelled as a member of a location-scale family, there is an efficient extension of the k-means algorithm known as k-barycenter (for details see [36]), 1 +···+kn which produces a template for .{Ci }ki=1 with k different cell types. Another possible strategy, the one followed in [13], is to start with individual cytometric datasets and produce a template of the two closest ones. Then, since templates can be handled exactly the same as the other cytometric datasets, one can continue merging the two closest cytometric datasets until there is only one
134
H. Inouzhe
Fig. 6.3 Examples of different templates for a group of cytometric datasets. The goal is to obtain a synthetic cytometry that captures the information of both cytometries in Fig. 6.1. For ease of computation, we select the template to have 1500 cells. Top: Templates obtained from the ungated cytometry datasets X1 and X2 in the top of Fig. 6.1. We see that the most relevant information of both cytometries is well represented, with the 2-Wasserstein barycenter producing some spurious clusters. Bottom: Templates obtained from the gated versions of X1 and X2 , X˜ 1 and X˜ 2 , plotted in the bottom of Fig. 6.1. We considered clusters that are closest in 2-Wasserstein distance to correspond to the same cell type (see Fig. 6.2) and obtained a barycenter for each cell type. Again, templates seem to represent the original information well, with the 2-Wasserstein template producing the more homogeneous cell types
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
135
last template of the whole group of cytometric datasets. Hence, in this approach it is important to have a distance between gated cytometries such as .dsim,c (6.21) or .d GEC,c,λ (6.22). Notice that the last one comes with a hard assignment which can be used as a recipe for which clusters to merge together and which ones to leave unmerged. To complete the template production, one needs to describe a method for obtaining a template from the clusters that have been grouped together as representing the same or similar cell types. A straightforward cell template can be achieved by pooling together all the points of the clusters grouped together. A more sophisticated approach is to solve a barycenter problem. Here, solving the barycenter problem (6.11) to obtain .μ∗MMD,H will give a template which will be fairly similar to the one obtained by pooling. By solving a Wasserstein barycenter problem (6.5), one can obtain a different result. As mentioned in the previous section, when the space is high dimensional and the involved clusters have hundreds of thousands of points, one may need to fit location-scale models with densities to the clusters and then solve a 1-barycenter problem. In the bottom of Fig. 6.3, we have the templates obtained, from grouping together each cluster from .X˜ 1 to its closest counterpart in .X˜ 2 , and, then, the barycenters for each cell type are obtained by a 2-Wasserstein barycenter, an MMD barycenter or by pooling. We see that the resulting templates are a sensible representation of the information stored in the two original cytometries.
6.3.3 Interpolation Between Cytometry Datasets The ability to transform one cytometry dataset into another in some controlled fashion is very desirable. Two major consequences are the following: firstly, one can translate gates used to gate one of the cytometric datasets to gate the other one, and secondly, one can transform several cytometry datasets to try to reduce batch effects in a procedure known as normalization. In this section, we describe several methods based on statistical distances.
6.3.3.1
Gate Transportation
Typically, manual gating is a one- or a two-dimensional hierarchical procedure, and therefore to use gates from a gated cytometry, one needs to be able to interpolate in one or two dimensions and not in the full space of m markers. This is a considerable reduction in dimension and makes computation far easier. A good tool for interpolation in this setting is the transport map T , whenever it exists, associated with the solution of the optimal transport problem (6.4). Hence, in order for the transport maps to exist and be unique, we will assume that we are working with the 2-Wasserstein distance and that the cytometric datasets are samples from probability
136
H. Inouzhe
distributions with densities. Although transport maps in two dimensions can be used to transport two-dimensional gates from one cytometry dataset to another one, it is far more technical and it is out of the scope of this chapter. On the other hand, the one-dimensional counterpart provides good intuition and can be broadly applied. The optimal map between two measures .μ and .ν defined in .R is given by T (x) = Fν−1 ◦ Fμ (x),
(6.23)
.
where .Fμ is the cumulative distribution function (CDF) of .μ and .Fν−1 is the quantile function (QF), also known as the generalized inverse, of .ν. For a sample .X = {x1 , . . . , xnX } from .μ and .Y = {y1 , . . . , ynY } from .ν, a plug-in estimator is obtained by −1 Tn (x) = Fn,ν ◦ Fn,μ (x),
(6.24)
.
−1 is the where .Fn,μ is the empirical CDF associated with the sample X and .Fn,ν empirical QF associated with the sample Y . Notice that when dealing with onedimensional projections, a gate associated with marker .mi is just a value .θi ∈ R. Therefore, the transported version of the gate is .Tn (θi ), and it is the one to be used for gating Y with respect to marker .mi . Examples of this gate transportation can be seen in Fig. 6.4. Let us stress that here the samples are just the one-dimensional projections into marker .mi . We want to point out that this alignment method can replace or be used alongside the alternatives in the workflows presented in [11] and [26]. A different approach, also based on the OT problem and introduced in [27], tries to reweight the learning sample, the one that is gated, in order to minimize the OT cost to the ungated cytometry dataset. The optimal weights can be understood as the relative frequencies of the original gated cell types in the new cytometry dataset. Since, usually, the relative frequencies are relevant for diagnosis, the previous procedure can be good enough in many practical situations. Being that the origin cytometry is gated, one can write its empirical distribution function as
η=
K
.
k=1
⎛ |C1, k |
K
j =1 |C1, j |
⎝
x∈C1, k
⎞ 1 |C1, k |
δx ⎠ =
K n k=1
k
n1
η k ,
(6.25)
where .|C1, k | = n k is the number of points that have labels . k , . K j =1 |C1, j | = n1 is the total number of cells in .C1 , and .η k is the empirical distribution of the points with label . k . A reweighting of .η with weights .w = {wk }K k=1 , with .wk > 0 and K . w = 1, is given by k=1 k η(w) =
K
.
k=1
w k η k .
(6.26)
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
137
0.0015
Gate transportation Origin cytometry Destination cytometry Gate at origin
0.0000
0.0005
Density
0.0010
Transported gate
0
2000
4000
6000
CD4+CD20:PB−A LOGICAL
Gate transportation 5e−04
Origin cytometry Destination cytometry
3e−04
Transported gate
0e+00
1e−04
2e−04
Density
4e−04
Gate at origin
0
2000
4000
6000
8000
CD3:APC−A LOGICAL
Fig. 6.4 Examples of gate transportation using the OT map (6.24). In solid black, we have the density estimation of projections into two markers of an origin cytometry X. Some onedimensional gates are given in dashed black, which separate low and high values in the respective marker. Transported gates, which do not require any human input, are presented in dashed red.
138
H. Inouzhe
Let us call the empirical distribution associated with the ungated cytometry .X2 = 1 {x2,1 , . . . , x2,n2 }, .η = x∈X2 n2 δx . Then, one has the following minimization problem: 2 w ∗ = arg min dW (η(w), η ). 2
.
w
(6.27)
An approximate solution of the regularized version of this problem, i.e., with the entropically regularized Wasserstein distance as approximator of the Wasserstein distance, can be computed efficiently (see details in [27]) and its solution .w ∗ represents the relative weights in the new cytometry .X2 of the cell types present in .C1 . It is also possible to obtain a full gating of .X2 if one has access to the optimal coupling.
6.3.3.2
Reduction of Batch Effects
When the problem of interest is the reduction of batch effects, there are two distinct strategies. One is to look for cell type-dependent normalizations, and therefore one assumes that batch effects are not independent from the cell types. Another is to try to treat batch effects as a common perturbation to the whole dataset and therefore as B cell type independent. The main setting is the following: there are batches .{Bj }N j =1 , where each batch is a collection of cytometric datasets .Bj = {Xj,1 , . . . , Xj,Nj } such that .Xj,i = {xj,i,1 , . . . , xj,i,nj,i } ⊂ Rm . A cell type-dependent normalization may proceed by previously gating in an unsupervised fashion the available datasets, and then producing a normalization dependent on the groups (as in [26]), or, alternatively, it can affect only some gates in a gating hierarchy (as in [37]). The main tools required in those settings are the production of an “average” element for a group of 1D samples and the interpolation between two 1D samples. When possible, reducing batch effects is helped if one can have a control sample at each batch (see [26]). Therefore, one can find the empirical quantile function of the barycenter of the 1D projections onto the marker .mq of the −1 . For details on the control sample in the different batches, which we denote as .Fn,∗,q computation, see Remark 9.6 in [20]. Hence, for each batch j , there is a transport map −1 Tn,j,q (x) = Fn,∗,q ◦ Fn,j,q (x),
.
(6.28)
where .Fn,j,q is the empirical CDF associated with the projection onto the marker mq of the control sample in batch j . A normalization of marker .mq for batch j corresponds to applying .Tn,j,q to the .mq projection of the other cytometric datasets of batch j . The full normalization is the resulting data for the correction of all (or some) markers. When normalization is cell type specific, this is done for each cell type and the corresponding markers. We stress that in [26, 37] methods not
.
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
139
directly related to statistical measures of discrepancy are used. However, adopting the techniques presented in this section is fairly simple. A different approach consists of finding an approximation of a function .g : Rm → Rm such that cytometry datasets in different batches are closer after transforming them by g. This problem can be situated in the fields of domain adaptation of transfer learning in ML (for some comprehensive reviews on the topics see [38, 39]). The idea is to choose a reference cytometry dataset .X∗ , usually the one that will be gated, and try to find a function g that brings all other cytometries closer to .X∗ . This can be done using generative adversarial networks (GANs) where a loss function is based on a distance d between cytometric datasets. Examples of this procedure can be found in [28, 29]. The distance d can be any of the ones we have discussed so far, but usually the most efficient are the Wasserstein and the one based on MMD since they can be computed from samples efficiently.
6.4 Conclusions The main point that the reader should take from this chapter is that working with cytometry datasets as statistical objects, particularly through the lenses of statistical distances, is very helpful in the whole gating workflow. Some popular gating methods can be readily adapted to incorporate (or already do) a measure of discrepancy or an alignment method between cytometry datasets based on the tools discussed in this chapter. Furthermore, in supervised settings, preprocessing steps based on reducing variability have proven effective in improving performance as shown in [14]. Therefore, any practitioner or interested researcher should have at least a basic knowledge of the topics presented in this chapter. This knowledge can be very helpful since many of the discussed topics are very active fields of research and innovation which can have further positive impact in the cytometry gating workflow. Acknowledgments This research was partially supported by project MTM2017-86061-C2-1P (AEI/FEDER, UE), by grant PID2021-128314NB-I00 (MCIN/AEI/10.13039/501100011033/ FEDER, UE), by the Basque Government through the BERC 2018-2021 and the Basque Modelling Task Force (BMTF) program, by the Spanish Ministry of Science and Innovation (MCIN) and AEI (BCAM Severo Ochoa accreditation CEX2021-001142-S), and by grants VA005P17 and VA002G18 (Junta de Castilla y León).
References 1. Cossarizza, A., Chang, H.D., Radbruch, A., Acs, A., Adam, D., Adam-Klages, S., Agace, W.W., Aghaeepour, N., Akdis, M., Allez, M., et al.: Guidelines for the use of flow cytometry and cell sorting in immunological studies. European journal of immunology 49(10), 1457– 1973 (2019)
140
H. Inouzhe
2. Iyer, A., Hamers, A.A., Pillai, A.B.: Cytof® for the masses. Frontiers in Immunology 13, 815828 (2022) 3. Finak, G., Langweiler, M., Jaimes, M., Malek, M., Taghiyar, J., Korin, Y., Raddassi, K., Devine, L., Obermoser, G., Pekalski, M.L., et al.: Standardizing flow cytometry immunophenotyping analysis from the human immunophenotyping consortium. Scientific reports 6(1), 1–11 (2016) 4. Saeys, Y., Van Gassen, S., Lambrecht, B.N.: Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nature Reviews Immunology 16(7), 449–462 (2016) 5. Hu, Z., Bhattacharya, S., Butte, A.J.: Application of machine learning for cytometry data. Frontiers in Immunology p. 5703 (2021) 6. Aghaeepour, N., Finak, G., Hoos, H., Mosmann, T.R., Brinkman, R., Gottardo, R., Scheuermann, R.H.: Critical assessment of automated flow cytometry data analysis techniques. Nature methods 10(3), 228–238 (2013) 7. Hennig, C., Meila, M., Murtagh, F., Rocci, R.: Handbook of cluster analysis. CRC Press (2015) 8. Alpaydin, E.: Introduction to machine learning. MIT press (2020) 9. Shalev-Shwartz, S., Ben-David, S.: Understanding machine learning: From theory to algorithms. Cambridge university press (2014) 10. Hu, Z., Jujjavarapu, C., Hughey, J.J., Andorf, S., Lee, H.C., Gherardini, P.F., Spitzer, M.H., Thomas, C.G., Campbell, J., Dunn, P., et al.: Metacyto: a tool for automated meta-analysis of mass and flow cytometry data. Cell reports 24(5), 1377–1388 (2018) 11. Lux, M., Brinkman, R.R., Chauve, C., Laing, A., Lorenc, A., Abeler-Dörner, L., Hammer, B.: flowLearn: fast and precise identification and quality checking of cell populations in flow cytometry. Bioinformatics 34(13), 2245–2253 (2018) 12. Maecker, H.T., McCoy, J.P., Nussenblatt, R.: Standardizing immunophenotyping for the human immunology project. Nature Reviews Immunology 12(3), 191–200 (2012) 13. Azad, A., Pyne, S., Pothen, A.: Matching phosphorylation response patterns of antigenreceptor-stimulated t cells via flow cytometry. In: BMC Bioinformatics, vol. 13, pp. 1–8. Springer (2012) 14. Del Barrio, E., Inouzhe, H., Loubes, J.M., Matrán, C., Mayo-Íscar, A.: optimalFlow: optimal transport approach to flow cytometry gating and population matching. BMC bioinformatics 21(1), 1–25 (2020) 15. Klenke, A.: Probability theory: a comprehensive course. Springer Science & Business Media (2013) 16. Ross, S.M.: A first course in probability. Pearson (2014) 17. García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. The Annals of Statistics 36(3), 1324–1345 (2008) 18. Orlova, D.Y., Zimmerman, N., Meehan, S., Meehan, C., Waters, J., Ghosn, E.E., Filatenkov, A., Kolyagin, G.A., Gernez, Y., Tsuda, S., et al.: Earth mover’s distance (EMD): a true metric for comparing biomarker expression levels in cell populations. PLoS one 11(3), e0151859 (2016) 19. Villani, C.: Optimal transport: old and new, vol. 338. Springer (2009) 20. Peyré, G., Cuturi, M., et al.: Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning 11(5-6), 355–607 (2019) 21. Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B., et al.: Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning 10(1-2), 1–141 (2017) 22. Cohen, S., Arbel, M., Deisenroth, M.P.: Estimating barycenters of measures in high dimensions. arXiv preprint arXiv:2007.07105 (2020) 23. Haasdonk, B., Bahlmann, C.: Learning with distance substitution kernels. In: Joint pattern recognition symposium, pp. 220–227. Springer (2004) 24. Friedman, J.H., Rafsky, L.C.: Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics pp. 697–717 (1979) 25. Hsiao, C., Liu, M., Stanton, R., McGee, M., Qian, Y., Scheuermann, R.H.: Mapping cell populations in flow cytometry data for cross-sample comparison using the Friedman–Rafsky test statistic as a distance measure. Cytometry Part A 89(1), 71–88 (2016)
6 Advances in Cytometry Gating Based on Statistical Distances and Dissimilarities
141
26. Van Gassen, S., Gaudilliere, B., Angst, M.S., Saeys, Y., Aghaeepour, N.: Cytonorm: a normalization algorithm for cytometry data. Cytometry Part A 97(3), 268–278 (2020) 27. Freulon, P., Bigot, J., Hejblum, B.P.: CytOpT: Optimal transport with domain adaptation for interpreting flow cytometry data. arXiv preprint arXiv:2006.09003 (2020) 28. Li, H., Shaham, U., Stanton, K.P., Yao, Y., Montgomery, R.R., Kluger, Y.: Gating mass cytometry data by deep learning. Bioinformatics 33(21), 3423–3430 (2017) 29. Shaham, U., Stanton, K.P., Zhao, J., Li, H., Raddassi, K., Montgomery, R., Kluger, Y.: Removal of batch effects using distribution-matching residual networks. Bioinformatics 33(16), 2539– 2546 (2017) 30. Ram, P., Gray, A.G.: Density estimation trees. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 627–635 (2011) 31. Roederer, M., Moore, W., Treister, A., Hardy, R.R., Herzenberg, L.A.: Probability binning comparison: a metric for quantitating multivariate distribution differences. Cytometry: The Journal of the International Society for Analytical Cytology 45(1), 47–55 (2001) 32. Coen, M.H., Ansari, M.H., Fillmore, N.: Comparing clusterings in space. In: ICML (2010) 33. Cheung, M., Campbell, J.J., Whitby, L., Thomas, R.J., Braybrook, J., Petzing, J.: Current trends in flow cytometry automated data analysis software. Cytometry Part A 99(10), 1007–1021 (2021) 34. Liu, P., Liu, S., Fang, Y., Xue, X., Zou, J., Tseng, G., Konnikova, L.: Recent advances in computer-assisted algorithms for cell subtype identification of cytometry data. Frontiers in cell and developmental biology 8, 234 (2020) 35. Montante, S., Brinkman, R.R.: Flow cytometry data analysis: Recent tools and algorithms. International Journal of Laboratory Hematology 41, 56–62 (2019) 36. Alvarez-Esteban, P.C., del Barrio, E., Cuesta-Albertos, J.A., Matran, C.: Wide consensus aggregation in the Wasserstein space. application to location-scatter families. Bernoulli 24(4A), 3147–3179 (2018) 37. Finak, G., Jiang, W., Krouse, K., Wei, C., Sanz, I., Phippard, D., Asare, A., De Rosa, S.C., Self, S., Gottardo, R.: High-throughput flow cytometry data normalization for clinical trials 85(3), 277–286 (2014) 38. Kouw, W.M., Loog, M.: A review of domain adaptation without target labels. IEEE transactions on pattern analysis and machine intelligence 43(3), 766–785 (2019) 39. Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q.: A comprehensive survey on transfer learning. Proceedings of the IEEE 109(1), 43–76 (2020)
Chapter 7
Derivation of Optimal Experimental Design Methods for Applications in Cytogenetic Biodosimetry Manuel Higueras and Jesús López-Fidalgo
Abstract The definition of cytogenetic dose–response curves implies the irradiation of in vitro blood samples to different ionizing radiation dose levels. Here, optimal experimental design techniques are applied to these curves. This optimization is mainly focused on the selection of dose levels. As cytogenetic dose estimation leads to a calibration problem, an optimal design criterion for calibration purposes is explained here. Keywords Biological dosimetry · Dose–response calibration · Experimental design · Chromosomal aberrations · Poisson distribution
7.1 Introduction Ionizing radiation may be absorbed by humans implying dramatic health consequences. For the clinical actions after a radiation event, absorbed dose estimates are necessary. Ionizing radiations produce damage at a cellular level in form of chromosomal aberrations which are used as biomarkers of the absorbed dose. Calibration dose–response curves are built for quantifying the effect of ionizing radiation. The exposure to ionizing radiation induces breaks in the chromosomal DNA. The wrong reparation of these breaks generates chromosome aberrations. Consequently, the yield of chromosome aberrations increases with the amount of radiation and establishes a reliable radiation absorbed dose biomarker. Examples of radiation-
M. Higueras () Scientific Computing & Technological Innovation Center (SCoTIC), University of La Rioja, Logroño, La Rioja, Spain e-mail: [email protected] J. López-Fidalgo Institute of Data Science and Artificial Intelligence (DATAI), University of Navarra, Pamplona, Navarra, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Y. Larriba (ed.), Statistical Methods at the Forefront of Biomedical Advances, https://doi.org/10.1007/978-3-031-32729-2_7
143
144
M. Higueras and J. López-Fidalgo
induced chromosome aberrations are dicentrics, centric rings, excess acentric fragments, micronuclei, and translocations. See [1] for further details. Cytogenetic dose–response curves describe the quantity of damage provoked by the amount of absorbed radiation dose. A laboratory develops a dose–response curve for the yield of a chromosome aberration from a radiation source. When an individual is suspected of being overexposed to that radiation source of ionizing radiation, his/her incidence level of that chromosome aberration is compared to the curve. These curves are fitted with data coming from the irradiation of in vitro blood samples which simulate homogeneous whole-body exposures. The experimental design bases of these curves can be found in [1]. They can be summarized in the following suggestions: • For low linear energy transfer (LET), use at least ten dose levels in the range .(0.0, 5.0] Gy jointly with the non-irradiated level (0 Gy). For high LET radiation, a maximum of 2.0 Gy is suggested. The LET characterizes how much energy an ionizing particle carries to the material transverse per unit distance. Examples of low LET sources are X- and .γ -rays, of high .α-particles and neutrons. • Use at least four level doses in the range of .(0.0, 1.0] Gy. If the laboratory is capable to perform it, at least one of these levels should be below 0.25 Gy. • Scoring should aim to detect 100 chromosome aberrations at each dose level. At lower dose levels, this may require to score several thousand cells, so a number between 3000 and 5000 is suggested. These considerations does not take into account optimal experimental design (OED) techniques, although they achieve purposes which can be directly managed with these methods. Designing optimally an experiment may be an objective for any kind of model. Optimal experimental designs are oriented to optimize one specific objective of interest for the study. This objective may be minimizing the volume of the confidence ellipsoid of the parameters of the model, optimizing the prediction, estimating the area under the curve, or the minimum of the response. Nonlinear models are also considered under this approach. This theory adds an important value to the classic theory of experimental design where designs with just good compromise of different properties for linear models are usually sought. For further reading on optimal experimental design, we recommend [2] with a concise introduction, [3] for a friendly introduction for non-statisticians and [4] for a deeper learning. In this work, a specific optimality criterion based on making a good calibration is presented. This work introduces a framework for OED applied to cytogenetic biodosimetry dose estimation models. It is focused on Poisson models for a given dose–response function. The Poisson assumption is widely used in cytogenetic biodosimetry for modeling the yield of chromosomal aberrations induced by whole-body homogeneous exposures from low LET sources. Different dose effects are taken: linear, quadratic (the most widely used), including an individual age effect, or sigmoid, among others. Optimal designs tend to be extreme: they have a small number of points which mostly are in the vertices of the design space. As these extreme designs
7 OED in Biodosimetry
145
do not usually fulfill experimenters’ expectations, constrained suboptimal designs can be produced. The structure of this chapter is the following. First, Sect. 7.2 introduces the statistical modeling of cytogenetic biodosimetry dose–response curves. Then, Sect. 7.3 reviews some OED concepts and criteria. Afterward, Sect. 7.4 establishes the application of different OED criteria applied to cytogenetic dose estimation. Finally, in Sect. 7.5, two real application examples are shown.
7.2 Dose–Response Model Ionizing radiation induces chromosomal aberrations at cellular level in humans and more generally in mammals. After a suspected overexposure event, blood samples taken from individuals are analyzed in biodosimetry laboratories to count chromosomal aberrations in blood cells. For each individual, his/her chromosomal aberration counts are used to calibrate his/her absorbed dose. The absorbed dose is calibrated by the damage, i.e., the yield of chromosomal aberrations, induced over in vitro blood samples simulating whole-body homogeneous exposure at different dose levels. This yield is typically assumed to be Poisson distributed, although there are irradiation scenarios such that an overdispersed alternative is more suitable (e.g., the negative binomial). It is also assumed an expected level of damage for a given dose–response effect, for instance, a quadratic effect of the absorbed dose for low linear energy transfer irradiations. Calibration data consists in m blood cells which are irradiated in vitro simulating homogeneous whole-body irradiation. Each cell is irradiated to dose level .di (.i ∈ {1, . . . , m}). After the exposure, .yi chromosomal aberrations are scored at each cell. As the number of different dose levels is relatively small, calibration data is usually summarized by the total count of chromosomal aberrations in a number of analyzed blood cells for each irradiated dose level (see Tables 7.1 and 7.3). In the context of this work, it is assumed that the number of chromosomal aberrations per cell .Y ∼ Pois [λ(X)], where .X is the explanatory variable set and .λ() is the dose–response which establishes the model parameter set .θ . The absorbed dose D is always in .X; in fact, in most of the situations .X is formed just only by D. The realizations of Y are assumed to be independent among them. The goal in this exercise is the dose estimation for a given total chromosomal aberration assay S in a sample of N blood cells, i.e., Dˆ = λ−1 (S/N) = ν(θ ).
.
(7.1)
The first equality refers to the interpolation of the average chromosomal aberrations (.S/N) over the dose–response curve. The second equality refers to this interpolation as prediction function which is used in Sect. 7.4 and 7.5.
146
M. Higueras and J. López-Fidalgo
7.3 Optimal Experimental Design An exact experimental design of size n is a set of points .pi (.i ∈ {1, . . . , n}) in a given compact design space .X which have associated a probability measure for mapping to each one the proportion of times it appears in the design. For OED purposes, the search can be constrained to discrete designs, i.e., ξ=
.
p1 p2 . . . pn , w 1 w 2 . . . wn
(7.2)
where .pi (.i ∈ {1, . . . , n}) are the support points and .ξ(pi ) = wirefers to the proportion of experiments to perform for point .pi . Thus, .wi ≥ 0 and . ni=1 wi = 1. Given a model represented by .f (Y |X, θ ), the probability mass function of Y conditioned on the explanatory variable(s) .X and the values of the model parameter set .θ . Note that this work is focused on Poisson responses, but .f (Y |X, θ ) could be a probability density function if the response is a continuous random variable. The Fisher information matrix (FIM) of a design .ξ is given by M(ξ, θ ) =
.
(7.3)
I (x, θ )ξ(x),
x∈X
where I (x, θ ) = E
.
∂ log f (Y |x, θ ) ∂ log f (Y |x, θ ) ∂θ ∂θ T
(7.4)
is the FIM at a particular point .x. The superscript T denotes the transpose operator. The FIM is evaluated at some nominal value of .θ . The nominal value usually represents the best guess for the parameters vector .θ at the beginning of the experiment. An optimal design criterion . [M(ξ, θ )], for simplicity . (ξ ), aims to minimize the variance–covariance matrix in some sense and therefore the inverse of the information matrix. The main optimal design criterion is the D-optimal, which maximizes the FIM determinant, i.e. . D (ξ ) = det M(ξ, θ ), as the inverse of the FIM is asymptotically proportional to the variance–covariance matrix of the parameter estimators. The c-optimal criterion is used to minimize the variance of a linear combination of the model parameters, say .cT θ , and it is defined by . c (ξ ) = cT M − (ξ, θ )c. The superscript .− stands for the generalized inverse class of the matrix. The I -optimal criterion minimizes the average variance of a prediction .Yˆ = η(θ ), i.e., the function ∂η(θ ) − ∂η(θ ) M (ξ, θ ) . I (ξ ) = π(X )dX , (7.5) ∂θ T ∂θ
7 OED in Biodosimetry
147
where .π(X ) is the density function of the established probability distribution of the space design. This probability distribution can be used for highlighting a region of .X . Otherwise, if there is no such a specific region, it can be uniform across all .X . More details and principles of OED can be found in [2, 3], and [4].
7.4 OED for Cytogenetic Biodosimetry The dose estimation leads to an inverse regression (calibration) problem. The dose levels are fixed values selected by the experimenter. Consequently, the irradiated dose cannot be modeled as a regression function of the frequency of chromosomal aberrations. However, as explained in Sect. 7.2, the mean of chromosomal aberrations can be interpolated to the dose–response curve, i.e., .Dˆ = λ−1 (S/N). The sample size N is the number of blood cells analyzed. Two different sample sizes are used in practice: • .N = 500, the number of cells recommended to analyze by the IAEA • .N = 50, the number of cells recommended to analyze in emergency situations for fast triage radio-protection response (see [5]) As the counts of chromosomal aberrations in cells are independent among them, then the total count of chromosomal aberration assay .S ∼ Pois [Nλ(X)]. Consequently, the predictor .ν(θ ) does not depend on the absorbed dose, as it is being calibrated, so neither is its gradient. Consequently, the I -criteria (7.5) are modified for dose calibration leading to a new method proposed in [6] denoted as .IM -criteria. This .IM -optimality takes the expected value of the gradient of .ν(θ ), and now it depends on the explanatory variables .X, i.e., IM (ξ ) =
.
X
∇X M − (ξ, θ )∇XT π(X)dX,
(7.6)
where
∂ν(θ ) .∇X = E ∂θ
(7.7)
taking into account .S ∼ Pois [Nλ(X)]. The sensitivity function, the Frechet directional derivative in the direction of a one-point design, for this criterion of a given design .ξ is ψ(x, ξ ) = ∂ IM (ξ, ξx ) = ∂ IM (ξ, ξx )dx X .
∇x M − (ξ, θ )∇xT − gx M − (ξ, θ )∇xT ∇x M − (ξ, θ )gxT π(x )dx , = X
(7.8)
148
M. Higueras and J. López-Fidalgo
where .ξx is the one-point design at .x ∈ X and 1 ∂λ(x) gx = √ . λ(x) ∂θ
.
(7.9)
The General Equivalence Theorem, see [7], states that the design .ξ is locally (for the nominal values used) .IM -optimal if and only if .ψ(x, ξ )) ≥ 0 .∀x ∈ X .
7.5 Applied Examples Two cases of application for these techniques are displayed in this section. They are for two different chromosomal aberration assays which require different dose– response curves.
7.5.1 Dicentric Plus Ring Chromosomes The most typically analyzed radiation-induced chromosomal aberration is the dicentric chromosome. A dicentric is formed by the interchange between the fragments of two different chromosomes resulting in an aberrant chromosome with two centromeres. The dicentric assay has very good properties as robustness to interindividual variations and suitability to the assumption of Poisson responses. Some experimenters add the incidence of centric rings to the assay. Centric rings, also called ring chromosomes, are the interchange of two breaks of separate arms of the same chromosome. See [1] for further details. In [8], the authors constructed a calibration curve for dicentric plus ring chromosomes (D .+ R) at doses ranging from 0 to 5 Gy of X-rays at a dose rate of 0.275 Gy/min. The D .+ R assay is shown in Table 7.1. In this scenario, a quadratic dose effect is assumed, i.e., .Y ∼ Pois(C + αD + βD 2 ). In this model, it is important to remark that the link function is the identity, instead of the canonical link for the Poisson distribution: the logarithmic function. All this gives .X = [0, 5] and .θ = {C, α, β}. The maximum likelihood estimates are used as nominal values and they result Cˆ = 1.28 · 10−3 (4.39 · 10−4 ) [3.47 · 10−3 ], .α ˆ = 3.15 · 10−2 (7.15 · 10−3 ) [1.04 · 10−5 ], βˆ = 6.06 · 10−2 (4.02 · 10−3 ) [ 0, .d = 1, 2, . . . , D, . d=1 yid = C, .C > 0. Let .xi = (xi1 , xi2 , . . . , xip ) denote the vector of covariates on the ith subject. Each component of the compositional vector .yi is called a “composition.” Due to compositionality, the means of the outcome are analyzed and interpreted in relative terms. The two popular relative measures introduced by Aithchison [1] are the Additive Log-Ratios (ALRs) and the Centered Log-Ratios (CLRs). For each subject i, .i = 1, 2, . . . , n, the ALRs are log-ratios of each composition relative to a predefined composition, say D → R D−1 , where .f (y ) ≡ r = (r , r , . . . , r .yiD . Thus, .f : S i i i1 i2 iD−1 ) = (log(yi1 /yiD ), . . . , log(yiD−1 /yiD )) . The CLRs are the log-ratios of each com1/D of all position relative to the log of the geometric mean .GM = D d=1 yid compositions. Thus, .f (yi ) = (log(yi1 /GM), log(yi2 /GM), . . . , log(yiD /GM)) . Depending upon the scientific question of interest or application, a researcher may prefer to use one of these log-ratios as the transformed outcome variable. In practice, however, the observed .yij ’s may fail to satisfy some of the constraints noted above for a variety of reasons. In some instances, a specific composition d could take a value of zero or is missing for all subjects with a particular value of a covariate. Such zeros are known as structural zeros. For example, some activities of children in a day may be driven by the socio-economic status of their parents because not all parents cannot afford certain types of activities for their children. The time spent on such activities will be structural zeros for those children. As another example, suppose the outcome variable of interest is the abundance of microbial phyla in the gut of a newborn infant, and the covariate of interest is mode of delivery, C-section or vaginal delivery. Infants born via C-Section are expected to be deprived of the phylum Bacteroidetes during the early days of life. Thus, in this case the C-Section born babies have a structural zero corresponding to the phylum Bacteroidetes. One should not impute structural zeros because those are real zeros. On the other hand, suppose a parent is asked to recall the average times spent by their child in a typical 24-h day on various activities such as Reading, Physical Activity, Screen Time, Cognitive Activities, Sleep, and Other (which includes eating and other miscellaneous activities). In their recollection, a parent may (i) forget the amount of time spent on one or more of the activities (i.e., respond as “0” or leave them unanswered) or (ii) may recall incorrectly the amount of time spent on various activities so that the total time spent in a 24-h period is not 24 h. Standard multiple imputation techniques for unconstrained Euclidean space data do not apply to the present context due to the various constraints noted above. However, there are some simple algorithms described in the literature. Imputation of missing compositions (or zeros) for compositional data is well discussed in the
8 Multiple Imputation for Compositional Data (MICoDa) Adjusting for Covariates
159
literature [5]. A common approach is to replace the missing values by a small positive constant, such as 1, and then adjust the remaining compositions (if needed) so that for each subject i, the modified compositional vector, denoted by .yi ∗, satisfies the constraint .1 yi ∗ = C, where .1 = (1, 1, 1, . . . , 1) . While this is a simple and a commonly used approach, it has been demonstrated in the literature that it may result in inflated false discovery rates or loss of power when testing hypotheses regarding the outcome variables [5, 6]. The focus of this chapter is to impute missing values (or zeros) for the nonstructural zero case using ALR. Once all missing compositions are imputed, one may use a log-ratio (ALR or CLR) of choice depending upon the scientific question of interest. In Sect. 8.2, we introduce a simple iterative algorithm for imputing missing compositional data when the missingness depends upon the covariates. The proposed method is known as Multiple Imputation method for Composition Data (MICoDa). In Sect. 8.3, the performance of MICoDa is evaluated using a simulation study based on a data obtained from NICHD. The method is illustrated using two different types of data. In Sect. 8.4.1, we apply the methodology on a data obtained from the NICHD’s Upstate KIDS cohort study [7]. The Upstate KIDS Study is a birth cohort which recruited mothers who delivered their newborns between 2008 and 2010 in upstate New York. As a second illustration, in Sect. 8.4.2, we illustrate MICoDa using simulated microbiome data based on a sample of 1006 western adults selected by Lahti et al. [8] from the Human Intestinal Tract Chip (HITChip) phylogenetic microarray.
8.2 MICoDa Methodology The MICoDa methodology imputes missing compositions using ALR transformation of the data. Since ALR requires log-transformation of the compositional data, to avoid taking logarithms of zero values, as the first step a positive constant W is added to all .yid , .i = 1, 2, . . . , n, .d = 1, 2, . . . , D. Remark 1 By adding W to all .yid , .i = 1, 2, . . . , n, .d = 1, 2, . . . , D to deal with zeros in the data, the sum of .yid ’s for each subject potentially exceeds C. Hence, upon convergence of the algorithm, each .yid is rescaled so that the sum equals C. However, if the goal is to correlate the compositions with covariates, then such a rescaling may not be necessary because the addition of W may be simply viewed as part of transformation of data, similar in spirit to the Box-Cox family of transformations. The reference composition is chosen so that it is not missing in most of the samples, if not all samples. For the rest of this section, for simplicity of exposition, we shall take the reference composition to be the Dth composition. Thus, ALR vectors .ri , .i = 1, 2, . . . , n, are computed relative to the Dth composition. For a given subject i, and composition d, if .yid is missing (denoted as NA), then .rid is defined as NA; otherwise, it is computed as described above.
160
A. Saha et al.
We assume that for all .i = 1, 2, . . . , n, none of the components of .xi have missing values. Then, for each subject i,.i = 1, 2, . . . , n and composition d, .d = 1, 2, . . . , D − 1, we model .rid in terms of the covariate vector .xi using the following standard linear regression model: rid = β0d + xTi β d + id .
.
(8.1)
In the case of complete data, i.e., subjects with no NA values in the ALR vector .r, one can obtain the ordinary least squares estimators .[βˆ0d , βˆ d ] from the complete data. For a new subject .io , from the same population as the above subjects, with a covariate vector .xo , the components of the ALR vector .ri0 = (ri0 1 , ri0 2 , . . . , ri0 D−1 ) and the y’s can be predicted as usually done as follows: rˆio d = βˆ0d + xTio βˆ d , d < D D−1 ∗ exp{ˆrio d } ; yˆio d = yˆio D exp{ˆrio d }, d < D. yˆio D = C / 1 + .
(8.2)
d=1
In the above expressions, .C ∗ = C + W D. However, if no zeros are present in the data, then .C ∗ = C. This simple idea serves as a useful recipe for imputing missing ALR values and the missing compositions for any subject in the sample. Let the set of the missing compositions for the .i th subject be denoted as .Mi . Thus, .yid = NA, d ∈ Mi ⊂ {1, . . . , D}. Two assumptions are made regarding the subjects with missing compositions (.Mi = φ, the null set).
Assumptions A.1: The cardinality of the set .Mi is at least 2, i.e., .|Mi | ≥ 2. If .|Mi | = 1, then missing component can be trivially computed using the compositionality constraint. A.2: The number of subjects with complete data .nc is larger than the number of parameters .p + 1 in the ALR regression model so that the model parameters can be estimated using the complete data.
For the ith subject such that .Mi = φ and .Mci = φ, where .Mci denotes complementary set of compositions with observed values, Eq. 8.2 can be applied to impute .yid ’s given a set of estimates of regression coefficients. However, we propose an iterative method, summarized in Eq. 8.3–Eq. 8.5, to improve upon the imputation by accounting for partial information in .Mci , given
8 Multiple Imputation for Compositional Data (MICoDa) Adjusting for Covariates
161
the ALR model as in Eq. 8.1, and a set of estimates of regression coefficients, say ˆ0d , βˆ d ] , d = 1, · · · , (D − 1). Let .rˆid = .βˆ0d + xT βˆ d , ∀ d < D. .[β i
.Initialize
0 yˆiD = C∗/ 1 +
D−1
0 0 exp{ˆrid } ; yˆid = yˆiD exp{ˆrid }, ∀ d < D (by Eq. 8.2)
d=1
l Iterate rˆid
l yˆiD
l−1 ∀d ∈ Mci , l → 1, 2, . . . . → log yid /yˆiD ⎡ ⎤ l ⎦ = C ∗ / ⎣1 + exp{ˆrid } + exp{ˆrid } d∈Mi ,d