540 46 3MB
English Pages 196 [186] Year 2023
Studies in Classification, Data Analysis, and Knowledge Organization
Leonardo Grilli · Monia Lupparelli · Carla Rampichini · Emilia Rocco · Maurizio Vichi Editors
Statistical Models and Methods for Data Science
Studies in Classification, Data Analysis, and Knowledge Organization
Managing Editors
Editorial Board
Wolfgang Gaul, Karlsruhe, Germany
Daniel Baier, Bayreuth, Germany
Maurizio Vichi, Rome, Italy
Frank Critchley, Milton Keynes, UK
Claus Weihs, Dortmund, Germany
Reinhold Decker, Bielefeld, Germany Edwin Diday, Paris, France Michael Greenacre, Barcelona, Spain Carlo Natale Lauro, Naples, Italy Jacqueline Meulman, Leiden, The Netherlands Paola Monari, Bologna, Italy Shizuhiko Nishisato, Toronto, Canada Noboru Ohsumi, Tokyo, Japan Otto Opitz, Augsburg, Germany Gunter Ritter, Passau, Germany Martin Schader, Mannheim, Germany
Studies in Classification, Data Analysis, and Knowledge Organization is a book series which offers constant and up-to-date information on the most recent developments and methods in the fields of statistical data analysis, exploratory statistics, classification and clustering, handling of information and ordering of knowledge. It covers a broad scope of theoretical, methodological as well as application-oriented articles, surveys and discussions from an international authorship and includes fields like computational statistics, pattern recognition, biological taxonomy, DNA and genome analysis, marketing, finance and other areas in economics, databases and the internet. A major purpose is to show the intimate interplay between various, seemingly unrelated domains and to foster the cooperation between mathematicians, statisticians, computer scientists and practitioners by offering well-based and innovative solutions to urgent problems of practice.
Leonardo Grilli · Monia Lupparelli · Carla Rampichini · Emilia Rocco · Maurizio Vichi Editors
Statistical Models and Methods for Data Science
Editors Leonardo Grilli Department of Statistics, Computer Science, Applications “G. Parenti” University of Florence Florence, Italy
Monia Lupparelli Department of Statistics, Computer Science, Applications “G. Parenti” University of Florence Florence, Italy
Carla Rampichini Department of Statistics, Computer Science, Applications “G. Parenti” University of Florence Florence, Italy
Emilia Rocco Department of Statistics, Computer Science, Applications “G. Parenti” University of Florence Florence, Italy
Maurizio Vichi Department of Statistical Sciences Sapienza University of Rome Rome, Italy
ISSN 1431-8814 ISSN 2198-3321 (electronic) Studies in Classification, Data Analysis, and Knowledge Organization ISBN 978-3-031-30163-6 ISBN 978-3-031-30164-3 (eBook) https://doi.org/10.1007/978-3-031-30164-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This book offers a collection of papers focusing on methods and models in classification and data analysis. Several research topics are covered, ranging from statistical inference and modeling to clustering and factorial methods, from directional data analysis to time series analysis and small area estimation. Applications deal with new investigations in a variety of relevant fields: medicine, finance, engineering, marketing, and cyber risk, to cite a few. These contributions are a selection of post-conference papers presented at the 13th meeting of the CLAssification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS), organized by the Department of Statistics, Computer Science, Applications “G. Parenti” of the University of Florence (Italy) on September 9–11, 2021. The submitted papers followed a careful review process involving two reviewers per paper. In the end, 14 papers were selected for publication in this volume. Due to the persistent uncertainty about the COVID-19 epidemic, CLADAG 2021 was entirely online. Despite this unfortunate situation, the Conference was highly participated and vital, as the wide range of contributions here collected shows. CLADAG, a member of the International Federation of Classification Societies (IFCS), organizes an international scientific meeting every two years devoted to presenting theoretical and applied papers in classification and, more generally, data analysis. The meeting includes advanced methodological research in multivariate statistics, mathematical and statistical investigations, survey papers on the state of the art, real case studies, papers on numerical and algorithmic aspects, and applications in special fields of interest at the interface between classification and data science. The Conference aims to encourage the interchange of ideas in the mentioned research fields and disseminate new findings. CLADAG conferences, initiated in 1997 in Pescara (Italy), were soon considered an attractive information exchange market and became an important meeting point for people interested in classification and data analysis. Traditionally, a selection of the presented papers, fully peer-reviewed, is published in a volume of post-conference proceedings. The Scientific Committee of the 2021 edition has planned Plenary and Invited Sessions to provide a fresh perspective on the state of the art of knowledge and research in the field. The scientific program of CLADAG 2021 is especially rich. v
vi
Preface
All in all, it comprises five Keynote Lectures, 26 Invited Sessions promoted by the members of the Scientific Program Committee, 10 Contributed Sessions, and a Plenary Session on Statistical Issues in the COVID-19 pandemic. We thank all the session organizers for inviting renowned speakers from many countries. We are greatly indebted to the referees for the time spent carefully reviewing the papers collected in this book. Special thanks are due to the members of the Local Organizing Committee and all the people who collaborated with CLADAG 2021. Last but not least, we thank all the authors and participants, without whom the Conference would not have been possible. Above all, we thank all participants who chose this book to share their research findings. We hope this book will contribute to fostering new knowledge in the field. Florence, Italy November 2022
Leonardo Grilli Monia Lupparelli Carla Rampichini Emilia Rocco Maurizio Vichi
Contents
Clustering Financial Time Series by Dependency . . . . . . . . . . . . . . . . . . . . . Andrés M. Alonso, Carolina Gamboa, and Daniel Peña The Homogeneity Index as a Measure of Interrater Agreement for Ratings on a Nominal Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giuseppe Bove Hierarchical Clustering of Income Data Based on Share Densities . . . . . . Francesca Condino Optimal Coding of High-Cardinality Categorical Data in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Agostino Di Ciaccio Bayesian Multivariate Analysis of Mixed Data . . . . . . . . . . . . . . . . . . . . . . . Chiara Galimberti, Federico Castelletti, and Stefano Peluso
1
15 27
39 53
Marginals Matrix Under a Generalized Mallows Model Based on the Power Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Kateri and Nikolay I. Nikolov
67
Time Series Clustering Based on Forecast Distributions: An Empirical Analysis on Production Indices for Construction . . . . . . . . Michele La Rocca, Francesco Giordano, and Cira Perna
81
Partial Reconstruction of Measures from Halfspace Depth . . . . . . . . . . . . Petra Laketa and Stanislav Nagy
93
Posterior Predictive Assessment of IRT Models via the Hellinger Distance: A Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Mariagiulia Matteucci and Stefania Mignani Shapley-Lorenz Values for Credit Risk Management . . . . . . . . . . . . . . . . . . 121 Niklas Bussmann, Roman Enzmann, Paolo Giudici, and Emanuela Raffinetti
vii
viii
Contents
A Study of Lack-of-Fit Diagnostics for Models Fit to Cross-Classified Binary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Maduranga Dassanayake and Mark Reiser Robust Response Transformations for Generalized Additive Models via Additivity and Variance Stabilization . . . . . . . . . . . . . . . . . . . . . 147 Marco Riani, Anthony C. Atkinson, and Aldo Corbellini A Random-Coefficients Analysis with a Multivariate Random-Coefficients Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Laura Marcis, Maria Chiara Pagliarella, and Renato Salvatore Parsimonious Mixtures of Matrix-Variate Shifted Exponential Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Salvatore D. Tomarchio, Luca Bagnato, and Antonio Punzo Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Clustering Financial Time Series by Dependency Andrés M. Alonso, Carolina Gamboa, and Daniel Peña
Abstract In this paper, we propose a procedure for clustering financial time series by dependency on their volatilities. Our procedure is based on the generalized cross correlation between the estimated volatilities of a time series. Monte Carlo experiments are carried out to analyze the improvements obtained by clustering using the squared residuals instead of the levels of the series. Our procedure was able to recover the original clustering structures in all cases in our Monte Carlo study. Finally, the methodology is applied to a set of financial data. Keywords Estimated volatilities · Generalized cross correlation · Correlation matrix · Unsupervised classification
1 Introduction The advancement of information technology has made available a large amount of temporal high-frequency data in a variety of fields, like economy, biology, medicine, meteorology, and demography, among others. This fact has promoted research in cluster analysis to find common structures in a data set. The data are grouped searching for homogeneous groups by maximizing some measure of similarity. A variety of methods have been proposed in the literature to cluster sets of time series (see, for example, Caiado et al. (2006), Caiado et al. (2015), Galeano and Peña (2000), Jeong et al. (2011), Piccolo (1990), Díaz and Vilar (2010), and LafuenteA. M. Alonso (B) · D. Peña Department of Statistics and Institute Flores de Lemus, Universidad Carlos III de Madrid, 28903 Getafe, Spain e-mail: [email protected] D. Peña e-mail: [email protected] C. Gamboa Department of Statistics, Universidad Carlos III de Madrid, 28903 Getafe, Spain e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_1
1
2
A. M. Alonso et al.
Rego and Vilar (2016)). In those methods, clustering is approached from two different perspectives: in the first one, working directly on the original time series and defining an appropriate metric between them; in the second one, projecting the time series on a small space of characteristics or parameters. An often used method was proposed by Piccolo (1990) in which the distances between the parameters of an autoregressive approximation to each time series are used to form the clusters. The previous methodologies are useful when the time series are independent. However, in many applications this assumption is not met, and ignoring this fact can lead us to group highly dependent time series into different groups or to group independent time series into the same group. In financial time series, the returns do not present a strong structure in the levels, cor (X t , X s ) ≈ 0, but they show higher order dependence, as, for instance, in the squares, cor (X t2 , X s2 ) = 0. Thus, returns are often uncorrelated but not independent (see, for example, Tsay (2010)). These characteristics have opened a line of research in the grouping of time series taking into account the similarity of the evolution of the univariate conditional variances. For instance, Otranto (2008) and D’Urso et al. (2013) extended the methods proposed in Piccolo (1990) to Generalized AutoRegressive Conditional Heteroscedasticity (GARCH) models. They use the representation of square disturbances in their AR(∞) form in order to compute a distance measure defined as a function of the autoregressive parameters as in Piccolo (1990). A few articles have proposed methods for clustering by dependency. Zhang and An (2018) proposed a distance measure based on copulas to measure general dependencies of the time series. Alonso and Peña (2019) introduced the generalized crosscorrelation (GCC) metric, which is based on the determinant of a set of cross correlations between two time series until a certain lag k. These two methods assumed that the dependency among the time series is on the levels, and does not consider the case in which the dependency could be shown on the conditional variances. To the best of our knowledge, the only proposal to deal with dependency between volatilities is due to La Rocca and Vitale (2021), who proposed to use the auto-distance correlation function (see Zhou, 2012) to handle both linear and nonlinear dependence structures. In this work, we present a procedure to cluster time series for linear dependency on the volatilities. We believe that this is the most important case in practice, and we extend the approach presented by Alonso and Peña (2019) in the search for dependencies between the squares of two time series or between the estimated volatilities. The work is organized as follows. Section 2 introduces the notation and reviews some conditional heteroscedastic models. Section 3 extends the GCC measure for measuring dependencies between squares of time series. In Sect. 4, we present some Monte Carlo simulations in order to evaluate the performance of GCC measure to detect relations between squares of heteroscedastic time series. Section 5 illustrates the use of the proposed procedure to a well-known set of 100 financial time series.
Clustering Financial Time Series by Dependency
3
2 Conditional Heteroscedastic Models Volatility, or conditional variance, is an important feature of financial markets, since it captures the conditional variations in the returns of the assets. As discussed in Otranto (2008), it is generally considered to be a proxy of financial risk, since a high volatility means that the value of returns can change dramatically in short periods of time, and low volatility is associated with stable assets or periods. Some other characteristics that usually appear in asset returns are as follows: (i) their distributions are heavy-tailed; (ii) there exist volatility clusters, i.e. periods of high volatility are often followed by periods of low volatility and vice versa; (iii) although the returns are not correlated, their square values have a strong autocorrelation structure (see, for example, Tsay (2010)). Based on these properties, some models have been proposed in the literature for studying the behavior of these time series. Engle (1982) proposed the well known Autoregressive Conditional Heteroscedasticity (ARCH) model in which the dependence between squared asset returns is parameterized by using an autoregressive type model and the volatility, σt2 , is a function of past squared returns. As often this model requires a high order AR, Bollerslev (1986) proposed the GARCH model where the volatility also depends on past values of the conditional variance. Given a stationary time series xt = μ + et with mean μ and heteroscedastic noise et , it is said that the series follows a GARCH( p,q) model if the noise follows the equation et = σt t , where t is an i.i.d. random variable with mean zero and variance one, and σt is the volatility that can be expressed as σt2 = ω +
p j=1
2 α j et− j +
q
2 βl σt−l ,
l=1
where ω > 0 and 0 ≤ α j , βl < 1, in order to ensure that the conditional variance is p p positive, and j=1 α j + j=1 βl < 1 since the process is stationary. Those models have been generalized to the multivariate framework but, as mentioned by Tsay (2014), this approach suffers the curse of dimensionality. For instance, with n time series we have n conditional variances and n(n − 1)/2 conditional covariances. Our work would make it possible to reduce the number of elements to be modeled, as the covariances between groups of independent (or very slightly dependent) series can be considered negligible.
4
A. M. Alonso et al.
3 Procedure for Clustering Time Series by Dependency Alonso and Peña (2019) proposed a measure based on the dependency between the conditional means of two time series. Our goal is to extend this measure to the dependency between conditional variances, σt2 . The procedure could be extended in order to find interesting relations between other conditional moments of time series. Let wt and z t be two stationary time series, and let xt = wt2 , yt = z t2 be their corresponding squares that will also be stationary. Using the results given in Alonso and Peña (2019), we are going to define a dependence measure between (xt , yt ). Suppose that we have standardized the squared series so that E(xt ) = E(yt ) = 0 and var (xt ) = var (yt ) = 1. For lags h = 0, ±1, · · · , ±k, the autocorrelation function of xt is ρx (h) = E(xt−h xt ) and the cross correlation between xt and yt is ρx y (h) = E(xt−h yt ). The linear dependency between the two time series of squares can be summarized in the matrix ρx (h) ρx y (h) . R(h) = ρ yx (h) ρ y (h) For the set of lags from 0 to k, we define the matrix, ⎛
⎞ . . . R(k) . . . R(k − 1) ⎟ ⎟ ⎟, .. .. ⎠ . . R(−k) R(−k + 1) . . . R(0)
R(0) ⎜ R(−1) ⎜ Rk = ⎜ . ⎝ ..
R(1) R(0) .. .
which corresponds to the correlation matrix of the stationary process (xt , yt , xt−1 , yt−1 , . . . , xt−k , yt−k ) . , Yt,(k+1) ) , We can reorganize the elements of the above vector as Z t,2(k+1) = (X t,(k+1) where X t,(k+1) = (xt , · · · , xt−k ) and Yt,(k+1) = (yt , · · · , yt−k ) . In this case, the correlation matrix will be R yy,k CTxy,k , R yx,k = Cx y,k Rx x,k
where Rx x,k and R yy,k are the correlation matrices for the X t,k and Yt,k processes, respectively, and Cx y,k the matrix of cross correlations between these two vectors. R yx,k matrix verifies that 0 ≤ det(R yx,k ) ≤ 1. The determinant will take the value one when the matrix is diagonal, i.e. when the two time series do not have cross correlations or autocorrelations different than zero for k = 0; and the determinant will take the value zero when there is a perfect linear dependence between the components of the vector Z t,2(k+1) .
Clustering Financial Time Series by Dependency
5
In Alonso and Peña (2019) is shown that the det(R yx,k ) by itself is not sufficient to capture the linear relationship between the two time series, since it depends on the cross correlations as well as the autocorrelation of both processes. For example, if C x y,k = 0 then det(Rx y,k ) = det(Rx x,k ) det(R yy,k ); in addition, if one of the series presents strong autocorrelations we will have det(R yx,k ) to be close to zero, regardless of whether the series are independent or not. For this reason, an alternative similarity measure was proposed, described as follows: GCC(xt , yt ) = 1 −
det(R yx,k ) det Rx x,k det R yy,k
1/(k+1) .
This similarity measure GCC(xt , yt ) satisfies the following properties: (1) GCC(xt , yt ) = GCC(yt , xt ); (2) 0 ≤ GCC(yt , xt ) ≤ 1, it takes the one value in the case that the dependence between both variables is perfectly linear, and takes the value zero in the case that all cross-correlation coefficients are zero. The last statement was proven at Alonso and Peña (2019, Eq. 11) using Hadamard’s inequality. It is called Generalized Cross-Correlation measure (GCC). In addition, this measure takes into account the dimension of the matrices and thus allows us to compare matrices of different dimensions. The selection of k can be done using the BIC criteria (see Sect. 5.1 in Alonso and Peña 2019). Additionally, the analyst can have a priori information about which lags are relevant to her specific problem. Based on this measure, a dissimilarity between xt and yt can be defined by d(xt , yt ) = 1 − GCC(xt , yt ); in that way, high dissimilarity values are associated with weak linearity dependencies, and values close to zero will be related to strong linear dependencies between x y and yt . Once the pairwise dissimilarity between time series is obtained, we can apply any clustering method that uses dissimilarity matrices as input. In this work, we will use a hierarchical clustering algorithm with single linkage as this allows us to find interesting dependency structures such as chain dependency (see details in Alonso and Peña (2019) and Alonso et al. (2021)).
4 Simulation Study In this section, we carry out a Monte Carlo experiment to evaluate the performance of the proposed methods using ARCH(1) models. Suppose that
t,x = St,x ηt,x t,y = St,y ηt,y
,
(1)
where St,y , St,x , ηt,y , ηt,x are pairwise independent, except by the pair (ηt,y , ηt,x ) that will be dependent. The dummy variables St,y and St,x take only two possible values {−1, 1}, each with probability 1/2 and are independent over time, and ηt,y
6
A. M. Alonso et al.
and ηt,x are standardized white noise normal variables with cor(ηt,y , ηt,x ) = ρ > 0. 2 2 , t,y )= Then, the variables t,x and t,y have mean zero, variance one, and cor(t,x 2 2 2 cor(ηt,x , ηt,y ) = ρ (see Appendix). Furthermore, we have
E(t,x ) = E(St,x ηt,x ) = E(St,x )E(ηt,x ) = 0 E(t,y ) = E(St,y ηt,y ) = E(St,y )E(ηt,y ) = 0
.
(2)
Thus, taking into account (2), we obtain cov(t,x , t,y ) = E(t,x t,y ) = E(St,x St,y ηt,x ηt,y ) = 0. Thus, the series will have linear dependencies on their squares, but will be uncorrelated on the levels. Consider a data set et,i = σt,i t,i , with i = 1, ..., 15 and each of the 15 time series 2 2 = α0,i + α1,i et−1 having size T = 1000. The σt,i follow the ARCH(1) model, σt,i and Model 1:α0,i = 0.01 and α1,i = 0.3 for i ∈ {1, · · · , 5} ∪ {11, · · · , 15} Model 2:α0,i = 0.001 and α1,i = 0.6 for i ∈ {6, · · · , 10} Using (1), we introduce the cross dependence in the set of time series {et,i } through the noise {t,i }. For this, we will assume that the ηt,i variables are multivariate normal with dependence structure defined by ρ(i, j) = cor(ηt,i , ηt, j ). We will consider five scenarios where there are two (or six) clusters by dependency. Different types of dependency will be studied: independence, chain dependency, and full dependency. The noise in the five scenarios is defined as follows: – Scenario 1: ρ(i, j) = 0.9 for i = 1, . . . , 9 and j = i + 1, . . . , 10, and 0 otherwise. In this scenario, we have ten full dependent series and five independent ones; in total we have six groups. – Scenario 2: ρ(i, j) = 0.9 for i = 1, . . . , 9 and j = i + 1, . . . , 10, and ρ(i, i + 1) = 0.5 for i = 11, . . . , 14, and 0 otherwise. In this scenario, we have ten full dependent series and five chain-dependent ones; in total we have two groups. – Scenario 3: ρ(i, j) = 0.9 for i = 1, . . . , 9 and j = i + 1, . . . , 10. ρ(i, j) = 0.9 for i = 11, . . . , 14 and j = i + 1, . . . , 15, and 0 otherwise. In this scenario, we have ten full dependent series and five full dependent ones; in total we have two groups. – Scenario 4: ρ(i, i + 1) = 0.5 for i = 1, . . . , 9 and ρ(i, i + 1) = 0.5 for i = 11, . . . , 14, and 0 otherwise. In this scenario, we have ten chain- dependent series and five chain-dependent ones; in total we have two groups. – Scenario 5: ρ(i, i + 1) = 0.5 for i = 1, . . . , 9, and 0 otherwise. In this scenario, we have ten chain-dependent series and five independent ones; in total we have six groups.
Clustering Financial Time Series by Dependency
(a) S.1.
7
(b) S.2.
(d) S.4.
(c) S.3.
(e) S.5.
Fig. 1 Dependence structure representation
In Fig. 1, we present these five scenarios; each one is represented by an ellipse where each point corresponds to a time series. The line represents the not null cross correlation between two time series. For the clustering procedure, we are going to compare the results using the levels and the squares of the time series. The first, the bivariate GCC(ei,t , e j,t ) measure on the levels, will be denoted by GCC and the second, the GCC measure on the squares 2 , e2j,t ), will be denoted by GCC2. We have also computed of ei,t that is GCC(ei,t 2 2 , σˆ t,x ) (denoted by the GCC measure on the estimated volatility that is GCC(σˆ t,x GCC_vol), where the volatilities have been estimated by using a GARCH(1,1) which is a usual selection in financial time series modeling. Note that the results of using the estimated volatility will depend on the chosen model, and this is a strong limitation. Using the squares is equivalent to estimating the volatility using an ARCH(1) process, as then the volatilities are linear transformations of the squared residuals. As proposed by Hubert and Arabie (1985), the partitions can be compared using Adjusted Rand Index (ARI). ARI is based on counting pairs of observations that are classified in the same cluster under two cluster partitions. The closer the index is to one (zero), the higher (lower) the agreement between the two partitions. Table 1 reports the means of the ARI measure from 100 replicates for these five scenarios. We group time series using hierarchical clustering with a single linkage. Assuming that the number of groups is known, we can observe that when the estimated volatility is used, the performance of this methodology is similar to the GCC2. Both procedures capture the grouping structure in all cases for scenarios 1 to 3. Also, both measures have a better performance than GCC measure that uses the original time series. Additional Monte Carlo experiments have been performed, but are not presented here due to space constraints.
8
A. M. Alonso et al.
Table 1 Clustering performance ARCH(1) model Method Scenario 1 Scenario 2 Scenario 3 GCC GCC2 GCC_vol
0.70 1.00 1.00
0.16 1.00 0.99
(a) GCC on levels.
0.24 1.00 1.00
Scenario 4
Scenario 5
0.01 0.89 0.89
0.09 0.90 0.90
(b) GCC on squares.
Fig. 2 Example of dendrograms (using single linkage) for scenario S.3
(a) GCC on levels.
(b) GCC on squares.
Fig. 3 Example of dendrograms (using single linkage) for scenario S.4
Figures 2 and 3 show an example of dendrograms obtained using the GCC measure for the third and fourth scenarios when we use the original time series (levels) and their squares. The two groups for scenario 3 are clearly distinguishable in the dendrograms built from the squares; besides, we note that this structure cannot be captured when we use levels. For the other scenarios, we obtain similar results, however, when there is chain dependence, the unions of the groups are presented at high points in the dendrogram (see Fig. 3).
Clustering Financial Time Series by Dependency
9
5 Real Data Example In this section, we will analyze 100 time series that contain equal-weighted returns from the intersections of five portfolios formed on size (market equity, ME) corresponding to Europe (UE), Japan (JAP), Pacific Asia (PA)—except Japan, and North America (AM); and five portfolios formed on the ratio of book equity to market equity (BE/ME). The size is coded from 1(small) to 5(big) and the book to market ratio is coded from 1(low) to 5(high). A detailed description can be found on the web page developed by Kenneth R. French http://mba.tuck.dartmouth.edu/pages/faculty/ken. french/data_library.html. The data are taken from July 1990 to April 2019 making a total of 7522 days. In Fig. 4, we present the dendrogram obtained by using a hierarchical clustering procedure with a single linkage over the levels of daily returns. The number of clusters have been obtained by using the Silhouette statistic (see Rousseeuw (1987)). This statistic measures the mean similarity (closeness) of observations in its own group to the similarity with other groups. The Silhouette statistic detects four clusters in the data set which correspond to the four regions analyzed. In addition, the series that belong to each cluster present a strong linear dependence between them, except in the cluster of the Asian time series. When we use the squares of daily returns for clustering, the Silhouette statistic detects five clusters. They are the four main groups found in levels and a fifth group composed by a single time series belonging to Pacific Asia. Also, in the squares the group of American, time series presents weaker dependencies than in levels (see Fig. 5). In Fig. 6, we show that the dependency structures based on levels differ from the one based on squared returns. In particular, it is remarkable that portfolios AM12, AM13, AM14, AM15 make up a group of dependent series on the levels, however, this group is divided when squares are taken into account, the same is true for the group of portfolios AM52, AM53, AM54.
AM51 AM54 AM52 AM53 AM41 AM55 AM31 AM21 AM11 AM34 AM33 AM22 AM32 AM23 AM24 AM25 AM35 AM45 AM42 AM43 AM44 AM12 AM13 AM14 AM15 EU51 EU52 EU53 EU54 EU55 EU41 EU31 EU21 EU42 EU43 EU32 EU13 EU22 EU44 EU33 EU34 EU45 EU35 EU14 EU15 EU23 EU24 EU25 EU11 EU12
PA11 PA21 PA31 PA43 PA22 PA34 PA23 PA24 PA12 PA13 PA45 PA14 PA35 PA15 PA25 PA32 PA42 PA41 PA33 PA44 PA55 PA54 PA51 PA52 PA53
JAP21 JAP55 JAP11 JAP12 JAP13 JAP31 JAP51 JAP54 JAP52 JAP53 JAP41 JAP32 JAP22 JAP14 JAP15 JAP23 JAP42 JAP33 JAP43 JAP44 JAP45 JAP24 JAP34 JAP25 JAP35
0.4 0.0
0.2
Height
0.6
0.8
Cluster Dendrogram
as.dist(DM0$DM) hclust (*, "single")
Fig. 4 Dendrogram using GCC on returns from 100 portfolios (using single linkage method)
10
A. M. Alonso et al.
PA43 PA55 PA42 PA35 PA45 PA54 PA51 PA52 PA53 PA31 PA34 PA44 PA32 PA22 PA21 PA23 PA24 PA33 PA41 PA11 PA25 PA15 PA13 PA12 PA14 EU51 EU41 EU21 EU31 EU13 EU11 EU12 EU22 EU52 EU55 EU53 EU54 EU23 EU33 EU34 EU24 EU25 EU14 EU15 EU35 EU45 EU43 EU44 EU32 EU42 AM41 AM34 AM54 AM21 AM51 AM13 AM52 AM53 AM11 AM12 AM55 AM31 AM14 AM15 AM22 AM23 AM24 AM25 AM35 AM32 AM33 AM42 AM45 AM43 AM44
JAP21 JAP54 JAP55 JAP31 JAP41 JAP13 JAP32 JAP14 JAP15 JAP22 JAP23 JAP24 JAP45 JAP25 JAP33 JAP34 JAP35 JAP42 JAP43 JAP44 JAP11 JAP12 JAP51 JAP52 JAP53
0.4 0.0
0.2
Height
0.6
0.8
Cluster Dendrogram
as.dist(DM2$DM) hclust (*, "single")
Fig. 5 Dendrogram using GCC on squares of returns from 100 portfolios (using single linkage method)
The results with the estimated volatilities are similar to the ones obtained with the square returns, so we omit the figure but it is available upon request to the authors.
6 Conclusions Clustering time series in high-dimensional dynamic data sets by taking into account their dependency is an important field of research. In financial time series, the dependency appears often in the volatilities, and clustering the series for similar volatility can be very helpful to build factorial models for volatilities in large data sets. The use of the generalized cross-correlation measure applied to the dependencies on the squares of the series, or on their estimated volatilities, has been shown to be very useful in the Monte Carlo study. Also, some interesting features in the real data analysis have been illustrated by comparing the clusters in levels and squares. We conclude that clustering using the linear dependency between the volatilities seems to be a promising line of research although other possible ways of nonlinear dependency in real data should be explored in the future.
Clustering Financial Time Series by Dependency
11
Fig. 6 Dendrogram using GCC of returns and squares from American and European portfolios
Acknowledgements The authors gratefully acknowledge the financial support from the Spanish government Agencia Estatal de Investigación (PID2019-108311GB-I00/AEI/10.13039/ 501100011033) and Comunidad de Madrid.
12
A. M. Alonso et al.
Appendix Correlation of Two Dependent Chi-square Variables Let ηt,x and ηt,y be normal variables with mean 0, variance 1, and cor(ηt,x , ηt,y ) = ρ. Suppose that ηt,y = ρηt,x + aηt,z with ηt,z a standard normal variable independent of ηt,x and a 2 = 1 − ρ 2 . Thus 2 2 2 2 2 4 3 2 2 ηt,y = ηt,x (ρ 2 ηt,x + 2ρaηt,x ηt,z + a 2 ηt,z ) = ηt,x ρ 2 + 2aρηt,x ηt,z + a 2 ηt,z ηt,x . ηt,x
Using the moments of standard normal variables and the independence between 2 2 ηt,y ) = 3ρ 2 + a 2 = 2ρ 2 + 1. Thus, we conclude ηt,x and ηt,z , we have that E(ηt,x that 2 2 2 2 2 2 2 2 , ηt,y ) = E(ηt,x ηt,y ) − E(ηt,x )E(ηt,y ) = E(ηt,x ηt,y ) − 1 = 2ρ 2 . cov(ηt,x 2 2 ) = var (ηt,y ) = 2, therefore, On the other hand, we know that var (ηt,x
2 2 cor(ηt,x , ηt,y )=
2ρ 2 = ρ2. 2
References Alonso, A. M., & Peña, D. (2019). Clustering time series by linear dependency. Statistics and Computing, 29, 655–676. https://doi.org/10.1007/s11222-018-9830-6. Alonso, A. M., D’Urso, P., Gamboa, C., & Guerrero, V. (2021). Cophenetic-based fuzzy clustering of time series by linear dependency. International Journal of Approximate Reasoning, 137, 114– 136. https://doi.org/10.1016/j.ijar.2021.07.006. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31, 307–327. https://doi.org/10.1016/0304-4076(86)90063-1. Caiado, J., Crato, N., & Peña, D. (2006). A periodogram-based metric for time series classification. Computational Statistics & Data Analysis, 50, 2668–2684. https://doi.org/10.1016/j.csda.2005. 04.012. Caiado, J., Maharaj, E. A., & D’Urso, P. (2015). Time-series clustering. In: Handbook of cluster analysis (pp. 262–285). Chapman and Hall/CRC. https://doi.org/10.1201/9780429058264. Díaz, S. P., & Vilar, J. A. (2010). Comparing several parametric and nonparametric approaches to time series clustering: A simulation study. Journal of Classification, 27, 333–362. https://doi. org/10.1007/s00357-010-9064-6. D’Urso, P., Cappelli, C., Di Lallo, D., & Massari, R. (2013). Clustering of financial time series. Physica A: Statistical Mechanics and its Applications, 392, 2114–2129. https://doi.org/10.1016/ j.physa.2013.01.027. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50, 987–1007. https://doi.org/10.2307/1912773. Galeano, P., & Peña, D. (2000). Multivariate analysis in vector time series. Resenhas do Instituto de Matemática e Estatística da Universidade de S ao Paulo, 4, 383–403. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. https:// doi.org/10.1007/BF01908075.
Clustering Financial Time Series by Dependency
13
Jeong, Y.-S., Jeong, M. K., & Omitaomu, O. A. (2011). Weighted dynamic time warping for time series classification. Pattern Recognition, 44, 2231–2240. https://doi.org/10.1016/j.patcog.2010. 09.022. Lafuente-Rego, B., & Vilar, J. A. (2016). Clustering of time series using quantile autocovariances. Advances in Data Analysis and Classification, 10, 391–415. https://doi.org/10.1007/s11634-0150208-8. La Rocca, M., & Vitale, V. (2021). Clustering time series by nonlinear dependence. In M. Corazza et al. (Eds.), Mathematical and Statistical Methods for Actuarial Sciences and Finance (pp. 291–297). https://doi.org/10.1007/978-3-030-78965-7_43. Otranto, E. (2008). Clustering heteroskedastic time series by model-based procedures. Computational Statistics & Data Analysis, 52, 4685–4698. https://doi.org/10.1016/j.csda.2008.03.020. Piccolo, D. (1990). A distance measure for classifying ARIMA models. Journal of Time Series Analysis, 11, 153–164. https://doi.org/10.1111/j.1467-9892.1990.tb00048.x. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/ 0377-0427(87)90125-7. Tsay, R. S. (2010). Analysis of financial time series. Wiley. https://doi.org/10.1002/9780470644560. Tsay, R. S. (2014). Multivariate time series analysis. Wiley. https://doi.org/10.1002/ 9780470644560.ch8. Zhang, B., & An, B. (2018). Clustering time series based on dependence structure. PloS One, 13, e0206753. https://doi.org/10.1371/journal.pone.0206753. Zhou, Z. (2012). Measuring nonlinear dependence in time-series, a distance correlation approach. Journal of Time Series Analysis, 33, 438–457. https://doi.org/10.1111/j.1467-9892.2011.00780. x.
The Homogeneity Index as a Measure of Interrater Agreement for Ratings on a Nominal Scale Giuseppe Bove
Abstract Interrater agreement for classifications on nominal scales is usually evaluated by overall measures across targets (subjects or objects) like Cohen’s kappa index. In this paper, the homogeneity index for a qualitative variable is proposed to evaluate the agreement between raters for each target, and also to obtain a global measure of interrater agreement for the whole group of targets evaluated. The targetspecific and global measures proposed do not depend on a particular definition of agreement (simultaneously between two, three, or more raters) and are not influenced by marginal rater distributions of scale like most kappa-type indices. Keywords Nominal classification scales · Interrater agreement · Homogeneity index
1 Introduction In behavioral and biomedical sciences, classifications of subjects or objects (targets) into predefined classes or categories and the analysis of their agreement are a rather common activity. For instance, agreement between clinical diagnoses provided by more physicians (raters) who recognize symptoms with respect to the type of disease is considered for identifying the best treatment for patients. The patient will not feel confident if physicians seriously differ in opinion. The rating procedure (or scale) can be used with confidence and will be accepted by the patient depending on the extent to which the diagnoses coincide. This does not only apply to physicians but also in general to all those situations where it is impossible to establish ‘truth classification’ in an objective way. Hence, in this type of application it is important to analyze interrater absolute agreement, that is, the extent to which raters assign the same category on the rating scale. Absolute agreement is distinct from association. G. Bove (B) Dipartimento di Scienze della Formazione, Università degli Studi Roma Tre, Rome, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_2
15
16
G. Bove
Strong absolute agreement requires strong association, but strong association can exist without strong absolute agreement. Consequently, a measure of association is not necessarily a good measure of absolute agreement. In the following, attention is restricted to the case of a nominal scale, and ordinal or interval scales will not be considered. A consequence of this restriction is that agreement statistics based on Pearson product moment correlations or upon analysis of variance techniques (which assume interval categories) are ruled out.
1.1 Aims of This Contribution Agreement between two raters who rate each of a sample of targets on a nominal scale is usually assessed with Cohen’s kappa (Cohen, 1960). Generalizations of kappa for more than two raters and for targets assessed by different groups of raters have been proposed by many authors (e.g., Fleiss, 1971; Conger, 1980). These indices are used to analyze the agreement between multiple raters for a whole group of targets. Moreover, methods to detect subsets of raters who demonstrate a high level of interobserver agreement were considered, for instance, by Landis and Koch (1977). Less frequent agreement on a single target has been considered (O’Connell, & Dobson, 1984), in spite of the fact that having evaluations of agreement on a single case is particularly useful, for example, in situations where the rating scale is being tested, and it is necessary to identify any changes to improve it, or to request from the raters a specific comparison on a single case in which the agreement is poor. Another situation that needs the analysis of agreement on single targets is when teachers in a school are evaluated by a questionnaire administered to different groups of raters (pupils, peers, principal). In this case, comparisons are made between the level of agreement in each group for each single teacher and for the whole group of teachers. Many of the indices available are not able to measure the agreement when each target (teacher) is evaluated by a completely different set of raters (pupils) or by a different number of raters. In the next sections, after a brief review of the indices proposed to measure agreement for nominal scales, an index to measure the interrater agreement on single targets is proposed based on a measure of dispersion for nominal variables. Furthermore, a global measure of agreement on the whole group of targets obtained as the arithmetic average of the target values of the index will be also considered and applied to a data set concerning the cause of death of 35 hypertensive patients. The measures proposed allow avoiding paradoxes caused by the dependence of most kappa-type indices on marginal distributions of the categories of the scale observed for each rater.
The Homogeneity Index as a Measure of Interrater Agreement …
17
1.2 Measures of Interrater Agreement for Nominal Scales Ratings provided by a group of raters for a group of targets have been presented in different ways. In the case of two raters, a contingency table with raters considered as two crossed variables, having the levels of the scale as categories, and entries in each cell given by the number (or proportion) of ratings corresponding to each combination of categories is the most common data representation (e.g., Cohen, 1960, Table 1). For multiple (two or more) raters, the case mainly considered in this paper, data are usually represented either 1) as multivariate matrix targets × raters (see Tables 1 or 2) as a distribution of raters by targets and category (see Table 2), according to the type of measure of interrater agreement to be computed. In Table 1, xig is the category that rater g assigns to target i. The approach mainly considered to extend coefficient kappa (Cohen, 1960) to the case of multiple raters is based on the representation in Table 2, where rik is the number of raters who classified target i into category k, ri is the number of raters who rated target i (equal to R if
Table 1 Targets × Raters matrix of ratings Targets Rater 1 Rater 2
…rater g …
Rater R
1 2 .. .
x11 x21 .. .
x12 x22 .. .
…x1g … …x2g . . . .. .
x1R x2R .. .
i .. .
xi1 .. .
xi2 .. .
…xig … .. .
xi R .. .
N
xN1
xN2
…x N g …
xN R
Table 2 Targets × Categories distribution of raters with column averages and row totals Targets Categ. 1 Categ. 2 …categ. k … Categ. q Total 1 2 .. .
r11 r21 .. .
r12 r22 .. .
…r1k … …r2k . . . .. .
r1q r2q .. .
r1 r2 .. .
i .. .
ri1 .. .
ri2 .. .
…rik … .. .
riq .. .
ri .. .
N Average
rN1 r .1
rN2 r .2
…r N k … …r .k …
rNq r .q
rN r
18
G. Bove
all raters rate target i), r .k is the mean number of raters who classified a target into category k, and r is the mean number of raters who classified a target (equal to R if all raters rate each target). Extensions of Cohen’s kappa (Cohen, 1960) are mostly formulated as kappa =
Pa − Pe , 1 − Pe
where Pa is the observed proportion of agreement and Pe is the proportion of chance agreement. The idea of adjusting Pa for chance agreement Pe in kappa is because in some situations raters could be unclear about the categorization of a target and they could agree by pure chance. The denominator (1 − Pe ) is the maximum value assumed by (Pa − Pe ). Most extensions share the same definition of Pa , but differ on the expression to compute chance agreement. The extent of agreement among the ri raters for the ith target is usually considered as the proportion of agreeing pairs out of all the ri (ri − 1)/2 possible pairs of assignments. This proportion in the case of balanced data (ri = R, i = 1, 2, .., N ) is q rik (rik − 1) . R(R − 1) k=1
The overall proportion of agreement Pa is the arithmetic mean of the proportions of agreement of the N targets Pa =
q N 1 rik (rik − 1) . N i=1 k=1 R(R − 1)
For only two raters (the case considered in Cohen (1960)), Pa is directly interpretable as the proportion of joint judgments in which there is agreement (the sum of the proportions positioned along the main diagonal of the contingency table with raters considered as the two crossed variables). We remark here that the agreement measured by Pa is between pairs of raters (2-agreement type), but joint agreement between three (3-agreement type), four (4-agreement type), or more raters could be considered as well (Conger, 1980; Warrens, 2012). The definition of the proportion of chance agreement Pe is controversial, and it is given in different ways by several authors, according to how they define the concept of agreement by chance (for an extended discussion, see Gwet (2014)). Fleiss’ extension of kappa (Fleiss, 1971), for instance, defines Pe =
q k=1
pk2 ,
The Homogeneity Index as a Measure of Interrater Agreement …
19
N rik is the proportion of all assignments which were given to where pk = N1 i=1 R the kth category. In this approach, it is assumed that a random selected rater assigns category k to a random selected target with probability pk . It follows that under the hypothesis of rater independence, the proportion of chance agreement for category k is pk2 , and Pe is the total proportion of chance agreement. If the simple assumption is made that pk = 1/q, that is, that each random selected rater assigns category k to a random selected target with the same probability, the kappa index proposed by Brennan and Prediger (1981) is obtained (Marasini et al., 2016 support the utilization of this index and have developed a weighted version for ordinal scales). Light (1971) proposed a slightly different extension that Conger (1980) proved to be equivalent to the average of all pairwise Cohen’s kappas for the raters. In this approach, it is assumed that a random selected rater assigns category k to a random selected target according to his personal marginal distribution of the scale. Pe can be computed as the average chance agreements in the contingency tables corresponding to each pair of raters (or by a simplified formula provided in Gwet 2014, p. 53). A conceptually different approach in the definition of the proportion of chance agreement is presented in Gwet (2008), where it is assumed that only an unknown portion of observed ratings is subject to randomness and that chance agreement occurs when at least one rater rates a target randomly. Then, the estimation problem concerns (1) the conditional probability that raters agree given that one of them (or both) has performed a random rating and (2) the probability that one rater (or both) has performed a random rating (see Gwet, 2008 for details of the estimation problem). The proportion of chance agreement is Pe = (
q 1 ) pk (1 − pk ), q − 1 k=1
N rik is the proportion of all assignments which were given where pk = 1/N i=1 R to the kth category, and the resulting kappa extension is usually named Gwet’s AC1 coefficient. The AC1 coefficient does not seem to be influenced by the marginal distributions of the categories of the scale observed for each rater. Another kappa extension, frequently applied in the field of communication, was proposed by Krippendorff (1970, 2012). The values assumed in the applications by Krippendorff’s kappa extension are very similar to Fleiss’ kappa extension, especially when there are no missing data and 5 or more raters. All the previous kappa extensions presented assume a maximum value equal to 1 in the case of a perfect agreement between the ratings of the raters (in this case, it is Pa = 1), a value equal to 0 when the agreement found in the observed ratings is equal to that obtained as a result of the chance (Pa = Pe ) and negative values if the agreement is less than that obtained as a result of chance (Pa < Pe ). Although not always coincident, indications have been provided on how to interpret the values that can be assumed; we can say that generally values lower than 0.6 are found in correspondence with low or moderate levels of agreement, values between 0.6 and
20
G. Bove
0.8 indicate a good level of agreement, and values above 0.8 indicate an excellent level of agreement. The extensions of kappa have some drawbacks: (1) most of them cannot be computed for only one target (N = 1), because in that case the agreement expected by chance is not defined or based on information statistically not relevant; (2) they are formulated in terms of agreement statistics based on all pairs of raters, but some authors argue that simultaneous agreement among three or more raters can be alternatively considered (e.g., see Warrens, 2012); (3) agreement expected by chance depends on the observed proportions of targets allocated to the categories of the scale by each rater, and this implies that the measure of agreement depends on the marginal distributions of the categories of the scale observed for each rater, a source of some paradoxes (for this aspect, see, e.g., Marasini et al., 2016, where a modification of Fleiss’ kappa, not affected by this dependence, is proposed). In particular, overall measures of agreement allow us to analyze the agreement between multiple raters for a whole group of targets but not for a single target. When the agreement is poor, it is often necessary to identify the targets with low levels of agreement to request the raters for specific comparisons on the single cases rated, and this can change the scale definition in order to improve the consistency of the rating procedure. In the next section, a proposal of a target-specific measure of agreement by O’Connell and Dobson (1984) is discussed, emphasizing some limitations, and a new measure is proposed based on the homogeneity index to measure the dispersion of a nominal variable (e.g., Leti, 1983).
2 Target-specific Measures of Interrater Agreement for Nominal Scales In this section, a representation of the observed ratings as a multivariate matrix targets × raters is assumed to present the index proposed by O’Connell and Dobson (1984). Referring to the response profile in row i of Table 1, a target-specific chance-corrected measure of agreement for several raters using nominal (or ordinal) categories is given by Si = 1 − Di /, where Di represents the overall disagreement on the whole response profile i (i = 1, 2, ..., N ) and is the disagreement expected by chance (see O’Connell, & Dobson, 1984, Eq. (6)), computed on the basis of the observed proportion of targets that each rater allocates to the categories of the scale. The computation of the overall disagreement Di can be explained by the small example in Table 3. A disagreement function for the response profile i and a pair of raters is defined as 0 if raters agree and 1 if raters disagree (other choices should be considered for ordinal scales). So, the disagreement between rater 1 and rater 2 is 1, the disagreement
The Homogeneity Index as a Measure of Interrater Agreement … Table 3 Judgements of seven raters on a patient (5-category scale) Patient Rater 1 Rater 2 Rater 3 Rater 4 Rater 5 i
Categ. 4
Categ. 2
Categ. 4
Categ. 2
Categ. 3
21
Rater 6
Rater 7
Categ. 4
Categ. 3
between rater 1 and rater 3 is 0, etc. The overall disagreement Di on the whole response profile i is obtained as the sum of disagreements on all twenty-one possible pairs of different raters. So, the value Di = 16 is obtained, with respect to a maximum value of disagreement that, with 5 categories and 7 raters, is 19. Si takes the value 1 when there is perfect agreement; it is positive when the agreement is better than chance, and negative otherwise. An overall measure of agreement across subjects Sav can be obtained as the arithmetic average of the Si individual values. The index Si shares the same drawbacks of the overall measures of agreement already discussed: (1) it cannot be computed for only one target (N = 1) because in that case the disagreement expected by chance is not defined; (2) it is formulated in terms of agreement statistics based on pairs of raters; (3) agreement expected by chance depends on the marginal distributions of the categories of the scale observed for each rater. A different approach is considered below, based on the well-known homogeneity index O to measure the dispersion of a nominal variable having q categories (e.g., Leti, 1983), given by O=
q
f k2 ,
k=1
where f k is the observed proportion of ratings in category k (k = 1, 2, ..., q). The index is equal to 1 in the case of maximum homogeneity (perfect agreement), and 1/q in the case of maximum heterogeneity (total disagreement, for each category k is f k = 1/q). O depends on the number of categories, so the normalization in the interval [0,1] is given by Or el = (q O − 1)/(q − 1). It is not difficult to verify that Or el is the normalized variance of the proportions f k (k = 1, 2, ..., q), and it is equal to 1 minus Gini’s heterogeneity index (see also Capecchi & Iannario, 2016). In order to apply the normalized homogeneity index Or el to the distribution of the seven raters with respect to the five categories (Table 4), it is convenient to reformulate the index with respect to subject i as
Or el,i =
2 q q k=1 rRik − 1 (q − 1)
.
22
G. Bove
Table 4 Distribution of the seven raters with respect to the five categories of the scale in Table 3 Patient Category 1 Category 2 Category 3 Category 4 Category 5 Total i
0
2
2
3
0
7
Thus, the target-specific measure of agreement Or el,i is equal to zero for total disagreement and one for perfect agreement. Some experiences with real applications indicate thresholds for the interpretation of Or el,i similar to those presented for kappa in Sect. 2. For the distribution in Table 4, a value Or el,i = 0.18 is obtained, showing a low level of agreement among the seven ratings. An overall measure of agreement on a whole group O r el can be easily obtained as the arithmetic average of the individual values Or el,i as O r el =
N 1 Or el,i . N i=1
Advantages of the new measures proposed are that (1) they can be computed for only one target (Or el,i ), so they can be applied also when each target is rated by a different set of raters and/or the number of raters is different for each target; (2) they are not formulated in terms of agreement statistics based on pairs of raters but in terms of statistical variability of the distributions of the raters with respect to the categories of the scale; (3) they do not depend on the marginal distributions of the categories of the scale observed for each rater, so they are not affected by paradoxes like many of the kappa extensions previously presented. It can be shown that in this particular case the number of raters is equal to the number of categories of the scale (R = q); O r el is equal to Pa . Berry and Mielke (1988) presented seven basic attributes that a measure of agreement should embody; some of them apply to the case of nominal scales considered in this paper. Namely, O r el is constructed by an approach based on statistical variability and cannot be considered a chancecorrected measure of agreement. In our opinion, most of the assumptions that allow the computation of the proportion of chance agreement Pe are unrealistic and produce paradoxes. In line with Gwet (2008), we think that in most applications raters do not assign all ratings by chance, and only a portion of raters assign by chance to a small portion of targets. From this point of view, low values of the target-specific index Or el,i can be an indication of targets for which ratings could be affected by chance assignments and for which the group of raters could be called for a revision. Moreover, as previously remarked, an approach based on statistical variability allows us to avoid some paradoxes and limitations by which some of the kappa extensions are affected (exceptions seem to be Gwet’s AC1 and the extension of kappa proposed in Brennan and Prediger (1981); see Quarfoot and Levin (2016)). Berry and Mielke (1988) advocate that a measure of agreement should be able to analyze multivariate data and evaluate information from more than two observers;
The Homogeneity Index as a Measure of Interrater Agreement …
23
both these requirements are satisfied by Or el,i and O r el . On the other hand, the new approach considered is mainly descriptive, so Or el,i and O r el lack a statistical base, which should be considered in future developments of this study.
3 Application Data considered are about a study with seven nosologists assessing the cause of death of 35 hypertensive patients by using their death certificates (Woolson, 1987). The scores were assigned by the following categories: 1=Arteriosclerotic disease, 2=cerebrovascular disease, 3=other heart disease, 4=renal disease, and 5=other disease. The marginal proportions of ratings for the five categories were 0.21, 0.17, 0.19, 0.27, and 0.16, respectively. Some results are presented for the method based on the Or el index. The subjective values of Or el,i allowed the detection of low level of agreement for many evaluations (28.6% of the Or el,i values less than 0.4), which calls for a possible revision of the assessment procedure. It can be also interesting to analyze some descriptive statistics provided in Table 5 for the comparison of Si and Or el,i . The mean values for global agreement are Sav = 0.48 and O r el = 0.56. Si values show higher dispersion with respect to Or el,i values. The measures are almost perfectly correlated (r = 0.99). We also computed the values of other indices of interrater agreement: the values of the average Cohen’s kappa (Light’s extension of kappa), Fleiss kappa, Krippendorff kappa, and Gwet’s AC1 were approximately equal to 0.48 and coincide with Sav ; the value of Brennan and Prediger kappa was 0.40. So, all indices show a moderate level of agreement between the seven nosologists. It is interesting to point out that if we increase the level of agreement between raters by collapsing the five categories into the two strongly unbalanced categories cerebrovascular disease (marginal proportion 0.17) and all other diseases (marginal proportion 0.83), the values of Sav , Light kappa, Fleiss kappa, and Krippendorff kappa remain almost the same, while the new values of O r el , Gwet’s AC1 , and Brennan and Prediger kappa increase to 0.75, 0.80, and 0.71, respectively, according to the new high level of agreement determined by the aggregation. It is not uncommon in applications to have highly unbalanced categories; this happens, for example, when a diagnostic category is rare or when for some reason the raters use almost exclusively very few levels of the scale. The
Table 5 Some descriptive statistics for Si and Or el,i values N Mean Std. Dev. Si Or el,i
35 35
0.48 0.56
0.27 0.23
CV 56.5 42.1
24
G. Bove
possibility to compute also the target-specific values Or el,i is an advantage of Or el,i with respect to the indices Gwet’s AC1 and Brennan and Prediger kappa.
4 Conclusion Generalizations of Cohen’s kappa to analyze the agreement on a nominal scale for the case of more than two raters were reviewed emphasizing some limitations frequently encountered in their application. These indices are used to analyze the agreement between multiple raters for a whole group of targets. Measures of agreement on a single target have been considered less frequently, in spite of the fact that having evaluations of agreement on a single case is particularly useful. O’Connell and Dobson (1984) proposed a single-target measure of interrater agreement that, however, depends on knowledge of the whole group of ratings and on the marginal distributions of the categories of the scale observed for each rater. A descriptive approach to the analysis of absolute interrater agreement has been proposed, formulated in terms of statistical variability of the distributions of the raters with respect to the categories of the scale, rather than in terms of agreement statistics based on pairs of raters. A target-specific measure and a global measure of agreement are proposed that present some advantages with respect to the kappa extensions more frequently encountered in application. The indices proposed are mainly considered as measures of the size of interrater agreement, therefore future developments of this study may concern the definition of reliable thresholds useful in application. Besides, the sampling properties of O r el should be studied and compared with competitor’s measures like Gwet’s AC1 coefficient. Finally, we notice that a measure of interrater agreement for ordinal data recently proposed and applied in educational studies follows an approach similar to the present proposal (Bove et al., 2021) where a measure of dispersion for ordinal variables is considered instead of the homogeneity index.
References Berry, K. J., & Mielke, P. W. (1988). A measure of interrater absolute agreement for ordinal categorical data. Educational and Psychological Measurement, 48, 921–933. Bove, G., Conti, P. L., & Marella, D. (2021). A measure of interrater absolute agreement for ordinal categorical data. Statistical Methods and Applications, 30(3), 927–945. Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699. Capecchi, S., & Iannario, M. (2016). Gini heterogeneity index for detecting uncertainty in ordinal data surveys. Metron, 74, 223–232. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 213–220. Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328.
The Homogeneity Index as a Measure of Interrater Agreement …
25
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382. Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48. Gwet, K. L. (2014). Handbook of inter-rater reliability (4th ed.). Gaithersburg, MD, USA: Advanced Analytics, LLC. Krippendorff, K. (1970). Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30, 61–70. Krippendorff, K. (2012). Content Analysis: An Introduction to its Methodology, Thousand Oaks. CA: Sage Publications. Landis, J. R., & Koch, G. G. (1977). An application of hierarchical Kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33, 363–374. Leti, G. (1983). Statistica descrittiva. Bologna: Il Mulino. Light, R. J. (1971). Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin, 76, 365–377. Marasini, D., Quatto, P., & Ripamonti, E. (2016). Assessing the interrater agreement for ordinal data through weighted indexes. Statistical Methods in Medical Research, 25, 2611–2633. O’Connell, D. L., & Dobson, A. J. (1984). General observer-agreement measures on individual subjects and groups of subjects. Biometrics, 40, 973–983. Quarfoot, D., & Levin, R. A. (2016). How robust are multirater interrater reliability indices to changes in frequency distribution? American Statistician, 70(4), 373–384. Warrens, M. J. (2012). Equivalences of weighted kappas for multiple raters. Statistical Methodology, 9, 407–422. Woolson, R. F. (1987). Statistical methods for the analysis of biomedical data. New York: Wiley.
Hierarchical Clustering of Income Data Based on Share Densities Francesca Condino
Abstract Starting from a situation where a reference population of income earners is naturally divided into sub-groups, the aim of this work is to explore the similarity of the sub-populations in terms of income inequality. To this end, we propose to extract information from a particular function, the so-called share density, strongly related to different inequality measures, such as the Gini and Theil indexes. The JensenShannon dissimilarity measure is proposed to evaluate the discrepancy across share densities, and a hierarchical clustering algorithm is employed to find the family of partitions. Results regarding data from the Survey on Households Income and Wealth (SHIW) by Bank of Italy are shown. Keywords Tail inequality · Dissimilarity measure · Income concentration
1 Introduction The analysis of income inequality is a central theme in the economic and social debate, as proven by the wide existing literature on this argument. A great variety of methodologies have been proposed, not only to quantify the concentration level in a population but also to compare different populations, geographical areas, groups and so on. Most of these methodologies are based on the Lorenz curve, proposed by Lorenz (1905) and currently still widely used. Starting from the Lorenz curve, different inequality measures and tools have been proposed to explore the disparity in a population and to construct different methodologies for group comparisons. With reference to this second issue, Fields and Fei (1978) identify two approaches, namely the cardinal and the ordinal approach, according to whether the comparison is implemented by means of summary measures or on the basis of the Lorenz domination criterion. Actually, when the comparison across groups of income earners is F. Condino (B) Department of Economics Statistics and Finance Giovanni Anania, University of Calabria, Arcavacata, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_3
27
28
F. Condino
limited to the comparison between synthetic inequality index values, such as Gini or Theil, the obtained results can be very inaccurate, since similar values of these measures can hide very different structures in income distribution. On the other hand, checking the Lorenz dominance can lead to unclear comparisons, since the curves may intersect (Davies & Hoy, 1995). The attempt of this work is to develop a procedure for quantifying the dissimilarity across sub-groups on the basis of the whole structure of inequality, rather than of specific indicators. In doing so, we consider a dissimilarity measure borrowed from information theory, a context having many contact points with inequality field, as proven by the large body of literature that has its roots in the work of Theil (1967) and its successive developments in Cowell (1980a, 1980b), Shorrocks (1980) and Cowell and Kuga (1981), to cite a few. In this context, more recently, Rohde (2016) proposes the use of a symmetric entropy statistic, the J-divergence measure, to study income inequality. In order to consider the whole structure of inequality, the chosen dissimilarity measure will be applied to the so-called share densities, some particular functions obtained from the Lorenz curve and related to the probability that a unit of amount, chosen at random, is earned by a specific percentile range of the population. Finally, this dissimilarity will be used for clustering purposes.
2 Lorenz Curve and Share Density: A Parametric Approach Although many methods to estimate income distribution and its properties are available, here we focus on the parametric approach, which requires the choice of an a priori functional form for income data description. Given a parametric model, it is possible to obtain the Lorenz curve and its properties. To this end, let Y be the continuous random variable (r v) describing income and F(y; θ) its distribution function (d f ), where θ is the parameter vector. It is well known that, starting from the expression of d f , the parametric Lorenz curve L(u, θ) can be obtained as u L(u; θ) =
0
F −1 (t; θ)dt μ
where F −1 (t; θ) = sup{y|F(y) ≤ t}, for t ∈ [0, 1] is the quantile function and μ = Y yd F(y; θ) is the mean of Y . In economic literature, the Lorenz curve is a well-known and widely used tool for analysing income inequality. Since its proposal, in 1905 (Lorenz, 1905), a lot of investigation has been suggested among statisticians and economists, generating a fertile field of study. It is known that each Lorenz curve L(u; θ) (u ∈ [0, 1]) can be viewed as a distribution function on the unit interval and it is possible to consider its , as a parametric density function. derivative, l(u; θ) = ∂ L(u;θ) ∂u
Hierarchical Clustering of Income Data Based on Share Densities
29
This function was rarely mentioned and used to explore inequality, although it furnishes different information about it, as suggested by Rohde (2008), who has shown that the two well-known Theil’s inequality indexes, TL and TT , can be directly obtained from l(u). In particular, Theil’s TT index coincides with the Shannon entropy of l(u), changed in sign. Some reference to the Lorenz density can be found in Farris (2010), where this curve is referred to as share density. Afterwards, the concept of Lorenz density is recalled in Zizler (2014), Kämpke and Radermacher (2015) and Shao (2021). It is worth to note that comparing the Gini or Theil index values between groups is equivalent to comparing specific characteristics of their corresponding share densities. Therefore, here we want to quantify the discrepancy among groups of income earners by quantifying the dissimilarity between the whole share densities. To do this, the choice of a proper measure is required.
3 Hierarchical Algorithm Based on JS Dissimilarity Let us consider a reference population naturally divided in a set E of K groups of income earners, i.e. E = {ω1 , ..., ω K }, each characterized by its own share density function l1 , ..., l K . In this paper, we consider an agglomerative clustering algorithm, starting with K distinct non-overlapping clusters (C1 , ..., C K ) each containing a single object. The algorithm is quite similar to that used for classical data structure of the type units × variables, and the main difference regards the calculation of the dissimilarity matrix, which must take into account the particular nature of the objects to be partitioned and their descriptors, namely the share densities. In order to analyse the existing differences among various groups of income earners, this dissimilarity measure will be considered in connection with the Lorenz density. Therefore, let F1 , ..., FK (for brevity of notation, the arguments of the functions are omitted) be the parametric d f corresponding to K different groups of income earners. As previously mentioned, we can obtain the corresponding Lorenz curve L 1 , ..., L K and the corresponding derivatives with respect to u, l1 , ..., l K . With the aim to quantify the discrepancy between each couple of share densities, of all possible measures of dissimilarity among density functions, in this work we consider the Jensen-Shannon (JS) dissimilarity, also called total divergence to the average. This is an information theoretic measure, and, therefore, strongly related to the entropy-based measures of inequality, originally introduced by Theil (1967). Hence, given two share densities, lk and lk , the JS divergence is given by D J S (lk , lk ) = H (lm ) − πk H (lk ) − πk H (lk )
(1)
where lm = πk · lk + πk · lk is their mixture. The weights πk and πk , as suggested by Bishop et al. (2003), represent the income shares, respectively, for the k-th and k’-th group. The JS dissimilarity can be alternatively written as D J S (lk , lk ) = πk · D K L (lk ||lm ) + πk · D K L (lk ||lm )
(2)
30
F. Condino
1 l (u) where D K L l j ||lm = 0 l j (u) log lmj (u) du, for j = k, k , is the so-called KullbackLeibler divergence between each share density and the mixture lm . From (1), it can be noted that the JS divergence directly depends on Theil’s TT index for the two groups, since the entropies of share densities are involved; moreover, it depends on Theil’s TT index for the mixture of the two sub-populations, changed in sign. Therefore, it represents the difference between Theil’s inequality within the two groups and Theil’s inequality under the hypothesis that individuals belong to the population made up of the two mixture components. From (2), it is also evident that this measure takes into account the discrepancy of each function from the mixture, along the whole unit interval, so that it will be influenced by any existing deviation that may occur in some segment of the examined population. As a consequence, the clustering procedures based on the JS divergence will exploit these discrepancies. In this paper, we consider the hierarchical clustering procedure that seeks to progressively merge groups of income earners having the smallest JS dissimilarity in terms of share density functions. Therefore, the starting point is represented by the dissimilarity matrix D = {d J S (lk , lk )}, whose elements are the JS dissimilarities between each pair of share densities, computed through (1), for k = k = 1, ..., K . From the initial partition P (1) = (C1 , ..., C K ), where Ck , for k = 1, ..., K , contains one observation ωk ; at each step s, the two closest clusters in P (s) , i.e. clusters for which d(Ck , Ck ) in D(s) is the smallest, are merged, until the final partition P (K ) , made up of a single cluster containing all the objects ω1 , ..., ω K , is obtained. As for the classical approach, different criteria can be used to compute the distances d(Ck , Ck ) between clusters and update the partition.
4 An Application In this section, data from the Survey on Households Income and Wealth (SHIW), carried out by Bank of Italy in 2016, are considered. To take into account the composition of households, equivalent incomes (in ten of thousands) are obtained, using the OECD-modified equivalent scale. For this application, we consider the Dagum model (Dagum, 1977, 1980) to fit income distribution. Indeed, the Dagum model has been widely used to study income and wealth, thanks to its ability in reproducing particular features of these kinds of data. Kleiber and Kotz (2003) and Kleiber (2008) provided an exhaustive description of the genesis of this model and a comprehensive review of its characteristics. Here, we remember that the Dagum r v has a positive support and its df is given −β , where all the involved parameters are positive. by FDa (y; β, λ, δ) = 1 + λy −δ In particular, λ is a scale parameter and β and δ are shape parameters. In order to obtain the estimates for each region, we consider the maximum likelihood estimation method and income data at the regional level. Having obtained the estimates βˆk , λˆk , δˆk , for k = 1, ..., 20, it is possible to compute some synthetic values for describing various population features, such as average income, μˆ k and
Hierarchical Clustering of Income Data Based on Share Densities
31
Table 1 Fitted means, negative entropy and Gini index for Italian regions Gˆ k Regions μˆ k − Hˆ (lk ) Piedmont Aosta Valley Veneto Friuli Emilia Romagna Tuscany Abruzzo Calabria Sardinia Lombardy Molise Campania Apulia Basilicata Sicily Trentino Liguria Umbria Marche Lazio
2.0797 2.3134 1.9739 2.3140 2.3971 2.3648 1.9542 1.3472 1.5772 2.4798 1.7789 1.3461 1.4558 1.5191 1.4610 2.2247 2.2482 1.9897 2.1809 1.9972
0.1277 0.1195 0.1244 0.1226 0.1141 0.1131 0.1237 0.1425 0.1361 0.1632 0.1873 0.1815 0.1557 0.1850 0.1633 0.1008 0.1356 0.1024 0.1053 0.1437
0.2751 0.2663 0.2706 0.2693 0.2602 0.2588 0.2717 0.2910 0.2843 0.3064 0.3294 0.3252 0.3026 0.3287 0.3071 0.2408 0.2782 0.2456 0.2475 0.2883
Gini inequality index Gˆ k . Moreover, for this model it is easy to obtain the derivative of Lorenz curve and, consequently, through the numerical integration method, its entropy. All these results are reported in Table 1. For each couple of regions, the mixture of the corresponding Lorenz densities is considered, and the elements of the JS dissimilarity matrix D(1) are computed to initialize the above-described algorithm. Table 2 reports the obtained values, in thousandths, from which it is evident that the minimum value is obtained in correspondence of the couple {Aosta Valley, Emilia Romagna}. Therefore, these two regions are merged in the first step of the algorithm. On the other hand, the maximum value for JS dissimilarity is obtained in correspondence of the couple {Trentino, Campania}. To compute the distances between clusters, the unweighted pair group method with averaging (UPGA) is chosen, after comparing its performance with that of different agglomeration methods based on dissimilarity matrix D. In particular, the value of the cophenetic correlation coefficient from the UPGA method (0.6286) results higher than that obtained from the weighted pair group method with averaging (WPGA; 0.5920), the complete (0.6176) and the single (0.5127) method, suggesting a more faithful representation of the original pairwise dissimilarities among regions.
Aosta Valley Lombardy Trentino Veneto Friuli Liguria E. Romagna Tuscany Umbria Marche Lazio Abruzzo Molise Campania Apulia
0.083 0.865 0.973 0.144 0.095 0.615 0.178 0.213 0.503 0.709 0.435 0.163 0.731 1.319 0.414
Piedmont
0.342 0.478 0.106 0.082 0.400 0.069 0.077 0.215 0.245 0.276 0.143 1.838 0.670 0.361
1.477 0.761 0.642 0.513 1.335 1.317 1.283 1.452 0.329 0.891 0.302 0.455 0.341
Aosta Valley Lombardy
0.724 0.880 0.777 0.604 0.527 0.262 0.158 1.136 1.710 3.126 3.576 2.322
Trentino
Table 2 JS distance matrix elements for Italian regions (in thousandths)
0.104 0.425 0.157 0.155 0.368 0.451 0.348 0.371 1.040 1.586 0.604
Veneto
0.587 0.097 0.110 0.399 0.484 0.411 0.278 1.644 1.462 0.615
Friuli
0.668 0.623 0.776 0.604 0.280 1.311 1.189 1.557 0.900
Liguria
0.073 0.220 0.390 0.662 0.216 1.058 2.072 0.827
0.182 0.320 0.651 0.282 1.146 2.176 0.901
E. Romagna Tuscany
(continued)
0.133 0.913 0.873 2.706 2.972 1.696
Umbria
32 F. Condino
Lazio Abruzzo Molise Campania Apulia Basilicata Calabria Sicily Sardinia
Basilicata Calabria Sicily Sardinia
0.680 0.168 0.774 0.123 Marche 0.899 1.070 2.113 3.146 1.755 2.162 1.218 1.818 1.161
Piedmont
Table 2 (continued)
0.832 0.623 0.940 0.411 0.668 0.490 0.346 0.507
1.907 1.546 0.706 1.677 0.254 1.222 0.184
0.334 0.461 0.210 0.604 Abruzzo
Aosta Valley Lombardy
1.738 0.321 0.480 0.160 Lazio
0.198 0.417 0.232 1.079 0.354 1.045
3.241 1.972 2.165 1.761 Molise
Trentino
0.409 0.185 0.666 0.427 0.983
1.021 0.367 0.821 0.296 Campania
Veneto
0.366 0.232 0.308 0.307
1.573 0.388 0.860 0.242 Apulia
Friuli
0.888 0.399 0.910
1.310 0.993 0.673 0.946 Basilicata
Liguria
0.578 0.097
1.012 0.347 1.247 0.310 Calabria
0.706
1.108 0.413 1.280 0.374 Sicily
E. Romagna Tuscany
2.706 1.239 1.822 1.027
Umbria
Hierarchical Clustering of Income Data Based on Share Densities 33
34
F. Condino
Lazio
Liguria
Sardinia
Abruzzo
Calabria
Friuli
Piedmont
Veneto
Aosta Valley
Emilia Romagna
Marche
Tuscany
Umbria
Trentino
Sicily
Lombardy
Apulia
Basilicata
Molise
Campania
0.0000
0.0010
Cluster Dendrogram
Fig. 1 Dendrogram based on UPGA agglomeration method
Therefore, according the UPGA agglomeration method, given two clusters C j and C j , consisting of N j and N j regions, respectively, and belonging to the partition P (s) , the distance between them is d(C j , C j ) =
1 {d J S (lk , lk )} N j · N j ω ∈C k
j
ωk ∈C j
where lk is the share density for the region ωk ∈ C j and lk is the share density for the region ωk ∈ C j , j = j . The dendrogram in Fig. 1 suggests the presence of three or four clusters. In particular, by considering a partition made up of three clusters, we obtain the following groups: C1 = {Abbruzzo, Piedmont, Sardinia, Friuli, Emilia Romagna, Aosta Valley, Calabria, Veneto, Tuscany, Liguria, Lazio} C2 = {Molise, Campania, Basilicata, Sicily, Lombardy, Apulia} C3 = {Trentino, Marche, Umbria}. As we can see from the results reported in Table 3, clusters seem clearly characterized, with regions having generally lower concentration of income belonging to clusters 1 and 3, and regions with higher concentration levels included in cluster 2. Moreover, the element that mainly differentiates cluster 1 from cluster 3 seems to be the different concentration level in correspondence to the poorest households. Indeed, if we look at the Gini index for the poorest households (G (L) ), i.e. households having income lower than the median, and the Gini index for the richest G (L) , having income higher than the median, we note that the regions belonging to cluster 1 tends to show higher inequality in correspondence to the left tail than regions belonging to cluster 3. This phenomenon is less evident in the Liguria region, as confirmed by the silhouette plot in Fig. 2 and, when the number of clusters equal to four are considered,
Hierarchical Clustering of Income Data Based on Share Densities
35 (L)
(U )
Table 3 Observed Gini index for total households (G k ), for poorer (G k ) and richer earners (G k ) and membership cluster for Italian regions (L) (U ) Regions Gk Gk Gk Cluster Piedmont Aosta Valley Veneto Friuli Emilia Romagna Tuscany Abruzzo Calabria Sardinia Liguria Lazio Lombardy Molise Campania Apulia Basilicata Sicily Trentino Umbria Marche
0.2756 0.2594 0.2848 0.2699 0.2601 0.2562 0.2589 0.2743 0.2851 0.2845 0.2878 0.3096 0.3167 0.3216 0.2922 0.3273 0.3064 0.2758 0.2451 0.2415
0.1854 0.1967 0.1818 0.1860 0.1761 0.1631 0.1764 0.1982 0.2056 0.1572 0.1830 0.1804 0.1663 0.2210 0.1879 0.2242 0.1965 0.1383 0.1555 0.1426
0.1717 0.1627 0.1937 0.1593 0.1616 0.1602 0.1405 0.1651 0.1631 0.2032 0.1840 0.2249 0.2133 0.2047 0.1848 0.1938 0.2027 0.2315 0.1503 0.1465
3clustersCj
n = 20
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3
4clustersCj
n = 20 Abruzzo Piedmont Sardinia Friuli Aosta Valley E.Romagna Calabria Veneto Tuscany Campania Basilicata Molise Sicily Apulia Lombardy Marche Trentino Umbria Liguria Lazio
Piedmont Abruzzo Friuli Sardinia Veneto Aosta Valley Calabria E.Romagna Tuscany Lazio Liguria Campania Molise Basilicata Sicily Lombardy Apulia Trentino Marche Umbria
0.0
0.2
0.4
0.6
0.8
1.0
Silhouette width si Average silhouette width : 0.53
Fig. 2 Dendrogram and silhouette plot
0.0
0.2
0.4
0.6
0.8
1.0
Silhouette width si Average silhouette width : 0.58
36
F. Condino
it constitutes a new cluster together with Lazio, while the other two clusters remain unchanged. In this case, the average silhouette increases from 0.53 to 0.58.
5 Conclusion In this paper, a non-conventional hierarchical clustering procedure is proposed with the aim to group sub-populations of income earners having a similar structure in terms of inequality. In this context, the objects to be partitioned are described by means of the derivative of Lorenz curve, the so-called share density. As shown by different authors, this function furnishes various information regarding inequality. Therefore, a measure of the discrepancy across density functions, strongly related to information theory, namely the JS dissimilarity, is considered to quantify the existing differences among inequality structure of different sub-populations. The application to real data shows the potentiality of this proposal. A parametric approach based on the Dagum model is considered to fit income data, and the obtained findings suggest that this method allows us to properly partition the share densities on the basis of their global and local behaviour. Indeed, through this approach it is possible to take into account inequality in different segments of the population and to explore more in depth the different structures of inequality across groups.
References Billard, L., & Diday, E. (2020). Clustering methodology for symbolic data. Hoboken, NJ: Wiley. Bishop, J. A., Chow, K. V., & Zeager, L. A. (2003). Decomposing Lorenz and concentration curves. International Economic Review, 44, 965–978. Chotikapanich, D., Griffiths, W. E., Hajargasht, G., Karunarathne, W., & Rao, D. S. P. (2018). Using the GB2 income distribution. Econometrics, 6, 21. Cowell, F. A. (1980a). Generalized entropy and the measurement of distributional change. European Economic Review, 13, 147–159. Cowell, F. A. (1980b). On the structure of additive inequality measures. The Review of Economic Studies, 47, 521–531. Cowell, F. A., & Kuga, K. (1981). Additivity and the entropy concept: An axiomatic approach to inequality measurement. Journal of Economic Theory, 25, 131–143. Dagum, C. (1977). A new model of personal distribution: Specification and estimation. Economie Appliquée, 30, 413–437. Dagum, C. (1980). The generation and distribution of income, the Lorenz curve and the Gini ratio. Economie Appliquée, 33, 327–367. Dancelli, L. (1989). Confronti fra le curve di concentrazione Z(p) e L(p) nel modello di Dagum. Statistica Applicata, 1, 399–415. Davies, J., & Hoy, M. (1995). Making inequality comparisons when Lorenz curves intersect. The American Economic Review, 85(4), 980–986. Farris, F. A. (2010). The Gini index and measures of inequality. American Mathematical Monthly, 117(10), 851–864. Fields, G. S., & Fei, J. C. H. (1978). On inequality comparisons. Econometrica, 46(2), 303–316.
Hierarchical Clustering of Income Data Based on Share Densities
37
Kämpke, T., & Radermacher, F. (2015). Income modeling and balancing: A rigorous treatment of distribution patterns. Switzerland: Springer. Kleiber, C., & Kotz, S. (2003). Statistical size distributions in economics and actuarial sciences. Wiley. Kleiber, C. (2008). The Lorenz curve in economics and econometrics. London: Routledge. Lorenz, M. O. (1905). Methods of measuring the concentration of wealth. Publications of the American Statistical Association, 9(70), 209–219. Rohde, N. (2008). Lorenz Curves and generalised entropy inequality measures. In D. Chotikapanich (Eds.), Modeling Income Distributions and Lorenz Curve. Economic Studies in Equality, Social Exclusion and Well-Being (Vol. 5). New York: Springer. Rohde, N. (2016). J-divergence measurements of economic inequality. Journal of the Royal Statistical Society. Series A (Statistics in Society), 179(3), 847–870. Shao, B. (2021). Decomposition of the Gini index by income source for aggregated data and its applications. Computational statistics, Epub ahead of print: 1–25. Shorrocks, A. F. (1980). The class of additively decomposable inequality measures. Econometrica, 48, 613–625. Theil, H. (1967). Economics and information theory. Amsterdam: North Holland. Zizler, P. (2014). Gini indices and the moments of the share density function. Applications of Mathematics, 59, 167–175.
Optimal Coding of High-Cardinality Categorical Data in Machine Learning Agostino Di Ciaccio
Abstract Analyzing categorical data in machine learning generally requires a coding strategy. This problem is common to multivariate statistical techniques, and several approaches have been suggested in the literature. This article proposes a method for analyzing categorical variables with neural networks. Both a supervised and unsupervised approaches were considered, in which the variables can have high cardinality. Some simulated data applications illustrate the interest in the proposal. Keywords Encoding categorical data · Neural networks · High-cardinality attributes · Optimal scaling
1 Introduction Most machine learning algorithms cannot be applied directly to categorical data that are generally non-numeric. Their application therefore requires some form of encoding that transforms the categorical features into one or more numeric variables. This problem can pose a serious difficulty if the variables have many categories, a common situation for big data generally including mixed measurement levels of the variables. We could have categorical variables with tens or hundreds of categories, e.g., names of provinces, postcodes, ATECO codes, list of company names, list of product names, and so on. A further complication is that some categories that we might observe in the future were not observed in the training data. To analyze qualitative variables together with quantitative variables, we can create for each categorical variable an embedding of the categories, in a space with one or more dimensions.
A. Di Ciaccio (B) University of Rome La Sapienza, Rome, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_4
39
40
A. Di Ciaccio
The embedding of strings in a multidimensional space is a well-established procedure in Natural Language Processing (NLP) (Bengio et al., 2003) and is the inspiration for this paper. Many encoding systems for supervised and unsupervised models have been proposed in the literature, but they have several weaknesses. In a previous paper (Di Ciaccio, 2020), a technique was proposed, the Low Embedding Encoder (LEE), which extends the concept of quantification using an approach similar to word embedding in Natural Language Processing. The aim of LEE was to analyze categorical variables with high cardinality using Neural Networks in a supervised approach. In this paper, this technique is extended to consider also an unsupervised approach, pointing out some relationship over consolidated techniques such as Correspondence Analysis Benzécri (1973) or Optimal Scaling Gifi (1990). In par. 2, there is a review of the techniques proposed in the literature both in a supervised and unsupervised approaches, showing some weaknesses of each technique, with reference to a Machine Learning context. In par. 3, some approaches based on dummy variables are analyzed, then in the successive paragraphs we show a non-linear extension using neural networks, both in the supervised and unsupervised cases. Some examples highlight the interest in our proposal.
2 Quantify Categorical Features: A Review of Existing Methods In the literature, many methods have been proposed to encode categorical variables (a recent review is Hancock et al., 2020). The classification of the encoding methods can consider several elements: how the method takes into account the variable to predict (the target, if available), how the coding is influenced by the other explanatory variables, how many new variables are created, and what the aim of the analysis is. Several encoding methods for ML have been proposed in the literature (see, for example, Potdar et al., 2017) mainly in a supervised approach. The scikit-learn software library (https://scikit-learn.org/) allows to apply 17 different methods. Here, we will consider the most used techniques, without pretending to be exhaustive, distinguishing them into three groups: encoders that do not use the target or other data; encoders requiring only the target; encoders that use dummy variables.
2.1 Methods that Do Not Consider the Target or the Other Variables These methods assign the encoding to the categorical variable according to a criterion that does not involve the target or other data. In this way, there is no risk of overfitting, but the encodings obtained are extraneous to the dataset, and the result largely depends
Optimal Coding of High-Cardinality Categorical Data in Machine Learning
41
on the researcher’s choices or is essentially a random quantification and therefore not interpretable. A crude way is to assign integer values to the categories (label encoding), possibly respecting the natural category order (ordinal encoding). The result is a single numerical variable. Note that all the possible categories must be known before assigning the quantifications (integers). The hash encoder uses a hash function to convert the K categories of a variable to s X 2 and X 3 < 100 ⇒ Y ∼ N (20, 1.5) i f X 1 ≤ X 2 and X 3 < 100 ⇒ Y ∼ N (10, 1.5) in all other cases Y ∼ N (1, 1.5) There are only 3 expected values E(Y |x1 , x2 , x3 ), i.e., (1, 10, 20), so an optimal regression model should predict these values. Note that the expected value of Y depends on the interaction of the three categorical variables and that the three conditional distributions of Y overlap in the tails (Fig. 2). The dataset was then split as training set (50%) and test set (50%). Regression algorithms such as MORALS or Regression Tree cannot make a satisfactory prediction on this data unless introducing explicitly all the interaction terms into the model, producing thousands of dummy variables. On the contrary, neural networks can autonomously detect the interactions among the variables. Then, for this simulation, after encoding the categorical variables, a small neural network was chosen to predict the target Y. The neural network includes an input layer, two hidden layers with 8 and 3 neurons (ELU activation function), and 1 output neuron with linear activation function. As for the data leakage, which is always possible in the supervised case, this can be controlled with the usual tools of neural networks, as simplify the architecture introduces dropout layers or L1/L2 penalizations. Applying the LEE approach, each categorical variable is considered a separate input coded as OHE, followed by one dense layer with 2 neurons ( p = 2) and no bias. If we want to avoid the sparse coding matrices of OHE, an embedding layer can
48
A. Di Ciaccio
Table 2 Comparison between three encoding approaches (200 iterations) MSE-train MSE-test n.parameters OHE LEE Target encoder
2.11 2.62 7.22
6.18 4.64 7.29
4839 1287 1673
Fig. 3 Distribution of the target Y in the simulation for OHE (left) and LEE (right)
be used, for each original categorical variable, without creating the dummy variables (for example, the embedding layer of the software library TensorFlow is available https://www.tensorflow.org/). To verify that the results obtained, shown in Table 2, do not depend on the choice of the supervised model applied, a check was also made with other network architectures. The Target Encoder obtained bad results also on the training set but with far fewer parameters (only 63), then this encoder was applied with a bigger neural network, with 38 neurons in each hidden layer, to obtain an equivalent number of parameters in the comparison. Despite the increase in the number of parameters, the result is poor even on the training set, since this encoding does not allow the identification of interactions. In Fig. 3, it is possible to compare the performance of OHE and LEE encoders in allowing the neural network to identify the expected values of Y . In particular, the true Y values (ordinates) and the estimated Y values (abscissas) are shown, for the OHE (left) and LEE (right). In the optimal result, the predicted Y should have only 3 values: 1, 10, and 20.
5 Non-linear Encoding in the Unsupervised Case Using an autoencoder neural network, it is also possible to reproduce, with some efforts, the quantifications provided by HOMALS. To obtain this result, it is necessary to build an encoder-average-decoder NN with three internal layers to reproduce the loss function (4). The encoder will consist of the LEE encoding, i.e., the introduction of variables through OHE, then a small dense layer with linear activation for each
Optimal Coding of High-Cardinality Categorical Data in Machine Learning Table 3 A frequency table showing association between X and Y X/Y a b c d A B C D E Total
801 100 100 100 100 1201
100 800 100 100 100 1200
100 100 800 100 100 1200
100 100 100 800 100 1200
49
e
Total
100 100 100 100 800 1200
1201 1200 1200 1200 1200 6001
variable with size = p, then one average layer. The decoder will be constituted inversely to the encoder, with, in the output, the variables encoded as OHE and the same parameters as the encoder. Finally, to obtain the same result of (4), it would be also necessary to impose orthogonality and normalization constraints followed by an appropriate rotation of the components obtained. Obviously, all this would not make sense, as with much less effort we can use the elegant analytical solution provided by MCA. On the other hand, this scheme can be easily modified to obtain a non-linear solution simply by using non-linear activation functions in the decoder instead of linear functions, without imposing any constraints on the parameters. The coding achieved in this way can be much more effective than that provided by HOMALS/MCA. As an example, let’s consider only two categorical variables, X and Y, each with 5 categories, respectively (A, B, C, D, E) and (a, b, c, d, e), which give rise to the frequency Table 3. The strong associations of the pairs of categories (A, a), (B, b), (C, c), (D, d), (E, e) are evident from the table. We would therefore expect a representation on two components that highlights these associations, which is the main information inside the table: a representation in which those pairs are close while there is an approximate equidistance from the other categories. By applying the MCA, the first 4 components have the same eigenvalue and are all necessary to have a satisfactory representation of the categories. Figure 4 shows the result obtained on the first 2 components with MCA (on the left) and with the non-linear version just described (on the right). Note how the presence of only one more unit for the pair (A, a) completely distorts the result of MCA.
6 Conclusions The proposed encoding method LEE allows to apply neural networks to data with categorical variables and high cardinality, reducing the number of parameters and memory resources. The simulation results obtained show an increased predictive
50
A. Di Ciaccio
Fig. 4 Categorical encoding for Correspondence Analysis (left) and LEE (right) on data of Table 3
capacity of the neural network, thanks to the more efficient encoding. Even in the unsupervised case, this approach could provide more effective and interpretable nonlinear quantifications than classical methods.
References Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155. Benzécri, J. P. (1973). L’Analyse Des Données: Tome II: L’Analyse Des Correspondances. Paris: Dunod. De Leeuw, J. (1973). Canonical Analysis of Categorical Data. Ph.D. thesis, University of Leiden, The Netherlands (1973). De Leeuw, J., Young, F. W., & Takane, Y. (1976). Additive structure in qualitative data: an alternating least squares method with optimal scaling features. Psychometrika, 41(4), 471–503. Di Ciaccio, A. (2020). Categorical encoding for machine learning. In A. Pollice et al. (Eds.), Book of short papers SIS2020. Pearson Italia. Gifi, A. (1990). Nonlinear Multivariate Analysis. New York: Wiley. Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv:1604.06737. Hancock, J. T., & Khoshgoftaar, T. M. (2020). Survey on categorical data for neural networks. Journal of Big Data, 7, 28. https://doi.org/10.1186/s40537-020-00305-w. https://scikit-learn.org/ . https://www.tensorflow.org/ . Meulman, J., van der Kooij, A. J., & Duisters, K. L. W. (2019). ROS Regression: integrating regularization with optimal scaling regression. Statistical Science, 34(3), 361–390. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: global vectors for word representation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). https://doi.org/10.3115/v1/D14-1162 Potdar, K., Pardawala, T. S., & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers. International Journal of Computer Applications, 175, 4.
Optimal Coding of High-Cardinality Categorical Data in Machine Learning
51
Yong-Xia, Z., & Ge, Z. (2010). MD5 research. In 2010 Second International Conference on Multimedia and Information Technology, Kaifeng (pp. 271–273). Young, F. W., De Leeuw, J., Takane, Y. (1976). Regression with qualitative and quantitative variables: An alternating least squares method with optimal scaling features. Psychometrika, 41(4).
Bayesian Multivariate Analysis of Mixed Data Chiara Galimberti, Federico Castelletti, and Stefano Peluso
Abstract Graphical models provide an effective tool to represent conditional independences among variables. While this class of models has been extensively studied in the Gaussian and categorical settings separately, state-of-the-art literature which jointly models the two types of data is narrow. However, mixed data are widespread in many applications where both continuous and categorical measurements are available, such as genomics or industrial machine processes. In this paper, we propose a Bayesian framework for the analysis of mixed data. Specifically, we define a likelihood function for n observations following a conditional Gaussian distribution and assign suitable priors to model parameters. We develop an MCMC scheme for approximate posterior inference in two alternative model parameterizations and a structure learning algorithm for related undirected graph models. Keywords Conditional Gaussian distribution · Undirected graph · Graphical model · Marginal likelihood · Mixed variables · Structure learning
C. Galimberti (B) Department of Economics, Management and Statistics, Universitá degli Studi di Milano-Bicocca, Milan, Italy e-mail: [email protected] F. Castelletti Department of Statistical Sciences, Università Cattolica del Sacro Cuore, Milan, Italy e-mail: [email protected] S. Peluso Department of Statistics and Quantitative Methods, Universitá degli Studi di Milano-Bicocca, Milan, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_5
53
54
C. Galimberti et al.
1 Introduction Graphical models provide a powerful framework to represent conditional dependence structures in multivariate distributions (Lauritzen, 1996). Typically the graphical model generating the observations is unknown and inferring it from the data is possible using structure learning methodologies. Several contributions for structure learning of graphical models given continuous (Gaussian) or discrete/categorical data are available in the literature; see for instance Kalisch and Bühlmann (2007) and Meinshausen and Bühlmann (2006). However, mixed-type data are extremely diffuse in many contexts where both continuous and categorical measurements are available. One instance is nanotechnology, which develops functional structures designed at the atomic or molecular scale, related to optoelectronics, luminescent materials, lasing materials, and biomedical imaging Hodes et al. (1987), Bawendi et al. (1989), and Ma et al. (2004). In particular, nanostructure data may be represented as categorical data, while the variables involved in the process of measuring the nanostructures are continuous. However, literature oriented to graphical models for mixed-type data is narrow, especially in the Bayesian framework. A few recent works on graphical models for mixed data with a frequentist approach are available in the literature. Some of them propose parameter estimation methods for models belonging to exponential families based on node-wise conditional generalized linear regressions; see Chen et al. (2015), Yang et al. (2012), and Yang et al. (2014). Other contributions work in the context of the Conditional Gaussian (CG) distribution introduced by Lauritzen and Wermuth (1989); in particular, Lee and Hastie (2015) propose a maximum pseudo-likelihood approach under the assumption of constant conditional covariance for continuous variables, while Cheng et al. (2017) fit separate regression models for each variable with a weighted lasso penalty. The more recent Bayesian methodology of Bhadra et al. (2018) allows to infer a network structure given mixed data by adopting Gaussian scale mixtures, while Zareifard et al. (2021) propose a Gibbs sampler algorithm to estimate a Directed Acyclic Graph (DAG) in the presence of continuous and discrete observations. In this contribution, we focus on undirected graphs (UGs). The scope of our study is to develop a Bayesian methodology for structure learning and inference of UG models from mixed data. The starting point of our model is the CG distribution. We elaborate on the latter model by extending it to a Bayesian framework and providing a class of prior distributions for the model parameters. Specifically, we adopt two different parameterizations for the same model: a moment and a canonical representation; the first one enables closed-form results in terms of parameter posterior distribution as well as marginal likelihood. The second instead provides an effective way to express conditional independence relations in the joint distribution, and in turn to learn the underlying graphical structure; from a computational perspective, it requires Markov Chain Monte Carlo (MCMC) strategies for approximate posterior inference.
Bayesian Multivariate Analysis of Mixed Data
55
The rest of the contribution is organized as follows. In Sect. 2.1, we present some results obtained through the moment representation of the model, while in Sect. 2.2 we introduce the canonical parameterization. In Sect. 3, we present an application to heart disease diagnosis data, and we conclude in Sect. 4 with a discussion.
2 Bayesian Model Development The starting point of our model is represented by the definition of Conditional Gaussian (CG) distribution, as introduced by Lauritzen and Wermuth (1989). Let V be a finite set of nodes indexing a collection of random variables X = (X 1 , . . . , X |V | )T , which includes both discrete and continuous quantities indexed by Δ ∪ Γ = V , respectively. Lauritzen and Wermuth (1989) defined a general class of probability distributions of the form 1 T T (1) f (x) = f (s, y) = exp g(s) + h(s) y − y K(s)y 2 where s and y correspond to the multi-dimensional levels assumed by the categorical and continuous variables, respectively. A probability distribution of the form (1) has CG distribution if and only if X Γ |X Δ = s ∼ N|Γ | (K(s)−1 h(s), K(s)−1 ) and the marginal distribution of the discrete variables is |Γ | 1 1 θ (s) = (2π )− 2 |K(s)|− 2 exp g(s) + h(s)T K(s)−1 h(s) , 2
(2)
for each level s assumed by Z Δ . Moreover, if K(s) = K a CG distribution is called homogeneous (HCG). A representation based on the triplet (g, h, K) is named canonical. One possible alternative parameterization is given in terms of momentcharacteristics parameters (θ, μ, Ω). In the following sections, we detail some results for a Bayesian model formulation under both parameterizations.
2.1 Moment Representation Let (Z 1 , . . . , Z p ) be p categorical variables, (Y1 , . . . , Yq ) q continuous variables. Let also I be the space of all possible configurations of the p categorical variables and θ = {θ (s), s ∈ I}) where θ (s) = Pr(Z 1 = s1 , . . . Z p = s p ) is the probability to observe configuration s = (s1 , . . . , s p ).
56
C. Galimberti et al.
Under the HCG assumption, we can write for each s ∈ I Y1 (s), . . . , Yq (s) | μ(s), Ω ∼ Nq (μ(s), Ω −1 ).
(3)
In particular, the relation between the canonical representation and the moments of the Gaussian model can be expressed through the re-parameterization μ(s) = K−1 h(s) and Ω = K−1 . Consider now a collection of n i.i.d. observations z i = (z i,1 , . . . , z i, p )T , yi = (yi,1 , . . . , yi,q )T , i = 1, . . . , n from (2) and (3). Categorical data {x i , i = 1, . . . , n} can be equivalently represented as a contingency table of counts N with elements n(s) ∈ N such that n n(s) = 1(z i = s), i=1
where 1(·) is the indicator function and s∈I n(s) = n. Following Frydenberg & Lauritzen (1989), the likelihood function can be written as f (N, y1 , . . . , yn | θ , {μ(s)}s∈I , Ω) =
s∈I
θ (s)n(s)
φ( yi | μ(s), Ω −1 )
s∈I i∈ν(s)
1 n(s) T ∝ θ (s) |Ω| exp − ( yi − μ(s)) Ω( yi − μ(s)) , (4) 2 s∈I s∈I i∈ν(s)
1 2
where ν(s) is the set of observations among {1, . . . , n} with observed configuration s and φ denotes the Gaussian density. We complete our Bayesian model formulation by assigning the following prior distributions: θ ∼ Dirichlet( A), μ(s) | Ω ∼ Nq (m(s), (aμ Ω)−1 ), Ω ∼ Wq (aΩ , U), (5) where in particular Wq (aΩ , U) denotes a Wishart distribution having expectation aΩ U −1 , with aΩ > q − 1 and U a s.p.d. matrix. It is advisable to set hyperparameters to values leading to proper prior distributions. A standard way to proceed, whenever no substantial prior information is available, is to choose hyperparameters leading to weakly informative priors. In particular, A may be set equal to a vector with all equal (e.g. unit) components (each associated to one level of the categorical variables). With regard to the Normal priors, we can fix a zero mean, while aμ = 1. Finally, the hyperparameters of the Wishart distribution can be fixed as aΩ = q, U = I q , the (q, q) the identity matrix. Under prior parameter independence, the posterior distribution can be written after standard calculations as
Bayesian Multivariate Analysis of Mixed Data
57
p(θ, {μ(s)}s∈I , Ω | N, y1 , . . . , yn ) ∝ ·
s∈I
with R =
θ (s)a(s)+n(s)−1
s∈I
1 1 T ¯ ¯ Ω(μ(s) − m(s)) |Ω| 2 | exp − (n(s) + aμ )(μ(s) − m(s)) 2 aΩ +n−q−1 1 · |Ω| 2 exp − tr[(U + R + R0 )Ω] , 2
s∈I
(6)
SSD(s), aμ n(s) y¯ (s), m(s) + aμ + n(s) aμ + n(s) aμ n(s) (m(s) − y¯ (s))(m(s) − y¯ (s))T , R0 = a + n(s) μ s∈I
¯ m(s) =
where SSD(s) = i∈ν(s) ei eiT , ei = ( yi − y¯ (s)), and y¯ (s) is the (q, 1) vector with sample means of (Y1 , . . . , Yq ) relative to observations i ∈ ν(s). Note that the independence is assumed also among the configuration s for the whole set of μ in order to have prior independence between μ(s)’s. From Eq. (6), it follows that θ | N ∼ Dirichlet( A + N) ¯ [(aμ + n(s))Ω)]−1 ) μ(s) | N, Y , Ω ∼ N q (m(s),
(7)
Ω | Y ∼ Wq (aΩ + n, U + R + R0 ), where Y denotes the (n, q) data matrix, row-binding of the yi ’s; see also Degroot (2004) for details on multivariate Normal models with Normal-Wishart priors and posterior calculations. In addition, because of conjugacy, the marginal data distribution m(Y , N) =
f (N, Y | θ , {μ(s)}s∈I , Ω) p(θ)
s∈I
p(μ(s)) p(Ω) dθ
dμ(s) dΩ
s∈I
can be computed from the ratio of prior and posterior normalizing constants. The result in (7) enables posterior inference on the parameters of an unconstrained (complete) graphical model. Specifically, by implementing a Monte Carlo sampler it is possible to infer the parameters of the marginal distribution of discrete variables and the parameters of the conditional distribution of continuous (Gaussian) variables.
58
C. Galimberti et al.
2.2 Canonical Representation The canonical parameterization of a CG distribution relies on the notion of interaction. In particular, the triplet (g, h, K) in (1) can be first expressed through the following expansions: g(s) =
λd (s),
d:d⊆Δ
h(s) =
ηd (s),
K(s) =
d:d⊆Δ
d (s),
(8)
d:d⊆Δ
where parameters λ, η, and are called interaction terms and d represents any subset (including the null set) of the categorical variables. Each term of the expansions represents a type of interaction between variables. Specifically, – λ∅ is the log normalizing constant. Also, λd (d = ∅) are pure discrete interactions among variables in d. If |d| = 1 they correspond to the main effects of the discrete variables; – η∅ ’s coordinates are the main effects of the continuous variables. Instead, ηd , d = ∅ are mixed linear interactions between a continuous variable and variables in d; – ∅ ’s elements are pure quadratic interactions. Differently, d (d = ∅) are mixed quadratic interaction matrices between variables in d and pairs of continuous variables. Using this representation, the distribution is homogeneous if and only if it has an interaction representation with no mixed quadratic interactions. Interaction terms allow for a more direct characterization of conditional independencies between variables; accordingly, the Markov property of a given UG can be expressed through zero-constraints on such parameters. Let G be an UG; a CG distribution is said to be nearest-neighbour Gibbs with respect to a graph G (or G-Gibbsian) if it has a representation with interaction terms satisfying λd (s) ≡ 0 unless d is complete in G, ηd (s)γ ≡ 0 unless d ∪ {γ } is complete in G, γδ d (s)
(9)
≡ 0 unless d ∪ {γ , δ} is complete in G,
where γ and δ represent continuous variables in Γ . Notice that a Gibbsian probability has an expansion with interaction terms involving variables that are neighbours only. Moreover, it can be proved that a CG distribution is G-Markovian if and only if it is G-Gibbsian; see Proposition 3.1 in Lauritzen and Wermuth (1989). As a consequence, the joint density factorizes into a product of local densities that only depends on variables that are mutual neighbours. In addition, it can be shown that the so-obtained factorization splits up into separate factorizations of the constant, linear, and quadratic terms; see Appendix B in Lauritzen and Wermuth (1989).
Bayesian Multivariate Analysis of Mixed Data
59
In what follows, we consider a simplified model by imposing the following conditions on the order of interactions: – |d| ≤ 2 for λd (s), – |d| ≤ 1 for ηd (s), – d = ∅ for d (s) ≡ d . In other words, the simplified model omits all interaction terms between the categorical variables of order higher than two and it defines the canonical mean vector of the Gaussian variables as a linear function of the categorical variables instead of an “arbitrary” dependence function. Moreover, the so-obtained distribution is HCG since the conditional covariance matrix does not depend on the categorical variables. As a consequence, the conditional independencies can be read-off according to the following equivalences: Z j ⊥ Z k |X \ {Z j , Z k } ⇔ λ jk ≡ 0, Z j ⊥ Yγ |X \ {Z j , Yγ } ⇔ η jγ ≡ 0, Yγ ⊥ Yδ |X \ {Yγ , Yδ } ⇔
γδ 0
(10)
≡ 0,
where (Z j , Z k ) represents two discrete variables and (Yγ , Yδ ) represents two continuous variables. As in the moment representation setting, we consider a mixed dataset consisting of n i.i.d. observations and represent categorical data through the implied contingency table of counts. For simplicity of exposition, in the following, we assume all categorical variables being binary. By adopting the canonical parameterization, the likelihood function can be written as f (x1 , . . . , xn |θ) =
n
f (xi |θ)
(11)
i=1
=
n i=1
p p 1 exp [λ0 + λ j zi j + λ jk z i j z ik ] + yiT [η0 + η j z i j ] − yiT 0 yi 2
= exp nλ0 +
j=1
p j=1
λ j n( j) +
j 0 for π = σ and π, σ ∈ S N ; (ii) d(π, π) = 0 for every π ∈ S N ; (iii) d(π, σ) = d(σ, π) for every π, σ ∈ S N . The invariance properties under left or right compositions are defined below and play a fundamental role in various applications of distances on permutations. Definition 2 A distance d(·, ·) on S N is called right-invariant (label-invariant), if d (π, σ) = d (π ◦ τ , σ ◦ τ ) for every π, σ, τ ∈ S N , and left-invariant (rankinvariant), if d (π, σ) = d (τ ◦ π, τ ◦ σ) for every π, σ, τ ∈ S N . If d(·, ·) is both right- and left-invariant, then d(·, ·) is a bi-invariant distance. As pointed out by Critchlow (1985), the right-invariance is a necessary requirement in data modeling since it ensures that the distance does not depend on the objects labeling. If d (·, ·) is left-invariant, then d(π, σ) does not use the numerical values (integers from 1 to N ) of π, σ ∈ S N , but only the way they are ordered. Examples of right-invariant distances are the commonly used Kendall’s tau, Spearman’s footrule, and Spearman’s rho, which are also applied in the data analysis in Sect. 4. In fact, all eight distances employed there and listed in Table 1 are right-invariant. Among them, only Cayley and Hamming distances are left-invariant as well, i.e., bi-invariant. As discussed in Sect. 3.2, the bi-invariant property leads to a special design of the Marginals matrix under a distance-based model. An important class of distances on S N is defined as follows. Definition 3 A distance d(·, ·) on S N is called Hoeffding distance, if d(π, σ) =
N a π(i), σ(i) , for every π, σ ∈ S N ,
(1)
i=1
where a(·, ·) is a real-valued function on {1, . . . , N } × {1, . . . , N } that satisfies a(i, i) = 0 and a(i, j) = a( j, i).
70
M. Kateri and N. I. Nikolov
Hoeffding distances form a rich family of distances on permutations that include Spearman’s footrule, Spearman’s rho, Hamming, and Lee distances. As it is given in Definition 3, they can be linearly decomposed over the rankings components. This feature is crucial in obtaining some asymptotic properties for large values of N , like the combinatorial central limit theorem (CCLT), formulated and proved by Hoeffding (1951). Furthermore, as we will show in Sect. 3, the linearity in Definition 3 affects the structure of the Marginals matrix induced by the distance models presented in the next subsection.
2.2 Generalized Mallows Model Based on the Power Divergence A common approach for analyzing rank data is to construct a probability distribution over all permutations in S N . In some situations, it is reasonable to assume that there is a fixed central (modal) ranking π 0 ∈ S N and the probability of observing a ranking π ∈ S N decreases exponentially in increasing the distance from π to π 0 . This corresponds to the classical distance-based model, widely known as Mallows model (MM), defined by P(π) = P(π|θ, π 0 ) = exp (θd(π, π 0 ) − ψ N (θ)) ,
(2)
where d(·, ·) is a fixed right-invariant distance on S N , θ ∈ R is a parameter, and ψ N (θ) is a normalizing constant, independent of π 0 . Regarding the modal permutation π 0 , it can be either fixed or unknown. The special cases of MM, with d(·, ·) being Kendall’s tau and Spearman’s rho, were first proposed by Mallows (1957) and later generalized by Diaconis (1988). Under certain conditions, (2) is the closest to the discrete uniform model when the models discrepancy is measured in terms of the Kullback-Leibler divergence; see Diaconis (1988, pp. 175–176). More recently, Kateri and Nikolov (2022) extended the optimal property of MM by making use of the φ-divergence measures and proposed the following generalized Mallows model (GMM), induced by the Cressie-Read power divergence (Cressie and Read, 1984), P(π) = P(π|β, λ, π 0 ) = (1 + βd(π, π 0 ))1/λ
1 , c(β, λ)
(3)
where β, λ ∈ R \ {0}, d(·, ·) is a right-invariant distance on S N , π 0 ∈ S N is a central permutation, and c(β, λ) = (1 + βd(σ, π 0 ))1/λ is a normalizing constant. The σ∈S N
parameter λ is associated with the Cressie-Read divergence and determines the shape properties of (3), whereas β can be viewed as a scale parameter. When β = 0, λ → ∞ or λ → −∞, GMM in (3) coincides with the uniform model P0 (π) = 1/N !, while for λ → 0 we obtain the exponential MM in (2), i.e., MM≡GMM for λ = 0. The special values λ = −2, λ = −1, and λ = 1 have good interpretation as well,
Marginals Matrix Under a Generalized Mallows …
71
since in these cases (3) corresponds to the optimal model under Neyman-modified X 2 , modified Kullback-Leibler, and Pearsonian X 2 divergences, respectively; see Cressie and Read (1988) and Kateri and Nikolov (2022) for more details. Similar to MM, the permutation π 0 is the location parameter for GMM and is associated with the consensus (modal) ranking in the population. More details on the estimating procedures and constraints for the unknown parameters (β, λ, π 0 ) can be found in Kateri and Nikolov (2022). In the next subsection, we consider an alternative generalization of MM based on Hoeffding distances.
2.3 Marginals Model The Marginals model is first proposed by Verducci (1982) under the name quasiindependence model. It is a member of the exponential family and is based on a more data-analytical approach. Its probability mass function is defined as ⎛ ⎞ N N ( j) P (π) = P (π| θ) = exp ⎝ θi 1 [π(i) = j] − ψ(θ)⎠ , for π ∈ S N , (4) i=1 j=1
N ( j) where θ = θi
i, j=1
are N 2 real parameters, 1 [·] is the indicator function and ψ(θ)
is a normalizing constant. The aim of formula (4) is to explain the quantities mi j = P (π|θ) , for i, j ∈ {1, . . . , N } , π(i)= j
where the summation is over all permutations π ∈ S N , such that π(i) = j. The matrix
N M = m i j i, j=1 is called Marginals matrix since its i-th row gives the theoretical marginal distribution of the ranks assigned to object i, and its j-th column gives the theoretical marginal distribution of the objects given rank j. From
P (π| θ) = 1,
π∈S N
it follows that M is a bistochastic matrix, i.e., N i=1
m i j = 1 and
N j=1
m i j = 1.
72
M. Kateri and N. I. Nikolov
N ( j) Thus, there are only (N − 1)2 free parameters θi
i, j=1
of the Marginals model.
An extension of model (4) with more free parameters is proposed by Diaconis (1989) as an application of spectral analysis to permutation data. Models of the MM class based on Hoeffding distances are specific cases of the Marginals model with additional constraints for the parameters θ. For example, model (2) induced by Spearman’s footrule d F (π, σ) =
N
|π(i) − σ(i)| , for π, σ ∈ S N ,
(5)
i=1
coincides with model (4) when ( j)
θi
= θ | j − π0 (i)| , for i, j ∈ {1, . . . , N } .
In order to study more precisely the fit of the proposed models, we present several goodness-of-fit measures in the next subsection.
2.4 Model Comparisons ∗ Let that the∗ data sample consists of n full rankings denoted by π = ∗ us assume ∗ π 1 , . . . , π n , where π i ∈ S N for i ∈ {1, . . . , n}. To test the fit of GMM with fixed parameter λ, we first consider the log-likelihood ratio statistic (LRS), defined as
⎞ Lλ π ∗ , βˆ ⎠, L RSλ = 2 ln ⎝ Lλ (π ∗ , 0) ⎛
where βˆ is the maximum likelihood estimator (MLE) of β, Lλ (π ∗ , β) is the likelihood under (3) with fixed λ, and Lλ (π ∗ , 0) is the likelihood under the uniform model (β = 0). Similarly, the LRS for the Marginals model is given by ⎞ Lm π ∗ , θˆ ⎠, L RSm = 2 ln ⎝ Lm (π ∗ , 0) ⎛
with Lm (π ∗ , θ) being the likelihood under (4) and Lm (π ∗ , θ) corresponding to N ( j) the likelihood under the uniform model (θ = 0). The MLEs θˆ = θˆ can be i
i, j=1
obtained by using the Newton-Raphson method or by the algorithm proposed in Verducci (1986) that is based on minimum majorization decomposition.
Marginals Matrix Under a Generalized Mallows …
73
Next, we quantify the total nonuniformity in the data sample π ∗ by introducing n the T N U coefficient, considered by Marden (1995). Let f (π) = 1 {π i = π} be i=1
the frequency of a given ranking π ∈ S N , i.e., f (π) is the number of observations in the sample that coincide with π. Then, the empirical probability for π is f (π)/n and 1 f (π) − ln , f (π) ln T NU = 2 n N ! π∈S N
as defined in Marden (1995, p. 145). Then, the goodness-of-fit of a model can be tested by combining the T N U and the associated L RS, i.e., the corresponding L RSλ or L RSm . The null hypothesis of perfect fit is rejected for large values of the statistic G O F = T N U − L RS. Marden (1995, p. 144) showed that under certain regularity conditions G O F has an asymptotic chi-square distribution (χ2 ) with N ! − p degrees of freedom as n → ∞, where p is the number of estimated unknown parameters ( p = 1 for L RSλ and p = (N − 1)2 for L RSm ). Although the asymptotic critical region for G O F is very easy to compute, G O F does not have a clear interpretation. Thus, Marden (1995, p. 144) considered the following coefficient: R2 = 1 −
GOF , T NU
(6)
which is similar to the multiple correlation coefficient in the linear regression and measures the percentage of nonuniformity in the data that is explained by the fitted model. Since R 2 is a simple transformation of G O F, for the example in Sect. 4 the significance of the fitted models is discussed only in terms of the R 2 values. Nevertheless, since the T N U value is also reported, the associated G O F values can easily be calculated. Notice that for the Marginals model (4) a sufficient statistic is the sample
ˆ = mˆ i j N defined by Marginals matrix M i, j=1 n 1 ∗ mˆ i j = 1 πk (i) = j , for i, j ∈ {1, . . . , N } . n k=1
ˆ In contrast, Thus, the fitted Marginals matrix coincides with the empirical one (M). the fitted Marginals matrices under MM and GMM depend on the estimated parameters and are further studied in the next section.
74
M. Kateri and N. I. Nikolov
3 Marginals Matrix Under GMM
N Let us denote by M (β, λ, N ) = m i j (β, λ, N ) i, j=1 the Marginals matrix under GMM in (3). Without loss of generality, it can be assumed that π 0 = e N , since d(·, ·) in (3) is right-invariant and varying the permutation π 0 is equivalent to reordering the rows of the matrix. Then, the elements of M (β, λ, N ) can be written as m i j (β, λ, N ) =
P(π|β, λ, e N ), for i, j ∈ {1, . . . , N } ,
(7)
π(i)= j
where P(π|β, λ, e N ) is defined in (3) for π 0 = e N . Furthermore, let us consider the random variable (8) D N = d (π, e N ) , where π is uniformly chosen from the set S N . Note that the distribution of D N depends only on the distance d (·, ·). Hence, we can express the normalizing constant in (3) as 1 E (1 + β D N )1/λ , c (β, λ) = N! with expectation E [·] taken in respect to D N . In a similar way, by considering the sets (i, j) (9) S N = {π ∈ S N : π(i) = j} , for i, j ∈ {1, . . . , N } , and the random variables (i, j)
DN
= d π (i, j) , e N , for i, j ∈ {1, . . . , N } ,
(10)
(i, j)
with π (i, j) being uniformly chosen from the set S N , we obtain the following proposition. Proposition 1 The elements of the Marginals matrix M (β, λ, N ) under the generalized Mallows model (3) are given by
m i j (β, λ, N ) = (i, j)
where D N and D N
1 N
E
(i, j) β DN
1/λ
1+ , E (1 + β D N )1/λ
(11)
are defined in (8) and (10), respectively.
It is worth to mention that the elements of the Marginals matrix under MM in (2) can be expressed in a similar form as in formula (11). However, by using the exponentiality of MM, the corresponding expectations in the numerator and the (i, j) denominator of (11) are simplified to the moment generating functions of D N and D N , respectively, at the parameter value θ in (2); see Fligner and Verducci (1986)
Marginals Matrix Under a Generalized Mallows …
75
for more details. Furthermore, for some special distances, like Hoeffding distances, the Marginals matrix under MM is symmetric and can be approximated for large values of N by applying the asymptotic normality results for the random variables (i, j) D N and D N ; see Marden (1995) and Nikolov and Stoimenova (2019). In the next subsection, we prove that in the class of Hoeffding distances the Marginals matrix preserves its symmetric property even under GMM.
3.1 Marginals Matrix Structure Under Hoeffding Distances Proposition 2 If the distance d(·, ·) used in the generalized Mallows model (3) is a Hoeffding distance, then the corresponding Marginals matrix is symmetric.
N Proof Let M (β, λ, N ) = m i j (β, λ, N ) i, j=1 be the Marginals matrix under model (3) based on a Hoeffding distance d(·, ·), defined by (1). Then, it is straightforward to check that d(·, ·) is right-invariant. Indeed, for any π, σ, τ ∈ S N , N N d (π ◦ τ , σ ◦ τ ) = a π (τ (i)) , σ (τ (i)) = a π(i), σ(i) = d(π, σ), i=1
i=1
where the middle equality is obtained by rearranging the summation terms. Therefore, from (7) we have m i j (β, λ, N ) =
(1 + βd(π, e N ))1/λ
(i, j)
π∈S N
=
1 c(β, λ)
1/λ 1 + βd π ◦ π −1 , e N ◦ π −1
(i, j)
π∈S N
=
1/λ 1 + βd e N , π −1
1 c(β, λ)
1/λ 1 + βd π −1 , e N
1 , c(β, λ)
(i, j)
π∈S N
=
(i, j) π∈S N
(i, j)
(i, j)
1 c(β, λ)
(12)
where S N is defined in (9). Since, for fixed π ∈ S N , the inverse permutation π −1 ( j,i) is an element of S N , there is one-to-one correspondence between the (N − 1)! (i, j) ( j,i) permutations in the sets S N and S N . Thus, from (12) it follows that
76
M. Kateri and N. I. Nikolov
m i j (β, λ, N ) =
(1 + βd (σ, e N ))1/λ
( j,i)
σ ∈ SN
1 = m ji (β, λ, N ) , c(β, λ)
which completes the proof.
Similar to MM, the Marginals matrix under GMM has a further structural symmetry for some specific distances on S N . In the next subsection, we outline these additional symmetric properties for a few particular distances.
3.2 Special Cases From the definition of m i j (β, λ, N ) in (7) and the linear decomposition of Spearman’s footrule in (5), we obtain the following relation between the elements of the Marginals matrix under GMM with d F , m i, j (β, λ, N ) = m N −i+1,N − j+1 (β, λ, N ).
(13)
Similarly, it is easy to check that under GMM with Spearman’s rho, defined by d R (π, σ) =
N
(π(i) − σ(i))2 , for π, σ ∈ S N ,
i=1
the elements of the Marginals matrix have symmetric property (13). Hence, the same N N +2 N− different elements of under GMM with d F or d R there are 2 2 the Marginals matrix, with [x] being the integer part of x. The structural freedom is even reduced if we use the Lee distance, given by d L (π, σ) =
N
min (|π(i) − σ(i)|, N − |π(i) − σ(i)|) , for π, σ ∈ S N .
i=1
Analogously to the MM case studied in Nikolov and Stoimenova (2019), the Marginals elements under GMM with d L satisfy m i, j (β, λ, N ) = m N ,k (β, λ, N ), N +2 different where k = N − min (|i − j|, N − |i − j|). Thus, there are only 2 elements of the matrix M (β, λ, N ) induced by d L . Moreover, Marden (1995) showed that under MM based on a bi-invariant distance the Marginals matrix has only two different elements: diagonal and off-diagonal. In the same fashion, under GMM with
Marginals Matrix Under a Generalized Mallows …
77
a bi-invariant distance, it can be proved that ⎧ ⎨ B, for i = j m i, j (β, λ, N ) = 1 − B ⎩ , for i = j, N −1 where B is a constant depending on β, λ, and N . In the special cases of Cayley and Hamming distances (bi-invariant), we have 1 E (1 + β D N −1 )1/λ B= , N E (1 + β D N )1/λ where D N −1 and D N are as in (8), cf. Nikolov and Stoimenova (2020). From the specific results derived in this subsection, it is clear that even for some of the most widely used distances on S N the Marginals matrix under both MM and GMM has a limited design freedom. The next section illustrates how these restrictions might be observed in the sample Marginals matrix as well and points out the associated effect on the fit of the suggested models.
4 Illustrative Example Mao et al. (2013) studied the human ability to make noisy comparisons of items in ranking tasks by performing tests based on counting pseudo-randomly distributed dots in images. In the conducted experiment, each ranking involved sorting of four pictures in an increasing number of plotted dots. Here, we are focusing on one experimental setting that corresponds to the easiest task for the participants. The collected data is available at http://preflib.org (Mattei and Walsh, 2013), an online library of data sets concerning preferences, and consists of n = 794 full rankings for N = 4 types of pictures with 200, 209, 218, and 227 dots. From the problem setup, it is natural to assume that the consensus ranking π 0 equals the identity permutation e N = 1, 2, 3, 4. The results of fitting MM, the Mallows model (2), and GMM, the generalized Mallows model (3), for eight of the most commonly used distances on S N are given in Table 1. Since the nonuniformity of the dots data is T N U = 650.68, the critical value of 2 = 0.9459 (based on χ2 with the R 2 coefficient in (6) for GMM with fixed λ is Rcrit. 23 = 4! − 1 degrees of freedom and nominal level 0.05). Hence, for λ = 0, there is no significant MM fit for all eight distances in Table 1. However, we can improve the fit by changing the distributional shape via the parameter λ. Thus, under d F , d R , and d K , the associated GMM is significant for values of λ close to −1. The best fit is obtained for GMM with d R , λ = −1.2699, β = −1.7696, and R 2 = 0.9708. Nevertheless, for better parameter interpretation we suggest GMM with d R and λ = −1, which has a
78
M. Kateri and N. I. Nikolov
Table 1 Results of fitting model (3) to the dots data with free parameter λ (GMM) and with λ = 0 (MM in (2)) GMM MM βˆ θˆ Distance Notation λˆ R2 R2 Spearman’s footrule Spearman’s rho Chebyshev metric Kendall’s tau Cayley distance Ulam distance Hamming distance Lee distance
−0.8567 −1.2699 −0.8171 −0.9637 −1.0271 −0.6101 −0.7781 −0.8831
dF dR dM dK dC dU dH dL
−1.4336 −1.7696 −3.0273 −3.0381 −4.5367 1.2544 −1.8300 −1.7228
λ=0
λˆ
0.9512 0.9708 0.9171 0.9620 0.6873 0.5978 0.6674 0.7545
−0.3931 −0.1630 −0.9012 −0.6245 −0.8621 −0.9625 −0.6032 −0.4535
0.9290 0.8642 0.9003 0.9127 0.6412 0.5699 0.6513 0.7195
similar fit (R 2 = 0.9597) and corresponds to the optimal model under the modified Kullback-Leibler divergence, as pointed out in Sect. 2.2. The Marginals model (4) has insignificant fit for the dots data with statistic Rm2 = 2 = 0.9616 (based on χ2 with 15 = 4! − 32 degrees of 0.9392 and critical value Rcrit. freedom and nominal level 0.05). Therefore, model (3) performs substantially better, although model (4) has 9 free parameters versus only 2 for GMM. The reason for ˆ The values of M, ˆ together this lies in the structure of the sample Marginal matrix M. with M R , M L , and MC , the Marginals matrices induced by the fitted GMM in Table 1 with d R , d L , and dC , respectively, are given below in percentages: ⎛
51.26 ⎜25.57 ˆ =⎜ M ⎜ ⎝14.61 8.56 ⎛
45.24 ⎜20.82 ⎜ ML = ⎜ ⎝13.13 20.82
25.57 39.42 21.79 13.22 20.82 45.24 20.82 13.13
12.97 21.79 38.79 26.45
⎞ ⎛ 10.20 51.78 ⎟ ⎜ 13.22⎟ ⎜25.05 ⎟ , MR = ⎜ 24.81⎠ ⎝13.48 51.76 9.69
13.13 20.82 45.24 20.82
25.05 39.31 22.16 13.48
⎞ ⎛ 20.82 45.48 ⎜18.17 13.13⎟ ⎟ ⎜ ⎟ , MC = ⎜ 20.82⎠ ⎝18.17 45.24 18.17
13.48 22.16 39.31 25.05
18.17 45.48 18.17 18.17
⎞ 9.69 13.48⎟ ⎟ ⎟, 25.05⎠ 51.78
18.17 18.17 45.48 18.17
⎞ 18.17 18.17⎟ ⎟ ⎟. 18.17⎠ 45.48
ˆ follow a similar pattern to the matrix structure under Clearly, the elements of M Spearman’s footrule and rho, described in Sect. 3.2, so it is natural that models (2) and (3) with d F and d R fit relatively well. In contrast, the restricted freedom of the ˆ by M L and Marginals matrices under d L and dC leads to poor approximation of M MC , which implies the lack of explanatory power of MM and GMM for these two ˆ itself does not require much computational distances. Since the evaluation of M
Marginals Matrix Under a Generalized Mallows …
79
effort than estimating the parameters in MM, GMM, and particularly in model (4), ˆ can be used as a guidance for appropriate distance and model choices. the form of M Furthermore, from the results in Table 1, we can conclude that model (2) is improved remarkably better by model (3) with only one extra parameter compared to model (4) with 8 additional parameters.
5 Concluding Remarks As a future work, it would be interesting to consider a statistic that measures the deviance of the sample Marginals matrix from the structural design under given probability model. Moreover, a simple statistical procedure could be developed to test the significance of this measurement that would reduce the computational time and resources for estimating the unknown model parameters when N , the number of ranked items, is large. Another possible direction for the continuation of the current research is to apply the presented features of the Marginals matrix under GMM in the framework of RSS or other scheme that involves comparing the ranking abilities of two judges or ranking methods.
References Chen, Z., Bai, Z., & Sinha, B. (2003). Ranked set sampling: Theory and applications. In Lecture Notes in Statistics (vol. 176). New York: Springer. Cressie, N., & Pardo, L. (2000). Minimum φ-divergence estimator and hierarchical testing in loglinear models. Statistica Sinica, 10(3), 867–884. Cressie, N., & Read, T. (1984). Multinomial goodness-of-fit tests. The Journal of the Royal Statistical Society, Series B Statistical Methodology, 46(3), 440–464. Cressie, N., & Read, T. (1988). Goodness of fit statistics for discrete multivariate data. New York: Springer. Critchlow, D. E. (1985). Metric methods for analyzing partially ranked data. New York: Springer. Diaconis, P. (1988). Group representations in probability and statistics. In: IMS Lecture Notes— Monograph Series (vol. 11), Hayward, CA. Diaconis, P. (1989). A generalization of spectral analysis with application to ranked data. The Annals of Statistics, 17(3), 949–979. Fligner, M. A., & Verducci, J. S. (1986). Distance based ranking models. The Journal of the Royal Statistical Society, Series B Statistical Methodology, 48(3), 359–369. Forcina, A., & Kateri, M. (2021). A new general class of RC association models: Estimation and main properties. The Journal of Multivariate Analysis, 184(3), 104741, 1–16. Hoeffding, W. (1951). A combinatorial central limit theorem. Annals of Mathematical Statistics, 22(4), 558–566. Kateri, M. (2018). φ-Divergence in contingency table analysis. Entropy, 20, 324. Kateri, M., & Agresti, A. (2007). A class of ordinal quasi symmetry models for square contingency tables. Statistics and Probability Letters, 77, 598–603. Kateri, M., & Agresti, A. (2010). A generalized regression model for a binary response. Statistics and Probability Letters, 80, 89–95.
80
M. Kateri and N. I. Nikolov
Kateri, M., & Nikolov, N. I. (2022). A generalized Mallows model based on φ-divergence measures. The Journal of Multivariate Analysis, 190(104958), 1–14. Kateri, M., & Papaioannou, T. (1997). Asymmetry models for contingency tables. Journal of the American Statistical Association, 92, 1124–1131. Mallows, C. (1957). Non-null ranking models. I. Biometrika, 44(1), 114–130. Mao, A., Procaccia, A. D., & Chen, Y. (2013). Better human computation through principled voting. In Twenty-Seventh AAAI Conference on Artificial Intelligence (pp. 1142–1148). Marden, J. I. (1995). Analyzing and modeling rank data. In Monographs on statistics and applied probability (Vol. 64). London: Chapman & Hall. Mattei, N., & Walsh, T. (2013). Preflib: A library for preferences. In: International conference on algorithmic decision theory (pp. 259–270). Springer. http://preflib.org. Nikolov, N. I., & Stoimenova, E. (2019). Asymptotic properties of Lee distance. Metrika, 82(3), 385–408. Nikolov, N. I., & Stoimenova, E. (2020). Mallows’ models for imperfect ranking in ranked set sampling AStA–Adv. Statistical Analysis, 104(3), 459–484. Pardo, L. (2006). Statistical inference based on divergence measures. New York: Chapman & Hall. Verducci, J. S. (1982). Discriminating Between Two Probabilities on the Basis of Ranked Preferences. Technical report, Stanford University Press, Redwood City. Verducci, J. S. (1989). Minimum majorization decomposition. Contributions to probability and statistics (pp. 160–173). New York: Springer.
Time Series Clustering Based on Forecast Distributions: An Empirical Analysis on Production Indices for Construction Michele La Rocca, Francesco Giordano, and Cira Perna
Abstract This paper presents and discusses a recent proposal for clustering autoregressive nonlinear time series data in which dissimilarities are computed according to their forecast distributions. The procedure uses feedforward neural networks to approximate the original nonlinear process combined with the pair bootstrap as a resampling technique. An empirical analysis of the construction sector for 21 European countries is performed. Since the COVID-19 pandemic has hit all the countries in the last part of the observational period, the analysis also evaluates the possible different group structures due to the pandemic. The results are almost identical for different forecast horizons in the pre-COVID-19 period. On the contrary, where all countries experienced severe contractions in their economic activities, the dataset shows a diverse group structure, indicating different routes and timelines for economic recovery. Keywords Nonlinear time series · Clustering · Pair bootstrap · Forecast distribution
1 Introduction There has been a growing interest in time series clustering in recent decades. In this context, different approaches have been proposed in the statistical literature (see Aghabozorgi et al. 2015) for review. They rely on working directly with raw data, indirectly with features extracted from the raw data, or with models built from it. M. La Rocca (B) · F. Giordano · C. Perna Department of Economics and Statistics, University of Salerno, Salerno, Italy e-mail: [email protected] F. Giordano e-mail: [email protected] C. Perna e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_7
81
82
M. La Rocca et al.
In the first case, the so-called raw-data-based approach, the distance/similarity measure for static data is modified with an appropriate one to account for time series structure (see D’Urso et al. 2018 for examples of this approach). Since it is not easy to work directly with raw data that are highly noisy, the featurebased approach can be implemented. It is based on a preliminary transformation of the raw time series in a feature vector of lower dimension and then on applying a clustering algorithm. In this context, attracting proposals are based on the use of autocovariances and autocorrelations (Lafuente-Rego & Vilar 2016; D’Urso & Maharaj D’Urso and Maharaj 2009), and spectral density function (D’Urso et al. 2020). Moreover, the model-based clustering approach considers that some model or probability distribution generates each time series. Time series are considered similar when the models identified on the individual series or the residuals of the fitted model are similar. As data generating processes, ARIMA models (Piccolo 1990), TAR models (Aslan et al. 2018), and GARCH models (D’Urso et al. 2016) have been considered, inter alia, in the literature. Some recent approaches rely on the use of distance criteria which compare the forecast densities estimated by using a resampling method combined with a nonparametric kernel estimator (see Alonso et al. 2006; Vilar et al. 2010). More recently, La Rocca et al. (2021) have proposed a novel approach for clustering nonlinear autoregressive time series based on comparing their full forecast distribution at a given forecast horizon. Rather than considering patterns throughout observations, this approach introduces information on the forecast behaviour at a specific time horizon in the clustering procedure. The procedure combines a class of neural network models to approximate the original nonlinear process with the pair bootstrap as a resampling device. Under general conditions of the data generating process, the overall clustering procedure, which depends on the implemented bootstrap scheme, can deliver consistent results. Moreover, it performs well in finite small samples for a broad class of nonlinear autoregressive processes and different types of innovations. This paper aims to present and discuss the novel clustering approach and to address its employment in a real dataset on the production index for construction, a vital business cycle indicator. In the analysis, a set of 21 European countries has been considered. The observations span the period from January 2000 to December 2020 (the base year 2015), including last year, when the COVID-19 pandemic hit Europe. In this empirical analysis, the aim is to identify the different group structures induced by the COVID-19 pandemic by using the forecast one-step ahead distribution for January 2020 (so excluding any observations from the COVID-19 pandemic), the forecast multi-step ahead distribution for January 2021, and the forecast one-step ahead distribution for January 2021 (where all models have been trained up to December 2020). The paper is organized as follows. Section 2 introduces and reviews the time series clustering procedure based on neural network forecast distributions. Section 3 discusses the case study on clustering monthly production indices for construction. Some remarks in Sect. 4 close the paper.
Time Series Clustering Based on Forecast Distributions: An Empirical Analysis …
83
2 The Clustering Procedure Let {Yt , t ∈ Z} be a real-valued stationary stochastic process modelled as a nonlinear autoregressive (NAR) model of the form Yt = g Yt−1 , . . . , Yt− p + εt , where g(·) is an unknown (possibly) nonlinear function, and {εt } are iid error terms, with E[εt ] = 0 and E[εt2 ] > 0. In the following, for the sake of simplicity, we put =(Yt−1 , . . . , Yt− p ). xt−1 Let y(1) , . . . , y(S) be S observed length T generated from a DGP time series of
of the previous class, where y(i) = Y1(i) , . . . , YT(i) . The aim is to cluster time series based on their full forecast distribution at a specific future time T + h, with h ≥ 1. The proposed approach accounts for the future dynamic behaviour of the time series in the clustering procedure by using the L r -norm distance r j Dr,i j = FTi +h|T (y) − FT +h|T (y) dy r = 1, 2 (1) where FTi +h|T (·), i = 1, . . . , S is the forecast distribution function at a given future point T + h, of the series y(i) , conditioned on the information set available up to time T . Since the L r -norm distance previously defined cannot be computed directly, La Rocca et al. (2021) have proposed a strategy in which the unknown distributions are consistently estimated by using a feedforward neural network estimator and the pair bootstrap approach. In particular, given the forecast horizon h, the unknown function g(·) can be approximated by using the network f mh (xt−h ; θ ) =
m
ck ψ wk xt−h + wk0 + c0
(2)
k=1
with θ = (c0 , c1 , . . . , cm , w1 , . . . , wm , w10 , . . . , wm0 ), where m is the hidden layer size; wk are the vectors of weights for the connections between input layer and hidden layer, ck , k = 1, . . . ; m are the weights of the link between the hidden layer and the output; wk0 and c0 are the bias terms; ψ(·) is a proper chosen activation function and = (Yt−h , . . . , Yt−h− p+1 ). xt−h As usual in neural network applications, we assume a sigmoidal activation function such as the logistic or the hyperbolic tangent function. In this case, single hidden layer neural networks are universal approximators in that they can arbitrarily closely approximate, in an appropriate corresponding metric, to L 1 -integrable functions (Hornik 1991). Moreover, they have a good performance in forecasting and are uninfluenced by the eventual increasing dimension of the time series. The general proposed procedure is reported in the following algorithm.
84
M. La Rocca et al.
Algorithm ), t = p + h, . . . , T }. 1: Fix the forecast horizon h ≥ 1. Let X = {(Yt , xt−h 2: Fix the hidden layer size m and the lag structure p and estimate the weights of the network as T 1 2 θˆh = arg minθ T − p−h+1 t= p+h (Yt − f mh (xt−h ; θ)) . 3: Compute the residuals from the estimated network defined as εˆ t = Yt − f mh xt−h ; θˆh . T 1 4: Compute the centred residuals: ε˜ t = εˆ t − T − p−h+1 t= p+h εˆ t . ∗ ) = (Y ∗ , Y ∗ , . . . , Y ∗ ), t = p + h, . . . , T }, as an iid sample from 5: Resample {(Yt∗ , xt−h t t−h t−h− p+1 the set of tuples X . 6: Get the bootstrap estimate of the neural network weights: ∗ ∗ 2 T 1 θˆh∗ = arg minθ T − p−h+1 . t= p+h Yt − f mh xt−h ; θ ∗ ∗ ∗ ˆ ˆ 7: Compute YT +h = f mh YT , YT −1 , . . . , YT − p+1 ; θh + εT +h where εT∗ +h is a random sample from the centred residuals {˜εt }. 8: The bootstrap forecast distribution FT∗+h|T is given by the law of YˆT∗+h conditioned on X .
As usual, the bootstrap distribution can be approximated by Monte Carlo simulations repeating B times Steps 5–7, and then computing the empirical cumulative distribution function (ECDF) of YˆTb+h , b = 1, 2, . . . , B: B 1 ˆb FˆT∗+h|T (y) = I YT +h ≤ y B b=1
(3)
with I(·) denoting the indicator function. Some comments and remarks on some of the steps of the algorithm are in order. In Step 1, the forecast horizon is fixed and depends on the problem at hand. It is left to the researcher. In Step 2, selecting a suitable neural network topology is a critical issue, but it has been deeply studied from a statistical perspective. Several solutions have been proposed involving information criteria (Kuan and White 1994), pruning, stopped training and regularization (Reed 1993), and inferential techniques (Anders & Korn 1999; La Rocca & Perna 2005). Finally, it is worthwhile to stress that, although the parameters of a neural network model are unidentifiable, when focusing on prediction, the problem disappears (Hwang and Ding 1997). This latter property justifies the use of a feedforward neural network in a procedure of clustering nonlinear processes based on forecast distributions. Here, we have used an almost automatic approach based on time series cross-validation as described in Bergmeir et al. (2018). In Step 5, the pair bootstrap has been implemented as a resampling scheme. This choice appears appropriate in neural networks since it is robust for misspecified models, as the neural networks intrinsically are. In Step 6, the optimization problem has been treated as an NLS problem and solved with a BFGS algorithm. In Step 7, a direct multi-step forecasting approach is considered. A separate neural network model is estimated for each forecasting horizon, and forecasts are computed only conditioning on the observed data. This choice is justified by the nonlinearity of
Time Series Clustering Based on Forecast Distributions: An Empirical Analysis …
85
the data generating process. Moreover, although the direct strategy may have greater variability concerning the recursive method, it is most beneficial when the model is misspecified (Chevillon 2007), a peculiar aspect of neural networks which are inherently misspecified, as we have previously pointed out. Under very general assumptions, concerning essentially the stationarity and the α-mixing condition of the data generating process (see La Rocca et al. 2021 for details), the proposed clustering procedure can deliver consistent results in nonlinear autoregressive models. Moreover, it gives good results also in finite sample sizes for a broad class of nonlinear autoregressive processes and different types of innovations.
3 An Application to the European Construction Sector The proposed procedure has been used to cluster the production index for construction (seasonally and calendar adjusted) for 21 European countries. The production index measures the building and construction industry activity. It is of great interest for evaluating the sector’s contribution to the economy and is considered a critical business cycle indicator. The time series are observed from January 2000 to December 2020 (the base year is 2015). In the last year, the dataset includes the period in which the European countries experienced the COVID-19 pandemic, which has hit our most consolidated habits with severe challenges for social and economic systems. The dataset has been downloaded from Eurostat (https://appsso.eurostat.ec.europa. eu/nui/show.do?dataset=sts_copr_m&lang=en). The time series plots reported in Fig. 1 show, for all the European countries, a structural break at the beginning of 2020, due to the well-known COVID-19 pandemic. This behaviour is more evident in the countries that have already experienced a nationwide lockdown during the pandemic beginning with more severe provisions and personal restrictions. Moreover, the time plots highlight a clear non-stationarity in the mean of all the considered time series. To achieve stationarity, a necessary condition to implement the proposed clustering strategy, all the series have been pre-processed by applying the first-order differences. The plots of the differenced series for all the European countries are reported in Fig. 2. Since all the series are seasonally and calendar adjusted, the first difference is the only needed transformation to get stationarity. The difference plots highlight even better the severe effect of the pandemic on the construction sector, except for a few countries (Czechia, Finland, Romania, and Slovenia) that have experienced lighter restrictions on economic activity to limit the virus spread. The results of the Teraesvirta test for neglected nonlinearity are reported in Table 1. For 11 out of 21 series, the test rejects the null linearity hypothesis at level 5%. However, note that, although artificial neural networks show clear advantages in the case of nonlinearity, they continue to be effective tools for linear relationships. That
86
M. La Rocca et al. Austria
Belgium
130 120 110 100 90 80
Bulgaria
110 120 100
90 60
90
30 Croatia
Czechia
175
120
150
100
Denmark 120 110 100
125
80
90
100 Finland
France
120
125
100
100
Germany 120 110 100 90 80
75
80
50 Hungary
Italy
Luxembourg 120
150
150
125
100
100
80
100
60
50
75 Netherlands
Poland
130 120 110 100 90
Portugal
120
300 250 200 150 100
100 80 60 Romania
Slovakia
Slovenia
140 120
120
90
100
300 250 200 150 100
80
60 Spain
Sweden
200 175 150 125 100 75
UK
120
100
100 80
80 60 2000
2005
2010
2015
2020
60 2000
2005
2010
2015
2020
2000
2005
2010
2015
2020
Fig. 1 Production indices for construction (base = 2015). Seasonally and calendar adjusted monthly time series from January 2000 to December 2020
allows straightforward and homogeneous modelling of both linear and nonlinear dynamics. Neural network models have been trained by using a quadratic loss function with weight decay and a Broyden-Fletcher-Goldfarb-Shanno optimization algorithm. The authors have implemented all procedures using the R language (version 4.1.2). The bootstrap resampling scheme has been efficiently coded by exploiting the equivalence between the pair bootstrap and the multinomial weighted bootstrap and its embarrassingly parallel nature. As a result, a single bootstrap prediction distribution can be obtained with a computational burden that, on average, is in the range of tens of seconds on an Intel I7 multicore processor. The input lag values p and the network topology’s hidden layer size m to be fitted to each time series have been determined by time-series cross-validation. The results are reported in the last two columns of Table 1.
Time Series Clustering Based on Forecast Distributions: An Empirical Analysis … Austria
Belgium
Bulgaria
20
10 5 0 −5 −10 −15
10
10
0
0
−10
−10
−20
−20 Croatia
Czechia
20
Denmark 10
10
10
0
0
−10
−10
0 −10
87
−20 Finland
France
Germany
50
10 5 0 −5 −10
15 10 5 0 −5 −10
25 0 −25 Hungary
Italy
Luxembourg
60 40 20 0 −20 −40
30 20 10 0 −10 −20 −30
50 25 0 −25
Netherlands
Poland
10
Portugal 20 10 0 −10 −20
10
5 0
0
−5
−10
−10
−20 Romania
Slovakia
Slovenia 100
20 10 0 −10 −20
10 50
0 −10
0
−20
−50
Spain
Sweden
UK
10 5 0 −5 −10
30 0 −30 −60 2000
2005
2010
2015
2020
0 −20 −40 2000
2005
2010
2015
2020
2000
2005
2010
2015
2020
Fig. 2 First-order differences of production indices for construction (base = 2015)
To identify if the COVID-19 pandemic has induced different group structures, we have used the forecast one-step-ahead distribution for January 2020 and the forecast thirteen-step-ahead distribution for January 2021 (excluding any observations from the COVID-19 pandemic), using as training period January 2000–December 2019. Then, we added January 2019–December 2020 to the previous set (including observations from the COVID-19 pandemic) to obtain the forecast models to derive the one-step-ahead distribution for January 2021. Possible converging or diverging behaviour among the countries, as shown by the two group structures, might be due to the effect of the pandemic on the construction sector. We have considered a hierarchical clustering technique with the average linking method. The metric is given by the bootstrap estimated counterpart of expressions (1), namely Dˆ r , with r = 1 and r = 2 estimated by the data. All bootstrap distributions have been derived using 5,000 Monte Carlo runs.
88
M. La Rocca et al.
Table 1 Teraesvirta’s neural network test for neglected nonlinearity (p-values, lag = 1). The values of p and m denote, respectively, the maximum lag and the hidden layer size of neural network models selected via time series cross-validation. In bold, those p-values that are lower than the threshold 0.05 (11 out of 21 time series) Country p-value p m Country p-value p m 1 2 3 4 5 6 7 8 9 10 11
Austria Belgium Bulgaria Croatia Czechia Denmark Finland France Germany Hungary Italy
0.4865 0.0000 0.3548 0.0560 0.0000 0.1071 0.0000 0.0121 0.0019 0.3321 0.0034
6 3 2 2 2 4 6 3 1 5 2
4 3 2 6 5 3 2 4 6 3 8
12 13 14 15 16 17 18 19 20 21
Luxembourg Netherlands Poland Portugal Romania Slovakia Slovenia Spain Sweden UK
0.3123 0.3601 0.0009 0.0011 0.3935 0.0644 0.0275 0.0000 0.8719 0.0005
5 6 1 6 6 3 3 6 5 1
3 5 8 2 2 2 3 10 3 9
The dendrograms showing the results of the proposed clustering procedure using the L 1 -norm are reported in Fig. 3a, b and c. Apparently, the group structure would have been almost identical without the impact due to the COVID-19 pandemic (h = 1 and h = 13), showing a somewhat stable economic evolution of all the countries considered in the application (see panels a and b). All groups have been identified with the average silhouette. In particular, with h = 1 five groups have been identified: the first with Bulgaria and Slovakia (group G1 ); the second with France, Italy, Portugal, Belgium, and Spain (group G2 ); the third with the UK, Czechia, Luxemburg, Germany, Finland, Denmark, and Poland (group G3 ); the fourth with Slovenia, Sweden, the Netherlands, Croatia, and Austria (group G4 ); Hungary constitutes a group of its own. Without the impact of the COVID-19 pandemic, the clustering would have remained the same. Of course, the countries within each group join at different distances, showing further evolution for the construction sector while preserving the group structure. On the contrary, when the clustering is based on models that include the year 2020 in the training period, where all countries experienced severe contractions in their economic activities, the dataset shows a pretty different group structure, indicating different routes and timelines for economic recovery (see panel c). For example, Italy moved from the group G2 to a group that now includes Poland, Austria, Sweden, Luxembourg, and Finland. The UK moves from the group G3 to a group that now includes Czechia and Portugal. In Fig. 4a, b, and c, the dendrograms with the L 2 -norm are reported. Again, we have considered a hierarchical clustering technique with the average linking method and the forecast horizons h = 1 and h = 13 (with neural networks trained with observations up to December 2019) and h = 1 (with neural networks trained with
Time Series Clustering Based on Forecast Distributions: An Empirical Analysis …
89
Cluster Dendrogram 40
Height
30
20
Austria
Croatia
Sweden
etherlands
Poland
Slovenia
Finland
Denmark
Germany
uxembourg
UK
Czechia
France
Romania
Belgium
Italy
Portugal
Spain
Bulgaria
Slovakia
0
Hungary
10
(a) Training period January 2000 - December 2019, prediction h 1 Cluster Dendrogram 40
Height
30
20
Belgium
Portugal
Italy
France
UK
Spain
Czechia
Denmark
Germany
uxembourg
Finland
Romania
Poland
Austria
Sweden
etherlands
Croatia
Slovenia
Slovakia
Bulgaria
0
Hungary
10
(b) Training period January 2000 - December 2019, prediction h 13 Cluster Dendrogram
Finland
uxembourg
Italy
Sweden
Poland
Austria
Romania
etherlands
Croatia
Slovenia
Denmark
Germany
Hungary
France
Belgium
Spain
Bulgaria
Slovakia
Portugal
0
UK
10
Czechia
Height
20
(c) Training period January 2000 - December 2020, prediction h 1
Fig. 3 Construction index clustering based on h-step-ahead forecast distributions and L 1 -norm distance
0
Finland
Luxembourg
Sweden
Austria
Poland
Romania
Netherlands
Slovenia
Croatia
Germany
Denmark
Hungary
France
Belgium
Spain
Slovakia
Bulgaria
UK
Italy
Portugal
Belgium
France
Spain
Italy
Denmark
Czechia
UK
Germany
Luxembourg
Finland
Romania
Poland
Sweden
Austria
Netherlands
Croatia
Slovenia
Slovakia
Bulgaria
Hungary
0
Czechia
Height
Austria
Croatia
Netherlands
Sweden
Slovenia
Poland
Denmark
Finland
Germany
Luxembourg
Czechia
Romania
UK
Spain
Belgium
Portugal
Italy
France
Slovakia
Bulgaria
Hungary
0
Portugal
Height Height
90 M. La Rocca et al.
6
Cluster Dendrogram
4
2
Cluster Dendrogram
(a) Training period January 2000 - December 2019, prediction h 1
6
4
2
Cluster Dendrogram
(b) Training period January 2000 - December 2019, prediction h 13
4
3
2
1
(c) Training period January 2000 - December 2020, prediction h 1
Fig. 4 Construction index clustering based on h-step-ahead forecast distributions and L 2 -norm distance
Time Series Clustering Based on Forecast Distributions: An Empirical Analysis …
91
observations up to December 2020). As in the previous analysis with L 1 −norm, the group structure remains almost stable with h = 1 and h = 13 (see panel Fig. 4a and b), while, once again, substantial differences are highlighted when the models include the year 2020 (see panel Fig. 4c). Some interesting remarks are in order by comparing the results obtained with the two norms. The clustering structure seems to be not sensitive to the choice of the norm used to compute countries’ distances: the groups remain almost identical in the pre-COVID-19 period. That once again shows relative stability in the dynamic evolution of the European construction sector without the COVID-19 shock. On the contrary, based on models that include the year 2020 in the training period, results obtained with the L 1 -norm differ from those obtained with the L 2 -norm, and different clusters of countries arise. A possible explanation might be found in the different behaviour of the two norms when structural breaks or outlying observations (both additive and innovative) are included in the training period.
4 Concluding Remarks In this paper, we have presented and discussed an approach for clustering autoregressive nonlinear time series. It is based on dissimilarities computed according to time series forecast distributions. It uses feedforward neural networks to approximate the original process and pair bootstrap as a resampling device. An empirical analysis of the construction sector has shown the good performance of the proposed approach. Moreover, we have also empirically evaluated if the COVID-19 pandemic has affected production indices for construction by comparing the group structures pre- and post-COVID-19. In the pre-COVID period, the structure of the groups seems to be the same as the time horizon changes. In the post-COVID period, a substantial change seems to have occurred, confirming, once again, the effect that this pandemic has had on the European economy.
References Aghabozorgi, S., Shirkhorshidi, A. S., & Wah, T. Y. (2015). Time-series clustering-a decade review. Information Systems, 53, 16–38. Alonso, A. M., Berrendero, J. R., Hernández, A., & Justel, A. (2006). Time series clustering based on forecast densities. Computational Statistics & Data Analysis, 51(2), 762–776. Anders, U., & Korn, O. (1999). Model selection in neural networks. Neural Networks, 12, 309–323. Aslan, S., Yozgatligil, C., & Iyigun, C. (2018). Temporal clustering of time series via threshold autoregressive models: application to commodity prices. Annals of Operations Research, 260, 51–77. Bergmeir, C., Hyndman, R. J., & Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70–83.
92
M. La Rocca et al.
Chevillon, G. (2007). Direct multi-step estimation and forecasting. Journal of Economic Survey, 21, 746–785. D’Urso, P., & Maharaj, E. A. (2009). Autocorrelation-based fuzzy clustering of time series. Fuzzy Sets and Systems, 160, 3565–3589. D’Urso, P., De Giovanni, L., & Massari, R. (2016). GARCH-based robust clustering of time series. Fuzzy Sets and Systems, 305, 1–28. D’Urso, P., De Giovanni, L., & Massari, R. (2018). Robust fuzzy clustering of multivariate time trajectories. International Journal of Approximate Reasoning, 99, 12–38. D’Urso, P., De Giovanni, L., Massari, R., D’Ecclesia, R. L., & Maharaj, E. A. (2020). Cepstral-based clustering of financial time series. Expert Systems with Applications, 161, 113705. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4, 251–257. Hwang, J. T. G., & Ding, A. A. (1997). Prediction intervals for artificial neural networks. Journal of the American Statistical Association, 92, 748–757. Kuan, C., & White, H. (1994). Artificial neural networks: an econometric perspective. Econometric Reviews, 13, 1–91. La Rocca, M., & Perna, C. (2005). Variable selection in neural network regression models with dependent data: a subsampling approach. Computational Statistics & Data Analysis, 48, 415– 429. La Rocca, M., Giordano, F., & Perna, C. (2021). Clustering nonlinear time series with neural network bootstrap forecast distributions. International Journal of Approximate Reasoning, 137, 1–15. Lafuente-Rego, B., & Vilar, J. A. (2016). Clustering of time series using quantile autocovariances. Advances in Data Analysis and Classification, 10, 391–415. Piccolo, D. (1990). A distance measure for classifying ARIMA models. Journal of Time Series Analysis, 11, 153–164. Reed, R. (1993). Pruning algorithms - a survey. IEEE Transactions on Neural Networks, 4, 740–747. Vilar, J. A., Alonso, A. M., & Vilar, J. M. (2010). Non-linear time series clustering based on nonparametric forecast densities. Computational Statistics & Data Analysis, 54(11), 2850–2865.
Partial Reconstruction of Measures from Halfspace Depth Petra Laketa and Stanislav Nagy
Abstract The halfspace depth of a d-dimensional point x with respect to a finite (or probability) Borel measure μ in Rd is defined as the infimum of the μ-masses of all closed halfspaces containing x. A natural question is whether the halfspace depth, as a function of x ∈ Rd , determines the measure μ completely. In general, it turns out that this is not the case, and it is possible for two different measures to have the same halfspace depth function everywhere in Rd . In this paper, we show that despite this negative result, one can still obtain a substantial amount of information on the support and the location of the mass of μ from its halfspace depth. We illustrate our partial reconstruction procedure in an example of a non-trivial bivariate probability distribution whose atomic part is determined successfully from its halfspace depth. Keywords Halfspace depth · Reconstruction · Characterization problem
1 The Depth Characterization/Reconstruction Problem Let x be a point in the d-dimensional Euclidean space Rd and let μ be a finite Borel measure in Rd . We write H for the collection of all closed halfspaces1 in Rd and H(x) for the subset of those halfspaces from H that contain x in their boundary hyperplane. The halfspace depth (or Tukey depth) of the point x with respect to μ is defined as (1) D (x; μ) = inf μ(H ). H ∈H(x)
d halfspace isone of the two regions determined by a hyperplane in R ; any halfspace can be written as a set y ∈ Rd : y, u ≤ c for some c ∈ R and u ∈ Rd \ {0}.
1A
P. Laketa · S. Nagy (B) Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic e-mail: [email protected] P. Laketa e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_8
93
94
P. Laketa and S. Nagy
The history of the halfspace depth in statistics goes back to the 1970s (Tukey 1975). The halfspace depth plays an important role in the theory and practice of nonparametric inference of multivariate data; for many references, see Liu et al. (1999); Nagy et al. (2019); Zuo and Serfling (2000). The depth (1) was originally designed to serve as a multivariate generalization of the quantile function. As such, it is desirable that just as the quantile function in R, the depth function x → D(x; μ) in Rd characterizes the underlying measure μ uniquely, and μ is straightforward to be retrieved from its depth. The question of whether the last two properties are valid for D is known as the halfspace depth characterization and reconstruction problems. They both turned out not to have an easy answer. In fact, only the recent progress in the theory of the halfspace depth gave the first definite solutions to some of these problems. In Nagy (2021), the general characterization question for the halfspace depth was answered in the negative, by giving examples of different probability distributions in Rd with d ≥ 2 with identical halfspace depth functions. On the other hand, several authors have obtained also partial positive answers to the characterization problem; for a recent overview of that work, see Nagy (2020). Only three types of distributions are known to be completely characterized by their halfspace depth functions: (i) univariate measures, in which case the depth (1) is just a simple transform of the distribution function of μ; (ii) atomic measures with finitely many atoms (which we subsequently call finitely atomic measures for brevity) in Rd (Struyf & Rousseeuw, 1999; Laketa & Nagy, 2021); and (iii) measures that possess all Dupin floating bodies2 (Nagy et al. 2019). In this contribution, we revisit the halfspace depth reconstruction problem. We pursue a general approach and do not restrict only to atomic measures or to measures with densities. Our results are valid for any finite (or probability) Borel measure μ in Rd . As the first step in addressing the reconstruction problem, our intention is to identify the support and the location of the atoms of μ, based on their depth. We will see at the end of this note that without additional assumptions, neither of these problems is possible to be resolved. We, however, prove several positive results. We begin by introducing the necessary mathematical background in Sect. 2. In Sect. 3, we state our main theorem; a detailed proof of that theorem is given in the Appendix. We show that (i) the support of the measure μ must be concentrated only in the boundaries of the level sets of its halfspace depth; (ii) each atom of μ is an extreme point of the corresponding (closed and convex) upper level sets of the halfspace depth; and (iii) each atom of μ induces a jump in the halfspace depth function on the line passing through that atom. A Borel measure μ on Rd is said to bodies if each tangent halfspace to possess all Dupin floating the halfspace depth upper level set x ∈ Rd : D(x; μ) ≥ α is of μ-mass exactly α, for all α ≥ 0.
2
Partial Reconstruction of Measures from Halfspace Depth
95
These advances enable us to identify the location of the atoms of μ, at least in simpler scenarios. We illustrate this in Sect. 4, where we give an example of a non-trivial bivariate probability measure μ whose atomic part we are able to determine from its depth. We conclude by giving an example of two measures whose depth functions are the same, yet both their supports and the location of their atoms differ.
2 Preliminaries: Flag Halfspaces and Central Regions Notations. When writing simply of Rdwe always mean an affine sub a subspace d space, that is the set a + L = a + x ∈ R : x ∈ L for a ∈ Rd and L a linear subspace of Rd . The intersection of all subspaces in Rd that contain a set A ⊆ Rd is called the affine hull of A, and denoted by aff (A). It is the smallest subspace that contains A. The affine hull aff ({x, y}) of two different points x, y ∈ Rd is the infinite line passing through both x and y; another example of a subspace is any hyperplane in Rd . For a set A ⊆ Rd we write int(A), cl(A), and bd(A) to denote the interior, closure, and boundary of A, respectively. The interior, closure, and boundary of a set B ⊆ A when considered only as a subset of a subspace A ⊆ Rd are denoted by int A (B), cl A (B), and bd A (B), respectively. For two different points x, y ∈ Rd , x = y, we denote by L(x, y) the interior of the line segment between x and y when considered inside the infinite line aff ({x, y}). In other words, L(x, y) is the open line segment between x and y. In the special case of A = aff (B) we write relint(B) = int A (B), relbd(B) = bd A (B), and relcl(B) = cl A (B) to denote the relative interior, relative boundary, and relative closure of B, respectively. For instance, relbd(L(x, y)) = {x, y} and L(x, y) = relint(L(x, y)), but int(L(x, y)) = ∅ if d > 1. of all finite Borel measures in Rd . For a We write M Rd for the collection d d subspace A ⊆ R and μ ∈ M R we write μ| A to denote the measure obtained by restricting μ to the subspace A, that is the finite Borel measure given by μ| A (B) = μ(B ∩ A) for any Borel set B ⊆ Rd . By supp (μ) we mean the support of μ ∈ M Rd , which is the smallest closed subset of Rd of full μ-mass.
2.1 Minimizing Halfspaces and Flag Halfspaces For μ ∈ M Rd and x ∈ Rd , we call H ∈ H(x) a minimizing halfspace of μ at x if μ(H ) = D (x; μ). For d = 1 a minimizing halfspace always trivially exists. It also exists if μ is smooth in the sense that μ(bd(H )) = 0 for all H ∈ H(x), or if μ ∈ M Rd is finitely atomic. In general, however, the infimum in (1) does not have to be attained. We give a simple example. 2 Example on the disk R the sum of a uniform distribution 1 2Take μ ∈ M B = x ∈ R : x ≤ 2 and a Dirac measure at a = (1, 1) ∈ R2 . For x = (1, 0) ∈
P. Laketa and S. Nagy
x
F
−2
−2
−1
1
Hn
a
0
1
x
−1
a
0
2
2
96
−2
−1
0
1
2
−2
−1
0
1
2
2
Fig. 1 The support of μ ∈ M R from Example 1 (colored disk) and its unique atom a (diamond). No minimizing halfspace of μ at x = (1, 0) ∈ R2 exists. On the left-hand panel, we see a halfspace Hn ∈ H(x) whose μ-mass is almost D(x; μ). The halfspace Hn does not contain a. On the righthand panel, the unique minimizing flag halfspace F ∈ F (x) of μ at x is displayed
R2 no minimizing halfspace of μ at x exists. As can be seen in Fig. 1, the depth D(x; μ) is approached by μ(Hn ) for a sequence of halfspaces Hn ∈ H(x) with inner normals vn = (cos(−1/n), sin(−1/n)) that converge Hv ∈ H(x) with inner normal v = (1, 0), yet D(x; μ) = limn→∞ μ(Hn ) < μ(Hv ). For certain theoretical properties of the halfspace depth of μ to be valid, the existence of minimizing halfspaces appears to be crucial. As a way to alleviate the issue of their possible non-existence, in Pokorný et al. (2022) a novel concept of the so-called flag halfspaces was introduced. A flag halfspace F centered at a point x ∈ Rd is defined as any set of the form F = {x} ∪
d
relint(Hi ) .
(2)
i=1
In this formula, Hd ∈ H(x) and for each i ∈ {1, . . . , d − 1}, Hi stands for an idimensional halfspace inside the subspace relbd(Hi+1 ) such that x ∈ relbd(Hi ). The collection of all flag halfspaces in Rd centered at x ∈ Rd is denoted by F(x). An example of a flag halfspace in R2 is displayed in the right-hand panel of Fig. 1. That flag halfspace is a union of an open halfplane H2 (light-colored halfplane) whose boundary passes through x, a halfline (thick halfline) in the boundary line bd(H2 ) starting at x, and the point x itself. The results derived from the present paper lean on the following crucial observation, whose complete proof can be found in Pokorný et al. (2022, Theorem 1).
Partial Reconstruction of Measures from Halfspace Depth
97
Lemma 1 For any x ∈ Rd and μ ∈ M Rd it holds true that D (x; μ) = min μ(F). F∈F (x)
In particular, there always exists F ∈ F(x) such that μ(F) = D (x; μ). Any flag halfspace F ∈ F(x) from Lemma 1 that satisfies μ(F) = D (x; μ) is called a minimizing flag halfspace of μ at x. This is because it minimizes the μ-mass among all the flag halfspaces from F(x). Lemma 1 tells us two important messages. First, the halfspace depth D(x; μ) can be introduced also in terms of flag halfspaces instead of the usual closed halfspaces in (1), and the two formulations are equivalent to each other. Second, in contrast to the usual minimizing halfspaces that do not exist at certain points x ∈ Rd , according to Lemma 1, there always exists a minimizing flag halfspace of any μ at any x.
2.2 Halfspace Depth Central Regions The upper level sets of the halfspace depth function D(·; μ), given by Dα (μ) = x ∈ Rd : D (x; μ) ≥ α for α ≥ 0,
(3)
play the important role of multivariate quantiles in depth statistics. The set Dα (μ) is called the central region of μ at level α. All central regions are known to be convex and closed. The sets (3) are clearly also nested, in the sense that Dα (μ) ⊆ Dβ (μ) for β ≤ α. Besides (3), another collection of depth-generated sets of interest considered in Laketa and Nagy (2022); Pokorný et al. (2022) is Uα (μ) = x ∈ Rd : D (x; μ) > α for α ≥ 0. We conclude this collection of preliminaries with another result from Pokorný et al. (2022), which tells us that no set difference of the level sets Dα (μ) \ Uα (μ) can contain a relatively open subset of positive μ-mass. That result lends an insight into the properties of the support of μ, based on its depth function D (·; μ). It will be of great importance in the proof of our main result in Sect. 3. The complete proof of the next technical lemma can be found in Laketa et al. (2022, Lemma 3.1). Lemma 2 Let μ ∈ M Rd and let K ⊂ Rd be a relatively open set of points of equal depth of μ that contains at least two points. Then μ(K ) = 0.
98
P. Laketa and S. Nagy
3 Main Result The preliminary Lemma 2 hints that the mass of μ cannot be located in the interior of regions of constant depth. We refine and formalize that claim in the following Theorem 1, which is the main result of the present work. In part (i) of Theorem 1 we bound the support of μ ∈ M Rd , based on the information available in its depth function D (·; μ). We do so by showing that μ may be supported only in the closure of the boundaries of the central regions Dα (μ). That is a generalization of a similarresult, known to be valid in the special case of finitely atomic measures μ ∈ M Rd (Laketa & Nagy, 2021; Liu et al. 2020; Struyf & Rousseeuw, 1999). In the latter situation, all central regions Dα (μ) are convex polytopes, there is only a finite number of different polytopes in the collection {Dα (μ) : α ≥ 0}, and the atoms of μ must be located in the vertices of the polytopes from that collection. Nevertheless, not all vertices of Dα (μ) are atoms of μ; an algorithmic procedure for the reconstruction of the atoms, and the determination of their μ-masses, is given in Laketa and Nagy (2021). Extending the last observation about the possible location of atoms from finitely atomic measures to the general scenario, in part (ii) of Theorem 1 we show that all atoms of μ are contained in the extreme points3 of the central regions Dα (μ). Note that this indeed corresponds to the known theory for finitely atomic measures—the extreme points of polytopes are exactly their vertices. Our last observation in part (iii) of Theorem 1 is that each atom x ∈ Rd of μ induces a jump discontinuity in the halfspace depth, when considered on the straight line connecting any point of higher depth with x. This will be useful in detecting possible locations of atoms for general measures. Theorem 1 Let μ ∈ M Rd . (i) Let A be a subspace of Rd that contains at least two points. Then supp (μ| A ) ⊆ cl A
bd A (Dα (μ) ∩ A) .
α≥0
In particular, for A = Rd we have supp (μ) ⊆ cl
bd (Dα (μ)) .
α≥0
(ii) Each atom x of μ with D (x; μ) = α is an extreme point of Dβ (μ) for all β ∈ (α − μ({x}), α]. (iii) For any x ∈ Rd with D (x; μ) = α, any z ∈ Uα (μ), and any y ∈ Rd such that x belongs to the open line segment L(y, z) between y and z, it holds true that For a convex set C ⊂ Rd , a face of C is a convex subset F ⊆ C such that x, y ∈ C and (x + y)/2 ∈ F implies x, y ∈ F. If {z} is a face of C, then z is called an extreme point of C.
3
Partial Reconstruction of Measures from Halfspace Depth
99
D (y; μ) ≤ D (x; μ) − μ({x}). The proof of Theorem 1 is given in the Appendix. Theorem 1 sheds light on the support and the location of the atoms of a measure. Its part (i) tells us that if a depth function D (·; μ) attains only at most countably many different values, and each level set Dα (μ) is a polytope, the mass of μ must be concentrated in the closure of the set of vertices of the level sets Dα (μ). A special case is, of course, the setup of finitely atomic measures treated in Laketa and Nagy (2021); Struyf and Rousseeuw (1999).
4 Examples We conclude this note by giving two examples. Parts (ii) and (iii) of Theorem 1 show a way, at least in special situations, to locate the atomic parts of measures. We start by reconsidering our motivating Example 1. The distribution μ ∈ M R2 is not purely atomic and can be shown not to possess Dupin floating bodies. Thus, it is currently unknown whether its depth function D (·; μ) determines μ uniquely. In our first example of this section, we show how Theorem 1 recovers the position of the atomic part of μ. Then, in Example 3, we argue that the general problem of determining the support or the location of the atoms of μ ∈ M Rd from its halfspace depth is impossible to be solved without further restrictions. Example 2 Suppose that in Example 1 we have μ({a}) = δ for δ ∈ (0, 1/2) small enough, and that the non-atomic part of μ is ν ∈ M R2 uniform on the disk B, with ν(B) = 1. Hence, μ(R2 ) = ν(B) + μ({a}) = 1 + δ. We first compute the halfspace depth function D (·; μ) of μ and then show how to use Theorem 1 to find the atom a of μ from its depth. The computation of the depth function is done by means of determining all the central regions (3) at levels β ≥ 0 of μ. We denote α = D(x; μ) and split our argumentation into three situations according to the behavior of the regions Dβ (μ): (i) β < α where x is contained in the interior of Dβ (μ); (ii) β ∈ (α, α + δ] where x lies in the boundary of Dβ (μ); and (iii) β > α + δ where Dβ (μ) does not contain x. First note that because ν is uniform on a unit disk, all non-empty depth regions Dβ (ν) of ν are circular disks centered at the origin, and all the touching halfspaces4 of Dβ (ν) carry ν-mass exactly β. Case I: β ≤ α. For α = D (a; μ) = D (a; ν) we have that Dα (ν) is a disk centered at the origin containing a on its boundary. Note that the halfspace depths of μ and ν remain the same outside Dα (ν), since the added atom a does not lie in any minimizing halfspace of x ∈ / Dα (ν), so we have Dβ (μ) = Dβ (ν) for all β ≤ α. 4
We say that H ∈ H is touching A ⊂ Rd if H ∩ A = ∅ and int(H ) ∩ A = ∅.
100
P. Laketa and S. Nagy
Case II: β ∈ (α, α + δ]. We have D (a; μ) = α + δ ≥ β, meaning that a ∈ Dβ (μ). Because μ is obtained by adding mass to ν, it must be Dβ (ν) ⊆ Dβ (μ) and due to the convexity of the central regions (3), the convex hull C of Dβ (ν) ∪ {a} must be contained in Dβ (μ). Denote by H ∈ H(a) a touching halfspace of Dβ (ν) that contains a on its boundary. Then ν(H ) = β, and hence int(H ) ∩ Dβ (μ) = ∅. We obtain that Dβ (μ) is equal to the convex hull of Dβ (ν) ∪ {a}. Case III: β > α + δ. In a manner similar to Case II, one concludes that Dβ (μ) is the convex hull of a circular disk Dβ (ν) and a, intersected with the disk Dβ−δ (ν). In order to complete the reconstruction of the atomic part of measure μ from Example 1 based on its depth function, we present Lemma 3, which is a special case of a more general result (called the generalized inverse ray basis theorem) whose complete proof can be found in Laketa and Nagy (2022, Lemma 4). / Dα (μ), and a face F of Lemma 3 Suppose that μ ∈ M Rd , α > 0, a point x ∈ Dα (μ) are given so that the relatively open line segment L(x, y) does not intersect Dα (μ) for any y ∈ F. Then there exists a touching halfspace H ∈ H of Dα (μ) such that μ(int(H )) ≤ α, x ∈ H , and F ⊂ bd(H ).
−2
−3
−2
−1
−1
0
0
1
1
2
3
2
Reconstruction. We now know the complete depth function D (·; μ) of μ; see also Fig. 2. From this depth only, we will locate the atoms of μ and their mass. The only point in R2 that is an extreme point of more than one depth region is certainly a, so that a is the only possible candidate for an atom of μ by part (ii) of Theorem 1. Take any β ∈ (α, α + δ). Then Dβ (μ) is the convex hull of a circular disk and the point a outside that disk, so its boundary contains a line segment L(a, yβ ) for yβ ∈ bd(Dβ (ν)). Due to Lemma 3, there is a halfspace Hβ ∈ H
−2
−1
0
1
2
−3
−2
−1
0
1
2
3
Fig. 2 Left panel: The measure μ from Example 1 being the sum of a uniform distribution on a disk and a single atom at a ∈ R2 (black point) with δ = 1/10, with several contours of its depth D (·; μ) (thick colored lines). The halfspace median is the red line segment in the middle of the plot. From the depth only, Theorem 1 allows us to determine the mass and the location of the atom. Two depth contours that share a ∈ R2 as an extreme point are plotted with boundaries in green. Right panel: Example 3 with d = 2. Several density contours of the measure μ ∈ M R2 (solid lines) and its atom (point at the origin), together with multiple contours of the corresponding depth D(·; μ) ≡ D(·; ν) (dashed lines)
Partial Reconstruction of Measures from Halfspace Depth
101
such that L(a, yβ ) ⊂ bd(Hβ ) and μ(int(Hβ )) ≤ β < α + δ = D (a; μ) ≤ μ(Hβ ), the last inequality because a ∈ Hβ . We obtain μ(bd(Hβ )) ≥ α + δ − β. This is true for any β ∈ (α, α + δ), and for different β1 , β2 ∈ (α, α + δ) we have Hβ1 = Hβ2 with x ∈ bd(Hβi ) and μ bd(Hβi ) ≥ α + δ − βi , i = 1, 2. In conclusion, we obtain uncountably many different lines bd(Hβ ) of positive μ-mass, all passing through a. That is possible only if a is an atom of μ, and μ({a}) ≥ δ. Theorem 1 again guarantees that μ({a}) = δ and that there is no other atom of μ. The complete Example 1 gives a partial positive result toward the halfspace depth characterization problem and promises methods allowing one to determine features of μ from its depth D (·; μ), at least for special sets of measures. The complete determination of the support or the atoms of μ from its depth is, however, a problem considerably more difficult, impossible to be solved in full generality. Follows an example of mutually singular measures5 μ, ν ∈ M Rd sharing the same depth function from Nagy (2020, Section 2.2). with independent Cauchy marginals and Example3 For μ1 ∈ M Rd μ2 ∈ M Rd the Dirac measure at the origin, define μ ∈ M Rd by the sum 1/d and 1/2 − 1/(2d), respectively. The total mass of μ1 and μ2 with weights d of μ is hence μ Rd = 1/2 + 1/(2d), and its support is R . For the other disd in the coordinate tribution take ν ∈ M R the probability measure supported axes Ai = x = (x1 , . . . , xd ) ∈ Rd : x j = 0 for all j = i , i = 1, . . . , d. The density g of ν with respect to the one-dimensional Hausdorff measure on its support d Ai is given as a weighted sum of densities of univariate Cauchy supp (ν) = i=1 distributions in Ai g(x) =
d 1 I [x ∈ Ai ] for x = (x1 , . . . , xd ) ∈ Rd . d i=1 π(1 + xi2 )
It can be shown (Nagy, 2020, Section 2.2) that the depths of μ and ν coincide at all points x = (x1 , . . . , xd ) in Rd D (x; μ) = D (x; ν) =
1 d
1/2
1 2
−
arctan(maxi=1,...,d |xi |) π
if x ∈ Rd \ {0}, for x = 0 ∈ Rd .
The two measures μ and ν are, however, singular as for A = supp (ν) \ {0} we have μ(A) = ν(Rd \ A) = 0. For an arbitrary finite Borel measure, it is therefore impossible to retrieve the full information about its support only from its depth function. For a visualization of measure μ and its halfspace depth, see Fig. 2. The same demonstrates that in general also the location of the atoms example of μ ∈ M Rd or even the number of them cannot be recovered from the depth function D (·; μ) only—the measure ν in Example 3 does not contain any atoms, but Recall that μ, ν ∈ M Rd are called singular if there is a Borel set A ⊂ Rd such that μ(A) = ν(Rd \ A) = 0.
5
102
P. Laketa and S. Nagy
μ has a single atom at its unique halfspace median (the smallest non-empty central region (3)). Because of the very special position of the atom of μ, it is impossible to use our results from parts (ii) and (iii) of Theorem 1 to decide whether the origin is an atom of μ, or not.
5 Conclusion The halfspace depth has many applications, for example, in classification or in nonparametric goodness-of-fit testing. However, in order to apply it properly, one needs to make sure that the measure μ is characterized by its halfspace depth function, so that we can use the halfspace depth to distinguish μ from other measures. For that reason, it is important to know which collections of measures satisfy this property. The partial reconstruction procedure provided in this paper may be used to narrow down the set of all possible measures that correspond to a given halfspace depth function. That can be used to guide the selection of an appropriate tool for depthbased analysis. The problem of determining those distributions that are uniquely characterized by their halfspace depth, however, remains open.
6 Proof of Theorem 1 For part (i), take x ∈ supp (μ| A ) and denote α = D (x; μ), D A = Dα (μ) ∩ A, and U A = Uα (μ) ∩ A. Because x comes from the support of μ| A , we know that μ| A (Bx ) > 0 for any open ball Bx in A centered at x. Using Lemma 2 we conclude that Bx cannot be a subset of Dα (μ) \ Uα (μ), meaning that it also cannot be a subset of D A \ U A . But, because x ∈ D A \ U A , necessarily x ∈ bd A (D A \ U A ) ⊆ bd A (D A ) ∪ bd A (U A ). Now, suppose that x ∈ bd A (U A ). Then there exists a sequence {xn }∞ n=1 ⊂ U A ⊂ A converging to x. We know that αn = D (x n ; μ) > α for each n. / Dαn (μ), meaning that Thus, for any n = 1, 2, . . . we have that xn ∈ Dαn (μ) and x ∈ (μ) ∩ A in the line segment L(x, xn ). there is a point yn ∈ A from the set bd A Dα n ⊂ β>α bd A Dβ (μ) ∩ A converges to x, Since xn → x, also the sequence {yn }∞ n=1 D bd (μ) ∩ A as we intended to show. and necessarily x ∈ cl A A β β>α To prove part (ii), consider x ∈ Rd such that μ({x}) > 0 and α = D (x; μ). Choose any y, z ∈ Rd such that x ∈ L(y, z). We will prove that one of the points y and z has depth at most α − μ({x}), which means that x must be an extreme point of Dβ (μ) for any β ∈ (α − μ({x}), α]. Let F ∈ F(x) be a minimizing flag halfspace of μ at x from Lemma 1,
i.e. let μ(F) = α. We can write F in the form of the union d relint(H ) as in (2). Since Hd ∈ H(x) is a halfspace that contains x on {x} ∪ i i=1 its boundary and x lies in the open line segment L(y, z), one of the following must hold true with j = d:
Partial Reconstruction of Measures from Halfspace Depth
103
(C1 ) L(y, z) ⊂ relbd(H j ), or (C2 ) exactly one of the points y and z is contained in relint(H j ). If (C1 ) holds true with j = d, then we know that together with x ∈ relbd(Hd−1 ) and x ∈ L(y, z) it implies again that one of (C1 ) or (C2 ) must be true with j = d − 1. We continue this procedure iteratively as j decreases, until we reach an index J ∈ {1, . . . , d} such that relint(H J ) contains exactly one of the points y and z. Note that such an index must exist, since relbd(H2 ) is a halfline originating at x, so L(y, z) ⊂ relbd(H2 ) would imply either that y ∈ relint(H1 ) or that z ∈ relint(H1 ). We choose J to be the largest index j ∈ {1, . . . , d} satisfying (C2 ) and assume, without loss of generality, that y ∈ relint(H J ). Then L(y, z) ⊂ bd(H j ) for each j ∈ {J + 1, . . . , d}. Recall that for a set A ⊂ Rd and u ∈ Rd we denote by A + u = {a + u : a ∈ A} the shift of A by the vector u. Then for each j ∈ {1, . . . , d} the j-dimensional halfspace H j + (y − x) satisfies y ∈ relbd(H j + (y − x)). Since y ∈ relint(H J ), it must be relbd (H J + (y − x)) ⊂ relint(H J ) and therefore relint H j + (y − x) ⊂ H j + (y − x) ⊂ relint(H J ) for j ∈ {1, . . . , J },
(4)
because the relative boundaries of H J and H J + (y − x) are parallel. At the same time, we have relint H j + (y − x) = relint(H j ) for each j ∈ {J + 1, . . . , d},
(5)
since the indices j ∈ {J + 1, . . . , d} all satisfy y ∈ relbd(H j ). Consider thus a shifted flag halfspace F = F + (y − x) = {y} ∪
d
relint H j + (y − x) .
(6)
⎞ relint H j ⎠ ⊂ F \ {x}.
(7)
j=1
Using (4), (5), and (6) we obtain ⎛ F ⊂ {y} ∪ relint (H J ) ∪ ⎝
d
j=J +1
Therefore, we have μ(F ) ≤ μ(F) − μ({x}) = α − μ({x}) < β and necessarily also D(y; μ) < β by Lemma 1. Hence y ∈ / Dβ (μ). We conclude that x ∈ Dα (μ) cannot be contained in the relative interior of any line segment whose endpoints are both in Dβ (μ) for β ∈ (α − μ({x}), α], and x is therefore an extreme point of each such Dβ (μ). Consider now part (iii) and take F to be any minimizing flag halfspace of μ at x. Then μ (F) = D (x; μ) = α < D (z; μ), and necessarily z ∈ / F. Since x ∈ L(y, z), we can use the same argumentation as in part (ii) of this proof to conclude that exactly one of the points y and z is contained in the relative interior of one of the closed i-dimensional halfspaces Hi taking part in the decomposition of F in (2),
104
P. Laketa and S. Nagy
meaning that F \ {x} contains exactly one of these two endpoints y and z. Since we found that z ∈ / F, it must be that y ∈ F \ {x}. Then from (7) it follows that F + (y − x) ⊂ F \ {x}. Therefore, μ (F + (y − x)) ≤ μ (F) − μ ({x}) .
(8)
At the same time, Lemma 1 gives us D(y; μ) ≤ μ (F + (y − x)), which together with (8) finally implies D (y; μ) ≤ μ (F + (y − x)) ≤ μ (F) − μ ({x}) = D (x; μ) − μ ({x}) , where the last equality follows from the fact that F is a minimizing flag halfspace of μ at x. We proved all three parts of our main theorem. Acknowledgements The work of S. Nagy was supported by Czech Science Foundation (EXPRO project n. 19-28231X). P. Laketa was supported by the OP RDE project “International mobility of research, technical and administrative staff at the Charles University”, grant CZ.02.2.69/0.0/0.0/ 18_053/0016976.
References Laketa, P., & Nagy, S. (2021). Reconstruction of atomic measures from their halfspace depth. Journal of Multivariate Analysis, 183, Paper No. 104727, 13. https://doi.org/10.1016/j.jmva. 2021.104727. Laketa, P., & Nagy, S. (2022). Halfspace depth for general measures: the ray basis theorem and its consequences. Statistical Papers, 63(3), 849–883. https://doi.org/10.1007/s00362-021-01259-8. Laketa, P., Pokorný, D., & Nagy, S. (2022). Simple halfspace depth. Electronic Communications in Probability, 27, 1–12. https://doi.org/10.1214/22-ECP503. Liu, R. Y., Parelius, J. M., & Singh, K. (1999). Multivariate analysis by data depth: Descriptive statistics, graphics and inference. The Annals of Statistics, 27(3), 783–858. http://dx.doi.org/10. 1214/aos/1018031260. Liu, X., Luo, S., & Zuo, Y. (2020). Some results on the computing of Tukey’s halfspace median. Statistical Papers, 61(1), 303–316. https://doi.org/10.1007/s00362-017-0941-5. Nagy, S. (2020). The halfspace depth characterization problem. In Nonparametric statistics. Springer Proceedings in Mathematics & Statistics, (Vol. 339, pp. 379–389). Springer. https:// doi.org/10.1007/978-3-030-57306-5_34. Nagy, S. (2021). Halfspace depth does not characterize probability distributions. Statistical Papers, 62(3), 1135–1139. https://doi.org/10.1007/s00362-019-01130-x. Nagy, S., Schütt, C., & Werner, E. M. (2019). Halfspace depth and floating body. Statistics Surveys, 13, 52–118. https://doi.org/10.1214/19-ss123. Pokorný, D., Laketa, P., & Nagy, S. (2022). Another look at halfspace depth: Flag halfspaces with applications, under review. Struyf, A., & Rousseeuw, P. J. (1999). Halfspace depth and regression depth characterize the empirical distribution. Journal of Multivariate Analysis, 69(1), 135–153. https://doi.org/10.1006/jmva. 1998.1804.
Partial Reconstruction of Measures from Halfspace Depth
105
Tukey, J. W. (1975). Mathematics and the picturing of data. In: Proceedings of the International Congress of Mathematicians (Vancouver, B. C., 1974), (Vol. 2. pp. 523–531). Canadian Mathematical Congress, Montreal, Que. Zuo, Y., & Serfling, R. (2000). General notions of statistical depth function. The Annals of Statistics, 28(2), 461–482. http://dx.doi.org/10.1214/aos/1016218226.
Posterior Predictive Assessment of IRT Models via the Hellinger Distance: A Simulation Study Mariagiulia Matteucci and Stefania Mignani
Abstract Fit assessment of item response theory models is a crucial issue. In recent years, posterior predictive model checking has become a popular tool for investigating overall model fit and potential misfit due to specific items. Different approaches rely on graphical analysis, posterior predictive p-values, the relative entropy, and, more recently, the Hellinger distance. In this study, we focus on the performance of the Hellinger distance in the case multidimensional data are analyzed with a unidimensional approach. In particular, we consider the case of three latent dimensions. A simulation study is conducted to show the effectiveness of the method. Finally, the results of an empirical application to potential three-dimensional data are discussed. Keywords Posterior predictive model checking · Hellinger distance · IRT models · Goodness of fit
1 Introduction Model checking is a crucial issue in statistical analysis. In the field of item response theory (IRT; see, e.g. van der Linden & Hambleton, 1997), posterior predictive model checking (PPMC; Ruben, 1984) was recently used to investigate different aspects of model fit. PPMC relies on Markov chain Monte Carlo(MCMC) estimation under a Bayesian approach, which was proved to be a very flexible method, especially for estimating models with increasing complexity. In particular, PPMC is used to assess the discrepancies between data and a model and also to identify possible limitations of the chosen model, with respect to the specific application (see Gelman et al., 2014). The PPMC method is based on the comparison between the observed and the M. Matteucci (B) · S. Mignani Department of Statistical Sciences “Paolo Fortunati”, University of Bologna, Bologna, Italy e-mail: [email protected] S. Mignani e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_9
107
108
M. Matteucci and S. Mignani
replicated data of a given discrepancy measure D. The main advantages of PPMC are the relative easiness to be implemented, given MCMC simulations to represent the entire posterior distribution of the parameters of interest, and the fact that the method does not require distributional assumptions, unlike classical frequentist methods. PPMC is usually implemented with graphical analyses, followed by the estimation of posterior predictive p-values (PPP-values). However, the PPP-value simply counts the number of times the replicated D is equal or higher than the realized D without addressing the magnitude of the difference between the two distributions. Bayesian estimation via MCMC is well-established for IRT models. For this reason, also PPMC has been intensively used to examine different aspects of model fit, such as person fit or item fit, the local independence assumption, and test dimensionality (see, among others, Sinharay et al., 2006; Levy et al., 2009; Levy & Svetina, Levy and Svetina 2011). A special attention was given to the case of multidimensional data analyzed with a unidimensional approach. In fact, item response real data often show multidimensionality due to the existence of different latent variables (abilities). In these cases, estimating a unidimensional model yields a less accurate measure of individual traits as compared to fitting multiple unidimensional models with correlated traits. To address these aspects, it was proposed to use the relative entropy (RE) within PPMC with global fit measures (Wu et al., 2014). The RE is able to measure the difference between the predictive and the realized distribution of a specific discrepancy measure and overcomes the limitations of the PPP-value, which is not able to quantify the amount of model misfit. The RE in turn suffers from two main drawbacks that may weaken its usefulness as a pairwise distance measure in applied settings: it is asymmetric and not upper bounded. For these reasons, an approach based on the Hellinger (H) distance was proposed (Matteucci & Mignani, 2021a, 2021b). In particular, the H distance was used to quantify the magnitude of the difference between the predictive and the realized distribution of measures for item pairs. In fact, similarly to the RE, the H distance provides a quantitative measure of the degree of misfit by overcoming the approach based on PPMC. Unlike the RE, the H distance satisfies the metric properties, including symmetry, and it is bounded between 0 and 1. These properties make the H distance a suitable measure for improving the interpretation of results in applied settings. Moreover, the H distance can be used for model comparison purposes, considering both single discrepancy measures on item pairs and overall. The approach based on the H distance was proposed and implemented in a under-fitting scenario to investigate the misfit of unidimensional models to multidimensional data (Matteucci & Mignani, 2021a). Also, an over-fitting scenario was considered in Matteucci and Mignani (2021b), where the fit of different multidimensional approaches to unidimensional data was evaluated. In this study, we enlarge the study in Matteucci and Mignani (2021a) by considering IRT multidimensional models with three specific latent traits instead of two to investigate if the latent dimensionality may affect the effectiveness of the H distance to assess model fit. The performance of the H distance is investigated for detecting the misfit of an IRT unidimensional model with both simulated and real multidimensional response
Posterior Predictive Assessment of IRT Models …
109
data. This scenario is common especially in socio-behavioral research, where the complexity of the data structure cannot be handled with a unidimensional approach, which is more typical of the educational field. We compare our proposal to the classical PPMC-based PPP-values. The main strengths of the H distance, compared to traditional approaches, rely on the possibility to directly quantify the amount of misfit and to be used for model comparison purposes in situations with complex data structure. The paper is organized in the following way. Section 2 briefly reviews the IRT models used in the paper. Section 3 discusses the PPMC-based approach and the discrepancy measures used for investigating local independence in IRT models. The proposal based on the H distance is also reviewed. The results of a simulation study are reported in Sect. 4, while a three-dimensional data empirical application is discussed in Sect. 5. Concluding remarks end the paper.
2 IRT Models IRT models (see, e.g. van der Linden & Hambleton, van der Linden and Hambleton 1997), also called latent trait models, are usually employed to model the relation among categorical response variables and a set of latent continuous variables. Given a response variable vector Y j containing the responses to item j, with j = 1, ..., k items, given by n subjects, and a set of latent traits, also called abilities, θ , IRT models are based on the assumption of local independence, i.e. P(Y = y|θ ) =
k
P(Y j = y j |θ),
(1)
j=1
where Y collects the response variable vectors Y j for all k items and y is its realization. IRT models are often used in the field of educational measurement and to describe social phenomena, when the responses to a questionnaire are typically available. In fact, IRT models assume that the relations among the items may be explained by the underlying latent variables. In its most simple formulation, an IRT model expresses the probability of a positive response to a binary item j by a subject i, with i = 1, ..., n as a function of a single latent trait, as follows: P(Yi j = 1|θi , α j , δ j ) = (α j θi − δ j ).
(2)
Equation (2) describes the two-parameter normal ogive (2PNO) model, where α j and δ j represent the item j-th discrimination and difficulty, respectively. The normal ogive parameterization is chosen here instead of the most common logistic one, as it is more useful to work with Bayesian estimation via MCMC (see, e.g. Béguin & Glas, 2001).
110
M. Matteucci and S. Mignani
In the case more than one latent trait is assumed, we resort to IRT multidimensional models (MIRT; Reckase (2009)). Under a confirmatory setting with m specific latent traits, with v = 1, ..., m the 2PNO multi-unidimensional model (see Sheng and Wikle (2007)) is defined as follows: P(Yvi j = 1|θvi , αv j , δ j ) = (αv j θvi − δ j ),
(3)
where αv j is the v-specific item discrimination parameter for item j, δ j is the difficulty parameter, and θvi is the v-specific latent trait for respondent i. Model (3) is based on a simple structure, i.e. each item may load on a single latent dimension only. Also, the different latent traits may be correlated. Lastly, the additive model (see Sheng and Wikle (2009)) generalizes the multiunidimensional model (3) by adding an overall latent trait θ0 , which is related to all the items through the discrimination parameter α0 j , for j = 1, ..., k, as follows: P(Yvi j = 1|θ0i , θvi , α0 j , αv j , δ j ) = (α0 j θ0i + αv j θvi − δ j ).
(4)
According to Eq. (4), m + 1 latent abilities are specified, which may again be correlated. The details about Bayesian estimation via MCMC for these models can be found in Sheng and Wikle (2007) and Sheng and Wikle (2009).
3 PPMC and Discrepancy Measures for IRT Models PPMC techniques are based on the comparison of observed data with replicated data generated or predicted by the model by using a number of diagnostic measures that are sensitive to model misfit (see Sinharay et al., 2006). Substantial differences between the posterior distribution based on observed data and the posterior predictive distribution indicate poor model fit. Given the data y, let p(y|ω) be the likelihood for a model depending on the set of parameters ω and p(ω) the prior distribution for the parameters, respectively. From a practical point of view, one should define a suitable discrepancy measure D(·) and compare the posterior distribution of D(y, ω), based on observed data, to the posterior predictive distribution of D(yr ep , ω), based on replicated data. Discrepancy measures should be chosen to capture relevant features of the data and differences among data and the model. Besides a graphical analysis, it is possible to resort to the PPP-value defined as the probability that the replicated data could be more extreme than the observed data, as measured by the test quantity (Gelman et al., 2014), as follows:
Posterior Predictive Assessment of IRT Models …
PPP-value = p(D(yr ep , ω) ≥ D(y, ω) | y) = = p(yr ep | ω) p(ω | y)dyr ep dω.
111
(5) (6)
D(yr ep ,ω)≥D(y,ω)
The choice of a suitable discrepancy measure is crucial in posterior predictive assessment. In order to check for the local independence assumption in IRT models, effective diagnostic measures are based on the association or on covariance/correlation among item pairs. In this paper, we consider the model-based covariance (MBC) that depends on both data and model parameters as follows: MBC j j =
n 1 (Yi j − E(Yi j )(Yi j − E(Yi j ), n i=1
(7)
where E(Yi j ) is the expected value of the response variable depending on the specific IRT model. The MBC is found to be effective as it measures the covariance among item pairs by explicitly conditioning on the underlying latent variable. If the local independence assumption holds, the MBC is close to zero. If the local independence does not hold, the MBC is greater than zero for items loading on the same latent variable (PPP-values are close to zero) and smaller for items loading on different latent variables (PPP-values are close to one). Similar results may be obtained with the Mantel-Haenszel (MH) statistic which is defined as the odds ratio conditionally to the rest score, i.e. the raw test score obtained by excluding the particular item pair under consideration. Unlike the MBC, the MH statistics is based on data only. Lastly, in Levy and Svetina (2011) it was proposed an overall measure, namely the generalized dimensionality discrepancy measure (GDDM), defined as the mean of the absolute values of MBC over unique item pairs. The GDDM is a unidirectional measure of average conditional covariance. When the GDDM is equal to zero, a weak local independence for all the item pairs is assumed. If the assumption of local independence is violated, the GDDM is greater than zero and the PPP-value will be close to zero. As the approach based on the PPP-value only counts the number of replications for which the predictive discrepancy exceeds the realized one, a relevant improvement in posterior predictive assessment would be to quantify the amount of the difference between the two distributions, in case the chosen discrepancy measure depends on both data and model parameters. To this aim, the use of the RE was proposed in Wu et al. (2014). However, this approach has several limitations due to the lack of interpretability and the possibility to make comparisons in applied settings, as the RE is not upper bounded. For this reason, to quantify the difference between the realized and the predictive distribution within PPMC, in Matteucci and Mignani (2021a) it was proposed to use the H distance which is symmetric, it does obey the triangle inequality and its range is 0-1. The interpretation of the H distance follows the empirical rule “the smaller the better”. The direct calculation is computationally demanding and, given the MCMC
112
M. Matteucci and S. Mignani
simulations, it is usually estimated by the normal kernel density. Given a discrepancy measure D(·), the H distance in the context of PPMC is defined as p(D(y, ω)) p(D(yr ep , ω))dydω, (8) H(P, Q) = 1 − where P and Q are two probability distributions of continuous random variables. In order to check for local independence, we used the H distance with the MBC discrepancy measure (MBC-H) to take into account a fit measure for each item pair and with the GDDM measure (GDDM-H) to evaluate the overall fit based on item pairs. In Matteucci and Mignani (2021a), it was proposed to investigate the assumption of local independence for 2PNO models by focusing on multidimensional data analyzed with the unidimensional model, for the case of m = 2 specific latent variables. Also, in Matteucci and Mignani (2021b), a study in an over-fitting scenario was proposed, where unidimensional data are analyzed through different multidimensional models. In this paper, we focus on an under-fitting scenario, when multidimensional data are treated through a unidimensional approach, but for the case of m = 3 specific latent traits to study if the number of dimensions may affect the effectiveness of the H distance to assess model fit.
4 Simulation Study In this study, we consider multidimensional data simulated from either a multiunidimensional model with m = 3 specific latent abilities or an additive model with m + 1 = 4 latent abilities, i.e. three specific and one overall traits. The test length is established in k = 15, with 3 subtests of length k1 = k2 = k3 = 5, and the sample size is fixed at n = 1,000. The item parameters are drawn from the following distributions: α0 ∼ U (1, 2), αv ∼ U (1, 2), δ ∼ U (−2, 2). The latent scores are drawn from a multivariate normal θ ∼ M N (0, Σ), where Σ is the correlation matrix with off-diagonal elements set equal to 0, 0.3, 0.6, and 0.9. to define different simulation conditions. The parameters of the data-analysis model are estimated via the Gibbs sampler. A number of 5,000 MCMC iterations are conducted, where 1,000 are the burn-in iterations. The effective samples are thinned by 4 so that 1000 samples are used in PPMC. For the unidimensional model, a standard normal prior is used for the latent trait and the difficulty parameters. A standard normal distribution truncated at zero to the left is used as prior distribution for the discrimination parameters to ensure positivity. For the multi-unidimensional and the additive models, conjugate standard normal priors are specified for the item parameters, and a conjugate multivariate normal prior is used for the covariance matrix of the latent traits (see Sheng 2008b, 2010). Finally, 100 replications are done for each simulation condition. The software
Posterior Predictive Assessment of IRT Models …
113
Table 1 Proportion of extreme PPP-values for the MH statistic and the MBC and PPP-value for the GDDM when the same model is used to simulate and analyze data Model ρ Proportion of extreme PPP-values GDDM MH MBC PPP-value Multi-uni
Additive
0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.546 0.529 0.546 0.181 0.376 0.531 0.491 0.376
MATLAB was used both to estimate the model parameters and to implement PPMC and the proposed H distance.1 The performance of the PPP-values and the proposed MBC-H and GDDM-H at detecting misfit is evaluated. We use the proportion of extreme PPP-values (below 0.05 or above 0.95) among the item pairs to summarize the results for the MH statistic and the MBC. We report some descriptive statistics for the MBC-H and the GDDM-H. The case of the same multidimensional model to both generate and fit the data is also considered to investigate if the chosen measures are able to report good fit in these cases. In this specific case, for all conditions, the proportion of extreme PPP-values for the MBC computed on the 105 different item pairs indicates good fit (see Table 1). Also, the PPP-values for the GDDM are all around 0.5, meaning good fit, with the exception of the case of strong correlations (ρ = 0.9) for the multi-unidimensional model (0.181). Even if this last PPP-values is not extreme, the deviation from 0.5 may be due to the strong correlation between the traits which make reasonable the data unidimensionality. The results on the MBC-H and the GDDM-H are reported in Table 2. In Matteucci and Mignani (2021a), it was established 0.5 as a possible threshold for model fit. Here, the average values for MBC-H are below 0.5, indicating good fit, for all the conditions except again the case of ρ = 0.9 in the multi-unidimensional model. Analogously, it can be observed that the average value of the MBC-H is 0.454, close to 0.5, for the additive model with ρ = 0.9. Overall, the values of the MBC-H tend to increase as the correlation among the traits increases, as a multidimensional approach is progressively not needed when data become nearly unidimensional. Also, the proportion of MBC-H values greater than 0.5 is equal to zero for all the cases, 1
MATLAB packages provided in Sheng (2008a, b, 2010) are used for data generation and for item parameter estimation. The Authors wrote MATLAB-specific programs for performing PPMC with the H distance. The documentation is still not ready but researchers interested in the scripts may ask the Authors for free.
114
M. Matteucci and S. Mignani
Table 2 MBC-H and GDDM-H when the same model is used to simulate and analyze data Model ρ MBC-H GDDM-H Mean Median SD Min Max Prop > 0.5 Multiuni
0.0
0.347
0.343
0.070
0.183
0.497
0.000
0.264
0.3 0.6 0.9 Additive 0.0 0.3 0.6 0.9
0.349 0.364 0.509 0.329 0.336 0.366 0.454
0.348 0.366 0.509 0.324 0.333 0.371 0.449
0.052 0.053 0.043 0.061 0.055 0.047 0.055
0.180 0.222 0.407 0.209 0.198 0.243 0.326
0.456 0.496 0.642 0.489 0.475 0.500 0.598
0.000 0.000 0.590 0.000 0.000 0.010 0.210
0.282 0.281 0.723 0.297 0.266 0.286 0.319
Table 3 Proportion of extreme PPP-values for the MH statistic and the MBC and PPP-value for the GDDM when a multidimensional model is used to simulate data and the unidimensional model to analyze data Model ρ Proportion of extreme PPP-values GDDM MH MBC PPP-value Multi-uni
Additive
0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9
0.829 0.762 0.543 0.057 0.867 0.743 0.390 0.000
0.010 0.610 0.514 0.029 0.838 0.695 0.248 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.048
with exceptions of the two strongly correlated cases. The values of the GDDM-H again show good fit for all the cases (GDDM-H < 0.5), with the exception of the multi-unidimensional model with ρ = 0.9. The results of the multidimensional data analyzed with a unidimensional model are reported in Tables 3 and 4. Here, we expect bad fit. From Table 3, we can notice that the proportion of extreme PPP-values for both the discrepancy measures decreases as the correlation among the trait increases. This outcome is coherent with the previous interpretation of nearly unidimensional data when the correlations among the latent variables are very strong. However, there is a further peculiar case, also observed in Matteucci and Mignani (2021a). In fact, the MH statistic and the MBC perform differently for the case of uncorrelated traits with multi-unidimensional data as, while the MH statistic correctly reports model misfit based on a rather high proportion of extreme PPP-values (0.867), the MBC fails by reporting only 1% of extreme PPPvalues. The behavior of the MBC may be attributed to the peculiarity of the generating model, which resembles two separate unidimensional models. In particular, we have
Posterior Predictive Assessment of IRT Models …
115
Table 4 MBC-H and GDDM-H when a multidimensional model is used to simulate data and the unidimensional model to analyze data Model ρ MBC-H GDDM-H Mean Median SD Min Max Prop > 0.5 Multiuni
0.0
0.644
0.597
0.140
0.466
0.924
0.876
1.000
0.3 0.6 0.9 Additive 0.0 0.3 0.6 0.9
0.832 0.813 0.611 0.899 0.877 0.770 0.551
0.846 0.836 0.585 0.934 0.891 0.753 0.550
0.146 0.135 0.079 0.103 0.098 0.120 0.065
0.589 0.581 0.452 0.546 0.613 0.472 0.343
1.000 1.000 0.833 1.000 1.000 0.996 0.741
1.000 1.000 0.981 1.000 1.000 0.990 0.829
1.000 1.000 0.998 1.000 1.000 1.000 0.825
noticed that the item discrimination estimates for the unidimensional model are higher than one for the first subtest (items 1-5) and close to zero for the second and the third subtest (items 6-15). This means that the latent trait is only defined by the first subtest items, which are related to the same ability. The PPP-values for the GDMM are all extreme (< 0.05), thus correctly reporting misfit for all the conditions. With respect to the H distance, Table 4 shows that the average MBC-H is higher than the threshold of 0.5 for all the simulation conditions, reporting bad fit. Moreover, we observe a decrease in the MBC-H as the correlation among the traits increases, with the exception of the peculiar case of the multi-unidimensional data with uncorrelated traits described above. The proportions of MBC-H values higher than 0.5 is always close to 1, again showing bad fit. Lastly, the values of the GDDM-H equal to one show the maximum possible distance between the predictive discrepancy and the realized one, with slightly lower values only for strongly correlated traits. The results of this simulation study show that the H distance is effective in highlighting the presence of possible misfit and determining plausible thresholds for classifying the misfit levels. In fact, unlike the approach based on the PPP-values, it is possible to evaluate the magnitude of the difference between the predictive and the realized distribution of the discrepancy measure (MBC or GDDM).
5 Empirical Application The empirical data come from a survey conducted by the Center of Advanced Studies in Tourism-University of Bologna to investigate the residents’ evaluations of their personal well-being in the Romagna area (see Bernini et al., 2018). The data used in this study consist of the responses given by 784 residents to 16 items of the questionnaire for the following three domains: personal Life (5 items), leisure activities
116
M. Matteucci and S. Mignani
Table 5 Proportion of extreme PPP-values for the MH statistic and the MBC and PPP-value for the GDDM for the tourism data Model Proportion of extreme PPP-values GDDM MH MBC PPP-value Uni Multi-uni Additive Bifactor
0.642 0.467 0.333 0.342
0.633 0.392 0.217 0.242
0.000 0.000 0.000 0.000
Table 6 DIC, WAIC, MBC-H, and GDDM-H for the tourism data Model
DIC
WAIC
MBC-H Mean
GDDM-H Median SD
Min
Max
Prop > 0.5
Uni
14038.79
13928.28 0.796
0.875
0.215 0.320 1.000
0.833
1.000
Multi-uni
12846.50
12621.08 0.629
0.610
0.238 0.219 1.000
0.625
1.000
Additive
12328.51
12083.12 0.505
0.466
0.238 0.097 1.000
0.475
1.000
Bifactor
12307.68
12061.63 0.497
0.450
0.247 0.084 1.000
0.442
1.000
(7 items), and life evaluation (4 items). Item responses have been dichotomized, where Y = 1 denotes satisfaction while Y = 0 indicates dissatisfaction. The unidimensional, the multi-unidimensional, the additive, and the bifactor models are fitted via the Gibbs sampler. The bifactor model is a particular case of the additive model, where the overall and the specific latent variables are set orthogonal, while the correlations among the specific traits are estimated (see Gibbons & Hedeker, 1992). Three specific factors are assumed (m = 3), one for each domain. The same prior distributions used in the simulation have been used in the empirical application. Table 5 reports the proportion of extreme PPP-values for the MH statistic and the MBC and the PPP-value estimate for the GDDM under the four competing models. It is clear that the unidimensional model is associated to the highest proportion of extreme PPP-values and that the fit improves as the model complexity increases. In particular, the additive and the bifactor models show similar performances. On the other hand, the PPP-value estimates for the GDDM are all equal to zero, reporting bad fit for all models. Table 6 reports the results for the H distance, together with the deviance information criterion (DIC) and the Watanabe-Akaike information criterion (WAIC); see Gelman et al. (2014), Chap. 7. Both DIC and WAIC suggest that the additive and the bifactor models fit the data comparatively better than the unidimensional and the multi-unidimensional models, with a slight preference toward the bifactor one. The results on the H distance are easily interpretable and confirm that the best models in terms of fit are the additive and the bifactor. In particular, the bifactor model is associated to an average MBC-H lower than 0.5 meaning that the amount
Posterior Predictive Assessment of IRT Models … Items 1, 4
120
117 Items 3, 15
140 MBC obs MBC rep
MBC obs MBC rep
120
100
100 80 80 60 60 40 40
20
0 -0.04
20
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0 -0.015
-0.01
-0.005
0
0.005
0.01
0.015
Fig. 1 Kernel densities of the realized (MBC obs) and predictive (MBC rep) discrepancies of MBC for the item pair 1, 4 (on the left) and the item pair 3, 15 (on the right)
of discrepancy between the predictive and the realized distribution of the MBC is the lowest among the competing models. This evidence is again confirmed with the proportion of item pairs for which the MBC-H is higher than 0.5 which is equal to 0.442 for the bifactor model, the lowest observed value. On the other hand, the GDDM-H reports bad fit for all the models and it is not able to compare the models effectively. A strong advantage of the approach based on MBC-H is the possibility to investigate the fit of item pairs. In Fig. 1, we compare the realized (MBC obs) and the predictive distribution (MBC rep) of the MBC in the bifactor model for items 1 and 4 (on the left) and items 3 and 15 (on the right). The values of the H distance for the specific item pairs are MBC-H1,4 = 0.9252 and MBC-H3,15 = 0.113, indicating bad and good fit, respectively.
6 Concluding Remarks The under-fitting scenario considered in this paper is particularly meaningful for its implications in real situations. In fact, unidimensional models are often used in practice even if unidimensionality and the following assumption of local independence do not hold. This often happens in behavioral or social sciences contexts, where more than one single domain affects the item responses. The main strengths of the H distance, compared to traditional approaches, rely on the possibility (a) to directly quantify the amount of misfit, (b) to be used for model comparison purposes, and (c) to make more informative analyses on item pairs. Furthermore, it is demonstrated that, in practical applications, the MBC-H can be used to (a) leave out the models that show serious misfit by using the threshold
118
M. Matteucci and S. Mignani
of 0.5; (b) compare the amount of misfit of different competing models and choose the model which fits the data best; (c) identify, also through graphical plots, critical items that may involve misfit which are associated to high MBC-H in several pairs. Future research may explore the use of different discrepancy measures that are able to investigate different aspects of model fit, other than the local independence assumption.
References Béguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541–561. https://doi.org/10.1007/BF02296195 Bernini, C., Matteucci, M., & Mignani, S. (2018). Modelling subjective well-being dimensions through an IRT bifactor model: Evidences from an Italian study. Electronic Journal of Applied Statistical Analysis, 11(2), 427–446. https://doi.org/10.1285/i20705948v11n2p427. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). CRC Press. Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423–436. https://doi.org/10.1007/BF02295430 Levy, R., Mislevy, R. J., & Sinharay, S. (2009). Posterior predictive model checking for multidimensionality in item response theory. Applied Psychological Measurement, 33(7), 519–537. https:// doi.org/10.1177/0146621608329504. Levy, R., & Svetina, D. (2011). A generalized dimensionality discrepancy measure for dimensionality assessment in multidimensional item response theory. British Journal of Mathematical and Statistical Psychology, 65, 208–232. https://doi.org/10.1348/000711010X500483. Matteucci, M., & Mignani, S. (2021). The Hellinger distance within posterior predictive assessment for investigating multidimensionality in IRT models. Multivariate Behavioral Research, 56(4), 627–648. https://doi.org/10.1080/00273171.2020.1753497. Matteucci, M., & Mignani, S. (2021). Investigating model fit in item response models with the Hellinger distance. In Porzio, G. C., Rampichini, C., Bocci, C. (Eds.) CLADAG 2021 book of abstracts and short papers (pp. 150–153). Firenze University Press. https://doi.org/10.36253/ 978-88-5518-340-6. Reckase, M. (2009). Multidimensional item response theory. Springer. Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applies statistician. Annals of Statistics, 12, 1151–1172. https://doi.org/10.2307/2240995. Sheng, Y. (2008a). Markov chain Monte Carlo estimation of normal ogive IRT models in MATLAB. Journal of Statistical Software, 25(8), 1–15. https://doi.org/10.18637/jss.v025.i08. Sheng, Y. (2008b). A MATLAB package for Markov chain Monte Carlo with a multi-unidimensional IRT model. Journal of Statistical Software, 28(10), 1–20. https://doi.org/10.18637/jss.v028.i10. Sheng, Y. (2010). Bayesian estimation of MIRT models with general and specific latent traits in MATLAB. Journal of Statistical Software, 34(3), 1–27. https://doi.org/10.18637/jss.v034.i03. Sheng, Y., & Wikle, C. (2007). Comparing multiunidimensional and unidimensional item response theory models. Educational and Psychological Measurement, 67(6), 899–919. https://doi.org/ 10.1177/0013164406296977. Sheng, Y., & Wikle, C. (2009). Bayesian IRT models incorporating general and specific abilities. Behaviormetrika, 36(1), 27–48. https://doi.org/10.2333/bhmk.36.27 Sinharay, S., Johnson, M. S., & Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Applied Psychological Measurement, 30(4), 298–321. https://doi.org/10.1177/ 0146621605285517.
Posterior Predictive Assessment of IRT Models …
119
van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. Springer. Wu, H., Yuen, K. V., & Leung, S. O. (2014). A novel relative entropy-posterior predictive model checking approach with limited information statistics for latent trait models in sparse 2k contingency tables. Computational Statistics & Data Analysis, 79, 261–276. https://doi.org/10.1016/j. csda.2014.06.004.
Shapley-Lorenz Values for Credit Risk Management Niklas Bussmann, Roman Enzmann, Paolo Giudici, and Emanuela Raffinetti
Abstract A trustworthy application of Artificial Intelligence requires to measure in advance its possible risks. When applied to regulated industries, such as banking, finance and insurance, Artificial Intelligence methods lack explainability and, therefore, authorities aimed at monitoring risks may not validate them. To solve this issue, explainable machine learning methods have been introduced to “interpret” black-box models. Among them, Shapley values are becoming popular: they are model agnostic and easy to interpret. However, they are not normalised and, therefore, cannot become a standard procedure for Artificial Intelligence risk management. This paper proposes an alternative explainable machine learning method, based on Lorenz Zonoids, that is statistically normalised, and can therefore be used as a standard for the application of Artificial Intelligence. The empirical analysis of 15,000 small and medium companies asking for credit confirms the advantages of our proposed method. Keywords Artificial intelligence · Lorenz Zonoids · Shapley values · Risk management
N. Bussmann · P. Giudici · E. Raffinetti (B) Department of Economics and Management, University of Pavia, Via San Felice al Monastero 5, 27100 Pavia, Italy e-mail: [email protected] N. Bussmann e-mail: [email protected] P. Giudici e-mail: [email protected] R. Enzmann University of Bonn, Bonn, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_10
121
122
N. Bussmann et al.
1 Introduction A key prerequisite to develop reliable and trustworthy Artificial Intelligence (AI) methods is the capability to measure their risks. When applied to high impact and regulated industries, such as energy, finance and health, AI methods need to be validated by national regulators, possibly according to international standards, such as ISO/IEC CD 23894. Most AI methods rely on the application of highly complex machine learning models which, while reaching high predictive performance, lack explainability. This is a problem in regulated industries, as authorities aimed at monitoring the risks arising from the application of AI methods may not validate them (see, e.g., Financial Stability Board 2020). In these fields, comprehensible results need to be obtained to allow organisations to detect risks, especially in terms of the factors which can cause them. This objective is more evident when dealing with AI systems. Indeed, AI methods have an intrinsic black-box nature, resulting in automated decision-making that can, for example, classify a person into a class associated with the prediction of its individual behaviour, without explaining the underlying rationale. To avoid that wrong actions are taken as a consequence of “automatic” choices, AI methods need to explain the reasons for their classifications and predictions. The notion of “explainable” AI has become very important in the recent years, following the increasing application of AI methods that impact the daily life of individuals and societies. At the institutional level, explanations can answer different kinds of questions about a model’s operations depending on the stakeholder they are addressed to (see, e.g., Bracke et al. 2019): developers, managers, model checkers and regulators. In general, to be explainable, AI methods have to provide details or reasons clarifying their functioning. From a mathematical viewpoint, the explainability requirement can be fulfilled using simple machine learning models (such as logistic and linear regression models). However, these models provide reduced predictive accuracy. To improve predictive accuracy, the implementation of complex machine learning models (such as neural networks and random forests) seems necessary, but this leads to limited interpretability. This trade-off can be solved empowering accurate machine learning models with innovative methodologies able to explain their classification and prediction output. A recent attempt in this direction can be found in Bussmann et al. (2020), who proposed to apply correlation networks (see, e.g., Mantegna and Stanley 1999) to Shapley values (see Shapley 1953) so that AI predictions can be grouped according to the similarity in the underlying explanations. The proposal has been applied to the scoring of borrowers in financial lending, a service for which the use of AI methods is developing fast. Shapley values have the advantage of being agnostic: independent on the underlying model with which classifications and predictions are computed; but have the disadvantage of not being normalised and, therefore, difficult to be used in comparisons outside the specific application.
Shapley-Lorenz Values for Credit Risk Management
123
In this paper, we propose an alternative explainable machine learning methods, based on the combination between the Shapley value approach (see Shapley 1953) and the Lorenz Zonoid tool described in Giudici and Raffinetti (2020). Shapley values belong to the class of local explanation methods, as they aim to interpret individual predictions in terms of which variables mostly affect them (see, e.g., Financial Stability Board 2020 and Joseph 2019). Lorenz Zonoids instead are a global explanation method, as they aim to interpret all model predictions as a whole, in terms of which variables mostly determine them, for all observations. The combination between the two approaches has been successfully tested in Giudici and Raffinetti (2021) within the context of predicting a continuous variable, the price of bitcoins, by means of three well-known continuous financial variables: the price of oil, gold and the Standard and Poor’s index. In this paper, we extend the methodology to a much more challenging problem: the prediction of credit default (binary variable) by means of a large set of highly correlated company performance variables, taken from balance sheets. This requires extending the statistical part of the methodology, to binary response variables and binary classification, and the computational part of the proposal, providing an automated search through a high-dimensional space of possible alternative models. Our empirical findings are very promising and may lead to a new explainable machine learning method to measure financial risk management, based on the Shapley-Lorenz approach which, different from Shapley values, gives rise to normalised values that can be used in any comparisons between alternative model explanations. The paper is organised as follows: the next section illustrates the methodology; Sect. 3 provides the Python algorithm that can be employed to apply our proposed method; Sect. 4 discusses the empirical findings obtained by applying our proposal to financial data; finally, Sect. 5 contains some brief concluding remarks.
2 Methodology To meet the requirement that risk measurement is explainable, leading to develop a trustworthy application of AI, in this section we propose a global explainable machine learning method. Our proposal derives from the combination of two research streams. The first one concerns the development of machine learning methods for binary classification problems. The second one concerns the development of explainable methods to understand the output of advanced machine learning models. The result is a novel method for AI risk management which is, at the same time, predictively accurate, interpretable and robust to anomalies in the observations.
124
N. Bussmann et al.
2.1 Binary Classification Let Y be a binary response variable, which can, for example and without loss of generality, express whether a company defaults (Y = 1) or not (Y = 0). Given K explanatory variables X 1 , . . . , X K , a logistic regression model for Y can be specified as follows: K πi = β0 + ln βk xik = ηi , (1) 1 − πi k=1 K where i = 1, . . . , n; ηi = β0 + k=1 βk xik ; πi represents the probability of the event for the i-th observation (company); xi = (xi1 , . . . , xik ) points out the K -dimensional vector reporting the values taken by the K explanatory variables referring to the i-th observation; β0 and βk are the parameters representing the intercept and the k-th regression coefficient, respectively. By means of the Maximum Likelihood Estimation method, the parameters β0 and βk can be estimated leading to derive the predicted probability of the default as πˆ i =
eηi , 1 + eηi
(2)
which can be employed to attach to the i-th observation a “score”, a number between zero and one which can be interpreted to signal, for example, the creditworthiness of a company: the higher the score, the lower the trust. A classification of each company can then follow, comparing the score with an appropriate threshold, chosen on an experiential basis. While in the following we will refer to the specific case of credit scoring, what was proposed is quite general, and can be applied to any binary classification problems.
2.2 The Shapley-Lorenz Decomposition for Credit Risk Data Following the early experimentation in Giudici and Raffinetti (2021), we propose, for financial risk management purposes, a global explainable AI method, named Shapley-Lorenz decomposition, which combines the interpretability power of the local Shapley value game theoretic approach (see Shapley 1953) with a more robust global approach based on the Lorenz Zonoid model accuracy tool (see Giudici and Raffinetti 2020). The Lorenz Zonoids, originally introduced by Koshevoy and Mosler (1996), were further developed by Giudici and Raffinetti (2020) as a generalisation of the ROC curve in a multidimensional setting and, therefore, the Shapley-Lorenz decomposition has the advantage of combining predictive accuracy and explainability performance into one single diagnostics. Furthermore, the Lorenz Zonoid is based on a measure of mutual variability that is more robust to the presence of outlying (anomalous) observations with respect to the standard variability around the mean.
Shapley-Lorenz Values for Credit Risk Management
125
These theoretical properties can be exploited to develop partial dependence measures that allow to detect the additional contribution of a new predictor into an existing model. Shapley values were originally introduced by Shapley (1953) as a pay-off concept from cooperative game theory. When referring to machine learning models, the notion of pay-off corresponds to the model prediction. For any single statistical unit i (1 = 1, . . . , n), the pay-offs are computed as po f f (X ik ) = fˆ(X ∪ X k )i − fˆ(X )i ,
(3)
where fˆ(X )i are the predicted values provided by a machine learning model, depend ing only on X predictors; fˆ(X ∪ X k )i are the predicted values generated by the machine learning model, depending both on the X predictors and the additional included X k predictor. The main advantage of Shapley values, over alternative explainable AI (XAI) methods, is that they can be employed to measure the contribution of each explanatory variable for each point prediction of a machine learning model, regardless of the underlying model itself (see, e.g., Lundberg and Lee 2017, Štrumbelj and Kononenko 2010). In other words, Shapley-based XAI methods combine generality of application (they are model agnostic) with the personalisation of their results (they can explain any single point prediction). The main drawback of Shapley values is that they provide explainability scores that are not normalised. They can be used to compare the relative contribution of one variable to that of another, but they cannot be used to assess the absolute importance of each variable, and to make comparisons beyond the specific context. The key benefit related to the employment of the Lorenz Zonoid tool is the possibility of evaluating the contribution associated with any additional explanatory variable to the whole model prediction with a normalised measure that can be used to assess the importance of each variable. The Lorenz Zonoid measure was introduced by Giudici and Raffinetti (2020) to develop new partial dependence measures. Specifically, given a set of K explanatory variables, the marginal contribution associated with the variable X k can be determined into a global perspective by resorting to the difference between two Lorenz Zonoids. More precisely, the two Lorenz Zonoids are built on the predictions Yˆ X ∪X k , provided by a model including the additional covariate X k , and on the predictions Yˆ X , provided by the reduced model excluding the covariate X k . The additional contribution (AC) related to the inclusion of covariate X k can be then expressed as
AC = L Z (Yˆ X ∪X k ) − L Z (Yˆ X ),
(4)
where L Z (Yˆ X ∪X k ) and L Z (Yˆ X ) define the Lorenz Zonoids computed on the predicted values provided by the model including also covariate X k and the Lorenz Zonoids computed on the predicted values provided by the model including the X covariates but excluding covariate X k , respectively.
126
N. Bussmann et al.
As we are dealing with a binary response variable, denoting the active and default status of the companies, the terms Yˆ X ∪X k and Yˆ X can be re-written as the predicted probabilities of default πˆ X ∪X k and πˆ X , when resorting to the logistic regression model including also the explanatory variable X k and to the logistic regression model not including the explanatory variable X k , respectively. Thus, equation in (4) becomes AC = L Z (πˆ X ∪X k ) − L Z (πˆ X ).
(5)
In order to compute the Lorenz Zonoids L Z (πˆ X ∪X k ) and L Z (πˆ X ) in Eq. (5), we resort to the covariance operators (see, e.g., Lerman and Yitzhaki 1984), i.e., L Z (πˆ X ∪X k ) =
2Cov(πˆ X ∪X k , r (πˆ X ∪X k )) and n E(πˆ X ∪X k )
L Z (πˆ X ) = L Z (πˆ X ) =
2Cov(πˆ X , r (πˆ X )) , n E(πˆ X )
(6)
where r (πˆ X ∪X k ) and r (πˆ X ) are the rank scores of πˆ X ∪X k and πˆ X ; n is the sample size; E(πˆ X ∪X k ) and E(πˆ X ) are the expected values of πˆ X ∪X k and πˆ X . The Shapley-Lorenz decomposition expression is the result of a combination between the Shapley value-based formula and the Lorenz Zonoid tools. Formally, the contribution of the additional variable X k , expressed in terms of the differential contribution to the global predictive accuracy, equals
ˆ = L Z (π) Xk
X ⊆C(X )\X K
|X |!(K − |X | − 1)! [L Z (πˆ X ∪X k ) − L Z (πˆ X )], K!
(7)
where L Z (πˆ X ∪X k ) and L Z (πˆ X ) measure the marginal contribution provided by the inclusion of variable X k ; K is the number of available predictors; C(X ) \ X k is the set of all the possible model configurations which can be obtained with K − 1 variables, excluding variable X k ; |X | denotes the number of variables included in each possible model. Finally, it is worth noting that the Shapley-Lorenz decomposition presents as an agnostic explainable Artificial Intelligence method which can be applied to the predictive output, regardless of which model and data generated it.
3 Algorithm The Python code that we used to compute the Shapley-Lorenz Zonoid values is available on https://github.com/roye10/ShapleyLorenz, for full reproducibility. We now summarise the logical structure of the code.
Shapley-Lorenz Values for Credit Risk Management
127
Table 1 Shapley-Lorenz Zonoid algorithm 0.
A pre-defined (user supplied) upper bound on total number of fully considered subset permutations can be defined, given, by say, n_perm. This is similar to the nsamples parameter used in the kernel SHAP module by Lundberg and Lee (2017). Unlike the kernel SHAP module, however, only subsets are considered, for which full permutations can be considered, given n_perm. Subsets are considered sequentially, in
|−1)! order of highest to lowest Shapley kernel weights, defined by |X |!(K −|X . Due to the K! symmetric property of this Shapley kernel weight, a given subset is always considered pair wise, with its complement, i.e. first all permutations of subset size 1 and those of size K − 1 are considered. If all permutations of the next subset can be considered as well, given the upper bound in n_perm, this is added to the subset sizes considered in the next step. 1. Do for k ∈ K : (i.e. for all covariates) 1(a) Do for s ∈ C (X ) (i.e. for all subset permutations): 1(b) Do for i ∈ N : Let X˜ contain a given permutation of a given subset size without k. Compute E[ f (X ) | X˜ = X˜ i ]. Once for X˜ /k and once for X˜ ∪ X k . Assuming covariate independence, this can be approximated by 1 n ˜ ˜ ˜ j=1 f (X j / X , X i ), and analogously for X i ∪ X k . N ˜ X/ X represents all covariates not included in the subset X ∪ X k and is obtained by replacing those covariates with either training data or a row-wise shuffled variant of the original covariate matrix. The result of this step, thus is E[ f (X ) | X˜ = X˜ i ], approximated by the sample mean, over the underlying distribution of X . 2. Sort the obtained values for the current permutation of the current subset size iteration. 3. Compute the Lorenz Zonoid share for the current permutation of the current subset size iteration. Once for the permutation not including k and once for the permutation including k. Then compute the difference. 4. Weight obtained Lorenz Zonoid difference, by the kernel weight as defined above. 5. Compute weighted sum of differences.
Following the Shapley value attribution method, computing exact Shapley-Lorenz Zonoid covariate contribution measures for K covariates requires the computation of Lorenz Zonoid marginal contributions across 2 K different subsets, per covariate. Computationally, this becomes intractable for non-conservative covariate sizes, and therefore an approximate solution is implemented in the Shapley-Lorenz Zonoid package. The method can be summarised as in Table 1.
128
N. Bussmann et al.
We remark that as the Shapley-Lorenz Zonoid approach is based on the construction of all the possible model configurations, it results computationally intensively. To overcome this drawback, especially when the set of the available explanatory variables is quite large, a preliminary selection of the predictors to be included into the model may be led through the employment of the Lorenz Zonoid-based measures proposed in Giudici and Raffinetti (2020).
4 Application 4.1 Data We apply our proposed method to data supplied by European External Credit Assessment Institution (ECAI) that specialises in credit scoring for P2P platforms focused on SME commercial lending. The data are described by Giudici et al. (2020) to which we refer for further details. In summary, the analysis relies on a dataset composed of official financial information, extracted from the balance-sheets of 15,045 SMEs, mostly based in Southern Europe, for the year 2015. The information about the status (0 = active, 1 = defaulted) of each company one year later (2016) is also provided. The observed proportion of defaulted companies is equal to 10.9%. The nineteen financial variables included in our dataset are as follows: Total Assets/Equity; (Long term debt+Loans)/Shareholders Funds; Total Assets/Total Liabilities; Current Assets/Current Liabilities; (Current assets-Current assets: stocks)/ Current liabilities; (Shareholders Funds+Non current liabilities)/Fixed assets; EBIT/ interest paid; (Profit or Loss before tax+Interest paid)/Total assets; Return on Equity; Operating revenues/Total assets; Sales/Total assets (Activity Ratio); Interest paid/(Profit before taxes+Interest paid); EBITDA/interest paid; EBITDA/Operating revenues; EBITDA/Sales; Trade Payables/Operating Revenues; Trade Receivables/ Operating Revenues; Inventories/Operating Revenues; Turnover. We remark that, for these variables, and particularly for those reflecting the operations of the companies, there is a noticeable presence of unusually large or small values when compared to the mean. These outliers should not be substituted or deleted as they can provide important explanations on the companies included in the sample. However, their presence is going to affect the robustness of machine learning models, and of Shapley values in particular. This further motivates the use of Shapley-Lorenz values, more robust to outliers.
4.2 Results Using the data, Giudici et al. 2020 have constructed logistic regression scoring models that aim at estimating the probability of the default of each company, using the
Shapley-Lorenz Values for Credit Risk Management
129
available financial data from the balance sheets and, in addition, network centrality measures that are obtained from similarity networks. To improve the predictive performance of the model, Bussmann et al. (2020) have applied to the same database the Gradient Boosting Tree algorithm, and obtained a substantial increase in predictive performance. The same authors identify Variable 3: Total Assets/Total Liabilities; Variable 7: EBITDA/Interest paid; Variable 8: (Profit or Loss before tax + Interest paid)/Total asset as the variables that mostly contribute to Shapley value decomposition. This is quite consistent with most credit scoring models that typically include, among the explanatory variables of credit default, a measure of financial leverage (such as Variable 3) and a measure of profitability (such as Variables 7 and 8). We use the same data and apply the proposed Shapley-Lorenz approach procedure to detect the main factors impacting the probability of default. Although the Gradient Boosting Tree method is more performing in terms of predictive accuracy, providing an Area Under the ROC Curve (AUROC) equal to 0.93, we resort to the logistic regression model to fit our data. This is because, if on the one hand the logistic regression model implementation leads to a lack in terms of predictive accuracy (as the AUROC decreases to a value of 0.81), on the other hand, it appears as a model explainable by design and then results simpler to be understood. Through the cross-validation procedure, the data is split into a training set (80%) and a test set (20%). We then calculate, on the same split, the contribution of each of the nineteen explanatory variables to the estimate of the probability of default, using two explainable AI methods: the Shapley value approach and the Lorenz-Shapley approach that we propose. Table 2 contains the results of the comparison. For each variable, we report the value of the Shapley-Lorenz Zonoid contribution and the Shapley Value contribution, calculated as the sum of the Shapley values over all observations. For comparison purposes, we also report the contribution of each variable to the deviance (G 2 ), calculated using the Shapley value formula, as in the Global ANOVA—like explainable AI method described in Financial Stability Board (2020) and Joseph (2019). From Table 2, note that the variables which most contribute to the prediction of default, according to the sum of the Shapley values, are Variable 8: (Profit or Loss before tax + Interest paid)/Total asset; followed at a considerable distance by Variable 13 and 14 (both related to EBITDA) and by Variable 3 (Total Assets/Total liabilities). In terms of G 2 , instead, the differences between Variable 8 (the highest contributor) and Variables 14, 15 and 3 are lower. The role that Variable 13 has in terms of Shapley value been replaced by Variable 15. Note that the variables selected as the most important by both Shapley values and G 2 have a similar structure: one variable that indicates leverage (Variable 3) and a few variables that express profitability (Variables 8, 13, 14 or Variables 8, 14, 15). The latter are highly correlated, as they are based on similar information: this may indicate a weakness of the selection methods, possibly redundant and more sensible to outlier observations. The first column of Table 2, giving the Shapley-Lorenz values, indicate, instead, that Variable 8, with a value of 0.165, and Variable 3, with a value of 0.114, are one magnitude order higher than the others. This indicates a more clearcut choice,
130
N. Bussmann et al.
Table 2 Marginal contribution of each explanatory variable in terms of Shapley-Lorenz Zonoids, G 2 and total Shapley values Shapley-Lorenz
G2
ID
Variable
1
Total assets/Equity
0.003
0.16
Shapley 2.53
2
(Long term debt + Loans)/Shareholders Funds
0.002
0.54
–202.80
3
Total assets/Total liabilities
0.114
1088.12
–1273.97
4
Current assets/Current liabilities
0.057
553.68
–641.69
5
(Current assets − Current assets: stocks)/Current liabilities
0.001
479.06
–93.51
6
(Shareholders Funds + Non current liabilities)/Fixed assets
0.003
13.16
4180.56
7
EBIT/interest paid
8
(Profit or Loss before tax + Interest paid)/Total assets
9 10 11
Sales/Total assets
–0.002
10.96
252.59
12
Interest paid/(Profit before taxes + Interest paid)
0.013
103.26
379.73
13
EBITDA/interest paid
0.024
418.00
–1697.31
14
EBITDA/Operating revenues
0.038
1254.63
–1419.43
15
EBITDA/Sales
0.026
1122.05
–785.95
16
Trade Payables/Operating revenues
0.001
14.73
–193.60
17
Trade Receivables/Operating revenues
0.058
475.40
–585.58
18
Inventories/Operating revenues
0.013
126.78
1190.47
19
Turnover
0.022
85.26
1072.37
–0.001
411.10
1504.44
0.165
1633.51
–13115.53
Return on Equity
0.054
826.96
–1993.98
Operating revenues/Total assets
0.062
17.36
–289.46
Shapley-Lorenz Values for Credit Risk Management
131
with only two variables being selected: a measure of leverage, and a measure of profitability. In the latter case, only the most contributing one, among the several that measure profitability, is chosen. We remark that the negative sign appearing for Variables 7 and 11 is not to be intended as the presence of an indirect relationship with the probability of default, as in the case of Shapley values. Indeed, the Shapley-Lorenz values being normalised, the negative sign is due to the approximation internal to the algorithm. Comparing the results obtained with the different methods, the Shapley-Lorenz selection is evidently the easiest to interpret and, by construction, it is more robust to outlier observations, as can be easily noticed by repeating the analysis for different training/validation splits. Further research focuses on the application of the methodology to other Artificial Intelligence applications that involve a binary classification problem. Our code is openly accessible for this purpose. This may allow the development of a quantitative standard for AI risk management.
5 Concluding Remarks The paper proposes a new agnostic explainable AI method, based on the combination of Shapley values and Lorenz Zonoids, that can be used to interpret the results of a highly performing machine learning risk management algorithm. The proposed method, like other explainable AI methods, is able to identify the variables which most affect the predictions, combining the notion of explainability with that of predictive accuracy. The application of the proposal to a credit risk management use case shows its superior performance, in terms of selectivity, consistency with economic intuition and robustness. It identifies one measure of leverage and one measure of profitability as those that most matter. We can thus conclude that the illustrated method is satisfactory and can be proposed as a use case to standardise risk measurement and management in the application of Artificial Intelligence to credit lending. The implementation of the Shapley-Lorenz approach to other application fields with further linear and non-linear models represents an interesting research area which is currently in progress. Moreover, due to its appealing features, the ShapleyLorenz decomposition can be further extended to fulfil the requirements of robustness and fairness which, together with the explainability and accuracy principles, contribute to ensuring a trustworthy AI. Indeed, as the Lorenz Zonoid appears as a measure of mutual variability, it can be exploited to both provide robust findings (i.e., not affected by outlying observations) and measures of fairness (equality) with respect to specific sub-groups of the population.
132
N. Bussmann et al.
References Bracke, P., Datta, A., Jung, C., & Shayak, S. (2019). Machine learning explainability in finance: An application to default risk analysis. https://www.bankofengland.co.uk/working-paper/2019/ machine-learning-explainability-in-finance-an-application-to-default-risk-analysis. Bussmann, N., Giudici, P., Marinelli, D., & Papenbrock, J. (2020). Explainable machine learning in credit risk management. Computational Economics, 57, 203–216. https://doi.org/10.1007/ s10614-020-10042-0 Financial Stability Board: Interpretable Machine Learning-A Guide for Making Black Box Models explainable (2020). https://cristophm.github.io/interpretable-ml-book. Giudici, P., & Raffinetti, E. (2020). Lorenz model selection. Journal of Classification, 37(2), 754– 768. https://doi.org/10.1007/s00357-019-09358-w Giudici, P., & Raffinetti, E. (2021). Shapley-Lorenz explainable artificial intelligence. Expert Systems with Applications, 167, 1–9. https://doi.org/10.1016/j.eswa.2020.114104 Giudici, P., Hadji-Misheva, B., & Spelta, A. (2020). Network based credit risk models. Quality Engineering, 32(1), 199–211. https://doi.org/10.1080/08982112.2019.1655159 Joseph, A. (2019). Parametric inference with universal function approximators. https://www. bankofengland.co.uk/working-paper/2019/shapley-regressions-a-framework-for-statisticalinference-on-machine-learning-models. Koshevoy, G., & Mosler, K. (1996). The Lorenz Zonoid of a multivariate distribution. Journal of the American Statistical Association, 91(434), 873–882. 10/2291682. Lerman, R., & Yitzhaki, S. (1984). A note on the calculation and interpretation of the Gini index. Economics Letters, 15(3–4), 363–368. https://doi.org/10.1016/0165-1765(84)90126-5 Lundberg, S. M., & Lee, S. (2017). A unified approach to interpreting model predictions. In Conference on Neural Information Processing Systems (NIPS 2017). Long Beach. Mantegna, R. N., & Stanley, H. E. (1999). Introduction to econophysics: Correlations and Complexity in finance. Cambridge University Press. Shapley, L. S. (1953). A value for n-person games. In H. Kuhn & A. Tucker (Eds.), Contributions to the theory of games II (pp. 307–317). Princeton University Press. Štrumbelj, E., & Kononenko, I. (2010). An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11, 1–18.
A Study of Lack-of-Fit Diagnostics for Models Fit to Cross-Classified Binary Variables Maduranga Dassanayake and Mark Reiser
Abstract In this paper, an extended version of the GFfit statistic is compared to other lack-of-fit diagnostics for models fit to cross-classified binary variables. The extended GFfit statistic is obtained by decomposing the Pearson statistic from the full table into orthogonal components defined on marginal distributions. The extended version (i j) of the statistic, G F f it⊥ , can be applied to a variety of models for cross-classified (i j) tables. Simulation results show that G F f it⊥ has good Type I error performance even if the joint frequencies are very sparse. Asymptotic power calculations and (i j) simulation results show that G F f it⊥ has higher power for detecting the source of lack of fit compared to other diagnostics on bivariate marginal tables for binary variables. Keywords Multinomial distribution · Pearson Chi-square · Orthogonal components · IRT model
1 Introduction For a multi-way contingency table, the traditional Pearson’s Chi-square statistic is obtained by comparing the observed frequencies to the expected frequencies under the null hypothesis. A composite null hypothesis H0 : π = π (β), where the null distribution depends on a vector of g unknown parameters β = (β1 , . . . , βg ) , can 1 √ ˆ −2 be tested with the Pearson-Fisher statistic, X 2P F = s z s2 , where z s = n(πs (β)) ˆ , and where pˆ s is element s of p, ˆ a vector of multinomial proportions, pˆ s − πs (β) ˆ n is the total sample size, β is the parameter estimator vector, πs (β) is the expected M. Dassanayake Department of Statistics, University of Georgia, Athens, GA, USA e-mail: [email protected] M. Reiser (B) School of Mathematical and Statistical Sciences, Arizona State University, Tempe, AZ, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_11
133
134
M. Dassanayake and M. Reiser
ˆ is the estimated expected proportion for cell s. The proportion for cell s and πs (β) degrees of freedom for a model on q cross-classified variables, 2q − g − 1, were given in Fisher (1924). Orthogonal components of X 2P F have been studied by many authors, including (Lancaster 1969) and (Rayner and Best 1989). A new extended GFfit statistic was proposed in Reiser et al. (2022) for the purpose of detecting lack of (i j) fit. The extended statistic, G F f it⊥ , is obtained by decomposing the Pearson statistic from the full table into orthogonal components defined on marginal distributions. (i j) In this paper, the performance of G F f it⊥ is compared to standardized residuals (Reiser 1996) and χ¯ i2j (Liu and Maydeu-Olivares 2014) using simulations to assess Type I error rate and power for a latent variable model fit to binary cross-classified variables. χ¯ i2j is a moment corrected version of χi2j , where χi2j is the Pearson statistic applied to bivariate table i, j.
2 Marginal Proportions A traditional statistic such as X 2P F uses the joint frequencies to calculate goodness of fit for a model that has been fit to a cross-classified table. This section presents a transformation from joint proportions or frequencies to marginal proportions. The (i j) method for using marginal proportions to obtain the G F f it⊥ and other statistics mentioned in Sect. 1 are explained in Sect. 3.
2.1 First- and Second-Order Marginals The relationship between joint proportions and marginals can be shown by using 0’s and 1’s to code the levels of dichotomous response random variables, Yi , i = 1, 2, . . . , q, where Yi follow the Bernoulli distribution with parameter Pi . Then, a q-dimensional vector of 0’s and 1’s, sometimes called a response pattern, will indicate a specific cell from the contingency table formed by the cross-classification of q response variables. For dichotomous response variables, a response pattern is a sequence of 0’s and 1’s with length q. The T = 2q -dimensional set of response patterns can be generated by varying the levels of the qth variable most rapidly, the q th − 1 variable next, etc. Define V as the T by q matrix with response patterns as rows. For instance when q = 3,
A Study of Lack-of-Fit Diagnostics for Models Fit …
135
⎞ ⎛ 000 ⎜0 0 1⎟ ⎟ ⎜ ⎜0 1 0 ⎟ ⎟ ⎜ ⎜0 1 1⎟ ⎟ V =⎜ ⎜1 0 0 ⎟ . ⎟ ⎜ ⎜1 0 1⎟ ⎟ ⎜ ⎝1 1 0 ⎠ 111
Let vis represent element i of response pattern s, s = 1, 2, . . . , T. Then, under the model π = π(β), the first-order marginal proportion for variable Yi can be defined as Pi (β) = Prob(Yi = 1|β) = s vis πs (β), and the true first-order marginal proportion is given by Pi = Prob(Yi = 1) = s vis πs . Under the model, the second-order marginal proportion for variables Yi and Y j can be defined as Pi j (β) = Prob(Yi = 1, Y j = 1|β) = s vis v js πs (β), where j = 1, 2, . . . , q − 1; i = j + 1, . . . q, and the true second-order marginal proportion is given by Pi j = Prob(Yi = 1, Y j = 1) = s vis v js πs .
2.2 Higher Order Marginals A general matrix H[t:u] to obtain marginals of any order can be defined by using Hadamard products among the columns of V . The symbol H[t:u] , t ≤ u ≤ q denotes the transformation matrix that would produce marginals from order t up to and including order u. Furthermore, H[t] ≡ H[t:t] . H[1:q] gives a mapping from joint proportions to the set of (2q − 1) marginal proportions: P = H[1:q] π , where P = (P1 , P2 , P3 , . . . Pq , P12 , P13 , . . . Pq−1,q , P1,1,2 . . . Pq−2,q−1,q . . . P1,2,3...q ) is the vector of marginal proportions. An example of H[1:3] for q = 3 is given in the online supplement.
3 Lack-of-Fit Statistics (i j)
This section illustrates the mathematical details of the G F f it⊥ statistic, adjusted residuals and the χ¯ i2j statistic mentioned in Sect. 1.
136
M. Dassanayake and M. Reiser
(i j )
3.1 The G F f i t⊥
Statistic
A traditional composite null hypothesis for a test of fit on a multinomial model is H0 : π = π (β). Linear combinations of π may be tested under the null hypothesis H0 : Hπ = Hπ (β). H may specify linear combinations that form marginal proportions as defined in Sect. 2.2. ˆ and denote the vector of Define the unstandardized residual us = pˆ s −√πs (β), unstandardized residuals as u with element us . n u has asymptotic covariance matrix u , where u = (D(π(β)) − π(β)π(β) − G(A A)−1 G ),
(3.1)
and where D(π(β)) = diagonal matrix with (s, s) element equal to πs (β), ∂π(β) ∂π(β) , and G = . A = D(π(β))−1/2 ∂β ∂β The H matrix can also be used to create residuals for marginals. A vector of simple ˆ = H u. As residuals for marginals of any order can be defined as e = H(pˆ − π (β)) defined previously, T = 2q and g is the dimension of β. If H contains T − g − 1 linearly independent rows corresponding to marginals from order 1 to q, then define the statistic 2 −1 (3.2) χ[T −g−1] = n u H e H u. ˆ where βˆ is now consistent and efficient Here, the statistic is evaluated at β = β, for β, such as the maximum likelihood estimator, and where e = H u H . With the added condition that the rows of H are linearly independent of the columns of the 2 2 model matrix for π (β), χ[T −g−1] can be shown to be equivalent to χ P F . See Reiser (1996). To obtain orthogonal components, define the upper triangular matrix F such that F e F = I. F = (C )−1 , where C is the Cholesky factor of e . Then writing e as C C , ˆ C) ˆ −1 H u χ P2 F = n u H (Cˆ )−1 Cˆ (Cˆ Cˆ )−1 C(
= n u H F FHu ˆ Premultiplicawhere F and Cˆ are the matrices F and C evaluated at β = β. tion by (C )−1 orthonormalizes the matrix H relative to the matrix D(π ) − π π − G(A A)−1 G . Let H∗ = F H, then ∗ u ∗ ) H (3.3) χ P2 F = n u (H
A Study of Lack-of-Fit Diagnostics for Models Fit …
137
ˆ Define γˆ = n 2 ∗ u. Then, ∗ = H∗ (β). where H F H u = n2H 1
χ P2 F
1
k=T −g−1
= γˆ γˆ =
γˆk2 ,
(3.4)
k=1
∗ u statistics have and the elements γˆk2 are orthogonal components of χ P2 F . Since H 2 asymptotic covariance matrix F e F = I T −g−1 , the elements γˆk are asymptotically independent χ12 random variables, assuming a consistent estimator for π (β) and e ˆ e . The asymptotic approximation may not be valid for components e = n −1 where from a sparse higher order marginal table (Reiser and VandenBerg 1994). (i j) Then for binary cross-classified variables, define G F f it⊥ = γˆk2 , where k = q + 1, q + 2, ..., q(q + 1)/2; i = 1, 2, ..., q − 1; j = i + 1, ..., q. When variables (i j) have c ≥ 2 categories, G F f it⊥ is a sum of (c − 1)2 orthogonal components (Reiser (i j) 2008). G F f it⊥ is an extended version of the G F f it (i j) statistic from Cagnone and (i j) Mignani (2007). G F f it⊥ statistics have a sequential property due to the application of the Cholesky factor. (i j) Since G F f it⊥ statistics are asymptotically independent χ12 random variates, 2 which is distributed asymptotically they can be summed to form the statistic X [2] as Chi-square (Reiser 2008). This statistic can be used for a more omnibus test focused on the second-order marginals with null hypothesis H0 : H[2] π = H[2] π (β). If this null hypothesis is rejected, then it follows that the null hypothesis H0 : π = 2 is calculated from second-order marginals, the π(β) should be rejected. Since X [2] asymptotic Chi-square distribution can be expected to be valid even when the full cross-classified table is sparse (Reiser 1996).
3.2 Adjusted Residuals The adjusted residual for second-order marginal i, j is n 2 e(k) Zi j = , σˆ e(k) 1
(3.5)
where k = 1, 2, ....., q2 and corresponds to item pair i, j, e(k) is an element of e and e . Adjusted residuals on marginal σˆ e(k) is the square root of the diagonal elements of tables for binary variables were developed by Reiser (1996). γˆk , defined above, is essentially an orthogonalized version of this adjusted residual.
138
M. Dassanayake and M. Reiser
3.3 The χ¯ i2j Statistic The authors of Liu and Maydeu-Olivares (2014) proposed a mean and variance adjusted Chi-square statistic, χ¯ i2j , for the bivariate distribution of variables i, j within a large table. Consider the case where Pearson’s Chi-square is applied to a bivariate subtable, ˆ ij) χi2j = n(pi j − πˆ i j ) Di−1 j (pi j − π
(3.6)
where Di j = diag(πi j ) is a diagonal matrix of the bivariate probabilities. If β is estimated on the full table, χi2j is not distributed asymptotically as Chi-square and has an unknown distribution function. But a mean and variance adjusted statistic has asymptotic distribution that can be well approximated by the Chi-square distribution. Then, the mean and variance adjusted statistic, χ¯ i2j , is defined as χ¯ i2j = 2
μˆ 1 2 χ , μˆ 2 i j
(3.7)
where the definitions of the two asymptotic moments (μ1 , μ2 ) are given in the online supplement. χ¯ i2j has an approximate reference Chi-square distribution with degrees 2μˆ 2
of freedom a = μˆ 21 , but the joint distribution of a set of χ¯ i2j for a model on a crossclassified table is unknown.
4 Simulation Studies 4.1 Type I Error Study The first simulation examined Type I error for the lack-of-fit diagnostics reviewed above using a continuous latent factor model for categorical variables known as the two-parameter logistic (2PL) item response model (Bock and Lieberman 1975). In this model, β = (β 0 , β 1 ) , where β 0 is a vector of intercepts, and β 1 is a vector of slopes. In this study with q = 8 manifest variables, β 1 = (0.1, 0.1, 0.1, 0.9, 0.9, 0.9, 0.2, 0.2). Results shown below are for intercepts symmetric around zero, β 0 = (–2.0, –1.5, –1.0, –0.5, 0.5, 1.0, 1.5, 2.0). Results for simulations with asymmetric intercepts are shown in the online supplement. Using Monte Carlo methods, 1,000 data sets were generated for each setting. Empirical Type I error rates of the individual (i j) lack-of-fit diagnostics were calculated. Since an individual G F f it⊥ is distributed approximately as Chi-square with one degree of freedom, to calculate the empirical (i j) Type I error rate for each G F f it⊥ , the sum of the number of cases that exceed the Chi-square critical value (at 5% significance level) with one degree of freedom was divided by the number of data sets. A similar process was used to calculate the
A Study of Lack-of-Fit Diagnostics for Models Fit …
139
Type I error rates of the adjusted residual and χ¯ i2j . This simulation was repeated for sample sizes 300 and 500, as is common in the literature on lack-of-fit diagnostics (Cagnone and Mignani 2007; Liu and Maydeu-Olivares 2014; Reiser 1996; Reiser (i j) et al. 2022). The G F f it⊥ statistics were calculated using an orthogonal regression given in Reiser (2008). Table 1 below indicates the empirical Type I error rates for q = 8 manifest variables for the symmetric intercept setting. The Type I error rates outside of √ the Monte Carlo error interval 0.05 ± 0.05(0.95)/1000 = (0.0365, 0.0635) are (i j) bolded. When n = 300, Type I error rates related to G F f it⊥ (4,5) and (5,6) were outside the Monte Carlo error interval. Given that there are twenty-eight individual (i j) G F f it⊥ , it is possible that one or two components may randomly fall slightly outside the Monte Carlo error interval. However, error rates for five χ¯ i2j and four adjusted residuals were outside the Monte Carlo error interval. This suggests, when n = 300, (i j) G F f it⊥ has a better Type I error rate compared to χ¯ i2j and adjusted residuals for eight manifest variables with a symmetric intercept model. (i j) When n = 500, all of the error rates for G F f it⊥ were inside the Monte Carlo 2 interval, while χ¯ i j and adjusted residuals each had two error rates outside the Monte Carlo interval. Similar results for Type I error were found in simulations using 15 variables, as shown in the online supplement.
4.2 Estimated Mean and Variance of the Statistics (i j)
Since an individual G F f it⊥ is distributed approximately as χ12 , it is expected that √ the mean be close to 1 and SD be close √ to 2. Results in Table 2 suggests that (i j) G F f it⊥ has mean and SD close 1 and 2, respectively. For the adjusted residuals, the mean should be close to 0 and the SD should be close to 1. χ¯ i2j should have an approximate Chi-square distribution, but the degrees of freedoms may not be the same for each pair. If a statistic has a larger empirical variance, in general larger than 2 times the mean for statistics distributed as Chi-squared, then it is an extremely undesirable feature in that statistic in terms of real-world applications, because even in correctly specified models, one may observe extremely large values of such a statistic in applications, which may lead researchers to believe that the model grossly misfits (i j) one or more pairs. Table 2 below indicates when n = 300, G F f it⊥ has similar 2 results to χ¯ i j for mean and SD and better results compared to adjusted residuals. High variance for the version of G F f it (i j) in Cagnone and Mignani (2007) was reported by Liu and Maydeu-Olivares (2014). This high variance will result from e , an approach which lacks sufficient the direct application of a matrix inverse to numerical stability. Calculation of G F f it (i j) using an orthogonal regression, as in the present study, has high numerical stability.
140
M. Dassanayake and M. Reiser
Table 1 Type I error study for 2PL latent variable model, symmetric intercepts n = 300 n = 500 (i j)
(i j)
Pair (i,j)
G F f it⊥
Adj. Residuals
χ¯ i2j
G F f it⊥
Adj. Residuals
χ¯ i2j
(1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (1, 7) (1, 8) (2, 3) (2, 4) (2, 5) (2, 6) (2, 7) (2, 8) (3, 4) (3, 5) (3, 6) (3, 7) (3, 8) (4, 5) (4, 6) (4, 7) (4, 8) (5, 6) (5, 7) (5, 8) (6, 7) (6, 8) (7, 8)
0.046 0.048 0.044 0.044 0.049 0.051 0.057 0.051 0.039 0.043 0.052 0.043 0.047 0.050 0.042 0.049 0.043 0.041 0.074 0.062 0.037 0.050 0.070 0.039 0.052 0.045 0.037 0.052
0.055 0.048 0.057 0.034 0.048 0.063 0.053 0.055 0.046 0.054 0.063 0.059 0.048 0.058 0.038 0.060 0.048 0.043 0.080 0.079 0.054 0.042 0.074 0.044 0.050 0.045 0.044 0.040
0.052 0.047 0.054 0.03 0.044 0.060 0.052 0.054 0.049 0.052 0.059 0.060 0.048 0.058 0.038 0.056 0.049 0.043 0.079 0.077 0.052 0.042 0.073 0.043 0.052 0.048 0.044 0.036
0.052 0.044 0.051 0.042 0.053 0.043 0.051 0.041 0.038 0.049 0.048 0.048 0.057 0.05 0.043 0.051 0.056 0.047 0.064 0.057 0.037 0.048 0.062 0.039 0.037 0.054 0.049 0.041
0.059 0.046 0.051 0.043 0.049 0.045 0.053 0.042 0.047 0.050 0.042 0.049 0.053 0.052 0.044 0.046 0.050 0.039 0.070 0.068 0.051 0.042 0.063 0.037 0.037 0.048 0.037 0.037
0.056 0.046 0.047 0.043 0.047 0.045 0.053 0.043 0.046 0.050 0.042 0.047 0.054 0.051 0.042 0.046 0.048 0.040 0.069 0.067 0.049 0.042 0.064 0.038 0.037 0.050 0.038 0.040
1000 samples; 975 (n = 300), 998 (n = 500) convergence
4.3 Power Study for Eight Variables A power study was conducted by using Monte Carlo methods to generate 1,000 data sets from a model with two continuous latent factors. Intercepts and slopes for the first factor were the same as in the Type I error study. For the second factor, β 2 = (0,0,0,1,1,1,1,1). Higher power is expected for lack-of-fit diagnostics related to variables 1, 2 and 3. Data sets were fit using a false model with one continuous latent
A Study of Lack-of-Fit Diagnostics for Models Fit …
141
Table 2 Mean and SD for 2PL latent variable model, symmetric intercepts, eight variables n = 300 (i j)
G F f it⊥
n = 500 Adj. Res.
(i j)
χ¯ i2j
G F f it⊥
Adj. Res.
χ¯ i2j
Pair(i,j)
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
(1, 2)
0.99
1.42
0.14
1.24
0.98
1.42
1.09
1.55
–0.02
1.13
1.14
1.95
(1, 3)
1.06
1.62
0.05
1.24
0.96
1.29
0.96
1.36
0.04
1.09
1.01
1.82
(1, 4)
1.16
2.08
0.44
3.13
0.94
1.37
0.97
1.33
–0.01
0.99
1.02
1.98
(1, 5)
1.03
1.64
–0.83
3.49
0.96
1.34
0.98
1.40
0.00
0.98
1.00
1.73
(1, 6)
1.12
1.57
–0.71
2.78
0.93
1.25
1.02
1.50
0.02
1.01
1.03
1.57
(1, 7)
1.02
1.47
–0.10
1.27
0.97
1.25
1.03
1.57
0.00
1.03
1.07
1.57
(1, 8)
1.01
1.41
–0.15
1.32
1.02
1.67
1.00
1.44
0.00
1.02
1.07
1.69
(2, 3)
0.97
1.33
0.01
1.03
0.96
1.42
0.99
1.35
–0.03
0.99
0.98
1.34
(2, 4)
1.00
1.48
0.45
2.83
0.87
1.24
0.94
1.33
0.01
0.97
0.94
1.34
(2, 5)
1.04
1.67
–0.78
3.56
1.00
1.57
0.93
1.30
–0.02
1.00
0.99
1.38
(2, 6)
1.06
1.56
–0.60
2.57
0.92
1.27
1.02
1.43
0.03
0.97
0.94
1.29
(2, 7)
1.06
1.47
–0.03
1.12
0.99
1.29
1.00
1.43
0.01
1.01
1.02
1.37
(2, 8)
1.02
1.45
–0.13
1.16
1.01
1.41
0.99
1.25
0.02
1.00
0.99
1.44
(3, 4)
1.04
1.91
0.31
2.46
1.05
1.39
0.99
1.39
–0.04
1.00
1.00
1.41
(3, 5)
1.05
1.39
–0.68
2.95
0.98
1.37
0.94
1.34
0.03
0.97
0.94
1.38
(3, 6)
1.09
1.51
–0.57
2.44
0.97
1.34
1.00
1.45
0.01
0.99
0.98
1.43
(3, 7)
1.08
1.48
–0.07
1.15
0.99
1.33
1.03
1.47
0.00
1.00
1.00
1.45
(3, 8)
1.01
1.48
–0.11
1.10
0.96
1.29
0.98
1.28
0.03
1.01
1.02
1.96
(4, 5)
1.21
2.36
–2.30
8.57
0.99
1.52
1.06
2.01
0.03
1.09
1.17
2.19
(4, 6)
1.15
1.80
–1.39
5.34
0.92
1.38
0.97
1.35
0.08
0.99
0.99
1.47
(4, 7)
0.98
1.51
–0.06
2.03
0.98
1.30
1.01
1.39
0.00
1.00
0.99
1.35
(4, 8)
1.10
1.51
–0.07
1.89
0.91
1.28
1.00
1.40
0.05
0.97
0.94
1.27
(5, 6)
1.14
1.73
–2.79
9.15
1.05
1.44
1.07
1.54
0.00
1.05
1.11
2.03
(5, 7)
0.98
1.35
–0.48
2.35
0.92
1.34
0.99
1.45
–0.02
0.98
0.95
1.34
(5, 8)
1.17
1.88
–0.41
2.18
0.92
1.30
0.99
1.43
–0.03
0.97
0.94
1.23
(6, 7)
1.18
2.02
–0.47
2.45
1.05
1.41
0.97
1.43
0.05
1.02
1.03
1.50
(6, 8)
1.02
1.49
–0.41
2.22
0.99
1.38
0.97
1.31
–0.04
1.00
1.00
1.41
(7, 8)
1.09
1.55
–0.08
1.16
1.04
1.38
0.98
1.36
0.01
0.97
0.94
1.36
factor. The process was repeated for n = 300 and n = 500. Asymptotic and empirical power comparisons are given in Table 3. Asymptotic power is displayed only for (i j) G F f it⊥ . By examining the highlighted values, it is clear that the empirical power (i j) of second-order G F f it⊥ (1, 2), (1, 3) and (2, 3) are substantially higher compared to (i j) other components. Thus, these second-order G F f it⊥ were successful in detecting the source of a poorly fit model. Empirical power for the three diagnostics was close (i j) for variable pairs (1, 2) and (1, 3), but G F f it⊥ had higher power for pair (2, 3). (i j) These power results should be considered in the context that G F f it⊥ has better (i j) Type I error performance when n = 300. Empirical power results for G F f it⊥ were
142
M. Dassanayake and M. Reiser
Table 3 Asymptotic and empirical power comparison for 2PL latent variable model, eight variables n = 300 (i j)
n = 500 (i j)
Pair (i,j)
G F f it⊥
Adj. Resi. χ¯ i2j
Asy. power*
G F f it⊥
Adj. Resi. χ¯ i2j
Asy. powera
(1, 2)
0.584
0.601
0.599
0.660
0.835
0.826
0.824
0.865
(1, 3)
0.543
0.553
0.553
0.609
0.814
0.781
0.781
0.823
(1, 4)
0.060
0.064
0.063
0.051
0.054
0.080
0.081
0.051
(1, 5)
0.049
0.058
0.055
0.052
0.056
0.080
0.081
0.053
(1, 6)
0.048
0.064
0.066
0.054
0.045
0.074
0.075
0.057
(1, 7)
0.054
0.047
0.049
0.050
0.051
0.068
0.069
0.050
(1, 8)
0.045
0.065
0.065
0.050
0.052
0.060
0.060
0.050
(2, 3)
0.610
0.562
0.562
0.654
0.852
0.803
0.803
0.860
(2, 4)
0.048
0.072
0.072
0.051
0.059
0.064
0.066
0.052
(2, 5)
0.043
0.062
0.067
0.052
0.060
0.073
0.072
0.053
(2, 6)
0.067
0.056
0.055
0.054
0.060
0.075
0.074
0.057
(2, 7)
0.044
0.059
0.057
0.050
0.042
0.054
0.055
0.050
(2, 8)
0.056
0.054
0.056
0.050
0.058
0.061
0.060
0.050
(3, 4)
0.056
0.064
0.064
0.051
0.067
0.062
0.065
0.052
(3, 5)
0.056
0.053
0.054
0.052
0.061
0.065
0.065
0.053
(3, 6)
0.071
0.048
0.049
0.054
0.070
0.075
0.075
0.057
(3, 7)
0.061
0.056
0.055
0.050
0.052
0.064
0.064
0.050
(3, 8)
0.053
0.066
0.064
0.050
0.040
0.065
0.066
0.050
(4, 5)
0.050
0.07
0.073
0.050
0.052
0.063
0.064
0.050
(4, 6)
0.051
0.073
0.071
0.050
0.053
0.076
0.076
0.050
(4, 7)
0.043
0.058
0.06
0.050
0.055
0.055
0.055
0.050
(4, 8)
0.051
0.063
0.062
0.050
0.040
0.059
0.057
0.050
(5, 6)
0.059
0.079
0.077
0.056
0.046
0.063
0.062
0.050
(5, 7)
0.045
0.061
0.061
0.050
0.053
0.052
0.050
0.050
(5, 8)
0.053
0.052
0.048
0.050
0.051
0.049
0.049
0.050
(6, 7)
0.049
0.056
0.057
0.050
0.054
0.062
0.062
0.050
(6, 8)
0.050
0.070
0.064
0.050
0.050
0.061
0.060
0.050
(7, 8)
0.051
0.070
0.072
0.050
0.049
0.057
0.058
0.050
a Asymptotic
power was calculated only for the orthogonal components 1000 samples; 984 (n = 300), 999 (n = 500) convergence
somewhat lower compared to asymptotic power results when n = 300, indicating that when sample size is smaller the empirical distribution may not be close to the theoretical power function. From the results in Table 3, it is clear that the empirical power will increase with the sample size. When n = 500, G F f it (i j) had higher power for all three variable pairs and substantially higher power for pair (2, 3). An important advantage in (i j) applications is that G F f it⊥ statistics have an independence property. Also when
A Study of Lack-of-Fit Diagnostics for Models Fit …
143 (i j)
n = 500, empirical power results and asymptotic power results for G F f it⊥ were fairly close, indicating that the empirical distribution approaches the theoretical distribution. Similar results for power were found in simulations using 15 variables.
5 Application The Epidemiologic Catchment Area (ECA) (United States Department of Health and Human Services 1985) program of research was initiated in response to the 1977 report of the U.S. President’s Commission on Mental Health. The purpose was to collect data on the prevalence and incidence of mental disorders and on the use of and need for services by the mentally ill. For this application, eight items related to the psychiatric condition known as simple phobias were chosen from the ECA study to analyze as a real-world application. The data set was limited to Johns Hopkins (Baltimore, MD, U.S.A.) area. The selected items are (1) fear of heights, (2) fear of closed places, (3) fear of speaking in front of close friends, (4) fear of speaking to strangers, (5) storms, (6) water, (7) spiders and (8) fear of harmless animals. There were 3,316 observations related to these specifications. Each variable has two categories: ‘yes’ or ‘no’. Thus, there are 28 = 256 response patterns. However, as most of the answers are ‘no’, many response patterns have a cell count less than five, and 165 response patterns are empty. A model with one latent factor was fitted to the data. For this model, χ P2 F = 488.95 on 239 degrees of freedom. Since the overall table is sparse, the Chi-square approximation for χ P2 F may not be valid. However, the bivariate tables are not sparse, 2 = 108.53 on 28 degrees of freedom with p-value = 1.51E − 10, indicating and χ[2] (i j) that the model with one factor is not a good fit. G F f it⊥ was used to identify lackof-fit. When a large number of components is produced, a multiple decision rule should be used to determine which components are significantly large. Because the (i j) G F f it⊥ are independent random variates, it is possible to take advantage of the False Discovery Rate (FDR) (Benjamini et al. 1995) procedure for independent tests to maintain Type I error rate. FDR adjustment is not valid for χ¯ i2j because it has unknown joint distribution, so a more conservative method such as the Benjamini and Yekutieli (2001) procedure would be needed to maintain Type I error for χ¯ i2j . The lack-of-fit diagnostics for second-order marginals are shown in Table 4 along (i j) with the raw p-values and the adaptive FDR p-values for G F f it⊥ . In the table, (i j) G F f it⊥ (1, 8), (3, 4), (3, 7), (3, 8) and (6, 7) have significant FDR p-values indicating that these pairs of variables have associations not explained by the model (i j) with one factor. Results related to G F f it⊥ , adjusted residual and χ¯ i2j are consistent with each other. Further, variable 3, ‘fear of speaking in front of close friends’ appears in three of these large components. ‘Fear of speaking’ may represent more than a simple phobia.
144
M. Dassanayake and M. Reiser (i j)
Table 4 G F f it⊥ , adjusted residuals and χ¯ i2j for ECA study (i j)
(i j)
(i j)
Pair (i,j)
G F f it⊥
Adj. Residual
χ¯ i2j
G F f it⊥ Raw P − value
G F f it⊥ FDR P − value
(1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (1, 7) (1, 8) (2, 3) (2, 4) (2, 5) (2, 6) (2, 7) (2, 8) (3, 4) (3, 5) (3, 6) (3, 7) (3, 8) (4, 5) (4, 6) (4, 7) (4, 8) (5, 6) (5, 7) (5, 8) (6, 7) (6, 8) (7, 8)
1.637 1.504 3.702 0.277 0.768 1.204 11.825 0.056 0.508 0.172 3.633 0.568 0.445 32.109 1.460 0.0005 10.984 10.077 1.872 1.281 2.706 2.555 3.910 0.170 4.437 9.014 0.475 1.158
1.153 –1.350 –2.289 0.658 0.617 –1.220 –3.328 –0.566 –1.967 –0.905 –0.109 –2.752 –1.378 4.752 –2.387 –1.502 –3.543 –1.844 –3.238 –1.523 –2.536 –1.789 –1.503 0.685 0.221 –2.855 –2.101 2.335
1.285 1.905 5.019 0.293 0.256 3.802 11.132 0.278 3.506 0.903 0.037 9.193 1.731 24.251 5.730 2.262 13.571 3.213 10.168 2.104 6.602 2.823 2.734 0.501 0.115 11.429 4.335 6.135
0.200 0.220 0.054 0.598 0.380 0.272 0.001 0.812 0.475 0.677 0.056 0.450 0.504 1.45E-08 0.226 0.982 0.001 0.001 0.171 0.257 0.099 0.109 0.047 0.679 0.035 0.002 0.490 0.281
0.423 0.423 0.176 0.698 0.560 0.438 0.008 0.842 0.614 0.731 0.176 0.614 0.614 4.08E-07 0.423 0.982 0.008 0.010 0.399 0.438 0.279 0.279 0.176 0.731 0.164 0.014 0.614 0.438
6 Conclusions (i j)
G F f it⊥ statistics are asymptotic independent Chi-square variates obtained as (i j) orthogonal components of Pearson’s χ P2 F , and G F f it⊥ is a powerful diagnostic to detect the source of lack of fit when a hypothesized model does not fit a table of cross-classified binary variables. Power calculations for a latent variable model (i j) show that G F f it⊥ has higher power than adjusted residuals and χ¯ i2j . Simulation
A Study of Lack-of-Fit Diagnostics for Models Fit …
145
(i j)
results show that G F f it⊥ has good Type I error performance even if the joint (i j) frequencies in full 2q table are sparse and that G F f it⊥ is computationally stable when calculated with an orthogonal regression.
References Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4), 1165–1188. Bock, R. D., & Lieberman, M. (1975). Fitting a response model for n dichotomously score items. Psychometrika, 40, 5–32. Cagnone, S., & Mignani, S. (2007). Assessing the goodness of fit for a latent variable model for ordinal data. Metron, LXV, 337–361. Fisher, R. A. (1924). The conditions under which chi square measures the discrepancy between observation and hypothesis. Journal of the Royal Statistical Society, 87, 19–43. Lancaster, H. (1969). The chi-squared distribution (Vol. 40, pp. 6667) Wiley. Liu, Y., & Maydeu-Olivares, A. (2014). Identifying the source of misfit in item response theory models. Multivariate Behavioral Research, 49(4), 354–371. Rayner, J. C. W., & Best, D. J. (1989). Smooth tests of goodness of fit. New York: Oxford. Reiser, M. (1996). Analysis of residuals for the multionmial item response model. Psychometrika, 61(3), 509–528. Reiser, M. (2008). Goodness-of-fit testing using components based on marginal frequencies of multinomial data. British Journal of Mathematical and Statistical Psychology, 61(2), 331–360. Reiser, M., & VandenBerg, M. (1994). Validity of the Chi-square test in dichotomous variable factor analysis when expected frequencies are small. British Journal of Mathematical and Statistical Psychology, 47, 85–107. Reiser, M., Cagone, S., & Zhu, J. (2022). An extended GFfit statistic defined on orthogonal components of Pearson’s chi-square. Psychometrika. https://doi.org/10.1007/s11336-022-09866-8 United States Department of Health and Human Services, National Institute of Mental Health. (1985). Epidemiological Catchment Area (ECA) Survey of Mental Disorders, Wave I (Household), 1980–1985: [United States]. Rockville, MD: [producer], 1985. Ann Arbor, MI: Interuniversity Consortium for Political and Social Research [distributor], 1991. https://doi.org/10. 3886/ICPSR08993.v1.
Robust Response Transformations for Generalized Additive Models via Additivity and Variance Stabilization Marco Riani, Anthony C. Atkinson, and Aldo Corbellini
Abstract The AVAS (Additivity And Variance Stabilization) algorithm of Tibshirani provides a non-parametric transformation of the response in a linear model to approximately constant variance. It is thus a generalization of the much used BoxCox transformation. However, AVAS is not robust. Outliers can have a major effect on the estimated transformations both of the response and of the transformed explanatory variables in the Generalized Additive Model (GAM). We describe and illustrate robust methods for the non-parametric transformation of the response and for estimation of the terms in the model and report the results of a simulation study comparing our robust procedure with AVAS. We illustrate the efficacy of our procedure through a simulation study and the analysis of real data. Keywords Augmented star plot · AVAS · Backfitting · Forward search · Heatmap · Outlier detection · Robust regression
1 Introduction The nonlinear parametric transformation of response variables are a common practice in regression problems, for example, logarithms of survival times. Tibshirani (1988) used smoothing techniques to provide non-parametric transformations of the response together with transformations of the explanatory variables, a procedure he M. Riani (B) · A. Corbellini Dipartimento di Scienze Economiche e Aziendale and Interdepartmental Centre for Robust Statistics, Università di Parma, 43100 Parma, Italy e-mail: [email protected] URL: http://www.riani.it A. Corbellini e-mail: [email protected] A. C. Atkinson The London School of Economics, London WC2A 2AE, UK e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_12
147
148
M. Riani et al.
called AVAS (additivity and variance stabilization). The resulting model is a generalized additive model (GAM) with a response transformed to approximate constant variance. Tibshirani’s work can be seen as a non-parametric extension of the power transformation family of Box and Cox (1964) in which the goals are the stabilization of error variance and the approximate normalization of the error distribution, hopefully combined with an additive model. It also extends the parametric transformation of explanatory variables of Box and Tidwell (1962). A discussion of the relationship of AVAS to the Box-Cox transformation is in Hastie and Tibshirani (1990, Chap. 7). Tibshirani’s AVAS is not robust with respect to outliers. Our main purpose is to provide a robust version of his work, which, for obvious reasons, we call RAVAS. In developing our procedure we made four important improvements to the original AVAS. Like robustness, these are available as options. Thus, RAVAS can be used for fitting a response transformed GAM when robustness is not an issue, or for fitting a GAM without response transformation. Section 2 introduces the generalized additive model and the associated backfitting algorithm for estimation of the transformations of the explanatory variables, which uses a smoothing algorithm. The AVAS procedure and the associated numerical variance stabilization transformation are described in Sects. 2.3 and 2.4. Section 3 outlines the various forms of robust regression that are available in our algorithm and describes the resulting outlier detection procedures. The purpose is to provide an outlier free subset of the data for transformation and smoothing. An outline of our improvements to AVAS is in Sect. 4. Appreciably more detail of these is provided in Riani et al. (2023) as well as further data analyses. Section 5 presents the results of a simulation study comparing some properties of AVAS and RAVAS in the presence of outliers: the mean squared error of parameter estimates, the power of detection of outliers, (just for RAVAS) and the number of numerical iterations of the two algorithms required for convergence. The performance of AVAS is severely degraded by the presence of outliers. The last two sections present a data analysis, which makes use of the augmented star plot as a guide to the choice of options in the estimation process and includes a comparison of the choices using a heatmap of correlations. The purpose of the paper is both to introduce the MATLAB program we have written for this form of robust data analysis and to illustrate some of its properties.
2 Generalized Additive Models and the Structure of AVAS 2.1 Introduction The generalized additive model (GAM) has the form g(Yi ) = β0 +
p j=1
f j (X i j ) + .
(1)
Robust Response Transformations for Generalized Additive Models …
149
The functions f j are unknown and are, in general, found by the use of smoothing techniques. A monotonicity constraint can be applied. If the response transformation or link function g is unknown, it is restricted to be monotonic, but scaled to satisfy the technically necessary constraint that var{g(Y )} = 1. In the fitting algorithm, the transformed responses are scaled to have mean zero; the constant β0 can therefore be ignored. The observational errors are assumed to be independent and additive with constant variance. The performance of fitted models is compared by use of the coefficient of determination R 2 . Since the f j are estimated from the data, the traditional assumption of linearity in the explanatory variables is avoided. However, the GAM retains the assumption that explanatory variable effects are additive. Buja et al. (1989) describe the background and early development of this model.
2.2 Backfitting For the moment we assume that the response transformation g(Y ) is known. The backfitting algorithm, described in Hastie and Tibshirani (1990, p. 91), is used to fit a GAM. The algorithm proceeds iteratively using residuals when one explanatory variable in turn is dropped from the model. With g(y) the n × 1 vector of transformed responses, let e( j) be the vector of residuals when f j (x j ) is removed from the model without any refitting. Then e( j) = g(y) −
p
f k (xk ).
(2)
k= j=1
The new value of f j (.) depends on ordered values of e( j) and x j . Let the ordered values of x j be xs, j . The residuals e( j) are sorted in the same way to give the new order es,( j) . Within each iteration each explanatory variable is dropped in turn; j = 1, . . . , p. The iterations continue until the change in the value of R 2 is less than a specified tolerance. For iteration l the vector of sorted residuals for x j is e(l j) . The new estimate of f j(l+1) is
l = S es,( f s,(l+1) j) , x s, j . j
(3)
The function S depends on the constraint imposed on the transformation of variable j. If the transformation can be non-monotonic, S denotes a smoothing procedure. As does Tibshirani (1988), we use the supersmoother (Friedman and Stuetzle 1982), a non-parametric estimator based on local linear regression with adaptive bandwidths. Monotonic transformations using isotonic regression are also an optional possibility (Barlow et al. 1972). The backfitting algorithm is not invariant to the permutation of order of the variables inside matrix X , with high collinearity between the explanatory variables causing slow convergence of the algorithm: the residual sum of squares can change very
150
M. Riani et al.
little between iterations. Our option orderR2, Sect. 4.1.4, attempts a solution to this problem by reordering the variables in order of importance.
2.3 The AVAS Algorithm In this section, we present the structure of the AVAS algorithm of Tibshirani (1988). The variance stabilizing transformation used to estimate the response transformation is outlined in Sect. 2.4 Our RAVAS algorithm has a similar structure to that given here, made more elaborate by the requirements of robustness and the presence of options. In this description of the algorithm t y and t X are transformed values of y and X . 1. Initialize Data. Standardize response y so that t y = 0 and var(t y) = 1, where var is the maximum likelihood biased estimator of variance. Centre each column of the X matrix so that t X j = 0, j = 1, . . . , p). 2. Initial call to ‘Inner Loop’ to find initial GAM using t y and t X ; calculates initial value of the coefficient of determination, R 2 . Set convergence conditions on number of iterations and value of R 2 . 3. Main (Outer) Loop. Given values of t y and t X at each iteration the outer loop finds numerically updated values of the transformed response. Given the newly transformed response, updated transformed explanatory variables are found through the call to the backfitting algorithm (inner loop). In our version iterations continue until a stopping condition on R 2 is verified or until a maximum number of iterations has been reached.
2.4 The Numerical Variance Stabilizing Transformation We first consider the case of a random variable Y with known distribution for which E(Y ) = μ and var(Y ) = V (μ). We seek a transformation t y = h(y) for which the variance is, at least approximately, independent of the mean. Then Taylor series expansion of h(y) leads to var(Y ) ≈ V (μ){h (μ)}2 . For√a general distribution h(y) is then a solution of the differential equation dg/dμ = C/ V (μ). For random variables standardized, as are the values t y, to have unit variance, C = 1 the variance stabilizing transformation is t 1/ V (u)du. (4) h(t) = √ In the AVAS algorithm for data, 1/ V (u) is estimated by the vector of the reciprocals of the absolute values of the smoothed residuals sorted using the ordering based on fitted values of the model. There are n integrals, one for each observation. The
Robust Response Transformations for Generalized Additive Models …
151
range of integration for observation i goes from the smallest fitted value to the old transformed value ty i , i = 1, . . . , n. The computation of the n integrals uses the trapezoidal rule and is outlined in Sect. 4.2. Since the transformation is the sum of an increasing number of non-negative elements, monotonicity is assured. The logged residuals in the estimation of the variance function are smoothed using the running line smoother of Hastie and Tibshirani (1986).
3 Robustness and Outlier Detection 3.1 Robust Regression We robustify our transformation method through the use of robust regression to replace least squares. The examples in this paper have been calculated using Adaptive Hard Trimming. In the Forward Search (FS), the observations are hard trimmed, the amount of trimming being determined by the choice of the trimming parameter h, the value of which is found adaptively by the search. Atkinson et al., 2010 provide a general survey of the FS, with discussion. We have also implemented Least Trimmed Squares, (Hampel, 1975, Rousseeuw, 1984), as well as Soft trimming (downweighting). Specifically we include S and MM estimation.
3.2 Robust Outlier Detection Our algorithm works with k observations treated as outliers, providing the subset Sm of m = n − k observations used in model fitting and parameter estimation. This section describes our outlier detection methods. The default setting of the forward search uses the multivariate procedure of Riani et al. (2009) adapted for regression (Torti et al. 2021) to detect outliers at a simultaneous level of approximately 1% for samples of size up to around 1,000. Optionally, a different level can be selected. For the other two methods of robust regression, we apply a Bonferroni inequality to robust residuals to give a simultaneous test for outliers. Since different response transformations can indicate different observations as outliers, the identification of outliers occurs repeatedly during our robust algorithm, once per iteration of the outer loop.
152
M. Riani et al.
4 Improvements and Options Our RAVAS procedure introduces five improvements to AVAS, programmed as options. These do not have a hierarchical structure, so that there are 25 possible choices of the options. The augmented star plot of Sect. 6 provides a method for assessing these choices. We discuss the motivation and implementation for each. The order in Sect. 4.1 is that in which the options are applied to the data when all five are used. We also give the names of the options, which are used as labels in the augmented star plot.
4.1 Initial Calculations The structure of our algorithm is an elaboration of that of AVAS outlined in Sect. 2.3. Four of the five options can be invoked before the start of the outer loop.
4.1.1
Initialization of Data: Option Tyinitial
Our numerical experience is that it is often beneficial to start from a parametric transformation of the response. This is optionally found using the automatic robust procedure for power transformations described by Riani et al. (2022). For min(y) > 0 we use the Box-Cox transformation. For min(y) ≤ 0 the extended Yeo-Johnson transformation is used (Atkinson et al. 2020). This family of transformations has separate Box and Cox transformations for positive and negative observations. In both cases the initial parametric transformations are only useful approximations, found by searching over a coarse grid of parameter values. The final non-parametric transformations sometimes suggest a generalization of the parametric ones.
4.1.2
Ordering Explanatory Variables in Backfitting: Option Scail
To avoid dependence of the fitted model on the order of the explanatory variables, one approach is to use an initial regression to remove the effect through scaling (Breiman 1988). With b j the coefficient of f j (x) in the multiple regression of g(y) on all f j (x), the option scail provides new transformed values for the explanatory variables: t X j = b j f j (x), j = 1, . . . , p. Option scail is used only in the initialization of the data.
Robust Response Transformations for Generalized Additive Models …
4.1.3
153
Robust Regression and Robust Outlier Detection: Option Rob
We robustify our method through the use of robust regression as described in Sect. 3. The subset Sm , changing at each iteration, defines the observations used in backfitting and in the calculation of the variance stabilizing transformation.
4.1.4
Ordering Predictor Variables: Option OrderR2
For complete elimination of dependence on the order of the variables, we include an option that, at each iteration, provides an ordering which is based on the variable which produces the highest increment of R 2 . With this option the most relevant features are immediately transformed and those that are perhaps irrelevant will be transformed in the final positions. For robust estimation, this procedure is applied solely to the observations in the subset Sm . Option orderR2 is available at each call to the backfitting function.
4.2 Outer Loop 4.2.1
Numerical Variance Stabilizing Transformation: Option Trapezoid
Plots of residuals against fitted values are widely used in regression analysis to check the assumption of constant variance. Here the observations have been transformed, so the fitted values are ty i . To estimate the variance stabilizing transformation, the fitted values have to be sorted, giving a vector of ordered values ty s . The residuals are ordered in the same way and, following the procedure of Sect. 2.4, provide estimates vi of the integrand V −0.5 (y) in (4). The vi provide estimates at the ordered points t yi s . Calculation of the variance transformation (4) is however for sorted observed yi s . As did Tibshirani, we responses t yis , rather than fitted, transformed responses t use the trapezoidal rule to approximate the integral. Linear interpolation and extrapolation are used in calculation of the vi at the t yis . We provide an option ‘trapezoid’ for the choice between two methods for the extrapolation of the variance function estimate, the interpolation method remaining unchanged. Our approach leads to trapezoidal summands in the approximation to the integral for the extrapolated elements, whereas Tibshirani’s leads to rectangular elements. When we are concerned with robust inference, there are only m = n − k members of ty s whereas there are n values of t yis , so that robustness increases the effect of the difference between the two rules. The option trapezoid = false uses rectangular elements in extrapolation.
154
M. Riani et al.
5 Simulations
MSE
0.06
Average power
We now use simulations to compare overall properties of AVAS with our robust version. The model was linear regression with data generated to have an average value of R 2 of 0.8. The responses were standardized to have zero mean and unit variance; 10% of the observations were contaminated by a shift δ and the responses were exponentiated. There were 1,000 simulations for n = 200 and n = 1,000 and 200 for n = 10,000. We encountered no numerical problems in the simulations. The figures compare the performances of RAVAS (with all options) and standard AVAS (with no options). Results for RAVAS use a dashed line. The left-hand panels of Fig. 1 show the mean squared error of the parameter estimates in the linear models. For RAVAS, those for n = 200 and 1,000 exhibit a slight increase for moderately small values of δ which then decreases to be close to zero as δ increases and the outliers become easier to detect. That for n = 10, 000 is virtually constant. The results for AVAS rapidly become much larger. The right-hand column shows the average power, that is the proportion of generated outliers that are detected by RAVAS. This climbs, in all cases, steadily to one. Of course, AVAS does not detect outliers. We also compared the number of iterations to convergence of the algorithms; the default maximum is 20. Figure 2 shows results for the same simulations as above. The three panels show that RAVAS converges in around 3 iterations, except for n = 200 when there is a peak around δ = 2, that is when the outliers are large enough to have
AVAS RAVAS
0.04 0.02 0
1
2
3
4
1 0.5 0 0
Shift contamination Average power
10-3
MSE
10
5 0
1
2
3
4
Average power
MSE
5
2
3
Shift contamination
3
4
0.5 0 0
1
2
3
4
Shift contamination
10-3
1
2
1
Shift contamination
0 0
1
Shift contamination
4
1 0.5 0 0
1
2
3
4
Shift contamination
Fig. 1 Mean squared error (MSE) and average power. Top panels n = 200, p = 5, mid panels n = 1,000, p = 10, bottom panels n = 10,000, p = 20
Robust Response Transformations for Generalized Additive Models … 18
18
18
155
Average number of iterations
AVAS RAVAS
16
16
16
14
14
14
12
12
12
10
10
10
8
8
8
6
6
6
4
4
4
2 0
2
Shift contamination
4
2 0
2
Shift contamination
4
2 0
2
4
Shift contamination
Fig. 2 Average number of iterations to convergence. Left-hand panel n = 200, p = 5, central panel n = 1,000, p = 10, right-hand panel n = 10,000, p = 20
an effect, but are still difficult to detect. This behavior is distinct from that of AVAS, where the number of iterations increases steadily both with δ and with the sample size.
6 The Generalized Star Plot We have added five options to the original AVAS. There are therefore 32 combinations of options that could be chosen. It is not obvious that all will be necessary when analyzing any particular set of data. Our program provides flexibility in the assessment of these options. One possibility is a list of options ordered by, for example, the value of R 2 or of the significance of the Durbin-Watson test. In this section we describe the augmented star plot, one graphical method for visualizing interesting combinations of options in a particular data analysis. An example is Fig. 3. We remove all analyses for which the residuals fail the Durbin-Watson test of independence and the Jarque-Bera normality test (Jarque and Bera 1987), at the 10 per cent level (two-sided for Durbin-Watson). The threshold of 10% can be optionally changed. We order the remaining, admissible, solutions by the Durbin-Watson significance level multiplied by the value of R 2 and by the number of units not declared as outliers. Other options are available. The lengths of rays in individual panels of
156
M. Riani et al.
Fig. 3 Weight of fish. Augmented star plot of six options. Option 1 excludes trapezoid
the plot are of equal length for those features used in an analysis. All rays are in identical places in each panel of the plot; the length of the rays for each analysis are proportional to p DW , the significance level of the Durbin-Watson test. The ordering in which the five options are displayed in the plot depends on the frequency of their presence in the set of admissible solutions. For example, if robustness is the one which has the highest frequency, its ray is shown on the right. The remaining options are displayed counterclockwise, in order of frequency.
7 Prediction of the Weight of Fish There are two websites, https://www.kaggle.com/aungpyaeap/fish-market and http://jse.amstat.org/datasets/fishcatch.txt which present data on the weight of 159 fish caught in a lake near Tampere, Finland. Interest is in the relationship between weight and five measurements of dimensions of the fish. There are 7 species of fish including pike. These behave rather differently from the other six species so we ignore them. We use the first three lengths for which the remaining fish seem homogenous. This assumption will be tested by our robust analysis if one or more species are identified as outliers. The variables are y x1 x2 x3
Weight of the fish (in grams) Length from the nose to the beginning of the tail (in cm) Length from the nose to the notch of the tail (in cm) Length from the nose to the end of the tail (in cm).
Robust Response Transformations for Generalized Additive Models …
157
Heatmap of the correlation matrix among the best solutions Sol1
0.9241
0.9737
0.542
0.9247
0.9422
0.9816
0.785
0.939
0.9396
0.95 0.9
Sol2
0.9241
0.85
Sol3
0.9737
0.9816
0.6811
0.9311
0.9416
0.8 0.75
Sol4
0.542
0.785
0.6811
Sol5
0.9247
0.939
0.9311
0.7015
0.6723
0.7 0.65
0.7015
0.998 0.6
Sol6
0.9422
0.9396
0.9416
0.6723
0.998
Sol1
Sol2
Sol3
Sol4
Sol5
0.55 NaN
Sol6
Fig. 4 Weight of fish. Heatmap of pairwise response correlations among the six solutions
After the deletion of the data on pike, 142 observations remain. Scatter plots of the response against the three explanatory variables reveal that all three lengths are highly correlated with the response, as they are with each other. It is reasonable to assume that weight increases with each of the explanatory variables. We therefore impose a monotonicity constraint on the transformations of the x j . However, multiple regression with highly correlated explanatory variables can lead to problems in interpretation, such as estimated effects having a physically incorrect sign. The augmented star plot for these data is in Fig. 3. There are six combinations of options that satisfy the constraints on the distribution of residuals. The first solution with an R 2 of 0.991 uses all five options except trapezoid. Robustness is used in all, succeeding selections giving R 2 values of 0.988 or 0.983. The heatmap of the response correlations between the pairs of solutions is in Fig. 4. This shows that the first three solutions are strongly correlated with each other, as they are with the fifth and sixth solutions, the fifth and sixth solutions themselves having a very high correlation of 0.998. The heatmap emphasizes that solution four is appreciably different from the other five. We now consider the adaptive identification of outliers using the FS. The first solution identifies three outliers. The left-hand panel of Fig. 5 shows that the response has been smoothly transformed. The plot of residuals against fitted values in the righthand panel shows that there is only one remote outlier and that there is no remaining structure in the residuals. The plots of transformed explanatory variables (not given here) show that f (x1 ) is decreasing and slightly curved. The other two functions are increasing but only that for x2 is almost straight, with slight curvature for the lowest values of the variable.
158
M. Riani et al.
0
0
Residuals
Transformed y
Plot of residuals vs. fit
Plot of ty vs. y
2
-2 -4
-1 -2 -3 -4
-6 0
500
1000
-2
y
-1
0
1
2
Fitted values
yt (just one expl. variable)
Fig. 5 Weight of fish. Left-hand panel, transformed y against y; right-hand panel, residuals against fitted values. Three outliers in red in the online version
yt (all variables)
2
0
-2
-4 -1
0
1
y standardized
2
2
0
-2
-4 -1
0
1
2
y standardized
Fig. 6 Weight of fish. Non-parametric transformation of response y compared to y 1/3 . Left-hand panel, three explanatory variables: right-hand panel, only x1 . Three and one outliers in black and red in the online version
The interpretation of the results from fitting three explanatory variables is that the variables are too highly correlated to give individually meaningful results. In our final analysis of the data we used only x1 . The star plot showed that the best selection included all options, except orderR2, which option is not possible with a single explanatory variable. The value of R 2 for this fit is 0.980 with the deletion of a single outlier. The three acceptable solutions had mutual fitted response correlations of 0.9994 or 1—the fitted model was stable to the choice of options. In regressing volume on measurements of length, arguments from dimensional analysis suggest that volume should have a one-third transformation. Our final plot, Fig. 6, compares the transformed responses from the fits with three and one explanatory variables to y 1/3 , for which transformation the value of R 2 = 0.968. The figure shows that both non-parametric transformations are indeed close to y 1/3 with a small systematic departure for the largest values of x. The fitted values from the single explanatory variable follow the power transformation slightly more closely than that when three variables are used. The transformation of x1 is virtually straight with some curvature for large values. The flexibility of the non-parametric transformation provides an improved simple model compared with regression on untransformed x1 .
Robust Response Transformations for Generalized Additive Models …
159
Acknowledgements We are very grateful to the editors and referees, whose comments greatly helped us to clarify the presentation of our work. Our research has benefited from the High Performance Computing (HPC) facility of the University of Parma. We acknowledge financial support from the University of Parma project “Robust statistical methods for the detection of frauds and anomalies in complex and heterogeneous data,” and the Project ECS00000033 “Ecosystem for Sustainable Transition in Emilia-Romagna”.
References Atkinson, A. C., Riani, M., & Cerioli, A. (2010). The forward search: theory and data analysis (with discussion). Journal of the Korean Statistical Society, 39, 117–134. https://doi.org/10.1016/j.jkss. 2010.02.007 Atkinson, A. C., Riani, M., & Corbellini, A. (2020). The analysis of transformations for profit-andloss data. Applied Statistics, 69, 251–275. https://doi.org/10.1111/rssc.12389 Barlow, R. E., Bartholomew, D. J., Bremner, J. M., & Brunk, H. D. (1972). Statistical inference under order restrictions. Chichester: Wiley. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series B, 26, 211–252. Box, G. E. P., & Tidwell, P. W. (1962). Transformations of the independent variables. Technometrics, 4, 531–550. Breiman, L. (1988). Comment on “Monotone regression splines in action” (Ramsey, 1988). Statistical Science, 3, 442–445. Buja, A., Hastie, T., & Tibshirani, R. (1989). Linear smoothers and additive models. Annals of Statistics, 17, 453–510. Friedman, J., & Stuetzle, W. (1982). Smoothing of scatterplots. Technical report, Department of Statistics, Stanford University, Technical Report ORION 003. Hampel, F. R. (1975). Beyond location parameters: robust concepts and methods. Bulletin of the International Statistical Institute, 46, 375–382. Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1, 297–318. Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. London: Chapman and Hall. Jarque, C. M., & Bera, A. K. (1987). A test for normality of observations and regression residuals. International Statistical Review, 52, 163–172. Riani, M., Atkinson, A. C., & Cerioli, A. (2009). Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society, Series B, 71, 447–466. Riani, M., Atkinson, A. C., & Corbellini, A. (2022). Automatic robust Box-Cox and extended YeoJohnson transformations in regression. Statistical Methods and Applications. https://doi.org/10. 1007/s10260-022-00640-7. Riani, M., Atkinson, A. C., & Corbellini, A. (2023). Robust transformations for multiple regression via additivity and variance stabilization. Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2023.2205447. Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871–880. Tibshirani, R. (1988). Estimating transformations for regression via additivity and variance stabilization. Journal of the American Statistical Association, 83, 394–405. Torti, F., Corbellini, A., & Atkinson, A. C. (2021). fsdaSAS: A package for robust regression for very large datasets including the Batch Forward Search. Stats, 4, 327–347.
A Random-Coefficients Analysis with a Multivariate Random-Coefficients Linear Model Laura Marcis, Maria Chiara Pagliarella, and Renato Salvatore
Abstract Random-coefficients linear models can be considered as a particular case of linear mixed models. Different sources of variation are treated by random effects, which depend on some specific model design matrices. A redundancy analysis of estimates of the multivariate random effects may be able to capture the leading contribution to the covariance between the observed responses and the model covariates. We introduce the random effects of reduced space by a weighted least-squares closedform solution, starting from the standardized multivariate best linear predictors. The application shows the effect of the linear dependence of the random effects in the space of the model covariates. Keywords Random-coefficients model · Multivariate linear mixed model · Best linear unbiased predictor · Redundancy analysis
1 Introduction In the standard linear regression models, a fundamental assumption is based on the independence of observations. This condition may be relaxed when clusters of units need to be explained following a common behavior inside a linear relationship. Usually, in the context of the mixed linear model (Demidenko 2004), it is possible to manage studies involving subjects naturally bound together. Widespread employment of general linear models with clustered data is common, for example, for longitudinal model analysis (Hsiao and Pesaran 2008). L. Marcis · M. C. Pagliarella · R. Salvatore (B) University of Cassino and Southern Lazio, Cassino, Italy e-mail: [email protected] L. Marcis e-mail: [email protected] M. C. Pagliarella e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Grilli et al. (eds.), Statistical Models and Methods for Data Science, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-30164-3_13
161
162
L. Marcis et al.
Random-coefficients linear regression models (RCM) (Longford 2011) represent a special class of linear mixed models, where the vector of regression coefficients for the subjects (e.g., repeated observations) is modeled in a second stage linear regression equation. In order to specify this type of models, it is convenient to define a two-stage hierarchical linear model, with a first stage that models within-subject observations, and a second stage with a linear model for the random regression coefficients. Unlike the basic mixed model—in which random effects are not correlated with the response variable at the population level—in RCM this correlation depends on the fixed-effects design matrix of the regression model. Therefore, in RCM the correlation structure is more complicated (with respect to the simpler random intercept model) because it depends on the covariate values. Since this happens, this paper aims to understand which random effect is most affected by the presence of the covariates and, at the same time, which random effect is “essentially random”, i.e., orthogonal to the subspace spanned by the model covariates. We propose to achieve this result using an approach based on Redundancy Analysis (RDA) (Van Den Wollenberg 1977), for the fact that RDA provides a constrained analysis of the whole linear relations between the two sets of variables (Takane and Hwang 2007), and another unconstrained analysis given by the set of multivariate regression residuals (Härdle and Simar 2015). Some studies highlight that an RDA of the predicted criterion variables by the best linear unbiased predictor may be quite representative (Marcis and Salvatore 2020). This paper uses a RDA by a least-squares solution from the data provided by the random-coefficients predictors of the criterion variables. The application study performs the method introduced with the official data by the Italian Equitable and Sustainable Well-being indicators.
2 The Model and the Analysis of the Random Coefficients Given a q-variate random vector y, consider the case when a matrix Y of repeated observations from y is partitioned in m subjects (groups), each of them with n i individuals (i = 1, ..., m, ; j = 1, ..., n i ). We assume that the general population model for the m subjects is yi = B xi + Ai zi , where B is the q × p matrix of fixed regression coefficients. The Ai is a q × r matrix of q-variate r -dimensional vectors of random-effects, with ai = vec(Ai ) ∼ N (0, a ), a = cov(vec(Ai )), whose ele ments are a,qq , with a,qq = cov(vec(Ai,qq )), the r × r blocks of a . When r = p, the population model is a multivariate RCM, with zi = xi . Given a sample of N = i j n i j units (e.g., repeated measurements), then the model structure is (with matrix dimensions reported as subscript): ⊗ Y N ×q = X⊗ N × p B p×q + Z N × pm A pm×q + E N ×q ,
(1)
with X⊗ the matrix of data covariates, Z⊗ the design matrix of random effects (the exponent ⊗, the symbol usually used for the Kronecker product, is to differentiate the
A Random-Coefficients Analysis with a Multivariate …
163
matrices just introduced from those that will be introduced below in order to rewrite the model in vector form) and E the matrix of regression within-subject errors, with vec(E) ∼ N (0, R). In this article, we assume both Y and X⊗ as columnwise centered and standardized. With ⊗ the Kronecker product and cov(y∗ ) the model covariance, we rewrite the last model in the vector form, with y∗ = vec(Y), X = (I ⊗ X⊗ ), β = vec(B), Za = vec(ZA) = (I ⊗ Z⊗ )vec(A). The components of the model covariance a = a (θ ) depend on a multivariate vector θ that accounts for some variance-covariance parameters, to be estimated. Maximum and restricted maximum likelihood procedures may be employed for the estimation in case of normality, as well as other general methods, like method of moments. Some studies report the effectiveness of the method of moments as the most common restricted likelihood estimation (see again Demidenko 2004), that’s the one used in this work. By the following Theorem 1 and Proposition 1, we extend some properties of the univariate RCM to the multivariate RCM (the proofs are given in the Appendix). Theorem 1 Conditional distribution of the multivariate random effects Given the multivariate RCM in (1), with
a ∼ N (0, D), E(y∗ |a) ∼ N (Xβ, ZDZ + R), α = (a , y∗ ) , a,a a, y∗ , and cov(α) = = y∗ ,a y∗ , y∗ cov(a, y∗ ) = a,y∗ = y∗ ,a = D[I ⊗ (Z⊗ ) ], then the conditional distribution of a given y∗ is ∗ −1 a|y∗ ∼ N [a,y∗ −1 y∗ ,y∗ (y − Xβ), a,a − a,y∗ y∗ ,y∗ y∗ ,a ].
Proposition 1 Component estimates of the multivariate RCM Given the multivariate RCM in (1), with a ∼ N (0, D), E(y∗ |a) ∼ N (Xβ, ZDZ + R), Z⊗ = I ⊗ Xd , Xd = diag(X1 , ... , Xm ), cov(vec(Ai )) = a (θ), and an estimate θ, the model estimates of β − β and Z( a − a) have the covariance matrix , with elements
164
L. Marcis et al.
11 = cov( β − β) = cov( β) = [X (ZDZ + R)−1 X]−1 , 12 = = cov[ β, Z( a − a)] 21
22
= −[X cov(y∗ )−1 X]−1 X cov(y∗ )−1 (I ⊗ Xd )D(I ⊗ Xd ) , = cov[Z( a − a)] = ZDZ − ZDZ PZDZ = (I ⊗ Xd )D(I ⊗ Xd ) − (I ⊗ Xd )D(I ⊗ Xd ) × P × (I ⊗ Xd )D(I ⊗ Xd ).
The matrix P represents the projection matrix onto the complement of the column space of X in the metrics of (ZDZ + R)−1 . The multivariate linear best predictor Y is given by the best linear predictor y∗ = vec( Y) = X β + Z a, a = E(a| y). Here a = ZDZ cov(y∗ )−1 (y∗ − X β), β = β gls , Z ∗ ∗ y = y + (I − )X β, = ZDZ cov(y∗ )−1 , cov(y∗ ) = ZDZ + R, and D = Dqq , withDqq = a,qq ⊗ Im . Given yqi , for the RCM as a special case of the general linear mixed model, we have . Thus, for the RCM in (1) cov(a, y∗ ) = D(I ⊗ Xd ) that cov(aqi , yqi ) = DZi = DXqi (with r = p):
cov(a, y∗ ) = E(a, y∗ ) = cov(a, a Z ) = DZ = D(I ⊗ Z⊗ ) = D[I ⊗ (Z⊗ ) ] = D(I ⊗ Xd ). For sake of clarity, it should be noted that a depend on the matrix of fixed effects X⊗ and that, unlike the classic mixed model in which cov(a, y∗ ) = D Z , in this case cov(a, y∗ ) = D(I ⊗ Xd ) that is, the random effects are affected by the covariates. In a similar way to RDA, we can achieve our goal by evaluating which random effects are more influenced by the model covariates (and, as a consequence, which random effects are not). In standard RDA two subspaces are obtained: one constrained—from the Singu —and lar Value Decomposition (SVD) of the predicted values, Y = X B = U Y V Y another unconstrained—from the SVD of the residual values. Our new step is as follows: after estimating the vector of the multivariate covariance parameters θ = θ , and the model estimates of fixed and random effects B( θ ) + Z⊗ A( θ ), we want to use the projection of the centered and standardY = X⊗ 1 1 y − E( y)] = cov( y)− 2 y∗∗ of the multivariate predictor y∗ , ized ratio φ = cov( y)− 2 [ in a common reduced covariate subspace. In particular, assuming both Y and X⊗ as columnwise centered and standardized, . and a balanced design, we get in the general RCM cov(aqi , yqi ) = DZi = DXqi Then, we can compute φ as indicated above, noting that E( y) = E(y) = Xβ, y− β = Z a − E(Z a) = Z a. E( y) = y∗ − X
A Random-Coefficients Analysis with a Multivariate …
165
Therefore, cov( y∗ )− 2 Z a = cov( y)− 2 y∗∗ = φ. In certain cases, due to the unbiasedness of model covariance parameters estimates, the expression cov( y∗ ) may be ∗ ∗ − 21 y ) represents substituted by the mean squared error mse( y ). Consequently, mse( the inverse matrix of the root mean squared error of the multivariate predictor. We may adopt two different methods to achieve a numerical evaluation of 1 1 y∗∗ , or simplifying, the statistic φ = cov( y∗ )− 2 Z a: cov( y)− 2 (a) First method: compute M = vecdiag cov( y∗ )−1 = I N ◦ cov( y∗ )−1 , as the ∗ − 21 matrix of diagonal elements of cov( y ) (◦ is the Hadamard product). Then a)], with Z a as the vector as diagonal elements of compute H = [I N (1N ⊗ Z 1 the diagonal matrix H, and (by the vec−1 operator), we get Z a × cov( y ∗ )− 2 = 1 1 ∗ y = HM− 2 1 N . HM− 2 , and ∗ ∗ Finally, Y = vec−1 ( y ) = ((vecIq ) ⊗ I N )(Iq ⊗ y ) = ((vecIq ) ⊗ I N )(Iq ⊗ −1 HM 1 N ). 1 1 y∗ ]}− 2 . (b) Second method: compute M− 2 = {I N ◦ cov[ 1 y∗ )}− 2 1 N as the vector of the inverse diagonal eleThen, with τ = {I N ◦ cov( ments of cov( y∗ ), and T = vec−1 (τ ) = ((vecIq ) ⊗ I N )(Iq ⊗ τ ), the matrix of elementwise variances of Y, we get Y= Y ◦ T. 1
1
Since we are interested in the simultaneous representation of all the predicted ai , given by a common projection subspace, we may also proceed following an alternative method, respect to those under methods (a) and (b), to explore the ratio φ. (c) Third method: Find the minimum Frobenius norm from the multivariate predictor Y, as explained below. The minimum Frobenius norm from the multivariate predictor Y is found by the difference 1 y)− 2 − X⊗ B, Y∗∗ = Y − E( Y) = Y − 1 N E( y ), i.e., = Y∗∗ var( 2 ∗∗ − 21 2 = trace( ) = Y − X⊗ B = min .
We assume ∗ ∗ yq − yq∗ )( yq∗ − yq∗ ) , cov( yqq ) = E ( 1 ∗ gtrace[cov( yqq = )] , and N ∗ ∗ ∗ yq − yq∗ ) ( yq − yq∗ )( yq∗ − yq∗ ) ] = E ( yq∗ − yq∗ ) . gtrace[cov( yqq )] = trace[E ( ∗ Here gtrace[cov( yqq )] is the generalized trace operator, that gives a matrix with elements as traces of submatrices of a given square matrix (Timm 2022). Now, setting
166
L. Marcis et al.
φ = vec( ) = ( − 2 ⊗ I N ) y∗ − (Iq ⊗ X⊗ )β = 1
β = vec(B),
− 21
= (
− 21
− 21 ∗
y − Xβ,
⊗ I N ),
we come to the following properties of β: −1
−1 ∗ y∗ − Xβ) ( 2 y − Xβ) trace(φ φ) = trace ( 2 −1
y∗ − Xβ), = ( y∗ − Xβ) ( 1 −1 −1 ∗ where X = 2 X. Thus, β = (X X)−1 X y is the q-variate vector in the y∗ orthogonal to the columns subspace spanned by the columns of the matrix X, with −1 −1 −1 ∗ −1 of X in the metrics of , y x = 0. Then, PX = X(X X)−1 X is the Y= projection matrix of the predictor y onto the joint subspace by X. The SVD of ⊗ Y , i.e., V , X β gives the common rescaled predictor’s coordinates of Y = U Y Y Y Y V −1 contains the row joint reduced coordinates of further noticing that U =
Y
Y
Y
Y. Y in the space of Finally, to give the exact number of the principal coordinates by the number of non-zero eigenvalues when projecting the matrix in the two different subspaces, depending or not from the design matrix X, we establish the following Theorem on the rank of (the proof is given in the Appendix). Theorem 2 Rank of Let i a n i × p matrix of n i repeated observations from the i th subject (i = m n i = n) of the p-variate random vector φ i j , where the j th row vector 1, ..., m, i=1 ( j = 1, ..., n i ) of i has the linear structure φ i j = B xi j + α i + ηi j , where α i ∼ ind(0, α ), ηi j ∼ ind(0, σ 2 ), vec(ηi ) ∼ (0, σ 2 Ini × p ), cov(α i , ηi ) = 0. B is a h × p matrix of fixed effects in the linear model, and α i is a vector of p-dimensional random effects. Let also = [ 1 , 2 , ..., m ] , modeled as = XB + Zα + η, Z = diag(1n 1 , ..., 1n m ), cov(vec(α)) = α ⊗ Im . Then, for n p h, and p > h + m, we have the following rank r of the model components: (a) (b) (c) (d)
r(XB) = h r(Zα) = m − 1, r(XB) + r(Zα) = h + m − 1, r( ) = r(η) = p.
A Random-Coefficients Analysis with a Multivariate …
167
If p ≤ h + m, we have that: (e) r(XB) + r(Zα) = p.
3 Application Study In accordance with the recent law reforms in Italy, the Equitable and Sustainable Well-being indicators (in Italian, BES)—annually provided by the Italian Statistical Institute (ISTAT, 2017)—are designed to define the economic policies which largely act on some fundamental aspects of the quality of life. In order to highlight the result of the proposed method we use 12 BES indicators relating to the years 2013–2016, collected at NUTS-2 (Nomenclature of Territorial Units for Statistics-2) level. The variables employed in the application study are in Table 1. We use the per capita adjusted disposable income variable (its logarithm, as is usually done in economics studies)—indicated with BE1—as an unique covariate for the RCM, while the remaining 11 variables are dependent variables (please, refer to Table 1 again for the description and acronyms used for the variables). The application uses the restricted maximum likelihood estimation and a Sas/IML code. We adopted the estimation method (a) discussed above. We propose to estimate the model under an uniform correlation structure among the multivariate components of the random effects. This structure is equivalent to the compound-symmetry covariance structure, with a better numerical property in terms of optimization. Further, some studies (Ledoit and Wolf 2004) highlight that using uniform correlation matrices reduces the estimation noise. The model covariance matrix D of random effects is then a
Table 1 Description of the variables used for the application Variables Description S8 IF3 L12 REL4 POL5 SIC1 BS3 PATR9 AMB9 INN1 Q2 BE1 LBE1
Age-standardized mortality rate for dementia and nervous system diseases People having completed tertiary education (30–34 years old) Share of employed persons who feel satisfied with their work Social participation Trust in other institutions like the police and the fire brigade Homicide rate Positive judgement for future perspectives Presence of Historic Parks/Gardens and other Urban Parks recognized of significant public interest Satisfaction for the environment—air, water, noise Percentage of R&D expenditure on GDP Children who benefited of early childhood services Per capita adjusted disposable income Logarithm of Per capita adjusted disposable income
168
L. Marcis et al.
Table 2 The slope parameters by the multivariate regression with the LBE1 covariate Dependent Slope parameter STD Error t Pr >t variable (LBE1) AMB9 BS3 IF3 INN1 L12 PATR9 POL5 Q2 REL4 S8 SIC1
0.9802 0.9330 −0.3166 −0.0433 0.0016 0.0975 −0.0036 0.2031 0.5602 −0.0506 −0.0072
0.3255 0.0891 0.1673 0.0170 0.0107 0.0756 0.0085 0.1762 0.1690 0.0293 0.0150
3.01 10.47 −1.89 −2.54 0.15 1.29 −0.42 1.15 3.31 −1.73 −0.48
0.0035 0.0001 0.0621 0.0130 0.8786 0.2007 0.6775 0.2526 0.0014 0.0879 0.6314
generalized uniform correlation matrix, D = a Im . The symbol is the TracySingh product operator that represents a Kronecker product structure applied on matrices instead of their elements (Ploymukda et al. 2018). The multivariate model is based on two covariance parameters, σa2 and ρa , that represent jointly variances and covariances, and correlations within and between the random effects of the multivariate dependent model components. Analytically, the covariance matrices, whit the symbols defined above in the previous sections, are a = Iq ⊗ a,q + (1q 1q − Iq ) ⊗ a,qq = Iq ⊗ (a,q − a,qq ) + (1q 1q ) ⊗ a,qq a,qq = σa2 ρa 1r 1r a,q = σa2 [(1 − ρa )Ir + ρa 1r 1r ] = σa2 [(1 − ρa )Ir ] + a,qq The restricted maximum likelihood estimates of the covariance parameters are a = 0.083. Table 2 shows the slope parameter estimates from the σa2 = 7.315 and ρ multivariate regression, with their significance level. Table 3 reports the MANOVA multivariate test statistics, based on the characteristic roots. These are the eigenvalues of the product of the sum-of-squares matrix of the regression model and the sum-ofsquares matrix of the error. The null hypothesis for each of these tests is the same: the independent variable (LBE1) has no effect on any of the dependent variables. The four tests are all significant. After the application of the multivariate model in (1), we may explore the global set of relations between the variables employed, by applying the redundancy analysis of . Figures 1 and 2 give the two independent subspaces by the projection of the matrix onto the space of the model (1), and of its orthogonal complement.
A Random-Coefficients Analysis with a Multivariate …
169
Table 3 MANOVA test criteria and F approximations for the hypothesis of no overall LBE1 effect Statistic
Value
F value
Num DF
Den DF
Pr > F
Wilks lambda Pillai’s trace Hotelling–Lawley trace Roy’s largest root
0.0566 0.9434 16.6590 16.6590
102.98 102.98 102.98 102.98
11 11 11 11
68 68 68 68