Statistical Causal Discovery: LiNGAM Approach (JSS Research Series in Statistics) 4431557830, 9784431557838

This is the first book to provide a comprehensive introduction to a new semiparametric causal discovery approach known a

99 70 3MB

English Pages 108 [100] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Acronyms
1 Introduction
1.1 A Starting Point for Causal Inference
1.2 Framework of Causal Inference
1.3 Identification and Estimation of the Magnitude of Causation
1.4 Identification and Estimation of Causal Structures
1.5 Concluding Remarks
References
Part I Basics of LiNGAM Approach
2 Basic LiNGAM Model
2.1 Independent Component Analysis
2.2 LiNGAM Model
2.3 Identifiability of the LiNGAM model
2.4 Concluding Remarks
References
3 Estimation of the Basic LiNGAM Model
3.1 ICA-Based LiNGAM Algorithm
3.2 DirectLiNGAM Algorithm
3.3 Multigroup Analysis
3.3.1 LiNGAM Model for Multiple Groups
3.3.2 DirectLiNGAM Algorithm for Multiple LiNGAMs
3.4 Concluding Remarks
References
4 Evaluation of Statistical Reliability and Model Assumptions
4.1 Evaluation of Statistical Reliability
4.1.1 A Bootstrap Approach
4.1.2 Bootstrap Probability
4.1.3 Multiscale Bootstrap for LiNGAM
4.2 Evaluation of Model Assumptions
References
Part II Extended Models
5 LiNGAM with Hidden Common Causes
5.1 Identification and Estimation of Causal Structures of Confounded Variables
5.1.1 LiNGAM Model with Hidden Common Causes
5.1.2 Identification Based on Independent Component Analysis
5.1.3 Estimation Based on Independent Component Analysis
5.2 Identification and Estimation of Causal Structures of Unconfounded Variables
5.3 Other Hidden Variable Models
5.3.1 LiNGAM Model for Latent Factors
5.3.2 LiNGAM Model in the Presence of Latent Classes
References
6 Other Extensions
Correction to: Statistical Causal Discovery: LiNGAM Approach
Correction to: S. Shimizu, Statistical Causal Discovery: LiNGAM Approach, JSS Research Series in Statistics, https://doi.org/10.1007/978-4-431-55784-5
6.1 Cyclic Models
6.2 Time-Series Models
6.3 Nonlinear Models
6.4 Discrete Variable Models
References
Recommend Papers

Statistical Causal Discovery: LiNGAM Approach (JSS Research Series in Statistics)
 4431557830, 9784431557838

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

SpringerBriefs in Statistics JSS Research Series in Statistics Shohei Shimizu

Statistical Causal Discovery: LiNGAM Approach

SpringerBriefs in Statistics

JSS Research Series in Statistics Editors-in-Chief Naoto Kunitomo, The Institute of Mathematical Statistics, Tachikawa, Tokyo, Japan Akimichi Takemura, The Center for Data Science Education and Research, Shiga University, Hikone, Shiga, Japan Series Editors Genshiro Kitagawa, Meiji Institute for Advanced Study of Mathematical Sciences, Nakano-ku, Tokyo, Japan Shigeyuki Matsui, Graduate School of Medicine, Nagoya University, Nagoya, Aichi, Japan Manabu Iwasaki, School of Data Science, Yokohama City University, Yokohama, Kanagawa, Japan Yasuhiro Omori, Graduate School of Economics, The University of Tokyo, Bunkyo-ku, Tokyo, Japan Masafumi Akahira, Institute of Mathematics, University of Tsukuba, Tsukuba, Ibaraki, Japan Masanobu Taniguchi, School of Fundamental Science and Engineering, Waseda University, Shinjuku-ku, Tokyo, Japan Hiroe Tsubaki, The Institute of Statistical Mathematics, Tachikawa, Tokyo, Japan Satoshi Hattori, Faculty of Medicine, Osaka University, Suita, Osaka, Japan Kosuke Oya, School of Economics, Osaka University, Toyonaka, Osaka, Japan Taiji Suzuki, School of Engineering, University of Tokyo, Tokyo, Japan

The current research of statistics in Japan has expanded in several directions in line with recent trends in academic activities in the area of statistics and statistical sciences over the globe. The core of these research activities in statistics in Japan has been the Japan Statistical Society (JSS). This society, the oldest and largest academic organization for statistics in Japan, was founded in 1931 by a handful of pioneer statisticians and economists and now has a history of about 80 years. Many distinguished scholars have been members, including the influential statistician Hirotugu Akaike, who was a past president of JSS, and the notable mathematician Kiyosi Itô, who was an earlier member of the Institute of Statistical Mathematics (ISM), which has been a closely related organization since the establishment of ISM. The society has two academic journals: the Journal of the Japan Statistical Society (English Series) and the Journal of the Japan Statistical Society (Japanese Series). The membership of JSS consists of researchers, teachers, and professional statisticians in many different fields including mathematics, statistics, engineering, medical sciences, government statistics, economics, business, psychology, education, and many other natural, biological, and social sciences. The JSS Series of Statistics aims to publish recent results of current research activities in the areas of statistics and statistical sciences in Japan that otherwise would not be available in English; they are complementary to the two JSS academic journals, both English and Japanese. Because the scope of a research paper in academic journals inevitably has become narrowly focused and condensed in recent years, this series is intended to fill the gap between academic research activities and the form of a single academic paper. The series will be of great interest to a wide audience of researchers, teachers, professional statisticians, and graduate students in many countries who are interested in statistics and statistical sciences, in statistical theory, and in various areas of statistical applications.

Shohei Shimizu

Statistical Causal Discovery: LiNGAM Approach

Shohei Shimizu Faculty of Data Science Shiga University and RIKEN Hikone, Shiga, Japan

ISSN 2191-544X ISSN 2191-5458 (electronic) SpringerBriefs in Statistics ISSN 2364-0057 ISSN 2364-0065 (electronic) JSS Research Series in Statistics ISBN 978-4-431-55783-8 ISBN 978-4-431-55784-5 (eBook) https://doi.org/10.1007/978-4-431-55784-5 © The Author(s), under exclusive licence to Springer Japan KK 2022, corrected publication 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Japan KK, part of Springer Nature. The registered company address is: Shiroyama Trust Tower, 4-3-1 Toranomon, Minato-ku, Tokyo 1056005, Japan

Preface

This monograph discusses statistical causal discovery methods for inferring causal relationships from data derived primarily from non-randomized experiments. Specifically, I focus on a non-Gaussian approach named LiNGAM. The LiNGAM approach uses a non-Gaussian data structure for model identification and can identify a much broader range of causal relationships than classic methods. This book aims to provide a concise summary of the basic ideas of the LiNGAM approach. More information on recent advances, applications, and available code packages can be found on the following website: https://www.shimizulab.org/lingam/lingampapers. I would be delighted if readers could get more familiar with the LiNGAM approach and become interested in working in the field of causal discovery. Hikone and Osaka, Japan January 2022

Shohei Shimizu

Acknowledgements I am deeply thankful to all my collaborators. I particularly extend my heartfelt gratitude to Aapo Hyvärinen, Patrik O. Hoyer, Takashi Nicholas Maeda, Yan Zeng, Patrick Blöbaum, Tatsuya Tashiro, Takashi Ikeuchi, Keisuke Kiritoshi, Kento Uemura, Hidetoshi Shimodaira, and Yutaka Kano.

The original version of the book was revised: Missing mathematical symbols have been updated throughout. The correction to the book is available at https://doi.org/10.1007/978-4-431-55784-5_7

v

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 A Starting Point for Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Framework of Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Identification and Estimation of the Magnitude of Causation . . . . . . . 4 1.4 Identification and Estimation of Causal Structures . . . . . . . . . . . . . . . . 6 1.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Part I

Basics of LiNGAM Approach

2 Basic LiNGAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 LiNGAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Identifiability of the LiNGAM model . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15 15 21 23 28 28

3 Estimation of the Basic LiNGAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 ICA-Based LiNGAM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 DirectLiNGAM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Multigroup Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 LiNGAM Model for Multiple Groups . . . . . . . . . . . . . . . . . . . . 3.3.2 DirectLiNGAM Algorithm for Multiple LiNGAMs . . . . . . . . 3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 31 34 42 42 43 45 46

4 Evaluation of Statistical Reliability and Model Assumptions . . . . . . . . . 4.1 Evaluation of Statistical Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 A Bootstrap Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Bootstrap Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Multiscale Bootstrap for LiNGAM . . . . . . . . . . . . . . . . . . . . . .

49 49 49 50 52

vii

viii

Contents

4.2 Evaluation of Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Part II Extended Models 5 LiNGAM with Hidden Common Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Identification and Estimation of Causal Structures of Confounded Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 LiNGAM Model with Hidden Common Causes . . . . . . . . . . . 5.1.2 Identification Based on Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Estimation Based on Independent Component Analysis . . . . . 5.2 Identification and Estimation of Causal Structures of Unconfounded Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Other Hidden Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 LiNGAM Model for Latent Factors . . . . . . . . . . . . . . . . . . . . . . 5.3.2 LiNGAM Model in the Presence of Latent Classes . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Other Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Cyclic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Time-Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Discrete Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 61 64 66 70 71 75 75 78 80 83 83 87 90 91 92

Correction to: Statistical Causal Discovery: LiNGAM Approach . . . . . . . . C1

Acronyms

SCM SEM LiNGAM ICA DAG xi ei fi si x e f s X B ∧ A W P D

Structural causal model Structural equation modeling Linear non-Gaussian acyclic model Independent component analysis Directed acyclic graph i-th observed variable i-th error variable i-th hidden variable i-th independent component Observed variable vector Error variable vector Hidden variable vector Independent component vector Data matrix Coefficient matrix for observed variables x Coefficient matrix for hidden variables f Mixing matrix in ICA Separating matrix in ICA Permutation matrix Diagonal matrix

ix

Chapter 1

Introduction

1.1 A Starting Point for Causal Inference Statistical causal inference (or, simply, causal inference) is a methodology for inferring causal relationships using data. Consider an example (Shimizu, 2019, 2020). Reportedly, people with higher levels of sleep disorder have higher levels of depression. For example, an epidemiological study (Raitakari et al., 2008) finds a 0.77 correlation coefficient between the degrees of sleep disorder and depression, which is quite significant (Rosenström et al., 2012). However, it would be hasty to posit a causal relationship merely because of a significant correlation coefficient. Figure 1.1 illustrates this point by showing a case with no causal relationship between the degrees of sleep disorder and depression; however, a third variable (known as a common cause or confounding variable) induces a correlation even if no causal relationship exists. Such common causes may include smoking and exercise habits. Other causal relationships likely induce a correlation between sleep disorder and depression. Figure 1.2 presents two such examples. Unlike in Fig. 1.1, the two cases consider how the degree of sleep disorder causes that of depression, and vice versa. Such diagrams of qualitative causal relationships between variables (i.e., causal structures) are called causal graphs. If there is no causal relationship between sleep disorder and depression, as in the causal graph of Fig. 1.1, changing the degree of sleep disorder will not change that of depression and vice versa. However, if sleep disorder causes depression, as in the causal graph on the left of Fig. 1.2, changing the degree of sleep disorder may change that of depression, but not vice versa. Similarly, if depression causes sleep disorder, as in the causal graph on the right of Fig. 1.2, changing the degree of depression may change that of sleep disorder, but not vice versa. The original version of this chapter was revised: Missing mathematical symbols have been updated throughout. The correction to this chapter is available at https://doi.org/10.1007/978-4-431-557845_7 © The Author(s), under exclusive licence to Springer Japan KK, part of Springer Nature 2022, corrected publication 2022 S. Shimizu, Statistical Causal Discovery: LiNGAM Approach, JSS Research Series in Statistics, https://doi.org/10.1007/978-4-431-55784-5_1

1

2

1 Introduction

Fig. 1.1 A common cause or confounding third variable may create a correlation between the degrees of sleep disorder and depression. The arrows indicate the directions of causation

Fig. 1.2 On the left-hand side of this figure, sleep disorder causes depression. On the right-hand side, depression causes sleep disorder

The correlation coefficient between the degree of sleep disorder and that of depression can be 0.77 even if any of the three causal relationships represented by the causal graphs in Figs. 1.1 and 1.2 is correct. The correlation coefficient does not indicate which causal relationship produced the correlation between the degree of sleep disorder and that of depression. Therefore, a correlation does not necessarily mean a causal relationship. First, the correlation coefficient is a measure to evaluate the correlation magnitude and not that of causation. When performing causal analysis and evaluating the magnitude of causation, one must first mathematically represent causality using a causality-specific measure, rather than an unsuitable correlation coefficient measure. Causal inference researchers have focused on developing such mathematical frameworks to represent causal relationships (Imbens and Rubin, 2015; Pearl, 2000). The frameworks enable one to consider under what conditions or assumptions causal queries can be identified or uniquely estimated from the data. Given that such causal queries are identifiable, it is feasible to discuss estimation methods, assess the statistical reliability of estimation results, and test the validity of assumptions in causal analysis.

1.2 Framework of Causal Inference Representative frameworks for causal inference include the potential outcome model (Imbens and Rubin, 2015) and structural causal model (Pearl, 2000). This book introduces causal discovery methods based on the structural causal model, in which causal graphs representing the causal structures of variables appear explicitly. In the structural causal model, structural equations describe how the values of variables are generated via the data-generating process. For example, consider the data-generating process in a population with the causal structure on the left-hand

1.2 Framework of Causal Inference

3

side of Fig. 1.2. The degree of sleep disorder is denoted by x1 , that of depression by x2 , and the common cause by z. The data-generating process of the degree of sleep disorder x1 and that of depression x2 can then be written using two functions g1 and g2 as follows: x1 = g1 (z, e1 ) x2 = g2 (x1 , z, e2 ).

(1.1) (1.2)

Such equations are called structural equations. They show that the values of variables on the left-hand side of the equations are determined by those of variables on the right-hand side. Here, parent variables of x1 and x2 with arrows pointing to x1 and x2 in the causal graph are on the right-hand sides of the structural equations. Further, e1 and e2 on the right-hand sides are unobserved error variables. In the structural equation determining x1 in Eq. (1.1), all variables that determine the value of x1 other than z are lumped together and represented by the error variable e1 . The same applies to the error variable e2 . For example, an intervention to force the degree of sleep disorder x1 to be a constant c is represented by replacing the structural equation determining the degree of sleep disorder x1 (i.e., x1 = g1 (z, e1 )) with x1 = c. When performing this intervention, consider the following data-generating process that replaces the structural equation determining x1 in Eq. (1.1) by x1 = c. x1 = c x2 = g2 (x1 , z, e2 ).

(1.3) (1.4)

Before performing the intervention, the value of the degree of sleep disorder x1 is determined by the values of the common cause z and error variable e1 , as in Eq. (1.1). However, after performing the intervention, the value of the degree of sleep disorder x1 is always set to c, as in Eq. (1.3). The intervention of setting or forcing x1 = c is denoted by do(x1 = c) using the mathematical operator do. The distribution of x2 after the intervention of setting x1 = c is then denoted by p(x2 |do(x1 = c)) and defined by the distribution of x2 in the data-generating process after the intervention represented by Eqs. (1.3) and (1.4). This post-intervention distribution p(x2 |do(x1 )) is called the causal effect from x1 to x2 . If the distribution of the degree of depression x2 when the degree of sleep disorder x1 is intervened to have x1 = c is different from that of the degree of depression x2 when x1 is intervened to have x1 = c, then they have a causal relationship where the degree of sleep disorder x1 causes that of depression x2 . Therefore, when investigating whether x1 and x2 have a causal relationship where x1 causes x2 , consider whether there are such constants as c and d of x1 , where p(x2 |do(x1 = d)) /= p(x 2 |do(x1 = c)).

(1.5)

4

1 Introduction

Equation (1.5) evaluates a qualitative relationship. Thus, to quantitatively examine the magnitude of the causal relationship, compute the expectations of x2 in both cases with do(x1 = d) and do(x1 = c) and ascertain their difference, for example, as follows: E(x2 |do(x1 = d)) − E(x2 |do(x1 = c)).

(1.6)

This quantity is called the average causal effect. When examining the magnitude of correlation, the correlation coefficient is computed as the measure. When examining the magnitude of causation, it is typical to compute the average causal effect. After representing the causal quantity of focus by a mathematical expression, the next step employs the theory of causal inference to judge whether it can be uniquely estimated from data (i.e., whether it is identifiable). This step requires some domain knowledge and assumptions per analyst judgment. The next section expands on this issue. In recent years, causal inference frameworks such as the potential outcome model (Imbens and Rubin, 2015) and the structural causal model (Pearl, 2000) have increasingly been used to describe assumptions and concepts in machine learning. For example, they are used to describe and interpret assumptions in transfer learning (Zhang et al., 2015, 2020), improve reinforcement learning algorithms via causality knowledge (Lee and Bareinboim, 2018), and define the fairness of prediction models (Kusner et al., 2017). Further, causal inference methods improve the explainability of prediction models based on the causal structure of variables (Sani et al., 2020) or the concept of the probability of causation (Pearl, 1999; Galhotra et al., 2021) and estimate the necessary intervention on a feature such that the desired prediction is obtained (Blöbaum and Shimizu, 2017; Kiritoshi et al., 2021).

1.3 Identification and Estimation of the Magnitude of Causation Once the causal quantity of interest is mathematically represented in the structural causal model framework, it is necessary to ascertain whether it is identifiable. It is identifiable if it is uniquely estimable from the data. In a typical problem setting, suppose the causal graph containing the pair of variables x1 and x2 is known from domain knowledge, and consider judging whether the magnitude of causation between them is uniquely estimable. The example on the left-hand side of Fig. 1.2 presents the degree of sleep disorder x1 and that of depression x2 to ascertain the causal relationship therein. Suppose the causal graph on the left-hand side of the figure is correct. Now, to simplify the explanation, suppose that the only common cause is smoking status. Thus, is the causal effect from the degree of sleep disorder x1 to that of depression x2 identifiable?

1.3 Identification and Estimation of the Magnitude of Causation

5

Fig. 1.3 Dividing the participants into two groups based on the common cause (i.e., smoking status), the common cause disappears in each group

To judge the identifiability of this causal query, consider which variables in the causal graph are observable in the population under analysis. Suppose the variables, for which the causation magnitude (the degrees of sleep disorder x1 and depression x2 ) is of interest, are observed or measured. Further, assume smoking status, the single common cause, is also observed. Accordingly, dividing the participants into two groups per their smoking habit generates a group comprising those with a smoking habit and another comprising those without such a habit. Then, as in Fig. 1.3, smoking status causes neither the degree of sleep disorder nor that of depression in the groups. That is, there is no common cause in each of the groups because all members of one group have a smoking habit and those of the other group have no smoking habit (i.e., smoking status is constant and independent of any other variables in each group). In the case of no common cause as in Fig. 1.3, the causal effect p(x2 |do(x1 )) from x1 to x2 is equal to the conditional distribution p(x2 |x1 ) (Pearl, 1995). Thus, the post-intervention expectation E(x2 |do(x1 )) also equals the conditional expectation E(x2 |x1 ), and the average causal effect in each population E(x2 |do(x1 = d)) − E(x2 |do(x1 = c)) is estimated by the difference in the conditional expectations E(x2 |(x1 = d)) − E(x2 |(x1 = c)). These conditional expectations can be computed from data with no intervention. After estimating the average causal effect for each group, that for the entire group is estimated by their weighted average, with the weights being the proportions of participants in each group. In general, if the causal graph representing the qualitative causal relationships between variables is acyclic, one can judge which variables must be observed to estimate the causal effect (Pearl, 1995; Shpitser and Pearl, 2006). For example, if the causal graph is a directed acyclic graph, as in Fig. 1.4, then it is sufficient to observe the pair of x and y whose causal effect is of interest and the set of parent variables pa(x) = {z 1 , z 2 } of x that causes y. After calculating the conditional expectation conditional on x and its parents pa(x), marginalize the conditional expectation over the parent variables pa(x) to compute the post-intervention expectation based on those conditional expectations. [ ] E(y|do(x)) = E pa(x) E(y|x, pa(x)) .

(1.7)

Note that the causal effect from the degree of sleep disorder x 1 to that of depression x2 cannot generally be estimated from the data if the common cause, smoking status, is not observed.

6

1 Introduction

Fig. 1.4 Variables x, y, z 1 , z 2 , and w form a directed acyclic graph. z = {z 1 , z 2 } are the parents of x

After judging whether causal effects and post-intervention expectations can be estimated from the data, the estimation stage follows. Additional efforts make the estimation easier to perform, even if unnecessary for identifiability. For example, when there are multiple parents to be conditioned on in Eq. (1.7), the propensity score (Rosenbaum and Rubin, 1983) is often used to obtain a one-dimensional representation of multiple parents when the dimensionality of parents is high and data are sparse.

1.4 Identification and Estimation of Causal Structures In structural causal models, qualitative causal relationships of variables (i.e., causal structures) are represented by causal graphs. The previous section assumed that the causal graph containing pairs of variables whose causal effects are of interest is known from prior knowledge. This section considers cases where it is unknown and discusses estimation methods for causal graphs from the data. The causal discovery or causal structure search is inferring causal graphs from data in that it explores hypotheses about the causal structure (Spirtes et al., 1993; Shimizu, 2014; Zhang and Hyvärinen, 2016). The situation that the causal graph can be estimated from data is helpful to analysts when prior knowledge is insufficient to draw the causal graph. From the estimated causal graph, the identifiability of causal effects can be examined. The estimated causal graph may be updated using prior knowledge and be used to examine the identifiability. Prior knowledge can also be incorporated into the process of learning causal graphs. Readers might think causal discovery might enable causal analysis without prior knowledge. However, additional assumptions are often required to identify the causal graph from the data. For example, assumptions about the functional forms and distributions, though not necessarily needed when the causal graph is known, are often made. There is a trade-off between what is assumed and estimated from data. Analysts must continue to improve their causal graphs by performing causal analyses based on domain knowledge and data. In a typical problem setting of causal discovery, the causal graph is assumed to be a directed acyclic graph, and all the common causes of observed variables are observed. These assumptions furnish the three candidates for the causal graph of the two observed variables x1 and x2 , as shown in Fig. 1.5. Regardless of which causal

1.4 Identification and Estimation of Causal Structures

7

Fig. 1.5 The three candidates for the bivariate causal graph when the causal graph is acyclic directed and all common causes are observed

graphs in the left-hand side and center of the figure is correct, x1 and x2 are dependent. However, if the rightmost causal graph is correct, then x1 and x2 are independent. Suppose data are generated from one of the causal graphs. If x1 and x2 are independent in the data, it contradicts the dependency derived from the two causal graphs on the left-hand side. Therefore, remove these two candidates. The rightmost causal graph remains a candidate, which is the correct causal graph. However, if x1 and x2 are dependent on the data, the rightmost causal graph candidate, which should make the two variables independent, is removed, leaving the left-hand side and center candidates. The two remaining candidates are indistinguishable from the independence and dependence of the variables because x1 and x2 are dependent, irrespective of which is correct. The set of causal graphs indistinguishable from data (i.e., the two candidates at the left-hand side and center) are called the equivalence class. The independence relation of the two variables furnishes the clue to distinguish between the rightmost causal graph and the two on the left-hand side and identify where the directed edges exist, though the causal orientation remains unknown. Similarly, in the case of three or more variables, causal graphs are generally not identifiable solely based on the independence and conditional independence of the variables. However, it is possible to infer between which directed edges exist from data, though the orientations often remain unknown (Pearl and Verma, 1991; Spirtes and Glymour, 1991). Another important problem setting is obtained by assuming that all the common causes are not necessarily observed (i.e., there may be hidden common causes). In this case, causal graphs are generally not identifiable based on the conditional independence of variables, but it is possible to estimate, from the data, between which observed variables hidden common causes are likely to exist (Pearl and Verma, 1991; Spirtes et al., 1995). Extensions to cyclic cases (Richardson, 1996) and time-series cases (Malinsky and Spirtes, 2018) have been considered. Note that using temporal directions of variables as prior knowledge helps identify causal directions in timeseries cases, but the problem of hidden common causes remains. These methods of inferring causal graphs using the conditional independence of variables (Spirtes et al., 1993) are implemented in software such as TETRAD (Scheines et al., 1998), pcalg (Kalisch et al., 2012), and causal-learn.1 Thus far, though using the conditional independence of observed variables, no assumptions about the functional forms or distributions of the variables and errors in the structural causal models have been made. However, if additional assumptions could be made, it would be reasonable to explore them. Such additional information on the functional forms or distributions can effectively infer causal graphs in depth 1

https://github.com/cmu-phil/causal-learn.

8

1 Introduction

than using only the conditional independence of observed variables (Shimizu et al., 2006; Hoyer et al., 2008, 2009; Zhang and Hyvärinen, 2009b; Peters et al., 2014; Park and Raskutti, 2017; Wei et al., 2018, Zeng et al., 2022). This book focuses on this more recent causal discovery approach, the LiNGAM approach (Shimizu, 2014, 2016; Zhang and Hyvärinen, 2016; Shimizu and Blöbaum, 2020), that employs additional information on the functional forms and distributions for model identification when it is available. LiNGAM is the abbreviation of the linear non-Gaussian acyclic model. For example, assuming linearity, the structural causal models with the three causal graphs in Fig. 1.5 are written as follows. First, the leftmost model is written as x1 = e1 x2 = b21 x1 + e2 .

(1.8) (1.9)

Second, the center model is given by x1 = b12 x2 + e1 x2 = e2 .

(1.10) (1.11)

Finally, the rightmost model is written as x1 = e1

(1.12)

x2 = e2 .

(1.13)

The error variables e1 and e2 are independent in all three models above, given the assumption that there is no hidden common cause. It is possible to identify which of the three candidates in Fig. 1.5 is correct if the functional forms are linear and e1 and e2 follow continuous distributions other than Gaussian distributions (Dodge and Rousson, 2001; Shimizu et al., 2005, 2006). The reason can be understood as one can use not only the conditional independence between the observed variables x1 and x2 but also the independence between the error variables e1 and e2 to infer the causal graph (Zhang and Hyvärinen, 2009a). Similarly, the causal graphs with hidden common cause z in Fig. 1.6 are identifiable (Hoyer et al., 2008). For the estimation of causal graphs, methods using the ideas of independent component analysis (ICA) (Hyvärinen et al., 2001) have been proposed (Shimizu et al., 2006, 2011; Hoyer et al., 2008; Henao and Winther, 2011; Ding et al., 2019). There are other extensions of linear non-Gaussian models such as cyclic (Lacerda et al., 2008), time-series (Hyvärinen et al., 2010; Gong et al., 2015), nonlinear (Hoyer et al., 2009; Zhang and Hyvärinen, 2009b; Khemakhem et al., 2021), and mixed cases with continuous and discrete variables (Wei et al., 2018; Zeng et al., 2022). Application examples are present in many fields, including economics (Moneta et al., 2013), marketing science (Moriyama and Kuwano, 2021), neuroscience (Ogawa et al., 2022), epidemiology (Rosenström et al., 2012), chemistry (Campomanes et al., 2014), climatology (Liu and Niyogi, 2020), and medicine

1.5 Concluding Remarks

9

Fig. 1.6 Candidate causal graphs of two observed variables given a hidden common cause z. Unobserved or hidden variables are enclosed in circles, and observed variables are in rectangular boxes

(Kotoku et al., 2020). See this page for an extensive and updated list: https://www. shimizulab.org/lingam/lingampapers. For code packages, in addition to TETRAD, causal-learn and pcalg, a more specialized package to the LiNGAM approach, LiNGAM Python package, is available at https://github.com/cdt15/lingam.

1.5 Concluding Remarks Regarding machine learning prediction, the performance of constructed prediction models can be evaluated based on test data. However, in causal inference, the gold standard to examine the validity of the estimated causal effects and causal structures is to conduct interventions. It is often not easy to conduct interventions from ethical and monetary perspectives. Therefore, it is essential to employ domain knowledge and take steps to improve the causal models by examining possible violations of assumptions, considering such violations, and modifying the assumptions before conducting interventions. This book highlights a causal discovery approach called the LiNGAM approach. The LiNGAM approach uses additional information on the functional forms or distributions, such as linearity and non-Gaussianity of continuous variables. It aims at using all the information contained in both prior knowledge and data to build better causal models. It can identify a much broader range of causal relationships than classic methods based on the conditional independence of variables. This study aims to provide a concise summary of the basic ideas of the LiNGAM approach. Chapter 2 introduces the basic LiNGAM model, which is the first multivariate causal model where the causal graph is identifiable. It opened a research direction of statistical causal inference that investigates the conditions under which causal graphs are identifiable. Chapter 3 discusses two estimation methods for the LiNGAM model. One method employs iterative estimation algorithms developed in the field of ICA. The other is a direct algorithm and is guaranteed to converge to the right solution in a fixed number of steps. Chapter 4 discusses how to evaluate the statistical reliability of estimated causal graphs and detect possible violations of model assumptions to improve the proposed models. Chapter 5 introduces the LiNGAM model with hidden common causes and its estimation methods. The chapter also discusses other hidden or latent variables in LiNGAM than hidden common causes; specifically, it considers latent factors and latent classes. Chapter 6 selects other

10

1 Introduction

key extensions of the LiNGAM model to cyclic, time-series, nonlinear, and discrete variable cases.

References Blöbaum, P., & Shimizu, S. (2017). Estimation of interventional effects of features on prediction. In Proceedings of the 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1–6). IEEE. Campomanes, P., Neri, M., Horta, B. A., Roehrig, U. F., Vanni, S., Tavernelli, I., & Rothlisberger, U. (2014). Origin of the spectral shifts among the early intermediates of the rhodopsin photocycle. Journal of the American Chemical Society, 136(10), 3842–3851. Ding, C., Gong, M., Zhang, K., & Tao, D. (2019). Likelihood-free overcomplete ICA and applications in causal discovery. In Advances in neural information processing systems (Vol. 32, pp. 6883–6893). Dodge, Y., & Rousson, V. (2001). On asymmetric properties of the correlation coefficient in the regression setting. The American Statistician, 55(1), 51–54. Galhotra, S., Pradhan, R., & Salimi, B. (2021). Explaining black-box algorithms using probabilistic contrastive counterfactuals. In Proceedings of the 2021 International Conference on Management of Data (pp. 577–590). Gong, M., Zhang, K., Schoelkopf, B., Tao, D., & Geiger, P. (2015). Discovering temporal causal relations from subsampled data. In Proceedings of the 32nd International Conference on Machine Learning (ICML2015) (pp. 1898–1906). Henao, R., & Winther, O. (2011). Sparse linear identifiable multivariate modeling. Journal of Machine Learning Research, 12, 863–905. Hoyer, P. O., Janzing, D., Mooij, J., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems (Vol. 21, pp. 689–696). Hoyer, P. O., Shimizu, S., Kerminen, A., & Palviainen, M. (2008). Estimation of causal effects using linear non-Gaussian causal models with hidden variables. International Journal of Approximate Reasoning, 49(2), 362–378. Hyvärinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyvärinen, A., Zhang, K., Shimizu, S., & Hoyer, P. O. (2010). Estimation of a structural vector autoregressive model using non-Gaussianity. Journal of Machine Learning Research, 11, 1709– 1731. Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. Kalisch, M., Mächler, M., Colombo, D., Maathuis, M. H., & Bühlmann, P. (2012). Causal inference using graphical models with the R package pcalg. Journal of Statistical Software, 47(11), 1–26. Khemakhem, I., Monti, R., Leech, R., & Hyvärinen, A. (2021). Causal autoregressive flows. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research (Vol. 130, pp. 3520–3528). PMLR. Kiritoshi, K., Izumitani, T., Koyama, K., Okawachi, T., Asahara, K., & Shimizu, S. (2021). Estimating individual-level optimal causal interventions combining causal models and machine learning models. In Proceedings of The KDD’21 Workshop on Causal Discovery, Proceedings of Machine Learning Research (Vol. 150, pp. 55–77). PMLR. Kotoku, J., Oyama, A., Kitazumi, K., Toki, H., Haga, A., Yamamoto, R., Shinzawa, M., Yamakawa, M., Fukui, S., Yamamoto, K., et al. (2020). Causal relations of health indices inferred statistically using the DirectLiNGAM algorithm from big data of Osaka prefecture health checkups. PLOS ONE, 15(12), e0243229.

References

11

Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. In Advances in neural information processing systems (Vol. 30). Curran Associates, Inc. Lacerda, G., Spirtes, P., Ramsey, J., & Hoyer, P. O. (2008). Discovering cyclic causal models by independent components analysis. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI2008) (pp. 366–374). Lee, S., & Bareinboim, E. (2018). Structural causal bandits: Where to intervene? In Advances in neural information processing systems (Vol. 31, pp. 2568–2578). Liu, J., & Niyogi, D. (2020). Identification of linkages between urban heat island magnitude and urban rainfall modification by use of causal discovery algorithms. Urban Climate, 33, 100659. Malinsky, D., & Spirtes, P. (2018). Causal structure learning from multivariate time series in settings with unmeasured confounding. In Proceedings of the 2018 ACM SIGKDD Workshop on Causal Discovery (pp. 23–47). Moneta, A., Entner, D., Hoyer, P. O., & Coad, A. (2013). Causal inference by independent component analysis: Theory and applications. Oxford Bulletin of Economics and Statistics, 75(5), 705–730. Moriyama, T., & Kuwano, M. (2021). Causal inference for contemporaneous effects and its application to tourism product sales data. Journal of Marketing Analytics, 1–11. Ogawa, T., Shimobayashi, H., Hirayama, J.-I., & Kawanabe, M. (2022). Asymmetric directed functional connectivity within the frontoparietal motor network during motor imagery and execution. NeuroImage, 247, 118794. Park, G., & Raskutti, G. (2017). Learning quadratic variance function (QVF) DAG models via overdispersion scoring (ODS). Journal of Machine Learning Research, 18, 224–1. Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4), 669–688. Pearl, J. (1999). Probabilities of causation: Three counterfactual interpretations and their identification. Synthese, 121(1), 93–149. Pearl, J. (2000). Causality: Models, reasoning, and inference. Cambridge University Press. Pearl, J., & Verma, T. (1991). A theory of inferred causation. In Proceedings of the 2nd International Conference on Principles of Knowledge Representation and Reasoning (pp. 441–452). Morgan Kaufmann, San Mateo, CA. Peters, J., Mooij, J. M., Janzing, D., & Schölkopf, B. (2014). Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15, 2009–2053. Raitakari, O. T., Juonala, M., Rönnemaa, T., Keltikangas-Järvinen, L., Räsänen, L., Pietikäinen, M., et al. (2008). Cohort profile: The cardiovascular risk in Young Finns Study. International Journal of Epidemiology, 37(6), 1220–1226. Richardson, T. (1996). A polynomial-time algorithm for deciding Markov equivalence of directed cyclic graphical models. In Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence (UAI1996). Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. Rosenström, T., Jokela, M., Puttonen, S., Hintsanen, M., Pulkki-Råback, L., Viikari, J. S., Raitakari, O. T., & Keltikangas-Järvinen, L. (2012). Pairwise measures of causal direction in the epidemiology of sleep problems and depression. PLOS ONE, 7(11), e50841. Sani, N., Malinsky, D., & Shpitser, I. (2020). Explaining the behavior of black-box prediction algorithms with causal learning. arXiv preprint arXiv:2006.02482. Scheines, R., Spirtes, P., Glymour, C., Meek, C., & Richardson, T. (1998). The TETRAD project: Constraint based aids to causal model specification. Multivariate Behavioral Research, 33(1), 65–117. Shimizu, S. (2014). LiNGAM: Non-Gaussian methods for estimating causal structures. Behaviormetrika, 41(1), 65–98. Shimizu, S. (2016). Non-Gaussian structural equation models for causal discovery. In Statistics and causality: Methods for applied empirical research (pp. 153–184). Wiley. Shimizu, S. (2019). Non-Gaussian methods for causal structure learning. Prevention Science, 20(3), 431–441.

12

1 Introduction

Shimizu, S. (2020). Toukeiteki inga suiron e no shoutai (Introduction to statistical causal inference). Suuri Kagaku, 58(9), 7–14. Shimizu, S., & Blöbaum, P. (2020). Recent advances in semi-parametric methods for causal discovery. Direction Dependence in Statistical Modeling: Methods of Analysis, 111–130. Shimizu, S., Hoyer, P. O., Hyvärinen, A., & Kerminen, A. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7, 2003–2030. Shimizu, S., Hyvärinen, A., Kano, Y., & Hoyer, P. O. (2005). Discovery of non-Gaussian linear causal models using ICA. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI2005) (pp. 526–533). Arlington, Virginia: AUAI Press. Shimizu, S., Inazumi, T., Sogawa, Y., Hyvärinen, A., Kawahara, Y., Washio, T., Hoyer, P. O., & Bollen, K. (2011). DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. Journal of Machine Learning Research, 12, 1225–1248. Shpitser, I., & Pearl, J. (2006). Identification of joint interventional distributions in recursive semiMarkovian causal models. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI2006) (pp. 437–444). Spirtes, P., & Glymour, C. (1991). An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9, 67–72. Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation, prediction, and search. Springer. (2nd ed. MIT Press 2000). Spirtes, P., Meek, C., & Richardson, T. (1995). Causal inference in the presence of latent variables and selection bias. In Proceedings of the 11th Annual Conference on Uncertainty in Artificial Intelligence (UAI1995) (pp. 491–506). Wei, W., Feng, L., & Liu, C. (2018). Mixed causal structure discovery with application to prescriptive pricing. In Proceedings of the 27rd International Joint Conference on Artificial Intelligence (IJCAI2018) (pp. 5126–5134). Zeng, Y., Shimizu, S., Matsui, H., & Sun, F. (2022). Causal discovery for linear mixed data. In Proceedings of the First Conference on Causal Learning and Reasoning (CLeaR2022). Accepted. Zhang, K., Gong, M., & Schölkopf, B. (2015). Multi-source domain adaptation: A causal view. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI2015). Zhang, K., Gong, M., Stojanov, P., Huang, B., & Glymour, C. (2020). Domain adaptation as a problem of inference on graphical models. In Advances in neural information processing systems (Vol. 20). Zhang, K., & Hyvärinen, A. (2009a). Causality discovery with additive disturbances: An information-theoretical perspective. In Proceedings of the European Conference on Machine Learning (ECML2009) (pp. 570–585). Zhang, K., & Hyvärinen, A. (2009b). On the identifiability of the post-nonlinear causal model. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI2009) (pp. 647–655). Zhang, K., & Hyvärinen, A. (2016). Nonlinear functional causal models for distinguishing causes form effect. In Statistics and causality: Methods for applied empirical research. Wiley.

Part I

Basics of LiNGAM Approach

Chapter 2

Basic LiNGAM Model

2.1 Independent Component Analysis Independent component analysis (ICA) (Comon, 1994; Jutten and Hérault, 1991) is a data analysis method developed in the field of signal processing (Hyvärinen et al., 2001). Consider an example of an ICA model: x1 = a11 s1 + a12 s2 x2 = a21 s1 + a22 s2 ,

(2.1) (2.2)

where x1 and x2 on the left-hand side are observed variables and s1 and s2 on the right-hand side are hidden continuous variables. The coefficients a11 , a12 , a21 , and a22 are constants. These coefficients represent how the hidden variables s1 and s2 are mixed to generate the observed variables x1 and x2 . This ICA model represents the following data-generating process. First, the values of unobserved independent variables s1 and s2 are generated. The values of the observed variables x1 and x2 are then generated as the linear sums of the values of s1 and s2 . This data-generating process is illustrated using the causal graph on the right-hand side of Fig. 2.1. The ICA model can be considered a structural causal model of the data-generating process. The ICA aims to estimate the coefficients a11 , a12 , a21 , and a22 and recover the values of the unobserved variables s1 and s2 using the data matrix X generated from the model of Eqs. (2.1) and (2.2). ICA is characterized by the assumption that the unobserved variables s1 and s2 are independent and follow non-Gaussian continuous distributions. Since independence is key, the unobserved variables s1 and s2 are independent components. This independence between s1 and s2 allows estimation

The original version of this chapter was revised: Missing mathematical symbols have been updated throughout. The correction to this chapter is available at https://doi.org/10.1007/978-4-431-557845_7 © The Author(s), under exclusive licence to Springer Japan KK, part of Springer Nature 2022, corrected publication 2022 S. Shimizu, Statistical Causal Discovery: LiNGAM Approach, JSS Research Series in Statistics, https://doi.org/10.1007/978-4-431-55784-5_2

15

16

2 Basic LiNGAM Model

Fig. 2.1 ICA model is a structural causal model

of the coefficients a11 , a12 , a21 , and a22 , and recovers the values of the unobserved independent components s1 and s2 . In general, the ICA model for p observed variables xi (i = 1, . . . , p; p ≥ 2) is defined as follows (Jutten and Hérault, 1991; Comon, 1994):

xi =

q Σ

ai j s j ,

(i = 1, . . . , p)

(2.3)

j=1

where s j ( j = 1, . . . , q) on the right-hand side are unobserved non-Gaussian continuous variables and are independent. The ICA model for Eq. (2.3) can be written in the following matrix form: x = As.

(2.4)

The matrix A is a p × q matrix whose elements are the coefficients ai j (i = 1, . . . , p; j = 1, . . . , q). A is a mixing matrix that represents how the independent components in s are mixed to generate the observed variables in x. The i-th element of vector x is the i-th observed variable xi , and the j-th component of the vector s is the j-th independent component s j . Note that any two columns of the mixing matrix A are assumed to be linearly independent (Comon, 1994; Eriksson and Koivunen, 2004). If two columns are linearly dependent, one is a constant multiple of the other. The two columns can then be combined into one column, and a mixing matrix with one fewer column and an independent component can represent the same observed variable vector x. This assumption of linear independence means there is no such redundancy. The chapter then discusses the identifiability of the mixing matrix A in Eq. (2.4). The mixing matrix A is identifiable other than the ordering and scaling of its columns (Comon, 1994; Eriksson and Koivunen, 2004). Thus, uniquely estimating the mixing matrix A is not feasible, unlike obtaining a matrix that may differ from the mixing matrix A in the column ordering and scaling. Let AICA denote a matrix that may differ from the mixing matrix A only in the ordering and scaling of the columns. The relationship between the p × q matrices AICA and A can be written as follows: AICA = ADP,

(2.5)

where the matrix P is a q × q permutation matrix. D is a q × q diagonal matrix with non-zero diagonal components. The permutation matrix P may change the ordering of the mixing matrix A columns, and the diagonal matrix D may change the column scaling of the mixing matrix A.

2.1 Independent Component Analysis

17

If the permutation P and diagonal D matrices are identity matrices, the matrices AICA and A are then equal given that the ordering and scaling do not change. However, since the permutation P and diagonal D matrices are unknown, whether the ordering and scaling are the same as the original is unknown. Uniquely estimating the ordering and scaling of the columns of the mixing matrix A is not feasible because even if the ordering and scaling of the columns of the mixing matrix A are changed, the changed mixing matrix and independent component vector satisfy the assumptions of the ICA model in Eq. (2.4) by changing the ordering and scaling of the elements of the independent component vector s such that they match the change. If the changed mixing matrix or independent component vector did not satisfy the assumptions of the ICA model, finding that the changed mixing matrix or independent component vector is different from the original mixing matrix or independent component vector would be possible. However, in this case, they still satisfy the assumptions of the ICA model even if their ordering and scaling change. Therefore, knowing whether they are different from the original ones is not feasible. Consider the following example (Shimizu, 2017). First, suppose that the first and second columns of the mixing matrix A are swapped and the permuted version Apermuted is obtained: ] a11 a12 A= a21 a22 [ ] a12 a11 Apermuted = . a22 a21 [

(2.6) (2.7)

The permuted version Apermuted would then be different from the original mixing matrix A unless the first and second columns are the same. However, further swapping the first and second rows of the vector of independent components s to create its permuted version spermuted , it remains possible to represent the same vector of observed variables x using these permuted versions of the original mixing matrix A and vector of independent components s. That is, the following equality holds. x = Apermuted spermuted .

(2.8)

The permuted mixing matrix Apermuted has only the columns of A swapped. Thus, all the columns remain linearly independent. The permuted independent component vector spermuted is also independent and follows non-Gaussian continuous distributions given that the elements of the vector spermuted have been swapped. Therefore, both the permuted mixing matrix Apermuted and independent component vector spermuted satisfy the assumptions of the ICA model in Eq. (2.4). Next, the study changes the column scaling of the mixing matrix A by multiplying the first column with a non-zero constant c. It then obtains a differently scaled version Arescaled as follows:

18

2 Basic LiNGAM Model

[ Arescaled =

] c × a12 a11 . c × a22 a21

(2.9)

This rescaled mixing matrix Arescaled is different from the original mixing matrix A. However, the same observed variable vector x can be represented as before the column scaling changes by multiplying the first row of the independent component vector s by 1/c: [ ] [ ][ ] x1 a a s1 = 12 11 (2.10) x2 a22 a21 s2 ][ ] [ s1 /c c × a12 a11 . = (2.11) c × a22 a21 s2 Note that multiplying an element of the independent component vector s by a constant changes the scaling of the corresponding independent component (i.e., its variance). However, in the rescaled mixing matrix Arescaled , every two columns remain linearly independent. The elements of the rescaled independent component vector srescaled are also independent and follow non-Gaussian distributions. Therefore, the rescaled mixing matrix Arescaled and rescaled independent component vector srescaled also satisfy the ICA model assumptions. The study generalizes these two examples above. In general, the following pairs of the mixing matrix AICA and independent component vector sICA satisfy the assumptions of the ICA model in Eq. (2.4): AICA = ADP sICA = P−1 D−1 s.

(2.12) (2.13)

The revised ICA model explains the reasoning as follows: x = As = ADPP−1 D−1 s = (ADP)(P−1 D−1 s) = AICA sICA ,

(2.14) (2.15) (2.16) (2.17)

where the matrix AICA and vector sICA also satisfy the same assumptions as the original mixing matrix A and independent component vector s. That is, any two columns of the matrix AICA are linearly independent, and the elements of the vector sICA are independent and follow non-Gaussian distributions. A key feature of ICA is the assumption that the independent components follow non-Gaussian distributions. If the independent components are Gaussian, how will the identifiability of the model change? Regarding Gaussian distributions, the mixing matrix A is identifiable except for its orthogonal transformation in addition to its column ordering and scaling. That is, the following pair of matrices and vectors AGauss and sGauss satisfies the same assumptions as the original mixing matrix A and independent component vector s:

2.1 Independent Component Analysis

19

AGauss = ADPQ −1 −1

(2.18) −1

sGauss = Q P D s = QT sICA ,

(2.19) (2.20)

where the matrix Q is an q × q orthogonal matrix. Note that the inverse of the orthogonal matrix Q−1 is its transpose QT . If the independent components s j ( j = 1, . . . , q) follow Gaussian distributions, the orthogonal matrix Q cannot be identified in addition to the permutation P and diagonal D matrices. If an orthogonal matrix Q other than the identity matrix is used, multiple independent components will be mixed as sGauss , and the original independent components cannot be recovered. The orthogonal matrix Q cannot be identified because independence and uncorrelatedness are equivalent for Gaussian variables, and there is less information available in the data for identification. The independence of variables is a stronger condition than their uncorrelatedness (Hyvärinen et al., 2001). That is, if they are independent, they are uncorrelated, but the reverse is not necessarily true. For example, suppose two variables y and z are independent. Thus, for any bounded functions g1 and g2 , cov[g1 (y), g2 (z)] = 0.

(2.21)

That is, for any bounded functions g1 and g2 , the covariance of g1 (y) and g2 (z) is zero. However, to show that two variables y and z are uncorrelated, it is sufficient to show that the covariance of y and z is zero: cov(y, z) = 0.

(2.22)

It is sufficient to show that the covariance of g1 (y) and g1 (z) is zero for identity functions g1 and g1 . Independence requires far more conditions to be satisfied than uncorrelatedness. For non-Gaussian variables, many conditions can identify the orthogonal matrix Q. However, only the condition that the covariance is zero can be used for Gaussian variables. However, it is not sufficient to identify the orthogonal matrix Q. Thus, non-Gaussian distributions contain more information than Gaussian distributions. Finally, the study demonstrates how to estimate the mixing matrix A from data in ICA (Hyvärinen et al., 2001). In many cases, it assumes that the mixing matrix A is a square matrix (i.e., p = q). Hence, it has as many observed variables as independent components. It then considers estimating the independent component vector s by a linear transformation of the observed variable vector x with the p × p matrix W (i.e., y = Wx). This matrix W is called the separating or demixing matrix. If the separating matrix W is equal to the inverse of the mixing matrix A (i.e., W = A−1 ), then the original independent component vector s can be recovered as s = Wx (= A−1 As = s).

(2.23)

20

2 Basic LiNGAM Model

The study can also estimate the mixing matrix A by the inverse of the separating matrix W. It finds W such that the independence of the components y j ( j = 1, . . . , p) of vector y is maximized to estimate the separating matrix because the components s j ( j = 1, . . . , p) of the independent component vector s that the study intends to estimate by vector y are independent. A common measure of independence is mutual information. The mutual information of variables is non-zero and takes zero if and only if the variables are independent. Therefore, the study estimates W that minimizes the mutual information of the elements of the vector y = Wx. The mutual information of vector y is defined as I (y) =

⎧ q ⎨Σ ⎩

j=1

H (y j )

⎫ ⎬ ⎭

− H (y),

(2.24)

where H (y) is the entropy of y and is defined by H (y) = E[− log p(y)].

(2.25)

The mutual information in Eq. (2.24) is zero if and only if all the elements of the vector y are independent. The mutual information is estimated using the data matrix X as follows: ⎧ ⎫ q n n ⎨Σ ⎬ 1Σ Σ 1 (m) ˆI (y) = − log p(y j ) − − log p(y(m) ) ⎩ ⎭ n n m=1 m=1 j=1 ⎧ ⎫ q n n ⎨Σ 1 Σ ⎬ 1Σ T (m) = − log p(w j x ) − − log p(Wx(m) ). ⎩ ⎭ n n m=1 m=1 j=1

(2.26)

The vector y(m) is the m-th of the matrix Y = WX, which is a linear transformation of the data matrix X by W and is also the m-th observation of the variable vector is the j-th element of vector y(m) and is the m-th observation of y. Further, y (m) j variable y j . The row vector w Tj is the j-th row of the matrix W. The vector x(m) is the m-th column of the data matrix X and the m-th observation of the observed variable vector x. A standard algorithm for estimating W that minimizes the mutual information in Eq. (2.26) is the fixed-point algorithm named FastICA (Hyvärinen, 1999). One of its useful features is that the non-Gaussian distributions of independent components do not need to be explicitly specified in advance. However, the separating matrix W is identifiable except for the permutation (diagonal) matrix P (D) that determines the ordering (scaling) of the rows. This indeterminacy stems from the same reason that the mixing matrix A is identifiable except for the column ordering and scaling. Since the separating matrix W is the inverse of

2.2 LiNGAM Model

21

the mixing matrix A, the indeterminacy concerning the columns of A applies to the indeterminacy concerning the rows of W. Therefore, estimating the separating matrix W that minimizes the mutual information in Eq. (2.26) gives an estimate of the following matrix WICA : WICA = PDW (= PDA−1 ).

(2.27)

2.2 LiNGAM Model This section describes the basic LiNGAM model (Shimizu et al., 2005, 2006), a representative model where the causal graph is identifiable. The LiNGAM model for p observed variables x1 , x2 , . . . , x p is defined by xi =

Σ

bi j x j + ei

(i = 1, . . . , p),

(2.28)

j∈pa(xi )

where pa(xi ) is the set of parents of observed variable xi . Each observed variable xi is a linear sum of their parent variables pa(xi ) and its corresponding error variable ei . If coefficient bi j is zero, there is no direct causal effect from x j to xi (i, j = 1, . . . , p). If coefficient bi j is not zero, there is a direct causal effect from x j to xi (i, j = 1, . . . , p). The error variables ei (i = 1, . . . , p) are independent and follow nonGaussian continuous distributions. This independence between the error variables comes from the assumption that there are no unobserved or hidden common causes. Further, the causal graph of the variables is assumed to be a directed acyclic graph, as in Fig. 2.2. Equation (2.28) can be written in a matrix as follows: x = Bx + e.

(2.29)

The observed x and error e variable vectors collect the observed xi and error ei variables (i = 1, . . . , p), respectively. The square matrix B collects the coefficients bi j (i, j = 1, . . . , p). The LiNGAM model is identifiable (Shimizu et al., 2006). Thus, the coefficient matrix B of Eq. (2.29) can be uniquely estimated based on the distribution p(x) of the observed variables. Drawing a causal graph based on the zero and non-zero

Fig. 2.2 LiNGAM assumes that the causal graph is acyclic

22

2 Basic LiNGAM Model

Fig. 2.3 An example LiNGAM model

pattern of the elements of the coefficient matrix B is feasible. If the coefficient bi j is non-zero (zero), a directed edge from the variable x j to xi is (not) drawn. This study explains the relationship between the acyclicity of causal graphs and coefficient matrix B. It first defines a causal ordering of observed variables. A causal ordering is a topological ordering in which variables in the latter place of the ordering do not affect variables in the former places when the variables are sorted according to the ordering. The study denotes such an ordering of observed variables x1 , x2 , . . . , x p as k(1), k(2), . . . , k( p). The model in Fig. 2.3 is an example (Shimizu, 2014). The matrix representation of this model is ⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤ 0 03 e1 x1 x1 ⎣ x2 ⎦ = ⎣ −5 0 0 ⎦ ⎣ x2 ⎦ + ⎣ e2 ⎦ . (2.30) x3 x3 e3 0 00 ◟ ◝◜ ◞ ◟ ◝◜ ◞ ◟ ◝◜ ◞ ◟ ◝◜ ◞ x

B

x

e

In the causal graph of Fig. 2.3, first, the variable x3 receives no directed edge from any of the two variables x1 and x2 . That is, x3 has no parent observed variables. Thus, no observed variable exists with a directed path to x3 . That is, x3 also has no ancestor observed variables. A directed path is a sequence of directed edges with the same orientation. For example, in the causal graph of Fig. 2.3, no directed edge from x3 to x2 exists, but there is a directed path from x3 to x2 via x1 . Regarding the causal graph of Fig. 2.3, no observed variables with directed edges or directed paths to x3 exist. Therefore, putting x3 in the first place of the causal ordering and the remaining x1 and x2 in the latter places, the observed variables in the latter places will not cause the observed variable in the first place. Next, the study determines the places of the remaining variables x1 and x2 in the causal ordering. The variable x1 has no directed edge or path from the variable x2 . However, there is a directed edge to x2 from the variable x1 . Therefore, putting x1 and x2 in second and third places, respectively, the latter variable will not cause the earlier variable. In summary, putting x3 , x1 , and x2 in the first, second, and third respective places in the causal ordering, there is no directed edge or path from the variable in the latter place to the variable in the earlier place. That is, the latter variable in the causal ordering cannot be the cause of the earlier variable. Using k(·), the causal ordering can be represented by k(3) = 1, k(1) = 2, and k(2) = 3. For example, k(3) = 1 means the variable x 3 is in the first place.

2.3 Identifiability of the LiNGAM model

23

The study permutes the observed variables of the model in Eq. (2.30), following this causal ordering (i.e., x3 , x1 , x2 ). Note that the observed x1 , x2 , x3 and error e1 , e2 , e3 variables on the right-hand side should be sorted in the same ordering. Otherwise, the equality in Eq. (2.30) will not hold. Thus, permuting the observed variables on both sides and the error variables on the right-hand side per the causal ordering yields this matrix representation: ⎤ ⎡ 0 x3 ⎣ x1 ⎦ = ⎣ 3 x2 0 ◟ ◝◜ ◞ ◟ ⎡

xpermuted

⎤⎡ ⎤ ⎡ ⎤ 0 0 e3 x3 0 0 ⎦ ⎣ x1 ⎦ + ⎣ e1 ⎦ . x2 e2 −5 0 ◝◜ ◞ ◟ ◝◜ ◞ ◟ ◝◜ ◞

Bpermuted

xpermuted

(2.31)

epermuted

The permuted coefficient matrix Bpermuted is a lower triangular matrix with all the diagonal elements being zeros; that is, it is strictly lower triangular. There is no other causal ordering in this example that makes Bpermuted strictly lower triangular. The study then reformulates the LiNGAM model in Eq. (2.29) using the symbols of a causal ordering k(i ) (i = 1, . . . , p) as follows: xi =

Σ

bi j x j + ei .

(i = 1, . . . , p)

(2.32)

k( j)