279 108 7MB
English Pages 396 [398] Year 2019
DATA HANDLING IN SCIENCE AND TECHNOLOGY Volume 31
DATA FUSION METHODOLOGY AND APPLICATIONS
This page intentionally left blank
DATA HANDLING IN SCIENCE AND TECHNOLOGY Volume 31
DATA FUSION METHODOLOGY AND APPLICATIONS Edited by
MARINA COCCHI Series editors
BEATA
WALCZAK
LUTGARDE
BUYDENS
Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2019 Elsevier B.V. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the Publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither AACCI nor the Publisher, nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-444-63984-4 ISSN: 0922-3487 For information on all Elsevier publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Susan Dennis Acquisition Editor: Kathryn Morrissey Editorial Project Manager: Laura Okidi Production Project Manager: Prem Kumar Kaliamoorthi Cover Designer: Christian Bilbow Cover Drawing: Anita Accorsi Typeset by TNQ Technologies
Contents Contributors Preface
ix xi
1. Introduction: Ways and Means to Deal With Data From Multiple Sources MARINA COCCHI
1. Motivation 2. Context, Definition 3. Main Approaches 4. Remarks in the User’s Perspective References
1 2 6 17 22
2. A Framework for Low-Level Data Fusion AGE K. SMILDE AND IVEN VAN MECHELEN
1. Introduction and Motivation 2. Data Structures 3. Framework for Low-Level Data Fusion 4. Common and Distinct Components 5. Examples 6. Conclusions References
27 31 32 38 41 47 47
3. General Framing of Low-, Mid-, and High-Level Data Fusion With Examples in the Life Sciences
AGNIESZKA SMOLINSKA, JASPER ENGEL, EWA SZYMANSKA, LUTGARDE BUYDENS, AND LIONEL BLANCHET
1. Introduction 2. Data Sampling, Measurements, and Preprocessing 3. Data Fusion Strategy 4. Data Fusion Strategies with Examples 5. Interpretation of the Outcomes 6. Conclusions References
v
51 54 55 65 72 75 76
vi
CONTENTS
4. Numerical Optimization-Based Algorithms for Data Fusion N. VERVLIET AND L. DE LATHAUWER
1. Introduction 2. Numerical Optimization for Tensor Decompositions 3. Canonical Polyadic Decomposition 4. Constrained Decompositions 5. Coupled Decompositions 6. Large-Scale Computations References
81 85 91 100 111 117 122
5. Recent Advances in High-Level Fusion Methods to Classify Multiple Analytical Chemical Data D. BALLABIO, R. TODESCHINI, AND V. CONSONNI
1. Introduction 2. Methods 3. Application on Analytical Data 4. Results 5. Conclusions References
129 132 144 148 152 153
6. The Sequential and Orthogonalized PLS Regression for Multiblock Regression: Theory, Examples, and Extensions ALESSANDRA BIANCOLILLO AND TORMOD NÆS
1. Introduction 2. How It Started 3. Model and Algorithm 4. Some Mathematical Formulae and Properties 5. How to Choose the Optimal Number of Components 6. How to Interpret the Models 7. Some Further Properties of the SO-PLS Method 8. Examples of Standard SO-PLS Regression 9. Extensions and Modifications of SO-PLS 10. Conclusions References
157 158 158 160 161 162 163 165 167 175 176
7. ComDim Methods for the Analysis of Multiblock Data in a Data Fusion Perspective
V. CARIOU, D. JOUAN-RIMBAUD BOUVERESSE, E.M. QANNARI, AND D.N. RUTLEDGE
1. 2. 3. 4. 5.
Introduction ComDim Analysis P-ComDim Analysis Path-ComDim Analysis Software
179 181 185 189 191
CONTENTS
6. Illustration 7. Conclusion References
vii 191 202 202
8. Data Fusion by Multivariate Curve Resolution ANNA DE JUAN AND R. TAULER
1. Introduction. General Multivariate Curve Resolution Framework. Why to Use It in Data Fusion? 2. Data Fusion Structures in MCR. Multiset Analysis 3. Constraints in MCR. Versatility Linked to Data Fusion. Hybrid Models (HardeSoft, Bilinear/Multilinear) 4. Limitations Overcome by Multiset MCR Analysis. Breaking Rank Deficiency and Decreasing Ambiguity 5. Additional Outcomes of MCR Multiset Analysis. The Hidden Dimensions 6. Data Fusion Fields 7. Conclusions References
205 208 211 218 221 222 227 228
9. Dealing With Data Heterogeneity in a Data Fusion Perspective: Models, Methodologies, and Algorithms FEDERICA MANDREOLI AND MANUELA MONTANGERO
1. Introduction 2. Overview of Life Science Data Sources 3. Addressing Data Heterogeneity 4. Latest Trends and Challenges 5. Conclusions References
235 237 239 255 264 265
10. Data Fusion Strategies in Food Analysis
ALESSANDRA BIANCOLILLO, RICARD BOQUE´, MARINA COCCHI, AND FEDERICO MARINI
1. Introduction 2. Chemometric Strategies Applied in Data Fusion 3. Building, Optimization, and Validation of Data-Fused Models 4. Applications 5. Conclusions References
271 273 276 278 301 301
11. Image Fusion
ANNA DE JUAN, AOIFE GOWEN, LUDOVIC DUPONCHEL, AND CYRIL RUCKEBUSCH
1. Introduction 2. Image Fusion by Using Single Fused Data Structures 3. Image Fusion by Connecting Different Images Through Regression Models
311 314 323
viii
CONTENTS
4. Image Fusion for Superresolution Purposes 5. Conclusions References
328 340 341
12. Data Fusion of Nonoptimized Models: Applications to Outlier Detection, Classification, and Image Library Searching JOHN H. KALIVAS
1. Outlier Detection 2. Classification 3. Thermal Image Analysis Acknowledgments References
Index
346 356 364 368 368
371
Contributors D. Ballabio Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano Bicocca, Milano, Italy Alessandra Biancolillo Department of Chemistry, University of Rome “La Sapienza” Rome, Italy Lionel Blanchet Department of Pharmacology and Toxicology, NUTRIM School for Nutrition, and Translational Research in Metabolism, Maastricht University, Maastricht, The Netherlands Ricard Boque´ Universitat Rovira i Virgili, Department of Analytical Chemistry and Organic Chemistry, Campus Sescelades Tarragona, Spain Lutgarde Buydens Radboud University, Institute for Molecules and Materials, Department of Analytical Chemistry, Nijmegen, The Netherlands V. Cariou
StatSC, ONIRIS, INRA, Nantes, France
Marina Cocchi Department of Chemical and Geological Sciences, University of Modena and Reggio Emilia, Modena, Italy V. Consonni Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano Bicocca, Milano, Italy Anna de Juan Chemometrics Group, Dept. of Chemical Engineering and Analytical Chemistry, Universitat de Barcelona, Barcelona, Spain L. De Lathauwer KU Leuven, Department of Electrical Engineering ESAT/ STADIUS, Kasteelpark Arenberg, Leuven, Belgium; Group Science, Engineering and Technology, KU Leuven - Kulak, Kortrijk, Belgium Ludovic Duponchel
Universite´ de Lille LASIR, Lille, France
Jasper Engel Biometris, Wageningen University and Research, Wageningen, The Netherlands Aoife Gowen School of Biosystems and Food Engineering, University College Dublin, Dublin, Ireland D. Jouan-Rimbaud Bouveresse UMR Inge´nierie Proce´de´s Aliments, AgroParisTech, Inra, Universite´ Paris-Saclay, Massy, France; UMR PNCA, AgroParisTech, INRA, Universite´ Paris Saclay, Paris, France John H. Kalivas Department of Chemistry, Idaho State University, Pocatello, ID, United States Federica Mandreoli Dip. di Scienze Fisiche, Informatiche e Matematiche, Modena, Italy Federico Marini Rome, Italy
Department of Chemistry, University of Rome “La Sapienza”
ix
x
CONTRIBUTORS
Manuela Montangero Dip. di Scienze Fisiche, Informatiche e Matematiche, Modena, Italy Tormod Næs Nofima AS, Aas, Norway; Quality and Technology, Department of Food Science, Faculty of Life Sciences, University of Copenhagen, Frederiksberg, Denmark E.M. Qannari
StatSC, ONIRIS, INRA, Nantes, France
Cyril Ruckebusch
Universite´ de Lille LASIR, Lille, France
D.N. Rutledge UMR Inge´nierie Proce´de´s Aliments, AgroParisTech, Inra, Universite´ Paris-Saclay, Massy, France Age K. Smilde Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands Agnieszka Smolinska Department of Pharmacology and Toxicology, NUTRIM School for Nutrition, and Translational Research in Metabolism, Maastricht University, Maastricht, The Netherlands Ewa Szymanska R. Tauler
FrieslandCampina, Amersfoort, The Netherlands
IDAEA-CSIC, Barcelona, Spain
R. Todeschini Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano Bicocca, Milano, Italy Iven Van Mechelen Research Group on Quantitative Psychology and Individual Differences, KU Leuven, Leuven, Belgium N. Vervliet KU Leuven, Department of Electrical Engineering ESAT/STADIUS, Kasteelpark Arenberg, Leuven, Belgium
Preface This book deals with data fusion aiming at furnishing a vision of the different available methodologies and the data analytics challenges, framing at the same time the nature of coupled data and how data fusion can enhance knowledge discovery. To this aim, this book will also focus on methods that allow unravelling common and distinct information from different blocks of data. The adoption of data-driven discovery paradigm in science has led to the need of handling large amount of diverse data. Drivers of this change are on one hand the increased availability and accessibility of hyphenated analytical platform, imaging techniques, the explosion of omics data, and on the other hand the development of information technology. Hence, the main challenge is how to face these multiple source data and how to retrieve all possible available information. One of the salient aspects is the methodology to integrate data from multiple sources, analytical platforms, different modalities, varying timescale, as well as unstructured data. This is generally referred to as data fusion. Data fusion is for sure a hot issue, as well as a multifaceted concept, and, as it emerges from literature, in recent years there has been a progressive development of a wealth of methods. However, the main attitude has been, with few exception, to present/discuss specific tools confined in a given discipline or context, especially within the field of monographs the focus was mostly posed on remote sensing and multisensor data integration. A unified view is still lacking, which prevents a useful contamination across disciplines, a wider understanding and proper dissemination of the methodology. This book has the ambition to address these issues, providing a comprehensive and comprehensible description of the current state of the art and offering a cross-disciplinary approach and a knowledge retrievale based perspective together with presenting challenging and convincing applications. This book is multiauthored and written as a collection of independent chapters that provide different perspectives and applications. However, when considered altogether, the chapters integrate, giving sound basis to understand the data fusion process. The first five chapters cover the basic concepts and the main methodologies. Introduction details ways and means to accomplish data fusion and the related taxonomies framing them in a user’s perspective.
xi
xii
PREFACE
Altogether Chapters 2, 3, and 5 provide a general framework for low-, mid-, and high-level data fusion methodologies including motivating applications in life science and analytical chemistry; in particular, Chapter 2 presents a generic model for coupled data blocks aimed at recovering full information in each single data block, as well as their common, distinctive information and the linking relations. Chapter 4 presents a very general and flexible mathematical framework, which allows matrices and/or tensors to be coupled through (partially) shared factors or through common underlying variables, addressing as well techniques suitable to handle big data. Chapters 6, 7, and 8 illustrate, in a tutorial manner, multiblock and multiset methods in the perspective of data fusion, showing the specificities and link among these approaches. The last four chapters (Chapters 9e12) are more oriented toward applications, such as food characterization and authenticity (Chapter 10), image analysis (Chapter 11), and/or specific ambits as Chapter 12, which present an “ensemble” approach to data fusion with applications to outlier detection, classification, and image library searching, and Chapter 9 which illustrates the challenges, and possible solution from computer science, in dealing with data heterogeneity, e.g., in fusion of semantic data in life science domain. A conspicuous effort has also been devoted to present an extensive bibliography which is nonetheless incomplete, when considering the amount of scientific literature on the subject, but could serve as a good list to face the subject. Written by invited authors who are recognized experts in their field, this book is addressed to graduate students, researchers, and practitioners; chapters are written in a way to be understandable to large and diverse audience and contain enough information to be, by itself, sufficient and to be read independently of the other chapters. What the reader would find is putting data fusion in context and perspective, a multidisciplinary view, detailed methods description, and challenging applications. Finally, I would like to express my sincere gratitude to all the invited authors who have accepted to cooperate and contributed to this book, as well as my apologies for the delay and difficulties somehow encountered. I also wish to thank the several people who, one way or another, supported this project.
C H A P T E R
1 Introduction: Ways and Means to Deal With Data From Multiple Sources Marina Cocchi Department of Chemical and Geological Sciences, University of Modena and Reggio Emilia, Modena, Italy
1. MOTIVATION The interest in data integration or fusion has gained renewed attention in recent years owing to both the continuous development of instrumental techniques and sensor devices and a paradigm change in the study of complex systems toward holistic, data-driven approaches [1,2]. On the one hand, the technological development in instrumentation has increased the data acquisition speed, the coupling of different instrumental modalities (hyphenation), their portability (in situ, on/in-line), and the data storage capacity, thus leading to an enormous growth of available data for analysis; on the other hand, the paradigm shift relies on data-intensive statistical exploration and data mining for knowledge discovery. The need for data integration is thus becoming ubiquitous and encompasses several different disciplines. Roughly summarizing, three main areas, which also identify the main reference communities (listed distinctly for brevity, but, however, not to be intended as exclusive), can be evidenced: multisensors/remote sensing [3,4] (engineering, geoscience, signal processing), internet of things/big data [5,6] (informatics, machine learning, computer science), and joint analysis of multiple data sets acquired by different modalities (different types of instruments, experiments, and settings) [7] (chemometrics, psychometrics, applied mathematics).
Data Fusion Methodology and Applications https://doi.org/10.1016/B978-0-444-63984-4.00001-6
1
Copyright © 2019 Elsevier B.V. All rights reserved.
2
1. WAYS AND MEANS TO DEAL WITH DATA
Each area is focused on a different motivation for data fusion. (1) For the first, it is improving decision making, i.e., aiming at achieving automatic decisions, based on fusion of data acquired on the same scene from multiple sensors (of different nature, at different locations, and in different time), e.g., in robotics applications, such as drones, to provide obstacle detection and localization [8], or in remote sensing to achieve high spatialespectraletemporal resolution by integrating images of different nature that show high resolution on only one item [9]. (2) In the second area, the main challenge is to integrate information from heterogeneous (e.g., text, numerical, relational) and flowing (i.e., continuously generated/updated) data [10]. Examples include data generated by recording navigation through Website, mobile apps, and social media post and measurements from sensors embedded into objects (e.g., bar codes) or from wearable sensors and data collected from different databases on same items, e.g., an illness (clinical, genomic database and metabolic pathways). The main expectation from data fusion in this area is to improve management of systems and resources benefiting from linking information pertaining to different domains, also in real time, and improve knowledge on systems behavior. (3) In the third area, the main aim is obtaining improved models, either exploratory or predictive, from joint analysis of the diverse data sets, and as a consequence, enhancing phenomenon and problem understanding and solving [7]. In this context, most often the interest is in assessing the commonalities and differences between the different data sets, also taking into account their linking relation [11]. The main focus, here, will be on the latter context; however, the discussed methodologies have broad applicability and it is worth noticing that the motivation and objectives overlap. The underlying assumption in performing data fusion is that the outcome will be more informative than any result obtained by the distinct analysis of each single data set (source). In other words, it is assumed that complementarity or synergy among the different data acquisition modalities will enable either a unified and enhanced view of the system under study or an improved modeling/decision making and understanding of the system/phenomenon.
2. CONTEXT, DEFINITION Data fusion as a specific defined area of research may be dated back to the 1990s [4,12e14], whereas if we refer to the methods used in the joint analysis of multiple data sets, basic references start with, e.g., canonical correlation analysis (CCA), parallel factor analysis (PARAFAC), and tensor decomposition [15e17]. An idea of the impact and growth of
2. CONTEXT, DEFINITION
3
FIGURE 1.1 (A) Number of publication obtained by search (Scopus) of “data fusion” item in article title. Top: percentage of documents per subject area; bottom: number of documents per year. (B) Number of publication obtained by search (Scopus) of “data fusion” item in article title and “Chemometrics” or “chemistry” or “analytica” or “omic” in journal title. Top: percentage of documents per subject area; bottom: number of documents per year.
interest in the field can be gathered from Fig. 1.1 wherein the number of publications and the coverage of disciplines are reported. It may be observed that there is a fast increasing trend from 1996 and that the majority of publications is in the engineering and computer science areas (Fig. 1.1A). Limiting the search to journals linked mainly to chemistry, chemometrics, and omics, the trend is almost the same; however, the number of publications is definitely lower. This highlights that only very recently in these disciplines the joint analysis of multiple data sets is being considered under the data fusion framework. Several definitions of the term “data fusion” are present in the literature, which mainly differ with regard to the degree of generality and also depending on the specific research areas. One of the first and most popular definition, at least in the multisensory area, has been introduced by the Joint Directors of Laboratory and the US Department of Defense in Ref. [13], wherein DF is defined as: “A process dealing with the association, correlation, and combination of data and information from single and multiple sources to achieve refined position and identity estimates, and complete and timely assessments of situations and threats as well as their significance.” This is very specific to the military context in the objective to be achieved by DF and focuses on the possible relations among the data sources and the type
4
1. WAYS AND MEANS TO DEAL WITH DATA
of connecting levels as detailed in Ref. [4]. Broader definitions, still pertaining to the multisensor DF field and the main tools/methodologies there employed, have been reviewed in Refs. [4] and [3]. The general task to be accomplished by DF, in this context, can be described as that of tracking one, or several, targets, monitored by a set of sensors, whose information has to be fused to establish a predictive model and support decision making. Salient issues are defining tools for data association (establishing the measure that pertains to each target), state estimation (e.g., the target position), and the decision fusion rules. A revision and reassessment of definitions by the remote sensing community, to achieve greater flexibility and generality, was done in 1998 and reported by Wald [18]: “Data fusion is a formal framework in which are expressed means and tools for the alliance of data originating from different sources. It aims at obtaining information of greater quality; the exact definition of ‘greater quality’ will depend upon the application.” The merit of this definition is to avoid circumscribing DF to specific tools and applications, and also to introduce emphasis on quality, i.e., it focuses on the concept that DF should bring more satisfactory results, nonetheless specifying that the improvement has to be conjugated according to the specific task/problem performed. Moving to the general data analysis context, i.e., leaving aside the specific focus on sensors as data acquisition tool, the DF scenario is that of gathering enhanced information, in a perspective of data-driven knowledge, about system(s) (phenomenon(a)) that are observed with different types of instrumental techniques (e.g., going from a temperature probe to spectroscopic and hyphenated chromatographicespectroscopic techniques, including imaging-based ones), at different time, in different conditions/environment, or under varying experimental setups. The corresponding data sets are nominated by different terms, such as multiset, linked or coupled (underlying that they are acquired on the same system/phenomenon) multimodal (intending each of the above techniques or conditions as a modality) data; DF pertains to performing analysis of this data [7,11]. To this aim the focus should be on the nature and structure of data together with the inherent research questions that are posed. This is where a much broad bunch of data analysis methods, e.g., from chemometrics, psychometrics, and applied mathematical communities, come in the DF context, as recently illustrated [11,19] in terms of DF prospects and challenges. Van Mechelen and Smilde ([11] and Chapter 2) provided a formal description of coupled data, as “a collection of data blocks, with the connections between blocks consisting of shared modes” and proposed a generic framework for decomposition techniques applied to coupled data for the subset of problems wherein the interest is in recovering full information
2. CONTEXT, DEFINITION
5
in each single data block, as well as their common, distinctive information and the linking relations. Lahat et al. [7] in revising the DF framework for multimodal data posed the accent on “how to exploit diversity,” which is “the property that allows to enhance the uses, benefits and insights in a way that cannot be achieved with a single modality” and according to this perspective adopted the following DF definition: “Data Fusion is the analysis of several data sets such that different data sets can interact and inform each other.” They discussed DF in theoretical/mathematical terms (restricting to data that can be represented by multilinear relationships and to “latent variables” modeling, in its broad sense) focusing on the link uniquenesseconstraintsediversity [20,21]; referring to [21] they exploited the concept of DF as a way of introducing factor uniqueness based on the fact that the joint factorization of two matrices [22] sharing at least one factor can be unique (whereas the decomposition of a single matrix is not, unless constraints, such as orthogonality, sparseness, etc., are imposed). According to [21], there is a close relation between joint factorization and tensor decomposition, and it has been used as foundation to the general structural data fusion framework ([21] and Chapter 4). Lahat et al. [7] also revised different coupling models, such as coupled independent vectors [23] and coupled tensor matrix factorization (CMTF) [22], their generalizations and those of others, underlining as final remark: “Properly linking data sets can be regarded as introducing a new form of diversity, and this diversity is the basis and driving force of data fusion.” Summarizing, common to the different definitions and disciplines is the concept that DF is a framework, fit by an ensemble of tools, for the joint analysis of data from multiple sources (modalities) that allows achieving information/knowledge not recoverable by the individual ones, although the degree of generalization, formalization, and unification of the methodology is distinctive. Besides the DF definition, another taxonomic concern is the so-called levels at which “fusion” is operated. Three levels of DF were defined [24] and have become a reference classification, e.g., in chemometrics and related application areas [25e28], namely, low level (raw data or observational level), where DF methods are directly applied to raw data sets/ blocks; mid level (features level or state-vector level), where DF operates on features extracted from each data set (in more general term, this level can be referred to as DF on preprocessed data), and high level (decision/ information level), where decisions (models outcome) from processing of each data block are fused (Chapter 5). These are also called levels of abstraction in information fusion in Ref. [4]. Other classification schemes, which are more relevant to multisensors field and will not be further considered here, concern the architecture [4], i.e., the way the individual sources are processed, and the information
6
1. WAYS AND MEANS TO DEAL WITH DATA
conveyed when building automated decision systems (e.g., to treat problem pertinent to area 1 described in Section 1). A main distinction, which can be applied to broader contexts, is between parallel and sequential processing, where the data from different sources are initially processed separately and an intermediate decision step (independent if processing is done in parallel and hierarchical if it is done sequentially) is taken before merging their outputs. These cases are distinguished with respect to “true” DF in Ref. [19] because the different sources (modalities) are not “let to fully interact and inform each other.” The following sections frame the main DF approaches and applications in the perspective of joint analysis of multiple data sets and chemometrics-related areas.
3. MAIN APPROACHES In Fig. 1.2 the different steps involved in the data fusion process are schematically reported. First an overview of the different DF levels/ approaches is presented, and then each route, depicted by the different arrow styles in Fig. 1.2A, will be described in more detail.
FIGURE 1.2 Data Fusion approaches. (A) Each path, which corresponds to a different DF level, is identified by the arrow style: solid/low level; dotted/mid level; dashed/high level. X1 - Xb indicate the raw data blocks; F1 - Fb indicate the features extracted from each data block, and D1 - Db indicate the outputs (decision) from separate modelling (classification, regression, clustering, etc.) from the analysis of each data block. The left oriented dashed arrow from F1 - Fb pointing to the modelling box highlight that mid and high levels data fusion can be sequentially applied. (B) Example of low-level data fusion path on kernels calculated from each data blocks.
3. MAIN APPROACHES
7
Characteristic of DF at “low level” is to model the data (either for exploratory or predictive purposes) after fusion/coupling. A main advantage is the possibility to interpret the results directly in terms of the original variables collected in each data block. DF at “low level” may be undertaken either by using a variety of methods that directly operate on several data blocks, joined or coupled, by decomposition, i.e., multiblock, multiset, tensor, and coupled matrix-tensor factorization methods, or simply by concatenating the different data blocks. Actually the latter approach is a specific case of the first, but in the literature they are somehow distinguished, considering the case of simple concatenation (mainly applied to data blocks sharing the samples mode) as just producing an augmented data set, which is then modeled as it were a single block, without analyzing the research questions pertinent to the role/ contribution/peculiarity of each block, as it is instead handled by the first category of methods. The main concern is about the scaling procedure to be applied to each block before concatenation (fusion). DF at “mid level” implies a first modeling step, before fusion, aimed at extracting features from each data block; mainly two approaches are used, either decomposition (or resolution) techniques or variables selection methods. The obtained features are then fused in a “new” data set, which is modeled to produce the desired outcome, which could be, e.g., an exploratory model, prediction of a given property, assignation to a class (supervised) or a cluster (unsupervised), etc. Model interpretation in terms of original variables is straightforward if variables selection is used in the first step, whereas it requires linking back the salience of each feature in the final model to the corresponding pattern in terms of original variables, e.g., a loadings profile, in the case that principal component analysis (PCA) has been used to decompose the original data blocks. Also, it is worth to make a distinction between applying unsupervised or supervised modeling in the first step, because the requirements in term of validation are different. The use of a supervised method at this stage may be seen as actually leading to a DF level, which is intermediate between mid and high levels, i.e., it is not properly “high level” because only the features, not the decisions (the model results), are fused, but on the other hand, the extracted features bear information about the final model sought (second step of modeling). DF at “high level” implies fusion of decisions (model results) obtained from disjoint modeling of each data block, generally supervised. In this case, the focus is on the final outcome (e.g., correct prediction of class or attribute of each sample) and generally the role of each data block (and its original variables) is not investigated, because a fused model in strict sense is not obtained but only a “fused decision.”
8
1. WAYS AND MEANS TO DEAL WITH DATA
In addition to these well-established DF approaches, recently an “ensemble” approach to DF has been proposed where a collection of nonoptimized models are ensembled ([29], Chapter 12). Finally, DF can be also operated on kernel space, i.e., each data block is transformed to a kernel matrix, Kb, which is the inner product in specific function spaces (kernels). Applying a linear kernel gives Kb ¼ XbXTb; in the more general case, nonlinear kernel functions (polynomial, radial basis function [RBF], etc.) can be used to handle nonlinearity by linear algorithms (e.g., PCA, partial least squares [PLS]). DF is then operated on kernel matrices (of dimension samples by samples), and the most used approach is to combine them by a weighted sum [30e32] as shown in Fig. 1.2B for a “low-level” DF context; obviously the kernel transformation of each data block can be seen as a preliminary step preceding either “mid-level” or “high-level” fusion (operating on each kernel matrix). We are at present developing an approach that shares most of the characteristic of ensemble methodology, and it is applied on adjacency matrices gathered by each data block [33].
3.1 Low-Level DF Fig. 1.3 illustrates some examples of connected data blocks highlighting in bold the shared mode(s). Simple concatenation, as discussed earlier, is possible for homogeneous data blocks, i.e., of the same order, sharing a number of modes, which is equal to the array order minus one (Fig. 1.3A). Data blocks of second order can be concatenated sample wise when the same set of samples is analyzed by different techniques, such as nuclear magnetic resonance (NMR) and near-infrared (NIR) spectroscopy, or variable wise if the same variables are measured for different sets of samples (e.g., process sensors data collected at two different reacting units). An example of variable-wise concatenation of data blocks of third order can be air pollution monitoring in different cities by 2D fluorescence (excitationeemission matrix [EEM]) spectroscopy (in this case the two spectral modes, excitation and emission, are shared, whereas a distinct set of sampling points for each city is acquired); a case of sample-wise concatenation can be the acquisition for the same patients, during the followup period, clinical (e.g., blood tests) and metabolic data (e.g., by NMR of urine); in this case, patients and time modes are shared, whereas measured variables are different for each data set. On the other hand, when only one mode (of a third-order array) is shared, as illustrated in Fig. 1.3B, or data blocks of different order (heterogeneous) are coupled (Fig. 1.3C), simple concatenation is not possible and data fusion strategies are based on methods that jointly
3. MAIN APPROACHES
9
FIGURE 1.3 Examples of different data blocks structures. (A) Data blocks of the same order: 2D-arrays sharing one of the modes or 3D-arrays sharing two of the modes. The shared modes are in bold character. In this case the data blocks can be simply concatenated. (B) Data blocks of the same order, i.e. 3D arrays, which share only one of the modes. The shared mode is in bold character. (C) Data blocks of different order sharing at least one mode.
analyze the different data blocks (obviously they also apply to cases illustrated in Fig. 1.3A). These methods have been revised in several articles [11,19,21,22,34]. In particular, in Ref. [11] they are analyzed in terms of the research questions, which can be addressed considering as well the challenges that pertain to the specific nature/structure of the coupled data (these concepts are reprised in Chapter 2, by the same authors, where a general framework for low-level DF is presented and discussed). To give a preliminary overview of these methodologies it is useful to focus on association among data blocks and what information is desired to be retrieved. The data blocks are considered (1) exchangeable when there is neither a hierarchy (all of them have the same importance) nor a dependency relationship among them, e.g., as in exploratory data analysis and (2) not exchangeable in the other cases, as in predictive modeling, where one
10
1. WAYS AND MEANS TO DEAL WITH DATA
block will be explained in function of all the others, or in path modeling [35,36], where a network of relationships among data blocks is considered, i.e., some data blocks (not all) have causal relationships with some others. DF strategies in (2) include (not a comprehensive list) multiblock PLS (MB-PLS) [37e41], hierarchical PLS [42], sequential and orthogonalized PLS (SO-PLS) ([43], Chapter 6), On-PLS [44], Path-PLS [45], and P-ComDim (Path-Comdim) ([46], Chapter 7). The main data fusion approaches in case (1) can be considered under the general coupled factorization framework, which implies that the different data blocks are decomposed in factor (component) models and share some factor matrices (or part of it) depending on their coupled modes (or type of linking) [19,21,47,48]. They can be distinguished according to different criteria, one of which is the data blocks structure that they can handle. Tensor decomposition methods, such as PARAFAC [49,50] and higher order singular value decomposition [51], require data blocks that can be stacked in a single tensor (as in Fig. 1.3A). Collective matrix factorization [52] deals with matrices coupled at least in one mode, which share a factor matrix and include, as a specific case, simultaneous component methods (e.g., simultaneous component analysis [SCA], multiple factor analysis [MFA], and STATIS [34,53]) for which the joint decomposition model can be expressed, in the case the samples mode is shared, as: wb Xb ¼ TVbT þ Eb X Xb TV T 2 min b
ðT;Vb Þ
(1.1) (1.2)
b
where Eq. (1.1) defines the model of each data block, TTT ¼ VVT ¼ I, and wb represents an eventual weighting, which can be applied to each data block (e.g., wb ¼ 1 in SCA, and it is the inverse of the first singular value of Xb in MFA [34]) and Eq. (1.2) expresses the loss function (in the case the variable mode is shared, in Eqs. (1.1) and (1.2) V will be the same for each data block and Tb will be block specific). A systematic comparison with sequential multiblock methods, such as hierarchical PCA has not been reported in the literature; however, the main properties of hierarchical and consensus PCA are illustrated in Ref. [42]. Data blocks structures as depicted in Figs. 1.3B and 1.3C can be handled by coupled matrix and tensor factorizations (CMTF [22,47]), which is based on early work by Harshman (incomplete and linked-mode PARAFAC, Section 5 in Ref. [49]). CMTF formulation, e.g., in the case of a third-order tensor and a matrix coupled by sample mode can be written as (but it can be generalized to any number/order of array and different coupled modes): 2 2 minðA;B;C;VÞ kX 1 EABCFk þ X 2 AV T (1.3)
11
3. MAIN APPROACHES
where the tensor X1 is decomposed by a CANDECOMP/PARAFAC (CP) model and factor matrix A is shared. This model has also been extended considering different decomposition with respect to CP and other loss functions [54]. Now focusing on the information, which is desired to be retrieved by data fusion, in many of the applications the relevant research question is to assess the common and distinctive information inherent to each data block [11]. The joint factorization models presented earlier, i.e., SCA and CMTF, assume that all factors/components in the common factor mode are shared, and this represents a limitation if both common and distinctive information has to be retrieved. Several methods have been developed to this aim, and the most popular have been recently compared and expressed under the same general framework ([53], Chapter 2); in particular a clear description has been given in terms of which subspaces the common and distinctive components span. There are basically two ways in which common and distinctive components can be identified: (1) by rotating the solution resulting from a simultaneous factorization (as the one obtained by SCA) to a target matrix, which holds the information about unshared components, e.g., as it is done in the distinctive and common component analysis (DISCO) [55,56] method, or by defining a priori some elements of component(s) as shared, whereas not the others (e.g., as can be implemented in incomplete PARAFAC [49]) and (2) by introducing sparsity constraints on the factors. As a representative example, in Ref. [57] sparsity is introduced in the context of matrix factorization, whereas in CMTF sparsity has been implemented adding penalties as reported later, and the method has been named structure-revealing data fusion advanced CMTF (ACMTF) [22,58]: 2 2 minðA;B;C;V;l;sÞ kX 1 El; ABCFk þ X 2 ASV T þ bklk1 þ bksk1 (1.4) where l and s hold the weights of the rank-one components of the third order tensor (X1) and the matrix (X2), respectively. S is a diagonal matrix with s entries on the diagonal. The weights are sparsified by 1-norm penalties tuned by the b term, so that unshared factors will have weights close to zero. Multivariate curve resolution (MCR) [59,60] applied to multiset structured data can also be seen under the general framework of CMTF where the data block structure and linking is handled by specifying proper constraints; this methodology is thoroughly discussed in Chapter 8. CCA is one of the first methods describing the common variation in two data sets (coupled sample wise) [15], differently from the joint factorization methods, discussed earlier, instead of finding an explicit set
12
1. WAYS AND MEANS TO DEAL WITH DATA
of shared factors (also referred to as hard link among data blocks in Ref. [19]). It focuses on finding linear combinations of data blocks (canonical variates), which have maximum correlation with each other; in other words, CCA seeks the coordinates system that describes the maximum cross-covariance between two data sets (establishing a “soft link” among the data blocks, according to [19]). CCA has been generalized to deal with more than two data sets generalized canonical correlation analysis (GCCA, [61]) and to deal with multiway arrays [62]. Moreover, to overcome the issue that canonical variates may poorly describe the single data block, because the amount of variance explained is not taken into account in the optimization, a PCA step before CCA has been proposed [63]. Relation between CCA and other methods can be found in Refs. [53,63,64]. A method that shares some relation with GCCA is common components and specific weights analysis (CCSWA or ComDim) [32,65], which operates on the sum of product matrices XbXTb (linear kernel); this method has also been extended to nonexchangeable data block (predictive context and path modeling), and it is presented in detail in Chapter 7. A very general and flexible mathematical framework, which allows any data block structure and flexible links across them (varying number of shared and partially shared modes) is represented by structural data fusion [21], whose distinctive characteristic is its modularity. As in ACMTF, each data block (a tensor of any order) is approximated by a tensor decomposition model (which may differ for each data block) with some factors matrix shared (whole or part of it, e.g., single rows, columns, or entries); in addition, any structure can be imposed on the factor matrices, in the form of a function of the underlying variables, e.g., factor entries can be constrained to vary between 0 and 1 (as in example 16 described in Chapter 4). Modularity means that decomposition method, loss function, regularization terms, and the factor matrix structure can be distinctively set for each data block; algorithms suitable in structured data fusion (SDF) context are discussed in detail in Chapter 4. Several methods have also been developed that use independent component analysis as decomposition method in joint factorization; a detailed description in DF framework can be found in Ref. [19,66,67]. Notwithstanding which low-level DF approach is adopted, as already observed, the main issue is how to deal with the heterogeneity of data variables (data entries in general) with respect to different physical units, measurement scales, etc. This issue can be addressed by preprocessing such as scaling and normalization; a large volume of literature has underlined the need to take into account the specific nature of the data and to evaluate the distortions that may be introduced [68,69].
3. MAIN APPROACHES
13
A further issue is how to ensure a “fair” contribution by each data block in the fusion process considering their different size, variance, and information content. In simple concatenation, in most of the cases this issue is dealt with blockscaling, e.g., to equal sum of squares, whereas in multiblock methods it is usually dealt with by weighting the data blocks, and the weighting scheme choice is guided by the objective that is pursued, e.g., having an equal contribution from each data block or weighting a data block more if it has a lower degree of redundancy with the others, etc. A comparison of preprocessing strategies (both at variable and block level) adopted in several methods and discussion of implication of the different choices can be found in Ref. [34]. Anyhow, the problem of data comparability/commensurability is more vast, and it may not be easy to assess if scaling can fully solve it [11]. Other aspects, which can be addressed by preprocessing, are misalignment among data blocks, e.g., in time or space, outliers, and spurious data. A critical issue is also represented by the possible difference in the type/level of noise present in different data blocks, which may affect the data fusion process. Some proposals to overcome this issue are based on Bayesian or maximum likelihood framework [66,70]. A further remark concerns the question, still open, of general criteria to identify the number of common and distinctive components as observed in Ref. [64] where a preliminary proposal is formulated.
3.2 Mid-Level DF Data fusion at mid level requires two modeling steps (Fig. 1.2): the first one is used to extract some features from each data block (Fk), which are then fused and modeled by the second step to obtain the desired objective(s), which can be data exploration, accomplishing a classification or regression task, etc. The features extraction step can be accomplished by different strategies: (1) applying feature selection strategies; (2) calculating global features, such as variables that summarize statistics (e.g., standard deviation and other momenta) or other simple properties; and (3) applying factorization (decomposition) methods to use the resulting components (factors, latent variables) as features. A subcase of strategy (3) is to use as features the peak areas of resolved chromatographic/spectral profiles, which can be obtained by decomposition methods, such as PARAFAC and MCR. In most of the applications, mid-level DF concerns data blocks coupled sample wise, but there is not an intrinsic limitation to which mode(s) can be compressed. The choice depends obviously on the final aim of the data analysis and the motivation for going mid level. In Fig. 1.4 are shown some possible scenarios to apply mid-level DF to the data blocks presented in Fig. 1.3.
14
1. WAYS AND MEANS TO DEAL WITH DATA
FIGURE 1.4 Few examples of different strategies for mid-level data fusion. (A) Features from the two spectroscopic data sets are obtained by a variable selection method, which can be unsupervised e.g. peaks selected by experts or according to a noise threshold value, or supervised e.g. by VIP, SR, IPLS, GA; (B) Features from the two data blocks (3D arrays) are obtained by a decomposition (factorization) method, such as PARAFAC, Tucker (as unsupervised), or NPLS (as supervised), etc. (C1) and (C2) present two different ways of extracting features from the two data blocks X1, and X2 both are 3D arrays sharing the third mode, i.e. “year”. (C1) Global variables, such as standard deviation, average, and so on, are calculated for each “weather” variable, e.g. temperature, humidity, etc., over seasons (block X1) and over location (block X2). Thus obtaining a matrix: year x features, for each data block; (C2) block X1 ha been decomposed, e.g. by PARAFAC, and the third mode (year) loadings (for each PARAFAC factor) are taken as features (F1) while block X2 is first unfolded and then analysed by MCR. The concentration values for each location corresponding to each year and are then used as features (for each MCR component). In C2 approach the information about beahaviour at different location is maintained.
The main pros in adopting a mid-level DF approach are that the noninformative variance can be removed in the features reduction step, and thus the final models may show better performance (e.g., in prediction). This approach also solves the issue of concatenating data blocks of different order (e.g., each of them can be reduced to latent variables by an appropriate multiway/multiset technique) or that of having quite a large matrix after concatenation, which may be difficult to handle (memory, computation time) or prevent the use, in the second step, of some modeling method (e.g., Linear Discriminant Analysis; LDA
3. MAIN APPROACHES
15
classification); furthermore the issue of scaling the data blocks to make them equally contributing may become less severe. On the cons side, mid-level DF requires to build as many models as data blocks (if the aforementioned strategy 3 is applied) or requires application of a features selection algorithm for each data block (in case strategy 1 is applied), thus becoming more demanding especially in terms of validation (number of samples). In fact, for each single model an internal validation to set the model (or feature selection algorithm) parameters is required, as well as a further independent validation set (not to be used in any of the first step models development). Moreover, more effort is required to interpret the results in terms of the raw variables (when aforementioned strategy 3 is adopted) because the features contribution in the final model is related to the salience of the extracted factors. Anyhow, going back to the raw variables contribution can be accomplished by using the loadings (e.g., if PCA is used as a factorization technique) or the resolved component profiles (e.g., for PARAFAC, MCR) obtained by the single data block models. Different considerations are to be taken into account when discussing mid-level DF with respect to low-level DF based on coupled factorization methodology. In this case, at low level there is the advantage that all the information pertaining to the common, distinctive, and linking relations among the data blocks can be exploited, obviously if one method allowing to recover this information is selected. Furthermore, it is worth to recall that when the final objective is a predictive model there are multiblock/ mutliset methods explicitly modeling data block dependency relationships (defined as type 2 in Section 3.1, e.g., SO-PLS, P-ComDim). However, seldom these approaches have been compared with mid-level DF [71], eventually only in terms of final model performance and never systematically. The single data blocks scaling is less critical in mid-level DF with respect to low-level DF, but the issue of choosing the factor models’ dimensionality for each data block is introduced. Obviously, the same criteria as in single data set analysis apply; the risks can be to retain irrelevant variability (or noise) or to discard part of each single data block that would be relevant when data blocks are analyzed jointly [72]. It has to be noticed that mid-level DF is more diffuse in predictive modeling context; in this case, usually supervised model is also used in the first step of feature extraction, e.g., the feature selection criterion is based on the predictive capability, or the PLS-DA components are used as features. As far as proper validation is applied, this may be a sound choice; a limiting issue can be the number of available samples with respect to the ones required for the double steps of validation in function of the supervised method in use.
16
1. WAYS AND MEANS TO DEAL WITH DATA
Mid-level DF framework also includes the application of one of the earlier discussed mid-level DF strategies to the kernel matrices calculated from the raw data blocks (Fig. 1.2B).
3.3 High-Level DF High-level DF, also defined as decision fusion, is treated in detail in Chapter 5, so here it is only briefly discussed. This approach is less used in chemometrics, whereas it is prevalent in other fields, e.g., multisensors [3,4,24]. High-level DF fuses the model output of several models. These outputs may be quantitative, when numerical values are produced (e.g., predicted values from different regression models), or qualitative when an identity statement is made (e.g., class assignation). Several methods can be used for high-level DF, such as weighted decision methods, Bayesian inference, DempstereShafer’s theory of evidence, and fuzzy set theory (Chapter 5, [3,4]). When using/selecting a method, some critical issues have to be taken into account such as the level of correlation among the data blocks (sources) and the conflicting/contradictory information provided from the different models and its identification [3,19]. Generally, this level of DF is used when the main objective is to improve the performance of the final model and reach an automatic decision. It has to be remarked that the modeling approach can be either supervised (being the most used) or unsupervised, e.g., the outcome can be assignment to a cluster achieved by an unsupervised method. As observed for mid-level DF, when supervised methods are used an independent validation set (not used in any of the single models) to assess the performance of the final fused model is needed. There is a link between high-level DF and ensemble methods [73], because these are based on aggregation of multiple weighted models, e.g., from several weak classifiers, to obtain a combined one. Ensemble methods are mostly used in pattern recognition/classification, but have been developed for regression as well [74,75]. Usually, they are applied on data from a single source and including a set of classifier/regression techniques, but the concept can be extended considering at the same time several data blocks. The rules used to fuse the single model decisions (predictions) may vary, e.g., in case of regression, from weighted average, by taking into account the model error, to nonlinear strategy (based on RBF network) [74]. The DF approach illustrated in Chapter 12 further generalizes the ensemble concept in the sense that the multiple models come not only from the use of several classifiers but also from using different scans of parameter values for each classifier.
4. REMARKS IN THE USER’S PERSPECTIVE
17
A particular case of decision fusion is represented by the fusion of semantic data from different sources or combination of numerical and semantic data [4,76,77], which is relevant, e.g., in multimedia event detection [76,77] or in life science to integrate information from literature database and clinical evidence with gene expression data, metabolic pathway data, etc. ([78,79] and Chapter 9). The methodology used in Ref. [79] is based on the graph regularized nonnegative matrix factorization approach [80], which is interesting because it offers a way to include the information derived by the proximity pattern of samples or variables, or both, in the coupled matrix decomposition framework. Just to give a rough idea, consider as an example two data blocks, which share the samples mode, and a proximity matrix for the samples derived by another source of information (e.g., if our samples are “proteins” the proximity matrix may come from homology information). The loss function in matrix factorization may be constrained by the proximity information (more generally by the neighboring or connecting information from a graph structure) introducing a penalty term, which increases when samples’ neighboring pattern in scores space departs from graph structure.
4. REMARKS IN THE USER’S PERSPECTIVE The application domain of DF methods is very broad, and some were already referred in previous discussion; just to cite a few that were recently reviewed or the object of comprehensive studies: emotional analysis of video material [77], data fusion for wearable health monitoring [81], data fusion of biological networked data (e.g., proteineprotein interaction networks, gene interaction networks, metabolic interaction networks, and gene coexpression networks) [82]. The following chapters include, besides methodology, examples from omics, medicine, complex analytical data (multiplatforms, hyphenated techniques), and social media/Website, and two chapters are specifically focused on image data fusion (Chapter 11) and food context (Chapter 10), respectively. Nevertheless, some remarks can be added about the main question that likely occurs in practice, i.e., which approach is the best (DF level/ method). Obviously, the choice should be guided by the research objectives and the data [11], and the question needs to be contextualized. In Figs. 1.5 and 1.6 a couple of examples are schematically shown, which may offer several alternatives. In the quality control applicative context, it is becoming usual to adopt a holistic/blind approach based on the extensive characterization (fingerprinting) of the products, and to this aim several analytical platforms are used (Fig. 1.5).
18
1. WAYS AND MEANS TO DEAL WITH DATA
FIGURE 1.5 An example of DF in the food authentication context. The food product has been characterized by three analytical platforms: gas-chromatography/mass-spectrometry (GC-MS); excitation-emission fluorescence (EEM) and near infrared spectroscopy (NIR). Three alternatives approaches are presented for data fusion: 1) structure revealing data fusion (ACMTF), top figure; 2) common components and specific weights analysis (ComDim), middle figure; distinctive and common simultaneous component analysis (DISCO), bottom figure. The factor matrix (scores) for the shared samples mode is represented on the right, the C labels indicate common components and the D labels the distinctive components. Going from left to the right the first set of common components (the first C on the left) is common to GC-MS and EEM, the second set to EEM and NIR and the third set to all the three data blocks. The first set of distinctive components (the first D on the left) is for GC-MS, the second for EEM and the third for NIR.
The most relevant question is what is the redundant and unique information provided by each, so that it could be possible to determine which are the platforms that are really required to set up a quality monitoring protocol. In fact, the aim is to have the best characterization with the minimum cost. Thus, it is essential to gather common and distinctive information contributions from each data block; to this aim, lowlevel DF framework seems appropriate, and the different properties of main methods that can accomplish this partitioning are discussed in
4. REMARKS IN THE USER’S PERSPECTIVE
19
FIGURE 1.6 An example of fusion of process sensors and NIR spectral data in a continuous process line. The top figure represents a schematic view of the process, indicating the location of the process sensors (represented by different shapes and colours) on the process line and the location of the two on-line NIR instruments. The process sensors which, fall in the S1, S2 and S3 areas (delimited by the dotted shape) are assigned to the three distinct data blocks. NIR1 and NIR2 spectra are kept in two further distinct data blocks, respectively. NIR1 and NIR2 spectra are separately compressed to features by, e.g. PCA or MCR. The data fusion for developing the monitoring charts can be accomplished by multiblock methods such as consensus (CPCA) or hierarchical (HPCA) PCA; while for developing the predictive model of the quality variables (Y) can be accomplished by multi-block PLS (MBPLS) or sequential orthogonal PLS (SO-PLS).
Ref. [53]. However, it may be worth considering compressing the raw data coming from analytical platforms, such as NMR, gas chromatographye mass spectrometry (GCeMS), high-performance liquid chromatographydiode array detector (DAD), and similar, to integrated area of resolved chemical components (e.g., by MCR, PARAFAC2) before fusion, whereas taking the raw data for techniques, such as NIR spectroscopy, where bands strongly overlap and resolution is not feasible, thus using a mix of low- and mid-level DF. The example in Fig. 1.5 can be representative of characterization of protected denomination of origin (PDO) wine categories by spectroscopic (NIR, EEM) and chromatographic (GCeMS) techniques, giving three data blocks of different order. Three alternative DF approaches are illustrated below. Alternative 1: Structure revealing data fusion (ACMTF), which allows simultaneous decomposition of matrices and tensors. The data blocks, in
20
1. WAYS AND MEANS TO DEAL WITH DATA
this example, share the samples mode, thus the mode 1 factor matrix (usually called scores matrix) is the same, and the common (C) and distinctive (D) factors/components are recovered by analysis of the weights l (GCeMS), s (EEM), and g (NIR). A factor/component is common to, e.g., GCeMS and NIR data blocks if the respective weights l and g are close to 1 and s is close to 0 (distinctive for EEM). The respective weighted factor models maximize the explained variance for each data block. Alternative 2: Common components and specific weights analysis (ComDim), because for each data block the product matrix, XBXTB (linear kernel), is first calculated both matrices and tensors can be dealt with. The decomposition of the summed kernel matrices gives the global shared scores matrix. The common (C) components can be recovered by the analysis of weights l; a component is considered common to two or more data blocks if the l weights are high for all of them. However, distinctive components are not specifically identified but indirectly highlighted by the fact that only one data block has a dominant l value. Moreover, the models for each single data block are not maximizing data block variance, in analogy with CCA. Alternative 3: A “mixed” low-mid-level approach, where features are extracted for GCeMS (e.g., by taking the areas of resolved GC profiles, after decomposition by PARAFAC2 or MCR) and EEM (by taking the concentration, i.e., mode 1 loadings, from PARAFAC decomposition), whereas for the NIR block the raw data (spectra) are considered. The resulting blocks, which are now of the same order and sharing the samples mode, can be jointly analyzed by DISCO, which is specifically devoted to recover common (C) and distinctive components [55]. Also, the respective weighted factor models maximize the explained variance for each data block. There is not an a priori best strategy among these three, but it is clear that the models have different properties. ComDim focuses on common components and may poorly describe each single data block; on the other hand, it can handle data set of different orders and, in principle, nonlinearity, because the first step is to calculate “kernels”; however, nonlinear kernels in ComDim analysis have not yet been explored. ACMTF seems more flexible but requires tuning of the penalty parameter, which introduce some subjectivity. The features extraction step before applying DISCO has the advantage of using resolved chemically meaningful components (obviously more effort in term of modeling is required). Both ACMTF and DISCO furnish least squares models for each data block. DISCO is more interpretable in terms of properties of the spanned subspaces [53]. In Fig. 1.6, the second example concerns a process monitoring context, i.e., a continuous process in polymer production, where data are collected at different stages of the process by several process sensors measuring
4. REMARKS IN THE USER’S PERSPECTIVE
21
temperatures, pressures, and flows and by two NIR instruments equipped with online probes positioned at different stages of the process line. The different types of sensors are indicated by different shapes in the process scheme (top part of Fig. 1.6). The two NIR probes are named NIR1 and NIR2. The process sensors on the basis of their location along the process line are assigned to data blocks named S1, S2, and S3 (Fig. 1.6 top). The objectives in this case are to build a process monitoring model (multivariate control charts) and a predictive model of final product quality variables (Y-block). The first model should describe at best the normal operative conditions so as to allow fault detection and diagnosis. It is interesting to know the role of each monitored stage (to detect the causes of deviation) and how the variability in the process sensors affects the physicochemical characteristic of the intermediate products, as captured by the NIR spectra, and the model should work real time. A low-level DF approach by one of the multiblock methods can be suitable, e.g., consensus PCA has been often used in the literature [83,84] giving the possibilities to inspect deviations in scores space (Hotelling T2 control chart) both at the global (superscores) and single-block levels through the block scores. However, the interpretation of the control chart based on single block residuals is less obvious, because these most probably contain a systematic variation that is not shared by all data blocks, as well as the unsystematic variability (noise). Alternatively, low-level approaches based on other decompositions, e.g., collective matrix factorization, could also be considered, and again, how to build/interpret monitoring charts is not trivial. A mixed low- and mid-level DF approach, where the spectral data blocks are compressed first (e.g., by PCA giving the features block F1 and F2, Fig. 1.6) could be advantageous to filter the huge unsystematic variability typical of NIR. Relevant issues are also how many models should be handled in real-time monitoring and model(s) maintenance. Now, considering the second objective of predicting the quality variables by using the online NIR probes and the process sensors, the different DF approaches may range from low level with one of the multiblock methods specific for the predictive context, e.g., MB-PLS, SO-PLS, etc., to high-level DF if single, e.g., PLS, regression models are built for each data block and predicted Y are then fused (as an example by using a weighted average of single predictions). Here the guiding criteria should be ease of model handling/updating, possibility to handle missing data, and the industrial requirement on the entity of prediction error. High-level DF is more demanding in terms of models to be maintained, mid-level DF could be a good compromise, and MB-PLS has been the most used one [85]. Again an a priori best strategy cannot be envisaged; also, it has to be remarked that exploration and comparison of various DF approaches in the process monitoring context is at the moment lacking.
22
1. WAYS AND MEANS TO DEAL WITH DATA
The aim of presenting this overview of the main applicative contexts and data fusion approaches was to highlight the opportunities and the complexity inherent to data integration as well as to frame, in the DF perspective, methods that are more familiar as multiblock or multiset. Notwithstanding most of the used methods are well established, there is the need of extending the comparative analysis of the different approaches in conjunction with the specific research objectives, and especially of formulating a proposal to evaluate the efficacy of the fusion process.
References [1] R. Kitchin, Big Data, new epistemologies and paradigm shifts, Big Data Soc. 1 (2014) 1e12, https://doi.org/10.1177/2053951714528481. [2] H. Martens, Quantitative Big Data: where chemometrics can contribute, J. Chemom. 29 (2015) 563e581, https://doi.org/10.1002/cem.2740. [3] B. Khaleghi, A. Khamis, F.O. Karray, S.N. Razavi, Multisensor data fusion : A review of the state-of-the-art, Inform. Fusion 14 (2013) 28e44, https://doi.org/10.1016/ j.inffus.2011.08.001. [4] F. Castanedo, A review of data fusion techniques, Sci. J. 2013 (2013), https://doi.org/ 10.1155/2013/704504. [5] J. Camacho, R. Maga´n-Carrio´n, P. Garcı´a-Teodoro, J.J. Treinen, Networkmetrics: multivariate big data analysis in the context of the internet, J. Chemom 30 (2016) 488e505, https://doi.org/10.1002/cem.2806. [6] F. Alam, R. Mehmood, I. Katib, N.N. Albogami, A. Albeshri, Data fusion and IoT for smart ubiquitous environments: a survey, IEEE Access 5 (2017) 9533e9554, https:// doi.org/10.1109/ACCESS.2017.2697839. [7] B.D. Lahat, F. Ieee, C. Jutten, F. Ieee, Multimodal data fusion : an overview of methods, challenges, and prospects, Proc. IEEE 103 (2015) 1450e1477. [8] S. Duncan, S. Singh, Approaches to mulitsensor data fusion in target tracking: a survey, IEEE Trans. Knowl. Data Eng 18 (2006) 1696e1710. [9] H. Ghassemian, A review of remote sensing image fusion methods, Inf. Fusion 32 (2016) 75e89, https://doi.org/10.1016/J.INFFUS.2016.03.003. [10] V. Steinmetz, F. Se´vila, Data fusion and IoT for smart ubiquitous environments: a survey, Chemom. Intell. Lab. Syst. 74 (2016) 1e12, https://doi.org/10.1016/ j.medengphy.2016.12.011. [11] I. Van Mechelen, A.K. Smilde, A generic linked-mode decomposition model for data fusion, Chemom. Intell. Lab. Syst. 104 (2010) 83e94, https://doi.org/10.1016/ j.chemolab.2010.04.012. [12] E. Waltz, J. Llinas, Multisensor data fusion, J. Navig (1990), https://doi.org/10.1017/ S0373463300010055. [13] F.E. White, JDL, Data Fusion Lexicon, 1991. Tech. Panel C3. [14] J. Llinas, D.L. Hall, An introduction to multi-sensor data fusion, ISCAS ’98, Proc. 1998 IEEE Int. Symp. Circuits Syst. 6 (1997) 537e540, https://doi.org/10.1109/ ISCAS.1998.705329 (Cat. No.98CH36187). [15] H. Hotelling, Relations between two sets of variants, Biometrika 28 (1936) 321e377. [16] R.B. Cattell, “Parallel proportional profiles” and other principles for determining the choice of factors by rotation, Psychometrika 9 (1944) 267, https://doi.org/10.1007/ BF02288739.
REFERENCES
23
[17] J.R. Kettenring, Canonical analysis of several sets of variables, Biometrika 58 (1971) 433e451. https://www.jstor.org/stable/2334380. [18] L. Wald, Definitions and terms of reference in data fusion, Int. Arch. Photogramm. Remote Sens. 32 (1999) 2e6. [19] D. Lahat, T. Adali, C. Jutten, Multimodal data Fusion : an overview of methods, challenges, and prospects, Proc. IEEE 103 (2015) 1449e1477, https://doi.org/10.1109/ jproc.2015.2460697. [20] N.D. Sidiropoulos, R. Bro, On communication diversity for blind identifiability and the uniqueness of low-rank decomposition of N-way arrays, Proc. Int. Conf. Acoust. Speech Signal Process. IEEE 5 (2000) 2449e2452. [21] L. Sorber, M. Van Barel, L. De Lathauwer, Structured data fusion, IEEE J. Sel. Top. Signal Process. 9 (2015) 586e600, https://doi.org/10.1109/JSTSP.2015.2400415. [22] E. Acar, R. Bro, A.K. Smilde, Data fusion in metabolomics using coupled matrix and tensor factorizations, Proc. IEEE 103 (2015) 1602e1620, https://doi.org/10.1109/ JPROC.2015.2438719. [23] T. Adall, M. Anderson, G.-S. Fu, Diversity in independent component and vector analyses: identifiability, algorithms, and applications in medical imaging, IEEE Signal Process. 31 (2014) 18e33. [24] D.L. Hall, S. Member, J. Llinas, An introduction to multisensor data fusion, IEEE J. 85 (1997). [25] E. Borra`s, J. Ferre´, R. Boque´, M. Mestres, L. Acen˜a, O. Busto, Data fusion methodologies for food and beverage authentication and quality assessment e a review, Anal. Chim. Acta 891 (2015) 1e14, https://doi.org/10.1016/j.aca.2015.04.042. [26] V. Steinmetz, F. Se´vila, V. Bellon-Maurel, A methodology for sensor fusion design : application to fruit quality assessment, J. Agric. Eng. Res. 74 (1999) 21e31, https:// doi.org/10.1006/jaer.1999.0428. [27] A.K. Smilde, M.J. van der Werf, S. Bijlsma, B.J.C. van der Werff-van der Vat, R.H. Jellema, Fusion of mass spectrometry-based metabolomics data, Anal. Chem. 77 (2005) 6729e6736, https://doi.org/10.1021/ac051080y. [28] Y. Liu, S.D. Brown, Wavelet multiscale regression from the perspective of data fusion: new conceptual approaches, Anal. Bioanal. Chem. 380 (2004) 445e452, https:// doi.org/10.1007/s00216-004-2776-x. [29] A.J. Tencate, J.H. Kalivas, A.J. White, Fusion strategies for selecting multiple tuning parameters for multivariate calibration and other penalty based processes: a model updating application for pharmaceutical analysis, Anal. Chim. Acta 921 (2016) 28e37, https://doi.org/10.1016/J.ACA.2016.03.046. [30] A. Smolinska, L. Blanchet, L. Coulier, K.A.M. Ampt, T. Luider, R.Q. Hintzen, S.S. Wijmenga, L.M.C. Buydens, Interpretation and visualization of non-linear data fusion in kernel space: study on metabolomic characterization of progression of multiple sclerosis, PLoS One 7 (2012), https://doi.org/10.1371/journal.pone.0038163. [31] M. Bach, F.G. Lanckriet, G.R.G. Jordan, Multiple kernel learning, Conic Duality, and the SMO algorithm, in: Proc. 21 St Int. Mach. Conf. Mach. Learn. Assoc. Comput, 2004. [32] A. El Ghaziri, V. Cariou, D.N. Rutledge, E.M. Qannari, Analysis of multiblock datasets using ComDim : Overview and extension to the analysis of (K þ 1) datasets, J. Chemometr (2016) 420e429, https://doi.org/10.1002/cem.2810. [33] N. Cavallini, F. Savorani, R. Bro, M. Cocchi, Fused Adjacency Matrices to Enhance Information Extraction: The Beer Benchmark, Anal. Chim. Acta 1061 (2019) 70e83, https://doi.org/10.1016/j.aca.2019.02.023. [34] K. Van Deun, A.K. Smilde, M.J. van der Werf, H.A.L. Kiers, I. Van Mechelen, A structured overview of simultaneous component based data integration, BMC Bioinform. 10 (2009) 246, https://doi.org/10.1186/1471-2105-10-246.
24
1. WAYS AND MEANS TO DEAL WITH DATA
[35] J. Page`s, M. Tenenhaus, Multiple factor analysis combined with PLS path modelling. Application to the analysis of relationships between physicochemical variables, sensory profiles and hedonic judgements, Chemom. Intell. Lab. Syst (2001), https:// doi.org/10.1016/S0169-7439(01)00165-4. [36] H. Wold, Soft modelling: the basic design and some extensions, Syst. under Indirect Obs. Causality-Structure-Prediction (1982). [37] I.E. Frank, B.R. Kowalski, Prediction of wine quality and geographic origin from chemical measurements by Partial Least-Squares regression modeling, Anal. Chim. Acta 162 (1984) 241e251. [38] T. Skov, A.H. Honore´, H. Max, T. Næs, S.B. Engelsen, Trends in Analytical Chemistry Chemometrics in foodomics : handling data structures from multiple analytical platforms, Trends Anal. Chem. 60 (2014) 71e79, https://doi.org/10.1016/ j.trac.2014.05.004. [39] L.E. Wangen, B.R. Kowalski, A multiblock Partial Least Squares algorithm for investigating complex chemical systems, J. Chemom 3 (1988) 3e20. [40] S.J. Qin, S. Valle, M.J. Piovoso, On unifying multiblock analysis with application to decentralized process monitoring, J. Chemom 15 (2001) 715e742. N. [41] J.A. Westerhuis, A.K. Smilde, Deflation in multiblock PLS, J. Chemom 15 (2001) 485e493. [42] S. Wold, N. Kettaneh, K. Tjessem, Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection, J. Chemom (1996), https://doi.org/10.1002/(SICI)1099-128X(199609)10:5/63.0.CO;2-L. [43] I. Ma˚ge, B.H. Mevik, T. Næs, Regression models with process variables and parallel blocks of raw material measurements, J. Chemom (2008), https://doi.org/10.1002/ cem.1169. [44] T. Lo¨fstedt, J. Trygg, OnPLS-a novel multiblock method for the modelling of predictive and orthogonal variation, J. Chemom (2011), https://doi.org/10.1002/cem.1388. [45] M. Tenenhaus, V.E. Vinzi, Y.-M. Chatelin, C. Lauro, PLS path modeling, Comput. Stat. Data Anal. (2005), https://doi.org/10.1016/j.csda.2004.03.005. [46] V. Cariou, E.M. Qannari, D.N. Rutledge, E. Vigneau, ComDim: from multiblock data analysis to path modeling, Food Qual. Prefer (2016) 1e8, https://doi.org/10.1016/ j.foodqual.2017.02.012. [47] E. Acar, T.G. Kolda, D.M. Dunlavy, All-at-once Optimization for Coupled Matrix and Tensor Factorizations, 2011. ArXiv:1105.3422vl. [48] E.E. Papalexakis, C. Faloutsos, N.D. Sidiropoulos, Tensors for data mining and data Fusion : models, applications, and scalable algorithms, ACM Trans. Intell. Syst. Technol. 8 (2016). [49] R.A. Harshman, M.E. Lundy, PARAFAC: Parallel Factor Anal. 18 (1994) 39e72. [50] R. Bro, PARAFAC. Tutorial and applications, Chemom. Intell. Lab. Syst. 38 (1997) 149e171, https://doi.org/10.1016/S0169-7439(97)00032-4. [51] L. De Lathauwer, B. De Moor, J. Vandewalle, A multilinear singular value decomposition, SIAM J. Matrix Anal. Appl. (2000), https://doi.org/10.1137/S08954 79896305696. [52] A.P. Singh, G.J. Gordon, Relational learning via collective matrix factorization, in: Proceeding 14th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. - KDD 08, 2008, https://doi.org/10.1145/1401890.1401969. [53] A.K. Smilde, M. Ingrid, Common and Distinct Components in Data Fusion, 2018 (n.d.). [54] Y.K. Yılmaz, a T. Cemgil, Generalised coupled tensor Factorisation, Adv. Neural Inf. Process. Syst. (2011). [55] K. Van Deun, I. Van Mechelen, L. Thorrez, M. Schouteden, B. De Moor, DISCO-SCA and properly applied GSVD as swinging methods to find common and distinctive process, PLoS One 7 (2012), https://doi.org/10.1371/journal.pone.0037840.
REFERENCES
25
[56] A.K. Smilde, I. Ma˚ge, T. Næs, T. Hankemeier, M.A. Lips, H.A.L. Kiers, E. Acar, R. Bro, Common and distinct components in data fusion, J. Chemom 31 (2017), https:// doi.org/10.1002/cem.2900. [57] K. Van Deun, T.F. Wilderjans, R.A. Van Den Berg, A. Antoniadis, I. Van Mechelen, A flexible framework for sparse simultaneous component based data integration, BMC Bio. (2011). [58] E. Acar, E.E. Papalexakis, G. Gu¨rdeniz, M.A. Rasmussen, A.J. Lawaetz, M. Nilsson, R. Bro, Structure-revealing data fusion, BMC Bio. (2014) 1e17. [59] S.C. Rutan, A. de Juan, R. Tauler, 2. 15 introduction to multivariate Curve resolution, Compr. Chemom 2 (2009) 249e260. [60] A. de Juan, J. Jaumot, R. Tauler, Multivariate Curve Resolution (MCR). Solving the mixture analysis problem, Anal. Methods 6 (2014) 4964, https://doi.org/10.1039/ c4ay00571f. [61] A. Tenenhaus, M. Tenenhaus, Regularized generalized canonical correlation analysis, Psychometrika (2011), https://doi.org/10.1007/s11336-011-9206-8. [62] A. Tenenhaus, L. Le Brusquet, G. Lechuga, Multiway Regularized Generalized Canonical Correlation Analysis, 2015, pp. 1e6. [63] H. Kiers, A. Smilde, A comparison of various methods for multivariate regression with highly collinear variables, Stat. Methods Appl. (2007), https://doi.org/10.1007/ s10260-006-0025-5. [64] R. Vitale, Novel Chemometric Proposals for Advanced Multivariate Data Analysis, Processing and Interpretation (P.h.D thesis), 2018, http://hdl.handle.net/10251/90442. [65] G. Mazerolles, M. Hanafi, E. Dufour, D. Bertrand, E.M. Qannari, Common components and specific weights analysis: a chemometric method for dealing with complexity of food products, Chemom. Intell. Lab. Syst. 81 (2006) 41e49, https://doi.org/10.1016/ j.chemolab.2005.09.004. [66] A.R. Groves, C.F. Beckmann, S.M. Smith, M.W. Woolrich, Linked independent component analysis for multimodal data fusion, Neuroimage (2011), https://doi.org/ 10.1016/j.neuroimage.2010.09.073. [67] T. Adali, Y. Levin-Schwartz, V.D. Calhoun, Multi-modal data fusion using source separation: two effective models based on ICA and IVA and their properties, Proc. IEEE. Inst. Electr. Electron. Eng (2015), https://doi.org/10.1109/JPROC.2015.2461624. [68] P. Filzmoser, B. Walczak, What can go wrong at the data normalization step for identification of biomarkers? J. Chromatogr. A 1362 (2014) 194e205, https://doi.org/ 10.1016/j.chroma.2014.08.050. [69] J. Walach, P. Filzmoser, K. Hron, Data normalization and scaling: consequences for the analysis in omics sciences, Compr. Anal. Chem. (2018), https://doi.org/10.1016/ bs.coac.2018.06.004. [70] T.F. Wilderjans, E. Ceulemans, I. van Mechelen, The SIMCLAS model: simultaneous analysis of coupled binary data matrices with noise heterogeneity between and within data blocks, Psychometrika (2012), https://doi.org/10.1007/s11336-012-9275-3. [71] R. Rı´os-Reina, R.M. Callejo´n, F. Savorani, J.M. Amigo, M. Cocchi, Data Fusion approaches in spectroscopic characterization and classification of PDO wine vinegars, Talanta 198 (2019) 560e572. [72] B.P. Geurts, J. Engel, B. Ra, L. Blanchet, A. Suppers, E. Szymanska, J. Jansen, L.M.C. Buydens, Improving high-dimensional data fusion by exploiting the multivariate advantage, Chem. Intel. Lab. Sys 156 (2016) 231e240, https://doi.org/10.1016/ j.chemolab.2016.05.010. [73] Y. Ren, L. Zhang, P.N. Suganthan, Review article ensemble classification and regressiond recent developments, Appl. Future Dir. (2016) 41e53. [74] L. Douha, N. Benoudjit, F. Melgani, A robust regression approach for spectrophotometric signal analysis, J. Chemom 26 (2012) 400e405, https://doi.org/10.1002/cem.2455.
26
1. WAYS AND MEANS TO DEAL WITH DATA
[75] D. peng Niu, F. li Wang, L. ling Zhang, D. kuo He, M. xing Jia, Neural network ensemble modeling for nosiheptide fermentation process based on partial least squares regression, Chemom. Intell. Lab. Syst. (2011), https://doi.org/10.1016/j.chemolab.2010.11.007. [76] S. Oh, S. Mccloskey, I. Kim, A. Vahdat, K.J. Cannons, H. Hajimirsadeghi, G. Mori, A.G.A. Perera, M. Pandey, J.J. Corso, Multimedia event detect. multimodal feature fusion temporal concept localization, Mach. Vision Appl. (2013), https://doi.org/ 10.1007/s00138-013-0525-x. [77] E. Acar, F. Hopfgartner, S. Albayrak, A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material, Multimed. Tools Appl. 76 (2017) 11809e11837, https://doi.org/10.1007/s11042-016-3618-5. [78] A. Zaveri, G. Ertaylan, Linked data for life sciences, Algorithms 10 (2017), https:// doi.org/10.3390/a10040126. [79] V. Gligorijevic, N. Malod-dognin, N. Przulj, Patient-specific data fusion for cancer stratification and personalised treatment, Pac. Symp. Biocomput. (2016) 321e332. [80] F. Shang, L.C. Jiao, F. Wang, Graph dual regularization non-negative matrix factorization for co-clustering, Pattern Recognit. 45 (2012) 2237e2250, https://doi.org/10.1016/ j.patcog.2011.12.015. [81] R.C. King, E. Villeneuve, R.J. White, R.S. Sherratt, W. Holderbaum, W.S. Harwin, Application of data fusion techniques and technologies for wearable health monitoring, Med. Engineeering Phys. 42 (2017) 1e12, https://doi.org/10.1016/j.medengphy.2016.12.011. [82] G. Tsiliki, S. Kossida, Fusion methodologies for biomedical data, J. Proteomics 74 (2011) 2774e2785, https://doi.org/10.1016/j.jprot.2011.07.001. [83] J.A. Westerhuis, T. Kourti, J.F. Macgregor, Analysis of multiblock and hierarchical PCA and PLS models, J. Chemom 12 (1998) 301e321, https://doi.org/10.1002/(SICI)1099128X(199809/10)12:53.0.CO;2-S. [84] S. Wold, S. Hellberg, T. Lundstedt, M. Sjostrom, H. Wold, in: Proc. Symp. On PLS Model Building: Theory and Application, Frankfurt Am Main, 1987, 1987. [85] J. Kohonen, S. Reinikainen, K. Aaljoki, A. Perkio, T. Va, A. Ho, Multi-block methods in multivariate process control, J. Chemom 22 (2007) 281e287, https://doi.org/10.1002/ cem.1120.
C H A P T E R
2 A Framework for Low-Level Data Fusion Age K. Smilde*, 1, Iven Van Mechelenx * Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands; x Research Group on Quantitative Psychology and Individual Differences, KU Leuven, Leuven, Belgium 1 Corresponding author
1. INTRODUCTION AND MOTIVATION In this chapter we describe systematic ways to analyze multiple data sets or data blocks simultaneously measured on the same system. Our examples are from the field of (systems) biology, but the methods and frameworks discussed are generic. The applications of these types of analyses are wide ranging, from medical biology [1e3] to microbial biology [4,5] to plant biology [6] and studies in mice [7,8] and rats [9]. This chapter draws heavily from earlier publications of the authors [10,11].
1.1 Data Integration Owing to the abundant availability of multiple data sets or data blocks that have been measured on the same biological system, there is a growing need of analyzing and visualizing such data blocks simultaneously to arrive at a global understanding of the system in question. This is generally called data integration or data fusion. Data integration can comprise many things. One of the most basic ways of integrating data is based on simple descriptive statistics such as correlations [12,13].
Data Fusion Methodology and Applications https://doi.org/10.1016/B978-0-444-63984-4.00002-8
27
Copyright © 2019 Elsevier B.V. All rights reserved.
28
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
Association networks are very popular in this regard [14]. A more demanding way of integrating data relies on models; we will further refer to this endeavor by model-based data fusion. The models underlying model-based data fusion can be either ad hoc structures (often associated with a primary goal of mere data reduction) or structures rooted in substantive theories (such as biological accounts). Examples of the latter include genome-scale models [15], whole body models [16], and models of parts of a system. Otherwise, all forms of data integration or data fusion can be combined with clever or advanced types of visualization, as exemplified by charts of association networks, and visual representations of genome-scale models.
1.2 Model-Based Data Fusion Within the area of model-based data fusion, several distinctions can be drawn. A first distinction one can make is between symmetric fusion and predictive or asymmetric fusion, with one or several data blocks being used to predict an outcome in the latter. Examples of asymmetric data fusion methods are multiblock-PLS (partial least squares) [17e19], PO-PLS (parallel and orthogonalised partial least squares) [20], and SO-PLS (sequential and orthogonalised partial least squares) [21]. Latentpath models are special cases in this respect and used a lot in management and food applications [22]. We will not discuss these types of asymmetric fusion methods in this chapter. Symmetric data fusion considers all data sets as taking the same roles and as having equal importance. There is neither an importance hierarchy nor a distinction between criterion (response or dependent variable) and predictor (explanatory or independent variable) data blocks; in other words, the blocks are exchangeable. Within symmetric data fusion, a second distinction is on what level the fusion takes place. High-level data fusion models each data set separately and takes all modeling results and combines these, e.g., using majority voting schemes [23]. Midlevel fusion first subjects each data set to some kind of preprocessing (such as a form of variable selection) and then uses low-level methods to fuse the preprocessed data. Low-level fusion fuses the different data sets without a prior preprocessing of each of them. We focus in this chapter on low-level data fusion.
1.3 Goals of Data Fusion Many goals of model-based data fusion can be envisaged. In current practice, these goals are usually implicit. By making them explicit one can start looking for custom-made models and tools for data analysis.
1. INTRODUCTION AND MOTIVATION
29
A recurring element in goals of model-based data fusion is that many of them pertain to differences or heterogeneity in each of the data blocks under study. As an example, one may think of between-person or individual differences, which are highly relevant in precision medicine and nutritional interventions. Fusing data may help to find such differences (e.g., in terms of to-be-estimated model parameters) and thereby facilitate population stratification. Four specific types of goals with regard to within-block differences can be distinguished. A first one is purely exploratory in nature. The idea here is to simply chart or describe the heterogeneity under study, possibly making use of proper visualization tools. A second type of goal is to capture some particular aspects of heterogeneity in each data set and to subsequently look for a synthesis or consensus of these aspects across all data sets. A third type of goal, which is becoming more and more popular, is separating common versus distinctive sources of variation in the respective data sets or data blocks. Such a separation may greatly simplify a subsequent interpretation of the results. We describe these types of applications in more detail in Section 4. The three types of goals as explained earlier pertain to the study of within-block heterogeneity or differences in themselves, whereas a fourth type of goal pertains to the study of within-block differences in relation to a known covariate. As an example, one may think of a covariate that represents treatment condition (treatment vs. control, or different active treatment conditions), as in data regarding an intervention or clinical trial. In that case, one may wish to identify treatment effects, both in terms of global effects for all data blocks as a whole and in terms of distinctive effects for the different individual data blocks separately.
1.4 Motivating Examples Some examples will serve to motivate discussing low-level data fusion methods. The first example is from the field of microbial metabolomics [24] and deals with integrating gas chromatographyemass spectrometry (GC-MS) and liquid chromatography (LC)-MS data of the same microbial system. The structure of this data set is visualized in Fig. 2.1 and can be considered a case of two coupled two-way two-mode data blocks that are connected through a common fermentation mode. The research question is now to arrive at a global view of the metabolism given the two different sets of measurements and what drives the differences in the fermentation process. Hence, this is an example of exploratory analysis. The second example is from the field of medical biology: the effects of gastric bypass surgery on obese and diabetic subjects [3]. A total of 14 obese patients with diabetes mellitus type II (DM2) underwent gastric
30
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
Metabolites 1
Experimental conditions
LC-MS
Metabolites 2
GC-MS
FIGURE 2.1 Coupled GC-MS and LC-MS data along the sampling mode.
bypass surgery, and blood samples were taken 4 weeks before and 3 weeks after surgery; on each occasion samples were taken both before and after a meal. The blood samples were then analyzed on multiple analytical platforms for the determination of amines, lipids, and oxylipins. Clearly, there is an experimental design underlying the shared sampling mode (see Fig. 2.2) and thus the goal is to establish (common and distinct) treatment effects in the different data sets. The third example is from analytical chemistry applied to resolving mixtures of chemical compounds into their underlying spectra and concentrations. This is visualized in Fig. 2.3 where nuclear magnetic resonance (NMR) and LC-MS measurements are performed on the same set of samples. Owing to the combined modeling, concentrations and pure spectral profiles can be obtained for both the NMR and LC-MS data [25,26]. We will make use of many figures like the ones in Figs. 2.1, 2.2, and 2.3. We always depict matrices with the first (row) mode pertaining to the sampling mode (each row of a matrix corresponds to a sample) and the second (column) mode pertaining to the variables (each column represents a variable). Experimental conditions
Design
Amines
Lipids
Oxylipins
FIGURE 2.2 Amines, lipids, and oxylipins measured repeatedly on the same persons. The data block design contains the encoding of the underlying experimental design (see text for details).
31
2. DATA STRUCTURES
Mixtures
Chemical shifts
Features
LC-MS NMR
FIGURE 2.3 NMR and LC-MS measurements performed on the same set of mixtures of chemical compounds.
2. DATA STRUCTURES When discussing data fusion methods, a central notion is the idea of coupled data. Without any coupling of the data, data-driven fusion is not possible. There are different multiset data structures with corresponding coupling characteristics. The first case pertains to data coupled in the sampling mode. Both examples given earlier are from this type. This is exemplified in Fig. 2.4 by placing the sets of data next to each other: the measurements (variables 1, 2, 3) are obtained on the same samples. Another possibility is coupling along the variable mode, as visualized in Fig. 2.5, where now the same variables are measured on different sets of samples. Also, hybrid cases are possible as shown in Fig. 2.6 and of course there are many more situations. One special case is worth mentioning and that is if data sets share both modes. Such structures are called multiway structures, and special
Variables 1
Variables 2
Variables 3
Samples FIGURE 2.4 Coupling along the sampling mode.
32
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
Variables Samples 1 Samples 2 Samples 3
FIGURE 2.5 Coupling along the variables mode.
methods are in place for multiway analysis [27]. In this chapter, we illustrate our framework with data coupled in the sampling mode, that is, situations as shown in Fig. 2.4.
3. FRAMEWORK FOR LOW-LEVEL DATA FUSION In this section, we describe a novel model for data fusion [11]. This model is generic in that it subsumes a very broad range of specific models (both existing and to be developed ones) as special cases. The generic model will appear to be a global model for the whole of all coupled data Variables 1
Variables 2
Samples 1 Samples 2 FIGURE 2.6 Partly coupled along variables and samples.
3. FRAMEWORK FOR LOW-LEVEL DATA FUSION
33
blocks. This global model will consist of (1) a submodel for each data block that accounts for the individual data entries in that block, along with (2) a linking structure between these submodels. We first outline those two aspects in the following text. Subsequently, we describe a few existing examples of our generic proposal.
3.1 Submodel per Data Block The first ingredient of our framework is a submodel for each data block as described in more detail elsewhere [28]. This submodel is made of two parts: quantifications of the modes per data block and association rules that define how these quantifications can be combined to model each block. We denote the different blocks by Xb;b ¼ 1, .,B with sizes I Jb for block Xb. We will describe each of those two ingredients in more detail and use a two-way two-block case as a guiding example. 3.1.1 Quantifications of Data Block Modes The first constituent of the submodel is a quantification of the two modes of the two blocks of data. Such quantifications can be seen as reductions of the modes in question. For our example (fusing two blocks of data), this is visualized in Fig. 2.7. The matrix X1 has quantifiers A1(I P1) and B1(I Q1) for the first and second modes, respectively. Likewise, there are quantifiers for the second block: A2(I P2) and B2(I Q2). There are many alternatives for choosing quantifiers. When the quantification matrix A1 is real valued, it would imply a representation of the rows of X1 as points in a low-dimensional (i.e., a P1-dimensional) space. Similar interpretations are available for the other quantification matrices. Note that the dimensionalities of the quantification matrices for the same block do not have to be equal: P1 is not necessarily equal to Q1. The flexibility of using such quantifiers is further illustrated by choosing A1 to be the identity matrix (of size I I) resulting in no reduction of the first mode of X1. A special case exists for data in which the two modes of the two-way data coincide (so-called one-mode two-way data), such as in distance matrices: then the quantifiers for both ways should be chosen to be the same.
X1 B1
A1
L(.)
A2
X2 B2
FIGURE 2.7 Idea of a linking structure.
34
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
3.1.2 Block-Specific Association Rule Next to the quantifiers, it is necessary to define rules that associate the quantifiers to model the different blocks. Again, there are many alternatives and the generic scheme for the first data block can be written as: X1 ¼ f ðA1 ; B1 ; W1 Þ þ E1 ;
(2.1)
with E1 denoting a matrix of residuals, W1(P1 Q1) denoting a core matrix, and the function f defining a mapping. This rather abstract representation serves to show the flexibility, and a few examples are given to clarify the concept of an association rule. For the other block(s) similar rules can be made, but these association rules do not necessarily have to be equal across all blocks. The simplest case is to choose W1 to be the identity matrix of size (P1 P1) (which implicitly assumes that P1 ¼ Q1) and the function f to be the outer product of A1 and B1. When minimizing the sum of squared residuals in E1 one obtains the familiar principal component model for X1. In the above-mentioned case of distance matrices and W1 an identity matrix, Eq. (2.1) reduces to: " #1 P1 X 2 2 f ðA1 ; B1 Þij ¼ ; (2.2) aip1 bjp1 P1 ¼1
which is known by the name multidimensional unfolding in the psychological literature. Other association rules are discussed in [28].
3.2 Linking Structure Between Different Submodels The final ingredient of our framework ties the blocks together through linking functions. For the two-block two-way case considered earlier we have the two submodels for the two blocks: X1 ¼ f1 ðA1 ; B1 ; W1 Þ þ E1 X2 ¼ f2 ðA2 ; B2 ; W2 Þ þ E2
(2.3)
and the linking should now be done by making assumptions about A1 and A2 because these represent the quantifiers for the shared mode, the sample mode. This idea is also visualized in Fig. 2.6. In our generic model, this mode sharing is captured through constraints on the quantification matrices of the shared modes; these constraints can be conceived as representing the linking structure of the model. There are many alternatives for linking structures, and some of them are shown in Fig. 2.8. In principle, a broad range of constraints could be considered as linking structure. The most simple of them is an identity constraint, which
3. FRAMEWORK FOR LOW-LEVEL DATA FUSION
35
FIGURE 2.8 Different linking structures. For explanation, see text.
is also used in most cases of low-level data fusion (see Fig. 2.8A). Such a constraint simply implies that a shared mode is given the same quantification in all submodels in which it shows up. The global model for such a linking structure would be: X1 ¼ f1 ðA; B1 ; W1 Þ þ E1 X2 ¼ f2 ðA; B2 ; W2 Þ þ E2
(2.4)
where the identity constraint becomes visible through the fact that the quantification matrix A does no longer bear a block-specific subscript. Rather than a full identity constraint on the quantification matrices of shared modes, Fig. 2.8 also shows two partial identity constraints. The first of these reads that a number of columns of the quantification matrices of a shared mode are constrained to be identical, whereas other columns are left unconstrained; through such a partial identity constraint one may wish to capture both commonalities in the structures of the linked data blocks (in terms of the identical quantification columns) and distinctive aspects (in terms of the unconstrained columns) (see Fig. 2.8C). This type of linking structure will be explained in more detail in Section 4. A second partial identity constraint reads that quantification matrices of a shared mode are constrained to be identical with regard to the vast majority of their rows (see Fig. 2.8D). This means that, for the vast majority of the elements of the mode involved, but not for all, the quantifications have to be the same (with elements that require different quantifications having to be identified during the data-analytic process). A special type of linkage structure may be needed if one of the shared
36
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
modes is the time mode (see Fig. 2.8B). In such cases, the linking structure may have to account for lags in dynamics (indicated by the symbol s). This is, for instance, the case if measurements of metabolites are performed in blood and urine, where usually the metabolite appears earlier in the blood. We give examples of two other possible linkage structures. The first of these pertains to the case of binary quantification matrices (which can be conceived as membership matrices in some clustering). A constraint on such matrices could read that the clustering as implied by the first quantification matrix is nested in the second. As a special case of this, in case one would consider partitioning matrices only, a nestedness constraint would imply that the first partitioning is a refinement of the second (i.e., the first partitioning then is to be obtained by splitting a number of classes of the second one). As a second possibility, in case of real-valued quantification matrices, one may require two quantifications of the same mode to be in a space-subspace relation.
3.3 Examples of the Framework In chemometrics, SUM-PCA (or Consensus-PCA) is a much used data fusion method [18]. Related methods in psychometrics are multiple factor analysis (MFA; [29]) and STATIS [30]. The exact relationships between these methods have been published elsewhere [31] and will not be repeated here. All methods fit within our generic framework. This has been illustrated for our two-block case. Assuming that appropriate preprocessing has taken place per data block, SUM-PCA assumes that: qm1 X1 ¼ AðB1 ÞT þ E1 ¼ f ðA; B1 Þ þ E1 qm2 X2 ¼ AðB2 ÞT þ E2 ¼ f ðA; B2 Þ þ E2 ;
(2.5)
where, depending on the specific method, different weights qmk are assigned to the data blocks (see [31] for details). This is an example of a data fusion model with an identity link where the blocks X1 and X2 have the sampling mode in common. It can easily be generalized to more than two blocks of data. An example of the use of this method can be found in [9], where metabolomics and gene expression data are coupled in a toxicology experiment. Also, our example in Section 5.1 is of this kind, using the MFA method. Simultaneous component analysis (SCA) has already a long history in psychometrics [32,33]. It was developed for cases in which the same set of
3. FRAMEWORK FOR LOW-LEVEL DATA FUSION
37
variables has been measured in different sets of samples, for example, stemming from different cultures. Assuming a suitable preprocessing, the basic SCA model (SCA-P) for a two-block case can be cast as follows within our generic framework: X1 ¼ A1 ðBÞT þ E1 ¼ f ðA1 ; BÞ þ E1 X2 ¼ A2 ðBÞT þ E2 ¼ f ðA2 ; BÞ þ E2 ;
(2.6)
where again an identity link is used. This example shows that coupling along the variables mode (see Fig. 2.5) can also be cast in our framework. Two members of the SCA family are further worth mentioning, because they have been used in several fields of science: multilevel SCA (MSCA), as used in psychometrics, functional genomics, and process chemometrics ([34e37]), and ANOVA-SCA (ASCA [36,38e40]), as used in functional genomics. We start by discussing ASCA, making use of the notation in ASCA applications ([35]). The typical background of ASCA is a set of designed experiments in which functional genomics data are collected from several subjects exposed to a treatment (k) and measured over time (t). Assuming that proper preprocessing has been done, each block Xb contains the measurements performed for treatment group b and can be modeled as: Xb ¼ 1tTb PT1 þ Tt PT2 þ Tbt PT3 þ Eb ;
(2.7)
where 1 is a vector of ones of the proper order; tb, Tt, and Tbt contain the scores representing the samples; P1, P2, and P3 contain the loadings, and Eb contains the residuals. The matrices Tt and Tbt have a specific structure, which is not important for this paper and which can be found elsewhere ([40]). On rewriting Eq. (2.7), we obtain: Xb ¼ 1tTb Tt Tbt ½ P1 P2 P3 T þ Eb (2.8) ¼ Ab ðBÞT þ Eb ; which clearly is a special case of SCA-P. The basic equation for the MSCA model reads as: Xb ¼ 1tTb PT1 þ Tbw PT2 þ Eb ;
(2.9)
assuming again proper preprocessing. The block Xb represents now an individual subject (or object) measured, for example, over time (t). Then tTb PT1 represents the between-subject variation and Tbw PT2 the withinsubject variation. Eq. (2.9) clearly shows that MSCA is a special case of ASCA. Hence, MSCA also fits within our generic framework.
38
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
An example of combining a PARAFAC and a PCA model for such data reads: X1 ¼ AðB1 ÞT þ E1 ¼ f1 ðA; B1 Þ þ E1 X2 ¼ AðB2 1C2 ÞT þ E2 ¼ f2 ðA; B2 ; C2 Þ þ E2 ;
(2.10)
where B2 and C2 are loading matrices pertaining to the second and third modes of the properly matricized three-way array X2 and 1 is the symbol for the Khatri-Rao product ([27]). The area of analyzing multiway arrays is already covered in some text books [27,41], and extensions of fusing data sets in which multiway arrays are involved are also available [25,26,42,43].
4. COMMON AND DISTINCT COMPONENTS 4.1 Generic Model for Common and Distinct Components
Row 2
This section leans heavily on previous work [10]. The two spaces spanned by the columns of X1 and X2 (R(X1) and R(X2)) are located in the same I-dimensional column-space RI; see Fig. 2.9 for an illustration in three-dimensional space. Each variable is a vector in this coordinate system indicating the level of that variable for each sample (row). These
X12C
X1 X2
Row 1
FIGURE 2.9 The I-dimensional space having R(X1) (blue (dark gray in print version)) and R(X2) (green (light gray in print version)) as subspaces. Only three axes of this I-dimensional space are drawn. The red line (gray in print version) X12C represents the common subspace. For the sake of illustration the dimensions of both column-spaces are equal (two). This is not necessarily always the case.
4. COMMON AND DISTINCT COMPONENTS
39
variables are not explicitly shown in this figure but lie within the space indicated by the blue and green column-spaces. If the two column-spaces intersect nontrivially (the zero is always shared), then the intersection space is called the common space. In Fig. 2.9, there is only one common direction (i.e., the common space is one dimensional), but there can be more dimensions or none. The common subspace will be called R(X12C), where the subscript C stands for “Common.” Note that R(X12C)4R(X1) and R(X12C)4R(X2). The common part of the two blocks will in most cases not span the whole of R(X1) and R(X2). Some definitions regarding the rest of these spaces are therefore needed. These subspaces representing the rest after identification of the common part will be called “distinct” subspaces. The requirement is that the space spanned by the columns in a block Xb(b ¼ 1,2) is a direct sum of the common space and the distinct space within that block. Hence, these two parts within a block are linearly independent (two subspaces are linearly independent if no vector in one subspace can be written as a linear combination of the vectors of the other and vice versa). These subspaces are called R(X1D) and R(X2D) where the subscript D stands for “Distinct.” There are several possibilities for selecting subspaces to be orthogonal to each other. One option is to select R(X1D) and R(X2D) to be orthogonal to R(X12C). Another option is to select R(X1D) orthogonal to R(X2D) and, of course, it is also possible to not impose orthogonality at all. The choice whether or not to choose which type of orthogonality depends on the application. What we have accomplished now is decomposing R(X1) and R(X2) into direct sums of spaces: RðX1 Þ ¼ RðX12C Þ4RðX1D Þ RðX2 Þ ¼ RðX12C Þ4RðX2D Þ
(2.11)
because R(X12C)XR(X1D) ¼ {0} and R(X12C)XR(X2D) ¼ {0} [44]. Hence, it also holds that: dimRðX1 Þ ¼ dimRðX12C Þ þ dimRðX1D Þ dimRðX2 Þ ¼ dimRðX12C Þ þ dimRðX2D Þ
(2.12)
If the distinct-orthogonal-to-common option is chosen, then R(X12C)t R(X1D) and R(X12C)tR(X2D). Note that, for this case, given the common space, the decomposition is unique because then R(X1D) is the orthogonal complement of R(X12C) within R(X1) and likewise for R(X2D) (but not necessarily the basis within the subspaces if these have dimension higher than one). In the nonorthogonal case, the distinct part can be defined by any set of linearly independent vectors that are in the original spaces but not in the common space. For a thorough description of direct sums of spaces, see [45].
40
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
4.2 DISCO (Distinct and Common Components) We will illustrate our generic model with the DISCO-SCA (or DISCO, for short) method [46,47]. The first step in DISCO is to solve an SCA e RÞ and loadings BððJ e 1 þ J2 Þ RÞ of the problem to find scores AðI concatenated matrix [X1jX2]. Note that we use now the SCA method for a e can be partitioned in shared sampling mode. The loading matrix B e e e is orthogonally B1 ðJ1 RÞ and B2 ðJ2 RÞ. Subsequently, the matrix B rotated to a simple structure reflecting distinct and common components. For the sake of illustration, assume that R ¼ 3; there is one common and e is orthogonally two distinct components (one for each block). Then B rotated to a structure Btarget according to: e Btarget 2 min V BQ (2.13) QT Q¼I
where V is a matrix of zeros and ones selecting the elements across which the minimization occurs, the symbol * indicates the Hadamard or elementwise product, and x 0 x x 0 x x 0 x (2.14) Btarget ¼ 0 x x 0 x x 0 x x 0 x x where the symbol x means an arbitrary value not necessarily zero and B ¼ e ¼ ½BT BT T . This will result in the first component being distinct for BQ 1 2 X1, the second component being distinct for X2, and the third component e are being the common one. After finding the optimal Q, the scores A e counterrotated, resulting in A ¼ AQ ¼ ½a1 a2 a3 , and the following decomposition is obtained: X1 ¼ ABT1 ¼ a1 bT11 þ a2 bT12 þ a3 bT13 þ E1 X2 ¼ ABT2 ¼ a1 bT21 þ a2 bT22 þ a3 bT23 þ E2
(2.15)
where b11 gives loadings for the distinct component for X1; b22 for the distinct component for X2; and b13, b23 for the common component. DISCO has been used in metabolomics [47] and in gene expression analysis [48].
5. EXAMPLES
41
5. EXAMPLES We give two examples in this section. The first one is an example of our generic framework for data fusion and concerns the goal of exploring relationships between two data blocks. The second one is an example of common and distinct components and how these are affected by a covariate, which is a treatment effect.
5.1 Microbial Metabolomics Example As already introduced in Section 1.4, the microbial metabolomics example concerns the fermentation of Escherichia coli and the use of this information to construct improved production strains. The E. coli strain was cultivated under different environmental conditions, and from these fermentations of in total 28 samples, the metabolomes were analyzed using a GC-MS and LC-MS method, resulting in 144 and 44 metabolites, respectively. Hence, the blocks as visualized in Fig. 2.1 consist of LC-MS (28 44) and GC-MS (28 144) data with an underlying experimental design [24]. The two data blocks were preprocessed by log-transformation and subsequent column-centering. Then they were subjected to an MFA, which is an instantiation of our data fusion framework (see Section 3.3). The weights put on the GC versus the LC block in the MFA were 0.66 and 0.34, respectively. The proportion of variance accounted for by the successive MFA components is shown in Fig. 2.10. Based on this figure, we choose five MFA components to analyze the data. The common scores were Varimax rotated to a simple structure [49]. This resulted in components in which the first two had contributions from GC and LC, the third and fifth mostly from LC, and the fourth from GC. The corresponding loadings are shown in Fig. 2.11. Some interpretation is given here and can be based on the experimental design. The first MFA component seems to consist of metabolic processes related to oxygen limitation and to the early stationary growth phase (after approximately 40 h of fermentation), which might be the result of oxygen stress. The metabolites fumarate, malate, aspartate, a-ketoglutarate, and 2-hydroxyglutarate are grouped together in the fourth MFA component and are all related to succinate metabolism; these metabolites are more abundant in samples with succinate as carbon source. A detailed interpretation of the results is given elsewhere [31].
42
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
0.4 0.2 0.0
proportion of variance accounted for
GC block
1
2
3
4
5
6
7
8
9
10
6
7
8
9
10
component
0.4 0.2 0.0
proportion of variance accounted for
LC block
1
2
3
4
5
component
FIGURE 2.10 Proportion of explained variances for the MFA solutions. The bars represent variances within each block by MFA components and the curve represents explained variance within a block by separate component analysis.
5.2 Medical Biology Example The data set is a subset of a larger study on the effects of gastric bypass surgery on obese and diabetic subjects [3]. Here, we focus on 14 obese patients with DM2 who underwent gastric bypass surgery. A description of the data was already given in Section 1.4. The three data blocks, amines (A), lipids (L), and oxylipins (O), consist of 14 subjects 4 samples ¼ 56 rows and 34, 243, and 32 variables, respectively. All variables in all three blocks were square-root transformed, to obtain more evenly distributed data. Individual differences between subjects were removed by subtracting each subject’s average profile. All variables were then scaled to unit variance. The blocks were also scaled to unit norm before SCA, to normalize scale differences between blocks. Selecting the dimensions of the subspaces is more complicated when the numbers of blocks increase. In this three-block example, we need to decide the dimensions of seven subspaces: X123C, X12C, X13C, X23C, X1D, X2D, and X3D. For DISCO, we start by deciding the sum of all the
5. EXAMPLES
43
FIGURE 2.11 Heatmap of the loadings of the five rotated MFA components. Only the 130 of the 188 metabolites are shown; for the other metabolites the identity was unknown.
44
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
dimensions, i.e., the number of SCA components. Explained variance as a function of components for SCA is given in Fig. 2.12. The curve of cumulative variance does not have a clear bend, which makes it hard to decide the cutoff between structure and noise. To allocate the common and distinct components we need to fix the number of SCA components and then compare the fit values of Eq. (2.13) using different target matrices. The computations are time consuming, as there are, e.g., 462 possible target matrices for the five-component model. To illustrate the complexity in selecting the dimensions for the subspaces, we have calculated all possible rotations for models with three to five SCA components. The fit values are very similar, making it hard to conclude which rotation gives the best fit. Looking further into the actual rotated score vectors, we discover that many of the models agree on some of the subspaces. We choose to interpret the five-component model with fit value 0.24, the best five-component model, which contains one component that is common across all three blocks, two components common for A and L, and one distinct component from both A and O. The decomposition of each block is illustrated by pie charts in Fig. 2.13AeC. Notice that there is a substantial contribution of one of the C-AL components also in the O block (7%), which implies that this component could perhaps also be regarded as common across all three blocks.
SCA 100 Amines Lipids Oxylipins
90
Explained variation (%)
80 70 60 50 40 30 20 10 0 1
2
3
4
5
6
7
8
9
10
Component #
FIGURE 2.12 Explained variances for SCA. The bars represent variances within each block, and the curve represents cumulative explained variance in all blocks combined.
45
5. EXAMPLES
(A)
(B)
(C)
Amines
DISCO
Lipids
Oxylipins
C-ALO:19% C-ALO:28%
C-ALO:31%
C-AL:5% D-O:2% D-O:2%
D-A:2%
C-AL:11%
C-AL:19%
C-AL:28%
D-A:17%
C-AL:2% C-AL:7% D-A:2%
D-O:14%
FIGURE 2.13 Subplots A, B, and C show the decomposition by DISCO for blocks A, L, and O, respectively. Each segment represents a component (dimension).
To interpret the different subspaces, we plot the scores and loadings from the DISCO model. Fig. 2.14 shows the one-dimensional subspace that is common for all three blocks (C-ALO), which accounts for 19%, 28%, and 31% of the variation in A, L, and O, respectively. The scores are shown in the top panel of Fig. 2.14. It is clear that the component contains Scores 2 After surgery -After meal After surgery -Before meal Before surgery-After meal Before surgery-Before meal
1 0 -1 -2 Loadings (A) dopamine
L-alpha-aminobutyric acid epinephrine DL-3-aminoisobutyric acid ethanolamine L-aspartic acid L-glycine L-glutamine L-arginine N-methylhistidine L-serine taurine
Loadings (L)
Loadings (O) PGF2a 8,9-DiHETrE 14,15-DiHETE
o-acetyl-L-serine L-phenylalanine L-leucine L-lsoleucine L-tryptophan L-kynurenine L-proline L-valine L-tyrosine L-lysine methionine ornithine L-2-aminoadipic acid L-alanine L-threonine L-methioninesulfoxide citrulline sarcosine L-glutamic acid
TG
CE SM PE PC LPC Cer
LPE DG FA
SN2 L-asparagine L-4-hydroxyproline L-histidine
SN1 EPC LPC
12S-HEPE PGE2 12(13)-EpOME 15S-HETrE 11,12-DiHETrE 19(20)-EpDPE 9(10)-EpOME 17,18-DiHETE 5S-HEPE 20-HETE 11-HETE 5,6-DiHETrE 13-KODE 15-HETE 9-HOTrE 19,20-DiHDPA 9-KODE 5-HETE 14,15-DiHETrE 12-HETE 12,13-DiHOME 13-HODE 9-HODE
9,10,13-TriHOME
PGF1a
TXB2 9,12,13-TriHOME
12S-HHTrE
9,10-DiHOME
FIGURE 2.14 Scores and loadings for the one-dimensional DISCO subspace common between all three blocks.
46
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
information related to both surgery and meal; the scores are increasing after surgery and decreasing after the meal. The variables spanning this dimension in each of the three blocks are shown in the bar plots of Fig. 2.14 (bottom). The most striking observation is that the branchedchain amino acids leucine, valine (and to a lesser extent leucine), and L2-aminoadipic acid (closely related to branched-chain amino acids) are downregulated after surgery, which confirms earlier findings [3]. There is more in common between amines and lipids than oxylipids; both amines and lipids are involved in central carbon and energy metabolism and therefore they may show higher correlation among some amino acids and some lipid groups (as reflected by common subspace). The two-dimensional subspace common between A and L is shown in Fig. 2.15. These two components together account for 24% and 39% in the A and L blocks, respectively, and they even explain 9% in the O block. We also see groupings here according to both surgery and meal, especially in the vertical dimension. Note that the two groups that were overlapping in the C-ALO component (“before surgery-before meal” vs. “after surgeryafter meal”) are completely separated in this subspace. Plots of the distinct components (not shown) did not reveal clear patterns related to the factors treatment and meal. Hence, all effects are seen in the common parts, meaning that a large part of the metabolism is affected simultaneously by these two factors. 2
57
27
1.5 C-AL (A:19%, L:28%, O:7%)
41 23
6 37
1
56
53
55
39
39
40 57
20
23
-2.5 -2.5
56
23 39 000 55
42 56
-1.5 -2
42
55 53
42 41
-1
6
20
49
39 53 27
37
-0.5
2727 40 53
23 6 56
0
57
40
49
55
0.5
41
42
49
20
6 After surgery -After meal After surgery -Before meal Before surgery-After meal Before surgery-Before meal
-2
-1.5
-1
37 49
41 27
20 57
-0.5
0
0.5
1
1.5
2
2.5
C-AL (A:5%, L:11%, O:2%)
FIGURE 2.15 Scores for the two-dimensional DISCO subspace common between the amine and lipids blocks.
REFERENCES
47
6. CONCLUSIONS We hope to have shown that fusing data sets can have advantages. Our framework for data fusion and our generic model for finding common and distinct components could serve as guidelines for how to perform certain tasks in data fusion. We would like to stress that data fusion requires several steps that should be carefully considered: 1. Consideration of the goal of the data fusion based on substantive questions. 2. Careful examination as to the nature of the data sets to be fused (see also [50]). 3. Casting the fusion problem in a formal mathematical model. 4. Selection of a proper global loss function related to the mathematical model to estimate the parameters. 5. Estimating the parameters and performing a proper validation of the whole model. These steps may require several rounds of interactions between the key players, e.g., biologists, data scientists, analytical chemists, to arrive at a proper problem definition. This shows that fusing data is an interdisciplinary endeavor!
References [1] O. Azimzadeh, W. Sievert, H. Sarioglu, J. Merl-Pham, R. Yentrapalli, M. Bakshi, D. Janik, M. Ueffing, M. Atkinson, G. Multhoff, S. Tapio, Integrative proteomics and targeted transcriptomics analyses in cardiac endothelial cells unravel mechanisms of longterm radiation-induced vascular dysfunction, J. Proteome Res. 14 (2) (2015) 1203e1219. [2] R. Higdon, R. Earl, L. Stanberry, C. Hudac, E. Montague, E. Stewart, I. Janko, J. Choiniere, W. Broomall, N. Kolker, R. Bernier, E. Kolker, The promise of multiomics and clinical data integration to identify and target personalized healthcare approaches in autism spectrum disorders, OMICS. 19 (4) (2015) 197e208. [3] M. Lips, J. Van Klinken, V. Van Harmelen, H. Dharuri, P. ‘t Hoen, J. Laros, G. Van Ommen, I. Janssen, B. Van Ramshorst, B. VanWagensveld, D. Swank, F. Van Dielen, A. Dane, A. Harms, R. Vreeken, T. Hankemeier, J. Smit, H. Pijl, K. Willems van Dijk, Roux-en-Y gastric bypass surgery, but not calorie restriction, reduces plasma branched chain amino acids in obese women independent of weight loss or the presence of type 2 diabetes mellitus, Diabetes Care 37 (12) (2014) 3150e3156. [4] P.H. Bradley, M.J. Brauer, J.D. Rabinowitz, O.G. Troyanskaya, Coordinated concentration changes of transcripts and metabolites in Saccharomyces cerevisiae, PLoS Comput. Biol. 5 (1) (2009) e1000270. [5] M.T.A.P. Kresnowati, W.A. vanWinden, M.J.H. Almering, A. ten Pierick, C. Ras, T.A. Knijnenburg, P. Daran-Lapujade, J.T. Pronk, J.J. Heijnen, J.M. Daran, When transcriptome meets metabolome: fast cellular responses of yeast to sudden relief of glucose limitation, Mol. Syst. Biol. 2 (2006) 49. [6] C. Caldana, T. Degenkolbe, A. Cuadros-Inostroza, S. Klie, R. Sulpice, A. Leisse, D. Steinhauser, A. Fernie, L. Willmitzer, M. Hannah, High-density kinetic analysis of
48
[7]
[8] [9] [10] [11] [12] [13] [14] [15]
[16] [17] [18] [19] [20] [21] [22] [23]
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
the metabolomic and transcriptomic response of arabidopsis to eight environmental conditions, Plant J. 6 (5) (2011) 869e884. C. Clish, E. Davidov, M. Oresic, T. Plasterer, G. Lavine, T. Londo, M. Meys, P. Snell, W. Stochaj, A. Adourian, X. Zhang, N. Morel, E. Neumann, E. Verheij, J. Vogels, L. Havekes, N. Afeyan, F. Regnier, J. Van Der Greef, S. Naylor, Integrative biological analysis of the APOE*3-Leiden transgenic mouse, OMICS. 8 (2004) 3e13. I. Montoliu, F.P.J. Martin, S. Collino, S. Rezzi, S. Kochhar, Multivariate modeling strategy for intercompartmental analysis of tissue and plasma (1)h NMR spectrotypes, J. Proteome Res. 8 (5) (2009) 2397e2406. W. Heijne, R. Lamers, P. van Bladeren, J. Groten, J. van Nesselrooij, B. van Ommen, Profiles of metabolites and gene expression in rats with chemically induced hepatic necrosis, Toxicol. Pathol. 33 (2005) 425e433. A.K. Smilde, I. Mage, T. Naes, T. Hankemeier, M.A. Lips, H.A.L. Kiers, E. Acar, R. Bro, Common and distinct components in data fusion, J. Chemometr. 31 (7) (2017) e2900. I. Van Mechelen, A. Smilde, A generic linked-mode decomposition model for data fusion, Chemometr. Intell. Lab. Syst. 104 (2010) 83e94. I. Kouskoumvekaki, N. Shublaq, S. Brunak, Facilitating the use of large-scale biological data and tools in the era of translational bioinformatics, Brief. Bioinf. 15 (6) (2014) 942e952. W. Luo, C. Brouwer, Pathview: an R/bioconductor package for pathway-based data integration and visualization, Bioinformatics 29 (14) (2013) 1830e1831. S. Wienkoop, K. Morgenthal, F. Wolschin, M. Scholz, J. Selbig, W. Weckwerth, Integration of metabolomic and proteomic phenotypes, Mol. Cell. Proteomics 7 (9) (2008) 1725e1736. I. Thiele, N. Swainston, R.M.T. Fleming, A. Hoppe, S. Sahoo, M.K. Aurich, H. Haraldsdottir, M.L. Mo, O. Rolfsson, M.D. Stobbe, S.G. Thorleifsson, R. Agren, C. Bolling, S. Bordel, A.K. Chavali, P. Dobson, W.B. Dunn, L. Endler, D. Hala, M. Hucka, D. Hull, D. Jameson, N. Jamshidi, J.J. Jonsson, N. Juty, S. Keating, I. Nookaew, N. Le Novere, N. Malys, A. Mazein, J.A. Papin, N.D. Price, E. Selkov, M.I. Sigurdsson, E. Simeonidis, N. Sonnenschein, K. Smallbone, A. Sorokin, J.H.G.M. van Beek, D. Weichart, I. Goryanin, J. Nielsen, H.V. Westerhoff, D.B. Kell, P. Mendes, B.O. Palsson, A community-driven global reconstruction of human metabolism, Nat. Biotechnol. 31 (5) (2013) 419e425. M. Krauss, S. Schaller, S. Borchers, R. Findeisen, J. Lippert, L. Kuepfer, Integrating cellular metabolism into a multiscale whole body model, PLoS Comput. Biol. 8 (10) (2012) e1002750. T. Naes, O. Tomic, N. Afseth, V. Segtnan, I. Mage, Multiblock regression based on combinations of orthogonalisation, PLS-regression and canonical correlation analysis, Chemometr. Intell. Lab. Syst. 124 (2013) 32e42. A. Smilde, J. Westerhuis, S. de Jong, A framework for sequential multiblock component methods, J. Chemometr. 17 (2003) 323e337. J.A. Westerhuis, T. Kourti, J.F. MacGregor, Analysis of multiblock and hierarchical PCA and PLS models, J. Chemometr. 12 (5) (1998) 301e321. I. Mage, E. Menichelli, T. Naes, Preference mapping by PO-PLS: separating common and unique information in several data blocks, Food Qual. Prefer. 24 (1) (2012) 8e16. E. Menichelli, T. Almoy, O. Tomic, N. Olsen, T. Naes, SOPLS as an exploratory tool for path modelling, Food Qual. Prefer. 36 (2014) 122e134. D. Kaplan, Structural Equation Modeling, Sage, 2009. T.G. Doeswijk, A.K. Smilde, J.A. Hageman, J.A. Westerhuis, F.A. van Eeuwijk, On the increase of predictive performance with high-level data fusion, Anal. Chim. Acta 705 (1e2) (2011) 41e47.
REFERENCES
49
[24] A. Smilde, M. van der Werf, S. Bijlsma, B. van der Werff-van-der Vat, R. Jellema, Fusion of mass spectrometry-based metabolomics data, Anal. Chem. 77 (2005a) 6729e6736. [25] E. Acar, E. Papalexakis, G. Grdeniz, M. Rasmussen, A. Lawaetz, M. Nilsson, R. Bro, Structure-revealing data fusion, BMC Bioinf. 15 (2014) 239. [26] E. Acar, R. Bro, A. Smilde, Data fusion in metabolomics using coupled matrix and tensor factorizations, Proc. IEEE 103 (9) (2015) 1602e1620. [27] A. Smilde, R. Bro, P. Geladi, Multi-way Analysis: Applications in the Chemical Sciences, John Wiley & Sons Inc, 2004. [28] I. Van Mechelen, J. Schepers, A unifying model involving a categorical and/or dimensional reduction for multimode data, Comput. Stat. Data Anal. 52 (2007) 537e549. [29] J. Page´s, Collection and analysis of perceived product interdistances using multiple factor analysis: application to the study of 10 white wines from the Loire valley, Food Qual. Prefer. 16 (7) (2005) 642e649. [30] H. L’Hermier des Plantes, B. Thiebaut, Etude de la pluviosite au moyen de la methode STATIS, Rev. Stat. Appl. 25 (1977) 57e81. [31] K. van Deun, A. Smilde, M. van derWerf, H. Kiers, I. van Mechelen, A structured overview of simultaneous component based data integration, BMC Bioinf. 10 (2009) 246. [32] H. Kiers, J. Ten Berge, Hierarchical relations between methods for simultaneous component analysis and a technique for rotation to a simple simultaneous structure, Br. J. Math. Stat. Psychol. 47 (Part 1) (1994) 109e126. [33] J. Timmerman, H. Kiers, Four simultaneous component models of multivariate times series from more than one subject to model intraindividual and interindividual differences, Psychometrika 86 (2003) 105e122. [34] O. de Noord, E. Theobald, Multilevel component analysis and multilevel PLS of chemical process data, J. Chemometr. 19 (5e7) (2005) 301e307. [35] J. Jansen, H. Hoefsloot, J. van der Greef, M. Timmerman, A. Smilde, Multilevel component analysis of time-resolved metabolic fingerprinting data, Anal. Chim. Acta 530 (2) (2005a) 173e183. [36] M. Nueda, A. Conesa, J. Westerhuis, H. Hoefsloot, A. Smilde, M. Talon, A. Ferrer, Discovering gene expression patterns in time course microarray experiments by ANOVA-SCA, Bioinformatics 23 (2007) 1792e1800. [37] M. Timmerman, Multilevel component analysis, Br. J. Math. Stat. Psychol. 59 (Part 2) (2006) 301e320. [38] P.D. Harrington, N.E. Vieira, J. Espinoza, J.K. Nien, R. Romero, A.L. Yergey, Analysis of variance-principal component analysis: a soft tool for proteomic discovery, Anal. Chim. Acta 544 (1e2) (2005) 118e127. [39] J.J. Jansen, H.C.J. Hoefsloot, J. van der Greef, M.E. Timmerman, J.A. Westerhuis, A.K. Smilde, Asca: analysis of multivariate data obtained from an experimental design, J. Chemometr. 19 (2005b) 469e481. [40] A.K. Smilde, J.J. Jansen, H.C.J. Hoefsloot, R.J.A.N. Lamers, J. van der Greef, M.E. Timmerman, ANOVA simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data, Bioinformatics 21 (13) (2005b) 3043e3048. [41] P. Kroonenberg, Applied Multiway Data Analysis, John Wiley & Sons, New Yersey, 2008. [42] E. Acar, M. Rasmussen, F. Savorani, T. Naes, R. Bro, Understanding data fusion within the framework of coupled matrix and tensor factorizations, Chemometr. Intell. Lab. Syst. 129 (2013) 53e63. [43] T. Wilderjans, E. Ceulemans, H. Kiers, K. Meers, The LMPCA program: a graphical user interface for fitting the linked-mode PARAFAC-PCA model to coupled real-valued data, Behav. Res. Methods 41 (2009) 1073e1082. [44] J. Schott, Matrix Analysis for Statistics, Wiley and Sons, 1997.
50
2. A FRAMEWORK FOR LOW-LEVEL DATA FUSION
[45] H. Yanai, K. Takeuchi, Y. Takane, Statistics for Social and Behavioral Sciences, Springer, 2011. [46] M. Schouteden, K. Van Deun, S. Pattyn, I. Van Mechelen, SCA with rotation to distinguish common and distinctive information in linked data, Behav. Res. Methods 45 (3) (2013) 822e833. [47] K. Van Deun, I. Van Mechelen, L. Thorrez, M. Schouteden, B. De Moor, M. van der Werf, L. De Lathauwer, A. Smilde, H.L. Kiers, DISCO-SCA and properly applied GSVD as swinging methods to find common and distinctive processes, PLoS One 7 (5) (2012). [48] K. Van Deun, A. Smilde, L. Thorrez, H. Kiers, I. Van Mechelen, Identifying common and distinctive processes underlying multiset data, Chemometr. Intell. Lab. Syst. 129 (2013) 40e51. [49] H. Kaiser, The varimax criterion for analytic rotation in factor analysis, Psychometrika 23 (1958) 187e200. [50] I. Van Mechelen, A. Smilde, Comparability problems in the analysis of multiway data, Chemometr. Intell. Lab. Syst. 106 (2011) 2e11.
C H A P T E R
3 General Framing of Low-, Mid-, and High-Level Data Fusion With Examples in the Life Sciences Agnieszka Smolinska*, 1, Jasper Engelx, Ewa Szymanska{, Lutgarde Buydensjj, Lionel Blanchet* * Department of Pharmacology and Toxicology, NUTRIM School for Nutrition, and Translational Research in Metabolism, Maastricht University, Maastricht, The Netherlands; x Biometris, Wageningen University and Research, Wageningen, The Netherlands; { FrieslandCampina, Amersfoort, The Netherlands; jj Radboud University, Institute for Molecules and Materials, Department of Analytical Chemistry, Nijmegen, The Netherlands 1 Corresponding author
1. INTRODUCTION Nowadays, in many different scientific fields, information about a particular phenomenon is simultaneously collected from multiple biological or technical platforms/modalities. The term platform refers here to a biological source or organism (e.g., urine, tissue, exhaled breath, or blood) [1e4] or analytical technique (e.g., gas chromatography [GC], nuclear magnetic resonance [NMR], or liquid chromatography [LC]) [1,5e7]. Owing to the complexity of natural phenomena, it is unusual that all relevant information can be obtained using just one of the modalities.
Data Fusion Methodology and Applications https://doi.org/10.1016/B978-0-444-63984-4.00003-X
51
Copyright © 2019 Elsevier B.V. All rights reserved.
52
3. GENERAL FRAMING OF LOW-, MID-, AND HIGH-LEVEL DATA FUSION
Therefore, it is common practice to combine measurements from multiple sources of information for better understanding of the underlining problem. Combining multiple types of measurements in a single analysis, also known as data fusion or data integration, is a widely studied topic in many fields ranging from analytical chemistry to biology, psychology, and computer science [8,9]. Depending on their nature, different types of measurements can be linked in different ways [10]. Typically, the modes of the acquired data sets are studied to understand how they can be linked. A mode is defined as one of the categories according to which the data can be ordered, such as the observations. Consider, as an example, an experiment in which several data matrices are produced. If each data matrix contains different variables and observations, there is no common mode. Although there might be a relationship between the different data sources, it is not directly evident how the data matrices should be linked. When the variable mode is in common, each data matrix considers the same variables for different sets of observations. This type of problem is often encountered in analytical chemistry, for example, when different batches of data are measured, or when observations can clearly be grouped because of the experimental design that was applied [11,12]. Similarly, when the observation mode is in common, each data matrix consists of different sets of variables that were measured for the same set of observations. Such data are often encountered in systems-biology studies where, for example, different omics techniques are used to simultaneously probe the entire system at different levels of biological complexity [5,13e15]. Finally, it is also possible that the data matrices would have both modes in common. For example, a series of liquid chromatographyemass spectrometry (LC-MS) spectra have as common modes retention time and mass. Together they form a multiway array [16]. In this work, we define data fusion as an approach of integrating several data sets/platforms measured for the same set of samples. We specifically focus on the case in which several data matrices (data blocks) were measured [17]. Various techniques are then used to measure the relation between platforms/data blocks [5,18,19]. The application of data fusion is very beneficial and a widely studied topic in many fields, such as omics, analytical chemistry, industrial process analysis, comprehensive quality control, and sensor and communication analysis [18,20,21]. Depending on their nature, the different types of measurements can be linked in different ways. Typically, in analytical chemistry and omics-related fields, fusion of multiple platforms usually has two specific goals: (1) increasing the prediction accuracy and/or (2) obtaining better understanding of the studied phenomena [1,5,18,22]. Depending on the nature of the data, the different types of measurements can be linked in different ways to achieve these goals [5,18,19]. For example, the data from the different platforms can be
1. INTRODUCTION
53
combined at the level of the “raw” measurements. Subsequently, a dimension reduction procedure can be carried out to identify patterns in the data that are shared among (some of) the platforms and patterns that are unique to a particular platform[23e26]. This is an example of data fusion whereby the aim is to obtain a better understanding of the studied phenomena. This is sometimes also referred to as a structure revealing data fusion [27,28]. This approach has been used, for example, to obtain promising results on joint analysis of NMR and fluorescence measurements of plasma samples for differentiating colorectal patients from controls [29]. In a classification problem, the platforms are sometimes fused at the level of classifications obtained with models fit to each individual data matrix, e.g., by a majority voting rule [30]. This is an example whereby data fusion is used to improve prediction/classification accuracy. It has been shown that, even for platforms with low individual predictive power, such data fusion can greatly increase the prediction accuracy because the (multivariate) relationships between data of the different platforms is considered. In the above-mentioned approaches, the data acquired from the different platforms is combined at different levels, i.e., the level of the raw data and the level of predictions. In between is the “feature” level where data are combined after an initial dimension reduction of variable selection step. These levels are referred to as low-level, mid-level, and highlevel data fusion. Finally, there is the special case of data fusion called kernel fusion [1,31,32] whereby data of each platform are first transformed via kernel trick and the kernels are then combined via linear combination. The main aim of data fusion is to increase the prediction accuracy and/or interpretability of the results. Usually the main criterion for successful data fusion is the predictive power of the individual platforms. However, even platforms with low individual predictive power can greatly increase the fused accuracy because the multivariate relation between concatenated platforms is considered. This chapter details how data can be fused at the different levels and what the benefits and disadvantages of the different approaches are, and some links to structure revealing data fusion approaches are provided as well. Data fusion at the different levels is demonstrated and compared using four different data sets, namely, exhaled breath data of patients with Crohn disease (CD) obtained by GC mass spectrometry (MS) [33], 454 pyrosequencing microbiome data of patients with CD [34] and metabolic profiling of beer brands by GC-MS and positive and negative ion mode of liquid chromatography (LC-MS) [35]. Using these four data sets the strategy for low-level, mid-level, high-level, and kernel fusion will be demonstrated. Note that the first step of data analysis related to sampling, analytical data acquisition, and data preprocessing are not covered here because they are very data specific and have been thoroughly described elsewhere.
54
3. GENERAL FRAMING OF LOW-, MID-, AND HIGH-LEVEL DATA FUSION
2. DATA SAMPLING, MEASUREMENTS, AND PREPROCESSING 2.1 Exhaled Breath and Fecal Microbiota (Data Sets One and Two) Exhaled breath and fecal samples were obtained from 71 patients with CD, who were participating in a 1-year follow-up study as part of a prospective cohort of inflammatory bowel disease outpatients of the population-based IBDSL cohort [18,19]. Each individual during that time delivered between two to five breath and fecal samples. Disease activity was defined by a fecal calprotectin (FC) > 250 mg/g [22]. Remission was defined by a Harvey-Bradshaw index 4 in combination with both serum C-reactive protein