162 61 7MB
English Pages 384 [380] Year 2024
Studies in Big Data 139
Hiroshi Ishikawa
Hypothesis Generation and Interpretation Design Principles and Patterns for Big Data Applications
Studies in Big Data Volume 139
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data- quickly and with a high quality. The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and Operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are reviewed in a single blind peer review process. Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH. All books published in the series are submitted for consideration in Web of Science.
Hiroshi Ishikawa
Hypothesis Generation and Interpretation Design Principles and Patterns for Big Data Applications
Hiroshi Ishikawa Department of Systems Design Tokyo Metropolitan University Hino, Tokyo, Japan
ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in Big Data ISBN 978-3-031-43539-3 ISBN 978-3-031-43540-9 (eBook) https://doi.org/10.1007/978-3-031-43540-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Preface
In the big data era, characterized by volume, variety, and velocity, which generates a large amount of diverse data at high speed, the role of a hypothesis is more important to generate the final value, and such a hypothesis is more complicated and complex than ever. At the same time, the era of big data creates new vague concerns for end users as to whether big data relevant to them will be used appropriately. Generating hypotheses in advance determines not only the success of scientific discoveries but also the success of investments in business and government. A model is constructed by abstracting the generated hypothesis. By executing the model, individual values and judgments are made as results. When conducting socially influential projects (e.g., tourism, engineering, science, and medicine) based on big data, it is necessary to explain to the users about the model (i.e., hypothesis) generation processes and the basis of individual decisions and predictions as the model execution results. If such explanation can be properly done, it will lead to accountability (i.e., evidence-based policy making). Based on historical viewpoints such as the philosophy of science, we first explain various reasoning and problem solving as basic hypothesis generation methods and then explain machine learning and integrated approaches as advanced hypothesis generation methods. In addition, we show that the description of the overall procedure for hypothesis generation (i.e., how), including both data management and data analysis, is effective in explaining hypothesis generation. On the other hand, in science such as modern astronomy and physics, it is important to reduce an observable event to more basic ones. Even in other fields, knowing which variables the model emphasizes in decision-making leads to the interpretation of the model. As a method for explaining the interpretation of the hypothesis (i.e., why), we mainly describe the selection and contribution of significant variables, the discovery of basic events, and the visualization of the basis of decisions. In this book, we explain the design principles and patterns as reusable solutions to common problems that frequently appear in the generation and interpretation of hypotheses during the design of big data applications. To make them understandable, we take as many concrete case studies as possible.
v
vi
Preface
This book focuses on data analysis and data management in hypothesis generation. In addition, it incorporates historical perspectives of science. Further, it places emphasis on the origin of the ideas behind the design principles and patterns for big data applications and explains how the ideas were born and developed. This book is intended for practitioners as well as college students in specialized courses. This book does not focus only on data science and data analysis but emphasizes the importance of data engineering and data management required for big data applications as well. This also makes the book useful both for data engineers who want to be able to analyze data and data scientists who want to be able to utilize data engineering. Kawasaki, Japan August 2023
Hiroshi Ishikawa
Contents
1 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Big Data in the 5G Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Characteristics of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Society 5.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 5G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Key Concepts of Big Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Interaction Between Real-World Data and Social Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Universal Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Ishikawa Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Single Event Data and Single Data Source . . . . . . . . . . . . . 1.2.5 Process Flow of Big Data Analysis . . . . . . . . . . . . . . . . . . . 1.3 Big Data’s Vagueness Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Integrated Data Model Approach . . . . . . . . . . . . . . . . . . . . . 1.4 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 What Is Hypothesis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Hypothesis in Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Hypothesis Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Hypothesis Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Design Principle and Design Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Notes on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Big Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 EBPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Users of Big Data Applications . . . . . . . . . . . . . . . . . . . . . . 1.8 Design Principles and Design Patterns for Efficient Big Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.1 Use of Tree Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.2 Reuse of Results of Subproblems . . . . . . . . . . . . . . . . . . . . . 1.8.3 Use of Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 2 3 3 4 4 4 5 6 8 10 10 11 13 13 13 15 15 16 17 18 18 20 21 21 23 25 vii
viii
Contents
1.8.4 Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.5 Online Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.6 Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.7 Function and Problem Transformation . . . . . . . . . . . . . . . . 1.9 Structure of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26 28 28 29 30 30
2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 What Is Hypothesis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Definition and Properties of Hypothesis . . . . . . . . . . . . . . . 2.1.2 Life Cycle of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Relationship of Hypothesis with Theory and Model . . . . . 2.1.4 Hypothesis and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Research Questions as Hints for Hypothesis Generation . . . . . . . . . 2.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Low-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Tree and Graph Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Time and Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Statistical Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Philosophy of Science and Hypothetico-Deductive Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Deductive Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Inductive Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Generalization and Specialization . . . . . . . . . . . . . . . . . . . . 2.4.5 Plausible Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Problem Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Problem Solving of Pólya . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Execution Means for Problem Solving . . . . . . . . . . . . . . . . 2.5.3 Examples of Problem Solving . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Unconscious Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33 33 33 34 35 36 37 38 38 39 40 41 41 42
3 Science and Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Kepler Solving Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Brahe’s Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Obtaining Orbit Data from Observation Data (Task 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Deriving Kepler’s First Law (Task 2) . . . . . . . . . . . . . . . . . 3.2 Galileo Conducting Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Galileo’s Law of Free Fall . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Thought Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Galileo’s Law of Inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Galileo’s Principle of Relativity . . . . . . . . . . . . . . . . . . . . . .
71 71 71
42 43 44 47 52 54 54 55 57 67 68
73 76 78 78 79 81 82
Contents
3.3
ix
Newton Seeking After Universality . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Reasoning Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Three Laws of Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 The Universal Law of Gravitation . . . . . . . . . . . . . . . . . . . . 3.4 Darwin Observing Nature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Theory of Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Population Growth Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Fibonacci Sequence Revisited . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Logistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83 83 84 85 86 86 88 90 92 93
4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Basics of Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Ceres Orbit Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 From Regression to Orthogonal Regression to Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 4.1.4 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 From Regression to Sparse Modeling . . . . . . . . . . . . . . . . . 4.2 From Regression to Correlation to Causality . . . . . . . . . . . . . . . . . . . 4.2.1 Genetics and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Galton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Karl Pearson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Neyman and Gosset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Wright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Spearman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Nightingale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.8 Mendel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.9 Hardy–Weinberg Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . 4.2.10 Fisher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95 95 95 104
5 Machine Learning and Integrated Approach . . . . . . . . . . . . . . . . . . . . . . 5.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Definition and Brief History of Clustering . . . . . . . . . . . . . 5.1.2 Clustering Based on Partitioning . . . . . . . . . . . . . . . . . . . . . 5.1.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Evaluation of Clustering Results . . . . . . . . . . . . . . . . . . . . . 5.1.5 Advanced Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Overview of Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Generation of Association Rule . . . . . . . . . . . . . . . . . . . . . .
147 147 147 149 153 157 159 167 167 168 170 174
108 110 113 119 119 119 124 128 130 134 135 138 141 142 145
x
Contents
5.3
Artificial Neural Network and Deep Learning . . . . . . . . . . . . . . . . . 5.3.1 Cross-Entropy and Gradient Descent . . . . . . . . . . . . . . . . . 5.3.2 Biological Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Integrated Hypothesis Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Graph and Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Digital Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
175 176 178 179 186 187 191 192 192 196 196 200
6 Hypothesis Generation by Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Difference-Based Method for Hypothesis Generation . . . . . . . . . . . 6.1.1 Classification of Difference-Based Methods . . . . . . . . . . . 6.1.2 Difference Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Difference in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Analysis of Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Time Difference: Case of Discovery of Satisfactory Spot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Time Difference: Case of Tankan of BOJ . . . . . . . . . . . . . . 6.2.4 Difference in Differences: Case of Effect of New Drug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 Time Series Model: Smoothing and Filtering . . . . . . . . . . . 6.2.6 Multiple Moving Averages: Case of Estimating Best Time to View Cherry Blossoms . . . . . . . . . . . . . . . . . . 6.2.7 Exponential Smoothing: Case of Detecting Local Trending Spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.8 Nested Moving Averages: Case of El Niño–Southern Oscillation . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.9 Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.10 MQ-RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.11 Difference Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Differences in Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Image with Time Difference . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Difference Analysis of Medical Images . . . . . . . . . . . . . . . 6.3.3 Difference Analysis of Topographic Data . . . . . . . . . . . . . . 6.3.4 Difference in Lunar Surface Images: Case of Discovery of Newly Created Lunar Craters . . . . . . . . . . 6.3.5 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Differences in Conceptual Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Case of Creating the Essential Meaning of Concept . . . . . 6.4.2 Case of International Cuisine Notation by Analogy . . . . .
203 203 203 204 204 205 206 211 212 213 216 220 224 227 227 231 234 234 236 238 240 243 247 247 259
Contents
6.5
xi
Difference Between Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Case of Discovery of Candidate Installation Sites for Free Wi-Fi Access Point . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Case of Analyzing Influence of Weather on Tourist Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 GWAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
265
7 Methods for Integrated Hypothesis Generation . . . . . . . . . . . . . . . . . . . 7.1 Overview of Integrated Hypothesis Generation Methods . . . . . . . . 7.1.1 Hypothesis Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Hypothesis Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Hypothetical Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Hypothesis Join: Case of Detection of High-Risk Paths During Evacuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Experiments and Considerations . . . . . . . . . . . . . . . . . . . . . 7.3 Hypothesis Intersection: Case of Detection of Abnormal Automobile Vibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Hypothesis Intersection: Case of Identification of Central Peak Crater . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
279 279 279 279 280 280
8 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Necessity to Interpret and Explain Hypothesis . . . . . . . . . . . . . . . . . 8.2 Explanation in the Philosophy of Science . . . . . . . . . . . . . . . . . . . . . 8.2.1 Deductive Nomological Model of Explanation . . . . . . . . . 8.2.2 Statistical Relevance Model of Explanation . . . . . . . . . . . . 8.2.3 Causal Mechanical Model of Explanation . . . . . . . . . . . . . 8.2.4 Unificationist Model of Explanation . . . . . . . . . . . . . . . . . . 8.2.5 Counterfactual Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Subjects and Types of Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Subjects of Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Types of Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Subjects of Explanation Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
317 317 318 318 318 318 319 319 319 319 320 320 321 321
265 272 273 276
281 281 283 287 292 292 294 298 301 302 302 304 310 313
xii
Contents
8.5
Model-Dependent Methods for Explanation . . . . . . . . . . . . . . . . . . . 8.5.1 How to Generate Data (HD) . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 How to Generate Hypothesis (HH) . . . . . . . . . . . . . . . . . . . 8.5.3 What Features of Hypothesis (WF) . . . . . . . . . . . . . . . . . . . 8.5.4 What Reason for Hypothesis (WR) . . . . . . . . . . . . . . . . . . . 8.6 Model-Independent Methods of Explanation . . . . . . . . . . . . . . . . . . 8.6.1 LIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Kernel SHAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.3 Counterfactual Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Reference Architecture for Explanation Management . . . . . . . . . . . 8.8 Overview of Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 Case of Discovery of Candidate Installation Sites for Free Wi-Fi Access Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.2 Two Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.3 Explanation of Integrated Hypothesis . . . . . . . . . . . . . . . . . 8.9.4 Experiments and Considerations . . . . . . . . . . . . . . . . . . . . . 8.10 Case of Classification of Deep Moonquakes . . . . . . . . . . . . . . . . . . . 8.10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10.2 Features for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10.3 Balanced Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10.4 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10.6 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.11 Case of Identification of Central Peak Crater . . . . . . . . . . . . . . . . . . 8.11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.11.2 Integrated Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.11.3 Explanation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.12 Case of Exploring Basic Components of Scientific Events . . . . . . . 8.12.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.12.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.12.3 Network Configuration and Algorithms . . . . . . . . . . . . . . . 8.12.4 Visualization of Judgment Evidence by Grad-CAM . . . . . 8.12.5 Experiments to Confirm Important Features . . . . . . . . . . . 8.12.6 Seeking Basic Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
324 324 325 325 326 327 327 328 329 330 331 331 332 332 333 335 336 336 337 339 341 343 346 347 347 348 349 349 350 352 352 353 357 359 362
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Chapter 1
Basic Concept
1.1 Big Data 1.1.1 Big Data in the 5G Era We are being surrounded by an ever-increasing amount of big data in the new communication environment of 5G (fifth-generation mobile communication system). With the start and spread of 5G services, a larger volume of higher speed (real time) data will be generated than ever before. Furthermore, the number of Internet of Things (IoT ) devices (IEEE 2022) such as sensors that can be simultaneously connected has increased dramatically, and it is more and more necessary to handle a wide variety of data including real-world data, social data, and open data in a combinatorial manner. By using such big data in an integrated manner, an intelligent and dynamic society such as Society 5.0 (CAO 2022) can be expected to be realized. The expected application fields include tourism, mobility, social infrastructure, medical, and science. In order to understand and utilize big data, to realize concrete applications (e.g., problem solving and prediction), and to create new value as a result, we need some kind of intelligent leverage. We think that generating and interpreting hypotheses is essential as leverage to realize big data applications including a variety of tasks such as identification, prediction, and problem solving. However, it is not always obvious how to combine various kinds of data to obtain a necessary and useful hypothesis. The main theme of this book is how to generate a new hypothesis by combining data and hypotheses and how to interpret the structure, mechanism, and results of hypotheses, required in the 5G big data era. Based on our observations of various big data use cases (concrete application examples), we will explain the integrated hypothesis generation methods, which are the basic technologies required to build big data application systems, as well as reasoning, problem
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Ishikawa, Hypothesis Generation and Interpretation, Studies in Big Data 139, https://doi.org/10.1007/978-3-031-43540-9_1
1
2
1 Basic Concept
solving, and machine learning. Further, we will introduce the hypothesis interpretation methods based on an approach that uses data analysis (machine learning, data mining) and data management (database) technologies in a harmonious manner.
1.1.2 Characteristics of Big Data First, let us describe the characteristics that big data commonly have. The characteristics of big data can be summarized as follows (Ishikawa and Yamamoto 2021). • • • • •
Volume: The amount of data is enormous. Velocity: Data are generated at high speed. Variety: There are many types of data structures and data sources. Vagueness: There is ambiguity or uncertainty in individual data values. Veracity: Data require certainty and reliability. It is a characteristic paired with vagueness. • Value: New value is created by analyzing and utilizing data. It is often said that big data necessarily have the first three characteristics, but we do not think that it is always necessary for big data to have all of them. Here, the definition of big data is a little loosened. In this book, if given data satisfy at least one of these characteristics, such data are included in big data. Further, big data can be roughly classified as follows. • Real-world data: Real-world data are derived from the real world. They range from data generated in relation to human activities to data obtained by observing natural phenomena. For example, they include data generated from IoT devices such as ones installed in connected cars and data from various observation devices such as ones installed in space probes like Japan Aerospace Exploration Agency’s (JAXA) Hayabusa. • Open data: Open data are mainly collected by public institutions (national governments, local governments, and non-profit organizations) through surveys and observations and are made available to the general public. For example, population data, shelter data, and meteorological data are included. • Social data: Social data are posted by users on social media (e.g., Twitter, Facebook, Instagram, and Flicker). However, these classifications are not always exclusive. For example, most of the data observed by JAXA are real-world data, and some parts of them are open data as well. If social data are available to outsiders through the API provided by the media site, then social data can also be considered as open data.
1.1 Big Data
3
1.1.3 Society 5.0 Next, Society 5.0 (CAO 2022) is explained as a concept related to big data. Until now, our society has focused on hunting (Society 1.0), farming (Society 2.0), industry (Society 3.0), and information (Society 4.0). Society 5.0 is positioned as a humancentered society that can be constructed by maximizing the use of IT and big data, leading to one of our social goals. Big data analysis by combining advanced technologies (e.g., machine learning, data mining, and data management) brings high-value-added effects to the real world through providing valuable information, creating new services, and improving existing services. Further, it is expected that the combination of multiple data will enhance the effects. Therefore, it can be said that big data are indispensable for the realization of Society 5.0.
1.1.4 5G Furthermore, 5G (fifth-generation mobile communication) (MIC 2022) is expected to be one of the IT technologies that can support the realization of Society 5.0. Mobile communications before 5G are characterized as follows. • First generation (1G): Analog method. • Second generation (2G): Digital system. • Third generation (3G): Higher speed in comparison with 2G (e.g., W-CDMA and LTE). • Fourth generation (4G): Higher speed and larger capacity in comparison with 3G (e.g., WiMAX2 and LTE-Advanced). 5G has the following characteristics related to data communication. • • • •
Large capacity. High speed. Low latency. Many connections.
It can be said that these 5G characteristics make the previously described characteristics of big data more prominent. First of all, regarding data communication, 5G’s large capacity as well as its high speed and low latency increase the amount of data accumulated in cyber space through sensor networks (i.e., networks consisting of IoT devices) dramatically in real time. Especially in 5G, the types of different data sources will increase through the simultaneous use of various IoT devices connected to sensor networks, and even if only one data source is used, the dimensions of the data will increase with the development of devices. Furthermore, 5G will further promote the availability of data before hypothesis generation and service development.
4
1 Basic Concept
Such situations raise the fundamental problem of how to combine a wide variety of data to create new services. In other words, it is necessary to answer the non-trivial question of how to use multiple data in an integrated manner to generate and interpret a hypothesis. One of the aims of this book is to try to tackle this problem head-on.
1.2 Key Concepts of Big Data Analysis 1.2.1 Interaction Between Real-World Data and Social Data Whether real-world data or social data, a wide variety of big data are being created around us from moment to moment. However, the meanings of real-world data are not explicit in nature. In many cases, it is not possible to extract clear meanings (semantics) at least only from the data. In other words, in order to know the meanings of real-world data, it is necessary to make the characteristics of the data correspond to the meanings that exist in the other data associated with the data. On the other hand, social data often have explicit meanings, whether they are texts (e.g., Twitter) or images (e.g., Flicker). For example, in Twitter, interests, opinions, and impressions are directly expressed as short sentences, and in Flicker, what are in the photographs are considered to express some interests themselves. Here, if we can discover any relationships that exist between different data sources, we will be able to use them as clues to match (i.e., synchronize) data related to each other contained in different data sources. However, we also should keep in mind that such relationships cannot always be found.
1.2.2 Universal Key Here, data attributes that can be used to match data with similarities in a general way are called universal keys in this book. Positional information and temporal information, which are generally added to data as metadata, are among candidates for universal keys that connect different data sources. In other words, if the same position (i.e., area) and the same time (i.e., duration) are used, there is a possibility that different data can be synchronized. If different data are analyzed by synchronizing their positions or time or both of them as universal keys, it may be possible to extract explicit meanings from one data source and assign the meanings to another. Then this type of analysis is called synchronized analysis. For example, it is possible that synchronized analysis of realworld data and social data can supplement the latent meanings of real-world data with the explicit meaning of the synchronized part of social data. Furthermore, if even different data have semantic information (e.g., text and tags), the similarity between the data (e.g., feature vectors) can be measured in the semantic
1.2 Key Concepts of Big Data Analysis
5
space (e.g., vector space model) (Ishikawa 2015) created by words or concepts contained by the data. Large similarities (or vice small distances) are an alternative to some universal keys, allowing data to be synchronized between different data sources. For example, it is possible to use the similarity between texts or tags in the body of Twitter articles and tags attached to Flickr images. Recently, Web services that can generate tags directly from image data are available such as Google Cloud Vision API (Google Cloud Vision API 2022). This kind of service enables integrated analysis between different data sources in the semantic space regardless of media types such as images and texts. Even if it is not possible to match multiple data sources as they are, you can select data from one data source by using the features and attributes of results obtained by selection or analysis of another data source.
1.2.3 Ishikawa Concept By analyzing multiple types of big data in synchronization with their positions, time, or semantics, we will be able to discover correlations in a broad sense and generate hypotheses, based on the results. As a result, applications such as problem solving, recommendation, causal analysis, and near future prediction are possible. The basic concept of such an analysis method is generally called Ishikawa concept (Olshannikova et al. 2017; Ishikawa and Yamamoto 2021). According to this Ishikawa concept, we can combine and analyze real-world data, open data, and social data to develop applications. Especially, we call big data “social big data” in this book when they include social data for analysis. This book explains how to describe the principal parts of an actual application either procedurally as in pseudocode (Pseudocode standard 2022) or declaratively as in SQL (Celko 2014) or SQL-like languages, based on this concept through actual examples. Figure 1.1 illustrates the Ishikawa concept. Here, a few notes are made about the types of data sources to be combined in this analysis. In addition to combination of three different kinds of big data, the following combinations are possible. • Real-world data and social data. • Real-world data and open data. • Open data and social data. As a special case of synchronized analysis, new findings may be obtained by synchronizing either different social data, different real-world data, or different open data.
6
1 Basic Concept Synchronization space/time Real world
Cyber world
1
Information extraction
Meteorological data Real-world data Vehicle driving data
Analysis and visualization of relationships Space, time, and semantics Information extraction
Quake data
IoT data Open data
Use of relationships Predication, recommendation, and problem solving
Latent semantics
YouTube
flickr Social data
facebook Twitter
Explicit semantics
Fig. 1.1 Integrated analysis of big data (aka Ishikawa concept)
1.2.4 Single Event Data and Single Data Source Even if the same event is the target of analysis, different observation data can be obtained from measuring by different instruments and methods. This is also a kind of analysis according to the Ishikawa concept, which is the principle of analysis that combines different types of data with some universal key. Familiar examples include determining earthquake epicenters (sources) by multiple seismographs and global positioning by multiple satellites (Global Navigation Satellite System). Here it should be noted that even if there is only one source of original data, multiple data sets are often generated from the single data set. Even the same data can be converted to different data depending on the processing (conversion) method. In particular, there is a method, called data augmentation, that increases the number of data samples by performing multiple types of operations (image processing such as rotation and flipping) when the number of samples required for machine learning is too small. For example, data augmentation is used in lunar crater detection. In addition, different data series can be created by applying aggregation operations (e.g., moving average) with different parameters (e.g., periods) to a single source of data series. This method is used in analysis of phenological observation data (open data) such as cherry blossom (aka Sakura) flowering. Further, various image filters can be used to create multiple images from one image (e.g., convolution and pooling in convolutional neural networks).
1.2 Key Concepts of Big Data Analysis
7
For the same samples (data set), multiple subsets can be created by giving separate data selection conditions (i.e., subsetting). It is often used to create exclusive subsets. For example, by such a subsetting operation, it is possible to obtain a set of genes of persons with a certain disease and a set of genes of persons without the disease from the original set (e.g., case control association study on genes). It may be possible to generate a new hypothesis by combining multiple data created in this way according to the Ishikawa concept. Furthermore, let us consider the extreme case. Even if it is a single data source as it is, data with time and position information are more relevant to events and spots that can be identified with the time and position than data without the information. So to speak, analysis based on data with position and time information is likely to be a hands-on analysis. For example, let us compare one case where you live in a foreign country (e.g., France) and actually visit a place in Japan (e.g., Takeshita Street in Harajuku, Tokyo) and talk about that place in social media with another case where you only talk about the place you have not visited yet. Since the first case is generally based on actual experiences, it can be said that the contents and locations of social data are more closely related to each other than the second case. Of course, this method is also applicable when analyzing any single data set whose position data are accurate as real-world data. In particular, social big data with location information are sometimes called geosocial big data. For example, Fig. 1.2 generated by our research team illustrates the outlines of the UK islands drawn solely by smoothing locations (i.e., geotags) where Flickr photos with a tag “beach” were shot (Omori et al. 2014). In a word, human activities can depict both natural and artificial geographical features (refer to Box “Lu Xun”). Box: Lu Xun The last passage of Lu Xun’s novel (My old home, 1923) (Lu 1972) perfectly expresses the concept of geosocial big data. It says “Hope cannot be said to exist, nor can it be said not to exist. It is just like roads across the earth. For actually the earth had no roads to begin with, but when many men pass one way, a road is made”. □
8
1 Basic Concept
Fig. 1.2 Geotagged Flickr images with a tag “beach” can draw the outlines of UK Islands
1.2.5 Process Flow of Big Data Analysis Here, we will take social big data and explain the general process flow. The basic flow of analysis using different sources of social big data is as follows. 1. Collection of social data and storage in database Data are collected and stored in a collection database using the search and streaming APIs provided by social media sites. 2. Database search Data are selected from the collection database by specifying selection conditions according to the analysis purpose. 3. Data transformation and processing The selected data are transformed and processed if necessary. For example, data transformation (e.g., normalization, standardization, and discretization) and data aggregation as well as handling missing values, contradictory values, or outliers are performed in this step.
1.2 Key Concepts of Big Data Analysis Collection of social big data/ database storage
Database search
preprocessing
Collection of social big data/ database storage
Database search
preprocessing
9
Data
Data
Storage in analysis database
Hypothesis generation / verification
Visualization/ knowledge conversion of hypotheses
Storage in analysis database
Hypothesis generation / verification
Visualization/ knowledge conversion of hypotheses
Storage in analysis database
Hypothesis generation / verification
Visualization/ knowledge conversion of hypotheses
Storage in analysis database
Hypothesis generation / verification
Visualization/ knowledge conversion of hypotheses
(a)
Collection of social big data/ database storage
Database search
Collection of social big data/ database storage
Database search
Data preprocessing
Data preprocessing
(b)
Fig. 1.3 Parallel analysis and serial analysis. a Parallel analysis. b Serial analysis
4. Storage in analysis database Usually, the transformed and processed data are stored in an analysis database different from the collection database. 5. Hypothesis generation and verification Data mining and machine learning algorithms are applied to the transformed and processed data to generate new hypotheses and verify generated hypotheses (Ishikawa 2015). For example, such algorithms include clustering, classification, and regression. 6. Visualization of hypotheses and knowledge creation from hypotheses Generated hypothesis and verification results are visualized using appropriate methods such as maps, networks (graphs), and various charts. Finally, they are turned into knowledge and even wisdom. Based on this process flow, the analysis processing can be roughly divided as follows. • Parallel analysis: Perform Steps 1 to 6 for each type of social media. Furthermore, in Steps 5 and 6, the results obtained for different media are integrated (synchronized) and analyzed. • Serial analysis: Perform Steps 1 to 6 for one social media. For another media, go to Step 1. Based on the results obtained for the first social media, search for another social media as in Step 2 and then proceed to Step 6 step by step.
10
1 Basic Concept
Figure 1.3 shows parallel analysis and serial analysis of social big data. On social media sites, when various user events (e.g., posts and check-ins) occur in the real world, data related to such events are stored in the database on the premises or in the cloud. After that, they will be displayed on the user’s timeline and can be accessed through the API provided to the user. Of course, what is normally visible to users outside the site is only part of the accumulated data. Regarding the use of data other than social data (i.e., real-world data and open data), it is desirable to collect a lot of data as possible and store them in a database when they are generated. In this case as well, once a hypothesis has been generated, data necessary to verify it are selected from the database by specifying appropriate conditions, and the necessary processing is executed to perform verification through experiments using the selected data.
1.3 Big Data’s Vagueness Revisited 1.3.1 Issues Here we explain a little more about another aspect of big data’s vagueness. The aspect is related to vague concerns of users as to how big data related to them are used. In Japan, there is a plan called information bank (i.e., personal data trust bank) (MIC 2022) that aims not only to properly manage personal data by law such as GDPR (EU 2022) but also to promote active use of personal information by companies. One of the prerequisites for the success of such services is that the users are convinced that their own data will be used safely and properly. One way to relieve users’ anxieties is to be accountable for the application. To do so, you should at least explain the analysis procedure to the user. For that purpose, we at least need words or methods to explain the procedure. A direct explanation is to present the program implementing the procedure itself. This method is theoretically possible if the users have sufficient knowledge of the programming language. However, the size of the entire program is often quite large in big data applications. Therefore, even if the users have such knowledge, it takes a huge amount of time to understand the program. There is also a problem with intellectual property rights of any proprietary program. Therefore, this method is not so realistic. Another option is to explain the program in natural language. While programs by the traditional programming language are often called high code these days, their abstraction level is rather low. Thus, the gap between natural languages and programs is too large. Therefore, it is difficult to explain the application by this method so that the user can understand it. Recently, a framework that allows development by low code has been proposed instead of traditional high-code development. In a low-code framework, it is possible to build an application by combining the components provided in advance through an interactive user interface. If low code is available, it may be possible to use it
1.3 Big Data’s Vagueness Revisited
11
as it is, in order to explain the use of big data. However, this method is not always applicable because the low-code framework has limitations to flexible customization and not all applications can be developed only with low code. In any case, a method that is more abstract than a program and does not depend on individual programming languages (e.g., Python and Java) is required. In other words, it is necessary to be able to describe the procedures as the meanings of the programs. Indeed, in the case of science, the accuracy (i.e., efficiency) of a model as an implemented hypothesis is important in practical use, but both to explain the basis for the individual judgment of the model and to explain the structure and mechanism of the model are also important. If these still remain vague, the hypotheses and models as well as the individual judgments may not be accepted by the involved scientists.
1.3.2 Integrated Data Model Approach 1.3.2.1
Data Model
Here, we introduce a data model approach to resolving the vagueness of the explanation as an effective way. A data model, as fundamental principles used in database systems, basically consists of the following components (Ramakrishnan and Gehrke 2002; Ishikawa and Yamamoto 2021). • Data model = data structures + data operations + data constraints. That is, the data model consists of the data structures and the operations on them. In addition, conditions to be satisfied between data, called data constraints, are added to the components of the data model. However, for the sake of simplicity, we will focus on the idea of data structures and data operations. The data model, in a nutshell, defines the meaning of data and operations. In this and the following sections, we will explain the need for a data model approach in terms of digital ecosystems and hypothesis generation and interpretation.
1.3.2.2
Different Digital Ecosystem
According to the author’s observations and experiences about real applications of big data, they consist of data management and data analysis using big data. Data management and data analysis are so-called digital ecosystems (MEDES 2022) (hereinafter shortly referred to as ecosystems) that have evolved separately. In general, an ecosystem can be said to be an interdependent system consisting of multiple vendors and users through products and services. Most of big data applications are, in a nutshell, hybrid application systems consisting of data management (data engineering) and data analysis (data science) that have evolved separately as different ecosystems.
12
1 Basic Concept
The features of the two ecosystems are described below from the perspective of basic data structures. • Data management ecosystem: The system used for data management is mainly a database system. Especially, the basis of the data model (Ramakrishnan and Gehrke 2002; Ishikawa and Yamamoto2021) in the relational database that is widely used today is the set of tuples (i.e., relations) and the operations on them (i.e., relational algebra, or SQL that realizes it). As an exception, the SQL group by operation (grouping and aggregation function) goes beyond the concept of sets, and it is rather based on the concept of partitions of sets. That is, the set of tuples to be queried is divided into subsets of tuples, each subset having the same grouping key value, and an aggregate function is applied to each subset. In other words, only the grouping with an aggregate function is based on a mathematical concept of a family (informally a collection of sets). In short, traditional data management is mostly based on the concept of sets, but partially on the concept of families. On the other hand, databases collectively called NoSQL (Venkatraman et al. 2016), which have been attracting attention recently because of the ability to flexibly store JavaScript Object Notation (JSON) documents, key-value pairs, and graphs, basically access one object (i.e., an element of the set) at a time or one kind of attribute (i.e., a part of an element of the set), not for a set of objects. The former is a concept similar to cursors in SQL in a nutshell. The latter is column-oriented access of relations in SQL and enhances scalability based on distribution of databases. NoSQL will be explained in more detail in Chap. 5. • Data analysis ecosystem: The methods used for data analysis are mainly data mining and machine learning (Ishikawa 2015). In both clustering and classification, which are typical methods for hypothesis generation or model construction, the input data structure is a set, and the output data structure is a collection of a set (i.e., partition). That is, in exclusive clustering, each cluster corresponds to a set as a result of the partitioning of the input set. In classification, each class corresponds to also a set as an element of the partition of the input set. In association rules, which is another method for model construction, both the input and output data structures are elements of the power set of the whole item set (i.e., item database). On the other hand, the input and output in regression, which is another method for model construction, are generally both simple sets of data. In short, the data model for explaining an application needs to be able to explain the meaning of such a hybrid ecosystem of data management and data analysis together and to dispel the vague anxieties that users feel. Therefore, abstract data structures and their operations are required as a concept that integrates these two different ecosystems. The data structures for that purpose will be described in Chap. 5.
1.4 Hypothesis
13
1.4 Hypothesis 1.4.1 What Is Hypothesis? First, let us briefly explain the relationship between big data and hypotheses. In the era of big data, it has become more and more important and difficult as well, to make promising hypotheses in advance. In general, a hypothesis is a tentative explanation based on the observations of a certain phenomenon, and in a narrow sense, it is a predictive relationship between variables corresponding to the cause of an observable phenomenon and variables corresponding to the effect of the phenomenon. Moreover, a hypothesis must be able to be verified. However, the verification of a hypothesis does not equal the proof of the hypothesis. With mathematical theorems, counterexamples can prove that the hypothesis is incorrect. However, it is generally difficult to prove that a hypothesis is totally correct. In other words, hypothesis verification is mainly a quantitative evaluation of how well the hypothesis that explains the phenomenon is accepted after it has occurred. A more detailed definition of hypothesis is given in Chap. 2. Is a hypothesis necessary for the task of analysis in the first place? Indeed, without making a hypothesis in advance, it is conceivable to create a feature vector of thousands of dimensions representing as many variables as possible and use such vectors as an input to an analysis method on a parallel software platform on computer clusters. As a result, some kind of prediction based on the output may be possible by using machine learning. However, if it is difficult to explain the mechanism of such a hypothesis-free prediction method and its reason for individual judgments, it will not be so easy for the users to adopt this method for prediction from the bottom of their hearts. In other words, even in the big data era, it is better to make a hypothesis in data analysis in advance.
1.4.2 Hypothesis in Data Analysis Here, we will explain the position of a hypothesis in machine learning and data mining. Simply put, data mining is a task that aims to generate a pattern or a model which corresponds to a hypothesis, based on collected data. The algorithm used for that purpose is called machine learning or simply learning. However, data mining and machine learning have the same purpose and have much in common with respect to their functionality. In a word, they have just evolved within separate ecosystems of data engineering and data science. To put it bluntly, neural networks may be one of the models that has been around for a long time only in machine learning and has not received much attention in data mining until very recently. Data mining and machine learning have been influenced by each other. Therefore, this book does not distinguish between the two disciplines strictly.
14
1 Basic Concept
Classification, which belongs to the category of supervised learning, uses a part of sample data explicitly labeled with class names, called training data, to generate a hypothesis. Classification uses the remaining sample data to verify the hypothesis with respect to classification accuracy. On the other hand, association rules and cluster analysis belonging to the category of unsupervised learning can be executed without generating any hypothesis in advance. However, hypotheses generated in advance of data mining or examples that can be a hint for generating a hypothesis may help to appropriately select a method and its related parameters for obtaining a good model as a target of data mining. For example, it can be said that learning a model such as classification rules after reducing data attributes to only those which the user considers important for classification corresponds somehow to a prestage hypothesis. In association rule mining, the domain expert’s degree of interest in an item type, support value, and confidence value can be estimated by analyzing frequent item sets (or association rules) obtained from the experiences. Furthermore, even in clustering, giving in advance constraints that two objects should belong to the same cluster or constraints that they should belong to separate clusters provides a kind of guideline for hypothesis generation performed by data mining. In data mining, the goodness of a hypothesis (model) is mainly measured by the accuracy. At the same time, the value of a hypothesis is measured based on interests of analysts and users in a certain field. In other words, a prestage hypothesis or a hint to generate a hypothesis can be said to express the degree of interests in the field in some sense. Even in association rule mining and cluster analysis, which belong to unsupervised learning in data mining, rules of thumb and expectations of field experts often precede the analysis. Therefore, it can be said that prestage hypotheses and hints are effective for measuring the degree of interests in hypotheses generated mechanically by data mining. Let us consider a hypothesis in science as one of the extreme cases. Even seemingly accidental scientific discoveries require careful observation of phenomena, development of hypotheses based on sharp inspirations and deep insight, as well as well-planned experiments to test the hypotheses. Hypotheses in science need to be able to explain phenomena that have not yet been observed. As long as a hypothesis continues to explain phenomena, the hypothesis will be accepted, and if a phenomenon unexplained by the hypothesis appears, the hypothesis will be either rejected or modified. In this way, a hypothesis is required to enrich scientific knowledge; that is, it should be ampliative. In other words, science is essentially hypothesis driven. Indeed, there is an argument that scientific research can be carried out only by a data-driven computational approach (Hey et al. 2009), but we do not think that it is always applicable to science in general.
1.4 Hypothesis
15
1.4.3 Hypothesis Generation While a hypothesis is important for big data applications and big data are relatively easy to obtain nowadays, it is not always obvious how to generate a hypothesis. On the other hand, our observations and experiences so far have led us to be convinced that there are hints for hypothesis generation and algorithms for problems commonly appearing (Ishikawa and Miyata 2021). Therefore, it is desirable that common problems and their solutions in hypothesis generation can be reused together with ideas behind hypothesis generation. (Diversity of Hypotheses) A hypothesis can be viewed in two different ways as follows. • Hypothesis as a declaration: This type of hypothesis corresponds to one in traditional data analysis, especially in confirmatory data analysis (CDA). A hypothesis has already been given in some way. Such a hypothesis is called declarative hypothesis in this book. The main task for this type of hypothesis is hypothesis testing, which roughly corresponds to theorem proving in mathematics. • Hypothesis as a procedure: This type of hypothesis is obtained as a result by performing a certain procedure such as an algorithm. It can be considered an extended concept of exploratory data analysis (EDA), which has been attracting attention in recent years. Of course, even in this case, it is necessary to verify a generated hypothesis. Such a hypothesis is called procedural hypothesis in this book. The main task for this type of hypothesis is hypothesis generation, which is a generalization of problem solving in mathematics. Especially in the latter type of hypothesis, the procedure for generating the hypothesis plays an important role. Methods for generating procedural hypotheses include the following methods. • • • •
Reasoning. Problem solving. Machine learning and data mining. Integrated methods.
These cover a wide spectrum from traditional methods and basic methods to advanced methods. One of the main themes of this book is to explain problems that often appear in hypothesis generation and reusable solutions to them in detail.
1.4.4 Hypothesis Interpretation Currently, machine learning research and applications are widely conducted in various big data application fields. Therefore, in order to further expand the utilization of big data, it is necessary to enable data analysts and field experts to accept
16
1 Basic Concept
models generated using machine learning and decisions based on big data. For that purpose, it is necessary that field experts can understand the structures and behaviors of the model as well as the evidence for the results of individual judgments. These circumstances call for the necessity of “micro-level” or model-specific explanations of a hypothesis. As mentioned previously, applications using big data are generally a hybrid digital ecosystem consisting of two types of processes: data management and data analysis. Therefore, in order for big data applications to be accepted by their users (i.e., field experts and end users), it is necessary at least to ensure the transparency of an overall process flow of an application system. These circumstances require the necessity of “macro-level” or overall explanations of a hypothesis. In a word, hypothesis interpretation includes both “micro-level” explanations and “macro-level” explanations as its essential components. (Scientific Reproducibility) Strictly speaking, the reliability of a scientific paper or research depends on the fact that data are basically prepared according to the procedures described in the paper and that the same result as the described result can be obtained by executing the procedures described in the paper (scientific reproducibility). In other words, computer science or information science is also a science in a broad sense and reproducible explanations of procedures are indispensable for the reliability of big data applications, which are applications of computer science (Peng 2011). To do so, we ensure that the reproducibility of computer science holds, that is, that a third party can obtain the same results within an allowable range by preparing and analyzing data according to the overall explanation of a given application. Such reproducibility can be expected to facilitate model customization and transfer to promote model reuse. Another theme of this book is to clarify the basic principles and methods for interpreting hypotheses based on explanation.
1.5 Design Principle and Design Pattern In data engineering, design patterns are abstract representation of typical solutions to common problems that can be reused for application development. In concrete application of design patterns, they are reified (or instantiated) and used. For example, design patterns have been proposed as an object-oriented methodology. Gang-of-Four (GoF), who are said to be pioneering proponents of design patterns, describe a total of 23 design patterns in their reference book (GoF 1994). On the other hand, design principles are concepts and rules that should be followed in design, but they are more abstract than design patterns and are not always associated with concrete methods (Pescio 1997). However, the boundary between design principles and design patterns is ambiguous, and it is not always possible to make a strict distinction between the two.
1.6 Notes on the Cloud
17
Therefore, hereafter in this book, it will not be specified which category described items belong to. In this book, we propose design principles and patterns for generating and interpreting hypotheses in big data applications, incorporating the concepts of principles and patterns introduced in software design. It is desirable that the means for describing proposed design patterns and principles is more abstract than programming languages. Then they can be transformed into concrete programs in your environment (i.e., a software framework) in a straightforward manner.
1.6 Notes on the Cloud Here, we would like to make a note about an architecture for realizing big data applications. First of all, in order to realize an application system, there is a conventional option of on-premises where all computational resources are covered by one’s own organization. Recently, in addition to the option of on-premises, the option of cloud has emerged, which uses services already available on the network. In general, the cloud has the following advantages. • You do not have to prepare your own computational resources in advance. • You can use as many computational resources as you need when you need them. This is convenient when you do not know how much peak performance is required in advance. • The initial cost can be suppressed and the system can be constructed immediately. • There are neither labor costs nor operating costs for dedicated tools. On the other hand, on-premises has an advantage of being able to fully control all computational resources. There are design patterns specialized for the cloud. For example, although the contents are not explained in this book, Amazon Web Services (AWS) provides the following cloud-native design patterns, leaving each detail (AWS 2019). • • • • • • • •
API Gateway. Service discovery and service registries. Circuit breaker. Command-query responsibility Segregation. Event sourcing. Choreography. Log aggregation. Polyglot persistence.
It is expected that the use of the cloud will gradually expand in the future, but this book does not assume the use of the cloud. Instead, this book focuses on hypothesis generation and interpretation and explains design principles and design patterns that
18
1 Basic Concept
we should basically know in hypothesis generation and interpretation, regardless of the use of the cloud.
1.7 Big Data Applications Big data applications are expected to increase in various fields, not limited to 5Grelated fields. The fields of application can be broadly classified as follows although the list is not comprehensive. • • • • •
Business. Social infrastructure. Engineering. Science. Medical.
Use cases for hypothesis generation will be explained concretely in Chap. 2 and the following chapters. Before that, we would like to mention EBPM as one of the backgrounds for the emergence of big data applications. In addition, we describe the users of big data applications.
1.7.1 EBPM Traditionally, various policy decisions have been made based on the results of collection and analysis of easy-to-collect data such as episodes or questionnaires because it is costly to collect the timely and sufficient data. In other words, no data (evidences) directly related to decisions exist, or data are neither dynamic nor sufficient if any. On the other hand, big data such as real-world data and social data are being created from moment to moment today. In other words, data available for analysis already exist around us. In various fields, not limited to administration, evidence-based policy making (EBPM) is proposed that uses clearly established evidences in good faith when choosing from multiple policy options (Kano and Hayashi 2021). In other words, integrated methods for hypothesis generation as proposed in the book, that integrate machine learning and big data processing, are expected to help EBPM managers answer where, how much, and what kind of demands and necessities exist in advance. For example, quantitatively and qualitatively grasping behavioral patterns of groups (e.g., people and things) will lead to effective measures and successful investment in areas important for administration and companies related to social infrastructures. In a broad sense, the following examples of EBPM solve various problems common to societies such as public, governments, cities, and regions.
1.7 Big Data Applications
1.7.1.1
19
Tourism
Tourism is an important economic sector for every country. The most common complaints of foreigners visiting Japan are the difficulty of accessing the Internet and obtaining information. However, it is extremely ineffective to create free WiFi access spots indiscriminately throughout Japan. In such cases, if places which foreigners prefer to visit and provide no free Internet access are known in advance based on quantitative evidences, it is possible to efficiently invest by preferentially setting up free Wi-Fi access spots there. Furthermore, big data analysis can be applied to the following situations. • If foreign visitors to Tokyo can know spots where cherry blossoms (Sakura) are in full bloom during a limited travel period, such spots will be a big tourism promotion for them. • It is often effective to introduce foreigners visiting a certain area similar to an area where they stay or know. • If we can quantitatively understand how foreign tourists change their behaviors due to local weather conditions, we can take effective measures for accepting them. • If we can explain Japanese foods by using the difference from the foods that foreigners are familiar with, we will be able to create a more foreigner-friendly menu. 1.7.1.2
Disaster Management
Needless to say, disaster management is one of the most urgent issues in the world. Based on quantitative evidences, if you know in advance where people will be crowded in the events of disasters and if you know in advance which route is safe (or dangerous) depending on time of day as to movement from the current place to a nearby evacuation facility, secondary disasters are expected to decrease through appropriate guidance.
1.7.1.3
Social Infrastructure
The development of social infrastructure such as urban development and transportation systems is an important issue especially for cities. Many of inconveniences that foreigners feel come from language problems, for example, languages used in various guidance. Of course, it is conceivable to use all languages anywhere, but it is not realistic. Instead, if languages used by foreigners who often use a certain station or line are known, it will be possible to provide effective guidance to such stations or lines by the languages. Easy-to-understand evacuation guidance is also important for foreigners in the event of disasters.
20
1 Basic Concept
In order to build next-generation mobility services (e.g., Mobility as a Service or MaaS), it is necessary to build service infrastructures that promote such services as well as to develop underlying technologies such as IoT.
1.7.1.4
Science and Medicine
Exploration of the internal structure of the moon as subject of lunar and planetary science is important in that it can clarify the crater formation mechanism and the crustal movement process there. However, it is not realistic to drop dedicated spacecrafts everywhere because of a limited budget. On the other hand, some lunar craters have a peak at the center of the crater floor. Such craters are called central peak craters. A central peak is reminiscent of a large-scale collision (e.g., an impact of a meteor) at the time of crater formation and is generated from raised underground materials. In other words, the lunar internal materials are exposed on the surface of a central peak, making a central peak a promising candidate site for exploration. However, in the first place, there are no data covering the entire lunar surface as to central peak craters. Furthermore, there is a problem that the identification of central peak craters is mostly done by manual operations such as visual inspection of images by experts. Therefore, it is necessary to develop an automatic extraction method for creating a complete catalog of central peak craters. Also, if we can discover craters formed relatively recently, they will be useful in order to know the mechanism of crater formation. The El Nino, which is one of meteorological phenomena, can also be determined by examining the time series data of the sea surface temperature in specific areas. In medicine, new lesions can be found by comparing radiographs of the same parts of the same patient with different times. In addition, it may be possible to identify genes associated with a disease by comparing and analyzing a large number of genes with and without the disease. By combining multiple sources of data, the number of new positives for COVID-19 can be predicted on a daily basis.
1.7.2 Users of Big Data Applications Then, what kinds of users are involved in big data applications? They are roughly classified into the following categories. • Data scientists: This type of users is mainly in charge of parts related to data science (i.e., machine learning and data mining) in big data applications. • Data engineers: This type of user is mainly responsible for parts related to data engineering (i.e., data management) in big data applications. • Field experts: This type of users motivates the realization of applications and accepts and uses execution results of the model. • End users: This type of users enjoys end user services provided by the applications.
1.8 Design Principles and Design Patterns for Efficient Big Data Processing
21
However, the above classification cannot always be clearly made. In an attempt to realize an application system with limited human resources, one data scientist may have a role in data engineering as well and one field expert may have all of these roles in extreme cases.
1.8 Design Principles and Design Patterns for Efficient Big Data Processing In the 5G era, in addition to the features of large capacity and ultra-high speed, the feature of low delay of data communication becomes remarkable. Along with these, the necessity to process a large volume of data at high speed is increasing. Prior to design principles and patterns of hypothesis generation and interpretation, we will give an overview of design principles and patterns useful for high-speed processing of a large volume of data (Ivezic et al. 2019; Kleinberg and Tardos 2005).
1.8.1 Use of Tree Structures One solution to the problem of efficiently processing a large amount of data is to divide a given problem into subproblems that can be independently solved by making the search space for solution smaller. This method solves the original problem by integrating solutions to the subproblems. This is referred to as divide-and-conquer principle and is widely used in problem solving (Kleinberg and Tardos 2005) as well as in historical wars (refer to Box “The Battle of Austerlitz”). Here we assume the subproblems are similar to each other, that is, homogeneous. In general, dividing a problem into subproblems can be represented by a tree structure (i.e., hierarchical structure), of which the original problem corresponds to the root (parent) node and the divided subproblems correspond to the child nodes. The solution of the problem starts from the root (parent) node of the tree structure and goes down to the child nodes. In integrating the results of solved subproblems, control is returned from the child nodes to the parent node in the opposite direction. Algorithms based on the divide-and-conquer principle are often recursively realized because it is easy to formulate such algorithms and prove their correctness (i.e., proof by mathematical induction, refer to Chap. 2). Let us consider a simple algorithm where a top-down computation method uses itself to solve a subproblem. Thus, the algorithm divides given data into subsets and creates a set of subproblems from the original problem and applies the same algorithm to the subproblems with the subsets recursively and integrate the solutions of the subproblems to obtain the solution to the original problem. For example, in the problem of sorting N data elements, which often appears in various applications, the typical algorithm for that purpose, MergeSort, divides given
22
1 Basic Concept
data into multiple subsets of similar size and applies MergeSort (itself) recursively to sort such subsets and merges the sorted results into the whole sorted set by using Merge. The pseudocode for MergeSort is described below. Here, l and r are the left and right indexes of the data array to be sorted, respectively. Algorithm 1.1 MergeSort (l, r). 1. {if l = r then return data (l); 2. else return Merge (MergeSort (l, (1 + r)/2), MergeSort ((1 + r)/2 + 1, r));} The amount of computation (i.e., computational cost) required to sort N data elements is O (NlogN). Here, big O notation O (F (N)) is defined as follows. If N is large enough, there are positive constants c and N0, and O (F (N)) < c F (N) holds when N > N0. O (NlogN) is the best cost for the sorting algorithms based on data comparison. Divide-and-conquer algorithms often lead to the best performance. An example of dividing a problem into heterogeneous (i.e., different kinds of) subproblems instead of dividing it into homogeneous subproblems as described above will be explained in detail in Chap. 2. In addition, for multidimensional data, the problem of finding a collection of data contained in a subspace created by specifying a range of values in each dimension (i.e., subsetting) often appears in big data applications. For example, it may be necessary to find a set of data contained in an area of interest, or a set of data close to a certain data object. One solution to this problem is multidimensional indexing, which is based on a hierarchical data structure suitable for efficient access to large amounts of data (Ramakrishnan and Gehrke 2002). Multidimensional indexing is often used especially for the problem of searching for a set of data in the vicinity of a given object (i.e., k-nearest neighbors). Normally, it is created for low-dimensional (1D to 3D) data corresponding to the real space and used for hierarchical access of the data. Multidimensional indexes include B-tree (one dimensional), quad-tree (two dimensional), octree (three dimensional), and kd-tree (higher-dimensional k for k « log N ). In such a hierarchical index, it is possible to create an index for N data objects at the cost of O (N logN) and to search the data using the index at the cost of O (logN), which is the best of the comparison-based search algorithm. Box: The Battle of Austerlitz The Battle of Austerlitz (1805) was strategically spectacular among the wars that a French emperor Napoleon Bonaparte as known as Napoleon I (1769– 1821) fought. The opponent was a coalition of Russian and Austrian armies, overwhelming the French army led by Napoleon with the number of soldiers (see Fig. 1.4). However, Napoleon divided the opponent in two by attacking the center of the allied forces. After division, the French army gained an advantage over each divided opponent and eventually conquered the allied forces. □
1.8 Design Principles and Design Patterns for Efficient Big Data Processing
23
Fig. 1.4 The Battle of Austerlitz (Map Courtesy of the United States Military Academy Department of History)
1.8.2 Reuse of Results of Subproblems Here, let us take another problem that can be defined recursively and consider an efficient solution to it. Fibonacci (1170–1250), a mathematician born in Pisa, Italy, considered modeling the problem of rabbit population growth. One pair of parent rabbits gives birth to one pair of offspring rabbits, and the offspring rabbits become parent rabbits in a month. The number F(n) of the rabbit pair in n-th month can be recursively defined by the following difference equation (i.e., recurrence formula). • F(n) = F(n − 1) + F(n − 2). • F(1) = F(2) = 1. The number defined in this way is called the Fibonacci number, and the sequence is called the Fibonacci sequence (see Fig. 1.5). In reality, the method of recursively calculating the Fibonacci sequence according to the above definition is not efficient. This is because in recursive computation of the Fibonacci number, the same computation (i.e., computation with the same
24
1 Basic Concept
1st month
Offspring
2nd month
Parent
3rd month 4th month 5th month 6th month
Fig. 1.5 Fibonacci sequence
parameters) appears multiple times during the course of computation, and as a result, the same computation must be performed more than once. To avoid such inefficiencies, let us consider reusing the result of an already solved problem as part of solution to the given problem. Specifically, there is a method called dynamic programming (Kleinberg and Tardos 2005). In this method, the combination of parameters and results for one computation is recorded as a new record in a table each time the solution is obtained (memorization). Usually, the computation is performed while increasing the values of the parameters, starting from the combination of the simplest values of the parameters (e.g., the minimum integer value if there exists any). This is, so to speak, an example of bottom-up computation in contrast to top-down computation of recursive algorithms (e.g., MergeSort). If it is necessary to compute a problem with a new combination of parameters during the course of the computation for the target parameters, the algorithm first searches the table for a record for the new parameters. If the record already exists, the algorithm uses the computation result in the record instead of recomputing. In other words, it is possible to avoid recomputing the same problem by searching the table for the computation result. Therefore, it is more efficient to compute the Fibonacci sequence by dynamic programming than by recursive programming, assumed that the table search can be performed more efficiently than the problem computation. In association rule mining, which is another method of data mining, the Apriori algorithm discovers frequently occurring item sets (such as the contents of a shopping cart) from a collection of item sets (Ishikawa 2015) in a bottom-up manner. It is also a kind of dynamic programming. When applying dynamic programming to optimization problems (e.g., shortest path problem), it is necessary to check the following principle. • Principle of optimality: The optimal solution of a problem includes the optimal solutions of subproblems. Similarly, even if a problem can be formulated recursively, there may be algorithms that execute the computation in a bottom-up manner. For example, there is also a
1.8 Design Principles and Design Patterns for Efficient Big Data Processing
25
non-recursive version of the MergeSort algorithm (refer to Algorithm 1.1) described above, where we consider runs (i.e., sorted data sequences), first merge shorter runs together to make longer runs, and finally make only one run (Kleinberg and Tardos 2005).
1.8.3 Use of Locality Generally, there is a difference in the capacity and access speed of each memory in the memory hierarchy of computers (see Fig. 1.6). That is, the storage capacity and the access speed tend to conflict with each other. In big data processing, we must handle a large amount of data that does not fit in main memory. Therefore, such large amounts of data are usually stored in external storage (i.e., magnetic disk drives or magnetic tapes), which is a low-speed storage medium. Therefore, the necessary data access from the CPU starts from the highest level of memory (cache or main memory) and returns the data if it hits, otherwise the data access immediately searches for lower memory in the memory hierarchy. In order to access data at high speed in a computer with a memory hierarchy, it is necessary to consider the difference in access speed of data, especially between main memory and external storage. In other words, it is important to reduce access to external memory by putting the data required for computation in the main memory as much as possible. In other words, it is necessary to increase the locality of data within a certain layer of memory as much as possible. Therefore, if the side effect of extra Faster Register
Random Access Memory RAM
Solid State Drive SSD Magnetic disk drive Magnetic tape Larger Fig. 1.6 Memory hierarchy
Access speed
Volume
Cache
26
1 Basic Concept
cost of investment is not considered, there is an option to increase the size of main memory. If it is not possible, the option of using faster external storage such as SSD instead of a magnetic disk is also effective. However, this also has the disadvantage of expensive cost at present. Therefore, if the whole data cannot fit in the main memory, buffers for passing data between the external memory and the main memory are secured in the main memory. Instead of main memory algorithms, it is useful to use algorithms that take such buffers into account. For example, big data can be efficiently searched, sorted, and combined by algorithms using buffers of a database management system (Ramakrishnan and Gehrke 2002). Even Graphical Processing Units (GPUs), which are currently attracting attention in machine learning, have limited storage capacity, so it is still important to keep in mind the use of locality.
1.8.4 Data Reduction Reducing the amount of data, called data reduction, is one way to accelerate big data applications. Now let us consider a data matrix that represents features of data. Each row corresponds to one data item, and each column corresponds to feature quantities. We let the number of rows and the number of columns be N (data size) and p (type of features), respectively. When N is large, it is conceivable to reduce the data size by random sampling from a large-scale data population. However, some caution is required when sampling. First, when the original data consist of different groups with several features (e.g., class, flavonoids, and proline of wines), it should be noted that the tendency (such as correlation) of each group may differ between groups or between a group and the whole. Such a phenomenon is called Simpson’s paradox (Pearl and Mackenzie 2018). In Fig. 1.7, while there a positive correlation between two variables (i.e., amounts of flavonoids and proline) as a whole, negative correlations are observed within the class 2 and class 3 wine groups. In this case, the wine class affects both the amounts of flavonoids and proline, and such a variable as class is called a confounding variable. Therefore, it may be possible to obtain the correct correlation by collecting samples for each group and performing analysis (stratified sampling). Furthermore, in addition to the problem that the total amount of data is enormous, when focusing on classification in machine learning (refer to Chap. 5), it often happens that the data size used for learning differs depending on classes. Such a case is called imbalanced data or data imbalance. If imbalanced data are used for sampling as they are, overfitting may occur due to the dominance of a large class and the precision of a small class may decrease.
1.8 Design Principles and Design Patterns for Efficient Big Data Processing
27
1600
1400
1200
Proline
1000
800
600
400
200
0 0
0.5
1
1.5
2
2.5
3
3.5
4
Flavonoids
class1
class 2
class3
generated from data (https://archive.ics.uci.edu/ml/datasets/wine) Fig. 1.7 Simpson’s paradox
In order to balance the sample data size (i.e., rebalancing), over-sampling (or up-sampling) that increases the size of data in a smaller class or down-sampling (or under-sampling) that reduces the size of data in a larger class can be used. A method called Synthetic Minority Over-sampling Technique (SMOTE) (Chawla and Bowyer 2002) is a typical technique of over-sampling. The basic idea of SMOTE is to select members of a small class, find their k-nearest neighbors, and synthesize data by interpolation between each member and its k-nearest neighbors. In addition to random sampling, down-sampling can also be performed by selecting representative data for each cluster obtained by clustering (Yen and Lee 2009). Also, in the case of p ≫ N , such data are said to be sparse. In that case, reduction of dimensions corresponding to features (columns), called dimensionality reduction, may be effective for data reduction. Principal component analysis (PCA) and LASSO regularization are used for dimensionality reduction. These will be explained again in the section on regression of Chap. 4.
28
1 Basic Concept
1.8.5 Online Processing In time series data, it is conceivable to focus only on the data within a certain time width (i.e., time window), but not on all the data. This is called online processing or streaming. This makes it possible to reduce the number of data handled at one time. It is also possible to reduce the number of data objects handled at one time by performing an aggregation operation (e.g., the average function) on a fixed size of data in the past. As will be explained in Chap. 5, online processing in deep learning also leads to reduction of data handled at one time. Hypothesis generation by performing aggregation and difference operations on time series data will be explained concretely in Chap. 6.
1.8.6 Parallel Processing Of course, it is best to use efficient algorithms for a given problem. If they are not available, it is possible to divide the problem into multiple homogeneous problems and process them in parallel. Such parallel processing is performed based on the divide-and-conquer principle. There are software and hardware technologies for parallel processing. For example, MapReduce (Ishikawa 2015) provides a software framework for parallel processing. MapReduce makes heavy use of external memory and decomposes tasks into homogeneous computations based on shuffling and sorting. This is suitable for relatively simple tasks (e.g., Extract-Transform-Load). Figure 1.8 illustrates how MapReduce computes a document frequency, as a component of term frequencyinverse document frequency (tf-idf ) used to measure the importance of a word to a document in a document collection. Tf-idf is commonly used by text mining (Ishikawa 2015). On the other hand, for tasks requiring many iterative computations such as machine learning algorithms (e.g., regression, clustering, and graph mining), MapReduce alone does not always guarantee efficient execution. There are other frameworks (e.g., Apache Spark) (Apache Spark 2022) that make heavy use of main memory based on hashing. Here, hashing is not only a method of high-speed search, but also plays a role of dividing the problem space into subspaces that do not intersect each other based on the key values of data. This can be considered as a kind of parallel processing technology based on the divide-and-conquer principle. Basically, the cost of partitioning N data objects by hashing is O(N), and the cost of searching data is O(1) (Kleinberg and Tardos 2005). The GPU can process operations (e.g., floating point operations) suitable for image processing in higher parallel than the CPU. By taking advantage of this feature and simultaneously performing the same operations required for deep learning in parallel, speeding up of big data processing can be expected.
1.8 Design Principles and Design Patterns for Efficient Big Data Processing
29
Word1 Document1 Word1 Document3
Word1 2
Word2 Document2 Word2 Document3
Word2 2
Word3 Document1 Word3 Document2 Word3 Document3
Word3 3
Word4 Document2
Word4 1
Word1 Document1 Word3 Document1
Document1
Word2 Document2 Word3 Document2 Word4 Document2
Document2
Word1 Document3 Word2 Document3 Word3 Document3
Document3
Map
Shuffle
Reduce
Fig. 1.8 Example of MapReduce
1.8.7 Function and Problem Transformation Even if a problem requires a large amount of computation, it may be easier to solve the problem by transforming functions involved in the problem or the problem itself (Aggarwal 2020). • Function transformation: Let us consider making it easier to compute by expanding a function that is difficult to evaluate as it is into a group of simpler functions or transforming it into another space ( function transformation). For example, Fourier series expands a periodic function by trigonometric functions (orthogonal to each other) with different periods (Strang 2021). By considering infinite periods, the Fourier transform can generally transform a function in the time or space domain into a function in the frequency domain. The kernel method can reduce the computation cost by calculating the inner product of transformed features using the kernel function that transforms the input data to high-dimensional features (Bishop 2006). • Problem transformation: If it is difficult to solve a problem by the methods explained so far, we consider transforming the problem itself into something that is easier to solve (problem transformation). It may be effective to find an approximate solution instead of solving an exact solution. This method will be discussed in Chap. 2. The method of Lagrange multipliers solves an optimization problem where there are multiple constraints, by converting it into a simpler optimization problem without constraints introducing the Lagrange multipliers. A specific use case of the method of Lagrange multipliers is explained in the section on sparse modeling of Chap. 4.
30
1 Basic Concept
1.9 Structure of This Book Finally, the structure of this book will be explained. In Chap. 2, the definition of a hypothesis, hints for hypothesis generation, and data visualization are explained. In the latter half of Chap. 2 and the six chapters that follow, reasoning, problem solving, machine learning, and integrated methods are concretely explained as design principles and patterns for big data applications from the perspective of hypothesis generation. In the last chapter, we will explain the principles and methods for interpreting and explaining hypotheses and explain their concrete examples. While this book focuses on data science and data engineering, it also incorporates historical perspectives of science such as physics and biology (i.e., from the precomputer era to the present day). Further, it places emphasis on the origin of ideas behind the design principles and patterns for big data applications and explains how these ideas were born and have been developed.
References Aggarwal CC (2020) Linear algebra and optimization for machine learning: a textbook. Springer Apache Spark (2022) Lightning-fast unified analytics engine. https://spark.apache.org/ Accessed 2022 AWS (2019) Modern application development on AWS: cloud-native modern application development and design patterns on AWS. https://d1.awsstatic.com/whitepapers/modern-applicationdevelopment-on-aws.pdf Accessed 2022. Bishop CM (2006) Pattern recognition and machine learning. Springer Cabinet Office (2022) Society5.0. https://www8.cao.go.jp/cstp/english/society5_0/index.html Accessed 2022 Celko J (2014) Joe Celko’s SQL for smarties: advanced SQL programming, 5th edn. Morgan Kaufmann. Chawla NV, Bowyer KW et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 EU (2022) General data protection regulation. http://data.europa.eu/eli/reg/2016/679/oj Accessed 2022 GoF (1994) Design patterns: elements of reusable object-oriented software. Addison-Wesley Google Cloud Vision API. https://cloud.google.com/vision Accessed 2022 Hey T, Tansley S, Tolle K (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research IEEE (2022) Towards a definition of the Internet of Things (IoT). https://iot.ieee.org/images/files/ pdf/IEEE_IoT_Towards_Definition_Internet_of_Things_Revision1_27MAY15.pdf Accessed 2022 Ishikawa H (2015) Social big data mining. CRC Press Ishikawa H, Miyata Y (2021) Social big data: case studies. In: Large scale data and knowledge centered systems, vol 47. Springer, pp.80–111 Ishikawa H, Yamamoto Y (2021) Social big data: concepts and theory. In: Large scale data and knowledge centered systems, vol 47, pp 51–79 Ivezic Z, J. Connolly A et al (2019) Statistics, data mining, and machine learning in astronomy: a practical python guide for the analysis of survey data. Princeton University Press
References
31
Kano H, Hayashi TI (2021) A framework for implementing evidence in policymaking: perspectives and phases of evidence evaluation in the science-policy interaction. Environ Sci Policy 116:86– 95 Kleinberg J, Tardos E (2005) Algorithm design. Pearson Lu X (1972) My old home. Translated by Hsien–yi and Gladys Yang from selected stories of Lu Xun MEDES (2022) https://medes.sigappfr.org/18/ Accessed 2022 Ministry of Internal Affairs and Communications (MIC) (2022) Information and Communications in Japan 2020 (summary). https://www.soumu.go.jp/main_sosiki/joho_tsusin/eng/whitepaper/ 2020/index.html. Accessed 2022 Olshannikova E, Olsson T, Huhtamäki J et al (2017) Conceptualizing big social data. J Big Data 4(3). https://doi.org/10.1186/s40537-017-0063-x Omori M, Hirota M, Ishikawa H, Yokoyama S (2014) Can geo-tags on flickr draw coastlines? In: Proceedings of 22nd ACM SIGSPATIAL international conference on advances in geographic information systems (SIGSPATIAL2014), pp 425–428. https://doi.org/10.1145/2666310.266 6436 Pearl J, Mackenzie D (2018) The book of why: the new science of cause and effect. Allen Lane Peng RD (2011) Reproducible research in computational science. Science 334(6060):1226–1227 Pescio C (1997) Principles versus patterns. IEEE Comput 30(9):130–131 Pseudocode Standard (2022) https://users.csc.calpoly.edu/~jdalbey/SWE/pdl_std.html Ramakrishnan R, Gehrke J (2002) Database management systems, 3rd edn. McGraw Hill Higher Education Strang G (2021) Introduction to linear algebra. Wellesley–Cambridge Press Venkatraman S, Fahd K, Kaspi S, Venkatraman R (2016) SQL versus NoSQL movement with big data analytics. Int J Inf Technol Comput Sci (IJITCS) 12:59–66 (MECS Press) Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3, Part 1):5718–5727
Chapter 2
Hypothesis
2.1 What Is Hypothesis? 2.1.1 Definition and Properties of Hypothesis First, we give a general definition and properties of a hypothesis. In a nutshell, a hypothesis is an explanation of an event or a phenomenon. Hypotheses are mainly expressed in words, but they may also be expressed in mathematical formulas, figures, algorithms, or programs, which can be processed formally. In the first place, a hypothesis should have the following properties (Ishikawa 2015). • (Universal) A hypothesis needs to be able to explain as many past cases as possible. This is the first property to consider in order to generate a hypothesis. • (Ampliative, novel) A hypothesis needs to have the property of providing new knowledge. This is important especially in scientific hypotheses. • (Simple, parsimonious) The explanation of a hypothesis should be described in the simplest way possible (refer to Box “Ockham’s razor”). In other words, this property demands that a hypothesis is easy for the user to understand. • (Fruitful, predictive) A hypothesis must be able to explain and predict cases that are not yet known at the time of generation. This is a property related to the usefulness of the hypothesis. • (Testable, falsifiable) It is essential that a hypothesis can be verified or falsified. An American philosopher of science Thomas Kuhn (1922–1996), who authored The Structure of Scientific Revolution, was asked “What are the characteristics of a good scientific theory?” His answer included “broad scope, simplicity, fruitfulness”. Furthermore, in this book, we would like to add the following two properties to the above properties. One of them is as follows.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Ishikawa, Hypothesis Generation and Interpretation, Studies in Big Data 139, https://doi.org/10.1007/978-3-031-43540-9_2
33
34
2 Hypothesis
• (Interesting) A hypothesis, whether academic or business oriented, should reflect current interests in each field. This property is not only important in terms of usefulness, but it also increases the chance of testing and refining a hypothesis since many people are involved. The other property is as follows. • (Interpretable) It is desirable to be able to explain the structure of hypotheses, the mechanism of reasoning or prediction, and the basis of judgments in addition to the methods for hypothesis generation. Furthermore, in science, it is required that hypotheses can be decomposed into combinations of the most basic events or parts possible. This property has a great influence on whether a hypothesis is accepted by related experts. Box: Occam’s Razor Occam’s razor is also referred to as “lex parsimoniae” (meaning “law of parsimony” in Latin) or KISS (“Keep it simple stupid!”). This principle argues that when there are multiple competitive hypotheses, a hypothesis with the fewest premises should be chosen. “William of Ockham” (1285–1348) is a Franciscan from the village of Ockham, Surrey, England, and is remembered as an important medieval philosopher. Since the Greek era, there has been a basic belief that the simplest explanation of a phenomenon should be sought, and through the writing of “William of Ockham”, such a belief has attracted even more people’s attention. The use of the word razor may sound a little strange in modern times, but in philosophy at the time, it was used to mean that it enables “shaving” of extra explanations. □
2.1.2 Life Cycle of Hypothesis In general, hypotheses have the following life cycle (see Fig. 2.1). (Life cycle of hypothesis). 1. Define a problem to be solved by research questions (refer to Sect. 2.2). 2. Generate a hypothesis in advance through pre-analyzing a problem (i.e., generation of a prestage hypothesis). 3. Collect data related to the prestage hypothesis by appropriate means such as experiment, observation, database search, and Web search (i.e., data collection). 4. Process and transform the collected data as needed.
2.1 What Is Hypothesis?
35
Modify hypothesis rejected
Pre -stage hypothesis
Generate hypothesis
Verify hypothesis Interpret and explain hypothesis
accepted Formulate theory
Theory
Fig. 2.1 Life cycle of a hypothesis
5. Based on the data, generate a hypothesis using appropriate means such as reasoning, problem solving, machine learning, and integrated methods (i.e., hypothesis generation). 6. Verify the generated hypothesis by appropriate means such as evaluation and statistical test (i.e., hypothesis verification). 7. Interpret and explain the hypothesis so that the user can understand how it is generated, what features it has, and how it works. 8. If the hypothesis is accepted as a result of verification, promote it to a candidate theory (i.e., theorizing the hypothesis). 9. Otherwise (i.e., if the hypothesis is rejected as a result of verification), it is either discarded, or modified and verified again. A prestage hypothesis here is a hypothesis obtained by problem analysis performed in advance, based on the experiences and researches of oneself and others, or experiments using small-scale data. It is, so to speak, a tentative hypothesis or a working hypothesis. It is not formally different from a final hypothesis, but is often coarse at the level of detail or small in terms of associated data size. In other words, a prestage hypothesis becomes a hypothesis after being refined by a large number of collected data. In this book, we will focus on the generation of hypotheses regardless of whether it is a prestage hypothesis or not.
2.1.3 Relationship of Hypothesis with Theory and Model Here, we summarize the relationship of a hypothesis with a theory and a model. It is difficult to give a strict definition to each and then difficult to distinguish from each other. In this book, we will take the following simple position.
36
2 Hypothesis
First, it is difficult to distinguish between hypothesis and theory. To put it bluntly, a hypothesis can be considered a theory that has not yet been fully tested or widely accepted. Furthermore, existing theories may be replaced by more plausible hypotheses or theories due to the emergence of new phenomena. For example, with the advent of the theory of special relativity proposed by Albert Einstein, a Germanborn Nobel laureate in physics (1879–1955), Newtonian mechanics, which had been thought to always hold until then, has been positioned as an approximate theory that can handle the motion of an object at a speed sufficiently slower than the speed of light. Models are more abstract explanations of phenomena than hypotheses and theories, and they are derived depending on what is focused and what is omitted. Therefore, there can be multiple models even if they are based on the same theory. Furthermore, the means of expressing a model include not only conceptual descriptions but also mathematical descriptions, algorithms, and programs that can be executed by computer. Normally, a model has parameters (i.e., hyperparameters) that are appropriately determined by the user or mechanically determined by computation, and the model is instantiated by specifying concrete parameters. In this book, the instantiated model can be regarded as a kind of hypothesis, too. Furthermore, concrete conclusions drawn by a hypothesis are also regarded as a kind of hypothesis.
2.1.4 Hypothesis and Data Theories, hypotheses, observations, experiences, and data related to a hypothesis are important as information necessary for hypothesis generation. First, existing related theories and hypotheses are obtained from papers published in academic journals, international conferences, domestic conferences, and academic books. Indeed, from the viewpoint of reliability and completeness of contents, papers published in conventional academic journals are important, but papers published in open access journals (DOAJ 2022), which usually have a shorter period from submission to publication than conventional journals, may be more useful from the viewpoint of immediacy of topics. Of course, one’s own observations and experiences are important because they give some concrete suggestions in hypothesis generation. Even if there are no such observations and experiences, those of other experts in a related field are also useful. Finally, we would like to make a note about relationships between a hypothesis and data. Data are usually obtained as a result of observations and experiments. In addition, data may be obtained in advance through preliminary experiments, which are useful for hypothesis generation. On the other hand, there may be data available before the concrete purposes of use, that is, the generation of a specific hypothesis. Such examples include data generated from modern IoT networks, data obtained from social media sites, various data open to the public, and even observational data recorded in astronomy. Such social data and
2.2 Research Questions as Hints for Hypothesis Generation
37
open data can be obtained by dedicated APIs or simply downloaded from publicly available related sites as needed. From there, data necessary for hypothesis generation and verification are selected. Furthermore, data generally provide a basis for hypotheses and theories. In other words, data support hypotheses as evidences. Therefore, getting to know obtained data in depth is one of the essential steps leading to better hypothesis generation.
2.2 Research Questions as Hints for Hypothesis Generation We explain research question as hints for hypothesis generation. In the very beginning of research, a research question must be set up in order to make research objective and methods clear (Järvinen 2008). Hypotheses give initial answers to research questions. In other words, research questions can be expected to lead to hypothesis generation. In many cases, research questions guide us as to how we collect data and what we choose as methods for hypothesis generation and verification. Each type of basic research question is associated with candidate methods for hypothesis generation. For example, there are correspondences between basic research questions and hypothesis generation methods although they are not comprehensive as follows. • Why → Causal analysis, that is, “what is the relationship between a cause and an effect?”, or more generally, hypothesis interpretation. • What → “What is the cause or effect?” in causal analysis, or “What are the common characteristics of similar data (e.g., events and things)?” grouped by clustering (i.e., intra-cluster analysis). • Who/whom → “What is the subject or object?” in causal analysis, or “What are the common characteristics of similar data (e.g., humans and creatures)?” grouped by clustering (also, intra-cluster analysis). • Which → Classification. • When → Time series data analysis, prediction, or estimation. • Where → Spatial or geometrical data analysis, prediction, or estimation. • How → Problem solving. • How much and how many → Regression, prediction, or estimation. • How similar → Clustering. • How frequent → Association rule mining. • How different → Statistical test (i.e., null hypothesis and alternative hypothesis). • How to integrate → Generation methods for integrated hypotheses. Actual research questions are not so simple and are often a combination of basic research questions.
38
2 Hypothesis
The hypothesis generation methods mentioned above are mainly techniques included in data analysis such as data mining, machine learning, and multivariate analysis. These are explained in detail in the following chapters. Specifically, problem solving is explained concretely later in this chapter. Estimation is also introduced as an approximate method in this chapter, too. Chapter 3 explains an example of hypothesis generation by reasoning and problem solving in astronomy and biology. Regression and causal analysis are explained in Chap. 4 with astronomy and biology as the subject. Statistical tests are briefly touched on in Chap. 4, too. For more details, refer to textbooks on statistics such as Diggle and Chetwynd (2011). Clustering (cluster analysis), association rule mining, and classification (deep learning and decision tree) are explained in detail in Chap. 5. Hypothesis generation methods based on differences between time series data or spatial, geometrical data, and hypothesis generation methods based on differences between conceptual data or between existent hypotheses are explained in detail in Chap. 6, respectively. Chapter 7 explains hypothesis generation methods based on operations such as join, intersection, and union. Hypothesis interpretation is explained in Chap. 8.
2.3 Data Visualization Data visualization is a basic means of understanding data. Data visualization is also used not only to present analysis results in an easy-to-understand manner, but also to generate and verify hypotheses. In this section, various visualization methods are introduced, based on the structure and number of dimensions of target data and scale (e.g., time and space) with which data are visualized. For more details, please refer to visualization textbooks (e.g., Cairo 2016).
2.3.1 Low-Dimensional Data The following methods are widely used for low-dimensional data (i.e., 1D or 2D data) such as statistical data. • • • •
Bar graph. Line graph. Band graph. Heat map.
Here, an example of a band graph is shown in Fig. 2.2a. It illustrates ratios of wine import volume by country in Japan (2019). A heat map is visualized in one or
2.3 Data Visualization
39
Wine import volume ratio by country (Japan, 2019) Bunkyo
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Shinjuku
Chile Italy Other
France Spain
Shibuya
(a)
(b)
Fig. 2.2 a Band graph. b Heat map
two dimensions by color or grayscale according to a cell value (i.e., intensity) of data (see Fig. 2.2b). In this example that we generated as our research work, the brighter the color, the more satisfying the grid area is for foreign visitors to Shibuya.
2.3.2 High-Dimensional Data For three or higher dimensions, it is conceivable to use methods used in lower dimensions after performing dimensionality reduction such as principal component analysis (PCA) and multidimensional scaling (MDS). PCA chooses the first principal axis so that it maximizes the variance of multidimensional data along the axis, then chooses the second principal axis orthogonal to the first principal axis, in the way that maximizes the variance along it, and can visualize each data in the space created by those axes (two dimensional or three dimensional). PCA will be taken again in the section on regression in Chap. 4. MDS visualizes each data point in a lower dimension (e.g., two dimensions), preserving distances between data points in the original multidimensional space.
40
2 Hypothesis
Property2
6 5 4 3 2
Property1
1
2
3
4
5
OBJECT Property1
Property3
0
Property3
1
Property2
Property3 Property1
(a)
Property2
(b)
Fig. 2.3 a Parallel coordinate plot. b Scatter plot matrix
Other methods include the following. Hilbert curve-based visualization is a method of arranging sorted multidimensional data in a two-dimensional window to preserve its order. Even if the number of data increases, the data can be displayed in a window of the same size. A parallel coordinate plot visualizes data by arranging different axes in parallel for each attribute (see Fig. 2.3a). A scatter plot matrix arranges scatter plots for each attribute pair in a matrix (see Fig. 2.3b).
2.3.3 Tree and Graph Structures When we represent data as a tree structure, we can use the following methods. • A node link diagram represents nodes and edges (links) of a tree in two dimensions. • A cone tree visualizes in three dimensions by arranging a parent node at the apex of the cone and child nodes on the circumference of the bottom circle of the cone. For representing data as a graph or a network (i.e., weighted graph), the following methods are available. • The method of static drawing of a graph is to arrange nodes and edges so that they do not overlap while preserving relationships between the nodes as much as possible.
2.3 Data Visualization
41
• The elastic map as dynamic drawing of a graph assumes that relevant data are connected by springs and arranges them in low dimensions so that the elastic energy of the entire network is minimized.
2.3.4 Time and Space • Time series data: For time series data, it is conceivable to use line graphs, band graphs, and heat maps introduced so far. • Map (map, sky): As real space, there are 2D and 3D maps such as OSM (OpenStreetMap 2022) and Google Maps (Google Maps 2022) and celestial coordinate systems such as equatorial coordinates and ecliptic coordinates. Frequently used methods that map the earth to a plane include a method of maintaining true angles (the Mercator projection) and a method of maintaining relative sizes of area (e.g., the Mollweide projection). The distribution and intensity of the data on the earth or the sky are plotted on such maps.
2.3.5 Statistical Summary A histogram (i.e., “frequency distribution graph”) visualizes the distribution of data with a bar graph that shows the frequency for each class (i.e., bin) on the horizontal axis. You can roughly know the representative value (e.g., mean and mode) and variation (variance) of data. A scatter plot is useful for visualizing correlation between attributes. In the case of a three-dimensional scatter plot, it is conceivable to display data in a threedimensional space. The scatter plot matrix for higher-dimensional data arranges scatter plots for different attribute pairs (see Fig. 2.3b). A cumulative frequency polygon (i.e., ogive) is visualized with the upper limit value of each class on the horizontal axis and the cumulative frequency of the focused class and all the preceding classes on the vertical axis (see Fig. 2.4a). A Lorenz curve shows the value of the cumulative relative (i.e., normalized) frequency of data items (e.g., cumulative number of households) belonging to the bottom class up to the focused class with respect to a certain amount (e.g., wealth) as the horizontal axis value and the cumulative relative amount (e.g., cumulative wealth) as the vertical axis value (see Fig. 2.4b). It is usually approximated by polygonal lines. The value obtained by subtracting twice the area under the Lorenz curve from the area of a square (1 × 1) is called the Gini coefficient, which is used as an index showing inequality in distribution.
42
2 Hypothesis
18
1
16
Cumulative amount of assets
14 12 10 8 6
8
line of perfect equality 6
4
Lorenz curve
2
4 2
0 0
0
0.2
0.4
0.6
0.8
1
Cumulative number of households
(a)
(b)
Fig. 2.4 a Ogive. b Lorenz curve
2.4 Reasoning The most basic means of hypothesis generation is reasoning, which includes the following types (Bortolotti 2008; Danks and Ippoliti 2018). • • • • • •
Deductive reasoning (deduction for short). Inductive reasoning (induction for short). Plausible reasoning: Abduction and IBE (inference to the best explanation). Generalization. Specialization. Analogy.
All the above types of reasoning will be explained in order, as methods useful for hypothesis generation. Before going into detailed explanation of reasoning, we introduce the philosophy of science and scientific methods as related topics.
2.4.1 Philosophy of Science and Hypothetico-Deductive Method 2.4.1.1
Philosophy of Science
Science is a systematic attempt to create, organize, and share knowledge. In the past, science and philosophy were often used interchangeably. In the seventeenth century,
2.4 Reasoning
43
natural science was derived from philosophy and came to be considered as a different field. Science has come to refer not only to knowledge, but also to ways to explore it, and after the nineteenth century, science is nowadays a principle-based method for understanding the world, such as physics and biology. Scientific methods have come to be considered in relation to science. An academic discipline primarily exploring scientific methods is called the philosophy of science (Bortolotti 2008; Chang 2014).
2.4.1.2
Hypothetico-Deductive Method
Especially in the field of science, there is a scientific method called hypotheticodeductive method, which corresponds to the life cycle of the hypothesis explained previously (Danks and Ippoliti 2018). The hypothetico-deductive method consists of the following procedures. 1. 2. 3. 4.
Generate a hypothesis explaining data (i.e., induction step). Assume the hypothesis is true and predict the result (i.e., deduction step). If the prediction can explain the data, accept the hypothesis for the time being. Otherwise, reject the hypothesis.
Induction and deduction in the above steps will be explained later in this section. Generally, reasoning has been used as the basis of scientific methods. Especially, the types of reasoning related to hypothesis generation are explained below.
2.4.2 Deductive Reasoning In deductive reasoning (deduction for short), general principles including axioms, theorems, and rules are applied to draw concrete conclusions. In other words, deduction is to prove a proposition that represents a hypothesis. The following rules are often used in deduction. Each sentence represents a proposition. The law of detachment (Modus ponens) • • • •
If A is true, then B is true. A is true. –- (i.e., implies) B is true. For example, If it is a human being, it dies Aristotle is a human being Therefore, Aristotle dies Generalizing this rule gives us the so-called syllogism.
44
2 Hypothesis
Syllogism • • • •
If A is true, then B is true. If B is true, then C is true. –If A is true, then C is true.
A Greek philosopher Aristotle (384–322 BC) proposed the prototype of syllogism. Now, let us consider the role of deductive reasoning in hypothesis generation. In deductive reasoning, a concrete conclusion is a specialization of a general principle although it is not always explicit. In other words, it is considered that the conclusion is included in the original principle in advance. Therefore, it is said that deductive reasoning, unlike inductive reasoning explained later, does not substantially have the property of increasing new truths or knowledge. However, deduction is used at least in the procedures of the hypothetico-deductive method (i.e., deduction step) introduced as a scientific method aiming at hypothesis generation. Deduction also plays a major role in the proof of mathematical propositions, including mathematical induction (refer to Box “Mathematical Induction”).
2.4.3 Inductive Reasoning Unlike deductive reasoning, inductive reasoning (induction for short) generalizes a new hypothesis from individual data. The basic form of induction is as follows. • The ratio Q of samples of the population P has the characteristic C. • –• The ratio Q of the population P has the characteristic C (a new hypothesis). Statistical syllogism is described as follows. • • • •
The ratio Q of samples of the population P has the characteristic C. The individual I is a member of the population P. –The individual I has the characteristic C with a plausibility of Q (a new hypothesis).
For example, nearly 0.6 of successful applicants at “T. M. University” are from the Kanto region of Japan (i.e., the eastern part of the largest island of Japan). “Mr. A” has been admitted to T. M. University. Therefore, Mr. A is from Kanto with a plausibility of nearly 0.6. The above ratio Q (0 ≤ Q ≤ 1) is a numerical value, but it may be qualitatively expressed by the words “most” and “few”. Here, care must be taken when applying inductive reasoning. There is no problem if the population P consists of multiple classes and they are homogeneous (i.e., have
2.4 Reasoning
45
the same ratio). However, if the ratio is different depending on each class, or classes are heterogeneous, that point needs to be taken into consideration in order to obtain a more plausible hypothesis (refer to Simpson’s paradox in Chap. 1). For example, if Mr. A is an international student in the above example, the conclusion will change. Enumerative induction corresponds to the case of ratio Q = 1 and can be paraphrased as follows. • • • • • •
Sample 1 has the characteristic C. Sample 2 has the characteristic C. … As far as I know, all samples have the characteristic C. –All members of the population have the characteristic C (a new hypothesis).
Enumerative induction corresponds to so-called generalization reasoning. Induction is an important tool for scientific discovery. There is a classic example of inductive reasoning as follows. An Italian astronomer and physicist Galileo Galilei (1564–1642) measured periods which it takes a pendulum to swing back to its original position, by changing the length of the pendulum (Aufmann et al. 2018). Then, values of lengths and periods of the pendulum are obtained as the experimental result and they are recorded in a table (see Fig. 2.5). Here, for simplicity, one unit of length is set to 10 in. (25.4 cm), and one unit of period is set to one heartbeat interval of Galileo, who did not have a clock at that time. From this table, Galileo concluded that the period of the pendulum is proportional to the square root of the length. It should be noted that in enumerative induction, the correctness of a hypothesis is lost due to the appearance of samples (i.e., counterexamples) that are contrary to the hypothesis.
Length Period 1 1 4 2 9 3 16 4 25 5 36 6 Fig. 2.5 Pendulum experiment
46
2 Hypothesis
[Quiz] Inductive reasoning. Make sure whether the following proposition is generally true or not. Assumed that n is an integer, the value of the following expression is a prime number (i.e., a natural number that is greater than 1 and has no divisor other than 1 and itself). n × (n + 1) + 41. [Tips to think] Let us substitute 0, 10, 20, 30, and 40 as n and check if the above proposition holds. By the way, this problem was submitted by a Swiss mathematician Leonhard Euler (1707–1783). In the first place, enumerative induction assumes that phenomena occurring in nature have regularity. It is called the principle of uniformity of nature. However, a Scottish philosopher David Hume (1711–1767) argued that the principle of uniformity of nature itself was derived from enumerative induction, and in that sense enumerative induction was problematic. In any case, it should be noted that the plausibility of the result of inductive reasoning depends on the size of sample data, the size of a population, and the degree to which samples represent the population. Box: Mathematical Induction Mathematical induction, despite its name, is a kind of deductive reasoning. Indeed, a mathematical proposition as a hypothesis to be proved is created by some generalization. That is, such a process of hypothesis generation itself is induction. However, deduction is used in the process of applying mathematical induction to the proof of a mathematical proposition. In other words, mathematical induction as well as its preceding hypothesis generation can be viewed as a special form of hypothetico-deductive method, described as the scientific method. Mathematical induction has been used for a long time, but a British mathematician Augustus De Morgan (1806–1871) gave a proper explanation to it. Further, the validity of mathematical induction is based on the Peano axioms in natural number theory. Relevant parts of the Peano axioms can be described as follows. (Peano axioms) If a set of integers S satisfies the following properties, • The integer 1 belongs to the set S. • If the integer k belongs to the set S, the integer k + 1 belongs to the set S. • The set S contains all natural numbers.
2.4 Reasoning
47
The set S can be expressed as follows. S = {n:P(n)}. Here, an integer and P(n) is a proposition to be proved, for example, ∑n n is n(n+1) k = . □ k=1 2
2.4.4 Generalization and Specialization We explain generalization and its converse, specialization, in relation to induction.
2.4.4.1
Generalization
Here, generalization of mathematical propositions will be explained with concrete examples. Before that, as a preparation, we take the Pythagorean theorem as a basic proposition and explain a proof of the Pythagorean theorem. It is said that there are more than 500 proofs of this theorem (Crease 2010). Among them, one of proofs by using figures is explained. Let us consider a square Box with (a + b) as each side and four right triangleshaped tiles with three sides a, b, and c (see Fig. 2.6 (left)). Here, white parts (three squares with a, b, and c as each side) are gaps or vacant spaces, to which attention must be paid now. The triangles are moved (i.e., translated and rotated) so that they do not overlap (see Fig. 2.6 (right)). Since the area of the gap does not change within the same Box even if the triangles are moved, it can be proved that the Pythagorean theorem holds by comparing the right and left figures. This method based on invariants is useful for solving various problems as explained later. Now, as for the generalization of the main subject, here we consider the generalization of the Pythagorean theorem.
2.4.4.2
Generalization to Polygons
Instead of a square assumed by the Pythagorean theorem, let us consider a right triangle that is similar to the original right triangle and whose corresponding sides are a, b, and c. The following relationship holds between these areas (See Fig. 2.7a). Area of right triangle c = Area of right triangle a + Area of right triangle b.
48
2 Hypothesis
b
c b a
a
Fig. 2.6 Pythagorean theorem. Pay attention to invariant (white squares)
(a)
(b)
(c)
Fig. 2.7 Extended Pythagoras theorem. a Right triangle version, b Polygon version. c Semicircle version
This proof is self-evident from Fig. 2.7a. As long as the corresponding side of three similar triangles is either side of a right triangle, the relationship between the areas of such triangles holds true not only for a right triangle but also for any triangle. At that time, the area ratio of the triangles is a2 : b2 : c2 . Furthermore, in general, if three similar polygons are considered and their corresponding sides correspond to the sides a, b, and c of a right triangle, then the following holds between the area of polygon a, the area of polygon b, and the area of polygon c (Pólya 2009) (see Fig. 2.7b).
2.4 Reasoning
49
Area of polygon c = Area of polygon a + Area of polygon b. The Pythagorean theorem which has been generalized now can be described below. (Generalized Pythagorean Theorem). For similar polygons, each corresponding side of which is equal to each side of a right triangle, the following holds between three areas. Here, λ is a parameter that depends on the shape of a polygon. λc2 = λa 2 + λb2 . [Quiz] Prove that the “Generalized Pythagorean theorem” is correct. [Tips for proof] Polygons can be divided into groups of triangles. Therefore, the area of a polygon is the sum of the areas of divided triangles. Also, similar polygons can be divided so that the corresponding triangles are similar. Therefore, in general, there is a positive constant λ, and the area of polygon a, area of polygon b, and area of polygon c can be expressed as λa2 , λb2 , and λc2 , respectively, as the sum of each polygon’s component triangles. From the Pythagorean theorem itself, it can be said that the following is correct. λc2 = λa 2 + λb2 . Now the Pythagorean theorem is generalized to similar polygons.
2.4.4.3
Generalization to Semicircle
Now we consider semicircles (always similar) whose diameters are sides a, b, and c instead of squares assumed by the Pythagorean theorem (see Fig. 2.7c). Area of semicircle c = Area of semicircle a + Area of semicircle b. That is, π ( c )2 π ( a )2 π = + 2 2 2 2 2
( )2 b . 2
This is also correct from the Pythagorean theorem. It is easy to prove that the circle version of Pythagorean theorem also holds. This case can be regarded as an analogy to the original Pythagorean theorem (refer to Sect. 2.4.4.1).
50
2 Hypothesis
Fig. 2.8 Generalization of Euclidean distance with respect to dimensions
2.4.4.4
Generalization of Dimensions (N-Dimensional Euclidean Distance L2 )
The Pythagorean theorem can calculate distances in two-dimensional Euclidean space. They are called the Euclidean distance L 2 . By applying the Pythagorean theorem in multiple steps, distances in the two-dimensional Euclidean space can be generalized to distances in the three-dimensional Euclidean space and further to distances in the n-dimensional Euclidean space (see Fig. 2.8).
2.4.4.5
Generalization of Exponentiation (from L2 to Lp )
By generalizing an exponent p of the Euclidean distance from 2 to an integer n, the n-dimensional Euclidean distance is generalized as follows (n-dimensional vectors X i and X j ). (Minkowski distance L p ) ) )p ) )p ) p )1/ p () L p = )xi1 −x j1 ) + )xi2 −x j2 ) + · · · + )xin −x jn ) . (Manhattan distance or block distance L 1 ) ) ) ) ) ) ) L 1 = )xi1 −x j1 ) + )xi2 −x j2 ) + · · · + )xin −x jn ).
2.4 Reasoning
51
L1
L2
L∞
Fig. 2.9 Contours of points equidistant from the origin
Assuming that something (e.g., a person and a car) travels on the ground, it can be said that the Manhattan distance is more realistic than the Euclidean distance. The Euclidean distance is, so to speak, the distance traveled by a person who can parkour (refer to Box “Parkour”). (Chebyshev distance L ∞ ) ) ) ) () ) L ∞ = max )xi1 −x j1 ), )xi2 −x j2 ), . . . , |xin −x jn | . The Chebyshev distance emphasizes the largest difference values in dimensions. The Manhattan distance L 1 , the Euclidean distance L 2 , and the Chebyshev distance L ∞ correspond to special cases of the Minkowski distance. For reference, Fig. 2.9 shows the contours of points equidistant from the origin by each type of distance. Here, the mathematical distance, called distance function or metric, satisfies the following axioms. (Axioms of distance metric) d(x, y) is a distance metric between objects x and y. • • • •
d(x, y) ≥ 0. d(x, y) = 0 ↔ x = y. d(x, y) = d(y, x). d(x, y) + d(y, z) > = d (x, z).
Box: Parkour Parkour is a type of sport derived from French military training (Gilchrist and Wheaton 2011). Parkour players pass through a course with obstacles smoothly and quickly only with their own physical ability. For example, in extreme cases, the players climb a building, run on the roof, and jump to the next building □ (see Fig. 2.10).
52
2 Hypothesis
Fig. 2.10 Image of parkour
2.4.4.6
Specialization
Now let us consider specialization as reasoning in the opposite direction to generalization. For example, the case of squares (i.e., Pythagorean theorem itself) corresponds to the generalized Pythagorean theorem with λ = 1, and the case of right triangles with each corresponding side equal to each side of the original right square corresponds to the generalized Pythagorean theorem with λ = 21 ab/c2 . In other words, these correspond to the specialization of the generalized Pythagorean theorem.
2.4.5 Plausible Reasoning Similar to inductive reasoning, plausible reasoning is explained as reasoning that creates a new hypothesis. Plausible reasoning includes abduction and IBE (inference to the best explanation). However, they also have common characteristics, and it is difficult to make a strict distinction between them. Let us briefly explain these two below.
2.4.5.1
Abduction
Abduction is described as follows. • If A is true, then B is true.
2.4 Reasoning
53
• B is true. • –• A is more plausible. For example, If they are crows, they scatter garbage in a garbage collection point. Garbage is scattered in the garbage collection point. Therefore, probably it is the work of crows.
2.4.5.2
IBE
IBE (inference to the best explanation) is described as follows. • • • • •
Data D are given. Hypothesis H can explain D. There is no other competing hypothesis that can explain D better than H. –H is correct.
In a nutshell, IBE is characterized by generating hypotheses that explain given data best.
2.4.5.3
Analogy
Analogy is reasoning using similarity of problems. A similarity used in analogy is strictly defined as follows. Here, we consider two objects. First, preconditions for analogy are explained. It is assumed that both objects 1 and 2 have components, and there are relationships between the components. • Correspondence between components c1: A collection of components of object 1. c2: A collection of components of object 2. c1 and c2 correspond. • Correspondence between relationships r1: Relationship among c1 (i.e., the components of object 1). r2: Relationship among c2 (i.e., the components of object 2). r1 and r2 correspond. Based on the above correspondences, analogy assumes the following similarities. • The corresponding components of object 1 and object 2 are similar. • The corresponding relationships of object 1 and object 2 are similar.
54
2 Hypothesis
Analogy is used as follows. • If the corresponding components of two objects are similar, the corresponding relationships may be similar. • If the corresponding relationships of two objects are similar, the corresponding components may be similar. For example, the original Pythagorean theorem (i.e., the square version) and the circle or semicircle version of the Pythagorean theorem are analogically related to each other. That is, each square has a side which is equal to either side of a right triangle, and of course, three squares are similar in shape while each circle has a diameter equal to either side of a right triangle and three circles are similar. The relationship is exactly that the area of the largest square or circle is equal to the sum of those of the other two.
2.5 Problem Solving In this book, problem solving is considered as a kind of hypothesis generation method. An Italian philosopher of science Carlo Cellucci (1940–) describes problem solving as a type of theory generation (Cellucci 2017). We make a summary of problem solving tips used in mathematics, engineering, and science in the following subsections.
2.5.1 Problem Solving of Pólya A Hungarian mathematician George Pólya (1887–1985) classifies mathematical problems into two types and explains how to solve them (Pólya and Conway 2014). Pólya divides general problem solving into the following four steps. 1. 2. 3. 4.
Understand a problem. Devise a solution. Execute the solution. Review the solution. Here, Steps 1 and 2 will be explained in a little more detail.
2.5.1.1
Understand Problem
This step is further divided into the following substeps. • Problem classification
2.5 Problem Solving
55
To understand a problem to be solved, identify a category to which the problem belongs among the following categories. A. Find an answer to a problem (i.e., so-called equation-solving problem). B. Make a proof of a problem (i.e., so-called proof problem). • Problem division Divide a problem according to the identified category of the problem as follows. A. Find unknown variables. Furthermore, consider conditions that hold among unknown variables and known data (i.e., given variables and constants). B. Divide conditions into a premise and a conclusion. Furthermore, find a logical link between the premise and the conclusion. In this step, it is necessary to pay attention to whether conditions are sufficient and whether there is no contradiction among multiple conditions. For that purpose, it is also important to choose an appropriate notation.
2.5.1.2
Devise Solution
For example, the following strategies are useful for devising a solution. • • • • • • • • • • •
Search for similar and simpler problems. Draw a diagram. Make a table or a chart. Make a list of known information. Make a list of needed information. Divide a problem into smaller problems. Find a pattern. Write down an equation. Work backward. Perform an experiment. Guess and check the result.
2.5.2 Execution Means for Problem Solving In problem solving, the means to actually find a solution to a problem is a procedure (modus operandi). That is, the procedure is a sequence of well-coordinated operations that are used to draw conclusions from a hypothesis. If the scope of problem solving is expanded to include problems outside the field of mathematics, the procedures include basic concepts and operations depending on each specialized field, such as laws and equations in physics, data structures and algorithms in computer science, mathematical concepts, and their operations (i.e., reasoning and computation).
56
2 Hypothesis
Pólya introduces auxiliary problems and dimensional analysis as methods that can be used in problem solving, in addition to divide-and-conquer, inductive reasoning (generalization and specialization), abduction (plausible reasoning), and analogy, which have already been explained. Among them, specialization helps to disprove a proposition by showing that the proposition in a special case is not satisfied. Auxiliary problems correspond to auxiliary lines and lemmas set to solve a given problem. Dimensional analysis is to consider a dimensionless group (or dimensionless number) and focus on basic dimensions (i.e., units) to find the relationship between multiple physical quantities. Dimensional analysis is explained more concretely later in this chapter. As far as mathematical problems are concerned, an American mathematician Paul Zeitz (1958–), who has been involved in the International Mathematical Olympiad for many years, introduces four principles: symmetry, pigeonhole principle, and extreme principle in addition to invariants already mentioned. His book (Zeitz 2006) is helpful to describe problem solving more comprehensively along with mathematical concepts such as algebra and geometry. In particular, the pigeonhole principle and the extreme principle are described as follows. • Pigeonhole principle: When trying to put N pigeons into M pigeonholes (N > M), there are two or more pigeons in at least one pigeonhole. This is used to prove the existence of a solution (see Fig. 2.11).
Fig. 2.11 Pigeonhole principle. N(pigeon) > M(pigeonhole) → There is one or more pigeonholes containing two or more pigeons
2.5 Problem Solving
57
• Extreme principle: If data to deal with are an ordered set with maximum or minimum values, the principle helps to disprove a problem by paying attention to such extreme values. Symmetry is concretely described in the following examples.
2.5.3 Examples of Problem Solving An applied physicist Sanjoy Mahajan (1969–) has introduced several principles in his book (Mahajan 2014), which are useful for solving problems in the fields of engineering and science, especially for estimating unknown values. These principles include as follows. • • • • • • •
Symmetry and invariants. Analogy and abstraction. Divide-and-conquer. Lumping. Dimensional analysis. Proportional reasoning. Probabilistic reasoning.
In the following subsections, we will explain problem solving concretely using familiar examples based on the above principles.
2.5.3.1
Use of Symmetry
First, we explain how to use symmetry using familiar mathematical problems. Case: Gauss’s calculation of the sum of a sequence. Here, we will consider a smart method to find the sum of an integer sequence from 1 to n, following a German mathematician and physicist Carl Friedrich Gauss (1777– 1855) at his childhood (see Fig. 2.12). That is, we consider that the following S is calculated. In reality, the task given to Gauss in his elementary school was a case where n = 100. S = 1 + 2 + · · · + (n − 1) + n. Of course, the value of S does not change even if each number is rearranged and added in reverse order, paying attention to symmetry. S = n + (n − 1) + · · · + 2 + 1.
58
2 Hypothesis
Fig. 2.12 Johann Carl Friedrich Gauss (Courtesy of the Smithsonian Libraries and Archives, released under a CC BY 4.0 license). https://library.si.edu/image-gallery/73259
Let us find a pattern here. That is, paying attention to the two numbers at the same upper and lower positions in the two S’s, each sum is (n + 1), and since n of them are lined up, the following result is obtained. 2S = (n + 1)n. Of course, this formula can be easily proved by mathematical induction.
2.5.3.2
Use of Invariants
Here, we consider solving the problem of estimation by using the concept of invariants (or conservation law). We consider to use Mahajan’s Box model, which focuses on invariants. A Box model assumes that the input and output match, such as a balance between supply and demand. Indeed, there are cases of excess and deficiency, but the Box model targets an equilibrium state. Here, we explain how to use invariants by taking an example of estimating the number of taxis needed for tourists in a certain area.
2.5 Problem Solving
59
Fig. 2.13 Taxi pool in front of Sendai Station
Sendai Station
Case: How many taxis are needed in a city. As a specific problem, we will estimate the number of taxis required for tourists visiting a certain city. There is a large taxi pool in front of Sendai station (see Fig. 2.13). For that purpose, consider the following conservation law. Demand (total monthly boarding time required by tourists) and supply (total monthly boarding time provided by taxis) must be balanced (i.e., Box model). We obtain a concrete Box model satisfying the conservation law as follows. (Box model) • Total boarding time for tourists (per month) based on “the number of tourists” (per year) = Total working hours of taxi drivers (per month). First, let us consider the left side of the above equation. “The number of tourists” is the total annual number of tourists who visited a city as tourist spot in a prefecture without the purpose of reward. Since the statistical value available as open data are the total for 12 months, we will divide it by 12 and use the result as the number of tourists per month. It is assumed that the tourists are a small group of two to three persons and half of tourist groups use taxis. Assuming that the boarding time is about 30 min, but since it is necessary to return to the starting point, 60 min (one hour) are required for a round trip. We assume that one driver is assigned to one separate taxi and the working hours per driver per month are 160 h. We let P and T, “the number of tourists”, and the number of taxis, respectively. Then the following holds. P/12 × 1/2.5 × 1/2 × 1 = 160 × T .
60
2 Hypothesis
For example, in Sendai City of Japan, the number of tourists P (2018) was 21,817,554. Assigning this value to P in the above equation gives the following T. T = 2273. In other words, at least 2273 taxis will be needed. As a side note, the actual number of taxis in Sendai City was 3221 (2018). This model does not take into account the use of taxis by residents and business travelers in the area. If these are taken into account, the total boarding time as a demand will increase, and then the number of required taxis will increase accordingly.
2.5.3.3
Analogy and Abstraction
Analogy focuses on the similarity between the components of two objects and that between the relationships among the components, leading to a kind of abstraction. Problem abstraction is generally one of the tools to reduce the complexity of a problem. As already explained, in generalizing the Pythagorean theorem, squares and circles are considered to be the corresponding objects in analogical relationships with respect to areas. As will be described later, analogies are also used in the formation of Darwin’s theory of evolution. Here, we explain the method of analogy and abstraction using a physics problem. Case: Graviton hypothesis In particle physics, forces (i.e., interactions) are classified according to involved elementary particles into four types: electromagnetic force, weak force, strong force, and gravity. Thus, a force is abstracted in a way such that the interaction is mediated by related elementary particles. As shown below, it has been confirmed that up to the first three types of forces are mediated by real elementary particles. Considering these observations, the following hypothesis is generated (Lewton 2020; Rehm 2019). • • • • •
Electromagnetic force: Photons. Weak force: W and Z bosons. Strong force: Gluons. –Gravity force: Hypothetical particles (called Graviton).
Of course, this hypothesis needs to be confirmed by future experiments and observations. On the other hand, research is now being conducted on string or superstring theory (Siegel 2018) as another hypothesis, which attempts to unify all forces with vibrating strings rather than with elementary particles.
2.5 Problem Solving
2.5.3.4
61
Divide-And-Conquer and Lumping
Problems to be solved can often be divided into smaller sets of subproblems. A tree structure can be used at that time. A parent node (root) in a tree structure is associated with the original problem, and child nodes are associated with the divided subproblems. Here, according to the divide-and-conquer principle, we will take an example of dividing the problem into heterogeneous subproblems instead of homogeneous subproblems (e.g., MergeSort, refer to Chap. 1). In addition, we will explain how to calculate approximate values by simplifying shapes and values without being bound by detailed accuracy. Case: What is the population of Tokyo’s 23 wards? Population is one of the basic numbers in planning a social infrastructure. As a simple example, we consider estimating the total population of the 23 special wards of Tokyo. We will solve this problem by the divide-and-conquer principle based on a tree structure. So, here, the goal (problem to be solved) is divided into subgoals (subproblems) of different types, rather than subgoals of the same type. Population is calculated by the product of area and population density. The population, which is the goal here, is assigned to the root of a tree structure. The area and population density are assigned to the child nodes as separate subgoals. First, let us estimate the area of the 23 wards, which is one of the subgoals. Of course, the answer can be obtained by examining the actual values for each ward and totaling them. However, here we consider a method of estimating the area only from general knowledge, not from accurate individual values. Let us observe the problem and find a pattern. If you look at the map of the entire shape of the 23 wards, not each ward, it is close to a circle (see Fig. 2.14). Therefore, we will approximate the area of the 23 wards by the area of a circle. However, since part of Tokyo Bay is also included by the circle, the area of the 23 wards will be that of a 3/4 circle in consideration of the situation. If the radius of the circle can be estimated, the area of the 23 wards can be estimated as a 3/4 circle. So, we consider to estimate the radius of the circle of the 23 wards. Let us take the JR Chuo Line or Sobu Line from Nishiogikubo Station at the western end of the 23 wards to Iidabashi Station at the center of the 23 wards, and it will take 20 to 30 min empirically by commuter train. So, 25 min is used for the boarding time. The average speed of a train that you can ride while standing is about 50 km/h. The area (goal here) is assigned as the root of a subtree, and the radius (subgoal here) required to calculate it is assigned as its child. The radius is further divided into the train speed and boarding time. However, looking at the route map of Tokyo, the railway is not a straight line but curved as it approaches the center of the 23 wards. Therefore, it is assumed that about 80% of the distance calculated by the speed and time corresponds to the linear distance. Radius = 50(km)/60(minutes) × 25(minutes) × 0.8 = 16.6(km).
62
2 Hypothesis
Adachi
Itabashi
Kita
Nerima
Katsushika Arakawa
Toshima Suginami
Bunkyo Nakanao
Taito Sumida
Shinjuku Chiyoda
Nishiogikubo Sta. railway
Edogawa
Iidabashi Sta. Chuo
Shibuya
Koto
Minato Setagaya Meguro Shinagawa Tokyo Bay Ota
Fig. 2.14 Tokyo 23 wards and railway
By rounding the radius down to 16 km, we get the area as follows. Area = π × 162 × 3/4 ≈ 600. The actual area of the 23 wards is 619 km2 . Second, we estimate the population density, which is another subgoal. First, we assume that the population density of the 23 wards is different from that of Tokyo other than the 23 wards and the population density is uniform within the 23 wards. Next, we make finer assumptions as follows. • A single-family house averagely has a total floor area of 100 m2 . • As to condominiums and apartments, the ratio of the total floor area to the site area is two or three times that of a single-family house. Here, 2.5 is taken as the intermediate value. • The total floor area required for one household is constant.
2.5 Problem Solving
63
• In the 23 wards with limited lands, the ratio of the number of single-family houses to the number of condominiums or apartments is about 1/3. Therefore, in the case of condominium or apartments, one household can be supported in an area of 1/2.5 of the actual site area of the buildings. After all, the site area required for one household is as follows. ( ) 100 × (1 × 0.25 + 1/2.5 × 0.75) = 55 m2 . Further, only half (50%) of the total area of the 23 wards can be used as residential lands. Considering that there are more single households in the 23 wards, we let the average number of people in a household be 2. Multiplying the number of people per household (two persons) by the number of households per km2 gives the following. 2 × 1000 × 1000 × 0.5/55 ≈ 18,000. That is, the population density is estimated to be 18,000. The actual population density of the 23 wards is 15,426 (2022). As the final goal, the population of the 23 wards is estimated to be 600 × 18,000 = 10.80 million (actually, 9.68 million in 2022). Indeed, it may not be possible to obtain accurate values by such rough estimation, but there is an advantage that the solution method of the subproblem used in the process of problem solving can be reused. As described above, approximate estimation of quantities is generally called lumping, instead of accurate calculation of quantities based on individual details.
2.5.3.5
Dimensional Analysis
Dimensional analysis finds the relationship between physical quantities by considering dimensionless groups (or dimensionless numbers) for multiple physical quantities as follows. • Make a list of related physical quantities. • Create independent dimensionless groups. • Reduce possibilities (candidates) with field knowledge. Case: What is the relationship among train speed, train acceleration, and circular track radius? We consider the speed and acceleration of a train running on a circle track and the radius of the track as physical quantities (variables). Also, we consider L (length), T (time), and M (mass) as dimensions. We express the three variables in the above units (dimensions) as follows. • Acceleration a: L T −2 M 0 . • Speed v: L T −1 M 0 .
64
2 Hypothesis
• Radius r: L M 0 . Next, we consider to create a dimensionless group (L 0 T 0 M 0 ) by combining these. However, M does not appear in this example (i.e., M 0 ). Therefore, M will be omitted in the following consideration. • Dimensionless group: ax vy r z . • (L T −2 )x (L T −1 )y (L)z = L 0 T 0 . Here, the exponents x, y, and z are rational numbers, but they can be viewed as integers by appropriate transformation. Each corresponding exponent of the left and right sides must match. The following dimensionless group can be found by solving simultaneous equations for three variables. ar/v2 = dimensionless group. The Buckingham Pi theorem assures that we have only to consider this dimensionless group. By rewriting this, the following relational expression can be obtained. a ∼ v2 /r. If the acceleration is known, the centrifugal force can be calculated. If an appropriate inclination called cant is given to the circular track based on the centrifugal force and gravity force, a more comfortable ride can be expected, compared to the case where the track is horizontal (see Fig. 2.15).
2.5.3.6
Proportional Reasoning
In proportional reasoning, an unknown value is found using its ratio to a known value. Further, by focusing on a ratio, common details can be omitted. Case: What is the surface temperature of a solar planet? The surface temperature of a solar planet can be calculated from the balance between the solar energy falling from the sun and the energy radiated from the planet. In other words, we pay attention to the invariant in this case. If the energy S per unit area reaches a planet from the sun, the reflectance of the planet is α (Bond albedo), and the radius of the planet is R, the total energy absorbed by the planet can be written as follows. (1 − α)π R 2 S. On the other hand, it is known that the energy radiated by a planet can be expressed on the entire surface of the planet by using Stefan–Boltzmann law (Mahajan 2014) as follows.
2.5 Problem Solving
65
Fig. 2.15 Rail cant
Centrifugal force
Gravity
Cant
4π R 2 σ T 4 . Here, σ is the Stefan–Boltzmann constant and T is the surface temperature called the radiative equilibrium temperature. Because they are in balance, we obtain T as follows. ( T =
S(1 − α) 4σ
) 14
.
Next, let d p be the distance between the sun and the planet expressed in astronomical unit (i.e., the distance d e between the earth and the sun). The value of S is inversely proportional to the square of d p . Therefore, using the value S0 on the surface of the sun, the value of S can be expressed as follows. S = S0/dp2 . Assuming that the surface temperature of the earth T e is known, let us consider to estimate the surface temperature of the planet T p by calculating the ratio of T p to T e. For the sake of simplicity, let us assume that the reflectance α of all planets is equal. Using only the surface temperature T e of the earth and the distance d p between the sun and the planet (i.e., unit = d e ), the surface temperature T p of the planet can be expressed as follows. Tp = Te /dp1/2 .
66
2 Hypothesis
Mars has become increasingly important in recent years (refer to Box “Mars exploration”). By substituting 255 (radiative equilibrium temperature) for T e and 1.52 for the distance between Mars (p) and the sun, the surface temperature T p of Mars is estimated to be 207. The radiative equilibrium temperature of T p of Mars obtained independently is 210 (Méndez and Rivera–Valentín 2017). Box: Mars Exploration As Mars exploration, Mars sample return (MSR) mission is a joint campaign by National Aeronautics and Space Administration (NASA) and European Space Agency (ESA) (NASA 2022). MMX (Martian Moons eXploration) mission is also planned by Japan Aerospace Exploration Agency (JAXA) (JAXA 2022). In the future, these missions will collect rock and dust samples from Mars and its satellites, respectively, and return them to earth for analysis (see Fig. 2.16). They will confirm the existence of water and organic matter and will clarify the evolutionary process of Mars and its satellites. □
Fig. 2.16 Martian Moons eXploration (MMX), Courtesy of JAXA
2.5.3.7
Probabilistic Reasoning
Here, Bayesian inference is briefly explained as probabilistic reasoning. Suppose that a hypothesis and data are given as follows. • H: Hypothesis. • E: Data (or evidence). The conditional probability p (H | E) is expressed as follows.
2.5 Problem Solving
67
p(H |E) =
p(H ) × p(E|H ) ∗ Bayes’ theorem. p(E)
Furthermore, since p(E) is a constant, it can be rewritten as follows. Posterior probability ( p(H |E)) ∼ Prior probability ( p(H )) × Likelihood ( p(E|H )).
In the above formula, the likelihood is interpreted as the explanatory power of the hypothesis. If we decide to adopt a hypothesis that maximizes the posterior probability among multiple competing hypotheses, this is a kind of plausible reasoning. Now, we define the odds with respect to H and ¬H (i.e., not H) as follows. Odds (H ) =
p(H ) . p(¬H )
Furthermore, the likelihood ratio is defined as follows. Likelihood ratio =
p(E|H ) . p(E|¬H )
Thus, the likelihood ratio measures the relative explanatory power of H and ¬H. By using these, the posterior odds are expressed as follows. ( Posterior dds (H |E) = Prior odds (Odds (H )) × Likelihood ratio
) p(E|H ) . p(E|¬H )
In the conventional concept of probability, it is interpreted that a probability approaches the true probability by repeated trials. It is, so to speak, a probability based on frequentism. On the other hand, Bayesian inference is based on the likelihood as to a hypothesis. The probability in this case is interpreted as a probability based on subjectivism.
2.5.4 Unconscious Work At the end of this chapter, we would like to point out that unconscious work is important in problem solving. The role of unconsciousness is somewhat a classical theme of psychology, but a British sociologist and psychologist Graham Wallace (1858–1932) advocated the four stages of creative process by emphasizing unconscious work (Sadler–Smith 2015). Since hypothesis generation is typical creative work, Wallace’s four stages of creative process are helpful. Wallace constructed the following model, based on the experiences of a German physicist Hermann Ludwig
68
2 Hypothesis
Ferdinand von Helmholtz (1821–1894) and a French mathematician Jules–Henri Poincaré (1854–1912). (Wallace’s four-stage model of creative process) 1. Preparation: Do initial conscious work on a problem. Immerse yourself in the problem and try to solve the problem. 2. Incubation: Stop the conscious work. However, unconscious work continues to explore new and promising combinations of possibilities. 3. Illumination: Wait until a hypothesis emerges from the unconscious work. 4. Verification: Verify an appropriate and plausible hypothesis.
References Aufmann RN, Lockwood JS et al (2018) Mathematical excursions. CENGAGE Bortolotti L (2008) An introduction to the philosophy of science. Polity Cairo A (2016) The truthful art: data, charts, and maps for communication. New Riders Cellucci C (2017) Rethinking knowledge: the heuristic view. Springer Chang M (2014) Principles of scientific methods. CRC Press Crease RP (2010) The great equations: breakthroughs in science from Pythagoras to Heisenberg. W. W. Norton & Company Danks D, Ippoliti E (eds) Building theories: Heuristics and hypotheses in sciences. Springer Diggle PJ, Chetwynd AG (2011) Statistics and scientific method: an introduction for students and researchers. Oxford University Press DOAJ (2022) Directory of open access journal. https://doaj.org/ Accessed 2022 Gilchrist P, Wheaton B (2011) Lifestyle sport, public policy and youth engagement: examining the emergence of Parkour. Int J Sport Policy Polit 3(1):109–131. https://doi.org/10.1080/19406940. 2010.547866 Google Maps. https://www.google.com/maps Accessed 2022. Ishikawa H (2015) Social big data mining. CRC Press Järvinen P (2008) Mapping research questions to research methods. In: Avison D, Kasper GM, Pernici B, Ramos I, Roode D (eds) Advances in information systems research, education and practice. Proceedings of IFIP 20th world computer congress, TC 8, information systems, vol 274. Springer. https://doi.org/10.1007/978-0-387-09682-7-9_3 JAXA (2022) Martian moons eXploration. http://www.mmx.jaxa.jp/en/. Accessed 2022 Lewton T (2020) How the bits of quantum gravity can buzz. Quanta Magazine. 2020. https:// www.quantamagazine.org/gravitons-revealed-in-the-noise-of-gravitational-waves-20200723/. Accessed 2022 Mahajan S (2014) The art of insight in science and engineering: Mastering complexity. The MIT Press Méndez A, Rivera–Valentín EG (2017) The equilibrium temperature of planets in elliptical orbits. Astrophys J Lett 837(1) NASA (2022) Mars sample return. https://www.jpl.nasa.gov/missions/mars-sample-return-msr Accessed 2022 OpenStreetMap (2022). https://www.openstreetmap.org. Accessed 2022 Pólya G (2009) Mathematics and plausible reasoning: vol I: induction and analogy in mathematics. Ishi Press Pólya G, Conway JH (2014) How to solve it. Princeton University Press Rehm J (2019) The four fundamental forces of nature. Live science https://www.livescience.com/ the-fundamental-forces-of-nature.html
References
69
Sadler-Smith E (2015) Wallas’ four-stage model of the creative process: more than meets the eye? Creat Res J 27(4):342–352. https://doi.org/10.1080/10400419.2015.1087277 Siegel E, This is why physicists think string theory might be our ‘theory of everything.’ Forbes, 2018. https://www.forbes.com/sites/startswithabang/2018/05/31/this-is-why-physicists-thinkstring-theory-might-be-our-theory-of-everything/?sh=b01d79758c25 Zeitz P (2006) The art and craft of problem solving. Wiley
Chapter 3
Science and Hypothesis
3.1 Kepler Solving Problems 3.1.1 Brahe’s Data Johannes Kepler (1571–1630) was a German who started from astrology and aimed at precise astronomy (see Fig. 3.1a). Based on the observation data of planets recorded by his master, a Danish nobleman Tyge (Latinized as Tycho) Ottesen Brahe (1546– 1601) (see Fig. 3.1b), Kepler derived the following three laws, collectively called Kepler’s laws on planetary motion. • First law: A planet moves in an elliptical orbit with the sun as one focal point (see Fig. 3.2a). • Second law: The areal velocity of a planet (area swept by a vector of a planet around the sun per unit time) is constant (see Fig. 3.2b). • Third law: The square of the orbital period of a planet is proportional to the cube of the size (semi-major axis) of the elliptical orbit (see Fig. 3.3). The important point here is that the data that Kepler used to generate the hypotheses had already existed. It can be said that Kepler made plausible reasoning (IBE) in that he selected hypotheses that could explain the observational data best. In particular, the first law could better explain the fact (or problem) that “the orbits of planets including Mars do not form strict circles” with respect to Tycho Brahe’s observational data. Here, we will first explain how Kepler created the orbital data of the planet as the premise of Kepler’s laws, based on Brahe’s observation data (azimuths of the sun and planets when viewed from the earth). Brahe converted the values observed in the equatorial coordinate system (coordinate system based on the equatorial plane of the earth) into the ecliptic coordinate system (coordinate system based on the orbital plane of the earth) and recorded them as orbital data.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Ishikawa, Hypothesis Generation and Interpretation, Studies in Big Data 139, https://doi.org/10.1007/978-3-031-43540-9_3
71
72
3 Science and Hypothesis
(b)
(a)
Fig. 3.1 a Johannes Kepler. b Tycho Brahe (Courtesy of the Smithsonian Libraries and Archives, released under a CC0 license). (a) https://library.si.edu/image-gallery/72832 (b) https://library.si. edu/image-gallery/68473
b (Semi-minor axis) a (Semi-major axis) c
Sun Focus
Focus
Planet (x,y) (Ellipse)
Areal velocity=constant
(Eccentricity)
(a)
(b)
Fig. 3.2 a Kepler’s first law. b Kepler’s second law
This task of orbital data creation (let us call it Task 1) is the preprocessing leading to the discovery of the law (let us call it Task 2). Task 1 corresponds to the procedure of data management (data engineering) in pioneering big data applications.
3.1 Kepler Solving Problems
73 Pluto Neptune
100000 Uranus
10000 Saturn
1000 Jupiter
100 Ceres
10
T2 ~ a3 T: Orbital period (year) a: Semi-major axis (Astronomical unit)
Mars
Earth
0.01
0.1 Mercury
Venus1
1
10
100
1000
10000
100000
0.1 0.01
Fig. 3.3 Kepler’s third law
For the procedure part of the following explanation, we referred to “What is physics?” (Tomonaga 1979) by Shinichiro Tomonaga (1906–1979), who won the Nobel Prize in Physics. Although this book has only a Japanese version, we believe that it is one of masterpieces of physics for the general public.
3.1.2 Obtaining Orbit Data from Observation Data (Task 1) 3.1.2.1
Solar System Planet Race
An event that the sun, the earth, and Mars are aligned (i.e., a phenomenon called opposition) is the starting line of the solar system planet race. Here, we assume that the orbital plane of revolution of Mars and that of the earth overlap for simplicity. Focusing on the triangle formed by three celestial bodies within each orbital period of Mars T m (687 days), the time point (date and time) when and the place where the earth passes in its orbit are determined. The orbital period T m of Mars can be calculated from the following formula based on the known orbital period of the earth (365 days) and the known period of opposition between the earth and Mars (780 days). (
1 1 − 365 Tm
) × 780 = 1.
74
3 Science and Hypothesis
The solar system planet race has finally started. Consider the positions of Sun (S), Earth (E), and Mars (M) (see Fig. 3.4). The angle θ 1 between the edge Sun–Mars whose distance is SM, and the edge Sun–Earth whose distance is SE 1 , and the angle ϕ 1 between the edge SE 1 , and the edge Earth–Mars whose distance is E 1 M can be found from daily observations. Then, the shape of the triangle can be determined, and SE 1 can be obtained in SM as a unit as follows (from the law of sines). S E i = di × S M. Here, d i is a relative distance. In fact, Brahe made large-scale astronomical devices such as quadrants, sextants, and armillary sphere to measure the azimuth and altitude of celestial bodies or equatorial coordinates (see Fig. 3.5). Brahe converted these observations into ecliptic coordinates using trigonometric functions and recorded them in a star catalog. In other words, in the time of Brahe, the absolute value of the distance with respect to the celestial body was not known. Instead, the distance was relatively calculated based on a ratio.
Fig. 3.4 Solar system planet race
3.1 Kepler Solving Problems
75
(a)
(b)
Fig. 3.5 Brahe’s astronomical observation. a Wall quadrant culmination altitude measurement. b Large Armillary sphere (diameter: 2.9 m) right ascension and declination measurement (Brahe 1602, Courtesy of Smithsonian Libraries, released under a CC0 1.0 license) https://doi.org/10.5479/ sil.77076.39088002053528
3.1.2.2
Time Machine
Now we convert future time points to past time points. As Mars revolves for times (1, 2, 3, …), the earth revolves for several times (1, 3, 5, …) and the remaining days (322, 279, 236, …). That is, the time points are described as follows. 687 = 365 + 322 687 × 2 = 365 × 3 + 279 687 × 3 = 365 × 5 + 236 ··· . Here, we pay attention to the fact that the earth passes through the same place in the orbit of revolution at the same date and time within the orbital period (i.e., invariant property).
76
3 Science and Hypothesis
Therefore, the day determined as above is converted to the day within the orbital period as follows. mod(date, 365). That is, the remainder is taken by dividing the date by the orbital period of the earth (365 days). In this way, the orbital data of the earth with a time point (day) within the orbital period can be obtained as polar coordinates (angle and relative distance). Therefore, the revolution speed of the earth can also be known. Now, let us find the orbital data of Mars. From the time of the next opposition (780 days later) as the starting line, the planet race will be held again. Using the Sun–Mars distance SM 1 at this time as a unit, the orbital data of the earth will be determined in exactly the same way as above. However, SM 1 is different from the previous SM because the starting line is generally at a different position in the orbit of Mars. The same thing is done for the next opposition, and the orbital data of the earth are obtained in units of the Sun–Mars distance SM 2 at that time. Similarly, the distance SE ij between the sun and the earth can be expressed for each starting line j as follows. S E i1 = di1 × S M1 , S E i2 = di2 × S M2 , . . . , S E i j = di j × S M j . Here, regarding the orbit of the earth, its position at the same time point t within its orbital period is of course the same at any revolution (invariant). Here, let t = mod (T j , 365), so SM 1 , SM 2 , … are calculated in SM as a unit. S E t = dT j × S M j = dt × S M. Next, the time point of the starting line is calculated by mod (date, 687), and the orbital data of Mars are converted to the data at the time point within the Mars orbital period. Also, the azimuth of Mars at the starting line can be obtained by observation, similar to that of the earth. In this way, the orbital data of Mars are obtained. Task 1 is a typical case of serial analysis using big data in that the earth orbit was determined first and then the Mars orbit was determined based on the determined earth orbit. It can be said to be an example of extremely skillful problem solving.
3.1.3 Deriving Kepler’s First Law (Task 2) Based on the orbital data of Mars and other planets calculated in this way, Kepler derived the first law. Kepler did the task in the following way although roads to the discovery were not so smooth. From the orbital data of the earth, it was found that the orbit of the earth is almost a circle, and the sun is offset by 0.018 times the radius from the center of the circle.
3.1 Kepler Solving Problems
77
Kepler also found that the distance is inversely proportional to the velocity (more accurately, angular velocity × distance) at the perihelion and aphelion. This has pointed Kepler in the direction of generalization that it may hold at any position in orbit other than the perihelion and aphelion. This can be rephrased as a constant areal velocity. Furthermore, from the orbital data of Mars, Kepler knew that it is not clearly a circle. He failed to fit it to an egg-shaped orbit as the first trial. Then he applied a conic section (i.e., circle, ellipse, parabola, and hyperbola) known from ancient Greece (see Fig. 3.6). As a result, it was found that the orbit of Mars is elliptical, and the sun is located at one of its focal points. Furthermore, it was found that the property of constant areal velocity is also satisfied. From the above, Kepler discovered the laws in the following order in terms of time. • First, he reached the second law: The areal velocity of any planet is constant. • Next, he reached the first law: Any planet follows an elliptical orbit with the sun as one focal point (Kepler, 1609, Astronomia Nova or New Astronomy). In addition, he explored the relationship between the period and the semi-major axis of the ellipse, which he had been interested in for some time. Fig. 3.6 Conic section. a Circle. b Ellipse. c Parabola. d Hyperbola
(a)
(b)
(c)
(d)
78
3 Science and Hypothesis
• Finally, Kepler reached the third law: The square of the orbital period of a planet is proportional to the cube of the semi-major axis of the elliptical orbit (Kepler 1619, Harmonices Mundi or Harmony of the World). This task (Task 2) corresponds to data analysis in the pioneering big data application. What is important here is that the observation data of Tycho Brahe, which Kepler used as the basis for hypothesis generation, were extremely accurate at that time. It is said that it was possible to accurately measure up to about 2 min (minute here is the unit of angle). Brahe’s data were taken over by Kepler and compiled as the Rudolphine Tables (collection of astronomical observation data conducted under the name of the Holy Roman Empire Rudolf II). Kepler had unwavering confidence in the accuracy of the data Brahe had left behind. He was also convinced that there was a non-negligible difference (deviation) between the results of his own precise calculations based on Brahe’s data and the dogmatic hypotheses based on uniform circular motion that prevailed in his time. It can be said that these grounded beliefs made Kepler accomplish his great astronomical achievements.
3.2 Galileo Conducting Experiments An Italian astronomer and physicist Galileo Galilei (1564–1642), who was active at the same time as Kepler, showed the importance of experimentation and observation in hypothesis generation (see Fig. 3.7). Galileo made experiments and reasoning about the motion of objects in his books, Dialogue Concerning the Two Chief World Systems Discourses (1632) and Discourses and Mathematical Demonstrations Relating to Two New Sciences (1638). Galileo was a pioneer who showed that experiments are important for the generation of hypotheses. Galileo wrote that “the book of nature is written in the language of mathematics”. Galileo mathematically described the laws of nature and experimentally tested the conclusions drawn deductively from them. In other words, Galileo practiced hypothetical deduction with an emphasis on his experimental verification. Galileo also made telescopes and observed celestial bodies for the first time. He was known for engineering military compasses and thermoscopes in addition to telescopes. The observations revealed that the moon has a terrain similar to that of the earth. He also discovered that Jupiter has natural satellites (or Jovian moons).
3.2.1 Galileo’s Law of Free Fall Galileo experimented with different angles of slopes, showing that the distance a sphere falls down the slope is proportional to the square of time. He also found that the speed is proportional to the square root of the distance. From this, it can be said
3.2 Galileo Conducting Experiments
79
Fig. 3.7 Galileo Galilei (Courtesy of the Smithsonian Libraries and Archives, released under a CC0 1.0 license) https://library.si.edu/ image-gallery/99826
that speed is proportional to time. This has been generalized to hold even when an object drops vertically (Galileo’s law of free fall). He also found that a parabola is obtained as a trajectory of an object launched horizontally from a high place by combining the horizontal position and the vertical position (see Fig. 3.8a). Again, he made a small crossbow (see Fig. 3.8b) dedicated for projection. By projecting objects at various angles with such a crossbow, he confirmed that the trajectory would be a parabola.
3.2.2 Thought Experiments Furthermore, Galileo had already conducted so-called thought experiments in which conclusions were drawn from principles as premises only in his mind, regardless of actual experiments. From Aristotle’s time to Galileo’s time, it had been commonly believed that heavy objects falling from high places (e.g., the tower of Pisa) reach the ground faster than light objects (see Fig. 3.9). Here, we consider to connect a heavy object and a light object and drop them. Given the above hypothesis since Aristotle, two different results are expected at the same time as follows. • Prediction 1: The total weight of connected objects is heavier than the weight of any single object, so connected objects drop faster than any single object.
80
3 Science and Hypothesis
(b) h= h0 + at2 v=v0 + at h: height v: velocity a: acceleration t: time
(a) Fig. 3.8 a Galileo’s law of free fall. b Crossbow Premise (Common wisdom from Aristotle’s time to Galileo’s time)
Prediction 1
Prediction 2
light
contradiction
heavy
Fig. 3.9 Galileo’s thought experiment on falling bodies
Leaning Tower of Pisa
3.2 Galileo Conducting Experiments
81
• Prediction 2: Since the falling speed of a heavy object is high and the falling speed of a light object is low, the falling speed as a whole is between the two speeds. Clearly, Prediction 1 and Prediction 2 are inconsistent with each other. From this, Galileo denied the above premise and came to the conclusion that the velocity of a falling object does not depend on the weight (more accurately, mass) of the object. Of course, Galileo conducted real experiments, too. That is, it was confirmed that there was no difference in the falling time by dropping objects of different weights (i.e., those of the same size but different densities) from a tall tower. A similar thought experiment can be used to confirm that Kepler’s third law holds regardless of the mass of the planet.
3.2.3 Galileo’s Law of Inertia An experiment is performed with two slopes (S1 and S2) facing each other as shown in Fig. 3.10. When a ball is rolled from the top of the slope S1, it climbs to almost the same position (the same height) on the opposite slope S2. The same thing happens by changing the angle of the slope S2. Further, if the angle of S2 is set to 0 as an extreme case, the ball will continue to roll. Therefore, Galileo made a hypothesis that a ball continues to move linearly unless a force is acted on it. It was confirmed by the experiment and observation that a ball launched horizontally by a crossbow continues to move horizontally at a constant velocity as long as it continues to fall. Thus, he summarized the law of inertia (Galileo’s law of inertia) as follows.
S1 Fig. 3.10 Galileo’s law of inertia
S2
82
3 Science and Hypothesis
• An object remains at rest or continues to move in a straight line at a constant velocity as its initial velocity unless a force is acted on it.
3.2.4 Galileo’s Principle of Relativity Now, those who advocate geocentric models since Aristotle (that is, all celestial bodies rotating around the earth) insist on the following as a counterargument to a heliocentric model in which Galileo believes (that is, planets rotating around the sun). • We think about dropping an object from a high place to the ground. If the earth is moving, the object will not fall directly below. However, it falls directly below. Galileo responds to their arguments by conducting a thought experiment again. • If you drop a lead ball from the top of the mast of a ship, it will fall directly below. • The result is the same even if the ship is moving at a constant speed. In other words, from a person aboard a ship, the falling motion looks the same whether the ship is at rest or moving at a constant speed. This is called Galileo’s principle of relativity (see Fig. 3.11). The same is true when the earth is moving. However, since the earth is round, it becomes a uniform circular motion instead of a uniform linear motion.
uniform motion
Fig. 3.11 Galileo’s principle of relativity. Uniform circular motion is more correct than uniform linear motion because the earth is spherical
3.3 Newton Seeking After Universality
83
3.3 Newton Seeking After Universality In astronomy, an English physicist Isaac Newton (1642–1727) took over and developed Kepler’s work (see Fig. 3.12). In his book Principia (Newton 1999), Newton made deductive reasoning leading to the law of universal gravitation.
3.3.1 Reasoning Rules Before deducing the law of universal gravitation, Newton first established reasoning rules as follows (Newton 1999; Chandrasekhar 2003). • (Rule 1) We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. • (Rule 2) Therefore to the same natural effects, we must, as far as possible, assign the same causes. • (Rule 3) The qualities of bodies, which admit neither intensification nor remission of degrees, and which are found to belong to all bodies within the reach of our experiments, are to be esteemed to the universal qualities of all bodies whatsoever. Fig. 3.12 Sir Isaac Newton (Courtesy of the Smithsonian Libraries and Archives, released under a CC0 1.0 license). https://library.si. edu/image-gallery/74020
84
3 Science and Hypothesis
• (Rule 4) In experimental philosophy, we are to look upon propositions inferred by general induction from phenomena as accurately or very nearly true, notwithstanding any contrary hypotheses that may be imagined (refer to the below note), till such time as phenomena occur, by which they may either be made more accurate, or liable to exceptions. Of these rules, Rule 1 suggests that Newton recognized Ockham’s razor as one of essential properties of a hypothesis. Also, from Rule 3 and Rule 4, Newton emphasized inductive reasoning and plausible reasoning (IBE) based on phenomena and experiments in generating hypotheses. (Note) In Rule 4, Newton regarded a hypothesis as “a proposition that was neither a phenomenon, nor inferred from a phenomenon, nor proved experimentally”. Based on the above rules, Principia as a whole started from axiomatic concepts (definitions and axioms) and deduced multiple propositions such as Kepler’s laws mainly by geometric methods including the concept of infinitesimal. Indeed, he partially used a method called the method of fluxions (Katz 2009), which he had invented as a method to consider the rate of temporal change of a variable (i.e., fluent) that depends on time while being aware of the motion of an object (i.e., variable). However, Newton did not use his calculus explicitly in Principia.
3.3.2 Three Laws of Motion Newton formulates the three laws of motion as axioms as follows. • First law (law of inertia): An object continues to be either at rest or move linearly at a constant velocity unless acted on by a force. • Second law: The change of an object with respect to “mass × velocity” (momentum in current terminology) is proportional to a force acted on the object and the change is made in the direction of the force. • Third law (law of action–reaction): For every action (i.e., force), there is an action called reaction of the same magnitude and opposite direction. In other words, the mutual actions of two objects are always the same in magnitude and opposite in direction. However, the three laws of motion are not necessarily axioms in mathematics, but it can be said that they are axioms backed by experiments and observations, based on the achievements of his predecessors (e.g., Kepler and Galileo). Specifically, it can be said that the first law is a modification of Galileo’s law of inertia. That is, in the case of Galileo, the premise is actually a constant velocity circular motion (i.e., uniform circular motion). In the case of Newton, the premise is a constant velocity linear motion (i.e., uniform linear motion). However, the association between inertia and linear motion was also noticed by a French philosopher René Descartes (1596–1650).
3.3 Newton Seeking After Universality
85
Fig. 3.13 Newton’s third law of motion aka the law of action–reaction (thought experiment)
A
B
C
Newton also stated that the second law had been discovered by Galileo. Newton considered Galileo’s law of free fall to be a general law of motion. Furthermore, he showed that various motions can be explained in a unified manner by the cause of a force. The third law can be described in a thought-experiment style as follows. Consider an object consisting of three parts, A, B, and C (≡A) (see Fig. 3.13). It is considered that a force (gravitational force) acts between A and C. The force by which A attracts C is equivalent to the pressure by which C pushes B. The force by which C attracts A is equivalent to the pressure by which A pushes B. Both are opposite to each other. If the two forces do not balance with no external force, B and even the entire object will start to move to either side (that is, this violates the law of inertia).
3.3.3 The Universal Law of Gravitation Newton presupposed phenomena known as Kepler’s laws (planetary motion) and first proved the following. 1. The moon receives a force from the earth, the satellites of Saturn and of Jupiter from each planet, and the planet from the sun, respectively. The force is gravity which is inversely proportional to the square of the distance between them (i.e., inverse square law). 2. The gravity that an object receives from the planet is proportional to the mass of the object. Furthermore, using these, Newton derived the universal law of gravitation that holds between all objects. For a systematic and detailed explanation, refer to reference books on Principia (Chandrasekhar 2003). By the way, you can easily guess the above law (1) by combining the law that a centripetal force is proportional to v2 /r (v: velocity and r: radius) if the orbit of an object is circular (refer to Sect. 2.5.3) and the law that T 2 (T: rotation period) is proportional to r 3 (Kepler’s third law).
86
3 Science and Hypothesis
Robert Hooke (1635–1703), an Englishman who was said to be Newton’s nemesis and discovered that a force acted on a spring scale is proportional to the displacement (elongation or compression) of the spring, was also aware of the inverse square law of a centripetal force. In fact, Newton geometrically proved that if a planet follows an elliptical orbit, its gravitational force is proportional to the inverse square law. On the contrary, Newton proved that if a force acted on a planet and its velocity are the same, the orbit of the planet on which the inverse square law works is only an elliptical orbit from the assumption that the orbit is unique (except for special cases). The above law (2) can be shown by combining Newton’s second and third laws of motion. Of course, Newton proved the universal law of gravitation by starting from the definitions (e.g., mass, momentum, and force) and the three laws of motion as axioms, and deductively reasoning step by step. In particular, it should be remembered that Rule 3 in his rules of reasoning led Newton to universal gravitation. When asked how Newton came to discover the universal law of gravitation, he said, “I keep the subject constantly before me, and wait’till the first dawnings open slowly, by little and little, into a full and clear light”. (Crease 2010). In response to a similar question from Hook, Newton replied, “If I could see further, it is by standing on the shoulders of giants”. To interpret these words straightforwardly, the giants would have been Kepler and Galileo. Newton is said to be one of the founders of calculus, a field of mathematics. In fact, Newton had a priority dispute with a German mathematician Gottfried Leibniz (1646– 1716) regarding calculus. However, in Principia, Newton mainly used geometric reasoning. There is a theory that it was because calculus was a newer method than geometry and geometry was more widely known in his time. In fact, a Swiss mathematician and astronomer Leonhard Euler (1707–1783) contributed significantly to the analytical description and study of Newton’s mechanics. By the way, in the universal law of gravitation, the inverse square law has been verified from the planet scale to several millimeters, but it has not been verified yet below 10 µm (i.e., 10 × 10−6 m) (Murata and Tanaka 2015).
3.4 Darwin Observing Nature 3.4.1 Theory of Evolution A British-born naturalist Charles Robert Darwin (1809–1882) knew well the efforts of humans (e.g., breeders) to breed, mutate, and select domesticated plants and animals to improve the breeding (see Fig. 3.14). Darwin was convinced by analogy that a similar function exists in natural organisms, which in most species do not grow
3.4 Darwin Observing Nature
87
Fig. 3.14 Charles Darwin (Courtesy of Wellcome Collection, released under a CC BY 4.0 license). https:// www.jstor.org/stable/com munity.24800310
explosively. Darwin developed that idea and argued that mutations and natural selection in populations also play an important role in the evolution of living organisms in general (that is, today’s theory of evolution). The theory of evolution can be summarized as follows. • Individuals included in a species population have variations (i.e., mutations) in terms of traits. • Some mutations are beneficial to survival and others are not. By natural selection, favorable mutations remain in the population and unfavorable mutations are excluded from the population. • Parental mutations are inherited into the offspring and settled as a population. This was largely due to his own experiences of noticing the existence of variants in the species when Darwin visited the Galapagos Islands as a hired naturalist on board the Royal Navy’s research vessel Beagle. There were multiple types of Galapagos finches (i.e., a type of birds), and their beaks had different morphologies to adapt to the types of foods (see Fig. 3.15). For example, there are relatively large finches (1) and (2) that eat nuts and seeds, those with a small beak (3) that is mainly suitable for sucking nectar, and those with an elongated beak (4) that is suitable for eating insects.
88
3 Science and Hypothesis
Fig. 3.15 Galapagos finches (Darwin 1897) (Courtesy of Wellcome Collection, released under a CC BY 4.0 license). https://www.jstor.org/stable/community.24743215
3.4.2 Population Growth Model In the time of Darwin, an English economist and demographer Thomas Robert Malthus (1766–1834) published a theory of population growth (1798), which had a large influence on Darwin. Malthus argued in his theory of population growth that the population increases geometrically (i.e., exponentially), but resources such as foods increase only arithmetically (i.e., linearly) at best, and humans eventually fall into a food shortage, creating competitive situations. Darwin thought that such competitive situations occur not only in humans but also in other species. Malthusian population growth model is defined by the following difference equation. Let p(n) be the population at discrete time n. Here, the growth rate r (birth rate– mortality rate) is constant and is called the Malthus parameter. p(n + 1) = (1 + r ) p(n). The solution of this equation is easily obtained as follows, assuming that the initial value is p(0).
3.4 Darwin Observing Nature
89
p(n) = p(0)(1 + r )n . In the process of generating the natural selection theory, Darwin took the following into consideration. • Diverse adaptations in natural species (e.g., Galapagos finches). • Competitive pressure on human population growth (i.e., Malthusian population growth theory). • Artificial selection for breeding desirable plants and animals. In short, Darwin made a kind of analogy as well as generalization. However, Mr. and Mrs. Grant (British-born biologists) demonstrated that natural selection by environment really occurs in the Galapagos Islands in a relatively short period (35 years) from the viewpoint of evolution. In fact, from the latter half of the twentieth century to the beginning of the twenty-first century, the morphology and size of the Galapagos finches evolved with changes in the dietary environment. It was observed that these traits settled in the population through natural selection and genetic drift (that is, accidental change of gene ratio) (Losos 2017). Now, the population growth is discretely modeled in the above formalization, but if time n and the population p(n) are extended to continuous variables (i.e., t and N), Malthusian population growth can be modeled by the following differential equation. Here, r is used for the rate of increase in the same way. 1 dN = r. N dt The above equation can be rewritten as follows. dN = r N. dt The solution of this equation and its approximation (assuming that r is small) with N0 as the initial value can be easily obtained as follows. N = N 0 exp(r t) ≈ N 0(1 + r )t . In other words, if r is small, it is substantially the same as the model obtained by the difference equation. Moreover, this is substantially the same as the Fibonacci sequence.
90
3 Science and Hypothesis
3.4.3 Fibonacci Sequence Revisited The Fibonacci sequence that models the increase in a rabbit population can be described by the following second-order linear homogeneous difference equation. F1 = F2 = 1. Fn = Fn−1 + Fn−2 or Fn −Fn−1 −Fn−2 = 0. We assume that F n = x n and substitute this for F n in the above equation. x n −x n−1 −x n−2 = 0. x n−2 (x 2 −x−1) = 0. Here, the following is called the characteristic equation of the Fibonacci difference equation. x 2 −x−1 = 0. The solutions of the characteristic equation, called particular solutions of the difference equation, are as follows. √ 1+ 5 . α= 2√ 1− 5 . β= 2 Here, α is known as the golden ratio (refer to Box “The golden ratio of Luca Pacioli”). The general solution of the difference equation can be obtained by using the particular solutions α and β and real numbers c1 and c2 as follows. Fn = c1 α n + c2 β n . If the initial values F 1 = F 2 = 1 are used, c1 and c2 are determined, and F n becomes as follows after all. ( ( √ )n √ )n 1 1+ 5 1 1− 5 Fn = √ −√ . 2 2 5 5 It is said that this formula was first obtained by a Swiss mathematician Daniel Bernoulli (1700–1782). If n → infinity in this equation, the model is equivalent to Malthusian population growth model. The growth rate in that case is as follows.
3.4 Darwin Observing Nature
91
√ r=
5−1 = −β. 2
Originally, Fibonacci studied mathematics with the aim of becoming a merchant and wrote a mathematics book with its commercial applications in mind. In his book, the Fibonacci sequence, which had already been known, was introduced as one of calculation problems in addition to perfect numbers and 4-simultaneous linear equations. Through his book, the Fibonacci sequence became popular, and finally, we found its point of contact with real biology. Box: The Golden Ratio of Luca Pacioli Fra Luca Bartolomeo de Pacioli (1445–1517) was an Italian mathematician who studied the golden ratio (often expressed in Φ). Pacioli is also known for his contributions to the development of double-entry bookkeeping and is often referred to as the father of modern accounting. Pacioli also had a friendship with Leonardo da Vinci (1452–1519). ) ( By the way, Kepler noticed that the ratio Fn + F1n of the Fibonacci ( √ ) sequence converges to the golden ratio Φ = 1+2 5 in the limit n → infinity. Since the golden ratio Φ = α, it satisfies the characteristic equation as follows. Φ2 = Φ + 1. Therefore, the golden ratio divides a line segment of length Φ2 into Φ:1. In other words, the line segment of length Φ is divided into 1: Φ1 (see Fig. 3.16a). The golden ratio is applied to various things such as historical buildings, designs, and paintings (see Fig. 3.16b). □
1/Φ2
Φ
1
1/Φ
1
(a)
(b)
Fig. 3.16 a Golden ratio. b Architecture (the parthenon, BC432, Courtesy of Gary Meisner). (b) https://www.goldennumber.net/parthenon-phi-golden-ratio/
92
3 Science and Hypothesis
3.4.4 Logistic Model However, in reality, the population does not increase explosively unlike the original models. It is thought that as the population increases, the capacity of an environment surrounding the population becomes smaller. This is not reflected in Malthusian population growth model. The logistic equation was invented in 1838 by a Belgian mathematician Pierre François Verhulst (1804–1894), who questioned Malthus’s model. An American biologist Raymond Pearl (1879–1940) popularized this equation. In other words, when a population becomes equal to the carrying capacity of an environment (let it be K), the population growth will stop. This situation can be modeled by the following differential equation, considering that the rate of increase decreases as N increases. The model is called the logistic model. ( ) N 1 dN =r 1− . N dt K The solution to the logistic model can be obtained as follows. N=
1+
(
K )
K −1 N0
exp(−r t)
.
Generally, it is difficult to solve the differential equation, but this solution can be obtained by transforming the original differential equation and then integrating it as follows. 1 dN ( ) = r. N N 1 − K dt ( ) 1 1 dN + = r. N K − N dt When K is larger than N0, as is normally expected, N becomes an S-shaped curve. This curve is also commonly referred to as the sigmoid curve (see Fig. 3.17). The sigmoid curve appears in many places. We will witness it again in the neural networks (refer to Sect. 5.3). Here, N/K is related to the effect that population density affects the growth of a population (called the density effect). In a stable natural environment where the fluctuation of population density is not large and the competition within a species is large, the species is considered to take a strategy called the K strategy to increase the carrying capacity K in natural selection. For example, Galapagos finches take this K strategy. On the other hand, so far, r has been regarded as a natural increase rate regardless of population density. The strategy to increase r in natural selection is called the r strategy. This strategy is taken in an unstable environment where population density
References
93 1.2 1 0.8 0.6 0.4 0.2 0 -1.5
-1
Fig. 3.17 Sigmoid curve: Y =
-0.5
0 X
0.5
1
1.5
1 1+exp(−5X )
fluctuates greatly. For example, fish and insects that lay many eggs at once take the r strategy. The above idea is called the r-K selection theory and was proposed in 1967 by an American ecologist Robert Helmer MacArthur (1930–1972) and an American biologist E. O. Wilson (1929–2021). Methods of hypothesis generation (i.e., reasoning and problem solving) explained so far are not always exclusive. Again, in this book, a conventional hypothesis as a test target is called a declarative hypothesis, while a hypothesis generated as a result of reasoning or problem solving is called a procedural hypothesis. Further, a new hypothesis can be created by combining multiple available hypotheses, using various methods, such as machine learning and integrated methods explained in the following chapters.
References Brahe T (1602) Astronomiæ instauratæ mechanica Chandrasekhar S (2003) Newton’s Principia for the common reader. Clarendon Press Crease RP (2010) The great equations: breakthroughs in science from Pythagoras to Heisenberg. W. W. Norton & Company Darwin C (1897) Journal of researches into the natural history and geology of the countries visited during the voyage of HMS Beagle round the world, under the command of Capt. Fitz Roy, R.N. John Murray Katz CJ (2009) A history of mathematics—an introduction, 3rd edn. Pearson Losos JB (2017) improbable destinies: fate, chance, and the future of evolution. Riverhead Books Murata J, Tanaka S (2015) Review of short-range gravity experiments in the LHC era. Class Quantum Gravity 32(3):033001 Newton I (1999) The principia: mathematical principles of natural philosophy: a new translation. In: Cohen IB, Whitman A (eds) A guide to newton’s Principia. University of California Press, Berkeley Tomonaga S (1979) What is physics?, vol 1. Iwanami Shoten, Publishers (in Japanese)
Chapter 4
Regression
4.1 Basics of Regression In this section, we will explain regression as a method of hypothesis generation as follows. • Gauss successfully determined the orbit of a celestial body named Ceres. The great achievement created an opportunity for Gauss’s method of least squares to attract attention. We will explain how Gauss succeeded in predicting the orbit of Ceres, as an example of problem solving. • Basically, regression creates straight lines (planes) or curves (curved surfaces) from multiple observation data, as models (i.e., hypotheses) to explain the data. We will define regression as predicting an objective variable with one or more explanatory variables. • There is a difference between the predicted value by a model and the observed value, called residual or error. The method of least squares is used to create a regression model by minimizing the sum of squared residuals. We will explain why we minimize the sum of squared residuals, based on Gauss’s assumptions. • We will explain the variants of regression, nonlinear regression, and sparse modeling as advanced topics.
4.1.1 Ceres Orbit Prediction 4.1.1.1
Titius–Bode Law
Sun, moon, Venus, Mars, Mercury, Jupiter, and Saturn were already known in Babylonia around 2000 BC. However, no new planet was discovered by the latter half of the eighteenth century. Uranus was discovered as a new planet on March 13, 1781,
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Ishikawa, Hypothesis Generation and Interpretation, Studies in Big Data 139, https://doi.org/10.1007/978-3-031-43540-9_4
95
96
4 Regression
by a German astronomer Sir Frederick William Herschel (1738–1822), who lived in Britain. Herschel was studying the relationship between light and temperature in addition to astronomy. Herschel split the sunlight with a prism and placed a thermometer beyond the red end of visible light (i.e., equivalent to invisible infrared light) and confirmed that the temperature was higher than that of red light. Through this experiment, Herschel discovered the existence of infrared radiation from the sun. He also worked as a composer and performer of church music. Before the discovery of Uranus, the following law called the Titius–Bode law regarding the semi-major axis of the elliptical orbits of the solar system planets had been advocated. Here, the unit of numerical value is one au (astronomical unit = 1.495978707 × 1011 m, average distance between the earth and the sun). • • • • • • •
0.4 + 0.3 · 2−∞ Mercury. 0.4 + 0.3 · 20 Venus. 0.4 + 0.3 · 21 Earth. 0.4 + 0.3 · 22 Mars. 0.4 + 0.3 · 24 Jupiter. 0.4 + 0.3 · 25 Saturn. 0.4 + 0.3 · 26 Uranus. This can be expressed as a general formula as follows.
• 4 + 0.3 · 2N . Here, N = − ∞, 0, 1, 2, (3 is missing!), 4, 5, 6. This existing law came to the forefront again because it was thought that the law had predicted the existence of Uranus (N = 6) before it was discovered. However, the above list lacked a planet corresponding to N = 3. Some said, “the Founder of the Universe surely had not left this place unoccupied”. Therefore, the expectation that there would be a new planet corresponding to N = 3 had become widespread. Confirming the truth of this conjecture was a major astronomical challenge at that time. An Italian astronomer Giuseppe Piazzi (1746–1826), who was looking for a new planet in the solar system, discovered a small celestial body between Mars and Jupiter on January 1, 1801. He continued to follow its orbit and set records until February 11 when his own illness, bad weather, and the approaching sun made it unobservable. The lost celestial body was named Ceres after the Roman goddess of agriculture (refer to Box “Ceres”). At that time, it was thought that Ceres was the planet that the astronomers were searching for, but now Ceres is classified as a dwarf planet. This time, rediscovering lost Ceres by the end of the year became a major astronomical challenge.
4.1 Basics of Regression
97
Box Ceres The existence of water has been confirmed in Ceres by the observation using farinfrared light by the Herschel Space Observatory that European Space Agency (ESA) launched (see Fig. 4.1a) (Küppers et al. 2014). Bright areas on the surface of Ceres detected by the Dawn space probe that National Aeronautics and Space Administration (NASA) launched, independently support the water ice there □ (see Fig. 4.1b). Ceres is once again in the spotlight of modern scientists.
(a)
(b)
Fig. 4.1 Observations of Ceres. a This graph shows variability in the intensity of the water absorption signal detected at Ceres by ESA’s Herschel space observatory (Courtesy of Michael Küppers). b This map of Ceres, made from images taken by NASA’s Dawn spacecraft, shows the locations of about 130 bright areas across Ceres’s surface, highlighted in blue. (NASA 2022, Courtesy of NASA)
4.1.1.2
Gauss’s Challenge
This task of calculating the orbit of Ceres also seemed attractive to a 24-year-old German mathematician Johann Carl Friedrich Gauss. Not only did this challenge stimulate his intellectual curiosity, but it was commonly recognized in his time that astronomical challenges would make those who achieved them celebrities. Although it will be a little long, we will explain Gauss’s challenge as an example of hypothesis generation using problem solving. Elements that define the orbit of a planet are called orbital elements. The orbital elements consist of the following elements. • • • • •
ω: Argument of perihelion. Ω: Longitude of the ascending node (right ascension of the ascending node). i: Orbit inclination angle. a: Semi-major axis of an orbit. e: Eccentricity.
98
4 Regression
• M: Mean anomaly. Figure 4.2a, b shows an explanation of the orbital elements of a planet and what Gauss found as the orbital elements of Ceres, respectively (Gauss 1874). It had already been known in his time that orbital elements can be determined if there are two planetary data in coordinates with the sun as the origin (i.e., heliocentric coordinates). Gauss obtained two positions based on three different observation data about Ceres in coordinates with earth as the origin (i.e., geocentric coordinates) although only longitudes and latitudes are known and no absolute distances are known. Figure 4.3 shows the relationship between the geocentric coordinate system and the heliocentric coordinate system.
Fig. 4.2 a Orbital elements. b Gauss’s computational results (Gauss 1874)
Fig. 4.3 Heliocentric and geocentric coordinate systems
4.1 Basics of Regression
99
Gauss proceeded with calculations based on Kepler’s laws, using geometric methods mainly based on trigonometry and using appropriate approximations for infinitesimal quantities as well. In the following, only an outline of calculations performed by Gauss will be explained. For detailed calculations, please refer to the paper by Teets and Whitehead (1999), which is relatively easy to understand, so the general flow of calculations will be explained accordingly.
4.1.1.3
Calculation of Ceres Orbit
Let the position coordinates of the earth and those of any solar system planet at time τ be (X, Y, Z) and (x, y, z), respectively. Let r, r' , and r'' are the column vectors corresponding to the positions of the planet at different times τ, τ ' , and τ '' , respectively. Further, we assume the following. || ' || || r × r '' || = 2 f. || || || r × r '' || = −2 f ' . || || || r × r ' || = 2 f '' . Here, f is the signed area of the triangle formed by the vectors at time τ ' and τ '' . Also, “×” is the cross product of the vectors. The others are defined in a similar way. We define the matrix ϕ and the vector f for the planet as follows. ⎞ x x ' x '' ϕ = ⎝ y y ' y '' ⎠. z z ' z '' ⎛ ⎞ f f = ⎝ f ' ⎠. f '' ⎛
Now, these three points are on the elliptical orbit of the planet, that is, on the same plane. Using this fact and the areas defined above, r' can be uniquely represented using r and r'' . r, = −
f f ,, r − ' r ,, . ' f f
[Quiz] Show that this equation holds. [Hint] Confirm it by applying a cross product to both sides of this equation by r from the left (or r'' from the right).
100
4 Regression
From the above equation, the following equation holds. ϕ f = 0. Similarly, for the position of the earth, the matrix Φ and the vector F are defined as follows. ⎛ ⎞ X X ' X '' Φ = ⎝ Y Y ' Y '' ⎠. Z Z ' Z '' ⎛ ⎞ F F = ⎝ F ' ⎠. F '' Therefore, the following equation holds similarly. Φ F = 0. The following holds from the above two equations. (
) (( ) ( ) ) F + F '' (ϕ − Φ) f = Φ f + f '' F − F + F '' f .
(4.1)
Furthermore, the following vector is the first column of the matrix ϕ − Φ. ⎛ ⎞ ⎛ ⎞ ξ x−X δπ = ⎝ η ⎠ = ⎝ y − Y ⎠. ζ z−Z Here, π and δ are defined as follows. • π = . • δ = ρ cos β. Note that π is a vector, not pi (a mathematical constant). On the other hand, DP is the first column of Φ. D and P are defined as follows. • P = . • D = R cos B. Next, let us consider a matrix whose row vectors are as follows. δ '' π '' × δπ , D ' P ' × δπ , D ' P ' × δ ' π ' , D ' P ' × δ '' π '' . Multiplying this matrix by Eq. (4.1) from the left side gives the four Eqs. (4.2)– (4.5) shown in Fig. 4.4 (Gauss 1874).
4.1 Basics of Regression
101
(4.2) (4.3) (4.4) (4.5) (4.6) (4.7)
(4.8) (4.9)
Fig. 4.4 Gauss’s calculation (Gauss 1874)
In Fig. 4.4, [a b c] represents the determinant of a matrix consisting of column vectors a, b, and c. Of these, Eqs. (4.6) and (4.7) in Fig. 4.4 can be obtained by approximating the right-hand side to 0 by paying attention to the amount of change in infinitesimal time in Eqs. (4.3) and (4.5), respectively. Furthermore, considering the following Kepler’s second law (refer to Sect. 3.1), Eqs. (4.8) and (4.9) in Fig. 4.4 can be obtained. g '' g g' = . = − τ '' − τ ' τ '' − τ τ' − τ Here, g is defined as follows. • g = Area of the sector created by the vectors for time τ ' and τ '' and the elliptical orbit. Furthermore, g' and g'' are defined in a similar way. All the values other than δ ' on the right side of Eqs. (4.8) and (4.9) in Fig. 4.4 can be calculated from the observed data. Thus, if δ ' is known, δ and δ '' can be calculated. Therefore, finding δ ' is the last remaining task. Equation (4.10) in Fig. 4.5 is obtained by applying Kepler’s laws and approximation regarding infinitesimal quantities (i.e., angle and area) to the Eq. (4.2) in Fig. 4.4 (Gauss 1874). The right side of Eq. (4.10) consists only of observation data. Although further details are omitted, R' /r ' can be expressed as the equation of R' /δ ' (Eq. (4.11) in Fig. 4.5) by using the law of cosines, so the left side of Eq. (4.10) contains only the variable R' /δ ' after all. Therefore, if Eq. (4.10) is numerically solved for the variable R' /δ ' , δ ' can be obtained for the known R' (distance between the earth and the sun). In Figs. 4.4 and 4.5, the following values were used. ) ( 2π τ − τ p . M= T
102
4 Regression (4.10)
(4.11)
Fig. 4.5 Gauss’s calculation (continued). A: Major axis of Earth’s orbit
• T = cycle. • τ p = time of perihelion passage. Here, π is truly pi (a mathematical constant). Using this δ ' , δ and δ '' can be obtained from the above Eqs. (4.8) and (4.9). The vectors r and r'' can be obtained by adding the coordinates of the earth at that time to the obtained π δ and π '' δ '' (π and π '' can be found by observation), respectively. That is, the two positions p and p'' of the planet can be obtained. However, it is stated that Gauss’s very laborious task for determining the Ceres orbit is a kind of wonder beyond the imagination of modern people living in the computer age (Teets and Whitehead 1999). Gauss thus predicted that Ceres would reappear in December 1801, and as predicted, Ceres was rediscovered on the seventh of the month. This led Gauss to widely spread his name to Europe as a young genius in astronomy. In December 1801, Ceres was rediscovered according to Gauss’s prophecy, and the observation data on the orbit of Ceres increased. Gauss argued that he used the method of least squares to improve the accuracy of the Ceres orbit based on an increased number of observations. However, it is not clear even among experts whether Gauss really used the method of least squares to predict the orbit of Ceres. Independently of Gauss, a French mathematician Adrien–Marie Legendre (1752– 1833) published the method of least squares in 1805, and the method of least squares had become widespread in the scientific community thereafter. Gauss claimed in 1809 that he had already used the method of least squares in 1795. This controversy added an entry to the list of priority disputes regarding discoveries and inventions in the history of science. Gauss contributed to wide areas ranging from mathematics to astronomy to physics. In recent years, Abdulle and Wanner calculated the orbit of Ceres at that time using software that executes the method of least squares based on the Gauss–Newton method, based on Piazzi’s observation data (Abdulle and Wanner 2002).
4.1 Basics of Regression
103
Thus, the Titius–Bode law prompted the discovery of Ceres, which led to the invention of the method of least squares. On the other hand, there are exceptions to the Titius–Bode law itself (e.g., Neptune and Pluto). So far, no physical interpretation has been made convincingly, and it is thought that it is just a coincidence (refer to Box “Hypothesis first”). Box Hypothesis First In attempts to find a new planet in the solar system according to the Titius– Bode law, a hypothesis whose truth has not been determined, astronomers discovered Ceres (more accurately, a dwarf planet). Gauss calculated its orbit based on observational data, and Ceres was actually rediscovered according to his prediction. Of course, there are cases where things are discovered by chance, such as cosmic microwave background radiation (Williams 2015) while there are many cases where things cannot be found without searching for them. For example, until a fossil of Japan’s first dinosaur “Moshiryu” was accidentally discovered in Moshi of Iwate Prefecture in 1978, it had been an established theory that “a fossil of a dinosaur cannot be found in Japan”. Since this discovery, many dinosaur fossils have been excavated in various parts of Japan, and the dogmatic theory was strongly denied. In December 2020, Chang’e-5, a Chinese lunar explorer, landed in the northern part of Oceanus Procellarum region of the moon and successfully returned samples from there. From radioisotope measurements dating of the sample, Che et al. (2021) showed that the landed area was covered by lavas about two billion years ago. This result is much consistent with the previously estimated model age by Morota et al. (2011), based on the measurement of the crater size frequency distribution using high-resolution image data observed by SELENE (Kaguya), a Japanese lunar orbiter launched in 2007 (see Fig. 4.6). On the other hand, the other model age estimate based on coarse-resolution data using an older lunar orbiter gave a much younger date of one billion years ago. This is probably because the number of identifiable craters was smaller and the error was large when the resolution of the data was coarse. This is a positive example of case where two independent hypotheses are combined to produce a stronger hypothesis (refer to Chap. 7).
104
4 Regression
Fig. 4.6 Map of model ages of Oceanus Procellarum region of the moon. (Courtesy of Tomokatsu Morota)
As a familiar example, when we ask the students in my class at my university (Tokyo Metropolitan University Hino Campus), “Can you see Mt. Fuji from this campus?” nearly 40% of the students say, “I don’t know or I didn’t think I can see Mt. Fuji”. In fact, if you look straight southwest along the road on the north side of the campus, you can see Mt. Fuji surprisingly large and clear on a fine day. In any case, it is important to generate a hypothesis first (Weinberg 2010). □
4.1.2 Method of Least Squares So far, it has taken a long time to explain the Gauss’s procedures for calculating the orbit of Ceres. So, what was Gauss’s method of least squares? Here, we return to a very simple example and consider the principle of the method of least squares (Hastie et al. 2009; Strang 2016). Let us consider applying a straight line (e.g., a straight orbit) to three observation data points (x i , yi ) (i = 1, 2, 3). That is, we consider the linear method of least squares
4.1 Basics of Regression
105
(see Fig. 4.7). y = c + d x. In this case, there is no straight line passing through all the three observation data points. That is, considering a system consisting of three equations with c and d as variables, there is no solution in this system. Here, we assume the followings. First, there is a set of values (x i , β i ) on a certain straight line that satisfy the following relationship. βi = c + d xi . Second, the observed values yi are independent of each other and are randomly sampled with an error (i.e., difference from the predicted value) that satisfies a typical probability distribution. That is, such a probability distribution follows a normal distribution with equal variances for all observations (see Fig. 4.8). Then, the probability that yi is observed with the accuracy of Δy is as follows. ) ( 2 exp − (βi−yi) 2 2σ p(0 = Thsimilar . A candidate trending spot w' satisfying the above condition is detected as a trending spot. All the above thresholds are determined by preliminary experiments.
6.2.7.2
Data Management and Data Analysis
This application consists of two combinations of data management and data analysis. Furthermore, it is a typical example of serial analysis (refer to Chap. 1). In the first combination, Procedures 1 and 2 correspond to data management and data analysis with respect to trending words, respectively. In the second combination, Procedures 3 and 4 correspond to data management and data analysis with respect to trending spots, respectively. The former combination produces the trending words as a hypothesis, which is fed to the latter as a premise for detecting trending spots as a final hypothesis. In that sense, the whole process is an example of serial analysis.
6.2.7.3
Experiments
The data used in the experiments of this case have the following specifications.
6.2 Difference in Time
223
• The data collection period is January 2018–June 2020. • The size of Tweet data in Nagoya City, Japan (i.e., Tweets with location information) is 5,131,208. • The size of spot data for Nagoya City, Japan is 58,831. In this way, we detected local trending spots as shown in Table 6.3. For example, Wakeoe Shrine is popular locally because of its colorful “Goshuin” (shrine stamp). Table 6.3 a Top five local trending spots. b Wakeoe Shrine is especially popular for its “Goshuin” (shrine stamp) (Courtesy of Wakeoe Shrine) (a) Trending POI
Misen Yaba shop
Total number of users who posted Tweets as to Trending POI in one year (hash)
Total number of users who posted Tweets as to Trending words in one year (hash)
Total number of users who posted Tweets as to Trending POI in one year (text)
Total Similarity number (hash) of users who posted Tweets as to Trending words in one year (text)
Similarity (text)
Difference
6
2
15
9
0.333333333 0.6
218
1
584
3
0.004587156 0.005136986 0.00055
Kanayama
79
4
4750
1568
0.050632911 0.330105263 0.279472
Miwa Shrine
19
7
68
59
0.368421053 0.867647059 0.499226
Wakeoe Shrine
10
4
42
21
0.4
Nagoya cuisine
(b)
0.5
0.266667
0.1
224
6 Hypothesis Generation by Difference
6.2.8 Nested Moving Averages: Case of El Niño–Southern Oscillation As another example of using moving averages, the detection of El Niño–Southern Oscillation (ENSO) is explained (NOAA 2022). First, the fact that the tropical regions of the Pacific Ocean have three different states as meteorological phenomena will be explained. Next, the method to determine ENSO based on moving averages of the sea surface temperatures will be explained.
6.2.8.1
Normal State
In the equatorial regions of the Pacific Ocean, easterly winds called the trade winds are constantly blowing. Therefore, warm seawater near the sea surface is blown to the west side of the Pacific Ocean (see Fig. 6.5a). In the western waters near Indonesia, warm seawater accumulates on the surface up to several hundred meters below sea level. Fig. 6.5 ENSO conditions. a Normal, b El Niño, c La Niña
The trade winds
Indonesia
Pacific Ocean
South America
(a) The trade winds
Indonesia
Pacific Ocean (b)
South America The trade winds
Indonesia
Pacific Ocean
(c)
South America
6.2 Difference in Time
225
Off the eastern coast of South America, seawater flows in the offshore direction (western side) due to the trade winds and the effect of the earth’s rotation (Coriolis force). Along with this, cold seawater springs up from the deep part toward the sea surface. As a result, the sea surface temperature is high in the western part of the equatorial Pacific region but low in the eastern part. In the western Pacific Ocean, where the sea surface temperature is high, evaporation of seawater becomes active from the sea surface, and a large amount of water vapor is supplied to the atmosphere. As a result, cumulonimbus clouds become active in the sky.
6.2.8.2
State at the Time of the El Niño Phenomenon
When the El Niño phenomenon occurs, the trade winds are weaker than normal, and as a result, the warm seawater that has accumulated in the west spreads to the east, and the upwelling of cold water weakens in the east (see Fig. 6.5b). Therefore, from the central part to the eastern part of the equatorial Pacific region, the sea surface temperature is higher than in normal times.
6.2.8.3
State at the Time of the La Niña Phenomenon
When the La Niña phenomenon occurs, as opposed to the El Niño phenomenon, the easterly wind becomes stronger than in normal times, and as a result of being blown, warm seawater accumulates thicker in the west. Cumulonimbus clouds are actively generated in the sea near Indonesia when the La Niña phenomenon occurs. On the contrary, in the eastern part, the upwelling of cold water becomes stronger than in normal times (see Fig. 6.5c). Therefore, the sea surface temperature is lower in the central and eastern parts of the equatorial Pacific region than in normal times.
6.2.8.4
Southern Oscillation
When the atmospheric pressure is higher than normal in the eastern part of the South Pacific, the atmospheric pressure is lower than normal near Indonesia. Conversely, when the atmospheric pressure is lower than normal in the eastern part of the South Pacific, the atmospheric pressure is higher than normal near Indonesia. This fluctuation phenomenon is called the Southern Oscillation. Since the Southern Oscillation is affected by the strength of the trade winds, it is linked to the El Niño phenomenon. Therefore, at present, the Southern Oscillation and the El Niño phenomenon are regarded as a series of fluctuations in the atmosphere and the ocean, and are collectively called El Niño–Southern Oscillation (ENSO).
226
6.2.8.5
6 Hypothesis Generation by Difference
Determination of ENSO
Of course, ENSO has a big impact on the weather in Japan. We explain how the El Niño phenomenon and the La Niña phenomenon affect the weather in Japan, respectively. • El Niño phenomenon → Cold summer and warm winter. • La Niña phenomenon → Hot summer and cold winter. Therefore, it is practically important to observe ENSO phenomena. In the past, sea surface temperatures were observed by ships and buoys and sent to the ground via communication satellites. However, it has become possible to make direct observations on a global scale using earth observation satellites equipped with sensors using various wavelengths such as infrared rays (high spatial resolution but weak against clouds) and microwaves (strong against clouds but lower spatial resolution). First, the monthly reference values μma(30) m,y are the moving averages of m months for 30 years up to the previous year of the year y. That is, the monthly reference value can be defined as follows. μma(30) m,y =
) 1 ( ∑ Tm,y−1 + · · · + Tm,y−30 . 30
Here, T m,y are monthly observation values. Next, the moving average of the difference between the monthly observation value and the monthly reference value for the five months with a certain month m in the middle is defined as follows. 1 (30) {(Tm−2,y − μma(30) m−2,y ) + (Tm−1,y − μmam−1,y ) 5 (30) + (Tm,y − μma(30) m,y ) + (Tm+1,y − μmam+1,y )
μma(5) m,y =
+ (Tm+2,y − μma(30) m+2,y )}. Using this moving average, El Niño–Southern Oscillation can be defined as the following state. ◦ μma(5) m,y > 0.5 (5 consecutive months).
6.2 Difference in Time
227
6.2.9 Time Series Forecasting Generally, time series data are generated from the following elements. • • • • •
Autocorrelation: Similarity with the past data. Periodic fluctuations: Fluctuations in units such as seasons and weeks. Trends: Trends such as increase and decrease. White noise: Noise that follows the normal distribution N (0, σ 2 ). Exogenous factors: Factors that cannot be explained by the past data.
The moving averages explained so far are effective in detecting trends, and the time series differences are effective in removing trends. Time series forecasting (i.e., future event prediction) is done using time series models. Typical time series models assume dedicated data generation processes as follows. • AutoRegressive (AR) model: It is assumed that the data piece at a certain point in time consists of the linear sum of the past data. • Moving average (MA) model: It is assumed that the data piece at a certain point in time consists of the linear sum of the mean and white noise up to that point in time. • AutoRegressive Moving Average (ARMA) model: It is assumed that the data generation process has the properties of both the AR model and the MA model. • AutoRegressive Integrated Moving Average (ARIMA) model: It is assumed that the time series difference has the properties of the ARMA model. The ARIMA with eXogenous variable (ARIMAX) model is an extension of the ARIMA model so that exogeneous variables can be considered. The time series are said to be stationary if the characteristics (i.e., mean, variance, autocovariance) do not depend on the time point in the stochastic generation process. If it is not stationary, the predicted value will diverge or oscillate. It can be said that the MA model is always stationary. This is not always the case with the AR model. The stationarity of the AR model can be checked by using a certain condition with respect to the roots of the characteristic equation of the model (refer to Sect. 6.2.11). Please refer to textbooks for more information on modeling (predicting) time series data, such as (Nielsen 2020; Wei 2019).
6.2.10 MQ-RNN 6.2.10.1
Basic Mode
We have already explained RNN as a neural network for time series prediction. Here, we introduce Multihorizon Quantile Recurrent Neural Network (MQ-RNN) (Wen et al. 2017) as an advanced neural network approach to time series prediction.
228
6 Hypothesis Generation by Difference
First, MQ-RNN assumes the following variables. • • • • •
yt + j: : The time series data of the objective variable to predict. y:t : The past data for the objective variable. x :t (h) : The past covariate data. x t: ( f ) : The predicted covariate data. x (s) : The static feature data.
MQ-RNN solves the regression problem with respect to the following conditional probability distribution. ) ( (f) p yt+k , . . . , yt+1 |y:t , x:t(h) , xt: , x (s) . The basic MQ-RNN model is RNN that uses LSTM (see Fig. 6.6). MQ-RNN consists of two types of neural networks, namely the encoder MLPGlobal and the decoder MLPLocal . MLPGlobal creates a context (intermediate representation) from the input data. On the other hand, MLPLocal creates output data from the context. As will be explained later, the predicted values are output at the same time for the number of quantiles (Q) given as parameters. The basic model is expressed as follows. h t = LSTM(h t−1 , xt , yt ) ( ) [ ] (f) ct+1 , . . . , ct+k , ca = MLPGlobal h t , xt:
Fig. 6.6 MQ-RNN (Courtesy of RuofengWen) https://doi.org/10.48550/arXiv.1711.11053
6.2 Difference in Time
[
229 (q1)
(q Q)
y˙t+k , . . . , y˙t+k
]
( ) (f) = MLPLocal ct+k , ca , xt+k .
In the above model, LSTM encodes all history into hidden states ht . The global MLP creates a context consisting of horizon-specific components ct + j and a horizon(q j) agnostic component ca . The local MLP produces all predicted quantiles y˙t+k for a specific future time (t + k). Furthermore, the advanced form of MQ-RNN uses not only RNN but also CNN scheme. That is, MQ-RNN in this case can also have old memories aggregated by using dilated causal convolution or extended causal convolution as in WaveNet (Oord et al. 2016) (see Fig. 6.7). h◠t = mlp(h t , . . . , h t−D ). h◠t .
In the advanced form of MQ-RNN, ht of the above basic model is replaced with
Fig. 6.7 Dilated causal convolution (Courtesy of RuofengWen). https://doi.org/10.48550/arXiv. 1711.11053
230
6.2.10.2
6 Hypothesis Generation by Difference
Quantile Regression
In MQ-RNN, quantile regression is performed. Here, q is a quantile, which is a parameter given by the user. For example, if q = 0.25, 0.5, and 0.75, q corresponds to the first quartile, the second quartile (median), and the third quartile, respectively. At each point, the difference between the predicted value y˙ and the observed value y is considered by the following formula (called check function). L q (y, y˙ ) = (1 − q) max( y˙ − y, 0) + q max(y − y˙ , 0) ) ∑∑∑ ( ( J= L q yt+k , y˙t+k q) . t,T q,Q k,K
Here, T, Q, and K are as follows. • T: Training period (called Forecast Creation Time). • Q: The number of quantiles. • K: The number of horizons (i.e., how many time units ahead of the present time). Here, let us consider the minimization problem by generalizing the error function to a function of a continuous variable so that it can be easily differentiated. F(·) is a cumulative distribution function. F( y˙ ) = p(y t B ). The Image A after intensity adjustment is redesignated as the image A. 2. Calculation of SIFT features: The SIFT features (explained later) of the current image A and those of the past image B are extracted. 3. Matching and transformation of images based on SIFT features: The feature points (SIFT features) of the past image B and those of the current image A are matched based on the Euclidean distance between them. Next, the image A is deformed by converting the positions in the image A to those of the matched image B (i.e., transforming a quadrangle into another quadrangle) (see Fig. 6.14). In other words, the images A (i.e., deformed A) and B correspond to the hypotheses (or models) on natural structures on the lunar surface at t A and t B , respectively. 4. Find the difference after coloring the images: The image B is colored with cyan, and the image A is colored with complementary red. By adding (i.e., additive mixing) the two images after coloring, only the common part changes its color to white. This operation corresponds to an overlay of two hypotheses, and the overlapping or common parts as the result correspond to invariant during the time interval. By taking the difference between the image A and the detected common
6.3 Differences in Space
241 D
A
E
Image1 B
H
E'
Image2 C
H'
Image2’
F
F'
G'
G
Fig. 6.14 Image matching and transformation. The coordinates (E, F, G, H) of Image2 and the coordinates (A, B, C, D) of Image1 are matched by using the SIFT features. Image2’ is obtained by transforming Image2 so that the coordinates (E, F, G, H) of Image2 respectively correspond to the coordinates (A, B, C, D) of Image1
Image1 RGB(N,N,N)
Image1 RGB(0,N,N) Image1 + Image2
coloring
Output (intersection) RGB (N,N,N) Image2 RGB(N,N,N)
Image2 RGB(N,0,0)
Fig. 6.15 Finding a difference by overlaying separately colored images
parts, the natural structures (red) that have been created since the image B was taken can be found (see Fig. 6.15). (Data management and data analysis) In this application, Steps 1 and 2 correspond to data management, while Steps 3 and 4 correspond to data analysis. (Experiments) As a result, a crater created during the specified time interval could be detected although it had been known. We have generated the resultant image by combining multiple raw images (see Fig. 6.16). In the actual comparison, in order to avoid overlooking objects that straddle the image boundaries, the images are divided so that they overlap each other, and then the corresponding images are compared.
242
6 Hypothesis Generation by Difference
Fig. 6.16 Detection of a newly created crater (partly expanded)
Newly created crater (red)
Box: Apollo 15 and NAC image • Apollo 15: Apollo 15 was launched on July 2, 1971, and returned on August 7, 1971. Apollo 15 landed on a plateau between the Apennines and the Hadley Valley in the moon’s sea “Mare Imbrium”. The Lunar Module Falcon with American astronauts David Scott (1932–) and James Irwin (1930– 1991) aboard landed on the moon. The time spent on the moon was 66 h and 55 min. For the first time in this mission, a lunar roving vehicle (LRV) was used to explore mountainous areas. The two astronauts performed three extravehicular activities, traveled a total distance of 28 km, installed an unmanned observation station called Apollo Lunar Surface Experiments Package (ALSEP), and brought back 77 kg of lunar rocks to the earth. • NAC image: The US lunar probe Lunar Reconnaissance Orbiter (LRO) was launched from Cape Canaveral Air Force Base on July 18, 2009, by the Atlas V rocket. The objective of LRO is to investigate the lunar topography and environment in detail and collect information in order to select useful and safe landing sites in preparation for future manned missions to the lunar surface. The Narrow Angle Camera (NAC) equipped with LRO can capture lunar surface images at a high resolution of 0.5 m/pixel from a spacecraft altitude of 50 km. LRO continued to operate in lunar orbit (at the time of writing), and its high-resolution camera also succeeded in capturing the
6.3 Differences in Space
•
243
landing site of the Apollo spacecraft. The NAC image clearly shows the landing point of Apollo 15, the ruts of LRV, and ALSEP (see Fig. 6.17) □ (NASA 2022).
Fig. 6.17 NAC image (Courtesy of NASA)
6.3.5 Image Processing So far, we have explained the differences between entire images. Here, we consider the differences related to spatially local images (i.e., pixels).
6.3.5.1
Smoothing Filter
The image is smoothed by applying the following filter (i.e., moving average) to each pixel as in time series data as follows. ⎡
⎤ 111 1⎣ 1 1 1 ⎦. 9 111
244
6 Hypothesis Generation by Difference
The application of the smoothing filter to a pixel value is the average of the value itself and the surrounding pixel values.
6.3.5.2
Edge Extraction
(Derivative filter) Let us consider a filter that extracts edges whose intensity changes suddenly in an image. It is based on the difference between adjacent pixels as shown below. The following formulas represent the horizontal and vertical first-order derivatives, respectively (see Fig. 6.18a). 1 ( f [i + 1, j] − f [i − 1, j]). 2 1 Δx f = ( f [i, j + 1] − f [i, j − 1]). 2
Δx f =
As first-order derivative filters for vertical (or horizontal) edge extraction, the Prewitt filter and the Sobel filter smooth pixels in the vertical (or horizontal) direction after differentiating them in the horizontal (or vertical) direction in order to make them resistant to noise. (Laplacian filter) Similarly, a second-order derivative filter is used for edge extraction. The secondorder derivative is obtained by first taking differences between adjacent pixels and
0
0
0
-0.5
0
0.5
0
0
0 -0.5 0
0
horizontal direction
0
0
0
0
0.5
0
vertical direction
(a)
0 0 0 1 -2 1 0 0 0 horizontal direction
+
0 1 0 0 -2 0 0 0 0
=
vertical direction (b)
Fig. 6.18 a First-order derivative filters. b Second-order derivative filters
0 1 0 1 -4 1 0 1 0 Laplacian filter
6.3 Differences in Space
245
then taking the difference of the differences as in the case of a first-order derivative filter. The following formulas represent the second-order derivatives in the horizontal and vertical directions, respectively. ∂2 f = f [i + 1, j] + f [i − 1, j] − 2 f [i, j]. ∂x2 ∂2 f = f [i, j + 1] + f [i, j − 1] − 2 f [i, j]. ∂ y2 The filter expressed by adding the results of the above expressions is called the Laplacian filter. The Laplacian filter can detect edges regardless of directions (see Fig. 6.18b).
6.3.5.3
SIFT Features
Scale-Invariant Feature Transform (SIFT ) feature is one of the feature quantities of images. SIFT has the property of being invariant with respect to image scale and rotation. An outline of how to obtain SIFT features for a local region is explained below. As preparation, we calculate the gradients from a smoothed (rectangle) image L(x, y) in the local region as follows. f x (x, y) = L(x + 1, y) − L(x − 1, y). f y (x, y) = L(x, y + 1) − L(x, y − 1). From these, the gradient direction θ (x, y) and the gradient intensity m(x, y) are obtained as follows. f y (x, y) . θ (x, y) = tan−1 f x (x, y) / m(x, y) = f x (x, y)2 + f y (x, y)2 . Algorithm 6.5 SIFT 1. Create a weighted gradient direction histogram from the directions and intensities of the gradients of feature points within the local region. The closer to the center of the feature points, the larger the weight. 2. Determine the peak of the histogram to be the orientation of the feature points. 3. Rotate the rectangular area associated with the features in the determined direction of orientation, divide it into 16 blocks, create a histogram in eight directions for each block, and create a 128-dimensional feature vector as a result. 4. Normalize this vector by its length.
246
6 Hypothesis Generation by Difference
Fig. 6.19 SIFT features
In this way, SIFT features are obtained as features which are invariant to rotation and scale (see Fig. 6.19).
6.3.5.4
Video Coding
Here, we consider compressing and coding video data which consist of time series of frames. There is a method called interframe coding that utilizes the interframe differences based on the fact that there is a correlation between consecutive frames. Thus, basic coding is performed by using the following differences between consecutive frames at times t and (t − 1). d(i, j, t) = f (i, j, t) − f (i, j, t − 1). However, the difference value between consecutive frames becomes large where the movement of an object is large. Therefore, there is a method called motioncompensated coding that predicts the movement of an object and takes the difference between frames.
6.4 Differences in Conceptual Space
247
The amounts of movement of the object (Δi , Δj ) are estimated, and the difference between frames is calculated using them as follows. ( ) d(i, j, t) = f (i, j, t) − f i − Δi , j − Δ j , t − 1 .
6.4 Differences in Conceptual Space In this section, we consider the meanings of commonly used concepts or unknown concepts based on the difference between concepts in the conceptual space (i.e., semantic space), not the real space such as time and space that we have dealt with so far. Through the following examples, we explain that the meaning of a concept as a hypothesis can be expressed by the differences between concepts. • We can find the essential meaning of a concept by the difference between concepts obtained by the method called Word2Vec, which can learn a kind of distributed representations (i.e., vectors) of words (or concepts). • We can understand an unknown concept by concretely describing the difference between the unknown concept and its similar known concept in the conceptual space (e.g., cuisines).
6.4.1 Case of Creating the Essential Meaning of Concept 6.4.1.1
Vector Space Model
An attempt to express the meaning of a document or a word using the vector space, which is the subject of linear algebra, is called the vector space model (Salton et al. 1975). The vector space model consists of document vectors and word vectors. This is the basic model for information retrieval and Web retrieval as well. If the meanings of concepts (i.e., words) are expressed as distributed representations (i.e., high-dimensional vectors), vector operations such as sum, difference, and inner product can be operated on the distributed representations. (Integrated Hypothesis) • The essential meaning of a common concept can be obtained by the difference between the concepts as their distributed representations (i.e., vectors). In other words, the difference-based model can express the core meaning of the concept. • The similarity between concepts can be measured by calculating the cosine of the angle between the vectors (the inner product of the vectors divided by the lengths).
248
6 Hypothesis Generation by Difference
In this section, we will explain these principles by taking tourism resources (landmarks) and places (regions) as examples. This method for hypothesis generation can be explained from the viewpoint of analogy as follows. • Tourism resources and places are components of analogy, and the relationship between them (i.e., relationship in analogy) is expressed by the difference between the components. 6.4.1.2
Tourism Application
There are many posts on Twitter (i.e., Tweets) that describe information closely related to the area and emotions of visitors to the area. Research is actively conducted to extract information about tourism, such as the characteristics of tourist spots and the tendency of people to visit tourist spots, from Twitter data. Word2Vec (Mikolov et al. 2013) is applied to texts posted on Twitter in order to extract information that can be adapted to tourism services (Ishikawa and Miyata 2021). Word2Vec produces distributed representations (i.e., meanings) of words by taking a corpus (i.e., a large set of text data) as input. This makes it possible to perform semantic operations such as addition and subtraction on the meanings of words. In this case, these semantic operations are used to extract the semantic relationship between a region and a landmark from Tweets. Landmarks are buildings, cityscapes, and events that symbolize a particular region. Examples of buildings that symbolize regions include the Sky Tree in Tokyo, Tsutenkaku Tower in Osaka, and The Clock Tower in Sapporo. Examples of events that symbolize regions include the Nebuta Festival in Aomori City and the Awa Odori Festival in Tokushima City. Therefore, in this case, landmarks are defined as buildings and festivals that are the purpose of tourism. However, when considering the relationships between regions and landmarks, there are semantic differences even within the same category of landmarks. For example, from the purpose and the appearance of the building, it is considered that the Sky Tree has a meaning closer to Tsutenkaku Tower than the Clock Tower. In order to realize such a judgment by a mechanical method, we perform semantic operations on regions and landmarks by using Word2Vec.
6.4.1.3
Proposed Method
We will explain how to apply Word2Vec to texts posted on Twitter and extract the semantic relationship between a region and a landmark. The dedicated procedure is explained as follows. 1. Create a keyword list that includes the names of cities, regions, and landmarks. 2. Collect Tweets using the Twitter API with the created keyword list as a filter.
6.4 Differences in Conceptual Space
249
3. Learn the vocabulary space using Word2Vec and generate vectors for the names of cities, regions, and landmarks. 4. Perform semantic operations on the generated vectors to extract the semantic relationship between the cities or regions and the landmarks. Here, please note that Steps 1 and 2 correspond to data management, while Steps 3 and 4 correspond to data analysis.
6.4.1.4
Creating Keyword List
In this case, we create a keyword list that includes the names of cities, regions, and landmarks, by using the tourism resource data provided by the Ministry of Land, Infrastructure, Transport and Tourism (MLIT) since 2014 (MLIT 2022). Tweets are extracted using the words included in this keyword list as search words. Tourism resource data consist of items such as “Tourism resource name” and “Location address” and include records such as buildings, festivals, and specialties all over Japan. Table 6.4 shows an example of the registered tourism resources. In the “Type name” column, the types of tourism resources are described. In this case, we use as tourism resources for analysis “shrines/temples/churches”, “buildings”, “local landscapes”, “villages/towns”, “castle ruins/castles/palaces”, “gardens/parks”, “historic sites”, “animal and botanical gardens/aquariums”, and “annual events”. Next, the target tourism resource name is morphologically analyzed, and the extracted nouns are added to the keyword list. The reason for morphological analysis of tourism resource names is to extract Tweets by the common names, not by the official names of tourism resources. For example, in tourism resource data, the official name of “Tokyo Sky Tree” is described as it is, but the common name is “Sky Tree”. So, it is divided into “Tokyo” and “Sky Tree” in morphological analysis. Thus, it is considered that it is sufficient to use “Sky Tree”. In addition, the city, ward, town, or village names are extracted from the data in the “Location/Address” column and added to the keyword list. The prefecture name is also added to the keyword list. Also, from tourism-related sites such as TripAdvisor (TripAdvisor 2022) and Wikipedia’s landmark articles (Wikipedia–Landmark 2022), major landmarks such as Kyoto Tower sand Kobe Port Tower that are not registered in the tourism resource data are manually extracted and added to the keyword list.
6.4.1.5
Collecting Tweets
There are a huge number of Tweets posted on Twitter. Therefore, instead of using the Tweets collected at random, we extract Tweets by using the keyword list created in the previous step. Specifically, we extract Tweets that include at least one city, region, or landmark name in the keyword list, prepare a corpus by using the extracted Tweets,
250
6 Hypothesis Generation by Difference
Table 6.4 Example of tourism resource data Tourism resource name
Prefecture code
Category
Location
Lake Mash¯u
01
Lake/Marshes
Teshikaga Town
Hakk¯oda Mountains
02
Mountain
Aomori City
Akita Kanto Festival
05
Annual event
Asahi Kita, Akita City
Ikaho Onsen
10
Hot spring
Shibukawa City
Tokyo Sky Tree
13
Building
1-1-2 Oshiage, Sumida Ward
Hakone Ekiden
13
Entertainment/Event
Chiyoda Ward
Kenrokuen Garden
17
Garden/Park
1 Kenrokumachi, Kanazawa City
Aokigahara The Sea of Trees
19
Vegetation
Shoji, Fujikawaguchiko Town
Zenk¯o-ji
20
Shrine/Temple/Church
491 Motoyoshicho, Nagano City
Kawadoko’s Kyoto cuisine
26
Food
Kamigyo Ward, Kyoto City
Dotonbori
27
Town/Village
Chuo Ward, Osaka City
Gunkanjima
42
Local landscape
Hashima, Takashimamachi, Nagasaki City
and let Word2Vec learn from the corpus that contains these names in order to generate meaningful vectors.
6.4.1.6
Word2Vec
Here, Word2Vec (Mikolov et al. 2013) will be explained. Word2Vec is a calculation method for distributed representation of words using a neural network. Here, we obtain the distributed representations of words by using a language model called the skip-gram model. Specifically, dense (that is, high-dimensional) vectors of words are generated by adjusting parameters to maximize the probability of occurrence of words that co-occur with a specific word in the context of the corpus. Hereafter, the vector of the word w is referred to as vw . In the vector space of the vocabulary constructed by this model, the vectors of words having similar meanings are assigned close to each other, and the vectors of words having different meanings are assigned far from each other. Furthermore, addition or subtraction (i.e., difference) can be operated on word vectors as semantic operations. For example, an expression such as v “Tokyo” − v “Japan” + v “France” ≈ v “Paris” holds true. This corresponds to the fact that the essential concept (or
6.4 Differences in Conceptual Space
251
region-independent concept) of “capital” can be obtained by subtracting the concept of “Japan” from the concept of “Tokyo”, that is, the concept of “capital of Japan”. On the other hand, the concept of “Paris”, that is, the “capital of France” can be calculated by adding the concept of the country “France” to the essential concept of “capital”. In this case, we try to extract tourist information from the texts in the posted Tweets using this semantic operation. The following is a detailed explanation of learning with Word2Vec (Rong 2014). The vector of the input word wI is represented by vwI . It corresponds to the row for wI of the weight matrix W from the input layer to the hidden layer of the neural network. Let k be the index of the word wI in the vocabulary. Using these, the output of the hidden layer h can be written as follows (see Fig. 6.20). T h = W(k,·) = vTwI .
The following softmax function is used as the output function. p(wc, j = w O,c |w I ) = yc, j = ∑ u c, j = u j = v'w j T · h
exp(u c, j ) exp(u j ' )
j ' =1,V
c = 1, 2, . . . , C.
Here, V is the vocabulary size, C is the number of words in the context, wc,j is the j-th word of the c-th panel (context) of the output layer, and wO, c are the c-th word of the context word of the output. v'w j is not only the output vector of the word wj contained in the vocabulary but also the sequence of the weight matrix W ' from the hidden layer to the output layer. The following formula is used as the loss function. ) ( J = − log p w O,1 , w O,2 , . . . , w O,C |w I ) ( exp u c, j∗c ( ) = − log IIc=1,C ∑ exp u j ' =−
j ' =1,V
∑ c=1,C
u c, j ∗ c + ∑c=1,C log
∑ j ' =1,V
( ) exp u j ' .
where j*c is the index of the c-th output context word. Therefore, the error is defined as follows. ) ( ∂J = yc, j − 1 j = jc∗ = ec, j . ∂u c, j Here, 1 (cond) is an indicator function, and if the condition cond is satisfied, it returns 1; otherwise, it returns 0.
252
6 Hypothesis Generation by Difference
Fig. 6.20 Skip-gram
E I = {E I1 , . . . , E Iv } E I j = ∑c=1,C ec, j . Applying the chain rule, the gradient can be calculated as follows. ∂J ∂ J ∂u c, j = E I j · hi . ' = ∑c=1,C ∂wi j ∂u c, j ∂wi' j The update formula of the weight matrix W W ' from the hidden layer to the output layer is obtained as follows. wi'(new) = wi'(old) − ηE I j · h i j j v'(new) = v'(old) − ηE I j · h wj wj
j = 1, 2, . . . , V.
6.4 Differences in Conceptual Space
253
Here, η is the learning rate. Similarly, the update formula of the weight matrix from the input layer to the hidden layer is obtained as follows. T v(new) = v(old) wI wI − η · E H ∑ E Hi = E I j · wi' j . j=1,V
6.4.1.7
Experiments
(Tweet Extraction and Normalization) In this case study, we collected geotagged Tweets posted in Japan from March 11, 2015, to October 28, 2015. As a result, the size of the data set is about 1,005,610,000. Furthermore, the target Tweets are extracted from this data set by using tourism resource data provided by MLIT and the corpus is created. From the collected Tweets, about 115,610,000 Tweets containing the names of cities, regions, and landmarks were extracted. In addition, the same number of Tweets that do not include the names of cities, regions, and landmarks were extracted. Before conducting the experiments, the extracted Tweets underwent the following normalization processing. Specifically, numbers not included in nouns and symbols such as “@, #” were removed. In addition, all symbols that are often added to the end of sentences, such as “!, ?, ♪, ✩”, are replaced with “.”. Furthermore, when the same symbol “!” is used consecutively, such as “!!!”, it is replaced with “!”. (Benchmark) In this case, we manually collected examples of semantic calculation such as v “Tokyo Tower” − v “Tokyo” + v “Kyoto” ≈ v “Kyoto Tower” with the help of five collaborators, from which 60 calculations were selected. These are used for evaluation as benchmarks, which are quantitative or qualitative indicators generally used when comparing and evaluating similar methods or products. Figure 6.21 shows examples of the benchmark used for the experiments. (Hyperparameter Setting) • • • • •
Fig. 6.21 Benchmark examples
254
6 Hypothesis Generation by Difference
When constructing the vector space using Word2Vec, it is necessary to set hyperparameters for model construction. The accuracy of semantic operations changes depending on the hyperparameter settings. First, the number of dimensions of the generated vectors is set to 400, and the skip-gram model and hierarchical softmax are used for model construction. The similarity between vectors is calculated using the cosine similarity. In the experiments, the window size is also used as a parameter when constructing the vector space. This parameter determines how many words are considered to appear before and after the word of interest in the context of the corpus. In other words, the larger the window size, the more likely words to learn far from the word of interest, but the longer the time required for learning. In other words, the window size determines the length of the context to consider and needs to appropriately be set. In addition, there is no upper limit, and it is possible to set a large value, but an excessively large window size may deteriorate the accuracy of semantic operations. We will briefly explain why an excessively large window deteriorates the accuracy of semantic operations. Tweets have a limit of 140 characters that can be posted at one time, and users tend to post concise sentences. For example, let us consider the Tweet “The night view from the Sky Tree was so beautiful. Hmm, it’s school again from tomorrow”. This Tweet mentions by two sentences, things that have nothing to do with each other. If the window size is set to 20, all the words in the second sentence will be considered in the vector generation of “Sky Tree”. It is important to appropriately set the window size when constructing a vector space from Twitter posts. So, the upper limit value L is set for the window size, semantic operations are performed for each L, and the accuracy is evaluated to determine the optimum value of the window size. First, in order to set the optimal L, all Tweets including the names of cities, regions, and landmarks were morphologically analyzed, and the morphemes were counted. As a result, the average number of morphemes is 14, and L is set to 20, which is sufficiently larger than this value (see Fig. 6.22). Next, the operations on the left side are performed in the vector space of each different window size, and a word with a vector close to the vector as the operation result is output. Then, the calculation result in which the word on the right side was output within the top ten words was taken as the correct answer, and the number of the results was counted. Similarly, the calculation results output within the top 20, 30, 40, and 50 words were taken as the correct answers, and the numbers were counted. The results are shown in Fig. 6.23. As a result of the verification, it was confirmed that the large window size deteriorates the accuracy of the semantic operations and the number of correct answers is the smallest when the window value is 20. In addition, it was confirmed that the accuracy of the semantic calculation improved from 1 to 5 as the window size and decreased after 6. From the above results, we set the window value to 5 and used it in the experiments.
6.4 Differences in Conceptual Space
255
Number of Tweets (ten million)
250 200 150 100 500 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
Number of morphemes Fig. 6.22 Number of Tweets versus number of morphemes
Number of correct answers
35 30 25 Top 10 words Top 20 words Top 30 words Top 40 words Top 50 words
20 15 10 5 5
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Window size Fig. 6.23 Number of correct answers versus window size
(Discussion) First, we performed a semantic operation expressed as v “Sky Tree” − v “Tokyo” + v “Osaka”. The calculation result is shown in Table 6.5. As a result of the calculation, two landmarks existing in Osaka were output. The first one is “Tsutenkaku”, which was output in second place. “Sky Tree” and “Tsutenkaku” have observatory facilities and are popular tourist spots that represent Tokyo and Osaka, respectively. The second is “Harukas”, which was output in sixth place. This is a word that officially refers to “Abeno Harukas” in Abeno Ward, Osaka City. Although the categories are different, both are among the tallest in Japan and have commercial facilities.
256
6 Hypothesis Generation by Difference
Table 6.5 Top ten results of evaluating the expression v “Sky Tree” − v “Tokyo” + v “Osaka”
(*1)
Rank
Result spot
Similarity
1
Hikone Castle
0.323
2
Ts¯utenkaku*1
0.317
3
Asakusa
0.311
4
Sumida River
0.31
5
Marine Tower
0.309
6
Harukas*2
0.292
7
Marugame
0.292
8
Light up
0.291
9
Sanja Festival
0.284
10
Nagoya Castle
0.283
(*2)
Next, we focused on the two port cities of Yokohama and Kobe in Japan and performed semantic operations. Specifically, we substitute landmarks that exist in Yokohama for the spot name in the expression (v spot name - v “Yokohama” + v “Kobe”) and try to extract similar landmarks that exist in Kobe (see Table 6.6). If “Bay Bridge” was given as spot name, “Akashi Kaikyo Bridge” was output in the first place. “Bay Bridge” and “Akashi Kaikyo Bridge” are suspension bridges that represent Yokohama and Kobe, respectively. Next, when “Yokohama Chinatown” was substituted, “Nankinmachi” was output in the first place. “Yokohama Chinatown” and “Nankinmachi” are Chinatowns that represent Yokohama and Kobe, respectively. Next, when “Minato Mirai” was input, “Harborland” was output. “Minato Mirai” is a redevelopment district that straddles Nishi-ku and Naka-ku, Yokohama City, and “Harborland” is a redevelopment district in Chuo-ku, Kobe City. Both redevelopment areas are located in coastal areas and are attracting attention. Furthermore, when “Marine Tower” was substituted, “Kobe Tower” was output in seventh place. This is a word that refers to “Kobe Port Tower” in Chuo-ku, Kobe. Both have observatory facilities and are popular tourist spots that represent Yokohama and Kobe, respectively. From the above calculation results, it was found that there are several similar landmarks in Yokohama and Kobe. Furthermore, from the results of multiple experiments, it was confirmed that it is generally possible to extract landmarks in areas that correspond to a given landmark in another area by using the semantic operations of landmarks.
Bay bridge
Result spot
Akashi Kaikyo Bridge
Akashi Bridge
Sannomiya
Kobe Tower
German beer festival
Differently spelled Sannomiya
Kobe Station
Kobe Bridge
Spot name
Rank
1
2
3
4
5
6
7
8
0.357
0.362
0.373
0.4
0.427
0.429
0.434
0.465
Similarity
Emperor’s Mausoleum
China Town
Oji Zoo
Itayado
Ijinkan
Differently spelled Sannomiya
Sannomiya
Nankin Machi
Result spot
0.372
0.381
0.383
0.39
0.394
0.439
0.476
0.542
Similarity
Yokohama China Town
C
Harborland
Kitano
Old settlement
Ijinkan
Suma
NankinMachi
Differently spelled Sannomiya
Sannomiya
Result spot
Minato Mirai
Table 6.6 Spots as results of evaluating the expression v Spot name − v “Yokohama” + v “Kobe”
0.454
0.456
0.461
0.468
0.469
0.494
0.505
0.566
Similarity
Differently spelled Sannomiya
Kobe Tower
Ijinkan
NankinMachi
Infiorata Kobe
German beer festival
Sannomiya
Akashi Kaikyo Bridge
Result spot
Marine Tower
(continued)
0.415
0.415
0.431
0.434
0.447
0.448
0.477
0.498
Similarity
6.4 Differences in Conceptual Space 257
Kobe Port
Old Hunter’s House 0.355
9
10
0.357
Result spot
Rank
Similarity
Bay bridge
Spot name
Table 6.6 (continued)
Kobe Station
German beer festival
Result spot
0.361
0.369
Similarity
Yokohama China Town
C
Momoyamadai
German beer festival
Result spot
Minato Mirai
0.451
0.452
Similarity
Le Un
Kobe Station
Result spot
Marine Tower
0.405
0.411
Similarity
258 6 Hypothesis Generation by Difference
6.4 Differences in Conceptual Space
259
6.4.2 Case of International Cuisine Notation by Analogy 6.4.2.1
Domestic Cuisine
When a foreign traveler visits a domestic (e.g., Japanese) restaurant, it may not be possible to get an overview of the dishes by looking at the menu. This happens when the restaurant only describes the menu in its domestic language and the traveler does not understand the language. In that case, it is difficult for the traveler to understand the descriptions of the dishes, and sometimes even the names of the dishes cannot be read. For example, even if you try to look up cuisine information on the Internet, it is not always the case that you can use the Internet. In such a situation, it is difficult for the travelers to order the dishes they would like to take. In recent years, Japan has been deeply involved in encouraging foreigners to travel to Japan. According to a survey by the Japan Tourism Agency (JTA) of MLIT (JTA 2022a), “what was expected before visiting Japan” and “what was most expected before visiting Japan” were both “eating Japanese food” (73.1 and 25.6%) as the most common answers, indicating that foreign visitors to Japan have a strong interest in Japanese food. As a result of questionnaires on the improvement of the environment for accepting foreign visitors to Japan (JTA 2022b), the most common problem that they had while traveling was “difficulty in communication with facility staff” (20.6%). This was followed by the problems related to “free public wireless LAN environment” (18.7%), “use of public transportation” (16.6%), and “lack of multilingual guidance/ difficulty in understanding any guidance” (16.4%). Indeed, if the restaurant has a free public wireless LAN environment, the problem of “language in general” can be solved by machine translation, and the problem of “food and drink” is expected to be solved by using the Internet search in many cases. However, even if the dish name and its explanation are translated sufficiently accurately and the translation result is simply presented to the clerk of the restaurant, the traveler may not understand the dish. For restaurant operators that want to accept foreign tourists, it is considered effective to describe the dish menu in multiple languages. Most of the dish names and explanations in restaurant menus are written only in the language of the country where the restaurant is located (hereinafter referred to as the domestic language). Therefore, in order to create explanations in multiple languages, it is necessary to translate from the explanations written in the domestic language. However, this requires knowledge of the words and grammar of the foreign language as translation target. Of course, there is also a method using machine translation, but it may not translate the explanations correctly. Furthermore, if the operators lack prior knowledge as to the target language of translation, they may not judge whether the translation result is accurate or not. Anyhow, it is important for restaurants seeking to attract foreign tourists to create menus written in multiple languages. For that purpose, we propose an international cuisine notation system based on analogy that allows foreigners to easily infer
260
6 Hypothesis Generation by Difference
domestic dishes from their familiar dishes similar to them and that can be easily extended to multiple languages. To generate such a proposed cuisine notation model requires various knowledge about both the dishes explained and those used for analogy, so it is difficult to manually do this. Therefore, we propose a method to automatically generate explanations based on this notation using the data of Cookpad (Cookpad 2022), which is a usersubmitted recipe site popular in Japan. This is because Cookpad contains not only recipes for Japanese dishes but also recipes for dishes from all over the world.
6.4.2.2
Analogical Explanation of Dishes by Difference
The following principle of difference will be explained concretely by taking cuisines (dishes) as an example of concept (Nobumoto et al. 2017). • An unknown concept can be explained by explicitly expressing the difference from a similar known concept. This principle can be explained from the viewpoint of analogy as follows. A dish and its ingredients are the components of analogy. The relationship between the dish and the ingredients corresponds to the relationship in analogy. Indeed, the components of different dishes do not necessarily have a one-to-one correspondence. However, the characteristics of the dish can be described by the difference in components. As an integrated hypothesis, we propose an international cuisine notation method to facilitate the extension of restaurant menus to multiple languages, based on analogy. Specifically, the proposed cuisine notation method is based on the differences between similar dishes in two countries as follows. domestic dish = foreign dish + differences.
(6.1)
In Eq. (6.1), the left side shows any dish in one country. The right-hand side is represented by an expression like a mathematical formula, consisting of the name of a dish of another country similar to the dish on the left-hand side and the elements as the signed (+ or −) differences between those two dishes. It is possible to create an international menu as shown in Fig. 6.24 by using this notation. Please note that it is not the mathematical vector operation. Since the cuisine notation method proposed in this case is simpler than the conventional notation method using sentences, it is considered that translation can be easily performed. That is, the descriptions based on the proposed notation are written only with nouns such as “garlic” and “cabbage”, so there is no need to interpret sentences, and it is possible to translate the descriptions word by word. Since the proposed notation method is a compact description as in Eq. (6.1), the amount of explanation of the dish that the reader needs to read is also small. In the case of traditional dish explanations, it is necessary to increase the number of
6.4 Differences in Conceptual Space
¥650
261
¥650
Fig. 6.24 Advantages of the proposed cuisine notation system
sentences in order to explain the dishes in detail, but the reader is burdened by the increased amount of reading. On the other hand, if the explanation is simply shortened in order to reduce the amount of reading, the amount of information needed to explain the dish is also reduced by that amount, and the reader may not be able to fully understand the dish. However, the proposed notation method uses similar dishes that the reader is supposed to know in order to explain unknown dishes, so the amount of description can be minimized and the amount of reading by the reader can be reduced maintaining the necessary information (i.e., Occam’s razor). In other words, by substituting a lot of information on the dish to be explained with a similar dish, it is possible to suppress the decrease in the amount of information due to the reduction in the amount of description.
6.4.2.3
Proposed Method
We introduce an automatic generation method of dish description based on international cuisine notation method by analogy. This method uses the recipe data of the user-submitted recipe site Cookpad to generate the dish description by the following three-step procedure. Step 1: Generate a dish feature vector Here, we explain a method for obtaining the features of dishes using words specific to the recipes. The dishes are considered to consist of various elements such as ingredients, cooking processes, tastes, and cooking utensils. Therefore, in this case, words specific to the recipes are extracted from the recipe data and classified into
262
6 Hypothesis Generation by Difference
categories such as ingredients and cooking processes. After that, a feature vector whose component is the frequency of words is generated for each category. The procedure for generating the feature vector is shown below. (1) Creation of a dish name list for each country The list of dish names of each country used in this case is prepared in advance. As an actual procedure, the dishes from each country are collected from websites and books, and one dish list containing L dishes is made. The names of dishes from foreign countries must be included in this list. This list is searched for a dish from foreign country (the leftmost item on the right side of Eq. 6.1) similar to the domestic dish you want to explain (the left side of Eq. 6.1). From now on, the dish you want to explain is referred to as a query. (2) Extraction of recipe-specific words In this case, among the words that appear in cooking procedures of the recipe data, the words that rarely appear in general documents and frequently appear in the cooking procedures are defined as recipe-specific words. In addition to the recipe data, Japanese Wikipedia (Wikipedia 2022) data are used as general documents to extract words specific to this recipe. Applying morphological analysis to the recipes of Cookpad and the pages of Wikipedia, the score SC df (w) showing the recipe specificity of each word w that appears in both Cookpad and Wikipedia is obtained by the following formula. SCd f (w) =
d f recipe (w) . d f w p (w)
(6.2)
Here, the document frequency df recipe (w) is the number of recipes in which w appears, and df wp (w) is the number of Wikipedia pages in which w appears. The top 500 words in SC df (w) scores are treated as recipe-specific words. Table 6.7 shows the top five extracted recipe-specific words. Nouns and verbs are mixed in the recipe-specific words. In this case, these recipe-specific words were manually classified into several categories such as ingredients and cooking processes. (3) Feature vector generation Method I In this case, we propose two separate methods for generating feature vectors for each dish using the generated list of dishes from a certain country. Here, the feature vector considering the weight of the dish will be explained (hereinafter referred to as Method I). This vector is based on the frequency of categorized recipe-specific words. Category c (1 ≤ c ≤ C), word wc, m (1 ≤ m ≤ M) specific to the recipe of the category c, and dish name l(1 ≤ l ≤ L) in the dish name list of the country are used to generate the feature vector va (c, l) of the dish name l in the category c by the following formula. va (c, l) = (a1 , a2 , . . . , am , . . . , a M ).
(6.3)
6.4 Differences in Conceptual Space Table 6.7 Examples of recipe-specific words
263
Recipe-specific word
Recipe specificity
中火 Ch¯ubi (Medium heat)
8142.1 4883.1
Shinnari (Tender) 溶く Toku (Dissolve)
4735.5
流し入れる Nagashi-ireru (Pour)
3383.6
ゆでる Yuderu (Boil)
2584.3
) ( d f wc,m , l am = . |Nl |
(6.4)
Here, the document frequency df (wc, m, l) is the number of recipes that include l in the recipe name and wc, m in the recipe text, and |Nl | is number of the recipes that include l in the recipe name. In addition, the feature vector va (c, q) of the query dish q instead of l in the above formula is also generated in the same manner. (4) Feature vector generation Method II. Next, as another method for generating the feature vector, a feature vector considering the weights of the sentences of the cooking procedure is generated by the following formula (hereinafter, referred to as Method II). ) ( d f wc,m , l ( ). bm = d f wc,m , all
(6.5)
df (wc, m , all) is the number of recipes that include the recipe-specific words wc, m in the cooking procedures across all recipe data. Method II also generates a query feature vector, similar to Method I. Step 2: Extract similar dishes Using the generated feature vectors, we find the similarity between dish names in the dish name list and in the query. The importance of factors that characterize the dishes, such as the ingredients and the cooking processes, is considered different depending on factors. For example, it is thought that ingredients are more important for cooking than cooking utensils. Therefore, in this case, the importance of each category for the extraction of similar dishes is determined by preliminary experiments. The similarity is calculated with the importance as the weight of each vector, and similar dishes are finally extracted by using the similarity. The similarity between the dish name l and the query dish name q in category c, simc (l, q), is defined using the cosine similarity
264
6 Hypothesis Generation by Difference
of each feature vector v (c, l) and v (c, q). Using this simc (l, q) and the importance α c of the category c, the similarity simall (l, q) between each dish name l and the query q is calculated by the following formula. simall (l, q) =
∏ i=1,C
simi (l, q)αi .
(6.6)
The importance α i used here was obtained by a preliminary experiment. The dish with the highest simall (l, q) is extracted as the dish s that is similar to the query q. Step 3: Output the difference We extract the differences between the elements of the dish similar to those of the query and represent them as a formula expression. In this case, the elements to be extracted are narrowed down to the ingredients from the viewpoint of easy-to-read understand expression. The generated feature vector of the ingredient category is based on the frequency of ingredients of dishes. Therefore, it is considered that the ingredients that frequently appear in either of pairs of dishes can be extracted by finding the difference between the feature vector of the query and that of the similar dishes. The elements that correspond to the differences between the query q and the dish s similar to the query q are extracted and output as an expression. The feature vectors v (s) and v(q) of the ingredient category are used to extract the difference. The vector vdiff (s, q) as the difference between v(s) and v(q) is generated by the following formula. vdiff (s, q) = v(s) − v(q).
(6.7)
P ingredients wp (0 ≤ p ≤ P) in descending order of the absolute value of vdiff (s, q) are extracted as the elements that correspond to the differences between the dishes s and q. Using these elements, an output is generated as in Eq. (6.1). Let r be the expression on the right-hand side of Eq. (6.1). r is generated by initializing with r ← s and concatenating wp according to the following updates. r ← r + w p (a p < 0) r ←r (a p = 0) r ← r − w p (a p > 0)
(6.8)
Here, ap refers to the value of element p of vdiff (s, q). Roughly speaking, Steps 1 and 2 correspond to data management and Step 3 corresponds to data analysis or hypothesis generation in the above process.
6.5 Difference Between Hypotheses
6.4.2.4
265
Experiments
Two methods were realized using the Cookpad recipe data consisting of about 1.71 million recipes, and the performance of each method was evaluated by human subjects. It was found that Method I and Method II are similar with respect to the extraction performance of similar dishes. On the other hand, Method I resulted in being more capable of extracting the main ingredients of the two dishes. This is because the element value of the feature vector generated by Method II is lower for words which more frequently appear in the recipe texts of the recipe data. Therefore, Method l is considered more appropriate as a method for extracting elements as differences.
6.5 Difference Between Hypotheses In this section, we will explain a hypothesis generation method that a new hypothesis is created by the difference between hypotheses as an integrated hypothesis. That is, the difference is calculated between the hypotheses obtained separately. The resulting difference is the final hypothesis. The following two case studies will be explained as examples. • Candidate free Wi-Fi access spots available to foreign visitors are determined by the differences between hypotheses obtained from two different geotagged social data. • Candidate DNAs related to a genetic disorder (i.e., genetic disease) are extracted from the differences between the DNA sequences of the samples with the disease and those of the samples without the disease (genome-wide association study).
6.5.1 Case of Discovery of Candidate Installation Sites for Free Wi-Fi Access Point 6.5.1.1
Free Wi-Fi and Social Data
In recent years, the public and private sectors have been actively engaged in activities to promote foreign tourists to Japan. As a result, in 2019, many foreigners visited Japan and the number reached about 31.88 million, which is the largest ever (JNTO 2022). It is an important issue to increase the number of foreign visitors to Japan because the consumer behaviors in tourism are expected to have a significant ripple effect on the entire industry, including accommodation, food and drink, and other services. To further increase the number of foreign visitors to Japan, it is important to acquire foreign tourists who have never visited Japan as well as foreign repeaters. Especially,
266
6 Hypothesis Generation by Difference
to acquire repeaters, it is important to increase the attractiveness of Japanese tourist spots and to provide information such as recommendations for new tourist spots, but it is also important to reduce foreign visitors’ dissatisfaction with Japan. According to MLIT, the results of a questionnaire survey on the inconvenience and dissatisfaction of foreign visitors with Japan’s welcoming environment (JTA 2022b) are follows. “Difficulty in communication with the staff of facilities” was the most common (20.6%), followed by the issues related to “Free public wireless LAN environment” (18.7%), “Use of public transportation” (16.6%), and “Lack of multilingual guidance/Difficulty in understanding any guidance” (16.4%). Of these, issued related to “communication” and “use of public transportation” are expected to be solved by using a translator or searching on the Internet if a free public wireless LAN environment is equipped. Therefore, it is important to improve the “free public wireless LAN environment” to resolve the dissatisfaction of foreign visitors with Japan. If we increase places to install free Wi-Fi (SAQ2) that is selectable, accessible, and high quality, we expect that the dissatisfaction of foreign visitors with free Wi-Fi in Japan will decrease. However, considering the cost, we should effectively install such free Wi-Fi in places which many foreigners are expected to visit. Therefore, the selectability and quality among the above three factors should be considered after the accessible free Wi-Fi available to foreign visitors is installed first. In addition, it is difficult for the national and local governments to know all the points where accessible free Wi-Fi is available. Indeed, in recent years, some local governments have disclosed the points where free Wi-Fi can be used. However, since private companies mainly install free Wi-Fi, it is necessary to collect information for each provider. Therefore, it is difficult for the governments to comprehensively grasp the installation locations of free Wi-Fi. With the spread of smartphones, the number of social media which allow posting contents with location information (geotagging) is increasing. Among them, Twitter, one of microblogging services, is easy to use on smartphones, so many users post short sentences (i.e., Tweets) on the spot about their actions and emotions. In addition, since it is possible to add location information to Tweets, it is possible to grasp where and when the posters were. However, some foreign visitors to Japan do not use mobile phones, so they cannot post Tweets in areas where free Wi-Fi does not exist. On the other hand, Flickr (Flickr 2022) is one of photo-sharing services, and users upload photos taken at tourist spots to the site. In recent digital devices such as digital cameras, it is possible to add the position and time information to photos, so it is possible to grasp where and when the foreigners posting the photos visited in Japan. Many of them cannot upload photos on the spot because free Wi-Fi does not exist everywhere. However, unlike Twitter, it is also common to store photos on a device and upload them using communication network services provided by facilities such as hotels and cafes. Therefore, in this case, we visualize the areas where accessible free Wi-Fi do not exist, by analyzing the difference in the distributions of Twitter articles and Flickr photos posted by foreign users with respect to posting or shooting areas.
6.5 Difference Between Hypotheses
267
Data Collection • Geotagged Tweets • Geotagged photos
Preprocessing • Extraction of foreign visitors
Results • Visualization • Questionnaire survey
Analysis • Make cell grids with 30m unit side • Count users per cell • Normalize counts over cell grids • Select cells with counts over the threshold as the targets
Social Data
Fig. 6.25 Overview of the process flow
We focus on the difference in the characteristics of these social data. In some areas, the number of posted Flickr photos is large, but the number of posted Twitter articles is small. Such areas are expected to be areas where Twitter cannot be used due to lack of accessible free Wi-Fi, even though those areas attract many foreign visitors. In addition, the proposed method analyzes the spots where accessible free Wi-Fi can be used, without considering the differences in the providers. In the case, we aim to develop an application system that visualizes areas in Japan where accessible free Wi-Fi is not installed and many foreign tourists visit (Ishikawa and Miyata 2021; Mitomi et al. 2016). Figure 6.25 shows the overview of the process flow of the system. By this system, it is possible to support decisionmaking as to which areas should be equipped with accessible free Wi-Fi. By providing visualization results to foreigners visiting Japan, it is also possible for foreigners to take measures, such as obtaining information in advance on areas with accessible free Wi-Fi, before heading to areas without accessible free Wi-Fi.
6.5.1.2
Proposed Method
First, we explain a method for visualizing areas with tourist spots and those with accessible free Wi-Fi. Next, we explain a method for visualizing posters of Tweets who have mobile communication means. Regarding the relationship between regions and free Wi-Fi, there are regions where free Wi-Fi does not exist, regions where free but not accessible Wi-Fi exists, and regions where accessible free Wi-Fi exists. However, since free Wi-Fi that is not accessible is difficult for foreign visitors to use, we treat it in the same way as in areas where free Wi-Fi does not exist. Therefore, in this case, areas with accessible free
268
6 Hypothesis Generation by Difference
Wi-Fi are extracted, and the other areas are visualized as areas without accessible free Wi-Fi. In this study, to analyze foreigners visiting Japan, we first extract foreign users from Twitter and Flickr, respectively. Next, the foreign users are classified into foreigners visiting Japan and foreigners residing in Japan, and analysis is performed based on the results. (1) Extraction of Foreign Visitors to Japan Here, we explain the method of acquiring Tweets used for analysis. We extract foreign visitors separately from Twitter and Flickr data as follows. First, we explain how to extract Tweets posted by foreign visitors to Japan. We collect Tweets using the API of Twitter. We apply the authors’ method (Ishikawa and Miyata 2021) to distinguish foreigners visiting Japan and foreigners residing in Japan who posted Tweets. That is, we judge that the users who posted Tweets in languages other than Japanese in Japan are foreigners. Then, we estimate how long they posted the Tweets in Japan. By comparing the estimation result and the predetermined threshold, we determine whether the posters are foreign visitors to Japan (i.e., short-stay visitors) or foreign residents in Japan. In this study, only the Twitter users who are judged to be foreign visitors to Japan (i.e., foreign Twitter users) are used to be analyzed. Next, we explain the method to extract Flickr photos used for analysis and the target users. We collect photos which were taken in Japan and posted to Flickr, using its API. If the photographers of those photos (i.e., Flickr users) have set the residence information in the Flickr profile to include the place names in Japan, the users are judged to live in Japan. Otherwise, the users are judged to live outside Japan. In this case study, only the foreign Flickr users residing outside Japan are targeted for analysis. (2) Extraction of Areas with Tourist Attractions and Areas with Accessible Free Wi-Fi We explain a method for determining whether there is accessible free Wi-Fi in each region, based on the number of Tweets posted by foreign Twitter users and the number of photos taken by foreign Flickr users. First, based on the latitude and longitude, a grid consisting of cells with a side of about 30 m is generated within the range to be analyzed. Then, the number of foreign Twitter users included in each cell is counted. The result is normalized so that the sum of the cell values is 1. A cell whose count exceeds the predetermined threshold is defined as an area where accessible free Wi-Fi exists (hereinafter referred to as free Wi-Fi spot). The same process is performed for the number of foreign Flickr users, and a cell whose count exceeds the predetermined threshold is regarded as an area with tourist attractions (i.e., tourist spot) visited by a lot of foreign visitors. (3) Identification of Users Who Have Mobile Communication Means
6.5 Difference Between Hypotheses
269
Based on the number of Tweets posted in areas with accessible free Wi-Fi available to foreign Twitter users and that of Tweets posted in areas without such accessible free Wi-Fi, it is determined whether each user has mobile communication means or not, as follows. First, the ratio of the number of the user’s Tweet posts in the area determined not to be a free Wi-Fi spot over the total number of the user’s Tweet posts in the region including the area is calculated. Next, the user whose ratio exceeds the predetermined threshold is regarded as the user who has mobile communication means. Since it is difficult to accurately judge users who posted less than 5 Tweet posts, such users are not targeted.
6.5.1.3
Integrated Analysis
Here, we generate and visualize relevant hypotheses and discuss the results with respect to free Wi-Fi spots and tourist spots. First, the data set used for visualization will be described. The number of Tweets posted around Sakuragicho Station by foreign Twitter users visiting Japan during the period from July 1, 2014, to February 28, 2015, and the number of Twitter users are 7596 Tweets and 1269 users, respectively. Of the 1269 users, 244 users were identified as users with mobile communication means. In addition, the number of photos taken around Sakuragicho Station by 186 foreign Flickr users visiting Japan during the same period was 2132. Here, the reason for using the areas around Sakuragicho for the experiment is described. An accessible free Wi-Fi was installed between July 2014 and January 2015 at the Red Brick Warehouse, a famous tourist spot in the surrounding areas. In addition, there is a tourist spot called Osanbashi Pier in the southeast of the Red Brick Warehouse. As a sightseeing route on the Yokohama City’s Official Visitors Guide, it is recommended to go to Osanbashi Pier next to the Red Brick Warehouse, so it is highly probable that tourists who visit either the Red Brick Warehouse or Osanbashi Pier will visit the other as well. However, there is no accessible free Wi-Fi installed in Osanbashi Pier. Therefore, by visualizing the areas around these two adjacent tourist spots, the effectiveness of the proposed method will be confirmed. (Hypothesis 1) Areas with Accessible Free Wi-Fi Spots Figure 6.26a shows the result of visualizing only the cells that were determined to be free Wi-Fi spots using the numbers of foreign Twitter users. When the value of the cell exceeded the threshold (set to 0.02 in the preliminary experiment), it was visualized as the cell with accessible free Wi-Fi. If the value of the cell is small, the cell is displayed in a color close to blue, and if the value of the cell is large, the cell is displayed in a color close to red. In Fig. 6.26a, the cells including the Red Brick Warehouse were visualized, but the cells including Osanbashi were not visualized. This is because free Wi-Fi available to foreign visitors to Japan has been installed in the Red Brick Warehouse since
270
6 Hypothesis Generation by Difference
Fig. 6.26 a Visualization of free Wi-Fi spots based on posted Tweets, b visualization of tourist spots based on posted Flickr photos
2014, and foreign visitors to Japan can use free Wi-Fi when visiting the Red Brick Warehouse while free Wi-Fi is not installed in Osanbashi Pier (as of the time of the survey), and users without mobile communication means cannot post Tweets there. In addition, there are Starbucks Coffee stores near Sakuragicho and Minatomirai Stations. Starbucks Coffee stores have a free Wi-Fi called at STARBUCKS Wi2 that can be used by foreign visitors to Japan (Starbucks 2022). Therefore, many foreigners visiting Japan posted Tweets using at STARBUCKS Wi2, and it is expected that they appeared as free Wi-Fi spots in this experiment. (Hypothesis 2) Areas with Tourist Attractions Figure 6.26b shows the result of visualizing only the cells that are determined to be tourist spots based on the numbers of foreign Flickr users. Similar to the free Wi-Fi spot, when the value in the cell exceeds the threshold value (0.02), the cell is visualized as a tourist spot visited by many tourists. In Fig. 6.26b, Osanbashi Pier is extracted while it is not extracted in Fig. 6.26a. This is because Flickr is one of popular photo-sharing media sites, tourists take photos at spots which they are interested in, and upload them to Flickr later at any spots with available free Wi-Fi. Osanbashi Pier is ranked 11th out of all 346 tourist destinations in Yokohama City on TripAdvisor (TripAdvisor 2022), which is often used when searching for tourist destination information. The Red Brick Warehouse near Osanbashi Pier is ranked third according to the site (as of the time of the survey). So, it is expected that these tourist spots will be visited together. (Hypothesis 3) Areas with Tourist Attractions and without Accessible Free Wi-Fi Spots
6.5 Difference Between Hypotheses
271
Comparing Fig. 6.26a, b, the cells containing Osanbashi Pier are not extracted in Fig. 6.26a, but they are extracted in Fig. 6.26b as mentioned above. This is because the Red Brick Warehouse and Osanbashi Pier are very close to each other, so it is highly possible that foreigners who visit the Red Brick Warehouse also visit Osanbashi Pier. In fact, about 50% of foreign Flickr users who took photos at the Red Brick Warehouse also took photos at Osanbashi Pier. Therefore, it can be seen that about 50% of the users who visited the Red Brick Warehouse also visited Osanbashi Pier. On the other hand, less than 10% of foreign Twitter users who actually posted Tweets at the Red Brick Warehouse posted Tweets at Osanbashi Pier. From these facts, it is expected that users who posted Tweets at the Red Brick Warehouse did not post Tweets at Osanbashi because there is no accessible free Wi-Fi. In addition, colored cells appeared at Sakuragicho and MinatoMirai Stations in Fig. 6.26a, whereas colored cells did not appear in Fig. 6.26b. This is because, as mentioned above, there are accessible free Wi-Fi such as at STARBUCKS Wi2. So, it is highly probable that foreigners use accessible free Wi-Fi at those points for posting on Twitter. In fact, we could find many Tweets indicating that the posters arrived at Minato Mirai and stopped by Starbucks Coffee stores. On the other hand, with regard to Flickr, there are no buildings or landscapes that are attractive to foreigners around the stations, so it is thought that there are not so many photos taken. In this way, the difference between Hypothesis 2 (area with tourist spots) and Hypothesis 1 (area with accessible free Wi-Fi spots) gives Hypothesis 3 (area with tourist spots and without accessible free Wi-Fi spots). This is also an example of parallel analysis. (Hypothesis 4) Consideration on the Possession of Mobile Communication Means Even in areas where there is no accessible free Wi-Fi, you can post to social media if you have a mobile communication means. Figure 6.27a, b show the results of visualizing cells where many users with mobile communication means posted Tweets and cells where many users without mobile communication means posted Tweets, respectively. As users who have mobile communication means, we visualized foreign users who posted more than a half of Tweets during visit in areas which were judged to have no accessible free Wi-Fi. Comparing Fig. 6.27a, b, the Red Brick Warehouse is extracted in both maps, but Osanbashi Pier and Yamashita Park are not extracted in Fig. 6.27b. It is considered that the cells containing the Red Brick Warehouse are extracted regardless of the presence of mobile communication means because accessible free Wi-Fi is available in the Red Brick Warehouse. On the other hand, it is considered that the cells including Osanbashi Pier are extracted only in Fig. 6.27a because foreign users who have mobile communication means can post Tweets even in the areas without accessible free Wi-Fi. Similarly, Yamashita Park is also recommended as a tourist spot, but it has no accessible free Wi-Fi. Therefore, it is considered that the cells including Yamashita Park are extracted in Fig. 6.27a but are not extracted in Fig. 6.27b.
272
6 Hypothesis Generation by Difference
Fig. 6.27 a Visualization of spots where many Tweets were posted by the users with mobile communication means. b Visualization of spots where many Tweets were posted by the users without mobile communication means
6.5.2 Case of Analyzing Influence of Weather on Tourist Behavior We conducted joint research with Tokyo Metropolitan Government on inbound services for foreign visitors to Japan for the Tokyo Olympics and Paralympics, named “Tourist Behavior Analysis Demonstration Project Utilizing Big Data”. As part of the project, we analyzed changes in tourists’ behaviors due to different weather conditions (Ishikawa and Miyata 2021). Open data of weathers and social data of Twitter with temporal and locational information were used to detect satisfying spots varying from sunny day to rainy day based on the method explained in Sect. 6.2.2. Figure 6.28 visualizes the spots that are determined to be satisfactory on sunny days (below) and rainy days (above), which correspond to the following hypotheses. • Sunny day hypothesis: Spots that are popular with foreign tourists on sunny days. • Rainy day hypothesis: Spots that are popular with foreign tourists on rainy days. Therefore, a new hypothesis can be obtained by performing the following operations between the hypotheses. • Sunny day hypothesis – rainy day hypothesis (difference). • Sunny day hypothesis ∩ rainy day hypothesis (intersection). Among the above hypotheses, the first hypothesis is to find popular spots for foreigners only on sunny days. On the other hand, the second hypothesis is to find popular spots for foreigners not only on sunny days but also on rainy days. As a result, the following observation is obtained.
6.5 Difference Between Hypotheses
273
Satisfactory spots on rainy days There are more positive Tweets around Harajuku Station on rainy days than on sunny days.
Takeshita St
Analyze changes in satisfactory spots due to weather and temperature
Satisfactory spots on sunny days
There are more positive Tweets around Shibuya station on sunny days than on rainy days.
Fig. 6.28 Changes in satisfactory spots due to weather and temperature
• On rainy days, the satisfactory spots for foreigners change from Shibuya to Harajuku. In other words, Shibuya is famous for Shibuya Crossing, which is a photogenic spot for sunny days. On the other hand, Harajuku is famous for Takeshita Street, which is an all-weather spot where you can enjoy shopping fashionable clothes and eating crepes as you can see from the contents of Twitter posts.
6.5.3 GWAS The genome-wide association study (GWAS) (Uffelmann et al. 2021) used big data on genes to identify the associations between gene regions (loci) and traits (including diseases). In a word, GWAS is one of the methods for generating hypotheses corresponding to associations to be identified. It has long been known that genetic variations between individuals can cause phenotypic differences. The disease-related mutations as to the same haplotype (a sequence of genes derived from one parent) are more frequent in the case group (i.e., individuals with traits of interests) rather than in the control group (i.e., individuals
274
6 Hypothesis Generation by Difference
Case group
Control group
Allele
SNP Array
Fig. 6.29 GWAS identifies disease-associated loci based on differences between allele frequencies in case and control groups
without the traits). Figure 6.29 schematically shows the typical allele distribution that GWAS is trying to identify. Therefore, the difference in the frequency of alleles between the case group and the control group as to a certain disease is used to determine the mutations related to the disease. That is, statistical analysis is performed on the data (the case group and the control group) collected by focusing on the disease of interest to confirm the possibility that the mutation is associated with the disease.
6.5 Difference Between Hypotheses Table 6.8 Contingency table
275
Case group D
Control group N
Type 1
a
c
Type 2
b
d
In a typical GWAS study, a tool called an SNP (single nucleotide polymorphism) array is used to find mutations that are common to a large number of individuals, with or without common characteristics (such as disease), throughout the genome. For example, the collected data are divided based on genetic diversity (i.e., type 1 and type 2) and the existence of a certain disease, that is, with the disease (D) and without the disease (N). Table 6.8 shows a contingency table for each number. The p-value obtained for the following odds ratio of each variant is assigned to the analysis result. Odds ratio =
a b c d
=
ad . bc
The odds of D and N (i.e., type 1 over type 2) are represented by a/b and c/d, respectively. We think that they correspond to Hypothesis 1 and Hypothesis 2. By taking the difference between them (in this case, actually calculating the ratio, not the difference), the gene deeply related to D is identified as new Hypothesis 3. In other words, Hypothesis 3 produces a null hypothesis that there is no difference between Hypothesis 1 and Hypothesis 2 or an alternative hypothesis that there is a difference between them. In the hypothesis test, a chi-square test suitable for the ratio is performed to obtain the p-value. Here, the p-value can be regarded as indicating the significance of the difference in the frequency of the variants between the case group and the control group, that is, the possibility that the variant is related to the trait. For visualization of GWAS results, the Manhattan plot is often used, in which the horizontal axis assigns the position (locus) of a gene in the genome and the vertical axis plots “– log10 (p-value)” for each gene (Ikram et al. 2010) (see Fig. 6.30). Genes whose p-value is below the significance threshold (p = 5 × 10–8 ) recommended in GWAS are considered as candidate genes strongly related to the trait of interest. In fact, the Manhattan plot visualizes such candidate genes below the GWAS significance level (indicated by a horizontal line) although they appear above the line visually in the Manhattan plot. It goes without saying that the genes visualized in this way are still candidates, and further studies are needed to definitely determine the genes.
276
6 Hypothesis Generation by Difference
Fig. 6.30 Manhattan plot (Courtesy of Wong Tien Yin)
References AWS (2022) CNN-QR Algorithm. https://docs.aws.amazon.com/forecast/latest/dg/aws-forecastalgo-cnnqr.html. Accessed 2022 Bank of Japan (BOJ) (2022) Explanation of Tankan (short-term economic survey of enterprises in Japan). https://www.boj.or.jp/en/statistics/outline/exp/tk/index.htm/. Accessed 2022 Bastami A, Beli´c M, Petrovi´c N (2010) Special solutions of the Riccati equation with applications to the Gross–Pitaevskii nonlinear PDE. Electron J Differ Equat 66:1–10 Behavioral Economics (2022) https://www.behavioraleconomics.com/resources/mini-encyclope dia-o2019f-be/prospect-theory/. Accessed 2022 Brown RG, Meyer RF (1961) The fundamental theorem of exponential smoothing. Oper Res 5(5):1961 Celko J (2014) Joe Celko’s SQL for smarties: advanced SQL programming, 5th edn. Morgan Kaufmann Columbia University (2022a) Difference in difference estimation. https://www.publichealth. columbia.edu/research/population-health-methods/difference-difference-estimation Accessed 2022 Columbia University (2022) Kriging interpolation. http://www.publichealth.columbia.edu/res earch/population-health-methods/kriging. Accessed 2022 Cookpad (2022) https://cookpad.com/. Accessed 2022 Endo M, Hirota M et al (2018) Best-time estimation method using information interpolation for sightseeing spots. Int J Inform Soc (IJIS) 10(2):97–105 Flickr (2022) https://www.Flickr.com. Accessed 2022 Hirota H (1979) Nonlinear partial difference equations. V. Nonlinear equations reducible to linear equations. J Phys Soc Jpn 46:312–319 Ikram MK et al (2010) Four nobel Loci (19q13, 6q24, 12q24, and 5q14) influence the microcirculation in vivo. PLOS Genetics 6(10) Ishikawa H, Miyata Y (2021) Social big data: case studies. In: Transaction on large-scale data- and knowledge-centered systems, vol 47. Springer Nature, pp 80–111 Ishikawa H, Yamamoto Y (2021) Social big data: concepts and theory. In: Transactions on large-scale data- and knowledge-centered systems XLVII, vol 12630, Springer, Nature Ishikawa H, Kato D et al (2018) Generalized difference method for generating integrated hypotheses in social big data. In: Proceedings of the 10th international conference on management of digital ecosystems Japan Meteorological Agency (JMA) (2022) Long-term trends of phenological events in Japan. https://ds.data.jma.go.jp/tcc/tcc/news/PhenologicalEventsJapan.pdf. Accessed 2022
References
277
Japan National Tourism Organization (JNTO) (2022) Trends in visitor arrivals to Japan. https://sta tistics.jnto.go.jp/en/graph/#graph--inbound--travelers--transition Accessed 2022 Japan Tourism Agency (JTA) (2022a) Results of the consumption trends of international visitors to Japan Survey for the July–September quarter of 2016. https://www.mlit.go.jp/kankocho/en/kou hou/page01_000279.html Accessed 2022 Japan Tourism Agency (JTA) (2022b) The JTA conducted a survey of foreign travelers visiting Japan about the welcoming environment. https://www.mlit.go.jp/kankocho/en/kouhou/page01_ 000272.html Accessed 2022 Jimenez F (2017) Intelligent vehicles: enabling technologies and future developments. Butterworth– Heinemann MeCab (2022) Yet another part-of-speech and morphological analyzer. https://taku910.github.io/ mecab/. Accessed 2022 Mikolov T, Chen K et al (2013) Efficient estimation of word representations in vector space. arXiv: 1301.3781 [cs.CL] Ministry of Land, Infrastructure, Transport and Tourism (MLIT) (2022) Tourism resource data https://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-P12.html. Accessed 2022 Mitomi K, Endo M, Hirota M, Yokoyama S, Shoji Y, Ishikawa H (2016) How to find accessible free Wi-Fi at tourist spots in Japan. In: Proceedings of SocInfo 2016. Lecture notes in computer science, vol 10046. Springer Nakata A, Miura T, Miyasaka K, Araki T, Endo M, Tsuchida M, Yamane Y, Hirate M, Maura M, Ishikawa H (2020) Examination of detection of trending spots using SNS. In: Proceedings of 16th ARG WI2 (in Japanese) NASA (2022) Apollo 15: follow the tracks. https://www.nasa.gov/mission_pages/LRO/news/apo llo-15.html. Accessed 2022 National Oceanic and Atmospheric Administration (NOAA) (2022) Climate.gov, what is the El Niño–Southern Oscillation (ENSO) in a nutshell? https://www.climate.gov/news-features/ blogs/enso/what-el-ni%C3%B1o%E2%80%93southern-oscillation-enso-nutshell. Accessed 2022 Nielsen A (2020) Practical time series analysis: prediction with statistics and machine learning. O’Reilly Nobumoto K, Kato D, Endo M, Hirota M, Ishikawa H (2017) Multilingualization of restaurant menu by analogical description. In: Proceedings of the 9th workshop on multimedia for cooking and eating activities in conjunction with the 2017 international joint conference on artificial intelligence. Association for Computing Machinery, pp 13–18. https://doi.org/10.1145/3106668. 3106671 Rong X (2014) word2vec parameter learning explained. arXiv:1411.2738 [cs.CL] Salton G, Wong A et al (1975) A vector space model for automatic indexing. CACM 18:11 Sasaki Y, Abe K, Tabei M et al (2011) Clinical usefulness of temporal subtraction method in screening digital chest radiography with a mobile computed radiography system. Radio Phys Technol 4:84–90. https://doi.org/10.1007/s12194-010-0109-7 Shibayama T, Ishikawa H, Yamamoto Y, Araki T (2020) Proposal of detection method for newly generated lunar crater. In: Space science informatics symposium 2020 (in Japanese) Starbucks (2022) https://www.starbucks.com/. Accessed 2022 Takamura H, Inui T, Okumura M (2005) Extracting semantic orientations of words using spin model. In: Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL2005), pp 133–140 TripAdvisor (2022) https://www.tripadvisor.com/ Accessed 2022 Uffelmann E, Huang QQ, Munung NS et al (2021) Genome-wide association studies. Nat Rev Methods Primers 1:59. https://doi.org/10.1038/s43586-021-00056-9 van den Oord A, Dieleman S et al (2016) WaveNet: a generative model for raw audio. arXiv:1609. 03499 [cs.SD] Wei WWS (2019) Time series analysis: univariate and multivariate methods, 2nd edn. Pearson Education
278
6 Hypothesis Generation by Difference
Wen R, Torkkola K, Narayanaswamy B, Madeka D (2017) A multi-horizon quantile recurrent forecaster. https://doi.org/10.48550/arXiv.1711.11053 Wikipedia landmark (2022) https://en.wikipedia.org/wiki/Landmark Accessed 2022 Wikipedia (2022) https://en.wikipedia.org/wiki/Wikipedia. Accessed 2022 Wu G, Korfiatis P et al (2016) Machine learning and medical imaging. Academic Press Zhang K, Chen S-C, Whitman D, Shyu M-L, Yan J, Zhang C (2003) A progressive morphological filter for removing nonground measurements from airborne LIDAR data. IEEE Trans Geosci Remote Sens 41(4):2003
Chapter 7
Methods for Integrated Hypothesis Generation
7.1 Overview of Integrated Hypothesis Generation Methods In this chapter, we introduce the following methods other than the difference method (refer to Chap. 6), as design principles and patterns for generating integrated hypotheses. Further, we explain case studies based on the methods.
7.1.1 Hypothesis Join In order to create an integrated hypothesis in big data applications, we complete a new hypothesis by complementing each other. That is, a join of separately obtained hypotheses is made by using common conditions regarding time, space, or semantics between the hypotheses, and the result is used as a final hypothesis. For example, high-risk evacuation paths are detected as an integrated hypothesis by joining multiple hypotheses obtained from social data and open data with common conditions regarding time and space as keys (Kanno et al. 2016).
7.1.2 Hypothesis Intersection In order to create an integrated hypothesis in big data applications, we overlay hypotheses and calculate their common parts. That is, an intersection or a set product between separately obtained hypotheses is calculated, and the result is a final hypothesis. If there is a correlation between hypotheses as a whole, parts with strong correlation are identified. Such examples include the following. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Ishikawa, Hypothesis Generation and Interpretation, Studies in Big Data 139, https://doi.org/10.1007/978-3-031-43540-9_7
279
280
7 Methods for Integrated Hypothesis Generation
• A vibration of interest can be detected as a result of coupling multiple data series observed simultaneously by multiple sensors (IoT devices) mounted on an automobile (Hashimoto et al. 2019). • By making sure there is no difference (i.e., invariant) with respect to rotation between hypothesis lunar craters as original data and those as data augmented by rotating the original data, we can detect lunar craters mechanically (Hara et al. 2019). • The intersection of the hypothesis based on the radar sounder observations that there are multiple cavities under the lunar surface and the independent hypothesis based on gravitational field mapping, that there are mass deficits (i.e., low-density space or voids) in the same area on the lunar surface strongly suggests the existence of lava tubes on the lunar surface (refer to Box “Hypothesis first” of Chap. 4).
7.1.3 Hypothetical Union A new hypothesis is created by union of hypotheses in order to create an integrated hypothesis in big data applications. This method increases the number of candidate hypotheses as follows. • We search for articles with the same hashtag from different social data and create union of the individual search results (i.e., hypotheses) to increase the total number of relevant hypotheses. • We increase the number of hypotheses by transitively using similarity such as co-occurrence relationships and adjacency relationships with respect to already obtained hypotheses. For example, a technique that calculates the relevance of a word to a particular word by propagating scores from the particular word can extract spots where multiple tourist resources can be enjoyed at the same time (Tsuchida et al. 2017). First, a tourism resource to be analyzed is determined, Tweets containing the tourist resource name are collected in target areas, and areas strongly related to the name are extracted by applying DBSCAN (refer to Chap. 5). Next, the degree of relevance between the target tourism resource name and words contained in Tweets for each area is calculated using Biased LexRank, adapted from PageRank, and the words are ranked based on the degrees. Based on the results, areas which provide multiple tourism resources are found.
7.1.4 Ensemble Learning It is conceivable to execute an aggregation operation on a set of hypotheses in addition to difference, join, intersection, and union. Such a kind of operations correspond to ensemble learning of models in the field of machine learning (Hastie et al. 2014). Bagging, which is one of ensemble learning, creates a set of hypotheses (models or results) by bootstrap (random sampling that allows duplication from a data set).
7.2 Hypothesis Join: Case of Detection of High-Risk Paths During Evacuation
281
It makes an integrated hypothesis by taking an average or a majority vote as an aggregation operation on a set of hypotheses. This makes it possible to reduce the variance of a model (hypothesis). Random Forest is a kind of bagging. It is characterized by using a random combination of explanatory variables. As a by-product, contribution of explanatory variables to classification and regression (i.e., prediction) can be quantitatively measured (refer to Chap. 8). A neural network model is trained by probabilistically invalidating a part of the network (i.e., following the Bernoulli distribution, a special case of the binomial distribution) using a method called dropout (Srivastava et al. 2014). Dropouts can reduce model overfitting.
7.2 Hypothesis Join: Case of Detection of High-Risk Paths During Evacuation Disaster management is one of the urgent issues for national and local governments. Based on hypotheses join with respect to time, space, and semantics as common conditions, new hypotheses on high-risk paths during evacuation are obtained by the following steps including estimation, search, and problem solving. • Estimate dense areas (hypothesis) during evacuation time zones using social data. • Find evacuation facilities using open data. • Find multiple candidate paths from dense areas to evacuation facilities by using open data. • As to obtain candidate paths, calculate weights based on the betweenness centrality and the level of disaster activity difficulty (open data) in an area where a path is located. • Generate a high-risk evacuation path as an integrated hypothesis by using multiple weights together.
7.2.1 Background Due to the Great East Japan Earthquake that occurred on March 11, 2011, many transportation infrastructures such as railways and roads in the metropolitan area were disrupted. As a result, the number of people who had difficulty returning home reached to about 5.15 million (Tokyo Metropolitan Government 2022a). At that time, in Shinjuku Ward, Tokyo, confusion was caused by a flood of people to the ward-owned facilities and evacuation facilities, who had difficulty returning home. Through the experiences of the Great East Japan Earthquake, the Central Disaster Prevention Council held by the Cabinet Office of Japan reaffirmed the importance
282
7 Methods for Integrated Hypothesis Generation
of smoothly evacuating residents in order to escape from tsunami, which is an issue for future disaster prevention (CAO 2022). To address these issues, the Tokyo Metropolitan Government Bureau of Urban Development has established safe evacuation roads to evacuation facilities in the event of disasters (Tokyo Metropolitan Government 2022b). Evacuation facilities and roads are revised about every five years in consideration of the changing urban conditions and the population change in Tokyo. However, the influx of population at local areas increases with the completion of an increasing number of new landmarks, and the situations in the city change daily and even within a day, so it is thought that the frequency of information update is not sufficient at present. Furthermore, factors that prevent smooth evacuation include a number of people present in areas and geographical conditions of areas. For example, in evacuating from a place with a large number of people (hereinafter referred to as a dense area) to an evacuation facility, people are more likely to have congestion and accidents such as domino effects in comparison to evacuating from a place with a few people to a facility. There is a high possibility that you will not be able to evacuate to the evacuation facility. In addition, evacuation from dense areas may newly produce roads where people are crowded (hereinafter referred to as dense roads). That is, a dense road is a road where evacuation roads from multiple dense areas intersect as shown by the light blue arrow in Fig. 7.1, or a road where evacuation roads from one dense area leading to multiple evacuation facilities intersect as shown by the orange arrow in Fig. 7.1. It is considered difficult to smoothly evacuate not only in dense areas but also in dense roads at the time of a disaster. In addition, there is a high possibility that evacuation will be hindered due to geographical conditions, such as narrow roads and roads along which there are many wooden buildings and fires are likely to occur. Therefore, in this case study, a path with a high evacuation risk (hereinafter referred to as a high-risk path) is discovered in consideration of the number of people and geographical conditions. Twitter data are used to find out the number of people in each region. Some users post Tweets by adding latitude and longitude information positioned by Global Positioning System (GPS). By using the positional information and the posting time information, we can grasp the information such as “when” and “where” users posted Tweets. In this case, it is thought that to discover and visualize high-risk paths by using Twitter data can contribute to risk avoidance and reduction considered in city planning. Thus, every hour, we extract a dense area using Tweets with latitude and longitude information. Then evacuation roads to evacuation facilities are extracted according to the estimated number of people in the dense area. Dense roads are discovered using network analysis techniques. We implemented a system to discover and visualize high-risk paths on the discovered dense roads, considering the geographical conditions around them. By using the system, we visualized the high-risk paths by time zone and considered the results.
7.2 Hypothesis Join: Case of Detection of High-Risk Paths During Evacuation
Dense road Pattern A
283
Evacuation road Pattern B
Evacuation facility
Evacuation facility
Crowded spot
Crowded spot
Evacuation facility
Crowded spot
Fig. 7.1 Example of dense roads
7.2.2 Proposed System We first extract dense areas using Tweets with latitude and longitude information, and search for evacuation facilities within 3 km of each dense area. Next, we propose a system that extracts evacuation roads from dense areas to evacuation facilities and extracts high-risk paths from the results, taking into consideration dense roads and geographical conditions emerging during evacuation.
7.2.2.1
Extraction of Dense Areas and Evacuation Facilities
Since the locations of emergent dense areas change with the temporal change of people’s behaviors, dense areas are extracted on an hourly basis using Tweets with latitude and longitude information. First, the target for extracting dense areas is divided into meshes with a side of 500 m. Then, the number of users who posted Tweets in each mesh is counted and is smoothed using a 3 × 3 Gaussian filter. The reason for using Gaussian filters is partly because latitude and longitude information contains an error of several tens of meters or more when extracting dense areas. It is also partly because there is a possibility that people move to surrounding areas during an hour. If the smoothed result exceeds the predetermined threshold, the mesh is defined as a dense area.
284
7.2.2.2
7 Methods for Integrated Hypothesis Generation
Extraction of Multiple Evacuation Roads
Multiple paths are searched to connect each of the extracted dense areas with evacuation facilities. Here, multiple paths are a set of paths connecting one dense area and one evacuation facility. In this case, pgRouting (pgRouting 2022) was used as a tool for searching paths. The method of determining the number of paths connecting one dense area and one evacuation facility is explained. Since the variety of paths to an evacuation facility will increase according to the number of people in a dense area, we have to extract multiple paths. Paths are extracted in ascending order of length, and each is called the kth path in order. In other words, the kth path is the kth shortest path connecting the two points of a dense area and an evacuation facility. If the extracted path length is less than or equal to 3 km, it will be used as an evacuation road. The reason why the maximum length of the path from the dense area to the evacuation facility is set to 3 km in this case is that the Tokyo Metropolitan Government aims to specify an evacuation facility so that the distance from all places to the evacuation facility is within 3 km (Tokyo Metropolitan Government 2022c). For the extracted path set, the number of people r k expected to pass through the kth path is calculated by Eq. (7.1) using the ratio of the path length for each k. { rk =
Rc (t = 1) . Length+ −Length(k) Rc (t > 1) Length+
(7.1)
RC shows the number of people after smoothing in the detected dense area C, Length(k) shows the length of the kth path, and t is the number of path searches. Length+ indicates the total from the path length of the first path to the length of the kth path and can be expressed by Eq. (7.2). Length+ = ∑ j=1,k Length( j ).
(7.2)
When t is two or more times, there is a difference between r k calculated when the path search is performed for the (t − 1)th time and r k calculated when the path search is performed for the tth time. Let Ave(t) be the average of the differences that occur when the tth time path is searched and calculated by Eq. (7.3). | | ∑k=1,t−1 |rk −rk' | . Ave(t) = t −1
(7.3)
Here, rk' is r k calculated when the path is searched for the (t − 1)th time. As t is increased, Ave(t) decreases monotonically and approaches 0 as t approaches infinity. Here, when Ave (t) falls below the predetermined threshold, the subsequent path search ends.
7.2 Hypothesis Join: Case of Detection of High-Risk Paths During Evacuation
7.2.2.3
285
Discovery of High-Risk Paths
We explain how to find high-risk paths using extracted evacuation roads. For each path that consists of evacuation roads, the degree of risk is calculated using two indicators. One indicator is the degree of congestion considering the weight of dense roads and dense areas that emerge at the time of a disaster. The other is the degree of difficulty in activities during a disaster considering the geographical conditions such as the road width, the risk of building collapse, and the risk of fire. Paths with high degree of risk are extracted as high-risk paths.
7.2.2.4
Calculation of Congestion
In calculating the degree of congestion, betweenness centrality (Freeman 1979) and the weight of dense areas are taken into consideration. Nodes with high betweenness centrality appear on major roads, such as in urban commercial areas (Porta et al. 2006; Porta 2009). In this case, the betweenness centrality of a path, not a node, is calculated for evacuation within 3 km. On the network, a node i located in the nearest neighbor of the center of gravity of each dense areas is defined as the starting point, and a node j located in the nearest neighbor of each evacuation facility that can be reached within 3 km from the starting point is defined as the destination point. Let gij be the number of paths between i and j, and gij (l) be the number of times for which multiple paths between i and j pass through path l. Betweenness centrality B(l) is expressed by Eq. (7.4). B(l) = ∑i/= j ∑ j∈Vi , j/=i
gi j (l) . gi j
(7.4)
Here, V i indicates a set of nodes for evacuation facilities that can be reached within 3 km of the node i. Next, we explain how to calculate the weight of dense areas. The weight of the dense area C is determined by using the number of people RC detected in the dense area, the total number of evacuation facilities S(C) within 3 km from the dense area, and the total number of paths k(CS) from the dense area C to the evacuation facility S. It shows the approximate value of the number of people passing through the path l. After the search for an evacuation road connecting the dense area C and the evacuation facility S is completed, if the path constituting the evacuation road includes the path l, the weight W (l) of the dense area in the path l is given by Eq. (7.5). W (l) = ∑C=1,N ∑ S∈V (C)
RC . S(C) · k(C S)
(7.5)
286
7 Methods for Integrated Hypothesis Generation
Here, V (C) is a set of evacuation facility nodes that can be reached within 3 km from the dense area C, and N is the total number of dense areas extracted during that time period. The congestion degree Con(l) of the path l is calculated based on the betweenness centrality B(l) and the weight W (l) of the dense area, both of which are normalized to the range from 0 to 1, by using Eqs. (7.6) and (7.7). Normalized(x) =
x − X min . X max − X min
Con(l) = Normalized(B(l)) + Normalized(W (l)).
(7.6) (7.7)
Here, X max and X min are the maximum and minimum values in X (a set of all values taken by x), respectively.
7.2.2.5
Calculation of Activity Difficulty Level During Disaster
Next, we explain the calculation of the level of activity difficulty during a disaster called Emergency Response Difficulty (ERD), which is another index of high-risk paths. The level of ERD is the ease (difficulty) of evacuation, fire extinction, and rescue activities in the event of a disaster, which the Tokyo Metropolitan Government Bureau of Urban Development determined comprehensively considering the probability of building collapse and fire occurrence, and the width of the road. This is an evaluation of 5133 towns and streets in Tokyo (Tokyo Metropolitan Government 2022d). The survey results are evaluated on a 5-point scale for the unit area of town and street (called chome), and the smaller this value is, the safer the evacuation site is in the event of a disaster. To calculate the ERD level for a path, we determine evaluations of ERD level to which each path l belongs and then calculate the length of the part of the path l within each evaluation of ERD level. If the path l of length L is included in the evaluation Dan of the ERD level and the length of the part of the path in the evaluation Dan is L Dan , the ERD level of the path l is expressed by Eq. (7.8), using Eq. (7.6) for normalization. E R D(l) = Normalized(∑ Dan · L Dan ).
7.2.2.6
(7.8)
Calculation of Risk
Finally, the method of calculating the risk level of the path l is explained. For the risk level Risk (l) of each path l, we use Eq. (7.7) to find the congestion level Con(l) for each path and use Eq. (7.8) to find the ERD level ERD(l). Finally, risk(l) is calculated by Eq. (7.9).
7.2 Hypothesis Join: Case of Detection of High-Risk Paths During Evacuation
Risk(l) = Con(l) + E R D(l).
287
(7.9)
The larger the value of Risk(l), the higher the degree of risk of the path.
7.2.3 Experiments and Considerations In this section, we will obtain dense areas and high-risk paths by using the methods explained so far for Tweets with latitude and longitude information, and consider the results. We visualize the results on maps.
7.2.3.1
Data Set
First, the data set used in this case will be described. We collected Tweets with latitude and longitude information posted in the 23 wards of Tokyo from April 1, 2015, to December 31, 2015, via Twitter’s API. The total number of Tweets collected was 5,769,800, and the number of confirmed users was 235,942. In addition, 250 was used as the threshold value when extracting dense areas, which was determined by the preliminary survey. For road network data, we used the data of Tokyo on OpenStreetMap (OpenStreetMap 2022). From the data, data with road tags “motorway”, “motorway link”, and “motorway junction” were deleted. This is because it is unlikely that people will follow paths indicated by these road tags when evacuating. As a result, the remaining road network data were 205,930 nodes and 302,141 paths. For the data of evacuation facilities, we selected the data of evacuation facilities in Tokyo (MILT 2022) published by the Ministry of Land, Infrastructure, Transport and Tourism (MILT). We used the data of evacuation facilities in the 23 wards of Tokyo. The number of evacuation facilities was 1467.
7.2.3.2
Consideration of Extracted High-Risk Paths
Next, the extracted high-risk paths are described. Of the high-risk paths extracted, the top five high-risk paths extracted between 5:00 and 6:00 are shown in Table 7.1a, the top five high-risk paths extracted between 14:00 and 15:00 are shown in Table 7.1b, and the top five high-risk extracted between 18:00 and 19:00 are shown in Table 7.1c. In Table 7.2a–c, Dan is a five-level ERD evaluation of the activity difficulty during a disaster in the town (chome) that includes the path, and the “number of evacuation facilities” V (C) is the total number of evacuation facilities that can be reached through the path as the result of path search.
288
7 Methods for Integrated Hypothesis Generation
Table 7.1 Top five risky paths Rank
Path No
W (l)
gij (l)
Place
Dan
Length
V (C)
1
851
123
174
In front of Shinjuku Ward Office
2
65
23
2
849
120
170
In front of Don Quixote Shinjuku east exit head office
2
76
19
3
850
121
171
In front of Don Quixote Shinjuku east exit head office
2
48
19
4
848
121
171
In front of Kabukicho
2
47
19
5
847
121
171
Shinjuku large guard east
2
37
19
1
315030
1273
634
Shinjuku 4 south side of Meiji-dori
1, 3
273
36
2
392780
931
557
Shinjuku 3 south side of Meiji-dori
2, 1
105
29
3
31320
1034
442
In front of BOX GOLF golf school
1, 3
107
38
4
234148
714
658
Ikebukuro Station 2, 3 south of the road under the overpass
127
36
5
392781
880
517
Shinjuku 4 north side of Meiji-dori
2, 1
152
29
1
315030
1840
710
Shinjuku 4 south side of Meiji-dori
1, 3
273
36
2
234148
1041
889
Ikebukuro Station 2, 3 south of the road under the overpass
127
36
3
31320
1513
500
In front of BOX GOLF golf school
1, 3
107
39
4
393780
1347
627
Shinjuku 3 south side of Meiji-dori
2, 1
105
29
5
392781
1274
584
Shinjuku 4 north side of Meiji-dori
2, 1
152
29
(a)
(b)
(c)
a Ranking from 5:00 to 6:00. b Ranking from 14:00 to 15:00. c Ranking from 18:00 to 19:00
7.2 Hypothesis Join: Case of Detection of High-Risk Paths During Evacuation Table 7.2 Risk rank and total number of extracted routes for each time zone on the south side of Meiji-Dori Street in Shinjuku 4-Chome
Time zone
Rank
289
Total number of extracted paths
0–1
1
9602
1–2
1
2373
2–3
98
1254
3–4 4–5 5–6
98
1254
6–7
1
3260
7–8
1
5935
8–9
6
11,947
9–10
9
16,419
10–11
6
20,215
11–12
4
25,078
12–13
1
29,650
13–14
1
29,733
14–15
1
29,144
15–16
1
28,704
16–17
1
29,596
17–18
1
32,164
18–19
1
33,181
19–20
1
31,898
20–21
1
28,493
21–22
1
27,442
22–23
1
23,139
23–24
1
20,146
In addition, Fig. 7.2a, b shows the visualization results around the path with the highest risk in each time zone. The paths with the highest risk between 14:00 and 15:00 were the same as those between 18:00 and 19:00, so only the result between 18:00 and 19:00 is shown in Fig. 7.2b. (1) Results from 5:00 to 6:00 First, we describe the high-risk paths extracted between 5:00 and 6:00. From Table 7.2a, among the high-risk paths extracted during this time zone, the top five paths with particularly high risks were all extracted on the paths included on Yasukuni Dori Street in Shinjuku Ward. The top five high-risk paths shown in Table 7.2a correspond to the red lines surrounded by the oval “A” in Fig. 7.2a. From Fig. 7.2a, it can be seen that not so many lines are drawn around the place surrounded by “A”. This is partly because no dense areas were extracted except around Shinjuku Station. Further, this is probably because it is possible for people
290
7 Methods for Integrated Hypothesis Generation : Dense area : Evacuation Facility
: Dense area : Evacuation Facility
A
Risk(l) High
Shinjuku Station Risk(l) Low No color
(a)
Risk(l) High
B Shinjuku Station
Risk(l) Low No color
(b)
Fig. 7.2 Highest-risk routes. a Highest-risk routes from 5:00 to 6:00. b Highest-risk routes from 18:00 to 19:00
to safely evacuate to evacuation facilities such as Shinjuku High School, Shinjuku Junior High School, and Tenjin Elementary School, without using very narrow roads. Let us comment on the above reasons. As to the former reason, there is room to perform more accurate analysis by using data other than Twitter data when extracting dense areas. As to the latter, as many evacuation facilities can be reached by using these paths, it is expected that in the event of a disaster, many people will gather on these paths and the risk of evacuation will increase. (2) Results from 14:00 to 15:00 Next, we describe the high-risk paths extracted from 14:00 to 15:00. The 1st, 2nd, 3rd, and 5th positions in Table 7.2b correspond to the red line surrounded by the ovals “B” in Fig. 7.2b. When we checked the path indicated by the red line on Google Street View (Google Street View 2022), it was a narrow road leading to another road with heavy traffic. In addition to human damage and crowd accidents caused by a disaster, it is highly likely that people will be involved in further damage such as traffic accidents. Furthermore, since the 1st, 2nd, and 5th paths indicate paths near the large intersections of Shinjuku 3-chome and Shinjuku 4-chome, it is conceivable that people will gather from various directions. From these facts, it is expected that the risk of evacuation may increase when many people gather on the path near these high-traffic roads in the event of a disaster. In addition, from Table 7.2b, all of the paths are evaluated as 1, 2, or 3 by 5-level ERD. Therefore, in terms of geographical conditions, it is a relatively safe path to evacuate in the event of a disaster. As shown by the above reasons and the weight of the dense areas in Table 7.2b, it is safe to evacuate when considering the town (chome) unit, but there may be a path that is not safe at evacuation when considering the path unit. (3) Results from 18:00 to 19:00 Finally, we describe the high-risk paths extracted from 18:00 to 19:00. The 1st, 3rd, 4th, and 5th paths in Table 7.2c correspond to the red lines surrounded by the ovals
7.2 Hypothesis Join: Case of Detection of High-Risk Paths During Evacuation
291
“B” in Fig. 7.2b. Comparing Table 7.2b, c, it can be seen that the paths extracted as the top five high-risk paths are exactly the same, although the rankings are different. From Table 7.2a, however, these paths were not extracted between 5:00 and 6:00, so these paths are affected by people evacuating from the surrounding dense areas, and the risk level is particularly high. As a result, it is considered that they were extracted as the top five paths. Furthermore, these paths were extracted near the large intersections of Shinjuku 3-chome and Shinjuku 4-chome, and the traffic volume at these intersections is expected to change with time. Due to the increasing number and weight of dense areas that appear in each path between 18:00 and 19:00, it is expected that the risk of evacuation will be higher than between 14:00 and 15:00. As mentioned above, it is confirmed that the evacuation risk changes due to congestion even in the same place as time changes.
7.2.3.3
Consideration on the South Side of Meiji-Dori Street in Shinjuku 4-Chome
Here, a special consideration will be made on the path extracted on Meiji-Dori Street. From Table 7.2b, c, the path with the highest risk in these time zones was the south side of Meiji-Dori Street in Shinjuku 4-chome, which is indicated by the area surrounded by the black oval in Fig. 7.3. Table 7.2 shows the risk rank of the street and the total number of extracted routes for each time zone. Table 7.2 does not include the results from 3:00 to 4:00 and from 4:00 to 5:00, because the dense areas were not extracted during these time periods. The path extracted on the south side of Meiji-Dori Street in Shinjuku 4-chome has the highest risk, not only from 14:00 to 15:00 and from 18:00 to 19:00, but also at most other
Shinjuku Station
Fig. 7.3 South side of Meiji-Dori Street in Shinjuku 4-Chome (surrounded by an oval)
292
7 Methods for Integrated Hypothesis Generation
times. On the contrary, from 2:00 to 3:00 and 5:00 to 6:00, the path is not included in the top ten risk levels. This is because in these time zones, compared to other time zones, no dense areas were extracted around Shinjuku Station, and then, no paths were extracted. The path was extracted as the path with the highest risk in 16 time zones. In other words, it shows that this path has a high evacuation risk in the event of a disaster at most times. From Table 7.2b, c, since the number of evacuation facilities that can be reached and the weight of dense areas are large, it is expected that people will gather from various places and directions in the event of a disaster, and crowd accidents are likely to occur. Therefore, it is highly probable that smooth evacuation will be difficult. In addition, since the path is near a large intersection, it is expected that there will be heavy traffic during most of the time zones, and traffic accidents are likely to occur, so it is considered to be a very high-risk path. In fact, independently of our experiments, the Shinjuku Station Area Disaster Prevention Council has pointed out that this path is so dangerous that it cannot be used for evacuation in the event of an earthquake (Shinjuku City 2022).
7.3 Hypothesis Intersection: Case of Detection of Abnormal Automobile Vibration This section explains hypothesis generation based on hypothesis intersection by taking detection of abnormal automobile vibration as an example. By intersecting (that is, overlaying and finding common parts of) multiple data or candidate hypotheses generated at the same time, a highly accurate classifier can be constructed as a hypothesis (model) as follows. • By intersecting multiple data (called multichannel coupling) from multiple IoT devices installed in an automobile, abnormal vibration can be detected accurately.
7.3.1 Background Mobility as a Service (MaaS) is one of the fields of which further development is expected by 5G technologies. In Japan as well, the scale of car-sharing services is expanding rapidly as one of MaaS. Figure 7.4 shows changes in the number of carsharing vehicles and that of registered members in Japan based on a survey by the Transportation Ecology and Mobility Foundation (Eco-Mo Foundation 2018). It can be seen that both have increased sharply in the last few years. The most important cause for the spread of car sharing is the growing awareness of the users about cost reduction. In other words, compared to owning a car, the advantage of car sharing is that there is no need for car purchase costs, taxes, insurance, or maintenance costs
293
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
7.3 Hypothesis Intersection: Case of Detection of Abnormal Automobile …
Number of car-sharing vehicles
Number of registered members (Eco-Mo Foundation 2018)
Fig. 7.4 Changes in the number of car-sharing vehicles and the number of registered members
(Millard–Ball et al. 2005). Therefore, car-sharing services are expected to grow in the future. On the other hand, through the interviews with private businesses, we have found that the number of cases in which the responsibility for “scratches” in shared automobiles is unclear is increasing. Regarding a scratch in automobiles, in addition to the type and size, it is necessary to judge whether it is spontaneous or artificial. Car rental companies that maintain cars for each rental can visually inspect scratches on a car, but car sharing is an operation such that users rent and return cars on unmanned sites (see Fig. 7.5), so it is especially important for the operating company to understand occurrences of scratches and their causes in managing automobiles. Currently, as a mechanism for automating the detection of scratches, there is a mechanism for collecting omnidirectional images of an automobile and performing image analysis using machine learning (Patil et al. 2017). It is expected that highly accurate scratch detection can be performed by comparing and analyzing images before and after driving based on machine learning, but diversifying services such as
Return
One way
Book
Unlock
Fig. 7.5 Car sharing in MaaS
Drive
Lock
294
7 Methods for Integrated Hypothesis Generation
(a)
(b)
(c)
Fig. 7.6 Examples of features obtained from three sensors. a Waveform. b PSD. c MFCC
car sharing are also required to significantly save time and space. Therefore, a novel mechanism for detecting scratches is needed. Generally, vibration analysis is used as an approach to detect machine abnormalities. That is, sensors are placed in a machine, abnormality is detected from the acquired vibration waveform and frequency characteristics, and the causes are estimated. In the study of Gontscharov et al. (2014), sensors are installed at places where an abnormality is considered to occur, and vibration signals are acquired by using them. Using a Pattern Recognition Neural Network (PRNN), minor damages can be detected by vibration analysis. Therefore, in this case study, we attached multiple piezoelectric sensors to an automobile body, collected vibration data as shown in Fig. 7.6a, and conducted an experiment to automatically detect abnormalities from vibration by combining machine learning and vibration analysis. Abnormal vibration is different from vibration caused by normal use of an automobile. We regard the detection of abnormal vibrations in automobiles as a two-class (i.e., binary) classification problem of whether they are abnormal (i.e., associated with scratches) or not and aim to automatically classify and detect abnormal vibrations by supervised learning using labeled vibrations. In addition, we will examine a method suitable for detecting abnormal vibrations in automobiles using features generated from vibration data.
7.3.2 Proposed Method This section explains data features and model generation methods used in this case.
7.3 Hypothesis Intersection: Case of Detection of Abnormal Automobile …
295
Preprocessing
Feature generation
Channel coupling
Model creation (training)
Model execution (test) Fig. 7.7 Process flow of the proposed method
7.3.2.1
Process Flow of Proposed Method
Here, the process flow of our proposed method is explained (see Fig. 7.7). As preprocessing, only a part where an event of abnormal vibration is considered to have occurred is obtained by detecting the event section in the vibration data and is treated as an input. Next, high-dimensional features are generated from the input by connecting multiple channels. Then by applying a Gaussian Mixture Model to the generated features, a model that can identify abnormal vibrations is created, and the model is executed on test abnormal vibrations.
7.3.2.2
Event Interval Detection Method
It is expected that abnormal vibration detection with high performance can be done by extracting only parts where an event actually occurs in a vibration waveform. So, in this case we detect the event section from the vibration waveform as preprocessing. As a method, we use short-term average/long-term average (STA/LTA), which is known for detecting earthquake event sections. The red and blue lines in the upper figure of Fig. 7.8 are event sections which are detected by using the ratio and threshold of the short-term average and long-time average in the lower figure of Fig. 7.8.
296
7 Methods for Integrated Hypothesis Generation
Fig. 7.8 Event detection based on STA/LTA
Research on determining the STA/LTA parameters is being actively conducted in the detection of earthquake event sections (Trnkoczy 1999). In this case, however, very short vibrations and various vibration waveforms are targets of detection, so it is difficult to comprehensively detect an event section by the unique parameters. Therefore, in this case, we visually confirm that an event actually occurred, and we adjust the STA/LTA parameters for the confirmed event one by one.
7.3.2.3
Feature Generation Method
In this case, the following two different features are used as inputs. • PSD: Power spectral density (PSD) is often used as a feature in the earthquake analysis. By applying fast Fourier transform (FFT ) is performed on the vibration waveform, PSD is obtained from the powers of each frequency (see Fig. 7.6b). • MFCC: Mel-frequency cepstral coefficients (MFCCs) (Logan 2000) are a representation of frequency characteristics that consider human auditory characteristics. They are used in DCASE (Mesaros et al. 2019). FFT is performed on the vibration waveform, and discrete cosine transform (DCT) is performed on the logarithmic value of the resulting powers, and as a result, MFCCs are obtained (see Fig. 7.6c). High identification performance is expected by using PSD and MFCC as features. Since the vibration data are time series data, it is considered that the relationship of vibrations separated in time affects the identification. First, the features of the vibration data corresponding to each channel are calculated and used. Next, by connecting the features with three channels, the features with spatial information are obtained and used.
7.3 Hypothesis Intersection: Case of Detection of Abnormal Automobile …
7.3.2.4
297
Model Generation Method
In this case, a Gaussian Mixture Model (GMM) (Mesaros et al. 2017) is used as a method to build a model that distinguishes between abnormal vibrations and normal vibrations. In general, GMM builds a model by a weighted sum of Gaussian distribution functions for each label of the input features (refer to Sect. 5.1.2.2; Yun et al. 2016). We obtain the classification label y* of test input X = {x 1 , …, x T } as follows. y ∗ = argmax F(X, y; θ ). y∈Y
Here, Y represents the set of M labels, and θ stands for the GMM parameter set. In terms of each vibration frame, we obtain a feature vector xt (1 ≤ t ≤ T ). Also, the discriminant function F (X, y; θ ) can be modeled using the conditional distribution log pθ (y | X), as follows. F(X, y; θ ) = log pθ (y|X ) = log pθ (X |y) p(y). Here, p(y) is the prior probability of the classification label. We assume equal prior probability, i.e., p(y) = M1 , for all y ∈ Y. When using the K-mixture GMM with diagonal covariance, the probability pθ (x t | y) can be expressed as follows. pθ (xt |y) = ∑k=1,K wk N (xt |μk , σk ). Here, x t follows kth Gaussian mixture component (normal distribution) of which the mean vector and variance are denoted by μk , and σ k , respectively, and wk is the weight of the component. The discriminant function can be expressed as follows. F(X, y; θ ) = log pθ (X |y) p(y) [ ] ∏ 1 pθ (xt |y) · = log M t=1,T [ ] ∏ ∑ 1 wk N (xt |μk , σk ) · = log . M t=1,T k=1,K
298
7 Methods for Integrated Hypothesis Generation
Table 7.3 Data set for automobile vibrations
Total number of abnormal vibrations
81 (27 × 3 channels)
Total number of normal vibrations
87 (29 × 3 channels)
Data length (time duration)
1–5 s
Total number of channels
3
Sampling frequency
44.1 kHz
7.3.3 Experiments Here, abnormal vibration is identified using each type of features, and its performance is evaluated.
7.3.3.1
Data Set
Table 7.3 shows the specifications for the data set used in this case. In order to detect abnormal vibration, three piezoelectric sensors were installed on the automobile body as shown in Fig. 7.9, and vibration data were collected. As for the vibration data, we generated and recorded the vibrations while the automobile was driving normally, driving slowly, and stopped at a sampling frequency of 44.1 kHz. The number of data pieces of abnormal vibration is 81 (27 sets × 3 channels), and the number of data pieces of normal vibration is 87 (29 sets × 3 channels). As for the normal vibration data, a wide range of vibrations such as the vibrations during driving and the vibrations caused by hit bollards and closed hatches were collected. As for the abnormal vibration, the vibrations caused by throwing a stone to the automobile body and hitting the automobile body with a metal rod were mainly collected. Each piece of data is a short vibration of about 1–5 s.
7.3.3.2
Experimental Settings
Table 7.4 shows the parameter setting for experiments. In both types of features, the frame length and frame period in the frame analysis are 40 ms and 20 ms, respectively. The PSD sample size used in this case was set to 1024 points. The 20-dimensional MFCC and its primary and secondary derivatives (60 dimensions in total) were used with reference to MFCC used in the DCASE 2017 baseline (Mesaros et al. 2017). The primary and the secondary derivatives are used for providing the features with information that reflects any change. The PSD and MFCC features were created by connecting three channels from the sensors mounted in three different locations of the automobile body. We conducted experiments to verify the accuracy of anomaly detection by randomly dividing the training data set and the evaluation data set into 7:3 and calculating the accuracy (correct answer rate). In addition, considering the number
7.3 Hypothesis Intersection: Case of Detection of Abnormal Automobile …
299
Fig. 7.9 Layout of piezoelectric sensors installed on the automobile
piezoelectric sensors
Table 7.4 Parameter setting for experiments Dimension
PSD varying depending on vibration length (× 3) MFCC 60 (× 3)
Frame length
40 ms
Frame period
20 ms
of data sets, five trials were performed and the average of all the accuracies was obtained as the final result.
7.3.3.3
Implementation
The Python library librosa (McFee et al. 2018) was used to implement the vibration analysis while the Python library scikit-learn (Pedregosa et al. 2011) was used to implement GMM.
7.3.3.4
Experimental Results
The experimental results are shown in Figs. 7.10 and 7.11. In the experimental results of three patterns of “no channel coupling and no event detection”, “channel coupling and no event detection”, and “channel coupling and event detection” for each type of features, “all”, “out”, and “safe” indicate the overall accuracy, the accuracy of abnormal vibrations, and the accuracy of normal vibrations, respectively. From these results, it can be seen that the following two points affect the accuracy.
300
7 Methods for Integrated Hypothesis Generation
All Out Safe
No coupling and no detection
Coupling and no detection
Coupling and detection
Coupling and no detection
Coupling and detection
Fig. 7.10 Experimental results using PSD
All Out Safe
No coupling and no detection Fig. 7.11 Experimental results using MFCC
• The first point is about the type of features. The identification accuracy of PSDbased model with respect to abnormal vibration is high, and its identification accuracy with respect to normal vibration is low. In other words, the model built by using PSD is biased toward abnormal vibration. On the other hand, although
7.3 Hypothesis Intersection: Case of Detection of Abnormal Automobile …
301
MFCC-based model is also biased toward abnormal vibration, the identification accuracy of MFCC-based model with respect to normal vibration is higher than that of PSD-based model. • The second point is about the existence of channel coupling. The accuracy is higher when the channels are coupled than when the channels are not coupled. This is because spatial information can be obtained by combining the vibration data collected in each channel rather than using them separately. Furthermore, it was hypothesized prior to the experiments that if an event section is detected, only the part where abnormal vibration occurs can be treated as an input and the identification accuracy can be improved. As a result, however, the accuracy in the case of event detection is worse than that in the case of no event detection. It can be inferred that the part where the abnormal event did not occur is also an important part of features for identification, such as the vibrations before and after the event occurred. This is probably because the lengths of the vibration data are aligned to some extent.
7.3.4 Considerations In this section, we introduced an abnormal vibration detection method for automobiles using machine learning. We collected various vibration data and conducted experiments to confirm the effectiveness of the proposed method. The highest accuracy was obtained when classification was performed using MFCC features with channel coupling and without event detection. However, as a result of visualizing and confirming the misclassified vibration data with a high probability, it was found that there are many normal cases (e.g., closed hatch) in which an event occurs and causes strong vibration. Even if the features are compared as shown in Fig. 7.12, it can be seen that it is difficult to distinguish them visually. As a method to correctly classify these vibrations, we also conducted experiments using CNN, which is one of the deep learning methods used for image identification
(a) Fig. 7.12 a Features of a thrown stone. b Features of a closed hatch
(b)
302
7 Methods for Integrated Hypothesis Generation
(Hashimoto et al. 2019). As a result, it was confirmed that channel coupling is also advantageous, and the accuracy of CNN is better than that of GMM.
7.4 Hypothesis Intersection: Case of Identification of Central Peak Crater In this case, hypotheses for multiple data (candidate craters for rotated identical data) are intersected (i.e., overlapped) to obtain a highly accurate hypothesis (i.e., crater). We will introduce an automatic discovery method for central peak craters with the ultimate goal of creating a catalog of central peak craters. Using DEM data of the lunar surface brought about by the observation of JAXA’s lunar orbiting satellite named “KAGUYA (SELENE)”, we identify central peak craters by machine learning and verify whether the proposed methods are effective for the creation of the central peak crater catalog. • First, we extract DEM data of each crater using the method called rotational pixel swapping for digital terrain model (RPSD method). The RPSD method uses rotational symmetry (i.e., invariance) and overlays hypotheses (candidate craters) at multiple angles to generate the final hypothesis (crater). • We label the DEM data either non-craters, non-central peak craters, or central peak craters, and construct a CNN model based on the data. Then we try to identify central peak craters by using the CNN model.
7.4.1 Introduction There are many craters on the moon, large and small. Among them, there is a crater with a special structure called central peak (hereinafter referred to a central peak crater) (see Fig. 7.13). A central peak has the important feature that materials beneath the lunar crust are exposed on the lunar surface there. In other words, it is possible to estimate the materials of the inner crust by exploring the surface of the central peak. Further, it is expected that this analysis of the inner crust helps to estimate the origin of craters and central peaks, as well as the environment of the past lunar surface and the process of crustal movement. However, at present, the exploration of the central peak craters is not actively carried out. The reason for this is because the existence of the central peak is confirmed by experts’ visual inspection of the image and only a limited number of central peak craters are known. As a solution of this problem, it is considered to significantly increase the number of central peaks that can be candidate sites for exploration on the lunar surface, by automating the discovery of central peak craters. As a result, a list covering the locations and sizes of central peak craters will be created (hereafter referred to as a central peak crater catalog).
7.4 Hypothesis Intersection: Case of Identification of Central Peak Crater
303
Fig. 7.13 Example of central peak crater (Courtesy of NASA)
Therefore, in this case, the ultimate goal is to create a central peak crater catalog, and we propose an automatic discovery method for central peak craters for that purpose. We use the DEM data of the lunar surface provided by JAXA’s lunar orbiting satellite “Kaguya (SELENE)” to identify central peak craters by machine learning. We verify whether the identification method by machine learning is effective in creating a central peak crater catalog. Specifically, first, the DEM data of craters are extracted using the method of highspeed crater extraction from the DEM data, called the method of the rotational pixel swapping for digital terrain model, or shortly the RPSD method. The RPSD method used here focuses on the rotational symmetry of craters, that is, invariant with respect to rotation, and applies hypothesis intersection to generate final hypotheses (craters). After labeling the obtained craters, we try to identify central peak craters by using CNN. As a result, it is confirmed that CNN is an effective method for identifying central peak craters. Some craters on the lunar surface have central peaks inside them. The central peak is a mountain-shaped structure formed inside the crater, and it is a valuable observation point on the lunar surface where the internal materials beneath the lunar crust are exposed. Central peaks tend to exist mainly in large craters since the scales of the impacts by an asteroid or a comet were large at the time of their formation (Allen 1975; Hale and Head 1979). Thus, the analysis of central peak craters provides plenty of information for the scientific analysis of the moon (Matsunaga et al. 2008). However, at present, there is no central peak crater catalog which can promote the analysis of central peaks. The central peak craters have been visually discovered by experts from lunar surface images. Therefore, it takes time and effort to comprehensively search candidate central peak craters and to have experts confirm the results one by one. In this case, we try to devise a method for automatically discovering central peak craters. In order to identify central peak craters, experts use images of the craters including candidate central peak craters, and visually confirm structures specific to central peak
304
7 Methods for Integrated Hypothesis Generation
craters. Therefore, it is considered that the identification of central peak craters can be automated by analyzing the structural features using some method. In this case, the DEM data (Tsubouchi et al. 2016) of the lunar surface collected by JAXA’s “Kaguya (SELENE)” are used. Since the DEM data can express the shapes of craters, it is possible to analyze the structural features of the central peak craters by using the DEM data. All the DEM data used in this case are published in DARTS at ISAS/ JAXA (JAXA 2022). In order to realize the automatic analysis of the structural features of the central peak craters, in this case, we consider to reduce the analysis to classification by machine learning, which does not assume deep knowledge about lunar physical or geological characteristics. For classification by machine learning, we first extract only craters by using the existing RPSD method and give them labels for training CNN. Then we identify central peak craters by the trained CNN model and examine their effectiveness.
7.4.2 Proposed Method 7.4.2.1
Crater Extraction
As preprocessing in this case, only areas containing craters are extracted from the DEM data on the lunar surface. Figure 7.14 shows an example of DEM data. The vertical axis represents a north–south range with pixel as unit, the horizontal axis represents an east–west range with pixel as unit, and the color represents an altitude with meter as unit (converted to a pixel value). Therefore, in this case, craters are automatically extracted using the RPSD method (Yamamoto et al. 2017).
Fig. 7.14 DEM data
7.4 Hypothesis Intersection: Case of Identification of Central Peak Crater
305
The basic idea of the RPSD method is to target the DEM data in any region and focus on the altitude, gradient, and rotational symmetry at each location (pixel) within the crater. Then, the center point of the crater is estimated for the analysis target, and the altitudes of the surrounding points are analyzed to determine the specific range of the crater. As a concrete process, first, a certain range of DEM data to be analyzed is acquired. Next, from the acquired DEM data, the slope gradient map G(x, y) showing the gradient of the slope at each point (x, y) and the slope azimuth map A(x, y) showing the azimuth of the slope at each point (x, y) are calculated. Next, for G(x, y) and A (x, y), maps Gϕ (x, y) and Aϕ (x, y) are created by rotating the points by an angle ϕ. In this case, the cases of ϕ = 90°, 180°, and 270° were created. Figures 7.15 and 7.16 show examples of a slope gradient map and of a slope azimuth map, respectively. The analysis range is the same as in Fig. 7.14. The color of the slope gradient map is the magnitude of the gradient, and the color of the slope azimuth map is the direction of the gradient. 0°, 180° (or − 180°), − 90°, and 90° indicate east, west, south, and north, respectively. After creating the above maps, the rotational symmetry at a certain pixel (x0, y0) is determined. First, the range of ± lmax (parameter for specifying the maximum size to be extracted) with the center point (x0, y0) on both the vertical and horizontal axes is cut out from G (x, y) and A(x, y) to create Gx0, y0 (x, y) and Ax0, y0 (x, y), respectively. Cut out from Gϕ (x, y) and Aϕ (x, y) in the same range, Gx0, y0, ϕ (x, y) and Ax0, y0, ϕ (x, y) are created, respectively. Next, the slope gradient evaluation map U x0, y0, ϕ (x, y) is created by extracting only the pixels with slope gradients in the range between θ L and θ U from the Gx0, y0 (x, y) and Gx0, y0, ϕ (x, y). Similarly, from Ax0, y0 (x, y) and Ax0, y0, ϕ (x, y), the slope azimuth evaluation map V x0, y0, ϕ (x, y) is created by extracting only the pixels with slope azimuth differences less than or equal to the threshold ω. Then, the rotational symmetry evaluation map H x0, y0 (x, y) is created by extracting only the pixels whose values are 1 in both U x0, y0, ϕ (x, y) and V x0, y0, ϕ (x, y). In other words, this map extracts only the pixels that satisfy the rotational symmetry condition
(a)
(b)
Fig. 7.15 a Slope gradient map G (x, y). b 90° rotated slope gradient map G90 (x, y)
306
7 Methods for Integrated Hypothesis Generation
(b)
(a)
Fig. 7.16 a Slope azimuth map A (x, y). b 90° rotated slope azimuth map A90 (x, y)
for the 360/ϕ different rotation angles. Figures 7.17 and 7.18 show the slope gradient evaluation map, the slope azimuth evaluation map, and the rotational symmetry evaluation map at points with and without rotational symmetry, respectively. In all the maps, the yellow points and purple points show extracted pixels and not extracted pixels, respectively. We sum the values of all pixels of H x0, y0 (x, y) and perform the same calculation for all pixels of the DEM data to be analyzed. Actually, the calculation is performed for every few pixels in order to reduce the calculation cost. Thus, the rotational symmetry function R (x, y) for the DEM data is obtained. Then, the candidate points as the center of the crater are ranked and listed in the descending order of the function value above the predetermined threshold f . Next, the crater diameter is calculated for the listed candidate points. First, the altitude profile P(n) is created from the candidate center point in the positive and negative directions of each of the horizontal and vertical axes. Then, the slope gradient profile Q(n) is created from P(n). Starting from n = lmin (the parameter for specifying the minimum size of the extraction target), n is gradually increased. We find n at the
(a)
(b)
(c)
Fig. 7.17 a Slope gradient evaluation map at observation points with rotational symmetry U x0, y0, ϕ (x, y) (ϕ = 270°). b Slope azimuth evaluation map at observation points with rotational symmetry V x0, y0, ϕ (x, y) (ϕ = 270°). c Rotational symmetry evaluation map at observation points with rotational symmetry H x0, y0, ϕ (x, y) (ϕ = 270°)
7.4 Hypothesis Intersection: Case of Identification of Central Peak Crater
(a)
307
(b)
(c)
Fig. 7.18 a Slope gradient evaluation map at observation points without rotational symmetry U x0, y0, ϕ (x, y) (ϕ = 270°). b Slope azimuth evaluation map at observation points without rotational symmetry V x0, y0, ϕ (x, y) (ϕ = 270°) c Rotational symmetry evaluation map at observation points without rotational symmetry H x0, y0, ϕ (x, y) (ϕ = 270°)
first point satisfying the condition such that the gradient becomes gentler by σ than the maximum gradient of the crater rim, or the gradient becomes negative, and the altitude from the level of the candidate center point is higher than Pmin (the threshold of minimum depth). At this time, if the condition is not satisfied when n reaches lmax , the calculation result is not output. Here, we use Pmin = l max ∗ 0.05 and σ = 15 according to the research by Yamamoto et al. (2017). The above calculation was performed in four directions, up, down, left, and right of the candidate center point, and the average of the derived values was taken as the crater rim diameter (i.e., crater diameter) with respect to the candidate center point. At this time, an arbitrary integer parameter smin (0 < smin < 5) is set, and candidate points for which more than smin calculation results are not output are considered not a candidate center point of the crater. Figure 7.19a, b shows the altitude profile and slope gradient profile in the positive direction (eastward) on the horizontal axis. In Fig. 7.19a, b the target points are the same as in Fig. 7.17a, and both of the horizontal axes represent the number of pixels, and the vertical axes represent the altitude with meter as unit and gradient with radian as unit, respectively.
(a)
(b)
Fig. 7.19 a Altitude profile for the positive direction of the horizontal axis P(n). b Slope gradient profile for the positive direction of the horizontal axis Q(n)
308
7 Methods for Integrated Hypothesis Generation
By the above calculation, a crater with a radius range (lmin to l max ) existing in one piece of DEM data is extracted. By executing these for each of DEM data, a data set of craters is obtained.
7.4.2.2
Identification of Central Peak Crater
Next, the extracted craters are classified by machine learning in order to identify central peak craters. Since each piece of DEM data of the extracted crater has a format similar to an image having only one value per pixel or channel, machine learning used in the image processing field is considered to be also effective. Therefore, classification is performed by convolution neural network (CNN) (Krizhevsky et al. 2012), which is one of the typical supervised learning models of image classification. An outline of the operation of CNN is explained as follows. The image input to CNN first passes through a network in which two types of layers (i.e., convolutional layers and pooling layers) are arbitrarily stacked. The convolution layer convolves the input image with a filter. The pooling layer compresses the input image. We will explain the principle again. For the sake of simplicity, if we consider a two-dimensional image with one channel for both input and output, that is, a twodimensional image with only one value for each pixel, the input image is I(x, y), and the filter kernel (matrix representing the filter) is K(x, y). Here, the convolution is calculated to output O(x, y) as follows. O(x, y) = ∑ j=1,3 ∑k=1,3 I (x − 2 + j, y − 2 + k) × k( j, k). By this calculation, the convolution layer can extract features such as edges from the input image. For example, a kernel as a filter that extracts edges in the y-axis direction is as follows. ⎡ ⎤ 1 2 1 ⎣ 0 0 0 ⎦. −1 −2 −1 If a dimension for multiple channels is added to the kernel to make a threedimensional matrix, different features are extracted for each channel, and the output is a three-dimensional image with multiple values for each pixel. After features are extracted from images and dimensionally reduced by the convolutional layer and the pooling layer, they are classified by the fully connected layer to output the results. In the above network, the desired classification model can be obtained by training the weights and bias of the kernel of the convolution layer and of the fully connected layer by the training data. For classification by CNN, the craters detected by the RPSD method are manually labeled before creating the classification model. At the time of labeling, the DEM of the extracted crater was displayed as a two-dimensional image, and the experts visually inspected it in order to determine which label should be attached. There are
7.4 Hypothesis Intersection: Case of Identification of Central Peak Crater
309
three types of labels to attach: “not a crater”, “a non-central peak crater”, and “a central peak crater”. The reason for labeling “not a crater” is to adjust parameters for the RPSD method in order to perform comprehensive central peak crater search, and create a data set with high recall and low precision. Another reason is that we want to verify the identifiability of craters by CNN regardless of the existence of central peaks.
7.4.2.3
Evaluation Method
The precision and recall used as indexes for classification performance in this case will be explained. In general, the precision and recall for any data label L (classification class) are calculated by the following formulas (see Fig. 7.20). precision =
number of data items identified as L by machine learning and with L as the correct label . total number of data items identified as L by machine learning
recall =
number of data items identified as L by machine learning and with L as the correct label . total number of data items with L as the correct label
The precision is an index to measure how error-free the experimental results are. On the other hand, the recall is an index to measure how well the experimental results Correct data
Recall =
Result data
5 9 5 10
Fig. 7.20 Precision and recall
= Precision
310
7 Methods for Integrated Hypothesis Generation
cover desired results. Generally, these two indicators are in a trade-off relationship. In other words, the higher the precision, the lower the recall, and vice versa. In this case, the purpose is to automatically extract as many and correct central peak craters as possible by constructing a classification model for the central peak crater. Therefore, the main goal of this case is to build a classification model for the central peak crater with balanced precision and recall such that central peak craters are more likely to correctly be identified and that the other objects are less likely to incorrectly be identified as central peak craters.
7.4.3 Experiments 7.4.3.1
Data Set
In this experiment, the DEM data of the lunar surface published by ISAS/JAXA are used. The surface of the lunar sphere is transformed into a rectangle by the projection method called simple cylindrical projection, which projects a sphere onto a cylinder of the same diameter, and then the horizontal axis is the longitudinal direction, that is, the east–west, and the vertical axis is the latitude direction, that is, north and south. The entire spherical surface is divided into 1° × 1° squares, and each square consists of 4096 × 4096 pixels, each of which stores altitude data. There are 360 × 360 = 129,600 squares in total, but the closer to the north or south pole, the greater the distortion of the DEM data in the longitude direction, making crater extraction more difficult. So, the subjects of the experiment are within 60° north latitude to 60° south latitude and 180° east longitude to 180° west longitude. As a result of labeling the craters extracted by RPSD, the number of central peak craters is very small compared to non-craters and non-central peak craters (i.e., the problem of imbalanced data). In general, if the number of data points with a specific label is relatively small in the training data for CNN, the trained classification model may become extremely inaccurate in identifying the label. Therefore, in this case, in order to make the number of data points for each label uniform in the training data set, image processing operations such as flipping and rotation are performed (i.e., data augmentation) only on central peak craters, and the data set for central peak craters is oversampled. The preprocessed data set is divided into training data and validation data, the identification model is trained with the training data, and the identification accuracy is verified with the validation data.
7.4.3.2
Parameter Settings
In the experiment, some of the parameters for the RPSD method are adjusted based on the crater catalog (Andersson and Whitaker 1982) published by USGS (US Geological Survey)/NASA (USGS 2022).
7.4 Hypothesis Intersection: Case of Identification of Central Peak Crater
311
The parameters to be adjusted are f and smin , which have a large effect on the crater extraction accuracy. The larger the values of both parameters, the stricter the shape of a crater to be extracted. In this case, since the extracted crater can be identified by CNN, so adjustments are made in the direction of loosening this strictness. Such adjustments are aimed at increasing the absolute number of craters to be extracted. Making f small increases the number of candidate crater centers and allows cases of craters with center points of lower rotational symmetry, that is, craters with large distortion with respect to perfect circle. Reducing the value of smin allows cases where another crater “gouge” exists at any position from the center of a crater to the north, south, east, or west. Both parameters are adjusted by grid search (i.e., evaluation by possible combinations of all parameter values), and as a result, the values f = 0.003 and smin = 3 are used. In the extraction with these parameters, the recall is about 80% for the craters with the analysis target size. In addition, l min and l max are set to smaller values than those used by Yamamoto et al. (2017) in order to reduce the accuracy of identification of candidate crater center points and the calculation cost. When searching for a larger crater, instead of increasing this value, we stitch together multiple DEM data to be analyzed and reduce the obtained DEM to 4096 × 4096 (i.e., the extractable range). In the experiments, as the analysis target range in one calculation by the RPSD method, three types of 32° square, 64° square, and 128° square are considered. By setting (l min , l max ) to (4, 16), (8, 32), and (16, 64) for each type, the diameters of craters to be extracted are set to 8–128 km. For craters larger than these, the craters on the USGS crater list are used as they are, instead of using the RPSD method. Among the extracted craters, the number of small craters is enormous, but the number of central peak craters is small. Therefore, in our experiment, only craters with a diameter of about 8 km or more are used for learning.
7.4.3.3
Learning Model
We extracted and used the DEM data of the 500 × 500 craters as input. The CNN models constructed in our experiment are as follows. Convolutional layer and pooling layer with 16 filters. Convolutional layer and pooling layer with 32 filters. Convolutional layer and pooling layer with 64 filters. Convolutional layer and pooling layer with 128 filters. Fully connected layer of 4.096 nodes × 2 with an activation function Leaky ReLU that makes output proportional to the input, with a small coefficient even if the input is negative. 6. Three-class classification with a softmax function. 1. 2. 3. 4. 5.
312
7 Methods for Integrated Hypothesis Generation
We used Adam (Kingma and Ba 2014) as the optimization algorithm in learning. We used TensorFlow (Abadi et al. 2016) as the deep learning framework and NVIDIA GeForce GTX 1080Ti as the GPU-based computing environment. We trained CNN for 20,000 epochs by using the stochastic gradient descent method with 50 as minibatch size (Bottou 2010). To prevent overfitting, we used early stopping (i.e., stopping learning before the generalization error increases again) (Hastie et al. 2009) to obtain the optimum learning model.
7.4.3.4
Identification Performance
The identification results of non-craters, non-central peak craters, and central peak craters by the proposed method are summarized in Table 7.5. The recall and precision with respect to the identification of central peak craters were 83.5% and 45.7%, respectively. We analyzed factors behind the low precision. As shown in Table 7.5, there are many cases where non-central peak craters were erroneously identified as central peak craters. This is partly because in preparing the training data manually, all craters with some kind of convex structures within them were labeled as central peak craters in order to compensate for the small number of central peak craters. Figure 7.21 shows successful and unsuccessful identification of craters classified as central peak craters by the classification model. On the other hand, the recall and precision with respect to the identification of craters (i.e., central peak craters and non-central peak craters) were 82.6% and 84.6%, respectively. From the results, CNN is judged as useful at least for identifying whether it is a crater or not. However, there were many cases where non-craters were erroneously identified as central peak craters. Therefore, it is considered possible to realize the identification model with higher accuracy if another model of binary classification for identifying whether the output is a crater or not is separately constructed in advance and used as the preliminary identification.
Table 7.5 Identification results Correct class label
Test data samples Identification result
Non-crater
Non-central peak crater
Central peak crater
966
439
272
Recall (%)
Precision (%)
Non-crater
859
111
13
88.9
87.4
Non-central peak crater
32
133
32
30.3
67.5
Central peak crater
75
195
227
83.5
45.7
References
313
(a)
(b)
Fig. 7.21 a Successful identification of a central peak crater. b Unsuccessful identification of a central peak crater
References Abadi M, Agarwal A et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 Allen CC (1975) Central peaks in lunar craters. Earth Moon Planets 12(4):463–474 Andersson LA, Whitaker EA (1982) Nasa catalogue of lunar nomenclature Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT2010, pp 177–186 Cabinet Office (CAO) (2022) Government of Japan, measures to evacuate smoothly from tsunami in the shortest possible time (in Japanese). http://www.bousai.go.jp/jishin/tsunami/hinan/7/pdf/ sub1.pdf. Accessed 2022 Freeman LC (1978) Centrality in social networks conceptual clarification. Soc Netw 1(3):215–239. https://doi.org/10.1016/0378-8733(78)90021-7 Gontscharov S, Baumgärtel H, Kneifel A, Krieger K-L (2014) Algorithm development for minor damage identification in vehicle bodies using adaptive sensor data processing. Procedia Technol 15:586–594 Google Street View (2022)https://www.google.co.jp/intl/ja/streetview/ Accessed 2022 Hale W, Head JW (1979) Central peaks in lunar craters-morphology and morphometry. Proc Lunar Planet Sci Conf 10:2623–2633 Hara S, Yamamoto Y, Araki T, Hirota M, Ishikawa H (2019) Identification of moon central peak craters by machine learning using Kaguya DEM. JAXA Res Dev Rep J Space Sci Inform Jpn 8:1–10 (in Japanese) Hashimoto W, Hirota M, Araki T, Yamamoto Y, Egi M, Hirate M, Maura M, Ishikawa H (2019) Detection of car abnormal vibration using machine learning. In: Proceedings of IEEE international symposium on multimedia (ISM) Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer JAXA (2022) DARTS at ISAS/JAXA. http://darts.jaxa.jp. Accessed 2022 Kanno M, Ehara Y, Hirota M, Yokoyama S, Ishikawa H (2016) Visualizing high-risk paths using geo-tagged social data for disaster mitigation. In: Proceedings of 9th ACM SIGSPATIAL international workshop on location-based social networks (LBSN2016) Kingma D, Adam JB (2014) A method for stochastic optimization. arXiv preprint arXiv:1412.6980
314
7 Methods for Integrated Hypothesis Generation
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc., pp 1097–1105 Logan B (2000) Mel frequency cepstral coefficients for music modeling. In: Proceedings of ISMIR Matsunaga T, Ohtake M et al (2008) Discoveries on the lithology of lunar crater central peaks by selene spectral profiler. Geophys Res Lett 35(23) McFee B, McVicar M et al (2018) librosa/librosa:0.6.2 Mesaros A, Heittola T et al (2017) Dcase 2017 challenge setup: tasks, datasets and baseline system. In: Proceedings of DCASE 2017–workshop on detection and classification of acoustic scenes and events Mesaros A, Diment A, Elizalde B, Heittola T, Vincent E, Raj B, Virtanen T (2019) Sound event detection in the DCASE 2017 challenge. IEEE/ACM Trans Audio Speech Lang Process.https:// doi.org/10.1109/TASLP.2019.2907016 Millard–Ball A, Murray G, Ter Schure J, Fox C, Burkhardt JE (2005) Car-Sharing: where and how it succeeds. In: Transit cooperative research program (TCRP) report, vol 108. Transportation Research Board Ministry of Land, Infrastructure, Transport and Tourism (MILT) (2022) Evacuation facility data (in Japanese). https://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-P20.html Accessed 2022 Srivastava N, Hinton G et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(56):1929–1958 OpenStreetMap (2022) https://openstreetmap.jp/. Accessed 2022 Patil K, Kulkarni M, Sriraman A, Karande SS (2017) Deep learning based car damage classification. In: Proceedings of 16th IEEE international conference on machine learning and applications (ICMLA) Pedregosa F, Varoquaux G et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830 pgRouting (2022) https://pgrouting.org/. Accessed 2022 Porta S, Crucitti P, Latora V (2006) The network analysis of urban streets: a dual approach. Physica A 369(2):853–866. https://doi.org/10.1016/j.physa.2005.12.063 Porta S, Strano E, Iacoviello V, Messora R, Latora V, Cardillo A, Wang F, Scellato S (2009) Street centrality and densities of retail and services in Bologna, Italy. Environ Plann B Plann Des 36(3):450–465. https://doi.org/10.1068/b34098 Shinjuku City (2022) Efforts of the Shinjuku station area disaster prevention council (in Japanese). https://www.city.shinjuku.lg.jp/content/000201844.pdf. Accessed 2022 The Foundation for Promoting Personal Mobility and Ecological Transportation (Eco-Mo Foundation) (2018) Changes in the number of car-sharing vehicles and the number of members (in Japanese). http://www.ecomo.or.jp/environment/carshare/carshare_graph2018.3.html Tokyo Metropolitan Government (2022a) Summary of measures for stranded persons in Tokyo implementation plan. https://www.bousai.metro.tokyo.lg.jp/_res/projects/default_proj ect/_page_/001/005/238/e_plan_summary.pdf. Accessed 2022 Tokyo Metropolitan Government (2022b) Designation of evacuation sites and stayed-in areas for fires caused by an earthquake in Tokyo wards. https://www.toshiseibi.metro.tokyo.lg.jp/bosai/ hinan/pdf/pamphlet_en.pdf. Accessed 2022 Tokyo Metropolitan Government (2022c) Evacuation center and evacuation site (in Japanese, but automatically translated into English). https://www.bousai.metro.tokyo.lg.jp/bousai/1000026/ 1000316.html. Accessed 2022 Tokyo Metropolitan Government (2022d) Your community’s earthquake risk 2018. https://www. toshiseibi.metro.tokyo.lg.jp/bosai/chousa_6/download/earthquake_risk.pdf Accessed 2022 Trnkoczy A (1999) Understanding and parameter setting of sta/lta trigger algorithm. In: New manual of seismological observatory practice Tsubouchi A, Shinoda R et al (2016) Verification report of elevation values of digital elevation model (DTM) and digital elevation model (DEM) products derived from stereo pair data obtained by the terrain camera onboard SELENE (Kaguya). In: JAXA research and development memorandum,
References
315
vol JAXA-RM-15-006, pp 1–36 (in Japanese). https://jaxa.repo.nii.ac.jp/?action=repository_ uri&item_id=2462&file_id=31&file_no=1 Tsuchida T, Kato D, Endo M, Hirota M, Araki T, Ishikawa H (2017) Analyzing relationship of words using biased LexRank from geotagged tweets. In: Proceeding of the 9th international conference on management of digital ecosystems (MEDES2017) USGS/NASA (2022) https://planetarynames.wr.usgs.gov/Page/MOON/target. Accessed 2022 Yamamoto S, Matsunaga T, Nakamura R, Sekine Y, Hirata N, Naru, Yamaguchi Y (2017) An automated method for crater counting using rotational pixel swapping method. IEEE Trans Geosci Remote Sens 55(8):4384–4397 Yun S, Kim S, Moon S, Cho J, Kim T (2016) Discriminative training of GMM parameters for audio scene classification. Tech Rep. In: DCASE 2016 challenge
Chapter 8
Interpretation
8.1 Necessity to Interpret and Explain Hypothesis Why is it necessary to interpret a hypothesis in the first place? In a nutshell, the objective is to get the users (stakeholders) involved in data analysis applications to accept the hypothesis. As explained in Sect. 1.7.2, the parties involved in data analysis applications are as follows. • Analysts (data scientists and data engineers). • Field experts. • End users. We explain hypothesis interpretation from the viewpoint of user types. In short, analysts and field experts need to determine whether analytical applications (i.e., big data applications) in which hypotheses play a central role are purely technically reliable. In addition, the service providers using such applications must be responsible for the service recipients to understand and use the applications. In other words, the beneficiary of the service (i.e., end user) has the right to explanation for the overall application, including the individual decisions of the service (Kaminski 2019). For that purpose, it is essential to interpret a hypothesis as a basis. For example, in scientific applications, it is easy to imagine that the analysts and field experts are a team of data engineers and scientists. In social infrastructure applications, the analysts are often data engineers and data scientists, the field experts are decision-makers, and the end users are ordinary people who receive services.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Ishikawa, Hypothesis Generation and Interpretation, Studies in Big Data 139, https://doi.org/10.1007/978-3-031-43540-9_8
317
318
8 Interpretation
8.2 Explanation in the Philosophy of Science Here, we will first summarize the position of explanations in the philosophy of science from the historical point of view. The following types of explanation are proposed in the philosophy of science (Woodward and Lauren 2021).
8.2.1 Deductive Nomological Model of Explanation The Deductive Nomological model (DN model) of explanation is composed of facts as input, general rules of reasoning, and conclusions. In this case, the facts and rules are collectively called an explanatory term (explanans), and the conclusion is called an explained term (explanandum). In the DN model, the explanandum is derived from the explanans by deductive reasoning. The name of the DN model comes from this.
8.2.2 Statistical Relevance Model of Explanation First, we define that one attribute is statistically relevant to another. That is, for a certain group p, an attribute C is statistically relevant to another attribute E only if the following condition is satisfied. P(E| p.C) /= P(E| p). In the Statistical Relevance model (SR model) of explanation, statistically relevant attributes are considered to be explanatory.
8.2.3 Causal Mechanical Model of Explanation The Causal Mechanical model (CM model) of explanation focuses on a causal process in which a cause produces a result. The causal process is characterized by the ability of the cause to transmit a mark. Furthermore, if two causal processes intersect each other and some change occurs, it is considered that there is a causal interaction between them. In the CM model, the explanation is to trace a causal process and causal interaction leading to an event.
8.3 Subjects and Types of Explanation
319
8.2.4 Unificationist Model of Explanation Phenomena that have been explained by applying multiple laws until a certain time may now be explained by only one law. For example, Kepler’s laws and Galileo’s laws were integrated into Newton’s laws. In other words, the fact that the number of inference patterns to be applied has decreased is considered as an explanation. This Unificationist model includes not only the integration of laws but also the integration of phenomena.
8.2.5 Counterfactual Explanation In counterfactual explanation, we assume a possible world where the cause has not occurred, in order to identify cause–effect relationships. By checking any change of the associated effect in the possible world, it is judged whether the cause is really the cause. The closer to the real world in which the cause occurred the possible world is, the more valid the counterfactual explanation will be.
8.3 Subjects and Types of Explanation We describe model explanation from the viewpoint of structure and level of detail. Basically, explanation structurally consists of a subject of explanation and an action of explanation. The correspondences between subjects and actions of explanation are not always fixed. That is, the action of explanation may be dependent on the subject of explanation or may not. Here, we will first introduce the subjects of explanation and then will describe the types of explanation based on the subjects.
8.3.1 Subjects of Explanation The subjects of the explanation are divided into the following basic components. • • • •
How to generate Data (HD). How to generate Hypothesis (HH). What Features of hypothesis (WF). What Reasons for hypothesis (WR). They will be described in detail later.
320
8 Interpretation
8.3.2 Types of Explanation Using the subject components, the explanation is classified into the following two categories of different levels of detail.
8.3.2.1
Macroscopic Explanation
The macroscopic explanation is a general explanation, and it includes the following subjects. HD, HH, WF, WR. Here, HD, HH, and WF are mandatory subjects for the macroscopic explanation, and WR is targeted as needed.
8.3.2.2
Microscopic Explanation
The microscopic explanation is a more detailed explanation and an individual explanation, compared with the macroscopic explanation. In addition to the individual data C as input, the microscopic explanation includes the following subjects which will produce the result R. HH, WF, WR. Here, WF and WR are mandatory subjects for the microscopic explanation, and HH is targeted as needed. However, as to WF and WR, instead of the whole of them, only parts directly related to the derivation of the result are targeted.
8.4 Subjects of Explanation Explained The explanation necessary for data management and data analysis of each model (hypothesis) will be described in more detail along the ecosystems of big data applications. As explained in Chap. 1, big data application systems generally consist of different ecosystems, that is, data management and data analysis. The subjects of explanation will be described from the perspective of each ecosystem.
8.4 Subjects of Explanation Explained
321
8.4.1 Data Management In general, data management is responsible for data storage and data manipulation. HD (How to generate Data) is to prepare the data necessary to generate a hypothesis (model). HD is mainly facilitated by data manipulation of data management. Therefore, the procedure (i.e., algorithm) for that purpose is the subject of explanation. Why are we trying to explain with an algorithm in pseudocode rather than a program in a specific programming language? This is because algorithms are more abstract and easier to understand than programs. Here, the process flow in the highest level of abstraction is the subject of explanation. Basically, data manipulation consists of the following. • Operations for data search and data transformation. • Conditions for data search. In many cases, the above can be described at once by SQL language (Celko 2014) and SQL-like languages provided by various frameworks for development, such as BigQuery (2022). In contrast to procedural programming languages, the SQL language is called non-procedural or declarative and does not specify in detail how you produce data, but instead specifies what kind of data you want. Therefore, compared to programming languages, the SQL language is easier to understand and is more suitable as a means of explanation. Furthermore, SQL can be applied to a wide variety of models. Therefore, SQL can be viewed as a model-agnostic method at least in data management.
8.4.2 Data Analysis 8.4.2.1
How to Generate Hypothesis (HH)
Here, the subjects of explanation are described in more detail according to the type of hypothesis generation. Thus, the subjects of explanation that depend on how to generate a hypothesis (model) HH are described below. First, we describe the subject of explanation regarding inference, problem solving, and SQL. Problem solving procedurally generates hypotheses or outputs results from input data while inference and SQL are rather declarative. In a word, they are declarative hypotheses. • Inference: The subjects of explanation are the general laws and the inference rules (e.g., deduction, induction, and plausible reasoning). • Problem solving: The procedure (algorithm) is the subject of explanation. Algorithms are usually written in pseudocode, which is more abstract than programming languages. As one method of abstracting the procedure, we can also use the
322
8 Interpretation
Box model (dependency between variables) (Mahajan 2014) (refer to Sect. 2.5.3) to explain the procedure. Alternatively, we can use a data flow diagram based on data input/output (Larsen et al. 1994) and computational graphs (refer to Sect. 4.2.5) as the subject of explanation. • SQL: SQL itself is the subject of explanation. Creating an intermediate table makes explanation easier to understand. Intermediate tables are also expected to be useful for the explanation of integrated hypotheses. Furthermore, SQL can be applied to machine learning models such as linear regression (Linoff 2016) and k-means clustering (Ordonez 2004). The following methods of hypothesis generation are often regarded as a kind of optimization problem. Optimization problems can be generally formulated as “objective function + algorithm”. That is, the following are the subjects of explanation for each model. • Regression: Objective functions and algorithms based on errors for optimization (i.e., minimization) are the subjects of explanation. • Sparse regression: Objective functions and algorithms based on errors with regularization for optimization (minimization) are the subjects of explanation. • Non-negative matrix factorization (NMF): Objective functions and algorithms based on the errors of matrix factorization for optimization (minimization) are the subjects of explanation. • Clustering: Data distances, objective functions, and algorithms for optimization (minimization) are the subjects of explanation. • Decision tree: Algorithms used for tree construction and the index (e.g., entropy and Gini) used for tree node division are the subjects of explanation. • Random Forest: Algorithms used for tree construction and the index (e.g., entropy and Gini) used for tree node division are also the subjects of explanation. Furthermore, the method of aggregating the results (e.g., majority vote) and the method of sampling are also the subjects of explanation. • Association rules: Algorithms as well as supports and confidences are the subjects of explanation. • Neural network (NN): Model types (e.g., CNN and GAN), loss functions, output functions, and algorithms are the subjects of explanation. 8.4.2.2
What Features of Hypothesis (WF)
The overall characteristics of the hypothesis (model) WF that are independent of individual data correspond to the subject of explanation. First, we describe the subject of explanation regarding inference, problem solving, and SQL. Since all these are declarative hypotheses, features of hypotheses are the descriptions and characteristics of them. • Inference: The purpose and characteristics of laws and inference rules (e.g., deduction, induction, and plausible reasoning) are the subject of explanation. They are
8.4 Subjects of Explanation Explained
323
usually represented as the names of the rules and laws as well as appropriately added annotations. • Problem solving: The purpose of the procedure and the characteristics of the operations and conditions within the procedure, which are indispensable for generating models, are explained. They are usually represented as the name of the procedure and arguments (i.e., procedure interface) as well as appropriately added annotations (i.e., comments). • SQL: All operations and conditions included in SQL commands are the subjects of explanation. The following are the subjects of explanation for each model of optimizationbased hypothesis generation. • Regression: In addition to a regression equation (model), the importance of variables and the degree of fit of a model are the subjects of explanation. Furthermore, in the case of linear regression, if a path diagram is created in path analysis (Wright 1920) (refer to Sect. 4.2.5), it is possible to visualize the causal relationship using the path diagram as a subject of explanation. • Sparse regression: In addition to a regression equation (a model), the regularization term and the dimensions (variables) selected as a result of dimensionality reduction are the subjects of explanation (refer to Sect. 4.1.5). • NMF: The basis vectors, especially the number of basis vectors, are the subject of explanation. • Clustering: The evaluation of the entire cluster (e.g., purity and Silhouette index) and the characteristics (e.g., centroid as representative objects) of each cluster are the subjects of explanation (refer to Sect. 5.1.4). • Decision tree: The index of node division (e.g., Gini) and all rules are the subjects of explanation. • Random Forest: The importance of variables and the method of summarizing results in addition to the index of node division (e.g., Gini) are the subjects of explanation. • Association rules: Supports and confidences for all rules are the subjects of explanation. • NN: The model configuration, loss function, and output function are the subjects of explanation. 8.4.2.3
What Reason for Hypothesis (WR)
The general form to be explained is described below. First, individual output results R and input facts (individual data) C that led to the derivation of R are the subject of explanation. In addition, the applied (i.e., instantiated) parts of the hypothesis, not the entire hypothesis (model), are the subject of explanation. The basic ideas described below conceptually correspond to either the DN explanation, SR explanation, or CM explanation in the philosophy of science.
324
8 Interpretation
• Reasoning: In addition to C and R, the rules really applied to C are the subjects of explanation. • Problem solving: In addition to C and R, the conditions and transformations really applied to C in the procedure are the subjects of explanation. • SQL: In addition to C and R, SQL commands really applied to C and the results of intermediate tables are the subjects of explanation. • Regression: The subject of explanation is the model (regression equations) in addition to C and R. • Sparse regression: In addition to C and R, the model (regression equations) is the subject of explanation. • NMF: Data, basis vectors, and feature vectors corresponding to C are the subjects of explanation. • Clustering: C as data points of interest and the characteristics of clusters to which C belongs are the subjects of explanation. • Decision tree: In addition to C and R, parts of the model such as paths (subtrees or a subset of decision rules) applied to individual data C are the subjects of explanation. • Random Forest: In addition to C and R, the characteristics of the model such as the importance of variables are the subjects of explanation. • Association rules: In addition to C and R, supports and confidences of individually applied rules are the subjects of explanation. • NN: In addition to C and R, parts of the model, such as the gradients associated with R, are the subject of explanation.
8.5 Model-Dependent Methods for Explanation The action of explanation, that is, the method related to the presentation of explanation, is summarized. The basic means is the presentation of textual descriptions of items, such as values, categories, and statements. Furthermore, if multiple items are included by the subjects of explanation, they are presented by basic visualization methods such as tables, graphs (charts), and figures to supplement texts.
8.5.1 How to Generate Data (HD) • SQL: SQL commands are presented as they are, by the utility of the database management system (DBMS). Indeed, SQL cannot be applied to all models, but SQL can be used especially in data management of a wide variety of applications. Therefore, SQL can be viewed as a model-agnostic method in data management.
8.5 Model-Dependent Methods for Explanation
325
8.5.2 How to Generate Hypothesis (HH) We explain a method of presenting explanations that depends on the method of hypothesis generation. • Inference: The description of the laws and the inference rules (types) for generating a hypothesis is presented with consciousness of abstraction. • Problem solving: Outlines of the procedures (algorithms) for generating a hypothesis by problem solving are presented by pseudocode with consciousness of abstraction, focusing on operations and conditions. Mahajan’s Box models (Mahajan 2014), computational graphs, or dataflow diagrams (Larsen et al. 1994) are presented as auxiliary description of dependency relationships between variables. • SQL: SQL commands for generating a hypothesis are presented by the DBMS utility. In particular, operations and conditions are focused on. • Regression and sparse regression: In hypothesis generation that can be reduced to an optimization problem, an outline of the optimization algorithm together with the objective function is presented by pseudocode. • NMF: An outline of the optimization algorithm and objective function is presented by pseudocode. • Clustering: An outline of the optimization algorithm together with the distance function and objective function is presented by pseudocode. • Decision tree: An outline of the algorithm together with the index used for node division is presented by pseudocode. • Random Forest: An outline of the algorithm together with the index used for node division is presented by pseudocode similarly. • Association rules: An outline of the algorithm with the parameters such as minimum support and minimum confidence is presented by pseudocodes. • NN: An outline of the algorithm together with the loss function and output function is presented by pseudocode.
8.5.3 What Features of Hypothesis (WF) • Inference and problem solving: An overall picture (i.e., purpose) of inference rules or procedures for problem solving is presented. Thus, the operations and conditions for achieving the purpose are listed as the features of a hypothesis. Alternatively, the entire dependency of variables, the entire computational graph, and the data flow diagram are presented. • SQL: All SQL commands used for generating a hypothesis are presented by the DBMS utility. • Regression and sparse regression: A regression model and its associated metrics are presented. Especially in the case of linear regression, the degree of fit of the
326
• •
• • • •
8 Interpretation
model is presented by the coefficient of determination, and the importance of variables is presented by the standardized partial regression coefficients. In the case of sparse regression, variables selected as a result of dimensionality reduction by regularization are presented together with the importance. A regression equation is also overlayed in the same graph as training data. Furthermore, a causal relationship is presented by the path diagram if available. NMF: Feature vectors as well as basic vectors are presented. Clustering: As the overall evaluation of clustering results, the metric (e.g., the silhouette coefficient) is presented. Data points belonging to each cluster with its centroid are presented in a low-dimensional scatter plot, after dimensionality reduction if needed. In the case of hierarchical clustering, a dendrogram is also presented. Decision tree: A full picture of a decision tree or decision rules is presented. Random Forest: The importance of variables calculated by integrating the degree of improvement of the division index (e.g., Gini and entropy) is presented. Association rules: Each association rule together with its individual support and confidence is presented. NN: The structure of the model is presented in a schematic diagram. The loss function and the output function are presented.
8.5.4 What Reason for Hypothesis (WR) In addition to the presentation of individual data C and corresponding results R, the following explanation is presented depending on the method of model generation. As the explanation here is concerned with the execution of a model, the method of model generation and the features of a model which are presented should be as much instantiated as possible. • Inference and problem solving: For inference and problem solving, the rules and procedures (i.e., operations and conditions) applied individually to derive R are presented using the tracer that interacts with the model execution system (refer to Sect. 8.7). • SQL: The SQL commands actually executed to derive R and the generated intermediate results are presented by the SQL DBMS utility. • Regression and sparse regression: The feature effect of individual data is presented as the feature value times its weight (coefficient), together with the distribution of effect across all the data per feature. • NMF: The analysis results of feature vectors for C as data points of interest are presented by tables or graphs. • Clustering: The characteristics of clusters to which C as data points of interest belong are presented. The data points belonging to the clusters are presented in a low-dimensional scatter plot after dimensionality reduction if needed.
8.6 Model-Independent Methods of Explanation
327
• Decision tree: The paths (subtrees) and rules (subset) individually applied to C are presented. • Random Forest: The importance and values of variables are presented. • Association rules: Individually applied association rules together with the supports and confidences are presented. • NN: Model components that individually contribute to derivation of R are presented by the methods based on the gradient values, such as Grad-CAM. Grad-CAM will be explained in detail in a case study described below.
8.6 Model-Independent Methods of Explanation So far, we have mainly explained the methods of explanation that depends on the types of models. Here, we explain methods of explanation that do not depend on the types of models (i.e., model-agnostic explanation). Typical model-independent methods of explanation include the following. • LIME. • Kernel SHAP. • Counterfactual explanation. Here, only the principles of these are described in this section.
8.6.1 LIME In order to estimate the behavior of the model (i.e., function) f at the input x, local interpretable model-agnostic explanations (LIME) (Ribeiro et al. 2016) evaluates f in the neighborhood z of x. To explain the behavior of f , we approximate f with a simpler model (that is, surrogate function) g. At the same time, LIME minimizes the complexity of the model g and improves the interpretability of the model. The input to the surrogate function is made easier for the user to interpret so that the factors of the result can be explained. First, the user gives concretely the related functions as follows, in order to determine the model g. ) ( 2 • Function that defines the neighborhood: πx (z) = exp − (x−z) . 2σ 2 • Function that defines the loss: L( f , g, π x ). • Function that defines the complexity of the model: Ω(g). Please note that small σ in the neighborhood function (kernel) assigns significant weights only to the closest points, and cosine distance can be specified instead of Euclidian distance in the kernel, depending on application domains (e.g., text classification).
328
8 Interpretation
Furthermore, the model g is specifically defined as follows. ( ) g z ' = Φ · z ' (linear regression). Here, Φ is a feature attribution vector and z' is a coalition vector. That is, if the component value of the coalition vector is 1, the corresponding feature exists, and if it is 0, the feature does not exist. The values x ' and z' as parameters of g correspond to the values x and z as parameters of f , respectively. We add Ω as a regularization term and obtain the following objective function. L( f, g, πx ) + Ω(g). The objective function is concretely expressed as follows. ∑[
( )]2 f (z)−g z ' πx (z) + ∞1 (||Φ||0 > K ).
z,z '
Here 1 is an indicator function, and its value is 1 if the number of nonzero features is greater than K, otherwise 0. Therefore, the second term in the objective function regularizes the number ≤ K. LIME has been applied to various methods of model generation. They include column-based explanation for tabular data classification, word-based explanation for text classification, and superpixel (segmented image)-based explanation for image classification.
8.6.2 Kernel SHAP Kernel SHapley Additive exPlanations (SHAP) (Lundberg and Lee 2017) creates an explanation model with a linear model as LIME does. Kernel SHAP minimizes the following objective function. L( f, g, πx ) =
∑[
( )]2 ( ) f (z)−g z ' πx z ' .
z,z '
Here z' is a coalition vector. In general, the Shapley value used by this model is based on cooperative game theory. As shown below, each feature is regarded as a player, the average marginal contribution of each feature (i.e., player) is set as the Shapley value, and the model g is defined by the linear sum of them. ∑ ( ) g z ' = Φ0 + Φi z i '. i=1,M
8.6 Model-Independent Methods of Explanation
329
Here, M is the number of features. Using this, the above objective function is more specifically expressed as follows. L( f, g, ∏) = [ f (z)−CΦ]2 ∏. Here, N = 2M . C is a (N, M + 1) matrix, called a coalition matrix. ∏ is a (N, N) diagonal matrix and has the following diagonal components. ∏i j =
(M − 1) . (M choose |ci |)|ci |(M − |ci |)
Please refer to the paper by Lundberg and Lee (2017) for these derivations. Note that, in the case of Kernel SHAP, there is no regularization term corresponding to that of LIME. SHAP has been applied to models such as regression, classification, and Random Forest. The Shapley value of a feature is used to explain its average marginal contribution to the result (e.g., prediction and classification) for individual data.
8.6.3 Counterfactual Explanation We try to explain the importance of features by observing how the result of a hypothesis (model) changes or the accuracy calculated based on the result changes, when reducing or invalidating a certain feature. In other words, in the presence of the event C and the event E, if C is changed by some intervention I, and if E always changes, it can be explained that the result of E is caused by C. This method is called counterfactual explanation (Woodward 2004) in the context of the philosophy of science. However, the counterfactual explanation in the context of machine learning has a slightly different meaning. That is, there are many cases in which the main purpose is to estimate the minimum changes in feature values that substantially affect the result and accuracy (White and Garcez 2019). The following objective function is minimized for a counterfactual example c (Wachter et al. 2018). λ loss( f (c), y) + dist(c, x). Here, the first term is the loss between the prediction f (c) and the desired value y, and the second term is the distance between c and the instance x to be explained.
330
8 Interpretation
8.7 Reference Architecture for Explanation Management As the need for explanation grows, it is getting increasingly important to systematically manage explanation (Ishikawa et al. 2019, 2020). The reference architecture for managing explanation generally consists of the following modules and interacts with the model generation and execution systems (i.e., data analysis), and the database system (i.e., data management) (see Fig. 8.1). • Explanation extraction: Extract subjects of explanation from the model generation (source), model execution (source), tracer (log data), and database (data and results). • Explanation generation: Generate an overall explanation by combining the subjects of explanation extracted through the explanation extraction. • Visualization: Visualize the generated overall explanation. • Tracer: Record parts related to actual execution using the model generation system, model execution system, and database system.
Visualization
Database management
Model generation
Explanation generation
Tracer
Data analysis
Explanation extraction
Tracer (log data)
Model execution
Database (data, results)
Fig. 8.1 Reference architecture for explanation management
Model generation (source) Model execution (source)
8.9 Case of Discovery of Candidate Installation Sites for Free Wi-Fi Access …
331
• Database system: Manipulate and store data as input and results as output. Furthermore, the above module uses the following databases. • • • •
Model generation (source). Model execution (source). Tracer (log data). Database (data and results).
8.8 Overview of Case Studies In the following sections, we will describe examples of explanation by using real case studies. They include the following case studies together with used methods and subjects of explanation. • Discovery of candidate installation sites for free Wi-Fi access point – Explanation of candidate sites by SQL (HD, HH, WF, WR) • Classification of deep moonquakes – Explanation of selection of important features by VIF (WF) – Explanation of important features by Balanced Random Forest (HD, HH, WF) • Identification of central peak craters – Explanation of identification of craters by RPSD (HD, HH) – Explanation of identification of central peak craters by CNN (HH) and GradCAM (WR) • Discrimination of antiprotons and antideuteron – Explanation of focal points in incident paths by Grad-CAM (WR) – Counterfactual explanation of important features (WF) – Exploration of basic components by NMF (HH, WF)
8.9 Case of Discovery of Candidate Installation Sites for Free Wi-Fi Access Point This case exemplifies both a macroscopic explanation that presents the whole procedure for data and hypothesis generation, and a microscopic explanation that presents the result. We also demonstrate that SQL is a universal method for explanation spanning data management and data analysis. Let us revisit the application that was taken as an example of hypothesis generation using a difference between hypotheses (refer to Sect. 6.5.1).
332
8 Interpretation
8.9.1 Overview First, this case is summarized as follows. • (Specific purpose of the case for social infrastructure) This case is related to social infrastructure. Travel agents and local governments are among parties involved in the evidence-based policy making (EBPM) described in Sect. 1.7.1. They are confronted with the gap between the social needs that “a lot of foreign visitors want to use the Internet” and the desirable state that “free Wi-Fi access spots are available for foreign visitors as infrastructure”. Practically, it is important for them to identify areas with such gaps. Therefore, it is necessary to explain to the EBPM decision-makers how to draw the conclusions (i.e., identify areas with such gaps). • (Data used in the case) Images on the social media site Flickr (2022) and articles on the social media Twitter (2022) were used. We collected 4.7 million geotagged Tweets using the site-provided API and selected 7500 Tweets posted by foreign tourists who visited Yokohama. We also collected 600,000 Flickr images using the site-provided API and selected 2132 images posted by foreign tourists who visited Yokohama. • (Method used in the case) First we judged foreign visitors by using mediaspecific methods (Mitomi et al. 2016). We selected only foreign visitors’ posts from social data by using SQL. We visualized intermediate hypotheses generated from different data sources (i.e., media sites). From the separate hypotheses, a final result was obtained using SQL as a procedure based on the interhypothesis difference method introduced in Sect. 6.5. • (Results) As the result of the social infrastructure application, we were able to identify areas which have gaps between social needs and an available infrastructure. An explanation of the entire process leading up to the results was found to be useful in discussions with travel agents.
8.9.2 Two Hypotheses Our method of hypothesis generation assumes two separate hypotheses corresponding to the following two locations with respect to foreign visitors. • Spots Where Many Foreign Tourists Visited (WF): Tourist attractions are extracted using the number of photos taken at a location. You may take photos of objects (e.g., landscapes) based on your interests. You may then upload those photos to Flickr. If there are spots where many photos are taken, we define such spots as tourist attractions that many foreigners must visit. Specifically, we counted the number of photos posted by foreign tourists, found where foreign tourists took many photos, and identified popular destinations for foreign tourists. To achieve this, we mapped the photos to a 2D grid, based on the locations where the photos were taken. Here, a 30-square-meter cell was created as grid component. As a result, all obtained cells in the grid contained photos which were taken within the
8.9 Case of Discovery of Candidate Installation Sites for Free Wi-Fi Access …
333
range. Next, the number of users in each cell was counted. Cells with more users than the predetermined threshold are considered as tourist attractions. • Spots Where Accessible Free Wi-Fi Was Available (WF): We use the number of articles posted at a location to extract locations with accessible free Wi-Fi access spot. Twitter users often post on the spot just when they want to write an article. Therefore, it is considered that there are accessible free Wi-Fi access spots for foreigners in the place where many articles were posted by foreigners. Based on the location of posted articles, we mapped the articles to the same grid introduced above. All obtained cells in the grid contained articles posted within the range. Next, the number of users was counted for each cell. Cells with more users than the predetermined threshold are considered as spots with accessible free Wi-Fi access.
8.9.3 Explanation of Integrated Hypothesis The SQL-based procedure below implements a generalized difference method for generating a hypothesis using different data obtained from Twitter and Flickr. As a result, we could discover tourist spots that are attractive for foreign tourists but have no accessible free Wi-Fi access available for them at least at the time of the experiment. (HD, HH, WF) • insert into ForeignerT select * from T: TweetDB where ForeignVisitorT (*); From the TweetDB table, we create a database of foreign visitors using the function (ForeignVisitorT ) as a filter condition that determines the Tweet posters as foreign visitors based on the length of stay in Japan. We store the intermediate result in the ForeignerT table. (HD, HH, WF) • insert into ForeignerF select * from F: FlickrDB where ForeignVisitorF (*); From the FlickrDB table, we create a database of foreign visitors using the function (ForeignVisitorF) as a filter condition that determines the photo posters as foreign visitors based on the places of residence of the photo posters. We store the intermediate result in the ForeignerF table. (HD, HH, WF) • insert into GridT (Index) select ForeignerT.Index from ForeignerT group by ForeignerT.Index having count (*) >= ThT; We “group” foreign visitors’ tweets ForeignerT based on the geotags (corresponding to “by” Index) attached to them. Furthermore, we obtain the indexes of cells with the number of Tweet posts above the predetermined threshold (ThT ). In other words, this query retrieves cells with free Wi-Fi access spots available for foreign visitors.
334
8 Interpretation
MinatoMirai Station
MinatoMirai Station
Red Brick Warehouse Sakuragicho Station
Ōsanbashi Pier
Red Brick Warehouse Sakuragicho Station
(a)
Ōsanbashi Pier
(b)
Fig. 8.2 Presented explanation. a Visualization of free Wi-Fi access spots based on posted Tweets. b Visualization of tourist spots based on posted Flickr photos
We store the intermediate result (hypothesis) in GridT. Furthermore, the GridT is visualized by a 2D map (see Fig. 8.2a). (HD, HH, WF) • insert into GridF (Index) select ForeignerF.Index from ForeignerF group by ForeignerF.Index count (*) >= ThF; We “group” foreign visitors’ photos ForeignerF based on the geotags (corresponding to “by” Index) attached to them. Furthermore, we obtain the indexes of cells with the number of photo posts above the predetermined threshold (ThF). In other words, this query retrieves spots visited by many foreign tourists. We store the intermediate result (hypothesis) in GridF. Furthermore, the GridF is visualized by a 2D map (see Fig. 8.2b). (HH, WF) • select * from GridF minus select * from GridT; This query obtains a set difference (set minus) between the cells visited by many foreign tourists represented by GridF and the cells providing accessible free Wi-Fi access spots represented by GridT. Thus, we detect cells containing spots that are popular for foreign visitors but have no accessible free Wi-Fi access spots available for them. (WR) For each of GridF and GridT, we present data for the results of interest (e.g., index of Osanbashi Pier). This tells us that GridF contains the index, but GridT does not.
8.9 Case of Discovery of Candidate Installation Sites for Free Wi-Fi Access …
335
At the same time, the following SQL command used to generate individual results is presented again. • select * from GridF minus select * from GridT;
8.9.4 Experiments and Considerations In Japan, from July 1, 2014, to February 28, 2015, more than 4.7 million geotagged data items were collected. Using the method ForeignVisitorT, we detected Tweets posted by foreign tourists, the number of which exceeded 4.7 million. Further, among them the number of tweets by foreign tourists in the Yokohama area exceeded 7500. In Japan, more than 600,000 geotagged photos were collected from July 1, 2014, to February 28, 2015. Using the method ForeignVisitorF, we detected photos posted by foreign visitors to Yokohama, who posted 2132 photos. For example, the result as the difference between GridF and GridT included “Osanbashi Pier” (see Fig. 8.3). It should be noted that the above explanation has made the following approximations for the sake of simplicity, slightly different from the procedures described in Sect. 6.5.1. • The explanation doesn’t consider whether users are unique since the number of photos is simply viewed as that of users. Fig. 8.3 Yokohama Osanbashi Pier
336
8 Interpretation
• The explanation doesn’t consider whether users have mobile communication means. Let us describe the role of SQL in explanation. It is confirmed that SQL can be used for the explanation of data generation, hypothesis generation, and hypothesis features. It is also confirmed that the visual presentation of intermediate tables as well as SQL commands is useful for explaining reasons for results. In a word, SQL is very versatile in both macroscopic and microscopic explanations. Lastly, the generated hypothesis contains only aggregated information (e.g., counts of users), but not information that can lead to individual identification.
8.10 Case of Classification of Deep Moonquakes This case describes a macroscopic explanation presenting the features of a model that contribute to the classification of the model.
8.10.1 Overview First, this case is summarized as follows (Kato et al. 2017). • Specific purpose of the case for natural science: This application is related to lunar and planetary science. In addition to the direct method for analyzing substances inside the lunar crust (e.g., exploration of the central peak crater), the indirect method of analyzing the lunar earthquakes (i.e., moonquakes) is also effective for understanding the internal structure of the moon. In particular, we pay attention to the deep moonquakes, which have epicenters deep inside the moon and have a lot of observed events. Deep moonquakes have been actively investigated by researchers (Frohlich and Nakamura 2009; Wieczorek et al. 2006). As a preliminary survey for that purpose, it is indispensable to classify moonquakes with respect to epicenters (sources). However, it is not yet fully understood which features are effective in classifying moonquakes. Therefore, as an explanation of the classification model, it is considered effective to identify the features that most contribute to the classification results. • Data used in the case: We used passive seismic data on the moonquakes collected by the Passive Seismic Experiment (PSE) as part of the Apollo Lunar Surface Experiments Package (ALSEP), which was established by the NASA Apollo program (see Fig. 8.4). The data set includes deep moonquakes observed deep inside the nearside of the moon (see Fig. 8.5). As shown in Table 8.1, the data set has 17 sources and 2537 events. However, the number of events varies depending on the moonquake sources.
8.10 Case of Classification of Deep Moonquakes
337
• Method used in the case: Balanced Random Forest (Chen et al. 2003) is used to describe the features that contribute most to the one-to-one classification of moonquakes with respect to two sources. • Results: The analysis of classification results using the state parameters (i.e., position and velocity) of the celestial bodies of the solar system (sun, moon, earth, and Jupiter) suggest that the state parameters of the earth are the most effective among them. Jupiter’s state parameters are effective in classifying moonquakes with respect to some sources. The effectiveness was verified by discussions in our research team of natural scientists and IT specialists.
8.10.2 Features for Analysis (HD) Table 8.2 shows the parameters of the coordinate system used in this case. The IAU_ MOON coordinate system is a fixed coordinate system centered on the moon. The z-axis is the north pole of the moon, the x-axis is the meridian of the moon, and the y-axis is perpendicular to the plane x-z. The IAU_EARTH coordinate system is a fixed coordinate system centered on the earth. Here, the z-axis is the direction of the conventional international origin, the x-axis is the direction of the prime meridian, and the y-axis is the direction perpendicular to the x-z plane. Here, sun_perturbation means the solar perturbation.
Fig. 8.4 Apollo 15 ALSEP layout (Courtesy of NASA)
338
8 Interpretation Farside
Earth
Farside
Mantle
Mantle
Core
Core
Apollo15 Apollo12/14
Apollo16
Apollo12 Apollo16 Apollo14 Apollo15
Earth
Fig. 8.5 The north–south pole cross section (left) and equatorial cross section (right) of the moon’s internal structure. Gray eclipses indicate major nearside clusters of deep moonquake nests.
Table 8.1 Number of events for each source
Source
Number of events
A1
441
A5
76
A6
178
A7
85
A8
327
A9
145
A10
230
A14
165
A18
214
A20
153
A23
79
A25
72
A33
57
A35
70
A44
86
A204
85
A218
74
As features, the position (x, y, z), velocity (vx, vy, vz), and distance (lt) of each of the moon, the sun, the earth, and Jupiter are used. These features are calculated based on the time of the moonquake events, by using SPICE (NASA/NAIF 2022). Generally, SPICE assists scientists involved in the planning and interpretation of
8.10 Case of Classification of Deep Moonquakes
339
Table 8.2 Coordinate system parameters Target
Observer
Coordinate system
Prefix for features
EARTH BARYCENTER
MOON
IAU MOON
Earth_from_moon
SOLAR SYSTEM BARYCENTER
MOON
IAU MOON
Sun_from_moon
JUPITER BARYCENTER
MOON
IAU MOON
Jupiter_from moon
SOLAR SYSTEM BARYCENTER
EARTH BARYCENTER
IAU EARTH
Sun_from_earth
JUPITER BARYCENTER
EARTH BARYCENTER
IAU EARTH
Jupiter_from earth
SUN
SOLAR SYSTEM BARYCENTER
IAU EARTH
Sun_perturbation
scientific observations from space-borne equipment and assists engineers involved in modeling, planning, and executing the activities needed to carry out planetary exploration missions. SPICE also calculates the perigee cycle at the position at earth_from_moon, and the x-coordinate and y-coordinate cycles of the solar perturbation. From these periodic features and phase angles, the values of sin and cos are calculated as features. Furthermore, the cosine similarity between the position at moon_from_earth and the position at sun_from_earth is calculated as a feature of the sidereal month. In the experiments described here, all possible combinations of a total of 55 features are used.
8.10.3 Balanced Random Forest Random Forest is a kind of ensemble learning that combines a large number of decision trees and has the advantage of being able to calculate the importance of features. However, Random Forest has a problem that when there is a large difference in the size of data items to be learned depending on the class labels, the learning is biased to the class (i.e., major class) in which the size of data is large. In general, there is a way to deal with the problem of imbalanced data by weighting the class (i.e., minor class) in which the size of data is small. However, if the weights of data belonging to the minor class become too large, it may cause overfitting of the model to the small class. Since there is a large difference in the number of events for deep moonquakes depending on the sources, it is necessary to apply a method that takes into account data imbalance. So, Balanced Random Forest (Chen et al. 2003) is applied that balances the size of data using down-sampling (under-sampling) for large classes.
340
8 Interpretation
This ensures that each class has an equal number of samples when building each decision tree.
8.10.3.1
Random Forest
(HH) First, we explain Random Forest as a basis of Balanced Random Forest. Random Forest was proposed by Breiman (Breiman 2001). Algorithm 8.1 Random Forest Construction 1. Create a collection of data sets B by bootstrap (i.e., random sampling with replacement) from a data set A; 2. Repeat the following 3. {Draw a data set from the collection B, and recursively create a decision tree for it by using the following algorithm; 4. Create Decision Tree (node) 5. {If the number of data items at the node to be divided is larger than the predetermined threshold 6. {Randomly create a subset of the attribute (i.e., feature) set; 7. Divide the node so that the index (e.g., Gini and entropy) is improved; 8. Recursively apply Create Decision Tree to each of the divided nodes; 9. Attach the returned nodes to the current node and return the node;} 10. } 11. } until the collection B is empty; Algorithm 8.2 Class Identification 1. Choose a class for a particular input from B result classes by majority vote, or choose a class with the highest probability for a particular input from B result classes. 8.10.3.2
Gini Index
The node division is performed based on the degree of improvement of indexes such as the Gini and entropy. Here, we explain the Gini index. The Gini index is a measure of the degree of impureness. The lower the Gini index is, the lower the impurity is. That is, the variance of data is reduced. Therefore, we prefer a lower Gini index. First, the Gini index is defined as follows. ∑ I = pk (1 − pk ). k
8.10 Case of Classification of Deep Moonquakes
341
Here pk is the ratio of data items belonging to the class k. The degree of improvement of the Gini index is calculated by the following. I−
nR nL IL − IR. n n
Here, n, nL , and nR represent the number of data items before division, that of data items at the left node after division, and that of data items at the right node after division, respectively. I L and I R are the Gini index of the left node and of the right node, respectively.
8.10.3.3
Balanced Random Forest as Improvement
(HH) Balanced Random Forest (Chen et al. 2003), improved by the same research group as that of Random Forest, performs the following down-sampling instead of simply creating a bootstrap in the above Random Forest algorithm (refer to Algorithm 8.1). Algorithm 8.3 Down-Sampling 1. Create a bootstrap from the data set with the minor class labels; 2. Create another bootstrap of the same number of data items with the bootstrap created by Step 1, from the data set with the major class labels; 3. Concatenate the two bootstraps to create a data set to be drawn. Last but not least, Balanced Random Forest as well as Random Forest explicitly provides the characteristics of the generated model. The importance of each feature on a decision tree is determined by calculating the weighted mean improvement in the Gini index for the feature during each node division. The final feature importance is the average of the importance of the feature in all decision trees contained in Random Forest.
8.10.4 Experimental Settings Here, we will explain experiments to evaluate features which are effective for classification of moonquake sources and analyze and consider the results. The relationships between the features and the sources were analyzed based on the classification performance and feature importance by Balanced Random Forest. An outline of feature analysis is shown below. 1. Calculate features based on the time of a moonquake event. 2. Apply Balanced Random Forest to the classification of moonquakes for each pair of two sources.
342
8 Interpretation
3. Calculate and analyze the classification performance and the importance of features. In this case, by constructing a classifier for each pair of two sources in the data set as a one-to-one method (i.e., binary classification method), the analysis focuses on the relationship between the features and the sources. By using Random Forest, 1000 decision trees are built for each classifier. The number of samples used to build each decision tree is obtained by the bootstrap method which takes into account the balance of data size for each class. We used scikit-learn (Pedregosa et al. 2011; scikitlearn 2022) as a machine learning library for the Python programming language, to construct each decision tree in the Random Forest. In this case, the following two analyses are performed as feature selection. 1. Build a classifier that uses all 55 extracted features. In conclusion, some of the earth’s features and Jupiter’s features are useful for classification when the moon is the origin. 2. Build a classifier with reduced number of features by using a variance inflation factor (VIF). Variance inflation factor (VIF) is calculated as follows. VIF =
1 . 1 − r2
Here r is a correlation coefficient between variables. VIF increases if a correlation between variables is large. VIF quantifies the severity of multicollinearity, that is, the phenomenon in which one predictor (variable) of a multiple regression model can be accurately linearly predicted from the others. In this case, we conducted an experiment with features reduced so that the VIF of each feature was less than 6. That is, VIF is calculated based on the experimental results using all features, and features with VIF of 6 or more are deleted. We used statsmodels (statsmodels 2022) to calculate VIF. From the data set used in this case, the events of 17 sources with 70 or more observations of moonquake events are selected. In this case, the precision, recall, and F-measure were used as indexes for evaluating the classification performance with respect to moonquake sources. The precision is an index for measuring the ratio of true positives (i.e., correct labels) in the classification result, and the recall is an index for measuring the ratio of true positives in the classification result over all positives. The F-measure is an index that considers the balance between the precision and the recall, which is calculated by the harmonic mean of the recall and the precision.
8.10 Case of Classification of Deep Moonquakes
343
8.10.5 Experimental Results 8.10.5.1
Experimental Results Using All Features
(a) Classification Performance Figure 8.6 shows F-measures of the classifiers of each pair of moonquake sources. The vertical and horizontal axes indicate the sources, and a value of each element is the F-measure of a classifier of each source pair. In Fig. 8.6, the highest classification performance is 0.97, which is observed in multiple pairs of sources. Also, in classification between the sources A9 and A25, the lowest classification performance is 0.55. Figure 8.6 also shows that some combinations of sources are difficult to classify. The number of classifiers with a classification performance of 0.9 or higher is 28, which is about 21% of the total number of classifiers. The number of classifiers with classification performance of 0.8 or higher and lower than 0.9 is 64, which is 47% of the total. Only one classifier has a performance of less than 0.6. Most classifiers show high classification performance. Therefore, we conclude that the positional features are effective for the source classification of deep moonquakes. (b) Feature Importance (WF) Figure 8.7 shows the importance of each feature. All the features of high importance are those of the earth when calculated with the moon as the origin. It can also be seen that when the moon is the origin, the importance of the features of the earth is high, followed by the importance of some features of Jupiter. Comparing the features when the moon is the origin and when the earth is the origin, the features with the moon as the origin are more important than the features with the earth as the origin. Figure 8.7 also shows that the relationship between the moon and the earth has the strongest influence on the classification. However, since there may be a correlation between the features, it is necessary to further analyze each feature from the viewpoint of the independence of the features. Therefore, in the following method, the experimental results after feature reduction using VIF will be explained in consideration of correlations between features.
8.10.5.2
Experimental Results After Feature Reduction Using VIF
(a) Classification Performance Figure 8.8 shows F-measures of the classifiers after the features are reduced. Similar to Fig. 8.6, the vertical axis and the horizontal axis are both moonquake sources, and each value is the score of the F-measure of the classifier for each pair of sources.
earth_from_moon_vz earth_from_moon_lt_phase earth_from_moon_z earth_from_moon_lt earth_from_moon_vy earth_from_moon_vx earth_from_moon_x earth_from_moon_y earth_from_moon_lt_sin earth_from_moon_lt_cos jupiter_from_moon_vx jupiter_from_moon_y jupiter_from_moon_vy jupiter_from_moon_x sun_pertubation_z sun_pertubation_vz sun_pertubation_lt jupiter_from_moon_z jupiter_from_earth_z jupiter_from_moon_vz sun_from_moon_vz earth_from_moon_lt_nest_cos earth_from_moon_lt_nest_sin earth_from_moon_lt_nest_phase sun_from_moon_lt sun_from_moon_vx sidereal sun_from_moon_y sun_from_moon_vy sun_from_moon_x sun_from_earth_lt jupiter_from_earth_vz jupiter_from_moon_lt sun_from_moon_z jupiter_from_earth_lt sun_from_earth_y sun_from_earth_vx sun_from_earth_x sun_from_earth_vz sun_from_earth_vy jupiter_from_earth_vx jupiter_from_earth_y sun_pertubation_x sun_from_earth_z sun_pertubation_vx jupiter_from_earth_x sun_pertubation_y sun_pertubation_vy jupiter_from_earth_vy sun_pertubation_x_cos sun_pertubation_y_cos sun_pertubation_x_sin sun_pertubation_y_sin sun_pertubation_y_phase sun_pertubation_x_phase
Seismic Sources
344 8 Interpretation
A1
A5
A6
A7 1.0
A8
A9 0.9
A10
A14 0.8
A18
A20 0.7
A23
A25 0.6
A33
A35 0.5
A204 A44
A218 A1 A5 A6 A7 A8 A9 A10 A14 A18 A20 A23 A25 A33 A35 A44 A204 A218
Seismic Sources
Fig. 8.6 F-measure for binary classification of seismic sources
0.04
0.03
0.02
0.01
0.00
Fig. 8.7 Importance of each feature
8.10 Case of Classification of Deep Moonquakes
345
Furthermore, the number of classifiers with a classification performance of 0.9 or higher is 34, which is about 25% of the total. 56 classifiers with a classification performance of 0.8 or higher and lower than 0.9 account for 41% of the total. There are two classifiers with a classification performance of less than 0.6. Compared with Figs. 8.6 and 8.8 show that there is no significant change in the classification performance. (b) Feature Importance (WF) Figure 8.9 shows the value of the importance of features (i.e., the feature importance) in each classification after features are reduced. After reducing the features, the earth’s features when the origin is the moon are reduced to four of the top ten features that existed before feature reduction. The four features between the top 11 to 14 positions of Jupiter’s features as shown in Fig. 8.7 when the moon is the origin, are reduced to one feature. Some parameters of Jupiter are believed to be influenced by other features. In other words, the subset of features after feature reduction may be less affected by multicollinearity.
A1 A5 A6 1.0
A7 A8
0.9
Seismic Sources
A9 A10
0.8
A14 0.7
A18 A20
0.6
A23 A25
0.5
A33 A35 A44 A204 A218 A1
A5
A6
A7
A8
A9 A10 A14 A18 A20 A23 A25
A33 A35 A44 A204 A218
Seismic Sources Fig. 8.8 F-measure for binary classification of seismic sources after feature reduction
346
8 Interpretation
0.05 0.04 0.03 0.02
sun_pertubation_x_phase
sun_pertubation_y_phase
sun_from_moon_z
sun_from_earth_vy
sun_from_earth_vx
sun_from_earth_vz
sun_pertubation_y
sun_pertubation_x
earth_from_moon_lt_nest_sin
sun_from_moon_vz
jupiter_from_moon_vz
sun_pertubation_vz
sun_pertubation_z
jupiter_from_moon_vy
earth_from_moon_z
earth_from_moon_vz
earth_from_moon_vx
earth_from_moon_vy
0.00
earth_from_moon_lt_nest_cos
0.01
Fig. 8.9 Importance of each feature after feature reduction
After all, after reducing features using VIF, some features of the earth and Jupiter turned out to be useful for classification when the moon is the origin.
8.10.6 Considerations By using Balanced Random Forest, the importance of features can be easily calculated in addition to classification performance. Therefore, in scientific research such as this case, the feature importance can be used to analyze features and explain models. However, there remain some issues in this case. First, there is room to consider classification features according to each pair of sources. In addition, other classification methods may need to be considered in order to obtain higher classification performance. Furthermore, it is necessary to consider applying a method that directly considers waveforms as features of moonquakes (Kikuchi et al. 2017). Indeed, the results of this case have been shown to be useful for professional analysis and knowledge building. However, since the findings obtained in this case are based on only correlations, it is not possible to directly estimate the causal mechanism of deep moonquakes. With the knowledge of experts, it can be expected to elucidate the causal relationship between the seismic sources and the features of the planets, and finally the overall causes of the moonquakes.
8.11 Case of Identification of Central Peak Crater
347
8.11 Case of Identification of Central Peak Crater This case study describes a microscopic explanation of the basis of individual judgments of a classification model. Here, the case of identification of the central peak craters dealt with in Sect. 7.4 is used again to describe an example of explanation of individual judgment bases.
8.11.1 Overview • Specific purpose of the case for natural science: This case is also related to lunar and planetary science. In order to understand the internal structures and activities of the moon, it is considered to directly use the materials inside the moon as a clue. The central peak in the crater formed by impacts of meteorites is attracting attention as a place where materials below the lunar surface are exposed on the lunar surface. However, not all craters with a central peak on the moon (hereinafter referred to as the central peak craters) have been identified. Therefore, it is scientifically meaningful to create a catalog of all central peak craters on the moon. However, since identification of central peak craters has been manually done by experts, it has taken a lot of time and efforts. So automatic identification has become one of the focal points for scientists in this field. Therefore, it is also necessary to explain to the relevant scientists why craters included in candidates automatically discovered by machine learning are judged to be central peak craters. • Data used in the case: Approximately 7200 pieces of DEM data collected and prepared by both NASA and JAXA were used. DEM data can be viewed as grayscale images although each DEM image contains an elevation instead of a color intensity in each pixel. Each image has been resized to 512 (height) × 512 (width) × 1 (normalized elevation). The entire image collection was divided into the same number of images with three labels: crater with central peak (i.e., central peak crater), crater without central peak (i.e., normal crater), and non-crater. • Method used in the case: RPSD method (Yamamoto et al. 2017) was used to detect craters only to prepare training data for CNN. Next, the trained CNN was applied to find central peak craters, including unknown and known ones. Furthermore, Grad-CAM (Selvaraju et al. 2020) was used to examine evidences for determining central peak craters. • Results: We were able to classify the three classes with an accuracy of 96.9%. These results were verified by scientist members of our research team. We were also able to provide the scientific members with individual evidences for determining central peak craters, which consists of the crater rim and the central peak.
348
8 Interpretation
8.11.2 Integrated Hypothesis A central peak crater is identified in the following two steps.
8.11.2.1
Extraction of Lunar Craters Using RPSD Method
(HD, HH) Craters are extracted using a method called rotational pixel swapping for DTM (RPSD) (Yamamoto et al. 2017) for digital terrain models. Here, digital terrain model (DTM) is a similar to Digital Elevation Model (DEM). The RPSD method focuses on the rotational symmetry when rotating one piece of DEM data at a specific point (i.e., the center point). That is, the RPSD method takes advantage of the fact that the negative gradient from the crater rim to the crater center point does not significantly change with respect to crater rotation. That is, RPSD compares the original candidate crater and the rotated crater (corresponding to the same object in the different observation mode) and confirms that the rotational symmetry holds in order to identify the crater. In a nutshell, the integrated hypothesis generation method finds a hypothesis (candidate crater) by focusing on the intersection of data obtained by different modes (rotations) for the same object.
8.11.2.2
Identification of Central Peak Craters from Extracted Craters Using CNN
(HH) In general, in the identification phase of each layer of deep learning, each output node weights the input values, takes their sum, adds their bias to the sum, and then outputs the result in the forward direction. In the learning phase of deep learning, as a problem of minimizing the error between the result of prediction and the correct answer, the weight and bias values are updated by differentiating the error function with respect to the weight and bias of each layer in the backward direction. First, RPSD is used to extract a piece of DEM data for each candidate crater and give the DEM data one label (i.e., either non-crater, normal crater or central peak crater) to prepare training data. The training data are used to train the CNN model, and then the trained CNN model is used to identify central peak craters. The results obtained from experiments emphasizing on the recall confirmed that the CNN model can be an effective method for determining the central peak crater.
8.12 Case of Exploring Basic Components of Scientific Events
(a)
(b)
349
(c)
Fig. 8.10 Contribution areas for “central peak crater” and “normal crater” by using Grad-CAM. a Original image. b Contribution area for “central peak crater”. c Contribution area for “normal crater”
8.11.3 Explanation of Results (WR) To confirm the reason for the classification result, we visualize contribution areas (i.e., individual evidence) of the input DEM image that affects the model’s decision, that is, output label. For that purpose, we use Grad-CAM (Selvaraju et al. 2020) and visualize the contribution areas of each label in the input image. Figure 8.10a–c is the input DEM image, the contribution areas of the “central peak crater” label, and the contribution areas of the “normal crater” label, respectively. The contribution areas of the “central peak crater” label include areas of high intensity inside the crater, and the emphasized areas cover the central peak. On the other hand, there is no emphasized area for a central peak in the contribution areas of the “normal crater” label. Therefore, the areas occupied by the central peak are considered to contribute to the classification of the “central peak crater” label. The mechanism of Grad-CAM itself will be explained in detail in the following case.
8.12 Case of Exploring Basic Components of Scientific Events This case study aims to discover basic components that make up a model for scientific events using NMF, based on the results of Grad-CAM.
350
8 Interpretation
8.12.1 Overview • Specific purpose of the case for natural science: There is a balloon experiment plan aiming to obtain clues to elucidate dark matter by detecting an antideuteron with a high-sensitivity observation device for cosmic ray antiparticles, called General AntiParticle Spectrometer (GAPS) (Fuke et al. 2008). Antideuterons are considered to be one of the particles produced by pair annihilation and decay of candidate dark matter particles (Bräuninger and Cirelli 2009), and have not been discovered until now due to their extremely small expected abundance. However, if even one event can be detected in the low energy region, it is highly likely that it originated from dark matter. • Data used in the case: GAPS uses the dedicated measuring instrument for antiparticle (antiproton or antideuteron) detection. It consists of silicon semiconductor (Si (Li)) detectors and plastic scintillation counters (TOF) that surround the detectors. We have drawn its conceptual diagram (Imafuku et al. 2021, 2022) (see Fig. 8.11). It consists of silicon semiconductor (Si (Li)) detectors and plastic scintillation counters (TOF) that surround the detectors. We have drawn its conceptual diagram (see Fig. 8.11). Silicon semiconductor detectors divided into eight segments are arranged in a configuration of 12 × 12 × 10. When an antiparticle enters the measuring instrument, it passes through a two-layered TOF counter and then is slowed down and captured due to energy loss while passing through the stacked silicon detectors, forming an excited exotic atom. Excited exotic atoms deexcite immediately, and in the process, characteristic X-rays, pions, and protons are emitted due to the nuclear annihilation. Since the number of pions and protons produced in the decay process of the excited exotic atoms depends on the type of captured antiparticle, the incident antiparticle can be identified by measuring these tracks (Aramaki et al. 2013). In the detection of antideuterons, antiprotons that can form exotic atoms are also compared, considering that the abundance ratio of antiprotons against antideuteron is 104 or more. We use simulated data for GAPS measurements. • Method used in the case: Wada et al. performed in-silico experiments using GAPSoriented code to simulate the antiproton and antideuteron detectors required for the high discrimination ability and the significant suppression of the difference in the abundance of antiprotons and antideuterons (Wada et al. 2020). They have started an antiparticle discrimination method using track data and neural network (NN) model. Specifically, three-dimensional convolutional neural network (CNN) is used to discriminate data with fixed angles of incidence on antiproton and antideuteron detectors with high accuracy. Although a certain level of discrimination accuracy has been confirmed for data with a certain degree of randomness in the incident angles by the previous research, it is stated that there is room for improvement. For example, increasing the number of learning events, adding time information, and optimizing the NN structure are mentioned as remaining issues in their paper. However, the factors enabling highly accurate discrimination and the characteristics of erroneously discriminated data have not yet been studied.
8.12 Case of Exploring Basic Components of Scientific Events
351
TOF
Stack of Si(Li) detectors
p
π+
ππ-
p π-
π+
Fig. 8.11 Conceptual diagram of GAPS. An antiparticle (e.g., antideuteron d) captured by Si(Li) detectors (striped ovals) forms an exotic atom. Excited exotic atoms deexcite immediately, and in the process, characteristic X-rays (wavy arrows) and hadrons (i.e., pions π+ , π− and protons p) are emitted due to the nuclear annihilation
Therefore, in this case (Imafuku et al. 2021, 2022), we aim to solve these issues and explain events by both the analysis of factors in the highly accurate discrimination of machine learning and the extraction of the fundamental features from the data. So, we visualized parts of the discrimination basis of the CNN model by gradient-weighted class activation mapping (Grad-CAM) (Selvaraju et al. 2020). • Results and additional experiments: From the visualization results, it is found that the discriminator did not consider the incident path and the shape of the particle spread after annihilation. Therefore, we conducted another in-silico experiment focusing on the positional information and shape information of the particles in the simulation data. Furthermore, with the aim of determining some basic events or components necessary for particle discrimination, we additionally conducted experiments using NMF (Nonnegative Matrix Factorization) (Lee and Seung 1999) to model the events.
352
8 Interpretation
8.12.2 Data Set (HD) The three-dimensional simulated data of incident antiprotons and antideuterons and the energy loss values caused by pions and protons, which are recorded in the silicon semiconductor detectors, are used as the input data. The silicon semiconductor detector group has ten layers, each of which is arranged in 12 × 12, and the energy loss values are recorded in a total of 1440 channels in a three-dimensional array. However, since the eight segments of the silicon semiconductor detector have different orientations depending on the layer and cannot be shaped into an array, they are treated as a unit. Two sets of simulated data with respect to incident angles are prepared for each group of antiparticle events (i.e., antiprotons and antideuterons). One set consists of data whose incident angle is fixed vertically downward with respect to the measuring instrument. The other set consists of data whose incident angle is uniform from the upper hemisphere when viewed from the measuring instrument. Hereafter, in this case, the former and latter are called fixed incident angle data and random incident angle data, respectively. Of course, the latter case is more realistic than the former. In the previous studies, two groups of 2 million antiprotons events and 2 million antideuteron events were used, but in this case, 20,000 events of each group are used. The learning data (training data) and the input data (test data) for the classifier are 36,000 and 4000 fixed incident angle data including the equal number of antiprotons and antideuterons, respectively. In addition, fixed incident angle data are used for visualization.
8.12.3 Network Configuration and Algorithms (WF) (HH) We use the 3D CNN model as in the previous research (see Table 8.3). The model consists of three-dimensional convolution layers (conv), pooling layers (maxpooling), and fully connected layers (dense) and applies batch normalization and dropout between the pooling layer and the fully connected layers. In general, batch normalization stabilizes learning by standardizing values for each mini-batch while dropout prevents model overfitting by randomly disabling the output. The ReLU function is used for the output of each convolution layer. The sigmoid function is used for the final output layer. Furthermore, the binary cross-entropy is used for the loss function. Adaptive Moment Optimization (Adam) is used as the optimization algorithm Adam (Kingma and Ba 2014). Adam stabilizes the gradients by calculating the momentum using exponential smoothing, and adjusting the learning rate with the momentum.
8.12 Case of Exploring Basic Components of Scientific Events Table 8.3 Network configuration of 3D CNN model
Layer
Kernel
353
Stride
Output size 12 × 12 × 10
Input 3×3×3
Conv.
1×1×1
12 × 12 × 10 × 64
Conv.
3×3×3
1×1×1
12 × 12 × 10 × 128
Conv.
3×3×3
1×1×1
12 × 12 × 10 × 256
Max-pooling
2×2×2
2×2×2
6 × 6 × 5 × 256
Dense
512
Dense
256
Dense
128
Dense
64
Dense
32
Output
1
Keras (Chollet 2015) (Keras 2022) and TensorFlow (Abadi et al. 2016) (Tensorflow 2022) were used for the construction and learning of the classifier and implementation of Grad-CAM, respectively.
8.12.4 Visualization of Judgment Evidence by Grad-CAM 8.12.4.1
Grad-CAM
(WF) Grad-CAM is a method to visualize the regions in the input image by a heat map, which contribute to the classification results of CNN. The formula for obtaining the heat map by Grad-CAM with 3D data as input is shown below. αlc =
1 ∑ ∑ ∑ ∂ yc . z i j k ∂ Ail jk
c HGrad-CAM = ReLU
( ∑
(8.1) )
αlc Al .
(8.2)
l
Here, yc in Eq. (8.1) is the output for class c in the output layer, Al is the l-th feature map of the convolution layer, α c is the weight for the feature map, and Z is the volume of the feature map. Generally, the feature map of the convolutional layer closest to the output layer is used. The heat map is obtained by multiplying the feature map Al by the weight α c in Eq. (8.2) and applying the ReLU function to the result. The
354
8 Interpretation
Fig. 8.12 Cross-sectional view of correctly identified antideuteron data (left), cross-sectional view of Grad-CAM (center), and their overlay (right)
ReLU function is used because it is considered that the positive gradient affects the classification results.
8.12.4.2
Correct Identification of Simulation Data and Its Explanation by Grad-CAM
(WR) In Figs. 8.12 and 8.13, cross-sectional views (left) and Grad-CAM cross-sectional views (center) match the incident paths of an antideuteron and an antiproton correctly identified by the CNN model, respectively. Furthermore, the views (right) in which the two images (i.e., left and center) are overlayed as a heat map are shown (see Figs. 8.12 and 8.13). Here, the fixed incident angle path is from (6,6,0) to (6,6,10) in a three-dimensional array (12 × 12 × 10). The cross-sectional view refers to the cross section (i.e., vertical × horizontal = 12 × 10) which is cut out from the range of (6,0,0) to (6,12,10). The white part in the left figure shows the energy loss value [MeV] of the antiparticles, and the antiparticles are incident from left to right. In Grad-CAM, the red regions of the heat map show the regions that the discriminator is watching. From Figs. 8.12 and 8.13, it can be seen that the discriminator, in common with antiprotons and antideuteron, watches the end of the incident path, which has the largest energy loss value. In addition, from the data of antiprotons and antideuterons, it can be seen that the incident path of antiparticles until pair annihilation is approximately longer for antideuterons than for antiprotons.
8.12.4.3
Erroneous Identification of Simulation Data and Its Explanation with Grad-CAM
(WR) In this experiment, in the case of fixed incident angle data, the number of antideuterons that were erroneously identified to be antiprotons and the number of
8.12 Case of Exploring Basic Components of Scientific Events
355
Fig. 8.13 Cross-sectional view of correctly identified antiproton data (left), cross-section of GradCAM (center), and their overlay (right)
antiprotons that were erroneously identified to be antideuterons were each 1 out of 4000 test data. A cross-sectional view (left) and a cross-sectional view of Grad-CAM (center) according to the incident path of the antiproton data erroneously identified as antideuteron are shown in Fig. 8.14. Two of them are overlayed as a heat map (right) in Fig. 8.14. As already mentioned, there is a difference in the length of the incident path between antideuterons and antiprotons, and the highest energy loss value in one piece of data appears at the end of the incident path where pair annihilation occurs. Although the length of the incident path of the antiproton data should be relatively short, the region with the largest energy loss value appears beyond the end of the incident path in Fig. 8.14. Therefore, it is probable that the discriminator erroneously identified the incident path of the antiproton as that of the antideuteron. A cross-sectional view (left) and a cross-sectional view of Grad-CAM (center) according to the incident path of the antideuteron data erroneously identified as antiprotons, and an overlayed view of these two as a heat map (right) are shown in Fig. 8.15. By Grad-CAM, it is confirmed that the discriminator is watching the region where nothing can be identified in the cross-sectional view of the incident path, ranging from (6,0,0) to (6,12,10) in Fig. 8.15. Then, we add all the cross-sectional views ((0,0,0) to (0,12,10)), …, ((12,0,0) to (12,12,10)) to confirm the regions that the discriminator watches (see Fig. 8.16).
Fig. 8.14 Cross-sectional view of antiproton data erroneously identified as antideuteron (left), cross-sectional view of Grad-CAM (center), and their overlay (right)
356
8 Interpretation
Fig. 8.15 Cross-sectional view of antideuteron data erroneously identified as antiproton (left), cross-sectional view of Grad-CAM (center), and their overlay (right)
Figure 8.16 shows the accumulated cross-sectional view of data of an antideuteron, which was erroneously identified as an antiproton (left), the accumulated crosssectional view of Grad-CAM (center), and the accumulated overlay as a heat map (right). According to Fig. 8.16, there is a maximum energy loss value in the region other than the end of the incident path, and it can be confirmed by Grad-CAM that the discriminator is paying the most attention to the region. In the first place, the intention of introducing the CNN model is to make the discriminator recognize spatial information such as the incident path of particles and the spread of particles after annihilation. From Fig. 8.16, it can be known that the discriminator is paying close attention to the maximum value of energy loss and did not recognize spatial information such as the incident path and the spread.
Fig. 8.16 Accumulated cross-sectional view of antideuteron data erroneously identified as antiproton (left), accumulated cross-sectional view of Grad-CAM (center), and their accumulated overlay (right)
8.12 Case of Exploring Basic Components of Scientific Events
357
8.12.5 Experiments to Confirm Important Features In order to make the classifier recognize the spatial information without paying attention to the maximum value of the energy loss, we will unify all the nonzero energy loss values in the simulation data to the same value when the classifier is generated and confirm the accuracy.
8.12.5.1
Evaluation Index
Figure 8.17 shows a histogram of the likelihood of antiprotons and antideuteron used by the CNN model of the previous research. The antiproton rejection power obtained from the antideuteron-likelihood histogram is often used as an evaluation index in this field. For example, we used such a histogram in our team’s previous work (Imafuku et al. 2021, 2022) (see Fig. 8.17). The horizontal axis represents the antideuteron-likelihood, and the vertical axis represents the counts (numbers) of antiprotons and antideuterons. The closer the value on the horizontal axis is to 1, the CNN model judges that the antiparticle is more like an antideuteron. The definition of the recognition efficiency of antideuteron (d: antideuteron), that of the misidentification (i.e., erroneous identification) probability of antiproton ( p: antiproton), and that of the antiproton rejection power are described below (T is a threshold). From the definitions, the larger the value of the antiproton rejection power, the better the identification accuracy.
Count
T
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood of antideuteron Fig. 8.17 Histogram of antideuteron-likelihood by the CNN model of our team’s previous research (red: antiproton, blue: antideuteron)
358
8 Interpretation
number of antideuteron data judged as antideuteron . total number of antideuteron data number of antiproton data judged as antideuteron . Misidentification probability ( p, >T ) = total number of antiproton data Recognition efficiency (d, >T ) =
Antiproton Rejection Power ( p, >T ) =
8.12.5.2
1 . Misidentification probability ( p, >T )
Experimental Settings
(HD) (HH) As data preprocessing, all the nonzero energy loss values in the original data (1) are unified to 1.0 (2), and further all the data in (2) are converted using the fast Fourier transform (FFT) and the polar coordinate transform (3), according to the data conversion procedure proposed by Nishiura et al. (1994) to make it easier for the computer to recognize the shape of an object in an image. The specific conversion procedure for the data set (3) consists of the unification of nonzero values to 1.0, FFT, quadrant swap, polar coordinate transform, FFT, and quadrant swap in that order. After the preprocessing, we generate classifiers and evaluate them. Random incident angle data were used for each data set, with 3.6 million, 400,000, and 400,000 cases for learning, verification, and evaluation, respectively. The abundance ratio of antiprotons against antideuterons in the data is about 1:1. The network configuration of the classifier is the same as already described.
8.12.5.3
Experimental Results
(WF) Figure 8.18 shows the antiproton rejection power with respect to the antideuteron recognition efficiency by the classifier generated by each data set (1), (2), and (3). The threshold T is changed by 0.01 from 0.5 to 0.99. As a result, a certain level of recognition ability can be confirmed even with the classifiers of data sets (2) and (3) although the classifiers of data sets (2) and (3) have less recognition ability than that of data set (1). In this experiment, the results that the performance was deteriorated by not distinguishing the magnitude of the energy loss value and not considering the spatial information confirm that these factors are important in the identification of the antiparticle. In other words, these experiments can be viewed as variations of counterfactual explanations.
Rejection Power of antiproton
8.12 Case of Exploring Basic Components of Scientific Events
359
101
100 0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
Recognition efficiency of antideuteron Fig. 8.18 Antiproton rejection power versus antideuteron recognition efficiency of the classifiers generated from each data set (1), (2), and (3) (1: black, 2: blue, 3: red)
8.12.6 Seeking Basic Factors In science, reducing a phenomenon into a combination of more basic phenomena is one way to approach the essential interpretation. So far, we have visualized the difference between the events of two types of antiparticles in a three-dimensional space that we can easily recognize. Further, we have experiments oriented toward counterfactual explanations by invalidating some information included in the event data. As a result, it is recognized that the positional information (e.g., incident path) through which the antiparticles pass in the detectors and the shape information of the spread of the antiparticles after annihilation are important as basic events as well as the energy loss values. In the first place, however, it is not yet theoretically determined how many basic events are essentially needed to describe the original events. In other words, we have to consider a method that enables us to explore more fundamental explanation. In general, non-negative Matrix Factorization (NMF) makes it possible to express events by combining more basic events (Lee and Seung 1999). Therefore, it is considered to determine the number (r) of basic events so that antiprotons and antideuterons have different distributions with respect to a certain feature. Thus, the purpose in this case is to consider the length of the weighting vector (i.e., feature vector) of the basic events obtained by NMF and to examine r so that the difference between the two types of antiparticles becomes significantly large with respect to the length distribution.
360
8.12.6.1
8 Interpretation
NMF
(HH) In many physical phenomena, observed values (e.g., images, counts, and intensities) are represented as non-negative feature vectors. NMF (Lee and Seung 1999) considers basic components in lower dimensions than such non-negative vectors. Thus, NMF tries to approximate the original matrix consisting of non-negative vectors by the product of the matrix consisting of basic components and the matrix consisting of the vectors for their weighting. That is, we try the following matrix factorization. A ≈ W H. Here, A is an m × n matrix representing feature quantities, W is an m × r matrix representing basic components, and H is an r × n matrix representing coefficients (weights), and typically r ≤ min (m, n). Furthermore, A, W, and H are non-negative matrices. For optimization, we consider to minimize the following objective function based on the Frobenius norm, JF = ||A − W H ||2F . ┌ |∑ n | m ∑ ||B||F = | Bi2j . i=1 j=1
or to minimize the following objective function based on the KL divergence (refer to Sect. 5.3.1). JKL =
m ∑ n ∑ {−Ai j log(W H )i j + (W H )i j }. i=1 j=1
The basic algorithm for minimizing J KL is shown below (Lee and Seung 2000). Algorithm 8.4 NMF 1. Initialize W and H randomly; the following updates alternately until they converge; 2. Repeat { ∑ Aik /(W H )ik k H jk ∑ 3. Wi j ← Wi j ; k H jk } ∑ W A /(W H ) ki k j 4. Hi j ← Hi j k ∑ Wki k j . k
On the other hand, in order to minimize J F , Steps 3 and 4 are replaced by the following updates, respectively.
8.12 Case of Exploring Basic Components of Scientific Events
361
∑ H jk Aik . Wi j ← Wi j ∑ k H (W H )ik jk k ∑ Wki Ak j Hi j ← Hi j ∑ k . W ki (W H )k j k NMF is considered to have the following advantage. It can be interpreted that if r components are weighted and added together, the result becomes feature quantities. In other words, we think that the original event is explained by r basic events.
8.12.6.2
Experimental Results
(WF) In the experiment, 20,000 random incident angle data were prepared for each type of antiparticle (i.e., antiproton and antideuteron), NMF factorization was applied multiple times by changing the number of basis (r), and the results were analyzed. The NMF of scikit-learn (Pedregosa et al. 2011), which tries to minimize the Frobenius norm, was used. In the experiment, the number r of basis vectors was changed to 6.8, 9, 10, 11, and 12. For each case, the length distribution of feature vectors of the antiprotons and antideuterons and the regression curve are overlayed (see Fig. 8.19). The following equation is used for regression. y = b exp(−ax). Here, x is the length of the feature vector and y is the number of counts for each length. Further, a and b are coefficients. The mean squared error (MSE) was used to evaluate the efficiency of the regression equation. As far as these results are seen, the difference between the two distributions is larger when the number of basis vectors (r) is at least 9 and 10 (see Box “String theory”). Furthermore, r was also changed to 15, 30, 60, and 90, and the differences were so small. By this observation only, it will be too hasty to conclude that the number of true basic dimensions is somewhere in this range. However, if experiments to discriminate various particle pairs can be done in the future, it may be possible to determine a reasonable number of basis vectors by stacking a lot of such experiments.
362
8 Interpretation r=6
r=9
r=8 Antideuteron Antiproton
Antideuteron Antiproton
Antideuteron Antiproton
Antideuteron
Antideuteron
Antideuteron
a b MSE
a b MSE
a b MSE
Antiproton a b MSE
Antiproton a b MSE
Antiproton a b MSE r = 11
r = 10
r = 12 Antideuteron Antiproton
Antideuteron Antiproton
Antideuteron Antiproton
Antideuteron
Antideuteron
Antideuteron
a b MSE
a b MSE
a b MSE
Antiproton a b MSE
Antiproton a b MSE
Antiproton a b MSE
Fig. 8.19 Length distribution and the regression model: y = b exp (−ax)
Box: String Theory Antiprotons and antideuterons consist of multiple elementary particles. According to string theory (Schwarz 2000), all elementary particles can be described by ten-dimensional features consisting of six extra dimensions other than the four-dimensional space–time that we can see directly. Is this just a coincidence? □
References Abadi M, Agarwal A et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467 [cs.DC] Aramaki T, Chan SK, Craig WW, Fabris L, Gahbauer F, Hailey CJ, Koglin JE, Madden N, Mori K, Yu HT (2013) A measurement of atomic X-ray yields in exotic atoms and implications for an antideuteron-based dark matter search. Astropart Phys 49:52–62 BigQuery (2022). https://cloud.google.com/bigquery. Accessed 2022
References
363
Bräuninger CB, Cirelli M (2009) Anti-deuterons from heavy dark matter. Phys Lett B 678(1):20–31 Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:101093340 4324 Celko J, Celko’s J (2014) SQL for smarties: advanced SQL programming. Morgan Kaufmann Chen C, Liaw A, Breiman L (2003) Using random forest to learn imbalanced data. Research Article of Department of Statistics UC Berkeley, pp 1–12 Chollet F (2015) Keras. keras.io, vol 10 Flickr (2022) Find your inspiration. https://www.flickr.com/. Accessed 2022 Frohlich C, Nakamura Y (2009) The physical mechanisms of deep moonquakes and intermediatedepth earthquakes: how similar and how different? Phys Earth Planet Inter 173:365–374 Fuke H, Koglin JE, Yoshida T, Aramaki T, Craig WW, Fabris L, Gahbauer F, Hailey CJ, Jou FJ, Madden N, Mori K, Yu HT, Ziock KP (2008) Current status and future plans for the general antiparticle spectrometer (GAPS). Adv Space Res 41(12):2056–2060 Imafuku T, Ishikawa H, Araki T, Yamamoto Y, Fuke H, Shimizu Y, Wada T, Nakagami Y (2021) Application of machine learning and visualization of evidence for cosmic-ray antiparticle identification. In: JAXA space science informatics symposium FY2020 (in Japanese) Imafuku T, Ishikawa H, Araki T, Yamamoto Y, Fuke H, Shimizu Y, Wada T, Nakagami Y (2022) Application of machine learning and visualization of evidence for cosmic-ray antiparticle identification. JAXA Res Dev Rep: J Space Sci Inf Jpn 11(JAXA-RR-21-008):37–43 (in Japanese) Ishikawa H, Yamamoto Y, Hirota M, Endo M (2019) Towards construction of an explanation framework for whole processes of data analysis applications: concepts and use cases. In: Proceedings of the 11th international conference on advances in multimedia MMEDIA 2019 Ishikawa H, Yamamoto Y, Hirota M, Endo M (2020) An Explanation framework for whole processes of data analysis applications: concepts and use cases. Int J Adv Softw 13(1&2):1–15 Kaminski ME (2019) The right to explanation, explained. Berkeley Technol Law J 34:189–218. https://scholar.law.colorado.edu/articles/1227 Kato K, Yamada R, Yamamoto Y, Hirota M, Yokoyama S, Ishikawa H (2017) Analysis of spatial and temporal features to classify the deep moonquake sources using balanced random forest. In: Proceedings of MMEDIA2017. https://www.thinkmind.org/articles/mmedia_2017_4_10_ 58006.pdf Keras (2022) https://keras.io/. Accessed 2022 Kikuchi S, Yamada R, Yamamoto Y, Hirota M, Yokoyama S, Ishikawa H (2017) Classification of unlabeled deep moonquakes using machine learning. In: Proceedings of MMEDIA 2017. https://www.thinkmind.org/articles/mmedia_2017_3_30_58004.pdf Kingma D, Ba J, Adam (2014) A method for stochastic optimization. arXiv:1412.6980 Larsen PG, Plat N, Toetenel H (1994) A formal semantics of data flow diagrams. Form Asp Comp 6:586–606. https://doi.org/10.1007/BF03259387 Lee D, Seung H (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791. https://doi.org/10.1038/44565 Lee D, Seung H (2000) Algorithms for non-negative matrix factorization. In: Proceedings of NIPS2000, pp 556–562 Linoff GS (2016) Data analysis using SQL and excel. Wiley Lundberg S, Lee S-I (2017) A unified approach to interpreting model predictions. In: Proceedings of NIPS2017. arXiv:1705.07874 [cs.AI] Mahajan S (2014) The art of insight in science and engineering: mastering complexity. The MIT Press Mitomi K, Endo M, Hirota M, Yokoyama S, Shoji Y, Ishikawa H (2016) How to find accessible free Wi-Fi at tourist spots in Japan. Proc Socinfo 1:389–403 NASA/NAIF, SPICE (2022) https://naif.jpl.nasa.gov/. Accessed 2022 Nishiura Y, Honami N, Murase H, Takigawa H (1994) Application of 2D FFT in shape recognition: Nondimensionalization of direction information by polar coordinate transformation. J Japanese
364
8 Interpretation
Soc Agric Mach 56(Supplement):499–500. https://doi.org/10.11357/jsam1937.56.Supplemen t_499 (in Japanese) Ordonez C (2004) Programming the K-means clustering algorithm in SQL. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 823–828. https://doi.org/10.1145/1014052.1016921 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in Python. J J Mach Lear Res 12:2825–2830 Ribeiro MT, Singh S, Guestrin C (2016) Why should I trust you?: explaining the predictions of any classifier. arXiv:1602.04938 [cs.LG] Schwarz JH (2000) String theory origins of supersymmetry. http://arxiv.org/abs/hep-th/0011078 scikit-learn (2022) Machine learning in Python. https://scikit-learn.org/stable/. Accessed 2022 Selvaraju RR, Cogswell M, Das A et al (2020) Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput vis 128:336–359. https://doi.org/10.1007/s11263019-01228-7 statsmodels (2022) https://www.statsmodels.org/stable/index.html. Accessed 2022 tensorflow (2022) https://www.tensorflow.org/. Accessed 2022 Twitter (2022) https://twitter.com/ Wachter S, Mittelstadt B, Russell C (2018) Counterfactual explanations without opening the black box: automated decisions and the GDPR. Harvard J Law Technol Wada T, Fuke H, Shimizu Y, Yoshida T (2020) Application of machine learning to the particle identification of GAPS. Trans Jpn Soc Aeronaut Space Sci Aerospace Technol Jpn 18(3):44–50. https://doi.org/10.2322/tastj.18.44 White A, d’Avila Garcez A (2019) Measurable counterfactual local explanations for any classifier. arXiv:1908.03020 [cs.AI] Wieczorek MA, Jolliff BL et al (2006) The constitution and structure of the lunar interior, Rev Mineral Geochem 60(1):221–364 Woodward JB (2004) Making things happen: a theory of causal explanation. Oxford University Press Woodward J, Lauren R (2021) Scientific explanation. In: Zalta EN (ed) The Stanford encyclopedia of philosophy, summer 2021 edn. https://plato.stanford.edu/archives/sum2021/entries/scientificexplanation/ Wright S (1920) The relative importance of heredity and environment in determining the piebald pattern of Guinea-Pigs. Natl Acad Sci USA 6(6):320–332 Yamamoto S, Matsunaga T, Nakamura R, Sekine Y, Hirata N, Yamaguchi Y (2017) An automated method for crater counting using rotational pixel swapping method. IEEE Trans Geosci Remote Sens 55(8):4384–4397
Index
A Abduction, 52 Abnormal, 292 Abstraction, 60 Adam, 312 Adaptive Moment Optimization (Adam), 352 Additive polygenic model, 142 Adjacency matrix, 196 Aerial survey, 238 Aggregation, 28 Akaike Information Criterion (AIC), 118 Alternative hypothesis, 128 Amazon Web Services (AWS), 17 Ampliative, 14 Analogy, 53, 60, 260 Antideuteron, 350 Apollo, 240, 242, 336 Apriori, 170 Aristotle, 44 Association analysis, 168 Association rule mining, 167 Average linkage, 156 Axon, 178
B Background subtraction, 235 Back propagation, 182 Bagging, 280 Balanced Random Forest, 339, 341 Band graph, 38 Batch gradient descent, 183 Bathymetric survey, 239
Battle of Austerlitz, The, 22 Bayesian inference, 66 Bayes’ theorem, 67 Behavioral economics, 211 Benchmarks, 253 Betweenness centrality, 285 Biased LexRank, 280 Big data, 2 Big O notation, 22 Binomial distribution, 122 Biological neurons, 178 BIRCH, 159 Black hole shadow, 117 Block distance, 51 Bootstrap, 280 Bottou, Léon, 184 Box model, 58 Brahe, Ottesen, 71 Brown, Robert, 215 B-tree, 22 Buckingham Pi theorem, 64
C Calinski–Harabasz index, 158 Car sharing, 292 Causal convolution, 229 Causal Mechanical model (CM model), 318 Cellucci, Carlo, 54 Central peak, 302 Central peak crater, 302, 347 Ceres, 96 Chaining, 155 Chain rule, 134
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 H. Ishikawa, Hypothesis Generation and Interpretation, Studies in Big Data 139, https://doi.org/10.1007/978-3-031-43540-9
365
366 Chameleon, 163 Characteristic equation, 90, 232 Chebyshev distance, 51 Check function, 230 Cherry blossom, 6 Chi-squared distribution, 127 Chi-square distribution, 127 Chi-square test, 275 Classification, 186 Cloud, 17 Cluster analysis, 147 Clustering, 147 CNN-QR, 231 Coalition, 328 Coefficient of determination, 131 Cohen’s d, 129 Collective satisfaction, 208 Common factor, 134 Complete linkage (CLINK), 155 Computational graph, 134 Conceptual space, 247 Cone tree, 40 Confidence, 169 Confirmatory Data Analysis, 15 Confounding variable, 26 Constant Error Carousel (CEC), 189 Constrained clustering, 165 Contingency table, 275 Contribution areas, 349 Convolutional Neural Network (CNN), 187, 308 Cookpad, 260, 261 Correlation, 120, 126, 134 Cosine similarity, 254, 339 Counterfactual explanation, 319, 329, 358 COVID-19, 231 Cross entropy, 176 Crowding, 156 Cumulative distribution, 230 Curse of dimensionality, 114
D Dark matter, 350 Darwin, Charles Robert, 86 Data analysis, 12, 209, 218, 222, 320 Data augmentation, 6 Data imbalance, 26 Data management, 12, 209, 218, 222, 320 Data mining, 13 Data model, 11 Data reduction, 26
Index Data visualization, 38 DBSCAN, 161, 280 D-CASE, 296 Decision tree, 186 Declarative, 321 Declarative hypothesis, 15 Deduction, 43 Deductive Nomological mode (DN model), 318 Deductive reasoning, 43 Deep learning, 187 Deep moonquakes, 336 Definition, 33 Degree of fit, 131 DEM, 303 De Moivre, Abraham, 105 De Moivre–Laplace theorem, 123 Dendrogram, 154 Density-connectivity, 161 Dependent variable, 108 Design patterns, 16 Design principles, 16 Difference, 28 Difference-based methods, 203 Difference compatibility, 204 Difference equation, 23, 88, 90, 231 Difference in differences, 212 Differential equation, 89 Diffusion Index, 211 Digital ecosystems, 11, 196 Digital Elevation Model (DEM), 239, 348 Digital Surface Model (DSM), 238 Dimensional analysis, 63 Dimensionality reduction, 27, 39 Dimensionless group, 64 Disaster management, 19, 281 Distance function, 51 Divergence rate, 215 Divide-and-conquer, 21, 61 Document data, 194 Down-sampling, 27, 339, 341 Dropout, 281 Dynamic programming, 24
E Early stopping, 312 Einstein, 36 Elastic map, 41 Elastic Net, 116 Elbow method, 150 El Niño phenomenon, 225
Index El Niño–Southern Oscillation (ENSO), 225 Emergency Response Difficulty, 286 Emotional analysis, 206 Emotion word dictionary, 206 Empirical error, 118 Ensemble learning, 280 Entropy, 157, 176 Enumerative induction, 45 Error function, 107 Euclidean distance, 50 Euler, Leonhard, 46, 86 Event Horizon Telescope, 117 Evidence-Based Policy Making (EBPM), 18, 332 Expectation Maximization (EM), 152 Explanandum, 318 Explanans, 318 Explanation, 318 Explanatory variable, 108 Exploratory data analysis, 15 Exponential smoothing, 215, 220, 352 Extrapolation, 108 Extreme principle, 57
F Factor, 134 Factor loading, 135 Family, 193 Fast Fourier Transform (FFT), 296, 358 Feature importance, 345 Feynman diagram, 133 Feynman, Richard, 133 Fibonacci, 23 Fibonacci sequence, 23, 90 5th Generation mobile communication (5G), 3 Filtering, 205 Fisher, Ronald Aylmer, 142 Flicker, 4 Flickr, 266 F-measure, 342 Forecasting, 205 Forget gate, 190 Forgy, E.W., 148 Fourier transform, 29 Four stages of creative process, 67 Frame difference, 235 Free fall, 85 Free Wi-Fi, 266, 332 Frequentism, 67 Frequent itemset, 169
367 Fukushima, 187 Function transformation, 29
G Galapagos finches, 87 Galilei, Galileo, 78 Galileo, 45 Galton board, 123 Galton, Francis, 119 Gauss, Carl Friedrich, 57 Gaussian distribution, 106 Gaussian mixture distribution, 126 Gaussian Mixture Model (GMM), 153, 297 Gauss–Newton method, 102, 113 Gauss, Johann Carl Friedrich, 97 GDPR, 10 General AntiParticle Spectrometer (GAPS), 350 Generalization, 45, 47 Generalization error, 118 Generative Adversarial Network (GAN), 188 Genetic drift, 89, 133 Genome Wide Association Study (GWAS), 116, 273 Geosocial big data, 7 Geotagging, 266 Gini coefficient, 41 Gini index, 340 Global Navigation Satellite System (GNSS), 238 Golden ratio, 91, 178 Goodness of fit test, 127 Google Maps, 41 Gosset, William Sealy, 129 GPU, 191 Gradient, 107 Gradient descent, 177 Gradient-weighted Class Activation Mapping (Grad-CAM), 349, 351, 353 Grammar of science, 128 Graph, 40, 196 Graphical Processing Units (GPUs), 26, 28 Graviton, 60 Grid search, 311 Group by, 197 Guinea pigs, 130
368 H Hard clustering, 148 Hardy–Weinberg equilibrium, 142 Hardy–Weinberg law, 141 Hashing, 28 “having”, 197 Heat map, 38 Hebb, Donald Olding, 180 Helmert, Friedrich Robert, 127 Heritability, 145 Herschel, Frederick William, 96 Heterogeneous subproblems, 61 Hierarchical Agglomerative Clustering (HAC), 153 Hierarchical clustering, 153 Hierarchical Divisive Clustering (HDC), 153 Hierarchy, 192 High code, 10 High-risk paths, 281 Hilbert curve, 40 Histogram, 41, 125 Hooke, Robert, 86 Hubel, David Hunter, 187 Hume, David, 46 Hyperparameters, 36 Hypothesis, 13 Hypothesis first, 104 Hypothesis-free prediction, 13 Hypothesis generation, 15 Hypothesis integration, 191 Hypothesis interpretation, 16 Hypothesis intersection, 303 Hypothesis testing, 15, 128 Hypothetico-deductive method, 43
I IBE, 71 Imafuku, 350, 351, 357 Imbalanced data, 26, 339 Incidence matrix, 196 Independent variable, 108 Indicator function, 230, 251 Induction, 44 Inductive reasoning, 44 Inertial Measurement Unit (IMU), 238 Inference to the Best Explanation (IBE), 53 Information content, 176 In-silico, 350 Inter-frame coding, 246 Internet of Things (IOT), 1, 280 Interpolation, 108, 219
Index Intersection, 279, 292 Invariants, 58, 76, 280, 303 Inverse square law, 85 Ishikawa concept, 5 Itemsets, 167, 169
J Join, 171, 279, 281
K Kaguya, 303 K-cross validation, 119 Kepler, Johannes, 71 Kepler’s laws, 71 Kepler’s second law, 101 Kernel, 187, 308, 327 Kernel method, 29 Kernel SHAP, 328 Key-value store, 197 KL divergence, 360 K-means, 148, 149 K-nearest neighbor graph, 163 K-nearest neighbors, 22, 27, 166 Kriging, 219 Kuhn, Thomas, 33 Kullback–Leibler divergence (KL divergence), 176
L Lagrange multipliers, 29, 115 Lance–Williams formula, 156 Laplacian filter, 245 Law of free fall, 79 Law of inertia, 81, 84 Laws of inheritance, 140 Laws of motion, 84 Leaky Relu, 311 Learning rate, 181 Least Absolute Shrinkage and Selection Operator (LASSO), 115 Legendre, Adrien–Marie, 102 Leibniz, Gottfried, 86 LeNet, 187 Life cycle, 34 Life cycle of hypothesis, 34 Light Detection and Ranging (LiDAR), 238 Likelihood, 67, 107, 129 Lloyd, Stuart, 148 Local Interpretable Model-Agnostic explanations (LIME), 327 Locality, 25
Index Logistic equation, 92 Log-likelihood, 129, 185 Long Short-Term Memory (LSTM), 189 Lorenz curve, 41 Low code, 10 Lunar Reconnaissance Orbiter (LRO), 240, 242
M Machine learning, 13 MacQueen, J.B., 148 Macroscopic explanation, 320 Mahajan, Sanjoy, 57 Malthus, Thomas Robert, 88 Manhattan distance, 51 Manhattan plot, 275 Map, 41, 193 MapReduce, 28 Market basket analysis, 167 Mars exploration, 66 Mathematical induction, 46 Maximum likelihood estimation, 107, 129, 230 McCulloch, Warren Sturgis, 179 Medicine, 20 Mel-Frequency Cepstral Coefficients (MFCCs), 296 Memorization, 24 Mendel, Gregor Johann, 138 Mendel’s School, 140 MergeSort, 21 Method of fluxions, 84 Method of least squares, 102 Metric, 51 Microscopic explanation, 320 Mid-parent, 121 Mini-batch gradient descent, 183 Minkowski distance, 50 Mobility as a Service (MaaS), 292 Model-agnostic explanation, 327 Models, 36 Modus ponens, 43 Moments, 127 Motion-compensated coding, 246 Moving average, 214, 216, 243 Multidimensional indexing, 22 Multidimensional Scaling (MDS), 39 Multi-horizon Quantile Recurrent Neural Network (MQ-RNN), 227 Multilayer Perceptron (MLP), 180 Multiple regression model, 130
369 N Narrow Angle Camera (NAC), 240, 242 Natural science, 336, 347, 350 Natural selection, 89 Neocognitron, 187 Network, 40 Neural network learning, 180 Neural network training, 180 Newton Isaac, 83 Newton’s method, 111 Neyman, Jerzy, 128 Nightingale, Florence, 135 Node link diagram, 40 Nonnegative Matrix Factorization (NMF), 351, 360 Normal distribution, 105 Normal equations, 108 Normal mixture distribution, 126 NoSQL, 12, 197 N