122 57
English Pages 291 [287] Year 2021
Wing Kam Liu Zhengtao Gan Mark Fleming
Mechanistic Data Science for STEM Education and Applications
Mechanistic Data Science for STEM Education and Applications
Wing Kam Liu • Zhengtao Gan • Mark Fleming
Mechanistic Data Science for STEM Education and Applications
Wing Kam Liu Northwestern University Evanston, IL, USA
Zhengtao Gan Northwestern University Evanston, IL, USA
Mark Fleming Northwestern University Evanston, IL, USA
ISBN 978-3-030-87831-3 ISBN 978-3-030-87832-0 https://doi.org/10.1007/978-3-030-87832-0
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This book presents Mechanistic Data Science (MDS) as a structured methodology for coupling data with mathematics and scientific principles to solve intractable problems. Dictionary.com defines the word “mechanistic” as referring to theories which explain a phenomenon in purely physical or deterministic terms. Traditional data science methodologies require copious quantities of data to show a reliable pattern which can be used to draw conclusions. The amount of data required to find a solution can be greatly reduced by considering the mathematical science principles. The history of mathematical science and technology has been a continuous cycle in which data and observations have led to empirical scientific discoveries and then to new technologies and products. These new technologies have, in turn, led to additional tools for data collection, resulting in new scientific discoveries and technologies. For example, Galileo’s study of beams in bending led to the field of strength of materials, which is the fundamental discipline used to this day when designing structures ranging from automobiles to buildings. Mathematical science principles provide the key framework for describing nature, typically written in the language of mathematics. The mathematics can be continuous, such as the Fourier transform for analyzing the frequency characteristics of a signal, or discrete, such as regression/curve fitting and graphical analysis for analyzing data gathered from experiments. Mechanistic data science is the hidden link between data and science. The intelligent mining of empirical data can be coupled with existing mathematical science principles to solve intractable problems, uncover new scientific principles, and ultimately aid in decision making. Although in reality there is a broad spectrum for the mix of data and fundamentals, for this book, the focus is on three types of problems. Type 1 is a problem with abundant data but undeveloped or unavailable fundamental principles, often called a purely data-driven problem. This type of problem is typified by marketing behavior of people based on characteristics such as age and gender. Type 2 is a problem with limited data and scientific knowledge, and neither the data nor the scientific principles provide a complete solution. This type of problem is v
vi
Preface
typified by biomechanical problems such as scoliosis progression, in which fundamental scientific principles can be used to compute the direction of bone growth, but data is required to characterize the effects of age and gender. Type 3, known mathematical science principles with uncertain parameters, which can be computationally burdensome to solve. Scientific knowledge is the fundamental understanding of the world which allows people to do predictions, which enables future technologies and new discoveries. This type of problem is typified by physics problems such as determining the actual spring stiffness and damping properties of an actual spring mass system based on data collected from multiple cameras at different angles. Mechanistic data science is an innovative methodology which can be the key to addressing problems that previously could only be dealt with through trial and error or experience. It works by combining available data with the fundamental principles of mathematical science through deep learning algorithms. The first MDS challenge addressed is the generation and collection of multimodal data (such as testing, simulations, or databases). Using the data collected, it will be shown how to extract mechanistic features using basic mathematical tools, including continuous and discrete/digital analysis. Next, it will be shown how to perform knowledge-driven dimension reduction to streamline the analysis. Reduced order surrogate models will be created to introduce the fundamental physics into the solution of the problem. Basic mathematical tools will be used for this, including Fourier analysis, regression, continuous and discrete mathematics, and image analysis. Deep learning algorithms, such as neural networks, will be performed for regression and classification. These data and mechanistic analysis steps will be coupled for the system and design. This book is written in a spectral style and is ideal as an entry level textbook for STEM (science, technology, engineering, and math) high school students and teachers, engineering and data science undergraduate and graduate students, as well as practicing scientists and engineers. Mini-apps and computer programming segments are included to aid readers in the implementation of the topics presented. While the goal is to make the entirety of the book understandable to all readers, the authors understand that some readers have more background than others. As such, the more advanced sections for readers looking for more detail are denoted as [Advanced topic]. These sections will generally be towards the end of the chapters. The authors have been working in the fields of engineering mechanics, data science, and engineering education for more than 50 years. This book was initiated in the summer of 2020 based on the notes of a Northwestern University summer course taught to high school students and engineering undergraduate students. It was further refined after teaching undergraduate and graduate students in the fall of 2020 at Northwestern University. Numerous examples are given to describe
Preface
vii
in-depth fundamental concepts in terms of everyday terminology. The end goal of this book is to provide readers with a mechanistic data science methodology to solve problems and make decisions by combining available data and mathematical science, not just to write another data science textbook. Evanston, IL, USA
Wing Kam Liu Zhengtao Gan Mark Fleming
Acknowledgments
It is difficult to accomplish anything in a vacuum, and that is especially true of writing a book. As such, there are several people to acknowledge that have helped bring this book to fruition. First, I would like to acknowledge Northwestern University graduate students and postdoc fellows who helped write and edit portions of the book, and worked on the example problems and the corresponding Python and/or Matlab codes: Hengyang Li helped on the sections related to signal processing in music and convolution, and helped to write Chap. 4; Satyajit Mojumder helped with the sections on deep learning neural networks and applications to composites, and helped in writing Chaps. 1, 2, and 7; Sourav Saha helped with sections on feature extraction, signal processing in music, and applications to fatigue in additively manufactured parts, and helped with Chaps. 4 and 7; Mahsa Tajdari and Professor Jessica Zhang of Carnegie Mellon University helped with sections related to feature extraction applied to medical imaging and prediction of adolescent idiopathic scoliosis, and helped with Chap. 4; Derick Suarez helped with sections on composite applications, and helped with Chap. 3 and editing Chap. 1; Xiaoyu Xie helped with sections on clustering, Fourier analysis, and Chaps. 6 and 7; Ye Lu and Yangfan Li helped with sections on dimension reduction and reduced order modeling. We would also like to acknowledge Northwestern undergraduate students Hannah Huang, Madeleine Handwork, and William Sadowski, who provided feedback on the book and worked on several of the presented examples, particularly in composite applications, fatigue of additively manufactured parts, and prediction of adolescent idiopathic scoliosis. Additionally, we acknowledge the valuable editing help from Hannah Boruchov, Bradley Goedde, Michael He, Jeffrey He, Dr. Abdullah Al Amin, and Georgia Tech graduate student Brian Kelly. Lastly, we would like to acknowledge Adlai E. Stevenson High School students Sneha Mohan, Sophia Zhuang, Krushank Bayyapu, and Arnav Srinivasan for participating in a 2020 summer short course, providing feedback on the material, and working on example problems. Finally, we would like to thank our families for their patience with our incessant Zoom calls to discuss the details of each and every chapter. Their help with editing and daily life experiments during this endeavor has been incredible. ix
Contents
1
Introduction to Mechanistic Data Science . . . . . . . . . . . . . . . . . . . 1.1 A Brief History of Science: From Reason to Empiricism to Mechanistic Principles and Data Science . . . . . . . . . . . . . . . 1.2 Galileo’s Study of Falling Objects . . . . . . . . . . . . . . . . . . . . . 1.3 Newton’s Laws of Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Science, Technology, Engineering and Mathematics (STEM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Data Science Revolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Data Science for Fatigue Fracture Analysis . . . . . . . . . . . . . . . 1.7 Data Science for Materials Design: “What’s in the Cake Mix” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 From Everyday Applications to Materials Design . . . . . . . . . . 1.8.1 Example: Tire Tread Material Design Using the MDS Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.2 Gold and Gold Alloys for Wedding Cakes and Wedding Rings . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Twenty-First Century Data Science . . . . . . . . . . . . . . . . . . . . 1.9.1 AlphaGo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.2 3D Printing: From Gold Jewelry to Customized Implants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Outline of Mechanistic Data Science Methodology . . . . . . . . . 1.11 Examples Describing the Three Types of MDS Problems . . . . . 1.11.1 Determining Price of a Diamond Based on Features (Pure Data Science: Type 1) . . . . . . . . . . . . . . . . . . . 1.11.2 Sports Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11.3 Predicting Patient-Specific Scoliosis Curvature (Mixed Data Science and Surrogate: Type 2) . . . . . . . 1.11.4 Identifying Important Dimensions and Damping in a Mass-Spring System (Type 3 Problem) . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 3 4 4 6 7 8 10 12 13 14 15 15 15 16 19 19 22 25 28 31 xi
xii
Contents
2
Multimodal Data Generation and Collection . . . . . . . . . . . . . . . . 2.1 Data as the Central Piece for Science . . . . . . . . . . . . . . . . . . 2.2 Data Formats and Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Data Science Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Example: Diamond Data for Feature-Based Pricing . . . . . . . . 2.5 Example: Data Collection from Indentation Testing . . . . . . . . 2.6 Summary of Multimodal Data Generation and Collection . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
33 34 37 40 41 43 47 47
3
Optimization and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Least Squares Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Method of Least Squares Optimization for Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Coefficient of Determination (r2) to Describe Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Multidimensional Derivatives: Computing Gradients to Find Slope or Rate of Change . . . . . . . . . . . . . . . 3.1.6 Gradient Descent (Advanced Topic: Necessary for Data Science) . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.7 Example: “Moneyball”: Data Science for Optimizing a Baseball Team Roster . . . . . . . . . . . . . . . . . . . . . 3.1.8 Example: Indentation for Material Hardness and Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.9 Example: Vickers Hardness for Metallic Glasses and Ceramics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Piecewise Linear Regression . . . . . . . . . . . . . . . . . . 3.2.2 Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Moving Least Squares (MLS) Regression . . . . . . . . 3.2.4 Example: Bacteria Growth . . . . . . . . . . . . . . . . . . . 3.3 Regularization and Cross-Validation (Advanced Topic) . . . . . 3.3.1 Review of the Lp-Norm . . . . . . . . . . . . . . . . . . . . . 3.3.2 L1-Norm Regularized Regression . . . . . . . . . . . . . . 3.3.3 L2-Norm Regularized Regression . . . . . . . . . . . . . . 3.3.4 K-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . 3.4 Equations for Moving Least Squares (MLS) Approximation (Advanced Topic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
49 49 50 52
.
54
.
54
.
55
.
58
.
60
.
69
. . . . . . . . . . .
70 72 72 74 75 76 78 80 81 82 83
. .
86 87
. . . .
89 89 90 90
4
Extraction of Mechanistic Features . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 What Is a “Feature” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Normalization of Feature Data . . . . . . . . . . . . . . . . . . . . . . .
Contents
4.3.1 Example: Home Buying . . . . . . . . . . . . . . . . . . . . . Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Example: Determining a New Store Location Using Coordinate Transformation Techniques . . . . . . . . . . 4.5 Projection of Images (3D to 2D) and Image Processing . . . . . 4.6 Review of 3D Vector Geometry . . . . . . . . . . . . . . . . . . . . . . 4.7 Problem Definition and Solution . . . . . . . . . . . . . . . . . . . . . 4.8 Equation of Line in 3D and the Least Square Method . . . . . . 4.8.1 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Applications: Medical Imaging . . . . . . . . . . . . . . . . . . . . . . . 4.9.1 X-ray (Radiography) . . . . . . . . . . . . . . . . . . . . . . . . 4.9.2 Computed Tomography (CT) . . . . . . . . . . . . . . . . . 4.9.3 Magnetic Resonance Imaging (MRI) . . . . . . . . . . . . 4.9.4 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Extracting Geometry Features Using 2D X-ray Images . . . . . 4.10.1 Coordinate Systems . . . . . . . . . . . . . . . . . . . . . . . . 4.10.2 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10.3 Vertebra Regions [Advanced Topic] . . . . . . . . . . . . 4.10.4 Calculating the Angle Between Two Vectors . . . . . . 4.10.5 Feature Definitions: Global Angles . . . . . . . . . . . . . 4.11 Signals and Signal Processing Using Fourier Transform and Short Term Fourier Transforms . . . . . . . . . . . . . . . . . . . 4.12 Fourier Transform (FT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.1 Example: Analysis of Separate and Combined Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.2 Example: Analysis of Sound Waves from a Piano . . . 4.13 Short Time Fourier Transform (STFT) . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
5
xiii
. .
91 92
. . . . . . . . . . . . . . . . .
92 96 97 98 99 101 103 103 104 105 105 105 107 108 108 109 110
. .
113 114
. . . .
116 119 123 128
Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Dimension Reduction by Clustering . . . . . . . . . . . . . . . . . . . . 5.2.1 Clustering in Real Life: Jogging . . . . . . . . . . . . . . . . 5.2.2 Clustering for Diamond Price: From Jenks Natural Breaks to K-Means Clustering . . . . . . . . . . . . . . . . . . 5.2.3 K-Means Clustering for High-Dimensional Data . . . . . 5.2.4 Determining the Number of Clusters . . . . . . . . . . . . . 5.2.5 Limitations of K-Means Clustering . . . . . . . . . . . . . . 5.2.6 Self-Organizing Map (SOM) [Advanced Topic] . . . . . 5.3 Reduced Order Surrogate Models . . . . . . . . . . . . . . . . . . . . . . 5.3.1 A First Look at Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Understanding PCA by Singular Value Decomposition (SVD) [Advanced Topic] . . . . . . . . . . . . . . . . . . . . .
131 132 132 132 133 138 139 141 141 146 146 149
xiv
Contents
5.3.3
Further Understanding of Principal Component Analysis [Advanced Topic] . . . . . . . . . . . . . . . . . . . . 5.3.4 Proper Generalized Decomposition (PGD) [Advanced Topic] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Eigenvalues and Eigenvectors [Advanced Topic] . . . . . . . . . . . 5.5 Mathematical Relation Between SVD and PCA [Advanced Topic] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7
Deep Learning for Regression and Classification . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . 6.1.2 A Brief History of Deep Learning and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Feed Forward Neural Network (FFNN) . . . . . . . . . . . . . . . . 6.2.1 A First Look at FFNN . . . . . . . . . . . . . . . . . . . . . . 6.2.2 General Notations for FFNN [Advanced Topic] . . . . 6.2.3 Apply FFNN to Diamond Price Regression . . . . . . . 6.3 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . 6.3.1 A First Look at CNN . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Building Blocks in CNN . . . . . . . . . . . . . . . . . . . . . 6.3.3 General Notations for CNN [Advanced Topic] . . . . . 6.3.4 COVID-19 Detection from X-Ray Images of Patients [Advanced Topic] . . . . . . . . . . . . . . . . . 6.4 Musical Instrument Sound Conversion Using Mechanistic Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Problem Statement and Solutions . . . . . . . . . . . . . . . 6.4.2 Mechanistic Data Science Model for Changing Instrumental Music [Advanced Topic] . . . . . . . . . . . 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
155 160 167 168 169
. . .
171 171 174
. . . . . . . . .
174 175 175 183 185 189 189 193 200
.
201
. .
205 205
. . .
208 211 213
System and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Piano to Guitar Musical Note Conversion (Type 3 General) . . . 7.2.1 Mechanistic Data Science with a Spring Mass Damper System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Principal Component Analysis for Musical Note Conversion (Type 1 Advanced) . . . . . . . . . . . . . . . . . 7.2.3 Data Preprocessing (Normalization and Scaling) . . . . . 7.2.4 Compute the Eigenvalues and Eigenvectors for the Covariance Matrix of Bp and Bg . . . . . . . . . . . . . . . . 7.2.5 Build a Reduced-Order Model . . . . . . . . . . . . . . . . . . 7.2.6 Inverse Transform Magnitudes for all PCs to a Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
215 215 216 216 228 228 230 230 231
Contents
7.2.7 Cumulative Energy for Each PC . . . . . . . . . . . . . . . 7.2.8 Python Code for Step 1 and Step 2 . . . . . . . . . . . . . 7.2.9 Training a Fully-Connected FFNN . . . . . . . . . . . . . 7.2.10 Code Explanation for Step 3 . . . . . . . . . . . . . . . . . . 7.2.11 Generate a Single Guitar . . . . . . . . . . . . . . . . . . . . . 7.2.12 Python Code for Step 4 . . . . . . . . . . . . . . . . . . . . . . 7.2.13 Generate a Melody . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.14 Code Explanation for Step 5 . . . . . . . . . . . . . . . . . . 7.2.15 Application for Forensic Engineering . . . . . . . . . . . . 7.3 Feature-Based Diamond Pricing (Type 1 General) . . . . . . . . . 7.4 Additive Manufacturing (Type 1 Advanced) . . . . . . . . . . . . . 7.5 Spine Growth Prediction (Type 2 Advanced) . . . . . . . . . . . . 7.6 Design of Polymer Matrix Composite Materials (Type 3 Advanced) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Indentation Analysis for Materials Property Prediction (Type 2 Advanced) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Early Warning of Rainfall Induced Landslides (Type 3 Advanced) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Potential Projects Using MDS . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Next Generation Tire Materials Design . . . . . . . . . . 7.9.2 Antimicrobial Surface Design . . . . . . . . . . . . . . . . . 7.9.3 Fault Detection Using Wavelet-CNN . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
. . . . . . . . . . . .
231 232 233 234 235 236 237 237 237 238 238 243
.
247
.
252
. . . . . .
257 262 262 264 265 265
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
267
Chapter 1
Introduction to Mechanistic Data Science
Abstract Mechanistic data science provides a framework to combine data science tools with underlying scientific tools for addressing a wide range of problems. Data science and machine learning are pushing the limits of the hardware and computer algorithms more and more. Additionally, mathematical science and engineering is constantly taxed with challenging the current status quo. In this perspective, it is paramount to explore the possibility to perform the same task with mechanistic data science. One promising approach to attain this ambitious goal is to make effective use of dimensional reduction through mechanistic data science where the knowledge of past scientific discovery interacts with the streams of data in comporting harmony. A successful implementation of the balanced interaction between the data and existing scientific knowledge will be beneficial to the advancement of science such as accelerating the scientific discovery. As an introductory chapter, the evolution of science, technology, engineering, and mathematics throughout history is explored, from Aristotle and the ancient Greeks to Galileo and Isaac Newton, to the presentday data science revolution. The everyday applications of data science or machine learning are ever-present, with applications ranging from suggested movie preferences to fraud detection. However, an efficient utilization of the data science to scientific discovery with proper deployment of the past acquired scientific and mathematical knowledge is relatively less explored. This chapter presents some real-world examples to demonstrate the methodical approach to solve pragmatic scientific and engineering problems in an accelerated fashion by applying the knowledge of data science in combination with existing scientific laws. Keywords Mechanistic data science · Fundamental scientific laws · Mathematical science · Scientific method · Falling objects · Gravity · Laws of motion · Law of inertia · Law of force balance · Law of reaction force · Edisonian approach · Empirical approach · Artificial intelligence · Neural networks · Game theory · Nash
Supplementary Information: The online version of this chapter (https://doi.org/10.1007/978-3030-87832-0_1) contains supplementary material, which is available to authorized users. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. K. Liu et al., Mechanistic Data Science for STEM Education and Applications, https://doi.org/10.1007/978-3-030-87832-0_1
1
2
1 Introduction to Mechanistic Data Science
equilibrium · STEM · Fatigue · Fracture · Material design · Alloys · Gold alloys · AlphaGo · 3D printing · Regression · Multivariate linear regression · Classification · Dimension reduction · Reduced order surrogate model · System and design
Most challenging problems require a combination of data and scientific or underlying principles to find a solution. The power of mechanistic data science is that the techniques which will be shown in this book apply to problems in the physical sciences and mathematics, as well as manufacturing, medical, social science, and business. As shown in Fig. 1.1, mechanistic data science combines established equations from mathematical science with data and measurements through techniques such as neural networks to address problems which were previously intractable. Although the mechanistic data science methodology will be mostly described in terms of engineering and science examples, the same methodology is applicable to any other walks of life in which data is available and decisions are required. This book is presented in a manner which will be applicable to high school students and teachers, college students and professors, and working professionals. Sections that are more advanced will be noted to inform the reader that additional background knowledge may be required to fully understand that particular section.
Fig. 1.1 Schematic of Mechanistic Data Science
1.1 A Brief History of Science: From Reason to Empiricism to Mechanistic Princi. . .
1.1
3
A Brief History of Science: From Reason to Empiricism to Mechanistic Principles and Data Science
The history of science has been one of observations, leading to theories, leading to new technologies. In turn, the new technologies have enabled people to make new observations, which led to new theories and new technologies. This cycle has been repeated for thousands of years, sometimes at a slow pace, and other times very rapidly. Ancient philosophers such as Aristotle (384–322 BC) believed that scientific laws could be discerned through reason and logic. Aristotle reasoned that the natural state of an object was to be at rest and that heavier objects fell faster than lighter objects because there was more downward force on them. Aristotle believed in a geocentric universe (centering around earth) and that the heavens were made of the quintessence, which was perfect and unalterable. In other words, there could be no supernovae or comets. These scientific ideas remained the benchmark for centuries until scientists in the Renaissance began questioning them. The astronomer Nicolaus Copernicus (1473–1543 AD) proposed a heliocentric model of the universe in which the sun was the center, not the earth. The idea that the earth was not the center of the universe was very difficult for humans to accept. However, the publication of Copernicus’ seminal work On the Revolutions of the Celestial Spheres, published in 1543, around the time of his death, set off the Copernican Revolution in science which resulted in a major paradigm shift away from the Ptolemaic geocentric model of the universe. The work of Copernicus was later supported by the observations of Tycho Brahe (1546–1601 AD). Tycho was a talented astronomer who recorded many accurate measurements of the solar system which became the foundation for future astronomical theories. Tycho was followed by Johannes Kepler (1571–1630), who used the data collected by Tycho to formulate scientific laws of planetary motion which could predict the past or future position of the planets. Of particular note, he determined that the planets moved in an elliptical orbit around the sun, not a circular orbit, in contrast to the ancient Greeks, who thought the universe was geocentric and planetary motion was circular. Galileo Galilei (1564–1642) is known as the father of the scientific method because of his systematic combination of experimental data and mathematics. He was a contemporary of Kepler who was the first scientist to use a telescope to observe celestial bodies and championed the heliocentric model of the universe. In 1638, he published Discourses and Mathematical Demonstrations Relating to Two New Sciences (better known by the abbreviated name The Two New Sciences) where he laid out the fundamentals for strength of materials and motion.
4
1.2
1 Introduction to Mechanistic Data Science
Galileo’s Study of Falling Objects
One of Galileo’s major contributions was his study of motion and his ability to discern primary forces such as gravity from secondary forces such as friction and wind resistance. A notable example was Galileo’s study of falling objects. Most people have seen a dense, weighty object like a baseball fall faster than a lightweight object like a feather or a piece of paper. As discussed previously, ancient philosophers such as Aristotle had postulated that heavier objects fall faster than lighter objects in proportion to their mass. This remained the generally accepted theory of gravity until Galileo began studying falling objects in the late 1500s. Around 1590, when Galileo was a professor of mathematics at the University of Pisa in Italy, he reportedly conducted experiments (according to his student Vincenzo Viviani) by dropping objects of different masses from the Leaning Tower of Pisa to demonstrate that they would fall at the same speed [1]. Many years later in 1971, astronaut David Scott performed a similar experiment on the moon, in which he dropped a feather and a hammer at the same time. Because the moon has almost no atmosphere (and thus no air resistance to slow the feather), the feather and the hammer hit the ground at the same time [2].
1.3
Newton’s Laws of Motion
Isaac Newton (1642–1726) was a scientist and mathematician best known for his laws of motion and the invention of calculus. Newton’s laws of motion are a classic example of a law that was developed through the scientific method. He synthesized many years, decades, and centuries of observations, experimental data, and theories by scientists and mathematicians such as Galileo, Kepler, and Copernicus, into a new understanding of motion. In 1687, Newton published his work Philosophiae Naturalis Principia Mathematica (better known by its abbreviated name Principia), which has become one of the most classic scientific texts in history. In this book, Newton laid out three fundamental laws of motion: 1. Law of inertia—an object in motion tends to stay in motion and an object at rest tends to stay at rest unless some force is applied to it. 2. Law of force balance—changing the motion requires a force to be applied, which leads to the classic equation: Force ¼ mass acceleration. 3. Law of reaction forces—for every action/force, there is an equal and opposite reaction. These three seemingly simple laws of motion account for and describe nearly all the motion seen and experienced in the world around us even to this day. In fact, it was not significantly modified until the early twentieth century when Albert Einstein’s theory of relativity was needed to describe the motion of objects traveling at a significant fraction of the speed of light.
1.3 Newton’s Laws of Motion
5
Joseph Fourier (1768–1830) was a scientist and mathematician born 42 years after the death of Isaac Newton who made fundamental contributions in the areas of heat conduction and dynamics. Fourier recognized that the any function, including the equation for periodic dynamic motion, can be approximated as a set of sine and cosine functions, called a Fourier series. The coefficients for each of the sine and cosine terms in the series determine their relative contributions/amplitudes. As such, the frequencies and amplitudes for a dynamic signal describe the signal and allow for other useful analyses such as filtering and frequency extraction. Since the time of Newton, tremendous technological strides have been made by coupling fundamentals laws of science with creativity to meet the needs of people. One emblematic example is Thomas Edison (1847–1931) and the light bulb. One of the most basic human needs is to see in the dark, something that was mostly accomplished by fire until the invention of the electric light bulb. Although Edison did not invent the first light bulb, he took it from a crude concept to a mainstream technology. Early light bulbs would only last around 14 h, but through Edison’s innovations, the working life improved to 1200 h. This was accomplished through a long and arduous process of trial and error. When asked about the 1000 failed attempts at inventing the lightbulb, Edison famously replied that he “didn’t fail 1,000 times. The light bulb was an invention with 1,000 steps”. This Edisonian brute force method is a purely empirical approach for applying science for technological development. This method involves pursuing and achieving a goal by building a design and testing it, and making small modifications based on the results of the previous tests. These steps are repeated until the inventor is satisfied with the design. During this process, the inventor is learning what works and what does not work. This information can be used for calibration of parameters in conjunction with the applicable scientific laws. Another offshoot of the Edisonian style brute force method is that lots of data for the various trials is collected which can be very informative for future work. The combination of calibrated mechanistic principles and data collected can be used to accelerate future development. One application of using collected data is artificial intelligence (AI) and neural networks in which data is used to guide decision making. An early success story for AI involved the chess matches between the IBM Deep Blue computer and chess champion Garry Kasparov. In 1985, Carnegie Mellon University began a project to “teach” a computer to play competitive chess. Over the next decade, the computer algorithm was trained using data from 4000 different positions and 700,000 chess games by chess grandmasters. In 1996, Deep Blue actually won a single chess game out of a six-game match against chess champion Garry Kasparov. In a 1997 rematch, Deep Blue won the entire match against Garry Kasparov. Implicit to the computer programming for playing games like chess is game theory, and no one is more synonymous with game theory than John Nash (1928–2015). Nash laid out his theory for achieving an optimal solution to noncooperative games which came to be known as Nash equilibrium. It states that in a non-cooperative game with known strategies and rational players, the game achieves a state of equilibrium if no player can improve their position by unilaterally changing
6
1 Introduction to Mechanistic Data Science
their strategy. Nash equilibrium is often illustrated by the prisoner’s dilemma in which two prisoners apprehended together are interrogated in separate rooms. Both prisoners are given three choices: (1) freedom if they confess before the other prisoner but their partner receives extensive jail time, (2) minimal jail time if neither confesses, or (3) extensive jail time if they do not confess but the other prisoner does confess. It can be easily seen that each prisoner will achieve their own best outcome by confessing first rather than trying to cooperate with each other by not confessing. Since its discovery, Nash equilibrium has become one of the most important concepts in the game theory approach to artificial intelligence and decision making for neural networks. The dawn of the new millennium also brought about the information age in which information and data are collected and categorized as never before in history. No more going to the library to look up information for a research paper—just Google it. The information age could also be called the data age because of the large amounts of data being collected. The challenge is to turn that data into information and use it synergistically with the already known and established understanding of our world. In this book, this synergy is called Mechanistic Data Science. Human progress has been greatly accelerated by our ability to understand and control the world around us. A large part of this is because of science and engineering. Since the times of Galileo and Newton, the fundamental principles of materials and motion have been further studied and formalized. Special technical fields of study have developed that feed and interact with each another in a symbiotic manner.
1.4
Science, Technology, Engineering and Mathematics (STEM)
Science provides a set of fundamental laws that describe nature and natural phenomena. From physics to chemistry to biology, once the natural phenomena are scientifically understood and described, humans are able to predict a particular outcome without testing for each possible outcome. Science is heavily reliant on mathematics for the “language” in which its laws are written [3] and is also dependent on engineering and technology for applying scientific findings and developing new tools to enable future discoveries. Mathematics is the unifying language of the physical sciences, and as such, the development of mathematics is integral to scientific progress. The understanding and capability of data scientists in mathematical topics such as algebra, geometry, trigonometry, matrix algebra and calculus foster the progress of science and engineering. Engineering is the application of scientific principles for design and problem solving. In other words, people can make things. Engineers use data collected regarding the needs and wants of society and then use the principles of science,
1.5 Data Science Revolution
7
invoke human creativity, and apply manufacturing craftsmanship to design, develop and produce products that address these societal needs and wants. Technology is the implementation of products and capabilities developed through science and engineering. This generally takes the form of actual products on the market and in use today. Technology draws heavily on scientific discovery and engineering development to be able to address challenges, grow the economy, and improve efficiency. The scientific method provides an organized methodology for studying nature and developing new scientific theories. A basic form of the scientific method is: • Observe: the subject of interest is studied and characterized from multiple standpoints. This first stage involves data collection in order to move to the next step. • Hypothesize: based on the observations and the early data collected, a hypothesis (proposed explanation) is developed. • Test: experiments are conducted to evaluate and challenge the hypothesis. In this stage, extensive amounts of data are collected and analyzed. • Theory or law: if a hypothesis is not proven to be false during the testing challenges then it is established as a theory. Theories that are considered to be fundamental and widely accepted are often described as laws. Note that often times a theory or law is established with limitations for when it is valid (e.g., Newton’s laws of motion are valid for speeds much less than the speed of light, but Einstein’s theory of relativity is required for objects moving at speeds close to the speed of light.) There are often many special cases for a scientific law. In the above-mentioned example of a falling object, it is necessary to account for air and wind resistance when comparing a falling hammer versus a feather. They will fall at the same rate in a vacuum or on the surface of the moon, but in normal atmospheric conditions, the effect of air resistance makes a noticeable difference in the rate of falling. In this case, the law of gravity still applies, but it must be coupled with data on wind resistance, object density, and aerodynamics in order to properly describe the phenomena. Once a scientific law has been established, engineers can use this information to make calculations in the continuing quest to design and build new products. For example, understanding gravity and falling objects allows engineers to design and build objects that can fly, whether they are a backyard water bottle rocket as shown in the Fig. 1.2 below, or a high-tech reusable rocket like the SpaceX rocket that can return components to Earth and land upright on a barge in the ocean.
1.5
Data Science Revolution
Recent years have seen a revolution in data science as large amounts of data have been collected on a vast array of topics. For instance, the ubiquitous smart phone is constantly collecting and transmitting information about grocery store purchases,
8
1 Introduction to Mechanistic Data Science
Fig. 1.2 (a) Backyard water bottle and baking soda rocket (a video is available in the E-book, Supplementary Video 1.1). (b) SpaceX Falcon Heavy rocket launch (Reuters/Thom Baur)
travel routes, and search history. Companies like Facebook, Google, and Instagram collect and utilize data posted on their site for various marketing purposes such as targeted advertisements and purchase recommendations. These sites match demographic information such as age, gender, and race with internet browsing history, purchases made, and photos and comments posted to predict future behavior, such as whether you are likely to buy a car or make some other significant purchase. These predictions can be used to target advertisements at the right audience. Data science has been heavily used for product development, from the concept stage to engineering and manufacturing to the customer. Data collected at each stage can be used to improve future products or identify the source of problems that arise. For instance, manufacturers regularly collect customer data to understand the “voice of the customer”, which is used to plan future product models. Furthermore, as a product is being manufactured, data is being collected at every step of the manufacturing process for process control. The increased use of sensors has given rise to the internet of things (IOT) in which data is automatically collected and transmitted over the internet for analysis without requiring explicit human interaction.
1.6
Data Science for Fatigue Fracture Analysis
Disasters have often been a driving force for exploration of new areas or to develop a much deeper understanding of existing areas. The data, techniques, and methods generated from these explorations in turn becomes a boon for engineers and product designers when designing new products or working to improve existing designs. One such area of scientific and engineering exploration is the study of fractures and failures due to fatigue. Fatigue is the initiation and slow propagation of small cracks into larger cracks under repeated cyclic loading. The formed cracks will continue to
1.6 Data Science for Fatigue Fracture Analysis
9
Fig. 1.3 Aloha Airlines flight 243 fatigue fracture example (a) scheduled and actual flight paths of flight 243 (b) Boeing 737 airplane after fuselage separation (c) schematic illustration of fatigue crack growth between rivet holes (https://fearoflanding.com/accidents/accident-reports/aloha-air243-becomes-relevant-thirty-years-later/)
get longer and longer while the product is being used under normal operating loads, until they become so large that the structure fails catastrophically. Consequences of Fatigue On April 28, 1988, Aloha Airlines flight 243 took off from Hilo, HI on a routine flight to Honolulu, HI. The Boeing 737 airplane had just reached cruising altitude when a large section of the fuselage separated from the plane (see Fig. 1.3). The pilot was able to successfully land the plane on the island of Maui, although one flight attendant was lost in the incident. Post-incident inspection of the airplane showed that small cracks had initiated from the rivet holes that were used to join the separate pieces of the airplane fuselage. The cracks had propagated slowly from hole to hole until the resulting crack was sufficiently large that the structure could no longer support the service loads in the forward section of the plane. A commercial airplane is pressurized for every flight, which stresses the fuselage, in addition to takeoff and landing loads, and vibration loads. The subject airplane had accumulated 89,680 flight cycles and 35,496 flight hours prior to the incident [4]. It should be noted that this incident resulted in the formation of the Center for Quality Engineering and Failure Prevention, led by Prof. Jan Achenbach at Northwestern University, and with which two of the authors collaborated extensively in the past. Prof. Achenbach received both the National Medal of Technology and the National Medal of Science, partly for his important scientific work on the non-destructive detection of fatigue cracks. Fatigue Design Methodology One of the most common methods for fatigue design and analysis is a mechanistic data driven methodology known as the stress life method. In this methodology, the material of the design has been tested at many different stress levels to determine how many load cycles to failure. That stress amplitude is then plotted on a graph vs. the number of cycles to failure for that material. This plot is commonly referred to as an S-N curve. When used for design analysis, the cyclic stress amplitude at a location of interest is either measured by laboratory testing, field testing or computed using finite element analysis or hand calculations. Using the computed or measured stress amplitude and the appropriate S-N curve, the fatigue life can be estimated. There are many factors involved in the initiation and propagation of fatigue cracks, such as material strength, microscopic impurities and voids, and surface
10
1 Introduction to Mechanistic Data Science
roughness. For each material of interest, many controlled laboratory tests are conducted on standardized specimens at a range of stress levels to measure how many cycles the material can endure before fracturing. This data-driven approach has been necessary because of the relatively large number of randomly occurring factors that are involved. Fatigue cracks generally initiate at or near the surface of the material. It has been found that parts with rougher surfaces will have shorter fatigue lives, with the effect being more pronounced at higher fatigue lives. For many materials, a large amount of testing has been performed to characterize the effect of the surface roughness due to the manufacturing process used to make the part (eg. polished surface, machined surface, or as-cast surface finish).
1.7
Data Science for Materials Design: “What’s in the Cake Mix”
The macrostructure (or physical structure you can see and hold in your hand) is composed of trillions of atoms of different elements which are mixed and organized in a certain way at the small-scale sub-structure, and this organization depends on the particular material being used. This small-scale sub-structure is referred to as the microstructure if a powerful microscope is required to see it and is called the mesoscale if you need to use a magnifying glass or just look very closely. The overall macrostructural performance of a bulk part is controlled in large part by the microstructure of the material used to make the part. For example, if a pastry chef is baking a cake, the flavor, texture, and crumble of the cake are controlled by the ingredients, the mixing, and the baking time and temperature. One can think of the microstructure as “what’s in the cake mix” (Fig. 1.4). Engineered components are often evaluated for strength, stiffness, and fracture resistance (as opposed to taste and texture for a cake). A simple example is to consider an ice cube. If one were to make an ice cube from only water, the ice cube would likely shatter into many pieces when dropped on a rigid surface from a sufficient height. However, if other ingredients were added to the water when the ice was made, the resulting ice cube would likely be more resistant to shattering, depending on what was added. For example, if strips of newspaper were added to the ice, the resulting reinforced ice cube would not shatter when dropped from the same height as the unreinforced ice cube. This is because the mesostructure of the newspaper in the ice increases the toughness of the ice cube and resists cracks from propagating through the cube as it impacts the rigid surface (Fig. 1.5). If one realizes that we have entered the digital age and newspaper is no longer readily available, then new composite material needs to be developed to replace newspaper filler. Given below is a sampling of alternative reinforcement materials that could be used to make a composite cube structure. To evaluate the impact fracture resistance of the various cubes, several composite ice cubes were made
1.7 Data Science for Materials Design: “What’s in the Cake Mix”
11
Material Bulk Part (Composites/Alloys) Processing
“Microstructural” ingredients
“Macrostructure”
Fig. 1.4 “Cake Mix” material microstructure example (“Classic Carrot Cake with Cream Cheese Frosting.” Once upon a chef with Jenn Segal. https://www.onceuponachef.com/recipes/carrot-cake. html)
Water & Newspaper
Reinforced ice cube
Freeze Ice
“Reinforced”
t0 Dropped
Dropped
tf
Fig. 1.5 Ice with and without reinforcement dropped to show effects of reinforcement on fracture resistance. Experiment by Northwestern University Prof. Yip Wah Chung, 2003
12
1 Introduction to Mechanistic Data Science
Blueberry
Wood chips
Coffee grounds
Sawdust
Egg
Salt
Control
Fig. 1.6 Composite ice cube experiment by Northwestern University Prof. Mark Fleming and Carmen Fleming, 2020
Table 1.1 Drop test results for ice cubes formed with various reinforcement materials
Filler material Control Salt water Egg Sawdust Wood chips Coffee grounds Blueberry
Result Fractured, big chunks Small chunks Fractured, big chunks Fractured, small chunks Very little fracturing Fractured, big chunks Fractured, big chunks
(Fig. 1.6) and dropped 14 ft. (impact speed of 30 ft./s) onto a concrete surface. Results showed that the addition of wood chips was most effective in preventing fracture of the cubes on impact. This resistance to fracture is called the fracture toughness and is important when designing objects which need to have good impact resistance (Table 1.1).
1.8
From Everyday Applications to Materials Design
The reinforced ice cube can be considered as a metaphor for an engineered composite material in which the mesoscale structure enhances the overall structural performance. Consider the material substitutions shown in Fig. 1.7. A composite material in its most basic form consists of a matrix material with some reinforcing material embedded inside. If the matrix ice material is replaced with an epoxy and
1.8 From Everyday Applications to Materials Design
Replace ice with other matrix
13
or Epoxy
Replace newspaper with other fibers
Rubber
or
Long
Short Resulng product
Fig. 1.7 Ice cube to engineered materials analogy for materials design
the newspaper is replaced with carbon fibers, the result is a carbon fiber reinforced composite material. On the other hand, if the matrix ice material is replaced with rubber and the newspaper is replaced with steel or polymer cords, a tire can be made.
1.8.1
Example: Tire Tread Material Design Using the MDS Framework
Tire durability is one of the fundamental questions facing the tire industry. The unpredictable weather conditions and road surface conditions that each tire faces every day have significant impact on its durability. One of the key materials property metrics that can be related with the tire material performance is called tan(δ). For tire materials, a high tan(δ) is desirable for low temperatures (better ice and wet grip) and a low tan(δ) is desirable for high temperatures (better rolling friction). It is noteworthy to mention that approximately 5–15% of the fuel consumed by a typical car is used to overcome the rolling friction of the tire on the road. Therefore, controlling the rolling friction of tires is a feasible way to save energy (by reducing fuel consumption) and reduce the environmental impact (by reducing carbon emission). Additionally, it ensures the safe operation of tires in providing sufficient ice or wet grip. The key performance metric tan(δ) is a function of the matrix materials, microstructure, and the operating conditions such as temperature and frequency. It is well known that adding fillers improves the tire materials performance. But what fillers and their distribution to achieve optimized properties and performance is still an important research question. The design space combine different rubber matrices and fillers, and the microstructure and operating conditions can be enormous, making some experimental or simulation techniques not feasible. The mechanistic data
14
1 Introduction to Mechanistic Data Science
science approach can provide an effective solution to explore the design space by leveraging the data science tools to reveal the mechanism and construction of accurate and efficient reduced order surrogate models. This approach will enable the industrial practitioner to perform rapid design iteration and expedite the decisionmaking process (Fig. 1.7).
1.8.2
Gold and Gold Alloys for Wedding Cakes and Wedding Rings
Pure gold is so ductile that it can be rolled into sheets that are so thin that it can be used to decorate cakes and subsequently eaten. While this expensive application of pure gold is an interesting possibility for a wedding cake, gold is more often used for jewelry such as a wedding ring. As such, it needs to hold a specific shape. Materials like gold are strongly influenced by the microstructure of the material. Pure 24 K gold is extremely ductile and malleable, meaning that it can be reshaped by rolling and pounding without cracking. Gold jewelry is generally made from 18 K or 14 K gold, which is strengthened by alloying (mixing with other elements) it with other metals—see periodic table of elements in Fig. 1.8. As an example, 18 K gold contains a mix of 75% pure gold, 10% copper, 8% nickel, 4.5% zinc and 2.5% silver (Fig. 1.9).
Fig. 1.8 Periodic table of elements (Source: Sciencenotes.org). The red box signifies some of the elements used for making gold alloys
1.9 Twenty-First Century Data Science
15
75% pure gold 8% nickel
Gold copper nickel silver zinc
18K gold
10% copper
2.5% silver
4.5% zinc
Melt mix freeze
Fig. 1.9 Elements making up 18 K gold, which is a common gold alloy for jewelry
1.9 1.9.1
Twenty-First Century Data Science AlphaGo
One very interesting data science accomplishment is the use of deep learning neural networks for the complex game of Go. Go is a strategy game created 3000 years ago in China. The game is played with black and white stones that are placed on a gridfilled board in an attempt to surround an opponents stones and to strategically occupy space. The possibilities of the game are astronomical, with 10170 possible configurations, which makes it much more complex than chess. Until recently, the best Go computer programs could only play at a relatively novice level. In 2014, a company called Deep Mind began working on a project that led to a program called AlphaGo and trained it to play using deep learning neural networks. It was initially trained with many amateur games and against itself, all the time improving its ability. In 2015, AlphaGo played a match against reigning European champion, Mr. Fan Hui, and won a game. Then in 2016, AlphaGo beat 18-time world champion Mr. Lee Sedol. Since that time, additional versions have been released, including AlphaZero, which is capable of learning other games such as chess and shogi.
1.9.2
3D Printing: From Gold Jewelry to Customized Implants
Recently, a new method of manufacturing structural parts called additive manufacturing (AM) or 3D printing has become popular. A report from the National Academy of Engineering identified 3D printing as a revolutionary new manufacturing technology capable of making complex shapes, and possibly 1 day printing new body parts [5]. Additive manufacturing generally works by building a part through
16
1 Introduction to Mechanistic Data Science
Fig. 1.10 (a) A example of 3D printed gold jewelry [6]. (b) 3D custom implant to reconstruct a vertebra destroyed by a spinal tumor [7]
depositing thin layers of material one after the other until a complete part is formed. This process is able to form very complicated shapes, and has been widely used in many industrial applications from precious metals such as gold to customized 3D spinal implants (see Fig. 1.10). This type of technology is making a difference for the environment, health, culture and more. This has included disaster relief, affordable housing, more efficient transportation (less pollution), better and more affordable healthcare, 3D bioprinted organs, cultural and archeological preservation, accessible medical and lab devices, and STEM education. A great deal of data are generated during 3D printing process and part qualification, including process parameters, physical fields (varying in both space and time), material microstructure, and mechanical properties and performance. Data science is extremely useful for 3D printed part qualification and optimization, stablized printing processes and improved material properties.
1.10
Outline of Mechanistic Data Science Methodology
As shown in Fig. 1.11, data science for solving engineering problems can be broken down into six modules, ranging from acquiring and gathering data to processing the data to performing analysis. Mechanistic data science is the structured use of data combined with the core understanding of physical phenomena to analyze and solve problems, with the end goal of decision-making. The problems to be solved range from purely data-driven problems to problems involving a mixture of data and scientific knowledge. Type 1 purely data-driven: problems with abundant data but undeveloped or unavailable fundamental principles. This type of problem can be illustrated through using data for the features of diamonds to determine the price. There is not an explicit
1.10
Outline of Mechanistic Data Science Methodology
17
Fig. 1.11 Schematic of Mechanistic Data Science for Engineering. Scientific knowledge is combined with data science to effect engineering design with the goal of obtaining knowledge for improved decision making
“theory” associated with diamond prices. Instead, the price is determined by the complex interplay of many features such as size and sparkle. Type 2 limited data and scientific knowledge: problems in which neither the data nor the scientific principles provide a complete solution. This type of problem can be illustrated by the analysis of scoliosis patients. X-rays provides some data for the progression of spine growth but combining that data with finite element surrogate models in a neural network provides a good estimate of scoliosis progression. Type 3 known mathematical science principles with uncertain parameters: problems which can be computationally burdensome to solve. This type of problem can be illustrated through a spring-mass example. Physics models of spring-mass systems typically assume a point mass, a massless spring, and no damping. Data collected on an actual spring-mass system can illustrate how to use data science to identify key physical factors such as damping coefficient directly from highdimensional noisy data collected from experimental observations. In this book, mechanistic data science will be broken into seven chapters: Chapter 1: Introduction of Mechanistic Data Science Illustrative problems will be used to demonstrate the power of MDS: Determining price of a diamond based on features (pure data science—Type 1), predicting playoff contention for baseball teams (pure science—Type 1), predicting patientspecific scoliosis curvature (mixed data science and surrogate—Type 2), and identifying important dimension and damping in a mass-spring system (Type 3 problem). The details of the solutions will be provided in the following chapters. (This chapter is for both general readers and advanced readers).
18
1 Introduction to Mechanistic Data Science
Chapter 2: Multimodal Data Generation and Collection Large quantities of data are collected related to the topic under study. Multimodal data is data from various types of sources, such as different type of measuring instruments and techniques, models, and experimental setup. Multimodal data generation and collection will be described in Chap. 2. Data and how data evolves to empiricism and to mechanism will be introduced using the story of Kepler’s laws and Newton’s laws or motion in the seventeenth century. Modern deep learning datasets and use those datasets to solve an engineering problem will be described using examples such as material quantification using macro/microindentation to measure the material hardness. Chapter 3: Least Square Optimization This chapter includes prerequisites for the book. We will introduce the concept of least square optimization and use a Baseball example to demonstrate linear regression. Nonlinear regression methods will be used to analyze interesting phenomena such as stock market performance and bacteria growth. A few more advanced methods such as moving least-squares and reproducing kernel will be introduced for advanced readers. Chapter 4: Extraction of Mechanistic Features Mechanistic features are the key pieces of data that will be used for further data science analysis. They often have to be computed from the raw data collected. It should be noted that data scientists generally describe the feature extraction process as the step where they spend a majority of their time. The concept of the traditional Fourier transformation will be shown and related to an important aspect in modern data science called convolution. Extraction of meaningful features will be shown for real life and engineering problems, such as human speech analysis and additive manufacturing. Chapter 5: Knowledge-Driven Dimensional Reduction and Reduced Order Surrogate Models Dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables, generally based on the mechanistic features extracted. Two types of dimensional reduction will be introduced. The first type reduces the number of data points based on clustering methods such as k-means clustering and self-organizing map (SOM). We will apply those clustering methods to real life jogging, diamond price, and additive manufacturing design. Another type of dimension reduction is to reduce the number of features by eliminating redundancy between them. Reduced order surrogate models can be built by using those dimension reduction methods. Singular value decomposition (SVD), principal component analysis (PCA), and proper generalized decomposition (PGD) will be described. Identification of intrinsic properties from a spring-mass system will be shown. Chapter 6: Deep Learning for Regression and Classification Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome variable‘) and one or more independent variables (often called ‘predictors’, ‘covariates’, or ‘features’). In machine learning and statistics, classification is the problem of identifying to
1.11
Examples Describing the Three Types of MDS Problems
19
which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Deep learning method such as feedforward neural networks (FNN) will be described and used to capture nonlinear relations in examples such as diamond price prediction. The learning process will be analyzed by Fourier transform to demonstrate some features of neural networks learning. Combining convolution layers and feedforward neural networks, convolutional neural networks (CNN) will be introduced. The power of CNN will be illustrative by examples of human behavior detection and damage identification of rolling bearings. Chapter 7: System and Design System and design tie the other modules together. The data science is coupled with the mechanistic principles in order to complete an analysis and make decisions. Based on the author’s years of experience, state-of-the-art applications ranging from daily life (like baseball) to engineering will demonstrate the concepts of system and design: 1. 2. 3. 4. 5. 6. 7.
1.11
Piano example with spring mass system (Type 3 general) Feature-based diamond pricing (Type 1 general) Additive manufacturing (Type 1 advanced) Spine growth prediction (Type 2 advanced) Composite design (Type 3 advanced) Indentation analysis for materials property prediction (Type 2 advanced) Early warning of rainfall induced landslides (Type 3 advanced)
Examples Describing the Three Types of MDS Problems
Examples are provided in this section to illustrate the application of mechanistic data science for the three types of problems described in Sect. 1.10. Examples ranging from strictly data-centric problems to medical diagnosis to physics-based problems are provided to illustrate the broad applications of MDS.
1.11.1 Determining Price of a Diamond Based on Features (Pure Data Science: Type 1) Jim and Maddie are two young people who want to get married. Jim wants to give Maddie a traditional diamond ring that really sparkles. However, they recently graduated from college and have student loans to pay off, as well as many other
20
1 Introduction to Mechanistic Data Science
expenses. Jim realizes that he needs to study what makes diamonds sparkle in order to get the best diamond ring that he can afford. Jim took a mechanistic data science course in college and decided to use that analytical capability when studying the features and prices of diamonds. A cursory study of diamonds shows that they have some very impressive properties and can be used for industrial applications as well as jewelry. Diamonds are the hardest substance on earth, which make them popular for surface coatings in which wear resistance is important, such as cutting and drilling tools. Diamonds for jewelry are known for their sparkle and impressiveness, which are features that are not easily quantified. However, they are functions of other features that can be quantified. The classic, best-known, quantifiable features of diamonds are the 4-C’s (Cut, Clarity, Color, Carat), but other features such as depth and dimension are also reported. A combination of all these features is used when determining the price of a diamond [8]. Multimodal Data Collection Jim found a large repository of data on diamond features and prices can be found at www.kaggle.com, which is a subsidiary of Google. For his analysis, a database of 53,940 diamonds is catalogued with information for ten different features along with the price for each diamond. A sample of some of the 3 pieces of the larger dataset is shown in the table below: Carat 0.23 0.21 0.23
Cut Ideal Premium Good
Color E E E
Clarity SI2 SI1 VS1
Depth 61.5 59.8 56.9
Table 55.0 61.0 65.0
Price 326 326 327
x 3.95 3.89 4.05
y 3.98 3.84 4.07
z 2.43 2.31 2.31
Extraction of Mechanistic Features The raw diamond data available on Kaggle is a good first start, but Jim needed to do some feature engineering, or initial data processing, to get it into an appropriate form for mathematical analysis. This includes removing missing values and converting alphabetic and alphanumeric scores to numerical scores (e.g., the cut of a diamond is rated as Premium, Very Good, Good, Fair, and Poor. These ratings are converted to 1, 2, 3, 4, and 5.). In addition, it is often useful to normalize the data. Dimension Reduction Jim found that for the analysis he was performing, certain features were often more useful than others. Since he was interested in how a diamond sparkles, the clarity and color were more interesting than the geometric features such as size distribution. For his analysis, Jim created a correlation matrix to help separate the relevant from the irrelevant features. From this, Jim selected four features for further analysis: carat, clarity, color, y. Regression and Classification Regression is the process of developing a mathematical relationship between variables in a dataset. Jim first performed a basic form called linear regression, which tries to determine where to draw a line through the
1.11
Examples Describing the Three Types of MDS Problems
21
Fig. 1.12 Diamond price vs. carat for all diamonds combined (left) and separated by cut (right)
middle of the data. In Fig. 1.12a, the red dots represent the raw data for carat versus price and the black line represents a linear regression between the price and the carat of the diamonds. The correlation of the data describes how close the actual data is to the regression line. It can be seen from looking at the graph that when all the diamond features are lumped together, there is a lot of scatter of the red data points around the black line, resulting in a low correlation. To achieve a better fit, Jim needed to consider additional features. Figure 1.12b shows the price vs. carat data subdivided by clarity, ranging from low clarity I1 diamonds with inclusions to high clarity IF diamonds that are inclusion free. From this graph, it can be seen that the price per carat increases more quickly for higher clarity diamonds. However, there is still increased scatter at higher prices for all the clarity levels. Multivariate Linear Regression Jim then decided to perform multivariate linear regression in order to consider the effects of multiple features simultaneously. He found that the correlation of the regression improved as more features are considered. In the graphs in Fig. 1.13 below, the prediction using a linear regression is plotted versus the actual data. It can be seen that as the number of features considered increases the amount of scatter decreases, with the least scatter noted when all 9 features are used for the multivariate linear regression. System and Design Once the multivariate linear regression is complete, Jim was able to estimate the price of a diamond using multiple features, and the nonlinear nature of diamond pricing is seen. For instance, using the regression data, the following diamond prices are found for different sized diamonds: 1 carat diamond ¼ $4600 2 carat diamond ¼ $17,500 In short, Jim found that when it comes to diamonds 1 + 1 6¼ 2. Jim chose . . .?
22
1 Introduction to Mechanistic Data Science
Fig. 1.13 Diamond price prediction vs. observation using regression based on various numbers of features
1.11.2 Sports Analytics Data science and analytics are very important in sports. Whether it is deciding which star player to choose in a professional sports draft, evaluating fantasy football statistics, or evaluating shot selection in a basketball game, the insight provided from data science has been very profound in sports.
1.11.2.1
Example: “Moneyball”: Data Science for Optimizing a Baseball Team Roster
Baseball is a game in which tradition is strong and data and statistics carry great weight. This allows baseball fans to compare the careers of Ty Cobb in 1911 to Pete Rose in 1968 (or anyone else for that matter). Historically, the worth of a player was largely dictated by their batting average (how many hits compared to how many time batting) and runs batted in (how many runners already on base were able to score when the batter hit the ball). However, through the use of data science, a new trend emerged.
1.11
Examples Describing the Three Types of MDS Problems
23
The game of baseball is played with 9 players from one team in the field playing defense (see figure to right). A pitcher throws a baseball toward home plate where a batter standing next to home plate tries to hit the ball out into the field and then run to first base. If the batter hits the ball and makes it to first base before the ball is caught or picked up and thrown to first base, then the batter is awarded a hit and allowed to stay on the base, becoming a base runner. If the ball is caught in the air, picked up and thrown to first base before the batter arrives, or the batter is tagged when running to first base then the batter is out. The runner can advance to second, third, and home bases as other batters get hits. When the runner reaches home base, the team is awarded a run. The batting team continues to bat until they make three outs, at which point they go out to the field and the team in the field goes to bat. As mentioned previously, batting average (BA) and runs batted in (RBI) have traditionally been a very important statistic for baseball teams to evaluate the worth of a player. Players with high BA’s and high RBI’s were paid very large salaries by the richest teams (usually large market teams) and the small market teams had trouble competing. In 2002, Billy Beane, the general manager of the Oakland Athletics, utilized data science to field a competitive team. Although Major League Baseball (MLB) generates around $10 billion in annual revenue, the smaller market MLB teams have a much lower budget with which to recruit and sign players. In 2002, Oakland A’s general manager, Billy Beane, found himself in a tough situation because of this. The Oakland A’s were a small market team without a large budget for player salaries. Beane, and his data science capable assistant, Paul DePodesta, analyzed baseball data from previous seasons and determined that they needed to win 95 games to make the playoffs. To achieve this goal, they estimated they needed to score 133 more runs than their opponents. The question they had to answer was “what data should they focus on”. To field a competitive team, Beane and DePodesta looked at a combination of a player’s on-base percentage (OBP), which is the percentage a batter reaches base, and the slugging percentage (SLG), which is a measure of how many bases a batter is able to reach for a hit. In formulaic terms: SLG ¼ (1B + 2B 2 + 3B 3 + HR 4)/ AB, where 1B, 2B, and 3B are first, second, and third base, respectively, HR is a “home run”, and AB is an “at bat”. Through these two measures, it is possible to assess how often a player is getting on base in any possible way (and thus in a position to score) and how far they go each time they hit the ball.
24
1 Introduction to Mechanistic Data Science
It is possible to show through linear regression that SLG and OBP provide a good correlation with runs scored (RS). Using a moneyball baseball dataset available from Kaggle (https://www.kaggle.com/wduckett/moneyball-mlb-stats-19622012/data), a regression analysis was performed to compare the number or runs scores as a function of the batting average, and then as a function of the on base percentage and slugging percentage. A sampling of the moneyball data used for the analysis is shown below. A linear regression analysis was first performed on the RS vs. BA. The results, which are plotted in the figure below, showed that the correlation between the RS and BA was only 0.69. BA was deemed a marginally useful statistic because it does not account for players hitting singles versus home runs and does not account for players getting on base by walks or being hit by a pitch. By contrast, a linear regression between the RS and OBP shows a correlation of r2 ¼ 0.82. OBP accounts for all the ways a player can get on base, and as such, provides a more meaningful measure of the number or runs scored than does the batting average. Finally, a multivariate linear regression was performed with the RS vs. the OBP and the SLG. The results of this linear regression showed a correlation of r2 ¼ 0.93, meaning that OBP combined with SLG provided a better indicator or run scoring performance than BA or the OBP by itself. It should be noted that the linear combination of OBP and SLG is called On-base Plus Slugging (OPS), and is a commonly used baseball statistic in the game today (OPS ¼ OBP + SLG). With this measure of OPS, the amount of time a player reaches base is accounted for as well as how many bases they are able to reach when they do get on base. Using these data science techniques, Beane and DePodesta and the Oakland A’s were able to win 103 games in 2002 (including a record-setting 20-game win streak), finish in first place, and make the playoffs. Today, OPS and OBP and SLG are some of the most closely watched baseball statistics by baseball insiders and fans alike.
1.11
Examples Describing the Three Types of MDS Problems
25
1.11.3 Predicting Patient-Specific Scoliosis Curvature (Mixed Data Science and Surrogate: Type 2) Mechanistic data science can be used to analyze the progression of Adolescent Idiopathic Scoliosis (AIS), and someday soon provide a way to virtually assess the effectiveness of patient-specific treatments before starting the actual treatment. AIS is a condition in which the adolescent spine curves in an unnatural manner. Recently, mechanistic data science has been used to study the progression of this condition. Multimodal Data Generation and Collection The analysis and diagnosis of AIS begins with medical imaging of the spine. Two types of images are used for the analysis—X-rays and magnetic resonance imaging (MRI). X-rays of a patient are taken from the front or anteroposterior (AP) view and the side or lateral (LAT) view to capture the position of the vertebrae that make up the spine (see Fig. 1.14). The X-rays are repeated to document the progression of the scoliosis condition over time. An outline of each vertebra can be extracted from the 2D X-ray data. The 2D data points are projected to 3D data. In addition to X-rays, MRI’s taken of a few patients provide a detailed 3D image of the entire spine. The MRI data from one patient’s spinal vertebrae can then be used as reference or surrogate images for other patients. The surrogate model of the vertebrae can be adjusted to be patient-specific by combining it with data from the 2D X-rays. The 2D X-ray data points for each vertebra that have been located in 3D space are overlaid on the surrogate model of the vertebra. For this analysis, the surrogate model of the vertebrae is taken from the MRI data of the spine, but there are other
Fig. 1.14 X-ray projections for collecting data on scoliosis progression
26
1 Introduction to Mechanistic Data Science
Fig. 1.15 Surrogate geometry of a vertebra. The vertebra on the left is before being adjusted by the collected data. The vertebra on the right has been adjusted through the collected data
sources of vertebrae data that can also be used. Once the 3D X-ray data has been overlaid on the surrogate model of the vertebra, the surrogate model is scaled and adjusted to yield a patient specific model of the vertebra (see Fig. 1.15). This process is repeated for each vertebra until the entire spine has been mapped for a particular patient by combining 2D X-ray images with generic models from a surrogate. Extraction of Mechanistic Features Key reference points, or landmarks, for each vertebra are also extracted from the 2D X-rays using image processing. The intersection of the two 2D projection is then used to locate the data points as 3D data points (see Fig. 1.15). Knowledge-Driven Dimension Reduction Using the 3D simplified model of the spine derived from the X-ray images, a detailed model of the spine was created using the atlas model vertebra. The generic surrogate model of each vertebra is updated based on the actual vertebra size and shape as shown in Fig. 1.15. These more detailed vertebrae are assembled to form a patient-specific spine model. Reduced Order Surrogate Models The patient specific geometry of the spine can be used to generate a finite element model to compute the pressure distribution on each vertebra due to scoliosis. According to the Hueter-Volkmann (HV) principle, areas of a vertebra with higher stress grow more and areas with lower stress grow less. The pressure distribution computed using finite element analysis and the geometries of the vertebrae are updated based on the HV principle. The gravity load and the material properties of the spine material are also updated over time to reflect the changes in a specific patient due to aging. The computed stress results can be used as the input to a neural network (Fig. 1.16), along with other factors such as the landmarks, the global angles, and the patient age, to predict how the spine would move over time. At this time, not enough information is known about modeling the materials of the growing spine and it is not possible to measure the pressure distribution of the vertebrae pressing on each other. However, the results below show that through a combination of finite element computer simulation to compute the pressure distribution on the vertebrae and the data from the patient X-rays, it is possible to accurately predict the progression of scoliosis (Figs. 1.16 and 1.17).
1.11
Examples Describing the Three Types of MDS Problems
27
Fig. 1.16 Neural network combining mechanistic models with X-ray data to predict spine growth in scoliosis patients
Fig. 1.17 Predicted progression of spine growth in scoliosis patients
28
1 Introduction to Mechanistic Data Science
1.11.4 Identifying Important Dimensions and Damping in a Mass-Spring System (Type 3 Problem) A young engineer is given the job of running experiments with physical components and then analyzing the data. Unfortunately, the test data does not always clearly match with the theory learned in school. This is not a trivial problem, but rather a fundamental challenge in empirical science. Examples abound from complex systems such as neuroscience, web indexing, meteorology, and oceanography - the number of variables to measure can be unwieldy and, at times, even deceptive, because the underlying relationships can often be quite simple. One such example is a spring-mass system shown in Fig. 1.18. The engineer learned in school to model a system like this with an ideal massless spring, assuming that all the mass at a point and no mass or damping in the spring. For an ideal system, when the weight is released a small distance away from equilibrium (i.e. the spring is stretched), the ball will bounce up and down along the length of the spring indefinitely. The frequency of the motion will be constant based on the stiffness of the spring and the mass of the attached weight. An actual spring-mass system does not perfectly match the ideal specifications since the attached mass is not a point mass, and the spring has some mass, and there will be some damping due to friction. Multimodal Data Generation and Collection The engineer needs to make measurements of an actual spring-mass system to determine the frequency of the motion and the amount of damping in the spring. To make these measurements, he decided to record the position of the weight from three different angles and orientations using video cameras, which would record the position of the weight in three dimensions
Fig. 1.18 A spring-mass motion example. The position of a ball attached to a spring is recorded using three cameras 1, 2 and 3. The position of the ball tracked by each camera is depicted in each panel. (a video is available in the E-book, Supplementary Video 1.2)
1.11
Examples Describing the Three Types of MDS Problems
29
(since the world is inherently a three-dimensional world). He placed three video cameras around the spring-mass system and recorded the motion at 120 frames per second, which provided three distinct projections of the two-dimensional position of the ball. Unfortunately, the engineer placed the video cameras as three arbitrary locations, which meant that the angles between the measurements were not necessarily at right angles! After recording the motion, the engineer was faced with the big question: how to get one-dimensional motion data from the two-dimensional projection data collected from three different angles? Extraction of Mechanistic Features The engineer first needed to obtain digital data from the videos that had been recorded. To do this, the motion of the weight is extracted from the videos using the computer vision motion capture capabilities of Physlet Tracker [9]. The motion data from each camera is shown in the lefthand plots in Fig. 1.19. To aid in further analysis, the data from each camera is centered on the mean. Knowledge-Driven Dimension Reduction The useful data for analyzing the spring mass system is the motion of the weight along the orientation of the spring. To extract the 1D data along the longitudinal direction of the spring, the dimension reduction techniques of singular value decomposition (SVD) and principal component analysis (PCA) can be used (these techniques will be described in detail in Chap. 5). These techniques evaluate which motion directions are key and compute the data associated with those directions. A dimension reduction of the data is achieved by only maintaining a few of these key motion components.
Fig. 1.19 An example reducing high-dimensional data to 1D data and then estimate important parameters in the system
30
1 Introduction to Mechanistic Data Science
Reduced Order Surrogate Models After the dimension reduction, the engineer created a reduced order model with a smaller number of features can be developed to represent all of the original data. For this spring mass system, one dimensional data for the motion along the longitudinal direction of the spring is desired to represent the original set of data from the three cameras. This is possible because the displacements from different local coordinate systems are highly correlated since they recorded the same spring-mass. Regression and Classification The engineer then used the data from the springmass experiments to determine the natural frequency and damping coefficients by considering the mechanistic principles of a spring-mass system. A spring-mass system will oscillate in a pattern that matches a sine wave. If the system has some damping, the sine wave can be multiplied by an exponential function zðt Þ ¼ A sin ð2πft Þ exp ðbt Þ where A is the starting amplitude, f is the natural frequency in Hz, and b is the damping coefficient. The natural frequency is the inverse of the peak-to-peak distance of the sine wave. The damping coefficient is computed by considering the rate of the exponential decay of the sine wave. Systems and Design The engineer was able to use the reduced order model of the spring, along with the regression data based on mechanistic principles, to determine the critical properties of the spring-mass system. The data show that the natural frequency is f ¼ 0.158 Hz, and the damping coefficient is γ ¼ 0.008. These data are overlaid on the reduced order model data in Fig. 1.20.
Fig. 1.20 Spring-mass motion data with damped sinusoidal plot overlaid
References
31
References 1. Galileo’s Leaning Tower of Pisa experiment. Wikipedia. 2020 [Online]. https://en.wikipedia.org/ w/index.php?title¼Galileo%27s_Leaning_Tower_of_Pisa_experiment&action¼history. Accessed 8 Sep 2020. 2. https://nssdc.gsfc.nasa.gov/planetary/lunar/apollo_15_feather_drop.html 3. Galileo, Opere Il Saggiatore. 4. Aloha Airlines Flight 243. Wikipedia. 2020. https://en.wikipedia.org/wiki/Aloha_Airlines_ Flight_243. Accessed 8 Sep 2020. 5. NAE Report on Making a World of Difference-Engineering Ideas Into Reality. 6. https://3dprintingindustry.com/news/eos-cooksongold-put-bling-3d-printing-precious-metalprinter-33041/ 7. https://www.spineuniverse.com/resource-center/spinal-cancer/3d-spinal-implants-glimpsefuture 8. https://www.diamonds.pro/education/gia-certification/ 9. https://physlets.org/tracker/
Chapter 2
Multimodal Data Generation and Collection
Abstract Mechanistic data science is heavily reliant on the input data to guide the analysis involved. This data can come from many shapes, sizes, and formats. This process is a key part of the scientific process and generally involves observation and careful recording. Costly data collection from physical observation can be enhanced by taking advantage of the modern computer hardware and software to simulate the physical experiments and generate further complementary data. Efficient data collection and management through a database can expedite the problem- solving timeline and help in rapid decision-making aspects. This chapter shows data collection and generation from different sources and how they can be managed efficiently. Feature-based diamond pricing and material property testing by indentation are used to demonstrate key ideas. Keywords Data collection · Data generation · Empiricism · Mechanism · Mechanistic · Database · Training · Testing · Cross-validation · High fidelity · Low fidelity · Multimodal · Multifidelity · Features · Macro-indentation · Microindentation · Nano-indentation · Microhardness · Hardness · Brinell · Vickers · Loaddisplacement · Sensing · Indenter
As the name suggests, data is the key input for mechanistic data science. The question though is “where does the data come from”. The answer is “it can come from many sources and in many formats”, which gives rise to the term multimodal data collection and generation. As discussed in Chap. 1, scientific investigation starts with observation, which invariably leads to data collection to test hypotheses that are developed. Analysis of the data leads to a proven hypothesis and the discovery of new scientific theory. Collecting data from physical observation can be very costly and difficult to control independent variables, but it is possible to take advantage of the modern computer
Supplementary Information: The online version of this chapter (https://doi.org/10.1007/978-3030-87832-0_2) contains supplementary material, which is available to authorized users. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. K. Liu et al., Mechanistic Data Science for STEM Education and Applications, https://doi.org/10.1007/978-3-030-87832-0_2
33
34
2 Multimodal Data Generation and Collection
hardware and software to simulate the physical experiments and generate further complementary data. Efficient data collection and management through a database can expedite the problem-solving timeline and help in rapid decision making aspects. In this chapter, the data collection and generation process are discussed from different sources and how they can be managed efficiently. The data collection and generation process will be demonstrated through example problems such as diamond features and prices, and material property testing by indentation.
2.1
Data as the Central Piece for Science
Data provides evidence to supports the scientific knowledge and distinguish it from conjecture and opinion. As discussed in Chap. 1, this practice goes back centuries to the times of Copernicus, Kepler, Brahe, and Galileo. Galileo has been called the Father of the Scientific Method, in part for his structured use of data in his scientific pursuits. This can be illustrated in his classic beam problem. In his book The Two New Sciences, Galileo presented a drawing of a cantilever beam bending test as shown in Fig. 2.1. Galileo’s analysis centered on the question “how forces are transmitted by structural members?” To answer this question, his unique approach led him to a conclusion that holds true for all structural member used even today. His approach can be seen in the following four steps: 1. Observation: he observed that as the strength of the beam were affected by the length of the beam and the cross section of the beam. 2. Hypothesis: he noted that the beam strength decreased with length, unless the thickness and breadth were increased at an even greater rate. 3. Testing and Data collection: he performed many experiments on different size and shapes of the structural member and tried to collect data on their ability to carry and transmit loads. 4. Scientific theory: From the data and observations (understanding the mechanisms) he came to a conclusion which is applicable irrespective of the length, size, shape, materials for the structural member carrying loads. This also led to the scaling law that holds regardless of the size, shape, and material. Galileo reported his finding as “the breaking force on a beam increases as the square of its lengths.” A more familiar version of his findings is typically taught the undergraduate engineering students in a strength of materials class as the deflection formula for cantilever beam. The deflection of the tip of a beam can be related with the applied force (F), length of the beam (L ), material property (Elastic modulus, E), and geometric factor (area moment of inertia for the beam cross-section, I ). The equation of the tip deflection is δ ¼ FL3/3EI, which works regardless of material, size, shape, and load. Another example of data to empiricism or mechanism is the Kepler’s three laws (1609–1619) of planetary motion. Kepler observed the solar system for many years,
2.1 Data as the Central Piece for Science
35
Fig. 2.1 An excerpt from Galileo’s The Two New Sciences [1]
based on his observations, came up with three laws to describe the motion of the planets in the solar system (see Fig. 2.2). Described succinctly, the laws are (1) the law of orbits, (2) the law of areas, and (3) the law of periods. All his laws are empirical in nature and describe the mechanism for planetary motion from direct observation of his collected data. This is the mechanistic part of this problem which explains the mechanisms of the planetary motions; however, these laws do not explain the reason behind such planetary motions. The science behind this is later discovered by Sir Isaac Newton through the law of gravity in 1687. The theory was further questioned by Einstein in his research from 1907 to 1917 in which he explained the motion of the planet Mercury and developed the theory of general relativity and gravity [2]. This remains the latest understanding of gravity and the motion of planets. From these two examples of Kepler and Galileo, it can be seen that data comes from physical observation of the system and provides the basis for finding governing
36
2 Multimodal Data Generation and Collection
Fig. 2.2 Discovery of law of gravitation from planetary motion data. Gravity working among two different objects can be described by the Newton’s universal law of gravitation. The gravity of earth and moon are 9.807 and 1.62 m/s2, respectively. The weight is mass times the gravity force acting on her. If she measures her weight in moon, she will be definitely happy to see her weight loss. However, if she is intelligent enough, she will realize that it is her mass which matters not the weight
mechanisms. On the other hand, science explains the detailed reasoning behind such observations. The intermediate step is finding the mechanisms and justify the scientific hypothesis, which is the “Mechanistic” aspect of a problem. Combining data with the underlying scientific mechanism results in a unique scientific approach defined in this book as Mechanistic Data Science. The goal of mechanistic data science is twofold: (1) mining the data intelligently to extract the science, (2) combining data and mechanisms for decision making. One can easily understand the amount of time (approximately 300 years from Kepler to Einstein) and effort necessary to develop science from just raw data observation with the devotion of great scientific minds. We can break down this process into two parts: Data to empiricism or mechanism and mechanism to science. • Data to empiricism or mechanism: Collected data are analyzed and the relationship between data samples are established using mathematical tools and intuition. • Mechanism to science: Once the mechanisms of a problem is clearly understood, the theory is further questioned to find the reasoning of such behavior found in nature.
2.2 Data Formats and Sources
37
Here mechanistic data science clearly establishes the links between the data and science through identifying the governing mechanisms. But it all starts from the data. In the next section, we will discuss about the data and some commonly used databases to search for data. However, finding appropriate data may be very challenging and sometimes consume years to find the appropriate data to solve the problem. Hence, having a clear idea on what data to collect and how to collect them make a significant difference on problem solving.
2.2
Data Formats and Sources
Data is a collection of information (numbers, words, measurements) or descriptions that describes a system or problem. It is an integral part of daily life, including financial data for tracking the stock market, climate data for predicting seasonal changes, or transportation data in the form of automobile accident records, train schedules, and flight delays. This information may take many forms – text, numbers, images, graphs, etc.—but it is all data. Data is divided into two categories: qualitative and quantitative. Qualitative data is descriptive information. For example, saying “it’s too hot outside” describes the temperature without giving an exact value. In contrast, quantitative data is numerical information. Saying “it’s 90 outside” gives the precise temperature in terms of a numerical value but does not give context. Quantitative data can be further divided into discrete or continuous data. Discrete data can only take certain values. For example, a dataset recording student heights has a fixed number of datapoints corresponding to one per student. If there are 10 students being measured, there must be 10 data points, not some fractional number like 10.7 data points. Continuous data can have any value within a given range. For example, temperature changes continuously throughout the day and can have any value (e.g. 47.783 , 65 , 32.6 ). In summary, discrete data is counted, while continuous data is measured. Data that is used for a mechanistic data science analysis can be obtained in multiple ways including measurements, computation or from existing databases. • Measurement: this generally involves setting up a controlled experiment and instrumenting the test to measure data. A test can be repeated multiple times to evaluate consistency (e.g. does a coil of aluminum used to make soda cans meet the specifications) or can be conducted with a varying set of parameters (e.g. what is the effect of changing material suppliers). Making measurements has long been one of the key endeavors of science. For example, Chap. 1 gave a historical description of the data collection for planetary motion and falling objects and how that led to fundamental laws of science. • Computation: in many cases there is a tremendous amount of mechanistic knowledge that can be used to compute data. For example, as described in the indentation example later in this chapter, finite element analysis can be used to
38
2 Multimodal Data Generation and Collection
Fig. 2.3 Sample machine learning database Kaggle
perform detailed calculations of how a structure will perform under different loading conditions. • Existing database: as data is collected, it can be compiled into a large dataset that can be used for reference or for further analysis. A database is an organized collection of data, generally stored and accessed electronically from a computer system. Mechanistic data science relies on several engineering and machine learning databases, spanning a wide range of industries, disciplines, and problems. For example, Kaggle contains various datasets for machine learning, Materials Project contains materials data such as compounds and molecules, the National Climate Data Center (NCDC) contains datasets on weather, climate, and marine data, and the National Institute of Standard and Technology (NIST) has materials physical testing databases [3]. Figure 2.3 is a snapshot listing some Kaggle databases which can be used for data science and machine learning. As can be seen from this list, there is a wide range of data available, including the stock market, earthquakes, global diseases, and other engineering and social topics. A typical dataset is composed of features and data. Features are distinctive variables that describe part of the data, and are typically arranged in columns, such as “country” or “earthquake magnitude” [4]. It should be noted that real data that is used for machine learning is not perfect; many “good” datasets are not complete, and often need to be prepared before being used for analysis. A dataset might often have noisy data, with some outliers that come from sensor errors or other artifacts during the data collection process. While it is very tempting to ignore or discard those outliers from the dataset, it is recommended that they be given careful attention before deciding how to treat them. It is the role of the data scientist to interpret those data and check the influence of those outliers on the hypothesis of the problem and population statistics of the data. Typically, regression-based models are well suited
2.2 Data Formats and Sources
39
Fig. 2.4 Data preparation for analysis
to identify outliers in linearly correlated data; clustering methods and principal component analysis are recommended if the data does not show linearity on the correlation planes [5]. Regression models are discussed in Chap. 3; clustering techniques and principal component analysis are discussed in Chap. 5 of this book. Steps involved in data preparation are captured in Fig. 2.4 and described below [6]. The extraction of mechanistic features is discussed more extensively in Chap. 4. • Raw Data: collected data in an unmodified form. • Data Wrangling: transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for analysis. – Data wrangling prepares data for machine interpretation. For example, a computer may not recognize “Yes” and “No,” so this data is converted to “1” and “0,” respectively. The meaning of the raw data is unchanged, but the information is mapped to another form. • Data Formatting: formatting data for consistency, associating text data with labels, etc. • Data Cleansing: providing attributes to missing values and removing unwanted characters from the data. • Database Preparation: adding data from more than one source to create a database.
40
2 Multimodal Data Generation and Collection
While many of these steps are performed with automated data processing techniques, user input is still required. For example, after data is transformed, cleaned, and prepared, it should be visualized. Plotting data, displaying images, and creating graphs often reveal key trends, leading to better understanding of data. This knowledge allows data scientists to manually evaluate data science results, ensuring machine learning trends reflect data.
2.3
Data Science Datasets
Data science consists of using data to find a functional relationship between input and output data. As shown in Fig. 2.5, an input dataset XN i with i features is used for N N developing the functional relationship yj = f X i . The function development starts by dividing the input dataset XN i into a training set, a validation set, and a test set (Fig. 2.6), with the training set generally being the largest. The inputs and outputs from the training set are fit to a mapping function f XN using regression analysis to develop a mechanistic data science model. The i validation set measures the accuracy of the model after the training step. This process is repeated with the updated model until the error between the predicted output and the actual output is below a required threshold. Once the error is minimized, the final functional form is established, the function is evaluated against the test set. Choosing the training, testing, and validation set from the data can be done either randomly or systematically. One systematic approach uses K-fold cross validation, where the data set is divided in K number of bins and different bins are used for training, testing, and validation. Cross validation makes the model more
Fig. 2.5 Fitting dataset inputs and outputs to a functional form with machine learning
Fig. 2.6 Data division into training, validation, and test sets for machine learning. In this figure, the training set comprises 70% of the data, the validation set comprises 20% of the data, and the test set comprises 10% of the data
2.4 Example: Diamond Data for Feature-Based Pricing
41
robust and removes bias in model training. More details on the K-fold cross validation are discussed in Chap. 3 and applied in the some of the examples in Chap. 7. Data modality refers to the source (or mode) of the data. Mechanistic data science is able to incorporate multimodal data (data from multiple sources and test methods). Data deviations between various sources are resolved through calibration. For example, indentation data can be obtained through physical experiments and computer simulations [7]. An example of this is shown later in the chapter. Data fidelity describes the degree to which a dataset reproduces the state and behavior of a real-world object, feature or condition. Fidelity is therefore a measure of the realism of a dataset [8]. It can be categorized as high fidelity or low fidelity. This is a somewhat subjective measure which depends on the application, but high fidelity data is generally more accurate and more expensive to obtain. For example, micro-indentation data is high fidelity compared to macro-indentation data, but low fidelity compared to nanoindentation data. Machine learning techniques can improve the resolution of low fidelity data, transforming it into high fidelity data without the large collection cost [9].
2.4
Example: Diamond Data for Feature-Based Pricing
Diamond pricing analysis using regression techniques is shown in Chap. 1. Diamonds can be described with several features such as cut, color, clarity, and carat, and the price of a diamond is a function of all of these features. A dataset was downloaded from Kaggle containing data for 53,940 diamonds with 10 features. A sample of this dataset is shown in Fig. 2.7. For this diamond dataset to be used for predicting prices based on features, the input feature index, i ¼ 9, represents the number of independent variables (in this example, features including cut, color, clarity, and carat). Similarly, the output feature index, j ¼ 1, represents the number of dependent variables (i.e. price). The number of data points in the dataset is N ¼ 53,940. Cut, color, clarity, and carat are four features known as the 4 C’s. They are defined as: • • • •
Cut: the proportions of the diamond and the arrangement of surfaces and facets. Color: color of the diamond, with less color given a higher rating Clarity: the amount of inclusions in a diamond Carat: the weight of the diamond
Some diamond features such as cut, color, and clarity are not rated using numerical values. They must be converted to numerical values in order to be used in a calculation. In this case, the cut, color, and clarity are assigned numerical values based on the number of individual classifications for each.
42
2 Multimodal Data Generation and Collection
Fig. 2.7 A sample of data extracted from diamond features and prices
Cut Rating Premium Ideal Very Good Good Fair
Numerical value 1 2 3 4 5
Clarity Rating IF—Internally Flawless VVS1,2—Very, Very Slightly Included 1,2 VS1,2—Very Slightly Included 1,2 SI1,2—Slightly Included 1,2 I1—Included 1
Numerical value 1 2 3 4 5
The color rating scale ranges from D to Z, where D is colorless and Z is a light yellow or brown color. For the given dataset, the diamond colors ranged from D to J and the numerical values were as assigned as: Color (D, E, F, G, H, I, J) ! (1, 2, 3, 4, 5, 6, 7). Once all the feature data is converted to numerical values, data normalization can be performed to scale all the data features from 0 to 1 if a regression analysis is to be performed (this will be discussed in more detail in Chap. 4).
2.5 Example: Data Collection from Indentation Testing
2.5
43
Example: Data Collection from Indentation Testing
Material hardness testing by indentation is a multimodal data collection technique. Hardness testing consists of pressing a hardened tip into the surface of a material with a specified load and measuring the dimensions of the small indentation that is made. The indentation is generally very small and, as such, the test is considered non-destructive. Furthermore, since the resistance to surface indentation is related to the stress required to permanently deform the material, the measured hardness can often be correlated to other material properties like the ultimate tensile strength. Indentation testing varies with sample size and shape, but the fundamental process remains the same. As shown in Fig. 2.8, the indenter is pressed into the surface of the material with a specified force and leaves a small impression. The hardness is determined by measuring the size of the indent for the applied load [10]. Macro-indentation is used to test large samples, with applied load exceeding 1 kgf. Small samples are tested using micro-indentation, using applied load ranging from 1–1000 gf. For even smaller scales, nanoindentation (also known as instrumented indentation) is used. For the nanoindentation scale, the applied load is less than 1 gf [11]. Common indenter tips (see Fig. 2.9) include hemispherical balls (used for the Brinell hardness test) and various pointed tips (used for the Vickers hardness test and nanoindentation test). The load vs. indentation depth for a typical nanoindentation test is plotted in Fig. 2.10 [13]. The decrease in the indentation depth when the load is removed is determined by the elasticity of the material. The sample results in Fig. 2.10 show some elasticity since the final (or residual) depth, hr, is less than maximum depth, hm. The net result is that the indenter tip leaves a permanent impression of depth hr in the surface of the material due to localized surface deformation [12]. The contact area of the indentation depends on the indentation depth and indenter shape. Figure 2.11 shows several indenter tips and the corresponding contact area equations, where d is the indentation depth [12].
Fig. 2.8 (a) Indentation testing experimental set-up and (b) impression data (https://matmatch. com/learn/property/vickers-hardness-test)
44
2 Multimodal Data Generation and Collection
Fig. 2.9 Experimental set-ups for (a) Brinell hardness test, (b) Vickers microhardness test, and (c) nanoindentation test [11, 12] Fig. 2.10 Load vs. indentation depth data generated by nanoindentation testing
The indentation data can come from experiments and computer simulations (see Fig. 2.12). The experimental data is collected using imaging and sensing. • Experimental data is obtained through the indentation. Typically, indentation experiments record the load-displacement data. – High resolution Atomic Force Microscopes (AFM) are used for imaging of the indented surface. The surface fracture pattern provides critical information on the material deformation during the indentation process. Additionally, the contact area of the indenter can be measured from these high-resolution microscope images.
2.5 Example: Data Collection from Indentation Testing
45
Fig. 2.11 Common indenter tips and corresponding contact area equations [12]
Fig. 2.12 Indentation data sources: (a) experiments, (b) imaging, (c) sensing using LVDT sensor, (d) computer simulation using FEM (a video is available in the E-book, Supplementary Video 2.1) [14]
– Sensing is accomplished using various force and displacement measurements. A common displacement sensor is the Linear Variable Differential Transformer (LVDT), which measures the movement of the indenter shaft through
46
2 Multimodal Data Generation and Collection
electric voltage change and provides the load and displacement data. Other displacement measurement techniques include differential capacitors or optical sensors. Force can be measured through a spring-based force actuation system. • Computer simulations are powerful tools to compute the load displacement data and materials behavior. Computational simulation methods, such as the Finite Element Method (FEM), has been used extensively to compute mechanical properties of materials through indentation simulation. FEM is a well-known computer simulation method for computing deformation and stress given the geometry and the material properties. It has successfully replaced or augmented physical testing for many areas of engineering product development. Computer simulations of surface indentation can also provide valuable data for material characterization. A physical test result and a finite element computer simulation are shown in Fig. 2.13. With proper calibration, the two methods produce nearly identical triangular indentations in the material and the simulation can be used to provide additional insight and data for the indentation process. Machine learning databases often combine information for different modes of data collection and levels of fidelity. For example, Table 2.1 summarizes
Fig. 2.13 Indentations produced by (a) physical nanoindentation experiment, and (b) finite element method computer simulation [7] Table 2.1 Summary of nanoindentation testing data [15] Material Al-6061 alloy Al-7075 alloy
Experiment 7 experiments 7 experiments
3D printed Ti-6Al-4V alloys (six samples)
144 experiments for each sample
Computation 2D FEM (Axisymmetric): 100 simulations each for conical indenter half angle of 50, 60, 70, 80o 3D FEM: 15 simulations for Berkovich indenter Not available
References
47
nanoindentation testing data for three different materials [15]. The aluminum alloy data consisted of 422 load-displacement curves (7 physical tests, 400 2D axisymmetric FEM simulations, and 15 3D FEM simulations). The data for the 3D printed Ti-6Al-4V material consisted of 864 load-displacement curves (144 experiments on each of six samples). The mix of experimental and computational data represent different modalities (or sources). In addition, the 2D and 3D FEM simulations also represent different levels of fidelity (or resolution). The 3D FEM simulations are more comprehensive but are computationally intensive. The 2D axisymmetric simulations assume that the indentation is axisymmetric but afford a much higher level of model refinement. Consequently, the fidelity of 2D and 3D simulation must be understood within the context of the physical test being modeled.
2.6
Summary of Multimodal Data Generation and Collection
Mechanistic data science analysis frequently utilizes multimodal and multi-fidelity data as one data source rarely provides sufficient data to fully represent an engineering problem. Experimental data obtained through direct observation is considered the most reliable but may be too expensive or complicated to obtain. Limited experimental data may need to be supplemented with simulations or published experimental results. As a result, data scientists must identify, collect, and synthesize required information from a variety of modes and fidelities to solve engineering problems. This idea will be further developed in Chap. 3 Optimization and Regression.
References 1. Galileo (1638) The two new sciences 2. Siegfried T (2015) Getting a grip on gravity: Einstein’s genius reconstructed science’s perception of the cosmos. Science News 3. Badr W (2019) Top sources for machine learning datasets. Towards Data Science. https:// towardsdatascience.com/top-sources-for-machine-learning-datasets-bb6d0dc3378b. Accessed 1 Sep 2020 4. Kumar M (2020) Global significant earthquake database from 2150BC. Kaggle [Online]. https://www.kaggle.com/mohitkr05/global-significant-earthquake-database-from-2150bc. Accessed 23 June 2020 5. Salgado CM, Azevedo C, Proença H, Vieira SM (2016) Noise versus outliers. Secondary analysis of electronic health records, pp 163–183 6. Jones MT (2018) Data, structure, and the data science pipeline. IBM Developer. [Online]. https://developer.ibm.com/articles/ba-intro-data-science-1/. Accessed 1 Sep 2020
48
2 Multimodal Data Generation and Collection
7. Liu M, Lu C, Tieu K et al (2015) A combined experimental-numerical approach for determining mechanical properties of aluminum subjects to nanoindentation. Sci Rep 5:15072. https://doi. org/10.1038/srep15072 8. SISO-REF-002-1999 (1999) Fidelity Implementation Study Group Report. Simulation Interoperability Standards Organization. Retrieved January 2, 2015 9. Lu L et al (2020) Extraction of mechanical properties of materials through deep learning from instrumented indentation. Proc Natl Acad Sci 117(13):7052–7062 10. Wikipedia (2020) Indentation hardness. [Online]. Available: https://en.wikipedia.org/wiki/ Indentation_hardness. Accessed 8 Sep 2020 11. Broitman E (2017) Indentation hardness measurements at macro-, micro-, and nanoscale: a critical overview. Tribol Lett 65(1):23 12. VanLandingham MR (2003) Review of instrumented indentation. J Res Natl Inst Stand Technol 108(4):249–265 13. Nanoindentation. Nanoscience instruments [Online]. https://www.nanoscience.com/tech niques/nanoindentation/. Accessed 1 Sep 2020 14. Rzepiejewska-Malyska KA, Mook WM, Parlinska-Wojtan M, Hejduk J, Michler J (2009) In situ scanning electron microscopy indentation studies on multilayer nitride films: methodology and deformation mechanisms. J Mater Res 24(3):1208–1221 15. Extraction of mechanical properties of materials through deep learning from instrumented indentation. GitHub. [Online]. https://github.com/lululxvi/deep-learning-for-indentation. Accessed 23 Jun 2020
Chapter 3
Optimization and Regression
Abstract Linear regression is the simplest method to build a relationship between input and output features. While many relationships are non-linear in science and engineering, linear regression is fundamental to understanding more advanced regression methods. In particular, gradient descent will be discussed as a technique with a wide range of applications. Key to understanding linear regression are concepts of optimization. In this chapter, the fundamentals of linear regression will be introduced, including least squares optimization through gradient descent. Extensions of linear regression to tackle some nonlinear relationships will also be discussed, including piecewise linear regression, and moving least squares. The ease and strength of linear regression will be demonstrated through example problems in baseball and material hardness. Keywords Linear regression · Least squares optimization · Coefficient of determination · Minimum · Gradient descent · Multivariable linear regression · Baseball · Indentation · Vickers hardness · Bacteria growth · Piecewise linear regression · Moving average · Moving least squares · Regularization · Crossvalidation
3.1
Least Squares Optimization
Least squares optimization is a method for determining the best relationship between variables making up a set of data. For example, if data is collected for measuring the shoe size and height of people, the raw data could be plotted on a graph, with shoe size on one axis and height on the other axis. The raw data may be interesting, but it would be more useful if a mathematical relationship can be found between these variables. If the data points fall in approximately a row, then a straight line can be drawn through the points. To determine the best placement for the straight line, least
Supplementary Information: The online version of this chapter (https://doi.org/10.1007/978-3030-87832-0_3) contains supplementary material, which is available to authorized users. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. K. Liu et al., Mechanistic Data Science for STEM Education and Applications, https://doi.org/10.1007/978-3-030-87832-0_3
49
50
3 Optimization and Regression
squares optimization (or linear regression) can be used. The details of this method are shown in this chapter. If the points do not fall in approximately a straight line, then some form of nonlinear optimization must be used. This chapter will focus on piecewise linear regression, the moving average, and moving least squares optimization for nonlinear optimization. The least squares method for optimization and regression was first published by Adrien-Marie Legendre (1805) and Carl Friedrich Gauss (1809) [1]. They were both studying the orbits of celestial bodies such as comets and minor planets about the sun based on observations. The term regression is a commonly-used word for least squares optimization. The term was coined by Sir Francis Galton from his work in genetics in the 1800’s. Galton was initially studying the genetics of sweet peas, in particular comparing the weights of planted and harvested peas. He found that when he plotted the data for the planted and harvested weights, the slope was less than unity, meaning that the offspring of the largest and smallest peas did not demonstrate the same extremes, but “regressed” to the mean. His concept of regression to the mean based on the evaluation of data on graphs led to the use of the term regression to describe the mathematical relationship developed based on data [2].
3.1.1
Optimization
Optimization is the process of finding the minimum (or maximum) value of a set of data or a function. This can be accomplished by analyzing extensive amounts of data and selecting the minimum (or maximum) value, but this is generally not practical. Instead, optimization is generally performed mathematically. A cost function, c(w), is written as a relation between variables of interest and the goal of the optimization is to find the minimum (or maximum) value of the function over the range of interest. The minimum (or maximum) value of the function corresponds to the location where the tangent slope becomes zero. To find the tangent slope, the first derivative of cost function is computed using differential calculus dcðw Þ ¼0 dw
ð3:1Þ
Setting this derivative equal to zero leads to the value of w as the location of the minimum (or maximum) of the function. In general, the location where the first derivative equals zero is a potential minima, maxima, or inflection point. The second derivative of the original function is used to distinguish between these three types of points d 2 cðw Þ > 0, convex ðminimumÞ dw2
ð3:2aÞ
3.1 Least Squares Optimization
51
d 2 cð w Þ < 0, non convex ðmaximumÞ dw2
ð3:2bÞ
d2 cðw Þ ¼ 0, ðinflection pointÞ dw2
ð3:2cÞ
Example: Consider the following function cðwÞ ¼ 2w2 þ 3w þ 4
ð3:3Þ
where c(w) is a quadratic function of the independent variable w (see blue curve in Fig. 3.1). The minimum of this quadratic equation is the point where the slope tangent to the curve is horizontal. The slope is computed using differential calculus to find the first derivative as dc ¼ 4w þ 3 ¼ 0 dw
ð3:4Þ
Setting this equation for the first derivative equal to zero and solving for w provides the location of w ¼ 3/4 as the location of zero slope (minimum point), as shown in Fig. 3.1.
Fig. 3.1 A function (blue line) and its derivative (red line)
52
3 Optimization and Regression
Fig. 3.2 Non-convex function with multiple local minima
For this function with the minimum located at w ¼ 3/4, the second derivative is d 2 cðw Þ ¼ 4, minimum dw2
ð3:5Þ
Convex functions are preferred for optimization problems because they converge to a solution much easier. Non-convex functions can have multiple minima, maxima, and inflection points. The global minimum is defined as the absolute minimum across the span of interest. As shown in Fig. 3.2, finding a global minimum in a non-convex equation is challenging because a local minimum can be chosen erroneously instead of the global minimum.
3.1.2
Linear Regression
A straight line can easily be drawn through two data points and a simple linear expression can be written as y ¼ w1 x þ w0
ð3:6Þ
where w1 ¼ (y2 y1)/(x2 x1) is the slope of the line and w0 is the location where the line crosses the y-axis. If there are more than two points and the points are not all aligned, it is obviously not possible to draw a straight line through all the data points. This common challenge often arises where the data points generally lie along a straight path, but it is not possible to fit a straight line through all the data points. One option is to try to draw a complicated curve through all the data points, but this option is generally not preferred due to the complex mathematics of such a curve. Instead, a straight line can be drawn through the data in such a way that it is close to as many points as possible. The process of determining the best fit of a straight line to the data is called linear regression (Fig. 3.3).
3.1 Least Squares Optimization
53
Fig. 3.3 Straight line through two data points
Linear regression is used to model a best-fit linear relationship between variables by fitting a linear equation to observed data. Consider a set of N data points ðx1 , y1 Þ, ðx2 , y2 Þ, . . . , ðxN , yN Þ
ð3:7Þ
A generic equation through these points can be written as
where the coefficients w0, w1 are called weights (the constant weight w0 is often called the bias) and yn is the computed approximate value of the “true” value of yn. The best fit approximation for yn is found through linear regression by determining the optimum values for the weights and bias based on the data. Note that when using regression for developing mechanistic data science models, it is important to assess the quality of the model through cross validation. The cross validation will be introduced in Sect. 3.3.
54
3.1.3
3 Optimization and Regression
Method of Least Squares Optimization for Linear Regression
The optimum values for the coefficients w0, w1 are determined by minimizing the total error between the computed approximate value, yn and the “true” value, yn. Each data point, xn, is multiplied by the weight, w1, and the bias, w0, is added to it as the approximation, yn, to the “true” value, yn. The approximation error at data point n in this linear regression model can be evaluated by subtracting the true value yn and squaring the difference. The total error is found by repeating this for all data points total error ¼
XN
ðy yn Þ2 ¼ n¼1 n
XN n¼1
ðw0 þ w1 xn yn Þ2
ð3:8Þ
The best fit line will be the one which minimizes the total error and is determined by performing a least squares optimization. This optimization begins with a cost function, c(w0, w1), which is the average of the total error for all data points cost function ¼ cðw0 , w1 Þ ¼
N 1 X ðw þ w1 xn yn Þ2 N n¼1 0
ð3:9Þ
The weights and bias for a best fit line are determined by finding the minimum (or maximum) of a function or a set of data. In this case, the goal is to minimize the total error. The minimum value can be determined experimentally by collecting an extensive amount of data and selecting the lowest overall value, but that is generally not practical. Instead, the minimum value is usually determined mathematically using a functional relationship between variables of interest.
3.1.4
Coefficient of Determination (r2) to Describe Goodness of Fit
Goodness of fit describes how closely the data points, yi, are to the line drawn through them. If the line goes through the points like Fig. 3.3, the fit is perfect. If the points do not lie directly on the line but are generally evenly clustered along the length of the line like Fig. 3.4, the fit shows a linear relationship (adequate precision of the data is generally dependent on the application). Conversely, if the points do not evenly cluster along the length of the line, there may be no correlation between the variables or a nonlinear correlation between the variables. A common way to quantify goodness of fit is by the coefficient of determination, r2
3.1 Least Squares Optimization
55
Fig. 3.4 Linear regression through non-colinear points
2 P regression sum of squares i yi y ¼ P r ¼ 2 total sum of squares i ðyi yÞ 2
ð3:10Þ
where y is the average of the data points yi. A good fit of the regression to the data will result in an r2 value closer to one.
3.1.5
Multidimensional Derivatives: Computing Gradients to Find Slope or Rate of Change
As shown above, computing the slope or the rate of change is important for optimization problems. Gradient is the slope or rate of change in a particular direction. For one-dimensional problems, determining the rate of change is trivial, but for problems involving multiple variables, determining the slope is more challenging since the slope can be different in each direction. The rate of change is the amount one variable changes when one or more other variables change. Consider the mountain in Fig. 3.5 below. If a skier follows the red-dotted path, a short amount of forward motion will result in a big vertical drop (and more speed) as the skier goes from red dot to red dot. On the other hand, if a
56
3 Optimization and Regression
Fig. 3.5 Ski mountain with two possible paths. The red-dotted path is steeper than the yellowdotted path. (Photo courtesy of Rebecca F. Boniol.) A video is available in the E-book, Supplementary Video 3.1
skier follows the yellow-dotted path, the vertical change with respect to the forward motion is less going from yellow dot to yellow dot. It can be seen that because of this, the route from the top of the mountain to the orange X is shorter and more direct when following the red-dotted path. The rate of change, or slope, can be determined on an average sense (vertical change from the top of the mountain to the orange X relative to the horizontal change in position) or instantaneously (different slope at every red and yellow dot). If measured data are used, the instantaneous rate of change is estimated by dividing the change in vertical height by the change in horizontal distance. If an equation is available, the instantaneous rate of change can be computed mathematically by taking the derivative of the equation for the hill in the direction of travel. More generally, a coordinate system is defined and partial derivatives with respect to each of the coordinates are taken. This set of partial derivatives forms a vector and is the mathematical definition of the gradient. Once the expression for the gradient is computed, the gradient in any specific direction can be computed. The gradient of a cost function is needed to perform a multivariate optimization. For the ski mountain in Fig. 3.5, the optimization values can be visualized, with the maximum corresponding to the top of the mountain and the minimum corresponding to the bottom of the mountain. Optimization can be performed mathematically if a functional relationship is available. Using the mathematical function, the global minima for higher dimension functions requires defining the gradient of the function. The gradient is the derivative
3.1 Least Squares Optimization
57
of the function in multiple directions. Consider a vector w containing all the independent variables or features w0. . .wS 2
w0
3
6 7 6 w1 7 6 7 6 7 7 w¼6 6 w2 7 6 7 6⋮7 4 5 wS
ð3:11Þ
and a scalar function c(w) made of these variables. The gradient of the scalar function c(w) is a vector composed of the partial derivatives1 with respect to each variable or feature: 3 2 ∂cðwÞ 6 ∂w0 7 7 6 7 6 6 ∂cðwÞ 7 7 6 ∂cðwÞ ∂w1 7 ð3:12Þ ¼ ∇cðwÞ ¼ 6 7 6 ∂w 7 6 6 ⋮ 7 7 6 4 ∂cðwÞ 5 ∂wS where the symbol ∇ is shorthand notation for the gradient and the symbol ∂ denotes the partial derivative. Example: Consider the following function with two variables, w0 and w1 cðwÞ ¼ ðw0 Þ2 þ 2ðw1 Þ2 þ 1
ð3:13Þ
The independent variable, w, and gradient of the function, c(w), can be written in matrix form as " w¼
w0
#
w1 2 3 ∂cðwÞ " # 6 ∂w 7 2w0 0 7 6 ∇cðwÞ ¼ 6 7¼ 4 ∂cðwÞ 5 4w1
ð3:14aÞ
ð3:14bÞ
∂w1
1
Partial derivatives are derivatives taken with respect to one variable while holding all other variables constant.
58
3 Optimization and Regression
The point w where the gradient equals zero is found by setting the gradient equal to zero "
2w0
#
4w1
3.1.6
¼0!w ¼
" # 0 0
ð3:15Þ
Gradient Descent (Advanced Topic: Necessary for Data Science)
Computing the minimum or maximum of a function by setting the gradient equal to zero works well for basic functions but is not efficient for higher order functions. Gradient descent is a more efficient way to determine the minimum of a higher order function. The minimum of a cost function, c(w), can be determined through an explicit update algorithm. The process begins by writing an explicit update equation for the independent variable, w, as kþ1
w
dc wk ¼w α dw k
ð3:16Þ
where wk + 1 is the value of w to be computed at the next step, wk is the current value dcðwk Þ of w, and dw is the derivative of the cost function evaluated at the current time step. The parameter, α, is a user-defined learning rate. This is illustrated using the function plotted in Fig. 3.6. The gradient descent algorithm is shown in the following five steps: 1. Select an arbitrary starting point w0 2. Find the derivative of the cost function c(w) at w0 3. Descend to the next point through the gradient descent equation w1 ¼ dcðw0 Þ w0 α dw dcðw1 Þ 4. Repeat the process for the next point w2 ¼ w1 α dw 5. Continue doing so until the minimum is reached (i.e. negligible change in w) (Fig. 3.7) 1 Example: Consider a function gðwÞ ¼ 50 ðwÞ4 þ ðwÞ2 þ 10w 1. Start at w0 ¼ 2 and use α ¼ 1 ðwÞ 1 ¼ 50 4ðwÞ3 þ 2w þ 10 and evaluate at the current 2. Take the derivative dgdw ð2Þ position of w, dgdw 3. Use the gradient descent formula to calculate w at the next step.
3.1 Least Squares Optimization
59
Fig. 3.6 Gradient descent methodology. A video is available in the E-book, Supplementary Video 3.2
1 Fig. 3.7 Gradient descent example for gðwÞ ¼ 50 ðwÞ4 þ ðwÞ2 þ 10w
60
3 Optimization and Regression
w1 ¼ w0 α
dgðw0 Þ dgð2Þ ¼21 ¼ 1:08 dw dw
dgðw1 Þ 1:08Þ 4. Repeat the process for the next step, w2 ¼ w1 α dw ¼ 1:08 1 dgðdw ¼ :736 5. Continue until the minimum is reached at g(w) ¼ 0.170
When using gradient descent method for higher dimensions, the explicit update formula is written as wkþ1 ¼ wk α∇c wk
ð3:17Þ
where the univariate derivative has been replaced with the gradient and the independent variable w is now a vector. The gradient descent steps in multiple dimensions are 1. Start at an arbitrary vector w0 2. Find the gradient of function c (w) at w0 3. Descend to the next point using the gradient descent w1 ¼ w0 α ∇ c(w0) 4. Repeat the process for w2 ¼ w1 α ∇ c(w1) 5. Continue until minimum is reached (negligible change in w).
3.1.7
equation
Example: “Moneyball”: Data Science for Optimizing a Baseball Team Roster
Baseball is a game in which tradition is strong and data and statistics carry great weight. This allows baseball fans to compare the careers of Ty Cobb in 1911 to Pete Rose in 1968 (or anyone else for that matter). Historically, the worth of a player was largely dictated by their batting average (how many hits compared to how many time batting) and runs batted in (how many runners already on base were able to score when the batter hit the ball). However, through the use of data science, a new trend emerged (Fig. 3.8). The game of baseball is played with 9 players from one team in the field playing defense (Fig. 3.8). A pitcher throws a baseball toward home plate where a batter standing next to home plate tries to hit the ball out into the field and then run to first base. If the batter hits the ball and makes it to first base before the ball is caught or picked up and thrown to first base, then the batter is awarded a hit and allowed to stay on the base, becoming a base runner. If the ball is caught in the air, picked up and thrown to first base before the batter arrives, or the batter is tagged when running to first base then the batter is out. The runner can advance to second, third, and home bases as other batters get hits. When the runner reaches home base, the team is awarded a run. The batting team continues to bat until they make three outs, at which point they go out to the field and the team in the field goes to bat.
3.1 Least Squares Optimization
61
Fig. 3.8 Baseball field (https://entertainment.howstuffworks.com/baseball2.htm)
As mentioned previously, batting average (BA) and runs batted in (RBI) have traditionally been a very important statistic for baseball teams to evaluate the worth of a player. Players with high BA’s and high RBI’s were paid very large salaries by the richest teams (usually large market teams) and the small market teams had trouble competing. In 2002, Billy Beane, the general manager of the Oakland Athletics, utilized data science to build a competitive team. Although Major League Baseball (MLB) generates around $10 billion in annual revenue, the smaller market MLB teams have much lower budgets with which to recruit and sign players. In 2002, Oakland A’s general manager, Billy Beane, found himself in a tough situation because of this. The Oakland A’s were a small market team without a large budget for player salaries. Beane, and his capable data science assistant, Paul DePodesta, analyzed baseball data from previous seasons and determined that they needed to win 95 games to make the playoffs. To achieve this goal, they estimated they needed to score 133 more runs than their opponents. The question they had to answer was “what data should they focus on”. To build a competitive team, Beane and DePodesta looked at a combination of a player’s on-base percentage (OBP), which is the percentage a batter reaches base, and the slugging percentage (SLG), which is a measure of how many bases a batter is able to reach for a hit. In formulaic terms: SLG ¼ (1B + 2B 2 + 3B 3 + HR 4)/ AB, where 1B, 2B, and 3B are first, second, and third base, respectively, HR is a
62
3 Optimization and Regression
“home run”, and AB is an “at bat”. Through these two measures, it is possible to assess how often a player is getting on base in any possible way (and thus in a position to score) and how far they go each time they hit the ball. It is possible to show through linear regression that SLG and OBP provide a good correlation with runs scored (RS). Using a moneyball baseball dataset available from Kaggle (https://www.kaggle.com/wduckett/moneyball-mlb-stats-19622012/data), a regression analysis was performed to compare the number or runs scores as a function of the batting average, and then as a function of the on base percentage and slugging percentage. A sampling of the moneyball data used for the analysis is shown below (Table 3.1). A linear regression analysis was first performed on the RS vs. BA. The results, which are plotted in the Fig. 3.10, showed that the correlation between the RS and BA was only 0.69. BA was deemed a marginally useful statistic because it does not account for players hitting singles versus home runs and does not account for players getting on base by walks or being hit by a pitch. By contrast, a linear regression between the RS and OBP shows a correlation of 0.82. OBP accounts for all the ways a player can get on base, and as such, provides a more meaningful measure of the number or runs scored than does the batting average. Finally, a multivariate linear regression was performed with the RS vs the OBP and the SLG. The results of this linear regression showed a correlation of 0.93, meaning that OBP combined with SLG provided a better indicator or run scoring performance than BA or the OBP by itself. It should be noted that the linear combination of OBP and SLG is called On-base Plus Slugging (OPS), and is a commonly used baseball statistic in the game today (OPS ¼ OBP + SLG). With this measure of OPS, the amount of time a player reaches base is accounted for as well as how many bases they are able to reach when they do get on base. Using these data science techniques, Beane and DePodesta and the Oakland A’s were able to win 103 games in 2002 (including a record-setting 20-game win streak), finish in first place, and make the playoffs. Today, OPS and OBP and SLG are some of the most closely watched baseball statistics by baseball insiders and fans alike.
3.1.7.1
Moneyball Regression Analysis Steps
Step 1: Multimodal Data Generation and Collection Baseball statistics are readily available. One such database is in Kaggle (a sample of data is shown in Table 3.1).
Step 2: Feature Engineering Various features are present in baseball sports analytics. However, we will restrict to team averaged stats for a few indicators. These include runs scored (RS), wins (W),
Team ARI ATL BAL BOS CHC CHW
League NL NL AL AL NL AL
Year 2012 2012 2012 2012 2012 2012
RS 734 700 712 734 613 748
RA 688 600 705 806 759 676
W 81 94 93 69 61 85
OBP 0.328 0.32 0.311 0.315 0.302 0.318
SLG 0.418 0.389 0.417 0.415 0.378 0.422
BA 0.259 0.247 0.247 0.26 0.24 0.255
Playoffs 0 1 1 0 0 0
Rank playoff 5 4
Rank season 4 5
G 162 162 162 162 162 162
Table 3.1 A sample of baseball statistical data from Kaggle (https://www.kaggle.com/wduckett/moneyball-mlb-stats-19622012/data) OOBP 0.317 0.306 0.315 0.331 0.335 0.319
OSLG 0.415 0.378 0.403 0.428 0.424 0.405
3.1 Least Squares Optimization 63
64
3 Optimization and Regression
on base percentage (OBP), slugging percentage (SLG), and on base plus slugging (OPS). RS corresponds to how much a team scores. W is the number of games a team wins in a season. OBP corresponds to how frequently a batter reaches base. SLG corresponds to the total number of bases a player records per at-bat. OPS is the sum of OBP and SLG.
Step 3: Dimension Reduction Reduce the dimension of the problem by only considering the aforementioned stats/ features.
Step 4: Reduced Order Modeling Reduce the order of the model by assuming it is linear for the purposes of this demonstration.
Step 5: Regression and Classification Use regression to determine model parameters and decide whether a linear hypothesis is adequate and offers any insight.
Module 6: System and Design Use OBP and SLG to predict performance (RS) and, therefore, potential recruitment. Returning to the baseball example, it is possible to find a relationship between OBP and RS (Fig. 3.9). Use the cost function cðwÞ ¼ cð w Þ ¼
N 1 X ð w þ w 1 xn yn Þ 2 N n¼1 0
h i 1 ðw0 þ w1 0:327 691Þ2 þ ðw0 þ w1 0:341 818Þ2 þ . . . N
ð3:18aÞ ð3:18bÞ
and the gradient descent equation wk ¼ wk1 α∇c wk1
ð3:19Þ
to find the optimal weights w for the model. Figure 3.10 shows the regression model results for RS vs. BA, OBP, SLG, and OPS. There is a good correlation between the runs scored and the on base
3.1 Least Squares Optimization
65
Fig. 3.9 Sample data for runs scored (RS) and on base percentage (OBP)
Fig. 3.10 Moneyball analysis results for runs scored (RS) vs. four different statistic—batting average (BA), on base percentage (OBP), slugging percentage (SLG), and on base plus slugging (OPS)
percentage. The correlation can be quantified by the r2 value, with r2 closer to 1.0 indicating a better the correlation of the linear regression to the data. Figure 3.10 also shows the regression for RS and SLG and RS and BA. Historically, BA was used by baseball scouts to recruit potential players. Billy Beane was correct in using OBP
66
3 Optimization and Regression
Fig. 3.11 Regression analysis results for wins (W) vs. four different statistic—batting average (BA), on base percentage (OBP), slugging percentage (SLG), and on base plus slugging (OPS)
and SLG (or a combination of them) instead of BA to do his recruitment as these statistics have a better correlation with RS. One other logical question is why not perform the analysis based on wins (W) instead of just runs scores (RS) since that is the ultimate metric. As can be seen in Fig. 3.11, the correlation between W and any of these statistics is not good. These are only offensive statistics and do not account for pitching and defense, which are other important parts of winning baseball games. The regression analysis performed thus far has been performed using one statistical variable at a time. It is possible to perform linear regression using multiple variables, such as performing a linear regression of RS versus both OBP and SLG. Using two variables for linear regression will result in a planar fit through the data instead of a straight line (Fig. 3.12). Note that the dependent variable yn (runs scored) depends on a vector of independent variables xn (baseball statistics such as OBP and SLG) ðx1 , y1 Þ, ðx2 , y2 Þ, . . . , ðxn , yn Þ
ð3:20Þ
For a multivariate linear regression, the model equation is w0 þ w1 x1,n þ w2 x2,n þ . . . þ wS xS,n yn for n ¼ 1, . . . , N
ð3:21Þ
3.1 Least Squares Optimization
67
Fig. 3.12 Sample data for runs scores (RS), on base percentage (OBP) and slugging percentage (SLG)
Note that for this specific problem of performing a linear regression of RS with OBP and SLG, the above equation reduces to w0 þ w1 OBPn þ w2 SLGn RSn for n ¼ 1, . . . , N
ð3:22Þ
However, for generality, the arbitrary form is still used. Use the two following vectors to compact the equation: 2
1
3
2
w0
3
7 6 6 7 6 x1,n 7 6 w1 7 7 6 6 7 7 6 6 7 6 x2,n 7 6 w2 7 7 6 6 7 b xn ¼ 6 7w¼6 7 6 x3,n 7 6 w3 7 7 6 6 7 7 6 6 7 6⋮7 6⋮7 5 4 4 5 xS,n
ð3:23Þ
wS
The model can be written in matrix notation as b xn T w yn for n ¼ 1, . . . , N
ð3:24Þ
After summing the squared differences and dividing by the number of points, the following cost function is obtained cðwÞ ¼
N 2 1 X T b xn w 2 y n N n¼1
ð3:25Þ
68
3 Optimization and Regression
The gradient of the cost function is ∇c ¼
N 2 X b xn xn T w 2 y n b N n¼1
ð3:26Þ
Using the gradient descent method to find the minimum weights leads to wk ¼ wk1 α∇c wk1 wk ¼ wk1 α
N 2 X b xn xn T wk1 2 yn b N n¼1
ð3:27aÞ ð3:27bÞ
Applying the gradient descent method to determine the weights and bias results in RS ¼ 803 þ 2729 OBP þ 1587 SLG
ð3:28Þ
with r2 ¼ 0.93 (Fig. 3.13). It is interesting to note that this result is essentially identical to the linear regression result between RS and OPS shown in Fig. 3.10. The OPS variable is the On Base Plus Slugging, and as the name implies, it is equal to the sum of the OBP and the SLG. As such, the results are the same when the analysis is performed either way.
Fig. 3.13 Moneyball analysis results in 3D plot format for RS vs OBP and SLG
3.1 Least Squares Optimization
69
As a result of these various regression models, it can be seen that Billy Beane’s hypothesis of evaluating players based on their OBP and SLG is more accurate in terms of their offensive potential to score runs. Using these data science techniques, Beane’s Oakland A’s were able to win 103 games in 2002 (including a record-setting 20-game win streak), finish in first place, and make the playoffs. Today, OPS and OBP and SLG are some of the most closely watched baseball statistics by baseball insiders and fans alike.
3.1.8
Example: Indentation for Material Hardness and Strength
Stress (σ) is the distribution of force in a material with a load applied to it. Strain (ε) is the relative displacement of a material that results from an applied load. For a uniform rod of material being pulled in tension, the stress is equal to the applied force divided by the cross-sectional area and the strain is equal to the change in length divided by the original length. A plot of the stress versus strain demonstrates some useful material relationships for material performance when loaded. To understand the relevance of stress in material deformation picture this scenario. Two people of equal weight step on your foot. One person is wearing sneakers while the other is wearing heels. Which case would hurt most? Getting stepped on with heels will hurt more since the force (equal in both cases) is concentrated of a smaller area (the heel piece), therefore having a higher stress. An example of a stress vs. strain curve for a metal is shown in Fig. 3.14. At first the sample is not loaded a. Then comes the initial linear portion, known as Young’s modulus, is the elastic, or recoverable, part of the curve b – when the load is removed, both the stress and strain return to zero in this region. The yield strength
Fig. 3.14 Typical stress vs. strain curve for a metal
70
3 Optimization and Regression
c is at the upper end of the linear portion of the curve and defines the onset of permanent (or plastic) deformation (nonlinear function between stress and strain/ state variables S) d-g – when loaded beyond this point and then unloaded some permanent shape change will occur. In engineering metals the yield strength is calculated by offsetting the Young’s modulus line 0.2% percent. The peak of the curve is the ultimate tensile strength e, which is the stress when the peak load is applied. After the ultimate tensile strength is the region where neck (or visible deformation) f occurs (at which point some damage D occurs), and finally, the fracture point g. Note that the parts of the curve after the yield point are generally not linear but can be approximated as piecewise linear. Indentation is an experimental method to measure the hardness of a material, or its the resistance to plastic deformation. This test is done using an indenter of a pre-determined shape, such as hemispherical or diamond-shaped. The indenter is pressed into the surface of a material with a specified force. One such hardness test is the Vickers Hardness (HV) test, which uses a diamond-shaped indenter. After the indentation mark is made, the average diagonal distance is measured and the Vickers Hardness is computed as HV ¼
F ¼ A
F F ¼ 1:8544 2 2 d d 2 sin ð68Þ
ð3:29Þ
where F is applied force, A is the surface area of the indentation, and d is the average diagonal dimension of the diamond shaped indentation. As shown in Fig. 2.10, the Vickers indenter is diamond-shaped, with the faces making a 68 angle from the indentation axis. It has been shown that for some materials, HV and ultimate tensile strength (TS) are well correlated.
3.1.9
Example: Vickers Hardness for Metallic Glasses and Ceramics
Vickers hardness measurements were reported for different material by Zhang et al. [3]. Some of the representative values for metallic glasses are shown in Fig. 3.15. The Vickers hardness vs. ultimate tensile strength for metallic glasses and ceramics is shown in Fig. 3.16. Inspection of the data shows that the measurements for the metallic glasses are generally oriented in a linear pattern, but the measurements for the ceramic materials are much more scattered. Regression analysis using least squares optimization is performed on the data for the metallic glasses and the ceramics. Results show that a linear relationship works very well for metallic glasses (r2 ¼ 0.949), but not for ceramics (r2 ¼ 0.0002) for this data set (see Fig. 3.16). The difference in indentation results between these two types of materials is to be expected. Indentation is measuring the amount of force for local
3.1 Least Squares Optimization
71
Fig. 3.15 Representative metallic glass material data from Zhang et al. [3]
Fig. 3.16 Vickers Hardness (HV) vs ultimate tensile strength for metallic glass and ceramic materials
72
3 Optimization and Regression
plastic flow on the surface. Metallic materials will demonstrate this type of deformation, but ceramics are generally brittle and will experience surface cracking and fragmentation instead of plastic deformation.
3.2
Nonlinear Regression
A simple straight line relationship often does not exist between two variables. In these cases, it is necessary to employ some form of regression capable of building a nonlinear relationship, such as piecewise linear regression analysis, moving average analysis, or a moving least squares regression analysis.
3.2.1
Piecewise Linear Regression
Piecewise linear regression is one of the most basic nonlinear regression techniques since it consists of subdividing a set of nonlinear data into a series of segments that are approximately linear. Once that is done, a linear regression can be done on these sections one by one. This can be illustrated by planning a route on a map. As shown in Fig. 3.17, if one were to plot a route from Chicago to Los Angeles, there is not a straight route to follow. Instead, there are roads filled with curves that go west for a long way, then roads that traverse in a west southwest direction for a long way, and finally roads that go in a southwest orientation for the remainder of the route. If a
Fig. 3.17 Map from Chicago to Los Angeles with piecewise linear route overlaid
3.2 Nonlinear Regression
73
Fig. 3.18 Piecewise linear regression through data
person wanted to estimate the distance, the route could be broken into several linear segments (three in this case) to quickly estimate travel distance and compare routes. A global piecewise linear regression equation can be developed as a series of line segments with continuity at the common points. Consider the following set of data with two straight lines in Fig. 3.18 fit through different parts of the data yn ¼ w0 þ w1 xn
x < c1
yn ¼ w2 þ w3 xn
x > c1
If the equation in red applies to the line left of point c1 and the blue equation applies to the line right of c1, at the common point x ¼ c1 w0 þ w1 c1 ¼ w2 þ w3 c1
ð3:30Þ
which leads to yn ¼ w0 þ w1 xn
x < c1
yn ¼ w0 þ ðw1 w3 Þc1 þ w3 xn
x > c1
ð3:31Þ
Using these equations, a global cost function can be written as cðwÞ ¼
N1 N2 X 1 X 1 ðyn yn Þ2 þ ðy yn Þ2 N 1 n¼1 N 2 N 1 n¼N þ1 n 1
ð3:32Þ
74
3 Optimization and Regression
which can be used for a least squares optimization for linear regression as discussed earlier in this chapter.
3.2.2
Moving Average
A moving average provides a good method to smooth out data and mute the effects of spikes in the data. This is a popular method for analyzing trends with stock prices in order to smooth out the effects of day to day movement of the stock price. As shown in Fig. 3.19, the price of the S&P 500 stock index goes up and down on a daily basis, but the overall trend is upward for the time period shown. As such, if a financial analyst wanted to evaluate the long term performance of the stock, a moving average provides a good tool for doing so. In addition, evaluating different moving averages can also provide insight into stock trends. In Fig. 3.19, the 50 and 200 day moving averages smooth out the data to show the overall trend. The 200-day can also act as a “floor” or lower limit—buying opportunities exist when the price drops down to that level or below. A basic form of a moving average is a simple moving average, which is computed by summing a quantity of interest over a range and dividing by the number of samples
Fig. 3.19 S&P 500 stock index price with two rolling averages (50 and 200 days) overlaid. The spread between these moving averages provides insight into stock trends
3.2 Nonlinear Regression
75
start = '2016-01-01' df2 = web.DataReader('^GSPC', 'yahoo',start) df2.to_csv('gspc.csv') df2['Close'].plot() df2['Close'].rolling(50).mean().plot() df2['Close'].rolling(200).mean().plot() plt.legend(['Daily close','50-day moving moving average']) plt.ylabel('Price ($)') plt.show() plt.grid()
average','200-day
Fig. 3.20 Sample Python code for rolling average in Fig. 3.19
P SMAc ¼
ðPc þ Pc1 þ ::Pck Þ k
ð3:33Þ
where Pc is the stock price at time c and k is the number of points used for the averaging. It should be noted that performing a moving average on one variable of a plot results in the graph giving an appearance of being off of the original data. To prevent this effect, the averaged value can be plotted at the average of the independent variable instead of at the extent (Fig. 3.20).
3.2.3
Moving Least Squares (MLS) Regression
Moving least squares (MLS) regression is a technique to perform regression analysis on data that does not necessarily demonstrate a linear relationship between the variables. The technique is similar to the least squares linear regression already shown, but the inclusion of a weight function allows it to be performed point by point with the data weighted to the point being evaluated. In other words, the regression is “bent” to the data by only using a few points at a time. The weight function results in a localized point-by-point least square fit instead of a global least squares fit as shown previously. Common weight functions include bell-shaped curves such as cubic splines and truncated Gaussian functions. An MLS curve fit through the Apple stock data is shown in Fig. 3.21. The original data (plotted with the blue curve) consisted of 252 data points. An MLS approximation was performed using only 45 evenly spaced points. The weight function used was a cubic spline with a coverage radius of 3 (approximately 3 points on each side of the point of interest were involved in each calculation). The results show that the MLS approximation with this set of parameters is able to accurately capture the trends of the data but results in smoothing the data similar to the moving average calculation shown earlier.
76
3 Optimization and Regression
Fig. 3.21 Apple stock price with moving least squares (MLS) approximation overlaid
3.2.4
Example: Bacteria Growth
Recall the discussion on bacteria from Chap. 2. The growth of bacteria is an example of a nonlinear relationship. If food is left out at room temperature and not refrigerated, bacteria will begin to grow. As shown by the blue circles in Fig. 3.22, the number of bacteria will increase slowly at first (lag phase), but after some period of time the rate of increase will become much more rapid (exponential phase). Later on, the rate of increase will become much slower again (stationary phase). A simple linear regression shown by the red line in Fig. 3.22 obviously provides a poor fit to the data and would not provide a useful predictive tool. However, if the data can be fit using piecewise linear regression as shown by the green lines. As shown in Fig. 3.22, the lag and stationary phases can be fit using a single line for each, but the exponential phase requires at least two line segments due to the nonlinear nature of the bacteria growth during this phase. Figure 3.23 shows the results for the application of a MLS approximation to the bacteria growth data. The original data contained 45 data points, but the MLS approximation was done using only 15 data points.
3.2 Nonlinear Regression Fig. 3.22 Bacteria growth model
Fig. 3.23 Moving least squares (MLS) approximation of bacteria growth data
77
78
3.3
3 Optimization and Regression
Regularization and Cross-Validation (Advanced Topic)
Nonlinear regression methods require complicated, higher order regression models to achieve accuracy. These methods may lose generality for the regression models. In order to find a good balance between model complexity and accuracy, regularization is introduced into the regression model. The regularized loss function introduces and extra term and is given by LðwÞ =
N 1 X ðy yn Þ2 þ λkwkpp N n¼1 n
ð3:34Þ
where λ is a predefined regularization parameter with nonnegative value, a positive number,yn is the original data, yn is the regression model, and N is the number of data points. This equation shows that besides the first MSE term, a p-norm regularization term is added. This term acts as a “penalty” for having w be too large. In theory, the regularization term seeks to balance on the seesaw of simplicity and accuracy. The λ parameter is analogous to the fulcrum location of the seesaw. A larger λ implies more simplicity in the model, and smaller λ implies more accuracy in the model. If the
import pandas as pd import numpy as np from scipy import stats import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D # Load team data df = pd.read_csv('baseball.csv',sep=',').fillna(0) df['OPS'] = df.OBP + df.SLG df2002 = df.loc[df.Year < 2002]
# Linear regression for Runs scored slBA, intBA, r_valBA, p_valBA, ste_errBA = stats.linregress(df2002.BA,df2002.RS) rsqBA = r_valBA**2 slOBP, intOBP, r_valOBP, p_valOBP, ste_errOBP = stats.linregress(df2002.OBP,df2002.RS) rsqOBP = r_valOBP**2 slSLG, intSLG, r_valSLG, p_valSLG, ste_errSLG = stats.linregress(df2002.SLG,df2002.RS) rsqSLG = r_valSLG**2 slOPS, intOPS, r_valOPS, p_valOPS, ste_errOPS = stats.linregress(df2002.OPS,df2002.RS)
Fig. 3.24 Linear regression Python code for baseball example
3.3 Regularization and Cross-Validation (Advanced Topic)
79
rsqOPS = r_valOPS**2 plt.plot(df2002.BA,df2002.RS,'.',label='BA ($r^2$=%.3f)' %rsqBA) plt.plot(df2002.OBP,df2002.RS,'o',label='OBP ($r^2$=%.3f)' %rsqOBP) plt.plot(df2002.SLG,df2002.RS,'.',label='SLG ($r^2$=%.3f)' %rsqSLG) plt.plot(df2002.OPS,df2002.RS,'*',label='OPS ($r^2$=%.3f)' %rsqOPS) plt.xlabel('Statistic') plt.ylabel('Runs scored') plt.legend(loc='lower right') plt.grid()
yBA = slBA*df2002.BA + intBA plt.plot(df2002.BA,yBA,'k-') yOBP = slOBP*df2002.OBP + intOBP plt.plot(df2002.OBP,yOBP, 'k-') ySLG = slSLG*df2002.SLG + intSLG plt.plot(df2002.SLG,ySLG, 'k-') yOPS = slOPS*df2002.OPS + intOPS plt.plot(df2002.OPS,yOPS, 'k-') fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(df2002.OBP,df2002.SLG,df2002.RS,marker='*',color='r' ) ax.set_xlabel('On base percentage (OBP)') ax.set_ylabel('Slugging percentage (SLG)') ax.set_zlabel('Runs scored (RS)') x = y = x,y z =
df2002.OBP df2002.SLG = np.meshgrid(x,y) -803 + 2729*x + 1587*y
# Linear regression for Wins slWBA, intWBA, r_valWBA, p_valWBA, ste_errWBA = stats.linregress(df2002.BA,df2002.W)
Fig. 3.24 (continued)
80
3 Optimization and Regression
Fig. 3.25 How the regularization term balances model complicity and accuracy
regularization parameter is λ ¼ 0, the problem is a standard least square regression problem, (i.e., regression without regularization). The choice of λ will be introduced in Sect. 3.3.4. A diagram of how the regularization term balances model complicity and accuracy is shown in Fig. 3.25. With the increase of the model complexity, the MSE term typically decreases while the regularization term increases. The sum of them can achieve a minimum when an appropriate model complexity is selected. Two commonly used regularization approaches are the L1 and L2 norm regularized regression methods.
3.3.1
Review of the Lp-Norm
The Lp-norm is a measure of a vector size, which is defined as the p-th root of the sum of the p-th-powers of the absolute values of the vector components kw kp ¼
XN 1=p w j p i¼1
ð3:35Þ
Example: If p ¼ 1, kwkp¼1 ¼ If p ¼ 2, kwkp¼2
XN
j w j j¼ jw1 j þ jw2 j þ . . . þ j wN j XN 1=2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w j 2 ¼ ¼ jw1 j2 þ jw2 j2 þ . . . þ jwN j2 i¼1 i¼1
For the limit case when p ¼ 1,
ð3:36Þ ð3:37Þ
3.3 Regularization and Cross-Validation (Advanced Topic) Fig. 3.26 Geometric interpretation of L1, L2 and 1 norm (https://en. wikipedia.org/wiki/Lp_ space)
(0,1)
81
(0,1) (1,0)
(1,1) (1,0)
kwk1 ¼ max w j ¼ max ðjw1 j, jw2 j, . . . , jwN jÞ
ð3:38Þ
The shape of Lp-norm is illustrated for a vector containing only two components, i.e. N ¼ 2. For comparison, it is required that kwkp ¼ 1. kwkp¼1 ¼ 1 kwkp¼2 ¼ 1 kwkp¼1 ¼ 1 The contour indicates for the possible points (w1, w2) that satisfy kwkp ¼ 1. In general, as p approaches infinity, the contour approaches a square, that is kwk1 (Fig. 3.26).
3.3.2
L1-Norm Regularized Regression
The L1-norm can be used to relieve overfitting by adding a regularization term to the regression equation. A two-dimensional polynomial example is used to demonstrate the recovery of the true function with independently and identically distributed Gaussian noise for each term y ¼ 1ðx þ E1 Þ5 4ðx þ E2 Þ2 5ðx þ E3 Þ
ð3:39Þ
where y is the simulated data, and the noise Ei~Normal (0, 0.05) for all i ¼ 1; 2; 3. To simulate noise, 81 linearly spaced x-coordinates are spaced between interval [2,2]. The noise data are used to define uncertainties and reflect inaccuracies in measurements. At the beginning of this section, the general form of regularized regression function is given in Eq. (3.34). For p ¼ 1, the expression of L1-norm regularized regression is LðwÞ =
N 1 X ðy yn Þ2 þ λkwk1p¼1 N n¼1 n
ð3:40Þ
A comparison showing the advantages of regularized regression is shown for two regressive methods, which are tested and recover the correct function in Eq. (3.39). Firstly, for non-regularized regression, the mean square error (MSE) is used to find weights w0, . . ., w5 in polynomial yn ¼ w0 + w1xn + w2xn2 + . . . + w5xn5. Secondly,
82
3 Optimization and Regression
Fig. 3.27 Comparisons between L1-norm regularization regression and non-regularized regression
regularization term with L1-norm is added to find weights of L1-norm regularized regression. The results are shown in Fig. 3.27. The blue dots represent the original data based on Eq. 3.39, the blue line is L1-norm regularized regression result, and the red line is non-regularized regression result. The regression model obtained by the regularization is y ¼ 0:9671x5 3:9175x2 4:6723x with λ ¼ 0:074
ð3:41Þ
The regression without regularization is y ¼ 1:0498x5 þ 0:2190x4 0:1003x3 4:5166x2 4:8456x 0:0018
ð3:42Þ
Results show that both the models fit the data very well. However, the predicted functions are very different. The main difference is that L1 regularization can eliminate some high order terms in the regression model (w0, w3, w4 equal to zero). This shows sparsity and thus can be used for feature selection (or model selection), e.g., the order of x. Also, note that in this problem, an appropriate value of λ is 0.074. One approach of choosing appropriate λ is through the K-fold crossvalidation, details will be introduced in Sect. 3.3.4.
3.3.3
L2-Norm Regularized Regression
Similarly, L2-norm can be used instead of L1-norm for the regularization term. The L2-norm regularized regression is defined as LðωÞ =
N 1 X ðy yn Þ2 þ λkωk2p¼2 N n¼1 n
ð3:43Þ
3.3 Regularization and Cross-Validation (Advanced Topic)
83
Fig. 3.28 Comparisons between L2-norm regularization regression and non-regularized regression
L2-norm uses the concept of “sum of squares”, and thus has useful properties such as convexity, smoothness and differentiability. L2-norm regularized regression also has an analytical solution because of these properties. Using the same polynomial model yn for regression, the results are depicted in Fig. 3.28. Similarly, the blue dots are generated as the original data based on Eq. (3.39), the blue line is L2-norm regularized regression result, and the red line is non-regularized regression result. The regression model obtained by the regularization is y ¼ 1:0119x5 0:0800x4 þ 0:0495x3 3:7368x2 5:3544x 0:0707 with λ ¼ 1
ð3:44Þ
The regression without regularization is the same as Eq. (3.42) y ¼ 1:0498x5 þ 0:2190x4 0:1003x3 4:5166x2 4:8456x 0:0018 Both models fit the data well, and the weights of L2-norm regularized regression are nonzero, but smaller than non-regularized regression weights (different from L1-norm regularized regression). That is, where the L1-regularized regression may squeeze sufficiently small coefficients to zero, it may omit the intricate details. The L2-norm seeks to capture those details, and thus L2-norm regression can preserve details and detect sophisticated patterns in data. Comparing L1-norm and L2-norm regressions, it is found that L1-norm regression has a better performance in selecting key features of model. However, the L2-norm can preserve details and detect sophisticated patterns in data.
3.3.4
K-Fold Cross-Validation
Cross-validation, sometimes called rotation estimation or out-of-sample testing, is an important model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
84
3 Optimization and Regression
Table 3.2 Divided data sets Data set number Set 1 Set 2 Set 3 ... Set 10
Training data Point 9 to 81 Point 1 to 8 and 17 to 81 Point 1 to 16 and 25 to 81 ... Point 1 to 72
Test data Point 1 to 8 Point 9 to 16 Point 17 to 24 ... Point 73 to 81
Fig. 3.29 Comparison result of different λ
One commonly used cross-validation method is K-fold cross-validation, in which the original sample is randomly partitioned into K equally sized subsamples. Consider the example in Sect. 3.3.2. Eighty-one (81) equally spaced x-coordinates are spaced between interval [2,2] through Eq. (3.39). If K ¼ 10-fold cross-validation is used, each data set has 8 data points (9 for the last data set). Of the K ¼ 10 subsamples, a single subsample is retained as the validation data for testing the model, and the remaining 9 subsamples are used as training data [4].2 The crossvalidation process is then repeated ten times, with each of the ten subsamples used exactly once as the validation data. The MSE for ten data sets can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. K ¼ 10-fold cross-validation is commonly used [5], but in general K remains an unfixed parameter [6]. The divided data sets are in Table 3.2. 2
Newbie question: confused about train, validation and test data! Archived from the original on 14 March 2015. Retrieved 14 November 2013.
3.3 Regularization and Cross-Validation (Advanced Topic)
85
The corresponding MSE of each set can be calculated. Thus, for those K ¼ 10 sets, an error bar is computed with the corresponding mean value and standard deviation. It is used to evaluate the goodness of fit for the regression model. The error bar results of different regularization parameter λ are used to find an appropriate λ. For example, if 50 different λs are evaluated from 0.01 to 10, the comparison result is shown in Fig. 3.29. The appropriate λ value with the minimum MSE is 0.0745. %% L1 and L2 norm regression example %% Generation of data clc clear x0 = -2:0.05:2; % 81 linearly spaced x-coordinates are spaced between interval [-2,2] n = length(x0); % The total number of data points (81) x1 = x0+randn(1,n)*0.05; % x+epsilon1 x2 = x0+randn(1,n)*0.05; % x+epsilon2 x3 = x0+randn(1,n)*0.05; % x+epsilon3 x = [x1.^5;x0.^4;x0.^3;x2.^2;x3;ones(1,n)]'; weights = [1;0;0;-4;-5;0]; % Weights y = x*weights; % Simulated data y = x^5 - 4*x^2 - 5*x %% Regressions [b_lasso,fitinfo] = lasso(x,y,'CV',10); % L1-norm regularized regression lam = fitinfo.Index1SE; % Index of appropriate Lambda b_lasso_opt = b_lasso(:,lam) % Weights for L1-norm regularized regression lambda = 1; % Set lambda equals to 1 for L2-norm regularized regression (You can also find an appropriate Lambda yourself) b_ridge = (x'*x+lambda*eye(size(x, 2)))^-1*x'*y % Weights for L2-norm regularized regression (Has analytical solution) b_ols = polyfit(x1',y,5) (Ordinary Least Squares)
%
Weights
for
non-regularized
regression
xplot = [x0.^5;x0.^4;x0.^3;x0.^2;x0;ones(1,n)]'; y_lasso = xplot*b_lasso_opt; % L1 norm regression result y_ols = xplot*b_ols'; % Non-regularized regression result y_ridge = xplot*b_ridge; % L2 norm regression result %% Plots plot(x0,y,'bo') % Plot of origin data hold on plot(x0,y_lasso,'LineWidth',1) % Plot of L1 norm regression hold on plot(x0,y_ridge,'LineWidth',1) % Plot of L2 norm regression hold on plot(x0,y_ols,'LineWidth',1) % Plot of non-regularized regression ylabel('Y','fontsize',20) xlabel('X','fontsize',20) legendset = legend('Original data','L1 norm regression','L2 regression','Noregularization','location','southeast'); set(gca,'FontSize',20); lassoPlot(b_lasso,fitinfo,'PlotType','CV'); % Cross-validated MSE legend('show') % Show legend
Fig. 3.30 Matlab code for regularization regression
norm
86
3 Optimization and Regression
The steps to find the appropriate value of λ using K-fold cross-validation are summarized below: 1. For each regularization parameter, divide the original data set into K equal folds (parts). 2. Use one part as the test set and the rest as the training set. 3. Train the model and calculate the mean square error (MSE) with the test set. 4. Repeat steps 2 and 3 K times, and each time use a different section as the test set. 5. Compute the average and standard deviation of the set of MSE including K MSEs. Take the average accuracy as the final model accuracy. 6. Compare MSE for different λ values to find the appropriate regularization parameter. The Matlab code for this section is given in Fig. 3.30.
3.4
Equations for Moving Least Squares (MLS) Approximation (Advanced Topic)
Consider an approximation function written as yn ðxÞ ¼ pðxÞT aðxÞ where yn is an approximation to be computed, p(x) is a basis vector and a(x) is a vector of unknown coefficients. For a polynomial basis vector h i pðxn ÞT ¼ 1, xn , xn 2 . . ., xn ðd1Þ , xn d 3 2 a0 ð x Þ 7 6 6 a1 ð x Þ 7 7 6 7 að x Þ ¼ 6 6 a2 ð x Þ 7 7 6 4 ⋮ 5 ad ð x Þ Note that the coefficients a(x) are not constant as they are for linear regression, and vary with the position. (continued)
References
87
The cost function for MLS is c ð að xÞ Þ ¼
N X
2 wðx 2 xI Þ pðxI ÞT aðxÞ yn
I¼1
where w(x xI) is a weight function and yn is the discrete data being used for the regression analysis. The minimum of the cost function can be determined by taking the derivative as ∂c ¼ AðxÞaðxÞ BðxÞy ¼ 0 ∂aðxÞ where AðxÞ ¼
N X
wðx xI ÞpðxI ÞpT ðxI Þ
I¼1
BðxÞ ¼ ½wðx x1 Þpðx1 Þ, wðx x2 Þpðx2 Þ, . . . , wðx xN ÞpðxN Þ and y is the vector of raw datapoints. The coefficients a(x) can be solved as aðxÞ = A 2 1 ðxÞBðxÞy which can be used to solve a reduced order nonlinear MLS approximation to the original data yn ¼
N X
ϕI ðxÞyI
I¼1
where yn is the reduced order MLS approximation, and ϕI(x) ¼ pT(x) A (x)BI(x) is the MLS approximation function. 1
References 1. Wolberg EJ (2010) The method of least squares. In: Designing quantitative experiments. Springer, Berlin, Heidelberg 2. Stanton JM (2001) Galton, Pearons, and the peas: a brief history of linear regression for statistics instructors. J Stat Educ 9:3
88
3 Optimization and Regression
3. Zhang P, Li SX, Zhang ZF (2011) General relationship between strength and hardness. Mater Sci Eng A 529:62–73 4. Galkin A (2011) What is the difference between test set and validation set? Retrieved 10 Oct 2018 5. McLachlan GJ, Do K-A, Ambroise C (2004) Analyzing microarray gene expression data. Wiley 6. Wikipedia. Cross-validation. https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_ cross-validation
Chapter 4
Extraction of Mechanistic Features
Abstract Extracting the key features of data is an important step that begins the transformation from data to information. Data is often available in unstructured, disorganized and jumbled formats. It is necessary process and re-organize data to make it possible to achieve useful outcomes using data science. In addition, it can be important to understand the scientific principles associated with data to enhance the feature extraction process. This chapter will discuss how to extract features from a given dataset, normalize them to establish meaningful correlation, and use available scientific or mathematical tools to minimize the number of features for consideration. Real-world examples of feature extraction using vector geometry and mathematics is discussed for image analysis, and an example from medical imaging is discussed. In addition, signal processing tools such as Fourier transform, and Short Time Fourier Transform are explained for additional processing of signals. A reallife example of analyzing piano notes is used to demonstrate this technique. Keywords Feature · Feature extraction · Normalization · Feature scaling · Image processing · Least square method · Medical image processing · Image segmentation · Geometric feature extraction · Signal processing · Fourier transform · Short time Fourier transform · Prognosis spinal deformity · Adolescent idiopathic scoliosis · 3D image reconstruction · Patient-specific geometry generation · Bone growth
4.1
Introduction
The extraction of mechanistic features is an important step in the mechanistic data science process. Tremendous amounts of data are collected, and this data is often unstructured and in various forms. This data can be forms such as recorded music, photographs, or measured data. In order to effectively utilize the data, some processing is typically required to get it into a useful form and to isolate the key
Supplementary Information: The online version of this chapter (https://doi.org/10.1007/978-3030-87832-0_4) contains supplementary material, which is available to authorized users. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. K. Liu et al., Mechanistic Data Science for STEM Education and Applications, https://doi.org/10.1007/978-3-030-87832-0_4
89
90
4 Extraction of Mechanistic Features
features of the data. There are many tools and methods for extracting features from data, but this chapter will focus on three of them: geometrical analysis for image processing, Fourier transform analysis, and short-time Fourier transform analysis.
4.2
What Is a “Feature”
A feature is a measurable property of any object (sample) in a data set. For example, data on students taking a course may be used to predict how much he/she is going to get in his/her final exam. A student’s data include his/her past performance in CGPA, the grades he/she gets in assignments, mid-terms, attendance in class. These are all quantifiable information which can be used to build a model for predicting how well he/she will do in her final exam. Each of these items can be recognized as features. Image analysis is another common example of feature identification. Examples include satellite image analysis for evidence of archaeological sites, species migrations, and changes in shorelines. Facial recognition software and autonomous vehicle technology depend on extensive image collections. While many of these examples involve technologies of the future, humans have historically analyzed images in many STEM fields. For example, doctors use medical images, such as X-rays, MRIs, and Ultrasounds, to aid in medical diagnosis and treatment. This chapter briefly explains different imaging technique in medicine and integrate image processing to simulate the 3D model of adolescent idiopathic scoliosis (AIS), the most common childhood spinal deformity.
4.3
Normalization of Feature Data
The data ranges for multimodal data can be of different magnitudes. For the elderly patient data in the previous paragraph, the range of patient ages would have a different range than the cholesterol levels or the blood pressure. These range differences make regression and training neural networks very difficult. This problem can be circumvented by normalizing the raw data prior to performing a regression analysis. Standard Normalization consists of transforming the data into a Gaussian distribution, with a mean of zero and a unit standard deviation zi ¼
Xi μ σ
ð4:1Þ
where zi is the standard normalized data point i, Xi is the original data point i, μ ¼ P 1 X i ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi is the mean for ffia particular feature (N is the total number of data points), and iq N P 2 σ¼ i ðX i μÞ =N is the standard deviation of data for that same feature.
4.3 Normalization of Feature Data
91
Feature Scaling consists of scaling the data into a range from 0 and 1 zi ¼
X i X min X max X min
ð4:2Þ
where zi is the scaled data point i, Xi is the original data point i, and (Xmin, Xmax) are the minimum and maximum of the data for a feature, respectively.
4.3.1
Example: Home Buying
Real estate agents say that “location, location, location” is the most important thing when buying a house, but the important features associated with a particular location vary from person to person. If a person wanted to buy a house located in Evanston, Illinois (home of Northwestern University where the authors are professors), what features would be of interest? The desired location for a particular person depends on other factors, such as age (young couples with children may want to be near good schools), size and vintage of the house, safety of neighborhood, convenience of travel, and, most importantly, price. Sample data related to four key criteria for decision-making are shown in Fig. 4.1: proximity to schools, number of bedrooms, age of the house, and price. These four measurable properties in the dataset are called the features, and the data points under each feature are called sample points. This chapter shows how key features can be extracted from raw data to be used for further analysis and decision making. The housing data in given in Fig. 4.1 can be normalized using either standard normalization or feature scaling. If the data from Fig. 4.1 are organized into a matrix, each column can be standard normalized using the mean, μ, and standard deviation, σ, of that column. For the data in Fig. 4.1, the mean and standard deviation for each feature are
Mean Standard deviation
Proximity to schools 1.05 0.73
Bed rooms 3 0.83
Age of house 8 4.32
Price, $ 212,500 131,498
Proximity to schools
Bedrooms
Age of house
Price, $
0.3 miles
3
10 year
200,000
1.2 miles
2
12 years
100,000
2.0 miles
4
2 years
400,000
0.7 miles
3
8 years
150,000
Fig. 4.1 House pricing as an example to understand feature engineering
92
4 Extraction of Mechanistic Features
For these features, standard normalization of the data according to Eq. (4.2) leads to 2
0:3 3 10 6 1:2 2 12 6 6 4 2:0 4 2 0:7 3 8 Original data
3 200, 000 100, 000 7 7 7 400, 000 5 150, 000
2 !
1:02 6 0:20 6 6 4 1:30 0:48
3 0 0:46 0:10 1:22 0:93 0:86 7 7 7 1:22 1:39 1:43 5 0 0 0:48 Standard normalized data
The data from Fig. 4.1 can also be feature scale normalized using the minimum and maximum values in each column. For the data in Fig. 4.1, the minimum and maximum of each feature is
Minimum Maximum
Proximity to schools 0.3 2.0
Bed rooms 2 4
Age of house 2 12
Price, $ 100,000 400,000
which results in a feature scaled data set of 2
0:3 3 10 6 1:2 2 12 6 6 4 2:0 4 2 0:7 3 8 Original data
4.4
3 200, 000 100, 000 7 7 7 400, 000 5
2 !
150, 000
0:00 6 0:53 6 6 4 1:00 0:24
0:5 0:0 1:0 0:5
0:8 1:0 0:0 0:6
3 0:33 0:00 7 7 7 1:00 5 0:17 Feature scaled data
Feature Engineering
Feature engineering involves transforming the features in a dataset into a form that is easier to use and interpret, while maintaining the critical information. These features may be directly available in the raw data, but many times the features are derived through processing the raw data into a more usable form.
4.4.1
Example: Determining a New Store Location Using Coordinate Transformation Techniques
A store owner wants to establish a new location, with a goal of being close to as many customers as possible. To select a location, the store owner needs data showing where potential customers live. This data for potential customers can be plotted on a graph, with the center point being the potential store location (Fig. 4.2).
4.4 Feature Engineering
93
This kind of graph uses a Cartesian coordinate system, with the vertical axis denoting miles north/south and the horizontal axis denoting miles east/west. If the target customers are located within a 6 mile radius of the potential store, the data can be color coded, with blue dots for highly likely customers, and red x’s identifying less likely customers. Likely customers means a more profitable store. It is easy for humans to understand which locations are profitable by basic inspection, but it is not as easy to state an objective condition based on these Cartesian coordinates that differentiates the likely customer base from the less likely customer base. However, the Cartesian coordinate system can be mapped to polar coordinate system to provide a more straightforward feature for decision making. A polar coordinate system represents the data as a radial distance from the center and an angle from the east direction (see Fig. 4.3). In the Cartesian coordinate system, the location of a point, P, is expressed in coordinates (x, y). In polar coordinate system, location is expressed in terms of a radial distance, r, from an origin and an angle, θ, in this case relative to the x-axis.
Fig. 4.2 Distribution of profitable (blue dots) and unprofitable (red x’s) locations on a Cartesian coordinate map. The proposed store location is shown with a gold star
94
4 Extraction of Mechanistic Features
Fig. 4.3 Cartesian and polar coordinate systems. For the customer location data, “x” represents east/west and “y” represents north/south. The angle θ is relative to the “x” axis in a counterclockwise direction
The radius, r, and the angle from the x-axis, θ, for point P can be computed using the coordinates (x, y) as Radius : Angle :
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð x2 þ y2 Þ y θ ¼ tan 1 x
r¼
ð4:3Þ ð4:4Þ
The radius and angle can be computed for each red and blue point in Fig. 4.2, with the gold star indicating the proposed store location as the reference point. Following the change of coordinates from Cartesian to polar, the changed location coordinates can be plotted as shown in Fig. 4.4. In this example, a radial distance of less than 6 miles from the store is considered a key feature for a profitable store location. The polar coordinate plot in Fig. 4.4 shows the likely customer identified by a different measure that is easier to implement in a computer code. The computer code to compute the change of coordinates and to make the plots is shown below.
4.4 Feature Engineering
Fig. 4.4 Locations of potential customers plotted in a polar coordinate system
Computer Code in Python3 Using Numpy and Matplotlib # x, y are coordinates of the blue and red data points in Figure 4.2 area = 10*np.ones(N) c = np.sqrt(area) r = np.sqrt(x ** 2 + y ** 2) th = np.arctan(y/x) * 180/np.pi area1 = np.ma.masked_where(r < r0, area) area2 = np.ma.masked_where(r >= r0, area) plt.scatter(x, y, s=area2, marker='o', c='b',label='Profitable') plt.scatter(x, y, s=area1, marker='x', c='r', label='Unprofitable') plt.figure() plt.scatter(r,th,s=area2,marker='o',c='b',label='Profitable') plt.scatter(r,th,s=area1,marker='x',c='r', label='Unprofitable')
95
96
4.5
4 Extraction of Mechanistic Features
Projection of Images (3D to 2D) and Image Processing
Photographs and videos are a constant aspect of daily life, whether through television, online images and videos, or photos and video recording of dynamic testing. A constant forensic challenge with photos and videos is how to extract meaningful measurements and data these images. One common use of video is for instant replay reviews in major sporting events. Instant replay has been a fixture in sports broadcasting since the December 7, 1963 Army-Navy football game when television director Tony Verna used it during the live broadcast of the game [1]. Since that time, its role has expanded from an entertainment feature to a key tool to assist the in-game referees make a final decision on difficult plays. The games are video recorded using multiple cameras from different angles, but sometimes the video evidence is not clear enough to make a definitive call based on the video. The cameras only provide 2D images of events that are happening in 3D. The sport of volleyball has dealt with this issue through an interesting multicamera setup for video tracking the 3D position of the ball on the court, and it has become very common for the referees to request a video check of the ball position. Many volleyball rules and fouls are related to the ball position, such as the ball hitting the boundary line is inbounds, but the ball is out of bounds if it contacts the antenna located on the two sides of the volleyball net (Fig. 4.5). In a fast-moving game, it is difficult to judge whether the ball hits the boundary line, which makes the video check essential for the referees. Video replay usually consists of multiple cameras located on the corners and along the sidelines of the volleyball court. Figure 4.6 shows a 19-cameras system layout around the volleyball court. The referees can perform a video check using multiple camera angles to investigate whether the ball hits the boundary line. Consider a situation in which a video replay check is performed using the images from Cameras 1 and Y. These two cameras are at a right angle from each other (these are known as orthogonal views). Figure 4.7 isolates the projection of these two cameras on the two orthogonal planes. Point C represents the center of the ball and
Fig. 4.5 (a) the ball is out of the court, and (b) the ball is touching antenna [2]. In both cases the ball is considered out of bounds
4.6 Review of 3D Vector Geometry
97
Fig. 4.6 19 Cameras around the volleyball court to capture different views of the ball [3]
has 3D coordinates (xC, yC, zC). Point A is the shadow projection of the point C obtained from Camera 1 on the ZY plane and has 2D coordinates (yA, zA). Similarly, point B is the projection of point C from camera Y onto the XY plane, with 2D coordinates (xB, yB).
4.6
Review of 3D Vector Geometry
A 3D vector (v) is a line segment in three dimensions that starts at point O (tail) and extends to point A (head), as shown in Fig. 4.8. The coordinates of O are (Ox, Oy, Oz) and the coordinates of A are (Ax, Ay, Az) The vector v can be written as v ¼ vx i þ vy j þ vz k
ð4:5Þ
where i, j, and k are unit vectors in the X, Y, and Z directions, respectively, and vx, vy and vz are the corresponding magnitudes in each direction, defined as:
98
4 Extraction of Mechanistic Features
Fig. 4.7 Point B is the “camera Y” projection of the center of the ball (point C) to the XY plane and Point A is the “camera 1” projection of the point C to the YZ plane. Point C has 3D coordinates (xC, yC, zC), while points A and B are described by 2D coordinates (yA, zA) and (xB, yB), respectively
Fig. 4.8 Definition of 3D vectors. Vector v defined by three components vx , vy and vz which are the projection of the vector on the x, y and z accordingly
8 > < vx ¼ Ax Ox, vy ¼ Ay Oy , > : vz ¼ Az Oz:
4.7
ð4:6Þ
Problem Definition and Solution
To utilize 3D vectors from different cameras to determine the volleyball location, the ball positions must be aligned with the global XYZ coordinate system. This requires the following assumptions: 1. The ball has the center point C. 2. Camera 1 projects point C onto the YZ plane. This is point A. 3. Camera Y projects point C onto the XY plane. This is point B.
4.8 Equation of Line in 3D and the Least Square Method
99
Fig. 4.9 Visual representation of key assumptions. The vector line L1 connects Camera 1 to point A (the shadow projection of point C on the ZY plane). Similarly, line L2 connects Camera Y and point B (the projection of point C on the XY plane). It is assumed that the positions of Cameras 1 and Y and the points A and B are known. The intersection of lines L1 and L2 determine the unknown 3D coordinate of point C
These assumptions are visually displayed in Fig. 4.9. The unknown 3D coordinates of C must be calculated from the known 2D coordinates of A and B. As Fig. 4.9 shows, C is located at the intersection of lines L1 and L2.
4.8
Equation of Line in 3D and the Least Square Method
Any line L in a 3D coordinate system is defined by a point P and the direction v, as shown in Fig. 4.10. It can be written as r ¼ r0 þ tv
ð4:7Þ
where t is a scalar constant representing the magnitude, v is the direction of line L, and r0 is the position vector of point P. Lines can also be expressed in a parametric form as r x Px r y Py r z Pz ¼ ¼ vx vy vz
ð4:8Þ
which is derived from the scalar components of vectors P, r, and v. In addition to points A and B, vectors v1 and v2 (vectors along the directions of L1 and L2, respectively) are known quantities. As a result, L1 and L2 can be defined and equated to determine their intersection at the point of interest C. Based on Eqs. (4.9) and (4.10), define L1 and L2, scaled by arbitrary constants α and β, and obtain
100
4 Extraction of Mechanistic Features
Fig. 4.10 Definition of the 3D line. A line in 3D space can be defined using an arbitrary point (P) located on the line, a direction vector (v) and a scalar constant (t)
0
0
1
0
v11
1
B C B C L1 ¼ @ yA A þ α@ v12 A zA v13
ð4:9Þ
and 0 21 1 xB v1 B C B C L2 ¼ @ yB A þ β@ v22 A 0
0
ð4:10Þ
v23
where v11 , v12 , v13 and v21 , v22 , v23 are the x, y, z components of the vector v1 and v2, respectively. The corresponding scalar x, y, and z are set equal to determine the intersection of the two lines. This process results in three equations, with two unknowns, α and β 8 1 2 > < αv1 βv1 ¼ x1 þ x2, αv12 βv22 ¼ y1 þ y2 , > : 1 αv3 βv23 ¼ z1 þ z2 :
ð4:11Þ
No unique solution exists for this system since it has more equations than unknowns. Instead, the least square method can used to determine values of α and β that minimize the error. A cost function can be written as c¼
N N 2 1 X 1 X b ðyn yn Þ2 ¼ xn T w y n N n¼1 N n¼1
ð4:12Þ
where w is a vector of the desired weights, b xn T consists of features, yn is the actual value, and yn* is the predicted value. Error, also known as the cost, is defined by
4.8 Equation of Line in 3D and the Least Square Method
101
y y ¼ c:
ð4:13Þ
The system of equations described by Eq. (4.11) can be expressed in matrix form with Eq. (4.14). Matrix A and vector b contain known quantities, while w is unknown. This leads to Aw ¼ b
ð4:14Þ
Aw can be set equal to y*, while b is set equal to y, recasting Eq. (4.14) as a least squares problem leads to Amn wn1 ¼ bm1 where
Aw ¼ y b¼y
:
ð4:15Þ
Mathematically, solving the least squares problem for Aw = b is equivalent to solving AT Aw ¼ AT b:
ð4:16Þ
The system of equations expressed in matrix form can be substituted into Eq. (4.11), producing
AT A
α β
2
v11
6 ¼ AT b Where A ¼ 4 v12 v13
v21
3
0
x1 þ x2
1
7 B C v22 5 and b ¼ @ y1 þ y2 A: ð4:17Þ z1 þ z2 v23
Equation (4.17) can be solved for α and β, and these parameters can then be substituted into Eq. (4.9) or (4.10) to calculate C, identifying the ball center’s actual location in 3D.
4.8.1
Numerical Example
To illustrate this process, let us calculate the coordinates of C in a numerical example. The 2D ball projections and relevant directions are known quantities determined from the two cameras shown in Fig. 4.9. These values are listed below:
102
4 Extraction of Mechanistic Features
8 A ¼ Ax , Ay , Az ¼ ð0, 50, 70Þ, > > > < B ¼ B , B , B ¼ ð30, 40, 0Þ, x y z 1 > v ¼ ð7, 3, 2Þ, > > : v2 ¼ ð9, 19, 76Þ:
ð4:18Þ
This information is sufficient to define lines connecting Camera 1 to point A on the YZ plane and Camera Y to point B on the XY plane, as shown in Fig. 4.11. These lines intersect at point C, which is the location of the ball center in 3D space. The step-by-step solution process is listed below and shown in Fig. 4.12. 1. Define directional lines. First, known values are substituted into Eqs. (4.9) and (4.10), mathematically defining L1 and L2. The scaling factors, α and β, are the only unknowns. 2. Equate x, y, and z equations. At their intersection, lines L1 and L2 have the same coordinates: point C. As a result, the corresponding x, y, and z scalar equations of each line are equated, forming a system of equations. 3. Rewrite equations in matrix form. The system of equations defined in Step 2 is expressed in matrix form Ax = b, where A is a matrix of coefficients, b is a vector of constants, and x is a vector of unknowns: α and β. 4. Calculate the least squares solution. Using linear algebra, Eq. (4.17) is solved to determine values of α and β that minimize error, generating a least squares solution to the system of equations. 5. Determine C. Values of α and β are substituted into the equations for L1 or L2 to calculate the coordinates of their intersection C.
Fig. 4.11 Lines connecting camera source 1 to A and camera source 2 to B. The intersection of the two lines is point C, the center of the ball
4.9 Applications: Medical Imaging
103
Fig. 4.12 Step-by-step example solution process to determine coordinates of point C
4.9
Applications: Medical Imaging
Medical images are frequently used to aid in patient diagnosis and treatment [4]. Medical imaging techniques aid in early screening for possible health conditions, help diagnose the likely cause of existing symptoms and monitor health conditions that have been diagnosed or examine the effects of treatment for them. Ultrasound, MRI (Magnetic Resonance Imaging), and CT (Computed Tomography) scans are all examples of medical imaging technology. Based on the application and the type of medical condition, different imaging technique are proposed by doctors. These images are recorded as DICOM (Digital Imaging and Communications in Medicine), the standard format for communication and management of medical imaging data.
4.9.1
X-ray (Radiography)
X-ray or radiography is the most common and oldest imaging technique in medicine. X-rays are commonly used to detect bone-related issues like fractures or skeletal abnormality. They can also be used for other applications, such as diagnosing infections (such as pneumonia), observing an external object in soft tissue and calcification (like kidney stones or vascular calcifications). Although visible light
104
4 Extraction of Mechanistic Features
reflects off opaque surfaces, X-rays pass through solids. Radiography, the projection of a 3D object onto 2D X-ray film, exploits this feature to reveal internal structures. This process is shown in Fig. 4.13. X-rays work by transmitting a beam of X-rays through the object to be scanned and recording the X-rays that pass through to the opposite of the object. In medicine, this is typically the human body. The resulting image intensity is determined by material composition and density. Dense materials, like bone, absorb more X-rays, while skin, muscle, and other soft tissue absorb less. As a result, bones are the most visible feature mapped to X-ray film and appear bright white in contrast to gray tissue regions. While X-rays reveal the body’s internal structure, making them a powerful medical tool, they are limited to 2D projections. More comprehensive analysis can be performed by using the 2D images to reconstruct 3D patient geometry.
4.9.2
Computed Tomography (CT)
CT imaging also known as computerized axial tomography (CAT) scan is another imaging technique in which a focused X-ray beam is aimed at patients and rotates rapidly across the body. These images are reconstructed by the computer as crosssectional images to generate the “slices” of the organs or tissue. These slices are stacked together and can reconstruct the 3D shape of the organ to detect tumors or abnormality. CT contains more radiation dose and is not as less invasive as the X-ray method since the images are reconstructed from the many individual X-ray images. Although this imaging technique can be implemented to detect hard tissue, soft tissue, and cardiovascular blood vessels, it is more commonly used for hard tissue since they generate high-contrast images.
Fig. 4.13 Performing a medical X-ray to generate the radiograph of the chest is, in which the regions of bone appear white [5]
4.10
Extracting Geometry Features Using 2D X-ray Images
4.9.3
105
Magnetic Resonance Imaging (MRI)
Magnetic resonance imaging (MRI) utilizes radio waves and magnetic fields to create detailed images of tissues or organs inside the body. Like CT scans, it creates stacks of 2D images that can be reconstructed to generate the 3D image. Unlike the CT and X-ray, this method is more time-consuming and produce loud noises while scanning. MRI is mainly used to image non-bony and soft tissue. Nerves, muscles, ligaments, and spinal cord are visualized more clearly in MRI comparing to X-ray or CT. However, this is an expensive method and is used as the last step of diagnosing plan.
4.9.4
Image Segmentation
In image processing, segmentation is an important technique for partitioning a digital image into several regions or sets of pixels depending on certain features [4]. Similar features are shared by pixels in one region and pixels in adjacent regions vary greatly from them. Segmentation is normally used to identify regions of interest and their boundaries in images. In particular, segmentation of medical images is primarily used for identifying tumors and other pathologies, calculating computer-guided surgery tissue volumes, diagnosis and treatment planning, as well as analyzing anatomical structure in medical school education. One segmentation method is the active contour method or snakes [6]. In this method, a set of points are located around a region of interest. Figure 4.14 represents the segmented two X-ray images of the spine using the snake algorithm (a demo and a code are provided in the E-book, Supplementary File 4.1 and Supplementary Video 4.1). Each vertebra is identified using 16 landmarks. Landmarks are (x, y, z) Cartesian coordinates that outline each vertebra, revealing vertebra shape, position, and orientation in 3D space. For example, 2D landmark projections in the x-z and y-z planes are shown in Fig. 4.14.
4.10
Extracting Geometry Features Using 2D X-ray Images
The human spine is composed of 24 vertebrae, but only 17 are of interest for adolescent idiopathic scoliosis (AIS) analysis. Thoracic (T 1–12) and lumbar (L 1–5) vertebrae, shown in Fig. 4.15, are responsible for scoliosis, making them the focus of diagnosis and treatment. While the cause of AIS is unknown, effective treatments exist. Patient observation, custom braces, and surgery are all viable options that can yield meaningful results. However, there is currently no standard methodology to quantify the shape or severity of AIS cases, classify patients, or determine the best course of treatment.
106
4 Extraction of Mechanistic Features
Fig. 4.14 Identification of landmarks (yellow dots) for the lumbar and thoracic vertebrae from (a) AP and (b) LAT X-ray images. The close-up views of the landmark points for the L3 vertebra are shown for both images [7]
For example, spinal deformities can be described by five global angles, listed below (a code is provided in the E-book, Supplementary File 4.2): • • • • •
Trunk Inclination Angle (TIA), Sacral Inclination Angle (SIA), Thoracic Kyphosis Angle (TKA), Lumbar Lordosis Angle (LLA), Cobb Angle (CA).
While these angles are helpful, there is no standard method to calculate them. Doctors manually estimate vertebrae planes and locations from patient x-rays, measuring angles according to individual definitions. As a result, AIS treatment methodology varies by hospital. Doctors rely on experience to manually read X-rays, estimate AIS progression, and prescribe corresponding treatment. This example shows how to standardize feature calculations used in AIS diagnosis and classification, ensuring each patient receives optimal treatment. Instead of relying on the limited resources of one doctor or hospital, standardization allows all AIS cases to be quantified and compared. Doctors can then use all available data to make informed decisions for each AIS patient, leading to more accurate predictions of AIS progression and, consequently, more effective treatments. This example integrates mechanistic data science with current medical practices. In the first step, vertebrae landmarks are extracted from 2D patient X-rays as shown in Fig. 4.14. Following by this step, these landmarks are used to define vertebrae boundaries and calculate key spinal features, pioneering a standardized mathematical methodology to quantitatively compare patients. These mechanistic data science results will ensure future AIS treatment is effective and consistent across the medical community, improving patient outcomes. The methodology used in this chapter to
4.10
Extracting Geometry Features Using 2D X-ray Images
107
Fig. 4.15 Diagram of human spine [8]. The thoracic and lumbar part, which consist 17 vertebrae, account for scoliosis
develop a Python code relies on several sources, mathematical techniques, and assumptions.
4.10.1 Coordinate Systems To define and plot vertebrae landmarks, a 3D Cartesian coordinate system was used: • The xz plane represents the transverse plane. • The xy plane represents the lateral plane. • The zy plane represents the anteroposterior plane.
108
4 Extraction of Mechanistic Features
These planes are mutually orthogonal, with the y-axis representing the vertical direction (parallel to patient height).
4.10.2 Input Data To calculate features, vertebrae landmarks are extracted from x-ray images and the data is stored in a table consisting of landmark locations for 17 vertebrae, T 1–12 and L 1–5. Each row represents 1 vertebra, with the topmost vertebra (T 1) listed first. This results in 18 total rows: 1 header, 12 rows of T 1–12 landmarks, and 5 rows of L 1–5 landmarks. Additionally, each set of three columns represents one landmark expressed as an (x, y, z) coordinate. The total number of columns is an arbitrary multiple of 3, as each landmark consists of three values (x, y, and z), and each vertebra is represented by an arbitrary number of landmarks.
4.10.3 Vertebra Regions [Advanced Topic] Consistent mathematical notation is used to group landmarks, define vertebra geometry, and calculate features. For example, landmarks on each vertebra are assigned to Region I, II, III, or IV based on location, shown in Fig. 4.16. It is assumed that each region represents an irregular boundary region and contains an arbitrary number of landmarks. Best fit lines are placed through key vertebra regions to calculate several features. For example, Fig. 4.17 shows lines of best fit through Region II and Region IV landmarks of a vertebra in the anteroposterior plane (AP). Notation of the line is described as:
Fig. 4.16 Vertebra landmarks assigned to Regions I, II, III, and IV, Anteroposterior Plane. The vertebra may have arbitrary number of landmarks in each region
4.10
Extracting Geometry Features Using 2D X-ray Images
109
Fig. 4.17 Vertebra landmarks assigned to regions I, II, III, and IV, Anteroposterior Plane. di2,AP denotes the regressed line passing through the region II and IV of vertebra i in the AP view respectively
8 > > >
j : 1, ::4 ðvertebra regionÞ > > : k : Plane, typically Anteroposterior ðAPÞ or Lateral ðLAT Þ
ð4:19Þ
4.10.4 Calculating the Angle Between Two Vectors The angle between two vectors v1 and v2 is calculated as θ ¼ cos
1
v1 :v2 : jv1 jjv2 j
ð4:20Þ
Example: Calculating Angle Between Two Vectors To calculate the angle (θ) between two lines defined as:
y1 ¼ 2x1 3, y2 ¼ 0:5x2 þ 1:
ð4:21Þ
As it is mentioned in Fig. 4.18 for 2D line defined as L:y ¼ a0x + b0 the vector r ¼ 1i + a0j is in the same direction of line L. It can be inferred that r1 ¼ 1i + 2j is in the direction of line L1 and r2 ¼ 1i + 0.5j is in the same direction of L2. The angle between the two lines can be calculated by substituting the r1 and r2 into Eq. (4.20) which gives:
110
4 Extraction of Mechanistic Features
Fig. 4.18 visualization of the angle between two lines in 2D coordinate system
θ
¼ cos 1 ¼ cos 1
ð1i þ 2jÞ:ð1i þ 0:5jÞ j1i þ 2jjj1i þ 0:5jj ! : 1 1 þ 2 0:5 2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ cos 1 ¼ 36:86 2:5 12 þ 22 12 þ 0:52
r1 :r2 jr1 jjr2 j
¼ cos 1
ð4:22Þ
4.10.5 Feature Definitions: Global Angles As previously discussed, there are no standard quantities to describe, classify, and compare AIS patients. This example shows how to establish universal definitions of global angles and introduce additional features to quantify the severity of an AIS diagnosis. Ultimately, these values will fully capture the 3D shape of a patient’s spine, enabling doctors to prescribe effective treatment with minimal reliance on manually reading X-rays. The five global angles are defined qualitatively and quantitatively, as documented below. Angles in the Lateral (LAT) Plane • α1: The Thoracic Kyphosis Angle (TKA) is the angle between Region II of T1 and Region 4 of T12, TKA ¼ cos
1
! d12,LAT :d12 4,LAT 1 12 : L L 2,LAT
ð4:23Þ
4,LAT
• α2: The Lumbar Lordosis Angle (LLA) is the angle between Region 11 of L1 and Region 4 of L5,
4.10
Extracting Geometry Features Using 2D X-ray Images
111
Fig. 4.19 Definitions of Thoracic Kyphosis Angle (TKA) and Lumbar Lordosis Angle (LLA)
LLA ¼ cos 1
! 17 d13 :d 2,LAT 13 4,LAT : d d17 2,LAT
ð4:24Þ
4,LAT
Both the TKA and LLA are shown in Fig. 4.19.
Angles in the Anteroposterior (AP) Plane Before calculating the angles in the AP views, some vertebrae labeling should be used to aid in feature extraction. This step identifies the most tilted vertebra. AIS curves can be categorized as the single curvature (C shape) and double-curve (S shape) shown in Fig. 4.20. Identifying the Most Tilted Vertebra in AP View for a Double-Curvature Spine 1. 2. 3. 4. 5. 6. 7.
Load X-ray images. Apply image segmentation to identify landmarks around each vertebra. Calculate the average of the landmarks corresponding each vertebra. Fit a third-degree polynomial on the averaged points. Calculate the extrema of the curve. Find the vertebrae corresponding the two extrema. Label the vertebrae to T and L (if the vertebra is corresponding the thoracic region label as T otherwise label as L ).
112
4 Extraction of Mechanistic Features
Fig. 4.20 (a) singlecurvature (C shape) and (b) double-curvature spine (S shape)
Identifying the Most Tilted Vertebra in AP View for a Single-Curvature Spine
1. 2. 3. 4. 5. 6.
Load X-ray images. Apply image segmentation to identify landmarks around each vertebra. Calculate the average of the landmarks corresponding each vertebra. Fit a second-degree polynomial on the averaged points. Calculate the extrema of the curve (Point C). Find the corresponding vertebra of point C and label the corresponding vertebra as vertebra C.
• α3: The Trunk Inclination Angle (TIA) is the angle between the region 4 of vertebra 17 (L5) and the vertebra L (the vertebra corresponding the point that has maximum curvature in the lumbar spine). This leads to TIA ¼ cos 1
! L d17 x¼4,AP :d4,AP 17 L : d d x¼4,AP
ð4:25Þ
4,AP
• α4: The Sacral Inclination Angle (SIA) is the angle between the region 4 of vertebra 17 (L5), and the vertebra T (the vertebra corresponding the point that has maximum curvature in the Thoracic spine). This leads to
4.11
Signals and Signal Processing Using Fourier Transform and Short Term. . .
SIA ¼ cos
1
! T d17 x¼4,AP :d 4,AP 17 T : d d x¼4,AP
113
ð4:26Þ
4,AP
• α5: Cobb Angle (CA)—For a patient with single curvature (a “C” shaped spine), the Cobb Angle (CA) is the angle between Region 2 of U and Region 4 of B. U and B are, respectively, the vertebrae two above and two below C, the vertebra with maximum curvature. This leads to CA ¼ cos
1
! B dU 2,AP :d4,AP U B : d d 2,AP
ð4:27Þ
4,AP
• α15 and α25 : First Cobb Angle (CA1) and second Cobb angle (CA2)—For a patient with double curvature (an “S” shaped spine), two Cobb Angles are reported. CA1 is the angle between Region 2 of U1 and Region 4 of B1, the vertebrae two above and two below T, the thoracic vertebra with maximum curvature. Similarly, CA2 is the angle between Region 2 of U2 and Region 4 of B2, the vertebrae two above and two below L, the lumbar vertebra with maximum curvature. This leads to CA1 ¼ cos CA2 ¼ cos
1
1
! B1 dU1 2,AP :d 4,AP U1 B1 , d d 2,AP
4,AP
2,AP
4,AP
! B2 dU2 2,AP :d4,AP U2 B2 : d d
ð4:28Þ ð4:29Þ
The TIA, SIA, and CA1 are shown in Fig. 4.21.
4.11
Signals and Signal Processing Using Fourier Transform and Short Term Fourier Transforms
Feature engineering can be applied to analyze more complicated data sets like signals. A signal is a time-varying or space-varying impulse that has a meaning or stores some information. For instance, a high-pitched siren from an ambulance is a time-varying signal intended to get peoples’ attention. A photograph from a family vacation is a space-varying signal composed of color and intensity. Signals have been recorded and used nearly as long as humans have existed. Some of the earliest recorded signals were from the ancient Egyptians, who would record the flooding of the Nile River. The flooding of the Nile River would fertilize
114
4 Extraction of Mechanistic Features
Fig. 4.21 Definitions of Trunk Inclination Angle (TIA), Sacral Inclination Angle (SIA), and Cobb Angle (CA1)
the land along the river, and data showed that it was a good predictor of crop production. The ancient Egyptians used this signal as a method of predicting food production and setting annual tax rates. The depth of the Nile flooding was measured using a nilometer like the one depicted in Fig. 4.22a below. The flood data was recorded on stone tablets like the Palermo stone shown in Fig. 4.22b. Since those early beginnings of signal processing, many mathematical advances have greatly expanded the power of signal processing. During the Industrial Revolution, calculus was invented independently by both Isaac Newton and Gottfried Leibniz. During that era, Joseph Fourier developed the Fourier transform which decomposed a dynamic signal into its component frequencies and amplitudes.
4.12
Fourier Transform (FT)
In nature, nearly everything from sounds to photographs can be described in terms of waves. In order to describe something in terms of waves, it is assumed that there is a periodic (continuously repeating) pattern. As shown in the middle plot of Fig. 4.23, the amplitude of a wave is the maximum distance from the middle of the wave and the period, T, is the horizontal distance for the wave to begin repeating. The frequency, f, is how often the wave repeats for a given distance along the horizontal axis. If the signal is a function of time, the frequency has units of cycles/second or Hertz (Hz). The period is the inverse of the frequency T ¼ 1/f. Note that the angular
4.12
Fourier Transform (FT)
115
Fig. 4.22 (a) Mosaic of a nilometer to measure the Nile River flood depth (http://cojs.org/nile_ celebration_mosaic-_5th-6th_century_ce/). (b) The Palermo stone (right) shows early recordings of Nile flood depths
Amplitude
=
+ Period
Fig. 4.23 Sinusoidal wave created by the addition of two other sinusoidal waves
frequency, ω ¼ 2πf, has units of radians/second. Spatial frequency is the number of repeating patterns for a given length. Joseph Fourier (1768–1830) discovered that periodic dynamic functions can be represented as a sine and cosine series called a Fourier series gðt Þ ¼ a0 þ a1 sin ðωtÞ þ a2 sin ð2ωt Þ þ . . . þ an sin ðnωt Þ þ b1 cos ðωt Þ þ b2 cos ð2ωtÞ þ . . . þ bn cos ðnωt Þ
ð4:30Þ where an and bn are coefficients whose magnitudes describe the overall shape of the wave and ω is the angular frequency. As such, finding the amplitude coefficients and
116
4 Extraction of Mechanistic Features
frequencies provides a description of the function. The coefficients for the Fourier series are an ¼ bn ¼
ω π ω π
Z
2π ω
gðt Þ sin ðnωt Þdt
ð4:31aÞ
gðt Þ cos ðnωt Þdt
ð4:31bÞ
0
Z
2π ω
0
The Fourier series can be written only in terms of a sine function if a phase shift is included. The phase shift is essential since not all sine waves that make up a signal will start at the same time, but the difference in starting time can be easily accounted for by including a phase angle shift. This leads the following equation gðt Þ ¼ A0 þ A1 sin ðωt þ θ1 Þ þ A2 sin ð2ωt þ θ2 Þ þ . . . þ An sin ðnωt þ θn Þ ð4:32Þ where the coefficients are A2n ¼ a2n þ b2n and the phase angle is θn ¼ tan 1
bn an
.A
special case of the phase angle shift is the cosine wave, which is simply a sine wave that has been shifted 90 . A Fourier series and a Fourier transform are complex mathematical definitions, but they are rooted in basic “engineering” fundamentals. Consider the periodic signals shown in Fig. 4.23. The signal on the left is actually the summation of the signal in the middle and the signal on the right. In this case, the amplitude of each of the component waves is the same. It can be seen that the wave in the middle is a sine wave with amplitude zero at the beginning. The wave on the right is a sine wave that is shifted by one quarter of the period of the wave, resulting in it actually being a cosine. In this case, it can be represented as a sine wave with a phase angle shift of 90 . Although signals in nature are continuous, they are recorded as a discrete set of data points, resulting in a wave that is actually more like that shown in Fig. 4.24. Signals made of discrete data are analyzed using a discrete form of the Fourier transform called a Discrete Fourier Transform (DFT). A common form of the DFT is the fast Fourier transform (FFT). An example of an FFT of a sine wave programmed in Python is shown in Fig. 4.25.
4.12.1 Example: Analysis of Separate and Combined Signals Consider two sinusoidal signals f1(t) and f2(t) (red and orange signals in Fig. 4.26). Adding these two signals together results in a combined signal f(t) (blue signal in Fig. 4.26). If a random noise signal, n(t), depicted as the green signal in Fig. 4.26, is
4.12
Fourier Transform (FT)
117
Fig. 4.24 Wave recorded as a discrete signal
import scipy import scipy.fftpack import numpy as np from matplotlib import pyplot as plt # Set up sine wave signal tt = np.linspace(0,2*np.pi,21) yy = np.sin(tt) plt.plot(tt,yy,'-o') plt.axhline(y=0,color='k') N = yy.shape[0] FFT = abs(scipy.fft(yy)) FFT_side = FFT[range(N//2)] # one side FFT range freqs = scipy.fftpack.fftfreq(yy.size, tt[1]-tt[0]) fft_freqs = np.array(freqs) freqs_side = freqs[range(N//2)] # one side frequency range plt.figure() p3 = plt.semilogy(freqs_side, abs(FFT_side), "b") Fig. 4.25 Python code to perform FFT of sine wave signal
added to f(t), the combined total signal is the black signal shown at the bottom of Fig. 4.26. By performing a DFT on the black signal, the signal components f1(t), f2(t) and n(t) can be identified and extracted from the combined signal in Fig. 4.26. When recording and processing signal data, it is necessary to have and adequate data points. If a high frequency signal like the red signal in Fig. 4.27 is recorded at a low frequency (black points), it is not possible to distinguish the high frequency red
118
4 Extraction of Mechanistic Features
Fig. 4.26 Two sinusoidal signals and a random noise signal plotted separately and combined together
Fig. 4.27 High frequency and low frequency sinusoidal data with discrete data points sampled. Inadequate sampling will lead to aliasing which will make these two sine waves impossible to distinguish
curve from the lower frequency blue curve in Fig. 4.27. This inability to distinguish between signals due to inadequate sampling frequency is called aliasing [9]. To overcome this challenge, the sampling data rate must be at least half of the frequency of the curve being sampled. This critical sampling rate is called the Nyquist frequency. A filter is a device or process that removes some selected frequencies from a signal [10]. A low-pass filter removes high frequencies and leaves low frequencies, while a high-pass filter leaves high frequencies and removes low frequencies from a
4.12
Fourier Transform (FT)
119
Fig. 4.28 (a) Fourier analysis of the original combined signals (blue) and filtered combined signals (orange) from Fig. 4.26, (b) signal f(t) and recovered signal ffilt(t) after filtering out random noise from combined signal
signal. A band-pass filter leaves only the frequencies a specified frequency band and removes others. A Fourier transform of the of the combined noisy signal is shown as the blue line in Fig. 4.28a. Two dominant peaks are clear, corresponding to the frequencies of signals f1(t) and f2(t). Many smaller peaks are also present across the entire frequency spectrum, which are due to the random noise, n(t), in the signal. A lowpass Butterworth filter is applied to the signal, which removes all frequencies above 15 Hz, leaving only the two dominant peaks corresponding to f1(t) and f2(t). After filtering, the Fourier transform of the signal is shown as the orange curve in Fig. 4.28a. An inverse Fourier transform can then be performed to reconstruct an approximation of the original signal. As shown in the Fig. 4.28b, the original and the reconstructed signal are close.
4.12.2 Example: Analysis of Sound Waves from a Piano A piano makes music based on vibration of strings of different lengths and thicknesses inside the body of the piano. Each string can make a single sound, or they can be combined together to make other sounds—and music. The piano keyboard consists of 88 white and black keys and around 230 strings (the total number of strings can vary from one piano-maker to another). When a key is struck, a felttipped hammer inside the body of the piano strikes a wire (or wires) and causes it (or them) to vibrate (Fig. 4.29). The vibration from the wires passes through the
120
4 Extraction of Mechanistic Features
TUNING PIN
STRING
AGRAFFE
BRIDGE
PLATE HITCH PIN
PLATE
HAMMER
SOUNDBOARD
RIB
Fig. 4.29 Cross section view of a piano [11] Table 4.1 Typical music terminology Music terminology Note: a single musical sound or the symbol used to indicate the musical sound Chord: a combination of notes (usually 3) played simultaneously Scale: a set of musical notes ordered by fundamental frequency or pitch Pitch: the frequency of a note Tempo/sustain: the speed or pace of a given musical piece Fundamental frequency: the lowest frequency of any vibrating object Harmonics: integer multiples of the fundamental frequency with lower amplitudes. Inharmonic frequencies: non-integer multiples of fundamental frequency
bridge and into the sound board. The vibration of the sound board emits sound waves which are heard as music. A list of typical music terminology is given in Table 4.1. The piano is capable of emitting different frequencies based on the diameter and length of the wire(s) and the tension in the wire(s). The piano strings for the lowest (leftmost) key strings vibrate at 27.5 Hz and the piano strings for the highest (rightmost) key strings vibrate at 4186 Hz. A single note will have a dominant fundamental frequency, which is the pitch that would be identified with that note. In addition, a note will also have harmonics, which are integer multiples of the fundamental frequency. The fundamental frequency and the harmonics play at the same time, but the harmonics have a much lower sound amplitude. The combination of the fundamental frequency and the harmonics give the piano its characteristic sound and distinguish it from other musical instruments that are capable making musical notes at the same fundamental frequency. Besides the harmonics, there may also be inharmonic frequencies. These are frequencies that are not integer multiples of the fundamental frequency. Figure 4.32 shows the sound wave made by striking the A4 key on a piano, recorded with a sampling rate of 11,020 Hz. The fundamental frequency for the A4 piano key is 440 Hz. When a 2 s time span of this wave is plotted out, the density of the plot makes it look solid. However, zooming on a short time segment of the signal shows that there is indeed an up-and-down wave pattern. The shape of the wave is approximately sinusoidal, but inspection of the waves shows that it is not a perfect sinusoid—there are some small peaks and valleys within the overall waveform. The sound wave must be analyzed to determine if this note on the piano is properly tuned,
4.12
Fourier Transform (FT)
A single note wrien on a music staff
121
Fundamental = 0 Harmonics =2
0
=3
0
=4
0
Fig. 4.30 Note on a music staff and a schematic of the fundamental and harmonic frequencies that make up the note when played on a musical instrument
Fig. 4.31 Detailed layout of a piano. (Source: https://www.yamaha.com/en/musical_instrument_ guide/piano/mechanism/)
and if there are some interesting harmonic or inharmonic frequencies. The sound wave in Fig. 4.32 has a maximum amplitude at the beginning, which corresponds to the hammer striking the piano string. After that, the amplitude of the wave decreases due to damping. The damping can be due to a damper mechanism built into the piano (Fig. 4.31). The first step in analyzing the piano sound wave is to find the frequency or frequencies that make up the wave. As shown in Fig. 4.30, a musical note played on an instrument like a piano can be composed of a dominant fundamental frequency and multiple harmonic frequencies. The harmonic frequencies are integer multiples of the fundamental frequency, but with much smaller magnitudes. There may also be inharmonic frequencies, which are frequencies that are not integer multiples of the fundamental frequency. These can vary from piano to piano and can be caused by sources such as the length and stiffness of the string (thicker strings are less flexible and generally have more inharmonicity). This means that a grand piano in a concert
122
4 Extraction of Mechanistic Features
Fig. 4.32 Sound wave from an A4 note made on a Yamaha piano (https://www.yamaha.com/en/ musical_instrument_guide/piano/mechanism/mechanism004.html). (Audio available with E-book, Supplementary Audio 4.1)
hall will most likely have a slightly different sound than an upright piano in someone’s home. Note that some degree of inharmonicity is appreciated in piano music. In addition, the vibration from one string can cause other strings to vibrate. Sound waves can generally be thought of as a collection of several sine waves with different frequencies and magnitudes added together. If all the sine waves start at the same time they are said to be in phase. However, if the sine waves start at different times they are said to be out of phase. One common example phase
4.13
Short Time Fourier Transform (STFT)
123
Fig. 4.33 Fast Fourier Transform (FFT) of the A4 piano sound wave in Fig. 4.32
differences is two sine waves that start at different times. If the time difference results in a 90 phase shift, the out of phase wave is actually a cosine wave. As described earlier, the collection of sine waves of different amplitudes and frequencies that describe a signal is called a Fourier series. A Fourier series can be analyzed in detail to determine the frequencies and corresponding amplitudes that make up a signal. This analysis is called a Fourier transform. A Fourier transform of the signal in Fig. 4.32 is shown in Fig. 4.33. It can be seen that the dominant fundamental frequency is 440 Hz, which corresponds to the A4 key. In addition, there are eight harmonics that are nearly perfect integer multiples.
4.13
Short Time Fourier Transform (STFT)
The A4 piano signal shown in Fig. 4.32 changes with time. The amplitude is largest when the key is pressed and the hammer strikes the piano strings. The amplitude of the signal then decays over time due to damping. In addition, the shape of the waveform demonstrates some other variations over time due to other sources such as the harmonics and reverberation. The Fourier transform results shown in Fig. 4.33 encapsulate the whole signal and do not distinguish changes in the signal with time. A short time Fourier transform (STFT) can be performed to analyze how the amplitude of the fundamental frequency and harmonic frequencies change with time. To do this, the original signal is broken up into many short time segments by overlaying a window function on the signal being analyzed, such as shown in Fig. 4.34. This results in only the part of the signal visible through the window being analyzed. Common window functions include the square wave or a Hann
124
4 Extraction of Mechanistic Features
Fig. 4.34 Signal with square window function and a Hann window function overlaid
window function. The square window functions are easy to implement and visualize, but can result in abrupt transitions from window to window. Other functions such as the Hann window function are tapered at the boundaries and provide a smoother transition between segments of the STFT. After the short time Fourier transform has been performed on each window segment, the results for each segment can be plotted for time-frequency-amplitude (Fig. 4.35). When plotted in two-dimensions, the color represents the amplitude of the signal. The STFT results show how the frequency content and amplitude of the signal changes with time. As shown in Fig. 4.35, the fundamental frequency has a higher amplitude than the harmonic frequencies, but the amplitude of the fundamental frequency and the harmonic frequencies decays over time. A piano note can be synthesized on a computer, but the question is “will it sound as good as the original?”. To synthesize a note, the computed frequencies and amplitudes from the Fourier transform are re-combined. If only the frequencies and amplitudes from the Fourier transform are combined, the resulting synthesized note would be uniform and “tonal”. To make the note more realistic, the damping effects as computed from the STFT should be included (Fig. 4.36). A spring-mass-damper mechanistic model can be used to describe wave-like behavior a piano sound and provide a convenient to incorporate damping effects. Figure 4.37 shows a basic spring-mass-damper model for, where kn, mn, bn are the spring constant-mass-damping coefficient for the nth harmonic, and An is the initial amplitude for the nth harmonic. Each frequency extracted using the STFT can be modeled as a spring-mass-damper system and expressed mathematically as
4.13
Short Time Fourier Transform (STFT)
125
Fig. 4.35 Short time Fourier transform spectrogram. The color indicates the sound amplitude. The piano note signal is plotted along the top. (See E-book for sounds and animations, Supplementary Audio 4.1)
gn ðt Þ ¼ An ebn t sin ðωn t þ ϕn Þ
ð4:33Þ
where gn is the signal for the nth harmonic (n ¼ 0 for the fundamental frequency, n ¼ 1 for the first harmonic, . . .), An is the initial amplitude for each frequency, bn is the damping coefficient (measure of how fast the signal reduces in amplitude due to
126
4 Extraction of Mechanistic Features
function generate_piano(fs,tn,a,omega) %This function is to generate a sound file with a series of %sine waves %to approximate the piano sound. %ts: time integral in a sound %tn: duration of the sound %a: amplitude of each sine wave, should be an 1 by I vector. %omega: frequency of each sine wave, should be an 1 by I %vector. t=0:1/fs:tn; %Generate the time array n=size(a); %Count the number of sine waves y=zeros(size(t));%Define signal for i=1:1:n y=y+a(i).*sin(2.*pi.*omega(i).*t); end audiowrite('piano.wav',y,fs); %Write the sound file Fig. 4.36 Matlab code to generate a piano sound Fig. 4.37 Spring-massdamper mechanistic model
damping), ωn is the frequency (ω0 is the fundamental frequency, ω1 is the first harmonic, . . .), and ϕn is the phase angle for each frequency. The damping coefficients, bn, for the fundamental frequency and each harmonic can be obtained by performing an optimization between the STFT data for each frequency (Fig. 4.38) and the mechanistic spring-mass-damper model (fundamental frequency and first three harmonics are shown in Fig. 4.39). The optimization algorithm is shown below. If the damping is not included, the resulting sound is “tonal” like an alert sound, instead of musical. The results of the optimization are tabulated in Table 4.2. Finally, using this data, the A4 piano signal is reproduced using Matlab. The overlay of the original piano sound and the reproduced piano sound are shown in Fig. 4.40. Consult the E-book to hear the audio files for additional comparison. The optimization algorithm for determining the mechanistic coefficients was based on minimizing the sum of the squared differences between the actual piano signal, fPIANO, and the piano signal based on the mechanistic features, fMATLAB.
4.13
Short Time Fourier Transform (STFT)
127
Fig. 4.38 Short time Fourier transform of the Yamaha A4 piano note plotted in three dimensions
Fig. 4.39 Spring-mass-damper mechanistic model of the fundamental frequency and three harmonics for the A4 piano note
minimize LLoss ¼
N 1 X ð f PIANO ðt n Þ f MATLAB ðt n ÞÞ2 N n¼1
ð4:34Þ
This was done for each frequency extracted using the STFT in order to determine the damping coefficient, bn. The phase angle was not extracted from the STFT results and was randomly assigned. A code of feature extractor is provided in the E-book (Supplementary File 4.3).
128
4 Extraction of Mechanistic Features
Table 4.2 Mechanistic data for the fundamental frequency and six harmonics based on STFT and spring-mass-damper mechanistic model Type Fundamental Harmonics
Frequency (Hz) 440 881 1342 1769 2219 2673 3133
Initial amplitudes 3.610E02 1.555E02 5.735E04 5.342E03 1.039E02 6.800E03 1.078E02
Damping coefficients 8.503E01 1.233E+00 1.611E+00 2.067E+00 2.514E+00 2.577E+00 2.744E+00
Phase angle (rad) 2.966E01 1.830E+00 1.554E+00 2.074E01 1.901E+00 9.385E02 1.646E+00
Fig. 4.40 Original authentic Yamaha A4 piano sound signal and the signal reproduced from the STFT and mechanistic model. (Audio available with E-book, Supplementary Audios 4.1 and 4.2)
References 1. Raphaelson S (2015) He invented instant replay, the TV trick we now take for granted. NPR. https://www.npr.org/sections/thetwo-way/2015/01/20/378570541/he-invented-instant-replaythe-tv-trick-we-now-take-for-granted 2. Google (n.d.) Courtesy of image to google.com 3. Data Project (n.d.). https://www.dataproject.com/Products/eu/en/volleyball/VideoCheck 4. Zhang YJ (2018) Geometric modeling and mesh generation from scanned images. CRC Press, Boca Raton, FL 5. Medical Radiation (n.d.) Courtesy of images to http://www.medicalradiation.com/ 6. Kass M, Witkin A, Terzopoulos D (1988) Snakes: active contour models. Int J Comput Vis 1(4):321–331 7. Tajdari M, Pawar A, Li H, Tajdari F, Maqsood A, Cleary E, Saha S, Zhang YJ, Sarwark JF, Liu WK (2021) Image-based modelling for Adolescent Idiopathic Scoliosis: mechanistic machine learning analysis and prediction. Comput Methods Appl Mech Eng 374:113590
References
129
8. Lumen Learning (n.d.) The vertebral column. https://courses.lumenlearning.com/ap1x94x1/ chapter/the-vertebral-column/ 9. Wikipedia (n.d.) Aliasing. https://en.wikipedia.org/wiki/Aliasing 10. Wikipedia (n.d.) Filter (signal processing). https://en.wikipedia.org/wiki/Filter_(signal_ processing) 11. Askenfelt (1990) Five Lectures on the Acoustics of the Piano, Introduction. Royal Swedish Academy of Music No. 64, Stockholm, Sweden
Chapter 5
Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Abstract This chapter focuses on the knowledge-driven dimension reduction aspect of mechanistic data science. Two types of dimension reduction methods that are introduced in this chapter: clustering and reduced order modeling. Clustering aims to reduce the total number of data points in a dataset by grouping similar data points into clusters. The datapoints within a cluster are considered to be more like each other than datapoints in other clusters. There are multiple methods and algorithms for clustering. In the first part of this chapter, three clustering algorithms are presented, ranging from entry level to advanced level: the Jenks natural breaks, K-means clustering, and self-organizing map (SOM). Clustering is a form of dimension reduction that reduces the total number of data points. In the second part of this chapter, Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) will be introduced as a reduced order modeling technique that reduce the number of features by eliminating redundant and dependent features, leading to a new set of principal features. The resulting model is called a reduced order model. Proper Generalized Decomposition (PGD) is a higher order extension of PCA and will also be introduced. Keywords Dimension reduction · K-means clustering · Self-organizing map (SOM) · Reduced order surrogate model · Singular value decomposition (SVD) · Principal component analysis (PCA) · Proper generalized decomposition (PGD) · Spring mass system · Variance · Covariance · Modal superposition · Eigenvalues · Eigenvectors
Supplementary Information: The online version of this chapter (https://doi.org/10.1007/978-3030-87832-0_5) contains supplementary material, which is available to authorized users. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. K. Liu et al., Mechanistic Data Science for STEM Education and Applications, https://doi.org/10.1007/978-3-030-87832-0_5
131
132
5.1
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Introduction
Clustering (sometimes called cluster analysis) is a general task of grouping a set of datapoints so that datapoints within the same group (called a cluster) are more similar to each other than datapoints in other clusters. Cluster analysis originated in the field of anthropology by Driver and Kroeber in 1932 [1] and was later introduced to psychology by Joseph Zubin in 1938 [2] and Robert Tryon in 1939 [3]. It was famously used by Cattell beginning in 1943 [4] for trait theory classification in personality psychology. In 1967, George Frederick Jenks proposed the Jenks optimization method [5], also called the Jenks natural breaks classification method, which is a data clustering method designed to determine the best arrangement of values into different classes. The term “k-means“was first used by James MacQueen in 1967 [6]. In recent years, a lot of clustering methods have been developed including k-medians clustering [7], hierarchical clustering [8] (1977), and selforganizing map (2007) [9]. Reduced order modeling is a second type of dimension reduction technique that aims to reduce the number of features by eliminating redundant and dependent features and obtaining a set of principal features. Two classical and highly related dimension reduction methods, Singular Value Decomposition (SVD) and Principal Component Analysis (PCA), are introduced in this chapter. SVD was originally developed by differential geometry researchers Eugenio Beltrami and Camille Jordan in 1873 and 1874, respectively [10]. PCA was independently invented by Karl Pearson (1901) [11] and Harold Hotelling (1933) [12]. The resulting model after dimension reduction is called a reduced order model (ROM). Proper Generalized Decomposition (PGD) (2006) [13] is introduced as a generalized ROM that is a higher order extension of PCA and SVD.
5.2 5.2.1
Dimension Reduction by Clustering Clustering in Real Life: Jogging
The analysis of jogging performance is used to as an application of clustering. Figure 5.1 below shows a sample of some jogging data collected by smartphone apps. The data include the jogging date, distance, duration, and the number of days since the last workout. The days since last workout vs. distance is plotted. The data were clustered into five groups using a k-means clustering algorithm (to be discussed later in this chapter), with each cluster indicated by the different colors. Clustering allows the identification of similar workout sessions and the relationship to overall performance. For example, the days since last workout vs. distance shows that the jogging distance was shortest after a large number of rest days (Cluster 3 in the Fig. 5.1), but also after only a few days of rest. The longest jogging distances were achieved when there was approximately 6 days since last workout
5.2 Dimension Reduction by Clustering
133
Fig. 5.1 The jogging workout session dataset clustered into five distinct groups
Fig. 5.2 Classifying four diamonds with known prices
(Cluster 5 in the Fig. 5.1). Therefore, clustering analysis shows that resting 5–7 days between workouts results in the best performance in terms of jogging distance.
5.2.2
Clustering for Diamond Price: From Jenks Natural Breaks to K-Means Clustering
Jenks natural breaks is a very intuitive data clustering algorithm proposed by George Frederick Jenks in 1967 [14]. “Natural breaks” divides the data into ranges that minimize the variation within each range. To illustrate the idea of the Jenks natural breaks, consider four diamonds shown in Fig. 5.2. The goal of the clustering is to
134
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
form two groups or clusters from these four diamonds based on their prices, i.e., $300, $400, $1000, and $1200. The first step of the Jenks natural breaks is calculating the “sum of squared deviations from array mean (SDAM)” based on the data array (order from smallest to largest): array ¼ ½300, 400, 1000, 1200
ð5:1Þ
The mean of the array can be computed first: mean ¼ ð300 þ 400 þ 1000 þ 1200Þ=4 ¼ 725
ð5:2Þ
SDAM is equal to variance of the data: SDAM ¼ ð300 725Þ2 þ ð400 725Þ2 þ ð1000 725Þ2 þ ð1200 725Þ2 ¼ 587, 500 ð5:3Þ The second step is to compute all possible range combinations and calculate the “sum of squared deviations from class means” (SDCM) and identify the smallest one. For this diamond price data, there are three range combinations: 1. group 1 : [300] and group 2 : [400,1000,1200] SDCM ¼ ð300 300Þ2 þ ð400 867Þ2 þ ð1000 867Þ2 þ ð1200 867Þ2 ¼ 346,667 ð5:4Þ 2. group 1 : [300, 400] and group 2 : [1000, 1200] SDCM ¼ ð300 350Þ2 þ ð400 350Þ2 þ ð1000 1150Þ2 þ ð1200 1150Þ2 ¼ 25,000 ð5:5Þ 3. group 1 : [300, 400, 1000] and group 2 : [1200] SDCM ¼ ð300 567Þ2 þ ð400 567Þ2 þ ð1000 567Þ2 þ ð1200 1200Þ2 ¼ 286,667 ð5:6Þ Based on this calculation, combination 2 has the smallest SDCM, implying that it is the best clustering.
5.2 Dimension Reduction by Clustering
135
Fig. 5.3 Clustering of diamonds based on Jenks Natural Breaks method
The final step is to calculate the “goodness of variance fit” (GVF) [15], defined as GVF ¼
SDAM SDCM SDAM
ð5:7Þ
GVF ranges from 1 (perfect fit) to 0 (poor fit). The GVF for range combination 2 is ð587, 500 25, 000Þ ¼ 0:96 587, 500
ð5:8Þ
The GVF for range combination 1 is 587, 500 346, 667 ¼ 0:41 587, 500
ð5:9Þ
In this data set, the range combination group 1 : [300, 400] and group 2 : [1000, 1200] is best because it has the lowest SDCM of all possible combinations and has a GVF close to 1. The two groups are shown in Fig. 5.3 highlighted with different colors (red and blue). The goal of the Jenks natural breaks (and many other clustering methods) is to minimize the SDCM. A lower SDCM equates to better clustering because SDCM is related to the distance of the data points from the means of their clusters. Thus, a lower SDCM indicates that the data points within a cluster are close together and more similar. For example, Fig. 5.4 shows two different clustering results for the diamond dataset. Figure 5.4(a) has an SDCM of 25,000 while Fig. 5.4(b) has an SDCM of 286,667 (the calculations for mean and SDCM are also marked in the figure). It can be noticed that the data points are closer to the means (marked as stars with different colors for different clusters) in Fig. 5.4(a) than that in Fig. 5.4(b). From this example, it is evident that Jenks natural breaks method works well for 1D data. However, it is inefficient because it has to go through all the possible arranging combinations. For example, clustering 254 data points into six clusters has C(253,5) ¼ 8,301,429,675 possible range combinations. It is very time consuming to calculate the means and SDCMs for the more than eight billion combinations.
136
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Fig. 5.4 Visualization of two different clustering results: (a) [300,400] and [1000, 1200] and (b) [300,400,1000] and [1200]. Each box indicates a cluster. The stars are mean of each cluster
Where does C(253,5) ¼ 8,301,429,675 come from? Recall: Stars and Bars Combinatorics Suppose n stars are to be divided into k distinguishable groups, in which each group contains at least one star. Additionally, the stars have a fixed order. k – 1 bars are needed to divide n stars into k groups. See the example below of seven stars divided into 3 groups: ★|★|★★★★★ (n ¼ 7, k ¼ 3) The position of the bars between the stars matters. There are n – 1 gaps between the stars, each which may or may not contain a bar. Thus, this becomes a simple combination problem, in which the total number of possibilities for choosing which k – 1 gaps out of n – 1 gaps contain a bar can be calculated. n1 ðn1Þ! This can be represented as (n – 1, k – 1), , or ðk1 Þ!ðnk Þ! k1 To overcome the limitation of the Jenks natural breaks, a more efficient and widely used clustering method, called k-means clustering, is introduced. This clustering method divides the data points into k clusters (hence the “k” in k-means) and
5.2 Dimension Reduction by Clustering
137
Fig. 5.5 The initial means are randomly assigned at the first iteration
finds the center point for each of the clusters. The center points are moved around, and some points are moved from cluster to cluster until the tightest collection of data points is found for each cluster center point (or mean). When written mathematically, the goal of k-means clustering is to minimize the SDCM argminμj
K X X xi μj 2
ð5:10Þ
j¼1 xi 2Sj
where xi is data point i in the jth cluster Sj, μj is the center point (or mean) of the cluster Sj, k∙k2 is the square distance, and K is the number of clusters. The same diamond data from the Jenks natural breaks example is used to explain k-means clustering with two clusters. 1. For the first iteration, the center points of the two clusters are randomly assigned, for example initial means ¼ [1100, 1200] (marked as stars in Fig. 5.5). 2. The distance of each data point to each of the k means is calculated: Distance to the mean 1 ¼ ½800, 700, 100, 100
ð5:11Þ
Distance to the mean 2 ¼ ½900, 800, 200, 0
ð5:12Þ
3. Based on the distance to the mean, each data point is labeled as belonging to the nearest mean: Label ¼ ½1, 1, 1, 2
ð5:13Þ
4. The means for the groups of labeled points then can be updated means ¼ ½567, 1200
ð5:14Þ
5. The updated means in Step 4 are compared with the previous iteration. If the difference is less than a specified tolerance, the optimum clustering is found. If not, the algorithm returns to step 2.
138
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Fig. 5.6 The cluster means at the second iteration
Fig. 5.7 The procedure of the k-means clustering Table 5.1 Comparing k-means clustering for one-, two-, and three-dimensional data
Data
Mean Distance
1D x
P x y¼ N |x y|
2D x¼
x1 x2
P x y¼ N kx yk
3D
1 x1 B C x ¼ @ x2 A x3 P x y¼ N kx yk 0
x ¼ data point, N ¼ number data points, y ¼ mean. Bold indicates vector
For this example, only three iterations are needed for the converged result of means ¼ [350, 1100] (see Fig. 5.6). Note that k-means clustering ends up with the same result as the Jenks natural breaks. The k-means clustering is much more efficient than the Jenks natural breaks. In summary the procedure of the k-means clustering is shown in Fig. 5.7.
5.2.3
K-Means Clustering for High-Dimensional Data
Higher dimension data can be clustered very similarly using k-means clustering. The dimension of the data is determined by the number of features. Table 5.1 lists the differences in clustering for 1D, 2D, and 3D data. It can be seen that for higher dimensional clustering the data points and the means are vectors instead of scalars.
5.2 Dimension Reduction by Clustering
139
Fig. 5.8 Diamond dataset is clustered into four groups based on carat and price. There are 539 data points used for the clustering
5.2.3.1
Example: Clustering of Diamonds Based on Multiple Features
The diamond dataset from Chap. 2 is clustered based on multiple key features. Initially, two key features (price and carat) are used, and the data is divided into four clusters. The results of this 2D clustering are shown in Fig. 5.8. Based on the k-means clustered data, the clusters of high-end diamonds (black cluster) and economical diamonds (yellow cluster) can be identified. Additional features can be used for clustering. Clustering for 3D data follows a similar method. Figure 5.9 shows the data for 2695 diamonds clustered based on price, clarity, and carat.
5.2.4
Determining the Number of Clusters
Before performing k-means clustering, the ideal number of clusters should be determined to properly represent the data. This can be accomplished using the “elbow method”. This method is based on the goodness of variance fit (GVF) as discussed previously. The GVF is defined as GVF ¼
SDAM SDCM SDAM
ð5:15Þ
140
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Fig. 5.9 Diamonds are clustered into four groups based on price, clarity, and carat. There are 2695 data points used for the clustering. The order of the colors is randomly generated by the algorithm. It might be different for each run Fig. 5.10 Using the elbow method on GVF vs. the number of clusters to determine the best number of clusters
where GVF ranges from 1 (perfect fit) to 0 (poor fit), SDAM is the sum of squared deviations from array mean, and SDCM is the sum of squared deviations from class mean. If GVF is graphed versus the number of clusters, an “elbow” will appear in the curve when the effect of increasing k begins to have a decreased effect on the GVF. For example, as shown in Fig. 5.10 the elbow appears to occur at K ¼ 3 or 4 for the diamond data.
5.2 Dimension Reduction by Clustering
141
It is noted that the elbow method is somewhat subjective because the elbow occurs in a region and not a precise location. This is one of the limitations of k-means clustering. Other limitations of the k-means clustering will be discussed in the next section.
5.2.5
Limitations of K-Means Clustering
Although k-means clustering is a widely used way to reduce data dimension, there are a couple of notable limitations. First, the user must specify the number of clusters, K. Some methods such as elbow method can be used to determine the best number of clusters, but most of them are done visually by the user’s judgment and not automatically by the method itself. This makes it difficult to completely automate the clustering process. Next, k-means clustering can only handle numerical or categorical data. It cannot handle images or words. Finally, k-means clustering assumes that we are dealing with spherical clusters and that each cluster has roughly an equal number of observations. Those limitations can be overcome (or partially overcome) by other types of advanced clustering methods such as Hierarchical Clustering/Dendrograms [16] and Self-organizing maps (SOM) [17], in which the number of clusters does not need to be specified before the clustering. The SOM and its applications will be introduced in the next section as an advanced topic.
5.2.6
Self-Organizing Map (SOM) [Advanced Topic]
The self-organizing map (SOM) is an unsupervised machine learning algorithm that is able to map high-dimensional data to two-dimensional (2D) planes while preserving topology [18]. The main advantage of the SOM is that it can visualize highdimensional data in the form of a low-dimensional map, which helps researchers to visually identify underlying relations between the features. As a tool to visualize high-dimensional datasets, the SOM is beneficial for the cluster analysis of engineering design problems as well. The goal of a self-organizing map (SOM) is to reduce the dimension of data points by representing the topology of the data with fewer points, using a map of neurons, and converting the dataset into two dimensions. The topology refers to the relative distance between points, meaning that there will be more neurons in areas where data is more condensed. These neurons can be treated as mini-clusters. In the map, each neuron is connected to its surrounding neurons, which are referred to as “neighbors.” Because the map preserves the topology of the data, neighboring neurons will describe a similar number of data points. The maps are always two-dimensional, even when used on higher dimensional data, allowing us to assess the clustering of higher dimensions in just two dimensions.
142
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Fig. 5.11 An illustrative 8*8 SOM
The goal of the SOM is to put “similar” datapoints into the same SOM neuron (mini-cluster) and weight Wij of each neuron represents the average of the included datapoints. For example, an illustrative 8*8 SOM is shown in Fig. 5.11. The labels of the X and Y axes are the integers 1 i SizeX, and 1 j SizeY, respectively. To achieve a convergent result, SOM uses competitive learning, rather than errorcorrection learning (like back propagation [19]). The training procedure can be described by a pseudo code as shown in Fig. 5.12. Initially, the weights of the neurons are set to be W i,j 0 with random number [0, 1]. The elements of each input vector are normalized linearly to [0, 1]. After initialization, the SOM is trained for a number of T epochs. For the current epoch t and each input vector xm, first, the best matching unit (BMU) is determined by calculating the distance between the input vector and each neuron weight using Eq. (5.1) in Fig. 5.12. The BMU is a map unit has the shortest distance to input vector xm. Second, the diameter of the neighborhood around the BMU can be determined by Eq. (5.2) in Fig. 5.12, where d(t) is a function that decreases monotonously with time. The initial distance coefficient is d0, and the decrease rate is λ. Third, the weights W i,j m in the BMU and its neighborhoods are updated according to Eqs. (5.3) and (5.4) in Fig. 5.12. In Eqs. (5.3) and (5.4), hBMU, i, j represents the Gaussian kernel function, where α(t) is a learning rate parameter, ri, j is the position of each unit, and rBMU is the position of the BMU. The current epoch is finished after all the xm have been calculated to the SOM. By selecting a large enough number of epochs, for example 100, the SOM can converge. When the training is finished, the map can reorder the original datasets while preserving the topological properties of the input space.
5.2 Dimension Reduction by Clustering
143
Fig. 5.12 SOM algorithm
5.2.6.1
An Engineering Example: Data-Driven Design for Additive Manufacturing Using SOM
The SOM algorithm is demonstrated for visualizing high-dimensional data in additive manufacturing (AM). These data are obtained from well-designed experimental measurements and multiphysics models. The SOM is introduced to find the relationships among process-structure-properties (PSP) in AM, including laser power, mass flow rate, energy density, dilution, cooling rate, dendrite arm spacing, and microhardness. These data-driven linkages between process, structure, and properties have the potential to benefit online process monitoring control in order to derive an ideal microstructure and mechanical properties. In addition, the design windows of process parameters under multiple objectives can be obtained from the visualized SOM. A schematic diagram of this work is shown in Fig. 5.13. Sixty single-track AM experiments using various process parameters were conducted for data generation. Materials characterization in this case included cooling rate measurements, dilution measurements, dendrite arm spacing measurements, and hardness testing. In addition, a computational thermal-fluid dynamics (CtFD) model was developed to simulate the AM process. In total, 25 simulation cases with various laser power levels and mass flow rates were computed. For each case, the structures and properties observed were the melt pool geometry, dilution,
144
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Fig. 5.13 A schematic description of the workflow typically employed in current computational efforts (top row) and of experimental efforts (bottom row), along with a description of how this can be augmented with a data-mining approach to recover high-value PSP linkages of interest to material innovation efforts
cooling rate, secondary dendrite arm spacing (SDAS), and microhardness. Details of experiments and simulations can be found in Refs [20, 21]. The SOM Toolbox in Matlab [22] was used to simultaneously visualize highdimensional datasets and design process parameters. Using physics-based simulations and experimental measurements, seven-dimensional (7D) AM dataset was obtained for data mining including laser power, mass flow rate, energy density, cooling rate, dilution, SDAS, and microhardness. The Matlab code for SOM is shown below. The file ‘Training data from AM (all).xlsx’ includes the 7D AM dataset. Matlab code for SOM: Dataset=xlsread('Training data from AM (all).xlsx','data'); SOM = selforgmap([8 8],100,6); SOM = train(SOM, Dataset’); view(SOM); y = SOM(Dataset’); classes = vec2ind(y);
The simulation and experimental data points were used as input vectors to train a single 8 8 SOM indistinctively. It is found that the 8 8 SOM has the best
5.2 Dimension Reduction by Clustering
145
Fig. 5.14 Contour plots of all design variables with the optimized design window outlined by a white wireframe
performance. If the map size is too small, the map resolution is very low; however, an SOM that is too large results in overfitting. Trained SOMs are shown in Fig. 5.14. The relation between the PSP variables can be understood visually. For example, the mass flow rate and SDAS are positively correlated, as the component planes of the mass flow rate and SDAS have similar values at similar positions. Conversely, the mass flow rate and dilution are highly negatively correlated. Thus, according to the visualized SOM in Fig. 5.14, several results are obtained (1) the mass flow rate, more than laser power, greatly contributes to the cooling rate and SDAS; (2) the dilution and microhardness depend on both the mass flow rate and laser power; (3) the microhardness is dominated by the dilution, rather than by the SDAS or cooling rate. Obtained through the data-mining approach, these relations provide valuable insight into the complex underlying physical phenomena and material evolution during the AM process. In addition, it is possible to obtain the desired process parameter window with multiple objective microstructure and property ranges. In this study, the objective dilution is from 0.1 to 0.3. In this range of dilution, the solidified track can avoid both lack of fusion due to low dilution and property degradation due to high dilution [23]. The SDAS should be minimized and the microhardness should be maximized in order to maintain good mechanical properties. An iteration procedure through all the units is undertaken in order to seek units that satisfy these restrictions. An objective cluster that includes multiple units can be selected as a white wireframe, as shown in Fig. 5.14. Thus, the. Following desired process parameters can be obtained: a laser power ranging from 1000 W to 1100 W and a mass flow rate ranging from 22.4 g min1 to 24.8 g min1. The desired energy density, which is defined as laser power divided by mass flow rate, ranges from 2.4 106 J kg1 to 2.9 106 J kg1. The SOM approach can be applied to a broad variety of PSP datasets for AM and other data-intensive processes. Data-
146
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
driven relationships between process, structure, and property can provide online monitoring and process control to derive ideal microstructure and mechanical properties.
5.3 5.3.1
Reduced Order Surrogate Models A First Look at Principal Component Analysis (PCA)
Principal component analysis (PCA) is a dimension reduction or reduced order modeling technique that seeks to determine the directions or components with maximum variability. This can be explained conceptually by considering the images of the teapot in Fig. 5.15. After observing the teapot from all angles, it can be seen that the orientation showing the most variation is the one that goes from the handle to the spout (see the direction with the blue arrow). The directions showing the second and third most variation are bottom to top (see red arrow), and from handle to spout viewed from the top (see green arrow) in Fig. 5.15. These directions are vectors, and
Fig. 5.15 Images of the teapot to conceptually explained principal component
5.3 Reduced Order Surrogate Models
147
Table 5.2 Steel material property data [24] Steel material ASTM A36 API 5 L X52 High strength alloy steel Boron steel
Elongation 0.3 0.21 0.18 0.07
Ultimate Tensile Strength (UTS) (MPa) 350 450 760 1500
the technical terms used to describe the directions of these vectors are principal components. In this example, the blue arrow is the first principal component. The principal components for a set of data can be computed and printed using the following Python code (assume 3 principal components are desired): Python code for PCA: from sklearn.decomposition import PCA pca = PCA(n_components=3,svd_solver='auto') coordinates = pca.fit_transform(B) print("principal components=\n",pca.components_)
Principal component analysis can be demonstrated using the steel material property data shown in Table 5.2. There are four datapoints in the dataset (i.e., four kinds of steel). Each datapoint includes two features that are material properties (elongation and UTS) of the steel. A reduced order model including only one feature can be constructed using PCA. Based on the principal component (the orientation indicating the most variation) computed from PCA, the data points can be projected to the principal component that is treated as a new coordinate. This creates a new reduced order model to represent the original dataset. The procedure of PCA is 1. calculate the mean of each feature of the data: Elongation : average ¼ 0:19
ð5:16Þ
UTS : average ¼ 766:25
ð5:17Þ
2. normalize data for each feature around the center point and put into a matrix B 2
0:3 0:19
350 766:25
3
2
0:11
416:25
3
7 6 7 7 6 450 766:25 7 7 6 0:02 316:25 7 7¼6 7 7 6 760 766:25 7 5 4 0:01 6:25 5 0:07 0:19 1500 766:25 0:12 733:75
6 6 0:21 0:19 6 B¼6 6 0:18 0:19 4
ð5:18Þ
148
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Fig. 5.16 The first principal component of the steel material properties
3. compute the eigenvalues and eigenvectors for the covariance matrix of B (this step requires linear algebra, see Sect. 5.4) " λ¼ P = ½ p1
λ1 λ2
#
" ¼ eigenvalues ¼ "
p2 ¼ eigenvectors ¼
2:70e þ 05
#
9:20e 04
1:73e 04
0:999
0:999
1:73e 04
ð5:19Þ # ð5:20Þ
where the eigenvectors are the principal components (e.g., arrows of the teapot in Fig. 5.15). The magnitude of each principal direction is indicated by the eigenvalue. Inspection of the eigenvalues shows that the first eigenvalue is nine orders of magnitude larger than the second. The first principal component for the steel material properties is computed and plotted in Fig. 5.16. The blue points are the original raw data, and the red arrow represents the first principal component considering only the first eigenvalue. A reduced order model R can be created by projecting the datapoints to the new coordinate: the principal component (eigenvector) R ¼ Bp1
ð5:21Þ
5.3 Reduced Order Surrogate Models
5.3.2
149
Understanding PCA by Singular Value Decomposition (SVD) [Advanced Topic]
Singular value decomposition (SVD) is amongst the most important matrix factorizations and is the foundation for many other data-driven methods (such as PCA).
5.3.2.1
Recall Matrix Multiplication
The concept of matrix multiplication is important for understanding SVD and PCA. Thus, the multiplication operation of matrix is recalled in this section. Consider a two-by-two matrix A " A=
1
2
1
1
# ð5:22Þ
The values in matrix A are assigned arbitrarily. The matrix A can be multiplied 1 with a vector x = to get another vector y 2 " y = Ax =
1
2
1
1
#" # 1 2
¼
" # 5 1
ð5:23Þ
To illustrate the meaning of the multiplication y = Ax, this operation is plotted in Fig. 5.17. After the multiplication with matrix A, the original vector x is transformed to y. This transform can be divided into two parts: (1) rotate the vector x by angle θ,
Fig. 5.17 Illustration of a matrix multiplication
150
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
and (2) stretch the resulting vector to y. Thus, multiplying a matrix by a vector simply means two “actions”: rotating and stretching this vector.
5.3.2.2
Singular Value Decomposition
The SVD decomposes any real matrix A into three matrices (a complex matrix can also be decomposed by SVD, but in this book, the real matrix is the only focus): ð5:24Þ
A ¼ UΣV T
where U and V are orthogonal matrices, i.e., UTU = I and VTV = VVT = I, and Σ is a diagonal matrix including zero off-diagonals. The superscript T denotes the transpose of the matrix. For example, consider a matrix A with assigned values: " A=
1
2
#
" ! AT =
1 0
1
1
2
0
# ð5:25Þ
The SVD can be very easily implemented using Matlab or Python as shown in the below code section: Matlab code for SVD: [U, S, V]=SVD(A); Python code for SVD: U, S, VT=svd(A,full_matrices=False);
To explain the matrices in the SVD, decompose a 2 2 matrix: " A=
1
2
1
1
# ð5:26Þ
Using SVD, matrix A can be decomposed into three matrices: " A=
1
2
1
1
#
" ¼
0:957
0:29
0:29
0:957
#"
2:3
0
0
1:3
#"
0:29
0:957
0:957 0:29
# ð5:27Þ
To illustrate the meaning of the matrices U, Σ and V, a vector x is used to multiply with the matrix A, which is equivalent to multiplying with the three decomposed matrices. This process can be visualized in Fig. 5.18. Through the SVD original operation Ax can be decomposed into three separated operations: (1) VTx, (2) ΣVTx, and (3) UΣVTx that is equal to Ax. In Fig. 5.18, there is a vector x = [3 2]T. The operation (1) rotates the vector x to VTx = [2.78 2.29]T. It is a rotation because it only changes the direction of the vector and the module of the two vectors remain
5.3 Reduced Order Surrogate Models
151
Fig. 5.18 Visualization of Ax¼ UΣVTx (A video is available in the E-book)
identical, i.e., kxk ¼ kVTxk. Multiplying the matrix Σ stretches the vector VTx to ΣVTx = [6.41 2.99]Tas (2) in Fig. 5.18. It is an anisotropic stretching in x and y directions. The stretching factor in the x direction and y direction is equal to the first and second diagonal in matrix Σ, respectively. The operation (3) is another rotation from vector ΣVTx to the final vector UΣVTx = Ax = [7 1]T. Demonstrating by this example, the SVD decomposes the original matrix into three matrices: a “rotation” matrix U, a “stretching” matrix Σ, and another “rotation” matrix VT. The diagonal components in matrix Σ are called singular values representing the stretching factors at different coordinates. The singular values are typically ordered from the largest to the smallest. Vertical vectors in U and V are called singular vectors.
5.3.2.3
Matrix Order Reduction by SVD Truncation
The singular values in matrix Σ are defined as σ 1, σ 2, . . ., σ m. The rows and columns corresponding to specific singular value(s) in U, Σ, and VT can be truncated to reduce the order of the original matrix. These truncated matrices give an approximation of the original A matrix. A simple example is used to demonstrate the SVD truncation. Consider a three-dimensional matrix A: 2
1
6 A=6 46 8
2
3
3
5
7 47 5
7
9
ð5:28Þ
152
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
The matrix A is also equal to a product of three matrices based on SVD: 2
0:20 0:61 0:75
32
16:73
6 76 76 A=UΣV T ¼ 6 4 0:51 0:72 0:45 5 4 0 0:83 0:29 0:47 0
0
0
2:14 0
32
0:59 0:52 0:60
3
76 7 6 7 0 7 54 0:63 0:15 0:75 5 0:58 0:48 0:83 0:24 ð5:29Þ
Since the third singular value (0.58) is smaller than the other two (16.73 and 2.14), a truncated SVD can be built by truncating the third singular value in matrix Σ and corresponding column in matrix U (third column) and corresponding row in matrix VT (third row). It is called the first order reduction of the matrix A: 2
3 " 6 7 16:734 0 7 AA ¼6 4 0:516 0:727 5 0 0:831 0:297 2 3 1:21 1:63 3:11 6 7 7 =6 4 6:13 4:78 4:01 5 7:87 7:23 8:93 0:207
0:618
0 2:145
#"
0:595
0:527
0:608
0:639
0:150
0:755
#
ð5:30Þ The first order reduction A0 is an approximate of the original A. A further order reduction can be conducted by truncating the second singular value in matrix Σ and corresponding column in matrix U and corresponding row in matrix VT: 2
0:207
3
2
2:06 1:83 2:11
3
6 7 6 7 7 6 7 A A00 ¼ 6 4 0:516 5½16:734½ 0:595 0:527 0:608 ¼ 4 5:13 4:54 5:24 5 0:831 8:27 7:33 8:45 ð5:31Þ In this example, the matrix A’s can be represented by a product of three vectors, with little loss in R2 [25] (see Table 5.3). In fact, after two reductions, the R2 score is still above 0.9. The R2 score is a metric to quantify the similarity of two datasets (two matrices in this case). If R2 ¼ 1, it indicates the two matrices are identical. If R2 ¼ 0, it indicates that the two matrices are quite different. Mean Square Error (MSE) is defined as
5.3 Reduced Order Surrogate Models
153
Table 5.3 MSE and R2 of reduced A matrices
Original A
1st order reduction
2nd order reduction
Matrix 2
3
1 2 3 7 6 A ¼ 46 5 45 8 7 9 2 1:21 6 A A0 ¼ 4 6:13 7:87 2 2:06 6 A A00 ¼ 4 5:13 8:27
3 3:11 7 4:01 5 8:93 3 1:83 2:11 7 4:54 5:24 5 7:33 8:45
1:63 4:78 7:23
Mean Square Error (MSE) due to order reduction 0
R2 1
0.038
0.992
0.549
0.923
Fig. 5.19 A schematic of ideal spring-mass system
MSE ¼
2 1 Xn Ai A0i i¼1 n
ð5:32Þ
where Ai is the ith component in matrix A and n is the number of components.
5.3.2.4
Example: Spring-Mass Harmonic Oscillator
A spring-mass system shown in Fig. 5.19 is ideally a one-dimensional system. For an ideal system, when the mass is released a small distance away from equilibrium (i.e., the spring is stretched), the mass will oscillate along the length of the spring indefinitely at a set frequency. However, the reality is that the measured results are not one dimensional because the mass swings back and forth, and the camera recording the motion is not held perfectly still. The actual three-dimensional motion is recorded using three cameras (see Fig. 5.20). This 3D motion capture enables more accurate identification of system properties such as spring constant or damping factor. Moreover, it benefits the
154
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Fig. 5.20 A spring-mass motion example. The position of a ball attached to a spring is recorded using three cameras 1, 2 and 3. The projected position of the ball tracked by each camera is depicted in each panel
analysis of system uncertainties and sensitivities. The position of the ball tracked by each camera is depicted in each panel. The local coordinate system can be defined for each phone and record the projected displacement of the spring with respect to each coordinate spring (see Fig. 5.21). The time-dependent displacement of the spring from each camera video is extracted using an opensource software: Tracker [26]. The software interface is shown in Fig. 5.22. The goal is to get one-dimensional motion data from the two-dimensional projection data collected from three different angles (different cameras). This new basis will filter out the noise and reveal hidden structure (i.e., determine the unit basis vector along the z-axis). After the important dimension (axis) is identified, it is possible to estimate the key properties of the system, such as damping coefficient and effect of mass, from the collected noisy data. All the local projection data can be put into a single data matrix. The data matrix is six dimensional: X ¼ ½ xa y a xb y b xc y c
ð5:33Þ
where xa and ya are local displacement vectors (each vector includes positions at different time) from the first camera, xb and yb are local displacement vectors from the second camera, and xc and yc are local displacement vectors from the third camera. The vectors are vertical. This system can be ideally described by a single direction (i.e., one-dimensional system). Therefore, there should be some redundancy between the six measurements. The SVD and PCA will be used to reduce the redundancy and recover the dominant one-dimensional data.
5.3 Reduced Order Surrogate Models
155
Fig. 5.21 Projection data from three different cameras. x and y are local coordinates. Shaded regions in different colors indicate the projection paths (A video is available in the E-book)
5.3.3
Further Understanding of Principal Component Analysis [Advanced Topic]
Principal component analysis (PCA) pre-processes the data by mean subtraction before performing the SVD. To apply PCA to reduce redundancy between data coordinates, a covariance matrix is required to be built from the data matrix X.
5.3.3.1
Variance and Covariance
The concept of variance and covariance must be introduced before it can be applied. Variance describes how much a variable varies with respect to its mean. Consider two vectors, x and y (each vector’s mean is assumed to be zero),
156
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Fig. 5.22 Using an opensource software, Tracker [27], to extract local displacement of the mass from a filmed video. (A video is available in the E-book)
x = ½ x1 , x2 , . . . , xn T
ð5:34Þ
y = ½y1 , y2 , . . . , yn T
ð5:35Þ
The variance of x and y can be defined as 1 T 1 2 xx ¼ x1 þ x22 þ . . . þ x2n n1 n1 1 1 2 σ 2y ¼ yyT ¼ y þ y22 þ . . . þ y2n n1 n1 1
σ 2x ¼
ð5:36Þ ð5:37Þ
Variance is proportional to the square of the magnitude of a (zero-mean) vector. Covariance is expressed very similarly to variance, which is proportional to the inner product of two vectors:
5.3 Reduced Order Surrogate Models
σ 2xy ¼
157
1 T 1 xy ¼ ð x y þ x2 y 2 þ . . . þ xn yn Þ n1 n1 1 1
ð5:38Þ
Therefore, the covariance is positive and large if the two vectors point to the same direction. Otherwise, the covariance is small if the two vectors are perpendicular. Consider the matrix of data described previously that has six features (each vector’s mean is assumed to be zero): X ¼ ½ xa y a xb y b xc y c
ð5:39Þ
A covariance matrix CX can be built to find how each column varies with others (if the covariance between the rows of the matrix is of interest, the definition of 1 covariance matrix is CX ¼ n1 XX T ): 2 2 6 σ xa xa 6 6 σ2 6 ya x a 6 6 2 6 σx x 6 ba 1 6 T CX ¼ X X¼6 2 n1 6 σ yb x a 6 6 6 σ2 6 xc xa 6 6 4 2 σ yc xa
σ 2xa yb
σ 2xa xc
σ 2ya xb
σ 2ya yb
σ 2ya xc
σ 2xb ya
σ 2xb xb
σ 2xb yb
σ 2xb xc
σ 2yb ya
σ 2yb xb
σ 2yb yb
σ 2yb xc
σ 2xc ya
σ 2xc xb
σ 2xc yb
σ 2xc xc
σ 2yc ya
σ 2yc xb
σ 2yc yb
σ 2yc xc
σ 2xa ya
σ 2xa xb
σ 2ya ya
σ 2xa yc
3
7 7 σ 2ya yc 7 7 7 7 σ 2xb yc 7 7 7 7 ð5:40Þ σ 2yb yc 7 7 7 7 2 σ x c yc 7 7 7 5 2 σ yc yc
It is noted that values along the diagonal are the variance of the features (coordinates). The off-diagonals are the covariance between pairs of features. The covariance matrix is symmetric. The goal of PCA is to get a diagonal covariance matrix because this means minimizing the covariance between different features, i.e., reducing the redundancy between different features. Specifically, the goal is to find a matrix P to transform the data matrix X to XP so that the covariance matrix of XP, i.e., CXP, is diagonal.
5.3.3.2
Identifying Intrinsic Dimension of Spring-Mass System Using PCA/SVD
Referring to the previous example, to apply PCA for dimension reduction the data matrix has to be centralized by subtracting the mean of each vector:
158
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
Fig. 5.23 Plots of (a) 2D data matrix and (b) mean-subtracted data
2
1
3
6 7 T 7 B ¼ X26 4...5 x
ð5:41Þ
1 where X is the data matrix and each component of the x is the mean of each data feature. For example, if X = [xa ya] X and B are plotted in Fig. 5.23. The principal components can be obtained by solving the eigenvectors and eigenvalues problem (eigenvectors are indeed the principal components, see Appendix A). However, this solution is sometimes computationally expensive. A more efficient way is to solve the SVD for the mean-subtracted data matrix directly and obtain the principal components since SVD and PCA are mathematically consistent (Sect. 5.4). Thus, the mean-subtracted data matrix B is divided into three matrices by SVD: ð5:42Þ
B ¼ UΣV T
The V is the matrix that can transform the data matrix to a new matrix with diagonal covariance matrix, i.e., CBV is a diagonal matrix. This means the column vectors of the matrix BV is independent (the proof is provided in Sect. 5.4). For this case, the covariance matrix of the matrix BV is 2 6 6 6 6 6 6 CBV = 6 6 6 6 6 4
10294
0
0
0
0
0
188
0
0
0
0
0
55
0
0
0
0
0
4:3
0
0
0
0
0
4:2
0
0
0
0
0
0
3
7 0 7 7 7 0 7 7 7 0 7 7 7 0 7 5 1:3
ð5:43Þ
5.3 Reduced Order Surrogate Models
159
Python code for computing covariance matrix: U, S, VT=svd(B,full_matrices=False) Y=B.dot(VT.T) np.cov(Y.T)
Based on the diagonals in the covariance matrix, it is reasonable to conclude that the spring-mass system of interest is intrinsically one dimensional because the first diagonal is an order of magnitude larger than the others. As a comparison, the covariance matrix of the data matrix B is also shown as 2
110
607
6 6 607 3510 6 6 6 304 1678 6 CB = 6 6 723 3983 6 6 6 323 1786 4 118
649
304
723
323
1678
3983
1786
923
2071
890
2071
4927
2114
890
2114
979
325
771
358
118
3
7 649 7 7 7 325 7 7 7 771 7 7 7 358 7 5 136
ð5:44Þ
It can be seen that the none of the values are zero or mush smaller than others, which means the column vectors of the data matrix are all dependent. The same procedure can be conducted to identify the intrinsic dimension of the spring-mass system. In practice, the intrinsic dimension of the spring-mass system
Fig. 5.24 Intrinsic dimension identified by covariance matrix and SVD/PCA
160
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
highly depends on the initial condition of the ball. As shown in Fig. 5.24a if the ball is released very closed to the axis of symmetry (dot dash line in the figure). The motion of the ball is up and down along the symmetry. In this case one coordinate is good enough to describe the ball’s motion, i.e., the intrinsic dimension is 1D. Correspondingly, the data matrix (captured by three cameras) can be transformed so that the first diagonal of the covariance matrix is at least one order of magnitude larger than the others. However, if the ball is released a distance away from the axis of symmetry. The motion of the ball becomes complicate, and it is a threedimensional motion. In this case, the covariance matrix of the transformed data matrix is shown in the Fig. 5.24b. The first three diagonals are at least one order of magnitude larger than the other diagonals indicating that the intrinsic dimension of the system of interest is three.
5.3.4
Proper Generalized Decomposition (PGD) [Advanced Topic]
5.3.4.1
From SVD to PGD
Proper generalized decomposition (PGD) is an alternative method to SVD of finding the principal components (or called modes) and reducing the dimensionality of the dataset. To introduce the mathematical concept of the PGD, the SVD is recalled first. Consider the matrix form of SVD for a three-by-three matrix A: 2 A ¼ UΣV T ¼ ½ u1
u2
σ1
0
6 u3 6 40
σ2
0
0
0
32
v1 T
3
76 7 6 T7 07 54 v2 5 σ3 v3 T
ð5:45Þ
where u1, u2 and u3 are three column vectors of matrix U, v1, v2 and v3 are three column vectors of matrix V, and σ 1, σ 2, and σ 3 are singular values. The vector form of SVD can be derived from this matrix form as 2 A ¼ ½ u1
u2
σ1
0
6 u3 6 40
σ2
0
0
0
32
v1 T
3
76 T 7 T T T 6 7 07 54 v2 5 ¼ u1 σ 1 v1 þ u2 σ 2 v2 þ u3 σ 3 v3 σ3 v3 T ð5:46Þ
where u1, u2 and u3 are orthonormal vectors and v1, v2 and v3 are principal components (they are also orthonormal vectors). The PGD follows the similar form as SVD but does not constrain the principal components to be orthonormal. For example, the three-by-three matrix A can be decomposed by PGD as
5.3 Reduced Order Surrogate Models
161
2 A ¼ f 1 g1 þ f 2 g2 þ f 3 g3 þ . . . þ f n gn ¼ ½ f 1 T
T
T
T
f2
g1 T
3
6 T7 6 g2 7 7 6 . . . f n 6 7 ð5:47Þ 6 ... 7 5 4 gn T
where the vectors f1, f2,. . . fn, and g1, g2, . . . gn are not necessarily orthonormal. That is where the name “generalized” comes from. It is noted that PGD is an iterative algorithm rather than an analytical solution like SVD, which means the first term f1 and g1 are computed first, and then f2 and g2, and so on. There are two types of PGD strategies: incremental approach [28] and modal superposition [29].
5.3.4.2
A Matrix Decomposition Example Using Incremental PGD
Consider a two-by-two matrix " A=
1
2
1
1
# ð5:48Þ
A n-order PGD approximates the original matrix by decomposing it into the form: ð5:49Þ
A ¼ f 1 g1 T þ f 2 g2 T þ f 3 g3 T þ . . . þ f n gn T Step (1) Rewrite An in a recursive form:
ð5:50Þ
An ¼ An1 þ f n gn T n1
A
¼ f 1 g1 þ f 2 g2 þ . . . þ f n1 gn1 T
T
ð5:51Þ
T
Step (2) The initial recursive term is assumed to be zero: " A ¼ 0
0 0
# ð5:52Þ
0 0
Step (3) The residual R0 is defined as the difference of A and An 2 1. For example, " R0 ¼ A 2 A = 0
1
2
1
1
#
"
0
0
0
0
#
" ¼
Step (4) The first components f1 and g1 are computed by
1
2
1 1
# ð5:53Þ
162
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
1 f 1 ¼ R0 g1 g1 T g1 1 g1 ¼ RT0 f 1 f 1 T f 1
ð5:54Þ ð5:55Þ
A random initial of g1 is set and then the above two equations are used to compute f1 and updated g1. For example, if the random initial of g1 = ½ 1 1 T (this step should be converged to the same result no matter what the initial value is assigned.), the f1 and updated g1 can be computed as 2 311 2 3 #2 30 1 1 1:5 4 5@½1 14 5A ¼ 4 5 f1 ¼ 1 1 1 1 0 0 2 3 1 2 3 2 3 1 " # 1:5 1:5 0:667 1 1 5A ¼ 4 4 5@½1:5 04 5 g1 ¼ 2 1 0 0 1:333 "
1
2
ð5:56Þ
ð5:57Þ
This process is repeated multiple times until the changes in f1 and g1 are below a certain value. That indicates the step (4) is converged. Two convergence factors, C 1fðiÞ and C g1ðiÞ , associated with f and g are defined as
norm f 1ðiÞ f 1ði 2 1Þ
C1fðiÞ ¼ norm f 1ðiÞ
norm g1ðiÞ g1ði 2 1Þ
C g1ðiÞ ¼ norm g1ðiÞ
ð5:58Þ
ð5:59Þ
where i is the iteration number. Table 5.4 shows the values of f1, g1 and convergence factors at different iterations. The iteration is finalized when the convergence factors
Table 5.4 The values and convergence factors of f1 and g1 for seven iterations i f1 g1
1 " "
1:5 0
0:667 1:333
C 1fðiÞ C g1ðiÞ
2 "
# #
"
1:5
3 "
#
0:3 0:513
#
"
1:480 0:399 0:460
#
4 "
#
"
1:472 0:430 0:443
#
5 "
#
"
1:469 0:440 0:438
#
6 "
#
"
1:469 0:440 0:436
#
7 "
#
"
1:468 0:443 0:436
1:410 1.020
1:429 0.002
1:435 0.0002
1:437 2.32e5
1:437 2.39e6
1:437 2.44e7
1.007
0.0007
0.0001
7.35e6
7.53e7
7.71e8
# #
5.3 Reduced Order Surrogate Models
163
are less than a specified value (106 in this case). Then the converged f1 and g1 can be obtained as 2 f1 ¼ 4 2 g1 ¼ 4
1:468
3 5
ð5:60Þ
0:443 0:436
3 5
ð5:61Þ
1:437
Step (5) Update the approximation with the computed components: " A = A þ f 1 g1 ¼ 1
0
T
0
0
0
0
2
#
þ4
1:468
3
"
5½0:436 1:437 ¼
0:443
0:639
2:110
0:193
0:638
#
ð5:62Þ Step (6) Update residual R1 and check the matrix convergence factor C1 " R1 = A 2 A1 =
1
2
1
1
#
"
C1 =
0:639 2:110
#
" ¼
0:193 0:638
0:361
0:110
1:193
0:362
normðR1 Þ 1:303 ¼ 0:566 ¼ 2:303 normðAÞ
# ð5:63Þ ð5:64Þ
Since the matrix convergence factor (typically 106 indicates a good convergence) is too large meaning the first order approximation is not good enough for this case. Thus, steps (4)–(6) are repeated to compute the next components f2 and g2 (and associated second order approximation A2). After the steps (4)–(6) are repeated, the values of f2 and g2 are 2 f2 ¼ 4 2 g2 = 4
0:126 0:416 2:871 0:871
3 5
ð5:65Þ
3 5
The second order approximation A2 can be computed by
ð5:66Þ
164
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
3 " # # 2 0:126 1:000 2:000 2 1 T 5½2:871 0:871 ¼ A =A þ f 2 g2 ¼ þ4 1:000 1:000 0:193 0:638 0:416 "
0:639 2:110
ð5:67Þ This result is very accurate as compared with the original matrix A, and the matrix convergence factor C2 ¼ 2.16 1016, which means two modes are good enough for achieving a convergence (indicated by a small enough convergence factor). This matrix deposition problem can be done by SVD. To compare the results 1 2 from SVD and PGD, the SVD result of the same matrix A ¼ can be 1 1 expressed as " ASVD ¼ ½ u1 u2
σ1 0 0 σ2
#"
v1 T
#
" ¼
v2 T
#"
0:957 0:290
#"
2:30
0
0
1:30
0:290 0:957
0:290 0:957
#
0:957 0:290 ð5:68Þ
PGD result of the matrix A can be expressed as " APGD ¼
1:468
#
0:443 "
¼
" ½0:436 1:437 þ
1:468
0:126
0:443
0:416
#"
0:126
# ½2:871 0:871
0:416
#
0:436
1:437
2:871
0:871
ð5:69Þ
It is noted that the vectors in PGD matrices are not normal. They can be normalized so that the PGD result can be compared with SVD result " APGD ¼
" 1:53
0:957 0:290
¼ ½ 1:53u1
#
" 0:43 "
0:43u2
1:5v1
T
3:0v2 T
#
0:290
# #"
0:957 ¼ ½ u1
u2
1:5 ½ 0:290 0:957
#
3:0 ½ 0:957 0:290 " #" T # 2:30 0 v1 0
1:30
v2 T ð5:70Þ
Thus, PGD can find the same principal components as SVD for the matrix decomposition problem. In practice, the order of matrix can be higher, e.g., a nby-n matrix.
5.3 Reduced Order Surrogate Models
5.3.4.3
165
A PGD Example Using Modal Superposition
In the previous example using incremental PGD, the first mode (i.e., f1g1T) is completed solved first, and then the next components f2 and g2 are computed based on the known first mode. In contrast, the modal superposition PGD updates all the modes back and forth until all the modes are unchanged. In general, the modal superposition is expected to obtain the optimal decomposition, which includes a smaller number of modes than incremental PGD. But the modal superposition is typically more computationally expensive for high-dimensional problems and sometimes encounters convergence difficulties because of its global iterative feature. Consider the same two-by-two matrix " A=
1
2
1
1
# ð5:71Þ
In modal superposition PGD, the number of modes is predetermined (two modes in this case): ð5:72Þ
A ¼ f 1 g1 T þ f 2 g2 T Step (1) The initial f2 and g2 are assumed to be
f2 ¼ g2 ¼
" # 1
ð5:73Þ
1 " # 1
ð5:74Þ
1
Step (2) The residual R is defined as the difference of A and f2g2T " R ¼ A 2 f 2 g2 T =
1
2
1 1
#
"
1 1 1 1
#
" ¼
0
1
2
0
# ð5:75Þ
Step (3) The first components f1 and g1 are computed by 1 f 1 ¼ Rg1 g1 T g1 1 g1 ¼ RT f 1 f 1 T f 1
ð5:76Þ ð5:77Þ
A random initial of g1 = ½ 1 1 T is set and then the above two equations are used to compute f1 and updated g1. The f1 and updated g1 can be computed as
166
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
2 311 2 3 #2 30 1 1 0:5 4 5@½1 14 5A ¼ 4 5 f1 ¼ 2 0 1 1 1 30 2 311 2 3 " #2 0:5 0:5 1:6 0 2 4 5@½0:5 14 5A ¼ 4 5 g1 ¼ 1 0 1 1 0:4 "
0
1
ð5:78Þ
ð5:79Þ
Step (4) The residual R is updated as the difference of A and f1g1T " R ¼ A 2 f 1 g1 = T
1
2
1 1
#
"
0:8
0:2
1:6
0:4
#
" ¼
0:2 1:8
# ð5:80Þ
0:6 0:4
Step (5) The first components f2 and g2 are computed by 1 f 2 ¼ Rg2 g2 T g2 1 g2 ¼ RT f 2 f 2 T f 2
ð5:81Þ ð5:82Þ
Step (6) The matrix convergence factor C1 is computed to check convergence. The matrix convergence factor Ci at the ith iteration is difeinde as Ci ¼
normðA f 1 g1 T f 2 g2 T Þ normðAÞ
ð5:83Þ
Thus, steps (2)–(6) are repeated for each iteration until the matrix convergence factor Ci is less than a predetermined criterion (e.g., 106). The values of f1, g1, f2, g2and convergence factor for five iterations are shown in Table 5.5. The calculation reaches a convergence after five iterations so that the addition of the two modes is equal to the original matrix Table 5.5 The values of f1, g1, f2, g2 and convergence factors for seven iterations i f1 g1 f2 g2 Ci
1 " "
0:5 1 1:6
#
2 "
#
"
0:4 " # 1 "
1 0:4
" #
1:6 0.151
"
0:412 0:912 # 1:522 0:711 1:059 1:027 0:365
1:609 0.0073
#
3 " "
#
"
#
"
0:406 0:905 # 1:517 0:722 1:061 1:028 0:363
1:609 3.33e4
#
4 " "
#
"
#
"
0:405 0:905 # 1:517 0:722 1:061 1:028 0:363
1:609 1.47e5
#
5 " "
#
"
#
"
0:405 0:905 # 1:517 0:722 1:061 1:028 0:363
1:609 6.52e7
# #
#
5.4 Eigenvalues and Eigenvectors [Advanced Topic]
" f 1 g1 þ f 2 g2 ¼ T
5.3.4.4
T
1
167
2
1 1
# ¼A
ð5:84Þ
PGD for High-Dimensional Tensor Decomposition
One benefit of PGD is that it can be used on high-dimensional tensor decomposition rather than the matrix decomposition that can be handled by SVD/PCA. A tensor is typically represented as a (potentially multidimensional) array, just as a vector (i.e., one-dimensional tensor) is represented by a one-dimensional array. The numbers in the array are denoted by indices giving their position in the array as subscripts following the symbolic name of the tensor. For example, a matrix can be treated as a two-dimensional tensor represented by a two-dimensional array. The values of this two-dimensional tensor (i.e., matrix) A could be denoted Aij, where i and j are position indices. An example is shown here: a four-dimensional tensor Aijkl can be decomposed by PGD as Aijkl
Xn
f m¼1 m
g m hm p m
ð5:85Þ
where fm, gm, hm, and pm are vectors in an index of m, and is the operation of the outer product which produces a tensor by inputting multiple vectors. For example, (f g h)ijk ¼ figjhk. The number of modes is given as n. The high-dimensional PGD can be solved using the same procedure demonstrated above. The Matlab implementation of higher-dimensional PGD can be found at a website created by Northwestern post-doctoral fellow Dr. Ye Lu [30]. This code can automatically determine the optimal number of modes given an approximation accuracy. In summary, PGD decomposes raw data into a series of 1D vectors (functions) and performs reduced order modelling for high-dimensional tensors. For 2D tensor (i.e., matrix) decomposition provides the same outcome (under appropriate normalization) as the SVD/PCA. PGD can be applied to a dataset that is described by a 3D, 4D, or even higher dimensional array (tensor). Therefore, PGD can be seen as a higher dimensional extension of SVD/PCA. Moreover, the PGD might provide more efficient solution for high-dimensional complex problems due to its iterative feature during the solution.
5.4
Eigenvalues and Eigenvectors [Advanced Topic]
Consider a real n-by-p matrix B, if the covariances between the rows of the matrix are of interest the covariance matrix of B can be computed by
168
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
CB ¼
1 BT B n1
ð5:86Þ
where n is the number of columns of the matrix B. Superscribe T denotes the matrix transpose. Since the matrix CB is a real symmetric p-by-p matrix, i.e., CBT ¼ CB, it has real eigenvalues and eigenvalues defined by [31] 2
CB ¼ GΛG ¼ g1 T
g2
...
λ1
0
6 0 λ2 6 gp 6 6 6... ... 4 0 0
...
0
32
gT1
3
76 T 7 6 7 07 76 g2 7 76 7 6 7 ... 0 7 54 . . . 5 gTp 0 λp
...
ð5:87Þ
where g1, g2, . . ., gp are eigenvectors and they are orthonormal, and λ1, λ2, . . ., λp are eigenvalues. The eigenvalues and eigenvectors can be obtained by solving the following equations jCB λIj ¼ 0
ð5:88Þ
ðCB λIÞg = 0
ð5:89Þ
where |∙| is the determinant of the matrix, and I is the unit matrix.
5.5
Mathematical Relation Between SVD and PCA [Advanced Topic]
The principal components can be obtained by solving the eigenvectors and eigenvalues problem (eigenvectors are indeed the principal components). However, it is sometimes computationally expensive to solve the eigenvectors and eigenvalues for a covariance matrix CB because of the calculation of the determinant. It is more efficient to solve the SVD for the mean-subtracted data matrix directly and obtain the principal components (eigenvectors) because SVD and PCA are mathematically consistent. Consider a mean-subtracted data n-by-p matrix B, the SVD of the matrix B is expressed as 2 B ¼ UΣV ¼ ½ u1 T
u2
...
σ1
0
6 6 0 σ2 6 up 6 6... ... 4 0 0
... ... ... 0
0
32
vT1
3
76 T 7 6 7 07 7 6 v2 7 76 7 6 7 07 54 . . . 5 σp
vTp
ð5:90Þ
References
169
where u1, u2, . . ., up and v1, v2, . . ., vp are singular vectors, and σ 1, σ 2, . . ., σ p are singular values. It can be proven that there are the following relations between singular values/vectors and eigenvalues/eigenvectors: G ¼ V or gi ¼ vi Λ¼
ð5:91Þ σ 2i
Σ2 or λi ¼ n1 n1
ð5:92Þ
Where G is the matrix including the eigenvectors g1, g2, . . ., gp, Λ is the matrix including the eigenvalues λ1, λ2, . . ., λp, and n is the number of columns of the matrix B. Proof Given that UTU = I and VTV = VVT = I, and Σ is a diagonal matrix, the covariance matrix of the matrix B can be expressed as CB ¼
T 1 1 1 BT B = UΣV T UΣV T ¼ VΣU T UΣV T n1 n1 n1
1 Σ2 V T ¼V n1
ð5:93Þ
Comparing with the form of eigenvectors and eigenvalues in PCA, i.e., CB ¼ GΛGT, two relations can be obtained: G¼V Λ¼
Σ2 n1
ð5:94Þ ð5:95Þ
Q.E.D In PCA, a transform is sought so that the covariance matrix of the transformed matrix is diagonal. It can be proven that the covariance matrix of BV is diagonal. Proof
T 1 1 1 ðBV ÞT BV = UΣV T V UΣV T V ¼ ΣU T UΣ n1 n1 n1 1 ¼ Σ2 n1
CBV ¼
ð5:96Þ
Q.E.D
References 1. Driver HE, Kroeber AL (1932) Quantitative expression of cultural relationships. University of California Publications in American Archaeology and Ethnology, Berkeley, pp 211–256 2. Zubin J (1938) A technique for measuring like-mindedness. J Abnorm Soc Psychol 33(4):508–516
170
5 Knowledge-Driven Dimension Reduction and Reduced Order Surrogate Models
3. Tryon RC (1939) Cluster analysis: correlation profile and Orthometric (factor). In: Analysis for the Isolation of Unities in Mind and Personality. Edwards Brothers, Ann Arbor 4. Cattell RB (1943) The description of personality: basic traits resolved into clusters. J Abnorm Soc Psychol 38(4):476–506 5. Jenks GF (1967) The data model concept in statistical mapping. Int Yearbook Cartograp 7:186– 190 6. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol 1. University of California Press, Berkeley, pp 281–297 7. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs 8. Defays D (1977) An efficient algorithm for a complete-link method. Comp J Br Comp Soc 20(4):364–366 9. Kohonen T, Honkela T (2007) Kohonen Network. Scholarpedia 2(1):1568 10. https://en.wikipedia.org/wiki/Singular_value_decomposition#History 11. Pearson K (1901) On lines and planes of closest fit to Systems of Points in space. Philos Mag 2(11):559–572 12. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:417–441 and 498–520 13. Ammar A, Mokdad B, Chinesta F, Keunings R (2006) A new family of solvers for some classes of multidimensional partial differential equations encountered in kinetic theory Modeling of complex fluids. J Non-Newtonian Fluid Mech 139(3):153–176 14. Jenks GF (1967) The data model concept in statistical mapping. Int Yearbook Cartograp 7:186– 190 15. https://medium.com/analytics-vidhya/jenks-natural-breaks-best-range-finder-algorithm-8d190 7192051 16. https://en.wikipedia.org/wiki/Hierarchical_clustering 17. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480 18. Rauber A, Merkl D, Dittenbach M (2002) The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data. IEEE Trans Neural Netw 13(6):1331–1341 19. Goodfellow, Bengio, Courville (2016) The back-propagation algorithm (Rumelhart et al., 1986a). p. 200 20. Gan Z, Li H, Wolff SJ, Bennett JL, Hyatt G, Wagner GJ, Cao J, Liu WK (2019) Data-driven microstructure and microhardness design in additive manufacturing using a self-organizing map. Engineering 5(4):730–735 21. Wolff SJ, Gan Z, Lin S, Bennett JL, Yan W, Hyatt G, Ehmann KF, Wagner GJ, Liu WK, Cao J (2019) Experimentally validated predictions of thermal history and microhardness in laserdeposited Inconel 718 on carbon steel. Addit Manuf 27:540–551 22. Vesanto J, Himberg J, Alhoniemi E, Parhankangas J (2000) SOM toolbox for Matlab 5 57:2. Technical report 23. Mukherjee T, Zuback JS, De A, DebRoy T (2016) Printability of alloys for additive manufacturing. Sci Rep 6(1):1–8 24. https://www.api5lx.com/api5lx-grades/ 25. https://en.wikipedia.org/wiki/Coefficient_of_determination 26. https://physlets.org/tracker/ 27. https://physlets.org/tracker/ 28. Modesto D, Zlotnik S, Huerta A (2015) Proper generalized decomposition for parameterized Helmholtz problems in heterogeneous and unbounded domains: application to harbor agitation. Comput Methods Appl Mech Eng 295:127–149 29. Bro R (1997) PARAFAC. Tutorial and applications. Chem Intell Lab Syst 38(2):149–171 30. https://yelu-git.github.io/hopgd/ 31. Hawkins T (1975) Cauchy and the spectral theory of matrices. Hist Math 2:1–29
Chapter 6
Deep Learning for Regression and Classification
Abstract Deep learning (DL) is a subclass of machine learning methods that works in conjunction with large datasets along with many neuron layers (artificial neural networks (ANN)) (Schmidhuber, Neural Networks, 61, 85–117, 2015). An ANN is a computer system inspired by the biological neural networks in the human brain (Chen et al., Sensors, 19, 2047, 2019). The term “deep” refers to the use of multiple layers of neurons (three or more) in the ANN. Deep learning models can automatically generate features from data, and they can be designed for many structures. In this chapter, two standard structures called feed forward neural network (FFNN) and convolutional neural network (CNN) will be introduced and demonstrative examples will be presented. An application of CNN is also given. The CNN is used to read raw chest X-ray images of patent and automatically classify diseases, such as pneumonia or COVID-19. In addition, a musical instrument sound converter for changing piano musical notes to guitar musical notes will be developed using mechanistic data science. This example will demonstrate the advantage of the mechanistic data science approach compared to the standard neural networks. That is, a smaller number of training data is required to achieve a good performance. Keywords Artificial neural networks · Feed forward neural network · Convolutional neural network · Kernel · Convolution · Padding · Stride · Pooling · Instrumental music conversion · COVID-19
6.1
Introduction
Learning is a process for acquiring knowledge and skills, so they are readily available for understanding and solving future problems and opportunities [3]. Deep learning leverages artificial neural networks to automatically find patterns in data, with the objective of predicting some target output or response. Deep
Supplementary Information: The online version of this chapter (https://doi.org/10.1007/978-3030-87832-0_6) contains supplementary material, which is available to authorized users. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. K. Liu et al., Mechanistic Data Science for STEM Education and Applications, https://doi.org/10.1007/978-3-030-87832-0_6
171
172
6 Deep Learning for Regression and Classification
Table 6.1 A comparison of supervised and unsupervised learning, in terms of methods, data, goal, and uses Methods
Supervised learning Linear regression, nonlinear regression, etc.
Data
Input and output variables will be given
Goal
To determine the relationship between inputs and outputs so that we can predict the output when a new dataset is given Regression, classification, etc.
Uses
Unsupervised learning K-means clustering, Principal component analysis (PCA), etc. Only input data will be given, and the data are not labelled To capture the hidden patterns or underlying structure in the given input data Clustering, dimension reduction, etc.
learning methods are heavily based on statistics and mathematical optimization. They can be supervised or unsupervised (or semi-supervised that is beyond the scope of this book). Unsupervised learning looks for previously undetected patterns in a data set with no pre-existing labels and minimal human supervision. It studies how systems can infer a function to describe a hidden structure from unlabeled data. In contrast, supervised learning maps an input to an output based on representative input-output pairs. It infers a function from labeled training data consisting of a set of training examples. Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Supervised learning plays an important role in estimating the relationships between independent variables (i.e., inputs) and dependent variables (i.e., outputs). A comparison of supervised and unsupervised learning is shown in Table 6.1. Supervised learning attempts to learn the relationship between output variable (s) and input variable(s). If the outputs are continuous, supervised learning becomes a regression problem. An example of a regression problem is predicting the price of a diamond based on its properties such as carat, cut, color, and clarity (i.e., the 4Cs) as shown in Fig. 6.1. Supervised learning attempts to solve the problem of learning input-output mappings from empirical data or the training set. For example, a dataset D of n datapoints: D ¼ {(xi, yi)| i ¼ 1, 2, . . ., n}, where x is the input vector (e.g., the 4Cs in the previous example) and y is the output (e.g., the price of a diamond). If the outputs are discrete (e.g., yes or no) instead of continuous, supervised learning turns into a classification problem. One example (Fig. 6.2) is recognizing if a patient has been infected with coronavirus 2019 (COVID-19) based on their chest X-ray images. COVID-19 spread rapidly around the world and became a pandemic since it first appeared in December 2019, which caused a disastrous impact on public health, daily lives, and global economy. It is very important to accurately detect the positive cases at early stage to treat patients and prevent the further spread of the pandemic. Chest X-ray imaging has critical roles in early diagnosis and treatment of COVID-19. Automated toolkits for COVID-19 diagnosis based on radiology
6.1 Introduction
173
Fig. 6.1 Regression example: predicting the price of a diamond based on its properties such as carat, cut, color, and clarity (i.e., 4Cs)
Fig. 6.2 Classification example: detect COVID-19 based on chest X-ray images [4]
imaging techniques such as X-ray imaging can overcome the issue of a lack of physician in remote villages and other underdeveloped regions. Application of artificial intelligence (AI) techniques, such as deep learning, coupled with radiological X-ray imaging, can be very helpful for the accurate and automatic detection of this disease. The classification can be binary (COVID-19 vs. Normal) and multiclass (COVID-19 vs. Normal vs. Pneumonia).
174
6.1.1
6 Deep Learning for Regression and Classification
Artificial Neural Networks
Artificial neural networks (often called neural networks) learn (or are trained) by a set of data, which contain known inputs and outputs. The training of a neural network from a given dataset is usually conducted by determining the difference between the output of the neural networks (often a prediction) and a target output (an error). The neural network then updates its parameters (such as weights and biases) according to a learning rule based on this error value. Successive adjustments will cause the neural network to produce output which is increasingly similar to the target output. After a sufficient number of these adjustments, the training can be terminated based on certain criteria. Such neural networks “learn” to perform tasks by considering data without being programmed with task-specific rules. For example, in the previous COVID-19 image recognition, neural networks might learn to identify X-ray images that indicate COVID-19 by analyzing many images (i.e., data points) that have been manually labeled as “COVID-19” or “Normal”. Neural networks do this without any prior knowledge of the pathology known by a doctor.
6.1.2
A Brief History of Deep Learning and Neural Networks
In 1976, Alexey Ivakhnenko and Lapa [5] published the first feedforward multilayer neural networks for supervised learning. The term “Deep Learning” was first introduced by Rina Dechter in 1986 [6]. As a milestone, Yann LeCun et al. [7] applied the standard backpropagation algorithm to a deep convolutional neural network (CNN) for recognizing handwritten ZIP codes on mail in 1989. In 1995, Brendan Frey and co-developer Peter Dayan and Geoffrey Hinton demonstrated that it was feasible to train a network containing six fully-connected hidden layers with several hundred neurons using a wake-sleep algorithm [8]. In 1997, a recurrent neural network (RNN) was published by Hochreiter and Schmidhuber and called long short-term memory (LSTM) [9], which avoided the longstanding vanishing gradient problem in deep learning. In the early 2000s, the deep learning began to significantly impact industry. Industrial applications of deep learning to large-scale speech recognition started around 2010. Advances in computational hardware have driven more interest in deep learning. In 2009, Andrew Ng demonstrated that graphics processing units (GPUs) could accelerate the learning process of deep learning by more than 100 times [10]. In 2019, Yoshua Bengio, Geoffrey Hinton, and Yann LeCun was named as recipients of the 2018 Association for Computing Machinery (ACM) Turing Award (the “Nobel Prize of Computing”) for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing, and the story is continuing. . .
6.2 Feed Forward Neural Network (FFNN)
6.2 6.2.1
175
Feed Forward Neural Network (FFNN) A First Look at FFNN
Neural networks are based on simple building block called artificial neurons (or just neurons). A neuron (Fig. 6.3) is a computational node that takes an input value, x, adjusts that value according to a specific weight w, bias b, and activation function A ð ∙ Þ, to produce a new output value y ¼ A ðwx þ bÞ. Weights are values which need to be multiplied with each input value, essentially reflecting the importance of an input. Biases are constant values that are added to the product of inputs and weight, usually utilized to offset the result. Activation functions A ð ∙ Þ engage the neurons based on the provided input. This can typically include a nonlinear mapping of input and helps to increase the degree of freedom of ANN. Some examples of commonly used activation functions can be seen in Fig. 6.4, of which, the ReLu is the most popular due to its simplicity. The ReLU activation function has some limitations. For example, the Dying ReLU problem [11] includes some ReLU functions that essentially “die” for all inputs and remain inactive no matter what input is supplied. This can be corrected by using Leaky ReLU or Parametric ReLU. For these ReLU functions, the slope to left of x ¼ 0 is changed, causing a leak and extending the range of ReLU (see Fig. 6.5). Fig. 6.3 Schematic of an artificial neuron and its functionality
Fig. 6.4 Commonly used activation functions
176
6 Deep Learning for Regression and Classification
Fig. 6.5 Leaky ReLU and Parametric ReLU activation functions
Fig. 6.6 A flowchart for training a neural network (NN)
A standard procedure for training a neural network (NN) is shown in Fig. 6.6. The goal of a neural network is to match a target value y given an input value of x for the network. The input and output can be either scalar or vector. Once the network structure is determined (based on prior information or just trial and error), the weights and biases of the neural network are trained/updated to make sure the predicted output, y, is close to the target value, y. If they are close enough, the weights and biases are reported as converged values. The closeness is quantified by a loss function, e.g., (y y)2. If the loss function is larger than a predetermined error limit, a learning algorithm, called back propagation, adjusts the weights and biases of the neural network to reduce the loss function. The back propagation algorithm is modified from the gradient decent method described in Chap. 3. Once the neural network is trained, additional new datapoints should be used to test the network, as described in Chap. 2. A neural network can include hidden neurons or hidden layers of neurons. Each layer of neurons between the input and output neurons is called hidden. Figure 6.7 shows an example of a network structure including one hidden neuron. The output of
6.2 Feed Forward Neural Network (FFNN)
177
Fig. 6.7 A network structure including only one hidden neuron
the network can be computed as follows: (1) the first weight w1 is multiplied by the input value x and the first bias b1 is added, which is then inserted into the activation function A ð ∙ Þ and the output of the hidden neuron is determined as A ðwx1 þ b1 Þ; (2) the output of the hidden neuron is treated as the input of the next neuron, i.e., output neuron in this case. Repeating the same operation mentioned above, the output of this network can be determined as A ðw2 A ðw1 x þ b1 Þ þ b2 Þ: Assume the goal of a network is to fit one datapoint, such as (x ¼ 0.1, y ¼ 20), by adjusting the unknown parameters. A good fit means if the input value x ¼ 0.1, the output of this network is y ¼ 20. The detailed training process will be illustrated step by step. For simplifying the problem, the biases are assumed to be zero, and a linear activation function is used. That means the output of a neuron is equal to the product of weight and input. The two unknown parameters in this case are w1 and w2. To train the parameters based on the datapoint, a back propagation algorithm is described below: Step (1): initialize the weights by arbitrary values. For example, w1 ¼ 10
ð6:1Þ
w2 ¼ 2
ð6:2Þ
Step (2): compute NN output and error: y ¼ w2 w1 x ¼ 10 5 0:1 ¼ 5
ð6:3Þ
y y ¼ 20 5 ¼ 15
ð6:4Þ
Step (3): compute increments of weights based on the gradient decent (GD) explained in Chap. 3 (a learning rate of α ¼ 0.25 is used in this case): Δw1 ¼ α
∂L ¼ αðy yÞw2 x ¼ 0:25 15 5 0:1 ¼ 1:875 ∂w1
ð6:5Þ
Δw2 ¼ α
∂L ¼ αðy yÞw1 x ¼ 0:25 15 10 0:1 ¼ 3:75 ∂w2
ð6:6Þ
Step (4): update the weights: w1 ¼ w1 þ Δw1 ¼ 10 þ 1:875 ¼ 11:875
ð6:7Þ
w2 ¼ w2 þ Δw2 ¼ 5 þ 3:75 ¼ 8:75
ð6:8Þ
178
6 Deep Learning for Regression and Classification
Table 6.2 Results of weights, NN output and error at each iteration
i w1 w2 y y y
1 10 5 5 15
2 11.88 8.75 10.39 9.61
3 13.98 11.60 16.22 3.78
4 15.07 12.92 19.48 0.52
5 15.24 13.12 20.00 0.00
Fig. 6.8 A network structure including two hidden neurons
Fig. 6.9 A network structure including one hidden layer but two hidden neurons
Through step (2)–(4), the two weights of the NN have been updated based on the gradient decent of the error between NN output and target value. The step (2)–(4) can be repeated until the weights are unchanged, or the error is smaller than a criterion. Table 6.2 presents the computed weights w1 and w2, NN output y, and error y y at each iteration. As seen in Table 6.2, after five iterations the NN output reaches 19.996, which is very close to the target of 20. The result can be more accurate if more iterations are used. Additional hidden layers can be added between the input and output neurons. For example, consider a network structure with two hidden layers, each layer with only one hidden neuron, as shown in Fig. 6.8. Now there are three unknown parameters, and the NN output can be computed based on the same rule (if the assumptions still hold): y ¼ w3w2w1x. The same back propagation procedure can be used to train the unknown weights. In addition, each hidden layer can have more than one neuron. For example, a NN structure with one hidden layer but two hidden neurons are shown in Fig. 6.9. In general, the input to the output neuron is equal to the sum of the two hidden neurons. Based on the operation rule, the output of this NN structure is: y ¼ A ½w3 A ðw1 x þ b1 Þ þ w4 A ðw2 x þ b2 Þ þ b3
ð6:9Þ
Using the same assumptions that the biases are zero and activation functions are linear, the output can be rewritten as
6.2 Feed Forward Neural Network (FFNN)
y ¼ w1 w3 x þ w2 w4 x
179
ð6:10Þ
To fit the same datapoint (x ¼ 0.1, y ¼ 20) by adjusting the four unknown weights, the back propagation can be used again: Step (1): initialize the weights by arbitrary values: w1 ¼ 10
ð6:11Þ
w2 ¼ 10
ð6:12Þ
w3 ¼ 5
ð6:13Þ
w4 ¼ 5
ð6:14Þ
Step (2): compute NN output and error. It is noted that the formula of the NN output depends on the NN structure that is different from the previous example. y ¼ w1 w3 x þ w2 w4 x ¼ 5 þ 5 ¼ 10
ð6:15Þ
y y ¼ 20 10 ¼ 10
ð6:16Þ
Step (3): compute increments of weights based on the gradient decent (GD) (α is the learning rate that is 0.25 in this case): Δw1 ¼ α
∂L ¼ αðy yÞw3 x ¼ 0:25 10 5 0:1 ¼ 1:25 ∂w1
ð6:17Þ
Δw2 ¼ α
∂L ¼ αðy yÞw4 x ¼ 0:25 10 5 0:1 ¼ 1:25 ∂w2
ð6:18Þ
Δw3 ¼ α
∂L ¼ αðy yÞw1 x ¼ 0:25 10 10 0:1 ¼ 2:5 ∂w3
ð6:19Þ
Δw4 ¼ α
∂L ¼ αðy yÞw2 x ¼ 0:25 10 10 0:1 ¼ 2:5 ∂w4
ð6:20Þ
180
6 Deep Learning for Regression and Classification
Step (4): update the weights: w1 ¼ w1 þ Δw1 ¼ 10 þ 1:25 ¼ 11:25
ð6:21Þ
w2 ¼ w2 þ Δw2 ¼ 10 þ 1:25 ¼ 11:25
ð6:22Þ
w3 ¼ w3 þ Δw3 ¼ 5 þ 2:5 ¼ 7:5
ð6:23Þ
w4 ¼ w4 þ Δw4 ¼ 5 þ 2:5 ¼ 7:5
ð6:24Þ
Repeat Step (2)–(4) until the weights are unchanged or the error is less than a criterion. Table 6.3 presents the computed weights, NN output, and error at each iteration. As seen in the Table, this NN structure needs only four iterations (instead of five used in the previous example) to achieve a good approximation. In practice, adding more neurons or layers increases the complexity of functions that the NN can represent, but also increases the computational cost and the risk of overfitting [12]. Thus, choosing appropriate NN structure for given problem is still a challenging task. It is interesting to note that the input and output can be generalized to vectors so the NN can handle multiple inputs and outputs. To demonstrate this, a datapoint 1 10 including a 2D vector as input and a 2D vector as output x ¼ , y ¼ 2 20 will be used to train the NN with the same structure as the previous example. The same procedure can be followed but the weights of the NN are 2D vectors in the case: Step (1): initialize the weights by arbitrary vectors: w1 ¼
Table 6.3 Results of weights, NN output and error at each iteration
i w1 w2 w3 w4 y y y
" # 1
ð6:25Þ
1
1 10 10 5 5 10 10
2 11.25 11.25 7.5 7.5 16.88 3.13
3 11.84 11.84 8.38 8.38 19.83 0.17
4 11.87 11.87 8.428 8.428 20.009 0.009
6.2 Feed Forward Neural Network (FFNN)
181
w2 ¼
w3 ¼
w4 ¼
" # 1
ð6:26Þ
0 " # 0
ð6:27Þ
1 " # 2
ð6:28Þ
2
Step (2): compute NN output and error: y ¼ w3 wT1 x þ w4 wT2 x ¼
"
y 2y ¼
10 20
" # 0 1 #
3þ
" # 2 5
" # 2 2 "
¼
8
1¼
" # 2 5
ð6:29Þ
#
15
ð6:30Þ
Step (3): compute increments of weights based on the gradient decent (GD), and α is the learning rate that is 0.01 in this case. The learning rate can be predetermined by trial and error. It is noted that a too small learning rate causes a large number of iterative steps required but a too large learning rate leads to an oscillated learning process that cannot converge. " # " # 8 0:16 ∂L Δw1 ¼ α ¼ αðy 2 yÞwT3 x ¼ 0:01 2¼ ∂w1 15 0:3
ð6:31Þ
" # " # 8 0:48 ∂L Δw2 ¼ α ¼ αðy 2 yÞwT4 x ¼ 0:01 6¼ ∂w2 15 0:9
ð6:32Þ
" # " # 8 0:24 ∂L T Δw3 ¼ α ¼ αðy 2 yÞw1 x ¼ 0:01 3¼ ∂w3 15 0:45
ð6:33Þ
182
6 Deep Learning for Regression and Classification
" # " # 8 0:08 ∂L T Δw4 ¼ α ¼ αðy 2 yÞw2 x ¼ 0:01 1¼ ∂w4 15 0:15
ð6:34Þ
Step (4): update the weights: w1 = w1 þ Δw1 ¼
w2 = w2 þ Δw2 ¼
w3 = w3 þ Δw3 ¼
w4 = w4 þ Δw4 ¼
" # 1 1 " # 1 0 " # 0 1 " # 2 2
" þ
#
þ
0:48
#
þ
0:24
0:08 0:15
"
¼
ð6:35Þ
1:48
# ð6:36Þ
0:24
# ð6:37Þ
1:45 "
#
#
0:9
¼
0:45 "
"
#
1:16 1:3
¼
0:9 "
" ¼
0:3 "
þ
0:16
2:08
#
2:15
ð6:38Þ
Repeat Step (2)–(4) until the weights are unchanged or the error is less than a criterion. The loss function L ¼ 12 ðy yÞ2 is also computed for this case. The loss function is half of the squared distance of the error y y. The convergence is reached if the loss function is less than a specific criterion (0.0001 in this case). Table 6.4 presents the computed weights, NN output, and loss at each iteration. Table 6.4 Results of weights, NN output and loss at each iteration
i w1 w2 w3 w4 y L
1 1 1 1 0 0 1 2 2 2 5 144.5
3
1:23 1:54 1:62 1:38 0:33 1:73 2:15 2:40 10:84 17:95 2.46
5
1:16 1:63 1:49 1:56 0:24 1:84 2:07 2:51 10:61 19:70 0.23
15 1:1 1:66 1:388 1:608 0:178 1:875 2:002 2:542 10:003 19:998 6.5e6
6.2 Feed Forward Neural Network (FFNN)
183
Typically, the vector data is more difficult to fit as compared with scale data. In this case fifteen iterations are required to reach a convergence. The neural network can also handle a larger database including many data points. In the next section, the NN will be used to predict a diamond price. A python code will be introduced to implement the network.
6.2.2
General Notations for FFNN [Advanced Topic]
In the previous section, a few simple neural networks were trained by updating their weights and biases using the gradient descent algorithm (often called backpropagation algorithm in machine learning terminology). In this section, a general matrix-based form is explained to compute the output from a neural network with multiple layers and neurons. To introduce the notation in an unambiguous way, Fig. 6.10 shows a three layers FFNN, and the notations which refer to weights, biases, and activation (i.e., output) in the network are also marked in the figure. The notations use wl j,k to denote the weight for the connection from the kth neuron in the (l 1)th layer to the jth neuron in the lth layer. For example, Fig. 6.10 shows a weight w32,4 on a connection from the fourth neuron in the second layer to the second neuron in the third layer of the network. This notation appears cumbersome at first, but it will be explained below to demonstrate that this notation is easy and natural. The similar notation can be used for the network’s biases and activations, i.e., blj denotes the bias of the jth neuron in the lth layer, and alj denotes the activation (or output) of the jth neuron in the lth layer. Based on these notations, the activation alj of the jth neuron in the lth layer is related to the activations in the (l 1)th layer by the equation
Fig. 6.10 Notations for a multilayer neural network. The blue circles indicate neurons in the hidden layer
184
6 Deep Learning for Regression and Classification
alj
X l =A wl j,k al1 k þ bj
! ð6:39Þ
k
where the sum is over all neurons k in the (l 1)th layer. To use a matrix form to represent this expression, a weight matrix Wl can be defined for each layer, l. The entry in the jth row and kth column of the weight matrix Wl is just wl j,k . Similarly, a bias vector bl is defined, and the component of the bias vector are just the blj . An activation vector al can be also defined, and the component are the activations alj . With these matrix-formed notations the above equation can be rewritten in a compact vectorized form al ¼ A W l al1 þ bl
ð6:40Þ
Thus, in terms of mapping the data between input x and output y over M layers, the following recursive equation is derived y ¼ A M W M , . . . , A 3 W 3 A 2 W 2 A 1 W 1 x þ b1 þ b2 þ b3 . . .
ð6:41Þ
where A M is the activation function for the Mth layer. Based on this general form, the neural networks specifically optimize weights Wland biases bl in a NN with M layers over a loss function as argmin L W 1 , W 2 , . . . W M , b1 , b2 , . . . , bM
ð6:42Þ
where the L is the loss function, and for example, it can be the standard root-mean square error between NN outputs and target values argmin W l , bl
n X 2 yk yk
ð6:43Þ
k¼1
where yk is the NN output for the kth datapoint (in total n datapoints), and yk is the target value (ground truth) for the kth datapoint. The yk can be substituted by the matrix-formed NN output argmin Wl , b
l
n X
2 A M W M , . . . , A 3 W 3 A 2 W 2 A 1 W 1 xk þ b1 þ b2 þ b3 . . . yk
k¼1
ð6:44Þ To solve this optimization problem using gradient decent or back propagation, a lot of standard programming libraries such as Python [13], PyTorch [14], and MATLAB [15] can be used. An example will be introduced in the next section.
6.2 Feed Forward Neural Network (FFNN)
6.2.3
185
Apply FFNN to Diamond Price Regression
In Chap. 1, the feature-based diamond pricing problem was briefly introduced. The universal method for assessing diamond quality, regardless of location in the world, relies on the 4Cs (color, clarity, cut, carat weight), whose features and scales can be seen in Fig. 6.11. The goal of this example is to predict the price of a new diamond based on its 4Cs and other features. To achieve that goal, a feed forward neural network (FNN) will be trained based on a large database including a lot of diamonds and their information. The data for this application was found on Kaggle, which is the world’s largest data science community with powerful tools and resources to help users achieve their data science goals. The diamond dataset contains the price and features of nearly 54,000 diamonds and a portion can be seen in Fig. 6.12. The ten features and their ranges can also be seen below in Fig. 6.13.
Fig. 6.11 The 4Cs (Color, Clarity, Cut, and Carat weight) of diamond quality (https://4cs.gia.edu/ en-us/)
Fig. 6.12 Open-source diamond dataset from Kaggle (https://www.kaggle.com/)
186
6 Deep Learning for Regression and Classification
Fig. 6.13 Diamond dataset features explained
To use this data effectively, it is important to understand how this raw data can be built into a useful model and understand machine learning datasets as well. N Given a dataset with input XN i and corresponding output label yj , where i indicates the input feature index, j indicates the output feature index, and N indicates the number of data points. The goal of the machine learning is to form a certain function with multiple parameters so that it can capture the relationship between input features and corresponding output values. For the diamond dataset, i ¼ 1. . .9, representing the number of independent variables, including carat, color, cut, and carat. Similarly, j ¼ 1, representing the number of dependent variables: price. The dataset contains 53,940 diamonds, so N N ¼ 53,940. Machine learning aims to find the functional form yN ¼ f X , where j i N N correctly maps input X to output y . The dataset is divided into training f XN j i i (70%), validation (15%), and testing (15%) sets to find the functional relationship and confirm it is the best possible fit. This process has been explained in Chap. 2. First, inputs and outputs from the training set are fit to mapping function f X N i , developing a FFNN model. The validation set tests the model after each training step. This process is iterative, meaning that the function is updated after each validation test to reduce error between the predicted and actual outputs. When error is minimal, the final functional form is established, and the training accuracy meets the required threshold, the function’s performance is evaluated with the testing set. In this example, properties of diamonds with known prices are used as test model inputs. Predicted and actual prices are compared to determine model accuracy. For this problem, there are nine different input features in X N i (carat, cut, color, etc.) for each of the N ¼ 53,940 diamonds. Price is the only output feature, yN j . The neural network attempts to build a relationship between the input features and the output feature. The overall neural network architecture can be seen in Fig. 6.14. Two hidden layers are used, with each layer including twelve hidden neurons. This NN structure is complex enough to capture the relationship between diamond features and its price. The loss function used in the neural network for training is a mean squared error (MSE):
6.2 Feed Forward Neural Network (FFNN)
187
Fig. 6.14 Neural network architecture used in the diamond example
L¼
N 1 X ð y yÞ 2 N i¼1
ð6:45Þ
An Adam optimizer [16] is used to implement back propagation algorithm. A Python code with annotations to implement this example is shown below:
Python code for diamond price classification: # import necessary python library from keras.models import Sequential from keras.layers import Dense from keras.optimizers import RMSprop from keras.wrappers.scikti_learn import KerasRegressor from sklearn.model_selection import cross_val_score from sklearn.model.selection import KFold # data preparation and scaling X_df = df.drop([“price”],axis=1) y_df = df.price from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_df) print(X_df.shape) display(X_df.head()) X_df = scaler.transform(X_df) from sklearn.model_selection import train_test_split
(continued)
188
6 Deep Learning for Regression and Classification
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, random_state=2, test_size=0.3) # define a neural network Def Net(): model=Sequential() model.add(Dense(input_dim=9,activation=”relu”,units=18)) model.add(Dense(kernel_initializer=”normal”, activation=”relu”, units=12)) model.add(Dense(kernel_initializer=”normal”, units=1)) model.comile(loss=”mean_squared_error”, optimizer=”adam”) return model # train the network and predict estimator = KerasRegressor(build_fn=Net, epochs=10, batch_size=5) estimator.fit(X_train, y_train) y_pred = estimator.predict(X_test) print(“Mean absolute error: {:.2f}”.format(mean_absolute_error( y_test, y_pred)))
The price of the diamonds as predicted by the neural network is graphed along with the actual price of the diamonds to evaluate the accuracy of the neural network. This graph can be seen in Fig. 6.15. This model has a coefficient of determination of R2 ¼ 0.93, which shows a high level of accuracy in predicting the price of diamonds. This correlation, known as the “goodness of fit,” is represented as a value between 0.0 and 1.0. A value of 1.0 indicates a perfect fit and might be thus a good model for future forecasts, while a value of 0.0 would indicate a poor fit. Once the NN model is trained, it can be used to predict the price of a diamond if its features are known. For example, Fig. 6.16 shows four diamonds. Their features such as 4Cs are known but the prices are unknown. The trained NN model can
Fig. 6.15 Actual vs. predicted Price of diamonds based on a neural network
6.3 Convolutional Neural Network (CNN)
189
Fig. 6.16 Predictions of the prices for four diamonds
Fig. 6.17 A simple world including one person, one computer with two-by-two pixels resolution, and a two letters alphabet
predict their prices by inputting their features, which would be very helpful for evaluation the value of a diamond for a diamond vender or a person who want to buy a diamond.
6.3 6.3.1
Convolutional Neural Network (CNN) A First Look at CNN
This section will explain how the computers recognize images using convolutional neural network (CNN). To demonstrate the basic idea of the CNN, imagine a world with only one person, and that person has one computer. The computer has a screen of resolution two-by-two pixels and it has a keyboard with two only keys. Only two keys are needed because this simple world has an alphabet with only two letters: a backward slash and a forward slash as shown in Fig. 6.17. In this simple world, writing is very simple, and the alphabet song is also very simple—it goes like “backward slash and forward slash”. What happens when the keyboard is used in this world? As shown in Fig. 6.18, if the computer key on the
190
6 Deep Learning for Regression and Classification
Fig. 6.18 Showing backward slash (left) and forward slash (right) on the computer screens when the keyboards are pressed
Fig. 6.19 A image recognition software that can distinguish backward slash and forward slash based on the given image
right (i.e., the backward slash) is pressed, there is a backward slash with the two-bytwo pixels shown in the screen. If the computer key on the left (i.e., the forward slash) is pressed, there is a forward slash with the two-by-two pixels shown in the screen. It is possible to add some sophistication in this world by creating an image recognition software. As shown in Fig. 6.19, if an image of a backward slash is fed into the computer, the computer should “say” that is a backward slash. If an image of a forward slash is fed into the computer, the computer should “say” that is a forward slash. Computers do not really see things the way human see them. The computers treat pixels as numbers. Assume the orange pixel is 1 and the grey pixel is 1. Figure 6.20 shows what the computer sees the backward slash and forward slash. The computer reads data as a single line (i.e., flatted form) from left to right and bottom to top. Thus, the computer reads a backward slash as 1, 1, 1, 1, and a forward slash as 1, 1, 1, 1 in this simple setting. To differentiate these two things on the computer, some mathematical operations have to be conducted. Ideally, a certain positive value would be ouput for one and a certain negative value would be output for the other one. What operations could be used? The simplest operation that can be tried for these numbers is just adding them. If 1, 1, 1, and 1 are added together for the backward slash, a zero (i.e., 1 1 1 + 1 ¼ 0) is obtained. If 1, 1, 1, and 1 are added for the forward slash, a
6.3 Convolutional Neural Network (CNN)
191
Fig. 6.20 Showing how a backward slash and a forward slash store in the computer
Fig. 6.21 An example showing how tell two images apart using convolution operation
zero (1 + 1 + 1 1 ¼ 0) is obtained again. Thus, the addition operation does not distinguish the two characters. Similarly, a multiplication operation does not effectively distinguish the characters. A mathematical operation that effectively separate these two strings of numbers is to add the first last numbers together and to subtract the sum of the two middle numbers. This operation can be defined to add the first number, subtract the second number, subtract the third number and add the fourth number of a string. For the backward slash string, this yields 1 + 1 + 1 + 1 ¼ 4. For the forward slash, it yields 1 1 1 1 ¼ 4. This operation, which can distinguish the two characters, is actually a convolution operation, and its mathematical definition will be given in next section. An character recognition classifier has now been constructed in this simple world. To make it more general, a two-by-two matrix called a kernel, or a filter, is defined as shown in Fig. 6.21. Using this kernel, the operation mentioned above can be re-described. Four symbols f1, f2, f3, f4 can be used to represent the kernel, which
192
6 Deep Learning for Regression and Classification
is actually how this matrix is stored in the computer. They can be defined as f1 ¼ 1, f2 ¼ 1, f3 ¼ 1, f4 ¼ 1. Four different symbols g1, g2, g3, g4 are used to represent the image. The values of them are already known for a backward slash or forward slash. The operation found above is simply f1 g1 + f2 g2 + f3 g3 + f4 g4. If the result is positive, then the character is a backward slash. If the result is negative, the character is a forward slash. The obvious question is how does the computer know the right values in the kernel? Actually, that is the core objective of a CNN. Humans can differentiate using the “eyeball filter”. In contrast, a computer can compute very quickly, so one possibility is for a computer to check all the possibilities. There are four entities in the kernel and each entity has two choices: positive or negative. In total, there are sixteen possible kernels (24 ¼ 16). For this method, the computer will try all the possibilities, will find that most of them work poorly, but that two of them work well (Fig. 6.22), and one of the good kernels is selected. These 16 choices are not excessive for a world with only two letters. However, even with only two images for a computer classifier, and a computer screen with two-by-two pixels, 16 possibilities must be checked. Modern images contain many more pixels and colors to be separated apart, which results in a nearly impossibly large number of possibilities. Moreover, instead of just binary inputs, decimal values such as 0.5, 1.2, or 0.8, and any other values, can be used. So now there are almost infinite possibilities and the computer cannot check all of them. This necessitates a smarter way to find good kernels (filters). The idea of gradient descent introduced in the previous section can be applied to CNN. Consider four random numbers (Fig. 6.23) for a filter to test performance of a classifier. The filter can be updated based on the computed error level until the error level is sufficiently small. To achieve this, the filter values are changed slightly, and a direction with reduced error can be identified. The direction is computed using an error function (or loss function). The derivatives of this error function are computed to find the gradient, or slope in any direction. The gradient can be used to find the direction of maximum and minimum error increase. This procedure can be repeated the error is minimized, resulting in a good classification filter.
Fig. 6.22 Sixteen possible kernels to be tested
6.3 Convolutional Neural Network (CNN)
193
Fig. 6.23 Schematic of using gradient descent to update filter in CNN
6.3.2
Building Blocks in CNN
Before the concept of convolution neural network was proposed in the 1990’s for solving image classification problem, people used other machining learning methods, such as logistic regression and support vector machine, to classify images. Those algorithms considered pixel values as features, e.g., a 36 36 image contained 36 36 ¼ 1296 features, while a lot of spatial interactions between pixels were lost. Although it is possible to handpick features out of the image similar to what a convolution automatically does, it is very time-consuming, and the quality of those extracted features highly depends on the knowledge and experience of the domain experts. CNN uses information from adjacent pixels to down-sample the image into features by convolution and pooling and then use prediction layers (e.g., a FFNN) to predict the target values. A typical CNN structure consists of the building blocks of input, convolution, padding, stride, pooling, FFNN, and output (see Fig. 6.24). The mathematical symbols in the figure will be defined and explained in the next section. The CNN starts from an input layer such as a signals (1D) or an image (2D). In Fig. 6.24, the input data is assumed to be one-dimensional, i.e., 1D signal or flattened image. The padded input can be obtained by adding zeros around the margin of the signal or image. Multiple convolution operations will be done using several moving kernels to extract features from the padded input. The dimensions of the convolved features can be reduced using pooling layers. Those reduced features then are used as input for a FFNN to calculate the output of the CNN. Table 6.5 shows a list of terms included in the CNN structure. Some descriptions are also provided. More detailed definitions and examples will be given in the following.
194
6 Deep Learning for Regression and Classification
Fig. 6.24 An illustrative structure of CNN including several building blocks and concepts
Table 6.5 Terminology used in CNN and their descriptions and objectives Terminologies Convolution
Kernel (filter) Padding
Stride
Pooling
Fully connected layers
6.3.2.1
Descriptions A mathematical operation that does the integral of the product of functions (signals), with one of the signals slides. It can extract features from the input signals A function used to extract important features A technique to simply add zeros around the margin of the signal or image to increase its dimension. Padding allows to emphasize the border values and in order lose less information The steps of sliding the kernel during convolution. The kernel move by different stride values is designed to extract different kinds of features. The amount of stride chosen affects the size of the feature extracted An operation that takes maximum or average of the region from the input overlapped by a sliding kernel. The pooling layer helps reduce the spatial size of the convolved features by providing an abstracted representation of them A FFNN in which layer nodes are connected to every node in the next layer. The fully connected layers help learn non-linear combinations of the features outputted by the convolutional layers
Convolution
Multiple convolution filters or kernels that operate over the signal or the image can be used in CNN to extract different features. The concept of convolution in machine learning stems from mathematics. Consider two univariate continuous functions: an original function, or signal, f(t), and a kernel (filter) function, ϕ(t). The definition of the convolution is given as the integral of the product of the two functions after one is reversed and shifted:
6.3 Convolutional Neural Network (CNN)
195
Z ð f ϕÞðt Þ ¼
1 1
f ðt ξÞϕðξÞdξ
ð6:46Þ
where the represents the convolution operation between the two functions, and ξ is the index sliding through the filter function. The integration interval of the convolution in mathematics is from negative infinity to positive infinity. It is typical to use multiple kernels (ϕ1(ξ), ϕ2(ξ), . . ., ϕk(ξ)) for conducting k different convolution operations in CNN. Since the signal or data in data science is finite, the definition of the convolution can be modified by limiting the integration interval from l to l Z ð f ϕÞ ð t Þ ¼
l
l
f ðt ξÞϕðξÞdξ
ð6:47Þ
where l is a real number. Since the data is stored in the computers in a discrete form, the above convolution defined for continuous functions can be modified to be a discrete convolution ð f ϕÞ ð t Þ ¼
XN ξ¼N
f ðt ξÞϕðξÞΔξ
ð6:48Þ
where N is an integer. If the Δξ is assumed to be 1, the discrete convolution can be written as ð f ϕÞ ð t Þ ¼
XN ξ¼N
f ðt ξÞϕðξÞ
ð6:49Þ
This is the definition used in data science where f(t) is a signal or a flattened image, and ϕ(ξ) is a filter or kernel used to extract features from the original signal or image (in the previous example, a filter is used to classify an image is a backward slash or a forward slash). An example is given to demonstrate how to compute a discrete convolution. Given a discrete function f(t) including 12 elements f(0), f(1), . . ., f(11) and a filter ϕ(ξ) including three elements ϕ(1), ϕ(0), and ϕ(1) as shown in Fig. 6.25. The convolution is an element wise multiplication between the function f(t) values and the filter ϕ(ξ) values, and then sum them up. Based on the definition, the first value of the convolution ( f ϕ)(1) can be computed as ð f ϕÞð1Þ ¼ f ð2Þϕð1Þ þ f ð1Þϕð0Þ þ f ð0Þϕð1Þ ¼ 2 6 þ 7 3 þ 9 1 ¼ 12 þ 21 þ 9 ¼ 42
ð6:50Þ
This formula is consistent with what is used in the previous backward and forward slashes example. The one dimensional case is analyzed here since the two-dimensional images are flattened into one-dimensional data and stored in the computers. Filter continues from left to right on the signal and produce the second value of the convolution as shown below (see Fig. 6.25: step 2)
196
6 Deep Learning for Regression and Classification
Fig. 6.25 A convolution operation example (the first two steps are shown)
ð f ϕÞð2Þ ¼ f ð3Þϕð1Þ þ f ð2Þϕð0Þ þ f ð1Þϕð1Þ ¼ 4 6 þ 2 3 þ 7 1 ¼ 24 þ 6 þ 7 ¼ 37
ð6:51Þ
The filter can be continually moved to the next position (next pixel) so all the values of the convolution ( f ϕ)(1), ( f ϕ)(2), . . .( f ϕ)(10) can be calculated.
6.3.2.2
Stride
In the above example, the filter is sliding by one position (1 pixel). This is called stride. In practice the filter can be moved by different stride values to produce the different size of convolution, and to extract different kinds of features. Figure 6.26 shows an example using a stride value of 3. The calculation of the first step is the same as the that in the previous example, so the first value of the convolution is still 42. However, for the subsequent steps the filter is moved by three pixels at a time. For example, the second value of the convolution can be computed as
6.3 Convolutional Neural Network (CNN)
197
Fig. 6.26 A convolution operation with a stride value of three (the first two steps are shown)
ð f ϕÞð2Þ ¼ f ð5Þϕð1Þ þ f ð4Þϕð0Þ þ f ð3Þϕð1Þ ¼ 7 6 þ 8 3 þ 4 1 ¼ 42 þ 24 þ 4 ¼ 70
ð6:52Þ
The rest of the values of the convolution can be computed based on the same procedure. As shown in Fig. 6.26, the size of the convolution ( f ϕ)(t) is affected by the amount of the stride. There are only four elements in the convolution if a stride value of 3 is used because some values in the original signal are skipped when the stride value is greater than 1. An equation to calculate the size of convolution for a particular filter size and stride is as follows Convolution size ¼ ðSignal size Filter sizeÞ=Stride þ 1
ð6:53Þ
This equation can be verified by putting the values for the above examples. For the example with stride 3, Convolution size ¼ 123 3 þ 1 ¼ 4. For the example with stride 1 shown previously, Convolution size ¼ 123 1 þ 1 ¼ 10. This equation works for both the cases.
198
6 Deep Learning for Regression and Classification
Fig. 6.27 A convolution operation with padding
6.3.2.3
Padding
From the previous examples, the convolution changes the size (or dimension) of the original signal or image. Is it possible to keep the convolution the same size as the input signal or image? Indeed, this can be achieved by padding the input. Padding is a technique to simply add zeros around the margin of the signal or image to increase the dimensions of the images before and after the convolution operation. Padding emphasizes the border pixels to reduce information loss. Figure 6.27 is an example with a 12-dimensional input f(t). It can be padded to a 14-dimensional input f 0(t) by adding two zeros at each ends of the input. Adding these two extra dimensions results in a 12-dimensional output ( f 0 ϕ)(t) after convolution operation, which is the same as the input signal f(t). The equation to calculate the dimension after convolution for a particular filter size, stride, and padding is as follows Convolution size ¼ ðSignal size þ 2Padding size Filter sizeÞ=Stride þ 1 ð6:54Þ For an image with three channels, i.e., red, green, and blue (RGB), the same operations are performed on all the three channels. In this book, only a single channel is considered. More details of the image channels can be found in the reference [17]. A CNN learns those filter values through back propagation to extract different features of the image based on training data. A CNN typically has more than one filter at each convolution layer. Those extracted features by convolution are further used to perform different tasks like classification, regression.
6.3 Convolutional Neural Network (CNN)
199
Fig. 6.28 An example showing max and average pooling layers
6.3.2.4
Pooling
Pooling layers in CNN help reduce the spatial size of the convolved features and also reduce overfitting by providing low-dimensional representations. There are two types of pooling that are widely used: max pooling and average pooling. It is similar to the convolution layer, but the pooling layer takes the maximum or average of the region from the input overlapped by the kernel (or filter). Figure 6.28 is an example showing a max pooling layer and an average pooling layer with a kernel having size of 2 and stride of 2. The max pooling operation takes the maximum of every two pixels while the average pooling operation computes the average of every two pixels. Max pooling helps reduce noise by ignoring noisy small values in the input data and hence is typically better than average pooling.
6.3.2.5
Fully Connected Networks
After the convolution and pooling layers, fully connected networks, like FFNN, are typically used with activation functions to learn complicated functional mappings between convolved features and outputs. Some activation functions have been previously introduced in the FFNN section. In the fully connected layers, neurons in a hidden layer are connected to every node in the adjacent layers. A dropout layer is normally used between two consecutive fully connective layers to reduce overfitting [18]. At the last layer the output size is decided based on the tasks. For example, one output value was sufficient for the backward and forward slashes example.
200
6 Deep Learning for Regression and Classification
Fig. 6.29 Illustration of one-dimensional CNN with the following setup: padding, convolution, pooling, and a FFNN for regression analysis. The first three steps may be repeated [20]
6.3.3
General Notations for CNN [Advanced Topic]
A CNN model consists of four basic unit operations: (1) padding, (2) convolution, (3) pooling, and (4) a feed forward neural network (FFNN). A one-dimensional CNN model is shown in Fig. 6.29. It begins with the input of a series of N values f1, f2, . . ., fN. The CNN consists of several loops of padding, convolution, and pooling. As shown in Fig. 6.29, for each loop iteration η, a padding procedure adds zeros around boundaries to ensure that the post-convolution dimension is the same as the input dimension. After padding, kernel functions will be used to approximate the discrete convoκ,η lution operator ef x given by f κ,η x ¼
ðLconv 1Þ=2 X ξ¼ðLconv 1Þ=2
padded,η ϕκ,η þ bκ,η ξ f xþξ
ð6:55Þ
where f padded,η is the padded input, x is the counting index for location within the xþξ th signal, ξ is the counting index for a location within the kernel, ϕκ,η ξ is the κ kernel function, and bκ, η is the bias for ηth convolution process (η ¼ 1, 2, . . ., Nconv). The total number of iterations is Nconv. The size of the kernel function is Lconv. A pooling layer is used after convolution to reduce the dimensions of data and extract features from the convolved data. A one-dimensional max pooling layer is formulated by
6.3 Convolutional Neural Network (CNN)
201
bf P,κ,η ¼ MAX ef κ,η , ξ 2 ðα 1ÞLpooling þ 1, αLpooling α ξ
ð6:56Þ
P,κ,η κ,η where bf α is the output value after the max pooling, ef ξ is the value before the max pooling, α is the counting index for location within output after pooling (α ¼ 1, 2, ::, N ηpooling ), N ηpooling is the size of the output after pooling for a loop iteration η, MAX is the function which returns the largest value in a given list of arguments, and Lpooling is the length of the pooling window. Padding, convolution, and pooling are repeated Nconv times to extract important patterns and features. The output of the pooling layers is transferred to a fully connected FFNN which is illustrate in the Sect. 6.2.2. The output of the CNN is represented by the vector ε. The CNN training process can be written as an optimization problem, which finds the filter values in the convolution layers and weights and biases in the FFNN by minimizing the loss function (e.g., the mean square error (MSE)) representing the distance between training data ε and CNN output ε
min loss function : MSE ¼
N 2 1 X i ε þ εi N i¼1
ð6:57Þ
where N is the number of data points in the training set, εi is the CNN output of the ith data point, and εi is the labeled output of the ith data point. Since εi is a function of the filter values in the convolution layers and weights and biases in the FFNN, those parameters in CNN can be iteratively updated by minimizing loss function based on the back-propagation algorithm.
6.3.4
COVID-19 Detection from X-Ray Images of Patients [Advanced Topic]
In this section a CNN model for automatically detect COVID-19 by classifying raw chest x-ray images is presented. Coronavirus 2019 (COVID-19) first appeared in December 2019 and caused a worldwide pandemic. Part of the effects of the virus is that infect lungs and airways and cause inflammation. As the inflammation progresses, a dry or barking cough would results, followed by tightness in the chest and deep pain when breathing. X-rays of the chests of COVID-19 patients show a progression that differs from healthy patients or patients with pneumonia. The CNN model is developed to provide COVID-19 diagnosis for multi-class classification, i.e., COVID-19 vs. Pneumonia vs. No-Findings. The X-ray images of a COVID-19 patient’s chest reveal several important features, which are critical for diagnosing inflammation caused by COVID-19. For example, Fig. 6.30 shows chest X-ray images taken at days 1, 4, 5 and 7 for a 50-year-old COVID-19 patient. At day 1, the lungs are clear and there are no significant findings. At days 4 and 5, ill-defined
202
6 Deep Learning for Regression and Classification
Fig. 6.30 Chest X-ray images of a 50-year-old COVID-19 patient over a week [22]
alveolar consolidations can be observed on the X-ray images. At day 7, the radiological condition has worsened, with typical findings of Acute Respiratory Distress Syndrome (ARDS [20]). Machine learning-based automatic diagnosis of COVID-19 based on chest X-ray images provide an end-to-end architecture that can automatically extract important features (such as alveolar consolidations) from images to assist clinicians in making accurate diagnoses (Fig. 6.31). The X-ray images obtained from two sources were used for the diagnosis of COVID-19. The first COVID-19 X-ray image database was generated and collected by Cohen JP [22]. Another chest X-ray image database was provided by Wang et al. [23]. These two X-ray databases are combined for a total of 125 X-ray images of COVID-19 patients (43 female, 82 male, average age of approximately 55 years), 500 X-ray images pneumonia patients, and 500 X-ray images of patients with no-findings. The grey scale X-ray images are obtained from institute’s PACS system [24]. Each pixel in the images has an intensity ranging from 0 to 255. Nine-hundred X-ray images are used for training and 225 images for validation (including 28 COVID-19 cases, 88 pneumonia cases, and 109 no-finding cases). The fivefold cross-validation is used to evaluate the model performance (see Chap. 2 for more details of cross-validation). A typical CNN structure has many convolution layers that can extract features and produce feature maps from the input with the applied filters, subsequent pooling layers to reduce the size of the feature maps, and fully connected layers (i.e., a FFNN). The trainable internal parameters in CNN are adjusted to accomplish a classification or regression task. The developed CNN structure described in this section is inspired by the Darknet-19 model [25], which is a well-tested classifier for
6.3 Convolutional Neural Network (CNN)
203
Fig. 6.31 A schematic presentation of the first convolution layer and max pooling layer
many real-time object detection systems. Leaky rectified linear unit (leaky ReLU) is used as an activation function in the CNN. The CNN structure consists of 17 convolutional layers and 5 max pooling layers. These layers are typical CNN layers with different filter numbers, sizes, and stride values. A schematic presentation for the first convolution layer and max pooling layer is given in Fig. 6.31. After padding operation introduced in the Sect. 6.3.1, the input image with 256 256 resolution increase its size to 257 257. Eight 3 3 filters with stride 1 are then used to produce eight 256 256 feature maps. This method allows different features from the input image to be extracted. The values in the filters will be obtained during the data training process. To reduce the size of the feature maps, eight 2 2 max pooling operators with stride 2 are used so that the size of feature maps can be reduced to 128 128. To present the whole structure of the CNN in a compact manner, a simplified presentation of the first convolution and max pooling layer is also shown in Fig. 6.31. Figure 6.32 shows the CNN structure with 17 convolutional layers and 5 max pooling layers. Each convolution block layer has one convolutional layer followed by LeakyReLU activation functions (see Sect. 6.2.1). ReLU activation function has zero value in the negative part of their derivatives, but LeakyReLU has a small value that can overcome the dying neuron problem [26]. The CNN model performs the COVID-19 detection task to determine the labels of the input chest X-ray images. COVID-19 is represented by a vector ½ 1 0 0 T , pneumonia is represented by a
204
6 Deep Learning for Regression and Classification
Fig. 6.32 The architecture of the CNN model
vector ½ 0 1 0 T , a normal (i.e., No-Findings) represented by ½ 0 0 1 T . Finally, the layer details and layer parameters of the model are given in the Python code included with the book. The developed deep learning model consists of 1,164,434 parameters. The Adam optimizer [27] is used for updating the weights and loss functions with a selected learning rate of 3 103 (Supplementary Data 6.1). The multi-class classification performance of the CNN model can be evaluated using the error matrix shown in Fig. 6.33. The error matrix allows visualization of the general performance of the CNN model. A total of 225 images in the test set are used to evaluate the performance of the model. The classification accuracy for the COVID-19 category is 24/28 ¼ 85.7%, the accuracy for the No-finding category is 102/109 ¼ 94.6%, and the accuracy for the Pneumonia category is 75/88 ¼ 85.2%. The CNN model achieved an average classification accuracy of 88.5% for these categories. During the COVID-19 pandemic, X-ray imaging is a very important assisted tool to the diagnostic tests for the early diagnosis. Deep learning models such as the one introduced in this section provide high accuracy rate in diagnosis and thus are particularly useful in identifying early stages of COVID-19 patients. The deep learning model can potentially be used in healthcare centers for an early diagnosis
6.4 Musical Instrument Sound Conversion Using Mechanistic Data Science
205
Fig. 6.33 The error matrix results of the multi-class classification COVID19 task
or a second “opinion”. However, in this case of CNN, more than one million parameters are involved. A large number of training datapoints are required to train those parameters. In the next section, a mechanistic data science approach will be introduced. It significantly reduces the model parameters by using mechanistic knowledge, which can achieve a good model performance using a small number of datapoints.
6.4 6.4.1
Musical Instrument Sound Conversion Using Mechanistic Data Science Problem Statement and Solutions
A machine learning model that can change any piano sound/music to a guitar sound/ music will be used to demonstrate mechanistic data science. Pianos and guitars can perform the same music, but a piano and a guitar have quite different instrumental structures and sounds. The challenge is to use mechanistic data science to convert piano sounds to guitar sounds. The first step in the process is to change a single piano note to the same note that sounds like a guitar (Fig. 6.34). The input consists of eight pairs of notes (notes A4, A5, B5, C5, C6, D5, E5, G5) performed by a piano and by a guitar. For this example, signals from an open-source database [30] are used for the training dataset. Figure 6.35 shows the time-amplitude signal (sound signal) for the A4 note for piano and guitar. The two signals look different, but they have the same fundamental frequency (same pitch). A Fourier analysis can be conducted to obtain the fundamental frequency and harmonics of the sound (see Chap. 4). It should be noted that the dimension of each curve is very high since the sampling rate was 44,100 Hz.
206
6 Deep Learning for Regression and Classification
Fig. 6.34 A schematic of changing a piano sound to a guitar sound suing machine learning. Two images in the figure come from the Internet [29, 30]
Fig. 6.35 A pair of A4 time-amplitude curves for piano and guitar. Audios are available in the E-book (Supplementary Audio 6.1)
Given the training data paired sounds from the piano and the guitar, two strategies can be used to train a machine learning model: (1) a pure CNN analysis and (2) a mechanistic data science analysis (see Fig. 6.36). The first strategy uses the deep CNN architecture introduced in the previous section. For this strategy, piano sounds are used as input and guitar sounds are used as output. Both are high-dimensional time-amplitude curves. Convolution layers and pooling layers can be used repeatedly to extract features from the piano sound signals and a low-dimensional representation (deep features shown in Fig. 6.36) of the original sound curve can be obtained after the CNN. These deep features then are mapped to another set of deep features, which can be converted to guitar sounds using another CNN structure for feature reconstruction. This flexible deep learning structure can be applied to many regression and classification applications with various data structures. However, a drawback of this strategy is the high number of trainable parameters involved in the CNN and FFNN structures. Both filter values and hyperparameters (weights and biases) need to be determined by the training dataset. The number of hyperparameters in a CNN can be thousands, millions, or even larger, depending on the size of the network. For some applications where the amount of data is small or the quality of data is low, the performance of the CNN is limited. To overcome
6.4 Musical Instrument Sound Conversion Using Mechanistic Data Science
207
Fig. 6.36 A schematic of two solutions for changing a piano sound to a guitar sound: pure CNN approach and mechanistic data science
this drawback of the standard CNN, another strategy called mechanistic data science is proposed (Fig. 6.36). Instead of extracting deep features by CNN, mechanistic data science extracts mechanistic features based on the underlying scientific principles. In this specific problem, a low-dimensional set of mechanistic features can be extracted from each piano signal and guitar signal. Each signal can be simplified to a set of sine functions, where the mechanistic features are the frequencies, damping coefficients, amplitudes, and phase angles (see Chap. 4 for more details). The damping coefficients describe a decreasing of the amplitude of the sound wave due to frictional drag or other resistive forces. These mechanistic features can be obtained by Short Time Fourier Transform (STFT) and a regression to fit a mechanistic model, such as a spring-mass-damper model introduced in Chap. 4. In this way, a high-dimensional sound curve can be represented by a set of mechanistic features with a physical meaning. The hyperparameters involved in the model can be significantly decreased from thousands to dozens in this manner, which reduces the amount of training data required. The computer codes for conducting CNN and MDS are included in the E-book (Supplementary Data 6.2). Figure 6.37 shows the values of the CNN and the MDS loss functions at each iteration step of training. The goal of the training is to minimize the loss function, and the MDS (right in Fig. 6.37) provides a better solution than the CNN (left in Fig. 6.37). The loss function of the CNN is volatile and does not converged to zero. The high number of parameters in the CNN and the amount of training data available in this case is insufficient to enable the CNN to find appropriate values of all the parameters. Training a CNN model requires the number of data points to be larger than the dimension of input signal or image. In this case, the dimension of input is a million, so a million training data points are needed to successfully train the CNN model. As the results show, the eight training data are not enough. In contrast, the
208
6 Deep Learning for Regression and Classification
Fig. 6.37 Values of CNN loss function (left) and MDS loss function (right) at each iteration step during the training
loss function of the MDS converges to zero if sufficient iterations steps are used. MDS reduces the number of parameters significantly using mechanistic features, which provides an efficient manner to solve this kind of problem with a relatively small amount of data.
6.4.2
Mechanistic Data Science Model for Changing Instrumental Music [Advanced Topic]
A mechanistic data science model is presented which converts music from one instrumental sound to another. In particular, a piano sound will be converted to a guitar sound. The training data for the analysis consisted of eight pairs of piano and guitar sound files, with signal durations ranging from 1.5 to 3.0 s. The notes used are A4, A5, B5, C5, C6, D5, E5, and G5. A representative pair of the time-amplitude curves is shown in Fig. 6.35. The sampling rate used is 44.1 kHz. Recorded duration for the piano sounds is 2.8 s and for the guitar sounds is 1.6 s. Thus, the dimension of a piano sound is 120,000 (44.1 kHz 2.8 s), and the dimension of the guitar sound is 72,000 (44.1 kHz 1.6 s). The high dimension of the input signal necessitates a mechanistic feature extraction to efficiently perform the analysis. Mechanistic feature were extracted from the signals to enhance the mapping from the piano to the guitar. Short Time Fourier Transform (STFT) is used to reveal the frequency, amplitude, damping and phase angle, as shown in Chap. 4 (see Fig. 6.38). A mechanistic model of the system is introduced in the form of a spring-massdamper system. The STFT and a least-squares optimization are performed to determine the parameters for the mechanistic model, as shown in Chap. 4. The coefficients of the reduced order model for the A4 piano and guitar signals are shown in Tables 6.6 and 6.7, respectively. Data for the other piano and guitar sounds are included in the E-book (Supplementary Data 6.2).
6.4 Musical Instrument Sound Conversion Using Mechanistic Data Science
209
Fig. 6.38 An A4 piano sound signal and its STFT result (2D and 3D)
Table 6.6 Optimal coefficients to represent the authentic A4 piano sound Type Fundamental Harmonics
Frequency (Hz) 4.410E+02 8.820E+02 1.323E+03 1.764E+03 2.205E+03 2.646E+03 3.087E+03 3.528E+03
Initial amplitudes 1.034E-01 1.119E-02 6.285E-03 7.715E-04 1.455E-03 5.130E-04 1.899E-04 3.891E-05
Damping coefficients 3.309E+00 1.844E+00 5.052E+00 2.484E+00 8.602E+00 1.198E+01 8.108E+00 3.282E+00
Phase angle (rad) 6.954E-01 7.202E-01 3.469E-01 5.170E-01 5.567E-01 1.565E-01 5.621E-01 6.948E-01
Table 6.7 Optimal coefficients to represent the authentic A4 guitar sound Type Fundamental Harmonics
Frequency (Hz) 4.400E+02 8.800E+02 1.320E+03 1.760E+03 2.200E+03 2.640E+03 3.080E+03 3.520E+03
Initial amplitudes 1.649E-02 8.022E-03 2.551E-03 5.454E-03 5.523E-03 6.742E-03 7.643E-04 9.748E-04
Damping coefficients 1.287E+00 1.865E+00 2.176E+00 1.100E+00 3.346E+00 2.504E+00 1.666E+00 2.609E+00
Phase angle (rad) 9.798E-01 2.848E-01 5.950E-01 9.622E-01 1.858E-01 1.930E-01 3.416E-01 9.329E-01
210
6 Deep Learning for Regression and Classification
Fig. 6.39 FFNN structure with reduced mechanistic features
Deep learning for regression is performed using the reduced mechanistic features as input and output. A fully-connected FFNN is used to map the relationships between piano and guitar sounds. Figure 6.39 shows the FFNN structure with the reduced order mechanistic model. Three hidden layers with 100 neurons are used for this FFNN. The tanh function is used as the activation function. Standard mean squared error (MSE) is used as the loss function. The mathematical description of the optimization process of the FFNN training can be found in Sect. 6.2.2. The code for implement MDS and FFNN is attached to the E-book. Note that the generation of the guitar sound from the piano sound is possible with a significantly smaller dimension (i.e., 4(sets) 8 (features) ¼ 32) and only 8 data points are sufficient to train the MDS model (Supplementary Data 6.2). Figure 6.40 shows the result of reconstructing an A4 guitar note from an input piano key. Audios in the figure are available in the E-book. In this figure, the mechanistic features (i.e., frequencies, amplitudes, damping coefficients, and phase angles) obtained from ground truth piano sound, ground truth guitar sound, and MDS generated guitar sound are compared. As shown in the Fig. 6.40, the features of ground truth guitar sound (marked in orange) and MDS generated guitar sound (marked in blue) are very similar, and they are different from the features of ground truth piano sound (marked in green). Figure 6.41 shows the time-amplitude curves (i.e., sound waves) reconstructing an A4 guitar note from an input piano key. Figure 6.41b, d present the magnification plots of the sound waves ranging from 0 to 0.01 s. The MDS generated guitar sound wave is similar to the authentic one (i.e., ground truth), which is quite different from the input piano sound wave as shown in Fig. 6.41e, f. The enveloped shapes of the sound waves are controlled by the damping coefficients. The detailed wave shapes are affected by the frequencies and their amplitudes. Figure 6.41 demonstrates that the MDS generated guitar sound captures the key features compared to the authentic guitar sound.
6.5 Conclusion
211
Fig. 6.40 Results of reconstructing a single guitar key A4 from a piano key as input. Audios are available in the E-book (Supplementary Audio 6.2)
6.5
Conclusion
Two major variants of Deep learning neural networks (i.e., FFNN and CNN) are presented in this Chapter and their ability is demonstrated through examples. The neural networks can be combined with mechanistic data science to simplify and enable solutions with relatively small amount of data. A musical sound conversion
212
6 Deep Learning for Regression and Classification
Fig. 6.41 Time-amplitude curves (sound waves) of reconstructing a single guitar key A4 from a piano key as input. (a) MDS generated guitar sound. (b) Magnification plot of MDS generated guitar sound ranging from 0 to 0.01 s, to highlight the detailed wave shape. (c) Authentic guitar sound. (d) Magnification plot of Authentic guitar sound ranging from 0 to 0.01 s, to highlight the detailed wave shape. (e) Authentic piano sound. (f) Magnification plot of Authentic piano sound ranging from 0 to 0.01 s, to highlight the detailed wave shape
from piano to guitar demonstrated this capability. Four sets of mechanistic features replaced the CNN, and these features are used as the input layer to the FFNN for the training of the neural network. Incorporating mechanistic knowledge to perform dimension reduction opens a new avenue to other scientific methods. Researchers have demonstrated new approaches to solve the partial differential equation by application dimension reduction using mechanistic deep learning [31]. This allows the solution of scientific problems with limited data and limited understanding of the relevant physics. This is highly advantageous for predicting biomechanical process such as progression of scoliosis [32] and optimizing additive manufacturing processes by discovering dimensionless parameters [33].
References
213
References 1. Schmidhuber J (2015) Deep Learning in neural networks: an overview. Neural Netw 61:85–117 2. Chen Y-Y, Lin Y-H, Kung C-C, Chung M-H, Yen I-H (2019) Design and implementation of cloud analytics-assisted smart power meters considering advanced artificial intelligence as edge analytics in demand-side management for smart homes. Sensors 19(9):2047 3. Smith JK, Brown PC, Roediger III HL, McDaniel MA (2014) Make it stick. The science of successful learning (2015):346–346 4. Cohen JP (2020) COVID-19 image data collection. https://github.com/ieee8023/COVIDchestxray-dataset 5. Ivakhnenko AG, Lapa VG (1967) Cybernetics and forecasting techniques. American Elsevier, New York 6. Dechter R (1986) Learning while searching in constraint-satisfaction problems. University of California, Computer Science Department, Cognitive Systems Laboratory, Los Angeles 7. LeCun et al (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1:541–551 8. Hinton GE, Dayan P, Frey BJ, Neal R (1995) The wake-sleep algorithm for unsupervised neural networks. Science 268(5214):1158–1161 9. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780 10. Nvidia CEO bets big on deep learning and VR. Venture Beat, 5 April 2016 11. Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. Proc icml 30(1) 12. https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learningalgorithms/ 13. https://scikit-learn.org/stable/modules/neural_networks_supervised.html 14. https://pytorch.org/ 15. https://www.mathworks.com/help/deeplearning/ref/trainnetwork.html 16. https://arxiv.org/abs/1412.6980 17. https://machinelearningmastery.com/introduction-to-1x1-convolutions-to-reduce-the-complex ity-of-convolutional-neural-networks/ 18. https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/ 19. Li H, Kafka OL, Gao J, Yu C, Nie Y, Zhang L et al (2019) Clustering discretization methods for generation of material performance databases in machine learning and design optimization. Comput Mech 64(2):281–305 20. https://www.mayoclinic.org/diseases-conditions/ards/symptoms-causes/syc-20355576 21. https://radiopaedia.org/cases/COVID-19-pneumonia-evolution-over-a-week-1?lang¼us 22. Cohen JP (2020) COVID-19 image data collection. https://github.com/ieee8023/COVIDchestxray-dataset 23. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, R.M. (2017) Summers Chest x-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2097–2106 24. Choplin R (1992) Picture archiving and communication systems: an overview. Radiographics 12:127–129 25. Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition 26. https://medium.com/@shubham.deshmukh705/dying-relu-problem-879cec7a687f 27. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 28. https://www.brothers-brick.com/2020/07/23/lego-ideas-21323-grand-piano-makes-musicstarting-aug-1st-news/ 29. https://www.dawsons.co.uk/blog/a-guide-to-the-different-types-of-guitar
214
6 Deep Learning for Regression and Classification
30. https://www.apronus.com/ 31. Zhang L, Cheng L, Li H, Gao J, Yu C, Domel R, Yang Y, Tang S, Liu WK (2021) Hierarchical deep-learning neural networks: finite elements and beyond. Comput Mech 67:207–230 32. Tajdari M, Pawar A, Li H, Tajdari F, Maqsood A, Cleary E, Saha S, Zhang YJ, Sarwark JF, Liu WK (2021) Image-based modelling for adolescent idiopathic scoliosis: mechanistic machine learning analysis and prediction. Comput Methods Appl Mech 374:113590 33. Saha S, Gan Z, Cheng L, Gao J, Kafka OL, Xie X, Li H, Tajdari M, Kim HA, Liu WK (2021) Hierarchical Deep Learning Neural Network (HiDeNN): an artificial intelligence (AI) framework for computational science and engineering. Comput Methods Appl Mech 373, p 113452
Chapter 7
System and Design
Abstract System and design bring the techniques of mechanistic data science (MDS) together to solve problems. The previous chapters have demonstrated the steps of MDS, starting with data collection through deep learning. This chapter will provide a range of examples for different classes of problems, from daily life topics to detailed science and engineering research. The examples shown in this chapter will include Type 1 (data driven problems), Type 2 (mixed data and scientific knowledge), and Type 3 (known mathematical science principles with uncertain parameters). Keywords System and design · Spine growth · Piano and guitar music · Additive manufacturing · Polymer matrix composites · Indentation · Landslides · Tire design · Antimicrobial surfaces
7.1
Introduction
The previous chapters of this book have focused on the individual steps of a mechanistic data science analysis: • • • • •
Chapter 2 showed data generation and collection methodology, Chapter 3 showed methods for regression and optimization, Chapter 4 emphasized the extraction of mechanistic features, Chapter 5 showed methods of knowledge driven dimension reduction, and Chapter 6 focused on deep learning for regression and classification.
The ultimate objective of performing all these individual steps is to combine them to be able to “do something”. This could range from developing new products or making critical decisions. In this chapter, some of the highlights of the various MDS from the previous chapters will be shown and some finality will be applied to the
Supplementary Information: The online version of this chapter (https://doi.org/10.1007/978-3030-87832-0_7) contains supplementary material, which is available to authorized users. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. K. Liu et al., Mechanistic Data Science for STEM Education and Applications, https://doi.org/10.1007/978-3-030-87832-0_7
215
216
7 System and Design
problems. The examples shown in this chapter will include Type 1 (data driven problems), Type 2 (mixed data and scientific knowledge), and Type 3 (known mathematical science principles with uncertain parameters).
7.2
Piano to Guitar Musical Note Conversion (Type 3 General)
Chapter 4 demonstrated the analysis of a sound wave signal made by a single piano note (note A4 on the piano keyboard) using Fourier transform (FT) analysis and short-time Fourier transform (STFT) analysis. The STFT results enabled the raw signal to be reduced to a reduced order model to be constructed, consisting of a spring-mass-damper. This reduced order model greatly reduced the dimension of the data. In Chap. 6, this same STFT methodology was applied to eight notes from a piano and to eight notes from a guitar, and these notes were used to train a neural network to convert a piano note to a guitar note. The challenge going forward is to be able to “make” alternative forms of music by converting the piano music to the form of another instrument. This chapter will also demonstrate how to perform the same conversion using principal component analysis (PCA).
7.2.1
Mechanistic Data Science with a Spring Mass Damper System
A comprehensive deep learning neural network algorithm for converting piano sounds to guitar sounds is shown in Fig. 7.1. When this algorithm is used with larger datasets, it can be used to convert musical melodies (multiple notes) from one
Fig. 7.1 Algorithm for converting from a piano to a guitar sound. Audio available in E-book (Supplementary Audio 7.1)
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
217
instrument to another. In this particular example, eight piano keys and eight guitar notes (A4, A5, B5, C5, C6, D5, E5, G5) are used for training a neural network for sound conversion (additional details are shown in Chap. 6). The steps of the mechanistic data science methodology presented in this book are followed to achieve this musical sound conversion. The steps are outlined below. Note that some of the steps are grouped together because the lines between them are blurred and they are essentially executed in concert. Step 1: Multimodal data collection and generation is the important first step in the process. Sound files for several different notes from both a piano and a guitar must be collected. For the analysis to be performed, 1.5–3 s of data for notes A4, A5, B5, C5, C6, D5, E5, G5 are collected. This dataset represents a minimum of data needed and additional data collection would enhance final results. Representative sound waves are shown in Fig. 6.34 for a piano (left) and a guitar (right). As expected, the wave form for each is different. The amplitude for the piano decays much faster since there is a built-in damping mechanism. Difference such as this is manifested in the sound when each instrument is played. The piano has more reverberation whereas the guitar has a “twang” sound. Differences such as these lead to differences in the recorded sound files. Steps 2, 3, and 4 of the MDS analysis process are all integrally intertwined and will be described together. These steps consist of extracting the mechanistic features of the sound signals and creating a reduced order surrogate model of each of the sound signals. The key mechanistic features extracted for each of the sound signals are the (1) frequency (fundamental frequency and harmonic frequencies), (2) damping for each frequency, (3) phase angle, and (4) initial amplitude. An STFT is performed on each signal to separate the base signal into a set of frequencies composed of the fundamental frequency and the harmonic frequencies (Fig. 6.37). Eight frequencies are extracted for each of the signals, with each frequency having a different damping, phase angle, and initial amplitude. A least squares optimization algorithm is used to determine the “best fit” properties for a springmass-damper mechanistic model for each of the component signals from the STFT of the original signals (Fig. 7.2). This results in a reduced order model of the signals. As summarized in Table 7.1, the original signals were recorded with a frequency of 44,100 Hz, which lead to a dimension of 120,000 data points for the 2.8-s piano note recordings, and 72,000 data points for the 1.6-s guitar recordings. The mechanistic spring-mass-damper model only requires 4 features for the 8 frequencies extracted using the STFT, resulting in a dimension of 4 features x 8 frequencies ¼ 32. The reduced order model can be used to generate an approximation to the original sound file. Figure 7.3 shows the overlaid comparison of the two signals. It can be seen that reduced order model provides an accurate comparison, although late in the signal the comparison is not as close. The actual signal appears to increase in amplitude after 1.25 s, possibly due to reverberation or changing in the pressure from the damping pedal. Additional mechanistic features would be required to capture those effects.
218
7 System and Design
Initial amplitude
0
0
Amplitude
0
0.1 0.05 0 -0.05 -0.1 0
0.02
0.04
0.06 Time
0.08
0.1
Fig. 7.2 Spring mass damper model (left) and representative wave form (right) Table 7.1 Summary of dimension reduction achieved for the piano and guitar signals using the reduced order spring-mass-damper model Instrument Piano Guitar
Original signal dimension 120,000 (44.1 kHz 2.8 s) 72,000 (44.1 kHz 1.6 s)
Reduced order dimension 32 (4 features 8 frequencies) 32 (4 features 8 frequencies)
Fig. 7.3 Comparison of original authentic piano sound wave signal and a reduced order Matlab approximation. Audio available in E-book (Supplementary Audio 7.2)
Step 5: Deep learning for regression. The reduced order surrogate models for the piano notes and the guitar notes can be used to train a neural network to convert the sound from a given piano note to the corresponding guitar note. The four features (frequency, damping, phase angle, and amplitude) for the piano are used as input to a neural network and the weights and biases of the neural network are optimized to achieve the same four features that correspond to a guitar (Fig. 7.4).
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
219
Fig. 7.4 Schematic of the neural network used to convert piano sounds to guitar sounds
This process is repeated for all the notes in the training dataset (A4, A5, B5, C5, C6, D5, E5, G5). Additional details of this process are given in Chap. 6. Step 6: System and design. A musical melody is a collection of notes played in sequence to make an enjoyable sound. For example, the blue signal at the top of Fig. 7.5 shows the first seven notes from the children’s song “Twinkle, twinkle little star”, played using piano notes. (For this example, the melody is made up of a collection of individually recorded notes in placed in sequence, but this exercise can also be performed for a melody played in one sitting on a piano.) The goal of this example is to convert a melody from an authentic piano to a guitar sound using a trained neural network. To do this, the trained neural network described above and in Chap. 6 is used. The neural network is trained using eight piano notes and eight guitar notes as described earlier. As with all neural networks, training with more data yields enhanced results. A reduced order model for each piano note in the melody is created by extracting the features with a STFT and using an optimization algorithm to fit the parameters. The features used are the frequency, damping, phase angle, and initial amplitude. The reduced order models of the piano notes are used as input to the trained neural network. The resulting output is the same note played using a guitar sound. The first seven notes from “Twinkle, twinkle little star” are shown in Fig. 7.5. The blue signal (top) is made using the notes from an actual piano, and the green signal (middle) is made using the notes from an actual guitar. The orange signal at the bottom is created using the feed-forward neural network to convert the piano signal to a guitar sound. It can be seen from inspection of these signals that the authentic piano signal is quite different from the authentic guitar signal. In particular, the piano has a built-in damper that quickly reduces the amplitude of the signal. This accounts for the rapid decay of the piano signal. This characteristic is not present in a guitar and the green authentic guitar signal shows that the signal decay is much slower than the piano. The neural network, even trained using only eight piano notes, is able to capture this difference.
220
7 System and Design
Fig. 7.5 First seven notes from “Twinkle, twinkle little star” from a piano (blue), from a guitar (green), and from a neural network trained using piano and guitar notes. Audio available in E-book (Supplementary Audio 7.3)
The Matlab and Python codes used for this example are given below (Supplementary File 7.1): • • • •
Feature_extractor.m Sound_generator.m Model_trainer.py Feature_generator.py
Feature_extractor.m clear all clc %% Read sound file Filename='A4.wav'; Filename_out='A4.mat'; [y,Fs] = audioread(Filename); %% STFT to find the damping coefficients, initial amplitudes, and the frequencies y=y(:,1); t=0+1/Fs:1/Fs:1/Fs*size(y);
(continued)
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
221
% plot(t,y,"linewidth",1.5); % xlim([0,0.01]) % xlabel('Time') % ylabel('Amplitude') x=y; % Assumes x is an N x 1 column vector % and Fs is the sampling rate. N = size(x,1); dt = 1/Fs; t = dt*(0:N-1)'; dF = Fs/N; f = dF*(0:N/2-1)'; X = fft(x)/N; X = X(1:N/2); X(2:end) = 2*X(2:end); % figure; % plot(f,abs(X),'linewidth',1.5); % plot(f(1:100,1),abs(X(1:100,1)),'linewidth',1.5); % xlim([0,2000]); % % xlabel('Frequency') % ylabel('Amplitude') X=abs(X); [~,Index]=max(X); basic_f=round(f(Index)); TF = islocalmax(X,'MinSeparation',basic_f.*1.9); temp=f(TF); omega=round(temp(1:8)); % figure [s,f,t]=stft(y,Fs,'Window', rectwin (2048*2),'FFTLength',2048*2,'FrequencyRange','onesided'); % logs=abs(s(1:300,:)); % surface=surf(t,f(1:300,1),logs,'FaceAlpha',1); % surface.EdgeColor = 'none'; % view(0,90) % xlabel('Time(s)') % ylabel('Frequency(Hz)') % colormap(jet) %temp=[452,904,1356,1808,2260,2737]; for i=1:1:8 [~,Index(i)] = min(abs(f-omega(i))); amp=abs(s(Index(i),:)); g = fittype('a*exp(b*x)'); fit_f=fit(t,amp',g); out=coeffvalues(fit_f); a_out(i)=out(1); b_out(i)=out(2); end %% leasqure regression to find the optimal damping coeffcients, initial amplitudes, phase angles
(continued)
222
7 System and Design
[y,Fs] = audioread(Filename); y=y(:,1); t=0+1/Fs:1/Fs:1/Fs*size(y); F = @(x,xdata)x(1).*exp(x(2).*xdata).*sin(2.*pi.*omega(1). *xdata+x(3).*pi)+x(4).*exp(x(5).*xdata).*sin(2.*pi.*omega(2). *xdata+x(6).*pi)+x(7).*exp(x(8).*xdata).*sin(2.*pi.*omega(3). *xdata+x(9).*pi)+x(10).*exp(x(11).*xdata).*sin(2.*pi.*omega (4).*xdata+x(12).*pi)+x(13).*exp(x(14).*xdata).*sin(2.*pi. *omega(5).*xdata+x(15).*pi)+x(16).*exp(x(17).*xdata).*sin(2. *pi.*omega(6).*xdata+x(18).*pi)+x(19).*exp(x(20).*xdata).*sin (2.*pi.*omega(7).*xdata+x(21).*pi); for i=1:1:8 x0((i-1)*3+1)=a_out(i)/max(max(a_out)).*max(max(y)); x0((i-1)*3+2)=b_out(i); x0((i-1)*3+3)=rand(); end % give the initial value of amplitude, damping coefficient and phase angle options = optimoptions(@fminunc,'Display','iter'); [x,resnorm,~,exitflag,output] = lsqcurvefit(F,x0,t,y',-10,10, options); for i=1:1:8 a(i)=x0((i-1)*3+1); b(i)=x0((i-1)*3+2); phi(i)=x0((i-1)*3+3); end %% Save data save(Filename_out,'a','b','phi','omega');
Sound_generator.m clear all clc %% load original sound file and extracted features (original sound file is just used to measure the duration) Filename='A4.wav'; Filename_data='A4.mat'; Filename_sound_file='A4_MATLAB.wav'; [y,Fs] = audioread(Filename); load(Filename_data) amp_factor=0.1087; %sound volume normalization factor %% generate sound single y=y(:,1); t=0+1/Fs:1/Fs:1/Fs*size(y); y_t=zeros(size(t)); for i=1:1:7
(continued)
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
223
y_t=y_t+a(i).*exp(b(i).*t).*sin(2.*pi.*omega(i).*t+phi(i). *pi); end y_t=y_t./max(abs(y_t)).*amp_factor; plot(t,y,'linewidth',1); hold on plot(t,y_t,'linewidth',1); legend("Authentic sound","MATLAB approximation") xlabel('Time') ylabel('Amplitude') %% write sound file audiowrite(Filename_sound_file,y_t,Fs);
Model_trainer.py # -*- coding: utf-8 -*import matplotlib.pyplot as plt import numpy as np import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.optim.lr_scheduler import StepLR from torch.utils.data import DataLoader from tools import * def model_trainer(dataset_path): '''train model Args: dataset_path: [String] folder to save dataset, please name it as "dataset"; Returns: None, but save model to current_folder + "results/mode.pkl" ''' # configuration config = Configer() dataset_train = MyDataset(dataset_path, 'train') dataset_test = MyDataset(dataset_path, 'test')
(continued)
224
7 System and Design
print(f'[DATASET] The number of paired data (train): {len (dataset_train)}') print(f'[DATASET] The number of paired data (test): {len (dataset_test)}') print(f'[DATASET] Piano_shape: {dataset_train[0][0].shape}, guitar_shape: {dataset_train[0][1].shape}') # dataset train_loader = DataLoader(dataset_train, batch_size=config. batch_size, shuffle=True) test_loader = DataLoader(dataset_test, batch_size=config. batch_size, shuffle=True) net = SimpleNet(config.p_length, config.g_length) net.to(device) criterion = nn.MSELoss() optimizer = optim.Adam(net.parameters(), lr=config.lr) scheduler = StepLR(optimizer, step_size=int(config.epoch/4.), gamma=0.3) # Note that this part is about model_trainer loss_list = [] for epoch_idx in range(config.epoch): # train for step, (piano_sound, guitar_sound, _) in enumerate (train_loader): inputs = piano_sound.to(device) targets = guitar_sound.to(device) inputs = inputs.reshape(inputs.shape[0], 4, -1) targets = targets.reshape(inputs.shape[0], 4, -1) optimizer.zero_grad() outputs = net(inputs) loss = criterion(outputs, targets) loss_list.append(loss.item()) loss.backward() optimizer.step() # eval if epoch_idx % int(config.epoch/10.) == 0: net.eval() for step, (inputs, targets, _) in enumerate(train_loader): inputs = inputs.to(device) targets = targets.to(device) inputs = inputs.reshape(inputs.shape[0], 4, -1) targets = targets.reshape(inputs.shape[0], 4, -1) outputs = net(inputs) loss = criterion(outputs, targets)
(continued)
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
225
print(f'epoch: {epoch_idx}/{config.epoch}, loss: {loss.item ()}') # save model torch.save(net.state_dict(), dataset_path.replace('dataset', 'results')+'/model.pkl') # plot loss history fig = plt.figure() plt.plot(loss_list, 'k') plt.ylim([0, 0.02]) plt.xlabel('Iteration', fontsize=16) plt.ylabel('Loss', fontsize=16) plt.tight_layout() plt.savefig('results/MDS_loss.jpg', doi=300) if __name__ == '__main__': # train model dataset_path = 'dataset' model_trainer(dataset_path)
Guitar_feature_generator.py # -*- coding: utf-8 -*import matplotlib.pyplot as plt import numpy as np import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import DataLoader from tools import * def guitar_feature_generator(dataset_path, key_name): '''Generate predicted guitar features from piano features Args: dataset_path: [String] folder to save dataset, please name it as "dataset"; key_name: [String] key name that you want to generate. Example: "A4" Returns: gen_guitar_feats: [List] contains predicted guitar features in a dict,
(continued)
226
7 System and Design
note that this part can be used to generate many guitar features, so we use a list to store the guitar features. ''' config = Configer() model_path = dataset_path.replace('dataset', 'results')+'/ model.pkl' net = SimpleNet(config.p_length, config.g_length) net.to(device) net.load_state_dict(torch.load(model_path)) net.eval() res, res_true = [], [] dataset_train = MyDataset(dataset_path, 'train') train_loader = DataLoader(dataset_train, batch_size=config. batch_size, shuffle=True) for step, (inputs, targets, key_names) in enumerate (train_loader): inputs = inputs.to(device) targets = targets.to(device) inputs = inputs.reshape(inputs.shape[0], 4, -1) targets = targets.reshape(inputs.shape[0], 4, -1) outputs = net(inputs) gen_feats_batch = outputs.detach().cpu().numpy() targets_batch = targets.detach().cpu().numpy() inputs_batch = inputs.detach().cpu().numpy() for i in range(gen_feats_batch.shape[0]): if key_names[i] != key_name: continue pred_feats_norm = gen_feats_batch[i].reshape(4,8) # inverse data to original range pred_feats = dataset_train.inverse_guitar(pred_feats_norm) true_feats_norm = targets_batch[i].reshape(4,8) true_feats = dataset_train.inverse_guitar(true_feats_norm) inputs_feats_norm = inputs_batch[i].reshape(4,8) inputs_feats = dataset_train.inverse_piano (inputs_feats_norm) d={ 'key': key_names[i], 'freq': pred_feats[0,:], 'phi': pred_feats[1,:], 'a': pred_feats[2,:], 'b': pred_feats[3,:], } d_true = {
(continued)
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
227
'key': key_names[i], 'freq': true_feats[0,:], 'phi': true_feats[1,:], 'a': true_feats[2,:], 'b': true_feats[3,:], } res_true.append(d_true) res.append(d) # plot results fig = plt.figure(figsize=(12,5)) ax1 = fig.add_subplot(1, 2, 1) lns1 = plt.plot(pred_feats[0,:], pred_feats[2,:], '^', label='Prediction (G)') lns2 = plt.plot(true_feats[0,:], true_feats[2,:], 'v', label='Ground Truth (G)') plt.xlabel('Frequency', fontsize=16) plt.ylabel('Amplitude', fontsize=16) ax2 = ax1.twinx() lns3 = plt.plot(inputs_feats[0,:], inputs_feats[2,:], 'o', c='g', label='Ground Truth (P)') lns = lns1+lns2+lns3 labs = [l.get_label() for l in lns] ax1.legend(lns, labs, loc=0, fontsize=14) plt.title('Key: '+key_names[i], fontsize=18) ax3 = fig.add_subplot(1, 2, 2) lns1 = plt.plot(pred_feats[1,:], pred_feats[3,:], '^', label='Prediction (G)') lns2 = plt.plot(true_feats[1,:], true_feats[3,:], 'v', label='Ground Truth (G)') plt.xlabel('Phase angle', fontsize=16) plt.ylabel('Dampping coefficient $b_i$', fontsize=16) ax4 = ax3.twinx() lns3 = plt.plot(inputs_feats[1,:], inputs_feats[3,:], 'o', c='g', label='Ground Truth (P)') lns = lns1+lns2+lns3 labs = [l.get_label() for l in lns] ax3.legend(lns, labs, loc=0, fontsize=14) plt.title('Key: '+key_names[i], fontsize=18) plt.tight_layout() plt.savefig(f'results/MDS_pred_{key_names[i]}.jpg', doi=300) return res if __name__ == '__main__': dataset_path = 'dataset' # generate guitar features from piano features
(continued)
228
7 System and Design
gen_guitar_feats = guitar_feature_generator(dataset_path, 'A4') print('gen_guitar_feats', gen_guitar_feats) # show prediction results for dt in gen_guitar_feats: for key, value in dt.items(): print("{}={}'".format(key, list(value)))
7.2.2
Principal Component Analysis for Musical Note Conversion (Type 1 Advanced)
The dimensions of the raw sound signals can be reduced using principal component analysis (PCA). It should be noted that PCA creates a reduced order model based only on the data and does not consider mechanistic features, and thus it is very hard to interpret the meanings of the reduced model created by the PCA. The principal components of the data are extracted using the methodology described in Sect. 5.3. Step 1. Dataset collection The training data is the same as the mechanistic data science model. It consists of eight pairs of piano and guitar sound signals, including A4, A5, B5, C5, C6, D5, E5, and G5 notes for each instrument. Detailed information for the training set can be found in Sect. 6.4.2. Step 2. Extraction of dominant features by PCA
7.2.3
Data Preprocessing (Normalization and Scaling)
Firstly, the piano sound signals are put into a matrix Ap, in which each row represents a sound signal, and each column represents the amplitude at a certain time step. The dimension of Ap is m (8) n (81, 849), where m is the number of piano sound files and n is the minimum number of time steps (duration sample rate) in all piano sound files. The piano A5 key has the minimum number of time step (81, 849 1.86 s 44,100 Hz). Matrix Ap can be illustrated as (Fig. 7.6): The mean and standard deviation are calculated for each time step of the matrix Ap, defined as: mean Ap and std Ap :
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
229
Fig. 7.6 Build a matrix Ap for all piano sounds. The dimension of Ap is m n
Fig. 7.7 Obtain a normalized and scaled matrix Bp. The dimension of Bp is m n
The dimensions of mean(Ap) and std(Ap) are both n 1 matrices. The matrix Ap is then normalized column by column by subtracting the mean(Ap) and dividing the std(Ap). Note that in this case, Ap is scaled by dividing by the standard deviation to accelerate the training of the fully-connected FFNN. The normalized and scaled matrix is called Bp, and has a dimension the same as Ap. The same procedure is used to obtain a normalized and scaled matrix Bg, using mean vector mean(Ag) and standard vector std(Ag), for the guitar sound signals (Fig. 7.7).
230
7.2.4
7 System and Design
Compute the Eigenvalues and Eigenvectors for the Covariance Matrix of Bp and Bg
The covariance matrix, Xp, of normalized matrix Bp is calculated as: Xp ¼
BTp Bp n1
ð7:1Þ
A Singular Value Decomposition (SVD) is then performed for the covariance matrix, Xp, (additional details in Sect. 5.4): 2 Xp ¼ Pp ΛPTp ¼ pp1
λp1
0
p 6 6 0 λ2 pp2 . . . ppm 6 4... ...
0
0
32 pT 3 p1 6 pT 7 7 . . . 0 76 p2 7 7 76 7 . . . 0 56 4 ... 5 p 0 λm ppT m ...
0
ð7:2Þ
where Pp is an orthogonal matrix containing the eigenvectors (pp1 , pp2 , pp3 , . . . , ppm ) and Λ is a diagonal matrix containing the eigenvalues (λp1 , λp2 , λp3 , . . . , λpm ).
7.2.5
Build a Reduced-Order Model
The normalized piano sounds matrix, Bp, can be projected to the eigenvectors to obtain a reduced-order model, Rp: Rp ¼ Bp Pp
ð7:3Þ
As the dimension of Bp is m n and the dimension of Pp is n m, the dimension of Rp is m m. The ith row in Rp represents a reduced-dimension vector ai for a piano sound: 2
3 aT1 6 aT 7 6 7 Rp ¼ 6 2 7 4⋮5
ð7:4Þ
aTm The vector ai contains the magnitudes of all principal components (PCs) for the ith piano sound. The dimension of ai is m 1. The same procedure can be followed for the guitar sound signals. The ith row in Rg represents a reduced-dimension vector bi for a guitar sound:
7.2 Piano to Guitar Musical Note Conversion (Type 3 General) Table 7.2 Eight principal components for A4 key in piano and guitar
Magnitude for each PC 1st PC 2nd PC 3rd PC 4th PC 5th PC 6th PC 7th PC 8th PC
2
bT1
Piano-A4 78.08 303.93 47.98 5.03 38.21 5.68 4.71 1.45 1013
231 Guitar-A4 47.32 3.18 19.68 27.26 40.95 19.26 144.83 3.58 1014
3
6 bT 7 6 7 Rg ¼ 6 2 7 4⋮5
ð7:5Þ
bTm For the A4 key, the magnitudes of eights PCs for the piano and guitar sound are shown in Table 7.2. A detailed explanation of PCA can be found in Sect. 5.3.
7.2.6
Inverse Transform Magnitudes for all PCs to a Sound
Using the vector bi that contains all magnitudes for each guitar PC, bi can be inversely transformed to a guitar sound, si, by multiplying by PTp , doing an element-wise product of std(Ag), and adding the mean(Ag): si ¼ bTi PTg ∘std Ag þ mean Ag
ð7:6Þ
Note that ∘ is a notation for an element-wise product.
7.2.7
Cumulative Energy for Each PC
One important part of using PCA in practice is to estimate the number of PCs needed to describe the data. For the piano sounds, the cumulative energy, ei, of ith PC can be defined as: ei ¼
i m X X 2 2 λpk = λpk k¼1
k¼1
232
7 System and Design
Fig. 7.8 The relationship between cumulative energy and the number of principal components. (a) Piano sounds. (b) Guitar sounds
The cumulative energy for each PC is shown in Fig. 7.8. Using only 5 principal components captures 90% of the information of the original sounds.
7.2.8
Python Code for Step 1 and Step 2
The Python code for data collection and data preprocessing is shown below:
def fit_norm_PCA(self, itype, min_length, data_type): '''Fit normalization and PCA scaler for training set Args: itype: [string] decide whether you are handling piano or guitar sounds min_length: [int] the minimum length for sounds files data_type: [string] train or test Returns: PCA_scaler: [object] a PCA scaler mean: [array] mean for this instrument std: [array] standard deviation for this instrument ''' # load all sound files for a certain instrument sound_file_all = glob.glob(f'{dataset_path}/{itype}/train/*. wav') # load wav files to a list sounds_all = self.load_files(sound_file_all) # segment all sounds with the minimum length and convert to an array sounds_seg = np.array([i[:min_length] for i in sounds_all]) # normalization and scaling mean = np.mean(sounds_seg, 0)
(continued)
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
233
std = np.std(sounds_seg, 0) sounds_seg_norm = (sounds_seg - mean) / std # fit PCA scaler PCA_scaler = PCA(n_components=self.n_components) PCA_scaler.fit(sounds_seg_norm) return PCA_scaler, mean, std
Step 3. Deep learning regression
7.2.9
Training a Fully-Connected FFNN
After obtaining eight piano vectors a1,2,3, . . .,8 and eight guitar vectors b1,2,3, . . .,8, a same fully-connected FFNN model, F FFNN , is built to learn the mapping between piano sounds and guitar sounds. The input is a vector ai that contains the magnitudes of PCs of ith piano sound. The output is a predicted vector bbi that contains the magnitudes of PCs of ith guitar sound. The FFNN can be defined as: bbi ¼ F FFNN ðai Þ
ð7:7Þ
The FFNN has three hidden layers with 100 neurons. Each hidden layer uses a tanh function as the activation function to learn the non-linear relationship. A detailed description of the FFNN can be found in Sects. 6.2.2 and 6.4.2. The only difference between the mechanistic data science in Sect. 6.4.2 and PCA-NN is that the dimension of input and output are not the same. The structure of the PCA-NN is shown in Fig. 7.9.
Fig. 7.9 FFNN structure for PCA-NN model. The input and output are principal components, which are reduced dimension features
234
7 System and Design
The loss function L calculates the Mean Squared Error (MSE) between authentic magnitudes of guitar principal components bi and predicted magnitudes of guitar principal components bbi : L¼
N 2 1 X bi bbi N i¼1
ð7:8Þ
During training, the FFNN is fed the same notes for the different instruments. For example, a1 can be fed into the FFNN as the input and b1as the output. The subscript 1 indicates that the vectors are for the A4 key. Then, the FFNN is trained for another paired set of vectors a2 and b2, corresponding to the A5 key. The FFNN is iteratively trained until it reaches the maximum iteration.
7.2.10 Code Explanation for Step 3 PyTorch is used to implement the FFNN and to train the model:
def model_trainer(dataset_path): '''train a FFNN model Args: dataset_path: [string] folder to save dataset, please name it as "dataset"; Returns: None, but save a trained model ''' # set your hyperparameters config = Configer() # load dataset dataset_train = MyDataset(dataset_path, 'train', config. n_components) # build dataloader train_loader = DataLoader( dataset_train, batch_size=config.batch_size, shuffle=True) # initialize the FFNN net = SimpleNet(input_dim=config.n_components, output_dim=config.n_components) net.to(device) # build a MSE loss function criterion = nn.MSELoss() # Use adam as the optimizer optimizer = optim.Adam(net.parameters(), lr=config.lr) # step learning rate decay scheduler = StepLR(optimizer, step_size=int(config.epoch/4.), gamma=0.3) loss_list = [] # to save the loss # iterate all epoches
(continued)
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
235
for epoch_idx in range(config.epoch): for step, (piano_sound, guitar_sound, _) in enumerate (train_loader): # data preparation for PyTorch inputs = piano_sound.to(device) targets = guitar_sound.to(device) inputs = inputs.reshape(inputs.shape[0], config. n_components) targets = targets.reshape(inputs.shape[0], config. n_components) # backpropagation to update parameters of the FFNN optimizer.zero_grad() outputs = net(inputs) loss = criterion(outputs, targets) loss_list.append(loss.item()) loss.backward() optimizer.step()
Step 4. Generate a single guitar sound
7.2.11 Generate a Single Guitar After training, a well-trained FFNN model F FFNN is obtained that can be used to predict the magnitudes of guitar PCs b b using the magnitudes of a piano sound PCs a as input: b b ¼ F FFNN ðaÞ
ð7:9Þ
In this example, the vector a comes from the training set. Then, the predicted guitar PCs b b is inversely transformed to a guitar sound sbg by projecting it to the original space: T sbg ¼ b b PTg ∘std Ag þ mean Ag
ð7:10Þ
The reconstruction results are shown in Fig. 7.10. The PCA-NN generated guitar sound wave is almost the same as the authentic one. Figure 7.10 demonstrates that the PCA-NN model captures the key features compared to the authentic guitar sound.
236
7 System and Design
Fig. 7.10 Time-amplitude curves (sound waves) of reconstructing a single guitar key A4 from a piano key’s PCs as input. (a) PCA-NN generated guitar sound. (b) Magnification plot of PCA-NN generated guitar sound ranging from 0.10 to 0.11 s, to highlight the detailed wave shape
7.2.12 Python Code for Step 4 Inverse transform Python code is shown below:
def inverse(self, itype, dt): '''inverse data to original scale Args: itype: [string] decide whether you are handling piano or guitar sound dt: [array] the magnitudes for guitar PCs. The length is the number of principal components Returns: Sound_raw: [array] an array that contains the signal in the original space ''' if itype == 'piano': # inverse transform for PCA sound_norm = self.piano_PCA_scaler.inverse_transform(dt) # inverse transform for normalization sound_raw = sound_norm * self.piano_std + self.piano_mean elif itype == 'guitar': # inverse transform for PCA sound_norm = self.guitar_PCA_scaler.inverse_transform(dt) # inverse transform for normalization sound_raw = sound_norm * self.guitar_std + self.guitar_mean return sound_raw
7.2 Piano to Guitar Musical Note Conversion (Type 3 General)
237
Step 5. Generate a melody
7.2.13 Generate a Melody To generate a melody for part of the children’s song Twinkle, Twinkle Little Star, step 3 is followed to prepare all related sound files, including A5, C5, and G5. Then, all notes are combined in the proper order, based on the order of the notes in the melody, i.e., “C5, C5, G5, G5, A5, A5, G5”.
7.2.14 Code Explanation for Step 5 The generation of a melody can be performed by a command-line code using FFmpeg, a popular open-source audio and video software:
ffmpeg -i C5.wav -i C5.wav \ -i G5.wav -i G5.wav \ -i A5.wav -i A5.wav -i G5.wav \ -filter_complex '[0:0][1:0][2:0][3:0][4:0][5:0][6:0] concat=n=7:v=0:a=1[out]' \ -map '[out]' melody.wav
7.2.15 Application for Forensic Engineering Forensic engineering involves the application of mathematical science and engineering methods to analyze “clues” that are left following some “event”. This “event” may involve an accident or an alleged failure, but it can also consist of analyzing product performance to determine if it is working properly. While one may not think of a musical instrument in this context, a hypothetical scenario can be constructed that would represent a typical forensic engineering example. Consider a situation in which someone pays a large amount of money for a special collector’s item piano supposedly owned by a famous musician. After getting it to their home, the piano does not sound right. To better understand the piano, musical notes from the piano in its current condition can be compared with previous recordings from the piano. This hypothetical scenario may not be very common, but this same methodology can be used to classify noises or other from mechanical systems. Engineers regularly record vibrations and sounds on automobiles and other machinery to study squeaks
238
7 System and Design
and rattles. Based on the frequencies and other signal characteristics, problems with brakes, bearings, or engine components can be identified and classified.
7.3
Feature-Based Diamond Pricing (Type 1 General)
Predicting the price of a diamond based on its features, such as color, clarity, cut, carat weight is a type 1 problem for mechanistic data science (see Chap. 1). In the book, this diamond pricing example is used in several sections to explain key concepts of the data science. Starting from a large repository of data (Chap. 2) on diamond features and prices with appropriate data normalizations (Chap. 4), the diamonds can be classified into multiple clusters in which the diamonds have similar features and prices. Many clustering methods can achieve this goal, for example, the k-means clustering that introduced in Chap. 5. Mapping the diamonds’ features to their prices, regression methods can be applied such as linear regression (Chap. 3), nonlinear regression (Chap. 3), and neural networks (Chap. 6). Finally, given a new diamond with known features, its price can be predicted by the trained machine learning model. This model can help customers find a reasonable price for the diamond they would like to purchase, and also help diamond sellers evaluate the value of diamonds.
7.4
Additive Manufacturing (Type 1 Advanced)
Additive manufacturing (AM), sometimes called three-dimensional (3D) printing, is a rapidly growing advanced manufacturing paradigm that promises unparalleled flexibility in the production of metal or non-metal parts with complex geometries. However, the nature of the process creates position-dependent microstructures, defects, and mechanical properties that complicate printing process design, part qualification, and manufacturing certification. Metal additive manufacturing, such as laser powder bed fusion (L-PBF) and directed energy deposition (DED), have most of the relevant physical processes occurring in the vicinity of the melt pool. The laser rapidly heats the metal causing localized melting and vaporization. The melt pool surface extends behind the moving laser producing large thermal gradients with corresponding variations in surface tension. During rapid solidification, microstructure growth can produce complicate phases and grain morphologies that strongly affect the local component properties and performances. These multiscale and multiphysics phenomena involve interactions and dependencies of a large number of process parameters and material properties leading to complex process-structureproperties (PSP) relationships. In AM, localized heating/cooling heterogeneity leads to spatial variations of as-built mechanical properties, significantly complicating the materials design process. To this end, a mechanistic data-driven framework [1] integrating wavelet transforms and convolutional neural networks is developed to
7.4 Additive Manufacturing (Type 1 Advanced)
239
predict position-dependent mechanical properties over fabricated parts based on process-induced temperature sequences, i.e., thermal histories. The in-situ thermal histories were measured by a well-calibrated infrared (IR) camera during DED process of Inconel 718 material. The mechanical properties of interest include ultimate tensile strength (UTS), yield strength, and elongation. The framework enables multiresolution analysis to reveal dominant mechanistic features underlying the additive manufacturing process, such as fundamental thermal frequencies. The approach provides a concrete foundation for a revolutionary methodology that predicts spatial and temporal evolution of mechanical properties leveraging domainspecific knowledge and cutting-edge machine and deep learning technologies. The proposed data-driven supervised learning approach is aimed to capture nonlinear
Fig. 7.11 A schematic of the proposed mechanistic data-driven model linking thermal history and mechanical properties such as ultimate tensile strength (UTS). A convolutional neural network (CNN) scheme combing with a multiresolution analysis is developed to deal with high-dimensional thermal histories and a small amount of noisy experimental data. This methodology provides a mechanistic data-driven framework as a digital twin of physical AM process [1]
240
7 System and Design
Fig. 7.12 Illustrative IR measurements. (a) Positions of areas of interest at wall #1. (b) Positions of areas of interest at wall # 12. (c) An illustrative thermal history for a specific area at wall # 1. (d) An illustrative thermal history for a specific area at wall # 12. The part numbers are provided in the reference [1]
Fig. 7.13 Illustrative tensile specimens cut from different area of a wall and the stress-strain curve. (a) Positions of tensile specimens at wall # 1. (b) A schematic of stress-strain curve and corresponding mechanical properties
mapping between local thermal histories (i.e., time-temperature curves) and as-built mechanical properties such as UTS, which is defined as the ability of a material to resist a force that tends to pull it apart. A schematic of the proposed framework is shown in Fig. 7.11. Step 1: Multimodal data generation and collection Twelve sets of thermal history were collected by IR in-situ measurement for 12 additively manufactured thin walls (5000 uniformly spaced measurement locations per wall). Each thin wall was built using a single track and multilayer laser DED process. For mechanical tensile tests, 135 specimens were cut at
7.4 Additive Manufacturing (Type 1 Advanced)
241
Fig. 7.14 A schematic showing the local thermal history and corresponding wavelet scalograms at different locations on the as-built wall when there is no dwell time
specific positions of interest. For training the proposed data-driven model, 135 sets of thermal history were used as input (Fig. 7.12) and corresponding 135 sets of UTS were used as labeled output (Fig. 7.13). Detailed information for thermal history extraction and mechanical tensile tests is provided in the reference1. All the thermal histories (5000 per wall) were then used as input of trained data-driven model to predict 2D high-resolution UTS maps for each thin wall fabricated by a specific process condition. Steps 2, 3 and 4: Extraction of mechanistic features, knowledge-driven dimension reduction, and reduced order surrogate models. To extract mechanistic meanings of the thermal histories and improve the predictive capability of the model given a small amount of noisy data, the high-dimensional thermal histories are transformed into time-frequency spectrums using wavelet transforms [2]. Feature engineering is performed by applying wavelet analysis on the experimental time-temperature histories (i.e., thermal histories). Underlying mechanistic information can be revealed by wavelet transformed time-frequency maps. Figure 7.14 shows an example where the wall without dwell time was considered. The time-temperature histories at different points were converted into time-frequency maps using wavelet transform. Thermal histories vary depending on the position on the wall. For example, at the top-left of the wall (position 1 in the Fig. 7.14), the thermal history shows dual peaks. This happens because the specimen point is slight to the left, and the point does not get sufficient cooling time before being reheated again. The heating-cooling cycles are manifested as periodic in nature in the time-temperature history. The time-temperature history from the bottom-right of the wall (position 2) has a shorter period for heatingreheating as the point is near the corner. If we compare the wavelet transforms of both points, the wavelet scalogram for the point from the bottom-right of the wall
242
7 System and Design
Fig. 7.15 The architecture of the proposed CNN architecture. The first convolution layer has 64 33 filters. For each block in the center of the network, the filter size and filter number are shown in the figure. For example, the first residual block has 64 3 3 filters, and the last residual block has 512 3 3 filters. After eight residual blocks, a global average pooling with four strides was used to reduce the dimension. Then 2 fully-connected (FC) layers with 512 neurons for the first layer and 128 neurons for the second layer were used to make the final prediction
shows comparatively more high frequencies. Both of these plots have a common fundamental frequency of approximately 0.1 Hz. This frequency is related to the scan speed of the AM process. Step 5: Deep learning for regression and classification ResNet18 [3], an 18-layer CNN, is used as the base structure as shown in Fig. 7.15. In the first convolution layer, the filter size is 3 3 and the stride is 1 and the padding is 1. These parameters help the network maintain most of the information from the inputs. Eight residual blocks are used as the main structure of our network. Each residual block has a residual connection (or shortcut connection). This technique can improve the feature extraction capacity and avoid vanishing gradient problem at the same time. After eight residual blocks and the global average pooling layer, two fully-connected (FC) layers are used to fit the output label. ReLU activation functions are only used in the first two FC layers. The network is used using the Adam optimizer [4] with 1 103 weight decay. Mean square error (MSE) is used as the loss function. There are several hyperparameters including learning rate (1 103), the number of epoch (50), batch size (8). Besides, fivefold cross-validation is used to choose the optimal network. Thus, the total dataset is split into three parts: training set (64%), validation set (16%), and test set (20%). Step 6: System and design Figure 7.16 shows the predicted UTS maps for three process conditions. The three UTS maps on the left denote the original average outputs of the CNN models and the three maps on the right are the associated locally averaged results for clearly demonstrating the spatial variation of the UTS distribution. The two maps in the first-row are associated with the AM process without intentional dwell process and melt pool control. The two maps at the second-row are associated with the AM process with 5 s dwell time between layers but without melt pool control. The third-row maps are associated with the AM process without dwell time but with melt pool control (see the reference1 for more details). The CNN computed UTS (in black) and experimental values (in red) at positions of interest are marked in Fig. 7.16 as well. The proposed data-driven approach can predict the UTS very
7.5 Spine Growth Prediction (Type 2 Advanced)
243
Fig. 7.16 Predicted UTS maps for three process conditions: 120 mm wall without any dwell time and melt pool control, 120 mm wall with 5 s dwell time and 120 mm wall with melt pool control. The CNN outputs (in black) and experimental values (in red) are marked as well
well as compared with the experimental measurements. Leveraging the proposed data-driven approach with high-fidelity simulations, spatial and temporal mechanical properties could be predicted. This methodology provides a mechanistic data-driven framework as a digital twin of physical AM process. It will significantly accelerate AM process optimization and printable material discovery by avoiding an Edisonian trial and error approach.
7.5
Spine Growth Prediction (Type 2 Advanced)
Irregular spine growth can lead to scoliosis, a condition where a side-to-side curvature of the spine occurs. The cause of most scoliosis is unknown, but it is observed in approximately 3% of adolescents [5]. This is known as adolescent idiopathic scoliosis (AIS). A pure physics-based analysis of the spine and this condition are currently not possible because of the complicated nature of the spinal materials and the slow progress of the condition. Historically, the progression of scoliosis has been assessed through a series of “snapshots” taken through X-rays. These are useful for charting
244
7 System and Design
Fig. 7.17 Front and side X-rays for observing scoliosis
changes, but do not provide much detail related to the interaction between the vertebrae. This is a classic Type 2 mechanistic data science problem in which both the data and the fundamental physics are needed to analyze the condition. Step 1: Multimodal data generation and collection X-ray imaging is a common way of assessing the progress of AIS. X-rays are taken from the front and the side in intervals specified by the doctor. The X-rays project a 2D shadow image of the 3D object being evaluated (Fig. 7.17). These two image projections establish the position of the spine at a given instant in time and allow for further measurements of the progression of AIS. In addition to the X-rays, 3D CT scans can be performed to provide much more detail, but also results in a much higher dose of radiation. Steps 2, 3 and 4: Extraction of mechanistic features, knowledge-driven dimension reduction, reduced-order surrogate models. As with the piano wave form analysis, the lines between these three steps are somewhat blurry and they will be described together As described in Chap. 4, the 2D projection images can be used to compute the location of the vertebra of interest in three dimensions. This begins with the use of the snake method to create landmarks outlining each vertebra [6]. The landmark positions in the two projections are used in conjunction with the 3D reconstruction described in Chap. 4 to establish a three-dimensional bounding box for each vertebra. The 3D reference ATLAS model is deformed using the generated landmarks of the X-ray images to establish a 3D model of the vertebrata corresponding to the X-ray images (Fig. 7.18). The data for the 3D reduced order model were then used to refine the dimension of a 3D detailed ATLAS model of the vertebrae (Fig. 7.19). Step 5: Deep learning for regression and classification The updated 3D model of the vertebrae as shown in Fig. 7.19 was used to create a finite element model of the spine. These models were used to compute the contact pressure at key landmark locations of the surface of the vertebrae. The predicted contact pressures from the finite element model were combined with the clinical measurements of spine growth in a neural network. This allowed for a more patient-specific prediction of vertebrae growth (Fig. 7.19).
7.5 Spine Growth Prediction (Type 2 Advanced)
245
Fig. 7.18 Sixteen landmarks (yellow dots) are located around each vertebra in each projected image
Fig. 7.19 Finite element stress computation at landmarks on the vertebrae used as input to a neural network to predict growth
Step 6: System and design The data collected and generated based on the X-ray imaging are coupled with predicted stress data at landmark locations to generate a neural network that allowed for prognosis patient-specific spinal growth and deformity (Fig. 7.20). The framework was tested on a single patient as shown in Fig. 7.20. The 3D
246
7 System and Design
Fig. 7.20 Neural network framework for predicting spinal curvature using mechanistic data science methodology
Fig. 7.21 Reconstructed 3D geometry for a patient at (a) 68 months, (b) 84 months, (c) 100 months. (Source: Tajdari et al. 2021—see Footnote 2)
reconstructions of the training data (68 and 100 months) are shown in Fig. 7.20 along with 3D reconstruction of the prediction results inside the ranges of the training data using the Mechanistic NN, which indicates a good agreement with the experimental data (Fig. 7.21).
7.6 Design of Polymer Matrix Composite Materials (Type 3 Advanced)
7.6
247
Design of Polymer Matrix Composite Materials (Type 3 Advanced)
Materials engineers are constantly challenged to decide which material suits a specific purpose. This requires an in-depth knowledge of material properties for a wide variety of materials available. The material choices can be narrowed down by defining some design and application parameters, such as density, moisture tolerance, or high strength. For instance, choosing a polymer composite for an airplane wing provides better strength to weight performance than a metal and metallic alloy materials. In Chap. 1, polymer composites are briefly discussed and the method to make a composite material is demonstrated. Given a desired composite property, the way to choose a specific combination of the constituent phases (matrix and fibers) will be shown through a system and design problem. First, a system and design problem will be discussed, and then the steps of mechanistic data science that can be carried out to solve the problem will be shown. Suppose an engineer requires certain composite materials properties such as elastic modulus, resilience, toughness, and yield strength. The materials engineer is asked to provide the matrix and the fiber type with a specific composition to achieve the desired properties. This poses several challenges since a materials engineer must consider many different parameters, such as the material combinations, the constituent fractions of each material, and the temperatures of the material. This process can be very time-consuming and costly, but mechanistic data science can help streamline the process by learning the relationship from a few material combinations in order to derive a solution. Using mechanistic data science, the problem boils down to how to learn the hidden relationship of materials system, microstructure, and their response from limited available data [7]. That knowledge can then be used to find an appropriate materials system (Fig. 7.21). Step 1: Multimodal data generation and collection The first step is to collect or generate data that is relevant to the problem. The properties of interest can be found from the stress-strain data for a composite materials system. Composite microstructures with fiber volume fractions from 1 to 50% have been prepared to perform numerical tensile testing to predict the stress-strain data. Self-consistent clustering analysis has been used to accelerate the tensile test data generation process [8]. Some data can also be generated by other computer simulations such as finite element analysis. Stress-strain data can further be collected through physical experiments, although sample preparation can be costly. Four different fiber materials were tested in combination with three different matrix materials. The four fibers are carbon, glass, rayon and Kevlar. The three matrix materials are epoxy, poly (methyl methacrylate) (PMMA) and polyethylene terephthalate (PET). Numerical tensile testing is carried out by applying a tensile load along the fiber direction up to a strain value of 0.04 for three different temperatures. The data generation process and mechanistic feature extraction is demonstrated in Fig. 7.22.
248
7 System and Design
Fig. 7.22 Stress-strain data generated from tensile simulations. Yield strength, elastic modulus, toughness, and resilience have been extracted as mechanistic features from the stress-strain data
Step 2: Mechanistic feature extraction The stress-strain curves generated from the numerical tensile simulations show several mechanistic features (namely elastic modulus, yield strength, toughness, and resilience) that can be extracted. Elastic modulus is defined as the slope of the linear (or elastic) portion of the stress-strain curve. Yield strength is the onset of nonlinearity of the stress-strain curve which is sometimes hard to identify on the curve. One popular way to determine when the stress-strain curve becomes nonlinear is to take 0.2% strain offset of the linear portion of the stress-strain curve. Toughness is a measure of the material’s ability to absorb energy, and it can be calculated from the area under the stress-strain curve. The resilience is the elastic energy stored during the loading, and it can be obtained by finding the area under the elastic portion of the stress-strain curve. All these mechanistic features of the stress-strain curve depend on the microstructure features such as volume fractions of the fiber and the fiber and matrix materials system. Steps 3 and 4: Knowledge-driven dimension reduction and reduced order surrogate models Understanding a relation between the microstructure features, materials property features, and stress-strain curve features are essential to solve the problem. However, considering the materials properties, the problem is very high dimensional as there are a total of 17 features (see Fig. 7.23). To reduce the number of material features, they are divided into three categories: Microstructure Descriptor (MSD), Mechanical Property Descriptor (MPD), and Physical Property Descriptor (PPD). These categories were determined to hold the most related features, which would be useful in dimension reduction. MSD includes the volume fraction (ϕ), matrix density (ρm), and filler density (ρf), which describe
7.6 Design of Polymer Matrix Composite Materials (Type 3 Advanced)
249
Fig. 7.23 Very high dimensional materials feature that can be compressed using PCA. Three PCA components can represents 98.1% of the MPD data. The latent material property space is the space created by the three main principal components of the MPDs. Bottom left figure shows the 3D space created with L1, L2, and L3, while bottom right shows the 2D space created only with L1 and L2
the densities and ratio makeup of the composites. MPD includes the rest of the filler/matrix features, which describe the mechanical properties of the matrix and filler materials. PPD only consists of temperature, as it describes the physical surroundings of the composite. The features in MSD could all be related to a fourth variable, composite density (ρc), using the “rule of mixtures” ρc ¼ ρ f ϕ þ ρm ð1 ϕÞ, which says that the composite density is equal to the weighted sum of the filler and matrix densities. Therefore, all three MSD features can be reduced to ρc.
250
7 System and Design
Fig. 7.24 The principal component values of the material with maximized mechanical features and a low density on the LMPS
The MPD features are reduced using Principal Component Analysis (PCA). PCA is a process that transforms a dataset’s space so that each dimension, called principal components, is linearly independent. It was determined that 98.1% of the data could be described by three principal components, which will be referred to as L1, L2, and L3. All of the data was graphed in this 3D space created by the principal components (see Fig. 7.24), which will be called the “Latent Material Property Space” (LMPS). When looking at the 2D LMPS in Fig. 7.24, the clustering of the LMPS suggests that L1 represents the filler material, while L2 represents the matrix material. There are three values of L2 per cluster because there is data for three different temperatures, and the matrix properties are temperature dependent. This interesting clustering pattern implies that if given a set of unknown matrix and filler properties, the principal component values could be plotted on the LMPS to find similar materials from the known database. Step 5: Deep learning for regression A feed-forward neural network (FFNN) is used, with the desired mechanical features and density as inputs, and the principal component values of the matrix and filler properties as outputs. The final FFNN model is comprised of three hidden layers, with 50 neurons per layer. The hidden activation functions are sigmoid, while the output activation function is linear. The Adam optimizer is used, with a learning rate of 0.001. The model was trained in minibatch sizes of 32 for 4233 epochs. The resulting model had a MSE of 0.271 (validation MSE of 0. 286) and an R2 score of 0.9601. Step 6: Design of new composite The principal component outputs of the FFNN can be compared to the LMPS, which will show the closest material system. These outputs can also be turned back into their matrix and filler properties with reverse PCA. The flowchart of the process is shown in Fig. 7.25.
7.6 Design of Polymer Matrix Composite Materials (Type 3 Advanced)
251
Fig. 7.25 Flowchart of how to predict a specific material system using the proposed model
For one test, the desired mechanical features chosen are the maximum values of elastic modulus, yield strength, resilience, and toughness in the dataset, while choosing a low density of 1300 kg/m3. By putting these values through the FFNN and comparing the outputs to the LMPS, this material is predicted to have matrix and filler materials closest to epoxy Kevlar at 295 K. A comparison of the desired material properties and epoxy Kevlar properties at 295 K and 0.055 volume fraction is shown in Table 7.3. Many of the matrix and filler properties are quite similar, but the matrix elastic modulus and matrix yield strength of the desired material is significantly higher than that of the epoxy Kevlar. To increase these values, many studies have shown that adding silica nanoparticles to epoxy matrices both increased the elastic modulus and yield strength of the matrix [9].
252
7 System and Design
Table 7.3 Desired material properties vs. epoxy Kevlar properties at 295 K and 0.055 volume fraction Features Volume fraction Matrix elastic modulus (MPa) Matrix Poisson’s ratio Matrix yield strength (MPa) Matrix hardening parameter Filler elastic modulus 1 (MPa) Filler elastic modulus 2 (MPa) Filler elastic modulus 3 (MPa) Filler Poisson 1 Filler Poisson 2 Filler Poisson 3 Filler shear 1 (MPa) Filler shear 2 (MPa) Filler shear 3 (MPa) Composite density (kg/m3) Composite elastic modulus (MPa) Composite yield strength (MPa) Composite resilience (MPa) Composite toughness (MPa)
7.7
Desired material 0.055556 6238.186 0.385362 89.313338 0.297894 166,939.45 3228.879 3228.879 0.362125 0.362125 0.354568 2646.434 2646.434 1051.689 1300 24.013 0.164331 0.001400 0.005218
Epoxy Kevlar 295 K 0.055 4408 0.4 69.79866 0.446773 150,000 4200 4200 0.35 0.35 0.35 2900 2900 1500 1299.9 4.844093 0.074844 0.000707 0.002826
Indentation Analysis for Materials Property Prediction (Type 2 Advanced)
Indentation is a useful technique to extract mechanical properties of materials. This is a low-cost semi or nondestructive testing procedure that is less time-consuming than tensile testing and capable of providing important materials properties such as hardness and elastic modulus. Knowledge of mechanical properties of materials is essential for engineers to design any real-life part or products. In indentation experiment, an indenter of known shape (e.g., spherical, conical, etc.), size, and materials is penetrated through the workpiece materials and the load-displacement data is recorded for both loading and unloading of the indenter through the testing workpiece. The materials properties are extracted by analyzing the load-displacement curve (also known as P-h curve). Important materials physics such as plasticity, yielding occurs during this loading, unloading step and P-h curves carries the signature of the materials localized properties. Several other mechanisms also get activated during the indentation test. For metal and alloys, dislocations are generated and propagated during the loading step which deforms the materials permanently in a form of line defects. Dislocation mechanism can provide many important aspects of the materials failure eventually.
7.7 Indentation Analysis for Materials Property Prediction (Type 2 Advanced)
253
Indentation test provides the hardness data for a material at the location the materials is tested. Hardness is thus a localized property that varies from point to point of the materials. Hardness can be directly related to other mechanical properties of the materials such as yield strength, elastic properties, and hardening parameters. Predicting materials properties from the localized hardness data is referred as the inverse problem of the indentation. This has a direct application on additively manufactured materials. With the process variability of the additive manufacturing process the local microstructures alter significantly that gives a property variation of the AM built parts. To ensure the part integrity it is very important to evaluate the localized mechanical properties and relate them to the processing conditions. However, performing tensile experiments in microstructure level samples are hard and it takes a long time. Instead, nano or instrumented indentation is an easier alternative to evaluate localized mechanical properties of the materials. Using mechanistic data science, inverse problem of the indentation can be solved and a relation between the localized hardness and yield strength can be established for an AM build Ti-64 alloy parts. Step 1: Multimodal data generation and collection Nanoindentation data has been collected for additively manufactured Ti64 alloy from the literature [1]. The summary of the available data set is given in Table 2.1 in Chap. 2. The data set consists of indentation data for AM built part for different processing conditions. For this example, S3067 processing conditions having 144 indentation tests data has been selected as the experimental data set. The load-displacement data from the experiment can be considered high-fidelity testing data. However, the localized mechanical properties such as yield strength is not evaluated locally. For this purpose, 70 numerical indentation simulations have been carried out considering different materials property as input (solving the forward problem) (See animation in Chap. 2). From both experiment and numerical simulation load-displacement data is obtained and stored in a database. Our objective is to use this multimodal data to establish the relation of the experimental low resolution materials property data with indentation testing data (hardness) first and then transfer it for simulation data using transfer learning to find the materials property (yield strength and hardening parameters) from an inverse problem. Step 2: Mechanistic features extraction Several mechanistic features needed to be extracted from the collected loaddisplacement data (see Fig. 7.26). For instance, load-displacement can be used to find the peak load and the indentation area which is necessary to calculate the hardness and other mechanical properties. From the loading portion of the curve, a curvature (C) is identified. From the unloading the slope (S) of the unloading curve at the maximum load point is identified and a contact area is calculated based on the residual depth, hc. The area under the loading curve is the total work done, while the area under the unloading curve is due to the elastic work. The W ratio of plastic work (difference between total to elastic work) to total work ( Wpt Þ is
254
7 System and Design
Fig. 7.26 (a) Multimodal data generation and collection for indentation test, (b) typical loaddisplacement data from an indentation test with different mechanistic features
considered as another important features. From the load displacement data, the hardness can be calculated using the following formula, H ¼ Pm =Ac
where Pm is the maximum load during the indentation process and the contact area for a Berkovich indenter can be obtained by Ac ¼ f ðhc Þ ¼ 24:56h2c : The critical depth—hc is measured from the slope of the unloading curve using the formula, hc ¼ h
εPm S
Another important feature is indentation reduced modulus that can be evaluated from the following equation, s E ¼ 2β
rffiffiffiffiffiffiffiffiffiffiffi π 24:56
where β depends on the indenter shape and for Berkovich indenter it has a value of 1.034. Finally, the elastic modulus of the workpiece and the indentation reduced modulus can be related by
7.7 Indentation Analysis for Materials Property Prediction (Type 2 Advanced) Table 7.4 Mechanistic features and their physical units
Mechanistic features Curvature, C Slope of unloading curve, S Maximum load, Pm Maximum depth, hm Plastic to total work ratio,
Wp Wt
Hardness, H Reduced modulus, E Yield strength, σ y Hardening parameter, n
255 SI unit Pa (N/m2) N/m N m – Pa Pa Pa –
1 ν2s 1 ν2i 1 ¼ þ E Es Ei where subscript “s” denotes the workpiece and “i” denotes the indenter. Therefore, from the load displacement data, following mechanistic features can be identified: C, Pm, hm, hc, Ac, S, H, E Steps 3 and 4: Knowledge-driven dimension reduction and reduced order surrogate models Instead of working with all the features extracted, a dimensional analysis can be performed on the features from the load-displacement curve. The process of the nondimensional has been explained in Chap. 5 and same concept can be applied on the nondimensionalization of the indentation features. Table 7.4 describes different mechanistic features and their units. For further study of the nondimensionalization, interested readers are referred to Cheng and Cheng [10]. Using these important features, six important nondimensional groups can be identified. Among them four are obtained from the indentation test and two are materials properties. The nondimensional groups can be written in a functional form of the materials properties as follows: σ C y ¼ Π1 ,n E E σ S y ¼ Π , n 2 E hm E Wp σy ¼ Π3 , n Wt E σ H y ¼ Π , n 4 E E These nondimensional groups are highly correlated with each other and can be used for scaling analysis. These relations of the indentation features with the
256
7 System and Design
Fig. 7.27 Mechanistic data science approach to predict localized materials property from indentation analysis
materials properties can be put in a functional form from the above equations as follows. σ
Wp H C S ,n ¼ f , , , E E hm W t E E y
This relationship further reduces the dimension of the problem. This relationship can be found using a neural network based surrogate model which is the next step of this problem. Step 5: Deep learning for regression As shown in Fig. 7.27, a relationship among the experimentally obtained data and simulation data needed to be established since experimental local mechanical properties such as yield strength are not known. However, the properties are used to generate data from the simulations. A transfer learning approach is applied to build a better model to have a better prediction of the hardness from the input materials properties. First, an experimental neural network having the nondimensional input variables shown in Fig. 7.28 is trained for the output of materials properties. This neural network has two hidden layers with 50 neurons each and “ReLU” type activation function. Training, testing, and validation have been set as 70%, 20% and 10% for the dataset. The R2 value obtained for this neural network is 0.70. Later these pretrained layers are used for the data generated from the simulation to train a separate neural network. This physicsbased neural network has two additional hidden layers having 20 neurons each. The R2 value increased up to 0.74 for the physics-based NN. Finally, this neural network can take any indentation test features as input (in dimensionless form) and predict the localized properties.
7.8 Early Warning of Rainfall Induced Landslides (Type 3 Advanced)
257
Fig. 7.28 (a) Nondimensional hardness distribution over the AM sample surface, (b) MDS predicted localized yield strength mapping Fig. 7.29 Relation between hardness and yield strength for Ti64 alloys printed following S3090 printing conditions
Step 6: System and design for new materials system Using the previous steps of mechanistic data science, a knowledge base for the indentation system is set up and then a new system can be tested. To illustrate this, a new indentation test data has been selected from [1] for S3090 processing conditions. The sample size is 360 μm by 360 μm and an indenter is indented every 30 μm apart. The nondimensional hardness distribution for the sample is shown in Fig. 7.28. In Fig. 7.28b the MDS predicted localized yield strength has been shown. The hardness and the yield strength are highly correlated, and a further analysis provides a mechanistic relation for the hardness and yield strength for such specific materials system (see Fig. 7.29).
7.8
Early Warning of Rainfall Induced Landslides (Type 3 Advanced)
A landslide occurs when the soil and rocks on a hillside give way and a large section of the hillside suddenly moves down the hill. There are multiple factors that affect the likelihood of having a landslide, including soil type, rainfall, and slope
258
7 System and Design
Fig. 7.30 Landslide along California’s Highway 1. (Photo credit San Luis Obispo Tribune)
inclination. As shown in Fig. 7.30, landslides can be very damaging to property and threaten the lives of people in their vicinity. The cost of landslides damage to public and private property in the United States is estimated to exceed $1 billion per year [11]. Landslide Early Warning Systems (LEWS) attempt to mitigate the threat of landslides by monitoring key variables and providing a timely warning to a population. One key parameter is soil moisture due to rain and storms, which inspires the construction of precipitation intensity-duration thresholds for shallow, rainfallinduced landslides. However, other factors come into play as well, which explains why thresholds developed from limited historical databases of rainwater infiltration have large variability in landslide occurrence times. This historical data usually does not show how factors such as topography, soil properties, and soil initial conditions affect reported landslide times. Mechanistic data science can be used to analyze the interplay of numerous key parameters and conditions for landslide prediction. This example shows how rainfall-induced landslide predictions can be improved by combining historical data with data generated from water infiltration simulations. The MDS steps are outlined below: Step 1: Multimodal data generation and collection is the important first step to ensure that sufficient data is available for analysis. Properties of interest include failure time, rainfall intensity, soil cohesion, soil porosity, soil density, initial moisture conditions, and slope angle. Failure time is the time it takes for a landslide to occur. The rainfall intensity is the amount of water incident on a soil per unit time (mm/h). Soil cohesion is the measure of the force that holds together soil particles. Soil porosity is the percentage of air or spaces between particles of soil in a given
7.8 Early Warning of Rainfall Induced Landslides (Type 3 Advanced)
259
sample. Soil density is the dry weight of the soil divided by its volume. The initial moisture conditions are represented by the difference in weight of the soil dry and weight of the soil when moist. The slope angle is the angle measured from a horizontal plane to a point on the land. Note that many historical landslide databases do not include all these parameters. For situations where all the data are not accessible, computer simulations can be used to generate some data. This includes simulating water infiltration events with different parameters to compute the time for the slope to become unstable. Data was collected through a physically-based simulation software [12, 13]. The simulation software used comes with a physical evaluation of a factor of safety threshold [14] to determine when the landslide occurs based on the moisture content throughout the soil column. All of the properties of interest above are obtained through this software. Figure 7.31 shows how the simulation software generates data through time. Data were extracted from the database for these soil parameters. Figure 7.32 shows plots of failure time versus rainfall intensity for multiple slope angles. It can be seen that the time for a landslide to occur decreases as the rainfall intensity
Fig. 7.31 A diagram illustrating a soil column with a simulated rainfall event
260
7 System and Design
Fig. 7.32 Failure time vs. Intensity across several slope angles
increases. It can also be seen that steeper slope angles will results in faster landslides at all rainfall intensities the soil cohesion is quantified through the internal friction angle, as suggested by the literature. Because this angle is 37.24 , any data near it or above this level will have a skewed response in a different domain from the rest of the data. As a result, the choices of slope angle will be in the range of 25–35 . The log of both intensity and failure time are then taken on all datapoints because past research shows a logarithmic relationship between these variables. Step 2: Extraction of mechanistic features determines important characteristics from the data collected and generated. From these features, rainfall intensity and slope angle were chosen as inputs while failure time was chosen as output. These two inputs were picked because they are more easily measured compared to specific soil parameters. Table 7.5 below shows each feature obtained from the simulation software as well as their units and range. The bottom four parameters are all defined as a constant while rainfall intensity and slope angle differ to produce a landslide failure time. Steps 3, 4: knowledge-driven dimension reduction and reduced order surrogate models Each soil parameter was standardized to a constant. This was done to reduce the amount of input variables while retaining the relationship between inputs and outputs. All the soil properties and external conditions (rainfall intensity and slopes) mentioned in the Table 7.4 above are important parameters for failure time prediction. Rainfall intensity and slope angle are reported in the literature as the most important variables associated with landslide triggering since they can
7.8 Early Warning of Rainfall Induced Landslides (Type 3 Advanced) Table 7.5 Features extracted from the simulation software
Parameters Failure time Rainfall intensity Slope angle Soil cohesion Porosity Soil density Initial soil moisture content
Units h mm/h
% g/cm3 kPa
261 Range 1–20,000 1–40 25–35 37.24 71.5 8.7 3.57
Fig. 7.33 Intensity and slope angle vs. failure time in neural network
vary a lot: every rainfall can have a different intensity, and a hill can have different slope angles at every point. Therefore, rainfall intensity and slope angles are varied for data generation and database preparation, which reduces the problem dimension significantly. Step 5: Deep learning for regression A feed-forward neural network was constructed to relate the rainfall intensity and slope angle (inputs) to the landslide failure time (output). After randomizing the dataset, k-fold cross-validation was conducted. To accomplish this, dataset was divided into 5 groups (k ¼ 5). Each group was subject to testing and the other four groups were used to fit the model. The mean absolute error ranges from 0.04 to 0.05 and the coefficient of determination is r2 ¼ 0.96. Figure 7.33 shows the original data compared to the neural network predictions. It can be seen that the neural network accurately predicts the landslide failure time for both the rainfall intensity and slope angle. Step 6: System and design for intensity-duration thresholds The neural network predicts the failure time given input intensity and slope angle. This analysis builds on the premise of having relatively static soil parameters. The slope angle of a given area is available to be measured. The trained neural network can then be used to compute the rainfall intensity-duration thresholds that indicate landslide risk. This analysis can be applied to the coastal and mountainous areas of California, which are some of the areas most susceptible
262
7 System and Design
Fig. 7.34 Landslide failure times of California cities considering different slope angles and rainfall intensities
to landslides in the United States. Figure 7.34 identifies several locations throughout the state for analyzing landslide risk. Rainfall intensity at these California locations can be used to predict possible landslides for areas with different slope angles.
7.9 7.9.1
Potential Projects Using MDS Next Generation Tire Materials Design
One of the fundamental questions that tire industry faces is on the durability of the tire. The unpredictable weather conditions and road surface conditions that each tire faces every day have a significant impact on its durability. One of the key materials property metrics that can be related to the tire material performance is the tan(δ). For tire material it is desirable that it has high tan(δ) for low temperature (provide better ice and wet grip) and low tan(δ) at high temperature (provide better rolling friction).
7.9 Potential Projects Using MDS
263
It is noteworthy to mention that approximately 5–15% of the fuel consumed by a typical car is used to overcome the rolling friction of the tire on the road. Therefore, controlling the rolling friction of tires is a feasible way to save energy (by reducing fuel consumption) and environment (by reducing carbon emission). We can also ensure the safety operation of tire providing sufficient ice or wet grip. The key performance metric, tan(δ), is a function of the matrix materials, microstructure, and the operating conditions such as temperature and frequency. It is well known that adding filler improves the tire materials performance. But what fillers and their distribution to achieve optimized properties and performance is still an important research question. The design space combining different rubber matrices and fillers, microstructure and operating conditions can be so enormous that experimental or simulation technique is not feasible at all. The mechanistic data science approach can provide an effective solution to explore the design space leveraging the data science tools through revealing the mechanisms and construction of necessary accurate and efficient reduce order surrogate model. This approach will enable the industrial practitioner to perform rapid design iteration and expedite the decisionmaking process. The six modules of the mechanistic data science approach to tackle the problem. Multimodal data generation and collection: Experimental data such as materials composition, microstructure images along with mechanical testing such as tension, shear, DMA, friction tester, etc. can be collected for different matrix and filler combinations. Multiscale simulation data can be generated through the numerical simulation. Using transfer learning, we can establish a robust numerical model to generate data which is very accurate compared to the experiment. We also need to account for interphase and interface characterization. Mechanistic features extraction: Important mechanistic features needs to be identified such as tan delta or others. Microstructure features such as volume fraction, filler distribution, and the operating conditions features also need to be analyzed. How these mechanistic features relate to the materials performance will be analyzed by a sensitivity analysis. Knowledge driven dimension reduction: Based on materials choice, microstructure features and operating condition features we can reduce the dimension of the problem. Potentially, we can use PCA to transfer many less understood correlations among those features in a latent space and understand which shows similar performance and why. Regression and Classifications: Relating the mechanistic features with the performance metric such as tan(δ). Classifications can be used to better understand local materials performance such as crack initiation, damage, etc. Reduce order surrogate model: Establishing a mechanistic based ROM for rapid performance metric prediction for a given operating condition. System and design: Use the ROM for materials multiscale performance prediction and apply this for the real tire design.
264
7.9.2
7 System and Design
Antimicrobial Surface Design
Since ancient times, people in South Asia use metal and alloy (such as gold, silver, bronze, steel, etc.) plates for their daily meals. They believed that eating food from a metal plate was good for their health, but most of them did not know why. Though the situation has now changed with the polymeric and ceramic tableware, eating off a metal plate is still very common in many South Asian countries. There is an important science hidden behind this usage of metal plates related to food bacteria. It is well known that active metals like gold and silver have antimicrobial properties. Silver has been widely used considering its antimicrobial effects and low cost. Very recently, it has been reported that copper and copper alloys may have similar or even better antimicrobial effects [15]. This can provide a very low-cost solution to prevent bacterial infections by making coatings in daily-use public places (such as buses, lift buttons, etc.). These metal and alloy surfaces can kill the bacteria when it comes into contact with the surface. The key term is the “ion diffusion” through bacteria membrane, which eventually kill the bacteria. Increasing the ion diffusion rate can accelerate the bacteria killing. Surface designing parameters such as surface patterning and surface roughness can be directly correlated with the ion diffusion rate for this metal and alloy surfaces. However, identifying the engineered surface to minimize the bacteria growth is not straightforward, as it depends on the kinetics of the bacteria growth and the surface bacteria interaction. Using the MDS framework with available data science tools allows the study of the relation between different alloy designs, their ion diffusion rate in relation with the surface patterning. This will reveal the mechanisms in more detail, and eventually lead to engineered surfaces based on the knowledge gained. The six steps of mechanistic data science are outlined below for this problem. Multimodal data generation and collection: For this step, data is collected from several information sources, such as ion diffusion rate of different metal and alloys combination, relation of surface roughness to the ion diffusion rate, bacteria surface interaction, and bacteria growth kinetics. Mechanistic features extraction: The bacteria growth data are typically collected in image form. Images must be analyzed to identify the key mechanistic features of bacteria growth on such surface. Also, surface roughness features and ion diffusion rates can be extracted from experimentally collected data or modeling. Knowledge driven dimension reduction: Having the large data set of metal and alloy combination with different bacteria, the problem becomes very high dimensional. Different dimension reduction techniques can be employed to identify the important features to be extracted from a huge dataset. Reduced order surrogate model: Several reduced order model can be developed for the mechanisms such as the growth kinetics of bacteria for different surface parameters, surface roughness and patterning and ion diffusion mechanism for metal alloys.
References
265
Regression and Classifications: Relating the mechanistic features (surface roughness, bacteria growth kinetics, etc.) with the performance metric such as ion diffusion rate will provide necessary information to look at different alloy materials combinations and their antimicrobial properties. System and design: Develop and design an engineered metal or metal alloy surface which reduces the bacteria growth rate.
7.9.3
Fault Detection Using Wavelet-CNN
A very important application of the wavelet-based CNN methodologies is the fault detection during the non-destructive testing (NDT) of metallic part. For example, detection of welding defects with wavelet-CNN [16]. Welding faults include slags, lack of fusion, and cracks. To detect these faults, an eddy current is applied, and the resulting magnetic field fluctuation is recorded. The response of the magnetic field is different for defect-free and flawed parts. These signals are converted to 2D images using wavelet transformation and finally, a CNN is applied to classify the test to identify what kind of fault is present inside the material.
References 1. Xie X, Saha S, Bennett J, Lu Y, Cao J, Liu WK, Gan Z (2021) Mechanistic data-driven prediction of as-built mechanical properties in metal additive manufacturing. npj Comput Mater 7:86 2. Meyer Y (1992) Wavelets and operators. Cambridge University Press, Cambridge. ISBN 0-521-42000-8 3. He K., Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 4. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 5. https://www.mayoclinic.org/diseases-conditions/scoliosis/symptoms-causes/syc-20350716 6. Tajdari M, Pawar A, Li H, Tajdari F, Maqsood A, Cleary E, Saha S, Zhang YJ, Sarwark JF, Liu WK (2021) Image-based modelling for adolescent idiopathic scoliosis: mechanistic machine learning analysis and prediction. Comput Methods Appl Mech Eng 374(113):590 7. Huang H, Mojumder S, Suarez D, Amin AA, Liu WK (2021) Design of reinforced polymer composites using mechanistic data science framework (in preparation) 8. Liu Z, Bessa M, Liu WK (2016) Self-consistent clustering analysis: an efficient multi-scale scheme for inelastic heterogeneous materials. Comput Methods Appl Mater Eng 306:319–341 9. Domun N, Hadavinia H, Zhang T, Sainsbury T, Liaghat GH, Vahid S (2015) Improving the fracture toughness and the strength of epoxy using nanomaterials—a review of the current status. Nanoscale 7(23):10294–10329 10. Cheng YT, Cheng CM (2004) Scaling, dimensional analysis, and indentation measurements. Mater Sci Eng R Rep 44(4–5):91–149 11. Fleming RW, Taylor FA (1980) Estimating the costs of landslide damage in the United States. U.S. Geological Survey, Circular 832
266
7 System and Design
12. Lizarraga JJ, Buscarnera G (2019) Spatially distributed modeling of rainfall-induced landslides in shallow layered slopes. Landslides 16:253–263 13. Rundeddu E, Lizarraga JJ, Buscarnera G (2021) Hybrid stochastic-mechanical modeling of precipitation thresholds of shallow landslide initiation. arXiv preprint arXiv:2106.15119 14. Lizarraga JJ, Frattini P, Crosta GB, Buscarnera G (2017) Regional-scale modelling of shallow landslides with different initiation mechanisms: sliding versus liquefaction. Eng Geol 228:346– 356 15. Grass G, Rensing C, Solioz M (2011) Metallic copper as an antimicrobial surface. Appl Environ Microbiol 77(5):1541–1547 16. Miao R et al (2019) Online defect recognition of narrow overlap weld based on two-stage recognition model combining continuous wavelet transform and convolutional neural network. Comput Ind 112:103115
Index
A Acute Respiratory Distress Syndrome (ARDS), 202 Adam optimizer, 204 Additive manufacturing (AM), 15, 16, 143, 238, 240–243, 253 Adolescent idiopathic scoliosis (AIS), 25, 90, 105, 243 Advanced clustering methods, 141 Aliasing, 118 Alloys, 14 AlphaGo, 15 Amplitude, 123 Angular frequency, 114, 115 Anisotropic stretching, 151 Anteroposterior plane (AP), 108, 111 Antimicrobial surface design, 264, 265 Approximation function, 86 Aristotle, 3 Artificial intelligence (AI), 5, 6 application, 173 Artificial neural networks, 174 Artificial neurons, 175 functionality, 175 neural networks, 175 Astronomical theories, 3 Atomic Force Microscopes (AFM), 44 Automated data processing techniques, 40 Autonomous vehicle technology, 90
B Band-pass filter, 119 Baseball, 22–24, 60
Batting average (BA), 23, 61 Berkovich indenter, 254 Best matching unit (BMU), 142
C Cantilever beam bending test, 34 Carbon fibers, 13 Cartesian coordinate system, 93 Chest X-ray imaging, 172 Childhood spinal deformity, 90 Cluster analysis, 132 Clustering methods, 132 Cobb Angle (CA), 113 Color, clarity, cut, carat weight (4Cs), 172, 173, 185, 188 Common weight functions, 75 Computational thermal-fluid dynamics (CtFD), 143 Computed tomography (CT), 104 Computerized axial tomography (CAT), 104 Continuous data, 37 Convex functions, 52 Convolution, 18, 194–197, 200 Convolutional neural network (CNN), 19, 174, 239 architecture, 204, 242 building blocks, 194 computers, 189 backward slash, 190, 191 forward slash, 190, 191 mathematical operations, 190 concepts, 194 convolution, 191, 194–197, 200, 201
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 W. K. Liu et al., Mechanistic Data Science for STEM Education and Applications, https://doi.org/10.1007/978-3-030-87832-0
267
268 Convolutional neural network (CNN) (cont.) COVID-19 detection, x-ray images, 201, 203–205 FFNN, 199–201 humans, 192 input layer, 193 kernel/ filter, 191–193 machining learning methods, 193 multiple convolution operations, 193 padding, 198, 200 pooling, 199–201 prediction layers, 193 terminology, 194 UTS maps, 242, 243 Correlation, 21 Cost function, 50, 87, 100 Covariance matrix, 157, 159, 160, 169 COVID-19 artificial neural networks, 174 chest X-ray images, 172 classification, chest X-ray images, 173 radiology imaging techniques, 173 COVID-19 detection, x-ray images Adam optimizer, 204 chest X-ray image database, 202 CNN model, 201, 203, 204 CNN structure, 202, 203 deep learning models, 204 diagnosis, 202 error matrix results, multi-class classification, 205 grey scale X-ray images, 202 inflammation, 201 machine learning-based automatic diagnosis, 202 ReLU activation function, 203 Cross validation, 40, 83
D Damping, 121 Data categories, 37 collection and management, 34 definition, 37 mechanistic data science, 33, 37 scientific knowledge, 34 Data age, 6 Data cleansing, 39 Data collection process, 38 Data deviations, 41 Data fidelity, 41 Data formatting, 39 Data modality, 41
Index Data science, 195 functional development, 40 functional relationship, 40 mechanistic data science model, 40 validation set, 40 Data to empiricism/mechanism, 34, 36 Data wrangling, 39 Decision-making process, 14 Deep learning (DL), 18, 19, 204, 210, 211, 218 artificial neural networks, 171 classification, 242, 244 history, 174 mathematical optimization, 172 regression, 242, 244, 256, 261 statistics, 172 Deep learning neural networks, 15 Deep Mind began working, 15 Dense materials, 104 Diagonal covariance matrix, 157 Diagonal matrix, 169 Diamond price regression, FFNN actual vs. predicted price, 188 Adam optimizer, 187 diamond dataset features, 186 hidden layers, 186 independent variables, 186 input features, 186 Kaggle, 185 loss function, 186 machine learning, 186 mapping function, 186 neural network architecture, 187 open-source diamond dataset, 185 Diamond pricing analysis color rating scale, 42 data normalization, 42 dataset, 41 dependent variables, 41 numerical values, 41 regression techniques, 41 Diamond shaped indentation, 70 Digital twin, 243 Dimension reduction, 18, 20, 30 diamond dataset (see Jenks natural breaks) jogging performance, 132, 133 k-means, 138 Directed energy deposition (DED), 238 Discourses and Mathematical Demonstrations Relating to Two New Sciences, 3 Discrete data, 37 Discrete Fourier Transform (DFT), 116 Double-curvature spine, 111 Drop test, 12
Index E Earthquake magnitude, 38 Edisonian approach, 5 Edisonian brute force method, 5 Edisonian style brute force method, 5 Eigenvalues, 168 Eigenvectors, 168 Empirical approach, 5 Engineering, 6 Environmental impact, 13 Experimental data, 44 Extraction of mechanistic features, 29
F Facial recognition software, 90 Falling objects, 4 Fast Fourier transform (FFT), 116 Fatigue cracks, 9, 10 Fatigue fracture analysis consequences, 9 design methodology, 9, 10 engineers and product designers, 8 Fault detection, 265 Feature, 90 Feature data normalization, 90–92 Feature engineering Cartesian coordinates, 93 computer code, 94 dataset, 92 location determination, 92 reference point, 94 signals, 113 Feature identification, 90 Feature scaling, 91 Feed forward neural network (FFNN), 199, 206, 210, 233 activation function, 175, 184 artificial neurons, 175 biases, 175, 183 closeness, 176 datapoint, 177 diamond price regression, 185–189 gradient decent/back propagation, 184 hidden activation functions, 250 leaky ReLU activation functions, 176 loss function, 184 mechanical features, 250 notations, 183 parametric ReLU activation functions, 176 PCA-NN model, 233 PyTorch, 234 ReLU activation functions, 175
269 training, 234 weights, 175, 183 Feedforward neural networks (FNN), 19 Fidelity, 41, 47 Filtering and frequency extraction, 5 Finite element analysis, 37 Finite element computer simulation, 46 Finite Element Method (FEM), 46 Forensic engineering, 237, 238 Fourier series, 5, 115, 116, 123 Fourier transform (FT) combined noisy signal, 119 definition, 123 DFT, 116 engineering fundamentals, 116 FFT, 116 filter, 118 inverse, 119 low frequency, 117 sinusoidal signals, 116 sound wave analysis, 119–122 Fracture toughness, 12 Friction, 4 Fundamental frequency, 120, 121 Fundamental scientific laws, 3, 5, 7
G Galileo Galilei, 3 Galileo’s analysis, 34 Game theory, 5, 6 Gaussian distribution, 90 Gaussian kernel function, 142 Geometry features extraction, 2D x-ray images AIS analysis, 105 angle between two vectors, 109 coordinate system, 107 global angles, 106, 110 input data, 108 mechanistic data science, 106 vertebra regions, 108 vertebrae planes and locations, 106 Global angles, 106 Global cost function, 73 Global minimum, 52 Gold, 14, 15 Gold alloys, 14, 15 Good datasets, 38 Goodness of fit, 54 Goodness of variance fit (GVF), 135 Gradient, 55 Gradient descent algorithm, 58
270 Gradient descent (cont.) cost function, 58 higher dimensions, 60 higher order function, 58 MLB, 61 univariate derivative, 60 user-defined learning rate, 58 Graphics processing units (GPUs), 174 Gravity, 4, 7, 26
H Hann window function, 124 Hardness testing, 43 Harmonic/inharmonic frequencies, 121 Harmonics, 120 Hierarchical Clustering/Dendrograms, 141 High-dimensional tensor decomposition, 167 High-pass filter, 118 Hueter-Volkmann (HV) principle, 26
I Image processing, 96 Image segmentation, 105 Indentation, 70 computer simulations, 46 data sources, 45 depth and indenter shape, 43 elasticity, 43 experimental data, 44 macro-indentation, 43 multimodal data collection technique, 43 nanoindentation, 43 sample size and shape, 43 Indentation, materials property prediction additive manufacturing, 253 deep learning regression, 256 dislocation mechanism, 252 engineers, 252 hardness, 253 hardness vs. yield strength, Ti64 alloys, 257 knowledge-driven dimension reduction, 255, 256 load-displacement curve, 252 localized hardness data, 253 mechanistic data science, 253, 256 mechanistic features, 253–255 multimodal data generation and collection, 253, 254 nondestructive testing procedure, 252 nondimensional hardness distribution, 257 P-h curves, 252 reduced order surrogate models, 255, 256
Index system and design, 257 tensile experiments, 253 Independent variables, 57 Industrial Revolution, 114 Inharmonicity, 121 Initial distance coefficient, 142 In-situ thermal histories, 239 Instrumental music conversion, mechanistic data science A4 piano sound signal, 209 challenge, 205 CNN loss function values, 208 converts music, 208 deep learning, 210 FFNN structure, 210 Fourier analysis, 205 guitar sound, 210, 212 guitars, 205 hyperparameters, 207 machine learning model piano sound to a guitar sound, 205, 206 MDS loss function values, 208 mechanistic features, 207, 208 optimal coefficients A4 piano sound signal, 209 authentic A4 guitar sound, 209 piano and guitar A4 time-amplitude curves, 206 CNN architecture, 206, 207 strategies, machine learning model, 206 training data paired sounds, 206 piano sound to a guitar sound, CNN approach, 207 pianos, 205 STFT, 208 training data, 208 Internet of things (IOT), 8 Ion diffusion, 264 IR in-situ measurement, 240 Iteration, 162
J Jenks natural breaks clustering data points, 135 data clustering algorithm, 133 diamond data, 137 diamond dividing, 133 GVF, 135 iterations, 138 k-means clustering, 136 SDAM, 134 SDCM, 135 Jogging performance, 132, 133
Index K Kaggle databases, 38 Kepler’s three laws, 34 Kernels, 192, 194 K-fold cross validation, 40, 84 K-means clustering, 132 diamond dataset, 139 disadvantages, 141 higher dimension data, 138 number of clusters, 139, 141 procedure, 138 SOM (see Self-organizing map (SOM)) Knowledge driven dimension reduction, 29, 241, 244, 248, 255, 260, 263, 264
L Landmarks, 105 Landslide Early Warning Systems (LEWS), 258 Landslides California slope angles, 262 cost, landslides damage, 258 factors, 257 failure time vs. intensity, 260 LEWS, 258 MDS deep learning regression, 261 knowledge-driven dimension reduction, 260 mechanistic features extraction, 260, 261 multimodal data generation and collection, 258, 259 reduced order surrogate models, 260 system and design, 261, 262 soil and rocks, 257 soil moisture, 258 Laser powder bed fusion (L-PBF), 238 Latent Material Property Space (LMPS), 250 Law of force balance, 4 Law of inertia, 4 Law of reaction forces, 4 Laws of motion, 4–6 Learning, 171 Least square method cost function, 100 error, 100 known quantities, 99, 101 lines, 102 matrix, 101 point of interest, 99
271 problem solving, 101 scalar constant, 99 step-by-step solution process, 102 Least square optimization, 18 coefficient of determination, 54 coordinate system, 56 cost function, 54 functional relationship, 54, 56 goodness of fit, 54 gradient, 58 independent variable, 57 linear regression model, 54 multidimensional derivatives, 55 multivariate optimization, 56 nonlinear optimization, 50 orbits of celestial bodies, 50 rate of change, 55, 56 regression, 50 relationship between variables, 49 scalar function, 57 Light bulb, 5 Limited data and scientific knowledge, 17 Linear regression, 18, 20 analysis, 62 best-fit linear relationship, 53 data points, 52 generic equation, 53 multivariate, 62 Python code, 78 SLG and OBP, 62 weights, 53 Linear regression analysis, 24 Linear Variable Differential Transformer (LVDT), 45 Load-displacement data, 253, 254 Long short-term memory (LSTM), 174 Lowpass Butterworth filter, 119 Low-pass filter, 118 Lp-norm, 80, 81 Lp-norm regularized regression, 81, 82 Lumbar Lordosis Angle (LLA), 110
M Machine learning, 41, 46, 186, 205 Macro-indentation, 43 Magnetic field fluctuation, 265 Magnetic resonance imaging (MRI), 25, 105 Major League Baseball (MLB), 23, 61 Massless spring, 28 Mass-spring system challenges, 28 classification, 30
272 Mass-spring system (cont.) extraction of mechanistic features, 29 knowledge-driven dimension reduction, 29 massless spring, 28 multimodal data generation and collection, 28, 29 reduced order surrogate models, 30 regression, 30 systems and design, 29, 30 Material characterization, 46 Material deformation picture, 69 Material design applications, 12–15 drop test, 12 macrostructure, 10 mesoscale, 10 microstructure, 11 MSD framework, 13, 14 reinforced ice cube, 10, 11 reinforcement materials, 12 small-scale sub-structure, 10 Material hardness testing, 43 Materials characterization, 143 Materials Project, 38 Mathematical relation between SVD and PCA, 168, 169 Mathematical science, 2, 17 Mathematics, 6 Matrix convergence factor, 163 Matrix deposition problem, 164 Matrix transpose, 168 Mean square error (MSE), 81, 84, 152, 186, 210, 242 Mean-subtracted data matrix, 158 Mechanism to science, 36 Mechanistic aspect, 36 Mechanistic Data Science (book), 36 Mechanistic data science (MDS), 106 AlphaGo, 15 antimicrobial surface design, 264, 265 description, 2 determining price of diamond based on features classification, 20 dimension reduction, 20 extraction of mechanistic features, 20 multimodal data collection, 20 multivariate linear regression, 21, 22 properties, 20 regression, 20 sparkle and impressiveness, 20 sports analytics, 22 system and design, 21
Index engineering problems, 16, 17 equations from mathematical science, 2 falling objects, 4 fatigue fracture analysis, 8–10 fault detection, wavelet-CNN, 265 framework, 13 gold, 14, 15 history of science, 3 laws of motion, 4–6 limited data and scientific knowledge, 17 mass-spring system, 28–30 material design, 10–15 next generation tire materials design, 262, 263 patient-specific scoliosis curvature AIS, 25 extraction of mechanistic features, 26 knowledge-driven dimension reduction, 26 multimodal data generation and collection, 25, 26 reduced order surrogate models, 26, 27 PCA, 249, 250 power, 2 purely data-driven, 16 revolution in data science, 7, 8 spring-mass systems, 17 STEM, 6–8 system and design (see System and design, MDS) Mechanistic data science analysis computation, 37 database, 38 measurement, 37 multi-fidelity data, 47 Mechanistic data-driven framework, 238, 243 Mechanistic feature extraction, 247, 248, 253, 254, 260, 261, 263, 264 data science process, 89 elastic modulus, 248 resilience, 248 yield strength, 248 Mechanistic spring-mass-damper model, 126 Medical images, 90, 103 CT, 104 MRI, 105 segmentation, 105 X-ray/radiography, 103, 104 Mesoscale, 10 Metal additive manufacturing, 238 Meteorology, 28 Microhardness, 144, 145 MLS approximation, 75, 76
Index MLS cost function, 87 Modal superposition PGD, 165, 166 Modalities, 47 Moneyball, 22–24, 62 Moneyball regression analysis base percentage, 64 cost function, 64, 68 logical question, 66 matrix notation, 67 multivariate linear regression, 66, 67 optimal weights, 64 procedure, 62, 64 statistical variable, 66 Moving average, 74, 75 Moving least squares (MLS), 75, 86–87 Multimodal data, 18, 20, 33 Multiple convolution filters, 194 Multivariate linear regression, 21, 22, 24
N Nanoindentation, 43, 46 Nash equilibrium, 5, 6 National Climate Data Center (NCDC), 38 National Institute of Standard and Technology (NIST), 38 Neural network (NN) arbitrary vectors, 180 artificial neural networks, 174 database, 183 datapoint, 176, 177, 179, 180 goal, 176 gradient decent (GD), 181 hidden layers, 178 hidden neurons, 176–178 history, 174 mechanistic data science, 211 output and error, 177, 179–181 output and loss, 182 training, 176 weights, 177–180 X-ray images, COVID-19, 174 Neural networks, 2, 5, 6, 17, 26 Neuroscience, 28 Newton’s universal law of gravitation, 36 Next generation tire materials design, 262, 263 Nicolaus Copernicus, 3 Non-convex equation, 52 Non-convex functions, 52 Non-cooperative games, 5 Nonlinear regression, 18 higher order regression models, 78
273 Nonlinear relationship bacteria growth, 76 MLS regression, 75 moving average, 74, 75 piecewise linear regression, 72, 73 Non-regularized regression, 83 Nyquist frequency, 118
O Oceanography, 28 On the Revolutions of the Celestial Spheres, 3 On-base percentage (OBP), 23, 61 On-base Plus Slugging (OPS), 24, 62 Optimization, 50 Outcome variable, 18
P Padding, 198, 200 Periodic pattern, 114 Philosophiae Naturalis Principia Mathematica, 4 Physlet Tracker, 29 Piano to guitar musical note conversion deep learning regression training, fully-connected FFNN, 233, 234 forensic engineering application, 237 MDS and spring mass damper system algorithm, 216 deep learning neural network algorithm, 216 deep learning, regression, 218 dimension reduction, 217, 218 Matlab, 220 mechanistic features, 217 multimodal data collection and generation, 217 neural network, 219, 220 Python codes, 220 reduced order model, 217, 218 system and design, 219 melody generation, 237 command-line code, 237 forensic engineering application, 238 PCA-NN generated guitar sound wave, 235, 236 principal component analysis (PCA) A4 key, 231 covariance matrix, 230 cumulative energy, 231, 232
274 Piano to guitar musical note conversion (cont.) data preprocessing, 228, 229 dataset collection, 228 inverse transform magnitudes, 231 Python code, 232 reduced order model, 228, 230 single guitar sound generation, 235 Python code, 236 STFT, 216 Piecewise linear regression, 72, 73 Planetary motions, 35 Polar coordinate system, 93, 95 Polymer matrix composites composite material, 247 deep learning regression, 250 desired material properties vs. epoxy Kevlar properties, 252 knowledge-driven dimension reduction, 248–250 material choices, 247 materials engineer, 247 mechanistic data science, 247 mechanistic feature extraction, 248 elastic modulus, 248 multimodal data generation and collection, 247 new composite design, 250, 251 reduced order surrogate models, 248–250 Polynomial basis vector, 86 Pooling, 199–201 Predetermined criterion, 166 Principal component analysis (PCA), 18, 29, 132, 216, 250 datapoints, 147, 148 dataset, 147 dimension reduction, 146 eigenvalues and eigenvectors, 148 procedure, 147 reduced order model, 147 Principal components, 250 Process-structure-properties (PSP), 143, 238 Proper generalized decomposition (PGD), 18, 132 column vectors matrix, 160 high-dimensional tensor decomposition, 167 incremental approach, 161–164 mathematical concept, 160 modal superposition, 165, 166 SVD, 160 Purely data-driven, 16 Python code, 107 PyTorch, 234
Index Q Quadratic function, 51 Qualitative data, 37 Quantitative data, 37
R Radiology imaging techniques, 172 Rainfall intensity, 260, 262 Random sub-sampling, 84 Raw data, 39 Raw datapoints, 87 Recurrent neural network (RNN), 174 Reduce order surrogate model, 263 Reduced order model (ROM), 132, 217–219, 230 Reduced order nonlinear MLS approximation, 87 Reduced order surrogate model, 14, 18, 26, 27, 30, 248–250, 255, 256, 260, 264 PCA, 147, 148 SVD (see Singular value decomposition (SVD)) Regression, 18–21, 24, 30, 50, 70, 85 Regularization approaches, 80 K-fold cross-validation, 84 L2-norm regularized regression, 82, 83 Lp-norm, 80, 81 Lp-norm regularized regression, 81, 82 p-norm, 78 regression model, 78 regularized loss function, 78 Regularization parameter, 78 Regularized regression function, 81 Reinforced ice cube, 10–12 ReLU activation function, 175 ReLU type activation function, 256 Revolution in data science, 7, 8 Road surface conditions, 13 Runs batted in (RBI), 23, 61 Runs scored (RS), 24
S Sacral Inclination Angle (SIA), 112 Safety operation, 13 Science, Technology, Engineering and Mathematics (STEM), 6–8 Scientific hypothesis, 36 Scientific law, 7 Scientific method, 3, 4, 7 Scoliosis, 105, 243, 244
Index Scoliosis progression, 17 Secondary dendrite arm spacing (SDAS), 144 Self-organizing map (SOM), 18 AM, 143, 145 BMU, 142 cluster analysis, 141 competitive learning, 142 CtFD model, 143 data-driven design, 143–146 data-driven relationships, 145 datapoints, 142 goal, 141 Matlab code, 144 neurons, 141 process parameters, 143 simulation and experimental data points, 144 T epochs, 142 unsupervised machine learning algorithm, 141 visualized, 145 Self-organizing maps (SOM), 141 Sensing, 45 Short time Fourier transform (STFT), 207, 208, 216 amplitude analysis, 123 damping coefficient, 125–127 damping effects, 124 frequency content, 124 fundamental frequency and harmonics, 128 Matlab, 126 optimization algorithm, 126 spring-mass-damper system, 124 window functions, 123 Signals, 113 Simple moving average, 74 Single-curvature spine, 112 Singular value decomposition (SVD), 18, 29, 132, 230 covariance matrix, 157, 160 damping coefficient, 154 data matrix, 154, 158 data-driven methods, 149 diagonal covariance matrix, 158, 159 dominant one-dimensional data, 154 local displacement vectors, 154 Matlab/Python, 150 matrices, 150, 152 matrix multiplication, 149 matrix order reduction, 151 MSE, 152, 153 principal components, 158 spring-mass system, 153, 159 three-dimensional motion, 153
275 time-dependent displacement, 154 variance and covariance, 155, 157 Singular values, 151 Singular vectors, 151 Slope angle, 260 Slugging percentage (SLG), 23, 61 Small-scale sub-structure, 10 S-N curve, 9 Snake algorithm, 105 Sound waves, 122 Spatial frequency, 115 Spine growth prediction, 243, 244, 246 Sports analytics, 22 Spring-mass-damper mechanistic model, 124, 127, 218 Spring-mass harmonic oscillator, 153 Spring-mass system, 17, 159 Standard deviation, 90 Standard least square regression problem, 80 Standard normalization, 90, 92 Stress, 69 Stress amplitude, 9 Stress life method, 9 Stress vs. strain curve, 69 Stress-strain data, 247, 248 Stride, 196 Sum of squared deviations from array mean (SDAM), 134 Sum of squares concept, 83 Supervised learning classification problem, 172 dependent variables, 172 input-output mappings, 172 output variable(s) vs. input variable(s), 172 regression problem, 172 Surface designing parameters, 264 Surface fracture pattern, 44 System and design, 19, 21, 29, 30 additive manufacturing, 238, 240–243 feature-based diamond pricing, 238 indentation (see Indentation, materials property prediction) landslides, 257, 258, 260–262 Piano to guitar musical note conversion (see Piano to guitar musical note conversion) polymer matrix composites, 247–251 spine growth prediction, 243, 244, 246
T Tensile strength (TS), 70 Thermal history, 240, 241 Thoracic Kyphosis Angle (TKA), 110
276 Three-dimensional (3D) printing, 238 Time-temperature history, 241 Tire durability, 13 Trained SOMs, 145 Trait theory classification, 132 Transfer learning, 256, 263 Triangular indentations, 46 Truncated Gaussian functions, 75 Trunk Inclination Angle (TIA), 112 Tycho Brahe, 3
U Ultimate tensile strength (UTS), 70, 239, 240 Unsupervised learning, 172
V Vertebra landmarks, 109 Vertebra regions, 108 VH measurements, 70 VH vs. ultimate tensile strength, 70 Vibration, 119 Vickers hardness (HV), 70 Video replay check, 96
Index Voice of the customer, 8 Vovariances, 167
W Wavelet-CNN, 265 Weather conditions, 13 Web indexing, 28 Welding faults, 265 Wet grip, 13 Wind resistance, 4
X X-ray (radiography), 103, 104 X-ray imaging, 244, 245 X-rays, 25, 244
Y Young’s modulus, 69
Z Zero slope, 51