136 71 10MB
English Pages 297 [301] Year 2024
Keisuke Takahashi · Lauren Takahashi
Materials Informatics and Catalysts Informatics An Introduction
Materials Informatics and Catalysts Informatics
Keisuke Takahashi • Lauren Takahashi
Materials Informatics and Catalysts Informatics An Introduction
Keisuke Takahashi Department of Chemistry Hokkaido University Sapporo, Hokkaido, Japan
Lauren Takahashi Department of Chemistry Hokkaido University Sapporo, Hokkaido, Japan
ISBN 978-981-97-0216-9 ISBN 978-981-97-0217-6 https://doi.org/10.1007/978-981-97-0217-6
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.
Preface
Materials and catalysts informatics has swiftly risen to prominence as a fourth viable approach to scientific research, standing alongside traditional methods of experimentation, theory, and computation. However, despite its rapid growth in popularity and significance, there remains a noticeable lack of comprehensive textbooks and educational resources for these emerging fields. The few materials that do exist tend to assume a certain level of prior knowledge in data science, predominantly catering to experts that are already familiar with programming and data science. This situation results in a substantial learning curve, which can appear formidable for people interested in exploring materials and catalysts informatics. In fact, this learning curve may scare off potential adopters of data science, especially those who have little or no experience with programming or data science. We find this a shame, and hope to help provide a more beginner-friendly way of introducing one to data science and materials and catalysts informatics. The motivation behind the creation of this book is to bridge this gap and make it easier to access the knowledge and methodologies underpinning materials informatics. We aim to make these fields more accessible to individuals with varying levels of exposure or experience in computer science and data science, with a particular focus on students who aspire to adopt these technologies to broaden their skillsets. Such proficiency is likely to be invaluable upon graduation, as it becomes increasingly relevant in the ever-evolving landscape of the workforce. With this in mind, our intention for this book is to offer a comprehensive journey through the multifaceted world of materials and catalysts informatics. We will begin by delving into the foundational aspects, exploring the history and evolution of these fields and elucidating their significance in the context of modern scientific research. From there, we will transition into the nuts and bolts of programming, data visualization, and machine learning, providing readers with a solid grounding in the essential tools and techniques that underpin materials informatics. However, we don’t stop at theory alone. Our aim is to make the subject matter tangible and actionable for our readers. Therefore, we will provide real-life examples that illustrate how the various techniques can be practically applied to real-world datasets. These examples serve a dual purpose—they not only reinforce the theoretical concepts
v
vi
Preface
but also give the reader knowledge that can be readily employed in their own research or projects. In a rapidly evolving world driven by data, we believe that a better understanding of materials and catalysts informatics can be a catalyst for transformative changes across various industries. We hope that this book will serve as a stepping stone for more individuals to venture into the dynamic realm of data science and contribute to the growth and development of materials and catalysts informatics. By equipping our readers with the essential knowledge and skills required for this journey, we hope to foster a broader and more inclusive community of researchers who will push the boundaries of knowledge, unleashing a new era of groundbreaking discoveries. With this, we believe that as the frontiers of materials and catalyst informatics expand, so too will the potential of research, paving the way for innovative solutions and transformative breakthroughs. By encouraging a broader audience to embrace these fields, we anticipate a surge in innovative research and applications, ultimately expanding the horizons of what is achievable in both the scientific and industrial domains. We thank you for taking the time to learn more about materials and catalysts informatics, and hope that you find this a useful and applicable resource for your future research endeavors. Sapporo, Japan November, 2023
Keisuke Takahashi Lauren Takahashi
Contents
1
An Introduction to Materials Informatics and Catalysts Informatics . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The Rise of Materials Informatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 What Is Materials Informatics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Materials Informaticists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Data Science vs. Experiments, Theory, and Computation . . . . . . . . . . . . . . . . . . 1.6 Data Science in Materials Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Catalysts Informatics and Materials Informatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 3 5 6 8 14 19 20 22
2
Developing an Informatics Work Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 A Brief Guide to Installing Linux Mint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25 25 28 39 44 46 47
3
Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Basics of Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Programming Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Open Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51 51 52 60 65 73 74
4
Programming and Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Basics of Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Structuring Code through Logic and Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77 77 78 89
vii
viii
Contents
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5
Data and Materials and Catalysts Informatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Collecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113 114 114 126 138 139
6
Data Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Seaborn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
143 143 144 158 168 169
7
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Unsupervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Semi-supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171 172 173 177 180 183 184 187 188
8
Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.11 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191 192 192 196 199 200 206 210 214 215 217 220 223 224
9
Unsupervised Machine Learning and Beyond Machine Learning . . . . . . . . . . . . 227 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 9.2 Unsupervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Contents
9.2.1 Dimensional Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Principle Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Hierarchical Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.5 Perspective of Unsupervised Machine Learning . . . . . . . . . . . . . . . . . . . . 9.3 Additional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
230 230 231 233 234 236 237 238 242 242
Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
1
An Introduction to Materials Informatics and Catalysts Informatics
Abstract
In this chapter, we delve into the fundamental concepts of materials informatics and catalysts informatics, shedding light on their origins and their relationship with the fields of materials and catalysts science. To begin, we explore the history of materials science and catalysts science, emphasizing the traditional approach of designing materials and catalysts based on researchers’ knowledge and a trial-anderror process. For a long time, the development of new materials and catalysts relied heavily on empirical observations, experimental iterations, and theoretical insights. This approach required significant time, resources, and expertise, often leading to slow progress and limited success in achieving desired material properties and catalyst performances. However, with the rapid advancements in data acquisition, storage, and analysis technologies, a new paradigm has emerged, known as materials informatics and catalysts informatics. These disciplines aim to leverage data science methodologies, such as statistical analysis, machine learning, and data visualization, to extract knowledge and insights from large datasets generated by experimental, theoretical, and computational studies. By employing materials informatics and catalysts informatics, researchers can overcome the limitations of the traditional approach. These emerging fields enable the systematic exploration and design of materials and catalysts by leveraging data-driven approaches. Rather than relying solely on researchers’ intuition and experience, materials informatics and catalysts informatics utilize data-driven models and algorithms to accelerate the discovery and design of materials and catalysts with desired properties and performances. The birth of materials informatics and catalysts informatics signifies a paradigm shift in materials and catalysts science. They allow researchers to harness the power of data to uncover hidden patterns, correlations, and structure-property relationships that were previously difficult to discern. This datadriven approach enables more efficient and targeted experimentation, reducing the time © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6_1
1
2
1 An Introduction to Materials Informatics and Catalysts Informatics
and resources required for materials and catalysts design. Throughout this chapter, we will delve deeper into the concepts of materials informatics and catalysts informatics, exploring their key principles, methodologies, and applications. We will examine how these disciplines have transformed the landscape of materials and catalysts research, paving the way for accelerated innovation, improved materials design, and enhanced catalyst performance. Keywords
Material science · Materials informatics · Catalysts informatics · Data science · Processing · Structure · Properties · Performance · Characterization • Understand how materials informatics came into existence. • Learn how data science relates to materials science research. • Cover the basics of materials informatics.
1.1
Introduction
The crucial role that materials play in shaping human civilization cannot be overstated. In combination with human ingenuity and the ability to use tools, materials have facilitated the creation of modern society in its current form. Indeed, materials have allowed us to achieve remarkable feats, such as traveling great distances in short periods of time, communicating with one another instantly through phones and computers, preserving food for extended periods, curing diseases, and even exploring space. These impressive achievements are possible thanks to the tireless efforts of researchers who dedicate their lives to understanding how materials function and how they can be improved or designed to perform even better. Traditionally, material research has been conducted through experiments and theoretical development. However, recent technological advancements have opened up new avenues for research and innovation, with computers now playing a critical role in the field. This shift toward a more data-centric approach has given rise to the field of data science, which places data at the forefront of research efforts. The integration of computers and data science in material research has led to an exponential growth in data and a wealth of new insights into the behavior and properties of materials. This data-driven approach allows researchers to extract meaningful information from vast amounts of data, enabling them to make more informed decisions and drive innovation in the field. By utilizing cutting-edge technologies such as machine learning and artificial intelligence, researchers can analyze large amounts of data and identify trends that would have been otherwise difficult or impossible to detect using traditional methods. As technology continues to advance and the field of data science evolves, we can only imagine the extraordinary advancements that will be made possible in the world of materials.
1.2 The Rise of Materials Informatics
3
What does the development of data science mean in terms of materials research? In this chapter, we will explore the following: • • • • •
What is materials informatics? How did it come into existence? The basics of data science and informatics in relation to materials science research The four approaches to scientific research Unique requirements of applying data science to materials science research Data science and the five components of materials science
By exploring these concepts, we aim to provide an overview of materials science, data science, and how these fields interact when applied toward materials research.
1.2
The Rise of Materials Informatics
Materials science provides the foundation for the basic and advanced technologies that are integral to modern society. It encompasses a broad range of materials, from commonplace elements such as iron, steel, and oxygen, to cutting-edge materials such as rocket engines, artificial bones, electrical circuits, and synthetic fibers. The fact that such a wide range of tools can be produced from the same 118 elements of the periodic table is truly remarkable and inspiring. Through the manipulation of various factors such as temperature, pressure, and exposure to other elements or compounds, materials scientists can create new compounds, strengthen existing materials, and increase the shelf life of materials. This ability to control and manipulate materials has played a critical role in the progress and development of human society. Without the ability to manipulate materials, it is difficult to imagine that our society could have progressed to the extent that it has. Materials science can be seen as an integral aspect of the continued development of society and technology. As we continue to rely more heavily on advanced technologies in every aspect of our lives, the importance of materials science becomes increasingly apparent. From the creation of new and innovative materials to the development of advanced manufacturing techniques, materials science is essential to the continued progress and prosperity of our society. However, the factors that materials science must contend with extend beyond these materials. In particular, the driving question behind much of materials science lies within the mysteries behind how materials can be designed and synthesized. Throughout history, materials have been designed and synthesized using a combination of experimentation and theory. However, with the introduction of computing, new scientific approaches have become available for materials science research, as illustrated in Fig. 1.1. The computational approach, for example, has become a viable third approach toward materials research, with advancements in computing technologies and developments in theoretical studies. Moreover, data science has emerged as a fourth approach to research in the last decade, with its ability to analyze large data sets and identify patterns and correlations.
4
1 An Introduction to Materials Informatics and Catalysts Informatics
Fig. 1.1 History of the development of materials science
Given the rate of exponential growth in scientific and technological developments, it is reasonable to expect that artificial intelligence (AI) will become a fifth potential approach for scientific research in the future. AI has already shown great potential in various fields, including medicine, finance, and transportation. In materials science, AI can be used to design and optimize materials with specific properties, predict material behavior under different conditions, and accelerate the discovery of new materials. As such, it is essential for materials scientists to stay up-to-date with these new developments and explore the potential applications of AI in their research. The design of materials has traditionally relied on a trial-and-error approach based on previously acquired knowledge and experiences. While this approach has yielded some successes, it is still challenging to design desirable materials directly from the elements of the periodic table. This is due to the vast number of possible combinations that could be considered, leading to a time-consuming and error-prone process. To overcome these challenges, materials informatics has emerged as a promising approach to accelerate the process of discovering new materials. Materials informatics involves the use of materials data and data science techniques such as data mining, machine learning, and artificial intelligence, in order to design and optimize materials . By leveraging materials datasets, materials informatics enables researchers to identify patterns and correlations that would be challenging to detect using traditional approaches. This, in turn, enables the development of predictive models that can accelerate the discovery of new materials with desired properties. Moreover, materials informatics can help researchers optimize existing materials by predicting their performance under different conditions and identifying opportunities for improvement. The use of materials informatics in materials science is still in its early stages, but it holds great promise for accelerating the discovery and optimization of materials. As the field continues to evolve, it is expected that materials informatics will become an integral part of the materials science toolkit, enabling researchers to design materials with unprecedented precision and efficiency.
1.3 What Is Materials Informatics?
1.3
5
What Is Materials Informatics?
Materials informatics is a subfield of materials science that focuses on using data science toward materials research. Often referred to as a “fourth science” (the third being the computational science approach after experiment and theory), materials informatics places emphasis on data where data science techniques are applied toward materials data [1–4]. This adoption of informatics within scientific research has occurred through many disciplines where this type of movement can be described as “X Science” to “X Informatics.” This movement has been seen with fields such as biology, chemistry, and pharmacy where applying data science has led to the formation of bioinformatics, chemoinformatics, and pharmacy informatics. As the fields reach a mature phase, it is natural for the evolution toward X informatics to occur as good amounts of data are generated. In this book, materials informatics is introduced where explored topics range from basic concepts toward applications of materials informatics. When dealing with materials informatics, one must first consider the question: What exactly is materials informatics? In essence, the underlying principle of materials informatics is integrating data science into materials science research. This brings about another question to consider: What is the difference between data science and informatics? Data science is a discipline where data are the central driving force behind research. It focuses on understanding and extracting knowledge from data by incorporating tools such as statistics, visualization, and machine learning techniques. Given the broad scope of tools and numerous techniques that are available for data science research, it is actually very similar to how experiments approach research, which also use a variety of techniques in order to synthesize and characterize materials . As data science involves many different methodologies and concepts, it is increasingly described as a fourth science, following experiment, theory, and computation science. Informatics, meanwhile, is an application of data science where data science is implemented in order to solve problems and acquire knowledge via data. With this understanding, materials informatics can be described as a field in which design and knowledge of materials science can be achieved by applying data science to material data. It is not only important to have knowledge of data science, but it is also important to have knowledge of materials science, which is, itself, an umbrella for many other subfields. This means that simply understanding data science on its own is not enough for materials informatics. Nor is it enough to have specialization in materials science without at least a working understanding of how data science works. It is only by having an understanding of both data science and materials science can materials informatics truly reach its potential.
6
1.4
1 An Introduction to Materials Informatics and Catalysts Informatics
Materials Informaticists
The rapid generation and accumulation of data in the field of materials science presents a unique opportunity to transform the field into materials informatics, as illustrated in Fig. 1.2. However, the nature of materials data presents challenges for effective organization and utilization. Materials data are available in a variety of formats, including academic articles, patent data, and laboratory notebooks. Unfortunately, most materials data are not well organized and are subject to the preferences of the researchers who record and publish it. This can make it difficult to locate and analyze relevant data. In addition to traditional data sources, visualizations and models such as raw spectra, computer models, and calculations and graphs are also sources of data in materials science. While these data sources are often rich with information, physical and chemical phenomena are not well defined or explicitly labeled. This makes it challenging to identify relevant information and integrate it into machine learning models. Despite these challenges, the field of materials science continues to generate an exponential amount of data. Despite the wealth of data available in materials science, there is a growing gap between the number of trained or skilled researchers and the degree of materials data available to analyze. This presents a significant challenge for the successful growth of materials informatics as a viable research field. While the abundance of materials data
Fig. 1.2 Goals and fundamental concepts in materials informatics
1.4 Materials Informaticists
7
provides the possibility of hidden patterns and knowledge ready to be mined using materials informatics, there are not enough materials science researchers who are trained or otherwise skilled with materials informatics to take advantage of this. This gap poses a significant barrier to the development of materials informatics as a mature research field, hindering the full realization of its potential. To bridge this gap, efforts are needed to train and educate more materials science researchers in the principles and techniques of materials informatics. This will require the development of specialized training programs, courses, and resources to help researchers acquire the necessary skills and knowledge. Additionally, collaborations between materials science researchers and experts in data science and machine learning may help to facilitate the integration of materials informatics into traditional materials science research. Inviting data scientists to conduct materials informatics research can be a potential solution to bridge the gap in the number of trained or skilled researchers in this field. However, a crucial question arises: Can data scientists efficiently handle materials informatics research without proper domain knowledge of materials science? This is a challenging issue to address. On one hand, data scientists possess the necessary expertise to apply machine learning and data science techniques to materials data. However, without the appropriate domain knowledge of materials science, data scientists may not be able to understand the underlying science and physical phenomena behind the data they are analyzing, potentially leading to inaccurate or unrealistic outcomes. This highlights the critical importance of having domain knowledge in materials science. Without it, data scientists may not be able to accurately interpret the results of their analysis, leading to erroneous conclusions and potential wasted resources. To further illustrate the importance of domain knowledge in materials informatics, one can use the analogy of car racing. Just like how a good car and a skilled driver are both necessary to win a race, both information science and domain knowledge of materials science are critical components for successful materials informatics research. Here, information science is comparable to the design and development of good parts such as the engine, brakes, tires, and all other parts required for a car to operate. However, even if the car is the fastest car in the race, it is not guaranteed to win such a race because it needs a driver to maximize the performance of the car. Here, the driver may not be able to make the car parts, but the driver knows how to use them and how they function and therefore is able to maximize the performance of the car when driving. This idea can be applied toward materials informatics. Materials informatics may not be specialized in developing machine learning, visualizations, and all other data science techniques. However, materials informaticists will know how to apply these techniques and skills toward problem-solving efforts and will also know how to validate results via data analysis and determine whether the results are reasonable or accurate enough from the materials science perspective. Thus, the material informaticist is the driver and is required to know various data science techniques, which will be introduced in this book, and is able to strategically apply said techniques toward various problems and types of data within materials science.
8
1 An Introduction to Materials Informatics and Catalysts Informatics
Materials informatics represents a vital approach to discovering and designing new materials by leveraging machine learning, data analysis, and other data science techniques. With the vast amount of materials data available, materials informatics aims to optimize the discovery process by mining this data for hidden patterns and knowledge. However, there is a critical shortage of materials science researchers with the necessary data science skills to take full advantage of materials informatics.
1.5
Data Science vs. Experiments, Theory, and Computation
In the field of materials science research, as well as other areas of scientific inquiry, problems have traditionally been tackled using one of the three primary “sciences”: experiment, theory, or computation. These approaches are fairly straightforward to understand; when thinking about experiments, one might envision scientists conducting tests in reallife settings, such as mixing chemicals or testing materials under various conditions. In contrast, the theoretical approach may conjure up images of researchers developing equations or models to explain scientific phenomena. Meanwhile, computational methods may involve using sophisticated algorithms and computer simulations to model molecules, materials, and other complex systems. However, recent developments in technology have given rise to a fourth potential approach to scientific research: data science. In this approach, researchers rely on large sets of data to extract meaningful insights and patterns that may not be easily discernible through traditional methods. With the advent of big data and the increasing availability of tools and techniques for data analysis, data science has emerged as a powerful and highly effective means of scientific inquiry. For materials science (and for other areas of scientific research), data science is often referred to as a “fourth science.” What, then, makes data science unique in comparison to the other established approaches? Here, ammonia synthesis (shown in Figs. 1.3, 1.4, 1.5 and 1.6) is used as an example to illustrate these differences.
Fig. 1.3 Ammonia synthesis in experiment
1.5 Data Science vs. Experiments, Theory, and Computation
9
Fig. 1.4 Ammonia synthesis in theory
Fig. 1.5 Ammonia synthesis in computations
From the experimental perspective, researchers first attempt to mix random gases such as N.2 and H.2 gases. Typically, researchers also choose experimental conditions based on their knowledge and experiences. The desired results most likely do not occur on the first try—in fact, researchers would be quite lucky to achieve ammonia synthesis right away. At this stage, the trial-and-error process begins where researchers learn from the failed experiment and attempt to apply the new knowledge toward the next experiment. For instance, researchers may add CH.4 instead of H.2 and use different experimental conditions. This process is repeated until the target goal is achieved in the experiment. This
10
1 An Introduction to Materials Informatics and Catalysts Informatics
Fig. 1.6 Ammonia synthesis in informatics
approach is the method traditionally taken when synthesizing materials despite it being a very time-consuming process. In order to accelerate the material synthesis process, a faster way of working through the trial-and-error process must be found. In the field of material science, the relationship between theory and experimentation is intrinsically interconnected, with one informing the other. While the concept of theory might conjure up images of researchers poring over literature to derive equations that explain natural phenomena, the reality is that it is often the combination of experimental data and theoretical models that leads to meaningful insights. Experimentation is a trialand-error process that can take years or decades to refine, but the trends that emerge from these experiments can often be expressed mathematically. The law of thermodynamics is an excellent example of a theory that was developed through this kind of iterative process. The role of theory is to provide a framework for understanding the mechanisms behind experimental observations. By identifying underlying relationships and principles, theories allow researchers to go beyond simply describing what they observe and start to explain why certain phenomena occur. Theories are typically represented through mathematical equations that incorporate variables and constants to describe the relationships between different factors. For example, the Boltzmann constant, mass, and temperature are all variables that might be included in an equation to describe a particular phenomenon. While experiments provide the raw data that inform theoretical models, theories also guide experimental design by suggesting new avenues for exploration. By identifying the key variables that are likely to be important in a particular system, theoretical models can help researchers to design experiments that will test specific hypotheses. In this way, theory and experimentation work hand in hand, each supporting and complementing the other. Ultimately, it is the combination of both theory and experimentation that leads to meaningful discoveries in the field of material science.
1.5 Data Science vs. Experiments, Theory, and Computation
11
The emergence of computational science has revolutionized the way we approach research, particularly in the field of materials science. With the aid of advanced technology, researchers can now simulate and model complex systems in ways that were once impossible. Computational science has become a viable third science alongside experimental science and theoretical science due to the development of first principles calculations. The term “first principles” refers to the fundamental principles that are inherent to a particular phenomenon and cannot be modified. One of the most well-known examples of first principles calculations is density functional theory. The underlying assumptions of density functional theory make it possible to calculate interatomic distances between atoms, allowing researchers to investigate the properties of materials at the atomic level. This approach has been revolutionary in the field of materials science, as it has enabled researchers to gain a deeper understanding of the fundamental properties of materials and their behavior. Through the use of first principles calculations, researchers have been able to simulate and design materials with specific properties at the atomic scale. This has opened up new avenues for developing materials with improved properties, such as increased strength or improved conductivity. Additionally, the use of computational science has helped to reduce the need for costly and time-consuming experimental methods, as simulations can be conducted in a fraction of the time it would take to conduct physical experiments. The impact of computational science on materials science research has been far-reaching, enabling researchers to investigate complex systems and phenomena that were once thought to be beyond our understanding. With ongoing advancements in technology and the development of new computational methods, it is likely that the role of computational science in materials science research will continue to grow in the years to come. Computational science has enabled researchers to gain a detailed understanding of the processes involved in ammonia synthesis at an atomic level. By using first principles calculations to simulate the reactions of N.2 and H.2 over catalysts, it is possible to calculate the activation energies involved in the reactions. This approach allows researchers to map out the energy profile of these reactions and gain insights into the fundamental mechanisms that underlie the synthesis of ammonia. Through these calculations, researchers can examine how atoms interact with each other and form ammonia. For example, it has been revealed that the dissociation of N.2 involves a high activation energy, which suggests that it is a crucial reaction in the synthesis of ammonia. This information can be used to guide the design of catalysts that are optimized for ammonia synthesis, ultimately leading to more efficient and sustainable production methods. In addition to providing insights into the reaction mechanisms involved in ammonia synthesis, computational science has also enabled researchers to investigate the effects of different catalysts on the reaction. By modeling the reactions over different catalysts and comparing the resulting energy profiles, researchers can identify catalysts that are most effective for ammonia synthesis. One of the challenges of computational science is that it relies on accurate atomic models to carry out calculations. If the atomic models are not known or are not accurate, it becomes difficult to judge whether the calculations accurately reflect what occurs during
12
1 An Introduction to Materials Informatics and Catalysts Informatics
an experiment. This highlights the importance of experimental methods in material science research, as experiments provide the data needed to verify the accuracy of computational models and to identify potential discrepancies or errors. Another limitation of computational science is that it cannot fully capture the complex experimental conditions involved in material synthesis. While computational models can provide insights into the underlying mechanisms that govern chemical reactions, they are unable to account for all of the variables involved in the experimental process. For example, factors such as temperature, pressure, and the presence of impurities can all affect the outcome of a reaction but are difficult to fully simulate in a computational model. Despite these limitations, computational science remains a valuable tool for material science research. Data science represents a new approach to science that differs from previous approaches in that the focus is on the data itself. Instead of generating new data through experiments or calculations, data science leverages existing data to uncover patterns, trends, and other insights that may not be immediately apparent. In the case of ammonia synthesis, data science techniques can be used to analyze data generated from experiments or simulations to identify optimal gas molecule candidates and experimental conditions. By processing and analyzing large datasets, data science can help to identify correlations between different variables and to generate predictive models that can guide future experiments or simulations. One of the key advantages of data science is that it can help to uncover hidden patterns and relationships that might be difficult to discern using traditional experimental or computational methods. For example, by analyzing large datasets of material properties, data science can help to identify novel materials with unique properties that might have been overlooked using traditional approaches. However, data science is limited by the quality and quantity of the data being used. If the data are incomplete, noisy, or biased, the results of data science analyses may be inaccurate or misleading. As a result, data preprocessing and cleaning are critical steps in the data science process, as they can help to ensure that the data are of sufficient quality to support meaningful analyses. In addition, data science is not a replacement for experimental or computational methods, but rather a complementary approach that can help to generate insights and guide future research directions. By integrating data science with other approaches, researchers can gain a more comprehensive understanding of materials and accelerate the development of new materials and processes. Figures 1.7 and 1.8 depict the fundamental difference between the traditional approach to research and data science. As mentioned earlier, the traditional approach involves a trial-and-error process where researchers design models or samples, conduct experiments or computations, and then derive theories. This approach is generally referred to as the “forward analysis” approach. On the other hand, data science involves “inverse analysis” where researchers feed machines with properties they are looking for, and the machine returns a list of candidate materials that are predicted to possess those desired properties. The forward analysis approach can be time-consuming and resource-intensive, requiring a significant amount of human effort and computational resources to design and conduct experiments or computations. In contrast, data science can quickly identify promising
1.5 Data Science vs. Experiments, Theory, and Computation
13
Fig. 1.7 The role of experiment, theory, and computation
Fig. 1.8 The role of data science
candidates from a vast amount of data, saving researchers’ significant amounts of time and resources. It is important to understand that each approach to science, including experiment, theory, computation, and data science, has its own unique benefits and limitations. Experimentation allows researchers to directly observe and measure physical properties, providing a wealth of experimental data. However, experimental methods can be limited by the ability to control experimental conditions or by the need for expensive equipment. Theory, on the other hand, allows researchers to make predictions based on fundamental
14
1 An Introduction to Materials Informatics and Catalysts Informatics
principles, providing a theoretical framework for understanding phenomena. However, theoretical models can be limited by the assumptions made and the complexity of the systems being studied. Computation provides a powerful tool for simulating and analyzing systems, offering a level of detail and control not always possible with experimentation. However, computational models can be limited by the accuracy of the underlying physical models and the availability of computational resources. Data science is a newer approach that focuses on extracting knowledge and insights from large and complex datasets. The abundance of data generated from experiments or calculations can be processed using data science techniques, such as machine learning or statistical analysis, to reveal hidden patterns or relationships. Data science can also be applied to develop predictive models or to optimize experimental designs, potentially reducing the time and cost of experimentation. However, data science is limited by the quality and completeness of the data available, as well as the need for careful preprocessing and analysis to avoid introducing biases or artifacts. By understanding the strengths and limitations of these different approaches, researchers can employ a multidisciplinary approach to create a more complete understanding of materials and phenomena. For example, experimental data can be used to validate theoretical models or provide input for computational simulations. In turn, computational models can be used to guide experimental designs or provide insights into the mechanisms underlying experimental observations. Data science can also be used to analyze and integrate data from multiple sources, potentially uncovering new relationships or insights. Ultimately, by leveraging the strengths of each approach and combining them in a complementary manner, researchers can develop a more comprehensive and accurate understanding of materials and their properties.
1.6
Data Science in Materials Science
Undoubtedly, the integration of data science into material science has opened up new opportunities for innovation and progress. However, the question of how to effectively incorporate data science and materials informatics into the field of material science remains a crucial challenge that needs to be addressed. To tackle this challenge, it is essential to consider the five fundamental components of materials science, which include synthesis, characterization, processing, properties, and performance, as illustrated in Fig. 1.9. These components can provide useful hints as to how data science and materials informatics can be integrated into the field of material science. Materials science consists of the following five components: processing, structure, properties, performance, and characterization. It is possible to apply materials informatics toward each of these five components. Let us take a further look at how materials informatics can be applied toward each component.
1.6 Data Science in Materials Science
15
Fig. 1.9 Opportunities in material science
Processing Processing in materials science refers to the design and synthesis of materials, involving a range of techniques and experimental conditions. It encompasses the selection of appropriate materials synthesis methods, as well as the optimization of parameters such as temperature, pressure, and reaction time. The processing conditions directly impact the resulting material structures. Materials informatics offers valuable tools and approaches to enhance the processing aspect of materials science. By utilizing data science techniques, researchers can optimize material synthesis methods by analyzing and modeling large datasets of experimental conditions and their corresponding material structures. This involves identifying key features, also known as descriptors, that determine the crystal structures and properties of materials. Materials informatics can also assist in finding the most suitable materials synthesis method for a specific application. By leveraging datadriven approaches, researchers can analyze the relationships between synthesis techniques and resulting material properties to identify optimal methods. This enables the design and synthesis of materials with desired structures and properties for targeted applications.
Crystal Structures Crystal structures play a vital role in determining the properties and behavior of materials in materials science. The arrangement and bonding of atoms within a crystal lattice directly influence various properties, such as mechanical, electrical, and thermal characteristics. The choice of elements from the periodic table further dictates the nature and behavior of the crystal structure. Materials informatics offers valuable tools and approaches to establish connections between crystal structures and material properties. By employing data science techniques, researchers can (continued)
16
1 An Introduction to Materials Informatics and Catalysts Informatics
analyze large datasets of crystal structures and corresponding properties to uncover meaningful correlations and patterns. This enables the identification of structural features and arrangements that are responsible for specific material properties. Furthermore, materials informatics facilitates the exploration of correlations between crystal structures and elements in the periodic table. By integrating data on crystal structures with elemental properties and characteristics, researchers can identify relationships between specific elements and structural motifs, shedding light on the influence of different elements on crystal formation and stability. While the determination of crystal structures has been a long-standing challenge in materials science, materials informatics can contribute to overcoming this hurdle. By analyzing existing crystal structure data, researchers can develop models and algorithms that assist in predicting and determining crystal structures, accelerating the discovery of new materials with desired properties.
Properties Properties in materials science encompass the physical and chemical characteristics that define how materials respond and behave. These properties include reactivity, mechanical properties, magnetic properties, thermal properties, optical properties, electrical properties, and many others. Understanding and controlling these properties are central to materials research and development. Materials informatics offers a powerful approach to design materials with desired physical and chemical properties. By leveraging data science techniques, researchers can analyze vast datasets of material properties and their corresponding composition, structure, and processing parameters. Through the application of statistical modeling, machine learning, and data mining, materials informatics can reveal valuable insights and establish predictive models that guide the design and optimization of materials with specific target properties. The application of materials informatics in designing materials with desired properties is of great interest to materials researchers. Traditionally, the design of materials with specific properties has been a laborious and time-consuming process. However, with the aid of materials informatics, researchers can expedite the discovery and design of new materials by utilizing data-driven approaches that identify composition-property relationships and optimize material formulations. Moreover, materials informatics holds the potential to unravel the relationships between crystal structure and materials properties. By analyzing large datasets encompassing crystal structures and corresponding properties, researchers can uncover hidden patterns and correlations. This knowledge contributes to a deeper understanding of how the arrangement of atoms in a crystal lattice influences (continued)
1.6 Data Science in Materials Science
17
the observed material properties. Such insights can lead to the development of predictive models that aid in the rational design of materials based on their crystal structures.
Performance Performance evaluation is a crucial stage in materials science where material properties are assessed and analyzed in real-world applications. It involves testing and evaluating how well a material performs under specific conditions and requirements. During this stage, materials informatics can play a significant role in optimizing and maximizing material properties. Materials informatics enables researchers to explore and analyze large datasets of material performance data. By leveraging data science techniques, researchers can identify patterns, correlations, and relationships between material properties and performance under different conditions. This knowledge can then be used to fine-tune material properties and optimize their performance for specific applications. Through the use of materials informatics, researchers can employ predictive modeling, machine learning, and optimization algorithms to search for conditions and parameters that can enhance material performance. By analyzing data on various material compositions, processing methods, and performance metrics, materials informatics can help identify optimal combinations and parameter ranges that lead to improved performance. Additionally, materials informatics enables researchers to design materials with tailored properties by leveraging data-driven approaches. By combining knowledge of materials properties and performance with computational modeling and simulations, researchers can predict and optimize material behavior before physical testing, saving time and resources.
Characterization Characterization plays an important role in materials science research as it involves the comprehensive evaluation of materials in terms of their structures, properties, and responses to various processing and performance conditions. This stage provides valuable insights and answers the “why” questions in materials science by examining the underlying mechanisms and behaviors of materials. There is a wide range of characterization techniques available, including microscopy, Xray analysis, and computational material science, among others, that contribute (continued)
18
1 An Introduction to Materials Informatics and Catalysts Informatics
to the acquisition of important information for materials informatics. Microscopy techniques, such as optical microscopy, electron microscopy, and scanning probe microscopy, enable the visualization and analysis of material structures at different length scales. These techniques provide detailed images and information about the morphology, composition, and crystallographic features of materials. Through image processing and analysis within the framework of materials informatics, researchers can extract quantitative data, identify specific features, and establish relationships between the observed structural characteristics and material properties. X-ray analysis techniques, such as X-ray diffraction (XRD) and X-ray spectroscopy, provide valuable insights into the atomic and crystal structures of materials. These techniques allow researchers to determine crystallographic parameters, identify crystal phases, and analyze the local chemical environments of atoms. The data obtained from these techniques can be used in materials informatics to connect structure with properties. By utilizing measurement informatics , researchers can analyze spectra data and establish correlations between structural characteristics and material properties, enabling more informed materials design and optimization. Computational material science, including techniques such as molecular dynamics simulations and density functional theory calculations, offers virtual experiments that provide insights into the atomic-scale behavior and properties of materials. These computational approaches generate valuable data that can be integrated into materials informatics, contributing to the understanding of structure-property relationships and guiding the design and development of materials with desired properties. Moreover, the characterization process itself generates a wealth of data related to material properties and performance. By collecting and organizing these data, materials informatics can harness it as a valuable source for analysis and modeling. This data-driven approach enables researchers to identify trends, correlations, and patterns in materials properties and performance, further enhancing the understanding and optimization of materials through data science techniques.
Understanding the five components of materials science—processing, structure, properties, performance, and characterization—highlights the numerous benefits that materials informatics can offer to the field. By leveraging data science techniques and tools, materials informatics can significantly enhance various aspects of materials science research. One key advantage of materials informatics is its ability to improve the efficiency and effectiveness of experimental design and materials synthesis processes. By analyzing and integrating data from previous experiments and simulations, researchers can identify optimal synthesis conditions, explore novel materials compositions, and reduce the trialand-error approach in the laboratory. This accelerates the discovery and development of new materials with desired properties. Furthermore, materials informatics enables the
1.7 Catalysts Informatics and Materials Informatics
19
exploration of patterns and correlations between crystal structures and various physical and chemical properties of materials. By leveraging large datasets and applying data science techniques, researchers can uncover hidden relationships, identify structure-property relationships, and gain insights into the underlying mechanisms governing material behavior. This knowledge can guide the design of materials with specific properties, enhance materials performance, and enable the discovery of new materials with tailored functionalities. Moreover, materials informatics can assist in identifying conditions that can maximize specific material properties. By analyzing large datasets and applying machine learning algorithms, researchers can predict and optimize material performance under different processing and environmental conditions. This knowledge can lead to the development of materials with enhanced properties, such as improved mechanical strength, higher reactivity, or better thermal stability. Lastly, materials informatics can play a crucial role in the characterization process by extracting valuable information that helps explain observed phenomena. By utilizing advanced data analysis techniques, materials informatics can analyze and interpret data from various characterization techniques, such as microscopy, spectroscopy, and imaging. This allows researchers to gain deeper insights into the relationships between material structure, properties, and performance, facilitating a better understanding of materials behavior and guiding further research. Overall, materials informatics contributes to expanding our knowledge of materials by integrating data and data science methodologies into materials science research. It enhances experimental design, uncovers structure-property relationships, optimizes material performance, and elucidates underlying mechanisms. By harnessing the power of data and data science techniques, materials informatics accelerates the discovery, design, and understanding of materials, driving innovation and progress in the field of materials science.
1.7
Catalysts Informatics and Materials Informatics
Materials science is a multidisciplinary field that encompasses a diverse range of scientific disciplines, including biology, physics, mathematics, and chemistry. Within this expansive realm, catalysts hold a particularly profound fascination for many materials scientists, as they play a pivotal role in numerous chemical processes employed during materials development. Recognizing the parallel significance of catalysts, the concept of catalysts informatics emerges as an extension of materials informatics. Analogous to the traditional approach observed in materials science, catalysts have traditionally been designed and investigated through the prism of experiments, theory, and computation. However, in line with the prevailing trend witnessed in various research domains, the establishment of discipline-specific informatics has gained traction. Prominent instances of this phenomenon can be found in the realms of bioinformatics and chemoinformatics, wherein the utilization of informatics methodologies has revolutionized the respective fields. Thanks to remarkable advancements in technology, the availability of catalystrelated data has experienced an exponential surge, thereby enabling the application of data
20
1 An Introduction to Materials Informatics and Catalysts Informatics
science techniques to catalyst research. Consequently, the nascent discipline of catalysts informatics has come into existence, offering a novel and promising avenue to enhance our understanding and manipulation of catalysts, thus fostering advancements in materials science as a whole. Nevertheless, it is worth noting that catalysts informatics presents unique requirements that diverge slightly from the expectations set by materials informatics. The distinctive nature of catalyst research lies in its inherent association between catalytic reactions and the experimental process conditions in which they occur. This interdependence introduces a significant consideration, as catalysts demonstrate sensitivity to variations in experimental conditions, which can directly impact their activity. The behavior of catalysts is known to fluctuate between active and inactive states depending on the specific process conditions under which they are employed. Consequently, it becomes imperative to meticulously design both the catalyst itself and the corresponding process conditions in a concurrent manner, as they intricately influence each other. Moreover, the structural characteristics of catalysts are also subject to alteration in response to the varying process conditions they experience. In a fascinating parallel, catalysts can be perceived as exhibiting traits reminiscent of living organisms, adapting and responding to changes in their environment and surrounding conditions. Given these intricate dynamics, the realm of catalysts informatics necessitates the incorporation of essential elements such as catalyst composition, catalyst performance data, and catalyst characterization data. These key components provide crucial insights into the behavior and effectiveness of catalysts. By leveraging informatics approaches, researchers can holistically analyze and integrate these diverse datasets, ultimately unraveling the intricate relationships between catalyst composition, performance, and characterization. Such comprehensive analyses pave the way for enhanced understanding and optimization of catalysts, empowering researchers to unlock new frontiers in catalysis and drive advancements in the broader field of materials science.
1.8
Conclusion
Within this chapter, we embarked upon an exploration of the foundational principles that underpin materials science—a scientific domain that can be aptly described as an “umbrella” field, encompassing diverse research disciplines from a wide array of fields. It is worth noting that contemporary advancements and technological progress have opened doors for the integration of data science—a field primarily concerned with the analysis and interpretation of data. The exponential growth and proliferation of data and information pertaining to materials have presented a remarkable opportunity to employ data science methodologies to analyze and derive insights from such vast quantities of data. In fact, these techniques have surpassed the limits of human capacity for analysis. This convergence of data science and materials science has led to the emergence of materials informatics—an interdisciplinary field that capitalizes on the application of data science
1.8 Conclusion
21
within the context of materials science research. By leveraging the power of data science, materials informatics enables researchers to extract valuable knowledge and patterns from the extensive amounts of data generated and accumulated in the field of materials science. This, in turn, empowers scientists and engineers to make informed decisions, enhance material properties, optimize performance, and expedite the discovery and design of new materials with desired functionalities. In essence, the integration of data science into materials science research has given rise to a transformative paradigm known as materials informatics, propelling the field forward and ushering in new opportunities for innovation and advancement. The pursuit of materials research can be approached from four distinct perspectives: experimental investigation, theoretical modeling, computational simulations, and the emerging field of data science. Each of these approaches possesses its own inherent strengths and weaknesses, which, if employed in a synergistic manner, can yield valuable insights and mutually reinforce one another. Materials informatics, as a discipline, heavily relies on data science techniques and tools, such as statistical analysis, visualization methods, and machine learning algorithms. Through the application of these tools, materials informatics aims to unlock hidden knowledge embedded within vast quantities of materials data, which is often derived from experimental, theoretical, and computational studies. By leveraging the power of data science, materials informatics enables researchers to uncover patterns, correlations, and trends that may not be readily discernible through traditional means. However, it is essential to acknowledge that the unique nature of materials science imposes certain considerations when applying data science techniques. Materials science involves complex materials systems with intricate structure-property relationships and multifaceted dependencies on processing conditions. Therefore, data science methodologies cannot be blindly applied to materials data without careful consideration and domain expertise. Researchers utilizing data science in materials science must exercise caution and ensure a thorough understanding of the underlying data and the specific characteristics of the materials being investigated. They must be capable of critically assessing the relevance, quality, and limitations of the data at hand. Moreover, an awareness of the fundamental principles and governing laws within materials science is crucial for interpreting the results generated by data science techniques accurately. By adopting a judicious approach that combines domain expertise, data science methodologies, and materials science knowledge, researchers can leverage the power of materials informatics to advance our understanding, design, and discovery of novel materials with tailored properties and enhanced performance. Researchers seeking to employ data science techniques in materials research must possess not only proficiency in utilizing the available tools but also the ability to comprehend and assess the accuracy of the underlying data. This crucial capability enables researchers to make informed judgments regarding the most appropriate data science techniques for their specific studies. Additionally, it empowers them to evaluate the validity and scientific robustness of the results obtained through data science methodologies. This proficiency in data evaluation and judgment is particularly vital when it comes
22
1 An Introduction to Materials Informatics and Catalysts Informatics
to materials design, as it involves intricate considerations encompassing processing, structure, properties, performance, and characterization. Understanding how these components interplay is fundamental to effective materials design, as each facet contributes to the overall functionality and behavior of the material. By comprehending these interdependencies, researchers can harness the power of data science in a myriad of ways, significantly enhancing various aspects of materials design and materials research. When researchers possess a deep understanding of the underlying materials science principles and the associated experimental or computational techniques, they are better equipped to evaluate the accuracy and reliability of the data used in their research. This knowledge enables them to discern potential limitations, biases, or uncertainties associated with the data and adjust their data science approaches accordingly. Additionally, understanding the materials’ key characteristics facilitates the identification of relevant features and parameters to be analyzed using data science techniques, enhancing the accuracy and relevance of the results obtained. By integrating their domain expertise in materials science with data science methodologies, researchers can unlock new insights, accelerate materials discovery, and optimize material design and performance. Furthermore, this interdisciplinary approach promotes innovation and opens up new avenues for scientific exploration, ultimately driving advancements in the field of materials research. In the following chapters, materials informatics and catalysts informatics are explained further, starting from a basic level and progressing toward an advanced level. In particular, the following will be covered: materials and catalysts data, data preprocessing, data science techniques, visualization, and machine learning with demonstrations using Python code.
Questions 1.1 How does materials informatics integrate data science with materials science research? 1.2 What are the four perspectives from which materials research can be approached? 1.3 How does materials informatics utilize data science techniques and tools? 1.4 What considerations should be kept in mind when applying data science techniques in materials science? 1.5 What capabilities must researchers possess when employing data science techniques in materials research? 1.6 How does a deep understanding of materials science principles benefit researchers employing data science techniques?
Questions
23
1.7 How does integrating domain expertise in materials science with data science methodologies benefit materials research? 1.8 How does the integration of data science in materials science overcome human capacity limitations? 1.9 What role does materials informatics play in materials research? 1.10 How does the convergence of data science and materials science contribute to innovation and advancement? 1.11 What is materials informatics? 1.12 How does materials informatics differ from data science? 1.13 What are some examples of other informatics fields? 1.14 Why is it important to have knowledge of both data science and materials science in materials informatics? 1.15 How is data science related to the traditional scientific disciplines? 1.16 What is the challenge facing materials informatics as a research field? 1.17 What efforts are needed to bridge the gap in materials informatics? 1.18 Can data scientists effectively handle materials informatics research? 1.19 Why is domain knowledge of materials science important in materials informatics? 1.20 How can the importance of domain knowledge in materials informatics be illustrated? 1.21 What are the three traditional approaches to scientific research? 1.22 How does data science differ from the traditional approaches to scientific research? 1.23 How does the experimental approach work in materials science research? 1.24 How does theory contribute to materials science research? 1.25 How has computational science revolutionized materials science research?
24
1 An Introduction to Materials Informatics and Catalysts Informatics
1.26 What limitations does computational science face in materials science research? 1.27 How does data science contribute to materials science research? 1.28 What are the limitations of data science in materials science research? 1.29 What are the five fundamental components of materials science? 1.30 How can materials informatics enhance the processing aspect of materials science? 1.31 How can materials informatics enhance the processing aspect of materials science? 1.32 What role does materials informatics play in identifying optimal materials synthesis methods? 1.33 How does materials informatics establish connections between crystal structures and material properties? 1.34 How can materials informatics contribute to predicting and determining crystal structures? 1.35 In what ways does materials informatics aid in the design of materials with desired properties? 1.36 How does materials informatics optimize material performance in real-world applications? 1.37 What role does materials informatics play in the characterization of materials? 1.38 What is catalysts informatics? 1.39 What distinguishes catalysts informatics from materials informatics? 1.40 How can catalysts be compared to living organisms? 1.41 What are the essential elements incorporated in catalysts informatics? 1.42 How does catalysts informatics contribute to advancements in catalysis?
2
Developing an Informatics Work Environment
Abstract
Data are the equivalent to experimental samples within materials and catalyst informatics, with computers acting as the equivalent of experimental devices. Given the field’s dependency on data quantity and quality, the question turns toward how an informatics working environment can be established. In this chapter, the world of computers is introduced and explored with a guide for how one can go about constructing an informatics working environment. Keywords
Computing · Hardware · Software · Linux · Servers · Computing environment · Ubuntu · Installation
• Explore the basics behind constructing an informatics working environment. • Learn about hardware components and their roles in informatics. • Explore the advantages and disadvantages of Linux vs. major competing operating systems. • Provide a brief guide to installing Linux Mint.
2.1
Introduction
In the context of informatics, hardware and software are two fundamental components that work together to enable the effective processing, storage, and communication of information. Hardware refers to the physical components of a computer system, including
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6_2
25
26
2 Developing an Informatics Work Environment
the central processing unit (CPU), memory (RAM), storage devices (e.g., hard drives and solid-state drives), input/output devices (e.g., keyboard, mouse, and monitor), and the network infrastructure. These hardware components provide the necessary resources for software execution and data manipulation. Software, on the other hand, is a collection of programs and instructions that tell the hardware how to perform specific tasks and operations. In informatics, software encompasses a wide range of applications, from operating systems that manage hardware resources and provide user interfaces, to specialized software used for data analysis, information management, and decision support. Informatics professionals use software to design and implement systems for data collection, storage, retrieval, analysis, and visualization, making it a pivotal element in the field. Computers are complex machines comprised of multiple components, including hardware and software. While programming and other informatics-related software and programs may take center stage in the world of computing, it is important to recognize the significance of the underlying hardware and software. Understanding the fundamental workings of computers is crucial because data analysis performance is heavily reliant on these components. It is not enough to simply interface with a computer through software without having a basic comprehension of the system’s underlying structure. Hardware, denoting the tangible constituents of a computer system, encompasses a spectrum of essential elements, including but not limited to the motherboard, central processing unit (CPU), memory modules, and various storage devices. The synergy and proficiency of these hardware components play a pivotal role in determining the holistic speed and operational efficiency of a computer system, a factor of paramount importance in the context of performing comprehensive data analysis. The performance of hardware components is undeniably a linchpin in the realm of data analysis, where computational prowess is of the essence. The seamless coordination and optimal functioning of these hardware constituents underpin the expeditious execution of intricate algorithms, the rapid retrieval of vast datasets, and the simultaneous processing of multifaceted calculations. Such expeditious and efficient performance is indispensable for data analysts and scientists alike, as it directly impacts the speed, accuracy, and reliability of analytical results. Furthermore, in the dynamic landscape of technology, hardware specifications are subject to continual enhancement and innovation. Staying abreast of these evolutionary leaps is imperative for individuals and organizations seeking to harness the full potential of their computer systems. Adapting to the latest advancements ensures that hardware infrastructure remains capable of accommodating the increasingly demanding computational requirements of modern data analysis methodologies, thus facilitating the realization of cutting-edge insights and breakthroughs in diverse fields. Software, in the context of computing, encompasses a vast array of meticulously crafted programs and applications that serve as the lifeblood of a computer system. These software entities orchestrate the symphony of digital operations, allowing users to engage with their
2.1 Introduction
27
computer devices in meaningful and productive ways. It is through the realm of software that individuals and organizations wield the power to harness the full potential of their computing resources, embarking on an intricate dance of data manipulation, analysis, and creation. Within this intricate digital ecosystem, software assumes a multifaceted role, acting as both the medium and the conduit through which users interact with their machines. It provides the graphical interfaces, command-line tools, and rich graphical applications that facilitate a spectrum of activities, from word processing and data visualization to complex data analysis endeavors. Indeed, it is through software that users command the hardware’s computational prowess, seamlessly translating their intentions into tangible outcomes. A defining characteristic of the software landscape is its inherent diversity. A profusion of software programs, each tailored to specific tasks and objectives, populates this expansive domain. These programs can range from general-purpose office suites to specialized data analysis tools, each finely tuned to address particular needs and challenges. The critical task then becomes the judicious selection of the right software tool for a given purpose. The importance of this selection process cannot be overstated, especially in the context of data analysis, where the quality, efficiency, and accuracy of results are paramount. Choosing the appropriate software suite or application can significantly impact the analytical workflow, affecting data preprocessing, modeling, visualization, and the extraction of meaningful insights. Consequently, software selection becomes an art as well as a science, demanding a nuanced understanding of the analytical requirements and an awareness of the software’s capabilities and limitations. In an era of perpetual technological advancement, staying abreast of evolving software solutions is a strategic imperative. Software developers continually refine and enhance their products, incorporating innovative features, performance optimizations, and security enhancements. Remaining informed about these developments ensures that data analysts and professionals can leverage the most current tools, unlocking new dimensions of productivity and analytical prowess. To gain a more profound insight into the intricate world of hardware and software, drawing parallels to the human body can be a remarkably enlightening exercise. Much like the human body, which consists of both tangible and imperceptible constituents, computers, too, possess a duality of physical and intangible elements. In this analogy, the hardware of a computer system assumes the role of the corporeal components, representing the tangible aspects of the machine. If we liken the computer’s hardware to the human body, then the hardware components become the digital counterpart to hands, eyes, skin, hair, and vital organs. These tangible elements form the visible and touchable fabric of the computer, akin to the body parts that we can see, feel, and interact with in the physical realm. Just as hands allow humans to grasp and manipulate objects, the hardware components of a computer, such as the CPU and memory, enable the machine to process and manage data. Similarly, the eyes of a computer can be equated with the display monitor, which provides a visual interface for users to observe and interpret information. The skin and hair, which serve as protective layers for the human body, align with the computer’s casing and cooling systems, safeguarding the delicate internal components
28
2 Developing an Informatics Work Environment
from external harm. Furthermore, much like the vital organs within the human body, the computer’s hardware components work in harmony, each with a specific role and function. The motherboard acts as the computer’s central nervous system, facilitating communication and coordination among various parts, while storage devices, such as hard drives and SSDs, parallel the functions of the human brain, storing and retrieving digital data. However, hardware alone cannot function without software. Software is the invisible component of the computer system and can be compared to the body’s nerves and blood. Just as nerves and blood are essential for the human body to function properly, software is crucial for a computer system to perform tasks and processes. It is the unseen but critical aspect that connects the hardware and enables it to function effectively. In addition, just as there are different types of hormones and components that perform specific functions in the human body, there are different types of software that perform distinct tasks in a computer system. Operating systems, for example, are a type of software that manage and control the hardware components of the computer. Applications, on the other hand, are software programs designed for specific purposes such as data analysis, word processing, or gaming. Having a basic understanding of hardware and software is crucial for achieving optimal performance in data analysis. While it may be tempting to focus solely on programming and other informatics-related software and programs, it is important to recognize the integral role that hardware and software play in computer systems. By staying up-to-date with advancements in both hardware and software, data analysts can improve their ability to process and analyze data efficiently and accurately. Here, the roles that hardware and software play are introduced and discussed.
2.2
Hardware
In terms of computers, hardware can be defined as the physical components that a computer consists of. Computers are mainly composed of the following parts: central processing unit (CPU), memory, motherboard, power unit, graphic card, storage unit, and computer case as shown in Fig. 2.1 and Table 2.1. In computers, hardware is defined as the physical parts which computers consist of. As shown in Fig. 2.1, computers mainly consist of the following 7 parts: CPU, memory, motherboard, power units, graphic card, storage unit, and computer cases. Here, the role and details of each part are introduced.
2.2 Hardware
29
Fig. 2.1 7 parts in computers Table 2.1 The details of 7 parts in computers Name
Full name
Description
CPU RAM Motherboard Power unit GPU Storage Case
Central processing unit Random access memory N/A Power supply unit Graphical processing unit N/A N/A
Computing Temporary stores data Connects parts Supplies energy Transforms data into output image Stores data Houses all assemble parts
Central Processing Unit (CPU) For those with a basic understanding of computers, the CPU, or “central processing unit,” is a term that is likely to be familiar. As one of the most crucial components of a computer system, the CPU is responsible for executing calculations and performing operations. Much like the brain in the human body, the CPU is the central hub of a computer that controls all of its functions. For individuals with a foundational grasp of computers, the term “CPU” (Central Processing Unit) likely holds a degree of familiarity. In the intricate tapestry of a computer system, the CPU stands as one of its most pivotal and recognizable (continued)
30
2 Developing an Informatics Work Environment
components, entrusted with the weighty responsibility of executing calculations and orchestrating a myriad of operations. Much akin to the role of the human brain, the CPU serves as the paramount center of control within a computer’s domain. It is the vital nexus that marshals the intricate choreography of data manipulation, storage, and retrieval. In essence, the CPU functions as the cerebral epicenter of the machine, commanding and coordinating all of its functions with precision and efficiency. Just as the human brain coordinates a plethora of bodily functions, from sensory perception to motor control, the CPU governs the intricate ballet of a computer’s digital realm. It manages the execution of instructions from software, governs the allocation of resources, and ensures that data flow seamlessly between different components. This digital command center integrates diverse processes, much like the brain’s ability to process sensory input and initiate motor responses. Moreover, the CPU is characterized by its innate adaptability and speed, akin to the rapid and dynamic nature of human thought processes. It is designed to execute instructions at an astonishing pace, processing millions or even billions of operations per second. This remarkable capacity for rapid computation empowers computers to tackle a wide array of tasks, from basic arithmetic to complex simulations and data analysis. In the contemporary landscape of computing, CPUs are crafted and produced by an array of well-established companies, each contributing to the diversity and innovation of this critical component. Noteworthy among these manufacturers are industry giants such as Intel and Advanced Micro Devices (AMD), alongside newer entrants such as Apple, which has recently made waves with its groundbreaking APPLE M1 CPU, and Loongson Technology, renowned for its pioneering work on the Loongson CPU. The expansive roster of CPU manufacturers reflects a vibrant and competitive market where innovation and specialization thrive. Each of these manufacturers brings forth CPUs endowed with distinctive features and capabilities, meticulously designed to cater to a spectrum of computing needs and demands. For instance, some CPUs are meticulously optimized for the immersive realms of gaming and high-definition video editing. These processors prioritize swift and efficient rendering of graphics, ensuring that gamers can navigate virtual worlds with fluidity and video editors can manipulate high-resolution footage seamlessly. In contrast, other CPUs are tailor-made for the rigors of scientific and engineering calculations. These computational workhorses excel at number-crunching, simulations, and complex mathematical operations, enabling researchers and engineers to tackle intricate problems with precision and speed. The remarkable diversity in CPU offerings underscores the dynamic nature of the computing industry. It illustrates how CPUs have evolved from being general-purpose workhorses into specialized tools, finely tuned to meet the unique requirements of various computing tasks and industries. (continued)
2.2 Hardware
31
It is important to note that when choosing a CPU, one must consider their computing needs and requirements. Depending on the tasks and applications being performed, some CPUs may offer better performance than others. Factors such as clock speed, cache size, and the number of cores can all affect the performance of a CPU. Therefore, it is essential to conduct thorough research and evaluate the features and specifications of different CPUs before making a decision. Furthermore, advances in CPU technology continue to push the boundaries of computing power and performance. The latest CPUs offer faster clock speeds, larger cache sizes, and more cores than ever before. As a result, computers are now capable of performing tasks that were once thought to be impossible. This trend is expected to continue, and it is likely that CPUs will become even more powerful and efficient in the future.
Random Access Memory (RAM) Random Access Memory (RAM), commonly referred to as memory, is a critical component of a computer system that temporarily stores data. Although some may question the need for temporary storage in a computer, the relationship between CPU, memory, and storage devices provides a clear answer. As illustrated in Fig. 2.2, the CPU is responsible for performing calculations and producing output data, which ideally should be stored in storage devices such as solid-state drives (SSDs) or hard disk drives (HDDs). However, data interaction between the CPU and storage devices can be a significant drawback in the computing process, leading to slow performance and reduced efficiency. To address this issue, RAM is placed between the CPU and storage devices, acting as a buffer to temporarily store data produced from CPU output. By doing so, this process minimizes the slow interactions that occur between the CPU and data storage, resulting in faster data processing and more efficient system performance. The importance of RAM becomes particularly apparent in data processing and machine learning applications, where rapid input and output (I/O) are critical for achieving optimal performance. As such, large RAM sizes are preferred for conducting informatics works. Currently, a RAM size of 64 gigabytes may be considered ideal for materials and catalysts informatics. With advances in technology and the increasing demand for computing power, this ideal RAM size is likely to change in the future. Furthermore, recent developments in RAM technology have enabled the production of faster and more efficient RAM modules. The latest DDR4 RAM technology, (continued)
32
2 Developing an Informatics Work Environment
Fig. 2.2 The role of RAM (random access memory)
for example, provides faster data transfer rates and lower power consumption than its predecessor, DDR3 RAM. Moreover, the development of non-volatile RAM (NVRAM) technology has led to the creation of high-speed and energy-efficient memory modules that can retain data even when the power is turned off. As technology continues to evolve, the ideal RAM size and technology may change, and it is important to stay up-to-date with the latest developments in RAM technology to ensure optimal computing performance.
Unit of Data The concept of the “unit of data” is fundamental to understanding how computers operate. In the current state of computer technology, the binary number system is used, where all data are represented using only 0s and 1s. This binary system is the foundation of modern computing, allowing for complex calculations and communication between different parts of the computer. In later chapters, we will explore in more detail how this binary system works and its importance to modern computing. (continued)
2.2 Hardware
33
At the core of the intricate world of computing lies a fundamental concept that forms the bedrock of digital information: The binary code, composed of 0s and 1s, collectively known as “bits.” These seemingly simple entities are the elemental building blocks of all data in the realm of computing, and their significance cannot be overstated. A bit is the most elemental unit of information that a computer can comprehend, a fundamental binary choice between 0 and 1. It represents the smallest quantum of data manipulation, akin to the fundamental particles in the universe. This binary duality serves as the language of computation, where each bit embodies a decision, a true or false, an on or off state. Yet, despite the apparent simplicity of bits, the scale at which they operate in computing is nothing short of staggering. The digital universe is teeming with these minuscule units, a vast and intricate tapestry of 0s and 1s that collectively encode the entirety of digital information, from text and images to videos and software. The sheer volume of bits employed in modern computing is, indeed, monumental, and it can be an overwhelming prospect to fathom. This complexity is further amplified by the diverse roles that bits play in the multifaceted world of technology, from representing individual pixels in highdefinition displays to encoding the instructions executed by CPUs. To simplify the intricate world of bits and make it more manageable for everyday use, the International System of Units (SI) has bestowed standardized names upon quantities of bits. One of the most commonly encountered units is the “byte,” which comprises eight bits. This byte serves as the foundational unit of measurement for data storage and transmission. The power of a byte lies in its versatility; it can represent a whopping 256 different values. This multitude of possibilities arises from the fact that a byte can be arranged in 2 to the power of 8 (.28 ) unique ways, encompassing a wide range of information, from individual characters in text to various shades of color in images. Moving beyond the byte, we encounter larger units of measurement, each denoting progressively larger quantities of bits. For instance, the “kilobyte” (KB) represents a thousand bytes, the “megabyte” (MB) signifies a million bytes, and the “gigabyte” (GB) stands for a billion bytes (Table 2.2). As the scale continues to expand, we encounter the “terabyte” (TB), equivalent to a trillion bytes. These standardized units offer a practical and comprehensible way to express data sizes and storage capacities. They are indispensable in various aspects of technology, from specifying the capacity of storage devices such as hard drives and flash drives to quantifying the size of files, documents, and multimedia content. Thanks to these units, we can effortlessly navigate the vast digital landscape, making informed decisions about data management and storage. The constant increase in the terms for different amounts of bytes is a clear indication of the rapid growth in computer technology. As technology advances, the amount of data that can be stored and processed also increases. This trend is (continued)
34
2 Developing an Informatics Work Environment
Table 2.2 The quantity of bites
Name
Full name
Description
1 bit 1 byte 1 kilobyte 1 megabyte 1 gigabyte 1 terabyte 1 petabyte
Bit Byte KB MB GB TB PB
0 or 1 8 bit 1024 bytes 1024 KB 1024 MB 1024 GB 1024 TB
evident in the example of Windows operating systems. Windows 95, which was introduced in the mid-90s, required only 1 gigabyte of storage to function. This was considered a significant amount of storage at the time. However, as technology has progressed, the latest Windows 11 operating system now requires 64 gigabytes of storage. Similarly, the amount of RAM required for the operating systems has also increased over time. While Windows 95 needed only 50 megabytes of RAM, Windows 11 now requires a minimum of 4 gigabytes of RAM. From this, we can see that technological innovations have dramatically increased the amount of information that a computer can handle. Additionally, the amount of data generated in certain fields such as traffic data has the potential to reach petabytes of data. Thus, we can see that as society continues to produce large amounts of data, the development of hardware that can handle such large data is likely to follow.
Storage Devices As technology continues to advance at a rapid pace, so does the evolution of data storage devices. Over the years, there have been a myriad of different storage devices developed, each with its own unique features and capabilities. One of the earlier technologies used to store data was the floppy disk. Floppy disks were widely used until the 2000s and were popular because of their small size and relatively low cost. However, they could only store a limited amount of data, with the highest capacity being around 2.8 megabytes. As technology progressed, floppy disks were eventually replaced by more advanced storage devices, such as CDs and DVDs. Compact disks (CDs) and digital video disks (DVDs) are optical disk storage devices that use a laser to read and write data. CDs and DVDs are widely used for software installation, as well as for more permanent storage of data, because once (continued)
2.2 Hardware
35
data are written to a CD or DVD, it is difficult to overwrite. While CDs and DVDs have been largely replaced by newer storage devices, they remain a popular choice for certain applications. Another popular storage device is the hard disk drive (HDD). HDDs use magnetic storage technology and consist of a magnetic head and a hard disk platter. HDDs are capable of storing large amounts of data, with current models capable of storing up to 10 terabytes. They are also relatively affordable and widely used in personal computers and servers. More recent developments in data storage technology have led to the development of USB drives and solid-state drives (SSDs). Unlike HDDs, USB drives and SSDs use semiconductor memory technology, which allows them to read and write data much more quickly. This makes them ideal for use in situations where high-speed data access is required, such as for booting an operating system or running resourceintensive applications. USB drives, in particular, have become very popular in recent years due to their small size, affordability, and versatility. They are widely used for storing and transferring data between devices and are also commonly used as bootable devices for installing operating systems. SSDs, on the other hand, have become increasingly popular as primary storage devices for personal computers and servers. They offer faster boot times, quicker application load times, and improved system performance compared to HDDs. While SSDs can be more expensive than HDDs, their superior performance and reliability make them an attractive option for users who demand the best possible performance from their computers.
Graphic Processing Unit (GPU) Graphic processing unit, more commonly known as graphic card, has quickly become an important hardware for machine learning applications. Initially, GPU was developed in order to transform data into the imagery we see on computer displays and monitors. Architecturally and conceptually, the CPUs and GPUs are very similar to each other. However, their complexities differ. CPUs are designed to solve complex mathematical functions, while GPUs are comparatively simple. The simplicity of the hardware allows for GPUs to contain a larger number of cores than CPUs. Given this, GPUs are able to calculate multiple simple mathematical calculations simultaneously. This ability becomes very important when dealing with machine learning calculations. More specifically, deep learning—a part of machine learning—benefits from GPU use as it requires large amounts of simple (continued)
36
2 Developing an Informatics Work Environment
mathematical calculations to be conducted and these calculations can be carried out in parallel to each other. From this, one can regard the GPU to be a core hardware technology for deep learning.
Motherboard The motherboard serves as the central circuitry that orchestrates the integration of various essential hardware components within a computer system. It acts as the hub, connecting crucial elements such as the CPU, RAM, GPU, storage devices, and power unit, thus establishing the foundation for the computer’s functionality and performance. Typically, the assembled motherboard is carefully placed and secured within a designated computer case. However, this process of mounting the motherboard can give rise to temperature-related challenges. It is imperative to consider that modern CPUs have the propensity to generate significant amounts of heat during their operation, which can subsequently have a detrimental impact on their overall performance. The accumulation of excessive heat within the CPU can result in a dramatic reduction in its processing power and efficiency. In certain scenarios, particularly when engaged in intensive computational tasks, the CPU temperature can soar to levels surpassing .100 ◦ C. Employing robust cooling mechanisms, such as fans, heatsinks, and liquid cooling systems, is essential to dissipate the generated heat and maintain optimal operating temperatures, safeguarding the performance and longevity of the computer system. Computer components are designed to operate within specific temperature limits, as excessive heat can lead to component failure and damage. To ensure the longevity and optimal functioning of crucial parts, such as the CPU, it is imperative to employ an effective cooling system. One commonly used cooling method involves the utilization of a fan. This cooling system typically involves placing a fan on top of the CPU. As the CPU generates heat during its operation, this heat is transferred to a metal heatsink located beneath the fan. The airflow generated by the fan facilitates the dissipation of heat from the heatsink, thereby helping to maintain a lower temperature for the CPU. CPU fans are a prevalent cooling solution implemented in both server environments and desktop PCs. Another popular cooling system employed in high-performance systems is liquid cooling. This method follows a similar concept to the fan cooling system, with the primary difference being the use of liquid to transfer and dissipate heat. In a liquid cooling setup, the heat generated by the CPU is transferred to a heatsink, similar to the fan cooling system. However, in this case, the heatsink is in direct contact with a liquid coolant, which absorbs the heat. The warm liquid then circulates through the cooling system, often aided (continued)
2.2 Hardware
37
by a pump, to a radiator where a fan or fans cool the liquid by dissipating the heat into the surrounding air. This continuous cycle of heat absorption, transfer, and dissipation helps maintain the CPU’s temperature within an acceptable range. Liquid cooling systems are commonly employed in high-performance computing scenarios, such as gaming rigs or advanced workstation setups, where the CPU is subjected to heavy workloads and requires efficient heat dissipation. These systems offer enhanced cooling capabilities and can contribute to maintaining lower temperatures even under demanding conditions.
Materials and catalysts informatics heavily rely on the utilization of supercomputers, as certain machine learning processes and computational methods, such as first principle calculations, necessitate extensive CPU time for their execution. The sheer magnitude of these computational tasks demands the exceptional processing power and parallel computing capabilities that supercomputers offer. When one hears the term “supercomputer,” it may conjure up images of an extraordinary and highly specialized machine, distinct from the conventional computers we are familiar with. However, the fundamental concept of a supercomputer remains rooted in the same principles we discussed earlier in relation to computer hardware. In essence, a supercomputer comprises a series of interconnected computers, each equipped with essential components such as motherboards, CPUs, RAMs, GPUs, and power units—mirroring the composition of a typical computer system. While individual computers within a supercomputer possess similar hardware configurations to regular computers, it is their collective synergy and interconnectedness that sets them apart. A supercomputer can be envisioned as a network of interconnected computers, working in harmony to tackle computationally demanding tasks. Through high-speed interconnections, these individual computing nodes collaborate to execute complex calculations and simulations with remarkable efficiency and speed. The distributed nature of supercomputers allows for the parallel execution of tasks across multiple nodes simultaneously, leveraging the power of parallel processing. This parallelization enables the supercomputer to process vast volumes of data and perform intricate calculations that would be unfeasible or prohibitively time-consuming for traditional computing systems. By harnessing the combined computational prowess of numerous interconnected computers, a supercomputer empowers researchers and scientists to explore the frontiers of materials and catalysts informatics. It enables them to delve deeper into complex phenomena, analyze vast datasets, and uncover valuable insights that can drive advancements in various scientific and technological domains. In the realm of supercomputing, a critical technology known as the “message passing interface (MPI)” plays a pivotal role in enabling efficient parallel calculations. The architectural framework of MPI, depicted in Fig. 2.3, provides a mechanism for distributing calculations across multiple interconnected computers, thereby harnessing the power of parallel processing. By leveraging MPI, the input calculations are intelligently distributed
38
2 Developing an Informatics Work Environment
Fig. 2.3 The architecture of MPI
among the individual computers comprising the supercomputer. Each computer is assigned a specific set of calculations, which it performs independently and concurrently with the others. Once the calculations are completed, the results are consolidated and sent back to a central location, facilitating the seamless integration of parallel calculations. This parallel computing paradigm allows supercomputers to unleash their remarkable computational power by harnessing the collective capabilities of numerous interconnected computers. Through the collaborative efforts of these distributed systems, extensive calculations can be efficiently executed in parallel, significantly accelerating the overall computational throughput. However, it is important to note that simply increasing the number of assigned CPU scores within a supercomputer does not necessarily translate to a proportional decrease in computational time. The key lies in how the calculations are divided and allocated to each core. The distribution of calculations must be carefully evaluated to ensure optimal load balancing and resource utilization. In some cases, depending on the nature of the calculations and the distribution strategy employed, a linear improvement in computational time may not be achieved with an increase in CPU cores. Factors such as data dependencies, communication overhead, and synchronization requirements can impact the scalability of parallel computations. Furthermore, augmenting the number of cores in a supercomputer can introduce potential bottlenecks in other system components, such as input and output speeds in RAM or storage devices. The increased computational power must be accompanied by a comprehensive evaluation of the overall system architecture and resource allocation to prevent these potential bottlenecks from impeding performance gains. Therefore, it is imperative to optimize and evaluate how calculations are distributed and executed within supercomputers. This involves carefully assessing the workload distribution, minimizing communication overhead, and maximizing the utilization of
2.3 Software
39
available resources. By fine-tuning these aspects, supercomputers can achieve optimal performance and deliver the exceptional computational capabilities required for tackling complex scientific and technological challenges.
2.3
Software
Once the hardware setup for informatics is complete, the next crucial step is selecting the appropriate software. Before any calculations can commence, it is necessary to choose and install an operating system (OS). There are three well-established operating systems available for desktop computers: Windows OS, Mac OS, and Linux OS. Windows OS is a widely adapted operating system utilized in various settings worldwide. Known for its user-friendly interface and extensive software compatibility, it has become the operating system of choice for many users. On the other hand, Mac OS, developed by Apple Inc., has garnered popularity among designers, scientists, and medical professionals. It is renowned for its seamless integration with Apple’s hardware and the availability of professional software applications tailored to these fields. This has made Mac OS a preferred choice for those seeking a reliable and efficient computing environment in specialized domains. Meanwhile, Linux OS, developed by the open-source community, has emerged as a strongly recommended operating system for data science and informatics applications. Unlike proprietary software systems such as Windows and Mac OS, Linux is considered open source and is available free of charge. This accessibility has contributed to its widespread adoption, particularly among data scientists and informatics professionals. One of the primary reasons Linux has gained popularity in the data science and informatics realms is its versatility and customizability. Linux provides users with extensive control over their computing environment, allowing them to tailor it to their specific needs and preferences. Furthermore, Linux offers a vast array of powerful tools, libraries, and frameworks that are instrumental in data analysis, machine learning, and scientific computing. Moreover, the open-source nature of Linux fosters a vibrant community of developers and contributors who continuously improve the system, fix bugs, and add new features. This collaborative approach promotes innovation, rapid development, and ensures that Linux remains at the forefront of technological advancements in the informatics field. Details regarding the meanings behind proprietary software and open-source software are explained in later chapters. For now, we shall focus on why Linux is a popular choice for data science and informatics applications. There is a multitude of compelling reasons for choosing Linux as the operating system of choice for data science and informatics work. One significant advantage becomes evident when considering its role in both desktop computers and servers, and how this influences their respective designs and functionalities. Desktop computers are specifically engineered to be utilized directly by users, necessitating the inclusion of hardware components such as display monitors, keyboards, and
40
2 Developing an Informatics Work Environment
Fig. 2.4 Server side and client side
computer mice. These physical peripherals are essential for interacting with the desktop environment and carrying out various tasks. Users directly engage with the graphical user interface (GUI) provided by the operating system, allowing for seamless navigation, data input, and application execution. In contrast, servers are primarily designed to operate in a remote environment, serving as centralized hubs of computational power and data storage accessible to users from remote locations. Unlike desktop computers, servers do not typically incorporate keyboards or display monitors as part of their standard configuration. Instead, users typically access servers remotely via the Internet or local network connections, as depicted in Fig. 2.4. Servers play a crucial role in handling resource-intensive calculations that surpass the capabilities of typical desktop computers. Their superior hardware specifications, including larger RAM capacities, greater storage capacities, and high-performance CPUs, make them well-suited for processing complex computations efficiently. In order to facilitate seamless interaction with servers, they operate on their own dedicated operating systems. This allows users to access the server remotely using their personal operating systems. In cases where the operating systems of the server and desktop differ, users often resort to writing and executing code directly on the servers. This approach ensures compatibility and avoids potential issues arising from operating system disparities. Alternatively, some users create a server environment on their desktop PCs by utilizing virtual machines or dockers. These virtualized setups enable users to simulate the functionalities of a server within their desktop operating systems. By leveraging virtual
2.3 Software
41
machines or dockers, users can benefit from the advantages of a server environment, such as enhanced computational capabilities and dedicated resources, while working within the familiar confines of their desktop systems. Creating a server environment on a desktop PC using virtualization technologies offers several benefits. It allows users to experiment with different server configurations, test software compatibility, and develop applications in a controlled environment. Furthermore, virtualization enables the efficient utilization of system resources by allocating specific amounts of CPU, RAM, and storage to the virtualized server, thus ensuring optimal performance and minimizing conflicts with other applications running on the desktop. Dockers, on the other hand, provide a lightweight and portable solution for creating server environments. They offer a standardized packaging format for applications, allowing for easy deployment and replication across different systems. By encapsulating the necessary dependencies and configurations within dockers, users can quickly set up server-like environments on their desktop PCs without the need for complex manual configurations or dedicated hardware. An examination of the distribution of desktop and server operating systems provides valuable insights into the potential challenges associated with accessing servers that employ different operating systems. By delving into the market shares of various operating systems, we can gain a better understanding of the prevailing landscape. When considering the operating systems utilized in desktop PCs, it becomes evident that dominates the market with a substantial majority, accounting for over 80% of the market share. Following Windows, we find the Mac OS, developed by Apple Inc., which captures approximately 10% of the desktop operating system market share. The Mac OS has garnered a loyal following, particularly among professionals in fields such as design, science, and medicine, due to its seamless integration with specialized software and its reputation for delivering a highly intuitive and streamlined user experience. On the other hand, Linux, an open-source operating system developed by the collaborative efforts of the open-source community, holds a relatively modest market share of less than 2% in the realm of desktop computing. While Linux enthusiasts appreciate its flexibility, robust security features, and extensive customization options, its adoption remains comparatively limited among the broader user base. The landscape of server operating systems presents a stark contrast to that of desktop operating systems. Upon analyzing Fig. 2.5, it becomes evident that Linux reigns supreme, capturing an overwhelming market share of nearly 99%. This staggering dominance solidifies Linux as the de facto choice for server deployments worldwide. The near ubiquity of Linux in server environments implies that the vast majority of servers, spanning a diverse range of industries and applications, operate using this robust and versatile operating system. This widespread adoption can be attributed to Linux’s exceptional stability, scalability, security features, and its open-source nature, which encourages collaborative development and innovation. However, the prevalence of Linux-based servers poses a significant challenge for users accessing these servers from Windows or Mac operating systems. The discrepancy in operating systems becomes
42
2 Developing an Informatics Work Environment
Fig. 2.5 Rough estimates of the distribution of operating system market shares
particularly pronounced in the realm of data science, where the seamless portability and consistency of the data science environment across different platforms are of paramount importance. Achieving parity in data science environments between Windows or Mac and Linux can prove challenging due to inherent differences in software compatibility, package availability, and system configurations. The variations between these necessitate thoughtful consideration and adaptation to ensure a harmonious workflow and optimal utilization of data science tools and resources. For Windows users seeking to access Linux servers, they often encounter obstacles in replicating the exact Linux environment on their local machines. While efforts have been made to bridge the gap between Windows and Linux, inherent disparities in file systems, libraries, and command-line interfaces can hinder the seamless transferability of data science workflows. Workarounds, such as using virtual machines, containers, or remote access solutions, are often employed to create a virtual Linux environment within the Windows ecosystem. Similarly, Mac users face similar challenges when accessing Linux servers, as the Mac operating system differs from Linux in various aspects. Although both Mac and Linux are Unix-like systems, disparities in system configurations, library versions, and software availability can complicate the process of ensuring a consistent data science environment. Mac users often resort to virtualization or containerization technologies to simulate Linux environments or leverage remote access methods to interact directly with Linux . Addressing these discrepancies and fostering compatibility between different operating systems is a continuous endeavor in the field of data science and informatics. Crossplatform tools, such as Jupyter notebooks, Docker containers, and cloud-based computing solutions, have emerged as powerful facilitators, enabling data scientists to collaborate and seamlessly transition between Windows, Mac, and Linux environments. These tools bridge the gaps between operating systems, ensuring a consistent data science experience while maximizing the potential of Linux servers. By also using Linux for a desktop operating system, it is possible to minimize discrepancies between server and desktop operating systems. The world of Linux distributions is vast and diverse, offering a plethora of options for both servers and desktops. A simple search for “Linux” on any search engine would yield
2.3 Software
43
an extensive list of distributions, each with its unique features and target audience. When it comes to selecting a Linux distribution for servers within the context of data science, there are several factors to consider. Two of the most commonly used server operating systems in the present landscape are CentOS and Ubuntu. While both distributions have their merits, they have distinct characteristics that cater to different needs and preferences. CentOS, once hailed as the flagship distribution for Linux servers, has garnered a loyal following due to its remarkable stability. The developers behind CentOS maintain a conservative approach when it comes to updating software packages, which ensures a rock-solid and reliable system. This stability is of paramount importance in critical environments such as banking systems and research institutions, where system uptime and resilience are non-negotiable. CentOS’s commitment to maintaining a stable software repository does come with a trade-off. The downside is that the software packages available within the distribution might be slightly outdated compared to other distributions that prioritize the latest updates and features. However, for those who prioritize stability and reliability over bleeding-edge software versions, CentOS remains a compelling choice. In recent years, Ubuntu has emerged as a formidable contender in the server operating system landscape. Backed by Canonical Ltd., Ubuntu has gained widespread popularity for its user-friendly approach, extensive software ecosystem, and active community support. Its focus on delivering regular updates, combined with a vast software repository, ensures that users have access to the latest features and security patches. Ubuntu’s popularity within the data science community can be attributed to its seamless integration with popular data science tools, frameworks, and libraries. The distribution offers dedicated packages and repositories specifically tailored for data science workflows, making it an attractive choice for data scientists and researchers. Furthermore, Ubuntu’s user-friendly interface and extensive documentation make it accessible even to those with limited Linux experience. It is important to note that CentOS and Ubuntu are just two examples of server operating systems, and there are numerous other distributions available, each with its own strengths and specialties. The choice of distribution ultimately depends on the specific requirements of the server environment and the preferences of the users. Ubuntu has garnered a significant following and established itself as a leading server operating system. Built upon the solid foundation of Debian Linux, Ubuntu inherits its emphasis on stability, akin to CentOS. However, Ubuntu adds a unique touch by incorporating cutting-edge technologies and software, striking a balance between stability and innovation. Since its inception, Ubuntu has witnessed remarkable growth and adoption, becoming a go-to choice for server deployments. One key aspect that sets Ubuntu apart is its comprehensive software repository, known as the Ubuntu Software Repository. This repository serves as a centralized hub, housing a vast collection of stable and compatible software and libraries specifically curated for Ubuntu. This inclusion of essential data science tools and libraries makes Ubuntu an attractive option for data science and informatics applications. Considering that the server environment is likely to be based on Ubuntu, it is highly advantageous to align the personal desktop environment with the same distribution.
44
2 Developing an Informatics Work Environment
Fortunately, Ubuntu offers a dedicated desktop version, ensuring a seamless transition and cohesive experience across systems. Moreover, for users seeking an even more userfriendly desktop environment, Linux Mint—a Linux distribution derived from Ubuntu— provides an optimized Ubuntu-based variant tailored for desktop usage. When it comes to selecting a Linux distribution for the desktop environment, Linux Mint emerges as a highly recommended choice. With its emphasis on user-friendliness, Linux Mint offers an intuitive and polished desktop experience, making it particularly appealing to those new to Linux operating systems. This focus on usability, combined with the solid foundation of Ubuntu, creates an environment conducive to productivity and seamless integration with data science workflows.
2.4
A Brief Guide to Installing Linux Mint
The installation of Linux Mint is briefly explored. With the exception of Mac products (which come installed with Apple’s own OS), most desktop and laptop computers come with Windows OS previously installed. Given this, it is most likely that the Linux OS will have to be installed manually by the user. Manual installation is one of several reasons why users may feel Linux is a challenging operating system with a large learning curve. However, installing Linux is relatively simple. Here are a series of steps that can be followed for installing Linux: 1. Download ISO image from a Linux website. As an example, the ISO image file for the desired version of Linux Mint can be downloaded from https://linuxmint.com/. 2. Prepare installation media using ISO image file. An external data storage device is necessary in order to create the Linux Mint installation media. While CD/DVDs can be used, USB memory sticks are the more popular choice and are the option we recommend. Here, a USB memory stick with at least 4GB of space is necessary in order to create the installation media. 3. Create Linux installation bootable USB memory using ISO image. Software such as Rufus (https://rufus.ie/) is a commonly used free software used for this purpose. 4. Once the bootable Linux USB memory is prepared and ready to use, plug the USB memory stick into the desired laptop or desktop. Note that the computer in question should be turned off before starting. 5. Start the computer and enter the Boot Menu. Here, the Boot Menu for the computer can be launched by pressing a specific key on the keyboard after starting the computer. The key for this boot menu is different based on the computer or motherboard brands. For instance, a Dell computer may use F12 to load its boot menu, while an HP computer may use the Esc key. 6. Once the Boot Menu is launched, select the USB to boot. This allows you to start the OS from the USB memory stick.
2.4 A Brief Guide to Installing Linux Mint
45
7. Once the option “Start Linux Mint” appears, select it. This allows the computer to start Linux Mint from the USB memory stick. 8. Install Linux Mint by selecting the “Install Linux Mint” icon located on the desktop. Once selected, installation will automatically begin where one will be prompted to supply information such as username, password, keyboard layout, language, and other types of basic information. It must be noted that, once this process is complete, the existing Windows OS will be eliminated. Before installing Linux, one must be aware that the Windows system will be permanently replaced by the Linux OS unless a dual boot system is set up. When considering the installation of a Linux operating system for data science purposes, one may hesitate to replace their existing operating system. Fortunately, there is an alternative solution that allows for the creation of a self-contained Linux environment: the utilization of virtual machines. Virtual machines enable users to install Linux as a Windows application, thereby creating a separate and isolated environment within their existing operating system. This approach offers the flexibility to experiment with Linux without the need for a dedicated physical machine or altering the current system setup. Notable virtual machine software options include VirtualBox (https://www.virtualbox.org) and VMWare Workstation (https://www.vmware.com), which provide robust virtualization capabilities. By leveraging virtual machines, users can enjoy the benefits of a Linux operating system while seamlessly integrating it into their current computing environment. This approach allows for the exploration and utilization of Linux-specific tools, libraries, and frameworks for data science tasks. Alternatively, for those who are willing to embrace Linux as their primary operating system, a direct installation on their computer is a viable choice. This approach provides a more integrated experience and allows for optimal utilization of hardware resources. Whether opting for a virtual machine or a dedicated installation, establishing a Linux operating system lays the foundation for setting up an environment tailored to data science-related work. Once the Linux operating system is in place, the next step is to configure the environment to suit the specific requirements of data science endeavors. This entails installing and configuring data science tools, libraries, and frameworks that facilitate data analysis, machine learning, and other informatics-related tasks. The Linux ecosystem offers a wealth of resources and package managers, such as APT (Advanced Package Tool) and Yum, that streamline the installation process and ensure compatibility with the chosen Linux distribution. Furthermore, it is worth noting that numerous pre-configured Linux distributions specifically tailored for data science are available. These distributions, such as Anaconda, provide a comprehensive suite of data science tools and libraries, eliminating the need for manual setup and simplifying the initial configuration process. By leveraging virtual machines or directly installing Linux, users can establish an optimized and dedicated environment for their data science work. Whether opting for a contained virtual machine or a native installation, the flexibility and power of Linux
46
2 Developing an Informatics Work Environment
enable users to harness the full potential of data science tools and methodologies. With the proper environment in place, data scientists can embark on their analytical journey with confidence and efficiency. In the next chapter, the concepts and usage of programming languages will be explored.
2.5
Conclusion
Computer hardware and software play a paramount role in the field of informatics, akin to experimental devices utilized in scientific research. Although informatics scientists may not be tasked with constructing their own computers, possessing a comprehensive understanding of the underlying mechanisms of these components is imperative. In a manner analogous to researchers comprehending the inner workings of their experimental devices, informatics must acquire a profound knowledge of both hardware and software domains to unleash the full potential of these tools and optimize their informatics workflows. To effectively navigate the realm of informatics, a solid foundation in computer hardware is essential. This encompasses a deep comprehension of the various hardware components, their functionalities, and their interconnections within a computer system. Knowledge of hardware architecture, central processing units (CPUs), memory modules, storage devices, and input/output (I/O) mechanisms enables informatics scientists to select and configure hardware setups that align with their computational requirements. Understanding the performance characteristics, scalability, and limitations of different hardware configurations empowers them to make informed decisions when designing and executing computationally intensive tasks. Equally significant is the mastery of software systems in informatics. Informatics scientists must possess a comprehensive understanding of operating systems, programming languages, software libraries, and frameworks that underpin their work. Proficiency in operating systems allows them to navigate the intricate layers of system resources, optimize resource allocation, and ensure efficient utilization of computing power. Additionally, a strong command of programming languages and software tools enables informatics scientists to develop robust algorithms, implement data processing pipelines, and perform complex computations. Familiarity with software libraries and frameworks further enhances their capabilities, providing them with pre-existing functionalities, algorithms, and data structures that expedite the development process and enable rapid prototyping. By merging their knowledge of computer hardware and software, informatics scientists gain a comprehensive perspective that empowers them to harness the full potential of these tools. Such an integrative understanding allows them to fine-tune their informatics workflows, optimize resource allocation, and exploit parallel computing capabilities. By leveraging hardware advancements and employing software techniques tailored to their specific research domains, informatics scientists can significantly enhance the efficiency and accuracy of their analyses, simulations, and modeling tasks. Moreover, a deep
Questions
47
comprehension of computer hardware and software equips informatics scientists with the ability to troubleshoot and resolve technical issues that may arise during their work. This self-reliance not only reduces dependence on external technical support but also ensures the continuity and productivity of their informatics endeavors.
Questions 2.1 Why is it important for informatics scientists to have a comprehensive understanding of computer hardware and software? 2.2 What aspects of computer hardware should informatics scientists be familiar with? 2.3 Why is mastery of software systems important for informatics scientists? 2.4 How does merging knowledge of computer hardware and software empower informatics scientists? 2.5 Why does a deep comprehension of computer hardware and software benefit informatics scientists in troubleshooting and resolving technical issues? 2.6 What are the main components of a computer system? 2.7 How does hardware affect data analysis performance? 2.8 What is the role of software in a computer system? 2.9 How can hardware and software be compared to the human body? 2.10 Why is it important for data analysts to have a basic understanding of hardware and software? 2.11 What is the CPU and what is its role in a computer system? 2.12 Name some well-known CPU manufacturers and explain their differences. 2.13 How does the choice of CPU impact computer performance? 2.14 What is the purpose of RAM in a computer system? 2.15 Why is large RAM size preferred in data processing and machine learning applications?
48
2 Developing an Informatics Work Environment
2.16 How has RAM technology evolved, and why is it important to stay updated? 2.17 What are some examples of storage devices mentioned in the text? 2.18 What is the advantage of using USB drives? 2.19 How do USB drives and SSDs differ from HDDs in terms of technology? 2.20 What is the function of a graphic processing unit (GPU)? 2.21 How does a motherboard contribute to a computer’s functionality and performance? 2.22 What are some cooling mechanisms used to dissipate heat from the CPU? 2.23 Why do materials and catalysts informatics heavily rely on supercomputers? 2.24 How does the distributed nature of supercomputers contribute to their computational power? 2.25 What is the role of the message passing interface (MPI) in supercomputing? 2.26 Is increasing the number of CPU cores in a supercomputer always proportional to a decrease in computational time? 2.27 What factors can impact the scalability of parallel computations in supercomputers? 2.28 What are the three well-established operating systems available for desktop computers? 2.29 What are some advantages of Windows OS? 2.30 What makes Mac OS a preferred choice for certain professionals? 2.31 Why has Linux gained popularity in data science and informatics? 2.32 How can users create a server environment on their desktop PCs? 2.33 What are the market shares of the dominant desktop operating systems? 2.34 Why has Linux gained limited adoption among desktop users? 2.35 Which operating system dominates the server market?
Questions
49
2.36 What challenges do users face when accessing Linux servers from Windows or Mac operating systems? 2.37 What are two commonly used Linux distributions for servers in the context of data science? 2.38 What sets Ubuntu apart from other server operating systems? 2.39 What advantages does Ubuntu offer for data science applications? 2.40 Which Linux distribution is recommended for a user-friendly desktop experience aligned with Ubuntu? 2.41 What is a solution that allows for the installation of a Linux operating system without replacing the existing operating system? 2.42 What are the benefits of leveraging virtual machines for data science purposes? 2.43 What are some pre-configured Linux distributions available for data science applications?
3
Programming
Abstract
Programming languages play a central role in data science. However, before we can explore what roles they play, we must first have an understanding of what programming languages are. In this chapter, basic knowledge pertaining to programming and programming languages are introduced and explored. Keywords
Programming · Computational science · Scripting language · Compiling language · Python · Text editor · Open source
• • • •
Understand the nature of programming languages. Cover the basics of programming. Introduce major programming languages. Explore the concepts of open source.
3.1
Introduction
Programming is the way that people can communicate with machines and is a fundamental component of computer science and software development. It involves creating sets of instructions and definitions using programming languages that machines then process and carry out. Logic and problem-solving are heavily involved as programmers develop algorithms and code to solve various problems or to carry out specific tasks. It is easy to
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6_3
51
52
3 Programming
say that the code written by programmers is the foundation and backbone of the software applications and systems that we rely on throughout our daily lives. Applications for programming widely vary and are embedded in the majority of our lives, whether directly or indirectly. Programming is also essential when developing software, desktop or mobile applications, web development, and video games. It is also essential to the community and is required for website creation and maintenance, database design and management, and task automation in many industrial and commercial sectors. Of course, it is crucial in areas such as data science and machine learning, where algorithms and models are written in order to process and analyze vast amounts of data. It is thus easy to conclude that programming has quickly become a crucial component of how modern society functions.
3.2
Basics of Programming
To start: What is programming? At its core, programming is the art of crafting computer programs that are designed to execute specific tasks and solve problems. These programs serve as the foundation for automating processes, analyzing data, and creating innovative solutions. The process of programming involves three key steps: planning, writing, and running. Each step plays a crucial role in ensuring the success and effectiveness of the program. The first step, planning, is undeniably the cornerstone of the entire programming process. During this phase, programmers carefully define the objective of the program and meticulously plan the algorithms required to achieve that objective. This involves breaking down the problem into smaller, manageable steps and designing a logical flow of operations. Effective planning sets the stage for a well-structured and efficient program. It provides the programmer with clear goals and outcomes to be achieved, serving as a roadmap throughout the development process. A comprehensive and thoughtful plan greatly reduces the risk of writing extraneous or sloppy code that may hinder the program’s functionality or performance. The second step is the actual writing of the code using a programming language of choice. This stage requires the translation of the planned algorithms into a precise and syntactically correct code. Programmers utilize their knowledge of programming languages, libraries, and frameworks to implement the planned logic. The code serves as the building blocks that bring the program’s functionality to life. Once the code is written, the third step involves running the program and observing its execution. This phase tests the program’s functionality, identifies any errors or bugs, and allows for refinement and optimization. Programmers analyze the program’s output and behavior, making necessary adjustments to ensure its accuracy and efficiency. Through rigorous testing and debugging, the program evolves into a reliable and robust software solution.
3.2 Basics of Programming
53
It is worth emphasizing that the programming process is not linear but iterative. Programmers often revisit earlier steps, refining the planning, modifying the code, and running the program multiple times to achieve the desired outcome. This iterative approach allows for continuous improvement and the fine-tuning of the program. These three steps, namely planning, coding, and execution, encompass the fundamental process of programming. It is important to recognize a fundamental distinction between computer programs and humans: Programs operate solely based on the instructions they are provided. This fundamental concept highlights the role of humans in designing algorithms that serve as instructions for programs to follow during execution. Through meticulous planning and coding, programmers imbue computer programs with the ability to perform specific tasks and solve problems in a systematic and precise manner. How, then, can programming be used in science? There are a variety of tools available. Different fields of research prioritize and utilize a wide range of techniques to accomplish their goals. In the field of data science and informatics, various techniques and processes play a fundamental role in extracting insights from data. For example, machine learning techniques enable the development of predictive models, data preprocessing techniques help clean and transform data for analysis, and techniques aid in communicating findings effectively. In the realm of computational science, specific approaches are employed to simulate and analyze complex systems. Density functional theory, for instance, is widely used to study the electronic structure of materials, while molecular dynamics simulations allow researchers to explore the behavior of molecules over time. Additionally, methods such as the finite difference method provide numerical solutions to differential equations, facilitating the modeling of physical phenomena. Furthermore, it is important to note that programming serves as a foundational technology in both data science and computational science. It enables researchers to implement algorithms, conduct mathematical computations, and manipulate data effectively. Programming languages provide the necessary tools and frameworks to develop and execute sophisticated analyses, making it an essential skill for researchers in these fields. Moving forward, let us delve into the finer details of programming. Programming can be visualized as a vast array of programming languages, each with its own syntax and purpose. To illustrate the concept, let us examine a simple program as an example. In the realm of computer science, the “Hello World” program serves as a ubiquitous and introductory example that almost every programmer is acquainted with. The “Hello World” program typically involves a minimalistic code snippet that outputs the phrase “Hello, World!” to the screen or console. Despite its simplicity, this program serves as an important starting point for beginners as it introduces fundamental concepts such as printing output and understanding basic syntax. In various programming languages, the implementation of the “Hello World” program may differ, showcasing the unique characteristics and syntax of each language. It serves as a stepping stone for
54
3 Programming
individuals to grasp the core concepts and structure of a programming language before embarking on more complex coding endeavors. For instance, a program displaying “Hello World” in the Python programming language can be written as simple as below: print ('Hello World')
Here, the print command serves as a crucial function that facilitates the display of desired output. While the process may appear straightforward on the screen, it is important to recognize that the programming language undertakes a complex transformation of textual data into a series of binary digits, namely 0s and 1s, which is the fundamental language of computers. Consider the widely recognized distress signal “SOS” as an example. Although one might initially assume that “SOS” is an acronym or a meaningful phrase, it is, in fact, a representation of binary code. The selection of “SOS” as a distress signal stems from its simplicity when encoded into binary form. In Morse code, the letter “S” corresponds to three consecutive 0s (000), while the letter “O” corresponds to three consecutive 1s (111). As a result, the Morse code representation of “SOS” translates into the sequence 000-111-000. This example highlights the inherent relationship between binary code and computer communication. By encoding information into binary form, computers can efficiently process and interpret commands, allowing for effective communication and execution of tasks. It underscores the fundamental role of binary code as the means through which computers comprehend and respond to human instructions, forming the backbone of modern computing systems. Let us delve into the profound impact of binary digits, 0s and 1s, on modern technology. Beyond their role in representing textual information, these binary digits play a crucial role in visual communication, enabling the display of images and photographs captured by devices such as smartphones. Consider the file size of a single picture, which typically amounts to approximately 1.1 megabytes. Within this seemingly compact file, there are approximately 8.8 million individual 0s and 1s meticulously arranged to encode the visual information. Although we perceive a vivid photograph on our screens, it is important to recognize the intricate interplay of an immense number of binary digits that constitute the underlying framework of these images. In essence, binary code serves as the binding force that unifies hardware and software, forming a cohesive connection between the physical components of a system and the programs that operate on them. This symbiotic relationship is aptly depicted in Fig. 3.1, illustrating the integral role of binary code in bridging the gap between hardware and software. However, the practicality of writing code solely in binary numbers is highly impractical for most users. To address this challenge, specialized programming languages known as assembly languages have been developed.
3.2 Basics of Programming
55
Fig. 3.1 Example of assembly language and translated binary 0/1
Assembly languages occupy a unique position at the forefront of binary code, intimately intertwined with the hardware they directly interact with. This close proximity to the underlying binary numbers grants assembly languages a level of control and precision that is unparalleled in higher-level programming languages. However, it is important to note that assembly language is primarily designed to communicate with specific computer hardware, rendering it less accessible and challenging for human readability, as depicted in Fig. 3.1. The syntax and structure of assembly language instructions reflect the intricacies of the underlying hardware architecture, making it less intuitive for programmers accustomed to more abstract and user-friendly languages. Furthermore, it is worth emphasizing that each computer architecture typically has its own unique assembly language. For example, the Apple II computer employs an assembly language known as LISA, while one of the IBM computer systems utilizes an assembly language called SOAO. This inherent architectural diversity implies that software written in LISA cannot be directly executed on IBM computers, illustrating the assembly language’s close alignment with specific hardware configurations. While assembly language offers unparalleled power and precision in instructing a computer’s operations, it inherently restricts the portability and global applicability of code. As a result, software written in assembly language is often limited in its ability to be easily adapted across different computer systems or architectures. This constraint underscores the need for higher-level programming languages that provide greater portability and code reusability, enabling developers to write code that can be executed on a wider range of hardware platforms. The relationship between software and hardware is visually depicted in Fig. 3.2, illustrating the crucial role that binary code, represented by the digits 0 and 1, plays as the bridge connecting computer hardware and software. Binary code forms the foundation upon which all computer operations and instructions are built, enabling the translation of
56
3 Programming
Fig. 3.2 The relationship between hardware and software
human-readable commands into a language that the hardware can comprehend. Assembly languages serve as a vital intermediary step between the intricate workings of hardware and the instructions provided by users. They allow programmers to create specific commands and orders that can be executed by the computer. However, as previously mentioned, assembly language can be challenging and cumbersome for users due to its low-level nature and close alignment with hardware architecture. The complexities of assembly language syntax and structure necessitate a deeper understanding of the underlying hardware, making it less user-friendly for programmers. To address these challenges and promote greater accessibility, high-level programming languages were designed and developed. These languages were inspired by the concept of spoken languages, where multiple expressions and variations can convey the same underlying meaning. Similarly, high-level programming languages offer programmers a more intuitive and human-readable syntax, abstracting away the intricate details of hardware operations. Just as spoken languages have their own unique grammar and rules, programming languages also possess their own syntax and guidelines governing the structure and composition of commands. Examples of early high-level programming languages include CPL and LISP, which paved the way for subsequent language developments. Consider the task of creating a program that prints the phrase “Hello World.” With the use of high-level programming languages, such as C, Python, or Java, this can be accomplished with relatively simple and concise code. These languages provide built-in
3.2 Basics of Programming
57
functions and libraries that abstract away the complexities of low-level hardware operations, allowing programmers to focus on the logic and functionality of their programs. Such code would be written as the following: CPL "Write("Hello, World!")"
LISP (format t "Hello, World")
The realm of programming languages, while seemingly straightforward, is a complex and dynamic landscape. Each programming language exhibits its own unique syntax, providing a distinct way of expressing instructions and commands. This syntax variation is a fundamental characteristic shared by all programming languages, highlighting their diverse nature and tailored approaches to problem-solving. Since the pioneering days of and, the world of programming has witnessed an extraordinary proliferation of languages, each serving specific purposes and catering to different domains of application. As one delves deeper into the world of programming languages, it becomes apparent that they should be perceived as an additional language to learn, akin to human languages. However, in this case, the communication is not with fellow humans but with machines. By embracing this perspective, programmers can approach the learning process with a mindset of mastering the intricacies of communicating with computers, thereby unleashing the potential for creating sophisticated and innovative software solutions. Programming languages can be broadly classified into two main categories: compiled languages and script languages, each with its own distinctive characteristics and purposes. Figure 3.3 provides a visual representation of this classification. Compiled languages, as the name suggests, require a dedicated compiler to translate the human-readable source code into machine code that can be executed by the computer. This compilation process involves converting the entire program into an executable format,
Fig. 3.3 The difference between scripting and compiling languages
58
3 Programming
resulting in a standalone binary file. The use of compilers introduces an additional step in the workflow, requiring programmers to compile the code before it can be run. This compilation step serves to optimize the program’s performance and ensure compatibility with the target system’s architecture. On the other hand, script languages operate in a more direct and immediate manner. They enable programmers to write and execute code line by line without the need for a separate compilation step. Script languages are interpreted on-the-fly, with each line being executed as it is encountered. This inherent flexibility allows for rapid prototyping and dynamic programming, making script languages particularly well-suited for scenarios that demand quick iterations and interactive development. The comparison of compiled languages and scripted languages is a topic that warrants further exploration. While compiled languages offer faster execution times than script languages, they typically require more complex syntax to achieve optimal performance. In contrast, script languages tend to offer a more straightforward and accessible coding experience. For instance, consider the differences between C and Python. Python code is generally easy to follow, even for individuals who may not have previous experience with the language. This accessibility makes it a popular choice for tasks such as data science, where rapid prototyping and experimentation are critical components. In contrast, C code can be more challenging to comprehend without prior knowledge of the language’s syntax and structure. As such, selecting the appropriate language for a given task is essential to optimize code quality and performance. Ultimately, both compiled languages and script languages have their respective strengths and weaknesses. While compiled languages offer superior performance, their complexity may be a barrier to entry for some programmers. In contrast, script languages such as Python may be more accessible but may not always be the optimal choice for high-performance computing tasks. Selecting the appropriate language for a given programming task requires careful consideration of factors such as performance requirements, coding complexity, and accessibility. By taking these factors into account, programmers can optimize their coding workflows and achieve their desired outcomes. Due to its simplicity and easy execution, script languages are often preferred for data science. Python print ('Hello World')
C int main() { printf("Hi Informatics \n"); return 0; }
In today’s software landscape, it is evident that the majority of available applications no longer necessitate users to engage in the intricacies of coding. This paradigm shift
3.2 Basics of Programming
59
can be attributed to the widespread adoption of graphical user interfaces (GUIs). By overlaying a visual interface on top of the programming language, GUIs enhance usability and accessibility. GUIs manifest as graphical windows that empower users to interact with software in a more intuitive and visually oriented manner. Rather than grappling with lines of code, users can now rely on graphical elements to perform various tasks. To illustrate this point, consider the execution of a simple “hello world” program. Instead of typing out the code and running it in a traditional command-line interface, a GUI solution can be employed. One approach entails creating a button within the GUI, which triggers the execution of the code snippet “print(‘hello world’)” upon selection. This button, acting as a visual representation of the desired action, simplifies the process for users who may not possess coding proficiency. By leveraging GUIs, individuals are empowered to interact with software more effortlessly and efficiently. The advent of GUIs has significantly contributed to bridging the gap between users and complex programming tasks. The ability to perform actions through visual cues and interactions has revolutionized software usability, promoting widespread adoption and engagement. Consequently, GUIs have become a cornerstone of modern software design, enabling users to harness the power of technology without the need for extensive coding knowledge. In Python, graphic user interfaces can be easily built using packages like the TK GUI toolkit. With TK, developers can effortlessly integrate visually appealing and interactive elements into their Python applications. The toolkit provides a comprehensive set of tools, allowing for the creation of intuitive interfaces with customizable features. It also supports event-driven programming and offers cross-platform compatibility, ensuring widespread accessibility. The graphic user interface (GUI) that we interact with is merely the visible surface of a multi-layered system that ultimately translates into binary code, which is then processed by computer hardware. It is important to acknowledge that this system’s structure and functionality might undergo significant transformation in the future, driven by the remarkable advancements in machine learning. Machine learning and artificial intelligence (AI) have the potential to revolutionize the way computers communicate and operate. In this envisioned future, machine learning algorithms could directly interface with computer hardware, eliminating the need for human-generated code. Through automated algorithm generation, machine learning systems could dynamically adapt and optimize their performance based on real-time data. Such a shift would disrupt the conventional programming landscape, fundamentally altering the dynamics of software development. The traditional approach of writing code line by line could potentially be supplanted by a more automated and intelligent process. As a result, programmers may evolve from manually instructing computers to collaborating with AI systems that possess the ability to generate algorithms autonomously. By enabling direct communication between machine learning algorithms and computer hardware, the potential for efficient and optimized processing becomes significantly
60
3 Programming
enhanced. This paradigm shift could unlock new possibilities in areas such as performance optimization, system responsiveness, and adaptability.
3.3
Programming Languages
Programming encompasses a diverse array of languages, each with its own unique characteristics and applications. Selecting the most appropriate programming language necessitates a thorough understanding of the specific objectives one seeks to accomplish. For instance, if the goal is to develop iOS applications, the SWIFT programming language emerges as a natural choice due to its native integration with Apple’s operating systems and frameworks. On the other hand, for web-based application development, Javascript emerges as a highly suitable option, given its prominence in enabling interactive and dynamic functionalities on the client side. It is essential to recognize that programming languages are akin to tools in a developer’s toolkit, each serving a distinct purpose and possessing its own set of strengths. Consequently, the choice of programming language hinges on aligning the language’s features and capabilities with the intended purpose of the project at hand. Within the realm of informatics and data science, the Python programming language has gained substantial recognition and popularity. Renowned for its simplicity, versatility, and extensive library ecosystem, Python has emerged as a frontrunner, offering a wide range of tools and frameworks tailored specifically for data analysis, machine learning, and scientific computing. Its expressive syntax and comprehensive documentation have made Python a preferred choice among professionals in these fields. It is worth noting, however, that the selection of a programming language should also consider factors such as community support, available resources, scalability, performance requirements, and the existing technology stack. By assessing these considerations in tandem with the intended purpose, developers can make informed decisions regarding the most appropriate programming language for their specific use case. In this book, Python is used as the main programming language. You may be wondering why Python is used for data science work? The answer lies in Python’s extensive suite of libraries and modules specifically tailored for data science and informatics, making it a versatile and comprehensive toolset. One indispensable library in the Python ecosystem for data science is pandas. Renowned for its efficient data manipulation and organization capabilities, pandas empowers data scientists to effortlessly handle and manipulate datasets of varying sizes and complexities. By providing a rich set of functions and intuitive data structures, pandas streamlines the data preprocessing and exploration phases of a data science workflow. In addition to pandas, Python boasts a diverse range of libraries dedicated to specific data science tasks. One such library is scikit-learn, a robust toolkit that facilitates machine learning tasks such as classification, regression, and clustering. With scikit-learn, data
3.3 Programming Languages
61
scientists can access a wide range of algorithms, tools, and utilities, simplifying the implementation and evaluation of machine learning models. To effectively communicate insights and trends hidden within datasets, Python offers libraries like matplotlib and seaborn for data visualization. These powerful visualization tools equip data scientists with the means to generate insightful charts, graphs, and plots, enabling clear and compelling data-driven storytelling. Python’s data science capabilities extend beyond manipulation and visualization. Libraries such as NumPy and SciPy provide a rich collection of mathematical and scientific functions. NumPy, known for its efficient numerical computations, offers multidimensional arrays and a wide array of mathematical operations. Complementing NumPy, SciPy provides a comprehensive library of scientific and statistical functions, enabling advanced data analysis and modeling. The adoption and popularity of Python in the field of data science have witnessed a remarkable surge since 2005. Python’s widespread appeal can be attributed to its simplicity, readability, and an ever-growing ecosystem of powerful and specialized modules. Notably, Python has found applications across diverse industries, research institutions, and universities, solidifying its status as a go-to language for data-centric endeavors. Python founder “Guido van Rossum” once made the following statement: Over six years ago, in December 1989, I was looking for a ‘hobby’ programming project that would keep me occupied during the week around Christmas. My office . . . would be closed, but I had a home computer, and not much else on my hands. I decided to write an interpreter for the new scripting language I had been thinking about lately: a descendant of ABC that would appeal to Unix/C hackers. I chose Python as a working title for the project, being in a slightly irreverent mood (and a big fan of Monty Python’s Flying Circus). [Foreword for ‘Programming Python’ (1st ed.) Guido van Rossum, Reston, VA, May 1996]
With this philosophy, Python code aims to be easy to read and powerful yet efficient. In the upcoming chapter, an in-depth exploration of the Python programming language will delve into intricate details and nuances, shedding light on its comprehensive features and functionalities. In programming, the text editor assumes a crucial role as an indispensable tool for code creation and development. Serving as the medium through which programmers craft their code, text editors are instrumental in facilitating efficient and effective coding practices. While some text editors offer minimalistic functionality akin to the simplicity of the Notepad app in the Windows operating system, others present an array of advanced features and capabilities designed to streamline and enhance the coding experience. These feature-rich text editors often provide functionalities such as syntax highlighting, code completion, and error detection, empowering programmers with invaluable assistance and facilitating more accurate and productive code writing. The market abounds with a wide array of text editors, each tailored to meet the diverse needs and preferences of programmers. Choosing the most suitable text editor ultimately boils down to personal preference and the specific requirements of the project at hand.
62 Table 3.1 List of text editors
3 Programming
Editors Sublime Text PyCharm Eclipse Atom Bracket Xcode Notepad++ Visual Studio Code Netbeans IntelliJ Nano emacs vim Rubymine
Table 3.1 provides an overview of some commonly utilized text editors, shedding light on their popularity and distinguishing features. A selection of frequently employed text editors includes notable options such as Sublime Text, Visual Studio Code, Atom, and Emacs, among many others. These text editors offer a blend of user-friendly interfaces, extensibility, and a rich set of plugins and packages that enhance productivity and cater to various programming languages and workflows. When selecting a text editor, programmers must consider factors such as customization options, performance, compatibility with specific programming languages, available plugins, and the level of community support. By carefully evaluating these aspects, individuals can identify a text editor that fits their personal preferences and optimizes their overall development process. When embarking on your coding journey, the task of selecting an appropriate text editor can indeed be perplexing, given the myriad options available. To facilitate your decisionmaking process, we have compiled a curated selection of text editor suggestions tailored to meet diverse needs and project requirements. For extensive and ambitious projects, several robust text editors stand out for their remarkable capabilities. Atom, renowned for its versatility, offers an array of plugins and file management tools that streamline project organization and management. Similarly, Xcode, Visual Studio Code, and PyCharm prove highly potent with their comprehensive feature sets, empowering developers with powerful debugging, version control, and collaboration capabilities. In situations where a lightweight and swift text editor is desired, Sublime Text emerges as a formidable choice. Renowned for its impressive performance and efficiency, Sublime Text provides a seamless editing experience without compromising on essential features and functionalities.
3.3 Programming Languages
63
The aforementioned text editors exemplify the diverse landscape of available options, each excelling in specific areas while catering to different programming preferences and project complexities. While Atom, Xcode, Visual Studio Code, and PyCharm offer extensive plugin ecosystems and advanced project management tools for larger-scale endeavors, Sublime Text shines as a nimble yet powerful option for those seeking a lightweight coding experience. Furthermore, it is worth noting that the selection of a text editor is not a one-size-fits-all decision, but rather a personalized choice that hinges on factors such as coding proficiency, project scope, desired features, and compatibility with specific programming languages. Therefore, it is beneficial to explore these suggested text editors and experiment with their capabilities to discover the most suitable fit for your individual needs. When the need arises to directly edit code within a server environment, programmers commonly rely on three prominent text editors: Emacs, vi, and nano. These editors provide essential functionalities and command-line interfaces that facilitate efficient code editing and manipulation directly on the server. In the context of data science projects involving Python, Jupyter Notebook emerges as a highly favored option, offering an immersive and interactive environment where code execution, input, and output can be seamlessly saved and tracked within a single unified space. Its notebook-style interface empowers data scientists to combine code, visualizations, and textual explanations, fostering a cohesive and reproducible analytical workflow. In their own coding practices, the authors of this chapter adopt a pragmatic approach, leveraging a range of text editors based on specific use cases. For writing test code, Sublime Text proves to be their tool of choice, providing a versatile and intuitive editing experience. When engaging in primary data science and informatics analysis, the authors rely on Jupyter Notebook for its ability to integrate code execution with comprehensive documentation and visualizations. For code editing within server environments, Emacs offers a robust and customizable platform that caters to their needs. Lastly, for the development of substantial projects, PyCharm shines as an indispensable companion, offering powerful features tailored to handle complex codebases with ease. It is worth emphasizing that the selection of a text editor is highly subjective, and there is no one-size-fits-all solution. It is crucial for every programmer to explore different text editors and evaluate them based on individual needs, preferences, and project requirements. By experimenting with various options and considering factors such as functionality, extensibility, ease of use, and compatibility with specific programming languages, developers can discover the text editors that best align with their unique workflows. Now we have a more comprehensive understanding of what programming is and its capabilities. It involves a systematic approach that begins with identifying the problem at hand, followed by the design and development of an appropriate solution through the utilization of algorithms. Programming languages serve as the means to translate these algorithms and methods into executable code.
64
3 Programming
While this process may appear straightforward, it is of paramount importance to exercise great caution and prudence. Numerous factors can significantly influence the quality and ultimate success of the code. This prompts us to pause and reflect upon critical questions: Who is responsible for designing the algorithm? Who is tasked with writing the program? The process of algorithm design demands precision and expertise. It necessitates a deep understanding of the problem domain, coupled with a robust analytical mindset that enables the formulation of effective and efficient solutions. Skilled algorithm designers possess a wealth of knowledge and experience, enabling them to devise strategies that maximize performance, minimize computational complexity, and address potential pitfalls. Equally important is the role of the programmer in transforming the algorithmic design into actual code. Proficiency in programming languages and adherence to best coding practices are indispensable qualities. A competent programmer ensures that the implemented solution faithfully captures the intended algorithm, while also considering factors such as code readability, maintainability, and scalability. Collaboration and effective communication between algorithm designers and programmers are pivotal to achieving optimal results. A clear understanding of the problem domain, requirements, and constraints fosters a harmonious synergy between these roles, enhancing the likelihood of delivering successful and robust software solutions. Moreover, the expertise and experience of those involved in the algorithm design and programming processes directly impact the code’s quality and reliability. A skilled and knowledgeable team that possesses a deep understanding of programming paradigms, data structures, and algorithms is better equipped to navigate potential challenges and deliver superior outcomes. Asides from one’s skills in programming, there are many other factors that impact the quality and success of the designed program or code. Imagine you find yourself in the task of creating a straightforward program that calculates the average test scores of a classroom. The objective is to develop a program that simplifies the process by summing up the test scores of all students and dividing the sum by the total number of students or test takers. However, before proceeding to write the code for this program, it is crucial to consider a multitude of pertinent questions and factors. One significant aspect to address is the treatment of absent students. How should their absence be accounted for in the calculation? Including absent students would imply assigning them a test score of 0, which would inevitably impact the overall average score. This scenario would yield a slightly lower average compared to a scenario where only the scores of students who participated in the test were considered. Deciding on the appropriate approach requires careful consideration and deliberation. Determining the rules governing the program poses another important consideration. Is the decision-making process solely within your purview, or are there project leaders or stakeholders involved? If you are the sole contributor to this project, you hold the responsibility of making a well-informed decision regarding the inclusion of absent
3.4 Open Source
65
students in the calculation. Depending on your motivations, values, and objectives, your choice will significantly impact the resulting average score of the classroom. It becomes readily apparent that the rules governing algorithms are inherently established by human beings. This holds immense significance within programming, as the accuracy and appropriateness of the outcomes generated by a codebase rest heavily upon the rules set forth by the . The consequences of formulating inappropriate rules can be far-reaching and profound. An erroneous algorithm can, for instance, yield results that are incongruous, misleading, or even detrimental, particularly when dealing with complex projects that involve sensitive matters. The magnitude of these implications amplifies exponentially as the project expands in scope and complexity. Consequently, the evaluation and scrutiny of algorithms assume a pivotal role in ensuring their integrity and reliability. It is imperative to subject algorithms to meticulous assessment and testing, aimed at detecting and rectifying any potential shortcomings or unintended consequences. In this regard, collaborative efforts become paramount, especially as projects grow in scale and complexity. Engaging multiple programmers in the evaluation process can help identify and rectify underlying assumptions, biases, or unforeseen factors that may inadvertently find their way into the code. By fostering a collaborative environment and promoting diversity of perspectives, programming teams can enhance the robustness and validity of their algorithms. By leveraging the collective knowledge, expertise, and critical thinking abilities of multiple programmers, the risks associated with erroneous assumptions or flawed rules can be mitigated effectively. This collective approach serves as a safeguard against inadvertent errors, ensuring that algorithms yield accurate, reliable, and ethically sound outcomes.
3.4
Open Source
The concept of open source presents itself as a compelling and effective approach to address the multifaceted challenges encountered in software development. In data science and informatics, the prevalence of Python codes adhering to the open-source philosophy is indeed noteworthy. What, exactly, is open source? Some might feel that it simply means that the software is free, but it is much more than that. At its core, open source endeavors to cultivate an environment where users cannot only access and utilize the software but also comprehend and modify the underlying source code. This transparency and user agency lie at the heart of the open-source paradigm. Returning to the example of determining the class average score, a pertinent consideration arises regarding the inclusion or exclusion of absent students in the calculation. Within the educational landscape, divergent perspectives exist among teachers, some advocating for the incorporation of absent students’ scores, while others favor their exclusion. In such a scenario, an ideal solution would entail endowing the user with the autonomy to exercise this choice, wherein the program offers a flexible option to selectively include or exclude
66
3 Programming
absent students when computing the average score. This exemplifies the fundamental ethos underpinning the philosophy of open source. An additional question that often arises among teachers pertains to the accuracy of the code employed to calculate the average test scores. In an ideal scenario, all teachers, functioning as users of the program, aspire to have access to the source code. This accessibility enables them to scrutinize the underlying code, thereby facilitating a comprehensive evaluation of its correctness and reliability. Moreover, the availability of the source code empowers fellow educators to acquire insights into the intricacies of code composition and execution. This mutual exchange of knowledge not only fosters continuous learning but also engenders a collaborative environment wherein teachers can collectively contribute to the improvement and rectification of any identified issues. Such an open approach to program development, where source code visibility and contribution are encouraged, invariably culminates in the creation of a meticulously accurate and refined software solution. The importance of open-source code becomes apparent when considering the limitations imposed by closed-source code. In a closed-source environment, users are unable to gain a comprehensive understanding of how the code functions and what processes it actually executes. Their interaction is restricted to interpreting the binary output, making it exceedingly challenging to ascertain the accuracy and reliability of the code. The ramifications of closed code extend beyond mere comprehension, as it hampers the ability of users to identify and rectify issues or bugs that arise due to inherent flaws in the source code. Moreover, closed code inhibits collaborative efforts and stifles potential contributions, thereby limiting the growth and progress of a project. As a solution to these challenges, the alternative approach of embracing open-source code emerges. By adopting an open-source philosophy, the code is made accessible and transparent to all, fostering a collaborative ecosystem wherein users can scrutinize, improve, and contribute to the development of a more robust and refined software solution. The fundamental tenet underlying the open-source philosophy revolves around ensuring that users possess the freedom to access and inspect the source code, thereby empowering them to make necessary edits and modifications. Furthermore, open-source principles advocate for the freedom to redistribute the source code, irrespective of whether users possess the capability to edit it or not. By adhering to this approach, software development can reach new heights of sophistication and user-friendliness. In stark contrast, the ominous implications of closed-source code manifest when a program is solely controlled by a single individual. Any sudden alteration to the program’s algorithm has the potential to cascade into a complete transformation of the program’s output. This lack of user control renders them susceptible to encountering unforeseen issues stemming from such modifications, leaving them at the mercy of changes they are unable to influence or mitigate. The open-source movement, initiated in 1983 through the visionary efforts of Richard Stallman under the GNU Project, represents a resolute endeavor to champion the unencumbered utilization of computing devices by end users. A notable milestone within
3.4 Open Source
67
the expansive GNU Project was the development of Emacs, a venerable text editor that continues to enjoy widespread acclaim and adoption. Over the course of its existence, the GNU Project has garnered substantial backing and enthusiastic support from the vibrant computing community. However, it was the advent of the Linux operating system within the that indubitably reshaped the landscape of the open-source movement. Notably, Linux has emerged as an indispensable recommendation for practitioners in the domains of data science and informatics, leveraging its prowess to effect transformative changes not only within these specific disciplines but also within the broader expanse of the computing ecosystem. Its far-reaching impact endures as a testament to the extraordinary potential inherent in the open-source paradigm. The Linux operating system, originally developed by Linus Torvalds in 1992, stands as a testament to the dynamic and transformative nature of the technology landscape. While the names of Bill Gates and Steve Jobs have become synonymous with Microsoft Windows and Apple Mac OS, respectively, Linux has emerged as a formidable third player in the operating system market, heralding a new era of open collaboration and innovation. The origins of Linux diverge significantly from the development paths of Windows and Mac OS. Torvalds initially unveiled the rudimentary concepts of the Linux OS in 1988 as part of his master’s thesis, envisioning a future where this creation could rival its counterparts. During the period between 1989 and 1991, Torvalds sought to commercialize the operating system, resulting in its initial incarnation as a closed-source system. However, the remarkable power of collaboration began to exert its influence. A diverse array of programmers from around the globe offered their feedback, ideas, and suggestions to Torvalds, paving the way for a paradigm shift in the trajectory of Linux. Realizing the tremendous potential that a global community of contributors could bring, Torvalds made the momentous decision in 1991 to embrace the open-source ethos, inviting a multitude of developers to join forces in shaping the future of Linux. This pivotal transition catapulted the development of the Linux OS into an unprecedented realm of rapid advancement, thanks to the collective genius of a vast army of dedicated developers. Interestingly, the Linux systems we encounter today bear little resemblance to Torvalds’ original code. Much like the concept of evolutionary theory, Linux has undergone a remarkable metamorphosis, adapting and incorporating the needs and requests of its users, solidifying its position as a highly customizable and user-centric operating system. A distinguishing feature of Linux’s open-source nature lies in the accessibility and transparency of its source code, which serves as a guiding manual for developers and enthusiasts alike. This inherent openness facilitates a thriving ecosystem where software and hardware can be seamlessly tailored to fit the Linux environment, fostering a culture of innovation and growth. The collaborative power of Linux, encapsulated within its opensource framework, paves the way for boundless possibilities in the realm of software development and technological advancement. There are 10 definitions that help clarify what open source truly means, which are defined by the Open-Source Initiative as the following: https://opensource.org/osd:
68
3 Programming
1. Free Redistribution The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale. 2. Source Code The program must include source code and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed. 3. Derived Works The license must allow modifications and derived works and must allow them to be distributed under the same terms as the license of the original software. 4. Integrity of the Author’s Source Code The license may restrict source code from being distributed in modified form only if the license allows the distribution of “patch files” with the source code for the purpose of modifying the program at build time. The license must explicitly permit distribution of software built from modified source code. The license may require derived works to carry a different name or version number from the original software. 5. No Discrimination Against Persons or Groups The license must not discriminate against any person or a group of persons. 6. No Discrimination Against Fields of Endeavor The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research. 7. Distribution of License The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties. 8. License Must Not Be Specific to a Product The rights attached to the program must not depend on the program’s being part of a particular software distribution. If the program is extracted from that distribution and used or distributed within the terms of the program’s license, all parties to whom the program is redistributed should have the same rights as those that are granted in conjunction with the original software distribution. 9. License Must Not Restrict Other Software The license must not place restrictions on other software that is distributed along with the licensed software. For example, the license must not insist that all other programs distributed on the same medium must be open-source software. 10. License Must Be Technology Neutral No provision of the license may be predicated on any individual technology or style of interface.
3.4 Open Source
69
The open-source philosophy has had a significant impact on the availability of data science and informatics tools, as many of these tools are made accessible under opensource licenses. These licenses, as defined by the Open-Source Initiative, embody the principles of allowing software to be freely used, modified, and shared. They serve as legal frameworks accompanying the code, outlining the permissions granted by the individual or organization sharing the code. These licenses are widely prevalent in the programming community, each offering a unique combination of flexibility and limitations, dictating the extent to which the code can be edited, shared, or even sold. Among the commonly encountered open-source licenses, the MIT license stands out for its permissive nature, granting users broad rights to use and modify the code. The Creative Commons license provides a spectrum of licenses enabling creators to specify the conditions under which their work can be used, while the Apache license emphasizes collaboration and grants users explicit patent rights. Understanding the nuances and implications of these licenses is crucial for developers and users alike to navigate the open-source landscape effectively and responsibly. Here are some further details regarding these licenses. GNU General Public License V3.0 Copyright (C) This program is free software: You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see .
MIT licence https://opensource.org/licenses/MIT Copyright Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
70
3 Programming
Apache Licence 2.0 https://www.apache.org/licenses/LICENSE-2.0 TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. “License” shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. “Licensor” shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. “Legal Entity” shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, “control” means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. “You” (or “Your”) shall mean an individual or Legal Entity exercising permissions granted by this License. “Source” form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. “Object” form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. “Work” shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). “Derivative Works” shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. “Contribution” shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, “submitted” means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as “Not a Contribution.” “Contributor” shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution.
3.4 Open Source You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: You must give any other recipients of the Work or Derivative Works a copy of this License; and You must cause any modified files to carry prominent notices stating that You changed the files; and You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and If the Work includes a “NOTICE” text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole
71
72
3 Programming responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS
Creative Common License https://creativecommons.org/licenses/ There are six types of creative common licenses that can be used and are distinguished according to acronym. IMG Attribution CC BY This license lets others distribute, remix, adapt, and build upon your work, even commercially, as long as they credit you for the original creation. This is the most accommodating of licenses offered. Recommended for maximum dissemination and use of licensed materials. IMG Attribution-ShareAlike CC BY-SA This license lets others remix, adapt, and build upon your work even for commercial purposes, as long as they credit you and license their new creations under the identical terms. This license is often compared to “copyleft” free and open-source software licenses. All new works based on yours will carry the same license, so any derivatives will also allow commercial use. This is the license used by Wikipedia and is recommended for materials that would benefit from incorporating content from Wikipedia and similarly licensed projects. IMG Attribution-NoDerivs CC BY-ND This license lets others reuse the work for any purpose, including commercially; however, it cannot be shared with others in adapted form, and credit must be provided to you. IMG Attribution-NonCommercial CC BY-NC This license lets others remix, adapt, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms. IMG Attribution-NonCommercial-ShareAlike CC BY-NC-SA This license lets others remix, adapt, and build upon your work non-commercially, as long as they credit you and license their new creations under the identical terms. IMG Attribution-NonCommercial-NoDerivs CC BY-NC-ND This license is the most restrictive of our six main licenses, only allowing others to download your works and share them with others as long as they credit you, but they can’t change them in any way or use them commercially.
When it comes to selecting the appropriate license for your code and project, it is important to consider the different options available. One commonly encountered license is the MIT license, known for its simplicity and permissive nature. This license allows users to freely use, modify, and distribute the code, making it a popular choice for open-source projects. On the other hand, the Apache license offers a unique feature by allowing claims to be made on patents that utilize the code it is attached to. This provides an added layer of protection for both the code and any associated patents, making it a suitable option for projects that involve patented technology. Additionally, the Creative Commons License offers the flexibility to customize the license type based on specific requirements. This allows developers to choose from a range of restrictions and permissions, providing greater control over the commercialization of the code. When considering the appropriate license, it is crucial to carefully evaluate the goals and objectives of the project. Factors such as the desired level of openness, the need for
3.5 Conclusion
73
patent protection, and the intention for commercialization should all be taken into account. By selecting the right license, developers can ensure proper legal protection while fostering collaboration and sharing within the programming community. In the realm of data science and informatics, a vast array of tools are generously provided under the umbrella of open-source licensing. These tools not only empower individuals and organizations in their data-driven endeavors but also foster a collaborative and innovative environment. While Linux distributions offer a multitude of software repositories, serving as a valuable source of readily accessible tools, the digital landscape extends beyond these boundaries. Platforms such as GitHub, renowned as a hub for code sharing and collaboration, emerge as pivotal players in the open-source ecosystem. By hosting repositories brimming with code from diverse projects, GitHub acts as a thriving community where developers can freely contribute, explore, and refine their skills. When venturing into the realm of open-source projects, it is often recommended to utilize this communal platform, provided that the project imposes no restrictions on code sharing. Embracing this ethos of openness and collaboration not only facilitates knowledge exchange but also fuels the continuous growth and evolution of data science and informatics.
3.5
Conclusion
In this chapter, we embark on a comprehensive exploration of this fundamental discipline, which empowers individuals to unlock the full potential of technology and navigate the intricacies of the digital landscape. It is crucial to recognize the indispensable nature of programming, as it serves as a versatile tool capable of tackling a vast array of tasks and objectives. In an era characterized by ever-increasing complexity and technological advancements, the ability to program effectively becomes paramount. It transcends being a mere technical skill and assumes the role of a strategic approach to problem-solving, empowering users to bring their ideas to life in the digital realm. To harness the true power of programming, it is essential to start by clearly defining the desired outcomes and gaining a deep understanding of the intricacies of the problem at hand. This foundational step allows users to identify the most optimal approaches and methods to accomplish their objectives. Programming acts as the critical conduit through which these meticulously crafted plans are realized, facilitating the transformation of abstract concepts into tangible solutions. Moreover, programming should be viewed as a dynamic and creative process, rather than a rigid set of instructions. It provides users with the flexibility to continuously refine and enhance their solutions, adapting to evolving needs and emerging insights. With each iteration, programmers have the opportunity to fine-tune their code, striving for improved performance, efficiency, and functionality. This iterative nature of programming fosters a mindset of continuous improvement, where innovation and progress become intertwined.
74
3 Programming
Furthermore, programming empowers individuals to embrace the power of abstraction, enabling them to tackle complex problems by breaking them down into smaller, more manageable components. By employing modular and scalable design principles, programmers can create robust and flexible systems that can adapt to changing requirements. This approach not only enhances the maintainability and scalability of software projects but also promotes collaboration and code reusability. Additionally, programming serves as a gateway to automation, enabling the automation of repetitive tasks and the streamlining of workflows. By leveraging the power of algorithms and computational logic, programmers can devise elegant and efficient solutions that save time and effort. Through automation, mundane and time-consuming processes can be transformed into seamless and efficient operations, allowing users to focus on higher-level tasks and strategic decision-making. In summary, programming is a multidimensional discipline that goes beyond mere technical skills. It empowers individuals to think critically, solve complex problems, and transform abstract ideas into tangible solutions. By approaching programming as an iterative and creative process, users can continuously refine their code, adapt to changing needs, and drive innovation. With the ability to abstract and automate tasks, programming serves as a catalyst for efficiency and productivity in various domains.
Questions 3.1 Why is programming considered an indispensable skill in today’s complex and technologically advanced era? 3.2 How does programming facilitate the transformation of abstract concepts into tangible solutions? 3.3 What is the significance of the iterative nature of programming? 3.4 How does programming promote collaboration and code reusability? 3.5 How does programming enable automation and streamline workflows? 3.6 What are the three key steps involved in programming? 3.7 What is the role of planning in the programming process? 3.8 What happens during the writing phase of programming? 3.9 What is the purpose of running a program in programming?
Questions
75
3.10 How does programming relate to fields like data science and computational science? 3.11 What is the purpose of the “Hello World” program in programming? 3.12 How does binary code relate to computer communication? 3.13 How are binary digits used in visual communication? 3.14 What is the role of assembly languages in programming? 3.15 Why is portability important in programming languages? 3.16 What role does binary code play in the relationship between software and hardware? 3.17 Why are assembly languages challenging for users? 3.18 What are high-level programming languages designed to achieve? 3.19 How do high-level programming languages simplify the task of printing “Hello World”? 3.20 What is the difference between compiled languages and script languages? 3.21 What has contributed to the shift away from coding in today’s software landscape? 3.22 How do GUIs simplify the process of executing a “hello world” program? 3.23 What role do GUIs play in bridging the gap between users and complex programming tasks? 3.24 Which toolkit can be used in Python to easily build graphic user interfaces? 3.25 How might machine learning and AI impact the future of software development? 3.26 What are some robust text editors suitable for extensive and ambitious projects? 3.27 Which text editor is recommended for those seeking a lightweight coding experience? 3.28 Why is it important to evaluate different text editors before making a selection? 3.29 Which text editors are commonly used for code editing within server environments?
76
3 Programming
3.30 What makes Jupyter Notebook a favored option for data science projects involving Python? 3.31 What role does algorithm design play in programming? 3.32 What qualities are essential for a competent programmer? 3.33 Why is collaboration and effective communication between algorithm designers and programmers important? 3.34 What is the concept of open source? 3.35 Is open source just about free software? 3.36 How does open source enable user autonomy? 3.37 Why is it important for teachers to have access to the source code? 3.38 What are the limitations of closed-source code? 3.39 What is the fundamental tenet of the open-source philosophy? 3.40 How does closed-source code lack user control? 3.41 How did the Linux operating system shape the open-source movement? 3.42 What are some commonly encountered licenses for open-source projects? 3.43 What is the unique feature of the Apache license? 3.44 How does GitHub contribute to the open-source ecosystem?
4
Programming and Python
Abstract
Why is programming a necessary component of data science? Why can’t one, for example, simply carry out data science work using existing tools such as Microsoft Excel? In this chapter, we explore the basics behind programming with a focus on the Python programming language. Keywords
Python · Programming basics · Python basics · Code · Python modules · Scikit-learn · Pandas
• Learn the basics of Python programming • Learn how to structure code using if-statements and while-loops • Explore commonly used Python modules
4.1
Introduction
Python is a high-level, versatile programming language that is known for its simplicity and readability and has gained immense popularity in the data science community. Python’s clean and expressive syntax makes it a great choice for both beginners and experienced programmers, as it allows for easy code writing and debugging. Moreover, Python has a vast ecosystem of libraries and frameworks, including NumPy, Pandas, Matplotlib, and scikit-learn, which are essential for data manipulation, analysis, and visualization. This
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6_4
77
78
4 Programming and Python
extensive library support, along with its strong community, has made Python a preferred language for data scientists. Python plays a pivotal role in data science by providing the tools and libraries necessary for data collection, analysis, and modeling. Data scientists use Python to import, clean, and preprocess data with Pandas, perform complex numerical operations with NumPy, and create data visualizations using Matplotlib or Seaborn. Python is also used for machine learning and artificial intelligence, thanks to the availability of libraries like scikit-learn and TensorFlow. Its flexibility and integration capabilities make Python suitable for developing data-driven applications, from recommendation systems to natural language processing. In this sense, Python can be viewed as a lingua franca for data science, empowering professionals to extract insights and drive informed decision-making from vast and complex datasets.
4.2
Basics of Python
Programming plays a crucial role in data science for several reasons, making it an essential component of the field. However, why is programming a necessary component of data science? Why can’t one, for example, simply carry out data science work using existing tools such as Microsoft Excel? While existing tools like Microsoft Excel have their advantages, they often fall short when it comes to handling large datasets and performing advanced data manipulation and analysis. There are several compelling reasons for why the use of programming languages is necessary for data science work. One of the reasons programming is necessary for data science is the ability to work with massive datasets. Unlike Excel, which has limitations on file size and computational capabilities, programming languages offer the flexibility and power to handle vast amounts of data. For instance, datasets in fields dealing with catalysts and materials can be incredibly large, consisting of thousands of rows and columns or even reaching gigabytes in size. It is very difficult to handle datasets of this size in programs like Excel, which are likely to freeze or crash before it is able to even load the file. Manipulating and analyzing such extensive datasets necessitates the computational capabilities and efficiency provided by programming languages. Furthermore, advanced data visualization and machine learning techniques require complex data manipulation and mathematical algorithms. These techniques go beyond the capabilities of traditional spreadsheet software. Programming languages provide a wide range of libraries and frameworks that enable data scientists to implement sophisticated algorithms, perform statistical analysis, and build predictive models. With programming, data scientists can unleash the full potential of their data, uncovering hidden patterns and insights that would otherwise remain undiscovered.
4.2 Basics of Python
79
Moreover, programming offers flexibility and customization options that are not available in prebuilt tools like Excel. Data scientists can tailor their code to specific project requirements, create reusable functions and modules, and automate repetitive tasks. This level of control and flexibility allows for efficient data processing, analysis, and experimentation, ultimately leading to more accurate and reliable results. In this book, we have chosen to use the Python programming language as the primary language for exploration and learning. Python has emerged as a prominent language in the field of data science and enjoys widespread adoption among developers and data scientists alike. Python holds a prominent position among the most widely implemented programming languages. The design philosophy behind Python, envisioned by its creator Guido van Rossum, emphasizes readability, efficiency, and ease of writing. This philosophy has resonated strongly with developers and data scientists, making Python a preferred language for various applications, including data science. Its intuitive syntax and comprehensive standard library provide a solid foundation for implementing complex data manipulation and analysis tasks. To highlight the differences in programming languages, let us consider an example showcasing Python and another popular language used in data science, C. The following code snippets—one written in Python and one written in C, another popular programming language used with data science—demonstrate the contrasting approaches: Example: C #include int main() { int x = 7; if (x < 10) { printf("yes\n"); } else { printf("no\n"); } return 0; }
Example: Python x=7 if x < 10: print ('yes') else: print ('no')
*****
80
4 Programming and Python
Let us delve into the concept of creating a program that generates the output “yes” if a variable, x, is smaller than 10, and “no” if the variable is larger than 10. To illustrate this, let us examine the implementation in the C programming language. In C, the code may contain elements that appear daunting to newcomers, such as the declaration “int main().” This syntax signifies the main function in C, which is commonly used as an entry point for program execution. However, for individuals with limited programming experience or those who are unfamiliar with the syntax of C, these constructs can be intimidating and challenging to comprehend. Furthermore, the presence of bracket sets and punctuation within the C code adds an additional layer of complexity for those who are not well-versed in programming languages. These syntactic elements require a foundational understanding of programming concepts to decipher their meaning and purpose. Therefore, even attempting to read and understand code written in C necessitates a fundamental grasp of programming principles. Familiarity with concepts such as function declarations and language-specific syntax is essential for navigating and comprehending code effectively. Python, as a programming language, offers significant advantages in terms of readability and ease of understanding, even for individuals without prior knowledge of Python or programming in general. When examining the example code written in Python, it becomes apparent that the code structure is more intuitive and user-friendly. Take, for instance, the declaration of the variable x, which is assigned the value of 7. This straightforward syntax allows for a clearer representation of the code’s intentions. Compared to the code written in C, Python’s if-statements exhibit a greater simplicity, devoid of excessive brackets and punctuation that can potentially obscure the logic flow. This clean and concise style of coding in Python enhances the code’s comprehensibility, making it more approachable for beginners. The readability aspect of Python plays a crucial role in minimizing the learning curves associated with programming. Aspiring programmers can quickly grasp the fundamental concepts and syntax of Python, reducing the time spent on acquiring the necessary programming skills. By streamlining the learning process, Python enables individuals to focus more on leveraging their newfound programming abilities in practical applications, such as data science. The accessibility and ease of use offered by Python make it a preferred choice for data science practitioners. With Python’s gentle learning curve, researchers and analysts can swiftly transition into writing code for data science tasks, eliminating the potential hurdles encountered when tackling more complex programming languages like C. By leveraging Python’s user-friendly nature, professionals can allocate more time and energy to solving intricate data science challenges rather than grappling with the intricacies of programming syntax.
4.2 Basics of Python
81
Table 4.1 Types in Python Type
Description
Example
string integer float datatime bool list tuple dictionary
text number without decimal point number with decimal point data true or false list similar to list, but its immutable tag key and corresponding value
a=’Hello world’ a=3 a=3.14 a= datetime.date.today() a=True, b=False a=[’hello’, 3, False] a=(’hello’, 3, False) a={’hello’:1,’world’:2}
The simplicity and ease of use of Python have bestowed it with a significant advantage in the field of data science. It can be argued that Python’s emergence has fundamentally transformed the role of data scientists. In the past, engaging in data science necessitated rigorous training in programming. Even with such training, data scientists would invest a substantial amount of time in tasks like data preprocessing, data analysis, and machine learning applications. As depicted in the accompanying figure, there has been an exponential surge in demand for the Python language. Given these factors, this textbook places emphasis on Python as the primary tool for data science. However, it is worth noting that as time progresses, there is a possibility that a more convenient or beneficial alternative programming language may surface. While Python undoubtedly remains a valuable tool, it is essential to continually update one’s skill set to align with current and emerging technologies. Regardless, having experience with Python will undoubtedly provide a solid foundation for venturing into the realm of programming within the context of data science. By staying knowledgeable and adaptable, one can navigate the evolving landscape of data science with confidence and efficacy. In this chapter, the fundamentals of Python in regard to data science are explored. Note that all sample code snippets are written using Linux Mint 20.3 (which uses the Ubuntu 20.4 repository) which comes with the Python 3.8.2 environment. To start, we must first understand the concept of types in Python. A list of types with accompanying definitions and examples are found in Table 4.1. Table 4.1 provides an overview of the eight fundamental types in Python, which play a crucial role in the language’s versatility and functionality. These types can be likened to the building blocks of a language, encompassing nouns, verbs, adjectives, and adverbs in their own unique ways. Each type possesses distinct properties and serves specific purposes within Python programming. Let us delve into a brief exploration of these eight types, shedding light on their characteristics and applications.
82
4 Programming and Python
1. String: This type represents textual data and is denoted by enclosing the text within quotation marks, such as ‘hello world.’ Strings are commonly used for displaying text in output files and have a wide range of manipulative operations available. 2. Integer: Integers are used to represent whole numbers without any decimal points. They are essential for performing arithmetic operations and numerical computations. 3. Float: Floats, on the other hand, represent numbers with decimal points. They enable precise calculations involving fractional or real values. 4. Datetime: The datetime type is employed for working with dates and times. It allows for various operations and manipulations related to time-based data, such as tracking events or measuring durations. 5. Bool: Booleans, represented by the bool type, have two possible values: True or False. They are particularly useful for implementing conditional statements and determining the truth or falsity of certain conditions. 6. List: Lists serve as containers that can hold elements of different types, including strings, integers, floats, and booleans. They are one of the most versatile and powerful types in Python, enabling various operations like appending, removing, and iterating over elements. 7. Tuple: Similar to lists, tuples also store elements of different types. However, tuples are immutable, meaning their order and content cannot be modified once created. They are represented using parentheses ( ) and are useful in situations where data integrity and immutability are desired. 8. Dictionary: Dictionaries facilitate the creation of key-value pairs, allowing for efficient data retrieval and manipulation. They are particularly useful when organizing and accessing data based on specific tags or keys. By familiarizing ourselves with these fundamental types, we gain a solid foundation in Python programming. It is important to note that while Python currently holds immense popularity and utility, the ever-evolving nature of technology may introduce alternative languages or frameworks in the future. Therefore, it is essential to continuously update and expand our skill set to adapt to emerging technologies. Nonetheless, the knowledge and experience gained through Python will undoubtedly provide a valuable head start in the dynamic world of data science and programming. Here, a longer sample code for Python types is presented. The type() command plays a crucial role in Python programming as it allows developers to determine the type of an object. By directing the type() command to an object, one can quickly ascertain its data type, which is particularly valuable when working with extensive datasets. Understanding the data type is essential for performing appropriate operations and ensuring data integrity. In Python, not all commands are readily available in the default library. To access additional functionalities, one must import specific libraries before you can use commands that belong to said libraries. This can be accomplished by utilizing the import command
4.2 Basics of Python
83
followed by the library name. As seen in the code sample, “import datetime” imports the library that holds the information for the command datetime.date.today(), which acquires the time and date at the moment the command is called. Importing libraries expands the capabilities of Python and enables the use of specialized commands and functions tailored to specific tasks. However, it is essential to note that when working on a Linux system, certain libraries may not be pre-installed. In such cases, users are advised to install the required libraries through tools like the Synaptic Manager or other suitable methods. Installing the necessary libraries ensures smooth execution of code that relies on their functionalities, thereby avoiding potential errors or unexpected behavior. By leveraging the type() command and importing relevant libraries, Python programmers gain a powerful toolkit to handle diverse data and perform complex operations. These practices not only enhance the accuracy and efficiency of data analysis but also foster a more structured and maintainable codebase. It is vital for developers to stay proactive in keeping their library repertoire up-to-date, embracing new libraries and modules that emerge, as it ensures they are equipped with the latest tools and techniques to tackle evolving programming challenges. a='Hydrogen' print (type(a)) print (a) print ('********************') a=3 print (type(a)) print (a) print ('********************') a=3.14 print (type(a)) print (a) print ('********************') import datetime a=datetime.date.today() print (type(a)) print (a) print ('********************') a=True print (type(a)) print (a) print ('********************') a=['H','He','Li'] print (type(a)) print (a) print ('********************') a=('H','He','Li')
84
4 Programming and Python print (type(a)) print (a) print ('********************') a={1:'H',2:'He',3:'Li'} print (type(a)) print (a)
Code Output
Hydrogen ********************
3 ********************
3.14 ********************
2022-03-31 ********************
True ********************
['H', 'He', 'Li'] ********************
('H', 'He', 'Li') ********************
{1: 'H', 2: 'He', 3: 'Li'} [Finished in 19ms]
Python, being a versatile programming language, offers a wide range of capabilities when it comes to performing calculations. As showcased in Table 4.2, Python provides symbols for fundamental arithmetic operations such as addition, subtraction, multiplication, division, modulus, and exponentiation. These symbols serve as powerful tools in tackling complex mathematical problems that are encountered when coding. Whether it is adding up numerical values, subtracting quantities, multiplying factors, dividing quantities, finding remainders, or raising numbers to specific powers, Python’s extensive set of mathematical symbols enables efficient and accurate computations. These operations are essential for performing calculations and solving mathematical challenges encountered during code development. By harnessing the mathematical notation available in Python, programmers can effortlessly handle numeric calculations and incorporate them into their code. This capability significantly enhances the versatility and problem-solving capabilities of Python, making it a preferred choice for a wide range of computational tasks.
4.2 Basics of Python Table 4.2 Calculations in Python
85
Type of calculation Symbol Addition Subtraction Multiplication Division Modulus Exponent
+ .−
* / % **
Here is a sample Python code written to solve basic calculations. In the given code snippet, we have defined variables “x” and “y” as integers 6 and 2, respectively, while variable “a” represents a mathematical problem. Upon analyzing the code, we observe that variable “a” is redefined multiple times. During the execution of the code, it follows a top-down approach, where each line is processed sequentially until reaching the end or encountering a stopping condition. The placement of variables within the code determines their accessibility and the possibility of redefinition. This feature proves beneficial when, for instance, there is a need to track and count specific types of data dynamically as the program runs. By leveraging this flexibility, data scientists can effectively manipulate variables to achieve their desired computational goals. The ability to redefine variables during runtime enables dynamic and adaptive problem-solving approaches in various data science scenarios. Code Example x=6 y=2 a=x+y print (a) print ('********************') a=x-y print (a) print ('********************') a=x*y print (a) print ('********************') a=x/y print (a) print ('********************') a=x%y print (a) print ('********************') a=x**y print (a)
86
4 Programming and Python
Table 4.3 Common mathematical operators in Python
Symbol Meaning = != > < >= y) print ('********************') print (x=y) print ('********************') print (xy: print ('x is greater than y')
This example effectively showcases the utilization of an if-statement to evaluate the relationship between variable x and variable y. In the given scenario, if the condition “x > y” holds true, the designated output of “x is greater than y” will be printed. However, what if we desire to include an alternative set of instructions to be executed when the condition “x > y” is not satisfied? This is where the “else statement” comes into play, enabling the definition and execution of a distinct code block to cater to such alternative conditions, effectively closing the if-statement with a comprehensive and versatile structure. By incorporating the “else statement,” we enhance the flexibility and robustness of our code,
4.3 Structuring Code through Logic and Modules
91
allowing for more nuanced control over the program’s behavior in response to varying conditions. x=1 y=1 if x>y: print ('x is greater than y') else: print ('Something else')
Within the presented code snippet, the variables x and y are both assigned a value of 1, establishing the initial state for our evaluation. Delving into the if-statement, the first condition encountered scrutinizes whether x surpasses y in value. However, as 1 is not greater than 1, this condition fails to be met, leading the program to progress further down the code. At this juncture, our attention is directed toward the “else” segment, which encapsulates a distinct set of instructions to be executed when the previous condition proves untrue. Given that 1 does not satisfy the condition of being greater than 1, it falls within the purview of the “else” block, thereby triggering the subsequent execution of the code specified therein. It is worth highlighting that in more complex scenarios, where multiple conditions necessitate evaluation, the introduction of the “elif statement” becomes advantageous. By employing “elif,” programmers can seamlessly continue constructing additional conditions within the if-statement framework without prematurely concluding its execution. This capability facilitates the development of intricate decision-making processes, allowing for more comprehensive and nuanced program logic to be established while preserving the integrity of the overall if-statement structure. A simple way of understanding how the if-statements are performed is summarized below. if [A]: perform actions here if A is satisfied elif [B]: perform actions here if B is satisfied else: perform action if A and B are both not satisfied.
Another imperative construct in the realm of Python programming is the for-loop. The for-loop plays a significant when one finds themselves in a scenario where a particular process necessitates repetition under predefined conditions. It allows for the seamless execution of iterative tasks with utmost precision and efficiency. Here is an example of a for-loop. x=[1,3,7,2,9] for i in x: print (i)
In this example, the code is written so that as variable i loops through x, the content of x is printed. The following code is a clearer example of how this is done.
92
4 Programming and Python xx=[] x=[1,3,7,2,9] for i in x: y=i+1 xx.append(y) print (xx)
Initially, an empty list named “xx” is created to serve as a container. In a similar fashion, a loop is established to iterate through each component of the list denoted as “x.” Within this iterative process, a new variable called “y” is defined, representing the sum of the current element, denoted as “x(i),” and the value 1. At the onset of this loop, the variable “i” is initialized with a value of 0, indicating the first element of the list. During each subsequent iteration, the value of 1 is added to the element at the index position “i” within the list “x.” For instance, when “i” is set to 0, the expression “y = x[i] + 1” yields “y = x[0] + 1.” Given that the initial element in “x” holds an integer value of 1, the result is that “y” becomes 2. Subsequently, this resulting integer value is appended to the previously empty list “xx,” effectively expanding its contents. Thus, at this stage, “xx” is transformed from an empty list to a list containing the value [2]. This process of iteration and summation is repeated until all elements within the list “x” have been processed. By the conclusion of the for-loop’s execution, the list “xx” will have undergone further expansion, now encompassing the elements [2, 4, 8, 3, 10]. Consequently, through the utilization of empty lists in conjunction with the append method, it becomes possible to dynamically recreate lists, whereby individual actions are performed during each iteration of the aforementioned loop. Lastly, the concept of a while-loop introduced, where it exhibits similarities to the for-loop but with a distinctive characteristic. Unlike the for-loop, which iterates a predetermined number of times, a while-loop persists until certain specified conditions are met. The while-loop operates based on a conditional statement, which, when true, allows the loop to continue executing indefinitely. It is crucial to define the conditions accurately to ensure proper termination of the loop and prevent infinite execution. By utilizing a while-loop, we can dynamically adapt our code’s behavior based on changing conditions or unknown iterations. This flexibility empowers us to handle situations where the exact number of iterations is uncertain or contingent upon specific criteria being fulfilled. An example of this is listed below: x=5 while x 5, it becomes possible to extract cases where the atomic weight is greater than 5. This condition acts as a filter, allowing only the relevant data points to be included in the result. This selective extraction ensures that data scientists can focus their analysis on specific subsets of the dataset that meet predefined conditions, aiding in the identification of patterns, trends, and outliers. To further enhance the analysis, pandas facilitates the seamless integration of conditional extraction within the data frame itself. By enclosing the code snippet within df4[df4[’AtomicWeight’] > 5], data scientists can obtain a comprehensive list of data points that satisfy the given condition. This approach yields a result set that includes both the element values and their corresponding indices, enabling a deeper understanding of the dataset’s composition and structure. Figure 4.6 showcases this process, highlighting the ability to extract specific data from the data frame. This method of conditional extraction proves particularly valuable when working with large datasets. The ability to selectively retrieve data based on specific criteria enhances efficiency and reduces the cognitive load associated with manually shifting through voluminous data. By leveraging this approach, data scientists can focus their analysis on relevant subsets of the data, facilitating targeted investigations and uncovering meaningful insights. In addition to the previously discussed functionalities, pandas provides a diverse array of methods to manipulate and transform data. One such method is the sort_values()
method, which offers a powerful mechanism for sorting data based on specific columns. By employing this method, data scientists can seamlessly arrange data in either alphabetical order for text or numerical order for numbers, thereby enabling effective data manipulation and analysis.
104
4 Programming and Python
The ``sort_values()''
method accepts a parameter, denoted as “by,” which specifies the column by which the data should be sorted. For instance, when invoking sort_values(by='Name'),
the data will be sorted based on the ‘Name’ column. This versatility allows analysts to tailor their data manipulation according to their specific requirements, ensuring that the resulting dataset is organized in a meaningful and logical manner. When working with text data, the sort_values()
method enables alphabetical sorting, facilitating tasks such as arranging names, categories, or any other textual information in a desired order. On the other hand, when applied to numerical data, this method performs numerical sorting, enabling data scientists to organize values in ascending or descending order based on their magnitude. By leveraging such methods, pandas empowers data scientists to perform a wide range of data manipulations, transforming raw data into a structured and organized format. These manipulations not only enhance data exploration and analysis but also facilitate efficient data visualization and modeling. The ability to sort data based on specific columns provides valuable insights into patterns, trends, and relationships within the dataset. Finally, pandas offers powerful capabilities for data concatenation and merging, enabling the seamless integration of different datasets. These operations prove particularly valuable when combining and consolidating diverse pieces of data to derive comprehensive insights. Concatenation refers to the process of joining two or more datasets together. It serves as a practical approach when there is a need to combine separate pieces of data into a unified dataset. Consider the scenario where two databases, namely data.csv and data2.csv, contain distinct sets of information. Pandas provides an intuitive solution to concatenate these datasets, facilitating the integration of their contents. Figure 4.7 showcases the concatenation process, demonstrating the combination of the first dataset comprising elements H, He, Li, and Be with the second dataset comprising elements B, C, and N. By utilizing the concat() command provided by pandas, data scientists can effortlessly concatenate datasets. This command enables the seamless merging of multiple datasets, preserving the integrity and structure of the original data. As a result, the combined dataset incorporates the information from both source datasets, providing a comprehensive view of the underlying data. Data concatenation proves particularly useful when working with datasets that share similar structures but differ in their content. By combining these datasets, analysts can gain a holistic perspective and perform comprehensive analyses that leverage the collective information available.
4.3 Structuring Code through Logic and Modules Fig. 4.7 Combing two data using Python pandas
105
106
4 Programming and Python
Fig. 4.8 Merging two data using Python pandas
Merging is another powerful operation offered by pandas for combining datasets. While concatenation focuses on appending data, merging offers greater flexibility by enabling the mixing of data from multiple sources. This functionality proves particularly valuable when dealing with datasets that contain different columns or when a more comprehensive integration of data is required. Figure 4.8 presents an example scenario involving two datasets: data.csv and data3.csv. In data3.csv, we observe the presence of distinct columns not found in the first dataset. By utilizing the merge() command, we can merge these datasets by adding the ‘AtomicNumber’ column from data3.csv to the first dataset, which contains the ‘Element’ and ‘AtomicWeight’ columns.
4.4 Conclusion
107
To achieve this merging operation, the merge() function is invoked, specifying the datasets to merge (df4 and df7 in this case) and indicating the common key for the merging process, which is the ‘Element’ column. Through this operation, all unique data from both datasets is combined into a unified space, facilitating subsequent analysis and exploration. This capability proves particularly useful when the objective is to extract data from multiple datasets and create a new dataset that integrates information from diverse sources. Merging expands the analytical capabilities of pandas by enabling data scientists to integrate and consolidate information from various datasets. By leveraging the merge() function, analysts can perform advanced data manipulations, combining datasets based on common keys or columns. This flexibility allows for the creation of comprehensive datasets that incorporate information from multiple sources, enhancing the depth and richness of the analysis.
4.4
Conclusion
Throughout this section, we have explored the fundamental aspects of coding in Python for data science applications. As you continue to learn programming, it is important to have access to reliable and comprehensive resources to support your learning and development. In this regard, official documentation for pandas can be found online at “https://pandas.pydata.org/docs/”. This valuable resource offers a wealth of guides, tutorials, and references that cater to various skill levels, aiding in both getting started with and mastering the usage of pandas. Additionally, the documentation provides a comprehensive overview of the pandas API, empowering users with the knowledge to leverage its functionalities effectively. As you explore different modules and tools within the Python ecosystem, you will discover the vast potential and virtually boundless possibilities available to you. Python, with its extensive range of libraries and frameworks, presents an incredibly versatile platform for data analysis, visualization, and modeling. By harnessing the power of Python and its rich ecosystem of modules, you can unlock the potential to create remarkable solutions and derive insightful conclusions from your data. Building upon our foundational understanding of programming concepts, we are now poised to venture into the next chapter, where we will delve into materials and catalyst informatics. These fields combine the principles of data science with the study of materials and catalysts, paving the way for groundbreaking discoveries and advancements in various industries. We will also explore the intricate interplay between data analysis, computational modeling, and materials science, equipping ourselves with the tools and knowledge to tackle complex challenges and drive innovation. In summary, the comprehensive documentation available for pandas serves as an invaluable resource to deepen your understanding and proficiency in utilizing this powerful library. With each module you explore and master, you will uncover endless possibilities
108
4
Programming and Python
for leveraging Python’s capabilities in the realm of data science. As we transition to the next chapter, the fusion of materials and catalyst informatics awaits, offering a captivating landscape where data science principles intersect with cutting-edge research and development.
Questions 4.1 Where can you find the official documentation for pandas? 4.2 What are some of the resources available in the pandas documentation? 4.3 How does Python’s extensive range of libraries and frameworks contribute to data analysis, visualization, and modeling? 4.4 What is the focus of the next chapter after exploring programming concepts? 4.5 How does the fusion of materials and catalyst informatics contribute to groundbreaking discoveries and advancements? 4.6 Why is programming a necessary component of data science? 4.7 How does programming surpass tools like Microsoft Excel in data science work? 4.8 Why is Python a preferred programming language in data science? 4.9 What is the purpose of the code in C mentioned in the text? 4.10 How does Python’s readability compare to C in terms of coding syntax? 4.11 Why is Python considered a preferred choice for data science practitioners? 4.12 What is the purpose of the type() command in Python programming? 4.13 How does a tuple differ from a list in Python? 4.14 What is the significance of importing libraries in Python? 4.15 Why is it important to continuously update and expand one’s skill set in programming? 4.16 How does Python’s datetime type contribute to working with time-based data?
Questions
109
4.17 What are some fundamental arithmetic operations supported by Python? 4.18 How does Python’s extensive set of mathematical symbols enhance its computational capabilities? 4.19 How can variables be redefined during runtime in Python? 4.20 What is the role of logical operators in programming? 4.21 How do logical operators contribute to data processing and manipulation? 4.22 What is the sequential execution model in Python? 4.23 Why is the proper order of operations important in Python code execution? 4.24 What are if-statements used for in Python? 4.25 What is the purpose of for-loops in Python? 4.26 How does the “else” statement enhance the functionality of an if-statement in Python? 4.27 What is the purpose of an empty list named “xx” in the provided code? 4.28 How is the value of variable “y” calculated within the loop? 4.29 What happens during each iteration of the loop in relation to the list “xx”? 4.30 What is the key difference between a for-loop and a while-loop? 4.31 How is the while-loop in the example terminated? 4.32 What is the advantage of using modules in Python programming? 4.33 How can modules enhance productivity in Python programming? 4.34 How can the numpy module be used to generate random numbers in Python? 4.35 What advantage does importing the numpy module offer in Python code? 4.36 What is the purpose of the random function in the numpy module?
110
4
Programming and Python
4.37 How can you customize the range of the random number generated using numpy? 4.38 What is the purpose of the print function in generating random numbers using numpy? 4.39 How does importing specialized modules like numpy simplify random number generation? 4.40 What role does the Numpy module play in data science? 4.41 What is the purpose of the Pandas module in data analysis? 4.42 How does the pandas module facilitate the creation of data tables? 4.43 Why is it practical to directly load data from .csv files using the pandas library? 4.44 How can you import and manipulate data stored in a .csv file using pandas? 4.45 What is the function used to instruct the code to access and interact with a .csv file? 4.46 What is the advantage of using the “head(N)” method in pandas for data exploration? 4.47 How can you customize the range of rows displayed in a pandas data frame using the [x:y] notation? 4.48 How does pandas enable data scientists to filter and retrieve specific data points based on conditions? 4.49 How can you obtain a comprehensive list of data points that satisfy a given condition in pandas? 4.50 What advantage does conditional extraction offer when working with large datasets? 4.51 What is the purpose of the ``sort_values()''
method in pandas?
Questions
4.52 How does data concatenation benefit data scientists in pandas? 4.53 What is the difference between concatenation and merging in pandas? 4.54 How does merging datasets benefit data scientists in pandas?
111
5
Data and Materials and Catalysts Informatics
Abstract
Materials and catalyst informatics encompasses the systematic exploration and analysis of materials and catalysts data to discern underlying trends and patterns. The quest for such insights drives researchers to address fundamental questions: How can we access relevant materials and catalysts data? Is it feasible to generate materials and catalysts data deliberately? What knowledge can we derive from the wealth of materials and catalysts data at our disposal? Furthermore, what methodologies and tools are necessary to facilitate scientific visualization and harness the power of machine learning in this domain? The pivotal role of data in informatics cannot be overstated, as it serves as the cornerstone of the entire process. This chapter delves deep into the indispensable role of data within the realms of materials science and catalysis. It explores the avenues through which researchers can acquire, curate, and leverage materials and catalysts data to advance our understanding and ultimately pave the way for groundbreaking discoveries. Understanding the nuances of data acquisition and utilization is paramount in the pursuit of advancements in materials and catalyst informatics. Keywords
Data · Data acquisition · Data generation · High throughput · Data preprocessing · Data cleansing · Dataset
• Understand data and its central role in informatics • Explore how data is acquired or generated for data science purposes • Learn about the importance of data preprocessing, data cleansing, and data quality
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6_5
113
114
5.1
5 Data and Materials and Catalysts Informatics
Introduction
Data plays a central role in materials informatics, a multidisciplinary field that merges materials science, data science, and computer technology to accelerate the discovery, design, and development of new materials. In materials informatics, data encompasses an extensive range of information related to materials, including their chemical composition, structural properties, manufacturing processes, and performance characteristics. This data is collected through experiments, simulations, and observations, generating vast datasets that can be analyzed using various computational techniques. The objective of materials informatics is to extract valuable insights and patterns from this data to inform the creation and optimization of materials for specific applications. Materials informatics has become increasingly important in various industries, including materials science, manufacturing, and engineering, as it accelerates the research and development process. By analyzing existing datasets, researchers can predict material behaviors, optimize manufacturing processes, and reduce the time and resources required for experimental testing. By investigating data in this way, researchers can make more informed decisions about material selection and design, ultimately leading to the creation of advanced materials with tailored properties for specific applications. Moreover, materials informatics facilitates collaboration and knowledge sharing within the materials community, as researchers can access and build upon existing datasets and models. The relationship between data and materials informatics extends beyond research and development. It allows for knowledge sharing and collaboration within the materials science community. Researchers can access and contribute to centralized materials databases and models, fostering innovation and the sharing of best practices. Additionally, materials informatics aids in the integration of computational models and experimental data, offering a holistic view of materials properties and behavior. In this sense, data can be viewed as the lifeblood of materials informatics, and its effective use is transforming the way new materials are discovered, developed, and deployed in a wide range of technological applications.
5.2
Collecting Data
The volume and complexity of data for material and catalyst research have witnessed an exponential surge in recent decades. This surge has presented both an opportunity and a challenge, driving the emergence of interdisciplinary fields such as materials informatics and catalysts informatics. At the heart of these fields, data serves as the bedrock for insightful analysis and informed decision-making. To effectively harness the power of data science in material and chemical research, it is essential to have access to a substantial and well-curated repository of high-quality data. The success of data-driven approaches is reliant upon the availability of relevant and reliable datasets. The process of collecting and preparing data for data science applications involves several crucial
5.2 Collecting Data
115
steps. The initial phase revolves around data acquisition, which entails sourcing data from diverse experimental techniques, simulation methods, literature sources, and collaborative efforts. Ensuring the integrity and reliability of the collected data is paramount, as it forms the foundation upon which subsequent analyses and modeling efforts are built. Once the data is acquired, it undergoes a rigorous process of cleaning, transformation, and integration. Data cleaning involves identifying and rectifying inconsistencies, errors, missing values, and outliers that may impact the integrity and reliability of the dataset. This step ensures that the subsequent analyses and modeling are based on accurate and valid information. Data transformation is another crucial aspect of the preparation process. It involves converting raw data into a suitable format that aligns with the objectives of the analysis. This may include normalization, standardization, feature engineering, or dimensionality reduction techniques, depending on the specific requirements of the data science application at hand. Furthermore, data integration plays a vital role in aggregating and merging disparate datasets to create a unified and comprehensive view. This step involves aligning and reconciling different data sources, resolving conflicts, and harmonizing data representations to create a cohesive dataset that encapsulates diverse aspects of materials and catalyst research. In addition to data cleaning, transformation, and integration, metadata annotation is essential for contextualizing the data. By annotating the dataset with relevant metadata, such as experimental conditions, measurement techniques, and sample characteristics, researchers can enhance the interpretability and reproducibility of their findings. Without access to an appropriate amount of quality, well-organized data, our efforts to explore material and chemical data through data science are hindered. Let us further explore how data is collected and prepared for data science applications. Figure 5.1 serves as an illustrative depiction of the fundamental workflow within the realm of materials informatics and catalysts informatics, providing a high-level overview of the key stages involved. Embarking on this journey entails a systematic approach, beginning with the collection of materials and catalysts data, which forms the bedrock of subsequent analysis and modeling endeavors. While it may be tempting to directly apply machine learning and data visualization techniques to the collected data, a critical realization emerges in the field of informatics. The inherent complexities and nuances of real-world data necessitate a crucial preprocessing step before data science techniques can be effectively employed. This preprocessing stage serves as the gateway to unlocking the true potential hidden within the data, enabling the extraction of meaningful insights and facilitating informed decision-making. Data preprocessing assumes paramount importance due to the myriad challenges encountered in real-world data. These challenges include the presence of outliers, duplicate entries, unlabeled data, and instances with unique labels, among other intricacies. The nature of such imperfections demands careful attention and treatment to ensure the reliability, integrity, and quality of the dataset. Preprocessing acts as a vital conduit for addressing these challenges and transforming the raw data into a refined and harmonized form suitable for subsequent data science techniques. In addition to addressing data quality concerns, the format of the collected data often requires tailored
116
5 Data and Materials and Catalysts Informatics
Fig. 5.1 Basic work flow in materials and catalysts informatics
adjustments to align with the specific requirements of data science techniques. Raw data seldom conforms to the standardized formats and structures expected by these techniques. Thus, data preprocessing assumes the pivotal role of molding the data into a suitable format, enabling seamless integration and compatibility with a wide array of analytical and modeling approaches. The data preprocessing journey encompasses a diverse set of techniques and methodologies tailored to tackle the unique characteristics of the materials and catalysts domain. These techniques include but are not limited to data cleaning, outlier detection and handling, data transformation, normalization, feature selection and extraction as well as data integration and fusion. Each of these steps contributes to the comprehensive preprocessing pipeline, collectively enhancing the overall quality, coherence, and usability of the data for subsequent data science applications. By meticulously conducting data preprocessing as a preliminary step, researchers and practitioners pave the way for robust and reliable analyses, modeling, and visualization of materials and catalysts data. This preparatory phase ensures that the subsequent application of data science techniques is founded upon a solid and trustworthy data foundation. Thus, the success and efficacy of informatics-driven endeavors critically hinge upon the thoroughness and sophistication with which data preprocessing is executed. For data science, the indispensability of data is self-evident. It serves as the cornerstone upon which the entire field revolves upon. However, the question arises: how does one go about acquiring the requisite data? What are the potential sources for materials and catalysts data? The landscape of data sources can be broadly classified into two major categories, each offering distinct advantages and considerations.
5.2 Collecting Data
117
The first category of data sources revolves around utilizing existing data published in reputable repositories, such as literature and patents. These established sources have long been recognized as invaluable reservoirs of scientific knowledge and insights. Leveraging data from literature and patents offers numerous benefits, including the availability of well-curated datasets, the potential for accessing comprehensive and domain-specific information, and the advantage of building upon prior research and findings. The vast expanse of scientific literature and patent databases encompasses a wealth of data on materials and catalysts, encapsulating an extensive spectrum of studies, experiments, and discoveries conducted by researchers and scientists across the globe. On the other hand, the alternative option entails the creation of a bespoke dataset tailored to one’s specific research objectives and requirements. This avenue involves the collection, generation, and curation of data through deliberate experimentation, observations, simulations, or empirical studies. Creating one’s own dataset empowers researchers to design experiments or simulations that precisely target the desired parameters and variables, offering a degree of control and customization unparalleled by existing datasets. This approach allows for the exploration of uncharted territories, novel phenomena, or specialized aspects that may not have been extensively studied or documented in existing literature. However, it is crucial to recognize that generating one’s own data requires meticulous planning, rigorous execution, and adherence to scientific principles to ensure the integrity and validity of the resulting dataset. The choice between utilizing existing data from literature and patents or creating a new dataset is a consequential decision, contingent upon the specific research context, available resources, research objectives, and the level of customization required. Careful consideration must be given to the advantages, limitations, and potential biases associated with each approach. Depending on the research question at hand, a judicious combination of both approaches may also be warranted, wherein existing data serves as a foundational baseline, complemented by carefully designed experiments or observations to augment and enrich the dataset. Literature data can be collected through the following five ways: • • • • •
Manual Collection Review Articles Text Mining Open Data Center Data Purchases
Literature data acquisition encompasses several approaches, including manual collection, review articles, text mining, open data centers, and data purchases. Each method offers unique advantages in obtaining valuable materials and catalysts data. Manual collection is a fundamental approach where researchers proactively search for target data. This process has transitioned to online platforms such as “Google Scholar” and “Web of Science.” These platforms serve as powerful search engines, enabling access to academic articles and patents. Researchers can download relevant publications and
118
5 Data and Materials and Catalysts Informatics
delve deeper into the content to extract pertinent information. This can include details on synthesis methods, process conditions, material properties, and other relevant data. Often, supplementary tables and supporting documents accompany these publications, enriching the available data. In instances where data is presented graphically, specialized tools like WebPlotDigitizer (https://github.com/ankitrohatgi/WebPlotDigitizer) can be employed to extract the data accurately, ensuring its usability in research endeavors. The manual collection method allows for meticulous data curation and direct interaction with the scholarly material. Review articles are another valuable resource for data acquisition. These articles provide comprehensive summaries and analyses of existing research, serving as a consolidated source of information. Researchers can leverage review articles to gather relevant materials and catalysts data, as they often present a synthesis of multiple studies and their findings. By consulting review articles, researchers can efficiently access key insights and data points from a broader range of sources, facilitating the identification of trends and knowledge gaps in the field. Text mining is an invaluable technique for data collection, allowing for the automated extraction of valuable information from articles. This process heavily relies on the power of natural language processing (NLP), a branch of artificial intelligence that focuses on understanding and analyzing human language. At its core, text mining involves the identification and extraction of relevant data by specifying the appropriate context and keywords. By utilizing advanced algorithms and techniques, text mining enables the automation of data collection, significantly enhancing efficiency and scalability. Researchers and data scientists can leverage this approach to gather large volumes of information from diverse textual sources, such as research articles, academic papers, and online documents. To facilitate text mining tasks, various sophisticated tools and libraries have been developed. Notable examples include the Python Natural Language Toolkit (NLTK) and spaCy, which provide a wide range of functionalities for NLP-related tasks. These tools offer pre-trained models, tokenization capabilities, part-of-speech tagging, named entity recognition, and other advanced features that streamline the text mining process. However, it is important to note that while text mining can yield powerful insights, it is not without its challenges. One of the key considerations is the potential for collecting unwanted or inaccurate data. Due to the inherent complexities of language and the limitations of automated algorithms, there is a possibility of extracting irrelevant or erroneous information during the mining process. It is crucial, therefore, to exercise caution and implement rigorous validation procedures to ensure the quality and reliability of the collected data. Data validation also plays a pivotal role in the text mining workflow. It involves verifying the accuracy, consistency, and relevance of the extracted data against established criteria or ground truth. Through careful validation techniques, researchers can identify and rectify any inconsistencies or errors, ensuring the integrity of the collected dataset. This step is essential for maintaining the credibility of the subsequent analyses and interpretations based on the mined data.
5.2 Collecting Data
119
Finally, in addition to the aforementioned methods, researchers have the option of utilizing open data centers or purchasing datasets for their data collection needs. Open data centers provide a wealth of diverse and publicly accessible data sources, encompassing various domains and topics. These data centers serve as centralized repositories that house a vast array of structured and unstructured data, covering a wide range of research fields. Researchers can leverage these open data centers to access and retrieve valuable data that aligns with their specific research objectives. Here are some useful open data centers for materials and catalysts. Materials Data • Materials Project https://materialsproject.org/ • Aflow http://aflowlib.org/ • Nomad https://nomad-lab.eu/ • Open quantum material data https://oqmd.org/ • NIST Materials Data Repository https://materialsdata.nist.gov/ • NIMS atomwork https://crystdb.nims.go.jp/en/ • Computational materials repository https://cmr.fysik.dtu.dk/ • Crystallography Open Database http://www.crystallography.net/cod/
120
5 Data and Materials and Catalysts Informatics
Catalysts Data • CADS https://cads.eng.hokudai.ac.jp/ • Catalysis hub https://www.catalysis-hub.org/ Spectra Databases: XAFS • IXAS XAFS database http://ixs.iit.edu/database/ • Hokkaido University XAFS data https://www.cat.hokudai.ac.jp/catdb/index.php?action =xafs_login_form&opnid=2 • NIST X-ray Photoelectron Spectroscopy Database https://srdata.nist.gov/xps/ For more specialized or targeted data requirements, researchers may opt to purchase datasets from reputable providers. Data vendors offer curated and verified datasets that cater to specific domains, ensuring the reliability and quality of the data. These datasets are often meticulously organized and standardized, enabling researchers to directly incorporate them into their analyses and studies. Materials data can also be purchased from data centers like the following: • NIMS Atomwork Advs https://atomwork-adv.nims.go.jp/?lan=en • Cambridge Structural Database https://www.ccdc.cam.ac.uk/
5.2 Collecting Data
121
• Reaxys https://www.elsevier.com/solutions/reaxys If the required data cannot be readily obtained or if there is a need for greater control and customization over the data used, an alternative approach is to generate one’s own materials and catalysts data. This approach involves designing and conducting experiments or simulations to gather specific data points relevant to the research objectives. Creating custom datasets offers researchers the advantage of tailoring the data collection process to their specific needs and research goals. It allows for the inclusion of specific parameters, conditions, and variables that are essential for studying materials and catalysts in a targeted manner. This approach grants researchers a higher degree of control over the data generation process, enabling them to focus on specific aspects or phenomena of interest. Obtaining data in this manner can be broken down into three different categories: 1. Perform experiment and computation one by one 2. Pay a third party to create data 3. Create data through High throughput/Robot experiments and high throughput calculation The first and foremost approach to acquiring data is through the conventional method of conducting experiments or calculations. In this context, the term “conventional” refers to the widely practiced approach adopted by researchers across various scientific disciplines. Researchers follow a systematic procedure where they design and create samples or models specific to their research objectives. These samples can include materials, catalysts, or experimental setups tailored to investigate particular phenomena or properties of interest. Once the samples are prepared, experiments are conducted, and measurements or observations are recorded. These experiments encompass a range of techniques and methodologies depending on the nature of the research. They may involve manipulating variables, controlling experimental conditions, and collecting data through instruments or sensors. Researchers perform careful measurements, record data points, and repeat experiments to ensure the accuracy and reliability of the acquired data. Following the data collection phase, researchers proceed to analyze and review the obtained data. This entails employing statistical methods, data visualization techniques, and data processing algorithms to extract meaningful insights and draw conclusions. The analysis phase often involves comparing results, identifying patterns or trends, and assessing the significance of the findings. Furthermore, the conventional approach necessitates rigorous documentation of the experimental procedures, methodologies, and measurement protocols to ensure reproducibility and facilitate future reference. Researchers maintain comprehensive laboratory notebooks or electronic records containing details of each step, including sample preparation, experimental setup, data collection, and analysis techniques employed. This process is standard across all scientific disciplines.
122
5 Data and Materials and Catalysts Informatics
In addition to conducting experiments and calculations in-house, another avenue for obtaining data is to collaborate with third-party entities and specialized companies that offer comprehensive experimental and computational services. By outsourcing certain aspects of the data collection process, researchers can benefit from the expertise and resources of these external entities. When considering this option, it is crucial to be cognizant of the associated costs and financial implications. Outsourcing experiments and calculations typically entail entering into contractual agreements, wherein the thirdparty service providers charge fees for their services. It is essential to establish clear communication and understanding regarding the scope of work, deliverables, quality control measures, and financial arrangements. Moreover, researchers should exercise due diligence in selecting reputable and reliable third-party partners to ensure the integrity and accuracy of the collected data. Evaluating the track record, credentials, and reputation of these service providers can help mitigate potential risks and ensure that the outsourced experiments and calculations align with the desired scientific objectives. The third option for data creation involves utilizing high throughput experiments and high throughput calculations. This approach leverages the concept of high throughput, which enables the efficient execution of synthesis, performance testing, and characterization processes on a large scale. By automating and parallelizing these processes, researchers can generate significant amounts of data in a relatively short period. Before we go further in details, let us briefly explore what high throughput means. High throughput encompasses the ability to conduct multiple experiments simultaneously, with each experiment focusing on the synthesis of a distinct sample. The synthesized samples are then subjected to performance testing and characterization to obtain valuable data. Figure 5.2 provides an overview of the high throughput concept, illustrating the interconnected nature of synthesis, performance testing, and characterization processes. Unlike low throughput approaches, where individual samples are processed sequentially, high throughput enables the concurrent handling of multiple samples, significantly accelerating data acquisition. In high throughput experiments, researchers have the flexibility to explore various parameters, such as composition, structure, and processing conditions, across a range of samples. This enables the systematic investigation of material and catalyst properties, leading to a comprehensive understanding of structure-performance relationships. Furthermore, high throughput calculations complement experimental efforts by employing computational techniques to predict and analyze material properties. Through simulations and modeling, researchers can explore a vast chemical space, rapidly screening and identifying promising candidates for further experimental validation. By combining high throughput experiments and calculations, researchers can generate large datasets, facilitating the discovery and optimization of materials and catalysts with desired properties. The resulting data can serve as a valuable resource for subsequent datadriven analyses, machine learning algorithms, and predictive models. In conventional experimental approaches, researchers typically focus on conducting experiments using a single sample. Each experiment involves the synthesis of a specific sample, followed by rigorous testing and comprehensive characterization to gather valuable insights and
5.2 Collecting Data
123
Fig. 5.2 The concept of high throughput experiment
data. This sequential process, where synthesis, performance testing, and characterization are performed on individual samples, is commonly referred to as low throughput. Low throughput methodologies often necessitate a significant investment of time and resources due to the sequential nature of sample processing. Each sample is meticulously prepared, tested, and characterized before moving on to the next, which can result in a relatively slower data acquisition rate. However, it is important to note that low throughput approaches still play a vital role in many scientific disciplines. They offer a high level of control and precision, allowing researchers to delve deeply into the properties and behavior of individual samples. This approach is particularly valuable when working with complex materials or catalyst systems that require thorough investigation and analysis. While low throughput experiments provide valuable insights, they may not be suitable for scenarios where a larger dataset is desired or when screening a wide range of samples and conditions. In such cases, high throughput methodologies, as discussed earlier, offer an alternative approach by enabling the simultaneous processing and analysis of multiple samples. The concept of high throughput holds immense significance for materials and catalysts informatics. High throughput methodologies revolutionize the traditional approach by enabling the simultaneous synthesis, performance testing, and characterization of a substantial number of samples, leading to a significant increase in efficiency and data generation. Figure 5.2 provides a visual representation of the high throughput concept, showcasing the parallel processing of multiple samples. Unlike the sequential nature of low throughput experiments, high throughput methodologies allow researchers to synthesize and perform experiments on a large number of samples simultaneously. This
124
5 Data and Materials and Catalysts Informatics
parallelization enables the production of a wealth of data within remarkably shorter timeframes. To illustrate the magnitude of high throughput capabilities, consider the scenario where 100 samples are synthesized and performed in parallel through experimental means. This process empowers researchers to amass a vast amount of data within a relatively condensed timeframe, facilitating comprehensive analysis and evaluation of materials and catalyst systems. Similarly, high throughput calculations are instrumental in generating substantial volumes of data in a short span of time. By leveraging computational methods, researchers can produce and calculate a multitude of models concurrently, unveiling valuable insights and accelerating the exploration of materials and catalysts. The appeal of high throughput methodologies in materials and catalyst informatics lies in their unparalleled capacity to generate extensive datasets. These datasets serve as the lifeblood of data science-driven research, empowering scientists to harness the power of advanced analytics and machine learning techniques. By leveraging these methodologies, researchers can unlock new patterns, correlations, and trends within the data, leading to profound discoveries and advancements in materials and catalyst science. The ability to rapidly generate significant amounts of data through high throughput approaches opens up new avenues for research and exploration. It enables researchers to tackle complex scientific challenges, screen vast libraries of materials, optimize performance parameters, and drive innovation in a more efficient and systematic manner. The concept of high throughput is attractive for generating large amounts of data. The utilization of high throughput methodologies in materials and catalyst informatics brings forth numerous advantages, particularly concerning the generation and accumulation of large volumes of data within remarkably condensed time frames. This immediate advantage revolutionizes the research landscape by facilitating comprehensive exploration and analysis of a broader range of materials and catalyst systems, expediting the pace of scientific discoveries and technological advancements. Moreover, the consistency and quality of the data generated through high throughput approaches stand as formidable benefits. With the employment of the same experimental devices and parameters throughout the entire process, a remarkable level of data consistency is achieved. This uniformity plays a pivotal role in materials and catalyst informatics, where reliable and accurate data forms the bedrock for the effective application of data science techniques. Traditional experimental approaches, in contrast, often contend with inherent inconsistencies arising from various sources, including divergent experimental conditions, disparities in researchers’ abilities, and other concealed factors. These inconsistencies can substantially impede the efficacy of data science techniques. By adopting high throughput methodologies, researchers can amass extensive datasets where all conditions and parameters remain constant, enabling machines to read and predict data with heightened efficiency and precision. The impact of this consistency in high throughput data generation cannot be overstated, particularly in the realm of materials and catalyst informatics. Traditionally, when experiments are conducted, the same material may undergo synthesis and subsequent characterization multiple times. However, the presence of diverse experimental factors, variances in researchers’ capabilities, and other latent variables may introduce inconsistencies into
5.2 Collecting Data
125
the data, compromising its reliability and hindering the seamless integration of data science techniques. The utilization of high throughput methodologies effectively mitigates these challenges by ensuring the maintenance of constant conditions and parameters throughout the entire process, enabling the collection of large datasets that are free from the confounding effects of inconsistencies. This, in turn, empowers machines to more efficiently interpret and predict data, fostering enhanced understanding and enabling accelerated progress in the field of materials and catalyst informatics. While high throughput methodologies offer significant advantages in data generation, it is crucial to consider the potential drawbacks associated with their utilization. One notable disadvantage pertains to the development of high throughput experimental devices, which demands substantial financial investments and specialized skill sets. The intricate nature of these devices necessitates expertise in their design, construction, and operation, thereby increasing the overall costs and resource requirements. Similarly, high throughput calculations also entail comparable challenges. Employing a shared codebase and server infrastructure, high throughput calculations aim to achieve consistent data outputs. Nevertheless, the development of the requisite high throughput code demands profound knowledge and proficiency to seamlessly integrate model construction, parallel calculations, and data storage. Additionally, financial considerations encompass the procurement and maintenance of servers, utility bills, and other associated expenses. Another aspect to ponder is the trade-off between data quantity and quality in high throughput approaches. While the generation of vast amounts of data is desirable for comprehensive analysis, it often comes at the expense of a slight decrease in data quality compared to conventional experiments and calculations. Factors such as reduced control over individual experimental conditions, inherent limitations in measurement precision, and potential variations within high throughput processes can collectively contribute to this quality disparity. Therefore, a cautious evaluation of the data produced by high throughput experimentation and high throughput calculations becomes imperative, necessitating a comprehensive bird’s-eye-view perspective. In the realm of materials and catalyst informatics, this holistic viewpoint is embraced to effectively interpret and exploit the wealth of data generated through high throughput methodologies, accounting for both its strengths and limitations. It is worth noting that despite these disadvantages, high throughput methodologies remain indispensable tools in materials and catalyst informatics. The benefits they confer in terms of accelerated data generation, expanded exploration of materials space, and enhanced efficiency in data-driven analysis outweigh the associated challenges. By strategically addressing the limitations, such as investing in skilled personnel, optimizing high throughput code, and implementing quality control measures, researchers can effectively harness the power of high throughput approaches while mitigating potential pitfalls. These considerations enable the seamless integration of high throughput methodologies into the research workflow, driving advancements in materials and catalyst informatics and propelling scientific discovery forward.
126
5.3
5 Data and Materials and Catalysts Informatics
Data Preprocessing
Out of all the steps taken with materials and catalysts informatics, the data preprocessing stage can be considered to be the most important one. It serves as a crucial stage where collected data is meticulously refined and prepared to facilitate the application of data science techniques. Essentially, data preprocessing encompasses a range of tasks aimed at rectifying and eliminating inaccurate, corrupted, or unreliable data, thereby enhancing its quality and reliability. This essential data refinement process is often referred to as data cleansing, underscoring its significance in ensuring the integrity of the dataset. Data preprocessing extends beyond data cleansing and encompasses various other preparatory steps to enable effective data visualization. Figure 5.3 provides a comprehensive overview of the diverse tasks involved in data preprocessing, shedding light on the typical processes employed. While the specific requirements may vary depending on the dataset and research objectives, Fig. 5.3 offers valuable insights into the fundamental elements of data preprocessing. At the forefront of data preprocessing tasks, data cleansing holds a prominent position, constituting approximately 25% of the overall process. This critical step involves identifying and rectifying inconsistencies, errors, and outliers present in the dataset, ensuring data accuracy and reliability. By employing various techniques such as outlier detection, missing data imputation, and error correction algorithms, researchers strive to refine the dataset and eliminate potential sources of bias or inaccuracies.
Fig. 5.3 Workload in materials and catalysts informatics
5.3 Data Preprocessing
127
Equally vital to the data preprocessing pipeline is data labeling, which also accounts for approximately 25% of the process. Accurate and comprehensive labeling of the data enables subsequent analysis and modeling, empowering researchers to extract meaningful insights and patterns. Through techniques such as manual labeling, automated labeling, or a combination of both, the dataset is enriched with relevant labels, providing a foundation for supervised learning algorithms and facilitating the interpretation of results. In addition to data cleansing and data labeling, data augmentation holds a notable share of approximately 15% in the preprocessing workflow. Data augmentation techniques involve generating synthetic data points or augmenting existing ones, effectively expanding the dataset size and diversity. By introducing variations, transformations, or perturbations to the data, researchers can mitigate the effects of limited sample sizes and improve the generalization and robustness of subsequent machine learning models. Data aggregation, encompassing approximately 10% of the data preprocessing stage, involves the consolidation and integration of multiple datasets or sources. This process aims to create a comprehensive and unified dataset, allowing researchers to harness a broader range of information and insights for analysis. Moreover, machine learning model training, also accounting for approximately 10%, involves preparing the dataset in a format suitable for model training, encompassing tasks such as feature extraction, normalization, and data partitioning. From this, it becomes evident that 85% of the overall efforts devoted to data preprocessing encompass the crucial tasks of data cleaning, data labeling, and ensuring proper data formatting for machine accessibility. In fact, a significant portion, accounting for 50% of the total workload, is dedicated solely to the meticulous processes of data cleaning and data labeling. This staggering allocation of time and resources underscores the paramount importance of thorough and accurate data preparation. The extensive time and effort invested in data preprocessing are reflective of its pivotal role in ensuring the validity and reliability of subsequent machine learning and data visualization applications. Failure to adequately perform data preprocessing can have detrimental consequences, leading to erroneous results and misguided analyses. As such, utmost care and attention must be given to this crucial stage to mitigate the risk of drawing incorrect conclusions or making flawed decisions based on flawed data. The significance of proper data preprocessing extends beyond machine learning applications and also holds true for experimental studies. Neglecting to include detailed process information along with the collected samples can severely compromise the reliability and integrity of the obtained results. Without clear points of reference connecting the experimental process to the observed outcomes, researchers run the risk of generating misleading or inconclusive findings. Consequently, it is imperative to treat data as an integral part of the process much like with samples when dealing with experimental investigations, ensuring that comprehensive process information is documented and considered in the analysis. Let us explore data cleansing further.
128
5 Data and Materials and Catalysts Informatics
Fig. 5.4 Factors that affect the quality and usability of data
The importance of data preprocessing becomes even more apparent when we delve into the intricacies of data quality and usability. While addressing data inaccuracies and corruption is certainly a critical aspect of the preprocessing stage, it alone does not encompass the full scope of its necessity. One might question whether identifying and rectifying such issues would be sufficient for machine processing to commence. However, a closer examination, as depicted in Fig. 5.4, reveals the multitude of factors that can significantly impact the quality and usability of data, warranting comprehensive data preprocessing measures. One fundamental challenge arises from the absence of standardized rules or industrywide conventions dictating how data should be collected and organized. The responsibility of data collection rests solely with the researcher(s), leading to variations in the recording and collection practices employed. For instance, different researchers may assign varying degrees of importance to specific parameters, such as temperature, when collecting data. These divergent preferences and priorities can introduce complexities and discrepancies into the data collection process itself. The implications of such disparities in data collection preferences can be far-reaching. Consider a scenario where some researchers emphasize the inclusion of temperature data in their datasets, while others may overlook or disregard its significance. This discrepancy directly impacts the usability and compatibility of the resulting datasets when shared with other researchers or utilized in collaborative projects. The absence of standardized practices leads to data fragmentation and hampers effective data sharing and collaboration efforts, rendering datasets incomplete or inadequate for certain applications.
5.3 Data Preprocessing
129
To address these challenges, data preprocessing plays a pivotal role in harmonizing and standardizing the collected data. By subjecting the data to rigorous preprocessing steps, researchers can align and reconcile the discrepancies arising from individual preferences, ensuring that the resulting datasets are comprehensive, consistent, and suitable for widespread use. This not only enhances the interoperability of datasets among researchers but also maximizes the potential for cross-disciplinary collaborations and data-driven insights. Moreover, data preprocessing encompasses a broader spectrum of tasks beyond error detection and correction. It includes data transformation, normalization, feature extraction, and dimensionality reduction, among other techniques, to optimize the data for subsequent analysis and machine learning algorithms. Through these preprocessing steps, the raw data is refined, enhanced, and tailored to meet the specific requirements of the intended analytical approaches and models. The impact of data recording methods on the resulting datasets should not be underestimated. This is particularly evident when we consider the recording of molecules, where multiple approaches exist, each with its own implications. The choice of how to record a molecule, whether through molecular formula, SMILES notation, or in layman’s terms, can significantly influence the discoverability and interpretability of the data. Consequently, differences in labeling conventions can pose challenges when attempting to locate specific types of data within a given dataset. Let us consider the example of the molecule CH.4 . Depending on the recording method employed, it could be represented as CH.4 , C (in the case of SMILES notation), or simply referred to as “methane” using common language. For individuals with a specialization in chemistry or relevant research fields, it may be apparent that these various notations all pertain to the same molecular entity. However, for those lacking knowledge in methane or chemistry, these representations may appear as distinct and unrelated entities. Similarly, if the data is being processed by a machine, it will naturally assume that the different labels correspond to separate entities unless explicitly defined and connected. Consequently, the lack of standardization in data labeling can lead to confusion, misinterpretation, and even the inadvertent omission of valuable data. It is disheartening to witness potentially valuable information being disregarded simply due to the lack of understanding surrounding the labels used. This issue becomes particularly pronounced when leveraging automated processes, as machines rely on clear instructions and unambiguous data representations to effectively interpret and analyze the information at hand. To mitigate these challenges and ensure accurate data retrieval and analysis, it is crucial to establish clear and consistent labeling practices. The adoption of standardized data recording methods and conventions can harmonize the representation of molecules, facilitating seamless integration and cross-referencing of data from diverse sources. By implementing universally recognized notations and terminology, researchers can enhance the discoverability and accessibility of molecular data, enabling seamless collaboration, knowledge sharing, and the effective utilization of computational tools and algorithms.
130
5 Data and Materials and Catalysts Informatics
Furthermore, efforts should be directed toward improving data literacy and promoting interdisciplinary understanding. By fostering a culture of shared knowledge and emphasizing the importance of comprehensible and coherent data labeling, researchers can bridge the gap between domain-specific terminologies and broader scientific communities. This will empower researchers from diverse backgrounds to navigate and harness the wealth of available data with greater confidence and accuracy. The challenges surrounding data generation and collection extend even further when we consider cases where certain types of data are derived from other data. A notable example of this can be found in catalysis, where the calculation of yield relies on selectivity and conversion data. In essence, the necessary information for calculating selectivity and conversion is contained within the dataset. However, if this relationship is not explicitly specified, some researchers may erroneously treat yield as an independent variable. Consequently, attempts to predict yield solely based on selectivity and conversion data may yield suboptimal results. To illustrate this further, let us consider a scenario where we collect selectivity and conversion data from multiple sources. It is not uncommon to encounter inconsistencies in the reporting of selectivity values across different datasets. While some researchers may include selectivity data in their dataset, others may omit it. As a result, when merging these datasets, certain sections intended for selectivity may remain blank or devoid of data. This raises an important question: How should we handle such missing or incomplete data? Moreover, there are cases where variables, such as material synthesis methods, are described using text or string-based labels rather than numerical values. Machine learning algorithms and visualization techniques often rely on numerical data, making it challenging to incorporate string data directly. In such instances, it becomes necessary to assign index numbers or some form of numerical representation to these string-based variables to enable their inclusion in computational analyses and visualizations. These examples highlight the current state of data generation and collection, where researchers often rely on individual judgment and discretion due to the absence of standardized rules and guidelines. This lack of standardization hampers the reusability and interoperability of datasets, hindering collaborative efforts and impeding the progress of scientific research. Addressing these challenges requires concerted efforts to establish standardized protocols, promote data sharing best practices, and develop robust data preprocessing methods that account for missing values, inconsistent reporting, and the integration of diverse data types. The development of comprehensive data standards and guidelines would not only facilitate data integration and analysis but also enhance the comparability and reliability of research outcomes. It would provide a common framework for researchers to document and organize their data, ensuring essential information is consistently captured and readily accessible for future analyses and investigations. Furthermore, the creation of standardized data repositories and platforms that adhere to these guidelines would foster a culture of data sharing and collaboration, encouraging researchers to contribute and exchange their datasets in a harmonized and well-documented manner.
5.3 Data Preprocessing
131
Fig. 5.5 Six key types in data cleansing
Data cleansing, the initial and pivotal step in the data preprocessing pipeline, demands meticulous attention from data scientists. Figure 5.5 presents a comprehensive overview of the six fundamental categories that necessitate thorough examination during the data cleansing process: validity, accuracy, uniformity, completeness, consistency, and duplication. Each category serves a distinct purpose in ensuring the cleanliness and usability of the data for subsequent analyses and machine learning tasks. Validity, the first category, focuses on verifying whether the data adheres to predefined rules or constraints. For instance, suppose the collected data is expected to consist of binary values (0s and 1s), but an unexpected value of “2” surfaces. In such cases, it becomes imperative to investigate the presence of this anomalous value, enabling a deeper understanding of its origin and potential implications. Accurate data is paramount for reliable analyses. Therefore, data scientists must diligently scrutinize the accuracy category. Discrepancies between the collected data and the original data source can occur due to human errors, inconsistencies arising from text mining processes, or other data collection artifacts. Detecting and rectifying these inaccuracies is crucial to maintain the integrity and trustworthiness of the dataset. Uniformity plays a pivotal role in ensuring consistent data representation. This category encompasses assessing the consistency of units, such as Kelvin or Celsius, used throughout the dataset. Inconsistencies in unit usage pose challenges for machines to interpret and process the data accurately. Therefore, establishing and enforcing uniformity in units is essential for eliminating ambiguity and facilitating seamless data analysis.
132
5 Data and Materials and Catalysts Informatics
The completeness category focuses on examining whether the dataset contains any missing or blank values. Missing data can arise due to various reasons, such as data collection errors or incomplete records. Addressing these gaps by imputing missing values or applying appropriate data interpolation techniques is crucial to ensure the comprehensiveness and reliability of the dataset. Consistency, as another critical aspect of data cleansing, revolves around verifying the coherence and coherence of naming conventions. For instance, data scientists need to determine if different terms or notations are used interchangeably within the dataset. In the context of chemistry, it is essential to ascertain if terms like “methane” and “CH.4 ” are utilized consistently or interchangeably. Achieving consistency in naming conventions is vital for accurate interpretation and analysis of the data. The final category, duplication, necessitates a comprehensive evaluation of duplicate data instances within the dataset. Identifying and handling duplicates is crucial to avoid skewed analyses or biased model training. It is essential to ascertain whether duplicates are intentional, serving specific purposes, or if they represent independent instances of the same data. Proper management of duplicates ensures the integrity and accuracy of subsequent analyses and prevents redundancy in the dataset. Data cleansing can be performed using the following four approaches: 1. 2. 3. 4.
Observation Python pandas Visualization Machine learning
The initial step in the data cleansing process involves a meticulous manual observation of the data. While this approach demands significant work and effort, it often yields the best results. By immersing oneself in the dataset and carefully examining its contents, data scientists can identify potential anomalies, errors, or inconsistencies that might require cleansing. In addition to manual observation, Python’s powerful data manipulation library, pandas, offers a wide array of functions specifically designed for data cleansing tasks. These functions simplify the process by providing efficient and intuitive ways to handle common data cleaning operations. For instance, pandas offers functions to handle missing values, remove duplicates, transform data types, and apply custom cleansing operations. By leveraging the functionalities of pandas, data scientists can streamline their data cleansing workflow and ensure the accuracy and quality of the dataset. Data visualization techniques also play a crucial role in the data cleansing process. By visualizing the data, patterns, outliers, and inconsistencies can be easily detected. Plots, charts, and graphs can reveal data points that deviate significantly from the expected distribution or exhibit suspicious patterns. Data scientists can employ various visualization techniques, such as scatter plots, box plots, or histograms, to gain insights into the data’s
5.3 Data Preprocessing
133
integrity and identify potential outliers or errors. These visual cues serve as a valuable aid in decision-making during the data cleansing phase. Furthermore, machine learning algorithms can be leveraged to automate the detection of data outliers and anomalies. These algorithms are trained to learn patterns and regularities from a given dataset, and they can identify instances that deviate significantly from the expected behavior. By applying machine learning techniques, data scientists can identify and flag data points that might require further investigation or cleansing. These automated approaches can expedite the data cleansing process and provide valuable insights into potential issues that might have been overlooked during manual inspection. In practice, the choice of data cleansing methods depends on the specific research context and the characteristics of the dataset. Researchers must evaluate the advantages and limitations of each approach and determine the most suitable method or combination of methods for their data cleansing tasks. In some cases, a manual inspection coupled with pandas functions might suffice, while in other scenarios, a combination of data visualization and machine learning algorithms could be more effective. The selection of the appropriate cleansing methods is a critical decision that ensures the accuracy, reliability, and usability of the data for subsequent analysis and modeling tasks. An important consideration when dealing with data involves the treatment of string data, which comprises textual information represented by characters. String data often poses challenges during visualization and machine learning processes since many applications are designed to handle numerical variables. However, there are techniques available to address this issue, with one common approach being the utilization of one hot encoding, as depicted in Fig. 5.6. One hot encoding is a method employed to convert text values into categorical variables. This process involves creating binary variables, where a value of 1 is assigned if a particular category is present in the data, and 0 if it is not. For example, let us consider a dataset with a column representing different car models. Using one hot encoding, the car models would be converted into separate categorical variables, such as “Toyota,” “Honda,” “Ford,” etc. If a specific car model is present for a given data instance, the corresponding categorical variable would be assigned a value of 1; otherwise, it would be assigned a value of 0. By employing one hot encoding, text data can be transformed into a numerical format that is suitable for data visualization and machine learning algorithms. This conversion facilitates the inclusion of string data in various analytical tasks. However, it is important Fig. 5.6 Read data in csv format using Python pandas
134
5 Data and Materials and Catalysts Informatics
Fig. 5.7 Data duplication techniques in Python pandas
to note that one hot encoding may result in an increase in the dimensionality of the dataset, particularly when dealing with a large number of distinct categories. In such cases, feature engineering techniques, such as dimensionality reduction algorithms (e.g., principal component analysis), might be necessary to manage the resulting high-dimensional data effectively. It is worth mentioning that one hot encoding is just one approach to handle string data, and alternative methods exist depending on the specific context and objectives of the analysis. For instance, label encoding can be used to assign unique numerical labels to different categories, or embedding techniques can be employed to represent text data in a continuous vector space. The selection of the most appropriate method depends on factors such as the nature of the data, the requirements of the analysis, and the capabilities of the chosen machine learning algorithms. Python pandas offers a range of robust techniques that can significantly enhance the quality and integrity of datasets. One such technique involves addressing data duplication, as demonstrated in Fig. 5.7, which visually illustrates the presence of two identical entries for “FeTi” within the dataset. To handle data duplication, pandas provides a powerful command called “drop._duplicates” that simplifies the process of eliminating redundant data instances. By employing this command, duplicated records can be effortlessly identified and removed, ensuring the preservation of data integrity and consistency. Furthermore, the “drop._duplicates” command offers the flexibility to specify the target column(s) on which the duplication check should be performed.
5.3 Data Preprocessing
135
Fig. 5.8 Blank data in Python pandas
In addition to eliminating duplicates, the “drop._duplicates” command also provides the option to retain either the first or the last occurrence of duplicated entries, depending on the specific requirements of the data analysis. This feature allows researchers to exercise control over the deduplication process, enabling them to choose the most appropriate approach for their particular use case. The presence of blank or missing values poses a significant challenge. Fortunately, Python pandas provides a suite of techniques to effectively handle such data anomalies, ensuring the integrity and reliability of the analysis. Figure 5.8 visually demonstrates the presence of blank values within a dataset. To identify and address blank values, pandas offers the “isna()” command, which allows researchers to check for missing data entries. When applied to a dataset, the “isna()” command evaluates each data point and returns a Boolean value of “True” if the data is missing or blank. This feature enables analysts to efficiently identify and locate areas within the dataset that require attention. To mitigate the impact of blank data on subsequent analyses, pandas offers the “dropna()” function, which serves as a powerful tool for eliminating blank values from the dataset. By invoking the “dropna()” function, researchers can remove rows or columns
136
5 Data and Materials and Catalysts Informatics
Fig. 5.9 Outlier detection in Python
that contain missing values, effectively enhancing the reliability and completeness of the dataset. The “dropna()” function can be customized with various parameters to suit specific requirements. For instance, researchers can choose to drop rows or columns that contain any missing values or only those that have a certain threshold of missing data. This flexibility empowers analysts to tailor their data cleansing approach to the unique characteristics of their dataset. It is not uncommon to encounter outliers—data points that exhibit significantly different behavior compared to the rest of the dataset. Detecting and addressing these outliers is crucial to ensure the accuracy and reliability of subsequent analyses. One effective method for identifying outliers is through the use of z-scores. The z-score, also known as the standard score, quantifies how many standard deviations a data point deviates from the mean of the dataset. By calculating the z-score for each data point, we can identify values that deviate significantly from the mean, indicating potential outliers. Figure 5.9 visually represents a dataset with a distinct outlier, where most of the data points cluster around 50, while one data point stands at 0.5. To calculate the z-score, we can leverage the statistical functionalities of Python pandas. By subtracting each data point from the mean of the dataset and dividing the result by the standard deviation, we obtain the z-score for each data point. The mean and standard deviation can be easily computed using the “mean()” and “std()” commands in pandas, as demonstrated in Fig. 5.9. The z-score equation, as illustrated in Fig. 5.9, captures the essence of the calculation. By applying this equation to the dataset, we obtain a set of z-scores, with each value
5.3 Data Preprocessing
137
Fig. 5.10 Merging two data in Python pandas
indicating the deviation of the corresponding data point from the mean in terms of standard deviations. By setting a threshold, typically around 2 or 3 standard deviations, we can flag data points with z-scores exceeding the threshold as potential outliers. Identifying and addressing these outliers is crucial as they can significantly impact subsequent analyses and distort the overall interpretation of the data. Therefore, integrating z-score analysis into the data preprocessing stage enhances the accuracy and reliability of the subsequent analysis and ensures robust decision-making based on sound data-driven insights. There are instances where merging datasets becomes necessary to gain a comprehensive understanding of the underlying information. Python pandas, a versatile data manipulation library, provides powerful capabilities to facilitate seamless merging of datasets. Consider Fig. 5.10, which visually represents two separate datasets, each with its own set of columns and values. Using pandas, these datasets can be effortlessly read into memory, as demonstrated in the figure. Once loaded, the merge command in pandas becomes invaluable for combining the datasets based on a specified column or key. This merging process enables researchers to integrate the information from both datasets into a single, unified dataset, thereby facilitating a holistic analysis. By leveraging the merge command in pandas, users can efficiently combine datasets by specifying the column(s) that serve as common identifiers. This allows for the creation of a merged dataset that consolidates information from multiple sources. Whether it is merging based on a shared primary key, joining on common attributes, or combining data through
138
5 Data and Materials and Catalysts Informatics
other logical relationships, pandas offers flexible functionality to accommodate a variety of merging scenarios. Undoubtedly, pandas is a powerful tool that empowers researchers and data scientists to manipulate and transform data with ease. However, it is crucial to emphasize the importance of thoroughly examining and understanding the data before applying advanced techniques such as machine learning or data visualization. Much like researchers meticulously prepare samples in experiments, having a deep comprehension of the dataset through careful observation is essential. This initial step provides researchers with the necessary context to make informed decisions during subsequent analyses. Once the data has been thoroughly inspected and understood, advanced data science techniques, including machine learning and visualization, can be effectively employed. The integration of pandas as a data manipulation tool alongside these techniques enhances the overall analytical workflow and enables researchers to derive meaningful insights from their data. Tools like pandas serve as vital instruments for merging datasets effortlessly, enabling researchers to create comprehensive and unified datasets for further analysis. However, it is essential to emphasize the significance of data observation and understanding before delving into advanced data science techniques. It is important to be able to read and observe the data with one’s own eyes in a similar manner to how researchers can make samples precisely in experiments. Only then do advanced data science techniques such as machine learning and visualization become useful.
5.4
Conclusion
In this chapter, the intricate nature of data consistency within this field is explored. It becomes apparent that the consistency of materials and catalysts data can vary significantly, primarily due to the diverse methodologies employed in data generation throughout the years. Consequently, it becomes imperative to possess a comprehensive understanding of the data’s intricacies before engaging in subsequent data visualization and machine learning tasks. To achieve this level of comprehension, data preprocessing emerges as the linchpin of success in materials and catalysts informatics. The significance of data preprocessing cannot be overstated, as it plays a pivotal role in transforming raw data into a structured format that is suitable for further analysis. Within the realm of materials and catalysts informatics, several key concepts and techniques constitute the foundation of data preprocessing, and one such vital aspect is data cleansing. Data cleansing, a fundamental process within data preprocessing, entails identifying and rectifying issues within the dataset to ensure its accuracy and reliability. Python’s renowned library, pandas, comes to the forefront by offering an array of powerful tools specifically designed to facilitate the data cleansing process. Leveraging the capabilities of pandas, researchers and data scientists can effectively address data inconsistencies, inaccuracies, missing values, and outliers that might otherwise hinder the integrity and usability of the dataset.
Questions
139
By employing the robust data preprocessing functionalities provided by pandas, practitioners can efficiently cleanse and transform their materials and catalysts data. These preprocessing techniques enable researchers to eliminate duplicate entries, handle missing values, address inconsistent units or formatting, and ensure overall data quality. By adhering to these meticulous data preprocessing practices, researchers can establish a solid foundation upon which subsequent data visualization and machine learning tasks can be performed with confidence. It is crucial to recognize that data preprocessing serves as a critical stepping stone in the materials and catalysts informatics journey. Python’s pandas library not only offers a rich assortment of tools for data preprocessing but also empowers researchers to seamlessly navigate the complexities of their datasets. By harnessing the power of pandas and embracing the essentiality of data preprocessing, researchers can unlock valuable insights, uncover hidden patterns, and derive meaningful conclusions from their materials and catalysts data. In summary, within the domain of materials and catalysts informatics, data consistency poses a formidable challenge due to the diverse nature of data generation techniques. To tackle this challenge effectively, a deep understanding of the data is paramount before embarking on subsequent data visualization and machine learning endeavors. This is where data preprocessing emerges as a critical component, with data cleansing being a key technique within this process. Python’s pandas library offers a comprehensive suite of powerful tools to facilitate the data cleansing and preprocessing tasks, enabling researchers to refine their materials and catalysts data into a reliable and structured format. By prioritizing data preprocessing, researchers can pave the way for successful and uncover valuable insights that drive advancements in the field of materials and catalysts informatics.
Questions 5.1 Why is data consistency important in the field of materials and catalysts informatics? 5.2 What is the role of data preprocessing in materials and catalysts informatics? 5.3 What is data cleansing, and why is it a fundamental process in data preprocessing? 5.4 How does Python’s pandas library contribute to data cleansing in materials and catalysts informatics? 5.5 What are the benefits of performing data preprocessing with pandas in materials and catalysts informatics?
140
5 Data and Materials and Catalysts Informatics
5.6 Why is access to a substantial and well-curated repository of high-quality data essential for data science in material and chemical research? 5.7 What are the crucial steps involved in collecting and preparing data for data science applications? 5.8 What is the purpose of data cleaning in the data preparation process? 5.9 How does data transformation contribute to the data preparation process? 5.10 What is the role of data integration in the data preparation process? 5.11 Why is data considered indispensable for data science? 5.12 What are the two major categories of data sources for materials and catalysts data? 5.13 hat are the advantages of utilizing existing data from literature and patents? 5.14 What is text mining, and how can it contribute to data acquisition? 5.15 What are the different approaches to literature data acquisition? 5.16 What is text mining, and how does it contribute to data collection? 5.17 What are some challenges associated with text mining? 5.18 How does data validation contribute to the text mining workflow? 5.19 What are some notable tools and libraries used in text mining? 5.20 What are the options for data collection other than text mining? 5.21 What is the conventional method for acquiring data? 5.22 What are some techniques involved in experimental data collection? 5.23 How can researchers benefit from collaborating with third-party entities for data acquisition? 5.24 What is the concept of high throughput in data creation? 5.25 What is the difference between low throughput and high throughput approaches?
Questions
141
5.26 What is the concept of high throughput in materials and catalyst informatics? 5.27 How does high throughput differ from low throughput experiments? 5.28 What advantages does high throughput offer in materials and catalyst informatics? 5.29 How does the consistency of data generated through high throughput methodologies benefit materials and catalyst informatics? 5.30 What are the potential drawbacks of using high throughput methodologies in data generation? 5.31 Why is the data preprocessing stage considered the most important in materials and catalyst informatics? 5.32 What is data cleansing in the context of data preprocessing? 5.33 What is the role of data labeling in data preprocessing? 5.34 What is data augmentation, and why is it important in the preprocessing workflow? 5.35 What is the purpose of data aggregation in the data preprocessing stage? 5.36 Why is comprehensive data preprocessing necessary, beyond addressing data inaccuracies and corruption? 5.37 What are some challenges arising from the absence of standardized rules in data collection? 5.38 What is the role of data preprocessing in optimizing data for analysis and machine learning algorithms? 5.39 How can differences in data labeling conventions impact data discoverability and interpretability? 5.40 How can the adoption of standardized data recording methods and clear labeling practices enhance data utilization and collaboration? 5.41 Why is the relationship between yield, selectivity, and conversion important in catalysis?
142
5 Data and Materials and Catalysts Informatics
5.42 How can missing or incomplete data be handled in datasets with inconsistencies in selectivity reporting? 5.43 What challenges arise when incorporating string-based labels into computational analyses and visualizations? 5.44 How can the lack of standardized rules and guidelines hinder data integration and collaboration? 5.45 What are the six fundamental categories involved in data cleansing during the data preprocessing pipeline? 5.46 What are two methods used for data cleansing tasks? 5.47 How can data visualization aid in the data cleansing process? 5.48 What is the role of machine learning algorithms in data cleansing? 5.49 What factors influence the choice of data cleansing methods? 5.50 How can string data be transformed into a numerical format for visualization and machine learning algorithms? 5.51 How can pandas handle blank or missing values in a dataset? 5.52 What does the z-score measure in the context of outlier detection? 5.53 What are the customizable parameters of the “dropna()” function in pandas? 5.54 How does the merge command in pandas facilitate dataset merging? 5.55 Why is it important to thoroughly examine and understand the data before applying advanced techniques?
6
Data Visualization
Abstract
Once the data has undergone preprocessing and preparation, the subsequent stage in the realm of materials and catalysts informatics involves a crucial step known as data visualization. This process plays a pivotal role in unraveling various factors pertaining to the data, including its structure, frequency, bias, and variance. By employing visual techniques, valuable insights can be gleaned prior to the application of machine learning algorithms or further data science methodologies. Consequently, this chapter delves into the realm of data visualization within the specific context of materials and catalyst informatics, utilizing the versatile programming language Python as a powerful tool in this endeavor. Keywords
Data visualization · Matplotlib · Seaborn · Scatter plot · Multidimensional plotting · Parallel coordinates
• Learn the basics of Matplotlib and how to use several powerful techniques • Learn the basics of Seaborn and how to visualize multidimensional data
6.1
Introduction
Data visualization is a crucial aspect of data science, as it involves presenting complex data in a graphical or visual format, making it easier to understand, analyze, and communicate insights. Visualization helps data scientists and analysts explore patterns, trends, and relationships within datasets, allowing them to make informed decisions and © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6_6
143
144
6 Data Visualization
present findings effectively. Here, we cover two popular Python libraries used for data visualization: Matplotlib and Seaborn. Matplotlib is a widely used data visualization library in Python. It provides a comprehensive set of tools for creating a wide range of static, interactive, and publication-quality plots and charts. Matplotlib is highly customizable, allowing data scientists to adjust various aspects of their visualizations like colors, labels, and plot styles. This flexibility makes it a valuable tool for creating a variety of plots, from simple line charts to complex 3D visualizations. On the other hand, Seaborn is built on top of Matplotlib and offers a higher-level interface for creating aesthetically pleasing statistical graphics. It simplifies the process of generating complex visualizations like heatmaps, pair plots, and violin plots, making it especially useful for data exploration and presentation. Seaborn also comes with built-in themes and color palettes, making it easy to create visually appealing plots with minimal effort. Data visualization is vital in data science for several reasons. Firstly, it helps data exploration by providing a visual context for the information contained within datasets, making it easier to identify outliers, trends, and patterns. Secondly, it simplifies the communication of insights to non-technical stakeholders, allowing data scientists to convey complex findings in an easily digestible format. Moreover, visualizations play a crucial role in model evaluation and validation, helping data scientists assess the performance of machine learning models and identify areas for improvement. Data visualization is, therefore, a powerful tool that enhances the data science process, from data preprocessing and analysis to model interpretation and reporting.
6.2
Matplotlib
To initiate our exploration, it is essential to acquaint ourselves with two indispensable Python libraries that greatly facilitate the process of data visualization: matplotlib and seaborn. The matplotlib library (available at https://matplotlib.org/) provides a comprehensive suite of functionalities for generating an array of visual representations, encompassing anything from simple scatter plots to intricate three-dimensional plots. On the other hand, seaborn (accessible at https://seaborn.pydata.org/) builds upon the foundation of matplotlib and introduces a plethora of advanced plotting capabilities. These encompass a wide range of sophisticated visualization techniques, including the incorporation of statistical information, color enhancements, and the integration of diverse plot types. Consequently, this chapter aims to showcase an extensive array of plots, ranging from rudimentary to intricate, while simultaneously delving into the intricacies of extracting valuable insights from these visualizations. Basic Plot Basic plotting is introduced using the Python library matplotlib as shown in Fig. 6.1.
6.2 Matplotlib
145
Fig. 6.1 Basic plot in Python
Figure 6.1 showcases an illustrative example, encompassing both the fundamental Python code and its corresponding output. Let us closely examine the intricacies of the code. To commence, we initiate the process by importing two indispensable libraries: numpy and pylab. The inclusion of numpy is of utmost significance, as it provides a vast array of mathematical functions that prove to be instrumental in the domain of plotting. Moreover, we import pylab, a pivotal library that not only imports the matplotlib library but also plays a pivotal role in facilitating its utilization. Once these libraries are successfully imported, we proceed to utilize the powerful linspace() function provided by numpy. This function enables the generation of a sequence comprising 30 equidistant data points for the variable x, spanning the range from .−3 to 3. In parallel, we calculate the exponential values of x to obtain the corresponding y values. Subsequently, the plot() function is invoked to establish the relationship between the variables x and y, thereby generating the desired plot. It forms the core of the plotting process, allowing us to visually perceive the interplay between the two variables. Finally, the show() command is executed, culminating in the visualization of the plot. Hence, this exemplary code effectively illustrates that even with a relatively simple Python code, it is indeed possible to create informative and visually appealing basic plots. The iris flower example is a well-known and widely used dataset in the field of data visualization, exemplifying the power of Python for visual analysis. Figure 6.2 portrays a visualization of this dataset, providing valuable insights into its characteristics. The dataset comprises data pertaining to three distinct types of iris flowers: iris setosa, iris versicolor, and iris virginica. These types are typically classified based on specific measurements,
146
6 Data Visualization
Fig. 6.2 Loading iris dataset from scikit learn
namely the length and width of the sepal and petal. With the iris dataset at hand, the fundamental question arises: how can we discern and classify the features of each iris type based on the available data? By leveraging the power of data visualization, we can effectively explore and understand the distinguishing attributes of each iris type. Through appropriate visualization techniques, we gain valuable insights into the relationships and patterns within the dataset, allowing for informed classification and analysis. Thus, the iris flower example serves as an influential case study, illustrating the potential of data visualization in unraveling complex datasets and facilitating the comprehension and classification of intricate features within the data. Let us start by plotting the iris data using a simple line plot using Fig. 6.3 as a guide. The iris data consists of five columns where sepal length, sepal width, petal length, petal width, and corresponding types are presented. To start, the pandas library is imported along with the data file iris_data.csv under variable “data.” Next, matplotlib and numpy are imported within the first two lines. The plot() command (here, referred to by the label “plt” we defined when we first imported
6.2 Matplotlib
147
Fig. 6.3 Line plot for iris data
the matplotlib library) is then used to generate the simple line graph that is visualized in Fig. 6.3. To initiate the plotting of the iris data, we can employ a simple line plot as exemplified in Fig. 6.3. The iris dataset comprises five columns, encompassing sepal length, sepal width, petal length, petal width, and the corresponding iris types. To begin, we import the pandas library, a powerful tool for data manipulation and analysis, and load the iris data file, stored under the variable “data” in the iris_data.csv file. Subsequently, the matplotlib and numpy libraries are imported, allowing us to utilize their functionalities for plotting operations. Upon setting up the necessary imports, we can utilize the plot() command from matplotlib, here referred to as “plt,” which was defined when the library was initially imported. This command enables us to generate the desired line graph, as showcased in Fig. 6.3.
148
6 Data Visualization
The primary objective of visualizing the iris data is to discern and capture the underlying trends associated with the classification of the three distinct iris types. However, upon examining the line graph presented in Fig. 6.3, it becomes apparent that it is exceedingly challenging to discern any significant trends from the visual representation. Instead of displaying a coherent plot, the line graph appears as scattered and unorganized scribbles, failing to convey the intended information effectively. Nevertheless, upon closer inspection, it is possible to glean certain observations from the plot. Notably, there seem to be two distinct clusters emerging around the petal length threshold of approximately 2, indicating potential groupings within the data. Additionally, there appears to be a potential third cluster around the threshold where the petal length is roughly 4.5. While these indications provide some insights into the potential trends within the dataset, it becomes evident that the line plot falls short in visualizing more intricate and detailed patterns. Consequently, it is crucial to acknowledge that the line plot is not an optimal visualization technique for this particular dataset. To truly capture and comprehend the underlying trends and relationships, alternative visualization methods need to be explored, allowing for a more comprehensive and informative depiction of the iris data. What sort of plot can we, then, consider to be suitable? Perhaps if we replaced the lines of Fig. 6.3 with simple data points, the graph usefulness may improve. In Fig. 6.4, we delve into the world of data visualization with a demonstration of a scatter plot, accompanied by the relevant code. This graphical representation serves as a significant departure from the line plot showcased in Fig. 6.3, introducing a more sophisticated and informative visualization technique. The primary distinction lies in the choice of the plotting command, which shifts from plot() to scatter(). By leveraging the scatter() command, we unlock a powerful visual tool that allows for the clear depiction of individual data points, replacing the interconnected lines observed in the line plot. This transformation significantly enhances the interpretability of the plot and facilitates a more comprehensive understanding of the underlying data patterns. Notably, when we examine the scatter plot, a distinct pattern emerges, revealing the presence of at least two major groups distinguished by petal lengths. Upon closer examination, one can observe the formation of a group characterized by petal lengths of 2 or less, while another group emerges for cases where the petal length exceeds 3. This refined visualization approach enables us to discern and appreciate the intricate relationships and trends within the iris dataset. It provides a rich visual exploration of the data, unveiling previously unseen patterns and facilitating the identification of distinctive clusters based on petal lengths. Therefore, through the utilization of scatter plots, we gain deeper insights into the underlying structure of the iris dataset. This enhanced visualization technique enables a more nuanced analysis, empowering researchers and practitioners to make informed decisions based on the clear and visually apparent patterns within the data. The incorporation of color as a third dimension of information in data visualization adds a new level of clarity and insight. By assigning colors based on the iris type, it becomes possible to differentiate between the three types present in the dataset. Figure 6.5 showcases the updated scatter plot, where each data point is now represented by one
6.2 Matplotlib
149
Fig. 6.4 Scatter plot for iris data
of three distinct colors, corresponding to iris setosa, iris versicolor, and iris virginica. With this additional dimension of color, the plot reveals a more pronounced association between the three groups and petal length. It becomes evident that petal length significantly influences the clustering of the iris types. Notably, a third distinct separation becomes apparent, indicating that two groups diverge when petal lengths reach approximately 5 cm. The utilization of a scatter plot, coupled with the integration of color as a third dimension, expedites the identification and understanding of critical factors in iris classification. The visualization clearly highlights the significance of petal lengths at approximately 2.5 cm and 5 cm, shedding light on their pivotal roles in distinguishing and categorizing the different iris types. By employing this enhanced visualization technique, we gain valuable insights into the relationships between petal lengths and iris classification. This knowledge has significant implications for researchers and practitioners seeking to leverage datadriven approaches in identifying and characterizing iris species accurately. In the field of informatics, the term often connotes the utilization of advanced machine learning techniques. However, the example presented here serves as a compelling demonstration of the significant role that data visualization plays in extracting meaningful
150
6 Data Visualization
Fig. 6.5 Scatter plot for iris data
insights from raw data. While machine learning methods are undoubtedly powerful, data visualization in itself holds tremendous value as a technique for uncovering hidden trends and patterns. The scatter plot discussed in this context exemplifies how visualization techniques can shed light on the intricate relationships within a dataset. By visually examining the distinct groupings based on petal lengths, we can make informed predictions about the iris setosa type when encountering cases where petal lengths are below 2. This understanding is made possible solely through the effective visualization of the data, without relying on complex machine learning algorithms. Moreover, data visualization is not limited to exploratory analysis; it also enables us to go beyond mere observations and engage in predictive and design-oriented tasks. Armed with the knowledge gained from the scatter plot, we can confidently make predictions about the likely classification of iris flowers based on their petal lengths. This newfound understanding empowers researchers and practitioners to make informed decisions and formulate strategies that leverage the insights gained through visualization. It is important to recognize that data visualization serves as a valuable and independent means of extracting knowledge from data. While machine learning algorithms are undeniably powerful in their own right, they often require substantial computational resources and may not always be necessary for every analysis. In contrast, data visualization offers a more intuitive and accessible approach, providing a visual medium through which patterns and relationships can be
6.2 Matplotlib
151
Fig. 6.6 Scatter plot for CH.4 conversion and C.2 selectivity in methane oxidation reaction. Colorbar is C.2 yield
discerned and communicated effectively. Therefore, data visualization holds a prominent place in the realm of informatics, as it not only reveals the underlying structure of data but also equips us with the ability to make informed predictions and design effective solutions. By harnessing the power of visualization, researchers and practitioners can leverage the inherent patterns within datasets to gain valuable insights and drive impactful decisionmaking processes. Let us explore a case relating to catalysts. The catalyst data utilized in this study focuses on the oxidative coupling of methane (OCM) reaction, as highlighted in the work by Takahashi et al. [5]. To gain insights into this catalytic process, a scatter plot is employed, visualized in Fig. 6.6, with the intention of extracting valuable knowledge from the data. In essence, the OCM reaction aims to directly convert methane (CH.4 ) into C.2 compounds, specifically ethylene (C.2 H.4 ) and ethane (C.2 H.6 ). By plotting the OCM literature data, a three-dimensional representation is achieved, incorporating the CH.4 conversion (.%) as the x-axis, C.2 selectivity (.%) as
152
6 Data Visualization
the y-axis, and the C.2 yield (.%) as the color bar. This multidimensional visualization approach offers a comprehensive view of the OCM reaction, allowing for a more nuanced analysis of the interplay between CH.4 conversion, C.2 selectivity, and C.2 yield. The scatter plot effectively captures the relationship between these variables, facilitating a deeper understanding of the underlying catalyst performance. The incorporation of the color bar as a visual dimension further enhances the plot, enabling the identification of regions associated with higher or lower C.2 yield. This additional dimension provides valuable insights into the overall efficiency of the catalytic process and offers clues for optimizing reaction conditions. Through the utilization of this scatter plot, researchers and practitioners gain valuable knowledge about the OCM reaction. The visual representation facilitates the identification of optimal operating conditions by examining the relationship between CH.4 conversion, C.2 selectivity, and C.2 yield. By leveraging this data visualization technique, scientists can uncover trends, patterns, and correlations, paving the way for the development of more efficient catalysts and enhanced process design in the realm of OCM. Even a simple scatter plot can reveal a wealth of valuable information and insights. In the case of the OCM reaction data, the scatter plot offers several key observations that aid in understanding the relationship between different variables. Firstly, the tradeoff between CH.4 conversion and C.2 selectivity becomes apparent from the plot. As one variable increases, the other tends to decrease, indicating an inherent compromise between these two factors. This insight provides crucial knowledge for researchers and practitioners to optimize the OCM process by striking a balance between CH.4 conversion and C.2 selectivity. Moreover, the scatter plot serves as a means to visualize the statistical distribution of the data. By observing the concentration of data points in specific regions, such as low CH.4 conversion and C.2 selectivity, important trends can be identified. Furthermore, the observation that there is a scarcity of data points in regions of high C.2 yield highlights a potential bias in the dataset. Recognizing such biases is critical for robust analysis and decision-making, as it prompts researchers to investigate the underlying causes and refine data collection strategies. By utilizing this basic scatter plot, a multitude of insights can be gleaned regarding the trends and patterns inherent within the OCM reaction data. This visualization technique allows researchers to grasp the inherent tradeoffs, identify statistical patterns, and uncover potential biases, all of which contribute to a more comprehensive understanding of the catalyst performance and pave the way for further investigations and improvements. Hence, the simplicity and effectiveness of the scatter plot as a visualization tool make it an indispensable asset in the analysis and interpretation of complex datasets, providing a platform to derive meaningful insights and inform decision-making processes. Additional Basic Plots Let us explore other basic plots. The bar plot is a valuable visualization technique that presents data as bars, making it particularly useful for comparing variables, as demonstrated in Fig. 6.7. In this example, the bar plot visualizes the magnetic moments of two-dimensional materials consisting of two elements, A and B. By examining the bar plot,
6.2 Matplotlib
153
3.0
Magnetic Moment
2.5
2.0
1.5
1.0
0.5
0.0 Ge Ni O Sc Ir Cr Pb Mn Mo Co S Se Nb Rh Ru Fe Hg Re Hf Cu Te Pd Au Pt Sn Ta Os
Compisition A in 2 dimentional AB2
Fig. 6.7 Bar chart in two dimensional material data
one can observe trends related to the magnetic moments of these materials. For instance, it becomes apparent that two-dimensional materials containing Mn tend to exhibit larger magnetic moments. Conversely, elements such as Sn and Pt tend to have non-magnetic moments. This visual representation enables researchers to quickly identify and compare the magnetic properties associated with different elements, facilitating further analysis and investigation in the field of materials science. Another important graph to consider is the histogram, as depicted in Fig. 6.8. The histogram provides insights into the distribution
154
6 Data Visualization
Magnetic Moment
20
15
10
5
0 Ge Ni O Sc Ir Cr Pb MnMo Co S Se Nb Rh Ru Fe Hg Re Hf Cu Te Pd Au Pt Sn Ta Os
Compisition A in 2 dimentional AB2
Fig. 6.8 Histogram in two dimensional material data
of data for various elements in two-dimensional materials. By analyzing the histogram, one can observe the frequency or occurrence of specific elements within the dataset. In the case of Fig. 6.8, it becomes evident that the data distribution of certain elements is not uniform. For instance, the number of occurrences of Sn and Ta data appears to be relatively less compared to other elements. This discrepancy in data distribution indicates a bias within the dataset, where some elements are underrepresented. Such insights gained from the histogram allow researchers to take the data distribution into account and consider
155
electronegativity by Pauling scale
6.2 Matplotlib
Li
H
12% Ti
H Li Ti Cl
28%
20%
40% Cl
Fig. 6.9 Pie chart in electronegativity in H, Li, Ti, and Cl
potential biases when interpreting the results. Both the bar plot and the histogram offer valuable information regarding data distribution and trends. The bar plot facilitates a comparison between variables, while the histogram provides a visual representation of the frequency distribution of specific elements. These visualization techniques play a crucial role in uncovering patterns, identifying biases, and extracting meaningful insights from the data, enabling researchers to make informed decisions and draw accurate conclusions in various scientific domains. Next, let us explore the pie chart. The pie chart is a visualization tool that effectively represents data distribution, as demonstrated in Fig. 6.9. In this example, a pie chart is used to visualize the distribution of random variables assigned to four categories generated as a list. By examining the pie chart, one can easily perceive the relative proportions of the data within each category. In Fig. 6.9, it becomes apparent that H and Cl have larger electronegativity compared to Ti and Li, as they occupy larger portions of the pie chart. This intuitive representation allows researchers to quickly grasp the distribution patterns and make visual comparisons between different categories. While it is possible to calculate data distribution using libraries such as pandas and numpy, the pie chart provides a more intuitive and visually appealing representation. It simplifies the interpretation of data distribution by presenting the proportions in a clear and concise manner. This enables researchers to gain immediate insights into the relative magnitudes or significance of different categories. The use of a pie chart not only facilitates the understanding of data distribution but also enhances the communication of findings to a broader audience. Its
156
6 Data Visualization
Fig. 6.10 Stackplot in periodic table information of H, Li, Ti, Cl, Mn, Au
visual nature enables viewers to grasp the distribution patterns and draw conclusions more readily. Consequently, the pie chart serves as a valuable tool for exploratory data analysis, allowing researchers to intuitively recognize trends, patterns, and disparities within the dataset. The pie chart is a powerful visualization technique that effectively showcases data distribution. By presenting the proportions of different categories in a visually appealing manner, it enables researchers to intuitively recognize patterns and draw conclusions about the dataset. The stack plot, also known as the stacked area plot, is a valuable visualization tool provided by matplotlib. It allows for the stacking of multiple plots on top of each other, creating a cumulative representation of data, as demonstrated in Fig. 6.10. To illustrate the usage of the stack plot, let us consider an example where we generate data for several
6.2 Matplotlib
157
atomic elements. We begin by creating a list of atomic elements such as H, Li, Ti, Cl, Mn, and Au. Next, we define three categories, namely atomic number, electronegativity, and atomic radius, and assign corresponding values for each category to the atomic elements. Once the data is prepared, we can plot the stack plot. In Fig. 6.10, the x-axis represents the days of the week, and the atomic elements (H, Li, Ti, Cl, Mn, and Au) are stacked on top of each other, with each element assigned a specific color. By visually inspecting the stack plot, we can observe clear trends and patterns that emerge within the atomic elements. The stack plot allows us to visualize the cumulative contribution of each element to the overall composition across the categories. By examining the stacked areas, we can discern the relative proportions of each element within the different categories. This visualization technique facilitates the identification of trends and patterns that may exist among the atomic elements. Through the stack plot, we can gain insights into how different categories (such as atomic number, electronegativity, and atomic radius) vary and interact with each other across the atomic elements. It provides a comprehensive view of the data, highlighting both the individual contributions of each element and the collective trends that arise when considering them together. The stack plot is a powerful visualization tool that allows for the simultaneous representation of multiple plots, revealing trends and patterns within the data. By stacking the plots on top of each other and assigning distinct colors, the stack plot provides a clear and intuitive understanding of the relationships and contributions among different elements or categories. Lastly, matplotlib can display multiple different plots simultaneously using the subplot() command. Let us say, for example, that we want to display four different plots simultaneously. Here, we use subplot(x,y,z) as shown in Fig. 6.11 where x, y, and z
Fig. 6.11 Subplot to plot multiple graph in matplotlib
158
6 Data Visualization
represent the number of rows, columns, and location. If we want to plot six plots (two rows and three columns), then we input subplot(2,3,z) where z is the place on the canvas that you want to put the plots. By doing so, we can compare several different plots within the same window simultaneously, which can help shed light on new information that may be found within the data.
6.3
Seaborn
Throughout our exploration of matplotlib, we have only scratched the surface of its capabilities. The library’s true potential lies in its extensive customization options and the ability to create a wide range of visualizations tailored to specific requirements. By harnessing the full power of matplotlib, users can unlock nearly limitless possibilities in data visualization. Matplotlib offers a plethora of customization options that allow users to fine-tune every aspect of their plots. From adjusting colors, markers, and line styles to modifying axis scales, labels, and ticks, every element of a plot can be meticulously tailored to meet specific design and analysis goals. Furthermore, matplotlib supports the creation of complex figures with multiple subplots, annotations, and legends, enabling the creation of rich and informative visual narratives. However, matplotlib is just the tip of the iceberg when it comes to advanced data visualization. To further expand our visualization repertoire, we can leverage additional libraries such as seaborn. Seaborn builds upon matplotlib’s foundation and provides a higher-level interface that simplifies the creation of more sophisticated and visually appealing plots. With seaborn, users can effortlessly generate visually stunning statistical visualizations, such as distribution plots, regression plots, and categorical plots, with just a few lines of code. Moreover, seaborn integrates seamlessly with pandas, another powerful data analysis library, allowing for seamless data manipulation and visualization. By combining the capabilities of matplotlib, seaborn, and pandas, analysts can unlock a comprehensive toolkit for exploring, analyzing, and communicating complex datasets effectively. While our journey with matplotlib has provided us with a solid understanding of its fundamentals, there is still a vast world of advanced customization and visualization techniques awaiting exploration. By delving deeper into matplotlib’s extensive documentation and exploring complementary libraries like seaborn, we can unlock the full potential of data visualization and uncover valuable insights hidden within our data. One of the standout features of seaborn is its ability to enhance and augment matplotlib visualizations. This becomes evident when we examine the oxidative coupling of methane reaction data and leverage seaborn’s capabilities to uncover deeper insights. In Fig. 6.12, we observe a histogram depicting the data distribution of C.2 yield. The histogram alone provides a glimpse into the concentration of C.2 yield values, which appears to be primarily clustered within the range of 0–20. However, seaborn takes it a step further by incorporating kernel density estimation, represented by the smooth line superimposed on the histogram. This additional layer of information offers a more
6.3 Seaborn
159
0.10
Count
0.08
0.06
0.04
0.02
0.00 0
10
20 C2 Yiels
30
40
Fig. 6.12 Dist plot in seaborn
comprehensive understanding of the distribution, revealing nuances and characteristics that might have otherwise gone unnoticed. Similarly, seaborn’s joint plot, showcased in Fig. 6.13, combines the power of scatter plots and data distribution visualization. Here, the joint plot illustrates the relationship between CH.4 conversion and C.2 selectivity. At a
160
6 Data Visualization
100
C2-Selectivity
C2-selectivity
80
60
40
20
0 0
20
40 60 CH4-conversion%
80
100 CH4 Conversion
Fig. 6.13 Joint plot in seaborn
glance, we can discern an inverse proportionality between these variables. Additionally, the joint plot unveils a notable bias in the data, as most data points are concentrated in areas of low CH.4 conversion. However, what truly sets the joint plot apart is the inclusion of a histogram depicting the data distribution. This histogram instantaneously exposes the underlying biases and patterns, shedding light on the overall distribution characteristics. By utilizing seaborn’s advanced visualization capabilities, we gain deeper insights into
6.3 Seaborn
161
Fig. 6.14 Pairplot in seaborn
the interplay between variables and the distribution of data. These added effects provide a comprehensive and holistic view, enabling us to discern complex relationships, identify biases, and extract valuable information. Seamlessly integrating with matplotlib, seaborn empowers analysts to go beyond the basic visualizations and explore data from multiple angles, ultimately facilitating more informed decision-making and data-driven insights. In matplotlib, users have the flexibility to selectively choose which plots to display using the subplot() command. However, seaborn offers a powerful alternative called pairplot(). This versatile command takes a comprehensive approach by presenting all requested variables along with their data distributions, providing a holistic view of the data structure and enabling users to uncover trends and patterns more effectively. As illustrated in Fig. 6.14, the pairplot() command showcases a matrix of plots that encompasses all the relevant variables. This visual arrangement allows users to explore the relationships and dependencies between different variables, gaining valuable insights into their interactions. By displaying multiple plots side by side, pairplot() offers a unique perspective that enables users to discern intricate patterns, trends, and data structures that may not be readily apparent in individual plots. Examining Fig. 6.14, we can glean important information from the data. For instance, we observe that the temperature variable tends to cluster around the 1000 mark, suggesting a specific range of values in the dataset. Furthermore,
162
6 Data Visualization
Pairwise Correation
1.00
0.75
Temperature
0.50
0.25
0.00
O2-pressure
–0.25
Pair-wise correlation value
CH4-pressure
CH4-conversion% –0.50
C2-selectivity
C2-selectivity
CH4-conversion%
O2-pressure
CH4-pressure
Temperature
–0.75
–1.00
Fig. 6.15 Pairwise correlations in Python
the distribution of CH.4 conversion highlights that the majority of values lie below 50, providing an indication of the overall distribution characteristics. By leveraging the power of pairplot(), users can unlock deeper insights and derive a more comprehensive understanding of their data. This command serves as a valuable tool for visual exploration, facilitating the identification of complex relationships and unveiling meaningful patterns. Whether it is seeking correlations, identifying outliers, or unraveling intricate structures within the dataset, pairplot() empowers users to leverage the full potential of their data and make informed decisions based on a richer visual representation. In the realm of materials and catalysts informatics, the Pearson correlation coefficient emerges as a fundamental visualization technique. This coefficient serves as a valuable tool for quantifying the linear relationship between variables. To illustrate its utility, let us delve into the realm of oxidative coupling of methane (OCM) reaction data. Figure 6.15 showcases the Pearson correlation map of OCM data, offering a comprehensive visual representation of the interplay between various variables. Through this map, we can readily
6.3 Seaborn
163
discern intriguing relationships, such as the inverse relation between CH.4 conversion and CH.4 pressure. This finding suggests that as CH.4 pressure increases, CH.4 conversion tends to decrease. Conversely, an increase in CH.4 pressure coincides with an elevation in C.2 selectivity. These observations underscore the intricate dynamics at play within the OCM reaction system and provide valuable insights for further analysis and interpretation. It is important to note that the Pearson correlation coefficient does not adhere to a strict threshold for defining the strength of a relationship. However, conventionally, a correlation coefficient of approximately 0.8 or .−0.8 is considered relatively strong, indicating a robust linear relationship. Conversely, a coefficient around 0.3 would be deemed a weak correlation, suggesting a less pronounced association between the variables under consideration. Interpreting these correlation coefficients provides researchers with valuable guidance for discerning significant trends and drawing meaningful conclusions. By applying Pearson correlation coefficient and its visualization in the form of a correlation map, researchers in materials and catalysts informatics can gain valuable insights into the intricate relationships within their data. These insights serve as a foundation for further analysis, exploration, and decision-making, guiding the development of advanced materials, catalysts, and innovative solutions in various domains. Utilizing the Pearson correlation coefficient analysis provides researchers with a valuable initial assessment of the correlations present within their data. While this analysis serves as a rough measurement of correlation, it offers several strong advantages in data exploration and analysis. One significant advantage is the ability to quickly identify correlations between variables. By examining the correlation coefficients, researchers can gain insights into the strength and direction of relationships, guiding them toward the next steps in their analysis. For example, when a strong linear correlation is observed, it prompts researchers to investigate further through scatter plots, enabling them to visually confirm the presence of linear relationships. Furthermore, understanding data correlations can inform the selection of appropriate data science methods. For instance, in the case of machine learning, if a strong linear correlation is identified, employing linear supervised machine learning algorithms may be an effective approach. This correlationdriven guidance helps streamline the data science workflow and directs researchers toward the most suitable techniques and methods for their analysis. By using data correlations as a guiding principle, researchers can optimize their analysis strategies and decisionmaking processes. This approach not only saves time but also enhances the efficiency and effectiveness of data exploration, enabling researchers to uncover valuable insights and make informed decisions based on the underlying relationships within the data. The violin plot is a powerful visualization technique provided by seaborn that offers unique insights into data distribution and correlation. While similar to box plots in some aspects, the violin plot adds an additional layer of information by representing the kernel probability distribution of the data. By utilizing kernel density estimation, the violin plot depicts the density of data points in different regions. Highly dense areas are represented by wider sections of the violin, indicating a higher concentration of data. This is clearly illustrated in the example shown in Fig. 6.16, where the violin plot
164
6 Data Visualization
Fig. 6.16 Violin plot in Python
compares age with whether a traveler is alone or with a group. From this plot, several observations can be made. Firstly, it is apparent that C.2 yield decreases as temperature increases, suggesting an inverse correlation. Additionally, the plot reveals that C.2 yield tends to be higher at temperatures of 700 and 750 compared to other temperature ranges. This information allows researchers to observe the data distribution at each temperature range and draw valuable insights. Based on these observations, one hypothesis that can be derived is that conducting experiments at temperatures of 700 and 750 may result in a higher probability of obtaining a high C.2 yield. This demonstrates how violin plots assist in identifying key factors and their correlations, enabling researchers to make informed decisions and formulate hypotheses for further investigation. Violin plots provide a comprehensive view of data accumulation and the factors they correlate with. By visualizing the kernel probability distribution, researchers can gain valuable insights
6.3 Seaborn
165
into data distribution patterns, identify correlations, and generate hypotheses for further analysis and experimentation. Multidimensional Plotting Until now, low-dimensional data visualization has been explored. This leads us to our next question: is it possible to visualize high-dimensional plotting, even when data may be involved with more than five dimensions? The visualization of high-dimensional data poses a significant challenge due to the difficulty in representing multiple dimensions on a 2D or 3D plot. However, there are two powerful techniques that can be used to tackle this problem: parallel coordinates and radviz. Parallel coordinates are a plotting method that allows for the visualization of data with more than five dimensions. It achieves this by representing each data point as a polyline connecting parallel vertical axes, where each axis represents a different dimension. The polylines provide a visual representation of the relationships and patterns between the different dimensions. By observing the patterns and interactions between the polylines, one can gain insights into the high-dimensional data. On the other hand, radviz (short for radial coordinates visualization) provides a different approach to visualizing high-dimensional data. It places each data point on a circle or hypersphere, with each dimension represented by a radial axis. The position of each point on the circle is determined by the values of its respective dimensions. The closer a point is to the center of the circle, the lower the value of that dimension, and the closer it is to the outer edge, the higher the value. This visualization technique allows for the examination of the relationship and relative influence of each dimension on the data points. Both parallel coordinates and radviz provide effective ways to visualize and explore high-dimensional data. These techniques enable researchers to identify patterns, trends, and correlations within complex datasets that would be challenging to perceive using traditional lowdimensional plots. By leveraging these high-dimensional plotting tools, researchers can gain a deeper understanding of their data and make informed decisions in their analysis and modeling processes. Parallel coordinates are a powerful method for visualizing multidimensional data. It allows for the simultaneous visualization of multiple variables on the x-axis, while the yaxis represents the values of these variables. By connecting data points with lines, patterns and relationships within the data become more apparent. In the example using the iris dataset, Fig. 6.17 demonstrates the effectiveness of parallel coordinates in understanding the classification of three types of iris based on their characteristics. Notably, it is evident that petal length plays a significant role in distinguishing between the iris types. The plot reveals that iris versicolor and iris virginica exhibit similar patterns compared to iris setosa, but they are distinctly separated based on their petal size, width, and sepal width. By examining these patterns together, parallel coordinates enable the visualization of multiple insights simultaneously. Parallel coordinates provide a comprehensive overview of the relationships and interactions between variables in multidimensional data. It allows researchers to identify clusters, trends, and outliers, making it a valuable tool in exploratory data analysis and pattern recognition. With parallel coordinates, complex datasets can be
166
6 Data Visualization
8
Setosa Versicolour
7
Verginica
6 5 4 3 2 1 0 sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
Fig. 6.17 Parallel coordinate
effectively visualized and analyzed, aiding in the understanding and interpretation of the underlying information. Radviz is a visualization technique designed to handle multidimensional data. It represents data using a circular layout, where each dimension is mapped to an anchor point on the circumference of the circle. The plot is created by balancing the forces of springs connecting the data points to the anchors. The position of each data point within the circle is determined by the equilibrium of these spring forces. By using Radviz, we can effectively visualize the iris dataset and gain insights into its classification patterns. Figure 6.18 showcases the Radviz plot of the iris data. From this visualization, it becomes immediately apparent that sepal width plays a crucial role in differentiating iris setosa from
6.3 Seaborn
167
sepal width (cm)
1.0
Setosa Versicolour Virginica
0.5
petal length (cm) 0.0
sepal length (cm)
–0.5
–1.0 –1.5
petal width (cm) –1.0
–0.5
0.0
0.5
1.0
1.5
Fig. 6.18 Radviz
iris virginica and iris versicolor. Additionally, both sepal and petal lengths exhibit strong influences in distinguishing iris virginica from iris versicolor, consistent with the findings from the scatter plot and parallel coordinates analyses. Furthermore, petal width also shows a slight impact in the classification of iris virginica and iris versicolor. Radviz offers an alternative approach to exploring and interpreting multidimensional data. By leveraging the circular layout and spring forces, it allows us to identify trends, relationships, and patterns within the data. The visualization facilitates the identification of key features or dimensions that contribute significantly to the classification or clustering of the data. Radviz provides a valuable tool for visualizing and understanding complex datasets, enabling researchers to uncover meaningful insights and make informed decisions based on the underlying patterns. As a side note, a three-dimensional plot can also be created using matplotlib (as visualized in Fig. 6.19) if its dimensions are smaller than 4 as three dimensions will be taken by x, y, and z while the fourth dimension represents color. Matplotlib allows the creation of three-dimensional plots, providing an additional dimension for visualization. In a three-dimensional plot, the x, y, and z coordinates
168
6 Data Visualization
Fig. 6.19 Three-dimensional plot
represent the three dimensions, while color can be used to represent a fourth dimension. This can be particularly useful for visualizing relationships and patterns in data that involve multiple variables. Figure 6.19 demonstrates a three-dimensional plot created using matplotlib. Each data point is positioned in the three-dimensional space defined by the x, y, and z axes, while the color of the points represents the fourth dimension. By incorporating color as an additional dimension, we can further enhance the visual representation of the data and potentially reveal additional insights or trends. It is important to note that while three-dimensional plots can provide valuable insights, they have limitations when it comes to visualizing higher-dimensional datasets. As the number of dimensions increases, the ability to effectively represent and interpret the data becomes more challenging. In such cases, alternative visualization techniques, such as dimensionality reduction or specialized plotting methods, may be more appropriate for gaining insights from higher-dimensional data.
6.4
Conclusion
Throughout this chapter, we have delved into the significant role of data visualization in the field of materials and catalysts informatics. It has become evident that data visualization is a crucial tool for gaining insights and extracting meaningful information from complex
Questions
169
datasets. The range of techniques available for data visualization is vast, offering unlimited possibilities for representing and exploring data. By applying and combining various visualization techniques, we can uncover valuable trends, patterns, and relationships within the data. These insights, in turn, directly contribute to the design and development of innovative materials and catalysts. Visualizing the data in a meaningful way allows researchers to identify key factors, correlations, and dependencies that can drive decisionmaking and inform the design process. Moreover, data visualization serves as a guiding compass for further analysis. The visual representations of the data provide a roadmap for conducting deeper investigations and applying advanced analytical methods. By observing the visual patterns and trends, researchers can identify areas of interest and focus their efforts on specific aspects of the data that warrant further exploration. Ultimately, data visualization plays a dual role in materials and catalysts informatics. It not only enables us to comprehend complex datasets more effectively but also serves as a catalyst for informed decision-making and exploration. By harnessing the power of data visualization, researchers can unlock the full potential of their data and drive advancements in materials science and catalysis research.
Questions 6.1 What is the significance of data visualization in materials and catalysts informatics? 6.2 How does data visualization contribute to the design and development of innovative materials and catalysts? 6.3 How does data visualization serve as a guiding compass for further analysis? 6.4 What dual role does data visualization play in materials and catalysts informatics? 6.5 How does data visualization contribute to advancements in materials science and catalysis research? 6.6 What are the two Python libraries commonly used for data visualization? 6.7 How does the linspace() function in numpy contribute to data visualization? 6.8 What is the purpose of the plot() function in matplotlib? 6.9 Why is data visualization important in the analysis of the iris flower dataset? 6.10 Why is the line plot visualization technique not optimal for the iris dataset?
170
6 Data Visualization
6.11 What type of plot is showcased in Fig. 6.3, and how does it differ from Fig. 6.4? 6.12 What insights can be gained from the scatter plot in Fig. 6.4? 6.13 How does the incorporation of color as a third dimension enhance the scatter plot in Fig. 6.5? 6.14 What is the significance of the scatter plot in the context of the oxidative coupling of methane (OCM) reaction? 6.15 What insights can be obtained from a simple scatter plot of the OCM reaction data? 6.16 What is the purpose of a bar plot? 6.17 What insights can be gained from analyzing a bar plot? 6.18 How does a histogram provide insights into data distribution? 6.19 What does a pie chart represent in data visualization? 6.20 How does a stack plot help in understanding data trends? 6.21 What is the true potential of matplotlib? 6.22 How does seaborn enhance matplotlib visualizations? 6.23 What is the advantage of using pairplot() in seaborn? 6.24 How does the Pearson correlation coefficient analysis benefit researchers? 6.25 What insights can be gained from violin plots in seaborn? 6.26 What are the challenges in visualizing high-dimensional data? 6.27 How does parallel coordinates help in visualizing high-dimensional data? 6.28 What is the concept behind radviz visualization technique? 6.29 What insights can be gained from parallel coordinates visualization? 6.30 How can three-dimensional plots be useful in visualizing data?
7
Machine Learning
Abstract
Within the confines of this chapter, our intention is to provide a comprehensive and in-depth introduction to the captivating realm of machine learning—a discipline that has unquestionably exerted a profound and transformative influence on human society. Indeed, one could boldly argue that the development and progression of machine learning bear an analogous significance to the groundbreaking invention of electricity. By drawing a parallel to the revolutionary impact of electricity, which brought about unprecedented changes to industries through its provision of power for factory automation, fundamentally reshaping entire sectors, job markets, and societal structures, we contend that machine learning is poised to catalyze yet another remarkable and consequential transformation. Its pervasive presence is already palpable across a vast array of domains, encompassing marketing, e-commerce, image recognition, medical diagnosis, and beyond. In this chapter, our primary objective is to embark on a profound exploration of the fundamental underpinnings of machine learning, meticulously unraveling its intricacies and unveiling the core principles that underlie its remarkable functionality. Through this immersive journey, our aim is to endow our readers with a comprehensive understanding of the fundamental principles in this field, laying the groundwork for the acquisition of further knowledge and insights. Keywords
Machine learning · Supervised machine learning · Classification · Regression · Unsupervised machine learning · Semi-supervised machine learning · Reinforcement learning · Overfitting
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6_7
171
172
• • • •
7 Machine Learning
Understand the history of machine learning. Explore supervised machine learning and its two types: classification and regression. Understand the basics of unsupervised machine learning. Explore other types of learning techniques.
7.1
Introduction
Machine learning is a dynamic and rapidly evolving field within artificial intelligence that focuses on the development of algorithms and models capable of improving their performance on tasks through experience. It allows computers to learn from data, make predictions or decisions, and adapt to new information without being explicitly programmed. Machine learning encompasses a broad spectrum of techniques that can be categorized into several main types, including supervised learning, unsupervised learning, and other common methodologies. Supervised machine learning is one of the fundamental paradigms in the field. It involves training a model on labeled data, where the input data are associated with corresponding target labels or outcomes. The goal is for the model to learn a mapping from input to output, making it capable of making accurate predictions on new, unseen data. Unsupervised machine learning, in contrast, deals with unlabeled data, where the algorithm’s objective is to uncover hidden patterns or structures within the dataset. Clustering techniques, such as K-Means and hierarchical clustering, group similar data points together, revealing natural clusters. Dimensionality reduction methods, such as Principal Component Analysis (PCA), aim to simplify complex datasets by representing them in a lower-dimensional space. These approaches are invaluable for exploratory data analysis, pattern recognition, and data preprocessing. Besides supervised and unsupervised learning, there are other notable types of machine learning techniques. Semi-supervised learning leverages a combination of labeled and unlabeled data, enhancing model performance by learning from both sources. Reinforcement learning focuses on training agents to make sequences of decisions in an environment to maximize a reward, making it well-suited for applications in robotics, gaming, and autonomous systems. Machine learning is a diverse field, and the choice of technique depends on the specific problem and the available data. These methodologies collectively empower machines to learn, adapt, and make informed decisions, making machine learning a pivotal force in numerous industries and applications. In this chapter, we explore what machine learning is along with investigating commonly used techniques.
7.2 Machine Learning
7.2
173
Machine Learning
In the annals of history, specifically in the year 1959, Arthur Samuel, a distinguished luminary in the realm of artificial intelligence (AI), eloquently elucidated machine learning as a “discipline of inquiry that bestows upon computers the remarkable capacity to acquire knowledge autonomously, free from explicit programming.” This profound conceptualization not only captures the essence of machine learning but also unveils a fundamental tenet that forms its bedrock: the pursuit of endowing computers with the inherent ability to assimilate information, thus enabling them to discern patterns, derive insightful conclusions, and generate effective solutions based on their accumulated wisdom. At its core, machine learning strives to transcend the boundaries of traditional programming by nurturing computer systems that possess an intrinsic cognitive prowess— a capacity to acquire knowledge and leverage it to provide intelligent responses. The paramount objective lies in the cognition and aptitude of these computational entities, as they endeavor to amass a vast repertoire of knowledge, harness it effectively, and ultimately yield intelligent outcomes rooted in their acquired expertise. By embracing machine learning, we delve into a paradigm where computers become more than mere tools, transforming into entities capable of autonomous learning and adaptive decisionmaking. This groundbreaking field of inquiry presents a profound shift in the way we perceive and interact with technology, paving the way for advanced applications across diverse domains such as healthcare, finance, transportation, and beyond. In essence, the visionary notion put forth by Arthur Samuel in 1959 encapsulates the essence of machine learning, portraying it as a powerful force that empowers computers to embark on a perpetual quest for knowledge, culminating in the remarkable ability to generate intelligent responses that transcend the confines of explicit programming. Through a meticulous examination of the disparities between human and machine problem-solving approaches, we can significantly enhance our coding practices and data preprocessing techniques to better align with the capabilities of machines. The visual representation in Fig. 7.1 provides insightful workflow charts that elucidate the distinctive methodologies employed by humans and machines in tackling problems. Embarking on the journey of problem-solving, it is crucial to first grasp the intricacies of the issue at hand. Human intervention revolves around comprehending the problem comprehensively, delving into its nuances and complexities. Subsequently, humans proceed to devise algorithms or methodologies tailored to address the challenges inherent in the problem. This critical phase can aptly be labeled as algorithm development, involving a meticulous synthesis of conceptual frameworks to effectively tackle the identified issues. However, it is important to acknowledge that the formulated algorithms may not always yield the desired outcomes. In such instances, an iterative process reminiscent of traditional experimentation ensues, wherein the devised algorithm undergoes rigorous scrutiny to discern the underlying causes of failure. Consequently, diligent efforts are directed toward rectifying the identified flaws, providing an opportunity for subsequent evaluation and validation. This iterative cycle is often repeated multiple times, with each iteration refining
174
7 Machine Learning General Develop Algorithm (Method)
Problem
Apply the method
Solution
Try & Error
Modify Algorithm
Machine Learning
Problem
Data
Machine Learning
Apply the method
Solution
Retrain Machine Supply New Data For machine
Fig. 7.1 Comparison of problem-solving between human and machine learning
the algorithm until a state of completion is eventually achieved. The realization of an impeccable algorithm requires a deep appreciation for both success and failure, thereby effectively harnessing the iterative nature of problem-solving to enhance the robustness and efficacy of the final solution. By embracing this iterative and incremental approach, we can capitalize on the collective knowledge gained from each iteration, continually refining the algorithm until it reaches its optimal form. This iterative process serves as a powerful mechanism for honing our problem-solving abilities, enabling us to adapt our coding practices and data preprocessing techniques to better accommodate the unique capabilities of machines. Within the realm of machine learning, there exists a slight deviation from the conventional problem-solving paradigm. While the initial stride involves identifying and comprehending the intricacies of the problem, much like the human-centered approach, a distinct departure emerges in the subsequent stages. In the context of machine learning, the process begins by curating and preprocessing a relevant dataset, tailored specifically to the complexities of the identified problem. Subsequently, machine learning algorithms are employed to equip the machine with the ability to learn from the dataset, aiming to unravel the problem in a conceptual manner. Interestingly, in this scenario, the role traditionally performed by human developers is assumed by the machine itself, which autonomously evolves its algorithmic framework. This machine-driven paradigm offers remarkable advantages. Machines inherently possess the capacity to navigate and
7.2 Machine Learning
175
manipulate multidimensional spaces, deftly handling vast volumes of data encompassing diverse dimensions concurrently—a challenging feat that proves exceedingly difficult, if not unattainable, for humans. Moreover, the allure of the machine learning approach amplifies as the entire process can be orchestrated to function autonomously, minimizing the need for extensive manual intervention. Notably, in the event of dataset updates, the machine seamlessly adapts by assimilating the novel information, ensuring that its learning remains up-to-date. This adaptability empowers the machine to transcend the confines of a static problem space, accommodating dynamic shifts and evolving challenges with remarkable agility. By harnessing these distinguishing attributes, machine learning emerges as a transformative paradigm capable of addressing predicaments that elude human capacity alone. Its capacity to navigate and glean insights from vast and complex datasets, coupled with its intrinsic adaptability, endows machine learning with the potential to unravel conundrums that surpass human capabilities, revolutionizing problem-solving in unprecedented ways. Machine learning demonstrates extraordinary prowess in addressing a wide range of challenges, making it an invaluable tool across various domains. Its remarkable capabilities render it particularly effective in solving complex problems that often elude comprehensive human understanding. By operating within multidimensional spaces, machines possess the ability to navigate intricate problem domains with great precision, enabling the development of high-dimensional algorithms capable of tackling the inherent complexity of such challenges. Furthermore, machine learning emerges as a formidable contender when confronted with problems involving vast amounts of data. Extracting meaningful insights and knowledge from massive datasets is typically a daunting task, but machine learning algorithms possess an innate capacity to process and analyze large volumes of information. This enables the extraction of valuable insights that might otherwise remain hidden. It is important to note, however, that handling big data often requires robust hardware configurations encompassing high RAM, storage, GPU, and CPU capabilities to ensure efficient processing. In addition, machine learning exhibits exceptional aptitude in managing systems characterized by intricate rule-based patterns. By diligently learning and internalizing these rules, machines can consistently exhibit behavior that adheres to established patterns, ensuring reliable and consistent system operations. This ability proves particularly valuable in domains where precise adherence to predefined rules is essential. Moreover, machine learning showcases the potential to tackle challenges that have thus far eluded human achievement, including a multitude of inverse problems encountered within scientific research. Leveraging its inherent learning capabilities, machine learning holds promise in surpassing human limitations and providing innovative solutions to long-standing scientific quandaries. The ability to uncover new insights and offer novel approaches to long-standing problems positions machine learning as a compelling choice for driving impactful research endeavors. Given these remarkable capabilities, it is unsurprising that researchers and practitioners across diverse domains are increasingly drawn to the potential of machine learning. Its capacity to address complex problems, handle vast amounts of data, leverage rule-based patterns, and potentially surpass human
176
7 Machine Learning
limitations makes it a powerful tool for driving impactful research and innovation. Machine learning’s transformative power is reshaping our understanding and approach to problem-solving, opening up new frontiers of knowledge and ushering in a new era of innovation. Let us explore machine learning further. Machine learning, as a vast field of study, encompasses various types of learning methods that enable computers to acquire knowledge and make intelligent decisions. It is essential to familiarize oneself with the four primary types of machine learning: supervised machine learning, unsupervised machine learning, semi-supervised machine learning, and reinforcement learning. Each method possesses distinct characteristics and serves different purposes in solving diverse problems. Supervised machine learning involves training a model using labeled data, where the desired outcome or target variable is known. The algorithm learns to associate input features with corresponding output labels by generalizing from the provided examples. This type of learning is particularly useful for classification and regression tasks, enabling the prediction of future outcomes based on historical patterns and known relationships. Unsupervised machine learning, on the other hand, deals with unlabeled data, where the algorithm aims to discover inherent patterns, structures, or relationships within the data itself. Without explicit guidance, the algorithm autonomously clusters data points, identifies anomalies, or performs dimensionality reduction to reveal underlying structures and gain insights into the data. Unsupervised learning is instrumental in exploratory data analysis, data preprocessing, and identifying hidden patterns. Semi-supervised machine learning is a hybrid approach that combines elements of supervised and unsupervised learning. In scenarios where labeled data are scarce or expensive to obtain, semisupervised learning leverages a combination of labeled and unlabeled data to improve model performance. The algorithm learns from both the labeled examples and the unlabeled data to enhance its understanding of the underlying patterns and relationships. This method is particularly beneficial when acquiring labeled data is resource-intensive or time-consuming. Lastly, reinforcement learning revolves around an agent learning to interact with an environment and maximize its performance by receiving feedback in the form of rewards or penalties. The algorithm explores the environment, takes actions, and learns from the consequences of its actions. Through trial and error, the agent discovers an optimal strategy or policy to achieve a specific goal. Reinforcement learning is prominent in applications such as game playing, robotics, and autonomous decisionmaking systems. Understanding the nuances and strengths of each learning method enables researchers and practitioners to choose the most appropriate approach based on the problem at hand. Supervised learning is suitable when labeled data are available, unsupervised learning excels in extracting insights from unlabeled data, semi-supervised learning is valuable when labeled data are limited, and reinforcement learning empowers autonomous decision-making in dynamic environments. By embracing the versatility of these machine learning types, one can unlock the full potential of intelligent systems and drive advancements across a myriad of domains. Here, each learning method is explored further here.
7.3 Supervised Machine Learning
7.3
177
Supervised Machine Learning
What is supervised machine learning? Supervised machine learning is all about creating the relationship .y = f (x). Supervised machine learning encompasses the process of creating a robust and predictive relationship between a dependent variable, denoted as y, and its corresponding independent variables, symbolized as x. In simpler terms, it involves establishing a connection of the form .y = f (x). The dependent variable, y, represents the objective variable of interest that requires a solution, while the independent variables, x, encapsulate the descriptor variables that define or influence the objective variable. The fundamental task in supervised machine learning is to train the machine by exposing it to relevant and labeled data. This process enables the machine to acquire the necessary knowledge and insights to comprehend the intricate relationship between the descriptor variables and the objective variable. Through diligent analysis and processing of the data, the machine learns to discern patterns, trends, and underlying correlations that serve as the foundation for making accurate predictions. The core principle underlying supervised machine learning lies in uncovering the relationship between the descriptor variables and the objective variable. By leveraging the provided data, the machine gains an understanding of how the descriptor variables interact and impact the objective variable. This acquired knowledge allows the machine to generalize and apply its insights to predict the objective variable accurately when faced with new instances of input data. Supervised machine learning plays a crucial role in a wide range of applications, such as predicting customer preferences, forecasting financial trends, diagnosing diseases, and analyzing sentiment in text. By effectively capturing the relationship between the descriptor variables and the objective variable, supervised machine learning enables informed decision-making and provides valuable insights. It is important to note that the success of supervised machine learning relies on meticulous analysis and processing of the data. The machine learns to discern meaningful patterns and correlations by carefully studying the provided data, ensuring that its predictions are accurate and reliable. The ultimate objective of supervised machine learning is to develop a sophisticated and reliable model that, when faced with new instances of input data, can efficiently generalize the acquired knowledge and accurately predict the corresponding objective variable. By leveraging the established relationship between the descriptor variables and the objective variable, the machine serves as a powerful tool for forecasting and decision-making purposes. However, identifying the optimal set of descriptor variables that exhibit a meaningful and impactful association with the objective variable presents a significant challenge. It requires a careful exploration of the data, feature engineering techniques, and statistical analysis to determine the variables that have the most substantial influence on the target outcome. This meticulous endeavor involves evaluating various statistical metrics, conducting comprehensive data exploration, and leveraging domain expertise to identify and incorporate the most relevant and informative descriptor variables. Thus, supervised machine learning represents an intricate and multifaceted endeavor that revolves around elucidating the complex relationship between the objective variable and its descriptor variables. Through diligent training and
178
7 Machine Learning
Fig. 7.2 Supervised machine learning: Classification
analysis, the machine strives to grasp the underlying patterns and correlations, ultimately empowering it to make accurate predictions. Nonetheless, the quest for identifying the optimal descriptor variables remains an ongoing challenge, demanding a comprehensive and meticulous exploration of the data landscape to uncover the most influential factors that shape the behavior of the objective variable. It is through this thorough exploration and understanding of the data that the machine can acquire the knowledge necessary to make reliable predictions and support informed decision-making. Supervised machine learning encompasses two primary types: classification and regression, each serving distinct purposes and applied in diverse contexts. Classification is employed when the objective is to categorize data into different groups or classes based on specific criteria. An illustrative example of classification is spam email identification, as depicted in Fig. 7.2. In the realm of supervised classification machine learning, the process typically involves assigning a numeric value to each group or class. For instance, an active catalysts may be designated with a value of 1, while a non-active catalyst is assigned a value of 0. The aim is for the machine to learn the underlying relationship between active, non-active catalysts, and the associated descriptor variables. Through extensive training and analysis, the machine endeavors to discern patterns and features within the data that are indicative of either spam or non-spam characteristics. By capturing the intricate associations between the descriptor variables and the predefined classes, the machine can acquire the ability to automatically detect whether a new incoming email is spam or not. The success of the classification model hinges on the accuracy and reliability of the learned relationship between the descriptor variables and the class labels. This learning process involves extracting relevant features, selecting appropriate algorithms, and tuning model parameters to optimize performance. The machine strives to minimize errors and maximize the correct classification of new instances based on the patterns it has learned. Once this
7.3 Supervised Machine Learning
179
Fig. 7.3 Supervised machine learning: Regression
relationship is established, the machine can seamlessly apply the acquired knowledge to categorize future email messages, making real-time predictions regarding their spam or non-spam nature. This automated classification process offers significant advantages in terms of efficiency, allowing for timely identification and handling of spam emails without requiring manual intervention. It is worth noting that supervised classification techniques extend beyond spam email identification and find applications in various domains, such as image recognition, sentiment analysis, fraud detection, and medical diagnosis. The ability to accurately classify data based on learned patterns and relationships contributes to enhanced decision-making, improved resource allocation, and streamlined workflows across diverse industries. Supervised regression models operate within a similar framework as their classification counterparts. However, in regression, the objective variables are continuous, as demonstrated in Fig. 7.3. The fundamental principle remains the same, wherein a functional relationship .y = f (x) is established using the machine learning approach. In the context of supervised regression, the goal is to create a model that accurately predicts the value of the continuous objective variable, y, based on the given input variables, x. Through a comprehensive training process, the machine learns to discern the complex relationship between the descriptor variables and the continuous target variable. Once the machine is adequately trained, it gains the capability to generate predictions for the objective variable based on new sets of input variables. By leveraging the acquired knowledge and patterns, the machine can provide estimations for the continuous variable, enabling informed decision-making and forecasting. Similar to classification, the efficacy of a regression model relies on the accuracy and reliability of the learned relationship between
180
7 Machine Learning
the descriptor variables and the continuous objective variable. A well-trained regression model possesses the capacity to answer queries regarding the dependent variable, y, based on the provided independent variables, x, facilitating a deeper understanding of the relationship between them. Supervised regression models share the underlying concept with classification models. However, in regression, the focus is on predicting continuous objective variables. By establishing the relationship .y = f (x) through machine learning techniques, a regression model can provide estimations for the dependent variable based on the provided independent variables, thus enabling enhanced decision-making and forecasting capabilities. The versatility of supervised regression extends to various domains, such as economics, finance, healthcare, and engineering. In economics, regression models can help analyze the relationship between economic factors and market performance. In finance, they can aid in predicting stock prices or estimating risk. In healthcare, regression models can assist in predicting patient outcomes based on clinical variables. The applications are numerous, and the insights gained from regression models contribute to informed decision-making and a deeper understanding of continuous relationships in the data. Hence, supervised regression models operate under a similar framework as their classification counterparts. By establishing the relationship .y = f (x) through machine learning techniques, these models can accurately predict continuous objective variables. The trained models provide estimations for the dependent variable based on the provided independent variables, enabling enhanced decision-making and forecasting capabilities across various domains.
7.4
Unsupervised Machine Learning
Now that we have an understanding of supervised machine learning, let us explore unsupervised machine learning. Unsupervised machine learning operates on a distinct and powerful premise that sets it apart from its supervised counterpart. Unlike supervised learning, which endeavors to solve the equation .y = f (x) or identify objective variables, unsupervised learning takes a different approach altogether. Its primary focus lies in discerning the underlying patterns and relationships that exist within the data, unearthing valuable insights and aiding in various data analysis tasks. In the realm of unsupervised learning, the machine is not provided with explicit labels or target variables to guide its learning process. Instead, it autonomously explores the data, searching for inherent structures and organizing principles that can inform its understanding of the underlying information. Through sophisticated algorithms and techniques, unsupervised learning enables the identification of variables that can be considered descriptors or possess independent significance, contributing to a deeper understanding of the data. By leveraging unsupervised learning, researchers and analysts can extract valuable insights and derive meaning from complex datasets. The autonomous exploration of the data allows the machine to uncover hidden patterns, clusters, and relationships that might not be immediately apparent to human observers. This capability proves invaluable in a
7.4 Unsupervised Machine Learning
181
wide range of data analysis tasks, including exploratory data analysis, data visualization, and feature engineering. Unsupervised learning techniques encompass various algorithms, including clustering, dimensionality reduction, and association rule mining. Clustering algorithms group similar data points together based on their intrinsic similarities, enabling the identification of meaningful clusters or segments within the data. Dimensionality reduction techniques, on the other hand, aim to reduce the complexity of high-dimensional data by extracting the most relevant features or representations. Unsupervised learning plays a pivotal role in numerous domains and applications. In fields such as market and customer segmentation, unsupervised learning techniques assist in identifying distinct customer segments based on purchasing behavior or preferences. In anomaly detection, unsupervised learning algorithms can automatically flag unusual or anomalous instances within a dataset, highlighting potential outliers or suspicious patterns. Furthermore, unsupervised learning is widely used in exploratory data analysis, where it helps analysts gain a deeper understanding of the data and uncover novel insights. By autonomously exploring and uncovering inherent structures, unsupervised learning provides valuable insights and aids in various data analysis tasks. Its algorithms, such as clustering and dimensionality reduction, empower analysts to extract meaning from complex datasets and discover patterns that may have otherwise remained hidden. With its broad applicability and transformative potential, unsupervised learning continues to shape the field of data science and enable data-driven decision-making. Clustering serves as a prominent and widely utilized approach in the realm of unsupervised machine learning, facilitating the identification of distinct groups or clusters within a given dataset. This algorithmic technique plays a crucial role in exploring and revealing patterns that may exist in the data, without the need for predefined labels or objective variables. As exemplified in Fig. 7.4, the clustering algorithm divides the dataset into three distinctive groups based on the variables. This classification process hinges
Fig. 7.4 Unsupervised machine learning
182
7 Machine Learning
upon the calculation of distances between data points, reflecting the concept of how humans perceive and discern patterns in data. Comparable to our intuitive abilities, we naturally discern the existence of three groups by observing the aggregation of data points in separate regions, with noticeable gaps between them. Unsupervised machine learning algorithms adopt a similar perspective, aiming to unveil latent groups or clusters within the dataset. While the specific methods employed to measure distances may differ across clustering algorithms, the overarching objective remains steadfast: to unearth meaningful patterns and associations in the data. The fundamental principle underpinning clustering algorithms centers on the notion of proximity or similarity. Data points that exhibit closer proximity in terms of their variable values are considered more akin or related and thus are assigned to the same cluster. Conversely, data points that exhibit greater dissimilarity are assigned to different clusters. The precise mechanisms for measuring distances vary depending on the specific clustering algorithm utilized, such as k-means, hierarchical clustering, or density-based clustering (e.g., DBSCAN). Regardless of the method, the primary objective remains unaltered: to unearth hidden clusters or groups within the dataset. The implications of clustering algorithms extend across diverse domains and applications. In customer segmentation, for instance, clustering techniques aid in identifying distinct segments of customers based on shared attributes or behaviors, facilitating targeted marketing strategies. In image analysis, clustering algorithms assist in grouping similar pixels together, enabling tasks such as image compression or image recognition. Clustering algorithms also find application in anomaly detection, helping to identify aberrant or anomalous data points that deviate from the norm. It is important to emphasize that clustering algorithms operate independently, autonomously exploring the data to identify patterns and groups. Unlike supervised learning approaches that rely on predefined labels, clustering algorithms offer an unbiased means of exploration and discovery. By effectively organizing data into clusters and revealing inherent structures, clustering provides valuable insights into the data and aids in understanding complex datasets. Unsupervised machine learning algorithms exhibit remarkable versatility, extending their capabilities beyond simple two-dimensional datasets. They possess the power to effectively classify multidimensional data, a task that often proves challenging for human analysts. Leveraging sophisticated mathematical techniques and advanced algorithms, unsupervised machine learning algorithms excel in identifying meaningful patterns and associations within large and intricate datasets. One of the key advantages of unsupervised machine learning lies in its ability to analyze vast amounts of data without the need for predefined labels or objective variables. By autonomously exploring the data, these algorithms uncover hidden groups, clusters, or structures that may exist within the dataset. This enables a deeper understanding of the inherent organization and patterns present in the data, ultimately facilitating further analysis and generating valuable insights. Unsupervised machine learning techniques offer a valuable tool for data analysis in diverse domains. For instance, in the field of genetics, unsupervised learning algorithms aid in identifying distinct subgroups or patterns within genomic data, leading to advancements in personalized
7.5 Semi-supervised Machine Learning
183
medicine and disease diagnosis. In the field of finance, unsupervised learning enables the identification of hidden market trends, patterns, and correlations, supporting investment decision-making and risk management. In the field of image processing, unsupervised learning techniques assist in tasks such as image segmentation, where the algorithm autonomously identifies and separates different objects or regions in an image. The ability of unsupervised machine learning to classify and extract underlying classifications or structures within large datasets makes it a valuable and practical approach across various domains. These algorithms offer a powerful means of organizing and understanding complex data, providing insights that may not be easily discernible through manual analysis. By automating the process of pattern discovery and classification, unsupervised machine learning algorithms empower researchers, analysts, and practitioners to extract meaningful knowledge from vast amounts of data, enhancing decision-making processes and driving innovation.
7.5
Semi-supervised Machine Learning
Semi-supervised machine learning represents a hybrid approach that combines elements of both supervised and unsupervised machine learning techniques. Its application becomes apparent when considering datasets that are only partially labeled, posing a challenge for traditional machine learning processes. Let us consider the example depicted in Fig. 7.5,
Fig. 7.5 Semi-supervised machine learning
184
7 Machine Learning
where the dataset exhibits a mix of labeled and unlabeled data points. To the human eye, it is evident that the data points can be grouped into two distinct clusters: one cluster comprising dots and triangles, and the other cluster comprising dots and squares. However, directly applying supervised machine learning to this dataset would be inappropriate due to the incomplete labeling. In such cases, a two-step approach can be employed, harnessing the power of semi-supervised machine learning. The first step involves applying unsupervised machine learning techniques to assign labels to the unlabeled data points. Through unsupervised learning algorithms, patterns and groupings within the data are identified, enabling the assignment of labels such as triangles and squares to the data points based on their inherent characteristics and proximity. Once the dataset has been labeled through unsupervised learning, the second step involves leveraging supervised machine learning techniques. The labeled dataset is now utilized for supervised classification, allowing the machine to learn and make predictions based on the assigned labels. This twostep process, incorporating unsupervised machine learning for labeling and subsequently applying supervised machine learning, constitutes the essence of semi-supervised machine learning. Semi-supervised machine learning serves as a practical solution when neither supervised nor unsupervised machine learning approaches can be independently employed due to the partially labeled nature of the dataset. By incorporating the available labeled data, semi-supervised learning enhances the learning and predictive capabilities of the model, while also benefiting from the insights gained through unsupervised learning. This approach allows for more efficient utilization of data resources and can be particularly valuable when dealing with large datasets, where labeling the entire dataset would be timeconsuming or resource-intensive. Therefore, semi-supervised machine learning represents a hybrid approach that integrates unsupervised and supervised learning techniques. By utilizing unsupervised learning for labeling and subsequently incorporating supervised learning, it provides a practical solution for datasets that are only partially labeled. This approach leverages the benefits of both supervised and unsupervised learning, enhancing the model’s learning and predictive capabilities while efficiently utilizing available data resources.
7.6
Reinforcement Learning
Reinforcement learning, another prominent type of machine learning, diverges slightly from the previously discussed approaches. Its fundamental concept revolves around training machines to make decisions based on rewards or penalties received during the learning process. Let us closely examine the example depicted in Fig. 7.6 to gain a better understanding of reinforcement learning. In this particular scenario, the machine is faced with the task of selecting the safer option between fire and water. As humans, we possess the knowledge that fire is hazardous and, therefore, instinctively choose water as the safer alternative. Our objective is to train the machine to make the same choice— to select water. To achieve this, the machine undergoes a process of trial and error,
7.6 Reinforcement Learning
185
Fig. 7.6 Reinforce learning
continuously refining its decision-making based on the received feedback. In the context of reinforcement learning, the machine is assigned a penalty or negative reward when it selects fire, indicating an incorrect or undesirable choice. Subsequently, the machine proceeds to another round of decision-making, armed with the knowledge that fire should be avoided. Through repetition and experience, the machine’s success rate gradually increases as it learns from the penalties and rewards. By reinforcing the idea that choosing fire is unfavorable and opting for water is preferable, the machine’s decision-making improves over time. Reinforcement learning hinges on the concept of an agent interacting with an environment. The agent receives feedback in the form of rewards or penalties based on its actions, enabling it to learn and optimize its decision-making strategy. By exploring different options, observing the outcomes, and adjusting its approach based on the received feedback, the machine becomes increasingly adept at making decisions in pursuit of a specific objective. Hence, reinforcement learning trains machines to make decisions through a process of trial and error, where penalties and rewards guide the learning process. By repeatedly reinforcing positive actions and penalizing undesirable choices, the machine progressively refines its decision-making abilities. This approach enables machines to autonomously learn optimal strategies and make informed decisions in complex environments. Reinforcement learning is a distinct type of machine learning that diverges from the previously discussed approaches. It involves training machines to make decisions based on rewards or penalties received during the learning process. Let us closely examine the example depicted in Fig. 7.6 to gain a better understanding of reinforcement learning. In this scenario, the machine is faced with the task of selecting
186
7 Machine Learning
the safer option between fire and water. As humans, we possess the knowledge that fire is hazardous and instinctively choose water as the safer alternative. Our objective is to train the machine to make the same choice—to select water. To achieve this, the machine undergoes a process of trial and error, continuously refining its decision-making based on the feedback received. In the context of reinforcement learning, the machine is assigned a penalty or negative reward when it selects fire, indicating an incorrect or undesirable choice. Subsequently, the machine proceeds to another round of decision-making, armed with the knowledge that fire should be avoided. Through repetition and experience, the machine’s success rate gradually increases as it learns from the penalties and rewards. By reinforcing the idea that choosing fire is unfavorable and opting for water is preferable, the machine’s decision-making improves over time. Reinforcement learning revolves around the concept of an agent interacting with an environment. The agent receives feedback in the form of rewards or penalties based on its actions, enabling it to learn and optimize its decision-making strategy. By exploring different options, observing the outcomes, and adjusting its approach based on the received feedback, the machine becomes increasingly adept at making decisions in pursuit of a specific objective. Hence, reinforcement learning trains machines to make decisions through a process of trial and error, where penalties and rewards guide the learning process. By repeatedly reinforcing positive actions and penalizing undesirable choices, the machine progressively refines its decision-making abilities. This approach enables machines to autonomously learn optimal strategies and make informed decisions in complex environments. Overfitting is indeed a critical concern when applying machine learning algorithms. As depicted in Fig. 7.7, overfitting arises when a machine learning model becomes excessively tailored to the training data, resulting in difficulties when attempting to predict outcomes for new, unseen test data. It occurs when the model becomes overly complex or when it is tuned too precisely to the training dataset. The consequence of overfitting is that the model learns the noise and random fluctuations in the training data, rather than the underlying patterns and generalizable trends. Consequently, when exposed to new data, the overfitted model may struggle to generalize and provide accurate predictions. While a detailed exploration of overfitting will be addressed in the next chapter, it is
Fig. 7.7 Overfitting
7.7 Conclusion
187
essential to acknowledge the importance of finding the right balance when tuning machine learning models. Striking this balance involves avoiding both underfitting, where the model is too simplistic and fails to capture the underlying relationships, and overfitting, where the model becomes overly complex and excessively adheres to the training data’s idiosyncrasies. Mitigating overfitting requires employing strategies such as regularization techniques, cross-validation, and feature selection to enhance model performance on unseen data. By implementing these approaches, machine learning models can strike the optimal balance between capturing the relevant patterns in the data and maintaining generalization capabilities. Overfitting is a significant challenge in machine learning, resulting from overly precise tuning or excessive model complexity. It impedes the model’s ability to accurately predict outcomes for unseen data. Therefore, it is crucial to maintain a careful balance in model tuning, taking steps to avoid overfitting while capturing meaningful patterns in the data.
7.7
Conclusion
Indeed, the four types of machine learning explored—supervised, unsupervised, semisupervised, and reinforcement learning—represent powerful tools for problem-solving in various domains. However, it is crucial to emphasize that the effectiveness and efficiency of these machine learning techniques heavily rely on the quality and preprocessing of the datasets they are applied to. Data preprocessing plays a vital role in optimizing the performance of machine learning algorithms. The analogy of capturing only a fraction of an entire map accurately illustrates the importance of comprehensive and representative data. If the machine learning process is conducted on biased or incomplete data, it may fail to grasp the full picture or generalize accurately. Data collection is the initial step in ensuring the availability of relevant and diverse data that adequately represent the problem at hand. This process involves identifying and gathering data from reliable sources, considering the necessary features or variables, and ensuring the data are representative of the problem’s scope. However, data collection alone is insufficient without proper preprocessing. Data preprocessing encompasses various techniques such as cleaning, normalization, feature selection, and handling missing values or outliers. These techniques aim to enhance the quality and reliability of the data, removing noise and irrelevant information, and transforming it into a suitable format for machine learning algorithms. By performing diligent data collection and preprocessing, the machine learning process can be primed for success. High-quality, well-preprocessed data allow the algorithms to extract meaningful patterns, relationships, and insights, leading to more accurate predictions and informed decision-making. While machine learning techniques are powerful problemsolving tools, their efficacy is closely intertwined with the quality and preprocessing of the data they are provided. Just as an incomplete map hampers understanding, biased or inadequate data can hinder the machine’s ability to learn and generalize effectively.
188
7 Machine Learning
Thus, meticulous attention to data collection and preprocessing is of utmost importance for achieving optimal results in machine learning applications.
Questions 7.1 What role does data preprocessing play in optimizing the performance of machine learning algorithms? 7.2 How does biased or incomplete data impact the effectiveness of machine learning? 7.3 What are the key steps involved in ensuring the availability of relevant and diverse data for machine learning? 7.4 What are some techniques involved in data preprocessing for machine learning? 7.5 How does high-quality, well-preprocessed data contribute to the success of the machine learning process? 7.6 What did Arthur Samuel define machine learning as in the year 1959? 7.7 What is the fundamental tenet that forms the bedrock of machine learning? 7.8 How does machine learning differ from traditional programming in terms of cognitive prowess? 7.9 What is the advantage of using machine learning in handling big data? 7.10 What are the four primary types of machine learning and their purposes? 7.11 What is supervised machine learning? 7.12 What is the core principle of supervised machine learning? 7.13 What are the primary types of supervised machine learning? 7.14 How does supervised classification work? 7.15 What is the purpose of supervised regression models? 7.16 What is the primary focus of unsupervised machine learning?
Questions
7.17 How does unsupervised learning differ from supervised learning? 7.18 What are some tasks that unsupervised learning techniques assist in? 7.19 How do clustering algorithms work in unsupervised machine learning? 7.20 What are the advantages of unsupervised machine learning? 7.21 What is semi-supervised machine learning? 7.22 What is the fundamental concept of reinforcement learning? 7.23 How does reinforcement learning train machines to make decisions? 7.24 What is overfitting in machine learning, and why is it a concern?
189
8
Supervised Machine Learning
Abstract
As previously mentioned, supervised machine learning refers to a prominent algorithmic approach that revolves around solving the function .y = f (x), wherein y and x denote the objective and descriptor variables, respectively. The nature of the objective variable dictates whether the models developed fall under the classification or regression category. This chapter serves as an avenue for delving deeper into the realm of supervised machine learning algorithms, wherein a diverse array of such algorithms shall be presented and discussed. Within the context of this chapter, we will familiarize ourselves with a variety of commonly utilized supervised machine learning models, leveraging the capabilities of Python libraries such as pandas and scikit-learn for their implementation and execution. Keywords
Supervised machine learning · Linear regression · Ridge regression · Logistic regression · Support vector machine · Decision tree · Random forest · Neural network · Gaussian process regression · Cross-validation
• Explore commonly used supervised machine learning models. • Learn how to use these models with real data.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6_8
191
192
8.1
8 Supervised Machine Learning
Introduction
Supervised machine learning is a fundamental part of machine learning where models are trained on labeled data to make predictions or classifications. In supervised learning, the algorithm learns to map input data to corresponding output based on a set of training examples. There are various models and algorithms used in supervised learning, including linear regression for regression tasks, decision trees, random forests, and support vector machines for classification, along with neural networks for more complex tasks. Pandas and scikit-learn are two essential Python libraries that play significant roles in supervised machine learning. Pandas is primarily used for data preprocessing, data cleaning, and feature engineering. It provides a powerful and flexible framework for working with structured data, allowing data scientists to load, manipulate, and transform data into a suitable format for machine learning tasks. On the other hand, scikit-learn is a comprehensive machine learning library that offers a wide range of algorithms for supervised learning, as well as tools for model selection, evaluation, and hyperparameter tuning. It provides a user-friendly API for building and training machine learning models and is widely adopted in the data science community. Supervised machine learning plays a crucial role in deciphering the relationship between a given objective variable and its corresponding set of descriptors. These descriptors encompass a range of properties that influence the manifestation of a specific attribute. To tackle the function .y = f (x), numerous models have been developed, each with its unique approach and characteristics. In the ensuing discussion, we delve into the exploration of several notable models employed in this context. The models under scrutiny include linear regression, ridge regression, logistic regression, support vector machine, decision tree, random forest, voting, neural network, Gaussian process regression, crossvalidation, and classification. Each model offers distinct advantages and methodologies, enriching our understanding of supervised machine learning and its applications.
8.2
Linear Regression
One of the primary supervised machine learning models we shall begin our exploration with is linear regression. It can be regarded as the fundamental form of machine learning, characterized by its simplicity and elegance. In essence, linear regression can be perceived as a form of linear fitting, where the objective is to establish a linear relationship between the variables at hand. However, it would be unjust to undermine the significance of linear regression, as it encapsulates the very essence of supervised machine learning in its concise formulation. By employing linear regression, we can glean valuable insights and draw meaningful conclusions from the data, thereby contributing to our broader understanding of the underlying patterns and trends governing the observed phenomena. Let us direct our attention to Fig. 8.1, which showcases a graphical representation of two variables, x and y, denoting the descriptor and objective variables, respectively. This
8.2 Linear Regression
193
Fig. 8.1 Scatter plot and linear regression machine learning
scatter plot effectively illustrates the presence of a linear relationship between x and y. In light of this observation, we proceed to implement the linear regression machine learning model. In this particular instance, we explore the formulation .y = w1x − w0, where w1 signifies the slope and w0 represents the intercept. Upon closer examination of the scatter plot, it becomes apparent that the plotted data points deviate slightly from the idealized linear relationship encapsulated by .y = w1x − w0. Consequently, we are confronted with the challenge of constructing a model that effectively reconciles the observed data points with the hypothesized .y = w1x − w0 equation. To address this, we can leverage the capabilities of the scikit-learn Python library to apply linear regression. This powerful tool enables us to construct a machine learning model capable of accurately fitting the data points to the desired .y = w1x − w0 relationship, thereby enhancing our ability to extract meaningful insights from the data. Upon closer examination of Fig. 8.1, we observe that the descriptor variable x and the objective variable y are represented as lists. This choice of data structure enables efficient handling and manipulation of the variables within the linear regression context. Furthermore, we initialize the variable “model” as an instance of the LinearRegression() class, which grants us access to the functionalities and methods associated with the linear regression model. To train the model, we invoke the fit() command, which facilitates the process of fitting the model to the training data. This crucial step enables the model to
194
8 Supervised Machine Learning
learn the underlying patterns and relationships between the descriptor variable x and the corresponding objective variable y, thus empowering it to make accurate predictions and draw meaningful inferences. The fitting process of a machine learning model can be fine-tuned by manipulating hyperparameters. Hyperparameters are parameters that govern the learning process of the machine, playing a crucial role in its performance and behavior. It is worth noting that most machine learning algorithms possess specific hyperparameters that can be adjusted to optimize the model’s performance. In the case of linear regression, the hyperparameters revolve around the slope and intercept, as they directly influence the fitted line’s characteristics. By carefully selecting appropriate values for these hyperparameters, we can tailor the model to accurately capture the relationships between the descriptor variable x and the objective variable y. When it comes to evaluating the accuracy of a model based on hyperparameter variations, a key consideration is the choice of an appropriate evaluation metric. In this context, the root-mean-square error (RMSE) is often employed to quantify the deviation between the predicted values and the actual observed values. By calculating the RMSE, we can assess the model’s ability to accurately predict the objective variable y based on the given descriptor variable x. Lower values of RMSE indicate better accuracy and a closer alignment between the model’s predictions and the actual data points. RMSE is calculated by summing the squares of the differences between the predicted values and the actual observed values of the objective variable y, divided by the total number of data points. This metric provides valuable insights into the dispersion or variability of the fitted line, as depicted in Fig. 8.2. A larger RMSE signifies a greater dispersion of the data points around the fitted line, indicating reduced accuracy of the model’s predictions. On the other hand, a smaller RMSE suggests a more precise alignment between the predicted values and the actual observations. In the context of linear regression, the objective is to minimize the RMSE by carefully controlling the hyperparameters, specifically the slope and intercept. By fine-tuning these hyperparameters, we strive to obtain a fitted line that best captures the underlying relationships between the descriptor variable x and the objective variable y. This optimization process aims to minimize the dispersion of the data points around the fitted line, thereby improving the overall accuracy of the linear regression model. In the realm of data analysis, the landscape is often far from simplistic, with the presence of a solitary descriptor variable falling short in unraveling the intricacies underlying the objective variable. It is an established truth that datasets commonly exhibit the coexistence of multiple descriptor variables, thereby necessitating the adoption of more sophisticated methodologies. In light of this, the concept of multiple linear regression emerges as a compelling solution. Multiple linear regression represents a powerful extension of its linear counterpart, accommodating the incorporation of numerous descriptor variables. Its mathematical formulation can be succinctly expressed through the equation: .y = W 1X1 + W 2X2 + W xXx + W 0, where W 1, W 2, W x symbolize the respective coefficients associated with each descriptor variable, while W 0 denotes the intercept term.
8.2 Linear Regression
195
Root-mean-square deviation (RMSE)
Slope and intercept are called hyperparameter. It controls the accuracy
50
y=slope X x + intercept
40
1
30
(–5)2 + 42 + (–2)2 + 12
–2
4
20
4 10
11.5
–5 0 0
10
20
30
40
50
50
y=slope X x + intercept 40 30
10
1
(–20)2 + 12 + (–2)2 + 102 4
–2
20
126
–20
10
0 0
10
20
30
40
50
Fig. 8.2 Root-mean-square deviation
By embracing the paradigm of multiple linear regression, one can effectively navigate datasets characterized by elevated dimensions, deftly capturing the multifaceted influence of diverse descriptor variables on the objective variable. Moreover, within the expansive domain of linear regression, the versatility of polynomial regression unveils itself as a valuable tool. This flexible technique enables the seamless integration of polynomial terms into the linear regression framework, facilitating the modeling of non-linear relationships. A notable example of polynomial regression can be encapsulated by the equation: .y = W 1X1+W 2(X12 )+W 0, where the term (.X12 ) signifies the squared representation of the descriptor variable X1. The inclusion of polynomial terms empowers the linear regression model to encompass and accommodate the intricate non-linear patterns that may manifest within the data, thus augmenting the model’s flexibility and expressiveness. While linear regression is undeniably a potent machine learning algorithm, its performance is intrinsically linked to the characteristics of the data it operates on. The chosen approach must align with the nature of the data at hand, ensuring a suitable and effective modeling strategy. Let us direct our attention to Fig. 8.3, which showcases three distinct types of data, featuring the objective variable y and the descriptor variable x. Each case presents unique challenges and considerations that underscore the significance of choosing the appropriate modeling approach. In the first case, the data exhibit a linear relationship, allowing linear regression to yield impeccable results. The model seamlessly captures the
196
8 Supervised Machine Learning
Fig. 8.3 Four different types of data. Linear regression is implemented for all cases
underlying patterns, showcasing the efficacy of linear regression in scenarios characterized by linear data. However, the second case presents a different scenario, where the data deviate from a linear pattern. In such instances, non-linear machine learning algorithms become necessary to effectively capture and model the non-linear relationships present within the data. The third case introduces the concept of outliers, which can significantly impact the performance of the linear regression model. These outliers, when present, may exert undue influence on the model, leading to suboptimal predictions. Therefore, careful inspection of the training data is essential to identify and address potential outliers. It is crucial to exercise caution as outliers may not always be visually evident, necessitating thorough data visualization and preprocessing techniques to accurately detect and handle them. Nevertheless, it is crucial to acknowledge that the treatment of outliers can greatly impact the performance of linear regression, as demonstrated in the fourth case. Sensible decision-making is vital when dealing with outliers, necessitating data preprocessing and parameter adjustment to mitigate their influence. A comprehensive understanding of the data’s characteristics and potential outliers is essential in leveraging linear regression effectively. Data visualization, careful data preprocessing, and thoughtful consideration of outlier treatment are critical components in optimizing the modeling process and harnessing the full potential of linear regression.
8.3
Ridge Regression
Ridge regression stands out as another powerful modeling technique that can be harnessed for addressing complex data scenarios. Similar to linear regression, ridge regression incorporates penalty terms into the loss function, allowing for enhanced control and regularization of the model’s behavior. Let us now direct our attention to the illustrative
8.3 Ridge Regression
197
Fig. 8.4 Ridge regression with different degrees
examples depicted in Fig. 8.4, where the test and training data are generated. The objective is to devise a linear regression model capable of accurately capturing the underlying patterns within the data. Upon examination of the graph, it becomes evident that the data points exhibit a sinusoidal shape, implying that a polynomial linear regression should be employed to adequately model the relationships. In the figure, we observe that the fitted model’s characteristics change in response to alterations in the degree of the polynomial, denoted by “d.” Notably, when d equals 1, a straight line is formed, whereas increasing values of d lead to polynomial fittings of varying complexities. Interestingly, as the degree of the polynomial increases, RMSE decreases, indicating a reduction in the discrepancy between the predicted and actual values in the training data. However, it is crucial to note that this decrease in RMSE may not always signify improved performance. In fact, it can be indicative of overfitting, where the model becomes excessively tailored to the training data, impairing its ability to generalize well to unseen test data. Essentially, overfitting denotes a decrease in the model’s generalization ability, which refers to its capability to accurately predict outcomes for unseen data points. It is imperative to strike a balance between model complexity and generalization performance, ensuring that the model neither underfits nor overfits the data. Thus, through the application of ridge regression and
198
8 Supervised Machine Learning
Fig. 8.5 Regulation in linear regression
careful consideration of the model’s complexity, one can mitigate the risk of overfitting, achieve satisfactory generalization, and enable accurate predictions on unseen test data. To exercise control over linear regression and address potential overfitting, regularization techniques come into play. In Fig. 8.5, we observe the process of regularization, which involves the formulation of a loss function. This loss function encompasses polynomial linear regression with penalty terms. The inclusion of penalty terms within the loss function allows for the balancing of model complexity and the minimization of overfitting. A crucial hyperparameter, denoted by .α, emerges as a pivotal control parameter governing the impact of the penalty terms. Now, let us delve into how the hyperparameter .α influences polynomial linear regression, as illustrated in Fig. 8.4. It becomes apparent that .α exerts a significant effect on the resulting polynomial regression. The choice of .α plays a critical role in determining the optimal trade-off between model complexity and generalization performance. Determining the appropriate value for .α can be a challenging task, as it requires meticulous tuning to ascertain the highest degree of generalization. It demands a careful exploration of various values for .α and an assessment of their impact on the model’s ability to generalize effectively to . It is worth emphasizing that the hyperparameter .α assumes a pivotal role in controlling the linear regression model. Through judicious selection and tuning of .α, one can strike the right balance between model complexity and generalization, enabling the creation of a model that captures the underlying patterns in the data while maintaining its ability to make accurate predictions on unseen test data.
8.4 Logistic Regression
8.4
199
Logistic Regression
Despite its name, logistic regression is actually a classification technique in machine learning. It leverages the logistic function to assign data points to either class 0 or class 1, making it a binary classification model. Figure 8.6 provides an example of logistic regression, demonstrating how changes in temperature can determine whether a system is active or inactive. In this case, the descriptor variable is set as petal width, while the objective variable represents the state of the system, encoded as either 0 or 1. The binary nature of logistic regression is closely tied to Boolean logic, where 0 and 1 correspond to False and True, respectively. Thus, logistic regression enables binary predictions, with the aim of assigning data points to one of the two classes. The logistic function used in logistic regression outputs probabilities ranging between 0 and 1. When the probability falls below 0.5, the model predicts class 0, while probabilities above 0.5 are assigned to class 1. Logistic regression, therefore, determines binary classifications based on probabilities, reflecting its probabilistic nature. It is important to note that there exists a decision boundary where the model makes the final determination between class 0 and class 1. This decision boundary is typically set at a probability threshold of 0.5. The significance of the decision boundary extends beyond logistic regression and has implications for other supervised machine learning techniques, such as support vector machines. The choice of the decision boundary can significantly impact the model’s performance and generalization
Fig. 8.6 Logistic regression
200
8 Supervised Machine Learning
capability. Implementing logistic regression is straightforward, and it can be accomplished by utilizing the logisticregression() command, as exemplified in Fig. 8.6. This allows for the construction of a logistic regression model, enabling binary classification based on the underlying probabilities obtained through the logistic function.
8.5
Support Vector Machine
Support Vector Machine (SVM) is a powerful supervised machine learning model that is widely used in various applications. It is capable of handling both regression and classification tasks, and it is versatile enough to handle linear as well as non-linear data instances. SVM achieves this by creating a decision boundary line or hyperplane that separates different classes in the data. Let us consider the example shown in Fig. 8.7, where we have two groups of data points represented by circles and triangles. The task is to classify these data points into their respective groups. SVM accomplishes this by creating a decision boundary line between the two groups, as depicted in Fig. 8.7. In this specific case, the data points are clearly separable, and a simple straight line can serve as an effective decision boundary. However, in real-world scenarios, the decision
Fig. 8.7 The concept of support vector machine
8.5 Support Vector Machine
201
Fig. 8.8 Margine in support vector machine
boundaries are often more complex, and finding an optimal decision boundary becomes more challenging. SVM employs a mathematical technique that maximizes the margin or the distance between the decision boundary and the closest data points of each class. This ensures that the decision boundary is as robust and generalizable as possible. In cases where the data are not linearly separable, SVM employs a kernel trick to transform the data into a higher-dimensional space, where a linear decision boundary can be established. The ability of SVM to handle non-linear decision boundaries makes it a popular choice in various machine learning applications. SVMs can effectively handle complex datasets where the classes are not easily separable by a simple linear boundary. By creating decision boundaries that can be non-linear or more complex, SVMs can accurately classify data points and make predictions in a wide range of scenarios. SVM utilizes the concept of margins to control the decision boundary. The margin refers to the distance between the decision boundary and the closest data point, as illustrated in Fig. 8.8. In SVM, two key concepts are considered: the soft margin and the large margin. The soft margin allows for some data points to be within the margin, while a hard margin does not tolerate any data points within the margin. The choice between a soft margin and a hard margin depends on the specific characteristics of the dataset. Opting for a hard margin can result in overfitting, where the model becomes too complex and adapts too closely to the training data. This can lead to poor generalization and reduced performance on unseen data. On the other hand, a soft margin allows for more flexibility in adopting complex decision boundaries, but it may not classify the data points accurately. Selecting the appropriate margin setting is crucial and should be based
202
8 Supervised Machine Learning
Fig. 8.9 Non-linear decision boundary in support vector machine
on the particular dataset at hand. It requires careful consideration and tuning to strike the right balance between model complexity and generalization ability. The scikit-learn library provides support for SVM through the LinearSVC class, which allows for support vector classification. In the Python code example depicted in Fig. 8.8, the LinearSVC is employed with the parameter C set to 0, which controls the margin. By adjusting the value of C, the trade-off between fitting the training data and achieving good generalization can be finetuned. Implementing SVM using scikit-learn offers a convenient way to apply support vector classification and explore different margin settings to find the optimal balance for a given dataset. Support Vector Machines (SVMs) are particularly powerful for solving non-linear problems. In cases where linear SVC fails to create appropriate decision boundaries for non-linear data, an alternative approach is to use the Radial Basis Function (RBF) kernel. Figure 8.9 demonstrates the limitation of linear SVC in handling non-linear data. As observed, the linear decision boundary is unable to accurately separate the blue and orange data points. However, by employing the RBF kernel method, it becomes possible to transform the data into a high-dimensional space where a suitable decision boundary can be established. The RBF kernel method maps the data to a higher-dimensional feature space, allowing for the creation of decision boundaries that are not possible with linear SVC alone. In Fig. 8.9, we can observe that the blue data points have been transformed into high dimensions, resulting in a clear separation between the blue and orange data points. This transformation helps unveil intricate decision boundaries that would have been otherwise unachievable with linear SVC. The RBF kernel is just one example of a kernel function that can be used in SVM. Other kernel functions, such as polynomial
8.5 Support Vector Machine
203
Fig. 8.10 Type of decision boundary in support vector machine
and sigmoid kernels, can also be employed based on the specific characteristics of the data and the problem at hand. The choice of kernel function is crucial in capturing the underlying patterns and relationships in non-linear data. By utilizing the RBF kernel and other appropriate kernel functions, SVM becomes a versatile and robust tool for solving non-linear classification problems. It allows for the discovery of complex decision boundaries and facilitates accurate classification even in scenarios where linear models fall short. Support Vector Machines (SVMs) offer several kernel functions that can be applied to handle different types of data and establish appropriate decision boundaries. Figure 8.10 provides an overview of four commonly used kernel functions in SVM: • Linear Kernel: The linear kernel applies a straight decision boundary to separate the data points. It is suitable for linearly separable data, where a straight line can effectively classify the classes. The linear kernel is the simplest and fastest option among the kernel functions. • Radial Basis Function (RBF) Kernel: The RBF kernel, as mentioned earlier, transforms the data into a high-dimensional space. It is effective in capturing complex non-linear relationships and can establish non-linear decision boundaries. The RBF kernel is versatile and widely used due to its ability to handle a wide range of data patterns. • Polynomial Kernel: The polynomial kernel applies a polynomial function to the data, allowing for the creation of non-linear decision boundaries. It is suitable for data that exhibit polynomial relationships. The degree of the polynomial can be adjusted to control the flexibility of the decision boundary. • Sigmoid Kernel: The sigmoid kernel applies a sigmoid function to the data, resulting in a decision boundary that resembles the shape of a sigmoid curve. It is primarily used for binary classification problems. The sigmoid kernel can handle data with non-linear relationships but it is less commonly used compared to the linear, RBF, and polynomial kernels. Choosing the appropriate kernel and determining the optimal value for the margin parameter, C, depend on the specific characteristics of the data. The selection process
204
8 Supervised Machine Learning
Fig. 8.11 Hyperparameters in support vector machine
involves understanding the underlying relationships, visualizing the data, and experimenting with different kernel functions and margin values. The goal is to find the combination that results in the most accurate and reliable classification performance for the given dataset. In SVM, the hyperparameter .γ plays a crucial role in determining the influence of each data point on the decision boundary. Figure 8.11 illustrates the impact of .γ on the decision boundary. .γ controls the width of the Gaussian radial basis function used by the RBF kernel. A small value of .γ means that the influence of each data point is more spread out, resulting in a smoother decision boundary. This can be seen in the left side of Fig. 8.11, where the decision boundary is relatively smooth and captures a larger area of data points. On the other hand, a large value of .γ leads to a more focused and localized influence of each data point. This results in a decision boundary that closely fits the individual data points. However, it also increases the risk of overfitting, as seen on the right side of Fig. 8.11, where the decision boundary appears to be excessively complex and overly tailored to the training data. Choosing the appropriate value of .γ is crucial to achieve a good balance between underfitting and overfitting. It requires careful consideration and experimentation to find the optimal value for .γ and other hyperparameters, such as the margin parameter C. The selection process often involves techniques like cross-validation and grid search to evaluate the performance of different combinations of hyperparameters and choose the one that yields the best generalization ability on unseen data. SVM with the RBF kernel is widely recognized as one of the most powerful machine learning models available. The RBF kernel enables SVM to effectively capture complex and non-linear relationships in high-dimensional feature spaces. However, while SVM
8.5 Support Vector Machine
205
with the RBF kernel offers remarkable predictive capabilities, it suffers from a notable drawback: the lack of interpretability. This limitation arises from the nature of SVM’s decision boundary creation in the context of multidimensional descriptor variables. As the number of dimensions increases, the resulting feature space becomes increasingly intricate and difficult to visualize. Consequently, understanding the specific mechanisms through which the decision boundary is determined becomes challenging, resembling the concept of a black box. Due to this inherent opacity, SVM’s ability to make accurate predictions often comes at the cost of interpretability. While SVM can provide highly accurate results, it falls short in providing meaningful insights into the underlying reasons or factors driving its decisions. This absence of explanatory power can be a significant consideration in certain applications and domains where interpretability and transparency are crucial. Therefore, when employing SVM with the RBF kernel, it is essential to bear in mind that its strengths lie primarily in its predictive capabilities rather than in its ability to offer a comprehensive understanding of the decision-making process. Consequently, careful consideration should be given to the specific requirements and constraints of the problem at hand when deciding to utilize SVM, particularly when interpretability and explainability are of paramount importance. SVMs are not only applicable to classification tasks but also find utility in regression modeling. Through the utilization of the Radial Basis Function (RBF) kernel, SVM becomes adept at capturing and representing intricate relationships in data. This capability is vividly demonstrated in Fig. 8.12. In regression scenarios, the objective is to predict continuous numerical values rather than discrete classes. The RBF kernel, with its ability to handle non-linear relationships, empowers SVM to effectively model complex data patterns and generate accurate regression predictions. By employing the RBF kernel,
Fig. 8.12 Regression in in support vector machine
206
8 Supervised Machine Learning
SVM transforms the original feature space into a higher-dimensional space, enabling the identification of intricate relationships between the descriptor variables and the target variable. In Fig. 8.12, we can observe the power of SVM with the RBF kernel in capturing the complexities of the data. The scattered data points exhibit a non-linear pattern, and the SVM regression model, aided by the RBF kernel, successfully captures this intricate relationship. The resulting decision boundary delineates the predictions generated by the SVM model, providing a reliable estimation of the target variable across the input feature space. By leveraging the RBF kernel, SVM excels at handling regression tasks, offering accurate predictions even in the presence of intricate and non-linear relationships. This capability makes SVM a valuable tool in various domains where capturing complex patterns and making precise numerical predictions are of utmost importance.
8.6
Decision Tree
The decision tree is a versatile machine learning model capable of handling both classification and regression tasks, as well as accommodating linear and non-linear data relationships. It operates based on the principle of constructing a tree-like model that learns to make decisions by analyzing the available data. The decision tree classifier algorithm is commonly used to implement decision trees in practice. Figure 8.13 provides a visual representation of the structure of a decision tree. To illustrate the functioning of a decision tree, let us consider a scenario involving two descriptor variables: temperature and pressure. The decision tree model starts with the first node, which represents a decision based on whether the temperature is smaller than 100. If the temperature is
Fig. 8.13 The concept of decision tree
8.6 Decision Tree
207
indeed smaller, the decision tree outputs the result as “circle.” On the other hand, if the temperature is not smaller than 100, the model proceeds to the next question or decision point. In this case, it evaluates whether the pressure is smaller than 1. Depending on the outcome, the decision tree returns either “circle” or “triangle” as the final result. The structure of the decision tree is determined through a process called recursive partitioning, where the algorithm recursively splits the data based on the available features to create decision nodes. Each decision node represents a specific condition or question that helps navigate the tree and determine the final prediction or classification. By learning the optimal decisions from the training data, the decision tree model can accurately predict the target variable for new instances or make classifications based on the input features. The advantage of decision trees lies in their interpretability, as the resulting tree structure provides a transparent representation of the decision-making process. The decisions made at each node can be easily understood and explained, allowing for valuable insights into the relationship between the input variables and the target variable. This interpretability makes decision trees particularly useful in domains where model transparency and human comprehensibility are essential. In practice, decision trees can handle both categorical and numerical variables and are capable of accommodating non-linear relationships in the data. However, it is important to note that decision trees are prone to overfitting, especially when the tree becomes too complex or when the data contain noise or outliers. Techniques such as pruning, ensemble methods (e.g., random forests), and parameter tuning can be employed to mitigate overfitting and enhance the predictive performance of decision tree models. The growth of a decision tree is determined by various factors, including the complexity of the data and the desired level of predictive accuracy. To measure the impurity or statistical dispersion of a decision tree, the Gini coefficient is commonly used. The Gini coefficient quantifies the degree of impurity in a set of samples by calculating the probability of misclassifying a randomly chosen sample. In Fig. 8.13, we observe a dataset consisting of three circles and three crosses. The initial Gini coefficient for this dataset is calculated to be 0.5. The first decision is made based on a particular feature, and the resulting Gini coefficient is computed to be 0.44. This reduction of 0.06 indicates that the decision tree is attempting to minimize impurities within the subsets created by the decision. However, it is important to note that the initial high Gini coefficient suggests that there is still considerable overlap between the two classes (circles and crosses), indicating impurities within the subsets. In the case of the right plot in Fig. 8.13, a lower Gini coefficient of 0.25 is obtained. This significant reduction in impurity indicates a more accurate decision boundary, where the right side of the boundary predominantly contains circles, while the left side mainly consists of crosses. The decision tree algorithm iteratively calculates the Gini coefficient at each decision point and selects the decision boundary that minimizes the Gini coefficient. This iterative process continues until the Gini coefficient is minimized or falls below a predetermined threshold. The goal of the decision tree algorithm is to create decision boundaries that effectively separate the different classes or categories present in the data. By minimizing the Gini coefficient,
208
8 Supervised Machine Learning
Fig. 8.14 Nodes in decision tree
the decision tree aims to achieve the purest subsets possible at each decision node. However, it is important to strike a balance between reducing impurities and avoiding excessive complexity or overfitting. Techniques such as pruning, which involves removing unnecessary decision nodes or limiting the depth of the tree, can help control the growth of the decision tree and prevent overfitting. Overall, the Gini coefficient is a valuable measure used in decision tree algorithms to assess the quality of decision boundaries and guide the construction of the tree. By iteratively reducing the Gini coefficient, decision trees aim to create accurate and interpretable models that capture the underlying patterns in the data. Figure 8.14 presents the structure of a decision tree based on real data. At the beginning of the decision tree, we encounter the root node, which initiates the decision-making process. The root node evaluates a specific condition, and if the condition is met (i.e., “yes”), the classification “circle” is assigned. In this case, the Gini coefficient is 0, indicating that there are no impurities present, and the node becomes a leaf node. A leaf node represents the final classification decision for a particular subset of data. On the other hand, if the condition evaluated at the root node is not met (i.e., “no”), we move to the next decision node, which is known as a child node. From this child node, further decisions are made, leading to the creation of two additional leaf nodes. Each leaf node represents a specific classification outcome based on the given conditions and the observed features of the data. The decision tree continues to grow as more decision nodes are encountered, branching out into various paths based on the conditions evaluated at each node. This branching process allows the decision tree to capture the underlying patterns and relationships within the data, leading to a hierarchical structure that guides the
8.6 Decision Tree
209
Fig. 8.15 Regression in decision tree
classification or regression process. It is important to note that the structure of a decision tree is highly dependent on the specific dataset and the features being considered. As the decision tree grows, it seeks to create decision boundaries that best separate the different classes or categories present in the data, while minimizing impurities or misclassifications. The overall goal is to create an interpretable and accurate model that can make predictions or classifications based on the observed features. The Gini coefficient, as mentioned earlier, is a measure of impurity used to assess the quality of decision boundaries at each node. It quantifies the probability of misclassifying a randomly chosen sample within a particular node. In the context of a decision tree, a Gini coefficient of 0 indicates a pure node with no impurities, while a higher Gini coefficient suggests a greater degree of impurity within the node. The decision tree algorithm recursively evaluates conditions and creates decision nodes until a stopping criterion is met, such as reaching a pure node (Gini coefficient of 0) or reaching a predetermined maximum depth. This process allows the decision tree to adapt to the complexity of the data, capturing both linear and non-linear relationships, and providing an interpretable model for classification or regression tasks. Figure 8.15 illustrates the use of decision trees in regression models. While decision trees are commonly associated with classification tasks, they can also be utilized for regression analysis. In this context, the decision tree is constructed to predict continuous target variables based on the observed features. However, it is important to exercise caution when using decision trees for regression. One limitation of decision trees is their inability to predict objective variables that lie outside the range of values observed in
210
8 Supervised Machine Learning
the training data. This can lead to overfitting, where the model becomes overly sensitive to the training data and fails to generalize well to unseen data. To mitigate overfitting and control the complexity of the decision tree, hyperparameters such as the number of trees and leaf nodes can be adjusted. These hyperparameters influence the depth and breadth of the decision tree, allowing for a balance between model complexity and generalization ability. Like any machine learning approach, decision trees have their own strengths and weaknesses. One advantage is their simplicity and ease of implementation, making them accessible to users with varying levels of expertise. Decision trees can be viewed as “whitebox” models, meaning that their internal workings are transparent, and it is easy to understand how the model makes predictions based on the decision rules at each node. In contrast, more complex models such as Support Vector Machines (SVM) can be considered “blackbox” models, as they often involve intricate mathematical transformations and optimization procedures that make it challenging to interpret how the model arrives at its predictions. Decision trees provide a more interpretable alternative to SVM in this regard. However, a drawback of decision trees is their sensitivity to small changes in the training data. A slight modification or addition of data points can potentially lead to the construction of a different decision tree, which may not accurately capture the underlying patterns in the data. This sensitivity to data perturbations highlights the importance of evaluating the decision tree model carefully and considering its robustness to variations in the dataset. Decision trees offer a simple yet powerful approach for both classification and regression tasks. They provide interpretability and transparency, making them useful for gaining insights into the decision-making process. However, it is crucial to be mindful of their limitations, such as the potential for overfitting and sensitivity to data variations, and to employ appropriate techniques to control these issues when using decision trees in practice.
8.7
Random Forest
Random Forest (RF) is indeed a widely used and powerful machine learning method that complements the capabilities of Support Vector Machines (SVMs). Similar to decision trees, RF is versatile and can handle both linear and non-linear models, making it suitable for a range of regression and classification tasks. RF is built upon the foundation of decision trees, addressing one of the key limitations of individual decision trees, which is their tendency to overfit the training data. Overfitting occurs when a model learns the training data too well and fails to generalize to unseen data, resulting in reduced accuracy on new observations. To mitigate overfitting, RF employs an ensemble approach by creating multiple decision trees and aggregating their predictions. Each decision tree in the forest is trained on a random subset of the training data, and at each node, a random subset of features is considered for splitting. By introducing randomness in both the data and feature selection, RF aims to reduce the overfitting potential of individual decision
8.7 Random Forest
211
Fig. 8.16 The concept of random forest
trees. Figure 8.16 provides a visual representation of how RF is constructed. Multiple decision trees are grown, each using different subsets of the training data and considering different subsets of features. Each tree independently makes predictions, and the final prediction of the random forest is obtained by combining the predictions of all the trees, such as through majority voting (for classification) or averaging (for regression). This ensemble approach helps to improve the overall accuracy and robustness of the model. By leveraging the collective knowledge of multiple decision trees, RF can capture a wider range of patterns and relationships in the data, making it more capable of handling complex and non-linear problems. Furthermore, RF provides additional benefits such as feature importance estimation, allowing for the identification of influential features in the prediction process. Overall, RF is a powerful machine learning method that overcomes the limitations of individual decision trees by aggregating multiple trees to improve prediction accuracy and reduce overfitting. Its ability to handle both linear and non-linear models, along with its versatility in regression and classification tasks, contributes to its popularity and widespread use in various domains. RF consists of several key steps that contribute to its predictive power and accuracy. The first step in building a random forest is creating multiple decision trees. Each decision tree is constructed using a random subset of the training data, a technique known as bootstrapping. Bootstrapping involves sampling the training data with replacement, which means that some instances may appear multiple times in a given subset, while others may be omitted. By creating different subsets of the training data, each decision tree in the random forest is exposed to a diverse set of observations, which helps to introduce variability and reduce the potential for overfitting. The second step involves making predictions for each decision tree in the random forest. Each tree independently processes
212
8 Supervised Machine Learning
the input data and generates its own predictions based on the learned patterns and splits. In your example, if there are three decision trees, and two of them predict a value of 1, while one predicts a value of 0, the predictions of the individual trees are taken into account in the subsequent steps. The third step is aggregation, where the predictions of the decision trees are combined to obtain a final prediction. In classification tasks, this is typically done through majority voting, where the class that receives the most votes among the decision trees is selected as the final prediction. In your example, since the majority of the trees predict a value of 1, the final prediction is also 1. For regression tasks, the predictions of the decision trees are often averaged to obtain the final prediction. Finally, in the fourth step, the final prediction is made based on the aggregated predictions from the decision trees. By leveraging the collective knowledge of multiple trees and considering the majority vote or averaging, the random forest model can provide a more robust and accurate prediction compared to individual decision trees. This approach helps to mitigate the impact of any individual tree that may make incorrect predictions, as the overall prediction is based on the majority consensus. In the case of the example in Fig. 8.16, the final prediction is 1. By combining the randomized construction of decision trees, their independent predictions, and the aggregation of their results, random forests are able to improve the accuracy and generalization capabilities compared to using a single decision tree. This ensemble approach, with its majority voting or averaging mechanism, contributes to the robustness and predictive power of random forests in handling a wide range of regression and classification tasks. The bootstrapping method plays a crucial role in making random forests powerful through the creation of randomized decision trees. In RF, it is essential to generate diverse and distinct decision trees to capture different aspects of the data. The bootstrapping method enables this by creating multiple subsets of the original data, each containing a random selection of samples. This process involves randomly selecting data points from the original dataset, allowing for the possibility of selecting the same data point more than once. The resulting subsets, known as bootstrap samples, have the same size as the original dataset but exhibit variation due to the random sampling. In Fig. 8.17, an illustration of the bootstrapping method is presented. Suppose there are three sets of data, each consisting of three descriptor variables. The bootstrapping method is applied to generate new datasets by randomly selecting data points from the original dataset, with the possibility of duplicating certain points. For example, in the first bootstrap sample, data point 2 is chosen twice. Once the bootstrap samples are created, individual decision trees are constructed using each sample. The construction of each tree involves randomly selecting descriptor variables at each node based on the available features in the given bootstrap sample. This random selection of variables introduces further variability and ensures that each decision tree focuses on different aspects of the data. By combining the bootstrapping method with the random selection of descriptor variables, RF creates a diverse ensemble of decision trees. Each tree contributes its unique perspective and predictions based on the random subset of data it was trained on. The aggregation of these predictions, whether through majority voting for classification tasks or averaging for regression tasks, results in a
8.7 Random Forest
213
Fig. 8.17 Bootstrapping in random forest
robust and accurate final prediction. The bootstrapping method plays a crucial role in the randomization of decision trees, ensuring that each tree in the random forest is exposed to different subsets of the data. This randomness and diversity among the decision trees contribute to the strength and power of random forests in handling complex regression and classification problems. RF is simplified and implemented by calling RandomForestClassifier (RandomForestRegressor) as seen in the following code: from sklearn.ensemble import RandomForestClassifier,RandomForest Regressormodel=RandomForestClassifier()
RF offers a powerful feature to evaluate the importance of descriptor variables, allowing us to identify the variables that have a significant impact on the predictions made by the model. The importance analysis, illustrated in Fig. 8.18, provides valuable insights into the relevance and contribution of each descriptor variable. In RF, the impurity of each node in the decision tree is measured using metrics such as the Gini coefficient. By evaluating the impurity reduction achieved by splitting on a particular descriptor variable, we can assess its importance in the overall model. Descriptor variables that contribute to a substantial reduction in impurity are considered important, as they play a crucial role in achieving accurate predictions. The importance analysis assigns a score to each descriptor variable, typically ranging from 0 to 1, reflecting their relative importance in the random forest model. A higher score indicates a greater impact on the predictions.
214
8 Supervised Machine Learning
Fig. 8.18 Feature importance in random forest
This analysis aids in identifying the most influential descriptor variables within the given dataset. The importance analysis in random forests has practical applications. It can be employed to search for significant descriptor variables, helping researchers focus their attention on the most relevant features when building subsequent models, such as SVM or other machine learning algorithms. By incorporating the important descriptor variables identified through the analysis, the subsequent models can benefit from the enhanced predictive power. Moreover, importance analysis also offers valuable insights into the scientific mechanisms underlying the prediction process. By identifying the descriptor variables that have a major impact on predicting the objective variables, researchers can gain a deeper understanding of the relationships and dependencies within the dataset. This understanding can provide valuable scientific insights and guide further investigations into the underlying mechanisms driving the predictions. In summary, the importance analysis in random forests provides a quantitative measure of the relevance and impact of descriptor variables. It enables the identification of important variables for subsequent modeling and facilitates the interpretation of the scientific mechanisms behind the predictions made by the model.
8.8
Voting
Voting machine learning, also known as ensemble learning, offers a unique and powerful approach by harnessing the collective knowledge of multiple machine learning models.
8.9 Neural Network
215
Fig. 8.19 Ensemble voting classification
In this methodology, as illustrated in Fig. 8.19, classification predictions are obtained from diverse models such as logistic regression, support vector machines (SVM), random forests (RF), and other sophisticated algorithms. The essence of voting machine learning lies in aggregating these individual predictions and leveraging the power of the majority to make the final prediction. By considering the outputs generated by each model, voting machine learning maximizes the overall accuracy and robustness of the prediction process. This approach proves particularly valuable in scenarios where individual models may have varying strengths and weaknesses, allowing for a more comprehensive and reliable prediction outcome.
8.9
Neural Network
The neural network is a highly influential and versatile machine learning model that aims to emulate the intricate functioning of the human brain. At the core of the neural network lies the perceptron, an essential classification algorithm serving as an artificial neuron. As depicted in Fig. 8.20, the perceptron receives two input variables, x1 and x2, and applies an activation function to produce an output. However, when confronted with a complex dataset like the one illustrated, a simple perceptron is insufficient to establish an effective decision boundary using a single line. To tackle such intricate classification tasks, a hidden layer is introduced, acting as an intermediary between the input and output layers, as demonstrated in Fig. 8.20. This hidden layer enables the neural network to capture the inherent complexity of the data by learning intricate patterns and relationships.
216
8 Supervised Machine Learning
Fig. 8.20 The concept of neural network
By incorporating multiple hidden layers, the neural network can effectively process and classify complex data that would otherwise be unattainable using a single line or surface. The information extracted from these hidden layers is then amalgamated to make a final decision, enabling the neural network to classify data points that are not amenable to linear separation. These hidden layers serve as the essence of the neural network, harnessing their collective computational power to unveil intricate structures within the data, leading to enhanced classification capabilities and improved accuracy in complex problem domains. In using the Python scikit-learn library, neural networks can be used by calling the MLPClassifer where the number of layers can be defined. An example of how to do this is shown in the code below. from sklearn.neural_network import MLPClassifier model=MLPClassifier() Deep learning, a branch of machine learning, is built upon the foundation of neural networks. While neural networks typically consist of a single hidden layer, deep learning distinguishes itself by incorporating multiple hidden layers. This increased depth allows deep learning models to effectively handle complex and intricate datasets. By leveraging the hierarchical representations learned in these deep architectures, deep learning models can capture and exploit intricate patterns and relationships that may exist within the data. It is worth noting, however, that the increased complexity and depth of deep learning models pose challenges in terms of interpretability. As the number of layers and parameters grows, it becomes increasingly difficult to comprehend how the model learns and makes predictions. Deep learning models are often referred to as black boxes, as the underlying mechanisms are not readily interpretable by humans. This lack of interpretability can be
8.10 Gaussian Process Regression
217
seen as a trade-off for the enhanced performance and capability of deep learning models. Another important consideration when working with deep learning is the requirement for large amounts of training data. Deep learning models thrive when provided with abundant data, as they have the capacity to learn intricate representations from a wealth of examples. This data-hungry nature of deep learning highlights the significance of data availability and quality in achieving optimal performance. In Python, the scikitlearn library provides the MLPClassifier, which allows for the implementation of deep learning models. Additionally, frameworks such as TensorFlow and Pytorch are popular choices for deep learning, offering extensive functionalities and flexibility. However, these specific packages are beyond the scope of this book and are not covered in detail. Overall, deep learning holds great potential in tackling complex tasks and extracting meaningful representations from vast amounts of data. It is a rapidly evolving field with a wide range of applications, revolutionizing various domains such as computer vision, natural language processing, and speech recognition.
8.10
Gaussian Process Regression
In the context of materials and catalysts data, it is not uncommon to encounter situations where gathering a large amount of labeled data for supervised machine learning is challenging. In such cases, where the dataset is relatively small, the term “small data” is often used to describe the limited availability of training examples. When dealing with small data, predicting the next data point to optimize or minimize target variables becomes a challenging task. In these scenarios, traditional experimental design approaches are often employed. Experimental design involves systematically planning and executing experiments to obtain informative data points. There are various strategies for experimental design. One approach is to randomly select the next data point, which provides a simple and unbiased way to explore the parameter space. However, this approach may not efficiently utilize the limited data available. Another method, as depicted in Fig. 8.21, is the use of an orthogonal array. An orthogonal array enables the systematic exploration of all possible combinations of descriptor variables while minimizing the number of experimental trials required. By carefully selecting the combinations, an orthogonal array allows for efficient coverage of the parameter space, even with limited data points. The use of orthogonal arrays in experimental design helps ensure that the available data points capture the variability and interactions among the descriptor variables in an efficient and informative manner. This approach aids in maximizing the insights gained from the limited data, allowing for more effective decision-making and understanding of the underlying system. It is important to note that experimental design methods, including the use of orthogonal arrays, can provide valuable insights and guidance when working with small data. These techniques help researchers optimize their experimental efforts, minimize resource utilization, and obtain meaningful results even in scenarios with limited data availability.
218
8 Supervised Machine Learning
Fig. 8.21 The design of experiment
Fig. 8.22 Bayesian optimization with machine learning
Gaussian Process Regressor (GPR) is introduced as an alternative approach for selecting the next data point in scenarios where the goal is to maximize or minimize the objective variable. Figure 8.22 provides a visualization of this concept. In the figure, the blue data points represent the training data, where the y-axis represents the objective variable and the x-axis represents the descriptor variables. The objective is to maximize the value of the objective variable, y, using GPR with the RBF (Radial Basis Function)
8.10 Gaussian Process Regression
219
kernel. The blue line in the plot represents the predicted result obtained from GPR. This resembles the prediction outcome observed with Support Vector Regression (SVR). However, there is a significant difference: GPR provides an additional measure called standard deviation, which indicates the level of uncertainty associated with the predicted variables. The standard deviation is depicted as the orange shading around the blue line. A higher standard deviation indicates that the predicted point is further away from the training data points, implying greater uncertainty in its prediction. Conversely, a lower standard deviation suggests that the predicted point is closer to the training data points, indicating higher confidence in its prediction. By considering both the predicted values and their associated standard deviations, GPR provides a means to assess the uncertainty in predictions. This information can be valuable in decision-making processes, as it allows researchers to identify regions in the parameter space where the model has a higher degree of uncertainty. This insight can guide the selection of the next data point, enabling more informed and efficient experimental design. Gaussian Process Regressor, with its capability to provide predictions and associated uncertainties, offers a powerful tool for optimizing experimental efforts and exploring the parameter space in situations where limited data are available. In the context of seeking the minimum objective variable predicted by GPR while also targeting points with high standard deviation, the goal is to identify regions of high uncertainty where the objective variable is expected to be low. Figure 8.22 provides insights into this process. In the figure, the largest orange area represents regions with a relatively high standard deviation, indicating higher uncertainty in the predictions made by the GPR model. Simultaneously, the machine predicts relatively low objective variable values in this area. Therefore, based on this analysis, it is suggested that this region should be explored as the next data point to be collected. It is important to note that although selecting the suggested point does not guarantee the discovery of the largest objective variable, it increases the likelihood of obtaining low objective variable values, which is the focus in this scenario. If the suggested point does not yield the largest objective variable, it can be added to the training data, and another round of GPR can be performed. Upon incorporating the new data point into the training set, the standard deviation and predictions made by GPR will be altered, reflecting the updated information. This process can be repeated iteratively, with each round suggesting the next data point for measurement. By accumulating additional data and continuously refining the GPR model, the optimization process aims to guide the exploration of the parameter space and improve the accuracy of predictions. By iteratively selecting data points based on uncertainty and objective variable considerations, the approach combines the insights from GPR’s uncertainty estimation and the objective variable optimization, enabling an effective exploration of the parameter space to identify regions with desired characteristics. Indeed, there is a wide range of supervised machine learning models available, each with its own strengths and limitations. This book introduces commonly used models that have proven to be effective in various applications. Selecting the most appropriate machine learning model for a given task is a crucial step in building accurate and reliable models.
220
8 Supervised Machine Learning
Equally important is the selection of appropriate descriptor variables, as these variables play a fundamental role in capturing the relevant information from the data. The choice of descriptors should be guided by domain knowledge and a deep understanding of the problem at hand. It is essential to consider the specific characteristics of the data, the relationships between variables, and the underlying scientific or engineering principles. Before applying machine learning algorithms, proper data preprocessing is necessary to ensure the data are in a suitable format and to address any issues such as missing values, outliers, or inconsistencies. Data visualization techniques can provide valuable insights into the data, helping to identify patterns, trends, and relationships between variables. Exploratory data analysis through visualization can also provide hints and guidance for selecting the appropriate machine learning model and descriptor variables. By performing data preprocessing and visualization, researchers and practitioners gain a better understanding of the data and can make informed decisions when choosing the most suitable machine learning model and descriptors. These initial steps contribute to the overall success and accuracy of the resulting models, enabling meaningful analysis and predictions based on the available data.
8.11
Cross-Validation
Supervised machine learning has shown powerful advantages in informatics. Here, evaluation method for machine learning is introduced. Cross-validation is a standard method used to evaluate the accuracy and generalization performance of machine learning models. The goal of cross-validation is to assess how well a trained model can predict the objective variable on unseen data, which is represented by the test dataset. In cross-validation, the available dataset is split into two subsets: the training dataset and the test dataset. The training dataset is used to train the machine learning model, while the test dataset is held back and not used for training. The trained model is then evaluated by making predictions on the test dataset and comparing them with the actual values of the objective variable. The accuracy of the model is typically measured by various metrics, such as mean squared error, accuracy score, or precision and recall, depending on the specific problem. One common approach in cross-validation is to randomly divide the dataset into training and test sets, with a typical split of 70% for training and 30% for testing. This random splitting is repeated multiple times, and the average score across these iterations is calculated to assess the overall performance of the model. Another popular method is k-fold cross-validation, as depicted in Fig. 8.23. In kfold cross-validation, the dataset is divided into k subsets or folds. Each fold is used as the test dataset once, while the remaining k-1 folds are used as the training dataset. This process is repeated k times, with each fold serving as the test set exactly once. The results from each fold are then averaged to obtain the overall evaluation score of the model. Crossvalidation helps to provide a more robust and reliable estimate of the model’s performance by using multiple test datasets and considering different variations in the training and
8.11 Cross-Validation
221
Fig. 8.23 Cross-validation
test splits. It helps to mitigate the effects of data variability and ensures that the model’s performance is not overly influenced by a particular random split of the data. Regression and classification have different evaluation approaches. In the realm of regression analysis, mean squared error (MSE) and coefficient of determination, commonly referred to as R-squared, emerge as fundamental evaluation measures. The mean squared error quantifies the average squared difference between the predicted and actual values of the target variable, offering valuable insights into the overall predictive accuracy of the model. A lower MSE signifies a higher level of precision and closer alignment between the model’s predictions and the ground truth. Conversely, the coefficient of determination, denoted as R-squared, provides a measure of the proportion of the total variance in the target variable that can be explained by the independent variables employed in the regression model. An R-squared value closer to 1 indicates a stronger relationship between the predictors and the target, implying a better fit of the model to the data. In the context of classification machine learning, a diverse repertoire of evaluation methods comes into play. Among these methods, six prominent metrics are commonly employed: the mixing matrix, accuracy, precision, recall, F1 score, and area under the curve (AUC). The mixing matrix, also known as the confusion matrix, assumes a pivotal role by encapsulating four key variables: true negatives (TN), true positives (TP), false negatives (FN), and false positives (FP). These matrix components enable a comprehensive understanding of the model’s performance in predicting positive and negative instances. True negatives represent cases where the model correctly predicts negative outcomes, while true positives indicate accurate predictions of positive outcomes. Conversely, false negatives occur when the model wrongly predicts negative outcomes for positive instances, and false positives arise when the model incorrectly predicts positive outcomes for negative instances. By examining the mixing matrix, one can obtain a rough estimation of the model’s accuracy and discern the various types of prediction errors made. To further assess the model’s performance, the mixing matrix can be leveraged to compute additional evaluation metrics such as accuracy, precision, recall, and the F1
222
8 Supervised Machine Learning
score. Accuracy quantifies the percentage of correctly predicted results in relation to the total number of predictions, providing a comprehensive measure of overall predictive performance. Precision, on the other hand, gauges the proportion of accurately predicted positive instances out of all the predicted positives, shedding light on the model’s ability to precisely identify positive cases. Recall, also known as sensitivity or true positive rate, calculates the percentage of true positive instances correctly identified by the model out of all the actual positive instances, thereby highlighting the model’s capability to capture all relevant positive cases. Furthermore, the F1 score, which strikes a balance between precision and recall, serves as an index that encompasses both measures, providing a holistic evaluation of the model’s performance in classification tasks. By combining precision and recall, the F1 score offers a comprehensive assessment of the model’s ability to achieve both high precision and recall simultaneously, thereby capturing the trade-off between the two metrics. In addition to the aforementioned metrics, the area under the curve (AUC) is a valuable evaluation measure frequently employed in classification tasks, particularly in the context of binary classifiers. The AUC represents the area under the receiver operating characteristic (ROC) curve, which plots the true positive rate against the false positive rate at various classification thresholds. A higher AUC value indicates a superior discriminatory power of the model, implying a stronger ability to differentiate between positive and negative instances. Overall, a plethora of evaluation methods exist to assess the accuracy and performance of machine learning models. The selection of the appropriate evaluation metrics depends on the specific objectives and requirements of the project at hand, ensuring that the chosen measures align with the desired goals and deliver meaningful insights into the model’s efficacy. Mitigating overfitting is a crucial aspect of building robust machine learning models. Several effective strategies can be employed to combat overfitting and enhance the generalization capabilities of the model. One fundamental approach is to augment the training data by adding more diverse samples. Increasing the size of the training set provides the model with a broader range of examples to learn from, enabling it to capture the underlying patterns and relationships in a more comprehensive manner. Incorporating diverse data instances helps the model avoid becoming overly specialized to the idiosyncrasies of a limited dataset, thus reducing the likelihood of overfitting. Another strategy involves reducing the complexity of the model by controlling the number of descriptor variables used. This can be achieved through feature selection or dimensionality reduction techniques. By identifying and retaining only the most relevant and informative features, the model focuses on the essential aspects of the data, mitigating the risk of overfitting due to excessive model complexity. Removing redundant or irrelevant variables can enhance the model’s ability to generalize to unseen data. Hyperparameter tuning plays a significant role in preventing overfitting. Hyperparameters are parameters that are not learned from the data but set by the model builder, such as learning rate, regularization strength, or tree depth. By fine-tuning these hyperparameters, one can strike an optimal balance between model complexity and generalization. Techniques such as grid search or random search can be employed to systematically explore different combinations of hyperparameter
8.12 Conclusion
223
values, identifying the configuration that yields the best performance while avoiding overfitting. Ensemble learning methods, such as voting or bagging, can also serve as effective mechanisms to mitigate overfitting. By combining multiple models, each trained on a different subset of the data or employing different algorithms, ensemble methods harness the wisdom of the crowd, reducing the risk of individual models overfitting to specific patterns in the data. The collective decision-making process of ensemble models helps enhance the overall robustness and generalization capabilities, resulting in more reliable predictions. Thus, guarding against overfitting requires a multifaceted approach. Expanding the training data, reducing the complexity of the model, finetuning hyperparameters, and leveraging ensemble learning techniques all contribute to the overarching goal of preventing overfitting. Employing a combination of these strategies, tailored to the specific characteristics of the dataset and the machine learning algorithm being utilized, helps ensure the development of models that generalize well to unseen data and deliver reliable predictions.
8.12
Conclusion
In this chapter, we have explored various types of supervised machine learning models. Understanding the functioning of each model is indeed crucial in order to make informed decisions when choosing the most suitable approach for a given task. It is important to note that while there is a wide range of machine learning models available, there is no universally perfect model that can address all scenarios and datasets. The effectiveness of a machine learning model depends on multiple factors, including the nature of the dataset, the complexity of the problem, and the specific objectives. Different models have their strengths and limitations, and the selection of the appropriate model is contingent upon understanding these characteristics. For instance, decision trees are intuitive and easy to interpret, making them well-suited for problems where interpretability and explainability are crucial. On the other hand, support vector machines excel in handling high-dimensional data and are effective in scenarios where a clear decision boundary is required. Random forests combine the strengths of decision trees and ensemble learning, offering robust performance and the ability to capture complex interactions. Neural networks and deep learning models, with their ability to learn intricate patterns and handle large-scale datasets, are particularly useful in tasks such as image recognition and natural language processing. However, they often require substantial amounts of training data and computational resources. When selecting a machine learning model, it is essential to consider the characteristics of the dataset, including the size, dimensionality, and distribution of the data. Additionally, the specific objectives of the task, such as prediction accuracy, interpretability, or speed, should also be taken into account. Ultimately, the choice of the machine learning model should be guided by a careful evaluation of these factors, coupled with experimentation and iterative refinement. It is important to strike a balance between the complexity of the model and the available data, ensuring that
224
8 Supervised Machine Learning
the selected model is capable of capturing the underlying patterns without overfitting or underfitting the data. In summary, the selection of a machine learning model is a critical step in building effective predictive models. By understanding the functioning, strengths, and limitations of various models, and aligning them with the characteristics of the dataset and the objectives of the task, one can make informed decisions and develop accurate and robust machine learning models.
Questions 8.1 Why is it important to understand the functioning of different supervised machine learning models? 8.2 What are some factors that influence the effectiveness of a machine learning model? 8.3 What are the strengths of decision trees as a machine learning model? 8.4 In what scenarios are support vector machines (SVMs) effective? 8.5 What are the advantages of neural networks and deep learning models? 8.6 What is linear regression? 8.7 How can linear regression be perceived? 8.8 How can linear regression help in understanding data? 8.9 How can linear regression be implemented using scikit-learn? 8.10 What is RMSE and how is it used in linear regression? 8.11 What is multiple linear regression? 8.12 How does polynomial regression enhance linear regression? 8.13 What is ridge regression and how does it address complex data scenarios? 8.14 How does the degree of the polynomial affect the performance of polynomial linear regression? 8.15 What is the role of the hyperparameter α in polynomial linear regression?
Questions
225
8.16 What is a Support Vector Machine (SVM)? 8.17 How does SVM create decision boundaries? 8.18 How does SVM handle non-linear data? 8.19 What are the advantages of using SVM? 8.20 What is the significance of the margin in SVM? 8.21 How does the choice of the RBF kernel impact SVM’s performance? 8.22 What is the trade-off when adjusting the gamma hyperparameter in SVM? 8.23 What is the purpose of the decision tree model? 8.24 How is the structure of a decision tree determined? 8.25 What advantage do decision trees have in terms of interpretability? 8.26 What are some techniques to mitigate overfitting in decision trees? 8.27 How is the impurity of a decision tree measured? 8.28 How do decision trees handle regression tasks? 8.29 What is Random Forest (RF)? 8.30 How does RF address the limitation of overfitting? 8.31 How are the predictions of individual decision trees combined in RF? 8.32 How does RF capture a wider range of patterns and relationships in data? 8.33 What is the role of the bootstrapping method in RF? 8.34 How does the importance analysis in RF help in modeling and understanding the data? 8.35 What is voting machine learning? 8.36 What is the role of hidden layers in neural networks?
226
8 Supervised Machine Learning
8.37 What is the difference between neural networks and deep learning? 8.38 What is the trade-off for the enhanced performance of deep learning models? 8.39 What is the trade-off for the enhanced performance of deep learning models? 8.40 How does experimental design help when dealing with small data? 8.41 What is the advantage of using Gaussian Process Regressor (GPR) in experimental design? 8.42 What is the purpose of cross-validation in supervised machine learning? 8.43 What are the two main subsets into which the available dataset is split in crossvalidation? 8.44 What are two popular methods for cross-validation? 8.45 What are the evaluation metrics commonly used in regression analysis? 8.46 What are the evaluation metrics commonly used in classification machine learning?
9
Unsupervised Machine Learning and Beyond Machine Learning
Abstract
Unsupervised machine learning endeavors to unveil latent patterns and groupings within datasets. It remains an actively evolving field, teeming with numerous experimental methods that continue to emerge. This chapter serves to introduce the widely employed techniques of unsupervised machine learning. Moreover, the significance of graph data is elucidated, as it holds the potential to serve as a fundamental element in the realms of materials and catalysts informatics. By harnessing the power of graph-based representations, researchers can gain deeper insights into the underlying structures and relationships within these domains. Consequently, the exploration of graph data assumes a pivotal role in driving advancements and facilitating breakthroughs in materials and catalysts informatics. Keywords
Unsupervised machine learning · Dimensional reduction · Principle component analysis · k-means clustering · Hierarchical clustering · Network analysis · Ontology
• Explore the uses of unsupervised machine learning. • Learn about the uses of commonly adopted techniques such as principle component analysis, k-means clustering, and hierarchical clustering. • Explore alternative approaches to machine learning such as network analysis and ontology.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6_9
227
228
9.1
9 Unsupervised Machine Learning and Beyond Machine Learning
Introduction
Unsupervised machine learning is a branch of artificial intelligence that deals with unlabeled data and the discovery of hidden patterns, structures, or relationships within that data without the guidance of predefined output labels. In unsupervised learning, algorithms are tasked with exploring and making sense of the inherent structure of the data on their own. This approach is particularly valuable for exploratory data analysis, dimensionality reduction, clustering, and anomaly detection. Techniques such as Principal Component Analysis (PCA) and t-SNE help reduce the dimensionality of complex datasets, making them more manageable while preserving essential information. Clustering algorithms, such as K-means and hierarchical clustering, group similar data points, revealing natural clusters or groupings within the data. Unsupervised learning is also essential for identifying anomalies or outliers, a critical task in fraud detection, quality control, and other applications. Unsupervised machine learning has a wide range of applications across various domains. It is used in customer segmentation for marketing, where it groups customers based on shared behaviors and preferences. In natural language processing, it helps with topic modeling and document clustering, enabling the organization and understanding of vast textual datasets. Unsupervised learning also plays a crucial role in recommendation systems, where it identifies patterns in user behavior and suggests products or content that users are likely to be interested in. Overall, unsupervised machine learning techniques are powerful tools for data exploration, data preprocessing, and uncovering valuable insights from unlabeled data, contributing to data-driven decision-making and knowledge discovery in many fields. In addition to unsupervised machine learning, other concepts such as network theory/analysis and ontology hold promise with regard to materials and catalysts research. Network analysis is a field that focuses on examining the relationships and interactions between entities, often represented as nodes and edges in a network or graph. This can be applied to various domains, including social networks, transportation systems, biological networks, and more. Researchers use graph theory, centrality measures, and community detection algorithms to uncover patterns, influential nodes, and substructures within the network. By analyzing network data, one can gain insights into information flow, connectivity, and vulnerabilities, among other aspects. Ontology, in the context of knowledge representation, refers to a formal and explicit specification of concepts, their attributes, and the relationships between them in a specific domain or knowledge domain. Ontologies are used to capture and organize knowledge in a structured and machinereadable manner, enabling machines to understand and reason about the domain. They define a shared vocabulary that ensures consistency and interoperability in knowledge representation, making them valuable for data integration, search engines, and automated reasoning systems. In the biomedical field, for instance, ontologies such as the Gene Ontology (GO) provide a structured framework for annotating and categorizing gene functions and their relationships, facilitating biological research and data analysis. Overall,
9.2 Unsupervised Machine Learning
229
unsupervised machine learning, network analysis, and ontologies each contribute to the exploration and organization of data and knowledge in distinct yet interconnected ways, offering valuable tools for data-driven decision-making and research in various domains.
9.2
Unsupervised Machine Learning
In stark contrast to supervised machine learning, unsupervised machine learning operates without relying on predefined descriptors or objective variables. It forgoes the need to solve the conventional .y = f (x) equation, which forms the core objective of supervised machine learning. Instead, unsupervised machine learning endeavors to discover concealed patterns and similarities inherent within datasets, without the utilization of any training data. By eschewing the reliance on labeled examples or explicit guidance, unsupervised machine learning allows for the exploration and extraction of intrinsic structures and relationships within the data itself. This enables the discovery of meaningful insights, associations, and clusters that may not have been apparent through traditional, supervised approaches. In essence, unsupervised machine learning serves as a powerful tool for uncovering and harnessing the latent knowledge encapsulated within the data, thereby facilitating novel discoveries and insights. Unsupervised machine learning encompasses several approaches that can be employed to extract valuable insights from data. Among these approaches, clustering stands out as a commonly utilized technique. Clustering aims to identify similarities and patterns within the data, effectively categorizing it into distinct groups. This process aids in gaining a deeper understanding of the underlying structure and characteristics of the data. Furthermore, once the data have been classified through clustering, independent supervised machine learning models can be applied to the grouped data. This subsequent analysis can provide additional layers of insights and enable the development of predictive models or other forms of analysis. Another widely employed unsupervised machine learning algorithm is dimension reduction. This technique reduces the dimensionality of the variables within the data, transforming them from a high-dimensional space to a lower-dimensional space. Dimension reduction offers several advantages, such as facilitating easier data visualization when variables are reduced to 2 or 3 dimensions. Additionally, it can help mitigate issues related to overfitting by reducing the number of descriptor variables utilized in the models. As a result, dimensional reduction presents an alternative approach to comprehending and analyzing complex data, while also aiding supervised machine learning tasks. In addition to clustering and dimension reduction, several other approaches are frequently employed in unsupervised machine learning. These include principle component analysis (PCA), local linear embedding (LLE), and kmeans clustering. Each of these techniques offers its unique advantages and applications, further expanding the toolkit available for uncovering hidden patterns and extracting valuable knowledge from data.
230
9.2.1
9 Unsupervised Machine Learning and Beyond Machine Learning
Dimensional Reduction
Dimensional reduction plays a pivotal role in unsupervised machine learning, particularly when dealing with high-dimensional descriptor variables. The inclusion of numerous dimensions in training such variables often leads to overfitting, a phenomenon commonly referred to as the curse of dimensionality. The curse of dimensionality can be succinctly described as follows: As the number of dimensions increases, the amount of data required for effective unsupervised machine learning grows exponentially. In high-dimensional spaces, even small datasets occupy only a fraction of the available space, making it challenging to capture meaningful patterns and relationships. Consequently, one must exercise caution and recognize that the number of data points must increase proportionally with the expansion of descriptor variables. To address this challenge, dimensional reduction methods prove invaluable. By reducing the dimensionality of the data, these techniques help alleviate the curse of dimensionality. They achieve this by transforming the highdimensional space into a lower-dimensional one, where patterns and relationships can be more effectively captured and analyzed. Consequently, dimensional reduction serves as a crucial tool in managing the curse of dimensionality, enabling more accurate and robust unsupervised machine learning analyses.
9.2.2
Principle Component Analysis
Principal Component Analysis (PCA) is a widely utilized technique for dimensional reduction. It enables the transformation of high-dimensional data into a lower-dimensional representation. Let us consider the example depicted in Fig. 9.1. In this scenario, the dataset comprises five distinct sets of data, with each set consisting of four descriptor variables. These descriptors correspond to the selectivity of CO, CO.2 , C.2 H.6 , and C.2 H.4
Fig. 9.1 Principle component analysis in the selectivity of CO, CO.2 , C.2 H.6 , C.2 H.4
9.2 Unsupervised Machine Learning
231
during the methane oxidation reaction. Upon initial observation, it can be inferred that the first two datasets exhibit high selectivity for CO and CO.2 , while the third dataset demonstrates a preference for C.2 H.6 and C.2 H.4 . The final dataset, however, displays equal selectivities across all descriptors. These observations, though insightful, are only rudimentary interpretations of the data. By employing PCA, the four-dimensional dataset is transformed into a lower-dimensional representation, as illustrated in Fig. 9.1. This transformation allows for a clearer interpretation of the data. For instance, the first dataset receives large negative values in the PCA scores, reflecting its high selectivities for CO and CO.2 . Similarly, positive values are assigned to the cases where scores are high for C.2 H.6 and C.2 H.4 . Conversely, the remaining dataset is assigned a PCA score close to zero, signifying its lack of unique patterns compared to the other four sets of data. Through the application of PCA, the inherent structure and patterns within the data become more discernible, enabling a more precise characterization and understanding of the relationships between the variables. PCA can be implemented in Python by utilizing the PCA function from the scikit-learn library, as shown in the example code snippet below: from sklearn.decomposition import PCA pca = PCA(n_components=2) transformed_data = pca.fit_transform(data) One of the challenges encountered when employing PCA is determining the appropriate number of dimensions for the transformed data. This decision is crucial as it directly impacts the amount of information retained and the quality of the dimensionally reduced representation. Selecting the optimal number of dimensions is often a trade-off between preserving sufficient information while reducing the dimensionality to a manageable level. Several methods can aid in deciding the number of dimensions, including analyzing the cumulative explained variance ratio, scree plots, and domain knowledge. These techniques allow the user to assess the amount of variance retained in the data as a function of the number of dimensions and make informed decisions based on their specific requirements and understanding of the dataset. Choosing the appropriate number of dimensions is an essential step in PCA and requires careful consideration to strike the right balance between data representation and computational efficiency.
9.2.3
K-Means Clustering
The k-means clustering method is the simplest way for clustering data. K-means clustering can be achieved as demonstrated in Fig. 9.2. At first, select the number of clusters you want to be classified. Then, select the number of random points equal to the selected number of clusters. These random points are set as the center of mass. Next, each data point seeks the nearest center of mass. Here, data belonging to each center of mass are recognized as
232
9 Unsupervised Machine Learning and Beyond Machine Learning
Fig. 9.2 The concept of K-means clustering
a group. Now, the center of mass is recalculated for each group and a new center of mass is assigned. This process is repeated until the set number of processed is reached. Thus, appropriate groups can be classified based on distance. The k-means clustering method is a straightforward approach for grouping data into clusters. The steps involved in performing k-means clustering are illustrated in Fig. 9.2. To begin, the user specifies the desired number of clusters to be formed. Subsequently, an equal number of random points, corresponding to the selected number of clusters, are chosen as the initial center of mass for each cluster. Next, each data point in the dataset is assigned to the nearest center of mass based on the calculated distances. This step ensures that data points are grouped according to their proximity to a specific center of mass. After the initial assignment of data points to clusters, the center of mass for each group is recalculated by taking the average position of all data points within the cluster. This updated center of mass becomes the new reference point for its respective cluster. The iterative process continues by repeating the steps of assigning data points to the nearest center of mass and recalculating the center of mass for each cluster. This process is iterated until a predetermined number of iterations are reached or until convergence is achieved, for instance, when the positions of the centers of mass stabilize. Through this iterative approach, k-means clustering effectively classifies data points into appropriate groups based on their distances to the center of mass. The result is a clustering solution that organizes the data into distinct clusters, facilitating further analysis and interpretation of the underlying patterns and structures within the dataset. K-means clustering can be simply performed by calling Kmeans() from the scikit-learn library where the number of clusters must be decided by the users. from sklearn.cluster import KMeans num_clusters = 3 kmeans = KMeans(n_clusters=num_clusters) kmeans.fit(data) # Accessing the cluster labels assigned to each data point
9.2 Unsupervised Machine Learning
233
cluster_labels = kmeans.labels_ # Accessing the cluster centers cluster_centers = kmeans.cluster_centers_ In this code snippet, “data” represents the input dataset. The “num._clusters” variable denotes the desired number of clusters to be formed. By instantiating the KMeans() object with the specified number of clusters, the k-means clustering algorithm is applied to the data. The fit() method is then called to perform the clustering process. After clustering, you can access the cluster labels assigned to each data point using “kmeans.labels._.” These labels indicate which cluster each data point belongs to. Additionally, the cluster centers can be obtained using “kmeans.cluster._centers._,” which provides the coordinates of the center of each cluster. By utilizing the KMeans() function from scikit-learn, users can easily perform k-means clustering with flexibility in choosing the desired number of clusters to be formed. K-means clustering offers simplicity, scalability, interpretability, and versatility, making it a popular choice for unsupervised machine learning tasks. However, its limitations, such as the requirement to specify the number of clusters, sensitivity to initialization, assumptions about cluster sizes and shapes, and susceptibility to outliers, should be carefully considered when applying the algorithm in practice.
9.2.4
Hierarchical Clustering
Hierarchical clustering is a powerful method for clustering data, which also allows for visualizing the classification process through dendrograms. The fundamental approach of hierarchical clustering is summarized in Fig. 9.3. Consider a scenario where there are two data points, A and B, that are close to each other within the dataset. In the dendrogram representation, these data points are connected, indicating their proximity. Now, let us examine them as a pair, AB, and a separate data point, C. Upon analysis, we observe that AB and C are also close to each other, leading to a connection in the dendrogram. By clustering the data in this hierarchical manner, we gain insights into the relative distances and relationships between different data points. One of the major advantages of hierarchical clustering is that it provides a detailed depiction of how clusters are formed. The dendrogram visualizes the step-by-step merging of data points and the formation of clusters. This level of detail enhances our understanding of the similarities and dissimilarities in materials and catalysts data. It enables us to explore the hierarchical structure of the dataset and observe how individual data points or groups of data points relate to one another. By leveraging hierarchical clustering, we can gain a comprehensive understanding of the clustering process and the underlying similarities in the data. This knowledge facilitates clear insights into the relationships and patterns within materials and catalysts data, supporting advanced analysis and decision-making.
234
9 Unsupervised Machine Learning and Beyond Machine Learning
Fig. 9.3 The concept of hierarchical clustering
9.2.5
Perspective of Unsupervised Machine Learning
Unsupervised machine learning, a powerful approach in the field of artificial intelligence, offers numerous benefits and challenges for data analysis and model development. Unsupervised machine learning offers several advantages in data analysis and model development. First, it enables data exploration and discovery by allowing researchers to delve into the data and uncover hidden patterns and relationships that may not be immediately apparent. This exploratory analysis can lead to novel insights and a deeper understanding of the underlying data structure. Another key advantage is its ability to handle unlabeled data effectively. In situations where labeled data are scarce or unavailable, unsupervised learning proves valuable as it does not require explicit target labels for training. This flexibility expands the range of applications where machine learning can be applied, even in scenarios with limited labeled data. Scalability is also a strength of unsupervised learning algorithms. They are often designed to handle large datasets efficiently, making it possible to analyze big data and extract meaningful information from it. This scalability is crucial in today’s era of abundant data, where traditional approaches may struggle to cope with the volume and complexity of the data. Furthermore, unsupervised learning methods facilitate feature extraction and representation learning. By identifying meaningful features or representations of the data, these techniques provide valuable insights that can be utilized in various machine learning tasks. This feature extraction capability contributes to improved performance and generalization across different domains. Despite its benefits, unsupervised machine learning also has some limitations. One major challenge is the lack of ground truth for evaluation. Without labeled data, it becomes difficult to objectively assess the quality and accuracy of unsupervised learning
9.2 Unsupervised Machine Learning
235
outcomes. This subjective evaluation hampers the reliability and confidence in the results obtained through unsupervised techniques. Interpretability is another concern associated with certain unsupervised learning methods, particularly deep learning. These techniques often produce complex models that are challenging to interpret and understand. The lack of transparency in model representation can hinder the ability to extract meaningful insights and make informed decisions based on the learned patterns. Lastly, interpreting unsupervised learning results often requires domain expertise. Without prior knowledge and understanding of the specific domain, it can be challenging to identify meaningful patterns or structures in the data. Domain experts are needed to provide context and domain-specific insights, enhancing the interpretation and practical utility of unsupervised learning outcomes. While unsupervised machine learning brings several advantages, such as data exploration, handling unlabeled data, scalability, and feature extraction, it also faces challenges regarding evaluation, interpretability, and the need for domain expertise. Awareness of these pros and cons is crucial in utilizing unsupervised learning effectively and making informed decisions in real-world applications. The future of unsupervised machine learning holds great promise as it emerges as a pivotal approach in extracting valuable insights from vast amounts of unlabeled data. Advancements in this field are expected to revolutionize various domains, including anomaly detection, representation learning, and generative modeling, leading to enhanced capabilities and broader applications of unsupervised learning techniques. In the forthcoming years, unsupervised machine learning is poised to make significant strides in multiple areas. One such area is anomaly detection, where the focus lies on refining the ability to identify anomalies and outliers across diverse domains such as cybersecurity, fraud detection, and predictive maintenance. By leveraging unsupervised techniques, anomalies can be efficiently flagged and addressed, bolstering security, minimizing risks, and optimizing maintenance processes. Furthermore, representation learning is set to undergo substantial advancements. The quest for automatically learning powerful and meaningful representations from unlabeled data drives research efforts in this domain. By extracting intricate patterns and underlying structures, unsupervised learning can enable the development of highly effective representations, ultimately improving the performance of downstream tasks such as classification and regression. This breakthrough has the potential to revolutionize a wide range of applications, including image and speech recognition, natural language processing, and recommendation systems. This will not only expand the availability of labeled data but also enhance the robustness and generalization of machine learning models. Lastly, it must be pointed out that unsupervised and supervised learning techniques form integral components of machine learning workflows, with their synergistic interplay contributing to enhanced performance and broader applicability. By leveraging the strengths of each approach, practitioners can leverage unsupervised learning for data preprocessing, feature engineering, semi-supervised learning, and augmenting the
236
9 Unsupervised Machine Learning and Beyond Machine Learning
capabilities of supervised learning algorithms and addressing challenges posed by limited labeled data. The future of unsupervised machine learning appears bright and full of opportunities. Its potential to extract meaningful insights from unlabeled data, coupled with advancements in anomaly detection, representation learning, and generative modeling, paves the way for transformative applications across various domains. As researchers continue to push the boundaries of unsupervised learning, we can anticipate a paradigm shift in the way data are analyzed, interpreted, and utilized to drive innovation and progress.
9.3
Additional Approaches
Machine learning has undoubtedly gained popularity among researchers in the field of materials and catalysts. However, there is an emerging interest in exploring alternative methods for material investigation and catalyst design. Researchers are increasingly seeking to study the underlying relationships present within datasets and understand how different types of data interrelate. One promising approach in this context is network analysis, which utilizes graph theory to transform material data into networks. These networks capture the connections and relationships between various components, such as materials, properties, and performance metrics. Network analysis offers a range of applications, including optimizing experimental conditions for catalyst design. By representing materials and catalysts as interconnected nodes and edges, network analysis provides a comprehensive framework to analyze and interpret complex data relationships. Another valuable approach for material and catalyst research is the use of ontologies. Ontology involves the formal representation of knowledge and semantics within material and catalyst data. By incorporating ontological structures, researchers can define and organize concepts, properties, and relationships in a standardized and machine-readable manner. This enables the integration and sharing of knowledge across different domains and facilitates more comprehensive and efficient data analysis. Ontologies provide a powerful tool for enhancing data management, discovery, and decision-making processes in materials and catalyst research. In summary, alongside machine learning, researchers are exploring alternative methodologies to delve into the relationships within material and catalyst datasets. Network analysis and ontologies offer distinct avenues to understand the complex interplay between various types of data and provide valuable insights for material investigation, catalyst design, and optimization. By leveraging these alternative approaches, researchers can unlock new perspectives and enhance their understanding of materials and catalysts at a deeper level.
9.3 Additional Approaches
9.3.1
237
Network Analysis
Network analysis is a powerful methodology that focuses on the study of networks consisting of nodes and edges. These networks serve as representations of entities within datasets and provide a visual depiction of their relationships. By transforming data into a network structure, it becomes possible to examine clusters of nodes that are closely interconnected, identify key nodes that control the pathways between different entities, and assess the connectivity patterns of individual nodes in terms of their incoming and outgoing connections. This network perspective becomes particularly valuable when investigating complex processes, such as a series of intermediate reactions in a prototype reaction. By representing the reactions as nodes and the connections between them as edges, one can analyze the network to gain insights into the potential pathways, dependencies, and interactions between the intermediates. This analysis can help uncover the sequence of reactions, identify critical intermediates, and understand the overall reaction mechanism. The visual nature of network analysis allows researchers to intuitively grasp the relationships and patterns within the data. By examining the structure and properties of the network, one can extract valuable information and gain a deeper understanding of the underlying processes and interactions. Network analysis provides a powerful framework for studying complex systems and uncovering hidden relationships within datasets. By representing data as networks of nodes and edges, researchers can analyze the connectivity patterns, identify key entities, and gain insights into the dynamics and mechanisms of various processes, including intermediate reactions in prototype reactions. The concept of graph data, as shown in Fig. 9.4, involves the representation of data using nodes and edges. In this context, nodes represent individual data points or entities within a dataset, while edges denote the relationships or connections between these nodes. By structuring the data as a graph, it becomes possible to analyze and visualize the interdependencies and associations among the data points. Furthermore, the edges in a graph can be assigned weights, which convey the strength or importance of the connections between nodes. In the context of a reaction network, for example, a graph representation Fig. 9.4 The concept of graph data
238
9 Unsupervised Machine Learning and Beyond Machine Learning
can be used to capture the relationships between different chemical species or reactions. The edges can be weighted to reflect factors such as the Arrhenius equation, which describes the temperature-dependent probability of a reaction occurring. The weights assigned to the edges can represent the influence of temperature or other factors on the likelihood of a specific reaction taking place. By considering these edge weights within the network, it becomes possible to visualize the probabilities associated with different reactions. This provides insights into the relative likelihoods of various reactions occurring within the system. Thus, graph data representation allows for a comprehensive understanding of the relationships and dependencies within a dataset. By incorporating edge weights and considering them in the context of specific applications, such as reaction networks, it becomes possible to visualize and analyze factors that affect the probabilities and outcomes of certain events or processes. With networks, there are two types of edges to consider: undirected and directed edges. Undirected and directed edges determine the manner in which nodes relate to each other. With undirected edges, nodes share connections where directionality does not play a role in the relationship. In Fig. 9.4, this is visually represented as straight lines that connect two nodes (e.g., A-B, A-C, A-D, and B-C). This can be used when, for example, one transforms catalyst and experimental conditions data into a network [6]. A directed graph, meanwhile, denotes the source and target relationships between two nodes. Directed edges mark that a relationship goes in a single direction, where a source node acts as the starting node and the target node acts as the final destination of the path. Directionality becomes important to note in situations where relationships between two nodes are a result of an action. This is seen, for example, in cases where intermediate reactions of a proto reaction are explored as a network [7]. Thus, it is important to determine the nature of the data being analyzed in order to determine whether the intended network should be directed or undirected. By understanding the inherent characteristics of the relationships between nodes, one can make an informed decision on the appropriate type of edge to use in the network representation.
9.3.2
Ontology
Ontology has gained significant interest in the chemistry and materials science communities as a valuable technology for organizing and representing data. Originally rooted in philosophy as the study of being, ontology has been adopted by computer science and information science disciplines to define formal representations and domains for data. In the context of materials and catalysts research, ontologies play a crucial role in organizing data by defining concepts and their relationships. They provide a structured framework for defining and standardizing terminology, ensuring consistency and clarity in data representation. By establishing a common vocabulary and understanding of concepts, ontologies enable researchers to effectively communicate and share data across different domains and disciplines. One of the key advantages of using ontology in materials and catalysts
9.3 Additional Approaches
239
research is its ability to address issues related to data quality and usability. Materials and catalysts data often come from various sources, with differences in formats, terminologies, and semantic interpretations. This can lead to challenges when integrating and analyzing the data using data science techniques. By employing ontologies, researchers can establish a shared understanding of data elements, their definitions, and the relationships between them. This facilitates data integration, interoperability, and knowledge discovery, ultimately enhancing the quality and usability of the data. Ontologies provide a powerful tool for organizing and structuring materials and catalysts data, enabling researchers to overcome challenges associated with data quality, heterogeneity, and integration. By defining and formalizing concepts and relationships, ontologies support effective data management, analysis, and knowledge generation in the field of chemistry and materials science. In the materials science and catalyst fields, the absence of standardization in creating databases poses challenges for researchers. Without a standardized framework, individual researchers tend to create databases based on personal preferences and specific needs, which can limit the reusability and interoperability of the data when shared with others. One of the key challenges in data preprocessing arises from the lack of consistent terminology. Terms used to describe materials, properties, and processes often vary between disciplines, leading to translation errors and incorrect assumptions about the data. This heterogeneity hampers effective data integration and analysis. Additionally, the lack of accompanying metadata further complicates the understanding and utilization of the data. Metadata provides essential contextual information about the data, such as its source, collection methods, and quality measures. Without comprehensive metadata, researchers may face difficulties in interpreting and leveraging the data effectively. Ontology offers a valuable solution to these challenges by providing a structured framework to clearly define terms and their relationships. It establishes a common vocabulary and semantics, allowing researchers to achieve consistency and standardization in their databases. By defining concepts, properties, and their interconnections, ontology enables accurate and unambiguous data representation and integration across disciplines. Furthermore, ontology provides the means to include metadata and additional information alongside the data. Researchers can define metadata standards, specify data provenance, and capture domainspecific knowledge within the ontology. This enhances the understanding and usability of the data, facilitating data sharing, collaboration, and knowledge discovery. Ontology plays a vital role in addressing the issues of standardization, terminology inconsistencies, and lack of metadata in materials science and catalyst databases. It provides a framework for defining and organizing data, fostering interoperability, and promoting the effective utilization of data across research communities. Figure 9.4 visually demonstrates how ontology can be utilized to organize and access information effectively. In this example, an ontology is created for organizing information related to the periodic table. The ontology defines various types of information and properties, such as the atomic structure and crystal structure. At first glance, the presentation of different crystal structure types may not appear significantly different from
240
9 Unsupervised Machine Learning and Beyond Machine Learning
other methods of categorization and classification. However, the true power of ontology becomes evident when researchers attempt to query the data. Ontologies are constructed using knowledge representation languages such as OWL (Web Ontology Language), which leverage description logic. Description logic provides a formal framework for defining concepts, properties, and relationships within the ontology. The use of description logic allows researchers to query the data based on the ontology’s defined structure and semantics. Queries can be formulated using logical expressions and inference rules, enabling sophisticated searches and reasoning over the data. By employing description logic, researchers can extract specific information, explore relationships, and gain deeper insights from the ontology-based data. This capability to query data using description logic is a distinguishing feature of ontologies. It enhances the usability and flexibility of the data by enabling researchers to access and retrieve information in a more structured and precise manner. Through ontology-based querying, researchers can effectively navigate and explore complex datasets, facilitating knowledge discovery and supporting informed decision-making processes. In summary, ontology-based data organization and querying offer significant advantages for accessing and exploring information. By utilizing knowledge representation languages and description logic, researchers can leverage the power of ontologies to query data efficiently, leading to enhanced data exploration, understanding, and knowledge generation. In Fig. 9.5, the power of ontology-based querying is illustrated through two example queries. These queries demonstrate the ability to extract specific data that match a set of defined restrictions or statements. In Query 1, the goal is to identify elements that possess both a “body-centered cubic” crystal structure and the property of being “antiferromagnetic.” By formulating the query based on these criteria, researchers can retrieve the elements that meet these specific conditions. This targeted search allows for the extraction of relevant information from a large dataset, saving time and effort compared to manual searching. Similarly, Query 2 aims to identify elements that exhibit a “bodycentered cubic” crystal structure and belong to Group 1. By combining multiple criteria in the query, researchers can further refine their search and obtain more specific results. This capability enables efficient navigation through extensive datasets, enabling researchers to quickly identify and access the desired information. Furthermore, ontology offers the advantage of merging ontologies from different datasets into a unified space. This means that researchers can combine multiple sources of data, expanding the range of information available for querying. By consolidating diverse datasets within a single ontology, the search space is significantly increased. This integration of datasets allows for more comprehensive and extensive searches, leading to faster and more efficient exploration of the data. Overall, ontology-based querying provides researchers with a powerful tool to navigate and extract information from large datasets. By specifying restrictions and criteria through queries, researchers can quickly locate relevant data, saving time and resources. Additionally, the ability to merge ontologies from different datasets enhances the search capabilities, enabling researchers to explore a broader range of data in a unified and streamlined manner.
9.3 Additional Approaches
241
Fig. 9.5 An example of querying information from an ontology. Reprinted (adapted) with permission from [8]. Copyright 2023 American Chemical Society
Applying ontology to materials and catalyst data offers several benefits and advantages. First, it provides a means to represent a researcher’s understanding of concepts and data in a format that is machine-readable. By transforming the rules and relationships within the data into ontologies, machines can apply these rules and relationships consistently across large datasets, enabling automated processing and analysis. Standardization is another key benefit of using ontology in materials and catalyst data. Ontologies provide a structured and standardized way to define and organize data, ensuring consistent terminology and relationships. This standardization facilitates data sharing, integration, and collaboration among researchers. It also reduces the likelihood of translation errors and incorrect assumptions about the data, enhancing the reliability and usability of the data across different research efforts. Moreover, ontologies allow for the reclassification and reorganization of materials. By defining data in ontologies, researchers can explore alternative ways of classifying materials and identifying relationships between different materials. This flexibility in classification opens up new possibilities for discovering descriptors, identifying correlations, and uncovering new materials with desired properties. It enables researchers to think beyond traditional classifications and consider novel perspectives, leading to potentially groundbreaking insights and discoveries. Ontology
242
9
Unsupervised Machine Learning and Beyond Machine Learning
provides a unique and valuable approach to handling and processing materials and catalyst data. It enables machine-readable representation, standardization, and the exploration of alternative classifications, fostering improved data analysis, knowledge discovery, and innovation in materials science and catalyst research.
9.4
Conclusion
Unsupervised machine learning is indeed a powerful tool for uncovering hidden patterns and groups within data. As an evolving field, there are various unique algorithms available for unsupervised learning, each with its own strengths and limitations. It is crucial to carefully select the most suitable algorithm based on the specific characteristics and requirements of the data at hand. In addition to traditional unsupervised learning methods, emerging concepts such as graph theory and graph data hold great potential for future materials and catalysts informatics. Graphs provide a structured framework for organizing and understanding complex relationships within data. By representing data as nodes and edges, graph theory enables the exploration of connectivity patterns, influence relationships, and other valuable insights. This approach can lead to more effective analysis and interpretation of materials and catalysts data. Furthermore, the utilization of ontologies is of utmost importance in materials and catalysts research. Ontologies enable the formal representation of knowledge, concepts, and relationships within a specific domain. By defining terminologies and preserving the semantic connections, ontologies facilitate data integration, standardization, and interoperability. They enhance data organization, improve data quality, and support advanced querying and inference capabilities. Ontologies play a crucial role in achieving a deeper understanding of materials and catalysts data and promoting collaboration and knowledge sharing within the scientific community. In summary, choosing the appropriate unsupervised machine learning algorithm, leveraging graph theory and graph data, and utilizing ontologies are all important aspects in the advancement of materials and catalysts informatics. These approaches contribute to better organization, analysis, and interpretation of data, leading to valuable insights and discoveries in the field.
Questions 9.1 What are the advantages of using unsupervised machine learning in data analysis? 9.2 How can graph theory contribute to materials and catalysts informatics? 9.3 What is the role of ontologies in materials and catalysts research? 9.4 How can unsupervised machine learning algorithms be selected for a specific dataset?
Questions
243
9.5 What are the potential benefits of utilizing graph data in materials and catalysts informatics? 9.6 What is the main difference between supervised and unsupervised machine learning? 9.7 What is the objective of unsupervised machine learning? 9.8 How does unsupervised machine learning help in data analysis? 9.9 What is clustering in unsupervised machine learning? 9.10 What is the purpose of dimension reduction in unsupervised machine learning? 9.11 What is PCA in unsupervised machine learning? 9.12 What is the curse of dimensionality in unsupervised machine learning? 9.13 How does dimensional reduction help address the curse of dimensionality? 9.14 How can one determine the appropriate number of dimensions in PCA? 9.15 What is k-means clustering and how does it work in unsupervised machine learning? 9.16 What are the advantages of unsupervised machine learning? 9.17 What is one key challenge in evaluating unsupervised learning outcomes? 9.18 What is one concern related to interpretability in certain unsupervised learning methods? 9.19 In which areas is unsupervised machine learning expected to make significant strides in the future? 9.20 How can unsupervised and supervised learning techniques complement each other in machine learning workflows? 9.21 What is the focus of network analysis? 9.22 How can network analysis help in understanding complex processes? 9.23 What advantages does the visual nature of network analysis provide?
244
9
Unsupervised Machine Learning and Beyond Machine Learning
9.24 What is the significance of assigning weights to edges in a graph representation? 9.25 What are the two types of edges in network analysis, and how do they differ? 9.26 What role does ontology play in materials and catalysts research? 9.27 How does ontology address challenges related to data quality and usability in materials and catalysts research? 9.28 What are the advantages of using ontology in materials science and catalyst fields? 9.29 How does ontology-based querying enhance data exploration and understanding? 9.30 What are the benefits of applying ontology to materials and catalyst data?
Solutions
Problems of Chap. 1 1.1 Materials informatics is an interdisciplinary field that applies data science techniques within the context of materials science research. By leveraging the power of data science, materials informatics enables researchers to extract valuable knowledge and patterns from extensive materials data, empowering them to make informed decisions, enhance material properties, and expedite the discovery and design of new materials. 1.2 Materials research can be approached from four distinct perspectives: experimental investigations, theoretical modeling, computational simulations, and the emerging field of data science. 1.3 Materials informatics heavily relies on data science techniques and tools such as statistical analysis, visualization methods, and machine learning algorithms. These tools help unlock hidden knowledge embedded within vast quantities of materials data, uncovering patterns, correlations, and trends that may not be readily discernible through traditional means. 1.4 Materials science involves complex materials systems with intricate structureproperty relationships and dependencies on processing conditions. Therefore, data science methodologies cannot be blindly applied to materials data without careful consideration and domain expertise. Researchers must exercise caution, thoroughly understand the data, and have awareness of the fundamental principles and governing laws within materials science. 1.5 Researchers must possess proficiency in utilizing data science tools and the ability to comprehend and assess the accuracy of the underlying data. This capability enables them to make informed judgments, select appropriate data science techniques, evaluate
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6
245
246
Solutions
the validity of results, and comprehend the interdependencies of processing, structure, properties, performance, and characterization in materials design. 1.6 A deep understanding of materials science principles helps researchers evaluate the accuracy and reliability of the data used in their research. It enables them to discern potential limitations, biases, or uncertainties associated with the data and adjust their data science approaches accordingly. Additionally, it facilitates the identification of relevant features and parameters to be analyzed using data science techniques, enhancing the accuracy and relevance of the results. 1.7 Integrating domain expertise in materials science with data science methodologies unlocks new insights, accelerates materials discovery, and optimizes material design and performance. This interdisciplinary approach promotes innovation and opens up new avenues for scientific exploration, ultimately driving advancements in the field of materials research. 1.8 The integration of data science in materials science allows researchers to analyze and derive insights from vast quantities of data, surpassing the limits of human capacity for analysis. Data science methodologies enable the extraction of valuable knowledge and patterns that may not be readily apparent through traditional means, enhancing research capabilities in materials science. 1.9 Materials informatics, as a discipline, plays a significant role in materials research by leveraging data science techniques to unlock hidden knowledge within materials data. It empowers researchers to make informed decisions, enhance material properties, optimize performance, and expedite the discovery and design of new materials with desired functionalities. 1.10 The convergence of data science and materials science gives rise to materials informatics, a transformative paradigm that propels the field forward and creates new opportunities for innovation and advancement. By harnessing the power of data science, researchers can unlock valuable insights, accelerate materials discovery, and optimize material design and performance. 1.11 Materials informatics is a subfield of materials science that integrates data science into materials research. 1.12 While data science focuses on understanding and extracting knowledge from data, materials informatics applies data science specifically to materials science research.
Solutions
247
1.13 Examples of other informatics fields include bioinformatics, chemoinformatics, and pharmacy informatics, which have emerged by applying data science to biology, chemistry, and pharmacy, respectively. 1.14 To fully leverage the potential of materials informatics, one must have an understanding of both data science and materials science, as materials informatics requires the application of data science techniques to material data. 1.15 Data science is often referred to as the “fourth science” following experiment, theory, and computational science. It shares similarities with experimental approaches by using a variety of techniques to extract knowledge from data, much like experiments synthesize and characterize materials. 1.16 The challenge is the growing gap between the availability of materials data and the number of trained researchers skilled in materials informatics. 1.17 Efforts such as specialized training programs, courses, and collaborations between materials science researchers and data science experts are needed to train and educate more researchers in materials informatics. 1.18 While data scientists can apply machine learning and data science techniques to materials data, their lack of domain knowledge in materials science may lead to difficulties in understanding and interpreting the outcomes of their analysis. 1.19 Domain knowledge of materials science is crucial because it allows researchers to understand the underlying science and physical phenomena behind the data they analyze, ensuring accurate interpretation and avoiding erroneous conclusions. 1.20 The importance of domain knowledge can be illustrated using the analogy of car racing, where information science represents the design and development of car parts, and domain knowledge of materials science is akin to a skilled driver who knows how to maximize the performance of the car using those parts. 1.21 The three traditional approaches are experiment, theory, and computation. 1.22 Data science focuses on analyzing existing data to uncover patterns and insights, while the traditional approaches involve conducting experiments, developing theories, or using computational methods. 1.23 The experimental approach involves conducting tests and trials in real-life settings, such as mixing chemicals or testing materials under various conditions, to achieve the desired results.
248
Solutions
1.24 Theory provides a framework for understanding the mechanisms behind experimental observations in materials science. It helps researchers explain why certain phenomena occur and guides experimental design by suggesting new avenues for exploration. 1.25 Computational science allows researchers to simulate and model complex systems, such as molecules and materials, providing insights into their properties and behavior. It has accelerated the understanding of materials at the atomic level and facilitated the design of materials with specific properties. 1.26 Computational science relies on accurate atomic models and cannot fully capture the complex experimental conditions involved in material synthesis. It requires experimental data for verification and is unable to account for all variables in the experimental process. 1.27 Data science analyzes existing data to uncover patterns and relationships, helping to identify optimal conditions and candidate materials. It can reveal hidden insights and guide future experiments or simulations, complementing traditional approaches. 1.28 Data science is limited by the quality and quantity of the data used. Incomplete, noisy, or biased data can lead to inaccurate or misleading results. Data preprocessing and cleaning are crucial to ensure data quality. Data science is not a replacement for experimental or computational methods but rather a complementary approach. 1.29 The five fundamental components of materials science are processing, structure, properties, performance, and characterization. 1.30 By utilizing data science techniques, materials informatics can optimize material synthesis methods by analyzing and modeling large datasets of experimental conditions and material structures. 1.31 By utilizing data science techniques, materials informatics can optimize material synthesis methods by analyzing and modeling large datasets of experimental conditions and material structures. 1.32 Materials informatics helps researchers analyze the relationships between synthesis techniques and resulting material properties to identify optimal methods for specific applications. 1.33 By employing data science techniques, materials informatics analyzes large datasets of crystal structures and corresponding properties to uncover meaningful correlations and patterns.
Solutions
249
1.34 By analyzing existing crystal structure data, materials informatics can develop models and algorithms that assist in predicting and determining crystal structures, thereby accelerating the discovery of new materials. 1.35 Materials informatics utilizes data-driven approaches to analyze large datasets of material properties and their corresponding composition, structure, and processing parameters, enabling the design and optimization of materials with specific target properties. 1.36 By leveraging data science techniques, materials informatics helps identify patterns, correlations, and relationships between material properties and performance under different conditions, allowing for the fine-tuning and optimization of material properties. 1.37 Materials informatics utilizes advanced data analysis techniques to extract valuable information from various characterization techniques, facilitating a better understanding of the relationships between material structure, properties, and performance. 1.38 Catalysts informatics is a discipline that applies data science techniques to catalyst research, aiming to enhance our understanding and manipulation of catalysts for advancements in materials science. 1.39 Catalysts informatics presents unique requirements due to the interdependence between catalytic reactions and experimental process conditions. Catalyst behavior is sensitive to variations in these conditions, necessitating the concurrent design of catalysts and process conditions. 1.40 Catalysts can be perceived as exhibiting traits reminiscent of living organisms, as they adapt and respond to changes in their environment and surrounding conditions. 1.41 Catalysts informatics incorporates catalyst composition, catalyst performance data, and catalyst characterization data as essential elements to gain insights into catalyst behavior and effectiveness. 1.42 By leveraging informatics approaches, researchers can holistically analyze and integrate diverse datasets related to catalysts, unraveling the relationships between catalyst composition, performance, and characterization. This comprehensive analysis leads to an enhanced understanding and optimization of catalysts, enabling researchers to explore new frontiers in catalysis and drive advancements in materials science.
250
Solutions
Problems of Chap. 2 2.1 Informatics scientists need a comprehensive understanding of computer hardware and software because these components play a crucial role in the field of informatics. Just as researchers must comprehend the inner workings of their experimental devices, informatics scientists must possess profound knowledge of hardware and software to optimize their workflows and unleash the full potential of these tools. 2.2 Informatics scientists should have a deep comprehension of hardware components, their functionalities, and their interconnections within a computer system. This includes knowledge of hardware architecture, central processing units (CPUs), memory modules, storage devices, and input/output (I/O) mechanisms. Understanding performance characteristics, scalability, and limitations of different hardware configurations enables informed decision-making for computationally intensive tasks. 2.3 Mastery of software systems is essential for informatics scientists because it allows them to navigate operating systems, utilize programming languages, leverage software libraries, and frameworks. Proficiency in operating systems helps optimize resource allocation and efficient utilization of computing power. Strong command of programming languages and software tools enables algorithm development, data processing, and complex computations. Software libraries and frameworks expedite development and enable rapid prototyping. 2.4 Merging knowledge of computer hardware and software provides informatics scientists with a comprehensive perspective, allowing them to optimize their workflows, resource allocation, and exploit parallel computing capabilities. It enables them to leverage hardware advancements and software techniques tailored to their research domains, enhancing the efficiency and accuracy of their analyses, simulations, and modeling tasks. 2.5 A deep comprehension of computer hardware and software empowers informatics scientists to troubleshoot and resolve technical issues independently. This self-reliance reduces dependence on external technical support and ensures the continuity and productivity of their informatics endeavors. 2.6 A computer system consists of hardware and software components. 2.7 The performance of hardware significantly impacts the speed and efficiency of a computer system, which is crucial for data analysis. 2.8 Software programs and applications allow users to interact with a computer and perform various tasks, including data analysis.
Solutions
251
2.9 Similar to the human body, computers have physical (hardware) and invisible (software) components that work together to enable functionality. 2.10 Understanding both hardware and software is crucial for achieving optimal performance in data analysis. Staying up-to-date with advancements in hardware and software allows data analysts to process and analyze data efficiently and accurately. 2.11 The CPU, or central processing unit, is a crucial component of a computer system responsible for executing calculations and performing operations. It serves as the central hub that controls all functions of the computer. 2.12 Some well-known CPU manufacturers include Intel, Advanced Micro Devices (AMD), Apple, and Loongson Technology. Each manufacturer offers CPUs with unique features and capabilities designed to cater to different computing needs. 2.13 Factors such as clock speed, cache size, and the number of cores in a CPU can affect its performance. Choosing the right CPU optimized for specific tasks can lead to better overall computer performance. 2.14 RAM, or Random Access Memory, acts as a temporary storage component in a computer system. It serves as a buffer between the CPU and storage devices, allowing for faster data processing and more efficient system performance. 2.15 Rapid input and output (I/O) are critical in data processing and machine learning applications. Large RAM sizes enable faster data transfer and help achieve optimal performance in these fields. 2.16 Recent developments in RAM technology have led to faster and more efficient modules, such as DDR4 and non-volatile RAM (NVRAM). Staying updated with the latest RAM technology ensures optimal computing performance as technology continues to advance. 2.17 Some examples of storage devices mentioned in the text are floppy disks, CDs, DVDs, hard disk drives (HDDs), USB drives, and solid-state drives (SSDs). 2.18 USB drives offer advantages such as small size, affordability, and versatility. They are widely used for storing and transferring data between devices and can also be used as bootable devices for installing operating systems. 2.19 USB drives and SSDs use semiconductor memory technology, which allows them to read and write data much more quickly. In contrast, HDDs use magnetic storage technology.
252
Solutions
2.20 The graphic processing unit (GPU), also known as a graphic card, is responsible for transforming data into the imagery displayed on computer monitors. It is particularly important for machine learning applications and deep learning calculations. 2.21 The motherboard serves as the central circuitry that integrates various essential hardware components within a computer system. It connects elements such as the CPU, RAM, GPU, storage devices, and power unit, establishing the foundation for the computer’s functionality and performance. 2.22 Some cooling mechanisms used to dissipate heat from the CPU include fans, heatsinks, and liquid cooling systems. These mechanisms help maintain optimal operating temperatures and safeguard the performance and longevity of the computer system. 2.23 Materials and catalysts informatics require extensive computational power and parallel computing capabilities, which supercomputers offer. 2.24 The distributed nature allows for parallel execution across multiple interconnected computers, enabling efficient processing of vast volumes of data and complex calculations. 2.25 MPI enables the distribution of calculations across interconnected computers, facilitating efficient parallel processing and integration of results. 2.26 No, increasing CPU cores does not necessarily lead to a proportional decrease in computational time. Factors such as workload distribution and resource utilization must be carefully evaluated. 2.27 Factors such as data dependencies, communication overhead, synchronization requirements, and potential bottlenecks in other system components can impact the scalability of parallel computations. 2.28 The three well-established operating systems for desktop computers are Windows OS, Mac OS, and Linux OS. 2.29 Windows OS is known for its user-friendly interface, extensive software compatibility, and wide adaptation in various settings worldwide. 2.30 Mac OS has seamless integration with Apple’s hardware, and there are professional software applications tailored to fields such as design, science, and medicine. 2.31 Linux is open source, free of charge, versatile, and customizable. It offers a wide range of powerful tools, libraries, and frameworks for data analysis, machine learning, and scientific computing.
Solutions
253
2.32 Users can create a server environment on their desktop PCs by utilizing virtual machines or dockers. These technologies allow users to simulate server functionalities, experiment with different configurations, and allocate specific resources for optimal performance. Users can create a server environment on their desktop PCs by utilizing virtual machines or dockers. These technologies allow users to simulate server functionalities, experiment with different configurations, and allocate specific resources for optimal performance. 2.33 Windows holds over 80% of the market share, followed by Mac OS with approximately 10%. 2.34 Linux, despite its flexibility and customization options, holds a relatively modest market share of less than 2% in the realm of desktop computing. 2.35 Linux captures an overwhelming market share of nearly 99% in the realm of server operating systems. 2.36 Differences in software compatibility, package availability, and system configurations pose challenges in achieving parity in data science environments between Windows or Mac and Linux. 2.37 CentOS, known for its stability, and Ubuntu, favored for its user-friendly approach, extensive software ecosystem, and active community support, are commonly used server operating systems for data science. 2.38 Ubuntu incorporates cutting-edge technologies and software while maintaining a balance between stability and innovation, making it a leading choice for server deployments. 2.39 Ubuntu provides a comprehensive software repository specifically curated for data science, housing essential tools and libraries, making it an attractive option for data science and informatics applications. 2.40 Linux Mint, a Linux distribution derived from Ubuntu, offers an intuitive and polished desktop experience, making it highly recommended for users new to Linux operating systems. 2.41 Virtual machines provide an alternative solution by enabling the installation of Linux as a Windows application, creating a self-contained and isolated Linux environment within the existing operating system.
254
Solutions
2.42 By using virtual machines, users can seamlessly integrate a Linux operating system into their current computing environment, allowing for the exploration and utilization of Linux-specific tools, libraries, and frameworks for data science tasks. 2.43 Pre-configured Linux distributions such as Anaconda offer comprehensive suites of data science tools and libraries, simplifying the initial configuration process by providing a ready-to-use environment specifically tailored for data science work.
Problems of Chap. 3 3.1 In an era characterized by complexity and technological advancements, programming is considered indispensable because it serves as a versatile tool capable of tackling a wide range of tasks and objectives. It empowers individuals to bring their ideas to life in the digital realm and facilitates strategic problem-solving. 3.2 Programming acts as a critical conduit for transforming abstract concepts into tangible solutions by providing a means to implement meticulously crafted plans. It enables users to clearly define desired outcomes, gain a deep understanding of the problem at hand, and identify optimal approaches and methods to accomplish objectives. 3.3 The iterative nature of programming fosters a mindset of continuous improvement, where users can continuously refine and enhance their code. Through each iteration, programmers strive for improved performance, efficiency, and functionality, adapting to evolving needs and emerging insights. 3.4 Programming empowers individuals to embrace the power of abstraction and modular design principles. By breaking down complex problems into smaller components, programmers can create robust and flexible systems. This approach enhances the maintainability, scalability, collaboration, and code reusability of software projects. 3.5 Programming enables automation by leveraging algorithms and computational logic. By automating repetitive tasks, programmers can streamline workflows and save time and effort. Automation transforms mundane and time-consuming processes into efficient operations, allowing users to focus on higher-level tasks and strategic decision-making. 3.6 The three key steps in programming are planning, writing, and running. 3.7 Planning is the foundational step in programming where programmers define the objective of the program, design algorithms, and create a logical flow of operations. It sets clear goals and serves as a roadmap for the development process.
Solutions
255
3.8 The writing phase involves translating the planned algorithms into precise and syntactically correct code using a programming language of choice. Programmers implement the logic and create the building blocks of the program’s functionality. 3.9 Running the program allows for testing its functionality, identifying errors or bugs, and refining its behavior. Programmers analyze the program’s output and behavior to ensure its accuracy and efficiency. 3.10 Programming serves as a foundational technology in data science and computational science. It enables the implementation of algorithms, mathematical computations, and effective data manipulation, making it essential for tasks such as predictive modeling, data preprocessing, simulations, and numerical solutions to differential equations. 3.11 The “Hello World” program serves as an introductory example in programming, teaching beginners fundamental concepts such as printing output and understanding basic syntax. 3.12 Binary code is the means through which computers comprehend and respond to human instructions. It allows for efficient processing and interpretation of commands, forming the backbone of modern computing systems. 3.13 Binary digits enable the display of images and photographs captured by devices like smartphones. Within a seemingly compact image file, millions of individual 0s and 1s are meticulously arranged to encode visual information. 3.14 Assembly languages are specialized programming languages that directly interact with computer hardware. They offer unparalleled control and precision but are less accessible and challenging for human readability compared to higher-level programming languages. 3.15 Portability allows code to be easily adapted across different computer systems or architectures. Higher-level programming languages provide greater portability and code reusability, enabling developers to write code that can be executed on a wider range of hardware platforms. 3.16 Binary code serves as the bridge connecting computer hardware and software, enabling the translation of human-readable commands into a language that the hardware can comprehend. 3.17 Assembly languages are low level and closely aligned with hardware architecture, making them challenging and cumbersome for users due to their complex syntax and structure.
256
Solutions
3.18 High-level programming languages are designed to promote greater accessibility by offering programmers a more intuitive and human-readable syntax, abstracting away the intricate details of hardware operations. 3.19 High-level programming languages provide built-in functions and libraries that abstract away the complexities of low-level hardware operations, allowing programmers to focus on the logic and functionality of their programs. 3.20 Compiled languages require a dedicated compiler to translate source code into machine code, resulting in a standalone binary file. Script languages, on the other hand, are interpreted on-the-fly, allowing code to be executed line by line without a separate compilation step. 3.21 The widespread adoption of graphical user interfaces (GUIs) has contributed to the shift away from coding, as GUIs overlay a visual interface on top of the programming language, enhancing usability and accessibility. 3.22 GUIs simplify the process by allowing users to create a button within the interface that triggers the execution of the code snippet “print(‘hello world’)” upon selection, eliminating the need for users to type out the code. 3.23 GUIs enable users to perform actions through visual cues and interactions, revolutionizing software usability and bridging the gap between users and complex programming tasks. 3.24 The TK GUI toolkit is commonly used in Python to build graphic user interfaces, providing developers with visually appealing and interactive elements for their applications. 3.25 Machine learning and AI have the potential to disrupt the conventional programming landscape by enabling automated algorithm generation and direct communication between machine learning algorithms and computer hardware. This shift could lead to more efficient and optimized processing, revolutionizing areas such as performance optimization, system responsiveness, and adaptability. 3.26 Atom, Xcode, Visual Studio Code, and PyCharm are robust text editors suitable for extensive and ambitious projects, offering comprehensive feature sets, powerful debugging capabilities, version control, and collaboration features. 3.27 Sublime Text is recommended for those seeking a lightweight coding experience. It offers impressive performance and efficiency without compromising on essential features and functionalities.
Solutions
257
3.28 The selection of a text editor is highly subjective and depends on individual needs, preferences, and project requirements. Evaluating different text editors allows developers to find the ones that best align with their unique workflows and offer desired functionality, extensibility, ease of use, and compatibility with specific programming languages. 3.29 Emacs, vi, and nano are commonly used text editors for code editing within server environments, providing essential functionalities and command-line interfaces for efficient code editing and manipulation. 3.30 Jupyter Notebook offers an immersive and interactive environment where code execution, input, and output can be seamlessly saved and tracked within a single unified space. Its notebook-style interface allows data scientists to combine code, vi. 3.31 Algorithm design involves formulating effective and efficient solutions by leveraging a deep understanding of the problem domain and robust analytical thinking. Skilled algorithm designers maximize performance, minimize computational complexity, and address potential pitfalls in their designs. 3.32 Proficiency in programming languages and adherence to best coding practices are essential qualities for a competent programmer. They ensure that the implemented solution captures the intended algorithm accurately while considering factors such as code readability, maintainability, and scalability. 3.33 Collaboration and effective communication between algorithm designers and programmers enhance the likelihood of delivering successful and robust software solutions. Understanding the problem domain, requirements, and constraints fosters a harmonious synergy between these roles, leading to superior outcomes. The expertise and experience of the team directly impact the quality and reliability of the code. 3.34 Open source refers to an approach in software development where users have access to and can modify the underlying source code of the software. 3.35 No, open source goes beyond the idea of free software. It emphasizes transparency, allowing users to understand and modify the source code. 3.36 Open source empowers users to make choices and modifications by providing access to the source code, such as selectively including or excluding certain elements in a program. 3.37 Access to the source code allows teachers to evaluate the accuracy of code used in calculating average test scores and collaborate to improve and rectify any identified issues.
258
Solutions
3.38 Closed-source code restricts users from understanding how the code functions, hampers issue identification and rectification, and limits collaborative efforts and potential contributions. 3.39 The fundamental tenet of open source is to ensure users have the freedom to access, inspect, and modify the source code, leading to more sophisticated and user-friendly software development. 3.40 In a closed-source environment, users lack control over modifications to the code, making them susceptible to unforeseen issues and changes they cannot influence or mitigate. 3.41 The Linux operating system emerged as a transformative force in the open-source movement, showcasing the potential of collaborative development and innovation within the computing ecosystem. 3.42 Some commonly encountered licenses for open-source projects include the MIT license, the Apache license, and the Creative Commons License. 3.43 The Apache license allows claims to be made on patents that utilize the code it is attached to, providing an added layer of protection for both the code and any associated patents. 3.44 GitHub serves as a hub for code sharing and collaboration, hosting repositories from diverse projects. It enables developers to contribute, explore, and refine their skills while fostering a community of openness and collaboration.
Problems of Chap. 4 4.1 The official documentation for pandas can be found online at “https://pandas.pydata. org/docs/.” It provides guides, tutorials, and references to support learning and mastery of pandas. 4.2 The pandas documentation offers a wealth of guides, tutorials, and references catering to various skill levels. It provides comprehensive information about the pandas API and its functionalities. 4.3 Python’s extensive ecosystem of libraries and frameworks provides a versatile platform for data analysis, visualization, and modeling. These tools enable users to create remarkable solutions and derive insightful conclusions from data.
Solutions
259
4.4 The next chapter delves into materials and catalyst informatics, combining principles of data science with the study of materials and catalysts. It explores the interplay between data analysis, computational modeling, and materials science to drive innovation. 4.5 The fusion of materials and catalyst informatics combines data science principles with cutting-edge research and development. It offers a captivating landscape where complex challenges can be tackled, leading to groundbreaking discoveries and advancements in various industries. 4.6 Programming is necessary in data science because it allows handling of massive datasets, implementation of advanced data manipulation and analysis techniques, customization and flexibility, and automation of tasks. 4.7 Programming languages offer more computational power and flexibility compared to tools such as Excel, enabling handling of large datasets, implementation of advanced techniques, customization, and automation. 4.8 Python is a preferred language in data science due to its readability, efficiency, and ease of use. It has a comprehensive standard library and is widely adopted, providing a strong foundation for complex data manipulation and analysis tasks. 4.9 The code in C is designed to generate the output “yes” if a variable, x, is smaller than 10, and “no” if the variable is larger than 10.6. 4.10 Python’s coding syntax is more intuitive, user-friendly, and readable compared to C. Python code tends to be cleaner, with simpler if-statements and fewer brackets and punctuation, making it more approachable for beginners. 4.11 Python is preferred in data science due to its accessibility, ease of use, and gentle learning curve. It allows researchers and analysts to quickly grasp fundamental concepts and syntax, enabling them to focus more on solving data science challenges rather than struggling with complex programming languages. 4.12 The type() command allows developers to determine the data type of an object in Python. It helps in performing appropriate operations and ensuring data integrity. 4.13 Tuples and lists both store elements of different types, but tuples are immutable, meaning their order and content cannot be modified once created. Lists, on the other hand, are mutable and allow operations such as appending, removing, and iterating over elements.
260
Solutions
4.14 Importing libraries in Python expands the capabilities of the language by providing access to specialized commands and functions tailored to specific tasks. It allows programmers to leverage additional functionalities beyond the default library and enhance the accuracy and efficiency of their code. 4.15 The ever-evolving nature of technology may introduce alternative languages or frameworks in the future. By updating and expanding one’s skill set, individuals can adapt to emerging technologies and stay proficient in the dynamic world of data science and programming. 4.16 The datetime type in Python is used for working with dates and times. It enables various operations and manipulations related to time-based data, such as tracking events or measuring durations. It provides functionalities like acquiring the current time and date using the datetime.date.today() command. 4.17 Python supports fundamental arithmetic operations such as addition, subtraction, multiplication, division, modulus, and exponentiation. 4.18 Python’s extensive set of mathematical symbols enables efficient and accurate computations, allowing programmers to perform calculations and solve mathematical challenges with ease. 4.19 In Python, variables can be redefined during runtime, which allows for dynamic and adaptive problem-solving approaches. This feature is particularly useful in data science scenarios where specific types of data need to be tracked and counted dynamically. 4.20 Logical operators enable the connection of two pieces of information and facilitate decision-making processes in programming. They are commonly used to create conditional statements and incorporate powerful decision-making logic into the code. 4.21 Logical operators are invaluable in data sorting operations as they enable filtering, pattern identification, and targeted actions based on specific criteria. They enhance the functionality and efficiency of code in data processing and manipulation tasks, making it more versatile and capable of handling complex datasets. 4.22 The sequential execution model in Python means that code is executed from top to bottom in the order it appears, ensuring that each line of code is processed in the intended sequence. 4.23 The proper order of operations is important in Python code execution to ensure that dependencies, such as variable assignments or function definitions, are correctly resolved
Solutions
261
before they are referenced. Misplacing or improperly ordering code segments can result in errors or unexpected behavior. 4.24 If-statements in Python are used to introduce conditional logic, allowing specific code blocks to be executed based on certain conditions being met or not. 4.25 For-loops in Python provide a convenient way to iterate over collections of data, such as lists or strings, and perform repetitive tasks for each element. They enhance code efficiency and readability by simplifying the process of applying the same operations to multiple items. 4.26 The “else” statement in Python allows for the definition and execution of a distinct code block to cater to alternative conditions when the initial condition in an if-statement is not satisfied. It enhances the flexibility and robustness of the code, enabling more nuanced control over the program’s behavior. 4.27 The empty list “xx” serves as a container to store the values obtained during the iterative process. 4.28 The value of variable “y” is calculated by summing the current element “x(i)” with the value 1, represented as “y = x(i) + 1.” 4.29 During each iteration of the loop, the resulting value of variable “y” is appended to the list “xx,” effectively expanding its contents. 4.30 A for-loop iterates a predetermined number of times, whereas a while-loop continues executing until specific conditions are met. 4.31 The while-loop is terminated when the condition “x < 10” is no longer true. 4.32 Modules in Python provide a collection of prebuilt functions and definitions that can be readily utilized, saving time and effort in manual implementation. 4.33 Modules enhance productivity by offering a comprehensive set of prebuilt tools, eliminating the need to manually define functions and enabling the seamless integration of complex functionalities. 4.34 The numpy module provides functions specifically designed for generating random numbers. By importing and utilizing numpy, developers can easily generate random numbers according to their desired criteria. 4.35 Importing the numpy module streamlines the process and reduces the code footprint.
262
Solutions
4.36 The random function in the numpy module facilitates the generation of random numbers. 4.37 By customizing the parameters, you can define the desired range from which the random number should be derived. 4.38 The print function is used to present the output, displaying the generated random number. 4.39 Specialized modules such as numpy simplify random number generation by providing predefined functions and capabilities, reducing the need for complex coding. 4.40 The numpy module facilitates array computations and offers a comprehensive range of mathematical functions in data science. 4.41 The pandas module provides a rich set of data structures and functions for efficient data handling, transformation, and exploration in data analysis. 4.42 The pandas module allows the creation of data frames, a versatile data structure that efficiently consolidates tabular data, enabling the generation and manipulation of data tables. 4.43 Directly loading data from .csv files using the pandas library saves time and effort, especially with large datasets, and ensures data integrity and consistency. 4.44 By leveraging the data loading functions provided by pandas, you can import and manipulate data from a .csv file, organizing it into a suitable data structure like a pandas DataFrame. 4.45 The function “ pd.read_csv('FILENAME')
” is used to access and interact with a .csv file, where FILENAME refers to the name of the specific .csv file. 4.46 The “head(N)” method allows users to selectively display the first N rows of a data frame, providing a concise overview of the dataset and aiding in the initial stages of analysis.
Solutions
263
4.47 By using the [x:y] notation, users can define the starting and ending rows to display, allowing fine-grained control over the subset of rows to be displayed in a pandas data frame. 4.48 Pandas allows data scientists to extract data based on specified conditions by using conditional statements, such as df4[’AtomicWeight’] > 5, which acts as a filter to include only relevant data points that meet the condition. 4.49 By enclosing the code snippet within df4[df4[’AtomicWeight’] > 5], you can retrieve a comprehensive list of data points that satisfy the given condition, including both the element values and their corresponding indices. 4.50 Conditional extraction allows for selective retrieval of data based on specific criteria, enhancing efficiency and reducing the cognitive load associated with sifting through large volumes of data manually. It enables focused analysis on relevant subsets of data, aiding in the identification of patterns, trends, and outliers. 4.51 The “ sort_values()
” method in pandas allows for sorting data based on specific columns. It enables data manipulation by arranging data in either alphabetical order for text or numerical order for numbers, facilitating effective analysis. 4.52 Data concatenation in pandas allows for the seamless integration of different datasets. It is useful when combining separate pieces of data into a unified dataset, providing a comprehensive view and enabling holistic analyses that leverage collective information from multiple sources. 4.53 While concatenation focuses on appending data, merging in pandas offers greater flexibility by enabling the mixing of data from multiple sources. Merging is particularly valuable when dealing with datasets that contain different columns or when a more comprehensive integration of data is required. 4.54 Merging datasets in pandas allows data scientists to integrate and consolidate information from various sources. By leveraging the merge() function, analysts can perform advanced data manipulations, combining datasets based on common keys or columns. This flexibility enables the creation of comprehensive datasets that incorporate information from multiple sources, enhancing the depth and richness of the analysis.
264
Solutions
Problems of Chap. 5 5.1 Data consistency is crucial because the consistency of materials and catalysts data can vary significantly due to diverse methodologies employed in data generation. Understanding data consistency is important before engaging in subsequent data visualization and machine learning tasks. 5.2 Data preprocessing plays a pivotal role in transforming raw data into a structured format suitable for further analysis. It is essential in materials and catalysts informatics to ensure data accuracy, reliability, and usability. 5.3 Data cleansing is a process within data preprocessing that involves identifying and rectifying issues within the dataset to ensure accuracy and reliability. It addresses data inconsistencies, inaccuracies, missing values, and outliers that could hinder the integrity and usability of the dataset. 5.4 Python’s pandas library offers a range of powerful tools specifically designed to facilitate the data cleansing process. Researchers and data scientists can leverage pandas to effectively address data inconsistencies, inaccuracies, missing values, and outliers in materials and catalysts data. 5.5 By employing the robust data preprocessing functionalities provided by pandas, practitioners can efficiently cleanse and transform their materials and catalysts data. These preprocessing techniques help eliminate duplicate entries, handle missing values, address inconsistent units or formatting, and ensure overall data quality. It establishes a solid foundation for subsequent data visualization and machine learning tasks. 5.6 Access to a substantial and well-curated repository of high-quality data is essential for data science in material and chemical research because data serves as the bedrock for insightful analysis and informed decision-making. The success of data-driven approaches relies on the availability of relevant and reliable datasets. 5.7 The crucial steps involved in collecting and preparing data for data science applications include data acquisition, data cleaning, data transformation, data integration, and metadata annotation. These steps ensure the integrity, reliability, and suitability of the dataset for subsequent analysis and modeling. 5.8 The purpose of data cleaning is to identify and rectify inconsistencies, errors, missing values, and outliers that may impact the integrity and reliability of the dataset. Data cleaning ensures that subsequent analyses and modeling are based on accurate and valid information.
Solutions
265
5.9 Data transformation involves converting raw data into a suitable format that aligns with the objectives of the analysis. This may include normalization, standardization, feature engineering, or dimensionality reduction techniques, depending on the specific requirements of the data science application at hand. 5.10 Data integration plays a vital role in aggregating and merging disparate datasets to create a unified and comprehensive view. It involves aligning and reconciling different data sources, resolving conflicts, and harmonizing data representations to create a cohesive dataset that encapsulates diverse aspects of materials and catalyst research. Data integration enhances the usability and interpretability of the dataset for subsequent analysis. 5.11 Data are considered indispensable for data science because it serves as the cornerstone upon which the entire field revolves. It provides the foundation for analysis, modeling, and decision-making in data science. 5.12 The two major categories of data sources for materials and catalysts data are existing data published in reputable repositories (literature and patents) and the creation of a bespoke dataset tailored to specific research objectives and requirements. 5.13 Utilizing existing data from literature and patents offers numerous benefits, including access to well-curated datasets, comprehensive and domain-specific information, and the advantage of building upon prior research and findings. It encompasses a wealth of data on materials and catalysts from studies, experiments, and discoveries conducted globally. 5.14 Text mining is an emerging technique that involves the automated extraction of data from vast collections of text-based sources. By leveraging natural language processing and machine learning algorithms, researchers can analyze large volumes of literature and extract valuable information. Text mining can uncover hidden patterns, correlations, and trends within textual data, facilitating the efficient extraction of materials and catalysts data. 5.15 Different approaches to literature data acquisition include manual collection, review articles, text mining, open data centers, and data purchases. Manual collection involves proactive searching for target data through online platforms such as Google Scholar and Web of Science. Review articles provide consolidated summaries and analyses of existing research. Text mining uses automated techniques to extract information from text-based sources. Open data centers offer curated datasets openly available for scientific research. Data purchases provide access to specialized or proprietary datasets. 5.16 Text mining is a technique that allows for the automated extraction of valuable information from articles. It involves the identification and extraction of relevant data
266
Solutions
by specifying the appropriate context and keywords. Text mining significantly enhances efficiency and scalability by utilizing advanced algorithms and techniques. 5.17 One of the key challenges in text mining is the potential for collecting unwanted or inaccurate data. Due to language complexities and limitations of automated algorithms, irrelevant or erroneous information may be extracted. Rigorous validation procedures are necessary to ensure the quality and reliability of collected data. 5.18 Data validation involves verifying the accuracy, consistency, and relevance of the extracted data against established criteria or ground truth. Through careful validation techniques, researchers can identify and rectify inconsistencies or errors, ensuring the integrity of the collected dataset. It helps maintain the credibility of subsequent analyses and interpretations based on the mined data. 5.19 Notable tools and libraries used in text mining include the Python Natural Language Toolkit (NLTK) and spaCy. These tools offer functionalities for tasks such as tokenization, part-of-speech tagging, named entity recognition, and more. They streamline the text mining process and provide pre-trained models for NLP-related tasks. 5.20 Researchers have the option of utilizing open data centers or purchasing datasets for their data collection needs. Open data centers provide diverse and publicly accessible data sources, while data purchases offer curated and verified datasets from reputable providers. Alternatively, researchers can generate their own materials and catalysts data through experiments or simulations to meet specific research objectives. 5.21 The conventional method involves conducting experiments or calculations in-house, where researchers design and create samples specific to their research objectives, followed by data collection and analysis. 5.22 Experimental data collection techniques can include manipulating variables, controlling experimental conditions, and collecting data through instruments or sensors. 5.23 By collaborating with third-party entities, researchers can leverage their expertise and resources in conducting experiments or calculations, thereby expanding their capabilities and potentially accessing specialized services. 5.24 High throughput refers to the ability to conduct multiple experiments or calculations simultaneously, enabling the generation of large amounts of data in a relatively short period. It involves automating and parallelizing synthesis, performance testing, and characterization processes.
Solutions
267
5.25 Low throughput approaches involve processing individual samples sequentially, while high throughput approaches allow for the concurrent handling of multiple samples. Low throughput offers precision and control for in-depth analysis, while high throughput facilitates faster data acquisition and screening of a wide range of samples and conditions. 5.26 High throughput methodologies enable the simultaneous synthesis, performance testing, and characterization of multiple samples, leading to increased efficiency and data generation. 5.27 High throughput methodologies allow researchers to process and analyze multiple samples in parallel, while low throughput experiments handle individual samples sequentially. 5.28 High throughput approaches generate extensive datasets, enabling data-driven research, advanced analytics, and machine learning techniques for profound discoveries and advancements in materials and catalyst science. 5.29 High throughput methodologies ensure constant experimental conditions, resulting in consistent and reliable data that enhance the efficacy of data science techniques and enable accelerated progress in the field. 5.30 Challenges include the development of specialized experimental devices or code, increased financial investments, and a potential trade-off between data quantity and quality compared to conventional experiments and calculations. 5.31 The data preprocessing stage is crucial as it refines and prepares collected data, ensuring its quality and reliability for the application of data science techniques. 5.32 Data cleansing involves identifying and rectifying inconsistencies, errors, and outliers in the dataset to improve its accuracy and reliability. 5.33 Data labeling enriches the dataset with relevant labels, enabling subsequent analysis and modeling, and facilitating the extraction of meaningful insights and patterns. 5.34 Data augmentation involves generating synthetic data points or augmenting existing ones to expand the dataset size and improve the generalization and robustness of machine learning models. 5.35 Data aggregation consolidates and integrates multiple datasets or sources to create a comprehensive and unified dataset, enabling researchers to leverage a broader range of information and insights for analysis.
268
Solutions
5.36 Comprehensive data preprocessing is necessary to harmonize and standardize collected data, ensuring its usability and compatibility for widespread use, facilitating effective data sharing and collaboration. 5.37 The absence of standardized rules in data collection leads to variations in recording and collection practices, resulting in complexities, discrepancies, data fragmentation, and inadequate datasets for certain applications. 5.38 Data preprocessing includes tasks such as data transformation, normalization, feature extraction, and dimensionality reduction to refine the raw data and tailor it to meet the specific requirements of analytical approaches and machine learning models. 5.39 Differences in data labeling conventions can lead to confusion, misinterpretation, and omission of valuable data. It hampers accurate retrieval and analysis, particularly when leveraging automated processes or interdisciplinary collaborations. 5.40 Standardized data recording methods and clear labeling practices harmonize data representation, facilitate integration and cross-referencing of data from diverse sources, enhance discoverability and accessibility, and enable seamless collaboration and utilization of computational tools and algorithms. 5.41 The calculation of yield in catalysis relies on selectivity and conversion data. Understanding this relationship is crucial for accurate predictions and optimal results. 5.42 Missing or incomplete data in datasets with inconsistencies in selectivity reporting should be addressed through data preprocessing techniques such as imputation or interpolation to ensure the comprehensiveness and reliability of the dataset. 5.43 Machine learning algorithms and visualization techniques often rely on numerical data, making it challenging to incorporate string-based labels directly. It becomes necessary to assign index numbers or some numerical representation to enable the inclusion of string-based variables in computational analyses and visualizations. 5.44 The absence of standardized rules and guidelines in data generation and collection hampers the reusability and interoperability of datasets, impeding collaborative efforts and hindering the progress of scientific research. 5.45 The six fundamental categories in data cleansing are validity, accuracy, uniformity, completeness, consistency, and duplication. Each category plays a distinct role in ensuring the cleanliness, usability, and integrity of the data for subsequent analyses and machine learning tasks.
Solutions
269
5.46 The two methods commonly used for data cleansing tasks are manual observation and Python’s pandas library functions. 5.47 Data visualization techniques help detect patterns, outliers, and inconsistencies in the data. Plots, charts, and graphs can reveal data points that deviate from the expected distribution or exhibit suspicious patterns, assisting in decision-making during the data cleansing phase. 5.48 Machine learning algorithms can automate the detection of data outliers and anomalies. They learn patterns from a dataset and can identify instances that deviate significantly from the expected behavior, helping data scientists identify data points that require further investigation or cleansing. 5.49 The specific research context and characteristics of the dataset influence the choice of data cleansing methods. Researchers need to evaluate the advantages and limitations of different approaches and select the most suitable method or combination of methods for their data cleansing tasks. 5.50 One hot encoding is a common method used to convert string data into categorical variables represented by binary values (0s and 1s). It allows the inclusion of string data in data visualization and machine learning tasks by creating separate categorical variables for each category present in the data. 5.51 Pandas provides techniques such as the “isna()” command to check for missing data entries and the “dropna()” function to eliminate blank values from the dataset. 5.52 The z-score quantifies how many standard deviations a data point deviates from the mean of the dataset, helping to identify potential outliers. 5.53 The “dropna()” function in pandas can be customized to drop rows or columns that contain any missing values or only those that have a certain threshold of missing data. 5.54 The merge command in pandas allows for the seamless merging of datasets based on a specified column or key, enabling researchers to integrate information from multiple sources into a single, unified dataset. 5.55 Thoroughly examining and understanding the data provides researchers with necessary context and insights to make informed decisions during subsequent analyses, ensuring the accuracy and reliability of the results obtained through advanced techniques such as machine learning and data visualization.
270
Solutions
Problems of Chap. 6 6.1 Data visualization plays a significant role in materials and catalysts informatics as a crucial tool for gaining insights and extracting meaningful information from complex datasets. It enables researchers to comprehend the data more effectively and uncover valuable trends, patterns, and relationships. 6.2 By visualizing the data in a meaningful way, researchers can identify key factors, correlations, and dependencies that directly contribute to the design and development of innovative materials and catalysts. Data visualization enables researchers to make informed decisions and guide the design process. 6.3 Data visualization provides a roadmap for conducting deeper investigations and applying advanced analytical methods. By observing the visual patterns and trends, researchers can identify areas of interest and focus their efforts on specific aspects of the data that warrant further exploration. 6.4 Data visualization plays a dual role in materials and catalysts informatics. It enables researchers to comprehend complex datasets effectively and also serves as a catalyst for informed decision-making and exploration, unlocking the full potential of the data. 6.5 By harnessing the power of data visualization, researchers can drive advancements in materials science and catalysis research. It allows them to gain insights, make informed decisions, and explore the data, ultimately leading to discoveries and advancements in the field. 6.6 The two indispensable Python libraries for data visualization are matplotlib and seaborn. 6.7 The linspace() function in numpy enables the generation of a sequence of equidistant data points, which is useful for creating plots with evenly spaced values. 6.8 The plot() function in matplotlib is used to establish the relationship between variables and generate visual plots. 6.9 Data visualization allows researchers to explore and understand the distinguishing attributes of each iris type, providing valuable insights into the relationships and patterns within the dataset for informed classification and analysis. 6.10 The line plot fails to effectively convey the intended information and capture intricate patterns in the iris dataset, making it challenging to discern significant trends. Alternative visualization methods should be explored for a more comprehensive depiction of the data.
Solutions
271
6.11 Figure 6.3 showcases a line plot, while Fig. 6.4 demonstrates a scatter plot. The primary distinction lies in the choice of the plotting command, shifting from plot() to scatter(). 6.12 The scatter plot in Fig. 6.4 allows for the clear depiction of individual data points, revealing the presence of at least two major groups distinguished by petal lengths. It enables the identification of distinctive clusters based on petal lengths within the iris dataset. 6.13 The incorporation of color allows for the differentiation between the three types of iris (iris setosa, iris versicolor, and iris virginica). It reveals a more pronounced association between the iris types and petal length, indicating their significant influence on clustering. 6.14 The scatter plot visualizes the relationship between CH4 conversion, C2 selectivity, and C2 yield in the OCM reaction. It offers insights into the interplay between these variables, aiding in the understanding of catalyst performance and providing clues for optimizing reaction conditions. 6.15 The scatter plot reveals a trade-off between CH4 conversion and C2 selectivity, highlighting the need to balance these factors in the OCM process. It also helps identify statistical distributions and potential biases in the dataset, contributing to a comprehensive understanding of catalyst performance and guiding further investigations. 6.16 A bar plot is used to compare variables by presenting data as bars, making it useful for visualizing trends and comparisons. 6.17 By examining a bar plot, one can observe trends and compare the values of different variables, such as the magnetic moments of two-dimensional materials containing different elements. 6.18 A histogram visualizes the frequency or occurrence of specific elements within a dataset, allowing researchers to observe the distribution of data and identify biases or underrepresented elements. 6.19 A pie chart is a visualization tool that showcases data distribution by presenting the relative proportions of different categories. It allows for quick comparisons and grasping distribution patterns. 6.20 A stack plot, also known as a stacked area plot, visualizes multiple plots stacked on top of each other, providing a cumulative representation of data. It helps identify trends and patterns among different elements or categories.
272
Solutions
6.21 The true potential of matplotlib lies in its extensive customization options and the ability to create a wide range of tailored visualizations. 6.22 Seaborn enhances matplotlib visualizations by providing a higher-level interface that simplifies the creation of more sophisticated and visually appealing plots. 6.23 The advantage of using pairplot() in seaborn is that it presents all requested variables along with their data distributions, providing a holistic view of the data structure and enabling the uncovering of trends and patterns more effectively. 6.24 The Pearson correlation coefficient analysis benefits researchers by providing a quick assessment of correlations within their data, guiding them toward further analysis and the selection of appropriate data science methods. 6.25 Violin plots in seaborn offer insights into data distribution and correlation by visualizing the kernel probability distribution. They allow researchers to observe data distribution patterns, identify correlations, and generate hypotheses for further analysis and experimentation. 6.26 The difficulty lies in representing multiple dimensions on a 2D or 3D plot, making it challenging to visualize and interpret the data effectively. 6.27 Parallel coordinates represent each data point as a polyline connecting parallel vertical axes, where each axis represents a different dimension. This technique allows for the visualization of relationships and patterns between dimensions, enabling insights into high-dimensional data. 6.28 Radviz places each data point on a circle or hypersphere, with each dimension represented by a radial axis. The position of each point on the circle is determined by the values of its respective dimensions. It allows examination of the relationship and relative influence of each dimension on the data points. 6.29 Parallel coordinates provide a comprehensive overview of relationships and interactions between variables in multidimensional data. It helps identify clusters, trends, outliers, and enables simultaneous visualization of multiple insights. 6.30 Three-dimensional plots offer an additional dimension for visualization, where x, y, and z coordinates represent three dimensions, and color can represent a fourth dimension. They can help visualize relationships and patterns involving multiple variables, enhancing the representation of data and potentially revealing additional insights or trends. However, they have limitations in visualizing higher-dimensional datasets.
Solutions
273
Problems of Chap. 7 7.1 Data preprocessing plays a vital role in optimizing the performance of machine learning algorithms. It encompasses techniques such as cleaning, normalization, feature selection, and handling missing values or outliers, aiming to enhance the quality and reliability of the data for improved algorithm performance. 7.2 Biased or incomplete data can hinder the machine learning process by failing to grasp the full picture or generalize accurately. Just as an incomplete map hampers understanding, biased or inadequate data can limit the machine’s ability to learn and make accurate predictions. 7.3 Data collection is the initial step in ensuring the availability of relevant and diverse data. This process involves identifying and gathering data from reliable sources, considering necessary features or variables, and ensuring the data are representative of the problem at hand. 7.4 Data preprocessing techniques include cleaning the data by removing noise and irrelevant information, normalizing the data to a common scale, performing feature selection to choose the most relevant variables, and handling missing values or outliers to ensure the data are suitable for machine learning algorithms. 7.5 High-quality, well-preprocessed data allow machine learning algorithms to extract meaningful patterns, relationships, and insights from the data. This leads to more accurate predictions and informed decision-making, ultimately contributing to the success of the machine learning process. 7.6 Arthur Samuel defined machine learning as a “discipline of inquiry that bestows upon computers the remarkable capacity to acquire knowledge autonomously, free from explicit programming.” 7.7 The fundamental tenet that forms the bedrock of machine learning is the pursuit of endowing computers with the inherent ability to assimilate information, discern patterns, derive insightful conclusions, and generate effective solutions based on their accumulated wisdom. 7.8 Machine learning strives to nurture computer systems with intrinsic cognitive prowess, enabling them to acquire knowledge and provide intelligent responses, while traditional programming relies on explicit instructions.
274
Solutions
7.9 Machine learning algorithms possess the innate capacity to process and analyze large volumes of information, enabling the extraction of valuable insights that might otherwise remain hidden in massive datasets. 7.10 The four primary types of machine learning are supervised machine learning, unsupervised machine learning, semi-supervised machine learning, and reinforcement learning. Supervised learning uses labeled data for prediction, unsupervised learning discovers patterns in unlabeled data, semi-supervised learning combines labeled and unlabeled data, and reinforcement learning focuses on maximizing performance through interaction with an environment. 7.11 Supervised machine learning involves creating a relationship between a dependent variable (y) and its corresponding independent variables (x) to make accurate predictions or classifications. 7.12 The core principle of supervised machine learning is uncovering the relationship between the descriptor variables (x) and the objective variable (y) by analyzing labeled data. 7.13 The primary types of supervised machine learning are classification and regression. 7.14 Supervised classification involves categorizing data into different groups or classes based on specific criteria, where the machine learns the relationship between the descriptor variables and the predefined classes. 7.15 Supervised regression models aim to predict the value of a continuous objective variable (y) based on given input variables (x), enabling informed decision-making and forecasting. 7.16 The primary focus of unsupervised machine learning is discerning the underlying patterns and relationships that exist within the data. 7.17 Unlike supervised learning, unsupervised learning does not rely on explicit labels or target variables to guide its learning process. Instead, it autonomously explores the data to find inherent structures and organizing principles. 7.18 Unsupervised learning techniques aid in tasks such as exploratory data analysis, data visualization, feature engineering, market and customer segmentation, anomaly detection, and more. 7.19 Clustering algorithms group similar data points together based on their intrinsic similarities, allowing the identification of meaningful clusters or segments within the data.
Solutions
275
7.20 Unsupervised machine learning algorithms can analyze vast amounts of data without predefined labels, uncover hidden patterns and associations, provide valuable insights, and classify multidimensional data effectively. They are versatile tools in various domains, such as genetics, finance, and image processing. 7.21 Semi-supervised machine learning represents a hybrid approach that combines elements of both supervised and unsupervised machine learning techniques. It is used when dealing with datasets that are only partially labeled, which poses a challenge for traditional machine learning processes. 7.22 The fundamental concept of reinforcement learning revolves around training machines to make decisions based on rewards or penalties received during the learning process. 7.23 Reinforcement learning trains machines to make decisions through a process of trial and error, where penalties and rewards guide the learning process. By reinforcing positive actions and penalizing undesirable choices, the machine progressively refines its decisionmaking abilities. 7.24 Overfitting is when a machine learning model becomes excessively tailored to the training data, resulting in difficulties when predicting outcomes for new, unseen test data. It occurs when the model becomes overly complex or when it is tuned too precisely to the training dataset. Overfitting is a concern because the model learns noise and random fluctuations in the training data rather than the underlying patterns, leading to poor generalization and inaccurate predictions for new data.
Problems of Chap. 8 8.1 Understanding the functioning of different supervised machine learning models is crucial for making informed decisions when choosing the most suitable approach for a given task. It allows for a better understanding of the strengths and limitations of each model, enabling effective model selection and decision-making. 8.2 The effectiveness of a machine learning model depends on factors such as the nature of the dataset, the complexity of the problem, and the specific objectives. Different models have different strengths and limitations, and considering these factors is crucial when selecting an appropriate model. 8.3 Decision trees are intuitive and easy to interpret, making them well-suited for problems where interpretability and explainability are crucial. They provide insights into the decision-making process and can handle both categorical and numerical data.
276
Solutions
8.4 Support vector machines excel in handling high-dimensional data and are effective when a clear decision boundary is required. They can handle linear and non-linear classification tasks and have been widely used in various domains, including text classification and image recognition. 8.5 Neural networks and deep learning models have the ability to learn intricate patterns and handle large-scale datasets. They are particularly useful in tasks such as image recognition and natural language processing. However, they often require substantial amounts of training data and computational resources. 8.6 Linear regression is a fundamental form of supervised machine learning that establishes a linear relationship between variables. 8.7 Linear regression can be perceived as a form of linear fitting, where the objective is to establish a linear relationship between variables. 8.8 Linear regression allows us to glean valuable insights and draw meaningful conclusions from the data, contributing to our broader understanding of patterns and trends. 8.9 Linear regression can be implemented using the scikit-learn Python library, which provides functionalities for constructing and fitting a linear regression model. 8.10 RMSE (root-mean-square error) is an evaluation metric used in linear regression to quantify the deviation between predicted and observed values. Lower RMSE values indicate better accuracy. 8.11 Multiple linear regression extends linear regression by accommodating multiple descriptor variables, capturing the influence of diverse variables on the objective variable. 8.12 Polynomial regression integrates polynomial terms into linear regression, enabling the modeling of non-linear relationships and enhancing the model’s flexibility and expressiveness. 8.13 Ridge regression is a modeling technique similar to linear regression, but with penalty terms incorporated into the loss function. These penalty terms provide enhanced control and regularization of the model’s behavior, making it effective in addressing complex data scenarios. 8.14 In polynomial linear regression, increasing the degree of the polynomial leads to fittings of varying complexities. As the degree increases, the root-mean-square error (RMSE) decreases, indicating a reduction in the discrepancy between predicted and actual values in the training data. However, a higher degree of the polynomial can also lead to
Solutions
277
overfitting, where the model becomes too tailored to the training data and performs poorly on unseen test data. 8.15 The hyperparameter α plays a crucial role in controlling the impact of penalty terms in polynomial linear regression. The choice of α determines the trade-off between model complexity and generalization performance. It requires careful tuning to find the optimal value that allows the model to capture underlying patterns while maintaining its ability to make accurate predictions on unseen test data. 8.16 SVM is a powerful supervised machine learning model used in various applications that can handle regression and classification tasks, accommodating linear as well as nonlinear data instances. 8.17 SVM creates decision boundaries by establishing a line or hyperplane that separates different classes in the data, maximizing the margin or distance between the boundary and the closest data points of each class. 8.18 In cases where the data are not linearly separable, SVM employs a kernel trick to transform the data into a higher-dimensional space, where a linear decision boundary can be established. 8.19 SVM can accurately classify data points and make predictions in a wide range of scenarios, especially when dealing with complex datasets and non-linear decision boundaries. 8.20 The margin in SVM refers to the distance between the decision boundary and the closest data point. It is controlled by the choice between a soft margin (allowing some points within the margin) and a hard margin (not tolerating any points within the margin). 8.21 The RBF kernel and other kernel functions allow SVM to capture complex patterns in non-linear data. The choice of the kernel function is crucial in capturing the underlying relationships and should be based on the specific characteristics of the data. 8.22 The gamma hyperparameter controls the influence of each data point on the decision boundary. A smaller gamma value results in a smoother decision boundary, while a larger gamma value leads to a more focused and localized boundary but increases the risk of overfitting. Selecting the appropriate gamma value requires careful consideration and experimentation. 8.23 The decision tree model is used for both classification and regression tasks, analyzing data to make decisions and predictions.
278
Solutions
8.24 The structure of a decision tree is determined through recursive partitioning, where the algorithm splits the data based on features to create decision nodes. 8.25 Decision trees provide a transparent representation of the decision-making process, allowing for easy understanding and explanation of the decisions made at each node. 8.26 Techniques such as pruning, ensemble methods (e.g., random forests), and parameter tuning can be employed to mitigate overfitting and enhance the predictive performance of decision tree models. 8.27 The impurity of a decision tree is commonly measured using the Gini coefficient, which quantifies the degree of impurity in a set of samples. 8.28 Decision trees can be used for regression tasks by predicting continuous target variables based on the observed features, but overfitting and limitations in predicting values outside the training data range should be considered. 8.29 Random Forest (RF) is a widely used and powerful machine learning method that complements the capabilities of Support Vector Machines (SVMs). It can handle both linear and non-linear models, making it suitable for regression and classification tasks. 8.30 RF mitigates overfitting by employing an ensemble approach. It creates multiple decision trees and aggregates their predictions. Each tree is trained on a random subset of the training data and considers a random subset of features for splitting, introducing randomness to reduce overfitting potential. 8.31 The predictions of individual decision trees in RF are combined through either majority voting (for classification) or averaging (for regression). The final prediction of the random forest is based on the collective predictions of all the trees. 8.32 RF leverages the collective knowledge of multiple decision trees to capture a wider range of patterns and relationships in the data. Each tree focuses on different aspects of the data, and by combining their predictions, RF improves its capability to handle complex and non-linear problems. 8.33 The bootstrapping method is used in RF to create diverse and distinct decision trees. It involves randomly selecting subsets of the training data, allowing for the possibility of selecting the same data point multiple times. This randomness helps to generate randomized decision trees, contributing to the power of RF.
Solutions
279
8.34 The importance analysis in RF evaluates the relevance and impact of descriptor variables. It helps identify influential variables, guiding subsequent modeling efforts and providing insights into the scientific mechanisms underlying the predictions. 8.35 Voting machine learning is an approach that combines the predictions of multiple machine learning models to make a final prediction, maximizing overall accuracy and robustness. 8.36 Hidden layers in neural networks capture the complexity of data by learning intricate patterns and relationships, enabling the network to classify data points that are not easily separable using a single line or surface. 8.37 Neural networks typically consist of a single hidden layer, while deep learning incorporates multiple hidden layers. Deep learning models can effectively handle complex and intricate datasets by leveraging hierarchical representations learned in these deep architectures. 8.38 Deep learning models are often considered black boxes, as their underlying mechanisms are not easily interpretable by humans. The enhanced performance of deep learning comes at the cost of reduced interpretability. 8.39 Deep learning models require large amounts of training data to perform optimally. Additionally, the complexity and depth of these models make it harder to comprehend how they learn and make predictions. 8.40 Experimental design approaches, such as the use of orthogonal arrays, enable systematic exploration of the parameter space with limited data points, maximizing the insights gained from the available data. 8.41 GPR provides predictions and associated uncertainties, allowing researchers to assess the uncertainty in predictions. This information helps guide the selection of the next data point, improving the efficiency and informativeness of experimental design. 8.42 The purpose of cross-validation is to evaluate how well a trained model can predict the objective variable on unseen data, using a test dataset. 8.43 The available dataset is split into the training dataset (used for model training) and the test dataset (held back and not used for training). 8.44 Random splitting and k-fold cross-validation are two popular methods for crossvalidation.
280
Solutions
8.45 Mean squared error (MSE) and coefficient of determination (R-squared) are commonly used evaluation metrics in regression analysis. 8.46 The mixing matrix, accuracy, precision, recall, F1 score, and area under the curve (AUC) are commonly used evaluation metrics in classification machine learning.
Problems of Chap. 9 9.1 Unsupervised machine learning is a powerful tool for uncovering hidden patterns and groups within data. It enables the exploration of data without the need for labeled examples, allowing for more flexibility and scalability in analyzing large and complex datasets. 9.2 Graph theory provides a structured framework for organizing and understanding complex relationships within data. By representing data as nodes and edges, graph theory enables the exploration of connectivity patterns, influence relationships, and other valuable insights, leading to more effective analysis and interpretation of materials and catalysts data. 9.3 Ontologies play a crucial role in materials and catalysts research by enabling the formal representation of knowledge, concepts, and relationships within a specific domain. They facilitate data integration, standardization, and interoperability, enhancing data organization, improving data quality, and supporting advanced querying and inference capabilities. 9.4 The selection of an appropriate unsupervised machine learning algorithm depends on the specific characteristics and requirements of the data at hand. Factors such as the nature of the data, desired outcomes, computational efficiency, and interpretability should be considered when choosing the most suitable algorithm for a given dataset. 9.5 Utilizing graph data provides a structured and intuitive representation of complex relationships within materials and catalysts data. This approach can lead to valuable insights, such as identifying influential nodes or understanding connectivity patterns, ultimately contributing to better organization, analysis, and interpretation of data in the field of materials and catalysts informatics. 9.6 Supervised machine learning relies on predefined descriptors or objective variables, while unsupervised machine learning operates without them. 9.7 The objective of unsupervised machine learning is to discover concealed patterns and similarities within datasets without using any training data.
Solutions
281
9.8 Unsupervised machine learning enables the exploration and extraction of intrinsic structures and relationships within the data, leading to the discovery of meaningful insights, associations, and clusters. 9.9 Clustering is a technique used in unsupervised machine learning to identify similarities and patterns within the data, effectively categorizing it into distinct groups. 9.10 Dimension reduction transforms high-dimensional variables into a lowerdimensional space, facilitating easier data visualization and mitigating issues related to overfitting. 9.11 Principal Component Analysis (PCA) is a widely used technique for dimensional reduction in unsupervised machine learning, enabling the transformation of highdimensional data into a lower-dimensional representation. 9.12 The curse of dimensionality refers to the phenomenon where the amount of data required for effective unsupervised machine learning grows exponentially as the number of dimensions increases, making it challenging to capture meaningful patterns and relationships. 9.13 Dimensional reduction techniques transform high-dimensional data into a lowerdimensional space, where patterns and relationships can be more effectively captured and analyzed, mitigating the challenges posed by the curse of dimensionality. 9.14 The appropriate number of dimensions in PCA can be determined by analyzing the cumulative explained variance ratio, scree plots, and leveraging domain knowledge. 9.15 K-means clustering is a method used to group data into clusters based on their distances to the center of mass. It involves selecting a desired number of clusters, assigning data points to the nearest center of mass, and iteratively recalculating the center of mass until convergence is achieved. 9.16 Unsupervised machine learning offers benefits such as data exploration, discovery of hidden patterns and relationships, and the ability to handle unlabeled data effectively. 9.17 The lack of ground truth or labeled data makes it difficult to objectively assess the quality and accuracy of unsupervised learning results. 9.18 Deep learning techniques, often used in unsupervised learning, can produce complex models that are challenging to interpret and understand, limiting the extraction of meaningful insights.
282
Solutions
9.19 The future advancements of unsupervised machine learning are expected in anomaly detection and representation learning, with applications in cybersecurity, fraud detection, predictive maintenance, image recognition, speech recognition, natural language processing, and recommendation systems. 9.20 Unsupervised learning can be leveraged for data preprocessing, feature engineering, semi-supervised learning, and augmenting the capabilities of supervised learning algorithms, addressing challenges posed by limited labeled data and enhancing overall performance. 9.21 Network analysis focuses on the study of networks consisting of nodes and edges, representing entities and their relationships within datasets. 9.22 By representing complex processes as networks, researchers can analyze the interconnections between nodes to uncover potential pathways, dependencies, and interactions, gaining insights into the sequence of reactions and identifying critical intermediates. 9.23 The visual representation of networks allows researchers to intuitively grasp relationships and patterns within the data, enabling a deeper understanding of underlying processes and interactions. 9.24 Assigning weights to edges in a graph allows for the conveyance of the strength or importance of connections between nodes. This weighting can reflect factors such as the influence of temperature on the likelihood of reactions occurring, providing insights into relative probabilities. 9.25 The two types of edges are undirected and directed. Undirected edges represent connections between nodes where directionality does not play a role, while directed edges indicate source and target relationships between nodes, reflecting a unidirectional relationship. The choice depends on the nature of the relationships being analyzed and the specific context of the data. 9.26 Ontologies play a crucial role in organizing data by defining concepts and relationships, providing a structured framework for standardizing terminology and ensuring consistency in data representation. 9.27 Ontologies establish a shared understanding of data elements, their definitions, and relationships, facilitating data integration, interoperability, and knowledge discovery. They enhance the quality and usability of data by addressing issues of heterogeneity and providing a common vocabulary.
Solutions
283
9.28 Ontology provides a standardized framework for data representation, fostering reusability and interoperability. It addresses challenges related to data standardization, inconsistent terminology, and lack of metadata, promoting effective data management, analysis, and knowledge generation. 9.29 Ontology-based querying enables structured and precise searches by leveraging knowledge representation languages and description logic. Researchers can extract specific information, explore relationships, and gain deeper insights from the ontology-based data, facilitating knowledge discovery and informed decision-making. 9.30 Applying ontology to materials and catalyst data enables machine-readable representation, standardization, and alternative classifications. It supports automated processing, data sharing, integration, and collaboration, while also fostering innovative approaches to classification and discovering new materials with desired properties.
References
1. Jain, Anubhav, et al. (2013) Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 1.1: 011002 2. Agrawal, Ankit, and Alok Choudhary (2016) Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. APL Mater. 4.5: 053208 3. Rajan, Krishna (2005) Materials informatics. Mater. Today 8.10: 38–45 4. Takahashi, Keisuke, and Yuzuru Tanaka. (2016) Materials informatics: a journey towards material design and synthesis. Dalton Trans. 45.26: 10497–10499 5. Takahashi, Keisuke, et al. (2018) Unveiling hidden catalysts for the oxidative coupling of methane based on combining machine learning with literature data. ChemCatChem 10.15: 3223–3228 6. Takahashi, Lauren, et al. (2021) Constructing catalyst knowledge networks from catalyst big data in oxidative coupling of methane for designing catalysts. Chem. Sci. 12.38: 12546–12555 7. Takahashi, Lauren, et al. (2020) Representing the Methane Oxidation Reaction via Linking FirstPrinciples Calculations and Experiment with Graph Theory. J. Phys. Chem. Lett 12.1: 558–568 8. Takahashi, Lauren, and Keisuke Takahashi. (2019) Visualizing scientists’ cognitive representation of materials data through the application of ontology. J. Phys. Chem. Lett. 10.23: 7482–7491
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6
285
Index
A Accuracy, 121, 122, 131, 221 Accurate data, 131 Acquire knowledge, 173 Adaptability, 60, 175 Adaptive decision-making, 173 Adding, 84 Addition, 84 Advanced Micro Devices (AMD), 30 Advanced Package Tool (APT), 45 Agent, 185, 186 Aggregation, 212 Algorithm, 64, 65 Algorithm design, 64 Algorithm designers, 64 Algorithm development, 173 Algorithms, 52, 53, 63–65, 173, 174 Ammonia synthesis, 8, 9, 11, 12 Anaconda, 45 Analyze, 120, 121 Anomaly detection, 181, 182, 235, 236 Apache license, 72 API, 107 append method, 92 Apple, 30 Apple M1 CPU, 30 Applications, 28 Aptitude, 173 Architectural diversity, 55 Architectures, 55 Area under the curve (AUC), 221, 222 Artificial intelligence (AI), 4, 59, 86, 173, 234 Aspects, 121 Assembly languages, 54–56 Association rule mining, 181
Atom, 62, 63 Atomic elements, 157 Atomic number, 157 Atomic radius, 157 AtomicWeight, 98 Au, 157 Automated labeling, 127 Autonomous learning, 173
B Bar plot, 152, 155 Big data, 175, 234 Binary code, 54, 55, 59 Binary digits, 54 Binary form, 54 Binary numbers, 54, 55 Bioinformatics, 5, 19 Biology, 19 Bit, 33 Black box, 205 Blank, 135 Blank data, 135 Blank values, 135 Boolean logic, 199 Bootstrap samples, 212 Bootstrapping, 211 Bootstrapping method, 212, 213 Bugs, 52 Byte, 33
C C, 56, 58, 79, 202, 203 Cache size, 31
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 K. Takahashi, L. Takahashi, Materials Informatics and Catalysts Informatics, https://doi.org/10.1007/978-981-97-0217-6
287
288 Calculations, 121, 122 Canonical Ltd., 43 Catalysis, 130 Catalyst, 114, 115, 122 Catalyst data, 151 Catalyst informatics, 107, 124, 125 Catalyst performance, 152 Catalyst science, 124 Catalysts, 11, 19, 78, 115–118, 121, 123, 151, 233 Catalysts data, 121, 138, 139 Catalysts informatics, 19, 20, 37, 114, 115, 126, 138, 139 Catalytic process, 151, 152 C2 compounds, 151 CentOS, 43 Central processing units (CPUs), 28–31, 33, 35–37, 40, 46, 175 Characterization, 14, 18, 122, 123 CH4 conversion, 152, 159, 160, 162, 163 Chemical, 115 Chemical research, 114 Chemistry, 19 Chemoinformatics, 5, 19 Child node, 208 CH4 pressure, 163 Circles, 200 Cl, 155, 157 Class 0, 199 Class 1, 199 Classification, 178, 179, 192, 200, 206, 210–212, 221 Classification counterparts, 179 Classification models, 180 Cleaning, 115 Cleansing, 131 Cleansing methods, 133 Clock speed, 31 Closed code, 66 Closed-source, 66 Clustering, 181, 229 Clusters, 180, 182 Code, 53, 73, 74 Coding, 53, 58, 59 Coding practices, 173 Coefficient of determination, 221 Cognition, 173 Collaborate, 122 Collection, 130
Index Collective trends, 157 Color, 148, 149 Color bar, 152 Communication, 122 Communication overhead, 38 Compact disks (CDs), 34 Compilation, 57, 58 Compiled languages, 57, 58 Compilers, 57, 58 Completeness, 131, 132 Complex calculations, 88 Comprehensive view, 157 Computation, 13, 14 Computation science, 5 Computational material science, 17, 18 Computational methods, 8, 124 Computational modeling, 107 Computational science, 11, 53 Computational services, 122 Computational simulations, 21 Computational studies, 21 Computational techniques, 22, 122 Computer cases, 28, 36 Computer hardware, 37, 46, 47 Computer programs, 53 Computer science, 53 Computer system, 46 Computing environment, 45 concat(), 104 Concatenation, 104, 106 Conditions, 121 Conducting experiments, 121, 122 Consistency, 131, 132 Contractual agreements, 122 Contributions, 157 Control, 121 Conventional experimental, 122 Conventional method, 121 Conversion, 130 Cooling mechanisms, 36 Cooling system, 36 Correction, 129 Corruption, 128 Costs, 122 CPL, 56, 57 CPU cores, 38 Creative Commons license, 72 Credentials, 122 Cross-validation, 187, 192, 220
Index C2 selectivity, 152, 159, 163 Cumulative contribution, 157 Cumulative representation, 156 Curating, 174 Curse of dimensionality, 230 Customer segmentation, 181, 182 C2 yield, 152, 158, 164
D Data, 2, 3, 5, 8, 19, 118–121, 133, 138, 139 Data acquisition, 115, 117, 118 Data aggregation, 127 Data analysis, 8, 26, 45, 81, 107, 139, 180, 182, 234 Data augmentation, 127 Data centers, 119 Data cleaning, 115, 116, 127 Data cleansing, 126, 127, 132, 133, 138, 139 Data collection, 118, 119, 128, 187, 188 Data collection phase, 121 Data collection process, 121 Data consistency, 138, 139 Data dependencies, 38 Data distribution, 155, 156 Data duplication, 134 Data exploration, 234, 235 Data formatting, 127 Data fragmentation, 128 Data generation, 130, 138, 139 Data generation process, 121 Data inaccuracies, 128 Data inconsistencies, 138 Data integration, 115, 116 Data labeling, 127, 129, 130 Data literacy, 130 Data mining, 4 Data partitioning, 127 Data points, 121, 136 Data preparation, 127 Data preprocessing, 12, 53, 81, 115, 116, 126–131, 138, 139, 187, 196, 239 Data preprocessing techniques, 173 Data processing, 86 Data processing algorithms, 121 Data quality, 128, 139 Data recording, 129 Data repositories, 130 Data representation, 239
289 Data requirements, 120 Data science, 2–5, 7, 8, 12–22, 42–45, 53, 58, 65, 78, 80, 81, 86, 107, 114–116, 124, 125, 181 Data science research, 5 Data science techniques, 126 Data science tools, 45, 46 Data scientists, 7, 42, 43, 46, 118, 132, 133, 138 Data sets, 120, 121 Data sharing, 128 Data sources, 119 Data standards, 130 Data structures, 64, 234 Data transformation, 115, 116, 129 Data validation, 118 Data vendors, 120 Data visualization, 53, 115, 126, 127, 133, 138, 139, 144, 168, 169, 181, 196, 229 Data visualization techniques, 121, 152 Data-driven, 18, 114 Data-driven analysis, 122, 125 Data-driven approaches, 17 Data-driven insights, 129 data.csv, 101 Datasets, 120, 134, 137–139, 174 datetime.date.today(), 83 Debian Linux, 43 Debug, 89 Decision boundary, 199, 200 Decision tree classifier, 206 Decision trees, 192, 206–213, 223 Deep learning, 35, 216, 217, 235 Deliverables, 122 Dendrograms, 233 Density-based clustering, 182 Density functional theory, 11, 53 Density functional theory calculations, 18 Dependencies, 89 Derive insightful conclusions, 173 describe(), 103 Descriptors, 180 Descriptor variables, 179, 199, 213, 214, 220, 222, 230 Design, 121 Designing, 121 Desktop environment, 43, 44 Desktop experience, 44 Desktop PCs, 36 Desktops, 41, 42, 44
290 Desktop version, 44 df4, 101 Digital video disks (DVDs), 34 Dimensionality reduction, 115, 129, 181, 222 Dimensional reduction, 230 Dimensional reduction methods, 230 Dimension reduction, 229 Directed, 238 Directed edges, 238 Directionality, 238 Discern patterns, 173 Discoverability, 129 Discovery, 234 Disparities, 156 Distribution patterns, 155, 156 Dividing, 84 Division, 84 Dockers, 40, 42 Domain expertise, 235 Domain experts, 235 Domain knowledge, 220 drop_duplicates, 134, 135 dropna(), 135, 136 Duplicate entries, 115, 139 Duplication, 131, 132
E Economics, 180 Edge weights, 238 Edges, 237 Electron microscopy, 18 Electronegativity, 155, 157 Electronic records, 121 Element, 97 elif, 91 else, 90, 91 Emacs, 62, 63, 67 Embedding, 134 Engineering, 180 Ensemble learning, 214, 223 Environment, 185, 186 Error correction algorithms, 126 Error detection, 129 Errors, 52 Ethane, 151 Ethylene, 151 Event-driven programming, 59 Execution, 53
Index Execution flow, 89 Experiment, 5, 13 Experimental, 9, 21, 22, 122 Experimental design, 217, 219 Experimental efforts, 219 Experimental investigations, 21, 127 Experimental methods, 12 Experimental science, 11 Experimental setups, 121 Experimental studies, 127 Experimentation, 10, 13 Expertise, 122 Exploratory analysis, 234 Exploratory data analysis, 156, 181, 220 Exponentiation, 84
F Failure, 174 False, 199 Fans, 36 Feature engineering, 115, 181 Feature extraction, 127, 129, 234, 235 Feature importance, 211 Feature selection, 116, 187, 222 Fees, 122 File size, 54 Finance, 173, 180, 183 Financial arrangements, 122 Financial implications, 122 Finite difference method, 53 Fire, 184–186 First principles calculations, 11, 37 fit(), 193, 233 Floppy disk, 34 For-loops, 90–93 Forest, 210 Fourth science, 5 Frameworks, 45, 46, 52 Frequency, 154 F1 score, 221, 222 Function declarations, 80 Fusion, 116
G Gaussian process regression, 192 Gaussian Process Regressor (GPR), 218, 219 Generalizable, 201
Index Generalization, 201 Generalization capability, 200 Generalization performance, 198 Generate, 121 Generate effective solutions, 173 Generative modeling, 235, 236 Genetics, 182 Gigabyte, 33 Gini coefficient, 207–209 GitHub, 73 GNU Project, 66, 67 Graph data, 237, 238, 242 Graphical user interfaces (GUI), 40, 59 Graphic card, 28, 35 Graphic processing units (GPUs), 35–37, 175 Graph representation, 237 Graphs, 242 Graph theory, 236, 242 Grid search, 222 Groups, 182 Guidelines, 130
H Hard disk drives (HDDs), 31, 35 Hardware, 26, 28, 46, 54–57 Hardware advancements, 46 Hardware architecture, 46, 55, 56 Hardware components, 46 Hardware configurations, 46, 55 Hardware platforms, 55 Healthcare, 173, 180 Heatsinks, 36 Hidden layers, 215, 216 Hierarchical clustering, 182, 233 High-level programming languages, 56 High throughput, 122–125 High throughput approaches, 125 High throughput calculations, 122, 124, 125 High throughput code, 125 High throughput experimental devices, 125 High throughput experimentation, 125 High-throughput experiments, 122 High throughput methodologies, 123–125 High throughput processes, 125 Histogram, 153, 155 Human-centered approach, 174
291 Human developers, 174 Humans, 173 Hyperparameters, 194, 198, 210 Hyperparameter tuning, 222 Hyperplane, 200
I If-statements, 80, 90, 91, 93 Image analysis, 182 Image processing, 183 Import, 82 Inconsistencies, 220 Index numbers, 130 Individual contributions, 157 Informatics, 5, 26, 45, 46, 53, 65, 116, 149 Informatics endeavors, 47 Informatics scientists, 46, 47 Informatics workflows, 46 Information science, 7 Input/output (I/O), 46 Instruments, 121 Integration, 115 Integrity, 122 Intel, 30 Intelligent outcomes, 173 Interdisciplinary understanding, 130 Interoperability, 129 Interpretability, 129, 235 Intricate conditional statements, 88 Inverse problems, 175 Iris classification, 149 Iris data, 146–148 Iris dataset, 147, 148 Iris flowers, 145, 150 Iris setosa, 145, 149, 150 Iris species, 149 Iris types, 147–149 Iris versicolor, 145, 149 Iris virginica, 145, 149 isna(), 135 Iterative and incremental approach, 174 Iterative process, 173
J Java, 56 JavaScript, 60 Jupyter Notebook, 42, 63, 99
292 K Kernel density estimation, 158, 163 Kernel probability distribution, 163, 164 Kernel trick, 201 K-fold cross-validation, 220 Kilobyte, 33 KMeans(), 233 KMeans(), 233 K-means, 182 K-means clustering, 229, 231–233
L Label encoding, 134 Labeling, 129 Laboratory notebooks, 121 Large margin, 201 Latent groups, 182 Leaf nodes, 208 Li, 155, 157 Libraries, 45, 52 Linear, 200 Linear kernel, 203 LinearRegression(), 193 Linear regression, 192–196, 198 Linear relationship, 195 LinearSVC, 202 Linear SVC, 202 linspace(), 145 Linux, 39, 41–45, 67 Linux distributions, 44, 45, 73 Linux ecosystem, 45 Linux environment, 45 Linux Mint, 44, 81 Linux operating systems, 44, 45 Linux OS, 39, 44 Linux servers, 43 Linux-specific tools, 45 Liquid coolant, 36 Liquid cooling, 36 Liquid cooling systems, 36 LISP, 56, 57 List, 98 Literature, 117 Local linear embedding (LLE), 229 Logical operators, 86 Logical structure, 89 Logistic function, 199 Logistic regression, 192, 199, 215
Index logisticregression(), 200 Loongson CPU, 30 Loongson Technology, 30 Loss function, 198 Low throughput, 122, 123
M Mac, 41, 42 Machine-driven, 174 Machine learning, 2, 4–8, 14, 16, 17, 19, 35, 45, 53, 59, 78, 81, 86, 115, 124, 127, 129–131, 133, 138, 139, 173, 174, 178–180, 184–188, 199, 206, 210, 211, 216, 217, 219–221, 234–236 Machine learning algorithms, 21, 122, 134 Machine learning models, 127 Machine learning model training, 127 Machine learning tasks, 234 Machine processing, 128 Machines, 173 Mac OS, 39, 41 Magnetic moments, 153 Magnetic properties, 153 Main function, 80 Maintainability, 89 Majority voting, 212 Manual labeling, 127 Margins, 201 Marketing, 181 Massive datasets, 78, 175 Material data, 5 Material informaticist, 7 Material informatics, 5 Material science, 10 Material science research, 12 Materials, 2–5, 12, 78, 107, 114–118, 121–126, 138, 139, 233 Materials and catalysts, 119 Materials data, 4–8 Materials datasets, 4 Materials design, 22 Materials informaticists, 7 Materials informatics, 4–8, 14, 15, 17–21, 114, 115 Materials research, 5, 21, 22 Materials science, 2–6, 8, 11, 15–18, 20–22, 107 Materials science research, 5 Materials scientists, 3, 4
Index Material synthesis methods, 130 Mathematics, 19 Matplotlib, 61, 95, 144–147, 156–158, 161, 167, 168 max(), 101 mean(), 102, 136 Mean squared error (MSE), 221 Measurement informatics, 18 Measurements, 121 Megabyte, 33 Memory, 28, 31 Memory modules, 46 Merge, 137 merge(), 106, 107 Merging, 106, 240 Message passing interface (MPI), 37 Metadata, 239 Metadata annotation, 115 Metadata standards, 239 Methane, 151 Microscopy, 17–19 Microsoft Excel, 78 min(), 101 Missing data imputation, 126 Missing values, 135, 220 MIT license, 72 Mixing matrix, 221 MLPClassifier, 217 Mn, 153, 157 Model complexity, 202 Model development, 234 Modeling, 107 Models, 121 Modularity, 94 Modules, 93, 94 Modulus, 84 Molecular dynamics simulations, 18, 53 Molecular formula, 129 Molecules, 129 Morse code, 54 Motherboards, 28, 36, 37 Multidimensional data, 182 Multidimensional spaces, 175 Multidimensional visualization approach, 152 Multiple experiments, 122 Multiple linear regression, 194 Multiplication, 84 Multiplying, 84
293 N Nano, 63 Network analysis, 236, 237 Neural networks, 192, 215, 216, 223 Nodes, 237 Nonlinear, 200 Normalization, 115, 116, 127, 129 Number of cores, 31 NumPy, 61, 94, 95, 145–147, 155 NVRAM, 32
O Objective variables, 179–182, 199, 219 Observations, 121 OCM reaction, 151 OCM reaction data, 152 One hot encoding, 133, 134 Ontologies, 236, 238–242 Open collaboration, 67 Open data centers, 119 Open-source, 65–67, 72 Open Source Initiative, 67 Open-source licensing, 73 Operating system, 28, 40–46 Operators, 87 Optical microscopy, 18 Optimization algorithms, 17 Optimize, 89 Organization, 89 Orthogonal arrays, 217 Outlier detection, 116, 126 Outliers, 115, 136, 137, 196, 220 Outsourcing, 122 Overfit, 210 Overfitting, 186, 187, 197, 198, 201, 210, 211, 222 Oxidative coupling of methane (OCM), 151, 162
P pairplot(), 161, 162 Pandas, 60, 96, 97, 99, 100, 103, 104, 107, 132, 134, 136–139, 146, 147, 155, 158 Pandas DataFrame, 99 Parallel calculations, 37 Parallel computing, 38 Parallel coordinates, 165
294 Parallel execution, 37 Parallel processing, 37 Parameters, 121 Patents, 117 Patterns, 121, 156, 157, 180 Pearson correlation coefficient, 162, 163 Pearson correlation map, 162 Penalties, 184–186 Perceptron, 215 Performance, 14, 17, 18 Performance optimization, 60 Performance testing, 122, 123 Petabytes, 34 Petal lengths, 146–150 Petal width, 146, 147, 199 Pharmacy informatics, 5 Phenomena, 121 Physics, 19 Pie chart, 155, 156 Planning, 52, 53 plot(), 145–148 Polynomial, 202 Polynomial kernel, 203 Polynomial linear regression, 197, 198 Polynomial regression, 195 Portability, 55 Powers, 84 Power units, 28, 36, 37 Precision, 221 Predictive modeling, 17 Predictive models, 122 Preprocessing, 115, 128, 129, 139, 174, 187, 188, 220 Preprocessing pipeline, 116 Preprocessing workflow, 127 Pressure, 206, 207 Principle component analysis (PCA), 229–231 Print, 95 Print(a), 90 Print command, 54 Problem domain, 64 Problem-solving, 174, 176 Processing, 14, 15, 18 Program, 64 Programmers, 64, 65 Programming, 26, 52, 53, 57, 59, 64, 65, 73, 74, 87 Programming best practices, 89 Programming languages, 46, 52–57, 63, 64
Index Programming paradigms, 64 Properties, 14, 16, 18, 81, 121 Proportions, 155, 156 Prowess, 175 Proximity, 182 Pruning, 208 Pt, 153 Purchase data sets, 120 Purchasing behavior, 181 Purchasing data sets, 119 PyCharm, 62, 63 Pylab, 145 Python, 56, 58–61, 65, 79, 80, 84, 85, 107, 202 Python pandas, 135–137 Python programming language, 54 PyTorch, 217
Q Quality, 120, 128 Quality control measures, 122 Queries, 240 Querying, 240
R Radial Basis Function (RBF), 202, 203, 205, 206, 218 Radial Basis Function (RBF) kernel, 203–205 Radiator, 37 Radviz, 165–167 Random, 94 Random access memory (RAM), 31, 32, 34, 36–38, 40, 175 RandomForestClassifier, 213 RandomForestRegressor, 213 Random forests (RF), 192, 207, 210–215, 223 Random search, 222 Random variables, 155 Reaction networks, 237, 238 Readability, 89 Real-world, 200 Recall, 221 Receiver operating characteristic, 222 Reclassification, 241 Recursive partitioning, 207 Regression, 178, 179, 200, 206, 210–212, 221 Regression analysis, 209, 221 Regression model, 179
Index Regularization, 187, 198 Reinforcement learning, 176, 184–187 Relationships, 157, 180 Relative magnitudes, 155 Relative proportions, 155, 157 Reliability, 120, 121 Reliable, 122 Remainders, 84 Representation learning, 234–236 Reproducibility, 121 Reputable, 122 Reputation, 122 Research, 18 Research objectives, 121 Resources, 122 Results, 121 Reusability, 55 Review, 121 Rewards, 184–186 Ridge regression, 192, 196 Rigorous documentation, 121 Robust, 201 Root-mean-square error (RMSE), 194, 197 Root node, 208 Rule-based patterns, 175 Running, 52
S Samples, 121 Scalability, 234, 235 Scanning probe microscopy, 18 scatter(), 148 Scatter plot, 148–152 Scikit-image, 96 Scikit-learn, 60, 95, 193, 202, 217, 231–233 SciPy, 61, 95 Scope of work, 122 Scripted languages, 58 Script languages, 57, 58 Seaborn, 61, 95, 144, 158–161, 163 Selectivity, 130 Semi-supervised, 187 Semi-supervised learning, 184 Semi-supervised machine learning, 176, 183, 184 Sensors, 121 Sepal length, 146, 147 Sepal width, 146, 147
295 Separable, 200 Sequential execution model, 89 Sequential nature, 89 Server, 41–43 Server environments, 36 Server operating systems, 43 Servers, 41–43 show(), 145 Sigmoid, 203 Sigmoid kernel, 203 Significance, 121 Similarity, 182 Simulations, 121 Small data, 217 SMILES notation, 129 Sn, 153, 154 Soft margin, 201 Software, 26, 28, 30, 46, 47, 54, 55 Software development, 59 Software libraries, 46 Software systems, 46 Software techniques, 46 Software tools, 46 Solid state drives (SSDs), 31, 35 Spectroscopy, 19 Stacked area plot, 156 Stacked areas, 157 Stack plot, 156, 157 Standard deviation, 219 Standardization, 115, 129, 239, 241, 242 Standardized data recording, 129 Standardized rules, 130 Standard score, 136 Statistical analysis, 21 Statistical methods, 121 Statistics, 5 std(), 102, 136 Storage devices, 36, 38, 46 Storage unit, 28 String data, 133, 134 Structure-performance relationships, 122 Structures, 14, 18, 182 Studies, 120 Subgroups, 182 Sublime Text, 62, 63 subplot(), 157, 161 Subtracting, 84 Subtraction, 84
296 Success, 174 Supercomputers, 37 Supercomputing, 37 Supervised, 183, 184, 187 Supervised classification, 184 Supervised counterpart, 180 Supervised learning, 184, 235, 236 Supervised learning algorithms, 127 Supervised machine learning, 176–178, 180, 184, 192, 200, 223, 229 Supervised regression, 179 Supervised regression models, 179, 180 Support vector machines (SVMs), 192, 199–206, 210, 214, 215, 223 Support Vector Regression (SVR), 219 SWIFT, 60 Synchronization requirements, 38 Syntax, 57, 58 Synthesis, 14, 122 Systematic procedure, 121 System responsiveness, 60
T Ta, 154 Tailoring, 121 Target variables, 180, 209 Temperature, 161, 164, 199, 206, 207 Temperature-related challenges, 36 TensorFlow, 95, 217 Terabytes, 35 Text data, 133 Text editors, 63 Theoretical, 21 Theoretical modeling, 21 Theoretical science, 11 Theory, 5, 10, 13 Third-party entities, 122 Three-dimensional representation, 151 Ti, 155, 157 TK GUI toolkit, 59 Top-to-bottom execution, 90 Track record, 122 Transformation, 115 Transportation, 173 Trends, 121, 156, 157 Trial and error, 184–186 Triangles, 200 True, 199
Index type(), 82, 83 Types, 81, 82, 146
U Ubuntu, 43, 44 Ubuntu-based, 44 Ubuntu Software Repository, 43 Underfitting, 187 Undirected, 238 Uniformity, 131 Unique labels, 115 Unlabeled data, 115, 234–236 Unseen data, 198 Unsupervised, 184, 187, 235 Unsupervised learning, 182–184, 234, 235 Unsupervised learning algorithms, 234 Unsupervised learning methods, 234 Unsupervised machine learning, 176, 180–184, 229, 230, 234–236 Usability, 128 USB drives, 35 Utilization, 89
V Validity, 131 Variable-centric approach, 89 Variable definition, 89 Variables, 89, 121 vi, 63 Violin plot, 163 Virtual machines, 40, 45 VirtualBox, 45 Visualization, 5, 107, 132, 133, 220 Visualization methods, 21 Visualization techniques, 149, 169 Visual Studio Code, 62, 63 VMWare Workstation, 45 Voting, 192 Voting machine learning, 214, 215
W Water, 184–186 Web Ontology Language, 240 WebPlotDigitizer, 118 Weights, 237 While-loop, 92, 93
Index Windows, 41, 42 Windows application, 45 Windows OS, 39, 44 Workload distribution, 38 Writing, 52
X Xcode, 62, 63 X informatics, 5 X-ray analysis, 18 X-ray diffraction (XRD), 18 X-ray spectroscopy, 18
297 Y y=f(x), 179, 180 Yield, 130 Yum, 45
Z Zip, 98 z-scores, 136, 137