Guide to Teaching Data Science: An Interdisciplinary Approach 3031247574, 9783031247576

Data science is a new field that touches on almost every domain of our lives, and thus it is taught in a variety of envi

441 9 8MB

English Pages 329 [330] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Prologue
Contents
List of Figures
List of Tables
List of Exercises
1 Introduction—What is This Guide About?
1.1 Introduction
1.2 Motivation for Writing This Guide
1.3 Pedagogical Principles and Guidelines for Teaching Data Science
1.4 The Structure of the Guide to Teaching Data Science
1.4.1 The Five Parts of the Guide
1.4.2 The Chapters of the Guide
1.5 How to Use This Guide?
1.5.1 Data Science Instructors in Academia
1.5.2 K-12 Teachers
1.5.3 Instructors of the Methods of Teaching Data Science (MTDS) Course
1.6 Learning Environments for Data Science
1.6.1 Textual Programing Environments for Data Science
1.6.2 Visual Programing Environments for Data Science
1.7 Conclusion
Reference
Part I Overview of Data Science and Data Science Education
2 What is Data Science?
2.1 The Interdisciplinary Development of Data Science
2.1.1 The Origins of Data Science in Statistics
2.1.2 The Origins of Data Science in Computer Science
2.1.3 The Origins of Data Science in Application Domains: The Case of Business Analytics
2.2 Data Science as a Science
2.3 Data Science as a Research Method
2.3.1 Exploratory Data Analysis
2.3.2 Machine Learning as a Research Method
2.4 Data Science as a Discipline
2.5 Data Science as a Workflow
2.6 Data Science as a Profession
2.7 Conclusion
References
3 Data Science Thinking
3.1 Introduction
3.2 Data Thinking and the Thinking Skills Associated with Its Components
3.2.1 Computational Thinking
3.2.2 Statistical Thinking
3.2.3 Mathematical Thinking
3.2.4 Application Domain Thinking
3.2.5 Data Thinking
3.3 Thinking About Data Science Thinking
3.4 Conclusion
References
4 The Birth of a New Discipline: Data Science Education
4.1 Introduction
4.2 Undergraduate Data Science Curricula Initiatives
4.2.1 Strengthening Data Science Education Through Collaboration, 2015
4.2.2 Curriculum Guidelines for Undergraduate Programs in Data Science, 2016
4.2.3 The EDISON Data Science Framework, 2017
4.2.4 Envisioning the Data Science Discipline, 2018
4.2.5 Computing Competencies for Undergraduate Data Science Curricula, 2017–2021
4.3 Data Science Curriculum for K-12
4.4 Meta-Analysis of Data Science Curricula
4.5 Conclusion
References
Part II Opportunities and Challenges of Data Science Education
5 Opportunities in Data Science Education
5.1 Introduction
5.2 Teaching STEM in a Real-World Context
5.3 Teaching STEM with Real-World Data
5.4 Bridging Gender Gaps in STEM Education
5.5 Teaching Twenty-First Century Skills
5.6 Interdisciplinary Pedagogy
5.7 Professional Development for Teachers
5.8 Conclusion
References
6 The Interdisciplinarity Challenge
6.1 Introduction
6.2 The Interdisciplinary Structure of Data Science
6.3 Is Data Science More About Computer Science or More About Statistics?
6.4 Integrating the Application Domain
6.4.1 Data Science Pedagogical Content Knowledge (PCK)
6.4.2 Developing Interdisciplinary Programs
6.4.3 Integrating the Application Domain into Courses in Computer Science, Mathematics, and Statistics
6.4.4 Mentoring Interdisciplinary Projects
6.5 Conclusion
References
7 The Variety of Data Science Learners
7.1 Introduction
7.2 Data Science for K-12 Pupils
7.3 Data Science for High School Computer Science Pupils
7.4 Data Science for Undergraduate Students
7.5 Data Science for Graduate Students
7.6 Data Science for Researchers
7.7 Data Science for Data Science Educators
7.8 Data Science for Professional Practitioners in the Industry
7.9 Data Science for Policy Makers
7.10 Data Science for Users
7.11 Data Science for the General Public
7.12 Activities on Learning Environments for Data Science
7.13 Conclusion
References
8 Data Science as a Research Method
8.1 Introduction
8.2 Data Science as a Research Method
8.2.1 Data Science Research as a Grounded Theory
8.2.2 The Application Domain Knowledge in Data Science Research
8.3 Research Skills
8.3.1 Cognitive Skills: Awareness of the Importance of Model Assessment—Explainability and Evaluation
8.3.2 Organizational Skills: Understanding the Field of the Organization
8.3.3 Technological Skills: Data Visualization
8.4 Pedagogical Challenges of Teaching Research Skills
8.5 Conclusion
References
9 The Pedagogical Chasm in Data Science Education
9.1 The Diffusion of Innovation Theory
9.2 The Crossing the Chasm Theory
9.3 The Data Science Curriculum Case Study from the Diffusion of Innovation Perspective
9.3.1 The Story of the New Program
9.3.2 The Teachers’ Perspective
9.4 The Pedagogical Chasm
9.5 Conclusion
References
Part III Teaching Professional Aspects of Data Science
10 The Data Science Workflow
10.1 Data Workflow
10.2 Data Collection
10.3 Data Preparation
10.4 Exploratory Data Analysis
10.5 Modeling
10.5.1 Data Quantity, Quality, and Coverage
10.5.2 Feature Engineering
10.6 Communication and Action
10.7 Conclusion
References
11 Professional Skills and Soft Skills in Data Science
11.1 Introduction
11.2 Professional Skills
11.2.1 Cognitive Skills: Thinking on Different Levels of Abstraction
11.2.2 Organizational Skills: Storytelling
11.2.3 Technological Skills: Programming for Data Science
11.3 Soft Skills
11.3.1 Cognitive Skills: Learning
11.3.2 Organizational Skills: Teamwork and Collaboration
11.3.3 Technological Skills: Debugging Data and Models
11.4 Teaching Notes
11.5 Conclusion
References
12 Social and Ethical Issues of Data Science
12.1 Introduction
12.2 Data Science Ethics
12.3 Methods of Teaching Social Aspects of Data Science
12.3.1 Teaching Principles
12.3.2 Kinds of Activities
12.4 Conclusion
References
Part IV Machine Learning Education
13 The Pedagogical Challenge of Machine Learning Education
13.1 Introduction
13.2 Black Box and White Box Understandings
13.3 Teaching ML to a Variety of Populations
13.3.1 Machine Learning for Data Science Majors and Allied Majors
13.3.2 Machine Learning for Non-major Students
13.3.3 Machine Learning for ML Users
13.4 Framework Remarks for ML Education
13.4.1 Statistical Thinking
13.4.2 Interdisciplinary Projects
13.4.3 The Application Domain Knowledge
13.5 Conclusion
References
14 Core Concepts of Machine Learning
14.1 Introduction
14.2 Types of Machine Learning
14.3 Machine Learning Parameters and Hyperparameters
14.4 Model Training, Testing, and Validation
14.5 Machine Learning Performance Indicators
14.6 Bias and Variance
14.7 Model Complexity
14.8 Overfitting and Underfitting
14.9 Loss Function Optimization and the Gradient Descent Algorithm
14.10 Regularization
14.11 Conclusion
References
15 Machine Learning Algorithms
15.1 Introduction
15.2 K-nearest Neighbors
15.3 Decision Trees
15.4 Perceptron
15.5 Linear Regression
15.6 Logistic Regression
15.7 Neural Networks
15.8 Conclusion
References
16 Teaching Methods for Machine Learning
16.1 Introduction
16.2 Visualization
16.3 Hand-On Tasks
16.3.1 Hands-On Task for the KNN Algorithm
16.3.2 Hands-On Task for the Perceptron Algorithm
16.3.3 Hands-On Task for the Gradient Descent Algorithm
16.3.4 Hands-On Task for Neural Networks
16.4 Programming Tasks
16.5 Project-Based Learning
16.6 Conclusion
References
Part V Frameworks for Teaching Data Science
17 Data Science for Managers and Policymakers
17.1 Introduction
17.2 Workshop for Policymakers in National Education Systems
17.2.1 Workshop Rationale and Content
17.2.2 Workshop Schedule
17.2.3 Group Work Products
17.2.4 Workshop Wrap-Up
17.3 Conclusion
References
18 Data Science Teacher Preparation: The “Method for Teaching Data Science” Course
18.1 Introduction
18.2 The MTDS Course Environment
18.3 The MTDS Course Design
18.4 The Learning Targets and Structure of the MTDS Course
18.5 Grading Policy and Submissions
18.6 Teaching Principles of the MTDS Course
18.7 Lesson Descriptions
18.7.1 Lesson 6
18.7.2 Mid-Semester Questionnaire
18.7.3 Lesson 7
18.8 Conclusion
References
19 Data Science for Social Science and Digital Humanities Research
19.1 Introduction
19.2 Relevance of Data Science for Social Science and Digital Humanities Researchers
19.3 Data Science Bootcamps for Researchers in Social Sciences and Digital Humanities
19.3.1 Applicants and Participants of Two 2020 Bootcamps for Researchers in Social Sciences and Digital Humanities
19.3.2 The Design and Curriculum of the Data Science for Social Science and Digital Humanities Researchers Bootcamp
19.4 Data Science for Psychological Sciences
19.4.1 The Computer Science for Psychological Science Course
19.4.2 The Data Science for Psychology Science Course
19.5 Data Science for Social Sciences and Digital Humanities, from a Motivation Theory Perspective
19.5.1 The Self-determination Theory
19.5.2 Gender Perspective
19.6 Conclusion
References
20 Data Science for Research on Human Aspects of Science and Engineering
20.1 Introduction
20.2 Examples of Research Topics Related to Human Aspects of Science and Engineering that Can Use Data Science Methods
20.3 Workshop on Data Science Research on Human Aspects of Science and Engineering
20.3.1 Workshop Rationale
20.3.2 Workshop Contents
20.3.3 Target Audience
20.3.4 Workshop Framework (in Terms of weeks)—A Proposal
20.3.5 Prerequisites
20.3.6 Workshop Requirements and Assessment
20.3.7 Workshop Schedule and Detailed Contents
20.3.8 Literature (For the Workshop)
20.4 Conclusion
Epilogue
Index
Recommend Papers

Guide to Teaching Data Science: An Interdisciplinary Approach
 3031247574, 9783031247576

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Orit Hazzan Koby Mike

Guide to Teaching Data Science An Interdisciplinary Approach

Guide to Teaching Data Science

Orit Hazzan · Koby Mike

Guide to Teaching Data Science An Interdisciplinary Approach

Orit Hazzan Department of Education in Science and Technology Technion—Israel Institute of Technology Haifa, Israel

Koby Mike Department of Education in Science and Technology Technion—Israel Instutite of Technology Haifa, Israel

ISBN 978-3-031-24757-6 ISBN 978-3-031-24758-3 (eBook) https://doi.org/10.1007/978-3-031-24758-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To our families, students, and colleagues

Prologue

This Guide to Teaching Data Science can be used by all educators in all educational environments and settings: in formal education (from elementary schools through high schools to academia) and informal education, in industry, and in non-profit governmental and third sector organizations. Specifically, the guide can be used as a textbook for Methods of Teaching Data Science courses, in which prospective and in-service teachers learn the pedagogy of data science, which is currently emerging in parallel to the development of the discipline of data science. The guide can serve also other practitioners who are curious about data science, its main characteristics, the challenges it poses, and the opportunities it offers to a variety of populations. To benefit all of its potential user populations, the guide is organized in a way that enables immediate application of its main ideas. This goal is achieved by presenting the rationale behind the inclusion of each topic presented in this guide, its background, development, and importance in the context of data science and data science education, as well as the details of the actual teaching process (including over 200 exercises, worksheets, topics for discussions, and more). The writing of this guide is based on our five years of experience teaching and conducting research on data science education in a variety of frameworks (2018– 2022 inclusive). Specifically, we have taught courses and facilitated workshops on data science and on data science education in different formats (from 2 h activelearning workshops to full, year-long academic courses) to a variety of populations: high school pupils, undergraduate and graduate students, practitioners in different sectors, researchers in a variety of domains, and pre-service and in-service data science teachers. In parallel, we researched a variety of data science education topics, such as teaching methods, learning processes, teacher preparation, and social and organizational aspects of data science education. We are also involved in various data science education initiatives and participate in national initiatives and policymaking committees. This guide enables us to share with the professional community of data science educators the professional knowledge that we have accumulated over

vii

viii

Prologue

the years. In addition, supplementary pedagogical material is available on our website at https://orithazzan.net.technion.ac.il/data-science-education/. We would like to thank all those who have contributed to our understanding of the nature of data science education and who have fostered the interdisciplinary approach to data science presented in this guide: These include all of the students in the various courses we have taught and many prospective and in-service high school computer science and data science teachers, as well as colleagues, researchers, and instructors who have collaborated with us throughout the years in a variety of teaching, research, and development initiatives. Over the past five years, they have all shared with us their knowledge, professional experience, thoughts, and attitudes with respect to data science education. We have learned from them all. In addition, we thank Tech.AI—Technion Artificial Intelligence Hub and The Bernard M. Gordon Center for Systems Engineering, also at the Technion, for their generous support of our research. Special thanks go to Ms. Susan Spira for her highly professional editing of many of our publications (including this guide). Haifa, Israel November 2022

Orit Hazzan Koby Mike

Contents

1

Introduction—What is This Guide About? . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivation for Writing This Guide . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Pedagogical Principles and Guidelines for Teaching Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 The Structure of the Guide to Teaching Data Science . . . . . . . . . 1.4.1 The Five Parts of the Guide . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 The Chapters of the Guide . . . . . . . . . . . . . . . . . . . . . . . . 1.5 How to Use This Guide? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Data Science Instructors in Academia . . . . . . . . . . . . . . . 1.5.2 K-12 Teachers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Instructors of the Methods of Teaching Data Science (MTDS) Course . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Learning Environments for Data Science . . . . . . . . . . . . . . . . . . . 1.6.1 Textual Programing Environments for Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 Visual Programing Environments for Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part I 2

1 1 3 3 4 5 6 11 12 12 12 13 13 14 14 15

Overview of Data Science and Data Science Education

What is Data Science? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Interdisciplinary Development of Data Science . . . . . . . . . . 2.1.1 The Origins of Data Science in Statistics . . . . . . . . . . . . 2.1.2 The Origins of Data Science in Computer Science . . . . 2.1.3 The Origins of Data Science in Application Domains: The Case of Business Analytics . . . . . . . . . . . 2.2 Data Science as a Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Data Science as a Research Method . . . . . . . . . . . . . . . . . . . . . . . .

19 19 20 21 22 22 24

ix

x

3

4

Contents

2.3.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Machine Learning as a Research Method . . . . . . . . . . . . 2.4 Data Science as a Discipline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Data Science as a Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Data Science as a Profession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24 24 26 28 30 32 33

Data Science Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data Thinking and the Thinking Skills Associated with Its Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Computational Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Statistical Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Mathematical Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Application Domain Thinking . . . . . . . . . . . . . . . . . . . . . 3.2.5 Data Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Thinking About Data Science Thinking . . . . . . . . . . . . . . . . . . . . 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35 35

The Birth of a New Discipline: Data Science Education . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Undergraduate Data Science Curricula Initiatives . . . . . . . . . . . . 4.2.1 Strengthening Data Science Education Through Collaboration, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Curriculum Guidelines for Undergraduate Programs in Data Science, 2016 . . . . . . . . . . . . . . . . . . . 4.2.3 The EDISON Data Science Framework, 2017 . . . . . . . . 4.2.4 Envisioning the Data Science Discipline, 2018 . . . . . . . 4.2.5 Computing Competencies for Undergraduate Data Science Curricula, 2017–2021 . . . . . . . . . . . . . . . . 4.3 Data Science Curriculum for K-12 . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Meta-Analysis of Data Science Curricula . . . . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59 59 61

Part II 5

36 37 39 40 44 52 53 55 55

62 63 64 65 67 68 69 71 71

Opportunities and Challenges of Data Science Education

Opportunities in Data Science Education . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Teaching STEM in a Real-World Context . . . . . . . . . . . . . . . . . . . 5.3 Teaching STEM with Real-World Data . . . . . . . . . . . . . . . . . . . . . 5.4 Bridging Gender Gaps in STEM Education . . . . . . . . . . . . . . . . . 5.5 Teaching Twenty-First Century Skills . . . . . . . . . . . . . . . . . . . . . . 5.6 Interdisciplinary Pedagogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75 75 76 77 78 79 80

Contents

xi

5.7 Professional Development for Teachers . . . . . . . . . . . . . . . . . . . . . 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81 82 82

The Interdisciplinarity Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Interdisciplinary Structure of Data Science . . . . . . . . . . . . . . 6.3 Is Data Science More About Computer Science or More About Statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Integrating the Application Domain . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Data Science Pedagogical Content Knowledge (PCK) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Developing Interdisciplinary Programs . . . . . . . . . . . . . . 6.4.3 Integrating the Application Domain into Courses in Computer Science, Mathematics, and Statistics . . . . 6.4.4 Mentoring Interdisciplinary Projects . . . . . . . . . . . . . . . . 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85 85 86

7

The Variety of Data Science Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Data Science for K-12 Pupils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Data Science for High School Computer Science Pupils . . . . . . 7.4 Data Science for Undergraduate Students . . . . . . . . . . . . . . . . . . . 7.5 Data Science for Graduate Students . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Data Science for Researchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Data Science for Data Science Educators . . . . . . . . . . . . . . . . . . . 7.8 Data Science for Professional Practitioners in the Industry . . . . 7.9 Data Science for Policy Makers . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 Data Science for Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Data Science for the General Public . . . . . . . . . . . . . . . . . . . . . . . . 7.12 Activities on Learning Environments for Data Science . . . . . . . . 7.13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

101 101 103 105 108 109 111 111 112 114 115 116 117 119 119

8

Data Science as a Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Data Science as a Research Method . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Data Science Research as a Grounded Theory . . . . . . . . 8.2.2 The Application Domain Knowledge in Data Science Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Research Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Cognitive Skills: Awareness of the Importance of Model Assessment—Explainability and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121 121 122 124

6

88 91 91 92 93 94 97 98

125 126

127

xii

Contents

8.3.2

Organizational Skills: Understanding the Field of the Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Technological Skills: Data Visualization . . . . . . . . . . . . 8.4 Pedagogical Challenges of Teaching Research Skills . . . . . . . . . 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

The Pedagogical Chasm in Data Science Education . . . . . . . . . . . . . . . 9.1 The Diffusion of Innovation Theory . . . . . . . . . . . . . . . . . . . . . . . . 9.2 The Crossing the Chasm Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 The Data Science Curriculum Case Study from the Diffusion of Innovation Perspective . . . . . . . . . . . . . . . . 9.3.1 The Story of the New Program . . . . . . . . . . . . . . . . . . . . . 9.3.2 The Teachers’ Perspective . . . . . . . . . . . . . . . . . . . . . . . . 9.4 The Pedagogical Chasm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

128 131 133 134 135 137 137 140 141 141 143 145 147 147

Part III Teaching Professional Aspects of Data Science 10 The Data Science Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Data Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Data Quantity, Quality, and Coverage . . . . . . . . . . . . . . . 10.5.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Communication and Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

151 151 154 156 157 160 160 161 161 162 163

11 Professional Skills and Soft Skills in Data Science . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Professional Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Cognitive Skills: Thinking on Different Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Organizational Skills: Storytelling . . . . . . . . . . . . . . . . . . 11.2.3 Technological Skills: Programming for Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Soft Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Cognitive Skills: Learning . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Organizational Skills: Teamwork and Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 Technological Skills: Debugging Data and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

165 165 167 167 169 171 172 172 173 174

Contents

xiii

11.4 Teaching Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 11.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 12 Social and Ethical Issues of Data Science . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Data Science Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Methods of Teaching Social Aspects of Data Science . . . . . . . . . 12.3.1 Teaching Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Kinds of Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

179 179 180 185 186 188 194 194

Part IV Machine Learning Education 13 The Pedagogical Challenge of Machine Learning Education . . . . . . . 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Black Box and White Box Understandings . . . . . . . . . . . . . . . . . . 13.3 Teaching ML to a Variety of Populations . . . . . . . . . . . . . . . . . . . 13.3.1 Machine Learning for Data Science Majors and Allied Majors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Machine Learning for Non-major Students . . . . . . . . . . 13.3.3 Machine Learning for ML Users . . . . . . . . . . . . . . . . . . . 13.4 Framework Remarks for ML Education . . . . . . . . . . . . . . . . . . . . 13.4.1 Statistical Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.2 Interdisciplinary Projects . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.3 The Application Domain Knowledge . . . . . . . . . . . . . . . 13.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199 199 200 201

14 Core Concepts of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 Machine Learning Parameters and Hyperparameters . . . . . . . . . . 14.4 Model Training, Testing, and Validation . . . . . . . . . . . . . . . . . . . . 14.5 Machine Learning Performance Indicators . . . . . . . . . . . . . . . . . . 14.6 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7 Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.8 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.9 Loss Function Optimization and the Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.10 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

209 209 210 211 211 214 217 218 219

202 202 203 204 204 205 206 207 207

222 222 223 223

xiv

Contents

15 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 K-nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

225 225 226 229 230 232 232 233 234 234

16 Teaching Methods for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Hand-On Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1 Hands-On Task for the KNN Algorithm . . . . . . . . . . . . . 16.3.2 Hands-On Task for the Perceptron Algorithm . . . . . . . . 16.3.3 Hands-On Task for the Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.4 Hands-On Task for Neural Networks . . . . . . . . . . . . . . . 16.4 Programming Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Project-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

235 235 236 237 238 240

Part V

243 246 246 248 248 249

Frameworks for Teaching Data Science

17 Data Science for Managers and Policymakers . . . . . . . . . . . . . . . . . . . . 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Workshop for Policymakers in National Education Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.1 Workshop Rationale and Content . . . . . . . . . . . . . . . . . . 17.2.2 Workshop Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.3 Group Work Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.4 Workshop Wrap-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Data Science Teacher Preparation: The “Method for Teaching Data Science” Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 The MTDS Course Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3 The MTDS Course Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.4 The Learning Targets and Structure of the MTDS Course . . . . . 18.5 Grading Policy and Submissions . . . . . . . . . . . . . . . . . . . . . . . . . . 18.6 Teaching Principles of the MTDS Course . . . . . . . . . . . . . . . . . . .

253 253 255 256 258 258 261 262 262 265 265 267 267 268 269 271

Contents

18.7

Lesson Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.7.1 Lesson 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.7.2 Mid-Semester Questionnaire . . . . . . . . . . . . . . . . . . . . . . 18.7.3 Lesson 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Data Science for Social Science and Digital Humanities Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2 Relevance of Data Science for Social Science and Digital Humanities Researchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3 Data Science Bootcamps for Researchers in Social Sciences and Digital Humanities . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.1 Applicants and Participants of Two 2020 Bootcamps for Researchers in Social Sciences and Digital Humanities . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.2 The Design and Curriculum of the Data Science for Social Science and Digital Humanities Researchers Bootcamp . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.4 Data Science for Psychological Sciences . . . . . . . . . . . . . . . . . . . . 19.4.1 The Computer Science for Psychological Science Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.4.2 The Data Science for Psychology Science Course . . . . 19.5 Data Science for Social Sciences and Digital Humanities, from a Motivation Theory Perspective . . . . . . . . . . . . . . . . . . . . . . 19.5.1 The Self-determination Theory . . . . . . . . . . . . . . . . . . . . 19.5.2 Gender Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Data Science for Research on Human Aspects of Science and Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.2 Examples of Research Topics Related to Human Aspects of Science and Engineering that Can Use Data Science Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3 Workshop on Data Science Research on Human Aspects of Science and Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3.1 Workshop Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3.2 Workshop Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3.3 Target Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3.4 Workshop Framework (in Terms of weeks)—A Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3.5 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3.6 Workshop Requirements and Assessment . . . . . . . . . . .

xv

274 274 277 279 280 281 283 283 284 286

286

290 292 293 295 296 296 297 299 300 303 303

304 306 307 307 307 307 308 308

xvi

Contents

20.4

20.3.7 Workshop Schedule and Detailed Contents . . . . . . . . . . 309 20.3.8 Literature (For the Workshop) . . . . . . . . . . . . . . . . . . . . . 312 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

List of Figures

Fig. 2.1

Fig. 2.2 Fig. 2.3 Fig. 2.4 Fig. 3.1 Fig. 3.2 Fig. 3.3 Fig. 3.4 Fig. 5.1 Fig. 6.1 Fig. 6.2 Fig. 6.3 Fig. 8.1

Fig. 9.1 Fig. 10.1 Fig. 10.2 Fig. 14.1

Map by John Snow showing the clusters of cholera cases in the London epidemic of 1854 (Source: https://en.wikipe dia.org/wiki/John_Snow, image is public domain) . . . . . . . . . . . . The authors’ version of the data science Venn diagram, as inspired by Conway (2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data science workflow (Authors’ version) . . . . . . . . . . . . . . . . . . The data life cycle (Berman et al., 2016) . . . . . . . . . . . . . . . . . . . . Data science and data thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . Traffic light classification explanations: distribution of answer categories (n = 98) . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carcinoma classification explanations: distribution of answer categories (n = 88) . . . . . . . . . . . . . . . . . . . . . . . . . . . . Traffic light question explanation types versus carcinoma classification question explanation types (n = 88) . . . . . . . . . . . . Proposed range of possibilities for data integration in STEM education (Mike & Hazzan, 2022a) . . . . . . . . . . . . . . . . The data science Venn diagram, as proposed by Conway (2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The data science Venn diagram, as proposed by the authors of the guide to teaching data science . . . . . . . . . . . . . . . . . . . . . . . Pedagogical content knowledge (PCK) . . . . . . . . . . . . . . . . . . . . . Dr. Inbal Zafir-Lavie’s slide on the knowledge required from life science graduates and from data science graduates (presented here with permission) . . . . . . . . . . . . . . . . . . . . . . . . . . Diffusion of innovation timeline (Rogers, 1962) . . . . . . . . . . . . . The CRISP-DM workflow (based on Shearer, 2000) . . . . . . . . . . The agile data science workflow (based on Pfister et al., 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A general scheme of the ML model generation, testing, and prediction phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25 26 28 29 37 48 48 50 78 86 87 91

129 139 152 153 212

xvii

xviii

Fig. 14.2 Fig. 14.3 Fig. 14.4 Fig. 14.5 Fig. 14.6 Fig. 15.1 Fig. 15.2 Fig. 15.3 Fig. 15.4 Fig. 19.1 Fig. 19.2 Fig. 19.3 Fig. 19.4 Fig. 19.5

List of Figures

Presentation of the lion detection question using a confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bias and variance in dart throwing (based on Domingos, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model complexity in a classification problem . . . . . . . . . . . . . . . . Model complexity in a regression problem . . . . . . . . . . . . . . . . . . Overfitting and underfitting—irrigation system analogy . . . . . . . The classification problem of Iris virginica versus Iris versicolor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . KNN hyperparameter tuning: accuracy versus K . . . . . . . . . . . . . The underfitting and overfitting phenomena in KNN . . . . . . . . . . Decision tree model for the classification of Iris flowers . . . . . . . Bootcamp applicants’ research disciplines by gender . . . . . . . . . Bootcamp applicants’ academic rank by gender . . . . . . . . . . . . . . Participants’ pre-bootcamp knowledge in programming, statistics and machine learning (on a 1–5 scale, n = 38) . . . . . . . Participants’ interest in the different bootcamp topics (on a 1–5 scale, n = 38) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applicants’ computer science and statistics knowledge vs. the domain knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

216 218 219 220 221 227 228 228 229 287 288 289 289 290

List of Tables

Table 1.1 Table 3.1 Table 3.2 Table 3.3 Table 3.4 Table 4.1 Table 6.1 Table 6.2

Table 6.3 Table 6.4 Table 6.5 Table 7.1 Table 7.2

Table 8.1 Table 9.1 Table 9.2

Table 11.1 Table 12.1

The five parts of the guide to teaching data science . . . . . . . . . . The traffic light classification question . . . . . . . . . . . . . . . . . . . . The carcinoma classification question . . . . . . . . . . . . . . . . . . . . . The context and social consideration of the experiment questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The familiarity-driven domain neglect . . . . . . . . . . . . . . . . . . . . Initiatives of undergraduate data science curricula . . . . . . . . . . . Questions about the relationships between data science, computer science, and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . Students’ perceptions of the mutual relationships between data science and computer science and between data science and statistics . . . . . . . . . . . . . . . . . . . . Intervention program in biomedical signal processing project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Students’ perception of the knowledge required to meet the project goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Students’ perception of the research project’s success factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data science for high school curriculum—topics and number of hours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Machine learning algorithms—the mathematical knowledge required to understand them and an alternative intuitive explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapping data science research skills . . . . . . . . . . . . . . . . . . . . . The adoption process of the 10th grade data science curriculum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The schedule of the first data science teachers’ training course (teaching methods were added to the second course) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapping data science skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of data science codes of ethics . . . . . . . . . . . . . . . . . . .

5 45 45 46 49 61 88

89 95 96 96 106

107 122 143

144 166 182 xix

xx

Table 12.2 Table 15.1 Table 15.2 Table 15.3 Table 15.4 Table 15.5 Table 15.6 Table 15.7 Table 16.1 Table 16.2 Table 16.3 Table 16.4 Table 16.5 Table 17.1 Table 17.2 Table 18.1 Table 18.2 Table 18.3 Table 19.1 Table 19.2 Table 19.3 Table 20.1

Table 20.2

List of Tables

Ethical norms of different stakeholder in the different phases of the data science workflow . . . . . . . . . . . . . . . . . . . . . . The KNN algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector representation of the perceptron classification algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-vectorized representation of the perceptron classification algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The perceptron training algorithm . . . . . . . . . . . . . . . . . . . . . . . . Logistic regression classification algorithm . . . . . . . . . . . . . . . . Logistic loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic regression training algorithm . . . . . . . . . . . . . . . . . . . . KNN worksheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Perceptron worksheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The gradient descent algorithm worksheet . . . . . . . . . . . . . . . . . Neural networks worksheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Worksheet—programming tasks on the distance function . . . . . Topics for teamwork in the workshop for policymakers in the education system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SWOT analysis of the integration of data science in the Israeli education system . . . . . . . . . . . . . . . . . . . . . . . . . . . The course assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of asynchronous tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Course schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bootcamp topics and hours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computer science for psychological science—topics and number of hours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data science for psychological science course—topics and number of hours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Science and engineering research: Scientific and engineering aspects, human aspects, and human-related research topics . . . . . . . . . . . . . . . . . . . . . . . Workshop schedule and content . . . . . . . . . . . . . . . . . . . . . . . . .

183 226 231 231 231 233 233 233 238 241 244 247 248 259 260 270 272 275 291 294 295

305 309

List of Exercises

Exercise 1.1 Exercise 2.1 Exercise 2.2 Exercise 2.3 Exercise 2.4 Exercise 2.5 Exercise 2.6 Exercise 2.7 Exercise 2.8 Exercise 2.9 Exercise 2.10 Exercise 2.11 Exercise 2.12 Exercise 2.13 Exercise 2.14 Exercise 2.15 Exercise 3.1 Exercise 3.2 Exercise 3.3 Exercise 3.4

Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedagogical implications of data science as a science . . . . Pedagogical implications of data science as a research method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedagogical implications of data science as a discipline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedagogical implications of data science as a workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Environmental and social aspects of the data life cycle . . . Types of data scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Categories of data science skills . . . . . . . . . . . . . . . . . . . . . . Data science as a discipline and data science as a profession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedagogical implications of data science as a profession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skills of the data scientists . . . . . . . . . . . . . . . . . . . . . . . . . . Characteristics of data scientists on LinkedIn . . . . . . . . . . . Connections between the different facets of data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Connections between the different facets of data science and the main characteristics of data science . . . . . . Pedagogical implications of the multi-faceted analysis of data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of the history of data science . . . . . . . . . . . . . . . . Computational thinking and data science online course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical thinking for everyone . . . . . . . . . . . . . . . . . . . . . . Concept formulation processes, objects, and procepts . . . . Data science from the perspective of the process-object duality theory . . . . . . . . . . . . . . . . . . . . . . . .

14 23 26 28 29 29 30 30 31 31 31 31 32 32 32 33 39 40 42 42

xxi

xxii

Exercise 3.5 Exercise 3.6 Exercise 3.7 Exercise 3.8 Exercise 3.9 Exercise 3.10 Exercise 3.11 Exercise 3.12 Exercise 3.13 Exercise 3.14 Exercise 3.15 Exercise 3.16 Exercise 3.17 Exercise 4.1 Exercise 4.2 Exercise 4.3 Exercise 4.4 Exercise 4.5 Exercise 4.6 Exercise 4.7 Exercise 4.8 Exercise 5.1 Exercise 5.2 Exercise 5.3 Exercise 5.4 Exercise 5.5 Exercise 5.6 Exercise 6.1 Exercise 6.2 Exercise 6.3

List of Exercises

The process-object duality theory from the constructivist perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . Reducing abstraction and thinking on different levels of abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Describing data science concepts on different levels of abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing the performance of machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of the familiarity-driven domain neglect . . . . . . . Developing questions that demonstrate the domain neglect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outcomes of cognitive biases . . . . . . . . . . . . . . . . . . . . . . . . Teaching possible effects of cognitive biases on the interpretation of machine learning models . . . . . . . . . . . . . Cases in which data thinking is important . . . . . . . . . . . . . . Application domain knowledge and real-life data in data thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data thinking and other cognitive aspects of data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Additional modes of thinking required for data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analytical thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interdisciplinary education . . . . . . . . . . . . . . . . . . . . . . . . . . Pioneering data science programs . . . . . . . . . . . . . . . . . . . . Comparison of undergraduate data science curriculum initiatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rating the interdisciplinarity of the reports . . . . . . . . . . . . . Analysis of the reports from additional prisms . . . . . . . . . . Multidisciplinary, interdisciplinary, and transdisciplinary education . . . . . . . . . . . . . . . . . . . . . . . . . . Multidisciplinary, interdisciplinary, and transdisciplinary education in schools . . . . . . . . . . . . . . . . . Didactic transposition in data science . . . . . . . . . . . . . . . . . Teaching the STEM subjects in a real-world context . . . . . Real data in mathematics and computer science courses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gender gaps in STEM subjects . . . . . . . . . . . . . . . . . . . . . . Twenty-first century skills in data science education . . . . . Interdisciplinary pedagogy . . . . . . . . . . . . . . . . . . . . . . . . . . Professional development for teachers . . . . . . . . . . . . . . . . . The interdisciplinary structure of data science . . . . . . . . . . Is data science more about computer science or more about statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Does data science include computer science? Does it include statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42 43 43 46 49 50 50 52 53 54 54 54 55 60 61 69 69 69 70 70 70 76 78 79 80 81 81 87 88 89

List of Exercises

Exercise 6.4 Exercise 6.5 Exercise 6.6 Exercise 6.7 Exercise 6.8 Exercise 6.9 Exercise 6.10 Exercise 7.1 Exercise 7.2 Exercise 7.3 Exercise 7.4 Exercise 7.5 Exercise 7.6 Exercise 7.7 Exercise 7.8 Exercise 7.9 Exercise 7.10 Exercise 7.11 Exercise 7.12 Exercise 7.13 Exercise 7.14 Exercise 7.15 Exercise 7.16 Exercise 7.17 Exercise 7.18 Exercise 7.19 Exercise 7.20 Exercise 7.21 Exercise 7.22 Exercise 7.23 Exercise 7.24 Exercise 7.25 Exercise 7.26 Exercise 8.1 Exercise 8.2 Exercise 8.3

xxiii

Designing an undergraduate data science program . . . . . . . Analysis of undergraduate data science programs by components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data science PCK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The challenge of developing interdisciplinary programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge gaps in PBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . Additional challenges of mentoring interdisciplinary projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Revisiting the questions about the interdisciplinary structure of data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . Different data science professions . . . . . . . . . . . . . . . . . . . . Diversity in data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . The AI + Ethics Curriculum for Middle School initiative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiatives for K-12 data science education . . . . . . . . . . . . . Learning environments for high school pupils . . . . . . . . . . Real data in data science education . . . . . . . . . . . . . . . . . . . Didactic transposition of machine learning algorithms . . . Discussions about the role of data science in admission requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of undergraduate data science programs . . . . . . . Graduate studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data science and diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . Online learning platforms versus physical campuses . . . . . Online data science courses . . . . . . . . . . . . . . . . . . . . . . . . . Profiles of online learners of data science courses . . . . . . . Analysis of data gathered by online learning platforms . . . Policy making based on data science methods . . . . . . . . . . Data science knowledge for users . . . . . . . . . . . . . . . . . . . . Data science knowledge for the general public . . . . . . . . . . Designing a data science course for the general public . . . Books about data science for the general public . . . . . . . . . Storytelling from the general public’s perspective . . . . . . . Categorization of learning environments for data science according to learner groups . . . . . . . . . . . . . . . . . . . Textual programing environments for data science . . . . . . . Visual programing environments for data science . . . . . . . . Use of general data processing tools for data science education purposes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teachable Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapping data science skills . . . . . . . . . . . . . . . . . . . . . . . . . Technion Biomedical Informatics Research and COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sub-cycles of the data science workflow . . . . . . . . . . . . . . .

90 90 92 92 96 97 97 102 103 104 104 104 106 107 108 109 110 111 113 113 113 114 114 115 116 116 117 117 117 118 118 118 119 122 123 124

xxiv

Exercise 8.4 Exercise 8.5 Exercise 8.6 Exercise 8.7 Exercise 8.8 Exercise 8.9 Exercise 8.10 Exercise 8.11 Exercise 8.12 Exercise 8.13 Exercise 8.14 Exercise 8.15 Exercise 8.16 Exercise 8.17 Exercise 8.18 Exercise 9.1 Exercise 9.2 Exercise 9.3 Exercise 9.4 Exercise 10.1 Exercise 10.2 Exercise 10.3 Exercise 10.4 Exercise 10.5 Exercise 10.6 Exercise 10.7 Exercise 10.8 Exercise 10.9 Exercise 10.10 Exercise 10.11 Exercise 10.12 Exercise 10.13 Exercise 10.14 Exercise 10.15 Exercise 10.16 Exercise 11.1 Exercise 11.2 Exercise 11.3

List of Exercises

The challenge of leaning the application domain . . . . . . . . The key role of understanding the application domain . . . . Populations for which research skills are relevant . . . . . . . Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A metaphor for a discourse between a biologist and a data scientist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Building a data science team . . . . . . . . . . . . . . . . . . . . . . . . The importance of the application domain for the data scientist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data science and anthropology . . . . . . . . . . . . . . . . . . . . . . . Visualization tools of data analysis environments . . . . . . . . Visualization in the data science workflow . . . . . . . . . . . . . Visual programing environments for data science . . . . . . . . Connections between data visualization and other data science skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reflective and analysis skills . . . . . . . . . . . . . . . . . . . . . . . . Challenges of data science from a research perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Additional data science skills . . . . . . . . . . . . . . . . . . . . . . . . Diffusion of educational programs . . . . . . . . . . . . . . . . . . . . Reflection on your experience with the adoption of innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reflection on your experience with the adoption of educational innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedagogical chasms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The agile data science workflows . . . . . . . . . . . . . . . . . . . . . Additional data science workflows . . . . . . . . . . . . . . . . . . . . Data science workflow for learners . . . . . . . . . . . . . . . . . . . Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Search for a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disadvantages of data gathering by learners . . . . . . . . . . . . Other tools for collecting data from people . . . . . . . . . . . . . Activities included in data preparation . . . . . . . . . . . . . . . . . Data wrangling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types of visualization methods . . . . . . . . . . . . . . . . . . . . . . Selecting an appropriate visualization method . . . . . . . . . . Interpreting graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of the exploratory data analysis phase from the perspective of abstraction levels . . . . . . . . . . . . . . . . . . . Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Communication and actions . . . . . . . . . . . . . . . . . . . . . . . . . Mapping data science skills . . . . . . . . . . . . . . . . . . . . . . . . . Critical thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reflection on dataset exploration . . . . . . . . . . . . . . . . . . . . .

125 126 127 128 129 130 130 131 131 132 132 132 133 134 134 138 139 142 147 152 153 154 155 155 155 155 156 156 157 158 158 159 159 161 162 166 166 168

List of Exercises

Exercise 11.4 Exercise 11.5 Exercise 11.6 Exercise 11.7 Exercise 11.8 Exercise 11.9 Exercise 11.10 Exercise 11.11 Exercise 11.12 Exercise 11.13 Exercise 11.14 Exercise 11.15 Exercise 11.16 Exercise 12.1 Exercise 12.2 Exercise 12.3 Exercise 12.4 Exercise 12.5 Exercise 12.6 Exercise 12.7

Exercise 12.8 Exercise 12.9 Exercise 12.10 Exercise 12.11 Exercise 12.12 Exercise 12.13 Exercise 12.14 Exercise 12.15 Exercise 12.16 Exercise 12.17 Exercise 12.18 Exercise 12.19 Exercise 13.1 Exercise 13.2

xxv

Analysis of unknown dataset . . . . . . . . . . . . . . . . . . . . . . . . The rhetorical triangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anthropology and data science . . . . . . . . . . . . . . . . . . . . . . . Programming tools used in the data science workflow . . . . Lifelong learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coursera 4-week (14 h) course Learning How to Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A metaphor for a discourse between a data scientist, you, and another employee in your organization . . . . . . . . . Giving and receiving feedback on a presentation about data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploration of a dataset as a debugging process . . . . . . . . . The expression of skills in the data science workflow . . . . Skills of the data science workflow stakeholders . . . . . . . . Additional data science skills . . . . . . . . . . . . . . . . . . . . . . . . An interdisciplinary perspective on data science skills . . . . Famous cases that illustrate the need for an ethical code for data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generative adversarial networks from an ethical perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Codes of ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons of data science codes of ethics . . . . . . . . . . . . Stakeholders’ behavior in the different phases of the data science workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Responsible AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recommendations of curriculum guidelines with respect to the inclusion of ethics in data science programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ethical principles to be applied in the creation of image sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The ethical aspect of product development . . . . . . . . . . . . . Analysis of a documentary movie . . . . . . . . . . . . . . . . . . . . Embedded ethics in data science . . . . . . . . . . . . . . . . . . . . . Revisiting the AI + Ethics curriculum for middle school initiative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exploration of previously published data science case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Development of data science case studies . . . . . . . . . . . . . . Developing scenarios development about data science . . . . Reflection on a lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reflection as a habit of mind . . . . . . . . . . . . . . . . . . . . . . . . PBL and data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Social issues of data science and interdisciplinarity . . . . . . The concepts of explainability and interpretability . . . . . . . Machine learning and statistical thinking . . . . . . . . . . . . . .

170 170 171 172 173 173 174 174 176 176 177 177 177 180 181 181 182 183 183

184 184 185 185 187 188 189 189 190 191 192 193 193 204 205

xxvi

Exercise 13.3 Exercise 13.4 Exercise 14.1 Exercise 14.2 Exercise 14.3 Exercise 14.4 Exercise 14.5 Exercise 14.6 Exercise 14.7 Exercise 14.8 Exercise 15.1 Exercise 17.1 Exercise 17.2 Exercise 17.3 Exercise 17.4 Exercise 17.5 Exercise 17.6 Exercise 17.7

Exercise 17.8 Exercise 18.1 Exercise 18.2 Exercise 18.3 Exercise 18.4 Exercise 18.5 Exercise 19.1 Exercise 19.2 Exercise 19.3 Exercise 19.4 Exercise 19.5

Exercise 19.6 Exercise 20.1 Exercise 20.2 Exercise 20.3

List of Exercises

Machine learning and interdisciplinary projects . . . . . . . . . Machine learning performance indicators . . . . . . . . . . . . . . Types of machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . Machine learning performance indicators . . . . . . . . . . . . . . True or false . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing Performance indicators . . . . . . . . . . . . . . . . . . . Overfitting and underfitting . . . . . . . . . . . . . . . . . . . . . . . . . Overfitting and underfitting analogies . . . . . . . . . . . . . . . . . Regularization analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . . K as hyperparameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision making in governmental offices . . . . . . . . . . . . . . Data culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The roles of the workshop facilitator . . . . . . . . . . . . . . . . . . Culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Working on the group task . . . . . . . . . . . . . . . . . . . . . . . . . . Data science in the service of government offices . . . . . . . . A follow-up workshop for the “Data science for education policymaking, governance, and operations” workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The pedagogical chasm and policymaking . . . . . . . . . . . . . Facilitation of an asynchronous activity . . . . . . . . . . . . . . . . Topics to be included in a Method of Teaching Data Science course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interdisciplinarity of and in the MTDS course . . . . . . . . . . Classification of the KNN description into process and object conceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of students’ responses to the mid-semester questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data science applications that require knowledge in social sciences and digital humanities . . . . . . . . . . . . . . . . . Data science job ads that require knowledge in social sciences and digital humanities . . . . . . . . . . . . . . . . . . . . . . Interviewing researchers in social sciences and digital humanities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uses of supervised and unsupervised machine learning algorithms in social sciences applications . . . . . . . Theoretical perspectives on data science education for researchers in social sciences and digital humanities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data science education for undergraduate social sciences and digital humanities students . . . . . . . . . . . . . . . Examples of human aspects of engineering research . . . . . Data-driven research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

206 206 210 214 216 217 217 221 221 223 229 254 255 258 260 261 261

261 262 269 273 274 276 278 285 286 290 296

298 299 305 306 306

List of Exercises

Exercise EP.1 Exercise EP.2 Exercise EP.3

xxvii

Pedagogical chasms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Connections between chapters . . . . . . . . . . . . . . . . . . . . . . . Final reflection task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

315 316 316

Chapter 1

Introduction—What is This Guide About?

Abstract Data science is a new discipline of research that is gaining growing interest in both industry and academia. As a result, demand is increasing for data science programs for a variety of learners from a variety of disciplines (data science, computer science, statistics, engineering, life science, social science and humanities) and a variety of levels (from school children to academia and industry). While significant efforts are being invested in the development of data science curricula, or in other words, in what to teach, only sporadic discussions focus today on the data science pedagogy, that is, on how to teach. This is the focus of this guide. In the following introduction, we present the motivation for writing this guide (Sect. 1.2), followed by the pedagogical principles we applied in it (Sect. 1.3), its structure (Sect. 1.4), and how it can be used by educators who teach data science in different educational frameworks (Sect. 1.5). Finally, we present several main kinds of learning environments that are appropriate for teaching and learning data science (Sect. 1.6).

1.1 Introduction Data science is a new research discipline that focuses on generating knowledge and value from raw data. It is an interdisciplinary discipline that inherits knowledge and skills from mathematics, statistics, computer science, and the application domain of the data. While many initiatives aim to develop data science curricula, the pedagogy of data science is and has been almost completely neglected. This guide attempts to partially close this gap, and so it focuses on data science teaching. Most of the ideas presented in this guide can be applied in any framework for teaching data science, for any data science topic, and at any level, from elementary and middle school through high school to the university and post-academia level. Since the different aspects of data science teaching presented in this guide are not restricted to a specific curriculum, we avoid using any code in any programming language. We do, however, present environments for teaching data science to guide educators who design their teaching material and wish to select a learning environment that suits the learners they are teaching (see Sect. 1.6). We also note that this guide does not purport to teach data science; rather, it focuses on data science © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_1

1

2

1 Introduction—What is This Guide About?

teaching. We presume that our readership has sufficient data science knowledge and is interested in expanding this knowledge to the educational aspects of data science. To provide the readership of this guide with a common basis, most of the demonstrations included in the guide are based on fundamental data science concepts; nevertheless, they can be adjusted to the teaching of any data science concept. Section 1.5 specifies how this guide can be used by different populations of data science educators: data science lecturers in academia, data science teachers in schools, instructors of the Methods of Teaching Data Science (MTDS) course, and more. The following terminology is used throughout this guide to refer to the different kinds of learners: • Learners: This term is used when we discuss topics that are not necessarily related to any specific population. • Pupils: Elementary, middle school, and high school learners. • Students: Undergraduate and graduate learners in institutions of higher education. • Data science majors: Students studying toward a degree in data science. • Allied majors: Students studying towards a degree in a field closely related to data science (such as statistics, computer science, exact science, and engineering) who are learning the required level of computer science, mathematics, and statistics as an integral part of their study program. • Non-Majors: Other students (e.g., who are studying social sciences and humanities, either undergraduate or graduate) who do not always have the sufficient computer science, mathematical, and statistical background required for data science studies. In addition to the above groups of learners, other learning populations are addressed in this guide as well; these include researchers, policy makers, practitioners in the industry, and users. We also use the following terminology when addressing data science and its components (see Chap. 2): • Data science is a discipline. • Data science education is a field. We use the term field to distinguish between data science education and the disciplines of data science, mathematics, statistics, and computer science, thus helping to ease the reading process. In spite of this convention, we intentionally use the term discipline in the title of Chap. 4 since data science education is in fact an emerging discipline. • The components of data science: – the disciplines of computer science, mathematics, and statistics; – the application domain component of data science, from which the data is taken. The term application domain knowledge refers to its content. • Domain is also used when referring to domains of life, the domain of the organization, and so on. Later on in the introduction, we present the motivation for writing the guide (Sect. 1.2), the pedagogical principles and guidelines for data science teaching that

1.3 Pedagogical Principles and Guidelines for Teaching Data Science

3

we employ in it (Sect. 1.3), its structure (Sect. 1.4), how it can be used in different frameworks of data science education (Sect. 1.5), and learning environments for learning and teaching data science (Sect. 1.6).

1.2 Motivation for Writing This Guide The dynamic and rapid evolution of the discipline of data science poses educational and pedagogical challenges, including the design of teaching and learning materials. Indeed, teaching data science is challenging because (a) data science is a very broad discipline with a large body of knowledge, (b) data science contains complex topics such as machine learning algorithms, (c) data science requires a variety of thinking skills, such as interdisciplinary thinking, computational thinking, and statistical thinking (see Chap. 3), (d) data science requires special professional and organizational skills such as teamwork and storytelling (see Chap. 11), and ethical behavior (see Chap. 12), and (e) in many cases, data science educators are not data science majors, but rather graduates of one of the disciplines whose intersection creates data science (that is, mathematics, statistics, computer science, and the application domain). This guide addresses the above challenges, and additional ones, by presenting a comprehensive overview and, when appropriate, detailed guidelines for implementation of a large variety of topics related to data science pedagogy that we have identified so far while teaching data science in a variety of frameworks and in our research on data science education. From a practical point of view, the writing style of this guide enables immediate implementation of its ideas in any framework of data science education.

1.3 Pedagogical Principles and Guidelines for Teaching Data Science In analogy to data science, which aggregates knowledge and skills from a variety of disciplines including mathematics, statistics, computer science, and various application domains, data science education is based on an interdisciplinary pedagogy composed of knowledge and skills taken from mathematics education, statistics education, computer science education, and the educational field of the relevant application domain. As such, data science pedagogy draws on the best pedagogical practices from this variety of educational fields. In this spirit, we now describe several pedagogical principles that we apply in this guide. These pedagogies, as well as some others, are further elaborated on in the different chapters of the guide. • Context-based learning. Data science should be taught in the context of reallife problems taken from any domain of life, using real-life data. While it is

4

1 Introduction—What is This Guide About?

sometimes easier to teach with simulated data, teaching with real-life data in the context of real-life domains has many advantages. It enhances the learners’ motivation, relies on their real-life experience, attracts learners from a variety of backgrounds, and enhances diversity and inclusion. See also Sect. “Embedded Context” on embedded context. • Active learning. Active learning advocates the perspective that in order to achieve meaningful learning, learners should be active rather than passive. Active learning stands in contrast to passive teaching methods according to which teachers lecture without giving the students the opportunity to express their knowledge, skills, opinions, and imagination (see also Sect. “Active Learning”). In the spirit of this guide, which aims to inspire the active learning approach by applying it in a variety of contexts, we suggest that data science educators embrace the active learning pedagogical approach in the diverse frameworks in which data science is learned. To support this goal, the guide includes over 200 exercises that both encourage readers to engage and expand their perspective on data science. See Index of Exercises. • White box understanding. While it is preferable that learners have a white box understanding of every topic in the data science curriculum (that is, that they gain an understanding of the details of the learned algorithm), it is not always feasible due to gaps in the learners’ preliminary knowledge, time and budget constraints, and various other reasons. Therefore, certain learners are taught some of the data science concepts as a black box (that is, focusing on the algorithm’s input and output without understanding the details of the learned algorithm). • Data science skills. We advocate integrating the teaching of several kinds of skills (e.g., cognitive, organizational, and research skills) into any data science curriculum, both parallel to and together with the teaching of data science core concepts (rather than in dedicated courses or workshops). To support the implementation of this pedagogical guideline, activities that address the teaching of data science skills are integrated into the various chapters of this guide, and in addition, Chaps. 11 and 12 offer an in-depth discussion of a variety of data science skills.

1.4 The Structure of the Guide to Teaching Data Science This section describes the structure of this guide and of each of its chapters. We note that although the chapters are presented in a specific order, they need neither to be read nor applied sequentially; rather, each reader can choose the reading order of the chapters according to his or her needs. Furthermore, the different chapters of the book are interconnected, and in many cases, we refer the reader to other chapters to gain a more comprehensive perspective of the discussed topic.

1.4 The Structure of the Guide to Teaching Data Science

5

1.4.1 The Five Parts of the Guide Following Table 1.1, which presents the five parts of the guide and the chapters comprising each part, we describe the focus and content of each part. • Part I—Overview of Data Science and Data Science Education. In this part, we discuss what data science and data science education are and review the current state of data science education, including curricula and pedagogy. • Part II—Opportunities and Challenges of Data Science Education. This part of the guide elaborates on the educational challenges of data science, from a variety of perspectives (including learners, teachers, and policy makers among other), addressing also the multi-faceted and interdisciplinary nature of data science. • Part III—Teaching Professional Aspects of Data Science. In this part, we take a pedagogical perspective and examine several topics related to the human aspects of data science, such as data science skills and social issues, in general, and ethics, in particular. In line with the teaching principles presented in Sect. 1.3, we champion the message that data science skills and other non-technical issues should be given special attention in any data science program regardless of its framework or level. • Part IV —Machine Learning Education. We dedicate one part of this guide to machine learning education for two main reasons. First, machine learning is one of the central steps in the data science workflow and an important and central emerging approach for data modeling. Second, machine learning is heavily based on mathematics and computer science content, and therefore poses unique pedagogical challenges that we address in this part of the guide. Specifically, we discuss teaching machine learning algorithms using a white box approach, teaching core concepts of machine learning algorithms that are commonly taught in introductory data science courses, and specific teaching methods suitable for teaching machine learning. • Part V—Frameworks for Teaching Data Science. This part of the guide presents several frameworks for teaching data science to professionals whose core activities are management, education, or research, and who need data science knowledge to foster their professional development and improve their professionalism. The chapters in this part are organized according to the MERge framework for professional development (Hazzan & Lis-Hacohen, 2016), according to which Table 1.1 The five parts of the guide to teaching data science Part

Chapters

Part I—Overview of Data Science and Data Science Education

2–4

Part II—Opportunities and Challenges of Data Science Education

5–9

Part III—Teaching Professional Aspects of Data Science

10–12

Part IV—Machine Learning Education

13–16

Part V—Frameworks for Teaching Data Science

17–20

6

1 Introduction—What is This Guide About?

disciplinary knowledge is required in order to manage, educate, and carry out research. In other words, according to the MERge model, management, education and research (MER) activities can be carried out meaningfully only when one is an expert in the domain in which these activities are carried out (in the case of data science, the application domain from which the data is taken).

1.4.2 The Chapters of the Guide This section provides a brief description of each of the next 20 chapters of this guide. Before delving into the details, we note that: • Each chapter focuses on one pedagogical aspect of data science, including its relevance, importance, and application in the context of data science education. • While most chapters can be read and used independently, each chapter contains references to other chapters in which a specific topic is further elaborated on. Accordingly, when different topics are related, we suggest the readership further explore this connection by reading additional chapters. • Most of the chapters of the guide can be used as a textbook of the Methods of Teaching Data Science course (see Chap. 18) or for a course on Data Science Education. • Each chapter presents a set of exercises whose purpose is to encourage, broaden, and deepen the readership’s thinking about the ideas presented in the chapter. As open-ended exercises, they have no correct answers; on the contrary, since data science education is a new evolving field, new ideas are welcome. In addition, the exercises illustrate what kinds of activities can be given to learners in different frameworks in which data science is taught. The list of activities is presented in the Index of Exercises at the beginning of this guide. • At the end of each chapter, we highlight how the interdisciplinary of data science is reflected in the chapter. Part I—Overview of Data Science and Data Science Education Chapter 2: What is Data Science? Although many attempts have been made to define data science, such a definition has not yet been reached. One reason for the difficulty to reach a single, consensual definition for data science is its multifaceted nature: it can be described as a science, as a research method, as a discipline, as a workflow, or as a profession. One single definition just cannot capture this diverse essence of data science. In this chapter, we first take an interdisciplinary perspective and review the background for the development of data science (Sect. 2.1). Then we present data science from several perspectives: data science as a science (Sect. 2.2), data science as a research method (Sect. 2.3), data science as a discipline (Sect. 2.4), data science as a workflow (Sect. 2.5), and data science as a profession (Sect. 2.6). We conclude by highlighting three main characteristics of data science: interdisciplinarity, learner diversity, and its research-oriented nature (Sect. 2.7).

1.4 The Structure of the Guide to Teaching Data Science

7

Chapter 3: Data Science Thinking. This chapter highlights the cognitive aspect of data science. It presents a variety of thinking modes that are associated with the different components of data science and explores the contribution of each such mode to data thinking—the mode of thinking required from data scientists (not necessarily professional ones). Specifically, computer science contributescomputational thinking (Sect. 3.2.1), statistics contributes statistical thinking (Sect. 3.2.2), mathematics adds different ways in which data science concepts can be comprehended (Sect. 3.2.3), and each application domain brings with it its thinking skills, core principles, and ethical considerations (Sect. 3.2.4). Based on these modes of thinking, which are each associated with a components of data science, data science thinking is presented (Sect. 3.2.5). Chapter 4: The Birth of a New Discipline: Data Science Education. Data science is a young discipline, and its associated educational field—data science education— is even younger. As of today, data science education has not yet been recognized as a distinct field and is discussed mainly in the context of the education of the disciplines that compose data science, in other words, computer science education, statistics education, mathematics education, and the educational fields of the applications domains, such as medical education or business analytics education. There are voices, however, that call for the integration of relevant knowledge from these educational fields and the formation of a coherent and integrative data science education body of knowledge, based on which data science programs may be designed. In this chapter, we present the story of the birth of the field of data science education by describing its short history. We focus on the main efforts invested in the design of an undergraduate data science curriculum (Sect. 4.2), and on the main initiatives whose aim was to develop a data science curriculum tailored for school pupils (Sect. 4.3). Part II—Opportunities and Challenges of Data Science Education Chapter 5: Opportunities in Data Science Education. Data science education brings about multiple new educational opportunities. In this chapter we elaborate on six such opportunities: teaching STEM in a real-world context (Sect. 5.2), teaching STEM with real-world data (Sect. 5.3), bridging gender gaps in STEM education (Sect. 5.4), teaching twenty-first century skills (Sect. 5.5), interdisciplinary pedagogy (Sect. 5.6), and professional development for teachers (Sect. 5.7). We conclude with an interdisciplinary perspective on the opportunities of data science education (Sect. 5.8). Chapter 6: The Interdisciplinarity Challenge. In this chapter we elaborate on the educational (i.e., curricular and pedagogical) challenges posed by the interdisciplinary structure of data science. We first describe the unique and complex interdisciplinary structure of data science (Sect. 6.2). Then, we present the challenge of balancing computer science and statistics in data science curricula (Sect. 6.3), and the challenge of integrating the application domain knowledge in data science study programs, courses, and student projects (Sect. 6.4). Chapter 7: The Variety of Data Science Learners. Since data science is considered an important twenty-first century skill, it is our belief that everyone, children as well as adults, should learn it on a suitable level, and to a suitable extent and depth.

8

1 Introduction—What is This Guide About?

Accordingly, after reviewing the importance of data science knowledge for everyone (Sect. 7.1), this chapter reviews the teaching of data science to different populations: K-12 pupils (Sect. 7.2), high school computer science pupils (Sect. 7.3), undergraduate students (Sect. 7.4), graduate students (Sect. 7.5) and senior researchers (Sect. 7.6), data science educators (Sect. 7.7), practitioners in the industry (Sect. 7.8), policy makers (Sect. 7.9), users (Sect. 7.10) and the general public (Sect. 7.11). For each population, we discuss the main purpose of teaching it data science, the main concepts that the specific population should learn, and (in some cases) we present suitable learning environments and exercises for its learners. In Sect. 7.12, we present several activities that address the suitability of different learning environments for teaching data science to the different populations discussed in this chapter. In the conclusion (Sect. 7.13), we highlight the concept of diversity in the context of data science. Chapter 8: Data Science as a Research Method. In this chapter, we focus on the challenge that arises from the fact that data science is also a research method. We first describe the essence of the research process that data science inspires (Sect. 8.2). Then, the focus turns to the teaching of cognitive, organizational, and technological skills (sometimes) required for carrying out data science research (Sect. 8.3). The chapter ends with several notes that highlight the pedagogical challenges that stem from the fact that data science is a research method (Sect. 8.4). Chapter 9: The Pedagogical Chasm in Data Science Education. As an interdisciplinary profession, data science poses many challenges for teachers. This chapter presents the story of the adoption, by high school computer science teachers, of a new data science curriculum developed in Israel for high school computer science pupils. We analyze the adoption process according to the diffusion-of-innovation and the crossing-the-chasm theories. First, we present the diffusion-of-innovation theory (Sect. 9.1) and the crossing-the-chasm theory (Sect. 9.2). Next, we present the data science for high school curriculum case study (Sect. 9.3). Data analysis reveals that any time a new curriculum is adopted, there is a pedagogical chasm (i.e., a pedagogical challenge that reduces the motivation of most teachers to adopt the curriculum) that eventually slows the adoption process of the innovation (Sect. 9.4). Finally, we discuss the implications of the pedagogical chasm for data science education (Sect. 9.5). Part III—Teaching Professional Aspects of Data Science Chapter 10: The Data Science Workflow. One facet of data science is its examination as a workflow in which a variety of practitioners is involved, and which requires technical, cognitive, social, and organizational skills. In this chapter we elaborate on the data science workflow from an educational perspective. First, we present several approaches to the data science workflow (Sect. 10.1). Then, we elaborate on the pedagogical aspects of the different phases of the workflow: data collection (Sect. 10.2), data preparation (Sect. 10.3), exploratory data analysis (Sect. 10.4), modeling (Sect. 10.5), and communication and action (Sect. 10.6). We conclude with an interdisciplinary perspective on the data science workflow (Sect. 10.7).

1.4 The Structure of the Guide to Teaching Data Science

9

Chapter 11: Professional Skills and Soft Skills in Data Science. In this chapter, we discuss the skills required in order to deal with data science in a meaningful manner. The chapter is divided into two sections: professional skills (Sect. 11.2) and soft skills (Sect. 11.3). Professional skills are specific skills that are required for data science while soft skills are more general skills that gain unique importance in the context of data science. In each section, we address cognitive, organizational, and technological skills. The discussion about data science skills is especially important today due to increasing awareness that scientists and engineers in general, and data scientists in particular, should acquire both professional and soft skills, in addition to disciplinary and technical knowledge. Chapter 12: Social and Ethical Issues of Data Science. The teaching of social issues related to data science should receive special attention regardless of the data science teaching framework or level. This assertion is derived from the fact that data science (a) is relevant for many aspects of our lives (e.g., health, education, social life and transportation); (b) can be applied in harmful ways (even without explicit intention); and (c) involves ethical considerations that stem from the application domain. From the many possible social topics whose teaching could have been discussed in this chapter, we focus on data science ethics (Sect. 12.2). We also present teaching methods that are especially suited for the teaching of social issues of data science (Sect. 12.3). Finally, we discuss how the social perspective on data science, as discussed throughout the chapter, further emphasizes the interdisciplinarity of data science. Part IV—Machine Learning Education1 Chapter 13: The Pedagogical Challenge of Machine Learning Education. Machine learning (ML) is the essence of the modeling phase of the data science workflow. In this chapter, we focus on the pedagogical challenges of teaching ML to various populations. We first describe the terms white box and black box in the context of ML education (Sect. 13.2). Next, we describe the pedagogical challenge with respect to different learner populations including data science major students as well as nonmajor students (Sect. 13.3). Then, we present three framework remarks for teaching ML (regarding statistical thinking, interdisciplinary projects, and the application domain knowledge), which, despite not being mentioning frequently in this part of the book, are important to be kept in mind in ML teaching processes (Sect. 13.4). We conclude this chapter by highlighting the importance of ML education in the context of the application domain (Sect. 13.5). Chapter 14: Machine Learning Core Concepts. This chapter focuses on the teaching of several core concepts of ML that are relevant for many ML algorithms (such as hyper-parameters tuning). Specifically, we discuss types of ML algorithms (Sect. 14.2), ML parameters and hyperparameters (Sect. 14.3), model training, validation, and testing (Sect. 14.4), ML performance indicators (Sect. 14.5),

1

In the chapters of Part IV—Machine Learning Education, we use the abbreviation ML for Machine Learning.

10

1 Introduction—What is This Guide About?

bias and variance (Sect. 14.6), model complexity (Sect. 14.7), overfitting and underfitting (Sect. 14.8), loss function optimization and the gradient descent algorithm (Sect. 14.9), and regularization (Sect. 14.10). We conclude this chapter by reviewing ML core concepts from an interdisciplinary perspective (Sect. 14.11). Chapter 15: Machine Learning Algorithms. In this chapter, we describe the teaching of several ML algorithms that are commonly taught in introductory ML courses. The algorithms we discuss are the K-nearest neighbors (KNN) (Sect. 15.2), Decision Trees (Sect. 15.3), Perceptron (Sect. 15.4), linear regression (Sect. 15.5), logistic regression (Sect. 15.6), and neural networks (Sect. 15.7). Finally, we discuss interrelations between the interdisciplinarity of data science and the teaching of ML algorithms (Sect. 15.8). Chapter 16: Methods for Teaching Machine Learning. In this chapter, we review four methods for teaching ML: visualization (Sect. 16.2), hands-on tasks (Sect. 16.3), programming tasks (Sect. 16.4), and project-based learning (Sect. 16.5). Part V—Frameworks for Teaching Data Science Chapter 17: Data Science for Managers and Policy Makers. In this chapter we address the first component of the MERge model, i.e. management. In line with the MERge model as a professional development framework, we show in this chapter how managers and policy makers (on all levels) can use data science in their decisionmaking processes. We describe a workshop for policy makers that focuses on the integration of data science into educational systems for policy, governance, and operational purposes (Sect. 17.2). The messages conveyed in this chapter can be applied in other systems and organizations in all three sectors: governmental organizations (the first sector), for-profit organizations (the second sector), and non-profit organizations (the third sector). We conclude with an interdisciplinary perspective on data science for managers and policy makers (Sect. 17.3). Chapter 18: Data Science Teacher Preparation: The Methods of Teaching Data Science (MTDS) Course. This chapter focuses on the second component of the MERge model, namely education. We present a detailed description of the Methods of Teaching Data Science (MTDS) course that we designed and taught to prospective computer science teachers at our institution, the Technion—Israel Institute of Technology. As our purpose in this chapter is to encourage the implementation and teaching of the MTDS course in different frameworks, we provide the readership with as many details as possible about the course, including the course environment (Sect. 18.2), the course design (Sect. 18.3), its learning targets and structure (Sect. 18.4), the grading policy and assignments (Sect. 18.5), teaching principles we employed in the course (Sect. 18.6), and a detailed description of two of the course lessons (Sect. 18.7). Full detailed descriptions of all course lessons are available on our Data Science Education website. We hope that this a detailed presentation partially closes the pedagogical chasm presented in Chap. 9. Chapter 19: Data Science for Social Science and Digital Humanities Researchers. In this chapter (and in Chap. 20), we focus on the third component of the MERge model, research, and describe two frameworks for teaching data science designed for researchers in social science and digital humanities. We open with a discussion

1.5 How to Use This Guide?

11

on the relevancy of data science for social science and digital humanities researchers (Sect. 19.2), followed by a description of a data science bootcamp designed for researchers in the social sciences and digital humanities (Sect. 19.3), including the applicants and bootcamp participants (Sect. 19.3.1) and the bootcamp curriculum (Sect. 19.3.2). Then, the curriculum of a year-long specialization in data science for graduate psychology students, derived from this bootcamp, is presented (Sect. 19.4). Finally, we discuss the data science teaching frameworks for researchers in social sciences and digital humanities from a motivational perspective (Sect. 19.5), and conclude by illuminating the importance of an interdisciplinary approach in designing data science curricula for application domain specialists (Sect. 19.6). Chapter 20: Data Science for Research on Human Aspects of Science and Engineering. In this chapter (and in Chap. 19), the focus is on the third component of the MERge model—research. We examine how to teach data science methods to science and engineering students in order to conduct research on human aspects of science and engineering, including cognitive, social, behavioral, and organizational aspects. These learner populations, unlike the community of social scientists (discussed in Chap. 19), usually have the needed background in computer science, mathematics, and statistics, and should be exposed to the human aspects of science and engineering which, in many cases, are not included in most scientific and engineering study programs. We start by presenting possible human-related science and engineering research topics (Sect. 20.2). Then, we describe a workshop for science and engineering graduate students that can be facilitated in a hybrid format, combining both synchronous (either online or face-to-face) and asynchronous meetings (Sect. 20.3). Epilogue. In the Epilogue we look at the guide from a holistic perceptive, reflecting on its main ideas and their interconnections.

1.5 How to Use This Guide? As mentioned, this guide can be used by a variety of populations who teach data science in a variety of educational frameworks. In addition to this guide, educators can access our website and its supplementary material at: https://orithazzan.net.tec hnion.ac.il/data-science-education/. From this array of populations, we elaborate on the following three: data science instructors in academia, K-12 teachers, and instructors of a Methods of Teaching Data Science (MTDS) course. However, as mentioned above, other populations of data science educators can find this guide useful as well. These include teachers of data science MOOCs, practitioners in the industry who wish to teach their colleagues data science as part of professional development programs, leaders of academic institutions who plan on launching a data science program, and policy makers who wish to promote data science education on a national level, in general, and particularly in the first (governmental and local authorities) sector.

12

1 Introduction—What is This Guide About?

1.5.1 Data Science Instructors in Academia Data science instructors in universities can use this guide in a variety of ways, for instance as a textbook for a specific course, or as a learning material for either themselves or their students. We elaborate on several of these ways: Data science instructors in academia can expand their data science educational knowledge and increase their awareness to pedagogical aspects that may arise while teaching data science courses to either data science majors or non-majors. As we shall see in other chapters of this guide, the teaching of data science to non-majors garners a lot of attention due to the interdisciplinary nature of data science, which makes it relevant for many disciplines. University instructors can facilitate activities presented in the guide, in their courses, emphasizing the context of the application domain. This guide can also enhance instructors’ awareness to different aspects of data science education, such as cognitive aspects and learners’ alternative conceptions and possible difficulties, and social aspects, such as ethics and skills. Finally, data science instructors can use this guide to vary their teaching methods so as to increase their learners’ interest and motivation to learn data science. This can be done, for example, by implementing active-learning teaching methods in their teaching (see Sect. “Active Learning”).

1.5.2 K-12 Teachers This guide can be used by teachers who teach all ages and topics in the education system implementing the needed adjustments to suit the specific population they teach. High school computer science teachers, for example, can use this guide in ways similar to those described for instructors of data science in university (Sect. 1.5.1).

1.5.3 Instructors of the Methods of Teaching Data Science (MTDS) Course In the context of a MTDS course, this guide can be used as a textbook. Thus, instructors can ask their students to read specific chapters in preparation for lessons on specific topics, they can facilitate specific exercises from this guide, in class, according to the course objectives and plan, and they can suggest that interested students broaden their knowledge by referring them to the list of references presented at the end of each chapter. Chapter 18 presents a description of two lessons from the MTDS course. As mentioned, full detailed descriptions of all course lessons are available on our Data Science Education website.

1.6 Learning Environments for Data Science

13

1.6 Learning Environments for Data Science This guide does not purport to mention all of the material available for data science education (such as study programs, MOOCs, and computational tools). Not only would this be an impossible task since the arsenal of materials is growing on a daily basis, but it is also not our intention to recommend any specific teaching or learning environment. At the same time, we do mention throughout the guide, data science reports, study programs, learning environments and tools to illustrate a specific point or deliver a specific message. In some exercises, we may suggest that the readership explore such resources from a perspective that, in our opinion, may foster their comprehension of the field of data science education. And when our own experience teaching data science to a variety of populations is relevant to the topic addressed in a specific chapter, we use our own teaching and research experience and present relevant study programs and teaching material. In this spirit, we mention that many existing development environments can be used for data science education. These environments enable learners to load, manipulate, model, and visualize data. In fact, any general-purpose industrial programming IDE can support the programming needs of data scientists and data science learners with a basic programming background. At the same time, visual blockbased IDEs for data science also exist. The following sections review both textual and visual programming environments that are suitable for data science education. Each educator can choose the relevant kind of environment according to the specific educational context in which he or she teaches.

1.6.1 Textual Programing Environments for Data Science Notebooks are popular Python programing environments for data science. A notebook is a document that contains blocks of code, the output of its execution, equations, visualizations, and texts. Since Python is an interpreted programing language, the notebook can be executed block by block. In general, notebooks enable an interactive mode of working. Not only is the code processed step by step, but also the data is processed step by step. After each step, the processed data can be displayed and visualized. Then, if needed, the code can be changed interactively until the desirable results are obtained. Notebooks suit educational purposes and are appropriate environments for teaching and learning data science, for several reasons: • Notebooks are a great tool for step-by-step demonstrations of how to perform data science tasks and can, therefore, be given to the learners as part of their learning materials. • Notebooks provide a good platform for lab exercises since the instructions can be given to the students together with an initial partial code to be completed by the learners.

14

1 Introduction—What is This Guide About?

• Notebooks contain full, self-contained reports, including code (and its execution) and text (technical, methodological, and other documentation) and can be submitted as homework or presented in class. Specifically, homework assignments can be submitted by learners in a notebook format that contains, in one single document, both the code, the results, and the student’s textual documentation and explanations. Jupyter Notebook and Google Colab are two widely-used notebook tools for educational purposes.

1.6.2 Visual Programing Environments for Data Science A visual working environment for data science is a working environment that allows the data scientist to drag and drop visual data processing blocks (which represent data-processing units, like “load file”, “select rows”, “test and score” or “scatter plot”) into a processing canvas. The blocks represent all of the required steps for data processing, such as loading of various types of data, data selection, statistics and modeling, machine learning algorithms, and visualization. The way the blocks are organized and connected represents the flow of data rather than the flow of the program instructions. This is an important idea that is further elaborated on in Chap. 3, on data science thinking. Orange Data Mining, KNIME, and Weka are examples of visual environments for data science that can be installed and run locally on the students’ computers.

1.7 Conclusion Data science education is a young field that is expected to grow dramatically in the near future in parallel to the growth in the awareness of the potential influences of how data is used. In this guide, we aim to highlight the challenges and opportunities of data science education in a way that, in our opinion, channels the discipline of data science to a path that benefits us all—learners and teachers—as citizens in the twenty-first century.

Exercise 1.1 Reflection Reflect on what you have read so far: (a) What pedagogical ideas were introduced in this chapter? (b) Can you speculate how you will use this guide when you teach data science (now or in the future)?

Reference

15

Reference Hazzan, O., & Lis-Hacohen, R. (2016). The MERge model for business development: The amalgamation of management, education and research. Springer.

Part I

Overview of Data Science and Data Science Education

In this part, we discuss what data science and data science education are and review the current state of data science education, including its curriculum and pedagogy. This part includes the following chapters: Chapter 2: What is Data Science? Chapter 3: Data Science Thinking Chapter 4: The Birth of a New Discipline: Data Science Education

Chapter 2

What is Data Science?

Abstract Although many attempts have been made to define data science, such a definition has not yet been reached. One reason for the difficulty to reach a single, consensus definition for data science is its multifaceted nature: it can be described as a science, as a research method, as a discipline, as a workflow, or as a profession. One single definition just cannot capture this diverse essence of data science. In this chapter, we first take an interdisciplinary perspective and review the background for the development of data science (Sect. 2.1). Then we present data science from several perspectives: data science as a science (Sect. 2.2), data science as a research method (Sect. 2.3), data science as a discipline (Sect. 2.4), data science as a workflow (Sect. 2.5), and data science as a profession (Sect. 2.6). We conclude by highlighting three main characteristics of data science: interdisciplinarity, learner diversity, and its research-oriented nature (Sect. 2.7).

2.1 The Interdisciplinary Development of Data Science Data science is emerging from an interdisciplinary integration of mathematics, statistics, computer science, and many other application domains such as business, biology, and education. On the surface, nothing about data science is new; any data science method or tool used today can be traced back to statistics, computer science, data mining, bioinformatics, and other data-intensive disciplines. The innovation, however, is in the integration—the holistic approach to data and methods to obtain knowledge and value, either financial, social, or educational. Over the years, the term data science appeared independently in statistics, computer science, and various application domains (Irizarry, 2020). In this section, we describe the evolution of the focus on data in the disciplines that make up data science: statistics (Sect. 2.1.1), computer science (Sect. 2.1.2), and business analytics as an example of an application domain (Sect. 2.1.3).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_2

19

20

2 What is Data Science?

2.1.1 The Origins of Data Science in Statistics In his paper “50 years of data science”, Donoho (2017) describes the role of statistics in the creation of data science. The origins of data science, Donoho claims, are rooted in Tukey’s, 1962 paper “The future of data analysis”, in which Tukey presents a broader vision of the field of statistics, whose focus is data analysis rather than statistical inference. In the following years, additional voices from the statistician community were heard that called for a shift in the focus of statistics from mathematical modeling to learning from data. According to Wikipedia, in 1985, C.F. Jeff Wu first used the term data science in a lecture before the Chinese Academy of Sciences in Beijing as an alternative name for statistics (Jeff Wu, 2021). Twelve years later, in 1997, he gave a lecture entitled “Statistics = Data Science?” at the occasion of his appointment to the H. C. Carver Professorship at the University of Michigan (Wu, 1997). In his lecture, Wu presented his vision of the future directions of statistics, including handling large and complex data, using neural networks and data mining methods, and representing and exploring knowledge using computational algorithms. Wu also suggested a new, more balanced statistics curriculum with heavier emphasis on data collection, a scientific and mathematical basis for modeling, computing for large and complex systems, and an interdisciplinary component that for undergraduate students meant taking a minor in cognitive sciences and for graduate students meant studying 30–50% of the curriculum outside of the statistics department. In 2001, Cleveland (2001) published a paper entitled “Data science: An action plan for expanding the technical areas of the field of statistics” in which he called for the establishment of a new field, based on statistics, that focused on data analysis. According to Cleveland, the new field should have the following six area of interest: • Multidisciplinary Investigations: data analysis collaborations in a collection of subject matter areas. • Models and Methods for Data: statistical models; methods of model building; methods of estimation and distribution based on probabilistic inference. • Computing with Data: hardware systems; software systems; computational algorithms. • Pedagogy: curriculum planning and approaches to teaching for elementary school, secondary school, college, graduate school, continuing education, and corporate training. • Tool Evaluation: surveys of tools in use in practice, surveys of perceived needs for new tools; and studies of processes for the development of new tools. • Theory: foundations of data science; general approaches to models and methods, computing with data, teaching, and tool evaluation; mathematical investigations of models and methods, computing with data, teaching and evaluation. In the past two decades since then, in parallel to the development of data science, it has become accepted that statistics is an inherently necessary component of data science and that professional data scientists need statistical analysis skills.

2.1 The Interdisciplinary Development of Data Science

21

2.1.2 The Origins of Data Science in Computer Science In 1966, Peter Naur proposed to the editor of the Communications of the ACM a new terminology and a new name for computer science, that stressed the centrality of data in computing (Naur, 1966). The new term Naur proposed was datalogy, which was defined as “the science of the nature and use of data”. According to Naur, datalogy “might be a suitable replacement for computer science.” (ibid., p. 485). Data mining is one branch of computer science that today is closely associated with data science, and whose roots can be found in computer science. It was first used by statisticians, data analysts, and information systems management professionals and was adopted by the computer science community in the 1990s (Fayyad et al., 1996). While the statistician community was ambivalent towards data mining as it was not considered an acceptable research strategy (Lovell, 1983), it rapidly gained popularity in the computer science community. Another term that is related to data mining is knowledge discovery in databases (KDD). This term was coined by PiatetskyShapiro in 1989 during what was later recognized as the first KDD workshop, and it emphasizes the fact that knowledge is the end product of a data-driven discovery (Fayyad et al., 1996). Today the terms data mining and knowledge discovery are used interchangeably, as Piatetsky-Shapiro suggested in 1990 (Piatetsky-Shapiro, 1990). In 2000, about ten years after that first KDD workshop, Piatetsky-Shapiro (2000) predicted that (a) as computational power and data storage capabilities grow, data mining and KDD will become important research methods; (b) data mining and KDD tools will gradually be integrated better into commercial databases; and (c) new applications of data mining and KDD will emerge in e-commerce and drug discovery. Today, data mining is used in variety of application domains for a variety of purposes, for example, to improve the prediction of heart failure (Ishaq et al., 2021), to detect financial fraud (Al-Hashedi & Magalingam, 2021), and to study user behavior (Su & Wu, 2021). In these applications, data mining is used to extract meaningful data from datasets using statistical, machine learning, mathematical, and artificial intelligence techniques (Al-Hashedi & Magalingam, 2021). Data mining techniques can also utilize data exploration to integrate large quantities of unrelated data, find useful correlations, and recover valuable information from the data (Su & Wu, 2021). In parallel to the growth of data mining and KDD, other branches of computer science research faced the challenges of collecting and managing huge amounts of data, that could not be collected, stored and analyzed using conventional storage techniques (Cox & Ellsworth, 1997). These challenges lead to the development of the Big Data technology, which consists of exploring datasets of high volume, variety, velocity, and variability that require a scalable architecture for efficient storage, manipulation, and analysis (Chang & Gardy, 2019). The ability to store and analyze big data is one of the key drivers of the growth of data science (Berman et al., 2018).

22

2 What is Data Science?

2.1.3 The Origins of Data Science in Application Domains: The Case of Business Analytics In a paper entitled “Data scientist: The sexiest job of the twenty-first century” published in the October 2012 Harvard Business Review, Davenport and Patil (2012) present the origins of data science in business analytics. According to this paper, the term data scientist was first coined in an industrial context in 2008 by Patil and Hammerbacher, data analytics leaders at LinkedIn and Facebook at the time. The focus of the new data scientist profession was to extract insights from big data. In their paper, Davenport and Patil claimed that no university programs were yet offering degrees in data science and only little consensus had been reached regarding the role of data scientists in the organization, how they can add meaningful value to the organization, and how their performance should be measured. In their book “Data Science for Business”, Provost and Fawcett (2013) describe the emerging of data science from the business need to extract knowledge and value from data: The past fifteen years have seen extensive investments in business infrastructure, which have improved the ability to collect data throughout the enterprise. Virtually every aspect of business is now open to data collection and often even instrumented for data collection: operations, manufacturing, supply-chain management, customer behavior, marketing campaign performance, workflow procedures, and so on. At the same time, information is now widely available on external events such as market trends, industry news, and competitors’ movements. This broad availability of data has led to increasing interest in methods for extracting useful information and knowledge from data—the realm of data science. (p. 1)

2.2 Data Science as a Science Empirical science has been always about data. Kepler used data about the movement of the planets collected by Tycho Brahe to prove Copernicus’ theory of the solar system. The first steps of evidence-based medicine were taken around 1750, when James Lind, a Scottish naval surgeon, conducted a controlled experiment and found that oranges can cure scurvy, a disease known today to be caused by vitamin C deficiency. Are Kepler and Lind the fathers of data science, looking for patterns and models in raw data? While Kepler used data that were collected manually and Lind performed his own experiment to collect the data, data science today is more than an empirical science. That is, data science views the data itself as a natural resource and deals with methods for extracting value out of this data. While science focuses both on understanding the world and on developing tools and methods to perform research, data science focuses on understanding data and developing tools and methods to perform research on data (Skiena, 2017). In 2005, the National Science Board of the National Science Foundation published a report entitled “Long-Lived Digital Data Collections: Enabling Research and Education in the Twenty-First Century” (National Science Board, 2005). This

2.2 Data Science as a Science

23

report presents data as a natural resource that need to be collected and managed to allow science to extract its full value. In this report, data scientists are defined as “information and computer scientists, database and software engineers and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection” (ibid, p. 27). In 2007, Jim Gray, then manager of Microsoft Research’s eScience group, presented to the National Research Council a presentation entitled “eScience—A Transformed Scientific Method” (Gray, 2007). In his presentation, Gray described eScience as an integration of information technology and science, and as the fourth paradigm of science. The first scientific paradigm, established thousands of years ago, is empirical science, in which scientists describe natural phenomena. The second scientific paradigm, applied hundreds of years ago, is the theoretical paradigm, in which scientists build models of nature. The third scientific paradigm was introduced only several decades ago and is the computational paradigm, in which scientists simulate complex phenomena using algorithms and computers. The fourth scientific paradigm, according to Gray, is data exploration, in which data is captured or simulated, and then analyzed by scientists to infer new scientific knowledge. In 2009, Tony Hey, Stewart Tansley, and Kristin Tolle edited a book entitled “The Fourth paradigm: Data-Intensive Scientific Discovery” (Hey et al., 2009). This book, which was dedicated to Jim Gray, presents the full transcript of Gray’s lecture, as well as more than 20 examples of data-centric scientific research and methods for promoting scientific discovery using data in various application domains such as health and wellbeing, earth science, and the environment. In 2014, the StatSNSF (National Science Foundation Directorate for Mathematical and Physical Sciences Support for the Statistical Sciences) committee defined data science as the “science of planning for, acquisition, management, analysis of, and inference from data” (Johnstone & Roberts, 2014, p. 6). Following Gray, a report published in 2019, entitled “NIST Big Data Interoperability Framework” claimed that data science is a short version of the term Gray used—data-intensive science— and described it as: “the conduct of data analysis as an empirical science, learning directly from data itself. This can take the form of collecting data followed by openended analysis without preconceived hypothesis (sometimes referred to as discovery or data exploration)” (Chang & Grady, 2019, p. 7).

Exercise 2.1 Pedagogical implications of data science as a science What pedagogical implications can you draw from the analysis of data science as a science?

24

2 What is Data Science?

2.3 Data Science as a Research Method Data science integrates research tools and methods taken from statistics and computer science that can be used to conduct research in various application domains, such as social science and digital humanities (see Chap. 19) or research on human aspects of engineering (see Chap. 20). In this section we present two data science methods out of many that can be applied in a variety of application domains: exploratory data analysis (Sect. 2.3.1) and machine learning as a research method (Sect. 2.3.2). Chapter 8 elaborates on data science as a research method, including the skills it requires and the challenges that result from this perspective of data science.

2.3.1 Exploratory Data Analysis Exploratory data analysis, which originated in statistics, is used to find new patterns and relationships between variables by visualizing raw data and giving meaning to these visualizations (Tukey, 1977). In the context of research, exploration and visualization can uncover unknown patterns and relationships between variables that can in turn reveal new knowledge. One well-known example of visualization was when John Snow revealed the source of the 1854 London cholera epidemic using a map showing the geographical spread of the cholera cases (see Fig. 2.1). Another example is Goggle Correlate, a tool used to find correlations between temporal search patterns of different terms to find correlations between phenomena (Mohebbi et al., 2011).

2.3.2 Machine Learning as a Research Method Machine learning algorithms, which originated in computer science, have a variety of research applications. For example, they can be used to reveal complex and nonlinear relationships between variables. New machine learning techniques, such as word embedding, convert non-numerical, complex data (such as text, images, audio, and video) into quantitative data, thus enabling, among other things, the application of both qualitative and quantitative research methods to non-numerical, complex data. Drug discovery is one area that illustrates how machine learning is applied as a research method. In a recent survey on machine learning applications for drug discovery, Vamathevan et al. (2019) wrote: “Opportunities to apply machine learning occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights” (p. 1).

2.3 Data Science as a Research Method

25

Fig. 2.1 Map by John Snow showing the clusters of cholera cases in the London epidemic of 1854 (Source: https://en.wikipedia.org/wiki/John_Snow, image is public domain)

Another example is the application of machine learning methods in social science research. In such research, complex human-generated data, such as posts on social networks, are used to map social phenomena. For example, Prebor (2021) used machine learning methods to investigate different kinds of feminism in orthodox groups on social media. Grimmer et al. (2021) reviewed current use of machine learning in social science research and stated that: Social scientists are now in an era of data abundance, and machine learning tools are increasingly used to extract meaning from datasets both massive and small… inclusion of machine learning in the social sciences requires us to rethink not only applications of machine learning methods but also best practices in the social sciences… [machine learning] is used to discover new concepts, measure the prevalence of those concepts, assess causal effects, and make predictions. The abundance of data and resources facilitates the move away from a deductive social science to a more sequential, interactive, and ultimately inductive approach to inference. (p. 1)

26

2 What is Data Science?

Exercise 2.2 Pedagogical implications of data science as a research method What pedagogical implications can you draw from the analysis of data science as a research method?

2.4 Data Science as a Discipline Data science integrates knowledge and skills form several disciplines, namely computer science, mathematics, statistics, and an application domain. One way to present such a relationship is using a Venn diagram, which is a diagram that shows logical relationships between different sets. Conway (2010) was the first to propose a Venn diagram for data science as a discipline; following Conway, many other Venn diagrams were proposed for the discipline of data science (Taylor, 2016). Figure 2.2 shows our Venn diagram for data science. Researchers recognize three levels of integration between two or more distinct disciplines: multidisciplinarity, interdisciplinarity, and transdisciplinarity (Alvargonza, 2011). In what follows, we investigate these levels of integration from the learners’ perspective. Multidisciplinarity is the lowest level of integration. In multidisciplinary education, learners are expected to gain knowledge and understanding in each discipline separately. For example, in undergraduate multidisciplinary science study programs, Fig. 2.2 The authors’ version of the data science Venn diagram, as inspired by Conway (2010)

2.4 Data Science as a Discipline

27

students select a collection of courses in different scientific topics (in most cases, two) without any specific specialization. Interdisciplinarity represents a higher level of integration than multidisciplinarity. In interdisciplinary education, after learners gain basic knowledge and understanding in each discipline separately, they are expected to understand the interconnections between the disciplines and to be able to solve problems that require applying different knowledge and methods from each discipline. Gender studies, for example, combine knowledge from different disciplines, such as the humanities, social sciences, and life sciences, forming a holistic view of the field. Transdisciplinarity is the highest level of integration. A transdisciplinary discipline is generated when the knowledge, methods, and values formed in the intersection of the original disciplines are detached from the original disciplines and create a separate new discipline on their own. What level of integration does data science reflect today? What is the desired level of integration for data science in the future? In the 2015 National Science Foundation (NSF) report entitled “Strengthening Data Science Education Through Collaboration”, which summarized the NSF sponsored workshop on data science education, the debate about the status of data science is phrased clearly: Data science can either be seen as an integrated combination of contributing disciplines or a separate, still emerging discipline with its own content and expected competencies. Many participants supported the view that data science is gradually emerging as a discipline that is different from the academic disciplines that contribute to it. Data science should be freed from the contributing disciplines so that it can gain its own identity. For this to be possible, data science will need to develop its own independent theory and methods. (Cassel & Topi, 2015, p. ii)

Today, data science is transitioning gradually from multidisciplinarity to interdisciplinarity, but challenges are emerging in the integration process of the different bodies of knowledge and traditions that originate in the different fields and scholar cultures. The final report of the ACM Data Science Task Force stated explicitly that: True interdisciplinary work is challenging. If each component remains independent, the relationships remain blurred and the opportunities for cross-fertilization are reduced… Early programs in Data Science will often work with a group of existing courses from the participating disciplines. That is practical and easy to bring a new program into existence... Cross references between courses, projects that call upon topics learned in other courses, and a comprehensive project to bring all the pieces together are essential to turn a mixed set of courses into a cohesive, interdisciplinary program. (Danyluk & Leidig, 2021, p. 11)

In this spirit, we can say that the integration of computer science, mathematics, and statistics to form a unified and holistic discipline focused on data will be described eventually as transdisciplinarity. To conclude this section, we note that data science is special case of integration, since the application domain component can be almost any discipline, opening up a practically infinite number of possible combinations.

28

2 What is Data Science?

Fig. 2.3 Data science workflow (Authors’ version)1

Exercise 2.3 Pedagogical implications of data science as a discipline What pedagogical implications can you draw from the analysis of data science as a discipline?

2.5 Data Science as a Workflow Data science is commonly presented as a workflow for generating value and datadriven actions from data. This workflow starts with a collection of real-world data (either available data or collected intentionally), proceeds through exploration and modeling, and continues to the publication of new knowledge and/or the performance of data-driven actions in the real world (see Fig. 2.3). The data science workflow is based on previous data-processing workflows such as the Cross-Industry Standard Process for Data Mining (CRISP-DM) (Shearer, 2000). It is an iterative process, whereby each cycle opens up a new cycle of exploration based on the findings obtained in the previous cycle. Chapter 8 further discusses data science as a research method and describes a typical work process that characterizes this paradigm. The 2015 NSF report summarizing the NSF-sponsored workshop on data science education introduced a definition of data science that reflects this perspective of data science as a workflow: “Data science is a process, including all aspects of gathering, cleaning, organizing, analyzing, interpreting, and visualizing the facts represented by the raw data” (Cassel & Topi, 2015, p. iii). 1

Earth image was originally posted to Flickr by DonkeyHotey at https://flickr.com/photos/474220 05@N04/5679642883. It was reviewed on 4 December 2020 by FlickreviewR 2 and was confirmed to be licensed under the terms of the cc-by-2.0.

2.5 Data Science as a Workflow

29

Fig. 2.4 The data life cycle (Berman et al., 2016)

The 2016 report of the Data Science Working Group of the NSF Computer and Information Science and Engineering Advisory Committee presented a broader perspective of the data science workflow, referred to as the data life cycle (see Fig. 2.4) (Berman et al., 2016). The data life cycle includes not only the workflow (from collection to action and publication) but also the environmental and social aspects of data science, such as regulations and ethics.

Exercise 2.4 Pedagogical implications of data science as a workflow What pedagogical implications can you draw from the analysis of data science as a workflow?

Exercise 2.5 Environmental and social aspects of the data life cycle Discuss the expression of the environmental and social aspects of the data life cycle (Berman et al., 2016).

30

2 What is Data Science?

2.6 Data Science as a Profession In this section, we present the profession of data scientist, who develops and applies data science methods, belongs to the community of data scientists, and adheres its norms. As described in Sect. 2.1.3, the roots of data science are in the industry and, therefore, the definition of data science can be derived from the description of the profession of data scientist. Irizarry (2020) proposed that the term data science was coined in order to improve the communication between human resource recruiters in the industry and work applicants. According to Irizarry, As the demand in [sic] employees capable of completing data-driven projects increased, the term data scientist quickly became particularly prominent because it helped recruiters specify the type of employee they wanted. Postgraduate degrees in the two disciplines most associated with data analysis and management, Statistics and Computer Science, did not guarantee the expertise needed to successfully complete these projects. Programming skills and experience organizing and analysing messy, complex, and large datasets were fundamental. (Irizarry, 2020, Sect. 1, para. 4)

In their paper “Data scientist: The sexiest job of the twenty-first century” mentioned previously, Davenport and Patil (2012) describe the profession of data scientist as “a high-ranking professional with the training and curiosity to make discoveries in the world of big data… More than anything, what data scientists do is make discoveries while swimming in data. It’s their preferred method of navigating the world around them” (ibid, A New Breed, para. 1). Harris et al. (2013) surveyed data scientists who worked in a business environment in order to learn about their perception of the profession of data science. They found four types of data scientists—data businesspersons, data creators, data developers, and data researchers—and five clusters of expertise required by those professionals: statistics, programming, mathematics, machine learning and big data, and business.

Exercise 2.6 Types of data scientists What skills are important for each type of data scientist identified by Harris et al. (2013)?

Exercise 2.7 Categories of data science skills Chapter 8, which discusses data science as a research method, presents three categories of skills: cognitive, organizational, and technical. How are these three categories related to the five clusters of expertise presented by Harris et al. (2013)? Describe a scenario that illustrates your claim.

2.6 Data Science as a Profession

31

The NIST Big Data Interoperability Framework report defines the data scientist as “a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes in the data life cycle” (Chang & Gardy, 2019, p. 8).

Exercise 2.8 Data science as a discipline and data science as a profession Is the definition of the data scientist, presented in the NIST Big Data Interoperability Framework report (Chang & Gardy, 2019), related to the perspective of data science as a discipline? If yes, how?

Exercise 2.9 Pedagogical implications of data science as a profession What pedagogical implications can you draw from the analysis of data science as a profession?

Exercise 2.10 Skills of the data scientists Many articles present lists of skills required of data scientists. Read at least five such lists and create a unified list of the skills included in these lists. For each skill in the unified list, identify the perspectives of data science presented in this chapter for which it is especially important.

Exercise 2.11 Characteristics of data scientists on LinkedIn Locate the profiles of at least ten data scientists on LinkedIn. What are the main skills mentioned in these profiles? What characteristics of data science do these skills highlight?

32

2 What is Data Science?

2.7 Conclusion The various aspects, facets, and definitions of data science presented in this chapter highlight three main characteristics of data science: interdisciplinarity, diversity of learners, and data science as a research skill. In Part II—Opportunities and Challenges of Data Science Education—we discuss the educational challenges of data science that stem from each of these characteristics: the interdisciplinarity challenge (Chap. 6), challenges related to the variety of learners (Chap. 7), and challenges that result from the discipline’s research orientation (Chap. 8). The five perspectives on data science presented in this chapter, as well as the three main characteristics of data science (interdisciplinarity, diversity of learners, and data science as a research skill) are mentioned in many chapters of this guide. To deepen the understanding of the essence of data science as an evolving field as well as its pedagogy, while reading the different topics the guide addresses, we recommend that the readership keep considering these perspectives and characteristics of data science.

Exercise 2.12 Connections between the different facets of data science This chapter presented the field of data science from five perspectives: as a science, a research method, a discipline, a workflow, and a profession. What mutual relationships exist between these perspectives?

Exercise 2.13 Connections between the different facets of data science and the main characteristics of data science Discuss the connections between the five perspectives of data science presented in this chapter (as a science, a research method, a discipline, a workflow, and a profession) and the three main characteristics of data science: interdisciplinarity, diversity of learners, and data science as a research skill.

Exercise 2.14 Pedagogical implications of the multi-faceted analysis of data science What pedagogical implications can you draw from the integration of the analysis of data science as a science, a research method, a discipline, a workflow, and a profession?

References

33

Exercise 2.15 Overview of the history of data science Continue reading about the history of data science and present it according to the following points: • • • • •

The history of the name data science Significant milestones (events, conferences, publications, committees, etc.) Significant persons in the history of data science Stories that shaped the consciousness of data science Open questions in data science.

References Al-Hashedi, K. G., & Magalingam, P. (2021). Financial fraud detection applying data mining techniques: A comprehensive review from 2009 to 2019. Computer Science Review, 40, 100402. Alvargonza, D. (2011). Multidisciplinarity interdisciplinarity transdisciplinarity and the sciences. International Studies in the Philosophy Science, 25(4), 387–403. Berman, F. (co-chair), Rutenbar, R. (co-chair), Christensen, H., Davidson, S., Estrin, D., Franklin, M., Hailpern, B., Martonosi, M., Raghavan, P., Stodden, V., & Szalay, A. (2016). Realizing the potential of data science: Final report from the national science foundation computer and information science and engineering advisory committee data science working group. National Science Foundation Computer and Information Science and Engineering Advisory Committee Report, December 2016; https://www.nsf.gov/cise/ac-data-science-report/CISEACDataScienceReport1. 19.17.pdf Berman, F., Rutenbar, R., Hailpern, B., Christensen, H., Davidson, S., Estrin, D., Franklin, M., Martonosi, M., Raghavan, P., Stodden, V., & Szalay, A. S. (2018). Realizing the potential of data science. Communications of the ACM, 61(4), 67–72. https://doi.org/10.1145/3188721 Cassel, B., & Topi, H. (2015). Strengthening data science education through collaboration: Workshop report 7-27-2016. Arlington, VA. Chang, W., & Grady, N. (2019). NIST big data interoperability framework: Volume 1, Definitions, Special Publication (NIST SP). National Institute of Standards and Technology, [online], https:// doi.org/10.6028/NIST.SP.1500-1r2 Cleveland, W. S. (2001). Data science: An action plan for expanding the technical areas of the field of statistics. International Statistical Review, 69(1), 21–26. Conway, D. (2010). The data science venn diagram. Datist. http://www.dataists.com/2010/09/thedata-science-venn-diagram/ Cox, M., & Ellsworth, D. (1997). Managing big data for scientific visualization. ACM Siggraph, 97(1), 21–38. Danyluk, A., & Leidig, P. (2021). Computing competencies for undergraduate data science curricula. https://www.acm.org/binaries/content/assets/education/curricula-recommend ations/dstf_ccdsc2021.pdf Davenport, T. H., & Patil, D. (2012). Data scientist: The sexiest job of the 21st century. Harvard Business Review, 90(5), 70–76. Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745–766. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37–37.

34

2 What is Data Science?

Gray, J. (2007). EScience—A transformed scientific method. http://research.microsoft.com/en-us/ um/people/gray/talks/NRC-CSTB_eScience.ppt Grimmer, J., Roberts, M. E., & Stewart, B. M. (2021). Machine learning for social science: An agnostic approach. Annual Review of Political Science, 24, 395–419. Harris, H., Murphy, S., & Vaisman, M. (2013). Analyzing the analyzers: An introspective survey of data scientists and their work. O’Reilly Media, Inc. Hey, T., Tansley, S., Tolle, K., & Gray, J. (2009). The fourth paradigm: Data-intensive scientific discovery (vol. 1). Microsoft research Redmond. Irizarry, R. A. (2020). The role of academia in data science education. Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.dd363929 Ishaq, A., Sadiq, S., Umer, M., Ullah, S., Mirjalili, S., Rupapara, V., & Nappi, M. (2021). Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques. IEEE Access, 9, 39707–39716. Jeff Wu, C. F. (2021). In Wikipedia. https://en.wikipedia.org/w/index.php?title=C._F._Jeff_Wu& oldid=1049935836 Johnstone, I., & Roberts, F. (2014). Data science at NSF. https://www.nsf.gov/attachments/129788/ public/Final_StatSNSFJan14.pdf Lovell, M. C. (1983). Data mining. The Review of Economics and Statistics, 65(1), 1–12. Mohebbi, M., Vanderkam, D., Kodysh, J., Schonberger, R., Choi, H., & Kumar, S. (2011). Google correlate whitepaper. Naur, P. (1966). The science of datalogy. Communications of the ACM, 9(7), 485. National Science Board. (2005). Long-Lived digital data collections: Enabling research and education in the 21st century. National Science Foundation Report NSB-05-04, September 2005. http:// www.nsf.gov/pubs/2005/nsb05040 Piatetsky-Shapiro, G. (1990). Knowledge discovery in real databases: A report on the IJCAI-89 workshop. AI Magazine, 11(4), 68–68. Piatetsky-Shapiro, G. (2000). Knowledge discovery in databases: 10 years after. Acm Sigkdd Explorations Newsletter, 1(2), 59–61. Prebor, G. (2021). When feminism meets social networks. Library Hi Tech. Provost, F., & Fawcett, T. (2013). Data science for business: What you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc. Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 5(4), 13–22. Skiena, S. S. (2017). The data science design manual. Springer. Su, Y.-S., & Wu, S.-Y. (2021). Applying data mining techniques to explore user behaviors and watching video patterns in converged IT environments. Journal of Ambient Intelligence and Humanized Computing, 1–8. Taylor, D. (2016). Battle of the data science venn diagrams. KDnuggets. https://www.kdnuggets. com/battle-of-the-data-science-venn-diagrams.html/ Tukey, J. W. (1962). The future of data analysis. The Annals of Mathematical Statistics, 33(1), 1–67. Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley. Vamathevan, J., Clark, D., Czodrowski, P., Dunham, I., Ferran, E., Lee, G., Li, B., Madabhushi, A., Shah, P., Spitzer, M., & Zhao, S. (2019). Applications of machine learning in drug discovery and development. Nature Reviews Drug Discovery, 18(6), 463–477. Wu, J. (1997). Statistics = Data Science? http://www2.isye.gatech.edu/~jeffwu/presentations/dat ascience.pdf

Chapter 3

Data Science Thinking

Abstract This chapter highlights the cognitive aspect of data science. It presents a variety of modes of thinking, which are associated with the different components of data science, and describes the contribution of each one to data thinking—the mode of thinking required of data scientists (not only professional ones). Indeed, data science thinking integrates the thinking modes associated with the various disciplines that make up data science. Specifically, computer science contributes computational thinking (Sect. 3.2.1), statistics contributes statistical thinking (Sect. 3.2.2), mathematics adds different ways in which data science concepts can be conceived (Sect. 3.2.3), and each application domain brings with it its thinking skills, core principles, and ethical considerations (Sect. 3.2.4). Finally, based on these thinking modes, which are associated with the components of data science, we present data thinking (Sect. 3.2.5). The definition of data science inspires the message that processes of solving real-life problems using data science methods should not be based only on algorithms and data, but also on the application domain knowledge. In Sect. 3.3 we present a set of exercises that analyze the thinking skills associated with data science.

3.1 Introduction Data science introduces a new mode of thinking, namely data thinking. Since data science education is clearly based on data thinking, before we move on to all other chapters of this guide, we must describe first what data science thinking is. In Chap. 4, we quote the National Academies of Sciences, Engineering, and Medicine’s report entitled “Data Science for Undergraduates: Opportunities and Options” (2018). The committee defines data acumen as “the ability to understand data, to make good judgments about and good decisions with data, and to use data analysis tools responsibly and effectively” (p. 12). In this chapter, we delve into the details of this data acumen (which reflects the ability to make good judgments and take decisions) by analyzing the concept of data thinking, which we interpret as the integration of the thinking modes associated with the disciplines that make up data science: computational thinking, which is associated with computer science © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_3

35

36

3 Data Science Thinking

(Sect. 3.2.1), statistical thinking, which represents the data science component of statistics (Sect. 3.2.2), different conceptions of mathematical concepts characterized by the community of mathematics education (Sect. 3.2.3), and the thinking skills and core principles rooted in each application domain (Sect. 3.2.4). After each of these modes of thinking is presented, we discuss what data thinking means and conclude with exercises about these and additional modes of thinking related to data thinking. Before delving into the details, we highlight two of the main ideas regarding the essence of data thinking that we wish to emphasize in this chapter: (a) Since data science is about data analysis, in which both the algorithms and data play an important role, both ways of thinking—the one that characterizes computer scientists (and focuses on algorithms) and the one that characterize statisticians (and focuses on data)—should be learned, practiced, and mastered by data science learners to different levels of depth according to the expertise expected from each learner. (b) The application domain from which the data is taken also plays an important role in data thinking. This message is delivered in the various chapters of this guide; for instance, in the present chapter, we emphasize the unique contribution of the application domain to data thinking, highlighting the cognitive bias of domain neglect (which we identified in our research) and the need to be aware of its existence. This chapter examines data science from a cognitive perspective and is linked to other chapters in this guide that deal with cognitive aspects of data science. Among them we should mention: • • • •

Chapter 8, which discusses data science as a research discipline; Chapter 11, which introduces data science skills—cognitive and other; Chapter 13, which presents white box and black box understanding; and Sect. 14.5, which addresses the base-rate neglect.

3.2 Data Thinking and the Thinking Skills Associated with Its Components1 Data science is located in the intersection of computer science, mathematics and statistics, and the application domain from which the data is taken (see Fig. 3.1a). Each of these components highlights some cognitive theories and habits of mind that are relevant for problem-solving processes associated with or within the scope of that component. Specifically, • computer science highlights computational thinking; • statistics proposes statistical thinking;

1

This section is based on Mike et al. (2022). Presented here with permission.

3.2 Data Thinking and the Thinking Skills Associated with Its Components

(a) Data science integrates computer science, statistics, mathematics, and an application domain.

37

(b) Data thinking integrates computational thinking, statistical thinking, mathematical thinking, and application domain thinking.

Fig. 3.1 Data science and data thinking

• mathematics offers a variety of learning theories from which we select to present two: the process-object duality and the reduction of the level of abstraction; and • the application domain from which the data is taken addresses the context in which the data science workflow takes place. Based on an examination of the modes of thinking associated with the components of data science, we propose that data thinking—the thinking skills needed for dealing meaningfully with data science—integrates computational thinking, statistical thinking, different mathematical conceptions, and context-based thinking (see Fig. 3.1b). One of the main ideas highlighted by this perspective is that data science treats the algorithm and the data equally. That is, while computer science tends to focus on the algorithms (over data) and statistics tends to focus on the data (over algorithms), data science treat them equally—they are both important, and therefore, data thinking comprises both computational thinking and statistical thinking, as well as mathematical thinking and the thinking mode associated with the application domain.

3.2.1 Computational Thinking2 Over the past two decades, an approach has emerged that claims that the thinking skills and modes of action applied in computer science are important skills for everyone, and not just for computer scientists. These skills were encapsulated in 2

This section is partially based on Chap. 4 of Hazzan et al. (2020). Presented here with permission.

38

3 Data Science Thinking

the term computational thinking, first introduced by Papert (1990) and redefined by Wing (2006). One of the main arguments of the computational thinking approach is that it is useful and can be applied in all disciplines. Computational thinking is recognized today as one of the central twenty-first century skills. Furthermore, an understanding has developed that the basic ideas of computer science are fundamental and essential for functioning in all domains of life (Boholano, 2017; Harper, 2018). Although various definitions exist for computational thinking, the necessity of computational thinking for effective citizenship today, and certainly in the future, is widely agreed upon. In addition, it is recognized that acquiring computational thinking skills gives students socio-economic benefits and may close social gaps and promote social mobility. Accordingly, significant work has been done worldwide in promoting computational thinking skills in educational systems at all ages: from kindergarten, through elementary schools, middle and high schools, to academia. Computational thinking emphasizes a variety of computer science problemsolving skills that promote learning experiences and support learning processes. Although different views and opinions have been expressed with respect to computational thinking, several of the main principles of Wing’s approach are in the consensus: (1) Computational thinking is a cognitive ability. It is a type of problem-solving process that involves the ability to design solutions to be implemented by a person or a computer or both (Günbatar, 2019). (2) Computational thinking is about thinking processes, and so its implementation is independent of technology. (3) With or without computers, some key skills and processes are usually mentioned with respect to computational thinking, including: problem formulation, dividing a problem into sub-problems, organization and logical analysis of data, representation of data with models and simulations using abstraction, suggestion and assessment of several solutions to a given problem, examination and implementation of the chosen solution, and generalization and transfer of the solution to a range of problems (Cuny et al., 2010; Google for Education, 2020; Hu, 2011; Wing, 2014). (4) Computational thinking also includes social skills that support learning processes, such as teamwork, time management, and planning and scheduling tasks. (5) The computational thinking approach does not emphasize the teaching of a specific subject matter; rather, it emphasizes the acquisition of broad and multidisciplinary knowledge and a set of skills that can be applied in a variety of contexts (Günbatar, 2019). The different approaches to computational thinking reflect its tight connection to computer science. For example, a Royal Society position paper that inspired the development and assimilation of computational thinking in the UK defines computational thinking as follows:

3.2 Data Thinking and the Thinking Skills Associated with Its Components

39

Computational thinking is the process of recognizing aspects of computation in the world that surrounds us, and applying tools and techniques from Computer Science to understand and reason about both natural and artificial systems and processes. (The Royal Society, 2012, p. 29)

Studies show that the pedagogy of computational thinking, when integrated into all areas of studies, develops learners’ knowledge in those areas and promotes their problem-solving skills (DeSchryver & Yadav, 2015; Yadav et al., 2014). Furthermore, the application of computational thinking in different disciplines enables students to deepen their understanding of the discipline they are studying and, at the same time, to develop their computational thinking skills. This observation is important in the context of data science since data science itself is carried out in the context of the application domain.

Exercise 3.1 Computational thinking and data science online course Review the details of MIT’s free online course Introduction to Computational Thinking and Data Science at: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/60002-introduction-to-computational-thinking-and-data-science-fall-2016/. (a) What topics does the course teach? (b) Does the course highlight any connections between computational thinking and data science? If yes—which? (c) Can you suggest any additional topics to be added to the course that will further highlight the different characteristics of data science?

3.2.2 Statistical Thinking The term statistical thinking was coined by Deming (1986) and further developed by Moore (1990). It has since attracted considerable attention and much discussion within the statistics education community, and voices have emerged, calling for statistical thinking for everyone (Wallman, 1993). Statistical thinking is associated with an understanding of the essence, the characteristics, and the variability of real-life data (Cobb & Moore, 1997). According to Ben-Zvi and Garfield (2004), it “involves an understanding of why and how statistical investigations are conducted and the ‘big ideas’ that underlie statistical investigations” (p. 8). Specifically, statistical thinking (a) is based on the understanding that variation exists in any data source and that real-life data contain outliers, errors, biases, and variance; (b) addresses when and how to use specific statistical data analysis methods; (c) refers to the nature of sampling and how to infer from samples to populations; (d) includes statistical models and their usage; (e) considers the context of a given

40

3 Data Science Thinking

problem when performing investigations and drawing conclusions; (f) covers the entire process of statistical inquiry; and (g) emphasizes the relevance of critique and evaluation of inquiry results (ibid.). For a discussion on the relevance of statistical thinking for learning processes of machine learning algorithms, the readership is referred to Chap. 14. Specifically, we highlight the importance of statistical thinking for the examination of training data, which requires understanding the fundamental nature of the data, which is undoubtedly the core of statistical thinking.

Exercise 3.2 Statistical thinking for everyone The Tableau 2022 Data Trends report states that “the development of statistical thinking is an imperative today. Every individual must be able to synthesize data to support decision making, make sense of our world, and prepare for the future.” (Setlur, 2022, p. 13). Review the context within which this statement is presented. What does this context imply with respect to the role of statistical thinking in our life, in general, and in data science problem-solving processes, in particular?

3.2.3 Mathematical Thinking3 In this section, we review two theories that are borrowed from mathematics education research and are relevant for the examination of the thinking modes needed for data science, in general, and for machine learning algorithms, in particular. The two theories are the process-object duality theory (Sect. “The Process-Object Duality Theory”) and the reduction of abstraction theory (Sect. “The Reduction of Abstraction Theory”). The relevance of these two theories for the discussion on data science education stems from to the centrality of the concepts of algorithm and data in data science. As it turns out, the analysis of the understanding of these concepts, and of some associated concepts, according to these cognitive theories has several educational implications for the design of pedagogical tools for teaching machine learning concepts. For example, in Chap. 15, we explain the relevance of the process-object duality theory in the context of machine learning algorithms.

3

This section is based on Mike and Hazzan (2022a, 2022b). Machine learning for non-major data science students: A white box approach, special issue on Research on Data Science Education, The Statistics Education Research Journal (SERJ) 21(2), Article 10. Reprint is allowed by SERJ journal’s copyright policy.

3.2 Data Thinking and the Thinking Skills Associated with Its Components

41

The Process-Object Duality Theory According to the process-object duality theory, abstract mathematical concepts can be represented in the human mind as either objects or processes (Sfard, 1991). As an object, a mathematical concept is conceived of as a fixed construct, and as a process, the same mathematical concept is conceived of as an algorithm or a computation that generates an output from an input. For example, as an object, the concept of a function can be conceived of as a set of ordered pairs {(xi , yi )}, whereas as a process, a function can be represented in the human mind as the steps required to calculate the function output value, yi , for a given input value, xi . In the learning processes of most mathematical concepts, the learner goes through three phases. First, the concept is conceived of as a process. Then, the process is mentally packed (encapsulated) and an object representation is created in the learner’s mind. In the final step, after the concept has been packed and has become an object, it can be used by the learner as a component of a more complex process. Conceiving of abstract mathematical concepts as objects, therefore, reflects a deeper understanding than does their conception as processes. An object conception is also considered a more abstract conception than a process conception, since when a mathematical concept is conceived of as an object, its details can be neglected and the concept can be investigated from different perspectives as an entity, by examining its properties, rather than having to deal with the details of the process represented by the concept. The concept of procept, introduced by Gray and Tall (1994), represents the mathematical duality that exists between the understanding of a concept as a process and as an object. It reflects the idea that an advanced thinker can hold both mental structures in his or her mind and can move back and forth between the two according to the problem being solved. Hence, a procept is a hybrid schema, an amalgam of the two representations, as a process and as an object. Let us examine the process-object duality theory by referring mainly to the concept of algorithm, one of the central concepts of machine learning, whose origin is in computer science. The concept of algorithm is appropriate for this discussion since machine learning algorithms can be described as the “mathematical function mapping [of] some set of input values to output values” (Goodfellow et al., 2016, p. 5). It is therefore relevant to examine learners’ conception of machine learning algorithms in terms of the understanding of mathematical concepts as processes and objects. The understanding of machine learning algorithms requires students to understand mathematical concepts such as vectors, matrices, vector distances, dot products, derivatives and partial derivatives, and function optimization. None of these mathematical concepts is part of the non-major data science curriculum (Rabin et al., 2018). Teaching of these algorithms requires, therefore, that the gaps in the students’ mathematical knowledge be filled as part of a data science course or, alternatively, that some of the mathematical details be omitted and the gap filled with alternative intuitive explanations. In our paper on teaching machine learning using a white box approach (Mike & Hazzan, 2022a, 2022b), we demonstrate how the white box approach is applied to support students’ conception of machine learning algorithms

42

3 Data Science Thinking

as processes. Specifically, hands-on tasks, in which learners perform the calculations that the algorithm performs using pen and paper rather than computer software, enable them to follow the process of the algorithm and, in turn, support their understanding of the algorithm, at least as processes. For more on hands-on task, see Sect. 16.3.

Exercise 3.3 Concept formulation processes, objects, and procepts Select five data science concepts. For each of these concepts, describe its process conception, object conception, and procept conception. Based on these formulations, specify pedagogical guidelines for the teaching of each of these concepts.

Exercise 3.4 Data science from the perspective of the process-object duality theory Data science itself can be conceived of as an object and as a process. As an object, data science can be conceived of as a discipline; as a process, data science can be conceived of as a workflow. See Sects. 2.4 and 2.5. Define the characteristics of each of these conceptions. What aspects of data science are highlighted by each conception?

Exercise 3.5 The process-object duality theory from the constructivist perspective Read about the constructivist perspective of learning processes (see Sect. “Active Learning”). Can you describe connections between the processobject duality and constructivism?

The Reduction of Abstraction Theory From a wider perspective, the process-object duality of a mental representation of mathematical concepts is associated with the phenomenon of reducing the abstraction level when learning abstract mathematical concepts (Hazzan, 1999). In general, students who are required to learn mathematical concepts that are too abstract for

3.2 Data Thinking and the Thinking Skills Associated with Its Components

43

their current mental representation (i.e., their current conception of the concepts), use several mechanisms to reduce the level of abstraction. One of these mechanisms is based on the dual process-object representation of mathematical concepts, presented above. In this case, the reduction of the level of abstraction is exhibited when students conceive of concepts that are too abstract (for them) as processes rather than as objects. This claim is based on the fact (as explained above) that an object conception is considered a more abstract conception than a process conception. The theory of reducing the abstraction of mathematical concepts has been generalized for learning processes of computer science concepts as well (Hazzan, 2003a, 2003b, Hazzan & Hadar, 2005), and so, as computer science is a component of data science, it can be used also to describe students’ understanding of data science concepts. In the context of data science education, reduction of the level of abstraction can be illustrated, for example, using the concept of data: (a) Data with specific features are more concrete (less abstract) than abstract features. Thus, images with features “red” and “green” are more concrete than objects with features “a” and “b”. (b) Data with specific values are more concrete than abstract values; an image with features [107, 83, city] is more concrete than an object with features [x1 , x2 , x3 ]. (c) Data with meaning in the real world are more concrete than just numbers. Thus, a house with 5 rooms and 2 bathrooms is more concrete than a house with features (5, 2). Reducing the level of abstraction should, however, be done very carefully. If it is done too frequently, learners may end up conceiving of the specific (less abstract) case as the general (more abstract) case, which may, in turn, ultimately limit their problem-solving skills and abilities. Exercise 3.6 Reducing abstraction and thinking on different levels of abstraction In Sect. 11.2.1, we discuss the cognitive skills of thinking on different levels of abstraction. Explore the connections between this skill and the idea of reducing the level of abstraction.

Exercise 3.7 Describing data science concepts on different levels of abstraction Select five data science concepts and describe each of them on three different levels of abstraction. Reflect on the process you went through: (a) What were your considerations when formulating the three descriptions?

44

3 Data Science Thinking

(b) How did you decide what level of abstraction each description represents? (c) How did you decide on the hierarchy between the three descriptions in terms of abstraction levels? (d) Based on these formulations, compose guidelines for such tasks, to be applied on other data science concepts. (e) How can these descriptions of the concepts, formulated on different levels of abstraction, be used for pedagogical purposes? (f) Any additional thoughts you have about this activity.

3.2.4 Application Domain Thinking In this section we highlight the context, in other words, the application domain that the data are taken from, and we show that it should not be neglected and must definitely be taken into consideration in data science problem-solving processes. We show that there is a tendency to ignore the application domain, especially in unfamiliar situations. We call this tendency the domain neglect cognitive bias, and discuss its potential to cause harm. It is important to pay attention to domain neglect when teaching data science since students should be aware of the context in which the data is being analyzed and should acknowledge its importance when evaluating algorithm performance. For example, while a face recognition algorithm with a 99.9% accuracy is acceptable in the context of organizing our photo album, in the context of autonomous cars, payment systems, and passport verification—it is not (see Sect. 14.5 for more on machine learning performance indicators). This example highlights not only the interdisciplinarity of data science, but also the understanding that the component of the application domain of data science should be taken into consideration, together with the other components of data science, both when solving problems in data science, in general, and when evaluating the performance of an algorithm, in particular.

The Domain Neglect Bias Experiment4 Cognitive biases are unconscious and automatic processes of the human brain, evolutionary developed to make decision making quicker and more efficient. As such, they may lead to irrational decision-making processes (Kahneman & Tversky, 1973). One of the biases that we identified in our research on data science education is that learners fail to consider the application domain when interpreting the performance

4

This section is based on the following paper: © 2022 IEEE. Reprinted, with permission, from Mike and Hazzan (2022a, 2022b).

3.2 Data Thinking and the Thinking Skills Associated with Its Components

45

of machine learning algorithms. In the spirit of other cognitive biases (see Sect. 14.5 for a discussion on the base-rate neglect), we termed this bias the domain neglect. In our experiment, we presented students with a questionnaire with two questions about classifiers in two different application domains: transportation and medical diagnosis. In both of these questions, the participants were asked to compare the performance of two classifiers. Specifically, each question presents two machine learning algorithms with different performances, and the students were asked to determine which performed better (see Tables 3.1 and 3.2, respectively). Both questions presented the students with five optional answers to choose from. To the three comparison options—Machine A performs better, Machine B performs better, and both machines perform equally well—we added two more options: It is not possible to answer the question since data is missing and I do not know. In addition to the multiple-choice question, the students were also presented with an open-ended question that asked them to explain their answer to the multiple-choice question. Table 3.1 The traffic light classification question Researchers developed two machine learning algorithms to classify the red light and the green light of a traffic light. Machine A is always correct when the light is red, but is wrong with a probability of 0.001 when the light is green. Machine B is always correct when the light is green, but is wrong with a probability of 0.0001 when the light is red. Which of these machines is better in your opinion? (a) Machine A (b) Machine B (c) Both are the same (d) It is not possible to answer the question since data is missing (e) I do not know Please explain your choice: _____________________________

Table 3.2 The carcinoma classification question Researchers developed two machine learning algorithms to classify two types of carcinomas: SCC and BCC. Machine A is always correct when the carcinoma is of type SCC, but is wrong with a probability of 0.001 when the carcinoma is of type BCC. Machine B is always correct when the carcinoma is of type BCC, but is wrong with a probability of 0.001 when the carcinoma is of type SCC. Which of these machines is better in your opinion? (a) Machine A (b) Machine B (c) Both are the same (d) It is not possible to answer the question since data is missing (e) I do not know Please explain your choice: _____________________________

46

3 Data Science Thinking

Exercise 3.8 Comparing the performance of machine learning algorithms Before you go on reading, answer the questions in Tables 3.1 and 3.2. How confident are you about your answers?

As the two problem formulations indicate: (a) While the performance of the machine learning algorithms was described verbally, the meaning of their performance in the real world was not described verbally and had to be interpreted by the participants. (b) Although the two questions were similar from a mathematical perspective, they differed in the application domains they deal with. While the traffic light classification question addressed an application domain with which the students were well acquainted from their daily life, the carcinoma classification question referred to medical diagnosis, which (hopefully) most students were not familiar with from their daily life. Table 3.3 summarizes the context and social consideration of the two questions. Table 3.3 The context and social consideration of the experiment questions Question dimension

The traffic light classification question (see Table 3.1)

The carcinoma classification question (see Table 3.2)

The context

Transportation: the light in a traffic light

Medicine: two types of carcinomas: basal cell carcinoma (BCC) and squamous cell carcinoma (SCC)

Students’ familiarity with the context

Students were familiar with the context

Students were not familiar with the context

Social considerations

Although Machine B makes only a tenth of the mistakes that Machine A makes, the fact that Machine A is always correct when the light is red, should have alerted the students to the consequences of each type of mistake

Although both machines may err with the same probability, 0.001, since SCC is a more severe disease than BCC, the consequences of errors in the SCC case are also more severe

3.2 Data Thinking and the Thinking Skills Associated with Its Components

47

Analysis of Students’ Explanations from the Application Domain Perspective To demonstrate the importance of the application domain of data science, we focus on the students’ answers to the open questions. (The analysis of the close questions is presented in Mike and Hazzan (2022a, 2022b)). Our analysis of the answers to the open questions, in which the students were requested to explain their answers to the closed questions, revealed three types of answers. We illustrate them for the case of the traffic light question. (a) Comparisons based on considerations that pertain to the application domain. For example, one student explained that “it is more important not to make mistakes on a red light to reduce car accidents”. (b) Comparisons based on mathematical considerations. For example, one student explained that “Machine B’s error probability is lower”, ignoring the consequences of the cases in which both machines are 100% correct. (c) Debating: Comparisons based on both application domain considerations and mathematical considerations, without deciding which consideration is more important. For example, one student wrote: “I guess the mistake in red classification is more critical in the context of accidents. So even though B is much less wrong, its mistakes are costlier. It remains to be decided whether the mistake in red will indeed “happen” 10 times more often than the mistake in green in order to determine which is a better classifier. And this is information that does not appear in the question (and in any case may be moral).” To characterize the domain neglect, we examined what type of answers students gave in each of the two cases: transportation and medical diagnosis. Students, who in both cases (familiar and unfamiliar context) explained their choices based on application domain considerations, did not exhibit the domain neglect cognitive bias. The other students exhibited two types of the domain neglect: • Mathematics-driven domain neglect: In both cases (familiar and unfamiliar context), these students explained their choices based on mathematical considerations, ignoring the context of the question. • Familiarity-driven domain neglect: These students offered a different kind of the explanation, depending on their familiarity with the application domain (or lack thereof): – In the familiar application domain, the students based their answers on arguments taken from the application domain (application domain considerations or debating); – in unfamiliar situations, the students based their answers on mathematical considerations, ignoring the application domain. Figures 3.2 and 3.3 present the percentages of each type of answer to the open questions.

48

3 Data Science Thinking

Fig. 3.2 Traffic light classification explanations: distribution of answer categories (n = 98)

Fig. 3.3 Carcinoma classification explanations: distribution of answer categories (n = 88)

Table 3.4 presents the familiarity-driven domain neglect, specifying these changes. As can be seen, the students used different considerations depending on their familiarity with the context of the problem. Indeed, 13% of the students exhibited the familiarity-driven domain neglect, switching from an application domain-driven explanation in the familiar application domain to a mathematical explanation in the unfamiliar application domain. Of these 13% who switched to the mathematical explanation in the unfamiliar context, 5% switched from the application domaindriven considerations (63% vs. 58%) and 8% switched from the debating considerations (9% vs. 1%). Figure 3.4 presents the Traffic Light Question Explanation Types versus Carcinoma Classification Question Explanation Types (rows represent the students’ type of answer to the traffic light classification question and columns represent the students’ answers to the carcinoma classification question).

3.2 Data Thinking and the Thinking Skills Associated with Its Components

49

Exercise 3.9 Analysis of the familiarity-driven domain neglect We further investigate the familiarity-driven domain neglect with the following two questions: Question 1: Assume you are working for a company that develops products for transportation design. You are asked to recruit an employee to a machine learning team. You must select one of the following five candidates. Who would be your preference? (a) (b) (c) (d) (e)

Expertise in transportation 0%; expertise in machine learning 100%. Expertise in transportation 20%; expertise in machine learning 80%. Expertise in transportation 50%; expertise in machine learning 50%. Expertise in transportation 80%; expertise in machine learning 20%. Expertise in transportation 100%; expertise in machine learning 0%.

Question 2: Assume you are working for a company that develops products for medical diagnosis. You are asked to recruit an employee to a machine learning team. You must select one of the following five candidates. Who would be your preference? (a) (b) (c) (d) (e)

Expertise in medicine 0%; expertise in machine learning 100%. Expertise in medicine 20%; expertise in machine learning 80%. Expertise in medicine 50%; expertise in machine learning 50%. Expertise in medicine 80%; expertise in machine learning 20%. Expertise in medicine 100%; expertise in machine learning 0%.

Based on the familiarity-driven domain neglect, predict how learners would respond to these questions.

Table 3.4 The familiarity-driven domain neglect Application domain Types of answers

Traffic (%)

The difference between traffic and medicine

Medicine (%)

Application domain considerations

63

5% ↓

58

Mathematics considerations

28

13% ↑

41

Debating

9

8% ↓

1

50

3 Data Science Thinking

Fig. 3.4 Traffic light question explanation types versus carcinoma classification question explanation types (n = 88)

Exercise 3.10 Developing questions that demonstrate the domain neglect Compose five questions that may expose the domain neglect, in general, and the familiarity-driven domain neglect, in particular. Describe how you developed your questions. Check your prediction regarding the exhibition of the domain neglect and the familiarity-driven domain neglect among a relevant population of your choice (students, colleagues, etc.).

Exercise 3.11 Outcomes of cognitive biases In its 2017 report, Lessons from Early AI Projects, the Gartner organization predicted that by 2022, 85% of all AI projects will deliver erroneous outcomes due to bias in data, algorithms, or the teams responsible for managing them. (a) Check this prediction. Was the report correct? (b) Suggest several situations in which the domain neglect, in general, and the familiarity-driven domain neglect, in particular, can cause such outcomes.

3.2 Data Thinking and the Thinking Skills Associated with Its Components

51

Addressing the Domain Neglect in Context Clearly, it is important to increase our students’ awareness to the prevalence and possible implications of the domain neglect cognitive bias. For example, we use the traffic light classification question and the carcinoma classification question (Tables 3.1 and 3.2) as a pre-lesson activity for the topic of measuring the performance of machine learning algorithms. The intention of this pre-lesson activity is to reflect to the students their own biases, thus enhancing their understanding of the broad considerations required when implementing machine learning algorithms in real-world applications. From a wider perspective, it is important to keep reminding the students that it is crucial to consider the application domain when evaluating the performance of machine leaning algorithms. Since the application domain is, in many cases, related to some social aspect of our life, the discussion we hold with the students about the domain neglect is a suitable opportunity to discuss with them also the importance of the social aspects of data science (see Chap. 12) and, specifically, the importance of considering the application domain.

A Cognitive Perspective of the Domain Neglect: The Dual Process Theory So far, we have explained the students’ answers to the traffic light and carcinoma classification problems using the domain neglect cognitive bias. But how can the domain neglect bias itself be explained? What are its cognitive roots? To explain the origin of the domain neglect, we use the dual process theory of cognitive psychology (Kahneman, 2002; Tversky & Kahneman, 1983). The following description of the dual process theory is partially based on Leron and Hazzan (2009).5 According to dual process theory, our cognition and behavior operate in parallel in two quite different modes, called System 1 (S1) and System 2 (S2), roughly corresponding to our commonsense notions of intuitive and analytical modes of thinking, respectively. S1 processes are characterized as being fast, automatic, effortless, unconscious, and inflexible (hard to change or overcome). S2 processes, on the other hand, are slow, conscious, effortful, and computationally expensive (drawing heavily on working memory resources). The two systems differ mainly on the dimension of accessibility: how fast and how easily things come to mind. In most situations, S1 and S2 work in concert to produce adaptive responses but, in some cases, S1 generates quick automatic nonnormative responses, while S2 may or may not intervene in its role as monitor and critic to correct or override S1’s response. Many of the nonnormative answers people give in psychological experiments can be explained by the quick and automatic responses of S1, and the frequent failure of S2 to intervene in its role as S1’s critic.

5

See Leron and Hazzan (2009). Presented here with permission.

52

3 Data Science Thinking

We suggest that both in the case of the mathematics-driven domain neglect and in the case of the familiarity-driven domain neglect, an S1 response is invoked by what is most immediately accessible to the students in the situation, which also looks roughly appropriate for the task at hand—that is, the numbers that represent the classifiers’ performance. In addition to S1’s inappropriate reaction, S2 too fails in its role as S1’s critic, since there is nothing in the task situation to alert the monitoring function of S2, in general, and in the familiarity-driven domain neglect, in particular, since the participants are unfamiliar with the required medical knowledge. This invocation of S1 that causes the domain neglect, may lead the students to neglect the social implications of deploying the machine learning algorithms in the real world. Exercise 3.12 Teaching possible effects of cognitive biases on the interpretation of machine learning models Read the paper: Kliegr et al. (2021). A review of possible effects of cognitive biases on interpretation of rule-based machine learning models. Artificial Intelligence, 103458. Choose five of the twenty cognitive biases presented in the article and suggest how they may be addressed in data science courses.

3.2.5 Data Thinking Based on the above examination of data science thinking skills as they relate to the disciplines that make up data science, our interpretation of data thinking integrates computational thinking (from computer science), statistical thinking (from statistics), mathematical thinking (from mathematics), and application domain thinking (from the application domain) (see Fig. 3.1b). Specifically, data thinking is the understanding that a solution to a real-life problem should not be based only on data and algorithms, but also on the application domain knowledge that governs that data and those algorithms. Data thinking asks whether the data offer a good representation of the real-life situation. It also addresses how the data were collected and asks whether the data collection can be improved. It analyzes the data not only logically but also statistically, using visualizations and statistical methods to find patterns as well as irregular phenomena. Specifically, data thinking is the understanding that: (a) data are not just numbers that are to be presented by an adequate data model, but that these numbers have a meaning that is derived from the application domain; (b) any process or calculation performed on the data should preserve the meaning of the relevant application domain; (c) problem abstraction is application domain-dependent, and that generalization is subject to biases and variance in the data; (d) lab testing is not enough, and that real-life implementation will always encounter unexpected data and situations, and so improving the models and solutions to

3.3 Thinking About Data Science Thinking

53

a given problem is a continuous process that includes, among other activities, constant and iterative monitoring and data collection. To illustrate our claims regarding the added value of data thinking for problemsolving processes that stems from the combination of computational thinking, statistical and mathematical thinking, and application domain thinking, we present an example from our own discipline, education. Consider the problem of the dropout rate in higher education. From a computational thinking perspective, a model may be suggested that monitors and checks students’ progress, from their entrance into higher education institutions, through all possible study programs, until their graduation. We can also consider additional aspects, such as the students’ attributes before embarking on higher education. By adding such considerations, we can predict their success and accept only those students with the highest probability of graduating. Such research is indeed being carried out using machine learning methods to predict student success (i.e., Alyahyan & Dü¸stegör, 2020). This solution, however, may be biased: by adding statistical thinking, we can question whether the available data are indeed the best possible for this task. Using current student data, we in fact intensify biases of the existing admission system (Murrell, 2019) by overlooking those applicants who were not accepted; their potential for success is, therefore, not measured or modeled. Building on our disciplinary knowledge of education, we suggest that other factors influencing success, such as motivation, should be added to the prediction model. Thus, using application domain knowledge, we can further improve our computational and statistical models. This way, we may generate a prediction model based on computational thinking, mathematics and statistical thinking, and application domain thinking, i.e., based on data thinking.

Exercise 3.13 Cases in which data thinking is important In the spirit of the higher education dropout rate problem presented above, suggest another case from a field you are familiar with, that illustrates the importance and applicability of data thinking. Reflect on the process you went through in the development of your case. What guidelines can you suggest for the development process of such cases?

3.3 Thinking About Data Science Thinking This section presents several exercises whose purpose is to foster thinking about data thinking as a whole, rather than thinking about its individual components, as we have done in the previous sections.

54

3 Data Science Thinking

Exercise 3.14 Application domain knowledge and real-life data in data thinking Two aspects of data thinking are the application domain knowledge and real-life data. Explore mutual connections of these aspects of data thinking to each component of data thinking: computational thinking, mathematics and statistical thinking, and application domain thinking.

Exercise 3.15 Data thinking and other cognitive aspects of data science In addition to this chapter, the cognitive aspect of data science is addressed in other chapters of this guide as well. Explore connections between data thinking as presented in this chapter and the cognitive aspects of data science presented in the following chapters: • Chapter 8, which discusses data science as a research discipline; • Chapter 11, which introduces data science skills—cognitive and other; • Chapter 13, which presents white box and black box understanding; and • Section 14.5, which addresses the base-rate neglect.

Exercise 3.16 Additional modes of thinking required for data science Different publications on data science skills mention different thinking skills as being required in order to deal meaningfully with data science. These include, among others, analytical thinking, critical thinking, and data literacy. Explore these thinking skills (and others you may find) as well as the interconnections between them and the various thinking skills presented in this chapter.

References

55

Exercise 3.17 Analytical thinking In Chap. 4, we describe the formation of the field of data science education. One of the reports we present is the “Curriculum Guidelines for Undergraduate Programs in Data Science” (De Veaux et al., 2017). The report lists the following core competence for data science graduates (p. 6): • • • • • •

Analytical thinking (computational thinking and statistical thinking) Mathematical foundations Model building and assessment Algorithms and software foundation Data curation Knowledge transference—communication and responsibility

De Veaux et al. (2017) propose that analytical thinking is one of the required core competences for undergraduate data science students, presenting it as “computational thinking and statistical thinking”. How does this perspective on analytical thinking correlate to other definitions of analytical thinking presented in the literature?

3.4 Conclusion In the conclusion of this chapter we highlight three ideas presented in this chapter that reflect the interdisciplinary and multifaceted nature of data science. First, different modes of thinking contribute to data thinking—the mode of thinking required for doing meaningful data science. Second, data thinking combines thinking skills that it inherits from each component of data science. And third, the application domain is of importance in the data science workflow and must not be neglected.

References Alyahyan, E., & Dü¸stegör, D. (2020). Predicting academic success in higher education: Literature review and best practices. International Journal of Educational Technology in Higher Education, 17(1), 1–21. Ben-Zvi, D., & Garfield, J. B. (2004). The challenge of developing statistical literacy, reasoning and thinking. Springer. Boholano, H. (2017). Smart social networking: 21st century teaching and learning skills. Research in Pedagogy, 7(1), 21–29. Cobb, G. W., & Moore, D. S. (1997). Mathematics, statistics, and teaching. The American Mathematical Monthly, 104(9), 801–823. https://doi.org/10.1080/00029890.1997.11990723

56

3 Data Science Thinking

Cuny, J., Snyder, L., & Wing, J. M. (2010). Demystifying computational thinking for non-computer scientists. Unpublished Manuscript in Progress, Referenced in https://www.cs.cmu.edu/~Com pThink/resources/TheLinkWing.pdf De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., Bryant, L., Cheng, L. Z., Francis, A., Gould, R., Kim, A. Y., Kretchmar, M., Lu, Q., Moskol, A., Nolan, D., Pelayo, R., Raleigh, S., Sethi, R. J., Sondjaja, M., Tiruviluamala, N., et al. (2017). Curriculum guidelines for undergraduate programs in data science. Annual Review of Statistics and Its Application, 4(1), 15–30. https://doi.org/10.1146/annurev-statistics-060116-053930 Deming, W. E. (1986). Out of the crisis Cambridge. Massachusetts Institute of Technology. DeSchryver, M. D., & Yadav, A. (2015). Creative and computational thinking in the context of new literacies: Working with teachers to scaffold complex technology-mediated approaches to teaching and learning. Journal of Technology and Teacher Education, 23(3), 411–431. Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (vol. 1). MIT press Cambridge. Google for Education: Computational Thinking. (2020). https://edu.google.com/resources/pro grams/exploring-computational-thinking/ Gray, E. M., & Tall, D. O. (1994). Duality, ambiguity, and flexibility: A “proceptual” view of simple arithmetic. Journal for Research in Mathematics Education, 25(2), 116–140. https://doi.org/10. 5951/jresematheduc.25.2.0116 Günbatar, M. S. (2019). Computational thinking within the context of professional life: Change in CT skill from the viewpoint of teachers. Education and Information Technologies, 24(5), 2629–2652. Harper, B. (2018). Technology and teacher–student interactions: A review of empirical research. Journal of Research on Technology in Education, 50(3), 214–225. Hazzan, O. (1999). Reducing abstraction level when learning abstract algebra concepts. Educational Studies in Mathematics, 40(1), 71–90. Hazzan, O. (2003a). How students attempt to reduce abstraction in the learning of mathematics and in the learning of computer science. Computer Science Education, 13(2), 95–122. Hazzan, O. (2003b). Reducing abstraction when learning computability theory. Journal of Computers in Mathematics and Science Teaching, 22(2), 95–117. Hazzan, O., & Hadar, I. (2005). Reducing abstraction when learning graph theory. Journal of Computers in Mathematics and Science Teaching, 24(3), 255–272. Hazzan, O., Ragonis, N., & Lapidot, T. (2020). Guide to teaching computer science: An activitybased approach. Hu, C. (2011). Computational thinking: What it might mean and what we might do about it. In Proceedings of the 16th annual joint conference on innovation and technology in computer science education, pp. 223–227. Kahneman, D. (2002). Maps of bounded rationality: A perspective on intuitive judgment and choice. Nobel Prize Lecture, 8(1), 351–401. Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80(4), 237–251. https://doi.org/10.1037/h0034747 Kliegr, T., Bahník, Š., & Fürnkranz, J. (2021). A review of possible effects of cognitive biases on interpretation of rule-based machine learning models. Artificial Intelligence, 103458. Leron, U., & Hazzan, O. (2009). Intuitive vs analytical thinking: Four perspectives. Educational Studies in Mathematics, 71(3), 263–278. Lessons from early AI projects. (2017). Gartner. https://www.gartner.com/en/documents/3834749 Mike, K., & Hazzan, O. (2022a). Machine learning for non-majors: A white box approach. Statistics Education Research Journal, 21(2), Article 10. Mike, K., & Hazzan, O. (2022b). What is common to transportation and health in machine learning education? The domain neglect bias. IEEE Transactions on Education. https://doi.org/10.1109/ TE.2022.3218013 Mike, K., Ragonis, N., Rosenberg-Kima, R., & Hazzan, O. (2022). Computational thinking in the era of data science. Communications of the ACM, 65(8), 31–33. https://doi.org/10.1145/3545109

References

57

Moore, P. G. (1990). The skills challenge of the nineties. Journal of the Royal Statistical Society. Series A (Statistics in Society), 153(3), 265. https://doi.org/10.2307/2982974 Murrell, A. (2019). Big data and the problem of bias in higher education. Forbes. https://www.forbes. com/sites/audreymurrell/2019/05/30/big-data-and-the-problem-of-bias-in-higher-education/ National Academies of Sciences, Engineering, and Medicine. (2018). Data science for undergraduates: Opportunities and options. The National Academies Press. https://doi.org/10.17226/ 25104 Papert, S. (1990). Mindstorms: Children, computers and powerful ideas, (vol. 10, p. 1095592). Basic Books. Rabin, L., Fink, L., Krishnan, A., Fogel, J., Berman, L., & Bergdoll, R. (2018). A measure of basic math skills for use with undergraduate statistics students: The macs11. Setlur, V. (2022). AI augments and empowers human expertise. Tableau. https://www.tableau.com/ sites/default/files/2022-02/Data_Trends_2022.pdf Sfard, A. (1991). On the dual nature of mathematical conceptions: Reflections on processes and objects as different sides of the same coin. Educational Studies in Mathematics, 22(1), 1–36. https://doi.org/10.1007/BF00302715 The Royal Society. (2012). Shut down or restart?: The way forward for computing in UK schools. Royal Society. Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review, 90(4), 293. Wallman, K. K. (1993). Enhancing statistical literacy: Enriching our society. Journal of the American Statistical Association, 88(421), 1–8. https://doi.org/10.1080/01621459.1993.10594283 Wing, J. M. (2006). Computational thinking. Communications of the ACM, 49(3), 33–35. https:// doi.org/10.1145/1118178.1118215 Wing, J. M. (2014). Computational thinking benefits society. 40th Anniversary Blog of Social Issues in Computing, 2014, 26. Yadav, A., Mayfield, C., Zhou, N., Hambrusch, S., & Korb, J. T. (2014). Computational thinking in elementary and secondary teacher education. ACM Transactions on Computing Education (TOCE), 14(1), 1–16.

Chapter 4

The Birth of a New Discipline: Data Science Education

Abstract Data science is a young field of research and its associated educational knowledge—data science education—is even younger. As of the time of writing this book, data science education has not yet gained recognition as a distinct field and is mainly discussed in the context of the education of the disciplines that make up data science, i.e., computer science education, statistics education, mathematics education, and the educational fields of the applications domains, such as medical education, business analytics education, or learning analytics. There are, however, voice that call to integrate the relevant knowledge from these educational disciplines, and to form a coherent and integrative data science education body of knowledge, based on which data science programs can be designed. In this chapter, we present the story of the birth of the field of data science education by describing its short history. We focus on the main efforts invested in the design of an undergraduate data science curriculum (Sect. 4.2), and on the main initiatives aimed at tailoring a data science curriculum for school pupils (Sect. 4.3). We also suggest several meta-analysis exercises that examine these efforts (Sect. 4.4).

4.1 Introduction What will data science be in 10 or 50 years? The answer to this question is in the hands of the next-generation researchers and educators (Wing, 2020, Closing Remarks, para. 2).

In this chapter, we tell the story of the birth of data science education by presenting its short history. We show that not only is data science a young field of research and practice, but that the field of data science education is even younger. Indeed, a brief look at the reference list used to prepare the review presented in this chapter, reveals that the oldest reference is from 2014! Accordingly, and not surprisingly, data science education has not yet been recognized as a distinct field and is mainly discussed in the context of the education of the disciplines that compose data science, i.e., computer science education, statistics education, mathematics education, and the educational fields of the applications domains, such as medical education and business analytics education. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_4

59

60

4 The Birth of a New Discipline: Data Science Education

At the same time, efforts are being invested to integrate the disciplines that make up data science, in an attempt to develop a data science curriculum (Cassel & Topi, 2015; Danyluk & Leidig, 2021; De Veaux et al., 2017; EDISON: Building the Data Science Profession, n.d.; National Academies of Sciences, Engineering, and Medicine, 2018). This integration task is challenging not only due to the inherent difficulty of creating an interdisciplinary program, in general, but also due to the unique interdisciplinarity essence of data science, which in addition to computer science, mathematics, and statistics, should offer the option to add any application domain. It is therefore important to view both data science and data science education as interdisciplinary fields, where the latter integrates knowledge, theories and pedagogies taken from computer science education, statistics education, mathematics education, and the educational fields of various application domains from different disciplines and a variety of cultures (Hazzan & Mike, 2021). In addition, data science education should build on the theory and practice of interdisciplinary education.

Exercise 4.1 Interdisciplinary education Read several resources that discuss interdisciplinary education and summarize its main ideas. Suggest how these ideas can be implemented for the case of data science education. Although the main wave of discussions about data science education has taken place since 2015, it is important to note that in response to the market demand for data scientists, hundreds of new data science programs were launched even prior to 2015, and about 200 colleges and universities had already been offering both graduate and undergraduate degrees in data science in 2015 (Cassel & Topi, 2015). Some of these programs were launched as programs in related topics such as business analytics or knowledge discovery in databases (KDD), and only later did they evolve into data science programs. For example, the College of Charleston in South Carolina launched an undergraduate degree in KDD in 2003 as a joint venture between several faculties: mathematics, statistics, computer science, and biology (Anderson et al., 2014). In 2012, when the term data science became much more popular than KDD, the name of the program was changed to data science, to attract more students. As more and more universities launched or planned to launch data science programs, the need arose for standardization and guidelines, and several initiatives to formulate and design data science curricula were undertaken between 2015 and 2021. In Sect. 4.2, we present these efforts to define the data science body of knowledge, its competence framework, and curriculum recommendations, in the following sub-sections: Strengthening Data Science Education Through Collaboration, 2015 (Sect. 4.2.1), Curriculum Guidelines for Undergraduate Programs in Data Science, 2016 (Sect. 4.2.2), The EDISON Data Science Framework, 2017 (Sect. 4.2.3), Envisioning the Data Science Discipline, 2018 (Sect. 4.2.4), and the ACM Data Science

4.2 Undergraduate Data Science Curricula Initiatives

61

Task Force, 2021 (Sect. 4.2.5). Among other things, we describe how each of these efforts addresses the application domain and the theme of interdisciplinarity. Following the development of data science programs for undergraduates, several efforts to adapt data science for school pupils were made, which are presented briefly in Sect. 4.3. In Sect. 4.4 we propose meta-analysis activities for data science curricula. We conclude this chapter with an interdisciplinary view on data science education (Sect. 4.5).

Exercise 4.2 Pioneering data science programs Find several data science programs. Where were they launched? What are their main characteristics? What are their main topics? Do they each have unique characteristics?

4.2 Undergraduate Data Science Curricula Initiatives This section reviews several of the main initiatives aimed at formulating guidelines for undergraduate data science curricula. Table 4.1 list these initiatives with their year of publication (beginning in 2015). The following review presents the main messages of each initiative and highlights how each report addressed the interdisciplinarity of data science—one of the main themes of this guide. Clearly, these reports can be analyzed through additional prisms, such as the variety of learners they address (see Chap. 7) and the skills whose importance is highlighted (see Chap. 11). Table 4.1 Initiatives of undergraduate data science curricula Initiative title

Year of publication

Section number

Strengthening Data Science Education Through Collaboration

2015

4.2.1

Curriculum Guidelines for Undergraduate Programs 2016 in Data Science

4.2.2

The EDISON Data Science Framework

2017

4.2.3

Envisioning the Data Science Discipline

2018

4.2.4

Computing Competencies for Undergraduate Data Science Curricula

2021

4.2.5

62

4 The Birth of a New Discipline: Data Science Education

4.2.1 Strengthening Data Science Education Through Collaboration, 2015 The report “Strengthening Data Science Education Through Collaboration” (Cassel & Topi, 2015) is the product of the 2015 workshop on data science education funded by the National Science Foundation (NSF). The workshop was organized by the ACM Education Board and Council and took place in Arlington, VA. One of the main motivations to organize this workshop was the shortage of data science professionals, which in turn prevented organizations and societies from capturing the potential benefits of data science. Although by 2015 several educational programs had already been launched to address this shortage, many workshop participants felt that the quality and future direction of these programs, as well as their structures and practices, do not provide a wide perspective on data science and should be improved. The goal of this workshop was, therefore, “to start a conversation to address these concerns and develop a deeper integrated understanding of the best ways to offer data science education, ultimately leading to a better prepared workforce.” (Cassel & Topi, 2015, p. 3). The workshop started with an attempt to define data science. Since it was impossible to reach a consensus on either a definition of data science or whether it is possible to define data science within the timeframe of the workshop, “[t]he workshop participants made a conscious decision not to spend a significant amount of time defining data science” (p. 8). This agreement to not agree on a definition of data science also led to a similar agreement to not agree on the core of data science as well as for the need for such a core, and so the discussion did not, in fact, attempt to define a curriculum. Even though data science and its core were not defined in this workshop, the report lists the following topics as fundamental for data scientists: • • • • •

Machine learning Statistical inference and modeling Data and database management Data integrity, privacy, and security The [application] domain of activity, for which the data is important

With respect to the theme of our guide, interdisciplinarity, the report raised several important questions: • How much of data science is about the theory and concepts related to handling data, regardless of the application domain? • How much should data science students learn about data-related activities (e.g., privacy, security, machine learning, and statistical methods) without addressing the purposes that these data-related activities serve? • Is it meaningful to have a data science degree without addressing some application domain? • How can we balance the various disciplinary perspectives? • Do the answers to the above questions vary depending on the context?

4.2 Undergraduate Data Science Curricula Initiatives

63

4.2.2 Curriculum Guidelines for Undergraduate Programs in Data Science, 2016 In the summer of 2016, the NSF and the Institute for Advanced Study at Princeton University funded another workshop that focused on formulating curriculum guidelines for an undergraduate data science degree. This workshop was held at the Park City Math Institute of Princeton University in New Jersey. The workshop product was a report entitled “Curriculum Guidelines for Undergraduate Programs in Data Science” (De Veaux et al., 2017) and it emphasizes that that the guidelines are not prescriptive, but rather, aim to inform and enumerate the core skills that a data science major should have. The curriculum guidelines are based on six principles: data science as science, the interdisciplinary nature of data science, data at the core, analytical (computational and statistical) thinking, mathematical foundations, and flexibility. While data science is presented as a science, the report states that (1) data science for undergraduate students can be described as an applied field similar to engineering, with emphasis on using data to describe the world; and (2) the data life cycle should be the core of the data science curriculum (including obtaining, wrangling, curating, managing, and processing data; exploring data; defining questions; performing analyses; and communicating the results). It was agreed that data science requires mathematical foundations, computational thinking, and statistical thinking, and that students must be prepared to learn new techniques and methods that may not even exist at the time of their studies. The report lists the following core competences required for data science graduates: • • • • • •

Analytical thinking (computational thinking and statistical thinking) Mathematical foundations Model building and assessment Algorithms and software foundation Data curation Knowledge transference—communication and responsibility.

This list is similar, but not identical to the list of core topics presented by Cassel and Topi (2015) (see Sect. 4.2.1). We note that De Veaux et al. (2017) propose that analytical thinking is a required core competence for undergraduate data science students, and is composed of both computational thinking and statistical thinking (see Chap. 3—Data Science Thinking). Finally, the report outlines the recommended curriculum, which includes the following set of courses: 1. Introduction to data science (2 courses): introductory courses in which the students will gain initial practice in the data science workflow. 2. Mathematical foundations of data science (2 courses) 3. Computational thinking (2 courses): algorithms and software foundations, data curation, databases and data management

64

4 The Birth of a New Discipline: Data Science Education

4. Statistical thinking (2 courses): introduction to statistical models, statistical and machine learning 5. One course from an unrelated discipline, to let students gain expertise in an application domain 6. Capstone project. With respect to the application domain, the report states that “the practical realworld meanings come from interpreting the data in the context of the domain in which the data arose” (p. 3). The application domain, however, which is a fundamental topic according to Cassel and Topi (2015), is not mentioned as a core idea by De Veaux et al. (2017). Instead, De Veaux and his colleagues emphasize the importance of a capstone project in which the students can practice their data science skills in a real-world context. Indeed, the special interdisciplinary essence of data science as well of data science education was acknowledged, as the following quotes indicates: Data science is inherently interdisciplinary. Working with data requires the mastery of a variety of skills and concepts, including many traditionally associated with the fields of statistics, computer science, and mathematics. Data science blends much of the pedagogical content from all three disciplines, but it is neither the simple intersection, nor the superset of the three. By applying the concepts needed from each discipline in the context of data, the curriculum can be significantly streamlined and enhanced. The integration of courses, focused on data, is a fundamental feature of an effective data science program and results in a synergistic approach to problem solving. (De Veaux et al., 2017, p. 3)

4.2.3 The EDISON Data Science Framework, 2017 In parallel to the efforts invested in the US in 2015–2017 in defining data science and its curriculum, the EDISON project was funded by the European Commission in 2017 to define the data science profession and to promote the education and training of data scientists (EDISON: Building the Data Science Profession, n.d.). The EDISON project produced the EDISON Data Science Framework (EDSF), which is a collection of four documents that define the data science profession and its required competences and knowledge, and presents a model for a data science curriculum (EDISON Data Science Framework (EDSF), n.d.). The framework includes the following documents: • • • •

Data Science Competences Framework (CF-DS) Data Science Body of Knowledge (DS-BoK) Data Science Model Curriculum (MC-DS) Data Science Professional Framework (DSPP). The framework defines five groups of data science competences:

• Data science analytics (including statistical analysis, machine learning, data mining, business analytics, and more)

4.2 Undergraduate Data Science Curricula Initiatives

65

• Data science engineering (including software and applications engineering, data warehousing, and big data infrastructure and tools) • Application domain knowledge and expertise • Data management and governance (including data stewardship, curation, and preservation) • Research methods for research-related professions and business process management for business-related professions. For each group, the framework describes the competencies, body of knowledge, skills, learning goals, and proposed courses. In addition, the framework details the required skills with respect to three groups of professional skills: • Data science professional or attitude skills • Twenty-first century workplace skills • Skills for data scientists in modern, agile, data-driven organizations. As a general competence, data science literacy is defined as the competences and skills commonly required for data science-related and data science-enabled occupations. Similar to the recommendations of the Park City Math Institute (Curriculum Guidelines for Undergraduate Programs in Data Science, see Sect. 4.2.2), the EDISON project recommendations do not elaborate on how to integrate the application domains knowledge into the curriculum either. Instead, regarding application domain-related competencies, the EDISON framework exemplifies the required competencies with the domain of business analytics. In the document that defines the body of knowledge (Part 2), an additional component is added to represent the body of knowledge of a specific application domain knowledge, but the specific details of the integration are not elaborated on: The subject domain-related knowledge group (scientific or business) is recognized as essential for practical work of Data Scientist what in fact means not professional work in a specific subject domain but understanding the domain related concepts, models and organization and corresponding data analysis methods and models. These knowledge areas will be a subject for future development in tight cooperation with subject domain specialists. (EDISON Data Science Framework (EDSF), n.d., Part 2, p. 12)

4.2.4 Envisioning the Data Science Discipline, 2018 In December 2016, The US National Academies of Sciences, Engineering, and Medicine established the Committee on Envisioning the Data Science Discipline: The Undergraduate Perspective, charging it with the goal of setting forth a vision for the emerging discipline of data science at the undergraduate level (National Academies of Sciences, Engineering, and Medicine, 2018). The committee was funded by the National Science Foundation (NSF).

66

4 The Birth of a New Discipline: Data Science Education

The committee recognized the complexity of creating a data science program due to two main reasons: the diversity of the backgrounds from which data science education programs were developed, and the complex essence of data science itself. The committee’s final report stated that: Current data science courses, programs, and degrees are highly variable in part because emerging educational approaches start from different institutional contexts, aim to reach students in different communities, address different challenges, and achieve different goals. This variation makes it challenging to lay out a single vision for data science education in the future that would apply to all institutions of higher learning, but it also allows data science to be customized and to reach broader populations than other similar fields have done in the past. (National Academies of Sciences, Engineering, and Medicine, 2018, p. 16)

The report further describes a ‘data scientist’ not as a single profession, but as a collection of several types of professions, each with a different focus. For example, the report mentions: data scientists who manage platforms on which data science models are created, data scientists who focus on managing data storage solutions, experts in statistical modeling and machine learning, data visualization experts, and business analytics. The committee defines data acumen as “the ability to understand data, to make good judgments about and good decisions with data, and to use data analysis tools responsibly and effectively” (p. 12). Building on the Park City Math Institute workshop (see Sect. 4.2.2), the committee defines the following concept areas for data science: • • • • • • • • • •

Mathematical foundations Computational foundations Statistical foundations Data management and curation Data description and visualization Data modeling and assessment Workflow and reproducibility Communication and teamwork Domain-specific considerations Ethical problem solving.

The committee also describes several models for teaching data science, starting with an introductory exposure to data science, summer programs and bootcamps, certificates in data science, minor studies in data science, massive open online courses (MOOCs) and major studies in data science. Although the focus of this committee was undergraduates, the report also refers to middle school and high school data science curriculums, since the undergraduate curriculum has the potential to drive the development of both middle school and high school curricula. Regarding the application domain, the committee stated that “effective application of data science to a domain requires knowledge of that domain” (p. 29). To achieve this goal, the committee recommended (a) grounding data science instruction in substantive contextual examples, and (b) reinforcing skills and capacities developed in data science courses in the context of specific application domains. For these

4.2 Undergraduate Data Science Curricula Initiatives

67

purposes, the committee proposed several models including tracks in application domain areas, specialized connector courses that link data science concepts directly to students’ fields of interest, a minor in the application domain area, or a co-major or double major in an application domain area.

4.2.5 Computing Competencies for Undergraduate Data Science Curricula, 2017–2021 The ACM data science task force (DSTF) was formed at 2017 by the ACM Educational Council. The task force mission was to “explore a process to add to the broad, interdisciplinary conversation on data science, with an articulation of the role of computing discipline-specific contributions to this emerging field” (Danyluk & Leidig, 2021, p. 6). In the task force’s final report, the special donation of each of the disciplines constructing data science—computer science, statistics, and the application domain—is recognized. Effective instruction, however, it emphasizes, depends on the integration of the instruction of the different subject matters. The report states that: Each component of the Data Science environment: the domain that provides the data; statistics and mathematics for analysis, modeling, and inference; and computer science for data access, management, protection, as well as effective processing in modern computer architectures, is essential. However, a random collection of the three elements does not constitute a meaningful Data Science program. Data Science is interdisciplinary and requires the effective integration of the three components to produce meaningful results (Danyluk & Leidig, 2021, p. 10).

Since the task force members originated from a computing background, they focused mainly on the computer science component of the data science curriculum. In 2018, the task force conducted a survey of ACM members, representing academic institutions and industry organizations. Based on the survey data, the ACM curriculum recommendations for computer science curriculum, and feedback on early drafts of the final report, the task force ensembled the following list of core computer science knowledge areas for data science: • • • • • • • • • • •

Analysis and presentation Artificial intelligence Big data systems Computing and computer fundamentals Data acquisition, management, and governance Data mining Data privacy, security, integrity, and analysis for security Machine learning Professionalism Programming, data structures, and algorithms Software development and maintenance.

68

4 The Birth of a New Discipline: Data Science Education

The ACM task force also acknowledged the importance of diversity and devoted one of the chapters in the final report to the theme of broadening participation in data science. Its recommendations included (a) reporting student and faculty demographics as part of the assessment, (b) designing data science course content to support a diverse student body, and (c) training faculty members in inclusive and diverse teaching pedagogy.

4.3 Data Science Curriculum for K-12 Following the development of data science courses and data science curricula for undergraduates, an effort was made to adjust and develop data science courses and data science curricula for school pupils. A symposium on data science curricula for schools was held in Paderborn, Germany in 2017 (Biehler & Schulte, 2018), and soon afterwards a draft of a data science curriculum for school pupils was published (Heinemann et al., 2018). Following this publication, the International Data Science in Schools Project (IDSSP) published a curriculum framework for introductory high school data science (Fisher et al., 2019). The Mobilize Introduction to Data Science program aims to develop computational and statistical thinking skills so that pupils can access and analyze data from a variety of traditional and non-traditional sources (Gould et al., 2018). In Israel, a data science course for high school computer science pupils has been integrated into the current, official Israeli high school computer science curriculum (Mike et al., 2020). Other initiatives aim to expose school pupils to the power of data science in the form of extra-curricular programs. For example, Srikant and Aggarwal (2017) developed a half-day long data science tutorial for children in grades 5 through 9 that included the development of a friend predictor (i.e., a friend recommendation algorithm) and the implementation of a full cycle of data application development using a spreadsheet software. Dryer et al. (2018) developed a data mining workshop based on seven modules of data mining, big data, ethics, and privacy in which the pupils used RapidMiner, a graphical environment for data analysis. Haqqi et al. (2018) introduced Data Jam, a four-month long competition aimed at introducing high school pupils to data science concepts. The program included teacher workshops, pupil workshops, homework assignments, projects, mentorship, field trips, and final posters and presentations. Finally, Bryant et al. (2019) presented a one-week long programming camp for middle school pupils that emphasized data science and included Python programming and data analysis tasks. We further elaborate on data science in the K-12 education system in Chap. 7—The Variety of Data Science Learners.

4.4 Meta-Analysis of Data Science Curricula

69

4.4 Meta-Analysis of Data Science Curricula This section presents several meta-analysis exercises that address data science curricula. Policy makers in both the academia and the K-12 education system can use them as guidance when considering the launch of new data science programs or the redesign of existing programs. Exercise 4.3 Comparison of undergraduate data science curriculum initiatives Analyze the reports published by the initiatives described in Sect. 4.2: What common guidelines do they share? What are the differences between them? What main conceptions about data science does each initiative reflect? How do data science and data science education, fields that are both undergoing accelerated development, reflect these conceptions?

Exercise 4.4 Rating the interdisciplinarity of the reports For each of the reports published by the initiatives described in Sect. 4.2, determine its degree of interdisciplinarity. Reflect on how you determined these ratings. (a) What rubric did you use? What are its measures? (b) What do these ratings tell us about the birth of the new field of data science education?

Exercise 4.5 Analysis of the reports from additional prisms The reports published by the initiatives described in Sect. 4.2 can be analyzed from a variety of perspectives. In particular, the similarities and differences between them can be studied through many prisms such as the variety of learners they address (see Chap. 7), and the skills whose importance is highlighted (see Chap. 11). (a) Chose two such prisms and analyze the reports from these perspectives. (b) Wha are your conclusions?

70

4 The Birth of a New Discipline: Data Science Education

Exercise 4.6 Multidisciplinary, interdisciplinary, and transdisciplinary education In Sect. 2.4 we reviewed the concepts of multidisciplinary, interdisciplinary, and transdisciplinary education. For each of these concepts, check whether it is reflected in the reports described in Sect. 4.2? If they are, explain how; If they are not, illustrate how these concepts can be implemented in the case of data science education.

Exercise 4.7 Multidisciplinary, interdisciplinary, and transdisciplinary education in schools Repeat the exercise on multidisciplinary, interdisciplinary, and transdisciplinary education (Exercise 4.6) with respect to several high school data science programs. Does the implementation of these three concepts in high school data science education differ from that of undergraduate data science education reviewed in Sect. 4.2?

Exercise 4.8 Didactic transposition in data science Didactic transposition is a concept that refers to the process of adopting knowledge used by practitioners for teaching purposes (Chevallard, 1989). The term was first coined in the context of mathematics education, in which it refers to the process by which formal mathematics is adapted to fit school teaching and learning. For example, the introduction of a proof using two columns, one for the statement and the other for reasoning, represents “a didactic transposition from abstract knowledge about mathematical proofs” (Kang & Kilpatrick, 1992, p. 3). (a) Suggest several examples of didactic transpositions of formal mathematics to school mathematics. Reflect: What guidelines did you follow? (b) Read the paper by Hazzan et al. (2010) in which the authors demonstrate didactic transpositions of software development methods to educational frameworks. What are the paper’s main messages? (c) Suggest possible didactic transpositions of data science concepts from the academia to high school and elementary school.

References

71

4.5 Conclusion In this chapter, we told the story of the birth of data science education by reviewing its short history and its current state as an emerging field. With respect to one of the main themes of this guide, namely the interdisciplinarity of data science education, we note that the importance of developing the field of data science education as an interdisciplinary field on its own was already reflected in the final report of the first NSF workshop on data science education held in 2015: Infrastructure and culture of sharing of materials and experiences (including negative ones) among the departments and schools that offer data science programs be supported and encouraged. We should strive to form a knowledge hub across several faculties, domains of knowledge and industry partners, a hub that offers visibility and connections between many existing platforms. Forms of collaboration might include web-based tools, conference(s), or a journal focused on data science education. (Cassel & Topi, 2015, p. iv)

This vision, however, has not yet been fully realized. This can be explained by the unique interdisciplinarity nature of data science, which in addition to computer science, mathematics, and statistics, should take into the consideration the addition of any application domain. As a new field that is closely associated with the growing field of data science, the field of data science education is at the same time also taking shape, receiving a great deal of attention, and is on its way to defining its own conceptual frameworks. In the following chapters of this guide we present these frameworks, including the challenges and opportunities of data science education and emerging pedagogical methods to mitigate the challenges and leverage the opportunities.

References Anderson, P., Bowring, J., McCauley, R., Pothering, G., & Starr, C. (2014). An undergraduate degree in data science: Curriculum and a decade of implementation experience. In Proceedings of the 45th ACM technical symposium on computer science education—SIGCSE’ (vol. 14, pp. 145–150). https://doi.org/10.1145/2538862.2538936 Biehler, R., & Schulte, C. (2018). Paderborn symposium on data science education at school level 2017: The collected extended abstracts. Universitätsbibliothek. Bryant, C., Chen, Y., Chen, Z., Gilmour, J., Gumidyala, S., Herce-Hagiwara, B., Koures, A., Lee, S., Msekela, J., Pham, A. T., Remash, H., Remash, M., Schoenle, N., Zimmerman, J., Dahlby Albright, S., & Rebelsky, S. (2019). A middle-school camp emphasizing data science and computing for social good. In Proceedings of the 50th ACM technical symposium on computer science education, pp. 358–364. Cassel, B., & Topi, H. (2015). Strengthening data science education through collaboration: Workshop report 7-27-2016. Arlington, VA. Chevallard, Y. (1989). On didactic transposition theory: Some introductory notes. In Proceedings of the international symposium on selected domains of research and development in mathematics education, pp. 51–62. Danyluk, A., & Leidig, P. (2021). Computing competencies for undergraduate data science curricula. https://www.acm.org/binaries/content/assets/education/curricula-recommend ations/dstf_ccdsc2021.pdf

72

4 The Birth of a New Discipline: Data Science Education

De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., Bryant, L., Cheng, L. Z., Francis, A., Gould, R., Kim, A. Y., Kretchmar, M., Lu, Q., Moskol, A., Nolan, D., Pelayo, R., Raleigh, S., Sethi, R. J., Sondjaja, M., Tiruviluamala, N., et al. (2017). Curriculum guidelines for undergraduate programs in data science. Annual Review of Statistics and Its Application, 4(1), 15–30. https://doi.org/10.1146/annurev-statistics-060116-053930 Dryer, A., Walia, N., & Chattopadhyay, A. (2018). A middle-school module for introducing datamining, big-data, ethics and privacy using rapidminer and a hollywood theme. In Proceedings of the 49th ACM technical symposium on computer science education, pp. 753–758. https://doi.org/ 10.1145/3159450.3159553 EDISON: building the data science profession. (n.d.). https://edison-project.eu/ EDISON Data Science Framework (EDSF). (n.d.). https://edison-project.eu/edison/edison-data-sci ence-framework-edsf/ Fisher, N., Anand, A., Gould, R., Hesterberg, J. B. ans T., Bailey, J., Ng, R., Burr, W., Rosenberger, J., Fekete, A., Sheldon, N., Gibbs, A., & Wild, C. (2019, September). Curriculum frameworks for introductory data science. http://www.idssp.org/files/IDSSP_Data_Science_Curriculum_F rameworks_for_Schools_Edition_1.0.pdf Gould, R., Suyen, M.-M., James, M., Terri, J., & LeeAnn, T. (2018). Mobilize: A data science curriculum for 16-year-old students (pp. 1–4). Iase-Web. Org. Haqqi, S., Sooriamurthi, R., Macdonald, B., Begandy, C., Cameron, J., Pirollo, B., Becker, E., Choffo, J., Davis, C., Farrell, M., Lundahl, J., Marshall, L., Wychw, K., & Zheng, A. (2018). Data jam: Introducing high school students to data science. In Proceedings of the 23rd annual ACM conference on innovation and technology in computer science education, pp. 387–387. Hazzan, O., Dubinsky, Y., & Meerbaum-Salant, O. (2010). Didactic transposition in computer science education. ACM Inroads, 1(4), 33–37. Hazzan, O., & Mike, K. (2021). A journal for interdisciplinary data science education. Communications of the ACM, 64(8), 10–11. https://doi.org/10.1145/3469281 Heinemann, B., Opel, S., Budde, L., Schulte, C., Frischemeier, D., Biehler, R., Podworny, S., & Wassong, T. (2018). Drafting a data science curriculum for secondary schools. In Proceedings of the 18th koli calling international conference on computing education research—koli calling’, vol. 18, pp. 1–5. https://doi.org/10.1145/3279720.3279737 Kang, W., & Kilpatrick, J. (1992). Didactic transposition in mathematics textbooks. For the Learning of Mathematics, 12(1), 2–7. Mike, K., Hazan, T., & Hazzan, O. (2020). Equalizing data science curriculum for computer science pupils. In Koli calling’20: Proceedings of the 20th koli calling international conference on computing education research, pp. 1–5. National Academies of Sciences, Engineering, and Medicine. (2018). Data science for undergraduates: opportunities and options. The National Academies Press. https://doi.org/10.17226/ 25104 Srikant, S., & Aggarwal, V. (2017). Introducing data science to school kids. In Proceedings of the 2017 ACM SIGCSE technical symposium on computer science education, pp. 561–566. https:// doi.org/10.1145/3017680.3017717 Wing, J. M. (2020). Ten research challenge areas in data science. Harvard Data Science Review. https://doi.org/10.1162/99608f92.c6577b1f

Part II

Opportunities and Challenges of Data Science Education

This part of the guide elaborates on the educational opportunities and challenges of data science education from a variety of perspectives (e.g., learners and teachers), addressing also the multifaceted and interdisciplinary nature of data science. This part includes the following chapters: Chapter 5: Opportunities in Data Science Education Chapter 6: The Interdisciplinarity Challenge Chapter 7: The Variety of Data Science Learners Chapter 8: Data Science as a Research Method Chapter 9: The Pedagogical Chasm in Data Science Education

Chapter 5

Opportunities in Data Science Education

Abstract Data science education opens up multiple new educational opportunities. In this chapter, we elaborate on six such opportunities: teaching STEM in a real-world context (Sect. 5.2), teaching STEM with real-world data (Sect. 5.3), bridging gender gaps in STEM education (Sect. 5.4), teaching twenty-first century skills (Sect. 5.5), interdisciplinary pedagogy (Sect. 5.6), and professional development for teachers (Sect. 5.7). We conclude with an interdisciplinary perspective on the opportunities of data science education (Sect. 5.8).

5.1 Introduction In the following chapters of Part II—Opportunities and Challenges of Data Science Education, we elaborate on the educational challenges of data science education from four perspectives: interdisciplinarity (Chap. 6), variety of learners (Chap. 7), data science as a multifaceted discipline, and specifically, as a research method (Chap. 8), and the pedagogical chasm, which stems from the novelty of data science education (Chap. 9). Alongside those challenges, data science education also offers multiple new educational opportunities. In this chapter, we review some of these opportunities. From the interdisciplinary perspective, we highlight the opportunities of teaching STEM in a real-world context (Sect. 5.2) and of teaching STEM with real-world data (Sect. 5.3); from a perspective that considers a variety of learners, we highlight the opportunities of bridging gender gaps in STEM education (Sect. 5.4); from the perspective of data science as a multifaceted discipline, we highlight the opportunity of teaching twenty-first century skills (Sect. 5.5); and from the perspective of data science as a new field, we highlight the opportunities of interdisciplinary pedagogy (Sect. 5.6) and professional development for teachers (Sect. 5.7). We conclude with an interdisciplinary perspective on the opportunities of data science education (Sect. 5.8).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_5

75

76

5 Opportunities in Data Science Education

5.2 Teaching STEM in a Real-World Context Since data science attributes a great deal of importance to the application domain, that is, the real-world context, it opens up an opportunity to expand this perspective to the other STEM subjects (i.e., science, technology, engineering, and mathematics). Teaching STEM in a real-world context is an important educational goal. From the learners’ perspective, this goal includes: • Understanding the interdisciplinary aspect of their studies, e.g., understanding how mathematics is implemented in solving real-life engineering problems; • Understanding the human aspect of their studies, e.g., understanding that the performance of medical diagnosis devices influences human lives; • Understanding professional ethics, i.e., what is legitimate and accepted by the learners’ professional communities, and what is not. This understanding is based not only on an understanding of the ethical rules, but also on an understanding of the human aspects of STEM. See also Chap. 12. Several methods exist for incorporating real-world context in STEM education, including mathematical modeling, interdisciplinary courses, and project-based learning (see Sect. “Project-Based Learning”) to name a few. We illustrate this claim in the context of computer science, in which attempts have been made to implement these approaches with respect to different subject matters. For example, Asamoah et al. (2015) offer a course in computer science taught by the faculties of computer science and business administration. The students study in multidisciplinary groups that include students from both faculties and they deal with issues related to business data analysis. The course project requires students to define hypotheses related to a topic they are interested in, collect data from the relevant application domain(s), and test their hypotheses by conducting data analysis. Nevertheless, in the context of the disciplinary education of the components of data science (in other words, mathematics education, statistics education, and computer science education), it is not simple, for either teachers or learner, to implement such methods, and therefore, this kind of integration is not common. In the context of data science education, however, as the application domain is integrated inherently, teaching mathematics, statistics, and computer science in a real-world context that is represented by the application domain, is not only natural but is essentially unavoidable.

Exercise 5.1 Teaching the STEM subjects in a real-world context Review the different topics you teach as part of your disciplinary teaching. Select one of these topics and determine whether or not you currently teach it in the context of real world. If you teach it in a real-world context, choose another topic. Repeat this process until you find a topic that you do not teach in a real-world context.

5.3 Teaching STEM with Real-World Data

77

Describe how you currently teach this topic and design a new teaching process for it, in a real-world context. Compare the two teaching processes. What are your conclusions? Suggest some general guidelines for teaching different subject matters in a real-world context.

5.3 Teaching STEM with Real-World Data One method of teaching STEM in a real-world context is to teach with real-world data. In both data science education and statistical education, it is a common practice to teach using real data. Nevertheless, teaching with real-world data is not a common practice in the context of mathematics education and computer science education. In the context of computer science education, learning computer science with real data can motivate students and support their learning processes, as was found by Anderson et al. (2015), who offered a CS1 course in which students learned how to connect to accessible databases of real-life data and analyze them. This approach was also applied by Burlinson et al. (2016) to develop BRIDGES, a system that enables the creation of data structure assignments with real-world data and visualizations, and by Tartaro and Chosed (2015), who designed an interdisciplinary computer science course by integrating data from other disciplines. On the high school level, the Israeli high-school computer science curriculum includes an elective data science unit (see Sect. 7.3). On the graduate level, we designed, for example, an interdisciplinary Introduction to Computer Science course as part of a Data Science Specialization for graduate students in psychology (Mike & Hazzan, 2022a; see also Chap. 19). Based on recent comprehensive literature reviews of trends in computer science education, it is evident, however, that teaching computer science with real-world data is still not common practice (Becker & Quille, 2019; Luxton-Reilly et al., 2018). The lack of such courses can be explained by the complexity of developing and teaching interdisciplinary courses (Way & Whidden, 2014). Based on our experience, we propose a six-level hierarchy for integrating real-life data in STEM education (see Fig. 5.1). We note that in the two highest levels of the hierarchy (5th and 6th), real-life data is collected by the students themselves. This may not only increase student motivation, but may also make it easier for the teachers to implement, since they are solely responsible for the selection of data (either designed or real) in the 2nd, 3rd and 4th levels of the hierarchy.

78

5 Opportunities in Data Science Education

Fig. 5.1 Proposed range of possibilities for data integration in STEM education (Mike & Hazzan, 2022a)

Exercise 5.2 Real data in mathematics and computer science courses This exercise addresses the teaching of mathematics and statistics courses using real data both as part of a data science program and when taught not as part of such a program. (a) Illustrate this idea using specific content/activities/examples taught in a mathematics and statistics course in any data science program according to your choice. (b) Compose guidelines for choosing real data to be integrated into a mathematics and statistics course that is part of a data science program. (c) Repeat (a) and (b) with respect to a mathematics and statistics course that is taught not as part of a data science program.

5.4 Bridging Gender Gaps in STEM Education Data show that women are underrepresented in STEM subjects in K-12, in academia, and in industry. For example, statistics published by the US Department of Education (Digest of Education Statistics, 2020) on bachelor’s degrees earned in 2018–19, reveal that women were awarded only 21% of all degrees awarded in computer science. Since data science is an emerging discipline, Berman and Bourne (2015) suggest that it has the potential to narrow the gender gap in STEM subjects. A significant majority (44 of 53, 83%) of the participants in two data science workshops for social sciences and digital humanities researchers that we held in 2020 (see Chap. 19), self-identified as women, although the base rate of woman among

5.5 Teaching Twenty-First Century Skills

79

social sciences and humanities graduates in Israel is approximately 66% (Mike et al., 2021). This gender proportion, which is the opposite of that prevailing in STEM studies, led us to the examine the workshop from a gender perspective. Our findings indicate that the women who participated in the data science workshop perceived it as an opportunity to acquire research tools rather than programming tools. In other words, framing the workshop as a research workshop, rather than as a programming workshop, lowered prevalent gender barriers in STEM and encouraged a majority of women researchers to participate. Following Berman and Bourne (2015), we therefore propose that it is important to frame data science as a field of its own (rather than as a sub-field of computer science) and to frame data science skills as professional and research skills rather than as programming skills (Mike et al., 2021). Such framing of data science and data science skills will promote gender equality both in data science, and on a larger scale, STEM education.

Exercise 5.3 Gender gaps in STEM subjects Find data about gender gaps in the STEM subjects and about approaches that aim to bridge them. Analyze these approaches. Specifically, investigate the differences between the commonly proposed approaches and the approach presented in this section to decrease gender gaps in STEM subjects from a data science perspective.

5.5 Teaching Twenty-First Century Skills Data science is an important twenty-first century competence. It includes many cognitive, social, and organizational skills, which are discussed in different chapters of this guide. In Chap. 3, which deals with data science thinking, we focus on the cognitive skills that together create data thinking, such as computational thinking and statistical thinking. Data thinking is, in fact, a good example of the more general concept of interdisciplinary thinking, which integrates thinking skills from several disciplines. Data science also promotes data literacy, which is considered a basic twenty-first century skill. Chapter 11 deals with professional skills and soft skills of data science. As it is accepted that skills should be taught and practiced in some context rather than in isolation, data science education is an opportunity to promote such skills in a real-world context as part of the data science learning processes.

80

5 Opportunities in Data Science Education

Exercise 5.4 Twenty-first century skills in data science education (a) Choose five twenty-first century skills and find a high school data science program. How would you integrate the skills you chose into this data science education program? (b) Choose another five twenty-first century skills and now find an undergraduate data science program. How would you integrate these five skills into this data science education program?

5.6 Interdisciplinary Pedagogy Researchers recognize several levels of integration between two or more distinct disciplines; two of these levels are multidisciplinarity and interdisciplinarity (Alvargonzález, 2011) (See also Sect. 2.4). Multidisciplinarity is the lowest level of integration in which learners are expected to acquire knowledge and gain understanding in each discipline separately. Interdisciplinarity represents a higher level of integration than multidisciplinary. In interdisciplinary programs, after learners gain basic knowledge and understanding in each discipline separately, they are expected to understand the interconnections between the disciplines and to be able to solve problems that require the application of knowledge and methods from both disciplines. The term interdisciplinary pedagogy is used in the literature mostly as a synonym of interdisciplinary education (see for example Cargill, 2005 and Penny, 2009). Others define interdisciplinary pedagogy as an integrative pedagogical approach based on pedagogies of different disciplines (Chesley et al., 2018). In the spirit of interdisciplinary education, however, we propose a new meaning for the term interdisciplinary pedagogy: solving pedagogical challenges of interdisciplinary education based on the integration of pedagogical methods from each of the disciplines. Based on this definition, we can see that data science education leverages an opportunity for interdisciplinary pedagogy. An example of such an interdisciplinary pedagogy is teaching machine learning using pedagogical principles of mathematics education (Mike & Hazzan, 2022b). Specifically, the process-object duality of mathematical comprehension presented in Sect. “The Process-Object Duality Theory” can be applied in the teaching of machine learning concepts (as explained in Sect. 16.3). Data science opens up the opportunity to further develop this kind of pedagogy, which has recently been receiving a great deal of attention in the context of other world problems (see for example, the UN’s 17 sustainable development goals at https://sdgs. un.org/goals). Data science educators and data science education researchers should develop and evaluate additional interdisciplinary pedagogical methods to leverage the opportunities of data science education.

5.7 Professional Development for Teachers

81

Exercise 5.5 Interdisciplinary pedagogy The purpose of this exercise is to encourage you to keep thinking about the term interdisciplinary pedagogy while you read this guide. As you read this guide, you will notice that different theories taken from a variety of disciplines are presented to explain different behaviors and phenomena related to data science education. Your task is to consider different connections between these theories that promote the solving of challenges that relate to data science education (mainly those presented in Chaps. 6–9, but not only).

5.7 Professional Development for Teachers As data science is relevant for a variety of disciplines, including mathematics, statistics, computer science, and many application domains, it follows that data science education is relevant for teachers from a large variety of disciplines as well. Data science, therefore, offers a professional development opportunity for teachers from all these disciplines, to become familiar with the discipline of data science as well as with its teaching methods. In the course of such a professional development process, it is important to encourage the teachers to practice data science using data that is taken either from their field of expertise (e.g., history, sociology, and psychology) or from the field of education field. By undergoing such a professional development process, not only do the teachers become familiar with data science, but they may also improve their understanding of the application domain they teach, and enrich their arsenal of teaching methods to boot.

Exercise 5.6 Professional development for teachers You are asked to design a data science workshop for high school history and literature teachers. Define the learning outcomes of the workshop and its framework in terms of the total number of hours, the number of sessions, and the length of each session. Delve into the details of this workshop: For each session, describe the learning content, the teaching methods, and the activities that the teachers will participate in.

82

5 Opportunities in Data Science Education

5.8 Conclusion In this chapter, we present several opportunities that data science education offers. An examination of these opportunities reveals that they all are based on the interdisciplinary essence of data science. Specifically: • The opportunities of teaching STEM in a real-world context and of teaching STEM with real-world data are based on the fact that data science integrates the STEM subjects with various application domains. • Bridging gender gaps in STEM education is based on framing data science as a skill that can be applied in other disciplines as well. • The teaching of twenty-first century skills is based (also, but not only) on the integration of the skills associated with the disciplines that make up data science, as well as those associated with interdisciplinary thinking. • Interdisciplinary pedagogy is based on the integration of the pedagogies of the education fields of the disciplines that form data science. • The professional development of teachers is based on the appeal and attractiveness of data science to teachers from a variety of disciplines.

References Alvargonzález, D. (2011). Multidisciplinarity, interdisciplinarity, transdisciplinarity, and the sciences. International Studies in the Philosophy of Science, 25(4), 387–403. https://doi.org/ 10.1080/02698595.2011.623366. Anderson, R. E., Ernst, M. D., Ordóñez, R., Pham, P., & Tribelhorn, B. (2015). A data programming CS1 course. In Proceedings of the 46th ACM technical symposium on computer science education, pp. 150–155. Asamoah, D., Doran, D., & Schiller, S. (2015). Teaching the foundations of data science: An interdisciplinary approach. ArXiv Preprint ArXiv:1512.04456. Becker, B. A., & Quille, K. (2019). 50 years of CS1 at SIGCSE: A review of the evolution of introductory programming education research. In Proceedings of the 50th ACM technical symposium on computer science education, pp. 338–344. Berman, F., & Bourne, P. E. (2015). Let’s make gender diversity in data science a priority right from the start. PLoS Biology, 13(7), e1002206. Burlinson, D., Mehedint, M., Grafer, C., Subramanian, K., Payton, J., Goolkasian, P., Youngblood, M., & Kosara, R. (2016). BRIDGES: A system to enable creation of engaging data structures assignments with real-world data and visualizations. In Proceedings of the 47th ACM technical symposium on computing science education, pp. 18–23. Cargill, K. (2005). Food studies in the curriculum: A model for interdisciplinary pedagogy. Food, Culture & Society, 8(1), 115–123. https://doi.org/10.2752/155280105778055371. Chesle¯y, A., Parupudi, T., Holtan, A., Farrington, S., Eden, C., Baniya, S., Mentzer, N., & Laux, D. (2018). Interdisciplinary pedagogy, integrated curriculum, and professional development. Luxton-Reilly, A., Albluwi, I., Becker, B. A., Giannakos, M., Kumar, A. N., Ott, L., Paterson, J., Scott, M. J., Sheard, J., & Szabo, C. (2018). Introductory programming: A systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education, pp. 55–106.

References

83

Mike, K., Hartal, G., & Hazzan, O. (2021). Widening the shrinking pipeline: The case of data science. IEEE Global Engineering Education Conference (EDUCON), 2021, 252–261. Mike, K., & Hazzan, O. (2021). How can computer science educators benefit from data science education? In Proceedings of the 52nd ACM technical symposium on computer science education, pp. 1363–1363. Mike, K., & Hazzan, O. (2022a). Interdisciplinary CS1 for non-majors: The case of graduate psychology students. In 2022a IEEE global engineering education conference (EDUCON). Mike, K., & Hazzan, O. (2022b). Machine learning for non-majors: A white box approach. Statistics Education Research Journal, 21(2), Article 10. Mike, K., Ragonis, N., Rosenberg-Kima, R., & Hazzan, O. (2022). Computational thinking in the era of data science. Communications of the ACM, 65(8), 31–33. https://doi.org/10.1145/3545109 Penny, S. (2009). Rigorous interdisciplinary pedagogy: Five years of ACE. Convergence, 15(1), 31–54. Tartaro, A., & Chosed, R. J. (2015). Computer scientists at the biology lab bench. In Proceedings of the 46th ACM technical symposium on computer science education, pp. 120–125. https://doi. org/10.1145/2676723.2677246 Way, T., & Whidden, S. (2014). A loosely-coupled approach to interdisciplinary computer science education. In Proceedings of the international conference on frontiers in education: Computer science and computer engineering (FECS), p. 1.

Chapter 6

The Interdisciplinarity Challenge

Abstract In Sect. 2.4, Data Science as a Discipline, we present the interdisciplinary nature of data science. This interdisciplinary structure is, as can be seen, challenging from an educational perspective (that is, in terms of curricula and pedagogy). In Chap. 4, we discuss the challenge of integrating the application domain into data science education, and in this chapter, we elaborate on the challenges posed by the interdisciplinary structure of data science. First, we describe the unique and complex interdisciplinary structure of data science (Sect. 6.2). Then, we present the challenge of balancing computer science and statistics in data science education (Sect. 6.3), and the challenge of actually integrating the application domain knowledge into data science study programs, courses, and student projects (Sect. 6.4). Although this chapter focuses on the challenges that emerge from the interdisciplinarity of data science, we note that it also presents an opportunity, expressed for example in the closing of gender gaps in STEM education (see Sect. 5.4 and Chap. 19).

6.1 Introduction This chapter deals with the educational challenges that emerge from the interdisciplinarity of data science. After reviewing the unique interdisciplinary structure of data structure (Sect. 6.2), we delve into the details of its components. Specifically, (a) In Sect. 6.3, we investigate with the mutual relationships between data science and computer science and between data science and statistics. Furthermore, we analyze students’ conceptions about the similarity of these two mutual relationships; (b) In Sect. 6.4, we focus on four educational challenges that emerge from the integration of the application domain component into data science study programs, courses, and student projects.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_6

85

86

6 The Interdisciplinarity Challenge

6.2 The Interdisciplinary Structure of Data Science Data science integrates knowledge and skills from several disciplines: computer science, mathematics, statistics, and an application domain. A Venn diagram is a diagram that shows a logical relationship between different sets (see also Sect. 2.4— Data Science as a Discipline). Following Conway (2010), who was the first to propose a Venn diagram of data science (see Fig. 6.1), many other researchers proposed Venn diagrams to describe the field of data science. For example, in his blog entitled “Battle of the Data Science Venn Diagrams”, Taylor (2016) reviewed 15 different proposed Venn diagrams of data science. Likewise, a Google Images search for “data science Venn diagram” yields dozens of proposals for such diagrams. Figure 6.2 presents our version of a data science Venn diagram. The multitude of proposed data science Venn diagrams demonstrates the complexity of the interdisciplinary structure of data science. This structure raises several educational questions; following are several illustrative examples: • How should the fact that the Venn diagram can include multiple application domains be reflected in the design of data science curricula? • Can data science be taught without referring to an application domain? • If not, what application domain should be integrated into the curriculum and at what level? • Is it possible to teach data science in the context of any application domain? • The Venn diagram shows that data science is composed of 1/3 computer science, 1/3 mathematics and statistics, and 1/3 application domain. Should this ratio be maintained in the curriculum? If it should, why? If not, what should the ratio of each component be? Do the ratios depend on the learners for whom the curriculum is being designed (high school pupils, undergraduate students, etc.)?

Fig. 6.1 The data science Venn diagram, as proposed by Conway (2010)

6.2 The Interdisciplinary Structure of Data Science

87

Fig. 6.2 The data science Venn diagram, as proposed by the authors of the guide to teaching data science

Later on in this chapter, we address these as well as other questions that arise from the interdisciplinary structure of data science. We first address the balance between the computer science and statistics components of data science (Sect. 6.3) and then, we focus on the educational challenges that emerge from the integration of the application domain into data science study programs, courses, and student projects (Sect. 6.4).

Exercise 6.1 The interdisciplinary structure of data science (a) Add at least three questions that relate to the educational challenge of data science and stem from the interdisciplinary structure of data science. (b) Try answering the questions presented above, as well as the questions you added in (a). (c) What are your conclusions from (a) and (b)?

88

6 The Interdisciplinarity Challenge

6.3 Is Data Science More About Computer Science or More About Statistics? Exercise 6.2 Is data science more about computer science or more about statistics? It is customary to draw Venn diagrams of data science using same-sized circles. In your opinion, does the size of the circles of any meaning? How would you draw the Venn diagram of data science in your classroom? Would you use different circle sizes in different teaching contexts? Explain your answers. We asked about one hundred undergraduate students who were taking an Introduction to Data Science course, about the mutual relationships between data science, computer science, and statistics (Table 6.1 presents the questions). To determine whether the relationships between data science and computer science and between data science and statistics, as perceived by the students, differed significantly, we used a double-sided t-test to compare the mean student rankings of each statement. Table 6.2 presents the means of the students’ answers and t values of the comparison between each two statements, respectively. As can be seen, students rated the relationship ‘computer science includes data science’ significantly higher than they rated the relationship ‘statistics includes data science’, and in the same spirit, the students rated the relationship ‘computer science is included in data science’ significantly lower than the relationship ‘statistics is Table 6.1 Questions about the relationships between data science, computer science, and statistics

Data science and computer science For each of the following statements, please rate your level of agreement on a 1 (strongly disagree) to 5 (strongly agree) scale 1. Data science and computer science are two different disciplines (__) 2. Data science includes computer science (__) 3. Computer science includes data science (__) 4. Computer science and data science are overlapping disciplines (__) 5. Data science is based on computer science (__) 6. Computer science is based on data science (__) Data science and statistics For each of the following statements, please rate your level of agreement on a 1 (strongly disagree) to 5 (strongly agree) scale 1. Data science and statistics are two different disciplines (__) 2. Data science includes statistics (__) 3. Statistics includes data science (__) 4. Statistics and data science are overlapping disciplines (__) 5. Data science is based on statistics (__) 6. Statistics is based on data science (__)

6.3 Is Data Science More About Computer Science or More About Statistics?

89

Table 6.2 Students’ perceptions of the mutual relationships between data science and computer science and between data science and statistics Statistics

t-test

Data science and … are M = 2.95 SD = 0.83 two different disciplines

M = 2.73 SD = 1.01

t(106) = 1.88, p = 0.06

Data science includes … M = 2.88 SD = 1.14

M = 3.70 SD = 0.96

t(105) = −6.26, p = 0.00

… includes data science M = 3.12 SD = 1.06

M = 2.70 SD = 1.05

t(105) = 3.11, p = 0.00

M = 3.07 SD = 1.02

M = 3.20 SD = 0.96

t(103) = −0.85, p = 0.40

Data science is based on M = 3.77 SD = 0.94 …

M = 4.04 SD = 0.78

t(103) = −2.04, p = 0.04

M = 2.45 SD = 1.05

M = 2.42 SD = 1.00

t(102) = −0.09, p = 0.93

Relationship

Data science and … are overlapping disciplines

… is based on data science

Computer science

The results in bold are statistically significant with p < 0.05

included in data science’. These conceptions were reflected also when we asked the students to specify what they expect to learn in the course. We categorized students’ answers into three groups according to the discipline that the topics they mentioned belong to: computer science, statistics, or the application domain. As it turns out, the students mentioned topics from computer science 33% more than topics from statistics. Exercise 6.3 Does data science include computer science? Does it include statistics? Analyze the data presented in Table 6.2. What can you learn from students’ perceptions of the relationships between data science, computer science, and statistics? Will you address these mutual relationships in your classes? Explain your answers. The weight of computer science and statistics in a data science curriculum is an interesting question and different programs reflect different approaches. For example, in a paper entitled “Creating a Balanced Data Science Program”, Adams (2020) discussed the challenge of balancing computer science and statistics in a data science curriculum. He writes: To address the challenges posed by the data deluge, the discipline of data science has arisen, and an increasing number of universities are offering undergraduate data science programs. Many of these programs have their origins in a computer science or a statistics department, leading to a data science curriculum that is more heavily weighted toward computing or statistics. (Adams, 2020, p. 1)

90

6 The Interdisciplinarity Challenge

Adams continued to describe a 50-semester-hour data science major program, developed at Calvin University, that balances computer science and statistics as follows: • 20 h of computer science coursework (40%) • 20 h of mathematics and statistics coursework (40%) • ~10 h of new data-related coursework (20%).

Exercise 6.4 Designing an undergraduate data science program Imagine you are asked to design a study program for undergraduate data science major students. How would you divide the courses among the following four sets? (a) (b) (c) (d)

computer science; mathematics and statistics; the application domain; and integrative data science courses.

For the sake of simplicity, use percentages and assume that all of courses add up to 100%. What were your considerations when deciding the percentage of each component?

Exercise 6.5 Analysis of undergraduate data science programs by components Choose five undergraduate data science programs. (a) For each program: (i) Categorize the courses in the program according to the components of data science. (ii) Based on this categorization, analyze its structure. What can you tell about its orientation? (b) Compare the programs: Are they similar? Different? How are their similarities expressed? How are the differences between them expressed? (c) Based on (a) and (b): What are your conclusions?

6.4 Integrating the Application Domain

91

6.4 Integrating the Application Domain The integration of an application domain into a data science curriculum is not an easy task. In this section, we refer to four challenges involved in this task: teachers’ pedagogical content knowledge (Sect. 6.4.1), the development of interdisciplinary programs (Sect. 6.4.2), the integration of the application domain into specific computer science, mathematics, and statistics courses (Sect. 6.4.3), and mentoring interdisciplinary projects (Sect. 6.4.4).

6.4.1 Data Science Pedagogical Content Knowledge (PCK) Teaching is an interdisciplinary profession; it integrates knowledge and skills from the disciplines of education and from the taught discipline (Shulman, 1986). According to Shulman, in order to teach effectively, teachers need three types of knowledge: content knowledge (CK), pedagogical knowledge (PK), and pedagogical content knowledge (PCK). The PCK is created in the intersection of the pedagogical knowledge (PK) and the content knowledge (CK) (see Fig. 6.3) and it includes instructional strategies (e.g. games, technology, illustrations) and pedagogy (e.g., group work, problem-solving, project-based learning) that are suitable for imparting the taught content successfully. In general, defining data science PCK might be a significant challenge for the community of data science educators due to (a) the interdisciplinary structure of data science (see Sect. 6.2), and (b) the fact that data science education is a new field that is taking shape in parallel to the development of data science. To illustrate this challenge, we examine the use of examples. Although the use of examples is a common teaching method in all subject matters in a teacher’s PCK (Shulman, 1986), in data science education, the use of examples has a special role in a teacher’s PCK. This role stems from the fact that in data science teaching, examples can be brought from a variety of application domains, and so a teacher Fig. 6.3 Pedagogical content knowledge (PCK)

92

6 The Interdisciplinarity Challenge

should select appropriate examples for any specific educational context, based on his or her PCK. The integration of examples from so many diverse application domains may be challenging for data science teachers. For instance, in his book “The data science design manual”, Skiena (2017) presented examples from fifteen different application domains including entertainment, biomedicine, economics, education, environmental sciences, history, image processing, linguistics, medicine, politics, search, social media, sports, text analysis, and transportation.

Exercise 6.6 Data science PCK Imagine you are a data science teacher. Describe your teaching environment, according to your choice: characterize the students, define the study program, describe the physical learning environment, etc. What PCK would you need in order to teach this class? Describe scenarios in which this PCK might be expressed in your teaching.

6.4.2 Developing Interdisciplinary Programs Anderson et al. (2014) introduced a multidisciplinary data science program whose faculty members hailed from a variety of fields including computer science, mathematics, statistics, and other faculties such as bioinformatics and geoinformatics so that in addition to courses in computer science, mathematics, and statistics, the curriculum included courses in 14 application domains, such as biology and economics. To complete the program, students had to take several courses from one of the 14 application domains in addition to the mandatory courses in computer science, mathematics, and statistics, and each student was also required to develop a final project in his or her application domain of choice. This approach can be applied by designing an interdisciplinary program that is taught by faculty members from several different faculties.

Exercise 6.7 The challenge of developing interdisciplinary programs (a) Find several existing interdisciplinary data science programs. (b) Describe their main characteristics. (c) Based on your analysis of the programs you found, list the challenges involved in the development of interdisciplinary data science programs.

6.4 Integrating the Application Domain

93

6.4.3 Integrating the Application Domain into Courses in Computer Science, Mathematics, and Statistics Although it is common to find interdisciplinary programs that integrate computer science with other disciplines, such as biology or neural science, it is quite uncommon to find interdisciplinary introductory computer science courses (Becker & Quille, 2019; Luxton-Reilly et al., 2018). The lack of such courses can be explained by the complexity involved in the development and teaching of such courses (Way & Whidden, 2014). For example, the Computer Science in the Biology Laboratory course (Tartaro & Chosed, 2015) is an introductory programming course in which students learn to program and analyze biological data that they collect themselves in a biology laboratory. This course is taught by two lecturers: one from computer science and one from biology. Such courses, however, are still not prevalent (Luxton-Reilly et al., 2018). In the 52nd ACM Technical Symposium on Computer Science Education, we facilitated a birds-of-a-feather (BoF) session in which we asked “How Can Computer Science Educators Benefit from Data Science Education?” (Mike & Hazzan, 2021). One of the answers suggested using real-life data in computer science courses. Indeed, it has been argued that learning computer science with real data can motivate students and support their learning processes (Anderson et al., 2015). This approach was applied, for instance, by Burlinson et al. (2016), who developed BRIDGES, a system that enables the creation of data structure assignments with real-world data and visualizations. It has been also applied in the design of interdisciplinary computer science courses that integrated data from other disciplines (Tartaro & Chosed, 2015). Here are several examples of such courses: • Asamoah et al. (2015) described a course in computer science taught by the faculties of computer science and business administration. The students studied in multidisciplinary groups that included students from both faculties and dealt with issues related to business data analysis. The course project required students to define hypotheses in a topic of their interest, collect data from the relevant application domain(s), and analyze it to test their hypotheses. • Anderson et al. (2015) reported on a CS1 course in which students learned to connect to accessible databases of real-life data and analyze them. • We designed an interdisciplinary introduction to computer science course for graduate students in psychology (Mike & Hazzan, 2022; see also Chap. 19).

94

6 The Interdisciplinarity Challenge

6.4.4 Mentoring Interdisciplinary Projects1 Working on interdisciplinary projects increases students’ motivation and engagement in real-life problem-solving processes (Yogeshwaran et al., 2019) and promotes cooperation between disciplines (Ramamurthy, 2016). To successfully accomplish interdisciplinary learning in a project-based (PBL) learning environment (see Chap. 12), project teams must include specialists in several disciplines; Using cross-disciplinary teams is one way to ensure a project team includes specialists in all required disciplines (Othman et al., 2018). It is not always possible, however, to work in cross- or interdisciplinary teams, and in many cases, significant knowledge gaps exist within the project team in one or more of the project disciplines (Setiawan, 2019). In 2019, we carried out an experiment with fourth-year electrical engineering students working on biomedical signal processing projects, which are evidently interdisciplinary (Mike et al., 2020). The students worked in pairs; The teams, however, were homogeneous and the students lacked essential medical expertise required to achieve solutions that would be applicable by physicians. Furthermore, the students acknowledged this knowledge gap only in the advanced phases of the project, and so critical phases, such as goal setting and planning, were performed without the required knowledge. From an image processing perspective, the treatment of medical images is inherently different than that of everyday life images (such as images for entertainment purposes) in the sense that the diagnostic medical information must be preserved in order to protect patients from potentially harmful situations. For example, in the field of image processing, mathematical fidelity criteria are commonly used to assess processed images with respect to desired results; When dealing with medical images, however, these criteria are insufficient, and the images should also undergo subjective evaluation tests by specialist physicians to assess their quality from the medical perspective. Considering this requirement, collaboration with radiologists is crucial when developing and testing the performance of medical data algorithms. Thus, medical knowledge and expertise are vital in order to achieve satisfactory performance of imaging systems. We therefore decided to develop an interdisciplinary intervention program on biomedical signal processing, for undergraduate electrical engineering students, whose objectives were: (a) to teach students basic biomedical knowledge; (b) to convey the importance of understanding biomedicine as an application domain when designing and developing image processing algorithms that aim to be applicable for physicians; and (c) to refer students to relevant resources where they could find self-learning material about biomedical signals (Mike et al., 2020). The intervention program consisted of six regular, short key lectures, as well as several group meetings with radiologists and scientists from the industry who provided valuable medical knowledge and expertise. In addition, individual project follow-up meetings were held with the project supervisor. Table 6.3 presents the program schedule and the content of each meeting. 1

© 2020 IEEE. Reprinted, with permission, from Mike et al. (2020).

6.4 Integrating the Application Domain

95

Table 6.3 Intervention program in biomedical signal processing project Week

Content

1

• Introductory tutorial on ultrasound imaging • Presentation of project goals

3

First meeting with radiologists to present project results and ask for medicine-related feedback

6

Presentation of self-learning materials on biomedical signals

8

• Tutorial on objective fidelity criteria of medical image processing • Second meeting with radiologists

10

Third meeting with radiologists

13

Meeting with representatives from the biomedical signal processing industry to receive feedback on the relevance and applicability of the current research results

To evaluate the intervention program, we collected data from both students who participated in the intervention program and students from previous years who did not participate in the intervention program. Data from students who participated the intervention program were collected using anonymous bi-weekly questionnaires while data from students from previous years who did not participate the intervention program data were collected by interviews and a single questionnaire. In what follows we present the perceptions of the two groups of students. At the end of the intervention program, we asked both students who participated the intervention program and students from previous years who did not participate the intervention program to indicate, on a scale of 1–5, how much knowledge they had to acquire for the project development. Table 6.4 presents the students’ answers. As can be seen, the two groups of students gave similar evaluations of the required knowledge they had to learn. Specifically, the students’ estimated that they had to learn less statistics and biomedicine than electrical engineering material. However, while the perception of required knowledge is similar, the students from the two groups explained their estimations differently. The students who participated in the intervention program perceived the volume of the required biomedical knowledge as relatively low due to their understanding that they had radiologists supporting them in this aspect. For example, one student wrote: “We had continuous guidance from the radiologist”. On the other hand, the students who did not participate in the intervention program perceived the volume of the required biomedical knowledge as relatively low due to their perception of this knowledge as not being crucial for their project development. For example, one student explained that “it is required to have [only] basic medical knowledge”. We further compared the students’ perception of the success factors of the project (see Table 6.5). While students from the two groups rated the self-learning factors similarly, they rated the cooperation with radiologists differently: Students who participated in the intervention program and met with the radiologists during the project development, rated this criterion as 4.62; the students who did not participate in the intervention program and worked on the project without actual cooperation

96

6 The Interdisciplinarity Challenge

Table 6.4 Students’ perception of the knowledge required to meet the project goal

Table 6.5 Students’ perception of the research project’s success factors

Discipline

With intervention (n = 8, scale 1–5)

Without intervention (n = 4, scale 1–5)

Electrical engineering

4.25

4.50

Statistics

2.25

3.00

Biomedicine

2.50

2.75

Success factor

With intervention (n = 8, scale 1–5)

Without intervention (n = 4, scale 1–5)

Self-learning on biomedical diagnosis

3.87

3

Self-learning on image processing

4.12

4.5

Cooperation with radiologists

4.62

2.5

with any radiologists, rated its importance as 2.5. Here are illustrative explanations of their rating. The students who participated in the intervention program explained that “cooperation with the radiologist is necessary”, while students who did not participate in the intervention program explained that “feedback from a radiologist is not as important as self-learning”. In a similar spirit to the intervention program, when students have knowledge gaps in one of the application domains required for the project success, mentors should support students with the help of application domain specialists. Such support is especially important when the mentor also has knowledge gaps in one of the application domains required for the project development.

Exercise 6.8 Knowledge gaps in PBL Choose a topic for a project that in your opinion can be developed by a team of high school students studying in a data science outreach program. (a) Characterize the team members (in terms of their knowledge and expertise). (b) Identify any knowledge gaps the students might have. (c) Develop an intervention program to close these knowledge gaps. Repeat the exercise for:

6.5 Conclusion

97

(a) Undergraduate students in any engineering discipline according to your choice. (b) Social science graduate students.

Exercise 6.9 Additional challenges of mentoring interdisciplinary projects In your opinion, besides gaps in students’ application domain knowledge, what other challenges should a mentor of interdisciplinary projects expect to deal with during the mentoring process?

6.5 Conclusion Interdisciplinarity is one of data science’s main characteristics. Although this interdisciplinarity poses several educational challenges, both curricular and pedagogical, as reviewed in this chapter, it also opens up some valuable opportunities: (a) It serves as an entrance gate to the STEM subjects for learners who traditionally do not choose to study these subjects; and (b) It enables learners to acquire cognitive skills to deal with interdisciplinarity, whose importance is acknowledged for solving problems the world is currently facing.

Exercise 6.10 Revisiting the questions about the interdisciplinary structure of data science At the beginning of this chapter, we presented several questions about the interdisciplinarity of data science and asked you to try answering them (Exercise 6.1). (a) Answer these questions again. (b) Are your answers the same as before or are they different? In what ways are they similar? In what ways are they different? (c) What are your conclusions from (a) and (b)?

98

6 The Interdisciplinarity Challenge

References Adams, J. C. (2020). Creating a balanced data science program. In Proceedings of the 51st ACM technical symposium on computer science education, pp. 185–191. https://doi.org/10.1145/332 8778.3366800 Albu, A. B., Malakuti, K., Tuokko, H., Lindstrom-Forneri, W., & Kowalski, K. (2008). Interdisciplinary project-based learning in ergonomics for software engineers: A case study. The Third International Conference on Software Engineering Advances, 2008, 295–300. Anderson, P., Bowring, J., McCauley, R., Pothering, G., & Starr, C. (2014). An undergraduate degree in data science: Curriculum and a decade of implementation experience. In Proceedings of the 45th ACM technical symposium on computer science education—SIGCSE ’14, pp. 145–150. https://doi.org/10.1145/2538862.2538936 Anderson, R. E., Ernst, M. D., Ordóñez, R., Pham, P., & Tribelhorn, B. (2015). A data programming CS1 course. In Proceedings of the 46th ACM technical symposium on computer science education, pp. 150–155. Asamoah, D., Doran, D., & Schiller, S. (2015). Teaching the foundations of data science: An interdisciplinary approach. ArXiv Preprint ArXiv: 1512.04456. Becker, B. A., & Quille, K. (2019). 50 years of cs1 at sigcse: A review of the evolution of introductory programming education research. In Proceedings of the 50th ACM technical symposium on computer science education, pp. 338–344. Blumenfeld, P. C., Soloway, E., Marx, R. W., Krajcik, J. S., Guzdial, M., & Palincsar, A. (1991). Motivating project-based learning: Sustaining the doing, supporting the learning. Educational Psychologist, 26(3–4), 369–398. Burlinson, D., Mehedint, M., Grafer, C., Subramanian, K., Payton, J., Goolkasian, P., Youngblood, M., & Kosara, R. (2016). BRIDGES: A system to enable creation of engaging data structures assignments with real-world data and visualizations. In Proceedings of the 47th ACM technical symposium on computing science education, pp. 18–23. Conway, D. (2010). The data science Venn diagram. Datist. http://www.dataists.com/2010/09/thedata-science-venn-diagram/ Krishnan, S. (2013). Promoting interdisciplinary project-based learning to build the skill sets for research and development of medical devices in academia. In 2013 35th annual international conference of the IEEE engineering in medicine and biology society (EMBC), pp. 3142–3145. Liu, J.-S., & Huang, T.-K. (2005). A project mediation approach to interdisciplinary learning. In Fifth IEEE international conference on advanced learning technologies (ICALT’05), pp. 54–58. Luxton-Reilly, A., Albluwi, I., Becker, B. A., Giannakos, M., Kumar, A. N., Ott, L., Paterson, J., Scott, M. J., Sheard, J., & Szabo, C. (2018). Introductory programming: A systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education, pp. 55–106. Mike, K., & Hazzan, O. (2021). How can computer science educators benefit from data science education? In Proceedings of the 52nd ACM technical symposium on computer science education, pp. 1363–1363. Mike, K., & Hazzan, O. (2022). Interdisciplinary CS1 for non-majors: The case of graduate psychology students. 2022 IEEE global engineering education conference (EDUCON). Mike, K., Nemirovsky-Rotman, S., & Hazzan, O. (2020). Interdisciplinary education—The case of biomedical signal processing. IEEE Global Engineering Education Conference (EDUCON), 2020, 339–343. https://doi.org/10.1109/EDUCON45650.2020.9125200 Othman, A., Hussin, H., Mustapha, M., & Parman, S. (2017). Cross-disciplinary team learning in engineering project-based: Challenges in collaborative learning. In 2017 7th world engineering education forum (WEEF), pp. 866–871. Othman, A. R., Hussin, H., Mustapha, M., & Parman, S. (2018). Cross-Disciplinary team learning in engineering project-based: challenges in collaborative learning. In Proceedings—2017 7th world engineering education forum, WEEF 2017. In conjunction with: 7th regional conference on engineering education and research in higher education 2017, RCEE and RHEd 2017, 1st

References

99

international STEAM education conference, STEAMEC vol. 201, pp. 866–871. https://doi.org/ 10.1109/WEEF.2017.8467029 Ramamurthy, B. (2016). A practical and sustainable model for learning and. In The 47th ACM technical symposium on computer science education, SIGCSE 2016, pp. 169–174. Setiawan, A. W. (2019). Detailed comparison of instructor and student-based assessment in project based learning. IEEE global engineering education conference, EDUCON, April-2019, pp. 557– 560. https://doi.org/10.1109/EDUCON.2019.8725126 Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15(2), 4–14. Skiena, S. S. (2017). The Data science design manual. Springer. Tartaro, A., & Chosed, R. J. (2015). Computer scientists at the biology lab bench. In Proceedings of the 46th ACM technical symposium on computer science education, pp. 120–125. https://doi. org/10.1145/2676723.2677246 Taylor, D. (2016). Battle of the data science Venn diagrams. KDnuggets. https://www.kdnuggets. com/battle-of-the-data-science-venn-diagrams.html/ Way, T., & Whidden, S. (2014). A loosely-coupled approach to interdisciplinary computer science education. In Proceedings of the international conference on frontiers in education: computer science and computer engineering (FECS), p. 1. Yogeshwaran, S., Kaur, M. J., & Maheshwari, P. (2019). Project based learning: Predicting bitcoin prices using deep learning. In IEEE global engineering education conference, EDUCON, April2019(978), pp. 1449–1454. https://doi.org/10.1109/EDUCON.2019.8725091

Chapter 7

The Variety of Data Science Learners

Abstract Since data science is considered to be an important twenty-first century skill, it should be acquired by everyone—children as well as adults—on a suitable level, to a suitable breadth, and to a suitable depth. And so, after reviewing the importance of data science knowledge for everyone (Sect. 7.1), this chapter reviews the teaching of data science to different populations: K-12 pupils in general (Sect. 7.2) and high school computer science pupils in particular (Sect. 7.3), undergraduate students (Sect. 7.4), graduate students (Sect. 7.5), researchers (Sect. 7.6), data science educators (Sect. 7.7), practitioners in the industry (Sect. 7.8), policy makers (Sect. 7.9), users (Sect. 7.10), and the general public (Sect. 7.11). For each population, we discuss the main purpose of teaching it data science, main concepts that the said population should learn and (in some cases) learning environments and exercises that fit it. In Sect. 7.12, we present several activities about the fitness of difference learning environments for data science teaching to the different populations we discuss in this chapter. In the conclusion (Sect. 7.13) we highlight the concept of diversity in the context of data science.

7.1 Introduction Data science is relevant for everyone. Not only should data scientists be familiar with its main ideas and how it can be used, but also children, students, and a variety of professionals, such as researchers, policy makers, educators, medical doctors, lawyers and adults in general, should acquire some data science knowledge according to the ways in which they expect to use it. This assertion derives from the fact that each of these populations consumes data science products, even unintentionally. For example, the general public uses social networks whose design and operation are based on data science methods. Indeed, every web search is based on data science. Other examples are navigation apps, marketing channels, and natural language processes that are designed and work based on data analysis conducted using data science methods. The need to teach data science to a variety of populations is, therefore, great and growing.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_7

101

102

7 The Variety of Data Science Learners

The need for everyone to understand data science is reflected in the following quote, taken from the 2018 Envisioning the Data Science Discipline report, presented in Chap. 4. The committee defined data acumen as “the ability to understand data, to make good judgments about and good decisions with data, and to use data analysis tools responsibly and effectively” (National Academies of Sciences, Engineering, and Medicine, 2018, p. 12). In this chapter we delve into the details of this data acumen, and its meaning for different populations. Since data science has become relevant for everyone, it is considered to be among the most important twenty-first century skills—a set of skills that have been identified as being required for functioning and succeeding in the twenty-first century society and workplace. This chapter, therefore, highlights the need to teach data science to a variety of populations. In this context, it is interesting to note that even the data science experts can be divided into different groups such as data analysts, data scientists, and so on.

Exercise 7.1 Different data science professions Data Analyst vs. Data Scientist: What’s the Difference? (2021) is one manuscript that discusses different data science professions. (a) Find additional data science professions. (b) Investigate the similarities and differences among these different data science professions. (c) In your opinion, should the data science professions receive a similar or different education? Elaborate on these similarities or differences. This chapter also discusses the notion of diversity, which refers to the representation of different groups in specific contexts. In our case, the context is data science. The inherent diversity of data science is reflected in this chapter in the different groups whose data science education we discuss: K-12 pupils in general (Sect. 7.2) and high school computer science pupils in particular (Sect. 7.3), undergraduate students (Sect. 7.4), graduate students (Sect. 7.5), researchers (Sect. 7.6), data science educators (Sect. 7.7), practitioners in the industry (Sect. 7.8), policy makers (Sect. 7.9), users (Sect. 7.10), and the general public (Sect. 7.11). For each population, we discuss the reasons for teaching it data science, the main concepts that the population should be taught, and suitable learning environments and exercises. This diversity of learners is a challenge for data science educators. As we discuss in Chap. 2, data science is an interdisciplinary discipline that requires knowledge and skills from various disciplines including mathematics, statistics, computer science, and the application domain. Obviously, most learners lack essential knowledge required to understand and practice data science. Throughout this guide, we discuss this challenge and suggest ways of mitigating it.

7.2 Data Science for K-12 Pupils

103

Exercise 7.2 Diversity in data science One reason that data science has the potential to enhance diversity is due to its interdisciplinarity. Explore this diversity: How can it be expressed? What are its benefits? What are its drawbacks? How is diversity reflected in the data science workflow? (see Chaps. 2 and 10). The chapter ends with several activities that invite the readership to examine the suitability of different learning environment for each population discussed in this chapter (Sect. 7.12).

7.2 Data Science for K-12 Pupils Since data science is one of the more important twenty-first century skills, it should be learned by all people in society, including K-12 pupils. As a twenty-first century skill, one of the most important objectives of data science education for K-12 pupils is to guide children to use data in a meaningful and thoughtful way that does not harm them. One initiative that attempts to teach data science to K-12 pupils was developed by a team of scholars from Stanford. The leader of this initiative, Jo Boaler, explains that “[t]eaching data science in K-12 schools is about much more than preparing young people for well-paying careers …. It affords an alternate, more equitable pathway that appeals to broader groups of students. … Data science … relates to students’ daily lives and communities and raises awareness of issues in society” (Boaler, 2021). The initiative’s teaching activities targeted a variety of K-12 populations, including special education pupils. One of these activities, “Big Ideas in Data Science” (Data Big Ideas, 2022), proposes a set of standards for teaching data science to pupils in kindergarten through tenth grade. Specifically, the tool listed the following data science big ideas that K-10 pupils should learn: • • • •

Formulate statistical investigative questions Collect/consider data Analyze data Interpret and communicate.

For a discussion about the importance of introducing data science in early ages so as to create positive experiences and encourage curiosity about data science as a twenty-first century skill, see also Martinez and LaLonde (2020).

104

7 The Variety of Data Science Learners

Exercise 7.3 The AI + Ethics Curriculum for Middle School initiative The “AI + Ethics Curriculum for Middle School” initiative, presented at https://www.media.mit.edu/projects/ai-ethics-for-middle-school/overview/, focuses on artificial intelligence. It seeks to develop an open-source curriculum for middle school students that is made up of a series of lessons and activities. Explore the activities proposed by this initiative. In your opinion, what were the pedagogical guidelines applied in the development of these activities? Can these guidelines be applied in the development of learning material that focuses on other data science topics?

Exercise 7.4 Initiatives for K-12 data science education Read the paper entitled “Scratch Community Blocks: Supporting Children as Data Scientists” (Dasgupta & Hill, 2017). (a) Explore the Scratch Community Blocks system built within the Scratch programming language. What are its main goals and how are they achieved? What learning theories were applied in the design of this environment? (b) Find three additional initiatives intended for different group ages, whose goal is to teach data science to K-12 pupils. What are the three main pedagogical ideas that they each implement? What teaching materials do they each provide? (c) If possible, try using one of these initiatives. Reflect on your experience: What did you like about it? What did you dislike about it? Try to envision how it can be used to teach data science to K-12 pupils of different ages?

Exercise 7.5 Learning environments for high school pupils In the next section (Sect. 7.3), we focus on data science for high school computer science pupils. At the same time, several data science curricula have also been developed for high school pupils whose major subject is not computer science (see examples in Fisher et al., 2019; Gould et al., 2018; Heinemann et al., 2018). Explore the three data science initiatives for non-computer science major high school pupils mentioned above: What are the main data science concepts they teach? What are the main pedagogical ideas they implement?

7.3 Data Science for High School Computer Science Pupils

105

7.3 Data Science for High School Computer Science Pupils1 We excluded high school computer science pupils from the above discussion of K12 pupils (Sect. 7.2) since these pupils have, in most cases, a strong prerequisite background in mathematics and computer science that enables them to study some of the more advanced data science topics, such as machine learning algorithms, including some of the mathematical ideas that ensure their correctness. In this section, we present our attempt to develop a data science course for high school computer science pupils, a course that is, in fact, currently being taught to 10th grade computer science pupils in an Israeli public school and has been integrated into the official Israeli high school computer science curriculum (Mike et al., 2020). This course balances the need to teach both the breadth of data understanding, processing, and workflow, on the one hand, with the depth of the algorithmic facet of data science, specifically, machine learning algorithms, on the other hand. The data science program we developed is designed to accommodate two levels of study. The basic level is taught in the 10th grade and includes developing a project in Python as part of the lab-based learning unit in which pupils are exposed to other programming paradigms in addition to the object-oriented paradigm, which is the main paradigm addressed in the course. The advanced or extended level is taught in the 11th and 12th grades and it elaborates on both the data science process and on machine learning algorithms, with an emphasis on deep learning. In the continuation of this section, we describe the basic level that is taught in the 10th grade. The Israeli data science curriculum is based on the data science workflow. Pupils learn to ask questions about a specific topic, to collect relevant data and clean it, to explore and model the data, to make decisions, and to evaluate the appropriateness and quality of the predictions of the machine learning algorithms. The curriculum includes five machine learning algorithms: K-nearest neighbors (KNN), linear regression, perceptron, support vector machine (SVM), and neural networks (NN). Each pupil learns two algorithms: the KNN algorithm, which is mandatory for all pupils and one of the other four algorithms according to the teacher’s choice. Although the KNN and perceptron are simple to understand and are taught mostly for pedagogical reasons, they also demonstrate the core principles of machine learning. See for example Hazzan and Mike (2022) who explore the suitability of the KNN algorithm for teaching the core principles of machine learning. Since the curriculum is based on constructivism and active learning, the first two parts of the curriculum, data science workflow and machine learning, are interlaced. The pupils first learn one data type (tables or images) and one basic machine learning algorithm (KNN), thus enabling them to program and run their first machine learning model on data they collect themselves. The two parts continue in parallel: additional data types and data processing methods are taught alongside more sophisticated machine learning algorithms. The third section, project development, is 1

This section is based on Mike et al. (2020). Equalizing Data Science Curriculum for Computer Science Pupils, Koli Calling—International Conference on Computing Education Research, Finland, Article No.: 20, Pages 1–5. https://doi.org/10.1145/3428029.3428045.

106

7 The Variety of Data Science Learners

based on the first two parts. While the project focuses on the implementation of machine learning algorithms, the pupils experience all steps of the data science workflow, including asking questions, looking for and collecting data, data exploration, modeling, and reporting. Table 7.1 presents the structure of the data science for high school curriculum, including the number of hours dedicate to each topic. We propose that computer science pupils can benefit from the integration of data science into the computer science curriculum (Mike & Hazzan, 2020). Among other benefits, we mention the broader perspective of real-world problems they may gain. Specifically, while the focus of computer science is algorithms, the focus of data science is data. Accordingly, the computer science curriculum today focuses on the understanding of algorithms: how they work, how to implement them efficiently and so on, while the data science curriculum focuses on the data required to answer questions about the application domain and to obtain, clean, visualize, and model the data in a way that enables us to gain insights, with algorithms serving as tools to achieve these targets.

Exercise 7.6 Real data in data science education In computer science education, we usually create datasets to suit our specific teaching objectives (for example, to demonstrate how a specific algorithm works). In data science education, on the other hand, real data is used in most cases. This is true also for the case of the Israeli high school data science curriculum studied by high school computer science pupils. Section 5.3 further elaborates on this idea. In your opinion, how can the use of real data in data science education contribute to computer science learners’ view of problem-solving processes, in general, and of the solving of real-world problems, in particular?

Table 7.1 Data science for high school curriculum—topics and number of hours

Topic Introduction to data science Python programming

Hours 3 15

Python for data science

3

Data tables with Pandas

3

Introduction to machine learning

6

Exploratory data analysis

6

The KNN algorithm

6

Classifier evaluation

3

Core concepts of machine learning

3

Data science workflow

6

Images as data

3

Second machine learning algorithm (teachers’ choice)

3

Final project development

30

Total

90

7.3 Data Science for High School Computer Science Pupils

107

Table 7.2 Machine learning algorithms—the mathematical knowledge required to understand them and an alternative intuitive explanation Algorithm

Required mathematical knowledge

Alternative intuitive explanation

Perceptron

Linear algebra: sum of vectors and dot product

Define a training-error function and demonstrate how the training error declines

Support vectors machine (SVM)

Linear algebra: sum of vectors and dot product Optimization: Lagrange method

Use a visual 2-dimensional graph to find the support vectors and the separation line

Logistic regression

Basic math: log function Statistics: probability and likelihood Calculus: partial derivatives Optimization: gradient descent

Use a visual graph to demonstrate the error function as a function of the decision boundaries

Neural networks and deep learning

Calculus: partial derivatives Optimization: gradient descent

Each element of the network is a perceptron, an algorithm the learners are familiar with (see first line of this table). Deep learning is a special case of neural networks that includes several layers of perceptrons

Exercise 7.7 Didactic transposition of machine learning algorithms In Exercise 4.8, we presented the concept of didactic transposition. We now wish to apply it to the mathematical knowledge required for high school computer science pupils who are studying data science. Table 7.2 displays several machine learning algorithms. For each algorithm, it lists the mathematical knowledge required in order to understand it, and an intuitive explanation, that is a didactic transposition, that may support the student in learning the algorithm without fully understanding the mathematics behind it. For each machine learning algorithm presented in Table 7.2, explain the didactic transposition of the mathematical knowledge required in order to teach it to high school computer science pupils.

108

7 The Variety of Data Science Learners

Exercise 7.8 Discussions about the role of data science in admission requirements Read the following two documents: 1. Mathematicians and the High School Math Curriculum, at: https://www.mathunion.org/imu-news/archive/2022/imu-news-111-jan uary-2022. 2. Does Calculus Count Too Much in Admissions?, at: https://www.insidehighered.com/admissions/article/2022/02/14/does-cal culus-count-too-much-admissions? What debates do they present? What is their opinion about data science education? What is your opinion on these debates? How would you rebut the claims they present?

7.4 Data Science for Undergraduate Students Chapter 4 describes the birth process of data science education on the undergraduate level. Although the review there addresses mainly data science majors, it also refers to two other large populations that study data science on the undergraduate level: allied majors and non-majors (see also Sect. 13.3 in which these groups of learners are addressed in the context of machine learning). For example, the National Academies of Sciences, Engineering, and Medicine (2018) makes the following recommendations with respect to the variety of learners: Recommendation 2.2: Academic institutions should provide and evolve a range of educational pathways to prepare students for an array of data science roles in the workplace. Recommendation 2.3: To prepare their graduates for this new data driven era, academic institutions should encourage the development of a basic understanding of data science in all undergraduates. Recommendation 4.1: As data science programs develop, they should focus on attracting students with varied backgrounds and degrees of preparation and preparing them for success in a variety of careers. In line with these recommendations, many undergraduate programs that are associated with a variety of fields have been launched in recent years. Since it is impossible to review them all, we invite the readership to review and analyze several online resources that reflect the intensive discussions taking place about the main characteristics and the main messages of these programs. See Exercise 7.9.

7.5 Data Science for Graduate Students

109

Exercise 7.9 Analysis of undergraduate data science programs Review the following two online discussions about undergraduate data science programs. 1. Cliff Stein Discusses Education with Harvard Data Science Initiative at: https://datascience.columbia.edu/news/2022/watch-cliff-stein-discusseseducation-with-harvard-data-science-initiative/ 2. Race + Data Science Lecture Series at: https://www.youtube.com/playlist?list=PLRXLC-iYknE66vW3tiBHy 1O5pfYmhFfEo (a) What are the main characteristics of the programs presented in these conversations? What are the main values they promote? (b) How is each of the following concepts reflected in these programs? (i) Diversity (in terms of learners) (ii) Interdisciplinarity (in terms of the application domain) (iii) Opportunities for professional development (c) How is the physical space in which the programs operate designed? (d) Find at least six additional undergraduate data science programs (two for each of the three populations: majors, allied majors, and non-majors). Do they share the same or similar characteristics with the programs reviewed in the above conversations? Do they adhere to the same values? (e) What is your conclusion from this exercise? Write at least five insights. (f) Reflect: How will you use these insights in your current and/or future data science teaching?

7.5 Data Science for Graduate Students In Chaps. 2 and 8, one of the main characteristics of data science, namely, data science as a research method, is discussed. As a research method, data science is, therefore, relevant for graduate students who perform research as part of their graduate studies. The fact that data science is also a research method may explain the need of professional data scientists for advanced higher education, as the following data indicates: 88% of data scientists have at least a master’s degree and 46% have PhD (Analytics Insight, 2022). Furthermore, in Chap. 8, in which we discuss data science as a research method, we show that specific research skills are needed in order to perform data science research.

110

7 The Variety of Data Science Learners

In this section, we divide the graduate students into three groups: (1) Graduate students in data science. This group of graduate students explores new data science methods; its main purpose is to add new knowledge to the discipline of data science; (2) Graduate students in other disciplines who do research in: (a) allied disciplines (e.g., engineering) (b) other disciplines (e.g., humanities and social studies). This group of graduate students study data science as a research method (whether or not they use it in their research). (3) Graduate students who do not perform research as part of their graduate studies. These graduate students chose to proceed to graduate studies to promote their professional development by adding data science to their professional knowledge skills either as practitioners in the industry (Sect. 7.8), policy makers (Sect. 7.9), users (Sect. 7.10), or simply as twenty-first century citizens (Sect. 7.11). Students’ knowledge in mathematics, statistics, and computer science is one factor that should be taken into account when considering the integration of data science into a graduate study program. In Chap. 20, Data Science for Research on Human Aspects of Science and Engineering, we propose a workshop for graduate students in science and engineering who have such knowledge and in Chap. 19, Data Science for Social Science and Digital Humanities Research, we present a data science workshop for researchers and graduate students in the social sciences and the humanities who in some cases lack sufficient knowledge in mathematics, statistics, and computer science, but need data science as a research method. See also the next section (Sect. 7.6).

Exercise 7.10 Graduate studies For each of the three groups of graduate students mentioned above, review several graduate programs from both the disciplines of science and engineering and social sciences and humanities. (a) Is the research facet of data science emphasized in each of these programs? If it is, how? If it is not, how would you suggest incorporating the concept of ‘data science as a research method’ into each of these programs? (b) Formulate general guidelines for incorporating a research facet into such programs in general.

7.7 Data Science for Data Science Educators

111

7.6 Data Science for Researchers As a research method, data science has become relevant for researchers in all disciplines. Here again, researchers can be divided into two groups: (a) Researchers in the social sciences and digital humanities who (in most cases) lack the required knowledge in mathematics, statistics, and computer science. In Chap. 19, we present a workshop that we designed for researchers in social studies and the humanities. Details about the workshop can be found also in Mike et al. (2021). (b) Researchers in science and engineering who have the required knowledge in mathematics, statistics, and computer science. These researchers should focus on how to integrate their mathematics, statistics and computer science knowledge with their disciplinary knowledge so as to carry out research using data science methods. Tamimi (2020) proposes a possible path for closing this gap, which is also our own experience, as described in Chap. 19.

Exercise 7.11 Data science and diversity BERMAN and Bourne (2015) propose that gender diversity should be prioritized in data science. In Mike et al. (2021), we propose that data science research may increase diversity and widen the shrinking pipeline phenomenon associated with STEM subjects (Camp, 2002). Discuss these assertions with respect to the interconnections between the three main characteristic of data science: interdisciplinarity, a variety of learners, and data science as a research method.

7.7 Data Science for Data Science Educators Data science educators comprise one of the most important groups that should study data science; after all, they teach all of the other populations described in this chapter. In general, data science educators must fill in gaps both in data science and in the pedagogy of data science, called PCK (pedagogical content knowledge) (Shulman, 1986), which includes the teaching principles and methods that fit each discipline; in our case—data science. In this section, we focus on computer science teachers, both pre-service and in-service, who are sometimes expected to teach data science, both in the K-12 system and in academia. We hope that this guide provides the fundamental knowledge required.

112

7 The Variety of Data Science Learners

Teacher preparation programs for (pre-service and in-service) computer science teachers are addressed in this guide in two chapters, as follows: • In Chap. 9, we describe the phenomenon of the pedagogical chasm. We identified this phenomenon in workshops that we facilitated for in-service high school computer science teachers. The purpose of the workshops was to expose teachers to the new data science unit that was added to the Israeli high school computer science curriculum (see Sect. 7.3). According to the pedagogical chasm phenomenon, the adoption process of innovative curricula is blocked when a suitable pedagogy is not proposed together with the curriculum itself. This hurdle is removed when an appropriate pedagogy is developed and presented together with the curriculum. As described in Chap. 9, this is what happened in our workshop on the new data science unit that was attended by in-service high school computer science teachers. When the new data science unit was introduced to the computer science teachers, along with many teaching materials and detailed curriculum, including the number of teaching hours and suitable pedagogical guidelines, more and more teachers expressed interest in starting teaching it. • In Chap. 18, we describe in detail the Method for Teaching Data Science course that we taught to preservice computer science teachers in our institution, the Technion—Israel Institute of Technology. We present the course target, structure, teaching principles, grading policy (including all the submissions), questionnaires we distributed, and the content and activities included in two of the lessons.

7.8 Data Science for Professional Practitioners in the Industry In this section, we focus on practitioners in the industry who wish to deepen their data science knowledge as a twenty-first century skill. This skill will enable them to improve their suitability to the future work market, in which most of its professions will, most probably, require some data science proficiency. These practitioners have a variety of available resources from which to choose how to learn the relevant data science knowledge: Besides academic programs, many online data science courses are offered by online learning platforms such as Coursera and EdX, which enable anyone, anywhere, access to online courses from leading universities and companies. On these platforms, practitioners may choose to study single courses according to their interest, a set of courses that provides a certificate in a specific topic, and course programs that earn full academic degrees. The learning environment offered by such platforms are especially suited for practitioners in the industry due to several reasons: (a) the learning process is flexible and each practitioner can adapt it to his or her work hours; (b) each practitioner can assemble a set of courses from different universities according to his or her needs;

7.8 Data Science for Professional Practitioners in the Industry

113

(c) practitioners can very quickly gain an overview of a topic they wish to study before they decide which course to take (if at all); (d) practitioners can review several courses on a specific topic before they decide which to study; (e) if the practitioner is not in need of a formal certificate, the courses are free; and (f) no prerequisites or admission criteria are required; the courses are open to everyone, with or without prior formal education.

Exercise 7.12 Online learning platforms versus physical campuses Compare the online learning platforms mentioned above with the physical campus. What are their advantages over the physical campus? What are their disadvantages compared with the physical campus? Address general learning criteria as well as specific arguments related to data science learning.

Exercise 7.13 Online data science courses (a) How many online data science courses are offered by these online learning platforms? (b) How many online courses are offered by the online learning platforms in other disciplines • from science and engineering? • from the humanities and social studies? Compare the numbers. What can you learn about the field of data science? What do they tell us about the community of data science learners who choose to study on these online learning platforms?

Exercise 7.14 Profiles of online learners of data science courses Find data science courses, certificates, and degrees offered on online learning platforms. For each course, certificate, and degree, sketch the profile of an industry practitioner whom, in your opinion, the said course, certificate, or degree is especially suitable.

114

7 The Variety of Data Science Learners

Exercise 7.15 Analysis of data gathered by online learning platforms Another aspect of digital learning platforms is the ability to collect and analyze learners’ data. This approach is called learning analytics or educational data mining (EDM). (a) What data can the online learning platforms gather in order to learn about their learners and about their learning processes? (b) How can these online learning platforms use these data to improve their performance as well as the learning experience they provide to their community of learners? (c) What data science methods are suitable for such an exploration? (d) Develop five exercises about these data. Do all of your exercises apply data science methods? If they do, in what way? If they do not, what other kinds of exercises did you develop?

7.9 Data Science for Policy Makers In many cases, policy makers are called upon to make a decision based on data science algorithms. In such cases, their familiarity with the application domain is crucial to making meaningful decisions. Think, for example, about policy making in education systems or health systems. Decision maker must be familiar with what each feature in the data means, not only in terms of their current expression in the real world, but also in terms of their future expression in the real world if the recommendations of the data science exploration are accepted and implemented. This statement is relevant for all disciplines in which policy makers must make decisions based on a data science examination, e.g., public transportation, economics, the environment and many more.

Exercise 7.16 Policy making based on data science methods Consider policy makers who are required to make a decision whether or not to make a change in their country’s education system. (a) Suggest a change in the education system to be examined using data science methods in this case study. (b) Describe the dataset, based on whose exploration using data science methods, decisions are to be made. (c) Suggest a recommendation that might be suggested by a data sciencebased exploration.

7.10 Data Science for Users

115

(d) Present two opinions: one that advocates for the implementation of the recommendation and one that argues that the recommendation of the data science exploration should not be implemented. (e) What is your opinion? Should the recommendation be implemented? (f) Reflect on the thinking process you went through in stages (a) through (e). (g) Repeat the above stages for another topic taken from any other domain in your life. In addition to the policy makers’ familiarity with the application domain, policy makers should have basic data science knowledge. They should know what can be explored, what the characteristics of a suitable dataset for a data science exploration are, and what considerations should be made when a decision whether or not to accept the recommendation of a data science exploration is made. They should also be familiar with and aware of cognitive biases (see Chap. 3) as well as the limitations of data collection and analysis. In Chap. 17, we further elaborate on data science education of policy makers, describing a 1.5-h data science workshop for policy makers in the education system.

7.10 Data Science for Users Section 13.3.3 addresses the challenges of teaching machine learning algorithms to users, people who need to understand data science as part of their professional or daily life. Thus, users include, for example, physicians who use machine learning algorithms as diagnostic tools and business managers who use machine learning algorithms for financial decision making. As described in Sect. 13.3.3, the goal of data science education for users is to teach them the functionality and limitations of the machine learning algorithms, including how to interpret the algorithms’ output and predictions in real life, in the context of the relevant application domain.

Exercise 7.17 Data science knowledge for users In your opinion, (a) What are the five most important data science concepts that users should be familiar with? (b) What are the five most important data science skills that users should acquire?

116

7 The Variety of Data Science Learners

7.11 Data Science for the General Public In this section, we refer to the data science education of all other populations that were not discussed explicitly in the previous sections. Such populations usually use data science tools as part of their work or everyday life. These usages may include recommendation engines in a variety of real-life problems, different NLP (natural language processing) and voice recognition applications, and social networks. In such cases, several important capabilities are required: • Awareness of biases introduced into machine learning systems during their development; • Critical thinking to monitor decision-making processes and actions based on the recommendation of these systems; • Awareness of how data is presented in the media and the ability to verify the reliability of the conclusions reached by the commentators of the different media channels; • Interpretation and understanding of data visualization. Similar to the industry practitioners (see Sect. 7.8), the general public can gain this knowledge in one of the many on-line courses offered by the on-line learning platforms. The following exercises explore additional ways by which data science education for the general public can be designed.

Exercise 7.18 Data science knowledge for the general public In your opinion, (a) What are the five most important data science concepts that the general public should be familiar with? (b) What are the five most important data science skills that the general public should acquire?

Exercise 7.19 Designing a data science course for the general public Design a course on data science for the general public. Specify the course objectives, length, guidelines, and the teaching principles you implemented in the course design, as well as the content of each of its lessons (including topics and activities).

7.12 Activities on Learning Environments for Data Science

117

Exercise 7.20 Books about data science for the general public Look on the web for books about data science written for the general public. (a) What topics do they address? What are their main messages? (b) What topics do you suggest should be added to these books?

Exercise 7.21 Storytelling from the general public’s perspective Section 11.2.2 describes storytelling as an organizational data science skill. The main purpose of storytelling in the context of data science is to communicate data science works without delving into the scientific and technical details, (mainly but not only) to audiences who are not familiar with such details. Storytelling is usually discussed from the data science perspective. For example, read Storytelling with Data—A Key Skill for Data Scientists (2022). Suggest five guidelines for the general public to be applied when listening to such stories (for example, what details should they pay attention in a story or what questions should they seek answers to).

7.12 Activities on Learning Environments for Data Science In Chap. 1 of this guide we present two main kinds of learning environments for data science. In this section, we present several activities that invite the readership to examine the suitability of different environments to the variety of populations discussed in this chapter. Exercise 7.22 Categorization of learning environments for data science according to learner groups Match each environment presented in Chap. 1 with the group that is best suited to it, from among the groups discussed in this chapter. After completing the task, reflect: (a) What were your considerations in determining what learning environment is the most suitable one for teaching data science to each population? (b) How does the availability of a variety of learning environments for data science support the message of data science as an important twenty-first century skill?

118

7 The Variety of Data Science Learners

Exercise 7.23 Textual programing environments for data science Jupyter Notebook and Google Colab are two examples of notebooks that are suitable for educational purposes. Find additional notebooks used to teach and learn data science. Compare all of the notebooks you found. (a) What are the advantages and the disadvantages of each notebook? (b) Would you prefer to use a specific notebook if you were teaching data science to a specific population mentioned in this chapter? Explain your answer.

Exercise 7.24 Visual programing environments for data science Orange Data Mining, Weka, and KNIME are three examples of visual environments that are suitable for educational purposes. Find additional visual environments used to teach and learn data science. Compare the visual environments you found. (a) What are the advantages and the disadvantages of each visual environment? (b) Would you prefer to use a specific visual environment if you were teaching data science to a specific population mentioned in this chapter? Explain your answer.

Exercise 7.25 Use of general data processing tools for data science education purposes Beside dedicated environments for data science learning, other tools (such as Excel and Tableau) can be used for data science teaching and learning. Explore how, by analyzing from a pedagogical perspective, (a) their advantages over dedicated data science learning environments; (b) their disadvantages relative to dedicated data science learning environments; (c) their suitability for teaching data science to the specific populations mentioned in this chapter.

References

119

Exercise 7.26 Teachable Machine Teachable Machine is a web-based tool that makes creating machine-learning models fast, easy, and accessible to everyone. See: https://teachablemachine. withgoogle.com/. Explore what can be done with Teachable Machine. Can Teachable Machine be used for pedagogical purposes? If yes, how? If not, why not?

7.13 Conclusion The topic of this chapter—a variety of learners—alludes to the diversity of the people who deal with data science. Not only professional data scientists are exposed to its main ideas and applications, everybody is—from elementary school pupils to researchers, from industry practitioners in various disciplines to the general public. Furthermore, data science also has the potential to increase diversity in the STEM topics, which are known as disciplines with low diversity in terms of gender and minorities. One reason for this is the interdisciplinarity of data science, which opens up many gates for a variety of populations who wish to become data science learners (e.g., research in a variety of social sciences and humanities disciplines).

References Analytics Insight. (2022, January 8). 5 Skills every data science candidate should know. https:// www.analyticsinsight.net/5-skills-every-data-science-candidate-should-know/ Berman, F. D., & Bourne, P. E. (2015). Let’s make gender diversity in data science a priority right from the start. PloS Biology, 13(7), e1002206. Camp, T. (2002). The incredible shrinking pipeline. ACM SIGCSE Bulletin, 34(2), 129–134. Dasgupta, S., & Hill, B. M. (2017). Scratch community blocks: Supporting children as data scientists. In Proceedings of the 2017 CHI conference on human factors in computing systems, pp. 3620–3631. Data Analyst vs. Data Scientist: What’s the Difference? (2021). Coursera. https://www.coursera. org/articles/data-analyst-vs-data-scientist-whats-the-difference Data Big Ideas. (2022). YouCubed. https://www.youcubed.org/data-big-ideas/ Dear Data. (2022). YouCubed. https://www.youcubed.org/tasks/dear-data/ Fisher, N., Anand, A., Gould, R., Hesterberg, J. B. ans T., Bailey, J., Ng, R., Burr, W., Rosenberger, J., Fekete, A., Sheldon, N., Gibbs, A., & Wild, C. (2019, September). Curriculum frameworks for introductory data science. http://www.idssp.org/files/IDSSP_Data_Science_Curriculum_F rameworks_for_Schools_Edition_1.0.pdf

120

7 The Variety of Data Science Learners

Gould, R., Suyen, M.-M., James, M., Terri, J., & LeeAnn, T. (2018). Mobilize: A data science curriculum for 16-year-old students, pp. 1–4. Iase-Web. Org. Hazzan, O., & Mike, K. (2022). Teaching core principles of machine learning with a simple machine learning algorithm: The case of the KNN algorithm in a high school introduction to data science course. ACM Inroads, 13(1), 18–25. Heinemann, B., Opel, S., Budde, L., Schulte, C., Frischemeier, D., Biehler, R., Podworny, S., & Wassong, T. (2018). Drafting a data science curriculum for secondary schools. In Proceedings of the 18th koli calling international conference on computing education research—koli calling’18, pp. 1–5. https://doi.org/10.1145/3279720.3279737 How to teach data science in K-12 schools? Stanford-led team launches “Big Ideas.” (2021, October 11). Stanford graduate school of education. https://ed.stanford.edu/news/how-teach-data-sciencek-12-schools-stanford-led-team-launches-big-ideas Martinez, W., & LaLonde, D. (2020). Data science for everyone starts in kindergarten: Strategies and initiatives from the American Statistical Association. https://hdsr.mitpress.mit.edu/pub/wkh g4f7a/release/3 Mike, K., Hartal, G., & Hazzan, O. (2021). Widening the shrinking pipeline: The case of data science. IEEE Global Engineering Education Conference (EDUCON), 2021, 252–261. Mike, K., Hazan, T., & Hazzan, O. (2020). Equalizing data science curriculum for computer science pupils. Koli calling’20: Proceedings of the 20th koli calling international conference on computing education research, pp. 1–5. Mike, K., & Hazzan, O. (2020). Data science and computer science education. In O. Hazzan, N. Ragonis, & T. Lapidot (Eds.), Guide to teaching computer science (pp. 95–117). Springer. National Academies of Sciences, Engineering, and Medicine. (2018). Data science for undergraduates: Opportunities and options. The National Academies Press. https://doi.org/10.17226/ 25104 Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15(2), 4–14. Storytelling with Data—A Key Skill for Data Scientists. (2022). ProjectPro. https://www.projec tpro.io/article/storytelling-with-data-a-key-skill-for-data-scientists/174 Tamimi, N. (2020, December 29). Engineers should learn data science differently. Medium. https:// towardsdatascience.com/engineers-should-learn-data-science-differently-99c7bf5caa6b

Chapter 8

Data Science as a Research Method

Abstract In this chapter, we focus on the challenges that emerge from the fact that data science is also a research method. First, we describe the essence of the research process that data science inspires (Sect. 8.2). Then, Sect. 8.3 presents examples of cognitive, organizational, and technological skills which are important for coping with the challenge of data science as a research method, and Sect. 8.4 highlights pedagogical methods for coping with it. In the conclusion of this chapter (Sect. 8.5), we review, from an interdisciplinary perspective, the skills required to perform data science research. The discussions about data science skills in this chapter and in Chap. 11 are especially important today due to the increasing awareness that scientists and engineers, in general, and data scientists, in particular, should acquire professional skills, in addition to disciplinary and technical knowledge.

8.1 Introduction In this chapter, we discuss the challenges that emerge from the fact that not only is data science also a research method, but it, in fact, introduces a new perspective on research processes. These challenges emerge mainly due to the fact that research skills are usually learned at the graduate level, while, as we saw in Chap. 7, a variety of populations actually study data science, and some of those lack any knowledge in research processes. We begin by looking at data science as a research method (Sect. 8.2). Then, in Sect. 8.3, we examine a variety of skills needed in order to conduct meaningful research using data science methods. We describe three kinds of skills: cognitive, organizational, and technological, present examples of skills from each category (see Table 8.1), and elaborate on one representative skill from each category. In Sect. 8.4 we propose teaching methods that are especially important for teaching research skills. Since ethical behavior and additional skills, such as professional and soft skills, are needed to perform professional research using data science methods, we focus on relevant skills for data science in two other chapters as well: In Chap. 11, we discuss professional and soft skills, and in Chap. 12, we highlight the importance of ethical © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_8

121

122

8 Data Science as a Research Method

Table 8.1 Mapping data science research skills

Research skills Cognitive skills

• Model assessment • Critical thinking

Organizational skills • Understanding the domain of the organization • Communicating research results to different stakeholders Technological skills • Data visualization • Using data science technological tools for research purposes

behavior. This chapter is also connected to Chap. 10, which presents a pedagogical perspective on the data science workflow, which is in fact a research process, and to Chaps. 19 and 20, which describe teaching frameworks of data science for research purposes.

Exercise 8.1 Mapping data science skills Add 3–5 skills to each cell in Table 8.1. Reflect on your work: (a) How did you decide which cell each skill belongs to? (b) What message do the added skills highlight? Clearly, to be applied meaningfully, all skills require basic data science knowledge. For example, if you wish to apply critical thinking while reading research results produced by data science methods, you must be aware, and check whether or not the researchers selected the analyzed data with special attention to the base rate of each subset of the samples of the analyzed data (see Chap. 13). Similarly, managers of organizations, who use data science in their decision making, should have basic data science knowledge to examine the implications of the predictions suggested by a data science model in a world that keeps changing in general, and specifically, in their own business domain.

8.2 Data Science as a Research Method Data science is a research method (see Sect. 2.3). Data scientists look at patterns in the data and try to construct models that enable them to predict phenomena that seem to be relevant for the application domain. In this section, we delve into the details of this research process.

8.2 Data Science as a Research Method

123

Exercise 8.2 Technion Biomedical Informatics Research and COVID-19 Watch the clip https://www.youtube.com/watch?v=uu1MY-odhj8. What kind of new research process in biomedicine does data science enable? How does it differ from traditional research processes in biomedicine? What populations, from those described in Chap. 7, can this question be presented to? Figure 2.3 presents the data science workflow. A traditional research process starts with the formulation of a research problem and targets based on, and in parallel to the performance of a literature review whose purpose is to find previous relevant research works on which the current research can be built. If needed, the research problem and targets are refined several times at this stage according to the literature review, and again in the following stages, as our understanding of the research problem and target improves. Then, based on the research target, research questions are formulated in a way that addresses the gaps identified in the literature review. According to the formulated research questions, relevant data is gathered and exploratory data analysis is carried out to understand what insights can be derived from the data analysis. Based on the exploratory data analysis, a model is constructed, usually using machine learning methods. Finally, the research results are published and then, based on the published results, a new research process can be initiated. With respect to the above description, we note the following: 1. Any research process, in general, and the data science research-oriented workflow, in particular, is an iterative process, in which each of its cycles improves the understanding of the research topic. For example, the data modeling stage may raise new research questions that may lead to a new cycle of the data science workflow. 2. The process may also contain iterative sub-cycles. For example, following a literature review, researchers may discover that the research problem and targets they formulated are too broad and should be narrowed; after the exploratory data analysis stage, the researchers may realize that additional data must be collected; and so on. 3. Since a lot of data is currently gathered automatically by various existing tools and applications, data need not always be collected by the researchers themselves, since it is already available. Furthermore, some research processes are initiated by already available data. In such cases, when researchers identify interesting data for investigation, they may perform an exploratory data analysis, proceed to the formulation of a relevant research problem, targets and questions, and continue according to the stages of the data science research process. 4. The data analysis stage, which in the data science workflow is, in many cases, carried out by modeling the data using machine learning methods, can be carried out by other methods, e.g., a statistical test or qualitative data analysis.

124

8 Data Science as a Research Method

See also Sect. 2.5 and Chap. 10 for further discussions on the data science workflow.

Exercise 8.3 Sub-cycles of the data science workflow Suggest additional possible sub-cycles of the data science research process.

8.2.1 Data Science Research as a Grounded Theory As mentioned above, from a research perspective, data scientists look for patterns in the data, trying to construct a model that enables them to predict phenomena that seem to be relevant for the application domain. This process is similar to the process of developing a grounded theory (Glaser & Strauss, 2017), which according to Wikipedia is “a systematic methodology that has been largely applied to qualitative research conducted by social scientists. The methodology involves the construction of hypotheses and theories through the collecting and analysis of data. Grounded theory involves the application of inductive reasoning. The methodology contrasts with the hypothetico-deductive model used in traditional scientific research” (Grounded Theory, 2022). Let us now examine several characteristics of grounded theory mentioned in this description and their analogy for the case of data science: • A systematic methodology that has been largely applied to qualitative research conducted by social scientists: In the case of data science, the analogy of this assertion is: a systematic methodology – conducted by data scientists and – applied to a) quantitative research on structured data and b) to qualitative research on unstructured data. • The methodology involves the construction of hypotheses and theories through the collecting and analysis of data: In the case of data science, the analogy of this assertion is the construction of a model through data collection, exploratory data analysis, and data analysis. • Grounded theory involves the application of inductive reasoning. The methodology contrasts with the hypothetico-deductive model used in traditional scientific research: Exactly same assertion can be said about a typical data science workflow. In other words, it seems that data scientists apply the grounded theory research methodology to quantitative data or data that have been converted into quantitative data (such as text, photos, and music), instead of to qualitative data, which is the

8.2 Data Science as a Research Method

125

kind usually analyzed in grounded theory research methodology. Another difference between classic grounded theory and the research process applied in data science is that in data science, artificial intelligence is combined with human intelligence to generate the research results, while in traditional grounded theory, only humans are involved in the data analysis and in creating the theory from the data analysis.

8.2.2 The Application Domain Knowledge in Data Science Research In order to do meaningful research in any discipline, one should be an expert in the said discipline. For example, in order to carry out data science research in education (for example, a learning analytics research or educational data mining research), one should be an expert in education, including being familiar with learning theories and pedagogical approaches. Similarly, in order to do data science research in marketing, transportation, or medicine, one must first acquire expertise in these disciplines, respectively.

Exercise 8.4 The challenge of leaning the application domain As can be seen in Chap. 7, a variety of populations nowadays study data science. How can the challenge of familiarity with the application domain be overcome when a specific population studies data science research methods but lacks the required application domain knowledge? The importance of the application domain knowledge in research processes in general, and in data science research processes in particular, is reflected by the MERge model (Hazzan & Lis-Hacohen, 2016). According to the MERge model, management, education, and research (MER) activities can be carried out meaningfully only when one is an expert in the discipline with respect to which the activities are being carried out, and specifically in the case of data science, the discipline from which the data is taken. This expertise allows one to decide, for example, what data to collect and how to collect it according to the base rate prevalence of each kind of instance in the said application domain, to explore meaningful connections between features, to evaluate the feasibility of different models produced by data science algorithms as well as to explain them, and to understand the meaning of each model and its predictions in the real world. Part V—Frameworks for Teaching Data Science provides teaching frameworks for data science for professionals whose core activities are management, education, and research, and who need data science knowledge to improve their professionality. Imagine a data scientist who is working in a hospital and is asked to compare two possible processes for the hospital organizations using data science based on

126

8 Data Science as a Research Method

available data about hospital operations and management that is gathered around the world. Such an exploration requires medical knowledge as well as experience, expertise, and an understanding of how hospitals work. This knowledge will then enable the data scientist to determine what data are relevant and how they should be gathered in a reliable way, how to identify outliers in the data, and what features are more important than others. Such a background will also enable the data scientist to prioritize more critical activities from a medical perspective over less critical ones. At the same time, when the data scientist does not have the relevant medical knowledge, he or she must collaborate with an expert who does have the relevant application domain knowledge (in our case, medical knowledge).

Exercise 8.5 The key role of understanding the application domain Hypothesize a situation in which an expert in machine learning algorithms analyzes medical data without understanding the relevant medical knowledge and publishes the results of the analysis in his or her blog without any peer review. (a) Describe a scenario that reflects the harms that might be caused as a result of such a situation. (b) Identify clashes in the data science workflow that can lead to such a situation. (c) Repeat steps (a) and (b) with respect to a case study that deals with the COVID-19 pandemic. Were you able to find a real case or did you have to invent one?

8.3 Research Skills In this section, we address skills needed in order to carry out data science research activities in a meaningful way. We focus on three skills, one from each of the three categories of data science skills: cognitive, organizational, and technological. We selected these skills by surveying several resources that list skills needed in order to use data science meaningfully (located either by a literature review or a simple Google search out of many existing lists of data science skills). We note that the research skills discussed below are important for all data scientists; but not only. Several research-oriented skills are important also for other data science learners who can use them, for example, for data collection and exploratory data analysis.

8.3 Research Skills

127

Exercise 8.6 Populations for which research skills are relevant While reading the following description of data science research skills, identify for each skill, the populations (from those described in Chap. 7). (a) for which it is important to acquire the skill, on the one hand; (b) and for which the teaching of the said skill is too challenging and should be avoided, on the other hand. For each population you mention, explain your answers to (a) and (b).

8.3.1 Cognitive Skills: Awareness of the Importance of Model Assessment—Explainability and Evaluation As described above, any typical data science workflow includes the basic stages of a research project, which encompass the following activities: framing the research subject by formulating the research problem, defining its goal, and asking research questions; data gathering and analyzing them; and finally, presenting the results. Data scientists should, therefore, master skills that enable them to decide on the volume and type of the data needed for the research and to select and apply appropriate models and statistical tests. It is, however, especially important to assess these models, since unlike regular research in which the researchers are those who select or develop the model or theory, in the case of data science, the machine builds the model. This increases the importance of the researcher’s awareness to two activities that he or she must master: (a) Addressing the model’s explainability: As it turns out, not all models produced by machines are explainable (see the discussion about explainability in Chap. 13). This realization led to the coining of the concept of explainable AI (XAI), or interpretable AI, which refers to people’s ability to understand the results of the solution (the model) produced by the machine. (b) Evaluate the model’s performance: This activity requires familiarity with machine learning performance indicators (see Sect. 14.5) and the selection of indicators that are relevant for the specific case. The ability to assess the fitness of the model to the problem under investigation includes other skills, such as critical thinking.

128

8 Data Science as a Research Method

Exercise 8.7 Model assessment List at least five additional skills needed for model assessment. Explain the role and contribution of each skill to the activity of model assessment. How would you address a case in which an expert in the application domain is required to assess the model’s fitness to the problem at hand, but lacks the required skills?

8.3.2 Organizational Skills: Understanding the Field of the Organization It is crucial to understand the domain in which the organization operates in order to perform meaningful data science research. This assertion is, once again, based on the MERge model, which asserts that disciplinary knowledge is needed in order to perform (any) research meaningfully (see the description of the MERge model above). However, due to the high demand for data scientists in the job market, data scientists may work during their career in several organizations from different sectors. It is, therefore, quite uncommon to find data scientists who are also experts in the domain of their organization. In such cases, expert in the knowledge domain of the organization must work together with the data scientists during the entire data science workflow, on every step of the process. This also highlights the importance of teamwork (see Chap. 11) and the dominance of interdisciplinarity embodied in the data science workflow. A vivid example in this context is Gina Life, which is described in what follows. Gina Life is a femtech company that has developed a platform for the early detection of problems in women’s health. The company’s vision states: “With the use of a unique proprietary biomarkers panel supported by AI and data science, a personalized test will be generated for every woman, enabling early detection of women related diseases.” (Gina Life—Simply Saving Lives, 2021). In a lecture given at our institution, the Technion—Israel Institute of Technology on February 13, 2022, Dr. Inbal Zafir-Lavie, a Technion graduate and Gina Life’s CEO and founder, described the need presented in Fig. 8.1, which highlights the importance of the application domain (in this case, life science) in the data science workflow carried out at the company. Thus, in addition to the challenge companies face to recruit data scientists with research skills, they must also meet the challenge of forming teams with experts in the discipline domain in which the company operates. Again, the interdisciplinarity of data science is highlighted.

8.3 Research Skills

129

Fig. 8.1 Dr. Inbal Zafir-Lavie’s slide on the knowledge required from life science graduates and from data science graduates (presented here with permission)

We conclude this section by linking the skill of understanding the organization’s domain to the skill of storytelling (discussed in Sect. 11.2.2). Specifically, such an understanding helps data scientists choose the stories they tell about the data. In this spirit, the Tableau Data Trends 2022 report says that “Data becomes the language for people and organizations to be seen, have their issues understood, and engage with institutions intended to serve them” (Setlur, 2022, p. 21). Such an understanding also increases data scientists’ awareness to ethical issues associated with the specific industry (see Chap. 12 for a discussion about the social and ethical issues of data science education). Exercise 8.8 A metaphor for a discourse between a biologist and a data scientist (a) Deepen your understanding of the research carried out by Gina Life. (b) Imagine you are visiting Gina Life’s offices. Describe a scenario in which a biologist and a data scientist, both working for the company, communicate to decide how biomarkers will be selected and what machine learning techniques will be used for the data analysis. What metaphor would be appropriate to describe such a discourse?

130

8 Data Science as a Research Method

Exercise 8.9 Building a data science team It is your responsibility to recruit an employee for the data science team of a company that develops products for transportation planning. From the following five candidates, you are asked to choose two. For each candidate, mark your preference: Expertise

First priority Second priority Waiting list

100% expertise in machine learning, no knowledge in transportation 80% expertise in machine learning, 20% expertise in transportation 50% expertise in machine learning, 50% expertise in transportation 20% expertise in machine learning, 80% expertise in transportation No knowledge in machine learning, 100% expertise in transportation

(a) What were your considerations when evaluating the suitability of each candidate for the company? (b) Repeat this exercise with respect to a company that develops medical products. Were your considerations different? Explain your answer. (c) Exercise 3.9 examines a similar case from the perspective of the familiarity-driven domain neglect. What interconnections can you find between the two contexts in which the case is discusses.

Exercise 8.10 The importance of the application domain for the data scientist Imagine a data scientist who works in any organization according to your choice. For each phase of the data science workflow, describe a scenario that illustrates why a data scientist working in the organization, must understand the domain of the organization. Perform this exercise for three domains, one of which is the domain of your organization.

8.3 Research Skills

131

Exercise 8.11 Data science and anthropology In Sect. 11.2.2, which discusses the organizational skill of storytelling, we also present an analogy between data science and anthropology. In a similar spirit, Tye Rattenbury (Salesforce) and Dawn Nafus (Intel) published the paper “Data Science and Ethnography: What’s Our Common Ground, and Why Does It Matter?” in 2018 (available at https://www.epicpe ople.org/data-science-and-ethnography/). The paper presents a conversation about data science and ethnography. Read the conversation and summarize lessons that each research field can learn from the other.

8.3.3 Technological Skills: Data Visualization Data visualization is the graphical representation of data and its analysis, using visual elements such as charts, graphs, maps, and more. It forms a useful bridge between technical presentation and storytelling. Data visualization is important for intra-organizational and extra-organizational communication. It helps communicate the results of data science models to the general public, which consumes them, and enables to deliver important messages within the organization quickly and without delving into the details (when such are not needed). Clearly, data scientists should be aware of the advantages of the visual presentation of research results and select those visualization tools that are suitable for the case at hand. At the same time, however, the target audience of the visualization should also have the basic skills needed for its interpretation and understanding (such as the ability to understand graphs), which are included in the arsenal of data literacy skills (see Exercise 3.16 and Sect. 5.5). For such cases and many others, the idiom “A picture is worth a thousand words” applies.

Exercise 8.12 Visualization tools of data analysis environments Search the web for data analysis environments and explore the visualization tools these environments provide. What are their advantages? And their weaknesses? What visual elements do you like? Which do you dislike? Why?

132

8 Data Science as a Research Method

Exercise 8.13 Visualization in the data science workflow Review the data science workflow (Fig. 2.3) and suggest three phases or concepts for which, in your opinion, visualization is especially important and offers added value. Explain your answer.

Exercise 8.14 Visual programing environments for data science In Sect. 1.6.2, we present visual programing environments for data science. Explore the advantages of such environments for the teaching and learning of data science with respect to the different populations presented in Chap. 7. As it turns out, storytelling (discussed in Sect. 11.2.2) and visualization are both perceived as complimentary skills by data science experts. In a Forbes article entitled “Data Storytelling: The Essential Data Science Skill Everyone Needs”, Brent Dykes (2016) connects three ways by which the story of the data can be told: data, visualization, and storytelling: When narrative is coupled with data, it helps to explain to your audience what’s happening in the data and why a particular insight is important. [...] When visuals are applied to data, they can enlighten the audience to insights that they wouldn’t see without charts or graphs. [...] Finally, when narrative and visuals are merged together, they can engage or even entertain an audience. (para. 5)

In fact, stories and visual presentations are combined in many ways in our lives, for example in video clips of songs and movies. In their column, A Data Scientist’s Real Job: Storytelling, published in the Harvard Business Review on March 27, 2013, Bladt and Filbin (2013) also connect visualization with storytelling and advise: Present data so that everyone can grasp the insights [...] While we used regression analysis to find a list of significant variables, we visualized data to find trends [...] By presenting the data visually, the entire staff was able to quickly grasp and contribute to the conversation. (para. 8)

Exercise 8.15 Connections between data visualization and other data science skills Explore the connections between data visualization and other data science skills described in this chapter and in Chap. 11.

8.4 Pedagogical Challenges of Teaching Research Skills

133

Visualization is further discussed from a research perspective in Chap. 10 (Sect. 10.4), in the context of the research-oriented exploratory data analysis phase of the data science workflow.

8.4 Pedagogical Challenges of Teaching Research Skills In this section, we highlight two pedagogical challenges that emerge from the fact that data science introduces a new research-oriented method. Specifically, they result from the fact that a variety of populations study data science and some of these populations. (a) have no knowledge or experience in research processes; (b) lack the required knowledge in the application domain. These two facts should be taken into consideration each time data science is taught. First, the educator should check the learners’ knowledge in research processes and, accordingly, decide what aspects of the data science workflow to focus on. Second, when selecting examples to be presented to the specific population of learners, the educator should pay special attention to the application domains from which the examples are selected so that they are meaningful to the learners. Relevant methods of teaching the skills described in this chapter can be found in several chapters in this guide, including: (c) Chapter 10, which presents a pedagogical perspective on the data science workflow; (d) Chapter 11, which talks about the teaching of professional and soft skills; (e) Chapter 12, in which social and ethical issues of data science education are discussed; and (f) Chapter 16, which presents methods for teaching machine learning. As can be observed, these chapters highlight active learning and the promotion of reflective and analysis skills, which are important in any learning process, and specifically in learning processes of (research) skills.

Exercise 8.16 Reflective and analysis skills How, in your opinion, can reflective skills improve research processes in general, and data science research processes in particular?

134

8 Data Science as a Research Method

8.5 Conclusion In this chapter, we describe research skills required for meaningful implementation of the different phases of the data science workflow, both by data scientists as well as by other stakeholders involved in this workflow (such as observers of the data science analysis results). In the conclusion of this chapter, we highlight, from the interdisciplinary perspective, those skills required in order to perform data science research. One of the messages we impart in this chapter is that these research skills are strongly connected to the interdisciplinarity of data science. In other words, due to the importance of the application domain, the interdisciplinary perspective is especially important in the usage of research skills. This observation derives from the fact that one cannot perform research meaningfully without expertise in the relevant discipline. Indeed, how can one make sense of numbers if he or she is not familiar with what they represent? How can one observe relationships between different features of the data without understanding their meaning? The answers to these questions explain why research skills are usually learned in graduate schools or during the final stages of undergraduate studies. One implication of the above discussion is that if we consider introducing the topic of data science at the undergraduate level or even earlier, as presented in Chap. 7, we should also consider teaching some aspects of the research mindset, in general, and of the data science research method and process, in particular, taking into consideration the learners’ background and why they are studying data science.

Exercise 8.17 Challenges of data science from a research perspective In this chapter, we analyzed the field of data science as a research method and discussed several pedagogical challenges that emerged from this perspective. (a) List these challenges. (b) Describe the connection of each of the challenges to data science, as an interdisciplinary research method that is learned by a variety of populations.

Exercise 8.18 Additional data science skills This exercise repeats Exercise 8.1, which was presented at the beginning of this chapter. For each cell in Table 8.1, suggest three additional skills and explain their importance for professional data scientists.

References

135

References Bladt, J., & Filbin, B. (2013). A data scientist’s real job: Storytelling. https://hbr.org/2013/03/adata-scientists-real-job-sto Dykes, B. (2016). Data storytelling: The essential data science skill everyone needs. https://www. forbes.com/sites/brentdykes/2016/03/31/data-storytelling-the-essential-data-science-skill-eve ryone-needs/?sh=29958f1e52ad Gina Life—Simply Saving Lives. (2021, March 23). Gina life. https://www.gina-life.com/ Glaser, B. G., & Strauss, A. L. (2017). Discovery of grounded theory: Strategies for qualitative research. Routledge. Grounded theory. (2022). https://en.wikipedia.org/wiki/Grounded_theory Hazzan, O., & Lis-Hacohen, R. (2016). The MERge model for business development: The amalgamation of management, education and research. Springer. Setlur, V. (2022). AI augments and empowers human expertise. Tableau. https://www.tableau.com/ sites/default/files/2022-02/Data_Trends_2022.pdf

Chapter 9

The Pedagogical Chasm in Data Science Education

Abstract As an interdisciplinary discipline, data science poses many challenges for teachers. This chapter presents the story of one of them, specifically of the adoption of a new data science curriculum developed in Israel for high school computer science pupils, by high school computer science teachers. We analyze the adoption process using the diffusion of innovation and the crossing the chasm theories. Accordingly, we first present the diffusion of innovation theory (Sect. 9.1) and the crossing the chasm theory (Sect. 9.2). Then, we present the data science for high school curriculum case study (Sect. 9.3). Data collected from teachers who learned to teach the program reveals that when a new curriculum is adopted, a pedagogical chasm might exist (i.e., a pedagogical challenge that reduces the motivation of the majority of teachers to adopt the curriculum) that slows down the adoption process of the innovation (Sect. 9.4). Finally, we discuss the implications of the pedagogical chasm for data science education (Sect. 9.5).

9.1 The Diffusion of Innovation Theory So far, in this part of the guide (Part II—Opportunities and Challenges of Data Science Education), we have discussed the challenges of data science education from the perspective of three characteristics of data science: interdisciplinarity (Chap. 6), variety of learners (Chap. 7), and data science as a research method (Chap. 8). In this chapter we discuss the challenge of data science education from the perspective of the diffusion of innovation theory as applied to the adoption process of a new curriculum. The diffusion of innovation theory was proposed some 60 years ago by Everett Rogers based on his study of the diffusion of agriculture technologies among rural farms in the United States (Rogers, 1962). Since its introduction, the diffusion of innovation theory has been implemented in many additional fields. Specifically, in the context of education, it has been implemented, for example, to describe the diffusion of educational technology (Lau & Greer, 2022; Sahin, 2006) and of educational programs (Bennett & Bennett, 2003; Scott & McGuire, 2017; West et al., 2007).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_9

137

138

9 The Pedagogical Chasm in Data Science Education

Exercise 9.1 Diffusion of educational programs Explore the resources listed above, which describe the diffusion of educational programs. How are they similar to one another? How do they differ from one another? Rogers defines innovation as “an idea, practice, or project that is perceived as new by an individual or other unit of adoption” (2003, p. 12). He then identifies four main elements of diffusion of innovations: (a) the innovation, (b) the communication channels, (c) time, and (d) the social system. According to the diffusion of innovation theory, innovations spread in society by flowing from one of the following five distinct groups of adopters to the next: Innovators: This group is characterized by enthusiasm for innovation in general. Its members value the innovativeness of a new product and are the first to understand the value of a new technology over existing ones. Innovators are willing to embrace innovation even if they are required to invest significant effort using it due to its premature development stage. Early adopters: Early adopters are ready to take risks and embrace innovation at early stages of innovation maturation. This group is characterized by the ability to quickly identify the added value of an innovation and to harness it in favor of a larger vision. Early adopters are not looking for improvement, but rather, promote a vision of breakthroughs and innovations. Innovation for them is a means of realizing the vision, and the innovation’s importance derives from its ability to contribute to the fulfillment of this vision. Early majority: This group consists of pragmatic people who are interested in adopting mature innovation to improve their performance of a variety of activities. This group attempts to reduce costs and is, therefore, interested in standardization. They are looking to adopt mature, reliable, and fully supported products. Late majority: This group consists of conservatives who are fundamentally opposed to innovation and prefer using existing and familiar products over new and innovative ones. Laggards: Members of this group oppose innovation, are skeptical about its benefits, and pose difficult questions to the innovators. The laggards are constantly pointing out the discrepancies between the promises claimed by the new technology and its actual realization. It is customary to display the five groups on an innovation-adoption timeline (see Fig. 9.1). The blue line represents the rate of adoption by each group over time, and the yellow line represents the accumulative adoption of the innovation as time progresses.

9.1 The Diffusion of Innovation Theory

139

Fig. 9.1 Diffusion of innovation timeline (Rogers, 1962)1

Exercise 9.2 Reflection on your experience with the adoption of innovation Reflect on your personal experience with the adoption of innovations. What group did you belong to in each case? Was it the same group? Were they different groups? What can you learn about your personality as an adaptor of innovation? Rogers (2003) defines an innovation-decision process through which the innovation adopters pass: It starts when the adopters first learn about the existence of an innovation, following which they form an attitude towards it, decide whether to adopt or reject it, and finally, after its adoption (if it is indeed adopted), they use it as part of their environment. Specifically, the innovation-decision process involves five steps: (a) knowledge, (b) persuasion, (c) decision, (d) implementation, and (e) confirmation. In our case, the computer science teachers were exposed to the existence of the new curriculum through official messages sent to them by the Ministry of Education and they learned the content of the new curriculum in a teachers’ training program. The critical steps of the innovation-decision process, namely the persuasion and decision stages, took place during those professional development programs. Rogers presented five attributes of an innovation that affect its rate of adoption: (a) relative advantage, (b) compatibility, (c) complexity, (d) trialability, and (e) observability. In the case of the new data science curriculum, its relative advantage over

1

The right to use this work is granted here: https://commons.wikimedia.org/wiki/File:Diffusion_ of_ideas.svg.

140

9 The Pedagogical Chasm in Data Science Education

other possible topics derives from the attractiveness of data science as a new profession. The curriculum, however, is not compatible with other learning materials as it requires Python as a programming language, and its complexity is also higher relative to other topics as it involves complex machine learning algorithms. Its trialability is low since the teacher has to choose to teach the entire unit, and cannot teach only parts of it, and its observability is low, since teachers work each in their own classroom and seldom get to observe other teachers. Furthermore, the data sciences unit is one possible unit out of several elective units on other topics including android applications development, web development, and more. The data science unit is, therefore, optional and computer science teachers can choose whether or not to adopt it.

9.2 The Crossing the Chasm Theory While some innovations diffuse according to the diffusion of innovation theory, the diffusion rate of other innovations drops before they reach a high level of acceptance. According to the crossing the chasm theory, this effect can be caused by discontinuity of the flow of innovation between the early market (Innovators and Early Adopters) and the mainstream market (Early Majority and Late Majority) (Moore, 2002). Since these two groups (early market and mainstream market) have different adoption characteristics, there is a so-called chasm between them that stems from the fact that adopters from the mainstream market rely on recommendations and on the adoption experience of other adopters in this market, rather than on the adoption experience of the early market. Mainstream market adopters simply understand that early market adopters are willing to adopt innovations that are not mature enough for them. The chasm reduces the adoption rate of the innovation after its adoption by the early market and leads to its failure to penetrate the mainstream market. Moore (2002) defined two types of innovation: continuous innovation and disruptive innovation. Continuous innovation is defined as innovation that aims to upgrade existing capabilities of a product without requiring the adopters to change their behavior (for example, replacing a mobile phone keyboard with a touch screen). Disruptive innovation is defined as innovation that requires the adopters to change their behavior in order to use it (for instance, replacing a mobile phone keyboard with a voice-operated virtual assistant). While early market adopters are willing to adopt disruptive innovations, mainstream market adopters are willing to adopt only continuous innovations. The central market will adopt a disruptive innovation when it reaches a certain level of maturity and its compatibility enables members of the mainstream market to adopt it with no substantial change in the way they operate. The crossing the chasm theory has already been used to describe the adoption processes of educational innovations; See Lee et al. (2021) for a detailed review.

9.3 The Data Science Curriculum Case Study from the Diffusion …

141

9.3 The Data Science Curriculum Case Study from the Diffusion of Innovation Perspective2 As described in Sect. 7.3, a three-year program in data science and machine learning designed for 10th to 12th grade computer science pupils was approved in Israel in 2019 after three years of pilot implementation (Mike et al., 2020). The program was integrated into the current, official Israeli high school computer science curriculum, which has been updated on a regular basis since 1998, when it was first implemented (Gal-Ezer et al., 1995). These updates include both new content as well as adaptations for new populations, targeting younger audiences, mainly elementary and middle school pupils. Similar to the general high school computer science programs taught in Israel, the data science program is designed to accommodate two levels. The basic level, which is taught in the 10th grade, focuses on the data science workflow and machine learning (see Table 7.1: Data science for high school curriculum—Topics and number of hours), and the extended level, which is taught in the 11th and 12th grades and focuses on deep learning. Pupils at both levels develop a machine/deep learning project in Python that includes defining a machine learning problem, searching for and collecting data, data exploration, training, evaluation and comparison of several machine learning algorithms, and communication of the results. Since we are involved in the development of the new data science program as both curriculum designers, teachers, and researchers, and actively promote it among computer science teachers in teachers’ professional development courses, we are interested in exploring the adoption of the new data science curriculum from the perspective of diffusion of innovations. In what follows, we describe the case study of the adoption of the new data science curriculum by high school computer science teachers in Israel. We first present the story of the new program (Sect. 9.3.1) followed by the teachers’ perceptions, attitudes, and challenges they face with respect to the new curriculum (Sect. 9.3.2).

9.3.1 The Story of the New Program The story of the new program started in 2017, when Prof. Tamir Hazan,3 from the Faculty of Industrial Engineering and Management at the Technion, began teaching machine learning as part of the Technion’s undergraduate data science program. That year, he also developed a machine learning course and taught it to his daughter’s 9th grade class. Following his successful teaching experience in school, he presented the program to several groups of teachers, in an attempt to sell them on the program.

2 3

See Sect. 7.3 for more details. Full name appears with permission.

142

9 The Pedagogical Chasm in Data Science Education

Even though many teachers had heard about this new curriculum and had begun to realize the importance of teaching data science in schools, almost none of them adopted it. Clearly, the program was still premature for mass adoption. Nevertheless, a few teachers did adopt the program. These early adopters were willing not only to learn the new content knowledge, but also to develop the new pedagogies and teaching methods required to teach it. One of these teachers was Mara Kofler,4 a highly experienced teacher with a strong entrepreneurial sense. Kofler had, in the past, been involved in the creation and adoption of new computer science curricula, such as a unit in computer graphics as part of the high school computer science curriculum and a full high school study program in data literacy. Kofler assembled a team, recruited new teachers, and initiated the implementation of the data science unit in 2018. One of the teachers on her team was Koby Mike, co-author of this guide. During the first two years of its implementation, the program was taught solely by Kofler’s team. Based on the experience gained in those two years, the curriculum was redesigned, adequate teaching methods were developed, and it was approved by the Israeli Ministry of Education as a pilot program. At this stage, the program was reintroduced to several groups of teachers. Again, only a few teachers adopted it and started teaching it. One of them was Ariel Bar-Itzhak,5 another experienced teacher with an entrepreneurial sense. While teaching the curriculum, Bar-Itzhak invested many hours developing learning materials and instructional movies for the benefit of both teachers and pupils (see https://www.youtube.com/user/arikb30; although the materials are in Hebrew, you can easily discern their richness). At the end of the 2020–21 school year (June 2021, to be exact), after appropriate pedagogy and teaching methods were developed, the curriculum was ready to be adopted by teachers in the mainstream market. Indeed, when the program was introduced for the third time to a group of teachers, about half of the group decided to adopt it and began teaching it the following year. Table 9.1 presents the timeline of this adoption process. Exercise 9.3 Reflection on your experience with the adoption of educational innovation Reflect on your personal experience with the adoption of educational innovations. Can you construct a similar timetable? At what stage did you adopt the educational innovation and start teaching it?

4 5

Full name appears with permission. Full name appears with permission.

9.3 The Data Science Curriculum Case Study from the Diffusion …

143

Table 9.1 The adoption process of the 10th grade data science curriculum Period

Phases of adoption of innovation

Description

September 2017

Innovators

Prof. Hazan developed the program and taught it to his daughter’s 9th grade class

July 2018

Innovators

Prof. Hazan presented the curriculum to a group of leading teachers, including Mara Kofler

September 2018–June 2020

Innovators

The program was taught by Mara Kofler’s team, including the development of a suitable pedagogy

July 2020

CHASM

First teachers’ training course (see Table 9.2) was held for computer science teachers in which the new program was presented, but the chasm was not crossed. Nevertheless, several early adopters did adopt the innovation and started teaching the data science unit

September 2020

Early adopters

Ariel Bar-Itzhak, one of the early adopters, started teaching the program, developing additional learning materials and instructional movies along the way

July 2021

CHASM crossed

A second in-service teachers’ training course was held for computer science teachers in which the new program was presented alongside the teaching pedagogy developed for its teaching (see Table 9.2)

September 2021

Early majority

The early majority started teaching the program in 20 schools (note that Israel’s population is similar to that of New Jersey)

9.3.2 The Teachers’ Perspective In this section, we present the teachers’ perspective of the challenges involved in the adoption of the new data science curriculum. Data was collected through questionnaires that were distributed to the teachers at the end of the second teachers’ training (July 2021). In the questionnaire’s open questions, the teachers were asked to share what challenges, in their opinion, will they face when they start teaching the curriculum, and what challenges, in their opinion, will their pupils face while

144

9 The Pedagogical Chasm in Data Science Education

Table 9.2 The schedule of the first data science teachers’ training course (teaching methods were added to the second course) Lesson

Topic

Number of hours

1

Introduction to data science Python notebooks

6

2

Tabular data and Pandas

6

3

Exploratory data analysis Visualization Seaborn and Matplotlib

6

4

Introduction to machine learning The KNN algorithm

6

5

Machine learning performance indicators

6

6

Core concepts of machine learning (overfitting, underfitting)

6

7

Data preparation process (1)

6

8

Data preparation process (2)

6

9

Images as data

6

10

Projects in data science

Total

6 60

learning it. The teachers’ answers are presented according to the three components of teachers’ knowledge, according to the TPACK framework: pedagogical knowledge, content knowledge, and technological knowledge (Mishra & Koehler, 2006). Twenty teachers answered the questionnaire and provided 56 assertions regarding the challenges that may be encountered when teaching and learning the new curriculum.

Pedagogical Knowledge Pedagogical knowledge represents the teacher’s knowledge of how to teach, in other words, methods of teaching the subject matter. About two thirds of the teachers’ assertions referred to pedagogical challenges, and these were grouped into four categories: (a) concepts that are difficult to teach and learn, (b) motivation, (c) pupils’ need to work individually, and (d) teaching aids and teacher support. In the next section, we explain why such a large proportion of teachers (two thirds of the sample) mentioned pedagogical challenges. Here, we present and illustrate the four categories of teachers’ assertions that indicate a pedagogical challenge: Within the ‘concepts that are difficult to teach and learn’ category, about half of the assertions declared that the curriculum includes many concepts that are difficult for teachers to teach and for pupils to learn. Particularly, the teachers mentioned that many concepts require comprehensive mathematics and a solid theoretical background in machine learning and data analysis. For example, one teacher wrote that “there is a lot of theory here - from my acquaintance with 10th grade pupils, they have no patience for it at this point. That comes later”.

9.4 The Pedagogical Chasm

145

About a quarter of the assertions referred to the need to enhance pupils’ motivation to learn some essential parts of the curriculum that were perceived as less interesting. For example, one teacher listed “(pupils’) interest in tabular data” as a challenge. Another aspect of the curriculum is working on real problems with real data and this can make it challenging for the pupils to achieve high performance algorithms. For example, one teacher mentioned “unsatisfactory final results” as a challenge. About 15% of the assertions concerned the pupils’ need to work individually on complex projects. For example, one teacher wrote that pupils need to “know how to work alone, and look for answers to questions and databases”. About 10% of the assertions addressed the lack of learning materials for the new curriculum and the lack of teacher support. For example, one teacher wrote that “arranged lesson plans for teachers with notes, examples, videos beyond advanced training” is a challenge.

Content Knowledge Content knowledge refers to the teacher’s understanding of the subject matter itself, i.e., what they teach. Approximately one quarter of the assertions addressed the required content knowledge, which can be divided into two categories: the variety of topics included in the curriculum and the required knowledge in data science. For example, one teacher wrote that “the program is difficult to teach, the field is very broad and contains a lot of concepts and a lot of cross-cutting knowledge”.

Technological Knowledge Technological knowledge refers to the tools required to make the content more accessible for the learners. About 9% of the assertions considered technological challenges. The main technological challenge that the teachers noted was finding suitable datasets (a) for teaching purposes, and (b) for projects to be developed by the pupils. For example, one teacher wrote that “it is difficult to find good databases on a variety of topics”. As it turns out, the use of real datasets and the use of dataset search engines is apparently a new activity and skill for most computer science teachers.

9.4 The Pedagogical Chasm The case study presented here demonstrates that while some innovative teachers are willing to adopt a new curriculum in its early phases of development, the majority of teachers are willing to adopt a new curriculum only after new pedagogies and teaching methods are developed for its teaching. The crossing the chasm theory is relevant especially in cases of disruptive innovation, which is, we claim, what the new data science curriculum is for computer

146

9 The Pedagogical Chasm in Data Science Education

science high school teachers. We base our claim on the change required (if the new data science curriculum is adopted) in the three components of teachers’ knowledge defined by the TPACK framework: pedagogical knowledge, content knowledge, and technological knowledge (Mishra & Koehler, 2006): • The pedagogy component: From a pedagogical perspective, a new curriculum is considered disruptive if it uses teaching methods that the teacher does not know how to employ. For example, a curriculum whose teaching is based on projectbased learning (PBL) is a disruptive innovation, from the pedagogical perspective, for teachers who for many years have used only frontal teaching methods and have never before experienced any active-learning-based pedagogies (such as PBL). • The content component: From a content perspective, a new curriculum can be considered disruptive if it is based on knowledge that teachers lack. For example, a curriculum that teaches the Python programming language is a disruptive innovation, from the content perspective, for teachers who have never before learned Python, as they will have to learn the new content before they can start teaching it. • The technology component: From a technological perspective, a new curriculum is considered disruptive if it uses technology that requires teachers to change their behavior. For example, a curriculum taught in an on-line teaching environment is a disruptive innovation for teachers with no previous experience teaching in such an environment. As it turns out, the data science curriculum for high school described above is disruptive in terms of all three dimensions: • The pedagogy dimension: Teachers must learn new pedagogies, e.g., using real data that were not generated by the teacher (Mike & Hazzan, 2021); • The content component: Teachers must learn new content knowledge, e.g., data analysis. • The technology component: Teachers must learn how to use new technology, e.g., dataset search engines. The diffusion of a curriculum is similar to the diffusion of any other innovation in any other discipline, with regard to the content component and the technology component, since any innovation may require users to learn new content or to use new technology. The crossing the chasm theory can therefore predict the diffusion of an innovation to the early and mainstream markets based on the characteristics of its content and technological properties, either as continuous or disruptive. Curriculum innovations are, however, different in terms of the pedagogical component. While any disruptive innovation may require its adopters to learn new knowledge or to use new technology, curriculum innovations may require teachers also to adopt a new pedagogy, hence, it may require the crossing of a pedagogical chasm. It is therefore not surprising that when asked to share what, in their opinion, are the challenges that they and their pupils will face in teaching and learning the curriculum, two third of the teachers’ assertions indicated pedagogical challenges.

References

147

Exercise 9.4 Pedagogical chasms This guide is one means we propose for crossing the pedagogical chasm. Can you identify specific ideas that are presented in this guide that support you in crossing the pedagogical chasm in your data science teaching?

9.5 Conclusion In the development of an educational innovation that requires a pedagogical change, educational researchers, innovators, entrepreneurs, and policy makers (see Chap. 17) should, therefore, first understand whether a pedagogical chasm exists and if it does, explore its essence. Then, if needed, they should develop new pedagogies and teaching methods required to cross the pedagogical chasm so as to facilitate adoption of the innovation by the early majority. It is highly recommended to form a team of early-adaptor teachers for this purpose, since their experience is crucial for a future successful adoption process of the curriculum. This recommendation is valid especially in the case of computer science education in which teachers are frequently asked to adopt new curricula and teaching materials, as teacher support is among the core components in successful curriculum implementation (Hazzan et al., 2008). In the case of data science, this guide is one means we suggest for crossing the pedagogical chasm. The Methods of Teaching Data Science course, presented in detail in Chap. 18, is another. In conclusion, we propose that the interdisciplinarity of data science is reflected in the case of the adoption of data science educational programs in the need to cross the chasm related to the content component. Not only must teachers learn new computer science content, mathematics content, and statistics content, but they should also be open to explore new application domains with which they are not necessarily familiar, and in particular, mentor learners in the development process of projects in application domains in which the learners are more knowledgeable than they are. Indeed, such as interdisciplinary content chasm is challenging, especially when it is accompanied by a pedagogical chasm as described in this chapter.

References Avidov-Ungar, O., & Eshet-Alkakay, Y. (2011). The islands of innovation model: Opportunities and threats for effective implementation of technological innovation in the education system. Issues in Informing Science and Information Technology, 8, 363–376. Bennett, J., & Bennett, L. (2003). A review of factors that influence the diffusion of innovation when structuring a faculty training program. The Internet and Higher Education, 6(1), 53–63.

148

9 The Pedagogical Chasm in Data Science Education

Fullan, M. (2007). The new meaning of educational change. Routledge. Gal-Ezer, J., Beeri, C., Harel, D., & Yehudai, A. (1995). A high school program in computer science. Computer, 28(10), 73–80. Hazzan, O., Gal-Ezer, J., & Blum, L. (2008). A model for high school computer science education: The four key elements that make it! In Proceedings of the 39th SIGCSE technical symposium on computer science education, pp. 281–285. Hazzan, O., & Zelig, D. (2016). Adoption of innovation from the business sector by post-primary education organizations. Management in Education, 30(1), 19–28. Lau, K., & Greer, D. M. (2022). Using technology adoption theories to maximize the uptake of e-learning in medical education. Medical Science Educator, 1–8. Lee, J., Tan, E., Barrow, J., Bocala, C., & Seymour, B. (2021). Crossing the innovation chasm: Identifying facilitators and barriers to early adoption of the global health starter kit curriculum. Annals of Global Health, 87(1). Mike, K., Hazan, T., & Hazzan, O. (2020). Equalizing data science curriculum for computer science pupils. In Koli Calling’20: Proceedings of the 20th Koli Calling international conference on computing education research, pp. 1–5. Mike, K., & Hazzan, O. (2021). How can computer science educators benefit from data science education? In Proceedings of the 52nd ACM technical symposium on computer science education, pp. 1363–1363. Mishra, P., & Koehler, M. (2006). Technological pedagogical content knowledge: A framework for teacher knowledge. The Teachers College Record, 108(6), 1017–1054. Moore, G. A. (2002). Crossing the chasm: Marketing and selling disruptive products to mainstream customers. HarperCollins. https://books.google.co.il/books?id=yJXHUDSaJgsC Rogers, E. M. (1962). Diffusion of innovations. Free Press. Rogers, E. M. (2003). Diffusion of innovations (5th ed.). Free Press. Sahin, I. (2006). Detailed review of Rogers’ diffusion of innovations theory and educational technology-related studies based on Rogers’ theory. Turkish Online Journal of Educational Technology-TOJET, 5(2), 14–23. Scott, S., & McGuire, J. (2017). Using diffusion of innovation theory to promote universally designed college instruction. International Journal of Teaching and Learning in Higher Education, 29(1), 119–128. West, R. E., Waddoups, G., & Graham, C. R. (2007). Understanding the experiences of instructors as they adopt a course management system. Educational Technology Research and Development, 55(1), 1–26. Zelig, D. (2011). Adoption of innovation from the business sector by secondary educational organizations [PhD Thesis]. Technion.

Part III

Teaching Professional Aspects of Data Science

In this part, we take a pedagogical perceptive and examine several topics related to the professional aspect of data science, such as data science skills and social issues, in general, and ethics, in particular. In line with the teaching principles applied in this guide, we advocate the message that data science skills and other non-technical issues should be given special attention in any data science program regardless of its framework or level. This part includes the following chapters: Chapter 10: The Data Science Workflow Chapter 11: Professional Skills and Soft Skills in Data Science Chapter 12: Social and Ethical Issues of Data Science

Chapter 10

The Data Science Workflow

Abstract The examination of data science as a workflow is yet another facet of data science. In this chapter we elaborate on the data science workflow from an educational perspective. First, we present several approaches to the data science workflow (Sect. 10.1), following which we elaborate on the pedagogical aspects of the different phases of the workflow: data collection (Sect. 10.2), data preparation (Sect. 10.3), exploratory data analysis (Sect. 10.4), modeling (Sect. 10.5), and communication and action (Sect. 10.6). We conclude with an interdisciplinary perspective on the data science workflow (Sect. 10.7).

10.1 Data Workflow Consensus has not yet been reached with respect to a single workflow for data science. In Chap. 2 we present two versions of the data science workflow: our version of the data science workflow (Fig. 2.3) and the data life cycle (Fig. 2.4 based on Berman et al., 2016). The existence of different perspectives on the data science workflow reflects the different facets of data science, e.g., as a research method or as a profession (see Chap. 2). Accordingly, the data science workflow can be seen as a research workflow or as a project workflow, respectively. One of the early sources of the data science workflow is the Cross-Industry Standard Process for Data Mining (CRISP-DM) presented in Fig. 10.1 (Shearer, 2000). This model was developed for business environments and it reflects the importance attributed to the understanding of the application domain, in this case business. The CRISP-DM model also presumes that the product of the analysis is a system to be deployed in the context of the organizational information system, in contrast to the common product of data analysis as a report. The agile data science workflow, presented in Fig. 10.2, is another version offered for the data science workflow (Pfister et al., 2015). The main attribute of the agile data science workflow is the ability to move back and forth between its different phases, if needed. This workflow is initiated by asking an interesting question about the world. As data science is also a research method, the exploratory data analysis phase sometimes starts without a specific research question; In such cases, after © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_10

151

152

10 The Data Science Workflow

Fig. 10.1 The CRISP-DM workflow (based on Shearer, 2000)

preliminary observations are revealed in the exploratory data analysis phase, the researcher continues either to the formulation of the research question or to additional data collection in order to deepen the understanding of the essence of the data.

Exercise 10.1 The agile data science workflows Why, in your opinion, does the agile data science workflow highlight the ability to move back and forth between its different phases? Can you suggest specific scenarios in which this ability is important? From a pedagogical perspective, the workflow taught in any teaching framework should fit the learners. For example: • Graduate students and researchers will benefit from learning a lifecycle that emphasizes the research-oriented phases of the data science workflow (e.g., data gathering and exploration). • Learners who are not conducting research (e.g., high-school pupils or undergraduate students) should be exposed to one of the simpler models (e.g., the agile model or the model we introduce in Fig. 2.3). These learners may find it difficult

10.1 Data Workflow

153

Fig. 10.2 The agile data science workflow (based on Pfister et al., 2015)

to formulate research questions or to extract meaningful insights from the data exploration. • Similar consideration should guide the selection of the data science workflow taught in a machine learning course; these learners, however, may not understand the importance of the data acquisition and exploration phases, as the course focuses only on machine learning algorithms.

Exercise 10.2 Additional data science workflows Find three additional data science workflows. (a) Describe each workflow, including its main characteristics. (b) What type of learners does each workflow fit? (c) Compare the workflows: In what ways are they similar? In what ways are they different?

154

10 The Data Science Workflow

Exercise 10.3 Data science workflow for learners Suppose you are an undergraduate data science major who is taking the Introduction to Data Science course and is required to submit a final project at the end of the semester. (a) Which data science workflow would you prefer to work with? Explain your choice. (b) Choose two additional types of learners (see Chap. 7). In your opinion, what data science workflow should each type of learner learn?

10.2 Data Collection Data science learners can collect data from three main resources: • Their own data • Academic or industrial partnerships • Various publicly available dataset sources (for example Kaggle, Data at WHO (World Health Organization, Wikidata) and search engines (such as google dataset search)). Our experience tells that most learners tend to select the third resource, that is, they search an existing dataset in a publicly available collection of datasets. While this method seems to be the simplest, it is not always the best strategy, as it may raise several difficulties, such as: • It might sometimes be difficult to find a dataset that fits a specific project goal. • As datasets are diverse in terms of the number and type of features and the number of examples they contain, it is sometimes difficult to understand their structure. • Datasets are also diverse in their quality and reliability: Some datasets were generated automatically and not collected and, therefore, are not always well structured; In other datasets, vital data is sometimes missing. Although the acquisition of a dataset through an academic or industrial partnership can bypass the difficulties listed above, we recommend that in some cases the best option is to ask the learners to collect the data themselves. This approach has several advantages: • Conducting such a data gathering process may enhance the learners’ engagement, interest, and motivation in the project development. • The learner practices data collection from people, for example, by using questionnaires; Gaining experience in such data collection processes is a vital step in many real-world projects and research works.

10.2 Data Collection

155

• While practicing data collection, the learner may learn computer science concepts and gain practical experience in developing data collection tools. Here are several examples of such concepts and practices: writing web crawling algorithms and web scraping programs, and using application programing interfaces (APIs).

Exercise 10.4 Data collection Suppose you are asked to develop a project in your Introductory to Data Science course: What is your preferable method for collecting data for your project? Explain your choice.

Exercise 10.5 Search for a dataset Suppose you are asked to develop a project in your Introductory to Data Science course. You decide to look for a dataset in one of the open dataset collections available. (a) Search for a dataset that suits your project. (b) Reflect on the dataset search process: How did you look for the dataset? Did you find what you were looking for? What did you find?

Exercise 10.6 Disadvantages of data gathering by learners We mention several advantages of data gathering by learners. Can you think of any disadvantages of such data gathering processes?

Exercise 10.7 Other tools for collecting data from people One of the advantages we mention of data gathering by learners is the experience learners gain collecting data from people, for example, by using questionnaires. Can you suggest other data collection tools that learners can use to collect data from people?

156

10 The Data Science Workflow

10.3 Data Preparation After data is collected, it needs to be prepared for the next phases: exploratory data analysis and modeling. Preparation of the data for these phases includes: • Verifying source relatability • Merging data from different sources • Cleaning the data, including: – Finding and handling erroneous data – Finding and handling outliers • Finding missing data and handling it (remove or complete) • Balancing and normalizing data • Labeling data manually.

Exercise 10.8 Activities included in data preparation For each of the above activities carried out as part of the data preparation, give an example of data that needs to be addressed in this phase.

Exercise 10.9 Data wrangling In Sect. 11.2.1, we address the activity of data wrangling, which is seemingly similar to data preparation. What do the two terms—data preparation and data wrangling—reflect? Some of these steps might be difficult or even impossible for learners who lack sufficient knowledge in the application domain. We demonstrate this claim using the concept of outliers. Outliers are data samples that differ significantly from the rest of the observations. In reality, however, this definition might not be sufficient and application domain knowledge is needed in order to decide whether a sample is an outlier or not. We explain this using the feature “height” which everyone is familiar with. We know that there are no people whose height is 0 cm. Therefore, a height feature with a value of 0 cm may indicate missing data or an error, rather than an outlier. On the other extreme, there are people who are 200 cm tall, and so even though a height of 200 cm may differ significantly from other height samples, it is probably not an outlier and should neither be removed nor fixed. Similarly, handling of missing data also requires application domain knowledge. First, it must be decided whether records with missing data should be completed or removed. Second, if it is decided to complete the data, then appropriate methods must be selected for this task. Clearly, such decisions require an understanding of the meaning of the missing data in the real world, considering the application domain.

10.4 Exploratory Data Analysis

157

Exercise 10.10 Data preparation Search for a dataset in an application domain you are familiar with. Review the data. Can you find erroneous data? Can you find outliers? Search for a dataset in an application domain you are not familiar with. Review the data. Can you find erroneous data? Can you find outliers? (a) Analyze differences (if such exist) between the results in the two cases: the familiar application domain and the unfamiliar application domain. (b) Reflect on how you performed this exercise: How did you look for erroneous data? How did you look for outliers? Did you use any resources in these processes? If you did, which resources? If not, why not? Can these processes be improved? If yes, how? If nor, why?

10.4 Exploratory Data Analysis Exploratory data analysis is introduced in Chap. 2 as part of the data science research method (see Sect. 2.3.1). Since the essence of the exploratory data analysis phase is research, it may be challenging for learners with no research background or experience. Two common exploratory data analysis tools are descriptive statistics and visualization, which is indeed discussed in this guide as a research skill (see Sect. 8.3.3). Teaching and learning descriptive statistics and visualization entail several pedagogical challenges, as follows: • For learners with no statistical knowledge, basic concepts such as randomness and variance may be challenging. • The selection of the proper visualization method from the many types of visualization methods available (count plot, bar plot, pie plot, box plot, violine plot, histograms, scatter plots, regression plots, timeline plots, heat maps, geographical maps and more) may be challenging. • Sometimes, parameter tuning of the visualization as an object by itself is required to produce a correct, not misleading, and interpretable visualization. For example, setting the correct number of bins in a histogram or setting the scale scope presented on each axis sometimes totally changes the interpretation of a histogram. • Color selection for the visualization, as well as the color conventions set for the entire exploratory data analysis phase, has a significant impact on the communication of the visualization. For example, choosing red for positive behavior and green for negative behavior may be confusing. • Applying the selected visualization method may not be easy, as the data may need to be manipulated in order to arrange the data in the required form to generate the

158

10 The Data Science Workflow

desired visualization. For example, sometimes we have to define new columns based on given features, or reorganize the axis coordinates in a logical order. • Even if the learners can produce the visualization correctly by calling the procedures with the correct syntax and data, it still may be challenging to understand and interpret the visualization produced. Exercise 10.11 Types of visualization methods (a) Select five visualization methods with which you are not familiar. (b) Explore their purposes and how they can be generated. (c) For each visualization method, suggest a dataset and a relevant research question that is related to the dataset and has some aspect that is highlighted by the visualization produced by the method.

Exercise 10.12 Selecting an appropriate visualization method Consider the following visualization for the length of TED lectures (data is based on TED Talks, 2017):

(a) Which of the three methods presented here is, in your opinion, the best for visualizing the length of TED lectures? Explain your answer. (b) Can you suggest other visualization methods for this task? What are the advantages and disadvantages of each one?

10.4 Exploratory Data Analysis

159

Exercise 10.13 Interpreting graphs Consider the following visualization of the number of views a TED lecture versus the number of comments about it (data is based on TED Talks, 2017):

(a) What can you learn about the relationship between the number of lecture views and the number of comments about it? (b) Can you suggest other visualization methods for this task?

Exercise 10.14 Analysis of the exploratory data analysis phase from the perspective of abstraction levels In Sect. 11.2.1, we discuss the cognitive skill of thinking on different levels of abstraction, illustrating this cognitive skill through the activity of data wrangling. Analyze the exploratory data analysis phase from this perspective. Ask yourself questions such as: When it is important to think on a high abstraction level? When it is important to lower the level of abstraction? What objects of thoughts do you think with when the exploratory data analysis is carried out on a high level of abstraction? What objects of thoughts do you think with when the exploratory data analysis is carried out on a low level of abstraction? And any additional questions that you find interesting.

160

10 The Data Science Workflow

10.5 Modeling The modeling step includes fitting a statistical model (e.g., gaussian model) or a machine learning model (e.g., neural networks) to analyze and describe the gathered data. In this section, we elaborate on machine learning models; the description is applicable to statistical models as well. The pedagogical aspects of machine learning modeling are discussed in detail in Part IV – Machine Learning Education (Chaps. 13, 14, 15, and 16) of this guide. While the discussion in those chapters focuses on the algorithmic aspects of machine learning, in this section, we elaborate on machine learning modeling in the context of the data science workflow. Specifically, this section focuses on the preparation of data for modeling. Unlike Sect. 10.3 which discusses the general process of data preparation for analysis purposes, we focus here on the preparation of data for machine learning modeling. Data preparation for modeling purposes has two aspects: (a) data quantity, quality, and coverage (Sect. 10.5.1) and (b) feature engineering (Sect. 10.5.2).

10.5.1 Data Quantity, Quality, and Coverage Machine learning models are as good as the data they learn from. Therefore, the model must learn from sufficient data in terms of quantity and quality, as well as representative coverage (in terms of distribution) of real-world examples of the data. These requirements have several pedagogical implications: • As learners typically work with free, publicly available datasets, they cannot control and improve the dataset quality, size, or coverage. Furthermore, learners must understand that this situation does not represent the way real data science projects and research are conducted. On the contrary, in real-world data science projects or research, great efforts are invested in improving the dataset, either by acquiring more data or by enhancing its quality and distribution. • Learners can enrich publicly available datasets using data augmentation. This technique, while very useful in real-world projects, often requires both (a) advanced technical knowledge on the application of data augmentation, and (b) application domain knowledge in order to choose the suitable augmentation method for the explored dataset (see Exercise 12.2 Generative adversarial networks from an ethical perspective). • Learners who wish to collect their own datasets for their projects usually need to invest a lot of effort in collecting data with sufficient quantity, quality, and coverage.

10.6 Communication and Action

161

10.5.2 Feature Engineering Feature engineering is the process of generating new features for the data based on existing features, in a manner that enhances the performance of the machine learning algorithm. For example, suppose our target is to predict possible disease based on a person’s height and weight. Each feature by itself may have little meaning. The body mass indicator (BMI), however, which is calculated as weight/height2 , is a wellknown indicator for possible disease. Such feature engineering requires an expertise in the application domain, which many learners may lack. Some voices in the deep learning community claim that feature engineering should be avoided since a deep learning network can learn what feature transformations (that is, feature engineering) will improve the network’s performance. While this statement is correct, training networks on row data usually requires large datasets and high computational power, resources that are not always available to data science learners.

Exercise 10.15 Feature engineering (a) Search for a dataset in an application domain you are familiar with. Review its features. Can you think of a new feature that can be generated (that is, engineered) based on the given features and that will improve the model? (b) Repeat (a) but this time for a dataset in an application domain you are not familiar with. (c) Reflect on your thinking processes in the two cases: the familiar application domain vs. the unfamiliar application domain. What are your conclusions?

10.6 Communication and Action The final step of the data science workflow is communication and action. Communication refers to the preparation of reports targeted to various consumers of data analysis products, including decision makers, policy makers, professionals in datadriven occupations such as business and medicine, and the general public. Action refers to the performance of a real action in the real world, such as invoking requests to buy or sell stock. Generating an effective report is not an easy task even for professional data scientists. In her blog, Brownlow (2022) claims that data scientists may feel that the “application of their data-driven reports as business decisions is not sufficient”, and calls this problem “the Last Mile of Analytics”. Brownlow explains: The Last Mile of Analytics refers to the space between the data team and the business teams; it is the moment between someone looking at your report or dashboard, and the action they

162

10 The Data Science Workflow

take as a result; it encompasses all the back and forth questions we get about ‘what this graph means’ or ‘what data is included there’; it’s where we discuss the problems of our stakeholders, and how we can help solve them. (para. 7)

Similar messages are delivered also in Sect. 11.3.2 Organizational skills: Teamwork and collaboration, in which we state that “Communication skills are one of the most important teamwork skills for imparting messages in a meaningful way to relevant audience.” It is for this purpose that the storytelling skill may be relevant (see Sect. 11.2.2). In many cases, learners write reports as part of their evaluation, and not in order to deliver data analysis results to users or to generate an action in the real world. Such reports contain information that data-users do not need, such as detailed computer code, the process of data preparation, exploratory data analysis (including intermediate calculations and graphs), and model generation. As a result. data science learners do not gain sufficient practice writing reports that meet the real-life requirements of data-driven reports. We therefore recommended that data science learners will submit two different types of reports for evaluation: • A development report covering the development process of the data research or project. This report will be evaluated based on the correctness of the development process and its results. • A user-oriented report simulating the final product that is to be delivered to data consumers. This report will be evaluated according to its quality as a deliverable data-driven report.

Exercise 10.16 Communication and actions (a) Look for a report on a data science project or research you wrote for your studies. Evaluate the quality of this report as a data-driven deliverable report for users. (b) Look for a data-driven report you prepared as part of your professional work. Evaluate the quality of this report as a learning task. (c) What can you learn from this comparison?

10.7 Conclusion In this chapter, we reviewed the data science workflow from a pedagogical perspective. This perspective, again, highlights the interdisciplinarity of data science, in general, and the role of the application domain, in particular. For example, the importance of understanding the application domain is emphasized for the evaluation of

References

163

dataset quality, the completion of missing data and identification of outliers, the interpretation of data visualization, and the engineering of features. In addition, the difference between reports generated for learners’ evaluation and reports generated to drive decisions and actions in the real world is emphasized. Reports intended for real users must be prepared incorporating application domain knowledge in a clear and persuasive way for their users.

References Berman, F. (co-chair), Rutenbar, R. (co-chair), Christensen, H., Davidson, S., Estrin, D., Franklin, M., Hailpern, B., Martonosi, M., Raghavan, P., Stodden, V., & Szalay, A. (2016). Realizing the potential of data science: Final report from the national science foundation computer and information science and engineering advisory committee data science working group. National Science Foundation Computer and Information Science and Engineering Advisory Committee Report, December, 2016. https://www.nsf.gov/cise/ac-data-science-report/CISEACDataScienceReport1. 19.17.pdf Brownlow, T. (2022, March 15). The last mile of analytics can make or break your startup. More Than Numbers. https://blog.count.co/the-last-mile-of-analytics-can-make-or-break-your-startup/ Data at WHO. (2022). https://www.who.int/data Dataset Search. (2022). https://datasetsearch.research.google.com/ Kaggle: Your Home for Data Science. (2022). https://www.kaggle.com/ Pfister, H., Blitzstein, J., & Kaynig, V. (2015). CS109 data science. https://github.com/cs109/2015/ blob/f4dcbcc1446b7dfc33ecad4dd5e92b9a23a274e0/Lectures/01-Introduction.pdf Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 5(4), 13–22. TED Talks. (2017). https://www.kaggle.com/rounakbanik/ted-talks Wikidata. (n.d.). Retrieved May 18, 2022, from https://www.wikidata.org/wiki/Wikidata:Main_ Page

Chapter 11

Professional Skills and Soft Skills in Data Science

Abstract In this chapter, we highlight skills that are required to deal with data science in a meaningful manner. The chapter describes two kinds of skills: professional skills (Sect. 11.2) and soft skills (Sect. 11.3). Professional skills are specific skills that are needed in order to engage in data science, while soft skills are more general skills that acquire unique importance in the context of data science. In each section, we address both cognitive, organizational, and technological skills. The chapter also offers exercises to practice the skills discussed and it ends with several teaching notes (Sect. 11.4). The discussion about data science skills is especially important today due to the increasing awareness of the fact that scientists and engineers in general, and data scientists in particular, should acquire professional and soft skills, in addition to disciplinary and technical knowledge.

11.1 Introduction In this chapter we discuss the teaching of a variety of skills needed in order to execute each step of the data science workflow in a meaningful manner. We map the skills on two axes. On one axis, we divide the skills into professional skills (Sect. 11.2) and soft skills (Sect. 11.3), and on the second axis, we divide the skills into cognitive skills, organizational skills, and technological skills. Table 11.1 presents this mapping alongside examples of skills that belong to each type (or cell). In the continuation of this chapter, we elaborate on one representative skill from each cell. Clearly, data science practitioners need additional kinds of skills as well, such as social skills and ethical behavior. Some of those skills are addressed in Chap. 8, which discusses research skills, and in Chap. 12 which focuses on ethical and social issues.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_11

165

166

11 Professional Skills and Soft Skills in Data Science

Table 11.1 Mapping data science skills Professional skills (Sect. 11.2) Cognitive skills

Soft skills (Sect. 11.3)

• Thinking on different levels of • Learning abstraction • Critical thinking • Avoiding cognitive biases (see • Knowledge sharing through Chap. 3—Data Science Thinking, awareness to the audience’s level of knowledge and Sect. 14.5)

Organizational skills • Storytelling • Familiarity with the business environment and strategy

• Teamwork • Communication

Technological skills • Programming for data science

• Debugging data and models

Exercise 11.1 Mapping data science skills Add 3–5 skills to each cell in Table 11.1. You can find additional skills by conducting a web search for the term “data science skills”. Reflect on your findings: Do some skills appear in more than one cell? Do they have common characteristics? Were you surprised during your search? If yes, when? Why?

We emphasize that this mapping does not divide data science skills into disjoint groups. For example, the ability to think on different levels of abstraction could have been mapped also as a cognitive-soft skill; We determine the specific location of each skill in Table 11.1 according to its centrality and uniqueness in the data science workflow.

Exercise 11.2 Critical thinking One of the most important cognitive skills required for dealing meaningfully with data science ideas is critical thinking. Propose a case study in which decision-making processes, which were not accompanied with critical thinking processes, led to a chain of undesirable events. Clearly, to be used meaningfully, all skills require data science knowledge. For example, users of a machine learning model produced by a data science algorithm should be aware of whether the researchers selected the analyzed data with special attention to the base rate of each subset of samples in the analyzed data (See Sect. 14.5). Likewise, managers of organizations, who use data science for decision making, should have data science knowledge in order to examine the implications of the predictions of the data science model in an ever-changing world in general, and

11.2 Professional Skills

167

specifically in their business environment. This is further highlighted in Chap. 7, in which we discuss the variety of data science learners, from elementary school pupils to researchers, professional practitioners in a variety of fields, and policy makers. In each of the following two sections (dedicated to professional skills and to soft skills, respectively), we elaborate on three skills that represent the three categories of data science skills: cognitive, organizational, and technological. We selected the skills on which we focus according to our appreciation of their importance in general scientific and engineering work and, specifically, in the context of data science. Before delving into the details, we note that the skills we focus on here are not important only for professional data scientists; rather, they are needed in order to deal meaningfully with data science either as professionals, who are involved in the different phases of the data science workflow, or as users of products produced using data science methods. This last message is imparted vividly by Alison Holder, Director at EQUAL MEASURES 2030, in the Tableau Data Trends 2022 report: “What we hear from [partners], is that having data— having the skills and language and resources to use data in their advocacy, helps them be more credible. It helps them to open the door with policymakers. […Data] is a different tool in their toolbox” (Setlur, 2022, p. 23).

11.2 Professional Skills In this section, we discuss a variety of skills needed to perform data science research and analysis in a meaningful manner, and focus on one specific example of each kind (given in parentheses): cognitive skills (thinking on different levels of abstraction), organizational skills (storytelling), and technological skills (software development for data science).

11.2.1 Cognitive Skills: Thinking on Different Levels of Abstraction Thinking on different levels of abstraction means the ability to examine a topic at different levels of detail: Sometimes the full, broad picture must be examined, other times a close examination of all of the details is needed to understand the essence of a problem or its solution. The ability to think on different levels of abstraction, as well as to navigate between them during problem-solving processes, is required for different activities conducted in the data science workflow. For example, in many cases, exploratory data analysis requires thinking on a lower level of abstraction than the level of abstraction needed for the analysis of hyper parameters since the latter requires understanding the properties of a machine learning algorithm, which in turn relies on understanding the algorithm as an object (see Chap. 3).

168

11 Professional Skills and Soft Skills in Data Science

The skill of thinking on different levels of abstraction requires the ability to keep reflecting on one’s thinking process, in general, and on the movement between the different levels of abstraction, in particular. This process is called reflection in action (Schön, 1983, 1987). While performing a reflection-in-action process on our thinking, we ask ourselves questions such as: Should I look at the details to improve my understanding of the problem? Should I move to a higher level of abstraction, ignoring the details for a moment, in order to understand the context in which I am working and to realize the implications of each decision I make in the data analysis process? Similarly, a reflection on action is a process in which learners are asked to reflect on their activities after they have accomplished a task they were working on.

Exercise 11.3 Reflection on dataset exploration Explore any dataset of your choice using any data science tool you want (e.g., Python or a tool that does not require programming experience, e.g., Weka or Orange Data Mining, see Chap. 1). During this process, keep reflecting (that is, reflect in action) and document your exploration process of the dataset. After completing the exploration of the chosen database, read your documentation and analyze it from the perspective of moving between levels of abstraction. Write your conclusion. What guidelines will you use in the future in such exploration processes? From the many activities conducted in the data science workflow that require moving between levels of abstraction, we focus now on the activity of data wrangling (also called data cleaning, data remediation, or data munging). We chose to examine this activity for two main reasons: First, it requires thinking on different levels of abstraction and moving between them; Second, it is an important activity in the data science workflow to which data scientists dedicate a significant portion of their time. Data wrangling includes data manipulation, which essentially means mainly transforming data from one format to another. It includes a variety of processes designed to transform raw data into more readily used formats, such as: • Merging multiple data sources into a single dataset for analysis; • Identifying gaps in data (for example, empty cells in a spreadsheet) and either filling or deleting them; and • Deleting data that is either unnecessary or irrelevant for the project at hand. The process of data wrangling may also include data visualization (see the discussion on research skills in Chap. 8) and data aggregation. As mentioned, data wrangling is based on moving between different formats: some are more abstract (e.g., visualization), others are more concrete (e.g., identifying gaps in data). In addition to the need to move between levels of abstractions, other cognitive challenges can be addressed during data wrangling. For example, data scientists should be aware of the selection bias, which plays an important role in this process.

11.2 Professional Skills

169

Selection (or “sampling”) bias occurs when the sample data that is gathered and prepared for modeling has characteristics that do not represent the future population that the model will see (“Sampling Bias,” 2022; What Is Selection Bias and Why Does It Occur?, n.d.). The main problem associated with the selection bias is that it is often detected only at a late stage of the data science workflow, when a model fails, after a lot of work has been invested in the previous stages. See also Chap. 3 in general, and Exercise 3.11 in particular, for an elaboration on common cognitive biases exhibited in the data science workflow. In general, the ability to move between levels of abstraction is connected to other important data science skills, such as critical thinking and reflection (mentioned above), and for any specific situation, it involves considering the audience’s background, knowledge, and perspective in order to determine the appropriate level of abstraction required for efficient communication (this is discussed also in the next section in connection with the organizational skill of storytelling).

11.2.2 Organizational Skills: Storytelling The importance of the skill of storytelling derives from the fact that data science aims to transform data into value, which, in most cases, takes place in a specific context. While data by themself are meaningless, their value is reflected in the story told about them. In his Forbes article Data Storytelling: The Essential Data Science Skill Everyone Needs, Dykes (2016) explains that “For thousands of years, storytelling has been an integral part of our humanity. Even in our digital age, stories continue to appeal to us just as much as they did to our ancient ancestors. Stories play a vibrant role in our daily lives—from the entertainment we consume to the experiences we share with others to what we conjure up in our dreams” (Why storytelling is essential?, para. 1). In the case of data science, good storytelling means that datadriven solutions are communicated clearly, concisely, and directly to each relevant target group. A similar message is delivered by the ACM Data Science Task Force on Computing Competencies for Undergraduate Data Science Curricula, published in January 2021 (Danyluk & Leidig, 2021). Among other skills, it mentions ‘result delivery’ as a skill required of data science graduates: “[…] on presentation of results, the Data Science graduate needs to explain and interpret the numerical conclusions in the client’s terminology, and deliver text and graphics ready to be digested by non-technical personnel” (p. 38). The importance of storytelling in the context of data science has been recognized since the early days of the current emergence of data science (about ten years ago, see Chap. 2). For example, Nusca (2012) wrote in a ZDNet post published on December 13, 2012, entitled The key to data science? Telling stories, that “Like journalism, there are many stories to tell from the same set of data, and data scientists must choose carefully” (para. 3). Here again, critical thinking plays an important role.

170

11 Professional Skills and Soft Skills in Data Science

Exercise 11.4 Analysis of unknown dataset Pick a set of data without knowing its source. Analyze the dataset and tell different stories about it. Then, check its source and decide whether any of the stories you told about it make sense. Reflect on the process you went through. What conclusions can you draw? As a class activity, this can be facilitated in pairs whereby each student selects a dataset to be explored by his or her classmate. Clearly, to tell a meaningful story, it helps to be familiar with the application domain, in general, and with the source of the data, in particular. Furthermore, to tell a meaningful story, one must also consider the audience to which the story is told. For example, in order to deliver meaningful messages, the same story should be presented from different angles to managers and customers. In this context, the rhetorical triangle, which includes the ethos, the pathos, and the logos techniques of persuasion, can be considered. In short, logos is the internal logic of the message, including evidence; pathos appeals to the emotion, but also demonstrates shared beliefs and knowledge; and ethos is the resulting trust of the writer/presenter based on his or her perceived authority and professionalism (Aberšek & Aberšek, 2010). For a description of the expression of the rhetorical triangle in the case of data science, see Hazzan and Rakedzon (2022) and Lesson 9 of the Methods of Teaching Data Science course on our Data Science Education website.

Exercise 11.5 The rhetorical triangle Pick an example of a case study that presents a model generated by a machine learning algorithm. In the context of this case study, explore the nature of the rhetorical triangle and illustrate the expression of each of its edges (logos, ethos and pathos) when communicating the model generated by the machine learning algorithm to three different kinds of audiences. An example of such a presentation to different kinds of audiences is given in Hazzan and Rakedzon (2022). Another perspective that highlights the role of storytelling in the context of data science is the analogy between data science and anthropology. Among other things, this analogy suggests using anthropological skills in a data centric role. For example, the blog Data and the Anthropologist: Could you be using your anthropology skills

11.2 Professional Skills

171

in a more data centric role? attempts to partially answer this question by referring to the context, which is highly relevant for the examination of the storytelling skill. In this blog, Astrid Countee writes: “Companies are finding that just having the hard data points isn’t enough to take action. They need context, an understanding of what the data implies, and a plan for how to strategically use those implications to move forward. Sounds suspiciously similar to anthropological training, doesn’t it?” (Countee, 2015, Anthropology and Data Science, para. 4).

Exercise 11.6 Anthropology and data science The analogy between data science skills and anthropology skills is questioned mainly in one direction, that is, what anthropological skills can be used in data science. Explore the opposite direction, that is: What data science skills can be used in anthropological research?

11.2.3 Technological Skills: Programming for Data Science Although we do not delve into technical programming details in this guide, we should remember that some basic programming skills are important for anyone who is involved in the data science workflow, even if they are not directly involved with programming activities. Such basic programming knowledge can help the practitioner make decisions with respect to what can and should be analyzed, and how. An examination of all of the tools that can be used in the data science workflow is beyond the scope of this guide (see an overview in Sect. 1.6). We do, however, wish to make two comments: (a) As soon as the data scientist understands what can be achieved using each kind of tool and how it can be used in the data science workflow, the data scientist’s organizational environment and culture can determine which specific tool to use. For example, use of Python is more common among computer scientists; use of R is more common among statisticians and academics, and so on. (b) Some tools do not require programming skills, for instance, Orange Data Mining and Weka, which are machine learning and data mining suites that support data analysis through visual programming, and are intended both for experienced users and programmers, as well as for data science students (Demšar et al., 2013).

172

11 Professional Skills and Soft Skills in Data Science

Exercise 11.7 Programming tools used in the data science workflow Select 2–3 programming tools that are used in the data science workflow (see an overview in Sect. 1.6). Analyze their advantages and disadvantages from a pedagogical perspective, and compare them. What conclusions can you draw? Reflect: In your analysis, did you consider the characteristics of the learners? If you did, which characteristics did you consider? How are they expressed in your analysis? If not—why?

11.3 Soft Skills In this section, we discuss skills that are required in order to perform data science activities meaningfully, and that, at the same time, have been discussed extensively in the past, prior to the recent attention that data science has received, as important skills that are needed in order to function well in the twenty-first century job market. Nevertheless, their expression in the context of data science is important, as is described below. Again, we address the three kinds of skills, cognitive, organizational, and technological and focus on a representative example of each kind.

11.3.1 Cognitive Skills: Learning Data science technologies and frameworks evolve so fast that it is futile to try to master them all. It is, however, important to be familiar with new trends, tools, and methods in data science and to understand their essence in order to communicate meaningfully in the data science ecosystem (which comprises data scientists, experts in the application domain, other practitioners involved in the data science workflow, and the general public). The data scientist’s awareness of the need to keep learning, and the attention given to the acquisition of learning skills, reflect the data scientist’s openness and acknowledgement of the fact that (a) it is impossible to know everything about every aspect of data science, and (b) theoretical, technical, and practical developments that take place in the data science ecosystem should be learned and practiced continuously, on a regular basis. In order to acquire and master learning skills, in the course of learning new content, it is important to keep reflecting on the learning process: What content was easy for me to learn? What was difficult? What content, which I already know, is the new material connected to? How can I use the new content in my current work? How will I be able to use the new content in my future professional career? Such reflective thoughts not only improve the current learning process, but also teach one how to

11.3 Soft Skills

173

learn—that is, to learn how to learn. In other words, as soon as one realizes, becomes aware of, and understands his or her thinking processes, as well as the strengths and weaknesses of their learning processes, he or she will gain insights that will likely support them in their future learning processes. We further discuss learning processes in the context of the technological soft skill of debugging in Sect. 11.3.3.

Exercise 11.8 Lifelong learning Explore the meaning of lifelong learning and explain its unique importance in the context of data science.

Exercise 11.9 Coursera 4-week (14 h) course Learning How to Learn One of the most popular Coursera courses is the 4-week (14 h) course Learning How to Learn: Powerful mental tools to help you master tough subjects, taught by Dr. Barbara Oakley, Professor of Engineering at Oakland University in Rochester, Michigan and Dr. Terrence Sejnowski, Professor at the Salk Institute for Biological Studies at the University of California San Diego. If you do not have time to study the entire course, explore the main learning techniques it focuses on. Discuss the relevance and illustrate the implication of each of these techniques in the data science workflow.

11.3.2 Organizational Skills: Teamwork and Collaboration As the data science workflow presented in Chap. 2 indicates, data scientists do not work in isolation. Indeed, they work with their team members as well as with other colleagues in their organization, and collaborate with other stakeholders both inside and outside their own organization. For example, in order to decide what data should be gathered, analyzed, and presented, data scientists must collaborate and communicate with experts in data visualization; in order to deliver the data science findings to the organization’s customers, data scientists need to communicate with marketing experts; and in order to carry out the exploratory data analysis and modeling phases of the data science workflow, professional programmers and experts in the application domain (for example, sociologists) must work together. In short, teamwork and collaboration skills are required in all such work processes.

174

11 Professional Skills and Soft Skills in Data Science

Exercise 11.10 A metaphor for a discourse between a data scientist, you, and another employee in your organization (a) Explore your company and choose a specific aspect of the organization with which you are familiar. (b) Describe a scenario in which a data scientist, you, and another employee of the company communicate in order to decide what machine learning algorithms will be used for a specific analysis of the data that belongs to the aspect selected in (a). What metaphor is appropriate to describe the discourse? (c) Repeat section (b) with an aspect of your organization with which you are not familiar. Communication skills are one of the most important teamwork skills for imparting messages in a meaningful way to relevant audiences. Communication skills require additional skills, of which we mention two: (a) attention to the abstraction level on which one is delivering his or her messages (using a table, for instance, as opposed to through a visualization or a story, see above in Sect. 11.2) and (b) openness to give and receive feedback from both professionals who are part of the data science workflow and other stakeholders outside of the immediate data science workflow. Being open to feedback reflects (a) an understanding of the other person’s perspective as well as his or her current knowledge, emotions, and interests, and (b) the realization that any feedback can improve one’s professional work. Exercise 11.11 Giving and receiving feedback on a presentation about data analysis Describe a scenario in which a data scientist presents findings from his or her data science project to an executive in his or her organization. Then, describe two discourses in which the data scientist receives feedback and his or her reaction to that feedback. Compare the two discourses: In what ways are they similar? In what ways are they different? Do these similarities and differences result in different outcomes?

11.3.3 Technological Skills: Debugging Data and Models In most cases, programming activities (on any scale) are not performed properly the first time around; debugging is needed. Debugging means finding the errors in the computer programs: what went wrong and how it can be fixed. Sometimes debugging is easy (for example, in the case of syntax errors, that is, spelling errors);

11.3 Soft Skills

175

sometimes, debugging is difficult, like in the case of execution or logical errors, when the instructions are written properly but the computer program does not behave as expected. In both cases, the debugging process can be viewed as both a problemsolving and a learning process. As a problem-solving process, debugging requires several skills that are applied in typical problem-solving processes: breaking a problem down into sub-problems, trying different alternative solutions, moving between levels of abstraction, and learning from mistakes. As a learning process, debugging means that we improve our understanding of (a) the problem whose solution is implemented, (b) our own understanding, and (c) the solution implementation. In the case of data science, in addition to the traditional meaning of debugging, we propose using the term debugging in two additional contexts: (a) for the exploratory data analysis phase and (b) for the model hyperparameter tuning phase—both of which are intensive, interactive phases of the data science workflow. Specifically, in these phases, we refer to the debugging of our mental model of the data and the algorithm, a process that enables us to improve our understanding of the problem at hand, as is described below. In the exploratory data analysis phase, the data is analyzed in order to understand its main characteristics, in many cases using visual methods. We improve our understanding by carrying out several activities during the exploratory data analysis process, such as testing underlying assumptions, checking relationships between features, eliminating part of the data to examine the impact of this omission, and so on. These activities enable us to gradually improve our understanding of the data in terms of discovering its underlying structures, identifying the significant features of the data, recognizing meaningful connections between features which may highlight new insights, detecting outliers and anomalies, etc. In the case of model hyperparameter tuning, the term tuning by itself reflects the fact that we must gradually explore the best tuning for the solution we are trying to model; most probably, we will move forward and backward in this process. Nevertheless, we should recall that each tuning step improves our understanding of both the problem we are trying to solve and the solution we are developing. We propose to call these two cases ‘debugging’ since, during their execution, we gradually refine our understanding both of the data and of the algorithm in a way that enables us to attain a more suitable model for our data. In other words, such debugging processes are not immediate but rather interactive, so that during their execution we improve our understanding of the data and its features, as well as of the model and its generation process. This process is very similar to the process of debugging computer programs since in both cases we gradually refine our understanding of the object we are exploring (either a computer program, data, a model, or a hyperparameter), while trying different paths towards the solution, and learning also from what seem to be dead ends. From a wider perspective, examining these two phases of the data science workflow—exploratory data analysis and model hyperparameter tuning—as debugging

176

11 Professional Skills and Soft Skills in Data Science

processes demonstrates the code-data duality introduced by Wing (2006) in her influential paper on computational thinking: “Computational thinking is thinking recursively. It is parallel processing. It is interpreting code as data and data as code” (p. 33). In Chap. 3, we elaborate on computational thinking and its role in data (science) thinking.

Exercise 11.12 Exploration of a dataset as a debugging process Select a database you are unfamiliar with from one of the online resources. Explore it in a way that improves your understanding of relevant connections between the features of the data and what each connection may inform you about the context of the data. While performing this process, keep reflecting on your thinking processes, noting what you have learned in each step and how you will be able to use what you have learned in your future work and learning processes. If you have any programming experience, explore the similarities and differences between the thinking process you went through and the process of debugging computer programs.

11.4 Teaching Notes In this chapter, we describe important skills required for the implementation of the data science workflow, both by data scientists as well as by other stakeholders involved in this workflow. Relevant teaching methods for these skills can be found in several chapters of this guide: Chap. 10, which presents a pedagogical perspective of the data science workflow, Chap. 12, which discusses social and ethical issues of data science education, and Chap. 16, in which we propose methods for teaching machine learning.

Exercise 11.13 The expression of skills in the data science workflow A report published in January 2021 by the ACM Data Science Task Force on Computing Competencies for Undergraduate Data Science Curricula (Danyluk & Leidig, 2021) lists the following skills of data science graduates: flexibility and joy of learning and working in a fast-paced discipline; staying alert to broader societal, non-technical concerns; and commitment to professional responsibility. Explain in which phases of the data science workflow is it especially important to implement each of these skills. Illustrate your answer with specific scenarios.

References

177

Exercise 11.14 Skills of the data science workflow stakeholders Explore the relevance of each skill described in this chapter for the other stakeholders of the data science workflow than those described in this chapter.

Exercise 11.15 Additional data science skills In this exercise, we ask you to repeat Exercise 11.1, presented at the beginning of this chapter. For each cell in Table 11.1, suggest three additional skills and explain their importance for professional data scientists. Explore their connections to the skills described in this chapter.

11.5 Conclusion In the conclusion of this chapter we highlight the interdisciplinary aspect of data science through an exercise that explores this characteristic of data science with respect to the skills described in this chapter.

Exercise 11.16 An interdisciplinary perspective on data science skills Explore the different skills described in this chapter from the interdisciplinary perspective: The expression of which skills warrants the application of an interdisciplinary perspective? The expression of which skills highlights the importance of the interdisciplinarity of data science? Explain your answer.

References Aberšek, B., & Aberšek, M. K. (2010). Development of communication training paradigm for engineers. Journal of Baltic Science Education, 9(2), 99–108. Countee, A. (2015). Data and the anthropologist: Could you be using your anthropology skills in a more data centric role? https://thegeekanthropologist.com/2015/11/06/data-and-the-anthropol ogist-could-you-be-using-your-anthropology-skills-in-a-more-data-centric-role/ Danyluk, A., & Leidig, P. (2021). Computing competencies for undergraduate data science curricula. https://www.acm.org/binaries/content/assets/education/curricula-recommend ations/dstf_ccdsc2021.pdf

178

11 Professional Skills and Soft Skills in Data Science

ˇ Hoˇcevar, T., Milutinoviˇc, M., Možina, M., Polajnar, Demšar, J., Curk, T., Erjavec, A., Gorup, C, M., Toplak, M., Stariˇc, A., Stajdohar, M., Umek, L., Zagar, L., Zbontar, J., Zitnik, M., & Zupan, B. (2013). Orange: Data mining toolbox in Python. The Journal of Machine Learning Research, 14(1), 2349–2353. Dykes, B. (2016). Data storytelling: The essential data science skill everyone needs. https://www. forbes.com/sites/brentdykes/2016/03/31/data-storytelling-the-essential-data-science-skill-eve ryone-needs/?sh=29958f1e52ad Hazzan, O., & Rakedzon, T. (2022). The expression of the rhetorical triangle in data science. https://cacm.acm.org/blogs/blog-cacm/259000-the-expression-of-the-rhetorical-tri angle-in-data-science/fulltext Nusca, A. (2012). The key to data science? Telling stories. https://www.zdnet.com/article/the-keyto-data-science-telling-stories/ Sampling Bias. (2022). Wikipedia. https://en.wikipedia.org/w/index.php?title=Sampling_bias& oldid=1082510404 Schön, D. A. (1983). The reflective practitioner. Basic Books. Schön, D. A. (1987). Educating the reflective practitioner: Toward a new design for teaching and learning in the professions. Jossey-Bass. Setlur, V. (2022). AI augments and empowers human expertise. Tableau. https://www.tableau.com/ sites/default/files/2022-02/Data_Trends_2022.pdf What is Selection Bias and Why Does it Occur? (n.d.). Data science and machine learning. Retrieved June 14, 2022, from https://www.kaggle.com/questions-and-answers/a Wing, J. M. (2006). Computational thinking. Communications of the ACM, 49(3), 33–35. https:// doi.org/10.1145/1118178.1118215

Chapter 12

Social and Ethical Issues of Data Science

Abstract The teaching of social issues related to data science should be given special attention regardless of the framework or level at which data science is taught. This assertion is derived from the fact that data science (a) is relevant for many aspects of our lives (such as health, education, social life, and transportation); (b) can be applied in harmful ways (even without explicit intention); and (c) involves ethical considerations derived from the application domain. Of the many possible social topics whose teaching might have been discussed in this chapter, we focus on data science ethics (Sect. 12.2). We also present teaching methods that are especially appropriate for the teaching of social issues of data science (Sect. 12.3). Throughout the chapter, we highlight the social perspective, which in turn further emphasizes the interdisciplinarity of data science.

12.1 Introduction This chapter addresses an important aspect of data science—its social aspect. This importance is derived from the fact that data science (a) is relevant for many aspects of our lives (such as health, education, social life, and transportation); (b) can be applied in harmful ways (even without direct intention); (c) involves ethical considerations derived from the application domain; and finally (d) involves many people in its workflow, from which it follows that there may be many different perspectives on the social aspect of data science. All these reasons highlight the importance of considering the application domain in any data science-based exploration. We propose to address social issues in any framework in which data science is taught—from elementary school, through high school and academia to industry. Due to the importance attributed to ethical issues of data science, we focus on data science ethics (Sect. 12.2). We also present a variety of teaching methods (including teaching principles and types of activities) for teaching social aspects of data science (Sect. 12.3). Throughout the chapter, the focus on the social aspect highlights the interdisciplinarity of data science.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_12

179

180

12 Social and Ethical Issues of Data Science

Chapters 19 and 20 also deal with human aspects, using data science as a research method for the exploration of social issues. This chapter, on the other hand, examines the social aspect of data science from the behavioral perspective: How do the stakeholders of the data science workflow behave and what social issues do they pay attention to? While in Chaps. 19 and 20 data science is used for research, and therefore programming and mathematics are taught in this context, this chapter highlights the attention that should be given to a topic which, on the one hand, is not always included explicitly in the data science study programs, and on the other hand, is very important, should not be neglected, and requires high-level cognitive skills such as critical thinking and reflective abilities. We note that Fig. 2.4, which presents the data life cycle (Berman et al., 2016), does indeed highlight the social aspect of data science, including ethics and policy. Furthermore, Fig. 2.4 reflects the fact that these topics should be addressed in all phases of the data science workflow.

12.2 Data Science Ethics Codes of ethics guide professionals on how to behave in situations in which it is not clear what is right and what is wrong. The need for a code of ethics arises from the fact that any and every profession creates situations that can neither be predicted nor responded to uniformly by all members of the relevant professional community. In this section, we examine ethical guidelines for data scientists and for other stakeholders involved in the data science workflow. Two well-known cases in which the need for a data science code of ethics is clear are (a) facial recognition algorithms, which on the one hand, may speed up security lines, and on the other hand, may discriminate against certain populations that the algorithm fails to recognize, and (b) autonomous vehicles, which on the one hand, may free us from human driving errors, and on the other hand, may pose new risks for other human drivers and pedestrians. Exercise 12.1 Famous cases that illustrate the need for an ethical code for data science (a) Locate resources about the two cases mentioned above that illustrate the need for an ethical code for data science. List the ethical considerations involved in each case. (b) Find additional cases that illustrate the importance of an ethical code for data science.

12.2 Data Science Ethics

181

Exercise 12.2 Generative adversarial networks from an ethical perspective Generative adversarial networks (GANs) are used in machine learning for creating new data instances that resemble your training data. Explore the concept of GANs: (a) In what cases can GANs be helpful? In what cases can GANs cause damage? (b) How can the benefits of GANs be leveraged? How can potential damages of GANs be mitigated? (c) How can GANs be used in teaching social issues and ethics of data science? There are several codes of ethics that are related to computing and technology. See, for example, the Software Engineering Code of Ethics and Professional Practice formulated by the ACM/IEEE-CS Joint Task Force (Software Engineering Code— ACM Ethics, 2016)

Exercise 12.3 Codes of ethics (a) Review the software engineering code of ethics at https://ethics.acm.org/code-of-ethics/software-engineering-code/. (b) In your opinion, which of its guidelines also fits the case of data science? (c) Look for codes of ethics of other professions, not necessarily science or engineering related. What do these codes of ethics have in common? In what ways do they differ from one another? In the post Is It Time for a Data Scientist Code of Ethics?, Jesse Freeman argues that “[o]n the top of most people’s mind may be the idea of using such code to define accountability. But the real concept behind a code of ethics isn’t accountability, per se, but the idea that the group can collectively agree on a set of core principles. These principles drive the actions at a systemic level to help ensure that an individual’s moral compass is pointing in the right direction when their own values or beliefs are questioned” (Freeman, 2019, Data Science Code of Ethics, paras. 1–2). Indeed, it makes sense that data scientists will have their own set of agreed-upon principles encompassed by a data science code of ethics. In this spirit, several attempts have been made in recent years to formulate such codes of ethics (see Exercise 12.4). While most of the codes address the same several issues (e.g., the relationship with the client), each of them emphasizes other specific aspects of data science ethics.

182

12 Social and Ethical Issues of Data Science

Exercise 12.4 Comparisons of data science codes of ethics (a) Review the following data science codes of ethics: I II III

IV V

Data Science Association: Data Science Code of Professional Conduct at https://www.datascienceassn.org/code-of-conduct.html “Ethics That Every Data Scientist Should Follow” at https://analyt icsindiamag.com/ethics-that-every-data-scientist-should-follow/ A Beginner’s Guide to Data Ethics at https://medium.com/big-dataat-berkeley/things-you-need-to-know-before-you-become-a-datascientist-a-beginners-guide-to-data-ethics-8f9aa21af742 Oxford Munich Code of Conduct at http://www.code-of-ethics.org/ code-of-conduct/ Data Science for Social Good at http://www.dssgfellowship.org/ 2015/09/18/an-ethical-checklist-for-data-science/

(b) Compare how each of them addresses the topics listed on the left-hand column of Table 12.1. Table 12.1 Analysis of data science codes of ethics Code of ethics category

Data science association: data science code of professional conduct

Ethics that every data scientist should follow

A beginner’s guide to data ethics

Oxford —Munich code of conduct

Data science for social good

Client Professionalism Conflict of interests Misconduct Data gathering

(c) To the list of topics presented in the left-hand column of Table 12.1, add at least three topics addressed by at least one of the codes of ethics and complete the table accordingly. (d) Search for additional data science codes of ethics. If you find such codes, add them to Table 12.1 and complete the relevant cells. (e) If your organization has a code of ethics (either a general code of ethics or a specific code for data science), add it to the list of codes in Table 12.1 and complete the relevant cells of the table. (f) What are your conclusions from this analysis? (g) What implications for data science education can you derive from this activity?

12.2 Data Science Ethics

183

Exercise 12.5 Stakeholders’ behavior in the different phases of the data science workflow Table 12.2 presents the different phases of the data science workflow (columns) versus the stakeholders involved in the data science workflow (rows). (a) To the list presented in the left-hand column of Table 12.2, add at least three more stakeholders. (b) Fill in Table 12.2 by specifying ethical norms that each stakeholder should adhere to in each phase of the data science workflow. Table 12.2 Ethical norms of different stakeholder in the different phases of the data science workflow Data science Data collection workflow phase Stakeholder Data scientists

Exploratory Modeling Conclusions Implementations data analysis

Example: Make sure the diversity represents the population

Software engineers

Exercise 12.6 Responsible AI A concept that is closely related to ethics is responsible AI. (a) What is responsible AI? Review at least four resources and use them to define responsible AI. (b) Responsible AI is sometimes described by pillars. Find two such descriptions of the pillars of responsible AI. Are these pillars similar? Are they different? What aspects of data science does each of the pillars emphasize?

184

12 Social and Ethical Issues of Data Science

Exercise 12.7 Recommendations of curriculum guidelines with respect to the inclusion of ethics in data science programs In Chap. 4, we present several reports that offer recommendations for the structure and context of undergraduate data science programs. For example, the recommendation of the National Academies of Sciences (2018) with respect to ethics is: Recommendation 2.4: Ethics is a topic that, given the nature of data science, students should learn and practice throughout their education. Academic institutions should ensure that ethics is woven into the data science curriculum from the beginning and throughout. (p. 3)

Explore how each of the reports presented in Chap. 4 addresses the topic of ethics. What are the main messages delivered by these recommendations? Are the recommendations of the different reports similar to one another? Do they differ from each other? How are their similarities reflected? How are the differences between them expressed?

Exercise 12.8 Ethical principles to be applied in the creation of image sets1 An image set is a collection of labeled images used for research in computer vision. One such image set is ImageNet, available at https://www.image-net. org/. The images in ImageNet have all been hand-annotated to indicate what objects are pictured. Since the labeling was done by humans, it was subject to the cognitive biases of the people who participated in the labeling task. (a) What ethical issues can such a manual labeling process reveal? (b) What instructions would you give the research team who labeled the images to increase its awareness of ethical issues and to guide the labeling process in a way that adheres to ethical norms? (c) In the creation of image sets, some categories of images may not be easily available. Think for example of pictures of different kinds of pedestrians that must be gathered and labeled in a database whose purpose is to train an autonomous driving application. 1. What ethical challenges might this situation pose? 2. What instructions would you give practitioners who create such image databases to increase their awareness of ethical issues and to guide the dataset creation process in a way that adheres to ethical norms?

1

This question is inspired by Prof. Lihi Zelnik-Manor’s course “Algorithms and Applications in Image Processing”. We would like to thank Prof. Lihi Zelnik-Manor for her inspiration.

12.3 Methods of Teaching Social Aspects of Data Science

185

Exercise 12.9 The ethical aspect of product development2 Look at the DALL·E 2 project at https://openai.com/dall-e-2/. (a) What does the project do? (b) Scroll down the page and review the names of the practitioners who contributed to the project. (c) According to what categories are the practitioners presented? (d) How many practitioners appear in each category? (e) From an ethical perspective: What can you say about these lists?

Exercise 12.10 Analysis of a documentary movie Recent years have seen an increase in the production of documentary movies that deal with the ethical challenges of face recognition software. (a) Find one such movie and watch it (e.g., Coded Bias at https://en.wikipe dia.org/wiki/Coded_Bias) (b) Summarize its main messages. (c) Formulate ethical guidelines that address the main challenges presented in the movie.

12.3 Methods of Teaching Social Aspects of Data Science The social aspects of data science are different than the more common topics discussed in a data science course, regardless of whether the learners are K-12 pupils, data science majors, students in related disciplines, non-major students, industry practitioners, researchers, or general users. While most of the topics covered in such courses require analytical skills and modes of thinking (e.g., programming, statistical thinking, and data analysis), social issues require also verbal skills (e.g., storytelling and communication) and social awareness (e.g., taking the other person’s perspective). This section presents teaching principles and kinds of activities that foster the learners’ attention to the social aspect of data science, mainly but not only, through exercises in which learners can practice their skills. The readership of this guide will likely notice that the teaching principles and kinds of activities presented below, are applied also in other chapters of this guide.

2

This question is inspired by Prof. Lihi Zelnik-Manor’s course “Algorithms and Applications in Image Processing”. We would like to thank Prof. Lihi Zelnik-Manor for her inspiration.

186

12 Social and Ethical Issues of Data Science

12.3.1 Teaching Principles We first present two teaching principles that we recommend using when teaching social issues of data science: Active learning and embedded context. Active Learning Active learning is a term that advocates the perspective that in order to achieve meaningful learning, learners should be active rather than passive. Among the many descriptions of active learning, we highlight Silberman’s assertion (1996) according to which “Above all, students need to ‘do it’—figure things out by themselves, come up with examples, try out skills, and do assignments that depend on the knowledge they already have or must acquire” (p. ix). Active learning is closely related to inquiry-based learning, problem-based learning, and project-based learning, which are all highly suitable approaches for teaching and learning data science. Active learning stands in contrast to passive teaching methods in which teachers lecture without letting the students express their skills, opinions, and imagination. In fact, meaningful learning occurs mostly when the learners are active. This claim is derived from the constructivist approach according to which learning is the active acquisition of ideas and knowledge construction, not a passive process of absorbing knowledge (Confrey, 1995; Davis, 1990; Kilpatrick, 1987). In other words, meaningful learning requires the individual to be active and to be engaged in the process of constructing his or her own mental models. We therefore recommend that data science educators encourage learners “to be active in their relationship with the material to be learned” (Newman et al., 2003. p. 2). In Sect. 12.3.2, we present four kinds of activities that apply active learning in the context of the social aspects of data science. A comprehensive resource on active learning, including different types of active learning activities, is available at Active Learning | Center for Educational Innovation (n.d.). Embedded Context The term embedded is largely used in the context of computer science with respect to ethics, where the term used in that context is embedded ethiCS (see for example Embedded EthiCS (n.d.)). We propose implementing the pedagogical principles of embedded context also in the case of data science education, in general, and in the case of social issues of data science, one of them being ethics, in particular (see Sect. 12.2). We illustrate the embedded context principle, which refers to the design of data science programs, with respect to the social aspect of data science on which we are focusing in this chapter. The embedded social-context concept advocates the idea that social issues should not be taught separately, in a separate course or unit, but rather should integrated into all data science courses. This principle highlights the perspective according to which social issues are as important as other core data science concepts taught in a data science program. We further recommend not to address the social aspect of data science only theoretically, for example, by describing

12.3 Methods of Teaching Social Aspects of Data Science

187

the ethical codes of data science as a list of principles, but rather, to connect each ethical principle to some real-life context. That is, we recommend that authentic cases be embodied in the teaching of the social issues of data science, which in turn, are embodied in the teaching of data science science-oriented content related to one of its components: mathematics, statistics, and computer science. In fact, the data science in context idea has already been presented in the Computing Competencies for Undergraduate Data Science Curricula ACM Data Science Task Force (Danyluk & Leidig, 2021) as follows: 4.2 Data Science in context […] Data Science curricula should include courses designed to promote dual coverage combining both data science fundamentals and applications, exploring why people turn to data to explain contextual phenomena. Such courses highlight how valuable context is in data analytics; where data are viewed with narratives, and questions often arise about ethics and bias. It can be beneficial to teach some courses with a disciplinary context so that students appreciate that data science is not an abstract set of approaches. Related application disciplines might include physics, biology, chemistry, the humanities, or other areas. (Danyluk & Leidig, 2021, p. 29)

Exercise 12.11 Embedded ethics in data science (a) Several principles of embedded ethiCS are presented in Embedded EthiCS at https://embeddedethics.seas.harvard.edu/about. Adapt them to the case of data science. For example, the statement: • The advantage of the Embedded EthiCS distributed pedagogy over stand-alone courses is: 1. It shows students the extent to which ethical and social issues permeate virtually all areas of computer science. (sec. The Embedded EthiCS Advantage, para. 1) can be adapted very simply to the case of data science, as follows (italics added): • The advantage of the Embedded Ethic distributed pedagogy in data science over stand-alone courses is: 1. It shows students the extent to which ethical and social issues permeate virtually all areas of data science. (b) The Ethical Issues in AI and Computing Conference that took place in June 2022 illustrates the implementation of embedded ethics also in data science. What can you learn from the titles of the presentations presented in the conference about the conference orientation?

188

12 Social and Ethical Issues of Data Science

Exercise 12.12 Revisiting the AI + Ethics curriculum for middle school initiative In Exercise 7.3, we explore the pedagogical guidelines applied in the development of the AI + Ethics Curriculum for Middle School initiative at https://www.media.mit.edu/projects/ai-ethics-for-middle-school/overview/. Are these guidelines similar to or different than the guidelines presented in this chapter? In what ways are they similar or different? Do these guidelines depend on the age of learners? Expand this exploration to the ai4k12 initiative at https://ai4k12.org/.

12.3.2 Kinds of Activities In this section, we present four kinds of activities that adhere to the two teaching principles introduced above. These activities can be integrated into the teaching of many of the topics discussed in this guide (see for example, Chap. 8 on data science as a research method, Chap. 10 on the pedagogical perspective on the data science workflow, and Chap. 11 on data science skills).

Case Study Analysis A case study is an in-depth, detailed examination of a particular case (or cases) within a real-world context (Case Study, 2022). Clearly, data science provides many case studies for examination, in general, and within a real-world, social context, in particular. Therefore, in the case of the social aspect of data science, case studies should elicit interesting topics related to some social aspect of data science. Case studies share several common characteristics, among them we mention six: 1. A case study tells a story that illustrates a specific theme: an ethical dilemma, a specific behavior, a decision-making process, etc. Like any other story, a case study has actors, or players, and it describes their behavior as well as the interactions among them. See also Sect. 11.2.2 in which we address the storytelling skill and the rhetorical triangle (Aberšek & Aberšek, 2010). 2. The theme of the case study is highlighted and reflected through the case study description. 3. Case studies should be told in language that fits the target audience. 4. Case studies should raise interesting, open questions for exploration that each have several answers that can be judged neither as correct nor as incorrect.

12.3 Methods of Teaching Social Aspects of Data Science

189

5. Case studies can be explored by a variety of ways, both quantitative and qualitative. 6. Case studies teach us lessons that can be applied in future situations we face.

Exercise 12.13 Exploration of previously published data science case studies Find 3–5 case studies that focus on the social aspect of data science. Analyze each case study according to the above six characteristics.

Exercise 12.14 Development of data science case studies Describe three imaginary case studies whose purpose is to illustrate the importance of the social aspects of data science. (a) Describe how you developed the case studies. (b) For each case study, check whether it has the characteristics listed above. (c) Ask three colleagues to work on your case studies (one per colleague) and document each of their analyses of their respective case studies. What conclusions did they draw from their case study analyses? See if you can improve your case study descriptions based on your colleagues’ work? If you can, how? If not, why? (d) Interview your colleagues about their work and thinking processes while working on the case studies. List at least three lessons that you learned from these interviews.

Develop and Act Out Scenarios In this kind of activity, the learners are asked to describe scenarios that address social issues of data science and to act them out, playing the different roles in the scenario. This kind of activity requires learners to: • explore various topics related to the social aspect of data science and choose one of them; • deepen their exploration of the chosen topic on which the scenario focuses; • consider the different stakeholders that participate in the scenario. This step exposes the students to the variety of stakeholders that participate in typical scenarios that deal with social issues of data science; • delve into the details of the scenario, taking into consideration the perspective of each stakeholder: What are the stakeholder’s interests? How is each interest related to other interests?

190

12 Social and Ethical Issues of Data Science

• identify conflicts between the interests of different stakeholders participating in the scenario. If such conflicts exist, learners must analyze their source and suggest possible solutions to resolve them; • ensure that ethical norms are adhered to, regardless of the specific topic of the scenario and the stakeholders’ specific interests. In a classroom setting, we recommend asking the students to develop such scenarios in teams and then to act them out in front of the entire class.

Exercise 12.15 Developing scenarios development about data science (a) Chapter 8 (Data Science as a Research Method) and Chap. 11 (Data Science Skills) present several activities in which learners are asked to develop scenarios. If you have not done these activities yet, you are invited to work on them now to practice the skill of developing scenarios. (b) While developing your scenarios, make sure you perform the activities listed above, which are part of the development of any scenario, in general, and of scenarios related to the social aspect of data science, in particular.

Reflection in Action and Reflection on Action As can be seen in various chapters in this guide, learners are repeatedly asked to reflect on what they have just executed, thought, considered, developed, and so on. In Sect. 11.2.1, which addresses professional-cognitive data science skills, we further highlight the cognitive processes of reflection in-action and reflection on-action as important data science skills (Schön, 1983, 1987). In addition to the reflection on problem solving processes, learners may be given reflective tasks on stories they read, movies they watch and so on. The idea in such cases is to guide learners to think about what they wrote or viewed both during the actual writing or viewing (reflection in-action) and after it (reflection on-action). Such reflection tasks can include a request to write down their thoughts on their profession and how they will implement the lessons that they derived from the story or movie in their future professional career when they face similar cases. For example, in the last lessons of the Machine Learning (Computational Learning) course, taught at the Technion’s Faculty of Industrial Engineering and Management, students were asked to work in class on the task presented in Exercise 12.16.3

3

We would like to thank Prof. Tamir Hazan for his collaboration.

12.3 Methods of Teaching Social Aspects of Data Science

191

Exercise 12.16 Reflection on a lecture Watch Yuval Noah Harari’s short talk on How to Survive the twentyfirst Century (https://www.youtube.com/watch?v=gG6WnMb9Fho) which he presented in 2020 at the World Economic Forum in Davos. While watching Harari’s talk, reflect on its connections to the social aspect of data science. In practice, the discussion that ensued in the Machine Learning course about Exercise 12.16 was divided into three parts: (a) Before the students watched Yuval Noah Harari’s talk, a discussion took place in which the students were asked to explain why they decided to study data science. The answers that were given included the applicability of data science in every area of life and the fact that data science is a research-oriented profession. Indeed, several chapters in this guide address data science as a research method for example, Chap. 8, which introduces this perspective, and Chaps. 19 and 20, which present teaching frameworks in which data science is learned as a research method by researchers in social sciences (Chap. 19) and in science and engineering (Chap. 20). (b) While watching Yuval Noah Harari’s talk, the students were asked to reflect on what they learned from it, what surprised them, what connections they could see to their professional life and any other thoughts they had about what they were hearing. (c) After the students finished watching the talk, they were asked to share their thoughts anonymously in writing and to submit their reflections. Here are two illustrative excerpts (translated from Hebrew). Note the words “stressful”, “frightening”, etc. that the students used. • It’s frightening not to know what direction the world is going to develop in. It’s common to say that the ones who own the data have the power, and we, as students at the Technion, who have the data, are the strong ones. But I don’t feel that I have more skills that are related to what to do with the data or that I know what to do with the data [than other people], but I do know that there are people who may know what to do with the data and this is stressful. • The lecture was very interesting! But it is frightening to think where the world is going. I must mention that this connection between the course and its influence on our lives and its influence on the future is very important in my opinion!

192

12 Social and Ethical Issues of Data Science

Exercise 12.17 Reflection as a habit of mind (a) Choose a topic related to the social aspect of data science. Write an essay on this topic, analyzing it from different points of view. During the writing process, keep reflecting on your thinking processes (that is, reflect inaction). (b) When you finish writing, reflect on the entire writing process (that is, reflect on-action): What was challenging? What knowledge gaps did you have to close? When was the writing process fluent? When was the writing process blocked? (c) Formulate guidelines for use when writing a story about the social aspect of data science.

Project-Based Learning Project-based learning (PBL) is a pedagogical tool through which students gain knowledge and skills by developing solutions to real-life problems (Hadim & Esche, 2002). Although it is a well-known methodology for applying active learning, in the context of data science, it might be challenging for educators to guide the development of projects on any topic that students choose as the project’s application domains, since the topics might be diverse and project development might require a certain level of knowledge in the application domain which the educators may not have. PBL has many pedagogical advantages. For example: • PBL has been recognized as an efficient method for promoting both interdisciplinary learning and the acquisition of twenty-first century skills, such as teamwork and interpersonal communication (Liu & Huang, 2005). To successfully accomplish interdisciplinary projects in a PBL setting, project teams must include specialists in several disciplines. • PBL promotes active learning (see Sect. “Active Learning”) and increases students’ interest in science and technology (Duarte et al., 2012). • PBL can promote the entrepreneurship skills of students (Dragoumanos et al., 2017). • Interdisciplinary projects can increase students’ motivation and engagement in real-life problems (Yogeshwaran et al., 2019) and promote cooperation between disciplines (Ramamurthy, 2016).

12.3 Methods of Teaching Social Aspects of Data Science

193

Exercise 12.18 PBL and data science Based on the description of PBL, discuss the characteristics it shares with data science. In the context discussed in this chapter, PBL is presented as a teaching method that can foster learners’ awareness to social issues related to data science. This assertion is derived from the fact that through PBL, learners can be guided to take into consideration different aspects of the project topic, both technical and social. For example, students can be asked to use data science methods to explore a variety of social issues, such as educational policies or the public opinion regarding a specific social issue, or to develop a technological tool that helps solve a social problem. In order to achieve interdisciplinary learning in PBL, students must have sufficient knowledge in each of the separate disciplines connected to the project topic. Using cross-disciplinary teams is one way to ensure that a project team includes specialists in all required disciplines (Othman et al., 2017). This, however, is not always possible. In many cases, there are significant knowledge gaps in the project team in one or more of the disciplines required for the project development. For example, Mike et al. (2020) describe a case study of fourth-year electrical engineering students who worked on biomedical signal processing projects, which are evidently interdisciplinary in nature. Nevertheless, the teams were homogeneous and the students lacked the essential medical expertise required to reach solutions that are applicable by physicians. Furthermore, students tended to acknowledge this gap only in the advanced phases of the project, and so critical phases, such as goal setting and planning, were performed without the required knowledge. Mike et al. (2020) proposed to close such gaps through an intervention program that exposed students both to the required application domain knowledge and to its importance for their work (see Chap. 6).

Exercise 12.19 Social issues of data science and interdisciplinarity For each kind of activity presented in Sect. 12.3.2, explain how it highlights the interdisciplinarity of data science.

194

12 Social and Ethical Issues of Data Science

12.4 Conclusion The interdisciplinarity of data science, in general, and the attention that should be given to the application domain knowledge in data science problem-solving processes, in particular, are especially important in the case of social issues of data science since, in many cases, they are connected to some aspect of our daily life. In conclusion, we describe how the interdisciplinarity of data science is highlighted in each section of this chapter: (a) In the Ethics of Data Science section (Sect. 12.2), the interdisciplinarity of data science is reflected in the need to take into consideration the variety of I stakeholders of the data science workflow when discussing ethical issues; II ethical codes of ethics driven from each of the data science components (e.g., computer science ethics and application domain knowledge). (b) In the teaching principles and in every kind of activities we suggest using when teaching the social aspect of data science (Sect. 12.3), the interdisciplinarity of data science is highlighted by the variety of topics the activities can address and the fact that students are asked to consider these variety of topics in the process of working on the different exercises.

References Aberšek, B., & Aberšek, M. K. (2010). Development of communication training paradigm for engineers. Journal of Baltic Science Education, 9(2), 99–108. Active Learning | Center for Educational Innovation. (n.d.). Retrieved June 10, 2022, from https:// cei.umn.edu/teaching-resources/active-learning Berman, F. (co-chair), Rutenbar, R. (co-chair), Christensen, H., Davidson, S., Estrin, D., Franklin, M., Hailpern, B., Martonosi, M., Raghavan, P., Stodden, V., and Szalay, A. (2016). Realizing the potential of data science: Final report from the national science foundation computer and information science and engineering advisory committee data science working group. National Science Foundation Computer and Information Science and Engineering Advisory Committee Report, December 2016; https://www.nsf.gov/cise/ac-data-science-report/CISEACDataScienceReport1. 19.17.pdf Case Study. (2022). Wikipedia. https://en.wikipedia.org/w/index.php?title=Case_study&oldid=109 0767454 Confrey, J. (1995). A theory of intellectual development. For the Learning of Mathematics, 15(1), 38–48. Danyluk, A., & Leidig, P. (2021). Computing competencies for undergraduate data science curricula. https://www.acm.org/binaries/content/assets/education/curricula-recommend ations/dstf_ccdsc2021.pdf Davis, R. B. (1990). Constructivist views on the teaching and learning of mathematics. Journal for Research in Mathematics Education, 4. ERIC. Dragoumanos, S., Kakarountas, A., & Fourou, T. (2017). Young technology entrepreneurship enhancement based on an alternative approach of project-based learning. In 2017 IEEE global engineering education conference (EDUCON), pp. 351–358.

References

195

Duarte, C., Oliveira, H. P., Magalhães, F., Tavares, V. G., Campilho, A. C., & de Oliveira, P. G. (2012). Proactive engineering. In Proceedings of the 2012 IEEE global engineering education conference (EDUCON), pp. 1–5. Embedded EthiCS. (n.d.). Retrieved June 10, 2022, from https://embeddedethics.seas.harvard.edu/ Freeman, J. (2019, August 7). Is it time for a data scientist code of ethics? Medium. https://toward sdatascience.com/is-it-time-for-a-data-scientist-code-of-ethics-210b4f987a8 Hadim, H. A., & Esche, S. K. (2002). Enhancing the engineering curriculum through project-based learning. In 32nd Annual frontiers in education (vol. 2, pp. F3F-F3F). Kilpatrick, J. (1987). What constructivism might be in mathematics education. In Proceedings of the eleventh international conference on the psychology of mathematics education, vol. 1, pp. 3–27. Liu, J.-S., & Huang, T.-K. (2005). A project mediation approach to interdisciplinary learning. In Fifth IEEE international conference on advanced learning technologies (ICALT’05), pp. 54–58. Mike, K., Nemirovsky-Rotman, S., & Hazzan, O. (2020). Interdisciplinary education—The case of biomedical signal processing. In 2020 IEEE global engineering education conference (EDUCON), pp. 339–343. https://doi.org/10.1109/EDUCON45650.2020.9125200 National Academies of Sciences, Engineering, and Medicine. (2018). Data science for undergraduates: Opportunities and options. The National Academies Press. https://doi.org/10.17226/ 25104 Newman, I., Daniels, M., & Faulkner, X. (2003). Open ended group projects a tool for more effective teaching. In Proceedings of the fifth Australasian conference on computing education, vol. 20, pp. 95–103. Othman, A., Hussin, H., Mustapha, M., & Parman, S. (2017). Cross-disciplinary team learning in engineering project-based: Challenges in collaborative learning. In 2017 7th World engineering education forum (WEEF), pp. 866–871. Ramamurthy, B. (2016). A practical and sustainable model for learning and teaching data science. In Proceedings of the 47th ACM technical symposium on computing science education, pp. 169–174. Schön, D. A. (1983). The reflective practitioner. Basic Books. Schön, D. A. (1987). Educating the reflective practitioner: Toward a new design for teaching and learning in the professions. Jossey-Bass. Silberman, M. (1996). Active learning: 101 strategies to teach any subject. ERIC. Software Engineering Code—ACM Ethics. (2016). https://ethics.acm.org/code-of-ethics/softwareengineering-code/ Yogeshwaran, S., Kaur, M. J., & Maheshwari, P. (2019). Project based learning: Predicting bitcoin prices using deep learning. In 2019 IEEE global engineering education conference (EDUCON), pp. 1449–1454.

Part IV

Machine Learning Education

This part of the guide is dedicated to the modeling phase of the data science workflow, whose focus is the generation and testing of the machine learning (ML) model designed to solve the problem under examination. Machine learning education warrants special attention for two main reasons. First, it is one of the main steps of the data science workflow and an important and central emerging approach for data modeling. Second, ML is based heavily on mathematics and computer science content and, therefore, poses unique pedagogical challenges that we address in this part of the guide. Specifically, we address the teaching of ML core concepts and algorithms that are commonly taught in introductory data science courses and present specific teaching methods suitable for teaching ML. This part comprises the following chapters: Chapter 13: The Pedagogical Challenge of Machine Learning Education Chapter 14: Core Concepts of Machine Learning Chapter 15: Machine Learning Algorithms Chapter 16: Teaching Methods for Machine Learning

Chapter 13

The Pedagogical Challenge of Machine Learning Education

Abstract Machine learning (ML) is the essence of the modeling phase of the data science workflow. In this chapter, we focus on the pedagogical challenges of teaching ML to various populations. We first describe the terms white box and black box in the context of ML education (Sect. 13.2). Next, we describe the pedagogical challenge with respect to different learner populations including data science major students as well as non-major students (Sect. 13.3). Then, we present three framework remarks for teaching ML (regarding statistical thinking, interdisciplinary projects, and the application domain knowledge), which, despite not being mentioned frequently in this part of the book, are important to be kept in mind in ML teaching processes (Sect. 13.4). We conclude this chapter by highlighting the importance of ML education in the context of the application domain (Sect. 13.5).

13.1 Introduction Machine learning (ML) has developed in the last 50 years as part of the growing discipline of artificial intelligence and intelligent systems. As ML is one of the most effective methods for modeling huge and complex data, it has, in the last decade, become a significant phase of the data science workflow. Unlike traditional parametric modeling, which presumes an underlying statistical behavior of the researched phenomena and aims to fit the best parameters of this model, ML algorithms learn from experience (Goodfellow et al., 2016) and can learn complex data patterns directly from raw data. Machine learning is an integral part of current data science curricula (Danyluk & Leidig, 2021; De Veaux et al., 2017; Demchenko et al., 2016) and constitutes a component in the curriculum of many undergraduate and graduate study programs, ranging from programs for data science major students, through undergraduate computer science and engineering programs, to humanities and social science study programs. Furthermore, ML is an important topic not only for students, but also for ML users. As we shall see, the fact that data science is taught today to such a wide variety of learners raises pedagogical challenges. Nevertheless, pedagogical

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_13

199

200

13 The Pedagogical Challenge of Machine Learning Education

research, namely, how to teach ML, often concentrates on the non-major population (see for example, Sulmont et al., 2019a, 2019b). Nowadays, ready-to-use labeled datasets are widely available, and so it is easy to skip steps of the data science workflow (e.g., data gathering and cleaning) and to jump directly to the model generation step. While this phase may be sufficient in order to learn ML algorithms, it is important to teach how to implement ML in the context of the entire data science workflow. In this chapter, we first describe the terms white box and black box in the context of ML education (Sect. 13.2) and present the challenge of teaching ML to a variety of populations (Sect. 13.3). Then, we offer several framework remarks that should be considered when designing ML teaching processes (Sect. 13.4). We conclude with re-highlighting the importance of the application domain in the context of ML as well, although ML can also be taught independently of the application domain.

13.2 Black Box and White Box Understandings1 The term white box understanding refers to understanding the details of the algorithm; that is, how it works. Using this terminology, we say that an individual must have a white box understanding both of the algorithm, in order to understand its parameters, and of hyperparameter tuning, to be able to improve the algorithm’s performance. The term black box understanding, on the other hand, refers to an understanding of the relations between the input and the output of an algorithm without understanding how the algorithm itself works. In other words, understanding an ML algorithm as a black box refers to the ability to call a library procedure that executes the algorithm, without understanding the internal process of how the algorithm’s output is generated. Taking a black box approach, it is possible to learn the principles of ML even without sufficient mathematical and computational knowledge. For example, one can understand how to use a logistic regression as a classifier without understanding how the algorithm works, i.e., without understanding the process required to find the model parameters. Nevertheless, Biehler and Schulte (2018) ask “what if machine learning is used in a data science course: Would it be appropriate to treat it as a black box…? Probably not” (p. 9). One reason for this assertion is that to achieve high performance, ML algorithms require many human decisions for which a white box understanding is needed. For example, one challenging task when designing an ML algorithm is hyperparameter tuning (Sulmont et al., 2019b). Hyperparameter tuning is essential in order to optimize the performance of the learning algorithm and, at the same time,

1

This section is based on Mike and Hazzan (2022). Machine learning for non-major data science students: A white box approach, special issue on Research on Data Science Education, The Statistics Education Research Journal (SERJ) 21(2), Article 10. Reprint is allowed by SERJ journal’s copyright policy.

13.3 Teaching ML to a Variety of Populations

201

it requires an understanding of the mathematical details of the ML algorithm (see Sect. 14.3). We note that the development of ML for data applications has two distinct phases that can be understood as either a black box or a white box: model development (see Sect. 14.4) and model usage. Since each of these two phases may require different mathematical knowledge, each of them can be understood independently either as a black box or as a white box. In the following chapters, we present pedagogical principles and methods that can support a white box understanding of ML algorithms by non-major students.

13.3 Teaching ML to a Variety of Populations Data science educators commonly identify three types of data science learners: majors, non-majors, and users (Bhatt et al., 2020; Sulmont et al., 2019b). This section addresses the teaching of ML to each of these groups. • Majors are students who are majoring in data science or disciplines such as statistics, electrical engineering, and computer science, and who have extensive mathematical backgrounds that enable them to learn ML algorithms including all of their mathematical details. This kind of learning is commonly referred to as learning ML as a white box. • Non-majors are students from other disciplines, such as the social sciences or life sciences, who usually lack the mathematical background required in order to learn ML algorithms with all of their mathematical details, and so usually learn ML algorithms as a black box. • Users of ML algorithms need to understand ML as part of their professional or daily life, for example, physicians who use ML algorithms as diagnostic tools or business managers who use ML algorithms to make financial decisions. While the common goal of ML curricula for majors and non-majors is to teach the learners how the ML algorithms work and how to generate new ML applications, the goal of ML curricula for users is to teach the functionality and limitations of the ML algorithms, including the interpretation of their outputs and predictions in the context of the relevant application domain. For more on this approach, see Hazzan & Mike’s (2022) blog on Machine Learning: Out! Data Science: In!

202

13 The Pedagogical Challenge of Machine Learning Education

13.3.1 Machine Learning for Data Science Majors and Allied Majors Data science programs require extensive knowledge and skills in mathematics, statistics, and computer science (Anderson et al., 2014; Danyluk & Leidig, 2021; De Veaux et al., 2017; Demchenko et al., 2016). It is obvious then that data science majors and allied majors such as majors in statistics, computer science, and electrical engineering should have sufficient mathematical background to understand the mathematics of ML algorithms. See, e.g., Georgiopoulos et al. (2009). Machine learning can be taught with or without the context of an application domain. In other words, ML can be taught as a collection of statistical and computational algorithms, without considering the application domain. Although ML lies in the intersection of mathematics and statistics with computer science, in many cases learning ML requires not only an understanding of the algorithms itself, but also a broader view of the algorithms in the context of the application domain. Such an understanding requires knowledge about the role of ML as a component of the data science workflow, handling of biases in the data, the role of training and test data, the evaluation of ML methods and models in the context of the application domain, as well as ethics and social responsibility. These topics are not easy to understand and are not well integrated into the curricula for major students (Wing, 2020). As a result, even though mathematics is not much of a challenge for data science majors and allied majors, teaching ML in the context of the application domain might be.

13.3.2 Machine Learning for Non-major Students2 Machine learning is an important tool that is nowadays integrated into many occupations in various ways, such as the prediction of economic states or the detection of diseases using medical diagnostics tools. It is also a powerful research tool that can support the analysis of complex data such as text, images, speech, and video. As a result, ML is now being taught to a variety of learners (see Chap. 7), many of whom lack the required mathematical background. Machine learning algorithms may be a complex topic to learn for those not majoring in statistics, computer science, or data science. Sulmont et al. (2019a, 2019b) describe non-major learners of ML as learners who lack sufficient knowledge in mathematics, statistics, and programming. For example, based on results of their interviews with instructors of ML courses offered to such learners, Sulmont and her colleagues (2019a) use the Structure of the Observed Learning Outcome (SOLO) taxonomy to map the difficulties such learners encounter when learning ML. The 2

This section is based on Mike and Hazzan (2022). Machine learning for non-major data science students: A white box approach, special issue on Research on Data Science Education, The Statistics Education Research Journal (SERJ) 21(2), Article 10. Reprint is allowed by SERJ journal’s copyright policy.

13.3 Teaching ML to a Variety of Populations

203

SOLO taxonomy, which classifies learning outcomes in terms of their complexity, consists of five levels: prestructural, unistructural, multistructural, relational, and extended abstract (Biggs & Collis, 2014). Sulmont et al. (2019a) found that (a) at the unistructural stage of the SOLO taxonomy, the students’ preconception of human thinking versus computer processing was a barrier; (b) at the relational stage, understanding decision making in ML was a barrier; and (c) in the extended abstract stage, difficulties in perceiving the limits of ML application correctly was a barrier as well. In addition, both mathematics and programming were found to be barriers, and so mathematics was omitted, in general, from ML courses, and programming was not included in all cases. According to Sulmont and her colleagues, “Realizing that higher SOLO learning goals are more difficult to teach is useful for informing course design, public outreach, and the design of educational tools for teaching ML” (Sulmont et al., 2019a, p. 1). Accordingly, in Mike and Hazzan (2022), we present a tool that can support non-major data science students in learning ML by mitigating the mathematical barrier.

13.3.3 Machine Learning for ML Users One of the main challenges of using ML algorithms is ML explainability and interpretability. Although several definitions have been proposed for these terms, no consensus has yet been arrived at (Marcinkeviˇcs & Vogt, 2020). Both terms are concerned with humans’ ability to understand the predictions of ML algorithms and the reasons for these predictions (Doshi-Velez & Kim, 2017). Since interpretability methods are required to help users build trust in ML models and understand their capabilities, developing both interpretable models and explanation methods is currently one of the main efforts the ML research community is engaged in (Carvalho et al., 2019; Suresh et al., 2020). Rudin (2019) proposes that new ML techniques be developed that are inherently interpretable. Thus, the explainability and interpretability problem has two sides: humans and machines (Hilgard et al., 2021). While an ongoing effort is being made to improve the machines, it is the role of educators to improve human understanding of ML algorithms. In the context of education, interpretability is discussed mainly in relation to users rather than students. Long and Magerko (2020) define 17 competencies of AI literacy, several of which are connected to understanding the role of humans in developing the model, choosing data to be used for training, and examining the algorithms and their biases. Suresh et al. (2020) found that people trust incorrect ML recommendations for tasks that they perform correctly most of the time, even when they have prior knowledge about ML or are given information indicating that the system is not confident in its prediction. Several methods have been proposed to help users interpret the outcomes of ML. For example, Suresh and her colleagues (2021) suggest that to help users interpret an algorithm’s output, the model’s prediction output should be accompanied by additional examples, which the users are familiar with and which are not necessarily

204

13 The Pedagogical Challenge of Machine Learning Education

taken from the given data, that generate the same prediction. Bhatt and his colleagues (2020) investigated whether ML developers consider the users when they examine their model’s explainability. They found that the majority of ML developers tend to consider explainability with respect to their own ability to debug the model rather than the end users’ ability to interpret it.

Exercise 13.1 The concepts of explainability and interpretability Search the web and find 3–5 stories that exemplify the concepts of explainability and interpretability. (a) For each story, identify its main actors, the ML algorithm it refers to, the context in which these concepts are discussed, the end of the story, and what conclusions are drawn (if at all). (b) Add your conclusions from the examination of each story. (c) Formulate three guidelines for users of ML methods. (d) Reflect on what you have learned while working on this exercise. What would you do differently if you were asked to repeat it? (e) What conclusions can you draw for your own usage of ML results?

13.4 Framework Remarks for ML Education In this section, we present several framework remarks for teaching ML, which despite not being mentioned frequently in this part of the book, are important to be kept in mind while teaching ML in any framework.

13.4.1 Statistical Thinking Statistical thinking is discussed in Chap. 3 in which the topic is addressed from the cognitive perspective of data science. Here, on the other hand, we discuss statistical thinking in the context of its relevance for learning ML algorithms, specifically, highlighting the attention that the data in the ML modeling phase warrants, which places heavy emphasis on the algorithmic facet of ML modeling. Statistical thinking is associated with understanding the essence, characteristics, and variability of real-life data with emphasis on its importance for solving real-life problems (Cobb & Moore, 1997). According to Ben-Zvi and Garfield (2004), statistical thinking “involves an understanding of why and how statistical investigations are conducted and the ‘big ideas’ that underlie statistical investigations” (p. 8). As specified in Chap. 3, statistical thinking includes (a) the understanding that variation exists in any data source and that real-life data contain outliers, errors, biases,

13.4 Framework Remarks for ML Education

205

and variance; (b) when and how to use specific statistical data analysis methods; (c) the nature of sampling and how to infer from samples to populations; (d) statistical models and their usage; (e) the context of a given problem when performing investigations and drawing conclusions; (f) the entire process of statistical inquiry; and (g) the relevance of critique and the evaluation of inquiry results (Ben-Zvi & Garfield, 2004). While the focus of ML education might be the ML algorithms themselves, students must understand that the data are no less important than the algorithms, and maybe even more so. Since this is of such great importance, it is essential that the students understand the fundamental nature of the data, which is the core of statistical thinking. It should be emphasized that statistical understanding helps make good decisions especially, but not only, in the development process of the ML algorithm. In Chap. 14, several core concepts of ML, such as ML algorithm indicators and bias and variance, illustrate the importance of this mode of thinking and understanding.

Exercise 13.2 Machine learning and statistical thinking Choose a problem whose solution requires the design of an ML algorithm. Describe the relevance of each component of statistical thinking mentioned above for the design process of that ML algorithm.

13.4.2 Interdisciplinary Projects ML applications often require an interdisciplinary point of view; In reality, unfortunately, ML teams in the academia and industry are not always interdisciplinary. Furthermore, ML courses that are offered by a certain faculty are usually taught by a team from that faculty, and so only students from that one faculty attend them and are exposed to the perspective on data science issues pertaining to that faculty only. In such cases, teams of learners working on their final project may lack the required interdisciplinary perspective required for the project development. A similar situation may occur in the industry, as ML teams are mostly made up of data science, computer science, and electrical engineering graduates. Thus, students must be exposed to the importance of interdisciplinarity for the success of ML projects by practicing the full process of data science project development.

206

13 The Pedagogical Challenge of Machine Learning Education

Exercise 13.3 Machine learning and interdisciplinary projects (a) Describe an ML project intended to detect whether or not a person has a specific disease. Describe a scenario in which the lack of medical knowledge among the project developers leads to situations in which the disease is not detected. (b) Choose two additional application domains and repeat step (a) with respect to each application domain. (c) For each of the three projects you worked on in this exercise, characterize the expert whom the team should hire to avoid the situations you described.

13.4.3 The Application Domain Knowledge Data science learners tend to neglect the importance of the application domain knowledge for their ML project (Mike et al., 2020). It is, however, important to consider the application domain knowledge while performing several phases of the ML development process including defining the suitable type of ML algorithm, defining the suitable performance indicator, cleaning the data, and deploying the model in the application field. For example, when choosing a performance indicator for an ML algorithm developed to detect pedestrians in an image, it is important to understand the real-world situations in which this algorithm will be deployed. In the same spirit, different performance indicators and accuracy scores are considered suitable for an algorithm designed to drive an autonomous car versus an algorithm designed to count the number of pedestrians walking into a shopping mall (see also the discussion about the domain neglect bias in Chap. 3).

Exercise 13.4 Machine learning performance indicators (a) Choose an ML algorithm and a possible indicator for its performance. (b) Suggest two problems that the specific algorithm can solve with the condition that while the chosen indicator is reasonable for one of the problems, it may lead to disaster if used in the second problem as an indicator for the performance of the ML algorithm. (c) What are your conclusions? (d) Repeat steps (a)–(c) for two additional ML algorithms.

References

207

13.5 Conclusion In this chapter we saw that ML algorithms can be taught as a general concept and not necessarily in the context of a specific application domain. It is, however, always recommended to associate each ML algorithm with a specific application domain in which it is relevant, otherwise, the importance of ML in the context of data science is not delivered and the interdisciplinarity of data science is not imparted to the students. Indeed, although ML is an interesting topic in itself, it is doubly meaningful and important in the context of data science.

References Anderson, P., Bowring, J., McCauley, R., Pothering, G., & Starr, C. (2014). An undergraduate degree in data science: Curriculum and a decade of implementation experience. In Proceedings of the 45th ACM technical symposium on computer science education—SIGCSE’14, pp. 145–150. https://doi.org/10.1145/2538862.2538936 Ben-Zvi, D., & Garfield, J. B. (2004). The challenge of developing statistical literacy, reasoning and thinking. Springer. Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R., Moura, J. M., & Eckersley, P. (2020). Explainable machine learning in deployment. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 648–657. Biehler, R., & Schulte, C. (2018). Paderborn symposium on data science education at school level 2017: The collected extended abstracts. Universitätsbibliothek. Biggs, J. B., & Collis, K. F. (2014). Evaluating the quality of learning: The SOLO taxonomy (Structure of the observed learning outcome). Academic Press. Carvalho, D. V., Pereira, E. M., & Cardoso, J. S. (2019). Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8), 832. Cobb, G. W., & Moore, D. S. (1997). Mathematics, statistics, and teaching. The American Mathematical Monthly, 104(9), 801–823. https://doi.org/10.1080/00029890.1997.11990723 Danyluk, A., & Leidig, P. (2021). Computing competencies for undergraduate data science curricula. https://www.acm.org/binaries/content/assets/education/curricula-recommend ations/dstf_ccdsc2021.pdf De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., Bryant, L., Cheng, L. Z., Francis, A., Gould, R., Kim, A. Y., Kretchmar, M., Lu, Q., Moskol, A., Nolan, D., Pelayo, R., Raleigh, S., Sethi, R. J., Sondjaja, M., & Tiruviluamala, N., et al. (2017). Curriculum guidelines for undergraduate programs in data science. Annual Review of Statistics and Its Application, 4(1), 15–30. https://doi.org/10.1146/annurev-statistics-060116-053930 Demchenko, Y., Belloum, A., Los, W., Wiktorski, T., Manieri, A., Brocks, H., Becker, J., Heutelbeck, D., Hemmje, M., & Brewer, S. (2016). EDISON data science framework: A foundation for building data science profession for research and industry. In 2016 IEEE international conference on cloud computing technology and science (CloudCom), pp. 620–626. https://doi.org/10.1109/ CloudCom.2016.0107 Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. ArXiv Preprint ArXiv:1702.08608. Georgiopoulos, M., DeMara, R. F., Gonzalez, A. J., Wu, A. S., Mollaghasemi, M., Gelenbe, E., Kysilka, M., Secretan, J., Sharma, C. A., & Alnsour, A. J. (2009). A sustainable model for integrating current topics in machine learning research into the undergraduate curriculum. IEEE Transactions on Education, 52(4), 503–512.

208

13 The Pedagogical Challenge of Machine Learning Education

Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (vol. 1). MIT Press. Hazzan, O., & Mike, K. (2022). Machine learning: Out! Data science: In! https://cacm.acm.org/ blogs/blog-cacm/261730-machine-learning-out-data-science-in/fulltext Hilgard, S., Rosenfeld, N., Banaji, M. R., Cao, J., & Parkes, D. (2021). Learning representations by humans, for humans. In International conference on machine learning, pp. 4227–4238. Long, D., & Magerko, B. (2020). What is AI literacy? Competencies and design considerations. In Proceedings of the 2020 CHI conference on human factors in computing systems, pp. 1–16. Marcinkeviˇcs, R., & Vogt, J. E. (2020). Interpretability and explainability: A machine learning zoo mini-tour. ArXiv Preprint ArXiv:2012.01805. Mike, K., & Hazzan, O. (2022). Machine learning for non-major data science students: A white box approach. Statistics Education Research Journal, 21(2), Article 10. Mike, K., Nemirovsky-Rotman, S., & Hazzan, O. (2020). Interdisciplinary education—The case of biomedical signal processing. In 2020 IEEE global engineering education conference (EDUCON), pp. 339–343. https://doi.org/10.1109/EDUCON45650.2020.9125200 Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215. Sulmont, E., Patitsas, E., & Cooperstock, J. R. (2019a). Can you teach me to machine learn? In Proceedings of the 50th ACM technical symposium on computer science education, pp. 948–954. https://doi.org/10.1145/3287324.3287392 Sulmont, E., Patitsas, E., & Cooperstock, J. R. (2019b). What is hard about teaching machine learning to non-majors? Insights from classifying instructors’ learning goals. ACM Transactions on Computing Education, 19(4), 1–16. https://doi.org/10.1145/3336124 Suresh, H., Lao, N., & Liccardi, I. (2020). Misplaced trust: Measuring the interference of machine learning in human decision-making. In 12th ACM conference on web science, pp. 315–324 Suresh, H., Lewis, K. M., Guttag, J. V., & Satyanarayan, A. (2021). Intuitively assessing ML model reliability through example-based explanations and editing model inputs. ArXiv Preprint ArXiv:2102.08540. Wing, J. M. (2020). Ten research challenge areas in data science. Harvard Data Science Review. https://doi.org/10.1162/99608f92.c6577b1f

Chapter 14

Core Concepts of Machine Learning

Abstract In this chapter, we focus on the teaching of several core concepts that are common to many machine learning (ML) algorithms (such as hyper-parameter tuning) and, as such, are essential learning goals in themselves, regardless of the ML algorithms. Specifically, we discuss types of ML (Sect. 14.2), ML parameters and hyperparameters (Sect. 14.3), model training, validation, and testing (Sect. 14.4), ML performance indicators (Sect. 14.5), bias and variance (Sect. 14.6), model complexity (Sect. 14.7), overfitting and underfitting (Sect. 14.8), loss function optimization and the gradient descent algorithm (Sect. 14.9), and regularization (Sect. 14.10). We conclude this chapter by emphasizing what ML core concepts should be discussed in the context of the application domain (Sect. 14.11).

14.1 Introduction Understanding machine learning (ML) algorithms requires not only an understanding of the mathematics behind the algorithms and, in many cases, an understanding of the social environments in which they are applied, but also an understanding of some core principles of ML, such as the different types of ML and the development process of ML applications. In this chapter, we present core principles of ML algorithms along with challenges that arise when teaching these concepts to different audiences and teaching guidelines that may help overcome those challenges. The concepts we present are: types of ML algorithms (Sect. 14.2), parameters and hyperparameters of ML algorithms (Sect. 14.3), model training, validation, and testing (Sect. 14.4), performance indicators of ML algorithms (Sect. 14.5), bias and variance (Sect. 14.6), model complexity (Sect. 14.7), overfitting and underfitting (Sect. 14.8), loss function optimization and the gradient descent algorithm (Sect. 14.9), and regularization (Sect. 14.10). Some notes before embarking on this chapter: • As mentioned in the introduction to this guide (Chap. 1), this guide does not aim to teach data science; rather, it addresses data science from a pedagogical perspective. Nevertheless, we note that understanding ML algorithms and core concepts is a prerequisite for understanding this chapter. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_14

209

210

14 Core Concepts of Machine Learning

• As presented in Chap. 2, data science is an interdisciplinary field that inherits knowledge and skills from mathematics, statistics, and computer science, as well as from the application domain. By the term ML, we refer to the intersection of the mathematics and statistics component of data science with its computer science component. As we shall see in this chapter, ML includes concepts whose understanding is important regardless of the domain in which the ML algorithm is applied. Since ML refers to the intersection of the mathematics and statistics component of data science with its computer science component, one of the main concepts of ML is the algorithm. A significant portion of this chapter is, therefore, dedicated to the teaching of algorithms, and so, it can also be used by computer science educators.

14.2 Types of Machine Learning Machine learning algorithms can be categorized in several ways. One categorization is by the type of the output the ML algorithm returns: categorial or continuous. Categorial-output ML algorithms may classify an animal as cat or dog or determine whether or not a pedestrian is seen in a given image. Continuous-output ML algorithms may predict the price of a house given its location and additional real-estate features or determine where the pedestrian is located within an image.

Exercise 14.1 Types of machine learning Add three examples of each of the two types of classifications: categorial or continuous. Another way to categorize ML algorithms is by their training method. Supervised learning refers to ML algorithms that learn from labeled samples, that is samples with a known desired output. An example of such a sample is a set of images of cats and dogs, each with a label denoting the type of animal seen in the image. Unsupervised learning refers to ML algorithms that are designed to find patterns in unlabeled data. For example, an algorithm that divides a collection of images into groups, each containing images of a different type of animal, without attaching a known meaningful label to each group. Reinforcement learning refers to algorithms that make decisions and learn from the response to that decision that it receives from the external environment. For example, an algorithm designed to classify the type of animal that appears in an image, classifies an image as a cat or a dog, and then improves its decision rules based on feedback it receives from an external entity on whether or not the classification was correct. Many curriculums that have been proposed focus on supervised learning of categorical labeled data, that is, classification algorithms such as KNN, decision trees, and

14.4 Model Training, Testing, and Validation

211

neural networks. Indeed, in many cases classifiers are ML algorithms that are simpler to understand than other types of ML algorithms; however, many real-world applications require other types of ML methods, and therefore should not be neglected. For example, in many applications in economics, algorithms are required to predict continuous variables, such as income, growth and so on. It is therefore recommended to adjust the mix of the different types of algorithms learned in a specific course according to the specific learners of the course and the types of applications relevant for them.

14.3 Machine Learning Parameters and Hyperparameters Machine learning algorithms have both parameters and hyperparameters. The values of the ML parameters are learned based on the data, e.g., the weights of the different features of a linear regression algorithm. The values of the ML hyperparameters, on the other hand, are set by the human developer of the application, e.g., the number of layers in a neural network or the number of learning cycles. The purpose of hyperparameters is to control the learning process of the algorithm. Hyperparameter tuning is essential for optimizing the performance of the learning algorithm, and at the same time, it requires an understanding of the mathematical details of the ML algorithm. Thus, while it is possible to teach the principles of ML without teaching the mathematical and computational knowledge required to fully understand them, individuals lacking this knowledge may find it difficult to optimize the performance of ML algorithms. For example, one of the subjects interviewed by Sulmont and her colleagues (2019a, 2019b) suggested that “their students cannot understand how tuning works, because they lack the mathematical prerequisite to understand parameters. Therefore, they claim, they think [ML] is magic when you tune parameters and get different results” (p. 11). Teaching ML algorithms as a black box is, therefore, sufficient in order to impart to students the ability to train ML algorithms, but it may be insufficient when it comes to their ability to optimize these algorithms. Accordingly, in Chap. 16 we propose pedagogical methods that support a white-box understanding of ML by non-majors.

14.4 Model Training, Testing, and Validation In this section, we focus on the steps of model training, validation, and testing that are part of the development process of the ML algorithm. Figure 14.1 presents a general scheme of the model generation, testing, and prediction for the case of classification. This process is the same for any supervised learning including regression. We will now elaborate on each step of this process. The first step of the process is the training, that is, finding a set of algorithm parameters that minimize the difference between the classification of the training

Fig. 14.1 A general scheme of the ML model generation, testing, and prediction phases

212 14 Core Concepts of Machine Learning

14.4 Model Training, Testing, and Validation

213

set samples (X train) and the given labels for these samples (the desired Y). For many algorithms, such as neural network, the algorithm parameters are initiated in the learning phase with random values which are then refined gradually using an optimization algorithm. Other algorithms, such as the KNN, have no parameters and the algorithm learning phase consists of simply assigning the training set to the algorithm’s memory for future predictions. The second step of the process is validation, in which the model is validated against samples yet unseen by the algorithm (X validation), for which the labels are also known. The first and second steps are repeated several times, using a different set of hyperparameters each time. This repetition is called hyperparameter tuning, in which we seek for the set of hyperparameters that yields the best performance. In the third step, testing, we again test the model against another unseen set of labeled data (X test). This step is required to verify that the model parameters and hyperparameters do not overfit the training and validation data (see Sect. 14.8). In the last step, prediction, the algorithm is deployed and run against real data. This data is obviously not labeled and, therefore, we cannot calculate the performance of the algorithm with respect to these data. Students might find this scheme somewhat confusing for several reasons: • They may find it difficult to understand why we leave some of the data for testing rather than train the algorithms using all of the available data. They may think that leaving out data is a waste of data and that the learning phase will result in a better model if it is trained with as much data as possible (Sulmont et al., 2019a, 2019b). • Due to their lack of understanding of the hyperparameter tuning step, learners may wonder why the algorithm validation and testing are different steps. • The sizes of the validation and test sets depend both on the overall amount of data available for model development as well on the data distribution. Students with insufficient statistical knowledge may find it difficult to construct the validation and test sets in a way that represents the data properly. • When not enough data is available for training, validation, and testing, a more complex scheme, such as cross validation or data augmentation, may be required. • The scheme does not exactly fit the development of all algorithms. For example, the KNN algorithm has no parameters, and so its development does not have a parameter tuning phase, whereas the decision tree algorithm is tuned based on heuristic rules for split generation, and so its training is not based on reducing the classification error of the algorithm developed based on the training set. Educators should be aware of these difficulties and choose the taught ML algorithms according the computational and statistical knowledge (See Chap. 3) of their class as well as learners’ ability to think abstractly and to move between different levels of abstraction (See Chap. 11).

214

14 Core Concepts of Machine Learning

14.5 Machine Learning Performance Indicators The performance of ML algorithms can be measured by a variety of indicators, such as accuracy, precision, recall, F1, receiver operation curve (ROC), area under the curve (AUC), R2 , sum of square errors (SSE), to name a few. In addition to familiarity with these indicators, it is also important to understand their meaning in the real world and to realize that the selection of different performance indicators, as well as their values, may result in different ML models. Furthermore, misunderstanding the meaning of performance indicators may lead to incorrect conclusions regarding the algorithm’s impact in the real world (Shalala et al., 2021). Students with insufficient knowledge in statistics may find it difficult to understand that the value of a selected performance indicator is influenced by a selection of the validation and test sets, which are themselves a sample of the whole dataset, which is itself only a sample of the entire real-world population.

Exercise 14.2 Machine learning performance indicators Choose a problem, for the solution of which an ML algorithm is sought. Investigate the interpretation of three performance indicators in the context of the problem you chose. What are the meaning and implications of non-informative indicators for the problem? Students with insufficient knowledge in statistics might also misinterpret the performance indicators due to cognitive biases. Cognitive biases are phenomena of the human brain that may cause erroneous perceptions and irrational decisionmaking processes (Kahneman & Tversky, 1973). For example, consider the following question: A machine learning algorithm was trained to detect photos of lions. The algorithm does not err when detecting photos of lions, but 5% of photos of other animals (in which a lion does not appear) are detected as a photo of a lion. The algorithm was executed on a dataset with a lion-photo rate of 1:1000. If a photo was detected as a lion, what is the probability that it is indeed a photo of a lion? (Hazzan & Mike, 2022, Para. 1).

The correct answer to this question is about 2%, based on Bayes theorem (see Eq. 14.1). It turns out, however, that about one third of undergraduate computer science, electrical engineering, and data science students fail to answer this question correctly (Hazzan & Mike, 2022).

14.5 Machine Learning Performance Indicators

215

p(Lion detected|lion) · p(Lion) p(Lion detected) p(Lion detected|lion) · p(Lion) = p(Lion detected|Lion) · p(Lion) + p(Lion detected|not Lion) · p(not Lion)

p(Lion|Lion detected) =

=

1 · 0.001 = 0.0196 ≈ 2% 1 · 0.001 + 0.05 · 0.999

(14.1)

The lion detection question is a ML paraphrase of the medical diagnosis question whose understanding was investigated by Ward Casscells and his colleague (1978): If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the person’s symptoms or signs? (p. 999)

Casscells and his colleagues posed the medical diagnosis question to 20 interns, 20 fourth-year medical students, and 20 attending physicians at four Harvard Medical School teaching hospitals. Only 18% of the participants in their experiment answered the question correctly. Several explanations were suggested for this phenomenon (Koehler, 1996), one of which is the base-rate neglect fallacy, identified by Kahneman and Tversky (1973). According to this explanation, the mistake in answering the medical diagnosis question results from ignoring the base rate of the disease in the population. Another explanation was suggested by Leron and Hazzan (2009), who used the dual-process theory (Kahneman, 2002) to explain the error. We now present two methods for the mitigation of the base-rate neglect cognitive bias to support learners’ correct interpretation of indicators of ML algorithms. The first method for mitigating the base-rate neglect cognitive bias is to work with frequencies (natural numbers) instead of probabilities (percentages). For example, Cosmides and Tooby (1996) rephrased the medical diagnosis problem as follows: One out of every 1,000 Americans has disease X. A test has been developed to detect when a person has disease X. Every time the test is given to a person who has the disease, the test comes out positive (i.e., the “true positive” rate is 100%). But sometimes the test also comes out positive when it is given to a person who is completely healthy. Specifically, out of every 1,000 people who are perfectly healthy, 50 of them test positive for the disease (i.e., the “false positive” rate is 5%). Imagine that we have assembled a random sample of 1,000 Americans. They were selected by a lottery. Those who conducted the lottery had no information about the health status of any of these people. Given the information above, on average, how many people who test positive for the disease will actually have the disease? ________out of________. (p. 24)

Cosmides and Tooby found that 56% of the participants in their research answered the medical diagnosis problem correctly when it was formulated using frequencies.

216

14 Core Concepts of Machine Learning

Exercise 14.3 True or false (a) Define the concepts true-positive, true-negative, false-positive, and falsenegative. Which of these concepts does the medical diagnosis problem use? (b) Select a problem from any domain of life, whose formulation includes these concepts. Formulate it in two ways: using frequencies and using probabilities (percentages). (c) If you are working in a team, the team can discuss the different problems, addressing questions such as: In what context is each formulation clearer? Was it easy to transition between the two formulations? Why? The second tool we present that can mitigate learners’ misinterpretation of ML performance indicators is a confusion matrix that represents the performance of a classifier for the categories mentioned in Exercise 14.3: true-positive, true-negative, false-positive, and false-negative. In general, a confusion matrix is an n × n matrix (according to the number of groups into which the classifier classifies the data); in the current discussion, however, we use a 2 × 2 confusion matrix. In a confusion matrix, each row represents the distribution of classifications for each class in the dataset (sometimes a transposed confusion matrix is presented, with real labels on the columns). For example, consider the confusion matrix of the lion detection problem presented above with a database of (for the sake of simplicity) 1001 photos of animals (Fig. 14.2). Not only does this representation help visualize the two different categories of correct and erroneous classifications, it also presents them with natural numbers (frequencies) rather than with percentages (probabilities), hence supporting the students’ comprehension of the essence of performance indicators of ML algorithms. Using this question, it is also easy to explain the performance indicators in different scenarios. For example, the accuracy of the lion classifier is given by 951/1001 ~ 95%, while the precision is only 1/51 ~ 2%. Therefore, the confusion matrix helps Predicted label Lion

Not lion

Lion

1

0

Not lion

50

950

Real label

Fig. 14.2 Presentation of the lion detection question using a confusion matrix

14.6 Bias and Variance

217

explain the indicators, helping students with insufficient background in statistics and probability overcome the need for Eq. 14.1. Questions like the lion detection question can lead to a discussion about the difference between performance indicators, in general, and their real-world meaning, in particular. Specifically, the importance of selecting the correct performance indicator with respect to its real-world meaning should be highlighted.

Exercise 14.4 Performance indicators Based on the values presented in Fig. 14.2, calculate the value of three additional performance indicators of the classifier described in the lion detection problem.

Exercise 14.5 Comparing Performance indicators Present two situations for which a classifier with performance indicators of accuracy 95% and precision 2% is acceptable, and two situations for which such a classifier is unacceptable (in other words, may cause significant harm or damage).

14.6 Bias and Variance Our datasets are usually samples that represent populations. In statistics, bias represents the error in the estimation of some parameter, that is, the difference between the estimation and the true value of the parameter being estimated. In the context of ML, a bias is associated with an algorithm and refers to the error in the algorithm’s predications, that is, the average difference between the true value that we are trying to predict and the prediction of our model. Intuitively, the algorithm’s bias on the training set represents a systematic error in the model. In statistics, variance represents the dispersion of the estimator. In other words, considering that our data is just a sample of the population, what is the expected spread of the estimator around its true value? In the context of ML, the algorithm’s error on the test set is used to estimate the variance. The algorithm’s error on an unseen dataset is affected by both errors, the bias and the variance. We estimate the bias using the algorithm’s error on the training set, and the variance, using the algorithm’s error on the test set. Domingos (2012) proposed a dart-throwing analogy to demonstrate the phenomena of bias and variance (see Fig. 14.3).

218

14 Core Concepts of Machine Learning

Fig. 14.3 Bias and variance in dart throwing (based on Domingos, 2012)

Students who lack a solid background in statistics might find these terms confusing, since they have no mental construct of the concepts of population, sampling, sampling noise, bias, and variance. These students may find the alternative terms training error and validation/test error (for bias and variance, respectively) more intuitive. Beyond their relevance for understating the properties of ML algorithms, it is important to understand the concepts of bias and variance (or train and test errors) for the detection of overfitting or underfitting conditions, described in Sect. 14.8.

14.7 Model Complexity Model complexity is a measure of how complex the problems that a certain algorithm can solve are, or how complex the data that the algorithm can represent are (Hu et al., 2021). As can be seen, in ML, the model complexity and the data complexity should be in harmony. The concept of model complexity has several mathematical definitions (see Hu et al., 2021), but those definitions require a very strong mathematical background. Nevertheless, it is crucial to understand the concept of model complexity because model complexity must match the problem complexity to obtain good model performance (see Sect. 14.8—Overfitting and Underfitting). The intuition behind model complexity may be explained using visualization. For example, Fig. 14.4 presents the iris flowers classification problem (virginica vs. versicolor) based on two features (sepal width and petal width). The three colored lines (purple, green, and red) represent three classification lines of three classifiers with different complexity. The purple line represents a linear line, which depicts a low-complexity classifier, and might be too simple to represent this problem well.

14.8 Overfitting and Underfitting

219

Fig. 14.4 Model complexity in a classification problem

The green line represents a classifier with a higher complexity that might be a good fit for this problem, and the red line represents the most complex classifier and might be too complex to represent the problem, thus causing the phenomena of overfitting (see Sect. 14.8). Model complexity can be also demonstrated in regression problems. Figure 14.5 presents a regression of five hypothetical points. In this example, the three colored lines (purple, green, and red) represent three regression lines of three regressors with different complexity. The purple line represents linear regression, which is too simple to represent the data well. The green line represents a quadratic regression line that fits this data well, and the red line represents the most complex regression line with a fifth degree polynom. While this regression line does fit each point in the data, it might nevertheless be too complex to represent it well.

14.8 Overfitting and Underfitting In supervised learning, overfitting and underfitting are two phenomena that lead to low performance of ML algorithms. Underfitting represents a situation in which the learned model does not represent the data well, and so the algorithm’s performance on both the training dataset and the test dataset is low. Overfitting represents a situation in which the model fits the training data very well but does not represent the whole dataset well, and so the algorithm’s performance on the test data is low.

220

14 Core Concepts of Machine Learning

Fig. 14.5 Model complexity in a regression problem

The phenomena of overfitting and underfitting are connected to the algorithm’s bias and variance. High bias and high variance indicate an underfitting model; Low bias and high variance indicate an overfitting model. Low bias and low variance are the desired situation. Student that find the concepts of bias and variance confusing, might also find the concepts of overfitting and underfitting confusing. It is therefore possible to teach the concepts of overfitting and underfitting using analogies, such as the two described below. • Preparation for an exam: Overfitting is similar to memorizing the training dataset without generalizing the learned model to unseen data. Suppose a student is learning for an exam only by memorizing the answers to all of questions that appeared in previous exams. He or she will not be able to succeed on the upcoming exam if it contains new questions that did not appear in previous exams. • Irrigation system: Suppose that an autonomous tractor is trained to irrigate a field of tomato bushes planted in a pattern similar to the patterns shown in Fig. 14.6. The purple arrow exemplifies an underfitting route, since it does not consider the true pattern of the plants and neglects the different offset of the other rows. The red arrow exemplifies an overfitted route, tracking the exact pattern of a single row of plants that will not fit the general pattern of the field. The green line exemplifies a good model that does not underfit nor overfit the pattern of the planted bushes.

14.8 Overfitting and Underfitting

221

Fig. 14.6 Overfitting and underfitting—irrigation system analogy

Exercise 14.6 Overfitting and underfitting In the following table, explain the implication of each combination of bias and variance in terms of overfitting and underfitting, and give an example that illustrates this combination. Low bias

High bias

Low variance High variance

Exercise 14.7 Overfitting and underfitting analogies Create two additional analogies for underfitting and overfitting. Reflect on the creation process you went through: (a) How did you search for domains with respect to which you created the analogies? (b) How did you find the exact situations that correspond to underfitting and overfitting? (c) Did the creation of these analogies improve your understanding of bias, variance, underfitting, and overfitting? If yes—how?

222

14 Core Concepts of Machine Learning

14.9 Loss Function Optimization and the Gradient Descent Algorithm Many ML training algorithms, for example, the backpropagation algorithm in the case of neural networks, are based on minimizing the loss function calculated as the difference between the ML algorithm predictions and the true labels. Two examples of loss functions are the mean of square errors (MSE) in the case of linear regression and the cross-entropy value in the case of logistic regression. The gradient descent algorithm, usually taught in numeric analysis courses, is used to minimize the loss function. Calculus constitutes essential mathematical knowledge for the training of such algorithms. Alas, it is not a requirement in many academic programs, for example in social science. Several ML algorithms, such as linear regression and logistic regression, are however part of the statistics curriculum, and so the students learn these algorithms as a black box. We recommend demonstrating how linear regression or logistic regression can be trained using the gradient descent algorithm, building on the students’ knowledge about these algorithms. Then, more complex optimization algorithms, such as the backpropagation, can be introduced based on the intuition gained with linear regression and logistic regression.

14.10 Regularization One method of mitigating overfitting is regularization. With regularization, the ML training algorithm tries to minimize not only the difference between the ML algorithm’s predictions and the true labels, but also to minimize the magnitude of the algorithm’s parameters (weights, coefficients). Mathematically, regularization can be described as adding a factor to the loss function that relates to the magnitude of the algorithm parameters, for example the sum of squares of the algorithm parameters. This indicates that the model is becoming too complex and is going to overfit the training data. This explanation might not be accessible to learners with insufficient mathematical background. Therefore, the analogies used to explain the overfitting and underfitting phenomena may be useful in explaining the regularization concept as well. For example, in the case of the irrigation system (Fig. 14.6), suppose the training algorithm needs to consider not only the plant locations but also the irrigation route length, which affects the fuel consumption and time required to irrigate the entire field. In this case, shorter routes will be preferred, and overfitting will be avoided.

References

223

Exercise 14.8 Regularization analogies Continue developing the analogies you suggested in Exercise 14.7. Propose how regulation can be applied in the domains of the analogies to prevent overfitting.

14.11 Conclusion In this chapter, we saw that while several core concepts of ML algorithms, such as bias and variance, can be taught as mathematical concept not necessarily in the context of a specific application domain, other core concepts of ML algorithms draw their meaning from the application domain, for example, performance indicators and loss function. It is therefore essential to present these core concepts of ML algorithms in the interdisciplinary context of real-world problems to support their meaningful understanding.

References Casscells, W., Schoenberger, A., & Graboys, T. B. (1978). Interpretation by physicians of clinical laboratory results. New England Journal of Medicine, 299(18), 999–1001. https://doi.org/10. 1056/NEJM197811022991808 Cosmides, L., & Tooby, J. (1996). Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty. Cognition, 58(1), 1–73. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87. Hazzan, O., & Mike, K. (2022). The base-rate neglect cognitive bias in data science. https://cacm. acm.org/blogs/blog-cacm/262443-the-base-rate-neglect-cognitive-bias-in-data-science/fulltext Hu, X., Chu, L., Pei, J., Liu, W., & Bian, J. (2021). Model complexity of deep learning: A survey. Knowledge and Information Systems, 63(10), 2585–2619. Kahneman, D. (2002). Maps of bounded rationality: A perspective on intuitive judgment and choice. Nobel Prize lecture, December 8. Retrieved December 21, 2007. Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80(4), 237–251. https://doi.org/10.1037/h0034747 Koehler, J. J. (1996). The base rate fallacy reconsidered: Descriptive, normative, and methodological challenges. Behavioral and Brain Sciences, 19(1), 1–17. https://doi.org/10.1017/S0140525X000 41157 Leron, U., & Hazzan, O. (2009). Intuitive vs analytical thinking: Four perspectives. Educational Studies in Mathematics, 71(3), 263–278. Shalala, R., Amir, O., & Roll, I. (2021). Towards asynchronous data science invention activities at scale. In Proceedings of the 14th international conference on computer-supported collaborative learning-CSCL 2021.

224

14 Core Concepts of Machine Learning

Sulmont, E., Patitsas, E., & Cooperstock, J. R. (2019a). Can you teach me to machine learn? In Proceedings of the 50th ACM technical symposium on computer science education, pp. 948–954. https://doi.org/10.1145/3287324.3287392 Sulmont, E., Patitsas, E., & Cooperstock, J. R. (2019b). What is hard about teaching machine learning to non-majors? Insights from classifying instructors’ learning goals. ACM Transactions on Computing Education, 19(4), 1–16. https://doi.org/10.1145/3336124

Chapter 15

Machine Learning Algorithms

Abstract In this chapter, we describe the teaching of several machine learning (ML) algorithms that are commonly taught in introduction to ML courses, and analyze them from a pedagogical perspective. The algorithms we discuss are the K-nearest neighbors (KNN) (Sect. 15.2), decision trees (Sect. 15.3), Perceptron (Sect. 15.4), linear regression (Sect. 15.5), logistic regression (Sect. 15.6), and neural networks (Sect. 15.7). Finally, we discuss interrelations between the interdisciplinarity of data science and the teaching of ML algorithms (Sect. 15.8).

15.1 Introduction Machine learning (ML) is considered a field of research that lies in the intersection of computer science and statistics, and has developed in the last 50 years as part of the growing discipline of artificial intelligence and intelligent systems. Several statistical tools, such as linear regression and logistic regression, are nowadays considered as ML algorithms, as they can be described as algorithms that learn from examples. In this chapter, we discuss ML algorithms that originate in computer science as well as ML algorithms that originate in statistics, and present them each from an educational perspective. A couple notes on this chapter: • As mentioned in the introduction to this guide (Chap. 1), this guide does not aim to teach data science; rather, it addresses data science from a pedagogical perspective. Nevertheless, we note that this chapter does require an understanding of the following ML algorithms: the K-nearest neighbors (KNN), decision trees, perceptron, linear regression, logistic regression, and neural networks. • Since a significant portion of this chapter is dedicated to the teaching of algorithms, it can also be used by computer science educators.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_15

225

226

15 Machine Learning Algorithms

Table 15.1 The KNN algorithm

To classify a new object, U, with features u1 , u2 , …, un : (a) Calculate the distance of U from each sample X(i) in the training dataset (b) Find the K nearest neighbors of U (c) Find the most common label among these K nearest neighbors (d) U is classified accordingly to this most common label

15.2 K-nearest Neighbors1 The K-nearest neighbors (KNN) algorithm is a relatively simple and intuitive classifier that uses the classification algorithm presented in Table 15.1 and is explained below (Hazzan & Mike, 2022). The distance between a new object and its neighbors is calculated using vector distances. Although one of several measures of distance may be selected, from a pedagogical perspective, the Euclidian distance is preferred as it is more intuitive to explain using its geometrical interpretation. Moreover, for classification problems of objects with two features, i.e., a two-dimensional case, the Euclidian distance can be calculated using the Pythagoras theorem (Eq. 15.1). d

(i )

=

/

u 1 − x1(i)

2

2  + u 2 − x2(i)

(15.1)

To generalize the algorithm to the n-dimensional case, we present the vector distance between the unknown object, U, and the ith sample (X (i) ) as a generalization of the Pythagoras theorem (Eq. 15.2). / d

(i )

=

 2  2 2  u 1 − x1(i) + u 2 − x2(i ) + · · · + u n − xn(i)

(15.2)

The KNN algorithm requires an understanding of only basic arithmetic operators such as addition, subtraction, square, square root, and comparison of numbers. Thus, with respect to the object-process duality theory it can be said that the KNN algorithm requires an understanding of mathematical operators that are commonly understood as objects by most high-school pupils and undergraduate students in any subject. See Sect. “The Process-Object Duality Theory” for a discussion of the object-process duality theory. Despite the simplicity of the KNN algorithm, it may be used to intuitively explain many core ML concepts such as classification, training and testing, model performance metrics, hyperparameter, underfitting and overfitting, and the data science workflow. 1

This section is based on Hazzan, O. & Mike, K. (2022). Teaching core principles of machine learning with a simple machine learning algorithm: The case of the KNN algorithm in a high school introduction to data science course. ACM Inroads, 13(1), 18–25. https://doi.org/10.1145/3514217.

15.2 K-nearest Neighbors

227

Fig. 15.1 The classification problem of Iris virginica versus Iris versicolor

To demonstrate some of those core ML concepts, we use a KNN algorithm and the classic Iris dataset (Fisher, 1936) to differentiate between Iris virginica and Iris versicolor based on sepal width and petal width (see Fig. 15.1). The Iris dataset contains measurements of four flower features (sepal length, sepal width, petal length, and petal width) of 150 samples of iris flowers of three species: virginica, versicolor, and setosa. To demonstrate hyperparameter tuning, we test the algorithm’s accuracy for different Ks (see Fig. 15.2); It is evident that accuracy is influenced by the K hyperparameter and that an optimal K can be selected based on this exploration. Furthermore, the relationship between K and the accuracy can also be used to demonstrates the phenomena of underfitting and overfitting. For this purpose, we draw a graph in which the x-axis represents model complexity (represented by K) and the y-axis represents the classification error (see Fig. 15.3). As is evident from this graph, at low complexity levels (high Ks), the percent of both training and validation/test errors is high. This region represents Ks at which model complexity is too low to properly represent the data, in other words, this is the underfitting region. At high complexity levels (low Ks), on the other hand, there is a region in which the training error decreases while the validation/test error increases sharply. This region represents the overfitting area, in which model complexity is too high to properly represent the data since the model overfits the training data.

228

15 Machine Learning Algorithms

Fig. 15.2 KNN hyperparameter tuning: accuracy versus K

Fig. 15.3 The underfitting and overfitting phenomena in KNN

15.3 Decision Trees

229

Exercise 15.1 K as hyperparameter According to Fig. 15.3, what K would you select for the KNN algorithm if you were to use it to classify Iris virginica and Iris versicolor flowers? In Hazzan and Mike (2022), we suggest that the KNN learning algorithm be the first algorithm to be included in introductory data science courses. We use three perspectives to support our proposal: computational, cognitive, and pedagogical. We show that despite the simplicity of the KNN algorithm, it exposes novice data science learners to the main ideas of ML and poses interesting questions that address its core concepts. We also discuss how such an approach may eliminate barriers that new teachers might encounter both learning the topic and teaching it.

15.3 Decision Trees Decision trees are popular classification algorithms that are taught in many data science programs (Biehler & Schulte, 2018; Gould et al., 2018; Heinemann et al., 2018; Mariescu-Istodor & Jormanainen, 2019; Sperling & Lickerman, 2012). A decision tree is a flowchart-like model in which each node represents a decision based on the value of a single feature. Figure 15.4 presents a simple decision tree model for the classification of Iris flowers either as versicolor or virginica, based on two features: petal width and sepal width (see Fig. 15.1). While the use of decision trees for classification is very simple and intuitive, the construction of the trees is not (Biehler & Fleischer, 2021). Indeed, since it is an NP-complete problem (Laurent & Rivest, 1976), in practice, decision trees are built

Fig. 15.4 Decision tree model for the classification of Iris flowers

230

15 Machine Learning Algorithms

using heuristics (the two heuristics most commonly used to select the order of the features in the tree are information gain and Gini index). Both the explanation why heuristic rules must be used in the model development and the heuristic rules themselves, require mathematical knowledge that is beyond the level of many non-major data science students. As a result, model generation is commonly taught with a black-box approach (Delibasic et al., 2013). Delibasic and his colleagues found that 4th-year management students in a business course, who learned the tree generation algorithm as a black box, achieved similar algorithmic performance to that of students who learned it with a white-box approach. However, students who learned the algorithm with a white-box approach perceived their understanding of the algorithm as significantly higher than those who learned it with a black-box approach.

15.4 Perceptron The perceptron algorithm is not a useful algorithm in itself, compared with similar ML algorithms such as the logistic regression algorithm (Sect. 15.6) or the support vector machine. It does, however, have several advantages that can be exploited for educational purposes: • The perceptron algorithm has a long history compared with modern ML algorithms and teaching the perceptron algorithm offers an opportunity to discuss the history of ML. • The perceptron algorithm requires only basic mathematical operations of addition and subtraction and can be explained to students with no prior knowledge in calculus. • The training algorithm of the perceptron algorithm is similar to the iterative minimization of a loss function with a gradient descent algorithm and can, therefore, build students’ intuition regarding an iterative search for minimum loss. • The perceptron algorithm is a linear separator and can be presented as part of an introduction to mathematically more complex algorithms such as the support vector machine. • The perceptron is similar to a single neuron in a neural network and so it is essential to understand the perceptron algorithm in order to understand neural networks. The perceptron model has two parameters: a weight vector, w, and a bias, b. Given a weight vector, w, and a bias, b, the perceptron predicts that a new unknown object, u, belongs to Class 1 using the algorithm presented in Table 15.2. Many ML learners have insufficient backgrounds in linear algebra and are not familiar with the concept of vectors. For example, high school pupils may learn the concept of vectors as part of the physics curriculum and may be familiar with the polar representation of vectors and vector operations such as addition or subtraction; however, they do not usually learn the cartesian representation of vectors, vectors with more than two dimensions, or the dot product operator. It is therefore important

15.4 Perceptron

231

Table 15.2 Vector representation of the perceptron classification algorithm

If w · u + b ≥ 0

Table 15.3 Non-vectorized representation of the perceptron classification algorithm

If w1 * u1 + w2 * u2 + b ≥ 0

yˆ = 1 Else yˆ = 0

yˆ = 1 Else yˆ = 0

to remember that the perception algorithm can be represented using scalars as well (see Table 15.3). This representation of the two-dimensional case enables to plot the separation line on a plane. For example, setting u1 as x and u2 as y, the extraction of y yields a formula of a line that pupils are usually familiar with (Eq. 15.3): y=

−b −w1 x+ w2 w2

(15.3)

Given a separation hyperplane between the classes (that is, in the case of two dimensions, the two classes are linearly separable), it can be proven that the learning algorithm (see Table 15.4) will converge after a finite number of steps (i.e., parameters w and b can be found with no classification errors). This algorithm, again, uses only simple mathematical operators (addition, subtraction, and multiplication) and thus requires only a basic mathematical background. Table 15.4 The perceptron training algorithm

Set w = 0 and b = 0; While classification errors exist: For each example in the training dataset x(i) : Calculate yˆ (i) , the predicted class for sample x(i) Compare the predicted yˆ (i ) with the real label y(i) If y(i) /= yˆ (i) update w, b as follows: If y(i) is 1: w ← w + x(i) b←b+1 If y(i) is 0: w ← w − x(i) b←b–1

232

15 Machine Learning Algorithms

15.5 Linear Regression Linear regression offers several advantages from the educational point of view: • It is the simplest regression algorithm. • While most of the algorithms presented in this chapter are classifiers, and represent the most common algorithms found in many curricula, the linear regression algorithm, as its name indicates, is a regression algorithm and, therefore, demonstrates a different type of ML algorithms. • It is useful in many academic and industrial applications. • It is a well-known algorithm even among students without an extensive background in linear algebra and calculus. • The linear regression weights can be found both analytically (by solving linear equations) and iteratively (using the gradient descent algorithm). Linear regression can, therefore, be used to demonstrate the application of the gradient descent algorithm in the context of a simple and known algorithm. • The loss function in the case of linear regression is the mean of squared error (MSE). This function is both intuitive and relatively easy to derive in order to develop the gradient descent equations. For more advanced learners, polynomial regression, which is a popular enhancement of the linear regression, can be used to demonstrate the concepts of underfitting and overfitting by fitting a first, second, and third order polynomial regression to quadratic data, for example.

15.6 Logistic Regression Logistic regression is an ML classification algorithm that is similar to the perceptron algorithm. From an educational point of view, logistic regression has several advantages: • It is a highly useful algorithm in many academic and industrial applications. • It is a building block in neural networks. • It is part of the curriculum of statistics courses in many social science study programs, and so many social science students are familiar with some aspects of this algorithm. The logistic regression’s classification rule is similar to that of the perceptron, except that a logistic function, σ(z) = 1/(1 + e−z ), is applied on the weighted sum w · u + b. Table 15.5 presents the classification algorithm. To find the logistic regression parameters, one must derive the logistic loss function (Table 15.6), which uses the logarithmic function. Students with insufficient mathematical background may not be familiar with these mathematical concepts and may, therefore, find it difficult to understand the logistic loss function. Furthermore,

15.7 Neural Networks Table 15.5 Logistic regression classification algorithm

Table 15.6 Logistic loss function

Table 15.7 Logistic regression training algorithm

233 yˆ = σ (w · u + b) If yˆ ≥ 0.5 Classify as 1 Else Classify as 0

Logistic loss function       Σm −y (i) · log yˆ (i ) − 1 − y (i) · log 1 − yˆ (i ) J(w, b) = m1 i=1

(a) Initialize random w (b) Repeat the following update step E times: w =w−α

m  1 Σ (i ) yˆ − y (i) x (i) m i=1

m  1 Σ (i ) yˆ − y (i ) b =b−α m i=1

the derivative of the logistic loss function is also complex as it requires learners to apply the chain rule of derivatives and to calculate the derivatives of the log and the sigmoid functions, which again, are unfamiliar to students with insufficient mathematical background. The final parameter update equation, which is based on the gradient descent algorithm, is surprisingly simple (see Table 15.7 step b). It is worth noting that if the learning rate, α, is set equal to the number of training samples, m, then the logistic regression training algorithm (Table 15.7) is very similar to the perceptron training algorithm (Table 15.4).

15.7 Neural Networks A fully connected neural network can be represented as layers of logistic regression, with several possible variations of non-linear activation functions. Building on the students’ knowledge of logistic regression, we present the concepts of a neuron, a layer of neurons, and finally multilayer networks. While a single neuron can be represented as a vector, a layer of neurons must be represented as a matrix, which might be an unfamiliar concept for students who lack a sufficient mathematical background. Training a neural network introduces the concept of back propagation. While it is possible to understand the training algorithm of logistic regression by understanding

234

15 Machine Learning Algorithms

the gradient descent as a process, in order to understand the backpropagation algorithm, the gradient descent algorithm must be understood as an object, since the derivatives must flow backwards in the network. We recommend that the concept of gradient descent be taught with the linear regression algorithm or the logistic regression algorithm since many learners are familiar with these algorithms.

15.8 Conclusion In this chapter, we reviewed several ML algorithms from an educational perspective. We note that the field of ML developed in the intersection of computer science and statistics. Hence, some of the algorithms mentioned in this chapter were developed historically mainly in a computer science context (such as the perceptron), while others were developed mainly in a statistics context (such as the linear regression and logistic regression algorithms); Therefore, historically, different algorithms were taught in different courses; From an interdisciplinary point of view, data science courses provide the opportunity to teach ML algorithms in both contexts.

References Biehler, R., & Fleischer, Y. (2021). Introducing students to machine learning with decision trees using CODAP and Jupyter Notebooks. Teaching Statistics, 43, S133–S142. Biehler, R., & Schulte, C. (2018). Paderborn symposium on data science education at school level 2017: The collected extended abstracts. Universitätsbibliothek. Delibasic, B., Vukicevic, M., & Jovanovic, M. (2013). White-box decision tree algorithms: A pilot study on perceived usefulness, perceived ease of use, and perceived understanding. International Journal of Engineering Education, 29(3), 674–687. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x Gould, R., Suyen, M.-M., James, M., Terri, J., & LeeAnn, T. (2018). Mobilize: A data science curriculum for 16-year-old students. Iase-Web. Org, 1–4. Hazzan, O., & Mike, K. (2022). Teaching core principles of machine learning with a simple machine learning algorithm: The case of the KNN algorithm in a high school introduction to data science course. ACM Inroads, 13(1), 18–25. Heinemann, B., Opel, S., Budde, L., Schulte, C., Frischemeier, D., Biehler, R., Podworny, S., & Wassong, T. (2018). Drafting a data science curriculum for secondary schools. In Proceedings of the 18th Koli Calling international conference on computing education research—Koli Calling’18, pp. 1–5. https://doi.org/10.1145/3279720.3279737 Laurent, H., & Rivest, R. L. (1976). Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5(1), 15–17. Mariescu-Istodor, R., & Jormanainen, I. (2019). Machine learning for high school students. In Proceedings of the 19th Koli Calling international conference on computing education research, pp. 1–9. Sperling, A., & Lickerman, D. (2012). Integrating AI and machine learning in software engineering course for high school students. In Proceedings of the 17th ACM annual conference on innovation and technology in computer science education, pp. 244–249.

Chapter 16

Teaching Methods for Machine Learning

Abstract In this chapter, we review four teaching methods for machine learning: visualization (Sect. 16.2), hands-on tasks (Sect. 16.3), programming tasks (Sect. 16.4), and project-based learning (Sect. 16.5). When relevant, as part of the presentation of these pedagogical tools, we analyze them from the perspective of the process-object duality theory and the reduction of abstraction theory.

16.1 Introduction Many pedagogical strategies have been proposed for machine learning (ML) education, including problem-based learning, project-based learning, collaborative learning, active learning, inquiry-based learning, and design-oriented learning (Sanusi & Oyelere, 2020). In this section, we elaborate on four teaching methods for ML that can mitigate some of the challenges reviewed in Chaps. 13, 14 and 15. These methods are: visualization (Sect. 16.2), hands-on tasks (Sect. 16.3), programming tasks (Sect. 16.4), and project-based learning (Sect. 16.5). When relevant, as part of the presentation of these pedagogical tools, we analyze them from the perspective of the process-object duality theory and the reduction of abstraction theory. These theories are reviewed in Sect. 3.2.3. As already mentioned (especially, but not only, in Chaps. 14 and 15), understanding the mathematics underlying the various algorithms requires an understanding of various mathematical concepts. Furthermore, to achieve a white-box understanding of ML algorithms, students must first gain an object understanding of the underlaying mathematics. This assertion can guide curriculum design in the following ways: • A curriculum should include at least one ML algorithm that requires only mathematical concepts that the students are already familiar with. The learning process of this algorithm will establish the students’ white-box understanding of the ML core principles and will support their learning of other, more complex, algorithms that are taught as a black box. An example of such a curriculum is presented in Hazzan and Mike (2022) and the algorithm we used for this purpose was KNN.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_16

235

236

16 Teaching Methods for Machine Learning

• The term cognitive load refers to the amount of information that a person’s working memory can hold (Sweller, 1988). According to the cognitive load theory, since working memory has a limited capacity, instructional methods should avoid overloading it. Therefore, if some mathematical background must be completed as part of an ML course, it is recommended to first teach the mathematical concepts separately, out of the context of ML, in order to provide students the opportunity to achieve object understanding of these concepts. Otherwise, too high cognitive load may prevent the students conceptualizing the required mathematical concepts in the context of the ML context. Several notes before embarking on this chapter: • As mentioned in the introduction to this guide (Chap. 1), this guide does not aim to teach data science; rather, it addresses data science from a pedagogical perspective. Nevertheless, we note that this chapter does require an understanding of ML algorithms such as the K-nearest neighbors (KNN), decision trees, gradient descent, and neural networks. • Since a significant portion of this chapter is dedicated to the teaching of algorithms, it can also be used by computer science educators.

16.2 Visualization In Chap. 8, we introduce visualization as an important skill for data science researchers. Since it is a basic tool that supports the reduction of the level of abstraction, we now highlight it as a common practice used by ML teachers (Sulmont et al., 2019). Indeed, visualization can be used to demonstrate several aspects of ML algorithms, including: • • • • • • • • • •

the model (see Fig. 15.4—decision tree, for example) data features, using histograms and scatter plots (see Fig. 15.1, for example) image data (see the perceptron worksheet presented in Table 16.2) classification areas separation lines algorithm performance with respect to hyperparameters (see Fig. 15.2, for example) bias and variance (see Fig. 14.3, for example) the loss function with respect to algorithm parameters (see https://playground.ten sorflow.org, for example) the loss function with respect to learning cycles (learning curve) (see https://pla yground.tensorflow.org/, for example) the algorithm performance (see Fig. 14.2—confusion matrix, for example).

Animations can also support the students’ understanding of dynamic processes of ML algorithms, particularly, during the training phase. These may include animations of:

16.3 Hand-On Tasks

• • • •

237

the classification area the separation line the loss function the algorithm performance.

Many additional visual aids can be found on the internet; for example, see the TensorFlow playground at https://playground.tensorflow.org.

16.3 Hand-On Tasks In this section, we present four examples of hands-on tasks that assist in learning ML, in which the students do not simulate the ML algorithms using computer software, but rather execute all of the calculations themselves, manually (or using a calculator). The four hands-on tasks that we give are designed for four algorithms: the KNN algorithm (Sect. 16.3.1), the perceptron algorithm (Sect. 16.3.2), the gradient descent algorithm (Sect. 16.3.3), and neural networks (Sect. 16.3.4). Building on the theory of mathematical ritual and exploration (See Chap. 3), the hands-on tasks require the students to literally execute manually all calculations performed in all steps of the algorithm. Like first-grade pupils learning how to calculate the difference between two numbers by practicing ritual calculations in mathematical exercises until the ritual evaluation turns into an exploration from which learning is derived, in the hands-on tasks presented here, students learn how to calculate the difference between two vectors, for example, by performing all of the mathematical operations manually. Hands-on tasks similar to those presented in this section are common in statistics education. Hancock and Rummerfield (2020) examined the effect of hands-on simulations on the understanding of sampling distribution and found a significant positive effect of the hands-on activities on the students’ final exam grades. Pfaff and Weinberg (2009) examined the effectivity of hands-on activities in an introductory college statistics course. Although these activities did not significantly affect the students’ grades, the students reacted positively to the hands-on modules and many listed them as the most interesting element of the course. Students’ responses to the question: “Was the hands-on module beneficial for your learning?” showed a median score of 4 on a scale of 1–5. Executing the hands-on tasks is somehow similar to tracing computer programs used in computer science education (Hazzan et al., 2015). Since the students perform the calculations in the pen-and-paper tasks presented here, using pens and paper rather than computer software, the tasks can be seen as a mathematical version of the computer program tracing task. These exercises focus solely on the mathematical concepts of the ML algorithms and, accordingly, are phrased using only mathematical symbols and equations rather than code or pseudo-code.

238

16 Teaching Methods for Machine Learning

From the reduction of abstraction perspective, the hands-on tasks support students’ learning by reducing the level of abstraction in two ways: (a) A small training dataset that learners can iterate manually is more concrete than the huge datasets typically used to demonstrate ML algorithms. (b) The manual execution of a task is more concrete than running a simulation, such as a computer program, since attention must be paid to every detail, regardless of its role in the procedure.

16.3.1 Hands-On Task for the KNN Algorithm In Sect. 15.2, the KNN algorithm is presented together with an illustration of how it can be used to demonstrate core concepts of machine learning. In this section, we present a hands-on task for the KNN algorithm (see Table 16.1) to support its gradual Table 16.1 KNN worksheet Part 1 – Image classification

A KNN algorithm is designed to classify images into two types – urban or forest – based on two features – red (the mean level of the red color in the image) and blue (the mean level of the blue color in the image). The following graph presents the training dataset images (urban images in gray and forest images in green). The graph also shows the test set, which comprises two unknown images, A and B. Classify images A and B using a KNN algorithm, with K=1 and K=5.

(continued)

16.3 Hand-On Tasks

239

Table 16.1 (continued)

1. For K=1: a. What is A’s classification? Explain your answer. b. What is B’s classification? Explain your answer. 2. For K=5: a. What is A’s classification? Explain your answer. b. What is B’s classification? Explain your answer. 3. Compare your answers to Questions 1 and 2. What can you learn from this comparison? Part 2 – Iris classification A KNN algorithm is designed to classify Iris flowers into two classes: setosa and versicolor. Four features are given for each flower: sepal length (SL), sepal width (SW), petal length (PL), and petal width (PW). There are ten samples of flowers in the training set. A researcher found a new flower, U, with the following features: USL=5, USW=3, UPL=2, UPW=2. 1. Calculate the distance of the new flower, U, from each sample in the training set: Sample

SL

SW

PL

PW

Label

1

5.1

3.5

1.4

0.2

Setosa

2

4.9

3.0

1.4

0.2

Setosa

3

4.7

3.2

1.3

0.2

Setosa

4

4.6

3.1

1.5

0.2

Setosa

5

5.0

3.6

1.4

0.2

Setosa

6

7.0

3.2

4.7

1.4

Versicolor

7

6.4

3.2

4.5

1.5

Versicolor

8

6.9

3.1

4.9

1.5

Versicolor

9

5.5

2.3

4.0

1.3

Versicolor

10

6.5

2.8

4.6

1.5

Versicolor

Distance, d(i)

2. Classify the new flower, U, using the KNN algorithm with K=1 and with K=5. (continued)

240

16 Teaching Methods for Machine Learning

Table 16.1 (continued)

a. For K=1: a. The index of the closest training example is: ___ b. The label of the closest training example is: ___ c. U’s classification is for K=1 is: ___ b. For K=5: a. The indexes of the 5 closest training examples are: ___ b. The labels of the 5 closest training examples are: ___ c. U’s classification for K=5 is: ___ 3. Compare your answers to Questions a and b. What can you learn from this comparison?

mental construction by students.1 The first phase requires visual identification of the nearest neighbors in the two-dimensional case. It simulates the algorithm’s search for the K-nearest neighbors of a new instance and its classification according to the majority class. The given dataset was drawn in a way that enables to observe the classification easily without actually calculating the distances. The second phase uses the Iris flowers dataset, which contains 150 examples of three types of Iris flowers, and four features for each flower: sepal length (SL), sepal width (SW), petal length (PL), and petal width (PW) (Fisher, 1936). Even with a small dataset of only 150 flowers, hand-on tasks might be too tedious for any learner to execute manually, and so only a portion of the dataset is given in each task. In this phase, the students are asked to manually calculate the Pythagoras distances in the four-dimensional case for the Iris flower dataset, to find the nearest neighbors, and to classify a new instance according to the majority class. Since the given dataset is small (10 samples of flowers), the students can perform the calculations manually without using a computer program. Refer also to Sect. 15.2 to see how the Iris flowers dataset is used in KNN learning processes.

16.3.2 Hands-On Task for the Perceptron Algorithm In Sect. 15.4, the perceptron algorithm is introduced as well as several pedagogical challenges teachers may face while teaching it. In this section, we present a hands-on task for the perceptron algorithm (see Table 16.2). The first phase requires the learners 1

This section is based on Mike and Hazzan (2022). Machine learning for non-major data science students: A white box approach, special issue on Research on Data Science Education, The Statistics Education Research Journal (SERJ) 21(2), Article 10. Reprint is allowed by SERJ journal’s copyright policy.

16.3 Hand-On Tasks

241

Table 16.2 Perceptron worksheet

Part 1 – The perceptron classification algorithm The perceptron prediction is denoted =

1 0

w1 w1

+ 1 +

1

2 2

+ 2 +

2

and is calculated by:

>0 0

A perceptron algorithm was trained to classify images as urban or nature, where = 1 denotes the urban class and = 0 denotes the nature class. Two features were extracted for each image: x1, the mean red level of the image, and x2, the mean blue level of the image. The given perceptron parameters are: w1 = 0.4, w2 = 0.6, b = -100. Classify the following images using this perceptron. Image

X1 (red)

X2 (blue)

153

173

93

83

81

124

90

18

1



1

+

2



2

+

Urban/ Nature

(continued)

242

16 Teaching Methods for Machine Learning

Table 16.2 (continued)

Part 2 – The perceptron training algorithm In this part, we will execute the perceptron training algorithm. The training dataset contains four example images and is presented in the following table: Sample number

Image

X1 (red)

X2 (blue)

y

1

46

22

0

2

84

67

0

3

94

129

1

4

107

118

1

(continued)

to calculate the classification part of the perceptron algorithm, and the second part requires them to execute the model training process. In practice, the model training phase takes place before the classification phase; however, in this hands-on task, for educational purposes, their order is reversed: the students first practice the simpler classification algorithm and then, they practice the training algorithm, which is more complicated. In Part 1, the prediction of the third sample is intentionally erroneous. This error may lead to a discussion regarding model errors and the need to quantify the model’s performance, as well as to a discussion regarding the number of required training steps and the meaning of convergence.

16.3 Hand-On Tasks

243

Table 16.2 (continued)

Complete the tracing table of the perceptron learning algorithm: Correct?

0

1

0

0

0 46 22 0

0*46+0*22+0=0

0

Y

2

0

0

0 84 67 0

0*84+0*67+0=0

0

Y

3

0

0

0 94 129 1

0*94+0*129+0=0

0

N

4

94 129 1 107 118 1

1

46 22 0

2

84 67 0

3

94 129 1

4

107 118 1

1

46 22 0

2

84 67 0

3

94 129 1

4

107 118 1

y

b

3

x2

w2

2

x1

w1

0

Sample

Step

0

0 1

w1x1+w2x2+b

New w1

New w2

0+94=94 0+129=129

New b

1

16.3.3 Hands-On Task for the Gradient Descent Algorithm The hands-on task for the gradient descent algorithm is presented in Table 16.3. The gradient descent is the basis for every optimization algorithm and is, therefore, a core principle of ML. The basic parameter of the gradient descent algorithm, the learning rate α, is a hyperparameter of any ML algorithm whose learning is based on the gradient descent algorithm. Therefore, this worksheet gives students the opportunity to try different values of α and to understand the importance of setting a suitable learning rate.

244

16 Teaching Methods for Machine Learning

Table 16.3 The gradient descent algorithm worksheet

Given the function J(w) = w4 + 0.1 · w3 + 0.2 · w2 + 0.3 · w + 0.4. 1. Find the derivative

.

2. Trace the gradient descent algorithm for α=0.01. Step

W

0

0



1 2 3 4

3. Trace the gradient descent algorithm for α=0.5. Step

W

0

0



1 2 3 4

(continued)

16.3 Hand-On Tasks

245

Table 16.3 (continued)

4. Trace the gradient descent algorithm for α=1. Step

W

0

0



1 2 3 4

5. Trace the gradient descent algorithm for α=2. Step

W

0

0

1 2 3 4

6. What can you conclude from these four traces?



246

16 Teaching Methods for Machine Learning

16.3.4 Hands-On Task for Neural Networks The neural network concept is presented in Sect. 15.7. The hands-on task for neural networks is presented in Table 16.4. This task is divided into two parts. First, the students are asked to calculate a single output of a very simple two-layer neural network. This task is designed to practice the forward propagation of data in the neural network as a collection of simple logistic regression neurons. The aim of the second part is to practice the vectorized representation of a neural network. In this part, the network is given as a collection of matrixes and the students are required to reconstruct the network from its matrix representation.

16.4 Programming Tasks Since ML algorithms require large datasets and a lot of computational power, programming tasks are inevitable in ML education. The learning of mathematical concepts can be supported by programming exercises (Leron & Dubinsky, 1995), but they require an understanding of the mathematical concepts, at least as processes, as well as proficiency in programming. We therefore suggest that programming tasks come only after the learners execute visualization tasks or hands-on tasks (or both), thus enabling the students to construct mental images of the algorithms. Specifically, teachers should consider asking learners to work not only on “Write a program that…” tasks, but also on a variety of other kinds of programming tasks (for a complete review, see Hazzan et al., 2020, Chap. 12, Types of Questions in Computer Science Education). In the context of ML education, teachers can consider tasks such as the following: • Solve the hands-on tasks (see Sect. 16.3) using programing, after executing them manually. • Implement specific phases of an ML algorithm (if relevant) (see, for example, the exercise in Table 16.5 that demonstrates a task of programing the distance calculation phase of the KNN algorithm). • Implement the training phase of an ML algorithm. • Trace the prediction phases of an ML algorithm (if relevant). • Trace the training phase of an ML algorithm. • Program a visualization of an ML algorithm (for example, one of the visualizations mentioned in Sect. 16.2). • Program an animation of the training of an ML algorithm (for example, one of the visualizations mentioned in Sect. 16.2).

16.4 Programming Tasks Table 16.4 Neural networks worksheet

247

248

16 Teaching Methods for Machine Learning

Table 16.5 Worksheet—programming tasks on the distance function 1. Write a function, distance2, that is given two vectors, A and B, with two features each—a1, a2, b1, and b2—and returns the distance between the two vectors 2. Write a function, distance3, that is given two vectors, A and B, with three features each—a1, a2, a3, b1, b2, b3—and returns the distance between the two vectors 3. Write a function, distance4, that is given two vectors (lists), each of length 4—A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4]—and returns the distance between the two vectors 4. Write a function, distance, that is given two vectors (lists), each of length n, and returns the distance between the two vectors 5. Write a function, distance_np, that is given two vectors (np.array) of any length and returns the distance between the two vectors

16.5 Project-Based Learning Project-based learning (PBL) is a pedagogical tool that helps students gain knowledge and skills by developing solutions to real-life problems (Hadim & Esche, 2002). Projects are a common pedagogical method for teaching ML, especially nowadays, as more and more data are available free on designated sites (Chow, 2019). We highlight two recommendations that should be considered when practicing PBL in the context of ML: • We recommend that students choose a project based on their own interests in the application domain. We know that students tend to select datasets that are easy to find, but if students have no internal motivation to learn the application domain, the project will be a technical execution of library tools with only very little learning. • Working on real projects with real data requires understanding of the application domain. Without such understanding, students cannot evaluate whether or not the available data represent the real-world data, nor can they practice feature engineering or evaluate the ML algorithm’s performance (Mike et al., 2020). If students have no knowledge in the application domain, we recommend that they receive additional mentoring from a specialist in that domain.

16.6 Conclusion In this chapter, we saw that although ML algorithms can indeed be taught without referring to a specific application domain context, when training and using these algorithms, the data must be interpreted in the context of the application domain, otherwise, performance issues as well as ethical issues will not be properly addressed. This is especially important when ML algorithms are taught in an interdisciplinary project-based learning environment.

References

249

References Chow, W. (2019). A pedagogy that uses a Kaggle competition for teaching machine learning: An experience sharing. In 2019 IEEE international conference on engineering, technology and education (TALE), pp. 1–5. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x Hadim, H. A., & Esche, S. K. (2002). Enhancing the engineering curriculum through project-based learning. In Proceedings—Frontiers in education conference, vol. 2, pp. 1–6. https://doi.org/10. 1109/fie.2002.1158200 Hancock, S. A., & Rummerfield, W. (2020). Simulation methods for teaching sampling distributions: Should hands-on activities precede the computer? Journal of Statistics Education, 28(1), 9–17. https://doi.org/10.1080/10691898.2020.1720551 Hazzan, O., Lapidot, T., & Ragonis, N. (2015). Guide to teaching computer science: An activitybased approach. Springer. Hazzan, O., & Mike, K. (2022). Teaching core principles of machine learning with a simple machine learning algorithm: The case of the KNN algorithm in a high school introduction to data science course. ACM Inroads, 13(1), 18–25. Hazzan, O., Ragonis, N., & Lapidot, T. (2020). Guide to teaching computer science: An activitybased approach. Leron, U., & Dubinsky, E. (1995). An abstract algebra story. The American Mathematical Monthly, 102(3), 227–242. https://doi.org/10.1080/00029890.1995.11990563 Mike, K., & Hazzan, O. (2022). Machine learning for non-major data science students: A white box approach, Statistics Education Research Journal, 21(2), Article 10. Mike, K., Nemirovsky-Rotman, S., & Hazzan, O. (2020). Interdisciplinary education—The case of biomedical signal processing. In 2020 IEEE global engineering education conference (EDUCON), pp. 339–343. https://doi.org/10.1109/EDUCON45650.2020.9125200 Pfaff, T. J., & Weinberg, A. (2009). Do hands-on activities increase student understanding?: A case study. Journal of Statistics Education, 17(3), 7. https://doi.org/10.1080/10691898.2009.118 89536 Sanusi, I. T., & Oyelere, S. S. (2020). Pedagogies of machine learning in K-12 context. In 2020 IEEE frontiers in education conference (FIE), pp. 1–8. Sulmont, E., Patitsas, E., & Cooperstock, J. R. (2019). What is hard about teaching machine learning to non-majors? Insights from classifying instructors’ learning goals. ACM Transactions on Computing Education, 19(4), 1–16. https://doi.org/10.1145/3336124 Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285.

Part V

Frameworks for Teaching Data Science

This part presents several teaching frameworks of data science for professionals whose core activities are management, education, or research and who need data science knowledge to foster their professional development and improve their professionality. The chapters in this part are organized according to the MERge framework (Hazzan & Lis-Hacohen, 2016) for professional development, according to which disciplinary knowledge is required in order to manage, educate, and perform research. In other words, according to the MERge model, management, education, and research (MER) activities can be carried out meaningfully only when one is an expert in the domain within which these activities are carried out. In the case of data science, this means that expertise in the application domain from which the data is taken is required to manage initiatives, educate learners, and carry out research pertaining to data science content, tools, and methods. Accordingly, we first present data science education for managers and policymakers (the M in MERge, Chap. 17). Then, we present data science education for teachers (the E in MERge, Chap. 18), and finally, we address data science education for researchers (the R in MERge, Chaps. 19 and 20). This part includes the following chapters: Chapter 17: Data Science for Managers and Policymakers Chapter 18: Data Science Teacher Preparation: The “Method for Teaching Data Science” Course Chapter 19: Data Science for Social Science and Digital Humanities Research Chapter 20: Data Science for Research on Human Aspects of Science and Engineering

Reference Hazzan, O., & Lis-Hacohen, R. (2016). The MERge model for business development: The amalgamation of management, education and research. Springer.

Chapter 17

Data Science for Managers and Policymakers

Abstract In this chapter, we focus on the first component of the MERge model— management. In line with the MERge model as a professional development framework, we show how managers and policymakers (on all levels) can use data science in their decision-making processes. We describe a workshop for policy makers that focuses on the integration of data science into education systems for policy, governance, and operational purposes (Sect. 17.2). The messages conveyed in this chapter can be applied in other systems and organizations in all sectors—governmental (the first sector), for-profit organizations (the second sector), and non-profit organizations (the third sector). We conclude with an interdisciplinary perspective on data science for managers and policymakers (Sect. 17.3).

17.1 Introduction Data is used extensively by managers and policymakers. Indeed, managers rely in their decision-making processes on data analysis in a way that considers as many factors that are involved in the analyzed situation as possible. As it turns out, an organization cannot simply declare that it uses data science for decision-making processes; several preconditions must be fulfilled in order to enable an organization to perform meaningful data analysis, based on which intelligent decision-making processes take place. A data-oriented organizational culture is one of these preconditions (Díaz et al., 2018). Based on conversations with analytics leaders at companies from a wide range of industries and geographies, Díaz and his colleagues present seven takeaways derived from these conversations and with other executives. One of their conclusions is that “You can’t import data culture and you can’t impose it. Most of all, you can’t segregate it. You develop a data culture by moving beyond specialists and skunkworks, with the goal of achieving deep business engagement, creating employee pull, and cultivating a sense of purpose, so that data can support your operations instead of the other way around.” (Díaz et al., 2018, p. 2). Managers and other role holders in any organization, who are involved in decisionmaking processes on a daily basis, should therefore be familiar with the basic principles of data science, the core data science terminology, and the characteristics of the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_17

253

254

17 Data Science for Managers and Policymakers

main kinds of machine learning algorithms, and be able to communicate meaningfully with data scientists and take advantage of data science in their decision-making processes. In this chapter, we present a workshop that we facilitated, that aimed to expose policymakers to the use of data science in education policymaking, governance, and operations (Sect. 17.2). We chose this workshop for several reasons: • This workshop deals with education systems, which is the topic of this guide, and addresses policymakers (see also Sect. 7.9) and managers as users of data science products (see also Sect. 7.10). • In the spirit of the MERge model (Hazzan & Lis-Hacohen, 2016), this workshop illustrates the relevance and importance of data science as a managerial tool. • This workshop demonstrates that familiarity with the application domain—education systems—is crucial in order to understand how to use data science methods meaningfully.

Exercise 17.1 Decision making in governmental offices In the Methods of Teaching Data Science course (see Chap. 18), we ask the students the following question: Give examples of government ministries whose employees should know data science.

The prospective teachers gave the following examples: • Ministry of Economics • Ministry of Environmental Protection • Ministry of Health • Ministry of Transportation • Ministry of Education (a) For each of these ministries, describe a situation in which policymakers in that ministry should use data science methods in their decision-making processes. (b) In the continuation of this chapter, keep reflecting on how the ideas presented with respect to the education system, can be used in the situations you proposed in (a).

17.2 Workshop for Policymakers in National Education Systems

255

Exercise 17.2 Data culture Explore the term data culture. Use at least three resources that address this concept. (a) Give five examples of organizations that promote a healthy data culture. (b) List five characteristics of organizations that promote a healthy data culture. (c) Describe five practices that employees in originations that promote healthy data cultures should master. (d) Describe five scenarios involving managers in organizations that promote a healthy data culture that illustrate the relevance of data science for their decision-making processes. For each scenario, specify the data science knowledge the managers should have and suggest how and what they can learn from this specific data science content. (e) Explore the concept of exponential organizations (Ismail, 2014). Exponential Organizations: Why new organizations are ten times better, faster, and cheaper than yours (and what to do about it), Diversion Books). How do exponential organizations promote data culture?

17.2 Workshop for Policymakers in National Education Systems In this section, we describe a workshop for policymakers in education, that we facilitated in February 2022. Participants included various academicians and a variety of role holders from the Israeli Ministry of Education. The workshop was entitled “Data science for education policymaking, governance, and operations” and the main question that was discussed was: How can data science be incorporated in education systems to improve their management, in general, and their policymaking processes, governance, and operation, in particular? The workshop illustrates how policymakers can be exposed to the usefulness of data science without delving into the details of programming, mathematics, and statistics. Clearly, if they wish to apply this knowledge, they will have to delve into the details, learning what data science enables, what the opportunities and risks are, what ethical principles should be adhered to, and how to foster the data culture needed in order to use data science for meaningful decision-making processes.

256

17 Data Science for Managers and Policymakers

17.2.1 Workshop Rationale and Content The rationale of the workshop was presented to the participants as follows: Although data science opens up endless opportunities for organizations that use the data they gather efficiently, ethically, and proactively, education systems tend to use data reactively. In other words, education organizations tend to examine and rely on data in order to learn about the past and make decision based on this analysis, without exploring what the data can tell then about the future (Hazzan & Zelig, 2016). While education systems tend not to use data proactively, they do collect a lot of data, mainly on students’ and teachers’ functionality. This data, however, as mentioned, is not always used to navigate the education system towards its vision in a way that serves the needs of the individuals as well as the nation. A similar situation is described by Kreuter et al. (2019): From education to health to criminal justice, government regulation and policy decisions have important effects on social and individual experiences. New data science tools applied to data created by government agencies have the potential to enhance these meaningful decisions. However, certain institutional barriers limit the realization of this potential. First, we need to provide systematic training of government employees in data analytics. Second, we need a careful rethinking of the rules and technical systems that protect data in order to expand access to linked individual-level data across agencies and jurisdictions, while maintaining privacy. (para. 1)

In this spirit, the discussion in the workshop focused on the role of data science in education systems. The following questions were addressed, as well as additional questions raised by the participants in a questionnaire they completed prior to the workshop. • Current situation – How does the Israeli Ministry of Education use data science in its operational activities? – What are the barriers that prevent education systems from applying a proactive mode of operation? – What challenges and gaps must be overcome to change this reactive mode of operation? – What professional skills should decision makers in the education system learn in order to abandon the reactive mode of thinking and adopt a proactive state of mind and operation? • The application of data science in education systems – Can you find examples of the application of data science in the operational aspects of other countries’ education systems? If you can, what can you learn from their experiences? How do those usages of data science impact the education systems in those countries? – Can learning analytics and educational data mining be used to improve education policymaking, governance, and operations? If yes—how? If not—why?

17.2 Workshop for Policymakers in National Education Systems

257

– What are the possible contributions of data science to education systems? What are the possible pros and cons of this use? – How can it be ensured that data science neither replicates nor reinforces inequality among learners? – Should data science be applied on the school level, the region level, or throughout the entire education system? What are the challenges in each case? – Does an appropriate data science workflow for the assimilation of data science in education policymaking, governance, and operation processes exist? If one exists, describe it. If not, what characteristics should such a workflow have? • Governance-related issues – Is regulation required in order to introduce data science methods into education policymaking, governance, and operations processes? If yes, what aspects should it address? If not, why? – What regulation on data collection and retention should be promoted and formulated to ensure students’ privacy in the present and in the future? to prevent inequality and ensure fairness? – How can transparency in data science-based decision-making processes be fostered? – How can awareness of privacy and information security be increased? – What are the risks of giving access to public information? How can these risks be mitigated? • Cross-sectorial collaborations (see e.g., Hazzan et al., 2021) – What models of adoption of innovation are appropriate for the change process required in the education system to embrace a data culture? Kotter’s model (Kotter, 2012)? The collective impact model (Kania and Kramer, 2011)? – What cross-sectorial relationships and collaborations are required for education policymaking, governance, and operations? What benefits does each sector gain from such cross-sectorial relationships and collaborations? – What kind of collaboration processes between the academia, the business sector, and the third sector can be created with respect to data science education? – What is the role of the academia in this change process? We note that the workshop did not address the following two aspects of data science: • The integration of data science for purposes of researching, changing, or improving teaching and learning processes. Specifically, the workshop did not examine questions such as: Can data science contribute to learning quality? Can data science promote independent learning? • The contents of the mathematics, statistics and computer science components of data science.

258

17 Data Science for Managers and Policymakers

17.2.2 Workshop Schedule The workshop was 1.5 h long and the schedule was as follows: • • • •

Introduction—20 min Teamwork—30 min Team reports (3 min per team) and discussions—30 min Summary and wrap-up—10 min.

Specifically, following a brief introduction, which presented the messages described in Sect. 17.2.1, the participants were divided into teams of 5–8. Each team chose one of the activities presented in Table 17.1 and worked on it. Each team was also asked to choose a moderator, whose role was to document the discussion and present the team’s work to the entire workshop forum. The following instructions were given for the team presentations: Prepare a presentation of up to 3 min with two slides as follows: • Slide 1—The discussed topic and main ideas discussed • Slide 2—Three action items Two teams chose to work on the first option (SWOT analysis) and two teams chose to work on the third option (Models of change processes). None of the groups chose to work on the second option (Experience of other countries).

Exercise 17.3 The roles of the workshop facilitator As mentioned, two groups chose to work on the first option and two groups chose to work on the third option. None of the groups chose to work on Option 2: The integration of data science for policy making in other countries’ education systems. As the workshop facilitator: (a) How can this topic selection be explained? (b) What can you conclude from these topic selections about the participants’ interest in the overall topic of the workshop? (c) Would you present your conclusions to the workshop participants? (d) In your opinion, should the group selections influence the workshop facilitation? If so, how? If not, why?

17.2.3 Group Work Products Table 17.2 presents illustrative quotes from the two groups that performed SWOT analyses of the integration of data science in the Israeli education system. To simplify

17.2 Workshop for Policymakers in National Education Systems

259

Table 17.1 Topics for teamwork in the workshop for policymakers in the education system Option 1: SWOT (strengths, weaknesses, opportunities, and threats) analysis of the Israeli education system SWOT analysis is a strategic planning and management technique used to help a person or organization identify the strengths, weaknesses, opportunities, and threats related to that person’s or organization’s strategy or project planning Specifically, a SWOT analysis of the Israeli education system with respect to the use of data science in its operational management may help navigate this process, if and when it is decided to systematically integrate data science in its operational aspect Perform a SWOT analysis of the integration of data science into the Israeli education system for policy making, governance and operation purposes Option 2: The integration of data science for policymaking in other countries’ education systems Find countries that use data science for policymaking processes that are related to the operational aspect of their education systems – How is data science integrated in these education systems? – What can be learned from the experience of these countries? – Is data science integrated in these education systems on the school level, the region level, or the national level? – What challenges and opportunities can you identify in each case? Two examples are presented: • Countries using data to inform education policies (Kelly, 2021): This report explains how data can be transformed into policy so that children everywhere, even the most disadvantaged, can realize their right to learn. The report describes a project that was implemented in countries like Lao PDR, Georgia, and Mongolia to achieve this goal • Learning analytics for school and system management (Ifenthaler, 2021): This document presents three case studies that provide insights into how educational organizations have been successful in adopting learning analytics and producing organizational benefits or overcoming organizational hurdles. The conclusion presents guidelines for policymakers, researchers, and education organizations adopting learning analytics. It ends with a set of open questions to be addressed in future research and practice Option 3: Models of change processes – What models of adoption of innovation are suitable for the integration of data science in education systems? – What kinds of relationships and cross-sectorial cooperation are required for data science-based policymaking, governance, and operation in education systems? – What benefits can each sector gain from such cooperations? In particular, what is the role of the academia in such processes? Three models for the adoption of innovation are proposed for exploration: – Kotter’s model (Kotter, 2012) of an eight-stage process for leading, not managing, a change – Collective impact: A model for addressing complex social issues through cross-sector collaborations. In order to be considered of collective impact, initiatives must meet five criteria: common agenda, shared measurement system, mutually reinforcing activities, continuous communication, and backbone organization. See examples and other resources in The Collective Impact Forum (2022) – The cross-sectorial collaborative shared value strategy ((CS)2 V for short): An approach that aims to increase the impact of the organization’s social investments. It directs corporations to connect its core business needs, from the corporate perspective, to society’s core needs, from the social and/or government perspective. This is done by addressing a social problem that intersects with a business concern (Hazzan et al., 2021)

260

17 Data Science for Managers and Policymakers

Table 17.2 SWOT analysis of the integration of data science in the Israeli education system Strengths • Many teachers are open to innovation and initiate initiatives in their field work; • Following the Covid-19 pandemic: Technology is used more extensively and enables easier data gathering

Weaknesses • Potential misuse of data; • Lack of skills for intelligent use of data by teachers and principals

Action items: Action items: • Organize the data, suggest ways in which the • Initiate an organizational change/give more data can be used, and make the data power to the field; accessible to professionals in the field; • Address the potential data misuse problem • Set a standard for accessibility to technology for people with special needs Opportunities • Opportunity to perform a third-order change in the education system; • Opportunity to create a single system that will act logically and in a transparent way

Threats • Data can be shared with commercial companies; • Privacy risks

Action items: Action items: • Regulation of the data gathering by • Data should be accessible to the public and not only to policymakers with interests; commercial companies and by the Ministry • Data gathering for support and improvement of Education (of the education system)

the presentation, we combine the work of the two groups. The quotes presented in Table 17.1 were translated from Hebrew without changing their message or evaluating their correctness.

Exercise 17.4 Culture Read the report Why data culture matters, at: https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/McK insey%20Analytics/Our%20Insights/Why%20data%20culture%20matters/ Why-data-culture-matters.ashx Taking a cultural perspective, analyze the SWOT analysis performed by the two groups in the workshop (presented in Table 17.2). You can use the following questions: (a) Identify cultural characteristics and values of the Israeli education system mentioned in the SWOT analysis. (b) In your opinion, does this culture promote or obstruct the integration of data science in the operational aspect of the Israeli education system? (c) Do you need additional information about the Israeli education system to answer these questions? If so, what information do you need?

17.2 Workshop for Policymakers in National Education Systems

261

Exercise 17.5 Working on the group task You are invited to work on each of the three options presented to the workshop participants (see Table 17.1). In Option 1, you are invited to perform a SWOT analysis of the education system in your country. Reflect on your work: What can policymakers learn from performing such tasks on the integration of data science in the educational system of their country, for policymaking purposes?

17.2.4 Workshop Wrap-Up Following the group presentations, a discussion took place and the workshop was concluded. Two of the conclusions stated at the end of the workshop were: • The discussion in the workshop demonstrated how data science can initiate crosssectorial collaborations; • The importance of the application domain: – The discussion in the workshop is not limited to the education system; – All government offices can facilitate a similar discussion, e.g., Ministry of Environmental Protection, Ministry of Transportation, and Ministry of Welfare. Exercise 17.6 Data science in the service of government offices Select three government ministries in your country. (a) For each ministry, work on the three options presented in Table 17.1. (b) In the spirit of the workshop described in this section, design a workshop for one of the government offices you discussed.

Exercise 17.7 A follow-up workshop for the “Data science for education policymaking, governance, and operations” workshop Design a follow-up workshop for the “Data science for education policymaking, governance, and operations” workshop. Define the workshop goals, describe the main activities the participants will work on, layout the main messages you would wish to convey in your workshop, and formulate general guidelines for its facilitation.

262

17 Data Science for Managers and Policymakers

Exercise 17.8 The pedagogical chasm and policymaking In Chap. 9, we discuss the pedagogical chasm and suggest that policymakers should always consider its existence in the development process of new data science curricula. In the spirit of the workshop described in this chapter, design a workshop for policymakers in the education system of your district, state, or country, that focuses on ways to bridge the pedagogical chasm of data science education with respect to different age groups: k-2, 3–6, and so on. At the end of the design process, reflect: (a) What are the main guidelines you employed in the design process? (b) If you were to facilitate the workshop, what outcomes would you expect to achieve?

17.3 Conclusion This chapter focuses on data science usages for decision-making processes. We present a workshop for policymakers in the Israeli education system that focuses on data science usages for the operational aspects of education systems. We suggest that similar workshops be facilitated for policymakers in all kinds of organizations, and specifically, other ministries and government agencies. The interdisciplinarity of data science is emphasized in this case study by highlighting the importance of the application domain knowledge for understanding the context in which the data science applications are explored (in the case described in this chapter, educational systems).

References Collective Impact Forum. (2022). Collective Impact Forum. https://collectiveimpactforum.org/ Díaz, A., Rowshankish, K., & Saleh, T. (2018). Why data culture matters. Mckinsey Quarterly, 3(1), 36–53. Hazzan, O., & Lis-Hacohen, R. (2016). The MERge model for business development: The amalgamation of management, education and research. Springer. Hazzan, O., Lis-Hacohen, R., Abrahams, B., & Waksman, M. (2021). The cross-sectorial collaborative shared value strategy. https://cacm.acm.org/blogs/blog-cacm/249800-the-cross-sectorialcollaborative-shared-value-strategy/fulltext Hazzan, O., & Zelig, D. (2016). Adoption of innovation from the business sector by post-primary education organizations. Management in Education, 30(1), 19–28.

References

263

Ifenthaler, D. (2021). Learning analytics for school and system management. In OECD digital education outlook 2021 pushing the frontiers with artificial intelligence, blockchain and robots: pushing the frontiers with artificial intelligence, blockchain and robots (pp. 161). Ismail, S. (2014). Exponential Organizations: Why new organizations are ten times better, faster, and cheaper than yours (and what to do about it). Diversion Books. Kania, J., & Kramer, M. (2011). Collective impact. FSG Beijing, China. Kelly, P. (2021). Countries using data to inform education policies | Blog | Global Partnership for Education. https://www.globalpartnership.org/blog/countries-using-data-inform-educat ion-policies Kotter, J. P. (2012). Leading change. Harvard Business Review Press. Kreuter, F., Ghani, R., & Lane, J. (2019). Change through data: A data analytics training program for government employees. Harvard Data Science Review, 1(2), 1–26.

Chapter 18

Data Science Teacher Preparation: The “Method for Teaching Data Science” Course

Abstract In this chapter, we focus on the second component of the MERge model, namely education. We present a detailed description of the Method for Teaching Data Science (MTDS) course that we designed and taught to prospective computer science teachers at our institution, the Technion—Israel Institute of Technology. Since our goal in this chapter is to encourage the implementation and teaching of the MTDS course in different frameworks, we provide the readership with as many details as possible about the course, including the course environment (Sect. 18.2), the course design (Sect. 18.3), the learning targets and structure of the course (Sect. 18.4), the grading policy and assignments (Sect. 18.5), teaching principles we employed in the course (Sect. 18.6), and a detailed description of two of the course lessons (Sect. 18.7). Full, detailed descriptions of all 13 course lessons are available on our Data Science Education website. We hope that this detailed presentation partially closes the pedagogical chasm presented in Chap. 9.

18.1 Introduction In Chap. 7, we discuss the wide variety of data science learners that exist; One relevant question that arises and should be addressed in this context is who will teach all of these varied populations? In Chap. 9, we discuss the pedagogical chasm, highlighting the importance of accompanying the implementation of a new curriculum with pedagogical tools and teaching material; In this context, one relevant question that arises and should be addressed is what frameworks are appropriate for preparing prospective and in-service teachers to teach data science? In this chapter, we describe one such framework, in which appropriate tools and teaching materials are imparted to prospective and in-service computer science teachers—the Methods of Teaching Data Science (MTDS) course. The MTDS course described in this chapter is a fullsemester course that meets 3 h a week for 13 weeks. The description is based on our experience teaching the course at the Faculty of Education in Science and Technology at our institution, the Technion—Israel Institute of Technology during the spring semester of 2021–2022. The course, which used Moodle as its online learning platform, was taught as a hybrid course, a combination of synchronous and a-synchronous © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_18

265

266

18 Data Science Teacher Preparation: The “Method for Teaching Data …

lessons: every week, two hours were taught synchronously via Zoom and one hour was taught asynchronously. The prospective data science teachers who took the MTDS course were prospective high school computer science teachers with sufficient computer science and statistics knowledge, based on which we could help them close knowledge gaps in data science. This enabled us to focus in the course on the pedagogical facets of data science, paying special attention to its interdisciplinary nature and to the importance that should be attributed to the integration of the application domain in data science learning and teaching processes. Accordingly, the course includes data science topics (content knowledge) to close knowledge gaps in data science, alongside pedagogical content topics (pedagogical content knowledge) and technologies (technological knowledge) relevant for the teaching of data science (see also the TPACK framework in Chap. 9; Mishra & Koehler, 2006; Shulman, 1986). The need for such a course emerged following the integration of a data science unit into the Israeli high school computer science curriculum in 2018 (presented in Sect. 7.3). That said, most of the teaching principles and pedagogical methods presented in this chapter can be applied to any framework in which educators are trained as data science teachers. Furthermore, many topics that we include in this guide can serve either as resources for teaching the course or as learning materials for students enrolled in the course. Our search for a similar, already existing full semester course yielded no results. We could not find any courses that address the pedagogy of data science (as Sanusi et al. (2022) also indicated). And so, as far as we know, the course presented here is the first full course on methods of teaching data science ever to be taught. In order to share with the readership as many details as possible, and to present an exhaustive and vivid description of the course, we documented the course and collected data on the teaching and learning processes that took place in the lessons, all in a very systematic manner. To this end, we implemented two methods: First, in parallel to the teaching of the course, we documented its development very carefully, including our design decisions, the course assignments, and what took place in class during each lesson. Second, we distributed a preliminary questionnaire prior to the onset of the course in order to learn about the students’ backgrounds, as well as well as a mid-semester questionnaire and an end-of-semester questionnaire. The result of this careful documentation is presented in this chapter, in which we describe the course environment (Sect. 18.2), the course design (Sect. 18.3), the learning targets and structure of the course (Sect. 18.4), the grading policy and assignments (Sect. 18.5), and the teaching principles we applied in the course (Sect. 18.6). Then, we delve into the details, describing the content of two of the course lessons (Sect. 18.7). Note: Table 18.3 presents the course schedule. Detailed descriptions of all 13 lessons are available on our Data Science Education website. These descriptions reflect our attempt to give as much detail as possible about what happened in the course, including the activities the students worked on and the discussions that took place in class, so as to impart the essence and atmosphere of the course.

18.3 The MTDS Course Design

267

Since we have so far taught the course only once, we believe that it will be further updated in the future. Nevertheless, we assume that the current description, together with the material presented in this guide, form a solid basis for designing MTDS courses according to the teaching environments in which they are to be taught.

18.2 The MTDS Course Environment The MTDS course was taught at the Faculty of Education in Science and Technology at the Technion—Israel Institute of Technology. The Faculty of Education in Science and Technology offers teaching certificate programs in seven science and engineering tracks (mathematics, physics, chemistry, biology, computer science, electricity, and mechanics) as well as graduate studies (Masters and Doctorates) in all areas of science and engineering education. The course described in this chapter is part of the study program for a teaching certificate in computer science, and grants two credits out of the required 36. It was therefore assumed, as mentioned above, that students who enrolled in the course had the necessary computer science knowledge as well as the mathematical knowledge needed in order to learn basic data science algorithms as a white box (see Chap. 13). We further elaborate on their background in Sect. 18.3.

18.3 The MTDS Course Design As mentioned above, the MTDS course presented in this chapter can be taught in a variety of frameworks with some necessary adjustments to local conditions. We suggest that such adjustments be based, among other things, on the perspective of the prospective data science teachers taking the course, whose understanding, perceptions, and opinions can be collected at different time points during the semester using questionnaires or in class discussions. In our case, the course was attended by seven students—three women and four men. Three of the students were Technion graduates who had returned to the Technion to study for a teaching certificate as a second career, two were Technion undergraduates who were studying for their teaching certificate in parallel to their undergraduate studies (in software engineering and in information system engineering), one was enrolled in a full undergraduate program at the Faculty of Education in Science and Technology, and one was a software engineering graduate from another academic institution in Israel who had come to the Technion to study for a teaching certificate. Three of the students were teaching in parallel to their studies: one was teaching computer science, one was teaching mathematics, and one was participating in a program that introduces preservice teachers to the Israeli education system. We illustrate how we used the prospective teachers’ responses to design the course. In the pre-course preliminary questionnaire, the students indicated two main reasons for joining the course: to become familiar with the Israeli high school data

268

18 Data Science Teacher Preparation: The “Method for Teaching Data …

science study program (described in detail in Sect. 7.3) and to learn data science teaching methods and practices. None of the students had used data science before the course and they were, accordingly, most interested in studying machine learning and data visualization. They all indicated their proficiency and confidence in some programming language, mentioning Java (seven students), C (five students), C# (five students), C++ (four students), and Python (four students), among others. Results from this background questionnaire informed us that although the students’ data science knowledge and Python programming skills needed to be enhanced in the course, they did already have the necessary background to learn both the basic data science knowledge and basic Python programming by themselves, based on well-organized material that we would provide. As we shall see below, although the course assignments supported this self-learning process, we had to dedicate one of the mid-semester lessons to closing several knowledge gaps, both in basic data science concepts as well as in Python programming (see Sect. 18.7.3).

18.4 The Learning Targets and Structure of the MTDS Course The course focused on the activities that comprise the data science workflow and on the pedagogical challenges encountered when teaching them (see Chap. 10). As mentioned above, in order to close the students’ knowledge gaps in data science and Python programming, the course included data science topics (content knowledge), pedagogical content topics (pedagogical content knowledge—PCK), and technological topics (technological knowledge) that are relevant for the teaching of data science (Mishra & Koehler, 2006; Shulman, 1986) (see also our presentation the TPACK framework in Chap. 9). The course learning targets (LTs) are: • LT1: The students will be familiar with the Israeli high school data science curriculum (see Sect. 7.3). • LT2: The students will experience a full data science workflow in Python. • LT3: The students will be familiar with learning theories, pedagogical approaches, and teaching methods suitable for data science education (see Chap. 10). To accomplish these targets, the course is divided into three parts: • Part A: The basics of data science and of data science teaching • Part B: Challenges of data science teaching • Part C (two lessons of the course): Students presentations of articles, reports, and curricula that deal with data science and their teaching The final lesson of the semester is dedicated to a course summary. Here is the detailed course syllabus:

18.5 Grading Policy and Submissions

269

• • • • •

Data analysis with Python1 The Israeli data science curriculum Teaching methods suitable for teaching data science on the high school level Relevant technologies for teaching and learning data science Challenges of teaching data science, e.g., the interdisciplinarity of data science (Chap. 6), and the variety of populations studying data science (Chap. 7) • Teaching data science skills (Chap. 11). In addition, each student was required to conduct a research project in data science that examines an educational topic (since the teachers’ application domain is education).

18.5 Grading Policy and Submissions To earn the course credits and a grade, students had to meet the following requirements: • Participate in all synchronous lessons. • Submit all assignments on time. Late submissions were not accepted. All assignments were accessible from the onset of the semester and students were invited to start working on them from the beginning of the course (and even before). Table 18.1 presents the course assignments. • Facilitate an asynchronous lesson. Each week, one student was responsible to teach an asynchronous lesson and to report on it at the beginning of the following lesson (for up to 15 min). In the asynchronous lessons, all students worked on an activity that either the student or (in most cases) we chose. In addition to leading the task, the students were also requested to reflect on their experience teaching an asynchronous lesson. Table 18.2 presents the activities the students worked on during the asynchronous lessons. • Complete the preliminary questionnaire prior to the first lesson of the course.

Exercise 18.1 Facilitation of an asynchronous activity Choose three asynchronous activities from the tasks presented in Table 18.2. For each task, suggest how you would facilitate its execution in an asynchronous process with a class of up to 20 learners.

1

Students with no background in Python were strongly encouraged to close any knowledge gaps they had in Python prior to the onset of the semester, using any source they chose, particularly the videos available on the course website.

270

18 Data Science Teacher Preparation: The “Method for Teaching Data …

Table 18.1 The course assignments Assignment No.

Description

Percent of grade (out of 100)

Submission due

1

Watch pre-recorded lectures on data analysis with Python, on the following topics (available on the course website): • Introduction to data science • The data science workflow • Colab environment • Pandas • Exploratory data analysis • Introduction to machine learning A. Write a personal reflection on the lectures/your learning process/learning and teaching in the flipped classroom format/challenges of teaching and learning data science B. Suggest five topics that you would like to learn in the course and explain your suggestions This assignment is limited to two pages

15

2nd week of the course

2

Watch the following lectures on KNN and 15 machine learning (available on the course website): • The KNN algorithm • Performance indicators • Basic data science concepts: Confusion matrix, overfit and underfit, model complexity versus generalization • Descriptive statistics • Python libraries: Seaborn Read the following paper that demonstrates how to teach the basic concepts of data science with one simple algorithm—KNN: Hazzan, O., & Mike, K. (2022). Teaching core principles of machine learning with a simple machine learning algorithm: The case of the KNN algorithm in a high school introduction to data science course. ACM Inroads 13(1), 18–25 The article is available on the course website For each level of Bloom’s cognitive taxonomy, develop two questions on the KNN algorithm (12 questions in total)2

3rd week of the course

3

Choose another machine learning algorithm and learn about it using available online resources. Answer the following questions: (a) Why did you choose this algorithm? Which other algorithms did you consider? (b) Describe the resource(s) you used to learn the algorithm you chose, and reflect on your learning process of the algorithm (c) For each level of Bloom’s cognitive taxonomy, develop two questions on the algorithm you chose (12 questions in total)

4th week of the course

15

(continued)

2 The paper the students were asked to read (Hazzan & Mike, 2022) presents examples of questions corresponding to

each level of Bloom’s taxonomy.

18.6 Teaching Principles of the MTDS Course

271

Table 18.1 (continued) Assignment No.

Description

4

A research in data science education to be Stage (a) 15 Stage (b) 15 submitted in four stages: (a) Carry out a mini data science research Stages (c) & (d) 5 project on any educational data you choose. Submit the research notebook containing the research description and the data analysis (b) Record a video clip that describes your research and upload a link to the video clip to the designated forum in the course website. The video clip should not exceed 10 min (c) Watch at least three of your classmates’ video clips and give each of them feedback as a response to the forum message that presents the link to the video clip (d) Write 10 insights you had while watching the video clips

7th week of the semester 9th week of the semester 10th week of the semester

Choose a paper/report/curriculum about data 20 science education Prepare a presentation on the paper/report/curriculum you chose Present the paper/report/curriculum in the lesson—up to 15 min per presentation Following the presentation, reflect on the learning process in the course, in general, and on the work process of this task, in particular Submit the presentation including a reflection slide at the end

11th and 12th weeks of the course

Percent of grade (out of 100)

Submission due

18.6 Teaching Principles of the MTDS Course We implemented four main teaching principles while teaching the MTDS course. Although in our course we applied them in online teaching, with the exception of Principle 4, they can all be applied also in frontal class settings. 1. Emphasize the interdisciplinarity of data science as one of the central themes of the course, in addition to data science pedagogy and teaching methods. 2. Use active learning (see Sect. “Active Learning”) and groupwork as much as possible. 3. Integrate the course assignments (Table 18.1) into the asynchronous tasks (Table 18.2) and incorporate them both in the course content. For example: a. Asynchronous Task 4, in which the students were asked to prepare a presentation about computational thinking and statistical thinking, formed the basis for Lesson 6 (Sect. 18.7.1). b. Asynchronous Task 7 (in Table 18.2) is based on Assignment 3 (in Table 18.1), and based on the these two tasks, our discussion in Lesson 11 (see our data science education website) focused on identifying which algorithms are suitable for high school pupils.

272

18 Data Science Teacher Preparation: The “Method for Teaching Data …

Table 18.2 List of asynchronous tasks No.3 Asynchronous tasks Task given Presentation week Each week one student leads the execution of an asynchronous task and presents it in the following lesson 1

Attend the Faculty Seminar in which Koby Mike (the 1st week course TA) will present his doctoral thesis on data science education (a link to a recorded lecture will be published later on the website) The assignment: (a) From the topics presented in the lecture, choose three topics that you would like to focus on in the course and explain your choice (b) Reflect on the process of leading an asynchronous lesson

2nd week

2

(a) Is computer science an interdisciplinary field? Is the 2nd week interdisciplinarity of computer science similar to the interdisciplinarity of data science? Explain your claims (b) Propose ten exercises/assignments that emphasize the interdisciplinarity of computer science. The exercises/assignments should include at least four types and should be suitable for high school pupils (c) Reflection on the process of leading an asynchronous lesson

3rd week

3

(a) Use the Ment.io platform to discuss the following topics: 1. The “Questions Workshop” held during Lesson 4; 2. The use of the Ment.io platform for class discussions; 3. Data science ideas and concepts applied in Ment.io (b) Reflect on the process of leading an asynchronous lesson

4th week

5th week

4

(a) Prepare a presentation of up to 20 min on 5th week computational thinking and statistical thinking. The presentation must include various multimedia means, such as video clips, games, texts, examples of questions, a list of references, etc. (b) Reflect on the process of leading an asynchronous lesson

6th week

5

(a) Prepare a presentation according to Exercise 2.15 “Overview of the history of data science” (b) Reflect on the process of leading an asynchronous lesson

6th week

7th week

6

(a) Prepare a presentation according to Exercise 12.4 Comparisons of data science codes of ethics (b) Reflect on the process of leading an asynchronous lesson

7th week

8th week

(continued)

3

This table lists eight asynchronous tasks although only seven students completed the course; One student dropped the course after completing his asynchronous task.

18.6 Teaching Principles of the MTDS Course

273

Table 18.2 (continued) No.

Asynchronous tasks Task given Presentation week Each week one student leads the execution of an asynchronous task and presents it in the following lesson

7

(a) In Assignment 3 (see Table 18.1), four different machine learning algorithms were selected. Prepare a 40-min presentation that includes: 1. Presentation of the four algorithms 2. The process of constructing the presentation, the pedagogical considerations that guided the organization of the presentation, etc. (b) Reflect on the process of leading an asynchronous lesson

8th week

8

(a) Challenges and teaching methods involved in mentoring a machine learning/data analysis project in high school (b) Reflect on the process of leading an asynchronous lesson

11th week 12th week

9th week

4. Use the online learning environment to illustrate teaching methods. Following are three demonstrations of how we applied Principle 4: a. The chat function: The chat function in Zoom (or in any other online platform used) enables to solicit students’ immediate feedback regarding different topics. The chat function has many advantages in such situations since it enables all participants to express themselves in parallel and, consequently, to be exposed to many opinions, rather than to a limited number of selected opinions voiced in the F2F format (Hazzan, 2020). See for example the chat discussion held in Lesson 6 (Sect. 18.7.1). b. Asynchronous lessons: Asynchronous lessons that all students must facilitate during the semester and present at the beginning of the following lesson (see Table 18.2). c. Online forums: Course forums that enable students to give feedback on their fellow students’ course work (e.g., their projects). As for the teacher’s feedback, we prefer that it is not given publicly but rather to each student individually and in private.

Exercise 18.2 Topics to be included in a Method of Teaching Data Science course Before reading the description of the course lessons, suggest topics that you would include in a Methods of Teaching Data Science course.

274

18 Data Science Teacher Preparation: The “Method for Teaching Data …

18.7 Lesson Descriptions Table 18.3 presents the course schedule. Table 18.3, we present a detailed description of two of the course’s lessons. We also present students’ responses to the midsemester questionnaire (Sect. 18.7.2). Full, detailed descriptions of all 13 lessons are available on our Data Science Education website. Here we present Lessons 6 and 7. We also present students’ responses to the mid-semester questionnaire, distributed after the 6th lesson, which guided us to change the preliminary plan of the course and to adjust it according to students’ feedback.

Exercise 18.3 Interdisciplinarity of and in the MTDS course The interdisciplinarity of data science is one of the main teaching principles that was emphasized repeatedly throughout the course from different perspectives (see Sect. 18.6). While reading the following description of the two lessons, try to identify where the interdisciplinarity of data science is addressed.

18.7.1 Lesson 6 This lesson was dedicated to data science thinking (see Chap. 3). After the presentation of Asynchronous Task 4, in which the students were asked to prepare a presentation about computational thinking and statistical thinking (see Table 18.2), the lesson continued with elaborations and clarifications. Special attention was given to mathematical thinking and the process-object duality (see Sect. The Process-Object Duality Theory). The role of algorithms and data in data science was (re)addressed as well as their respective relationships to computational thinking and statistical thinking. The discussion on the process-object duality was based on: • students’ sharing the mathematical concepts they found difficult to understand during their mathematical studies at school and during their undergraduate studies. Among the concepts they mentioned were derivatives and proof by induction. With respect to these concepts, we discussed how process conception and object conception are expressed and what mathematical problems each conception enables us to solve. • students’ answers to the following question that was posed in a chat discussion: How would you explain to a friend what the KNN algorithm is? (See Q. 1 in Fig. 4: KNN comprehension questionnaire, Mike & Hazzan, 2022). Here are several illustrative answers: – Classify something into a category according to its proximity to other known objects.

18.7 Lesson Descriptions

275

Table 18.3 Course schedule Lesson No. Content 1

• • • •

Introduction Reflection on the preliminary questionnaire Discussion: What is data science? Principles of data science teaching

2

• • • •

Presentation of the asynchronous tasks (see Table 18.2) Data science and twenty-first century skills Learning environments for data science Interdisciplinarity, in general, and the interdisciplinarity of data science, in particular

3

• Continuation of the discussion about the interdisciplinarity of data science: – The pedagogical challenges that this characteristic of data science presents (see Chap. 6) – Project-based learning (see Sect. “Project-Based Learning”) and the interdisciplinarity challenge (see Chap. 6) • Presentation of the asynchronous task that deals with the interdisciplinarity of computer science (see Table 18.2)

4

• Question Workshop: Exposure to the Question Database and a discussion about: – Educational entrepreneurship and the many opportunities open to entrepreneur teachers who teach new topics such as data science – The data science teachers community and the role of teachers in the promotion of a learning community

5

• The Israeli high school data science curriculum (see Chap. 7) • The Israeli high school data science program development process (see Chap. 9): – Data science ideas highlighted in the program – Mentoring the development of the final project as part of the program (30 out of 90 h)

6

• Data science thinking (see Chap. 3) – Presentation of the asynchronous task in which students were asked to prepare a presentation on computational thinking and statistical thinking (see Table 18.2) – Algorithms and data in data science and their relationships to computational thinking and statistical thinking • Mid-semester questionnaire

7

• Presentation of Asynchronous Task 5 on the history of data science (see Table 18.2 and Exercise 2.15—Overview of the history of data science) • The process-object conception of the KNN algorithm (see Sect. “The Process-Object Duality Theory”) • Python programming • White box and black box understandings (see Chap. 13)

8

• • • •

9

• Data science skills (see Chap. 11): Storytelling and the rhetoric triangle

Presentation of Asynchronous Task 6 on ethics (see Table 18.2) Continuation of Lesson 6 on data science thinking (see Chap. 3) The pedagogical chasm in data science education (see Chap. 9) The Israeli high school data science curriculum for 11th grade AP computer science pupils (see Chap. 7) (continued)

276

18 Data Science Teacher Preparation: The “Method for Teaching Data …

Table 18.3 (continued) Lesson No. Content 10

• Panel of high school computer science teachers who teach the high school data science program

11

• Presentation of Asynchronous Task 7 on different machine learning algorithms (see Table 18.2) • Students’ presentations of their papers/reports/curricula on data science education—Part 1 (three presentations) (see Assignment 5, Table 18.1)

12

• Presentation of Asynchronous Task 8 on mentoring data science projects in the high school (see Table 18.2) • Students’ presentations of their papers/reports/curricula on data science education—Part 2 (four presentations) (see Assignment 5, Table 18.1)

13

• Course summary

– If there were two different groups and you had to choose which group to join, how would you choose? Expected answer: The group whose members are most similar to me. – The ability to classify an example according to the examples that are most similar to it. – KNN is a machine learning algorithm. – You tend to act like your neighbors. – KNN is a particular way of classifying a particular datum or image according to other examples. – Tell me who your neighbors are and I will tell you who you are.

Exercise 18.4 Classification of the KNN description into process and object conceptions (a) How would you explain to a friend what the KNN algorithm is? (b) Sort the above descriptions of the KNN algorithm according to the conceptions they reflect: process conception or object conception (see Sect. The Process-Object Duality Theory). (c) Reflect: How did you decide which conception each quote reflects? What did you learn from this classification? (d) Repeat this exercise with respect to other machine learning algorithms.

• Students’ answers to the following question: (see Q. 4 in Fig. 4: KNN comprehension questionnaire, Mike & Hazzan, 2022). In order to classify dogs as Poodle or Labrador, four characteristics were selected: height, weight, tail length, and ear length. The training set included 1000 dogs; 500 of each kind. Based on this dataset, we wish to classify an unknown dog using the KNN classifier.

18.7 Lesson Descriptions

277

For K = 5: How many times is the square operation executed? For K = 11: How many times is the square operation executed? What conclusion can you draw from your answers to the above two questions? In your opinion, when are the chances of a correct classification higher? I. K = 5 II. K = 11 III. It is impossible to decide IV. I do not know e. Please explain your answer.

a. b. c. d.

In this question, the students are asked to indicate, for K = 5 and K = 11, how many times the square operator must be calculated in a specific classification problem using a KNN algorithm. Although the K values are different, the square operator must be calculated anew to find the Euclidian distance between the unknown instance and each instance in the training examples. In other words, 4000 calculations are required in both cases (4 features × 1000 instances in the training set), regardless of the value of K. Similar to what usually happens in other courses we teach, the initial answers students gave were 1000 (ignoring the 4 features) and 2000 (which assumes the data is represented in a two-dimensional space). Following a class discussion, they understood the correct answer. This observation led us to the conclusion that we should dedicate a lesson to actual data science content rather than to its pedagogy. Accordingly, the next lesson (Lesson 7, Sect. 18.7.3) was dedicated to data science problem solving with Python.

18.7.2 Mid-Semester Questionnaire Since the semester comprises 13 lessons, we decided to distribute a mid-semester questionnaire after six lessons. In what follows, we present students’ answers to two questions that relate to the main ideas that were imparted up until that point in the course and to challenges they faced. Students’ responses are translated from Hebrew. To simplify the presentation, we sorted the students’ answers by topic. • What are, in your opinion, the five main ideas of the course so far? – The interdisciplinarity of data science (mentioned as the primary idea by all students) – Familiarity with the high school data science curriculum – Gaining experience in carrying out research on data – The multi-faceted nature of data science A discipline that changes the scientific paradigm (see Sect. 2.2) Data science from the research perspective (see Sect. 2.3) Understanding the data science paradigm

278

18 Data Science Teacher Preparation: The “Method for Teaching Data …

Accurate definition of data science Correct implementation of data science principles A historical examination of data science – Cognitive perspective on data science Computational thinking Statistical thinking Data science thinking defines a new kind of thinking Understanding data science concepts, principles, and algorithms as processes and not just as objects Become familiar with [learners’] difficulties.

Exercise 18.5 Analysis of students’ responses to the mid-semester questionnaire How would you use students’ responses to the question: “In your opinion, what are the main ideas imparted in the course so far?” if you were designing the continuation of the course? Explain your considerations.

• What are the three most significant challenges for you in the course so far? – Content Python The complexity of the material Finding relevant materials on the internet (it is a new discipline and I feel that there is not enough teaching material available) (see Chap. 9) Its [the course’s] pioneering nature 1. Looking for data to practice with; 2. Data analysis using libraries I was not familiar with; 3. New material I am not familiar with, so it does not seem fair to me that I should teach it in class before I become an expert at it myself – Course load In my opinion, the course load is very high relative to its two credits The course hours [no additional clarifications or explanations were provided for this answer] – Pedagogical approach The course does not emphasize or focus on the material delivered to the pupils but rather hovers around it. The course contents are more pedagogical and less learning materials.

18.7 Lesson Descriptions

279

The course is different from all other courses I have attended or know of (it is different in terms of how it is managed and the way the lessons are taught as discussions; I have never experienced that before). Similar messages were also delivered in students’ requests for topics they would want to study in the continuation of the course. They mentioned their wish to focus on (a) topics taught to high school pupils (rather than on theoretical, i.e., pedagogical topics), such as additional algorithms, the Deep Learning curriculum (for AP pupils), and data collection; and (b) how to develop the needed thinking skills (e.g., statistical thinking) by themselves and how to develop the cognitive skills that their future students require for their data science studies. In summary, the students’ answer to the question “On a 1–5 scale, to what extent do the course contents meet your expectations of the course?” was, on average, 3.6 (four students ranked it at a 4 and two students ranked it at a 3). The answers to the challenges question, the topics the students wish to study in the continuation of the course, as well as the ranking question led us, first, to dedicate the next lesson (Lesson 7—Sect. 18.7.3) to concepts of data science, and specifically to the KNN algorithm and to programming in Python, and second, to include, in the last lesson of the course, a discussion about the students’ expectations (data science topics, i.e. content knowledge) versus the actual content of the course (i.e. pedagogical content knowledge) (see the detailed description of Lesson 13 on our Data Science Education website).

18.7.3 Lesson 7 The lesson began with a presentation of Asynchronous Task 5 on the history of data science (See Exercise 2.15). The main conclusion of this presentation was that although the early sprouts of data science emerged in computer science and statistics, the actual development of data science took off when researchers and practitioners realized that the application domain is an important component of data science and cannot be overlooked. This observation, in fact, turned data science into an applicable discipline for many professionals and populations, rather than remaining a discipline that is accessible only to computer scientists and statisticians. Following our observation in Lesson 6 (Sect. 18.7.1) that students need to practice the algorithmic aspect of data science, and given the feedback they provided in the mid-semester questionnaire with respect to the challenges they faced and continue to face in the course and topics they wish to study in the course (Sect. 18.7.2), we decided to dedicate this lesson to content knowledge and to lead the students in a process through which they would gradually construct their own mental representation of the KNN algorithm from process conception to object conception. Therefore, in addition

280

18 Data Science Teacher Preparation: The “Method for Teaching Data …

to a discussion of the content knowledge and pedagogical content knowledge of data science (see Sect. 6.4.1), this lesson also included Python programming based on our understanding that this group of students was also in need of some practical experience in Python. Specifically, the following steps were applied, in the order they are presented in Chap. 16: visualization (Sect. 16.2), hands-on activity (Sect. 16.3), and programming (Sect. 16.4). First, the students work on a KNN worksheet that is a version of the worksheet presented in Table 16.1 and applies the visualization and hands-on activity teaching methods (Sects. 16.2 and 16.3, respectively). Then, as described in Sect. 16.4, the students worked on several programming tasks (see Table 16.5). As can be seen, the programming tasks presented in Table 16.5 guide the students from specific cases (length 2, 3 and 4) to the general case, and form the basis for a white-box understanding of the KNN algorithm. At the end of the lesson, the students were asked to provide feedback on their experience in this lesson. They explained that the programming practice they gained in this lesson was important for their future teaching for two main reasons, referring to content and pedagogy, respectively: (a) the actual Python programming, and (b) the introduction to different kinds of tasks and to the order in which they should be implemented in class. As teachers of the course, we felt that the decision we made to focus on data science content and on Python programming for one lesson was correct. Furthermore, we realized that not only did we address the prospective teachers’ request for additional focus on data science content knowledge, but the prospective teachers (as they indicated in their feedback) also learned some new pedagogical ideas and principles.

18.8 Conclusion This chapter presents the Methods of Teaching Data Science course. As the course is largely based on this guide, the interdisciplinary perspective of data science is highlighted in a way similar to the way it is highlighted in the different chapters of this book. Indeed, both in the mid-semester questionnaire (see Sect. 18.7.2) and the end-of-the-semester questionnaire, as a response to the question “What are, in your opinion, the five main ideas of the course?”, the interdisciplinarity of data science appeared as the first idea in almost all responses. In addition, the students mentioned the interdisciplinarity of data science as one of their 10 insights after viewing their colleagues’ video clips (see Table 18.1, Assignment 4 (d)). Here are excerpts from the insights of two students: “The integration of the domains of knowledge was evident in all three clips, when sometimes they integrate even more than one application domain” and “The clips demonstrated nicely the multidisciplinarity of data science, which includes statistics, computer science and an application domain.”

References

281

We conclude this chapter with two comments regarding the course design: • The flexibility of the course design: As described in Sect. 18.7, there were cases in which we changed the course design based on student feedback. This flexible course development process could not have been avoided since, on the global level, the course is among the first courses in the world on data science teaching and, on the local level, it was the first time that we, personally, taught the course. • The voice of prospective data science teachers: The students’ voices and their understanding of the different topics discussed in the course reflect the relevance of the MERge mode (Hazzan & Lis-Hacohen, 2016), which we implemented in the design of the course. As future data science teachers, the students should first be familiar with the data science content knowledge before they move on to learning the data science pedagogical content knowledge (PCK).

References Hazzan, O. (2020). The advantages of teaching soft skills to CS undergrads online. https://cacm. acm.org/blogs/blog-cacm/245478-the-advantages-of-teaching-soft-skills-to-cs-undergrads-onl ine/fulltext Hazzan, O., & Lis-Hacohen, R. (2016). The MERge model for business development: The amalgamation of management, education and research. Springer. Hazzan, O., & Mike, K. (2022). Teaching core principles of machine learning with a simple machine learning algorithm: The case of the KNN algorithm in a high school introduction to data science course. ACM Inroads, 13(1), 18–25. https://doi.org/10.1145/3514217 Ismail, S. (2014). Exponential Organizations: Why new organizations are ten times better, faster, and cheaper than yours (and what to do about it). Diversion Books. Mike, K., & Hazzan, O. (2022). Machine learning for non-major data science students: A white box approach, Statistics Education Research Journal, 21(2), Article 10. Mishra, P., & Koehler, M. (2006). Technological pedagogical content knowledge: A framework for teacher knowledge. The Teachers College Record, 108(6), 1017–1054. Sanusi, I. T., Oyelere, S. S., & Omidiora, J. O. (2022). Exploring teachers’ preconceptions of teaching machine learning in high school: A preliminary insight from Africa. Computers and Education Open, 3, 100072. Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15(2), 4–14.

Chapter 19

Data Science for Social Science and Digital Humanities Research

Abstract In this chapter and in Chap. 20, we focus on the third component of the MERge model—research, and describe two data science teaching frameworks for researchers: this chapter addresses researchers in social science and digital humanities; Chap. 20 addresses researchers in science and engineering. Following a discussion of the relevancy of data science for social science and digital humanities researchers (Sect. 19.2), we describe a data science bootcamp designed for researchers in those areas (Sect. 19.3). Then, we present the curriculum of a yearlong specialization program in data science for graduate psychology students that was developed based on this bootcamp (Sect. 19.4). Finally, we discuss the data science teaching frameworks for researchers in social science and digital humanities from motivational perspectives (Sect. 19.5) and conclude by illuminating the importance of an interdisciplinary approach in designing data science curricula for application domain specialists (Sect. 19.6).

19.1 Introduction1 This chapter describes two data science teaching frameworks designed for researchers in social sciences and digital humanities: (a) a two-week bootcamp for researchers in social sciences and digital humanities from various disciplines, including both graduate students and faculty members, and (b) a year-long specialization program (4 h/week) for graduate psychology students. The common goal of all of the researchers who participated in these frameworks was to use data science as a research method for their own research in their own discipline. We first describe the relevance of data science for social science and digital humanities researchers (Sect. 19.2). Then, we present the two teaching frameworks for these researchers mentioned above. The bootcamp is described in Sect. 19.3. Before we delve into a detailed presentation of its curriculum, we describe the applicants and participants of two bootcamps held in 2020, as the curriculum was design to meet 1

This chapter is based on the following papers: © 2022 IEEE. Reprinted, with permission, from Mike et al. (2021). © 2022 IEEE. Reprinted, with permission, from Mike and Hazzan (2022).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_19

283

284

19 Data Science for Social Science and Digital Humanities Research

their goals and pedagogical needs (Sect. 19.3.1). Then we describe the actual learning goals and curriculum of the bootcamp (Sect. 19.3.2). The year-long specialization program for graduate psychology students that was developed based on this bootcamp is described in Sect. 19.4. This program is divided into two courses: Computer Science for Psychological Science (Sect. 19.4.1), which aims to fill the students’ gap in computer science knowledge, and Data Science for Psychological Science (Sect. 19.4.2), which aims to develop the students’ ability to complete the data analysis cycle, that is, to explore, analyze, model, and predict using data that they collected in the first course (in most cases, for their own research thesis). Next, we discuss the data science teaching frameworks for researchers in social science and digital humanities from a motivational perspective (Sect. 19.5). We conclude by highlighting the importance of an interdisciplinary approach in designing such teaching frameworks (Sect. 19.6).

19.2 Relevance of Data Science for Social Science and Digital Humanities Researchers Humans as well as machines constantly create huge amounts data. Humans create text, audio, photos, films, and other types of data and share them through a variety of social networks. For example, 326 million new users joined social media between April 2021 and April 2022, equating to growth of almost 900,000 new users every day (Digital Around the World, 2022). Machines generate and collect data about many facets of our lives: health, economy, education, transportation, meteorology, and so on. Wearable healthcare sensors, for example, can continuously collect data on pulse rate, blood pressure, and other vital signs. These data can be analyzed to monitor health status and predict sickness, both on an individual level and on a broader, research-related level. Accessible to all, regardless of socioeconomic status, location and nationality, social media collect data on people’s interests, preference and other human characteristics, which, in turn, can be used to predict changes in the social sphere and political trends. Whether created by humans or machines, data are relevant for researchers in the social sciences and digital humanities. Text is one key data type in which researchers from both the social sciences and digital humanities are interested. As data, texts attract considerable attention not only in academia but also in the industry since they represent the most prevalent form of data available for both academic research and commercial applications.

19.2 Relevance of Data Science for Social Science and Digital Humanities …

285

Exercise 19.1 Data science applications that require knowledge in social sciences and digital humanities (a) Search the web for data science applications whose development required knowledge in social sciences. What do these applications have in common? In what ways are they different? (b) Search the web for data science applications whose development required knowledge in digital humanities? What do these applications have in common? In what ways are they different? (c) Are there similarities between the applications that require knowledge in the social sciences and the applications that require knowledge in the digital humanities? If yes—what are the similarities? If not—explain the differences between these two. The increase in the demand for data science education has led to the opening of multiple academic courses, massive open online courses (MOOCs) and other programs that are targeted at a wide range of learners (in terms of age, occupation, and education level) and teach data science at different levels (Berman et al., 2018). In terms of adult education, at one end of the spectrum are academic programs for data scientists that include advanced mathematics, statistics, and computer science, aimed at both undergraduates and graduates (Anderson et al., 2014; Danyluk & Leidig, 2021; De Veaux et al., 2017; Demchenko et al., 2016). Undergraduate programs that integrate data science with another disciplines are also available (Havill, 2019; Khuri et al., 2017; Plaue & Cook, 2015). On the other end of the spectrum, are experts in various disciplines who need data science skills to improve their professional work and remain relevant in the information age. This chapter focuses on the population at the second end of the spectrum, that is, researchers in social sciences and digital humanities. It is important to educate social sciences and digital humanities researchers to understand data science given their role in decision-making processes in a variety of application domains. While data science is clearly essential for the social sciences, the need of data science for digital humanities is less obvious. The Stanford Data Science Initiative explains: Data science for humanity – not only for the academic study of the humanities and the social sciences, but also for the betterment of humanity itself – is a deeply interdisciplinary effort… This interdisciplinarity is twofold: data science itself has benefited from the complex, important, and consequential research questions focused both on the rich history of human cultures and societies, and on the present state of humankind – in both its triumphs and its troubles. Data science can help diagnose those troubles, and suggest solutions.” (Data Science for Humanity | Data Science, 2022)

Another example is the course Data Science in the Humanities, offered by the University of Illinois’ School of Information Sciences. The relevance of this course for the humanities stems from the fact that “cultural materials usually come to analysts as unstructured texts, images, or sound files, forcing explicit decisions about

286

19 Data Science for Social Science and Digital Humanities Research

data modeling and feature extraction. Cultural questions also highlight the importance of interpreting statistical models in relation to a social context. Last but not least: songs, poems, and stories confront us with vivid problems.” (Data Science in the Humanities, 2022). Exercise 19.2 Data science job ads that require knowledge in social sciences and digital humanities (a) Search the web for data science job offers that require knowledge in social sciences. What do these job offers have in common? In what ways are they different? (b) Search the web for data science job offers that require knowledge in digital humanities. What do the job offers have in common? In what ways are they different? (c) Are there similarities between job offers that require knowledge in the social sciences and job offers that require knowledge in the digital humanities? If yes—what are the similarities? If not—explain these differences between the two.

19.3 Data Science Bootcamps for Researchers in Social Sciences and Digital Humanities In this section, we describe the data science bootcamp for social sciences and digital humanities. Before we delve into its curriculum, we describe the applicants and participants of two bootcamps held in 2020, as the curriculum was designed to meet their goals and pedagogical needs (Sect. 19.3.1). We then describe the bootcamp learning goals and curriculum (Sect. 19.3.2).

19.3.1 Applicants and Participants of Two 2020 Bootcamps for Researchers in Social Sciences and Digital Humanities One-hundred and ninety-five potential participants completed a preliminary questionnaire. Figure 19.1 shows the numerous disciplines of the researchers who completed this questionnaire and Fig. 19.2 shows their distribution according to academic rank. Of these applicants, 53 participated in one of two data science bootcamps: 29 participated in the February 2020 bootcamp and 24 in the September 2020 bootcamp.

19.3 Data Science Bootcamps for Researchers in Social Sciences and Digital …

287

Fig. 19.1 Bootcamp applicants’ research disciplines by gender

It was very interesting for us to discover that a significant majority (44 out of 53, 83%) of the participants of these bootcamps self-identified as women, although the base rate of woman among social sciences and humanities graduates in Israel is approximately 66% (Mike et al., 2021). Therefore, we decided to analyze the bootcamps not only from a pedagogical perspective but also from a gender perspective (see Sects. 5.4 and 19.5.2). Thus, Figs. 19.1 and 19.2 show not only the numerous

288

19 Data Science for Social Science and Digital Humanities Research

Fig. 19.2 Bootcamp applicants’ academic rank by gender

disciplines of the researchers who completed the questionnaire and their respective academic rank, but also their distribution by gender. To tailor the bootcamp to the researchers’ needs, the preliminary questionnaire required the applicants to rate their current knowledge level in data science-related skills, such as programming and statistics (see Fig. 19.3), and asked them to rate their interest in studying different topics included in the bootcamp curriculum (see Fig. 19.4). The expertise level of applicants with respect to the disciplines that compose data science, i.e., mathematics, statistics, computer science, and their application domain can be mapped on a two-axes diagram (see Fig. 19.5). As indicated in Fig. 19.5, the participants in our bootcamp were experts in their application domains. Indeed, their knowledge level in mathematics, statistics, and computer science varied, but was generally lower than their application domain expertise, and they were, therefore, located in the lower right part of the diagram.

19.3 Data Science Bootcamps for Researchers in Social Sciences and Digital …

289

Fig. 19.3 Participants’ pre-bootcamp knowledge in programming, statistics and machine learning (on a 1–5 scale, n = 38)

Fig. 19.4 Participants’ interest in the different bootcamp topics (on a 1–5 scale, n = 38)

290

19 Data Science for Social Science and Digital Humanities Research

Fig. 19.5 Applicants’ computer science and statistics knowledge vs. the domain knowledge

Exercise 19.3 Interviewing researchers in social sciences and digital humanities Compose ten questions to be used in an interview with researchers in the social sciences and in digital humanities. The purpose of the interview is to deepen your understanding with respect to their motivation to learn data science (regardless of whether or not they have already studied any data science). Contact one researcher in the social sciences and one researcher in digital humanities and interview them using the ten questions you composed. Analyze the interviews. What are your conclusions?

19.3.2 The Design and Curriculum of the Data Science for Social Science and Digital Humanities Researchers Bootcamp Data science programs require extensive knowledge and skills in mathematics, statistics, and computer science (Cassel & Topi, 2015; Danyluk & Leidig, 2021; De Veaux et al., 2017; Demchenko et al., 2016; National Academies of Sciences et al., 2018). Such knowledge is crucial in order to understand and implement advanced data science and machine learning applications. While it is possible to teach and learn the principles of data science without computer science and programming knowledge by

19.3 Data Science Bootcamps for Researchers in Social Sciences and Digital …

291

using computerized tools such as Orange Data Mining, Weka, KNIME, and SPSS, such learning is limited and in response to the interest that applicants expressed, we decided to teach programming as an integral part of the bootcamp. Among the things we had to consider in designing our bootcamp, was both what the researchers already knew about data science and what they needed to learn about data science that would be valuable for their research. In this spirit, the bootcamp was designed according to the following principles of instruction (Merrill, 2002): • Learning is promoted when learners are engaged in solving real-world problems. • Learning is promoted when existing knowledge is activated as a foundation for new knowledge. • Learning is promoted when new knowledge is applied by the learner. • Learning is promoted when new knowledge is integrated into the learner’s world. Table 19.1 presents the bootcamp curriculum (topics and hours) in detail in. Based on our analysis of the answers to the preliminary questionnaire (Figs. 19.3 and 19.4), the following guidelines were applied in the bootcamp curriculum design: • No previous programming knowledge was required. • As the learners had no or little programming experience, we also taught computational thinking (see Sect. 3.2.1) and the basic principles of algorithmic thinking (Wing, 2006). • While there are many data analysis tools that do not require programming skills, the researchers did expect to learn how to program; thus, we decided to teach them Python (see Fig. 19.4). Table 19.1 Bootcamp topics and hours Chapter

Topics

Introduction to data science

What is data science? The data science workflow

Hours 4

Introduction to web scraping

HTML, Java Script, ethics

6

Computational thinking

Computational thinking, cognitive and social skills

4

Python programming

Data types, loops, conditions, file IO

Web scraping in Python

Beautiful Soup, requests, API libraries, Selenium 12

20

Data analysis in Python

Pandas, Matplotlib, Seaborn, SK-learn

Machine learning fundamentals

Classification of machine learning algorithms, 10 machine learning project workflow, performance estimation and improving machine learning algorithms, hyperparameter tuning

12

Machine learning algorithms

KNN, SVM, artificial neural networks

12

Text analysis

Bag of words, TFIDF, word embedding

10

Total

90

292

19 Data Science for Social Science and Digital Humanities Research

• With respect to the data that was to be analyzed in the bootcamp, most applicants indicated that they were interested in collecting data from the web. Therefore, we taught Python with an emphasis on web scraping (see Table 19.1). • With respect to data analysis, most applicants indicated that they were interested in text analysis, and so the main theme of the data analysis topic was text analysis with machine learning (see Table 19.1).

19.4 Data Science for Psychological Sciences Following the implementation of the bootcamp, we developed a data science specialization program for graduate psychology students at the School of Psychological Sciences at the Tel Aviv University. The school’s motivation to offer this specialization is three-fold: • To expose graduate psychology research students to new research methods that are currently emerging from the new discipline of data science. • To expose graduate psychology research students, in general, and cognitive psychology research students, in particular, to state-of-the-art artificial intelligence and machine learning models that are inspired by the human brain and by human thinking. • To address the market demand for data scientists with a confirmed background in social sciences, in general, and in psychological sciences, in particular. In the design of the program, we took into the consideration the fact that the graduate psychology students’ understanding of the application domain of psychology and of the statistics components of data science is relatively high compared with their knowledge and understanding of computer science. Accordingly, the specialization was divided into two courses. The first course, Computer Science for Psychological Science (Sect. 19.4.1), was designed to close the computer science knowledge gap. The second course, Data Science for Psychological Science (Sect. 19.4.2), was designed to develop the students’ ability to complete the data science workflow (see Sect. 2.5), that is, to explore, analyze, model, and predict using data that they collected in the first course. Both courses were designed in a flipped classroom format, with pre-recorded asynchronous lectures and weekly online Zoom meetings devoted to answering students’ questions and working on advanced exercises in small groups (Rosenberg-Kima & Mike, 2020).

19.4 Data Science for Psychological Sciences

293

19.4.1 The Computer Science for Psychological Science Course Introduction to computer science and programming courses, often called CS1 (Hertz, 2010), are designed to introduce students to the fundamentals of programming. CS1 courses are currently taught on two distinct levels. The higher level CS1 courses are intended for computer science majors or non-majors in allied fields such as data science or other sciences, technology, engineering, and mathematics (STEM) (Bryant et al., 2010). These courses require a deep understanding of computational and algorithmics thinking and set a high bar for students. On the lower level, CS0 courses (Babbitt et al., 2019) or CS1 courses for non-majors (Bryant et al., 2010) aim to expose students to programming and to the essence of computer science; these courses do not expect students to master programming at a level that is appropriate for data science. CS1 courses for computer science majors are considered difficult and demanding, and suffer from an overload of learning goals and high failure rates (Becker & Fitzpatrick, 2019). CS1 courses can be even more difficult for students who are not majoring in computer science and allied disciplines and so several curriculum innovations and pedagogical methods have been proposed to support such students (Cortina, 2007; Forte & Guzdial, 2005; Guzdial, 2003; Sullivan, 2013). For example, Forte and Guzdial (2005) developed a media computation course, a CS1 course tailored for non-majors, in which “students learn how different media are encoded electronically and how to write basic code for manipulating digital images, audio, video, and text” (p. 249). A similar initiative was developed by Cortina (2007) for engineering students, in the form of an introductory computer science course for non-majors that “focuses on the major contributions of computer science from the perspective of the process of computation” (p. 218). The course does not include programming in any programming language but, rather, focuses on algorithms and on the principles of computing and its history. Sullivan (2013) proposed a datacentric CS1 course, which teaches the basics of databases and data mining and offers an introduction to programming and data visualization. The Computer Science for Psychological Science course was designed to close the computer science knowledge gap. Accordingly, three learning goals (LG) were defined: • LG1: To develop students’ computational thinking, which is considered an important twenty-first century skill, and specifically, an important thinking skill required for programing (Wing, 2006). See Sect. 3.2.1. • LG2: To enable students to collect data for social psychological research using web scraping with Python. • LG3: To enable students to collect data for cognitive psychological research by means of computerized experiments developed with JavaScript. Following previous research that showed that integrating application domain data can enhance students’ engagement with technical content (Wiberg, 2009), the

294

19 Data Science for Social Science and Digital Humanities Research

Table 19.2 Computer science for psychological science—topics and number of hours

Topic

Hours

Computational thinking

4

Website design and HTML

8

Website design and JavaScript Python programming Web scraping with Python Computerized experiments with JavaScript Total

8 16 8 8 52

computer science content and programming tasks were linked to the interests of the psychology graduate students. Specifically, two topics that are relevant for social psychology students emerged in a focus group facilitated by the School of Psychological Sciences faculty members: (a) social networks analysis and web scraping, and (b) experiments in computational cognition. Interviews with faculty members also revealed that, whereas the common programming environment for social networks analysis and web scraping is Python, experiments in computational cognition are programmed in JavaScript. Website design with HTML constitutes a common body of knowledge required for both web scraping and the development of experiments. Table 19.2 presents the detailed course outline. Although CS1 courses for majors have been the focus of a large body of research (Luxton-Reilly et al., 2018), research on CS1 for non-majors is practically nonexistent. In a recent survey, Becker and Quille (2019) mapped only 13 papers on CS1 for non-major courses out of 777 papers on CS1 published by the SIGCES community in the last 50 years. According to our/that literature review, the new course design is novel in four aspects: • Whereas interdisciplinary CS1 courses for STEM majors exist (see Tartaro & Chosed, 2015, for example), it seems that there are no such interdisciplinary courses for non-STEM majors. • Whereas all existing CS1 courses are intended for undergraduate students, this course is designed for graduate students. • Whereas it may be challenging to develop an interdisciplinary course due to the need for an interdisciplinary staff (Way & Whidden, 2014), we overcome this challenge by fostering cooperation between the course staff (specifically the second author of this guide), who contributed the computer science expertise, and the graduate students themselves, who contributed the psychology expertise. • Whereas the students in the 13 research works reviewed that pertained to nonmajor students used pre-prepared datasets in their learning activities, the students in our course collected their own data, since working with real-life data is known to motivate students in introductory computer science courses (see Mike & Hazzan, 2022). These self-collected data were also relevant for the students’ thesis research.

19.4 Data Science for Psychological Sciences

295

19.4.2 The Data Science for Psychology Science Course The Data Science for Psychology Science course was designed to develop the nonmajor students’ ability to complete the data analysis workflow, that is, to explore, analyze, model, and predict using data (see Table 19.3). The course, therefore, focused on data analysis including exploratory data analysis, descriptive statistics, statistical inference, and statistics and machine learning predictive models using Python. Since the psychology curriculum already contains extensive statistics contents and the first course teaches the necessary background in programming, the second course elaborates on machine learning, which is located, in the data science Venn diagram, in the intersection of statistics and computer science (see Fig. 2.2). Seven machine learning algorithms are taught in the course: • The K-nearest neighbors (KNN) and perceptron algorithms are taught mainly for their potential pedagogical value, as they convey the basic principles of machine learning using simple mathematics and can be taught easily applying the whitebox approach. See Sects. 15.2, 15.4 and 16.3. • Linear regression and logistic regression are a little more complex. Since the students are already familiar with the black-box implementation of these algorithms from their statistics background, these algorithms are taught in this course with a white-box approach including all mathematical details. See Sects. 15.5 and 15.6. • Support vector machines (SVM) and artificial neural networks (ANN) are taught with only partial mathematical details. See Sect. 15.7. • K-means is taught as an example of unsupervised learning.

Table 19.3 Data science for psychological science course—topics and number of hours

Topic

Hours

The data science workflow

4

Table manipulation with Python

4

Visualization with Python

4

Statistical inference with Python

4

Principles of machine learning Supervised machine learning Unsupervised machine learning Text analysis Total

8 16 4 8 52

296

19 Data Science for Social Science and Digital Humanities Research

Exercise 19.4 Uses of supervised and unsupervised machine learning algorithms in social sciences applications (a) Suggest three problems in social science applications that can be solved using supervised machine learning algorithms. Answer the following questions: • What do these three problems have in common? In what ways are they different? • How did you develop each of the problems? (b) Suggest three problems in social science applications that can be solved using unsupervised machine learning algorithms. Answer the following questions: • What do these three problems have in common? In what ways are they different? • How did you develop each of the problems? (c) What are your conclusions from (a) and (b)?

19.5 Data Science for Social Sciences and Digital Humanities, from a Motivation Theory Perspective In this section, we illuminate the data science teaching frameworks for social science and digital humanities researchers from two perspectives. We first discuss the teaching frameworks from the perspective of the self-determination theory (Deci & Ryan, 2013) (Sect. 19.5.1), and then from a gender perspective (Sect. 19.5.2).

19.5.1 The Self-determination Theory In this section, we present the motivation to learn data science using the selfdetermination theory (SDT) (Deci & Ryan, 2013). The self-determination theory is a broad framework for understanding factors that increase intrinsic motivation, autonomy, and psychological wellness. It emphasizes people’s inherent motivational propensities for learning and growing. Research shows that in educational settings, intrinsic motivation results in positive outcomes (Reeve, 2002; Ryan & Deci, 2020). According to the SDT, human beings naturally seek growth and wellness, and the three primary needs that are particularly fundamental for supporting these processes are autonomy, competence, and relatedness. Autonomy concerns a sense of initiative and ownership of one’s actions. It is strengthened by the learner’s sense of interest

19.5 Data Science for Social Sciences and Digital Humanities, …

297

and value and is weakened by external control, by means of either rewards or punishments. Competence concerns the feeling of mastery. The need for competence can be fulfilled in educational settings by well-structured challenges, positive feedback, and opportunities for growth. Relatedness concerns a sense of belonging and connection. It is enabled by the conveyance of indications of respect and caring. From a motivation perspective, the design of the bootcamp (Sect. 19.3) and the specialization program (Sect. 19.4) enhanced the researchers’ motivation by enhancing the three components of the SDT: autonomy, competence, and relatedness: Autonomy. Autonomy was supported by these teaching frameworks as follows: • The researchers learned to develop and use software tools that they can later use to gain more autonomy in their research. • The researchers worked individually on their own research projects/works. Competence. Competence was supported by these teaching frameworks as follows: • Even though the programming level is expected to be high, the learning curve was designed for novice programmers with no prior programming experience. Computational thinking skills were taught before programming was taught and only necessary computer science concepts were taught to reduce cognitive load. For example, recursion, which is an important computer science concept often integrated into introduction to computer science courses, was skipped. • Tasks on each topic were integrated, and included both computer science and data science concepts. For example, in the Introduction to Computer Science course (the first course in the specialization program), the students gradually developed a system that collects data from a real website. Relatedness. Relatedness was supported by the bootcamp and the specialization program as follows: • To enhance learners’ feeling of relatedness to both computer science and data science, the learners practiced the new computer science and data science content while working on their own real research.

19.5.2 Gender Perspective Data show that women are underrepresented in STEM subjects in K-12, academia, and industry. For example, according to the statistics of the US Department of Education on bachelor’s degrees earned in 2018–19, women were awarded only 21% of all degrees in computer science (Digest of Education Statistics, 2020). In Sect. 5.4, we discuss how data science education opens up opportunities to close gender gaps in STEM subjects. In this section, we present this opportunity using the expectancy value theory (Eccles & Wigfield, 2020). The expectancy value theory is a motivational theory that explains the motivation to engage in activities that involve

298

19 Data Science for Social Science and Digital Humanities Research

achievements. The theory was originally developed to explain women’s participation in STEM-related studies and occupations. According to the expectancy value theory, the most influential factors in achievement-related choices are expectations for success and subjective task values. Expectations for success are defined as the beliefs of individuals as to how well they will perform on an upcoming task. To enhance the women researchers’ expectations for success in our data science teaching frameworks for social science and digital humanities researchers (and especially in the intensive bootcamp), we did not request any prior knowledge in programming, and taught all of the computer science knowledge that is required in the context of data science, and specifically in the context of research in social science and digital humanities, as part of the bootcamp itself. Subjective task value is divided into four categories: interest value, utility value, attainment value, and relative cost. We found that the utility value of data science is highly appealing to women when the relative cost is reduced. • Utility value is the value that achievement of the specific task will contribute to the overall goals of the individual. The perceived value of learning data science was made up of new research skills acquisition, the ability to develop independence in future research that is methodologically based on data science tools, and attaining a cutting-edge and science-oriented position. • Relative cost represents the course of action an individual may choose to take as an alternative to the specific task. Data science enables women researchers to acquire additional research tools, but does not require them to forego their current research tools and traditions, hence reducing the relative cost of learning data science.

Exercise 19.5 Theoretical perspectives on data science education for researchers in social sciences and digital humanities (a) Both the self-determination theory (SDT) and the expectancy value theory are made up of several components or factors. List the components of each theory and explore connections among their components. (b) What does this analysis tell us about data science? What does it tell us about data science education?

19.6 Conclusion

299

Exercise 19.6 Data science education for undergraduate social sciences and digital humanities students This chapter deals with data science education for researchers in the social sciences and digital humanities who are either graduate students or postgraduation researchers in these disciplines. In your opinion: (a) Should undergraduate students in these disciplines also learn data science? Explain your position. (b) Are the teaching frameworks presented in this chapter, which are designed for researchers in these disciplines, also suitable for undergraduate students in these disciplines? If yes, why? If not, what adjustments would you make to the design and contents of these teaching frameworks to adapt them to undergraduate students?

19.6 Conclusion Data science is essentially interdisciplinary in nature. The data science teaching frameworks for social science and digital humanities researchers described in this chapter were designed to take advantage of this. They do not, and cannot, purport to give the researchers comprehensive training in each of the fields that make up data science. Instead, a set of tools was taught, which provided the participants with an opportunity to acquire skills and basic knowledge in the field, supported by work on the researchers’ personal research projects, with the objective of enabling them to integrate data science tools into their daily research. Such framing has the potential to enable researchers in the social sciences and digital humanities to better employ data science as a tool. Indeed, it can help them lead their fields towards a more interdisciplinary research approach. The interdisciplinarity approach can specifically widen the proverbial shrinking pipeline in computer science—a term coined by Camp (2002) to describe the situation whereby fewer woman than man decide to study computer science, and even less women decide to continue to graduate studies and pursue an academic career in computer science.

300

19 Data Science for Social Science and Digital Humanities Research

References Anderson, P., Bowring, J., McCauley, R., Pothering, G., & Starr, C. (2014). An undergraduate degree in data science: Curriculum and a decade of implementation experience. In Proceedings of the 45th ACM technical symposium on computer science education—SIGCSE’14, pp. 145–150. https://doi.org/10.1145/2538862.2538936 Babbitt, T., Schooler, C., & King, K. (2019). Punch cards to python: A case study of a cs0 core course. In Proceedings of the 50th ACM technical symposium on computer science education, pp. 811–817. Becker, B. A., & Fitzpatrick, T. (2019). What do CS1 syllabi reveal about our expectations of introductory programming students? In Proceedings of the 50th ACM technical symposium on computer science education, pp. 1011–1017. Becker, B. A., & Quille, K. (2019). 50 years of CS1 at SIGCSE: A review of the evolution of introductory programming education research. In Proceedings of the 50th ACM technical symposium on computer science education, pp. 338–344. Berman, F., Rutenbar, R., Hailpern, B., Christensen, H., Davidson, S., Estrin, D., Franklin, M., Martonosi, M., Raghavan, P., Stodden, V., & Szalay, A. S. (2018). Realizing the potential of data science. Communications of the ACM, 61(4), 67–72. https://doi.org/10.1145/3188721 Bryant, R. E., Sutner, K., & Stehlik, M. J. (2010). Introductory computer science education at llinois Mellon University: A deans’ perspective. Camp, T. (2002). The incredible shrinking pipeline. ACM SIGCSE Bulletin, 34(2), 129–134. Cassel, B., & Topi, H. (2015). Strengthening data science education through collaboration: Workshop report 7-27-2016. Arlington, VA. Cortina, T. J. (2007). An introduction to computer science for non-majors using principles of computation. ACM SIGCSE Bulletin, 39(1), 218–222. Danyluk, A., & Leidig, P. (2021). Computing competencies for undergraduate data science curricula. Association of Computing Machinery (ACM). https://dstf.acm.org/DSTF_Final_R eport.pdf Data Science for Humanity | data science. (2022). https://datascience.stanford.edu/research/res earch-areas/data-science-humanity Data Science in the Humanities. (2022). http://ischool.illinois.edu/degrees-programs/courses/is417 De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., Bryant, L., Cheng, L. Z., Francis, A., Gould, R., Kim, A. Y., Kretchmar, M., Lu, Q., Moskol, A., Nolan, D., Pelayo, R., Raleigh, S., Sethi, R. J., Sondjaja, M., Tiruviluamala, N., et al. (2017). Curriculum guidelines for undergraduate programs in data science. Annual Review of Statistics and Its Application, 4(1), 15–30. https://doi.org/10.1146/annurev-statistics-060116-053930 Deci, E. L., & Ryan, R. M. (2013). Intrinsic motivation and self-determination in human behavior. Springer Science & Business Media. Demchenko, Y., Belloum, A., Los, W., Wiktorski, T., Manieri, A., Brocks, H., Becker, J., Heutelbeck, D., Hemmje, M., & Brewer, S. (2016). EDISON data science framework: A foundation for building data science profession for research and industry. In 2016 IEEE international conference on cloud computing technology and science (CloudCom), pp. 620–626. https://doi.org/10.1109/ CloudCom.2016.0107 Digest of Education Statistics. (2020). National center for education statistics. https://nces.ed.gov/ programs/digest/d20/tables/dt20_322.50.asp Digital Around the World. (2022). DataReportal—Global digital insights. https://datareportal.com/ global-digital-overview Eccles, J. S., & Wigfield, A. (2020). From expectancy-value theory to situated expectancyvalue theory: A developmental, social cognitive, and sociocultural perspective on motivation. Contemporary Educational Psychology, 101859. Forte, A., & Guzdial, M. (2005). Motivation and nonmajors in computer science: Identifying discrete audiences for introductory courses. IEEE Transactions on Education, 48(2), 248–253.

References

301

Guzdial, M. (2003). A media computation course for non-majors. In Proceedings of the 8th annual conference on innovation and technology in computer science education, pp. 104–108. Havill, J. (2019). Embracing the liberal arts in an interdisciplinary data analytics program. In Proceedings of the 50th ACM technical symposium on computer science education, pp. 9–14. Hertz, M. (2010). What do CS1 and CS2 mean? Investigating differences in the early courses. In Proceedings of the 41st ACM technical symposium on computer science education, pp. 199–203. Khuri, S., VanHoven, M., & Khuri, N. (2017). Increasing the capacity of STEM workforce: Minor in bioinformatics. In Proceedings of the 2017 ACM SIGCSE technical symposium on computer science education, pp. 315–320. https://doi.org/10.1145/3017680.3017721 Luxton-Reilly, A., Albluwi, I., Becker, B. A., Giannakos, M., Kumar, A. N., Ott, L., Paterson, J., Scott, M. J., Sheard, J., & Szabo, C. (2018). Introductory programming: A systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education, pp. 55–106. Merrill, M. D. (2002). First principles of instruction. Educational Technology Research and Development, 50(3), 43–59. https://doi.org/10.1007/BF02505024 Mike, K., Hartal, G., & Hazzan, O. (2021). Widening the shrinking pipeline: The case of data science. In 2021 IEEE global engineering education conference (EDUCON), pp. 252–261. Mike, K., & Hazzan, O. (2022). Interdisciplinary CS1 for non-majors: The case of graduate psychology students. In 2022 IEEE global engineering education conference (EDUCON). National Academies of Sciences, Engineering, and Medicine. (2018). Data science for undergraduates: Opportunities and options. The National Academies Press. https://doi.org/10.17226/ 25104 Plaue, C., & Cook, L. R. (2015). Data journalism: Lessons learned while designing an interdisciplinary service course. In Proceedings of the 46th ACM technical symposium on computer science education—SIGCSE’15, pp. 126–131. https://doi.org/10.1145/2676723.2677263 Reeve, J. (2002). Self-determination theory applied to educational settings. Handbook of SelfDetermination Research, 2, 183–204. Rosenberg-Kima, R. B., & Mike, K. (2020). Teaching online teaching: Using the task-centered instructional design strategy for online computer science teachers’ preparation. In Teaching, technology, and teacher education during the COVID-19 pandemic: Stories from the field (pp. 119–123). Association for the Advancement of Computing in Education (AACE). Ryan, R. M., & Deci, E. L. (2020). Intrinsic and extrinsic motivation from a self-determination theory perspective: Definitions, theory, practices, and future directions. Contemporary Educational Psychology, 61, 101860. Sullivan, D. G. (2013). A data-centric introduction to computer science for non-majors. In: Proceeding of the 44th ACM technical symposium on computer science education, pp. 71–76. Tartaro, A., & Chosed, R. J. (2015). Computer scientists at the biology lab bench. In Proceedings of the 46th ACM technical symposium on computer science education, pp. 120–125. https://doi. org/10.1145/2676723.2677246 Way, T., & Whidden, S. (2014). A loosely-coupled approach to interdisciplinary computer science education. In Proceedings of the international conference on frontiers in education: computer science and computer engineering (FECS), 1. Wiberg, M. (2009). Teaching statistics in integration with psychology. Journal of Statistics Education, 17(1). Wing, J. M. (2006). Computational thinking. Communications of the ACM, 49(3), 33–35. https:// doi.org/10.1145/1118178.1118215

Chapter 20

Data Science for Research on Human Aspects of Science and Engineering

Abstract In this chapter and in Chap. 19, we focus on the third component of the MERge model—research, and describe two data science teaching frameworks for researchers: Chap. 19 addresses researchers in social science and digital humanities; this chapter addresses science and engineering researchers and discusses how to teach data science methods to science and engineering graduate students to assist them in conducting research on human aspects of science and engineering. In most cases, these target populations, unlike the community of social scientists (discussed in Chap. 19), have the required background in computer science, mathematics, and statistics, and need to be exposed to the human aspects of science and engineering which, in many cases, are not included in scientific and engineering study programs. We start with the presentation of possible human-related science and engineering topics for investigation (Sect. 20.2). Then, we describe a workshop for science and engineering graduate students that can be facilitated in a hybrid format, combining synchronous (online or face to face) and asynchronous meetings (Sect. 20.3). We conclude with an interdisciplinary perspective of data science for research on human aspects of science and engineering (Sect. 20.4).

20.1 Introduction Researchers in scientific disciplines investigate natural phenomena related to materials, physical systems, and biological organisms; researchers in engineering disciplines, on the other hand, focus on engineering phenomena, investigating how to improve the performance of technological systems by applying mathematics and scientific principles. In many cases, the results of these science and engineering research works are implemented in the development of specific technologies used by humans. Thus, the human aspect should be addressed in such scientific and engineering research works by questions such as: How do people perceive the ethical aspects of these research works? How do people perceive the usefulness of the tools developed and how do they intend to use them? What is the impact of these technologies on people’s daily lives? How do these tools influence the way organizations operate? and so on. This chapter examines possible science and engineering © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3_20

303

304

20 Data Science for Research on Human Aspects of Science and Engineering

research topics from these perspectives (Sect. 20.2). In addition, we propose a weekly schedule of a semestrial workshop for science and engineering graduate students, that can be facilitated in a hybrid format, combining synchronous (online or face to face) and asynchronous meetings (Sect. 20.3). We conclude with an interdisciplinary perspective of data science for research on human aspects of science and engineering (Sect. 20.4). This chapter is closely related to Sect. 2.3—Data science as a research method, to Chap. 8—Data Science as a Research Method, in which we describe, among other topics, the skills needed in order to perform meaningful data science research, and to Chap. 12, in which we present social and ethical issues of data science. These topics should also be incorporated into the teaching of human-related science and engineering topics, whose research is discussed in this chapter.

20.2 Examples of Research Topics Related to Human Aspects of Science and Engineering that Can Use Data Science Methods Research works carried out in science and engineering may involve the human aspects of science and engineering, be they cognitive, behavioral, social, or organizational. Even when science and engineering research works do not explicitly address human aspects, their applications may be connected to people’s lives and so research that focuses on their human aspects should be considered. This perspective further highlights the interdisciplinarity of data science. In this section, we describe several science and engineering disciplines whose human aspects can be investigated using data science methods. Table 20.1 presents several scientific and engineering disciplines and for each discipline, proposes a scientific or engineering aspect, a human aspect, and one specific human-related research topic associated with the given human aspect. Needless to say, additional scientific and engineering aspects, human aspects, and specific human-related research topics can be associated with each scientific and engineering discipline.

20.2 Examples of Research Topics Related to Human Aspects of Science …

305

Table 20.1 Science and engineering research: Scientific and engineering aspects, human aspects, and human-related research topics Scientific/engineering discipline

Scientific/engineering aspect

Human aspect

Human-related research topic

Biology

Cloning

Consequences of cloning

People’s beliefs about the implications of cloning on the future of humanity

Physics

Gravitational forces in the solar system

Mental models of the solar system

Young learners’ perception of the magnitudes of the solar system

Civil and environmental engineering

Transportation engineering

Commute habits

Mutual relations between commute habits and perceived environmental factors

Mechanical engineering

Biomechanics

Sport

Sport activities that people wish to monitor and why they wish to do so

Biomedical engineering

Nano-mechanics

Wearable technology

Factors influencing the design of wearable smart clothing

Food engineering

Molecular nutrition

Nutrition habits

Factors influencing food waste

Industrial engineering and management

Architecture of organizational communication software tools

Organizational Network Analysis

Human communication habits as reflected in the organizational networks

Exercise 20.1 Examples of human aspects of engineering research (a) Add three examples of engineering disciplines to Table 20.1. For each discipline, suggest three examples of research topics that highlight the human aspect of engineering. (b) What kind of organizations may find these research topics relevant for their operation? (c) Explore what skills are needed to carry out these research projects. Readers whose profession is engineering, are encouraged to explore this exercise with respect to their engineering discipline.

306

20 Data Science for Research on Human Aspects of Science and Engineering

Exercise 20.2 Data-driven research (a) Suggest five research topics in scientific disciplines that may be initiated by data that is gathered incidentally. (b) Suggest five research topics in engineering disciplines that may be initiated by data that is gathered incidentally. (c) For each topic, suggest a human-related topic that you would find interesting to investigate. (d) Select two topics from the human-reacted scientific disciplines and two topics from the human-related engineering disciplines and describe how you would research them.

Exercise 20.3 Data visualization Choose three of the human-related research topics presented in Table 20.1. For each topic, define what data should be collected for its investigation and how visualization can help obtain insights about the data. See Chap. 8 in which visualization is explored as a research skill.

20.3 Workshop on Data Science Research on Human Aspects of Science and Engineering The research workshop described in this section focuses on the application of data science when studying human aspects of science and engineering. In other words, the workshop is neither a data science workshop nor a Python workshop; rather, data science and Python are used in this workshop as research tools. The workshop consists of 13 sessions and it can be organized as a semestrial course in an academic institution (or any other organization) or in any other format that suits the organization in which it is conducted. In such cases, if needed, the workshop schedule and format can be adjusted to meet its purposes. The description includes the workshop rationale, contents, target audience, framework, prerequisites, requirements and assessment, time schedule and detailed content, and relevant literature. We note that teaching methods described in other chapters of this guide can be applied in the teaching of this workshop as well. This includes methods mentioned in Chap. 8—Data science as a research method, in Chap. 12—Social and ethical issues of data science, and Chap. 16—Teaching methods for machine learning.

20.3 Workshop on Data Science Research on Human Aspects of Science …

307

20.3.1 Workshop Rationale In any science and engineering research university, alongside research that focuses on scientific and engineering topics, research is conducted that deals with human-related topics. Nevertheless, research on the human aspect of science and engineering is less prevalent in such universities. The workshop presented here aims to highlight the relevance of data science for research on human aspects of science and engineering. The workshop design is explicitly based on the scientific and engineering research skills of the target audience which be used to investigate human aspects of science and engineering using data science methods; thus, the workshop may not only increase the learners’ awareness to the human aspects of the research they are conducting as part of their graduate studies, but may further expand their arsenal of research tools, and consequently, if they wish, they can expand their research topic to include its human aspects using data science methods.

20.3.2 Workshop Contents • • • •

The human aspects of science and engineering The data science workflow from a research perspective (see Chaps. 2 and 8) Python programming for research on human aspects of science and engineering Machine learning methods and their applications in research on human aspects of science and engineering.

20.3.3 Target Audience • Graduate students in science in engineering • Senior undergraduate students.

20.3.4 Workshop Framework (in Terms of weeks)—A Proposal • Hybrid format (synchronous and asynchronous meetings). • The synchronous sessions can take place either face to face or online. • Thirteen weekly 4-h sessions: – Once every two weeks, a synchronous lesson (2–4 h) takes place (as either part of the lesson or as the entire lesson; online or face-to-face) in which: The students present the discipline of their research and possible applications of data science in their research;

308

20 Data Science for Research on Human Aspects of Science and Engineering

Students explore research themes related to human aspects of science and engineering and the relevance of data science research tools for such research; Students work in groups on tasks that require active learning (see Chap. 12). – All other sessions are a-synchronous, in which the students view pre-recorded lessons, work on research activities and explore new data science topic which are relevant for their research. Alternative lesson frequencies can be applied according to the population participating in the workshop.

20.3.5 Prerequisites • An Introduction to Computer Science course for scientists and engineers; • The basics of Python programming can be learned using one of the many available online courses. Due to the variety of such free online resources, we do not recommend any specific resource; Rather, each learner can choose the resource that suits his or her needs in terms of background and programming skills.

20.3.6 Workshop Requirements and Assessment • 10%: Active participation in the synchronous lessons. • 10%: Writing a researcher’s diary with reflections on the research process (see Chap. 11), thinking processes, insights, challenges the student is facing, opportunities that the research opens, and any other thoughts the student has during the research. • 20%: Short presentation of the student’s research in the last workshop session. • 60%: Design and execution of research using data science methods. The research topic will be related to the human aspects of science and engineering, and will include three submissions: – Submission 1: Research problem, research target, and research questions (15%) – Submission 2: Data-gathering tools and planning of the analysis process (15%) – Submission 3: Research findings: Data analysis and results, research conclusions, research limitations, and possible directions for follow-up research (30%). Each submission will include the previous submission(s) (including modifications if needed, as the research progresses). Submission 3 will include a description of the student’s full research project. Note: Students are expected to carry out their research in parallel to the workshop and to complete all submissions on time, up to two weeks after the workshop ends.

20.3 Workshop on Data Science Research on Human Aspects of Science …

309

20.3.7 Workshop Schedule and Detailed Contents Table 20.2 presents the workshop schedule and content. Table 20.2 Workshop schedule and content Session No. Contents 1

Synchronous • Introduction • The data science research workflow and data science as a research method (in the spirit of grounded theory; see Chap. 8) • Skills required for the implementation of data science research (see Chap. 8) • Presentation of students’ research topics for their workshop research project. These topics must not be the same research topics they are working on as part of their graduate studies • Introduction to the Notebooks environment, working with data files Asynchronous • Introduction to data science • Data analysis with Python

2

Asynchronous • The exploratory data analysis stage of the data science workflow • Pandas library (a Python library for working with sequential and tabular data, which includes tools to manage, analyze, and manipulate data) • Individual task: Investigation of how data science methods can be implemented in students’ research works

3

Synchronous • Formulation of research problem, research targets, and research questions • Kinds of machine learning algorithms and their suitability for research on human aspects of science and engineering • Online dataset resources Asynchronous • Kinds of data: – structured and unstructured data – text, voice, and images as data • Gathering data from websites and social networks

Submission No.

(continued)

310

20 Data Science for Research on Human Aspects of Science and Engineering

Table 20.2 (continued) Session No. Contents

Submission No.

4

Asynchronous • Visualization • Python libraries: – matplotlib (a comprehensive library for creating static, animated, and interactive visualizations in Python) – seaborn (a data visualization library based on matplotlib, which provides a high-level interface for drawing attractive and informative statistical graphics) • Treatment of errors and biases in the data

1 (see Sect. 20.3.6)

5

Synchronous • Discussion of Submission 1 • Data science ethics (see Chap. 12): Ethical perspective of the data science workflow, in general, and of the data collection stage, in particular • Introduction to machine learning (see Chaps. 13, 14, 15, and 16) • Data analysis with machine learning Asynchronous • Read the following paper by Hazzan and Mike, and reflect on what you have learned: Hazzan O., & Mike, K. (2022). Teaching core principles of machine learning with a simple machine learning algorithm: the case of the KNN algorithm in a high school introduction to data science course. ACM Inroads, 13(1), 18–25 • The KNN algorithm

6

Asynchronous • Linear regression • Logistic regression

7

Synchronous • Text as data: bag of words, TFIDF, word embedding • Model complexity (see Sect. 14.7) • Underfitting and overfitting (see Sect. 14.8) • Regularization (see Sect. 14.10) Asynchronous • Example of text analysis (for example, topic modeling) • Analysis of the results of machine learning algorithms

8

Asynchronous • Neural networks • Example of text analysis: sentiment analysis

2 (see Sect. 20.3.6) (continued)

20.3 Workshop on Data Science Research on Human Aspects of Science …

311

Table 20.2 (continued) Session No. Contents 9

Synchronous • Discussion of Submission 2 • Validation of the research results through additional data gathering and analysis methods (quantitative and qualitative) • Discussion: The model construction process with illustrations of research works carried out by workshop participants Asynchronous • Performance analysis of machine learning algorithms (see Sect. 14.5) • Performance indicators of machine learning algorithms: Precision, recall, F1 (see Sect. 14.5) • Discussion: The meaning and interpretation of the above machine learning performance indicators in research on human aspects of science and engineering

10

Asynchronous • Gathering data from websites: web scraping • Read the following paper by Elad and discuss the main ideas it presents/addresses: Elad, M. (2017). Deep, Deep Trouble. Retrieved August 31, 2019, from https://sinews.siam.org/Details-Page/deep-deeptrouble

11

Synchronous • Improving the performance of machine learning algorithms, in general (See Sect. 14.5), and specifically in research on human aspects of science and engineering • Feature engineering (See Chap. 10) • Integration of the application domain and its role in the research design (see Chap. 3) • Challenges in the analysis of the results of machine learning algorithms • Explainability and interpretability of machine learning models (see Sect. 13.3.3) Asynchronous • Images as data • Usages of image analysis in research on human aspects of science and engineering • Read Syed Huma Shah’s article: Improving your Machine Learning Model Performance is sometimes futile. Here’s why. Posted on December 1, 2020 at https://towardsdatascience.com/improving-your-machine-lea rning-model-performance-is-sometimes-futile-heres-whybda848b2768a Reflect: What are the article’s main messages? How do they relate to research on human aspects of science and engineering that is performed using data science methods?

Submission No.

(continued)

312

20 Data Science for Research on Human Aspects of Science and Engineering

Table 20.2 (continued) Session No. Contents 12

Asynchronous • Automation of gathering data from websites: selenium (The selenium package is used to automate web browser interaction from Python) • Preliminary acquaintance with qualitative data-gathering tools (e.g., interviews and observations)

13

Synchronous • Students’ presentations of their research The presentation should include the research problem, targets, and questions; data gathering and analysis; and the conclusion The presentation should reflect the story of the research, taking into consideration the rhetorical triangle (see Sect. 11.2.2) Final project submission: Up to two weeks after the workshop ends

Submission No.

3 (see Sect. 20.3.6)

20.3.8 Literature (For the Workshop) The following resource is suggested for additional reading in the a-synchronous meetings (Nos. 5, 10, and 11). Burkov, A. (2019). The Hundred-Page Machine Learning Book. Quebec City, Canada. We recommend this book due to its purchasing agenda, as stated on page 4: This book is distributed on the “read first, buy later” principle. […] You have to be able to read a book before paying for it. The read first, buy later principle implies that you can freely download the book, read it and share it with your friends and colleagues. If you liked the book, only then you have to buy it. Now you are all set. Enjoy your reading!

Additional relevant papers can be added according to the teacher’s choice and the students’ research topics. The teacher can also ask the students to find and present in class papers that demonstrate: • the use of data science in research on human aspects of science and engineering • the use of qualitative research tools in a research whose main data-gathering and analysis methods are data science-oriented.

20.4 Conclusion

313

20.4 Conclusion In the conclusion, we highlight two characteristics of the workshop described in this chapter that reflect the interdisciplinarity of data science: topic and audience. • The workshop topic: The investigation of human aspects of science and engineering adds cognitive, behavioral, and social research disciplines to the scientific and engineering disciplines. • The workshop population: The fact that students from a variety of science and engineering disciplines participate in the workshop increases awareness to the many application domains in which data science can be useful.

Epilogue

Abstract In the epilogue, we view at the guide from a holistic perceptive, reflecting on its big ideas and their interconnections. As can be seen, this guide is multifaceted and addresses teaching methods, skills, learners, perspectives, habits of mind, and data science topics—from programming and statistics, through problem solving processes to organizational skills. Indeed, the guide reflects the richness of the discipline of data science, its relatedness to many aspects of our life, and its centrality and potential contribution to the education of future generations in the twenty-first century.

Exercise EP.1 Pedagogical chasms This guide is one means we propose for crossing the pedagogical chasm presented in Chap. 9. Can you identify specific ideas that are presented in this guide that may support you in crossing the pedagogical chasm in your data science teaching? Furthermore, the richness of the discipline of data science is reflected in the interconnections between the different chapters of this guide, as they are specified throughout the guide. For example, • Chapter 3, which introduces ways of thinking is strongly related to Chap. 8, which addresses data science as a research method, to Chap. 11, which discusses cognitive (and other kinds of) skills, to Chap. 13, in which white-box and blackbox understandings are discussed, and to Sect. 14.5, which addresses the base-rate neglect. • Chapter 18, which focuses on data science teacher preparation, and specifically on the Method for Teaching Data Science course, is clearly related to almost all topics discussed in the guide, as they deal with the same topic—data science education. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3

315

316

Epilogue

For a deeper examination of this network of connections between the different chapters of this guide, we recommend summarizing each chapter in 1–2 sentences and discussing the many interconnections between the different chapters and sections of the guide. Such an exploration will further expose the interdisciplinarity essence of data science and its teaching, which we aim to highlight in this guide.

Exercise EP.2 Connections between chapters Present additional connections between the different chapters of this guide. The more the merrier. Clearly, we could have included additional topics in this guide (such as the evaluation of learners’ understanding); We will add them to future editions of this guide. If you have any ideas for additional topics that you feel should be included in future editions of this guide, please share them with us. We welcome any and all suggestions. In the meanwhile, you can find supplementary material to this guide on Technion Data Science Education website, at https://orithazzan.net.technion.ac.il/data-science-edu cation/. In conclusion, we recommend that every reader reflect on what he or she has learned from reading this guide, focusing specifically on • ten new ideas have you learned from reading this guide; • the ten most interesting ideas you have learned from reading this guide; • connections between the different topics discussed in this guide which have special meaning for you and why; • five pedagogical ideas, guidelines, tools, and methods will you use in your data science teaching and learning processes, why and how; • five most meaningful messages related to data science that you will impart to your students; • your five most favorite activities, of the over 200 exercises included in this guide, and why; • your future use of the idea of interdisciplinarity, which is discussed extensively in this guide, in data science education contexts as well as in other frameworks.

Exercise EP.3 Final reflection task Reflect: (a) What do you like about data science? What do you dislike about it? (b) What do you like about data science education? What do you dislike about it? (c) If you had to formulate one main idea of data science education, what would it be? (d) What will your main new contribution to data science education be?

Index1

A Abstraction, 37, 38, 42–44, 52, 159, 166–169, 174, 175, 213, 236 ACM, 21, 27, 60, 62, 67, 68, 93, 169, 176, 181, 270, 310 ACM/IEEE-CS Joint Task Force, 181 Active learning, 4, 105, 133, 186, 192, 235, 271, 308 Anomalies, 175 Anthropology, 131, 170, 171 API libraries, 291 Application domain thinking, 44, 52–54 Artificial neural networks, 291, 295 Asynchronous lesson, 269, 272, 273 Attainment value, 298 Autonomy, 296, 297

B Bag of words, 291, 310 Base rate neglect, 36, 45, 54, 215, 315 Beautiful Soup, 291 Bias cognitive, 36, 44, 45, 47, 50–52, 115, 166, 169, 184, 214, 215 selection, 168, 169 Biomedical signal processing, 94, 95, 193 Blackbox understanding, 36, 54, 200, 275 Bootcamp, 11, 66, 283, 284, 286–288, 290–292, 297, 298

C Case study, 8, 114, 126, 137, 141, 145, 166, 170, 188, 189, 193, 259, 262 Chasm crossing the, 137, 140, 145, 146 interdisciplinary, 75 pedagogical, 8, 10, 75, 112, 137, 145–147, 262, 265, 275, 315 technological, 145 Chat, 273, 274 Classification, 45–48, 51, 210, 211, 213, 216, 218, 219, 226, 227, 229–232, 236, 237, 240, 242, 247, 276, 277, 291 Classifiers, 45, 47, 52, 106, 200, 211, 216–219, 226, 232, 276 Code-data duality, 176 Code of ethics, 180–182 Cognitive bias, 44, 214 Cognitive load, 236, 297 Cognitive skills, 43, 79, 97, 122, 127, 165–167, 172, 180, 279 Collective impact, 257, 259 Competence, 55, 60, 63–65, 79, 296, 297 Complexity, 66, 77, 86, 93, 139, 203, 219, 278 Computational thinking, 3, 7, 35–39, 52–55, 63, 79, 176, 271, 272, 274, 275, 278, 291, 293, 294, 297

1

The following terms are not listed in the index since they appear throughout this guide countless times: Data science, science, engineering, machine learning, domain knowledge, application domain, interdisciplinarity, and interdisciplinary

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hazzan and K. Mike, Guide to Teaching Data Science, https://doi.org/10.1007/978-3-031-24758-3

317

318 Conception object, 41–43, 274–276, 279 procept, 42 process, 41–43, 274, 276, 279 Constructivism, 42, 105 Coverage, 160 Crossing the chasm, 140

D Data coverage, 160, 187 quality, 62, 105, 154, 160, 162, 257 quantity, 160 Data aggregation, 168 Data cleaning, 168 Data collection, 8, 20, 22, 23, 52, 53, 115, 124, 126, 151, 152, 154, 155, 183, 257, 279, 310 Data life cycle, 29, 31, 63, 151, 180 Data mining, 19–21, 28, 64, 67, 68, 151, 171, 293 Data munging, 168 Data preparation, 8, 144, 151, 156, 157, 160, 162 Data remediation, 168 Data science as a profession, 6, 19, 30, 31 Data science as a research method, 6, 8, 19, 24, 26, 28, 30, 109–111, 121, 122, 134, 137, 180, 188, 190, 191, 283, 304, 306, 309, 315 Data science workflow, 5, 8, 9, 28, 29, 37, 63, 103, 105, 106, 122–124, 126–128, 130, 132–134, 141, 151–154, 160–162, 165–169, 171–177, 180, 183, 188, 194, 199, 200, 202, 226, 257, 268, 270, 291, 292, 295, 307, 309, 310 Data thinking, 7, 35–37, 52–55, 79 Data workflow, 151 Data wrangling, 156, 159, 168 Decision Trees, 10, 210, 213, 225, 229, 236 Didactic Transposition, 70, 107 Diffusion of Innovation, 137–141 Digital Humanities, 10, 11, 24, 78, 110, 111, 283–286, 290, 296, 298, 299, 303 Diversity, 4, 6, 8, 19, 32, 66, 68, 101–103, 109, 111, 119, 183 Domain neglect bias, 44, 51, 206 Drag and drop, 14

Index E EDISON data science framework, 60, 61, 64, 65 Educational data mining, 114, 125, 256 Educational entrepreneurship, 275 Embedded context, 4, 186 Embedded ethics, 186, 187 Encapsulation, 37, 41 Entrepreneurship, 192 Ethics, 5, 9, 12, 29, 68, 76, 104, 179–181, 183, 184, 186, 187, 194, 202, 275, 291, 310 Ethics - code of, 180–182 Ethos, 170 Expectancy value theory, 297, 298 Expectations for success, 298 Explainability, 127, 203, 204, 311 Exploratory data analysis, 8, 24, 106, 123, 124, 126, 133, 144, 151, 156, 157, 159, 162, 167, 173, 175, 183, 270, 295, 309 Exponential organizations, 255

F F1, 214, 311 Feature engineering, 160, 161, 248, 311 Flipped classroom, 270, 292 Forum, 191, 258, 271, 273

G Gender, 7, 27, 75, 78, 79, 82, 85, 111, 119, 287, 288, 296, 297 Gender gap, 78, 79 Generative adversarial networks (GAN), 160, 181 Google Colab, 14, 118 Gradient descent algorithm, 10, 209, 222, 230, 232–234, 237, 243 Graduate students, 8, 11, 20, 63, 77, 86, 88, 93, 97, 101, 102, 108–110, 152, 226, 283, 294, 299, 303, 304, 307

H Hands-on tasks, 10, 42, 235, 237, 238, 246 Health, 9, 23, 114, 128, 179, 215, 254, 256, 284 High school, 1, 2, 8, 12, 38, 66, 68, 70, 77, 80, 81, 96, 101, 102, 104–108, 112, 137, 141, 142, 146, 179, 226, 230, 266–273, 275–277, 279, 310 History of data science, 33, 272, 275, 279

Index HTML, 291, 294 Human aspects, 5, 11, 24, 76, 110, 180, 303–309, 311–313 Hyperparameters, 9, 209, 211, 213, 236 Hyperparameter tuning, 175, 200, 211, 213, 227, 291

I IEEE, 181 Images, 24, 28, 43, 86, 92, 94–96, 105, 106, 144, 184, 185, 202, 206, 210, 246, 285, 293, 309, 311 Indicators, 9, 44, 127, 144, 205, 206, 209, 214–217, 223, 270, 311 Industry, 1, 2, 8, 11, 22, 28, 30, 67, 71, 78, 94, 95, 101, 102, 110, 112, 113, 116, 119, 129, 151, 179, 185, 205, 284, 297 Innovation, 8, 19, 137–143, 145–147, 257, 259, 260, 293 Interdisciplinary pedagogy, 3, 7, 75, 80–82 Interest value, 298 Interpretability, 203, 204, 311 Iris dataset, 227

J Java Script, 291 Jupyter notebook, 14, 118

K K-12 pupils, 8, 101–105, 185 K-12 teachers, 11, 12 K-nearest neighbors, 10, 105, 225, 226, 236, 240, 295 KNIME, 14, 118, 291 KNN algorithm, 105, 106, 144, 213, 226, 227, 229, 237, 238, 246, 270, 274–277, 279, 280, 310

L Last Mile of Analytics, 161 Learning analytics, 59, 114, 125, 256 Learning community, 161, 275 Life cycle, 29 Life long learning, 173 Linear regression, 10, 105, 211, 219, 222, 225, 232, 234, 295, 310 LinkedIn, 22, 31 Logistic, 10, 107, 200, 222, 225, 230, 232–234, 246, 295, 310

319 Logos, 170 Loss function, 10, 209, 222, 223, 230, 232, 233, 236, 237

M Machine learning, 3, 5, 9, 10, 14, 21, 24, 30, 40, 41, 44–46, 49, 51–53, 62, 64, 66, 67, 80, 105–108, 115, 116, 123, 126, 127, 129, 130, 133, 140, 141, 144, 153, 160, 161, 166, 167, 170, 171, 174, 176, 181, 190, 191, 199, 200, 202, 203, 205, 206, 209–211, 214, 225, 226, 235, 238, 254, 268, 270, 273, 276, 290–292, 295, 296, 306, 307, 309–311 Mathematical thinking, 37, 40, 52, 53, 274 Matplotlib, 144, 291, 310 Medical diagnosis, 45–47, 49, 76, 96, 215, 216 Mentoring, 91, 94, 97, 248, 273, 275, 276 MERge model, 6, 10, 11, 125, 128, 253, 254, 265, 283, 303 Methods of Teaching Data Science Course, 6, 147, 170, 254, 273, 280 Model complexity, 10, 209, 218, 219, 227, 270, 310 Modeling, 5, 8, 9, 14, 20, 28, 62, 66, 67, 76, 106, 123, 151, 156, 160, 169, 173, 183, 199, 204, 286 Motivation, 1–4, 8, 12, 53, 62, 77, 94, 137, 144, 145, 154, 192, 248, 290, 292, 296, 297 Motivational theories, 297 Multidisciplinarity, 26, 27, 80, 280

N Neural networks, 10, 20, 105, 107, 160, 211, 213, 222, 225, 230, 232, 233, 236, 237, 246, 247, 310 NSF, 23, 27–29, 62, 63, 65, 71

O Object conception, 41, 43, 274–276, 279 Optimization, 10, 41, 107, 209, 213, 222, 243 Orange Data Mining, 14, 118, 168, 171, 291 Organizational skills, 3, 8, 79, 122, 128, 131, 162, 165–167, 169, 173, 315 Outliers, 39, 126, 156, 157, 163, 175, 204 Overfitting, 10, 144, 209, 218–223, 226, 227, 229, 232, 310

320 P Pandas, 106, 144, 270, 291, 309 Parameters, 9, 167, 199, 200, 209, 211, 213, 222, 230–232 Pathos, 170 PCK, 91, 92, 111, 268, 281 Pedagogical chasm, 147, 315 Perceptron, 10, 105, 107, 225, 230–234, 236, 237, 240, 295 Performance, 9, 22, 28, 44–46, 51, 52, 76, 94, 114, 123, 127, 138, 144, 145, 161, 200, 206, 209, 211, 213, 214, 216–219, 223, 226, 230, 236, 237, 242, 248, 270, 291, 303, 311 Policy makers, 2, 5, 8, 10, 11, 69, 101, 102, 110, 114, 115, 147, 161, 167, 253 Practitioners, 2, 8, 11, 31, 70, 101, 102, 110, 112, 113, 116, 119, 165, 167, 171, 172, 184, 185, 279 Precision, 214, 216, 311 Procept, 41, 42 Procept conception, 42 Process conception, 41, 42, 274, 276, 279 Process-object duality, 37, 40–42, 80, 235, 274 Professional development, 5, 7, 10, 11, 75, 81, 82, 109, 110, 139, 141, 253 Programing tasks, 246, 248 Project-based learning, 10, 76, 91, 146, 186, 192, 235, 248, 275 Project development, 95, 96, 105, 106, 154, 192, 193, 205 Psychological Science, 284, 292–295 Python, 13, 68, 105, 106, 140, 141, 144, 146, 168, 171, 268–270, 275, 277–280, 291–295, 306–310, 312 Q Qualitative research, 124, 312 Quality, 94, 160, 162, 163 Quantity, Data, 160 R Raw data, 1, 22, 24, 28, 168, 199 Real world context, 7, 64, 75–77, 79, 82, 188 Real world data, 7, 28, 75, 77, 82, 93, 160, 248 Recall, 175, 214, 311 Recursion, 297 Reduction of abstraction, 40, 42, 235, 238 Reflection in action, 168, 190

Index Reflection on action, 168, 190 Regression linear, 10, 105, 211, 219, 222, 225, 232, 234, 295, 310 logistic, 10, 107, 200, 222, 225, 230, 232–234, 246, 295, 310 Regularization, 10, 209, 222, 223, 310 Relatedness, 296, 297, 315 Relative cost, 298 Reliability, 116, 154 Research methods, 65 Rhetorical triangle, 170, 188, 312 R - programming language, 171 S Scratch, 104 Seaborn, 144, 270, 291, 310 Selenium, 291, 312 Self-determination theory, 296, 298 Sentiment analysis, 310 Signal processing, 95 Skills 21st century, 7, 38, 75, 79, 80, 82, 101–103, 112, 117, 172, 192, 275, 293 cognitive, 4, 8, 9, 12, 30, 36, 43, 159, 165, 190, 291, 315 organizational, 8, 9, 117, 121, 126, 167, 172 professional, 3, 7, 9, 20, 65, 75, 79, 110, 121, 133, 165–167, 256 soft, 166, 172 technological, 9, 121, 122, 131, 145, 165–167, 171, 174 SK-learn, 291 Social networks, 25, 101, 116, 284, 294, 309 Social Science, 1, 2, 10, 11, 24, 25, 27, 78, 97, 110, 111, 119, 191, 199, 201, 222, 232, 283–287, 290, 292, 296, 298, 299, 303 Software Engineering Code of Ethics and Professional Practice, 181 SPSS, 291 Statistical inference, 20, 62, 295 STEM, 7, 53, 75–79, 82, 85, 87, 91, 97, 111, 119, 140, 293, 294, 297, 298 Storytelling, 3, 117, 129, 131, 132, 162, 166, 167, 169–171, 185, 188, 275 Structured data, 124, 309 Subjective task values, 298 Supervised learning, 210, 211, 219 SVM, 105, 107, 291, 295 SWOT analysis, 258–261

Index T Teacher preparation, 10, 112, 265, 315 Technological skills, 8, 9, 121, 122, 131, 165–167, 171, 174 TensorFlow playground, 237 Testing, 9, 52, 94, 175, 209, 211, 213, 226 Text, 284, 291, 310 Text analysis, 291, 295 Textual programming environments, 13, 118 TFIDF, 291, 310 Thinking computational, 3, 7, 35–39, 52, 53, 55, 63, 79, 176, 271, 272, 274, 275, 293, 297 critical, 116, 122, 166 data, 7, 35–37, 52–55, 79 mathematical, 37, 40, 52, 274 statistical, 39, 40, 62, 64, 66, 204, 278 Topic modeling, 310 TPACK, 144, 146, 266, 268 Training, 233 Transdisciplinary, 27 Transportation, 46, 254, 261, 305 Tuning - hyperparameters, 9 U Underfitting, 219, 310

321 Undergraduate students, 97 Understanding black-box, 200 white-box, 4, 275 Unstructured data, 124, 309 Unsupervised learning, 210 Users, 201 Utility value, 298

V Validation, 311 Variance, 10, 39, 52, 157, 205, 209, 217, 218, 220, 221, 223, 236 Venn diagram, 26, 86, 88 Visualization, 131–133, 144, 236, 295, 310 Visual programming environments, 118, 132 Voice, 7, 20, 39, 59, 116, 161, 281, 309

W Web scraping, 291, 294 Weka, 14, 118, 168, 171, 291 Whitebox understanding, 4, 275 Word embedding, 24, 291, 310 Workflow, 66, 151