Addressing Global Challenges and Quality Education: 15th European Conference on Technology Enhanced Learning, EC-TEL 2020, Heidelberg, Germany, September 14–18, 2020, Proceedings [1st ed.] 9783030577162, 9783030577179

This book constitutes the proceedings of the 15th European Conference on Technology Enhanced Learning, EC-TEL 2020, held

218 113 28MB

English Pages XVII, 489 [507] Year 2020

Table of contents :
Front Matter ....Pages i-xvii
Exploring Artificial Jabbering for Automatic Text Comprehension Question Generation (Tim Steuer, Anna Filighera, Christoph Rensing)....Pages 1-14
Digital Value-Adding Chains in Vocational Education: Automatic Keyword Extraction from Learning Videos to Provide Learning Resource Recommendations (Cleo Schulten, Sven Manske, Angela Langner-Thiele, H. Ulrich Hoppe)....Pages 15-29
Human-Centered Design of a Dashboard on Students’ Revisions During Writing (Rianne Conijn, Luuk Van Waes, Menno van Zaanen)....Pages 30-44
An Operational Framework for Evaluating the Performance of Learning Record Stores (Chahrazed Labba, Azim Roussanaly, Anne Boyer)....Pages 45-59
Does an E-mail Reminder Intervention with Learning Analytics Reduce Procrastination in a Blended University Course? (Iryna Nikolayeva, Amel Yessad, Bertrand Laforge, Vanda Luengo)....Pages 60-73
Designing an Online Self-assessment for Informed Study Decisions: The User Perspective (L. E. C. Delnoij, J. P. W. Janssen, K. J. H. Dirkx, R. L. Martens)....Pages 74-86
What Teachers Need for Orchestrating Robotic Classrooms (Sina Shahmoradi, Aditi Kothiyal, Jennifer K. Olsen, Barbara Bruno, Pierre Dillenbourg)....Pages 87-101
Assessing Teacher’s Discourse Effect on Students’ Learning: A Keyword Centrality Approach (Danner Schlotterbeck, Roberto Araya, Daniela Caballero, Abelino Jimenez, Sami Lehesvuori, Jouni Viiri)....Pages 102-116
For Learners, with Learners: Identifying Indicators for an Academic Advising Dashboard for Students (Isabel Hilliger, Tinne De Laet, Valeria Henríquez, Julio Guerra, Margarita Ortiz-Rojas, Miguel Ángel Zuñiga et al.)....Pages 117-130
Living with Learning Difficulties: Two Case Studies Exploring the Relationship Between Emotion and Performance in Students with Learning Difficulties (Styliani Siouli, Stylianos Makris, Evangelia Romanopoulou, Panagiotis P. D. Bamidis)....Pages 131-143
Learnersourcing Quality Assessment of Explanations for Peer Instruction (Sameer Bhatnagar, Amal Zouaq, Michel C. Desmarais, Elizabeth Charles)....Pages 144-157
Using Diffusion Network Analytics to Examine and Support Knowledge Construction in CSCL Settings (Mohammed Saqr, Olga Viberg)....Pages 158-172
Supporting Second Language Learners’ Development of Affective Self-regulated Learning Skills Through the Use and Design of Mobile Technology (Olga Viberg, Anna Mavroudi, Yanwen Ma)....Pages 173-186
We Know What You Did Last Semester: Learners’ Perspectives on Screen Recordings as a Long-Term Data Source for Learning Analytics (Philipp Krieter, Michael Viertel, Andreas Breiter)....Pages 187-199
Teaching Simulation Literacy with Evacuations (Andre Greubel, Hans-Stefan Siller, Martin Hennecke)....Pages 200-214
Design of Conversational Agents for CSCL: Comparing Two Types of Agent Intervention Strategies in a University Classroom (Konstantinos Michos, Juan I. Asensio-Pérez, Yannis Dimitriadis, Sara García-Sastre, Sara Villagrá-Sobrino, Alejandro Ortega-Arranz et al.)....Pages 215-229
Exploring Human–AI Control Over Dynamic Transitions Between Individual and Collaborative Learning (Vanessa Echeverria, Kenneth Holstein, Jennifer Huang, Jonathan Sewall, Nikol Rummel, Vincent Aleven)....Pages 230-243
Exploring Student-Controlled Social Comparison (Kamil Akhuseyinoglu, Jordan Barria-Pineda, Sergey Sosnovsky, Anna-Lena Lamprecht, Julio Guerra, Peter Brusilovsky)....Pages 244-258
New Measures for Offline Evaluation of Learning Path Recommenders (Zhao Zhang, Armelle Brun, Anne Boyer)....Pages 259-273
Assessing the Impact of the Combination of Self-directed Learning, Immediate Feedback and Visualizations on Student Engagement in Online Learning (Bilal Yousuf, Owen Conlan, Vincent Wade)....Pages 274-287
CGVis: A Visualization-Based Learning Platform for Computational Geometry Algorithms (Athanasios Voulodimos, Paraskevas Karagiannopoulos, Ifigenia Drosouli, Georgios Miaoulis)....Pages 288-302
How to Design Effective Learning Analytics Indicators? A Human-Centered Design Approach (Mohamed Amine Chatti, Arham Muslim, Mouadh Guesmi, Florian Richtscheid, Dawood Nasimi, Amin Shahin et al.)....Pages 303-317
Emergency Remote Teaching: Capturing Teacher Experiences in Spain with SELFIE (Laia Albó, Marc Beardsley, Judit Martínez-Moreno, Patricia Santos, Davinia Hernández-Leo)....Pages 318-331
Utilising Learnersourcing to Inform Design Loop Adaptivity (Ali Darvishi, Hassan Khosravi, Shazia Sadiq)....Pages 332-346
Fooling It - Student Attacks on Automatic Short Answer Grading (Anna Filighera, Tim Steuer, Christoph Rensing)....Pages 347-352
Beyond Indicators: A Scoping Review of the Academic Literature Related to SDG4 and Educational Technology (Katy Jordan)....Pages 353-357
Pedagogical Underpinnings of Open Science, Citizen Science and Open Innovation Activities: A State-of-the-Art Analysis (Elisha Anne Teo, Evangelia Triantafyllou)....Pages 358-362
Knowledge-Driven Wikipedia Article Recommendation for Electronic Textbooks (Behnam Rahdari, Peter Brusilovsky, Khushboo Thaker, Jordan Barria-Pineda)....Pages 363-368
InfoBiTS: A Mobile Application to Foster Digital Competencies of Senior Citizens (Svenja Noichl, Ulrik Schroeder)....Pages 369-373
Student Awareness and Privacy Perception of Learning Analytics in Higher Education (Stian Botnevik, Mohammad Khalil, Barbara Wasson)....Pages 374-379
User Assistance for Serious Games Using Hidden Markov Model (Vivek Yadav, Alexander Streicher, Ajinkya Prabhune)....Pages 380-385
Guiding Socio-Technical Reflection of Ethical Principles in TEL Software Development: The SREP Framework (Sebastian Dennerlein, Christof Wolf-Brenner, Robert Gutounig, Stefan Schweiger, Viktoria Pammer-Schindler)....Pages 386-391
Git4School: A Dashboard for Supporting Teacher Interventions in Software Engineering Courses (Jean-Baptiste Raclet, Franck Silvestre)....Pages 392-397
Exploring the Design and Impact of Online Exercises for Teacher Training About Dynamic Models in Mathematics (Charlie ter Horst, Laura Kubbe, Bart van de Rotten, Koen Peters, Anders Bouwer, Bert Bredeweg)....Pages 398-403
Interactive Concept Cartoons: Exploring an Instrument for Developing Scientific Literacy (Patricia Kruit, Bert Bredeweg)....Pages 404-409
Quality Evaluation of Open Educational Resources (Mirette Elias, Allard Oelen, Mohammadreza Tavakoli, Gábor Kismihok, Sören Auer)....Pages 410-415
Designing Digital Activities to Screen Locomotor Skills in Developing Children (Benoit Bossavit, Inmaculada Arnedillo-Sánchez)....Pages 416-420
Towards Adaptive Social Comparison for Education (Sergey Sosnovsky, Qixiang Fang, Benjamin de Vries, Sven Luehof, Fred Wiegant)....Pages 421-426
Simulation Based Assessment of Epistemological Beliefs About Science (Melanie E. Peffer, Tessa Youmans)....Pages 427-431
An Approach to Support Interactive Activities in Live Stream Lectures (Tommy Kubica, Tenshi Hara, Iris Braun, Alexander Schill)....Pages 432-436
Educational Escape Games for Mixed Reality (Ralf Klamma, Daniel Sous, Benedikt Hensen, István Koren)....Pages 437-442
Measuring Learning Progress for Serving Immediate Feedback Needs: Learning Process Quantification Framework (LPQF) (Gayane Sedrakyan, Sebastian Dennerlein, Viktoria Pammer-Schindler, Stefanie Lindstaedt)....Pages 443-448
Data-Driven Game Design: The Case of Difficulty in Educational Games (Yoon Jeon Kim, Jose A. Ruipérez-Valiente)....Pages 449-454
Extracting Topics from Open Educational Resources (Mohammadreza Molavi, Mohammadreza Tavakoli, Gábor Kismihók)....Pages 455-460
Supporting Gamification with an Interactive Gamification Analytics Tool (IGAT) (Nadja Zaric, Manuel Gottschlich, Rene Roepke, Ulrik Schroeder)....Pages 461-466
OpenLAIR an Open Learning Analytics Indicator Repository Dashboard (Atezaz Ahmad, Jan Schneider, Hendrik Drachsler)....Pages 467-471
CasualLearn: A Smart Application to Learn History of Art (Adolfo Ruiz-Calleja, Miguel L. Bote-Lorenzo, Guillermo Vega-Gorgojo, Sergio Serrano-Iglesias, Pablo García-Zarza, Juan I. Asensio-Pérez et al.)....Pages 472-476
Applying Instructional Design Principles on Augmented Reality Cards for Computer Science Education (Josef Buchner, Michael Kerres)....Pages 477-481
Extending Patient Education with CLAIRE: An Interactive Virtual Reality and Voice User Interface Application (Richard May, Kerstin Denecke)....Pages 482-486
Correction to: Addressing Global Challenges and Quality Education (Carlos Alario-Hoyos, María Jesús Rodríguez-Triana, Maren Scheffel, Inmaculada Arnedillo-Sánchez, Sebastian Maximilian Dennerlein)....Pages C1-C1
Back Matter ....Pages 487-489

Recommend Papers

Critical Information Infrastructures Security: 15th International Conference, CRITIS 2020, Bristol, UK, September 2–3, 2020, Proceedings [1st ed.] 9783030582944, 9783030582951

This book constitutes the revised selected papers of the 15th International Conference on Critical Information Infrastru

461 100 8MB Read more

Multimedia Technology and Enhanced Learning: Second EAI International Conference, ICMTEL 2020, Leicester, UK, April 10-11, 2020, Proceedings, Part II [1st ed.] 9783030511029, 9783030511036

This two-volume book constitutes the refereed proceedings of the Second International Conference on Multimedia Technolog

364 114 32MB Read more

Multimedia Technology and Enhanced Learning: Second EAI International Conference, ICMTEL 2020, Leicester, UK, April 10-11, 2020, Proceedings, Part I [1st ed.] 9783030510992, 9783030511005

This two-volume book constitutes the refereed proceedings of the Second International Conference on Multimedia Technolog

290 67 31MB Read more

Software Architecture: 14th European Conference, ECSA 2020, L'Aquila, Italy, September 14–18, 2020, Proceedings [1st ed.] 9783030589226, 9783030589233

This book constitutes the refereed proceedings of the 14th International Conference on Software Architecture, ECSA 2020,

338 39 22MB Read more

Systems, Software and Services Process Improvement: 27th European Conference, EuroSPI 2020, Düsseldorf, Germany, September 9–11, 2020, Proceedings [1st ed.] 9783030564407, 9783030564414

This volume constitutes the refereed proceedings of the 27th European Conference on Systems, Software and Services Proce

948 48 70MB Read more

Functional and Logic Programming: 15th International Symposium, FLOPS 2020, Akita, Japan, September 14–16, 2020, Proceedings [1st ed.] 9783030590246, 9783030590253

This book constitutes the proceedings of the 15th International Symposium on Functional and Logic Programming, FLOPS 202

394 98 6MB Read more

Dependable Computing - EDCC 2020 Workshops: AI4RAILS, DREAMS, DSOGRI, SERENE 2020, Munich, Germany, September 7, 2020, Proceedings [1st ed.] 9783030584610, 9783030584627

This book constitutes refereed proceedings of the Workshops of the 16th European Dependable Computing Conference, EDCC:

468 119 25MB Read more

Artificial Neural Networks and Machine Learning – ICANN 2020: 29th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 15–18, 2020, Proceedings, Part II [1st ed.] 9783030616151, 9783030616168

The proceedings set LNCS 12396 and 12397 constitute the proceedings of the 29th International Conference on Artificial N

500 67 85MB Read more

Artificial Neural Networks and Machine Learning – ICANN 2020: 29th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 15–18, 2020, Proceedings, Part I [1st ed.] 9783030616083, 9783030616090

The proceedings set LNCS 12396 and 12397 constitute the proceedings of the 29th International Conference on Artificial N

633 26 91MB Read more

Wireless Algorithms, Systems, and Applications: 15th International Conference, WASA 2020, Qingdao, China, September 13–15, 2020, Proceedings, Part I [1st ed.] 9783030590154, 9783030590161

The two-volume set LNCS 12385 + 12386 constitutes the proceedings of the 15th International Conference on Wireless Algor

740 76 56MB Read more

Addressing Global Challenges and Quality Education: 15th European Conference on Technology Enhanced Learning, EC-TEL 2020, Heidelberg, Germany, September 14–18, 2020, Proceedings [1st ed.]
9783030577162, 9783030577179

Author / Uploaded
Carlos Alario-Hoyos
María Jesús Rodríguez-Triana
Maren Scheffel
Inmaculada Arnedillo-Sánchez
Sebastian Maximilian Dennerlein

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

LNCS 12315

Carlos Alario-Hoyos María Jesús Rodríguez-Triana Maren Scheffel Inmaculada Arnedillo-Sánchez Sebastian Maximilian Dennerlein (Eds.)

Addressing Global Challenges and Quality Education 15th European Conference on Technology Enhanced Learning, EC-TEL 2020 Heidelberg, Germany, September 14–18, 2020 Proceedings

Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA

Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA

12315

More information about this series at http://www.springer.com/series/7409

Carlos Alario-Hoyos María Jesús Rodríguez-Triana Maren Scheffel Inmaculada Arnedillo-Sánchez Sebastian Maximilian Dennerlein (Eds.) •

•

•

Addressing Global Challenges and Quality Education 15th European Conference on Technology Enhanced Learning, EC-TEL 2020 Heidelberg, Germany, September 14–18, 2020 Proceedings

123

•

Editors Carlos Alario-Hoyos Universidad Carlos III de Madrid Leganés (Madrid), Spain

María Jesús Rodríguez-Triana Tallinn University Tallinn, Estonia

Maren Scheffel Open University Netherlands Heerlen, The Netherlands

Inmaculada Arnedillo-Sánchez Trinity College Dublin Dublin, Ireland

Sebastian Maximilian Dennerlein Graz University of Technology and Know-Center Graz, Austria

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-57716-2 ISBN 978-3-030-57717-9 (eBook) https://doi.org/10.1007/978-3-030-57717-9 LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI © Springer Nature Switzerland AG 2020 Chapters 6, 10, 15 and 48 are licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further details see licence information in the chapters. This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Welcome to the proceedings of the 15th European Conference on Technology Enhanced Learning (EC-TEL 2020), one of the flagship events of the European Association of Technology Enhanced Learning (EATEL). In addition to the social and economic crisis, the pandemic has led to a historic inflection point in education. The impact of COVID-19 worldwide and the closure of schools and universities has led to numerous changes in how, where and with which tools we learn, and how education can take place at a distance. One of these changes is that lockdowns triggered the use of Technology-Enhanced Learning (TEL) at a massive level and throughout the world due to the urgent need to transform face-to-face into online classes. Such circumstances pose new challenges and research questions that our community must address. Looking back, the theme of EC-TEL 2020 (Addressing Global Challenges and Quality Education) could not have been more timely with COVID-19 affecting education and its quality all over the world. Thus, the Sustainable Development Goals of Quality Education and Reduced Inequalities, as deﬁned in the United Nations Agenda 2030, are more essential than ever. In a global world of great contrasts, we are all suffering the consequences of this unexpected pandemic, and it is paramount to ﬁnd global solutions to provide equitable quality education and promote lifelong learning opportunities for all. This not only requires improving student performance and success in general but also supporting under-represented groups and those disadvantaged by inequality. In this challenging context, the active role of the TEL community may be crucial to pursue the aforementioned goals. Another important change this year is that EC-TEL 2020 will be held online for the ﬁrst time in its 15-year history. Although EC-TEL 2020 was initially planned to take place in Heidelberg, Germany, local health authorities advised against organizing large events. What was to be a major liability has been transformed into a powerful experience in which to continue learning and evolving as a community thanks to the researchers who have continued to rely on this conference to submit and present their contributions. This year, 91 contributions were received despite the difﬁculties of some researchers to carry out developments and studies as well as reporting them. All of these contributions were reviewed by three members of the TEL community, who also had follow-up discussions to agree on a meta-review. As a result, 24 research papers (26.4%) were accepted and presented at the conference. This shows the high competitiveness and quality of this conference year after year. In addition, 20 posters, 5 demos, and 8 impact papers were presented during the conference to fuel the discussions among the researchers. Research papers, posters, and demos can be found in this volume, while impact papers are published in companion proceedings via CEUR. EC-TEL 2020 was co-located with another major conference in Germany, the 18th Fachtagung Bildungstechnologien der GI Fachgrupper Bildungstechnologien (DELFI 2020). This helped create partnerships between the two conferences and allowed attendees to exchange ideas and beneﬁt from presentations and workshops that were

vi

Preface

offered in both conferences. As an example, 14 workshops and 4 keynotes (Linda Castañeda, Jens Mönig, Samuel Greiff, and Sabine Seufert) were co-located between EC-TEL 2020 and DELFI 2020. The last words are words of gratitude. Thanks to the researchers who sent their contributions to EC-TEL 2020. Thanks to the members of the Programme Committee who devoted their time to give feedback to authors and supported decision making on paper acceptance. Finally, deep thanks to the local organizers, Marco Kalz and Joshua Weidlich, who have worked very hard to host this conference online with the same quality as in previous years. July 2020

Carlos Alario-Hoyos María Jesús Rodríguez-Triana Maren Scheffel Inmaculada Arnedillo-Sánchez Sebastian Dennerlein Tracie Farrell Frey Tom Broos Zacharoula Papamitsiou Adolfo Ruiz Calleja Kairit Tammets Christian Glahn Marco Kalz Joshua Weidlich EC-TEL 2020 Chairs

Organization

Program Committee Marie-Helene Abel Andrea Adamoli Nora Ayu Ahmad Uzir Gokce Akcayir Carlos Alario-Hoyos Patricia Albacete Dietrich Albert Laia Albó Liaqat Ali Ishari Amarasinghe Boyer Anne Alessandra Antonaci Roberto Araya Inmaculada Arnedillo-Sanchez Juan I. Asensio-Pérez Antonio Balderas Nicolas Ballier Jordan Barria-Pineda Jason Bernard Anis Bey Miguel L. Bote-Lorenzo François Bouchet Yolaine Bourda Bert Bredeweg Andreas Breiter Julien Broisin Tom Broos Armelle Brun Daniela Caballero Manuel Caeiro Rodríguez Sven Charleer Mohamed Amine Chatti John Cook Audrey Cooke Alessia Coppi Mihai Dascalu

Université de Technologie de Compiègne, France Università della Svizzera Italiana, Switzerland University of Edinburgh, UK University of Alberta, Canada Universidad Carlos III de Madrid, Spain University of Pittsburgh, USA University of Graz, Austria Universitat Pompeu Fabra, Spain Simon Fraser University, Canada Universitat Pompeu Fabra, Spain LORIA – KIWI, France European Association of Distance Teaching Universities, The Netherlands Universidad de Chile, Chile Trinity College Dublin, Ireland Universidad de Valladolid, Spain University of Cádiz, Spain Université de Paris, France University of Pittsburgh, USA University of Saskatchewan, Canada University of Paul Sabatier, IRIT, France Universidad de Valladolid, Spain Sorbonne Université, France LRI, CentraleSupélec, France University of Amsterdam, The Netherlands Universität Bremen, Germany Université Toulouse 3 Paul Sabatier, IRIT, France KU Leuven, Belgium LORIA, Université Nancy 2, France Universidad de Chile, Chile University of Vigo, Spain KU Leuven, Belgium University of Duisburg-Essen, Germany University of West of England, UK Curtin University, Australia SFIVET, Switzerland University Politehnica of Bucharest, Romania

viii

Organization

Peter de Lange Felipe de Morais Carrie Demmans Epp Sebastian Dennerlein Michael Derntl Philippe Dessus Daniele Di Mitri Yannis Dimitriadis Monica Divitini Juan Manuel Dodero Raymond Elferink Erkan Er Maka Eradze Iria Estévez-Ayres Juan Carlos Farah Tracie Farrell Louis Faucon Baltasar Fernandez-Manjon Carmen Fernández-Panadero Angela Fessl Anna Filighera Olga Firssova Mikhail Fominykh Felix J. Garcia Clemente Jesús Miguel García-Gorrostieta Serge Garlatti Dragan Gasevic Sébastien George Carlo Giovannella Christian Glahn Samuel González-López Sabine Graf Monique Grandbastien Wolfgang Greller David Grifﬁths Nathalie Guin Bernhard Göschlberger

RWTH Aachen University, Germany UNISINOS, Brazil University of Alberta, Canada Know-Center, Austria University of Tübingen, Germany Université Grenoble Alpes, LaRAC, France Open Universiteit, The Netherlands Universidad de Valladolid, Spain Norwegian University of Science and Technology, Norway Universidad de Cádiz, Spain RayCom BV, The Netherlands Universidad de Valladolid, Spain University of Modena and Reggio Emilia, Italy Universidad Carlos III de Madrid, Spain École Polytechnique Fédérale de Lausanne, Switzerland The Open University, UK École Polytechnique Fédérale de Lausanne, Switzerland Universidad Complutense de Madrid, Spain Universidad Carlos III de Madrid, Spain Know-Center, Austria Technical University of Darmstadt, Multimedia Communications Lab, Germany WELTEN Institute, Open Universiteit, The Netherlands Norwegian University of Science and Technology, Norway Universidad de Murcia, Spain Universidad de la Sierra, Mexico IMT Atlantique, France Monash University, Australia LIUM, Le Mans Université, France University of Tor Vergata, Italy LDE CEL, The Netherlands Technological University of Nogales, Mexico Athabasca University, Canada LORIA, Université de Lorraine, France Vienna University of Education, Austria University of Bolton, UK Université de Lyon, France Research Studios, Austria

Organization

Franziska Günther Christian Gütl Carolin Hahnel Stuart Hallifax Bastiaan Heeren Maartje Henderikx Davinia Hernandez-Leo Ángel Hernández-García Tore Hoel Adrian Holzer Pasquale Iero Francisco Iniesto Zeeshan Jan Patricia Jaques Johan Jeuring Ioana Jivet Srecko Joksimovic Ilkka Jormanainen Jelena Jovanovic Ken Kahn Rogers Kaliisa Marco Kalz Anastasios Karakostas Reet Kasepalu Mohammad Khalil Michael Kickmeier-Rust Ralf Klamma Roland Klemke Tomaž Klobučar Carolien Knoop-Van Campen Johannes Konert Külli Kori Panagiotis Kosmas Vitomir Kovanovic Dominik Kowald Milos Kravcik Elise Lavoué Marie Lefevre Dominique Lenne Marina Lepp Amna Liaqat Paul Libbrecht

ix

TU Dresden, Germany Graz University of Technology, Austria DIPF, Leibniz Institute for Research and Information in Education, Germany Laboratoire d’InfoRmatique en Image et Systèmes d'information, France Open Universiteit, The Netherlands Open Universiteit, The Netherlands Universitat Pompeu Fabra, Spain Universidad Politécnica de Madrid, Spain Oslo Metropolitan University, Norway University of Neuchâtel, Switzerland The Open University, UK The Open University, UK The Open University, UK PPGCA, UNISINOS, Brazil Utrecht University and Open Universiteit, The Netherlands Open Universiteit, The Netherlands University of South Australia, Australia University of Eastern Finland, Finland University of Belgrade, Serbia University of Oxford, UK University of Oslo, Norway PH Heidelberg, Germany Aristotle University of Thessaloniki, Greece Tallinn University, Estonia University of Bergen, Norway Graz University of Technology, Austria RWTH Aachen University, Germany Open Universiteit, The Netherlands Jozef Stefan Institute, Slovenia Radboud University, The Netherlands Hochschule Fulda, Germay Tallinn University, Estonia Cyprus University of Technology, Cyprus University of South Australia, Australia Know-Center, Austria DFKI GmbH, Germay Université Jean Moulin Lyon 3, France Université Lyon 1, France Université de Technologie de Compiègne, France University of Tartu, Estonia University of Toronto, Canada IUBH Fernstudium, Germany

x

Organization

Andreas Lingnau Martin Llamas-Nistal Aurelio Lopez-Lopez Domitile Lourdeaux Margarida Lucas Ulrike Lucke Vanda Luengo George Magoulas Jorge Maldonado-Mahauad Nils Malzahn Carlos Martínez-Gaitero Alejandra Martínez-Monés Wannisa Matcha Manolis Mavrikis Agathe Merceron Vasileios Mezaris Christine Michel Konstantinos Michos Alexander Mikroyannidis Tanja Mitrovic Riichiro Mizoguchi Inge Molenaar Anders Morch Pedro Manuel Moreno-Marcos Mathieu Muratet Pedro J. Muñoz-Merino Rob Nadolski Petru Nicolaescu Stavros Nikou Nicolae Nistor Alexander Nussbaumer Jennifer Olsen Alejandro Ortega-Arranz Lahcen Oubahssi Viktoria Pammer-Schindler Soﬁa Papavlasopoulou Abelardo Pardo Cesare Pautasso Maxime Pedrotti

Ruhr West University of Applied Science, Germany Universidad de Vigo, Spain INAOE, Mexico CNRS, France University of Aveiro, Portugal University of Potsdam, Germany Sorbonne Université, France University of London, UK Universidad de Cuenca, Ecuador, Pontiﬁcia Universidad Católica de Chile, Chile Rhine-Ruhr Institute for Applied System Innovation e.V., Germany Escuelas Universitarias Gimbernat, Spain Universidad de Valladolid, Spain University of Edinburgh, UK UCL Knowledge Lab, UK Beuth University of Applied Sciences, Germany Information Technologies Institute, CERT, Greece Université de Lyon, France Universidad de Valladolid, Spain The Open University, UK University of Canterbury, UK Japan Advanced Institute of Science and Technology, Japan Radboud University, The Netherlands University of Oslo, Norway Universidad Carlos III de Madrid, Spain Sorbonne Université, France Universidad Carlos III de Madrid, Spain Open Universiteit, The Netherlands RWTH Aachen University, Germany University of Strathclyde, UK Ludwig Maximilian University of Munich, Germany Graz University of Technology, Austria École Polytechnique Fédérale de Lausanne, Switzerland Universidad de Valladolid, Spain LIUM, Le Mans Université, France Graz University of Technology, Austria Norwegian University of Science and Technology, Norway University of South Australia, Australia University of Lugano, Switzerland Verein zur Förderung eines Deutschen Forschungsnetzes e.V., Germany

Organization

Mar Perez-Sanagustin Donatella Persico Yvan Peter Niels Pinkwart Gerti Pishtari Elvira Popescu Francesca Pozzi Luis P. Prieto Ronald Pérez Álvarez Hans Põldoja Eyal Rabin Juliana Elisa Raffaghelli Gustavo Ramirez-Gonzalez Marc Rittberger Tiago Roberto Kautzmann Covadonga Rodrigo Maria Jesus Rodriguez Triana Jeremy Roschelle José A. Ruipérez Valiente Ellen Rusman Merike Saar Demetrios Sampson Eric Sanchez Patricia Santos Mohammed Saqr Petra Sauer Maren Scheffel Daniel Schiffner Andreas Schmidt Marcel Schmitz Jan Schneider Ulrik Schroeder Stefan Schweiger Yann Secq Karim Sehaba Paul Seitlinger Audrey Serna Sergio Serrano-Iglesias Shashi Kant Shankar

xi

Université Paul Sabatier Toulouse III, France Istituto Tecnologie Didattiche, CNR, Italy Université de Lille, France Humboldt-Universität zu Berlin, Germany Tallinn University, Estonia University of Craiova, Romania Istituto Tecnologie Didattiche, CNR, Italy Tallinn University, Estonia Universidad de Costa Rica, Costa Rica Tallinn University, Estonia The Open University of Israel, Israel, and Open Universiteit, The Netherlands University of Florence, Italy Universidad del Cauca, Colombia DIPF, Leibniz Institute for Researach and Information in Education, Germany Universidade do Vale do Rio dos Sinos, Brazil Universidad Nacional de Educación a Distancia, Spain Tallinn University, Estonia Digital Promise, USA Universidad de Murcia, Spain Open Universiteit, The Netherlands Tallinn University, Estonia Curtin University, Australia Université Fribourg, Switzerland Universitat Pompeu Fabra, Spain University of Eastern Finland, Finland Beuth University of Applied Sciences, Germany Open Universiteit, The Netherlands DIPF, Leibniz Institute for Research and Information in Education, Germany Karlsruhe University of Applied Sciences, Germany Zuyd Hogeschool, The Netherlands DIPF, Leibniz Institute for Research and Information in Education, Germany RWTH Aachen University, Germany Bongﬁsh GmbH, Austria Université de Lille, France Laboratoire d’InfoRmatique en Image et Systèmes d’information, Université Lumière Lyon 2, France Tallinn University, Estonia Laboratoire d’InfoRmatique en Image et Systèmes d’information, France Universidad de Valladolid, Spain Tallinn University, Estonia

xii

Organization

Kshitij Sharma Tanmay Sinha Sergey Sosnovsky Marcus Specht Srinath Srinivasa Tim Steuer Slavi Stoyanov Alexander Streicher Bernardo Tabuenca Kairit Tammets Stefaan Ternier Stefan Thalmann Paraskevi Topali Richard Tortorella Lucia Uguina Peter Van Rosmalen Olga Viberg Markel Vigo Sara Villagrá-Sobrino Cathrin Vogel Joshua Weidlich Armin Weinberger Professor Denise Whitelock Fridolin Wild Nikoletta Xenofontos Amel Yessad

Additional Reviewers Anwar, Muhammad Berns, Anke Ebner, Markus Ehlenz, Matthias Halbherr, Tobias Koenigstorfer, Florian Kothiyal, Aditi Liaqat, Daniyal Liaqat, Salaar Müllner, Peter Ponce Mendoza, Ulises Raffaghelli, Juliana Rodriguez, Indelfonso Zeiringer, Johannes

Norwegian University of Science and Technology, Norway ETH Zurich, Switzerland Utrecht University, The Netherlands Delft University of Technology and Open Universiteit, The Netherlands International Institute of Information Technology, Bangalore, India Technical University of Darmstadt, Germany Open University, The Netherlands Fraunhofer IOSB, Germany Universidad Politécnica de Madrid, Spain Tallinn University, Estonia Open Universiteit, The Netherlands University of Graz, Austria Universidad de Valladolid, Spain University of North Texas, USA IMDEA Networks, Spain Maastricht University, The Netherlands KTH Royal Institute of Technology, Sweden The University of Manchester, UK Universidad de Valladolid, Spain FernUniversität in Hagen, Germany Heidelberg University of Education, Germany Saarland University, Germany The Open University, UK The Open University, UK University of Cyprus, Cyprus Sorbonne Université, France

Contents

Exploring Artificial Jabbering for Automatic Text Comprehension Question Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tim Steuer, Anna Filighera, and Christoph Rensing

1

Digital Value-Adding Chains in Vocational Education: Automatic Keyword Extraction from Learning Videos to Provide Learning Resource Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cleo Schulten, Sven Manske, Angela Langner-Thiele, and H. Ulrich Hoppe

15

Human-Centered Design of a Dashboard on Students’ Revisions During Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rianne Conijn, Luuk Van Waes, and Menno van Zaanen

30

An Operational Framework for Evaluating the Performance of Learning Record Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chahrazed Labba, Azim Roussanaly, and Anne Boyer

45

Does an E-mail Reminder Intervention with Learning Analytics Reduce Procrastination in a Blended University Course? . . . . . . . . . . . . . . . . . . . . . Iryna Nikolayeva, Amel Yessad, Bertrand Laforge, and Vanda Luengo

60

Designing an Online Self-assessment for Informed Study Decisions: The User Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L. E. C. Delnoij, J. P. W. Janssen, K. J. H. Dirkx, and R. L. Martens

74

What Teachers Need for Orchestrating Robotic Classrooms . . . . . . . . . . . . . Sina Shahmoradi, Aditi Kothiyal, Jennifer K. Olsen, Barbara Bruno, and Pierre Dillenbourg Assessing Teacher’s Discourse Effect on Students’ Learning: A Keyword Centrality Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Danner Schlotterbeck, Roberto Araya, Daniela Caballero, Abelino Jimenez, Sami Lehesvuori, and Jouni Viiri For Learners, with Learners: Identifying Indicators for an Academic Advising Dashboard for Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Isabel Hilliger, Tinne De Laet, Valeria Henríquez, Julio Guerra, Margarita Ortiz-Rojas, Miguel Ángel Zuñiga, Jorge Baier, and Mar Pérez-Sanagustín

87

102

117

xiv

Contents

Living with Learning Difficulties: Two Case Studies Exploring the Relationship Between Emotion and Performance in Students with Learning Difficulties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Styliani Siouli, Stylianos Makris, Evangelia Romanopoulou, and Panagiotis P. D. Bamidis Learnersourcing Quality Assessment of Explanations for Peer Instruction . . . Sameer Bhatnagar, Amal Zouaq, Michel C. Desmarais, and Elizabeth Charles Using Diffusion Network Analytics to Examine and Support Knowledge Construction in CSCL Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Saqr and Olga Viberg Supporting Second Language Learners’ Development of Affective Self-regulated Learning Skills Through the Use and Design of Mobile Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga Viberg, Anna Mavroudi, and Yanwen Ma

131

144

158

173

We Know What You Did Last Semester: Learners’ Perspectives on Screen Recordings as a Long-Term Data Source for Learning Analytics. . . . . . . . . . Philipp Krieter, Michael Viertel, and Andreas Breiter

187

Teaching Simulation Literacy with Evacuations: Concept, Technology, and Material for a Novel Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andre Greubel, Hans-Stefan Siller, and Martin Hennecke

200

Design of Conversational Agents for CSCL: Comparing Two Types of Agent Intervention Strategies in a University Classroom . . . . . . . . . . . . . Konstantinos Michos, Juan I. Asensio-Pérez, Yannis Dimitriadis, Sara García-Sastre, Sara Villagrá-Sobrino, Alejandro Ortega-Arranz, Eduardo Gómez-Sánchez, and Paraskevi Topali Exploring Human–AI Control Over Dynamic Transitions Between Individual and Collaborative Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vanessa Echeverria, Kenneth Holstein, Jennifer Huang, Jonathan Sewall, Nikol Rummel, and Vincent Aleven

215

230

Exploring Student-Controlled Social Comparison . . . . . . . . . . . . . . . . . . . . Kamil Akhuseyinoglu, Jordan Barria-Pineda, Sergey Sosnovsky, Anna-Lena Lamprecht, Julio Guerra, and Peter Brusilovsky

244

New Measures for Offline Evaluation of Learning Path Recommenders . . . . . Zhao Zhang, Armelle Brun, and Anne Boyer

259

Contents

Assessing the Impact of the Combination of Self-directed Learning, Immediate Feedback and Visualizations on Student Engagement in Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bilal Yousuf, Owen Conlan, and Vincent Wade CGVis: A Visualization-Based Learning Platform for Computational Geometry Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Athanasios Voulodimos, Paraskevas Karagiannopoulos, Ifigenia Drosouli, and Georgios Miaoulis How to Design Effective Learning Analytics Indicators? A HumanCentered Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Amine Chatti, Arham Muslim, Mouadh Guesmi, Florian Richtscheid, Dawood Nasimi, Amin Shahin, and Ritesh Damera Emergency Remote Teaching: Capturing Teacher Experiences in Spain with SELFIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laia Albó, Marc Beardsley, Judit Martínez-Moreno, Patricia Santos, and Davinia Hernández-Leo

xv

274

288

303

318

Utilising Learnersourcing to Inform Design Loop Adaptivity . . . . . . . . . . . . Ali Darvishi, Hassan Khosravi, and Shazia Sadiq

332

Fooling It - Student Attacks on Automatic Short Answer Grading. . . . . . . . . Anna Filighera, Tim Steuer, and Christoph Rensing

347

Beyond Indicators: A Scoping Review of the Academic Literature Related to SDG4 and Educational Technology . . . . . . . . . . . . . . . . . . . . . . Katy Jordan

353

Pedagogical Underpinnings of Open Science, Citizen Science and Open Innovation Activities: A State-of-the-Art Analysis. . . . . . . . . . . . . Elisha Anne Teo and Evangelia Triantafyllou

358

Knowledge-Driven Wikipedia Article Recommendation for Electronic Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behnam Rahdari, Peter Brusilovsky, Khushboo Thaker, and Jordan Barria-Pineda

363

InfoBiTS: A Mobile Application to Foster Digital Competencies of Senior Citizens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Svenja Noichl and Ulrik Schroeder

369

Student Awareness and Privacy Perception of Learning Analytics in Higher Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stian Botnevik, Mohammad Khalil, and Barbara Wasson

374

xvi

Contents

User Assistance for Serious Games Using Hidden Markov Model . . . . . . . . . Vivek Yadav, Alexander Streicher, and Ajinkya Prabhune Guiding Socio-Technical Reflection of Ethical Principles in TEL Software Development: The SREP Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sebastian Dennerlein, Christof Wolf-Brenner, Robert Gutounig, Stefan Schweiger, and Viktoria Pammer-Schindler Git4School: A Dashboard for Supporting Teacher Interventions in Software Engineering Courses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean-Baptiste Raclet and Franck Silvestre Exploring the Design and Impact of Online Exercises for Teacher Training About Dynamic Models in Mathematics. . . . . . . . . . . . . . . . . . . . . . . . . . . Charlie ter Horst, Laura Kubbe, Bart van de Rotten, Koen Peters, Anders Bouwer, and Bert Bredeweg Interactive Concept Cartoons: Exploring an Instrument for Developing Scientific Literacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patricia Kruit and Bert Bredeweg Quality Evaluation of Open Educational Resources . . . . . . . . . . . . . . . . . . . Mirette Elias, Allard Oelen, Mohammadreza Tavakoli, Gábor Kismihok, and Sören Auer Designing Digital Activities to Screen Locomotor Skills in Developing Children . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benoit Bossavit and Inmaculada Arnedillo-Sánchez

380

386

392

398

404 410

416

Towards Adaptive Social Comparison for Education . . . . . . . . . . . . . . . . . . Sergey Sosnovsky, Qixiang Fang, Benjamin de Vries, Sven Luehof, and Fred Wiegant

421

Simulation Based Assessment of Epistemological Beliefs About Science . . . . Melanie E. Peffer and Tessa Youmans

427

An Approach to Support Interactive Activities in Live Stream Lectures . . . . . Tommy Kubica, Tenshi Hara, Iris Braun, and Alexander Schill

432

Educational Escape Games for Mixed Reality . . . . . . . . . . . . . . . . . . . . . . . Ralf Klamma, Daniel Sous, Benedikt Hensen, and István Koren

437

Measuring Learning Progress for Serving Immediate Feedback Needs: Learning Process Quantification Framework (LPQF) . . . . . . . . . . . . . . . . . . Gayane Sedrakyan, Sebastian Dannerlein, Viktoria Pammer-Schindler, and Stefanie Lindstaedt

443

Contents

xvii

Data-Driven Game Design: The Case of Difficulty in Educational Games . . . Yoon Jeon Kim and Jose A. Ruipérez-Valiente

449

Extracting Topics from Open Educational Resources . . . . . . . . . . . . . . . . . . Mohammadreza Molavi, Mohammadreza Tavakoli, and Gábor Kismihók

455

Supporting Gamification with an Interactive Gamification Analytics Tool (IGAT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nadja Zaric, Manuel Gottschlich, Rene Roepke, and Ulrik Schroeder

461

OpenLAIR an Open Learning Analytics Indicator Repository Dashboard . . . . Atezaz Ahmad, Jan Schneider, and Hendrik Drachsler

467

CasualLearn: A Smart Application to Learn History of Art. . . . . . . . . . . . . . Adolfo Ruiz-Calleja, Miguel L. Bote-Lorenzo, Guillermo Vega-Gorgojo, Sergio Serrano-Iglesias, Pablo García-Zarza, Juan I. Asensio-Pérez, and Eduardo Gómez-Sánchez

472

Applying Instructional Design Principles on Augmented Reality Cards for Computer Science Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josef Buchner and Michael Kerres

477

Extending Patient Education with CLAIRE: An Interactive Virtual Reality and Voice User Interface Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard May and Kerstin Denecke

482

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

487

Exploring Artificial Jabbering for Automatic Text Comprehension Question Generation Tim Steuer(B) , Anna Filighera , and Christoph Rensing Technical University of Darmstadt, Darmstadt, Germany {tim.steuer,anna.filighera,christoph.rensing}@kom.tu-darmstadt.de

Abstract. Many educational texts lack comprehension questions and authoring them consumes time and money. Thus, in this article, we ask ourselves to what extent artiﬁcial jabbering text generation systems can be used to generate textbook comprehension questions. Novel machine learning-based text generation systems jabber on a wide variety of topics with deceptively good performance. To expose the generated texts as such, one often has to understand the actual topic the systems jabbers about. Hence, confronting learners with generated texts may cause them to question their level of knowledge. We built a novel prototype that generates comprehension questions given arbitrary textbook passages. We discuss the strengths and weaknesses of the prototype quantitatively and qualitatively. While our prototype is not perfect, we provide evidence that such systems have great potential as question generators and identify the most promising starting points may leading to (semi) automated generators that support textbook authors and self-studying. Keywords: Text comprehension · Language models question generation · Educational technology

1

· Automatic

Motivation

Reading, alongside direct verbal communication, is one of the most prevalent forms of learning. For every new subject, we encounter in our educational careers, highly motivated educators publish textbooks to help us understand. Even after we ﬁnish our formal education, the modern knowledge society is based on lifelong informal learning in which learners in the absence of teachers, also often devote themselves to textual learning resources. In both, the formal and informal scenarios only gaining surface-level understanding is likely not enough. If we e.g. study a physics or history textbook to pass an exam deeper understanding of the topic is crucial. However, reading is diﬃcult and to deeply comprehend a text, passive consumption is insuﬃcient [7,25]. Instead, readers need to actively reﬂect the information provided in the text to reach a deep understanding [7,25]. A well-explored method to actively engage c Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 1–14, 2020. https://doi.org/10.1007/978-3-030-57717-9_1

2

T. Steuer et al.

Textbook chapter

Suppose you want to connect to your workplace network from home. Your workplace, however, has a security policy that does not allow “outside” IP addresses to access essenƟal internal resources. How do you proceed, without leasing a dedicated telecommunicaƟons line to your workplace? A virtual private network, or VPN, provides a soluƟon; it supports creaƟon of virtual links that join far-flung nodes via the Internet. Your home computer creates an ordinary Internet connecƟon (TCP or UDP) …

Generated Discussion Prompt (italic)

Is the following statement true/false? Please discuss briefly why it is true or false: Vpns have the disadvantage of requiring the VPN tunnel to be established before the Internet can be accessed.

Student discusses answer

Fig. 1. Example usage of the proposed system.

readers is posing questioning about what they have read [1,25]. Yet, posing good questions consumes time and money and thus many texts encountered by learners either contain only a few questions at the end of a chapter or lack questions. Educational automatic question generation investigates approaches to generate meaningful questions about texts automatically, reducing the necessity for manually generated questions. It hereby relies either on machine learning-based approaches that excel in question variety and expressiveness but pose mostly factual questions [6] or on rule-based approaches that lack expressiveness and variety [32] but have limited capability to pose comprehension questions depending on their purpose (e.g. [17]). This article investigates a novel machine learning-based question generation approach seeking to generate comprehension questions with a high variety and expressiveness. We hereby rely on two main ideas. First, research in the educational domain has investigated learning from errors [19] indicating that explaining why a statement or solution is faulty may foster learning, conceptual understanding, and far transfer [10]. Second, we rely on the artiﬁcial jabbering of state-of-the-art neural text generators that are capable of extrapolating a given text with high structural consistency and in a way that often looks deceptively real for humans. We seek to explore whether this jabbering can be conditioned in such a way that it generates erroneous examples from textbook paragraphs. Presented with such a statement, learners need to justify if a given statement is true or false (see Fig. 1). This work comprises three main contributions: 1. We present the idea of leveraging artiﬁcial jabbering for automatic text comprehension question generation and introduce a prototypical generator. 2. We provide a quantitative and qualitative evaluation of the strengths and weaknesses of such an approach. 3. We distill the main challenges for future work based on an in-depth error analysis of our prototypical generator.

Automatic Question Generation via Neural Text Generators

2 2.1

3

Related Work Learning from Erroneous Examples

When learning with erroneous examples, students are confronted with a task and its faulty solution and have to explain why it is wrong (e.g. [30]). The underlying theoretical assumptions are that erroneous examples induce a cognitive conﬂict in students and thus support conceptual change [24] e.g. by pointing out typical misconceptions [29]. It has been shown that erroneous examples are beneﬁcial for learning in a variety of domains such as mathematics [10], computer science [4] or medicine [14]. Also, learners confronted with erroneous examples especially improve deeper measures of learning such as conceptual understanding and far transfer [24]. However, some studies have found that erroneous examples only foster learning when learners receive enough feedback [14,30] and have suﬃcient prior knowledge [30]. 2.2

Neural Text and Question Generation

With the rise of high capacity machine-learning models, language generation has shifted towards pretraining [27]. Trained on huge datasets, these models provide state-of-the-art results on a wide variety of natural language generation tasks [5,23] such as dialog response generation tasks [22] or abstractive summarization tasks [26]. Novel models like GPT-2 [23] are capable of extrapolating a given text with high structural consistency and in a way that looks deceptively real for humans. They copy the given text’s writing style and compose texts which seem to make sense at ﬁrst glance. Fine-tuning the model even increased the humanness of the generated texts [28]. Research in the credibility of such generated texts found that hand-picked generated news texts were found to be credible around 66% of the time, even when the model was not ﬁne-tuned on news articles [28]. Another study found that human raters could detect generated texts in 71.4% of the cases with two raters often disagreeing if the text is fake or not [13]. These ﬁndings started a debate in the natural language generation community if the model’s generation capabilities are to easy to misuse and therefore the models should not be released anymore [28]. Furthermore, such models are able to generate poems [16] and to rewrite stories to incorporate counterfactual events [21]. Besides of these open text generation models, special models for question generation exist. They evolved from baseline sequence to sequence architectures [6] into several advanced neural architectures (e.g. [5,33]) with diﬀerent facets such as taking the desired answers into account [34] or being diﬃculty-aware [8]. Although these systems work well in the general case they are mainly focusing on the generation of factual questions [6,20,35]. Thus, although their expressiveness and domain independence is impressive, the educational domain still most often uses template-based generators [15]. These template-based approaches are often able to generate comprehension questions but lack expressiveness and rely on expert rules limiting them to a speciﬁc purpose in a speciﬁc domain.

4

3

T. Steuer et al.

An Experimental Automatic Erroneous Example Generator

To experiment with the idea of using artiﬁcial jabbering for improving text comprehension, we propose the following text generation task. The input is a text passage of a learning resource from an arbitrary domain, having a length of 500–1000 words as this has been used in psychological studies that found text accompanying questions to be helpful [1,31]. The output is a generated text comprehension question about the given text passage, asking learners to explain why a given statement is true or false. We aim to generate high-quality questions of good grammaticality, containing educational valuable claims and having the right diﬃculty for discussion. Some technical challenges are inherent in the described task. Every approach must tackle discussion candidate selection as this determines what the main subject of the generated text will be. Also, every approach must provide the neural text generator with a conditioning context to ensure that the generated text is in the intended domain. Finally, every approach must render the actual text with some sort of open domain generator. These subtasks are active ﬁelds of research and a huge variety of possible approaches with diﬀerent strengths and weaknesses exists. Yet, our ﬁrst aim is to evaluate the general viability of such an approach. Thus, we do not experiment with diﬀerent combinations of sub-components but our generator relies on well-tested domain-independent general-purpose algorithms for the diﬀerent subtasks (see Fig. 2).

Fig. 2. Architecture of the automatic text comprehension question generator. The ﬁnal output is a justiﬁcation statement that is combined with a prompt to form the actual text comprehension question.

Automatic Question Generation via Neural Text Generators

5

First, for the discussion candidate selection, we make the simplifying assumption that good discussion candidates are the concepts that are characteristic of the text. To understand why this assumption is simpliﬁed consider a text about Newtonian physics where a few sentences discuss the common misconception that heavier objects fall faster than lighter objects. This discussion is unlikely to involve any special keywords and thus will not be selected as input to the generator. Yet, it might be very fruitful to generate erroneous examples based on these misconceptions. However, to test our general idea of generating erroneous examples the simpliﬁcation should be suﬃcient because we might select fewer inputs but the one we select should be important. Furthermore, this assumption allows us to rely on state-of-the-art keyphrase extraction algorithms. Considering that the inputs are texts from a variety of domains, the keyphrase selection step needs to be unsupervised and relatively robust to domain changes. Therefore, we apply the YAKE keyphrase extraction algorithm [3] which has been shown to perform consistently on a large variety of diﬀerent datasets and domains [2]. Stopwords are removed before running keyphrase extraction and the algorithm’s conﬁgured windows size is two. Second, for selecting the conditioning context, a short text that already comprises statements about the subject is needed. Suppose the discussion subject is “Thermal Equilibrium” in a text about physics. For the generator to produce interesting statements it must receive sentences from the text, discussing thermal equilibria. Thus, we extract up to three sentences in the text comprising the keyphrase, by sentence tokenizing the text1 and concatenating sentences containing the keyphrase. Third, we need to generate a justiﬁcation statement as the core for the text comprehension question. We use the pretrained GPT-2 774M2 parameter model and apply it similar to Radford et al. [23] by using plain text for the model conditioning. The plain text starts with the sentences from the conditioning context and to generate the actual justiﬁcation statement, a discussion starter is appended. It begins with the pluralized discussion subject followed by a predeﬁned phrase allowing us to choose the type of justiﬁcation statement the model will generate. For instance, let “Thermal Equilibrium” be our discussion subject, our to be completed discussion starter may be “Thermal equilibria are deﬁned as” or “Thermal equilibria can be used to” depending on the type of faulty statement we aim for. The resulting plain text is given to GPT-2 for completion. To prevent the model from sampling degenerated text, we apply nucleus sampling [12] with top-p=0.55 and restrict the output length to 70 words. Finally, we extract the justiﬁcation statement from the generated text and combine it with a generic prompt to discuss it, resulting in the ﬁnal text comprehension question. Note that we do not know, if the generated question is actually comprising a true or false justiﬁcation statement.

1 2

Using NLTK-3.4.5. https://github.com/openai/gpt-2.

6

4 4.1

T. Steuer et al.

Research Question and Methodology Research Question

We evaluate our generation approach on educational texts from a variety of domains focusing on the following research question: RQ: To what extent are we able to generate useful text comprehension statements in a variety of domains given short textbook passages? Looking at the related work, a fraction of the generated statements should already be usable without any adjustments, while many other statements need adjustment. We conduct a quantitative evaluation and qualitative evaluation. Our procedure includes a quantitative expert survey, a qualitative error analysis to determine useful error categories and a qualitative analysis of the already usable results to better describe their features. 4.2

Methodology

Quantitatively, a total of 120 text comprehension questions coming from ten educational texts are annotated by ten domain experts who have been teaching at least one university lecture in a similar domain. Texts are equally distributed across ﬁve diﬀerent domains: Computer Science, Machine Learning, Networking, Physics and Psychology. Twelve text comprehension questions are generated for every text. They are based on three extracted discussion candidates and four diﬀerent discussion starters, of which we hypothesized that they represent intermediate or deep questions according to Graesser et al. [9]. The discussion starters are: “X has the disadvantage”, “X has the advantage”, “X is deﬁned as” and “X is used to” where X is the discussion candidate. This Every question is rated by two experts who ﬁrst read the educational text that was used to generate the question and then rate it on ﬁve ﬁve-point Likert items regarding grammatical correctness, relatedness to the source material, factual knowledge involved when answering the question, conceptual knowledge involved when answering the questions and overall usefulness for learning. Before annotating every expert saw a short deﬁnition of every scale, clarifying their meaning. Additionally, experts can provide qualitative remarks for every question through a free-text ﬁeld. For the quantitative analysis the ratings where averaged across experts. We use the quantitatively collected data to guide our qualitative analysis of the research questions. To carry out our in-depth error analysis, we consider a statement useless for learning if it scores lower than three on the usefulness scale. This choice was made after qualitatively reviewing a number of examples. We use the inductive qualitative content analysis [18] to deduce meaningful error categories for the statements and to categorize the statements accordingly. Our search for meaningful error categories is hereby guided by the given task formulation and its sub-components. Furthermore, the useful generated statements (usefulness ≥ 3) are analyzed. We look at the eﬀects of the diﬀerent discussion starters and how they inﬂuence the knowledge involved in answering the generated questions.

Automatic Question Generation via Neural Text Generators

5 5.1

7

Results Quantitative Overview

The quantitative survey results indicate that many of the statements generated are of good grammar, are connected to the text but are only slightly useful for learning (see Fig. 3). Furthermore, most questions involve some factual knowledge and deeper comprehension, yet both scores vary greatly. Breaking down the diﬀerent rating scores by domain or discussion starter does not revealed no large diﬀerences. By looking at various examples of diﬀerent ratings (see Table 1) we found that a usefulness score of three or larger is indicative of some pedagogical value. With minor changes, such questions could be answered and discussed by experts, although their discussion is probably often not the perfect learning opportunity. In total, 39 of the 120 statements have a usefulness rating of 3 or larger (32.5%), in contrast to 81 statements rated lower (67.5%).

Fig. 3. Overview of the quantitative ratings for the generated statements without any human ﬁltering. Scores are between 1 and 5 where 5 is the best achievable rating. The whiskers indicate 1.5 Interquartile range and the black bar is the median.

5.2

Qualitative Error Analysis

While conducting the qualitative error analysis, the following main error categories where deduced. Keyword inappropriate means that the discussion candidate was not appropriate for the text because the keyword extraction algorithm selected a misleading or very general key term. Keyword incomplete means that

8

T. Steuer et al. Table 1. Examples of diﬀerently rated generated statements (higher = better). Usefulness ranking

Example statement

1

Fastest possible machines have the disadvantage of being more expensive to build and maintain

2

Prior knowledges have the advantage that they are easy to measure and easy to measure the causal role of

3

Knowledge bases can be used to test the performance of models, and to improve the performance of inference engines

4

Von neumann architectures have the advantage of being able to process a wide range of instructions at the same time, making them highly scalable

5

Vpns have the disadvantage of being diﬃcult to set up and maintain, and they can be compromised by bad actors

the discussion candidate would be good if it would comprise additional terms. For example, in physics, the discussion candidate sometimes was “Equilibrium” instead of “Thermal Equilibrium”. Platitude means that the generated statement was a generic platitude and thus not helpful. Hardly discussable means that the statement was either too vague or too convoluted therefore making it hard to write a good justiﬁcation. Finally too easy means that the students could answer by just relying on common sense. Table 2. The diﬀerent error categories and their distribution Inappropriate keyword

Incomplete keyword

Platitude

Hardly discussable

Statement too easy

43

6

9

11

12

The distribution of the diﬀerent error categories can be seen is heavily skewed towards keyword errors (see Table 2). The two keyword-based errors account for 49 or roughly 60% of the errors. Furthermore, statements generated by faulty keyword selection mostly have a usefulness rating of one. The other error categories are almost equally distributed and are most often rated with a usefulness score of two. The platitude case mostly comes from unnaturally combining a discussion candidate with a discussion starter resulting in very generic completion of the sentence inside the generator. For instance, if the generator has to complete the sentence “Classical conditionings have the disadvantage ...” it continues with “...of being costly and slow to develop”. The remaining error categories have no clear cause. Besides the error analysis, annotators left some remarks about the erroneous statements. Two annotators remarked on various occasions that the ﬁrst part of

Automatic Question Generation via Neural Text Generators

9

the sentence (discussion candidate + discussion starter) is incomprehensible and thus the whole statement is worthless. One annotator remarked that there are missing words in the keyword leading to a bad rating for the statement. The keyword was for example “knowledge” instead of “knowledge base”. Furthermore, one annotator remarked that the statement has not enough discussion potential. Those comments are in line with our deduced error categories for keyword errors.

Fig. 4. Overview of the quantitative ratings for the generated statements with an usefulness rating larger or equal three. Scores are between 1 and 5 where 5 is the best achievable rating. The whiskers indicate 1.5 Interquartile range and the black bar is the median.

5.3

Quality Characteristics of the Useful Statements

The 39 statements with a usefulness rating larger three also score well in the other factors (see Fig. 4). Especially, the involved factual knowledge and deeper comprehension clearly increase. Reviewing the generated statements reveals that the generated statements are not simply a paraphrase of a fact stated in the text. Thus, learners answering the corresponding question cannot simply do a keyword spotting but need to think about the actual content of the text. Furthermore, the generated statements adequately use technical terminology. Moreover, the different discussion starters play an important role as they lead to diﬀerent types of statements. When generating with the definition starter, the generator rephrases the deﬁnition of a discussion candidate in “its own words”. As a result, these deﬁnitions often lack important aspects or contain faulty claims (see Table 3). Thus, to explain why the deﬁnition is wrong, learners have to compare and contrast their previous knowledge with the generated deﬁnition. The usage starter

10

T. Steuer et al.

leads to statements that force learners to transfer the knowledge learnt into new situations (see Table 3). The usage that is described in the generated statements is normally not mentioned in the text, but can often be deduced by the knowledge provided in the text. The advantage and disadvantage discussion starter requires learners to think about the discussion candidate but also about similar concepts and solutions and to compare them (see Table 3). Otherwise, learners cannot tell if the stated advantage or disadvantage is one that is speciﬁc to the discussed concept. Table 3. Highly rated examples of diﬀerent types of statements resulting from diﬀerent discussion starter. Discussion starter Domain

Example statement

Deﬁnition

Machine learning Knowledge bases are deﬁned as data structures that store knowledge and act as a long-term memory

(Dis)advantage

Psychology

Conditionings have the advantage of being simple and universal, which makes them ideal for studies of complex behavior

Usage

Networking

Hosts can be used to forward packets between hosts

Finally, one annotator provided qualitative remarks for the good statements as well. This includes remarks that the generated statements are helpful but often could be improved by using diﬀerent discussion starters depending on the domain (e.g. speaking of the advantage of a physical concept is odd). Also, it was highlighted that the statements cannot simply be answered by copying information from the text and that thinking about the definition discussion starter sometimes resulted in the annotator checking a textbook to refresh some rusty knowledge.

6

Discussion and Future Work

Concerning our research question, we can say that roughly a third of the statements have some educational value. This is in line with the related work that reports between 29% and 66% deceptively real statements [13,28]. Yet, even lower numbers of valuable statements can be beneﬁcial. If we do not generate questions directly for the reader, but for textbook authors for further review, it can be a source of creative ideas and may reduce the authoring eﬀort. In particular, such systems could be combined with question ranking approaches similar to Heilman et al. [11] to only recommend the most promising candidates. Furthermore, there is more to our research question than just this quantitative view and looking at our qualitative results reveals interesting characteristics

Automatic Question Generation via Neural Text Generators

11

of the well-generated statements. First, they are not the typical factual Whquestions that ask for a simple fact or connection directly stated in the text. Therefore, they often need a deeper understanding of the subject matter to be answered correctly. While this can be a beneﬁt, we have to keep in mind that our annotators were experts and thus drawing connections between the text inherent knowledge and previously learned subject knowledge might be too diﬃcult for some learners as also remarked by the annotators. Second, depending on the used discussion starter, we can generate diﬀerent kinds of useful questions. Our four diﬀerent discussion starters generate questions requiring three diﬀerent types of thinking. Depending on the discussion starter, the text comprehension questions involve comparison with previous knowledge, transfer of learned knowledge to new situations or implicit diﬀerentiation from similar concepts. An encouraging result, because it shows that the generator’s expressiveness can be harnessed to create diﬀerent types of tasks. Moreover it provides evidence for the remark of the annotators, that the questions in some domains could be improved by using diﬀerent discussion starters and that this is a worthwhile direction for future research. Third, although we work with a variety of domains and input text from diﬀerent authors we were able to generate some valuable questions in every domain. Furthermore, the distribution of the diﬀerent quality scores did not change much from domain to domain. Hence, our approach seems, at least to a degree, domain-independent. Yet, as currently only a third of the generated statements are usable this should be reevaluated as soon as the general quality of the statements becomes better because it might be a trade-oﬀ between domainindependence and statement quality. In summary, our qualitative analysis of the well-generated questions provided evidence for their adaptability through different discussion starters and that they are well suited for text comprehension below the surface level when learners have to think not only about facts but also have to integrate knowledge. Our error analysis allowed us to identify why we fail to generate interesting questions. The ﬁve diﬀerent error categories are promising starting points for future work. Most often, the approach failed because the keyword extraction step did not ﬁnd a meaningful discussion candidate or extracted only parts of it. This is not surprising as our goal was to test the general idea without ﬁnetuning any of the intermediate steps. General-purpose keyword extraction is similar but not identical to discussion candidate extraction. Hence, future work might explore speciﬁc educational keyword extraction algorithms and their eﬀect on the generation approach. We assume that a ﬁne-tuned educational keyword extraction algorithm will yield much more valuable statements if adaptable to diﬀerent domains. Furthermore, as discussed in the results section the platitude errors can be alleviated by not combining discussion starters and discussion candidates in an odd manner. Future work should, therefore, investigate the optimal use of discussion starters taking into account diﬀerent domains and discussion candidates. Finally, we have the hardly discussable and statement too easy error categories. While no clear cause of these errors could be identiﬁed, we assume that a ﬁne-tuning of the neural generator with discussion speciﬁc texts would

12

T. Steuer et al.

reduce these types of errors. The related work has already shown that ﬁne-tuning neural generators yields performance gains [28]. Yet, one has to be careful not to lose some of the expressiveness of the current model. Thus, future work might explore the relation between ﬁne-tuning for the generation of justiﬁcation statements and the change in the expressiveness of the model. One important issue thereby is that ﬁne-tuning should allow the model to generate more erroneous statements comprising typical misconceptions of learners as these are particularly beneﬁcial for learning [29]. Besides of the actual generation process, feedback to learners’ answers is crucial and should be explored further [29]. Finally, we would like to point out some limitations of the current study. First, our goal was to explore the general idea of the generation of questions and not ﬁnding the optimal approach. Hence, this work only provides a lower bound for the quality of the generated questions and other state-of-the-art keyword extraction algorithms and language generators might yield better performance. We think our work is valuable nevertheless, as it demonstrates a working prototype, the key advantages of such a prototype and provides a strong baseline for future work. Furthermore, while other combinations of algorithms might yield better performance, we provided an in-depth error analysis to inform the research community on what to focus on. Third, while asking experts to score the statements is often used in research it is unclear if the experts’ opinion correlates with the actual perception of students. However, we assume that the experts’ assessment agrees with learners’ at least in tendency and that this is suﬃcient to assess the basic generation idea. Fourth, we are aware that qualitative analysis is always to some degree subjective. Yet, we believe that for complex novel approaches such, as the one presented, it is an important and often neglected way of collecting valuable data about the inner workings of the approach. To conclude, artiﬁcial jabbering of neural language models has the potential to foster text comprehension as it has unique strengths not present in other neural question generators. The initial implementation in this work may be used as a tool for authors, providing them with ideas about what they could ask students. However, it is too error-prone to interact directly with learners and we provided valuable pointers to improve this in future work.

References 1. Anderson, R.C., Biddle, W.B.: On asking people questions about what they are reading. Psychol. Learn. Motiv. Adv. Res. Theor. 9(C), 89–132 (1975). https:// doi.org/10.1016/S0079-7421(08)60269-8 2. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020) 3. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A.M., Nunes, C., Jatowt, A.: YAKE! collection-independent automatic keyword extractor. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 806–810. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-769417 80

Automatic Question Generation via Neural Text Generators

13

4. Chen, X., Mitrovic, T., Mathews, M.: Do novices and advanced students beneﬁt from erroneous examples diﬀerently. In: Proceedings of 24th International Conference on Computers in Education (2016) 5. Dong, L., et al.: Uniﬁed language model pre-training for natural language understanding and generation. In: Advances in Neural Information Processing Systems, pp. 13042–13054 (2019) 6. Du, X., Shao, J., Cardie, C.: Learning to ask: neural question generation for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 8, pp. 1342–1352. Association for Computational Linguistics, Stroudsburg (2017). https://doi.org/ 10.18653/v1/P17-1123. http://aclweb.org/anthology/P17-1123 7. Duke, N.K., Pearson, P.D.: Eﬀective practices for developing reading comprehension. J. Educ. 189(1–2), 107–122 (2009) 8. Gao, Y., Bing, L., Chen, W., Lyu, M.R., King, I.: Diﬃculty controllable generation of reading comprehension questions. In: Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence, pp. 4968–4974. AAAI Press (2019) 9. Graesser, A., Rus, V., Cai, Z.: Question classiﬁcation schemes. In: Proceedings of the Workshop on Question Generation, pp. 10–17 (2008) 10. Große, C.S., Renkl, A.: Finding and ﬁxing errors in worked examples: can this foster learning outcomes? Learn. Instr. 17(6), 612–634 (2007) 11. Heilman, M., Smith, N.A.: Good question! Statistical ranking for question generation. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 609–617. Association for Computational Linguistics (2010) 12. Holtzman, A., Buys, J., Forbes, M., Choi, Y.: The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019) 13. Ippolito, D., Duckworth, D., Callison-Burch, C., Eck, D.: Human and automatic detection of generated text. arXiv preprint arXiv:1911.00650 (2019) 14. Kopp, V., Stark, R., Fischer, M.R.: Fostering diagnostic knowledge through computer-supported, case-based worked examples: eﬀects of erroneous examples and feedback. Med. Educ. 42(8), 823–829 (2008) 15. Kurdi, G., Leo, J., Parsia, B., Sattler, U., Al-Emari, S.: A systematic review of automatic question generation for educational purposes. Int. J. Artif. Intell. Educ. 30, 1–84 (2019) 16. Liao, Y., Wang, Y., Liu, Q., Jiang, X.: GPT-based generation for classical chinese poetry. arXiv preprint arXiv:1907.00151 (2019) 17. Liu, M., Calvo, R.A., Rus, V.: G-asks: an intelligent automatic question generation system for academic writing support. Dialogue Discourse 3(2), 101–124 (2012) 18. Mayring, P.: Qualitative content analysis. A Companion Qual. Res. 1, 159–176 (2004) 19. Ohlsson, S.: Learning from performance errors. Psychol. Rev. 103(2), 241 (1996) 20. Pan, L., Lei, W., Chua, T.S., Kan, M.Y.: Recent advances in neural question generation. arXiv preprint arXiv:1905.08949 (2019) 21. Qin, L., Bosselut, A., Holtzman, A., Bhagavatula, C., Clark, E., Choi, Y.: Counterfactual story reasoning and generation. arXiv preprint arXiv:1909.04076 (2019) 22. Qin, L., et al.: Conversing by reading: contentful neural conversation with ondemand machine reading. arXiv preprint arXiv:1906.02738 (2019) 23. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019) 24. Richey, J.E., et al.: More confusion and frustration, better learning: the impact of erroneous examples. Comput. Educ. 139, 173–190 (2019)

14

T. Steuer et al.

25. Rouet, J.F., Vidal-Abarca, E.: Mining for meaning: cognitive eﬀects of inserted questions in learning from scientiﬁc text. Psychol. Sci. Text Comprehension, pp. 417–436 (2002) 26. See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with pointergenerator networks. arXiv preprint arXiv:1704.04368 (2017) 27. See, A., Pappu, A., Saxena, R., Yerukola, A., Manning, C.D.: Do massively pretrained language models make better storytellers? In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 843–861 (2019) 28. Solaiman, I., et al.: Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203 (2019) 29. Tsovaltzi, D., McLaren, B.M., Melis, E., Meyer, A.K.: Erroneous examples: eﬀects on learning fractions in a web-based setting. Int. J. Technol. Enhanced Learn. 4(3–4), 191–230 (2012) 30. Tsovaltzi, D., Melis, E., McLaren, B.M., Meyer, A.-K., Dietrich, M., Goguadze, G.: Learning from erroneous examples: when and how do students beneﬁt from them? In: Wolpers, M., Kirschner, P.A., Scheﬀel, M., Lindstaedt, S., Dimitrova, V. (eds.) EC-TEL 2010. LNCS, vol. 6383, pp. 357–373. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16020-2 24 31. Watts, G.H., Anderson, R.C.: Eﬀects of three types of inserted questions on learning from prose. J. Educ. Psychol. 62(5), 387 (1971) 32. Willis, A., Davis, G., Ruan, S., Manoharan, L., Landay, J., Brunskill, E.: Key phrase extraction for generating educational question-answer pairs. In: Proceedings of the 6th (2019) ACM Conference on Learning@ Scale, pp. 1–10 (2019) 33. Zhang, S., Bansal, M.: Addressing semantic drift in question generation for semisupervised question answering. arXiv preprint arXiv:1909.06356 (2019) 34. Zhao, Y., Ni, X., Ding, Y., Ke, Q.: Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In: EMNLP, pp. 3901– 3910 (2018). http://aclweb.org/anthology/D18-1424 35. Zhou, Q., Yang, N., Wei, F., Tan, C., Bao, H., Zhou, M.: Neural question generation from text: a preliminary study. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds.) NLPCC 2017. LNCS (LNAI), vol. 10619, pp. 662–671. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73618-1 56

Digital Value-Adding Chains in Vocational Education: Automatic Keyword Extraction from Learning Videos to Provide Learning Resource Recommendations Cleo Schulten1, Sven Manske1(&) , Angela Langner-Thiele2, and H. Ulrich Hoppe1

2

1 University Duisburg-Essen, Duisburg, Germany {schulten,manske,hoppe}@collide.info Evonik Digital GmbH, Evonik Industries AG, Essen, Germany [email protected]

Abstract. The digital transformation of industry environments creates new demands but also opportunities for vocational education and training (VET). On the one hand, the introduction of new digital learning tools involves the risk of creating a digital parallel world. On the other hand, such tools have the potential to provide intelligent and contextualized access to information sources and learning materials. In this work, we explore approaches to provide such intelligent learning resource recommendations based on a speciﬁc learning context. Our approach aims at automatically analyzing learning videos in order to extract keywords, which in turn can be used to discover and recommend new learning materials relevant to the video. We have implemented this approach and investigated the user-perceived quality of the results in a real-world VET setting. The results indicate that the extracted keywords are in line with user-generated keywords and summarize the content of videos quite well. Also, the ensuing recommendations are perceived as relevant and useful. Keywords: Video analysis Content analysis Keyword extraction Learning resource recommendation Vocational education and training Apprenticeship

1 Introduction During the last decades, digitalization has transformed many ﬁelds in industry and professional education and training. More recently, this trend has been discussed under the label of “Industry 4.0”. Apart from changes in social and economic processes, the digital transformation mainly aims for reaching a new level of process automation and integration in businesses (Hirsch-Kreinsen 2016), and comprises the adoption of digital technologies on the levels of processes, organizations, business domains and society (Parviainen et al. 2017). According to Henriette et al. (2015) digital transformation impacts the whole organization, which calls for major changes in habits and ways of working. Embedding new technology demands the employees for digital competences

© Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 15–29, 2020. https://doi.org/10.1007/978-3-030-57717-9_2

16

C. Schulten et al.

and digital literacy in advancing skill proﬁles posing new requirements in the vocational education and training (“VET”). In the debate about digital transformation, Brynjolfsson and McAfee (2011) point out the role of education and the acquisition of digital competences: “We need not only organizational innovation, orchestrated by entrepreneurs, but also a second broad strategy: investment in the complementary human capital – the education and skills required to get the most out of our racing technology.” In the chemical industry, the company Evonik Industries1 has recently equipped all their apprentices with digital tools (tablets) to support mobile learning using internet and communication technologies. The integration of digital hardware and software tools in educational contexts introduces new challenges but also bears several risks. In the processes of adopting new digital technologies, particularly when pioneering digital initiatives inside an organization, the digitalization might lead to a scattering of systems and to a heterogeneous information environment. To eliminate such digital parallel worlds, organizations might follow bottom-up approaches to align and synchronize these initiatives and technologies (Berghaus and Back 2017). The study presented in this paper is part of a cooperation between university research and a large company in the chemical industry (Evonik). The overall project was mainly focused on the exploration and prototypical implementation of digitalization strategies in the company’s education of apprentices as a speciﬁc ﬁeld of VET. In this endeavour, the creation and use of learning videos was chosen as a starting point in a participatory approach including apprentices as well as instructors. A guiding principle was to minimize the fragmentation of digital learning within the existing infrastructure. In this sense, the videos would be an entry point in digital “educational workflow”, which would lead over to other learning activities and materials. For this purpose, we establish intelligent information access by automatically analysing learning videos in order to extract keywords from the video using AI methods. Such keywords can be facilitated to recommend and discover new learning materials. We implemented this approach and evaluated the user perspective and quality of such mechanisms in a real-world setting in the VET context at Evonik Industries. The work is framed by following research questions: – RQ1: Do learners perceive the presented approach of intelligent information access (i.e., the presented system “NIMROD”) as useful? This comprises (a) the perceived usefulness of the keyword extraction and (b) the suitability of the learning resource recommendations. – RQ2: Do learner-generated keywords coincide with system-generated keywords? – RQ3: How do the learners judge the overall quality of the recommendation mechanisms? To answer these questions, we evaluated the developed system in a user study that involves learners in vocational education, who act as the main target group of this approach.

1

Evonik Industries AG (2020). https://corporate.evonik.com/en. Retrieved: 2020-02-26.

Digital Value-Adding Chains in Vocational Education

17

2 Background Learning videos have become a promising and popular research topic during the last two decades. One of the main challenges in using learning videos is the meaningful orchestration and embedding into a pedagogical design. Several research projects investigated this question of creating learning flows and scenarios that beneﬁt from videos. In the JuxtaLearn project, a dedicated learning flow has been facilitated, where the learning starts with creating videos by the learners (Malzahn et al. 2016). The JuxtaLearn process is divided into eight steps such as planning, editing and sharing creative videos. By creating videos, the learners overcome speciﬁc problems of understanding in so-called “tricky topics” (Cruz et al. 2017). These topics usually relate to science topics, particularly in the STEM ﬁelds and are predeﬁned by instructors. Although an automatic video content analysis was not foreseen, research around the project context facilitated semi-automatic methods for the content analysis of the created learning videos (Erkens et al. 2014). Network text analysis (“NTA”) has been applied in order to relate signal words and domain concepts to indicate missing preknowledge and misconceptions. This incorporated both learning videos and context, particularly the comments that were given on the videos. Such network-based approaches are often using a sliding window approach, which quantiﬁes proximity of words in a sentence to establish connections between nodes (Hecking and Hoppe 2015). Although co-occurrences of words are easy to spot using a text-proximity, such approaches need a manual feedback, a dictionary or an external knowledge source in order to spot compound terms. It is desirable to spot a compound term such as “semipermeable membrane” and treat it as a single node instead of handling “Semipermeable” and “membrane” separately. While unsupervised approaches for keyphrase extraction solve this to some extent, Hasan and Ng (2014) state out that such algorithms still need to incorporate background knowledge in order to improve their performance and accuracy. This can be achieved by adding external knowledge representations such as Wikipedia or DBpedia-related knowledge sources. The ESA-T2N approach can be seen as such an extension to network-text analysis that incorporates explicit semantic analysis (“ESA”) in order to bridge science-related concepts to text networks (Taskin et al. 2020). Thus, such approaches convert a text to a network (“T2N”) structure which externalizes a mental model of the text modeling semantic similarity or other semantic measures of the connected concepts. In contrast to pure network-based or linguistic approaches, it is possible to automatically link concepts by using Wikipedia or DBpedia-related data (Auer et al. 2007). The DBpedia project aims to extract structured information from Wikipedia as RDF triples. These triples are interrelated to external open datasets in order to enrich the data, for example with synonyms, translations or geo location. Each entry in DBpedia or Wikipedia has a dedicated URI representing a unique concept. When two different surface forms have the same URI, they are mapped to the same concept. For example, this is the case for “chemical reactions” and “reactions type”, where both have the same URI and a corresponding entity “Chemical reaction”. This mechanism is facilitated by DBpedia Spotlight (Mendes et al. 2011), which analyzes texts and spots DBpedia

18

C. Schulten et al.

entries in it (and therefore Wikipedia entries as well). The spotting uses common techniques from natural language processing such as lexicalization and dictionarybased chunking based on the Aho-Corasick algorithm (Commentz-Walter 1979), which outputs multiple matchings to spot compound terms. In addition, it provides a disambiguation based on a vector-space model in order to output the correct surface form depending on the context. In addition to the analysis of learner- and teacher-generated videos, the linking to other learning activities or resources such as external learning materials is relevant. In the work by Harbarth et al. (2018), different activities in a dedicated and predeﬁned learning flow have been linked using tags corresponding to interactive learning videos. The tags have been reused in connected learning activities such as concept mapping or the creation of flash cards. However, the tagging was not performed automatically from the videos, it was part of the learning activity itself. A similar architectural approach of a system has been realized in the context of this work. A customized Moodle-System has been facilitated to enroll micro-courses with interactive videos that embed interactive activities (quizzes, matching tasks, etc.) and knowledge-activating questions to scaffold the learning (Erkens et al. 2019). In the active video watching (AVW) approach, learners’ engagement during video watching is induced via commenting and interactive notetaking (Mitrovic et al. 2017). Hecking et al. (2017) used NTA in order to analyze the video comments in several applications of the AVW system. Based on their ﬁndings, they see a potential for intelligent support through micro-scaffolds and nudging that would rely on semantic analysis of the learner-generated content. Linking learning resources is one possible micro-scaffold in order to support learners in accessing further information. Establishing such links can be done on the personal trajectory based on activity logs (Zaíane 2002) or using content analysis approaches, for example, as mentioned above. However, many recommender systems in the context of learning materials usually take user preferences such as the rating of materials into account (Ghauth and Abdullah 2010). Consequently, such approaches have difﬁculties providing good recommendations if the amount of data, particularly user ratings, is quite low. Content-based recommender systems analyze the content of an item description and often transform this into a vector space model. The quality of the recommendation highly depends on the quality of the item description (Pazzani and Billsus 2007). Kopeinik, Kowald, and Lex investigated different recommender systems with respect to their quality and applicability in the context of technology-enhanced learning in a data-driven approach (Kopeinik et al. 2016). They conclude that standard resource recommendations algorithms are not well suited for sparse-data learning environments and propose ontology-based recommender systems to better describe the learners and learning resources. Drachsler et al. (2015) presented challenges for recommender systems speciﬁc to the ﬁeld of technology-enhanced learning in a metareview comprising 82 recommender systems. They suggest taking pedagogical needs and expectations into account by following user-centered design approaches. Further, semantic technologies can be “considered to describe the educational domain and therefore enrich the recommendation process”. This has implications for the research presented in this work. Semantic technologies using AI help to automatically process content given by the learning context or learner-

Digital Value-Adding Chains in Vocational Education

19

generated artefacts (Manske and Hoppe 2016). Ahn et al. (2017) describe a system that uses cognitive services from IBM Watson in order to retrieve learning materials. The matching of learning resources (videos and textual resources) to educational videos in the context of recommender systems has been part of several publications (Agrawal et al. 2014; Zhao et al. 2018). However, those systems rely on already existing transcripts of videos and a ﬁxed corpus of resources to be linked. Aprin et al. (2019) presented a system that analyzes content such as learning materials using semantic technologies in order to automatically link learning resources and to discover new open educational resources.

3 Video-Analysis and Contextualized Information Access In this section, we present an approach based on content analysis of videos in order to generate learning resource recommendations. The general workflow is as following: (1) separation of the audio track from the video; (2) segmentation of audio track (3) transcription (speech-to-text); (4) keyword extraction; (5) learning resource recommendations. Figure 1 illustrates this process with an example that has been used for the evaluation of this work (see Sect. 4). In the ﬁrst step of this workflow, the audio track is separated from the video. Second, the audio track is segmented using periods of silence as separators. In the third step each audio chunk will be transcribed using a speech-to-text API2 and persisted in the ﬁle system of the machine. At this point, the persisted transcript can be corrected or changed manually by a user. If the user decides to correct the transcript, this might affect the further outputs and results. In the fourth step, keywords will be extracted from the transcripts using DBpedia Spotlight (Mendes et al. 2011) and tf-idf (BaezaYates and Ribeiro-Neto 1999) using a corpus created from a Wikipedia database dump. All the keywords extracted by DBpedia relate to DBpedia concepts in the form of URIs that maintain a controlled vocabulary consisting (not exclusively) of Wikipedia article names. Thus, these URIs are bound to a certain context within the underlying knowledge base and help to establish semantic relations between concepts. Particularly for the recommendation of learning materials, it is important to obtain a ranking of keywords in order to ﬁnd the most relevant parts of the learning context. Therefore, the keywords obtained from DBPedia are ranked in order of their tf-idf scores. An exception of this are keywords that are also part of the ﬁle name or keywords that consist of several words like ‘chemical reaction’ with at least one of them being also a tf-idf rated keyword. These are put on top of the list. Keywords that do not occur within the tf-idf corpus will be discarded. Finally, the keywords will be used as an input to link learning resources and to provide intelligent information access. This can be accomplished either by representing the keywords in the target context of the learner using interactive tag clouds with linked materials, or by the provision of

2

Speech Recognition Library, Anthony Zhang (2017). https://pypi.org/project/SpeechRecognition/. Retrieved: 2020-02-25.

20

C. Schulten et al.

learning materials suited to the context of the learner. The latter has been evaluated for this work. Recommendations of learning materials are generated using an information retrieval approach (rather than a typical recommender system). This helps to easily connect new and already existing search APIs to the system. One of the project aims is to eliminate digital parallel worlds and to homogenize the IT infrastructure in the company. Search APIs mask the transparent functionality to discover information entities throughout the systems without knowing about the internal structures. Open Search is one of the approaches to easily integrate such an already existing API. Using existing knowledge sources in already implemented management services is crucial for companies in order to preserve a predominance in a certain ﬁeld of business. To discover learning resources, multiple searches with the extracted keywords are performed using a Google Custom Search API. Each search can be parametrized in order to boost preferred websites from the application context such as the chemistryspeciﬁc encyclopedia “ChemgaPedia”. Afterwards, the search results are sorted by frequency. Additionally, resources that are not well-received by the instructors or even prohibited are pushed to the end of the list. The proposed and evaluated system is a prototypical app (“NIMROD”) that implements all the mechanisms in a client-side native python application Fehler! Verweisquelle konnte nicht gefunden werden.

Fig. 1. Workflow: from a video to recommendations of learning materials.

However, it is important to distinguish the extraction of keywords and the representation in the learning space (e.g., displaying the keywords to learners) from tag recommendations that have become popular in the context of folksonomies (Jäschke et al. 2007). The mechanism presented in this paper works quite different than tag recommender systems on the algorithmic level and regarding the knowledgeorientation of the entities in the output. In this work, the extracted keywords are aligned to a speciﬁc knowledge representation (the DBpedia ontology created by data linking), which is usually not the case for tag recommendations. The tags in folksonomies are typically constructed without formal constraints or predeﬁned structures. Although the flexibility of having a non-controlled vocabulary might support kick-

Digital Value-Adding Chains in Vocational Education

21

starting systems, this inhibits several risks. Two typical problems that occur in social tagging systems that affect tag recommendations are spamming and false tagging. Topics that are prone to misconceptions, common mistakes or ambiguities suffer from the weaknesses of social tagging that does not rely on controlled vocabularies. Yet, in learning contexts it is quite crucial for such mechanisms to have a high precision because learners might not be able to differentiate or to verify results for themselves. Ontologies such as DBpedia address entities with URIs in order to provide a disambiguation of terms and contexts, whereas folksonomies usually ignore ambiguity.

4 Evaluation 4.1

Experimental Procedure

To evaluate the formerly described NIMROD system an online questionnaire was set up. A within-subjects design was used wherein every participant was asked to evaluate the relevance of the extracted keywords and of the proposed learning resources for four videos. These videos covered the chemistry topics ‘Calibration of an analytic balance’ (video 1), ‘Set up of a glass apparatus’ (video 2), ‘Oxidation’ (video 3) and ‘Redox reaction – galvanic cell’ (video 4). Videos 1 and 2 originated from our cooperation project, whereas 3 and 4 were available on YouTube. The used Google Custom Search API was set up to include websites directly related to chemistry (i.e., “ChemgaPedia”) as well as Wikipedia. The participants were apprentices of related professions in the chemical industry. The order of the videos was randomized to reduce order effects (Fig. 2).

Fig. 2. Overview of the questionnaire process with forwarding to videos

22

C. Schulten et al.

At the start of the questionnaire the participants read a brief description of the NIMROD system and were prompted to disregard the quality of the videos in their rating of keywords and resource recommendation as well as for the overall evaluation of the system. For each of the four videos the participants now had to watch the video before suggesting keywords themselves, then rating the quality of the extracted keywords and proposed learning resources and lastly giving any additional feedback regarding that speciﬁc video in an open question. The number of keywords was different for each video as all keywords found by the recommender were included. For each video 10 links were included as resources. The keywords were to be rated as “important in relation to the topic”, “suitable but not important” or “irrelevant to the topic”. The learning resources could be rated as “suitable and helpful”, “suitable but not helpful” or “unsuitable”. This was repeated for the remaining three videos. Subsequently a shortened version of the ResQue (Pu and Chen 2011) questionnaire was used to evaluate an overall assessment of the NIMROD system. We used items from the constructs ‘Quality of Recommendation’, ‘Perceived Usefulness’, ‘Transparency’, ‘Attitudes’ and ‘Behavioral Intentions’. The questionnaire closed with demographic questions regarding gender, age and occupation. The questionnaire was set up in German as were the videos, therefore any and all examples given in this paper were translated for this report. Operationalization of the Research Questions. For RQ1 we aggregated the participants’ judgements for each of the extracted keyword and recommended learning resource. For RQ2 we analyzed the user-generated keywords (by the participants) and compared them to the system-generated keywords. The overall judgement of the recommender system (RQ3) has been measured using the ResQue questionnaire.

5 Results Perceived Quality of the Extracted Keywords and Generated Information (RQ1). Overall 32 apprentices completed the questionnaire (n = 32, 19 male). Their age ranged from 16 to 31 (M = 21.03, SD = 3.614). Table 1. Rating of keywords and learning resources

Video 1 Video 2 Video 3 Video 4 Overall

Keywords Important in relation to the topic 61,98% 58,81% 48,38% 66,25% 58,86%

Suitable but not important 28,13% 30,40% 30,93% 26,10% 28,89%

Irrelevant to the topic 9,90% 10,80% 20,69% 7,66% 12,26%

Resources Suitable and helpful 55,31% 45,00% 62,81% 57,19% 55,08%

Suitable but not helpful 30,00% 32,81% 26,88% 27,5% 29,30%

Unsuitable

14,69% 22,19% 10,31% 15,31% 15,63%

Digital Value-Adding Chains in Vocational Education

23

Looking at the ratings given by the participants, as shown in Table 1, the average participant perceived more keywords as positive (i.e. “Important to the topic”) than as neutral (i.e. “suitable but not important”), and more as neutral than as negative (i.e. “irrelevant to the topic”). The same holds for the resources. On Average every participant rated 58.85% of the keywords of a video with “important in relation to the topic”, ranging from 48.38% to 66.25% per video. Furthermore, they rated 55.08% of the resources for a video as “suitable and helpful” with the results per video ranging between 45% and 62.81%. In this rating system for keywords and resources, the participants were given three rating options for each keyword or resource. In both cases the ﬁrst two options can be considered positive, with one of the two being more neutral than the other, while the third is negative. Meaning for example that the average participant rated 90.1% of the proposed keywords for video 1 positively and 9.9% negatively. The worst rated video regarding its keywords is video 3 which deviates the most from its main topic ‘Oxidation’ by giving everyday examples. At one point an English translation is given which results in “English” being suggested as a keyword. Which, while being reasonable knowing the algorithms behind the program, might not make much sense to a participant or user. The aim of this study was not to achieve a high agreement between the participants regarding the ratings of keywords and resources. However, we were interested in the “degree of divergence” of the ratings. For this purpose, we calculated Fleiss Kappa to see the participants’ agreement. Fleiss Kappa showed j = .175 (95% CI, .168 to .183), p < .001 for all keywords and j = .126 (95% CI, .115 to .136), p < .001 for all resources, both of which were signiﬁcant but slight (Landis and Koch 1977). Showing overall a rather low agreement between the participants. Qualitative Evaluation of the Keywords Proposed by Participants Compared with Those by NIMROD (RQ2). In preparation for this comparison the keywords given by all participants were put through DBpedia to leave only those keywords that could have been caught by the program. In the further course of this evaluation the label keywords suggested by the participants refers therefore to these DBpedia results of the keywords unless speciﬁed otherwise. A comparison of the keywords given by the recommender and those suggested by the participants shows that overall, only 11 out of 78 given keywords were not also proposed by at least one participant. Looking more closely at these speciﬁc keywords it becomes clear that these are usually words that are not directly connected to the main topic of the video. One video about oxidation for example talks about how the covered topic is also dealt with in chemistry class and how in chemistry organic material means carbon compounds. The recommender therefore proposes amongst other things chemistry and chemistry class while the participants understandably do not. The proposed keywords by both the NIMROD system and the participants for video 2 which deals with the ‘Set up of a glass apparatus’ are shown in Table 2. The keywords not proposed by the participants are all not directly related to the main topic. ‘friction’ and ‘rudder’ are both only mentioned in the bigger context of the stirrer being implemented in the glass apparatus.

24

C. Schulten et al.

Table 2. DBpedia keyword comparison for video 2: ‘Set up of a glass apparatus’. Blue indicates concurring suggestions, black suggestions made by one party but not the other, and red marks words suggested by participants that did not occur in the spoken words of the video) Keywords proposed by NIMROD compressed air, voltage, piston, coolant, ground glass joint, lens mount, friction, rudder, stirrer, chuck, support rod

Keywords proposed by the participants bayonet, chemistry, clamp holder, compressed air, glass, piston, coolant, multiple neck flask, ground glass joint, lens mount, stirrer, continuous stirred-tank reactor, round-bottom flask, superheating, chuck, tripod, support rod, vaseline

Nevertheless, of all keywords that were not proposed by any participant the average participant rated 69.32% as important or suitable and 30.68% as irrelevant. Suggesting that even if keywords are not obvious to the user their suggestion can still be comprehensible and is not necessarily considered unﬁtting. ‘friction’ was rated important or suitable by 26 participants and irrelevant by 6. ‘rudder’ was rated important or suitable by 28 participants and irrelevant by 4. The additional keywords that were given by the participants but not proposed by the NIMROD system can be divided into three groups. A few were in fact mentioned in the video but not recognized by NIMROD, others were synonymous or related to words that were mentioned in the videos, with the actually mentioned word being suggested by NIMROD (i.e. “calibration” given by the recommender, “calibrate” given by participants), and lastly words that were not mentioned in the video. For all these keyword examples, it is important to once again note that they were translated for this report. As the German language allows for compound words most keywords given in Table 2 were originally one-word keywords unlike their English translations. Similarly, we must point out complications regarding the ambiguity of words with DBpedia. For example, bayonet while being a chemistry term yields a match for the eponymous weapon. As seen in Table 3 the last of the three categories was the most frequent one. This category could now be used to evaluate the quality of the videos themselves. These keywords include for example speciﬁcs about setups or advanced information about the topic that might have been left out in the video or is only visually represented and therefore undetectable for the current recommender. Table 3. Qualitative analysis of the keywords proposed by participants but not by the recommender

Video Video Video Video

1 2 3 4

Number of keywords suggested by participant that were … … mentioned in the video … synonyms of words … not explicitly and missed by NIMROD mentioned in the video mentioned in the video 0 1 6 0 3 7 1 1 5 2 2 13

Digital Value-Adding Chains in Vocational Education

25

Of the keywords suggested by the participants that were not picked up by the DBpedia analysis some were more bullet points than keywords and described the processes of the video. Evaluation of the Recommender System (RQ3). From the included items of ResQue the corresponding constructs were calculated by computing the mean of the items. These constructs each have a mean score of at least 3, as can be seen in Table 2. The mean of the 5-point Likert scale is therefore reached or exceeded for all constructs. The low score for diversity can be related to the input in an open question where a participant remarked that some of the resources had very similar content. The higher accuracy score ﬁts the positive ratings of keywords and resources (Table 4).

Table 4. ResQue results Quality of recommendation Accuracy Novelty Diversity Perceived usefulness Transparency Attitudes Behavioral intentions Continuance and frequency Recommendation to friends

M

SD

3.23

0.58

3.81 3.09 3.00 3.14 3.50 3.40 3.08

1.00 1.09 0.75 1.11 1.19 0.72 1.16

3.02

1.22

3.22

1.16

6 Discussion For RQ1, we found a rather diverse rating of the recommender generated keywords and resources. On average the majority of keywords and resources were rated positively or neutral. Due to the use of illustrative examples some videos also contain concepts and terms that digress from the main topic. This of course leads to the NIMROD system ﬁnding some keywords that do not match the overall topic resulting in suggestions that seem misplaced for participant or user. The agreement scores for Fleiss Kappa show that the relevance judgment was quite diverse. We take these as subjective judgements. They do however show a slightly more unanimous view on the keywords. This again corresponds with the observation that some participants questioned the quality of some of the resource recommendations. The differing views on keywords and resources can be based on unclear distinctions between the options, a subjective view on keywords and resources themselves or non-

26

C. Schulten et al.

ideal testing conditions which could not be controlled due to it being an online study. Additionally, the rating might be influenced by the state of knowledge of the apprentices. In further studies it might be advisable to include trainers as participants to include a more informed view on the topics of the videos and proposed keywords and learning resources. It might also be a good idea to give the participants a more detailed instruction on the rating scales to achieve a more common understanding. RQ2 had us take a closer look on the differences between the keywords suggested by NIMROD and by the participants. This showed that keywords suggested only by NIMROD and not by the participants are mostly not the direct focus point of the main topic. Of the keywords suggested only by participants most were not explicitly mentioned in the video. Thus, showing deﬁcits in the videos themselves or further knowledge the participants had about the contents. The ResQue questionnaire’s results, which are used to answer RQ3, show moderate but still positive scores. It is important to note that the participants neither interacted with the tool directly by themselves nor were they informed in detail on the option to use the tool with any media other than videos. Both of which might influence the rating if they were included in further studies.

7 Conclusion Digitization and digital transformation pose many challenges for educators in the practical apprenticeship. We identiﬁed some of the challenges and implemented a prototype to circumvent the scattering of information systems. In this approach, we use methods of content analysis and semantic technologies to automatically extract keywords from videos in order to discover and recommend new learning materials from different repositories. These keywords are not exclusive to the recommendation, in real use cases they can be used to represent content and display cognitive stimuli without a major loss of information (Manske and Hoppe 2016). Although the ﬁndings from the study show that the extracted keywords are relatively useful in order to approximate learners’ suggestions, there is a subjective aspect in the distribution of keywords. One could guess, that the keywords, which have been rated as irrelevant, are not suitable for all learners. However, the Kappa-statistics have shown that there is no agreement about the irrelevance. It is still debatable whether this minority of inaccuracy can be handled through a further improvement of AI technologies or if this is one of the human factors that must be accepted in such complex scenarios. We argue that the use of knowledge-oriented approaches leads to a high accuracy in the extraction, which is in our opinion preferable in learning scenarios. Spotting keywords from an ontology like DBpedia avoids false positives per se, which is desirable as it does not confuse learners with wrong terminology. In summary, this work shows how semantic extraction can be used to enrich learning scenarios in an unobtrusive way with a relatively high acceptance on the part of the learners.

Digital Value-Adding Chains in Vocational Education

27

References Agrawal, R., Christoforaki, M., Gollapudi, S., Kannan, A., Kenthapadi, K., Swaminathan, A.: Mining videos from the web for electronic textbooks. In: Glodeanu, C., Kaytoue, M., Sacarea, C. (eds.) ICFCA 2014. LNCS (LNAI), vol. 8478, pp. 219–234. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07248-7_16 Ahn, J.W., et al.: Wizard’s apprentice: cognitive suggestion support for wizard-of-Oz question answering. In: André, E., Baker, R., Hu, X., Rodrigo, M., du Boulay, B. (eds.) AIED 2017. LNCS (LNAI), vol. 10331, pp. 630–635. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-61425-0_79 Aprin, F., Manske, S., Hoppe, H.U.: SALMON: sharing, annotating and linking learning materials online. In: Herzog, M.A., Kubincová, Z., Han, P., Temperini, M. (eds.) ICWL 2019. LNCS, vol. 11841, pp. 250–257. Springer, Cham (2019). https://doi.org/10.1007/978-3-03035758-0_23 Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K. (ed.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52 Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, vol. 463. ACM Press, New York (1999) Berghaus, S., Back, A.: Disentangling the fuzzy front end of digital transformation: activities and approaches. In: Association for Information Systems (2017) Brynjolfsson, E., McAfee, A.: Race Against the Machine: How the Digital Revolution is Accelerating Innovation, Driving Productivity, and Irreversibly Transforming Employment and the Economy. Brynjolfsson and McAfee (2011) Commentz-Walter, B.: A string matching algorithm fast on the average. In: Maurer, H.A. (ed.) ICALP 1979. LNCS, vol. 71, pp. 118–132. Springer, Heidelberg (1979). https://doi.org/10. 1007/3-540-09510-1_10 Cruz, S.M.A., Lencastre, J.A., Coutinho, C.P., José, R., Clough, G., Adams, A.: The JuxtaLearn process in the learning of maths’ tricky topics: practices, results and teacher’s perceptions (2017) Drachsler, H., Verbert, K., Santos, O.C., Manouselis, N.: Panorama of recommender systems to support learning. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 421–451. Springer, Boston, MA (2015). https://doi.org/10.1007/978-1-48997637-6_12 Erkens, M., Daems, O., Hoppe, H.U.: Artifact analysis around video creation in collaborative STEM learning scenarios. In: 2014 IEEE 14th International Conference on Advanced Learning Technologies, pp. 388–392. IEEE (2014) Erkens, M., Manske, S., Bodemer, D., Hoppe, H.U., Langner-Thiele, A.: Video-based competence development in chemistry vocational training. In: Proceedings of the ICCE 2019 (2019) Ghauth, K.I., Abdullah, N.A.: Learning materials recommendation using good learners’ ratings and content-based ﬁltering. Educ. Technol. Res. Dev. 58(6), 711–727 (2010) Harbarth, L., et al.: Learning by tagging–supporting constructive learning in video-based environments. DeLFI 2018-Die 16. E-Learning Fachtagung Informatik (2018) Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1262–1273) (2014)

28

C. Schulten et al.

Hecking, T., Dimitrova, V., Mitrovic, A., Ulrich Hoppe, U.: Using network-text analysis to characterise learner engagement in active video watching. In: ICCE 2017 Main Conference Proceedings, pp. 326–335. Asia-Paciﬁc Society for Computers in Education (2017) Hecking, T., Hoppe, H.U.: A network based approach for the visualization and analysis of collaboratively edited texts. In: VISLA@ LAK, pp. 19–23 (2015) Henriette, E., Feki, M., Boughzala, I.: The shape of digital transformation: a systematic literature review. In: MCIS 2015 Proceedings, pp. 431–443 (2015) Hirsch-Kreinsen, H.: Digitization of industrial work: development paths and prospects. J. Labour Market Res. 49(1), 1–14 (2016). https://doi.org/10.1007/s12651-016-0200-6 Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., Stumme, G.: Tag recommendations in folksonomies. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 506–514. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74976-9_52 Kopeinik, S., Kowald, D., Lex, E.: Which algorithms suit which learning environments? a comparative study of recommender systems in tel. In: Verbert, K., Sharples, M., Klobučar, T. (eds.) EC-TEL 2016. LNCS, vol. 9891, pp. 124–138. Springer, Cham (2016). https://doi.org/ 10.1007/978-3-319-45153-4_10 Landis, J., Koch, G.: The measurement of observer agreement for categorical data. Biometrics 33 (1), 159–174 (1977). https://doi.org/10.2307/2529310 Malzahn, N., Hartnett, E., Llinás, P., Hoppe, H.U.: A smart environment supporting the creation of juxtaposed videos for learning. In: Li, Y., et al. (eds.) State-of-the-Art and Future Directions of Smart Learning, pp. 461–470. Springer, Singapore (2016). https://doi.org/10. 1007/978-981-287-868-7_55 Manske, S., Hoppe, H.U.: The concept cloud: supporting collaborative knowledge construction based on semantic extraction from learner-generated artefacts. In: International Conference on Advanced Learning Technologies (ICALT 2016) (2016) Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th international conference on semantic systems, pp. 1–8 (2011) Mitrovic, A., Dimitrova, V., Lau, L., Weerasinghe, A., Mathews, M.: Supporting constructive video-based learning: requirements elicitation from exploratory studies. In: International Conference on Artiﬁcial Intelligence in Education, pp. 224–237 (2017) Parviainen, P., Tihinen, M., Kääriäinen, J., Teppola, S.: Tackling the digitalization challenge: how to beneﬁt from digitalization in practice. Int. J. Inf. Syst. Proj. Manage. 5(1), 63–77 (2017) Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) The Adaptive Web. LNCS, vol. 4321, pp. 325–341. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72079-9_10 Pu, P., Chen, L., Hu, R.: A user-centric evaluation framework for recommender systems. In: Proceedings of the 5th ACM Conference on Recommender Systems, pp. 157–164 (2011) Taskin, Y., Hecking, T., Hoppe, H.U., Dimitrova, V., Mitrovic, A.: Characterizing comment types and levels of engagement in video-based learning as a basis for adaptive nudging. In: Scheffel, M., Broisin, J., Pammer-Schindler, V., Ioannou, A., Schneider, J. (eds.) EC-TEL 2019. LNCS, vol. 11722, pp. 362–376. Springer, Cham (2019). https://doi.org/10.1007/9783-030-29736-7_27 Taskin, Y., Hecking, T., Hoppe, H.U.: ESA-T2N: a novel approach to network-text analysis. In: Cheriﬁ, H., Gaito, S., Mendes, J.S., Moro, E., Rocha, L.M. (eds.) COMPLEX NETWORKS 2019. SCI, vol. 882, pp. 129–139. Springer, Cham (2020). https://doi.org/10.1007/978-3-03036683-4_11

Digital Value-Adding Chains in Vocational Education

29

Zaíane, O.R.: Building a recommender agent for e-learning systems. In: International Conference on Computers in Education, Proceedings, pp. 55–59. IEEE (2002) Zhao, J., Bhatt, C., Cooper, M., Shamma, D.A.: Flexible learning with semantic visual exploration and sequence-based recommendation of MOOC videos. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2018)

Human-Centered Design of a Dashboard on Students’ Revisions During Writing Rianne Conijn1,2P,3(B) , Luuk Van Waes3 , and Menno van Zaanen4 1

4

Department of Cognitive Science and Artiﬁcial Intelligence, Tilburg University, Tilburg, The Netherlands [email protected] 2 Human-Technology Interaction Group, Eindhoven University of Technology, Eindhoven, The Netherlands 3 Department of Management, University of Antwerp, Antwerp, Belgium [email protected] South African Centre for Digital Language Resources, Potchefstroom, South Africa [email protected]

Abstract. Learning dashboards are often used to provide teachers with insight into students’ learning processes. However, simply providing teachers with data on students’ learning processes is not necessarily beneﬁcial for improving learning and teaching; the data need to be actionable. Recently, human-centered learning analytics has been suggested as a solution to realize more eﬀective and actionable dashboards. Accordingly, this study aims to identify how these human-centered approaches could be used to design an interpretable and actionable learning dashboard on students’ writing processes. The design consists of three iterative steps. First, visualizations on students’ revision process, created from keystroke data, were evaluated with writing researchers. Second, the updated visualizations were used to co-design a paper prototype of the dashboard within a focus group session with writing teachers. Finally, the paper prototype was transformed into a digital prototype and evaluated by teachers in individual user test interviews. The results showed that this approach was useful for designing an interpretable dashboard with envisioned actions, which could be further tested in classroom settings. Keywords: Keystroke logging · Writing process Human-centered design · LATUX

1

· Writing analytics ·

Introduction

In most classroom environments, teachers are not able to systematically monitor students’ writing processes. Teachers have access to the ﬁnal written products of students (and sometimes intermediate products), but they typically have limited information on how students create these ﬁnal products. Hence, little is known on where and when students struggle during their writing. c Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 30–44, 2020. https://doi.org/10.1007/978-3-030-57717-9_3

Human-Centered Design of a Dashboard on Students’ Revisions

31

Writing analytics can be used to provide teachers with more insight into students’ writing processes. Writing analytics focuses on “the measurement and analysis of written texts for the purpose of understanding writing processes and products, in their educational context, and improving the learning and teaching of writing” [3] (p. 481). Keystroke logging is often used as a tool in writing analytics [15]. Real-time keystroke data oﬀer the potential for automatic extraction of important diagnostic information on students’ writing processes, making it possible to provide a more precise identiﬁcation of writing diﬃculties, e.g., [14]. Yet, the diagnostic information from ﬁne-grained keystroke data are often not directly intuitive for educational stakeholders. A solution would be to provide the data in the form of a learning dashboard. A learning dashboard is a tool that provides a visual overview of students’ tracked learning activities (e.g., time spent on quizzes, number of learning sessions), to promote awareness and reﬂection [23]. Dashboards can be employed by teachers, students, or other educational stakeholders to review and analyze data on students’ learning processes [22, 23]. Teacher-facing dashboards are speciﬁcally aimed at providing teachers with information to improve their teaching and students’ learning and have been proven to be eﬀective in improving post-test scores and engagement [23]. There is limited research on designing dashboards for visualizing students’ writing processes. To our knowledge, only two studies address this issue. One study designed a dashboard on collaborative writing [19], in which DocuViz (a Chrome plugin) is used to visualize how a document grows over time, who contributed, and at which times and locations people contributed. Another study describes the process report in Inputlog, which shows pausing behavior, revision behavior, source interaction, and ﬂuency [21]. This report is mostly textual, but also includes two visualizations: the process graph and the ﬂuency graph, detailing the text production and the ﬂuency over time. One concern with dashboards, and learning analytics in general, is that simply providing teachers with data on students’ learning processes is not necessarily beneﬁcial [24]. It has been argued that for the learning analytics to be eﬀective in improving learning and teaching, the data need to be transformed into actionable information [5]. It is not always clear how to act upon insights obtained from tracked learning activities, to improve learning and teaching. For example, even though expert revisers might revise more or make more extensive revisions; simply asking novices to revise more rarely results in higher writing quality. Recently, human-centered learning analytics has been suggested as a solution to realize more eﬀective and actionable learning analytics [2]. Within a humancentered approach, the functionality and design of the system is deﬁned by the actual users of the system, rather than by the developers or researchers [2,11]. Accordingly, the design is more likely to account for all the needs, desires, and experiences of the relevant stakeholders [2,11]. As teachers are the main users of teacher-facing dashboards, they have a central role in the design of the dashboard in the current study. By including the teachers in the design process, the alignment of the learning analytics with the learning design of the course or module could be enhanced and teaching concerns could be addressed better

32

R. Conijn et al.

[16,24]. Accordingly, the actions upon these analytics will increase. However, evidence of the use of these human-centered approaches for the design of dashboards for teaching writing is limited. Therefore, the current study describes a human-centered approach to inform the design of a teacher-facing dashboard on students’ writing processes. Speciﬁcally, this study aims to answer the following research question: How to design an interpretable and actionable teacher-facing learning dashboard on students’ writing processes? 1.1

Human-Centered Approaches in Dashboard Design

Several human-centered approaches have been proposed for the design of learning analytics tools and dashboards [8]. In the current study, we employ two frameworks that speciﬁcally focus on the human-centered design of learning dashboards, an iterative workﬂow and a model for teacher inquiry. Learning Awareness Tools – User eXperience (LATUX) is an iterative workﬂow for designing and deploying awareness tools (such as dashboards) [18]. The LATUX workﬂow describes ﬁve iterative design stages: 1) problem statement and requirements identiﬁcation; 2) creation of low-ﬁdelity prototype is created 3) creation of high-ﬁdelity prototype; 4) evaluation of prototype in pilot classroom studies; and 5) evaluation in the real-world classroom, to evaluate the longer-term impact on learning and instruction. Teacher inquiry is of key importance in each of these stages. In this context, Wise and Jung [24] developed a model for actionable learning analytics to inquire into how teachers are using learning dashboards. This model describes the actions involved in teachers’ analytics use, which are divided into the sense-making of the data and the pedagogical response based upon the data. The sense-making consists of how teachers ask questions about the data displayed, where they start looking, which reference points they use, how they interpret the data, and which sources they use to explain the data. The pedagogical response consists of how teachers (intend to) act upon these data, how they align this with their learning design, what impact they expect, and how they measure this impact. 1.2

Approach

The dashboard in this article is focused on displaying information on students’ tracked writing processes, for two reasons: 1) students’ self-report on their writing processes is unreliable and 2) analysis of the writing product does not always provide insight into when, where, and why students struggled [20]. The dashboard is speciﬁcally focused on revisions, because revisions play an important role in writing [9], but are not visible in the written product. Revisions provide an indication of issues or point of improvements a writer identiﬁed in their text [10]. In addition, revisions can inﬂuence the writing quality, the writing process (e.g., disrupt the ﬂow), and the writers’ knowledge about the topic or writing itself [1,9]. Accordingly, revisions play an important role in writing instruction. The design of the dashboard followed a human-centered approach, using the LATUX framework [18] and the model for actionable learning analytics [24]. In

Human-Centered Design of a Dashboard on Students’ Revisions

33

this article, we report on stage 2 and stage 3 of the LATUX framework, divided over three iterative design steps. The ﬁrst two steps align with stage 2 (both lowﬁdelity prototyping) and the last step with stage 3 (high-ﬁdelity prototyping). First, visualizations on students’ revision processes were created from keystroke data, and evaluated with writing researchers in round table sessions (N = 13). Second, the updated visualizations were used to co-design a paper prototype of the dashboard during a focus group session with writing teachers (N = 4). Third, the paper prototype was transformed into a digital prototype, and evaluated by writing teachers in individual user tests combined with interviews (N = 6). The full study was approved by the school-level research ethics and data management committee. All participants provided informed consent.

2 2.1

Step 1: Creating Visualizations on Revisions Method

For the ﬁrst step, three short round table sessions were held in three groups of 4–5 people each (N = 13). The goal was to get quick evaluations on a variety of diﬀerent visualizations that were made on students’ revision processes. Therefore, writing researchers were chosen as participants, as they generally have a high data literacy, and hence can provide feedback on data visualizations without too much explanation. All participants were attendants of a writing research meeting who had at least two years of experience in writing research. The materials included twenty visualizations from students’ revision processes. These visualizations were created in R using an open source dataset on students’ revisions [6]. This dataset is annotated on eight properties of revisions: orientation, evaluation, action, linguistic domain, spatial location, temporal location, duration, and sequencing, based on the revision tagset from [7]. Each visualization was made on one or a combination of these properties. For example, one of the visualizations showed the number of surface and deep revisions (orientation) in relation to the time (temporal location) and the spatial location in the writing process (see [6]). All visualizations were printed on paper. After a short explanation of the revision properties, the participants were asked to individually evaluate the visualizations, using post-its to indicate aspects they liked, disliked, points of improvement, and questions. In addition, the participants were asked to individually vote on (at most) three visualizations they liked most. The evaluation took 15 min per group in which the participants were not allowed to talk. 2.2

Results and Conclusion

The comments and votes were used to improve the visualizations of the revision properties. The revision properties with at least two votes were used as input for step 2. For ﬁve of the revision properties both a frequency-based and a percentage-based graph were shown, resulting in mixed responses from the

34

R. Conijn et al. 1

Changes 1. Student names added 2. Data ordered from many to little revisions 3. Colors related to meaning (red is erroneous correction) 4. Distinguishable and less distracting color cheme 5. Jargon removed from labels

2 4

1 3

4 5

5

Fig. 1. Changes in two visualizations of the revision properties after step 1, left: evaluation; right: orientation, spatial and temporal location. (Color ﬁgure online)

participants. The graphs with percentages were said to ease ﬁnding diﬀerences between students. However, especially the participants with a teaching background argued that frequencies were more intuitive to understand for teachers and students. Therefore, we only selected the graphs with displaying frequencies, leaving nine visualizations for step 2. These nine visualizations were further improved based on the comments (see Fig. 1). First, the interpretability of the visualizations was improved based on the visual appeal. The colors were made more distinguishable, where possible related to their meaning (e.g., associating red with errors), and less distracting (e.g., use light colors for dominant categories). The graphs were transformed and the data were ordered from high (top) to low (bottom) frequencies. Additionally, the graphs were improved based on the content. Many participants argued that the labels were sometimes too complex. Accordingly, the labels were changed to reﬂect student names, and the axis titles were changed to include less jargon. Finally, four visualizations were added to better contextualize the visualizations: the description of the writing assignment, the replay of the writing process, the students’ ﬁnal text, and the total time spent on the assignment. This ﬁrst step provided most insight into the design of an interpretable dashboard, but some comments related to the actionability were already made. The majority of the participants argued that it was hard to directly identify how to use the graphs. One participant mentioned that teachers need to have goal for using the visualizations. Some of the possible actions mentioned included: goal-setting for students, providing feedback to students, and let students reﬂect on their writing process (possibly compared to a peer).

Human-Centered Design of a Dashboard on Students’ Revisions

3 3.1

35

Step 2: Creating a Paper Dashboard of Revisions Method

For the second step, a focus group was held with four academic writing teachers, the intended users of the dashboard. The goal was to co-design an interpretable and actionable paper prototype of a teacher-facing revision dashboard. In addition, we aimed to determine how this dashboard could be used in the teachers’ learning and teaching practices. The teachers were recruited by email via the university’s language center. All teachers had at least seven years of experience in teaching academic writing in Dutch and English within classes ranging from 40 to 120 students (three teachers) and/or individual coaching (three teachers). The focus group started with a brief introduction on learning dashboards. Thereafter, the participants were asked to discuss properties of dashboards they liked and disliked. Next, the participants were shown the nine improved visualizations from step 1 and the four visualizations showing additional information on the writing process. They were asked to individually vote on at most ﬁve visualizations they would like to have in the dashboard, in the context of their teaching. After voting, the visualizations with the most votes were discussed and questions about the visualizations were answered by the moderator, and participants were allowed to rearrange their votes. The visualizations with the most votes were then pasted onto an empty sheet to form the paper dashboard. Based on this paper prototype, the participants were asked to discuss what actions they would take based and what visualizations or information needed to be changed or added to better inform those actions. The focus group took 90 min. During the focus groups, audio was recorded. The audio was transcribed and coded using NVivo 12. The properties of dashboards that the participants liked and disliked were clustered into similar themes. The remainder of the transcript was analyzed using thematic analysis, using the topic list created from the teacher inquiry model for actionable learning analytics [24]. This consists of the topics: asking questions, orientation, reference point, interpretation, explanation, action, alignment, expected impact, measuring impact, and other comments. 3.2

Results and Conclusion

The participants ﬁrst discussed desirable properties of a dashboard. The participants argued that the dashboard should give a quick overview. Hence, everything should be clearly organized and visible on one screen (not scrollable). There should be a step-wise approach, where only relevant information is displayed on the homepage, with a possibility to gain more details when clicking further. They liked to have control over what gets displayed and how this gets displayed. The operation of the dashboard should be simple, intuitive, and easy to learn. Lastly, the dashboard should be attractive. This set the scene for the co-design of the revision dashboard. Four visualizations were voted to be included in the dashboard. One visualization was added during the discussion, when the participants realized they

36

R. Conijn et al.

missed the concept of time to explain the visualizations. The results of the discussion on the paper dashboard are reported per theme below. 1. Asking questions. The participants identiﬁed several aspects they would look for in the dashboard. They wanted to get insight into students’ writing processes, in relation to eﬀort (“do they really revise?”), struggles “what are they struggling with?”), quality (“how does the process relate to quality?”), and genres (“how does the process change for diﬀerent genres?”). In addition, one participant mentioned they would like to see whether the concepts covered in class were also addressed within the revisions. 2. Orientation. The orientation, or where you start looking in the dashboard was only brieﬂy discussed. It should be step-wise, starting from an overview, where it should be possible to go deeper into the data. 3. Reference point. Two reference points were identiﬁed to compare the revision process data. First, data could be compared with other students, for example with students who struggled more, with students who received really high/low grades, or with the class average. In addition, the data could be compared across diﬀerent texts, for example between drafts or assignments. 4. Interpretation. The interpretation of the visualizations was frequently discussed. The participants mainly focused on the frequency, depth, and distribution of the revisions; the passages students struggled with; and how things might be interpreted in relation to grades. They discussed how each of the visualizations provided complementary information, but were also wondering about how they could draw generalized conclusions, because “the visualizations are so speciﬁc to one persons’ process”. 5. Explanation. To explain the visualizations, four other sources of information were mentioned. First, the quality of the assignment was needed, to be able to evaluate the revisions. However, some also argued that quality might not always be useful for explanation, because there is no single ‘best’ writing process. Second, access to the ﬁnal text was necessary, to contextualize the revisions in the actual text that was written. Third, information on the version of the ﬁnal text was necessary, as diﬀerent revision processes might be expected for the ﬁrst draft, compared to for the second draft. Fourth, information on the timing of the revisions was needed and therefore the visualization on the distribution of the revisions over time was added to the dashboard (right plot in Fig. 1). 6. Action. The participants frequently mentioned they did not see how they should act upon this data. However, throughout the discussions, some ideas emerged. Three possible actions were discussed: encourage students to revise/review more, advice students to focus on deeper revisions, and let students reﬂect on their processes. In addition, three speciﬁc types of scaffolding were discussed. First, a workshop where students use the tool for multiple assignments, and compare their behavior over time. Second, individual coaching sessions, where the coach goes through the points the student struggled with and, together with the student, ﬁnds ways to improve the process. Third, a class teaching setting, where common ‘errors’ or struggles are

Human-Centered Design of a Dashboard on Students’ Revisions

37

discussed, supported by ‘actual’ data from the students in class. In addition, the participants mentioned that the tool might be used to measure the eﬀectiveness of their teaching, by identifying whether students are working on the taught constructs. 7. Alignment. The alignment with the course design was only brieﬂy mentioned. Participants mostly wanted to use the tool after the ﬁrst draft, because then the revision processes are most important. They argued the tool should be used after grading, because if the student receives a good grade, it might not be necessary to look at the process. The participants considered it to be most eﬀective for master students, as opposed to bachelor students, as master students already have more insight into their writing and would be more motivated. 8. Expected impact. Five forms of expected impact were determined for the students: get students more motivated, enhance students’ awareness/insight into their own writing process, improve their understanding of (the technical aspects of) writing and language, improve the writing process (“so they are actually trying to apply what you taught”), and improving the writing product. Lastly, an expected impact for the teachers was to make teaching more fun. 9. Measuring impact. The participants only brieﬂy mentioned some of the expected outcomes needed to be measured, such as the improvement in the writing process over time, and whether students applied what the teacher taught. However, they did not discuss how this could be measured. 10. Other comments. Several other implications were discussed. The participants mentioned that the writing process should not inﬂuence their grading. The students might also ‘game the system’, and hence they should not be rewarded for their process. In addition, students’ privacy was mentioned, data need to be anonymized. Concerns were expressed on “yet another system to operate”, hence they preferred to have the dashboard integrated into the learning management system. Participants also indicated that it might be very time-intensive to look at the writing process, especially on individual levels. Lastly, the participants wondered whether the system would remain interesting to use. Based on these results, the paper dashboard was then transformed into a digital dashboard (see Fig. 2). The dashboard followed a step-wise approach, starting with an overview that could be ﬂeshed out to display the full details. In addition, teachers could ﬁlter speciﬁc types of revision, to only focus on the types of their interest. To better interpret the data, the participants desired to be able to compare the visualizations between students and between assignments versions. Therefore, the digital dashboard was designed to facilitate these comparisons. Lastly, the participants discussed diﬀerent teaching contexts: class teaching/workshops and individual coaching, and argued these would require different information in order to be able to act upon the data. Therefore, two tabs were created for each of the teaching contexts, both displaying data preferred in either situation. The digital dashboard was used as input for step 3.

38

R. Conijn et al.

(a) Class overview tab, showing the diﬀerent types of revisions made for the whole class, for the draft (left) and ﬁnal version (right) of a summary task.

(b) Individual student tab, showing the diﬀerent types of revisions made (top), the spread of these revisions over time (middle), the linearity of the production and the density of the revisions mapped onto the written product (bottom) for two students.

Fig. 2. Digital dashboard

Human-Centered Design of a Dashboard on Students’ Revisions

4 4.1

39

Step 3: Evaluating a Digital Dashboard of Revisions Method

For the third and ﬁnal step, a series of individual user test interviews was held with six academic writing teachers. The goal was to evaluate the digital prototype of the revision dashboard, both on interpretability and actionability. The teachers were recruited by email from another university. All teachers had at least two years of experience in teaching academic writing within classes ranging from 20 to 120 students (ﬁve teachers) and/or individual coaching (three teachers). Four participants were already familiar with the progress graph from Inputlog [13], so they had some prior knowledge on writing process visualizations. The interview sessions consisted of thinking-aloud user testing of the digital dashboard, combined with interview questions on the interpretability and actionability of the dashboard. During the interviews, the participants ﬁrst received a short introduction on the goal of the interview and the thinking-aloud procedure. The participants ﬁrst practiced thinking-aloud during a speciﬁc task in a learning management system dashboard. Thereafter, the participants were asked to explore the dashboard and to voice all actions (e.g., button presses), expectations, interpretations of the visualizations, and aspects they found unclear. After the user test, the participants were interviewed on how they would act upon the data in their own course, how they would align the use of the dashboard with their course design, what the expected impact would be, and how they would measure this impact. During the user test and interview sessions, audio was recorded. As in step 2, the audio was transcribed, coded based on the topic list, and analyzed using thematic analysis in NVivo 12. 4.2

Results and Conclusion

The results of the user test interviews of the digital dashboard are detailed below. 1. Asking questions. The participants did not start with a speciﬁc question; they wanted to explore the dashboard ﬁrst and asked general questions, such as “how do students revise?”. More speciﬁc questions emerged when the participants explored the data (e.g., “does everyone show such a linear process?”). 2. Orientation. The design of the dashboard facilitated a step-wise navigation, following the recommendations from step 2. This sometimes resulted in some confusion in the beginning, “I cannot see anything yet”, however this became clear relatively quickly once they selected an assignment. There were three possible navigation approaches: from class overview towards a speciﬁc student that stood out in the class overview; starting from a speciﬁc student or assignment that showed issues; or starting from a random assignment. 3. Reference point. The reference points were the same as in step 2, as the design speciﬁcally facilitated these comparisons: compare with other students, previous versions and previous assignments, and class average.

40

R. Conijn et al.

4. Interpretation. The participants were able to interpret all the graphs, although sometimes clarifying questions were needed. Some of the labels were unclear (e.g., diﬀerence between semantic and wording revisions) and the graph displaying the spread of revisions over time and location (scatter plot Fig. 2b, middle) took some practice. The interpretations evidently followed the data displayed, and were highly related to the current sample of students within the dashboard. For example, it was noted that the students revised heavily, but mainly focused on minor revisions. In addition, the participants noticed that many students revised at the leading edge of the text, and rarely revised in earlier parts, i.e., their writing was fairly linear. Lastly, they noticed how heatmaps showed speciﬁc instances of words that were heavily revised and sentences where a student went back to revise something in their earlier produced text. 5. Explanation. The participants discussed diﬀerent sources that could aid the explanation of the data. First, the type of assignment was mentioned most, as shorter texts might result in less deep revisions. Second, the quality and ﬁnal text was often mentioned, mostly in relation to the possibility of multiple explanations of the revisions. Participants stated there is no single ‘best’ writing process. Hence, they argued it might be hard for students to identify what it means if they make only a few revisions; it needs to be related to quality or the ﬁnal text to draw conclusions. Lastly, three sources were mentioned once: the number of words so far, total time spent, and students’ language background. 6. Action. Contrary to step 2, participants were able to envision a variety of actions based upon the data. Six diﬀerent actions were discussed: 1) reﬂection upon students’ own processes 2) discussion on where students struggle; 3) providing advice on what to focus on; 4) instruction on the writing process, i.e., discussing diﬀerent strategies and approaches to the writing process; 5) measure the eﬀectiveness of teaching; 6) including information on the writing process into the grading. For the ﬁrst three actions it was considered necessary to also provide students with access to the dashboard. The same types of scaﬀolding were discussed as in step 2: individual coaching, workshop, and class teaching. In addition, several participants discussed group work and/or peer-feedback: let students discuss their writing processes in groups, to help each other in understanding the visualizations and identifying how things could be done diﬀerently. 7. Alignment. The participants discussed it would be useful to apply the dashboard in multiple assignments, to be able to compare the data over diﬀerent assignments. In addition, several suggestions were made to align the dashboard with their current course by integrating the tool into the peer-review sessions, as well as within the curriculum, by extending the functionality of the tool over time, when students get more experienced with writing and their writing process. 8. Expected impact. The expected impact was to create awareness and insight into students’ own revision processes and their struggles, but also in other students’ processes, to see that there are multiple ways to approach writing.

Human-Centered Design of a Dashboard on Students’ Revisions

41

In addition, it was envisioned that students’ writing process would improve, by using more eﬀective revision strategies. 9. Measuring impact. Measuring impact was only discussed superﬁcially, without concrete measures. The participants only discussed measuring the quality of the writing process, which could be done by determining whether students indeed tried a diﬀerent approach or whether their process evolved over time. 10. Other comments. Lastly, several concerns were discussed. Similarly to step 2, participants were concerned about the privacy of the students, the time needed to use the dashboard in class, and the multiple systems to operate. Four additional implications were mentioned. First, the participants wanted to know how the data was collected, and how accurate it was (transparency). Second, they discussed the learnability of the system: a manual was needed, but, they argued it was easier than they thought. Lastly, one participant mentioned it might be useful as a tool for research as well. To conclude, the user test interviews showed that participants were able to interpret the dashboard with only a few clarifying questions needed. This indicated that these steps were successful for creating an interpretable dashboard, which could be further developed in future iterations. In addition, the multiple possible actions envisioned by the teachers, provided evidence for the actionability of the revision dashboard. The results showed suggestions for further improvements of the design of the dashboard, which are described in the general discussion and conclusion.

5

General Discussion and Conclusion

This study aimed to identify how a learning dashboard could be designed to transform the data on students’ writing processes into both interpretable and actionable information for teachers. This was done within three steps, based upon the LATUX framework [18]. First, insights were gained into how the data on diﬀerent properties of revisions obtained from the revision tagset [7] could be best represented, according to writing researchers. Second, based on these visualizations, a paper prototype was co-designed with writing teachers, and evaluated using the teacher inquiry model [24]. This paper prototype was then transformed into a digital dashboard and evaluated with writing teachers, again following the teacher inquiry model. This process showed several implications for the design of an interpretable and actionable revision dashboard. First, the teacher inquiry model [24] proved to be useful to get insight into how teachers approached the dashboard, interpreted the visualizations, and envisioned possible actions upon the data. The use cases mentioned within the envisioned actions are in line with the types of scaﬀolding found by previous work: whole class scaﬀolding, targeted scaﬀolding, and revise course design [24]. Second, the results showed that for many of the actions, teachers wanted to show the dashboard to the students as well, to discuss and reﬂect upon their

42

R. Conijn et al.

writing process. This indicates that the dashboard should be both teacher and student-facing, and students should be included in future design iterations [2,11]. Third, step 2 and step 3 showed that the teachers preferred to have a stepwise navigation, starting from an overview, before diving into details, and with a possibility to ﬁlter out irrelevant information. This reduced the information overload and coincides with previous ﬁndings from learning dashboard evaluations [4]. However, this also resulted in some confusion in the beginning, as the teachers did not clearly know where to go. In addition, some of the detailed graphs were not clear from the beginning. Therefore, future work should provide a manual with examples or use data storytelling principles (see e.g., [17]) to better guide the users through the dashboard and improve the interpretability. Finally, several concerns were raised for the adoption of the dashboard, resulting in suggestions for improvement. First, the system needs to be made scalable to allow for large classes. Second the full system (data collection and dashboard) need to be integrated in the existing learning management systems. And ﬁnally, it needs to be further determined how the beneﬁts might be maximized with minimal input from the teachers. This study is limited as it only provides envisioned actions upon the dashboard. Further research should explore whether teachers indeed act upon the dashboard. Moreover, most of the participants had some prior experience with dashboards or information on the writing process, the results might not generalize towards teachers with less experience in this. Hence, the dashboard needs to be evaluated with less experienced teachers as well. Lastly, the interpretability and actionability of the dashboard seemed to largely depend on the data shown in the dashboard. For example, when the data did not show any clear patterns or diﬀerences between students, the teachers often had more diﬃculties interpreting the data and envisioning possible actions. However, it is not always known which diﬀerences or patterns might be of interest for teachers. Therefore, it is important to further evaluate the dashboard within the ﬁeld, using real and contextualized data from the teachers’ current students [12]. Further evaluation could be done according to steps four and ﬁve from the LATUX framework [18]. First, pilot classroom or individual coaching studies need to be considered, followed by longer-term evaluation within the classroom itself. This measures whether the dashboard is actually actionable in practice, and could be used to evaluate the longer-term impact on learning and instruction. In addition, as the learning setting might inﬂuence the impact of the dashboard, further evaluations could also be used to identify which actions upon the dashboard are preferred in which situations (cf. [23]). To conclude, this study showed the ﬁrst steps of a human-centered approach into designing a writing analytics dashboard. Using this approach, we contend that this will result in a dashboard that can be acted upon, and hence can be eﬀective in improving the learning and teaching of writing.

Human-Centered Design of a Dashboard on Students’ Revisions

43

References 1. Barkaoui, K.: What and when second-language learners revise when responding to timed writing tasks on the computer: the roles of task type second language proﬁciency, and keyboarding skills. Mod. Lang. J. 100(1), 320–340 (2016). https:// doi.org/10.1111/modl.12316 2. Buckingham Shum, S., Ferguson, R., Martinez-Maldonado, R.: Human-centred learning analytics. J. Learn. Analytics 6(2), 1–9 (2019). https://doi.org/10.18608/ jla.2019.62.1 3. Buckingham Shum, S., Knight, S., McNamara, D., Allen, L., Bektik, D., Crossley, S.: Critical perspectives on writing analytics. In: Proceedings of the 6th International Conference on Learning Analytics and Knowledge, pp. 481–483. ACM (2016) 4. Charleer, S., Klerkx, J., Duval, E., De Laet, T., Verbert, K.: Creating eﬀective learning analytics dashboards: lessons learnt. In: Verbert, K., Sharples, M., Klobuˇcar, T. (eds.) EC-TEL 2016. LNCS, vol. 9891, pp. 42–56. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45153-4 4 5. Conde, M.A., Hern´ andez-Garc´ıa, A.: Learning analytics for educational decision making. Comput. Hum. Behav. 47, 1–3 (2015). https://doi.org/10.1016/2Fj.chb. 2014.12.034 6. Conijn, R., Dux Speltz, E., van Zaanen, M., Van Waes, L., Chukharev-Hudilainen, E.: A process-oriented dataset of revisions during writing. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 356– 361 (2020) 7. Conijn, R., Dux Speltz, E., Zaanen, M.V., Van Waes, L., Chukharev-Hudilainen, E.: A product and process oriented tagset for revisions in writing. PsyArXiv (2020). https://doi.org/10.31234/osf.io/h25ak 8. Dollinger, M., Liu, D., Arthars, N., Lodge, J.: Working together in learning analytics towards the co-creation of value. J. Learn. Analytics 6(2), 10–26 (2019). https://doi.org/10.18608/jla.2019.62.2 9. Fitzgerald, J.: Research on revision in writing. Rev. Educ. Res. 57(4), 481–506 (1987) 10. Flower, L., Hayes, J.R., Carey, L., Schriver, K., Stratman, J.: Detection, diagnosis, and the strategies of revision. Coll. Compos. Commun. 37(1), 16–55 (1986) 11. Giacomin, J.: What is human centred design? Des. J. 17(4), 606–623 (2014) 12. Holstein, K., McLaren, B.M., Aleven, V.: Co-designing a real-time classroom orchestration tool to support teacher-AI complementarity. J. Learn. Analytics 6(2), 27–52 (2019) 13. Leijten, M., Van Waes, L.: Keystroke logging in writing research: using inputlog to analyze and visualize writing processes. Written Commun. 30(3), 358–392 (2013). https://doi.org/10.1177/0741088313491692 14. Likens, A.D., Allen, L.K., McNamara, D.S.: Keystroke dynamics predict essay quality. In: Proceedings of the 39th Annual Meeting of the Cognitive Science Society (CogSci 2017), pp. 2573–2578. London, UK (2017) 15. Lindgren, E., Westum, A., Outakoski, H., Sullivan, K.P.H.: Revising at the leading edge: shaping ideas or clearing up noise. In: Lindgren, E., Sullivan, K. (eds.) Observing Writing, pp. 346–365. Brill (2019). https://doi.org/10.1163/ 9789004392526 017 16. Lockyer, L., Heathcote, E., Dawson, S.: Informing pedagogical action: aligning learning analytics with learning design. Am. Behav. Sci. 57(10), 1439–1459 (2013)

44

R. Conijn et al.

17. Martinez-Maldonado, R., Echeverria, V., Fernandez Nieto, G., Buckingham Shum, S.: From data to insights: a layered storytelling approach for multimodal learning analytics. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2020) 18. Martinez-Maldonado, R., Pardo, A., Mirriahi, N., Yacef, K., Kay, J., Clayphan, A.: LATUX: an iterative workﬂow for designing, validating and deploying learning analytics visualisations. J. Learn. Analytics 2(3), 9–39 (2015). https://doi.org/10. 18608/jla.2015.23.3 19. Olson, J.S., Wang, D., Olson, G.M., Zhang, J.: How people write together now: beginning the investigation with advanced undergraduates in a project course. ACM Trans. Comput. Human Interact. (TOCHI) 24(1), 1–40 (2017) 20. Ranalli, J., Feng, H.H., Chukharev-Hudilainen, E.: Exploring the potential of process-tracing technologies to support assessment for learning of L2 writing. Assessing Writ. 36, 77–89 (2018). https://doi.org/10.1016/j.asw.2018.03.007 21. Vandermeulen, N., Leijten, M., Van Waes, L.: Reporting writing process feedback in the classroom: using keystroke logging data to reﬂect on writing processes. J. Writ. Res. 12(1), 109–140 (2020). https://doi.org/10.17239/jowr-2020.12.01.05 22. Verbert, K., Duval, E., Klerkx, J., Govaerts, S., Santos, J.L.: Learning analytics dashboard applications. Am. Behav. Sci. 57(10), 1500–1509 (2013) 23. Verbert, K., et al.: Learning dashboards: an overview and future research opportunities. Pers. Ubiquit. Comput. 18(6), 1499–1514 (2013). https://doi.org/10.1007/ s00779-013-0751-2 24. Wise, A.F., Jung, Y.: Teaching with analytics: towards a situated model of instructional decision-making. J. Learn. Analytics 6(2), 53–69 (2019). https://doi.org/10. 18608/jla.2019.62.4

An Operational Framework for Evaluating the Performance of Learning Record Stores Chahrazed Labba(B) , Azim Roussanaly , and Anne Boyer Lorraine University, Loria, KIWI Team, Nancy, France {chahrazed.labba,azim.roussanaly,anne.boyer}@loria.fr

Abstract. Nowadays, Learning Record Stores (LRS) are increasingly used within digital learning systems to store learning experiences. Multiple LRS software have made their appearance in the market. These systems provide the same basic functional features including receiving, storing and retrieving learning records. Further, some of them may oﬀer varying features like visualization functions and interfacing with various external systems. However, the non-functional requirements such as scalability, response time and throughput may diﬀer from one LRS to another. Thus, for a speciﬁc organization, choosing the appropriate LRS is of high importance, since adopting a non-optimized one in terms of non-functional requirements may lead to a loss of money, time and eﬀort. In this paper, we focus on the performance aspect and we introduce an operational framework for analyzing the performance behaviour of LRS under a set of test scenarios. Moreover, the use of our framework provides the user with the possibility to choose the suitable strategy for sending storing requests to optimize their processing while taking into account the underlying infrastructure. A set of metrics are used to provide performance measurements at the end of each test. To validate our framework, we studied and analyzed the performances of two open source LRS including Learning Locker and Trax. Keywords: Test scenarios · Non-functional requirements record store · xAPI speciﬁcations

1

· Learning

Introduction

Nowadays, the learning process is digitized and can happen through diﬀerent activities such as learning management systems, online training, simulations and games. Given the heterogeneity of the learning activity keeping track of the learning experience of the learner is becoming challenging. The speciﬁcation xAPI or Tin Can1 came to solve the above challenge. According to the Advanced Distributed Learning (ADL)2 , xAPI is a technical speciﬁcation that aims to 1 2

https://adlnet.gov/projects/xapi/. https://www.adlnet.gov.

c Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 45–59, 2020. https://doi.org/10.1007/978-3-030-57717-9_4

46

C. Labba et al.

facilitate the documentation and communication of learning experiences. This speciﬁcation deﬁnes the way to describe learning experiences by using a structure in the form of “Actor Verb Object”. The above structure is known as a learning record or a statement. Further, the xAPI speciﬁes how these statements can be exchanged electronically by transmission over HTTP or HTTPS to an LRS, which is deﬁned by the ADL3 as follows: “A server4 that is responsible for receiving, storing, and providing access to Learning Records.” Multiple LRS products have made their appearance in the market such as Learning Locker5 , Watershed LRS6 and Trax7 . The competition, in terms of being the market leader, is high. All the LRS systems provide the same basic features including recording and retrieving learning records. Further, some of them oﬀer additional varying functionalities such as visualization functions and interfacing with various external systems. ADL has setup a rational process [2] for choosing an LRS. The process gives more importance to the selection based on the functional features such as analytics, reporting and external integrations rather than the non-functional requirements such as the response time, scalability and throughput. However, these non-functional requirements may diﬀer from one LRS to another. Thus, for a speciﬁc organization, choosing the appropriate LRS is of high importance, since adopting a non-optimized one may lead to both monetary and data losses. There exist many tools [1,4,5] that can be used to run performance tests in order to select the appropriate LRS. However, the test settings need to be constantly edited to run diﬀerent scenarios. Furthermore, adjusting the performance tests is a time consuming task and error prone. The aim of this work is to provide those involved in the process of selecting an LRS with automatic test plans requiring minimum settings eﬀort. To do so, an operational framework for studying and analyzing the performance behaviour of LRS is proposed. Our framework, called LOLA8 -meter, provides the user with two test plans: 1) The performance plan consists of two test types including the load and stress tests. The aim is to study the LRS behaviour under expected and not expected load conditions. The load can be expressed in terms of the number of simultaneous requests and/or the number of statements sent within a request9 ; and 2) The strategy selection plan consists of two types of tests including the Post Chunk Time and the Post Chunk Statement tests. The aim is to determine the suitable strategy to adopt for sending learning experiences to be stored in the LRS while taking into consideration the infrastructure and the size of the generated data. For both types of test plans a set of performance indicators are deﬁned to analyze the output results of the test scenarios e.g response time and throughput. To validate our test scenarios, we studied and 3

4 5 6 7 8 9

https://adlnet.gov/news/2017/04/06/xapi-learning-record-store-test-suite-andadopter/. System capable of receiving and processing web requests. https://www.ht2labs.com/learning-locker-community/overview/. https://www.watershedlrs.com. http://traxproject.fr/traxlrs.php. Laboratoire Ouvert en Learning Analytics. The size of a post request.

An Operational Framework for Evaluating the Performance of LRS

47

analyzed the performances of two LRSs including Learning Locker (LL) (see Footnote 5) and Trax (see Footnote 7). Both of the LRS were selected because they are open source and conform to the speciﬁcation requirements10 deﬁned by ADL. The rest of the paper is organized as follows: Sect. 2 presents the related work. Section 3 introduces the proposed test scenarios. Section 4 presents the experiments and the validation using two open source LRS including LL and Trax. Section 5 enumerates the threats to validity. Section 6 presents the conclusions and the future work.

2

Related Work

There are few works on how to select an LRS. Some vendors, for example, LL (see Footnote 5), watershed LRS (see Footnote 6) and waxLRS11 provide on their websites case studies and demos concerning the adoption of their systems. Further, some vendors provide white papers that discuss how to choose the LRS partner and what questions need be asked to evaluate the LRS products. For example Yet Analytics12 provides a set of questions [8] that are organized into three categories “analytics and reports”, “customer support” and “database security, stability and scalability”. In each category, a set of questions are deﬁned and need to be asked to evaluate the LRS partner. While, H2Labs deﬁnes 28 questions13 to conduct an LRS needs analysis. The questions cover the types of deployment (on-site or SaaS), the data sending, storing and retrieving. In [2], the author introduces a rational process to select an LRS. The process gives more importance to the selection based on the functional features rather than the non-functional requirements. Further, the paper does not provide in anyway a comparative rating or evaluation of existing LRS software. In [7], the author highlights the special features and issues to consider when selecting an LRS such as conformance requirement, cost, hosting options and data analytics requirements. In [9], the authors present a set of decisive factors to consider, when searching for an LRS, such as analytics and reporting, security, integration and scalability. In [3], the authors proposed a web-based learning environment dedicated for training how to command and control unmanned autonomous vehicles. One of the main issues revealed in the work is the scalability and performance requirements of the integrated LRS for storing stream data. The authors found that the existing LRS may not perform well under certain circumstances. So, they proposed a storage system based on the use of an adhoc server over SQLLite3. Even though the proposed solution presents an eﬃcient storage system, however it leaves out many facilities provided by the LRS. To summarize, both research works and case studies deal with the selection of LRS from a pure functional point of view. However, the evaluation of the LRS based on the non-functional 10 11 12 13

https://lrstest.adlnet.gov. https://www.elucidat.com/blog/using-wax-lrs/. LRS provider https://www.yetanalytics.com. https://www.ht2labs.com/conducting-lrs-needs-analysis/.

48

C. Labba et al.

requirements is not well taken into consideration. In this work, we focus on providing an operational framework to evaluate and rate the existing LRS software. LOLA-meter provides the users with a set of automatic test plans requiring minimum settings eﬀort to study to which extent an LRS fulﬁlls the non-functional requirements. There exist many tools that can be used to run performance tests in order to select the appropriate LRS. However, the test settings in these tools need to be constantly edited to run diﬀerent scenarios. Furthermore, adjusting the performance tests is a time consuming task and error prone. In the next section we introduce an updated version of the selection process of an LRS and we present the test plans implemented within LOLA-meter.

3

How to Select an LRS?

ADL has setup a rational process [2] for choosing the appropriate LRS. It emphasizes the LRS selection based on the functional features rather than the nonfunctional requirements. In this section, we introduce an updated version of the selection process. An additional step that considers the analysis of the nonfunctional requirements is added. Section 3.1, introduces the modiﬁed selection process and Sect. 3.2 presents LOLA-meter for evaluating LRS software using the non-functional requirements. 3.1

Updated Process for Choosing an LRS

The recommended process for choosing the LRS is composed originally from four big steps including 1) Project scope, 2) Develop an LRS requirement matrix, 3) Develop a feature rating matrix and 4) Decision making. The process focuses more on the LRS features and what the system can do to fulﬁll the functional requirements and less on the performance aspect. So we extend the current process version and we include an additional step “Develop a nonfunctional requirement matrix”. In the following, we explain the diﬀerent steps of the updated selection process introduced in Fig. 1. Project Scope: This phase is primordial for any organization that wishes to adopt the use of LRS. Meetings with all the involved stakeholders need to be organized to discuss the following steps: i) Study and analyze the feasibility of acquiring an LRS system; ii) Determine the critical and high-level requirements for the LRS. The deﬁnition of such requirement types will allow the organization exclude many unsuitable LRS candidates; iii) Fix the budget knowing that there are diﬀerent pricing models and iv) Select the required LRS category. Develop an LRS Requirement Matrix: This phase consists of the following steps: i) identify the LRS products that better match the LRS category identiﬁed in the previous step; and ii) Develop and populate a functional requirement matrix. The matrix allows assessing the identiﬁed LRS systems against the highlevel requirements deﬁned in the project scope phase.

An Operational Framework for Evaluating the Performance of LRS

49

Fig. 1. Updated process for choosing an LRS

Develop an LRS Feature Matrix: During this phase, the following steps need to be achieved: i) Filter the list of the potential candidates by eliminating those who do not fulﬁll the minimum of the functional requirements or are over the ﬁxed budget; ii) Develop a detailed list of the features of the remaining LRS; iii) Use the system featuring rating matrix proposed in [2]. It consists in using numerical ratings of the LRS features for example “0” if the LRS does not oﬀer the feature and “10” if the LRS presents a strong implementation of the feature. The rating process can be achieved through the use of available documentation, consultation of the feedback of some LRS users or by the contact of LRS representatives; and iv) Contact the top scoring vendors to ask for presentations and demos. Develop a Non-functional Requirement Matrix: The main contribution of the current work is elaborated in this phase. Indeed, considering the nonfunctional requirements is of high importance to select the right LRS. A suitable LRS for one organization does not mean the same for others, since each has its own requirements. In this phase, we introduce the non-functional requirements matrix, presented in Table 1, which presents a set of metrics measurements. Those measurements will allow the user to determine the suitable LRS that fulﬁlls his needs. The matrix is populated after running each scenario with a given test plan. Indeed, we propose two diﬀerent test types including the performance test plans and the strategy test plans (More detailed in Sect. 3.2). The matrix can be used as follows: i) Replace the “X” with the name of the executed test scenario; ii) Replace the top row (LRS1, LRS2, ..) with the name of the LRS that have been identiﬁed in the previous steps of the selection process; iii) Replace the row names (metric1, metric2, ..) with the names of the considered metrics

50

C. Labba et al. Table 1. Non-functional requirements matrix for scenario X LRS NFR rating matrix for scenario X Metric Name LRS 1 LRS 2 LRS 3 LRS 4 Metric 1 Metric 2 Metric 3 Metric 4 Metric 5

for example response time and throughput; iv) Run the test scenario for each of the LRS software and ﬁll the matrix with the average measurements. To summarize, this phase consists of the following steps: i) Deﬁne the non-functional requirements; ii) Use our automated test plans for all the LRS candidates and iii) Update the non-functional matrix. Decision Making: The ﬁnal phase consists of selecting one of the LRS products. Based on the feature and non-functional requirements comparisons as well as the discussions with the LRS vendors, the organization can make its decision about which LRS to adopt for their use. 3.2

Design of Test Plans

In this section we describe the proposed test plans that need to be used during the fourth step in the selection process presented in Fig. 1. Indeed, we distinguish two diﬀerent plans including the performance and the strategy selection. Each plan encompasses a set of test types. For both types of test plans a set of performance indicators are deﬁned to analyze the output results of the test scenarios e.g response time and throughput. The test plans are implemented and made available through an operational framework14 that provides 24 diﬀerent test scenarios. Our framework is based on the use of the Apache Jmeter API15 . The selection of Jmeter was not arbitrary. Indeed, according to many evaluations [1,4–6], Jmeter was proven to be one of the best open source testing tools. Compared to the Jmeter application, our framework exempts the user from the burden of preparing manually the conﬁgurations ﬁles like the json and the csv ﬁles. Everything will be generated automatically through graphical interfaces or a simple conﬁguration ﬁle (for a non-gui version of the framework). Performance Test Plan: This scenario encompasses two types of tests including the load and stress tests. Each test requires a set of inputs and provides a set of outputs.

14 15

Open source: https://github.com/Chahrazed-l/Operational Framework. https://jmeter.apache.org.

An Operational Framework for Evaluating the Performance of LRS

51

– Load test: is designated to study the LRS performance behaviour under reallife and expected load conditions. In our case, the load can be either expressed in terms of simultaneous requests16 or the number of statements that are sent within requests17 . If the LRS is used by one organization, so the number of connected users is equal to 1. In this case the load is expressed in terms of the number of statements that are sent within a post request. While if the LRS is dedicated for multi-organizational use, the load is expressed in terms of both number of concurrent users as well as the number of statements sent within requests. The test is provided with real-life statistics of statements generation18 . The user can use those statistics to run the load test, else he can enter and use his own information about the number of statements that can be generated by learners. – Stress test: This test is designated to study the robustness of the LRS beyond the limits of normal load conditions. The load can be expressed in terms of the simultaneous requests or the number of statements sent within the requests. This test consists of generating sudden heavy loads either by increasing the number of simultaneous requests (the number of concurrent users) and/or the number of statements (for post request). For both test scenarios, a set of inputs needs to be provided to ensure a smooth execution. The inputs include: 1) the request type (get, post or both), 2) Number of test runs: the tests need to be repeated so many times in order to ensure the stability and the accuracy of the results; 3) Time interval between runs: depicts the time period that separates the end of a test run and the start of another; 4) number of requests; 5) time interval between requests; 6) number of concurrent users19 ;7) statements number (for post request); 8) LRS information: required to connect to the deployed LRS. Strategy Selection Test Plan: This plan allows the selection of the suitable strategy to adopt to send post requests. It encompasses two types of tests including the Post Chunk Time test and the Post Chunk Statement tests. The aim is to discover which of the strategies allows reducing the LRS load and optimizing the statements storage by taking into consideration the underlying computational infrastructure. Each test requires a set of inputs and provides a set of outputs. For both tests, the load is expressed in terms of the number of statements sent within a post request. – Post Chunk Time test: is designated to study the LRS performance under dynamic loads sent periodically (second, minute, hour, day). Indeed, we ﬁx the time interval that separates the end of sending a request and the start of sending a new one. The number of the statements keeps changing for each new request. Diﬀerent methods are used to generate the statements number 16 17 18

19

Denotes also the number of the concurrent users connected to the LRS. Speciﬁc to the requests of type post. Extracted from a Moodle dataset, more information are provided in the Sect. 4: Experiments and validation. One LRS can be used to store the statements coming from diﬀerent sources.

52

C. Labba et al.

including random, Poisson and Gaussian methods. For this test, the user selects the corresponding time interval to consider as well as the method for generating the statements. – Post Chunk Statement test: is designated to study the LRS performance under static loads sent aperiodically. This test consists of: 1) ﬁxing the number of statements sent within each request and 2) Using dynamic time intervals to send the requests. Diﬀerent methods are used to generate the time intervals including random, Poisson and Gaussian methods. For this test, the user selects the corresponding statements number to be sent per request as well as the method for generating the time intervals. Both the performance and strategy selection test plans provide as output a set of metrics measurements such as the response time, response time distribution, throughput, latency, the error rate and connect time (to name a few). We provide an exhaustive set of metrics to visualize the performance behaviour of the LRS under the diﬀerent test scenarios. At the end of each scenario, a report with all the measurements is generated. The report contains dashboards and charts.

4

Experiments and Validation

Two open source LRS have been used to validate the test plans, including Trax (see Footnote 7) and LL (see Footnote 5). Both of the LRS have been deployed on machines having the same characteristics in terms of computing power and memory. The evaluation was carried out using a real-life Moodle dataset20 . The dataset contains the data logged during 769 days (more than 2 years). It contains 2169 users, almost 2 millions events and we count 57 diﬀerent actions such as viewed, updated and submitted. Moodle is an important source of learning data. Nowadays many plugins21,22 have been developed to ensure the generation of xAPI statements from moodle contents. In the moodle traces, one can easily detect the format of an xAPI statement. For the rest of this section, we admit that an event in moodle is equivalent to an xAPI statement. We investigated in more details the behavior of the students in terms of event generation. We considered the 10 students who interacted the most with the Moodle content. For each of these students, we investigated in more details the maximum number of events that have been generated on diﬀerent time intervals including second, minute, hour and day. Table 2 summarizes the extracted information. For example, the maximum events numbers that have been generated by Student 1 on a second, a minute, an hour and a day are respectively 11, 28, 385 and 1439. For each student, the observed values on the diﬀerent time intervals are not necessarily perceived on the same day. In Sect. 4.1, we used the information of Student 1 as a reference to run the diﬀerent test scenarios including load and stress tests. The rest of the information are provided through LOLA-meter to 20 21 22

https://research.moodle.org/158/. https://moodle.org/plugins/mod tincanlaunch. https://moodle.org/plugins/logstore xapi.

An Operational Framework for Evaluating the Performance of LRS

53

run the same scenarios with diﬀerent inputs in terms of statement generation. Due to space limitations, only some of the test scenarios are presented. Table 2. Statistics for event generation on diﬀerent time intervals Username Student Student Student Student Student Student Student Student Student Student 1 2 3 4 5 6 7 8 9 10 Second

11

14

17

23

33

64

86

224

227

459

Minute

28

42

60

62

75

198

230

3634

5196

6220

Hour Day

385

604

608

718

737

893

1113

9938

10959

45959

1439

1457

1861

2079

2952

3884

6139

11119

11799

53506

Table 3. Conﬁgurations for the load test Users number

Students Statements Total Iteration Iterations Break number per stateduration number time student ment/req (s) (s)

Conﬁg. label

1

200 400 600 800 1000 1100 1200

Conﬁg.1 Conﬁg.2 Conﬁg.3 Conﬁg.4 Conﬁg.5 Conﬁg.6 Conﬁg.7

4.1

11

2200 4400 6600 8800 11000 12100 13200

900

5

120

Performance Test Plan

For this plan, two tests have been performed: the load test and the stress test. The load test is used to analyze the behaviour of the LRS under real-life expected loads. The test is performed using diﬀerent conﬁgurations, shown in Table 3. Each conﬁguration is characterized by the number of users, the number of students per user, the number of statements generated per student, the test duration, the iterations number and the break time between two successive iterations. Each of the LRS is deployed as it is furnished by the providers without modifying any used technologies including the database and the application server. LL and Trax use respectively nginx/mongo and apache2/mysql. In Table 4, we present the results provided at the end of each test scenario. Both of the LRS have been evaluated in the same manner and using the same conﬁgurations. In the evaluation we can take into consideration, the number of requests that has been processed during the test duration, the error rate which indicates the rate of requests that the LRS failed to store due to a given issue, the response time and the throughput. Based on the results of the performed tests, one can notice that LL has, sometimes, the best min and max values for

54

C. Labba et al. Table 4. Execution results for both Trax and LL using diﬀerent conﬁgurations

Config. LRS Executions label

Response time (ms)

#req KO%ErrorAvg Config.1 LL

Min

debit Max

90%

95%

99%

294

0

0.0%

15431.88 5227

42734

24426.00 33678.25

42213.35

Trax421

0

0.00%

10751.18 9429

23198

12355.80 14309.50

19049.06

0.08

0

0.0%

30187.7214113

51952

36565.80 40472.40

48886.48

0.03

Config.2 LL

152

Trax234 Config.3 LL

98

Trax150 Config.4 LL

98

Trax103 Config.5 LL

59

Trax 91 Config.6 LL

51

Trax 83 Config.7 LL

45

Trax 75

0.06

0

0.0%

19469.0518616

23681

20047.00 20534.00

22352.90

0.05

0

0.0%

47315.8720227

81839

71867.50 74172.85

81839.00

0.02

0

0.0%

30782.1528790

76040

32640.60 36278.65

62869.25

0.03

0

0.0%

47432.1114217

88432

60185.60 62722.10

88432.00

0.02

0

0.0%

44925.9739406

109670

54009.00 59329.20

107892.12

0.02

0

0.0%

80935.9714778

171988

100402.00101631.00

171988.00

0.01

0

0.0%

53152.4548587

297079

52377.40 54405.80

297079.00

0.02

3

5.58%

93050.4515914

128741

118633.80123194.60

128741.00

0.01

0

0.0%

56662.0753186

128302

57786.80 61309.20

128302.00

0.02

17 37.78% 107262.82 2091

149505

142720.80145747.90

149505.00

0.01

0

144986

62219.80 69537.00

144986.00

0.01

0.0%

61524.8757843

response time. However, Trax outperforms LL in all the tests scenarios. The performance diﬀerence is noticeable in the number of processed requests, the error rate, the response time (average, 90%, 95% and 99%) and the throughput. For example, as shown in Table 4, for both conﬁgurations (Conﬁg.6 and Conﬁg.7) LL failed to store all the received requests. For Conﬁg.6 and Conﬁg.7 LL failed to process respectively 5.58% and 37.78% of the requests. The LL application server throughout an internal error to the server (code 500). Whereas, for these same conﬁgurations, Trax succeeded in storing all the received requests. To investigate in more details the results presented in Table 4, additional tests have to be performed. Indeed, we need to use the same technologies including the application server and the database for both of the LRS. The stress test is used to study the performance of the LRS under unexpected load conditions. The test is performed using the conﬁguration shown in Table 5. The conﬁguration is characterized by the number of concurrent users (3 users), the number of requests per iteration (20 requests), the break time between two successive requests sent by each user, the number of statements sent within each request, the number of iterations (5 iterations) and the time separating two successive iterations (2 min). Each request contains a diﬀerent number of statements. As shown in Table 5, the ﬁrst request contains 551 statements, while the last one contains 804. The statements are written in separated Json ﬁles that will be sent within the requests bodies. Among the 20 requests, one user sends from 6 to 7 requests. The users send their requests concurrently to the LRS. For example user 1, user 2 and user 3 send respectively 551, 595 and 613 statements within their ﬁrst requests and so on for the rest. The results of the stress test are shown in Table 6. For both LRS, we provide the total execution time where we subtracted the break time between the iterations, the number of statements successfully stored, the error rate, the average response time as well as the Min and Max response times. Even though Trax has a greater error rate

An Operational Framework for Evaluating the Performance of LRS

55

Table 5. Conﬁguration for the stress test Number of users

Number of requests/ iteration

Break time request (s)

#Statement for all requests

#Iterations Break time iteration (s)

3

20

10

551, 595, 613, 700, 11281, 11967, 804 12944, 1588, 1602, 13537, 1881, 14149 613, 2088, 13537, 1881, 14149, 11967, 804

5

120

Table 6. Performance results for a stress test LRS Total exec #Statements Error rate Avg response Min (ms) Max (ms) time time (ms) (Min) LL

68

542 924

15%

104077.99

1524

267548

Trax 28

552 129

17%

41348.32

2312

105941

(17%), the number of the ﬁnal statements successfully stored is superior to the one stored within LL. Further, the overall execution time with Trax is smaller 28 min compared to 68 min for LL. This is due to the response time, which is better for Trax. One can notice that from the average and maximum response time values in Table 6. 4.2

Strategy Selection Test Plan

The strategy test allows to select the way of sending the xAPI data to the LRS. As we explained in Sect. 3.2, we diﬀerentiate two types of this test including the Post Chunk Statement test and the Post Chunk Time test. Table 7. Chunk statement strategy conﬁguration Statement chunk size

Time intervals (s)

Iteration duration (s)

Iterations number

500

59.94, 180.16, 59.79, 180.13, 59.67, 179.71, 60.3, 180.15, 60.04 240.103, 239.922, 239.388, 240.448

900

5

1000

56

C. Labba et al. Table 8. Results of the chunk statement test Chunk statement size LRS Response time (ms) Average Min Max

90%

95%

99%

500

LL 1184.67 878 2864 1651.50 2419.00 2864.00 Trax 2896.84 2220 10234 4152.80 5519.80 10234.00

1000

LL 1943.90 1722 Trax 4505.60 4235

2435 2124.00 2420.00 5127 4811.40 5111.30

2435.00 5127.00

The Post Chunk Statement test is performed using the conﬁgurations presented in Table 7. Each conﬁguration is characterized by the size of the statement chunk, the time intervals, the iteration duration and the iteration number. The main idea is to analyze the LRS performance when sending ﬁxed statements number within the request body while using diﬀerent time intervals. The time separating the sent of one request from another can be generated using three diﬀerent methods including Poisson, Gaussian and Random. For each method, the user can ﬁx the range for generating the values. As shown in Table 7, we used two diﬀerent chunks (500 and 1000). The time interval for 500 is generated using a random generation method. While the time interval for the 1000 is deducted from the ﬁrst one23 . The aim is to show the impact of the chunk size on the LRS performance while using the same data generation scenario. The Table 8 presents the response time of both LRS using the statement chunk sizes 500 and 1000. One can notice that LL shows better results compared to Trax in terms of response time when sending requests with small number of statements separated by a considerable time interval (one min minimum). Further, one can notice that by using a chunk of 1000, both LRS (LL and Trax) show better performance in terms of response time. As shown in Table 8, for LL and Trax the response times are respectively less then 2500 ms and 5200 ms. However, this values have been exceeded more than once for both LRS while using the chunk of 500 statements. We can justify this results by the fact that in the ﬁrst scenario the number of the sent requests is twice the number of the ones sent within the second scenario. Further, the time interval separating the sending of two successive requests with 500 statements each, is smaller compared to the one used during the second test. The Post Chunk Time test is performed using the conﬁgurations presented in Table 9. Each conﬁguration is characterized by the time chunk, the number of statements to be sent in each request, the iteration duration and the iteration number. The main idea is to analyze the LRS performance when sending variable statements number within each request body periodically. The number of statements in one request can be generated using three diﬀerent methods including Poisson, Gaussian and Random. For each method, the user can ﬁx the range for generating the values. As shown in Table 9, we used two diﬀerent 23

The ﬁrst time interval to wait to send a request with 1000 statements corresponds to the sum of the two ﬁrst ones to wait to send requests with 500 statements each.

An Operational Framework for Evaluating the Performance of LRS

57

Table 9. Chunk statement strategy conﬁguration Time chunk size (Min)

Statement number

Iteration duration (s)

Iterations number

1

120, 500, 50, 100, 300, 50, 10, 700, 1000, 900, 30, 100, 100, 100, 0 1070, 2660, 330

900

5

5

Table 10. Execution results for chunk time scenario Chunk time LRS %Error Response time Avg Min Max 404 76

5784 6276

90% 2225.20 4412.60

95% 2452.20 4728.20

99%

1

LL 0.00% Trax 0.00%

1179.04 1551.92

5784.00 6276.00

5

LL 0.00% Trax 0.00%

3501.80 1176 8663 6925.40 8663.00 8663.00 8196.00 1681 19530 16224.00 19530.00 19530.00

time chunks (1 and 5 min.). The number of statements for 1 min is generated using a random generation method. While the number of statements for the 5 min is deducted from the ﬁrst one. The aim is to show the inﬂuence of the selection of the time chunk size for the same generated amount of data. The Table 10 presents the results of running both scenarios with LL and Trax in terms of response time. We can notice that LL is more dedicated for batch strategy where the size of the body request is not too much signiﬁcant and the time separating two successive requests is considerable. However, this was not the case during the load and stress tests, where the request size is important in terms of statement number and the break time between two successive requests was less than 10 s. For both tests including the Post Chunk Time and the Post Chunk statement, the ﬁnal number of statements to be stored in the LRS is the same. However, we can notice that the use of the Post Chunk statement strategy is more appropriate for sending data to the LRS. In overall, the response times recorded for both statement chunks (500 and 1000) are better than those recorded using ﬁxed time chunks. Indeed, we think that the use of ﬁxed time chunks may not be appropriate. During peak use, there may be generation of a signiﬁcant number of statements that will be encapsulated and sent within a request, which the LRS will be unable to process in a reasonable time. If the chunk time is small, the LRS will be submerged by requests with huge number of statements. This situation may be even worse in case the LRS is used by more than one organization. The use of ﬁxed statement chunks may be more appropriate to keep a stable performance especially in terms of response time provided that we select the suitable chunk size.

58

5

C. Labba et al.

Threats to Validity

The current work presents some limitations that we tried to mitigate when possible: (i) We used one single machine to deploy the LRS that we used to perform our tests. We intend to run more tests while using load-balancing and distributed deployment. Cloud computing may be an appropriate environment to do so. (ii) Our developed test tool can be used on a single machine, which presents a limitation when it comes to the use of many concurrent users with heavy loads. We plan to extend the current version to a distributed one where many machines can be used to run large-scale scenarios. (iii) The test tool has been validated using only two open source LRS. We plan to contact LRS vendors to carry out a large performance evaluation to publish benchmarking studies.

6

Conclusion

Multiple LRS have made their appearance in the market. The ADL provided a rational process for selecting the appropriate LRS. However, this process emphasizes the selection based on the functional features rather the non-functional requirements. Thus, in this paper we proposed an updated version of the LRS selection process by adding another step called “Develop a non-functional matrix”. This step is enriched by the development and implementation of a set of test plans to evaluate the performance of the LRS as well as determine the suitable strategy to adopt for sending learning data. A set of metrics have been used to provide the performance measurements at the end of each test. Our automated test plans have been validated using two open source LRS including LL and Trax. Evaluating the performance of LRS depends on the organization requirements as well as the context of use. An appropriate LRS for one organization does not mean the same for another. In the current work, the performed evaluation of LL and Trax does not provide in any way a recommendation of one LRS over the other. Indeed, we provide just a snapshot in the time of the results that a user may have by using LOLA-meter. As a future work, we intend to enhance LOLA-meter to support distributed testing in order to perform large-scale tests. Moreover, we plan to use the cloud environment to perform additional performance evaluation by including multiple LRS products with diﬀerent deployment settings. Acknowledgement. This work has been done in the framework of the LOLA (see Footnote 8) project, with the support of the French Ministry of Higher Education, Research and Innovation.

References 1. Abbas, R., Sultan, Z., Bhatti, S.N.: Comparative analysis of automated load testing tools: Apache Jmeter, Microsoft Visual Studio (TFS), Loadrunner, Siege. In: 2017 International Conference on Communication Technologies (ComTech), pp. 39–44 (2017)

An Operational Framework for Evaluating the Performance of LRS

59

2. Berking, P.: Technical report: choosing a learning record store. Technical Report version1.13, Advanced Distributed Learning (ADL) Initiative, December 2016 3. Dodero, J.M., Gonz´ alez-Conejero, E.J., Guti´errez-Herrera, G., Peinado, S., Tocino, J.T., Ruiz-Rube, I.: Trade-oﬀ between interoperability and data collection performance when designing an architecture for learning analytics. Future Gener. Comput. Syst. 68, 31–37 (2017). https://doi.org/10.1016/j.future.2016.06.040. http://www. sciencedirect.com/science/article/pii/S0167739X16302813 4. Khan, R.B.: Comparative study of performance testing tools: apache Jmeter and HP loadrunner. Ph.D. thesis, Department of Software Engineering (2016) 5. Maila-Maila, F., Intriago-Pazmi˜ no, M., Ibarra-Fiallo, J.: Evaluation of open source ´ Adeli, H., Reis, software for testing performance of web applications. In: Rocha, A., L.P., Costanzo, S. (eds.) WorldCIST’19 2019. AISC, vol. 931, pp. 75–82. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-16184-2 8 6. Paz, S., Bernardino, J.: Comparative analysis of web platform assessment tools. In: WEBIST, pp. 116–125 (2017) 7. Presnall, B.: Choosing a learning record store (the 2019 edition). In: DEVLEARN, October 2019 8. Tscheulin, A.: How to choose your LRS partner. Yet analytics blog (2017) 9. Vermeulen, M., Wolfe, W.: In search of a learning record store. In: DEVLEARN, October 2019

Does an E-mail Reminder Intervention with Learning Analytics Reduce Procrastination in a Blended University Course? Iryna Nikolayeva1 , Amel Yessad1(B) , Bertrand Laforge2,3 , and Vanda Luengo1 1

3

Laboratoire d’Informatique de Paris 6 (LIP6), Sorbonne Universit´e, 75005 Paris, France [email protected] 2 CNRS/IN2P3, Universit´e de Paris, 75013 Paris, France Laboratoire de Physique Nucl´eaire et des Hautes Energies (LPNHE), Sorbonne Universit´e, 75005 Paris, France

Abstract. Procrastination is a widespread self-regulatory failure. It consists in voluntarily delaying work despite expecting to be worse-oﬀ the day after. Procrastination impacts students’ performance and wellbeing. Therefore it is argued that universities could and should play a more active role in helping freshmen improve their time management. We designed an intervention to scaﬀold regular work for large university classes in a platform-independent, easily scalable, and transferable manner. Our intervention consisted in sending a weekly e-mail reminding to complete chapter quizzes. These quizzes were closing shortly after the end of the chapter. The content of the reminder e-mails varied across our ﬁve experimental groups to additionally include diﬀerent types of personalised advice. We performed the intervention during one month on 1130 freshmen of a blended university course. We study whether regularly sending e-mails improves work regularity and ﬁnal performance. We also study the impact the e-mail content on work regularity and performance. As a result, we show that simple e-mail reminders were able to improve regularity in ﬁlling quizzes, the total number of quizzes ﬁlled, and the progress in overall performance. We also show that e-mail content matters and that complex personalised advice was counter-productive in our intervention. Keywords: Learning Management System reminder · Self-regulation · Procrastination

1

· Massive course · Email · Upscale

Introduction

When students enter university, they have less regular testing and less teacher supervision. Hence freshmen frequently struggle to work regularly. This results c Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 60–73, 2020. https://doi.org/10.1007/978-3-030-57717-9_5

Reducing Procrastination with E-mail Reminders

61

in procrastination behaviours. From Steel [30], ‘to procrastinate is to voluntarily delay an intended course of action despite expecting to be worse oﬀ for the delay’. Procrastination negatively impacts students’ performance and well-being [26, 30] and is frequently considered as a self-regulation failure [11,24]. Extensive research is available on ways to help students self-regulate in university [3,31,34]. Interventions usually require using speciﬁc technologies, platforms, or creating additional courses. In practice, such interventions rarely scale up easily, especially for large classes. We created a simple, platform-independent, and easily scalable intervention on 1130 students: e-mail reminders. We implemented this intervention in a physics course for freshmen. The course provided optional online quizzes closing shortly after the end of the corresponding chapters. The intervention consisted in weekly e-mails reminding to ﬁll in the open quizzes and featuring more or less personalised content. Does such an e-mail reminder intervention reduce procrastination in a blended university course? We test two hypotheses: (1) sending regular messages to students can help students work more regularly and improve ﬁnal performance, and (2) the content of these messages can inﬂuence students’ procrastination behaviour. We assess the impact of e-mails on the regularity of quiz completion, the ﬁnal number of completed quizzes, and the performance progress during the entire course.

2

Related Work

It has been shown that online learning environments require strong SelfRegulated Learning (SRL) skills for eﬀective learning [12,34]. SRL is a process in which individual students actively and constructively monitor and control their own motivation, cognition and behaviour towards the successful completion of academic tasks [36]. SRL can be decomposed in three phases: a forethought phase where students set objectives and prepare to act, an action phase where students need to manage their concentration, and a reﬂection phase where students compare their goals to actual outcomes and deduce lessons for further improvement [37]. Procrastination is a widespread self-regulatory failure [11,24]. In the last decades, meta-analyses conﬁrmed the association between procrastination and lower academic performance [10,11,18,24]. Indeed, accelerated learning before a deadline has been found less successful than studying at an even pace [1]. Therefore it is important to address procrastination, especially when students need to adapt to a new learning methodology and environment as it is the case in university. Improving time management is widely thought to reduce procrastination [21, 22,32]. It is argued that universities could and should play a more active role in helping freshmen improve their time management [7,8,23]. To scaﬀold time management among other SRL skills, frequent interventions implemented:

62

– – – – –

I. Nikolayeva et al.

oﬄine courses with active exercises [25], oﬄine courses with feedback through dialogues [3], additional online information [20,34], online prompts [2,4,16,17,28], and mobile prompts [31].

Rowe and Raﬀerty provided a review and guidelines for such interventions [27] and concluded that prompting SRL skills led to improvements in performance. Some analyses focus on eﬀects of such interventions on procrastination. H¨afner and colleagues tested a 2-h intervention in groups of 12 students on time management along with personal behavioural advice [14]. The analysis of the records reported that deadline rush occurred in the control group but not in the experimental group. W¨ aschle and colleagues [33] investigated on two experiments with visual feedback for self-reported procrastination behaviour. Learners in the visual feedback condition were shown a coloured line chart depicting their weekly reported level of procrastination (i.e., red for high, yellow for medium, and green for low). The results showed that learners in the visual feedback condition had signiﬁcantly lower levels of self-reported procrastination and set more speciﬁc learning goals. Authors also showed that this result was explained by a signalling eﬀect due to better awareness about metacognition as well as an informational eﬀect due to the accuracy of the presented information. However, the improvements in performance were not statistically signiﬁcant. Schmitz and Wiese [29] created a four-week training in SRL skills for engineering students where they asked to self-report procrastination behaviour as an outcome variable. They ﬁrst introduced Zimmerman’s [38] model of selfregulation. Then, they presented methods for planning and time management (e.g. day- and week-planning, developing speciﬁc and proximal learning goals, prioritizing of tasks) as well as methods of self-instruction (e.g. stopping negative thoughts and practicing positive self-talk to enhance concentration and motivation). Schmitz and Wiese obtained a signiﬁcant decrease in self-reported procrastination behavior and an increase in perceived self-eﬃcacy. However, they did not measure impact on performance. Eckert and colleagues [9] designed an intervention based on emotional regulation to reduce procrastination. They asked volunteers to participate in an online course to tackle procrastination. This course taught to train the ability to accept and modify negative emotions generated by tasks. Authors then compared pre- and post-questionnaires that were measuring procrastination levels. They found that the online training helped to signiﬁcantly reduce the amount of procrastination behaviours, even though the eﬀect was relatively small. They did not measure the impact on performance. These studies have however several limitations. None of the studies reports a signiﬁcant eﬀect on performance, a major learning outcome. These ﬁndings are additionally based on self-reported behaviours, prone to over-statements and inaccurate conclusions. Moreover, these interventions would require an important

Reducing Procrastination with E-mail Reminders

63

time investment if we wanted to scale them up, either in teaching time or in tool development time. Data-informed nudges have been proved to eﬃciently improve student engagement and success at a large scale with little investment (prompts integrated to educational tools, e-mails, smartphone notiﬁcations, etc.). Datainformed nudges are simple to use and are adaptable not only across various systems but also across teaching contexts and modes [6]. We study the eﬀects of an easy and scalable online intervention to improve work regularity in a blended learning context at university. We suggest an intervention that triggers directly to change the behaviour of students using time management prompts with and without learning analytics for online deadlines, and therefore does not require additional training, is very easily scalable to any course containing online, active and regular exercises. Changes in behaviours are then analysed from data logs. We test two hypotheses: whether regular messages to the student inﬂuence his/her procrastination behaviour and ﬁnal performance, and whether the content of the messages inﬂuences the student.

3

Methods

Our intervention tests two hypotheses: (1) sending regular messages to students can help students work more regularly and improve ﬁnal performance, and (2) the content of these messages can inﬂuence students’ procrastination behaviour. The content ranges from the simplest reminder to content with learning analytics, with an online questionnaire helping reﬂect on previous and future performance. We want to scaﬀold the three phases of SRL described in [37]: forethought, action, and reﬂection. 3.1

The Course

We designed an e-mail reminder intervention on students following the same physics course in the ﬁrst year of university. The course lasted from the end of September 2018 to the beginning of January 2019. In this course, students were split into ten groups and ten teachers gave weekly lectures in parallel in big auditoria of approximately 150 students each. They presented the course content of the week. Additionally, students were subdivided into groups of 30 students and had weekly tutorials, where they solved exercises related to the content taught in lectures. Each group of 30 students had one distinct teacher. As for the evaluation, two mandatory experimental practicals were graded, and students had a mid-semester exam in November and a ﬁnal exam in January. In terms of course material, a course handout had all the course content in a written format. This handout was available as a pdf ﬁle online as well as in a printed format. All online content was stored on the Moodle Learning Management System (LMS). Additionally, small exercises on the Wims server

64

I. Nikolayeva et al.

helped to work on several prerequisites. Videos explaining the key points of each chapter were also available online. Quizzes helped students test whether they understood all the main content from videos. In order to encourage students to take the quizzes when the corresponding chapters were taught in class, quizzes were only open for a limited time window. Quizzes closed one week after the end of the corresponding chapter. However, to help students work before the exams, teachers had decided to open up all quizzes one week before the midsemester exam as well as two weeks before the ﬁnal exam. In total, 30 quizzes were available, all graded from 0 to 10. All online content was provided through Moodle. Students following this course were all subscribed to more or less selective interdisciplinary science majors. Their majors included mathematics, physics, computer science, chemistry, geology as well as double degrees between science and a non-scientiﬁc ﬁeld (design, social sciences, languages, etc.). To design our intervention and ensure consistency with the course design, we had exchanges with two teachers throughout the design period. We decided with them that reminders should tackle quizzes and that such reminders will be sent weekly. We also agreed on criteria to randomly assign students to experimental groups to reduce diﬀerences between these groups, and about the form of the messages sent to each experiment group. The other teachers were not associated with the intervention to avoid inﬂuencing students’ reactions to the intervention. 3.2

Collected Data

The students’ initial level in physics was evaluated by the teachers at the very beginning of the course using an online questionnaire. Prior to the experiment, students ﬁlled in a form related to the forethought phase of the SRL process. The form asked for their goals for the academic year in terms of performance and regularity of work, their motivation to ﬁll in quizzes, and their belief that an e-mail intervention can help them work more regularly. The online activity was logged using Moodle logs. We mainly exploited information about the number of connections to the course on the platform, their timings, connections to speciﬁc pages, submissions of quiz responses, and the corresponding scores. The forum was not used within this course. We also had access to students’ ﬁnal grades. The ﬁnal grade (/100) consisted of the ﬁnal exam grade (/55), a continuous assessment grade for participation and for a mid-term exam (/25), and a grade for the two practicals(/20). 3.3

E-mail Content

We created ﬁve types of e-mail content. Depending on the experimental group, the e-mail contained one or several of the following elements (Table 1).

Reducing Procrastination with E-mail Reminders

65

Table 1. Description of interventions per group. Group Reminder Summary of results Advice Auto-evaluation E-mail example 0 1 2 3 4 5 Online

No e-mail sent Fig. 1 Fig. 2 Figure online Figure online Figure online ﬁgures are available at https://gitlab.lip6.fr/nikolayeva/spocmailspublic.

– Reminder: non-personalised reminder including the date when the quizzes of the current chapter close and a link to those quizzes. This simple e-mail only scaﬀolds the action phase of the SRL process in a minimalist manner. Reminders have previously been shown to be eﬃcient in scaﬀolding SRL strategies [15]. – Summary of results: a graph with the student’s score on open quizzes. The graph encourages to start the self-reﬂection phase of the SRL process and is inspired by the positive results of the work of W¨ aschle and colleagues [33]. – Summary + Advice: the summary graph is completed by a line indicating the initial score objective set by the student or, if the student had not set it, by teachers. Then personalised advice is given on the next steps. Finally, a reminder is added to course resources and exercises. Not only do we invite to self reﬂect but we also scaﬀold self-reﬂection with advice. This advice aims at helping prioritize work and suggests alternative strategies to master course content [15]. – Auto-evaluation: we help to engage in better self-regulated learning behaviour. E-mails contained a link to ﬁve questions that guided students to set weekly objectives and to note whether they met the objectives of the previous week. A structured document scaﬀolded this self-reﬂection. This content is inspired by the successful intervention of Schmitz and Wiese [29]. Table 1 sums up the e-mail content received by each group, with links to precise e-mail examples. The control group, Group 0, did not receive any e-mails. Between November 9th and December 14th included, we sent the weekly emails. 3.4

Student Assignment

We randomly assigned students to the six experimental groups. We controlled this random assignment for the confounding variables major and teachers. We did not control further the initial grade since that would have made too many constraints for random assignment. We only had around ﬁve people from the same

66

I. Nikolayeva et al.

Fig. 1. Example of e-mail content from Group 1: reminder e-mail.

Fig. 2. Example of e-mail content for Group 2: reminder + summary of results.

tutorial group in each experimental group. Moreover, we considered that controlling for the major and teacher was more important since there was much more variability in level and motivation between students from diﬀerent majors than between student levels from the same major. Indeed, some majors were selective, others were not, and Physics was a more or less central subject, therefore student levels and implications varied greatly between diﬀerent majors. Teacher variability accounted mainly for the fact that some teachers encouraged students more to use online resources than others. We asked the permission of all the students to use their data for research according to the current European legislation. Among the 1457 students of the course, we obtained the authorisation of 1130 students to use their data. Table 2 gives details about the distribution and the initial level of these 1130 students within the six experimental groups. It shows very similar test completion rates and initial levels, therefore validating that our experimental groups are comparable. Data logs were gathered from the LMS at the end of the course. Our code is available at https://gitlab.lip6.fr/nikolayeva/ spocmailspublic. Table 2. Description of students in groups. Group Final number of students

Pre-test scores: mean ± std. dev. (/100)

Pre-test completion rate

0 1 2 3 4 5

59 ± 25 58 ± 26 60 ± 27 59 ± 25 60 ± 27 58 ± 25

91% 90% 93% 91% 90% 93%

179 199 181 193 203 175

Reducing Procrastination with E-mail Reminders

4

67

Results

Signiﬁcant diﬀerences appear between groups on quiz completion and on entire course progress. 4.1

Quiz Completion

When plotting the distribution of the number of quizzes that each student has completed and when separating the diﬀerent experimental groups, there is a tendency for students receiving the simple reminders (Group 1) to complete more quizzes than the control group (Group 0) (Fig. 3(a)). When performing the Mann Whitney-U test for the diﬀerence of medians, we ﬁnd a p-value of 0.0016. After Bonferroni correction for multiple testing, the p-value is 0.0079 which is lower than the standard signiﬁcance threshold of 0.05. Therefore, the median number of quizzes completed by students of Group 1 is signiﬁcantly larger than for the control group. The eﬀect size measured by Cohen’s D is 0.31. There is no signiﬁcant diﬀerence between the other groups and the control group after multiple testing correction. Quiz score distributions do not diﬀer between groups (results shown in the code online). All means were above 8, conﬁrming that quizzes were designed to check the understanding of course content, rather than challenge with deep comprehension questions.

(a) Boxplot of quiz completion

(b) P-values for diﬀerence in medians with Bonferroni correction

Fig. 3. Variations in quiz completion between groups.

4.2

Regularity of Quiz Completion Throughout the Course

We investigated whether these quizzes were completed more regularly in experimental groups. We deﬁne group regularity as follows: group A of students is more regularly working than group B if the proportion of working students varies less

68

I. Nikolayeva et al.

between weeks in group A than in group B. For a given week, a working student is a student who has performed at least one quiz during that week. Figure 4 presents the number of submissions per week per group. For instance, in week 11, 20% of students submitted a quiz response. On week 12, 40% of Group 1 students have submitted a quiz response versus only 31% of Group 0. Before the experiment, groups behave more or less similarly. Once the experiment starts, a diﬀerence in behaviours appears throughout the weeks: the control group submits systematically fewer quizzes up to three weeks before the exam. In the three last weeks, students from the control group rush to complete leftover quizzes, more than in other groups. Therefore the control group procrastinated more than the other groups. We also want to check whether the patterns of quiz completion diﬀer along time. In order to take into account all the confounding variables in time (workload, time before exams, etc.) we want to compare what happens at the same time points between the diﬀerent experimental groups. Therefore we perform the 2-sample Kolmogorov-Smirnov test for diﬀerence in distributions between groups, from the beginning of the intervention (week 11) until the ﬁnal exam (week 20). We ﬁnd a Bonferroni corrected p-value of 0.0031. The distribution of the reminder group (Group 1) diﬀers signiﬁcantly from the distribution of the control group (Group 0). The eﬀect size measured by Kolmogorov-Smirnov D is 0.13. No signiﬁcant eﬀect was found for the other groups. And as expected, no signiﬁcant eﬀect was found between Group 0 and Group 1 before the beginning of the intervention.

Fig. 4. Histogram of the number of quizzes submitted per week by students of each experimental group, divided by the number of students in each group. P-values for diﬀerences in these weekly distributions between the control group (Group 0) and others are 0.0031 for Group 1, 0.15 for Group 2, 1 for Group 3, 0.16 for Group 4, and 1 for Group 5. These p-values are Bonferroni corrected for multiple testing.

Reducing Procrastination with E-mail Reminders

4.3

69

Progress on the Entire Course

As shown previously, Group 1 has ﬁlled quizzes signiﬁcantly more regularly compared to the control group and has ﬁlled signiﬁcantly more quizzes throughout the course. Does this have an impact on performance progress? To measure the progress we create a ratio between the ﬁnal score and the initial score. Figure 5 shows the kernel density estimate plot of this ratio for the control group (Group 0) and the e-mail reminder group (Group 1). Kernel density estimates are a smoothed version of histograms that have been normalised by the number of samples. We can, therefore, see in Fig. 5 that the reminder group (Group 1) is slightly enriched in higher progress measures than the control group (Group 0). The main diﬀerence is an enrichment around high values, namely around the value 3. Therefore, a test for diﬀerence in distributions is more powerful in this case than a test for diﬀerence of means. The 2-sample Kolmogorov-Smirnov test gives a p-value of 0.049. The diﬀerence of the progress is statistically signiﬁcant between Group 1 and the control group at the threshold of 0.05. The eﬀect size, measured by the Kolmogorov-Smirnov D statistic is 0.15.

Fig. 5. Kernel density estimate of the relative progress on the course. Progress = ﬁnal grade/initial level grade.

We also checked that there was no performance progress between other groups and the control group.

5

Discussion and Conclusion

In this paper, we were willing to create an intervention that was easily implementable at a large scale, was platform-independent, and helped university students to work more regularly. We created a one-month long intervention during which we sent weekly e-mails with more or less personalised and enriched content. We found that simple e-mail reminders helped students to ﬁll more quizzes,

70

I. Nikolayeva et al.

more regularly. We also found that the content of the e-mail mattered and that, in our case, a simple non-personalised reminder was more successful than longer, more personalised content. For the most successful e-mail group (Group 1), this regular and more active engagement with course content signiﬁcantly increased the overall performance of the group. These results validate the hypothesis that regular e-mails to students can inﬂuence regularity and performance. The eﬀect sizes of improvements in quiz completion, regularity, and performance are comparable, though slightly lower, to those found for scaﬀolding in totally online activities [35]. With an easy e-mail intervention, we limited procrastination and encouraged active learning. This intervention required very limited time investment from teachers and students. Most successful e-mails can be generated very easily. It convinced teachers of the usefulness of computer-mediated interventions on their large cohort of students. We provide an insight into how to design eﬃcient reminder e-mails for time regulation. Concerning the second hypothesis tackling e-mail content, students reacted better to less sophisticated, easily actionable content. The workload to read the e-mail may have been too large compared to the perceived importance of the optional quiz exercise, therefore demotivating students to take these e-mails into account. Moreover, when consulted via a smartphone, sophisticated content would require to scroll down to get to the end of the e-mail, therefore potentially overloading and demotivating students. Likewise, students in Group 4, deﬁned in Sect. 3.3, did not use the opportunity to ﬁll weekly objectives and reﬂect on their performance of the past week. Future work may focus more deeply on interface design for emails when used to manage pending tasks [13]. Another explanation may be the non-pertinence of the additional information: at the moment of the e-mailing, a majority of students had not yet completed their quizzes. Seeing graphs without any grades may have demotivated them. Moreover, as already mentioned, quizzes were not a central activity in this blended course. Therefore helping to reﬂect on optional content using learning analytics may have appeared to not tackle the right priorities for students. However, since they require more reﬂective analysis than in the simple reminder group, the behaviour change in self-regulation might be better transferred to future learning experiences. Another open question is whether such an intervention, if performed long enough, creates a habit of working regularly that would then transfer to other courses, or whether we always need to accompany it with SRL courses. Our results have several limitations: we could have performed interviews with several students to better understand the results from the e-mail content variation. Unfortunately, this was not possible due to practical reasons. In future experiments, such an interview would give us more insights on which mail content would work best. Moreover, since we worked only on online data, we did not remind students to do the most important work in this course, namely preparing exercises for tutorials. It is actually remarkable to achieve such performance while only commenting on optional quizzes. It may be even more eﬃcient to measure the impact of reminders for the most important exercises. Indeed being able

Reducing Procrastination with E-mail Reminders

71

to ﬁt the course design has been shown to be crucial [6]. This would require to gather oﬄine data. We did try to obtain this information from self-reported questionnaires (Group 4) since this approach had worked in another setting [29]. But as already mentioned, students did not ﬁll these questionnaires. Fitting a short session on self-regulatory skills and pointing to the self-reports may encourage students to ﬁll reports and may provide us valuable data. However, it would limit the scalability and ease of use. Another improvement would be to increase the length of the intervention from one month to the entire course length. This may increase the statistical signal and would be particularly useful to make a stronger claim about the overall performance progress. To our knowledge, this is the ﬁrst study using emails to scaﬀold time management. It is also a ﬁrst study claiming a signiﬁcant performance improvement while tackling procrastination in a blended university course. The results are consistent with numerous studies in completely online learning environments [35]. However, the eﬀect size of overall performance improvement remains small. The sample size is just large enough to statistically validate a slight performance increase. Even though inline with [5,19], the link between long-term performance improvement and regular reminders needs to be reproduced. Other main beneﬁts of our intervention ease of implementation and independence from platforms. These two qualities make it easy to reproduce in other courses and generate a practical impact [6]. In our university, it also drew attention to learning analytics as a means to improve learning. Such an intervention would also encourage universities to start helping students to regulate their learning, as well as encourage studies around the usefulness of learning analytics to scaﬀold self-regulation. It is a good starting point to show the value of learning analytics to teachers and university administrators. Acknowledgement. We thank the teachers of the course for supporting this research, speciﬁcally Laurence Rezeau and Fr´ed´eric Daigne, CAPSULE and Yves No¨el for technical support, the precious, thorough, and timely rereading of Cl´ement Choukroun, and Aymeric Dieuleveut for statistical insights.

References 1. Ariely, D., Wertenbroch, K.: Procrastination, deadlines, and performance: selfcontrol by precommitment. Am. Psychol. Soc. 13(3), 219–224 (2002) 2. Bannert, M., Hildebrand, M., Mengelkamp, C.: Computers in human behavior eﬀects of a metacognitive support device in learning environments. Comput. Hum. Behav. 25(4), 829–835 (2009) 3. Beaumont, C., Moscrop, C., Canning, S.: Easing the transition from school to HE: scaﬀolding the development of self-regulated learning through a dialogic approach to feedback. J. Further High. Educ. 40(3), 331–350 (2016) 4. Bixler, B.A.: The eﬀects of scaﬀolding student’s problem-solving process via question prompts on problem solving and intrinsic motivation in an online learning environment, vol. 68. The Pennsylvania State University (2007) 5. Bjork, R.A., Dunlosky, J., Kornell, N.: Self-regulated learning: beliefs, techniques, and illusions. Annu. Rev. Psychol. 64, 417–444 (2013)

72

I. Nikolayeva et al.

6. Blumenstein, M., Liu, D.Y.T., Richards, D., Leichtweis, S., Stephens, J.M.: Datainformed nudges for student engagement and success. In: Lodge, J.M., Horvath, J.C., Corrin, L. (eds.) Learning Analytics in the Classroom Translating Learning Analytics for Teachers, pp. 185–207. Routledge (2018) 7. Dabbagh, N., Kitsantas, A.: Using web-based pedagogical tools as scaﬀolds for selfregulated learning. Instr. Sci. 33, 513–540 (2005). https://doi.org/10.1007/s11251005-1278-3 8. Dabbagh, N., Kitsantas, A.: Using learning management systems as metacognitive tools to support self-regulation in higher education contexts. In: Azevedo, R., Aleven, V. (eds.) International Handbook of Metacognition and Learning Technologies. SIHE, vol. 28, pp. 197–211. Springer, New York (2013). https://doi.org/ 10.1007/978-1-4419-5546-3 14 9. Eckert, M., Ebert, D.D., Lehr, D., Sieland, B., Berking, M.: Overcome procrastination: enhancing emotion regulation skills reduce procrastination. Learn. Individ. Diﬀer. 52, 10–18 (2016). https://doi.org/10.1016/j.lindif.2016.10.001 10. Eerde, W.V.: A meta-analytically derived nomological network of procrastination. Personality Individ. Diﬀer. 35, 1401–1418 (2003) 11. Fritzsche, B.A., Young, B.R., Hickson, K.C.: Individual diﬀerences in academic procrastination tendency and writing success. Personality Individ. Diﬀer. 35, 1549– 1557 (2003) 12. Greene, J.A., Moos, D.C., Azevedo, R.: Self-regulation of learning with computerbased learning environments. In: New Directions for Teaching and Learning, pp. 107–115, no. 126 (2011). https://doi.org/10.1002/tl.449 13. Gwizdka, J., Chignell, M.: Individual diﬀerences and task-based user interface evaluation: a case study of pending tasks in email. Interact. Comput. 16(4), 769–797 (2004). https://doi.org/10.1016/j.intcom.2004.04.008 14. H¨ afner, A., Oberst, V., Stock, A.: Avoiding procrastination through time management: an experimental intervention study. Educ. Stud. 40(3), 352–360 (2014) 15. Hill, J.R., Hannaﬁn, M.J.: Teaching and learning in digital environments: the resurgence of resource-based learning. Educ. Technol. Res. Dev. 49(3), 37–52 (2001). https://doi.org/10.1007/BF02504914 16. Hu, H.: Eﬀects of self-regulated learning strategy training on learners’ achievement, motivation and strategy use in a web-enhanced instructional environment. Dissertation Abstracts International, UMI No. AA (2007) 17. Kauﬀman, D.F., Ge, X., Xie, K., Chen, C.H.: Prompting in web-based environments: supporting self-monitoring and problem solving skills in college students. J. Educ. Comput. Res. 38(2), 115–137 (2008) 18. Kim, K.R., Seo, E.H.: The relationship between procrastination and academic performance: a meta-analysis. Personality Individ. Diﬀer. 82, 26–33 (2015) 19. Kimball, D.R., Metcalfe, J.: Delaying judgments of learning aﬀects memory, not metamemory. Mem. Cogn. 31(6), 918–929 (2003). https://doi.org/10.3758/ BF03196445 20. Kizilcec, R.F., P´erez-sanagust´ın, M., Maldonado, J.J.: Recommending selfregulated learning strategies does not improve performance in a MOOC. In: Proceedings of the Third (2016) ACM Conference on Learning@ Scale, pp. 101–104 (2016) 21. Lynch, J.: Eﬀective Ph.D. Candidate Series (2008) 22. Maguire, S., Evans, S.E., Dyas, L.: Approaches to learning: a study of ﬁrst-year geography undergraduates. J. Geogr. High. Educ. 25(1), 95–107 (2001)

Reducing Procrastination with E-mail Reminders

73

23. der Meer, J., Jansen, E., Torenbeek, M.: It’s almost a mindset that teachers need to change: ﬁrst-year students need to be inducted into time management. Stud. High. Educ. 35(7), 777–791 (2010) 24. Moon, S.M., Illingworth, A.J.: Exploring the dynamic nature of procrastination: a latent growth curve analysis of academic procrastination. Personality Individ. Diﬀer. 38, 297–309 (2005). https://doi.org/10.1016/j.paid.2004.04.009 25. N´ un ˜ez, J.C., Cerezo, R., Bernardo, A., Ros´ ario, P., Valle, A., Fern´ andez, E.: Implementation of training programs in self-regulated learning strategies in Moodle format: results of a experience in higher education. Psicothema 23(2), 274–281 (2011) 26. Rabin, L.A., Fogel, J., Nutter-upham, K.E.: Academic procrastination in college students: the role of self-reported executive function. J. Clin. Exp. Neuropsychol. 33(3), 344–357 (2011). https://doi.org/10.1080/13803395.2010.518597 27. Rowe, F.A., Raﬀerty, J.A.: Instructional design interventions for supporting selfregulated learning: enhancing academic outcomes in postsecondary e-learning environments. MERLOT J. Online Learn. Teach. 9(4), 590–601 (2013) 28. Saito, H., Miwa, K.: Construction of a learning environment supporting learners? reﬂection: a case of information seeking on the Web. Comput. Educ. 49(2), 214–229 (2007) 29. Schmitz, B., Wiese, B.S.: New perspectives for the evaluation of training sessions in self-regulated learning: time-series analyses of diary data. Contemp. Educ. Psychol. 31, 64–96 (2006). https://doi.org/10.1016/j.cedpsych.2005.02.002 30. Steel, P.: The nature of procrastination: a meta-analytic and theoretical review of quintessential self-regulatory failure. Psychol. Bull. 133(1), 65–94 (2007) 31. Tabuenca, B., Kalz, M., Drachsler, H., Specht, M.: Time will tell: the role of mobile learning analytics in self-regulated learning. Comput. Educ. 89, 53–74 (2015) 32. Unsworth, K., Kauter, K.: Evaluating an earlybird scheme: encouraging early assignment writing and revising. High. Educ. Res. Dev. 27(1), 69–76 (2008) 33. W¨ aschle, K., Lachner, A., Stucke, B., Rey, S., Fr¨ ommel, C., N¨ uckles, M.: Computers in human behavior eﬀects of visual feedback on medical students’ procrastination within web-based planning and reﬂection protocols. Comput. Hum. Behav. 41, 120–136 (2014). https://doi.org/10.1016/j.chb.2014.09.022 34. Wong, J., et al.: Supporting self-regulated learning in online learning environments and MOOCs: a systematic review. Int. J. Hum. Comput. Interact. 35(4–5), 356– 373 (2019) 35. Zheng, L.: The eﬀectiveness of self-regulated learning scaﬀolds on academic performance in computer-based learning environments: a meta-analysis. Asia Paciﬁc Educ. Rev. 17(2), 187–202 (2016). https://doi.org/10.1007/s12564-016-9426-9 36. Zimmerman, B.J., Shunk, D.H.: Self-Regulated Learning and Academic Achievement: Theoretical Perspectives. Routledge, New York (2001) 37. Zimmerman, B.J.: Developing self-fulﬁlling cycles of academic regulation: an analysis of exemplary models. In: Schunk, D.H., Zimmerman, B.J. (eds.) Self-Regulated Learning: From Teaching to Self-Reﬂective Practice, pp. 1–19. Guilford Press, New York (1998) 38. Zimmerman, B.J.: Attaining self-regulation: a social cognitive perspective. In: Handbook of Self-Regulation, pp. 13–39. Elsevier (2000)

Designing an Online Self-assessment for Informed Study Decisions: The User Perspective L. E. C. Delnoij(&)

, J. P. W. Janssen and R. L. Martens

, K. J. H. Dirkx

,

Open University of the Netherlands, Valkenburgerweg 177, 6419 AT Heerlen, The Netherlands [email protected] https://www.ou.nl/

Abstract. This paper presents the results of a study, carried out as part of the design-based development of an online self-assessment for prospective students in higher online education. The self-assessment consists of a set of tests – predictive of completion – and is meant to improve informed decision making prior to enrolment. The rationale being that better decision making will help to address the ongoing concern of non-completion in higher online education. A prototypical design of the self-assessment was created based on an extensive literature review and correlational research, aimed at investigating validity evidence concerning the predictive value of the tests. The present study focused on investigating validity evidence regarding the content of the self-assessment (including the feedback it provides) from a user perspective. Results from a survey among prospective students (N = 66) indicated that predictive validity and content validity of the self-assessment are somewhat at odds: three out of the ﬁve tests included in the current prototype were considered relevant by prospective students. Moreover, students rated eleven additionally suggested tests – currently not included – as relevant concerning their study decision. Expectations regarding the feedback to be provided in connection with the tests include an explanation of the measurement and advice for further preparation. A comparison of the obtained scores to a reference group (i.e., other test-takers or successful students) is not expected. Implications for further development and evaluation of the self-assessment are discussed. Keywords: Study decision Self-assessment education Design-based research

Feedback Higher online

1 Introduction The number of students not completing a course or study program in higher online education remains problematic, despite a range of initiatives to decrease noncompletion rates [30, 34, 35, 37]. It is in the interest of both students and educational institutions to keep non-completion at a minimum [37]. One way to address this problem is by taking action prior to student enrolment, ensuring that the study © The Author(s) 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 74–86, 2020. https://doi.org/10.1007/978-3-030-57717-9_6

Designing an Online Self-assessment for Informed Study Decisions

75

expectations of prospective students are realistic [27, 37]. Adequate, personalized information has been shown to help prospective students make informed study decisions [9, 16] and, by extension, reduce non-completion [15, 39]. A self-assessment (SA) can provide such information [25, 26]. The current study contributes to the development of such a SA at an open online university. This SA will be available, online, for prospective students and inform them about the match between their characteristics (knowledge, skills, and attitudes) on the one hand, and what appears to be conducive to (read: predictive of) completion in higher online education on the other hand. The aim of the SA is not to select, but to provide feedback for action, so that prospective students can make a well-considered study choice [9, 15, 16], based on realistic expectations [27]. By following up on feedback suggestions (e.g. for remedial materials) they can start better prepared. However, as Broos and colleagues [3, pp. 3] have argued: “…advice may contribute to the study success of some students, but for others, it may be more beneﬁcial to stimulate the exploration of other (study) pathways. It may prevent (…) losing an entire year of study when faster reorientation is possible”. Nonetheless, the SA will be offered as an optional, and (in accordance with the open access policy of the institution) nonselective tool to visitors of the institutional website. A ﬁrst prototypical design of the SA (i.e., its constituent tests) was created, based on two prior studies: an extensive literature review and subsequent correlational research [6, 7]. Both studies were carried out to collect evidence concerning the predictive value of constituent tests regarding completion. However, the predictive value is only one of the ﬁve sources of validity evidence, as identiﬁed in the Standards for Educational and Psychological Testing [4, 5, 31]. Another important source of validity evidence is the content of the SA [31], which is the main concern of the present investigation. There are various reasons to investigate content validity, in addition to the predictive value of the constituent tests. The most important one is that, although previous research may have indicated that a certain test (variable) is a relevant predictor of completion, this does not necessarily mean that users perceive it as useful in the context of their study decision. When it is not perceived as useful, it becomes less likely that prospective students complete the test(s) and use the information they can gain from it [14]. The previous argument applies not only to each separate test but also to the overarching SA, i.e., whether the SA is perceived as a useful, coherent and balanced set of tests. Second, validity evidence based on the content of a test is not limited to the content of the actual test but includes the feedback provided in relation to obtained scores. Regarding this feedback, several design questions remain unanswered. In short, the general research question addressed in this paper is: ‘What are user expectations regarding the tests included in a SA prior to enrolment, including the feedback provided on obtained test scores?’ The next sections will provide some theoretical background regarding the SA and the feedback design, before elaborating on the more speciﬁc research questions and the methods used.

76

L. E. C. Delnoij et al.

1.1

Self-assessment Model

Figure 1 provides the domain model (UML class diagram) of the SA [18, 38]. The Figure illustrates that users attain a score on a predictor (i.e., a test, like basic mathematical skills or a single indicator, like the number of hours occupied in employment). A predictor included in the SA represents either a dispositional characteristic (i.e., pertaining to the student, like discipline) or a situational characteristic (i.e., pertaining to student’s life circumstances, e.g., social support) [7]. The score a user attains on a test falls within a particular score range (labeled e.g., unfavorable, sufﬁcient or favorable odds for completion). The exact score ranges (their cut-off points) of the current SA depend on parameters, which are set in the predictive model [7]. For this paper, it sufﬁces to understand that feedback is designed in relation to the score ranges, rather than particular scores. With respect to the exact constituent content elements of the feedback (apart from the obvious score, cf. Sect. 1.2) the current study is designed to ﬁll in the existing gaps as indicated by the empty boxes in the lower right part of Fig. 1. These gaps will be discussed in more detail in Sect. 1.2.

Fig. 1. Self-assessment domain model.

Figure 2 shows the tests as presented to prospective students in the ﬁrst prototypical design of the SA. Tests relating to dispositional variables are presented under the headers ‘knowledge/skills’ and ‘attitude’. Situational variables are presented under the header ‘proﬁle information’. These headers were chosen, instead of research jargon, to align with the users’ frame of reference.

Designing an Online Self-assessment for Informed Study Decisions

77

Fig. 2. First prototype of the self-assessment.

The review study that was carried out to make this ﬁrst selection of tests was inconclusive regarding a number of predictors and appeared biased towards a face-toface educational context [6]. This means that, in addition to the tests validated in our previous research [6, 7], other tests might be relevant as well. For instance, recent research, not available at the time of the ﬁrst prototypical design of the self-assessment, has demonstrated that technological skills (e.g., computer skills and information literacy) might be relevant, especially in the context of higher online education [19]. Furthermore, it has been argued that measures of actual behavior should be considered next to self-report measures, to enhance the validity of the SA [22, 24]. Actual behavior might be measured for instance, through a content sample test which involves studying course literature and/or watching video-lectures, followed by a short exam. Such a content sample test has also been shown to predict ﬁrst-year academic achievement [24]. All in all, these are sufﬁcient reasons to collect further validity evidence on the content of the SA so far, and to do so from the perspective of prospective users: if they consider the tests to be useful, they are more likely to complete the SA and use the feedback to help them make an informed decision [14]. 1.2

Feedback

Feedback during the transition to new educational contexts has been considered pivotal regarding student motivation, conﬁdence, retention, and success [20, 28]. Feedback on test scores in a study decision process can be designed in various ways [2, 3, 11, 25, 26]. However, with a view on transparency, it is evident that the attained score and an explanation of this score should be part of the feedback. Because the feedback provided on a score is connected to a particular score range (Fig. 1), it makes sense to provide and explain the score in this context, as the example presented in Fig. 3 illustrates.

78

L. E. C. Delnoij et al.

The attained score is visualized through an arrow in a bar. The bar represents the score ranges. Visualization of feedback data has several beneﬁts as evidenced by research in the ﬁeld of learning analytics: clearly illustrating a point, personalization, and memorability of feedback information [33]. Furthermore, the visualization in a bar representing score ranges is in line with other SAs prior to enrolment [11, 26].

Fig. 3. In-context visualization of the attained score.

Besides this basic information, additional feedback needs - previously (Sect. 1.1) referred to as gaps - are explored in this study. Current practices illustrate the broad variety of possibilities. For instance, the feedback that is provided in two Flemish selfassessment instruments entailed a comparison of the attained scores to the scores of a reference group consisting of other test-takers [2, 3, 11] or (successful) ﬁrst-year students [2, 3]. In an online SA used in Germany [25, 26] the feedback was focused on assisting prospective students in interpreting their scores, independent of comparison to a reference group. What is best, does not become clear when studying the literature. For instance, social comparison theory suggests that in times of uncertainty, individuals evaluate their abilities by comparing themselves to others, to reduce that uncertainty [10]. However, others suggest that information on success or failure in comparison to peers might have an adverse impact on students’ motivation and self-esteem [8, 21]. Another possible feedback component is an indication of the odds for completion, as described by Fonteyne and Duyck [11]. In this case, odds are based on multiple test scores and visualized by a trafﬁc light system. Though students appeared curious about the odds for completion, they also perceived them as quite confronting. Furthermore, regarding transparency and feedback for action [12], the feedback might contain a description of what was measured [25, 26] and information for action including tips to improve or a reference to advisory services [2, 3, 25, 26]. Regarding feedback for action, Broos and colleagues [2, 3] have demonstrated that consultation of a feedback dashboard was related to academic achievement. However, a deﬁnite causal relationship with the received feedback (i.e., a change in students’ beliefs and study behavior) could not be established. Broos and colleagues [3] conclude that dashboard usage may qualify as an early warning signal in itself. Again, it is paramount that prospective students perceive the feedback as relevant since this will affect their intention to use it, and thereby ultimately, the effectivity of the SA [14]. The present study, therefore, investigates prospective students’ expectations regarding the feedback provided in the SA.

Designing an Online Self-assessment for Informed Study Decisions

1.3

79

Research Questions

In the present study, we aim to complement the evidence for (predictive) validity of the SA with validity evidence based on the content of the SA, as perceived by prospective users. To that end, we chose to perform a small-scale user study, addressing the following research questions: 1. Which tests do prospective students consider relevant in the study decision process? 2. To what extent do tests considered relevant by prospective students overlap with tests included in the current SA prototype? 3. What are prospective students’ expectations regarding the feedback provided in relation to the tests?

2 Method 2.1

Context

The SA is designed, developed, and evaluated in the context of the Open University of the Netherlands (OUNL), provisioning mainly online education, occasionally combined with face-to-face meetings. Academic courses to full bachelor and master programs are provided in the following domains: law, management sciences, informatics, environmental sciences, cultural sciences, educational sciences, and psychology. The open-access policy of OUNL means that for all courses, except courses at master degree level, the only entry requirement is a minimum age of 18 years. 2.2

Design

The present study is part of a design-based research process that typically comprises iterative stages of analysis, design, development, and evaluation [17, 32]. More particularly this study is part of the design stage, reporting on a small-scale user study for further content validation of the SA. This study involves a survey design, examining prospective students’ opinions [5]. 2.3

Materials

Participants’ view on the SA content was investigated via two questions. In the ﬁrst question, a list of 17 tests, including those already incorporated in the prototypical design, was presented. Tests presented in addition were selected based on a consultation of the literature [e.g., 19, 22, 24] as well as experts in the ﬁeld. Respondents were asked to rate the perceived usefulness of each test for their study decision on a 5-point Likert scale (completely useless (1), somewhat useless (2), neither useless, nor useful (3), somewhat useful (4), and completely useful (5)). In the second question, it was explained that the feedback on each test contains the obtained score and an explanation of this score. Participants were asked to indicate which of the following feedback elements they would expect in addition (multiple answers possible): an explanation of what was measured [25, 26], their score

80

L. E. C. Delnoij et al.

compared to the score of successful students [3], their score compared to the score of other test-takers [2, 11], an indication of their odds for completion [11], and advice on further preparation for (a) course(s) or study program, when relevant [2, 3, 25, 26]. 2.4

Participants and Procedure

In total 73 prospective students were approached to participate and complete the online survey, resulting in 66 valid responses. Participants constituted a convenience sample [5] of prospective students who signed up for a ‘Meet and Match’ event for their study of interest, i.e., law or cultural sciences. We opted for this convenience sample, as it consists of prospective students with a serious interest in following a course or study program at the OUNL (as demonstrated by signing up to the Meet and Match event, for which a fee was charged). 2.5

Analysis

Survey data was analyzed in Jamovi 1.1.8.0. [29, 36]. For the usefulness of the tests (research questions 1 and 2), both the mean (the standard measure of central tendency) and the mode were presented. As the measurement level of the data for the ﬁrst two research questions was ordinal, we based our conclusions on the mode. A mode of 4 (somewhat useful) or 5 (completely useful) was considered indicative of perceived usefulness. In answering research question 3, frequencies were reported for each answer option (see Sect. 2.3).

3 Results 3.1

Perceived Usefulness of Self-assessment Tests

The ﬁrst two research questions were aimed at gaining insight into the perceived usefulness of tests. Table 1 provides an overview of prospective students’ ratings of the tests. The scores (modes) are ranked from high to low. The tests that are included in the current prototype of the SA are indicated by a checkmark in the ﬁrst column, to facilitate exploration of the overlap between ‘ratings of usefulness’ and ‘currently included’ (second research question). A content sample test and tests on interests, learning strategies, motivation, academic self-efﬁcacy, career perspectives, information literacy, intelligence, language skills, perseverance, prior knowledge, procrastination (discipline), study goals and intentions, and writing skills are considered useful (Mode 4). Not all currently included tests are considered useful by prospective students. Two tests (basic mathematical skills and social support) yielded a mode of 3.00, which was below our threshold. On the other hand, academic self-efﬁcacy, study goals and intentions, and procrastination (discipline) were perceived as useful (Mode = 4.00).

Designing an Online Self-assessment for Informed Study Decisions

81

Table 1. Tests ranked on mode usefulness as indicated by prospective students. Mode M (SD) Min–Max Test1 5.00 3.87 (1.14) 1.00–5.00 Content sample test2 Interests 5.00 3.88 (1.30) 1.00–5.00 Learning strategies 5.00 4.29 (0.86) 1.00–5.00 Motivation 5.00 3.58 (1.37) 1.00–5.00 ✓ Academic self-efﬁcacy 4.00 3.58 (1.24) 1.00–5.00 Career perspectives 4.00 3.67 (1.15) 1.00–5.00 Information literacy 4.00 3.92 (1.04) 1.00–5.00 Intelligence2 4.00 3.84 (1.02) 1.00–5.00 Language skills 4.00 3.76 (1.10) 1.00–5.00 Perseverance 4.00 3.55 (1.28) 1.00–5.00 Prior knowledge 4.00 3.88 (1.05) 1.00–5.00 ✓ Procrastination (discipline)2 4.00 3.84 (1.02) 1.00–5.00 ✓ Study goals and intentions 4.00 3.71 (0.99) 1.00–5.00 Writing skills2 4.00 4.07 (0.89) 1.00–5.00 ✓ Basic mathematical skills 3.00 2.53 (1.23) 1.00–5.00 Computer skills 3.00 2.67 (1.18) 1.00–5.00 ✓ Social support 3.00 3.00 (1.24) 1.00–5.00 1 Check marks indicate the tests included in the prototypical SA 2 Due to a technical error, only answered by 45 respondents.

3.2

Feedback Content

The third research question aimed at gaining insight into prospective students’ expectations regarding the feedback provided in relation to the SA tests. Table 2 presents an overview of the potential feedback elements, ranked by the percentage of students that listed each element (high to low). Next to the obtained score and an explanation of this score (i.e., the minimal feedback), 78.8% of the prospective students expect an explanation of what was measured, and 78.8% of the prospective students expect advice on further preparation for (a) course(s) or study, when relevant. Furthermore, 75.8% of the students expect an indication of the chances of completing a course or study. Finally, a comparison with a reference group is not expected by prospective students, as becomes clear from the relatively low frequencies for both comparisons with scores of other testtakers (40.9) and scores of successful students (39.4%).

Table 2. Feedback content elements expected by prospective students (%) Explanation of the test (what was measured) Advice on further preparation for (a) course(s) or study, when relevant Indication of chances of completing a course or study program at the OUNL Comparison of obtained score to scores of other test-takers Comparison of obtained score to scores of successful students

% (N = 66) 78.8 78.8 75.8 40.9 39.4

82

L. E. C. Delnoij et al.

4 Discussion The present study aimed to collect evidence for the content validity of the SA by gaining insight into prospective students’ opinions and expectations of a SA prior to enrolment and the feedback it provides. 4.1

Self-assessment Content

In terms of content validity, further evidence is obtained by the present study for three tests that were already included in the current SA: academic self-efﬁcacy, study goals and intentions, and procrastination (discipline). In line with our previous studies [6, 7], these tests appear useful for prospective students as well. Furthermore, the results of the present study show that prospective students ﬁnd information on speciﬁc knowledge (i.e., prior knowledge), skills (i.e., language skills, information literacy, learning strategies, and writing skills), and experience (i.e., a content sample test) useful in the process of their study decision. Although such tests did not appear as relevant predictors of completion in our previous studies [6, 7], it might be beneﬁcial to (re)consider and further investigate (e.g., their predictive value in the current context) these as possible tests for the SA. Especially since previous research has also stressed the relevance of, for instance, a content sample test (i.e., providing video lectures on a general academic topic, followed by a short exam) to support students in making wellinformed study decisions [22, 24]. Finally, our results show that two tests (i.e., basic mathematical skills and social support) – which proved to be relevant for completion in the online higher education context in our previous studies [6, 7] – are not necessarily perceived as useful by prospective students. Part of this result (basic mathematical skills) is likely to be an artefact of the speciﬁc sample, i.e., prospective students interested in law or cultural sciences. However, bearing in mind that prospective students need to recognize the usefulness of the tests [6, 7, 14], this also means due attention should be paid to clarifying the relevance of tests included in the SA to prospective students. 4.2

Feedback Content

Regarding the content of the feedback, results show that potential users of the SA expect an explanation of what was measured, as well as advice on further preparation for a course or study program at the OUNL, when relevant. Prospective students do not expect a comparison of their score to the score of a reference group (i.e., other testtakers or successful students). Overall, these results are in line with evaluations of feedback in LADs. For instance, Jivet and colleagues [12] have shown that transparency (i.e., explanations of the scales used, and why these are relevant) and support for action (i.e., recommendations on how to change their study behavior) are important for students to make sense of a LAD aimed at self-regulated learning. Following these results, the feedback in the SA domain model (Fig. 1) is complemented with information on what was measured and why, and advice for further preparation for a course or study program in the current context. This information is presented under the headers ‘Measurement’ and ‘Advice’, respectively.

Designing an Online Self-assessment for Informed Study Decisions

83

‘Measurement’ contains information on the test and the relevance of this test in relation to studying in online higher education [25, 26]. Yang and Carless [40] have stated that introducing students to the purpose(s) of the feedback is important for feedback to be effective. ‘Advice’ provides information on potential future actions that prospective students may take to start better prepared [2, 3, 25, 26]. In that regard, feedback literature has suggested that good feedback practices inform students about their active role in generating, processing, and using feedback [21]. Based on the results of the present study we decide not to include a comparison of the attained score to a reference group in the current prototype of the feedback. Furthermore, the odds for completion is not included in the prototypical feedback, even though a majority of prospective students appears to expect this. Calculating an indication of the odds for completion requires predictive models capturing the combined effects of predictors for each program within a speciﬁc ﬁeld [11]. In the current context, where students do not necessarily commit to a speciﬁc study program, but can also decide to enroll in a combination of courses of different study programs, including an indication of the odds for completion appears infeasible. Nevertheless, these results provide input for managing expectancies regarding the self-assessment. 4.3

Limitations and Future Directions

Several limitations are noteworthy in regard to the present study, as they point out directions for future development and evaluation of the self-assessment and the feedback it provides. First, the present study involves a relatively small, convenience sample. Participants were interested in speciﬁc study domains (i.e., law or cultural sciences), which is likely to have had an impact on certain results (e.g., perceived usefulness of a basic mathematical test). Thus, it would be valuable to extend the current sample with results of prospective students in other ﬁelds. Nevertheless, small-scale user studies can be considered part of the rapid, low-cost and low-risk pilot tests, which are an increasingly important instrument in contemporary research, enabling adjustments and reﬁnements in further iterations of the self-assessment and feedback [3]. Second, future development of the self-assessment and its feedback should take into account opinions of other stakeholders, most notably student advisors, as their work is affected by the SA when prospective students call on their help and advice as a follow-up on attained test results and feedback [2]. A third recommendation is to further investigate the extension of the content of the SA, by including measurements of actual behavior through a content sample test [22, 24]. Interestingly, research has shown that a content sample test is not only predictive of academic achievement but apparently, this experience of the content and level of a study program also has an effect on the predictive value of other tests. For instance, Niessen and colleagues [23], have demonstrated that scores on other tests (i.e., procrastination and study skills tests), taken after the ﬁrst course (i.e., an introductory course), more strongly predict academic achievement than scores on the same tests taken prior to enrolment. As the SA is meant to be a generic, rather than a domainspeciﬁc instrument, we aim to develop a program-independent content sample test (e.g., on academic integrity), in the near future.

84

L. E. C. Delnoij et al.

Finally, the prototypical feedback merits further investigations of e.g., language and tone [1], the framing of the score (i.e., focus on what goes well vs. focus on points of improvement) [13], possible visualizations [1, 33], and last but not least impact, i.e., consequential validity [7]). Acknowledgements. We would like to thank Henry Hermans, Hubert Vogten, Harrie Martens, and Steven Elston of the Open University of the Netherlands for their support in the technical design and development of the self-assessment.

References 1. Boscardin, C., Fergus, K.B., Hellevig, B., Hauer, K.E.: Twelve tips to promote successful development of a learner performance dashboard within a medical education program. Med. Teach. 40(8), 855–861 (2018) 2. Broos, T., Verbert, K., Langie, G., Van Soom, C., De Laet, T.: Multi-institutional positioning test feedback dashboard for aspiring students: lessons learnt from a case study in Flanders. In: Proceedings of the 8th International Conference on learning Analytics and Knowledge, pp. 51–55 (2018). https://doi.org/10.1145/3170358.3170419 3. Broos, T., Pinxten, M., Delporte, M., Verbert, K., De Laet, T.: Learning dashboards at scale: early warning and overall ﬁrst year experience. Assessment & Evaluation in Higher Education, pp. 1–20 (2019). https://doi.org/10.1080/02602938.2019.1689546 4. Cizek, G.J., Bowen, D., Church, K.: Sources of validity evidence for educational and psychological tests: a follow-up study. Educ. Psychol. Measur. 70(5), 732–743 (2010). https://doi.org/10.1177/0013164410379323 5. Creswell, J.W.: Educational research: planning, conducting, and evaluating quantitative and qualitative research. Prentice Hall, Upper Saddle River (2014) 6. Delnoij, L.E.C., Dirkx, K.J.H., Janssen, J.P.W., Martens, R.L.: Predicting and resolving noncompletion in higher (online) education – a literature review. Educ. Res. Rev. 29, 100313 (2020). https://doi.org/10.1016/j.edurev.2020.100313 7. Delnoij, L.E.C., et al.: Predicting completion – the road to informed study decisions in higher online education/Manuscript submitted for publication (2020) 8. Dijkstra, P., Kuyper, H., van der Werf, G., Buunk, A.P., van der Zee, Y.G.: Social comparison in the classroom: a review. Rev. Educ. Res. 78, 828–879 (2008). https://doi.org/ 10.3102/0034654308321210 9. Essig, G.N., Kelly, K.R.: Comparison of the effectiveness of two assessment feedback models in reducing career indecision. J. Career Assess. 21(4), 519–536 (2013). https://doi. org/10.1177/1069072712475283 10. Festinger, L.: A theory of social comparison processes. Hum. Relat. 7(2), 117–140 (1954). https://doi.org/10.1177/001872675400700202 11. Fonteyne, L., Duyck, W.: Vraag het aan SIMON! [Dutch]. Thema Hoger Onderwijs 2, 56– 60 (2015) 12. Jivet, I., et al.: From students with love: an empirical study on learner goals, self-regulated learning and sense-making of learning analytics in higher education (2020) 13. Jug, R., Jiang, X., Bean, S.M.: Giving and receiving effective feedback. Arch. Pathol. Lab. Med. 143(2), 244–251 (2019). https://doi.org/10.5858/arpa.2018-0058-RA 14. King, W.R., He, J.: A meta-analysis of the technology acceptance model. Inf. Manag. 43(6), 740–755 (2006). https://doi.org/10.1016/j.im.2006.05.003

Designing an Online Self-assessment for Informed Study Decisions

85

15. Kubinger, K.D., Frebort, M., Müller, C.: Self-assessments im Rahmen der Studientberatung: Möglichkeiten und Grenzen. In: Kubinger, K.D., Frebort, M., Khorramdel, L., Weitensfelder, L. (eds.) Self-Assessment: Theorie und Konzepte, pp. 9–24. Pabst Science Publishers, Lengerich (2012). [German] 16. McGrath, C., et al.: Higher education entrance qualiﬁcations and exams in Europe: a comparison (2014). http://www.rand.org/pubs/research_reports/RR574 17. McKenney, S., Reeves, T.C.: Conducting Educational Design Research, 2nd edn. Routledge, London (2018) 18. Microsoft Corporation. Microsoft Visio (2018). https://products.ofﬁce.com/en/visio/ ﬂowchart-software 19. Muljana, P.S., Luo, T.: Factors contributing to student retention in online learning and recommended strategies for improvement: a systematic literature review. J. Inf. Technol. Educ. Res. 18, 19–57 (2019). https://doi.org/10.28945/4182 20. Nicol, D.J.: Assessment for learner self-regulation: enhancing achievement in the ﬁrst year using learning technologies. Assess. Eval. High. Educ. 34(3), 335–352 (2009). https://doi. org/10.1080/02602930802255139 21. Nicol, D.J., Macfarlane-Dick, D.: Formative assessment and self-regulated learning: a model and seven principles of good feedback practice. Stud. High. Educ. 31(2), 199–218 (2006). https://doi.org/10.1080/03075070600572090 22. Niessen, A.S.M., Meijer, R.R., Tendeiro, J.N.: Predicting performance in higher education using proximal predictors. PloS One 11(4), e0153663 (2016). https://doi.org/10.1371/ journal.pone.0153663 23. Niessen, A.S.M., Meijer, R.R., Tendeiro, J.N.: Measuring noncognitive predictors in highstakes contexts: the effect of self-presentation on self-report instruments used in admission to higher education. Pers. Individ. Differ. 106, 183–189 (2017). https://doi.org/10.1016/j.paid. 2016.11.014 24. Niessen, A.S.M., Meijer, R.R., Tendeiro, J.N.: Admission testing for higher education: a multi-cohort study on the validity of high-ﬁdelity curriculum-sampling tests. PloS One 13 (6), e0198746 (2018) 25. Nolden, P., Wosnitza, M.: Webgestützte Selbstreflexion von Abbruchrisiken bei Studierenden. Empirische Pädagog6ik 30(3/4), 576–603 (2016). [German] 26. Nolden, P., et al.: Enhancing student self-reflection on the situation at university. The SRT scale inventory (2019). https://www.researchgate.net/publication/338343147_Enhancing_ student_self-reﬂection_The_SRT_scale_inventory. Accessed 06 Apr 2020 27. Oppedisano, V.: Open University admission policies and drop out rates in Europe (2009). https://ideas.repec.org/p/ucd/wpaper/200944.html 28. O’Regan, L., Brown, M., Harding, N., McDermott, G., Ryan, S.: Technology-enabled feedback in the ﬁrst year: a synthesis of the literature (2016). http://y1feedback.ie/wpcontent/uploads/2016/04/SynthesisoftheLiterature2016.pdf 29. R Core Team, R: A language and environment for statistical computing (2018). [Computer software]. https://cran.r-project.org/ 30. Rovai, A.P.: In search of higher persistence rates in distance education online programs. Internet High. Educ. 6(1), 1–16 (2003). https://doi.org/10.1016/S1096-7516(02)00158-6 31. Royal, K.D.: Four tenets of modern validity theory for medical education assessment and evaluation. Adv. Med. Educ. Pract. 8, 567 (2017). https://doi.org/10.2147/AMEP.S139492 32. Sandoval, W.: Educational design research in the 21st century. In: Handbook of Design in Educational Technology, pp. 388–396 (2013) 33. Sedrakyan, G., Mannens, E., Verbert, K.: Guiding the choice of learning dashboard visualizations: linking dashboard design and data visualization concepts. J. Comput. Lang. 50, 19–38 (2019). https://doi.org/10.1016/j.jvlc.2018.11.002

86

L. E. C. Delnoij et al.

34. Simpson, O.: 22%-can we do better? CWP Retention Lit. Rev. 47, 1–47 (2010) 35. Simpson, O.: Student retention in distance education: are we failing our students? Open Learn. J. Open Dist. e-Learn. 28(2), 105–119 (2013). https://doi.org/10.1080/02680513. 2013.847363 36. The jamovi project. Jamovi. (Version 1.0) [Computer Software] (2019). https://www.jamovi. org 37. Vossensteyn, H., et al.: Dropout and completion in higher education in Europe main report (2015). http://ec.europa.eu/dgs/education_culture/repository/education/library/study/2015/ dropout-completion-he-summary_en.pdf 38. Warmer, J., Kleppe, A.: Praktisch UML [Dutch], 2nd edn. Addison Wesley/Pearson Education, Amsterdam (2001) 39. Wosnitza, M., Beltman, S.: Learning and motivation in multiple contexts: the development of a heuristic framework. Eur. J. Psychol. Educ. 27, 117–193 (2012). https://doi.org/10. 1007/s10212-011-0081-6 40. Yang, M., Carless, D.: The feedback triangle and the enhancement of dialogic feedback processes. Teach. High. Educ. 18(3), 285–297 (2013). https://doi.org/10.1080/13562517. 2012.719154

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

What Teachers Need for Orchestrating Robotic Classrooms Sina Shahmoradi1(B) , Aditi Kothiyal1 , Jennifer K. Olsen1 , Barbara Bruno1,2 , and Pierre Dillenbourg1 1

Computer-Human Interaction in Learning and Instruction (CHILI) Laboratory, EPFL, Lausanne, Switzerland [email protected] 2 MOBOTS group within the Biorobotics Laboratory (BioRob), EPFL, Lausanne, Switzerland

Abstract. Educational Robots are gaining popularity in classrooms but can increase the load on teachers compared to the use of more traditional technologies. Providing support to teachers can make teachers conﬁdent in including robots in their teaching routines. In order to support teachers in managing robotic activities in the classroom, it is important to ﬁrst understand the challenges they face when engaging with these activities. To investigate these challenges, we observed three teachers managing robotic activities across ﬁfteen standard school sessions, followed by retrospective interviews. In these sessions, students performed group activities on assembling and programming diﬀerent robotic platforms. The results highlight a) how managing the additional technical complexity of the robotic activity is challenging for teachers b) teachers interventions focus on supporting students make connections between their programs and their robot behaviour in the real-world. Building on our results, we discuss how orchestration tools may be designed to help alleviate teachers challenges and support teachers interventions in robotic classrooms. Keywords: Classroom orchestration · Educational robotics Classroom observation · Need-ﬁnding study

1

·

Introduction

Educational Robots (ER) create an active learning environment by giving pupils the opportunity to experience hands-on activities [5]. Research in ER has produced robotic platforms to be used for education purposes and with education constraints in mind (i.e., cheap, robust, easy-to-use), which has led to the rise of successful robots such as LEGO Mindstorm [19] and Thymio [15]. Commercial This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 765955 (ANIMATAS). Furthermore, this project is supported by the Swiss National Science Foundation through the National Centre of Competence in Research (NCCR) - Robotics. c Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 87–101, 2020. https://doi.org/10.1007/978-3-030-57717-9_7

88

S. Shahmoradi et al.

ER platforms have also been marketed towards classroom use [1]. In general, when introducing technologies into classrooms, it is not enough to only consider the usability at the individual or group level but one must also consider the dynamics of the classroom as a whole [3]. This type of usability is strongly related to teachers’ experiences of managing or orchestrating their classrooms [3]. Classroom orchestration is how a teacher manages multiple learning activities at multiple social levels (e.g., individual, group and class-level) in real-time [2]. The orchestration process begins before class when teachers script the lesson and continues after the class when teachers reﬂecting of what happened in the class [14]. In this paper, we focus on the real-time orchestration of ER classrooms and designing orchestration tools within this context. In order to design such orchestration tools, it is important to identify the challenges of bringing multiple robots in the classroom and teachers’ actual needs. Within ER, the majority of the research that has taken a teacher focus has not addressed teacher support tools. Rather, the research has addressed teacher needs through teacher training [1], educational materials [1], or having enough hardware/software equipment [1,10]. However, a greater possibility of technical failures and diﬃculties of controlling the technology may be the main reasons behind teachers’ challenges with robotic activities [16]. Considering this research, the gap in teacher-focused ER works is studying teachers’ orchestration needs based on real teachers’ behaviours. Previous research on classroom orchestration has identiﬁed ﬁve main components of classroom orchestration: adaptation, management, assessment, planning, and role of the actors [14]. Although these components provide a universal framework for orchestration literature, within each learning context, how these components are enacted needs to be adapted to the speciﬁc context. Looking at the context of robotic activities, robots have three important features that impact the orchestration support needed for real-time adaptation and management during class-time: 1) Robots are table-top physical objects; 2) They are complex technologies for teachers compared to other technologies in classroom; and 3) They are almost always used in face-to-face classrooms [16]. Previous research in similar contexts have developed classroom orchestration tools for teacher support [4,11]. In the context of a multi-tabletop classroom, MartinezMaldonado et al. [11] used class-observation and interviews with teachers to discover teachers’ needs for support in managing equality in students’ collaboration. Based on these observations, they developed the “Radar of Physical Participation” that shows teachers the number of touches on the tabletop per student to gauge student collaboration [12]. In another context of training for logistics, Do-Lehn et al. [4] via classroom observations realized that teachers needed to make students think more on their design before running the simulation. To address this challenge, they designed a small paper key (TinkerKey) that gave the control of running the simulations to the teacher, forcing students to think and reﬂect on their designs. In the context of face-to-face classrooms with intelligent tutoring systems, Holstein found that lack of in-built customization and monitoring capabilities have been some of the main reasons for demotivating

Teachers’ Needs in Robotic Classrooms

89

teachers of using them in their classrooms [6]. In a later study, through speeddating method, he discovered that teachers needed real-time dashboards with a heads-up display that support them in deciding how best to allocate their time and attention across students [6]. Although these contexts share some similarities with a robotics classroom, they do not consider all characteristics of a robotic classroom, such as speciﬁc teachers’ needs on managing the underlying technical system itself, which may inﬂuence the orchestration support needed by the teacher. Before developing an orchestration tool for a robotics classroom, problem identiﬁcation is the ﬁrst and necessary step to understand where the unique challenges lie [13]. With this goal in mind, this paper presents the summary of teachers’ orchestration actions in robotic classrooms, based on observing them during class time. We speciﬁcally aim at addressing following research questions: – RQ1: What are teachers’ breakdowns in managing robotic classroom resources (time, space. logistics), as deﬁned by Prieto [14] that are caused by the robotic activity? – RQ2: What are the teachers’ actions in intervening groups, speciﬁc to robotic activities? This work contributes to researches on orchestration-need ﬁnding studies in technology-enhanced learning (TEL) classrooms and teachers’ perspective on educational robotics by providing more insight of how teachers actually manage robotic activities.

2 2.1

Methods Participants

We observed ﬁfteen robotic sessions, which were managed by three teachers (one male, two females) in three diﬀerent primary schools in Switzerland. We recruited teachers by contacting schools with a robotics curriculum. Teachers who were currently conducting robotics activities in their classrooms were invited to participate. Three teachers, two experienced and one novice in the domain of robotics, participated in the study. Table 1 provides the details of all the observed cases. Each class had 12–18 pupils, aged 8 to 14, who were performing a learning activity with an educational robotic platform that lasted around 45–65 min. As shown in Table 1, the activities covered the most common robotic activity types including programming robots and robot assembly. In eight sessions (cases A and B) students were performing diﬀerent robotic projects (e.g., build a maze and program a robot to go through a maze) and the topic of other seven sessions was teaching basics of robotics and programming a robot with a visual programming language [17] (case C). In all the classes, students engaged in group activities of 2–4 students with 1–2 robots per group.

90

S. Shahmoradi et al. Table 1. Summary of observation cases. School code # Sessions Robotic activity theme

Session time

Session place

A

2

Robot assembly

After-school activity

Workshop

B

6

Projects including Formal school building maze, time programming robots

Standard classroom

C

7

Lessons on basics of Formal school robots, time programming robots

Standard classroom

2.2

Procedure

To answer our research questions, we observed each teacher during their class sessions and followed by retrospective interviews. Our observations were semistructured and not based on a precise grid as we did not have the list of expected actions for which teachers have problems. Rather the goal was to ﬁnd actions for which teachers need support. The observation protocol was adapted from [7] and the goal was to record all critical incidents in the classroom that were relevant to answer our research questions. During the class, one member of the research team, who was sitting in the back of the classroom, took ﬁeld notes. Every time a teacher changed an activity or a task, the observer recorded the teacher’s action. When the teacher spoke with a group, the observer recorded the focus of the groups’ activity (i.e., whether they were working with robots or the programming interface) and their conversations with the teacher if they could be heard. Further, between the changes in activity, the observer recorded periodic (every two minutes) descriptions of teachers’ actions. The notes included teachers’ actions during the robotics activity and information about the context to make the intention of the action clearer. For instance, we recorded actions such as the teacher addressing the class as a whole (what s/he said to class), the teacher lecturing on the subject matter, whether the teacher was monitoring and teacher actions with respect to the robots, as they all could be related to our ﬁrst research question about ﬁnding breakdowns in classroom management. We also recorded all teacher interventions, especially during group work, for our second research question about teacher interventions. The observer did not interrupt the learning activities, but recorded any information volunteered by the teacher during the learning activity. We did not collect any audio or video during the study. At the end of each session (or two consecutive sessions) we conducted semistructured retrospective interviews, in which we asked the teacher questions about the session(s) that had just concluded. Each interview lasted around 15 min and included questions addressing both our research questions. For instance, we asked about their experience of managing the class (to answer

Teachers’ Needs in Robotic Classrooms

91

RQ1), for example, “What was the moment that you feel stressed during the session?”, or “What are the most important problems you had in managing this session?”. We also asked teachers to clarify the intention behind their interventions (which helps answer RQ2), for example “what did you do during a speciﬁc intervention?”, or “Why you did visit that group?”. 2.3

Data Analysis

After conducting all the observations, the research team reviewed the ﬁeld notes and interview notes. Some notes were split into two as the original note captured two diﬀerent teacher actions. Additionally, we removed any notes of teachers’ actions that were not speciﬁc to managing robotic activities, such as “teacher is putting the chairs in their positions” or “checking students’ attendance.” After the notes were cleaned, we had 249 notes from all the sessions. We used thematic analysis to analyze the data. Speciﬁcally, we used aﬃnity diagramming to summarize qualitative patterns by iteratively clustering notes into successively higher-level themes [8]. First, we categorized thirty percent of the data in a joint-interpretation meeting by the whole research team. Then two researchers continued categorizing the remaining data. The ﬁnal diagram was again reviewed by the research team and categories were discussed and changed as needed. After verifying the results, we synthesized the emerged lower-level categories into an hierarchy of higher-level categories.

3

Results

The aﬃnity diagram consists of 4 level-three themes, 20 level-two themes and 24 level-one themes. Table 2 shows the overview of the diagram with the number of notes for each theme and sub-theme. To visualize the recorded actions over teachers and class-time, Fig. 1 shows a summary of themes observed over the course of all sessions, separately for each teacher. It shows with whom and when during class each level-two theme occurred. Fewer notes for teacher A is due to fewer number of observed sessions. The four highest level themes that emerged from the data are: 1) Management 2) Intervention 3) Monitoring 4) Providing knowledge and instructions, which aligns with previous research on classroom orchestration [14]. Below, we explain each level-three theme followed by certain level-two themes which are unique to our robotics context. 1) Management includes all teachers’ eﬀorts for organizing the activity in class such as handling issues related to class time, workﬂow, group management, robotic interfaces, etc. 2) Intervention refers to teachers’ actions while supporting a group, which could be initiated by the group members or the teacher, or interrupting the class to provide information. 3) Monitoring consists of all teachers’ actions for gathering information on the student states and that of their learning technologies, including robots

92

S. Shahmoradi et al.

and laptops. During the session we found that teachers monitored student progress, the state of the technical systems, and assessment of student activities. 4) Providing knowledge and instructions includes teachers’ lectures, such as “how robot works” and “what are the diﬀerent modes of robots working”, and teachers providing task instructions to the students. Although this theme is of interest to classroom orchestration, the challenges encountered by the teacher were not unique to robotics activities, so we do not go into further detail.

Table 2. List of four level-three themes and level-two themes for management and intervention high-level themes with number of relevant notes in the aﬃnity diagram

3.1

Theme

No. of notes

Management

118

Managing technical system issues

28

Time management

17

Setting up the activity

24

Strategies for documenting students’ work

9

Managing the activity keep students engaged

14

Teacher wants to make sure children wrapping up at the end of activity

13

Collaboration management

7

Managing unexpected events

2

Behavioural issues management

4

Intervention

106

Providing direct guidance to understand the task/robotic interface

25

Scaﬀolds such as question prompts or hints

33

Debrieﬁng

16

Diagnosis

26

Teacher is ﬁxing some part of activity

6

Monitoring

15

Providing knowledge and instructions

10

Management

Teacher Is Managing Technical Failures. Although managing technical failures is a general concern for teachers in TEL classrooms [2], in robotic classrooms the source of the problem is unique due to the technology not being standalone

Teachers’ Needs in Robotic Classrooms

93

and consisting of multiple parts. In the observed classrooms, the technology consists of robots, personal computers and USB dongles for connecting the robots to the computer. As shown in Fig. 1, all the observed technical failures occurred during the sessions of teachers B and C in which students were performing programming activities. Across eight sessions, teachers had to ﬁx technical failures. Sometimes the issue was something the teacher could not address with teacher C stating, “The problem is that the programming interface is not updated” and she needed help for updating the software.

Fig. 1. Distribution of the level-two themes over teachers and their class time, which is aggregated over their sessions. The size of each icon shows the frequency of the action in that time window.

To address a technical failure, the teacher engaged in two steps: diagnosing and repairing. Due to the multiple parts used in the robotics activities, diagnosing could be diﬃcult with Teacher C stating, “I don’t know what should I change: the robot? the computer? the cable?” Once a issue was diagnosed, to address the problem, teachers would either repair the part, replace the part, or disband the group if it could not be ﬁxed. It was seen that teachers (case B and C) could solve the problem by charging a robot or reconnecting the robot. This was especially an issue if, as was the case with the teacher in case C, they “didn’t have time to charge the robots before class.” In seven cases, teachers changed one part of the robotic platform (the robot, laptop/PC, or the cable) to solve the problem. This action only happened if there was enough extra equipment. Finally, if the issue could not be ﬁxed, the group could be split among other groups. For example, the teacher in case C split a group with three members after failing to repair the robotic interface. Each of the students in this group joined another group of their own choice. As a result, there were groups of four and ﬁve members in the class doing the same activity that the teacher had prepared for groups of three members. We also observed that teachers proactively tried to avoid technical failures by providing students with support on using the diﬀerent pieces of hardware. For example, teachers explained to students how to use the USB dongle and reminded students to “make sure robots are charged” (teacher in case B). Teacher Is Managing Class Time. All teachers mentioned a lack of time for ﬁnishing the activity, as the teacher in case B said: “I prefer to have two sessions

94

S. Shahmoradi et al.

after each other.” Due to lack of time, the teacher in case A solved problems himself rather than giving hints to students and reported that, “I usually do that [fixing students’ problems] when the class runs out of time.” We observed that in at least six sessions, teachers were warning students about time shortage at the end of the class time (5–10 min before the ﬁnishing time) and asking students to ﬁnish the task in the remaining time. Teacher Is Setting up the Activity. Time that is devoted to setting up the activity, like logins in TEL classrooms, is a problem for teachers [2]. In robotic classrooms, this issue is even more critical as teachers need to take care of distributing several materials, including robots, laptops and/or other materials, such as a map or sheet for the activity, among the students. On average, 5 min at the beginning and 5 min at the end of sessions (i.e., more than 15% of class-time) are devoted to this activity in a robotics classroom. This could be a potential cause of the time shortage discussed previously. Teachers had two diﬀerent strategies for setting up the activity: the ﬁrst strategy was using a robot storage area in the classroom (cases A and B) where students could pick up their robots (in case B, students could pick any robot, while in case A each group had a speciﬁc box for their project). The second strategy was distributing the robots by teacher, as we observed in case C and asking students to help with bringing laptops. In terms of class space for setting up the activity, an issue related to RQ1, two teachers were worried about the suitability of class space for setting up the activity as teacher in case B said, “I want them to play on the floor. There is not enough space. I wanted them to be interactive while playing with the robot.” Teacher Is Managing the Activity Flow. Transitions between activities is one of the important aspects of classroom management [3]. Whenever teachers (case B and C) noticed that a group ﬁnished their activity before the rest of the class, instead of assigning a new activity, they tried to engage them to continue working by asking them to explore the programming interface more on their own, or as teacher in case B said, “When they finish their activity I tell them to add extra features on what they did like adding music, clapping, or check their friends’ maze.” Teachers also worked to gather students’ attention, especially for debrieﬁng. As seen with the teacher in case C, who stopped all the students activities by saying, “Stop the robot and listen to me,” to get the classrooms’ attention, as she described later that “They listen to me when I ask them to stop the robots.” In summary, in terms of management, teachers’ challenges come back to the spatial distribution of robotics technology in the classroom i.e., the physical aspect of the technology and the fact that it is composed of multiple parts. Issues related to time and activity management showed teachers’ dissatisfaction with managing class time and being overloaded, which is related to RQ1. There could be several reasons for teachers’ time shortages. Speciﬁcally to robotic activities, becoming aware of robots failures and managing the situation is a

Teachers’ Needs in Robotic Classrooms

95

time-consuming action. The way that awareness tools can support teachers in this issue is elaborated further in discussion section of the paper. 3.2

Intervention

Teacher Is Scaﬀolding Using Questions and Prompts. Teachers usually provide scaﬀolding to help students to solve problems [9]. This theme was the most common intervention that teachers engaged in and was observed in all teachers and sessions. Five lower-level themes emerged in this category: reﬂection (9 notes), elaboration (3 notes), encouragement for autonomy (6 notes), encouragement in peer learning (5 notes), and encouragement in strategy reﬂection (10 notes), which are explained below. Reflection actions are instances in which the teacher asked students to explain their work through questions similar to “How is your code working?” or “What do think happens when your robot is going there?”. The goal of teachers was to ask students about their understanding of what their robot is supposed to do by making connections between their program and observed robot behaviour. Elaboration are actions in which teachers asked a question to have a better understanding of students’ knowledge (e.g., “Do you know how to use the programming interface?”) or their work (e.g., “Can your robot stand on its own?”). As the teacher in case B described, “I want the student to go on his own when he finishes a task,” and took actions to support students in becoming autonomous. In ﬁve notes, teachers asked students to try implementing their ideas through statements such as, “Go and test it (your idea).” Also, in two cases, the teacher asked students to make decisions about their activities on their own through prompts such as, “Think about which programming language you are going to use.” The teacher in case B believed that children learn better from their peers, so she encouraged groups to watch other groups’ work by saying “see what are they (another group) doing”. Additionally, she asked a student to help another student, especially on basic matters, such as how to use the programming software. The teachers encouraged students to think about their activity strategy. The teacher in the case B warned a student that she is putting too much time in constructing the map rather than programming the robot. In other notes, teachers asked students to change their strategies about programming, like, “Program bit by bit,” or about the map by asking students to be more creative, like, “Make your map more interesting.” Teacher Is Debrieﬁng on Students’ Activities. Debrieﬁng is one of the well-known teachers’ orchestration actions for reﬂecting on students’ works using their own results [4]. In our case, the physicality of robots nicely supported debrieﬁng, as teachers could show the performance of a robot to the whole classroom easily. The teacher C asked questions about the performed activity by explaining the goal of activity by picking up the robot in front of the class and asking questions about the robot or by asking all students to gather around

96

S. Shahmoradi et al.

one group’s work and discuss the robot performance while showing students the robot behaviour. Teacher Is Providing Direct Guidance. Teachers provided guiding hints to students to help them continue the activity in diﬀerent situations. The ﬁrst situation was to explain how to use the programming interfaces and robots whenever students had problems with them, as teacher C described, “The main problem is that they don’t know they should send the program to the robot.” In this case, the teachers would give hints like explaining a meaning of a block in the programming interface. The second situation was to guide the students in how to program and run their code on the robot. These hints went beyond using the robotic interfaces and more related to the speciﬁc activity at hand, asking students to correct their programs, like “I would recommend putting sensors in your code”, explaining the logic of a program, or ﬁx a problem on their maze. Teacher Is Diagnosing. Compared to programming activities, robotic activities pose new challenges to diagnosis. While in programming activities students can check their program line by line, in robotic activities, robots behave in the continuous real world, which makes program execution line by line harder [18]. The other challenge arises from the inherent characteristic of robots having noisy sensors and imperfect motors, thus, making the connection between the expected execution of program and the resulting robot behaviour harder to predict [18]. Teachers had three main strategies to diagnose the source of students’ problems. The ﬁrst strategy, which appeared in two cases, was for teachers to help students to diagnose their problems by encouraging them to get a better understanding of what the robot should do by saying “Put yourself in the robot’s position” or, as the teacher in case C did, by asking a group to try to “think how their robot should behave in front of an obstacle.” In these actions, teachers were interested in supporting students to “think” about their robot behaviour, a point which will be more elaborated in the Discussion section. The second strategy, observed in nine cases, was for teachers to ﬁnd the problem in collaboration with students by observing the robot’s performance and asking students to perform some actions, like “Change the code” or “Press play and let’s test”. Teachers’ main challenge was in having to deal with diﬀerent interfaces and going back and forth to ﬁnd a problem. The third strategy was using class resources. The teacher in case B asked students to check the performance of their robot on another group’s maze or check the performance of another group’s robot on their maze. This helped students ﬁnd the source of problems easily. Teachers were interested in sharing class resources to diagnose students’ activities and ease their workload by asking one student to teach other students (i.e., by assigning peer assistants). In summary, teachers’ interventions focused on engaging students to think about how their program is related to robot behaviour (through scaﬀolding) or debrieﬁng (in class-level). then in the reverse way, on how they can ﬁx their mistakes in their programs by looking at robot behaviour. Due to the importance

Teachers’ Needs in Robotic Classrooms

97

of teachers’ intervention in the classrooms and owing to the speciﬁc issues related to robotic activities identiﬁed, this issue is addressed further in the discussion section. 3.3

Awareness

Teacher Is Monitoring. All teachers mentioned their interest in going around the classroom and visiting groups one-by-one. As the teacher in case B described, “I should check what they do . . . I wish I had eyes in back of my head.” The observed monitoring actions performed by teachers consists of two themes. First, the teacher monitored whether students’ technical systems were working. The teacher in case C in multiple sessions had to go around and check if all groups could connect the robots to the programming interface on their laptops. Second, two teachers went around and monitored which groups ﬁnished the task while one teacher stated that he visited groups to check if they were on-task.

4

Discussion

As seen from the themes identiﬁed above, some challenges require teachers to be supported outside of the classroom. For instance, providing a library of activities to manage the problem of activity ﬂow management in Sect. 3.1 or having a more informative programming interface to save teachers’ time in explaining the userinterface, mentioned in Sect. 3.2. These are not the focus of our work. Regarding the other actions that can be supported by orchestration tools in the classroom, some of them are not speciﬁc to robotic activities. For instance, orchestration solutions for debrieﬁng (in Sect. 3.2), monitoring students’ progress (Sect. 3.3) and scaﬀolding (in Sect. 3.2) have already been discussed in literature [9] and can be adapted to robotics activities. Hence, we only focus on those problems that are speciﬁc to orchestrating robotic activities and pertain to our research questions. To answer our ﬁrst research question on teachers’ breakdowns in managing classrooms, we conclude that the speciﬁc problem for teachers in robotic classrooms is handling the technical complexity caused by multiple parts of robotic activities. As described in Sect. 3.1, in programming robot activities, the setup between robots and programming interfaces on students’ computers is composed of multiple parts and the setup process should be performed during class by students or teachers. This causes several problems for teachers, such as making it diﬃcult to identify the source of the technical fault, making the setup process more time-consuming and requiring teachers to go around and check if the technical system is working, which wastes their time. Regarding our second research question on teachers’ interventions, from questions asked by teachers in scaﬀolding and helping to diagnose (mentioned in Sect. 3.2), we inferred that teachers’ intentions were to make students think about the connection between their robot behaviour and their program on screen. This idea comes back to the key skill of tracing program executions for novice

98

S. Shahmoradi et al.

programmers [18]. This skill is even more important when diagnosing errors in robotics activities, in which the challenge for students is to ﬁnd the source of their mistakes across diﬀerent interfaces. In this case, teachers asked students to think about how their robot is supposed to behave based on their code and what the robot actually does in the environment. In continue, we discuss how ﬁndings of teachers’ needs shape the way we should design orchestration tools for robotic classrooms. To address the teachers’ problems with handling technical complexity of robotic activities (answer to RQ1), there is a need for awareness and management systems about the technology (robots and their connections to programming interfaces). This functionality can assist teachers by notifying them about technical failures during class. An example design for this functionality is shown in Fig. 2. With such a system, teachers do not have to go around checking if students’ systems are working. Also, in terms of taking class-level decisions, knowing the total number of technical failures in class not only helps teachers to recognize groups who have problems but also provides an aggregated picture about available technical resources in the classroom. Second, this functionality can assist teachers through robot distribution management. At the beginning of the activity, this system will assign robots to groups, based on the number of available robots and students and inform them if a robot is charged. To support teachers in helping students make connections between their program and their robot behaviour (answer to RQ2), two awareness features can be promising. The ﬁrst feature is an abstract view of the two parts of the activity (robot behaviour and program), which helps teachers to monitor the two parts of students’ activities at one glance. The second feature is making the real robots behaviour in environments more clear by visualizing “what the robot sees” (through the robot sensors). An example of this functionality example is shown in Fig. 2, part C. This functionality helps teachers for debrieﬁng on what students have done or explaining how their programs match robot behaviour (as seen in our data, this can help teacher C who was interested in debriefing). The visualization of robot sensors makes visible what is invisible in robotic activity and can help both teachers and students for diagnosing. As mentioned in results related to diagnosing (in Sect. 3.2), in robotic activities it is easy to observe when the robot does not move as expected. However, the challenge is to understand what program line, block or sensor causes this error. With this tool, teachers would save time for ﬁnding the cause of error. Such a tool might be used directly for students. However, the added beneﬁt of including the teacher is the pedagogical point of teachers’ control over students’ reﬂection and metacognitive process, which is an important factor in orchestration [4]. Our work has limitations due to our small sample size. Some of the observed teachers’ actions were only seen in a single teacher, which limits their generality. Further work is needed to replicate our results to see if the challenges are found with a diverse set of teachers and robotics activities. Furthermore, this study only concerns teachers’ behaviours during the class. Additional research is needed to identify teachers’ needs for reﬂection after the class. Finally, in this study, the

Teachers’ Needs in Robotic Classrooms

99

Fig. 2. A mock-up of the proposed orchestration functionalities: A) aggregated view of all groups’ robot technical status (red colors show the groups that have problem). B) notiﬁcation to teacher about technical failure. C) abstract view of both program and robot behaviour for a group. D) The path of robot in environment shown by black color E) the robot on any point on the path with its sensors’ status (little boxes on robot shows sensors, if they are ‘on’ its shown by black color). F) the example program in visual-programming language environment [17] which is consisted of several lines. At any given point of robot path, the active line in the program is shown by indicating the green box over it. (Color ﬁgure online)

observer could only take ﬁeld notes and we did not record the video of session due to ethical considerations. As a result, the observer could not capture all the interactions happening in the classroom. Our next step in this project is to implement the proposed functionalities in an orchestration tool and evaluate it in a classroom setting. We will continue including teachers in our design process. Further observations will open the way to discover more orchestration functionalities.

5

Conclusion

The goal of this research was to ﬁnd teachers’ needs that can be supported with orchestration tools during ER activities. Through classroom observations, we found that while some teachers’ problems can be addressed by current solutions in literature, in programming robot activities, two unique factors appeared: 1) the robotic platform is not standalone and consists of diﬀerent parts that causes technical management problem and 2) students work in diﬀerent work spaces (programming interface and robot) and tracing from their program to real robot behaviour is hard for them. To address these problems, we propose two orchestration functionalities for orchestration tools. First, we propose an awareness functionality for notifying teachers about robot failures and having aggregated

100

S. Shahmoradi et al.

view of robots statuses in class. Second, we propose abstract and aggregated view of students’ activity on their programs and real robot behavior. This abstracted view can be used for reﬂecting on students’ works and assist teachers in diagnosis.

References 1. Chevalier, M., Riedo, F., Mondada, F.: Pedagogical uses of Thymio II: how do teachers perceive educational robots in formal education? IEEE Robot. Autom. Mag. 23(2), 16–23 (2016) 2. Dillenbourg, P., Jermann, P.: Technology for classroom orchestration. In: Khine, M.S., Saleh, I.M. (eds.) New Science of Learning, pp. 525–552. Springer, Heidelberg (2010). https://doi.org/10.1007/978-1-4419-5716-0 26 3. Dillenbourg, P., et al.: Classroom orchestration: the third circle of usability. In: Connecting Computer-Supported Collaborative Learning to Policy and Practice: CSCL2011 Conference Proceedings. Volume I—Long Papers, vol. 1, pp. 510–517. International Society of the Learning Sciences (2011) 4. Do-Lenh, S., Jermann, P., Legge, A., Zuﬀerey, G., Dillenbourg, P.: TinkerLamp 2.0: designing and evaluating orchestration technologies for the classroom. In: Ravenscroft, A., Lindstaedt, S., Kloos, C.D., Hern´ andez-Leo, D. (eds.) EC-TEL 2012. LNCS, vol. 7563, pp. 65–78. Springer, Heidelberg (2012). https://doi.org/10.1007/ 978-3-642-33263-0 6 5. Frangou, S., et al.: Representative examples of implementing educational robotics in school based on the constructivist approach. In: Workshop Proceedings of SIMPAR, pp. 54–65 (2008) 6. Holstein, K., McLaren, B.M., Aleven, V.: Intelligent tutors as teachers’ aides: exploring teacher needs for real-time analytics in blended classrooms. In: Proceedings of the Seventh International Learning Analytics & Knowledge Conference, pp. 257–266 (2017) 7. Holstein, K., McLaren, B.M., Aleven, V.: Student learning beneﬁts of a mixedreality teacher awareness tool in AI-enhanced classrooms. In: Penstein Ros´e, C., et al. (eds.) AIED 2018. LNCS (LNAI), vol. 10947, pp. 154–168. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93843-1 12 8. Holtzblatt, K., Beyer, H.: Contextual Design: Deﬁning Customer-Centered Systems. Elsevier, Burlington (1997) 9. Kawalkar, A., Vijapurkar, J.: Scaﬀolding science talk: the role of teachers’ questions in the inquiry classroom. Int. J. Sci. Educ. 35(12), 2004–2027 (2013) 10. Khanlari, A.: Teachers’ perceptions of the beneﬁts and the challenges of integrating educational robots into primary/elementary curricula. Eur. J. Eng. Educ. 41(3), 320–330 (2016) 11. Martinez Maldonado, R., Dimitriadis, Y., Kay, J., Yacef, K., Edbauer, M.T.: Orchestrating a multi-tabletop classroom: from activity design to enactment and reﬂection. In: Proceedings of the 2012 ACM International Conference on Interactive Tabletops and Surfaces, pp. 119–128 (2012) 12. Martinez-Maldonado, R., Dimitriadis, Y., Kay, J., Yacef, K., Edbauer, M.T.: MTClassroom and MTDashboard: supporting analysis of teacher attention in an orchestrated multi-tabletop classroom. In: Proceedings of CSCL 2013, pp. 119–128 (2013)

Teachers’ Needs in Robotic Classrooms

101

13. Martinez-Maldonado, R., Pardo, A., Mirriahi, N., Yacef, K., Kay, J., Clayphan, A.: The LATUX workﬂow: designing and deploying awareness tools in technologyenabled learning settings. In: Proceedings of the Fifth International Conference on Learning Analytics and Knowledge, pp. 1–10 (2015) 14. Prieto, L.P., Dlab, M.H., Guti´errez, I., Abdulwahed, M., Balid, W.: Orchestrating technology enhanced learning: a literature review and a conceptual framework. Int. J. Technol. Enhanc. Learn. 3(6), 583 (2011) 15. Riedo, F., Chevalier, M., Magnenat, S., Mondada, F.: Thymio II, a robot that grows wiser with children. In: 2013 IEEE Workshop on Advanced Robotics and Its Social Impacts, pp. 187–193. IEEE (2013) 16. Shahmoradi, S., et al.: Orchestration of robotic activities in classrooms: challenges and opportunities. In: Scheﬀel, M., Broisin, J., Pammer-Schindler, V., Ioannou, A., Schneider, J. (eds.) EC-TEL 2019. LNCS, vol. 11722, pp. 640–644. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29736-7 57 17. Shin, J., Siegwart, R., Magnenat, S.: Visual programming language for Thymio II robot. In: Conference on Interaction Design and Children (IDC 2014). ETH Z¨ urich (2014) 18. Siegfried, R., Klinger, S., Gross, M., Sumner, R.W., Mondada, F., Magnenat, S.: Improved mobile robot programming performance through real-time program assessment. In: Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education, pp. 341–346 (2017) 19. Williams, A.B.: The qualitative impact of using LEGO MINDSTROMS robots to teach computer engineering. IEEE Trans. Educ. 46(1), 206 (2003)

Assessing Teacher’s Discourse Effect on Students’ Learning: A Keyword Centrality Approach Danner Schlotterbeck1(&), Roberto Araya1, Daniela Caballero1, Abelino Jimenez1, Sami Lehesvuori2, and Jouni Viiri2 1

University of Chile, CIAE IE, Santiago, Chile [email protected] 2 University of Jyväskylä, Jyväskylä, Finland

Abstract. The way that content-related keywords co-occur along a lesson seems to play an important role in concept understanding and, therefore, in students’ performance. Thus, network-like structures have been used to represent and summarize conceptual knowledge, particularly in science areas. Previous work has automated the process of producing concept networks, computed different properties of these networks, and studied the correlation of these properties with students’ achievement. This work presents an automated analysis of teachers’ concept graphs, the distribution of relevance amongst contentrelated keywords and how this affects students’ achievement. Particularly, we automatically extracted concept networks from transcriptions of 25 physics classes with 327 students and compute three centrality measures (CMs): PageRank, Diffusion centrality, and Katz centrality. Next, we study the relation between CMs and students’ performance using multilevel analysis. Results show that PageRank and Katz centrality signiﬁcantly explain around 75% of the variance between different classes. Furthermore, the overall explained variance increased from 16% to 22% when including keyword centralities of teacher’s discourse as class-level variables. This paper shows a useful, low-cost tool for teachers to analyze and learn about how they orchestrate content-related keywords along with their speech. Keywords: Learning analytics Teacher discourse analysis Concept graphs Centrality measures

1 Introduction During the learning process, the way teachers emphasize concepts and link to each other through the lesson has an effect on the quality of students’ learning [1]. Thus, researchers have used network-like structures to summarize and analyze pedagogical link-making, particularly in STEM areas. [2] suggested concept graphs to represent how different content-related keywords concerning a certain science topic relate to each other. [3] implemented an automatic assessment system that extracted concept maps from course textbooks and test sheets and compared the distribution of content-related concepts in those concept maps to obtain the correct balance between these amongst © Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 102–116, 2020. https://doi.org/10.1007/978-3-030-57717-9_8

Assessing Teacher’s Discourse Effect on Students’ Learning

103

test items. [4] developed concept networks from the transcriptions of physics lessons and analyzed how some properties of these networks (e.g. density, diameter, number of nodes) were correlated to students’ performance. [5] recently performed a large scale study showing how automatically generated features obtained using a text-as-data approach are closely related to certain domains of two observation protocols widely used for teaching assessment in English language arts. [6] analyzed concept networks made by expert physics instructors and compared their topological properties against the networks made by undergraduate physics students revealing different motifs positively and negatively correlated to the coherence of the networks and therefore to their understanding of the concepts. The aim of this work is to quantify the importance of each content-related keyword along the lessons and measure the effect of these properties of the teacher’s speech on the students’ learning processes. For this purpose, we took a sample of 25 teachers and their students. Conceptual graphs were automatically computed from the teacher’s speech by counting the co-occurrences of keywords from a predeﬁned list of 436 keywords in a time window of 10 s. Next, we wanted to quantify each concept’s importance within the entire lesson. Since the relative frequency of a certain keyword does not consider how it cooccurs along other ones, three random-walk based centrality measures (CMs) were calculated over the concept graphs. In this way, we are not only accounting for the frequency of the keywords but also for how often they co-occur in different time intervals, allowing us to consider how keywords were orchestrated in the lesson when assigning them an importance value. Finally, the correlation between the three CMs and the student’s knowledge improvement was estimated using multilevel modeling to answer the following research questions: [RQ1]: Are CMs over the concept graphs able to capture how teachers orchestrate content-related keywords along the lesson? [RQ2]: To what extent does the centrality of the content-related keywords affect the students’ learning achievement? This paper conforms to the following distribution: Sect. 2 presents the previous studies that motivated this work and provides some background about concept graphs and CMs to the reader. Section 3 describes the methodology followed during this work. Section 4 presents descriptive statistics of the dataset and describes the results obtained from the multilevel analysis. Finally, Sect. 5 presents the general conclusions of this work and states some future research directions.

2 Related Work 2.1

Previous Studies

In 2010, [6] asked undergraduate physics students and expert physics instructors to elaborate concept maps using concepts, quantities, laws and fundamental principles as nodes while the edges were represented by experimental or modelling procedures that relate two nodes. Afterward, they computed the coherence of the concept maps and identiﬁed three basic motifs in these maps that were closely related to inductive-like

104

D. Schlotterbeck et al.

experimental procedures, and deductive-like modelling. Results show that maps made by the experts looks like a tightly woven web with high clustering and well deﬁned internal hierarchies while the novices’ maps presented loosely connected structures and lacked overall hierarchy and coherence. In 2014, as a part of the QuIP Project [7, 8] coded manually 25 Finnish videotaped lessons into concept graphs and calculated several metrics over them. The results revealed that the number of concepts related, and which concepts were connected were correlated with the students’ learning gains. Five years later, [4] automated this analysis using a text mining approach and revealed that density, diameter and average degree centrality were signiﬁcantly correlated with the quality of learning. These previous works measured the effect that some “macro” or general properties of the concept graph (i.e., properties of the whole network rather than its components) had on the students. This gives a great global perspective of how lessons were developed in terms of teachers’ talk, for example, density can be interpreted as how likely a teacher is to relate two concepts in a short window of time. However, the previous work has not looked for local information about the lesson in these networks, such as which concepts were given the more relevance or which ones is a teacher more likely to relate to other concepts. And how much this affects students’ learning. In 2019, [9] suggested PageRank as a measure to identify relevant nodes to the network (therefore to the lesson) and performed a case study over two different teachers. The authors concluded that by analyzing this measure qualitative information about the lesson can be inferred, for instance, if a teacher used a theoretical approach to the content or a more concrete one. To assess a more local analysis of the concept graph, we study different CMs to complement the previous analyses. The CMs will be discussed in the next section. Recently, [5] analyzed word-to-word transcriptions of 4th and 5th grade videos of English language arts lessons from the Measures of Effective Teaching project and automatically computed classroom speech features such as the proportion of time teachers and students spent talking, average number of turns per minute, and number of open ended questions. These features were compiled into three factors: classroom management, interactive instruction, and teacher-centered instruction. Finally, it was shown that the ﬁrst two factors were positively correlated to different domains of the Classroom Assessment Scoring System and the Language Arts Teaching Observations while the third one was negatively correlated to certain dimensions of the two scoring systems. This study also showed that using these automatically extracted features the performance was similar to using multiple human raters. 2.2

Concept Graphs and Centrality Measures

Along a lesson, the importance of keywords can be measured by the absolute frequency of the keyword in the whole speech, but we can also take into consideration how these keywords cooccur in a certain window of time. These co-occurrences can be summarized in a network-like structure where nodes represent keywords and the cooccurrences are represented by an arrow that goes from the ﬁrst keyword to the second one. Moreover, as the number of co-occurrences increases, so does the width of the arrow between them.

Assessing Teacher’s Discourse Effect on Students’ Learning

105

A concept graph [2] is a network-like structure G = (V, E) where V are the nodes, one for each content-related word (i.e., concepts) and E are the edges of the graph. We will use graph theory notation to represent the connections between concepts, with (i, j) representing a connection from node i to node j and we will say that E V 2. In concept graphs, edges are usually labeled with the relation the two concepts satisfy. In this work nodes were represented by a content-related keyword list while the connections were labeled with the relative co-occurrence frequency. Once the keywords and their co-occurrences are represented in a concept graph, we want to assign to each concept a measure of importance that reveals not only how many times did a keyword appear, but also how does it connect to other keywords in the network. For this purpose, one can compute different CMs for each concept in the graph. A CM is a function c: V ! [0, 1], where c(i) is the centrality of the node i in the graph G. Our study focused on three CMs calculated over the concept graphs: PageRank [10], diffusion centrality [11] and Katz centrality [12]. These CMs are suitable for this particular problem because of two main reasons. In ﬁrst place, the absolute frequency of the co-occurrences including a certain keyword can be normalized obtaining a probability distribution that represents how likely is the teacher to connect it to other keywords in a certain window of time. This approach suggests that using CMs based in random walks can be suitable to account for these transition probabilities from keyword to keyword as it is unlikely that every two keywords have the same probability of co-occurring along a lesson making it unviable to disregard this information. Secondly, many of the most commonly used CMs, such as degree or closeness centrality, are based on a geodesic approach in which the weights of the edges are considered as a “length” rather than a transition probability or simply are not applicable to weighted graphs at all, forcing the analysis to disregard the frequency of the keywords’ co-occurrences. We will give a summary of the three CMs, their formulae and some context about each one in the following paragraphs. PageRank. Algorithm that takes advantage of the features that the world wide web possesses [10], particularly links between web pages, to help search engines rate pages according to the links between them, considering how important was each web page in the entire network. Formally, the PageRank equation [10] over the nodes of a network satisﬁes: PRðiÞ ¼ ð1 dÞ þ d

X PRð jÞ djout j2N in ðiÞ

Where PR(i) is the rank of node i with values between zero and one, d is an attenuation factor between zero and one, N in(i) are the nodes that have an edge going to i, and djout is the number of edges coming out of node j. Therefore, the PageRank of a keyword depends directly on the PageRank of the keywords that cooccur before it, weighted by the proportion of co-occurrences from those keywords to i amongst the total co-occurrences coming out of the previous keywords. In other words, if a keyword usually co-occurs after other important keywords, then its PageRank should be high too. However, PageRank only accounts for the keywords that co-occur immediately

106

D. Schlotterbeck et al.

before. This means this measure is neither able to capture long keyword chains along time nor the probability of the whole chain occurring. Diffusion Centrality. Proposed by [11], was originally developed to model the phenomenon of a contagion process starting at node i, where in each step every neighbor of an infected node is infected with independent probability and an attenuation factor (d) is considered for each step. Since we are accounting for the in-degree of nodes in this work, the equation for the in-diffusion centrality of a node is: cdiff ði; LÞ ¼

L X X dl Al j;i l¼1 j6¼i

Where L is the number of steps considered, Al is the l-th power of the adjacency matrix and i, j are nodes in the graph. This measure rates the importance of keyword i as the sum of frequencies for the keyword chains that end at i and have length at most L where every chain is weighted exponentially lower according to its length. Thus, a keyword with high diffusion centrality is likely to conclude a chain of links between concepts which has length at most L. Katz Centrality. [12] proposed a measure of prestige in networks based on the number of different walks that emanate from a node or get to it according to whether out-degree or in-degree is considered, in the case of directed graphs. Since the number of walks is not ﬁnite in graphs with cycles, an attenuation parameter (d) is also added to the equation to simulate the loss of information when traveling along the network. Since the original Katz-Bonacich centrality was proposed for undirected graphs, Katz receive centrality was employed which uses column sums, therefore accounting for the in-degree of the nodes. This measure can be calculated as: cKatz ðiÞ ¼

1 X X dl Al j;i l¼1 j6¼i

Where l is the length of the walks considered, Al is the l-th power of the adjacency matrix and j, i are nodes in the network. Katz receive centrality for keyword i is the sum of frequencies for all keyword chains that get to i, where the chains are weighted exponentially lower according to their length. This means a keyword with high Katz receive centrality is likely to be at the end of an important concept-link chain.

3 Method 3.1

Dataset

The dataset consisted of transcriptions from Finnish 9th-grade physics lessons taught by 25 different teachers to their own classes, which were originally videotaped by [7]. The total number of students from all classes added up to 327. Transcriptions were automatically generated from the videotaped lessons using an Automatic Speech

Assessing Teacher’s Discourse Effect on Students’ Learning

107

Recognition (ASR) system speciﬁcally developed for the Finnish language by Aalto University [13]. The content of the lessons was the introduction to the relation between electric energy and power. Teachers were not told about particular methodologies to be used nor were given material since they were expected to develop the lesson using their usual strategies. Thus, no intervention was implemented. In addition to the transcriptions, the QuIP researchers designed a Pre- and Post-test to assess the quality of the content knowledge before and after the lesson. The instrument had multiple-choice questions, open-ended questions and calculations (more details in [14]). The booklets for the Pre- and Post-test consisted of 18 and 36 items respectively. Each question had a binary score (0 or 1), leading to a maximum score of 18 for the Pre- and 36 for the Post-test. Finally, a set of 436 keywords from a full glossary of a physics textbook was extracted by a researcher, who is also a physics teacher. Regarding the background of the students, three level 1 variables (i.e.: variables that are independent from one individual to another, even if they belong to the same class) were considered: GENDER_BIN, FISCEDLVL and MISCEDLVL; which account for their gender and education levels of their father and mother, respectively. These variables were chosen as the baseline input as these are values internationally measured in PISA tests and have been proven to affect students’ academic achievement by different studies [15– 17]. The procedure to obtain word centralities and estimate their effect involved several steps: computation of the concept graphs from the transcriptions for every teacher, calculation of the CMs discussed in the previous section, variable selection using Lasso regression, baseline model ﬁtting with the pretest score, student’s gender and the parent’s education levels, full model ﬁtting and measuring and comparison of the quality of the ﬁt in different models. Each of these steps will be discussed in the next sections. 3.2

Feature Engineering

First, to compute the concept graphs the transcriptions were put through a text mining process. In this procedure, a node was produced in the graph for each of the keywords in the list and every edge had its weight set to zero. Afterward, every two consecutive lines in the transcription were concatenated to obtain a 10-s phrase. For each 10 s of the teachers’ talk the keywords were searched. Every time two or more keywords appeared in the same 10 s, we added one to the weight of the edge that goes from the concept that appeared ﬁrst to the last one. Once this algorithm ran through the whole transcription the weights were normalized so that the weights of all the edges coming out of a node added up to one. In this way we obtained a directed and weighted graph were the weights of edges coming out of a node are a probability distribution, allowing us to compute some centrality measures whose numeric calculation is based on singular value decomposition techniques [12]. Next, the keywords that were not mentioned by any teacher (i.e. keywords that did not appear in any graph) were deleted from the list to avoid considering unnecessary variables. This procedure reduced the number of studied keywords from 436 to 112. Consecutively, the three CMs (PageRank, diffusion centrality and Katz centrality) and

108

D. Schlotterbeck et al.

the relative frequency of each keyword were computed over each of the 25 concept graphs. For every student a vector was made concatenating the education levels of the student’s parents, the student’s pre- and post-test score, their gender and the keyword centralities corresponding to the student’s teacher. These vectors were later compiled into four data frames (one for each CM and one for the relative frequency) which were used in the variable selection process. After the feature engineering process, variable selection was performed for each data set using Lasso regression [18]. This model selects variables by ﬁtting through least squares and adds l1 regularization, deriving into several null coefﬁcients. A suitable value for the regularization parameter was estimated using a jackknife [19] procedure over the 25 classes to estimate the out-of-sample error. 3.3

Multilevel Analysis

In educational research, it is usual to work with hierarchical data, for example: students working in groups, having different teachers, or even attending different schools. Therefore, the usual ordinary least squares (OLS) regression analysis faces several issues due to the multilevel structure of the data, as the model cannot capture different level effects and the predictive errors for students within the same group are inevitably correlated, violating one of the main assumptions for this model. To avoid these issues and consider the structure of the data, multilevel modeling was employed, which can account for the correlation between individuals in the same class by addressing nested effects [20]. In this study, a two-level random intercept model was applied to estimate the effect of teacher’s instructional discourse in the students’ understanding of the content. Intercept was allowed to vary across schools to account for the difference between different groups, while level 1 and 2 slopes were all ﬁxed since there was no difference in content nor in the tests for any class and no hypotheses regarding the difference in the level 1 variables were considered. Before ﬁtting the model with the independent variables, it is usual in multilevel analysis to ﬁt a null model (a model with no independent variables) to determine the amount of variance presented at each level and calculate the intraclass correlation coefﬁcient (ICC) of the data. Secondly, to account for the variances explained by the level 1 variables, a baseline model was implemented which incorporated the level 1 predictors (i.e.: parents’ education levels, pre-test score and gender). These variables are meant to capture the students’ background as an individual rather than as part of a group. Full Models. Finally, the last models run were the four full models which incorporated the level 2 variables for each group. These are the centralities of the statistically signiﬁcant words in their teacher’s speech. Relative word frequency was also considered as we wanted to test if there is an improvement when using CMs instead of the much simpler approach of word counting. The equations for the full models accounting for the level 1 variables Xi;j and the level 2 variables Wj read:

Assessing Teacher’s Discourse Effect on Students’ Learning

109

Yi;j ¼ b0;j þ b1 Xi;j þ ri;j b0;j ¼ c0;0 þ c0;1 Wj þ s0;j Here Yi, j represents the post-test score of student i who was in class j, b0;j is the intercept for class j, b1 is the vector of ﬁxed slopes for the level 1 variables, c0;0 is the overall intercept, c0;1 is the vector of ﬁxed slopes for the keyword centralities, ri;j is the within-class variance of class j and s0;j is the variance between classes. To summarize, Fig. 1 shows the whole procedure from the graph computation to the coefﬁcient extraction.

Fig. 1. Conceptual pipeline explaining input, procedure and output for graph computation, feature engineering, variable selection, model ﬁtting and model scoring.

110

D. Schlotterbeck et al.

Model Performance Measures. When ﬁtting a regular linear regression, the quality of the ﬁt is usually measured using R2 coefﬁcient (also called coefﬁcient of determination) which quantiﬁes the amount of variance in the data explained by the independent variables. However, when working with multilevel modeling the effects of level 1 and 2 variables can (and should) be measured separately [20]. Moreover, there is no universal measure agreed to be optimal by researchers and many measures (called pseudo-R2) have been proposed and proven to capture and retain some of the desirable properties of the R2 coefﬁcient in particular contexts [21]. During this work, three performance measures were used to calculate the amount of variance explained by different sets of variables. Level 1 R2 and level 2 R2 as presented in [22] were used to calculate the variance explained by the ﬁrst and second level predictors respectively. Additionally, conditional R2 [23] was used to calculate the amount of variance explained by the conjunction of both levels. Aside from the three performance measures, likelihood ratio tests were performed in each step with respect to the previous model to ensure the statistical signiﬁcance of the performance improvement. This means the baseline was compared with the null model and the full models were compared against the baseline model.

4 Results Descriptive statistics of the level 1 variables and the outcome variable were calculated. Table 1 presents the mean, median, standard deviation, minimum and maximum for the parents’ education levels and the pre- and post-test. The sample of students consisted of 170 boys and 157 girls. The maximum number of students in a class was 20 while the smallest group had 6. The group with the highest average point difference scored an average of 14.1 points between the Pre- and Post-test and had 14 students, while the group with the least average point difference scored 4.2 points and had 9 students. The mean of the average point difference was 8.6 with a standard deviation of 2.4. The selected level 2 predictors (keywords), their average centrality/frequency and the coefﬁcients associated with the multilevel analysis for the full models can be found in the appendix. Table 1. Descriptive statistics of the level 1 variables and the outcome variable. Variable Mean Median Pre-test 7.45 8 Post-test 16.09 17 MISCEDLVL 3.11 2 FISCEDLVL 3.60 3

Std. dev. 2.56 8.13 1.50 1.46

Min Max 0 15 0 33 0 6 0 6

Assessing Teacher’s Discourse Effect on Students’ Learning

4.1

111

Performance of the Different Models

The results from the null model indicated a variance of 2.38 between classes and 63.58 within groups, while the total variance in the post-test was 66.12. From these values the data showed an ICC of 0.04. According to [19], multilevel modelling should be applied to account for the nested effects as long as the ICC has a value higher than 0.02. Thus, multilevel analysis was justiﬁed and necessary in order to account for the group variables’ effects. Table 1 shows the values for the number of level 2 variables (keywords), the level 1, level 2 and conditional R2 and the values for the v2 test for the null model, the baseline and the four full models. Results show that the three performance measures increased signiﬁcantly from the null model to the baseline, and from the baseline to the full models. Particularly, when adding variables related to the teacher’s discourse, the amount of variance explained at a group level goes from 0.35 to 0.75 when using CMs. However, only two out of the four full models showed signiﬁcant results in the v2 test with respect to the baseline: PageRank and Katz centrality, having p-values of 0.025 and 0.031 respectively. Nevertheless, both signiﬁcant models increased in more than 40% the amount of variance explained in the second level and overall, both models increased the total variance explained from 16% to 22% (Table 2). Table 2. Number of level 2 variables, level 1 R2, level 2 R2, conditional R2 and v2 test p-values for the null, baseline, and full models. Statistically signiﬁcant results appear with (* < 0.05, ** < 0.01, *** < 0.001). The v2 test was performed with respect to the null model for the baseline and with respect to the baseline for the full models. Keywords Level 1 R2 Level 2 R2 Conditional R2 v2 Test value Null Model – – – 0.036 – Baseline – 0.13 0.35 0.163 7.5e − 11 *** Word Frequencies 16 0.25 0.74 0.219 0.059 PageRank 15 0.25 0.75 0.224 0.025 * Diffusion Centrality 17 0.25 0.75 0.224 0.053 Katz Centrality 15 0.25 0.75 0.222 0.031

Model

4.2

Selected Keywords and Centralities

In this section we present the analysis for the level two variables in each one of the full models. Results show that in general, words clearly related with the topic of the lesson (e.g.: energy, voltage) have positive coefﬁcients in both signiﬁcant models while concepts with a more distant relationship with the topic (e.g.: thermostat, model) generally had negative coefﬁcients. The selected keywords, the English translation, their average centralities across teachers, and their coefﬁcients in the random intercept part of the model are listed in Tables 3 and 4.

112

D. Schlotterbeck et al.

Table 3. Selected level 2 predictors for the model using words’ frequencies as level 2 variables. The words are shown along their English translation, average frequency, and level 2 coefﬁcient. Word Ampeeri Energia Jännite Kytkentäkaavio Lamppu Lämpötila Muuntaja Resistanssi Sulake Sähkömoottori Sähkövirta Teho Valo Virtapiiri Watti Yksikkö

Translation Ampere Energy Voltage Circuit Diagram Lamp Temperature Transformer Resistance Fuse Electric Motor Current Power Light Circuit Watt Unit

Avg. frequency Level 2 Coef. 0.0078 −16.658 0.0307 −82.467 0.1277 21.988 0.0080 −111.456 0.1108 16.706 0.0061 −15.373 0.0129 35.992 0.0098 6.186 0.0199 47.957 0.0085 61.614 0.0423 −44.984 0.2072 8.708 0.0110 3.777 0.0179 31.899 0.0566 7.808 0.0685 41.855

Furthermore, the majority of the keywords selected using Lasso have average centralities below 0.1 in both signiﬁcant models (voltage being the exception). For some keywords, like temperature or thermostat, this happens due to only a small proportion of the teachers actually including such keyword in their speech resulting in a low average centrality. In other cases (e.g.: current, light) the keyword was consistently used across all teachers, but its importance to the whole speech was also consistently low. Table 4. Selected level 2 predictors for the model using CMs as level 2 variables. For every word-CM combination, the average centrality of the word across teachers and the level 2 coefﬁcient from the random intercept model are shown. Word

Ampere Battery Energy Voltage Magnet Model Transformer Battery

Pagerank Avg. value – 0.0045 0.0251 0.1369 0.0026 0.0086 0.0079 0.0191

Level 2 coef. – −28.680 31.542 8.755 50.077 −14.863 26.706 −10.129

Diffusion centrality Avg. Level 2 value coef. – – 0.0045 −46.101 0.0251 26.812 0.1369 11.732 0.0026 63.453 0.0086 −4.132 0.0080 29.509 0.0192 −1.351

Katz centrality Avg. Level 2 value coef. 0.0069 25.971 0.0045 −54.139 0.0251 23.745 0.1369 9.351 – – 0.0086 −3.242 0.0080 32.522 0.0192 −1.492 (continued)

Assessing Teacher’s Discourse Effect on Students’ Learning

113

Table 4. (continued) Word

Iron core In series Electric device Current Power Thermostat Light Mains voltage Watt Unit

Pagerank Avg. value – 0.0341 0.0284

Level 2 coef. – 10.129 20.193

Diffusion centrality Avg. Level 2 value coef. 0.0032 −27.560 0.0341 16.984 0.0284 7.506

Katz centrality Avg. Level 2 value coef. – – 0.0341 7.456 0.0285 13.777

0.0647 – 0.0024 0.0209 0.0033

−7.479 – −64.522 48.043 −54.166

0.0647 0.1650 0.0025 0.0209 0.0033

0.737 −1.198 −49.967 37.467 −52.850

0.0647 – 0.0025 0.0208 0.0033

−2.044 – −67.861 32.251 −67.332

0.0590 0.0801

−19.066 23.874

0.0590 0.0801

2.024 23.079

0.0590 0.0801

−8.067 12.587

5 Discussion The merit of this work comes from proving PageRank and Katz centrality as signiﬁcant tools for quantifying the relevance teachers give to different content-related keywords in their instructional discourse by considering the co-occurrences of these words along the speech. Moreover, we automated the analysis provided a transcription of the lesson and a list of content related-keywords and quantiﬁed the effect these features have at a class level. Particularly, we calculated the extent to which the relevance distribution amongst keywords affects students’ understanding of the relation between electric energy and power and identiﬁed which words were signiﬁcantly correlated to the improvement in their knowledge. Results show that PageRank and Katz centrality are suitable measures for quantifying keyword importance in teachers’ speech as the as the amount of variance explained increases in the three performance measures while also showing relevant values in the v2 test. Besides, the difference in statistical signiﬁcance between the CM models and the relative frequency model shows that CMs are at least able to capture some dependencies between keywords, therefore accounting for information about how they were orchestrated during the lesson that a word counting approach would disregard. However, variable selection is needed for the model to reach statistical signiﬁcance as the number of keywords considered after the variable selection process decreased to less than twenty words from the original 436 in each of the four full models. This means that ﬁtting the model with the original list of words would not give reliable results as more than 90% of the independent variables would not be actually correlated to the outcome one. Nevertheless, the need of variable selection is not necessarily bad and just means the number of words that have a signiﬁcant effect was less than we initially expected.

114

D. Schlotterbeck et al.

Moreover, most of the keywords selected as level two predictors presented relatively low average centralities across teachers. This happened because teachers’ concept graphs were generally centered around the same core concepts, and those concepts presented very similar centralities across teachers. From the selected keywords we were able to identify two groups according to the proportion of teachers that actually included them in their speech. For the keywords that only a few teachers included in their speech, this means such keywords signiﬁcantly separate the results of the students that had them included in their lesson from the students that did not. For the keywords that were included consistently across teachers, this implies the lessons were not generally centered around them, but their inclusion in the speech and their relationship with the concepts that hoard most of the centrality help students improve their learning achievement. This results go in hand with the ones presented in [6] where the general structure from the concept maps made by undergraduate students and physics experts were similar in the macro level, but the concept maps from the latter presented a much richer structure in the clusters formed and a more organized hierarchy between them. Table 4 also shows that the selected keywords have similar values for the three CMs. This was caused by the concept graphs being mainly composed of several short chains of links between concepts rather than only a long one which results in several higher order terms in the deﬁnitions being zero. However, this is not bad as [6] also showed that the main inductive and deductive motifs of the concept maps were composed of few nodes. This study was conducted using data from a large scale project which endorses its reliability. In addition, the usage of multilevel modelling to account for the nested effects instead of more classical statistical approaches allowed us to estimate the effects of word centralities at a group level considering the correlation of students in the same class. Overall, this work presents an automated analysis on the teachers’ concept graphs, the distribution of relevance amongst content-related keywords and how this affects the students’ understanding of the content. In the future, this might turn into a really useful and low cost tool for teachers to analyze and learn about how they orchestrate content-related keywords along their speech. For example, considering the performance of the state-of-the art ASR algorithms, it would be possible to transcribe the lesson, extract the concept graphs and compute the keyword centralities to output an automatic concept map that summarizes their instructional speech and the importance they gave to each content-related keyword along their lesson, implying an objective and fast feedback. This analysis can act as an instrument for external and self-evaluation, as looking at the concept graph and most important keywords of the lesson would consume inarguably less time and manpower to analyze. Besides, as shown in [5], teachers are more likely to accept feedback from automated algorithms instead of human raters, as algorithms are free of rater bias. Also, the recent development of automatic feature extraction from text-asdata approaches is only making the gap between automatic and human raters smaller. Moreover, provided test scores of students who had different teachers one can aim to estimate the effect each content-related keyword has on them via multilevel modelling and plan future lessons with these factors in mind. In addition, considering the fact that the key concept list was manually made by a researcher, another interesting branch is the automatization of the key concept list, maybe through text mining in

Assessing Teacher’s Discourse Effect on Students’ Learning

115

textbooks, to make the process fully automated. The last research direction we propose is the visualization and qualitative analysis of these automatically generated graphs and the centralities of content-related keywords, for a deeper understanding about the representativity of these measures and how they capture the relevance of these keywords. Acknowledgements. Support from ANID/PIA/Basal Funds for Centers of Excellence FB0003 and ANID-FONDECYT grant N° 3180590 are gratefully acknowledged.

References 1. Scott, P., Mortimer, E., Ametller, J.: Pedagogical link-making: a fundamental aspect of teaching and learning scientiﬁc conceptual knowledge. Stud. Sci. Educ. 47(1), 3–36 (2011) 2. Novak, J.D., Gowin, D.B.: Learning How to Learn. Cambridge University Press, Cambridge (1984) 3. Su, C.Y., Wang, T.I.: Construction and analysis of educational assessments using knowledge maps with weight appraisal of concepts. Comput. Educ. 55(3), 1300–1311 (2010) 4. Caballero, D., Pikkarainen, T., Viiri, J., Araya, R., Espinoza, C.: Automatic network analysis of physics teacher talk. In: European Science Education Research Association (ESERA) 13th Conference 5. Liu, J., Cohen, J.: Measuring Teaching Practices at Scale: A Novel Application of Text-asData Methods (2020) 6. Koponen, I.T., Pehkonen, M.: Coherent knowledge structures of physics represented as concept networks in teacher education. Sci. Educ. 19(3), 259–282 (2010) 7. Fischer, H.E., Labudde, P., Neumann, K., Viiri, J.: Quality of instruction in physics. Comparing Finland, Germany and Switzerland. Waxmann, Munster (2014) 8. Helaakoski, J., Viiri, J.: Content and content structure of physics lessons and students’ learning gains. In: Fischer, H.E., Labudde, P., Neumann, K., Viiri, J. (eds.) Quality of Instruction in Physics, pp. 93–110. Comparing Finland, Germany and Switzerland, Waxmann, Münster (2014) 9. Tekijät, et al.: Automatic visualization of concept networks in classrooms: quantitative measures. Finnish Mathematics and Science Education Research Association (FMSERA) journal (2019). (in press) 10. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Stanford InfoLab (1999) 11. Banerjee, A., Chandrasekhar, A.G., Duflo, E., Jackson, M.O.: The diffusion of microﬁnance. Science 341(6144), 1236498 (2013) 12. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953) 13. Kronholm, H., Caballero, D., Araya, R., Viiri, J.: A smartphone application for ASR and observation of classroom interactions. In: Finnish Mathematics and Science Education Research Association (FMSERA) Annual Symposium (2016) 14. Geller, C., Neumann, K., Boone, W.J., Fischer, H.E.: What makes the ﬁnnish different in science? assessing and comparing students’ science learning in three countries. Int. J. Sci. Educ. 36(18), 3042–3066 (2014) 15. Gurría, A.: PISA 2015 resultados clave. Francia: OECD (2016). https://www.oecd.org/pisa/ pisa-2015-results-in-focus-ESP.pdf. Accessed 2 Apr 2020

116

D. Schlotterbeck et al.

16. Van Ewijk, R., Sleegers, P.: The effect of peer socioeconomic status on student achievement: a meta-analysis. Educ. Res. Rev. 5(2), 134–150 (2010) 17. Sirin, S.R.: Socioeconomic status and academic achievement: a meta-analytic review of research. Rev. Educ. Res. 75(3), 417–453 (2005) 18. Tibshirani, R.: Regression Shrinkage and selection via the Lasso. J. Roy. Stat. Soc. Ser. B (Methodological) 58(1), 267–288 (1996) 19. Miller, R.G.: The jackknife-a review. Biometrika 61(1), 1–15 (1974) 20. Du, J., Xu, J., Fan, X.: Factors affecting online groupwork interest: a multilevel analysis. J. Educ. Comput. Res. 49(4), 481–499 (2013) 21. FAQ: What are pseudo R-squareds? UCLA: Statistical Consulting Group. https://stats.idre. ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/. Accessed 18 Mar 2020 22. Finch, W.H., Bolin, J.E., Kelley, K.: Multilevel Modeling Using R. CRC Press, New York (2019) 23. Nakagawa, S., Johnson, P.C., Schielzeth, H.: The coefﬁcient of determination R 2 and intraclass correlation coefﬁcient from generalized linear mixed-effects models revisited and expanded. J. R. Soc. Interface 14(134), 20170213 (2017)

For Learners, with Learners: Identifying Indicators for an Academic Advising Dashboard for Students Isabel Hilliger1(&), Tinne De Laet2, Valeria Henríquez3, Julio Guerra3, Margarita Ortiz-Rojas4, Miguel Ángel Zuñiga5, Jorge Baier1, and Mar Pérez-Sanagustín1,6 1

4

Pontiﬁcia Universidad Católica de Chile, Santiago, Chile {ihillige,jabier}@ing.puc.cl, [email protected] 2 Katholieke Universiteit Leuven, Louvain, Belgium [email protected] 3 Universidad Austral de Chile, Valdivia, Chile {valeria.henriquez,jguerra}@inf.uach.cl Escuela Superior Politécnica del Litoral, Guayaquil, Ecuador [email protected] 5 Universidad de Cuenca, Cuenca, Ecuador [email protected] 6 Université Toulouse III Paul Sabatier, Toulouse, France

Abstract. Learning Analytics (LA) dashboards aggregate indicators about student performance and demographics to support academic advising. The majority of existing dashboards are targeted at advisors and professors, but not much attention has been put into students’ need for information for their own academic decision-making. In this study, we identify relevant indicators from a student perspective using a mixed methods approach. Qualitative data was obtained from an open-ended online questionnaire answered by 31 student representatives, and quantitative data was collected from a closed-ended online questionnaire answered by 652 students from different cohorts. Findings point out relevant indicators to help students choose what courses to take in an upcoming academic period. Since this study is part of a large research project that has motivated the adoption of academic advising dashboards in different Latin American universities, these ﬁndings were also contrasted with indicators of these advising dashboards, informing future developments targeting students. Keywords: Learning analytics Higher education

Learning dashboards Academic advising

1 Introduction In the recent years, higher education institutions have accumulated large amounts of personal and academic data including the interaction between users and online learning systems and learning management systems (LMS) [1, 2]. This accumulation of © Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 117–130, 2020. https://doi.org/10.1007/978-3-030-57717-9_9

118

I. Hilliger et al.

information has motivated the development of Learning Analytics (LA) dashboards aimed at supporting different stakeholders on their daily tasks. According to Schwendimann et al. [3], LA dashboards are “single displays that aggregate different indicators about learner(s), learning processes(es) and/or learning context(s) into one or multiple visualizations”. Although dashboards were initially focused on the use and development of data mining techniques to support automated interventions, there has been a recent shift towards directly reporting the information to support teachers’ and learners’ decision-making [2, 4, 5]. Following this approach, several researchers have been exploring how to produce LA dashboards to provide educational stakeholders with actionable information. Along these lines, studies have documented the design and implementation of LA dashboards to support academic advising, whose goal is to help students choose an academic program and a list of courses for an academic period, so they meet all the requirements to graduate with a degree [6–8]. Among these dashboards, we can distinguish three types: 1) student-facing [7, 8]; 2) advisor-facing [9, 10]; and 3) advising dialogue dashboards — designed to support conversations between students and university staff [11, 12]. These dashboards use descriptive and/or predictive analytics to guide students in their decision making, and some of them have already shown promising results. However, selecting the most appropriate indicators for an academic advising dashboard is not obvious since it requires the identiﬁcation of relevant information for different stakeholders and scenarios. According to recent studies, most existing LA dashboards have been designed primarily with university staff, placing the focus on advisors and professors [2, 4]. But, not much attention has been put into student-facing dashboards [1, 2, 4], and only few of them have been co-developed with students [4]. As a consequence, there are signiﬁcant research gaps around the perspective of students on the design of LA dashboards. On the one hand, more studies are needed to understand students’ needs for indicators to inform their decision-making [13]. On the other hand, more studies are needed to evaluate the effectiveness of academic advising dashboards in terms of the beneﬁts to learners [2, 11]. To address these gaps, researchers recommend to engage students as co-creators of LA dashboards, acknowledging their ability to make decisions and exercise choice [14]. By doing so, LA designers can better understand the scenarios under which students may use dashboards, as well as their effects on learners’ actions [4]. Moreover, the design of any dashboard should anticipate that its use could have a different effect depending on the context and the targeted user, so users need to be involved throughout the design process to meet their needs [2, 4]. In case of student-facing dashboards, the key is working with learners when creating visualizations to produce solutions that empower them as agents of their own learning [14]. This paper aims to expand the current literature regarding academic advising dashboards, placing the focus on learners. It presents a study that is part of a large-scale project that aims to build capacity to design and implement LA tools in Latin America. As a result of this project, three out of four Latin American universities designed different academics advising dashboards, with a focus on advisors. The current study was done on the fourth Latin American university afﬁliated to this project, where student representatives felt there was need for student-facing dashboards to help them choose courses.

For Learners, with Learners

119

In order to identify relevant indicators from a students’ perspective, qualitative data was obtained from 31 student representatives. In addition, quantitative data was collected from 652 students from different cohorts. Findings point out to relevant indicators, which were contrasted with the ones used in the other academic advising dashboards. Implications are discussed to inform the development of dashboards in other higher education institutions.

2 Indicators Used in Academic Advising Dashboards There are different types of indicators to inform students’ decision-making, which could be estimated by using descriptive analytics (regarding what has happened and why did it happen) or predictive analytics (regarding with what might happen) [15]. In this section, we provide examples of different type of dashboards that have been well documented in the literature, besides describing the indicators therein. Degree Compass is a student-facing dashboard that was developed at Austin Peay State University [8, 16]. This dashboard uses predictive analytics to provide students with a list of recommended courses. Along with this list, this dashboard provides a ﬁve-star rating to indicate in which subjects the student is more likely to achieve the best grades. Studies have shown that its underlying algorithm has predicted students’ grades in more than 90 percent of the recommended courses [8, 16]. Studies have also shown that course grades and graduate rates improved after Degree Compass was implemented, closing achievement gaps regarding race and income [8]. Besides, researchers observed a high correlation between recommended courses and student credits [8]. Unlike Degree Compass, eAdvisor integrates a student-facing and an academicfacing dashboard into one system. This system was designed by the Arizona State University (ASU), in order to help students choose majors and courses that are offered at this university [7]. For each student, it shows a list of recommended courses that matches the requirements for a speciﬁc major [7], besides showing his/her GPA, registered Major, catalog year, and current track term. Advisors have access to an additional indicator regarding the percentage of students that are on-track and offtrack. Although ASU reported that the 4-year graduation rate was 9 percentage points higher after eAdvisor was implemented [17], these promising results have not been backed up with scientiﬁc evidence yet. In regards to advisor-facing dashboards, we have identiﬁed two examples targeting advising staff. One example is based on descriptive analytics, and it was developed at the University of Michigan [10]. It consists of an Early Warning System (EWS) to provide academic advisors with information about advising meetings logs, along with the logs of interacting with the Student Explorer dashboard [10] (which shows students’ grades and logs in the institutional LMS [18]). The second example is known as GPS advising, and it was developed by the Georgia State University (GSU) [9]. GPS sends alerts to advisors under three scenarios: 1) a student does not obtain satisfactory grades in core subjects of their major, 2) a student does not take the required course within the recommended time, or 3) a student signs up for a subject that is not listed as relevant to the major. These three alerts use predictive analytics based on students’

120

I. Hilliger et al.

grades, student credits, and course enrollment. Both dashboards (EWS and GPS) have been perceived as useful for advising meetings. Once GPS was implemented, the number of advising meetings increased by three or four times. However, no further information is available regarding which indicators have been crucial for advising discussions or student decision-making. Regarding dashboards to support academic advising dialogue, there are two tools that have been well documented in the literature. The ﬁrst tool is called LADA, and it consist of a dashboard that can be used by the advisor individually, or during the advising dialogue [11]. By using predictive analytics, LADA shows the chance of a student passing a course, taking into account course difﬁculty and student academic performance in previous semesters. LADA also shows a list of recommended courses for an upcoming academic period, student grades in prior periods, and student skills. The second tool is called LISSA, which consists of a dashboard that relies mainly on descriptive analytics [12, 19]. Speciﬁcally, this dashboard provides advisors with the following indicators: 1) student’s program progress, 2) student’s results in positioning tests, mid-terms, and end-of-semester exams, 3) course grades, and 4) a histogram of student’s academic performance regarding their peers. Prior studies have shown that LADA and LISSA have successfully supported the advising dialogue, helping both advisor and student to take an active role on the dialogue [11, 12]. Among all the dashboards that have been described in this section, we have identiﬁed different types of indicators, which could be classiﬁed as predictive (e.g. chances of students to pass a course) or descriptive (e.g. students’ program progress). Some of these dashboards have had promising results, such as the perceived usefulness exhibited by EWS, LADA and LISSA [10, 11, 19], or the increase of graduation rates regarding the implementation of Degree Compass and e Advisor [8]. However, none of the proposed dashboards have evaluated what indicators have resulted particularly useful for academic advising, making difﬁcult to attribute these promising results to actual dashboard usage. Furthermore, prior studies have found that advisors resist to use predictive indicators included in these type of dashboards, due to usability concerns and moral discomfort [11, 20]. In this context, more studies are needed to understand what type of indicators are useful for different academic advising scenarios. This paper aims to identify relevant indicators from a students’ perspective, in order to understand what type of scenarios might be useful for student decision-making.

3 Methods 3.1

Study Design and Objective

The research question addressed in this paper is: What indicators are relevant for students to be included in an academic advisor dashboard? To answer this research question, we used mixed methods in an exploratory sequential design. This design involves gathering qualitative data ﬁrst, and then, collecting quantitative data to explain relationships found in the qualitative data (Creswell, 2012). In this study, qualitative data was collected to identify indicators that were found relevant by students’ representatives. Then, quantitative data was collected to conﬁrm whether these indicators are

For Learners, with Learners

121

relevant for a larger sample of students. Then, we contrasted the indicators emerging from the analysis with the indicators provided by three academic advising existing tools. 3.2

Study Context

This study is part of a large research project that aims to build capacity for learning analytics adoption in Latin America. In the context of this project, four Latin American universities (U1, U2, U3, and U4) aimed to adapt the LISSA dashboard for different academic advising scenarios. In U1, the resulting dashboard was called TrAC (from Spanish “Trayectoria Académica y Curricular”). In a ﬁrst phase, the dashboard aimed to consider indicators that were identiﬁed as relevant during focus groups and interviews that were conducted by a team of researchers to assess the institutional context. These indicators included class attendance, grades, academic workload, program progress, and the performance comparison of grades at the course level and through the program. The ﬁnal design of TrAC included a visualization of student academic progress (courses required by a study plan, courses taken by a student, courses failed and dropped), student academic records (grade point average, semester grade), and at a course-level, student academic performance regarding their peers and past course grades [21, 22]. In the case of U2, this university had already an academic advising system since 2013. Then, the process of adapting the LISSA dashboard was focused on discovering new indicators to improve the quality of advising meetings. As in U1, a team of researchers collected information to assess the institutional context, and ﬁndings revealed the need for further academic information regarding course difﬁculty and academic workload. Advisors and students also wanted an early dropout warning system, in addition to new visualizations of student demographics, student welfare information, and student academic performance regarding his/her peers. Forty teaching staff members were involved in the re-design of the academic advising system [23]. Based on the LISSA dashboard, a team of researchers designed mockups and low- and high-ﬁdelity prototypes. This resulted in the creation of three additional visualizations windows: (1) program progress visualization, (2) a list of recommended courses according to academic workload and course difﬁculty, and (3) a histogram of students’ academic performance regarding his/her peers. These visualizations also included previous advising history, the chances of dropping out of an academic program, and information from student welfare and support services. Regarding U3, the assessment of the institutional context revealed the need for information about academic strengths and weaknesses of students, in order to provide them with quality feedback and timely support. The indicators that were found relevant for that purpose were the following: program progress (e.g. course grades, courses passed, failed, and dropped), student academic history (e.g. grade point average, semester grade), student progress in each term (academic workload and course retaking rates), and drop out predictions. As a result of a user-centered and iterative software development process, U3 implemented AvAc (from Spanish “Avance Académico”). AvAc is an academic advising dashboard to support academic decision-making during face to face student-advisor dialogues. Since the academic advising process was not

122

I. Hilliger et al.

clearly deﬁned in this university, program chairs were identiﬁed as primary users, considering that they regularly talk with students about their academic progress. This dashboard was developed and tested at the Engineering Faculty, involving the faculty dean and vice-dean, four program chairs, ten students, and an IT specialist. Several interviews and focus group sessions were organized with these stakeholders, using prototypes that evolved throughout the different sessions. After validating a highﬁdelity prototype, a beta version of the visualization tool was built and improved by considering feedback from 75 teaching and administrative staff. Like the other universities, a team of researchers also conducted an assessment of the institutional context in U4. This assessment revealed the need to provide students with information for their own decision-making. This need emerged mainly from students’ representatives, who were interested in getting involved in the design of a student-facing dashboard. From their perspective, this dashboard should focus on helping students to choose courses for an upcoming semester. In this study, we present the students’ needs detected for U4, and compare how these needs compare with the LA dashboards proposed by the other universities. Speciﬁcally, sub-Sect. 3.3 presents the data gathering techniques that were used to explore students’ needs of information in U4, besides describing how the data was analyzed to answer the research question addressed in this study. 3.3

Study Participants, Data Gathering Techniques, and Data Analysis Plan

In order to explore what indicators could be relevant for student decision-making in U4, we applied two online surveys. Both surveys were anonymous and participation was voluntary, complying with the ethical standards of the ethical commission of the Pontiﬁcia Universidad Católica de Chile and other institutions participating in this study. The ﬁrst online survey was applied to a convenience sample of 31 student representatives of different engineering majors (environmental engineering, electrical engineering, construction engineering, structural engineering, transportation engineering, industrial engineering, mechanical engineering, chemical engineering, and others) This questionnaire had the following open-ended questions: • What information did you use to decide which subjects to take this academic period? • How did you get the information you used to decide which subjects to take? • What additional information would you like to have to decide which courses to take? From your perspective, indicate what information you perceive is missing and why it is relevant to you. • What problems did you face when deciding which courses to take? In case you did not face any kind of problem, you can leave this question blank. The answers to the questions above were coded using a grounded theory approach, letting core categories and related concepts emerge directly from the data. These categories were used to develop closed-ended questions for a subsequent online survey. This survey was voluntarily answered by a convenience sample of 652 students from

For Learners, with Learners

123

different majors and admission cohorts (see Table 1). We ranked the highest to the lowest percentage of respondents who agreed with the corresponding statement, in order to determine the predominant expectations of students and staff concerning LA adoption at their institutions. Table 1. Participants of the online survey applied during the quantitative phase Survey respondents (N = 652) 47%

% of female students Cohorts % of ﬁrst year students 15% % of second year students 28% % of third year students 11% % of fourth year students 31% % of ﬁfth years or older 15% Note: Most survey participants were afﬁliated with engineering majors (research operations (25%), computer science (13%), environmental engineering (6%), electrical engineering (6%), biological engineering (4%), chemical engineering (4%), and others), while 9% have not decided a major yet.

In order to identify what indicators are relevant for students to be included in an academic advisor dashboard, we conducted a two-step analysis. First, we identiﬁed those indicators that were mentioned by more than 50% of the U4 students who responded the closed-ended survey (considering indicators that they claimed that they currently used to choose courses, in addition to the ones they perceive that they are missing). Then, we contrasted these indicators with the ones that have been included in the academic advisor dashboards designed and implemented at U1, U2, and U3.

4 Results In U4, students currently use different sources of information to decide what course to take. Figure 1 shows the different sources of information that students mentioned during the qualitative phase of this study and the percentage of mentions that were obtained from students’ responses to the online survey applied during the quantitative phase. Out of 627 students who answered the online survey, 98% mentioned they use an institutional dashboard with information about course schedules for current and future academic periods, whereas 12% mentioned asking tutors and other people about further course information. Among all the different sources of information, students use different indicators to decide what courses to take in an upcoming semester. Figure 2 shows the different indicators that students claimed they currently use during the qualitative phase of this study, in addition to the percentages of frequency mentions that were collected during the quantitative phase. Out of 627 students who answered the online survey, 95%

124

I. Hilliger et al. Class Scheduling Dashboard Visualization of Progress in the Program Facebook Group about Course Information Student Facebook Group Learning Management System Program WhatsApp Group University Course Catalogue

Other people (Advisors, Acquaintances, etc.) Friends 0

10

20

30

40

50

60

70

80

90

100

Percentage of Frequency Mentions (N=627)

Fig. 1. Sources of information that students currently use to decide which subjects to take according to the online survey applied during the quantitative phase of this study.

mentioned they use course schedules, followed by information about program progress (66%), students’ perceptions of course difﬁculty (64%), course availability (58%), students’ perception of academic workload (56%), and students’ comments about different courses (55%). Course Schedules Program Progress Perceptions of Course Difficulty Course Availability Perception of Workload Students' Comments Teaching Staff Information Curriculum Progress Course requisites Program requisites Pending Academic Credits Course Assessment Tools 0

10

20

30

40

50

60

70

80

90

100

Percentage of Frequency Mentions (N=627)

Fig. 2. Indicators that students currently use to decide which subjects to take according to the online survey applied during the quantitative phase of this study.

Despite the use of many various forms of information, an important observation is that students still perceive they lack information for their decision-making. Figure 3 shows the different indicators that the students perceive that they lack according to the

For Learners, with Learners

125

ﬁndings of the qualitative phase of this study, in addition to the percentages of frequency mentions that were collected during the quantitative phase. Out of 627 students who answered the online survey, 61% mentioned that they lack information about the time spent with teaching assistants (real use of TA’s hours, followed by the assessment calendar of each course (56%), the type of assessment tools used in each course (52%), and students’ academic performance in prior versions of the course (49%). Real Use of TA's Hours Assessment Calendar Assessment Tool Types Past Course Grades Academic Workload Real Use of Lab Hours Teaching Methodologies Teaching Staff Experience Semesters in which a Course is offered Prior Course Syllabus Expected Performance Levels 0

10

20

30

40

50

60

70

80

90

100

Percentage of Frequency Mentions (N=627)

Fig. 3. Indicators that students perceive that they lack to decide which subjects to take according to the online survey applied during the quantitative phase of this study.

Table 1 shows the results of contrasting the indicators that were found relevant in U4 with the ones that have been included in the academic advisor dashboards implemented at U1, U2, and U3. In U4, students considered relevant to have information about course academic workload, course difﬁculty, and past course grades. In the dashboards implemented in U1, U2, and U3 these indicators have been included in the form of student’s program progress regarding the courses required by his/her study. Other indicators the historic were not directly commented by the students as something important, although were interesting for evaluating students’ evolution in the other dashboards. However, they also mentioned students’ comments on courses, in addition other course details — such as the assessment calendar and the type of assessment tools. This type of information has not necessarily been considered relevant by teaching staff, advisors, and program chairs in U1, U2, and U3 respectively (Table 2). Note: Black cells reveal the indicators that resulted relevant for each university. In the case of U1, these indicators were the ones mentioned by more than 50% of the U4 students that responded the closed-ended questions about information they have used or missed during course enrollment (Figs. 2 and 3).

126

I. Hilliger et al.

Table 2. Relevant indicators for academic advising scenarios in different Latin American universities Relevant indicators

U1 (teaching staff)

U3 (advisors)

U2 (program chairs)

U1 (students)

Program progress visualization (courses taken from the required study plan) Course requisites and credits Student academic history (grade point average and semester grade) Student academic performance in comparison with their academic cohort Student progress in each term (academic workload, courses taken) Student academic performance regarding their peers in a specific course Past course grades History of previous advising meetings List of recommended of courses according to workload and course difficulty Comparison between recommended courses and courses taken by the student Chances of dropping out of an academic program Information from student welfare and support services Course schedule Course difficulty Course availability Course academic workload Students’ perceptions on different courses Hours assigned to work with teaching assistants Assessment calendar for each course Type of assessment tools used in each course

Note: Black cells reveal the indicators that resulted relevant for each university. In the case of U1, these indicators were the ones mentioned by more than 50% of the U4 students that responded the closed-ended questions about information they have used or missed during course enrollment (Figs. 2 and 3).

For Learners, with Learners

127

5 Discussion, Limitations, and Conclusions This study reports the results of an inquire to students from a Latin American university about the indicators that they found relevant at the moment of choosing courses. On the one hand, most students mentioned that they use program progress information to know what courses to take for an upcoming academic period. On the other hand, most students also mentioned the use and the need for further course-level information. By course-level information, they referred to course schedules, difﬁculty, availability, workload; in addition to other details regarding the assessment calendar, the type of assessment tools, and past course grades. Most of these indicators could be classiﬁed as descriptive [15], considering that they describe what has happened in different courses in terms of content delivery, assessment, and student academic performance. This could mean that students do not necessarily need predictive indicators to choose courses, as the one that have been used in existing academic advising dashboards, such as Degree Compass [8] or LADA [11]. Along these lines, prior studies have already revealed some resistance towards predictive indicators from the perspective of academic advisors [11], what might explain this recent shift from predictive to descriptive analytics [2]. Potentially, descriptive analytics could help students to plan their next semester in terms of course enrollment and time management, anticipating what to expect from the course in terms of learning outcomes and academic workload. By comparing the perspective of different dashboards users in four Latin American universities, we identiﬁed similarities and differences between students and staff members. Regarding similarities, program progress was perceived as a relevant indicator by students, teaching staff, advisors, and program chairs (see Table 1). This indicator has already been included in LISSA, an academic advising dashboard which has successfully supported the dialogue between advisors and students in KU Leuven [12, 19]. Regarding differences, U4 students mentioned course-level information that was not necessarily considered relevant by staff members at U1, U2, or U3. According to the closed-ended survey, 55% of U4 students relied on students’ comments to choose courses (Fig. 2), and 58% consulted a Facebook group about course information. Existing dashboards have not necessarily included features to promote student socialization, which is crucial in educational theories, such as constructivism and cognitive motivation. On the contrary, existing dashboards have offered learners some indicators to compare their performance relative to their peers [2]. This type of comparisons could generate feelings of distress, demotivation and disappointment, especially in low-performing students [2]. Thus, student-facing dashboards should avoid these types of indicators, considering that there is no staff member supporting the student during the interpretation of this data. Furthermore, more studies are needed to understand how LA dashboards could support students, without necessarily requiring a staff member to influence their decision-making. Although advising experiences have been positive in some higher education institutions [6], not all students attend advising meetings promptly [9]. Successful advising experiences rely on trust and conﬁdence between advisors and advisees, which is not always easy to build. In this sense, more attention should be focused on student-facing analytics to meet students’ need for information [4].

128

I. Hilliger et al.

Regarding the universities that were involved in this study, not only U4 is planning to develop a student-facing dashboard based on the results presented in Sect. 4, but also U1 and U2. In fact, U1 is about to pilot a student-facing dashboard that is very similar to the one targeting teaching staff (TrAC), but its aim is to help students evaluate the impact of enrolling in certain courses in terms of workload and academic performance. More of these LA initiatives are needed to empower students, recognizing their abilities to make decisions and become agents of their own learning [14]. Although this study contributes to a better understanding of students’ perspectives on academic advising dashboards, some limitations should be noted. First, data was collected in Latin American universities, so they might not be representative for all educational systems. Besides, voluntary response bias could have affected the second questionnaire. Second, the study did not include the examination of indicators in terms of reference frames, such as comparison with peers, achievement and progress. Still, the four Latin American universities that were involved in this study are currently afﬁliated to a large research project, in which they have collaborated with European universities. As a result of this collaboration, three out of the four universities developed academic advising dashboards based on LISSA; a dashboard that has been widely documented in the LA literature [12, 19]. Besides, the type of indicators that have been presented in this study not only have been differentiated in terms of its descriptive or predictive nature, but also have been described according to some reference frames, such as ‘program progress’ and ‘regarding peers’. In order to examine students’ preferences regarding reference frames, LA researchers should engage students as cocreators of dashboards, anticipating the effects of different type of indicators on learners’ actions [4]. Finally, this study revealed that students’ need for information could be met with the development of visualizations that aggregate existing academic information. Today, this information might be disaggregated in course catalogues, learning management systems, and other web-based applications used by varied institutional services. However, teams of researchers have already invested efforts to integrate this data into academic advising dashboards that are being currently used in universities in the U.S., Europe and Latin America. Consequently, all higher education institutions are invited to explore the needs of their students, in order to design their own dashboard or adapt an existing one. Around the world, there are learners waiting to be empowered throughout meaningful information, willing to collaborate with LA researchers in the creation of student-facing analytics. Acknowledgements. This work was funded by the LALA project (grant no. 586120-EPP-12017-1-ES-EPPKA2-CBHE-JP). This project has been funded with support from the European Commission. This publication reflects only the views of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein. The authors would like to thank the student representatives who motivated this study—María José Marín and Cristóbal Chauriye. Their effort has motivated the design and implementation of student-facing dashboards at the engineering school in Pontiﬁcia Universidad Católica de Chile, leveraging the capacities already installed in this institution by means of the LALA project.

For Learners, with Learners

129

References 1. Bodily, R., Verbert, K.: Review of research on student-facing learning analytics dashboards and educational recommender systems. IEEE Trans. Learn. Technol. 10(4), 405–418 (2017) 2. Jivet, I., Scheffel, M., Specht, M., Drachsler, H.: License to evaluate: preparing learning analytics dashboards for educational practice. In: Proceedings of the 8th International Conference on Learning Analytics and Knowledge - LAK 2018, pp. 31–40 (2018) 3. Schwendimann, B.A., et al.: Perceiving learning at a glance: a systematic literature review of learning dashboard research. IEEE Trans. Learn. Technol. 10(1), 30–41 (2017) 4. De Quincey, E., Kyriacou, T., Briggs, C., Waller, R.: Student centred design of a learning analytics system. In: ACM International Conference Proceeding Series, pp. 353–362 (2019) 5. Baker, R.S.: Stupid tutoring systems, intelligent humans. Int. J. Artif. Intell. Educ. 26(2), 600–614 (2016). https://doi.org/10.1007/s40593-016-0105-0 6. Mu, L., Fosnacht, K.: Effective advising: how academic advising influences student learning outcomes in different institutional contexts. Rev. High. Educ. 42(4), 1283–1307 (2019) 7. Phillips, E.D.: Improving advising using technology and data analytics. Chang. Mag. High. Learn. 45, 48–56 (2013) 8. Denley, T.: How predictive analytics and choice Architecture can improve student success. Res. Pract. Assess. 2014, 61–69 (2014) 9. Schock, L.: Success through predictive analytics. International Educator, p. 16 (2018) 10. Aguilar, S., Lonn, S., Teasley, S.D.: Perceptions and use of an early warning system during a higher education transition program. In: ACM International Conference Proceeding Series, pp. 113–117 (2014) 11. Gutiérrez, F., Seipp, K., Ochoa, X., Chiluiza, K., De Laet, T., Verbert, K.: LADA: A learning analytics dashboard for academic advising. Comput. Human Behav. 107, 105826 (2018) 12. Charleer, S., Vande Moere, A., Klerkx, J., Verbert, K., De Laet, T.: Learning analytics dashboards to support adviser-student dialogue. IEEE Trans. Learn. Technol. 11(3), 389– 399 (2018) 13. Lim, L., Joksimović, S., Dawson, S., Gašević, D.: Exploring students’ sensemaking of learning analytics dashboards: does frame of reference make a difference? In: ACM International Conference Proceeding Series, pp. 250–259 (2019) 14. Dollinger, M., Lodge, J.M.: Co-creation strategies for learning analytics. In: ACM International Conference Proceeding Series, pp. 97–101, March 2018 15. Uhomoibhi, J., Azevedo, A.I.R.L., Azevedo, J.M.M.L., Ossiannilsson, E.: Learning analytics in theory and practice: guest editorial. Int. J. Inf. Learn. Technol. 36(4), 286– 287 (2019) 16. Oblinger, D.G. (ed.) Game Changers: Education and Information Technologies (2012) 17. Burns, B., Crow, M.M., Becker, M.P.: Innovating together: collaboration as a driving force to improve student success. EDUCAUSEreview, pp. 1–9 (2015) 18. Lonn, S., Teasley, S.D.: Student explorer: a tool for supporting academic advising at scale. In: L@S 2014 – Proceedings of the 1st ACM Conference Learning Scale, pp. 175–176 (2014) 19. Millecamp, M., Gutiérrez, F., Charleer, S., Verbert, K., De Laet, T.: A qualitative evaluation of a learning dashboard to support advisor-Student dialogues. In: ACM International Conference Proceedings Series, no. 2, pp. 56–60 (2018) 20. Jones, K.M.L.: Advising the whole student: eAdvising analytics and the contextual suppression of advisor values. Educ. Inf. Technol. 24(1), 437–458 (2018). https://doi.org/10. 1007/s10639-018-9781-8

130

I. Hilliger et al.

21. Chevreux, H., Henríquez, V., Guerra, J., Sheihing, E.: Agile development of learning analytics tools in a rigid environment like a university: beneﬁts, challenges and strategies. In: Transforming Learning with Meaningful Technologies 14th European Conference on Technology Enhanced Learning, EC-TEL 2019, pp. 705–708 (2019) 22. Guerra, J., et al.: Adaptation and evaluation of a learning analytics dashboard to improve academic support at three Latin American universities. Br. J. Educ. Technol. 51, 973–1001 (2020) 23. Ortiz-Rojas, M., Maya, R., Jimenez, A., Hilliger, I., Chiluiza, K.: A step by step methodology for software design of a learning analytics tool in Latin America: a case study in Ecuador. In: Proceedings of the - 14th Latin American Conference on Learning Objects and Technologies, LACLO 2019, pp. 116–122 (2019)

Living with Learning Difﬁculties: Two Case Studies Exploring the Relationship Between Emotion and Performance in Students with Learning Difﬁculties Styliani Siouli1, Stylianos Makris2, Evangelia Romanopoulou3, and Panagiotis P. D. Bamidis3(&) 1

Experimental School of the Aristotle University of Thessaloniki (AUTH), Thessaloniki, Greece [email protected] 2 8th Primary School of Neapoli, Thessaloniki, Greece [email protected] 3 Lab of Medical Physics, School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece [email protected], [email protected]

Abstract. Research demonstrates that positive emotions contribute to students’ greater engagement with the learning experience, while negative emotions may detract from the learning experience. The purpose of this study is to evaluate the effect of a computer-based training program on the emotional status and its effect on the performance of two students with learning difﬁculties: a secondgrade student of a primary school with Simpson-Golabi-Behmel syndrome and a fourth-grade student of a primary school with learning difﬁculties. For the purpose of this study, the “BrainHQ” web-based cognitive training software and the mobile app “AffectLecture” were used. The former was used for measuring the affective state of the students before and after each intervention. The latter was used for improving students’ cognitive development, in order to evaluate the possible improvement of their initial emotional status after the intervention with “BrainHQ” program, the possible effect of positive/negative emotional status on their performance, as well as the possible effect of high/poor performance on their emotional status. The results of the study demonstrate that there is a positive effect of emotion on performance and vice versa and the positive effect of performance on the emotional status and vice versa. These ﬁndings suggest that the affective state of students should be taken into account by educators, scholars and policymakers. Keywords: Learning difﬁculties Cognitive training status AffectLecture Performance

BrainHQ Emotional

© The Author(s) 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 131–143, 2020. https://doi.org/10.1007/978-3-030-57717-9_10

132

S. Siouli et al.

1 Introduction Educational policy has traditionally paid attention to the cognitive development of students without focusing on how emotions adjust their psychological state and how this affects their academic achievement. Emotions have a large influence over mental health, learning and cognitive functions. Students go through various emotional states during the education process [1] thus their mental state is considered to play a major role in obtaining internal motivation [2–4]. Existing research in the ﬁeld of education has shown however that students’ cognitive processes were far more important than emotional processes [5]. Therefore, educational research should take into consideration the experience of people with disabilities, including those with learning difﬁculties [6]. The deﬁnition of ‘learning difﬁculties’ (LD) is commonly used to describe students with intellectual/learning disabilities. For further speciﬁcation, this group may be considered as a sub-group of all those students who face several disabilities such as physical, sensory, and emotional-behavioral difﬁculties, as well as learning difﬁculties [7]. Learning difﬁculties are considered to be a developmental disorder that occurs more frequently in school years. It is commonly recognized as a “special” difﬁculty in writing, reading, spelling and mathematics affecting approximately 15% to 30% of students total [8]. The ﬁrst signs of the disorder are most frequently diagnosed from preschool age, either within the variety of speech disorders or within the variety of visual disturbances. Students with learning difﬁculties (LD) within the middle school years typically display slow and effortful performance of basic academic reading and arithmetic skills. This lack of ability in reading and calculating indicates incompetence in cognitive processes that have far-reaching connotations across learning, teaching, and affective domains [9]. When it comes to obtaining certain academic skills, the majority of students with LD during their middle school years achieve fewer beneﬁts in learning and classroom performance. Therefore, it’s inevitable the fact that as the time goes by, the gap becomes wider year after year and their intellectual achievements are considered to be less than the ones made by their peers [10–12]. In addition to deﬁciencies in basic academic skills, many students with LD in the middle years of schooling may even have particular cognitive characteristics that slow the process of their learning, like reduced working memory capacity and also the use of unproductive procedures for managing the components of working memory. In conclusion, learners with LD require a more sufﬁcient way of academic instruction [13]. Simpson-Golabi-Behmel syndrome (SGBS) is a rare, sex-linked (X) disorder with prenatal and postnatal overgrowth, physical development and multiple congenital abnormalities. The primary reference to the disorder was made by Simpson et al. [14]. The phenotype of the syndrome broad and includes typical countenance, macroglossia, organomegaly, Nephrolepis, herniation, broad arms and legs, skeletal abnormalities, supraventricular neck, conjunctivitis, more than two nipples and constructional dysfunctions [15]. In line with Neri et al. [16], men have an early mortality rate and an expanded possibility of developing neoplasms from fetal age. Women are asymptomatic carriers of the gene, who frequently present coarse external features in the face and mental incapacity [17].

Living with Learning Difﬁculties

133

The psychomotor development of patients with SGBS is diverse and ranges from normal intelligence to moderate and severe disorder that may be appeared at birth [18– 20]. Speech delay occurs in 50% of cases and motor delay in 36% [18]. Moreover, speech difﬁculties were appeared in most affected people, which are partially justiﬁed by macroglossia and cleft lip or palate. Affected boys show mild cognitive disorders that are not always related to speech delay and walking. Moreover, they face difﬁculty in ﬁne and gross motor development [20]. Learning difﬁculties are frequently related to behavioral problems and ambiguities, attention deﬁcit disorder, hyperactivity, concentration and writing disorders [21]. However, there has been limited research about the behavioral phenotype and the problematic behavior that progress during puberty, as well as the behavioral difﬁculties during the school years, which need mental health support and therapy [22]. There is a wide range of syndrome characteristics, which have not been identiﬁed or examined yet. Every new case being studied, found not to have precisely the same characteristics as the previous one. As new cases of SGBS appear, the clinical picture of the syndrome is consistently expanding [23]. 1.1

Related Work and Background

Research has shown that children’s and youngsters’ emotions are associated with their school performance. Typically, positive emotions such as enjoyment of learning present positive relations with performance, while negative emotions such as test anxiety show negative relations [24–26]. Related work in the ﬁeld of education, has indicated that students’ learning is related to their emotional state. Thus, negative emotions decrease academic performance, while positive emotions increase it [27, 28]. However, recent studies have been based on speciﬁc emotions, disregarding the possible presence of other emotions that may have a signiﬁcant impact on motivation and/or school performance [29]. At this same educational level, Yeager et al. [30] investigated the possible negative correlation between boredom and math activities, while Na [31] observed a negative correlation between anxiety and English learning. Likewise, Pulido and Herrera [32] carried out a study with primary, secondary and university students which revealed that high levels of fear predict low academic performance, irrespective of the school subject. In addition, Trigueros, Aguilar-Parra, Cangas, López-Liria and Álvarez [33] in their study with adolescents indicated that shame is negatively related to motivation, which negatively adversely impacts learning and therefore academic performance. Moreover, Siouli et al. carried out a study with primary students which showed that emotion influences academic performance in all class subjects and that the teaching process can induce an emotional change over a school week [34]. The results of the aforementioned studies conﬁrm a strong relationship between academic performance and students’ emotional state. Several research studies have examined the relationship between emotions and school performance. For students with syndromes and learning difﬁculties, however, the only data that occur from our case studies.

134

S. Siouli et al.

2 Methods 2.1

Cognitive Training Intervention

The Integrated Healthcare System Long Lasting Memories Care-LLM Care [35] was exploited in this study, as an ICT platform that combines cognitive exercises (BrainHQ) with physical activity (wFitForAll). LLM Care was initially exploited in order to offer the important training for enhancing the elderly’s cognitive and physical condition of their health [36], as well as the quality of life and autonomy of vulnerable people [37]. BrainHQ [38], as the cognitive component of LLM Care, it is a web-based training software developed by Posit Science. It is the sole software available in Greek being used to any portable computing device (tablet, cell phone, etc.) as an application either on Android or on IOS provided in various languages. Unquestionably, enhancement of brain performance can lead to multiple beneﬁts to everyday life. Both research studies and the testimonials of users themselves show that BrainHQ offers beneﬁts in enhancing thinking, memory and hearing, attention and vision, improving reaction speed, safer driving, self-conﬁdence, quality discussion and good mood. BrainHQ includes 29 exercises divided into 6 categories: Attention, Speed, Memory, Skills, Intelligence and Navigation [38]. Students’ training intervention attempts to ﬁll some of the identiﬁed gaps in research and practice concerning elementary school students with learning difﬁculties and syndromes. Speciﬁcally, it aims to produce an intensive intervention to provide students with the required skills in order to engage them more successfully with classroom instruction. This intervention was designed as a relatively long-term, yet cost-effective, program for students with poor performance in elementary school. BrainHQ software has therefore been used as an effort to cognitively train students with genetic syndromes and complex medical cases with psychiatric problems that go beyond cognitive function [39, 40]. An intervention using BrainHQ could be a promising approach for individuals with Simpson-Golabi-Behmel and individuals with Learning Difﬁculties. 2.2

The AffectLecture App

AffectLecture application (courtesy of the Laboratory of Medical Physics AUTH: accessible for download through the Google Play market place) was utilized to measure the students’ emotional status. It is a self-reporting, emotions-registering tool and it consists of a ﬁve-level Likert scale measuring a person’s emotional status ranging from 1 (very sad) to 5 (very happy) [41]. 2.3

Participants

I.M. (Participant A) is an 8-year-old student with Simpson-Golabi-Behmel Syndrome, who completed the second grade of Elementary School in a rural area of Greece and

Living with Learning Difﬁculties

135

S.D. (Participant B) is a 10-year-old student with Learning Difﬁculties who completed the fourth grade of Elementary School in a provincial area in Greece. Participant A performed a 30-session cognitive training intervention applied during school time (3–4 sessions/week for 45 min each). Few interventions were also conducted at the student’s house in order to complete the cognitive intervention program. Participant B attended a 40-session training intervention at school during school time and at the student’s house (3 sessions/week for 45 min each for 8 weeks and then everyday sessions for the last two weeks of the interventions). The cognitive training interventions were performed in classrooms, meaning that both students were in their own school environments and they received an equivalent cognitive training, although they faced different learning difﬁculties Prior to the beginning of this study, both students were informed about the use of AffectLecture app by exploiting their tablets, in which the app was installed. The students’ were urged to state how they felt by selecting an emoticon. The emotional status was being measured before the start and by the end of each training intervention, for the entire duration of the training sessions. In Fig. 1 the students had to choose between ﬁve emoticons and select the one that best expressed them at that moment.

Fig. 1. AffectLecture Input

Teachers provided at the beginning of each training intervention a unique 4-digit PIN, which let the students have access to the session and vote before and after it, so they could state their emotional status. 2.4

Research Hypotheses

Hypothesis 1: Students’ positive emotional state will have a positive effect on their performance, while students’ negative emotional state will have a negative effect on their performance. Hypothesis 2: High students’ performance will have a positive effect on their emotional status, while poor students’ performance will have a negative effect on their emotional status.

136

2.5

S. Siouli et al.

Data Collection Methods

Students’ performance was assessed by the online interactive BrainHQ program. The detailed session results of BrainHQ and the students’ emotional status as measured by the AffectLecture app before and after each intervention were used in order to collect crucial data for the purpose of the study. 2.6

Data Collection Procedure

Before the beginning of the study, students were informed about the use of BrainHQ and the AffectLecture app, and they were instructed to have their tablets with them. The AffectLecture app was installed in both students’ devices. Accommodated test scores in 6 categories (Attention, Speed, Memory, Skills, Intelligence and Navigation cognitive performance) were being measured by BrainHQ interactive program. During each session students were trained equally in all six categories starting with Attention and moving on to Memory, Brain Speed, People Skills, Intelligence, and Navigation in order to beneﬁt the most. Training time was equally spaced. Each time students completed an exercise level, they earned “Stars” according to their performance and progress in order to understand how their brain is performing and improving. The students’ emotional status was being measured before the beginning and by the end of each cognitive training session, throughout the intervention period. 2.7

Evaluation Methodology

A non-parametric Wilcoxon Signed-rank test was conducted to compare within interventions’ differences in emotional status. The AffectLecture responses of each student, before and after every intervention with BrainHQ cognitive training interactive program, were used for this comparison. Following that, a Spearman rank test was conducted to discover the relation between performance and emotional status variables. The signiﬁcance threshold was set to 0.01 for all tests.

3 Results Concerning intervention results revealed a statistically signiﬁcant difference in the emotional status before and after the intervention for participant A (Wilcoxon Z = −3.000, p = 0.003 < 0.01), as well as for participant B (Wilcoxon Z = −3.382, p = 0.001 < 0.01) as shown in Table 1.

Table 1. Within interventions comparisons. Participants Total no. of interventions Z p-value Participant A 30 −3.000 .003 Participant B 40 −3.382 .001

Living with Learning Difﬁculties

137

Furthermore, correlation coefﬁcient Spearman Rho was used to measure the intensity of the relationship between the performance indicator provided by cognitive training program BrainHQ (stars) and the emotional status of participant A and participant B. As hypothesized, the performance may have a correlation with the affective state of the students. The scatter diagrams (see Fig. 2) suggest a strong positive correlation between emotional status before the intervention and performance (BrainHQ stars) for participant A (Spearman rho r = 0.77, p = 0.000 < 0.01), as well as for participant B (Spearman rho r = −0.255, p = 0.112 > 0.01). Following the 0.01 criteria, the interpretation of the results shows that positive emotional status before the intervention tends to increase students’ performance during the cognitive training. More speciﬁcally, these ﬁndings illustrate the importance of positive emotions in the performance and other outcomes in relation with achievement, meaning that cognitive training intervention positively influenced students’ experiences in the level of performance.

Fig. 2. Scatter Diagrams for emotional status before the intervention and performance in BrainHQ cognitive training interactive program for Participant A and B.

The scatter diagrams (see Fig. 3) also suggest a strong positive correlation between performance (BrainHQ stars) and emotional status for participant A (Spearman rho r = 0.896, p = 0.000 < 0.01), as well as for participant B (Spearman rho r = 0.433, p = 0.005 < 0.01). Following the 0.01 criteria, the interpretation of the results show that high performance during interventions tends to increase students’ emotional status after the cognitive training. Data on positive and negative emotions were obtained by asking each student to report their ﬁnal emotional status on AffectLecture. An increase in happiness and motivation, was also observed when students’ performance was increased, while totally different emotions were expressed and signs of demotivation were observed when students’ performance was poor. These ﬁndings identify the impact of general well-being and happiness on performance.

138

S. Siouli et al.

Fig. 3. Scatter Diagrams for emotional status after the intervention and performance in BrainHQ cognitive training interactive program for Participant A and B.

The correlation values coefﬁcients (between emotional status before/after the intervention and performance) and their corresponding p- are included in Tables 2 and 3. Table 2. Correlations between affection and performance for participant A. Participant A Affection before the intervention/Performance (Stars) Performance (Stars)/Affection after the intervention

Total no. of interventions 30

Spearman Cor. Coef. .77

p-value

30

.896

.000

.000

Table 3. Correlations between affection and performance for participant B. Participant B Affection before the intervention/Performance (Stars) Performance (Stars)/Affection after the intervention

Total no. of interventions 40 40

Spearman Cor. Coef. −.255 .433

p-value .112 .005

4 Discussion The present study was designed to investigate the influence of emotional state on students’ performance, as well as to identify the relations between cognitive performance and emotion. Speciﬁcally, the affective state of two elementary school students with learning difﬁculties was measured for long periods of time, by the AffectLecture app, before and after intervention with BrainHQ. Additionally, cognitive performance was measured by the online interactive BrainHQ program at the end of each session.

Living with Learning Difﬁculties

139

As hypothesized students’ positive emotional state had a positive effect on their performance. By contrast, students’ negative emotional state had a negative effect on their performance. It can be also concluded that high students’ performance had a positive effect on their emotional status, while poor students’ performance had a negative effect on their emotional status. Repeated measures revealed a signiﬁcant positive effect of emotion on performance and vice versa and a positive effect of performance on emotional status and vice versa. The results imply that performance influences students’ emotions, suggesting that successful performance attainment and positive feedback can develop positive emotions, while failure can escalate negative emotions. This set of case studies adds to the small body of empirical data regarding the importance of emotions in children with learning difﬁculties and syndromes. The ﬁndings are in agreement with previous studies reporting that achievement emotions can profoundly affect students’ learning and performance. Positive activating emotions can positively affect academic performance under most conditions. Conversely, negative deactivating emotions are posited to uniformly reduce motivation implying negative effects on performance [42–46]. Numerous research studies have explored the relationship between emotional state and academic achievement. However, for pupils with Simpson Golabi Behmel syndrome and learning difﬁculties, the only available data comes from our case studies. The ﬁndings of this research in sequence with BrainHQ available data (collected stars) reveal the signiﬁcant contribution of the online brain training program (BrainHQ) to the cognitive enhancement of both students. The results from cognitive exercises and assessments in addition to students’ daily observation indicate that intervention with the BrainHQ program had a positive impact on cognitive function mainly in the area of visual/working memory and the capacity to retrieve and process new information, processing speed affecting daily life activities, attention, and concentration. Besides that, cognitive training improved students’ performance in solving learning problems and in the area of Memory, Speed, Attention, Skills, Navigation, and Intelligence [47]. At this point it is important to identify possible limitations of the present study design. The measurements that were performed cannot cover the total range of affecting factors that possibly impact participants’ cognitive performance and emotional state. Some of those factors, which were not included, could be: the level of intellectual disability/difﬁculty, language proﬁciency, family’s socioeconomic background, living conditions, intelligence, skills, and learning style. The limitations of this study also consist of the research method (case study) that has often been criticized for its lack of scientiﬁc generalizability. For this reason the results of this study must be treated with caution and contain bias toward veriﬁcation. Larger-scale studies should be conducted to prove the effect of emotional state on the cognitive function of children with learning difﬁculties/disabilities. On the other hand, case study research provides great strength in investigating units consisting of multiple variables of importance and it allows researchers to retain a holistic view of real-life events, such as behavior and school performance [48]. At this point, we should take into account that students tend to present their emotions as ‘more socially acknowledged’ when they are being assessed. Furthering this thought consideration must be given as to whether students consequently and

140

S. Siouli et al.

intentionally modiﬁed their emotions and behaviours due to the presence of an observer, introducing further bias to the study. Moreover, the knowledge of being observed can modify emotions and behaviour. Such reactivity to being watched is sometimes referred to as the “Hawthorne Effect”. The Hawthorne Effect refers to a phenomenon in which people alter their behaviour as a result of being studied or observed [49]. They attempt to change or improve their behaviour simply because it is being evaluated or studied. The Hawthorne Effect is the intrinsic bias that must be taken into consideration when studying ﬁndings.

5 Conclusions The current study provides evidences that learning difﬁculties can be ameliorated by intensive adaptive training and positive emotional states. The results of a strong positive correlation between affective state and cognitive performance on BrainHQ indicate that the better the affective state of the student, the higher the performance, which was the hypothesis set by the authors. However, the causal direction of this relationship requires further investigation by future studies. Developing and sustaining an educational environment, which celebrates the diversity of all learners, is circumscribed by the particular political-social environment, as well as the capacity of school communities and individual teachers to conﬁdently embed inclusive attitudes and practice into their everyday actions. In addition, identifying and accounting for the various dynamics which influence and impact the implementation of inclusive practice, is fundamentally bound to the diversity or disability encountered in the classroom. At the same time, it would be useful and promising to perform further research over a long period of time to investigate the influence of positive, neutral and negative affective states on students’ with cognitive and learning difﬁculties performance. Also, the ﬁndings could have signiﬁcant implications understanding the effect of positive or negative emotions on cognitive function and learning deﬁcits of children with learning disabilities. These ﬁndings obtained from the children after adaptive training suggest that positive emotional status during computer cognitive training may indeed enhance and stimulate cognitive performance with generalized beneﬁts in a wide range of activities. It is essential, that this research continues throughout the school years of both students to evaluate the learning beneﬁts. Ultimately, more research on the relation between emotion and performance is needed for better understanding students’ emotions and their relations with important school outcomes. Social and emotional skills are key components of the educational process to sustain students’ developmental process and conduct an effective instruction. These ﬁndings may also suggest guidelines for optimizing cognitive learning by strengthening students’ positive emotions and minimizing negative emotions and the need to be taken into consideration by educators, parents and school psychologists.

Living with Learning Difﬁculties

141

References 1. Pekrun, R., Goetz, T., Titz, W., Perry, R.P.: Academic emotions in students’ self-regulated learning and achievement: a program of qualitative and quantitative research. Educ. Psychol. 37, 91–105 (2002) 2. Bergin, D.A.: Influences on classroom interest. Educ. Psychol. 34, 87–98 (1999) 3. Krapp, A.: Basic needs and the development of interest and intrinsic motivational orientations. Learn. Instr. 15, 381–395 (2005) 4. Hidi, S., Renninger, K.A.: The four-phase-model of interest development. Educ. Psychol. 41, 111–127 (2006) 5. Frenzel, A.C., Becker-Kurz, B., Pekrun, R., Goetz, T., Lüdtke, O.: Emotion transmission in the classroom revisited: a reciprocal effects model of teacher and student enjoyment. J. Educ. Psychol. 110, 628–639 (2018) 6. Lambert, R., et al.: “My Dyslexia is Like a Bubble”: how insiders with learning disabilities describe their differences, strengths, and challenges. Learn. Disabil. Multidiscip. J. 24, 1–18 (2019) 7. Woolfson, L., Brady, K.: An investigation of factors impacting on mainstream teachers’ beliefs about teaching students with learning difﬁculties. Educ. Psychol. 29(2), 221–238 (2009). https://doi.org/10.1080/01443410802708895 8. Kaplan, H.I., Sadock, B.J.: Modern Synopsis of Comprehensive Textbook of Psychiatry, IV. Williams and Wilkins, Baltimore (1985) 9. Graham, L., et al.: QuickSmart: a basic academic skills intervention for middle school students with learning difﬁculties. J. Learn. Disabil. 40(5), 410–419 (2007) 10. Hempenstall, K.J.: How might a stage model of reading development be helpful in the classroom? Aust. J. Learn. Disabil. 10(3–4), 35–52 (2005) 11. Cawley, J.F., Fan Yan, W., Miller, J.H.: Arithmetic computation abilities of students with learning disabilities: implications for instruction. Learn. Disabil. Res. Pract. 11, 230–237 (1996) 12. Swanson, H.L., Hoskyn, M.: Instructing adolescents with learning disabilities: a component and composite analysis. Learn. Disabil. Res. Pract. 16, 109–119 (2001) 13. Westwood, P.: Mixed ability teaching: issues of personalization, inclusivity and effective instruction. Aust. J. Remedial Educ. 25(2), 22–26 (1993) 14. Simpson, J.L., Landey, S., New, M., German, J.: A previously unrecognized X-linked syndrome of dysmorphia. Birth Defects Original Art. Ser. 11(2), 18 (1975) 15. Neri, G., Marini, R., Cappa, M., Borrelli, P., Opitz, J.M.: Clinical and molecular aspects of the Simpson-Golabi-Behmel syndrome. Am. J. Med. Genet. Part A 30(1–2), 279–283 (1998) 16. Gertsch, E., Kirmani, S., Ackerman, M.J., Babovic-Vuksanovic, D.: Transient QT interval prolongation in an infant with Simpson–Golabi–Behmel syndrome. Am. J. Med. Genet. Part A 152(9), 2379–2382 (2010) 17. Golabi, M., Rosen, L., Opitz, J.M.: A new X-linked mental retardation-overgrowth syndrome. Am. J. Med. Genet. Part A 17(1), 345–358 (1984) 18. Spencer, C., Fieggen, K., Vorster, A., Beighton, P.: A clinical and molecular investigation of two South African families with Simpson-Golabi-Behmel syndrome. S. Afr. Med. J. (SAMJ) 106(3), 272–275 (2016) 19. Terespolsky, D., Farrell, S.A., Siegel-Bartelt, J., Weksberg, R.: Infantile lethal variant of Simpson-Golabi-Behmel syndrome associated with hydrops fetalis. Am. J. Med. Genet. Part A 59(3), 329–333 (1995)

142

S. Siouli et al.

20. Hughes-Benzie, R.M., et al.: Simpson-Golabi-Behmel syndrome: genotype/phenotype analysis of 18 affected males from 7 unrelated families. Am. J. Med. Genet. Part A 66(2), 227–234 (1996) 21. Cottereau, E., et al.: Phenotypic spectrum of Simpson–Golabi–Behmel syndrome in a series of 42 cases with a mutation in GPC3 and review of the literature. Am. J. Med. Genet. Part C Semin. Med. Genet. 163C(2), 92–105 (2013) 22. Behmel, A., Plöchl, E., Rosenkranz, W.: A new X-linked dysplasia gigantism syndrome: follow up in the ﬁrst family and report on a second Austrian family. Am. J. Med. Genet. Part A 30(1–2), 275–285 (1988) 23. Young, E.L., Wishnow, R., Nigro, M.A.: Expanding the clinical picture of Simpson-GolabiBehmel syndrome. Pediatr. Neurol. 34(2), 139–142 (2006) 24. Goetz, T., Bieg, M., Lüdtke, O., Pekrun, R., Hall, N.C.: Do girls really experience more anxiety in mathematics? Psychol. Sci. 24(10), 2079–2087 (2013) 25. Pekrun, R., Linnenbrink-Garcia, L.: Introduction to emotions in education. In: International Handbook of Emotions in Education, pp. 11–20. Routledge (2014) 26. Zeidner, M.: Test Anxiety: The State of the Art. Springer, New York (1998). https://doi.org/ 10.1007/b109548 27. García, C.A., Durán, N.C.: Revisiting the concept of self-efﬁcacy as a language learning enhancer. Gist Educ. Learn. Res. J. 15, 68–95 (2017) 28. García, M.D., Miller, R.: Disfemia y ansiedad en el aprendizaje de inglés como lengua extranjera. Rev. Española Discapac. 7, 87–109 (2019) 29. Trigueros-Ramos, R., Gómez, N.N., Aguilar-Parra, J.M., León-Estrada, I.: Influencia del docente de Educación Física sobre la conﬁanza, diversión, la motivación y la intención de ser físicamente activo en la adolescencia. Cuadernos de Psicología del Deporte 1, 222–232 (2019) 30. Yeager, D.S., et al.: Boring but important: a self-transcendent purpose for learning fosters academic self-regulation. J. Personal. Soc. Psychol. 107, 559 (2014) 31. Na, Z.: A study of high school students’ English learning anxiety. Asian EFL J. 9, 22–34 (2007) 32. Pulido, F., Herrera, F.: La influencia de las emociones sobre el rendimiento académico. Cienc. Psicológicas 11, 29–39 (2017) 33. Trigueros, R., Aguilar-Parra, J.M., Cangas, A., López-Liria, R., Álvarez, J.F.: Influence of physical education teachers on motivation, embarrassment and the intention of being physically active during adolescence. Int. J. Environ. Res. Public Health 16, 22–95 (2019) 34. Siouli, S., Dratsiou, I., Tsitouridou, M., Kartsidis, P., Spachos, D., Bamidis, P.D.: Evaluating the AffectLecture mobile app within an elementary school class teaching process. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS), pp. 481–485. IEEE (2017) 35. LLM Care Homepage. www.llmcare.gr. Accessed 25 Apr 2020 36. Bamidis, P.D., et al.: Gains in cognition through combined cognitive and physical training: the role of training dosage and severity of neurocognitive disorder. Front. Aging Neurosci. 7, 152 (2015) 37. Romanopoulou, E., Zilidou, V., Savvidis, T., Chatzisevastou-Loukidou, C., Bamidis, P.: Unmet needs of persons with down syndrome: how assistive technology and game-based training may ﬁll the gap. Stud. Health Technol. Inform. 251, 15–18 (2018) 38. BrainHQ Homepage. www.brainhq.com. Accessed 25 Apr 2020 39. Harrell, W., et al.: Feasibility and preliminary efﬁcacy data from a computerized cognitive intervention in children with chromosome 22q11. 2 deletion syndrome. Res. Develop. Disabil. 34(9), 2606–2613 (2013)

Living with Learning Difﬁculties

143

40. Siouli, S., Makris, S., Romanopoulou, E., Bamidis, P.D.: Cognitive computer training in children with cognitive and learning disabilities: two interesting case studies. In: 2018 2nd International Conference on Technology and Innovation in Sports, Health and Wellbeing (TISHW), pp. 1–6. IEEE (2018) 41. Antoniou, P.E., Spachos, D., Kartsidis, P., Konstantinidis, E.I., Bamidis, P.D.: Towards classroom affective analytics. Validating an affective state self-reporting tool for the medical classroom. MedEdPublish 28, 6 (2017) 42. Lane, A.M., Whyte, G.P., Terry, P.C., Nevill, A.M.: Mood, self-set goals and examination performance: the moderating effect of depressed mood. Pers. Individ. Differ. 39, 143–153 (2005) 43. Turner, J.E., Schallert, D.L.: Expectancy–value relationships of shame reactions and shame resiliency. J. Educ. Psychol. 93, 320–329 (2001) 44. Boekaerts, M.: Anger in relation to school learning. Learn. Instr. 3, 269–280 (1993) 45. Hembree, R.: Correlates, causes, effects, and treatment of test anxiety. Rev. Educ. Res. 58, 47–77 (1988) 46. Pekrun, R.: The control-value theory of achievement emotions: assumptions, corollaries, and implications for educational research and practice. Educ. Psychol. Rev. 18, 315–341 (2006) 47. Siouli, S., Makris, S., Romanopoulou, E., Bamidis, P.D.: Cognitive computer training in children with cognitive and learning disabilities: two interesting case studies. In: 2018 2nd International Conference on Technology and Innovation in Sports, Health and Wellbeing, pp. 1–6. IEEE (2018) 48. Kaufman, J., Yin, L.: What matters for Chinese girls; behaviour and performance in school: an investigation of co-educational and single-sex schooling for girls in urban China. In: Gender, Equality and Education from International and Comparative Perspectives, vol. 10, pp. 185–216 (2009) 49. McCambridge, J., Wilson, A., Attia, J., Weaver, N., Kypri, K.: Randomized trial seeking to induce the Hawthorne effect found no evidence for any effect on self-reported alcohol consumption online. J. Clin. Epidemiol. 108, 102–109 (2019)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Learnersourcing Quality Assessment of Explanations for Peer Instruction Sameer Bhatnagar1(B) , Amal Zouaq1 , Michel C. Desmarais1 , and Elizabeth Charles2 1

Ecole Polytechnique Montreal, Montreal, Canada {sameer.bhatnagar,amal.zouaq,michel.desmarais}@polymtl.ca 2 Dawson College, Westmount, Canada [email protected]

Abstract. This study reports on the application of text mining and machine learning methods in the context of asynchronous peer instruction, with the objective of automatically identifying high quality student explanations. Our study compares the performance of state-of-the-art methods across diﬀerent reference datasets and validation schemes. We demonstrate that when we extend the task of argument quality assessment along the dimensions of convincingness, from curated datasets, to data from a real learning environment, new challenges arise, and simpler vector space models can perform as well as a state-of-the-art neural approach. Keywords: Argument mining

1

· Learnersourcing · Peer instruction

Introduction

Learning environments that leverage peer submitted content, carry the great advantage of scaling up to a larger content base. Be it student-generated questions, sample solutions, feedback, or explanations and hints, peer-submitted content can easily help grow a database to meet the demand for practice items and formative assessment. Platforms built on this principle are growing in popularity [7,15], freeing the teacher from the tedious but important task of developing multiple variants of the same items. Critical to the success of such environments is the capacity to automatically assess the quality of learner-generated content. There are a growing number of tools which put students at the centre of this challenge: students submit their own work, but then are prompted to evaluate a subset of their peers’ submissions, and sometimes even provide feedback [3,23]. To counter the drawback that lies in the varying ability of novices to evaluate and provide good feedback to their peers, these environments often use pairwise comparison. It is widely accepted that evaluative judgements, based on ranking two objects relative to one another, are easier to make than providing an absolute score. Adaptive Comparative Judgement [22], where teachers assess student c Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 144–157, 2020. https://doi.org/10.1007/978-3-030-57717-9_11

Convincingness and Peer Instruction

145

submissions by simply choosing which is better from a pair, has been shown to be a reliable and valid alternative to absolute grading. As such, there is a growing family of learning tools which, at the evaluation step, present the peer-generated content as pairs to the current learner, and prompt for a pairwise ranking. This data can provide feedback to the students who originally submitted the item, but can also be used for moderating learner content. In platforms where students generate content that is part of the learning activities of future students, ﬁltering out irrelevant and misleading material is paramount. However, while removing bad content is important, educators hope to identify, and subsequently maximize the use of the best student generated items. Since not all students can be asked to evaluate all possible pairs, this in turn leads to the challenge of optimizing which items need evaluation by the “student-come-moderators”, without hindering their learning. Our work is centred on data coming from the subset of these learning environments that enable peer instruction [6], which follow a speciﬁc two-stage script: 1. students are prompted to answer a multiple choice question, and provide a free-text explanation that justiﬁes their answer; 2. students are then prompted to reconsider their answer, by presenting them a selection of explanations written by previous students [1]. Students can either decide that their own explanation is best, or indicate which of their peers’ explanation was the most convincing. We frame the explanations students produce here as arguments meant to persuade one’s peers that their own reasoning is best. Recent research in argument mining has proposed a novel machine learning task: given a pair of arguments on a given topic, can we predict which is more convincing? [11,12,24,26] Our objective is to evaluate if we can extend this task to a peer-instruction setting: given a pair of explanations, one written by a student, and another by their peer, can we predict which they will select as more convincing? To our knowledge, this is the ﬁrst analysis of peer instruction data through the lens of learnersourcing and argument convincingness. Thus, as a reference, we include datasets and methods from the argument mining research community, speciﬁcally focused on pairwise preference ranking of arguments for automatic assessment of convincingness. We apply vector space models, as well as a stateof-the-art neural approach, and report on their performance for this task. Our ﬁndings suggest that the arguments generated in learning environments centred on undergraduate science topics present a more challenging variant of the task originally proposed in the argument mining community, and that classical approaches can match neural models for performance on this task across datasets, depending on the context.

2 2.1

Related Work Learnersourcing and Comparative Peer Assessment

The term learnersourcing has been deﬁned as the process in “which learners collectively generate useful content for future learners, while engaging in a mean-

146

S. Bhatnagar et al.

ingful learning experience themselves” [27]. One of the earliest examples of a learning environment centred on this mode is Peerwise [8], wherein students generate question items for the subjects they are learning, share them with their peers, so that others can use them for practice. Ripple [15] is a similarly built system, but with an added recommendation engine which adaptively selects which problems to suggest to which students. Other tools leave the creation of question items to teachers, but call on students to generate and evaluate explanations for the answers. The AXIS [28] system prompts students to generate an explanation for their response to a short answer question, and then evaluate a similar explanation from one of their peers on a scale of 1–10 for helpfulness. This rating-data drives the reinforcement learning algorithm that decides which explanations to show to future students. A similar class of learning platforms leverage comparative judgement for supporting how students can evaluate the work of their peers. Juxtapeer [3] asks students to provide feedback to a single peer on their work by explicitly comparing to that of another. ComPAIR [23] asks students for feedback on each item of the pair they are presented with, with a focus on what makes one “better” than the other. Finally, peer instruction question prompts are being used more frequently inside online learning assignments [2,4]. Moderating the quality of student explanations therein is important work that can begin with unsupervised clustering to ﬁlter out irrelevant content [10]; however identifying the best explanations that might promote learning, is at the heart of this research. 2.2

Argument Quality and Convincingness

Work in the area of automatic evaluation of argument quality ﬁnds its roots in detecting evidence in legal texts [18], but has accelerated in recent years as more datasets become available, and focus shifts to modelling more qualitative measures, such as convincingness. Some earlier eﬀorts included work on automatic scoring of persuasive essays [21] and modelling persuasiveness in online debate forums [25]. However, evaluating argument convincingness with an absolute score can be challenging for annotators, which has led to signiﬁcant work that starts from data consisting of paired arguments, labelled with which of the two is most convincing. Research eﬀorts in modelling these pairwise preferences include a featurerich support vector machine, as well as an end-to-end neural approach based on pre-trained GloVe vectors [20] which are fed into a Bi-directional Long-ShortTerm Memory network [12]. Most recently, this work has been extended with a Siamese network architecture, wherein each of the two legs is a BiLSTM network that share weights, trained to detect which argument in a pair has the most convincing evidence [11]. Finally, based on the recent success of transformer models in NLP, the current state of the art for the task of pairwise preference ranking of convincingness is based on ﬁne tuning pre-trained weights for the publicly available base BERT [9] model setup for sequence classiﬁcation [26].

Convincingness and Peer Instruction

3

147

Data

The objective of this study is to compare and contrast how text mining methods for evaluating argument quality, speciﬁcally for argument convincingness, perform in an online learning environment with learner-generated and annotated arguments. Our primary dataset comes from a peer instruction learning environment myDALITE.org, used by instructors in a context of low-stakes formative assessment in introductory undergraduate science courses. To provide context as to the performance that can be expected for this task in a relatively novel setting, we include in our study three publicly available datasets, each speciﬁcally curated for the task of automatic assessment of argument quality along the dimension convincingness. Table 1 provides examples of an argument pair from each of the datasets. Table 1. Examples of argument pairs from each dataset. These examples were selected because they were incorrectly classiﬁed by all of our models, and demonstrate the challenging nature of the task. In each case, the argument labeled as more convincing is in italics. (a) A pair of arguments from UKP, for the prompt topic: “school uniforms are a good idea” a1

a2

I take the view that, school uniform is very comfortable. Because there is the gap between the rich and poor, school uniform is eﬃcient in many ways. If they wore to plain clothes every day, they concerned about clothes by brand and quantity of clothes. Every teenager is sensible so the poor students can feel inferior. Although school uniform is very expensive, it is cheap better than plain clothes. Also they feel sense of kinship and sense of belonging. In my case, school uniform is convenient. I don’t have to worry about my clothes during my student days

I think it is bad to wear school uniform because it makes you look unatrel and you cannot express yourself enough so Band school uniform OK

(b) A pair of arguments from IBM ArgQ, for the prompt topic: “We should support information privacy laws” a1

a2

if a company is not willing to openly say what they are going to do with my data, they shouldn’t be allowed to do it

if you are against information privacy laws, then you should not object to having a publicly accessible microphone in your home that others can use to listen to your Private conversations

(c) Student explanations from dalite, for the question prompt: “Rank the magnitudes of the electric ﬁeld at point A, B and C shown in the following ﬁgure from Greatest magnitude to weakest magnitude” a1

a2

At B, the electric field vectors cancel (E = 0). C A is closest, B experiences the least since it Is is further away than A and is therefore weaker directly in the middle, and C the least since it is most far away

148

3.1

S. Bhatnagar et al.

UKP and IBM

UKPConvArgStrict [12], hence forth referred to as UKP, was the ﬁrst to propose the task of pairwise preference learning for argument convincingness. The dataset consists of just over 1k individual arguments, that support a particular stance for one of 16 topics, collected from online debate portals and annotated as most/less convincing in a crowd-sourcing platform. More recently, a second similar dataset was released by the research group associated with IBM Project Debater, IBMArgQ-9.1kPairs [26], henceforth referred to as IBM ArgQ. IBM ArgQ data is strongly curated with respect to the relative length of the arguments in each pair: in order to control for the possibility that annotators may make their choice of which argument in the pair is more convincing based merely on the length of the text, the mean diﬀerence in word count, Δwc, is just 3 words across the entire dataset, which is 10 times more homogeneous than pairs in UKP. Finally, we include a third reference dataset, IBM Evi [11]. The important distinction here is that the arguments are actually extracted as evidence for their respective topic from Wikipedia, and hence represent cleaner, more well-formed text than our other reference datasets. We ﬁlter this dataset to only include topics that have at least 50 associated argument pairs. Table 2 summarizes some of the descriptive statistics that can be used to compare these sources, and potentially explain some of our experimental results. Table 2. Descriptive statistics for each dataset of argument pairs, with last rows showing dalite data split by discipline. Nargs is the number of individual arguments, distributed across Npairs revolving around Ntopics . Nvocab is the number of unique tokens in all the arguments. wc is the average number of words per argument, shown with the standard deviation (SD). Δwc is the average relative diﬀerence in number of words for each argument in each pair. Dataset IBM ArgQ

Npairs Ntopics Nargs Nvocab wc (SD) Δwc (SD) 9125

11

3474 6710

23 (7)

3 (2)

11650

16

1052 5170

49 (28)

30 (23)

dalite

8551 102

8942 5571

17 (15)

12 (7)

IBM Evi

5274

1513 6755

29 (11)

3 (2)

UKP

41

dalite:Biology

3919

49

4116 3170

15 (14)

10 (6)

dalite:Chemistry

1666

24

1758 2062

20 (14)

12 (7)

dalite:Physics

2966

29

3068 2478

19 (15)

15 (7)

Convincingness and Peer Instruction

3.2

149

Dalite

The dalite dataset is from a peer instruction environment explained in the introduction. It contains only the observations where, after writing an explanation for their answer choice, on the review step, students chose a peer’s explanation as more convincing than their own. To ensure internal reliability, we only keep argument explanations that were also chosen by at least 5 diﬀerent students. To ensure that the explanations in each pair are of comparable length, we keep only those with word counts that are within 25 words of each other. This leaves us a dataset with 8551 observations, spanning 2216 learner annotators having completed, on average, 4.0 items each, from a total of 102 items across three disciplines, with at least 50 explanationpairs per item. We draw an analogy between a “topic” in the reference data, and a “question item” in dalite. Table 3. Number of argument pairs in dalite, broken down by discipline, and the correctness of the selected answer choice on initial step, and then on review step. rr (right-right) rw (right-wrong) wr (wrong-right) ww (wrong-wrong) Biology

2459

124

733

603

Chemistry 1151

51

228

236

Physics

66

278

334

2288

Table 3 highlights three key diﬀerences between the modelling task of this study, and related work in argument mining. First, in IBM ArgQ and UKP, annotators are presented pairs of arguments that are always for the same stance, in order to limit bias due to their opinion on the topic when evaluating which argument is more convincing (this is also true of many of the pairs in IBM Evi). In a peer instruction learning environment, other pairings are possible, and pedagogically relevant. In dalite, the majority of students keep the same answer choice on the review step, and so they are comparing two explanations that are either both for the correct answer choice (“rr”), or an incorrect answer choice (“ww”). This is analogous to arguments for the same stance. However, 17% of the observations in dalite are for when students not only choose an explanation more convincing than their own, but also switch answer choice, either from the wrong to right, or the reverse (convincingness across diﬀerent stances). Second, a more fundamental diﬀerence is that our study focuses on undergraduate science courses across three disciplines, wherein the notion of answer “correctness” is important. There are a growing number of ethics and humanities instructors using the peer instruction platform, where the question prompts are topics more like a debate, as in our reference datasets. We leave these comparisons for future work.

150

S. Bhatnagar et al.

Thirdly, each argument pair is made up of the one written by the current learner, while the other is an alternative generated by a peer that the current learner chose as more convincing. In this respect, the dalite dataset is diﬀerent than the reference datasets, since it is the same person, the current learner, who is author and annotator. In the reference datasets, data was always independently annotated using crowdsourcing platforms.

4

Methodology

Choosing which argument is more convincing from a pair is a binary ordinal regression task, where the objective is to learn a function that, given two feature vectors, can assign the better argument a rank of +1, and the other a rank of −1. It has been proven that such a binary ordinal regression problem can be cast into an equivalent binary classiﬁcation problem, wherein the model is trained on the diﬀerence of the feature vectors of each argument in the pair [13]. Referred to as SVM-rank, this method of learning pairwise preferences has been used extensively in the context of information retrieval (e.g. ranking search results for a query) [14], but also more recently in evaluating the journalistic quality of newspaper and magazine articles [17]. The study accompanying the release of UKP ﬁrst proposed this method for modelling argument convincingness [12]. In our study, we explore three models, described below. Vector Space Models, ArgBoW: We follow-up work using SVM-rank, building simple “bag-of-words” vector space models to represent our argument text. We take all of the individual arguments for a particular topic in our training set (known as “explanations” for a particular question item in the case of the dalite data), lemmatize the tokens, and build term-document matrices for each topic. We then take the arithmetic diﬀerence of these normalized term frequency vector representations of each argument to train Support Vector Machine classiﬁers on. We refer to this model as ArgBoW. We do not, for this study, include any information related to the topic prompt in our document representation. Pre-trained Word Embeddings, ArgGloVe: A limitation of vector space models for text classiﬁcation is the exclusion of words that are “out-of-vocabulary” in the test set when compared to the training data. Previous research has addressed this using language models that oﬀer pre-trained word-embeddings that have already learned expansive vocabularies from massive corpora of text [11,12]. In our ArgGloVe model, we encode each token of each argument using 300dimensional GloVe vectors [20], and represent each argument as the average of its token-level embedding vectors. We then feed these into the same SVM-rank architecture described above.

Convincingness and Peer Instruction

151

Transfer Learning, ArgBERT: Finally, in order to leverage recent advances in transfer-learning for NLP, the ﬁnal model we explore is ArgBERT. We begin with a pre-trained language model for English built using Bi-directional Encoder Representation from Transformers, known as BERT [9], trained on large bodies of text for the task of masked token prediction and sentence-pair inference. As proposed in [26], we take the ﬁnal 768-dimensional hidden state of the baseuncased BERT model, feed it into a binary classiﬁcation layer, and ﬁne-tune all of the pre-trained weights for the task of sequence classiﬁcation using our argument-pair data. As in other applications involving sentence pairs for BERT, each argument pair is encoded as [CLS] A [SEP] B, where the special [SEP] token instructs the model as to the boundary between arguments A and B1 . 4.1

Baselines and Validation Schemes

We build a baseline model, ArgLength, which is trained on simply the number of words in each argument, as there may be many contexts where students will simply choose the longer/shorter argument, based on the prompt. In order to get a reliable estimate of performance, we employ stratiﬁed 5-fold cross-validation for our experiments. This means that for each fold, we train our model on 80% of the available data, ensuring that each topic and output class is represented equally. In the context of peer instruction learning environments, this is meant to give a reasonable estimate of how our models would perform in predicting which explanations will be chosen as most convincing, after a certain number of responses have been collected. However standard practice in the argument mining community is to employ “cross-topic” validation for this task. For each fold, the arguments for one topic (or question item, in the case of dalite) are held out from model training. Evaluating model performance on a yet unseen topic is a stronger estimate for how a model will perform when new question items are introduced into the content base. We evaluate our models using both validation schemes.

5

Results and Discussion

The performance of each model across the diﬀerent datasets is presented in Figs. 1 and 2, using respectively 5-fold cross-validation and cross-topic validation. For the purposes of comparison, in Table 4, we denote, to the best of our knowledge, the state-of-the-art performance for each dataset on the task of pairwise classiﬁcation for convincingness.

1

Modiﬁed from the run glue.py script provided by the tranformers package, built by company hugging face. All code for this study provided at https://github.com/ sameerbhatnagar/ectel2020.

152

S. Bhatnagar et al.

1.0

0.9

1.0 IBM ArgQ IBM Evi UKP dalite

0.9

0.8

acc

AUC

0.8

0.7

0.7

0.6

0.6

0.5

IBM ArgQ IBM Evi UKP dalite

0.5 ArgLength

ArgBoW

ArgGlove

ArgBERT

ArgLength

ArgBoW

model

ArgGlove

ArgBERT

model

(a) Accuracy

(b) RoC-AuC 1.0

0.9

0.9

0.8

0.8

acc

AUC

1.0

0.7

0.6

0.7

0.6

Biology Chemistry Physics

0.5

Biology Chemistry Physics

0.5 ArgLength

ArgBoW

ArgGlove

ArgBERT

model

(c) Accuracy for dalite disciplines

ArgLength

ArgBoW

ArgGlove

ArgBERT

model

(d) ROC-AUC for dalite disciplines

Fig. 1. Pairwise ranking classiﬁcation accuracy and ROC-AUC, evaluated using 5-fold stratified cross-validation, for diﬀerent models across datasets in Figs. 1a and 1b. Figures 1c and 1d split the performance for the dalite dataset across disciplines.

The ﬁrst point we remark is that the performance of ArgBERT, which is the state-of-the-art for our reference datasets, also performs relatively well on dalite data. This, in part, provides some support for the premise that student explanations, and the peer-vote-data we extract from a peer instruction learning environment, can actually be modelled via a pairwise argument ranking methodology. It also supports our line of inquiry, and lays the foundation for future research in this area. Second, we see that the baseline performance set by ArgLength is not uniform across the datasets, which, we surmise, is due in most part to how carefully curated the datasets are to begin with. IBM ArgQ and IBM Evi are datasets that were released, in part, for the purpose of promoting research on automatic assessment of argument quality, and are meant to serve as a benchmark that cannot simply be learned with word counts alone. As can be seen in Table 2, Δwc is greatest in UKP and dalite, as they are less carefully curated, and contain many more argument pairs that have a larger relative diﬀerence in length. This is particularly relevant in the context of learning environments based on peer instruction: depending on the importance placed on the review step by their teacher, students may, or may not, truly engage with their peers’ explanations, reverting to simply choosing explanations based solely on how many words they

Convincingness and Peer Instruction 1.0

0.9

153

1.0 IBM ArgQ IBM Evi UKP dalite

0.9

0.8

acc

AUC

0.8

IBM ArgQ IBM Evi UKP dalite

0.7

0.7

0.6

0.6

0.5

0.5 ArgLength

ArgBoW

ArgGlove

ArgBERT

ArgLength

ArgBoW

model

ArgGlove

ArgBERT

model

(a) Accuracy

(b) ROC-AUC 1.0

0.9

0.9

0.8

0.8

acc

AUC

1.0

0.7

0.6

0.7

0.6

Biology Chemistry Physics

0.5

Biology Chemistry Physics

0.5 ArgLength

ArgBoW

ArgGlove

ArgBERT

model

(c) Accuracy for dalite disciplines

ArgLength

ArgBoW

ArgGlove

ArgBERT

model

(d) ROC-AUC for dalite disciplines

Fig. 2. Pairwise ranking classiﬁcation accuracy and ROC-AUC, evaluated using crosstopic validation, for diﬀerent models across datasets in Figs. 2a and 2b. Figures 2c and 2d split the performance for the dalite dataset across disciplines.

have to read (or not). This sets the bar very high for other more sophisticated approaches. Third, we present the results for two diﬀerent validation schemes, in Figs. 1 and 2, to highlight the impact they may have on model selection. When evaluating performance using 5-fold cross-validation (Fig. 1), where some topic data is held-in for the training, the relatively simple ArgBoW performs at least as well, if not better, than all other models, including the state-of-the-art ArgBERT. This is not the case in Fig. 2, where each individual topic is completely held out from training (standard practice in argument mining research). We surmise that ArgBoW models can learn the necessary requisite vocabulary to explain what makes an argument more convincing and perform well under conditions where at least some data is available for a topic. This eﬀect is pronounced for dalite, where the vocabulary is relatively constrained, as seen in Table 2, where the ratio of Nvocab to Nargs is lowest for dalite. This result is not without precedent: in a study on the task of pairwise ranking of newspaper articles based on “quality”, the authors achieve a similar result: when comparing the performance of SVM-rank models using diﬀerent input feature sets (e.g. use of visual language, use of named entities, aﬀective content), their top performing models achieve pairwise ranking accuracy of 0.84

154

S. Bhatnagar et al.

Table 4. State of the art performance for pairwise argument classiﬁcation of convincingness for three publicly available datasets, using cross-topic validation scheme Dataset

Acc

AUC

UKP

0.83

0.89

Model ArgBERT [26]

IBM ArgQ

0.80

0.86

ArgBERT [26]

IBM Evi

0.73

–

EviConvNet [11]

using a combination of content and writing features, but also a 0.82 accuracy with the content words as features alone [17]. While ArgBoW will suﬀer from the cold-start problem when new question items are added to the set of learning activities, as no student explanations are yet available, and the vocabulary is still unknown, this may be remedied by the addition of calculated features to the model. However the more fundamental result here may point to the importance of verifying multiple validation schemes in future work that ties together methods and data from learning analytics and argument mining research. Cross-topic validation indicates that ArgBERT performs better than ArgBoW, likely because the model does not overﬁt to the training data, and the ﬁnal hidden state of the transformer has learned something more general about the task. Yet even better results might be obtained with a simpler vector space model after some student answers have been collected. Fourth, we observe the performance of models across diﬀerent disciplines in dalite. Results seem to indicate that argument pairs from items in dalite:Physics are easiest to classify. This eﬀect is even more important when we take into account that we have the least data from this discipline (Npairs in Table 2). This may be in part due to the impact of upstream tokenizers, which fail to adequately parse arguments that have a lot of chemical equations, or molecular formulae. Most of the items in dalite:Physics are conceptual in nature, and the arguments contain fewer non-word tokens. Finally, we highlight the variance in performance across models, datasets, and validation schemes. We posit that the task of pairwise classiﬁcation of argument quality, along the dimension of convincingness, is more challenging when the data is generated by students as part of a learning activity, than with data collected from crowd-sourcing annotation platforms and online debate portals. This eﬀect is expected to be more pronounced in explanations for items from STEM education, as the diﬀerence between correct and incorrect answer choices will be more prevalent than in learning contexts focused on humanistic disciplines. When a student is comparing their explanation, which may be for a incorrect answer choice, with that for an explanation of a correct answer choice, we refer to this as “wr”. This would be the equivalent of argument pairs which contained arguments of opposite stance, which is only true in IBM Evi. Of note is the relative stable performance of ArgGlove across datasets. This may further indicate that there is promise in vector space approaches, as the semantic information captured in well-trained word embeddings can be leveraged to address the

Convincingness and Peer Instruction

155

challenge when there is large variance in words used to express arguments (most pronounced in UKP).

6

Conclusion and Future Work

Asynchronous peer instruction oﬀers the potential of not only learnersourcing explanations for formative assessment items, but the curation of those explanations as well. This is a scalable modality that enables students to interact with their peers’ contributions, all as part of the learning process. However, to our knowledge, we are among the ﬁrst to explore methods that aim to distill the data from such a context and automatically assess the quality of student explanations along the dimension of convincingness. The implications of this study for technology-enhanced-learning are in the design of peer instruction platforms. A machine learning approach, with careful choice of validation scheme, can help automatically identify the explanations students ﬁnd most convincing. Directions for future work lie in exploring the features that these models leverage for their predictions, and developing frameworks for providing feedback to both teachers and students. The performance of vectors space models in this study indicate that more work should be done in expanding the representation of text arguments using calculated features. This approach has been the best performing in other studies [17,19], where a combination of writing and content features to achieve the best results. This avenue must be explored more thoroughly, especially as this may vary across disciplines and teaching contexts. This current study only addressed the task of predicting which of the explanations was more convincing in pair. The next task for future work lies in Learning to Rank, wherein this pairwise preference data is used to infer a global ranking of argument quality. Acknowledgements. Funding for the development of myDALITE.org is made pos´ sible by Entente-Canada-Quebec, and the Minist`ere de l’Education et Enseignment Sup´erieure du Qu´ebec. Funding for this research was made possible by the support of the Canadian Social Sciences and Humanities Research Council Insight Grant. This project would not have been possible without the SALTISE/S4 network of researcher practitioners, and the students using myDALITE.org who consented to share their learning traces with the research community.

References 1. Bhatnagar, S., Lasry, N., Desmarais, M., Charles, E.: DALITE: asynchronous peer instruction for MOOCs. In: Verbert, K., Sharples, M., Klobuˇcar, T. (eds.) EC-TEL 2016. LNCS, vol. 9891, pp. 505–508. Springer, Cham (2016). https://doi.org/10. 1007/978-3-319-45153-4 50 2. Univeristy of British Columbia, T.L.T.: ubc/ubcpi, August 2019. https://github. com/ubc/ubcpi. original-date: 2015–02-17T21:37:02Z

156

S. Bhatnagar et al.

3. Cambre, J., Klemmer, S., Kulkarni, C.: Juxtapeer: comparative peer review yields higher quality feedback and promotes deeper reﬂection. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI 2018, pp. 1–13. ACM Press, Montreal QC, Canada (2018). https://doi.org/10.1145/3173574. 3173868 4. Charles, E.S., et al.: Harnessing peer instruction in- and out- of class with myDALITE. In: Fifteenth Conference on Education and Training in Optics and Photonics: ETOP 2019. pp. 11143–11189. Optical Society of America (2019). http://www.osapublishing.org/abstract.cfm?URI=ETOP-2019-11143 89 5. Chen, X., Bennett, P.N., Collins-Thompson, K., Horvitz, E.: Pairwise ranking aggregation in a crowdsourced setting. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 193–202 (2013) 6. Crouch, C.H., Mazur, E.: Peer instruction: ten years of experience and results. Am. J. Phys. 69(9), 970–977 (2001) 7. Denny, P.: The eﬀect of virtual achievements on student engagement. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2013, Paris, France, pp. 763–772. ACM, New York (2013). https://doi.org/ 10.1145/2470654.2470763. http://doi.acm.org/10.1145/2470654.2470763 8. Denny, P., Hamer, J., Luxton-Reilly, A., Purchase, H.: PeerWise: students sharing their multiple choice questions. In: Proceedings of the Fourth International Workshop on Computing Education Research, ICER 2008, Sydney, Australia, pp. 51–58. ACM, New York (2008). https://doi.org/10.1145/1404520.1404526. http:// doi.acm.org/10.1145/1404520.1404526 9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 10. Gagnon, V., Labrie, A., Desmarais, M., Bhatnagar, S.: Filtering non-relevant short answers in peer learning applications. In: Proceedings Conference on Educational Data Mining (EDM) (2019) 11. Gleize, M., et al.: Are you convinced? choosing the more convincing evidence with a Siamese network. arXiv preprint arXiv:1907.08971 (2019). https://www.aclweb. org/anthology/P19-1093/ 12. Habernal, I., Gurevych, I.: Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional LSTM. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1589–1599. Association for Computational Linguistics, Berlin, Germany, August 2016). https://doi.org/10.18653/v1/P16-1150. https:// www.aclweb.org/anthology/P16-1150 13. Herbrich, R., Graepel, T., Obermayer, K.: Support Vector Learning For Ordinal Regression. IET (1999) 14. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2002) 15. Khosravi, H., Kitto, K., Williams, J.J.: Ripple: a crowdsourced adaptive platform for recommendation of learning activities. arXiv preprint arXiv:1910.05522 (2019) 16. Lippi, M., Torroni, P.: Argumentation mining: state of the art and emerging trends. ACM Trans. Internet Technol. (TOIT) 16(2), 10 (2016) 17. Louis, A., Nenkova, A.: What makes writing great? ﬁrst experiments on article quality prediction in the science journalism domain. Trans. Assoc. Comput. Linguist. 1, 341–352 (2013)

Convincingness and Peer Instruction

157

18. Moens, M.F., Boiy, E., Palau, R.M., Reed, C.: Automatic detection of arguments in legal texts. In: Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Law, pp. 225–230 (2007) 19. Nguyen, D., Do˘ gru¨ oz, A.S., Ros´e, C.P., de Jong, F.: Computational sociolinguistics: a survey. arXiv preprint arXiv:1508.07544 (2015) 20. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–1543 (2014) 21. Persing, I., Ng, V.: End-to-end argumentation mining in student essays. In: HLTNAACL, pp. 1384–1394 (2016) 22. Pollitt, A.: The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, vol. 19, no. 3, pp. 281–300. Routledge (2012). https://doi.org/10.1080/0969594X.2012.665354 23. Potter, T., Englund, L., Charbonneau, J., MacLean, M.T., Newell, J., Roll, I.: ComPAIR: a new online tool using adaptive comparative judgement to support learning with peer feedback. Teach. Learn. Inquiry 5(2), 89–113 (2017) 24. Simpson, E., Gurevych, I.: Finding convincing arguments using scalable Bayesian preference learning. Trans. Assoc. Comput. Linguist. 6, 357–371 (2018). https://www.aclweb.org/anthology/Q18-1026 25. Tan, C., Niculae, V., Danescu-Niculescu-Mizil, C., Lee, L.: Winning arguments: interaction dynamics and persuasion strategies in good-faith online discussions. arXiv:1602.01103 [physics], pp. 613–624 (2016). https://doi.org/10.1145/2872427. 2883081 26. Toledo, A., et al.: Automatic argument quality assessment-new datasets and methods. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5629–5639 (2019). https://www.aclweb. org/anthology/D19-1564.pdf 27. Weir, S., Kim, J., Gajos, K.Z., Miller, R.C.: Learnersourcing subgoal labels for howto videos. In: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 405–416. ACM (2015) 28. Williams, J.J., et al.: AXIS: generating explanations at scale with learnersourcing and machine learning. In: Proceedings of the Third (2016) ACM Conference on Learning @ Scale - L@S 2016, Edinburgh, Scotland, UK, pp. 379–388. ACM Press (2016). https://doi.org/10.1145/2876034.2876042

Using Diffusion Network Analytics to Examine and Support Knowledge Construction in CSCL Settings Mohammed Saqr1(&)

and Olga Viberg2

1

2

University of Eastern Finland, 80100 Joensuu, Finland [email protected] KTH Royal Institute of Technology, 100 44 Stockholm, Sweden [email protected]

Abstract. The analysis of CSCL needs to offer actionable insights about how knowledge construction between learners is built, facilitated and/or constrained, with the overall aim to help support knowledge (co-)construction. To address this, the present study demonstrates how network analysis - in a form of diffusion-based visual and quantitative information exchange metrics - can be effectively employed to: 1. visually map the learner networks of information exchange, 2. identify and deﬁne student roles in the collaborative process, and 3. test the association between information exchange metrics and performance. The analysis is based on a dataset of a course with a CSCL module (n = 129 students). For each student, we calculated the centrality indices that reflect the roles played in information exchange, range of influence, and connectivity. Students’ roles were analysed employing unsupervised clustering techniques to identify groups that share similar characteristics in regard to their emerging roles in the information exchange process. The results of this study have proved that diffusion-based visual and quantitative metrics can be effectively employed and are valuable methods to visually map the student networks of information exchange as well as to detect and deﬁne students’ roles in the collaborative learning process. Furthermore, the results demonstrated a positive and statistically signiﬁcant association between diffusion metrics and academic performance. Keywords: CSCL Information exchange metrics Students’ roles Knowledge exchange Student performance Higher education

1 Introduction Computer-supported collaborative learning (CSCL) – a pedagogical approach based on the foundations of cooperative and collaborative learning–“aims to employ computer technology to facilitate collaboration, discussion, and exchanges among peers or between students and teachers and help to achieve the goal of knowledge sharing and knowledge creation” [34, p. 768, 22]. CSCL affords several opportunities to the learner to facilitate knowledge construction (or creation) process. In particular, CSCL was found to offer the learner opportunities to (1) engage in a joint task, (2) communicate, © Springer Nature Switzerland AG 2020 C. Alario-Hoyos et al. (Eds.): EC-TEL 2020, LNCS 12315, pp. 158–172, 2020. https://doi.org/10.1007/978-3-030-57717-9_12

Using Diffusion Network Analytics to Examine and Support Knowledge

159

(3) share resources and exchange information, (4) engage in productive collaborative learning processes, (5) engage in co-construction of knowledge, (6) monitor and regulate collaborative learning, and (7) ﬁnd and build groups and communities [18]. However, despite a number of opportunities and affordances that CSCL provides, a key challenge researchers and practitioners face is how to successfully support learning activities that foster effective collaboration [8, 26] leading to productive knowledge construction and ultimately - to improved learning outcomes. To address this challenge, scholars highlight two key strategies, namely i) structuring the collaboration (i.e., organising collaborative learning activities) to facilitate productive interaction and ii) monitoring collaboration to provide relevant feedback [19]. This study aims to contribute to the development of the latter strategy by focusing on the examination of effective collaboration analysis methods that “should be adapted to the needs of their users more effective” [26, p.335]. Such analysis methods will ultimately assist instructors to provide a more personalised support to meet the needs of the users. Knowledge among individuals is frequently acquired through social interactions resulting in, for example, the exchange of ideas, endorsement of opinions and adoption of innovations. The process is similar to how an infection propagates when an infected individual has contact with a healthy susceptible person. Both phenomena propagate or diffuse through the pathways of social networks. The concept of diffusion has been extensively applied to study a variety of phenomena across several domains, including the spread of emotions, the ripple effect of stock market disturbances, the growth of political movements, the viral spread of internet memes, and recently, to study the diffusion of innovations in educational contexts (e.g., [1]). Yet, compared to other ﬁelds, such research attempts in the area of education have hitherto been scarce [4]. Network analytics methods have been found to be effective in terms of demonstrating beneﬁts of showing relevant information to the participants (e.g., the roles taken by students in a collaborative learning process) in CSCL settings [26]. To ﬁll this gap, the present study aims to provide a better understanding of the diffusion process focusing on knowledge (co)-construction in CSCL environments through the application of diffusion-based methods. In particular, we argue that the use of selected diffusion-based metrics can help to: 1. visualize the information exchange and the key players in a CSCL network, and 2. study their roles in information exchange to be able to assist instructors in their provision of more adaptive feedback.

2 Background 2.1

Studying Information Diffusion

One of the key challenges that information diffusion researchers encounter is to understand how certain network structures (or relationships) predict the spread of information [33, 36]. Scholars have proposed several modelling methods and diffusion indices. One of such models is the popular SIR (Susceptible-Infected-Recovered) model, initially implemented to study the spread of infection. The simplest form of the model proposes that an interaction with an infected person would probably cause an infection when contacting a susceptible person [17, 36]. A process known as simple

160

M. Saqr and O. Viberg

contagion, where a single source (infected person) is sufﬁcient to activate a target (susceptible person). However, in order to convince a person to endorse an opinion or accept a piece of information multiple interactions are needed. Information are also subject to argumentation, negotiation and veriﬁcation from the recipient [6, 12]. Information diffusion is therefore considered as ‘complex contagion’. Researchers have used complex forms of diffusion models or alternative methods that are both computationally efﬁcient and reasonably accurate, e.g., diffusion indices or heuristics. Among these indices are some of the traditionally used centrality measures (e.g., degree centrality) and the newly introduced structural measures (e.g., diffusion indices) that speciﬁcally address the complexity of the diffusion process, and have proven reliable in real-life applications and in mathematical models in a number of empirical and simulation studies [e.g., 27, 31, 37]. 2.2

Diffusion-Based Modeling and Metrics

Information diffusion research is concerned with the way information is exchanged, consumed or adopted as a social phenomenon through the pathways of the network [36]. In doing so, related research can be grouped into three categories: 1) understanding behavioral change mechanisms (e.g., studies investigating the diffusion of certain behaviours or practices, for example the diffusion of some novel education technology in education); 2) ﬁnding early adopters or influencers of new ideas or innovations (e.g., educators who endorse learning strategies or technology-enhanced learning), and 3) empirical or simulation studies that investigate social network models and/or patterns of interactions influencing the time and the scale of diffusion of information [5, 36]. This study focuses on the last thread of research i.e., harnessing network structure and diffusion indices to understand the dynamics of diffusion and the emerging – in a learning network - roles of the students who facilitate information diffusion in a CSCL environment. 2.3

Traditional Measures

Degree centralities (i.e., number of posts sent or received) is a suitable and often used measure for student effort and participation in a CSCL setting. However, these measures are not good enough in terms of the understanding of the diffusion process, since they only quantify the number of interactions. That is, they do not uncover and show the spread of these interactions beyond the immediate contacts, as such, being a local measure is their most signiﬁcant limitation [12, 23]. Betweenness centrality measures how many times a learner has lied on the shortest paths between two others, or bridged them [12, 23, 31]. On the one hand, it does not reflect the spread beyond the local context (these two individuals) and on the other hand, it is hard to interpret it in CSCL educational context. Closeness centrality (CC) considers the average distance between a node and all others. It might be relevant in determining reachability. Yet, it does not accurately reflect the process of information exchange as determined by empirical and simulation studies [12, 23, 31].

Using Diffusion Network Analytics to Examine and Support Knowledge

2.4

161

Diffusion Indices

Diffusion Centrality: In contrast to degree centrality–which does only indicate the spread to immediate contacts (neighbours or ego network)–diffusion centrality measures the probability that an individual can spread a property (e.g., information or opinion) across the whole network. For example, in a CSCL setting, students with higher diffusion centrality are those whose contributions are likely to spread by stimulating replies and further replies to the replies and so on; indegree centrality–a measure of replies–would only return the number of replies to the original post, while diffusion centrality would return the cumulative number of downstream replies. To demonstrate the concept, see the imaginary discussion presented in Fig. 1. The student A raises a point for discussion, student B agrees, ﬁve other students agree and reply to student B, as such B has the most replies, and therefore the highest degree centrality. However, student C has contributed with a solid piece of information that stimulated a further discussion and argumentation within multiple students and replies. In a knowledge construction process, we are interested in identifying students like C, and those contributions that stimulate argumentations, engagement and co-construction of knowledge in a collaborative learning network. The use of diffusion indices as a base for such analysis offers an explainable, and easy to interpret way to monitor students’ interactions and to visualize their networks than just the counts of interactions, offered by degree centralities.

Fig. 1. Fictional conversation among students.

Banerjee et al. [2] demonstrated how diffusion centrality predicts the spread of information about ﬁnancial solutions in Indian villages and later the adoption of these ﬁnancial solutions by villagers. They also exhibited how diffusion centrality outperforms the other used methods (e.g., degree and betweenness centralities) in identifying the leaders who influence the spread of information. Scholars have demonstrated the utility of diffusion centralities (DC) in the study of spread of rumours and the identiﬁcation of the individuals who are best positioned to spread information or rumours [2]. Furthermore, researchers have used diffusion centrality in a variety of contexts conﬁrming the empirical utility of the measure [3].

162

M. Saqr and O. Viberg

Cross Clique Centrality: One of the key ﬁndings of diffusion research is that connectedness of an individual influences the range of spread and diffusion [12, 23, 31]. In fact, it is obvious that a connected individual to well-connected others has a higher possibility of having the message delivered further. Several measures have been proposed to measure the connectedness of contacts, of which, we have selected the crossclique centrality that represents the embeddedness and density of cliques (triangles) a node belongs to. A highly cross-connected node has a higher potential for the diffusion of information. The measures have been tested in many real-life scenarios and was proven to predict the transmission of different social phenomena, like information, online rumours, fake news, etc. [9]. 2.5

Diffusion in CSCL

In CSCL settings, interactions among stakeholders (students and teachers) are facilitated or constrained by the dynamics of the collaborative learning process which can be seen as the key pathways or ‘pipes’ through which information is disseminated. This suggests that the understanding of the pathways through which knowledge is transferred and information is spread in a collaborative learning network is critical for improving student conditions for learning (i.e., learner support) in CSCL settings. Diffusion based analytics can: 1. augment monitoring of CSCL with visualization based on diffusion, 2. expand the current used centrality measures with diffusion-based indices that could help identify roles, and 3. help to understand knowledge exchange and predict performance. We briefly review these applications in the next section. Centralities and Performance: Researchers have used centrality measures in CSCL contexts to reveal productive contributions or to predict performance. For example, [15, 25] have reported positive correlations between indegree centrality and performance. Similarly, outdegree centrality have also been reported to reveal students’ communicability and activity in CSCL [15, 30]. Closeness and betweenness centralities were shown to be positively correlated with performance [15, 25]. Therefore, it is interesting to explore the potential of investigating the value of diffusion indices as indicators for performance. Roles in CSCL: The concept of roles has been largely used in the CSCL ﬁelds of research and practice. It refers to the stated functions or responsibilities that guide individual behavior and regulate group interaction [14]. Scholars distinguish between two types of roles: 1. scripted roles that facilitate the collaborative learning process by assigning and structuring roles and student activities, and 2. emerging roles that are spontaneously and continuously developing by students in their collaborative learning process [11, 32]. The concept of emergent roles offers a view on how learners structure and self-regulate their CSCL processes and activities, suggesting that roles emerge dynamically over longer time periods in relation to students’ advancing knowledge, but at the same time, roles are seen to be unequally distributed in CSCL contexts [32]. That is, the analysis of emerging roles is critical for understanding of the complex nature of the collaborative learning process, including the dynamics of CSCL, individual contribution, and interaction patterns, all of which bring valuable knowledge for facilitating CSCL [16, 32]. Scripts on the other hand, can detail roles and facilitate students’

Using Diffusion Network Analytics to Examine and Support Knowledge

163

role sequence to equally engage them in the relevant collaborative learning processes and activities [32]. Importantly to stress, that both types of roles can collide, and consequently need to be adapted to the advancing students’ capabilities and knowledge. However, to be able to adapt them effectively, we need to understand them accurately, and this study aims to ﬁll this gap by examining how diffusion-based visual and quantitative information exchange metrics can help to visualize, identify and deﬁne student roles in CSCL settings, or help predict performance.

3 Methods Data Collection: The data were collected from a ﬁrst-year problem-based learning (PBL) medical course. The course has an online discussion forum moderated by the teachers. The discussions are triggered by a weekly problem. The problems represent real life scenarios that match the theme of the week’s lectures and seminars. Since the curriculum is PBL-based, the online PBL discussion represents the most important module of the course. The context and assessment are described in details in [30]. Data were collected from the learning management system Moodle (n = 129 students and n = 15 teachers). We collected all forum posts and their metadata, including post ID, subject, content, timestamp, post author, post target, post replies and thread metadata (i.e., title, timing, and users). The data collection included also the user data: IDs, logs, roles (i.e., teachers and students), course grades and group IDs. For each group, a network was constructed by compiling a reply network: a directed edge was considered from author of a post (source) to the target of the reply. Networks’ data were aggregated and analysed using R programming language [29] using the Igraph and Centiserve packages [7, 28]. For each course member, we calculated the following variables: 1. outdegree centrality as the total number of outcoming edges, 2. indegree centrality as the total number of incoming edges, 3. degree centrality as the sum of indegree and outdegree, 4. betweenness centrality as the number of times a user has lied on the shortest paths between others, 5. closeness centrality as the inverse distance between a course member and all others in the network, as well as 6. diffusion centrality as the probability that a person propagates information to contacts, added to the probabilities of contacts transmitting to contacts. It is thus the cumulative contribution by the person and all the contacts to the diffusion process and cross-clique centrality as the number of cliques (triangles) the student is a member of diffusion centrality as the probability that a person propagates information to contacts, added to the probabilities of contacts transmitting to contacts, it is thus the cumulative contribution by the person and all contacts to the diffusion process, and cross-clique centrality as the number of cliques (triangles) to which the student belongs. For the course, we calculated the number of edges, the number of collaborators, the average degree as the sum of all degree centralities of all users divided by the number of collaborators, and the density of interactions as the ratio of the number of interactions among collaborators to the maximum possible. Data Analysis: The data were standardized (SD = 1, mean = 0) and centred (divided by SD) as an essential preparation step for K-means clustering to make variables comparable and also to minimize the influence of individual measures on the results of

164

M. Saqr and O. Viberg

clustering. Data were examined for collinearity. The results of this examination indicated that indegree, outdegree and degree centralities are collinear, which is not surprising since they measure closely related constructs. Thus, only degree centrality was included in the further analysis. To classify students according to their interaction- and knowledge construction patterns, this research selected K-means clustering technique that has been extensively used in learning analytics and educational data mining research to: 1. uncover behavioral patterns in educational data, and 2. detect emergent groups according to for example, their self-regulated learning proﬁles (e.g., [21, 37]) and roles in collaborative learning, in for example the settings of MOOCs (e.g., [20]). The optimum number of clusters was estimated using the silhouette, elbow and the NbClust method of NbClust R package, which provides the consensus of 30 different methods, the majority of which pointed to the number ‘three’ as the optimum number of clusters. K-means was then used to classify students according to their roles in the information exchange. In this, we used the centrality measures (degree centrality, betweenness centrality, closeness centrality, diffusion centrality and cross-clique connectivity) that uncover the range of interaction patterns of the student and the played role/s in information exchange. Clustering evaluation was performed using the average silhouette width that estimates how well the clustering of each observation was performed, by calculating the average distance between clusters. Shapiro-Wilk test was used to check the distribution of variables. Kruskal-Wallis non-parametric one-way analysis of variance was used to compare clusters, and Spearman’s Rank correlation coefﬁcient were employed to test the correlations among variables since the variables did not follow a normal distribution. Similarly, the Dwass-Steel-Critchlow-Fligner test was used for pairwise post-hoc comparisons. 3.1

Participants

The study included interaction data from 129 students, representing 15 small groups (with a median of nine students in each group (range 6 to 12) in the ﬁrst-year medical school.

4 Results The total number of the study participants was 144, including 15 tutors for the small groups. The total number of the examined interactions (or posts) was 2620. The mean degree for each participant was 34.9. For students, the average degree was higher (i.e., 37.7), indicating more student-student interactions compared to the interactions between teachers and students. Each course member has interacted with a median of seven others, with the average density of 0.44 in the groups. 4.1

Visualization

Since the course had 15 distinct groups, we demonstrate an example group (due to space limitations) in Fig. 1. The network was conﬁgured to highlight ‘productive’ students, i.e., those who have higher degree- and diffusion centralities. Figure 2

Using Diffusion Network Analytics to Examine and Support Knowledge

165

demonstrates interactions between the participants where each participant is depicted as a circle, and the interactions are represented as lines. The blue concentric levels represent diffusion centrality; the nodes with higher diffusion centralities are darker in color. Students with higher diffusion centralities are those students whose posts stimulated more replies, in longer threads, and denser interactions by others. This visualization shows that while two students (i.e., S5, S6) have been actively engaged and leading the constructive discussions, some others (e.g., S11, S10, & S3) have been less engaged in the knowledge (co-) construction process, and others (e.g., S8 & S12) were isolated. Therefore, we can visually separate three subgroups of students: 1. students who have highest values of diffusion indices (S5, S6) in the inner most circle; they represent leaders and/or influencers, 2. students with less influence and moderate contributions S3, S4, S7, S9, S10, S11; they represent the arbitrators, and 3. those who are peripheral and/or isolated S2, S8, S12 (satellites). Next, we investigated the possibility of using unsupervised clustering to identify these roles based on interaction proﬁles.

Fig. 2. Visualization of the interactions between the participants based on their contribution and diffusing proﬁle.

4.2

Role Identiﬁcation

The use of K-means clustering to group students, according to their role/s in the knowledge (co-) construction process, resulted in three distinct groups (Fig. 3). Cluster 1: Leaders. This group represents students who actively stimulate the discussion/s and information construction in the group and engage others in a collaborative learning process. These students are characterized by having high diffusion indices, high number of posts and they are well embedded in the community of collaborators in the learning network. However, they have a low betweenness centrality. This can be explained by the fact that betweenness centrality is a local influence measure, and this student group has far reaching influence reflected by their high diffusion values.

166

M. Saqr and O. Viberg

Fig. 3. Clustering students’ roles

Cluster 2: Arbitrators. This group describes students who have the above average number of posts, the above average diffusion centralities, with the highest betweenness centrality compared to the other two groups or clusters. All this indicates their active role in the group discussions and information exchange. Yet, they are not likely to post highly engaging content and/or stimulate others as leaders. Cluster 3: Satellites. This group demonstrates students with low values of contribution and low influence; they have low participation values; they are less likely to engage with others; their contributions have low values of engagement of others and therefore, less likely to contribute to knowledge (co-)construction. To prove that the identiﬁed clusters are coherent and do represent distinct roles with statistically signiﬁcant difference among the chosen variables - we compared selected students’ variables in each group regarding their values. Table 1 shows values of the examined parameters. Since a test of The Shapiro-Wilks for normality showed that the variables did not follow a normal distribution, we used the Kruskal-Wallis nonparametric one-way analysis of variance. The results of this test exhibit a statistically signiﬁcant difference between the three clusters regarding all parameters, except for the closeness centrality with a small p-value (p = 0.05). The mean values of each variable are listed, showing high diffusion indices/low betweenness in the Leaders cluster, the above average diffusion/high betweenness indices in the Arbitrators group, and the low values of diffusion and productivity of the Satellites cluster. The effect size (e2) was highest for the diffusion indices (DC = 0.79, CC = 0.55) and smallest - for the closeness centrality 0.05 (Table 1). The results of the Dwass-Steel-Critchlow-Fligner pairwise comparisons (post-hoc pairwise statistics) have shown that there is a statistical and signiﬁcant difference between all pairs of the roles regarding diffusion indices (diffusion- and cross clique connectivity). The ﬁndings did not reveal a statistical difference between the pairs regarding closeness centrality, indicating that there is no statistically signiﬁcant difference between Leaders (cluster 1) and Arbitrators (cluster 2) regarding the number of posts. The betweenness centrality was found to be only statistically signiﬁcant between

Using Diffusion Network Analytics to Examine and Support Knowledge

167

Arbitrators and Satellites. The pairwise comparisons prove the idea that diffusion centralities have the highest effect size and statistical signiﬁcance among all roles. This highlights the value of diffusion indices in differentiating various student roles in the information exchange process in CSCL settings (Table 2). Table 1. Kruskal-Wallis (KW) test showing the comparison among all groups. Group/Cluster

Statistics Degree Betweenness Closeness Diffusion Cross clique connectivity Leaders (n = 9) Median 59.00 3.03 0.09 654.00 384.00 Arbitrators Median 52.00 3.61 0.08 403.00 64.00 (n = 53) Satellites Median 16.00 0.34 0.08 168.00 16.00 (n = 67) KW Statistics v2 45.83 19.46 5.98 101.12 70.63 e2 0.36 0.15 0.05 0.79 0.55 p